The HTTP interface

To synthesize text to speech with the HTTP REST interface of the IBM Watson® Text to Speech service, you call the GET or POST /v1/synthesize method. You specify the text that is to be synthesized and the voice and format for the spoken audio. You can also specify a custom model that is to be used with the request.

For more information about the HTTP interface, see the API & SDK reference.

Synthesizing text to audio

To synthesize text to audio, you call one of the two versions of the service's /v1/synthesize method:

The GET /v1/synthesize method accepts the text that is to be synthesized as a required text query parameter. The maximum size of the request is 8 KB, which includes the input text, any SSML that you specify, and the URL and headers.
The POST /v1/synthesize method accepts the text that is to be synthesized as a JSON construct in the required body of the request. The maximum size of the request is 8 KB for the URL and headers, and 5 KB for the input text that is sent in the body of the request. The 5 KB limit includes any SSML that you specify.

The two versions of the /v1/synthesize method have the following parameters in common:

accept (query parameter, optional string)

Specifies the requested audio format, or MIME type, in which the service is to return the audio. You can also specify this value with the HTTP Accept request header. URL-encode the argument to the accept query parameter. By default, the service returns the audio in the format audio/ogg;codecs=opus. For more information, see Using audio formats.

The Ogg audio format is not supported with the Safari browser. If you are using the the Text to Speech service with the Safari browser, you must specify a different format in which you want the service to return the audio.

voice (query parameter, optional string)

Specifies the voice in which the text is to be spoken in the audio. Use the /v1/voices method to get the current list of supported voices. Omit the parameter to use the default voice. For more information, see Languages and voices and Using the default voice.

customization_id (query parameter, optional string)

Specifies a globally unique identifier (GUID) for a custom model that is to be used for the synthesis. A specified custom model must match the language of the voice that is used for the synthesis. If you include a customization ID, you must make the request with credentials for the instance of the service that owns the custom model. Omit the parameter to use the specified voice with no customization. For more information, see Understanding customization.

rate_percentage (query parameter, optional integer)

Specifies the global speaking rate for the entire synthesis request. The speaking rate is the speed at which the service speaks the text that it synthesizes into speech. A higher rate causes the text to be spoken more quickly; a lower rate causes the text to be spoken more slowly. The parameter changes the per-voice default rate for an entire request. For more information, see Modifying the speaking rate.

pitch_percentage (query parameter, optional integer)

Specifies the global speaking pitch for the entire synthesis request. The speaking pitch represents the tone of the speech that the service synthesizes. It represents how high or low the tone of the voice is perceived by the listener. A higher pitch results in speech that is spoken at a higher tone; a lower pitch results in speech that is spoken in a lower tone. The parameter changes the per-voice default pitch for an entire request. For more information, see Modifying the speaking pitch.

spell_out_mode (query parameter, optional string)

For German voices, specifies how individual characters of a string are to be spelled out. By default, the service spells out individual characters at the same rate at which it synthesizes text for a language. You can use the parameter to direct the service to spell out individual characters more slowly, in groups of one (singles), two (pairs), or three (triples). For more information, see Specifying how strings are spelled out.

X-Watson-Metadata (request header, optional string)

Associates a customer ID with data that is passed with a request. For more information, see Information security.

X-Watson-Learning-Opt-Out (request header, optional boolean)

IBM Cloud Indicates whether the service logs request and response data to improve the service for future users. To prevent IBM from accessing your data for general service improvements, specify true for the parameter. Opting out directs IBM to write to disk no user data (text or audio) for your request. You can also opt out at the account level. For more information, see Request logging.

If you specify an invalid query parameter or JSON field as part of the input to the /v1/synthesize method, the service returns a Warnings response header that describes and lists each invalid argument. The request succeeds despite the warnings.

Specifying input text

Both the POST and GET /v1/synthesize methods accept plain input text or text that is annotated with SSML. The two versions differ primarily in how you specify the text that is to be synthesized. The following examples both pass the plain text Hello world.

The POST /v1/synthesize method accepts input text in the body of the request. You specify the input with a simple JSON construct that includes plain text or SSML. You must also specify a value of application/json for the Content-Type header.

IBM Cloud

curl -X POST -u "apikey:{apikey}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output hello_world.wav \
--data "{\"text\":\"Hello world\"}" \
"{url}/v1/synthesize?voice=en-US_MichaelV3Voice"

IBM Cloud Pak for Data IBM Software Hub

curl -X POST \
--header "Authorization: Bearer {token}" \
--header "Content-Type: application/json" \
--header "Accept: audio/wav" \
--output hello_world.wav \
--data "{\"text\":\"Hello world\"}" \
"{url}/v1/synthesize?voice=en-US_MichaelV3Voice"

The GET /v1/synthesize method accepts input text that is specified by the text query parameter. You specify the input as plain text or SSML, both of which must be URL-encoded.

IBM Cloud

curl -X GET -u "apikey:{apikey}" \
--header "Accept: audio/wav" \
--output hello_world.wav \
"{url}/v1/synthesize?text=Hello%20world&voice=en-US_MichaelV3Voice"

IBM Cloud Pak for Data IBM Software Hub

curl -X GET \
--header "Authorization: Bearer {token}" \
--header "Accept: audio/wav" \
--output hello_world.wav \
"{url}/v1/synthesize?text=Hello%20world&voice=en-US_MichaelV3Voice"""

Although the POST and GET methods offer equivalent functionality, it is always more secure to pass input text to the service with the POST method. A POST request passes input in the body of the request. A GET request exposes the data in the URL.

Punctuating input text

Write text for synthesis with the punctuation you would use normally. For example, include commas, periods, exclamation points, and questions marks as you would in normal writing.

The service considers punctuation when synthesizing text. For example, commas and end-of-sentence punctuation affect the audio by inserting pauses at appropriate places in the resulting synthesized speech. End-of-sentence punctuation such as periods, exclamation points, and question marks also change the intonation and inflection of the speech. You can also use SSML elements to affect these aspects of the speech.

Specifying SSML input

The Speech Synthesis Markup Language (SSML) is an XML-based markup language that is designed to provide annotations of text for speech synthesis applications such as the Text to Speech service. You can use SSML elements and their attributes to gain greater control over the synthesis and resulting audio output.

For more information about using SSML to annotate input text, see Understanding SSML. For an inventory of all supported elements and attributes, see SSML elements.

Escaping XML control characters

Because you can submit input text that includes XML-based SSML annotations, the service validates all input to ensure that any SSML is correct and well formed. Therefore, you must escape any XML control characters that are present in the input text, regardless of whether the input includes SSML. Use the equivalent escape strings or character encodings from Table 1 instead of the indicated characters.

Escaping XML control characters
Character	Escape strings	Character encoding
`"` (double quotes)	`"`	`"`
`'` (apostrophe or single quote)	`'`	`'`
`&` (ampersand)	`&`	`&`
`<` (left angle bracket)	`<`	`<`
`>` (right angle bracket)	`>`	`>`
`/` (forward slash)	None	`/`

For more information about how the service validates input text, see SSML validation.

Examples of input text

The following examples show how to specify input text with either method of the HTTP interface. They also show how to escape XML control characters. The examples include line breaks for readability. Do not include the line breaks in actual input.

Example input with a GET request

The following examples pass URL-encoded input with the text query parameter of the GET /v1/synthesize method:

Plain text input:

text=This&20is&20the&20first&20sentence&20of&20the&20paragraph.&20Here
&20is&20another&20sentence.&20Finally,&20this&20is&20the&20last&20sentence.

SSML input:

text=%22%3Cp%3E%3Cs%3EThis%20is%20the%20first%20sentence%20of%20the%20%3C
break%20time=%225s%22/%3E%20paragraph.%3C/s%3E%3Cs%3EHere%20is%20another
%20sentence.%3C/s%3E%3Cs%3EFinally,%20this%20is%20the%20last%20sentence.
%3C/s%3E%3C/p%3E%22

Example input with a POST request

The following examples pass input in the body of the POST /v1/synthesize method:

Plain text input:

{
  "text": "This is the first sentence of the paragraph. Here is another
    sentence. Finally, this is the last sentence."
}

SSML input:

{
  "text": "<p><s>This is the first sentence of the <break time=\"5s\"/>
    paragraph.</s><s>Here is another sentence.</s><s>Finally, this is
    the last sentence.</s></p>"
}

Example input with XML control characters

The following examples send two sentences to the POST /v1/synthesize method. The examples properly escape the embedded XML characters.

"What have I learned?" he asked. "Everything!"

Plain text input:

{
  "text": "&quot;What have I learned?&quot; he asked. &quot;Everything!&quot;"
}

SSML input:

{
  "text": "<s>&quot;What have I learned?&quot; he asked.
    &quot;<prodody rate=\"50\">Everything!</prosody>&quot;</s>"
}