Service features

You can access the speech synthesis capabilities of the IBM Watson® Text to Speech service via an HTTP or WebSocket interface. Both interfaces provide features that let you submit and receive different information from the service. And as with all Watson services, SDKs are available to simplify application development in many programming languages.

Using languages and voices

The service supports speech synthesis with voices for the languages listed in Language support. For different languages, the service offers female voices, male voices, or both. Some languages and voices might be supported for IBM Cloud® only.

All of the service's voices use neural voice technology, which produces more natural-sounding speech. The service offers three types of voices, natural, expressive neural, and enhanced neural, which have different qualities and features. For information about the types of voices and about the supported languages and voices for each type, see Languages and voices.

Using audio formats

The service can return synthesized audio in the many formats listed in Audio support. For information about the supported audio formats, see Using audio formats.

Synthesizing speech with the service

The Text to Speech service offers an HTTP Representational State Transfer (REST) interface and a WebSocket interface:

The HTTP interface provides both GET and POST versions of the service's /v1/synthesize method. The two versions of the method offer generally equivalent functionality. You pass the text that is to be synthesized as a query parameter with the GET method and as the body of the request with the POST method.
The WebSocket interface provides a /v1/synthesize method. You pass the text that is to be synthesized over an established WebSocket connection.

With both the HTTP and WebSocket interfaces, you specify the language and voice that are to be used, and the format for the audio that is to be returned.

For an overview of the features that are available for speech synthesis, see Using speech synthesis features.
For detailed descriptions and examples of the speech synthesis methods, see the API & SDK reference.

Data limits

The interfaces accept the following maximum amounts of text with a single request:

The HTTP GET /v1/synthesize method accepts a maximum of 8 KB of input, which includes the input text and the URL and headers.
The HTTP POST /v1/synthesize method accepts a maximum of 8 KB for the URL and headers, and a maximum of 5 KB for the input text that is sent in the body of the request.
The WebSocket /v1/synthesize method accepts a maximum of 5 KB of input text.

These limits include all characters of the input, including whitespace.

IBM Cloud For billing purposes, whitespace characters are not counted. However, all other characters are counted, including those that are part of SSML elements.

Using speech synthesis features

The service supports additional features that you can use to tailor the text that you send and the audio that you receive.

SSML

You can pass the service plain text or text that is annotated with the Speech Synthesis Markup Language (SSML). SSML is an XML-based markup language that provides annotations of text for speech synthesis applications such as the Text to Speech service.

For more information about specifying input text, see Specifying input text.
For more information about using SSML, see Understanding SSML.

Modifying the speaking rate

To modify the global rate of speech synthesis for a request, you can use the rate_percentage query parameter. The speaking rate is the speed at which the service speaks the text that it synthesizes into speech. A higher rate causes the text to be spoken more quickly; a lower rate causes the text to be spoken more slowly. The parameter changes the per-voice default rate for an entire request. For more information, see Modifying the speaking rate.

The rate_percentage parameter is beta functionality.

Modifying the speaking pitch

To modify the global pitch of speech synthesis for a request, you can use the pitch_percentage query parameter. The speaking pitch represents the tone of the speech that the service synthesizes. It represents how high or low the tone of the voice is perceived by the listener. A higher pitch results in speech that is spoken at a higher tone and is perceived as a higher voice; a lower pitch results in speech that is spoken in a lower tone and is perceived as a lower voice. The parameter changes the per-voice default pitch for an entire request. For more information, see Modifying the speaking pitch.

The pitch_percentage parameter is beta functionality.

Spelling out strings

To indicate how individual characters of a string (alphabetic, numeric, or alphanumeric) are to be spelled out, you can include the spell_out_mode query parameter with a request. By default, the service spells out individual characters at the same rate at which it synthesizes text for a language. You can use the parameter to direct the service to spell out individual characters more slowly, in groups of one, two, or three characters. Use the parameter with the SSML <say-as> element to control how the characters of a string are synthesized. For more information, see Specifying how strings are spelled out.

The spell_out_mode parameter is beta functionality that is supported only for German voices.

Word timings

With the WebSocket interface, you can obtain timing information about the location of words in the audio that the service returns. Timing information is useful for synchronizing the input text and the audio.

You can use the SSML <mark> element to identify specific locations, such as word boundaries, in the audio. For languages other than Japanese, you can also request word timing information for all words of the input text. For more information, see Generating word timings.

Word timings are not supported for Natural voices.

Using speech synthesis features with expressive neural voice

With expressive neural voices, the service supports additional features that modify how the text that you pass is synthesized into audio.

Using speaking styles

The expressive neural voices determine the sentiment of the text from the context of its words and phrases. The speech that they produce, in addition to having a very conversational style, reflects the mood of the text. You can embellish the voices' natural tendencies by indicating that all or some of the text is to emphasize a specific style: cheerful, empathetic, neutral, or uncertain. You use SSML to indicate the style and the text to which it is to be applied. For more information, see Using speaking styles.

Emphasizing interjections

When you use expressive neural voices, the service automatically detects a collection of common interjections based on context. When it synthesizes these interjections, it gives them the natural emphasis that a human would use in normal conversation. For some of the interjections, you can use SSML to enable or disable their emphasis. For more information, see Emphasizing interjections.

Emphasizing words

The expressive voices use a conversational style that naturally applies the correct intonation from context. But you can indicate that one or more words are to be given more or less emphasis. The change in stress can be indicated by an increase or decrease in pitch, timing, volume, or other acoustic attributes. For more information, see Emphasizing words.

Customizing the service

The service includes a customization interface that you can use to create custom models for use during speech synthesis. A custom model is a dictionary of words and their translations for a specific language. Each word/translation pair in a model tells the service how to pronounce a word when it occurs in input text.

You can use custom models to create application-specific translations for unusual words for which the service's regular pronunciation rules might yield imperfect pronunciations. For example, your application might routinely encounter domain-specific terms, special terms with foreign origins, personal or geographic names, or abbreviations and acronyms. By using customization, you can define translations that tell the service how you want such terms to be pronounced.

You can define the custom entry for a word/translation pair based on other words, or you can create pronunciations based on phoneme symbols in the standard International Phonetic Alphabet (IPA) or the proprietary IBM Symbolic Phonetic Representation (SPR). Customization is available for all languages.

For more information about customization, see Understanding customization.
For more information about using phonetic IPA and SPR symbols, see Understanding phonetic symbols.

IBM Cloud You must have the Standard or Premium pricing plan to use customization. Users of the Lite plan cannot use the customization interface. For more information about pricing plans, see the Text to Speech service in the IBM Cloud® Catalog.

Creating a custom voice

IBM Cloud

Premium customers can work with IBM to train a new custom voice for their specific application needs. A custom voice is a unique voice that is based on audio training data that the customer provides. IBM can train a custom voice with as little as one hour of training data.

To request a custom voice or for more information, complete and submit this IBM Request Form.

Using Tune by Example

The Tune by Example feature lets you control how specified text is spoken by the service. The feature lets you dictate the intonation, cadence, and stress of the synthesized text. You create a custom prompt by providing a sample recording that speaks the text as you want to hear it. The service then duplicates the qualities of the recorded speech with its voices when you synthesize the prompt.

The feature provides a simpler mechanism than standard SSML for modifying how speech is synthesized. Tune by Example eliminates the need for complex SSML by letting you record text as you want it to be spoken rather than requiring you to emulate the intended prosody with SSML.

You can increase the quality of custom prompts by associating speaker models with those users who speak the prompts. You create a speaker model by providing an audio sample of a user's voice. The service trains itself on the voice to help it produce higher-quality prompts for that speaker.

For more information about Tune by Example, custom prompts, and speaker models, see Understanding Tune by Example.

The Tune by Example feature is beta functionality that is supported only for US English custom models and voices.

Using software development kits

SDKs are available for the Text to Speech service to simplify the development of speech applications. The SDKs support many popular programming languages and platforms.

For a complete list of SDKs and links to the SDKs on GitHub, see Watson SDKs.
For more information about all methods of the SDKs for the Text to Speech service, see the API & SDK reference.

Learning more about application development

For more information about working with Watson services and IBM Cloud:

For an introduction, see Getting started with Watson and IBM Cloud.
For information about using IBM Cloud Identity and Access Management, see Authenticating to Watson services.

Next steps

Explore the features introduced in this topic to gain a more in-depth understanding of the service's capabilities. Each feature includes links to topics that describe it in much greater detail.

Using languages and voices and Using audio formats describe the basic underpinnings of the service's capabilities. You must choose a language and voice that are suitable for your text and application, and you must understand the characteristics of the audio the service returns.
Synthesizing speech with the service provides links to detailed presentations of each of the service's interfaces. Experiment with the interfaces to determine which is best suited to your application needs.
Using speech synthesis features briefly describes the features that are available for speech synthesis and provides links for more information. Use the features to tailor the text that you send and the audio that you receive.
Using speech synthesis features with expressive neural voice introduces three additional features that are available for speech synthesis with expressive neural voices.
Customizing the service describes the more advanced topic of customization, which you can use to create custom models that contain dictionaries of words and their translations for specific languages.
Using Tune by Example introduces the Tune by Example feature that lets you create custom prompts. You can control the intonation, cadence, and stress of the synthesized text for your prompts.
Using software development kits provide links to the SDKs that are available to simplify application development in many programming languages.
Learning more about application development provides links to help you get started with Watson services and understand authentication.