Usage FAQs

FAQs for IBM Watson® Text to Speech include questions about speech synthesis, supported languages, audio formats, and other topics. To find all FAQs for IBM Cloud®, see the FAQ library.

How do I access my service credentials?

How you access your service credentials depends on whether you are using Text to Speech with IBM Cloud® or IBM Cloud Pak® for Data. For more information about obtaining your credentials for both versions, see Before you begin in the getting started tutorial.

Once you have your service credentials, see the following topics for information about authenticating to the service:

What languages does the service support?

The Text to Speech service supports male and female voices in various spoken languages:

The service offers expressive neural voices for English (Australian and United States).
The services offers enhanced neural voices for Dutch Netherlands, English (United Kingdom and United States), French (Canadian and France), German, Italian, Japanese, Korean, Portuguese (Brazilian), and Spanish (Castilian, Latin American, and North American).

Some languages and voices are available only for IBM Cloud®, not for IBM Cloud Pak® for Data. For more information about the available voices for all languages, see Languages and voices.

How does the service synthesize audio?

The Text to Speech service offers voices that rely on neural technology to synthesize text to speech. The topic of synthesizing text to speech is inherently complex. For more information, see

What are the output audio formats?

By default, the Text to Speech service returns audio in Ogg format with the Opus codec (audio/ogg;codecs=opus). The service supports many other audio formats to suit your application needs. For more information, see Supported audio formats.

How do I convert my text to speech?

To submit text to the service for synthesized audio output, you make an HTTP or WebSocket request. You can use the API directly or use one of the Watson SDKs. Getting started offers examples of both the HTTP POST /v1/synthesize and GET /v1/synthesize methods. The API & SDK reference shows examples of all interfaces and methods.

There is no graphical user interface for submitting text. See the Text to Speech demo to try an example of the service in action. The demo accepts a small amount of your text as input to generate speech with different voices.

Can I change how the service interprets input text and produces synthesized audio?

You can use the Speech Synthesis Markup Language (SSML) to control aspects of the synthesis process such as pronunciation, volume, pitch, speed, and other attributes. You can also use the Tune by Example feature to tailor the prosody, intonation, and cadence of custom prompts to better suit your application needs.

For general information about SSML, see Understanding SSML.
For information about the supported SSML elements, see SSML elements.
For information about the Tune by Example feature, see Understanding Tune by Example.

What programming languages can I use?

The service supports SDKs in many popular programming languages and platforms.

For more information about the SDKs and links to them on GitHub, see Watson SDKs.
For more information about all methods of the SDKs for the Text to Speech service, see the API & SDK reference.

What is the maximum amount of text that I can submit for synthesis?

You can submit the following maximum amount of text for a speech synthesis request with each of the service's method:

HTTP GET /v1/synthesize method - Maximum of 8 KB of total input, which includes the input text, SSML, and the URL and headers.
HTTP POST /v1/synthesize method - Maximum of 8 KB for the URL and headers. Maximum of 5 KB for the input text, including SSML.
WebSocket /v1/synthesize method - Maximum of 5 KB of input text, including SSML.

All characters of the input, including whitespace and those that are part of SSML elements, are counted toward the data maximum. For billing purposes, whitespace characters are not counted. For more information, see Data limits.

How does customization work?

The customization interface of the Text to Speech service creates a dictionary of words and their translations for a specific language. This dictionary is referred to as a custom model. For more information, see Understanding customization.

How do I create a custom model?

Review the guidelines for working with the customization interface before you begin. Then, see the steps and examples for creating, querying, updating, and deleting custom models in Creating and managing custom models. Also review Creating and managing custom entries for examples and guidance about adding relevant training data.

Can I create a custom voice?

IBM Cloud

As a premium customer, you can work with IBM to train a new custom voice for your specific use case and target market. Creating a custom voice is different from customizing one of the service's existing voices. A custom voice is a unique new voice that is based on audio training data that the customer provides. IBM can train a custom voice with as little as one hour of training data.

To request a custom voice or for more information, complete and submit this IBM Request Form.

How do I use the Tune by Example feature?

Tune by Example lets you control exactly how specified text is spoken by the service. You provide text and spoken audio to add a custom prompt to a custom model. The spoken audio can stress different syllables or words, introduce pauses, and generally make the synthesized audio sound more natural and appropriate for its context. When you synthesize the prompt, the service duplicates the qualities of the recorded speech with its voices.

You can further enhance the quality of a prompt by creating an optional speaker model that contains a sample of a speaker's voice. The service leverages the sample audio to train itself on the voice, which can help it produce higher-quality prompts for that speaker.

For more information, see Understanding Tune by Example.

What limits exist for a custom model?

The following limits apply to all custom models:

A word in a custom entry can contain a maximum of 49 characters.
A translation in a custom entry can contain a maximum of 499 characters.
A custom model can include a maximum of 20,000 custom entries.
A custom model can include a maximum of 1000 custom prompts.

For more information, see Rules for creating custom entries.

Where can I find plans and pricing information?

IBM Cloud

The Text to Speech service offers multiple pricing plans. For more information about pricing, see the Text to Speech service in the IBM Cloud Catalog.