Languages and voices
The IBM Watson® Text to Speech service supports a variety of languages, voices, and dialects. For different languages, the service offers female voices, male voices, or both. Each voice uses appropriate cadence and intonation for its dialect.
All of the service's voices use neural voice technology. Neural voice technology uses multiple Deep Neural Networks (DNNs) to predict the acoustic (spectral) features of the speech. The DNNs are trained on natural human speech and generate the resulting audio from the predicted acoustic features. During synthesis, the DNNs predict the pitch and phoneme duration (prosody), spectral structure, and waveform of the speech. Neural voices produce speech that is crisp and clear, with a very natural-sounding, smooth, and consistent audio quality.
Supported languages and voices
The service offers two types of voices with different qualities and capabilities:
- Expressive neural voices offer natural-sounding speech that is exceptionally clear and crisp. Their pronunciation and inflections are natural and conversational, and the resulting speech offers extremely smooth transitions between words. They also support the use of additional features that are not available with enhanced neural voices. For a list of all expressive voices, see Expressive neural voices.
- Enhanced neural voices achieve a high degree of natural-sounding speech and support most service features. For a list of all enhanced neural voices, see Enhanced neural voices.
The following pages provide more information about the voices and their technology:
- For a blog that introduces the expressive voices, see Is your conversational AI setting the right tone?.
- For more information about the service's neural voice technology, see The science behind the service.
Language support by type of voice
Table 1 shows the service's support for languages by type of voice. The following topics list the available languages and voices for each voice type.
Language | Expressive neural voices | Enhanced neural voices |
---|---|---|
Dutch (Netherlands) |
✔ | |
English (United Kingdom) |
✔ | |
English (Australian) |
✔ | |
English (United States) |
✔ | ✔ |
French (Canadian) |
✔ | |
French (France) |
✔ | |
German | ✔ | |
Italian | ✔ | |
Japanese | ✔ | |
Korean | ✔ | |
Portuguese (Brazilian) |
✔ | |
Spanish (Castilian) |
✔ | |
Spanish (Latin American) |
✔ | ✔ |
Spanish (South American) |
✔ |
Expressive neural voices
Table 2 lists and provides audio samples for all available expressive neural voices. The Availability column indicates whether each voice is generally available (GA) for production use or beta. The column also indicates whether each voice is available for IBM Cloud, IBM Cloud Pak for Data, or both (no product version is cited).
- Expressive neural voices support additional features that are not available with other types of voices. These features include additional speaking styles, automatic emphasis of interjections, and emphasis of specified words. For more information, see Modifying speech synthesis with expressive neural voices.
- When used with the SSML
<prosody>
element, expressive voices support only percentage values for therate
andpitch
attributes. For more information, see The<prosody>
element.
Expressive neural voices determine sentiment from context and automatically use the proper intonation to suit the text. To produce the most natural-sounding prosody, expressive neural voices need to consider the context of all words and phrases of a sentence. Expressive voices are therefore more compute-intensive and have slightly higher latency than other types of voices. The initial response for a synthesis request that uses an expressive voice might take a fraction of a second longer (for example, a few hundred milliseconds) to arrive. The total response time for the request to complete is also longer.
To minimize the latency and response time for an expressive voice, use shorter sentences wherever possible.
Language | Availability | Voice / Gender | Audio sample |
---|---|---|---|
English (Australian) |
GA | en-AU_HeidiExpressive Female |
|
GA | en-AU_JackExpressive Male |
||
English (United States) |
GA | en-US_AllisonExpressive Female |
|
GA | en-US_EmmaExpressive Female |
||
GA | en-US_LisaExpressive Female |
||
GA | en-US_MichaelExpressive Male |
||
Spanish (Latin American) |
GA | es-LA_DanielaExpressive Female |
Enhanced neural voices
Table 3 lists and provides audio samples for all available enhanced neural voices. The Availability column indicates whether each voice is generally available (GA) for production use or beta. The column also indicates whether each voice is available for IBM Cloud, IBM Cloud Pak for Data, or both (no product version is cited).
Language | Availability | Voice / Gender | Audio sample |
---|---|---|---|
Dutch (Netherlands) |
Beta | nl-NL_MerelV3Voice Female |
|
English (United Kingdom) |
GA | en-GB_CharlotteV3Voice Female |
|
GA | en-GB_JamesV3Voice Male |
||
GA | en-GB_KateV3Voice Female |
||
English (United States) |
GA | en-US_AllisonV3Voice Female |
|
GA | en-US_EmilyV3Voice Female |
||
GA | en-US_HenryV3Voice Male |
||
GA | en-US_KevinV3Voice Male |
||
GA | en-US_LisaV3Voice Female |
||
GA | en-US_MichaelV3Voice Male |
||
GA | en-US_OliviaV3Voice Female |
||
French (Canadian) |
GA | fr-CA_LouiseV3Voice Female |
|
French (France) |
GA | fr-FR_NicolasV3Voice Male |
|
GA | fr-FR_ReneeV3Voice Female |
||
German | GA | de-DE_BirgitV3Voice Female |
|
GA | de-DE_DieterV3Voice Male |
||
GA | de-DE_ErikaV3Voice Female |
||
Italian | GA | it-IT_FrancescaV3Voice Female |
|
Japanese | GA | ja-JP_EmiV3Voice Female |
|
Korean | GA | ko-KR_JinV3Voice Female |
|
Portuguese (Brazilian) |
GA | pt-BR_IsabelaV3Voice Female |
|
Spanish (Castilian) |
GA | es-ES_EnriqueV3Voice Male |
|
GA | es-ES_LauraV3Voice Female |
||
Spanish (Latin American) |
GA | es-LA_SofiaV3Voice Female |
|
Spanish (North American) |
GA | es-US_SofiaV3Voice Female |
The Spanish Latin American and North American Sofia
voices are essentially the same voice. The most significant difference concerns how the two voices interpret a $ (dollar sign). The Latin American version uses the term pesos;
the North American version uses the term dólares. Other minor differences might also exist between the two voices.
Creating a custom model
When you synthesize text, the service applies language-dependent pronunciation rules to convert the ordinary spelling of each word to a phonetic spelling. The service's pronunciation rules work well for common words, but they can yield imperfect results for unusual words, such as terms with foreign origins, personal names, and abbreviations or acronyms. If your application's lexicon includes such words, you can use the customization interface to specify how the service pronounces them.
A custom model is a dictionary of words and their translations. You create a custom model for a specific language, not for a specific voice. So a custom model can be used with any voice for its specified language. For example, a custom model
that you create for the en-US
language can be used with any US English voice. It cannot, however, be used with an en-GB
or en-AU
voice.
Customization is available for all languages. All voices support the use of both standard International Phonetic Alphabet (IPA) and IBM Symbolic Phonetic Representation (SPR) phonetic symbols for word customization. For more information, see Understanding customization.
Creating a custom voice
IBM Cloud
Premium customers can work with IBM to train a new custom voice for their specific use case and target market. Creating a custom voice is different from customizing one of the service's existing voices. A custom voice is a unique new voice that is based on audio training data that the customer provides. IBM can train a custom voice with as little as one hour of training data.
To request a custom voice or for more information, complete and submit this IBM Request Form.