Languages and voices

The IBM Watson® Text to Speech service supports a variety of languages, voices, and dialects. For different languages, the service offers female voices, male voices, or both. Each voice uses appropriate cadence and intonation for its dialect.

All of the service's voices use neural voice technology. Neural voice technology uses multiple Deep Neural Networks (DNNs) to predict the acoustic (spectral) features of the speech. The DNNs are trained on natural human speech and generate the resulting audio from the predicted acoustic features. During synthesis, the DNNs predict the pitch and phoneme duration (prosody), spectral structure, and waveform of the speech. Neural voices produce speech that is crisp and clear, with a very natural-sounding, smooth, and consistent audio quality.

Supported languages and voices

The service offers three types of voices with different qualities and capabilities:

Natural voices provide advanced performance in terms of naturalness and expressiveness. These voices use various techniques to provide an edge over Expressive voices. For a list of all natural voices, see Natural voices.
Expressive neural voices offer natural-sounding speech that is exceptionally clear and crisp. Their pronunciation and inflections are natural and conversational, and the resulting speech offers extremely smooth transitions between words. They also support the use of additional features that are not available with enhanced neural voices. For a list of all expressive voices, see Expressive neural voices.
Enhanced neural voices achieve a high degree of natural-sounding speech and support most service features. For a list of all enhanced neural voices, see Enhanced neural voices.

The following pages provide more information about the voices and their technology:

For a blog that introduces the expressive voices, see Is your conversational AI setting the right tone?.
For more information about the service's neural voice technology, see The science behind the service.

Language support by type of voice

Table 1 shows the service's support for languages by type of voice. The following topics list the available languages and voices for each voice type.

Language support by type of voice
Language	Natural voices	Expressive neural voices	Enhanced neural voices
Dutch (Netherlands)			✔
English (Canadian)	✔
English (United Kingdom)	✔	✔	✔
English (Australian)		✔
English (United States)	✔	✔	✔
French (Canadian)			✔
French (France)			✔
German			✔
Italian			✔
Japanese			✔
Korean			✔
Portuguese (Brazilian)		✔	✔
Spanish (Castilian)			✔
Spanish (Latin American)		✔	✔
Spanish (South American)			✔

Natural voices

Table 2 lists and provides audio samples for all available Natural voices. The Availability column indicates whether each voice is generally available (GA) for production use or beta. The column also indicates whether each voice is available for IBM Cloud, IBM Cloud Pak for Data, IBM Software Hub, or all 3 of them (no product version is cited).

Natural languages and voices
Language	Availability	Voice / Gender
English (Canadian)	GA	`en-CA_HannahNatural` Female
English (United Kingdom)	GA	`en-GB_ChloeNatural` Female
English (United Kingdom)	GA	`en-GB_GeorgeNatural` Male
English (United States)	GA	`en-US_EllieNatural` Female
English (United States)	GA	`en-US_EmmaNatural` Female
English (United States)	GA	`en-US_EthanNatural` Male
English (United States)	GA	`en-US_JacksonNatural` Male
English (United States)	GA	`en-US_VictoriaNatural` Female

Expressive neural voices

Table 3 lists and provides audio samples for all available expressive neural voices. The Availability column indicates whether each voice is generally available (GA) for production use or beta. The column also indicates whether each voice is available for IBM Cloud, IBM Cloud Pak for Data, IBM Software Hub, or all 3 of them (no product version is cited).

Expressive neural voices support additional features that are not available with other types of voices. These features include additional speaking styles, automatic emphasis of interjections, and emphasis of specified words. For more information, see Modifying speech synthesis with expressive neural voices.
When used with the SSML <prosody> element, expressive voices support only percentage values for the rate and pitch attributes. For more information, see The <prosody> element.

Expressive neural voices determine sentiment from context and automatically use the proper intonation to suit the text. To produce the most natural-sounding prosody, expressive neural voices need to consider the context of all words and phrases of a sentence. Expressive voices are therefore more compute-intensive and have slightly higher latency than other types of voices. The initial response for a synthesis request that uses an expressive voice might take a fraction of a second longer (for example, a few hundred milliseconds) to arrive. The total response time for the request to complete is also longer.

To minimize the latency and response time for an expressive voice, use shorter sentences wherever possible.

Expressive neural languages and voices
Language	Availability	Voice / Gender
English (Australian)	GA	`en-AU_HeidiExpressive` Female
	GA	`en-AU_JackExpressive` Male
English (United States)	GA	`en-US_AllisonExpressive` Female
	GA	`en-US_EmmaExpressive` Female
	GA	`en-US_LisaExpressive` Female
	GA	`en-US_MichaelExpressive` Male
English (United Kingdom)	GA	`en-GB_GeorgeExpressive` Male
Portuguese (Brazilian)	GA	`pt-BR_LucasExpressive` Male
Spanish (Latin American)	GA	`es-LA_DanielaExpressive` Female

Enhanced neural voices

Table 4 lists and provides audio samples for all available enhanced neural voices. The Availability column indicates whether each voice is generally available (GA) for production use or beta. The column also indicates whether each voice is available for IBM Cloud, IBM Cloud Pak for Data, IBM Software Hub or all 3 of them (no product version is cited).

Enhanced neural languages and voices
Language	Availability	Voice / Gender
Dutch (Netherlands)	Beta	`nl-NL_MerelV3Voice` Female
English (United Kingdom)	GA	`en-GB_CharlotteV3Voice` Female
	GA	`en-GB_KateV3Voice` Female
English (United States)	GA	`en-US_AllisonV3Voice` Female
	GA	`en-US_EmilyV3Voice` Female
	GA	`en-US_HenryV3Voice` Male
	GA	`en-US_KevinV3Voice` Male
	GA	`en-US_LisaV3Voice` Female
	GA	`en-US_MichaelV3Voice` Male
	GA	`en-US_OliviaV3Voice` Female
French (Canadian)	GA	`fr-CA_LouiseV3Voice` Female
French (France)	GA	`fr-FR_NicolasV3Voice` Male
	GA	`fr-FR_ReneeV3Voice` Female
German	GA	`de-DE_BirgitV3Voice` Female
	GA	`de-DE_DieterV3Voice` Male
	GA	`de-DE_ErikaV3Voice` Female
Italian	GA	`it-IT_FrancescaV3Voice` Female
Japanese	GA	`ja-JP_EmiV3Voice` Female
Korean	GA	`ko-KR_JinV3Voice` Female
Portuguese (Brazilian)	GA	`pt-BR_IsabelaV3Voice` Female
Spanish (Castilian)	GA	`es-ES_EnriqueV3Voice` Male
	GA	`es-ES_LauraV3Voice` Female
Spanish (Latin American)	GA	`es-LA_SofiaV3Voice` Female
Spanish (North American)	GA	`es-US_SofiaV3Voice` Female

The Spanish Latin American and North American Sofia voices are essentially the same voice. The most significant difference concerns how the two voices interpret a $ (dollar sign). The Latin American version uses the term pesos; the North American version uses the term dólares. Other minor differences might also exist between the two voices.

Creating a custom model

When you synthesize text, the service applies language-dependent pronunciation rules to convert the ordinary spelling of each word to a phonetic spelling. The service's pronunciation rules work well for common words, but they can yield imperfect results for unusual words, such as terms with foreign origins, personal names, and abbreviations or acronyms. If your application's lexicon includes such words, you can use the customization interface to specify how the service pronounces them.

A custom model is a dictionary of words and their translations. You create a custom model for a specific language, not for a specific voice. So a custom model can be used with any voice for its specified language. For example, a custom model that you create for the en-US language can be used with any US English voice. It cannot, however, be used with an en-GB or en-AU voice.

Customization is available for all languages. All voices support the use of both standard International Phonetic Alphabet (IPA) and IBM Symbolic Phonetic Representation (SPR) phonetic symbols for word customization. For more information, see Understanding customization.

Creating a custom voice

IBM Cloud

Premium customers can work with IBM to train a new custom voice for their specific use case and target market. Creating a custom voice is different from customizing one of the service's existing voices. A custom voice is a unique new voice that is based on audio training data that the customer provides. IBM can train a custom voice with as little as one hour of training data.

To request a custom voice or for more information, complete and submit this IBM Request Form.