IBM Cloud Docs
Languages and voices

Languages and voices

The IBM Watson® Text to Speech service supports a variety of languages, voices, and dialects. For different languages, the service offers female voices, male voices, or both. Each voice uses appropriate cadence and intonation for its dialect.

All of the service's voices use neural voice technology. Neural voice technology uses multiple Deep Neural Networks (DNNs) to predict the acoustic (spectral) features of the speech. The DNNs are trained on natural human speech and generate the resulting audio from the predicted acoustic features. During synthesis, the DNNs predict the pitch and phoneme duration (prosody), spectral structure, and waveform of the speech. Neural voices produce speech that is crisp and clear, with a very natural-sounding, smooth, and consistent audio quality.

Supported languages and voices

The service offers two types of voices with different qualities and capabilities:

  • Expressive neural voices offer natural-sounding speech that is exceptionally clear and crisp. Their pronunciation and inflections are natural and conversational, and the resulting speech offers extremely smooth transitions between words. They also support the use of additional features that are not available with enhanced neural voices. For a list of all expressive voices, see Expressive neural voices.
  • Enhanced neural voices achieve a high degree of natural-sounding speech and support most service features. For a list of all enhanced neural voices, see Enhanced neural voices.

The following pages provide more information about the voices and their technology:

Language support by type of voice

Table 1 shows the service's support for languages by type of voice. The following topics list the available languages and voices for each voice type.

Table 1. Language support by type of voice
Language Expressive neural voices Enhanced neural voices
Dutch
(Netherlands)
English
(United Kingdom)
English
(Australian)
English
(United States)
French
(Canadian)
French
(France)
German
Italian
Japanese
Korean
Portuguese
(Brazilian)
Spanish
(Castilian)
Spanish
(Latin American)
Spanish
(South American)

Expressive neural voices

Table 2 lists and provides audio samples for all available expressive neural voices. The Availability column indicates whether each voice is generally available (GA) for production use or beta. The column also indicates whether each voice is available for IBM Cloud, IBM Cloud Pak for Data, or both (no product version is cited).

  • Expressive neural voices support additional features that are not available with other types of voices. These features include additional speaking styles, automatic emphasis of interjections, and emphasis of specified words. For more information, see Modifying speech synthesis with expressive neural voices.
  • When used with the SSML <prosody> element, expressive voices support only percentage values for the rate and pitch attributes. For more information, see The <prosody> element.

Expressive neural voices determine sentiment from context and automatically use the proper intonation to suit the text. To produce the most natural-sounding prosody, expressive neural voices need to consider the context of all words and phrases of a sentence. Expressive voices are therefore more compute-intensive and have slightly higher latency than other types of voices. The initial response for a synthesis request that uses an expressive voice might take a fraction of a second longer (for example, a few hundred milliseconds) to arrive. The total response time for the request to complete is also longer.

To minimize the latency and response time for an expressive voice, use shorter sentences wherever possible.

Table 2. Expressive neural languages and voices
Language Availability Voice / Gender Audio sample
English
(Australian)
GA en-AU_HeidiExpressive
Female
GA en-AU_JackExpressive
Male
English
(United States)
GA en-US_AllisonExpressive
Female
GA en-US_EmmaExpressive
Female
GA en-US_LisaExpressive
Female
GA en-US_MichaelExpressive
Male

Enhanced neural voices

Table 3 lists and provides audio samples for all available enhanced neural voices. The Availability column indicates whether each voice is generally available (GA) for production use or beta. The column also indicates whether each voice is available for IBM Cloud, IBM Cloud Pak for Data, or both (no product version is cited).

Table 3. Enhanced neural languages and voices
Language Availability Voice / Gender Audio sample
Dutch
(Netherlands)
Beta nl-NL_MerelV3Voice
Female
English
(United Kingdom)
GA en-GB_CharlotteV3Voice
Female
GA en-GB_JamesV3Voice
Male
GA en-GB_KateV3Voice
Female
English
(United States)
GA en-US_AllisonV3Voice
Female
GA en-US_EmilyV3Voice
Female
GA en-US_HenryV3Voice
Male
GA en-US_KevinV3Voice
Male
GA en-US_LisaV3Voice
Female
GA en-US_MichaelV3Voice
Male
GA en-US_OliviaV3Voice
Female
French
(Canadian)
GA fr-CA_LouiseV3Voice
Female
French
(France)
GA fr-FR_NicolasV3Voice
Male
GA fr-FR_ReneeV3Voice
Female
German GA de-DE_BirgitV3Voice
Female
GA de-DE_DieterV3Voice
Male
GA de-DE_ErikaV3Voice
Female
Italian GA it-IT_FrancescaV3Voice
Female
Japanese GA ja-JP_EmiV3Voice
Female
Korean GA ko-KR_JinV3Voice
Female
Portuguese
(Brazilian)
GA pt-BR_IsabelaV3Voice
Female
Spanish
(Castilian)
GA es-ES_EnriqueV3Voice
Male
GA es-ES_LauraV3Voice
Female
Spanish
(Latin American)
GA es-LA_SofiaV3Voice
Female
Spanish
(North American)
GA es-US_SofiaV3Voice
Female

The Spanish Latin American and North American Sofia voices are essentially the same voice. The most significant difference concerns how the two voices interpret a $ (dollar sign). The Latin American version uses the term pesos; the North American version uses the term dólares. Other minor differences might also exist between the two voices.

Creating a custom model

When you synthesize text, the service applies language-dependent pronunciation rules to convert the ordinary spelling of each word to a phonetic spelling. The service's pronunciation rules work well for common words, but they can yield imperfect results for unusual words, such as terms with foreign origins, personal names, and abbreviations or acronyms. If your application's lexicon includes such words, you can use the customization interface to specify how the service pronounces them.

A custom model is a dictionary of words and their translations. You create a custom model for a specific language, not for a specific voice. So a custom model can be used with any voice for its specified language. For example, a custom model that you create for the en-US language can be used with any US English voice. It cannot, however, be used with an en-GB or en-AU voice.

Customization is available for all languages. All voices support the use of both standard International Phonetic Alphabet (IPA) and IBM Symbolic Phonetic Representation (SPR) phonetic symbols for word customization. For more information, see Understanding customization.

Creating a custom voice

IBM Cloud

Premium customers can work with IBM to train a new custom voice for their specific use case and target market. Creating a custom voice is different from customizing one of the service's existing voices. A custom voice is a unique new voice that is based on audio training data that the customer provides. IBM can train a custom voice with as little as one hour of training data.

To request a custom voice or for more information, complete and submit this IBM Request Form.