IBM Cloud Docs
Next-generation languages and models

Next-generation languages and models

The IBM Watson® Speech to Text service supports a growing collection of next-generation models that improve upon the speech recognition capabilities of the service's previous-generation models. The model indicates the language in which the audio is spoken and the rate at which it is sampled. Next-generation models have higher throughput than the previous-generation models, so the service can return transcriptions more quickly. Next-generation models also provide noticeably better transcription accuracy.

When you use next-generation models, the service analyzes audio bidirectionally. Using deep neural networks, the model analyzes and extracts information from the audio. The model then evaluates the information forwards and backwards to predict the transcription, effectively "listening" to the audio twice.

With the additional information and context afforded by bidirectional analysis, the service can make smarter hypotheses about the words spoken in the audio. Despite the added analysis, recognition with next-generation models is more efficient than with previous-generation models, so the service delivers results faster and with more accuracy. Most next-generation models also offer a low-latency option to receive results even faster, though low latency might impact transcription accuracy.

In addition to providing greater transcription accuracy, the models have the ability to hypothesize words that are not in the base language model and that it has not encountered in training. This capability can decrease the need for customization of domain-specific terms. A model does not need to contain a specific vocabulary term to predict that word.

Next-generation model types

The service makes available two types of next-generation models:

  • Telephony models are intended specifically for audio that is communicated over a telephone. Like previous-generation narrowband models, telephony models are intended for audio that has a minimum sampling rate of 8 kHz.
  • Multimedia models are intended for audio that is extracted from sources with a higher sampling rate, such as video. Use a multimedia model for any audio other than telephonic audio. Like previous-generation broadband models, multimedia models are intended for audio that has a minimum sampling rate of 16 kHz.

Choose the model type that most closely matches the source and sampling rate of your audio. The service automatically adjusts the sampling rate of your audio to match the model that you specify. To achieve the best recognition accuracy, also consider the frequency content of your audio. For more information, see Sampling rate and Audio frequency.

Supported next-generation language models

The following sections list the next-generation models of each type that are available for each language. The tables in the sections provide the following information:

  • The Model name column indicates the name of the model. (Unlike previous-generation models, next-generation models do not include the word Model in their names.)
  • The Low-latency support column indicates whether the model supports the low_latency parameter for speech recognition. For more information, see Low latency.
  • The Status column indicates whether the model is generally available (GA) or beta.

The Model name and Low-latency support columns indicate the product versions for which the model and low-latency are supported. Unless otherwise labeled as IBM Cloud or IBM Cloud Pak for Data, a model and low latency are supported for both versions of the service.

Telephony models

Table 1 lists the available next-generation telephony models.

Table 1. Next-generation telephony models
Language Model name Low-latency support Status
Arabic
(Modern Standard)
ar-MS_Telephony Yes GA
Chinese
(Mandarin)
zh-CN_Telephony Yes GA
Czech cs-CZ_Telephony Yes GA
Dutch
(Belgian)
nl-BE_Telephony Yes GA
Dutch
(Netherlands)
nl-NL_Telephony Yes GA
English
(Australian)
en-AU_Telephony Yes GA
English
(Indian)
en-IN_Telephony Yes GA
English
(United Kingdom)
en-GB_Telephony Yes GA
English
(United States)
en-US_Telephony Yes GA
English
(all supported dialects)
en-WW_Medical_Telephony Yes Beta
French
(Canadian)
fr-CA_Telephony Yes GA
French
(France)
fr-FR_Telephony Yes GA
German de-DE_Telephony Yes GA
Hindi
(Indian)
hi-IN_Telephony Yes GA
Italian it-IT_Telephony Yes GA
Japanese ja-JP_Telephony Yes GA
Korean ko-KR_Telephony Yes GA
Portuguese
(Brazilian)
pt-BR_Telephony Yes GA
Spanish
(Castilian)
es-ES_Telephony Yes GA
Spanish
(Argentinian, Chilean,
Colombian, Mexican,
and Peruvian)
es-LA_Telephony Yes GA
Swedish sv-SE_Telephony Yes GA

The Latin American Spanish model, es-LA_Telephony, applies to all Latin American dialects. It is the equivalent of the previous-generation models that are available for the Argentinian, Chilean, Colombian, Mexican, and Peruvian dialects. If you used a previous-generation model for any of these Latin American dialects, use the es-LA_Telephony model to migrate to the equivalent next-generation model.

Multimedia models

Table 2 lists the available next-generation multimedia models.

Table 2. Next-generation multimedia models
Language Model name Low-latency support Status
Dutch
(Netherlands)
nl-NL_Multimedia Yes GA
English
(Australian)
en-AU_Multimedia Yes GA
English
(United Kingdom)
en-GB_Multimedia Yes GA
English
(United States)
en-US_Multimedia Yes GA
French
(Canadian)
fr-CA_Multimedia Yes GA
French
(France)
fr-FR_Multimedia Yes GA
German de-DE_Multimedia Yes GA
Italian it-IT_Multimedia Yes GA
Japanese ja-JP_Multimedia Yes GA
Korean ko-KR_Multimedia Yes GA
Portuguese
(Brazilian)
pt-BR_Multimedia Yes GA
Spanish
(Castilian)
es-ES_Multimedia Yes GA

The English medical telephony model

The beta next-generation en-WW_Medical_Telephony understands terms from the medical and pharmacological domains. Use the model in situations where you need to transcribe common medical terminology such as medicine names, product brands, medical procedures, illnesses, types of doctor, or COVID-19-related terminology.

Common use cases include conversations between a patient and a medical provider (for example, a doctor, nurse, or pharmacist):

  • "My head is hurting. I need an Ibuprofen, please."
  • "Can you suggest an orthopedist who specializes in osteoarthritis?"
  • "Can you please help me find an internist in Chicago?"

The new model is available for all supported English dialects: Australian, Indian, UK, and US. The new model supports language model customization and grammars as beta functionality. It supports most of the same parameters as the en-US_Telephony model, including smart_formatting for US English audio. In addition to those features listed in Supported features for next-generation models, the model does not support the following parameters: profanity_filter, redaction, and speaker_labels.

Supported features for next-generation models

The next-generation models are supported for use with a large subset of the service's speech recognition features. In cases where a supported feature is restricted to certain languages, the same language restrictions usually apply to both previous- and next-generation models.

Next-generation models support all speech recognition parameters and headers except for the following:

  • acoustic_customization_id (Next-generation models do not support acoustic model customization.)
  • keywords and keywords_threshold
  • processing_metrics and processing_metrics_interval
  • word_alternatives_threshold

Next-generation models also support the following parameters, which are not available with previous-generation models:

  • character_insertion_bias, which is supported by all next-generation models. For more information, see Character insertion bias.
  • low_latency, which is supported by most next-generation models. For more information, see Low latency.

Next-generation models also differ from previous-generation models with respect to the following additional features:

  • Next-generation models do not produce hesitation markers. They instead include the actual hesitations in transcription results. For more information, see Speech hesitations and hesitation markers.
  • Next-generation models support automatic capitalization only for German models. Previous-generation models support automatic customization only for US English models. For more information, see Capitalization.