Working with corpora and custom words for previous-generation models
This information is specific to custom models that are based on previous-generation models. For information about corpora and custom words for custom models that are based on next-generation models, see Working with corpora and custom words for next-generation models.
You can populate a custom language model with words by adding corpora or grammars to the model, or by adding custom words directly:
- Corpora - The recommended means of populating a custom language model with words is to add one or more corpora to the model. When you add a corpus, the service analyzes the file and automatically adds any new words that it finds to the custom model. Adding a corpus to a custom model allows the service to extract domain-specific words in context, which helps ensure better transcription results. For more information, see Working with Corpora.
- Grammars - You can add grammars to a custom model to limit speech recognition to the words or phrases that are recognized by a grammar. When you add a grammar to a model, the service automatically adds any new words that it finds to the model, just as it does with corpora. For more information, see Using grammars with custom language models.
- Individual words - You can also add individual custom words to a model directly. The service adds the words to the model just as it does words that it discovers from corpora or grammars. When you add a word directly, you can specify multiple pronunciations and indicate how the word is to be displayed. You can also update existing words to modify or augment the definitions that were extracted from corpora or grammars. For more information, see Working with custom words.
Regardless of how you add them, the service stores all words that you add to a custom language model in the model's words resource.
The words resource
The words resource includes all words that you add from corpora, from grammars, or directly. Its purpose is to define the pronunciation and spelling of words that are not already present in the service's base vocabulary. The definitions tell the service how to transcribe these out-of-vocabulary (OOV) words.
The words resource contains the following information about each OOV word. The service creates the definitions for words that are extracted from corpora and grammars. You specify the characteristics for words that you add or modify directly.
-
word
- The spelling of the word as found in a corpus or grammar or as added by you.Do not use characters that need to be URL-encoded. For example, do not use spaces, slashes, backslashes, colons, ampersands, double quotes, plus signs, equals signs, question marks, etc. in the name. The service does not prevent the use of these characters, but because they must be URL-encoded wherever they are used, it is strongly discouraged.
-
sounds_like
- The pronunciation of the word. For words extracted from corpora and grammars, the value represents how the service believes that the word is pronounced based on its language rules. In many cases, the pronunciation reflects the spelling of theword
field. You can use thesounds_like
field to modify the word's pronunciation. You can also use the field to specify multiple pronunciations for a word. For more information, see Using the sounds_like field. -
display_as
- The spelling of the word that the service uses in transcripts. The field indicates how the word is to be displayed. In most cases, the spelling matches the value of theword
field. You can use thedisplay_as
field to specify a different spelling for the word. For more information, see Using the display_as field. -
source
- How the word was added to the words resource. If the service extracted the word from a corpus or grammar, the field lists the name of that resource. Because the service can encounter the same word in multiple resources, the field can list multiple corpus or grammar names. The field includes the stringuser
if you add or modify the word directly.
After adding or modifying a word in a model's words resource, it is important that you verify the correctness of the word's definition; for more information, see Validating a words resource for previous-generation models. You must also train the model for the changes to take effect during transcription; for more information, see Train the custom language model.
How much data do I need?
Many factors contribute to the amount of data that you need for an effective custom language model. It is not possible to indicate the exact number of words that you need to add for any custom model or application. Depending on the use case, even adding a few words directly to a custom model can improve the model's quality. But adding OOV words from a corpus that shows the words in the context in which they are used in audio can greatly improve transcription accuracy.
The service limits the number of words that you can add to a custom language model:
- You can add a maximum of 90 thousand OOV words to the words resource of a custom model. This figure includes OOV words from all sources (corpora, grammars, and individual custom words that you add directly).
- You can add a maximum of 10 million total words to a custom model from all sources. This figure includes all words, both OOV words and words that are already part of the service's base vocabulary, that are included in corpora or grammars. For corpora, the service uses these additional words to learn the context in which OOV words can appear, which is why corpora are a more effective means of improving recognition accuracy.
A large words resource can increase the latency of speech recognition, but the exact effect is difficult to quantify or predict. As with the amount of data that is needed to produce an effective custom model, the performance impact of a large words resource depends on many factors. Test your custom model with different amounts of data to determine the performance of your models and data.
Working with corpora for previous-generation models
You use the POST /v1/customizations/{customization_id}/corpora/{corpus_name}
method to add a corpus to a custom model. A corpus is a plain text file that contains sample sentences from your domain. The following example shows an
abbreviated corpus for the healthcare domain. A corpus file is typically much longer.
Am I at risk for health problems during travel?
Some people are more likely to have health problems when traveling outside the United States.
How Is Coronary Microvascular Disease Treated?
If you're diagnosed with coronary MVD and also have anemia, you may benefit from treatment for that condition.
Anemia is thought to slow the growth of cells needed to repair damaged blood vessels.
What causes autoimmune hepatitis?
A combination of autoimmunity, environmental triggers, and a genetic predisposition can lead to autoimmune hepatitis.
What research is being done for Spinal Cord Injury?
The National Institute of Neurological Disorders and Stroke NINDS conducts spinal cord research in its laboratories at the National Institutes of Health NIH.
NINDS also supports additional research through grants to major research institutions across the country.
Some of the more promising rehabilitation techniques are helping spinal cord injury patients become more mobile.
What is Osteogenesis imperfecta OI?
. . .
Speech recognition relies on statistical algorithms to analyze audio. Words from a custom model are in competition with words from the service's base vocabulary as well as other words of the model. (Factors such as audio noise and speaker accents also affect the quality of transcription.)
The accuracy of transcription can depend largely on how words are defined in a model and how speakers say them. To improve the service's accuracy, use corpora to provide as many examples as possible of how OOV words are used in the domain. Repeating the OOV words in corpora can improve the quality of the custom language model. How you duplicate the words in corpora depends on how you expect users to say them in the audio that is to be recognized. The more sentences that you add that represent the context in which speakers use words from the domain, the better the service's recognition accuracy.
The service does not apply a simple word-matching algorithm. Its transcription depends on the context in which words are used. When it parses a corpus, the service includes information about n-grams (bi-grams, tri-grams, and so on) from the sentences of the corpus in the custom model. This information helps the service transcribe audio with greater accuracy, and it explains why training a custom model on corpora is more valuable than training it on custom words alone.
For example, accountants adhere to a common set of standards and procedures that are known as Generally Accepted Accounting Principles (GAAP). When you create a custom model for a financial domain, provide sentences that use the term GAAP in context. The sentences help the service distinguish between general phrases such as "the gap between them is small" and domain-centric phrases such as "GAAP provides guidelines for measuring and disclosing financial information."
In general, it is better for corpora to use words in different contexts, which can improve how the service learns the words. However, if users speak the words in only a couple of contexts, then showing the words in other contexts does not improve the quality of the custom model: Speakers never use the words in those contexts. If speakers are likely to use the same phrase frequently, then repeating that phrase in the corpora can improve the quality of the model. (In some cases, even adding a few custom words directly to a custom model can make a positive difference.)
Preparing a corpus text file
Follow these guidelines to prepare a corpus text file:
-
Provide a plain text file that is encoded in UTF-8 if it contains non-ASCII characters. The service assumes UTF-8 encoding if it encounters such characters.
Make sure that you know the character encoding of your corpus text files. The service preserves the encoding that it finds in the text files. You must use that same encoding when working with custom words in the custom model. For more information, see Character encoding for custom words.
-
Use consistent capitalization for words in the corpus. The words resource is case-sensitive. Mix upper- and lowercase letters and use capitalization only when intended.
-
Include each sentence of the corpus on its own line, and terminate each line with a carriage return. Including multiple sentences on the same line can degrade accuracy.
-
Add personal names as discrete units on separate lines. Do not add the individual elements of a name on separate lines or as individual custom words, and do not include multiple names on the same line of a corpus. The following example shows the correct way to improve recognition accuracy for three names:
Gakuto Kutara Sebastian Leifson Malcolm Ingersol
Include additional contextual information where appropriate, for example,
Doctor Sebastian Leifson
orPresident Malcolm Ingersol
. As with all words, duplicating the names multiple times and, if possible, in different contexts can improve recognition accuracy. -
Beware of typographical errors. The service assumes that typographical errors are new words. Unless you correct them before you train the model, the service adds them to the model's vocabulary. Remember the adage Garbage in, garbage out!
-
More sentences result in better accuracy. But the service does limit a model to a maximum of 10 million total words and 90 thousand OOV words from all sources combined.
The service cannot generate a pronunciation for all words. After adding a corpus, you must validate the words resource to ensure that each OOV word's definition is complete and valid. For more information, see Validating a words resource for previous-generation models.
What happens when I add a corpus file?
When you add a corpus file, the service analyzes the file's contents. It extracts any new (OOV) words that it finds and adds each OOV word to the custom model's words resource. To distill the most meaning from the content, the service tokenizes and parses the data that it reads from a corpus file. The following sections describe how the service parses a corpus file for each supported language.
Parsing of Dutch, English, French, German, Italian, Portuguese, and Spanish
The following descriptions apply to all supported dialects of Dutch, English, French, German, Italian, Portuguese, and Spanish.
-
Converts numbers to their equivalent words.
Table 1. Examples of number conversion Language Whole number Decimal number Dutch 500
becomesvijfhonderd
0,15
becomesnul komma vijftien
English 500
becomesfive hundred
0.15
becomeszero point fifteen
French 500
becomescinq cents
0,15
becomeszéro virgule quinze
German 500
becomesfünfhundert
0,15
becomesnull punkt fünfzehn
Italian 500
becomescinquecento
0,15
becomeszero virgola quindici
Portuguese 500
becomesquinhentos
0,15
becomeszero ponto quinze
Spanish 500
becomesquinientos
0,15
becomescero coma quince
-
Converts tokens that include certain symbols to meaningful string representations. These examples are not exhaustive. The service makes similar adjustments for other characters as needed. (For Spanish, if the dialect is
es-LA
,$100
and100$
becomecien pesos
.)Table 2. Examples of symbol conversion Language A dollar sign and a number A euro sign and a number A percent sign and a number Dutch $100
becomeshonderd dollar
€100
becomeshonderd euro
100%
becomeshonderd procent
English $100
becomesone hundred dollars
€100
becomesone hundred euros
100%
becomesone hundred percent
French $100
becomescent dollars
€100
becomescent euros
100%
becomescent pour cent
German $100
and100$
becomeeinhundert dollar
€100
and100€
becomeeinhundert euro
100%
becomeseinhundert prozent
Italian $100
becomescento dollari
€100
becomescento euro
100%
becomescento per cento
Portuguese $100
and100$
becomecem dólares
€100
and100€
becomecem euros
100%
becomescem por cento
Spanish $100
and100$
becomecien dólares
€100
and100€
becomecien euros
100%
becomescien por ciento
-
Processes non-alphanumeric, punctuation, and special characters depending on their context. For example, the service removes a
$
(dollar sign) or€
(euro symbol) unless it is followed by a number. Processing is context-dependent and consistent across the supported languages. -
Ignores phrases that are enclosed in
( )
(parentheses),< >
(angle brackets),[ ]
(square brackets), or{ }
(curly braces).
Parsing of Japanese
- Converts all characters to full-width characters.
- Converts numbers to their equivalent words, for example,
500
becomes五百
, and0.15
becomes〇・一五
. - Does not convert tokens that include symbols to equivalent strings, for example,
100%
becomes百%
. - Does not automatically remove punctuation. IBM highly recommends that you remove punctuation if your application is transcription-based as opposed to dictation-based.
Parsing of Korean
-
Converts numbers to their equivalent words, for example,
10
becomes십
. -
Removes the following punctuation and special characters:
- ( ) * : . , ' "
. However, not all punctuation and special characters that are removed for other languages are removed for Korean, for example:- Removes a period (
.
) symbol only when it occurs at the end of a line of input. - Does not remove a tilde (
~
) symbol. - Does not remove or otherwise process Unicode wide-character symbols, for example,
…
(triple dot or ellipsis).
In general, IBM recommends that you remove punctuation, special characters, and Unicode wide-characters before you process a corpus file.
- Removes a period (
-
Does not remove or ignore phrases that are enclosed in
( )
(parentheses),< >
(angle brackets),[ ]
(square brackets), or{ }
(curly braces). -
Converts tokens that include certain symbols to meaningful string representations, for example:
24%
becomes이십사퍼센트
.$10
becomes십달러
.
This list is not exhaustive. The service makes similar adjustments for other characters as needed.
-
For phrases that consist of Latin (English) characters or a mix of Hangul and Latin characters, the service creates OOV words for the phrases exactly as they appear in the corpus file. And it creates sounds-like pronunciations for the words based on Hangul transcriptions.
- It gives the OOV word
London
a sounds-like of런던
. - It gives the OOV word
IBM홈페이지
a sounds-like of아이 비 엠 홈페이지
.
- It gives the OOV word
Working with custom words for previous-generation models
You can use the POST /v1/customizations/{customization_id}/words
and PUT /v1/customizations/{customization_id}/words/{word_name}
methods to add new words to a custom model's words resource. You can also use the methods
to modify or augment a word in a words resource.
You might, for instance, need to use the methods to correct a typographical error or other mistake that is made when a word is added from a corpus. You might also need to add sounds-like definitions for an existing word. If you modify an existing word, the new data that you provide overwrites the word's existing definition in the words resource. The rules for adding a word also apply to modifying an existing word.
You are likely to add most custom words from corpora. Make sure that you know the character encoding of your corpus text files. The service preserves the encoding that it finds in the text files. You must use that same encoding when working with custom words in the custom model. For more information, see Character encoding for custom words.
Using the sounds_like field
The sounds_like
field specifies how a word is pronounced by speakers. By default, the service automatically attempts to complete the field with the word's spelling. But the service cannot generate a pronunciation for all words.
After adding or modifying words, you must validate the words resource to ensure that each word's definition is complete and valid. For more information, see Validating a words resource for previous-generation models.
You can provide as many as five alternative pronunciations for a word that is difficult to pronounce or that can be pronounced in different ways. Consider using the field to
-
Provide different pronunciations for acronyms. For example, the acronym
NCAA
can be pronounced as it is spelled or colloquially as N. C. double A. The following example adds both of these sounds-like pronunciations for the wordNCAA
:IBM Cloud
curl -X PUT -u "apikey:{apikey}" \ --header "Content-Type: application/json" \ --data "{\"sounds_like\": [\"N. C. A. A.\", \"N. C. double A.\"]}" \ "{url}/v1/customizations/{customization_id}/words/NCAA"
IBM Cloud Pak for Data
curl -X PUT \ --header "Authorization: Bearer {token}" \ --header "Content-Type: application/json" \ --data "{\"sounds_like\": [\"N. C. A. A.\", \"N. C. double A.\"]}" \ "{url}/v1/customizations/{customization_id}/words/NCAA"
-
Handle foreign words. For example, the French word
garçon
contains a character that is not found in the English language. You can specify a sounds-like ofgaarson
, replacingç
withs
, to tell the service how English speakers would pronounce the word.
The following sections provide language-specific guidelines for specifying a sounds-like pronunciation. Speech recognition uses statistical algorithms to analyze audio, so adding a word does not guarantee that the service transcodes it with
complete accuracy. When you add a word, consider how it might be pronounced. Use the sounds_like
field to provide various pronunciations that reflect how a word can be spoken.
Guidelines for English
Guidelines for Australian, United Kingdom, and United States English:
- Use English alphabetic characters:
a-z
andA-Z
. - Use real or made-up words that are pronounceable in English for words that are difficult to pronounce, for example,
shuchesnie
for the wordSczcesny
. - Substitute equivalent English letters for non-English letters, for example,
s
forç
orny
forñ
. - Substitute non-accented letters for accented letters, for example,
a
forà
ore
forè
. - You can include multiple words that are separated by spaces. The service enforces a maximum of 40 total characters, not including leading or trailing spaces.
Guidelines for Australian and United States English only:
- To pronounce a single letter, use the letter followed by a period. If the period is followed by another character, be sure to use a space between the period and the next character. For example, use
N. C. A. A.
, notN.C.A.A.
- Use the spelling of numbers, for example,
seventy-five
for75
.
Guidelines for United Kingdom English only:
- You cannot use periods or dashes in sounds-like pronunciations for UK English.
- To pronounce a single letter, use the letter followed by a space. For example, use
N C A A
, notN. C. A. A.
,N.C.A.A.
, orNCAA
. - Use the spelling of numbers without dashes, for example,
seventy five
for75
.
Guidelines for Dutch, French, German, Italian, Portuguese, and Spanish
Guidelines for all supported dialects of Dutch, French, German, Italian, Portuguese, and Spanish:
- You cannot use dashes in sounds-like pronunciations.
- Use alphabetic characters that are valid for the language:
a-z
andA-Z
including valid accented letters. - To pronounce a single letter, use the letter followed by a period. If the period is followed by another character, be sure to use a space between the period and the next character. For example, use
N. C. A. A.
, notN.C.A.A.
- Use real or made-up words that are pronounceable in the language for words that are difficult to pronounce.
- Use the spelling of numbers without dashes, for example, for
75
use- Dutch (Netherlands):
vijfenzeventig
- French:
soixante quinze
- German:
fünfundsiebzig
- Italian:
settantacinque
- Portuguese (Brazilian):
setenta e cinco
- Spanish:
setenta y cinco
- Dutch (Netherlands):
- You can include multiple words that are separated by spaces. The service enforces a maximum of 40 total characters, not including leading or trailing spaces.
Guidelines for Japanese
-
Use only full-width Katakana characters by using the
―
lengthen symbol (chou-on, or 長音, in Japanese). Do not use half-width characters. -
Use contracted sounds (yoh-on, or 拗音, in Japanese) only in the following syllable contexts:
イェ
,ウィ
,ウェ
,ウォ
,キィ
,キャ
,キュ
,キョ
,ギャ
,ギュ
,ギョ
,クァ
,クィ
,クェ
,クォ
グァ
,グォ
,シィ
,シェ
,シャ
,シュ
,ショ
,ジィ
,ジェ
,ジャ
,ジュ
,ジョ
,スィ
,ズィ
,チェ
チャ
,チュ
,チョ
,ヂェ
,ヂャ
,ヂュ
,ヂョ
,ツァ
,ツィ
,ツェ
,ツォ
,ティ
,テュ
,ディ
,デャ
デュ
,デョ
,トゥ
,ドゥ
,ニェ
,ニャ
,ニュ
,ニョ
,ヒャ
,ヒュ
,ヒョ
,ビャ
,ビュ
,ビョ
,ピィ
ピャ
,ピュ
,ピョ
,ファ
,フィ
,フェ
,フォ
,フュ
,ミャ
,ミュ
,ミョ
,リィ
,リェ
,リャ
,リュ
リョ
,ヴァ
,ヴィ
,ヴェ
,ヴォ
,ヴュ
-
Use only the following syllables after an assimilated sound (soku-on, or 促音, in Japanese):
バ
,ビ
,ブ
,ベ
,ボ
,チ
,チェ
,チャ
,チュ
,チョ
,ダ
,デ
,ディ
,ド
,ドゥ
,フ
ファ
,フィ
,フェ
,フォ
,ガ
,ギ
,グ
,ゲ
,ゴ
,ハ
,ヒ
,ヘ
,ホ
,ジ
,ジェ
,ジャ
ジュ
,ジョ
,カ
,キ
,ク
,ケ
,コ
,キャ
,キュ
,キョ
,パ
,ピ
,プ
,ペ
,ポ
,ピャ
ピュ
,ピョ
,サ
,ス
,セ
,ソ
,シ
,シェ
,シャ
,シュ
,ショ
,タ
,テ
,ト
,ツ
,ザ
ズ
,ゼ
,ゾ
-
Do not use
ン
as the first character of a word. For example, useウーント
instead ofンート
, the latter of which is invalid. -
Many compound words consist of prefix+noun or noun+suffix. The service's base vocabulary covers most compound words that occur frequently (for example,
長電話
and古新聞
) but not those compound words that occur infrequently. If your corpus commonly contains compound words, add them as one word as the first step of your customization. For example,古鉛筆
is not common in general Japanese text; if you use it often, add it to your custom model to improve transcription accuracy. -
Do not use a trailing assimilated sound.
Guidelines for Korean
- Use Korean Hangul characters, symbols, and syllables.
- You can also use Latin (English) alphabetic characters:
a-z
andA-Z
. - Do not use any characters or symbols that are not included in the previous sets.
Using the display_as field
The display_as
field specifies how a word is displayed in a transcript. It is intended for cases where you want the service to display a string that is different from the word's spelling. For example, you can indicate that the
word hhonors
is to be displayed as HHonors
regardless of whether it sounds like hilton honors
or h honors
.
IBM Cloud
curl -X PUT -u "apikey:{apikey}" \
--header "Content-Type: application/json" \
--data "{\"sounds_like\": [\"hilton honors\", \"H. honors\"], \"display_as\": \"HHonors\"}" \
"{url}/v1/customizations/{customization_id}/words/hhonors"
IBM Cloud Pak for Data
curl -X PUT \
--header "Authorization: Bearer {token}" \
--header "Content-Type: application/json" \
--data "{\"sounds_like\": [\"hilton honors\", \"H. honors\"], \"display_as\": \"HHonors\"}" \
"{url}/v1/customizations/{customization_id}/words/hhonors"
As another example, you can indicate that the word IBM
is to be displayed as IBM™
.
IBM Cloud
curl -X PUT -u "apikey:{apikey}" \
--header "Content-Type: application/json" \
--data "{\"sounds_like\": [\"I. B. M.\"], \"display_as\":\"IBM™\"}" \
"{url}/v1/customizations/{customization_id}/words/IBM"
IBM Cloud Pak for Data
curl -X PUT \
--header "Authorization: Bearer {token}" \
--header "Content-Type: application/json" \
--data "{\"sounds_like\": [\"I. B. M.\"], \"display_as\":\"IBM™\"}" \
"{url}/v1/customizations/{customization_id}/words/IBM"
Interaction with smart formatting and numeric redaction
If you use the smart_formatting
or redaction
parameters with a recognition request, be aware that the service applies smart formatting and redaction to a word before it considers the display_as
field
for the word. You might need to experiment with results to ensure that the features do not interfere with how your custom words are displayed. You might also need to add custom words to accommodate the effects.
For instance, suppose that you add the custom word one
with a display_as
field of one
. Smart formatting changes the word one
to the number 1
, and the display-as value is not
applied. To work around this issue, you could add a custom word for the number 1
and apply the same display_as
field to that word.
For more information about working with these features, see Smart formatting and Numeric redaction.
What happens when I add or modify a custom word?
How the service responds to a request to add or modify a custom word depends on the fields and values that you specify. It also depends on whether the word exists in the service's base vocabulary.
-
Omit both the
sounds_like
anddisplay_as
fields:- If the word does not exist in the service's base vocabulary, the service attempts to set the
sounds_like
field to its pronunciation of the word. It cannot generate a pronunciation for all words, so you must review the word's definition to ensure that it is complete and valid. The service sets thedisplay_as
field to the value of theword
field. - If the word exists in the service's base vocabulary, the service leaves the
sounds_like
anddisplay_as
fields empty. These fields are empty only if the word exists in the service's base vocabulary. The word's presence in the model's words resource is harmless but unnecessary.
- If the word does not exist in the service's base vocabulary, the service attempts to set the
-
Specify only the
sounds_like
field:- If the
sounds_like
field is valid, the service sets thedisplay_as
field to the value of theword
field. - If the
sounds_like
field is invalid:- The
POST /v1/customizations/{customization_id}/words
method adds anerror
field to the word in the model's words resource. - The
PUT /v1/customizations/{customization_id}/words/{word_name}
method fails with a 400 response code and an error message. The service does not add the word to the words resource.
- The
- If the
-
Specify only the
display_as
field:- If the word does not exist in the service's base vocabulary, the service attempts to set the
sounds_like
field to its pronunciation of the word. It cannot generate a pronunciation for all words, so you must review the word's definition to ensure that it is complete and valid. The service leaves thedisplay_as
field as specified. - If the word exists in the service's base vocabulary, the service leaves the
sounds_like
empty and leaves thedisplay_as
field as specified.
- If the word does not exist in the service's base vocabulary, the service attempts to set the
-
Specify both the
sounds_like
anddisplay_as
fields:- If the
sounds_like
field is valid, the service sets thesounds_like
anddisplay_as
fields to the specified values. - If the
sounds_like
field is invalid, the service responds as it does in the case where thesounds_like
field is specified but thedisplay_as
field is not.
- If the
Validating a words resource for previous-generation models
Especially when you add a corpus to a custom language model or add multiple custom words at one time, examine the OOV words in the model's words resource.
- Look for typographical and other errors. Especially when you add corpora, which can be large, mistakes are easily made. Typographical errors in a corpus fole (or in custom words or a grammar file) have the unintended consequence of adding new words to a model's words resource, as do ill-formed HTML tags that are left in a corpus file.
- Verify the sounds-like pronunciations. The service tries to generate sounds-like pronunciations for OOV words automatically. In most cases, these pronunciations are sufficient. But the service cannot generate a pronunciation for all words, so you must review the word's definition to ensure that it is complete and valid. Reviewing the pronunciations for accuracy is also recommended for words that have unusual spellings or are difficult to pronounce, and for acronyms and technical terms.
To validate and, if necessary, correct a word for a custom model, regardless of how it was added to the words resource, use the following methods:
- List all of the words from a custom model by using the
GET /v1/customizations/{customization_id}/words
method or query an individual word with theGET /v1/customizations/{customization_id}/words/{word_name}
method. For more information, see Listing custom words from a custom language model. - Modify words in a custom model to correct errors or to add sounds-like or display-as values by using the
POST /v1/customizations/{customization_id}/words
orPUT /v1/customizations/{customization_id}/words/{word_name}
method. For more information, see Working with custom words for previous-generation models. - Delete extraneous words that are introduced in error (for example, by typographical or other mistakes in a corpus) by using the
DELETE /v1/customizations/{customization_id}/words/{word_name}
method. For more information, see Deleting a word from a custom language model.- If the word was extracted from a corpus, you can instead update the corpus text file to correct the error and then reload the file by using the
allow_overwrite
parameter of thePOST /v1/customizations/{customization_id}/corpora/{corpus_name}
method. For more information, see Add a corpus to the custom language model. - If the word was extracted from a grammar, you can update the grammar file to correct the error and then reload the file by using the
allow_overwrite
parameter of thePOST /v1/customizations/{customization_id}/grammars/{grammar_name}
method. For more information, see Add a grammar to the custom language model.
- If the word was extracted from a corpus, you can instead update the corpus text file to correct the error and then reload the file by using the