Managing custom words

The customization interface includes the POST /v1/customizations/{customization_id}/words and PUT /v1/customizations/{customization_id}/words/{word_name} methods, which are used to add or modify words for a custom model. For more information, see Add words to the custom language model. The interface also includes the following methods for listing and deleting words for a custom language model.

Character encoding for custom words

In general, you are likely to add most custom words from corpora. Make sure that you know the character encoding that is used in the text files for your corpora. The service preserves the encoding that it finds in the text files.

You must use that same encoding when working with individual words in the custom language model. When you specify a word with the GET, PUT, or DELETE /v1/customizations/{customization_id}/words/{word_name} method, you must URL-encode the word_name that you pass in the URL if the word includes non-ASCII characters.

For example, the following table shows what looks like the same letter in two different encodings, ASCII and UTF-8. Although you can pass the ASCII character on a URL as z, you must pass the UTF-8 character as %EF%BD%9A.

Examples of character encoding
Letter	Encoding	Value
z	ASCII	`0x7a` (`7a`)
ｚ	UTF-8 hexadecimal	`0xEF 0xBD 0x9A` (`efbd9a`)

Listing custom words from a custom language model

The customization interface offers two methods for listing words from a custom language model:

The GET /v1/customizations/{customization_id}/words method lists information about the words from the custom model's words resource. The method includes two optional query parameters:
- The word_type parameter specifies which words are to be listed:
  - all (the default) shows all words.
  - user shows only custom words that were added or modified by the user.
  - corpora shows only OOV words that were extracted from corpora.
  - grammars shows only OOV words that were extracted from grammars.
  For custom models that are based on next-generation models, only all and user apply. Both options return the same results. Words from corpora and grammars are not added to the words resource for custom models that are based on next-generation models.
- The sort parameter indicates how the words are to be ordered. The parameter accepts two arguments to indicate how the words are to be sorted: alphabetical and count. You can add an optional + or - to the front of an argument to indicate whether the results are to be sorted in ascending or descending order. By default, the method displays the words in ascending alphabetical order.
The GET /v1/customizations/{customization_id}/words/{word_name} method lists information about a single specified word from the model's words resource.

In addition to a word field that identifies the word, both methods return the following information about each word:

A sounds_like field that presents an array of as many as five pronunciations for the word. The array of sounds-like pronunciations can include a sounds-like value that is automatically generated by the service if no sounds-like value is provided when the word is added to the custom model. For more information, see
- For custom models that are based on previous-generation models, Using the sounds_like field.
- For custom models that are based on next-generation models, Using the sounds_like field.
A display_as field that shows the spelling of the custom word that the service displays in transcriptions. The field contains an empty string if no display-as value is provided for the word, in which case the word is displayed as it is spelled. For more information, see
- For custom models that are based on previous-generation models, Using the display_as field.
- For custom models that are based on next-generation models, Using the display_as field.
A source field that indicates how the word was added to the custom model's words resource.
- For custom models that are based on previous-generation models, the field includes the name of each corpus and grammar from which the service extracted the word. If you modified or added the word directly, the field includes the string user.
- For custom models that are based on next-generation models, this field shows only user for custom words that were added directly to the custom model. Words from corpora and grammars are not added to the words resource for custom models that are based on next-generation models.
A count field that indicates the number of times the word is found across all corpora and grammars.
- For custom models that are based on previous-generation models, for example, if the word occurs five times in one corpus and seven times in another, its count is 12. If you add a custom word to a model before it is added by any corpora or grammars, the count begins at 1. If the word is added from a corpus or grammar first and later modified, the count reflects only the number of times it is found in corpora and grammars.
- For custom models that are based on large speech models and next-generation models, the count field for any word is always 1.

If the service discovers one or more problems with a custom word's definition, the output includes an error field. The field provides an array that lists each problem element from the definition and a message that describes the problem.

An error can occur, for example, if you add a custom word with an invalid sounds_like field, one that violates one of the rules for adding a pronunciation. You cannot train a custom model whose words resource includes a word with an error. You must correct or delete the word before you can train the model.

List all custom words example

The following example lists all of the words, regardless of type, from the custom model with the specified customization ID. The words are displayed in the default sort order, ascending alphabetical.

IBM Cloud

curl -X GET -u "apikey:{apikey}" \
"{url}/v1/customizations/{customization_id}/words"

IBM Cloud Pak for Data IBM Software Hub

curl -X GET \
--header "Authorization: Bearer {token}" \
"{url}/v1/customizations/{customization_id}/words"

The words resource for the model contains four words. The first word was added directly by the user, but its sounds_like field contains an error: The field cannot contain numbers. The other words were added by the user or by both the user and from corpora, which indicates that the words were added to a custom model that is based on a previous-generation model.

{
  "words": [
    {
      "word": "75.00",
      "sounds_like": ["75 dollars"],
      "display_as": "75.00",
      "count": 1,
      "source": ["user"],
      "error": [{"75 dollars": "Numbers are not allowed in sounds_like. You can try for example 'seventy five dollars'."}]
    },
    {
      "word": "HHonors",
      "sounds_like": [
        "hilton honors",
        "H. honors"
      ],
      "display_as": "HHonors",
      "count": 1,
      "source": [
        "corpus1",
        "user"
      ]
    },
    {
      "word": "IEEE",
      "sounds_like": ["I. triple E."],
      "display_as": "IEEE",
      "count": 3,
      "source": [
        "corpus1",
        "corpus2",
        "user"
      ]
    },
    {
      "word": "tomato",
      "sounds_like": [
        "tomatoh",
        "tomayto"
      ],
      "display_as": "tomato",
      "count": 1,
      "source": ["user"]
    }
  ]
}

List a specific custom word example

The following example shows information about the word NCAA from the words resource of the specified model:

IBM Cloud

curl -X GET -u "apikey:{apikey}" \
"{url}/v1/customizations/{customization_id}/words/NCAA"

IBM Cloud Pak for Data IBM Software Hub

curl -X GET \
--header "Authorization: Bearer {token}" \
"{url}/v1/customizations/{customization_id}/words/NCAA"

The user added the word initially. The service then found the word twice in corpus3. These words were also added to a custom model that is based on a previous-generation model.

{
  "word": "NCAA",
  "sounds_like": [
    "N. C. A. A.",
    "N. C. double A."
  ],
  "display_as": "NCAA",
  "count": 3,
  "source": [
    "corpus3",
    "user"
  ]
}

Deleting a custom word from a custom language model

Use the DELETE /v1/customizations/{customization_id}/words/{word_name} method to delete a word from a custom language model.

For custom models that are based on large speech models, previous-generation models, use the method to remove words that were added in error, for example, from a corpus with faulty data. You can remove any word that you added to the custom model's words resource via any means (for example, from corpora, from grammars, or directly.)
For custom models that are based on next-generation models, you can delete only words that were added directly to the model. Words are not added to the model from corpora or grammars.

You cannot delete a word from the service's base vocabulary. If you delete a word that is also present in the service's base vocabulary, deleting the word from a custom model deletes only the custom pronunciation for the word. The word remains in the base vocabulary.

Removing a word from a custom model does not affect the model until you retrain it by using the POST /v1/customizations/{customization_id}/train method. If the model was previously trained on the word, the model continues to apply the word to speech recognition even after you delete the word from its words resource. You must retrain the model to reflect the deletion.

Delete a custom word example

The following example deletes the word IEEE from the custom model with the specified customization ID:

IBM Cloud

curl -X DELETE -u "apikey:{apikey}" \
"{url}/v1/customizations/{customization_id}/words/IEEE"

IBM Cloud Pak for Data IBM Software Hub

curl -X DELETE \
--header "Authorization: Bearer {token}" \
"{url}/v1/customizations/{customization_id}/words/IEEE"