IBM Cloud Docs
Using grammars with custom language models

Using grammars with custom language models

The IBM Watson® Speech to Text service supports the use of grammars with custom language models. You can add grammars to a custom language model and use them for speech recognition. Grammars restrict the set of phrases that the service can recognize from audio.

A grammar uses a formal language specification to define a set of production rules for transcribing strings. The rules specify how to form valid strings from the language's alphabet. When you apply a grammar to speech recognition, the service can return only one or more of the phrases that are generated by the grammar.

For example, when you need to recognize specific words or phrases, such as yes or no, individual letters or numbers, or a list of names, using grammars can be more effective than examining alternative words and transcripts. Moreover, by limiting the search space for valid strings, the service can deliver results faster and more accurately.

When you use a custom language model and a grammar for speech recognition, the service can return a valid phrase from the grammar or an empty result. If the result is not empty, the service includes a confidence score with the final transcript, as it does for all recognition requests. For grammars, the score indicates the likelihood that the response matched the grammar. False-positives are always possible, especially for simple grammars, so you must always consider the confidence of the service's results when evaluating its response.

For more information about the languages and models that support grammars and their level of support (generally available or beta), see Language support for customization.

Supported grammar formats

The Speech to Text service supports grammars that are defined in the following standard formats:

  • Augmented Backus-Naur Form (ABNF), which uses a plain-text representation that is similar to traditional BNF grammar. The media type for this format is application/srgs.
  • XML Form, which uses XML elements to represent the grammar. The media type for this format is application/srgs+xml.

Both grammar formats have the expressive power of a Context-Free Grammar (CFG). However, the service can decode only Type-3 Regular grammars in the Chomsky hierarchy. Such grammars represent finite state automata.

For general information about grammars, see the following Wikipedia pages:

The Speech Recognition Grammar Specification

The Speech to Text service supports grammars as defined by the W3C Speech Recognition Grammar Specification Version 1.0. The specification provides detailed information about the supported formats and about defining a grammar. For information about the supported media types, see Appendix G. Media Types and File Suffix of the specification.

The service does not currently support all features of the Speech Recognition Grammar Specification. Specifically, the service does not support the features described in the following sections of the specification:

Words in the grammar must be in UTF-8 encoding (ASCII is a subset of UTF-8). Using any other encoding can lead to issues when compiling the grammar or unexpected results in decoding. The service ignores an encoding specified in the header of the grammar.