Speech activity detection

The IBM Watson® Speech to Text service offers two speech activity detection parameters to control what audio is used for speech recognition. The parameters specify the service's sensitivity to non-speech events and to background noise. The parameters are independent: You can use them individually or together.

Speech activity detection is supported for most language models. For more information, see Language model support.

How speech activity detection works

Speech activity detection uses the input audio stream and determines which parts of the stream to pass for speech recognition. Background speech and noise can cause speech recognition errors, including incorrect transcriptions, extra words, or missing words from the input audio. The speech activity detection feature can help ensure that only relevant audio is processed for speech recognition.

You can use the feature to control the following aspects of speech recognition:

Suppress background speech. Call-center data often contains cross-talk ("overhearing") from other agents. You can set a volume threshold that causes the service to ignore background speech quieter than the specified level.
Suppress background noise. Some audio, such as speech recorded in a factory, can contain a high level of background noise. You can set a threshold that causes the system to ignore background noise quieter than the specified level.
Suppress non-speech audio events. Background music and tone events, such as audio played to a client who is waiting on hold on a telephone line, can cause inaccurate recognition. Silence can also result in unnecessary recognition or transcription errors. You can set a threshold that causes the system to ignore such events quieter than the specified level.

By default, speech activity detection is configured to provide optimal performance for the general case for each model. For specific cases, the default settings might not be optimal and can lead either to slow transcription or to word insertions and deletions. You are encouraged to experiment with different settings to determine which values work best for your audio.

Speech detector sensitivity

Use the speech_detector_sensitivity parameter to adjust the sensitivity of speech activity detection. Use the parameter to suppress word insertions from music, coughing, and other non-speech events. The service biases the audio it passes for speech recognition by evaluating chunks of the input audio against prior models of speech and non-speech activity.

Specify a float value between 0.0 and 1.0. The default value is 0.5, which provides a reasonable compromise for the level of sensitivity. A value of 0.0 suppresses all audio (no speech is transcribed). A value of 1.0 suppresses no audio (speech detection sensitivity is disabled). The values increase on a monotonic curve of sensitivity versus speech. Specifying one or two decimal places of precision (for example, 0.55) is typically more than sufficient.

This parameter can affect both the quality and the latency of speech recognition:

Smaller values can decrease latency because less audio is potentially passed for speech recognition. However, setting a small value might discard chunks of audio that contain actual speech, losing viable content from the transcript.
Higher values can increase latency because more audio is potentially passed for speech recognition. However, setting a high value might pass chunks of audio that contain non-speech events, adding spurious content to the transcript.

Speech detector sensitivity example

The following example request specifies a value of 0.6 for the speech_detector_sensitivity parameter with the synchronous HTTP interface. The service recognizes slightly more potential non-speech events than it would by default.

IBM Cloud

curl -X POST -u "apikey:{apikey}" \
--header "Content-Type: audio/flac" \
--data-binary @{path}audio-file1.flac \
"{url}/v1/recognize?speech_detector_sensitivity=0.6"

IBM Cloud Pak for Data IBM Software Hub

curl -X POST \
--header "Authorization: Bearer {token}" \
--header "Content-Type: audio/flac" \
--data-binary @{path}audio-file1.flac \
"{url}/v1/recognize?speech_detector_sensitivity=0.6"

Speech Activity Detection (SAD)

Speech Activity Detection (SAD), is now upgraded to include a new improved method in addition to the existing method in the Speech To Text recognize API. The new method increases accuracy and performance in detecting speech boundaries within the audio stream. It can be used by setting sad_module:2.

Enhanced features

SAD has the following enhanced features:

Better performance in detecting speech start and end.
Improved noise suppression.
Faster response time in identifying segments with speech.
Increased recognition accuracy for noisy environments.

Recommended usage

SAD can use both the existing method and the new improved method:

sad_module: 1 : Uses the existing SAD model (default).
sad_module: 2 : Uses the new SAD model. It is used for applications that require real-time responsiveness, background noise handling, or applications that depend on speech activity, such as speech_ begin_event.

For example,

curl -X POST -u "apikey:<apikey>" \
  -H "Content-Type: audio/wav" \
  --data-binary @{path}New_Recording.wav \
  "{url}/v1/detect_language/sad_module:2"

Speech Begin Event Detection

By using speech begin event detection, you can now receive an early notification with the recognize API. It detects the start of speech in the incoming audio stream and sends a notification to the user. When the speech_begin_event: true is set in the request, a new response object is sent that indicates the start of a speech segment. It can be useful for real-time transcription or voice-activity-driven applications. It can be used by client applications to trigger downstream action, for example, showing speech started or active indicators and preparing real-time processing pipelines.

A speech begin event is generated in two scenarios:

When a speech is detected first in an audio stream
After each end_of_phrase_silence_time period, if a new speech is detected

The sensitivity of the detection can be adjusted by using the speech_event_sensitivity (default 0.5) parameter, which is a measure of the intensity of the speech signal that triggers the speech begin event.

Greater values indicate that it is more sensitive to speech. Values closer to 0 make detection less sensitive to background noise. Values closer to 1 increase the chances of detecting a speech begin event but might trigger false alerts in noisy environments.

By using speech_event_sensitivity, you can control the sensitivity of the speech begin event separately from STT transcript, which is controlled by the speech_detector_sensitivity parameter. speech_event_sensitivity ignores noise for speech_begin_event, which can give false barge-in events.

Recommended usage

It is recommended to use sad_module:2 when you use speech_begin_event.

For example,

curl -X POST "[URL]/v1/recognize?model=en-US&speech_begin_event=true&sad_module=2" \
-u "apikey:[APIKEY] " \
--header "content-type:audio/wav" \
--data-binary @example.wav

Example response:

{
   "result_index": 0,
   "results": [
      {
         "speech_begin_event": {
            "begin": 2.5
         }
      }
   ]
}

Background audio suppression

Use the background_audio_suppression parameter to suppress background audio based on its volume to prevent it from being transcribed as speech. Use the parameter to suppress side conversations or background noise. For example, use this parameter to manage steady, quiet background sounds, such as weak audio signals. Because such noise can interfere with transcription, it can produce content where no actual speech occurs in the audio.

Specify a float value in the range of 0.0 to 1.0. The default value is 0.0, which provides no suppression (background audio suppression is disabled). A value of 0.5 provides a reasonable level of audio suppression for general usage. A value of 1.0 suppresses all audio (no speech is transcribed). The values increase on a monotonic curve. Specifying one or two decimal places of precision (for example, 0.55) is typically more than sufficient.

This parameter can also affect both the quality and the latency of speech recognition. However, because background noise suppression is disabled by default, setting the parameter to a value greater than zero can only improve latency. But higher values can gradually reduce the audio that is passed for speech recognition, which can cause valid content to be lost from the transcript.

Background audio suppression example

The following example request specifies a value of 0.5 for the background_audio_suppression parameter with the synchronous HTTP interface. The service suppresses a reasonable level of background audio.

IBM Cloud

curl -X POST -u "apikey:{apikey}" \
--header "Content-Type: audio/flac" \
--data-binary @{path}audio-file1.flac \
"{url}/v1/recognize?background_audio_suppression=0.5"

IBM Cloud Pak for Data IBM Software Hub

curl -X POST \
--header "Authorization: Bearer {token}" \
--header "Content-Type: audio/flac" \
--data-binary @{path}audio-file1.flac \
"{url}/v1/recognize?background_audio_suppression=0.5"

Language model support

The speech_detector_sensitivity and background_audio_suppression parameters are supported for use with the following language models:

For large speech models and next-generation models, the parameters are supported by all models.
For previous-generation models, the parameters are supported by most models. The following models do not support speech activity detection now. The parameters are ignored if used with these models.
- Arabic broadband model (ar-MS_BroadbandModel)
- Brazilian Portuguese broadband model (pt-BR_BroadbandModel)
- Chinese broadband model (zh-CN_BroadbandModel)
- Chinese narrowband model (zh-CN_NarrowbandModel)
- German broadband model (de-DE_BroadbandModel)