About Speech to Text

The IBM Watson® Speech to Text service provides speech transcription capabilities for your applications. The service leverages machine learning to combine knowledge of grammar, language structure, and the composition of audio and voice signals to accurately transcribe the human voice. It continuously updates and refines its transcription as it receives more speech.

The service provides APIs that make it suitable for any application where speech is the input and a textual transcript is the output. It can be used for applications such as voice-automated chatbots, analytic tools for customer-service call centers, and multimedia transcription. Voice control of embedded devices, transcribing meetings and conference calls, and dictating messages and notes are also possible applications, among many others.

The service is ideal for clients who need to extract high-quality speech transcripts from call center audio. Clients in industries such as financial services, healthcare, insurance, and telecommunication can develop cloud-native applications for customer care, customer voice, agent assistance, and other solutions.

Product versions

Speech to Text can be deployed as a managed cloud service or can be installed on premises. This documentation describes how to use both versions of the product. Information such as topics, paragraphs, and examples that applies exclusively to one version is clearly denoted:

IBM Cloud for managed instances of Speech to Text that are hosted on IBM Cloud or for instances that are hosted on IBM Cloud Pak for Data as a Service. For information about all service updates, see the Release notes for Speech to Text for IBM Cloud.
IBM Cloud Pak for Data IBM Software Hub for installed or on-premises instances of Watson Speech services. For more information about installing and managing Watson Speech services, see Installation overview. For information about all service updates, see the Release notes for Speech to Text for IBM Cloud Pak for Data and Release notes for Speech to Text for IBM Software Hub.

Speech recognition

The Speech to Text service offers three interfaces for speech recognition: a WebSocket interface, a synchronous HTTP interface, and an asynchronous HTTP interface. The interfaces let you specify the language of your audio and its format and sampling rate. They also provide many parameters that you can use to tailor how you request audio and the information that the service sends in response. You can also request metrics about the service's analysis of your audio and the audio itself.

For more information about the speech recognition interfaces, see Recognizing speech with the service in the service features.
For more information about the speech recognition parameters, see Using speech recognition parameters in the service features.

Customization

The service provides a customization interface that you can use to tune speech recognition for your language and acoustic requirements. You can expand the vocabulary of a model with domain-specific terminology or adapt a model for the acoustic characteristics of your audio. You can also add grammars to restrict the phrases that the service can recognize. For more information, see Customizing the service in the service features.

Language support

The service supports many languages and dialects:

Arabic (Modern Standard)
Chinese (Mandarin)
Czech
Dutch (Belgian and Netherlands)
English (Australian, Indian, United Kingdom, and United States)
French (Canadian and France)
German
Hindi (Indian)
Italian
Japanese
Korean
Portuguese (Brazilian)
Spanish (Castilian and Latin American)
Swedish

For more information about the supported languages and about using large speech models, previous- and next-generation models for speech recognition, see Using languages and models.

Audio support

The service accepts audio for transcription in many popular formats:

Ogg or Web Media (WebM) audio with the Opus or Vorbis codec
MP3 (or MPEG)
Waveform Audio File Format (WAV)
Free Lossless Audio Codec (FLAC)
Linear 16-bit Pulse-Code Modulation (PCM)
G.729
A-Law
Mu-law (or u-law)
Basic audio

For more information about the supported audio formats and their characteristics, see Using audio formats.

Integrated use cases

You can use the Speech to Text service with other Watson services to create applications with even greater scope and functionality:

AI assistant on the phone - Eliminate hold times and improve customer satisfaction with IBM® watsonx™ Assistant phone integration. Provide live support to your customers with the pre-built integration of watsonx Assistant, Speech to Text, and IBM Watson® Text to Speech.
Analyze customer calls - Uncover patterns and conduct root-cause analysis on transcriptions of phone calls between your customers and call center agents. Transcribe audio by using Speech to Text, and then analyze the transcription with IBM Watson® Natural Language Understanding.
Support agents - Provide real-time information to improve agent efficiency and focus. Use Speech to Text to transcribe calls live, and then use IBM Watson® Discovery to automatically surface relevant information so that your agent can focus on the customer rather than on the search.

Beta features

IBM occasionally releases features and language support that are classified as beta. Such features are provided so that you can evaluate their functionality. They might be unstable and are subject to change or removal with short notice. They are not intended for use in a production environment.

Beta features might not provide the same level of performance or compatibility as generally available features. Generally available features are ready for use in a production environment.

Pricing

IBM Cloud

The service offers multiple pricing plans to suit your usage and application needs:

For general information about the pricing plans and answers to common questions, see the Pricing FAQs.
For more information about the pricing plans or to purchase a plan, see the Speech to Text service in the IBM Cloud® Catalog.