Skip to content

Navigation Menu

IBM Cloud

CatalogCatalog
Cost EstimatorCost Estimator
- HelpHelp
- - Docs

Log in
Sign up

Catalog
Cost Estimator
Help
- Docs

Navigation settings
- Enable collapsed view
- Enable default icons

Error

Change theme

This feature is in early stage, some parts of the platform might not fully support different themes yet.

Catalog

Speech to Text

Low-latency, streaming transcription

Date of last update: 12/12/2024
Docs
API docs

Type

Service

Provider

IBM

Last updated

12/12/2024

Category

AI / Machine Learning

Compliance

EU Supported
HIPAA Enabled
IAM-enabled

Related links

API docs
Docs
Terms

Pricing plans

Plan	Features and capabilities	Pricing

Type

Service

Provider

IBM

Last updated

12/12/2024

Category

AI / Machine Learning

Compliance

EU Supported
HIPAA Enabled
IAM-enabled

Related links

API docs
Docs
Terms

Summary

The Speech to Text service converts the human voice into the written word. The service uses deep-learning AI to apply knowledge of grammar, language structure, and the composition of audio and voice signals to accurately transcribe human speech. It can be used in applications such as voice-automated chatbots, analytic tools for customer-service call centers, and multi-media transcription, among many others.

Features and capabilities

Available languages

Brazilian Portuguese, Chinese (Mandarin dialect), Dutch, English (US and UK dialects), French, German, Italian, Japanese, Korean, Spanish (Argentinian, Castilian, Chilean, Colombian, Mexican, and Peruvian dialects), and Modern Standard Arabic (broadband model only). Base models are available for audio sampled at 16 kHz broadband and 8k Hz narrowband in a wide range of audio formats.

Interfaces and SDKs

Request transcription with synchronous or asynchronous HTTP REST APIs, or use WebSockets for efficient, low-latency, high-throughput requests over a full-duplex connection. Send all audio at once or stream continuous audio for live speech recognition. Use SDKs for simplified rapid development in Node, Java, Python, Swift, and many other languages.

Language and Acoustic Customization

Use language model customization to define domain-specific words that expand the service's base vocabulary; acoustic model customization to enhance recognition for the acoustic characteristics of your audio; and grammars to limit recognition to specific strings and phrases only. Create multiple models and grammars for different purposes, and combine all three capabilities to adapt recognition for your application's requirements.

Keyword spotting and speaker labels

Identify specific keyword strings from the audio with a user-defined level of confidence. Identify different speakers from a multi-participant conversation.

Transcript metadata

Receive a JSON response that includes confidence scores, start and end times, and multiple possible alternatives. Split a transcript into multiple results based on semantic features such as sentences.

Transcript refinement

Apply smart formatting to convert dates, times, numbers, currency values, phone numbers, and more to conventional written forms in final transcripts. Redact sensitive personal information such as credit card numbers from transcripts. Censor profanity from US English transcripts and metadata.

Processing and audio metrics

Request processing metrics for detailed information about the service's analysis or your audio, or audio metrics for details about the precise signal characteristics of your audio.

Getting support

If you're experiencing issues with this product, go to the IBM Cloud Support Center and navigate to creating a case. Use the All products option to search for this product to continue creating the case or to find more information about getting support. Third party and community supported products might direct you to a support process outside of IBM Cloud.

Summary

Speech to Text

Already have an account? Log in