Skip to content
Navigation Menu

IBM Cloud

  • CatalogCatalog
  • Cost EstimatorCost Estimator
  • DocsDocs
  • Catalog
  • Cost Estimator
  • Docs

  • Navigation settings
  • Log in
  • Sign up

Error

  1. Catalog

Speech to Text

Low-latency, streaming transcription

  • Date of last update: 05/20/2022
  • Docs
  • API docs
Type
  • Service
Provider
  • IBM
Updated on
  • 05/20/2022
Category
  • AI / Machine Learning
Compliance
  • EU Supported
  • HIPAA Enabled
  • IAM-enabled
Related links
  • API docs
  • Docs
  • Terms

Pricing plans

PlanFeaturesPricing

Type
  • Service
Provider
  • IBM
Updated on
  • 05/20/2022
Category
  • AI / Machine Learning
Compliance
  • EU Supported
  • HIPAA Enabled
  • IAM-enabled
Related links
  • API docs
  • Docs
  • Terms

Summary

The Speech to Text service converts the human voice into the written word. The service uses deep-learning AI to apply knowledge of grammar, language structure, and the composition of audio and voice signals to accurately transcribe human speech. It can be used in applications such as voice-automated chatbots, analytic tools for customer-service call centers, and multi-media transcription, among many others.

Features

Available languages

Brazilian Portuguese, Chinese (Mandarin dialect), Dutch, English (US and UK dialects), French, German, Italian, Japanese, Korean, Spanish (Argentinian, Castilian, Chilean, Colombian, Mexican, and Peruvian dialects), and Modern Standard Arabic (broadband model only). Base models are available for audio sampled at 16 kHz broadband and 8k Hz narrowband in a wide range of audio formats.

Interfaces and SDKs

Request transcription with synchronous or asynchronous HTTP REST APIs, or use WebSockets for efficient, low-latency, high-throughput requests over a full-duplex connection. Send all audio at once or stream continuous audio for live speech recognition. Use SDKs for simplified rapid development in Node, Java, Python, Swift, and many other languages.

Language and Acoustic Customization

Use language model customization to define domain-specific words that expand the service's base vocabulary; acoustic model customization to enhance recognition for the acoustic characteristics of your audio; and grammars to limit recognition to specific strings and phrases only. Create multiple models and grammars for different purposes, and combine all three capabilities to adapt recognition for your application's requirements.

Keyword spotting and speaker labels

Identify specific keyword strings from the audio with a user-defined level of confidence. Identify different speakers from a multi-participant conversation.

Transcript metadata

Receive a JSON response that includes confidence scores, start and end times, and multiple possible alternatives. Split a transcript into multiple results based on semantic features such as sentences.

Transcript refinement

Apply smart formatting to convert dates, times, numbers, currency values, phone numbers, and more to conventional written forms in final transcripts. Redact sensitive personal information such as credit card numbers from transcripts. Censor profanity from US English transcripts and metadata.

Processing and audio metrics

Request processing metrics for detailed information about the service's analysis or your audio, or audio metrics for details about the precise signal characteristics of your audio.

Summary

Speech to Text

    Already have an account? Log in