The Speech to Text service converts the human voice into the written word. The service uses deep-learning AI to apply knowledge of grammar, language structure, and the composition of audio and voice signals to accurately transcribe human speech. It can be used in applications such as voice-automated chatbots, analytic tools for customer-service call centers, and multi-media transcription, among many others.
Brazilian Portuguese, Chinese (Mandarin dialect), Dutch, English (US and UK dialects), French, German, Italian, Japanese, Korean, Spanish (Argentinian, Castilian, Chilean, Colombian, Mexican, and Peruvian dialects), and Modern Standard Arabic (broadband model only). Base models are available for audio sampled at 16 kHz broadband and 8k Hz narrowband in a wide range of audio formats.
Request transcription with synchronous or asynchronous HTTP REST APIs, or use WebSockets for efficient, low-latency, high-throughput requests over a full-duplex connection. Send all audio at once or stream continuous audio for live speech recognition. Use SDKs for simplified rapid development in Node, Java, Python, Swift, and many other languages.
Use language model customization to define domain-specific words that expand the service's base vocabulary; acoustic model customization to enhance recognition for the acoustic characteristics of your audio; and grammars to limit recognition to specific strings and phrases only. Create multiple models and grammars for different purposes, and combine all three capabilities to adapt recognition for your application's requirements.
Identify specific keyword strings from the audio with a user-defined level of confidence. Identify different speakers from a multi-participant conversation.
Receive a JSON response that includes confidence scores, start and end times, and multiple possible alternatives. Split a transcript into multiple results based on semantic features such as sentences.
Apply smart formatting to convert dates, times, numbers, currency values, phone numbers, and more to conventional written forms in final transcripts. Redact sensitive personal information such as credit card numbers from transcripts. Censor profanity from US English transcripts and metadata.
Request processing metrics for detailed information about the service's analysis or your audio, or audio metrics for details about the precise signal characteristics of your audio.