The Text to Speech service converts written text to natural-sounding speech. The service streams the synthesized audio back with minimal delay. The audio uses appropriate cadence and intonation for its language and dialect to provide voices that are smooth and natural. The service can be used in applications such as voice-automated chatbots, as well as a variety of voice-driven and screenless applications, such as tools for the disabled or visually impaired, video narration and voice over, and educational and home-automation solutions.
Arabic, Brazilian Portuguese, Chinese (Mandarin dialect), Dutch, English (US and UK dialects), French, German, Italian, Japanese, Korean, and Spanish (Castilian, Latin American, and North American dialects).
Choose from a variety of male and female voices for different languages. Most languages provide both Neural and Standard voices, although some provide only one type. Neural voices generate audio by relying on Deep Neural Networks to predict the acoustic features of the requested speech. Standard voices assemble audio by concatenating segments of recorded speech.
Request synthesis with HTTP REST or WebSocket APIs. For languages other than Japanese, WebSockets also allow you to obtain timing information for words of the resulting audio. Use SDKs for simplified rapid development in Node, Java, Python, Swift, and many other languages.
Annotate input text with the Speech Synthesis Markup Language (SSML), a standard XML-based notation for speech-synthesis applications. Use SSML to control aspects of speech synthesis such as pronunciation, volume, pitch, speed, and other attributes.
Use voice customization to refine the service's language-dependent rules for pronunciation. Define custom dictionaries for domain-specific terms, words with foreign origins, personal or geographic names, and abbreviations or acronyms in your application's lexicon. Define pronunciations based on other words, or create pronunciations based on phoneme symbols in the International Phonetic Alphabet (IPA) or IBM Symbolic Phonetic Representation (SPR).
Work with IBM to train a new voice for your specific use case and target market. IBM can train a new voice with as little as one hour of training data. This feature is currently available only to Premium customers.