The science behind the service

The IBM Watson® Text to Speech service offers voices that rely on natural voices, expressive voices and enhanced neural voices. A brief overview of each of these voices follows.

Natural voices

The natural voices in the portfolio use an encoder-decoder architecture that disentangles timbre and prosody characteristics to better guide the synthesis. That characteristics are able to provide state-of-the-art performance in terms of naturalness and expressiveness. These novel natural voices use various techniques to provide an edge over the existing expressive voices, including:

use of diffusion denoising models to better predict style and timbre features [1].
use of model pre-trained by using large amounts of data, followed by model refinement with smaller, dedicated expressive corpora.
minimizing modular approaches, and fully integrating the encoder and decoder into an end-to-end pipeline that is jointly trained.

On the encoder side, the architecture comprises:

phonetic encoders to process the linguistic inputs, which consist of a sequence of phonemes (augmented with orthographic punctuation) generated from raw text by a rules-based front-end responsible for text normalization and phonetization.
a diffusion model that is responsible for predicting latent representations for timbre and prosody, which is guided by a combination of a global speaker embedding and an optional reference prompt embedding.
a prosody prediction model that generates explicit duration and normalized (speaker-agnostic) pitch and volume targets.
a prosody de-normalization model that corrects the normalized pitch and volume targets based on a timbre latent representation

The decoder absorbs the information that is produced by the encoder modules to generate waveforms, which is guided by perceptual adversarial losses (involving mel-spectra and WavLM-based losses), deploying an advanced vocoder with a novel streaming support for improved latency.

Expressive voices

The expressive voices work a style-sensitive, prosody-controllable architecture based on non-attentive Tacotron2 acoustic model, augmented with a set of Hierarchical Prosodic Controls (HPCs) [2],[3]. At a high level, it contains the following components:

an encoder module, which embeds the full inputs to the model, and which comprise the phoneme sequence (generated by the same modules that are deployed in the Natural Voices architecture), phrase-level linguistic features, and a style-vector.
a prosodic module that use the output of the encoder to predict the HPCs: a nested sequence of speaker-agnostic prosodic descriptors that contain various statistics related to pitch, energy, and duration. These HPCs provide the fine-grained conditioning (for example, at the phone, word, and utterance levels) to help realize the distinct prosodic patterns associated with different styles.
a non-auto-regressive decoder that takes the output of the encoder and HPC modules, plus a speaker embedding, to generate a sequence of spectral and periodicity features that are finally fed into a separately trained neural vocoder (an LPC Network) to generate high-quality audio.

Enhanced neural voices

Enhanced neural voices represent the oldest technology in the catalog, and use a modular, cascaded, fully Deep-Neural-Network (DNN) based approach to provide the back-end for speech synthesis [4]. Just like natural and expressive voices, a separate module is responsible for processing the text to normalize and extract phonetic sequences that are then input into:

a prosody-prediction DNN that predicts the pitch and phoneme duration from textual features
an acoustic feature DNN that uses these predicted prosody targets, plus phonetic information, to generate spectral and periodicity features
a neural vocoder that uses these spectral features and generates an output waveform.

This modular approach has the advantage of enabling fast and simple training, and independent control of each component and fast run time performance.

References

[1]: Yinghao Aaron Li, Cong Han, Vinay Raghavan, Gavin Mischler, Nima Mesgarani -- StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. Proc. NeurIPS 2023.

[2]: Slava Shechtman, Raul Fernandez, Alexander Sorin, and David Haws -- Synthesis of Expressive Speaking Styles with Limited Training Data in a Multi-Speaker, Prosody-Controllable Sequence-to-Sequence Architecture. Proc. Interspeech 2021, pp. 4693-4697.

[3]: Raul Fernandez, David Haws, Guy Lorberbom, Slava Shechtman, and Alexander Sorin -- Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis. Proc. Interspeech 2022, pp. 5488-5492.

[4]: Zvi Kons, Slava Shechtman, Alex Sorin, Carmel Rabinovitz, and Ron Hoory -- High Quality, Lightweight and Adaptable TTS Using LPCNet. Proc. Interspeech 2019, pp. 176-180.