The science behind the service

IBM® has been at the forefront of speech recognition research since the early 1960s and continues this rich tradition of research and development with the IBM Watson® Speech to Text service.

IBM has demonstrated industry-record speech recognition accuracy on the public benchmark data sets for Conversational Telephone Speech (CTS) and Broadcast News (BN) transcription. IBM leveraged neural networks for language modeling in addition to demonstrating the effectiveness of acoustic modeling.

The following announcements summarize IBM's recent speech recognition accomplishments:

These accomplishments contribute to further advance IBM's speech services. Recent ideas that best fit the cloud-based Speech to Text service include

For language modeling, IBM leverages a neural network-based language model to generate training text.
For acoustic modeling, IBM uses a fairly compact model to accommodate the resource limitations of the cloud. To train this compact model, IBM uses "teacher-student training / knowledge distillation." Large and strong neural networks such as Long Short-Term Memory (LSTM), VGG, and the Residual Network (ResNet) are first trained. The output of these networks is then used as teacher signals to train a compact model for actual deployment.

To further push the envelope, IBM also focuses on end-to-end modeling. For example, it has established a strong modeling pipeline for direct acoustic-to-word models that it is now further improving. It is also making efforts to create compact end-to-end models for future deployment on the cloud.