IBM Cloud Docs
Using audio formats

Using audio formats

The IBM Watson® Text to Speech service can return synthesized audio in a number of popular audio formats (or MIME types). For information about all supported formats, see Supported audio formats.

To make the best use of the service, you need to understand the sampling rate of the audio that the service returns and how to specify a different rate if you need to. For more information, see Sampling rate. The service always returns single-channel audio for all formats.

Sampling rate

The sampling rate (or sampling frequency) is the number of samples that are generated per second for the audio. Sampling rate is measured in Hertz (Hz) or kilohertz (kHz). For example, a rate of 16,000 samples per second is equal to 16,000 Hz (or 16 kHz).

Internally, the service always synthesizes audio with a sampling rate of 22,050 Hz. For many formats, the service also returns audio with this sampling rate. For other formats, the service returns audio with a different sampling rate.

For most formats, you can specify a different sampling rate for the audio. For the audio/alaw, audio/l16, and audio/mulaw formats, you must specify a sampling rate. You specify a sampling rate by including the rate={integer} parameter with the audio format specification. For more information, see Specifying an audio format.

When you specify a sampling rate, the service resamples the audio from 22,050 Hz to the specified rate before it returns the audio. A specified sampling rate must lie in the range of 8 kHz to 192 kHz. Some audio formats restrict the rate to specific values; the descriptions of the formats identify such restrictions.

Determining the sampling rate

The most reliable way to identify the sampling rate for any audio that the service returns is to extract the information from the audio stream itself. To determine the rate, call the /v1/synthesize method with some simple text (for example, "hello world") and specify the format and codec that you plan to use. You can then obtain the sampling rate by saving the audio stream to a file and opening it in an audio player such as one of those listed in Playing an audio file.

Supported audio formats

Table 1 lists the audio formats in which you can request synthesized audio. By default, the service returns the audio in the Ogg format with the Opus codec (audio/ogg;codecs=opus).

The Ogg audio format is not supported with the Safari browser. If you are using the the Text to Speech service with the Safari browser, you must specify a different format in which you want the service to return the audio. For more information, see Specifying an audio format.

The service provides the following information for each format:

  • Default sampling rate shows the sampling rate for audio in the indicated format if you do not specify an alternative rate.
  • Required parameters indicates those formats for which you must specify a sampling rate for the returned audio.
  • Optional parameters identifies formats for which you can optionally specify a sampling rate or other characteristics of the returned audio.

As shown in the Audio formats column for those formats that accept a codecs parameter, you separate all parameters of the format specification with a ; (semicolon). For more information about the different formats, see the sections that follow the table.

Summary of supported audio formats
Audio formats Default sampling rate Required parameters Optional parameters
audio/alaw None rate={integer} None
audio/basic 8000 Hz None None
audio/flac 22,050 Hz None rate={integer}
audio/l16 None rate={integer} endianness=big-endian
endianness=little-endian
audio/mp3
audio/mpeg
22,050 Hz None rate={integer}
audio/mulaw None rate={integer} None
audio/ogg
audio/ogg;codecs=vorbis
22,050 Hz None rate={integer}
audio/ogg;codecs=opus 48,000 Hz None rate={integer}
audio/wav 22,050 Hz None rate={integer}
audio/webm
audio/webm;codecs=opus
48,000 Hz None None
audio/webm;codecs=vorbis 22,050 Hz None rate={integer}

audio/alaw format

A-law (audio/alaw) is a single-channel, lossy audio format that is encoded by using u-law (or mu-law) data that is similar to the audio/basic and audio/mulaw formats, though the A-law algorithm produces different signal characteristics. You must specify the sampling rate with this format. For example, specify audio/alaw;rate=8000 for audio that is sampled at 8 kHz.

Due to the streaming nature of the returned audio, the A-law audio that is generated might not work in all audio players. Specifically, the attribute numSamples in the header of the audio stream is set to 0 regardless of the length of the audio.

audio/basic format

Basic audio is a single-channel, lossy audio format that is encoded by using 8-bit u-law (or mu-law) data that is sampled at 8 kHz. This format provides a lowest-common denominator media type. Audio in this format always has a sampling rate of 8 kHz.

For more information, see

audio/flac format

Free Lossless Audio Codec (FLAC) (.flac) is a lossless compressed audio coding format. You can optionally specify a sampling rate other than the default 22,050 Hz.

audio/l16 format

Linear 16-bit Pulse-Code Modulation (PCM) is an uncompressed audio data format (often .raw or .pcm). You must specify the sampling rate with this audio format. For example, specify audio/l16;rate=16000 for audio that is sampled at 16 kHz.

You can optionally specify the endianness for the audio by using the endianness parameter. Endianness indicates how bytes of data are ordered by the underlying computer architecture:

  • Big-endian (endianness=big-endian) orders data by most-significant bit.
  • Little-endian (endianness=little-endian) orders data by least-significant bit.

For example, specify audio/l16;rate=16000;endianness=big-endian to obtain audio that is sampled at 16 kHz and returned in big-endian order. If you omit the endianness, the default is little-endian. (Specifying the endianness is an issue only for the audio/l16 format, which does not include a header. Endianness is not a concern for the other formats.)

For more information, see IETF Request for Comment (RFC) 2586.

audio/mp3 and audio/mpeg formats

MP3 or Motion Picture Experts Group (MPEG) is a lossy data compression format (MP3 and MPEG refer to the same format). You can optionally specify a sampling rate other than the default 22,050 Hz.

audio/mulaw format

8-bit mu-law (or u-law) audio is a single-channel, lossy audio format that is encoded by using 8-bit mu-law data. You must specify the sampling rate with this audio format. For example, specify audio/mulaw;rate=16000 for audio that is sampled at 16 kHz.

audio/ogg format

Ogg format (.ogg) is a free, open container format that is maintained by the Xiph.org Foundation. You can specify the codecs parameter with the format to request an audio stream that is compressed with one of the following codecs:

  • The Opus codec by specifying audio/ogg;codecs=opus. You can optionally specify a sampling rate other than the default 48,000 Hz. Only the following values are valid sampling rates: 48000, 24000, 16000, 12000, or 8000. If you specify a value other than one of these, the service returns an error.

    A current limitation causes the service to disregard a valid sampling rate. The service always returns the audio with a sampling rate of 48 kHz.

  • The Vorbis codec by specifying audio/ogg;codecs=vorbis or simply audio/ogg. You can optionally specify a sampling rate other than the default 22,050 Hz.

Both codecs are free, open, lossy audio-compression formats. Opus is the preferred codec, but per the Ogg specification, the service returns the audio in Vorbis format if you omit the codec. If you omit an audio format altogether, the service returns the audio in Ogg format with the Opus codec by default.

For more information, see

audio/wav format

Waveform Audio File Format (WAV) (.wav) is a standard container format that is often used for uncompressed audio bitstreams but can contain compressed audio, as well. You can optionally specify a sampling rate other than the default 22,050 Hz.

Due to the streaming nature of the returned audio, the WAV audio that is generated might not work in all audio players. Specifically, the attribute numSamples in the header of the audio stream is set to 0 regardless of the length of the audio.

audio/webm format

Web Media (WebM) (.webm) is an open media-file format. You can specify the codecs parameter with the format to request an audio stream that is compressed with one of the following codecs:

  • The Opus codec by specifying audio/webm;codecs=opus or simply audio/webm. Audio in this format always has a sampling rate of 48 kHz.
  • The Vorbis codec by specifying audio/webm;codecs=vorbis. You can optionally specify a sampling rate other than the default 22,050 Hz.

Both codecs are free, open, lossy audio-compression formats. Opus is the preferred codec.

For more information, see

Specifying an audio format

By default, the service returns audio in the format audio/ogg;codecs=opus. You can specify a different audio format with either the HTTP or the WebSocket interface.

  • With the HTTP GET and POST /v1/synthesize methods, you can specify a format by using the Accept request header or the accept query parameter.
    • If you use the Accept request header, you specify the format and any additional parameters as shown in the following example. The example specifies the audio/l16 format and a sampling rate of 16,000 Hz.

      audio/l16;rate=16000
      
    • If you use the accept query parameter, you must URL-encode the format and any additional parameters as shown in the following example. The example specifies the same format and sampling rate as the previous example.

      audio%2Fl16%3Brate%3D16000
      
    To receive audio in the default format, omit both the header and the query parameter. For more information, see Synthesizing text to audio.
  • With the WebSocket /v1/synthesize method, you must specify a format by using the required accept parameter of the text message that you pass to initiate synthesis. To receive audio in the default format, specify the value */* for the parameter. For more information, see Send input text.

Playing an audio file

To play an audio file that the service generates, use one of the following tools:

  • A web browser such as Google Chrome™, Firefox®, or Microsoft® Internet Explorer®
  • An audio player such as Audacity® (audacityteam.org) or FFmpeg (ffmpeg.org)