SSML elements
With the IBM Watson® Text to Speech service, you can use most Speech Synthesis Markup Language (SSML) elements and attributes to control the synthesis of your text.
Supported elements and attributes
Table 1 summarizes the service's support for SSML elements and attributes:
- Full means that the service fully supports the element or attribute with its HTTP and WebSocket interfaces.
- Partial means that the service's support for the element or attribute is limited in one of the following ways:
- The service supports only some aspects of the element or attribute.
- The service supports the element or attribute with only some of its voices.
- The service supports the element or attribute with only one of its interfaces, HTTP or WebSocket.
- None means that the service does not support the element or attribute.
The following sections provide descriptions of each element or attribute, including examples, restrictions, and whether the service's support differs from standard SSML. Support for some attributes and values differs slightly from the SSML specification. For more information, see W3C Speech Synthesis Markup Language (SSML) Version 1.1.
Element or attribute | Support | Element or attribute | Support |
---|---|---|---|
<audio> element |
None | <prosody> element |
Partial |
<break> element |
Full |
|
None |
<desc> element |
None |
|
None |
<emphasis> element |
Partial | Full | |
<express-as> element |
Partial |
|
None |
<lexicon> element |
None | Full | |
<mark> element |
Partial |
|
None |
<meta> element |
None | <say-as> element |
Partial |
<metadata> element |
None | Partial | |
<paragraph> element |
Full | <sentence> element |
Full |
<phoneme> element |
Full | <speak> element |
Full |
<sub> element |
Full | ||
<voice> element |
None |
The <audio>
element
This <audio>
element inserts recorded elements into the service-generated audio. It is not supported.
The <break>
element
The <break>
element inserts a pause into the spoken text. It has the following optional attributes:
strength
specifies the length of the pause in terms of varying strength values:none
suppresses a break that might otherwise be produced during processing.x-weak
,weak
,medium
,strong
, orx-strong
insert increasingly stronger breaks.
time
specifies the length of the pause in terms of seconds or milliseconds. Valid value formats are{integer}s
for seconds or{integer}ms
for milliseconds.
Break size <break strength="none"/> no pause
Break size <break strength="x-weak"/> x-weak pause
Break size <break strength="weak"/> weak pause
Break size <break strength="medium"/> medium pause
Break size <break strength="strong"/> strong pause
Break size <break strength="x-strong"/> x-strong pause
Break size <break time="1s"/> one-second pause
Break size <break time="1500ms"/> 1500-millisecond pause
The <desc>
element
The <desc>
element can occur only within an <audio>
element. Because the <audio>
element is not supported, neither is the <desc>
element.
The <emphasis>
element
The <emphasis>
element is supported for use only with the expressive neural voices.
With the expressive neural voices, you can use the <emphasis>
element to emphasize or de-emphasize one or more words of the input text. The element supports an optional level
attribute that accepts one of the
following values:
none
- Prevents the service from emphasizing text that might otherwise be emphasized.moderate
- Provides a noticeable amount of emphasis to the text. This level is the default if you omit thelevel
attribute.strong
- Provides a more significant amount of emphasis to the text than the moderate level provides.reduced
- De-emphasizes the text by tending to reduce its significance in the audio. This level is the opposite of stressing the text.
The following example applies the moderate
level to the word give
:
I am going to <emphasis level="moderate">give</emphasis> her the book.
For more information, see Emphasizing words.
The <express-as>
element
The <express-as>
element is an SSML extension specific to the Text to Speech service. It is supported for use only with the expressive neural voices.
With the expressive neural voices, you can use the <express-as>
element to apply speaking styles to enhance the service's emphasis of certain characteristics for all or part of the input text. The element supports a required
style
attribute that accepts one of the following speaking styles:
cheerful
- Expresses happiness and good news.empathetic
- Expresses empathy and compassion.neutral
- Expresses objectivity and evenness.uncertain
- Expresses confusion and uncertainty.
The following example applies the cheerful
style to the entire input text:
<express-as style="cheerful">Oh, that's good news! I'm glad that we could help.</express-as>
For more information, see Using speaking styles.
The <lexicon>
element
This <lexicon>
element introduces pronunciation dictionaries for the given SSML document. It is not supported.
You can use the service's customization interface to define a dictionary of custom entries (word/translation pairs) for use during speech synthesis. For more information, see Understanding customization.
The <mark>
element
The <mark>
element is supported only by the service's WebSocket interface, not by its HTTP interface, which ignores the element. For more information, see Specifying an SSML mark.
The <mark>
element is an empty element that places a marker into the text to be synthesized. The client is notified when all of the text that precedes the <mark>
element has been synthesized. The element
accepts a single name
attribute that specifies a string that uniquely identifies the mark; the name must begin with an alphanumeric character. The name is returned along with the time at which the mark occurs in the synthesized
audio.
Hello <mark name="here"/> world.
The <meta>
and <metadata>
elements
The <meta>
and <metadata>
elements are containers in which you can place information about the document. They are not supported.
The <paragraph>
and <sentence>
elements
The <paragraph>
(or <p>
) and <sentence>
(or <s>
) elements are optional elements that can be used to give hints about textual structure. If the text that is enclosed
in a <paragraph>
or <sentence>
element does not end with an end-of-sentence punctuation character (like a period), the service adds a longer than normal pause to the synthesized audio.
The only valid attribute for either element is xml:lang
, which allows for language switching. The attribute is not supported.
<paragraph>
<sentence>Text within a sentence element.</sentence>
<s>More text in another sentence.</s>
</paragraph>
The <phoneme>
element
The <phoneme>
element provides a phonetic pronunciation for the enclosed text. The phonetic spelling represents the sounds of a word, how the sounds are divided into syllables, and which syllables receive stress. The element
has two attributes:
-
alphabet
is an optional attribute that specifies the phonology to be used. The supported alphabets are- The standard International Phonetic Alphabet (IPA):
alphabet="ipa"
. - The IBM Symbolic Phonetic Representation (SPR):
alphabet="ibm"
.
If no alphabet is specified, the service uses IBM SPR by default. For more information, see Understanding phonetic symbols.
- The standard International Phonetic Alphabet (IPA):
-
ph
is a required attribute that provides the pronunciation in the indicated alphabet. The following examples show the pronunciation for the word tomato in both formats:-
IPA format:
<phoneme alphabet="ipa" ph="təˈmeɪ.ɾoʊ">tomato</phoneme>
-
IPA format with Unicode symbols:
<phoneme alphabet="ipa" ph="təˈmeɪ.ɾoʊ">tomato</phoneme>
-
IBM SPR format:
<phoneme alphabet="ibm" ph=".0tx.1me.0Fo">tomato</phoneme>
-
For more information about using SPR and IPA notations with the <phoneme>
element, see Understanding phonetic symbols.
The <prosody>
element
The <prosody>
element controls the pitch and speaking rate of the text. All attributes are optional, but an error occurs if you do not specify at least one attribute with the element.
The service supports the following two attributes of the SSML specification:
The SSML specification also offers four attributes that the service does not support:
- The
contour
attribute - The
range
attribute - The
duration
attribute - The
volume
attribute
The service also supports query parameters that let you adjust the rate and pitch for all text of a speech synthesis request. For more information about the parameters and their interaction with the pitch
and rate
attributes
of the <prosody>
element, see
Differences from the SSML version 1.1 specification
The Text to Speech service bases its SSML support on W3C Speech Synthesis Markup Language (SSML) Version 1.1. However, the SSML specification has evolved
since the service was first released. To maintain backward-compatibility for users, the service continues to support some features of the <prosody>
element that are different from the latest SSML specification.
-
For the
pitch
attribute, the service supports the following additional features:- A relative change in percent indicated by a signed or unsigned number and followed by a
%
(percent sign). The default pitch for a voice is equivalent to passing a value of0%
. - A relative change in semitones indicated by a signed or unsigned number and followed by the string
st
.
- A relative change in percent indicated by a signed or unsigned number and followed by a
-
For the
rate
attribute, the service supports the following additional features:- A relative change in percent indicated by a signed or unsigned number and followed by a
%
(percent sign). The default speaking rate for a voice is equivalent to passing a value of0%
. - A number with no unit designation specifies the number of words per minute. The number is absolute; you cannot specify a relative increase or decrease in words per minute.
- A relative change in percent indicated by a signed or unsigned number and followed by a
-
For the expressive neural voices, the
pitch
andrate
attributes support only percentage values.- For the
pitch
attribute, do not use Hertz, semitones, or keywords. - For the
rate
attribute, do not use words per minute or keywords.
- For the
For more information about the features supported by SSML version 1.1, refer to section 3.2.4 prosody Element of the SSML specification.
The pitch
attribute
The pitch
attribute modifies the baseline pitch, or tone, for the text within the element. Accepted values are
- A number followed by the
Hz
(Hertz) designation: The baseline pitch is transposed (up or down) to the specified value. For example,150Hz
. - A relative change in percent: A number that causes a relative shift from the default baseline. The number is preceded by
+
(an increase) or-
(a decrease) and followed by a%
(percent sign). An unsigned number that is followed by a%
is interpreted as a positive increase. For example,+10%
or10%
. The default pitch for a voice is equivalent to passing a value of0%
. - A relative change in semitones: A number that causes an absolute shift from the default baseline. The number is preceded by
+
(an increase) or-
(a decrease) and followed byst
(semitones). An unsigned number followed byst
is interpreted as a positive increase. For example,+5st
or5st
. - A keyword: One of the following six keywords, which modify the pitch to the corresponding predefined values:
default
uses the service's default baseline pitch.x-low
shifts the pitch baseline down by 12 semitones.low
shifts the pitch baseline down by six semitones.medium
produces the same behavior asdefault
.high
shifts the pitch baseline up by six semitones.x-high
shifts the pitch baseline up by 12 semitones.
Expressive neural voices support only percentage values for the pitch
attribute. They do not support the use of Hertz, semitones, or keywords.
The best way to determine what works for your application is to make adjustments based on percentages and experiment with different values. Try incremental changes of five or ten percent before making more significant modifications.
<prosody pitch="150Hz">Transpose pitch to 150 Hz</prosody>
<prosody pitch="-20Hz">Lower pitch by 20 Hz from baseline</prosody>
<prosody pitch="+20Hz">Increase pitch by 20 Hz from baseline</prosody>
<prosody pitch="-10%">Decrease pitch by 10 percent</prosody>
<prosody pitch="+10%">Increase pitch by 10 percent</prosody>
<prosody pitch="-12st">Lower pitch by 12 semitones from baseline</prosody>
<prosody pitch="+12st">Increase pitch by 12 semitones from baseline</prosody>
<prosody pitch="x-low">Lower pitch by 12 semitones from baseline</prosody>
The rate
attribute
The rate
attribute indicates a change in the speaking rate for the text within the element. Accepted values are
- A number with no unit designation: The rate is changed to the specified number of words per minute. For example, a value of
50
indicates a speaking rate of 50 words per minute. The number is absolute; you cannot specify a relative increase or decrease in words per minute. - A relative change in percent: A number that causes a relative shift from the default speaking rate. The number is preceded by
+
(an increase) or-
(a decrease) and followed by a%
(percent sign). An unsigned number that is followed by a%
is interpreted as a positive increase. For example,+10%
or10%
. The default speaking rate for a voice is equivalent to passing a value of0%
. - A keyword: One of the following six keywords, which modify the speaking rate to the corresponding predefined values:
default
uses the service's default speaking rate.x-slow
decreases the rate by 50 percent.slow
decreases the rate by 25 percent.medium
produces the same behavior asdefault
.fast
increases the rate by 25 percent.x-fast
increases the rate by 50 percent.
Expressive neural voices support only percentage values for the rate
attribute. They do not support words per minute or keywords.
The best way to determine what works for your application is to make adjustments based on percentages and experiment with different values. Try incremental changes of five or ten percent before making more significant modifications.
<prosody rate="50">Set speaking rate to 50 words per minute</prosody>
<prosody rate="-5%">Decrease speaking rate by 5 percent</prosody>
<prosody rate="+5%">Increase speaking rate by 5 percent</prosody>
<prosody rate="slow">Decrease speaking rate by 25%</prosody>
<prosody rate="fast">Increase speaking rate by 25%</prosody>
The <say-as>
element
The <say-as>
element provides information about the type of text that is contained within the element and specifies the level of detail for rendering the text.
- The element has one required attribute,
interpret-as
, which indicates how the enclosed text is to be interpreted. - The element has two optional attributes,
format
anddetail
, which are used only with particular values of theinterpret-as
attribute, as shown in the following examples.
The service supports the <say-as>
element with the following languages:
- The service fully supports the
<say-as>
element for US English. - For most other languages, the service supports only the
digits
andletters
attributes of the element. - For Japanese, the service supports only the
digits
attribute. The service ignores non-numeric characters that are included in the string of digits.
The service's default pronunciation of alphabetic, numeric, and alphanumeric strings varies by language, with each language having its own rules. You can use the <say-as>
element to control how strings are pronounced, including
whether they are to be spelled out as individual characters with the letters
and digits
elements.
For German, you can also control the pace at which the service pronounces the characters. For more information, see Specifying how strings are spelled out.
The interpret-as
attribute
Acceptable values for the interpret-as
attribute and examples of each value follow. The service supports the following values as arguments to the interpret-as
attribute:
cardinal
date
digits
interjection
letters
number
ordinal
vxml:boolean
vxml:currency
vxml:date
vxml:time
vxml:digits
vxml:phone
cardinal
The cardinal
value speaks the cardinal number for the numeral within the element. The following examples say Super Bowl forty-nine. The first is superfluous, since it does not change the service's default behavior.
Super Bowl <say-as interpret-as="cardinal">49</say-as>
Super Bowl <say-as interpret-as="cardinal">XLIX</say-as>
date
The date
value speaks the date within the element according to the format given in the associated format
attribute. The format
attribute is required for the date
value. If no format
is present, the service still attempts to pronounce the date. The following examples speak the indicated dates in the specified formats, where d
, m
, and y
represent day, month, and year.
<say-as interpret-as="date" format="mdy">12/17/2005</say-as>
<say-as interpret-as="date" format="ymd">2005/12/17</say-as>
<say-as interpret-as="date" format="dmy">17/12/2005</say-as>
<say-as interpret-as="date" format="ydm">2005/17/12</say-as>
<say-as interpret-as="date" format="my">12/2005</say-as>
<say-as interpret-as="date" format="md">12/17</say-as>
<say-as interpret-as="date" format="ym">2005/12</say-as>
digits
The digits
value speaks the digits in the number within the element. (The value also pronounces individually any alphabetic characters that are included in the enclosed string.) The following example speaks the individual digits
123456.
<say-as interpret-as="digits">123456</say-as>
interjection
The interjection
attribute is an SSML extension specific to the Text to Speech service. It is supported for use only with the expressive neural voices.
With the expressive neural voices, the service automatically emphasizes the following interjections: aha
, hmm
, huh
, oh
, uh
, uh-huh
, and um
. You can
use the interjection
value to enable or disable the service's emphasis of the interjections aha
and oh
. Include the additional enabled
attribute with a value of true
or false
to enable or disable the interjection.
The following example disables emphasis of both the aha
and oh
interjections in the text:
<say-as interpret-as='interjection' enabled='false'>Oh</say-as>, in addition, the <say-as interpret-as='interjection' enabled='false'>aha</say-as> wasp is endemic to Australia.
For more information, see Emphasizing interjections.
letters
The letters
value spells out the characters in the word within the element. (The value also pronounces individually any numeric characters that are included in the enclosed string.) The following example spells the letters of
the word hello.
<say-as interpret-as="letters">Hello</say-as>
number
The number
value offers an alternative to the cardinal
and ordinal
values. You can use the optional format
attribute to indicate how a series of numbers is to be interpreted. The first
example omits the format
attribute to pronounce the number as a cardinal value. The second example explicitly specifies that the number is to be pronounced as a cardinal
value. The third example specifies that
the number is to be pronounced as an ordinal
value.
<say-as interpret-as="number">123456</say-as>
<say-as interpret-as="number" format="cardinal">123456</say-as>
<say-as interpret-as="number" format="ordinal">123456</say-as>
You can also specify the value telephone
for the format
attribute. The examples show two different ways of pronouncing a series of numbers as a telephone number. To pronounce the numbers with the punctuation included,
specify the value punctuation
for the optional detail
attribute.
<say-as interpret-as="number" format="telephone">555-555-5555</say-as>
<say-as interpret-as="number" format="telephone" detail="punctuation">555-555-5555</say-as>
ordinal
The ordinal
value speaks the ordinal value for the digit within the element. The following example says second first.
<say-as interpret-as="ordinal">2</say-as>
<say-as interpret-as="ordinal">1</say-as>
vxml:boolean
The vxml:boolean
value speaks yes or no depending on the true
or false
value within the element.
<say-as interpret-as="vxml:boolean">true</say-as>
<say-as interpret-as="vxml:boolean">false</say-as>
vxml:currency
The vxml:currency
value is used to control the synthesis of monetary values. The string must be written in the format UUUmm.nn
, where UUU
is the three-character currency indicator that is specified
by ISO standard 4217 and mm.nn
is the quantity. The following example says forty-five dollars and thirty cents.
<say-as interpret-as="vxml:currency">USD45.30</say-as>
If the specified number includes more than two decimal places, the amount is synthesized as a decimal number followed by the currency indicator. If the three-character currency indicator is not present, the amount is synthesized as a decimal number only and the currency type is not pronounced. The following example says forty-five point three two nine US dollars.
<say-as interpret-as="vxml:currency">USD45.329</say-as>
vxml:date
The vxml:date
value works like the date
value, but the format is predefined as YYYYMMDD
. If a day, month, or year value is not known or if you do not want it to be spoken, replace the value with a ?
(question mark). The second and third examples include question marks.
<say-as interpret-as="vxml:date">20050720</say-as>
<say-as interpret-as="vxml:date">????0720</say-as>
<say-as interpret-as="vxml:date">200507??</say-as>
vxml:time
The 'vxml:time' value speaks the time within the element according to the format given in the associated format attribute. The format attribute is required for the time value. The format has to be four digits with either no suffix, "a", "p" or "h". The following examples speak the indicated time in the specified formats, where d, m, and y represent day, month, and year.
<say-as interpret-as="vxml:time">1230</say-as>
<say-as interpret-as="vxml:time">1230a</say-as>
<say-as interpret-as="vxml:time">1230p</say-as>
<say-as interpret-as="vxml:time">0100h</say-as>
vxml:digits
The vxml:digits
value provides the same capabilities as the digits
value.
vxml:phone
The vxml:phone
value speaks a phone number with both digits and punctuation. It is equivalent to using the number
value and specifying telephone
for the format
attribute and punctuation
for the detail
attribute.
<say-as interpret-as="vxml:phone">555-555-5555</say-as>
The <speak>
element
The service supports SSML fragments, which are SSML elements that do not include the full XML header. The <speak>
element is optional for SSML that you pass to the service.
The <speak>
element is the root element for SSML documents. Valid attributes are
version
is a required attribute that specifies the SSML specification. The accepted value is1.0
.xml:lang
is not required by the service. Omit the attribute when you use this element. Note that you cannot use this attribute to change the language for a speech synthesis request.xml:base
has no effect.xmlns
is not required by the service. Omit the attribute when you use this element.
<speak version="1.1">
The text to be spoken.
</speak>
The <sub>
element
The <sub>
element indicates that the text that is specified by the alias
attribute is to replace the text that is enclosed within the element when speech is synthesized. The alias
attribute is the
only attribute of the element and is required.
<sub alias="International Business Machines">IBM</sub>
The <voice>
element
This <voice>
element requests a change in voice. It is not supported.