Customizing Spoken Text
Overview
You can control how the Vonage Voice API plays machine-generated text to your users by using a subset of the tags defined in the Speech Synthesis Markup Language (SSML) specification. This XML-based markup enables you to mix multiple languages, provide pronunciation hints for specific words and numbers and control the speed, volume and pitch of synthesized text.
In an NCCO talk
action, you can send SSML tags as
part of the text string. But first, you must surround the entire string in <speak></speak>
tags to tell Vonage that the string includes SSML. You may use either single quotes or escaped double quotes for tag attribute values.
Here is an example of SSML in the text
property of a NCCO talk
action:
[
{
"action": "talk",
"text": "<speak><p>Hello.</p><p>How are you?</p></speak>"
}
]
SSML tags
- Breaks: Add breaks (pauses) to spoken text
- Emphasizing: Add or remove emphasis from text
- Language: Specify another language for specific words
- Phonemes: Spell out the text using the phonetic alphabet
- Prosody: Set the pitch, speed and volume of the spoken text
- Say as: Provide pronunciation hints for words, numbers and dates
- Sentences and paragraphs: Force the API to recognize sentences and paragraphs when speaking your text
- Substitution: Replace specific text with a pronunciation of your choice
Breaks
The break
tag allows you to add pauses to text. The duration of the
pause can be specified either using a strength
duration or as a
time
seconds or milliseconds.
<speak>My name is <break time='1s' />Slim Shady.</speak>
Valid strength
values include:
-
none
orx-weak
(which removes a pause which might otherwise exist after a full stop) -
weak
ormedium
(equivalent to a comma) -
strong
orx-strong
(equivalent to a paragraph break)
<speak>
To be <break strength='weak' />
or not to be <break strength='weak' />
that is the question.
</speak>
Emphasizing
To emphasize words, use the emphasis
tag. Emphasizing words changes the speaking rate and volume. More emphasis makes the text spoken louder and slower. Less emphasis makes it quieter and faster. To specify the degree of emphasis, use the level
attribute.
Valid level
values include:
-
strong
: Increases the volume and slows the speaking rate so that the speech is louder and slower. -
moderate
: Increases the volume and slows the speaking rate, but less than strong.moderate
is the default. -
reduced
: Decreases the volume and speeds up the speaking rate. Speech is softer and faster.
<speak>
<emphasis level="moderate">This is an important announcement</emphasis>
</speak>
Language
The lang
tag allows you to specify another language for a specific word, phrase, or sentence.
It might be useful for better pronunciation of foreign words.
The language tag should contain both the language tag and
country code (e.g. pt-BR
for Brazilian Portuguese, en-GB
for
British English), even for languages with no country variations where
a country code might otherwise be redundant (e.g. it-IT
for
Italian).
<speak><lang xml:lang='it-IT'>Buongiorno</lang></speak>
Please note, lang
changes the pronunciation, though it doesn't change the "native" language of the voice,
for example, if the language in talk
action/request is set to en-US
, and SSML lang
tag in the text set to fr-FR
,
the sentence will be spoken in American-accented French.
To change the language for the whole message, use the language
parameter or the talk
action/request instead.
Not all the voice styles support lang
tag.
Phonemes
The phoneme
tag allows you to send an International Phonetic
Alphabet (IPA) representation of a word. To use this, you need to
specify both an alphabet
(either ipa
or
x-sampa
) and the ph
attribute containing the phonetic symbols.
<speak>
<phoneme alphabet='ipa' ph='təˈmætoː'>Tomato</phoneme> or
<phoneme alphabet='ipa' ph='təˈmeɪtoʊ'>tomato</phoneme>.
Two nations separated by a common language.
</speak>
Prosody
The prosody
tag allows you to set the pitch, rate and volume of the
text.
The
volume
attribute can be set to the following values:default
,silent
,x-soft
,soft
,medium
,loud
andx-loud
. You can also specify a relative decibel value in the form+ndB
or-ndB
wheren
is an integer value.The
rate
attribute changes the speed of speech. Acceptable values include:x-slow
,slow
,medium
,fast
andx-fast
.The
pitch
attribute changes the pitch of the voice. You can specify this using either predefined value labels or numerically. The value labels are:default
,x-low
,low
,medium
,high
andx-high
. The format for specifying a numerical pitch change is:+n%
and-n%
.
The example below shows how to change the volume, rate and pitch.
<speak>
I am <prosody volume='loud'>loud and proud</prosody>,
<prosody rate='fast'>quick as a cricket</prosody>
and can <prosody pitch='x-low'>change my pitch</prosody>.
</speak>
Say As
The say-as
tag allows you to provide instructions for how particular words and numbers are spoken. Many of these features are automatically detected in speech by the TTS engine, but the say-as
command allows you to mark them specifically.
The say-as
tag has a required attribute: interpret-as
. That attribute must contain one of the following values:
Value of interpret-as
|
Effect on spoken text |
---|---|
character /spell-out
|
Spells each letter out, for example: I-A-T-A . |
cardinal /number
|
Pronounces the value as a number. For example, "974" would be pronounced "nine hundred and seventy four". |
ordinal |
Pronounces the number as an ordinal. For example, "1" would be pronounced "first" and "33" would be pronounced "thirty-third". |
digits |
Reads the specified numbers out as digits. For example, "747" would be pronounced "seven four seven" and not "seven hundred and forty seven". |
fraction |
Reads the numbers out as a fraction. For example, "1/3" would be pronounced "one third" and "2 4/10" would be pronounced "two and four tenths". |
unit |
Reads the specified number out as a unit. The value must be a number followed by a unit of measure with no space between the two. For example: "1m". |
date |
Specify how to pronounce dates. See the section below on date formatting. |
time |
Pronounces time durations in minutes and seconds. For example: 1'30" is read as "one minute and thirty seconds". |
address |
Reads out a street address with appropriate breaks. |
expletive |
Replaces the content with a "bleep" to censor expletives. You can use this to automatically substitute filtered swear words. |
telephone |
Reads out a telephone number with appropriate breaks. |
An example:
<speak>
On the <say-as interpret-as="ordinal">1</say-as> day of Christmas,
come to <say-as interpret-as="address">123 Main Street</say-as>.
<say-as interpret-as="spell-out">RSVP</say-as> for a mince pie.
</speak>
Date formatting
Dates can be formatted in the following ways:
format |
How date is read out |
---|---|
mdy |
month-date-year (e.g. "3/10/2019") |
dmy |
day-month-year (e.g. "10/3/2019") |
ymd |
year-month-day (e.g. "2019/3/10") |
md |
month-day (e.g. "3/10") |
dm |
day-month (e.g. "10/3") |
ym |
year-month (e.g. "2019/3") |
my |
month-year (e.g. "3/2019") |
d |
day (e.g. "10") |
m |
month (e.g. "3") |
y |
year (e.g. "2019") |
yyyymmdd |
year-month-day, with optional ? to replace unspecified components. For example: 20190310 or ????0310 . |
The example below will be converted to "Today is March 10th".
<speak>
Today is <say-as interpret-as="date" format="dm">10/3</say-as>
</speak>
Sentences and paragraphs
Sentences
You can wrap sentences in the s
tag. This is equivalent to putting a
full stop at the end of the sentence.
<speak>
<s>Thank you Mario</s>
<s>But our princess is in another castle</s>
</speak>
Paragraphs
The p
tag allows you to specify paragraphs in your speech.
<speak>
<p>Hello.</p>
<p>How are you?</p>
</speak>
Substitution
The sub
tag allows you to provide a substitute pronunciation. The contents of the alias
attribute will be read instead.
<speak>
Welcome to the <sub alias="United States">US</sub>.
</speak>