Professional Documents
Culture Documents
Issues in Text-to-Speech Synthesis: Bellcore
Issues in Text-to-Speech Synthesis: Bellcore
Marian Macchi
Bellcore
mjm@bellcore.com
Abstract avoid rules and instead include large lists from machine-
The ultimate goal of text-to-speech synthesis is to convert readable dictionaries or to formulate rules by automatic
ordinary orthographic text into an acoustic signal that is methods from statistical analysis of a large corpus of
indistinguishable from human speech. Originally, transcribed speech. The trend today is to use the corpus-
synthesis systems were architected around a system of based approach ([1-6]). The dependency on large
rules and models that were based on research on human amounts of data processing necessitates the use of
language and speech production and perception automated techniques such as those used in automatic
processes. The quality of speech produced by such speech recognition. Crucial issues for this approach are
systems is inherently limited by the quality of the rules the coverage of the corpus and how a system deals with
and the models. Given that our knowledge of human cases outside the corpus.
speech processes is still incomplete, the quality of text-to-
speech is far from natural-sounding. Hence, today’s 2. Front-end of text-to-speech
interest in high quality speech for applications, in
combination with advances in computer resource, has The goal of the front-end of a synthesizer is to
caused the focus to shift from rules and model-based convert text into phonetic transcription. Because text is an
methods to corpus-based methods that presumably bypass impoverished representation of a speaker’s verbal
rules and models. For example, many systems now rely intentions, the front-end must attempt to regularize the
on large word pronunciation dictionaries instead of text and resolve ambiguities by inferring the speaker’s
letter-to-phoneme rules and large prerecorded sound intention from the text.
inventories instead of rules predicting the acoustic
correlates of phonemes. Because of the need to analyze 2.1. Text normalization.
large amounts of data, this approach relies on automated
techniques such as those used in automatic speech As illustrated in Table 1, text contains many
recognition. occurrences of strings that are not pronounceable as such.
Some of these strings can be interpreted in a context-free
1. Introduction fashion and can be handled by a simple table lookup. For
example, the abbreviation “Mr.” can be replaced with the
The ultimate goal of text-to-speech synthesis is to word “Mister”. Number strings are more complicated
convert ordinary orthographic text into an acoustic signal and cannot be handled by a dictionary per se. Even
that is indistinguishable from human speech. The simple number cases require an algorithm that assigns
conversion process, illustrated in Figure 1, is considered interpretation to the entire number.
to have two parts, because the two parts involve different In natural language, acronyms are sometimes spelled
types of knowledge and processes. The front end handles out, as is “NAACP”, and are sometimes pronounced as
problems in text analysis and higher level linguistic words, as is “VISA”. A synthesizer must have some
features; it interprets orthographic text and outputs a mechanism for determining whether such uppercase text
phonetic transcription that specifies the phonemes and an should be pronounced or spelled out. Some synthesizers
intonation for the text. The back end handles problems in use a dictionary listing the most common acronyms and
phonetics, acoustics and signal processing; it converts the then simply spell out all other occurrences of upper case
phonetic transcription into a speech waveform containing text. Another approach is to have algorithms based on
appropriate values for acoustic parameters such as pitch, constraints of letter sequences in actual words. For
amplitude, duration, and spectral characteristics. example, a word consisting of a long sequence of
For both parts the conversion process is usually consonants or of vowels, such as “JKLHP” and
performed through one of two approaches. One approach “OAEIE,” is not pronounceable as an English word.
is to rely on rules formulated either by experts in natural Consequently, such sequences should be spelled out. An
language, linguistics, or acoustics. Another approach is to advantage of the algorithmic approach is that it can be
O rth o g ra p h y T h e c h ild re n re a d to D r. S m ith .
could collect the set of N-grams for “St.,” consisting of
T e x t N o rm a liz a tio n
[re a d ] d o c to r
all translations of “St.” plus the adjacent words or word
] [V p re s ] [ categories. For each N-gram, we would tabulate the
F ro n t E n d W o rd P ro n u n c ia tio n
rid
+
d a k tR
number of “St.”s representing “Saint” and the number
In to n a tio n
representing “Street,” as illustrated in Table 2 using
bigrams. Then, one simple approach is to use the ratio of
+ + + +
P h o n e tic % D x C Ild rx n rid tx d a k tR s m IT % occurrences of the different pronunciations as a measure
T ra n s c rip tio n
A rtic u la to ry M o d e l
of the disambiguation power of the context. So for each
Back End
S y n th e s is -b y -R u le N-gram, we would compute the ratio of the two
or
C o n c a te n a tiv e s y n th e s is pronunciations and store them in a table. For synthesis,
we would choose the highest-valued ratio in the table for
A c o u s tic the matching context.
7DEOH$EEUHYLDWLRQGLVDPELJXDWLRQXVLQJELJUDPV