Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Issues in Text-to-Speech Synthesis

Marian Macchi
Bellcore
mjm@bellcore.com

Abstract avoid rules and instead include large lists from machine-
The ultimate goal of text-to-speech synthesis is to convert readable dictionaries or to formulate rules by automatic
ordinary orthographic text into an acoustic signal that is methods from statistical analysis of a large corpus of
indistinguishable from human speech. Originally, transcribed speech. The trend today is to use the corpus-
synthesis systems were architected around a system of based approach ([1-6]). The dependency on large
rules and models that were based on research on human amounts of data processing necessitates the use of
language and speech production and perception automated techniques such as those used in automatic
processes. The quality of speech produced by such speech recognition. Crucial issues for this approach are
systems is inherently limited by the quality of the rules the coverage of the corpus and how a system deals with
and the models. Given that our knowledge of human cases outside the corpus.
speech processes is still incomplete, the quality of text-to-
speech is far from natural-sounding. Hence, today’s 2. Front-end of text-to-speech
interest in high quality speech for applications, in
combination with advances in computer resource, has The goal of the front-end of a synthesizer is to
caused the focus to shift from rules and model-based convert text into phonetic transcription. Because text is an
methods to corpus-based methods that presumably bypass impoverished representation of a speaker’s verbal
rules and models. For example, many systems now rely intentions, the front-end must attempt to regularize the
on large word pronunciation dictionaries instead of text and resolve ambiguities by inferring the speaker’s
letter-to-phoneme rules and large prerecorded sound intention from the text.
inventories instead of rules predicting the acoustic
correlates of phonemes. Because of the need to analyze 2.1. Text normalization.
large amounts of data, this approach relies on automated
techniques such as those used in automatic speech As illustrated in Table 1, text contains many
recognition. occurrences of strings that are not pronounceable as such.
Some of these strings can be interpreted in a context-free
1. Introduction fashion and can be handled by a simple table lookup. For
example, the abbreviation “Mr.” can be replaced with the
The ultimate goal of text-to-speech synthesis is to word “Mister”. Number strings are more complicated
convert ordinary orthographic text into an acoustic signal and cannot be handled by a dictionary per se. Even
that is indistinguishable from human speech. The simple number cases require an algorithm that assigns
conversion process, illustrated in Figure 1, is considered interpretation to the entire number.
to have two parts, because the two parts involve different In natural language, acronyms are sometimes spelled
types of knowledge and processes. The front end handles out, as is “NAACP”, and are sometimes pronounced as
problems in text analysis and higher level linguistic words, as is “VISA”. A synthesizer must have some
features; it interprets orthographic text and outputs a mechanism for determining whether such uppercase text
phonetic transcription that specifies the phonemes and an should be pronounced or spelled out. Some synthesizers
intonation for the text. The back end handles problems in use a dictionary listing the most common acronyms and
phonetics, acoustics and signal processing; it converts the then simply spell out all other occurrences of upper case
phonetic transcription into a speech waveform containing text. Another approach is to have algorithms based on
appropriate values for acoustic parameters such as pitch, constraints of letter sequences in actual words. For
amplitude, duration, and spectral characteristics. example, a word consisting of a long sequence of
For both parts the conversion process is usually consonants or of vowels, such as “JKLHP” and
performed through one of two approaches. One approach “OAEIE,” is not pronounceable as an English word.
is to rely on rules formulated either by experts in natural Consequently, such sequences should be spelled out. An
language, linguistics, or acoustics. Another approach is to advantage of the algorithmic approach is that it can be
O rth o g ra p h y T h e c h ild re n re a d to D r. S m ith .
could collect the set of N-grams for “St.,” consisting of
T e x t N o rm a liz a tio n
[re a d ] d o c to r
all translations of “St.” plus the adjacent words or word
] [V p re s ] [ categories. For each N-gram, we would tabulate the
F ro n t E n d W o rd P ro n u n c ia tio n
rid
+
d a k tR
number of “St.”s representing “Saint” and the number
In to n a tio n
representing “Street,” as illustrated in Table 2 using
bigrams. Then, one simple approach is to use the ratio of
+ + + +
P h o n e tic % D x C Ild rx n rid tx d a k tR s m IT % occurrences of the different pronunciations as a measure
T ra n s c rip tio n
A rtic u la to ry M o d e l
of the disambiguation power of the context. So for each
Back End
S y n th e s is -b y -R u le N-gram, we would compute the ratio of the two
or
C o n c a te n a tiv e s y n th e s is pronunciations and store them in a table. For synthesis,
we would choose the highest-valued ratio in the table for
A c o u s tic the matching context.
7DEOH$EEUHYLDWLRQGLVDPELJXDWLRQXVLQJELJUDPV

)LJXUH7H[WWRVSHHFKOHYHOV Bigram Word #occurrences Ratio

St. John = Street 1 1/177=.005


applied to material that contains only uppercase text, as is St. John = Saint 177 177/1=177
the case with many databases.
John St. = Street 37 37/10=3.70
Context-dependent text normalization of homographs John St. = Saint 10 10/37=.270
can be resolved in different ways. One approach is to
rely on restrictions of the particular application of text-to- Disambiguation Table
speech. For example, if an application is designed to St. John > Saint 177.0
.
speak only names, then the abbreviation “St.” becomes .
John St. > Street 3.7
7DEOH7H[WQRUPDOL]DWLRQH[DPSOHV
This same methodology can be used with word
Context-free categories, like part of speech or other labels. A
Mr. Mister disadvantage of using words is the coverage of the
22 twenty-two corpus: for example, the name “Aloysius” might not
$N N dollars
NAACP N. A. A. C. P. appear in a corpus. Another disadvantage of the word
VISA visa category approach is that the categories may be too
Johnd@xyz.com John D. at XYZ dot com
Context-dependent coarse.
St. saint street The same basic approach can be used to resolve
Dr. doctor drive
Jesus Jesus /hAzus/ semantic ambiguities like that of “bass”, which has two
1998 nineteen ninety-eight pronunciations, corresponding to different meanings, one
one thousand nine hundred….
one nine nine eight referring to “bass” as a type of fish and one referring to
read /rid/ (Vpres) /rEd/ (Vpast) “bass” in music. In this way, “bass” preceded by the word
convict /kxnvikt/ (Verb) /kanvikt/ (Noun)
bass /bAs/ (music) /bas/ (fish) “sea”, would refer to the fish. However, non-local context
(for example, words occurring anywhere in the same
unambiguously “Saint”. However, most applications sentence or paragraph) may be needed to disambiguate
require a more general solution. One possibility is to such semantic ambiguities. For example, in a text
attempt to resolve the ambiguity through a list of nearby containing the word “opera”, “bass” would most likely
words. For example, the abbreviation “St.” followed by refer to the musical term. Given that text is an
“John”, “Joseph”, etc. would be “Saint.” Another impoverished representation of a speaker’s intention,
possibility is to use word categories such as proper noun many such cases cannot be resolved with high probability
or first name, and specify that “St.” followed by that even by a human reading text; however, a synthesizer
category would be “Saint.” The list of disambiguating must choose some pronunciation, and the choice can be
cases can be prepared either via the intuition of the based on general relative frequency of occurrence of one
researcher or automatically, through statistical analysis of pronunciation versus another.
a pre-labeled corpus. Thus, in both context-free and context-dependent
Typically, statistical analysis is based on the normalization, table lookup translations can be derived
distribution of occurrences of the abbreviation plus one or either by intuition or by statistical analysis of a large
two preceding or following words. For example, given a corpus. A problem with intuition-based lists is that
corpus of text in which the abbreviation “St.” has been preparation of the list is labor intensive and subject to
translated by a human into either “Saint” or “Street”, we errors of omission and subjectivity of the preparer. A
problem with the automatic, statistical preparation method
is that unless the corpus on which the analysis is based is
extremely large, the coverage of the corpus material will Dictionary
suffer from errors of omission. Further, automatic Beethoven > /be’ to vxn/
two > /tu/
read, Vpres > /rid/
methods require that the corpus be pre-labeled by a
human with the correct expansions, a process which is Ethnic Classification Rules
itself labor intensive and error-prone. Letourneaux > “french”
Fujisaki > “japanese”
Part-of-speech homonyms (“read” /rEd/, rhyming
with “bread”, as past tense of the verb; “read” /rid/, Morphological Rules
hopeful > hope + ful
hoped > hope + ed
rhyming with “seed”) are also context-dependent baseline > base + line
ambiguities. They can be resolved, in general, by a part
of speech tagger. Today, most taggers are based on Letter-to-Phoneme Rules
o : _{C}e$ > /o/ (hope)
o > /a/ (hop)
methodology akin to that discussed above, using statistics u, “japanese” > /u/ (fuji)
u > /yu/ (fugitive)
of short, 2- or 3-word sequences or part-of-speech label
sequences that occur in a corpus that has been prelabeled Syllabification & Lexical Stress
/sInTxsYzR/ > /sI’n Tx sY` zR/
with part of speech for each word. Syntactic parsers are
not used today, due to their computational demands. The
utility of this approach depends in part on the granularity Pronunciation: phonemes & stress
of the part-of-speech labels as well as on the size of the
corpus.
)LJXUH:RUGSURQXQFLDWLRQFRPSRQHQW
2.2. Word pronunciation dictionary covering the most frequently occurring words
and to use rules for the other cases.
Given the words, a synthesizer converts orthography Morphological decomposition is often used in
for each word into phonemes and stress. To perform this synthesis systems because some letter-to-phoneme rules
conversion, synthesizers typically use a combination of a can be stated more simply if the rules refer to morphemes.
dictionary and rules, as shown in Figure 2. For example, in English, final “e” in many words is silent,
The first process in word pronunciation is usually while the medial letter “e” is often pronounced. In
dictionary lookup; words found in the dictionary need not compound words in which a non-final morpheme ends in
be processed further. In the original conception of a word “e”, the “e” is also silent. So, for example, in the word
pronunciation component, the dictionary contains “Vaseline”, the medial “e” is pronounced as a schwa,
pronunciations for only those words that are exceptions to while in the word “baseline”, which is composed of two
the rules. For example, in English the word ‘two’ is morphemes (“base” + “line”), the “e” in “base” is silent.
pronounced /tu/, which is unpredictable from any English If morphological rules decompose compounds, the same
rules, so that it would definitely appear in the dictionary. silent “e” pronunciation rule can be used to refer to both
However, what words constitute exceptions can be only word and morpheme-final silent “e.” Also, inflectional
defined in terms of the rules. Compared to many other affixes (“-s”, “-ed”, “-ing”) are typically split off for
languages, the correspondence between orthography and letter-to-phoneme rules, because inflectional affixes do
phonemes in English is particularly complex, and not affect the pronunciation of the base word. However,
therefore, it is a challenge to formulate a comprehensive morphological decomposition is not trivial, since all
set of rules. occurrences of these letter sequences do not function as
Consequently, one current approach to word inflectional affixes (“Ahmed”, “bring”), and many
pronunciation is to greatly expand the dictionary to cover possible decompositions are not, in fact, correct
a large number of cases and to depend on rules only for decompositions (for example, “Vaseline” is not
backup. In this way, the dictionary would implicitly decomposed into “vase” + “line”).
build-in all the rules of word pronunciation. Given the Some synthesizers tag words and names as to ethnic
availability of machine-readable dictionaries, it might identity, assuming that certain letter sequences are
seem sufficient simply to use a dictionary and dispense pronounced differently depending on the ethnic origin of
with any rules. However, many applications for speech the word. Finally, letter-to-phoneme rules map letter
synthesis involve items that are not covered by readily- sequences into sequences of phonemes and stress.
available dictionaries, for example, inflected words (e.g., Because many applications for text-to-speech involve
“sings” versus “sing”) and proper names. Solving the pronouncing the names of people and places, and because
problem of pronouncing out-of-vocabulary words no complete pronouncing dictionaries for names exist,
involves either compiling one’s own dictionary or much attention has been directed toward name
developing other strategies, like traditional rules or pronunciation. This subproblem is particularly
analogy systems. A typical partial solution is to prepare a challenging in the United States, where the population is
ethnically diverse and therefore has many names from like auxiliary verbs and prepositions tend to be
different ethnic origins. Name pronunciation is one area deaccented. However, words that can be used as
where text-to-speech may have the potential of prepositions are accented when they serve as verb
performing better than a single person. That is, since no particles. Determining the particle/preposition distinction
single person can know how to pronounce all names is often handled via part-of-speech tagging or by other
correctly, it might be possible for a text-to-speech system general text normalization procedures, but the resolution
to cull knowledge about name pronunciation and perform is often difficult, as suggested by the cases in Table 3.
better, on average, than a person. Although synthesizers Another factor that affects accent is noun phrase
might make more serious pronunciation errors than a structure. Some noun phrases are accented on the last
person would make, some synthesizers actually perform word (e.g., “Avenue,” as in “Fifth Avenue”) and some are
as well or better than educated humans at pronouncing accented on earlier words (e.g., “Fifth Street”). One
names with low frequency of occurrence. approach to predicting accent is to list word compounds
whose accent does not fall on the last word. Another
7DEOH$FFHQWSUHGLFWLRQIDFWRUV approach is to formulate more lexically based or semantic
rules. For example, in proper names of streets, most
Part of Speech
Content word: Which book did John leave? street words are accented (“Avenue”, “Drive”, “Lane”,
Preposition: Which car did John leave in? “Road”), but “Street” is always deaccented and causes
Particle: Which paragraph did John leave in? accent to shift to the previous word.
Lexical accent
Normal case However, accent is also determined by more global
large number semantic properties of the text. In the absence of any
John Smith
long street special information, speakers often accent the last word in
Walnut Drive a phrase. However, speakers do not accent previously
Word compounds with deaccented last word
telephone number given information in many situations (in Table 3, given
school building words are shown in italics, and accented words in the last
heart attack
Walnut Street phrase are underlined.). So for example, in first example,
Given versus new information the second occurrence of “apple” is deaccented.
Lexical match
After John brought me an apple, I ate the apple. Identifying what information is given is clearly a problem
Morphology in natural language processing. However, synthesizers
John brought me an apple, but I don’t like apples.
John likes to swim, but I hate swimming. adopt heuristics for deaccenting. For example, one
Compounds approach to identifying given information is to rely on
After John felt pains near his heart, he took better care
of his heart. lexical match. However, as other examples in Table 3
After John had a heart attack, he took better care of his show, correct resolution requires identification of
heart. morphology, compounding, and perhaps most difficult,
Real world knowledge
John brought me an apple, but I don’t like fruit. real-world knowledge.
After I met Bill Clinton, I had lunch with the President. Phrasing. In natural speech, speakers often break up
sentences into several phrases, which can be articulated
with pausing and which serve as the domain of certain
2.3 Intonation types of accent placement, timing and melody rules.
Many factors play a role in determining intonational
In natural speech, the same phonemic sequence can phrases (Table 4). Sometimes punctuation can serve as a
occur with a variety of different intonational patterns: guide, but in other cases, punctuation can be misleading,
accent, melody, and phrasing. A speaker’s selection of an and in other cases, no punctuation is given when phrase
intonation is not random; the choice corresponds to breaks should be inserted. One possibility would be to
different syntactic, semantic and discourse-level factors. determine syntactic constituency via a parser, and to use
However, many of these factors are not explicitly the syntactic phrases in combination with other factors,
indicated orthographically in text; a synthesizer, like a such as phrase length, to decide on intonational phrases.
human who reads aloud from a text, must infer them. One approach to predicting intonational phrases uses
Accent. In natural speech, certain words are more statistical analysis of a corpus that has been pre-labeled
prominent or emphasized more than other words. for phrase boundaries, using part of speech and other
Consequently, a synthesizer must assign accent to some automatically derivable word categories.
words in sentences. Many factors affect accent Melody. Melody refers to the pattern of tones with
placement, as illustrated in Table 3. which a phrase, sentence, or paragraph is spoken. Much
Accent is in part correlated with part of speech. An early work on melody did not include any independent
oft-cited default accent rule is to place the primary accent level of representation for melody. That is, synthesizers
on the last content word of a sentence. Function words simply mapped from a sentence type to a physical pitch
synthesis is probably based on this approach. Articulatory
7DEOH6RPHIDFWRUVLQSKUDVLQJ synthesis requires two types of rules: linguistic rules
specifying the relation between phonetic transcription and
Punctuation
201-234-5678 articulatory activity and physical rules specifying the
(A + B) * C relation between articulatory activity and acoustic signal.
We saw Bill, and Mary saw Bob.
Syntactic parse Typically, the linguistic rules are formulated phoneme by
We saw Bill and Mary leave. phoneme. The linguistic rules specify, for example, when
We saw Bill and Mary saw Bob.
the articulators move to produce a phoneme, where they
time function. Since then, a great deal has been learned move, how fast they move, how they are coordinated with
about the syntax of tones. However, much less is known movements of other articulators. Because a phoneme need
about predicting the relationship between particular not have movements specified for all articulators,
melodic patterns and syntactic, semantic, and discourse- different articulators can show movements for different
level factors. Synthesizers typically have a relatively phonemes at the same point in time. In this way,
small taxonomy of melody patterns, and in synthesis, articulatory synthesis can provide elegant solutions to
assign utterances one of the patterns. For example, some difficult problems, particularly for the phonetics of
deaccented words are not assigned tones, the so-called consonant clusters and syllable structure. The physical
‘yes-no’ questions can end in a rising tone pattern, while rules in articulatory synthesis specify how particular
‘wh-‘ questions do not; lists can exhibit particular vocal tract configurations map onto an acoustic signal.
downstepping melodies. The conversion of the pattern of Both linguistic and physical rules, as well as the
tones to a physical pitch time function is usually articulatory model itself, crucially affect the quality of the
considered a problem for the back-end of the synthesizer. synthesized speech. However, current knowledge about
Timing. The timing or duration of speech is not all the details of both sets of rules and about the model is
usually considered a feature of the front-end of a incomplete, and consequently the speech produced by
synthesizer per se. Instead, timing is usually considered a articulatory synthesis is not completely natural-sounding.
problem related to the back-end of a synthesizer, under For these reasons, articulatory synthesizers today are
the assumption that the durational effects to be modeled research laboratory tools or prototypes and are not used in
are a function of other factors that are part of the phonetic commercial applications.
transcription. Factors that are often considered to Synthesis-by-rule is based on a model of the
determine timing include phonemic identity, syllable perceptually significant acoustic parameters underlying
structure, phrasing, accent, melody, speaking rate, and speech, such as formants, bandwidths, nasal zeros,
intonational structures like the word or the metrical foot. aspiration, and voicing. This approach also involves
several sets of rules. Linguistic rules specify so-called
3. Back-end of text-to-speech formant “target” positions for each phoneme in the
utterance and the timing of those target positions. Other
The goal of the back end of a speech synthesizer is to rules smooth the target sequence and convert the resulting
convert a given phonetic transcription into an acoustic time functions into an acoustic waveform. Synthesis-by-
signal. The back end addresses the problem that the same rule has provided much insight into speech production
phoneme can have many different acoustic realizations, and perception. However, for text-to-speech systems, the
depending on many factors, for example, the identity of speech quality produced by synthesis-by-rule suffers from
nearby phonemes, stress, emphasis, and position in a the fact that current knowledge about the rules and of the
phrase. Further, the back end must realize the intonation acoustic model itself is incomplete.
specified in the phonetic transcription as acoustic
parameters such as pitch, duration, and amplitude. 3.2 Concatenative Synthesis

3.1 Approaches The original rationale for synthesis-by-concatenation


was to overcome the inadequacies of rule-based systems
Synthesis systems traditionally adopt one of three by eliminating the need for some, or perhaps even all, of
general approaches to converting a phonetic transcription the rules and models. The way this is done is by storing
into an acoustic signal: articulatory synthesis, synthesis- an inventory of sound units that are not models at all, but
by-rule, synthesis-by-concatenation. rather are fragments of human speech, usually consisting
Articulatory synthesis is based on a model of the of a sequence of several phonemes. The units can then be
human vocal tract, and has articulatory parameters concatenated to form a new utterance. So in the
corresponding to organs such as the lips, tongue, velum, concatenative approach, the inventory itself incorporates
and vocal folds. The ultimate research solution for speech some of the acoustic variation that must be produced by
rules in other systems.
Most high quality speech synthesizers today are coding). The LPC model decomposes the waveform into
concatenative synthesizers. In a concatenative system, a acoustic parameters representing formants, pitch, and
person records speech containing a large set of basic amplitude over short 5- or 10-msec intervals. Because
sound units, usually corresponding to a relatively short LPC has a very simplified model of the source (glottal
sequence of phonemes. The units are excised from the fold vibration), the speech it produces is somewhat
speech, and in most systems, the units are processed with robotic-sounding. For this reason, newer systems are
some type of speech coding method, and the resulting pursuing alternative coding methods that provide a more
templates are stored in an inventory. For synthesis of a enriched representation. Some of these methods are based
new utterance, given a phonetic transcription, the system on LPC (e.g., MPLPC, or multipulse LPC, RELP, or
uses rules to select the appropriate units, extracts them residual-excited LPC) but have more elaborate
from the inventory and concatenates them. representation of glottal fold vibration. An extremely
Typically, a synthesis-by-concatenation system simple coding method, PSOLA (pitch-synchronous
lib e lib e l overlap and add) uses raw waveforms in which the pitch
1 - s y ll 2 -s y ll
period onsets have been labeled. In PSOLA, a template is
F 0
shortened by eliminating pitch periods; a template is
lengthened by repeating pitch periods; pitch is raised by
S p e c tru m overlapping the native pitch periods more closely together
in time; pitch is lowered by spreading out the intervals.
W a v e
However, with any of these techniques, when the
synthesizer imposes large changes to the native pitch and
l a y b l a y b l duration of the inventory templates, as is required in
synthesis, the resulting speech quality is often degraded.
Consequently, a new approach to synthesis-by-
)LJXUH  'XUDWLRQ DQG SLWFK IRU OD\E LQ ´OLEHµ DQG
´OLEHOµVSRNHQLQLVRODWLRQZLWKIDOOLQJWRQHSDWWHUQ concatenation attempts to avoid using any speech coding
at all by using only natural speech. To avoid having to
change pitch and duration for a template, it is necessary to
contains only one token of each unit. Since pitch, have multiple variants of each sound unit, representing all
duration, and amplitude for a short phoneme sequence the possible pitch and duration values with which that
vary widely with different intonational features (Figure 3) unit could occur in natural speech. While this approach
the system must modify the suprasegmental acoustic might avoid the speech quality distortions associated with
features of the units, like the pitch, duration and speech coding methods, because of the large range of
amplitude, so that the concatenated sequence of units pitch and duration variation that exists in a language, this
represents the appropriate intonation. Usually, a set of pure concatenation method requires huge sound
rules modifies the templates by adjusting these parameters inventories. Further, since manual labeling of such a large
and may also smoothe across the boundary at the splicing inventory is prohibitive if not impossible, it is necessary
point. to devise automatic methods to label the speech; today
The quality of the speech produced by concatenative tools borrowed from automatic speech recognition such
synthesis is determined by the choice of the basic as HMM (Hidden Markov Modeling) are being used to
concatenative unit, the rules for concatenating and that end.
modifying the units, and the speech coding method. Basic sound units. There is obviously an interaction
Speech coding methods. The reason that between the number and size of the sound unit inventory
concatenative synthesizers need speech coding methods is and the number and complexity of the rules that are
to allow the synthesizer to change the pitch and duration needed. Traditionally, the set of units has been selected
of the templates without changing the phonemic identity according to linguistic criteria. Figure 4 lists a variety of
of the unit. Coding methods are signal processing different kinds of units that might be considered for
techniques that attempt to separate the source signal concatenative synthesis. Some of the units have been
(vocal fold vibration or noise frication) from acoustic proposed based on knowledge about the kinds of phonetic
parameters representing the vocal tract and so allow factors that cause acoustic variation in phonemes, for
independent control over the source from the vocal tract. example, allophones. Some are based on the pragmatics
The earliest speech coding method used in of an application, for example phrases and sentences,
concatenative synthesis was LPC (linear predictive which are typically used in voice response systems.
In general, the longer the unit, the more information other names in the language cannot be synthesized from
it contains, and therefore fewer or less complex rules are an inventory based on 50K names. Synthesizing the next
needed. To the extent that the rules are not completely 100 names after the 50K names will require 83 more
understood, the higher quality the speech. Most sound units. Consequently, if coverage is inadequate, the
concatenation systems are based on relatively small units: synthesizer needs to adopt a strategy for synthesizing
allophones, diphones, triphones, or demisyllables. names that require missing units. One possibility is to fall
Recently, through the availability of large amounts of back on subunits of existing units; however, this method
computer memory, longer units are not out of the will succeed only if subunit concatenation rules are good.
question, and it is possible to envision systems that blend
voice response with text-to-speech and systems that have
huge sound inventories. a
Figure 5 illustrates one issue associated with relying
on large inventories: namely, the coverage of the corpus.
^b_a
Figure 5 shows, for different kinds of units, the number of b_a

units needed to synthesize subsets of the most frequent


50K names in the United States. As the length of the unit b_a_s
b_a_^s d_a ^d_a

Unit Length Unit # Units # Rules Quality


(English)
Short Many Low
Phoneme/Allophone 40/65
Diphone <40 2 -65 2 )LJXUH  $XWRPDWLF VHOHFWLRQ RI SKRQHPHVL]HG XQLWV
Triphone <40 3 -65 3 XVLQJVWDWLVWLFDOFOXVWHULQJ
Demisyllable 2K
Syllable 11K One goal of a sound unit inventory is to represent all
VC*V
the perceptually significant acoustic variations that occur
2-syllable <11K 2
Word 100K-1.5M in natural speech. Therefore, one approach is to use
Phrase statistical clustering techniques in combination with some
Sentence
8

Long Few High physical, phonetic distance measures to automatically


select a set of units given a large database that has been
phonemically labeled. Typically, the distance measures
)LJXUH7UDGLWLRQDOVHJPHQWDOXQLWVIRUFRQFDWHQDWLYH are those that have frequently been used in automatic
V\QWKHVLV speech recognition. As shown in Figure 6, the basic
increases, the number of units increases, and, more method involves computing the distance between all units
importantly, the slope of the coverage line increases. with the same phonetic labels, and using hierarchical
That is, if the basic sound unit is the diphone or clustering techniques to minimize the distance among
demisyllable, only a few thousand units are needed to units within the same cluster and maximize the distance
between clusters. The inventory units are represented by
the final leaves of the tree. Figure 6 shows a leaf
50000 2-Syllable Slope
.83
VC*V representing the phoneme /a/ preceded by /d/, for
Triphones
40000 Syllables example. This technique can be used to select units on the
Demisyllables
Diphones basis of segmental environment alone or on the basis of
30000
# Units

both segmental features and suprasegmental features like


20000
.23
pitch and duration. In addition to the issue of coverage, a
.15 challenge for this kind of approach is the perceptual
10000 .11
reality of the physical distance measure that forms the
.02
0 1
.003
basis of the clustering.
1 10K 20K 30K 40K 50K 1.5 M In another approach, there are no predefined sound
Top N Surnam es (rank) units or templates per se. Rather the sound inventory
consists of a large corpus of prerecorded speech, which
has been labeled in terms of phonemes and sometimes
)LJXUH&RUSXVFRYHUDJHIRUGLIIHUHQWVRXQGXQLWV
prosodic features. In synthesis, units matching the
synthesize all 50K names, and the low coverage slope requirements of the new utterance are extracted from the
means that their coverage is good; few additional units corpus on a case-by-case basis, using various criteria, as
are needed to cover the rest of the language. However, if illustrated in Figure 7. In this approach, the concatenation
the basic sound unit is two-syllable unit, the slope of the point between two units is also determined on a case-by-
coverage line is very steep; the probability is high that case basis, using some objective measure of the distance
New utterance to synthesize 4. Monolog versus Dialog
Turn right on Walnut Drive

turn right off South


Most research on text-to-speech to date has involved
right on Woodland situations in which the synthesizer speaks a monolog –
Units in Walnut Creek reading books, newspaper articles, email. However, many
Scott Drive applications of synthesis in the future are likely to involve
dialog between a human speaker and a computer. For
example, a system might give directions in response to
)LJXUH  8QLW VHOHFWLRQ DQG FRQFDWHQDWLRQ ZLWK QRQ questions from a traveler, as shown in Table 5. The table
XQLIRUPXQLWV&DQGLGDWHXQLWVDUHXQGHUOLQHGVSOLFLQJ compares accents for monolog and dialog, showing that
SRLQWVDUHVKRZQE\DUURZV in both cases, it is inappropriate to accent the second
occurrence of “Walnut Drive.” Thus, a computer system
between two units at the splicing point. The idea is to should consider the output of the speech recognizer as
make the splice at the point at which the discontinuity part of the context for the text to be synthesized.
between two adjacent units is minimized. In order to
measure the discontinuity, the distance is computed 7DEOH'HDFFHQWLQJLQPRQRORJDQGGLDORJ
between frames in the two units, based typically on
physical parameters like spectrum and possibly amplitude Text: Monolog
and pitch. After you get to Walnut Drive, turn right on Walnut Drive.
* turn right on Walnut Drive
Rules in concatenative synthesis. A great deal of Dialog
research has been done over the years on duration and Q: I’m coming up on Walnut Drive. What do I do?
A: Turn right on Walnut Drive.
pitch rules. For duration, most recent work has focussed *Turn right on Walnut Drive.
on predicting durations for individual phonemes, using
large speech corpora that have been segmented and
annotated with labels for various phonetic facts that are 5. Conclusion
potentially predictive of duration. Statistical techniques
such as multiple regression and neural networks have Today, text-to-speech for unrestricted text is still far
been applied to the corpora to produce quantitative from entirely natural. Although there are a few
models for use in synthesis. Similar approaches have been subproblems such as name pronunciation where text-to-
adopted for predicting pitch. speech may perform better than humans, in general, text-
However, since these rules are still not completely to-speech systems still only inadequately approach human
understood, one approach is to entertain the possibility of speech and language competence. Improvements to
minimizing or even eliminating the need for such speech quality have come most recently from the
modifications by designing a larger inventory that incorporation of rules based on analysis of large amounts
contains multiple versions of each unit. Another approach of data or from storing large tables, dictionaries, and
is to by replace some or all duration rules with inventory sound inventories. Future improvements will depend on
templates for duration and pitch, in analogous fashion to improvements from natural language understanding and
templates for the sound inventory. For example, pitch more natural-sounding speech quality.
time functions for particular melody patterns would be
stored in an inventory. For synthesis, rules would select 6. References
appropriate pitch templates and align them with the
concatenated sound units. A similar approach can be used [1] J. Allen, S. Hunnicut, and D. H. Klatt, From Text to Speech:
for duration rules using time warps that have been derived the MITalk System, Cambridge University Press, Cambridge,
from methods like DTW (Dynamic Time Warping). UK, 1987.
Different voices. One method for providing a new [2] C. Bailly and C. Benoit (eds.), Talking Machines, North-
voice is to produce an entirely new sound inventory based Holland, Amsterdam, 1992.
[3] Y. Sagisaka, N. Campbell, and N. Higuchi (eds.), Computing
on the new speaker. However, because of the amount of
Prosody, Springer-Verlag, New York, 1997.
work involved, another approach is to apply signal [4] D. B. Roe and J. G. Wilpon (eds.), Voice Communication
processing techniques to the templates or the output of a between Humans and Machines, National Academy Press,
synthesizer to transform the acoustic parameters for the Washington, D.C., 1994.
original speaker into those that are appropriate for a [5] J. P. H. van Santen, R. W. Sproat, J. P. Olive, and J.
different speaker. Hirschberg (eds.), Progress in Speech Synthesis, Springer-
Verlag, New York, 1997.
[6] V. J. van Heuven and L. C. W. Pols (eds.), Analysis and
Synthesis of Speech, Mouton de Gruyter, Berlin, 1993.

You might also like