TTS Notes

TTS system takes a sequence of
TTS (Text to Speech) words and produce as output in an

acoustic waveforms.
TTS has 10 Attributes
1. Limited domain waveform concatenation

- Can generate high quality of speech by only small num of segment
- Sadly, it cannot synthesis arbitary text
2. Concatenative synthesis (no waveform modification)
- Can synthesis arbitary text
- Achieve good quality with large set
- Quality mediocore (poor concatenation takes place)
3. Concatenative system with waveform modification
- Flexible in selecting speech segment for concatenation as waveform can be modified for better
prosody match
- Prosody modification process can degrade whole quality
4. Rule based systems
- Sound uniformly across diff sentences
- May produce lower quality than prev system
5. Delay
- Time taken by synthesizer to start speaking is very very important
- Should less than 200 ms
6. Memory resources
- Rule based require less than 200 KB (Widely used option)
- RAM can be issue for concatenative system (Require 100 MB of storage)
7. CPU resources
- Some concatenation synthesizer need large computation to find optimal sequences
8. Variable Speed
- Some app may need speech synthesis module to generate variable speeh ; fast speech
- Concatenative system need to modify waveform to ge variable speed control except large number of
segment recorded at diff speeds
9. Pitch Control
- Some spoken language system require output speech to have specific pitch; generate voice for song
- Concatenative system need to modify waveform to ge variable speed control except large number of
segment recorded at diff speeds
10. Voice Characteristic
- Some spoken language system require specific voice (robot voice)
Document Structure Detection : Text Normalisation :
- Determine location of all punctuation marks in - Handle large range of text issues
input text - Include abbreviation and acronyms
- Decide their significant based on text sentences
and paragraph structure
Abbreviations must expanded to full words
but not always and depend on context
ISSUE OF TEXTS Text Markup Interpretation
1. vNon delimited words 1. Control how a TTS engine render its output
2. Expansion of digit sequences 2. Make TTS output sound intelligent
3. Pronunciation of ordinary word and names (combine with emotion)
need morphological analysis 3. Speech Synthesis MarkUp Language (xml-
based markup standard for speech
LINGUISTIC ANALYSIS synthesis
- To determine linguistic properties of each word in

the input text
LINGUISTIC PROPERTIES HOMOGRAPH DISAMBIGUATION
1. POS of each word Resolve correct pronunciation of each word in input

2. Sense of each word string that has more than one pronunciation
3. Location of phrases Grapheme to Grapheme Conversion
4. Presence of anaphora
5. Emphasize of words Convert input text into speech sound
6. Speaking style * refer to Intro to Speech Processing Part I
PROSODIC ANALYSIS
Analyse a language based on pattern of stress, intonation and tone
ACOUSTIC PERCEPTUAL LINGUISTIC

Fundamental Frequency (F0) Pitch Tone, intonation, aspect of stress
Amplitude, Energy, Intensify Loudness Aspect of stress
Duration Length Aspect of stress
Amplitude of dynamics Strength Aspect of stress
Prosody is important (convey semantics, syntactic and emotional info.
Dutoit (1997), prosody refer to audible changes in pitch, loudness and syllable length.
To other authors, prosody related to speech timing such rythm and speech rate
Prosody operate on longer linguistic unit than phones and hence called study of suprasegmental phenomena.
TONES Intonation
- Significant contrast between words signal by - Rise and fall of voice during speaking
speech differences. May be lexical in Mandarin - To emphasize focus, new info, relationship
Language. May be grammatical in African between words, finality segmentation of
Language sentences into group of syllables
LINEAR PREDICTIVE CODING SYNTHESIS
- Used for encoding, transmitting and decoding a digital signal by reducing redundant information
- Estimate vocal tracts resonances from signal waveform, remove effect from speech signal (inverse filtering)
to get residue / source signal
- LPC Synthesis – inverse process of LPC Analysis
SYNTHESIS METHOD
 Rule-based synthesis consists of :

1. Articulatory synthesis
- Generate speech by direct simulation of human speech production
- Need high accuracy of vocal tract and vocal cords
- Need rule for handling dynamic of articulator motion
- In practice, acquire data to determine rule and models is very difficult
- Mimicking human system can get very complex and computation intractable
2. Formant synthesis
- Source filter model
- Filter characterised by slowly varying formant frequencies
 Corpus based synthesis consist of:

1. Concatenative synthesis & Unit Selection-based synthesis
- Make use of segment from recorded speech form pre-recorded inventory
- Issue to address:
(What type of speech Segment? How to design acoustic inventory? How to select best string of speech
segment, give phonetic string and prosody? How to alter prosody of speech segment to best match
desired output prosody?)
2. HMM synthesis
DECODING PROCESS
- To choose optimal string of units of a given phonetic string that match desired prosody the best
- For unit selection, use Objective Function
- Quality of unit string dominated by
1. Spectral
2. Pitch discontinuities at unit boundaries
EVALUATION OF TTS SYSTEM
Black Box Evaluation: Evaluate system in context of real world application.

Glass Box Evaluation: Testing diff component of TTS system
Human Evaluation
Automated Evaluation
Evaluating Intelligibility: Word Recognition Test and Modified Rhyme Test
Evaluating Naturalness: Measured by Mean Opinion Scores (MOS)

TTS Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TTS Notes

Uploaded by

Copyright:

Available Formats

TTS system takes a sequence of

TTS (Text to Speech) words and produce as output in an

TTS has 10 Attributes

1. Limited domain waveform concatenation

Document Structure Detection : Text Normalisation :

- To determine linguistic properties of each word in

1. POS of each word Resolve correct pronunciation of each word in input

Analyse a language based on pattern of stress, intonation and tone

ACOUSTIC PERCEPTUAL LINGUISTIC

Prosody is important (convey semantics, syntactic and emotional info.

 Rule-based synthesis consists of :

 Corpus based synthesis consist of:

EVALUATION OF TTS SYSTEM

Black Box Evaluation: Evaluate system in context of real world application.

You might also like