Production and Classification of Speech Signals

Speech Production Process
Anatomy and Physiology of Speech production
Spectrographic Analysis of Speech
Categorization of Speech Sounds
Discrete-time Model of Speech Signals

Textbook: T. F. Quatieri. Discrete-Time Speech Signal Processing: Principles

and practice. PHI, 2002. (Chapter 3)

Production and Classification of Speech Signals

1. Speech production process

The sound sources are idealized as periodic, impulsive, or white noise and
can occur in the larynx or vocal tract.

Fig. 3.1 Simplified speech production model.


Speech Organs

The lung acts as a power supply and provide airflow to the larynx.
The larynx modulates the airflow and provides either a periodic puff-like or
a noisy airflow source to the vocal tract.

The vocal tract consists of oral, nasal, and pharynx cavities, giving the
modulated airflow its color by spectrally shaping the source.

The variation of air pressure at the lips results in traveling sound wave
that the listener perceives as speech.

In general there are three general categories of source of speech sounds:

periodic, noisy, and impulsive, though combinations are often present.

Combining these sources with different vocal tract configurations gives

more refined speech sound classes, referred to as phonemes. The study
of which is called phonemics.

The study of sound variations of phonemes that lead to the same meaning
is called phonetics.

Phonemes are basic building blocks of a language and they are

concatenated according to certain phonemic and grammatical rules (called
linguistics) .

2. Anatomy and Physiology of Speech Production

Fig. 3.2 Cross-sectional view of the anatomy of speech production.

During speaking, we take in short spurts of air and release them steadily.
We override our rhythmic breathing by making the duration of exhaling
roughly equal to the length of a sentence or phase, where the lung air
pressure is maintained at approximately a constant level.

2.1 The Larynx

The larynx is a complicated system of cartilages, muscles, and ligaments,

whose primary purpose in speech production is to control the vocal cords
or vocal folds.

The vocal folds are two masses of flesh, ligament, and muscle, which
stretch between the front and back of the larynx.

Fig. 3.3 Sketches of downward-looking view of human larynx (a) voicing; (b)

The Glottis

The glottis is the slit-like orifice between two folds.

The folds are fixed at the front of the larynx. They are free to move at the
back and sides of the larynx.

The size of the glottis controlled in part by the arytenoids cartilages and in
part by muscles within the folds.

The tension of the folds is controlled by muscle within the folds, as well as
the cartilage around the folds.

The vocal folds, as well as the epiglottis, close during eating, and open
during breathing.

2.1.1 Voicing - How a vowel is produced?

During a vowel (e.g. a, e, I, ..), the arytenoids cartilages move toward

each other (Fig. 3.3a). The vocal folds tense up and are brought close
together. This partial closing of the glottis and increased fold tension
cause self-sustained oscillations of the folds.

Fig. 3.4 Bernoullis principle in the glottis.


Suppose the vocal folds begin in a loose and open state.

The contraction of the lungs results in air flowing through the glottis.
The increase in tension of the folds, together with the decrease in
pressure at the glottis due to Bernoullis principle, causes the vocal folds
to close shut abruptly.
Air pressure then builds behind the vocal folds as the lungs continue to
contract, forcing the folds to open.
The entire process then repeats and the result is periodic puffs of air
that enter the vocal tract.
Both horizontal and vertical movement of the folds may occur

The Glottal Airflow (Glottal flow)

If we measure the airflow velocity at the glottis as a function of time, we

would obtain the following waveform.

Fig. 3.6 Illustration of periodic glottal airflow velocity.

The time interval during which the vocal folds are closed, and no flow
occurs, is referred to as the glottal closed phase.

The time interval over which there is nonzero flow and up to the maximum
of the airflow velocity is referred to as the glottal open phase, and the time
interval from the airflow maximum to the time of glottal closure is referred
to as the return phase.

The time duration of one glottal cycle is referred to as the pitch period and
the reciprocal of the pitch period is the corresponding pitch, also referred
to as the fundamental frequency.

In conversational speech, during vowel sounds, we might see typically one

to four pitch periods over the duration of the sound, although the number
of pitch periods changes with numerous factors such as stress and
speaking rate.

The pitch range is about 60 Hz to 400 Hz. Typically, males have lower pitch
than females because their vocal folds are longer and more massive.


Simple mathematical model of the glottal flow

A simple model of the glottal flow above is given by the convolution of a
periodic impulse train with the glottal flow over one cycle:
u[n] = g[n] * p[n] ,
where g[n] is the glottal flow waveform over a single cycle and

p[n] = k = [n kP] is an impulse train with spacing P.

Fig. 3.7 Illustration of periodic glottal flow: (a) typical glottal flow; (b) same a
(a) with lower pitch; (c) same as (a) with softer glottal flow.

To model the finite-length speech waveform, we multiply an analysis window

w[ n, ] , centered at time to give

u[n, ] = w[ n, ] ( g[n] * p[n]) .


Using the multiplication and convolution theorems, the discrete-time Fourier

transform (or frequency domain description of (2-2)) is

U [ , ] = W ( , ) [ G ( ) ( k )].
k =


= [ G ( k )W ( k , )] ,
P k =

where W ( , ) is the Fourier transform of w[ n, ] , G ( ) is the Fourier

transform of g[n], k =




is the fundamental frequency or pitch.

As the pitch period decreases, the spacing between the frequencies

k = 2P k increases (c.f. Fig. 3.7b).


The Fourier transform of the periodic glottal waveform is characterized by


Typically, the spectral envelope of the harmonics, governed by G ( ) , has

on the average a -12dB/octave rolloff.

In more relaxed voicing (Fig. 3.7), the vocal folds do not close as abruptly,
and the glottal waveform has more rounded corners with an average 15dB/octave rolloff.

Pitch Jitter (variation of pitch period) and amplitude shimmer (variation of

glottal flow between cycles) can occur. These phenomenon help give the
vowel its naturalness and , in contrast to machine like sound.

The extend and form of jitter and shimmer can contribute to voice

A high degree of jitter results in a voice with hoarse quality, which can be
characteristics of a particular speaker or can be created under specific
speaking conditions such as with stress and fear.

2.1.2 Unvoicing?

In the unvoiced state, the folds are closer together and more tense than in
the breathing state, thus allowing for turbulence to be generated at the

Turbulence at the vocal folds is called aspiration. (e.g. h, he) These

sounds are sometimes called whispered sound.

In certain voice types, aspiration occurs normally simultaneously with

voicing, resulting in breathing voice.

Fig. 3.8 Sketches of various vocal fold configurations.


There are other forms of vocal fold movement that do not fall clearly into
any of the three states of breathing, voicing and unvoicing.
In vocal fry (Fig. 3.9.a), the folds are massy and relaxed with an abnormally
low and irregular pitch, which is characterized by secondary glottal pulses.
In diplophonia (Fig. 3.9.b), secondary glottal pulses occur between primary
pulses but within the closed phase.

Fig. 3.9 Illustration of secondary-pulse glottal flow.


2.2 The Vocal Tract

The vocal tract is comprised of the oral cavity from the larynx to the lips
and the nasal passage that is coupled to the oral tract by way of the

The vocal tract spectrally colors the source, which is important for
making perceptually distinct speech sounds.

Fig. 3.10. Illustration of changing vocal tract shapes.


2.2.1 Spectral shaping

Under certain conditions, the relationship between a glottal airflow

velocity input and vocal tract airflow velocity output can be approximated
by a linear filter with resonances, much like resonances of organ pipes
and wind instruments.

The resonance frequencies of the vocal tract are called formant

frequencies or simply formants.

Formants change with different vocal tract configurations.

When the vocal tract is modeled as a time-invariant all-pole linear system,
then a pole at z 0 = r0 e

j 0

corresponds approximately to a vocal tract

formant at frequency = 0 . Because the vocal tract is assumed stable,

the vocal tract transfer function an be written as:

H ( z) =

kN=i 1 (1 c k z 1 )(1 c k* z 1 )



k =1

(1 c k z 1 )(1 c k* z 1 )



2.2.2 Fourier Transform of speech signals after going through the vocal tract.
Assuming a periodic glottal flow source of the form:

u[n] = g[n] * p[n] .


The vocal tract output after passing u[n] through a LTI vocal tract with
impulse response h[n] is

x[n] = h[n] * ( g[ n] * p[n]) .


The windowed version of x[n] is

x[n, ] = w[n, ] {h[n] * ( g[n] * p[ n])}.


Using the convolution and multiplication theorem, one gets

x[n, ] X ( , ) = W ( , ) H ( )G ( ) ( k ) .

k =



Fig. 3.11 Illustration of relation of glottal source harmonics 1 , 2 , L , N , vocal

tract formants F1 , F2 , L , FM , and the spectral envelope | H ( )G ( ) | .

Generally the frequencies of the formants decrease as the vocal tract

length increases.

The spectral envelope is given by | H ( )G ( ) | consisting of a glottal and

vocal tract contribution.


The peaks in the spectral envelope correspond to vocal-tract formant

frequencies, F1, F2, FM.

A formant corresponds to the vocal tract poles, while the harmonics arise
from the periodicity of the glottal source.

(a) first harmonic

higher than first
formant frequency

(b) first harmonic

matched to first
formant frequency
Fig. 3.12 Illustration of formant movement to enhance the singing voice of a

2.2.3 Categorization of sound by source

Different vocal tract configurations/shapes such as those in Fig. 3.10 can

also generate different sound sources. E.g. a complete closure of the
tract by pressing the tongue against the palate generates impulsive

Speech sounds generated with a periodic glottal source are termed

voiced. Sounds not so generated are called unvoiced.

There are a variety of unvoiced sounds, including those created with a

noise source at an oral tract constriction. Because they come from the
friction of the moving air against the constriction, these sounds are called
fricatives, e.g. th in the word thin.

A second unvoiced sound class is plosives created with an impulsive

source with the oral tract (Fig. 3.10b). e.g. t in top.


Vocal fold vibration can occur simultaneously with impulsive or noisy

sources. E.g. Z in Zebra. Sound sounds are called voiced fricatives.

There also exist voiced plosives. E.g. b in the word boat.

Fig. 3.13 Examples of voiced, fricative, and plosive sounds in the sentence,
Which tea party did Baker go to?: (a) speech waveform; (b)-(d) magnified
voiced, fricative, and plosive sounds from (a).


3 Spectrographic analysis of speech

The Fourier transform of the windowed speech waveform, i.e. the shorttime Fourier transform (STFT), is given by

X ( , ) =

x[n, ] exp( jn ) .


n =

where x[ n, ] = w[ n, ] x[ n] is the windowed speech segments as a

function of the window center at time .

The spectrogram is a graphical display of the magnitude of the timevarying spectral characteristis and is given by

S ( , ) =| X ( , ) | 2 ,


which is a measure of the energy of the frequency component at

frequency in the neighborhood of .

Fig. 3.14 Formation of (a) the narrowband and (b) the wideband spectrograms.

The figure shows two types of spectrograms: narrowband (good spectral

resolution with large window length, e.g. 20ms) and wideband (good time
resolution with short window length e.g. 4ms Hamming window).

With voiced sources, the narrowband spectrogram shows horizontal

striations of the harmonics components, wideband spectrogram reveals
the variations of the pitch, but generates vertical striations due to poor
frequency resolution.

With regard to fricatives, the STFT magnitude of noise sounds is often

called the periodogram, because of the random wiggles of the spectral

With plosive sounds, the wideband spectrogram is preferred because it

gives better temporal resolution of the sounds components.


Insufficient frequency resolution

Better time resolution

Fundamental and harmonics

Fig. 3.15 Comparison of measured spectrograms for the utterance, which tea
party did Baker go to?: (a) speech waveform; (b) wideband spectrogram; (c)
narrowband spectrogram.

4 Categorization of Speech Sounds

Speech sounds can be categorized by the sound source, which can be

created with either the vocal folds or with a constriction in the vocal tract.

On the other hand, the time-varying spectral characteristics of speech can

be studied using spectrogram.

Here we give more example and classification of sounds according to the

following perspectives:
1. The nature of source: periodic, noisy, or impulsive and combinations of the
2. The shape of the vocal tract, e.g. place of the tongue hump and the degree of
the constriction of the hump place and manner of articulation, respectively.
3. The time domain waveform which gives the pressure change with time at the
lips output.
4. The time-varying spectral characteristics revealed through the spectrogram.

4.1 Element of a Language (terminology)

Phoneme is a fundamental distinctive unit of a language in that it is a
speech sound class that differentiates words of a language.

E.g. the

phoneme c b and h give distinctive sound and hence meaning to the

words cat, bat, and hat.
A particular instantiation of a phoneme is called phone and the study of
these sound variations is called phonetics.
Different languages contain different phoneme sets.
Syllables contain one or more phonemes, while words are formed with one
or more syllables, concatenated to form phrases and sentences.
Linguistics is the study of the arrangement of speech sounds, i.e.
phonemes, according to the rules of a language.


The use of features 1 and 2 in the study of phonemes is called articulatory

phonetics because phonemes arise from a combination of vocal fold and
vocal tract articulatory features.
While using the last two is called acoustic phonetics because it concerns
with the generation of the speech sounds from an acoustic point of view.
The variants of sounds, or phones, that convey the same phoneme are
called the allophones of the phoneme. E.g. the t in butter, but, and
The articulatory properties are influenced by adjacent phonemes and other


In English, the combinations of features give 40 phonemes as follows:

Fig. 3.17 Phonemes in American English.


Fig. 3.18 Vocal tract profiles for vowels in American English.


Fig. 3.19 Waveform, wideband spectrogram, and spectral slice of narrowband

spectrogram for two vowels: (a) /i/ as in eve; (b) /a/ as in father.
The second large phoneme grouping is that of consonants. The consonants
contain a number of subgroups: nasals, fricatives, plosives, whispers and

Fig. 3.20. Vocal tract configurations for nasal consonants.

Broadband width
Low frequency

Fig. 3. 21 Wideband spectrograms of nasal consonants (a) /n/ in no and (b)

/m/ in mo.

The source is quasi-periodic airflow puffs from the vibrating vocal folds.
The velum is lowered and the air flows mainly through the nasal cavity, the
oral tract being constricted; thus sound is radiated at the nostrils. E.g. /m/
in mo (oral tract constriction) and /n/ in no (constriction is with the
tongue to the gum ridge).
The spectrum of a nasal is dominated by the low resonance of the large
volume of the nasal cavity, which also have a large bandwidth because of
the viscous losses of airflow over the complexly configured surface.
The closed oral cavity has its own resonances and absorbs acoustic
energy. These anti-resonances can be modeled as zero of the vocal tract
transfer function.
In nasalization of vowels, the velum is partially open. The speech sound is
primarily due to the sound at the lips and not the sound at the nose output.
Vowels adjacent to nasal consonants tend to be nasalized.

Fig. 3. 22 Vocal tract configurations for pairs of voiced and unvoiced


There are two classes of fricative consonants voiced and unvoiced

The location of the constriction by the tongue at the back, center, or front
of the oral tract, as well as at the teeth or lips, influences which fricative
sound is produced.

The transfer function consists primarily of high-

frequency resonances which changes with the location of the construction.

Unvoiced fricatives are characterized by a noisy spectrum while voiced

fricatives often show both noise and harmonics. E.g. with /S/, the frication
occurs at the palate, and with /f/ at the lips.


Fig. 3. 23 Waveform, wideband spectrogram, and narrowband spectral slice of

voiced and unvoiced fricative pair: (a) /v/ as in vote; (b) /f/ as in for.


A simple model for voiced fricative is

x[n] = x g [n] + x q [n].



x g [n] = h[n] * ( g[n] * p[n]) is the voiced component

x q [n] = h f [n] * (q[n]u[n]) is the unvoiced component

q[n] is a (white) noise component and hf[n] is the impulse response of the
front oral cavity.
airflow velocity.

Since fricative is approximately synchronized with

The noise source is assumed to be modulated by the

glottal waveform u[n].


Fig. 3. 24 Vocal tract configurations for unvoiced and voiced plosive pairs.
Plosives can be both voiced and unvoiced.
The voiced onset time is the difference between the time of the burst and
the onset of voicing in the following vowel. The length of the voice onset
time and the place of constriction vary with the plosive consonant.

Fig. 3. 25 A schematic representation of (a) unvoiced and (b) voiced plosives.

The voiced onset time is denoted by VOT.


Fig. 3. 26 Waveform, wideband spectrogram, and narrowband spectral slice of

voiced and unvoiced plosive pair: (a) /g/ as in go; (b) /k/ as in key.


In voiced plosives, although the oral tract is closed, we hear a lowfrequency vibration, called the voice bar, due to the propagation of the
vibration at the vocal folds through the walls of the throat. Unlike unvoiced
plosives, there is little aspiration.
A simple for voiced plosive is

x[n] = h[n, m] * u[n] + h f [n, m] * [n]



h[n, m]u[n m] + h f [n, m] [n m]

m =

m =

Due to the changing vocal tract shape during the transition from the burst
to a following steady vowel, h and hf are assumed to be linear, but timevarying.
The burst is modeled as an impulse which is assumed to occur at time n=0.


5 Discrete-time Model of Speech Signals

It is possible to study the airflow velocity, pressure, etc by solving a set of

partial differential equations of the underlying acoustics phenomenon. A
commonly used model is the concatenated tube model:

Fig. 4.14 Concatenated tube model. The k-th tube has cross-sectional area Ak
and length lk.
Due to time limitation, we shall not go into these acoustic models.


If the airflows are models as signal flows, then the above tube model can
be approximated by the following signal flow graph:

Fig. 4.16 Signal flow graphs of (a) two concatenated tubes; (b) lip boundary
condition; (c) glottal boundary condition.

Fig. 4.18 Signal flow graph conversion to discrete time of (a) lossless two
tube model; (b) discrete-time version of (a); (c) conversion of (b) with singlesample delays.

The discrete-time speech production model for periodic, noise, and

impulsive sound sources is shown as follows:

Fig. 4.20 Overview of the complete discrete-time speech production model.

The voiced (periodic) speech is modeled by sending a periodic impulse train
having a period equal to the pitch period through G(z), V(z) and R(z).


G(z) is the z-transform of the glottal pulse. V(z) is the z-transform of the
vocal tract transfer function, and R(z) is the radiation loss at the lips (R(z) in
dotted line models the radiation loss at the glottis). An approximation of a
typical glottal flow waveform over one cycle is of the form

g[n] = ( n u[ n]) * ( n u[ n]) .


The z-transform is

G( z) =

(1 z ) 2


, < 1.

Note, the poles are outside the unit circle.

R(z) is usually model as a zero R ( z ) = 1 z 1, with < 1.
V(z) is a stable all-pole filter.


To model fricative consonants, a white noise instead of the periodic glottal

pulses is employed. It is still colored by the vocal tract and there is also
similar radiation loss at the lips.
Similarly, to model plosives, an impulse at appropriate onset time is used
to excite the vocal tract.
To model voiced plosives for example, these three sources can be
combined either linearly or nonlinearly as the excitation to the vocal tract.


