Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 40

EEE 6211

Digital Speech Processing

Speech Perception
Speech Perception
• What does speech perception and speech
understanding have to do with digital speech
processing methods? Why should we study speech
perception when our main interest is in how to process
speech to better code it, synthesize it, understand it by
machine, or process it in ways to increase intelligibility
or naturalness?
• The answer to these questions is naively simple yet
profound in impact; by achieving a good understanding
of how humans hear sounds and how humans perceive
speech, we are better able to design and implement
speech processing systems that are robust and
efficientfor analyzing and representing speech signals.
Speech Chain
Anatomy of the ear
Function of the outer ear
• The function of the outer ear is to funnel as much
sound energy as possible into the 2 cm long ear
canal. Because of the spreading out of sound
energy by an inverse square law with distance, it
is essential that the receiver of the sound be as
large as possible so as to capture as much sound
energy as possible. The pinna, by design, is quite
large (as compared to the opening of the auditory
canal) and, as a result, it increases human hearing
sensitivity by a factor of between 2 and 3
Function of the middle ear
• The middle ear is a mechanical transducer that converts
sound waves that impinge on the tympanic membrane to
mechanical vibrations along the inner ear.
• The mechanism for this transduction is the set of the three
smallest bones in the human body, namely the
hammer/malleus, the anvil/incus, and the stirrup/stapes.
• The movements of the tympanic membrane cause these
three bones to move in concert. Together they act as a
compound lever to amplify the sound vibrations, providing
force amplification of anywhere from a factor of 3 to a
factor of 15 from the eardrum to the stirrup, thereby
enabling humans to hear weak sounds.
• Muscles around these three tiny bones also protect the ear
against loud sounds by stiffening and thereby attenuating
excessively loud sounds so as to prevent permanent
damage to the hearing mechanisms
Function of the inner ear
• The inner ear can be thought of as two organs, namely the
semicircular canals, which serve as the body’s balance
organ, and the cochlea, which serves as the microphone of
the auditory system, converting sound pressure signals
from the outer ear into electrical impulses that are passed
on to the brain via the auditory nerve.
• The cochlea, a 2 and a half-turn snail-shaped organ of a
length of approximately 3 cm, is a fluid-filled chamber that
is partitioned longitudinally by the basilar membrane.
Mechanical vibrations at the entrance to the cochlea (the
stapes end) create standing waves (of the fluid inside the
cochlea), causing the basilar membrane to vibrate at
frequencies commensurate with the input acoustic wave
frequencies, and with largest amplitude occurring at places
along the basilar membrane that are “tuned” to these
Basiliar Membrane Mechanics
• The basilar membrane is lined with more than
30,000 inner hair cells (IHCs) that are set in
motion by the mechanical movements along the
basilar membrane. The IHCs vibrate at different
rates and with different levels and are tuned to
different frequencies along the basilar
membrane. When the basilar membrane
vibration motion is sufficiently large, the IHC
vibration evokes an electrical impulse that travels
along the auditory nerve to the brain for
subsequent analysis.
Sound processing in auditory system
response of the BM
Frequency response of the BM and
critical band
Bark scale
Perception of sound
• Perception of sound implies making interpretations of
the nature of the sensed sound signal
• Sound waves can be characterized by their amplitude
and frequency of variation. These are features that can
be represented mathematically and measured using
physical devices.
• Humans, however, sense these features of sound and
perceive them as loudness and pitch respectively, and
these perceptual features are not related in a simple
way to amplitude and frequency.
• The perceptual quantity that is related to
sound frequency is called pitch.
• The unit of frequency is Hz and the unit of
pitch is the mel (derived from the word

• the pitch of complex sounds is an even more

complex and interesting phenomenon
Masking Effect
Masking Effect
Masking Effect
Human speech perception experiments
Human speech perception experiments
Human speech perception experiments
Measurement of speech quality
• The SNR is a good measure of quality in the case where the
distortion introduced by the system can be modeled as an
additive noise that is uncorrelated with the signal, and it
has been used extensively to design and implement high
quality speech coding methods.
• However, over time it has been found that the SNR
measure is just not good enough as a subjective measure
for most speech coding methods. This is especially true for
“model-based” speech coders, where the waveform of the
speech signal is not inherently preserved, but instead the
speech signal is designed to match a speech production
model. In such cases the resulting noise, defined as the
difference between the coded speech and the original
speech, is not an uncorrelated noise source and thus the
assumptions behind the calculation of SNR are neither
correct nor relevant as a measure of speech quality.
Subjective test: MOS
• Excellent quality—MOS=5 (coded speech is essentially equivalent in
quality with original uncoded speech; speech recorded with high
• Good quality—MOS=4 (coded speech is of very good quality but
distinctly below the quality and naturalness of the uncoded speech
• Fair quality—MOS=3 (coded speech is acceptable for
communications but is distinctly of lower quality than uncoded
original signal);
• Poor quality—MOS=2 (coded speech quality is significantly
degraded and barely acceptable for a voice communications
• Bad quality—MOS=1 (coded speech quality is unacceptable for
Subjective test: MOS
Objective Measure: PESQ

The correlation coefficients between subjective test ratings

(MOS scores) and corresponding PESQ outputs range from 0.785 to 0.979.
Speech production revisited: Lossless
tube model
Speech production revisited: Lossless
tube model
Uniform lossless tube
lossless tube
Uniform lossless tube
Uniform lossless tube
General Speech Production Model

You might also like