Voice DSP Processing: Chief Scientist RAD Data Communications

Voice
DSP
Processing
I
Yaakov J. Stein
Chief Scientist
RAD Data Communications
Stein VoiceDSP 1.1
Voice DSP
Part 1 Speech biology and what we can learn from it
Part 2 Speech DSP (AGC, VAD, features, echo cancellation)
Part 3 Speech compression techiques
Part 4 Speech Recognition
Stein VoiceDSP 1.2

Voice DSP - Part 1a
Speech production mechanisms

Biology of the vocal tract
Pitch and formants
Sonograms
The basic LPC model
The cepstrum
LPC cepstrum
Line spectral pairs
Stein VoiceDSP 1.3

Voice DSP - Part 1b
Speech perception mechanisms

Biology of the ear
Psychophysical phenomena
Webers law
Fechners law
Changes
Masking
Stein VoiceDSP 1.4

Voice DSP - Part 1c
Speech quality measurement
Subjective measurement
MOS and its variants
Objective measurement
PSQM, PESQ
Stein VoiceDSP 1.5

Voice DSP - Part 2a
Basic speech processing

Simplest processing
AGC
Simplistic VAD
More complex processing
pitch tracking
formant tracking
U/V decision
computing LPC and other features
Stein VoiceDSP 1.6

Voice DSP - Part 2b
Echo Cancellation
Sources of echo (acoustic vs. line echo)
Echo suppression and cancellation
Adaptive noise cancellation
The LMS algorithm
Other adaptive algorithms
The standard LEC
Stein VoiceDSP 1.7

Voice DSP - Part 3
Speech compression techniques

PCM
ADPCM
SBC
VQ
ABS-CELP
MBE
MELP
STC
Waveform Interpolation
Stein VoiceDSP 1.8

Voice DSP - Part 4
Speech Recognition tasks
ASR Engine
Phonetic labeling
DTW
HMM
State-of-the-Art
Stein VoiceDSP 1.9

Voice DSP - Part 1a
Speech
production
mechanisms
Stein VoiceDSP 1.10
Speech Production Organs
Brain
Nasal Hard
cavity Palate
Velum
Teeth
Lips Uvula
Mouth
cavity Pharynx
Tongue
Esophagus
Larynx
Trachea
Lungs
Stein VoiceDSP 1.11
Speech Production Organs - cont.
Air from lungs is exhaled into trachea (windpipe)
Vocal chords (folds) in larynx can produce periodic pulses of air

by opening and closing (glottis)
Throat (pharynx), mouth, tongue and nasal cavity modify air flow
Teeth and lips can introduce turbulence
Epiglottis separates esophagus (food pipe) from trachea
Stein VoiceDSP 1.12

Voiced vs. Unvoiced Speech
When vocal cords are held open air flows unimpeded

When laryngeal muscles stretch them glottal flow is in bursts
When glottal flow is periodic called voiced speech

Basic interval/frequency called the pitch
Pitch period usually between 2.5 and 20 milliseconds
Pitch frequency between 50 and 400 Hz
You can feel the vibration of the larynx
Vowels are always voiced (unless whispered)
Consonants come in voiced/unvoiced pairs
for example : B/P K/G D/T V/F J/CH TH/th W/WH Z/S ZH/SH
Stein VoiceDSP 1.13

Excitation spectra
Voiced speech
Pulse train is not sinusoidal - harmonic rich
f
Unvoiced speech
Common assumption : white noise
Stein VoiceDSP 1.14

Effect of vocal tract
Mouth and nasal cavities have resonances
Resonant frequencies
depend on geometry
Stein VoiceDSP 1.15

Effect of vocal tract - cont.
Sound energy at these resonant frequencies is amplified
Frequencies of peak amplification are called formants
F1
frequency response
F2
F3
F4
frequency
voiced speech unvoiced speech
F0
Stein VoiceDSP 1.16
Formant frequencies
Peterson - Barney data (note the vowel triangle)
Stein VoiceDSP 1.17

Sonograms
Stein VoiceDSP 1.18

Cylinder model(s)
Rough model of throat and mouth cavity
Voice
open
Excitation
With nasal cavity

open
Voice
Excitation
open/closed
Stein VoiceDSP 1.19

Phonemes
The smallest acoustic unit that can change meaning

Different languages have different phoneme sets
Types: (notations: phonetic, CVC, ARPABET)
Vowels
front (heed, hid, head, hat)
mid (hot, heard, hut, thought)
back (boot, book, boat)
dipthongs (buy, boy, down, date)
Semivowels
liquids (w, l)
glides (r, y)
Stein VoiceDSP 1.20

Phonemes - cont.
Consonants
nasals (murmurs) (n, m, ng)
stops (plosives)
voiced (b,d,g)
unvoiced (p, t, k)
fricatives
voiced (v, that, z, zh)
unvoiced (f, think, s, sh)
affricatives (j, ch)
whispers (h, what)
gutturals ( ,)
clicks, etc. etc. etc.
Stein VoiceDSP 1.21

Basic LPC Model
Pulse
Generator
U/V
Switch
LPC
synthesis
filter
White Noise
Generator
Stein VoiceDSP 1.22

Basic LPC Model - cont.
Pulse generator produces a harmonic rich periodic impulse

train (with pitch period and gain)
White noise generator produces a random signal

(with gain)
U/V switch chooses between voiced and unvoiced speech
LPC filter amplifies formant frequencies

(all-pole or AR IIR filter)
The output will resemble true speech to within residual error
Stein VoiceDSP 1.23

Cepstrum
Another way of thinking about the LPC model

Speech spectrum is the obtained from multiplication
Spectrum of (pitch) pulse train times
Vocal tract (formant) frequency response
So log of this spectrum is obtained from addition
Log spectrum of pitch train plus
Log of vocal tract frequency response
Consider this log spectrum to be the spectrum of some new signal
called the cepstrum
The cepstrum is the sum of two components:
excitation plus vocal tract
Stein VoiceDSP 1.24

Cepstrum - cont.
Cepstral processing has its own language

Cepstrum (note that this is really a signal in the time domain)
Quefrency (its units are seconds)

Liftering (filtering)
Alanysis
Saphe
Several variants:
complex cepstrum
power cesptrum
LPC cepstrum
Stein VoiceDSP 1.25

Do we know enough?
Standard speech model (LPC)

(used by most speech processing/compression/recognition systems)
is a model of speech production
Unfortunately, speech production and speech perception systems

are not matched
So next well look at the biology of the hearing (auditory) system

and some psychophysics (perception)
Stein VoiceDSP 1.26

Voice DSP - Part 1b
Speech
Hearing &perception
mechanisms
Stein VoiceDSP 1.27
Hearing Organs
Stein VoiceDSP 1.28

Hearing Organs - cont.
Sound waves impinge on outer ear enter auditory canal

Amplified waves cause eardrum to vibrate
Eardrum separates outer ear from middle ear
The Eustachian tube equalizes air pressure of middle ear
Ossicles (hammer, anvil, stirrup) amplify vibrations
Oval window separates middle ear from inner ear
Stirrup excites oval window which excites liquid in the cochlea
The cochlea is curled up like a snail
The basilar membrane runs along middle of cochlea
The organ of Corti transduces vibrations to electric pulses
Pulses are carried by the auditory nerve to the brain
Stein VoiceDSP 1.29

Function of Cochlea
Cochlea has 2 1/2 to 3 turns

were it straightened out it would be 3 cm in length
The basilar membrane runs down the center of the cochlea
as does the organ of Corti
15,000 cilia (hairs) contact the vibrating basilar membrane
and release neurotransmitter stimulating 30,000 auditory neurons
Cochlea is wide (1/2 cm) near oval window and tapers towards apex
is stiff near oval window and flexible near apex
Hence high frequencies cause section near oval window to vibrate
low frequencies cause section near apex to vibrate
Overlapping bank of filter frequency decomposition
Stein VoiceDSP 1.30

Psychophysics - Webers law
Ernst Weber Professor of physiology at Leipzig in the early 1800s
Just Noticeable Difference :
minimal stimulus change that can be detected by senses
Discovery: DI=KI
Example
Tactile sense: place coins in each hand
subject could discriminate between with 10 coins and 11,
but not 20/21, but could 20/22!
Similarly vision lengths of lines, taste saltiness, sound frequency
Stein VoiceDSP 1.31

Webers law - cont.
This makes a lot of sense
Bill Gates
Stein VoiceDSP 1.32

Psychophysics - Fechners law
Webers law is not a true psychophysical law

it relates stimulus threshold to stimulus (both physical entities)
not internal representation (feelings) to physical entity
Gustav Theodor Fechner student of Weber medicine, physics philosophy
Simplest assumption: JND is single internal unit

Using Webers law we find:
Y = A log I + B
Fechner Day (October 22 1850)
Stein VoiceDSP 1.33

Fechners law - cont.
Log is very compressive
Fechners law explains the fantastic ranges of our senses

Sight: single photon - direct sunlight 1015
Hearing: eardrum move 1 H atom - jet plane 1012
Bel defined to be log10 of power ratio

decibel (dB) one tenth of a Bel
d(dB) = 10 log10 P 1 / P 2
Stein VoiceDSP 1.34

Fechners law - sound amplitudes
Companding
adaptation of logarithm to positive/negative signals
m-law and A-law are piecewise linear approximations
Equivalent to linear sampling at 12-14 bits
(8 bit linear sampling is significantly more noisy)
Stein VoiceDSP 1.35

Fechners law - sound frequencies
octaves, well tempered scale

12 2
Critical bands
Frequency warping f
Melody 1 KHz = 1000, JND afterwards M ~ 1000 log2 ( 1 + fKHz )
Barkhausen can be simultaneously heard B ~ 25 + 75 ( 1 + 1.4 f2KHz )0.69

excite different basilar membrane regions
Stein VoiceDSP 1.36

Psychophysics - changes
Our senses respond to changes
Inverse
E
Filter
Stein VoiceDSP 1.37

Psychophysics - masking
Masking: strong tones block weaker ones at nearby frequencies

narrowband noise blocks tones (up to critical band)
Stein VoiceDSP 1.38

Voice DSP - Part 1c
Speech
Quality
Measurement
Stein VoiceDSP 1.39
Why does it sound
the way it sounds?
PSTN
BW=0.2-3.8 KHz, SNR>30 dB
PCM, ADPCM (BER 10-3)
five nines reliability
line echo cancellation
Voice over packet network

speech compression
delay, delay variation, jitter
packet loss/corruption/priority
echo cancellation
Stein VoiceDSP 1.40

Subjective Voice Quality
Old Measures
meet neat seat feet Pete beat heat
5/9
DRT
DAM
The modern scale

MOS
DMOS
Stein VoiceDSP 1.41

MOS according to ITU
P.800 Subjective Determination of Transmission Quality
Annex B: Absolute Category Rating (ACR)
Listening Quality Listening Effort

5 excellent relaxed
4 good attention needed
3 fair moderate effort
2 poor considerable effort
1 bad no meaning
with feasible effort
Stein VoiceDSP 1.42

MOS according to ITU (cont)
Annex D Degradation Category Rating (DCR)
Annex E Comparison Category Rating (CCR)
ACR not good at high quality speech
DCR CCR
5 inaudible
4 not annoying
3 slightly annoying much better
2 annoying better
1 very annoying slightly better
0 the same
-1 slightly worse
-2 worse
-3 much worse
Stein VoiceDSP 1.43

Some MOS numbers
Effect of Speech Compression:
(from ITU-T Study Group 15)
Quiet room 48 KHz 16 bit linear sampling 5.0

PCM (A-law/mlaw) 64 Kb/s 4.1
G.723.1 @ 6.3 Kb/s 3.9
G.729 @ 8 Kb/s 3.9
ADPCM G.726 32 Kb/s 3.8 toll quality
GSM @ 13Kb/s 3.6
VSELP IS54 @ 8Kb/s 3.4
Stein VoiceDSP 1.44

The Problem(s) with MOS
Accurate MOS tests are the only reliable benchmark

BUT
MOS tests are off-line

MOS tests are slow
MOS tests are expensive
Different labs give consistently different results
Most MOS tests only check one aspect of system
Stein VoiceDSP 1.45

The Problem(s) with SNR
Naive question: Isnt CCR the same as SNR?
SNR does not correlate well with subjective criteria
Squared difference is not an accurate comparator
Gain
Delay
Phase
Nonlinear processing
Stein VoiceDSP 1.46

Speech distance measures
Many objective measures have been proposed:

Segmental SNR
Itakura Saito distance
Euclidean distance in Cepstrum space
Bark spectral distortion
Coherence Function
None correlate well with MOS

ITU target - find a quality-measure that does correlate well
Stein VoiceDSP 1.47

Some objective methods
Perceptual Speech Quality Measurement (PSQM)

ITU-T P.861
Perceptual Analysis Measurement System (PAMS)

BT proprietary technique
Perceptual Evaluation of Speech Quality (PESQ)

ITU-T P.862
Objective Measurement of Perceived Audio Quality (PAQM)

ITU-R BS.1387
Stein VoiceDSP 1.48

Objective Quality Strategy
channel
QM
to MOS
speech QM estimate
MOS
Stein VoiceDSP 1.49

PSQM philosophy
(from P.861)
Internal
Perceptual Representation
model
Audible Cognitive
Difference Model
Perceptual
model
Internal
Representation
Stein VoiceDSP 1.50

PSQM philosophy (cont)
Perceptual Modelling (Internal representation)
Short time Fourier transform
Frequency warping (telephone-band filtering, Hoth noise)
Intensity warping
Cognitive Modelling
Loudness scaling
Internal cognitive noise
Asymmetry
Silent interval processing
PSQM Values
0 (no degradation) to 6.5 (maximum degradation)
Conversion to MOS
PSQM to MOS calibration using known references
Equivalent Q values
Stein VoiceDSP 1.51

Problems with PSQM
Designed for telephony grade speech codecs
Doesnt take network effects into account:

filtering
variable time delay
localized distortions
Draft standard P.862 adds:

transfer function equalization
time alignment, delay skipping
distortion averaging
Stein VoiceDSP 1.52

PESQ philosophy
(from P.862)
Perceptual Internal
model Representation
Time Audible Cognitive

Alignment Difference Model
Perceptual Internal
model Representation
Stein VoiceDSP 1.53

Voice DSP Processing: Chief Scientist RAD Data Communications

Uploaded by

Copyright:

Available Formats

You might also like

Voice DSP Processing: Chief Scientist RAD Data Communications

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Voice DSP Processing: Chief Scientist RAD Data Communications

Uploaded by

Copyright:

Available Formats

Voice

Part 1 Speech biology and what we can learn from it

Part 2 Speech DSP (AGC, VAD, features, echo cancellation)

Part 3 Speech compression techiques

Part 4 Speech Recognition

Stein VoiceDSP 1.2

Speech production mechanisms

Pitch and formants

The basic LPC model

Line spectral pairs

Stein VoiceDSP 1.3

Speech perception mechanisms

Stein VoiceDSP 1.4

Speech quality measurement

Stein VoiceDSP 1.5

Basic speech processing

Stein VoiceDSP 1.6

Echo suppression and cancellation

Adaptive noise cancellation

The LMS algorithm

Other adaptive algorithms

The standard LEC

Stein VoiceDSP 1.7

Speech compression techniques

Stein VoiceDSP 1.8

Speech Recognition tasks

Stein VoiceDSP 1.9

Air from lungs is exhaled into trachea (windpipe)

Vocal chords (folds) in larynx can produce periodic pulses of air

Teeth and lips can introduce turbulence

Epiglottis separates esophagus (food pipe) from trachea

Stein VoiceDSP 1.12

When vocal cords are held open air flows unimpeded

When glottal flow is periodic called voiced speech

Stein VoiceDSP 1.13

Stein VoiceDSP 1.14

Mouth and nasal cavities have resonances

Stein VoiceDSP 1.15

voiced speech unvoiced speech

Stein VoiceDSP 1.17

Stein VoiceDSP 1.18

Rough model of throat and mouth cavity

With nasal cavity

Stein VoiceDSP 1.19

The smallest acoustic unit that can change meaning

Stein VoiceDSP 1.20

Stein VoiceDSP 1.21

Stein VoiceDSP 1.22

Pulse generator produces a harmonic rich periodic impulse

White noise generator produces a random signal

U/V switch chooses between voiced and unvoiced speech

LPC filter amplifies formant frequencies

The output will resemble true speech to within residual error

Stein VoiceDSP 1.23

Another way of thinking about the LPC model

Stein VoiceDSP 1.24

Cepstral processing has its own language