Voice DSP Processing: Chief Scientist RAD Data Communications

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 53

Voice

DSP
Processing
I

Yaakov J. Stein
Chief Scientist
RAD Data Communications
Stein VoiceDSP 1.1
Voice DSP

Part 1 Speech biology and what we can learn from it

Part 2 Speech DSP (AGC, VAD, features, echo cancellation)

Part 3 Speech compression techiques

Part 4 Speech Recognition

Stein VoiceDSP 1.2


Voice DSP - Part 1a

Speech production mechanisms


Biology of the vocal tract

Pitch and formants

Sonograms

The basic LPC model

The cepstrum

LPC cepstrum

Line spectral pairs

Stein VoiceDSP 1.3


Voice DSP - Part 1b

Speech perception mechanisms


Biology of the ear
Psychophysical phenomena
Webers law
Fechners law
Changes
Masking

Stein VoiceDSP 1.4


Voice DSP - Part 1c

Speech quality measurement

Subjective measurement
MOS and its variants

Objective measurement
PSQM, PESQ

Stein VoiceDSP 1.5


Voice DSP - Part 2a

Basic speech processing


Simplest processing

AGC
Simplistic VAD
More complex processing

pitch tracking
formant tracking
U/V decision
computing LPC and other features

Stein VoiceDSP 1.6


Voice DSP - Part 2b

Echo Cancellation
Sources of echo (acoustic vs. line echo)

Echo suppression and cancellation

Adaptive noise cancellation

The LMS algorithm

Other adaptive algorithms

The standard LEC

Stein VoiceDSP 1.7


Voice DSP - Part 3

Speech compression techniques


PCM

ADPCM

SBC

VQ

ABS-CELP

MBE

MELP

STC

Waveform Interpolation

Stein VoiceDSP 1.8


Voice DSP - Part 4

Speech Recognition tasks

ASR Engine

Phonetic labeling

DTW

HMM

State-of-the-Art

Stein VoiceDSP 1.9


Voice DSP - Part 1a

Speech
production
mechanisms
Stein VoiceDSP 1.10
Speech Production Organs

Brain
Nasal Hard
cavity Palate

Velum
Teeth

Lips Uvula
Mouth
cavity Pharynx
Tongue
Esophagus
Larynx
Trachea

Lungs
Stein VoiceDSP 1.11
Speech Production Organs - cont.

Air from lungs is exhaled into trachea (windpipe)

Vocal chords (folds) in larynx can produce periodic pulses of air


by opening and closing (glottis)

Throat (pharynx), mouth, tongue and nasal cavity modify air flow

Teeth and lips can introduce turbulence

Epiglottis separates esophagus (food pipe) from trachea

Stein VoiceDSP 1.12


Voiced vs. Unvoiced Speech

When vocal cords are held open air flows unimpeded


When laryngeal muscles stretch them glottal flow is in bursts

When glottal flow is periodic called voiced speech


Basic interval/frequency called the pitch
Pitch period usually between 2.5 and 20 milliseconds
Pitch frequency between 50 and 400 Hz
You can feel the vibration of the larynx
Vowels are always voiced (unless whispered)
Consonants come in voiced/unvoiced pairs
for example : B/P K/G D/T V/F J/CH TH/th W/WH Z/S ZH/SH

Stein VoiceDSP 1.13


Excitation spectra

Voiced speech
Pulse train is not sinusoidal - harmonic rich

f
Unvoiced speech
Common assumption : white noise

Stein VoiceDSP 1.14


Effect of vocal tract

Mouth and nasal cavities have resonances

Resonant frequencies
depend on geometry

Stein VoiceDSP 1.15


Effect of vocal tract - cont.
Sound energy at these resonant frequencies is amplified
Frequencies of peak amplification are called formants
F1
frequency response

F2
F3

F4

frequency

voiced speech unvoiced speech

F0
Stein VoiceDSP 1.16
Formant frequencies
Peterson - Barney data (note the vowel triangle)

Stein VoiceDSP 1.17


Sonograms

Stein VoiceDSP 1.18


Cylinder model(s)

Rough model of throat and mouth cavity

Voice
open
Excitation

With nasal cavity


open

Voice
Excitation
open/closed

Stein VoiceDSP 1.19


Phonemes

The smallest acoustic unit that can change meaning


Different languages have different phoneme sets
Types: (notations: phonetic, CVC, ARPABET)

Vowels
front (heed, hid, head, hat)
mid (hot, heard, hut, thought)
back (boot, book, boat)
dipthongs (buy, boy, down, date)
Semivowels
liquids (w, l)
glides (r, y)

Stein VoiceDSP 1.20


Phonemes - cont.

Consonants
nasals (murmurs) (n, m, ng)
stops (plosives)
voiced (b,d,g)
unvoiced (p, t, k)
fricatives
voiced (v, that, z, zh)
unvoiced (f, think, s, sh)
affricatives (j, ch)
whispers (h, what)
gutturals ( ,)
clicks, etc. etc. etc.

Stein VoiceDSP 1.21


Basic LPC Model
Pulse
Generator

U/V
Switch
LPC
synthesis
filter
White Noise
Generator

Stein VoiceDSP 1.22


Basic LPC Model - cont.

Pulse generator produces a harmonic rich periodic impulse


train (with pitch period and gain)

White noise generator produces a random signal


(with gain)

U/V switch chooses between voiced and unvoiced speech

LPC filter amplifies formant frequencies


(all-pole or AR IIR filter)

The output will resemble true speech to within residual error

Stein VoiceDSP 1.23


Cepstrum

Another way of thinking about the LPC model


Speech spectrum is the obtained from multiplication
Spectrum of (pitch) pulse train times
Vocal tract (formant) frequency response
So log of this spectrum is obtained from addition
Log spectrum of pitch train plus
Log of vocal tract frequency response
Consider this log spectrum to be the spectrum of some new signal
called the cepstrum
The cepstrum is the sum of two components:
excitation plus vocal tract

Stein VoiceDSP 1.24


Cepstrum - cont.

Cepstral processing has its own language


Cepstrum (note that this is really a signal in the time domain)

Quefrency (its units are seconds)


Liftering (filtering)
Alanysis
Saphe
Several variants:
complex cepstrum

power cesptrum

LPC cepstrum

Stein VoiceDSP 1.25


Do we know enough?

Standard speech model (LPC)


(used by most speech processing/compression/recognition systems)
is a model of speech production

Unfortunately, speech production and speech perception systems


are not matched

So next well look at the biology of the hearing (auditory) system


and some psychophysics (perception)

Stein VoiceDSP 1.26


Voice DSP - Part 1b

Speech
Hearing &perception
mechanisms
Stein VoiceDSP 1.27
Hearing Organs

Stein VoiceDSP 1.28


Hearing Organs - cont.

Sound waves impinge on outer ear enter auditory canal


Amplified waves cause eardrum to vibrate
Eardrum separates outer ear from middle ear
The Eustachian tube equalizes air pressure of middle ear
Ossicles (hammer, anvil, stirrup) amplify vibrations
Oval window separates middle ear from inner ear
Stirrup excites oval window which excites liquid in the cochlea
The cochlea is curled up like a snail
The basilar membrane runs along middle of cochlea
The organ of Corti transduces vibrations to electric pulses
Pulses are carried by the auditory nerve to the brain

Stein VoiceDSP 1.29


Function of Cochlea

Cochlea has 2 1/2 to 3 turns


were it straightened out it would be 3 cm in length
The basilar membrane runs down the center of the cochlea
as does the organ of Corti
15,000 cilia (hairs) contact the vibrating basilar membrane
and release neurotransmitter stimulating 30,000 auditory neurons
Cochlea is wide (1/2 cm) near oval window and tapers towards apex
is stiff near oval window and flexible near apex
Hence high frequencies cause section near oval window to vibrate
low frequencies cause section near apex to vibrate
Overlapping bank of filter frequency decomposition

Stein VoiceDSP 1.30


Psychophysics - Webers law
Ernst Weber Professor of physiology at Leipzig in the early 1800s
Just Noticeable Difference :
minimal stimulus change that can be detected by senses

Discovery: DI=KI
Example
Tactile sense: place coins in each hand
subject could discriminate between with 10 coins and 11,
but not 20/21, but could 20/22!

Similarly vision lengths of lines, taste saltiness, sound frequency

Stein VoiceDSP 1.31


Webers law - cont.
This makes a lot of sense

Bill Gates

Stein VoiceDSP 1.32


Psychophysics - Fechners law

Webers law is not a true psychophysical law


it relates stimulus threshold to stimulus (both physical entities)
not internal representation (feelings) to physical entity

Gustav Theodor Fechner student of Weber medicine, physics philosophy

Simplest assumption: JND is single internal unit


Using Webers law we find:

Y = A log I + B
Fechner Day (October 22 1850)

Stein VoiceDSP 1.33


Fechners law - cont.

Log is very compressive

Fechners law explains the fantastic ranges of our senses


Sight: single photon - direct sunlight 1015
Hearing: eardrum move 1 H atom - jet plane 1012

Bel defined to be log10 of power ratio


decibel (dB) one tenth of a Bel

d(dB) = 10 log10 P 1 / P 2

Stein VoiceDSP 1.34


Fechners law - sound amplitudes

Companding

adaptation of logarithm to positive/negative signals

m-law and A-law are piecewise linear approximations

Equivalent to linear sampling at 12-14 bits

(8 bit linear sampling is significantly more noisy)

Stein VoiceDSP 1.35


Fechners law - sound frequencies

octaves, well tempered scale


12 2
Critical bands

Frequency warping f
Melody 1 KHz = 1000, JND afterwards M ~ 1000 log2 ( 1 + fKHz )

Barkhausen can be simultaneously heard B ~ 25 + 75 ( 1 + 1.4 f2KHz )0.69


excite different basilar membrane regions

Stein VoiceDSP 1.36


Psychophysics - changes

Our senses respond to changes

Inverse
E
Filter

Stein VoiceDSP 1.37


Psychophysics - masking

Masking: strong tones block weaker ones at nearby frequencies


narrowband noise blocks tones (up to critical band)

Stein VoiceDSP 1.38


Voice DSP - Part 1c

Speech
Quality
Measurement
Stein VoiceDSP 1.39
Why does it sound
the way it sounds?
PSTN
BW=0.2-3.8 KHz, SNR>30 dB
PCM, ADPCM (BER 10-3)
five nines reliability
line echo cancellation

Voice over packet network


speech compression
delay, delay variation, jitter
packet loss/corruption/priority
echo cancellation

Stein VoiceDSP 1.40


Subjective Voice Quality

Old Measures
meet neat seat feet Pete beat heat
5/9
DRT
DAM

The modern scale


MOS
DMOS

Stein VoiceDSP 1.41


MOS according to ITU

P.800 Subjective Determination of Transmission Quality

Annex B: Absolute Category Rating (ACR)

Listening Quality Listening Effort


5 excellent relaxed
4 good attention needed
3 fair moderate effort
2 poor considerable effort
1 bad no meaning
with feasible effort

Stein VoiceDSP 1.42


MOS according to ITU (cont)
Annex D Degradation Category Rating (DCR)
Annex E Comparison Category Rating (CCR)
ACR not good at high quality speech
DCR CCR
5 inaudible
4 not annoying
3 slightly annoying much better
2 annoying better
1 very annoying slightly better
0 the same
-1 slightly worse
-2 worse
-3 much worse

Stein VoiceDSP 1.43


Some MOS numbers

Effect of Speech Compression:

(from ITU-T Study Group 15)

Quiet room 48 KHz 16 bit linear sampling 5.0


PCM (A-law/mlaw) 64 Kb/s 4.1
G.723.1 @ 6.3 Kb/s 3.9
G.729 @ 8 Kb/s 3.9
ADPCM G.726 32 Kb/s 3.8 toll quality
GSM @ 13Kb/s 3.6
VSELP IS54 @ 8Kb/s 3.4

Stein VoiceDSP 1.44


The Problem(s) with MOS

Accurate MOS tests are the only reliable benchmark


BUT

MOS tests are off-line


MOS tests are slow
MOS tests are expensive
Different labs give consistently different results
Most MOS tests only check one aspect of system

Stein VoiceDSP 1.45


The Problem(s) with SNR

Naive question: Isnt CCR the same as SNR?

SNR does not correlate well with subjective criteria

Squared difference is not an accurate comparator

Gain
Delay
Phase
Nonlinear processing

Stein VoiceDSP 1.46


Speech distance measures

Many objective measures have been proposed:


Segmental SNR
Itakura Saito distance
Euclidean distance in Cepstrum space
Bark spectral distortion
Coherence Function

None correlate well with MOS


ITU target - find a quality-measure that does correlate well

Stein VoiceDSP 1.47


Some objective methods

Perceptual Speech Quality Measurement (PSQM)


ITU-T P.861

Perceptual Analysis Measurement System (PAMS)


BT proprietary technique

Perceptual Evaluation of Speech Quality (PESQ)


ITU-T P.862

Objective Measurement of Perceived Audio Quality (PAQM)


ITU-R BS.1387

Stein VoiceDSP 1.48


Objective Quality Strategy

channel
QM
to MOS
speech QM estimate
MOS

Stein VoiceDSP 1.49


PSQM philosophy
(from P.861)

Internal

Perceptual Representation

model
Audible Cognitive
Difference Model
Perceptual
model
Internal
Representation

Stein VoiceDSP 1.50


PSQM philosophy (cont)
Perceptual Modelling (Internal representation)
Short time Fourier transform
Frequency warping (telephone-band filtering, Hoth noise)
Intensity warping

Cognitive Modelling
Loudness scaling
Internal cognitive noise
Asymmetry
Silent interval processing
PSQM Values
0 (no degradation) to 6.5 (maximum degradation)

Conversion to MOS
PSQM to MOS calibration using known references
Equivalent Q values

Stein VoiceDSP 1.51


Problems with PSQM

Designed for telephony grade speech codecs

Doesnt take network effects into account:


filtering
variable time delay
localized distortions

Draft standard P.862 adds:


transfer function equalization
time alignment, delay skipping
distortion averaging

Stein VoiceDSP 1.52


PESQ philosophy
(from P.862)

Perceptual Internal
model Representation

Time Audible Cognitive


Alignment Difference Model

Perceptual Internal
model Representation

Stein VoiceDSP 1.53

You might also like