Professional Documents
Culture Documents
Voice DSP Processing: Chief Scientist RAD Data Communications
Voice DSP Processing: Chief Scientist RAD Data Communications
Voice DSP Processing: Chief Scientist RAD Data Communications
DSP
Processing
I
Yaakov J. Stein
Chief Scientist
RAD Data Communications
Stein VoiceDSP 1.1
Voice DSP
Sonograms
The cepstrum
LPC cepstrum
Subjective measurement
MOS and its variants
Objective measurement
PSQM, PESQ
AGC
Simplistic VAD
More complex processing
pitch tracking
formant tracking
U/V decision
computing LPC and other features
Echo Cancellation
Sources of echo (acoustic vs. line echo)
ADPCM
SBC
VQ
ABS-CELP
MBE
MELP
STC
Waveform Interpolation
ASR Engine
Phonetic labeling
DTW
HMM
State-of-the-Art
Speech
production
mechanisms
Stein VoiceDSP 1.10
Speech Production Organs
Brain
Nasal Hard
cavity Palate
Velum
Teeth
Lips Uvula
Mouth
cavity Pharynx
Tongue
Esophagus
Larynx
Trachea
Lungs
Stein VoiceDSP 1.11
Speech Production Organs - cont.
Throat (pharynx), mouth, tongue and nasal cavity modify air flow
Voiced speech
Pulse train is not sinusoidal - harmonic rich
f
Unvoiced speech
Common assumption : white noise
Resonant frequencies
depend on geometry
F2
F3
F4
frequency
F0
Stein VoiceDSP 1.16
Formant frequencies
Peterson - Barney data (note the vowel triangle)
Voice
open
Excitation
Voice
Excitation
open/closed
Vowels
front (heed, hid, head, hat)
mid (hot, heard, hut, thought)
back (boot, book, boat)
dipthongs (buy, boy, down, date)
Semivowels
liquids (w, l)
glides (r, y)
Consonants
nasals (murmurs) (n, m, ng)
stops (plosives)
voiced (b,d,g)
unvoiced (p, t, k)
fricatives
voiced (v, that, z, zh)
unvoiced (f, think, s, sh)
affricatives (j, ch)
whispers (h, what)
gutturals ( ,)
clicks, etc. etc. etc.
U/V
Switch
LPC
synthesis
filter
White Noise
Generator
power cesptrum
LPC cepstrum
Speech
Hearing &perception
mechanisms
Stein VoiceDSP 1.27
Hearing Organs
Discovery: DI=KI
Example
Tactile sense: place coins in each hand
subject could discriminate between with 10 coins and 11,
but not 20/21, but could 20/22!
Bill Gates
Y = A log I + B
Fechner Day (October 22 1850)
d(dB) = 10 log10 P 1 / P 2
Companding
Frequency warping f
Melody 1 KHz = 1000, JND afterwards M ~ 1000 log2 ( 1 + fKHz )
Inverse
E
Filter
Speech
Quality
Measurement
Stein VoiceDSP 1.39
Why does it sound
the way it sounds?
PSTN
BW=0.2-3.8 KHz, SNR>30 dB
PCM, ADPCM (BER 10-3)
five nines reliability
line echo cancellation
Old Measures
meet neat seat feet Pete beat heat
5/9
DRT
DAM
Gain
Delay
Phase
Nonlinear processing
channel
QM
to MOS
speech QM estimate
MOS
Internal
Perceptual Representation
model
Audible Cognitive
Difference Model
Perceptual
model
Internal
Representation
Cognitive Modelling
Loudness scaling
Internal cognitive noise
Asymmetry
Silent interval processing
PSQM Values
0 (no degradation) to 6.5 (maximum degradation)
Conversion to MOS
PSQM to MOS calibration using known references
Equivalent Q values
Perceptual Internal
model Representation
Perceptual Internal
model Representation