Professional Documents
Culture Documents
Mfccs PDF
Mfccs PDF
Krzysztof Ślot
Politechnika Instytut
Informatyki
Łódzka Stosowanej
Instytut
Course information Informatyki
Stosowanej
• Schedule
– Meeting time: Wed 2pm-4pm, E110
– Lectures/labs interleave – actual schedule tba
• Assessment
– Test
– Lab projects
• Contact information
– kslot@p.lodz.pl
– Office hours: Wednesdays, 3pm-4pm, building C7, room 6
2
Instytut
Course outline Informatyki
Stosowanej
• Subject
– Speech recognition by computers
– Input: a sequence of samples representing sounds (electroacoustic
conversion, A/D conversion will not be covered)
– Output: phonetic transcription / words
Speech recognition
MIC A/D Sometext
algorithm
• Topics
– Speech signal: production and characterization (phone classes)
– Speech signal representation (LPC, cepstrum, MFCC)
– Sequence matching (Dynamic Time Warping)
– Hidden Markov Models
– Language modeling
3
Instytut
ASR preliminaries Informatyki
Stosowanej
4
Instytut
ASR preliminaries Informatyki
Stosowanej
• General framework
– Define the context and objectives, get training data
– Determine target categories and decide on class-modeling strategy
– Derive appropriate quantitative representation of speech signal
– Decide on recognition strategy
– Build models for considered categories and train their parameters
– Execute some adopted recognition scheme
Language MFCC,
IWR - CSR DTW
model HFCC
? ? ?
Clean – noisy
Dep. – indep. Phonetic
LPC, PLP HMM
Small – large transcription
voc. ? ? ?
…
Word set Other Neural Nets
5
Instytut
ASR preliminaries Informatyki
Stosowanej
• Basic concepts
– Sounds and their representation
– Speech signal
6
Instytut
Sounds Informatyki
Stosowanej
Bell
0.2s
Chirp
5s
Crickets
5s
Roar
0.2s
Woof
0.2s
7
Instytut
Sound perception and representation Informatyki
Stosowanej
8
Instytut
Sound perception and representation Informatyki
Stosowanej
Sine
f[Hz]
Chirp Magnitude
spectra
Roar
Woof
y1 (t ) y2 ( t )
9
Instytut
Speech production Informatyki
Stosowanej
Articulation Nasal
cavity
Speech
Throat Oral
cavity cavity
Tongue
Vocal Larynx
folds
Trachea
Simplified
Lungs vocal tract
model
Human vocal
tract Phonation
10
Instytut
Speech production Informatyki
Stosowanej
• Phonation
– Vocal folds: vibrate (voiced) do not vibrate (voiceless)
• Voiced phonation
– Multi-harmonic
– Fundamental
frequency (pitch)
– Voiced speech
• Unvoiced phonation
– Vocal folds do not
vibrate
– Turbulent flow
– Unvoiced speech
11
Instytut
Speech production: source-filter model Informatyki
Stosowanej
• Source
– Voiced excitation: vocal folds vibrate
– Unvoiced excitation
• Filter
– Resonant cavities of dynamically changing structure
Voiced
Filter
Unvoiced
12
Instytut
Speech signal: acoustics Informatyki
Stosowanej
• Phone types
– Vowels – flow of air is not impeded: a e o i u
– Consonants – impeded or stopped
– Nasals – formed by opening nasal cavity (m,n)
– Fricatives – formed by impeding air-flow: may be voiced (v,z, as in
this) or voiceless (f,s,sh, as in thought, h)
– Stops – complete stopping then releasing air: voiced (b,d,g),
voiceless (p,t,k)
– Liquids – r,l
– Semi-vowels: w as in wet, y as in yard
13
Instytut
Phonetic transcription Informatyki
Stosowanej
• Highlights
– Textual representation of
sounds – of what we hear IPA
– Several standards, the
most common: IPA
(International Phonetic
Alphabet)
– There are more symbols
than letters
– This is to be recognized
in CSR
14/20
Instytut
Voiced / unvoiced speech Informatyki
Stosowanej
• Clear differences
– Voiced: Few dominant frequencies
– Unvoiced: wide-band spectrum
15
Instytut
Speech signal properties Informatyki
Stosowanej
ðɪ s ɪ z ðə fər st lɛk tʃ ər
16
Instytut
Speech representation Informatyki
Stosowanej
• Corollary
– Spectral representation seems adequate
– Information contents: very rich
– Is spectrum an appropriate basis for recognition?
• Spectrum
– Decomposition of a function w.r.t. some set of bases
– Many possible basis functions
– Periodic basis functions: also several candidates
– Discrete functions – discrete bases
17
Instytut
Discrete Fourier Transform: a review Informatyki
Stosowanej
• Highlights
– Decomposition of a (periodic) signal into harmonic components
N 1
xn c X k bk ,n
x(tn)=xn expressed as a
composition of bases X k : k-th weight bk ,n k-th basis function c :a constant
k 0
:
x(tn ) xn
x (t )
bp Xp
: : + xn
bq Xq
k 2k
N 1 N 1
k 2f k 2f s
DFT X k x ( tn ) e jk tn
x(tn )(cos k tn j sin k tn ) N N
n 0 n 0
f s Sampling frequency N Number of samples
18
Instytut
Discrete Fourier Transform: a review Informatyki
Stosowanej
N 1 2 N 1 2
j 1
X k xn e X e
kn Inverse j kn
DFT N
DFT
xn k
N
n 0 N k 0
• Comments
k
– Weights regularly spaced in frequency domain fk fs
N
– Sampling frequency matters: Nyquist condition
The more points, the
– Number of samples in t and f is the same higher resolution
19
Instytut
Signal pectrum Informatyki
Stosowanej
• The answer
– Of course, not
– I am a boy, not a girl = I am a girl, not a boy
– Spectrum lacks a temporal resolution
20
Instytut
Spectrum and signal recognition Informatyki
Stosowanej
http://www.public.iastate.edu/~rpolikar/WAVELETS/WTpart1.html
21
Instytut
Short-time Fourier Transform: STFT Informatyki
Stosowanej
• An objective
– To preserve information on temporal evolution (provide time resolution)
• Temporal windows
– Extract parts of a signal – define domains of subsequent DFT analyses
Windowing function
1 n
w(n) n
0 w(n) w(n-200) w(n-400) w(n-600)
w(n-100) w(n-300)
Overlapping windows
w(n) w(n-200) n
K. Ślot: Speech recognition/Introduction
22
Instytut
Short-time Fourier Transform: STFT Informatyki
Stosowanej
• Signal representation
– Spectra computed for consecutive time ‘frames’
Signal
Spectrum
Instytut
Short-time Fourier Transform: STFT Informatyki
Stosowanej
k k0 2
[ ]
w(k ) e a
f
Speech signal
Spectrogram
25
Instytut
Spectrograms Informatyki
Stosowanej
• Spectrogram contents
– Information on temporal dynamics of speech (sound) production
– Resonant cavities characteristic for phones are reflected by spectral
contents of spectrogram
This is the first lecture
i e
26
Instytut
Speech representation Informatyki
Stosowanej
• Spectrogram features
– Bands of high energies: formants
• Formants
– Produced in resonant cavities: local energy maxima in various frequency
bands
– Vary in time as different phones get articulated
– Form patterns characteristic to phones
– Exist for voiced speech only , typically, up to four (can be hard to extract)
– Can be used e.g. for vowelclassification, are insufficient for speech
representation – richer representation is required
27/20
Instytut
Spectrogram computation considerations Informatyki
Stosowanej
• Spectrogram
– Provides information on temporal evolution and frequency composition of
signal in windows
– Should carry all relevant information
– Q1 - what frequencies should be covered?
– Q2 - what should be duration of a window?
http://en.wikipedia.org
28/20
Instytut
Window length Informatyki
Stosowanej
29/20
Instytut
Windowing functions Informatyki
Stosowanej
DFT IDFT
x(n) | X (k ) | x * (n)
DFT IDFT
• The problem
– DFT attempts to represent a discontinuous function (periodic)
– For rectangular window inevitable! distortions Spectrum ‘leaks’ over
neigboring frequencies K. Ślot: Speech recognition/Introduction
30/20
Instytut
Windowing functions Informatyki
Stosowanej
Rectangular window
Frequency
domain http://www.dsprelated.com/freebooks/sasp/
t
DFT
Time domain
31/20
Instytut
Windowing functions Informatyki
Stosowanej
• Hanning window
WHN (n ) WR (n ) cos2 n
– a=0.5, b=0.25 N
• Hamming window
– a=0.54, b=0.23 (chosen to completely cancel the largest side lobe)
K. Ślot: Speech recognition/Introduction
32/20
Instytut
Windowing functions Informatyki
Stosowanej
• Other candidates
– Blackman-Harris window family: similar concept as for Hanning/Hamming,
but more sinc functions are used
– Triangular - Bartlett window (convolution of two rectangular windows)
http://www.dsprelated.com/freebooks/sasp/
– Poisson window
WHN (n ) WR (n ) cos2 n
N
– Gaussian, Kaiser …
33/20
Instytut
Speech representation Informatyki
Stosowanej
• Spectrogram
– A sequence of slices. A slice: a vector of DFT magnitudes for a window
– A sequence of vectors
... ...
34/20
Instytut
Speech representation Informatyki
Stosowanej
• Feature vectors
– All DFT components?
Spectrum of ‘e’, 30 ms Spectrum of ‘i’, 30 ms
• Observation 1
– Huge within-class variability, unclear between-class differences: probably
useless representation
– Better representation needs to be found
35/20
Instytut
Speech representation Informatyki
Stosowanej
• Observation 2
– Spectrum looks like a mixture of slow-varying and fast-varying components
Formants
36/20
Instytut
Speech representation Informatyki
Stosowanej
• Speech signal
– Convolution of source (vocal fold excitation) and filter (articulation cavities)
– Spectrum is a product of the two
– We want to get rid of excitation (for us this is a noise)
– We are in frequency domain: how to proceed?
37/20
Instytut
Speech representation Informatyki
Stosowanej
• The problem
– Linear separation of components given in a product form
S ( f ) A( f ) P( f )
• Solution
– Homomorphic filtering (simple trick: use logarithms)
S ( f ) A( f ) P( f ) log S ( f ) log A( f ) log P( f )
– Spectrum becomes a sum of components: can be approached using the
presented framework
38/20
Instytut
Speech representation Informatyki
Stosowanej
39/20
Instytut
Speech representation Informatyki
Stosowanej
40/20
Instytut
Cepstral mean subtraction Informatyki
Stosowanej
• Cepstral mean
– Cepstral vector: combination of speech and channel Ci Cis Cch
components
n 1 n 1
1 1
– Channel component does not change
n i 0
(Ci C ) Cis Cch
s ch
n i 0
– Mean of cepstral vectors
– Mean cepstrum of speech is zero (as mean of original
μ Cch
signal is zero)
– Mean subtraction: channel elimination ~
C Cs Cch μ Cs
k k k
41/20
Instytut
Mel-frequency scale and MFCC Informatyki
Stosowanej
• Psychology of hearing
– Perception of frequencies is nonlinear: linear up to 1 kHz, then logarithmic
– Masking phenomenon and critical bands
F Hz
F Mel
2595 log10 1
700
42/20
Instytut
Mel-frequency scale and MFCC Informatyki
Stosowanej
• Masking
– Hiding neighboring tones by a stronger component
– Can be explained only if we assume that we perceive sounds within bands
(critical bands)
– Ear acts as a set of filters
– Cannot ignore it in modeling speech
43/20
Instytut
Mel-frequency scale and MFCC Informatyki
Stosowanej
• Mel-filter bank | A|
Mel-filter bank
– 24 triangular overlaping filters
– Centers – linear in Mel-scale … …
f
– Bandwidth: masking range 100 300 1000 1741
| A| f0 B f0 B
… …
f
f0
• Mel-spectrum
– Speech spectrum is integrated within
critical bands
– Can be seen as spectrum
downsampling
44/20
Instytut
Mel-frequency scale and MFCC Informatyki
Stosowanej
• MFCC
– Spectrum is accumulated in bins defined by Mel-filters
– 24 filters – 24 intervals along frequency axis
– Mel-Cepstrum computed from 24-element Mel-spectrum
– Result: Mel-Frequency Cepstral Coefficients (MFCC)
– Low-order MFCC’s use to represent articulation (typically – up to 12)
45/20
Instytut
Modeling articulation dynamics Informatyki
Stosowanej
• Dynamics
– A rate of change: delta-MFCC: subtraction of MFCC’s from consecutive
frames
– Second derivative of MFCC changes: delta-delta-MFCC
46/20