Professional Documents
Culture Documents
ASR: Training & Systems: EE E6820: Speech & Audio Processing & Recognition
ASR: Training & Systems: EE E6820: Speech & Audio Processing & Recognition
ASR: Training & Systems: EE E6820: Speech & Audio Processing & Recognition
Lecture 11:
ASR: Training & Systems
1 Training HMMs
2 Language modeling
- states qi • k a t •
k a t •
• 1.0 0.0 0.0 0.0
- transition • k a t •
k 0.9 0.1 0.0 0.0
probabilities aij a 0.0 0.9 0.1 0.0
j i t 0.0 0.0 0.9 0.1
p ( q n q n – 1 ) ≡ a ij
• k a t •
- emission
distributions bi(x)
i p(x|q)
p ( x q ) ≡ bi ( x ) x
i
+ (initial state probabilities p ( q 1 ) ≡ π i )
• See e6820/papers/Rabiner89-hmm.pdf
• Given states Q = { q 1, q 2, …, q N }
N
+ observations X = X 1 = { x 1, x 1, …, x N }
M= M*
X
Q = {q1,q2,...qn} Q*
assumed, hidden observed inferred
2 Training HMMs
- Viterbi training
- EM for HMM parameters
- Forward-backward (Baum-Welch)
3 Language modeling
five two t uw
four three
three th r iy
Data Models
t uw w ah n
th r iy
f ao th r iy
Viterbi Q* th r iy
labels
• “Re-estimate model parameters”:
∑n ∈ q i x n
pdf e.g. 1D Gauss: µ i = -----------------------
i
# ( qn )
i j
# ( qn – 1 → qn )
count transitions: a ij = ----------------------------------
-
i
# ( qn )
• And repeat...
• But: converges only if good initialization
Re-estimate
unknowns
Successive parameter
estimates Θ
∑ p ( Q k X , Θ old ) log [ p ( X Q k, Θ ) p ( Q k Θ ) ]
all Q k
= ∑ p ( Q k X , Θ old ) log ∏ p ( xn qn ) ⋅ p ( qn qn – 1 )
all Q k n
N S
i i
= ∑ ∑ p ( q n X , Θ old ) log p ( x n q n, Θ )
n = 1i = 1
S
i i
+ ∑ p ( q 1 X , Θ old ) log p ( q 1 Θ )
i=1
N S S
i j j i
+ ∑ ∑ ∑ p ( q n – 1, q n X , Θ old ) log p ( q n q n – 1, Θ )
n = 2i = 1 j = 1
N i n
• Hence, define β n ( i ) = p ( X n + 1 q n, X 1 )
i N
so that α n ( i ) ⋅ β n ( i ) = p ( q n, X 1 )
i N αn ( i ) ⋅ βn ( i )
then p ( q n X1 ) = ----------------------------------------
∑ αn ( j ) ⋅ βn ( j )
j
bj(xn)
i
qn- aij qnj
1
αn-1(i) βn(j)
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 14
GMM-HMMs in practice
• GMMs as acoustic models:
train by including mixture indices as unknowns
- just more complicated equations...
i i
∑n p ( mk q , x n, Θold ) p ( qn X , Θold ) ⋅ x n
µ ik = ------------------------------------------------------------------------------------------------
i i
-
∑ p ( mk q , x n, Θold ) p ( qn X , Θold )
n
• Practical GMMs:
- 9 to 39 feature dimensions
- 2 to 64 Gaussians per mixture
depending on number of training examples
• Lots of data → can model more classes
- e.g context-independent (CI): qi = ae aa ax ...
→context-dependent (CD): qi = b-ae-b b-ae-k ...
Initialization Θ dh ax k ae t Uniform
parameters init initialization
s ae t aa n alignments
E-step: dh ax k ae
probabilities p(qni |XN
1,Θold)
Repeat until of unknowns
convergence
M-step:
maximize via Θ : max E[log p(X,Q | Θ)]
parameters
TWO ONE w ah n
FIVE t uw
sil
ONE = w ah n th r iy
TWO = t uw
- alignment all handled implicitly
• What do the states end up meaning?
- not necessarily what you intended;
whatever locally maximizes data likelihood
• What if the models or transcriptions are bad?
- slow convergence, poor discrimination in models
• Other kinds of data, transcriptions
- less constrained initial models...
2 Training HMMs
3 Language modeling
- Pronunciation models
- Grammars
- Decoding
• So far, looked at p ( X M j, Θ A )
• What about p ( M j Θ L ) ?
- Mj is a particular word sequence
- ΘL are parameters related to the language
• Two components:
- link state sequences to words p ( Q w i )
ae
k t p(φ w) ⋅
aa
n–1
ATE p ( wn w1 , M)
CAT
THE
SAT
DOG
ZERO(0.5) z iy r ow
ZERO(0.5) z ih r ow
ONE(1.0) w ah n
TWO(1.0) tcl t uw
...
f ah ay v y uh r ow l
Surface Phone String
- infer modification rules
- predict other pronunciation variants
• e.g. ‘d deletion’:
d → Ø / l _ [stop] p = 0.9
• Generate pronunciation variants;
use forced alignment to find weights
oy DECOY s DECODES
k
iy
ow d z DECODES
d DECODE
uw DO
axr DECODER
root
2 Training HMMs
3 Language modeling
Feature C2
dcl
posteriors
calculation p(qi | X)
Ck
tn
tn+w
hidden layer
feature index
time frame
E6820 SAPR - Dan Ellis L11 - Training 2003-04-28 - 30
Model adaptation
• Practical systems often suffer from mismatch
- test conditions are not like training data:
accent, microphone, background noise ...
• Desirable to continue tuning during recognition
= adaptation
- but: no ‘ground truth’ labels or transcription
• Assume that recognizer output is correct;
Estimate a few parameters from those labels
- e.g. Maximum Likelihood Linear Regression
(MLLR)
Male data Female data
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
2 3 4 5 6 7 2 3 4 5 6 7
sound
Feature
calculation
feature vectors
Network Acoustic
weights classifier
phone probabilities
Word models HMM
Language model decoder