Professional Documents
Culture Documents
Lecture10 PDF
Lecture10 PDF
Lecture10 PDF
Lecture 10
CS 753
Instructor: Preethi Jyothi
Unseen Ngrams
• By using estimates based on counts from large text corpora,
there will still be many unseen bigrams/trigrams at test time
that never appear in the training corpus
want i
to eat chinese food lunch spend
i 827 5
0 9 0 0 0 2
want 0 2
608 1 6 6 5 1
to 0 2
4 686 2 0 6 211
No
eat 0 0
2 0 16 2 42 0
smoothing chinese 0 1
0 0 0 82 1 0
food 0 15
15 0 1 4 0 0
lunch 0 2
0 0 0 1 0 0
spend 0 1
1 0 0 0 0 0
14 C HAPTER 4 • L ANGUAGE M ODELING WITH N- GRAMS
Figure 4.1 Bigram counts for eight of the words (out of V = 1446) in the Berkeley Restau-
rant Project corpus of 9332 sentences. Zero counts are in gray.
i want to eat chinese food lunch spend
i i 6 want
828 to
1 eat
10 1 chinese food
1 lunch
1 spend
3
i want 0.002 3 10.33 0609 0.0036
2 70 07 06 0.00079
2
Laplace
want
to 0.0022
3 10 0.66
5 0.0011
687 30.0065 0.0065 1 0.0054
7 0.0011
212
(Add-one)
toeat 0.00083
1 10 0.0017
3 0.28
1 170.00083 03 0.0025
43 0.087
1
smoothing eat
chinese 0 2 10 0.0027
1 01 10.021 0.0027
83 0.056
2 01
chinese
food 0.0063
16 10 016 01 20 0.52
5 0.0063
1 01
food
lunch 0.0143 10 0.014
1 01 10.00092 0.00372 01 01
lunch
spend 0.0059
2 10 02 01 10 0.0029
1 01 01
spend4.5 0.0036
Figure Add-one 0 0.0036
smoothed bigram 0counts for0eight of the
0 words (out
0 of V 0 1446) in
=
Figure
the 4.2 Restaurant
Berkeley Bigram probabilities for eight
Project corpus words
of 9332 in the Berkeley
sentences. Restaurant
Previously-zero Project
counts corpus
are in gray.
chinese 1i 0 6 0 828 0 1 0 10 182 11 10 3
food 15want0 3 15 1 0 609 1 2 74 07 60 2
⇡(wi 1 , wi ) + ↵
Pr↵ (wi |wi 1 ) =
⇡(wi 1 ) + ↵V
Choosing α:
• can be written as
⇡ ⇤ (wi 1 , wi )
PrLap (wi |wi 1 ) =
⇡(wi 1 )
⇤ ⇡(wi 1 )
• where discounted count ⇡ (wi 1 , wi ) = (⇡(wi 1 , wi ) + 1)
⇡(wi 1 ) + V
to 0.00078 0.00026 0.0013 0.18 0.00078 0.00026 0.0018 0.055
eat 0.00046 0.00046 0.0014 0.00046 0.0078 0.0014 0.02 0.00046
chinese 0.0012
6foodC HAPTER
0.0063
4 •
Example: Bigram adjusted counts
0.00062 0.00062 0.00062 0.00062 0.052
0.00039
L ANGUAGE0.0063
M ODELING 0.00039
WITH N-0.00079
GRAMS 0.002
0.0012 0.00062
0.00039 0.00039
lunch 0.0017 0.00056 0.00056 0.00056 0.00056 0.0011 0.00056 0.00056
spend 0.0012 0.00058 0.0012 0.00058 0.00058 0.00058 0.00058 0.00058
i want to eat chinese food lunch spend
Figure 4.6 Add-one smoothed bigram probabilities for eight of the words (out of V = 1446) in the BeRP
i 5 827 0 9 0 0 0 2
corpus of 9332 sentences. Previously-zero probabilities are in gray.
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
It is often convenient to reconstruct the count matrix so we can see how much a
No
eat 0 0 2 0 16 2 42 0
smoothing algorithm has changed the original counts. These adjusted counts can be
smoothing chinese 1 0 0 0 0 82 1 0
computed by Eq. 4.22. Figure 4.7 shows the reconstructed counts.
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 n 1 w0n ) + 1] ⇥C(w
[C(w 1 n 1) 0 0
⇤
spend 1 c0 (wn 1 w1n ) = 0 0 0 0 0 (4.22)
C(w ) +V
n 1
Figure 4.1 Bigram counts for eight of the words (out of V = 1446) in the Berkeley Restau-
rant Project corpus of 9332 sentences. Zero counts are in gray.
i want to eat chinese food lunch spend
i want to eat chinese food lunch spend
i 3.8 527 0.64 6.4 0.64 0.64 0.64 1.9
iwant 0.002
1.2 0.33
0.39 0238 0.0036
0.78 0 2.7 0 2.7 0 2.3 0.00079
0.78
Laplace
want
to 0.0022
1.9 0
0.63 0.66
3.1 0.0011
430 0.0065
1.9 0.0065
0.63 0.0054
4.4 0.0011
133
(Add-one)
to
eat 0.00083
0.34 0
0.34 0.0017
1 0.28
0.34 0.00083
5.8 0 1 0.0025
15 0.087
0.34
smoothing eat 0
chinese 0.2 0
0.098 0.0027
0.098 00.098 0.021
0.098 0.0027
8.2 0.056
0.2 0 0.098
chinese
food 0.0063
6.9 0
0.43 06.9 00.43 0 0.86 0.52
2.2 0.0063
0.43 0 0.43
food 0.014 0 0.014 0 0.00092 0.0037 0 0
lunch 0.57 0.19 0.19 0.19 0.19 0.38 0.19 0.19
lunch 0.0059 0 0 0 0 0.0029 0 0
spend 0.32 0.16 0.32 0.16 0.16 0.16 0.16 0.16
spend 0.0036 0 0.0036 0 0 0 0 0
Figure 4.7 Add-one reconstituted counts for eight words (of V = 1446) in the BeRP corpus
Figure
of 9332 4.2 Bigram
sentences. probabilities for
Previously-zero eightare
counts words in the Berkeley Restaurant Project corpus
in gray.
Advanced Smoothing Techniques
• Good-Turing Discounting
• Kneser-Ney Smoothing
Advanced Smoothing Techniques
• Good-Turing Discounting
• Kneser-Ney Smoothing
Problems with Add-α Smoothing
r Nr r*-GT r*-heldout
• Thus, Good-Turing smoothing states that for any Ngram that occurs
r times, we should use an adjusted count r* = θ(r) = (r + 1)Nr+1/Nr
• Good-Turing Discounting
• Kneser-Ney Smoothing
Backoff and Interpolation
• General idea: It helps to use lesser context to generalise for
contexts that the model doesn’t know enough about
• Backoff:
• Interpolation
• Good-Turing Discounting
• Kneser-Ney Smoothing
Katz Smoothing
⇣ P ⇤
⌘
⇡ (wi 1 ,w)
1 w2 (wi 1) ⇡(wi 1)
where ↵(wi 1) = P
wi 62 (wi 1)
PKatz (wi )
Katz Backoff Smoothing
( ⇡⇤ (w
1 ,wi )
⇡(wi
i
1)
if wi 2 (wi 1)
PKatz (wi |wi 1) =
↵(wi 1 )PKatz (wi ) if wi 62 (wi 1)
⇣ P ⇤
⌘
⇡ (wi 1 ,w)
1 w2 (wi 1) ⇡(wi 1)
where ↵(wi 1) = P
wi 62 (wi 1)
PKatz (wi )
• Good-Turing Discounting
• Kneser-Ney Smoothing
Recall Good-Turing estimates
r Nr θ(r)
0 7.47 × 1010 .0000270
1 2 × 106 0.446
2 4 × 105 1.26
3 2 × 105 2.24
4 1 × 105 3.24
5 7 × 104 4.22
6 5 × 104 5.19
7 3.5 × 104 6.21
max{⇡(wi 1 , wi ) d, 0}
Prabs (wi |wi 1 ) = + (wi 1 )Pr(wi )
⇡(wi 1 )
Advanced Smoothing Techniques
• Good-Turing Discounting
• Kneser-Ney Smoothing
Kneser-Ney discounting
max{⇡(wi 1 , wi ) d, 0}
PrKN (wi |wi 1 ) = + KN (wi 1 )Prcont (wi )
⇡(wi 1 )
| (wi )| d
Prcont (wi ) = and KN (wi 1 ) = | (wi 1 )|
|B| ⇡(wi 1)
c / Pr(c|a,b)
a,b b,c
ε / α(a,b) ε / α(b,c)
c / Pr(c|b)
b c
ε / α(b) c / Pr(c)
ε
Putting it all together:
How do we recognise an utterance?
• A: speech utterance
• How do we decode?
Acoustic model
⇤
W = arg max Pr(OA |W ) Pr(W )
W
X
Pr(OA |W ) = Pr(OA , Q|W )
Q
T
X Y
t 1 t N t 1 N
= Pr(Ot |O1 , q1 , w1 ) Pr(qt |q1 , w1 )
q1T ,w1N t=1
T
X Y
First-order HMM N N
assumptions ⇡ Pr(Ot |qt , w1 ) Pr(qt |qt 1 , w1 )
q1T ,w1N t=1
T
Y
N N
Viterbi approximation ⇡ max Pr(Ot |qt , w1 ) Pr(qt |qt 1 , w1 )
q1T ,w1N
t=1
Acoustic Model
Transition
probabilities
T
Y
N N
Pr(OA |W ) = max Pr(Ot |qt , w1 ) Pr(qt |qt 1 , w1 )
q1T ,w1N
t=1
N
Modeled using a
Lq
X N
Pr(q | O; w1 )
N
Pr(O|q; w1 ) = N
cq` N (O|µq` , ⌃q` ; w1 ) Pr(O | q; w1 ) ∝
mixture of Gaussians
`=1
Pr(q)
Language Model
⇤
W = arg max Pr(OA |W ) Pr(W )
W
Pr(W ) = Pr(w1 , w2 , . . . , wN )
N 1
= Pr(w1 ) . . . Pr(wN |wN m+1 )
8" #2 39
< YN X Y T =
⇤
W = arg max n
Pr(wn |wn 1
m+1 )
4 N N 5
Pr(Ot |qt , w1 ) Pr(qt |qt 1 , w1 )
w1N ,N : ;
n=1 T q1 ,w1 t=1
N
(" N
#" T
#)
Viterbi Y Y
n 1 N N
⇡ arg max Pr(wn |wn m+1 ) max Pr(Ot |qt , w1 ) Pr(qt |qt 1 , w1 )
w1N ,N q1T ,w1N
n=1 t=1