Professional Documents
Culture Documents
Lec 4
Lec 4
For bigrams:
10
Zero probability bigrams
• Bigrams with zero probability
• mean that we will assign 0 probability to the test set!
• And hence we cannot compute perplexity (can’t divide by 0)!
• Zero mitigation
• Various smoothing techniques
11
Basic Smoothing:
Interpolation and Back-off
12
The intuition of smoothing
allegations
3 allegations
outcome
2 reports
reports
…
attack
1 claims
request
claims
man
1 request
7 total
allegations
generalize better
allegations
0.5 claims
outcome
0.5 request
attack
reports
2 other
…
man
request
claims
7 total
Add-one estimation
• Simple interpolation
ï i-1
ïî 0.4S(w i | w i-k+2 ) otherwise
count(wi )
S(wi ) =
N 26
Advanced smoothing
• Good-Turing
• Kneser-Ney
Language Modeling
Advanced Smoothing
Reminder: Add-1 (Laplace) Smoothing
c(wi-1, wi ) +1
PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V
More general formulations: Add-k
c(wi-1, wi ) + k
PAdd-k (wi | wi-1 ) =
c(wi-1 ) + kV
1
c(wi-1, wi ) + m( )
PAdd-k (wi | wi-1 ) = V
c(wi-1 ) + m
Unigram prior smoothing
1
c(wi-1, wi ) + m( )
PAdd-k (wi | wi-1 ) = V
c(wi-1 ) + m
c(wi-1, wi ) + mP(wi )
PUnigramPrior (wi | wi-1 ) =
c(wi-1 ) + m
Advanced smoothing algorithms
• Intuition used by many smoothing algorithms
• Good-Turing
• Kneser-Ney
33
Good-Turing smoothing intuition
• You are fishing (a scenario from Josh Goodman), and caught:
• 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
• How likely is it that next species is trout?
• 1/18
• How likely is it that next species is new (i.e. catfish or bass)
• Let’s use our estimate of things-we-saw-once to estimate the new things.
• 3/18 (because N1=3)
• Assuming so, how likely is it that next species is trout?
• Must be less than 1/18
• How to estimate?
Good Turing calculations
N1 (c +1)N c+1
P (things with zero frequency) =
*
GT c* =
N Nc
• Unseen (bass or catfish)
• c = 0: • Seen once (trout)
• MLE p = 0/18 = 0 •c=1
• MLE p = 1/18
• P*GT (unseen) = N1/N = 3/18
• C*(trout) = 2 * N2/N1
= 2 * 1/3
= 2/3
• P*GT(trout) = 2/3 / 18 = 1/27
Resulting Good-Turing numbers
• Numbers from Church and Gale (1991) Count Good Turing c*
c
• 22 million words of AP Newswire
0 .0000270
(c +1)N c+1 1 0.446
c* = 2 1.26
Nc
3 2.24
4 3.24
5 4.22
6 5.19
7 6.21
8 7.24
9 8.25
Language
Modeling
Advanced:
Kneser-Ney Smoothing
Resulting Good-Turing numbers
• Numbers from Church and Gale (1991) Count Good Turing c*
c
• 22 million words of AP Newswire
0 .0000270
(c +1)N c+1 1 0.446
c* = 2 1.26
Nc
3 2.24
4 3.24
• It sure looks like c* = (c - .75) 5 4.22
6 5.19
7 6.21
8 7.24
9 8.25
Absolute Discounting Interpolation
• Save ourselves some time and just subtract 0.75 (or some d)!
discounted bigram Interpolation weight
c(wi-1, wi ) - d
PAbsoluteDiscounting (wi | wi-1 ) = + l (wi-1 )P(w)
c(wi-1 )
unigram
• But should we really just use the regular unigram P(w)?
39
Kneser-Ney Smoothing I
• Better estimate for probabilities of lower-order unigrams!
glasses
Francisco
• Shannon game: I can’t see without my reading___________?
• “Francisco” is more common than “glasses”
• … but “Francisco” always follows “San”
• The unigram is useful exactly when we haven’t seen this bigram!
• Instead of P(w): “How likely is w”
• Pcontinuation(w): “How likely is w to appear as a novel continuation?
• For each word, count the number of bigram types it completes
• Every bigram type was a novel continuation the first time it was seen
max(c(wi-1, wi ) - d, 0)
PKN (wi | wi-1 ) = + l (wi-1 )PCONTINUATION (wi )
c(wi-1 )
λ is a normalizing constant; the probability mass we’ve discounted
d
l (wi-1 ) = {w : c(wi-1, w) > 0}
c(wi-1 )
The number of word types that can follow wi-1
the normalized discount = # of word types we discounted
= # of times we applied normalized discount 43