Professional Documents
Culture Documents
IS 7118 Unit-4 N-Grams
IS 7118 Unit-4 N-Grams
Modeling
IS 7118: Natural Language Processing t
•Numbers
•Misspellings
•Names
•Acronyms
•etc
• More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
P( w1w2 wn ) = P( wi | w1w2 wi −1 )
i
• Simplifying assumption:
Andrei Markov
• Or maybe
P(the |its water is so transparent that) » P(the |transparent that)
P( w1w2 wn ) P( wi | wi −k wi −1 )
i
• In other words, we approximate each
component in the product
Introduction to N-grams
Language Modeling
count(wi-1,wi )
P(wi | wi-1) =
count(wi-1)
c(wi-1,wi )
P(wi | wi-1) =
c(wi-1)
• This estimate is called Maximum Likelihood Estimate (MLE) because it is the choice of parameters that gives the highest probability to the training
corpus
C(<s>,I)/C(<s>) =2/3
• Result:
❑The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence)
1
-
Perplexity (PP) of a language model on PP(W ) = P(w1w2 ...wN ) N
a test set is the is a function of the
probability that the language model
assigns to that test set. For a test set 1
= N
W=w1w2….wN, the perplexity is the P(w1w2...wN )
probability of the test set normalized
by the number of words.
Chain rule:
For bigrams:
Because of inverse , minimizing perplexity is the same as 38
maximizing probability IS 7118: NLP Unit-4: N-Grams,
Prof. R.K.Rao Bandaru
IS 7118: NLP Unit-4: N-Grams,
Prof. R.K.Rao Bandaru
The Shannon Game intuition for perplexity
• From Josh Goodman:
• How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9’
– Perplexity 10 perplexity is related to the average branching factor
• How hard is recognizing (30,000) names? Speech recognition for
automated (switch board phone)operator to connect to an
employee.
– Perplexity = 30,000
• If a system has to recognize
– Operator (1 in 4)
– Sales (1 in 4)
– Technical Support (1 in 4)
– 30,000 names (1 in 120,000 each)
– Perplexity is ≈53
39
• Perplexity is weighted equivalent branching factor
Perplexity as branching factor
• Let’s suppose a sentence consisting of random
digits
• What is the perplexity of this sentence according
to a model that assign P=1/10 to each digit?
allegations
2 reports
1 claims
reports
outcome
1 request
…
attack
claims
request
man
7 total
• Steal probability mass to generalize better
0.5 request
outcome
2 other
…
reports
attack
7 total
man
claims
c(wi-1,wi )
• MLE estimate: PMLE (wi | wi-1 ) =
c(wi-1 )
• Add-1 estimate:
c(wi-1,wi )+1
PAdd-1 (wi | wi-1 ) =
c(wi-1 )+V
IS 7118: NLP Unit-4: N-Grams, 54
Prof. R.K.Rao Bandaru
Maximum Likelihood Estimates
• The maximum likelihood estimate
– of some parameter of a model M from a training set T
– maximizes the likelihood of the training set T given the model M
• Suppose the word “bagel” occurs 400 times in a corpus of a
million words
• What is the probability that a random word from some other
text will be “bagel”?
• MLE estimate is 400/1,000,000 = .0004
• This may be a bad estimate for some other corpus
– But it is the estimate that makes it most likely that “bagel” will
occur 400 times in a million word corpus.
– Smoothing is a non-maximum likelihood estimator
IS 7118: NLP Unit-4: N-Grams, 55
Prof. R.K.Rao Bandaru
Berkeley Restaurant Corpus:
Laplace smoothed bigram counts
• Add-one smoothed bigram counts for eight of words(out of
V=1446) in the Berkeley Restaurant Project corpus of 9332
sentences.
Take those probabilities and reestimate the original counts, to know how much of this
add -one smoothing changed these probabilities 58
Compare with raw bigram counts
• Simple interpolation
ì i
ïï count(wi-k+1
i-1
)
if count(wi-k+1 ) > 0
i
ï i-1
ïî 0.4S(w i | w i-k+2 ) otherwise
count(wi )
S(wi ) =
N IS 7118: NLP Unit-4: N-Grams,
Prof. R.K.Rao Bandaru
68
N-gram Smoothing Summary
• Add-1 smoothing:
– OK for text categorization, not for language modeling
• The most commonly used method:
– Extended Interpolated Kneser-Ney
• For very large N-grams like the Web:
– Stupid backoff
c(wi-1,wi )+1
PAdd-1 (wi | wi-1 ) =
c(wi-1 )+V
c(wi-1,wi )+ k
PAdd-k (wi | wi-1 ) =
c(wi-1 )+ kV
Now, let m=kV
1
c(wi-1,wi )+ m( )
PAdd-k (wi | wi-1 ) = V
c(wi-1 )+ m
IS 7118: NLP Unit-4: N-Grams, 74
Prof. R.K.Rao Bandaru
Unigram prior smoothing
1
c(wi-1,wi )+ m( )
PAdd-k (wi | wi-1 ) = V
c(wi-1 )+ m
c(wi-1, wi )+ mP(wi )
PUnigramPrior (wi | wi-1 ) =
c(wi-1 )+ m
IS 7118: NLP Unit-4: N-Grams, 75
Prof. R.K.Rao Bandaru
Advanced smoothing algorithms
• Intuition used by many smoothing algorithms
– Good-Turing
– Kneser-Ney
– Witten-Bell
do 1 N2 = 2
not 1
N3 = 1
eat 1
IS 7118: NLP Unit-4: N-Grams, 77
Prof. R.K.Rao Bandaru
Good-Turing smoothing intuition
• You are fishing (a scenario from Josh Goodman), and caught:
– 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
• How likely is it that next species is trout?
– 1/18
• How likely is it that next species is new (i.e. catfish or bass)
– Let’s use our estimate of things-we-saw-once to estimate the
new things.
– 3/18 (because N1=3)
• Assuming so, how likely is it that next species is trout?
– Must be less than 1/18 We need to discount the probability
to accommodate new things.
– How to estimate?
IS 7118: NLP Unit-4: N-Grams, 78
Prof. R.K.Rao Bandaru
Good Turing calculations
N1 (c +1)Nc+1
P (things with zero frequency) =
*
GT c* =
N Nc
• Unseen (bass or catfish) • Seen once (trout)
oc=1
o c = 0:
o MLE p = 1/18
o MLE p = 0/18 = 0
o C*(trout) = 2 * N2/N1
o P*GT (unseen) = 2 * 1/3
= 2/3
= N1/N
o P*GT(trout) =
= 3/18 =2/3 / 18 discount
IS 7118: NLP Unit-4: N-Grams, = 1/27 79
Prof. R.K.Rao Bandaru
Bigram Frequencies of Frequencies and
GT Re-estimates –Example 2
3*= 4 * (381/642)
= 4 * .593
= 2.37
81
IS 7118: NLP Unit-4: N-Grams,
Prof. R.K.Rao Bandaru
GT Smoothed Bigram Probabilities
Held-out words:
....
....
– There are Nk words with training count k
– Each should occur with probability:
• (k+1)Nk+1/c/Nk
(k +1)N k+1
– …or expected count: k* =
Nk N3511 N3510
IS 7118: NLP Unit-4: N-Grams, 84
Prof. R.K.Rao Bandaru N4417 N4416
Good-Turing complications
(slide from Dan Klein)
c(wi-1,wi )- d
PAbsoluteDiscounting (wi | wi-1 ) = + l (wi-1 )P(w)
c(wi-1 ) unigram
max(c(wi-1,wi )- d,0)
PKN (wi | wi-1 ) = + l (wi-1 )PCONTINUATION (wi )
c(wi-1 )
λ is a normalizing constant; the probability mass we’ve discounted
d
l (wi-1 ) = {w :c(wi-1,w) > 0}
c(wi-1 )
The number of word types that can follow wi-1
the normalized discount = # of word types we discounted
= # of times we applied normalized discount
IS 7118: NLP Unit-4: N-Grams, 91
Prof. R.K.Rao Bandaru
Kneser-Ney Smoothing:
Recursive formulation
i
i-1 max(cKN (wi-n+1 )- d,0) i-1 i-1
PKN (wi | wi-n+1 ) = i-1
+ l (wi-n+1 )PKN (wi | wi-n+2 )
cKN (wi-n+1 )
???