Professional Documents
Culture Documents
Language Modeling
Language Modeling
Modeling
Introduction to N-grams
Dan Jurafsky
• More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
Dan Jurafsky
The Chain Rule applied to compute
joint probability of words in sentence
Markov Assumption
• Simplifying assumption:
Andrei Markov
• Or maybe
P(the | its water is so transparent that) » P(the | transparent that)
Dan Jurafsky
Markov Assumption
P(w1w 2 … w n ) » Õ P(w i | w i- k … w i- 1 )
i
• In other words, we approximate each
component in the product
P(w i | w1w 2 … w i- 1 ) » P(w i | w i- k … w i- 1 )
Dan Jurafsky
P(w1w 2 … w n ) » Õ P(w i )
i
Some automatically generated sentences from a unigram model
fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars,
quarter, in, is, mass
Bigram model
Condition on the previous word:
N-gram models
• We can extend to trigrams, 4-grams, 5-grams
• In general this is an insufficient model of language
• because language has long-distance dependencies:
“The computer which I had just put into the machine room on
the fifth floor crashed.”
count(w i- 1,w i )
P(w i | w i- 1 ) =
count(w i- 1 )
c(w i- 1,w i )
P(w i | w i- 1 ) =
c(w i- 1)
Dan Jurafsky
An example
More examples:
Berkeley Restaurant Project sentences
• Result:
Dan Jurafsky
Practical Issues
• We do everything in log space
• Avoid underflow
• (also adding is faster than multiplying)
…
Dan Jurafsky
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Dan Jurafsky
Intuition of Perplexity
mushrooms 0.1
• The Shannon Game:
• How well can we predict the next word? pepperoni 0.1
anchovies 0.01
I always order pizza with cheese and ____ ….
The 33rd President of the US was ____ fried rice 0.0001
I saw a ____ ….
and 1e-100
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence) -
1
PP(W ) = P(w1w2 ...wN ) N
Perplexity is the inverse probability of
the test set, normalized by the number 1
of words: = N
P(w1w2 ...wN )
Chain rule:
For bigrams:
Approximating Shakespeare
Dan Jurafsky
Shakespeare as corpus
• N=884,647 tokens, V=29,066
• Shakespeare produced 300,000 bigram types
out of V2= 844 million possible bigrams.
• So 99.96% of the possible bigrams were never seen
(have zero entries in the table)
• Quadrigrams worse: What's coming out looks
like Shakespeare because it is Shakespeare
Dan Jurafsky
Zeros
• Training set: • Test set
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request
allegations
3 allegations
outcome
2 reports
reports
attack
…
claims
1 claims
request
man
1 request
7 total
• Steal probability mass to generalize better
P(w | denied the)
2.5 allegations
allegations
1.5 reports
allegations
outcome
0.5 claims
attack
reports
0.5 request
…
man
claims
request
2 other
7 total
Dan Jurafsky
Add-one estimation
Laplace-smoothed bigrams
Dan Jurafsky
Reconstituted counts
Dan Jurafsky
Linear Interpolation
• Simple interpolation
count(wi )
S(wi ) =
63 N
Dan Jurafsky
64
Dan Jurafsky
Advanced: Good
Turing Smoothing
Dan Jurafsky
c(wi- 1, wi ) +1
PAdd- 1 (wi | wi- 1 ) =
c(wi- 1 ) +V
Dan Jurafsky
c(wi- 1, wi ) + k
PAdd- k (wi | wi- 1 ) =
c(wi- 1 ) + kV
1
c(wi- 1, wi ) + m( )
PAdd- k (wi | wi- 1 ) = V
c(wi- 1 ) + m
Dan Jurafsky
c(wi- 1, wi ) + mP(wi )
PUnigramPrior (wi | wi- 1 ) =
c(wi- 1 ) + m
Dan Jurafsky
• C*(trout) = 2 * N2/N1
• P*GT (unseen) = N1/N = 3/18
= 2 * 1/3
= 2/3
• P*GT(trout) = 2/3 / 18 = 1/27
Dan Jurafsky
Held-out words:
75
Dan Jurafsky
Training Held out
Ney et al. Good Turing Intuition
(slide from Dan Klein)
N1 N0
• Intuition from leave-one-out validation
• Take each of the c training words out in turn
• c training sets of size c–1, held-out of size 1
• What fraction of held-out words are unseen in training? N2 N1
• N1/c
• What fraction of held-out words are seen k times in
training? N3 N2
• (k+1)Nk+1/c
....
....
• So in the future we expect (k+1)Nk+1/c of the words to be
those with training count k
• There are Nk words with training count k
• Each should occur with probability:
• (k+1)Nk+1/c/Nk
(k +1)N k+1 N3511 N3510
• …or expected count: k* =
kN N4417 N4416
Dan Jurafsky
Good-Turing complications
(slide from Dan Klein)
Advanced: Good
Turing Smoothing
Language
Modeling
Advanced:
Kneser-Ney Smoothing
Dan Jurafsky
c(wi- 1, wi ) - d
PAbsoluteDiscounting (wi | wi- 1 ) = + l (wi- 1 )P(w)
c(wi- 1 )
unigram
Kneser-Ney Smoothing I
• Better estimate for probabilities of lower-order unigrams!
• Shannon game: I can’t see without my reading___________?
Francisco
glasses
• “Francisco” is more common than “glasses”
• … but “Francisco” always follows “San”
• The unigram is useful exactly when we haven’t seen this bigram!
• Instead of P(w): “How likely is w”
• Pcontinuation(w): “How likely is w to appear as a novel continuation?
• For each word, count the number of bigram types it completes
• Every bigram type was a novel continuation the first time it was seen
PCONTINUATION (w) µ {wi- 1 : c(wi- 1, w) > 0}
Dan Jurafsky
Kneser-Ney Smoothing II
• How many times does w appear as a novel continuation:
PCONTINUATION (w) µ {wi- 1 : c(wi- 1, w) > 0}
Kneser-Ney Smoothing IV
max(c(wi- 1, wi ) - d, 0)
PKN (wi | wi- 1 ) = + l(wi- 1 )PCONTINUATION (wi )
c(wi- 1 )
λ is a normalizing constant; the probability mass we’ve discounted
d
l (wi- 1 ) = {w : c(wi- 1, w) > 0}
c(wi- 1 )
The number of word types that can follow wi-1
the normalized discount = # of word types we discounted
86 = # of times we applied normalized discount
Dan Jurafsky
i
i- 1 max(c KN (wi- n+1 ) - d, 0) i- 1 i- 1
PKN (wi | wi- n+1 ) = i- 1
+ l (w )P
i- n+1 KN (wi | wi- n+2 )
cKN (wi- n+1 )
Advanced:
Kneser-Ney Smoothing