Langmodel2 PDF

Language Modeling: Part II
Pawan Goyal
CSE, IITKGP
July 31, 2014
Pawan Goyal (IIT Kharagpur)
Language Modeling
July 31, 2014
1 / 47
Lower perplexity = better model

WSJ Corpus
Training: 38 million words
Test: 1.5 million words
Language Modeling
July 31, 2014
2 / 47

WSJ Corpus
Language Modeling
July 31, 2014
2 / 47

WSJ Corpus
Unigram perplexity: 962?

The model is as confused on test data as if it had to choose uniformly and
independently among 962 possibilities for each word.
Language Modeling
July 31, 2014
2 / 47
The Shannon Visualization Method
Use the language model to generate word sequences
Language Modeling
July 31, 2014
3 / 47

Choose a random bigram
(<s>,w) as per its
probability
Language Modeling
July 31, 2014
3 / 47

(<s>,w) as per its
probability
(w,x) as per its probability
Language Modeling
July 31, 2014
3 / 47

(<s>,w) as per its
probability
And so on until we choose
</s>
Language Modeling
July 31, 2014
3 / 47

(<s>,w) as per its
probability
And so on until we choose
</s>
Language Modeling
July 31, 2014
3 / 47
Shakespeare as Corpus
N = 884,647 tokens, V = 29,066

Shakespeare produced 300,000 bigram types out of V 2 = 844 million
possible bigrams.
Language Modeling
July 31, 2014
4 / 47
Approximating Shakespeare
Language Modeling
July 31, 2014
5 / 47
Problems with simple MLE estimate: zeros
Language Modeling
July 31, 2014
6 / 47
Training set
... denied the allegations
... denied the reports
... denied the claims
... denied the request
Language Modeling
July 31, 2014
6 / 47
Training set
Test Data
... denied the offer
... denied the loan
Language Modeling
July 31, 2014
6 / 47
Training set
Test Data
... denied the loan
Zero probability bigrams

P(offer | denied the) = 0
Language Modeling
July 31, 2014
6 / 47
Training set
Test Data
... denied the loan

The test set will be assigned a probability 0
Language Modeling
July 31, 2014
6 / 47
Training set
Test Data
... denied the loan

The test set will be assigned a probability 0
And the perplexity cant be computed
Language Modeling
July 31, 2014
6 / 47
Language Modeling: Smoothing
Language Modeling
July 31, 2014
7 / 47

With sparse statistics
Language Modeling
July 31, 2014
7 / 47

With sparse statistics
Steal probability mass to generalize better
Language Modeling
July 31, 2014
7 / 47
Laplace Smoothing (Add-one estimation)
Pretend as if we saw each word one more time that we actually did
Language Modeling
July 31, 2014
8 / 47
Just add one to all the counts!
Language Modeling
July 31, 2014
8 / 47
c(w
,w )
MLE estimate: PMLE (wi |wi1 ) = c(wi1 )i

i1
Language Modeling
July 31, 2014
8 / 47
c(w
,w )
MLE estimate: PMLE (wi |wi1 ) = c(wi1 )i

i1
c(w
,w )+1
i
Add-1 estimate: PAdd1 (wi |wi1 ) = c(wi1 )+V
i1
Language Modeling
July 31, 2014
8 / 47
Reconstituted counts as effect of smoothing
Effective bigram count (c (wn1 wn ))
c (wn1 wn ) c(wn1 wn ) + 1
=
c(wn1 )
c(wn1 ) + V
Language Modeling
July 31, 2014
9 / 47
Comparing with bigrams: Restaurant corpus
Language Modeling
July 31, 2014
10 / 47
Comparing with bigrams: Restaurant corpus
Language Modeling
July 31, 2014
10 / 47
Add-1 estimation
Not used for N-grams

There are better smoothing methods
Language Modeling
July 31, 2014
11 / 47
Add-1 estimation
Not used for N-grams

There are better smoothing methods
Is used to smooth other NLP models

In domains where the number of zeros isnt so large
For text classification
Language Modeling
July 31, 2014
11 / 47
More general formulations: Add-k
PAddk (wi |wi1 ) =
c(wi1 , wi ) + k
c(wi1 ) + kV
Language Modeling
July 31, 2014
12 / 47
PAddk (wi |wi1 ) =
PAddk (wi |wi1 ) =
c(wi1 , wi ) + k
c(wi1 ) + kV
c(wi1 , wi ) + m( V1 )
c(wi1 ) + m
Language Modeling
July 31, 2014
12 / 47
PAddk (wi |wi1 ) =
PAddk (wi |wi1 ) =
c(wi1 , wi ) + k
c(wi1 ) + kV
c(wi1 , wi ) + m( V1 )
c(wi1 ) + m
Unigram prior smoothing:
PUnigramPrior (wi |wi1 ) =
c(wi1 , wi ) + mP(wi )
c(wi1 ) + m
Language Modeling
July 31, 2014
12 / 47
Advanced smoothing algorithms
Basic Intuition
Use the count of things we have see once
to help estimate the count of things we have never seen
Language Modeling
July 31, 2014
13 / 47
Advanced smoothing algorithms
Basic Intuition
Use the count of things we have see once
to help estimate the count of things we have never seen
Smoothing algorithms
Good-Turing
Kneser-Ney
Witten-Bell
Language Modeling
July 31, 2014
13 / 47
Nc : Frequency of frequency c
Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>
Language Modeling
July 31, 2014
14 / 47
Nc : Frequency of frequency c
Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>
Computing Nc
I
am
here
who
would
like
3
2
1
1
1
1
N1 = 4
N2 = 1
N3 = 1
Language Modeling
July 31, 2014
14 / 47
Good-Turing smoothing intuition

You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next species is trout?
Language Modeling
July 31, 2014
15 / 47


1/18
Language Modeling
July 31, 2014
15 / 47


1/18
How likely is it that next species is new?
Language Modeling
July 31, 2014
15 / 47


1/18

Use the estimate of things-we-saw-once to estimate the new things
Language Modeling
July 31, 2014
15 / 47


1/18

3/18 (N1 = 3)
Language Modeling
July 31, 2014
15 / 47


1/18

3/18 (N1 = 3)
So, how likely is it that the next species is trout?
Language Modeling
July 31, 2014
15 / 47


1/18

3/18 (N1 = 3)
So, how likely is it that the next species is trout?

Must be less than 1/18
Language Modeling
July 31, 2014
15 / 47
Good Turing calculations
PGT (things with zero frequency) = NN1

Unseen word: PGT (unseen) = 3/18
Things with non-zero frequency
c =
(c + 1)Nc+1
Nc
Language Modeling
July 31, 2014
16 / 47
Good Turing calculations
PGT (things with zero frequency) = NN1

Unseen word: PGT (unseen) = 3/18
Things with non-zero frequency
c =
(c + 1)Nc+1
Nc
Seen once (trout)
c (trout) = 2 N2 /N1 = 2/3

PGT (trout) = 2/3
18 = 1/27
Language Modeling
July 31, 2014
16 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Language Modeling
July 31, 2014
17 / 47
Intuition
Take each of the c training words out in turn
c training sets of size c 1, held-out of size 1
Language Modeling
July 31, 2014
17 / 47
Intuition

What fraction of held-out words are unseen in training?
Language Modeling
July 31, 2014
17 / 47
Intuition

What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training?
Language Modeling
July 31, 2014
17 / 47
Intuition

What fraction of held-out words are seen k times in training? :
(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
Language Modeling
July 31, 2014
17 / 47
Intuition

(k + 1)Nk+1 /c
There are Nk words with training count k
Language Modeling
July 31, 2014
17 / 47
Intuition

(k + 1)Nk+1 /c
Each should occur with probability:
(k+1)Nk+1
cNk
Language Modeling
July 31, 2014
17 / 47
Intuition

(k + 1)Nk+1 /c
Each should occur with probability:
(k+1)Nk+1
cNk
(k+1)Nk+1
Expected count: k =
N
k
Language Modeling
July 31, 2014
17 / 47
Complications
What about the?

For small k, Nk > Nk+1
For large k, too jumpy
Language Modeling
July 31, 2014
18 / 47
Complications
What about the?

For small k, Nk > Nk+1
For large k, too jumpy
Simple Good-Turing
Replace empirical Nk with a best-fit power law once counts get unreliable
Language Modeling
July 31, 2014
18 / 47
Good-Turing numbers: Example
22 million words of AP Neswire
c =
(c + 1)Nc+1
Nc
Language Modeling
July 31, 2014
19 / 47
Good-Turing numbers: Example
22 million words of AP Neswire
c =
(c + 1)Nc+1
Nc
It looks like c = c 0.75
Language Modeling
July 31, 2014
20 / 47
Absolute Discounting Interpolation
Why dont we just substract 0.75 (or some d)?
Language Modeling
July 31, 2014
21 / 47
PAbsoluteDiscounting (wi |wi1 ) =
c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )
Language Modeling
July 31, 2014
21 / 47
c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )
We may keep some more values of d for counts 1 and 2
Language Modeling
July 31, 2014
21 / 47
c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )
We may keep some more values of d for counts 1 and 2

But can we do better than using the regular unigram correct?
Language Modeling
July 31, 2014
21 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Language Modeling
July 31, 2014
22 / 47
Intuition
Francisco more common that glasses
Language Modeling
July 31, 2014
22 / 47
Intuition
But Francisco mostly follows San
Language Modeling
July 31, 2014
22 / 47
Intuition
P(w): How likely is w?

Instead, Pcontinuation (w): How likely is w to appear as a novel continuation?
Language Modeling
July 31, 2014
22 / 47
Intuition

For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it was seen
Language Modeling
July 31, 2014
22 / 47
Intuition

For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it was seen
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Language Modeling
July 31, 2014
22 / 47
How many times does w appear as a novel continuation?
Language Modeling
July 31, 2014
23 / 47
Normalized by the total number of word bigram types
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Language Modeling
July 31, 2014
23 / 47
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Pcontinuation (w) =
|{wi1 : c(wi1 , w) > 0}|

|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Language Modeling
July 31, 2014
23 / 47
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Pcontinuation (w) =
|{wi1 : c(wi1 , w) > 0}|

|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
A frequent word (Francisco) occurring in only one context (San) will have a low
continuation probability
Language Modeling
July 31, 2014
23 / 47
PKN (wi |wi1 ) =
max(c(wi1 , wi ) d, 0)
+ (wi1 )Pcontinuation (wi )
c(wi1 )
Language Modeling
July 31, 2014
24 / 47
PKN (wi |wi1 ) =
max(c(wi1 , wi ) d, 0)
+ (wi1 )Pcontinuation (wi )
c(wi1 )
is a normalizing constant
(wi1 ) =
d
|{w : c(wi1 , w) > 0}|
c(wi1 )
Language Modeling
July 31, 2014
24 / 47
Model Combination
As N increases
The power (expressiveness) of an N-gram model increases
Language Modeling
July 31, 2014
25 / 47
Model Combination
As N increases
but the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).
Language Modeling
July 31, 2014
25 / 47
Model Combination
As N increases
but the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).
A general approach is to combine the results of multiple N-gram models.
Language Modeling
July 31, 2014
25 / 47
Backoff and Interpolation
It might help to use less context
Language Modeling
July 31, 2014
26 / 47

when you havent learned much about larger contexts
Language Modeling
July 31, 2014
26 / 47

Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram
Language Modeling
July 31, 2014
26 / 47

Backoff
Interpolation
mix unigram, bigram, trigram
Language Modeling
July 31, 2014
26 / 47

Backoff
Interpolation
mix unigram, bigram, trigram
Interpolation is found to work better
Language Modeling
July 31, 2014
26 / 47
Linear Interpolation
Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P
Language Modeling
July 31, 2014
27 / 47
P
i = 1
i
Language Modeling
July 31, 2014
27 / 47
P
i = 1
i
Lambdas conditional on context

(wn |wn1 wn2 ) = 1 (wn2 , wn1 )P(wn |wn1 wn2 )
P
+2 (wn2 , wn1 )P(wn |wn1 ) + 3 (wn2 , wn1 )P(wn )
Language Modeling
July 31, 2014
27 / 47
Setting the lambda values
Use a held-out corpus

Choose s to maximize the probability of held-out data:
Find the N-gram probabilities on the training data
Search for s that give the largest probability to held-out data
Language Modeling
July 31, 2014
28 / 47

Langmodel2 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Langmodel2 PDF

Uploaded by

Copyright:

Available Formats

Language Modeling: Part II

July 31, 2014

Pawan Goyal (IIT Kharagpur)

July 31, 2014

Lower perplexity = better model

Pawan Goyal (IIT Kharagpur)

July 31, 2014

Lower perplexity = better model

Pawan Goyal (IIT Kharagpur)

July 31, 2014

Lower perplexity = better model

Unigram perplexity: 962?

Pawan Goyal (IIT Kharagpur)

July 31, 2014

The Shannon Visualization Method

Use the language model to generate word sequences

Pawan Goyal (IIT Kharagpur)

July 31, 2014

The Shannon Visualization Method

Use the language model to generate word sequences

Pawan Goyal (IIT Kharagpur)

July 31, 2014

The Shannon Visualization Method

Use the language model to generate word sequences

Pawan Goyal (IIT Kharagpur)

July 31, 2014

The Shannon Visualization Method

Use the language model to generate word sequences

Pawan Goyal (IIT Kharagpur)

July 31, 2014

The Shannon Visualization Method

Use the language model to generate word sequences

Pawan Goyal (IIT Kharagpur)

July 31, 2014

N = 884,647 tokens, V = 29,066

Pawan Goyal (IIT Kharagpur)

July 31, 2014

Pawan Goyal (IIT Kharagpur)

July 31, 2014

Problems with simple MLE estimate: zeros

Pawan Goyal (IIT Kharagpur)

July 31, 2014

Problems with simple MLE estimate: zeros

Pawan Goyal (IIT Kharagpur)

July 31, 2014

Problems with simple MLE estimate: zeros

... denied the reports

... denied the offer

... denied the claims

... denied the loan

... denied the request

Pawan Goyal (IIT Kharagpur)

July 31, 2014

Problems with simple MLE estimate: zeros

... denied the reports

... denied the offer

... denied the claims

... denied the loan

... denied the request

Zero probability bigrams

Pawan Goyal (IIT Kharagpur)

July 31, 2014