Download as pdf or txt
Download as pdf or txt
You are on page 1of 85

Language Modeling: Part II

Pawan Goyal
CSE, IITKGP

July 31, 2014

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

1 / 47

Lower perplexity = better model


WSJ Corpus
Training: 38 million words
Test: 1.5 million words

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

2 / 47

Lower perplexity = better model


WSJ Corpus
Training: 38 million words
Test: 1.5 million words

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

2 / 47

Lower perplexity = better model


WSJ Corpus
Training: 38 million words
Test: 1.5 million words

Unigram perplexity: 962?


The model is as confused on test data as if it had to choose uniformly and
independently among 962 possibilities for each word.

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

2 / 47

The Shannon Visualization Method

Use the language model to generate word sequences

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

3 / 47

The Shannon Visualization Method

Use the language model to generate word sequences


Choose a random bigram
(<s>,w) as per its
probability

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

3 / 47

The Shannon Visualization Method

Use the language model to generate word sequences


Choose a random bigram
(<s>,w) as per its
probability
Choose a random bigram
(w,x) as per its probability

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

3 / 47

The Shannon Visualization Method

Use the language model to generate word sequences


Choose a random bigram
(<s>,w) as per its
probability
Choose a random bigram
(w,x) as per its probability
And so on until we choose
</s>

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

3 / 47

The Shannon Visualization Method

Use the language model to generate word sequences


Choose a random bigram
(<s>,w) as per its
probability
Choose a random bigram
(w,x) as per its probability
And so on until we choose
</s>

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

3 / 47

Shakespeare as Corpus

N = 884,647 tokens, V = 29,066


Shakespeare produced 300,000 bigram types out of V 2 = 844 million
possible bigrams.

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

4 / 47

Approximating Shakespeare

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

5 / 47

Problems with simple MLE estimate: zeros

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

6 / 47

Problems with simple MLE estimate: zeros

Training set
... denied the allegations
... denied the reports
... denied the claims
... denied the request

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

6 / 47

Problems with simple MLE estimate: zeros

Training set
... denied the allegations

Test Data

... denied the reports

... denied the offer

... denied the claims

... denied the loan

... denied the request

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

6 / 47

Problems with simple MLE estimate: zeros

Training set
... denied the allegations

Test Data

... denied the reports

... denied the offer

... denied the claims

... denied the loan

... denied the request

Zero probability bigrams


P(offer | denied the) = 0

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

6 / 47

Problems with simple MLE estimate: zeros

Training set
... denied the allegations

Test Data

... denied the reports

... denied the offer

... denied the claims

... denied the loan

... denied the request

Zero probability bigrams


P(offer | denied the) = 0
The test set will be assigned a probability 0

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

6 / 47

Problems with simple MLE estimate: zeros

Training set
... denied the allegations

Test Data

... denied the reports

... denied the offer

... denied the claims

... denied the loan

... denied the request

Zero probability bigrams


P(offer | denied the) = 0
The test set will be assigned a probability 0
And the perplexity cant be computed

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

6 / 47

Language Modeling: Smoothing

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

7 / 47

Language Modeling: Smoothing


With sparse statistics

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

7 / 47

Language Modeling: Smoothing


With sparse statistics

Steal probability mass to generalize better

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

7 / 47

Laplace Smoothing (Add-one estimation)

Pretend as if we saw each word one more time that we actually did

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

8 / 47

Laplace Smoothing (Add-one estimation)

Pretend as if we saw each word one more time that we actually did
Just add one to all the counts!

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

8 / 47

Laplace Smoothing (Add-one estimation)

Pretend as if we saw each word one more time that we actually did
Just add one to all the counts!
c(w

,w )

MLE estimate: PMLE (wi |wi1 ) = c(wi1 )i


i1

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

8 / 47

Laplace Smoothing (Add-one estimation)

Pretend as if we saw each word one more time that we actually did
Just add one to all the counts!
c(w

,w )

MLE estimate: PMLE (wi |wi1 ) = c(wi1 )i


i1
c(w

,w )+1

i
Add-1 estimate: PAdd1 (wi |wi1 ) = c(wi1 )+V
i1

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

8 / 47

Reconstituted counts as effect of smoothing

Effective bigram count (c (wn1 wn ))

c (wn1 wn ) c(wn1 wn ) + 1
=
c(wn1 )
c(wn1 ) + V

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

9 / 47

Comparing with bigrams: Restaurant corpus

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

10 / 47

Comparing with bigrams: Restaurant corpus

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

10 / 47

Add-1 estimation

Not used for N-grams


There are better smoothing methods

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

11 / 47

Add-1 estimation

Not used for N-grams


There are better smoothing methods

Is used to smooth other NLP models


In domains where the number of zeros isnt so large
For text classification

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

11 / 47

More general formulations: Add-k

PAddk (wi |wi1 ) =

Pawan Goyal (IIT Kharagpur)

c(wi1 , wi ) + k
c(wi1 ) + kV

Language Modeling

July 31, 2014

12 / 47

More general formulations: Add-k

PAddk (wi |wi1 ) =

PAddk (wi |wi1 ) =

Pawan Goyal (IIT Kharagpur)

c(wi1 , wi ) + k
c(wi1 ) + kV

c(wi1 , wi ) + m( V1 )
c(wi1 ) + m

Language Modeling

July 31, 2014

12 / 47

More general formulations: Add-k

PAddk (wi |wi1 ) =

PAddk (wi |wi1 ) =

c(wi1 , wi ) + k
c(wi1 ) + kV

c(wi1 , wi ) + m( V1 )
c(wi1 ) + m

Unigram prior smoothing:

PUnigramPrior (wi |wi1 ) =

Pawan Goyal (IIT Kharagpur)

c(wi1 , wi ) + mP(wi )
c(wi1 ) + m

Language Modeling

July 31, 2014

12 / 47

Advanced smoothing algorithms

Basic Intuition
Use the count of things we have see once
to help estimate the count of things we have never seen

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

13 / 47

Advanced smoothing algorithms

Basic Intuition
Use the count of things we have see once
to help estimate the count of things we have never seen

Smoothing algorithms
Good-Turing
Kneser-Ney
Witten-Bell

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

13 / 47

Nc : Frequency of frequency c
Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

14 / 47

Nc : Frequency of frequency c
Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>

Computing Nc
I
am
here
who
would
like

3
2
1
1
1
1

Pawan Goyal (IIT Kharagpur)

N1 = 4
N2 = 1
N3 = 1

Language Modeling

July 31, 2014

14 / 47

Good-Turing smoothing intuition


You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good-Turing smoothing intuition


You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?


1/18

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good-Turing smoothing intuition


You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?


1/18

How likely is it that next species is new?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good-Turing smoothing intuition


You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?


1/18

How likely is it that next species is new?


Use the estimate of things-we-saw-once to estimate the new things

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good-Turing smoothing intuition


You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?


1/18

How likely is it that next species is new?


Use the estimate of things-we-saw-once to estimate the new things
3/18 (N1 = 3)

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good-Turing smoothing intuition


You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?


1/18

How likely is it that next species is new?


Use the estimate of things-we-saw-once to estimate the new things
3/18 (N1 = 3)

So, how likely is it that the next species is trout?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good-Turing smoothing intuition


You are fishing and caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

How likely is it that the next species is trout?


1/18

How likely is it that next species is new?


Use the estimate of things-we-saw-once to estimate the new things
3/18 (N1 = 3)

So, how likely is it that the next species is trout?


Must be less than 1/18

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

15 / 47

Good Turing calculations

PGT (things with zero frequency) = NN1


Unseen word: PGT (unseen) = 3/18
Things with non-zero frequency
c =

Pawan Goyal (IIT Kharagpur)

(c + 1)Nc+1
Nc

Language Modeling

July 31, 2014

16 / 47

Good Turing calculations

PGT (things with zero frequency) = NN1


Unseen word: PGT (unseen) = 3/18
Things with non-zero frequency
c =

(c + 1)Nc+1
Nc

Seen once (trout)

c (trout) = 2 N2 /N1 = 2/3


PGT (trout) = 2/3
18 = 1/27

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

16 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1


What fraction of held-out words are unseen in training?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1


What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1


What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training? :

(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1


What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training? :

(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
There are Nk words with training count k

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1


What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training? :

(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
There are Nk words with training count k
Each should occur with probability:

Pawan Goyal (IIT Kharagpur)

(k+1)Nk+1
cNk

Language Modeling

July 31, 2014

17 / 47

Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn

c training sets of size c 1, held-out of size 1


What fraction of held-out words are unseen in training? : N1 /c
What fraction of held-out words are seen k times in training? :

(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
There are Nk words with training count k
Each should occur with probability:

(k+1)Nk+1
cNk

(k+1)Nk+1
Expected count: k =
N
k

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

17 / 47

Complications

What about the?


For small k, Nk > Nk+1
For large k, too jumpy

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

18 / 47

Complications

What about the?


For small k, Nk > Nk+1
For large k, too jumpy

Simple Good-Turing
Replace empirical Nk with a best-fit power law once counts get unreliable

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

18 / 47

Good-Turing numbers: Example

22 million words of AP Neswire

c =

(c + 1)Nc+1
Nc

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

19 / 47

Good-Turing numbers: Example

22 million words of AP Neswire

c =

(c + 1)Nc+1
Nc

It looks like c = c 0.75

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

20 / 47

Absolute Discounting Interpolation

Why dont we just substract 0.75 (or some d)?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

21 / 47

Absolute Discounting Interpolation

Why dont we just substract 0.75 (or some d)?

PAbsoluteDiscounting (wi |wi1 ) =

Pawan Goyal (IIT Kharagpur)

c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )

Language Modeling

July 31, 2014

21 / 47

Absolute Discounting Interpolation

Why dont we just substract 0.75 (or some d)?

PAbsoluteDiscounting (wi |wi1 ) =

c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )

We may keep some more values of d for counts 1 and 2

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

21 / 47

Absolute Discounting Interpolation

Why dont we just substract 0.75 (or some d)?

PAbsoluteDiscounting (wi |wi1 ) =

c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )

We may keep some more values of d for counts 1 and 2


But can we do better than using the regular unigram correct?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

21 / 47

Kneser-Ney Smoothing

Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

22 / 47

Kneser-Ney Smoothing

Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

22 / 47

Kneser-Ney Smoothing

Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

22 / 47

Kneser-Ney Smoothing

Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San

P(w): How likely is w?


Instead, Pcontinuation (w): How likely is w to appear as a novel continuation?

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

22 / 47

Kneser-Ney Smoothing

Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San

P(w): How likely is w?


Instead, Pcontinuation (w): How likely is w to appear as a novel continuation?
For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it was seen

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

22 / 47

Kneser-Ney Smoothing

Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San

P(w): How likely is w?


Instead, Pcontinuation (w): How likely is w to appear as a novel continuation?
For each word, count the number of bigram types it completes
Every bigram type was a novel continuation the first time it was seen

Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

22 / 47

Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

23 / 47

Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Normalized by the total number of word bigram types
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

23 / 47

Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Normalized by the total number of word bigram types
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Pcontinuation (w) =

Pawan Goyal (IIT Kharagpur)

|{wi1 : c(wi1 , w) > 0}|


|{(wj1 , wj ) : c(wj1 , wj ) > 0}|

Language Modeling

July 31, 2014

23 / 47

Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Normalized by the total number of word bigram types
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Pcontinuation (w) =

|{wi1 : c(wi1 , w) > 0}|


|{(wj1 , wj ) : c(wj1 , wj ) > 0}|

A frequent word (Francisco) occurring in only one context (San) will have a low
continuation probability

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

23 / 47

Kneser-Ney Smoothing

PKN (wi |wi1 ) =

Pawan Goyal (IIT Kharagpur)

max(c(wi1 , wi ) d, 0)
+ (wi1 )Pcontinuation (wi )
c(wi1 )

Language Modeling

July 31, 2014

24 / 47

Kneser-Ney Smoothing

PKN (wi |wi1 ) =

max(c(wi1 , wi ) d, 0)
+ (wi1 )Pcontinuation (wi )
c(wi1 )

is a normalizing constant
(wi1 ) =

Pawan Goyal (IIT Kharagpur)

d
|{w : c(wi1 , w) > 0}|
c(wi1 )

Language Modeling

July 31, 2014

24 / 47

Model Combination

As N increases
The power (expressiveness) of an N-gram model increases

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

25 / 47

Model Combination

As N increases
The power (expressiveness) of an N-gram model increases
but the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

25 / 47

Model Combination

As N increases
The power (expressiveness) of an N-gram model increases
but the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).
A general approach is to combine the results of multiple N-gram models.

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

25 / 47

Backoff and Interpolation

It might help to use less context

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

26 / 47

Backoff and Interpolation

It might help to use less context


when you havent learned much about larger contexts

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

26 / 47

Backoff and Interpolation

It might help to use less context


when you havent learned much about larger contexts

Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

26 / 47

Backoff and Interpolation

It might help to use less context


when you havent learned much about larger contexts

Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram

Interpolation
mix unigram, bigram, trigram

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

26 / 47

Backoff and Interpolation

It might help to use less context


when you havent learned much about larger contexts

Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram

Interpolation
mix unigram, bigram, trigram
Interpolation is found to work better

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

26 / 47

Linear Interpolation

Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

27 / 47

Linear Interpolation

Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P

i = 1
i

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

27 / 47

Linear Interpolation

Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P

i = 1
i

Lambdas conditional on context


(wn |wn1 wn2 ) = 1 (wn2 , wn1 )P(wn |wn1 wn2 )
P
+2 (wn2 , wn1 )P(wn |wn1 ) + 3 (wn2 , wn1 )P(wn )

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

27 / 47

Setting the lambda values

Use a held-out corpus


Choose s to maximize the probability of held-out data:
Find the N-gram probabilities on the training data
Search for s that give the largest probability to held-out data

Pawan Goyal (IIT Kharagpur)

Language Modeling

July 31, 2014

28 / 47

You might also like