Professional Documents
Culture Documents
Langmodel2 PDF
Langmodel2 PDF
Pawan Goyal
CSE, IITKGP
Language Modeling
1 / 47
Language Modeling
2 / 47
Language Modeling
2 / 47
Language Modeling
2 / 47
Language Modeling
3 / 47
Language Modeling
3 / 47
Language Modeling
3 / 47
Language Modeling
3 / 47
Language Modeling
3 / 47
Shakespeare as Corpus
Language Modeling
4 / 47
Approximating Shakespeare
Language Modeling
5 / 47
Language Modeling
6 / 47
Training set
... denied the allegations
... denied the reports
... denied the claims
... denied the request
Language Modeling
6 / 47
Training set
... denied the allegations
Test Data
Language Modeling
6 / 47
Training set
... denied the allegations
Test Data
Language Modeling
6 / 47
Training set
... denied the allegations
Test Data
Language Modeling
6 / 47
Training set
... denied the allegations
Test Data
Language Modeling
6 / 47
Language Modeling
7 / 47
Language Modeling
7 / 47
Language Modeling
7 / 47
Pretend as if we saw each word one more time that we actually did
Language Modeling
8 / 47
Pretend as if we saw each word one more time that we actually did
Just add one to all the counts!
Language Modeling
8 / 47
Pretend as if we saw each word one more time that we actually did
Just add one to all the counts!
c(w
,w )
Language Modeling
8 / 47
Pretend as if we saw each word one more time that we actually did
Just add one to all the counts!
c(w
,w )
,w )+1
i
Add-1 estimate: PAdd1 (wi |wi1 ) = c(wi1 )+V
i1
Language Modeling
8 / 47
c (wn1 wn ) c(wn1 wn ) + 1
=
c(wn1 )
c(wn1 ) + V
Language Modeling
9 / 47
Language Modeling
10 / 47
Language Modeling
10 / 47
Add-1 estimation
Language Modeling
11 / 47
Add-1 estimation
Language Modeling
11 / 47
c(wi1 , wi ) + k
c(wi1 ) + kV
Language Modeling
12 / 47
c(wi1 , wi ) + k
c(wi1 ) + kV
c(wi1 , wi ) + m( V1 )
c(wi1 ) + m
Language Modeling
12 / 47
c(wi1 , wi ) + k
c(wi1 ) + kV
c(wi1 , wi ) + m( V1 )
c(wi1 ) + m
c(wi1 , wi ) + mP(wi )
c(wi1 ) + m
Language Modeling
12 / 47
Basic Intuition
Use the count of things we have see once
to help estimate the count of things we have never seen
Language Modeling
13 / 47
Basic Intuition
Use the count of things we have see once
to help estimate the count of things we have never seen
Smoothing algorithms
Good-Turing
Kneser-Ney
Witten-Bell
Language Modeling
13 / 47
Nc : Frequency of frequency c
Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>
Language Modeling
14 / 47
Nc : Frequency of frequency c
Example Sentences
<s>I am here </s>
<s>who am I </s>
<s>I would like </s>
Computing Nc
I
am
here
who
would
like
3
2
1
1
1
1
N1 = 4
N2 = 1
N3 = 1
Language Modeling
14 / 47
Language Modeling
15 / 47
Language Modeling
15 / 47
Language Modeling
15 / 47
Language Modeling
15 / 47
Language Modeling
15 / 47
Language Modeling
15 / 47
Language Modeling
15 / 47
(c + 1)Nc+1
Nc
Language Modeling
16 / 47
(c + 1)Nc+1
Nc
Language Modeling
16 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Language Modeling
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
Language Modeling
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
Language Modeling
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
Language Modeling
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
Language Modeling
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
There are Nk words with training count k
Language Modeling
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
There are Nk words with training count k
Each should occur with probability:
(k+1)Nk+1
cNk
Language Modeling
17 / 47
Intuition
Intuition from leave-one-out validation
Training dataset: c tokens
Take each of the c training words out in turn
(k + 1)Nk+1 /c
We expect (k + 1)Nk+1 /c of the words to be those with training count k
There are Nk words with training count k
Each should occur with probability:
(k+1)Nk+1
cNk
(k+1)Nk+1
Expected count: k =
N
k
Language Modeling
17 / 47
Complications
Language Modeling
18 / 47
Complications
Simple Good-Turing
Replace empirical Nk with a best-fit power law once counts get unreliable
Language Modeling
18 / 47
c =
(c + 1)Nc+1
Nc
Language Modeling
19 / 47
c =
(c + 1)Nc+1
Nc
Language Modeling
20 / 47
Language Modeling
21 / 47
c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )
Language Modeling
21 / 47
c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )
Language Modeling
21 / 47
c(wi1 , wi ) d
+ (wi1 )P(wi )
c(wi1 )
Language Modeling
21 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Language Modeling
22 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
Language Modeling
22 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San
Language Modeling
22 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San
Language Modeling
22 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San
Language Modeling
22 / 47
Kneser-Ney Smoothing
Intuition
Shannon game: I cant see without my reading ...: glasses/Francisco?
Francisco more common that glasses
But Francisco mostly follows San
Language Modeling
22 / 47
Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Language Modeling
23 / 47
Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Normalized by the total number of word bigram types
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Language Modeling
23 / 47
Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Normalized by the total number of word bigram types
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Pcontinuation (w) =
Language Modeling
23 / 47
Kneser-Ney Smoothing
How many times does w appear as a novel continuation?
Pcontinuation (w) |{wi1 : c(wi1 , w) > 0}|
Normalized by the total number of word bigram types
|{(wj1 , wj ) : c(wj1 , wj ) > 0}|
Pcontinuation (w) =
A frequent word (Francisco) occurring in only one context (San) will have a low
continuation probability
Language Modeling
23 / 47
Kneser-Ney Smoothing
max(c(wi1 , wi ) d, 0)
+ (wi1 )Pcontinuation (wi )
c(wi1 )
Language Modeling
24 / 47
Kneser-Ney Smoothing
max(c(wi1 , wi ) d, 0)
+ (wi1 )Pcontinuation (wi )
c(wi1 )
is a normalizing constant
(wi1 ) =
d
|{w : c(wi1 , w) > 0}|
c(wi1 )
Language Modeling
24 / 47
Model Combination
As N increases
The power (expressiveness) of an N-gram model increases
Language Modeling
25 / 47
Model Combination
As N increases
The power (expressiveness) of an N-gram model increases
but the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).
Language Modeling
25 / 47
Model Combination
As N increases
The power (expressiveness) of an N-gram model increases
but the ability to estimate accurate parameters from sparse data
decreases (i.e. the smoothing problem gets worse).
A general approach is to combine the results of multiple N-gram models.
Language Modeling
25 / 47
Language Modeling
26 / 47
Language Modeling
26 / 47
Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram
Language Modeling
26 / 47
Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram
Interpolation
mix unigram, bigram, trigram
Language Modeling
26 / 47
Backoff
use trigram if you have good evidence
otherwise bigram, otherwise unigram
Interpolation
mix unigram, bigram, trigram
Interpolation is found to work better
Language Modeling
26 / 47
Linear Interpolation
Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P
Language Modeling
27 / 47
Linear Interpolation
Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P
i = 1
i
Language Modeling
27 / 47
Linear Interpolation
Simple Interpolation
(wn |wn1 wn2 ) = 1 P(wn |wn1 wn2 ) + 2 P(wn |wn1 ) + 3 P(wn )
P
i = 1
i
Language Modeling
27 / 47
Language Modeling
28 / 47