Nlp Basic 03-N-gram Language Model: Nguyễn Quốc Thái

NLP BASIC
03-N-gram Language Model
NGUYỄN QUỐC THÁI

thai.nq07@gmail.com
Contents
● N-grams Overview
● N-grams probabilities
● Estimating n-gram probabilities
● Evaluating language model
● Issues with n-gram language model
● Smoothing
● Backoff and Interpolation
2
N-gram Overview
Language models
1
- Compute the probability of occurrence of a number of words in a
particular sequence
Probabilities are essential in any task
- Machine Translation: “Tôi đi học”
P(I go to school) > P(I go to work)
- Spelling Correction:
P(Everything has improved) > P(Everything has improve)
3
N-gram Overview
Probabilistic Language Modeling
1
- Compute the probability of occurrence of a number of words in a
particular sequence
A sequence of words: {w1, w2, w3,... wn}:
P(W) = P(w1, w2, w3,... wn)
- probability of an upcoming word:
P(wn|w1, w2, w3,... wn-1)
called a language model (LM)
4
N-gram Probabilities
2
• Computing P(W)
𝑃(w1, w2, w3,... wn)
• Conditional probability:
𝑃 𝐵 𝐴 = 𝑃(𝐴, 𝐵)/𝑃 𝐴 => 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝑃(𝐵|𝐴)
• The chain rule of probability:
𝑃 𝐴, 𝐵, 𝐶, 𝐷 = 𝑃 𝐴 𝑃 𝐵 𝐴 P C A, B P(D|A, B, C)
=>𝑃 𝑤1 ,𝑤2 , 𝑤3 ,... 𝑤𝑛 = 𝑃 𝑤1 𝑃 𝑤2 𝑤1 𝑃 𝑤3 𝑤1:2 𝑃(𝑤𝑛 | 𝑤1:𝑛−1 )
= ς𝑛𝑘=1 𝑃(𝑤𝑘 |𝑤1:𝑘−1)
5
2
𝑃 𝑤1 ,𝑤2 , 𝑤3 ,... 𝑤𝑛 = 𝑃 𝑤1 𝑃 𝑤2 𝑤1 𝑃 𝑤3 𝑤1:2 𝑃(𝑤𝑛 | 𝑤1:𝑛−1 ) =
ς𝑛𝑘=1 𝑃(𝑤𝑘 |𝑤1:𝑘−1)
Example: given a sentence: “tôi đang học lớp nlp”
𝑃 𝑡ô𝑖, đ𝑎𝑛𝑔, ℎọ𝑐, 𝑙ớ𝑝, 𝑛𝑙𝑝
= 𝑃 𝑡ô𝑖 𝑃 đ𝑎𝑛𝑔 𝑡ô𝑖 𝑃 ℎọ𝑐 𝑡ô𝑖, đ𝑎𝑛𝑔
𝑃 𝑙ớ𝑝 𝑡ô𝑖, đ𝑎𝑛𝑔, ℎọ𝑐 𝑃(𝑛𝑙𝑝|𝑡ô𝑖, đ𝑎𝑛𝑔, ℎọ𝑐 𝑙ớ𝑝)
6
2
• Computing P(w|h)
- w: word “nlp”
- h: history “tôi đang học lớp”
𝑐𝑜𝑢𝑛𝑡(𝑡ô𝑖 đ𝑎𝑛𝑔 ℎọ𝑐 𝑙ớ𝑝 𝑛𝑙𝑝)
𝑃 𝑛𝑙𝑝|𝑡ô𝑖 đ𝑎𝑛𝑔 ℎọ𝑐 𝑙ớ𝑝 =
𝑐𝑜𝑢𝑛𝑡(𝑡ô𝑖 đ𝑎𝑛𝑔 ℎọ𝑐 𝑙ớ𝑝)
• Problem???
7
Markov Assumption
2
• The probability of a word depends only on the previous word
• Look n-1 words into the past – a called n-gram
𝑛
𝑃 𝑤1:𝑛 ≈ ෑ 𝑃(𝑤𝑖 |𝑤𝑖−𝑁+1:𝑖−1 )

𝑖=1
𝑃(𝑤𝑖 | 𝑤1:𝑖−1 ) ≈ 𝑃(𝑤𝑖 | 𝑤𝑖−𝑁+1:𝑖−1 )
• N-gram model: N = {1, 2, 3, 4, 5,…}
8
Markov Assumption
2
• Unigram model (1-gram)
𝑛
𝑃 𝑤1:𝑛 ≈ ෑ 𝑃 𝑤𝑖
𝑖=1
• Given a sentence: “tôi đang học lớp nlp”
= 𝑃 𝑡ô𝑖 𝑃 đ𝑎𝑛𝑔 𝑃 ℎọ𝑐 𝑃 𝑙ớ𝑝 𝑃(𝑛𝑙𝑝)
9
Markov Assumption
2
• Bigram model (2-gram)
𝑛
𝑃 𝑤1:𝑛 ≈ ෑ 𝑃 𝑤𝑖 |𝑤𝑖−1
𝑖=1
=> Padding: “<s> tôi đang học lớp nlp </s>”
= 𝑃 𝑡ô𝑖| < 𝑠 > 𝑃 đ𝑎𝑛𝑔|𝑡ô𝑖 𝑃 ℎọ𝑐|đ𝑎𝑛𝑔 𝑃 𝑙ớ𝑝|ℎọ𝑐
𝑃 𝑛𝑙𝑝 𝑙ớ𝑝 𝑃(</𝑠 > |𝑛𝑙𝑝)
10
Markov Assumption
2
• Trigram model (3-gram)
𝑛
𝑃 𝑤1:𝑛 ≈ ෑ 𝑃 𝑤𝑖 |𝑤𝑖−2:𝑖−1
𝑖=1
=> Padding: “<s> <s> tôi đang học lớp nlp </s> </s>”
= 𝑃 𝑡ô𝑖| < 𝑠 >, < 𝑠 > 𝑃 đ𝑎𝑛𝑔|𝑡ô𝑖, < 𝑠 > 𝑃 ℎọ𝑐|𝑡ô𝑖, đ𝑎𝑛𝑔
𝑃 𝑙ớ𝑝|đ𝑎𝑛𝑔, ℎọ𝑐 𝑃(𝑛𝑙𝑝|ℎọ𝑐, 𝑙ớ𝑝) 𝑃(</𝑠 > |𝑙ớ𝑝, 𝑛𝑙𝑝)𝑃(</𝑠 > |𝑙ớ𝑝, </𝑠 >)
11
Estimating N-gram Probabilities
Maximum likelihood estimation (MLE)
3
• Estimating bigram probabilities
𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−1 , 𝑤𝑖 ) 𝑐(𝑤𝑖−1 , 𝑤𝑖 )
𝑃 𝑤𝑖 |𝑤𝑖−1 = =
𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−1 ) 𝑐(𝑤𝑖−1 )
• Estimating n-gram probabilities

𝑐(𝑤𝑖−𝑁+1:𝑖−1 , 𝑤𝑖 )
𝑃 𝑤𝑖 |𝑤𝑖−𝑁+1:𝑖−1 =
𝑐(𝑤𝑖−𝑁+1:𝑖−1 )
12
Maximum likelihood estimation (MLE)
3
• Example bigram model:
𝑐(𝑤𝑖−1 , 𝑤𝑖 ) <s> tôi đang học </s>
𝑃 𝑤𝑖 |𝑤𝑖−1 = <s> tôi đang học lớp nlp </s>
𝑐(𝑤𝑖−1 )
<s> lớp nlp có vẻ hơi vui </s>
P(tôi|<s>) = 2/3 P(đang|tôi) = 2/2 P(học|đang) = 2/2
P(</s>|học) = 1/2 P(lớp|học) = 1/2 P(nlp|lớp) = 2/2
13
Example: “Truyện Kiều”
3
Trăm năm trong cõi người ta,
Chữ tài chữ mệnh khéo là ghét nhau.
Trải qua một cuộc bể dâu,
Những điều trông thấy mà đau đớn lòng.
Lạ gì bỉ sắc tư phong,
Trời xanh quen thói má hồng đánh ghen.
Cảo thơm lần giở trước đèn,
Phong tình cổ lục còn truyền sử xanh.
Rằng năm Gia Tĩnh triều Minh,
Bốn phương phẳng lặng, hai kinh vững vàng.
14
3
<s> trăm năm trong cõi người ta
<s> 0 13 5 23 1 22 1
trăm 0 0 9 0 0 0 0
năm 0 0 4 1 0 1 0
trong 0 0 1 0 3 1 0
cõi 0 0 0 0 0 1 0
người 0 0 0 2 0 2 7
ta 0 0 0 0 0 0 2
15
3
<s> 0 0.004 0.0015 0.007 0.0003 0.0068 0.0003
trăm 0 0 0.29 0 0 0 0
năm 0 0 0.077 0.019 0 0.019 0
trong 0 0 0.0095 0 0.0285 0.0095 0
cõi 0 0 0 0 0 0.1 0
người 0 0 0 0.009 0 0.009 0.031
ta 0 0 0 0 0 0 0.035
16
3
Sequence probabilities: “trăm năm trong cõi người ta”
𝑃 𝑡𝑟ă𝑚 𝑛ă𝑚 𝑡𝑟𝑜𝑛𝑔 𝑐õ𝑖 𝑛𝑔ườ𝑖 𝑡𝑎

= 𝑃 𝑡𝑟ă𝑚 < 𝑠 > 𝑃 𝑛ă𝑚 𝑡𝑟ă𝑚 𝑃 𝑡𝑟𝑜𝑛𝑔 𝑛ă𝑚
𝑃 𝑐õ𝑖 𝑡𝑟𝑜𝑛𝑔 𝑃 𝑛𝑔ườ𝑖 𝑐õ𝑖 𝑃 𝑡𝑎 𝑛𝑔ườ𝑖 𝑃 </𝑠 > 𝑡𝑎
= 0.004 * 0.29 * 0.019 *0.0285*0.1*0.031*0.439 = 8.5e-10
17
Problem: Numerical Underflow
3
• 8.5e-10 => SMALL
• Convert linear space => log space:
- Avoid underflow
- Adding is faster than multiplying
𝑝1 × 𝑝2 × 𝑝3 × 𝑝4 = exp(log 𝑝1 + log 𝑝2 + log 𝑝3 + log 𝑝4 )
18
Evaluating Language Models
Extrinsic evaluation:
4
• The best way to evaluate the performance, embed models in an application
• Comparing models A and B:
- Put each model in a task: MT system
- Run MT task, get an accuracy for A and B
- Compare accuracy for A and B
- Higher accuracy is better
• Problem??? Running big NLP systems end-to-end is often very expensive!
19
Intrinsic evaluation:
3
• Evaluate a model independent of any application
• Train parameters of our model on a training set
• Test the model on some unseen data – called test set
• Sometimes:
- We need a fresh test set is truly unseen
=> Split data: train set, validation set, test set
- Training: train set, tuning: validation set, evaluation: test set
- Intrinsic evaluation: perplexity
20
Perplexity:
4
• The best language model is one that best predicts an unseen test set
• Perplexity (PP) is the inverse probability of the test set, normalized by the
number of words
• Given a test set: W = w1w2w3…wn
1
−𝑛
𝑃𝑃 𝑊 = 𝑃(𝑤1 𝑤2 … 𝑤𝑛 )
𝑛 1
• The chain rule: 𝑃𝑃 𝑊 = ς𝑛𝑖=1
𝑃(𝑤𝑖 |𝑤1:𝑖−1 )
𝑛 1
• With a bigram model: 𝑃𝑃 𝑊 = ς𝑛𝑖=1
𝑃(𝑤𝑖 |𝑤𝑖−1 )
21
Perplexity:
4
𝑃 𝑡𝑟ă𝑚 𝑛ă𝑚 𝑡𝑟𝑜𝑛𝑔 𝑐õ𝑖 𝑛𝑔ườ𝑖 𝑡𝑎 = 8.5e-10
=>PP(trăm năm trong cõi người ta) = ???
Lower perplexity = better model
• Example: training 38 million words, test 1.5 million words, WSL
Unigram Bigram Trigram

Perplexity 962 170 109
22
Issues N-gram language Models
Sparsity Problems:
𝑃 𝑤3 |𝑤1 𝑤2 =
𝑐(𝑤1 , 𝑤2 , 𝑤3 )
5
𝑐(𝑤1 , 𝑤2 )
• w1, w2 and w3 never appear together in the corpus
=> P = 0
=> Solve: a small k could be added to the count – called smoothing
• w1 and w2 never occurred together in the corpus
=> no probability can be calculated for w3
=> Solve: could condition on w2 alone – called backoff, interpolation
Storage Problems: n increase => the model size increase
23
Smoothing
Laplace Smoothing (Add-one Laplace)
• Add one to all the n-gram counts, before normalize into probabilities
5
• MLE estimate:
𝑐(𝑤𝑖−1 , 𝑤𝑖 )
𝑃 𝑤𝑖 |𝑤𝑖−1 =
𝑐(𝑤𝑖−1 )
• Add-1 estimate:
𝑐 𝑤𝑖−1 , 𝑤𝑖 + 1
𝑃𝐴𝑑𝑑−1 𝑤𝑖 |𝑤𝑖−1 =
𝑐 𝑤𝑖−1 + 𝑉
V: vocab size
24
Smoothing
Add-1 Smoothing
5
<s> 1 14 6 24 2 23 2
trăm 1 1 10 1 1 1 1
năm 1 1 5 2 1 2 1
trong 1 1 2 1 4 2 1
cõi 1 1 1 1 1 2 1
người 1 1 1 3 1 3 8
ta 1 1 1 1 1 1 3
25
Smoothing
Add-1 Smoothing
5
<s> 0.00015 0.0021 0.0009 0.0036 0.0003 0.0035 0.003
trăm 0.0004 0.0004 0.0041 0.0004 0.0004 0.0004 0.0004
năm 0.0004 0.0004 0.002 0.0008 0.0004 0.0008 0.0004
trong 0.0004 0.0004 0.0008 0.0004 0.0015 0.0008 0.0004
cõi 0.0004 0.0004 0.0004 0.0004 0.0004 0.0008 0.0004
người 0.0004 0.0004 0.0004 0.0011 0.0004 0.0011 0.003
ta 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0012
26
Smoothing
Add-k smoothing
𝑐 𝑤𝑖−1 , 𝑤𝑖 + 𝑘
5
𝑃𝐴𝑑𝑑−𝑘 𝑤𝑖 |𝑤𝑖−1 =
𝑐 𝑤𝑖−1 + 𝑘𝑉
Choosing k by optimizing on a validation set
Add-k Smoothing:
• Not good => not used for n-gram language model
• Add-k is used to smooth other NLP models (ex: text classification,…)
27
Backoff and Interpolation
Sometimes using less context is a good thing
6
condition on less for contexts that the model hasn’t learned much about
Backoff
- use trigram if you have good evidence
- otherwise bigram, otherwise unigram
Interpolation
- mix unigram, bigram, trigram
Interpolation works better
28
Simple Interpolation
6
𝑃෠ 𝑤𝑛 𝑤𝑛−2 𝑤𝑛−1 = 𝜆1 𝑃 𝑤𝑛 𝑤𝑛−2 𝑤𝑛−1
+ 𝜆2 𝑃 𝑤𝑛 𝑤𝑛−1 + 𝜆3 𝑃(𝑤𝑛 )
σ𝑖 𝜆𝑖 = 1
𝜆𝑠 are learned from a validation set
Find optimal set using EM algorithm
29
Advanced:
6
Lambdas conditional on context
𝑃෠ 𝑤𝑛 𝑤𝑛−2 𝑤𝑛−1 = 𝜆1 (𝑤𝑛−2:𝑛−1 )𝑃 𝑤𝑛 𝑤𝑛−2 𝑤𝑛−1
+ 𝜆2 𝑤𝑛−2:𝑛−1 𝑃 𝑤𝑛 𝑤𝑛−1
+ 𝜆3 (𝑤𝑛−2:𝑛−1 )𝑃(𝑤𝑛 )
Katz backoff
Stupid backoff
Kneser-Ney Smoothing
30
Thanks!
Any questions?
31

Nlp Basic 03-N-gram Language Model: Nguyễn Quốc Thái

Uploaded by

Copyright:

Available Formats

You might also like

Nlp Basic 03-N-gram Language Model: Nguyễn Quốc Thái

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nlp Basic 03-N-gram Language Model: Nguyễn Quốc Thái

Uploaded by

Copyright:

Available Formats

NLP BASIC

03-N-gram Language Model

NGUYỄN QUỐC THÁI

𝑃 𝑤1:𝑛 ≈ ෑ 𝑃(𝑤𝑖 |𝑤𝑖−𝑁+1:𝑖−1 )

• Estimating n-gram probabilities

𝑃 𝑡𝑟ă𝑚 𝑛ă𝑚 𝑡𝑟𝑜𝑛𝑔 𝑐õ𝑖 𝑛𝑔ườ𝑖 𝑡𝑎

Unigram Bigram Trigram

You might also like