Nlp Basic 03-N-gram Language Model: Nguyễn Quốc Thái

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

NLP BASIC

03-N-gram Language Model

NGUYỄN QUỐC THÁI


thai.nq07@gmail.com
Contents
● N-grams Overview
● N-grams probabilities
● Estimating n-gram probabilities
● Evaluating language model
● Issues with n-gram language model
● Smoothing
● Backoff and Interpolation

2
N-gram Overview
Language models
1
- Compute the probability of occurrence of a number of words in a
particular sequence
Probabilities are essential in any task
- Machine Translation: “Tôi đi học”
P(I go to school) > P(I go to work)
- Spelling Correction:
P(Everything has improved) > P(Everything has improve)

3
N-gram Overview
Probabilistic Language Modeling
1
- Compute the probability of occurrence of a number of words in a
particular sequence
A sequence of words: {w1, w2, w3,... wn}:
P(W) = P(w1, w2, w3,... wn)
- probability of an upcoming word:
P(wn|w1, w2, w3,... wn-1)
called a language model (LM)

4
N-gram Probabilities
Probabilistic Language Modeling
2
• Computing P(W)
𝑃(w1, w2, w3,... wn)
• Conditional probability:
𝑃 𝐵 𝐴 = 𝑃(𝐴, 𝐵)/𝑃 𝐴 => 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝑃(𝐵|𝐴)
• The chain rule of probability:
𝑃 𝐴, 𝐵, 𝐶, 𝐷 = 𝑃 𝐴 𝑃 𝐵 𝐴 P C A, B P(D|A, B, C)
=>𝑃 𝑤1 ,𝑤2 , 𝑤3 ,... 𝑤𝑛 = 𝑃 𝑤1 𝑃 𝑤2 𝑤1 𝑃 𝑤3 𝑤1:2 𝑃(𝑤𝑛 | 𝑤1:𝑛−1 )
= ς𝑛𝑘=1 𝑃(𝑤𝑘 |𝑤1:𝑘−1)

5
N-gram Probabilities
Probabilistic Language Modeling
2
𝑃 𝑤1 ,𝑤2 , 𝑤3 ,... 𝑤𝑛 = 𝑃 𝑤1 𝑃 𝑤2 𝑤1 𝑃 𝑤3 𝑤1:2 𝑃(𝑤𝑛 | 𝑤1:𝑛−1 ) =
ς𝑛𝑘=1 𝑃(𝑤𝑘 |𝑤1:𝑘−1)
Example: given a sentence: “tôi đang học lớp nlp”
𝑃 𝑡ô𝑖, đ𝑎𝑛𝑔, ℎọ𝑐, 𝑙ớ𝑝, 𝑛𝑙𝑝
= 𝑃 𝑡ô𝑖 𝑃 đ𝑎𝑛𝑔 𝑡ô𝑖 𝑃 ℎọ𝑐 𝑡ô𝑖, đ𝑎𝑛𝑔
𝑃 𝑙ớ𝑝 𝑡ô𝑖, đ𝑎𝑛𝑔, ℎọ𝑐 𝑃(𝑛𝑙𝑝|𝑡ô𝑖, đ𝑎𝑛𝑔, ℎọ𝑐 𝑙ớ𝑝)

6
N-gram Probabilities
Probabilistic Language Modeling
2
• Computing P(w|h)
- w: word “nlp”
- h: history “tôi đang học lớp”
𝑐𝑜𝑢𝑛𝑡(𝑡ô𝑖 đ𝑎𝑛𝑔 ℎọ𝑐 𝑙ớ𝑝 𝑛𝑙𝑝)
𝑃 𝑛𝑙𝑝|𝑡ô𝑖 đ𝑎𝑛𝑔 ℎọ𝑐 𝑙ớ𝑝 =
𝑐𝑜𝑢𝑛𝑡(𝑡ô𝑖 đ𝑎𝑛𝑔 ℎọ𝑐 𝑙ớ𝑝)
• Problem???

7
N-gram Probabilities
Markov Assumption
2
• The probability of a word depends only on the previous word
• Look n-1 words into the past – a called n-gram
𝑛

𝑃 𝑤1:𝑛 ≈ ෑ 𝑃(𝑤𝑖 |𝑤𝑖−𝑁+1:𝑖−1 )


𝑖=1
𝑃(𝑤𝑖 | 𝑤1:𝑖−1 ) ≈ 𝑃(𝑤𝑖 | 𝑤𝑖−𝑁+1:𝑖−1 )
• N-gram model: N = {1, 2, 3, 4, 5,…}

8
N-gram Probabilities
Markov Assumption
2
• Unigram model (1-gram)
𝑛

𝑃 𝑤1:𝑛 ≈ ෑ 𝑃 𝑤𝑖
𝑖=1
• Given a sentence: “tôi đang học lớp nlp”
𝑃 𝑡ô𝑖, đ𝑎𝑛𝑔, ℎọ𝑐, 𝑙ớ𝑝, 𝑛𝑙𝑝
= 𝑃 𝑡ô𝑖 𝑃 đ𝑎𝑛𝑔 𝑃 ℎọ𝑐 𝑃 𝑙ớ𝑝 𝑃(𝑛𝑙𝑝)

9
N-gram Probabilities
Markov Assumption
2
• Bigram model (2-gram)
𝑛

𝑃 𝑤1:𝑛 ≈ ෑ 𝑃 𝑤𝑖 |𝑤𝑖−1
𝑖=1
• Given a sentence: “tôi đang học lớp nlp”
=> Padding: “<s> tôi đang học lớp nlp </s>”
𝑃 𝑡ô𝑖, đ𝑎𝑛𝑔, ℎọ𝑐, 𝑙ớ𝑝, 𝑛𝑙𝑝
= 𝑃 𝑡ô𝑖| < 𝑠 > 𝑃 đ𝑎𝑛𝑔|𝑡ô𝑖 𝑃 ℎọ𝑐|đ𝑎𝑛𝑔 𝑃 𝑙ớ𝑝|ℎọ𝑐
𝑃 𝑛𝑙𝑝 𝑙ớ𝑝 𝑃(</𝑠 > |𝑛𝑙𝑝)
10
N-gram Probabilities
Markov Assumption
2
• Trigram model (3-gram)
𝑛

𝑃 𝑤1:𝑛 ≈ ෑ 𝑃 𝑤𝑖 |𝑤𝑖−2:𝑖−1
𝑖=1
• Given a sentence: “tôi đang học lớp nlp”
=> Padding: “<s> <s> tôi đang học lớp nlp </s> </s>”
𝑃 𝑡ô𝑖, đ𝑎𝑛𝑔, ℎọ𝑐, 𝑙ớ𝑝, 𝑛𝑙𝑝
= 𝑃 𝑡ô𝑖| < 𝑠 >, < 𝑠 > 𝑃 đ𝑎𝑛𝑔|𝑡ô𝑖, < 𝑠 > 𝑃 ℎọ𝑐|𝑡ô𝑖, đ𝑎𝑛𝑔
𝑃 𝑙ớ𝑝|đ𝑎𝑛𝑔, ℎọ𝑐 𝑃(𝑛𝑙𝑝|ℎọ𝑐, 𝑙ớ𝑝) 𝑃(</𝑠 > |𝑙ớ𝑝, 𝑛𝑙𝑝)𝑃(</𝑠 > |𝑙ớ𝑝, </𝑠 >)

11
Estimating N-gram Probabilities
Maximum likelihood estimation (MLE)
3
• Estimating bigram probabilities
𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−1 , 𝑤𝑖 ) 𝑐(𝑤𝑖−1 , 𝑤𝑖 )
𝑃 𝑤𝑖 |𝑤𝑖−1 = =
𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−1 ) 𝑐(𝑤𝑖−1 )

• Estimating n-gram probabilities


𝑐(𝑤𝑖−𝑁+1:𝑖−1 , 𝑤𝑖 )
𝑃 𝑤𝑖 |𝑤𝑖−𝑁+1:𝑖−1 =
𝑐(𝑤𝑖−𝑁+1:𝑖−1 )

12
Estimating N-gram Probabilities
Maximum likelihood estimation (MLE)
3
• Example bigram model:
𝑐(𝑤𝑖−1 , 𝑤𝑖 ) <s> tôi đang học </s>
𝑃 𝑤𝑖 |𝑤𝑖−1 = <s> tôi đang học lớp nlp </s>
𝑐(𝑤𝑖−1 )
<s> lớp nlp có vẻ hơi vui </s>
P(tôi|<s>) = 2/3 P(đang|tôi) = 2/2 P(học|đang) = 2/2
P(</s>|học) = 1/2 P(lớp|học) = 1/2 P(nlp|lớp) = 2/2

13
Estimating N-gram Probabilities
Example: “Truyện Kiều”
3
Trăm năm trong cõi người ta,
Chữ tài chữ mệnh khéo là ghét nhau.
Trải qua một cuộc bể dâu,
Những điều trông thấy mà đau đớn lòng.
Lạ gì bỉ sắc tư phong,
Trời xanh quen thói má hồng đánh ghen.
Cảo thơm lần giở trước đèn,
Phong tình cổ lục còn truyền sử xanh.
Rằng năm Gia Tĩnh triều Minh,
Bốn phương phẳng lặng, hai kinh vững vàng.

14
Estimating N-gram Probabilities
Example: “Truyện Kiều”
3
<s> trăm năm trong cõi người ta
<s> 0 13 5 23 1 22 1
trăm 0 0 9 0 0 0 0
năm 0 0 4 1 0 1 0
trong 0 0 1 0 3 1 0
cõi 0 0 0 0 0 1 0
người 0 0 0 2 0 2 7
ta 0 0 0 0 0 0 2

15
Estimating N-gram Probabilities
Example: “Truyện Kiều”
3
<s> trăm năm trong cõi người ta
<s> 0 0.004 0.0015 0.007 0.0003 0.0068 0.0003
trăm 0 0 0.29 0 0 0 0
năm 0 0 0.077 0.019 0 0.019 0
trong 0 0 0.0095 0 0.0285 0.0095 0
cõi 0 0 0 0 0 0.1 0
người 0 0 0 0.009 0 0.009 0.031
ta 0 0 0 0 0 0 0.035
16
Estimating N-gram Probabilities
Example: “Truyện Kiều”
3
Sequence probabilities: “trăm năm trong cõi người ta”

𝑃 𝑡𝑟ă𝑚 𝑛ă𝑚 𝑡𝑟𝑜𝑛𝑔 𝑐õ𝑖 𝑛𝑔ườ𝑖 𝑡𝑎


= 𝑃 𝑡𝑟ă𝑚 < 𝑠 > 𝑃 𝑛ă𝑚 𝑡𝑟ă𝑚 𝑃 𝑡𝑟𝑜𝑛𝑔 𝑛ă𝑚
𝑃 𝑐õ𝑖 𝑡𝑟𝑜𝑛𝑔 𝑃 𝑛𝑔ườ𝑖 𝑐õ𝑖 𝑃 𝑡𝑎 𝑛𝑔ườ𝑖 𝑃 </𝑠 > 𝑡𝑎
= 0.004 * 0.29 * 0.019 *0.0285*0.1*0.031*0.439 = 8.5e-10

17
Estimating N-gram Probabilities
Problem: Numerical Underflow
3
• 8.5e-10 => SMALL
• Convert linear space => log space:
- Avoid underflow
- Adding is faster than multiplying
𝑝1 × 𝑝2 × 𝑝3 × 𝑝4 = exp(log 𝑝1 + log 𝑝2 + log 𝑝3 + log 𝑝4 )

18
Evaluating Language Models
Extrinsic evaluation:
4
• The best way to evaluate the performance, embed models in an application
• Comparing models A and B:
- Put each model in a task: MT system
- Run MT task, get an accuracy for A and B
- Compare accuracy for A and B
- Higher accuracy is better
• Problem??? Running big NLP systems end-to-end is often very expensive!

19
Evaluating Language Models
Intrinsic evaluation:
3
• Evaluate a model independent of any application
• Train parameters of our model on a training set
• Test the model on some unseen data – called test set
• Sometimes:
- We need a fresh test set is truly unseen
=> Split data: train set, validation set, test set
- Training: train set, tuning: validation set, evaluation: test set
- Intrinsic evaluation: perplexity
20
Evaluating Language Models
Perplexity:
4
• The best language model is one that best predicts an unseen test set
• Perplexity (PP) is the inverse probability of the test set, normalized by the
number of words
• Given a test set: W = w1w2w3…wn
1
−𝑛
𝑃𝑃 𝑊 = 𝑃(𝑤1 𝑤2 … 𝑤𝑛 )
𝑛 1
• The chain rule: 𝑃𝑃 𝑊 = ς𝑛𝑖=1
𝑃(𝑤𝑖 |𝑤1:𝑖−1 )

𝑛 1
• With a bigram model: 𝑃𝑃 𝑊 = ς𝑛𝑖=1
𝑃(𝑤𝑖 |𝑤𝑖−1 )
21
Evaluating Language Models
Perplexity:
4
𝑃 𝑡𝑟ă𝑚 𝑛ă𝑚 𝑡𝑟𝑜𝑛𝑔 𝑐õ𝑖 𝑛𝑔ườ𝑖 𝑡𝑎 = 8.5e-10
=>PP(trăm năm trong cõi người ta) = ???
Lower perplexity = better model
• Example: training 38 million words, test 1.5 million words, WSL

Unigram Bigram Trigram


Perplexity 962 170 109

22
Issues N-gram language Models
Sparsity Problems:
𝑃 𝑤3 |𝑤1 𝑤2 =
𝑐(𝑤1 , 𝑤2 , 𝑤3 )
5
𝑐(𝑤1 , 𝑤2 )
• w1, w2 and w3 never appear together in the corpus
=> P = 0
=> Solve: a small k could be added to the count – called smoothing
• w1 and w2 never occurred together in the corpus
=> no probability can be calculated for w3
=> Solve: could condition on w2 alone – called backoff, interpolation
Storage Problems: n increase => the model size increase
23
Smoothing
Laplace Smoothing (Add-one Laplace)
• Add one to all the n-gram counts, before normalize into probabilities
5
• MLE estimate:
𝑐(𝑤𝑖−1 , 𝑤𝑖 )
𝑃 𝑤𝑖 |𝑤𝑖−1 =
𝑐(𝑤𝑖−1 )
• Add-1 estimate:
𝑐 𝑤𝑖−1 , 𝑤𝑖 + 1
𝑃𝐴𝑑𝑑−1 𝑤𝑖 |𝑤𝑖−1 =
𝑐 𝑤𝑖−1 + 𝑉
V: vocab size

24
Smoothing
Add-1 Smoothing
5
Example: “Truyện Kiều”
<s> trăm năm trong cõi người ta
<s> 1 14 6 24 2 23 2
trăm 1 1 10 1 1 1 1
năm 1 1 5 2 1 2 1
trong 1 1 2 1 4 2 1
cõi 1 1 1 1 1 2 1
người 1 1 1 3 1 3 8
ta 1 1 1 1 1 1 3
25
Smoothing
Add-1 Smoothing
5
Example: “Truyện Kiều”
<s> trăm năm trong cõi người ta
<s> 0.00015 0.0021 0.0009 0.0036 0.0003 0.0035 0.003
trăm 0.0004 0.0004 0.0041 0.0004 0.0004 0.0004 0.0004
năm 0.0004 0.0004 0.002 0.0008 0.0004 0.0008 0.0004
trong 0.0004 0.0004 0.0008 0.0004 0.0015 0.0008 0.0004
cõi 0.0004 0.0004 0.0004 0.0004 0.0004 0.0008 0.0004
người 0.0004 0.0004 0.0004 0.0011 0.0004 0.0011 0.003
ta 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0012
26
Smoothing
Add-k smoothing
𝑐 𝑤𝑖−1 , 𝑤𝑖 + 𝑘
5
𝑃𝐴𝑑𝑑−𝑘 𝑤𝑖 |𝑤𝑖−1 =
𝑐 𝑤𝑖−1 + 𝑘𝑉
Choosing k by optimizing on a validation set
Add-k Smoothing:
• Not good => not used for n-gram language model
• Add-k is used to smooth other NLP models (ex: text classification,…)

27
Backoff and Interpolation
Sometimes using less context is a good thing
6
condition on less for contexts that the model hasn’t learned much about
Backoff
- use trigram if you have good evidence
- otherwise bigram, otherwise unigram
Interpolation
- mix unigram, bigram, trigram
Interpolation works better

28
Backoff and Interpolation
Simple Interpolation
6
𝑃෠ 𝑤𝑛 𝑤𝑛−2 𝑤𝑛−1 = 𝜆1 𝑃 𝑤𝑛 𝑤𝑛−2 𝑤𝑛−1
+ 𝜆2 𝑃 𝑤𝑛 𝑤𝑛−1 + 𝜆3 𝑃(𝑤𝑛 )
σ𝑖 𝜆𝑖 = 1
𝜆𝑠 are learned from a validation set
Find optimal set using EM algorithm

29
Backoff and Interpolation
Advanced:
6
Lambdas conditional on context
𝑃෠ 𝑤𝑛 𝑤𝑛−2 𝑤𝑛−1 = 𝜆1 (𝑤𝑛−2:𝑛−1 )𝑃 𝑤𝑛 𝑤𝑛−2 𝑤𝑛−1
+ 𝜆2 𝑤𝑛−2:𝑛−1 𝑃 𝑤𝑛 𝑤𝑛−1
+ 𝜆3 (𝑤𝑛−2:𝑛−1 )𝑃(𝑤𝑛 )
Katz backoff
Stupid backoff
Kneser-Ney Smoothing

30
Thanks!
Any questions?

31

You might also like