Professional Documents
Culture Documents
Lecture 2. Words and Language Models
Lecture 2. Words and Language Models
Parsing
Words and
Language Models
English Morphology
3
01/31/23
Light Weight Morphology
• No lexicon needed
• Basically a set of staged sets of rewrite rules
that strip suffixes
• Handles both inflectional and derivational
suffixes
• Doesn’t guarantee that the resulting stem is
really a stem (see first bullet)
• Lack of guarantee doesn’t matter for IR
Porter Example
• Computerization
ization -> -ize computerize
ize -> ε computer
Porter
• Expanding clitics
What’re -> what are
I’m -> I am
• Multi-token words
New York
Rock ‘n’ roll
Sentence Segmentation
• !, ? relatively unambiguous
• Period “.” is quite ambiguous
Sentence boundary
Abbreviations like Inc. or Dr.
• General idea:
Build a binary classifier:
Looks at a “.”
Decides EndOfSentence/NotEOS
Could be hand-written rules, or machine-learning
Word Segmentation in Chinese
14
Vietnamese Word Segmentation
Problem (cont)
Word segmentation ambiguities
Syllable sequences “nhà cửa”, “sắc đẹp”,
“hiệu sách”, are words in
a. Nhà cửa bề bộn quá
b. Cô ấy giữ gìn sắc đẹp.
c. Ngoài hiệu sách có bán cuốn này
17
01/31/23
Computing P(W)
P(“the”,”other”,”day”,”I”,”was”,”walking”,”along”
,”and”,”saw”,”a”,”lizard”)
18
01/31/23
The Chain Rule
• Recall the definition of conditional probabilities
P ( A^ B )
• Rewriting: P( A | B)
P( B)
P( A^ B) P( A | B) P( B)
• More generally
• P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• In general
• P(x1,x2,x3,…xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|
x1…xn-1)
19
01/31/23
The Chain Rule
• How to estimate?
P(you | the river is so wide that)
22
01/31/23
Unfortunately
23
01/31/23
Markov Assumption
n1 n1
P(wn | w 1 ) P(wn | w nN 1 )
Bigram version
n1
P(w n | w 1 ) P(w n | w n1 )
25
01/31/23
Estimating bigram probabilities
c(w i1,w i )
P(w i | w i1 )
c(w i1 )
26
01/31/23
An example
27
01/31/23
Maximum Likelihood Estimates
29
01/31/23
Raw Bigram Counts
30
01/31/23
Raw Bigram Probabilities
• Normalize by unigrams:
• Result:
31
01/31/23
Bigram Estimates of Sentence
Probabilities
32
01/31/23
Kinds of knowledge?
34
01/31/23
Shakespeare
35
01/31/23
Shakespeare as corpus
36
01/31/23
The Wall Street Journal is Not
Shakespeare
37
01/31/23
Why?
P(X | S)P(S)
P(S | X)
P(X)
38
01/31/23
Unknown words: Open versus
closed vocabulary tasks
• If we know all the words in advanced
Vocabulary V is fixed
Closed vocabulary task
• Often we don’t know this
Out Of Vocabulary = OOV words
Open vocabulary task
• Instead: create an unknown word token <UNK>
Training of <UNK> probabilities
Create a fixed lexicon L of size V
At text normalization phase, any training word not in L changed to <UNK>
Now we train its probabilities like a normal word
At decoding time
If text input: Use UNK probabilities for any word not in training
39
01/31/23
Evaluation
• Chain rule:
• For bigrams:
44
Slide
01/31/23
from Josh Goodman
Lower perplexity = better model
45
01/31/23
Lesson 1: the perils of
overfitting
46
01/31/23
Lesson 2: zeros or not?
• Zipf’s Law:
A small number of events occur with high frequency
A large number of events occur with low frequency
You can quickly collect statistics on the high frequency events
You might have to wait an arbitrarily long time to get valid statistics on low
frequency events
• Result:
Our estimates are sparse! no counts at all for the vast bulk of things we
want to estimate!
Some of the zeroes in the table are really zeros But others are simply low
frequency events you haven't seen yet. After all, ANYTHING CAN
HAPPEN!
How to address?
• Answer:
Estimate the likelihood of unseen N-grams!
47
01/31/23
Smoothing is like Robin Hood:
Steal from the rich and give to the poor (in
probability mass)
48
01/31/23
Laplace smoothing
• MLE estimate:
• Laplace estimate:
• Reconstructed counts:
49
01/31/23
Laplace smoothed bigram
counts
50
01/31/23
Laplace-smoothed bigrams
51
01/31/23
Reconstituted counts
52
01/31/23
Big Changes to Counts
53
01/31/23
Better Discounting Methods
55
01/31/23
Good-Turing
3/18
56
01/31/23
Good-Turing
57
01/31/23
Good-Turing Intuition
58
01/31/23 Slide from Josh Goodman
Good-Turing Intuition
59
01/31/23 Slide from Josh Goodman
Bigram frequencies of
frequencies and GT re-estimates
60
01/31/23
Backoff and Interpolation
62
01/31/23
Interpolation
• Simple interpolation
63
01/31/23
How to set the lambdas?
65
01/31/23
OOV words: <UNK> word
66
01/31/23
Practical Issues
67
01/31/23
Language Modeling Toolkits
• SRILM
• CMU-Cambridge LM Toolkit
• These toolkits are publicly available
• Can use it to get N-gram models
• Lots of parameters (need to know the
theory!)
• Standard N-gram format: ARPA language
model (see pages 108-109)
68
01/31/23
Google N-Gram Release
69
01/31/23
Google N-Gram Release
70
01/31/23