Professional Documents
Culture Documents
Language Models: Last Lecture's Key Concepts
Language Models: Last Lecture's Key Concepts
Language Models: Last Lecture's Key Concepts
-FSTs can be composed (cascaded), allowing us to We’ll also review some very basic probability theory.
define intermediate representations.
CS447: Natural Language Processing (J. Hockenmaier) 3 CS447: Natural Language Processing (J. Hockenmaier) 4
Why do we need language models?
Many NLP tasks return output in natural language:
- Machine translation
- Speech recognition
- Natural language generation
- Spell-checking
Reminder:
Language models define probability distributions
Basic Probability
over (natural language) strings or sentences.
CS447: Natural Language Processing (J. Hockenmaier) 5 CS447: Natural Language Processing (J. Hockenmaier) 6
CS447: Natural Language Processing (J. Hockenmaier) 7 CS447: Natural Language Processing (J. Hockenmaier) 8
Sampling with replacement Sampling with replacement
Alice was beginning to get very tired of beginning by, very Alice but was and?
sitting by her sister on the bank, and of reading no tired of to into sitting
having nothing to do: once or twice she sister the, bank, and thought of without
had peeped into the book her sister was her nothing: having conversations Alice
reading, but it had no pictures or once do or on she it get the book her had
conversations in it, 'and what is the use peeped was conversation it pictures or
of a book,' thought Alice 'without sister in, 'what is the use had twice of
pictures or conversation?' a book''pictures or' to
CS447: Natural Language Processing (J. Hockenmaier) 9 CS447: Natural Language Processing (J. Hockenmaier) 10
CS447: Natural Language Processing (J. Hockenmaier) 11 CS447: Natural Language Processing (J. Hockenmaier) 12
Discrete probability distributions: Joint and Conditional Probability
single trials
The conditional probability of X given Y, P(X | Y),
Bernoulli distribution (two possible outcomes) is defined in terms of the probability of Y, P( Y ),
The probability of success (=head,yes) and the joint probability of X and Y, P(X,Y):
The probability of head is p.
The probability of tail is 1−p.
P (X, Y )
P (X|Y ) =
P (Y )
Categorical distribution (N possible outcomes)
The probability of category/outcome ci is pi
(0≤ pi ≤1 ∑i pi = 1) P(blue | ) = 2/5
CS447: Natural Language Processing (J. Hockenmaier) 13 CS447: Natural Language Processing (J. Hockenmaier) 14
P(wi+1 = of | wi = tired) = 1 P(wi+1 = bank | wi = the) = 1/3 P(wi+1 = of | wi = tired) = 1 P(wi+1 = bank | wi = the) = 1/3
P(wi+1 = of | wi = use) = 1 P(wi+1 = book | wi = the) = 1/3 P(wi+1 = of | wi = use) = 1 P(wi+1 = book | wi = the) = 1/3
P(wi+1 = sister | wi = her) = 1 P(wi+1 = use | wi = the) = 1/3 P(wi+1 = sister | wi = her) = 1 P(wi+1 = use | wi = the) = 1/3
P(wi+1 = beginning | wi = was) = 1/2 P(wi+1 = beginning | wi = was) = 1/2
P(wi+1 = reading | wi = was) = 1/2 P(wi+1 = reading | wi = was) = 1/2
CS447: Natural Language Processing (J. Hockenmaier) 15 CS447: Natural Language Processing (J. Hockenmaier) 16
The chain rule Independence
The joint probability P(X,Y) can also be expressed in Two random variables X and Y are independent if
terms of the conditional probability P(X | Y)
P (X, Y ) = P (X)P (Y )
P (X, Y ) = P (X|Y )P (Y )
This leads to the so-called chain rule: If X and Y are independent, then P(X | Y) = P(X):
P (X, Y )
P (X1 , X2 , . . . , Xn ) = P (X1 )P (X2 |X1 )P (X3 |X2 , X1 )....P (Xn |X1 , ...Xn 1) P (X|Y ) =
n P (Y )
= P (X1 ) P (Xi |X1 . . . Xi 1)
P (X)P (Y )
i=2 = (X , Y independent)
P (Y )
= P (X)
CS447: Natural Language Processing (J. Hockenmaier) 17 CS447: Natural Language Processing (J. Hockenmaier) 18
Probability models
Building a probability model consists of two steps:
1. Defining the model
2. Estimating the model’s parameters
(= training/learning )
CS447: Natural Language Processing (J. Hockenmaier) 19 CS447: Natural Language Processing (J. Hockenmaier) 20
Language modeling with N-grams N-gram models
A language model over a vocabulary V assigns Unigram model P (w1 )P (w2 )...P (wi )
probabilities to strings drawn from V*.
Bigram model P (w1 )P (w2 |w1 )...P (wi |wi 1)
Trigram model P (w1 )P (w2 |w1 )...P (wi |wi 2 wi 1 )
Recall the chain rule:
N-gram model P (w1 )P (w2 |w1 )...P (wi |wi n 1 ...wi 1 )
P
(w1 ...wi ) = P (w1 )P (w2 |w1 )P (w3 |w1 w2 )...P (wi |w1 ...wi 1)
N-gram models assume each word (event) depends
An n-gram language model assumes each word depends only on the previous n−1 words (events).
only on the last n−1 words: Such independence assumptions are called
Markov assumptions (of order n−1).
Pngram (w1 ...wi ) := P (w1 )P (w2 |w1 )...P ( wi | wi n 1 ...wi 1 )
⇤⇥ ⌅ ⇤ ⇥ ⌅
nth word prev. n 1 words P(wi |w1 ...wi 1 ) :⇡ P(wi |wi n 1 ...wi 1 )
CS447: Natural Language Processing (J. Hockenmaier) 21 CS447: Natural Language Processing (J. Hockenmaier) 22
CS447: Natural Language Processing (J. Hockenmaier) 23 CS447: Natural Language Processing (J. Hockenmaier) 24
Parameter estimation (training) How do we use language models?
Parameters: the actual probabilities Independently of any application, we can use a
P(wi = ‘the’ | wi-1 = ‘on’) = ??? language model as a random sentence generator
(i.e we sample sentences according to their language model
We need (a large amount of) text as training data
probability)
to estimate the parameters of a language model.
Systems for applications such as machine translation,
The most basic estimation technique:
speech recognition, spell-checking, generation, often
relative frequency estimation (= counts) produce multiple candidate sentences as output.
- We prefer output sentences SOut that have a higher probability
P(wi = ‘the’ | wi-1 = ‘on’) = C(‘on the’) / C(‘on’)
- We can use a language model P(SOut) to score and rank these
Also called Maximum Likelihood Estimation (MLE) different candidate output sentences, e.g. as follows:
argmaxSOut P(SOut | Input) = argmaxSOut P(Input | SOut)P(SOut)
MLE assigns all probability mass to events that occur
in the training corpus.
CS447: Natural Language Processing (J. Hockenmaier) 25 CS447: Natural Language Processing (J. Hockenmaier) 26
to generate language
the probabilities of the outcomes
- Generate a random number r between 0 and 1.
- Return the x1 whose interval the number is in.
r
x1 x2 x3 x4 x5
0 p1 p1+p2 p1+p2+p3 p1+p2+p3+p4 1
CS447: Natural Language Processing (J. Hockenmaier) 27 CS447: Natural Language Processing (J. Hockenmaier) 28
Generating the Wall Street Journal Generating Shakespeare
CS447: Natural Language Processing (J. Hockenmaier) 29 CS447: Natural Language Processing (J. Hockenmaier) 30
CS447: Natural Language Processing (J. Hockenmaier) 31 CS447: Natural Language Processing (J. Hockenmaier) 32
Perplexity
Perplexity is the inverse of the probability of the test
set (as assigned by the language model), normalized
CS447: Natural Language Processing (J. Hockenmaier) 33 CS447: Natural Language Processing (J. Hockenmaier) 34
N))) NN
1
PPPPPP(w
(w11...w
1 ...w =
= P (w11...w
...wN
...w N N
= ⇤
N⇤ =def P(w
def N
iw
i |wiP (w 1 ,i..., ii 1)
N n 1
a higher probability to the unseenPP(w test
(wi |w
corpus.
...wi i 1 )1 )
...w i=1 i=1 n
i |wi in ...w
n+1 i 1)
i=1
i=1 i |w
11 i=1
⇧⇧
⌅⌅ NN
⌅
CS447: Natural Language Processing (J. Hockenmaier)
⌅ 11 35 CS447: Natural Language Processing (J. Hockenmaier) 36
= ⇤
N
⇤
=def N
P (w |wi i nn...w i i 1 )1 )
def
i=1 P (wi i |w ...w
i=1
Practical issues Perplexity and LM order
Since language model probabilities are very small,
Bigram LMs have lower perplexity than unigram LMs
multiplying them together often yields to underflow.
Trigram LMs have lower perplexity than bigram LMs
…
It is often better to use logarithms instead, so replace
s Example from the textbook
N
1
PP(w1 ...wN ) =def N ’
(WSJ corpus, trained on 38M tokens, tested on 1.5 M tokens,
i=1 P(wi |wi 1 , ..., wi n+1 )
vocabulary: 20K word types)
with
Unigram Bigram Trigram
✓ ◆
1 N Perplexity 962 170 109
PP(w1 ...wN ) =def exp  log P(wi |wi 1 , ..., wi
N i=1
n+1
CS447: Natural Language Processing (J. Hockenmaier) 37 CS447: Natural Language Processing (J. Hockenmaier) 38
CS447: Natural Language Processing (J. Hockenmaier) 39 CS447: Natural Language Processing (J. Hockenmaier) 40
Word Error Rate (WER) But….
Originally developed for speech recognition. … unseen test data will contain unseen words
CS447: Natural Language Processing (J. Hockenmaier) 41 CS447: Natural Language Processing (J. Hockenmaier) 42
Generating Shakespeare
Getting back to
Shakespeare…
CS447: Natural Language Processing (J. Hockenmaier) 43 CS447: Natural Language Processing (J. Hockenmaier) 44
Shakespeare as corpus MLE doesn’t capture unseen events
The Shakespeare corpus consists of N=884,647 word We estimated a model on 440K word tokens, but:
tokens and a vocabulary of V=29,066 word types
Only 30,000 word types occur in the training data
Shakespeare produced 300,000 bigram types
Any word that does not occur in the training data
out of V2= 844 million possible bigram types. has zero probability!
99.96% of possible bigrams don’t occur in the corpus. Only 0.04% of all possible bigrams (over 30K word
types) occur in the training data
Our relative frequency estimate assigns non-zero Any bigram that does not occur in the training data
probability to only 0.04% of the possible bigrams
has zero probability (even if we have seen both words in
That percentage is even lower for trigrams, 4-grams, etc. the bigram)
4-grams look like Shakespeare because they are Shakespeare!
CS447: Natural Language Processing (J. Hockenmaier) 45 CS447: Natural Language Processing (J. Hockenmaier) 46
common word wr
(log)
Most words
100
are very rare
CS447: Natural Language Processing (J. Hockenmaier) 49 CS447: Natural Language Processing (J. Hockenmaier) 50