Professional Documents
Culture Documents
Language Modeling and Spelling Correction
Language Modeling and Spelling Correction
Language Modeling and Spelling Correction
Language
Modeling
Introduc*on
to
N-‐grams
Dan
Jurafsky
P(w1w 2 …w n ) = # P(w i | w1w 2 …w i"1 )
i
P(“its
water
is
so
transparent”)
=
P(its)
×
P(water|its)
×
P(is|its
water)
!
×
P(so|its
water
is)
×
P(transparent|its
water
is
so)
Dan
Jurafsky
!
Dan
Jurafsky
Markov Assump1on
• Simplifying
assump*on:
Andrei
Markov
P(the | its water is so transparent that) " P(the | that)
• Or
maybe
P(the | its water is so transparent that) " P(the | transparent that)
Dan
Jurafsky
Markov Assump1on
Dan
Jurafsky
Bigram
model
" Condi*on
on
the
previous
word:
N-‐gram
models
• We
can
extend
to
trigrams,
4-‐grams,
5-‐grams
• In
general
this
is
an
insufficient
model
of
language
• because
language
has
long-‐distance
dependencies:
“The
computer
which
I
had
just
put
into
the
machine
room
on
the
fiIh
floor
crashed.”
count(w i"1,w i )
P(w i | w i"1 ) =
count(w i"1 )
c(w i"1,w i )
P(w i | w i"1 ) =
c(w i"1 )
!
Dan
Jurafsky
An example
More
examples:
Berkeley
Restaurant
Project
sentences
• can
you
tell
me
about
any
good
cantonese
restaurants
close
by
• mid
priced
thai
food
is
what
i’m
looking
for
• tell
me
about
chez
panisse
• can
you
give
me
a
lis$ng
of
the
kinds
of
food
that
are
available
• i’m
looking
for
a
good
place
to
eat
breakfast
• when
is
caffe
venezia
open
during
the
day
Dan
Jurafsky
• Result:
Dan
Jurafsky
Prac/cal
Issues
• We
do
everything
in
log
space
• Avoid
underflow
• (also
adding
is
faster
than
mul$plying)
…
Dan
Jurafsky
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Dan
Jurafsky
Intuition of Perplexity
mushrooms 0.1
• The Shannon Game:
• How well can we predict the next word? pepperoni 0.1
anchovies 0.01
I always order pizza with cheese and ____
….
The 33rd President of the US was ____
fried rice 0.0001
I saw a ____ ….
• Unigrams are terrible at this game. (Why?) and 1e-100
• A better model of a text
• is one which assigns a higher probability to the word that actually occurs
Dan Jurafsky
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence)
Perplexity is the probability of the test
set, normalized by the number of
words:
Chain rule:
For bigrams:
Approximating Shakespeare
Dan Jurafsky
Shakespeare as corpus
• N=884,647 tokens, V=29,066
• Shakespeare produced 300,000 bigram types
out of V2= 844 million possible bigrams.
• So 99.96% of the possible bigrams were never seen
(have zero entries in the table)
• Quadrigrams worse: What's coming out looks
like Shakespeare because it is Shakespeare
Dan Jurafsky
Zeros
• Training set: • Test set
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request
The
Spelling
Correc/on
Task
Dan
Jurafsky
Web search
2
Dan
Jurafsky
Spelling
Tasks
• Spelling
Error
Detec/on
• Spelling
Error
Correc/on:
• Autocorrect
• hteàthe
• Suggest
a
correc/on
• Sugges/on
lists
3
Dan
Jurafsky
4
Dan
Jurafsky
5
Dan
Jurafsky
6
Dan
Jurafsky
7
Spelling
Correction and the
Noisy Channel
The
Spelling
Correc/on
Task
Spelling
Correction and the
Noisy Channel
10
Dan
Jurafsky
Noisy
Channel
• We
see
an
observa/on
x
of
a
misspelled
word
• Find
the
correct
word
w
ŵ = argmax P(w | x)
w!V
P(x | w)P(w)
= argmax
w!V P(x)
= argmax P(x | w)P(w)
w!V
11
Dan
Jurafsky
acress!
13
Dan
Jurafsky
Candidate
genera'on
• Words
with
similar
spelling
• Small
edit
distance
to
error
• Words
with
similar
pronuncia/on
• Small
edit
distance
of
pronuncia/on
to
error
14
Dan
Jurafsky
15
Dan
Jurafsky
Candidate
genera'on
• 80%
of
errors
are
within
edit
distance
1
• Almost
all
errors
within
edit
distance
2
17
Dan
Jurafsky
Language
Model
• Use
any
of
the
language
modeling
algorithms
we’ve
learned
• Unigram,
bigram,
trigram
• Web-‐scale
spelling
correc/on
• Stupid
backoff
18
Dan
Jurafsky
19
Dan
Jurafsky
21
Dan
Jurafsky
23
Dan
Jurafsky
24
Dan
Jurafsky
25
Dan
Jurafsky
26
Dan
Jurafsky
27
Dan
Jurafsky
Evalua'on
• Some
spelling
error
test
sets
• Wikipedia’s
list
of
common
English
misspelling
• Aspell
filtered
version
of
that
list
• Birkbeck
spelling
error
corpus
• Peter
Norvig’s
list
of
errors
(includes
Wikipedia
and
Birkbeck,
for
training
or
tes/ng)
30
Spelling
Correction and the
Noisy Channel
Real-‐Word
Spelling
Correc/on
Dan
Jurafsky
33
Dan
Jurafsky
to threw
too on the
two of thaw
36
Dan
Jurafsky
to threw
too on the
two of thaw
37
Dan
Jurafsky
39
Dan
Jurafsky
41
Spelling
Correction and the
Noisy Channel
Real-‐Word
Spelling
Correc/on
Spelling
Correction and the
Noisy Channel
State-‐of-‐the-‐art
Systems
Dan
Jurafsky
45
Dan
Jurafsky
47
Dan
Jurafsky
Channel
model
• Factors
that
could
influence
p(misspelling|word)
• The
source
le[er
• The
target
le[er
• Surrounding
le[ers
• The
posi/on
in
the
word
• Nearby
keys
on
the
keyboard
• Homology
on
the
keyboard
• Pronuncia/ons
• Likely
morpheme
transforma/ons
48
Dan
Jurafsky
Nearby
keys
Dan
Jurafsky
Classifier-‐based
methods
for
real-‐word
spelling
correc'on
• Instead
of
just
channel
model
and
language
model
• Use
many
features
in
a
classifier
(next
lecture).
• Build
a
classifier
for
a
specific
pair
like:
whether/weather
• “cloudy”
within
+-‐
10
words
• ___
to
VERB
• ___
or
not
50
Spelling
Correction and the
Noisy Channel
Real-‐Word
Spelling
Correc/on