N Gram Models

N - GRAMS
Slide 1
Counting Words in Corpora (1)

Guessing the next word (or word prediction) is an
essential subtask of speech recognition, hand writing
recognition, augmentative communication and
spelling error detection.
Augmentative communication systems are
computer systems that help the disable in
communication.
Detecting real-word spelling error is difficult. There are
a number of different machine learning algorithms,
which make use of the surrounding words and other
features to do context-sensitive spelling error
correction.
2
Slide 2

An N-gram model uses the previous N-1 words to
predict the next one.
Such statistical models of word sequences,
depending on the context they are being
analyzed, are call language models, LM or
grammar.
Statistical processing of natural language is based
on corpora (singular corpus) - on-line
collections of text and speech. For computing
word probabilities, we will count words on
training corpus.
3
Slide 3
Utterance sentence of spoken language.

Fragments words that are broken in the middle.
Word-form inflected form as it appears in the corpus.
Lemma is a set of lexical forms having the same stem, the same
major part-of-speech, and the same word-sense.
Types the number of distinct words in corpus that is the size of
the vocabulary.
Tokens the total number of running words.
Slide 4
N-Gram Models
More formally, we can use knowledge of the counts of N-grams to
assess the conditional probability of candidate words as the next
word in a sequence.
Or, we can use them to assess the probability of an entire sequence of
words.
APPLICATIONS:
Automatic speech recognition
Handwriting and character recognition
Spelling correction
Machine translation
Slide 5
Counting
Simple counting lies at the core of any probabilistic approach. So lets first
take a look at what were counting.
He stepped out into the hall, was delighted to encounter a water brother.
14 tokens
Not always that simple

I do uh main- mainly business data processing
Spoken language poses various challenges.

Should we count uh and other fillers as tokens?
What about the repetition of mainly?
The answers depend on the application. If were focusing on something like ASR
to support indexing for search, then uh isnt helpful (its not likely to occur as a
query). But filled pauses are very useful in dialog management, so we might want
them there.
Slide 6
Counting: Types and Tokens

How about
They picnicked by the pool, then lay back on the grass and looked at the
stars.
18 tokens (again counting punctuation)
But we might also note that the is used 3 times, so there are only 16
unique types (as opposed to tokens).
In going forward, well have occasion to focus on counting both types
and tokens of both words and N-grams.
Slide 7
Counting: Corpora
So what happens when we look at large bodies of text
instead of single utterances?
Brown et al (1992) large corpus of English text
583 million wordform tokens
293,181 wordform types
Google
Crawl of 1,024,908,267,229 English tokens
13,588,391 wordform types
That seems like a lot of types... After all, even large dictionaries of English have only around
500k types. Why so many here?
Numbers
Misspellings
Names
Acronyms
etc
Slide 8
Language Modeling
Back to word prediction
We can model the word prediction task as the ability to assess the
conditional probability of a word given the previous words in the
sequence
P(wn|w1,w2wn-1)
Well call a statistical model that can assess this a Language Model
How might we go about calculating such a conditional probability?
One way is to use the definition of conditional probabilities and look for
counts. So to get
P(the | its water is so transparent that)
By definition thats
P(its water is so transparent that the)
P(its water is so transparent that)
We can get each of those from counts in a large corpus.
Slide 9
Very Easy Estimate

How to estimate?
P(the | its water is so transparent that)
P(the | its water is so transparent that) =

Count(its water is so transparent that the)
Count(its water is so transparent that)
According to Google those counts are 5/9.
But 2 of those were to these slides... So maybe its really 3/7
Slide 10
The Chain Rule

Recall the definition of conditional probabilities
P( A | B)
Rewriting:
Forsequences...
P ( A^ B )
P( B)
P( A^ B) P( A | B) P ( B)
P(A,B,C,D)=P(A)P(B|A)P(C|A,B)P(D|A,B,C)
Slide 11
The Chain Rule
P(itswaterwassotransparent)=
P(its)*
P(water|its)*
P(was|itswater)*
P(so|itswaterwas)*
P(transparent|itswaterwasso)
Therearestillalotofpossiblesentences
Ingeneral,wellneverbeabletogetenoughdatatocompute
thestatisticsforthoselongerprefixes
Slide 12
Thebigrammodelapproximatestheprobabilityofawordgivenall
thepreviouswordsP(wn|w1n1)bytheconditionalprobabilityof
precedingwordP(wn|wn1)
Thisassumptionthattheprobabilityofaworddependsonlyonthe
previouswordiscalledMarkovassumption.Markovmodelsarethe
classofprobabilisticmodelsthatassumethatwecanpredictthe
probabilityofsomefutureunitwithoutlookingtoofarintothepast.
Ngramlooksn1wordsinthepast.
AbigramiscalledafirstorderMarkovmodel,atrigramisasecond
orderMarkovmodel,andanNgramisaN1thorderMarkov
model.
Slide 13
Independence Assumption
Make the simplifying assumption
P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|a)
Or maybe
P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|saw,a)
That is, the probability in question is independent of its earlier history.

Markov Assumption
So for each component in the product replace with the approximation
(assuming a prefix of N - 1)
Bigram version
Slide 14
Estimating Bigram Probabilities
Slide 15
Bigram Estimates of Sentence Probabilities

<s> I
I want
want to
to eat
eat British
British food
0.25
0.32
0.65
0.26
0.02
0.60
P(I want to eat British food) = P(i|<s>) P(want|I) P(to|want)P(eat|to)P(British|eat)

P(food|British)
= 0.000016
Slide 16
An Example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Slide 17
Slide 18
Maximum Likelihood Estimates

The maximum likelihood estimate of some parameter of a model M from a
training set T is the estimate that maximizes the likelihood of the training
set T given the model M
Suppose the word Chinese occurs 400 times in a corpus of a million
words (Brown corpus)
What is the probability that a random word from some other text from the
same distribution will be Chinese
MLE estimate is 400/1000000 = .004
This may be a bad estimate for some other corpus
But it is the estimate that makes it most likely that Chinese will occur
400 times in a million word corpus.
Slide 19
Berkeley Restaurant Project Sentences

can you tell me about any good cantonese restaurants close by
mid priced thai food is what im looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
im looking for a good place to eat breakfast
when is caffe venezia open during the day
Slide 20
Bigram Counts
Out of 9222 sentences
Eg. I want occurred 827 times
Slide 21
Bigram Probabilities
Divide bigram counts by prefix unigram counts to get probabilities.
Slide 22
N-grams Issues
Sparse data
Not all N-grams found in training data, need smoothing
Change of domain
Train on WSJ, attempt to identify Shakespeare wont work
N-grams more reliable than (N-1)-grams

But even more sparse
Generating Shakespeare sentences with random unigrams...

Every enter now severally so, let
With bigrams...
What means, sir. I confess she? then all sorts, he is trim, captain.
Trigrams
Sweet prince, Falstaff shall die.
Slide 23
Smoothing Techniques
Every N-gram training matrix is sparse, even for
very large corpora (Zipfs law )
Solution: estimate the likelihood of unseen Ngrams
Slide 24
Add-one Smoothing
Add 1 to every N-gram count
P(wn|wn-1) = C(wn-1wn)/C(wn-1)
P(wn|wn-1) = [C(wn-1wn) + 1] / [C(wn-1) + V]
Slide 25
Add-one Smoothed Bigrams
P(wn|wn-1) = C(wn-1wn)/C(wn-1)
P(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V]
Slide 26
Other Smoothing Methods:

Good-Turing
Imagine you are fishing
You have caught 10 Carp, 3 Cod,
2 tuna, 1 trout, 1 salmon, 1 eel.
Slide 27
Smoothing: Good Turing

How many species (words) were
seen once? Estimate for how
many are unseen.
All other estimates are adjusted
(down) to give probabilities for
unseen
n1
p0
N
nr 1
r ( r 1)
nr
*
Slide 28
Smoothing:
Good Turing Example
10 Carp, 3 Cod, 2 tuna, 1 trout, 1 salmon, 1 eel.
How likely is new data (p0 ).

Let n1 be number occurring
once (3), N be total (18). p0=3/18
n1
p0
N
How likely is eel? 1*

n1 =3, n2 =1
1* =2 1/3 = 2/3
P(eel) = 1* /N = (2/3)/18 = 1/27
nr 1
r ( r 1)
nr
*
Slide 29

N Gram Models

Uploaded by

Copyright:

Available Formats

You might also like

N Gram Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

N Gram Models

Uploaded by

Copyright:

Available Formats

N - GRAMS

Counting Words in Corpora (1)

Counting Words in Corpora (2)

Counting Words in Corpora (3)

Utterance sentence of spoken language.

Not always that simple

Spoken language poses various challenges.

Counting: Types and Tokens

Very Easy Estimate

P(the | its water is so transparent that) =

The Chain Rule

The Chain Rule

That is, the probability in question is independent of its earlier history.

Estimating Bigram Probabilities

Bigram Estimates of Sentence Probabilities

P(I want to eat British food) = P(i|<s>) P(want|I) P(to|want)P(eat|to)P(British|eat)

Maximum Likelihood Estimates

Berkeley Restaurant Project Sentences

Not all N-grams found in training data, need smoothing

Train on WSJ, attempt to identify Shakespeare wont work

N-grams more reliable than (N-1)-grams

Generating Shakespeare sentences with random unigrams...

Add-one Smoothed Bigrams

Other Smoothing Methods:

Smoothing: Good Turing

How likely is new data (p0 ).

How likely is eel? 1*

You might also like