Machine Learning and Statistical Natural Language Processing

CS 595-052
Machine Learning and Statistical Natural Language Processing

Prof. Shlomo Argamon, argamon@iit.edu Room: 237C Office Hours: Mon 3-4 PM Book: Statistical Natural Language Processing
C. D. Manning and H. Schtze
Requirements:
Several programming projects Research Proposal
Machine Learning
Test Examples
Training Examples
Learning Algorithm
Learned Model
Classification/ Labeling Results
Modeling
Decide how to represent learned models:
Decision rules Linear functions Markov models
Type chosen affects generalization accuracy (on new data)
Generalization
Example Representation
Set of Features:
Continuous Discrete (ordered and unordered) Binary Sets vs. Sequences
Classes:
Continuous vs. discrete Binary vs. multivalued Disjoint vs. overlapping
Learning Algorithms
Find a good hypothesis consistent with the training data
Many hypotheses may be consistent, so may need a preference bias No hypothesis may be consistent, so need to find nearly consistent
May rule out some hypotheses to start with:

Feature reduction
Estimating Generalization Accuracy

Accuracy on the training says nothing about new examples! Must train and test on different example sets Estimate generalization accuracy over multiple train/test divisions Sources of estimation error:
Bias: Systematic error in the estimate Variance: How much the estimate changes between different runs
Cross-validation
1. Divide training into k sets 2. Repeat for each set:
1. Train on the remaining k-1 sets 2. Test on the kth
3. Average k accuracies (and compute statistics)
Bootstrapping
For a corpus of n examples: 1. Choose n examples randomly (with replacement)
Note: We expect ~0.632n different examples
2. Train model, and evaluate:

acc0 = accuracy of model on non-chosen examples accS = accuracy of model on n training examples
3. Estimate accuracy as 0.632*acc0 + 0.368*accS 4. Average accuracies over b different runs

Also note: there are other similar bootstrapping techniques
Bootstrapping vs. Cross-validation

Cross-validation:
Equal participation of all examples Dependency of class distribution in tests on distributions in training Stratified cross-validation: equalize class dist.
Bootstrap:
Often has higher bias (fewer distinct examples) Best for small datasets
Natural Language Processing

Extract useful information from natural language texts (articles, books, web pages, queries, etc.) Traditional method: Handcrafted lexicons, grammars, parsers Statistical approach: Learn how to process language from a corpus of real usage
Some Statistical NLP Tasks

1. Part of speech tagging - How to distinguish between book the noun, and book the verb. 2. Shallow parsing Pick out phrases of different types from a text, such as the purple people eater or would have been going 3. Word sense disambiguation - How to distinguish between river bank and bank as a financial institution. 4. Alignment Find the correspondence between words, sentences and paragraphs of a source text and its translation.
A Paradigmatic Task
Language Modeling:
Predict the next word of a text (probabilistically): P(wn | w1w2wn-1) = m(wn | w1w2wn-1)
To do this perfectly, we must capture true notions of grammaticality So: Better approximation of prob. of the next word Better language model
Measuring Surprise
The lower the probability of the actual word, the more the model is surprised:
H(wn | w1wn-1) = -log2 m(wn | w1wn-1) (The conditional entropy of wn given w1,n-1)
Cross-entropy: Suppose the actual distribution of the language is p(wn | w1wn-1), then our model is on average surprised by: Ep[H(wn|w1,n-1)] = w p(wn=w|w1,n-1)H(wn=w |w1,n-1) = Ep[-log2 m(wn | w1,n-1)]
Estimating the Cross-Entropy

How can we estimate Ep[H(wn|w1,n-1)] when we dont (by definition) know p? Assume: Stationarity: The language doesnt change Ergodicity: The language never gets stuck Then:
Ep[H(wn|w1,n-1)] = limn (1/n)
nH(w | w
n
1,n-1
Perplexity
Commonly used measure of model fit:
perplexity(w1,n,m) = 2H(w ,m) = m(w1,n)-(1/n)
1,n
How many choices for next word on average? Lower perplexity = better model
N-gram Models
Assume a limited horizon:
P(wk | w1w2wk-1) = P(wk | wk-nwk-1) Each word depends only on the last n-1 words
Specific cases:
Unigram model: P(wk) words independent Bigram model: P(wk | wk-1)
Learning task: estimate these probabilities from a given corpus
Using Bigrams
Compute probability of a sentence: W = The cat sat on the mat P(W) = P(The|START)P(cat|The)P(sat|cat) P(on|sat)P(the|on)P(mat|the)P(END|mat) Generate a random text and examine for reasonableness
Maximum Likelihood Estimation

PMLE(w1wn) = C(w1wn) / N PMLE(wn | w1wn-1) = C(w1wn) / C(w1wn-1)
Problem: Data Sparseness!! For the vast majority of possible n-grams, we get 0 probability, even in a very large corpus The larger the context, the greater the problem But there are always new cases not seen before!
Smoothing
Idea: Take some probability away from seen events and assign it to unseen events Simple method (Laplace): Give every event an a priori count of 1 PLap(X) = C(X)+1 / N+B
where X is any entity, B is the number of entity types Problem: Assigns too much probability to new events
The more event types there are, the worse this becomes
Interpolation
Lidstone: PLid(X) = (C(X) + d) / (N + dB) [d<1] Johnson: PLid(X) = m PMLE(X) + (1 m)(1/B) where m = N/(N+dB) How to choose d? Doesnt match low-frequency events well
Held-out Estimation
Idea: Estimate freq on unseen data from unseen data Divide data: training &held out subsets
C1(X) = freq of X in training data C2(X) = freq of X in held out data Tr = X:C1(X)=r C2(X) Pho(X) = Tr/(NrN) where C(X)=r
Deleted Estimation
Generalize to use all the data :
Divide data into 2 subsets: Nar = number of entities s.t. Ca(X)=r Tarb =
X:Ca(X)=r Cb(X)
Pdel (X) = (T0r1 + T1r0 ) / N(N0r1 + N1r0 ) [C(X)=r]

Needs a large data set Overestimates unseen data, underestimates infrequent data
Good-Turing
For observed items, discount item count: r* = (r+1) E[Nr+1] / E[Nr] The idea is that the chance of seeing the item one more time, is about E[Nr+1] / E[Nr] For unobserved items, total probability is: E[N1] / N
So, if we assume a uniform distribution over unknown items, we have: P(X) = E[N1] / (N0N)
Good-Turing Issues
Has problems with high-frequency items (consider rmax* = E[Nrmax+1]/E[Nrmax] = 0)
Usual answers: Use only for low-frequency items (r < k) Smooth E[Nr] by function S(r)
How to divide probability among unseen items?

Uniform distribution Estimate which seem more likely than others
Back-off Models
If high-order n-gram has insufficient data, use lower order n-gram:
Pbo(wi|wi-n+1,i-1) =
(1-d(wi-n+1,i-1)) P(wi|wi-n+1,i-1)
if enough data
(wi-n+1,i-1)Pbo(wi|wi-n+2,i-1)
otherwise
Note recursive formulation
Linear Interpolation
More generally, we can interpolate:
Pint(wi|h) = k k(h)Pk(wi| h) Interpolation between different orders Usually set weights by iterative training (gradient
descent EM algorithm)
Partition histories h into equivalence classes Need to be responsive to the amount of data!

Machine Learning and Statistical Natural Language Processing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning and Statistical Natural Language Processing

Uploaded by

Copyright:

Available Formats

CS 595-052

Machine Learning and Statistical Natural Language Processing

Classification/ Labeling Results

Type chosen affects generalization accuracy (on new data)

May rule out some hypotheses to start with:

Estimating Generalization Accuracy

3. Average k accuracies (and compute statistics)

2. Train model, and evaluate:

3. Estimate accuracy as 0.632acc0 + 0.368accS 4. Average accuracies over b different runs

Bootstrapping vs. Cross-validation

Natural Language Processing

Some Statistical NLP Tasks

Estimating the Cross-Entropy

Ep[H(wn|w1,n-1)] = limn (1/n)

Learning task: estimate these probabilities from a given corpus

Maximum Likelihood Estimation

Pdel (X) = (T0r1 + T1r0 ) / N(N0r1 + N1r0 ) [C(X)=r]

How to divide probability among unseen items?

Note recursive formulation

You might also like