Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 27

CS 595-052

Machine Learning and Statistical Natural Language Processing


Prof. Shlomo Argamon, argamon@iit.edu Room: 237C Office Hours: Mon 3-4 PM Book: Statistical Natural Language Processing
C. D. Manning and H. Schtze

Requirements:
Several programming projects Research Proposal

Machine Learning

Test Examples

Training Examples

Learning Algorithm

Learned Model

Classification/ Labeling Results

Modeling
Decide how to represent learned models:
Decision rules Linear functions Markov models

Type chosen affects generalization accuracy (on new data)

Generalization

Example Representation
Set of Features:
Continuous Discrete (ordered and unordered) Binary Sets vs. Sequences

Classes:
Continuous vs. discrete Binary vs. multivalued Disjoint vs. overlapping

Learning Algorithms
Find a good hypothesis consistent with the training data
Many hypotheses may be consistent, so may need a preference bias No hypothesis may be consistent, so need to find nearly consistent

May rule out some hypotheses to start with:


Feature reduction

Estimating Generalization Accuracy


Accuracy on the training says nothing about new examples! Must train and test on different example sets Estimate generalization accuracy over multiple train/test divisions Sources of estimation error:
Bias: Systematic error in the estimate Variance: How much the estimate changes between different runs

Cross-validation
1. Divide training into k sets 2. Repeat for each set:
1. Train on the remaining k-1 sets 2. Test on the kth

3. Average k accuracies (and compute statistics)

Bootstrapping
For a corpus of n examples: 1. Choose n examples randomly (with replacement)
Note: We expect ~0.632n different examples

2. Train model, and evaluate:


acc0 = accuracy of model on non-chosen examples accS = accuracy of model on n training examples

3. Estimate accuracy as 0.632*acc0 + 0.368*accS 4. Average accuracies over b different runs


Also note: there are other similar bootstrapping techniques

Bootstrapping vs. Cross-validation


Cross-validation:
Equal participation of all examples Dependency of class distribution in tests on distributions in training Stratified cross-validation: equalize class dist.

Bootstrap:
Often has higher bias (fewer distinct examples) Best for small datasets

Natural Language Processing


Extract useful information from natural language texts (articles, books, web pages, queries, etc.) Traditional method: Handcrafted lexicons, grammars, parsers Statistical approach: Learn how to process language from a corpus of real usage

Some Statistical NLP Tasks


1. Part of speech tagging - How to distinguish between book the noun, and book the verb. 2. Shallow parsing Pick out phrases of different types from a text, such as the purple people eater or would have been going 3. Word sense disambiguation - How to distinguish between river bank and bank as a financial institution. 4. Alignment Find the correspondence between words, sentences and paragraphs of a source text and its translation.

A Paradigmatic Task
Language Modeling:
Predict the next word of a text (probabilistically): P(wn | w1w2wn-1) = m(wn | w1w2wn-1)

To do this perfectly, we must capture true notions of grammaticality So: Better approximation of prob. of the next word Better language model

Measuring Surprise
The lower the probability of the actual word, the more the model is surprised:
H(wn | w1wn-1) = -log2 m(wn | w1wn-1) (The conditional entropy of wn given w1,n-1)

Cross-entropy: Suppose the actual distribution of the language is p(wn | w1wn-1), then our model is on average surprised by: Ep[H(wn|w1,n-1)] = w p(wn=w|w1,n-1)H(wn=w |w1,n-1) = Ep[-log2 m(wn | w1,n-1)]

Estimating the Cross-Entropy


How can we estimate Ep[H(wn|w1,n-1)] when we dont (by definition) know p? Assume: Stationarity: The language doesnt change Ergodicity: The language never gets stuck Then:

Ep[H(wn|w1,n-1)] = limn (1/n)

nH(w | w
n

1,n-1

Perplexity
Commonly used measure of model fit:
perplexity(w1,n,m) = 2H(w ,m) = m(w1,n)-(1/n)
1,n

How many choices for next word on average? Lower perplexity = better model

N-gram Models
Assume a limited horizon:
P(wk | w1w2wk-1) = P(wk | wk-nwk-1) Each word depends only on the last n-1 words

Specific cases:
Unigram model: P(wk) words independent Bigram model: P(wk | wk-1)

Learning task: estimate these probabilities from a given corpus

Using Bigrams
Compute probability of a sentence: W = The cat sat on the mat P(W) = P(The|START)P(cat|The)P(sat|cat) P(on|sat)P(the|on)P(mat|the)P(END|mat) Generate a random text and examine for reasonableness

Maximum Likelihood Estimation


PMLE(w1wn) = C(w1wn) / N PMLE(wn | w1wn-1) = C(w1wn) / C(w1wn-1)
Problem: Data Sparseness!! For the vast majority of possible n-grams, we get 0 probability, even in a very large corpus The larger the context, the greater the problem But there are always new cases not seen before!

Smoothing
Idea: Take some probability away from seen events and assign it to unseen events Simple method (Laplace): Give every event an a priori count of 1 PLap(X) = C(X)+1 / N+B
where X is any entity, B is the number of entity types Problem: Assigns too much probability to new events
The more event types there are, the worse this becomes

Interpolation
Lidstone: PLid(X) = (C(X) + d) / (N + dB) [d<1] Johnson: PLid(X) = m PMLE(X) + (1 m)(1/B) where m = N/(N+dB) How to choose d? Doesnt match low-frequency events well

Held-out Estimation
Idea: Estimate freq on unseen data from unseen data Divide data: training &held out subsets
C1(X) = freq of X in training data C2(X) = freq of X in held out data Tr = X:C1(X)=r C2(X) Pho(X) = Tr/(NrN) where C(X)=r

Deleted Estimation
Generalize to use all the data :
Divide data into 2 subsets: Nar = number of entities s.t. Ca(X)=r Tarb =

X:Ca(X)=r Cb(X)

Pdel (X) = (T0r1 + T1r0 ) / N(N0r1 + N1r0 ) [C(X)=r]


Needs a large data set Overestimates unseen data, underestimates infrequent data

Good-Turing
For observed items, discount item count: r* = (r+1) E[Nr+1] / E[Nr] The idea is that the chance of seeing the item one more time, is about E[Nr+1] / E[Nr] For unobserved items, total probability is: E[N1] / N
So, if we assume a uniform distribution over unknown items, we have: P(X) = E[N1] / (N0N)

Good-Turing Issues
Has problems with high-frequency items (consider rmax* = E[Nrmax+1]/E[Nrmax] = 0)
Usual answers: Use only for low-frequency items (r < k) Smooth E[Nr] by function S(r)

How to divide probability among unseen items?


Uniform distribution Estimate which seem more likely than others

Back-off Models
If high-order n-gram has insufficient data, use lower order n-gram:
Pbo(wi|wi-n+1,i-1) =

(1-d(wi-n+1,i-1)) P(wi|wi-n+1,i-1)
if enough data

(wi-n+1,i-1)Pbo(wi|wi-n+2,i-1)
otherwise

Note recursive formulation

Linear Interpolation
More generally, we can interpolate:
Pint(wi|h) = k k(h)Pk(wi| h) Interpolation between different orders Usually set weights by iterative training (gradient
descent EM algorithm)

Partition histories h into equivalence classes Need to be responsive to the amount of data!

You might also like