Professional Documents
Culture Documents
Machine Learning and Statistical Natural Language Processing
Machine Learning and Statistical Natural Language Processing
Requirements:
Several programming projects Research Proposal
Machine Learning
Test Examples
Training Examples
Learning Algorithm
Learned Model
Modeling
Decide how to represent learned models:
Decision rules Linear functions Markov models
Generalization
Example Representation
Set of Features:
Continuous Discrete (ordered and unordered) Binary Sets vs. Sequences
Classes:
Continuous vs. discrete Binary vs. multivalued Disjoint vs. overlapping
Learning Algorithms
Find a good hypothesis consistent with the training data
Many hypotheses may be consistent, so may need a preference bias No hypothesis may be consistent, so need to find nearly consistent
Cross-validation
1. Divide training into k sets 2. Repeat for each set:
1. Train on the remaining k-1 sets 2. Test on the kth
Bootstrapping
For a corpus of n examples: 1. Choose n examples randomly (with replacement)
Note: We expect ~0.632n different examples
Bootstrap:
Often has higher bias (fewer distinct examples) Best for small datasets
A Paradigmatic Task
Language Modeling:
Predict the next word of a text (probabilistically): P(wn | w1w2wn-1) = m(wn | w1w2wn-1)
To do this perfectly, we must capture true notions of grammaticality So: Better approximation of prob. of the next word Better language model
Measuring Surprise
The lower the probability of the actual word, the more the model is surprised:
H(wn | w1wn-1) = -log2 m(wn | w1wn-1) (The conditional entropy of wn given w1,n-1)
Cross-entropy: Suppose the actual distribution of the language is p(wn | w1wn-1), then our model is on average surprised by: Ep[H(wn|w1,n-1)] = w p(wn=w|w1,n-1)H(wn=w |w1,n-1) = Ep[-log2 m(wn | w1,n-1)]
nH(w | w
n
1,n-1
Perplexity
Commonly used measure of model fit:
perplexity(w1,n,m) = 2H(w ,m) = m(w1,n)-(1/n)
1,n
How many choices for next word on average? Lower perplexity = better model
N-gram Models
Assume a limited horizon:
P(wk | w1w2wk-1) = P(wk | wk-nwk-1) Each word depends only on the last n-1 words
Specific cases:
Unigram model: P(wk) words independent Bigram model: P(wk | wk-1)
Using Bigrams
Compute probability of a sentence: W = The cat sat on the mat P(W) = P(The|START)P(cat|The)P(sat|cat) P(on|sat)P(the|on)P(mat|the)P(END|mat) Generate a random text and examine for reasonableness
Smoothing
Idea: Take some probability away from seen events and assign it to unseen events Simple method (Laplace): Give every event an a priori count of 1 PLap(X) = C(X)+1 / N+B
where X is any entity, B is the number of entity types Problem: Assigns too much probability to new events
The more event types there are, the worse this becomes
Interpolation
Lidstone: PLid(X) = (C(X) + d) / (N + dB) [d<1] Johnson: PLid(X) = m PMLE(X) + (1 m)(1/B) where m = N/(N+dB) How to choose d? Doesnt match low-frequency events well
Held-out Estimation
Idea: Estimate freq on unseen data from unseen data Divide data: training &held out subsets
C1(X) = freq of X in training data C2(X) = freq of X in held out data Tr = X:C1(X)=r C2(X) Pho(X) = Tr/(NrN) where C(X)=r
Deleted Estimation
Generalize to use all the data :
Divide data into 2 subsets: Nar = number of entities s.t. Ca(X)=r Tarb =
X:Ca(X)=r Cb(X)
Good-Turing
For observed items, discount item count: r* = (r+1) E[Nr+1] / E[Nr] The idea is that the chance of seeing the item one more time, is about E[Nr+1] / E[Nr] For unobserved items, total probability is: E[N1] / N
So, if we assume a uniform distribution over unknown items, we have: P(X) = E[N1] / (N0N)
Good-Turing Issues
Has problems with high-frequency items (consider rmax* = E[Nrmax+1]/E[Nrmax] = 0)
Usual answers: Use only for low-frequency items (r < k) Smooth E[Nr] by function S(r)
Back-off Models
If high-order n-gram has insufficient data, use lower order n-gram:
Pbo(wi|wi-n+1,i-1) =
(1-d(wi-n+1,i-1)) P(wi|wi-n+1,i-1)
if enough data
(wi-n+1,i-1)Pbo(wi|wi-n+2,i-1)
otherwise
Linear Interpolation
More generally, we can interpolate:
Pint(wi|h) = k k(h)Pk(wi| h) Interpolation between different orders Usually set weights by iterative training (gradient
descent EM algorithm)
Partition histories h into equivalence classes Need to be responsive to the amount of data!