Artificial Intelligence Machine Learning PDF

ML in NLP June 02
Machine Learning in Natural Introduction

Language Processing
Fernando Pereira
University of Pennsylvania
NASSLLI, June 2002

Thanks to: William Bialek, John Lafferty, Andrew McCallum, Lillian Lee,
Lawrence Saul, Yves Schabes, Stuart Shieber, Naftali Tishby
ML in NLP
Why ML in NLP
Classification
n Examples are easier to create than rules
n Rule writers miss low frequency cases n Document topic
n Many factors involved in language interpretation
n People do it
n AI
n Cognitive science
politics, business national, environment
n Let the computer do it
n Moore’s law
n storage
n Word sense
n lots of data treasury bonds ´ chemical bonds
ML in NLP ML in NLP
NASSLI 1
ML in NLP June 02
Analysis
n Tagging
Language Modeling
VBG NNS WDT VBP RP NNS JJ
n Is this a likely English sentence?
causing symptoms that show up decades later
P(colorless green ideas sleep furiously)

n Parsing ª 2 ¥ 10 5
S(dumped) P(furiously sleep ideas green colorless)
NP-C(workers) VP(dumped)
N(workers)V(dumped) NP-C(sacks) PP(into)

n Disambiguate noisy transcription
workers dumped
It’s easy to wreck a nice beach
N(sacks) P(into) NP-C(bin)
It’s easy to recognize speech
sacks into D(a) N(bin)
a bin
ML in NLP ML in NLP
Inference
n Translation Machine Learning Approach
treasury bonds fi obrigações do tesouro
n Algorithms that write programs
covalent bonds fi ligações covalentes
n Specify
n Information extraction • Form of output programs
Sara Lee to Buy 30% of DIM • Accuracy criterion
acquirer Chicago, March 3 - Sara Lee Corp said it agreed to buy a 30 percent interest in
Paris-based DIM S.A., a subsidiary of BIC S.A., at cost of about 20 million
n Input: set of training examples
acquired dollars. DIM S.A., a hosiery manufacturer, had sales of about 2 million dollars.
n Output: program that performs as accurately
The investment includes the purchase of 5 million newly issued DIM shares
valued at about 5 million dollars, and a loan of about 15 million dollars, it said. as possible on the training examples
The loan is convertible into an additional 16 million DIM shares, it noted.
The proposed agreement is subject to approval by the French government, it n But will it work on new examples?
said.
ML in NLP ML in NLP
NASSLI 2
ML in NLP June 02
Learning Tradeoffs
Fundamental Questions
Overfitting
n Generalization: is the learned program useful on new
Error of best program

examples?
n Statistical learning theory: quantifiable tradeoffs between
number of examples, complexity of program class, and
generalization error testing on new examples
n Computational tractability: can we find a good program
quickly?
n If not, can we find a good approximation?
n Adaptation: can the program learn quickly from new Rote learning
evidence? training
n Information-theoretic analysis: relationship between adaptation
and compression
Program class complexity
ML in NLP ML in NLP
Machine Learning Methods Jargon

n Instance: event type of interest
n Classifiers n Document and its class
n Document classification n Sentence and its analysis
n Disambiguation disambiguation n …
n Structured models n Supervised learning: learn classification
n Tagging
function from hand-labeled instances
n Parsing n Unsupervised learning: exploit correlations to
n Extraction organize training instances
n Unsupervised learning n Generalization: how well does it work on
n Generalization unseen data
n Structure induction n Features: map instance to set of elementary
events
ML in NLP ML in NLP
NASSLI 3
ML in NLP June 02
Classification Ideas Structured Model Ideas

n Represent instances by feature vectors n Interdependent decisions
n Content n Successive parts-of-speech
n Parsing/generation steps
n Context
n Lexical choice
n Learn function from feature vectors • Parsing
n Class • Translation
n Class-probability distribution
n Combining decisions
n Sequential decisions
n Redundancy is our friend: many weak n Generative models
clues n Constraint satisfaction
ML in NLP ML in NLP
Unsupervised Learning Ideas Methodological Detour
n Clustering: class induction n Empiricist/information-theoretic view:

n Latent variables words combine following their associations
I’m thinking of sports fi more sporty words in previous material
n Distributional regularities n Rationalist/generative view:

• Know words by the company they keep words combine according to a formal
grammar in the class of possible natural-
n Data compression language grammars
n Infer dependencies among variables:
structure learning
ML in NLP ML in NLP
NASSLI 4
ML in NLP June 02
Chomsky’s Challenge to Empiricism The Return of Empiricism

(1) Colorless green ideas sleep furiously. n Empiricist methods work:
n Markov models can capture a surprising fraction of
(2) Furiously sleep ideas green colorless. the unpredictability in language
… It is fair to assume that neither sentence (1) nor n Statistical information retrieval methods beat
(2) (nor indeed any part of these sentences) has alternatives
ever occurred in an English discourse. Hence, in n Statistical parsers are more accurate than
any statistical model for grammaticalness, these competitors based on rationalist methods
sentences will be ruled out on identical grounds as
equally ‘remote’ from English. Yet (1), though n Machine-learning, statistical techniques close to
nonsensical, is grammatical, while (2) is not. human performance in part-of-speech tagging, sense
disambiguation
n Just engineering tricks?
Chomsky 57
ML in NLP ML in NLP
Unseen Events The Science of Modeling

n Chomsky’s implicit assumption: any model must
assign zero probability to unseen events n Probability estimates can be smoothed to
accommodate unseen events
n naïve estimation of Markov model probabilities from
frequencies n Redundancy in language supports
n no latent (hidden) events effective statistical inference procedures
n Any such model overfits data: many events are \ the stimulus is richer than it might seem
likely to be missing in any finite sample n Statistical learning theory: generalization
\ The learned model cannot generalize to unseen ability of a model class can be measured
data independently of model representation
\ Support for poverty of the stimulus arguments n Beyond Markov models: effects of latent
conditioning variables can be estimated
from data
ML in NLP ML in NLP
NASSLI 5
ML in NLP June 02
Richness of the Stimulus

n Information about: mutual information
Questions
n between linguistic and non-linguistic events
n between parts of a linguistic event n Generative or discriminative?
n Global coherence: n Structured models: local classification or
banks can now sell stocks and bonds global constraint satisfaction?
n Word statistics carry more information
n Does unsupervised learning help?
than it might seem
n Markov models in speech recognition
n Success of bag-of-words model in information
retrieval
n Statistical machine translation
n How far can these methods go?
ML in NLP ML in NLP
Classification Generative or Discriminative?

n Generative models
n Estimate the instance-label distribution
p(x, y)
n Discriminative models
n Estimate the label-given-instance distribution
p(y | x)
n Minimize an upper-bound on training error
Â[[ f (x i ) ≠ yi ]]
i
ML in NLP ML in NLP
NASSLI 6
ML in NLP June 02
Simple Generative Model

Generative Claims
n Binary naïve Bayes:
n Represent instances by sets of binary n Easy to train: just count
features
n Language modeling: probability of
• Does word occur in document
•…
observed forms
n Finite predefined set of classes n More robust
P(F1,..., Fn ,C) = P(C)’ P(Fi | C) n Small training sets
C i n Label noise
P(c | F1,..., Fn )µ P(c)’ P(Fi | C) n Full advantage of probabilistic methods
F1 F2 F3 F4 i
ML in NLP ML in NLP
Discriminative Models Simple Discriminative Forms

n Linear discriminant function
n Define functional form for
h(x;q 0 ,q1,...,q n ) = q 0 + Âi q i fi (x)
p(y | x;q )
n Binary classification: define a discriminant n Logistic form:
function P(+1 | x) = 11+ exp- h(x;q )
† y = sign h(x;q ) n Multi-class exponential form (maxent):
n Adjust parameter(s) q to maximize
h(x, y;q 0 ,q1 ,...,q n ) = q 0 + Âi q i fi (x, y)
probability of training labels/minimize
error
† P(y | x;q ) = exp h(x, y;q )
Ây¢ exp h(x, y¢;q )
ML in NLP ML in NLP
NASSLI 7
ML in NLP June 02
Classification Tasks
Discriminative Claims n Document categorization
n News categorization
n Focus modeling resources on instance-to- n Message filtering
label mapping n Web page selection
n Tagging
n Avoid restrictive probabilistic assumptions n Named entity
on instance distribution n Part-of-speech
n Sense disambiguation
n Optimize what you care about
n Syntactic decisions
n Higher accuracy n Attachment
ML in NLP ML in NLP
Document Models Term Weighting and Feature

n Binary vector Selection
ft (d) ≡ t Œ d n Select or weigh most informative features
n Frequency vector n TF*IDF: adjust term weight by how
tf(d,t) = t Œ d idf(d,t) = D d Œ D : t Œ d document-specific the term is
raw frequency : rt (d) = tf(d,t) n Feature selection:
TF * IDF : xt (d) = log(1+ tf(d,t))log(1+ idf(d,t)) n Remove low, unreliable counts
n Mutual information
n N-gram language model n Information gain
d
p(d | c) = p(d | c)’i=1 p(di | d1...di-1; c) n Other statistics
p(di | d1 ...di-1;c) ª p(di | di-n ...di-1;c)

ML in NLP ML in NLP
NASSLI 8
ML in NLP June 02
Documents vs. Vectors (2)

Documents vs. Vectors (I)
n Document probability (unigram language
n Many documents have the same binary or model)
d
frequency vector p(d | c) = p(d | c)’i=1 p(di | c)
n Document multiplicity must be handled n Raw frequency vector probability

correctly in probability models
n Binary naïve Bayes rt = tf(d,t)
p(f | c) = ’t [ ft p(t | c) + (1 - ft )(1- p(t | c))] p(t | c)rt
p(r | c) = p(L | c)L!’ where L = Â rt
t rt ! t
n Multiplicity is not recoverable
ML in NLP ML in NLP
Documents
† vs. Vectors (3) Linear Classifiers
n Unigram model: n Embedding into high-dimensional vector
d
p(c)p(d | c)’ p(di | c) space
p(c | d) = i=1
d n Geometric intuitions and techniques
Âc¢ p(c¢)p( d | c¢)’ i=1
p(di | c¢)
† n Easier separability
• Increase dimension with interaction terms
n Vector model: • Nonlinear embeddings (kernels)
p(c)p(L | c)’t p(t | c)rt n Swiss Army knife
p(c | r) =
Âc¢ p(c¢) p(L | c¢)’t p(t | c¢)rt
ML in NLP ML in NLP
NASSLI 9
ML in NLP June 02
Learning Linear Classifiers

Kinds of Linear Classifiers
n Rocchio Ê Â xk Â xk ˆ
n Naïve Bayes wk = maxÁ 0, xŒc - xœc ˜
Á c D - c ˜¯
n Exponential models Ë
n Widrow-Hoff
n Large margin classifiers w ¨ w - 2h(w.x i - yi )x i
n Support vector machines (SVM)
n Boosting
n (Balanced) winnow
n Online methods y = sign(w+ ⋅ x - w- ⋅ x - q )
n Perceptron
positive error : w+ ¨ aw+ ,w- ¨ bw- , a > 1 > b > 0
n Winnow
negative error : w+ ¨ bw+ ,w- ¨ aw -
ML in NLP ML in NLP
†
Linear Classification Margin
† n Instance margin g i = yi (w ⋅ x i + b)
n Linear discriminant function n Normalized (geometric) margin
h(x) = w ⋅ x + b = Â w k xk + b Êw bˆ
k x xi g i = yi Á ⋅ x i + ˜
x
x x Ë w w ¯
gi
x x o x
x x
x x gj o
† o w o g x
o x x xj
x o
x o o o
o
bo o x x
o
o o
o n Training set margin g
ML in NLP ML in NLP
NASSLI 10
ML in NLP June 02
Duality
Perceptron Algorithm
n Final hypothesis is a linear combination of
n Given training points
n Linearly separable training set S w = Âi ai yi x i ai ≥ 0
n Learning rate h>0 n Dual perceptron algorithm
w ¨ 0;b ¨ 0;R = maxi x i a ¨ 0;b ¨ 0;R = maxi x i
repeat repeat
for i = 1...N for i = 1...N
if yi ( w ⋅ x i + b) £ 0
w ¨ w + hyi x i (
if yi Â a j y j x j ⋅ x i + b £ 0
j )
b ¨ b + hyi R 2
a i ¨ ai + 1
until there are no mistakes b ¨ b + yi R 2
ML in NLP ML in NLP
until there are no mistakes
Canonical Hyperplanes
Why† Maximize the Margin?
n Multiple representations for the same
n There is a constant c such that for any hyperplane
data distribution D with support in a ball of (lw, lb) l > 0
radius R and any training sample S of size n Canonical hyperplane: functional margin = 1
N drawn from D
n Geometric margin for canonical hyperplane
Ê c Ê R2 ˆˆ
pÁÁerr(h) £ ÁÁ 2 log2 N + log(1 d )˜˜˜˜ ≥ 1 - d 1Ê w w ˆ
g = Á ⋅ x+ - ⋅ x- ˜
†Ë N Ëg ¯¯ 2Ë w w ¯ x+
1
= (w ⋅ x + - w ⋅ x - ) g
where g is the margin of h in S 2w
1 x-
=
w
ML in NLP ML in NLP
NASSLI 11
ML in NLP June 02
Convex Optimization (1) Convex Optimization (2)

n Constrained optimization problem: n Kuhn-Tucker conditions:
min wŒWÕ¬ n f (w) n f convex
subject to gi (w) £ 0 n gi, hj affine (h(w)=Aw-b)
h j (w) = 0
n Lagrangian function: n Solution w*, a*, b* must satisfy:
∂L(w* ,a * , b * )
L(w,a , b ) = f (w) + Âi a i gi (w) + Â j b j h j (w) = 0
∂w
* * *
n Dual problem: ∂L(w ,a , b )
= 0
maxa ,b infwŒW L(w,a , b ) ∂b Complementary condition:
subject to a i ≥ 0 ai* gi (w* ) = 0 parameter is non-zero iff
gi (w* ) £ 0 constraint is active
*
ML in NLP ML in NLP ai ≥ 0
Maximizing the margin (2)

Maximizing the Margin (1) n Dual Lagrangian at stationary point:
Given a separable training sample: 1

n W(a ) = L(w* ,b* ,a ) = Âi a i - Â yi y ja i a j x i ⋅ x j
2 i, j
min w,b w = w ⋅ w subject to y i ( w ⋅ x i + b) ≥ 1 n Dual maximization problem:
n Lagrangian: maxa W (a )
subject to a i ≥ 0
1
† L(w,b,a ) = w ⋅ w - Âi ai [ yi ( w ⋅ x i + b) -1] Âi yi a i = 0
2
∂L(w,b,a )
† = w - Âi yi a i x i = 0 n Maximum margin weight vector:
∂w
∂L(w,b,a ) -1 2
= Â yi a i = 0
∂b i w = Âi yi a *i x i with margin g = 1 w* =
*
(ÂiŒsv ai* )
ML in NLP ML in NLP
NASSLI 12
ML in NLP June 02
Building the Classifier Consequences

n Computing the offset (from primal n Complementarity condition yields support
constraints): vectors:
max yi =-1 w* ⋅ x i + min yi =1 w* ⋅ x i [ ( ) ]
ai yi w* ⋅ x i + b -1 = 0
b* =
2 ai > 0 fi w* ⋅ x i + b = yi
n Decision function: n Functional margin of 1 implies minimum
geometric margin
h(x) = sgn (Âi yi a*i x i ⋅ x + b* )
g = 1 w*
ML in NLP ML in NLP
Soft Margin
General SVM Form n Handles non-separable case
n Margin maximization for an arbitrary n Primal problem (2-norm):
†
kernel K min w,b,x w ⋅ w + C Â x i2
maxa Âi a i - 12 Âi, j yi y ja i a j K (x i , x j ) i
subject to yi (w ⋅ x i + b ) ≥ 1 - x i
subject to a i ≥ 0 xi ≥ 0
† Âi yi a i = 0
n Dual problem:
n Decision rule Ê ˆ
maxa Âi a i - 12 Âi, j yi y ja i a j ÁË x i ⋅ x j + C1 dij ˜¯
h(x) = sgn (Âi yi a*i K (x i ,x) + b* ) subject to a i ≥ 0
Âi yi a i = 0
ML in NLP ML in NLP
NASSLI 13
ML in NLP June 02
Duality
Conditional Maxent Model n Maximize conditional log likelihood
˜ = arg max L Â log p(yi | x i ;L)

L
n Model form i
exp Â l k f k (x, y) n Maximizing conditional entropy
k
p(y | x;L) =
Z(x; l ) [
p˜ = arg max p -Âi Ây p(y | x i ) log p(y | x i ) ]
Z(x;L) = Ây exp Âk l k f k (x, y) subject to constraints
n Useful properties Âi f k (x i , yi ) = Ây p(y | x i ) f k (x i , y)

n Multi-class yields
n May use different features for different classes ˜)
p˜ (y | x) = p(y | x; L
n Training is convex optimization
ML in NLP ML in NLP
Relationship
† to (Binary) Logistic Relationship to Linear
Discrimination Discrimination
†
n Decision rule
exp Âk l k f k (x,+1)
p(+1 | x) = Ê p(+1| x) ˆ
† exp Âk l k f k (x,+1) + exp Âk l k f k (x,-1) signÁ log ˜ = sign Âk l k g k (x)
Ë p(-1 | x) ¯
1
= n Bias term: parameter for “always on”
† 1 + exp- Âk k k (x,+1) - f k (x,-1))
l ( f
1 feature
=
1 + exp- Â l k gk (x) n Question: relationship to other trainers for
k
linear discriminant functions
ML in NLP ML in NLP
NASSLI 14
ML in NLP June 02
Solution Techniques (2)

Solution Techniques (I)
n Improved iterative scaling (IIS)
n Parameter updates
n Generalized iterative scaling (GIS)
n Parameter updates
lk ¨ l k + d k
1 Âi f k (x i , yi ) #
lk ¨ l k + log
C Âi Ây p(y | x i ;L) f k (x i , y) Âi f k (x i , yi ) = Âi Ây p(y | x i ;L) f k (x i , y)edk f (x i ,y)
n Requires that features add up to constant independent of f # (x, y) = Âk f k (x, y)

instance or label (add slack feature)
n For binary features reduces to solving a polynomial with
positive coefficients
Âk f k (x i , y) = C "i, y n Reduces to GIS if feature sum constant
ML in NLP ML in NLP
Deriving IIS (1)

n Conditional log-likelihood
Deriving IIS (2)
l(L) = Â log p(yi | x i ;L) n By Jensen’s inequality:
i
† n Log-likelihood update A(D) ≥ Â D ⋅ f (x i , yi ) + N -

È Z (x i ;L + D)˘ i
l(L + D) - l(L) = Âi ÍD ⋅ f (x i , yi ) - log ˙ f k (x i , y) dk f # (x i ,y)
Î Z (x i ;L) ˚ Âi Ây p(y | x i ;L)Âk e = B(D)
(L+D )⋅ f (x i ,y) f # (x i , y)
e
= Â D ⋅ f (x i , yi ) - Â log Â ∂B(D) #
i i y Z (x i ;L) = Â f k (x i , yi ) - Â Â p(y | x i ;L) f k (x i , y)edk f (x i ,y)
∂d k i i y
= Âi D ⋅ f (x i , yi ) - Âi log Ây p(y | x i ;L)e D⋅ f ( x i ,y)
n Maximize lower bound on update
(log x £ x -1) ≥ Âi D ⋅ f (x i , yi ) + N - Âi Ây p(y | x i ;L)e D⋅ f (x i ,y)
144444444424444444443
A(D)
ML in NLP ML in NLP
NASSLI 15
ML in NLP June 02
Gaussian Prior
Solution Techniques (3) n Log-likelihood gradient
n GIS very slow if slack variable takes large ∂l(L) l
= Âi f k (x i , yi ) - Âi Ây p(y | x i ;L)Âi f k (x i , y) - k2
values ∂l k sk
n IIS faster, but still problematic n Modified IIS update
n Recent suggestion: use standard convex lk ¨ l k + d k
optimization techniques Âi f k (x i , yi ) =
n Eg. Conjugate gradient # l k + dk
Some evidence of faster convergence
Âi Ây p(y | x i ;L) f k (x i , y)edk f ( x i ,y)
+
n s 2k
#
f (x, y) = Âk f k (x, y)
ML in NLP ML in NLP
† Instance Representation Enriching Features

n Fixed-size instance (PP attachment): n Word n-grams
binary features n Sparse word n-grams
n Word identity n Character n-grams (noisy transcriptions:
n Word class speech, OCR)
n Variable-size instance (document n Unknown word features: suffixes,
† classification) capitalization
n Word identity n Feature combinations (cf. n-grams)
n Word relative frequency in document

ML in NLP ML in NLP
NASSLI 16
ML in NLP June 02
Structured Models:
Finite State
I understood each and every word you said but not the order in which they appeared.
ML in NLP ML in NLP
Structured Model Applications Structured Models

n Language modeling n Assign a labeling to a sequence
n Story segmentation n Story segmentation
n POS tagging n POS tagging
n Named entity extraction
n Information extraction (IE)
n (Shallow) parsing
n (Shallow) parsing
ML in NLP ML in NLP
NASSLI 17
ML in NLP June 02
Constraint Satisfaction in
Structured Models Local Classification Models
n Train to minimize labeling loss n Train to minimize the per-decision loss in
qˆ = argminq Âi Loss(x i , y i | q )
context
qˆ = argminq Âi Â0£ j<|x |loss(yi, j | x i , y (i j);q )
n Computing the best labeling: i
argmin y Loss(x, y | qˆ )
n Apply by guessing context and finding
n Efficient minimization requires: each lowest-loss label:
n A common currency for local labeling decisions
n Efficient algorithm to combine the decisions argmin y j loss(y j | x, yˆ ( j);qˆ )
ML in NLP ML in NLP
Example: Language Modeling

Structured Model Claims n n-gram (n-1 order Markov) model:
n
† n Constraint satisfaction
n Principled w1 wk – n + 1 wk – 1wk
n Probabilistic interpretation allows model P(wk |w1 wk – 1) ª P(wk |wk – n + 1 wk – 1)
composition
n example, character n-grams (Shannon, Lucky):
n Efficient optimal decoding
n
†n Local classification 0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ
n Wider range of models
1 OCRO HLI RGWR NMIELWIS EU LL
n More efficient training
2 ON IE ANTSOUTINYS ARE T INCTORE ST
n Heuristic decoding comparable to pruning in
global models 3 IN NO IST LAT WHEY CRATICT FROURE
4 ABOVERY UPONDULTS WELL THE CODERST
ML in NLP ML in NLP
NASSLI 18
ML in NLP June 02
Markov’s Unreasonable
Effectiveness Limits of Markov Models
n Entropy estimates for English n No dependency
model bits/char
compress 4.43 n Likelihoods based on
word trigrams (Brown et al 92) 1.75 sequencing, not dependency
human prediction (Cover & King 78) 1.34
1 The are to know the issues necessary role
n Local word relations dominate statistics (Jelinek):
2 This will have this problems data thing
1 The are to know the issues necessary role
2 One the understand these the information that
2 This will have this problems data thing
7 Please need use problem people point
2 One the understand these the information that
9 We insert all tools issues
7 Please need use problem people point
98 resolve old
9 We insert all tools issues
1641 important
98 resolve old
1641 important
ML in NLP ML in NLP
Unseen Events (1)

Unseen Events (2)
n What’s the probability of unseen events?
n Bias forces nonzero probabilities for some n Discount estimates for seen events
unseen events n Use leftover for unseen events
n Typical bias: tie probabilities of related events n How to allocate leftover?
n specific unseen event Ã general seen event
eat pineapple Ã eat _
n Back-off from unseen event to less specific
n event decomposition: event Ã event1 « event2
seen events: n-gram to n-1-gram
eat pineapple Ã eat _ « _ pineapple n Hypothesize hidden cause for unseen events:
n Factoring via latent variables: latent variable model
n Relate unseen event to distributionally similar
P(eat | pineapple) ª Â P(eat | C) ¥ P(C | pineapple) seen events
C
ML in NLP ML in NLP
NASSLI 19
ML in NLP June 02
Important Detour:
Latent Variable Models Expectation-Maximization (EM)
n Latent (hidden) variable models
p(y,x,z | L)
p(y,x | L) = Â p(y,x,z | L)
z
n Examples:
n Mixture models
n Class-based models (hidden classes)
n Hidden Markov models
ML in NLP ML in NLP
Maximizing Likelihood Convenient Lower Bounds (1)

† f ( x1 )
n Data log-likelihood n Convex function
D = {(x1 , y1 ),...,(x N , yN )} af ( x0 ) + (1 - a ) f ( x1 )
f
L(D | L) = Âi log p(x i , yi ) = Âx,y p˜ (x, y) log p(x, y | L)
f ( x0 ) f (ax0 + (1 - a ) x1 )
i : x i = x, yi = y
p˜ (x, y) =
N x0 ax0 + (1 - a ) x1 x1
n Find parameters that maximize (log-)likelihood n Jensen’s inequality
ˆ = arg max L Â p˜ (x, y) log p(x, y | L)
L
( )
f Âx p(x)x £ Âx p(x) f (x)
x,y if f is convex and p is a probability density
ML in NLP ML in NLP
NASSLI 20
ML in NLP June 02
Convenient Lower Bounds (2) Auxiliary Function

L(D | l )
n Find a convenient non-negative function
E1 that lower-bounds likelihood increase
L(D | L¢) - L(D | L) ≥ Q(L¢,L) ≥ 0
M0 n Maximize lower bound:
E0 L i+1 = arg max L ¢ Q(L¢,L i )
lower bound
ML in NLP
l ML in NLP
Comments
Example: Mixture Model
†n Likelihood keeps increasing, but
n Can get stuck in local maximum (or saddle point!) n Base distributions
n Can oscillate between different local maxima with pi (y) :1 £ i £ m
† same log-likelihood
n If maximizing auxiliary function is too hard, find n Mixture coefficients
some L that increases likelihood: generalized m
EM (GEM)
li ≥ 0 Âi=1 li = 1
n Sum over hidden variable values can be n Mixture distribution
exponential if not done carefully (sometimes not
possible) p(y | L) = Â li pi (y)
i
ML in NLP ML in NLP
NASSLI 21
ML in NLP June 02
Auxiliary Quantities Solution

n Mixture coefficient i = prior probability of n E step:
1
being in class I Ci = Â p˜ (y) lilpi p(y)(y)
n Joint probability
li¢ y Âj j j
p(c, y | L) = lc pc (y) n M-step:
n Auxiliary function Ci
p(y,c | L¢) li ¨
Q(L¢,L) = Ây p˜ (y)Âc p(c | y,L) log ÂjC j
p(y,c | L)
ML in NLP ML in NLP
Example: Information Extraction

More Finite-State Models
† n Given: types of entities and relationships
we are interested in
n People, places, organizations, dates,
amounts, materials, processes, …
† n Employed by, located in, used for, arrived
when, …
n Find all entities and relationships of the
given types in source material
n Collect in suitable database
ML in NLP ML in NLP
NASSLI 22
ML in NLP June 02
IE Example IE Methods
employee
Co-reference relation n Partial matching:
n Hand-built patterns
person-descriptor n Automatically-trained hidden Markov models
person organization n Cascaded finite-state transducers
Nance, who is also a paid consultant to ABC News , said … n Parsing-based:

n Parse the whole text:
• Shallow parser (chunking)
n Rely on: • Automatically-induced grammar
n Syntactic structure n Classify phrases and phrase relations as
n Phrase classification desired entities and relationships
ML in NLP ML in NLP
Global Constraint Models Local Classification Models

n Train to minimize labeling loss n Train to minimize the per-symbol loss in
qˆ = argminq Â Loss(x i , y i | q ) context
i
qˆ = argminq Âi Â0£ j<|x |loss(yi, j | x i , y (i j);q )
n Computing the best labeling: i
argmin y Loss(x, y | qˆ )
n Apply by guessing context and finding
n Efficient minimization requires: each lowest-loss label:
n A common currency for local labeling decisions argmin y j loss(y j | x, yˆ ( j);qˆ )
n A dynamic programming algorithm to combine the
decisions
ML in NLP ML in NLP
NASSLI 23
ML in NLP June 02
Structured Model Claims Generative vs. Discriminative

n Global constraint
n Principled n Hidden Markov models (HMMs):
n Probabilistic interpretation allows model generative, global
composition n Conditional exponential models (MEMMs,
n Efficient optimal decoding CRFs): discriminative, global
n Local classifier n Boosting, winnow: discriminative, local
n Wider range of models
n More efficient training
n Heuristic decoding comparable to pruning in
global models
ML in NLP ML in NLP
Generative Models Model Structure

n Stochastic process that generates n Decompose the generation of instances
instance-label pairs into elementary steps
n Process structure n Define dependencies between steps
n Process parameters n Parameterize the dependencies
n (Hypothesize structure) n Useful descriptive language: graphical
n Estimate parameters from training data models
ML in NLP ML in NLP
NASSLI 24
ML in NLP June 02
Binary Naïve Bayes Discrete Hidden Markov Model

n Represent instances by sets of binary
features n Instances: symbol sequences
n Does word occur in document n Labels: class sequences
n … C0 C1 C2 C3 C4 …
n Finite predefined set of classes
P(F1,..., Fn ,C) = P(C)’ P(Fi | C) X0 X1 X2 X3 X4
C i
P(c | F1,..., Fn )µ P(c)’ P(Fi | C) P(X,C) = P(C0 )P(X 0 | C0 )’i P(Ci | Ci-1 )P(Xi | Ci )
F1 F2 F3 F4 i
ML in NLP ML in NLP
Generating Multiple Features

Independence or Intractability
n Instances: sequences of feature sets
n Word identity
n Word properties (eg. spelling, capitalization) n Trees are good: each node has a single
n Labels: class sequences immediate ancestor, joint probability
computed in linear time
C0 C1 C2 C3 C4 …
n But that forces features to be conditionally
† independent given the class
F0,0 F1,0 F2,0 F3,0 F4,0 n Unrealistic
F0,1 F1,1 F2,1 F3,1 F4,1 n Suffixes and capitalization
F0,2 F1,2 F2,2 F3,2 F4,2 n “San” and “Francisco” in document
n Limitation: conditionally independent features
ML in NLP ML in NLP
NASSLI 25
ML in NLP June 02
Information Extraction with HMMs [Seymore & McCallum ‘99]

Score Card [Freitag & McCallum ‘99]
¸ No independence assumptions
¸ Richer features: combinations of existing
features
- Optimization problem for parameters
- Limited probabilistic interpretation
n Parameters = P(s|s’) , P(o|s) for all states in S={s1,s2,…}
- Insensitive to input distribution
n Observations: words
n Training: maximize probability of observations (+ prior).
n For IE, states indicate “database field”.
ML in NLP ML in NLP
Problems with HMMs

1. Would prefer richer representation of text:
multiple overlapping features, whole chunks of text
Solution: Conditional Model
n Example word features: n Example line features: Maximum entropy
Hidden Markov model
n identity of word n length of line Markov model
n word is in all caps n line is centered
n word ends in “-tion” n percent of non-alphabetics P(s|s’)
n word is part of a noun phrase n total amount of white space P(s|o,s’)
n word is in bold font n line contains two verbs
P(o|s)
(represented by exponential
n word is on left hand side of page n line begins with a number model)
n word is under node X in WordNet n line is grammatically a question
For the time being, capture dependency

2. HMMs are generative models. on s’ with |S| independent functions, Ps’(s|o)
Generative models do not handle easily overlapping, non- s
independent features. Each state contains a “next-state classifier”

Would prefer a conditional model: P({s…}|{o…}). black box, that, given the next observation, will
s’ produce a probability distribution over possible
next states, Ps’(s|o).
ML in NLP ML in NLP
NASSLI 26
ML in NLP June 02
Two Sequence Models Transition Features

n Model Ps’ (s|o) in terms of multiple arbitrary
HMM MEMM overlapping (binary) features.
n Example observation predicates:
st-1 st st-1 st
P(s|s’) o is the word “apple”
l
P(s|o,s’) o is capitalized
l
P(o|s)
l o is on a left-justified line
n Feature f depends on both a predicate b
ot ot n … and a destination state s.
n Standard belief propagation: forward-backward Ï1 if b(o) is true and s'= s

procedure. f<b,s> (o, s') = Ì
n Viterbi and Baum-Welch follow naturally. Ó0 otherwise
ML in NLP ML in NLP
Example: Q-A pairs from FAQ

Next-State Classifier X-NNTP-Poster: NewsHound v1.33
Archive-name: acorn/faq/part2
Frequency: monthly
n Per-state conditional maxent model 2.6) What configuration of serial cable should I use?
Here follows a diagram of the necessary connections for common terminal
1 Ê ˆ programs to work properly. They are as far as I know the informal standard
expÁ Â l<b,q> f<b,q> (o, s)˜˜

agreed upon by commercial comms software developers for the Arc.
Ps ' (s | o) =
Z(o, s') Ë<b,q> ¯
Pins 1, 4, and 8 must be connected together inside the 9 pin plug. This
is to avoid the well known serial port chip bugs. The modem’s DCD (Data
Carrier Detect) signal has been re-routed to the Arc’s RI (Ring Indicator)
most modems broadcast a software RING signal anyway, and even then it’s
n Training: each state model independently really necessary to detect it for the model to answer the call.
from labeled sequences 2.7) The sound from the speaker port seems quite muffled.
How can I get unfiltered sound from an Acorn machine?
All Acorn machine are equipped with a sound filter designed to remove
high frequency harmonics from the sound output. To bypass the filter, hook
† into the Unfiltered port. You need to have a capacitor. Look for LM324 (chip
39) and and hook the capacitor like this:
ML in NLP ML in NLP
NASSLI 27
ML in NLP June 02
Experimental Data
Features in Experiments
n 38 files belonging to 7 UseNet FAQs
begins-with-number contains-question-mark
<head> X-NNTP-Poster: NewsHound v1.33 begins-with-ordinal contains-question-word
<head> Archive-name: acorn/faq/part2
<head> Frequency: monthly begins-with-punctuation ends-with-question-mark
<head> begins-with-question-word first-alpha-is-capitalized
<question> 2.6) What configuration of serial cable should I use?
<answer> begins-with-subject indented
<answer> Here follows a diagram of the necessary connection blank indented-1-to-4
<answer> programs to work properly. They are as far as I know
<answer> agreed upon by commercial comms software developers fo contains-alphanum indented-5-to-10
<answer> contains-bracketed-number more-than-one-third-space
<answer> Pins 1, 4, and 8 must be connected together inside
<answer> is to avoid the well known serial port chip bugs. The contains-http only-punctuation
contains-non-space prev-is-blank
n Procedure: For each FAQ, train on one file, test contains-number prev-begins-with-ordinal
contains-pipe shorter-than-30
on other; average.
ML in NLP ML in NLP
Models Tested
Results
n ME-Stateless: A single maximum entropy
classifier applied to each line independently. Learner Segmentation Segmentation
n TokenHMM: A fully-connected HMM with four precision recall
states, one for each of the line categories, ME-Stateless 0.038 0.362
each of which generates individual tokens
(groups of alphanumeric characters and TokenHMM 0.276 0.140
individual punctuation characters).
FeatureHMM 0.413 0.529
n FeatureHMM: Identical to TokenHMM, only
the lines in a document are first converted to MEMM 0.867 0.681
sequences of features.
n MEMM: maximum entropy Markov model
ML in NLP ML in NLP
NASSLI 28
ML in NLP June 02
Label Bias Problem Proposed Solutions

n Example (after Bottou ‘91): b
n Determinization: i 2 3 rib
i b r
r 1 2 3 ribP(1, 2 | ro) = P(1 | r)P(2 | o,1) n not always possible 0 1
o b
0 P(1 | r)1 n state-space explosion 5 6 rob
r b
o n Fully-connected models:
4 5 6 rob P(1 | r)P(2 | i,1)
P(1, 2 | ri) n lacks prior structural knowledge.
n Bias toward states with fewer outgoing n Conditional random fields (CRFs):
transitions. n Allow some transitions to vote more
n †
Per-state normalization does not allow the strongly than others
required score(1,2|ro) << score(1,2|ri). n Whole sequence rather than per-state
normalization
ML in NLP ML in NLP
From HMMs to CRFs CRF General Form

s = s1 ...sn o = o1 ...on
n
n State sequence is a Markov random field
HMM P(s | o) µ ’ P(st | st-1 )P(ot | st ) conditioned on the observation sequence.
t=1 n
Ê n ˆ
†MEMM P(s | o) = ’ P(s t | st-1,ot ) n Model form: P(s | o) =
1
expÁ Â Â l f f (st-1,st ,o,t)˜˜
t=1 Z(o) Ë t=1 f ¯
ÊÂ l f f (st ,st-1 )ˆ
n
† 1 n Features represent the dependency between
’ Z(s ,o ) ÁÁ+f Â m f g(st ,ot ) ˜˜
= exp successive states conditioned on the
t=1 t-1 t observations
Ë g ¯ †
n Ê Â l f f (st ,st-1 )ˆ n Dependence on whole observation sequence o
1 (not possible in HMMs).
CRF P(s | o) = ’ expÁÁ+f Â m f g(st ,ot ) ˜˜
Z(o) t=1
† ML in NLP Ë g ¯ ML in NLP
NASSLI 29
ML in NLP June 02
Efficient Estimation
Forward-Backward Calculations
n Matrix notation
M t (s',s | o) = exp L t (s',s | o) n For any path function G(s) = Ât gt (st-1,st )
L t (s',s | o) = Â l f f (st-1,st ,o,i) E LG = Â P (s | o)G(s)
f s L
1 a (s'| o)g (s',s)M (s',s | o)b (s | o)
PL (s | o) = ’ M i (st-1,st | o)
Z L (o) t
= Â † t t +1
Z (o)
t +1 t +1
t,s',s L
Z L (o) = (M1 (o)M 2 (o)L M n +1 (o)) start,stop a t (o) = a t-1 (o)M t (o)
n Efficient normalization: forward-backward b t (o)¢ = M t +1 (o)b t +1 (o)
algorithm Z L (o) = a n +1 (end | o) = b 0 (start | o)
† ML in NLP ML in NLP
Training Label Bias Experiment

n Maximize L(L) = Â log PL (sk | o k )
k n Data source: noisy version of
n Log-likelihood gradient i b
r 1 2 3 rib
∂L(L) 0
†∂l f
= Â# k f (sk | o k ) - Â E L # f (S | o k )
k
r
4
o
5
b
6 rob
# f (s | o) = Â t
f (st-1,st ,o,t) n P(intended symbol) = 29/30, P(other) = 1/30.
n Train both an MEMM and a CRF with identical
topologies on data from the source.
n Methods: iterative scaling, conjugate n Compute decoding error: CRF 4.6%, MEMM 42%
† gradient (2,000 training samples, 500 test)
n Comparable to standard Baum-Welch
ML in NLP ML in NLP
NASSLI 30
ML in NLP June 02
Mixed-Order Sources
Part-of-Speech Tagging
n Data generated by mixing sparse first and second
order HMMs with varying mixing coefficient. n Trained on 50% of the 1.1 million words in the
n Modeled by first-order HMM, MEMM and CRF Penn treebank. In this set, 5.45% of the words
(without contextual or overlapping features). occur only once, and were mapped to “oov”.
n Experiments with two different sets of features:
n traditional: just the words
n take advantage of power of conditional models: use
words, plus overlapping features: capitalized, begins
with #, contains hyphen, ends in -ing, -ogy, -ed, -s, -
ly, -ion, -tion, -ity, -ies.
ML in NLP ML in NLP
Structured Models:
POS Tagging Results Stochastic Grammars
model error oov error
HMM 5.69% 45.99%
MEMM 6.37% 54.61%
CRF 5.55% 48.05%
MEMM+ 4.81% 26.99%
CRF+ 4.27% 23.76%
ML in NLP ML in NLP
NASSLI 31
ML in NLP June 02
Beyond Finite State

S Stochastic Context-Free Grammars
NP VP
S S Æ NP VP 1.0
N' NP Æ Det N¢ 0.2
VP Adv
NP VP
NP Æ N¢ 0.8
Adj N' V slowly
N'
VP Adv N¢ Æ Adj N¢ 0.3
revolutionary Adj advance N¢ Æ N 0.7
N' Adj N' V slowly
VP Æ VP Adv 0.4
new N revolutionary Adj advance VP Æ V 0.6
N'
ideas N Æ ideas 0.1
new N N Æ people 0.2
n Constituency and meaningful relations ideas V Æ sleep 0.3
What sleeps? How does it sleep? What kind of ideas?
n Related sentences
Sleep furiously is all that colorless green ideas do.
ML in NLP ML in NLP
SCFG Derivations
Stochastic CFG Inference P( S ) P( S )
n Inside-outside algorithm (Baker 79): find rule NP VP NP VP
probabilities that locally maximize the
likelihood of a training corpus (instance of EM)
N'
VP Adv = N'
VP Adv
Adj N' V slowly V slowly

n Extended inside-outside algorithm: use
information about training corpus phrase revolutionary Adj advance advance
N'
structure to guide rule probability reestimation new N
¥
l Better modeling of phrase structure P(N¢ Æ Adj N¢)
ideas
l Improved convergence ¥ ¥
P( Adj ) P( N' )
l Improved computational complexity n Context-freeness fi
revolutionary
independence of Adj N'
derivation steps new N
ML in NLP ML in NLP ideas
NASSLI 32
ML in NLP June 02
Inside-Outside Reestimation Problems with I-O Reestimation

Ratios of expected rule frequencies to
n
n Hill-climbing procedure: sensitivity to initial
expected phrase frequencies
rule probabilities
phrase context E (#N¢ Æ Adj N¢)
Pn + 1(N¢ Æ Adj N¢) = n
En(#N¢) n Does not learn grammar structure directly:
S
n Computed from only implicit in rule probabilities
NP VP the probabilities of n Linguistically inadequate grammars: high
N' each phrase and
VP Adv mutual information sequences are grouped
phrase phrase context in
Adj N' V slowly
the training into phrases
revolutionary Adj
N'
advance corpus ((What (((is the) cheapest)
new N n Iterate fare))((I can) (get ?)))))))
ideas
Contrast: Is $300 the cheapest fare?
ML in NLP ML in NLP
(Partially) Bracketed Text Predictive rawPower

5 train
n Hand-annotated text with 4.5 bracketed
cross entropy
train
(some) phrase boundaries Training: 4
3.5
(((List (the fares (for 3
((flight) (number 2.5
891)))))).) 0 20 40 60 80
iteration
n Use only derivations 5 raw test
compatible with training 4.5 bracketed

cross entropy
test
4
bracketing Test:
3.5
3
2.5
0 20 40 60 80
ML in NLP ML in NLP iteration
NASSLI 33
ML in NLP June 02
Bracketing Accuracy Limitations of SCFGs

n Accuracy criterion: proportion of phrases in n Likelihoods independent of particular words
most likely analysis compatible with tree bank n Markovian assumption on syntactic categories
bracketing
100
80 raw test
accuracy
60
bracketed Markov models:
test
40
We need to resolve the issue
20
0 20 40 60 80
iteration Hierarchical models:
n Conclusion: structure is not evident from distribution
alone
ML in NLP ML in NLP
Lexicalization
Best Current Models
n Markov model: n Representation: surface trees with head-word
propagation
We need to resolve the issue n Generative power still context-free
n Model variables:
head word, dependency type, argument vs. adjunct, heaviness,
n Dependency model: slash
n Main challenge: smoothing method for unseen
We need to resolve the issue dependencies
n Learned from hand-parsed text (treebank)
n Around 90% constituent accuracy
ML in NLP ML in NLP
NASSLI 34
ML in NLP June 02
Lexicalized Tree (Collins 98)

S(dumped) Inducing Representations
NP-C(workers) VP(dumped)
N(workers) V(dumped) NP-C(sacks) PP(into)
workers dumped N(sacks) P(into) NP-C(bin)
sacks into D(a) N(bin)

Dependency Direction Relation
workers Æ dumped L ·NP, S, VPÒ
sacks Æ dumped R ·NP, VP, VÒ a bin
into Æ dumped R ·PP, VP, VÒ
a Æ bin L ·D, NP, NÒ
bin Æ into R ·NP, PP, PÒ
ML in NLP ML in NLP
Unsupervised Learning Do Induced Classes Help?

n Latent variable models n Generalization
n Model observables from latent variables n Better statistics for coarser events
n Search for “good” set of latent variables n Dimensionality reduction
n Information bottleneck n Smaller models
n Find efficient compression of some n Improved classification accuracy
observables
n … preserving the information about other
observables
ML in NLP ML in NLP
NASSLI 35
ML in NLP June 02
Chomsky’s Challenge Complex Events

to Empiricism n What Chomsky was talking about: Markov
models — state is just a record of
(1) Colorless green ideas sleep furiously.
observations
(2) Furiously sleep ideas green colorless. n But statistical models can have hidden
… It is fair to assume that neither sentence (1) nor state:
(2) (nor indeed any part of these sentences) has n representation of past experience
ever occurred in an English discourse. Hence, in
any statistical model for grammaticalness, these n uncertainty about correct grammar
sentences will be ruled out on identical grounds as n uncertainty about correct interpretation of
equally ‘remote’ from English. Yet (1), though experience: ambiguity
nonsensical, is grammatical, while (2) is not. n Probabilistic relationships involving
Chomsky 57 hidden variables can be induced from
observable data alone: EM algorithm
ML in NLP ML in NLP
In “Any Model”? Distributional Clustering

n Automatic grouping of words according to the
n Factored bigram model: contexts in which they appear
16
P(wi+1 | wi ) ª Â P(wi +1 | c) ¥ P(c | wi ) n Approach to data sparseness: approximate the
c =1
distribution of a relatively rare event (word) by the
n
P(w1 Lwn ) ª P(w1 )’ P(wi +1 | wi ) collective distribution of similar events (cluster)
i =2
P(colorless green ideas sleep furiously) n Sense ambiguity ≈ membership in several “soft”
ª 2 ¥ 10 5 clusters
P(furiously sleep ideas green colorless)
n Trained for large-vocabulary speech recognition n Case study: cluster nouns according to the verbs
from newswire text by EM that take them as direct objects
ML in NLP ML in NLP
NASSLI 36
ML in NLP June 02
Training Data Distributional Representation

Describing n Œ N: use conditional
n Universe: two word classes V and N, a
single relation between them (eg. main distribution p(V | n)
0.1
verb – head noun of verb’s direct object) stock
relative frequency
0.08 bond
n Data: frequencies fvn of (v,n) pairs 0.06
0.04
extracted from text by parsing or pattern 0.02
matching 0
own
purchase
select
sell
tender
trade
hold
issue
buy
ML in NLP ML in NLP
Reminder: Bottleneck Model Search for Cluster Solutions

I(N,V) n The scale parameter b (“inverse
n Markov condition: N V temperature”) determines how much a
p( n˜ | v) = Â p(n˜ | n)p(n | v) noun contributes to nearby centroids
n I(Ñ,N)
I(Ñ,V)
n Find p(Ñ | N) to maximize Ñ
mutual information for
fixed I(Ñ ,N)
p( n˜ ,v ) 0 b
I( N˜ ,V ) = Â p( n˜ , v ) log p( n˜ )p(v )
n˜ ,v
n b increases fi clusters split fi hierarchical
n Solution: p( n˜ ) clustering
p( n˜ | n) = exp(- bDKL ( p(V | n) || p(V | n˜ ))
ML in NLP
Zn ML in NLP
NASSLI 37
ML in NLP June 02
Small Example Mutual Information Ratios

n Cluster the 64 most common direct objects of 1
fire in 1988 Associated Press newswire

0.8
I(Ñ,V)/I(Ñ,N)
missile 0.835 officer 0.484 0.6
ng
si
rocket 0.850 aide 0.612
ea
bullet 0.917 chief 0.649
cr
0.4
In
gun 0.940 manager 0.651
0.2
gun 0.758 shot 0.858

missile 0.786 bullet 0.925 0
0 0.2 0.4 0.6 0.8
weapon 0.862 rocket 0.930 I(Ñ,N)/H(N)

missile 1.037
rocket 0.875
ML in NLP ML in NLP
Using Clusters for Prediction Evaluation

n Model verb–object associations through object n Relative entropy of held-out data
clusters: to asymmetric model
used in
pˆ (v | n) = Â p(v | n˜ ) p( n˜ | n)
experiments n Decision task: which of two verbs
n˜
’ p(v | n˜ ) p( n˜ |n) is more likely to take a noun as

pˆ (v | n) = n˜ more appropriate direct object, estimated from the
Zn
model for training data in which
n Depends on b
the pairs relating the noun to one
n Intuition: the associations of a word are a mixture of the verbs have been deleted
of the associations of the sense classes to which
the word belongs
ML in NLP ML in NLP
NASSLI 38
ML in NLP June 02
Decision Task
Relative Entropy n Which of v1 and v2 is more likely to take object o
Verb-object pairs from 1988 AP Newswire n Held out: 104 (v2,o) pairs – need to guess from
pc(v2)
n Train: 756,721 verb- Testing: compare all v1 and v2 such that (v2,o) was
avg relative entropy (bits)
6 n
5 object pair training held out and (v1,o) occurs

set 1.2
4 all n All: all test data
1
train n Test: 81,240 pair Exceptional: verb pairs
decision error
3 n
test held-out test set
0.8 exceptional whose frequency ratio
2 new 0.6 with o is reversed from
1 n New: held-out data 0.4
their overall frequency
for 1000 nouns not ratio
0 0.2
0 200 400 600 in the training data
0
number of clusters 0 100 200 300 400 500
ML in NLP ML in NLP number of clusters
NASSLI 39

Artificial Intelligence Machine Learning PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Artificial Intelligence Machine Learning PDF

Uploaded by

Copyright:

Available Formats

ML in NLP June 02

Machine Learning in Natural Introduction

NASSLLI, June 2002

P(colorless green ideas sleep furiously)

N(workers)V(dumped) NP-C(sacks) PP(into)

Error of best program

Machine Learning Methods Jargon

Classification Ideas Structured Model Ideas

Unsupervised Learning Ideas Methodological Detour

n Clustering: class induction n Empiricist/information-theoretic view:

n Distributional regularities n Rationalist/generative view:

Chomsky’s Challenge to Empiricism The Return of Empiricism

Unseen Events The Science of Modeling

Richness of the Stimulus

Classification Generative or Discriminative?

Simple Generative Model

Discriminative Models Simple Discriminative Forms

Document Models Term Weighting and Feature

p(di | d1 ...di-1;c) ª p(di | di-n ...di-1;c)

Documents vs. Vectors (2)

n Document multiplicity must be handled n Raw frequency vector probability

Learning Linear Classifiers

Convex Optimization (1) Convex Optimization (2)

Maximizing the margin (2)

Given a separable training sample: 1

Building the Classifier Consequences

˜ = arg max L Â log p(yi | x i ;L)

n Useful properties Âi f k (x i , yi ) = Ây p(y | x i ) f k (x i , y)

Solution Techniques (2)

n Requires that features add up to constant independent of f # (x, y) = Âk f k (x, y)

Deriving IIS (1)

† n Log-likelihood update A(D) ≥ Â D ⋅ f (x i , yi ) + N -

† Instance Representation Enriching Features

n Word relative frequency in document

Structured Model Applications Structured Models

Example: Language Modeling

Unseen Events (1)

n Hidden Markov models

Maximizing Likelihood Convenient Lower Bounds (1)

Convenient Lower Bounds (2) Auxiliary Function

Auxiliary Quantities Solution

Example: Information Extraction

Nance, who is also a paid consultant to ABC News , said … n Parsing-based:

Global Constraint Models Local Classification Models

Structured Model Claims Generative vs. Discriminative

Generative Models Model Structure

Binary Naïve Bayes Discrete Hidden Markov Model

Generating Multiple Features

Information Extraction with HMMs [Seymore & McCallum ‘99]

Problems with HMMs

For the time being, capture dependency

independent features. Each state contains a “next-state classifier”

Two Sequence Models Transition Features

n Standard belief propagation: forward-backward Ï1 if b(o) is true and s'= s

Example: Q-A pairs from FAQ

Here follows a diagram of the necessary connections for common terminal

expÁ Â l<b,q> f<b,q> (o, s)˜˜

Label Bias Problem Proposed Solutions

From HMMs to CRFs CRF General Form

Training Label Bias Experiment

Beyond Finite State

Adj N' V slowly V slowly

ML in NLP ML in NLP ideas