Professional Documents
Culture Documents
Artificial Intelligence Machine Learning PDF
Artificial Intelligence Machine Learning PDF
ML in NLP
Why ML in NLP
Classification
n Examples are easier to create than rules
n Rule writers miss low frequency cases n Document topic
n Many factors involved in language interpretation
n People do it
n AI
n Cognitive science
politics, business national, environment
n Let the computer do it
n Moore’s law
n storage
n Word sense
n lots of data treasury bonds ´ chemical bonds
ML in NLP ML in NLP
NASSLI 1
ML in NLP June 02
Analysis
n Tagging
Language Modeling
VBG NNS WDT VBP RP NNS JJ
n Is this a likely English sentence?
causing symptoms that show up decades later
a bin
ML in NLP ML in NLP
Inference
n Translation Machine Learning Approach
treasury bonds fi obrigações do tesouro
n Algorithms that write programs
covalent bonds fi ligações covalentes
n Specify
n Information extraction • Form of output programs
Sara Lee to Buy 30% of DIM • Accuracy criterion
acquirer Chicago, March 3 - Sara Lee Corp said it agreed to buy a 30 percent interest in
Paris-based DIM S.A., a subsidiary of BIC S.A., at cost of about 20 million
n Input: set of training examples
acquired dollars. DIM S.A., a hosiery manufacturer, had sales of about 2 million dollars.
n Output: program that performs as accurately
The investment includes the purchase of 5 million newly issued DIM shares
valued at about 5 million dollars, and a loan of about 15 million dollars, it said. as possible on the training examples
The loan is convertible into an additional 16 million DIM shares, it noted.
The proposed agreement is subject to approval by the French government, it n But will it work on new examples?
said.
ML in NLP ML in NLP
NASSLI 2
ML in NLP June 02
Learning Tradeoffs
Fundamental Questions
Overfitting
n Generalization: is the learned program useful on new
NASSLI 3
ML in NLP June 02
n Class-probability distribution
n Combining decisions
n Sequential decisions
n Redundancy is our friend: many weak n Generative models
clues n Constraint satisfaction
ML in NLP ML in NLP
NASSLI 4
ML in NLP June 02
ML in NLP ML in NLP
NASSLI 5
ML in NLP June 02
NASSLI 6
ML in NLP June 02
ML in NLP ML in NLP
NASSLI 7
ML in NLP June 02
Classification Tasks
Discriminative Claims n Document categorization
n News categorization
n Focus modeling resources on instance-to- n Message filtering
label mapping n Web page selection
n Tagging
n Avoid restrictive probabilistic assumptions n Named entity
on instance distribution n Part-of-speech
n Sense disambiguation
n Optimize what you care about
n Syntactic decisions
n Higher accuracy n Attachment
ML in NLP ML in NLP
NASSLI 8
ML in NLP June 02
ML in NLP ML in NLP
Documents
† vs. Vectors (3) Linear Classifiers
n Unigram model: n Embedding into high-dimensional vector
d
p(c)p(d | c)’ p(di | c) space
p(c | d) = i=1
d n Geometric intuitions and techniques
Âc¢ p(c¢)p( d | c¢)’ i=1
p(di | c¢)
† n Easier separability
• Increase dimension with interaction terms
n Vector model: • Nonlinear embeddings (kernels)
p(c)p(L | c)’t p(t | c)rt n Swiss Army knife
p(c | r) =
Âc¢ p(c¢) p(L | c¢)’t p(t | c¢)rt
ML in NLP ML in NLP
NASSLI 9
ML in NLP June 02
†
Linear Classification Margin
† n Instance margin g i = yi (w ⋅ x i + b)
n Linear discriminant function n Normalized (geometric) margin
h(x) = w ⋅ x + b = Â w k xk + b Êw bˆ
k x xi g i = yi Á ⋅ x i + ˜
x
x x Ë w w ¯
gi
x x o x
x x
x x gj o
† o w o g x
o x x xj
x o
x o o o
o
bo o x x
o
o o
o n Training set margin g
ML in NLP ML in NLP
NASSLI 10
ML in NLP June 02
Duality
Perceptron Algorithm
n Final hypothesis is a linear combination of
n Given training points
n Linearly separable training set S w = Âi ai yi x i ai ≥ 0
n Learning rate h>0 n Dual perceptron algorithm
w ¨ 0;b ¨ 0;R = maxi x i a ¨ 0;b ¨ 0;R = maxi x i
repeat repeat
for i = 1...N for i = 1...N
if yi ( w ⋅ x i + b) £ 0
w ¨ w + hyi x i (
if yi  a j y j x j ⋅ x i + b £ 0
j )
b ¨ b + hyi R 2
a i ¨ ai + 1
until there are no mistakes b ¨ b + yi R 2
ML in NLP ML in NLP
until there are no mistakes
Canonical Hyperplanes
Why† Maximize the Margin?
n Multiple representations for the same
n There is a constant c such that for any hyperplane
data distribution D with support in a ball of (lw, lb) l > 0
radius R and any training sample S of size n Canonical hyperplane: functional margin = 1
N drawn from D
n Geometric margin for canonical hyperplane
Ê c Ê R2 ˆˆ
pÁÁerr(h) £ ÁÁ 2 log2 N + log(1 d )˜˜˜˜ ≥ 1 - d 1Ê w w ˆ
g = Á ⋅ x+ - ⋅ x- ˜
†Ë N Ëg ¯¯ 2Ë w w ¯ x+
1
= (w ⋅ x + - w ⋅ x - ) g
where g is the margin of h in S 2w
1 x-
=
w
ML in NLP ML in NLP
NASSLI 11
ML in NLP June 02
NASSLI 12
ML in NLP June 02
ML in NLP ML in NLP
Soft Margin
General SVM Form n Handles non-separable case
n Margin maximization for an arbitrary n Primal problem (2-norm):
†
kernel K min w,b,x w ⋅ w + C Â x i2
maxa Âi a i - 12 Âi, j yi y ja i a j K (x i , x j ) i
subject to yi (w ⋅ x i + b ) ≥ 1 - x i
subject to a i ≥ 0 xi ≥ 0
† Âi yi a i = 0
n Dual problem:
n Decision rule Ê ˆ
maxa Âi a i - 12 Âi, j yi y ja i a j ÁË x i ⋅ x j + C1 dij ˜¯
h(x) = sgn (Âi yi a*i K (x i ,x) + b* ) subject to a i ≥ 0
Âi yi a i = 0
ML in NLP ML in NLP
NASSLI 13
ML in NLP June 02
Duality
Conditional Maxent Model n Maximize conditional log likelihood
ML in NLP ML in NLP
Relationship
† to (Binary) Logistic Relationship to Linear
Discrimination Discrimination
†
n Decision rule
exp Âk l k f k (x,+1)
p(+1 | x) = Ê p(+1| x) ˆ
† exp Âk l k f k (x,+1) + exp Âk l k f k (x,-1) signÁ log ˜ = sign Âk l k g k (x)
Ë p(-1 | x) ¯
1
= n Bias term: parameter for “always on”
† 1 + exp- Âk k k (x,+1) - f k (x,-1))
l ( f
1 feature
=
1 + exp- Â l k gk (x) n Question: relationship to other trainers for
k
linear discriminant functions
ML in NLP ML in NLP
NASSLI 14
ML in NLP June 02
ML in NLP ML in NLP
NASSLI 15
ML in NLP June 02
Gaussian Prior
Solution Techniques (3) n Log-likelihood gradient
n GIS very slow if slack variable takes large ∂l(L) l
= Âi f k (x i , yi ) - Âi Ây p(y | x i ;L)Âi f k (x i , y) - k2
values ∂l k sk
n IIS faster, but still problematic n Modified IIS update
n Recent suggestion: use standard convex lk ¨ l k + d k
optimization techniques Âi f k (x i , yi ) =
n Eg. Conjugate gradient # l k + dk
Some evidence of faster convergence
Âi Ây p(y | x i ;L) f k (x i , y)edk f ( x i ,y)
+
n s 2k
#
f (x, y) = Âk f k (x, y)
ML in NLP ML in NLP
NASSLI 16
ML in NLP June 02
Structured Models:
Finite State
I understood each and every word you said but not the order in which they appeared.
ML in NLP ML in NLP
ML in NLP ML in NLP
NASSLI 17
ML in NLP June 02
Constraint Satisfaction in
Structured Models Local Classification Models
n Train to minimize labeling loss n Train to minimize the per-decision loss in
qˆ = argminq Âi Loss(x i , y i | q )
context
qˆ = argminq Âi Â0£ j<|x |loss(yi, j | x i , y (i j);q )
n Computing the best labeling: i
argmin y Loss(x, y | qˆ )
n Apply by guessing context and finding
n Efficient minimization requires: each lowest-loss label:
n A common currency for local labeling decisions
n Efficient algorithm to combine the decisions argmin y j loss(y j | x, yˆ ( j);qˆ )
ML in NLP ML in NLP
NASSLI 18
ML in NLP June 02
Markov’s Unreasonable
Effectiveness Limits of Markov Models
n Entropy estimates for English n No dependency
model bits/char
compress 4.43 n Likelihoods based on
word trigrams (Brown et al 92) 1.75 sequencing, not dependency
human prediction (Cover & King 78) 1.34
1 The are to know the issues necessary role
n Local word relations dominate statistics (Jelinek):
2 This will have this problems data thing
1 The are to know the issues necessary role
2 One the understand these the information that
2 This will have this problems data thing
7 Please need use problem people point
2 One the understand these the information that
9 We insert all tools issues
7 Please need use problem people point
98 resolve old
9 We insert all tools issues
1641 important
98 resolve old
1641 important
ML in NLP ML in NLP
NASSLI 19
ML in NLP June 02
Important Detour:
Latent Variable Models Expectation-Maximization (EM)
n Latent (hidden) variable models
p(y,x,z | L)
p(y,x | L) = Â p(y,x,z | L)
z
n Examples:
n Mixture models
n Class-based models (hidden classes)
ML in NLP ML in NLP
ML in NLP ML in NLP
NASSLI 20
ML in NLP June 02
lower bound
ML in NLP
l ML in NLP
Comments
Example: Mixture Model
†n Likelihood keeps increasing, but
n Can get stuck in local maximum (or saddle point!) n Base distributions
n Can oscillate between different local maxima with pi (y) :1 £ i £ m
† same log-likelihood
n If maximizing auxiliary function is too hard, find n Mixture coefficients
some L that increases likelihood: generalized m
EM (GEM)
li ≥ 0 Âi=1 li = 1
n Sum over hidden variable values can be n Mixture distribution
exponential if not done carefully (sometimes not
possible) p(y | L) = Â li pi (y)
i
ML in NLP ML in NLP
NASSLI 21
ML in NLP June 02
ML in NLP ML in NLP
ML in NLP ML in NLP
NASSLI 22
ML in NLP June 02
IE Example IE Methods
employee
Co-reference relation n Partial matching:
n Hand-built patterns
person-descriptor n Automatically-trained hidden Markov models
person organization n Cascaded finite-state transducers
argmin y Loss(x, y | qˆ )
n Apply by guessing context and finding
n Efficient minimization requires: each lowest-loss label:
n A common currency for local labeling decisions argmin y j loss(y j | x, yˆ ( j);qˆ )
n A dynamic programming algorithm to combine the
decisions
ML in NLP ML in NLP
NASSLI 23
ML in NLP June 02
ML in NLP ML in NLP
NASSLI 24
ML in NLP June 02
F1 F2 F3 F4 i
ML in NLP ML in NLP
NASSLI 25
ML in NLP June 02
¸ No independence assumptions
¸ Richer features: combinations of existing
features
- Optimization problem for parameters
- Limited probabilistic interpretation
n Parameters = P(s|s’) , P(o|s) for all states in S={s1,s2,…}
- Insensitive to input distribution
n Observations: words
n Training: maximize probability of observations (+ prior).
n For IE, states indicate “database field”.
ML in NLP ML in NLP
NASSLI 26
ML in NLP June 02
1 Ê ˆ programs to work properly. They are as far as I know the informal standard
from labeled sequences 2.7) The sound from the speaker port seems quite muffled.
How can I get unfiltered sound from an Acorn machine?
All Acorn machine are equipped with a sound filter designed to remove
high frequency harmonics from the sound output. To bypass the filter, hook
† into the Unfiltered port. You need to have a capacitor. Look for LM324 (chip
39) and and hook the capacitor like this:
ML in NLP ML in NLP
NASSLI 27
ML in NLP June 02
Experimental Data
Features in Experiments
n 38 files belonging to 7 UseNet FAQs
begins-with-number contains-question-mark
<head> X-NNTP-Poster: NewsHound v1.33 begins-with-ordinal contains-question-word
<head> Archive-name: acorn/faq/part2
<head> Frequency: monthly begins-with-punctuation ends-with-question-mark
<head> begins-with-question-word first-alpha-is-capitalized
<question> 2.6) What configuration of serial cable should I use?
<answer> begins-with-subject indented
<answer> Here follows a diagram of the necessary connection blank indented-1-to-4
<answer> programs to work properly. They are as far as I know
<answer> agreed upon by commercial comms software developers fo contains-alphanum indented-5-to-10
<answer> contains-bracketed-number more-than-one-third-space
<answer> Pins 1, 4, and 8 must be connected together inside
<answer> is to avoid the well known serial port chip bugs. The contains-http only-punctuation
contains-non-space prev-is-blank
n Procedure: For each FAQ, train on one file, test contains-number prev-begins-with-ordinal
contains-pipe shorter-than-30
on other; average.
ML in NLP ML in NLP
Models Tested
Results
n ME-Stateless: A single maximum entropy
classifier applied to each line independently. Learner Segmentation Segmentation
n TokenHMM: A fully-connected HMM with four precision recall
states, one for each of the line categories, ME-Stateless 0.038 0.362
each of which generates individual tokens
(groups of alphanumeric characters and TokenHMM 0.276 0.140
individual punctuation characters).
FeatureHMM 0.413 0.529
n FeatureHMM: Identical to TokenHMM, only
the lines in a document are first converted to MEMM 0.867 0.681
sequences of features.
n MEMM: maximum entropy Markov model
ML in NLP ML in NLP
NASSLI 28
ML in NLP June 02
n Bias toward states with fewer outgoing n Conditional random fields (CRFs):
transitions. n Allow some transitions to vote more
n †
Per-state normalization does not allow the strongly than others
required score(1,2|ro) << score(1,2|ri). n Whole sequence rather than per-state
normalization
ML in NLP ML in NLP
NASSLI 29
ML in NLP June 02
Efficient Estimation
Forward-Backward Calculations
n Matrix notation
M t (s',s | o) = exp L t (s',s | o) n For any path function G(s) = Ât gt (st-1,st )
L t (s',s | o) = Â l f f (st-1,st ,o,i) E LG = Â P (s | o)G(s)
f s L
1 a (s'| o)g (s',s)M (s',s | o)b (s | o)
PL (s | o) = ’ M i (st-1,st | o)
Z L (o) t
= Â † t t +1
Z (o)
t +1 t +1
t,s',s L
Z L (o) = (M1 (o)M 2 (o)L M n +1 (o)) start,stop a t (o) = a t-1 (o)M t (o)
n Efficient normalization: forward-backward b t (o)¢ = M t +1 (o)b t +1 (o)
algorithm Z L (o) = a n +1 (end | o) = b 0 (start | o)
† ML in NLP ML in NLP
NASSLI 30
ML in NLP June 02
Mixed-Order Sources
Part-of-Speech Tagging
n Data generated by mixing sparse first and second
order HMMs with varying mixing coefficient. n Trained on 50% of the 1.1 million words in the
n Modeled by first-order HMM, MEMM and CRF Penn treebank. In this set, 5.45% of the words
(without contextual or overlapping features). occur only once, and were mapped to “oov”.
n Experiments with two different sets of features:
n traditional: just the words
n take advantage of power of conditional models: use
words, plus overlapping features: capitalized, begins
with #, contains hyphen, ends in -ing, -ogy, -ed, -s, -
ly, -ion, -tion, -ity, -ies.
ML in NLP ML in NLP
Structured Models:
POS Tagging Results Stochastic Grammars
model error oov error
HMM 5.69% 45.99%
MEMM 6.37% 54.61%
CRF 5.55% 48.05%
MEMM+ 4.81% 26.99%
CRF+ 4.27% 23.76%
ML in NLP ML in NLP
NASSLI 31
ML in NLP June 02
SCFG Derivations
Stochastic CFG Inference P( S ) P( S )
n Inside-outside algorithm (Baker 79): find rule NP VP NP VP
probabilities that locally maximize the
likelihood of a training corpus (instance of EM)
N'
VP Adv = N'
VP Adv
NASSLI 32
ML in NLP June 02
train
(some) phrase boundaries Training: 4
3.5
(((List (the fares (for 3
((flight) (number 2.5
891)))))).) 0 20 40 60 80
iteration
n Use only derivations 5 raw test
test
4
bracketing Test:
3.5
3
2.5
0 20 40 60 80
ML in NLP ML in NLP iteration
NASSLI 33
ML in NLP June 02
80 raw test
accuracy
60
bracketed Markov models:
test
40
We need to resolve the issue
20
0 20 40 60 80
iteration Hierarchical models:
n Conclusion: structure is not evident from distribution
alone
ML in NLP ML in NLP
Lexicalization
Best Current Models
n Markov model: n Representation: surface trees with head-word
propagation
We need to resolve the issue n Generative power still context-free
n Model variables:
head word, dependency type, argument vs. adjunct, heaviness,
n Dependency model: slash
n Main challenge: smoothing method for unseen
We need to resolve the issue dependencies
n Learned from hand-parsed text (treebank)
n Around 90% constituent accuracy
ML in NLP ML in NLP
NASSLI 34
ML in NLP June 02
ML in NLP ML in NLP
NASSLI 35
ML in NLP June 02
ML in NLP ML in NLP
NASSLI 36
ML in NLP June 02
relative frequency
0.08 bond
n Data: frequencies fvn of (v,n) pairs 0.06
0.04
extracted from text by parsing or pattern 0.02
matching 0
own
purchase
select
sell
tender
trade
hold
issue
buy
ML in NLP ML in NLP
NASSLI 37
ML in NLP June 02
I(Ñ,V)/I(Ñ,N)
missile 0.835 officer 0.484 0.6
ng
si
rocket 0.850 aide 0.612
ea
bullet 0.917 chief 0.649
cr
0.4
In
gun 0.940 manager 0.651
0.2
NASSLI 38
ML in NLP June 02
Decision Task
Relative Entropy n Which of v1 and v2 is more likely to take object o
Verb-object pairs from 1988 AP Newswire n Held out: 104 (v2,o) pairs – need to guess from
pc(v2)
n Train: 756,721 verb- Testing: compare all v1 and v2 such that (v2,o) was
avg relative entropy (bits)
6 n
decision error
3 n
test held-out test set
0.8 exceptional whose frequency ratio
2 new 0.6 with o is reversed from
1 n New: held-out data 0.4
their overall frequency
for 1000 nouns not ratio
0 0.2
0 200 400 600 in the training data
0
number of clusters 0 100 200 300 400 500
ML in NLP ML in NLP number of clusters
NASSLI 39