I2ml3e Chap4

Lecture Slides for
INTRODUCTION
TO
MACHNE
LEARNNG
3RD EDTON
ETHEM ALPAYDIN
The MIT Press, 2014
alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 4:
PARAMETRC METHODS
Parametric Estimation
3
X = { xt }t where xt ~ p (x)
Parametric estimation:
Assume a form for p (x |q ) and estimate q , its sufficient
statistics, using X
e.g., N ( , 2) where q = { , 2}
Maximum Likelihood Estimation
4
Likelihood of q given the sample X

l (|X) = p (X |) = t p (xt|)
Log likelihood
L(|X) = log l (|X) = t log p (xt|)
Maximum likelihood estimator (MLE)

* = argmax L(|X)
Examples: Bernoulli/Multinomial
5
Bernoulli: Two states, failure/success, x in {0,1}

P (x) = pox (1 po ) (1 x)
L (po|X) = log t poxt (1 po ) (1 xt)
MLE: po = t xt / N
Multinomial: K>2 states, xi in {0,1}

P (x1,x2,...,xK) = i pixi
L(p1,p2,...,pK|X) = log t i pixit
MLE: pi = t xit / N
Gaussian (Normal) Distribution
p(x) = N ( , 2)
p x
1 x 2
1 x 2
px
exp-
exp
2
2 2

2 2 2

xt MLE for and 2:

m t
N
x m
t 2
s
2 t
N
6
Bias and Variance
7
Unknown parameter q
Estimator di = d (Xi) on sample Xi
Bias: bq(d) = E [d] q

Variance: E [(dE [d])2]
q
Mean square error:

r (d,q) = E [(dq)2]
= (E [d] q)2 + E [(dE [d])2]
= Bias2 + Variance
Bayes Estimator
8
Treat as a random var with prior p ()

Bayes rule: p (|X) = p(X|) p() / p(X)
Full: p(x|X) = p(x|) p(|X) d
Maximum a Posteriori (MAP):
MAP = argmax p(|X)
Maximum Likelihood (ML): ML = argmax p(X|)
Bayes: Bayes = E[|X] = p(|X) d
Bayes Estimator: Example
xt ~ N (, o2) and ~ N ( , 2)
ML = m
MAP = Bayes =
N / 0 2
1 / 2
E q | X m
N / 0 1/
2 2
N / 0 1/
2 2
9
Parametric Classification
10
gi x px |C i P C i
or
gi x log px |C i log P C i
px |C i
1
exp
x i 2

2 i 2 i
2

1
gi x log 2 log i
x i 2
log P C i
2 2 i 2
Given the sample X {x t ,r t }tN1
1 if x t
C i
x ri
t
0 if x t
C j , j i
ML estimates are
r x r x m r
t t t t 2 t
i i i i
PC i t
m t
si2 t
r r
i t t
N i i
t t
Discriminant 1
gi x log 2 log si
x mi
log C i
P
2
2 2si 2
11
Equal variances
Single boundary at
halfway between means
12
Variances are different
Two boundaries
13
14
Regression
15
r f x
estimator : gx |q
~ N 0, 2
pr | x ~ N gx |q , 2
L q |X log px t , r t
N
t 1
log pr t | x t log px t
N N
t 1 t 1
Regression: From LogL to Error
16
L q | X log
N
1
exp
t

r g x |q
t 2

t 1 2 2 2

r
2
g x t |q
N
1
N log 2 t
2 2
t 1
E q | X r t g x t |q
N
1
2 t 1
Linear Regression
gx t |w1 ,w0 w1 x t w0
t
r t
Nw 0 w1 x t
r x t t
w 0 x w1 x
t

t 2
t t t
N t w0
x t
r t

A t
t x 1 t x

t
w y

x t 2
w r t t
w A 1y
17
Polynomial Regression
gx |w ,,w ,w ,w w x
t
k 2 1 0 k
t k
w x
2
t 2
w1x w0
t
1 x1

x 1 2
x

1 k
r 1

2
D
1 x 2
x 2 2
x
2 k
r r

N
1 x N
x N 2
x
N 2
r
w D D DT rT 1
18
Other Error Measures
19
|q
N
1
Square Error: E q |X
2 t 1
r t
g x t
r
2
g x t |q
N
t
E q |X t 1
Relative Square Error:
r
N 2
t
r
t 1
Absolute Error: E ( |X) = t |rt g(xt| )|

-sensitive Error:
E ( |X) = t 1(|rt g(xt| )|>) (|rt g(xt|)| )
Bias and Variance
20

E r gx 2 | x E r Er | x 2 | x Er | x gx 2
noise squared error

E X E r | x gx | x E r | x E X g x E X gx E X gx
2 2 2

bias variance
Estimating Bias and Variance
21
M samples Xi={xti , rti}, i=1,...,M

are used to fit gi (x), i =1,...,M
1
Bias2 g
N t
g x t
f x
t 2
Varianceg
1
NM t i
t

gi x g x
t 2
1
g x gi x
M t
Bias/Variance Dilemma
22
Example: gi(x)=2 has no variance and high bias

gi(x)= t rti/N has lower bias with variance
As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
Bias/Variance dilemma: (Geman et al., 1992)
f
23
f
bias
gi g
variance
Polynomial Regression
24
Best fit min error

25
Best fit, elbow

Model Selection
26
Cross-validation: Measure generalization accuracy

by testing on data unused during training
Regularization: Penalize complex models
E=error on data + model complexity
Akaikes information criterion (AIC), Bayesian
information criterion (BIC)
Minimum description length (MDL): Kolmogorov
complexity, shortest description of data
Structural risk minimization (SRM)
Bayesian Model Selection
27
Prior on models, p(model)

pdata|model pmodel
pmodel |data
pdata
Regularization, when prior favors simpler models

Bayes, MAP of the posterior, p(model|data)
Average over a number of models with high
posterior (voting, ensembles: Chapter 17)
Regression example
28
Coefficients increase in
magnitude as order
increases:
1: [-0.0769, 0.0016]
2: [0.1682, -0.6657,
0.0080]
3: [0.4238, -2.5778,
3.4675, -0.0002
4: [-0.1093, 1.4356,
-5.5007, 6.0454, -0.0019]
Regularization (L2): E w | X r t g x t | w i wi2

N
1
2 t 1

I2ml3e Chap4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

I2ml3e Chap4

Uploaded by

Copyright:

Available Formats

Lecture Slides for

Likelihood of q given the sample X

Maximum likelihood estimator (MLE)

Bernoulli: Two states, failure/success, x in {0,1}

Multinomial: K>2 states, xi in {0,1}

xt MLE for and 2:

Bias: bq(d) = E [d] q

Mean square error:

Treat as a random var with prior p ()

Absolute Error: E ( |X) = t |rt g(xt| )|

M samples Xi={xti , rti}, i=1,...,M

Example: gi(x)=2 has no variance and high bias

Best fit min error

Best fit, elbow

Cross-validation: Measure generalization accuracy

Prior on models, p(model)

Regularization, when prior favors simpler models

Regularization (L2): E w | X r t g x t | w i wi2

You might also like