Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

Lecture Slides for

INTRODUCTION
TO
MACHNE
LEARNNG
3RD EDTON
ETHEM ALPAYDIN
The MIT Press, 2014

alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 4:

PARAMETRC METHODS
Parametric Estimation
3

X = { xt }t where xt ~ p (x)
Parametric estimation:
Assume a form for p (x |q ) and estimate q , its sufficient
statistics, using X
e.g., N ( , 2) where q = { , 2}
Maximum Likelihood Estimation
4

Likelihood of q given the sample X


l (|X) = p (X |) = t p (xt|)

Log likelihood
L(|X) = log l (|X) = t log p (xt|)

Maximum likelihood estimator (MLE)


* = argmax L(|X)
Examples: Bernoulli/Multinomial
5

Bernoulli: Two states, failure/success, x in {0,1}


P (x) = pox (1 po ) (1 x)
L (po|X) = log t poxt (1 po ) (1 xt)
MLE: po = t xt / N

Multinomial: K>2 states, xi in {0,1}


P (x1,x2,...,xK) = i pixi
L(p1,p2,...,pK|X) = log t i pixit
MLE: pi = t xit / N
Gaussian (Normal) Distribution

p(x) = N ( , 2)
p x
1 x 2
1 x 2
px
exp-
exp
2
2 2

2 2 2

xt MLE for and 2:


m t
N
x m
t 2

s
2 t
N

6
Bias and Variance
7

Unknown parameter q
Estimator di = d (Xi) on sample Xi

Bias: bq(d) = E [d] q


Variance: E [(dE [d])2]
q

Mean square error:


r (d,q) = E [(dq)2]
= (E [d] q)2 + E [(dE [d])2]
= Bias2 + Variance
Bayes Estimator
8

Treat as a random var with prior p ()


Bayes rule: p (|X) = p(X|) p() / p(X)
Full: p(x|X) = p(x|) p(|X) d
Maximum a Posteriori (MAP):
MAP = argmax p(|X)
Maximum Likelihood (ML): ML = argmax p(X|)
Bayes: Bayes = E[|X] = p(|X) d
Bayes Estimator: Example

xt ~ N (, o2) and ~ N ( , 2)
ML = m
MAP = Bayes =
N / 0 2
1 / 2

E q | X m
N / 0 1/
2 2
N / 0 1/
2 2

9
Parametric Classification
10

gi x px |C i P C i
or
gi x log px |C i log P C i

px |C i
1
exp
x i 2


2 i 2 i
2

1
gi x log 2 log i
x i 2
log P C i
2 2 i 2
Given the sample X {x t ,r t }tN1
1 if x t
C i
x ri
t

0 if x t
C j , j i

ML estimates are
r x r x m r
t t t t 2 t
i i i i
PC i t
m t
si2 t

r r
i t t
N i i
t t

Discriminant 1
gi x log 2 log si
x mi
log C i
P
2

2 2si 2
11
Equal variances

Single boundary at
halfway between means

12
Variances are different

Two boundaries

13
14
Regression
15

r f x
estimator : gx |q
~ N 0, 2
pr | x ~ N gx |q , 2

L q |X log px t , r t
N

t 1

log pr t | x t log px t
N N

t 1 t 1
Regression: From LogL to Error
16

L q | X log
N
1
exp
t

r g x |q
t 2



t 1 2 2 2

r
2

g x t |q
N
1
N log 2 t

2 2
t 1

E q | X r t g x t |q
N
1
2 t 1
Linear Regression
gx t |w1 ,w0 w1 x t w0

t
r t
Nw 0 w1 x t

r x t t
w 0 x w1 x
t

t 2

t t t

N t w0
x t
r t

A t
t x 1 t x

t
w y

x t 2
w r t t

w A 1y
17
Polynomial Regression

gx |w ,,w ,w ,w w x
t
k 2 1 0 k
t k
w x
2
t 2
w1x w0
t

1 x1

x 1 2
x


1 k
r 1

2
D
1 x 2
x 2 2
x
2 k
r r

N
1 x N
x N 2
x
N 2
r

w D D DT rT 1

18
Other Error Measures
19

|q
N
1
Square Error: E q |X
2 t 1
r t
g x t

r
2

g x t |q
N
t

E q |X t 1
Relative Square Error:
r
N 2
t
r
t 1

Absolute Error: E ( |X) = t |rt g(xt| )|


-sensitive Error:
E ( |X) = t 1(|rt g(xt| )|>) (|rt g(xt|)| )
Bias and Variance
20


E r gx 2 | x E r Er | x 2 | x Er | x gx 2
noise squared error


E X E r | x gx | x E r | x E X g x E X gx E X gx
2 2 2

bias variance
Estimating Bias and Variance
21

M samples Xi={xti , rti}, i=1,...,M


are used to fit gi (x), i =1,...,M
1
Bias2 g
N t
g x t
f x
t 2

Varianceg
1
NM t i
t

gi x g x
t 2

1
g x gi x
M t
Bias/Variance Dilemma
22

Example: gi(x)=2 has no variance and high bias


gi(x)= t rti/N has lower bias with variance

As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
Bias/Variance dilemma: (Geman et al., 1992)
f
23
f
bias
gi g

variance
Polynomial Regression
24

Best fit min error


25

Best fit, elbow


Model Selection
26

Cross-validation: Measure generalization accuracy


by testing on data unused during training
Regularization: Penalize complex models
E=error on data + model complexity
Akaikes information criterion (AIC), Bayesian
information criterion (BIC)
Minimum description length (MDL): Kolmogorov
complexity, shortest description of data
Structural risk minimization (SRM)
Bayesian Model Selection
27

Prior on models, p(model)


pdata|model pmodel
pmodel |data
pdata

Regularization, when prior favors simpler models


Bayes, MAP of the posterior, p(model|data)
Average over a number of models with high
posterior (voting, ensembles: Chapter 17)
Regression example
28

Coefficients increase in
magnitude as order
increases:
1: [-0.0769, 0.0016]
2: [0.1682, -0.6657,
0.0080]
3: [0.4238, -2.5778,
3.4675, -0.0002
4: [-0.1093, 1.4356,
-5.5007, 6.0454, -0.0019]

Regularization (L2): E w | X r t g x t | w i wi2


N
1
2 t 1

You might also like