SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)

Maximum Likelihood Estimation
B li Ch Berlin Chen
Department of Computer Science & Information Engineering
National Taiwan Normal University
References:
1. Ethem Alpaydin, Introduction to Machine Learning , Chapter 4, MIT Press, 2004
Sample Statistics and Population Parameters
A Schematic Depiction
Population (sample space) Sample
Inference
Statistics
Parameters
SP - Berlin Chen 2
Introduction
Statistic
Any value (or function) that is calculated from a given sample y ( ) g p
Statistical inference: make a decision using the information
provided by a sample (or a set of examples/instances)
Parametric methods
Assume that examples are drawn from some distribution that
obeys a known model ( ) x p obeys a known model ( ) x p
Advantage: the model is well defined up to a small number of
parameters
E g mean and variance are sufficient statistics for the E.g., mean and variance are sufficient statistics for the
Gaussian distribution
Model parameters are typically estimated by either maximum
SP - Berlin Chen 3
likelihood estimation or Bayesian (MAP) estimation
Maximum Likelihood Estimation (MLE) (1/2)
Assume the instances are independent
and identically distributed (iid) and drawn from some
{ }
N t
x x x x , , , , ,
2 1
K K = x
and identically distributed (iid), and drawn from some
known probability distribution
( )
t t
x p X
X
: model parameters (assumed to be fixed but unknown here)

( ) x p X ~
MLE attempts to find that make the most likely to
be drawn
x
Namely, maximize the likelihood of the instances
iid x x
N
are , ,
1
K
( ) ( ) ( ) ( )
=
= = =
N
t
t N
x p x x p p l
1
1
, , L x x
SP - Berlin Chen 4
MLE (2/2)
Because logarithm will not change the value of when
it take its maximum (monotonically increasing/decreasing)
it take its maximum (monotonically increasing/decreasing)

Finding that maximizes the likelihood of the instances is
equivalent to finding that maximizes the log likelihood of the
samples
( ) ( ) ( )
= =
N
t
x p l L log log x x
b a
b a
log log

As we shall see, logarithmic operation can further simplify the
( ) ( ) ( )
= t
p
1
g g
g p p y
computation when estimating the parameters of those
distributions that have exponents
SP - Berlin Chen 5
MLE: Bernoulli Distribution (1/3)
Bernoulli Distribution
A random variable takes either the value (with X 1 = x
A random variable takes either the value (with
probability ) or the value (with probability )
Can be thought of as is generated form two distinct states
r
X
X
1 = x
1 = x
r 1
The associated probability distribution
( ) ( ) { } 1 , 0 , 1
1
=

x r r x P
x
x
The log likelihood for a set of iid instances drawn from
Bernoulli distribution
( ) ( ) { } 1 , 0 , 1 x r r x P
x
Bernoulli distribution
( )
( )
( )
( )
1 log
1
1
r r X r L
N
t
x
x
t
t
=
=

{ }
N t
x x x x , , , , ,
2 1
K K = x
( ) 1 log log
1 1
1
r x N r x
N
t
t
N
t
t
t
=
= =
SP - Berlin Chen 6
MLE of the distribution parameter r
N
N
x
r
N
t
t
=
=1
The estimate for is the ratio of the number of occurrences of

the event ( ) to the number of experiments
N
1 =
t
x
r
( ) p
The expected value for X
( )
{ }
( ) r r r x P x X E
x
= + = =

1 1 0 ] [
1 , 0
The variance value for
X
( ) [ ] [ ] ( ) ( ) r r r r X E X E X = = = 1 var
2 2 2
SP - Berlin Chen 7
( ) [ ] [ ] ( ) ( ) r r r r X E X E X = = = 1 var
Appendix A
( )
( ) r x N r x
X r dL
N
t
t
N
t
t
= = 1 1
0
1 log log
dr dr
N N

=

= 0
r
x N
r
x
N
t
t
N
t
t
=
= = 1 1
0
1

1 log
y dy
y d
=
N
x
r
N
t
t
=
=1
y y
N
Themaximumlikelihoodestimateofthemeanisthesampleaverage
SP - Berlin Chen 8
MLE: Multinomial Distribution (1/4)
Multinomial Distribution
A generalization of Bernoulli distribution A generalization of Bernoulli distribution
The value of a random variable can be one of K mutually
exclusive and exhaustive states with
b biliti ti l
X
{ }
K
s s s x , , ,
2 1
L
probabilities , respectively
The associated probability distribution
( )

K K
K
r r r , , ,
2 1
L
state choose if 1
i
s X
( )

= =
= =
i
i
i
s
i
r r x p
i
1 1
1 ,
The log likelihood for a set of iid instances drawn from a
=
otherwise 0
state choose if 1
i
i
s X
s
x
The log likelihood for a set of iid instances drawn from a
multinomial distribution
( ) log

=
N K
s
t
i
r L x r { }
N t
x x x x
2 1
= x
x
X
SP - Berlin Chen 9
( )

log
1 1

= =
=
t i
i
r L x r { } x x x x , , , , , K K = x
MLE of the distribution parameter
i
r
N
s
r
N
t
t
i
i
=
=1
The estimate for is the ratio of the number of experiments

N
i
r
The estimate for is the ratio of the number of experiments
with outcome of state ( ) to the number of experiments
i
r
i
1 =
t
i
s
SP - Berlin Chen 10
Appendix B
( )
K
s
N K N K
s
t t

( )
( )
r r s
r r r L
K
i i
N K
t
i
i
i
s
i
t i t i
s
i
i i

= = = = =
= = =
1 1 1 1 1
1 log
1 : constraint with , log log
x r
( )
r r
L
i
i
i i
t i
i
i

= = =
=
1 1 1
0
g
x r
Lagrange Multiplier
r
s
N
N
t
i
t
i
=
= +
1
1
0
1
s r
s r
N K
t
K
N
t
t
i i

=
= =
=
1
1
1
1
N
s r
N
t
t i
i
i
i

= = =
=
= =
1 1 1

1

=1
SP - Berlin Chen 11
N
s
r
t
t
i
i
=
=
1
LagrangeMultiplier:http://www.slimy.com/~steuard/teaching/tutorials/Lagrange.html
P(B)=3/10
P(W)=4/10
P(B)=3/10
P(R)=3/10
SP - Berlin Chen 12
MLE: Gaussian Distribution (1/3)
Also called Normal Distribution
Characterized with mean and variance
Characterized with mean and variance

( )
( )
< <

= x
x
x p - ,
2
exp
2
1
2
2

Recall that mean and variance are sufficient statistics for
G i
2 2
Gaussian
The log likelihood for a set of iid instances drawn from
Gaussian distribution X Gaussian distribution
( )
( )
2
1
log ,
2
2
2

=
x
N
e L
t
x
{ }
N t
x x x x , , , , ,
2 1
K K = x
X
( )
( )
log 2 log
2
1
2
1
=
=
=
N
t
t
t
x
N
N
SP - Berlin Chen 13
( )
2
log 2 log
2

2
= N
MLE of the distribution parameters and
N
x
m
N
t
t
= =
=1
sample average
N
( ) m x
N
t
t

=1
2
2 2
l i
R i d th t d till fi d b t k
N
s
t
= =
=1
2 2
2
sample variance
Remind that and are still fixed but unknown
SP - Berlin Chen 14
Appendix C
( ) ( )
( )
2
log
2
2 log
2
,
2
1
2
2

=

=
N
t
t
x
N N
L x
( )
( ) 0
1
0
,
1
2
x
x
L
N
t
t
N
t

=
= = =

x
2 2 2
( )
( )
0 0
2
1
2
N
x
N
t
t
=
= = =

( )
( )
( )
0
1
0
,
1
2
2
1
2
2 2
N
x
x N
L
t
t
N
t
t

=
=
= = + =

x
SP - Berlin Chen 15
Evaluating an Estimator : Bias and Variance (1/6)
The mean square error of the estimator can be further
decomposed into two parts respectively composed of
d
decomposed into two parts respectively composed of
bias and variance
[ ] ( ) ( ) [ ]
[ ] [ ] ( ) [ ]
2
2

,
d E d E d E
d E d r
+ =
=
[ ] [ ] ( ) [ ]
[ ] ( ) [ ] ( ) [ ] ( ) [ ] ( ) [ ]
[ ] ( ) [ ] [ ] ( ) [ ] [ ] ( ) [ ] ( ) [ ]
2 2
2 2
2
2
d E d E d E d E E d E d E
d E d E d d E d E d E
+ + =
+ + =
[ ] ( ) [ ] [ ] ( ) [ ] [ ] ( ) [ ] ( ) [ ]
[ ] ( ) [ ] [ ] ( ) [ ] ( ) [ ] [ ] ( )
2 2
2
2
d E d E d E d E d E d E
d E d E d E d E E d E d E
+ + =
+ +
constant
constant
[ ] ( ) [ ] [ ] ( ) [ ] ( ) [ ] [ ] ( )
[ ] ( ) [ ] [ ] ( )
2 2
2
d E d E d E
d E d E d E d E d E d E
+ =
+ + =
0
[ ] ( ) [ ] [ ] ( ) d E d E d E + =
variance bias
2
SP - Berlin Chen 16
SP - Berlin Chen 17
Example 1: sample average and sample variance
Assume samples are independent and { }
N t
x x x x
2 1
x Assume samples are independent and
identically distributed (iid), and drawn from some known
probability distribution with mean and variance
{ } x x x x , , , , , K K = x
2
X
Mean
[ ] ( )
= =
x
x p x X E
Variance
( ) [ ] [ ] [ ] ( )
2
2
2
2
X E X E X E = =
N
1
Sample average (mean) for the observed samples
( )
N
t
2
2
1
=
=
N
t
t
x
N
m
1
1
Sample variance for the observed samples ( )
=
= t
t
m x
N
s
1
2
2
1
( )
1
1
or
1
2
2
=
N
t
m x
N
s ?
1 1 = t N
SP - Berlin Chen 18
Example 1 (count.)
S l i bi d ti t f th Sample average is an unbiased estimator of the mean
[ ] [ ]
= =
=

N
N
X E
N
X
N
E m E
N N
t
1 1
m

= =
N N N
t t 1 1
( )
[ ] 0 = m E
is also a consistent estimator: m
( ) N m as 0 Var
1 1
2 2

N N
N
( ) ( ) 0 Var
1 1
Var Var
2
2
2
1
2
1
=
= =
=
=
= =

N
N
t
N
t
t
N N
N
X
N
X
N
m

( ) ( )
( ) ( ) ( ) Y X Y X
X a b aX
Var Var Var
Var Var
2
+ = +
= +
SP - Berlin Chen 19
Example 1 (count.)
Sample variance is an asymptotically unbiased estimator of
2
s Sample variance is an asymptotically unbiased estimator of
the variance
2
s
( )
=
N
t
t
m x
N
s
1
2
2
1
[ ] ( )
( )
m X
N
E s E
N
N
t
t
1
2
2
1

1

=

=
=
=
N
t
t
m N x
1
= t
N
1
( ) ( )
( ) m m X X E
X m X
N
E
N
t
t
2 2
1
2
2
1
i.i.d. are s '
1

+ =
=
t 1
( )
N
Nm m N X N
E
m m X X
N
E
t
2 2 2
1
2

2
+
=
=
[ ] [ ]
N
m E N X E N
N
m N X N
E
N
2 2 2 2

=

=

SP - Berlin Chen 20
Example 1 (count.)
2
the variance
2
2
s
( ) [ ] [ ] ( )
[ ] [ ] ( )
2
2
2
2
2
2
2
2
Var

= = m E m E
N
m
[ ]
[ ] [ ]
2 2
2

=
m E N X E N
s E
[ ] [ ] ( )
2
2
2

+ = + =
N
m E
N
m E
[ ]
( )
2
2
2 2
+ +
=
N N
N
s E
( )
( ) 1

+ +
=
N
N
N
N N
( ) [ ] [ ] ( )
2
2 2
Var = = X E X E X
( )
2 2

1

=
= N
N
N
[ ] [ ] ( )
2 2
2
2 2
+ = + = X E X E
The size of the observed sample set
SP - Berlin Chen 21
Bias and Variance: Example 2
1
x
different
samples
foranunknown
population
2
x
population
x
( ) y x X ,
( ) x F y =
3
x
SP - Berlin Chen 22
( ) + =
x F y
errorofmeasurement
Simple is Elegant ?
SP - Berlin Chen 23

SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)

Uploaded by

Copyright:

Available Formats

Maximum Likelihood Estimation

: model parameters (assumed to be fixed but unknown here)

it take its maximum (monotonically increasing/decreasing)

The estimate for is the ratio of the number of occurrences of

The estimate for is the ratio of the number of experiments

Characterized with mean and variance

You might also like