Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Maximum Likelihood Estimation

B li Ch Berlin Chen
Department of Computer Science & Information Engineering
National Taiwan Normal University
References:
1. Ethem Alpaydin, Introduction to Machine Learning , Chapter 4, MIT Press, 2004
Sample Statistics and Population Parameters
A Schematic Depiction
Population (sample space) Sample
Inference
Statistics
Parameters
SP - Berlin Chen 2
Introduction
Statistic
Any value (or function) that is calculated from a given sample y ( ) g p
Statistical inference: make a decision using the information
provided by a sample (or a set of examples/instances)
Parametric methods
Assume that examples are drawn from some distribution that
obeys a known model ( ) x p obeys a known model ( ) x p
Advantage: the model is well defined up to a small number of
parameters
E g mean and variance are sufficient statistics for the E.g., mean and variance are sufficient statistics for the
Gaussian distribution
Model parameters are typically estimated by either maximum
SP - Berlin Chen 3
likelihood estimation or Bayesian (MAP) estimation
Maximum Likelihood Estimation (MLE) (1/2)
Assume the instances are independent
and identically distributed (iid) and drawn from some
{ }
N t
x x x x , , , , ,
2 1
K K = x
and identically distributed (iid), and drawn from some
known probability distribution
( )
t t
x p X
X

: model parameters (assumed to be fixed but unknown here)


( ) x p X ~
MLE attempts to find that make the most likely to
be drawn
x
Namely, maximize the likelihood of the instances
iid x x
N
are , ,
1
K
( ) ( ) ( ) ( )

=
= = =
N
t
t N
x p x x p p l
1
1
, , L x x
SP - Berlin Chen 4
MLE (2/2)
Because logarithm will not change the value of when
it take its maximum (monotonically increasing/decreasing)

it take its maximum (monotonically increasing/decreasing)


Finding that maximizes the likelihood of the instances is
equivalent to finding that maximizes the log likelihood of the

samples
( ) ( ) ( )

= =
N
t
x p l L log log x x
b a
b a
log log



As we shall see, logarithmic operation can further simplify the
( ) ( ) ( )

= t
p
1
g g
g p p y
computation when estimating the parameters of those
distributions that have exponents
SP - Berlin Chen 5
MLE: Bernoulli Distribution (1/3)
Bernoulli Distribution
A random variable takes either the value (with X 1 = x
A random variable takes either the value (with
probability ) or the value (with probability )
Can be thought of as is generated form two distinct states
r
X
X
1 = x
1 = x
r 1
The associated probability distribution
( ) ( ) { } 1 , 0 , 1
1
=

x r r x P
x
x
The log likelihood for a set of iid instances drawn from
Bernoulli distribution
( ) ( ) { } 1 , 0 , 1 x r r x P
x
Bernoulli distribution
( )
( )
( )
( )
1 log
1
1
r r X r L
N
t
x
x
t
t
=
=

{ }
N t
x x x x , , , , ,
2 1
K K = x
( ) 1 log log
1 1
1
r x N r x
N
t
t
N
t
t
t

=
= =

SP - Berlin Chen 6
MLE: Bernoulli Distribution (2/3)
MLE of the distribution parameter r
N
N
x
r
N
t
t

=
=1

The estimate for is the ratio of the number of occurrences of


the event ( ) to the number of experiments
N
1 =
t
x
r
( ) p
The expected value for X
( )
{ }
( ) r r r x P x X E
x
= + = =

1 1 0 ] [
1 , 0
The variance value for
X
( ) [ ] [ ] ( ) ( ) r r r r X E X E X = = = 1 var
2 2 2
SP - Berlin Chen 7
( ) [ ] [ ] ( ) ( ) r r r r X E X E X = = = 1 var
MLE: Bernoulli Distribution (3/3)
Appendix A
( )
( ) r x N r x
X r dL
N
t
t
N
t
t

= = 1 1
0
1 log log
dr dr
N N

=

= 0
r
x N
r
x
N
t
t
N
t
t
=

= = 1 1
0
1

1 log
y dy
y d
=
N
x
r
N
t
t

=
=1

y y
N
Themaximumlikelihoodestimateofthemeanisthesampleaverage
SP - Berlin Chen 8
MLE: Multinomial Distribution (1/4)
Multinomial Distribution
A generalization of Bernoulli distribution A generalization of Bernoulli distribution
The value of a random variable can be one of K mutually
exclusive and exhaustive states with
b biliti ti l
X
{ }
K
s s s x , , ,
2 1
L
probabilities , respectively
The associated probability distribution
( )

K K
K
r r r , , ,
2 1
L

state choose if 1
i
s X
( )

= =
= =
i
i
i
s
i
r r x p
i
1 1
1 ,
The log likelihood for a set of iid instances drawn from a

=
otherwise 0
state choose if 1
i
i
s X
s
x
The log likelihood for a set of iid instances drawn from a
multinomial distribution
( ) log

=
N K
s
t
i
r L x r { }
N t
x x x x
2 1
= x
x
X
SP - Berlin Chen 9
( )

log
1 1

= =
=
t i
i
r L x r { } x x x x , , , , , K K = x
MLE: Multinomial Distribution (2/4)
MLE of the distribution parameter
i
r
N
s
r
N
t
t
i
i

=
=1

The estimate for is the ratio of the number of experiments


N
i
r
The estimate for is the ratio of the number of experiments
with outcome of state ( ) to the number of experiments
i
r
i
1 =
t
i
s
SP - Berlin Chen 10
MLE: Multinomial Distribution (3/4)
Appendix B
( )
K
s
N K N K
s
t t

( )
( )
r r s
r r r L
K
i i
N K
t
i
i
i
s
i
t i t i
s
i
i i


= = = = =

= = =
1 1 1 1 1
1 log
1 : constraint with , log log

x r
( )
r r
L
i
i
i i
t i
i
i

= = =
=

1 1 1
0
g
x r
Lagrange Multiplier
r
s
N
N
t
i
t
i
=
= +
1
1
0
1

s r
s r
N K
t
K
N
t
t
i i


=

= =
=
1
1
1
1

N
s r
N
t
t i
i
i
i


= = =
=

= =
1 1 1

1


=1
SP - Berlin Chen 11
N
s
r
t
t
i
i

=
=
1

LagrangeMultiplier:http://www.slimy.com/~steuard/teaching/tutorials/Lagrange.html
MLE: Multinomial Distribution (4/4)
P(B)=3/10
P(W)=4/10
P(B)=3/10
P(R)=3/10
SP - Berlin Chen 12
MLE: Gaussian Distribution (1/3)
Also called Normal Distribution
Characterized with mean and variance

Characterized with mean and variance


( )
( )
< <


= x
x
x p - ,
2
exp
2
1
2
2


Recall that mean and variance are sufficient statistics for
G i

2 2
Gaussian
The log likelihood for a set of iid instances drawn from
Gaussian distribution X Gaussian distribution
( )
( )
2
1
log ,
2
2
2


=
x
N
e L
t
x
{ }
N t
x x x x , , , , ,
2 1
K K = x
X
( )
( )
log 2 log
2
1
2
1

=
=

=
N
t
t
t
x
N
N
SP - Berlin Chen 13
( )
2
log 2 log
2

2

= N
MLE: Gaussian Distribution (2/3)
MLE of the distribution parameters and

N
x
m
N
t
t

= =
=1

sample average
N
( ) m x
N
t
t


=1
2
2 2

l i
R i d th t d till fi d b t k
N
s
t
= =
=1
2 2

2
sample variance
Remind that and are still fixed but unknown

SP - Berlin Chen 14
MLE: Gaussian Distribution (3/3)
Appendix C
( ) ( )
( )
2
log
2
2 log
2
,
2
1
2
2



=

=
N
t
t
x
N N
L x
( )
( ) 0
1
0
,
1
2
x
x
L
N
t
t
N
t

=
= = =


x
2 2 2
( )
( )
0 0
2
1
2
N
x
N
t
t

=
= = =



( )
( )
( )
0
1
0
,
1
2
2
1
2
2 2
N
x
x N
L
t
t
N
t
t

=
=

= = + =




x
SP - Berlin Chen 15
Evaluating an Estimator : Bias and Variance (1/6)
The mean square error of the estimator can be further
decomposed into two parts respectively composed of
d
decomposed into two parts respectively composed of
bias and variance
[ ] ( ) ( ) [ ]
[ ] [ ] ( ) [ ]
2
2

,
d E d E d E
d E d r
+ =
=
[ ] [ ] ( ) [ ]
[ ] ( ) [ ] ( ) [ ] ( ) [ ] ( ) [ ]
[ ] ( ) [ ] [ ] ( ) [ ] [ ] ( ) [ ] ( ) [ ]
2 2
2 2
2
2
d E d E d E d E E d E d E
d E d E d d E d E d E
+ + =
+ + =
[ ] ( ) [ ] [ ] ( ) [ ] [ ] ( ) [ ] ( ) [ ]
[ ] ( ) [ ] [ ] ( ) [ ] ( ) [ ] [ ] ( )
2 2
2
2
d E d E d E d E d E d E
d E d E d E d E E d E d E
+ + =
+ +
constant
constant
[ ] ( ) [ ] [ ] ( ) [ ] ( ) [ ] [ ] ( )
[ ] ( ) [ ] [ ] ( )
2 2
2
d E d E d E
d E d E d E d E d E d E
+ =
+ + =
0
[ ] ( ) [ ] [ ] ( ) d E d E d E + =
variance bias
2
SP - Berlin Chen 16
Evaluating an Estimator : Bias and Variance (2/6)
SP - Berlin Chen 17
Evaluating an Estimator : Bias and Variance (3/6)
Example 1: sample average and sample variance
Assume samples are independent and { }
N t
x x x x
2 1
x Assume samples are independent and
identically distributed (iid), and drawn from some known
probability distribution with mean and variance
{ } x x x x , , , , , K K = x

2
X
Mean
[ ] ( )

= =
x
x p x X E
Variance
( ) [ ] [ ] [ ] ( )
2
2
2
2
X E X E X E = =
N
1
Sample average (mean) for the observed samples
( )
N
t
2
2
1

=
=
N
t
t
x
N
m
1
1
Sample variance for the observed samples ( )

=
= t
t
m x
N
s
1
2
2
1
( )
1
1
or
1
2
2

=
N
t
m x
N
s ?
1 1 = t N
SP - Berlin Chen 18
Evaluating an Estimator : Bias and Variance (4/6)
Example 1 (count.)
S l i bi d ti t f th Sample average is an unbiased estimator of the mean
[ ] [ ]

= =

=

N
N
X E
N
X
N
E m E
N N
t
1 1
m


= =
N N N
t t 1 1
( )
[ ] 0 = m E
is also a consistent estimator: m
( ) N m as 0 Var
1 1
2 2

N N
N
( ) ( ) 0 Var
1 1
Var Var
2
2
2
1
2
1
=

= =

=
=
= =

N
N
t
N
t
t
N N
N
X
N
X
N
m

( ) ( )
( ) ( ) ( ) Y X Y X
X a b aX
Var Var Var
Var Var
2
+ = +
= +
SP - Berlin Chen 19
Evaluating an Estimator : Bias and Variance (5/6)
Example 1 (count.)
Sample variance is an asymptotically unbiased estimator of
2
s Sample variance is an asymptotically unbiased estimator of
the variance
2

s
( )

=
N
t
t
m x
N
s
1
2
2
1
[ ] ( )
( )
m X
N
E s E
N
N
t
t
1
2
2
1

1

=

=

=
=
N
t
t
m N x
1
= t
N
1
( ) ( )
( ) m m X X E
X m X
N
E
N
t
t
2 2
1
2
2
1
i.i.d. are s '
1

+ =

=
t 1
( )
N
Nm m N X N
E
m m X X
N
E
t
2 2 2
1
2

2

+
=

=
[ ] [ ]
N
m E N X E N
N
m N X N
E
N
2 2 2 2


=


=

SP - Berlin Chen 20
Evaluating an Estimator : Bias and Variance (6/6)
Example 1 (count.)
Sample variance is an asymptotically unbiased estimator of
2
Sample variance is an asymptotically unbiased estimator of
the variance
2

2
s
( ) [ ] [ ] ( )
[ ] [ ] ( )
2
2
2
2
2
2
2
2
Var

= = m E m E
N
m
[ ]
[ ] [ ]
2 2
2

=
m E N X E N
s E
[ ] [ ] ( )
2
2
2


+ = + =
N
m E
N
m E
[ ]
( )
2
2
2 2

+ +
=
N N
N
s E
( )
( ) 1


+ +
=
N
N
N
N N
( ) [ ] [ ] ( )
2
2 2
Var = = X E X E X
( )
2 2

1

=
= N
N
N
[ ] [ ] ( )
2 2
2
2 2
+ = + = X E X E
The size of the observed sample set
SP - Berlin Chen 21
Bias and Variance: Example 2
1
x
different
samples
foranunknown
population
2
x
population
x
( ) y x X ,
( ) x F y =
3
x
SP - Berlin Chen 22
( ) + =

x F y
errorofmeasurement
Simple is Elegant ?
SP - Berlin Chen 23

You might also like