Lecture 22

2.
160 System Identification, Estimation, and Learning

Lecture Notes No. 22
May 3, 2006
15. Maximum Likelihood

15.1 Principle
Consider an unknown stochastic process
Observed data
Unknown
Stochastic y N = ( y1 , y2 ,", y N )
Process
Assume that each observed datum yi is generated by an assumed stochastic process

having a PDF:
( x − m )2
1 −
f (m, λ ; x ) = e λ
(1)
2πλ
where m is mean, λ is variance, and x is the random variable associated with yi .

We know that mean m and variance λ are determined by
N N
1 1
∑y λ= ∑(y − m)
2
m= i i (2)
N i =1 N i =1
Let us now obtain the same parameter values, m and λ , based on a different
principle: Maximum Likelihood.
Assuming that the N observations y1 , y2 ," , y N are stochastically independent,
consider the joint probability associated with the N observations:
N ( x i − m )2
1 −
f (m, λ ; x1 , x2 ," xN ) = ∏ e λ
(3)
i =1 2πλ
Now, once y1 , y2 ," , y N have been observed (have taken specific values), what
parameter values, m and λ , provide the highest probability in f (m, λ ; x1 , x2 ," x N ) ?
In other words, what values of m and λ are most likely the case? This means that
we maximize the following functional with respect to m and λ :
Max f (m, λ ; y1 , y 2 , " y N ) (4)

m ,λ
1
Note that y1 , y2 ,", y N are known values. Therefore, f (m, λ ; y1 , y2 ," y N ) is a
function of m and λ only. Using our standard notation, this can be rewritten as
θˆ = arg max f (m, λ ; y N ) (5)
θ
where θˆ is estimate of m and λ . Maximizing f (m, λ ; y N ) is equivalent to

maximizing log f ( ),
[
θˆ = arg max log f (m, λ ; y N )
θ
]
⎛ 1 ( y −m )2 ⎞⎟ (6)
= arg max ∑ ⎜⎜ log − i
θ
⎝ 2πλ 2λ ⎟⎠
Taking derivatives and setting them to zero,
∂ log f N
1 1 N
= ∑ ( yi − m ) = 0 (7) m= ∑ yi (9)
∂m i =1 λ N i =1
1
−
d log π d −1 N ( yi − m ) 1 N
( yi − m )2
2
∂ log f ∑
2
=N − λ ∑ = 0 (8) λ= (10)
∂λ dλ dλ 2 N i =1
i =1
1 N 1 N
∑ yi and λ = ∑ ( yi − m ) provide a stochastic
2
The above results m =
N i =1 N i =1
process model that is most likely to generate the observed data y N . And these agree
with (2)
This Maximum Likelihood Estimate (MLE) is formally stated as follows.
Maximum Likelihood Estimate

Consider a joint probability density function with parameter vector θ as a
stochastic model of an unknown process:
f (θ ; x1 , x2 ," xN ) (11)
Given observed data y1 , y2 ,", y N form a deterministic function of θ , called the
likelihood function:
L(θ ) = f (θ ; y1 , y2 ," y N ) (12)
Determine parameter vector θ so that this likelihood function becomes maximum.
θˆ = arg max L(θ ) (13)

θ
This Maximum Likelihood Estimator was found by Ronald A. Fisher in 1912, when
he was an undergraduate junior at Cambridge University. He was 22 years old.
2
Don’t be disappointed. You are still young!
15.2 Likelihood Function for Probabilistic Models of Dynamic Systems
Consider a model set in predictor form

M (θ ) : yˆ (t θ ) = g t , z t −1;θ ( ) (14)
and an associated PDF of prediction error
f ε ( x,t ;θ ) (15)
ε (t;θ ) = y(t ) − yˆ (t θ ) (16)
Assume that the random process ε (t ) is independent in t , and consider the joint
probability density function:
N
f (θ ; x1 , x2 ," xN ) = ∏ fε ( xt , t ;θ ) (17)
t =1
Substituting observed data y1 , y2 ,", y N into (17) yields a likelihood function of

parameter θ ,
N
L(θ ) = ∏ fε ( y (t ) − yˆ (t θ ), t ;θ ) (18)
t =1
Maximizing this function is equivalent to maximizing its logarithmic function,
1 1 N
log L(θ ) = ∑ log fε (ε , t ;θ ) (19)
N N t =1
Replacing log f ε by l (ε ,θ , t ) = − log fε (ε , t ;θ ) ,
N
1
θˆ ML ( Z N ) = arg max ∑ l ( ε ,θ , t ) (20)
θ N t =1
The above minimization is for a deterministic function, and is analogous to the

prediction-error method with the least square criterion.
θˆ LS (z N ) = arg max
N
1 1
θ N
∑ 2 ε (t ,θ )
t =1
2
(21)
1 2
Replacing ε (t ,θ ) = l (ε ,θ , t )
2
θˆ LS (z N ) = arg max
N
1
θ N
∑ l (ε ,θ , t )
t =1
3
Therefore, the Maximum Likelihood Estimate is a special case of the prediction-error
method. Function l (ε ,θ , t ) may take a different form depending on the assumed
PDF of the process under consideration.
15.3 The Cramer-Rao Lower Bound
Consider the mean square error matrix of an estimate θˆN compared to its true
value θ0 ,
⎢⎣
( )(
P = E ⎡ θˆ( y N ) − θ 0 θˆ( y N ) − θ 0 ⎤
T
⎥⎦
) (22)
This mean square error matrix provides the overall quality of an estimate. Our goal
is to minimizing a lower limit to this matrix. The limit, called the Cramer-Rao
Lower Bound, gives the best performance that we can expect among many unbiased
estimators.
Theorem (The Cramer-Rao Lower Bound)

Let θˆ( y N ) be an unbiased estimate of θ ;
E[θˆ( y N )] = θ0 (23)
( )
where the PDF of y N is f y θ 0 ; y N . Then
(
⎢⎣
)( T
⎥⎦
)
E ⎡ θˆ( y N ) − θ 0 θˆ( y N ) − θ 0 ⎤ ≥ M −1 (24)
where M is the Fisher Information Matrix given by

T
⎡d ⎤⎡ d ⎤
M = E ⎢ log f y (θ ; x N ) ⎥ ⎢ log f y (θ ; x N ) ⎥ θ0 (25 − a )
⎣ dθ ⎦ ⎣ dθ ⎦
⎡ d2 ⎤
= − E ⎢ 2 log f y (θ ; x N ) ⎥ θ0 (25 − b )
⎣ dθ ⎦
Proof
From (23)
θ 0 = ∫ θˆ(x N ) f y (θ 0 ; x N )dx N (26)
RN
where dummy vector x N is a N × 1 vector in space R N .

Differentiating (26), which is a d × 1 vector equation, with respect to θ 0 ∈ R d ×1
yields
4
T
⎡ d ⎤
I = ∫ θˆ x ⎢ ( )N
f y (θ 0 ; x N )⎥ dx N (27)
RN ⎣ dθ 0 ⎦
where I is the d × d identity matrix,
⎡1 0⎤
⎢ 1 ⎥
I= ⎢ ⎥ ∈ R d ×d
⎢ % ⎥
⎢ ⎥
⎣0 1⎦
Using the following relationship
d 1 d
log f (θ ) = f (θ ) (28)
dθ f (θ ) dθ
(27) can be written as

T
⎡ d ⎤
I = ∫ θˆ(x N )⎢ log f y (θ 0 ; x N )⎥ f y (θ 0 ; x N )dx N
RN ⎣ dθ 0 ⎦
(29)
⎡ T
⎤
N ⎡ d N ⎤
= E ⎢θ (x )⎢
ˆ log f y (θ 0 ; x )⎥ ⎥
⎢⎣ ⎣ dθ 0 ⎦ ⎥⎦
Similarly, differentiating the identity: 1 = ∫f y (θ 0 ; x N )dx N yields

RN
T T
⎡ d ⎤ ⎡ d ⎤
0= ∫ ⎢ f y (θ 0 ; x N )⎥ dx N = ∫ ⎢ log f y (θ 0 ; x N )⎥ f y (θ 0 ; x N )dx N
RN ⎣
dθ 0 ⎦ RN ⎣
dθ 0 ⎦
T
⎡ d ⎤
∴ 0= E⎢ log f y (θ0 ; x N ) ⎥ (30)
⎣ dθ 0 ⎦
Multiplying (30) by θ 0 and subtracting it from (29),
⎧⎪ ⎡ d
T
⎫
ˆ N ⎤ ⎪
E ⎨ ⎣⎡θ ( y ) − θ0 ⎦⎤ ⎢
N
log f y (θ0 ; x ) ⎥ ⎬ = I (31)
⎪⎩ ⎣ dθ 0 ⎦ ⎪⎭
α ∈ R d ×1 β T ∈ R1× d
E (αβ T ) = I
⎛α ⎞
Consider 2d × 1 vector ⎜⎜ ⎟⎟ ,
⎝β ⎠
5
⎡⎛ α ⎞ ⎤ ⎡ Eαα T Eαβ T ⎤
E ⎢⎜⎜ ⎟⎟ (α T
β T )⎥ = ⎢ ⎥≥0 Positive Semi-definite (32)
⎣⎝ β ⎠ ⎣ Eβα Eββ T ⎦
T
⎦
Note Eαβ T = I , and Eβα T = E αβ T ( )

T
=I
∴ E (αα T ) ⋅ E (ββ T ) − I 2 ≥ 0 (33)
[
E (αα T ) ≥ E (ββ T ) ]−1
(34)
This gives the first equation, (25)-(a). Differentiating the transpose of (30) yields
⎡ d2 ⎤
0= ∫ ⎢ log f y (θ 0 ; x N )⎥ f y (θ 0 ; x N )dx N
dθ 0
2
RN ⎣ ⎦
T
(35)
⎡ d ⎤ ⎡ d ⎤
+∫⎢ log f y (θ 0 ; x N )⎥ ⎢ f y (θ 0 ; x N )⎥ dx N
RN ⎣
dθ 0 ⎦ ⎣ dθ 0 ⎦
Rewriting this in Expectation form yields (25)-(b). Q.E.D.
15.4 Best Unbiased Estimators for Dynamical Systems.
The Cramer-Rao Lower Bound, given by the Fisher Information Matrix M , is

the best performance we can expect among all unbiased estimates with regard to error
covariance. Interestingly, Maximum Likelihood Estimate provides this best error
covariance performance for large sample data. The following is to formulate the
problem for proving this. The complete proof, however, is left for your Extra-Credit
Problem.
Based on the logarithmic likelihood function, we can compute the Fisher
Information Matrix, the inverse of which provides the lower bound of the error
covariance. To simplify the problem, we assume that the PDF of prediction error ε
is a function of ε only. Namely,
∂fε ∂fε
=0 =0 (36)
∂θ ∂t
fε (ε , t ,θ ) ⎯Re
⎯duces
⎯⎯ ⎯to
→ fε (ε )
Note, however, that the total derivative of f ε w.r.t. θ is not zero, since ε depends on θ ;
ε (t;θ ) = y(t ) − yˆ (t θ ) . Denoting l0 (ε ) ≡ − log fε (ε ) , we can re-write (19), that is the
log likelihood function of a dynamical system is given by
N
log L(θ ) = ∑ log fε (ε ) = −∑ l0 (ε (t ,0) ) (37)
t =1
Differentiating this w.r.t. θ
6
d N
dl dε N
log L(θ ) = −∑ 0 = ∑ l0′ [ε (t ,θ )]ψ (t ,θ ) (38)
dθ t =1 dε dθ t =1
dε dyˆ d
Note that =− = −ψ . Replacing log f y in (25) by the above equation
dθ dθ dθ
yields the Fisher Information Matrix M , evaluated at the true parameter value θ0 .
With this θ0 , we can assume that the prediction error sequence ε (t ,θ 0 ) are generated
as a sequence of independent random variables ε (t ,θ 0 ) = e0 (t ) whose PDF is fε (x) .
This leads to the Fisher Information Matrix given by
1 N
[ ]
E ψ (t ,θ 0 )ψ T (t ,θ 0 )
1
∞
[ f ′( x)] dx
M= ∑
k0 t =1
≡ ∫ e
k0 − ∞ [ f e ( x)]
on the other hand, as N tends to infinity, the prediction-error method provides the
asymptotic variance Pθ , as discussed last time. From (20) we can treat Maximum
Likelihood Estimate as a special case of the prediction-error method. Therefore, the
previous result on asymptotic variance applies to MLE.
Extra Credit Problem

Obtain the asymptotic variance of MLE for dynamical systems, and show that it
agrees with the Fisher Information Matrix. Namely, Maximum Likelihood Estimate
provides the best unbiased estimate for large N.

Lecture 22

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 22

Uploaded by

Copyright:

Available Formats

2.

160 System Identification, Estimation, and Learning

15. Maximum Likelihood

Consider an unknown stochastic process

Assume that each observed datum yi is generated by an assumed stochastic process

where m is mean, λ is variance, and x is the random variable associated with yi .

Max f (m, λ ; y1 , y 2 , " y N ) (4)

where θˆ is estimate of m and λ . Maximizing f (m, λ ; y N ) is equivalent to

Taking derivatives and setting them to zero,

Maximum Likelihood Estimate

Determine parameter vector θ so that this likelihood function becomes maximum.

θˆ = arg max L(θ ) (13)

15.2 Likelihood Function for Probabilistic Models of Dynamic Systems

Consider a model set in predictor form

Substituting observed data y1 , y2 ,", y N into (17) yields a likelihood function of

Maximizing this function is equivalent to maximizing its logarithmic function,

Replacing log f ε by l (ε ,θ , t ) = − log fε (ε , t ;θ ) ,

The above minimization is for a deterministic function, and is analogous to the

15.3 The Cramer-Rao Lower Bound

Theorem (The Cramer-Rao Lower Bound)

where M is the Fisher Information Matrix given by

where dummy vector x N is a N × 1 vector in space R N .

Using the following relationship

(27) can be written as

Similarly, differentiating the identity: 1 = ∫f y (θ 0 ; x N )dx N yields

Note Eαβ T = I , and Eβα T = E αβ T ( )

∴ E (αα T ) ⋅ E (ββ T ) − I 2 ≥ 0 (33)

Rewriting this in Expectation form yields (25)-(b). Q.E.D.

15.4 Best Unbiased Estimators for Dynamical Systems.

The Cramer-Rao Lower Bound, given by the Fisher Information Matrix M , is

Differentiating this w.r.t. θ

Extra Credit Problem

You might also like