Professional Documents
Culture Documents
Lecture 22
Lecture 22
Observed data
Unknown
Stochastic y N = ( y1 , y2 ,", y N )
Process
N N
1 1
∑y λ= ∑(y − m)
2
m= i i (2)
N i =1 N i =1
Let us now obtain the same parameter values, m and λ , based on a different
principle: Maximum Likelihood.
Assuming that the N observations y1 , y2 ," , y N are stochastically independent,
consider the joint probability associated with the N observations:
N ( x i − m )2
1 −
f (m, λ ; x1 , x2 ," xN ) = ∏ e λ
(3)
i =1 2πλ
Now, once y1 , y2 ," , y N have been observed (have taken specific values), what
parameter values, m and λ , provide the highest probability in f (m, λ ; x1 , x2 ," x N ) ?
In other words, what values of m and λ are most likely the case? This means that
we maximize the following functional with respect to m and λ :
1
Note that y1 , y2 ,", y N are known values. Therefore, f (m, λ ; y1 , y2 ," y N ) is a
function of m and λ only. Using our standard notation, this can be rewritten as
θˆ = arg max f (m, λ ; y N ) (5)
θ
∂ log f N
1 1 N
= ∑ ( yi − m ) = 0 (7) m= ∑ yi (9)
∂m i =1 λ N i =1
1
−
d log π d −1 N ( yi − m ) 1 N
( yi − m )2
2
∂ log f ∑
2
=N − λ ∑ = 0 (8) λ= (10)
∂λ dλ dλ 2 N i =1
i =1
1 N 1 N
∑ yi and λ = ∑ ( yi − m ) provide a stochastic
2
The above results m =
N i =1 N i =1
process model that is most likely to generate the observed data y N . And these agree
with (2)
This Maximum Likelihood Estimate (MLE) is formally stated as follows.
This Maximum Likelihood Estimator was found by Ronald A. Fisher in 1912, when
he was an undergraduate junior at Cambridge University. He was 22 years old.
2
Don’t be disappointed. You are still young!
f ε ( x,t ;θ ) (15)
ε (t;θ ) = y(t ) − yˆ (t θ ) (16)
Assume that the random process ε (t ) is independent in t , and consider the joint
probability density function:
N
f (θ ; x1 , x2 ," xN ) = ∏ fε ( xt , t ;θ ) (17)
t =1
N
L(θ ) = ∏ fε ( y (t ) − yˆ (t θ ), t ;θ ) (18)
t =1
1 1 N
log L(θ ) = ∑ log fε (ε , t ;θ ) (19)
N N t =1
N
1
θˆ ML ( Z N ) = arg max ∑ l ( ε ,θ , t ) (20)
θ N t =1
θˆ LS (z N ) = arg max
N
1 1
θ N
∑ 2 ε (t ,θ )
t =1
2
(21)
1 2
Replacing ε (t ,θ ) = l (ε ,θ , t )
2
θˆ LS (z N ) = arg max
N
1
θ N
∑ l (ε ,θ , t )
t =1
3
Therefore, the Maximum Likelihood Estimate is a special case of the prediction-error
method. Function l (ε ,θ , t ) may take a different form depending on the assumed
PDF of the process under consideration.
Consider the mean square error matrix of an estimate θˆN compared to its true
value θ0 ,
⎢⎣
( )(
P = E ⎡ θˆ( y N ) − θ 0 θˆ( y N ) − θ 0 ⎤
T
⎥⎦
) (22)
This mean square error matrix provides the overall quality of an estimate. Our goal
is to minimizing a lower limit to this matrix. The limit, called the Cramer-Rao
Lower Bound, gives the best performance that we can expect among many unbiased
estimators.
E[θˆ( y N )] = θ0 (23)
( )
where the PDF of y N is f y θ 0 ; y N . Then
(
⎢⎣
)( T
⎥⎦
)
E ⎡ θˆ( y N ) − θ 0 θˆ( y N ) − θ 0 ⎤ ≥ M −1 (24)
Proof
From (23)
θ 0 = ∫ θˆ(x N ) f y (θ 0 ; x N )dx N (26)
RN
4
T
⎡ d ⎤
I = ∫ θˆ x ⎢ ( )N
f y (θ 0 ; x N )⎥ dx N (27)
RN ⎣ dθ 0 ⎦
where I is the d × d identity matrix,
⎡1 0⎤
⎢ 1 ⎥
I= ⎢ ⎥ ∈ R d ×d
⎢ % ⎥
⎢ ⎥
⎣0 1⎦
d 1 d
log f (θ ) = f (θ ) (28)
dθ f (θ ) dθ
T
⎡ d ⎤
∴ 0= E⎢ log f y (θ0 ; x N ) ⎥ (30)
⎣ dθ 0 ⎦
Multiplying (30) by θ 0 and subtracting it from (29),
⎧⎪ ⎡ d
T
⎫
ˆ N ⎤ ⎪
E ⎨ ⎣⎡θ ( y ) − θ0 ⎦⎤ ⎢
N
log f y (θ0 ; x ) ⎥ ⎬ = I (31)
⎪⎩ ⎣ dθ 0 ⎦ ⎪⎭
α ∈ R d ×1 β T ∈ R1× d
E (αβ T ) = I
⎛α ⎞
Consider 2d × 1 vector ⎜⎜ ⎟⎟ ,
⎝β ⎠
5
⎡⎛ α ⎞ ⎤ ⎡ Eαα T Eαβ T ⎤
E ⎢⎜⎜ ⎟⎟ (α T
β T )⎥ = ⎢ ⎥≥0 Positive Semi-definite (32)
⎣⎝ β ⎠ ⎣ Eβα Eββ T ⎦
T
⎦
[
E (αα T ) ≥ E (ββ T ) ]−1
(34)
This gives the first equation, (25)-(a). Differentiating the transpose of (30) yields
⎡ d2 ⎤
0= ∫ ⎢ log f y (θ 0 ; x N )⎥ f y (θ 0 ; x N )dx N
dθ 0
2
RN ⎣ ⎦
T
(35)
⎡ d ⎤ ⎡ d ⎤
+∫⎢ log f y (θ 0 ; x N )⎥ ⎢ f y (θ 0 ; x N )⎥ dx N
RN ⎣
dθ 0 ⎦ ⎣ dθ 0 ⎦
∂fε ∂fε
=0 =0 (36)
∂θ ∂t
fε (ε , t ,θ ) ⎯Re
⎯duces
⎯⎯ ⎯to
→ fε (ε )
Note, however, that the total derivative of f ε w.r.t. θ is not zero, since ε depends on θ ;
ε (t;θ ) = y(t ) − yˆ (t θ ) . Denoting l0 (ε ) ≡ − log fε (ε ) , we can re-write (19), that is the
log likelihood function of a dynamical system is given by
N
log L(θ ) = ∑ log fε (ε ) = −∑ l0 (ε (t ,0) ) (37)
t =1
6
d N
dl dε N
log L(θ ) = −∑ 0 = ∑ l0′ [ε (t ,θ )]ψ (t ,θ ) (38)
dθ t =1 dε dθ t =1
dε dyˆ d
Note that =− = −ψ . Replacing log f y in (25) by the above equation
dθ dθ dθ
yields the Fisher Information Matrix M , evaluated at the true parameter value θ0 .
With this θ0 , we can assume that the prediction error sequence ε (t ,θ 0 ) are generated
as a sequence of independent random variables ε (t ,θ 0 ) = e0 (t ) whose PDF is fε (x) .
This leads to the Fisher Information Matrix given by
1 N
[ ]
E ψ (t ,θ 0 )ψ T (t ,θ 0 )
1
∞
[ f ′( x)] dx
M= ∑
k0 t =1
≡ ∫ e
k0 − ∞ [ f e ( x)]
on the other hand, as N tends to infinity, the prediction-error method provides the
asymptotic variance Pθ , as discussed last time. From (20) we can treat Maximum
Likelihood Estimate as a special case of the prediction-error method. Therefore, the
previous result on asymptotic variance applies to MLE.