VE564 Summer 2023: Lecture 3-1: Maximum Likelihood Estimation and Least Squares

VE564 Summer 2023
Lecture 3-1: Maximum Likelihood Estimation and Least

Squares
Prof. H. Qiao
UM-SJTU Joint Institute
May 24, 2023
Outline
Maximum Likelihood Estimator

MLE for Scalar Parameter
MLE for Vector of Parameters
Advanced Topic of MLE: Consistency and Efficiency
Least Squares
Basic Least Squares
Variant 1: Least Squares with Unknown Model Order
Variant 2: Least Squares with Incoming Data
Variant 3: Least Squares with Constraints
Variant 4: Nonlinear Least Squares
Maximum Likelihood Estimator
The maximum likelihood (ML) estimator is an alternative to the MVU
estimator. The ML principle is most popular in designing practical
estimation. The ML estimator has optimal performance when a large
volume of data is available.
MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Examples
Consider the observed data set
xrns “ A ` w rns, n “ 0, 1, ¨ ¨ ¨ , N ´ 1
where A ą 0 is an unknown level. Different from earlier examples, we

assume w rns is WGN with variance A. The likelihood function for this
modified case is
« ff
N´1
1 1 ÿ
ppx; Aq “ N exp ´ pxrns ´ Aq2
p2πAq 2 2A n“0
It can be shown that for any unbiased estimator Â:
A2
VarpÂq ě
NpA ` 12 q
Examples
We first try to find the optimal estimator that may achieve the CRLB
N´1 N´1
B ln ppx; Aq N 1 ÿ 1 ÿ
“´ ` pxrns ´ Aq ` pxrns ´ Aq2
BA 2A A n“0 2A2 n“0
And no unbiased estimator Â satisfies

B ln ppx; Aq
“ I pAqpÂ ´ Aq
BA
Thus, no efficient estimator exists.
Examples
We next try to find the MVU estimator by studying the sufficient

statistics. The likelihood function can be factorized as
« ˜ ¸ff
N´1
1 1 1 ÿ 2
ppx; Aq “ N exp ´ x rns ` NA exppN x̄q
p2πAq 2 2 A n“0 looomooon
loooooooooooooooooooooooooomoooooooooooooooooooooooooon hpxq
řN´1
gp n“0 x 2 rns,Aq
řN´1
Then T pxq “ n“0 x 2 rns is a sufficient statistic. However, it is not
obvious to find a function hp¨q of T such that
˜ ¸
N´1
ÿ
EA hp x 2 rnsq “ A
n“0
´ř ¯
N´1 2 2
as E n“0 x rns “ NpA ` A q. We can also compute Epxr0s|T q for
the simple unbiased estimator xr0s. But the computation is formidable.
Examples
Idea: we allow the estimator to be approximately optimal as N Ñ 8:
EA pÂq Ñ A, VarpÂq Ñ CRLB, NÑ8
That is, Â is asymptotically unbiased (consistent) and efficient.
Examples
Now, we consider the estimator

g
f N´1
1 f 1 ÿ 2 1
Â “ ´ ` e x rns `
2 N n“0 4
It can be easily verified that EpÂq ‰ A. However, when N is large, Taylor

expansion implies
« ff
1 N´1
2 1 ÿ 2 2
Â « A ` x rns ´ pA ` A q
A ` 21 N n“0
And then we can show that

A2
EpÂq Ñ A, VarpÂq Ñ “ CRLB
NpA ` 21 q
Thus, Â is asymptotically optimal.

Finding the MLE
The MLE for a scalar parameter is defined to be the value of θ that

maximize ppx, θq.
Example
Still consider the DC level in WGN:

N´1 N´1
B ln ppx; Aq N 1 ÿ 1 ÿ
“´ ` pxrns ´ Aq ` pxrns ´ Aq2
BA 2A A n“0 2A2 n“0
B ln ppx;Aq
By setting BA “ 0, we have the MLE (positive one)
g
f N´1
1 f 1 ÿ 2 1
Â “ ´ ` e x rns `
2 N n“0 4
Example
If w rns „ N p0, σ 2 q, we have

N´1
B ln ppx; Aq 1 ÿ
“ 2 pxrns ´ Aq
BA σ n“0
And the MLE is

N´1
1 ÿ
Â “ xrns
N n“0
which is the MVU.
Property of MLE
Asymptotic Properties of the MLE

If the likelihood function ppx; θq of the data x satisfies the ”regularity”
conditions, then the MLE of the unknown parameter θ is asymptotically
distributed according to
θ̂ „ N pθ, I ´1 pθqq
where I pθq is the Fisher information evaluated at the true value of the
unknown parameter.
The regularity conditions require the existence of the derivatives of the

log-likelihood function, as well as the Fisher information being nonzero.
See Appendix 7B of the textbook.
Example: MLE of the Sinusoidal Phase
Consider the problem of estimating the phase φ of a sinusoid embedded

in noise
xrns “ A cosp2πf0 n ` φq ` w rns, n “ 0, 1, ¨ ¨ ¨ , N ´ 1
where w rns „ N p0, σ 2 q, and the amplitude A and frequency f0 are

known. We have found the sufficient statistics
N´1
ÿ
T1 pxq “ xrns cosp2πf0 nq
n“0
N´1
ÿ
T2 pxq “ xrns sinp2πf0 nq
n“0
The MLE of φ is found by maximizing ppx; φq:

« ff
N´1
1 1 ÿ 2
ppx; φq “ N exp ´ 2 pxrns ´ A cosp2πf0 n ` φqq
p2πσ 2 q 2 2σ n“0
Using the approximation (see eqn. 7.14 in the textbook)

N´1 N´1
1 ÿ 1 ÿ
sinp2πf0 n ` φ̂q cosp2πf0 n ` φ̂q “ sinp4πf0 n ` 2φ̂q « 0
N n“0 2N n“0
for f0 not near 0 or 1{2. The MLE φ̂ is given by

řN´1
n“0 xrns sinp2πf0 nq
φ̂ “ ´ arctan řN´1
n“0 xrns cosp2πf0 nq
The MLE estimate φ̂ is a function of the sufficient statistics. This should

be the case due to the Neyman-Fisher factorization theorem:
ppx; φq “ g pT1 pxq, T2 pxq, φqhpxq
Thus, φ̂ is a function of T1 pxq, T2 pxq. The asymptotic likelihood function
of the phase estimator is
NA2
φ̂ „ N pφ, I ´1 pφqq, I pφq “
2σ 2
So the asymptotic variance is

1 1
Varpφ̂q “ A2
“
N 2σ 2
Nη
where η “ pA2 {2q{σ 2 is the SNR.
Now we fix the data record length and vary the SNR.
MLE for Transformed Parameters
Suppose we try to estimate a function of θ : g pθq. Can we easily find the

MLE?
Invariance Property of the MLE
The MLE of the parameter α “ g pθq is given by
α̂ “ g pθ̂q
where θ̂ is the MLE of θ. The MLE θ̂ is obtained by maximizing ppx; θq.

If g is not a one-to-one function, then α̂ maximizes the modified
likelihood function p̄T px; αq defined as
p̄T px; αq “ max ppx; θq

tθ:α“g pθqu
Example: Transformed DC Level in WGN
Suppose we want to estimate α “ A2 . Note that

?
A“˘ α
and the transformation is not one-to-one. Thus, we have to consider two

likelihood functions
« ff
N´1
1 1 ÿ ? 2
pT1 px, αq “ N exp ´ 2 pxrns ´ αq αě0
p2πσ 2 q 2 2σ n“0
« ff
N´1
1 1 ÿ ? 2
pT2 px, αq “ N exp ´ 2 pxrns ` αq αă0
p2πσ 2 q 2 2σ n“0
Then, the MLE of α is given by

„ 2
? ?
α̂ “ arg maxtpT1 px; αq, pT2 px; αqu “ arg ?max tppx; αq, ppx; ´ αqu
α αě0
„ 2
“ arg max ppx; Aq “ x̄ 2
´8ăAă8
Example: Power of WGN in dB
We observe N samples of WGN with variance σ 2 in dB. That is, we want

to find the MLE of
P “ 10 log10 σ 2
To find the MLE of σ 2 :

N´1
B ln ppx, σ 2 q N 1 ÿ 2
“ ´ ` x rns
Bσ 2 2σ 2 2σ 4 n“0
N´1
1 ÿ 2
σ̂ 2 “ x rns
N n“0
And then
N´1
1 ÿ 2
P̂ “ 10 log10 σ̂ 2 “ 10 log10 x rns
N n“0
Extension to a Vector Parameter
Asymptotic Properties of the MLE

If the likelihood function ppx; θq satisfies the ”regularity” conditions,
then the MLE of the unknown parameter θ is asymptotically distributed
according to
θ̂ „ N pθ, I´1 pθqq
where Ipθq is the Fisher information matrix.
Remark: If the number of parameters is larger than the number of

samples, the MLE may not follow the asymptotic distribution as above.
Extension to a Vector Parameter
Invariance Property of the MLE

The MLE of the parameter α “ gpθq, where g is an r -dimensional
function of the p ˆ 1 parameter θ, is given by
α̂ “ gpθ̂q
If g is not an invertible function, then α̂ maximizes the modified

likelihood function p̄T px; αq:
p̄T px; αq “ max ppx; θq

tθ:α“gpθqu
Example: Signal in Non-Gaussian Noise
Consider the data
xrns “ srns ` w rns, n “ 0, 1, ¨ ¨ ¨ , N ´ 1
where w rns is zero mean i.i.d noise with the Laplacian PDF
„ 
1 1
ppw rnsq “ exp ´ |w rns|
4 2
All signal samples tsr0s, sr1s, ¨ ¨ ¨ , srN ´ 1su are to be estimated. The
PDF of the data is
N´1
ź 1 „ 
1
ppx; θq “ exp ´ |xrns ´ srns|
n“0
4 2
For each n, the MLE is given by
ŝrns “ xrns
Example: Signal in Non-Gaussian Noise
Or equivalently, the MLE of θ is given by
θ̂ “ x
The MLE is not asymptotically Gaussian as N Ñ 8. This is because we

have as many unknown parameters as the samples.
MLE for the Linear (Affine) Model
Optimality of the MLE for the Linear Model

Suppose the observation model is
x “ Hθ ` w
where H P RNˆp has full column rank p and w „ N p0, Cq, then the
MLE of θ is
θ̂ “ pHT C´1 Hq´1 HT C´1 x
θ̂ is also an efficient estimator and hence is the MVU estimator.
The likelihood function is

„ 
1 1 T ´1
ppx; θq “ N 1 exp ´ px ´ Hθq C px ´ Hθq
p2πq 2 det 2 pCq 2
B ln ppx; θq
“ HT C´1 px ´ Hθq
Bθ
2
Consistency of MLE*
Suppose that the observations txk u8 k“1 are i.i.d sequence of random
variables with density ppx; θq. Define
B ln ppxk ; θq
ψpxk ; θq “ , Jpθ; θ1 q “ Eθ pψpx1 ; θ1 qq
Bθ
Then the MLE θ̂n is asymptotically consistent pθ̂n Ñ θq if
(1) Jpθ; θ1 q is a continuous function of θ1 and has a unique root at

θ1 “ θ, at which point it changes sign.
(2) ψpxk ; θ1 q is a continuous function of θ1 with probability 1
řn
(3) For each sample size n, k“1 ψpxk ; θ1 q{n has a unique root θ̂n .
2
V. Poor, Proposition IV.D.1
3
Asymptotic Normality of MLEs*
Suppose that the observations txk u8 k“1 are i.i.d sequence of random
variables with density ppx; θq. Also assume that tθ̂n u8n“1 is a consistent
sequence of roots of the likelihood equation. If ψ satisfies the following
regularity conditions
(1) 0 ă iθ fi Eθ prψpx1 ; θqs2 q ă 8

(2) The derivatives
Bψpx1 ; θ1 q B 2 ψpx1 ; θ1 q
ψ 1 px1 ; θ1 q fi , ψ 2 px1 ; θ1 q fi
Bθ1 Bθ12
exist
(3) There is function Mpx1 q such that |ψ 2 px1 ; θ1 q| ď Mpx1 q for @θ1 P Ω
and Eθ rMpx1 qs ă 8
3
3
Asymptotic Normality of MLEs*
(4) Jpθ; θq “ 0
(5)
B 2 ppx; θq B2
ż ż
µpdxq “ ppx; θqµpdxq
Bθ2 Bθ2
Then θ̂n is asymptotically normal.
3
Example: One-parameter exponential family
Let x1 , ¨ ¨ ¨ , xn be i.i.d according to a one-parameter exponential family

with density
f pxi |ηq “ e ηT pxi q´Apηq
with respect to σ ´ finite measure µ, and let the estimand be η. For

obtaining η̂, we need to solve
1ÿ 1ÿ
T pxi q “ A1 pηq ô Eη rT pxj qs “ T pxi q
n i n i
As
d
rEη rT pxj qss “ Varη T pxj q ą 0
dη
The MLE solution is unique. Furthermore, we have
?
npη̂ ´ ηq Ñ N p0, VarpT qq
Least Squares
Least Squares
Pros:
• No probabilistic assumptions needed

• Simple to compute
Cons:
• No claims about optimality can be made

• The performance is difficult to analyze in general
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Least Squares
Least Squares
C. F. Gauss claimed that he came up with the least squares method in 1795.
The Least Squares Approach
Idea: In earlier discussions, we want to find the unbiased estimator

with smallest variance. Now, we try to directly minimize the average
discrepancy between the noisy data and assume signal model.
Formulation: Suppose the true signal srns is parameterized by θ
and we only have the noisy data xrns. Then the least squares
estimator (LSE) of θ is given by
N´1
ÿ
θ̂LS “ min Jpθq “ min pxrns ´ srnsq2
θ θ
n“0
Note that we do not take any probabilistic assumptions about the data
xrns.
Examples
For the problem of estimating DC level, the LSE of A is obtained by

minimizing
N´1
ÿ
JpAq “ pxrns ´ Aq2
n“0
1
řN´1
which is ÂLS “ x̄ “ N n“0 xrns.
Remark
Note that ÂLS may not be optimal in any sense as we do not make any
statistical assumptions. It is MVU for the WGN case but will be biased
if the noise is not zero-mean.
Examples
Consider estimating the frequency/amplitude from
srns “ A cos 2πf0 n
If f0 is unknown, we need to minimize

N´1
ÿ
Jpf0 q “ pxrns ´ A cos 2πf0 nq2
n“0
which is a non-linear function of f0 . This is a nonlinear least squares

problem.
If A is unknown, we again need to minimize

N´1
ÿ
JpAq “ pxrns ´ A cos 2πf0 nq2
n“0
which has a closed-form solution in terms of f0 . If both A, f0 are

unknown, we can first minimize JpA, f0 q with respect to A and then f0 ,
known as separable least squares problem.
Linear Least Squares (vector parameter)
As before, we assume the signal of interest can be modeled as
s “ Hθ
where the observation matrix H P RNˆp with full rank p. Different from
previous discussions, we do not assume that the noise follows a particular
distribution. The LSE is found by minimizing
N´1
ÿ
Jpθq “ pxrns ´ srnsq2 “ px ´ HθqT px ´ Hθq
n“0
The linear form allows a closed-form solution as

BJpθq
“ ´2HT x ` 2HT Hθ
Bθ
HT Hθ “ HT x pNormal Equationsq
θ̂ LS “ pHT Hq´1 HT x
JLS “ Jpθ̂ LS q “ xT pI ´ HpHT Hq´1 HT q
Weighted Least Squares
An extension of the linear least squares problem is the weighted least

squares. Given a positive definite weighting matrix W, the cost function
becomes
Jpθq “ px ´ HθqT Wpx ´ Hθq
The closed-form solution of the WLS is
θ̂ “ pHT WHq´1 HT Wx
with error
` ˘
Jmin “ xT W ´ WHpHT WHq´1 HT W x
Weighted Least Squares
Why we consider weighted least squares?
1. Focusing Accuracy: As a regression problem on Hθ, we may only

care about the feature values in a certain region. This may be
because these values (similar to certain rows of H) will often appear
or induce higher risk. Then, we can assign different weights to
points in a particular region.
2. Discounting Imprecision: If the additive noise has constant
variance, the case is called homoskedasticity, otherwise the data
are heteroskedastic. In the heteroskedasic case, there exist other
estimates that are unbiased and have smaller variance than the
ordinary least squares.
3. More General: Other optimization problems can be transformed
into or approximated by weighted least squares.
Geometrical Interpretation
The LSE is obtained by minimizing
Jpθq “ px ´ HθqT px ´ Hθq
Let H “ rh1 , h2 , ¨ ¨ ¨ , hp s, Jpθq can be represented as

p
ÿ
Jpθq “ }x ´ Hθ}22 “ }x ´ hi θi }22
i“1
Without further restriction on θ, the LSE attempts to minimize the

square of the distance from the data vector x to a signal vector
řp
i“1 θi hi , which is in the column space of H.
If H has full column rank, each vector in the column space corresponds
to a unique parameter θ.
Suppose H P R3ˆ2 , the LSE solution yields the vector

ŝ P S 2 “ spanph1 , h2 q. ŝ is the orthogonal projection of x:
px ´ ŝq K th1 , h2 u
In the general case, let “ x ´ Hθ, we have
K CpHq ô T H “ 0
And LSE is given by
θ̂ “ pHT Hq´1 HT x
with the error
}}22 “ Jmin “ xT pI ´ HpHT Hq´1 HT qx
Note that
P “ HpHT Hq´1 HT
is an orthogonal projection matrix. And we have
Jmin “ xT pI ´ Pqx “ }PK x}22

Variant 1: Order-Recursive Least Squares
In many cases, we do not know the signal models exactly. And we may
try different models in a particular order. For example, we try to fit a set
of data
If we assume the true signal is a constant
s1 rns “ A
Then the LSE with H1 “ r1, 1, ¨ ¨ ¨ , 1sT is given by
Â1 “ x̄
If we assume the true signal is a line

s2 rns “ A ` Bn
Then the LSE is given by
N´1 N´1
2p2N ´ 1q ÿ 6 ÿ
Â2 “ xrns ´ nxrns
NpN ` 1q n“0 NpN ` 1q n“0
N´1 N´1
6 ÿ 12 ÿ
B̂2 “ ´ xrns ` 2
nxrns
NpN ` 1q n“0 NpN ´ 1q n“0
H1 is a submatrix of the observation matrix H2 of the line case

» fi
1 0
— 1 1
— ffi
ffi
H2 “ —— .. .. ffi
ffi
– . . fl
1 N ´1
However, it is not good to assume a more complicated one than
necessary (e.g., line case)
Occam’s Razor Principle: entities should not be multiplied without

necessity
Manuscript illustration of William of Ockham (1287-1347)
As illustrated by the example, we may have to try a sequence of models

from the simplest to the most complicated. To reduce the computation,
it is desirable to update the LSE sequentially.
Specifically, we would like to be able to compute the LSE based on an

H P RNˆpk`1q from the solution based on the first k columns of H
Denote Hk P RNˆk , the LSE θ̂k based on Hk is given by
θ̂k “ pHT ´1 T
k Hk q Hk x
with minimum cost is
Jmin,k “ }x ´ Hk θ̂k }22
Let Hk`1 “ rHk , hk`1 s P RNˆpk`1q . To update θ̂k and Jmin,k , we use
pHT H q´1 HT h hT K
» fi
k`1 Pk x
θ̂k ´ k k hT Pk Kk`1h
k`1 k k`1
θ̂k`1 “ – hT PK x
fl
k`1 k
hT K
k`1 Pk hk`1
T ´1 T T
where PK
k “ I ´ Hk pHk Hk q Hk . To avoid inverting Hk Hk , we let
Dk “ pHT
k Hk q
´1
and use the recursive formula

D HT h hT D HT
» fi
k`1 Hk Dk k hk`1
Dk ` k hk T k`1
P Kh ´ hT k K
k`1 k k`1 k`1 Pk hk`1
Dk`1 “ – hT
fl
k`1 Hk Dk 1
´ hT PK h hT K
k`1 k k`1 k`1 Pk hk`1
T
where PK
k “ I ´ Hk Dk Hk . The minimum LS error is updated as
phT K 2
k`1 Pk xq
Jmin,k`1 “ Jmin,k ´ T K
hk`1 Pk hk`1
This recursive procedure determines the LSE for all lower-order models.
We make several observations:
1. If the new column hk`1 is orthogonal to all the previous ones,

HT k hk`1 “ 0, then
» fi
θ̂k
θ̂k`1 “ – hT PK x fl k`1 k
hT K
k`1 Pk hk`1
2. The term PK k x “ k is the residual that cannot be modeled by the

columns of Hk .
3. If hk`1 is nearly in the column space of Hk , then we have
}PK k hk`1 }2 « 0. And the recursive procedure will blow up as
HT k`1 Hk`1 will be nearly singular.
4. We can rewrite Jmin,k`1 as
phT K 2
k`1 Pk xq
Jmin,k`1 “ xT PK
kx´
2
“ Jmin,k p1 ´ rk`1 q
hT K
k`1 Pk hk`1
2
where the coefficient rk`1 is defined as
K 2
2 xPK
k hk`1 , Pk xy
0 ď rk`1 “ 2
ď1
K
}Pk hk`1 }2 }PK
k x}2
If PK K
k x and Pk hk`1 are collinear, then the residual can be perfectly
modeled by hk`1 and rk`1 “ 1.
5. The orthogonal projection matrix can be recursively updated as
pI ´ Pk qhk`1 hTk`1 pI ´ Pk q
Pk`1 “ Pk `
hTk`1 pI ´ P k qhk`1
Variant 2: Sequential Least Squares
In earlier discussions, we are already given the full data. But in many
cases, the samples may come sequentially and it is not desirable to wait
for all the data before computing the LSE.
Let C be the covariance matrix of the zero-mean noise, we consider the

weighted LS by minimizing
J “ px ´ HθqT C´1 px ´ Hθq
And the solution is given by
θ̂ “ pHT C´1 Hq´1 HT C´1 x
If C is diagonal (that is, the noise is uncorrelated), the LSE θ̂ can be

computed sequentially computed.
Assume the following conditions to hold
Crns “ diagpσ02 , σ12 , ¨ ¨ ¨ , σn2 q

« ff
Hrn ´ 1s T
Hrns “ , xrns “ rxr0s, xr1s, ¨ ¨ ¨ , xrnss
hT rns
and let θ̂rns be the LSE of θ based on xrns. The LSE is given by
θ̂rns “ pHT rnsC´1 rnsHrnsq´1 HT rnsC´1 rnsxrns
with covariance matrix of the LSE
Σrns “ pHT rnsC´1 rnsHrnsq´1
Estimator Update:
´ ¯
θ̂rns “ θ̂rn ´ 1s ` Krns xrns ´ hT rnsθ̂rn ´ 1s
where
Σrn ´ 1shrns
Krns “
σn2 ` hT rnsΣrn ´ 1shrns
Covariance Update:
Σrns “ pI ´ KrnshT rnsqΣrn ´ 1s
Example: Fourier Analysis
Consider estimating θ “ ra, bsT from the signal model:
srns “ a cosp2πf0 nq ` b sinp2πf0 nq
To apply the sequential LS, we assume the noise is uncorrelated.
To initialize the algorithm, we first acquire two samples xr0s, xr1s and
compute
ˆ ˙´1
T I I
θ̂r1s “ H r1s 2 Hr1s HT r1s 2 xr1s “ pHT r1sHr1sq´1 HT r1sxr1s
σ σ
where
« ff
1 0 T
Hr1s “ , xr1s “ rxr0s, xr1ss
cosp2πf0 q sinp2πf0 q
Σr1s “ σ 2 pHT r1sHr1sq´1
Example: Fourier Analysis
To compute θ̂r2s, we first compute Kr2s
Σr1shr2s
Kr2s “ , hr2s “ rcosp4πf0 q, sinp4πf0 qs
σ2 ` hT r2sΣr1shr2s
And then we have
θ̂r2s “ θ̂r1s ` Kr2spxr2s ´ hT r2sθ̂r1sq
with variance
Σr2s “ pI ´ Kr2shT r2sqΣr1s
Variant 3: Constrained Least Squares
So far, we have not imposed any restrictions on the values of θ and the
geometrical explanations apply. Here, we assume the parameter satisfies
the linear constraint
Aθ “ b
where A P Rr ˆp has full rank r ă p. Full rank means the constraints are
linearly independent.
To find the LSE subject to the constraints, we use the Lagrangian

multipliers. That is, we seek to minimize
Jc “ px ´ HθqT px ´ Hθq ` λT pAθ ´ bq

BJc
“ ´2HT x ` 2HT Hθ ` AT λ
Bθ
λ
θ̂ c “ θ̂ ´ pHT Hq´1 AT
2
where θ̂ is the unconstrained LSE.
Variant 3: Constrained Least Squares
To find λ, we enforce the linear constraints
Aθ̂ c “ b
λ “ ‰´1
“ ApHT Hq´1 AT pAθ̂ ´ bq
2
Eventually, we have
θ̂ c “ θ̂ ´ pHT Hq´1 AT rApHT Hq´1 AT s´1 pAθ̂ ´ bq
where θ̂ “ pHT Hq´1 HT x.
4
Example: Piecewise-polynomial fitting
Consider fitting a piecewise-polynomial fˆ specified as

#
ppxq “ θ1 ` θ2 x ` θ3 x 2 ` θ4 x 3 x ď a
fˆpxq “
qpxq “ θ5 ` θ6 x ` θ7 x 2 ` θ8 x 3 x ą a
with the requirements at the boundary point
ppaq “ qpaq, p 1 paq “ q 1 paq
We try to fit the polynomial by minimizing

N ´
ÿ ¯2
fˆpxi q ´ yi
i“1
which is a constrained least squares problem.
4
S. Boyd, EE 103
4
Example: Piecewise-polynomial fitting
4
S. Boyd, EE 103
5
Example: Least norm problem
A special case of constrained least squares problem:
min }x}22 s.t. Ax “ b

x
5
S. Boyd, EE 103
5
Force sequence:
• unit mass on frictionless surface, initially at rest

• f P R10 represents the piece-wise constant forces applied for one
second each
• the final velocity and position are
10
ÿ 19 17 1
v fin “ fi , p fin “ f1 ` f2 ` ¨ ¨ ¨ ` f10
i“1
2 2 2
• applied constraints: v fin “ 0, p fin “ 1
5
S. Boyd, EE 103
5
Bang-bang force sequence:
5
S. Boyd, EE 103
5
Least norm force sequence:
5
S. Boyd, EE 103
The general LS cost is given by
Jpθq “ px ´ spθqqT px ´ spθqq
where spθq is the signal model for x that depends on θ. In the linear LS
problem, we take the convenient model spθq “ Hθ which has a simple
closed-form solution. But in many cases, spθq
The are two methods that can reduce the complexity of the problem:
1. transformation of parameters
2. separability of parameters
In the first case, we try to find a transformation α “ gpθq such that
spθpαqq “ spg´1 pαqq “ Hα
Then we can find the LSE of α and then get back
θ̂ “ g´1 pα̂q
In general, the function g is hard to find.
For the sinusoidal signal model
srns “ A cosp2πf0 n ` φq, n “ 0, 1, ¨ ¨ ¨ , N ´ 1
and we try to estimate A, φ. The cost function is nonlinear

N´1
ÿ 2
JpA, φq “ pxrns ´ A cosp2πf0 n ` φqq
n“0
Now we consider the transformation:
α1 “ A cos φ, α2 “ ´A sin φ
Then s “ Hα with H
» fi
1 0
cosp2πf0 q sinp2πf0 q
— ffi
— ffi
H“—
— .. .. ffi
ffi
– . . fl
cosp2πf0 pN ´ 1qq sinp2πf0 pN ´ 1qq
The second method is to find some separability property. That is, in

some cases, we have
s “ Hpαqβ
The β that minimizes Jpα, βq for a fixed α is

` ˘´1 T
β̂ “ HT pαqHpαq H pαqx
And then the cost function is

” ` ˘´1 T ı
Jpα, β̂q “ xT I ´ Hpαq HT pαqHpαq H pαq x
The problem reduces to a maximization of

` ˘´1 T
xT Hpαq HT pαqHpαq H pαqx
Damped Exponentials: Consider the signal model
srns “ A1 r n ` A2 r 2n ` A3 r 3n
where the unknown parameters are tA1 , A2 , A3 , r u and 0 ă r ă 1. In the

separable model, β “ rA1 , A2 , A3 sT and α “ r . Then the nonlinear LSE
is obtained by maximizing
` ˘´1 T
xT Hpr q HT pr qHpr q H pr qx
over 0 ă r ă 1 where
» fi
1 1 1
r r2 r3
— ffi
— ffi
Hpr q “ —
— .. .. .. ffi
ffi
– . . . fl
r N´1 r 2pN´1q r 3pN´1q
6
Algorithm for NLS: Levenberg-Marquardt Algorithm
The basic idea:

• At any point z, we can approximate the nonlinear function f pxq as
fˆpx; zq “ f pzq ` ∇f pzqpx ´ zq
• fˆpx; zq « f pxq if x is close to z

• we can minimize the affine function }fˆpx; zq}22 using the linear LS
• we iterate the choice z
6
S. Boyd, EE 103, Stanford
6
Main structure:
• iterates x p1q , x p2q , ¨ ¨ ¨

• form affine approximation of f at x pkq :
fˆpx; x pkq q “ f px pkq q ` ∇f px pkq qpx ´ x pkq q
• choose x pk`1q as minimizer of
}fˆpx; x pkq q}22 ` λpkq }x ´ x pkq }22
for a λpkq ą 0
• We impose the regularizer (second term) because we need the affine
approximation to hold
6
6
Adjusting λ:
1. idea:
• if λpkq is too big, x pk`1q is close to x pkq , the progress is slow
• if λpkq is too small, x pk`1q is far from x pkq , the linear approximation
is poor
2. practical update mechanism:
• If }f px pk`1q q}22 ă }f px pkq q}22 , accept the update and reduce
λ : λpk`1q “ 0.8λpkq
• otherwise, increase λ and do not update:
λpk`1q “ 2λpkq , x pk`1q “ x pkq
6
6
Closed-form update for scalar parameter:
• The update:
f 1 px pkq q
x pk`1q “ x pkq ´ f px pkq q
λpkq ` pf 1 px pkq qq2
• For λpkq , it is the same as the Newton algorithm

• But the Newton algorithm will not make sense if f 1 px pkq q “ 0.
6
7
Example: Location from range measurements
7
7
7
7
7

VE564 Summer 2023: Lecture 3-1: Maximum Likelihood Estimation and Least Squares

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

VE564 Summer 2023: Lecture 3-1: Maximum Likelihood Estimation and Least Squares

Uploaded by

Copyright:

Available Formats

VE564 Summer 2023

Lecture 3-1: Maximum Likelihood Estimation and Least

Maximum Likelihood Estimator

Consider the observed data set

where A ą 0 is an unknown level. Different from earlier examples, we

It can be shown that for any unbiased estimator Â:

And no unbiased estimator Â satisfies

We next try to find the MVU estimator by studying the sufficient

Idea: we allow the estimator to be approximately optimal as N Ñ 8:

EA pÂq Ñ A, VarpÂq Ñ CRLB, NÑ8

That is, Â is asymptotically unbiased (consistent) and efficient.

Now, we consider the estimator

It can be easily verified that EpÂq ‰ A. However, when N is large, Taylor

And then we can show that

Thus, Â is asymptotically optimal.

The MLE for a scalar parameter is defined to be the value of θ that

Still consider the DC level in WGN:

If w rns „ N p0, σ 2 q, we have

And the MLE is

which is the MVU.

Asymptotic Properties of the MLE

The regularity conditions require the existence of the derivatives of the

Consider the problem of estimating the phase φ of a sinusoid embedded

xrns “ A cosp2πf0 n ` φq ` w rns, n “ 0, 1, ¨ ¨ ¨ , N ´ 1

where w rns „ N p0, σ 2 q, and the amplitude A and frequency f0 are

The MLE of φ is found by maximizing ppx; φq:

Using the approximation (see eqn. 7.14 in the textbook)

for f0 not near 0 or 1{2. The MLE φ̂ is given by

The MLE estimate φ̂ is a function of the sufficient statistics. This should

So the asymptotic variance is

where η “ pA2 {2q{σ 2 is the SNR.

Suppose we try to estimate a function of θ : g pθq. Can we easily find the

where θ̂ is the MLE of θ. The MLE θ̂ is obtained by maximizing ppx; θq.

p̄T px; αq “ max ppx; θq

Suppose we want to estimate α “ A2 . Note that

and the transformation is not one-to-one. Thus, we have to consider two

Then, the MLE of α is given by

We observe N samples of WGN with variance σ 2 in dB. That is, we want

To find the MLE of σ 2 :

Asymptotic Properties of the MLE

θ̂ „ N pθ, I´1 pθqq

where Ipθq is the Fisher information matrix.

Remark: If the number of parameters is larger than the number of

Invariance Property of the MLE

If g is not an invertible function, then α̂ maximizes the modified

p̄T px; αq “ max ppx; θq

Consider the data

xrns “ srns ` w rns, n “ 0, 1, ¨ ¨ ¨ , N ´ 1

For each n, the MLE is given by

Or equivalently, the MLE of θ is given by

The MLE is not asymptotically Gaussian as N Ñ 8. This is because we

Optimality of the MLE for the Linear Model

θ̂ “ pHT C´1 Hq´1 HT C´1 x

θ̂ is also an efficient estimator and hence is the MVU estimator.

The likelihood function is

Then the MLE θ̂n is asymptotically consistent pθ̂n Ñ θq if

(1) Jpθ; θ1 q is a continuous function of θ1 and has a unique root at

(1) 0 ă iθ fi Eθ prψpx1 ; θqs2 q ă 8

Then θ̂n is asymptotically normal.

Let x1 , ¨ ¨ ¨ , xn be i.i.d according to a one-parameter exponential family

f pxi |ηq “ e ηT pxi q´Apηq

with respect to σ ´ finite measure µ, and let the estimand be η. For

In the general case, let “ x ´ Hθ, we have

}}22 “ Jmin “ xT pI ´ HpHT Hq´1 HT qx

2. The term PK k x “ k is the residual that cannot be modeled by the