Download as pdf or txt
Download as pdf or txt
You are on page 1of 78

VE564 Summer 2023

Lecture 3-1: Maximum Likelihood Estimation and Least


Squares

Prof. H. Qiao
UM-SJTU Joint Institute
May 24, 2023
Outline

Maximum Likelihood Estimator


MLE for Scalar Parameter
MLE for Vector of Parameters
Advanced Topic of MLE: Consistency and Efficiency

Least Squares
Basic Least Squares
Variant 1: Least Squares with Unknown Model Order
Variant 2: Least Squares with Incoming Data
Variant 3: Least Squares with Constraints
Variant 4: Nonlinear Least Squares
Maximum Likelihood Estimator
The maximum likelihood (ML) estimator is an alternative to the MVU
estimator. The ML principle is most popular in designing practical
estimation. The ML estimator has optimal performance when a large
volume of data is available.

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Examples

Consider the observed data set

xrns “ A ` w rns, n “ 0, 1, ¨ ¨ ¨ , N ´ 1

where A ą 0 is an unknown level. Different from earlier examples, we


assume w rns is WGN with variance A. The likelihood function for this
modified case is
« ff
N´1
1 1 ÿ
ppx; Aq “ N exp ´ pxrns ´ Aq2
p2πAq 2 2A n“0

It can be shown that for any unbiased estimator Â:

A2
VarpÂq ě
NpA ` 12 q

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Examples

We first try to find the optimal estimator that may achieve the CRLB
N´1 N´1
B ln ppx; Aq N 1 ÿ 1 ÿ
“´ ` pxrns ´ Aq ` pxrns ´ Aq2
BA 2A A n“0 2A2 n“0

And no unbiased estimator  satisfies


B ln ppx; Aq
“ I pAqp ´ Aq
BA
Thus, no efficient estimator exists.

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Examples

We next try to find the MVU estimator by studying the sufficient


statistics. The likelihood function can be factorized as
« ˜ ¸ff
N´1
1 1 1 ÿ 2
ppx; Aq “ N exp ´ x rns ` NA exppN x̄q
p2πAq 2 2 A n“0 looomooon
loooooooooooooooooooooooooomoooooooooooooooooooooooooon hpxq
řN´1
gp n“0 x 2 rns,Aq

řN´1
Then T pxq “ n“0 x 2 rns is a sufficient statistic. However, it is not
obvious to find a function hp¨q of T such that
˜ ¸
N´1
ÿ
EA hp x 2 rnsq “ A
n“0
´ř ¯
N´1 2 2
as E n“0 x rns “ NpA ` A q. We can also compute Epxr0s|T q for
the simple unbiased estimator xr0s. But the computation is formidable.

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Examples

Idea: we allow the estimator to be approximately optimal as N Ñ 8:

EA pÂq Ñ A, VarpÂq Ñ CRLB, NÑ8

That is, Â is asymptotically unbiased (consistent) and efficient.

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Examples

Now, we consider the estimator


g
f N´1
1 f 1 ÿ 2 1
 “ ´ ` e x rns `
2 N n“0 4

It can be easily verified that EpÂq ‰ A. However, when N is large, Taylor


expansion implies
« ff
1 N´1
2 1 ÿ 2 2
 « A ` x rns ´ pA ` A q
A ` 21 N n“0

And then we can show that


A2
EpÂq Ñ A, VarpÂq Ñ “ CRLB
NpA ` 21 q

Thus, Â is asymptotically optimal.


MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Finding the MLE

The MLE for a scalar parameter is defined to be the value of θ that


maximize ppx, θq.

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Example

Still consider the DC level in WGN:


N´1 N´1
B ln ppx; Aq N 1 ÿ 1 ÿ
“´ ` pxrns ´ Aq ` pxrns ´ Aq2
BA 2A A n“0 2A2 n“0

B ln ppx;Aq
By setting BA “ 0, we have the MLE (positive one)
g
f N´1
1 f 1 ÿ 2 1
 “ ´ ` e x rns `
2 N n“0 4

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Example

If w rns „ N p0, σ 2 q, we have


N´1
B ln ppx; Aq 1 ÿ
“ 2 pxrns ´ Aq
BA σ n“0

And the MLE is


N´1
1 ÿ
 “ xrns
N n“0

which is the MVU.

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Property of MLE

Asymptotic Properties of the MLE


If the likelihood function ppx; θq of the data x satisfies the ”regularity”
conditions, then the MLE of the unknown parameter θ is asymptotically
distributed according to

θ̂ „ N pθ, I ´1 pθqq

where I pθq is the Fisher information evaluated at the true value of the
unknown parameter.

The regularity conditions require the existence of the derivatives of the


log-likelihood function, as well as the Fisher information being nonzero.
See Appendix 7B of the textbook.

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Example: MLE of the Sinusoidal Phase

Consider the problem of estimating the phase φ of a sinusoid embedded


in noise

xrns “ A cosp2πf0 n ` φq ` w rns, n “ 0, 1, ¨ ¨ ¨ , N ´ 1

where w rns „ N p0, σ 2 q, and the amplitude A and frequency f0 are


known. We have found the sufficient statistics
N´1
ÿ
T1 pxq “ xrns cosp2πf0 nq
n“0
N´1
ÿ
T2 pxq “ xrns sinp2πf0 nq
n“0

The MLE of φ is found by maximizing ppx; φq:


« ff
N´1
1 1 ÿ 2
ppx; φq “ N exp ´ 2 pxrns ´ A cosp2πf0 n ` φqq
p2πσ 2 q 2 2σ n“0

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Example: MLE of the Sinusoidal Phase

Using the approximation (see eqn. 7.14 in the textbook)


N´1 N´1
1 ÿ 1 ÿ
sinp2πf0 n ` φ̂q cosp2πf0 n ` φ̂q “ sinp4πf0 n ` 2φ̂q « 0
N n“0 2N n“0

for f0 not near 0 or 1{2. The MLE φ̂ is given by


řN´1
n“0 xrns sinp2πf0 nq
φ̂ “ ´ arctan řN´1
n“0 xrns cosp2πf0 nq

The MLE estimate φ̂ is a function of the sufficient statistics. This should


be the case due to the Neyman-Fisher factorization theorem:
ppx; φq “ g pT1 pxq, T2 pxq, φqhpxq
Thus, φ̂ is a function of T1 pxq, T2 pxq. The asymptotic likelihood function
of the phase estimator is
NA2
φ̂ „ N pφ, I ´1 pφqq, I pφq “
2σ 2
MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Example: MLE of the Sinusoidal Phase

So the asymptotic variance is


1 1
Varpφ̂q “ A2

N 2σ 2

where η “ pA2 {2q{σ 2 is the SNR.

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Example: MLE of the Sinusoidal Phase

Now we fix the data record length and vary the SNR.

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
MLE for Transformed Parameters

Suppose we try to estimate a function of θ : g pθq. Can we easily find the


MLE?
Invariance Property of the MLE
The MLE of the parameter α “ g pθq is given by

α̂ “ g pθ̂q

where θ̂ is the MLE of θ. The MLE θ̂ is obtained by maximizing ppx; θq.


If g is not a one-to-one function, then α̂ maximizes the modified
likelihood function p̄T px; αq defined as

p̄T px; αq “ max ppx; θq


tθ:α“g pθqu

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Example: Transformed DC Level in WGN

Suppose we want to estimate α “ A2 . Note that


?
A“˘ α

and the transformation is not one-to-one. Thus, we have to consider two


likelihood functions
« ff
N´1
1 1 ÿ ? 2
pT1 px, αq “ N exp ´ 2 pxrns ´ αq αě0
p2πσ 2 q 2 2σ n“0
« ff
N´1
1 1 ÿ ? 2
pT2 px, αq “ N exp ´ 2 pxrns ` αq αă0
p2πσ 2 q 2 2σ n“0

Then, the MLE of α is given by


„ 2
? ?
α̂ “ arg maxtpT1 px; αq, pT2 px; αqu “ arg ?max tppx; αq, ppx; ´ αqu
α αě0
„ 2
“ arg max ppx; Aq “ x̄ 2
´8ăAă8

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Example: Power of WGN in dB

We observe N samples of WGN with variance σ 2 in dB. That is, we want


to find the MLE of

P “ 10 log10 σ 2

To find the MLE of σ 2 :


N´1
B ln ppx, σ 2 q N 1 ÿ 2
“ ´ ` x rns
Bσ 2 2σ 2 2σ 4 n“0
N´1
1 ÿ 2
σ̂ 2 “ x rns
N n“0

And then
N´1
1 ÿ 2
P̂ “ 10 log10 σ̂ 2 “ 10 log10 x rns
N n“0

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Extension to a Vector Parameter

Asymptotic Properties of the MLE


If the likelihood function ppx; θq satisfies the ”regularity” conditions,
then the MLE of the unknown parameter θ is asymptotically distributed
according to

θ̂ „ N pθ, I´1 pθqq

where Ipθq is the Fisher information matrix.

Remark: If the number of parameters is larger than the number of


samples, the MLE may not follow the asymptotic distribution as above.

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Extension to a Vector Parameter

Invariance Property of the MLE


The MLE of the parameter α “ gpθq, where g is an r -dimensional
function of the p ˆ 1 parameter θ, is given by

α̂ “ gpθ̂q

If g is not an invertible function, then α̂ maximizes the modified


likelihood function p̄T px; αq:

p̄T px; αq “ max ppx; θq


tθ:α“gpθqu

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Example: Signal in Non-Gaussian Noise

Consider the data

xrns “ srns ` w rns, n “ 0, 1, ¨ ¨ ¨ , N ´ 1

where w rns is zero mean i.i.d noise with the Laplacian PDF
„ 
1 1
ppw rnsq “ exp ´ |w rns|
4 2

All signal samples tsr0s, sr1s, ¨ ¨ ¨ , srN ´ 1su are to be estimated. The
PDF of the data is
N´1
ź 1 „ 
1
ppx; θq “ exp ´ |xrns ´ srns|
n“0
4 2

For each n, the MLE is given by

ŝrns “ xrns

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Example: Signal in Non-Gaussian Noise

Or equivalently, the MLE of θ is given by

θ̂ “ x

The MLE is not asymptotically Gaussian as N Ñ 8. This is because we


have as many unknown parameters as the samples.

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
MLE for the Linear (Affine) Model

Optimality of the MLE for the Linear Model


Suppose the observation model is

x “ Hθ ` w

where H P RNˆp has full column rank p and w „ N p0, Cq, then the
MLE of θ is

θ̂ “ pHT C´1 Hq´1 HT C´1 x

θ̂ is also an efficient estimator and hence is the MVU estimator.

The likelihood function is


„ 
1 1 T ´1
ppx; θq “ N 1 exp ´ px ´ Hθq C px ´ Hθq
p2πq 2 det 2 pCq 2
B ln ppx; θq
“ HT C´1 px ´ Hθq

MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
2
Consistency of MLE*

Suppose that the observations txk u8 k“1 are i.i.d sequence of random
variables with density ppx; θq. Define

B ln ppxk ; θq
ψpxk ; θq “ , Jpθ; θ1 q “ Eθ pψpx1 ; θ1 qq

Then the MLE θ̂n is asymptotically consistent pθ̂n Ñ θq if

(1) Jpθ; θ1 q is a continuous function of θ1 and has a unique root at


θ1 “ θ, at which point it changes sign.
(2) ψpxk ; θ1 q is a continuous function of θ1 with probability 1
řn
(3) For each sample size n, k“1 ψpxk ; θ1 q{n has a unique root θ̂n .

2
V. Poor, Proposition IV.D.1
MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
3
Asymptotic Normality of MLEs*

Suppose that the observations txk u8 k“1 are i.i.d sequence of random
variables with density ppx; θq. Also assume that tθ̂n u8n“1 is a consistent
sequence of roots of the likelihood equation. If ψ satisfies the following
regularity conditions

(1) 0 ă iθ fi Eθ prψpx1 ; θqs2 q ă 8


(2) The derivatives

Bψpx1 ; θ1 q B 2 ψpx1 ; θ1 q
ψ 1 px1 ; θ1 q fi , ψ 2 px1 ; θ1 q fi
Bθ1 Bθ12
exist
(3) There is function Mpx1 q such that |ψ 2 px1 ; θ1 q| ď Mpx1 q for @θ1 P Ω
and Eθ rMpx1 qs ă 8

3
V. Poor, Proposition IV.D.2
MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
3
Asymptotic Normality of MLEs*

(4) Jpθ; θq “ 0
(5)

B 2 ppx; θq B2
ż ż
µpdxq “ ppx; θqµpdxq
Bθ2 Bθ2

Then θ̂n is asymptotically normal.

3
V. Poor, Proposition IV.D.2
MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Example: One-parameter exponential family

Let x1 , ¨ ¨ ¨ , xn be i.i.d according to a one-parameter exponential family


with density

f pxi |ηq “ e ηT pxi q´Apηq

with respect to σ ´ finite measure µ, and let the estimand be η. For


obtaining η̂, we need to solve
1ÿ 1ÿ
T pxi q “ A1 pηq ô Eη rT pxj qs “ T pxi q
n i n i

As
d
rEη rT pxj qss “ Varη T pxj q ą 0

The MLE solution is unique. Furthermore, we have
?
npη̂ ´ ηq Ñ N p0, VarpT qq
MLE for Scalar Parameter MLE for Vector of Parameters Advanced Topic of MLE: Consistency and Efficiency
Least Squares
Least Squares

Pros:

• No probabilistic assumptions needed


• Simple to compute

Cons:

• No claims about optimality can be made


• The performance is difficult to analyze in general

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Least Squares

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Least Squares

C. F. Gauss claimed that he came up with the least squares method in 1795.

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
The Least Squares Approach

Idea: In earlier discussions, we want to find the unbiased estimator


with smallest variance. Now, we try to directly minimize the average
discrepancy between the noisy data and assume signal model.
Formulation: Suppose the true signal srns is parameterized by θ
and we only have the noisy data xrns. Then the least squares
estimator (LSE) of θ is given by
N´1
ÿ
θ̂LS “ min Jpθq “ min pxrns ´ srnsq2
θ θ
n“0

Note that we do not take any probabilistic assumptions about the data
xrns.

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Examples

For the problem of estimating DC level, the LSE of A is obtained by


minimizing
N´1
ÿ
JpAq “ pxrns ´ Aq2
n“0

1
řN´1
which is ÂLS “ x̄ “ N n“0 xrns.

Remark
Note that ÂLS may not be optimal in any sense as we do not make any
statistical assumptions. It is MVU for the WGN case but will be biased
if the noise is not zero-mean.

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Examples

Consider estimating the frequency/amplitude from

srns “ A cos 2πf0 n

If f0 is unknown, we need to minimize


N´1
ÿ
Jpf0 q “ pxrns ´ A cos 2πf0 nq2
n“0

which is a non-linear function of f0 . This is a nonlinear least squares


problem.

If A is unknown, we again need to minimize


N´1
ÿ
JpAq “ pxrns ´ A cos 2πf0 nq2
n“0

which has a closed-form solution in terms of f0 . If both A, f0 are


unknown, we can first minimize JpA, f0 q with respect to A and then f0 ,
known as separable least squares problem.
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Linear Least Squares (vector parameter)

As before, we assume the signal of interest can be modeled as

s “ Hθ

where the observation matrix H P RNˆp with full rank p. Different from
previous discussions, we do not assume that the noise follows a particular
distribution. The LSE is found by minimizing
N´1
ÿ
Jpθq “ pxrns ´ srnsq2 “ px ´ HθqT px ´ Hθq
n“0

The linear form allows a closed-form solution as


BJpθq
“ ´2HT x ` 2HT Hθ

HT Hθ “ HT x pNormal Equationsq
θ̂ LS “ pHT Hq´1 HT x
JLS “ Jpθ̂ LS q “ xT pI ´ HpHT Hq´1 HT q
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Weighted Least Squares

An extension of the linear least squares problem is the weighted least


squares. Given a positive definite weighting matrix W, the cost function
becomes

Jpθq “ px ´ HθqT Wpx ´ Hθq

The closed-form solution of the WLS is

θ̂ “ pHT WHq´1 HT Wx

with error
` ˘
Jmin “ xT W ´ WHpHT WHq´1 HT W x

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Weighted Least Squares

Why we consider weighted least squares?

1. Focusing Accuracy: As a regression problem on Hθ, we may only


care about the feature values in a certain region. This may be
because these values (similar to certain rows of H) will often appear
or induce higher risk. Then, we can assign different weights to
points in a particular region.
2. Discounting Imprecision: If the additive noise has constant
variance, the case is called homoskedasticity, otherwise the data
are heteroskedastic. In the heteroskedasic case, there exist other
estimates that are unbiased and have smaller variance than the
ordinary least squares.
3. More General: Other optimization problems can be transformed
into or approximated by weighted least squares.

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Geometrical Interpretation

The LSE is obtained by minimizing

Jpθq “ px ´ HθqT px ´ Hθq

Let H “ rh1 , h2 , ¨ ¨ ¨ , hp s, Jpθq can be represented as


p
ÿ
Jpθq “ }x ´ Hθ}22 “ }x ´ hi θi }22
i“1

Without further restriction on θ, the LSE attempts to minimize the


square of the distance from the data vector x to a signal vector
řp
i“1 θi hi , which is in the column space of H.

If H has full column rank, each vector in the column space corresponds
to a unique parameter θ.

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Geometrical Interpretation

Suppose H P R3ˆ2 , the LSE solution yields the vector


ŝ P S 2 “ spanph1 , h2 q. ŝ is the orthogonal projection of x:

px ´ ŝq K th1 , h2 u

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Geometrical Interpretation

In the general case, let  “ x ´ Hθ, we have

 K CpHq ô T H “ 0

And LSE is given by

θ̂ “ pHT Hq´1 HT x

with the error

}}22 “ Jmin “ xT pI ´ HpHT Hq´1 HT qx

Note that

P “ HpHT Hq´1 HT

is an orthogonal projection matrix. And we have

Jmin “ xT pI ´ Pqx “ }PK x}22


Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 1: Order-Recursive Least Squares

In many cases, we do not know the signal models exactly. And we may
try different models in a particular order. For example, we try to fit a set
of data

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 1: Order-Recursive Least Squares

If we assume the true signal is a constant

s1 rns “ A

Then the LSE with H1 “ r1, 1, ¨ ¨ ¨ , 1sT is given by

Â1 “ x̄

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 1: Order-Recursive Least Squares

If we assume the true signal is a line


s2 rns “ A ` Bn
Then the LSE is given by
N´1 N´1
2p2N ´ 1q ÿ 6 ÿ
Â2 “ xrns ´ nxrns
NpN ` 1q n“0 NpN ` 1q n“0
N´1 N´1
6 ÿ 12 ÿ
B̂2 “ ´ xrns ` 2
nxrns
NpN ` 1q n“0 NpN ´ 1q n“0

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 1: Order-Recursive Least Squares

H1 is a submatrix of the observation matrix H2 of the line case


» fi
1 0
— 1 1
— ffi
ffi
H2 “ —— .. .. ffi
ffi
– . . fl
1 N ´1
However, it is not good to assume a more complicated one than
necessary (e.g., line case)

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 1: Order-Recursive Least Squares

Occam’s Razor Principle: entities should not be multiplied without


necessity

Manuscript illustration of William of Ockham (1287-1347)

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 1: Order-Recursive Least Squares

As illustrated by the example, we may have to try a sequence of models


from the simplest to the most complicated. To reduce the computation,
it is desirable to update the LSE sequentially.

Specifically, we would like to be able to compute the LSE based on an


H P RNˆpk`1q from the solution based on the first k columns of H

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 1: Order-Recursive Least Squares

Denote Hk P RNˆk , the LSE θ̂k based on Hk is given by

θ̂k “ pHT ´1 T
k Hk q Hk x

with minimum cost is

Jmin,k “ }x ´ Hk θ̂k }22

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 1: Order-Recursive Least Squares

Let Hk`1 “ rHk , hk`1 s P RNˆpk`1q . To update θ̂k and Jmin,k , we use
pHT H q´1 HT h hT K
» fi
k`1 Pk x
θ̂k ´ k k hT Pk Kk`1h
k`1 k k`1
θ̂k`1 “ – hT PK x
fl
k`1 k
hT K
k`1 Pk hk`1

T ´1 T T
where PK
k “ I ´ Hk pHk Hk q Hk . To avoid inverting Hk Hk , we let

Dk “ pHT
k Hk q
´1

and use the recursive formula


D HT h hT D HT
» fi
k`1 Hk Dk k hk`1
Dk ` k hk T k`1
P Kh ´ hT k K
k`1 k k`1 k`1 Pk hk`1
Dk`1 “ – hT
fl
k`1 Hk Dk 1
´ hT PK h hT K
k`1 k k`1 k`1 Pk hk`1

T
where PK
k “ I ´ Hk Dk Hk . The minimum LS error is updated as

phT K 2
k`1 Pk xq
Jmin,k`1 “ Jmin,k ´ T K
hk`1 Pk hk`1
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 1: Order-Recursive Least Squares

This recursive procedure determines the LSE for all lower-order models.
We make several observations:

1. If the new column hk`1 is orthogonal to all the previous ones,


HT k hk`1 “ 0, then
» fi
θ̂k
θ̂k`1 “ – hT PK x fl k`1 k
hT K
k`1 Pk hk`1

2. The term PK k x “ k is the residual that cannot be modeled by the


columns of Hk .
3. If hk`1 is nearly in the column space of Hk , then we have
}PK k hk`1 }2 « 0. And the recursive procedure will blow up as
HT k`1 Hk`1 will be nearly singular.
4. We can rewrite Jmin,k`1 as
phT K 2
k`1 Pk xq
Jmin,k`1 “ xT PK
kx´
2
“ Jmin,k p1 ´ rk`1 q
hT K
k`1 Pk hk`1
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 1: Order-Recursive Least Squares

2
where the coefficient rk`1 is defined as
K 2
2 xPK
k hk`1 , Pk xy
0 ď rk`1 “ 2
ď1
K
}Pk hk`1 }2 }PK
k x}2

If PK K
k x and Pk hk`1 are collinear, then the residual can be perfectly
modeled by hk`1 and rk`1 “ 1.

5. The orthogonal projection matrix can be recursively updated as

pI ´ Pk qhk`1 hTk`1 pI ´ Pk q
Pk`1 “ Pk `
hTk`1 pI ´ P k qhk`1

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 2: Sequential Least Squares

In earlier discussions, we are already given the full data. But in many
cases, the samples may come sequentially and it is not desirable to wait
for all the data before computing the LSE.

Let C be the covariance matrix of the zero-mean noise, we consider the


weighted LS by minimizing

J “ px ´ HθqT C´1 px ´ Hθq

And the solution is given by

θ̂ “ pHT C´1 Hq´1 HT C´1 x

If C is diagonal (that is, the noise is uncorrelated), the LSE θ̂ can be


computed sequentially computed.

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 2: Sequential Least Squares

Assume the following conditions to hold

Crns “ diagpσ02 , σ12 , ¨ ¨ ¨ , σn2 q


« ff
Hrn ´ 1s T
Hrns “ , xrns “ rxr0s, xr1s, ¨ ¨ ¨ , xrnss
hT rns

and let θ̂rns be the LSE of θ based on xrns. The LSE is given by

θ̂rns “ pHT rnsC´1 rnsHrnsq´1 HT rnsC´1 rnsxrns

with covariance matrix of the LSE

Σrns “ pHT rnsC´1 rnsHrnsq´1

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 2: Sequential Least Squares

Estimator Update:
´ ¯
θ̂rns “ θ̂rn ´ 1s ` Krns xrns ´ hT rnsθ̂rn ´ 1s

where
Σrn ´ 1shrns
Krns “
σn2 ` hT rnsΣrn ´ 1shrns
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 2: Sequential Least Squares

Covariance Update:

Σrns “ pI ´ KrnshT rnsqΣrn ´ 1s

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Example: Fourier Analysis

Consider estimating θ “ ra, bsT from the signal model:

srns “ a cosp2πf0 nq ` b sinp2πf0 nq

To apply the sequential LS, we assume the noise is uncorrelated.

To initialize the algorithm, we first acquire two samples xr0s, xr1s and
compute
ˆ ˙´1
T I I
θ̂r1s “ H r1s 2 Hr1s HT r1s 2 xr1s “ pHT r1sHr1sq´1 HT r1sxr1s
σ σ

where
« ff
1 0 T
Hr1s “ , xr1s “ rxr0s, xr1ss
cosp2πf0 q sinp2πf0 q
Σr1s “ σ 2 pHT r1sHr1sq´1

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Example: Fourier Analysis

To compute θ̂r2s, we first compute Kr2s

Σr1shr2s
Kr2s “ , hr2s “ rcosp4πf0 q, sinp4πf0 qs
σ2 ` hT r2sΣr1shr2s

And then we have

θ̂r2s “ θ̂r1s ` Kr2spxr2s ´ hT r2sθ̂r1sq

with variance

Σr2s “ pI ´ Kr2shT r2sqΣr1s

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 3: Constrained Least Squares

So far, we have not imposed any restrictions on the values of θ and the
geometrical explanations apply. Here, we assume the parameter satisfies
the linear constraint

Aθ “ b

where A P Rr ˆp has full rank r ă p. Full rank means the constraints are
linearly independent.

To find the LSE subject to the constraints, we use the Lagrangian


multipliers. That is, we seek to minimize

Jc “ px ´ HθqT px ´ Hθq ` λT pAθ ´ bq


BJc
“ ´2HT x ` 2HT Hθ ` AT λ

λ
θ̂ c “ θ̂ ´ pHT Hq´1 AT
2
where θ̂ is the unconstrained LSE.
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 3: Constrained Least Squares

To find λ, we enforce the linear constraints

Aθ̂ c “ b
λ “ ‰´1
“ ApHT Hq´1 AT pAθ̂ ´ bq
2
Eventually, we have

θ̂ c “ θ̂ ´ pHT Hq´1 AT rApHT Hq´1 AT s´1 pAθ̂ ´ bq

where θ̂ “ pHT Hq´1 HT x.

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
4
Example: Piecewise-polynomial fitting

Consider fitting a piecewise-polynomial fˆ specified as


#
ppxq “ θ1 ` θ2 x ` θ3 x 2 ` θ4 x 3 x ď a
fˆpxq “
qpxq “ θ5 ` θ6 x ` θ7 x 2 ` θ8 x 3 x ą a

with the requirements at the boundary point

ppaq “ qpaq, p 1 paq “ q 1 paq

We try to fit the polynomial by minimizing


N ´
ÿ ¯2
fˆpxi q ´ yi
i“1

which is a constrained least squares problem.

4
S. Boyd, EE 103
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
4
Example: Piecewise-polynomial fitting

4
S. Boyd, EE 103
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
5
Example: Least norm problem

A special case of constrained least squares problem:

min }x}22 s.t. Ax “ b


x

5
S. Boyd, EE 103
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
5
Example: Least norm problem

Force sequence:

• unit mass on frictionless surface, initially at rest


• f P R10 represents the piece-wise constant forces applied for one
second each
• the final velocity and position are
10
ÿ 19 17 1
v fin “ fi , p fin “ f1 ` f2 ` ¨ ¨ ¨ ` f10
i“1
2 2 2

• applied constraints: v fin “ 0, p fin “ 1

5
S. Boyd, EE 103
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
5
Example: Least norm problem

Bang-bang force sequence:

5
S. Boyd, EE 103
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
5
Example: Least norm problem

Least norm force sequence:

5
S. Boyd, EE 103
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 4: Nonlinear Least Squares

The general LS cost is given by

Jpθq “ px ´ spθqqT px ´ spθqq

where spθq is the signal model for x that depends on θ. In the linear LS
problem, we take the convenient model spθq “ Hθ which has a simple
closed-form solution. But in many cases, spθq

The are two methods that can reduce the complexity of the problem:

1. transformation of parameters
2. separability of parameters

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 4: Nonlinear Least Squares

In the first case, we try to find a transformation α “ gpθq such that

spθpαqq “ spg´1 pαqq “ Hα

Then we can find the LSE of α and then get back

θ̂ “ g´1 pα̂q

In general, the function g is hard to find.

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 4: Nonlinear Least Squares

For the sinusoidal signal model

srns “ A cosp2πf0 n ` φq, n “ 0, 1, ¨ ¨ ¨ , N ´ 1

and we try to estimate A, φ. The cost function is nonlinear


N´1
ÿ 2
JpA, φq “ pxrns ´ A cosp2πf0 n ` φqq
n“0

Now we consider the transformation:

α1 “ A cos φ, α2 “ ´A sin φ

Then s “ Hα with H
» fi
1 0
cosp2πf0 q sinp2πf0 q
— ffi
— ffi
H“—
— .. .. ffi
ffi
– . . fl
cosp2πf0 pN ´ 1qq sinp2πf0 pN ´ 1qq
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 4: Nonlinear Least Squares

The second method is to find some separability property. That is, in


some cases, we have

s “ Hpαqβ

The β that minimizes Jpα, βq for a fixed α is


` ˘´1 T
β̂ “ HT pαqHpαq H pαqx

And then the cost function is


” ` ˘´1 T ı
Jpα, β̂q “ xT I ´ Hpαq HT pαqHpαq H pαq x

The problem reduces to a maximization of


` ˘´1 T
xT Hpαq HT pαqHpαq H pαqx

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
Variant 4: Nonlinear Least Squares

Damped Exponentials: Consider the signal model

srns “ A1 r n ` A2 r 2n ` A3 r 3n

where the unknown parameters are tA1 , A2 , A3 , r u and 0 ă r ă 1. In the


separable model, β “ rA1 , A2 , A3 sT and α “ r . Then the nonlinear LSE
is obtained by maximizing
` ˘´1 T
xT Hpr q HT pr qHpr q H pr qx

over 0 ă r ă 1 where
» fi
1 1 1
r r2 r3
— ffi
— ffi
Hpr q “ —
— .. .. .. ffi
ffi
– . . . fl
r N´1 r 2pN´1q r 3pN´1q

Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
6
Algorithm for NLS: Levenberg-Marquardt Algorithm

The basic idea:


• At any point z, we can approximate the nonlinear function f pxq as

fˆpx; zq “ f pzq ` ∇f pzqpx ´ zq

• fˆpx; zq « f pxq if x is close to z


• we can minimize the affine function }fˆpx; zq}22 using the linear LS
• we iterate the choice z

6
S. Boyd, EE 103, Stanford
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
6
Algorithm for NLS: Levenberg-Marquardt Algorithm

Main structure:

• iterates x p1q , x p2q , ¨ ¨ ¨


• form affine approximation of f at x pkq :

fˆpx; x pkq q “ f px pkq q ` ∇f px pkq qpx ´ x pkq q

• choose x pk`1q as minimizer of

}fˆpx; x pkq q}22 ` λpkq }x ´ x pkq }22

for a λpkq ą 0
• We impose the regularizer (second term) because we need the affine
approximation to hold

6
S. Boyd, EE 103, Stanford
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
6
Algorithm for NLS: Levenberg-Marquardt Algorithm

Adjusting λ:

1. idea:
• if λpkq is too big, x pk`1q is close to x pkq , the progress is slow
• if λpkq is too small, x pk`1q is far from x pkq , the linear approximation
is poor
2. practical update mechanism:
• If }f px pk`1q q}22 ă }f px pkq q}22 , accept the update and reduce
λ : λpk`1q “ 0.8λpkq
• otherwise, increase λ and do not update:

λpk`1q “ 2λpkq , x pk`1q “ x pkq

6
S. Boyd, EE 103, Stanford
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
6
Algorithm for NLS: Levenberg-Marquardt Algorithm

Closed-form update for scalar parameter:

• The update:

f 1 px pkq q
x pk`1q “ x pkq ´ f px pkq q
λpkq ` pf 1 px pkq qq2

• For λpkq , it is the same as the Newton algorithm


• But the Newton algorithm will not make sense if f 1 px pkq q “ 0.

6
S. Boyd, EE 103, Stanford
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
7
Example: Location from range measurements

7
S. Boyd, EE 103, Stanford
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
7
Example: Location from range measurements

7
S. Boyd, EE 103, Stanford
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const
7
Example: Location from range measurements

7
S. Boyd, EE 103, Stanford
Basic Least Squares Variant 1: Least Squares with Unknown Model Order Variant 2: Least Squares with Incoming Data Variant 3: Least Squares with Const

You might also like