Least Mean Squares Algorithm (LMS)

Lecture 2 1 Background 2
During this lecture you will learn about

The method of the Steepest descent that was studies at the last lecture
is a recursive algorithm for calculation of the Wiener filter when the
• The Least Mean Squares algorithm (LMS) statistics of the signals are known (knowledge about R och p).
• Convergence analysis of the LMS
• Equalizer (Kanalutjämnare) The problem is that this information is oftenly unknown!
LMS is a method that is based on the same principles as the met-

hod of the Steepest descent, but where the statistics is estimated
continuously.
Since the statistics is estimated continuously, the LMS algorithm can

adapt to changes in the signal statistics; The LMS algorithm is thus an
adaptive filter.
Because of estimated statistics the gradient becomes noisy. The LMS

algorithm belongs to a group of methods referred to as stochastic
gradient methods, while the method of the Steepest descent belongs
to the group deterministic gradient methods.
Adaptive Signal Processing 2011 Lecture 2
The Least Mean Square (LMS) algorithm 3 The Least Mean Square (LMS) algorithm 4
We want to create an algorithm that minimizes E{|e(n)|2 }, just like For the SD, the update of the filter weights is given by
the SD, but based on unkown statistics. 1
w(n+1) = w(n) + µ[−∇J(n)]
2
A strategy that then can be used is to uses estimates of the autocorre-
lation matrix R and the cross correlationen vector p. If instantaneous where ∇J(n) = −2p + 2Rw(n).
estimates are chosen,
In the LMS we use the estimates R b och b b
p to calculate ∇J(n) .
b (n) = u(n)uH (n)
R Thus, also the updated filter vector becomes an estimate. It is therefore
b ∗
p(n) = u(n)d (n) denoted wb (n);
b
∇J(n) = −2b b (n)w
p(n) + 2R b (n)
the resulting method is the Least Mean Squares algorithm.
Adaptive Signal Processing 2011 Lecture 2 Adaptive Signal Processing 2011 Lecture 2
The Least Mean Square (LMS) algorithm 5 The Least Mean Square (LMS) algorithm 6
For the LMS algorithm, the update equation of the filter weights Illustration of gradient noise. R och p in the example from Lecture 1.
becomes µ = 0.1.
SD LMS
∗
b (n + 1) = w
w b (n) + µu(n)e (n)
2 2
10 10.0
10 10.0
9.1
9.1
7.3
7.3
.96 6
.96 6
6
6
6
6
1.5 1.5
where e∗ (n) = d∗(n) − uH (n)w
8.2
8.2
b (n).
6
1 1
1.06
1.06
0.5 0.5
Compare with the corresponding equation for the SD:
w0
w0
0 0
∗
w(n + 1) = w(n) + µE{u(n)e (n)} −0.5
1.9
6 −0.5
1.9
6
5. 5.
56 2.8 56 2.8
6 6
6.4 6.4
−1 −1
where e∗ (n) = d∗(n) − uH (n)w(n).

6 6
3.76 3.76
8.2 4.66 8.2 4.66

−1.5 6 −1.5 6
10 10
11 11
6
.0 .0
.8 10.9 6 .8 10.9 6
The difference is thus that E{u(n)e∗ (n)} is estimated as u(n)e∗ (n).
8.2
8.2
6 9.1 6 9.1
6 6 7.36 .36 6 6 7.36 6
−2 7 −2 7.3
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
This leads to gradient noise. w1 w1

Because of the noise in the gradientm the LMS never reaches Jmin,
which the SD does.
Convergence analysis of the LMS, overview 7 1. The filter weight error vector 8
Consider the update equation The LMS contains feedback, and just like for the SD there is a risk that
the algorithm diverges, if the stepsize µ is not chosen appropriately.
∗
b (n + 1) = w
w b (n)+µu(n)e (n) .
In order to investigate the stability, the filter weight error vector is
We want to know how fast and towards what the algorithm converges,in introduced ǫ(n):
b (n).
terms of J(n) and w
−1
b (n) − wo
ǫ(n) = w (wo = R p Wiener-Hopf)
Strategy:
b (n)− wo .
1. Introduce the filter weight error vector ǫ(n) = w Note that ǫ(n) corresponds to c(n) = w(n) − wo for the Steepest
b (n).
descent, but that ǫ(n) is stochastic because of w
2. Express the update equation in ǫ(n).
3. Express J(n) in ǫ(n). This leads to K(n) = E{ǫ(n)ǫH (n)}.
4. Calculate the difference equation in K(n), which controls the
convergence.
5. Make a variable change from K(n) to X(n) which converges
equally fast.
2. Update equation in ǫ(n) 9 3. Express J(n) in ǫ(n) 10
The update equation for ǫ(n) at time n+1 can be expressed recursively The convergence analysis of the LMS is considerably more complicated
as than that for the SD. Therefore, two tools are needed (approximations).
b (n)]ǫ(n) + µ[b
ǫ(n+1) = [I − µR b (n)wo ]
p(n) − R These are
H ∗
= [I − µu(n)u (n)]ǫ(n) + µu(n)eo (n)
• Independence theory
where e∗o (n) = d(n) − woH u(n). • Direct-averaging method
Compare to that of the SD:
c(n + 1) = (I − µR)c(n)
3. Express J(n) i ǫ(n), cont. 11 3. Express J(n) in ǫ(n), cont. 12
Independence Theory Direct-averaging
1. The input vectors of different time instants n, Direct-averaging means that in the update equation for ǫ,
u(1), u(2), . . . , u(n) , H ∗
is independet of each other (and thereby uncorrelated). ǫ(n + 1) = [I − µu(n)u (n)]ǫ(n) + µu(n)eo (n) ,
2. The input signal vector at time n is independent of the desired the instantaneous estimate Rb (n) = u(n)uH (n) is substituted with
signal the expectation over the ensemble, R = E{u(n)uH (n)}, such that
at all earlier time instants,
∗
u(n) idependent of d(1), d(2), . . . , d(n−1) . ǫ(n + 1) = (I − µR)ǫ(n) + µu(n)eo (n)
3. The desired signal at time n, d(n), results.

is independent of all earlier desired signals, d(1), d(2), . . . , d(n−
1), but dependent of the input signal vector at time n, u(n).
4. The signals d(n) and u(n) at the same time instant n is mutually
normally distributed.
3. Express J(n) i ǫ(n), cont. 13 3. Express J(n) i ǫ(n), cont. 14
For the SD we could show that The behavior of Jex (n) = J(n) − Jmin = E{ǫH (n)R b ǫ(n)}
H determines the convergence.
J(n) = Jmin + [w(n) − wo ] R[w(n) − wo ] Since the following holds for the trace of a matrix,
and that the eigenvalues of R controls the convergence speed. 1. tr(scalar) = scalar
2. tr(AB) = tr(BA)
b (n) − wo ,
For the LMS, ǫ(n) = w 3. E{tr(A)} = tr(E{A})
J(n) = E{|e(n)|2 } = E{|d(n) − w

b H (n)u(n)|2 } Jex (n) can be written as
1,2
= E{|eo (n) − ǫH (n)u(n)|2} = b (n)ǫ(n)ǫ (n)]}
Jex (n) = E{tr[R
H
i
= E{|eo (n)|2 } +E{ǫH (n) u(n)uH (n) ǫ(n)} 3 b (n)ǫ(n)ǫH (n)}]
= tr[E{R
| {z } | {z }
Jmin b (n)
R i b (n)}E{ǫ(n)ǫH (n)}]
= tr[E{R
Here, (i) indicates that the indpendence assupmtion is used. Since ǫ = tr[RK(n)]
b are estimates, and in addition not independent, the expectation
and R
b
operator cannot just be “moved into R. The convergence thus depends on K(n) = E{ǫ(n)ǫH (n)}.
4. Difference equation of K(n) 15 5. Changing of variables from K(n) to X(n) 16
The update equation for the filter weight error is, as shown previously, The variable changes
H ∗ H
ǫ(n + 1) = [I − µu(n)u (n)]ǫ(n) + µu(n)eo (n) Q RQ = Λ
H
The solution to this stochastic difference equation is for small µ close Q K(n)Q = X(n)
to the solution for
where Λ is the diagonal eigenvalue matrix of R, and where X(n)
∗ normally is not diagonal, yields
ǫ(n + 1) = (I − µR)ǫ(n) + µu(n)eo (n) ,
which is resulting from Direct-averaging. Jex (n) = tr(RK(n)) = tr(ΛX(n))
This equation is still difficult to solve, but now the behavior of K(n) The convergence can be discribed by
can be described by the following difference equation 2
X(n + 1) = (I − µΛ)X(n)(I − µΛ) + µ JminΛ
2
K(n + 1) = (I − µR)K(n)(I − µR) + µ Jmin R
5. Changing of variables from K(n) to X(n),
cont. 17 Properties of the LMS 18
Jex (n) depends only on the diagonal elements of X(n) (because of
M
X • Convergence for the same interval as the SD:
the trace tr[ΛX(n)]). We can therefore write Jex (n) = λixi (n). 2
i=1 0<µ<
λmax
The update of the diagonal elements is controlled by M
X µλi
2 2
• Jex (∞) = Jmin
xi (n + 1) = (1 − µλi) xi (n) + µ Jminλi 2 − µλi
i=1
Jex (∞)
This is an inhomogenous difference equation.This means that the solu- • Misadjustment M = is a measure of how close the the
Jmin
tion will contain a transient part, and a stationary part that depends on optimal solution the LMS (in mean-square sense).
Jmin and µ. The cost function can therefore be written
J(n) = Jmin + Jex (∞) + Jtrans (n)
where Jmin +Jex (∞) och Jtrans (n) is the stationary and transient
solutions, respectively.
At the best, the LMS can thus reach Jmin +Jex (∞).

Rules of thumb LMS 19 Summary LMS 20
The eigenvalues of R are rarely known. Therefore, a set of rules of Summary of the iterations:
thumbs is commonly used, that is based on the tap-input power,
P M −1
k=0
E{|u(n − k)|2 } = M r(0) = tr(R). 1. Set start values for the filter coefficients, ŵ(0)
2. Calculate the error e(n) = d(n) − ŵH (n)u(n)
• The stepsize 0 < µ < PM −1 2
E{|u(n−k)|2 } 3. Update the filter coefficients
k=0 ∗ H
PM −1 ŵ(n + 1) = ŵ(n) + µu(n)[d (n) − u (n)ŵ(n)]
• Misadjustment M ≈ µ
2
2
k=0 E{|u(n − k)| } 4. Repeat steps 2 and 3
• Average time constant τmse,av ≈ 2µλ 1 där

PM av
λav = M 1
i=1 λi .
When λi is unknown, the following relationship is useful.
• Approximate relationship M and τmse,av M

M ≈ 4τmse,av
Here, τmse,av denotes the time it takes for J(n) to decay a factor of
e−1 .
Example: Channel equalizer in training phase 21 Example, cont: Learning Curve 22
p(n) = . . . , 1, 1, −1, 1, −1, . . . d(n)

z−7
-
100 Learning curves based on 200 realiza-
Fördröjning tions.
The decay is related to τmse,av .
A large µ gives fast convergence, but
Average of MSE, J(n)

+
b ?
p(n) Channel +- u(n) FIR d(n) also a large Jex (∞).
- Σ - -Σ
h(n) M= 11 –
6
+

10−1 µ = 0.0075
WGN v(n)
2 = 0.001 -
e(n)
σv LMS µ = 0.075
h(n) = {0, 0.2798, 1.000, 0.2798, 0}
µ = 0.025
Sent information in the form of a pn-sequence is affected by a channel 10−2
(h(n)). An adaptive inverse filter is used to compensate for the 0 500 1000 1500
channel such that the total effect can be approximated by a delay. Iteration n
White Gaussian noise is added during the transmission (WGN). The curves illustrates learning curves for two different µ.
What to read? 23
• Least Mean Squares, Haykin chap.5.1–5.3, 5.4 (see lecture), 5.5–5.7
Exercises: 4.1, 4.2, 1.2, 4.3, 4.5, (4.4)
Computer exercise, theme: Implement the LMS algorithm in Matlab, and generate
the results in Haykin chap.5.7

Least Mean Squares Algorithm (LMS)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Least Mean Squares Algorithm (LMS)

Uploaded by

Copyright:

Available Formats

Lecture 2 1 Background 2

During this lecture you will learn about

LMS is a method that is based on the same principles as the met-

Since the statistics is estimated continuously, the LMS algorithm can

Because of estimated statistics the gradient becomes noisy. The LMS

where e∗ (n) = d∗(n) − uH (n)w

Compare with the corresponding equation for the SD:

where e∗ (n) = d∗(n) − uH (n)w(n).

8.2 4.66 8.2 4.66

The difference is thus that E{u(n)e∗ (n)} is estimated as u(n)e∗ (n).

This leads to gradient noise. w1 w1

Compare to that of the SD:

3. Express J(n) i ǫ(n), cont. 11 3. Express J(n) in ǫ(n), cont. 12

Independence Theory Direct-averaging

3. The desired signal at time n, d(n), results.

J(n) = E{|e(n)|2 } = E{|d(n) − w

4. Difference equation of K(n) 15 5. Changing of variables from K(n) to X(n) 16

which is resulting from Direct-averaging. Jex (n) = tr(RK(n)) = tr(ΛX(n))

J(n) = Jmin + Jex (∞) + Jtrans (n)

Adaptive Signal Processing 2011 Lecture 2

Rules of thumb LMS 19 Summary LMS 20

• Average time constant τmse,av ≈ 2µλ 1 där

• Approximate relationship M and τmse,av M

p(n) = . . . , 1, 1, −1, 1, −1, . . . d(n)

Average of MSE, J(n)

• Least Mean Squares, Haykin chap.5.1–5.3, 5.4 (see lecture), 5.5–5.7

Exercises: 4.1, 4.2, 1.2, 4.3, 4.5, (4.4)

Adaptive Signal Processing 2011 Lecture 2

You might also like