Professional Documents
Culture Documents
CHP - 10.1007 - 978 3 319 74222 9 - 4
CHP - 10.1007 - 978 3 319 74222 9 - 4
16 Normal Equation
16.1 In experimental and observational sciences, one is often confronted with the
task of estimating parameters p = (θ1 , . . . , θn )0 of a mathematical model from a
set of noisy measurements.63 If the parameters enter the model linearly, such a
set-up can be written in the form
b = Ap + e
where
b = Ax + r, r = ( ρ1 , . . . , ρ m ).
the Gauss–Markov theorem asserts its optimality64 if the noise obeys certain statisti-
cal properties: centered, uncorrelated, equal variance. For unequal variances or
correlated noise one uses the weighted or generalized least squares method.
Remark. The least squares method was first published in 1805 by Adrien-Marie Legendre in his
work on trajectories of comets; it had however already been used in 1801 by Carl Friedrich
Gauss for calculating the trajectory of the minor planet Ceres. Both mathematicians, having
previously got in each other’s way while studying number theory and elliptic functions,
bitterly fought for decades over their priority in this discovery.65
Exercise. Show that the least squares estimator of the parameter θ from the measurements
β j = θ + ej ( j = 1 : m)
For A with full column rank (which we want to assume from now on), and thus for
A0 A s.p.d. (Lemma 9.1), the normal equation possesses a unique solution x ∈ Rn .
This actually yields the unique minimum of F:
with equality for precisely y = x (A0 A is s.p.d.). Hence, we have proven for K = R:
64 That is to say that the it is the best linear unbiased estimator (BLUE).
65 R. L. Plackett: Studies in the History of Probability and Statistics. XXIX: The discovery of the method of least
squares, Biometrika 59, 239–251, 1972.
Sec. 16 ] Normal Equation 71
Theorem. For A ∈ Km×n with full column rank, the least squares problem
of the least squares problem (the vector b being fixed); the normal equation
itself is of course solved with the Cholesky decomposition from §8.3.66 Using the
symmetry of A0 A, the computational cost is therefore (in leading order)
. 1
= mn2 + n3 .
3
16.5 To evaluate the stability of (16.2) using Criterion F from §13.8,67 we compare
the condition number κ2 ( A0 A) of the normal equation with that of the least
squares problem. Since it can be shown for m = n, as was the case in §15.7, that
q
κ 2 ( A ) = κ 2 ( A 0 A ), (16.3)
we define κ2 ( A) as (16.3) for the case of m > n. The relative condition number κLS
of the least squares problem with respect to perturbations in A is bounded as68
max κ2 ( A), ω · κ2 ( A)2 6 κLS 6 κ2 ( A) + ω · κ2 ( A)2 (16.4)
In contrast, for all other cases, i.e., for relatively large residuals or well-conditioned
matrices, the use of the normal equation is stable.
Example. The numerical instability of the algorithm (16.2) can famously be illustrated with
the following vanishing-residual example (P. Läuchli 1961):
1 1 2
1 1 + e2 1 2
A = e 0 , b = e , x = ; A0 A = , κ2 ( A0 A) = 2 + 1.
1 1 1 + e2 e
0 e e
Here, A0 A is actually exactly numerically singular for e2 < 2emach since then 1 +
b e2 = 1
(the information about e is completely lost while assembling A0 A; after the start section,
the parameter e can no longer be reconstructed, cf. Criterion B from §13.8). For a slightly
larger e we obtain the following numerical example:
1 >> e = 1e -7; % e = 10−7
2 >> A = [1 1; e 0;0 e ]; b = [2; e ; e ]; % Läuchli example
3 >> x = (A ’* A ) \( A ’* b ) % solution of the normal equation
4 x =
5 1. 011 235 95 5 05 6 1 8 0 e +00
6 9. 88 7 64 0 4 4 9 43 8 20 4 e -01
The loss of 14 decimal places is consistent with κ2 ( A0 A) ≈ 2 · 1014 ; however, due to the
condition number κ LS = κ2 ( A) ≈ 107 , at most a loss of approximately 7 decimal places
would be acceptable.
17 Orthogonalization
For the following, we will assume that A ∈ Km×n has full column rank and, therefore, m > n.
17.1 A stable algorithm for the least squares problem should access A directly69
and not via the detour of A0 A. With the help of a (normalized) QR decomposition
A = QR, we can simplify the normal equations according to Theorem 9.3 to
A0 A x = |{z}
|{z} A0 b
= R0 R = R0 Q0
Rx = Q0 b. (17.1)
69 In 1961, Eduard Stiefel spoke of the “principle of direct attack” in the field of numerical analysis.
Sec. 17 ] Orthogonalization 73
Here, due to the column orthonormality of Q, the inverse start section, that is
( Q, R) 7→ Q · R = A, is actually well-conditioned: k Qk2 k Rk2 = k Ak2 .
Exercise. Show for the full QR decomposition (9.3): r = b − Ax = ( I − Q1 Q10 )b = Q2 Q20 b.
Explain why only the third formula is stable and should therefore be used for the residual.
17.2 By means of a simple trick, we can actually by and large forgo explicit knowl-
edge of the factor Q. To do so, we calculate the normalized QR decomposition of
the matrix A augmented by the column b, i.e.,
R z
A b = Q q · .
ρ
b = Qz + ρq.
Q0 b = z.
Rx = z.
Further, q0 q = kqk22 = 1 implies that the positive ρ is the norm of the residual:
As for linear systems of equations, the least squares problem is briefly solved by the line
x = A\b;
74 Least Squares [ Ch. IV
Despite the relatively large condition number κ LS = κ2 ( A) ≈ 107 , almost all digits are
correct; this time the machine arithmetic does not “trigger” the worst case scenario.
Exercise. Let the columns of a matrix B ∈ Rm×n constitute a basis of the n-dimensional
subspace U ⊂ Rm . For x ∈ Rm , denote by d( x, U ) the distance from x to U.
• Render the calculation of d( x, U ) as a least squares problem.
• Write a two line MATLAB code that calculates d( x, U ) given the input ( B, x ).
17.3 The computational cost of such a Q-free solution of a least squares problem
using the Householder method is, according to §§9.8 and D.5,
For m n, the computational cost is thus two times larger than the cost to
calculate the solution via the normal equation with Cholesky decomposition,
which in accordance with §16.4 is
mn2 + n3 /3 flop.
For m ≈ n, both are approximately 4m3 /3 flop and therefore of comparable size.
Therefore, both algorithms have their merit: the normal equation with Cholesky
should be used for m n with a relatively large residual (in the sense of ω 6 1)
or a comparatively well-conditioned matrix A; orthogonalization should be used
in all other cases (cf. §16.5).
Exercise. Let a matrix A ∈ Rm×n be given with full row rank m < n. Examine the under-
determined linear system of equations Ax = b.
• Show that there exists a unique minimal solution x∗ for which it holds