Professional Documents
Culture Documents
Texts in Applied Mathematics: Springer
Texts in Applied Mathematics: Springer
Editors
J.E. Marsden
L. Sirovich
M. Golubitsky
S.S. Antman
Advisors
G.looss
P. Holmes
D. Barkley
M. Dellnitz
P. Newton
Springer
New York
Berlin
Heidelberg
Hong Kong
London
Milan
Paris
Tokyo
Texts in Applied Mathematics
Second Edition
With 65 Illustrations
, Springer
Peter Deuflhard Andreas Hohmann
Konrad-Zuse-Zentrum (ZIB) AMS
Berlin-Dahlem, D-14195 D2 Vodafone TPAI
Germany Dusseldorf, D-40547
deuflhard@zib.de Germany
andreas.hohmann@d2vodafone.de
Series Editors
J.E.Marsden L. Sirovich
Control and Dynamical Systems 107-S1 Division of Applied Mathematics
California Institute of Technology Brown University
Pasadena, CA 91125 Providence, RI 02912
USA USA
marsden@cds.caltech.edu chico@camelot.mssm.edu
9 8 7 6 5 4 3 2 I SPIN 10861791
www.springer-ny.com
For quite a number of years the rapid progress in the development of both
computers and computing (algorithms) has stimulated a more and more de-
tailed scientific and engineering modeling of reality. New branches of science
and engineering, which had been considered rather closed until recently,
have freshly opened up to mathematical modeling and to simulation on the
computer. There is clear evidence that our present problem-solving ability
does not only depend on the accessibility of the fastest computers (hard-
ware), but even more on the availability of the most efficient algorithms
(software) .
The construction and the mathematical understanding of numerical al-
gorithms is the topic of the academic discipline Numerical Analysis. In
this introductory textbook the subject is understood as part of the larger
field Scientific Computing. This rather new interdisciplinary field influ-
ences smart solutions in quite a number of industrial processes, from car
production to biotechnology. At the same time it contributes immensely
to investigations that are of general importance to our societies-such as
the balanced economic and ecological use of primary energy, global climate
change, or epidemiology.
The present book is predominantly addressed to students of mathematics,
computer science, science, and engineering. In addition, it intends to reach
computational scientists already on the job who wish to get acquainted
with established modern concepts of Numerical Analysis and Scientific
Computing on an elementary level via personal studies.
viii Preface
tion of an adaptive multigrid quadrature; in this way we can deal with the
adaptivity principle behind multigrid methods for partial differential equa-
tions in isolated form-clearly separated from the principle of fast solution,
which is often predominant in the context of partial differential equations.
Contents
Preface vii
Outline xi
1 Linear Systems 1
1.1 Solution of Triangular Systems. . . . . . . . 3
1.2 Gaussian Elimination . . . . . . . . . . . . . 4
1.3 Pivoting Strategies and Iterative Refinement 7
1.4 Cholesky Decomposition for Symmetric Positive Definite
Matrices 14
Exercises . . . . 16
2 Error Analysis 21
2.1 Sources of Errors . . . . . . . . . . . 22
2.2 Condition of Problems . . . . . . . . 24
2.2.1 Normwise Condition Analysis 26
2.2.2 Componentwise Condition Analysis 31
2.3 Stability of Algorithms . . 34
2.3.1 Stability Concepts 35
2.3.2 Forward Analysis . 37
2.3.3 Backward Analysis 42
2.4 Application to Linear Systems 44
XVI Contents
References 325
Software 331
Index 333
1
Linear Systems
Ax = b,
as a sum of all permutations (J E Sn of the set {I, ... , n}, the cost of
computing det A amounts to n . n! arithmetic operations. Even with the
recursive scheme involving an expansion in sub determinants according to
Laplace's rule
n
detA = L(-l)i+lalidetAli
i=l
there are2n necessary arithmetic operations, where Ali E Matn-l (R) is the
matrix obtained from A by crossing out the first row and the ith column. As
we will see, all methods to be described in the sequel are more efficient than
Cramer's rule for n ~ 3. Speed is therefore certainly the second important
property of a "good" algorithm.
In the search for an efficient solution method for arbitrary linear equation
systems we will begin with a study of simple special cases. The simplest one
is certainly the case of a diagonal matrix A where the system degenerates
to n independent scalar linear equations. The idea to transform a general
system into a diagonal one underlies the Gauss-Jordan decomposition. This
method, however, is less efficient than the one to be described in Section
1.2 and is therefore omitted here. In terms of complexity, next is the case
of a triangular system, which is the topic of the following section.
rnnxn Zn,
Here the notation "~,, stands for "equal up to lower-order terms;" i.e., we
consider only the term containing the highest power of n, which dominates
the cost for large values of n.
In total analogy a triangular system of the form
Lx = Z, (1.2)
with a lower triangular matrix L can be solved starting from the first row
and working through to the last one.
This way of solving triangular systems is called backward substitution
in case of (1.1) and forward substitution in case of (1.2). The name sub-
stitution is used because each component of the right hand-side vector can
be successively substituted (replaced) by the solution, as indicated in the
following storage scheme for backward substitution:
Having achieved this we can apply the same procedure to the last n -1 rows
in order to obtain recursively a triangular system. Therefore it is sufficient
to examine the first elimination step from (1.3) to (1.4). We assume that
all =1= O. In order to eliminate the term ailXI in row i (i = 2, ... , n), we
subtract from row i a multiple of row 1 (unaltered), i.e.,
or explicitly
(1)
a ln
(2)
a 2n
(k) (k)
a kk a kn
(k) (k)
a nk ann
1
-lk+l,k 1
-In,k 1
is called a Frobenius matrix. It has the nice property that its inverse L-;;1
is obtained from Lk by changing the signs of the lik'S. Furthermore the
product of the L-;;I's satisfies
L ·- 1 1
. - L-
1 .. ·L-
n-l --
In,n-l 1
Summarizing, we have in this way reduced the system Ax b to the
equivalent triangular system Rx = z with
R = L -1 A and z = L -1 b .
A lower (respectively, upper) triangular matrix, whose main diagonal el-
ements are all equal to one, is called a unit lower (respectively, upper)
triangular matrix. The above product representation A = LR of the matrix
A with a unit lower triangular matrix L and an upper triangular matrix R
is called the Gaussian triangular jactorization, or briefly LR-jactorization
of A. If such a factorization exists, then Land R are uniquely determined
(cf. Exercise l.2). (In most of the English literature the matrix R is de-
noted by U -for Upper triangular-and accordingly Gaussian triangular
1.3. Pivoting Strategies and Iterative Refinement 7
Therefore the main cost comes from the LR-factorization. However, if dif-
ferent right-hand sides bl , ... ,bj are considered, then this factorization has
to be carried out only once.
A- (10)
a
= 1 = I = LR with L = R = I .
Xl = 1.000 X2 = 0.9999,
and with three correct figures
Xl = 1.00 X2 = 1.00.
Let us now carry out the Gaussian elimination on our computer, i.e., III
then [21 = 1.00 . 10- 4 , which yields the upper triangular system
1.00 Xl + 1.00 X2 2.00
1.00 X2 1.00
X2 = 1.00 Xl = 1.00.
By interchanging the rows in the above example we obtain
Thus, the new pivot 0. 11 is the largest element, in absolute value, of the
first column.
We can deduce the partial pivoting or column pivoting strategy from the
above considerations. This strategy is to choose at each Gaussian elimina-
tion step as pivot row the one having the largest element in absolute value
within the pivot column. More precisely, we can formulate the following
algorithm:
1.3. Pivoting Strategies and Iterative Refinement 9
if i =p
if i = k
otherwise
Now we have
Remark 1.7 Instead of column pivoting with row interchange one can
also perform row pivoting with column interchange. Both strategies require
at most O(n 2 ) additional operations. If we combine both methods and
look at each step for the largest element in absolute value of the entire
remaining matrix, then we need O(n 3 ) additional operations. This total
pivoting strategy is therefore almost never employed.
In the following formal description of the triangular factorization with
partial pivoting we use permutation matrices P E Matn(R). For each
permutation 7f E Sn we define the corresponding matrix
P 7r = [e (l)·· .e (n)],
7r 7r
is different from zero and is also the largest element in absolute value in
the first column, i.e.,
*
o
o
where all elements of L1 are less than or equal to one in absolute value, i.e.,
IL11 :::; 1, and det L1 = 1. The remaining matrix B(2) is again invertible
since lag) I -=I- 0 and
matrix
1
1
-lk+1,k 1
-In,k 1
satisfies
1
1
(1.7)
-l7r(k+1),k 1
-l7r(n),k 1
Therefore we can separate Frobenius matrices Lk and permutations PTk by
inserting in (1.6) the identities PT~l P Tk , i.e.,
where 7rn -1 := id and 7rk = Tn-1 ... Tk+1 for k = 0, ... , n - 2. Since the
permutation 7rk interchanges in fact only numbers 2:: k + 1, the matrices
Lk are of the form (1.7). Consequently
P 7rO A = LR
with L := L-)"1 ... L;;'~l or explicitly
1
l7rl(2),1 1
L = l7r,(3),1 l7r2(3),2 1
l7rl(n),l
Remark 1.9 Let us also note that the determinant of A can be easily
computed by using the P A = LR-factorization of Proposition l.8 via the
formula
det A = det(P) . det(LR) = sgn (7f0) . Tn ... Tnn .
A ----+ A := DrADc ,
1.3. Pivoting Strategies and Iterative Refinement 13
where
Dr = diag(O"l, ... , O"n) and Dc = diag (Tl"'" Tn).
At first glance the following three strategies seem to be reasonable:
(a) Row equilibration of A with respect to a vector norm 11·11. Let Ai be
the ith row of A and assume that there are no zero rows. By setting
Ds := I and
O"i := IIAill~l for i = 1, ... , n,
we make all rows of A have norm one.
(b) Column equilibration. Suppose that there are no columns Aj of A
equal to zero. By setting Dz := I and
Tj := IIAjll~l for j = 1, ... , n,
(i) A is invertible.
(ii) aii > 0 for i = 1, ... ,n.
(iii) . max laij I = . max aii·
1,,)=1, ... ,n 1,=l, ... ,n
Obviously (iii) and (iv) say that row or column pivoting is not necessary
for LR-factorization; in fact it is even absurd because it might destroy the
structure of A. In particular (iii) means that total pivoting can be reduced
to diagonal pivoting.
write A = A (1) as
all
A(l) = rz
where z = (a12,'" ,a1n)T and after one elimination step we obtain
1
1 with L, ~ -b 1
Now if we premultiply A (2) with Lf, then zT in the first row is also
eliminated and the remainder matrix B(2) remains unchanged, i.e.,
o o
Proof. We continue the construction from the proof of Theorem 1.10 for
k = 2, ... , n-l and obtain immediately L as the product of r;\ ... , L;;~l
and D as the diagonal matrix of the pivots. D
Exercises
Exercise 1.1 Give an example of a full nonsingular (3,3)-matrix for which
Gaussian elimination without pivoting fails.
Exercise 1.2
(a) Show that the unit (nonsingular) lower (upper) triangular matrices
form a subgroup of GL(n).
Exercises 17
Show that Gaussian triangular factorization can be performed for any ma-
trix A E Matn(R) with a strictly diagonally dominant transpose AT. In
particular any such A is invertible.
Hint: Use induction.
Exercise 1.4 The numerical range W(A) of a matrix A E Matn(R) is
defined as the set
W(A):= {(Ax,x) J (x,x) = 1, x ERn} .
Here (-,.) is the Euclidean scalar product on Rn.
(a) Show that the matrix A E Matn(R) has an LR-factorization (L unit
lower triangular, R upper triangular) if and only if the origin is not
contained in the numerical range of A, i.e.,
o (j. W(A) .
Hint: Use induction.
:n
(b) Use (a) to show that the matrix
[~
has no LR-factorization.
Exercise 1.5 Program the Gaussian triangular factorization. The pro-
gram should read data A and b from a data file and should be tested
on the following examples:
(a) with the matrix from Example 1.1,
(b) with n = 1, A = 25 and b = 4,
18 1. Linear Systems
(d) (2)
aij
>
_
(1)
aij
>
_
0 £or z,..J = 2 , ... , n and J. -r
--I- .
Z ;
where A is the matrix from Exercise 1.10. In order to solve this system
we apply Gaussian elimination on the matrix A with the following two
additional rules, where the matrices produced during elimination are de-
noted again by A = A (1), ... , A (n-1) and the relative machine precision is
denoted by eps.
(a) If during the algorithm la~~1 ~ lakkleps for some k < n, then shift
simultaneously column k and row k to the end and the other columns
and rows toward the front (rotation of rows and columns).
(b) If la~~1 ~ lakkleps for all remaining k < n -1, then terminate the
algorithm.
20 1. Linear Systems
Show that:
(i) If the algorithm does not terminate in (b), then after n-1 elimination
steps it delivers a factorization of A as PAP = LR, where P is a
permutation and R = A(n-l) is an upper triangular matrix with
rnn = 0, rii < 0 for i = 1, ... ,n - 1 and rij 2: 0 for j > i.
(ii) The system has in this case a unique solution x, and all components
of x are nonnegative (interpretation: probabilities).
Give a simple scheme for computing x.
Exercise 1.12 Program the algorithm developed in Exercise 1.11 for solv-
ing the special system of equations and test the program on two examples
:)
of your choice of dimensions n = 5 and n = 7, as well as on the matrix
2 0
-4 1
1 -2 .
1 0 -2
Exercise 1.13 Let a linear system Cx = b be given, where C is an
invertible (2n, 2n)-matrix of the following special form:
C = [~ ~], A, B invertible.
C- l = [EG HF] .
In the previous chapter, we got to know a class of methods for the numerical
solution of linear systems. Formally speaking, we there computed, from a
given input data (A, b), the solution f(A, b) = A -lb. With this example in
mind, we want to analyze algorithms from a more abstract point of view
in the present section.
Let a problem be abstractly characterized by (j, x) for given mapping
f and given input data x. To solve the problem then means to compute
the result f(x) by means of an algorithm that may produce intermediate
results as well. The situation is described by the scheme
output
input data --+ algorithm --+
data
In this chapter we want to see how errors come up and influence this pro-
cess and, in particular, whether Gaussian elimination is indeed a reliable
algorithm. Errors in the numerical result arise from errors in the data or
input errors as well as from errors from the algorithm.
In principle, we are powerless against input errors, since they belong to the
given problem and can only be avoided by changing the problem setting.
The situation is clearly different with errors caused by the algorithm. Here
we have the chance to avoid or, at least, to diminish errors by changing the
P. Deuflhard et al., Numerical Analysis in Modern Scientific Computing
© Springer-Verlag New York, Inc. 2003
22 2. Error Analysis
method. In what follows the distinction between the two kinds of errors
will lead us to the notions of the condition of a problem as opposed to the
stability of an algorithm. First we want to discuss the possible sources of
errors.
a = v Laid-i,
i=1
where v E {±1} is the sign, ai E {O, ... ,d -I} are the digits (it is assumed
that a = 0 or a1 =f. 0), and l is the length of the mantissa. The numbers
that are representable in this way form a subset
N := {x E R I there is a, e as above, so that x = ad e }
of real numbers. The range of the exponent e defines the largest and small-
est number that can be represented on the machine (by which we mean
the processor together with the compiler). The length of the mantissa is
responsible for the relative precision of the representation of real numbers
on the given machine. Every number x =f. 0 with
demin-1 ~ Ixl ~ d emax (l _ d- 1)
is represented as a floating point number by rounding to the closest machine
number whose relative error is estimated by
Ix - xl ~ Ox.
In many important practical situations the relative precision lox/xl lies in
between 10- 2 and 1O- 3 -a quantity that in general outweighs by far the
rounding of the input data. In this context the term technical precision is
often used.
Let us now go to the second group of error sources, the errors in the
algorithm. The realization of an elementary operation
o E {+, -, ., /}
by the corresponding floating point operation 0 E {-t-,.':..,~, /} does not
avoid the rounding errors. The relative error here is less than or equal to
the machine precision; i.e., for x, YEN, we have
fE:;\ .................
f
·.· ..... ·· ....cdf (x)
~ R
h
h
9
9 r
h
ratio between the input and the output sets depends strongly on the inter-
section angle <r.(g, h) between 9 and h. If 9 and h are nearly perpendicular,
the variation of the intersection point r is about the same as the varia-
tion of the lines 9 and h. If, however, the angle <r.(g, h) is small, i.e., the
lines 9 and h are nearly parallel, then one has real difficulties locating the
intersection point even with the naked eye (see Figure 2.3). Actually, the in-
tersection point r moves several times more compared to any perturbation
of the lines. We can therefore call the determination of the intersection
point well-conditioned in the first case, but ill-conditioned in the second
case.
or componentwise
if
g(x) = h(x) + o(llh(x)ll) for x ---7 xo,
where the Landau symbol "o(llh(x)ll) for x ---7 xo" denotes a generic
function cp(x) having the property
lim Ilcp(x)11 = O.
X->Xo Ilh(x)11
Thus for a differentiable function f we have
f (x) - f (x) ~ l' (x ) (x - x) for x ---7 X •
IIAxl1
IIAII := sup -II-II = sup IIAxl1 for A E Matm,n(R).
x¥O x Ilxll=l
For illustration let us compute the condition numbers for some simple
problems.
Example 2.3 Condition of addition (respectively, subtraction). Addition
is a linear mapping
f:R2-->R, G) f-+f(a,b):=a+b
An error in the seventh significant decimal digit of the input data a, b leads
to an error in the third significant decimal digit of the result a - b, i.e.,
K:rel ~ 104 .
Be aware of the fact that the cancellation of a digit of the result given
by a computer cannot be noticed afterward. The appended zeros are zeros
in the binary representation, which are lost via the transformation to the
decimal system. Therefore we arrive at the following rule:
x2 - 2px + q = 0,
whose solution is usually given by
Xl,2 = p ± Vp2 - q.
In this form the cancellation phenomenon occurs when one of the solutions
is close to zero. However, this cancellation of significant digits is avoidable
because, by Vieta's theorem, q is the product of the roots, which can be
exploited according to
1 - cos (x) 2 x2
X =;:1 (
1 - [1 - 2x + x4 ) X (
24 ± ... J ="2 1 - 12
)
± ... .
For x = 10- 4 we have x 2 /12 < 10- 9 and therefore, according to Leib-
niz's theorem on alternating power series, x/2 is an approximation of
(1 - cosx)/x correct up to eight decimal digits.
which is linear in b. Its derivative is f' (b) = A-I so that the condition
numbers of the problem are given by
-1 Ilbll -1 IIAxl1 -1
Kabs = IIA I and Krel = IIA-lbIIIIA I = WIIA II·
Next we take perturbations in A into account, too. For that purpose we
consider the matrix A as input quantity
Remark 2.9 The differentiability of the inverse follows easily also from
the Neumann series. If C E Matn(R) is a matrix with 11011 < 1, then 1 - C
is invertible and
C) -1 = L
00
(I - C k = 1 + C + C2 + ...
k=O
Lemma 2.8 implies that the derivative with respect to A of the solution
f(A) = A -lb of the linear system satisfies
f'(A) C = -A-1CA-lb = -A-1Cx for C E Matn(R).
In this way we arrive at the condition numbers
~llf'(A)11
Ilxll < IIAIIIIA-lil
- .
The earlier calculated relative condition number with respect to the input
b can be estimated by
IIAII I A-111
I\:[el ::;
because of the submultiplicativity IIAxl1 ::; IIAII Ilxll of the matrix norm.
Therefore henceforth the quantity
y = 0 f--------::::===--=------r--
Xo x
0)
E'
A-I = ( 1
0 ),
is obviously a well-conditioned problem, because the equations are com-
pletely independent of each other (also called decoupled). Here we implicitly
assume that the admissible perturbations preserve the diagonal form. The
32 2. Error Analysis
i.e., arbitrarily large for small E :S 1. It describes the condition for all kinds
of possible perturbations of the matrix.
The example suggests that the notion of condition defined in Section 2.2.1
turns out to be deficient in some situations. Intuitively, we expect that the
condition number of a diagonal matrix, i.e., of a completely decoupled linear
system, is equal to one, as in the case of a scalar linear equation. The fol-
lowing componentwise analysis will lead us to such a condition number. In
order to transfer the concept of Section 2.2.1 to the componentwise setting,
we merely have to replace norms with absolute values of the components.
In the following we will work out details for the relative error concept only.
for i: ---> x.
gives componentwise
K:rel =
I 1f'(x)1 Ixl 1100
Ilf(x)lloo
As earlier with the normwise concept, we also want to calculate the
componentwise condition number for a sequence of illustrative problems.
2.2. Condition of Problems 33
\\\A-I\\b\\\oo \\\A-I\\b\\\oo
fi:rel = \\A-Ib\\oo = \\x\\oo .
This number was introduced by R. D. Skeel [76]. With it the error i: - x,
i: = A-It; can be estimated by
IIi: - x\\oo -
\\x\\oo ~ fi:relE for \b - b\ ~ E\bI.
The ideas of Example 2.7 can be transferred for perturbations in A. We
already know that the mapping f : GL(n) --> Rn, A f-> f(A) = A-Ib, is
differentiable with
\\\f'(A)\\A\\\oo \\\A-I\\A\\x\\\oo
fi:rel = \\f(A) \\00 = \\x\\oo .
If we collect the results for perturbations in both A and b, then the relative
condition numbers add up and we obtain as a condition number for the
34 2. Error Analysis
combined problem
II lA-II IAI Ixl + IA-Illbili oo I lA-II IAI Ixl 1100
I'Crel = Ilxll oo :::; 2 Ilxll oo .
Taking for x the vector e = (1, ... ,1), yields the following characterization
of the componentwise condition of Ax = b for arbitrary right-hand sides b
To answer this quest~on we must first think about how to characterize the
perturbed mapping f. We have seen above that the errors in performing a
floating point operation 0 E {+, -, ., /} can be estimated by
aob=(aob)(1+c), c=c(a,b) with !c!::;eps. (2.4)
Here it does not make too much sense (even if it may be possible in prin-
ciple) to determine c = c(a, b) for all values a and b on a given computer.
In this respect our algorithm has to deal not with a single mapping j, but
with a whole class {j}, containing all mappings characterized by estimates
of the form (2.4). This class also contains the given problem f E {j}.
The estimate of the error of a floating point operation (2.4) was derived
in Section 2.1 only for machine numbers. Because we want to study the
relation of the whole class of mappings we can allow arbitrary real numbers
as arguments. In this way we put the mathematical tools of calculus at
our disposal. Our model of the algorithm consists therefore of mappings
operating on real numbers and satisfying estimates of the form (2.4).
In order t~ avoid unwieldy notation, let us denote the family {j} by j
as well; i.e., f stands for the whole family or for a representative according
to the context. Statements on such an algorithm j (for example, error
estimates) are always appropriately interpreted for all the mappings in the
family. In particular, we define the image j (E) of a set E as the union
j(E) := U¢(E) .
<PEj
We are left with the question of how to assess the error ](x) - f(x). In
our condition analysis we have seen that input data are always (at least
for floating point numbers) affected by input errors, which-through the
condition of the problem-lead to unavoidable errors in the output. From
our algorithm we cannot expect to accomplish more than from the problem
itself. Therefore we are happy when its error ](x) - f(x) lies within reason-
able bounds of the error f(5;) - f(x) caused by the input error. Along this
line of thought, there are essentially two approaches: the forward analysis
and the backward analysis. They will be treated in what follows.
-;. -- ;
\ /
.... R
1(~)./··/
f
f
.
;
f
;
;
;
;
b)(l + 10) for some 10 with 1101 ::::; eps and hence
laob-aobl l(aob)(l+s)-aobl_ll<
la 0 bl la 0 bl - 10 - eps.
o
Example 2.20 Subtraction. We see in particular that in the case of can-
cellation we have", » 1, which implies for the stability indicator that
u « 1. Thus subtraction is outstandingly stable and in the case of total
cancellation is indeed error free a-=-b = a-b.
38 2. Error Analysis
and we assume that the stability indicators 0" g, O"h for the partial algorithms
g and h that implement 9 and h are known. How can we assess from here
the stability of the composed algorithm
Proof. We work out the proof for the normwise approach. Let g and h be
arbitrary representatives of the algorithms for 9 and h as well as J = hog
for f = hog. Then
denotes the addition of the first two components. We want to examine this
"algorithm" componentwise. The condition number and stability indicator
for an coincide with those for the addition of two numbers, i.e., /\'cx n = /\,+
and O'CX n = 0'+. With the notation /\,j := /\'Sj and O'j := O'Sj we have by
virtue of Lemma 2.21 that
O'n /\'n ::; (O'n-l + 0'+ /\'+)/\'n-l ::; (1 + O'n-d/\'n-l .
According to Example 2.3, the condition number /\'n satisfies
In the following we will see how to carry out the forward analysis for
scalar functions. In this special case we have a simplified version of Lemma
2.21.
Lemma 2.25 If the functions 9 and h of Lemma 2.21 are scalar an_d differ-
entiable then the stability indicator a f of the combined algorithm f = hog
satisfies
Proof. In this special case the condition number of the combined problem
is the product of the condition numbers of the parts
Ixllf'(x)1 Ig(x)llh'(g(x))IIg'(x)llxl
""f = If(x)1 = Ih(g(x))llg(x)1 = ""h ""9'
Hence from Lemma 2.21 it follows that
o
If the condition number ""9 of the first partial problem is very small, ""9 « 1,
then the algorithm becomes unstable. A small condition number can also
be interpreted as loss of information: A change in the input has almost
no influence on the output. Such a loss of information at the beginning
of the algorithm leads therefore to instability. Moreover, we see that an
instability in the beginning of the algorithm (large a 9) fully affects the
composed algorithm. For example, let us analyze the recursive method for
computing cos mx and an intermediary result.
Example 2.27 Now we can analyze the recursive computation of cos mx.
It is important for example in the Fourier-synthesis, i.e., in the evaluation
2.3. Stability of Algorithms 41
f(x) = Lakcoskx+bksinkx.
k=l
For the above numerical example this recurrence yields an essentially better
solution with a relative error of 1.5 . 10- 11 . The recurrence for x -+ 7r can
be stabilized in a similar way. (It turns out that these stabilizations lead
ultimately to usable results only because the three-term recurrence relation
(2.7) is well-conditioned-see Section 6.2.1.)
The algorithm is called stable with respect to the relative input error J, if
T} < J.
For the input error J = eps caused by roundoff we define the stability
indicator of the backward analysis as the quotient
O"R := T}/eps.
As we see, the condition of the problem does not appear in the defini-
tion. Also, in contrast with forward analysis the backward analysis does
not require a beforehand condition analysis of the problem. Furthermore,
the results are easily interpreted by comparing the input error and the
backward error. Due to these properties the backward analysis is prefer-
able, especially in case of complex algorithms. All the stability results for
Gaussian elimination collected in the next section are related to backward
analysis.
The two stability indicators 0" and O"R are not identical in general. The
concept of backward analysis is rather stronger, as shown by the following
lemma.
Lemma 2.29 The stability indicators 0" and O"R of the forward and
backward analysis satisfy
2.3. Stability of Algorithms 43
Proof. From the definition of the backward error it follows that for any
x E E there is a x such that f(x) = /(x) and
Ilx-xll .
Ilxll S TJ = aR eps for eps -+ O.
for eps -+ O. o
As an example of the backward analysis let us look again to the scalar
product (x, y) for x, y E Rn in its floating point implementation
(x,y:=xnyn
) + (x n-l ,yn-l) , (2.8)
where
TJ S neps
and the scalar product is (with 2n - 1 elementary operations) stable in the
sense of backward analysis.
hypothesis we have
For the rounding error caused by putting the matrix A into a computer
we assume, for example that J = eps. With experimental data J is taken as
the largest tolerance. However, we would like to stress the fact that linear
systems with almost singular matrices may nevertheless be numerically
well-behaved~a fact that can be interpreted through the x-dependency of
the relative condition number K:rel.
The matrix A and the right-hand side b1 contain the common input variable
E « 1, i.e., they are connected with each other. The condition number of
the matrix
K:(A) = IIA-11IooIIAlioo = ~E » 1
Second, we examine the same problem, but now for a different right-hand
side br := (0,1) independent of E:. Here we obtain the solution
( -liE:)
liE:
and the componentwise condition numbers
Proof. Similar to the scalar product (see Lemma 2.30), the forward
substitution algorithm may also be recursively formulated as
l kkXk = bk - (l k-l , Xk-l) (2.11)
for k = 1, ... , n, where
Xk - 1 = (Xl, ... ,Xk-lf and lk-l = (lkl, ... ,lk,k_d T .
Floating point implementation turns (2.11) into the recurrence relation
lkk(l + 5k )(1 + E:k)Xk = bk - (lk-\ Xk-1)fl,
2.4. Application to Linear Systems 47
where 6k and Ek with 16kl,IEkl ::; eps describe the relative errors of the
multiplication and the addition, respectively. For the floating point im-
plementation of the scalar product we know already from Lemma 2.30
that
(lk-l, xk-1)fl = (Zk-l, x k- 1)
'k l ' , T
for some vector l - = (tkl,' .. , lk,k-d with
Ilk-I - Zk-II ::; (k -l)eps Ilk-II.
Proof. A simple inductive proof for the weaker statement with 4n instead
of n is found in the book [41] of G. H. Golub and C. van Loan, Theorem
3.3.1. D
where
Pn(A) := Qmax
maXi,j laij I
and Qmax is the largest absolute value of an element of the remainder
matrices A (1) = A through A (n) = R appearing during elimination.
IILlloo :s:; n.
The norm of R can be estimated by
IIRlloo :s:; n max Irij I :s:; n Qmax·
2,)
The statement follows therefore from (2.12), because maXi,j laijl :s:; IIAlloo.
o
So what is the stability of Gaussian elimination? This question is not
clearly answered by Theorem 2.37. Whether the matrix is suitable for Gaus-
sian elimination depends obviously on the number Pn(A). In general this
quantity can be estimated by
-1
Aw=
-1 -1 1
2.4. Application to Linear Systems 49
Therefore Gaussian elimination with column pivoting is not stable for the
whole class of invertible matrices. However, for special classes of matrices
the situation looks considerably better. In Table 2.1 we have listed some
classes of matrices and the corresponding estimates for Pn.
invertible yes 2n - 1
upper Hessenberg yes n
A or AT strictly diagonally dominant superfluous 2
tridiagonal yes 2
symmetric positive definite no 1
random yes n 2 / 3 (average)
Furthermore, one could state with a clear conscience that Gaussian elimi-
nation is stable "as a rule," i.e., is stable for matrices usually encountered
in practice. This statement is also supported by the probabilistic consider-
ations carried out by L. N. Trefethen and R. S. Schreiber [84] (see the last
row of Table 2.1).
(2.13)
Proof. Let
TJc = min{w I there is A, b with Ax = b, IA - AI:::; wlAI, Ib - bl :::; wlbl}·
We set
Ir(x) Ii
e := mfx ( IAllxl + Ibl)i
and we have to prove that TJc(x) = e. Half of the statement, TJc(x) 2 e,
follows from the fact that for any solution A, b, w we have
lAx - bl = I(A - A)x + (b - b)1 :::; IA - Allxl + Ib - bl :::; w(IAllxl + Ibl),
and therefore w 2 e. For the other half of the statement TJc (x) :::; e, we
write the residual f = b - Ax as
f = D(IAllxl + Ibl) with D = diag(d 1 , ... , dn ),
where
Now by setting
A:= A + DIAl diag(sgn(x)) and b:= b - Dlbl,
we get IA - AI :::; elAI, Ib - bl :::; elbl and
Ax - b = Ax + DIAl Ixl- b + Dlbl = -f + D(IAllxl + Ibl) = o.
o
2.4. Application to Linear Systems 51
Remark 2.40 Naturally, formula (2.13) for the backward error is usable
in practice only when the error occurring in the evaluation does not alter
the meaningfulness of the result. R. D. Skeel has shown that the value ii
computed in floating point arithmetic differs from the actual value 'r/ by at
most n· eps. Thus formula (2.13) can really be invoked to assess the quality
of an approximate solution x.
In Chapter 1, the iterative refinement was introduced as a possibility
of "improving" approximate solution. The following result proved by R.
D. Skeel [77] shows the effectiveness of the iterative refinement: For Gaus-
sian elimination a single refinement step already implies componentwise
stability.
Theorem 2.41 Let
Knli:c(A) o-(A, x)eps < 1,
with Skeel's condition number li:c(A) = " lAllA-II 1100' the definition
(A ).= maxi(IAllxl)i
0" ,x. mini (IAllxl)i '
and Kn a constant close to n. Then Gaussian elimination with column
pivoting followed by one refinement step has a componentwise stability
indicator
O":Sn+l,
i. e, this form of Gaussian elimination is stable.
The quantity O"(A, x) is a measure of the quality of the scaling (see [77]).
Example 2.42 Hamming's example.
A= , b=
For c we take the relative machine precision eps = 3 . 10- 10 . For Skeel's
condition number we then obtain
_ IIIA-lIIAllxl + lA-II Ibl 1100 -'- 6
li:rel - Ilxll oo - ,
Exercises
Exercise 2.1 Show that the elementary operation +is not associative in
floating point arithmetic.
Exercise 2.2 Determine through a short program the relative machine
precision of the computer you use.
Exercise 2.3 The zeros of the cubic polynomial
z3 + 3qz - 2r = 0, r, q >0
are to be found. According to Cardano-Tartaglia a real root is given by
Z= (r + J q3 + r2) + (r _ J q3 + r2) 3
1 1
3 .
0: = (1 + 2~ ) n _ e (1 + _12 ) > 0
1- l2n 12n
for n = 106 up to three exact digits.
Exercise 2.6 Compute the condition number of the evaluation of a
polynomial that is given by its coefficients ao, ... ,an
p(x) = anx n + an_lX n - 1 + ... + alX + ao
at the point X, first with respect to perturbation of the coefficients ai ........,
iii = ai(l + ci), and then with respect to perturbations X ........, x = x(l + c)
of x. Consider in particular the polynomial
p(x) = 8118x 4 - 11482x 3 + x 2 + 5741x - 2030
at the point x = 0.707107. The exact result is
p(x) = -1.9152732527082· 1O- 11 .
Exercises 53
and for p = 00 by
Ilxll oo := . max
l.=l, ... ,n
IXi I·
Show that the corresponding matrix norms
IIAxll p
IIAllp := sup -11-11-
x¥O x p
satisfy
(a) IIAl11 = maXj=l, ... ,n 2.::1Iaijl·
(b) IIAlloo = maXi=l, ... ,m 2.:7=1 laijl·
(c) IIAII2:S JIIAI111IAlloo.
(d) IIABllp:S IIAllpllBllp for 1 :S p:S 00.
Show that
(a) II· 110 defines indeed a norm on Matm,n(R).
(b) IIAxII2::; IIAlloIlxlll and IIATxll oo ::; IIAllollxll2.
(c) IIAIIF::; JnIlAIIo and IIABIIo::; IIAIIFIlBIIo-
Exercise 2.11 Let A be a nonsingular matrix that is transformed into
A := A+ oA by a small perturbation oA, and let 11·11 be a submultiplicative
norm with 11111 = 1. Show that
(a) If IIBII < 1, then there exists (I - B)-l = 2:;:'=0 Bk, and the following
inequality holds
-1 1
I (I - B) II::; 1- IIBII
(b) If IIA-loAIl ::; € < 1, then
- 1 +€
~(A) ::; 1 _ € ~(A) .
Exercise 2.12 Show that the following property holds for any invertible
matrix A E GL(n):
(a) S 2 = -1-
n-1
L:
n
(Xi -x) 2
i=l
(b))
Exercises 55
where x = ~ I:~=l Xi is the mean value of Xi. Which of the two formulas for
computing S2 is numerically more stable, and therefore preferable? Support
your assertion on the stability indicator, and illustrate your choice with a
numerical example.
Exercise 2.17 We consider the approximation of the exponential exp(x)
by the truncated Taylor series
N k
exp(x) ~ L ~! . (2.14)
k=O
Compute approximate values for exp(x) for X = -5.5 of the following three
types with N = 3,6,9, ... ,30:
with relative precision eps = 10- 4 . Assess the precision of the two
approximate solutions
__ ( 0.999 ) - (0.341)
Xl - -1.001 an d X2 = -0.087
-1
A=
-1 -1 1
the maximal absolute value of a pivot in column pivoting is
lamaxl = 2n - 1 .
Exercise 2.21 According to R. D. Skeel [76] the componentwise backward
error 'f] in Gaussian elimination with column pivoting for solving Ax = b,
satisfies
'f] :::; Xn O"(A, x) eps,
where Xn is a constant depending only on n (as a rule Xn ~ n ), and
(A ) = maxi(IAllxl)i
0" ,x mini(IAllxl)i .
Specify a row scaling A -+ DA of the matrix A with a diagonal matrix
D = diag(d 1, ... , d n ), d i > 0 such that
O"(DA, x) = l.
Why is this an impractical stabilization method?
Exercise 2.22 Let
~
1 -1
c 0
A=
r
0 c
0 0
The solution of the linear system Ax = b is x = [1, c-1, c-1, l]T.
(a) Show that this system is well-conditioned but badly scaled, by com-
puting the condition number h:c(A) = II IAI-1IAI 1100 and the scaling
quantity O"(A, x) (see Exercise 2.21). What do you expect from
Gaussian elimination when c is substituted by the relative machine
precision eps?
(b) Solve the system by a Gaussian elimination program with column
pivoting for c = eps. How big is the computed backward error f]?
(c) Check yourself that one single refinement step delivers a stable result.
3
Linear Least-Squares Problems
Remark 3.2 The relation between the linear least-squares problems and
probability theory is reflected in the equivalence of the minimization
problem (3.1) with the maximization problem
exp( _,6,2) = max.
The exponential term characterizes here a probabilistic distribution, the
Gaussian normal distribution. The complete method is called maximum
likelihood method.
In (3.1) the errors of individual measurements are equally weighted. How-
ever, the measurements (ti' bi ) are different just because the measuring
apparatus works differently over different ranges while the measurements
are taken sometimes with more and sometimes with less care. To any in-
dividual measurement bi pertains therefore in a natural wayan absolute
measuring precision, or tolerance, 8b i . These tolerances 8b i can be included
in the problem formulation (3.1) by weighting different errors with different
tolerances, i.e.,
t;
m (,6,i)2
8b i = mm.
This form of minimization has also a reasonable statistical interpreta-
tion (somewhat similar to standard deviation). In some cases the linear
least-squares problem is only uniquely solvable if the problem-specific
measurement tolerances are explicitly included!
Within this chapter we consider only the special case when the model
function if' is linear in x, i.e.,
the framework of the linear least-squares problem: For given bERm and
A E Matm,n(R) with m:::: n find an x E Rn such that
R(A)
It is graphically clear that the distance lib - Axil is minimal exactly when
the difference b - Ax is perpendicular to the subspace R( A). In other words:
Ax is the orthogonal projection of b onto the subspace R(A). As we want
to come back to this result later, we will formulate it in a somewhat more
abstract form.
Remark 3.5 With this the solution u E U of Ilv - ull = min is uniquely
determined and is called the orthogonal projection of v onto U. The mapping
P: V --> U, V f----+ Pv with Ilv - Pvll = min
uEU
Ilv - ull
is linear and is called the orthogonal projection from V onto U.
Remark 3.6 The theorem generally holds also when U is replaced by an
affine subspace W = Wo + U c V, where Wo E V and U is a subspace of V
parallel to W. Then for all v E V and W E W it follows
Ilv - wll = min
w'EW
Ilv - w'll <;:::=;> v- wE UJ. .
and therefore the first statement. The second part follows from the fact
that AT A is invertible if and only if rank(A) = n. 0
Remark 3.8 Geometrically, the normal equations mean precisely that the
residual vector b - Ax is normal to R(A) c Rm; hence the name.
3.1.3 Condition
We begin our condition analysis with the orthogonal projection P : R m ---+
V, b f--7 Pb, onto a subspace V of Rm (see Figure 3.3). Clearly the relative
lIb1f2
IIPbl12
=
. 2
1 - sm {) = cos {).
2
3.1. Least-Squares Method of Gauss 63
Proof. (a) The solution x is given through the normal equations by the
linear mapping
so that
IIAI1211(AT A)-l ATI1211bl1 2
IIAI1211xl12
It is easily seen that for a full-column rank matrix A the condition number
i"i:2(A) is precisely
so that
II¢'(A) 112 IIAI12
IIxl12
The numerical treatment described above requires in the first step the
computation of AT A, i.e., of numerous scalar products of rows of A. The
numerical intuition developed in the second chapter (see Example 2.33)
makes this appear dubious. In each additional step, additional errors may
arise that will propagate further to the final solution. Therefore, in most
cases it is better to look for an efficient "direct" method operating only
on A itself. A further point of criticism is the fact that the errors in AT b
are amplified in the solution of the linear system AT Ax = ATb by a factor
close to the condition (see Lemma 3.10)
K2(A T A) = f£2(A)2 .
For large residuals this agrees with the condition number of the linear least-
squares problem (3.2). However, for small residuals the condition number
of the latter problem is described instead by f£(A), so that passing to the
normal equations means a considerable worsening of the condition number.
In addition the matrices usually arising in linear least-squares problems
are already badly conditioned so that they require extreme attention and
further worsening of the condition by passing to AT A is not manageable.
Hence the solution of linear least-squares problems via normal equations
with Cholesky factorization can be recommended only for problems with
large residuals.
66 3. Linear Least-Squares Problems
ai":"'\,,, ""\/2(~'~)
,,' .. ,",,. - (v,v) V
~ ael
cose Sine)
a f----+ ael = Qa with Q:= ( . e e'
- sm cos
68 3. Linear Least-Squares Problems
eXk + 8Xl if i = k
-SXk + eXI if i = l . (3.3)
if i =I- k, l
If we premultiply a matrix
According to (3.3) only rows k and l of the matrix A are changed. This
is especially important when the sparsity structure is to be preserved as
much as possible by the transformation.
Now how can we determine the coefficients e and 8 in order to eliminate
a component Xl of the vector x? As Okl operates only on the (k, l)-plane
it is sufficient to clarify the principle in case m = 2. From x~ + xf =I- 0 and
8 2 + c2 = 1 it follows that
Actually c and S are more conveniently computed via the formula (where
T stands for tan e
and cot respectively) e,
T:= Xk/Xl, s:= 1/~, c:= ST
if IXll > IXkl and
* * * * * * * * * * * *
0
* * * * (5,4) * * * * (4,3) (2,1) * * *
A= * * * * -----+
* * * * ----7 . • • -----+ 0
* * *
0
* * * * *
0
* * * * * *
* * 0
* * * * * * * *
* *
0
* * * * * *
0
(5,4)
0
* * * (4,3) (5,4) * * *
----+ ... - ? 0 0
-----+
0
* * * * *
0 0 0
0
*
0
* * 0
*
0 0
* 0
*
After carefully counting the operations we obtain the cost of the
QR-factorization of a full matrix A E Matm,n :
(a) rv n 2 /2 square roots and rv 4n 3 /3 multiplications, if m ~ n ,
(b) rv mn square roots and rv 2mn 2 multiplications, if m » n.
For m = n we obtain an alternative to the Gaussian triangular factorization
of Section 1.2. The better stability is bought at a considerable higher cost
of rv 4n 3 /3 multiplications versus rv n 3 /3 for the Gaussian elimination.
However, one should observe that for sparse matrices the comparison turns
out to be essentially more favorable. Thus only n - 1 Givens rotations are
needed to bring A to upper Hessenberg form
* *
*
A= o
o o * *
70 3. Linear Least-Squares Problems
having almost upper triangular shape with nonzero components only in the
first sub diagonal. With Gaussian elimination the pivot search may double
the sub diagonal band.
Remark 3.14 If A is stored with a row scaling DA, then the Givens ro-
tations can be realized (similarly to the rational Cholesky factorization)
without evaluating square roots. In 1973 W. M. Gentleman [37] and S.
Hammarling [49] developed a variant, the fast Givens or rational Givens.
This type of factorization is invariant with respect to column scaling, i.e.,
A = QR ===} AD = Q(RD) for a diagonal matrix D.
where
vlvi
Ql = 1- 2-T- with Vl:= Al - aIel and al:= -sgn(all)IIAI!I2.
VI vI
After the kth step the output matrix is brought to upper triangular form
except for a remainder matrix T(k+l) E Matm-k,n-k(R)
* *
=
o* *
A(k)
o
Now let us build an orthogonal matrix
Qk+l = [ ~ I Q~+l ] ,
are stored in a separate vector, so that the Householder vectors VI, ... , vp
find a place in the lower half of A (see Figure 3.5). Another possibility is to
normalize the Householder vectors in such a way that the first component
(Vi, ei) is always 1 and therefore does not need to be stored.
f0-
R
t---
A
-
VI V2 V3 V4
(a) rv 2n 2 m multiplications, if m » n,
for the matrix norm IIAllo := maxj IIAj 112. If p = rank(A), then we obtain
theoretically that after p steps the matrix
o
where the elements of the remainder matrix T(p+l) E Matm_p,n_p(R) are
"very small." As the rank of the matrix is not generally known in advance
we have to decide during the algorithm when to neglect the rest matrix.
In the course of the QR-factorization with column exchange the following
criterion for the rank decision presents itself in a convenient way. If we
define the numerical rank p for a relative precision b of the matrix A by
the condition
sc (A) '=
. ~
Irnn I
of P. Deuflhard and W. Sautter (1979) [28]. Analogously to the properties
of the condition number K(A), we have
(a) sc(A) ~ 1,
(b) sc(aA) = sc(A),
(c) A#-O singular {==;> sc(A) = 00,
L(b) = x + N(A)
x
N(A)
\
....
origin 0 E R n onto the affine subspace L(b) (see Figure 3.6). If x E L(b) is
an arbitrary solution of lib - Axil = min, then we obtain all the solutions
by translating the nullspace N(A) of A by x, i.e.,
L(b) = x + N(A).
Here the smallest solution x must be perpendicular onto the nullspace
N(A); in other words: x is the uniquely determined vector x E N(A)l.
with lib - Axil = min.
Definition 3.15 The pseudo-inverse of a matrix A E Matm,n(R) is a ma-
trix A+ E Matn,m(R) such that for all bERm the vector x = A+b is the
smallest solution lib - Axil = min, i.e.,
A+b E N(A)l. and lib - AA+bll = min.
The situation can be most clearly represented by the following commu-
tative diagram (where i denotes each time the inclusion operator):
R(A+) = N(A)l.
We can easily read that the projection P is precisely AA+, while P = A+ A
describes the projection from Rn onto the orthogonal complement N(A)l.
of the nullspace. Furthermore, because of the projection property, we have
obviously A+ AA+ = A+ and AA+ A = A. As seen in the following theorem
the pseudo-inverse is uniquely determined by these two properties and the
symmetry of the orthogonal projections P = A + A and P = AA +.
Theorem 3.16 The pseudo-inverse A+ E Matn,m(R) of a matrix
A E Matm,n(R) is uniquely characterized by the following properties:
(i) (A+ A)T = A+ A,
(ii) (AA+)T = AA+,
(iii) A+ AA+ = A+,
(iv) AA+ A = A.
The properties (i) through (iv) are also called the Penrose axioms.
Proof. We have already seen that A + satisfies properties (i) through (iv),
because A+ A and AA+ are orthogonal projections onto N(A)l. = R(A+)
and R(A), respectively. Conversely (i) through (iv) imply that P := A+ A
and P := AA+ are orthogonal projections, because pT = P = p2 and
pT = P = P2. Analogously from (iii) and P = A+ A it follows that
76 3. Linear Least-Squares Problems
Remark 3.17 If only part of the Penrose axioms hold, then we speak of
generalized inverses. A detailed investigation is found, e.g., in the book of
M. Z. Nashed [63].
(3.4)
Lemma 3.18 With the above notation x is a solution of lib - Axil = min,
if and only if
Proof. According to Lemma 3.18 the solutions of lib - Axil min are
characterized by Xl = U - VX2. By inserting into Ilxll we obtain
IIxI1 2 IIxll12 + IIx2112 = Ilu - VX2112 + IIx2112
IIul1 2 - 2(u, VX2) + (VX2' VX2) + (X2' X2)
IIul1 2 + (X2' (1 + V T V)X2 - 2VT u) =: ~(X2)'
Here
~'(X2) = _2VT U + 2(1 + VTV)X2 and ~"(X2) = 2(1 + VTV).
Because 1 + VTV is a symmetric positive definite matrix, ~(X2) attains its
minimum for X2 and for this value we have ~'(X2) = 0, i.e., (1 + V T V)X2 =
V T u. This was exactly our claim. D
Exercises
Exercise 3.1 A Givens rotation
Q= [ c
-s
can be stored, up to a sign, as a unique number p (naturally, the best
storage location would be in the place of the eliminated matrix entry):
if c = 0
p := { ;gn (c) sl2 if lsi < lei
2 sgn (s) Ie if lsi 2: lei i= O.
Give formulas which reconstruct, up to a sign, from p the Givens rotation
±Q. Why is this representation meaningful although the sign is lost? Is
this representation stable?
Exercise 3.2 Let the matrix A E Matm,n(R), m 2: n, have full rank.
Suppose that the solution x of a linear least-squares problem lib - Axll2 =
min computed in a stable way (by using a QR-factorization) is not accurate
enough. According to A. Bjorck [7] the solution can be improved by residual
correction. This is implemented on the linear system
(4) u := R-Ib l E R m ,
Ki = A . exp ( - R~i)
one determines in the sense of least squares both the pre-exponential factor
A and the activation energy E, where the general gas constant R is given
in advance. Formulate the above given nonlinear problem as a linear least-
squares problem. What simplifications are obtained for the following two
special cases:
(a) 6Ki = CKi (constant relative error)?
(b) 6Ki = const (constant absolute error)?
80 3. Linear Least-Squares Problems
Ti Ki
1 728.79 7.4960. 10- 6
2 728.61 1.0062 . 10- 5
3 728.77 9.0220. 10- 6
4 728.84 1.4217. 10- 5
5 750.36 3.6608 . 10- 5
6 750.31 3.0642 . 10- 5
7 750.66 3.4588 . 10- 5
8 750.79 2.8875 . 10- 5
9 766.34 6.2065 . 10- 5
10 766.53 7.1908.10- 5
11 766.88 7.6056 . 10- 5
12 764.88 6.7110.10- 5
13 790.95 3.1927.10- 4
14 790.23 2.5538 . 10- 4
15 790.02 2.7563. 10- 4
16 790.02 2.5474.10- 4
17 809.95 1.0599 . 10- 3
18 810.36 8.4354. 10- 4
19 810.13 8.9309 . 10- 4
20 810.36 9.4770.10- 4
21 809.67 8.3409 . 10- 4
You can save the tedious typing of the above data by just looking into
the following web site:
http://www.zib.de/SciSoft/Codelib/arrhenius/
4
Nonlinear Systems and Least-Squares
Problems
So far we have been almost entirely concerned with linear problems. In this
chapter we shall direct our attention to the solution of nonlinear problems.
For this we should have a very clear picture of what is meant by a "solution"
of an equation. Probably everybody knows from high school the quadratic
equation
f(x) := X2 - 2px + q = 0
and its analytic, closed form, solution
Xl,2 = P ± jp2 - q.
For a stable evaluation ofthis expression see Example 2.5. In fact, however,
this solution only transfers the problem of solving the quadratic equation
to the problem of computing a square root, i.e., the solution of a simpler
quadratic equation of the form
f(x) := x 2 - c = 0 with c = Ip2 - ql.
The question of how to determine this solution, i.e., how to solve such a
problem numerically, still remains open.
2
y = 2x
y = t4nx
;
1 x* 7r/2 2
Figure 4.l. Graphical solution of 2x - tan x = o.
value for a fixed-point iteration. Equation (4.1) can be easily transformed
into a fixed-point iteration, perhaps into
1
x= 2" tanx =: ¢l (x) or x = arctan(2x) =: ¢2(X) .
Proof. This is a simple application of the mean value theorem in R: For all
x, y E I, x < y, there exists a ~ E [x, y], such that
¢(X) - ¢(y) = ¢'(~)(x - y).
Ok
< 1_0IxI-xol,
where we have used the triangle inequality and the formula for the sum of
geometric series L~o Ok = 1/(1 - 0). Thus {xd is a Cauchy sequence in
the complete metric space of all real numbers, and therefore it converges
to a limit point
X* := lim Xk.
k-->oo
With this we have proved the second part of the theorem and the existence
of a fixed point. If x*, y* are two fixed points, then
0::; Ix* - y*1 = I¢(x*) - ¢(Y*)1 ::; 0 Ix* - Y*I·
Because 0 < 1, this is possible only if Ix* - Y*I = O. This proves the
uniqueness of the fixed point of ¢. 0
Remark 4.5 Theorem 4.4 is a special case of the Banach fixed-point the-
orem. The only properties used in the proof are the triangle inequality for
the absolute value and the completeness of R. Therefore the proof is valid in
the much more general situation when R is replaced by a Banach space X,
e.g., a function space, and the absolute value by the corresponding norm.
Such theorems playa role not only in the theory but also in the numerics of
differential and integral equations. In this introductory textbook we shall
use only the extension to X = R n with a norm II . II instead of the absolute
value 1·1.
4.1. Fixed-Point Iterations 85
Remark 4.6 For the solution of scalar nonlinear equations in the case
when only a program for evaluating f(x) and an interval enclosing the
solution are available, the algorithm of R. P. Brent [10] has established
itself as a standard code. It is based on a mixture of rather elementary
techniques, such as bisection and inverse quadratic interpolation, which
will not be further elaborated here. For a detailed description we refer
the reader to [10]. If additional information regarding f, like convexity or
differentiability, is available, then methods with faster convergence can be
constructed, on which we will focus our attention in the following.
In order to assess the speed of convergence of a fixed-point iteration we
define the notion of the order of convergence of a sequence {Xk}.
Definition 4.7 A sequence {xd, Xk ERn, converges to x*, with order (at
least) p ~ 1, if there is a constant C ~ 0 such that
where in case p = 1 we also require that C < 1. We use the term linear con-
vergence in case p = 1, and quadratic convergence for p = 2. Furthermore
we say that {xd is superlinearly convergent if there exists a nonnegative
null sequence Ck ~ 0 with limk-->oo Ck = 0 such that
Xo Xl
Or----------r------~~~~--------
x*
f(xo)
if p = 2m
if p = 2m-1
where
f (x ) x2 - a x a 1 ( a)
¢(x)=x- f'(x) =x-~=x-2+2x=2 x+~
o 1.0000000000
1 0.2.050000000
2 0.9000138122
3 0.90000000001
88 4. Nonlinear Systems and Least-Squares Problems
Proof. First, we use the Lipschitz condition (4.4) to derive the following
result for all x, y E D
Here we use the Lagrange form of the integral mean value theorem:
: ; 11 swlly - xl1 2
8=0
ds = ~ Ily - xl1 2,
2
which proves (4.5). After this preparation we can turn our attention to
°
the question of convergence of the Newton iteration. By using the iterative
scheme (4.3) as well as the relation F(x*) = we get
Xk - F'(X k )-l F(xk) - x*
xk - x* - F'(x k )-l(F(x k ) - F(x*))
F'(X k )-l(F(x*) - F(x k ) - F'(xk)(x* _ xk)).
With the help of (4.5) this leads to the following estimate of the speed of
convergence:
90 4. Nonlinear Systems and Least-Squares Problems
Since Ilx o- x* I = p, we have Ilxk - x* I < p for all k > 0, and the sequence
{xk} converges toward x*.
In order to prove uniqueness in the ball B2/w(X*) centered at x* with ra-
dius 2/w, we employ again inequality (4.5). Let x** E B2/w(X*) be another
solution so that F(x**) = 0 and Ilx* -x**11 < 2/w. By substituting in (4.5)
we obtain
This standard mono tonicity test, however, is not affine invariant. Multipli-
cation of F by any invertible matrix A may arbitrarily change the result of
the test (4.6). With the idea of transforming the inequality (4.6) into a both
affine invariant and easily executable condition P. Deuflhard suggested in
1972 the natural monotonicity test
4.2. Newton Methods for Nonlinear Systems 91
For an extensive representation about this subject see, e.g., the book [20].
On the right-hand side we recognize the Newton correction ~xk that has
to be computed anyway. On the left-hand side we detect the simplified
Newton correction ~x k+ 1 as the solution of the linear equation system
F'(xk)~xk+l = _F(xk+ 1 ).
With this notation the natural monotonicity test (4.7) can be written as
(4.8)
For the simplified Newton correction we obviously have to solve another
system of linear equations with the same matrix F'(x k ), but with different
right-hand side F(xk+l) evaluated at the next iterate
X k+ 1 = xk + ~xk .
This can be done with little additional effort: If we apply an elimination
method (requiring O(n 3 ) operations for a full matrix), we only have to
carry out the forward and backward substitutions (which means O(n 2 )
additional operations).
The theoretical analysis in [20] also shows that, within the local
convergence domain of the ordinary Newton method,
If Ak < Amin, the iteration is terminated. In critical examples one will start
with Ao = Amin, in harmless examples preferably with Ao = 1. Whenever
Ak was successful, we set
Upon recalling the notation of the pseudo-inverse from Section 3.3 we here
obtain the formal representation
(4.11 )
and therefore the equation (4.9) to be solved for the nonlinear least-squares
problem is equivalent to
F'(x)+ F(x) = O.
This characterization holds for the rank-deficient and the under determined
case as well.
Similarly as in Newton's method for nonlinear systems, we could have
derived the iterative scheme (4.11) directly from the original minimiza-
tion problem by expanding in a Taylor series and truncating after the
linear term. Therefore (4.11) is also called Gauss-Newton method for the
nonlinear least-squares problem IIF(x)112 = min. The convergence of the
Gauss-Newton method is characterized by the following theorem (com-
pare [26]), which is an immediate generalization of our Theorem 4.10 for
Newton's method.
Theorem 4.14 Let D c Rn be open and convex and F : D - t R m ,
m ~ n, a continuously differentiable mapping whose Jacobian matrix F'(x)
has full rank n for all xED. Suppose there is a solution x* E D of the cor-
responding nonlinear least-squares problem II F (x) 112 = min. Furthermore
let w > 0 and 0 :S ""* < 1, be two constants such that
11F'(x)+(F'(x + sv) - F'(x))vll :S swllvl1 2 (4.12)
for all s E [0,1]' xED and vERn with x + v E D, and assume that
11F'(x)+F(x*)II:S ""*llx-x*11 (4.13)
for all xED. If for a given starting point x O ED, we have
P := Ilxo- x* I < 2(1 - ",,*)/w =: (J; (4.14)
then the sequence {xk} defined by the Gauss-Newton method (4.11) stays
in the open ball Bp(x*) and converges toward x*, i.e.,
Ilxk -x*11 < p for k > 0 and lim
k--+oo
Xk = x* .
Proof. The proof follows directly the main steps of the proof of Theorem
4.10. From the Lipschitz condition (4.12) it follows immediately that
Remark 4.15 Note that in the above theorem the existence of a solution
has been assumed. A variant of the above theorem additionally yields the
proof of existence of a solution x* wherein the full-rank assumption on the
Jacobian (see again, e.g., [20]) can be relaxed: only one out of the four
Penrose axioms, namely,
F'(x)+ F'(x)F'(x)+ = F'(x)+ ,
is needed.
Uniqueness, however, requires-just as in the linear problem (see Section
3.3)-a maximal rank assumption. Otherwise there exists a solution man-
ifold of a dimension equal to the rank deficiency and the Gauss-Newton
method converges toward any point on this manifold. We will exploit this
property in Section 4.4.2 in the context of continuation methods.
Finally we want to discuss the condition K.* < 1 in more detail. As we
approach the solution x*) the linear term i'£* Ilxk - x* I dominates the speed
of convergence estimate (4.15), at least for i'£* > O. In this case the Gauss-
Newton method converges linearly with asymptotic convergence factor i'£*,
which enforces the condition "'*
< 1. Obviously, the quantity reflects the "'*
omission of the tensor F"(x) in the derivation of the Gauss-Newton method
from Newton's method (4.10).
Another interpretation of i'£* comes from examining the influence of the
statistical measure of the error Db on the solution. In case the Jacobian
96 4. Nonlinear Systems and Least-Squares Problems
matrix P' (x*) has full rank, the perturbation of the parameters induced by
ob is determined, in a linearized error analysis, by
ox* = -P'(x*)+ob.
A quantity of this general type is given as an aposteriori error analysis by
virtually all software packages that are in widespread use today in statis-
tics. Obviously this condition does not reflect the possible effects of the
nonlinearity of the model. A more accurate analysis of this problem has
been carried out by H. G. Bock [8]: he actually showed that one should
perform the substitution
(4.16)
In the compatible case we have P(x*) = 0 and f£* = O. In the "almost
compatible" case 0 < f£* « 1 the linearized error theory is certainly sat-
isfactory. However, as shown in [8], in the case f£* ?: 1 there are always
statistical errors such that the solution "runs away unboundedly." Such
models might be called statistically ill-posed or inadequate. Conversely,
a nonlinear least-squares problem is called statistically well-posed or ad-
equate whenever f£* < 1. In this wording, Theorem 4.14 can be stated
in short form as: For adequate nonlinear least-squares problems the ordi-
nary Gauss-Newton method converges locally, for compatible least-squares
problems even quadratically.
Intuitively it is clear that not every model and every set of measurements
allow for a determination of a unique suitable parameter vector. But only
unique solutions permit a clear interpretation in connection with the basic
theoretical model. The Gauss-Newton method presented here proves the
uniqueness of a solution by three criteria:
(a) checking the full-rank condition for the corresponding Jacobians in
the sense of a numerical rank determination-which can be done, for
example, as in Section 3.2 on the basis of a QR-factorization;
(b) checking the statistical well-posedness with the help of the condition
f£* < 1 by estimating
. II~xk+lll
f£*= II~xkll
in the asymptotic phase of the Gauss-Newton iteration;
(c) analyzing the error behavior by (4.16).
One should be aware of the fact that all three criteria are influenced by the
choice of the measurement tolerances ob (cf. Section 3.1) as well as by the
scaling of the parameters x.
Remark 4.16 As in Newton's method the convergence domain of the
ordinary Gauss-Newton method can be enlarged by some damping strat-
egy. If we denote by ~xk the ordinary Gauss-Newton correction, then a
4.3. Gauss-Newton Method for Nonlinear Least-Squares Problems 97
x k+ l = xk + Ak!:"X k ,
Again there exist rather efficient theoretically backed strategies, which are
implemented in a series of modern least-squares software packages-see
[20j. These programs also check automatically whether the a least-squares
problem under consideration is adequate. If this is not the case, as happens
rather seldom, one should either improve the model or increase the precision
of the measurements. Moreover, these programs ensure automatically that
the iteration is performed only down to a relative precision that matches
the given precision of the measurements.
Table 4.3. Measurement sequence (ti, bi), i = 1, ... ,30, for Feulgen hydrolysis.
t b t b t b t b t b
6 24.19 42 57.39 78 52.99 114 49.64 150 46.72
12 35.34 48 59.56 84 53.83 120 57.81 156 40.68
18 43.43 54 55.60 90 59.37 126 54.79 162 35.14
24 42.63 60 51.91 96 62.35 132 50.38 168 45.47
30 49.92 66 58.27 102 61.84 138 43.85 174 42.40
36 51.53 72 62.99 108 61.62 144 45.16 180 55.21
The property sinh(x3t) = x~t + o(ltl) for small arguments is surely estab-
lished in every standard routine for calculating sinh. Therefore only the
evaluation of if' for X3 = 0 must be especially handled by the program. As
starting point for the iteration we choose xO = (80,0.055,0.21). The itera-
tion history of if'(x k ; t) over the interval t E [0,180] is represented in Figure
4.3. We come out with TOL = 0.142.10- 3 as the "statistically reasonable"
relative precision. At the solution x* we obtain the estimate = 0.156 and "'*
the residual norm IIF(x*)112 ~ 3.10 2. Therefore despite a "large" residual,
the problem is "almost compatible" in the sense of our above theoretical
characterization.
+ +++
60 +- - - +
+
/} +
--_ ,5
+
H
+ +
+ -----
/'
45 / + -
+_ - +-
- - --
~ 6, 7, 8
fr
-...- - 4
----- +
30 3
2
15
~~0--------~5~0--------~10~0~------~1T50~--~t
Figure 4.3. Measurements and iterated model function for Example 4.17.
Remark 4.18 Most software packages for are still based on another
globalization method with enlarged convergence domain, the Levenberg-
Marquardt method. This method is based on the idea that the local
linearization
p(ll~zll~ - 62 ) ;::: O.
0.5
O~--------.-----~~---,----------~
)'1 AO 2
·0.5
·1
1 --.~0.:-8---'.O~.6--.~0.-'--4--.O"-:.2--~--:0~.2--~0.4-'------'O:'-:.6--~0.~8- - '
'l.S.L
Therefore in our example there are exactly two turning points, namely, at
Xl = 1/ J3 and X2 = -1/ J3. At these points the tangent to the solution
curve is perpendicular to the A-axis.
As a last property we want to note the symmetry of the equation
F(x, A) = F( -X, -A) .
This is reflected in the point symmetry of the solution set: If (x, A) is a
solution of F(x, A) = 0, then so is (-x, -A).
Unfortunately, we cannot go into all the phenomena observed in Ex-
ample 4.19 within the frame of this introduction. We assume in what
follows that the Jacobian F'(x, A) has maximal rank at each solution point
(x, A) E D x [a, b], i.e.,
rank F' (x, A) =n whenever F(x, A) = o. (4.20)
Thus we exclude bifurcation points because according to the implicit
function theorem under the assumption (4.20) the solution set can be
represented locally around a solution (xo, AO) as the image of a differen-
tiable curve, i.e., there is a neighborhood U c D x [a, b] of (xo, AO) and
continuously differentiable mappings
x: ] - c, E[ ~ D and A:] - c, c[ ~ [a, b]
such that (x(O), A(O)) = (xo, AO) and the solutions of F(x, A) = 0 in U are
given exactly by
snu={(X(S),A(S)) I SE]-c,c[}.
In many applications an even more special case is also interesting, where
the partial derivative Fx(x, A) E Matn(R) with respect to x is invertible at
102 4. Nonlinear Systems and Least-Squares Problems
(x, .\)
As starting point we take the old solution and set x := Xo. This choice,
originally suggested by Poincare in his book on celestial mechanics [66] is
today called classical continuation.
A geometric view brings yet another choice: instead of going parallel to
the A-axis we can move along the tangent (x'(O), 1) to the solution curve
4.4. Nonlinear Systems Depending on Parameters 103
(x, 5.)
..';
sx'(O) ......
(xl,Ad
Thus each step of a continuation method contains two substeps: first the
choice of a point (x, Ad as close as possible to the curve; and second the
iteration from the starting point 5; back to a solution (Xl, Ad on the curve,
where Newton's method appears to be the most appropriate because of
its quadratic convergence. The first substep is frequently called predictor,
the second substep corrector, and the whole process a predictor-corrector
method. If we denote by s := Al - Ao the step size, then the dependence of
the starting point :i; on s for the both possibilities encountered so far can
be expressed as
5;(s) = Xo
for the classical continuation and
x(s) = Xo + sx'(O)
for tangent continuation. The most difficult problem in the construction
of a continuation algorithm consists of an appropriate choice of the step
length s in conjunction with the predictor-corrector strategy. The optimist
who chooses a too-large step size s must constantly reduce the step length
and therefore ends up with too many unsuccessful steps. The pessimist,
on the other hand, chooses the step size too small and ends up with too
many successful steps. Both characters waste computing time. In order
to minimize cost, we therefore want to choose the step length as large as
possible while still ensuring the convergence of Newton's method.
Remark 4.20 In practice one should take care of a third criterion, namely,
not to leave the present solution curve and "jump" onto another solution
104 4. Nonlinear Systems and Least-Squares Problems
curve without noticing it (see Figure 4.7). The problem of "jumping over"
becomes important especially when considering bifurcations of solutions.
Naturally, the maximal feasible step size Smax for which Newton's method
with starting point x O := x(s) and fixed parameter A = AO + s converges
depends on the quality of the predictor step. The better the curve is pre-
dicted the larger the step size. For example, the point x(s) given by the
tangent method appears graphically to be closer to the curve than the point
given by the classical method. In order to describe this deviation of the pre-
dictor from the solution curve more precisely, we introduce the order of a
continuation method (see [18]).
Definition 4.21 Let x and x be two curves
x,x: [-c:,c:]---+ R n
in R n. We say that the curve x(s) represents a continuation method of
order pEN at s = 0, if
Proof. According to the Lagrange form of the mean value theorem it follows
that
Ilx(s) - x(O)11 = IIsl1 x'(Ts)dTII S s max Ilx'(t)ll· o
r=O tE[-E,E]
4.4. Nonlinear Systems Depending on Parameters 105
and therefore
o
The following theorem connects a continuation method of order p as
predictor and Newton's method as corrector. It characterizes the maximal
feasible step size Smax, for which Newton's method applied to x O := i:(s)
with fixed parameter Ao + s converges.
Theorem 4.24 Let D c Rn be open and convex and let F : D x [a, b] --+
Rn be a continuously differentiable parametrized system such that Fx(x, A)
is invertible for all (x, A) E D x [a, b]. Furthermore, let an w > 0 be given
such that F satisfies the Lipschitz condition
Proof. We must check the hypothesis of Theorem 4.10 for Newton's method
(4.22) and the starting point x O = i:( s). According to the condition (4.23)
106 4. Nonlinear Systems and Least-Squares Problems
Here ~xk and ~x k+1 are the ordinary and the simplified Newton cor-
rections of Newton's method (4.22), i.e., with XO := x( s) and ~ '-
>'0 + s:
Fx(x k , ~)~xk = _F(xk,~) and Fx(x k , ~)~xk+1 = _F(xk+l,~).
If we establish with the help of the criterion (4.25) that Newton's method
does not converge for the step size s, i.e., if
then we reduce this step size by a factor (3 < 1 and we perform again the
Newton iteration with the new step size
s' := (3. s,
i.e., with the new starting point xO = xes') and the new parameter ~ :=
>'o+s'. This process is repeated until either the convergence criterion (4.25)
for Newton's method is satisfied or we get below a minimal step size Smin.
In the latter case we suspect that the assumptions on F are violated and
that we might be in a close neighborhood of a turning point or a bifurcation
point. On the other hand, we can choose a larger step size for the next step,
4.4. Nonlinear Systems Depending on Parameters 107
if Newton's method converges "too fast." This can also be seen from the
two Newton corrections. If
(4.26)
then the method converges "too fast," and we can enlarge the step size for
the next predictor step by a factor {3; i.e., we suggest the step size
8':=8/{3.
Here the choice
{3:= VL
motivated by (4.24) is consistent with (4.25) and (4.26). The following
algorithm describes the tangent continuation from a solution (xo, a) up to
the right endpoint ,\ = b of the parameter interval.
Algorithm 4.25 Tangent Continuation. The procedure newton (x,'\)
contains the (ordinary) Newton method (4.22) for the starting point xO = x
and fixed value of the parameter ,\. The Boolean variable done specifies
whether the procedure has computed the solution accurately enough after
at most kmax steps. Besides this information and (if necessary) the solution
x the program will return the quotient
-1
e=~
II~xOII
of the norms of the simplified and ordinary Newton correctors. The pro-
cedure continuation realizes the continuation method with the "step-size
control described above." Beginning with a starting point x for the solution
of F(x, a) = 0 at the left endpoint ,\ = a of the parameter interval, the
program tries to follow the solution curve up to the right endpoint ,\ = b.
The program terminates if this is achieved or if the step size 8 becomes to
small, or if the maximal number i max of computed solution is exceeded.
function [done,x, IJ]=newton (x,.x)
x:= x;
for k = 0 to k max do
A:= Fx(x,,\.);
solve Allx = -F(x, .x);
x:= x + llx;
solve Allx = -F(x, .x); (use again the factorization of A)
if k = 0 then
IJ := Illlxll/llllxll; (for the next predicted step size)
end
if Illlxll < tol then
done:= true;
break; (solution found)
end
if Illlxll > t91111xll then
108 4. Nonlinear Systems and Least-Squares Problems
done:= false;
break; (monotonicity violated)
end
end
if k > k max then
done:= false; (too many iterations)
end
function continuation (x)
AO := a;
[done, xo,lI] = newton (X,AO);
if not done then
poor starting point x for F(x, a) = 0
else
s := so; (starting step size)
for i = 0 to i max do
solve Fx(Xi, Ai)X' = -F.\(Xi, Ai);
repeat
X:= xi + sxt;
Ai+l := Ai + s;
[done, Xi+l,lI] = newton (x, Ai+ll;
if not done then
s = (3s;
< 1J/4 then
elseif II
s=s/(3;
end
s = mines, b - Ai+l);
until s < Smin or done
if not done then
break; (algorithm breaks down)
elseif Ai+l = b then
break; (terminated, solution Xi+l)
end
end
end
We assume again that the Jacobian F'(y) ofthis system has full rank for all
y E D. Then for each solution Yo E D there is a neighborhood U C Rn+l
and a differentiable curve y : ] - E,E[-+ D, S := {y E DI F(y) = O}
characterizing the curve around Yo, i.e.,
snu={y(s)1 SE]-E,Er}.
If we differentiate the equation F(y( s)) = 0 with respect to s at s = 0, it
follows that
F'(y(O))y'(O) = 0; (4.27)
i.e., the tangent y'(O) to the solution curve spans exactly the nullspace of
the Jacobian F' (yo). Since F' (yo) has maximal rank, the tangent through
(4.27) is uniquely determined up to a scalar factor. Therefore we define for
all y E D the normalized tangent t (y) E R n+ 1 by
F'(y)t(y) = 0 and Ilt(y)112 = 1,
which is uniquely determined up to its orientation (i.e., up to a factor ±1).
We choose the orientation of the tangent during the continuation process
such that two successive tangents to = t(yo) and tl = t(yI) form an acute
angle, i.e.,
(to,h»O.
This guarantees that we are not going backward on the solution curve.
With it we can also define tangent continuation for turning points (see
Figure 4.8) by
fj = fj(s) := Yo + st(yo).
Beginning with the starting vector yO = fj, we want to find y(s) on the
curve "as fast as possible." The vague expression "as fast as possible"
can be interpreted geometrically as "almost orthogonal" to the tangent at
a nearby point y(s) on the curve. However, since the tangent t(y(s)) is
at our disposal only after computing y(s), we substitute t(y(s)) with the
best approximation available at the present time t(yk). According to the
geometric interpretation of the pseudo-inverse (cf. Section 3.3) this leads
to the iterative scheme
(4.28)
The iterative scheme (4.28) is obviously a Gauss-Newton method for the
overdetermined system F(y) = O. We mention without proof that if F'(y)
has maximal rank then this method is quadratically convergent in a neigh-
borhood of the solution curve, the same as the ordinary Newton method.
The proof can be found in [20].
110 4. Nonlinear Systems and Least-Squares Problems
.....
P =1_ ttT .
tTt
For the correction fj.y it follows that
fj.y = (1 _tTt
tt T ) Z = Z _ (t, z) t .
(t, t)
With this we have a simple computational scheme for the pseudo-inverse
(with rank defect 1) provided we only have some solution z and nulls pace
vector t at our disposal. The Gauss-Newton method given in (4.28) is also
easily implement able in close interplay with tangent continuation.
For a step-size strategy we realize a strategy similar to the One described
in Algorithm 4.25. If the iterative method does not converge, then we reduce
the step length s by a factor {3 = 1/ J2. If the iterative method converges
"too fast" we enlarge the step size for the next predictor step by a factor
{3-1. This empirical continuation method is comparatively effective even in
rather complex problems.
4.4. Nonlinear Systems Depending on Parameters 111
Remark 4.27 For this tangent continuation method there is also a theo-
retically based, more effective step-size control, the description of which can
be found in [23]. Additionally, one may apply approximations of the exact
Jacobian F'(y). Extremely effective programs for parametrized systems are
working on this basis (see Figures 4.9 and 4.10).
Remark 4.28 The description of the solutions of the parametrized sys-
tem (4.19) is also called parameter study. At the same time parametrized
systems are used for enlarging the convergence domain of a method for
solving nonlinear systems. The idea is to work our way, step by step, from
a previously solved problem
G(x) = 0
to the actual problem
F(x) = o.
For this we construct a parametrized problem
H(x, >.) = 0, >. E [0,1],
that connects the two problems:
H(x,O) = G(x) and H(x, 1) = F(x) for all x.
Such a mapping H is called an embedding of the problem F(x) = 0, or a
homotopy. The simplest example is the standard embedding,
H(x, >.) := >.F(x) + (1 - >')G(x).
Problem-specific embeddings are certainly preferable (see Example 4.29). If
we apply a continuation method to this parametrized problem H(x, >.) = 0,
where we start with a known solution Xo of G(x) = 0, then we obtain a
homotopy method for solving F(x) = O.
2.5 2.0
1.8
2.0
1.6
1.5
1.0
0.5
with starting point x O = (0, ... ,0) at ).. = 0 is clearly advantageous (see
Figure 4.9, right). Note that there are no bifurcations in this example. The
intersections of the solution curves appear only in the projection onto the
coordinate plane (xg, )..). The points on both solution branches mark the
intermediate values selected automatically by the program: their number
is a measure for the computing cost required to go from).. = 0 to ).. = 1.
z= ( ~)
y
= (A - (E + l)x + x 2y )
Ex - x 2 y
=: j(z).
0= f(Zi)
1
+ ,X2 L D(zj - Zi), i = 1, ... ,k.
(i,j)
Exercises
Exercise 4.1 Explain the different convergence behavior of the two fixed-
point iterations described in Section 4.1 for the solution of
f(x) = 2x - tanx = O.
Analyze the speed of convergence of the second method.
Exercise 4.2 In order to determine a fixed point x* of a continuously
differentiable mapping ¢ with 14>' (x) 1 i=- 1 let us define the following iterative
procedures for k = 0,1, ... :
(I) Xk+l := ¢(Xk) ,
(b) For implementing the method one computes only xo, Xl, X2 and Xo
and then one starts the iteration with the improved starting point Xo
(Steffensen's method). Try this method on our trusted example
¢l(X) := (tanx)/2 and ¢2(X):= arctan2x
with starting point Xo = l.2 .
Exercise 4.5 Compute the solution of the nonlinear least-squares problem
arising in Feulgen hydrolysis by the ordinary Gauss-Newton method (from
a software package or written by yourself) for the data from Table 4.3 and
the starting points given there.
Hint: In this special case the ordinary Gauss-Newton method converges
faster than the damped method (cf. Figure 4.3).
Exercises 115
Let us define
L(8):={tei~SltER}, 8E[0,n[.
(a) Show that
Xl + X4 - 3
2XI + X2 + X4 + X7 + Xs + Xg + 2XlO - ,\
2X2 + 2X5 + X6 + X7 - 8
2X3 + Xg - 4,\
XIX5 - 0.193x2x4
P(X,,\) := =0.
X~XI - 0.67444· 1O-5x2x48
X?X4 - 0.1189· 1O-4xIX28
XSX4 - 0.1799· 1O-4x18
(X9X4)2 - 0.4644 . 10- 7 XiX38
XlOX~ - 0.3846 . 1O- 4xi 8
Xl =:h = p + Vp2 - r,
X2 =p- Vp2 - q, X2 = q/XI'
Write down the results in a table and underline each time the correct
figures.
Exercises 11 7
Ax = Ax,
x T Ax + bT X = min,
aij 2: 0, L aij = 1.
j=1
o i= X~(Ao) = :AXA+tc(A)lt=o·
According to the implicit function theorem there is a neighborhood of the
origin] - c, de R and a continuously differentiable mapping
A:] - c, c[-+ C, t f-+ A(t)
such that A(O) = Ao and A(t) is a simple eigenvalue of A + tC. Using
again the fact that AO is simple we deduce the existence of a continuously
differentiable function
x:] - c, c[-+ C n , t f-+ x(t)
such that x(O) = Xo and x(t) is an eigenvector of A + tC for the eigen-
value A(t); x(t) can be explicitly computed with adjoint determinants, see
Exercise 5.2. If we differentiate the equation
(A + tC)x(t) = A(t)X(t)
with respect to t at t = 0, then it follows that
Cxo + Ax'(O) = AOX'(O) + A'(O)XO.
If we multiply from the right by Yo (in the sense of the scalar product),
then we obtain
(Cxo, Yo) + (Ax'(O),yo) = (AOX'(O),yo) + (A'(O)XO,yo).
As (X(O)XO,yo) = X(O)(XO,yo) and
(Ax' (0), Yo) = (x' (0), A *yo) = AO (x' (0), Yo) = (AoX' (0), Yo) ,
it follows that
A'(O) = (Cxo, Yo) .
(XO,Yo)
Hence we have computed the derivative of A in the direction of the matrix
C. The continuous differentiability of the directional derivative implies the
differentiability of A with respect to A and
'( ) ( (Cx,y)
A A : Matn C) -+ C, C f-+ ( )'
X,y
122 5. Linear Eigenvalue Problems
A = (~ ~) and A= (~ ~),
with the eigenvalues >'1 = >'2 = 0 and ~1,2 = ±/J. For the condition of the
eigenvalue problem (A, >'d we have
1~1 - >'11 /J 1
~abs ;::: IIA _ AI12 = T = /J --7 00 for b --7 O.
Without going into depth we just want to state the following: For multiple
eigenvalues or already for nearby eigenvalue clusters the computation of
single eigenvectors is ill-conditioned, but not the computation of orthogonal
bases of the corresponding eigenspace.
For the well-conditioned real symmetric eigenvalue problem one could
first think of setting up the characteristic polynomial and subsequently
determining its zeros. Unfortunately the information on eigenvalues "dis-
appears" once the characteristic polynomial is treated in coefficient
representation. According to Section 2.2 the reverse problem is also
ill-conditioned.
Example 5.4 J. H. Wilkinson [87] has given the polynomial
P()..) = ().. - 1)··· ().. - 20) E P20
xk = A k Xo =~
~ ai\k1]i = a1 Ak1 ( 1]1 + ~ ai (Ai)
~;- :\ k 1]i ) .
i=l i=2 1 1
'----,v '
=: Zk
Because )Ai) < )A1) we have limk->oo Zk 1]1 for all 2, ... , n, and
therefore
o
The direct power method has several disadvantages: On one hand, we
obtain only the eigenvector corresponding to the largest eigenvalue (in ab-
solute value) Al of A; on the other hand, the speed of convergence depends
on the quotient )A2/ A1)' Hence, if the absolute values of the eigenvalues A1
and A2 are close, then the direct power method converges rather slowly.
5.2. Power Method 125
max Ai -
1
#i Aj - A
< l. ~ 1
A' ---
- >-1 «1 for all j #- i ,
1-"
Aj - A
so that the method converges very rapidly in this case. Thus with appro-
priate choice of >- this method can be used with nearly arbitrary starting
vector Xo in order to pick out individual eigenvalues and eigenvectors. For
an improvement of this method see Exercise 5.3.
Remark 5.6 Note that the matrix A - >-1 is almost singular for "well-
chosen" >- ;::::; Ai. In the following this poses no numerical difficulties because
we want to find only the directions of the eigenvectors whose calculation is
well-conditioned (cf. Example 2.33).
Example 5.7 Let us examine for example the 2 x 2-matrix
A:= ( -1 3)
-2 4 .
A- >-1 = ( -2+c:
-2
126 5. Linear Eigenvalue Problems
(A _ 5..I) -1 = _1_ ( 3 + c:
c: 2 +c: 2
-3
-2+c: ).
Because the factor 1/(c: 2 + c:) simplifies through normalization, the compu-
tation of the direction of a solution x of (A - 5..I)x = b is well-conditioned.
This can be also read from the componentwise relative condition number
* * * * * * * 0 0
* . pT
* * *
Pl· 1
--+ 0 --+ 0
0
* * 0
* * * *
(5.4)
We formulate this insight as a lemma.
Lemma 5.8 Let A E Matn(R) be symmetric. Then there is an orthogonal
matrix P E O(n), which is a product ofn - 2 Householder reflections such
that P ApT is tridiagonal.
* *
Pn - 2 ... PI A pi· ..
P'L2 = *
'---....-----" ' - . . . - '
=P =pT *
* *
o
With this we have transformed our problem to finding the eigenvalues of a
symmetric tridiagonal matrix. Therefore we need an algorithm for this spe-
cial case. The idea of the following algorithm goes back to H. Rutishauser.
He had first tried to find out what happens when the factors of the LR-
factorization of a matrix A = LR are interchanged according to A' = RL
and this process is recursively iterated. It turned out that in many cases the
matrices constructed in this way converged toward the diagonal matrix A of
the eigenvalues. The QR-algorithm that goes back to J. G. F. Francis (1959)
[33] and V. N. Kublanovskaja (1961) [56] employs the QR-factorization in-
128 5. Linear Eigenvalue Problems
* * EEl
* * * * EEl
181 * *
181 * *
EEl
181 * *
181 * *
A ----+ R=QTA
*
* * EEl * * EEl
* * E9 * * EEl
*
----+
EEl E9
* *
R
* *
AI = RQ = QTAQ
*
5.3. QR-Algorithm for Symmetric Eigenvalue Problems 129
According to (ii) A' must be symmetric and therefore all EEl entries in A'
vanish. Hence A' is also tridiagonal. 0
We show the convergence properties only for the simple case when the
absolute value of the eigenvalues of A are distinct.
(k)
and let A k , Qk, Rk be defined as in (5.5), with Ak (a ij ). Then the
following statements hold:
(a) lim Qk I,
k-+oo
(b) lim Rk A,
k-+DO
(c)
(k)
a·2,J. 0(1 ~; Ik) for i > j.
Proof. The proof given here goes back to J. H. Wilkinson [88]. We show
first that
Ak = Ql ... Qk Rk ... Rl for k = 1,2, ...
'--v--'~
=: Pk =: Uk
The assertion is clear for k = 1 because A = Al = QlR l .
On the other hand from the construction of Ak it follows that
Q=LR,
where L is a unit lower triangular matrix and R an upper triangular
matrix. We can always achieve this by conjugating A with appropriate
permutations. With this we have
(5.6)
130 5. Linear Eigenvalue Problems
lim Ak
k~(X)
= k---+CX)
lim QkRk = lim Rk = A.
k-HXJ
Remark 5.11 A more precise analysis shows that the method converges
also for multiple eigenvalues Ai = ... = Aj. However, if Ai = -Ai+1 then
the method does not converge. The 2 x 2 blocks are left as such.
If two eigenvalues Ai, Ai+1 are very close in absolute value, then the
method converges very slowly. This can be improved with the shift strategy.
In principle one tries to push both eigenvalues closer to the origin so as to
reduce the quotients IAi+l1 Ail. In order to do that one uses at each iteration
step k a shift parameter O"k and one defines the sequence {Ak} by
(a) Al A,
(b) Ak - O"kI QkRk, QR-factorization,
(c) Ak+l RkQk + O"k I .
5.3. QR-Algorithm for Symmetric Eigenvalue Problems 131
We have already met such a convergence behavior in Section 5.2 in the case
of the inverse power method.
In order to achieve a convergence acceleration the ak's have to be chosen
as close as possible to the eigenvalues Ai, Ai+1. J. H. Wilkinson has proposed
the following shift strategy: We start with a symmetric tridiagonal matrix
A; if the lower end of the tridiagonal matrix Ak is of the form
(k)
dn-1 (k)
en
e~k) d~k)
then the (2, 2)-matrix at the right end corner has two eigenvalues; we choose
as ak the one that is closer to d~k).
Better than these explicit shift strategies, especially for badly scaled
matrices, are the implicit shift methods, for which we refer again to [41] and
[79]. With these techniques one finally needs O(n) arithmetic operations
per computed eigenvalue that is O(n 2 ) for all eigenvalues.
Besides the eigenvalues we are also interested in the eigenvectors, which
can be computed as follows: If Q E O(n) is an orthogonal matrix, then
A;::::: QT AQ, A = diag(A1, ... , An),
then the columns of Q approximate the eigenvectors of A, i.e.,
UT AV = (; ~).
The claim follows then by induction. Let U := IIAI12 = maxllxll=l IIAxll·
Because the maximum is attained there are v ERn, U E R m such that
Av = uu and IIul12 = IIvl12 = 1.
We can extend {v} to an orthonormal basis {v = VI, ... , Vn } of R nand
{u} to an orthonormal basis {u = Ul , ... , Um} of Rm. Then
V:= [VI, ... , Vnl and U:= [U l , ... , Uml
are orthogonal matrices, V E O(n), U E Oem), and U T AV is of the form
U T AV = (; ~).
o
Definition 5.15 The factorization U T AV = ~ is called the singular value
decomposition of A, and the ai's are called the singular values of A.
With the singular value decomposition we have at our disposal the most
important information about the matrix. The following properties can be
easily deduced from Theorem 5.14.
Corollary 5.16 Let U T AV = ~ = diag(aI, ... ,ap) be the singular value
decomposition of A with singular values aI, ... , a p' where p = min( m, n).
Then
1. If Ui and V; are the columns of U and V, respectively, then
AV; = aiUi and ATUi = ai V; for i = 1, ... ,p.
2. If al 2: ... 2: a r > ar+I = ... = a p = 0, then Rang A = r,
ker A = span{Vr+I"'" Vn } and imA = span{UI , ... , Ur }.
3. The Euclidean norm of A is the largest singular value, z.e.,
IIAII2 = aI·
4. The Frobenius norm IIAIIF = (2:~=1 IIAII~)1/2 is equal to
II All} = ai + ... + a; .
5. The condition number of A relative to the Euclidean norm is equal to
the quotient of the largest and the smallest singular values , i. e.,
i'L2(A) = aI/ap .
6. The squares of the singular values ai, ... , a~
are the eigenvalues of
AT A and AAT corresponding to the eigenvectors VI"'" Vp and
U I , ... , Up, respectively.
Based on the invariance of the Euclidean norm 11·112 under the orthogonal
transformations U and V we obtain from the singular value decomposition
of A another representation of the pseudo-inverse A + of A.
Corollary 5.17 Let U T AV = ~ be the singular value decomposition of a
matrix A E Matm,n(R) with p = Rang A and
~ = diag(aI, . .. ,ap,O, . .. ,0).
134 5. Linear Eigenvalue Problems
Proof. We have to prove that the right-hand side B := V~+UT satisfies the
Penrose axioms. The (Moore-Penrose) pseudo-inverse of the diagonal ma-
trix ~ is evidently ~+. Then the Penrose axioms for B follow immediately
because VTV = I and UTU = I. 0
(5.7)
A = AT = (1.005
0.995
For AT A we obtain
fl(AT A) = (2.000 2.000) 0- 2 =4 0- 2 = O.
2.000 2.000' 1 , 2
As in the case of linear least squares we will search here also for a method
operating only on the matrix A. For this we examine first the operations
that leave the singular values invariant.
Lemma 5.19 Let A E Matm,n(R), and let P E O(m), Q E O(n) be
orthogonal matrices. Then A and B := PAQ have the same singular values
Proof. Simple. o
Hence, we may pre- and post-multiply the matrix A with arbitrary or-
thogonal matrices, without changing the singular values. In view of an
application of the Q R algorithm it is desirable to transform the matrix A
in such a way that AT A is tridiagonal. The simplest way to accomplish this
is by bringing A to bidiagonal form. The following lemma shows that this
goal can be reached by means of alternate Householder transformations
from the right and the left.
Lemma 5.20 Let A E Matm,n(R) be a general matrix and suppose, with-
out loss of generality that with m 2 n. Then there exist orthogonal matrices
5.4. Singular Value Decomposition 135
* *
PAQ= *
*
o o
o o
where B is an upper (square) bidiagonal matrix.
* * * * o
o* * * 0
o * *
P"
---+
* * o * * o * *
* * 0
0
0
* *
* * * Pn
* *
P2' - 1·
---+ 0 ---+ ... ---+
*
0 0
* * *
Therefore we have
* *
EEl * *
*
*
where in position EEl a new fill-in element is generated. If we play back the
QR algorithm for BT B in this way on B then it appears that the method
corresponds to the following elimination process.
* * Z3
Z2
* * Z5
Z4
* * Z7
(5.8)
Z2n-6
* * Z2n-3
Z2n-4
* *
Z2n-2
*
eliminate Z2 (Givens from left) ---t fill-in Z3
eliminate Z3 (Givens from right) ---t fill-in Z4
We "chase" fill-in elements alongside both diagonals and remove the newly
generated fill-in entries with Givens rotations alternating from left and
right~whence the name chasing has been given to this process. In the end
the matrix has bidiagonal form and we have performed one iteration of the
QR method for BT B operating only on B. According to Theorem 5.10 we
have
If we introduce the special vector e T = (1, ... ,1), we may write the above
row sum relation in compact form as
Ae = e.
Therefore there exists a (right) eigenvector e corresponding to an eigenvalue
A1(A) = 1. Upon recalling that IIAlloo is just the row sum norm, we obtain
for the spectral radius p(A) the inequality chain
IA(A)I :S p(A) :S IIAlloo = 1,
138 5. Linear Eigenvalue Problems
y = A Ixl - Ixl = o.
As a consequence, there exists an eigenvector Ixl to the eigenvalue A =
I-which is statement I of the theorem.
Due to Ixl = A Ixl > 0, all of its components must be positive. Obviously,
the proof applies for left as well as right eigenvectors in the same way, since
AT is also positive. This is statement III of the theorem. Moreover, the
following is apparent: If an eigenvector for an eigenvalue on the unit circle
exists, this eigenvalue must be A = 1; indeed, the above assumption y i= 0,
which included A i= 1 on the unit circle had led to a contradiction for A i= 1.
The eigenvalue A = 1 is therefore the only eigenvalue on the unit circle,
which proves statement II of the theorem.
The only still missing part of the theorem is that the eigenvalue A = 1
is simple. From the Jordan decomposition J = T- 1 AT we conclude that
Jk = T- 1 AkT, IIJkl1 <:::; K,(T) ·IIAkll ,
wherein K,(T) denotes the condition number of the transformation matrix T.
Let us first assume that J contains a Jordan block J v (l) to the eigenvalue
1 with v > 1. In this case we have on one hand that
lim
k~co
IIJv (l)kll = 00 =} lim
k~co
IIJkll = 00 =} lim
k~co
IIAkl1 = 00.
140 5. Linear Eigenvalue Problems
On the other hand, there exists a norm I . I such that for E > 0 we obtain
IIAkl1 ::; p(Ak) + E = max I>-kl + E = 1 + E,
AEa(A)
pT AP = (~ ~)
where the block matrices C and F are quadratic. If no zero block can be
generated, the matrix is said to be irreducible.
The mathematical objects behind this notion are graphs. From any non-
negative matrix A = (aij) we may construct the corresponding graph by
associating a node with each index i = 1, ... ,n and connecting node i with
node j by an arrow whenever aij > O. The operation p T AP describes just
a renumbering of the nodes leaving the graph as the mathematical object
unaltered. Just like the matrix a graph is irreducible or also strongly con-
nected, if there exists a connected path (in the direction of the arrows) from
each of the nodes to each other. If the corresponding matrix is reducible,
then the index set divides into (at least) two subsets: there exist no arrows
from the nodes of the second subset to the nodes of the first subset. In this
case the graph is also reducible. In Figure 5.1 we give two (3,3)-matrices
5.5. Stochastic Eigenvalue Problems 141
A=[~~~l
010
A=[~~~l
010
These elements vanish, if at least one of the factors on the right side vanish,
i.e., if in the corresponding graph there is no connecting path from node i
to node j. If, however, there exists such a path, then there exists at least
one index sequence i, l)', ... ,l'k-l' j such that
For an irreducible graph this case occurs with guarantee at latest after
running through all other nodes, which means through n - 1 nodes. With
binomial coefficients Cn -l,k > 0, we then obtain the relation
Of course, the inverse statement of the theorem does not hold. Inci-
dentally, in the concrete case lower powers of (I + A) might already be
positive: the connecting paths from each node to every other one might
be significantly shorter, i. e. they might run through less than n - 1 other
nodes-compare Figure 5.1.
In what follows we want to return to our original topic of interest, the
class of stochastic matrices. The following theorem is an adaptation of the
theorem of Perron-Frobenius (see, e.g., [61]) to our special case.
Theorem 5.26 Let A ~ 0 be an irreducible stochastic matrix. Then
1. The Perron eigenvalue .\ = 1 is simple.
II. To.\ = 1 there exists a corresponding left eigenvector JrT > O.
The theorem does not state that in the case of irreducible nonnegative
matrices the Perron eigenvalue is also the only eigenvalue on the unit circle.
To assure this additionally, we need a further structural property-as has
also already been found by Frobenius.
Definition 5.27 Nonnegative irreducible matrices are called primitive
when their Perron eigenvalue is the only eigenvalue on the unit circle
(assuming the normalization p(A) = 1).
Such matrices can be characterized by the property that there exists an
index m such that
5.5. Stochastic Eigenvalue Problems 143
trace(A) = I:>ii = L . \ = O.
i=l AEo-(A)
(5.10)
0\------' o )- ------------
,
/
-0.5 -0.5
,
~,- ., . . ,. "
-,---------------
-1 -1
o 30 60 90 o 30 60 90
Figure 5.2. Markov chain with k = 3 uncoupled subchains. The set of states
S = {Sl, ... , S90} divides into the subsets Sl = {Sl, ... , S29}, S2 = {S30, ... , S49},
and S3 = {S50, ... , S90}. Left: Characteristic function XS2' Right: Eigenbasis
{X 1 ,X2 ,X3} to 3-fold Perron eigenvalue A = 1.
In our formal frame we may therefore restate the above problem of cluster
analysis as:
Find index sets Sm, m = 1, ... , k corresponding to (nearly)
uncoupled Markov chains.
In a first step we consider the case of uncoupled Markov chains. After what
has been said we understand that in this case the knowledge about the
index subsets Sm is equivalent to the knowledge about the reduced right
eigenvectors em to the k-fold Perron eigenvalue of the transition matrix
A. However, we do not know any permutation P to transform A to block
diagonal shape (5.10), its actual computation would anyway be too expen-
sive. Moreover, in the "nearly uncoupled" case we expect a "perturbed"
block diagonal structure. For these reasons we must try to find a different
solution approach.
At first we will certainly solve the numerical eigenvalue problem for the
reversible transition matrix Ai as an algorithm we recommend a variant
of the QR iteration for stochastic matrices~as an indication see Exer-
cise 5.S. Suppose now we thereby detect a Perron eigenvalue A = 1 with
multiplicity k, then we have k after all. In this case the computation of sin-
gle corresponding eigenvectors is known to be ill-conditioned, but not the
computation of an (arbitrary, in general orthogonal) basis {Xl, ... , X k} of
the eigenspace (compare our remark in Section 5.1). Without any advance
knowledge about the index subsets Sm we are then automatically led to a
linear combination of the form
k
Xl = e, Xi = L aim XS m , i = 2, ... k. (5.11)
m=l
Figure 5.2, right, represents the situation again for our illustrating ex-
ample. Obviously the eigenvectors over each subset Sm are locally constant.
146 5. Linear Eigenvalue Problems
Proof. Because of (5.ll) all basis vectors Xm are locally constant over the
index sets Sm, which includes also a common sign structure. This confirms
statement I above. For the proof of statement II we may shorten the index
sets Sm each to a single element without loss of generality.
Let {Q1,"" Qk} be an orthogonal eigenbasis of the matrix Asym =
DAD- 1 and Q = [Q1,"" Qk] the corresponding (k, k)-matrix. As Q is
orthogonal w.r.t. (-, .;, QT is also orthogonal, since QT = Q-1. This means
that not only the columns of Q, but also the rows are mutually orthogonal.
Let {Xl,' .. , X k} denote the associated 7r-orthogonal basis of right eigen-
vectors corresponding to the matrix A. Then Xi = D- 1 Qi for i = 1, ... , k.
As the transformation matrices D- 1 only contain positive diagonal entries,
the sign structures of Xi and Qi are identical for i = 1, ... , k. The sign
structure of Sm is equal to the one of row m of matrix X = [Xl, ... , X k]'
Suppose now there were two index sets Si and Sj with i i' j, but with the
same sign structure. Then the rows i and j of X would have the same sign
structure and, as a consequence, the associated rows of Q. Their inner prod-
uct (','; therefore could not vanish-in contradiction to the orthogonality
of the rows of Q. This finally confirms statement II above. 0
Lemma 5.29 clearly shows that the k right eigenvectors to the k-fold
eigenvalue A = 1 can be conveniently exploited for the identification of
the k unknown index sets Sl, ... ,Sk via the sign structures as defined in
(5.12)-per each component only k binary digits. The criterion can be
tested componentwise and is therefore independent of any permutation.
For example, in Figure 5.2, right, we obtain for component S20 the sign
structure (+, +, +), for S69 accordingly (+, -,0).
5.5. Stochastic Eigenvalue Problems 147
-0.5
.,
',,,,_, }"f '-.\ I : : .... " ~, .. \ , .. \ .. ,~ ~ ~
.... .. ' \ " ; \
-1
o 30 60 90
Figure 5.3. Markov chain with k = 3 nearly uncoupled sub chains. Eigenbasis
{X 1 ,X2 ,X3} to Perron cluster.\l = 1,.\2 = 0.75,.\3 = 0.52. Compare Figure
5.2, right, for the uncoupled case.
5.2, right, we now obtain from Figure 5.3 for the component 820 the sign
structure (+, +, +) as before, but for 869 now (+, -, zero), where zero
stands for some kind of "dirty zero" to be defined in close connection with
the perturbation. In fact, the algorithm in [24] eventually even supplies
the perturbation parameter E and the transition probabilities between the
nearly uncoupled Markov chains~for more details see there.
Exercises
Exercise 5.1 Determine the eigenvalues, eigenvectors and the determi-
nant of a Householder matrix
T
Q = I _ 2 vV .
vTv
Exercise 5.2 Give a formula (in terms of determinants) for an eigenvector
x E en corresponding to a simple eigenvalue A E e of a matrix A E
Matn(C).
Exercise 5.3 The computation of an eigenvector 'f}j corresponding to an
eigenvalue Aj of a given matrix A can be done, according to Wielandt, by
the inverse iteration
with an approximation >'j to the eigenvalue Aj. Deduce from the relation
r(o) := AZi - (>'j + O)Zi = Zi-l - OZi
a correction 0 for the approximation >'j such that Ilr(0)112 is minimal.
Exercise 5.4 Let there be given a so-called arrow matrix Z of the form
z=[t T ~],
where A = AT E Matn(R) is symmetric, BE Matn,m and D is a diagonal
matrix, D = diag(d 1 , ... , d m ). For m » n it is recommended to use the
sparsity structure of Z.
(a) Show that
Z-AI=LT(A)(Y(A)-AI)L(A) for A#di ,i=l, ... ,m,
where
In
L(A) := [ (D - AIm)-l BT
L] ~]
and M(A) := A - B(D - AIm)-l BT.
Exercises 149
(b) Modify the method handled in Exercise 5.3 in such a way that one
operates essentially only on (n, n) matrices.
Exercise 5.5 Prove the properties of the singular value decomposition
from Corollary 5.16.
Exercise 5.6 Let there be given an (m, n)-matrix A, m :::: n, and an m-
vector b. The following linear system is to be solved for different values of
p :::: 0 (Levenberg-Marquardt method, compare Section 4.3):
(AT A + pIn)x = ATb. (5.14)
(a) Show that the matrix AT A + pIn is invertible for rank A < nand
p> o.
(b) Let A have the singular values CT1 :::: CT2 :::: ••• :::: CT n :::: o.
Show that: If CT n :::: CT1y'8PS, then
1
il:2(A T A + pIn) :S -eps for P:::: O.
If CTn < CT1 y'8PS, then there exists a p :::: 0 such that
1
il:2(AT A + pIn) :S -eps for P:::: p.
Determine p.
(c) Develop an efficient algorithm for solving (5.14) by using the singular
value decomposition of A.
Exercise 5.7 Determine the eigenvalues Ai (t) and the eigenvectors 1)i (t)
of the matrix
-tsin(2/t) ]
I-tcos(2/t) .
How do A(t) , Ai(t) and 1)i(t) behave for t ----+ O?
Exercise 5.8 We consider the matrix A given in Exercises 1.10 and 1.11
describing a "cosmic maser." What is the connection between a stochas-
tic matrix Astoch and the matrix A there? Which iterative algorithm
for the computation of all eigenvalues would be more natural than the
Q R-algorithm?
Exercise 5.9 Given a reversible primitive matrix A with left eigenvector
7r > O. Let D = diag( J7i'1, ... , Fn) be a diagonal weighting matrix and
7ri = Lb ij .
j=l
Compute all eigenvalues of the matrix A. In particular, identify the Perron
cluster (5.13), the associated spectral gap, and any index subsets corre-
sponding to nearly uncoupled Markov chains. Experiment a little bit with
the random number generator.
6
Three-Term Recurrence Relations
1:
Example 6.1 Define a scalar product
(j, g) f(x)g(x)dx
for functions f,g : [-7r,7r] ---+ R. It is easy to convince oneself that the
special functions P2d x) = cos kx and P2k +1 (x) = sin kx for k = 0, 1, ...
are orthogonal with respect to this scalar product, i.e.,
(1:
1
The functions, for which this norm is well-defined and finite, can be ap-
proximated arbitrarily well with respect to this norm by the partial sums
of the Fourier series
2N N
fN(X) = L akPk(x) = ao + L(a2k cos kx + a2k-l sin kx),
k=O k=l
if N is large enough.
Here we can compute the functions cos kx and sin kx via the three-term
recurrence relation
(l
1
IIPII = v(P,P) =
b
w(t)p(t)2 dt)" < 00
is well-defined and finite for all of the polynomials P E P k and all kEN.
In particular, under this assumption all moments
Pdt) = t k + ...
154 6. Three-Term Recurrence Relations
with the starting values To(x) = 1 and T 1 (x) = x. Using this, we can
define Tk(X) for all x E R. From the variable substitution x = cosa, i.e.,
dx = - sin ada, we can see that the Chebyshev polynomials are indeed
the orthogonal polynomials on [-1, 1] with respect to the weight function
w(x) = 1/JI=X2, i.e.,
,if n -=f. m
,if n=m=O
, if n = m -=f. 0
The Chebyshev polynomials are particularly important in approximation
theory. We shall encounter them several times in the next chapters.
such that
with Po := 0 and
o
156 6. Three-Term Recurrence Relations
Proof. Let tl,"" tm be the m distinct points ti Ela, b[, at which P k changes
sign. The polynomial
Q(t) := (t - tl)(t - t2)'" (t - t m )
then changes sign at the same points, so that the function w(t)Q(t)Pk(t)
does not change sign in la, b[, and therefore
(Q, P k ) = lb w(t)Q(t)Pk(t) dt -I O.
Since Pk is orthogonal to all polynomials PEP k-l, it follows that deg Q =
m ;::: k as required. 0
Figure 6.1. Discrete Green's function g(5, k) over k = 0, ... , 10 for ak = 2 and
bk = -1.
treated with special care. At least in that case, it was possible to stabilize
the trigonometric three-term recurrence relation numerically. The following
example shows that this is not always possible.
Example 6.9 Bessel's maze. The Bessel junctions, Jk = Jk(x), satisfy the
three-term recurrence relation
2k
Jk+1 = -Jk - J k- 1 for k 2: 1. (6.9)
x
We start, for example, with x = 2.13 and the values
Jo 0.14960677044884
J1 0.56499698056413,
which can be taken from a table (e.g., [73]). At the end of the chapter
we shall be able to confirm these values (see Exercise 6.7). We can now
try to compute the values J 2, ... , h3 by employing the three-term recur-
rence relation in forward mode. In order to "verify" (see below) the results
]2, ... , ]23, we solve the recurrence relation (6.9) with respect to Jk-1, and
insert ]23 and ]22 into the recurrence relation in backward mode. This way
we get 121 , ... , 10 back and actually expect that 10 coincides approximately
with the starting value J o. However, with a relative machine precision of
eps = 10- 16 , we obtain
- 9
Jo/Jo~lO .
A comparison of the computed value ]23 with the actual value h3 reveals
that it is much worse, namely,
27
J 23 /h3 ~ 10 ,
A
i.e., the result misses by several orders of magnitude! In Figure 6.2, we have
plotted the repetition of this procedure, i.e., the renewed start with 10 etc.:
Numerically, one does not find the way back to the starting value, hence
this phenomenon is called Bessel's maze. What happened? A first analysis
of the behavior of the rounding errors shows that
2k
-Jk ~ Jk-1 for k > x
x
(compare Table 6.1). Thus cancellation occurs in the forward recurrence
relation every time when Jk+l is computed (see Exercise 6.9). Moreover,
besides the Bessel functions Jk, the Neumann junctions Y k also satisfy
the same recurrence relation (Bessel and Neumann functions are called
cylinder functions). However, these possess an opposite growth behavior.
The Bessel functions decrease when k increases, whereas the Neumann
functions increase rapidly. It is through the input error for J o and h (in
the order of magnitude of machine precision),
10 = J o + EOYO , 11 = J 1+ E1 Y 1 ,
160 6. Three-Term Recurrence Relations
p
60
40
20
-20
~r,0--------5~------1~0------~1~5------~2~0----~k
Figure 6.2. Bessel's maze for x = 2.13, In([Jk(x)[) is plotted over k for 5 loops
until k = 23.
Table 6.1. Cancellation in the three-term recurrence relation for the Bessel
functions Jk = Jk(X), x = 2.13.
k J k- 1 2k Jk
x
that the input ]0,]1 always contains a portion of the Neumann function
Yk, which at first is very small, but, which in the course of the recur-
rence increasingly overruns the Bessel function. Conversely in the backward
direction, the Bessel functions superimpose the Neumann functions.
In the following section we shall try to understand the observed numerical
phenomena.
P2, P3, ... as resulting quantities. Only two multiplications and one addition
have to be carried out in each step, and we have verified the stability of
these operations in Lemma 2.19. The execution of the three-term recur-
rence relation in floating point arithmetic is therefore stable. Thus only the
condition number of the three-term recurrence relation determines whether
it is numerically useful. In order to analyze the numerical usefulness, we
prescribe perturbed starting values
Pk 1= 0,
is the solution of the inhomogeneous recurrence relation
akPk-l ll hPk-2 II
II
Ok = ---uk-l + ---Ok-2 + ck Dor I ? 2
Pk Pk
with the starting values co := eo, C1 := e 1, where
Ek. akPk-1 1-1 bkPk-2
Ck := - = ak--- + fJk---,
Pk Pk Pk
162 6. Three-Term Recurrence Relations
lim Pk = o.
k ...... oo qk
G oo := LmkPk = 1 (6.11)
k=O
with the weights mk. Conversely, such relations generally hint that the
corresponding solutions Pk are minimal. If they exist, then the minimal
solutions form a one-dimensional subspace of 1:. The existence can be
guaranteed by imposing certain assumptions on the coefficients ak and bk.
Theorem 6.11 Suppose that the three-term recurrence relation is sym-
metric, i. e., bk = -1 for all k, and that there exists a ko E N such
that
for all k > ko. Furthermore, for each dominant solution q, there is an index
kl ~ ko such that
LJ
00
Goo := J o + 2 2k = 1.
k=l
where pUx) denotes the associated Legendre functions of the first kind for
Ixl :::; 1. They can be given explicitly as follows:
(_l)k+l 2.!. d k+1 2 k
Pk(X):= (k+l)!k!2 k (1-x)2 dx k+l(l-x) . (6.13)
and
pl(X) = (-1)l(1_x2)~
l 2l . l! '
which leads immediately to the two-term recurrence relation
pl _ _ (1 - X
2).12 p l- 1
l - 2l l-1 , (6.16)
where
(2k-1)cosO l l l qL2
IJ"k(O) := (k -l)(k + l) . qk-l - qkrk - (k -l)(k + l) rLl
In order for the expression (1 - cos 0) to be a factor in IJ"k (0), 0 = 0
obviously has to be a root of IJ"k, i.e., IJ"dO) = O. Because of (6.17), we
require in addition that ql+1 rl+1 = 1. These two requirements regarding
the transformations qk and rk are satisfied by the choice
l 1
qk = I = k - l + 1.
rk
p3:= P3:= 1;
for l := 0 to L do
p l+1 ._ pl+l._ sin BpI.
l+l . - /+1 2(l + 1) I'
LlPl := - sin 2 (B/2)pi;
for k := l + 1 to K do
(k - l - l)LlPk-l - 2(2k - 1) sin 2 (B/2)Pk-l
LlP I '= .
k' (k+l)(k-l+1) ,
-l._ 1 -I 1
Pk + 1) Pk - 1 + LlPk ;
(k -l
Pk := (k - l + l)Pk;
end for
end for
Remark 6.16 For the successful computation of orthogonal polynomials,
one obviously needs a kind of "look-up table of condition numbers" for as
many orthogonal polynomials as possible. A first step in this direction is
the paper [36]. However, the numerically necessary information is in many
cases more hidden than published. Moreover, the literature often does not
clearly distinguish between the notions "stability" and "condition."
G n := LmkPk.
k=O
By computing an arbitrary solution 'Pk of the three-term recurrence rela-
tion in backward mode, e.g., with the starting values Pn+l = 0 and Pn = 1,
6.2. Numerical Aspects 167
and normalizing these with the help of G n , one obtains for increasing n
increasingly better approximations of the minimal solution. These consid-
erations motivate the following algorithm for the computation of p N with
a relative precision E.
,en) f
t ,en) ... 'Po
2. C ompuePn_l' rom
3. Compute
n
On:= LmkPk.
k=O
4. Normalize according to
5. Repeat steps 1 to 4 for increasing n = nl, n2, ... , and while doing
this, test the accuracy by comparing p<;)i) and p<;)i-,). If
Proof. The solution pin) of the three-term recurrence relation with the
starting values p~n) := 1 and p~nJI := 0 can be represented as a linear
combination of Pk and qk, because
,(n) Pkqn+1 - qkPn+1
Pk =
Pnqn+1 - qnPn+1
This implies
This sum can be computed in two different ways. The direct way is the
forward recurrence relations
(6.19)
UN+l .- 0
Uk .- XUk+l + ak for k = N, N - 1, ... ,0 (6.20)
5N .- Uo·
This is the Horner algorithm. Compared with the first algorithm (6.20), it
saves N multiplications and is therefore approximately twice as fast.
(6.21)
multiplying them by the coefficients ak, and adding the result up. The
resulting algorithm corresponds to the forward recurrence relation (6.19):
Po
1 Po
1 PI
-b 2 -a2 1 0
-bN -aN 1 0
v PN ~
=:L ~ =: r
=:p
170 6. Three-Term Recurrence Relations
1 -a2
1 -bN
-aN
1 UN aN
By solving this triangular system of equations, we obtain the desired
analogue of the algorithm (6.20):
UN+1 = 0 (6.24)
Uk 2 cos x . Uk+1 - Uk+2 + ak for k = N, ... , 1
and to the results
SN = U1 sinx respectively, C N = ao + U1 cosx - U2·
where pUx) are the Legendre functions of the first kind as introduced
in Example 6.14. Negative indices I are omitted here because of P k- 1 =
(-1)1 Pk. A set of well-conditioned recurrence relations was already given in
Example 6.14. Also in this situation, the application of similar stabilization
172 6. Three-Term Recurrence Relations
V := 0; ~ V := 0;
for l := L to 1 step -1 do
U := 0; ~U := 0;
for k := K to l + 1 step -1 do
end for
sin e 2
C(K,L;e,<.p) := Uo - - (-2sin (<.pj2)V + ~V);
1 2
S (K, L; e, <.p) := -"2 sin e sin <.p . V;
However, the cost of computing the 5}:;) would be quite high, since for
each new n all values have to be computed again. Can that be avoided
by employing some kind of adjoint summation? In order to answer this
question, we proceed as in the derivation of the previous section, and we
describe for given n > N one step of the Miller algorithm through a linear
system MnP(n) = r(n):
(n)
Po
°
bn an -1
bn+ 1 an+l
mo mn (n)
~----------~v~------------ Pn
~
=: p(n)
With a(n) := (ao, ... ,aN,O, ... ,O)T E Rn+I, the sum 5}:;) can again be
written as a scalar product
where urn) is the solution of the system MJ urn) = a(n). More explicitly,
(n)
b2 mo Uo a o(n)
a2
-1 bn
an bn+1
-1 mn (n) (n)
an+l Un an
We solve this system by the Gaussian elimination method. The arising
computations and results are listed in the following theorem:
Theorem 6.23 Define ern) = (eo, ... , en) and f(n) = (fo, . .. , fn) by
1 fa
-1 , Un :=
1 fn-l
fn
and therefore
S (n) _ (n) _ en
N - Un - fn .
With the recurrence relations (6.26) and (6.27), we need 0(1) operations
in order to compute the next approximation sj;+l) from sj;), as opposed to
O(n) operations with the method, which is directly derived from the Miller
algorithm. In addition, we need less memory not depending on n but only
on N (if the coefficients {ad are given as a field). Because of (6.25), we call
the method developed in Theorem 6.23 the adjoint summation of minimal
solutions.
We now want to illustrate how this method can be employed to obtain
a useful algorithm from the theoretical description of Theorem 6.23. First
we replace the three-term recurrence relation (6.26) for ek by a system of
two-term recurrence relations for
(k) ek
Uk := Uk = - and b.uk:= Uk - uk-l
fk
because we are interested in precisely these two values (Uk as solution, b.uk
to scrutinize the precision). Furthermore, one has to consider that the f n
6.3. Adjoint Summation 175
and ek get very large, and they may fall outside the domain of numbers,
which can be represented on the computer. Instead of the ik, we therefore
use the new quantities
fk-l - 1
gk := T and h:= fk .
Exercises
Exercise 6.1 On a computer, calculate the value of the Chebyshev
polyilOmial T3I(X) for x = 0.923:
(a) by using the Horner scheme (by computing the coefficients of the
monomial representation of T3I on a computer, or by looking them
up in a table),
(b) by using the three-term recurrence relation.
Compare the results with the value
T3I(X) = 0.948715916161,
which is precise up to 12 digits, and explain the error.
Exercise 6.2 Consider the three-term recurrence relation
Tk = akTk-I + bkTk-2 .
For the relative error Bk = Ch - Tk)/Tk of n, there is a inhomogeneous
three-term recurrence relation, which is of the form
/l Tk-I/l Tk-2/l
Uk = akT;:Uk-I + bkT;:Uk-2 + Ek·
Consider the case ak 2 0, bk > 0, To, TI > 0, and verify that
(a ) IE k I :::; 3eps,
(b) Ilhl:::; (3k - 2)eps, k 21.
Exercise 6.3 The Legendre polynomials are defined through the recur-
rence relation
(6.29)
with the starting values Po(x) = 1 and PI (x) = x. (6.29) is well-conditioned
in forward mode. Show that the computation of Sk (B) := Pk (cos B) accord-
ing to (6.29) is numerically unstable for () - t O. For cos () > 0, find an
economical, stable version of (6.29) for the computation of Sk(B).
Hint: Define Dk = ak(Sk - Sk-J), and determine a suitable ak.
Exercise 6.4 Consider the three-term recurrence relation
(6.30)
(a) Find the general solution of (6.30) by seeking solutions of the form
Tk = w k (distinguish cases!).
(b) Show the existence of a minimal solution, if lal 2 l.
Exercise 6.5 Under the assumptions of Theorem 6.11, analyze the
condition of the adjoint three-term recurrence relation for a minimal
solution.
Exercises 177
where the
D(l, m) ;= FIQm - QIFm
denote the generalized Casorati determinants. For the special case of the
trigonometric recurrence relation, find a closed formula for g(j, k), and
carry out the limiting process x ----+ lJr, l E Z. Sketch g(j, k) for selected j,
k, and x.
Exercise 6.9 Consider the three-term recurrence relation for the cylinder
functions
(6.31)
to
Theorem 7.1 Suppose n + 1 nodes (ti' fi) for i = 0, ... , n are given with
pairwise distinct nodes to, ... , tn, then there exists a unique interpolating
polynomial P E P n , i.e., P(td = fi for i = 0, ... , n.
lC:) C:)
1 to t 02 t 0n
,
[ 1 tn t n2 tn
n
v. - - - - - - ' "
=: Vn
The matrix Vn is called Vandermonde matrix. For the determinant of Vn ,
we have
n n
det Vn = II II (ti - tj) ,
i=O j=i+l
which is proven in virtually every linear algebra textbook (see ,e.g., [61]). It
is different from zero exactly when the nodes to, ... ,tn are pairwise distinct
(in agreement with Theorem 7.1). However, the solution of the system
requires an excessive amount of computational effort when compared with
the methods that will be discussed below.
In addition, the Vandermonde matrices are almost singular in higher
dimensions n. Gaussian elimination without pivoting is recommended for
its solution, because pivoting strategies may perturb the structure of the
matrix (compare [51]). For special nodes, the above Vandermonde matrix
can easily be inverted analytically. In Section 7.2 we shall encounter an
example of this.
An alternative basis for the representation of the interpolation polyno-
mial is formed by the Lagrange polynomials La, . .. ,Ln. They are defined
as the uniquely determined interpolation polynomials Li E P n with
182 7. Interpolation and Approximation
1.2,---,---,---,---~-~-~-~--,
L;(t) = rrn t - tj .
j~O'
t - tJ
#i
The interpolating polynomial P for arbitrary nodes fo, ... , fn can easily
be built from the Lagrange polynomials by superposition: With
n
P(t) := L fiLi(t) , (7.2)
i=O
we obviously have
n n
i=O i=O
Remark 7.2 The above statement can also be phrased as follows: The
Lagrange polynomials form an orthogonal basis of P n with respect to the
scalar product
n
(P, Q) := L P(ti)Q(t i )
i=O
Theorem 7.3 Let a :::; to < ... < tn :::; b be pairwise distinct nodes, and let
Lin be the corresponding Lagrange polynomials. Then the absolute condition
number K:abs of the polynomial interpolation
K:abs = An := max L
tE[a,b] i=O
ILin(t)1
Proof. The polynomial interpolation is linear, i.e., 1/ (f) (g) = ¢(g). We have
to show that IWII = An. For every continuous function f E C[a, b], we have
n n
1¢(f)(t)1 I L f(ti)Lin(t)1 :::; L If(ti)IILin(t)1
i=O i=O
n
Ilflloo max L
tE[a,b] i=O
ILin(t)l,
and thus K:abs :::; An. For the opposite direction, we construct a function
g E C[a, b] such that
n
for aTE [a, b]. For this let T E [a, b] be the place where the maximum is
attained, i.e.,
n n
any choice of nodes. For comparison, Table 7.1 also shows the Lebesgue
constants for the Chebyshev nodes (see Section 7.1.4)
ti = cos 2i + 1)
(2n+
--7["
2
.
for z = 0, ... ,n
(where the maximum was taken over [-1,1]). They grow only very slowly.
Table 7.1. Lebesgue constant An for equidistant and for Chebyshev nodes.
n An for equidistant nodes An for Chebyshev nodes
5 3.106292 2.104398
10 29.890695 2.489430
15 512.052451 2.727778
20 10986.533993 2.900825
Proof. Let <p(t) be defined as the expression on the right-hand side of (7.3).
Then <p E P n , and
The interpolation polynomials for only one single node are nothing else
than the constants
P(f Iti) = fi for i = 0, ... ,n.
If we simplify the notation for fixed t by
7.1. Classical Polynomial Interpolation 185
then the value Pnn = PC! Ito, ... , tn)(t) can be computed according to the
Neville scheme
Poo
~
P lO -> Pll
ti sin ti
50° 0.7660444
55° 0.8191520 0.~935027
60° 0.8660254 0.~847748 0.8830292
65° 0.9063078 0.~821384 0.8829293 0.8829493
70° 0.9396926 0.~862768 0.8829661 0.8829465 0.8829476
The recursive structure of the interpolation polynomials according to
Lemma 7.4 can also be utilized for the determination of the entire poly-
nomial PC! I to, ... ,tn). This is also true for the generalized interpolation
problem, where besides function values I(ti), also derivatives at the nodes
are given, the Hermite interpolation. For this we introduce the following
practical notation. With
a = to ::; h ::; ... ::; tn = b
we allow for the occurrence of multiple nodes in ~ := {ti};=O, ... ,n. If at a
node ti, the value I(ti) and the derivatives !'(ti), ... , I(k)(ti) are given up
to an order k, then ti shall occur (k+1)-times in the sequence~. The same
nodes are enumerated from the left to the right by
di := max{j I ti = ti-j} ,
e.g.,
ti
di
I t3 1 o 1 2 0 o 1
186 7. Interpolation and Approximation
is obviously linear and also injective. Now fJ.(P) = implies that P pos-
sesses at least n + 1 roots (counted with mUltiplicity), and it is therefore
the null-polynomial. Since dimP n = dim Rn+l = n+ 1, this implies again
the existence. 0
_~(t-to)j (j)
P(f I to,···, tn)(t) - L..- ., f (to), (7.6)
j=O J.
also called the Taylor interpolation.
Remark 7.7 An important application is the cubic Hermite interpolation,
where function values fa, h and derivatives fa, ff are given at two nodes
to, t l . According to Theorem 7.6, this determines uniquely a cubic polyno-
mial P E P 3 . If the Hermite polynomials H5, ... , H5 E P 3 are defined by
7.1. Classical Polynomial Interpolation 187
Hr(to) = 0,
Hg(to) = 0,
Hl(to) = 0,
then the polynomials
{HJ(t),Hr(t),Hg(t),Hl(t)}
form a basis of P 3 , the cubic Hermite basis, with respect to the nodes to, tl.
The Hermite polynomial corresponding to the values
{fo,f6,iI,fD is thus formally given by
P = (ti - t)P(f I t l , ... , £j, ... ,t n ) - (t j - t)P(f I iI, ... , t;, ... ,tn)
4-0 '
where ~ indicates that the corresponding node is omitted. ("has to lift its
hat").
The coefficients with respect to this basis are the divided differences, which
we now define.
Definition 7.9 The leading coefficient an of the interpolation polynomial
If f E en+! , then
Proof. (i) is true, because the nth coefficient of a polynomial of degree less
than or equal to n - 1 vanishes. (ii) follows from the Taylor interpolation
(7.6) and (iii) from Lemma 7.8 and the uniqueness of the leading coefficient.
D
With properties (ii) and (iii), the divided differences can be computed
recursively from the function values and derivatives f(j) (td of f at the
nodes ti. We also need the recurrence relation in the proof of the following
theorem, which states a surprising interpretation of the divided differ-
ences: The nth divided difference of a function f E en with respect to the
nodes to, ... , tn is the integral of the nth derivative over the n-dimensional
standard simplex
[to, . .. , tnlf = J (t
En
fen)
2=0
Siti) ds. (7.10)
J
n
n+l
i=O
L 8i=1
i=O
J
n
i=l
n
1- LSi
i=l
J J
n
f(n+1) (to + 2:>i(ti - to) + Sn+1(tn+l - to)) ds
80=0
i=l
o
Corollary 7.13 Let g : Rn+1 -+ R be the mapping, which is given by the
nthdivided difference of a function f E en with
Then g is continuous in its arguments ti. Furthermore, for all nodes to :::::
... ::::: tn, there exists aTE [to, tnl such that
(7.11)
For pairwise distinct nodes to < ... < tn, the divided differences can be
arranged similar to the Neville-scheme because of the recurrence relation
(7.9).
7.1. Classical Polynomial Interpolation 191
fa [tolf
'\.
fr [tllf -+ [to, tllf
Proof. First suppose that the nodes to, ... ,tn are pairwise distinct. Set
i-I n
Wi(t) := IT (t - tk) and Wj(t):= II (t - tl).
k=O 1=)+1
The product
n
PQ = L [to, ... , tiJg [t j , ... , tnJh· WiWj
i,j=O
thus interpolates the function f := gh in to, ... , tn. Since Wi(tk)Wj(tk) = 0
for all k and i > j, it follows that
n
i,j=O
i5,j
is the interpolation polynomial of gh in to, ... , tn. As claimed, the leading
coefficient is
n
L[to, ... ,tiJg [ti, ... , tnJh.
i=O
For arbitrary, not necessarily distinct nodes ti, the statement now follows
from the continuity of the divided differences at the nodes ti. D
Example 7.17 In the case ofthe Taylor interpolation (7.6), i.e., to = ... =
tn, the error formula (7.12) is just the Lagrange remainder of the Taylor
expansion
f(n+l)(T) n+l
f(t) - P(f I to, .. ·, tn)(t) = ( )' (t - to) .
n+1.
7.1. Classical Polynomial Interpolation 193
by choosing again the Chebyshev nodes for the nodes t;. We now turn our
attention to the question of whether the polynomial interpolation satisfies
the approximation property. For merely continuous functions I E C[a, b],
and the supremum-norm 11111 = SUPtE[a,b] I(t), the approximation error can
in principle grow beyond all bounds. More precisely, according to Faber,
for each sequence {Td of sets of nodes Tk = {tk,o, ... ,tk,nk} C [a, b], there
exists a continuous function I E C[a, b] such that the sequence {Pd of
the interpolation polynomials, which belong to the Tk, does not converge
uniformly to I.
We shall now see that the Chebyshev polynomials Tn, which we already
encountered as orthogonal polynomials with respect to the weight function
w(x) = (1- X2)-~ over [-1,1], solve this min-max problem (up to a scalar
factor and an affine transformation). In order to show this, we first reduce
the problem to the interval [-1,1], which is suitable for the Chebyshev
polynomials, with the help of the affine mapping
194 7. Interpolation and Approximation
~
x: [a,b] -----7 [-1,1]
t-a 2t-a-b
X = x(t) = 2 - - -1 = - - -
b-a b-a'
then Fn(t) := Pn(t(x)) is the solution of the original problem (7.14) with
leading coefficient 2n / (b - a) n .
In Example 6.3, we introduced the Chebyshev polynomials (see Figure
7.3) via
-1.5.!;-1---:-O~.8--0~.6--~0.4'-----~O.2:---:::------::0.2:--70.4;--70.6;--70.::-8~
or easily verified. In particular we can directly give the roots Xl, ... ,X n of
Tn(x), which are real and simple according to theorem 6.5 (see property 7
below).
7.1. Classical Polynomial Interpolation 195
Remark 7.18
1. The Chebyshev polynomials have integer coefficients.
2. The leading coefficient of Tn is an = 2n-l.
3. Tn is an even function if n is even, and an odd one if n is odd.
2k - 1 )
Xk :=cos ( ~1f for k = 1, ... , n .
8. We have
cos(k arccos x) if -1:Sx:S1
11(x) ~ { cosh(karccoshx)
(-l)k cosh(karccosh( -x))
if x?l
if x:S -1
9. The Chebyshev polynomials have the global representation
Properties 8 and 9 are most easily checked by verifying that they satisfy
the three-term recurrence (including the starting values). The min-max
property of the Chebyshev polynomials follows from the intermediate-value
theorem:
Theorem 7.19 Every polynomial P n E P n with leading coefficient an =f. 0
attains a value of absolute value ? 1an 1/2 n - 1 in the interval [-1, 1]. In
particular the Chebyshev polynomials Tn (x) are minimal with respect to
the maximum-norm Ilflloo = maxxE[-l,l]lf(x)1 among the polynomials of
degree n with leading coefficient 2n - 1 .
2i + 1 ) = 0, ... , n
ti = cos ( 2n + 27r for i
Theorem 7.21 Let [a, b] be an arbitrary interval, and let to rt. [a, b]. Then
the modified Chebyshev polynomial
Proof. Since all roots of Tn(x(t)) lie in [a, b], we have c := Tn(x(to)) #- 0
and Tn is well defined. Furthermore Tn(tO) = 1, and ITn(t)1 :::; Icl- 1 for
all t E [a, b]. Suppose now that there is a polynomial Pn E P n such that
Pn(to) = 1, and IPn(t)1 < 1c1- 1 for all t E [a,b], then to is a root of the
difference Tn - Pn , i.e.,
[ 1 WN-1 w;" -1
v
N-1
WN _1
=: VN-1
n n
Co +L 2fR(Cje ijtk ) = Co + L(2Rcj cosjt - 2CJcj sinjt).
j=l j=l
Hence, the real trigonometric polynomial with the coefficients
aj = 2Rcj = Cj + Cj = Cj + CN-j
and
bj = -2CJcj = i(cj - Cj) = i(cj - CN-j)
solves the interpolation problem. For even N = 2n, the statement follows
similarly. D
(7.18)
L wjwjl = N r5kl .
j=O
In particular, the functions 'l/Jj (t) = eijt are orthonormal with respect to the
scalar product (7.18), i.e., ('l/Jk,'l/Jl) = r5kl.
L w~ = Nr5 ok .
j=O
(Observe that wj = w~.) Now the Nth unit roots Wk are solutions of the
equation
N-1
0= w N - 1 = (w - 1)(w N - 1 + w N - 2 + ... + 1) = (w- 1) L w j .
j=O
N 1 .
If k =I- 0, then Wk =I- 1 and therefore I:j~ w~ = O. In the other case we
~N-l j
·
obVlOUS 1y h ave 6j=0 ~N-l1
Wo = 6j=0 = N. D
200 7. Interpolation and Approximation
With the help of this orthogonality relation, we can easily give the
solution of the interpolation problem.
Theorem 7.25 The coefficients Cj of the tTigonometric interpolation COT-
Tesponding to the N nodes (tk' ik) with equidistant nodes tk = 27rk/N,
z.e.,
N-l N-l
¢iN(td = L cje ijtk = L CjW~ = ik fOT k = 0, ... , N - 1,
j=O j=O
aTe given by
1 N-l .
Cj = N LfkWi;J fOT j = 0, ... , N-1.
k=O
Proof. We insert the given solution for the coefficients Cj and obtain
N-l
L
j=O
cjwl ~l (~ % ikwi;j) w{
~
ior
27r
j(j) = (I, eijt ) = f(t)e- ijt dt. (7.19)
27r
In fact, the coefficients Cj can be considered as the approximation of the
integral in (7.19) by the trapezoidal sum (compare Section 9.2) with respect
to the nodes tk = 27rk/N. If we insert this approximation
ior
27r 27r ~l
g(t) dt ~ N 0 g(tk) (7.20)
k=O
7.2. Trigonometric Interpolation 201
with
N-l
Cj =~ L fke-27rijk/N for j = 0, ... ,N - 1
k=O
is called a discrete Fourier transform. The inverse mapping :r;/ is
N-l
fj = L cke27rijk/N for j = 0, ... , N - 1 .
k=O
The computation of the coefficients Cj from the values fj (or the other
way around) is in principle a matrix-vector multiplication, for which we
expect a cost of O(N 2 ) operations. However, there is an algorithm, which
requires only O(Nlog2 N) operations, the fast Fourier transform (FFT).
It is based on a separate analysis of the expressions for the coefficients Cj
for odd, respectively, even indices j, called the odd even reduction. This
way it is possible to transform the original problem into two similar partial
problems of half dimension.
(X21 L hw 2k1
k=O
N/2-1
L (hW2kl + fk+N/2 w 2 (k+N/2)1)
k=O
M-l
L
N-I
fk Wk (21+1)
k=O
L
N/2-I
(h wk (21+I) + fk+N/2 W (k+N/2)(21+1»)
k=O
M-I
L (h - fk+M )w k (w 2 l 1.
k=O
o
The lemma can be applied to the discrete Fourier analysis (h) f--+ (Cj),
as well as to the synthesis (Cj) f--+ (h). If the number N of the given points
is a power of two N = 2P , pEN, then we can iterate the process. This
algorithm due to W. Cooley and J. W. Tukey [85] is frequently called the
Cooley- Tukey algorithm. The computation can essentially be carried out
on a single vector, if the current number-pairs are overwritten. In the Al-
gorithm 7.28, we simply overwrite the input values fo, ... , iN-i' However,
here the order is interchanged in each reduction step because of the sep-
aration of even and odd indices. We have illustrated this permutation of
indices in Table 7.2. We obtain the right indices by reversing the order of
the bits in the dual-representation of the indices.
We therefore define a permutation rJ,
Table 7.2. Interchange of the indices of the fast Fourier transform for N = 8, i.e.,
p= 3.
Algorithm 7.28 Fast Fourier transform (FFT). From given input values
fo, ... , fN-l for N = 2P and w = e±27ri/N the algorithm computes the
transformed values ao, ... , aN-l with aj = L~=-Ol fkwkj.
N red := N;
z :=w;
while Nred > 1 do
Mred := N red /2;
for j := 0 to N/Nred -1 do
l := jNred ;
for k := 0 to M red - 1 do
a := flH + flH+M,ed;
flH+M,ed := (flH - izH+M,eJZ k ;
fl+k := a;
end for
end for
N red := M red ;
z:= z2;
end while
for k := 0 to N - 1 do
aa(k) := fk
end for
then
7.3. Bezier Techniques 205
The last two bases are already oriented toward interpolation and depend
on the nodes to, ... , tn. The basis polynomials, which we shall now present,
apply to two parameters a, b E R. They are therefore very suitable for the
local representation of a polynomial. In the following, the closed interval
between the two points a and b is denoted by [a, b] also when a > b, i.e.,
(compare Definition 7.37)
The first step consists of an affine transformation onto the unit interval
[0,1]'
[a, b] ---7 [0,1]
t- a
t f-----7), = ),(t) := b _ a' (7.22)
with the help of which we can usually restrict our consideration to [0, 1].
By virtue of the binomial theorem, we can represent the unit function as
The terms of this partition of unity are just the Bernstein polynomials with
respect to the interval [0, 1]. By composing these with the above affine trans-
formation (7.22), we then obtain the Bernstein polynomials with respect
to the interval [a, b].
Definition 7.30 The ith Bernstein polynomial (compare Figure 7.4) of
degree n with respect to the interval [0, 1] is the polynomial Bi E P n with
t -
Bi(t; a, b) := Bi(),(t)) = Bi ( b _ a
a) = (b _ a)n
1 (n)
i
. .
(t - a)'(b - t)n-,
Instead of Bi(t; a, b), we shall in the following often simply write Bi(t),
if confusion with the Bernstein polynomials Bi(A) with respect to [0,1] is
impossible. In the following theorem we list the most important properties
0.6
°
n
Bi(A) 2: for A E [0,1] and LBi(A) = 1 for A E R.
;=0
6. Bi has exactly one maximum value in the interval [0,1 J, namely, at
A = i/n.
7. The Bernstein polynomials satisfy the recurrence relation
(7.23)
for i = 1, ... , n and A E R.
7.3. Bezier Techniques 207
Proof. The first five statements are either obvious or can be easily verified.
Statement 6 follows from the fact that
for the binomial coefficients. For the last statement, we show that the n+ 1
polynomials Bf are linearly independent. If
n
0= L biBf()..),
i=O
Similar statements are of course true for the Bernstein polynomials with
respect to the interval [a, b]. Here the maximum value of Bi(tj a, b) in [a, b]
is attained at
i
t=a+-(b-a).
n
Remark 7.32 The property that the Bernstein polynomials form a par-
tition of unity is equivalent to the fact that the Bezier points are affine
invariant. If ¢ : Rd -> Rd is an affine mapping,
¢ : Rd ---+ Rd with A E Matd(R) and v E Rd
U 1-----4 Au + v,
then the images ¢(bi ) of the Bezier points bi of a polynomial P E P~ are
the Bezier points of ¢ 0 P.
We now know that we can write any polynomial P E P~ as a linear
combination with respect to the Bernstein basis
n
P(t) = L biBf(tj a, b), bi E Rd. (7.24)
i=O
208 7. Interpolation and Approximation
i=O i=O
i.e., the Bezier coefficients with respect to b, aare just the ones of a, bin
reverse order.
The coefficients bo, ... , bn are called control or Bezier points of P, the
corresponding polygonal path a Bezier polygon. Because of
the Bezier points of the polynomial P(t) = t are, just the maxima bi =
*
a + (b - a) of the Bernstein polynomials. The Bezier representation of the
graph fp of a polynomial P as in (7.24) is therefore just
········· ..P(t)....
with its Bezier polygon. It is striking that the shape of the curve is closely
related to the shape of the Bezier polygon. In the following we shall more
closely investigate this geometric meaning of the Bezier points. First, it
is clear from Theorem 7.31 that the beginning and ending points of the
polynomial curve and the Bezier polygon coincide. Furthermore, it appears
that the tangents at the boundary points also coincide with the straight
lines at the end of the Bezier polygon. In order to verify this property,
we compute the derivatives of a polynomial in Bezier representation. We
shall restrict ourselves to the derivatives of the Bezier representation with
respect to the unit interval [0,1]. Together with the derivative of the affine
7.3. Bezier Techniques 209
for i =0
for i = 1, ... , n - 1
for i = n.
Proof. The statement follows from
where the forward difference operator D, operates on the lower index, i. e.,
D,lb i := bH1 - bi and D,kb i := D,k-1b H1 - D,k-1b i for k> 1.
(c) P"(O) = n(n - 1)(b 2 - 2b 1 + bo) and P"(l) = n(n - l)(b n - 2bn - 1 +
bn - 2 ).
Proof. Note that B~-k (0) = 60,i and B~-k (1) = 6n -k,i. o
210 7. Interpolation and Approximation
I" B
t
co(A) n{B c Rd convex with A C B}
{ X= ~ Aixi I mEN, Xi E A, Ai 2: 0, Ai = 1 } .
As one can already see in Figure 7.5, for a cubic polynomial P E P3,
this means that the graph of P for t E [a, b] is completely contained in the
convex hull of the four Bezier points b I , b 2 , b 3 , and b 4 . The name control
point is explained by the fact that, because of their geometric significance,
the points b i can be used to control a polynomial curve. Because of The-
orem 7.31, at the position ..\ = i/n, over which it is plotted, the control
point bi has the greatest "weight" Bf(..\). This is another reason that the
curve between a and b is closely related to the Bezier polygon, as the figure
indicates.
..---,{-n-....
... ............. bi (t )
\\b 6(t)
b6 (t) , //b~(t )
Similar to the Aitken lemma, the following recurrence relation is true, which
is the base for the algorithm of de Casteljau.
Proof. We insert the recurrence relation (7.23) into the definition of the
partial polynomials bf and obtain
k
bf LbHjBJ
j=O
k-1
biBg + bHkBZ + L bHjBJ
j=l
k-1
bi (l - >.)B6k- 1) + bHk>'B~=i +L bi+j ((1 - >')BJ-1 + >.BJ=l)
j=l
k-1 k
LbHj (l- >')BJ-1 + Lbi+j>.BJ=f
j=O j=l
( 1 - >.)b k
t
- 1 + >.b k - 1
,+1 .
o
Because of b~(t) = bi , by continued convex combination (which, for t rt.
[a, b] is only an affine combination) we can compute the function value
P(t) = bo(t) from the Bezier points. The auxiliary points bf can, similar
7.3. Bezier Techniques 213
b1 bO -+ -+ bn - 1
1 1
'\. '\.
bo bO -+ -+ bn - 1 -+ b0n
0 0
Proof. The statement follows from Theorem 7.35 and from the fact that
the forwards difference operator commutes with the sum:
, n-k ,n-k
n. """ k n-k ( ) n. k """ n-k ( )
(n - k)! L /}. biBi >.. = (n _ k)!/}. L biBi >..
,=0 ,=0
n! /}.kb n - k (>..)
(n - k)! 0 .
o
Thus the kth derivative p(k) (>..) at the position>.. is computed from the
(n - k)th column of the de Casteljau scheme. In particular,
pet) b~ ,
P'(t) n(b n - 1 _ bn - 1)
1 0'
pI! (t) n(n - 1)(b~-2 - 2b~-2 + b~-2) .
So far we have only considered the Bezier representation of a single poly-
nomial with respect to a fixed reference interval. Here the question remains
open on how the Bezier points are transformed when we change the ref-
erence interval (see Figure 7.7). It would also be interesting to know how
to join several pieces of polynomial curves continuously or smoothly (see
Figure 7.8). Finally, we would be interested in the possibility of subdi-
viding curves, in the sense that we subdivide the reference interval and
214 7. Interpolation and Approximation
bl
/·············aT··· ......... ~~...
al ;' '\
/ a3 \\
/ \
~ \
bo = ao b3
all'
\.
I \.
;
/ .. .......'..
\.
;
; \.
·c~'
;
;
!
ao
b2
b.,t_·_··-···---·~-~·=-~~-··--:-;.·.l.···-·-·-········.
.r;.~.............. - \.
;
;
;
/
al//
/
/
!
ao = bo
Figure 7.9. Subdivision of a cubic Bezier curve.
compute the Bezier points for the subintervals (see Figure 7.9). According
to Theorem 7.39, the curve is contained in the convex hull of the Bezier
points. Hence, it is clear that the Bezier polygons approach more and more
closely the curve, when the subdivision is refined. These three questions
are closely related, which can readily be seen from the figures. We shall see
that they can be easily resolved in the context of the Bezier technique. The
connecting elements are the partial polynomials. We have already seen in
7.3. Bezier Techniques 215
..:!!...-k _ k! 1 _(n-l)!k!..:!!...-bn()
d,\.l bo(O) - (k _l)!.6. bo - (k -l)! n! d,\.l 0 0
for l = 0, ... , k. The statement follows, because a polynomial is completely
determined by all derivatives at one position. 0
Proof. We show (i) {o} (ii) =? (iii) =? (iv) =? (ii). According to Corollary
7.36 and Lemma 7.43, the two curves P(t) and Q(t) coincide at the position
t = a up to the kth derivative, if and only if they have the same partial
polynomials a~(t; a, b) = b~(t; a, c). The two first statements are therefore
equivalent. If a~ and b~ coincide, then so do their partial polynomials ab
and bb for l = 0, ... ,k; i.e., (ii) implies (iii). By inserting t = b into (iii), it
follows in particular that
al = aUl) = ab(b; a, b) = bb(b; a, c),
and therefore (iv). Since a polynomial is uniquely determined by its Bezier
coefficients, (iv) therefore implies (ii) and thus the equivalence of the four
statements. 0
216 7. Interpolation and Approximation
With this result in hand, we can easily answer our three questions. As a
first corollary we compute the Bezier points that are created when subdi-
viding the reference interval. At the same time, this answers the question
regarding the change of the reference interval.
Corollary 7.45 Let
ao(t;a,b) = bo(t;a,c) = co(t;b,c)
be the Bezier representations of a polynomial curve P(t) with respect to the
intervals [a, b], [a, c] and [b, c], i.e.,
n n n
(see Figure 7.9). Then the Bezier coefficients ai and Ci of the partial curves
can be computed from the Bezier coefficients bi with respect to the entire
interval via
Since the curve pieces always lie in the convex hull of their Bezier points,
the corresponding Bezier polynomials converge to the curve when continu-
ously subdivided. By employing this method, the evaluation of a polynomial
is very stable, since only convex combinations are computed in the algo-
rithm of de Casteljau. In Figure 7.10, we have always divided the reference
interval of a Bezier curve of degree 4 in half, and we have plotted the Bezier
polygon of the first three subdivisions. After only a few subdivisions, it is
almost impossible to distinguish the curve from the polygonal path.
If we do utilize the fact that only the derivatives at one position must co-
incide, then we can solve the problem of continuously joining two polygonal
curves:
Corollary 7.46 A joined Bezier curve
........................................................................ b2
bo
4
i.e., the point an = Co has to divide the segment [an-l, Cl] in the proportion
c - b to b - a. If the pieces of curves fit C 2 -smoothly, then a n -2, an-l and
an describe the same parabola as co, Cl and C2, namely, with respect to
[a, b], respectively, [b, c]. According to Corollary 7.46, the Bezier points of
this parabola with respect to the entire interval [a, c] are an -2, d and C2,
where d is the auxiliary point
d := a~_2(c; a, b) = a;'_2('\) = ci(a; b, c) = cUf.L)
218 7. Interpolation and Approximation
d
.~};""""'/""""""""
a) ......···
/
;
;
;
;
ao
,,"0.'.'.'.'.'.'.'.'.-·'·
C2 C3
and
an -2 = c6(1L) = (1-1L) C6(1L) +IL CUIL) .
"--v--" ~
= an-l =d
The joined curve is therefore C 2 -smooth, if and only if there is a point d
such that
C2 = (1 - A)d + ACI and an -2 = (1 - lL)a n - l + ILd.
The auxiliary point d, the de Boor point, will play an important role in the
next section in the construction of cubic splines.
7.4 Splines
As we have seen, the classical polynomial interpolation is incapable of solv-
ing the approximation problem with a large number of equidistant nodes.
Polynomials of high degree tend to oscillate a lot, as the sketches of the La-
grange polynomials indicate (see Figure 7.2). They may thus not only spoil
the condition number (small changes of the nodes Ii induce large changes
of the interpolation polynomial P(t) at intermediate values t =F t i ), but also
lead to large oscillations of the interpolating curve between the nodes. As
one can imagine, such oscillations are highly undesirable. One need only
think of the induced vibrations of an airfoil formed according to such an
interpolation curve. If we require that an interpolating curve passes "as
7.4. Splines 219
smooth as possible" through given nodes (ti' ji), then it is obvious to lo-
cally use polynomials of lower degree and to join these at the nodes. As
a first possibility, we have encountered the cubic Hermite interpolation in
Example 7.7, which was, however, dependent on the special prescription of
function values and derivatives at the nodes. A second possibility are the
spline junctions, with which we shall be concerned in this chapter.
Definition 7.47 Let ~ = {to, ... , tl+d be a grid of 1+2 pairwise distinct
node points
The most important spline functions are the linear splines of order k = 2
(see Figure 7.12) and the cubic splines of order k = 4 (see Figure 7.13). The
linear splines are the continuous, piecewise linear functions with respect to
the intervals [ti' ti+ll. The cubic splines are best suited for the graphic
So
a = to
s
So
a = to
curvature, i.e., of the second derivative. Thus the C 2 -smooth cubic splines
are recognized as "smooth."
It is obvious that Sk,L:> is a real vector space, which, in particular, contains
all polynomials of degree::; k - 1, i.e., Pk-l C Sk,L:>. Furthermore, the
truncated powers of degree k,
if t ? ti
if t < ti
are contained in Sk,L:>. Together with the monomials 1, t, ... ,t k - l , they
form a basis of Sk,L:> , as we shall show in the following theorem:
(7.27)
dimSk,L:> = k + l.
Proof. We first show that one has at most k + l degrees of freedom for the
construction of a spline s E Sk,L:>. On the interval [to, tIl, we can choose
any polynomial of degree::; k - 1; these are k free parameters. Because of
the smoothness requirement s E C k - 2 , the polynomials on the following
intervals [tl, t2]' ... , [tl' t£+l] are determined by their predecessor up to one
parameter. Thus, we have another l parameters. Therefore dim Sk,L:> ::; k+l.
The remaining claim is that the k+l functions in B are linearly independent.
To prove this, let
k-l 1
s(t) := L ai ti +L Ci(t - ti)~-l = 0 for all t E [a, b].
i=O i=l
7.4. Splines 221
to s (where f(t+) and f(r) denote the right, respectively, left-sided limits),
then for all i = 1, ... ,l, it follows that
k-1 I
0= Gi(s) = G i (L aje) + L Cj Gi(t - tj)~-l = Ci·
j=O j=l ~
~ =Oij
=0
k 1 .
Thus s(t) = Li:O ait' = 0 for all t E [a, b], and therefore also ao = ... =
~-1=0. D
However, the basis B of Sk,~ given in (7.27) has several disadvantages.
For one, the basis elements are not local; e.g., the support of the monomials
t i is the whole of R. Second, the truncated powers are "almost" linearly
dependent for close nodes ti, ti+1. This results in the fact that the evaluation
of a spline in the representation
k-1 I
s(t) = L ai ti + L Ci(t - td~-l
i=O i=l
X[T;,Ti+d
(t) = {I
0
if
else
Ti ::; t < Ti+ 1
' (7.28)
t - Ti Ti+k - t
- - - - Ni,k-1 (t) + Ni+1,k-1 (t) . (7.29)
Ti+k-1 - Ti Ti+k - Ti+1
Note that the characteristic function in (7.28) vanishes if the nodes coincide,
i.e.,
Nil = X[Ti,THd = 0 if Ti = Ti+1 .
The corresponding terms are omitted according to our convention % = 0
in the recurrence relation (7.29). Thus, even if the nodes coincide, the B-
splines Nik are well-defined by (7.28) and (7.29); furthermore, Nik = 0 if
222 7. Interpolation and Approximation
{~ if
else
Ti ::::; t < Ti+ 1
Proof. The first statement follows from the fact that the divided difference
[Ti,' .. , Ti+klJ contains at most the (m - l)st derivative of the function
f at the position Tj. However, the truncated power f(s) = (s - Tj)~-l
is (k - 2)-times continuously differentiable. The second statement follows
from
(
[TH1' ... ,Ti+kl (- - t)~-2 - h, ... ,Ti+k-1](' - t)~-2)
Ti+k - Ti
Lemma 7.53 With the above notation, we have for all t E [a, b] and s E R
that
n k-l
(t - s)k-l = L cpik(S)Nik(t) with CPik(S):= II (Ti+j - s).
i=l j=l
i=l
~ t - T·
L.... ( ---'-cpik(S) + , -1 -
T+k t
CPi-l,k(S)
)
N i ,k-1(t)
i=2 Ti+k-l - Ti T;+k-l - Ti
n k-2
L II (Ti+j - s) .
;=2 j=l
t- t )
.(
T
'(TiH-1 - S) + T+k
, -1 - (Ti - s) N i ,k-1(t)
Ti+k-1 - Ti Ti+k-l - Ti
, "
=t-s
n
i=2
(t - s)(t - s)k-2 = (t _ s)k-l .
Here note that the expression, which is "bracketed from below" is the linear
interpolation of t - s, hence t - s itself. 0
k, i.e.,
Pk-da, bl c span (Nlk , ... , Nnk)'
In particular,
n
1 = L Nik(t) for all t E [a, bl ,
i=l
Proof. For the lth derivative of the function f(s) := (t - s)k-1, it follows
from the Marsden identity that
n
f(l) (0) = (k - 1) ... (k -l)( _1)lt k - I - 1 = L i.p~~ (O)Nidt )
i=l
m (_1)k-m-1 ~ (k-m-1)()N ()
t = (k _ 1) ... (m + 1) L.." i.pik 0 ik t .
,=1
The (k - pt) derivative of ¢ik satisfies
k-1 k-1
¢7k- 1(s) = (II h+j -s)) = (( _1)k-1 s k-1+ ... )k-1 = (_1)k-1(k_1)!
j=l
Proof. Without loss of generality, we may assume that the open interval
lc, d[ does not contain any nodes (otherwise we decompose lc, d[ into subin-
tervals). According to Corollary 7.54, each polynomial of degree:::; k-1 over
lc, d[ can be represented by the B-splines N ik . However, only k = dim P k - 1
B-splines are different from zero on the intervallc, dr. They therefore have
to be linearly independent. 0
B := {N1 k, ... , Nnk} of the spline space Sk,.6.. They are locally linear in-
dependent, are locally supported, and form a positive partition of unity.
Each spline s E Sk,.6. therefore has a unique representation as a linear
combination of the form
n
S = LdiNik.
i=1
Here the second inequality follows from the fact that the B-splines form
a positive partition of unity. Perturbations in the function values s(t) of
the spline s = 2:~=1 CiNik and the coefficients can therefore be estimated
against each other. In particular, the evaluation of a spline in B-spline
representation is well-conditioned. Therefore, the basis is also called well-
conditioned.
hi = L i(ti-dNi2.
i=1
7.4. Splines 227
Besides this very simple case of linear spline interpolation, the case k = 4 of
cubic splines plays the most important role in the applications. In this case,
we are missing two conditions to uniquely characterize the interpolating
cubic spline S E Si,A, because
dim S4,A - number of nodes = l +k - l - 2 = 2.
N ow the starting idea for the construction of spline functions was to find
interpolating curves, which are as "smooth" as possible; we could also say
"possibly least curved." The curvature of a parametric curve (in the plane)
y(t)=lnt
00, hence the curve is straight. In order to simplify this, instead of the
curvature, we consider for small y'(t) the reasonable approximation yl/(t),
yl/(t) ~ 1/
(1 + y'(t)2)3/2 ~ Y (t),
and measure the curvature of the entire curve by the L 2 -norm
of this approximation with respect to [a, b]. The interpolating cubic splines,
which satisfy the additional properties of Corollary 7.58, minimize this
functional.
228 7. Interpolation and Approximation
(7.31)
Proof. Trivially, y" = s" + (y" - s"), and, inserted into the right-hand side
of (7.31), it follows that
(*) 2:0
if the term (*) vanishes. This holds true under the assumption (7.30),
because by partial integration, it follows that
n
- L dd(y(ti) -
;=1 ~
s(t;)) - (y(t;-d - s(ti-d)]
'" v '
=0 =0
O.
o
Corollary 7.58 In addition to the interpolation conditions SCti) = f(ti),
assume that the cubic spline s E S4,6 satisfies one of the following
boundary conditions:
7.4. Splines 229
Then there exists a unique solution s E S4,ll, which satisfies this boundary
condition. An arbitrary interpolating function y E C 2 [a, b], which satisfies
the same boundary condition, furthermore satisfies
Proof. The requirements are linear in s, and their number coincides with
the dimension n = l + 4 of the spline space S4,ll. It is therefore sufficient to
show that the trivial spline s == 0 is the only solution for the null-function
f == O. Since y == 0 satisfies all requirements, Theorem 7.57 implies that
(7.32)
Since s// is continuous, this implies s// == 0; i.e., s is a continuously dif-
ferentiable, piecewise linear function with S(ti) = 0, and is therefore the
null-function. 0
The three types (i), (ii), and (iii) are called complete, natural, and
periodic cubic spline interpolation. The physical interpretation of the
above minimization property (7.32) accounts for the name "spline." If y(t)
describes the position of a thin wooden beam, then
E -
-
I a
b (
(1
y//(t)
+ y'(tF)3/2
) 2
dt
E ~ lb y//(t)2 dt = Ily//II~.
The interpolating cubic spline s E S4,ll therefore describes approximately
the position of a thin wooden beam, which is fixed at the nodes ti. In the
complete spline interpolation, we have clamped the beam at the boundary
nodes with an additional prescription of the slopes. The natural boundary
conditions correspond to the situation when the beam is straight outside
the interval [a, b]. Such thin wooden beams were in fact used as drawing
tools and are called "splines."
Note that besides the function values, two additional pieces of informa-
tion regarding the original function f at the nodes enter in the complete
spline interpolation. Thus their approximation properties (particularly at
the boundary) are better than the ones of the other types (ii) and (iii). In
230 7. Interpolation and Approximation
is the largest distance of the nodes k We state the following related result
due to C. A. Hall and W. W. Meyer [48] without proof.
Theorem 7.59 Let I4f E S4,.c:. be the complete interpolating spline of a
function f E C 4 [a, b] with respect to the nodes ti with h := maXi Iti+1 - til.
Then
Ilf-I4fll00 ~ 3:4h41If(4)1100.
Note that this estimate is independent of the position of the nodes ti.
/
/~
/
.................-:_--._---../
d2 b7 d3 = bs
Figure 7.17. Cubic spline with de Boor points d i and Bezier points bi.
--- i
hi d
hi -
+ hi-h1 -+ hi b3i+l
1 i 1
h i- 1 + hi b hi- 1 d
3i-l - - - i·
hi hi
Graphically, this means that the straight line segment between b3i - 2 and d i ,
respectively, d i and b3i+2 is partitioned at the ratio hi-I: hi by the Bezier
points b3i - 1 , respectively, b3i + 1 . The points d i , b3i +l' b3i +2 and di+l are
therefore positioned as shown in Figure 7.17. Taken together, this implies
b hi+ hi+1 d ht - 1 d
3i+l h
i-I+ h i + hi+l i + h t-l + h t + h t+l t+l
h i- 2 + hi- 1 d hi d
i +
hi- 2 + h i- 1 + hi hi- 2 + hi- 1 + hi
i-I
.- h 2t
ai
h i- 2 + hi- 1 + hi
.- h i (hi- 2 + hi-I) hi- 1(hi + hi+d
(3i
hi- 2 + hi- 1 + hi
+ hi- 1 + hi + hi+ 1
.- hT-l
1i
h i- 1 + hi + hi+l
232 7. Interpolation and Approximation
We now only have to determine the Bezier points hand b31+2 from the
boundary conditions. We confine ourselves to the first two types. For the
complete spline interpolation, we obtain from
d1
h~/"'/\
....... ho
b2,~ - _ _b_3 _. __ ~.~\, b4
h~/-..'/ \
". hI
d_ 1 = do = bo = b1 \
b5 \----_.,.,.- - -
if, as above, Ni4 are the B-splines for the extended sequence of nodes
T = {ij}.
Exercises
Exercise 7.1 Let An(K, I) denote the Lebesgue constant with respect to
the set of nodes K on the interval I.
(a) Let K = {to, ... , tn} C I = [a, b] be pairwise distinct nodes. Suppose
that the affine transformation
2t - a - b
X: 1--->10 = [-1,1]' t 1--+ --=-b--
-a
of this interval onto the unit interval 10 maps the set of nodes K
onto the set of nodes Ko = X(K). Show that the Lebesgue constant
is invariant under this transformation, i.e.,
An(K, I) = An(Ko,Io) .
(b) Let K = {to, ... , tn} with a:S to < t1 < ... < tn :S b be nodes in the
interval I = [a, b]. Give the affine transformation
X : [to, t n ] ---> I
on I, which satisfies the property that for R = X(K) = {to, ... , tn}:
a = to < t1 < ... < tn = b .
Show that
Exercise 7.3 Count how many computations and how much storage space
an economically written program requires for the evaluation of interpolation
polynomials on the basis of the Lagrange representation.
Compare with the algorithms of Aitken-Neville and the representation
over Newton's divided differences.
Exercise 7.4 Let a = to < tl < ... < tn-l < tn = b be a distribution
of nodes in the interval I = [a, b]. For a continuous function 9 E C(1), the
interpolating polygon Ig E C(I) is defined by
h:abs = 1.
Discuss and evaluate the difference between this and the polynomial
interpolation.
Exercise 7.5 For the approximation of the first derivative of a pointwise
given function f, one utilizes the first divided difference
°
(a) Estimate the approximation error IDhf(x) - f'(x)1 for f E C3.
(Leading order in h for h -+ is sufficient.)
(b) Instead of Dhf(x), the floating point arithmetic computes Dhf(x).
Estimate the error IDhf(x) - Dhf(x)1 in leading order.
(c) Which h turns out to be optimal, i.e., minimizes the total error?
(d) Test your prediction at f(x) = eX at the position x = 1 with
h=lO-I, 5.10- 2 , 10- 2 , ... , eps.
are given by
d k ()
dt kP t =
n! ~
(n _ k)!h k ~
A
L.l
k n-k()
biBi A,
,=0
Exercise 7.7 Find the Bezier representation with respect to [0,1] of the
Hermite polynomials Hl for the nodes to, t 1, and sketch the Hermite
polynomials together with the Bezier polygons.
Exercise 7.8 We have learned three different bases for the space P3 of
polynomials of degree:::; 3: the monomial basis {I, t, t 2 , t 3 }, the Bernstein
basis {B8(t), Br(t), B~(t), B~(t)} with respect to the interval [0,1]' and the
Hermite basis {H8 (t), Hr
(t), H~ (t), H~ (t)} for the nodes to, h. Determine
the matrices for the basis changes.
Exercise 7.9 Show that a spline s = L~=l diNik in B-spline represen-
tation with respect to the nodes {T;} satisfies the following recurrence
relation:
L
n
s(t) = d~(t)Ni,k-1(t).
i=l+l
for l > O. Show that s(t) = d7- 1(t) for t E [7;, Ti+1]' Use this to derive
a scheme for the computation of the spline s(t) through continued convex
combination of the coefficients di (algorithm of de Boor).
8
Large Symmetric Systems of
Equations and Eigenvalue Problems
The previously described direct methods for the solution of a linear system
Ax = b (Gaussian elimination, Cholesky factorization, QR-factorization
with Householder or Givens transformations) have two properties in
common.
(a) The methods start with arbitrary (for the Cholesky factorization
symmetric) full (or dense) matrices A E Matn(R).
(b) The cost of solving the system is of the order O( n 3 ) (multiplications).
However, there are many important cases of problems Ax = b, where
(a) the matrix A is highly structured (see below) and most of the
components are zero (i.e., A is sparse),
(b) the dimension n of the problem is very large.
For example, discretization of the Laplace equation in two space dimensions
leads to block-tridiagonal matrices,
A= (8.1)
A q- 1,q-2 Aq-1,q-l Aq-1,q
Aq,q-l Aqq
with Aij E Matn/q(R), which, in addition, are symmetric, i.e., Aij = Af;.
The direct methods are unsuitable for the treatment of such problems; they
P. Deuflhard et al., Numerical Analysis in Modern Scientific Computing
© Springer-Verlag New York, Inc. 2003
238 8. Large Symmetric Systems of Equations and Eigenvalue Problems
do not exploit the special structure, and they take far too long. There are
essentially two approaches to develop new solution methods. The first con-
sists of exploiting the special structure of the matrix in the direct methods,
in particular its sparsity pattern, as much as possible. We have already
discussed questions of this kind, when we compared the Givens and House-
holder transformations. The rotations operate only on two rows (from the
left) or columns (from the right) of a matrix at a time, and they are
therefore suited largely to maintain a sparsity pattern. In contrast, the
Householder transformations are completely unsuitable for this purpose.
Already in one step, they destroy any pattern of the starting matrix, so
that from then on, the algorithm has to work with a full matrix. In gen-
eral, the Gaussian elimination treats the sparsity pattern of matrices most
sparingly. It is therefore the most commonly used starting basis for the
construction of direct methods, which utilize the structure of the matrix
(direct sparse solver). Typically, column pivoting with possible row inter-
change and row pivoting with possible column interchange alternate with
each other, depending on which strategy spares the most zero elements. In
addition, the pivot rule is relaxed (conditional pivoting) in order to keep
the number of additional nonzero elements (fill-in elements) small. In the
last few years, the direct sparse solvers have developed into a sophisticated
art form. Their description requires, in general, resorting to graphs that
characterize the prevailing systems (see, e.g., [39]). Their presentation is
not suitable for this introduction.
The second approach to solve large systems, which are rich in structure,
is to develop iterative methods for the approximation of the solution x.
This seems reasonable, also because we are generally only interested in
the solution x up to a prescribed precision E, which depends on the pre-
cision of the input data (compare the evaluation of approximate solutions
in Section 2.4.3). If, for example, the linear system was obtained by dis-
cretization of a differential equation, then the precision of the solution of
the system only has to lie within the error bounds, which are induced by
the discretization. Any extra work would be a waste of time.
In the following sections, we shall be concerned with the most common
iterative methods for the solution of large linear systems and eigenvalue
problems for symmetric matrices. The goal is then always the construction
of an iteration prescription Xk+l = ¢(xo, .. . , Xk) such that
(a) the sequence {xd of iterates converges as fast as possible to the
solution x, and
(b) Xk+l be computed with as little cost as possible from XO, ... , Xk.
In the second requirement, one usually asks that the evaluation of ¢ does
not cost much more than a simple matrix-vector multiplication (A, y) f--+
Ay. It is notable that the cost for sparse matrices is of the order O(n) and
not O(n 2 ) (as with full matrices), because often the number of nonzero
elements in a row is independent of the dimension n of the problem.
8.1. Classical Iteration Methods 239
Since p( G) :::: I Gil for any corresponding matrix norm, it follows that
IIGII < 1 is sufficient for p(G) < 1. In this case, we can estimate the errors
Xk - x = Gk(xo - x) by
IIxk - xII :::: IIGllklixo - xII·
Besides the convergence, we require that ¢(y) = Gy+c be easily computed.
For this purpose, the matrix Q has to be easily invertible. The matrix, which
is most easy to invert, is doubtless the identity Q = I. The method, which
thus arises for the iteration function G = I - A,
Xk+l = Xk - AXk + b,
240 8. Large Symmetric Systems of Equations and Eigenvalue Problems
In the first chapter, after the diagonal ones, the triangular systems have
proven to be simply solvable. For full lower or upper triangular matrices, the
cost is of the order O(n 2 ) per solution, for sparse matrices the cost is often
of the order O(n), i.e., of an order which we consider acceptable. By taking
Q as the lower triangular half Q := D + L, we obtain the Gauss-Seidel
method:
Xk+1 (I - (D + L)-l A)Xk + (D + L)-lb
-(D + L)-l RXk + (D + L)-lb.
It converges for any Spd-matrix A. In order to prove this property, we derive
a condition for the contraction property of p( G) < 1 of G = I - Q -1 A,
8.1. Classical Iteration Methods 241
which is easy to verify. For this, we note that every Spd-matrix A induces
a scalar product (x, y) := (x, Ay) on Rn. For any matrix B E Matn(R),
B* := A- l BT A is the adjoint matrix with respect to this scalar product,
i.e.,
(8.2)
The trick in the last manipulation consists of inserting the equation
(D + M)-l = (D + M)-l(D + M)(D + M)-l
for M = R, L, after carrying out the multiplications and then factoring.
From (8.2) it follows for all x oft 0 that
is an Spd-matrix.
The iteration matrices G of symmetric fixed-point methods have the
following properties.
Lemma 8.7 Let Xk+l = GXk+C, G = G(A) be a symmetrizable fixed-point
method, and let A an Spd-matrix. Then all eigenvalues of G are real and
less than 1,. i. e., the spectrum a-( G) of G satisfies
a-(G) c J- 00, 1[.
Now let Amin ::; Amax < 1 be the extreme eigenvalues of G. Then the
eigenvalues of G w are just
i.e.,
p(Gw ) = max {II - w(1 - Amin(G))I, 11 - w(1 - Amax( G))I} .
8.1. Classical Iteration Methods 243
Because of 0 < 1 - Amax (G) ~ 1 - Amin (G), the optimal damping parameter
w with
p(Gw ) = min p(G w ) = 1 - w(l - Amin(G))
O<w:9
j1 - w(1- Amin)j
\"
~\,
/
.
\
1 W 1
I-A m in l-A max
Gw = wG + (1 - w)I =I - wA.
244 8. Large Symmetric Systems of Equations and Eigenvalue Problems
with a suitable norm II . II. According to Remark 3.6, Yk is the (affine) or-
thogonal projection of x onto Vk with respect to the Euclidean norm IIYII =
,;r:;;:il;, and the minimization problem is equivalent to the variational
problem
(8.5)
i.e., the formula cannot be evaluated, and the minimization problem (8.4)
is thus not solvable yet. There are two ways to escape this situation: One
possibility is to replace the Euclidean scalar product with a different one,
which better suits the problem at hand. We shall pursue this approach in
Section 8.3. The second possibility, which is considered here, is to construct
a solvable substitute problem instead of the minimization problem (8.4).
246 8. Large Symmetric Systems of Equations and Eigenvalue Problems
The value p(Pk(G)) is also called the virtual spectral radius of G. This way
we finally arrive at the min-max problem
max IPd>") I = min with degPk=k and Pk (l)=l .
.\E[a,bj
(S.6)
Here note that
_ T k- 2(f) _ Tk(f) - 2tTk-l(t) _ 1-
Tk(f) - Tk(t) - Pk
and
Tk-l(f) t 2A-b-a __
2t (f) =Pk"'=Pk =Pk(l-w+wA).
Tk t t 2 - b- a
If we insert (S.6) into Yk = P k (¢ )xo for the fixed-point method ¢(y)
Gy + c, then we obtain the recurrence relation
Yk Pk(¢)xo
(Pk((1-w)Pk- 1 (¢) +W¢Pk- 1 (¢)) + (1- Pk)Pk- 2(¢))xo
Pk((l - W)Yk-l + W(GYk-l + c)) + (1 - Pk)Yk-2.
For a fixed-point method of the form G = 1- Q-l A and c = Q-1b we have
in particular
- 2 - Amax(G) - Amin(G)
t:= ;
Amax (G) - Amin (G)
To := 1, Tl := t;
Yl := w(GyO + c) + (1 - w)Yo;
for k := 2 to kmax do
Tk := 2tTk-l - T k - 2 ;
-Tk-l
Pk :=2tn;
248 8. Large Symmetric Systems of Equations and Eigenvalue Problems
IIYk-xl::; 1
I ITk(t"i'llllxo-xll - 2 - Amax(G) - Amin(G)
with t=
"J Amax(G) - Amin(G) .
I
Tk (~)12
fi:-l
~2 (VK+1)k
VK-1
Proof. One easily computes that for z := (fi: + 1)/(fi: - 1), it follows that
r::;--; VK ± 1
z ± V z2 -1 = ,
VK=f 1
Tk (~)
fi:-1
= ~ [(VK+l)k + (VK-l)k]
2 VK-1 VK+1
2 ~2 (~+l)k,
~-1
D
8.3. Method of Conjugate Gradients 249
~(pj,X-xo) ~(pj,Ax-Axo)
Xo +~ Pj = Xo +~ Pj
j=l (Pj,Pj) j=l (Pj,Pj)
k
~(pj,ro)
Xo + ~ -(-.-.) Pj . (8.9)
j=l~
:= CY-j
=: (3k+l
and because of
-ak(rk,Pk) = (-akApk, rk) = (rk - rk-l, rk) = (rk' rk)
we have
252 8. Large Symmetric Systems of Equations and Eigenvalue Problems
Pl := T'o := b - Axo;
for k := 1 to k max do
._ (T'k-l,T'k-l) (T'k-l, T'k-l) .
O'.k
(Pk,Pk) (Pk, Apk) ,
Xk := Xk-l + O'.kPk;
if accurate then exit;
T'k := T'k-l - O'.kAPk;
(3 .- (T'k,T'k) .
k+l - (
T'k-l, T'k-l )'
Pk+l := T'k + (3k+1Pk;
end for
which is, however, not feasible in this form. Because of this, (8.12) is in
practice replaced by requiring
(8.14)
at most k cg-iterations are needed, where k is the smallest integer such that
1
k? "2 V"'2(A) In(2/c).
2(~-1)k :S;c,
V"'2(A) +1
or, equivalently,
Ok <~
- c
with 0:= (~ +
V"'2(A)-1
1) > 1,
Thus the reduction factor is achieved if
k _ In(2/c)
? Ioge (2/)
c - In 0 .
Now the natural logarithm satisfies
(By differentiating both sides with respect to a, one sees that their differ-
ence is strictly decreasing for a > 1. In the limit case a --> 00 both sides
vanish.) By assumption we thus have
1 r::fA\ > In(2/c)
k ? "2V "'2 (A) In(2/c) lne
and therefore the statement. o
Remark 8.19 Because of the striking properties of the cg-method for Spd-
matrices, the natural question to ask is, which properties can be carried
over to nonsymmetric matrices? First one needs to point out the fact that
an arbitrary, only invertible matrix A does not, in general, induce a scalar
product. Two principle possibilities have been pursued so far:
If one interprets the cg-method for Spd-matrices as an orthogonal sim-
ilarity transformation to tridiagonal form (compare Chapter 6.1.1), then
one would have to transform an arbitrary, not necessarily symmetric ma-
trix to Hessenberg form (compare Remark 5.13). This means that a k-term
recurrence relation with growing k will replace a three-term recurrence re-
lation. This variant is also called Arnoldi method (compare [5]). Besides
the fact that it uses more storage space, it is not particularly robust.
8.3. Method of Conjugate Gradients 255
. jM~\.··,~:
Figure 8.2. Method of steepest descent for large 1£2(A).
axis, where Al(A) and An(A) are just the lengths of the smallest, respec-
tively, largest semiaxis of {¢(x) = I} (see Figure 8.2). The quantity 1£2(A)
thus describes the geometric "distortion" of the ellipsoids as compared with
spheres. However, the method of steepest descent converges best, when the
level surfaces are approximately spheres. An improvement can be reached
by replacing the directions of search rk by other "gradients." Here, the A-
orthogonal (or A-conjugate) vectors Pk with Pl = ro = b have particularly
good properties, which explains the historical naming.
8.4 Preconditioning
The estimates of the convergence speed for the Chebyshev acceleration,
as well as the estimates for the cg-method depend monotonically on the
condition 1£2 (A) with respect to the Euclidean norm. Our next question is
therefore the following: How can one make the condition of the matrix A
smaller? Or, more precisely: How can the problem Ax = b be transformed,
so that the condition of the resulting matrix is as small as possible? This
question is the topic of the preconditioning. Geometrically speaking, this
means: We want to transform the problem such that the level surfaces,
which in general are ellipsoids, become as close as possible to spheres.
Instead of the equation Ax = b with an Spd-matrix A E Matn(R), we
can solve also for any invertible matrix BE GL(n) the equivalent problem
because
(x, ABY)B = (x, BABy) = (ABx, By) = (ABx, Y)B.
The cg-method is therefore again applicable, if we change the scalar prod-
ucts accordingly: (., ·)B takes on the role of the Euclidean product (-, .),
and the corresponding "energy product"
(., ·)AB = (AB·, ·)B = (AB·, B·)
of A = AB the role of (-, .). This immediately yields the following iteration
xo, Xl, ... for the solution of Ax = b:
P1 := ro := b - ABxo;
for k := 1 to k max do
._ (rk-1, rk-dB (rk-1, Brk-1).
ak
(Pk,Pk)AB (ABpk, Bpk) ,
Xk := Xk-1 + akPk;
if accurate then exit;
rk := rk-1 - akABpk;
f3 .- (rk' rk)B (rk' Brk) .
k+1·- (
rk-1, rk-dB (rk-1, Brk-1)'
Pk+1 := rk + f3k+1Pk;
end for
We are of course interested in an iteration for the actual solution x = Bx,
and we thus replace the row for the Xk by
Xk = Xk-1 + akBpk .
It strikes us that the Pk now only occur explicitly in the last row. If for
this reason, we introduce the (A-orthogonal) vectors qk := BPb then this
yields the following economical version of the method, the preconditioned
eg-method or briefly peg-method.
Algorithm 8.21 peg-method for the starting value Xo.
ro := b - Axo;
q1 := Bro;
for k := 1 to kmax do
._ (rk-1, Brk-1)
ak
(qk, Aqk)
Xk := Xk-1 + akqk;
if accurate then exit;
rk := rk-1 - akAqk;
258 8. Large Symmetric Systems of Equations and Eigenvalue Problems
(3 .- (rk, Brk) .
k+1·- (rk-l, B rk-1 ) ,
qk+l := Brk + (3k+lqk;
end for
Per iteration step, each time we only need to carry out one multiplication
by the matrix A (for Aqk), respectively, by B (for Brk), thus, as compared
with the original cg-method, only one more multiplication by B.
Let us turn to the error X-Xk of the pcg-method. According to Theorem
8.17, for the error Ilx - xkliAB of the transformed iterate Xk in the "new"
energy norm
IIYIIAB := J(y, y)AB = J(ABy, By),
we have the estimate
JKB(AB) _l)k
Ilx - xkliAB ::; 2 ( Ilx - xoiIAB.
JKB(AB) + 1
Here KB (AB) is the condition of AB with respect to the energy norm 11·11 B.
However, the condition
Amax(AB)
KB(AB) = Amin(AB) = K2(AB)
J K2(AB) _l)k
Ilx- x kIIA::;2 ( J
K2(AB) + 1
Ilx-xoIIA'
We therefore seek an Spd-matrix B, a preconditioner, with the following
properties:
(a) the mapping (B, y) f-+ By is "simple" to carry out, and
(b) the condition K2(AB) of AB is "small,"
where, for the time being, we have to leave it with the vague expres-
sions "simple" and "small." The ideal matrix to satisfy (b), B = A-I,
unfortunately has the disadvantage that the evaluation of the mapping
y f-+ By = A-1y possesses the complexity of the entire problem and con-
tradicts therefore the requirement (a). However, the following lemma says
8.4. Preconditioning 259
that it is sufficient, if the energy norms 11·11 Band 11·llk1, which are induced
by Band A- 1 , can be estimated (as sharply as possible) from above and
below (compare [92]).
Lemma 8.23 Suppose that for two positive constants fJ.o, fJ.1 > 0, one of
the following three equivalent conditions is satisfied
(i) fJ.o(A- 1y, y) ~ (By, y) ~ fJ.1 (A- 1y, y) for all y E R n
(ii) fJ.o(By, y) ~ (BABy, y) ~ fJ.dBy, y) for all y E R n
(iii) Amin(AB) :::: fJ.o and Amax(AB) ~ fJ.1· (8.15)
Then the condition of AB satisfies
Proof. The equivalence of (i) and (ii) follows by inserting y = Au into (i),
because
(A- 1y, y) = (Au, u) and (By, y) = (BAu, Au) = (ABAu, u) .
Because of
\ . (AB) -_ m1n
. (ABy, y)B . (BABy,y)
/\mm = m1n -'---c_--'--"":-:-
#0 (y,y)B #0 (By, y)
and
Amax(AB) = max ...c..(A-,-B_y,.,--y-,-)_B = max (BABy, y)
yolO (y, y)B yolO (By, y) ,
the latter condition is equivalent to (iii) (compare Lemma 8.29), from which
the statement
follows immediately. o
If both norms II· liB and II· IIA-' are approximately equal, i.e., fJ.o ~ fJ.1,
then B and A -1 are called spectrally equivalent, or briefly also B ~ A -1.
In this case, according to Lemma 8.23, we have that K:2(AB) ~ l.
Remark 8.24 The three conditions of Lemma 8.23 are in fact symmetric
in A and B. One can see this most easily in the condition for the eigenvalues,
since
Amin(AB) = Amin(BA) and Amax(AB) = Amax(BA) .
(This follows, for example, from (ABy, y) = (BAy, y).) If we assume
that the vague relation ~ is transitive, then one can call the spectrally
equivalence with full right an "equivalence" in the sense of equivalence
relations.
260 8. Large Symmetric Systems of Equations and Eigenvalue Problems
Lemma 8.25 Suppose that the assumptions of Lemma 8.23 are satisfied,
and let II·IIA := ~ be the energy norm and II· liB := ~. Then
1 1
-lhllB :::: Ilx-XklIA :::: -llrkIIB;
v1Il y7IO
i. e., if B and A -1 are spectrally equivalent, then the computable residual
norm IhllB is a quite good estimate of the energy norm Ilx - xkllA of the
error.
Proof According to Lemma 8.23 and Remark 8.24, (8.15), with A and B
exchanged, implies that
or, equivalently
1 1
-(ABAy,y) :::: (Ay,y) :S -(ABAy,y).
111 110
The residue rk = b - AXk = A(x - Xk), however, satisfies
This way, in many cases one obtains a drastic acceleration of the cg-method,
far beyond the class of M-matrices, which is accessible to proofs.
Remark 8.28 For systems which originate from the discretization of par-
tial differential equations, this additional knowledge about the origin of the
system allows for a much more refined and effective construction of pre-
conditioners. For examples, we refer to the articles by H. Yserentant [93],
J. Xu [92], and (for time-dependent partial differential equations) to the
dissertation of F. A. Bornemann [9]. As a matter of fact, in solving par-
tial differential equations by discretization, one has to deal not only with
a single linear system of fixed, though high dimension. An adequate de-
scription is by a nested sequence of linear systems, whose dimension grows
when the discretization is successively refined. This sequence is solved by
a cascade of cg-methods of growing dimension. Methods of this type are
genuine alternatives to classical multigrid methods-for details see, e.g.,
the fundamental paper by P. Deuflhard, P. Leinen, and H. Yserentant [25].
In these methods, each linear system is only solved up to the precision of
the corresponding discretization. In addition, it allows for a simultaneous
construction of discretization grids, which fit the problem under consider-
ation. We shall explain this aspect in Section 9.7 in the simple example of
numerical quadrature. In Exercise 8.4, we illustrate an aspect of the cascade
principle which is suitable for this introductory text.
to approximate the eigenvalues Amin and Amax well for growing k. According
to Theorem 6.4, we can construct an orthonormal basis VI, ... ,Vk of Vk(X)
by the following three-term recurrence relation:
.- x
Vo 0, VI .-
IIxl12
Qk .- (Vk' AVk)
Wk+I .- AVk - QkVk - PkVk-I (8.17)
Pk+l .- Il wk+Iil2
Vk+I .- Wk+I falls Pk+l i= O.
Pk+l
This iteration is called Lanczos algorithm. Thus Qk := [VI, ... , Vk] is a
column-orthonormal matrix, and
and similarly that A~lx = Amax(Tk). Because of Vk+I :J Vk, the minimal
property yields immediately that
A(HI) < A(k) and A(HI) > A(k)
mIn - mIn max - max·
264 8. Large Symmetric Systems of Equations and Eigenvalue Problems
The approximations A;:{n and A~lx are therefore the extreme eigenvalues
of the symmetric tridiagonal matrix Tk, and, as such, can be easily com-
puted. However, in contrast to the cg-method, it is not guaranteed that
A;:i~ = Amin, since in general Vn(x) "I Rn. This shows in the three-term
recurrence relation (8.17) in a vanishing of (3k+l for a k < n. In this case,
the computation has to be started again with an x E Vk (x).1. .
The convergence speed of the method can again be estimated by utilizing
the Chebyshev polynomials.
Theorem 8.31 Let A be a symmetric matrix with the eigenvalues Al :::;
... :::; An and corresponding orthonormal eigenvectors T)l, ... ,T)n. Further-
more, let fJl :::; ... :::; /-lk be the eigenvalues of the tridiagonal matrix Tk of
the Lanczos method for the starting value x "I 0, and with the orthonormal
basis VI, ... ,Vk ofVdx) as in {8.17}. Then
(An - Al)tan2(~(vl,T)n))
/\\ > "k > /\\ - -'-----;~...,----'--'--'--.:....:...
n -,.., - n TLI (1 + 2pn) ,
( ) ._ (x, Ax)
11 x .- (x, Bx) ,
then we obtain the following statement, which is an analogue to Lemma
8.29.
Lemma 8.32 Let Amin and Amax be the smallest, respectively, largest
eigenvalue of the generalized eigenvalue problem Ax = ABx, where the
matrices A, B E Matn(R) are symmetric and B is positive definite. Then
. (x, Ax) (x, Ax)
Amin = mIn (
x#O x, Bx
) and Amax = max
x#O
( B)'
x, x
Exercises
Exercise 8.1 Show that the coefficients Pk = 2ITk-l(f)/Tk(f), which
occur in the Chebyshev acceleration, satisfy the two-term recurrence
relation
1
PI = 2, Pk+l = 1 2
1 - 4(} Pk
'
where () .- 1/1. Furthermore, show that the limit of the sequence {pd
satisfies
2
lim Pk =: P = - - - = = =
k-->oo 1+~'
Exercise 8.2 Given are sparse matrices of the following structure (band
matrices, block diagonal matrices, arrow matrices, block cyclical matrices).
* * * *
* * * *
* * *
* * * * *
* *
* *
* * *
* * *
Exercises 267
* * * * * * * * * *
* * * * * *
* * * * * *
* * * * * *
* * * * * *
* * * * * *
Estimate the required storage space and cost of computation (number of
operations) for
(a) LU-factorization with Gaussian elimination without pivoting, respec-
tively, with column pivot search and exchange of rows.
(b) QR-factorization with Householder transformations without, respec-
tively, with exchange of columns.
(c) QR-factorization with Givens transformations.
Exercise 8.3 Let B ~ A -1 be a spectrally equivalent preconditioning ma-
trix. In the special case that B is of the form B = CC T , a preconditioned
cg-method can also be derived differently. For this, from the system Ax = b,
one passes formally to the equivalent system
Ax=b with A=CACT , x=C-Tx, and b=Cb.
Here A is again an Spd-matrix. One applies the classical cg-method to this
transformed system. Using this idea, derive a convergence result, which
is based on the energy norm of the error (which, as is well-known, is not
directly approachable). Use this approach to derive two different effective
variants of the pcg-method, one of which coincides with our Algorithm 8.2l.
Derive both variants, and consider which would be preferable under which
conditions. Implement both variants in the special case of the incomplete
Cholesky factorization, and carry out computational comparisons.
Exercise 8.4 Short introduction to the cascade principle. We consider a
sequence of linear systems
Ajxj=bj , j=l, ... ,m
of dimension nj, as it could come up by successively finer (uniform) dis-
cretization of an (elliptic) partial differential equation. The dimension of
the systems is assumed to grow geometrically, i.e.,
nj+1 = Knj for a K > l.
We seek an approximate solution xm of the largest system (corresponding
to the finest discretization) such that the error in the energy norm satisfies
Ilxm - xmllAn> s:: cmo
for given c, 0 > o. Assume that the connection of the linear systems with
the discretizations is shown by the following properties:
268 8. Large Symmetric Systems of Equations and Eigenvalue Problems
(i) The matrices Aj are symmetric positive definite, and their conditions
are uniformly bounded by
K2(A j )::::: C for j = l, ... ,m.
(This is only true after a suitable preconditioning.)
(ii) If Xj is an approximation of Xj with
Ilxj - xjllA j ::::: E j 8,
a b t
since y(b) = 1(f). We will come back to this formulation in the course of
this chapter.
f 2: 0 ~ 1(f) 2: 0 .
9.1. Quadrature Formulas 271
Proof. For any perturbation Of E LI[a, b], the perturbation of the integral
can be estimated by
zoidal sum (see Figure 9.2). We partition the interval into n subintervals
f
-" ;
;
;
;
;
;
;
;
;
;
;
;
;
To ;
ho
i
a = to t
Figure 9.2. Trapezoidal sum with equidistant nodes.
T(n) := t
i=1
Ti, T i := ~i (1(ti-l) + f(ti)) .
b-a
T := -2-(1(a) + f(b)) (9.1 )
it is obvious that
R(n)
mIn -
< T(n) <
-
R(n)
max
.
For continuous f E Cora, b], the convergence of the Riemann sums therefore
implies the convergence of the trapezoidal sums:
R(n)
mm
< T(n) < Rmax
(n)
Below (see Lemma 9.8), we shall give the approximation error in more
detail.
The trapezoidal sum is a simple example for a quadrature formula, which
we define as follows.
Definition 9.3 A quadrature formula i for the computation of the definite
integral is a sum
n
i(J) = (b - a) L Ai!(ti) ,
i=O
with nodes to, ... ,tn, and weights AO, ... , An such that
n
(9.2)
measures by how much the quadrature formula deviates from the positivity
requirement. Because of the results for the scalar product, we do not have
to worry about the stability of the evaluation of a quadrature formula (see
Lemma 2.30).
which were introduced in the last chapter. In particular, for given nodes
to, ... , tn, the function
n
j(t) := P(f I to,···, tn ) = L f(ti)Lin(t),
i=O
i(f) = (b - a) L Ad(ti) ,
i=O
which is exact for all polynomials PEP n of degree less than or equal to n.
j=o j=o
and thus, in a unique way, obtain back the weights Ai = (b - a)-l I(Lin) =
~. 0
Ain = --
b- a
1 lb II--
a j=O
n t - tj
ti - tj
1 in n
dt= -
n 0
II-- j
S -
i - j
j=O
ds.
j#i j#i
The weights Ain, which are independent of the interval boundaries, only
have to be computed once, respectively, given once. In Table 9.1, we have
listed them up to order n = 4. The weights, and therefore also the quadra-
ture formulas, are always positive for the orders n = 1, ... , 7. Higher orders
are less attractive, since starting with n = 8, negative weights may occur.
In this case, the Lebesgue constant is the characteristic quantity (9.3), up
to the normalization factor (b - a)-I. Note that we have already encoun-
tered the Lebesgue constant in Section 7.1 as the condition number of the
polynomial interpolation.
1 1
"2 "2
1
~~ 1" (,) Trapezoidal rule
2 1
(;
4
(;
1
(; ~~f(4)(,) Simpson's rule, Kepler's barrel rule
3 1
8 8 8 8
3 3 1 3:0 5
f(4)(,) Newton's 3/8-rule
4 7
90
32
90
12
90
32
90
7
90
~~~ f(6)(,) Milne's rule
min h(t)
tEla,b] Ja
r
b
g(s) ds:S;
Ja
b
r
h(s)g(s) ds:S; max h(t)
tEla,b] Ja
b
g(s) ds. r
Therefore, for the continuous function
°
there exist to, tl E [a, bj with F(to) ::::: and F(td :s; 0, and thus, because
of the mean value theorem, there exits also aTE [a, bj such that F( T) = 0,
or, in other words
as required. o
Lemma 9.6 Let f E C 2 ([a, b]) be a twice continuously differentiable
function. Then the approximation error of the trapezoidal rule
b-a
T = -2-(f(a) + f(b))
with step size h := b - a can be expressed by
T -
Ja
rbf = h123 J"(T)
for some T E [a, bj.
with some T = T(t) E [a, b], which is independent of t. Inserted into the
quadrature formula, Lemma 9.5 implies that
lb f = lb P(t)dt+ lb[t,a,bjf.~dt
:so
T +
2 ,a J
r
f"(T) b(t - a)(t - b) dt
~
'V
(b_a)3
--6-
9.2. Newton-Cotes Formulas 277
T-lb a
f= h 3 1"(7).
12
o
a+b
Q(t) = P(t) + 'Y (t - a)(t - -2-) (t - b) ,
'- J
V
= W3(t)
where 'Y = QI//(t)/6 E R is a constant. This implies for the integral that
a+b a+b
to = a, h = -2-' t2 = -2-' t3 = b .
I f = 8 ititi-1 f .
a
b n
Thus
n
T(h) -lb a
f = (b - a)h 21"(T)
12
with some T E [a, b].
9.3. Gauss-Christoffel Quadrature 279
Proof. According to Lemma 9.6, there exists a E [ti-1, til such that
iti
7i
h3
T; - f = 121"(7;) ,
til
and therefore
T(h)- I =8
a
b
f
n (
T; -
iti)
ti-l f = 8 h121"(7;) = (b - 12a)h
n 3 2 1 n
:;;: 81"(7;) .
Since
1
L 1"(7i)
n
min 1"(t) ::; - ::; max 1"(t) ,
tE[a,b] n ;=1 tE[a,b]
and according to the mean value theorem, there exists a 7 E [a, b] such that
and therefore
as claimed. o
(l
1
IIPII:= (P,P)~ = b
w(t)p(t)2 dt) '2 < 00
are well-defined and finite for all polynomials PEP k and all kEN.
In contrast to Section 9.2, the interval may be infinite here. It is only
280 9. Definite Integrals
11k := lb tkW(t) dt
are bounded. For the definition of the absolute condition, we measure the
perturbations Sf in a natural way with respect to the weighted L 1 -norm
This way, the results of Lemma 9.1 remain valid also for weighted integrals,
only the interpretation of 1(f) is changed.
In Table 9.2, the most common weight functions are listed, together with
the corresponding intervals.
draw conclusions from our wishful thinking, which may be helpful in the
solution of the problem.
(f, g) = lb w(t)f(t)g(t) dt .
(Pj,Pn+d = 1
a
b
WPjPn+1 = in(PjPn+d
n
= l:AinPj(Tin)Pn+l(Tin) = O.
i=O ~
=0
o
Therefore the nodes Tin, which we seek, have to be roots of pairwise
orthogonal polynomials {Pj } of degree deg Pj = j. Such orthogonal poly-
nomials are not unknown to us. According to Theorem 6.2, there exists one
and only one family {Pj } of polynomials Pj E P j with leading coefficients
one, i.e., Pj(t) = t j + ... so that
From Theorem 6.5, we already know that the roots of these orthogonal
polynomials are real and have to lie in the interval [a, b]. We have therefore
constructed, in a unique way, candidates for the integration nodes Tin of
the quadrature formula in: the roots of the orthogonal polynomial Pn+l'
Once the nodes are determined, there is no choice for the weights: In order
that at least polynomials PEP n up to degree n are integrated exactly,
according to Lemma 9.4, the weights
Ain := _l_l
b- a a
b
Lin(t) dt
have to be chosen with the Lagrange polynomials Lin (Tjn) = 6ij. This, at
first, only guarantees exactness for polynomials up to degree n, which is in
fact enough.
Lemma 9.10 Let TOn, . .. , Tnn be the roots of the (n + 1)st orthogonal
polynomial Pn+1. Then any quadrature formula in(f) = 2:7=0 Ad(Tin)
satisfies
lb lbwP = wQPn+1 +
'-..-'
lb lbwR = wR = in(R) .
=0
On the other hand,
n n
i.e., i is exact on P 2n + 1 . o
We collect our results in the following theorem.
Theorem 9.11 There exist uniquely determined nodes Tan, . .. , Tnn and
weights AO n , ... ,Ann such that the quadrature formula
n
in(f) = L Ainf(Tin)
i=O
integrates exactly all polynomials of degree less than or equal to 2n + 1, i. e.,
in (P) = lb wP for P E P 2n + 1 .
The nodes Tin are the mots of the (n + l)st orthogonal polynomial P n+1
with respect to the weight function wand the weights
Ain := b ~a lb Lin(t) dt
with the Lagrange polynomials Lin (Tjn) = 6ij. Furthermore, the weights are
all positive, Ain > 0, i. e., in is a positive linear form, and they satisfy the
equation
(9.4)
Pmof. We only have to verify the positivity of the weights and their rep-
resentation (9.4). Suppose Q E P 2n +l is a polynomial such that Tkn is
the only node, at which it does not vanish, i.e., Q(Tin) = 0 for i i= k and
Q(Tkn) i= o. Then, obviously,
lba wQ
1
= AknQ(7"kn), hence Akn = Q(Tkn) lb
a wQ.
9.3. Gauss-Christoffel Quadrature 283
If we set, e.g.,
Q(t):=(Pn+l(t))2,
(t - Tkn)
then Q E P 2n has the required properties, where Q(Tkn) = P~+l (Tkn)2.
Thus the weights satisfy
Akn 1-
= --
Q(Tkn)
lb a
wQ = lb (
a
W I
Pn+dt)
Pn+1(Tkn)(t-Tkn)
)2 dt > 0 ;
i.e., all weights are positive. In order to verify formula (9.4), we put
Akn = I
Pn+l(Tkn)Pn(Tkn)
1 lb
a
w(t)
Pn+l(t)
t - Tkn
Pn(t) dt .
l a
b A
wf - In (f) =
f(2n+2) (T)
(2n + 2)! (Pn+l, Pn+d
= Pn + 1 (t)2 2: 0
284 9. Definite Integrals
in
lb + ( lb
Since integrates the interpolation P exactly, it follows that
f(2n+2) (T) 2
a
wP )1
2n + 2. a
WPn + 1
n f(2n+2) (T)
~ .Ain ~ + (2n + 2)! (Pn+1, Pn +1) .
= f(Tin)
D
,
In(f) = - -
7r Ln
f(Tin) wIth Tin =
. 2i + 1
COS - - 7 r .
n + 1 ,=0
. 2n+ 2
J 1
-1
f(t)
~
dt _ j (f) _
n
7r
- 22n+l(2n + 2)!
f(2n+2)(T)
t
R,
(9.5)
286 9. Definite Integrals
(9.6)
= (PO, PO) = 1
(9.7)
o
The actual determination of the weights Ain and nodes Tin is based on
techniques which are based on the contents of Chapter 6. For this recall
that, according to Theorem 6.2, the orthogonal polynomials Pk with respect
to the weight function w satisfy a three-term recurrence relation
(9.8)
where
(3k= (tPk-1, Pk-d ,"/k=2 (Pk- 1, Pk- 1)
(Pk- 1 , Pk- 1 ) (Pk-2, Pk- 2)
We therefore assume that the orthogonal polynomials are given by their
three-term recurrence relation (9.8), which, for k = 0, ... ,n, we can write
as a linear system Tp = tp + r with
(31 1
'Y~ (32 1
T·-
9.4. Classical Romberg Quadrature 287
and
p:=(Po(t), ... ,Pn(t)f, r:=(O, ... ,o,-Pn+1 (t)f·
Thus
i.e., the roots of P n + 1 are just the eigenvalues of T, where the eigenvector
p( T) corresponds to an eigenvalue T.
Because the roots Tin of Pn +1 are all real, one could have the idea that the
eigenvalue problem Tp = tp can be transformed into a symmetric eigenvalue
problem. The simplest possibility would be to scale with a diagonal matrix
D = diag(d o, ... , d n ) to obtain
Tp = tp withP= Dp , T = DTD- 1 ,
with the hope of achieving T = TT. More explicitly, diagonal scaling, as
applied to a matrix A E Matn+1 (R), satisfies
A~ DAD-1
A f-+:= WIt
~ = -;Faij
. h aij di
J
=TT,
11 (3n In +1
In+l (3n
at which the function was evaluated. In contrast to this, for the Romberg
quadrature, we employ a sequence of grids, and we try to construct a better
approximation of the integral from the corresponding trapezoidal sums.
with coefficients
where B2k are the Bernoulli numbers, and with the remainder
B2k ~ (2k)! ,
so that the series (9.9) in general also diverges with m ~ 00 for analytic
functions f E CW[a, b]; i.e., in contrast to the series expansion, which we
know from calculus (like Taylor or Fourier series), in Theorem 9.16 the
function is expanded into a divergent series. At first, this does not seem to
make any sense; in practice, however, the finite partial sums can often be
used to compute the function value with sufficient accuracy, even though
the corresponding series diverges.
In order to illustrate the fact that such an expansion into a divergent se-
ries can be numerically useful, we consider the following example (compare
[55]).
Example 9.18 Let f(h) be a function with an asymptotic expansion in h
such that for all hER and n E N
n
f(h) = 2:) -l)kk! . hk + ()( _l)n+l(n + I)! hn+l for a 0 < () = ()(h) < 1.
k=O
J(f) := lb f(t) dt
yielding
from the two trapezoidal sums (see Figure 9.4). We can also explain
1
4~0------~~~------------------~------~
h2
2 -- .!
4
2= 1
h1 h2
that T(h) is only defined for discrete values h (see above). In addition, we
require that the method converges to TO for h --7 0, i.e.,
lim T(h) = TO . (9.13)
h-->O
--7 Tk-l,k-l
""
--7 Tk,k-l
Remark 9.21 In accordance with [19], we start to count with 1 in the ex-
trapolation methods. As we shall see below, this leads to a more convenient
9.4. Classical Romberg Quadrature 293
(9.15)
Ck-I,I ---+ ---+ ck-I,k-I
\. \. \.
Ckl ---+ ---+ ck,k-I ---+ Ckk
Cik ~ hpl hLk+1 ... hf for 1 :::; k :::; i :::; m and h j :::; h ---+ 0.
~
k factors
More precisely,
This theorem says that, essentially, for each column of the extrapolation
table, we can gain p orders. However, since we have to deal with asymptotic
expansions, and not with series expansions, this viewpoint is too optimistic.
The high order is of little use, if the remainders of the asymptotic expansion,
which are hidden behind the O(h;k+1)p), become very large. For the proof
of the theorem we use the following auxiliary statement.
Lemma 9.23 The Lagrange functions L o , ... , Ln with respect to the nodes
to, ... ,tn satisfy
n for m=O
LLj(O)tj = for 1:::; m:::; n
j=O for m=n+1
294 9. Definite Integrals
Proof. For 0 ::; m ::; n, P(t) = t m is the interpolating polynomial for the
points (tj, tj) for j = 0, ... ,n, and therefore
n n
P(t) = tm = LLj(t)P(tj ) = LLj(t)tj .
j=O j=O
If we set t = 0, then the statement follows in the first two cases. In the case
m = n + 1, we consider the polynomial
n
Q(t) := t n+1 - L Lj(tW;+l .
j=O
This is a polynomial of degree n + 1 with leading coefficient 1 and roots
to, ... , tn, so that
Q(t) = (t - to)··· (t - tn) ,
and, in particular,
n
LLj (0)tr 1 = -Q(O) = (-l)nto···tn .
j=O
D
These numbers, of course, depend on the chosen sequence nl, n2, . ... To
each increasing sequence
we obtain
• 2 new f evaluations
a b
o~----------~------------~o
1 new f evaluation
a h2 = H/2 b
Figure 9.5. Computation of the trapezoidal sums for the Romberg sequence.
1
+ hi L
ni-l
if i = 2k
F B := {l, 2, 3, 4, 6, 8,12,16,24, ... }, if i = 2k + 1
if i = 1
Table 9.4. Weights Aj for the diagonal and subdiagonal elements of the extrap-
olation tableau at the nodes tj for the Bulirsch sequence (Note that Aj = Anj,
tj = tn-j).
0 1
"6 4
1 1
"3
1
"2 I: IAil
1
T 1 ,1 "2 1
1 1
T 2 ,1 4 "2 1
1 2
T 2 ,2 "6 "3 1
1 3 2
T 3 ,2 10 [; "3 1
11 27 8
T 3 ,3 120 40 -15 2.067
13 16 27 94
T 4 ,3 210 21 -35 105 4.086
151 256 243 104
T 4 ,4 2520 315 - 280 105 4.471
FH = {1,2,3, ... }, ni = i
298 9. Definite Integrals
1 1
-1
dt
10- 4 + t2
(9.17)
(compare Figure 9.10) with relative precision tol = 10- 8 . The values Tkk
of the extrapolation tableau are given in Table 9.5.
Table 9.5. Romberg quadrature for the needle impulse f(t) = 1/ (10- 4 +x 2 ). (ckk'
relative precision, A k , cost in terms of f evaluations.)
k Tkk ckk Ak
step size H = b - a, then all regions of the basic interval [a, b] are equal,
and we apply the same method everywhere. This cannot be the best way
to integrate functions. Rather, we should partition the integration interval
so that we can choose in each subregion a method which is tailor-made to
the function and which thus, with as little effort as possible, determines
the integral with a given relative precision. Such methods, which control
themselves in the course of the computation by adapting to the problem
at hand, are called adaptive methods. Their essential advantage consists
of the fact that a large class of problems can be handled with the same
program, without the user having to make adaptions, i.e., without having to
insert apriori knowledge about the problem into the method. The program
itself tries to adapt itself to the problem. In order to achieve this, the
intermediate results, which are computed in the course of the algorithm are
constantly checked. This serves two purposes: On one hand, the algorithm
can thus automatically choose an optimal solution strategy with respect to
cost, and thus solve the posed problem effectively. On the other hand, this
ensures that the program works more safely, and hopefully does not produce
fictitious solutions, which, in reality do not have a lot to do with the posed
problem. It should also be a goal that the program can recognize its own
limitations, and ,for instance, detect that a user-prescribed precision cannot
be achieved. This adaptive concept can in general only be carried out if a
reasonable estimate for the occurring approximation error is available and
can be computed at relatively little cost.
J:
In quadrature the problem is more precisely formulated: Approximate the
integral I = f(t) dt up to a given relative precision tol; i.e., compute an
approximation i of I so that
(9.19)
where Iscal ("scal" for "scaling") should be of the order of III. This value
is either given by the user together with tol, or is obtained from the first
approximations.
Whereas the classical Romberg quadrature merely adapts the order of
the method in order to achieve a desired precision, in the adaptive Romberg
quadrature the basic step size H is also adapted. There are two principal
possibilities to attack the problem: The initial value method (in this section)
and the boundary value method (two sections later).
300 9. Definite Integrals
The following considerations are based on [19] and [21]. We start with
the formulation of the quadrature problem as an initial value problem
and try to compute the integral successively from the left to the right (see
Figure 9.6). Here we partition the basic interval in suitable subintervals
[ti, t HI ] of length Hi := tHI - t i , which are adapted to the function f and
y(t) = J~ f(T)dT
a t b
a t+H b
I Input I
H, q
~ Quadrature
step t+H
It
Output
-
,H, ij
1
possible error message
(9.20)
that
Here the constants T2k depend on the integrand j, and thus on the problem.
Inserted into (9.21), it follows that
where
(9.22)
(9.23)
then, under the assumptions (9.22) and (9.23), the following picture
emerges.
Ell
i
E21 f- E22
i i
E31 f- E32 f- E33
i i i
The most precise approximation inside the row k is thus the diagonal el-
ement Tkk. It would therefore be ideal if we could estimate the error Ekk.
However, we are in a dilemma here: In order to estimate the error of Tkk,
we need a more precise approximation j of the integral, e.g., Tk+1,k. With
such an approximation at hand, it would be possible to estimate Ekk, e.g.,
by
But once we have computed T k +1,k , we can also directly produce the
(better) approximation Tk+1,k+1. However, we do not have an estimate of
the error Ek+1,k+1, unless we again compute a more precise approximation.
We escape this dilemma by the insight that the second-best solution may
also be useful. The second-best approximation available including the row
k, is the subdiagonal element T k ,k-1. The approximation error Ek,k-1 can
be estimated from known data up to this row as follows.
and therefore
( 1 - -Ekk)
- - Ek,k-1 < Ek,k-1 < ( 1+ ---
Ekk)
Ek,k-1.
Ek,k-1 Ek,k-1
'-v--' '-v--'
«1 «1
o
304 9. Definite Integrals
E = E(t, H) = lii+H(J) - I t
t+H
f(T) dTI ~ "Y(t)HP+l
with a number "Y(t), which depends on the left boundary t and on the
problem. With the data E and H of the current integration step, we can
estimate "Y(t) by
(9.25)
Suppose that H is the step size, for which we would have achieved the
desired precision
tol = E(t, H) ~ "Y(t)HP+l . (9.26)
By employing (9.25), we can compute an a posteriori approximation of H
from E and H, because
(9.27)
9.5. Adaptive Romberg Quadrature 305
We also call H the optimal step size in the sense of (9.26). Should H
be much smaller than the step size H that we actually used, then this
indicates that H was too large, and that we have possibly jumped over
a critical region (e.g., a small peak). In this case we should repeat the
integration step with H as basic step size. Otherwise we can use H as the
recommended step size for the next integration step, because for sufficiently
smooth integrands f and small basic step sizes H, the number '1(t) will
change only little over the integration interval [t, t + H], i.e.,
'1(t) ~ '1(t+H) for H ~ o. (9.28)
This implies
c(t + H, H) ~ '1(t + H)Hp+1 ~ '1(t)H P+l ~ tol ,
so that we may assume that H is also the optimal step size for
the next
step. The algorithm of course has to verify the Assumption (9.23) as well
as the Assumption (9.28), and possibly correct the step size.
So far, we have only considered a fixed order p, and we have determined
an optimal step size H for this order. The Romberg quadrature, as an
extrapolation method, produces an entire series of approximations Tik of
various orders p = 2k for the column index k, which could also vary, where
the approximation error satisfies
Eik~IT2kl '1ikH2k+l for f E C 2k [t, t + H] .
In the course of the investigation in the previous section, we worked out
the error estimator Ek,k-1 for the sub diagonal approximation T k,k-1 of the
order p = 2k - 2. If we now replace the unknown error E = Ek,k-l in (9.27)
by Ek,k-1, then we obtain the suggested step size
Hk := 2k-l !tol H
Ek,k-1
where we again have introduced the safety factor p < 1, in order to match
a possible variation of '1(t) in the interval [t, t + h], compare Assumption
(9.28).
For each column k the above derivation supplies a step-size proposal Hk,
the realization of which requires an amount of work Ak associated with the
subdivision sequence F. Still missing is a criterion to choose, in each basic
step j = 1, ... , J, from the triples
(k,Hk,A k ) = (column, step-size proposal, work amount)
for k = 1, ... , kmax = q /2 the best triple (kj, H kj , Ak j ). In an abstract
setting, we to have solve the following optimization problem: minimize the
total amount of work
J
Atotal = L Akj = min
j=l
306 9. Definite Integrals
2.:ihj = T = const.
j=l
The total number J of basic steps is here dependent on the selected se-
quence of indices {k j }, which means it is unknown in advance. We therefore
have to tackle a discrete minimization problem. For this type of problem
in particular, there exists a quite efficient established heuristics, the greedy
algorithm-see, e.g., Chapter 9.3 in the introductory textbook [3] by M.
Aigner. At step j this algorithm requires the minimization of the work per
unit step
if H > H then
H := H; (repeat the step for safety)
else
i:=t+H;
1= T kk ;
done := true; (done)
end
end
i:=i+1;
end
(2) The algorithm notices only rather late, namely, only after crossing
k max that a suggested step size H was too large and that it does not
pass the accuracy criterion (Ek,k-l :s: tol) for any column k.
(3) If our assumptions are not satisfied, then the error estimator does
not work. It may therefore happen that the algorithm recognizes an
incorrect solution as correct and supplies it as an output. This case
is referred to as pseudo-convergence.
Here we only want to discuss briefly one such possibility, which is based
on the information theory of C. E. Shannon (see, e.g., [75]). For more details
we refer to the paper [19] of P. Deufihard. In this model, the quadrature
algorithm is interpreted as an encoding device. It converts the information,
E;~n) = a(Ai - A i - k + 1)
with a constant a > O. The amount of information on the output side,
the output entropy E;~ut), can be characterized by the number of correct
binary digits of the approximation Tik. This leads to
I (1)
(out) = og2
E ik -
Eik
.
We now assume that our information channel works with a constant noise
factor 0 < f3 ::::: 1,
E (out) _ f3E(in) .
ik - ik'
i.e., that input and output entropies are proportional to each other. (If
f3 = 1, then the channel is noise free; no information gets lost.) In our case
this means that
(9.29)
with c := af3. In order to determine the proportionality factor c, we need
a pair of input and output entropies. In the above we required that for
a given column k, the subdiagonal error Ek,k-l is equal to the required
precision tol, hence Ek,k-l = tol. By inserting this relation into (9.29), we
conclude that
Having thus determined c, we can then determine for all i, j, which errors
Eijare to be expected by our model. If we denote these errors, which the
9.5. Adaptive Romberg Quadrature 309
where m is the number of basic steps, which were obtained in the adap-
tive quadrature (a-posteriori error estimate). The chosen strategy obviously
leads to a uniform distribution of the local discretization errors. This
principle is also important for considerably more general adaptive dis-
cretization methods (compare Section 9.7). If one wants to prescribe a
global discretization error, which is independent of m,
II - 11 :::; Iscal' E ,
then, following a suggestion by C. de Boor [14], in the derivation of the
order and step-size control, the precision tol is to be replaced by
H
tol ---+ --E .
b-a
This leads to smaller changes of the order and step-size control, but also
to additional difficulties and a less robust algorithm.
Or-----------------
-1 o 1
Figure 9.10. Automatic subdivision into basic steps by the program TRAPEX.
Discontinuous Integrands
A common problem in numerical quadrature are discontinuities of the in-
tegrand f or its derivatives (see Figure 9.11). Such integrands occur, e.g.,
a b
the adaptive Romberg method, freezes at the jumps. Thus the jumps can
be localized and treated separately.
Needle Impulses
We have considered this problem repeatedly in the above. It has to be
noted, however, that in principle, every quadrature program will fail if the
peaks are small enough (compare Exercise 9.8). On the other hand, such
integrands are pretty common: just think of the spectrum of a star whose
entire radiation is to be computed. If the positions of the peaks are known,
then one should subdivide the interval in a suitable way, and again compute
the sub integrals separately. Otherwise, there only remains the hope that
the adaptive quadrature program does not "overlook" them.
Highly Oscillatory Integrands
We have already noted in Section 9.1 that highly oscillatory integrands are
ill-conditioned from the relative error viewpoint. As an example, we have
plotted the function
f(t) = cos(te 4 t 2 )
for t E [-1,1] in Figure 9.12. The numerical quadrature is powerless against
such integrands. They have to be prepared by analytical averaging over
subintervals (pre-clarification of the structure of the inflection points of the
integrand).
-0.2
-0.4
-0.6
-0.8
l llLL_0:":.8:--..l.U...'--~"---_~0.-::-2-~O:-----:-O.'::-2_-:-'-:-_--::'"'L-.-:":,.u-uw
-l W
2
Figure 9.12. Highly oscillatory integrand f(t) = cos(te 4t ).
1 v't
71"
t=o'-v--'
cos t dt .
I(t)
I(A) := lb f(t, A) dt .
If, however, a peak varies with the parameter, and if this dependence is
known, then one can employ parameter-dependent grids. One transforms
the integral in dependence of A in such a way that the integrand stays the
same qualitatively (the movement of the peak is, e.g., counter-balanced) or,
in dependence of A, one shifts the adaptive partitioning of the integration
interval.
The last possibility requires a lot of insight into the respective problem.
We choose a fixed grid adapted to the respective problem and integrate
over this grid with a fixed quadrature formula (Newton-Cotes or Gauss-
Christoffel). In order to do this, the qualitative properties of the integrand
need to be largely known, of course.
Discrete Integrands
In many applications, the integrand is not given as a function f, but only
in the form of finitely many discrete points
refined at places where it is necessary for the required precision, i.e., the
qualitative behavior of the integrand becomes visible in the refinement of
the grids. The nodes condense where "a lot happens." In order to achieve
this, one requires two things: a local error estimator and local refinement
rules.
The local error estimator is typically realized by a comparison of methods
of lower and higher order, as we have seen in Section 9.5.3 in the subdiago-
nal error criterion. Here the theory of the respective approximation method
enters. In the definition of refinement rules, aspects of the data structures
play the decisive role. Thus, in fact part of the complexity of the mathe-
matical problem is transferred to the computer science side (in the form of
more complex data structures).
(I ~
;
;
I
T(J)
S(J)
= %(f(tz) + 2f(tm ) + f(t r ))
= ~ (f(tz) + 4f(tm) + f(t r))
t~
Figure 9.13. Trapezoidal and Simpson's rule for an interval J:= (t[,trn,tr ).
T(J) and S(J) are approximations of the integral Jt~r f(t) dt of order O(h3)
9.7. Adaptive Multigrid Quadrature 315
As in the Romberg quadrature, we assume (at first not checked) that the
method of higher order, the Simpson rule, is locally better, i.e.,
Under this assumption, the sub diagonal estimator of the local approxima-
tion error is
Eel) := IT(J) - S(J)I = [e(J)] ,
and we can use the Simpson result as a better approximation.
In the construction of local refinement rules, we essentially follow an
abstract suggestion by 1. Babuska and W. C. Rheinboldt [6], which they
made in the more general context of boundary value problems for partial
differential equations. The subintervals which are obtained when bisecting
an interval J := (tz, tm, t r ) are denoted by Jz and J r , where
When refining twice, we thus obtain the binary tree, which is displayed in
Figure 9.14. If J is obtained by refinement, then we denote the starting
J
•
~
Jz Jr
0 • 0 • 0
J ll
/\ /\ JZ r Jrz J rr
0 0 0 0 0
for the unknown error E(JZ )' We can therefore estimate in advance, what
... ·····Ch'
s(J) ,.
E+(J) .
~~~~~-----------,~~-----
(h/2)' h' (2h)'
Figure 9.15. Local extrapolation for the error estimator c+(J).
a threshold value for the local errors, above which we refine an interval. In
order to do this, we take the maximal local error, which we would obtain
from a global refinement, i.e., refinement of all intervals J E Do, and define
(9.31 )
a b
Figure 9.16. Estimated error distributions before and after global and local
refinement.
error at the right and left boundary is below the maximal local error K(Do),
which can possibly be achieved by a complete refinement. If we follow the
principle of equidistribution of the local error, then we do not have to refine
any more near the right and left boundary. Refinement does only payoff in
the middle region. We thus arrive at the following refinement rule: Refine
only intervals J E Do, for which
This yields the error distribution, which is displayed in Figure 9.16. It is ob-
viously one step closer to the desired equidistribution of the approximation
errors.
In order that the partitioning in fact yields an improvement, the order '"Y
has to satisfy the condition '"Y > 1 locally.
318 9. Definite Integrals
The sum 2::JE~ E(J) is not a suitable measure, since integration errors may
average out. Better suitable is a comparison with the approximation of the
previous grid ~ -. If
(9.32)
then
J:
The complete algorithm of the adaptive multigrid quadrature for the
computation of f(t) dt with a relative precision tol now looks as follows:
Algorithm 9.36 Simple multigrid quadrature
Choose an initial grid, e.g., ~ := {(a, (a + b)/2, b)};
for i = 0 to i max do
Compute T(J), S(J) and E(~) for all J E ~;
Compute E(~);
if E(~) s; tolIS(J)1 then
break; (done, solution S (~))
else
Compute E+ (J) and E( J) for all J E ~;
Compute K(~);
Replace all J E ~ with E(J) ::::: K(~) by J 1 and J r ;
end
end
The multigrid approach obviously leads to a considerably simpler adap-
tive quadrature algorithm than the adaptive Romberg quadrature. The only
difficulty consists in the storage of the grid sequence. However, this diffi-
culty can be mastered fairly easily by employing a structured programming
language (such as C or Pascal). In the one-dimensional quadrature, we can
store the sequence as a binary tree (as indicated in Figure 9.14). In prob-
lems in more than one spatial dimension, the question of data structures
9.7. Adaptive Multigrid Quadrature 319
Or-----------------
-1 o 1
Figure 9.17. Adapted grid for the needle impulse f(t) = 1/(10- 4 +t2 ) of the fifth
and ninth step for the tolerance 10- 3 .
The program can also be adapted to discrete integrands (it was originally
developed just for this case in [91] as the so-called SUMMATOR). Here
one only has to consider the case that there is no value available at a
bisecting point. As always, we do the next best, and this time in the literal
sense, by taking the nearest given point, which is next to the bisectional
point, and thus modify the bisection slightly. Once the required precision is
320 9. Definite Integrals
achieved, then for discrete integrands, and for reasons, which we discussed
in Section 9.6, we take the trapezoidal sum as the best approximation.
Example 9.38 Summation of the Harmonic Series. The sum
L -J:-
n 1
S = for n = 10 7
j=1
Figure 9.18. Summation of the harmonic series with the program SUMMATOR.
Exercises
li I I -
Exercise 9.1 Let
n n .
s-]
A·2n = - . . ds
n 0 j~O ~ - ]
jf.i
12 x 2e 3x dx
P = 211 v'"f=t2
~l
f(t) dt
'
of the radial movement of a satellite in an orbit in the equatorial plane
(apogeum height 492 km) under the influence ofthe flattening ofthe Earth.
Here
1 P2-1
(a) f(t) = , r(t) = 1 + (1 + t ) - - ,
V2g(r(t)) 2
(b) g(x) = 2w2(1 - pI/x),
k
(c) w 2 = ~(1 - c) + ~, P1=-6
2 '
W P2
with the constants c = 0.5 (elliptic eccentricity of the satellite orbit), P2 =
2.9919245059286 and k = 1.4·10~3 (constant, which describes the influence
of the Earth flattening). Write a program which computes the integral
In := -1f- ~
L f(Tin), Tin:= cos
n+1.,=0
(2i + 1 1f)
--.-
n+12
, n = 3,4, ... 7
for the extrapolation tableau from the one of the Aitken-Neville algorithm.
Exercise 9.6 Every element Tik in the extrapolation tableau of the extra-
polated trapezoidal rule can be considered as a quadrature formula. Show
that when using the Romberg sequence and polynomial extrapolation, the
following results hold:
(a) T22 is equal to the value, which is obtained by applying the Simpson
rule; T33 corresponds to the Milne rule.
(b) Tik, i > k is obtained by 2i - k -fold application of the quadrature
formula, which belongs to Tkk to suitably chosen subintervals.
(c) For every Tik, the weights of the corresponding quadrature formula
are positive.
Hint: By using (b), show that the weights Ai,n of the quadrature formula,
which corresponds to Tkk, satisfies
max Ai n :::; 4k. min Ai n .
i' i 1
Exercise 9.7 Implement the Romberg algorithm by only using one single
vector of length n (note that only one intermediate value of the table needs
to be extra stored).
Exercise 9.8 Experiment with an adaptive Romberg quadrature pro-
gram, test it with the "needle function"
I(n):= 1 1
-1
2-n
4- n +t
2 dt, for n = 1,2, ...
and determine the n, for which your program yields the value zero for a
given precision of eps = 10- 3 .
Exercise 9.9 Consider the computation of the integrals
11 +1
f(t)dt::::; f-Lof( -1)
n-l
+ f-Lnf(l) + ~ f-Ld(t i )
with fixed nodes -1 and + 1 and variable nodes to be determined such that
the order is as high as possible (Gauss-Lobatto quadrature).
References
[1] ABDULLE, A., AND WANNER, G. 200 years of least squares method. Elemente
der Mathematik (2002).
[2] ABRAMOWITZ, M., AND STEGUN, 1. A. Pocketbook of Mathematical
Functions. Verlag Harri Deutsch, Thun, Frankfurt/Main, 1984.
[3] AIGNER, M. Diskrete Mathematik, 4. ed. Vieweg, Braunschweig, Wiesbaden,
200l.
[4] ANDERSON, E., BAI, Z., BISCHOF, C., DEMMEL, J., DONGARRA, J.,
DUCROZ, J., GREENBAUM, A., HAMMARLING, S., McKENNEY, A., OSTRU-
CHOV, S., AND SORENSEN, D. LAPACK Users' Guide. SIAM, Philadelphia,
1999.
[5] ARNOLDI, W. E. The principle of minimized iterations in the solution of the
matrix eigenvalue problem. Quart. Appl. Math. 9 (1951), 17-29.
[6] BABUSKA, 1., AND RHEINBOLDT, W. C. Error estimates for adaptive finite
element computations. SIAM J. Numer. Anal. 15 (1978), 736-754.
[7] BJ0RCK, A. Iterative refinement of linear least squares solutions I. BIT 7
(1967), 257-278.
[8] BOCK, H. G. Randwertproblemmethoden zur Parameteridentijizierung in
Systemen nichtlinearer Differentialgleichungen. PhD thesis, Universitiit zu
Bonn, 1985.
[9] BORNEMANN, F. A. An Adaptive Multilevel Approach to Parabolic Equations
in two Dimensions. PhD thesis, Freie Universitiit Berlin, 1991.
[10] BRENT, R. P. Algorithms for Minimization Without Derivatives. Prentice
Hall, Englewood Cliffs, N.J., 1973.
[11] BULIRSCH, R. Bemerkungen zur Romberg-Integration. Numer. Math. 6
(1964),6-16.
326 References
[71] RIGAL, J. L., AND GACHES, J. On the compatibility of a given solution with
the data of a linear system. J. Assoc. Comput. Mach. 14 (1967), 543-548.
[72] ROMBERG, W. Vereinfachte Numerische Integration. Det Kongelige Norske
Videnskabers Selskabs Forhandlinger Bind 28, 7 (1955).
[73] SAUER, R., AND SZABO, 1. Mathematische Hilfsmittel des Ingenieurs.
Springer Verlag, Berlin, Heidelberg, New York, 1968.
[74] SAUTTER, W. Fehlerfortpfianzung und Rundungsfehler bei der verallge-
meinerten Inversion von Matrizen. PhD thesis, TU Miinchen, Fakultiit fiir
Allgemeine Wissenschaften, 1971.
[75] SHANNON, C. E. The Mathematical Theory of Communication. The
University of Illinois Press, Urbana, Chicago, London, 1949.
[76] SKEEL, R. D. Scaling for numerical stability in Gaussian elimination. J.
ACM 26, 3 (1979), 494-526.
[77] SKEEL, R. D. Iterative refinement implies numerical stability for Gaussian
elimination. Math. Compo 35, 151 (1980), 817-832.
[78] SONNEVELD, P. A fast Lanczos-type solver for nonsymmetric linear systems.
SIAM J. Sci. Stat. Comput. 10 (1989), 36-52.
[79] STEWART, G. W. Introduction to Matrix Computations. Academic Press,
New York, San Francisco, London, 1973.
[80] STEWART, G. W. On the structure of nearly uncoupled Markov chains.
In Mathematical Computer Performance and Reliability, G. Iazeolla, P. J.
Courtois, and A. Hordijk, Eds. Elsevier, New York, 1984.
[81] STOER, J. Solution of large systems of linear equations by conjugate gra-
dient type methods. In Mathematical Programming, the State of the Art,
A. Bachem, M. Grotschel, and B. Korte, Eds. Springer Verlag, Berlin,
Heidelberg, New York, 1983.
[82] SZEGO, G. Orthogonal Polynomials, fourth ed. AMS, Providence, RI, 1975.
[83] TRAUB, J., AND WOZNIAKOWSKI, H. General Theory of Optimal Algorithms.
Academic Press, Orlando, San Diego, San Francisco, 1980.
[84] TREFETHEN, L. N., AND SCHREIBER, R. S. Average-case stability of
gaussian elimination. SIAM J. Matrix Anal. Appl. 11,3 (1990), 335-360.
[85] TUKEY, J. W., AND COOLEY, J. W. An algorithm for the machine
calculation of complex Fourier series. Math. Comp 19 (1965), 197-30l.
[86] VARGA, J. Matrix Iterative Analysis. Prentice Hall, Englewood Cliffs, N.J.,
1962.
[87] WILKINSON, J. H. Rounding Errors in Algebraic Processes. Her Majesty's
Stationary Office, London, 1963.
[88] WILKINSON, J. H. The Algebraic Eigenvalue Problem. Oxford University
Press, Oxford, UK, 1965.
[89] WILKINSON, J. H., AND REINSCH, C. Handbook for Automatic Computation,
Volume II, Linear Algebra. Springer Verlag, New York, Heidelberg, Berlin,
1971.
[90] WITTUM, G. Mehrgitterverfahren. Spektrum der Wissenschajt (April 1990),
78-90.
330 References
Software
For most of the algorithms described in this book there exists rather so-
phisticated software, which is public domain. Of central importance is the
netlib, a library of mathematical software, data, documents, etc. Its address
IS
http://www.netlib.org/
http://www.netlib.org/lapack
http://www.netlib.org/eispack
Please study the therein given hints carefully (e.g., README, etc.) to
make sure that you download all necessary material. Sometimes a bit of
additional browsing in the neighborhood is needed.
http://elib.zib.de/pub/elib/codelib/
http://www.zib.de/SciSoft/CodeLib/
All of the there available programs are free as long as they are exclusively
used for research or teaching purposes.
Index
complexity backward, 36
of problems, 2 forward, 35
condition equidistribution, 316
intersection point, 24 linearised theory, 26
condition number relative, 25
absolute, 26 extrapolation
componentwise, 32 algorithm, 295
of addition, 27 local, 316
of multiplication, 32 methods, 291, 295
of scalar product, 33 sub diagonal error criterion, 304
relative, 26 tableau, 292
Skeel's, 33
conjugate gradients, 252 Farin, G., 204
continuation method, 92 FFT,203
classical, 102 fixed-point
order, 104 Banach theorem, 84
tangent, 103, 108 equation, 82
convergence iteration, 82, 239
linear, 85 method
model, 309 symmetrizable, 242
monitor, 307 Fletcher, R., 255
quadratic, 85 floating point number, 22
super linear, 85 forward substitution, 4
Cooley, W., 202 Fourier
cost series, 152, 200
QR-factorization, 69, 72 transform, 197
Cholesky decomposition, 16 fast, 201
Gaussian elimination, 7 Francis, J. G. F., 127
QR method Frobenius, F. G., 140
for singular values, 137
QR-algorithm, 132 Gaches, J., 50
Cramer's rule, 1 Gauss
Cullum, J., 266 Jordan decomposition, 3
cylinder functions, 159 Newton method, 109
Seidel method:, 240
de Boor algorithm, 235 Gauss, C. F., 4, 57
de Boor, C., 204, 309 Gautschi, W., 164
de Casteljau algorithm, 213 generalized inverse, 76
detailed balance, 143 Gentleman, W. M., 70
Deuflhard, P., 73, 90, 261, 308 Givens
fast, 70
eigenvalue rational, 70
derivative, 120 rotations, 68
Perron, 140 Givens, W., 68
elementary operation, 23 Goertzel algorithm, 171
Ericsson, T., 266 Goertzel, G., 171
error Golub, G. H., 47, 72, 119
absolute, 25 graph, 140
analysis irreducible, 140
Index 335
3l. Bremaud: Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues.
32. Durran: Numerical Methods for Wave Equations in Geophysical Fluid
Dynamics.
33. Thomas: Numerical Partial Differential Equations: Conservation Laws and
Elliptic Equations.
34. Chicone: Ordinary Differential Equations with Applications.
3 ::J.
~
Kevorkian: Partial Differential Equations: Analytical Solution Techniques,
2nd ed.
36. Dllllerlld/Paganini: A Course in Robust Control Theory: A Convex Approach.
37. Quarteroni/Sacco/Saleri: Numerical Mathematics.
38. Gallier: Geometric Methods and Applications: For Computer Science and
Engineering.
39. Atkinson/Han: Theoretical Numerical Analysis: A Functional Analysis
Framework.
40. Braller/Castill(}-Chimez: Mathematical Models in Population Biology and
Epidemiology.
41. Davies: Integral Transforms and Their Applications, 3rd ed.
42. Deuflhard/Bornemann: Scientific Computing with Ordinary Differential
Equations.
43. Deuflhard/Hohmann: Numerical Analysis in Modern Scientific Computing: An
Introduction, 2nd ed.