Texts in Applied Mathematics: Springer

Texts in Applied Mathematics 43
Editors
J.E. Marsden
L. Sirovich
M. Golubitsky
S.S. Antman
Advisors
G.looss
P. Holmes
D. Barkley
M. Dellnitz
P. Newton
Springer
New York
Berlin
Heidelberg
Hong Kong
London
Milan
Paris
Tokyo
Texts in Applied Mathematics
I. SiTOvich: Introduction to Applied Mathematics.

2. Wiggins: Introduction to Applied Nonlinear Dynamical Systems and Chaos.
3. Hale/Kor;ak: Dynamics and Bifurcations.
4. Chorin/Marsden: A Mathematical Introduction to Fluid Mechanics, 3rd ed.
5. Hubbard/West: Differential Equations: A Dynamical Systems Approach:
Ordinary Differential Equations.
6. Sontag: Mathematical Control Theory: Deterministic Finite Dimensional
Systems, 2nd ed.
7. Perko: Differential Equations and Dynamical Systems, 3rd ed.
8. 5mborn: Hypergeometric Functions and Their Applications.
9. Pipkin: A Course on Integral Equations.
10. Hoppensteadt/Peskin: Modeling and Simulation in Medicine and the Life
Sciences, 2nd cd.
II. Braun: Differential Equations and Their Applications, 4th ed.
12. Stoer/Bulirsch: Introduction to Numerical Analysis, 3rd ed.
13. Renardy/Rogers: An Introduction to Partial Differential Equations.
14. Banks: Growth and Diffusion Phenomena: Mathematical Frameworks and
Applications.
15. Brenner/Scott: The Mathematical Theory of Finite Element Methods, 2nd ed.
16. Van de Velde: Concurrent Scientific Computing.
17. Marsden/Ratiu: Introduction to Mechanics and Symmetry, 2nd ed.
18. Hubbard/West: Differential Equations: A Dynamical Systems Approach:
Higher-Dimensional Systems.
19. Kaplan/Glass: Understanding Nonlinear Dynamics.
20. Holmes: Introduction to Perturbation Methods.
21. Curtain/Zwart: An Introduction to Infinite-Dimensional Linear Systems
Theory.
22. Thomas: Numerical Partial Differential Equations: Finite Difference Methods.
23. Taylor: Partial Differential Equations: Basic Theory.
24. Merkin: Introduction to the Theory of Stability of Motion.
25. Naber: Topology, Geometry, and Gauge Fields: Foundations.
26. Polderman/Willems: Introduction to Mathematical Systems Theory: A
Behavioral Approach.
27. Reddy: Introductory Functional Analysis with Applications to Boundary-Value
Problems and Finite Elements.
28. Gustafson/Wilcox: Analytical and Computational Methods of Advanced
Engineering Mathematics.
29. Tveito/Winther: Introduction to Partial Differential Equations: A
Computational Approach.
30. Gasquet/Witomski: Fourier Analysis and Applications: Filtering, Numerical
Computation, Wavelets.
(continued after index)
Peter Deuflhard Andreas Hohmann
Numerical Analysis in Modern

Scientific Computing
An Introduction
Second Edition
With 65 Illustrations
, Springer
Peter Deuflhard Andreas Hohmann
Konrad-Zuse-Zentrum (ZIB) AMS
Berlin-Dahlem, D-14195 D2 Vodafone TPAI
Germany Dusseldorf, D-40547
deuflhard@zib.de Germany
andreas.hohmann@d2vodafone.de
Series Editors
J.E.Marsden L. Sirovich
Control and Dynamical Systems 107-S1 Division of Applied Mathematics
California Institute of Technology Brown University
Pasadena, CA 91125 Providence, RI 02912
USA USA
marsden@cds.caltech.edu chico@camelot.mssm.edu
M. Golubitsky S.S. Antman

Department of Mathematics Department of Mathematics
University of Houston and
Houston, TX 77204-3476 Institute for Physical Science
USA and Technology
University of Maryland
College Park, MD 20742-4015
USA
ssa@math.umd.edu
Mathematics Subject Classification (2000): 65-XX, 6S-XX, 65-01, 65Fxx, 6SNxx
Library of Congress Cataloging-in-Publication Data

Deuflhard. P. (Peter)
Numerical analysis in modern scientific computing: an introduction / Peter Deuflhard,
Andreas Hohmann.-2nd ed.
p. cm. - (Texts in applied mathematics; 43)
Rev. ed. of: Numerical analysis. 1995.
Includes bibliographical references and index.
I. Numerical analysis-Data processing. I. Hohmann, Andreas. 1964- II. Deutlhard, P.

(Peter). Numerische Mathematik I. English. III. Title. IV. Series.
QA297 .D45 2003
519.4-dc21 2002030564
ISBN 978-1-4419-2990-7 ISBN 978-0-387-21584-6 (eBook) Printed on acid-free paper.

DOl 10.1007/978-0-387-21584-6
© 2003 Springer-Verlag New York, Inc.
Softcover reprint of the hardcover 1st edition 2003
All rights reserved. This work may not be translated or copied in whole or in part without the written per-
mission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York, NY 10010, USA),
except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form
of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are
not identified as such. is not to be taken as an expression of opinion as to whether or not they are subject
to proprietary rights.
9 8 7 6 5 4 3 2 I SPIN 10861791
www.springer-ny.com
Springer-Verlag New York Berlin Heidelberg

A member (If' Berte/smannSpringer Sciellce+Business Media GmbH
Series Preface
Mathematics is playing an ever more important role in the physical and

biological sciences, provoking a blurring of boundaries between scientific
disciplines and a resurgence of interest in the modern as well as the clas-
sical techniques of applied mathematics. This renewal of interest, both in
research and teaching, has led to the establishment of the series Texts in
Applied Mathematics (TAM).
The development of new courses is a natural consequence of a high level of
excitement on the research frontier as newer techniques, such as numerical
and symbolic computer systems, dynamical systems, and chaos, mix with
and reinforce the traditional methods of applied mathematics. Thus, the
purpose of this textbook series is to meet the current and future needs of
these advances and to encourage the teaching of new courses.
TAM will publish textbooks suitable for use in advanced undergraduate
and beginning graduate courses, and will complement the Applied Mathe-
matical Sciences (AMS) series, which will focus on advanced textbooks and
research-level monographs.
Pasadena, California J.E. Marsden

Providence, Rhode Island L. Sirovich
Houston, Texas M. Golubitsky
College Park, Maryland S.S. Antman
Preface
For quite a number of years the rapid progress in the development of both
computers and computing (algorithms) has stimulated a more and more de-
tailed scientific and engineering modeling of reality. New branches of science
and engineering, which had been considered rather closed until recently,
have freshly opened up to mathematical modeling and to simulation on the
computer. There is clear evidence that our present problem-solving ability
does not only depend on the accessibility of the fastest computers (hard-
ware), but even more on the availability of the most efficient algorithms
(software) .
The construction and the mathematical understanding of numerical al-
gorithms is the topic of the academic discipline Numerical Analysis. In
this introductory textbook the subject is understood as part of the larger
field Scientific Computing. This rather new interdisciplinary field influ-
ences smart solutions in quite a number of industrial processes, from car
production to biotechnology. At the same time it contributes immensely
to investigations that are of general importance to our societies-such as
the balanced economic and ecological use of primary energy, global climate
change, or epidemiology.
The present book is predominantly addressed to students of mathematics,
computer science, science, and engineering. In addition, it intends to reach
computational scientists already on the job who wish to get acquainted
with established modern concepts of Numerical Analysis and Scientific
Computing on an elementary level via personal studies.
viii Preface
The field of Scientific Computing, situated at the confluence of mathe-

matics, computer science, natural science, and engineering, has established
itself in most teaching curricula, sometimes still under the traditional name
Numerical Analysis. However, basic changes in the contents and the presen-
tation have taken place in recent years, and this already at the introductory
level: classical topics, which had been considered important for quite a
time, have just dropped out, new ones have entered the stage. The guiding
principle of this introductory textbook is to explain and exemplify essential
concepts of modern Numerical Analysis for ordinary and partial differential
equations using the simplest possible model problems. Nevertheless, read-
ers are only assumed to have basic knowledge about topics typically taught
in undergraduate Linear Algebra and Calculus courses. Further knowledge
is definitely not required.
The primary aim of the book is to develop algorithmic feeling and think-
ing. After all, the algorithmic approach has historically been one of the
roots of today's mathematics. It is no mere coincidence that, besides con-
temporary names, historical names like Gauss, Newton, and Chebyshev
are found in numerous places all through the text. The orientation toward
algorithms, however, should by no means be misunderstood. In fact, the
most efficient algorithms often require a substantial amount of mathemat-
ical theory, which will be developed in the book. As a rule, elementary
mathematical arguments are preferred. In topics like interpolation or in-
tegration we deliberately restrict ourselves to the one-dimensional case.
Wherever meaningful, the reasoning appeals to geometric intuition-which
also explains the quite large number of graphical representations. Notions
like scalar product and orthogonality are used throughout-in the finite
dimensional case as well as in infinite dimensions (functions). Despite the
elementary presentation, the book contains a significant number of other-
wise unpublished material. Some of our derivations of classical results differ
significantly from traditional derivations-in many cases they are simpler
and nevertheless more stringent. As an example we refer to our condition
and error analysis, which requires only multidimensional differentiation as
the main analytical prerequisite.
Compared to the first English edition, a polishing of the book as a whole
has been performed. The essential new item is Section 5.5 on stochastic
eigenvalue problems-a problem class that has gained increasing impor-
tance and appeared to be well-suited for an elementary presentation within
our conceptual frame. As a recent follow-up, there exists an advanced
textbook on numerical ordinary differential equations [22].
Preface ix
Of course, any selection of material expresses the scientific taste of the

authors. The first author founded the Zuse Institute Berlin (ZIB) as a re-
search institute for Scientific Computing in 1986. He has given Numerical
Analysis courses at the Technical University of Munich and the Univer-
sity of Heidelberg, and is now teaching at the Free University of Berlin.
Needless to say, he has presented his research results in numerous invited
talks at international conferences and seminars at renowned universities
and industry places all over the world. The second author originally got his
mathematical training in pure mathematics and switched over to compu-
tational mathematics later. He is presently working in the communication
industry. We are confident that the combination of a senior and a junior
author, of a pure and an applied mathematician, as well as a member of
academia and a representative from industry has had a stimulating effect
on our presentation.
At this point it is our pleasure to thank all those who have particularly
helped us with the preparation of this book. The first author remembers
with gratitude his early time as an assistant of Roland Bulirsch (Technical
University of Munich, retired since 2001), in whose tradition his present
views on Scientific Computing have been shaped. Of course, our book has
significantly profited from intensive discussions with numerous colleagues,
some of which we want to mention explicitly here: Ernst Hairer and Ger-
hard Wanner (University of Geneva) for discussions on the general concept
of the book; Folkmar Bornemann (Technical University of Munich) for
the formulation of the error analysis, the different condition number con-
cepts, and the definition of the stability indicator in Chapter 2; Wolfgang
Dahmen (RWTH Aachen) for Chapter 7; and Dietrich Braess (Ruhr Uni-
versity Bochum) for the recursive derivation of the Fast Fourier Transform
in Section 7.2.
The first edition of this textbook, which already contained the bulk of
material presented in this text, was translated by Florian Potra and Fried-
mar Schulz-again many thanks to them. For this, the second edition,
we cordially thank Rainer Roitzsch (ZIB), without whose deep knowledge
about a rich variety of fiddly TEX questions this book could never have
appeared. Our final thanks go to Erlinda Kornig and Sigrid Wacker for all
kinds of assistance.
Berlin and Dusseldorf, March 2002
Peter Deufihard and Andreas Hohmann

Outline
This introductory textbook is, in the first place, addressed to students of

mathematics, computer science, science, and engineering. In the second
place, it is also addressed to computational scientists already on the job
who wish to get acquainted with modern concepts of Numerical Analysis
and Scientific Computing on an elementary level via personal studies.
The book is divided into nine chapters, including associated exercises, a
software list, a reference list, and an index. The contents of the first five
and of the last four chapters are each closely related.
In Chapter 1 we begin with Gaussian elimination for linear systems
of equations as the classical prototype of an algorithm. Beyond the ele-
mentary elimination technique we discuss pivoting strategies and iterative
refinement as additional issues. Chapter 2 contains the indispensable error
analysis based on the fundamental ideas of J. H. Wilkinson. The condition
of a problem and the stability of an algorithm are presented in a unified
framework, well separated and illustrated by simple examples. The quite
unpopular "E-battle" in linearized error analysis is avoided~which leads
to a drastic simplification of the presentation and to an improved under-
standing. A stability indicator arises naturally, which allows a compact
classification of numerical stability. On this basis we derive an algorithmic
criterion to determine whether a given approximate solution of a linear
system of equations is acceptable or not. In Chapter 3 we treat orthogo-
nalization methods in the context of Gaussian linear least-squares problems
and introduce the extremely useful calculus of pseudo-inverses. It is imme-
diately applied in the following Chapter 4, where we present iterative
xu Outline
methods for systems of nonlinear equations (Newton method), nonlinear

least-squares problems (Gauss-Newton method), and parameter-dependent
problems (continuation methods) in close mutual connection. Special at-
tention is paid to modern affine invariant convergence theory and iterative
algorithms. Chapter 5 starts with a condition analysis of linear eigen-
value problems for general matrices. From this analysis, interest is naturally
drawn to the real symmetric case, for which we present the power method
(direct and inverse) and the QR-algorithm in some detail. Into the same
context fits the singular value decomposition for general matrices, which is
of utmost importance in application problems. As an add-on in this second
edition, we finally consider stochastic eigenvalue problems, which in recent
years have played an increasing role, especially in cluster analysis.
The second closely related chapter sequence begins in Chapter 6 with
an extensive theoretical treatment of three-term recurrences, which play
a key role in the realization of orthogonal projections in function spaces.
The condition of three-term recurrences is represented in terms of discrete
Green's functions-thus paving the way toward mathematical structures in
initial and boundary value problems for differential equations. The signif-
icant recent spread of symbolic computing has renewed interest in special
functions also within Numerical Analysis. Numerical algorithms for their
fast summation via the corresponding three-term recurrences are exem-
plified for spherical harmonics and for Bessel functions. In Chapter 7 we
start with classical polynomial interpolation and approximation in the one-
dimensional case. We then continue over Bezier techniques and splines up
to methods that nowadays are of central importance in CAD (Computer-
Aided Design) or CAGD (Computer-Aided Geometric Design), disciplines
of computer graphics. Our presentation in Chapter 8 on iterative methods
for the solution of large symmetric systems of linear equations benefits con-
veniently from Chapter 6 (three-term recurrences) and Chapter 7 (minimax
property of Chebyshev polynomials). The same is true for our treatment
of the Lanczos algorithm for large symmetric eigenvalue problems.
Finally, Chapter 9 has deliberately gotten somewhat longer: it bears
the main burden of presenting principles of the numerical solution of or-
dinary and partial differential equations without any technicalities at the
simplest possible problem type, which here is numerical quadrature. We
start with the historical Newton-Cotes and Gauss-Christoffel quadrature.
As a first adaptive algorithm, we introduce the classical Romberg quadra-
ture, wherein, however, only the approximation order can be varied. The
formulation of the quadrature problem as an initial value problem offers the
opportunity to work out an adaptive Romberg algorithm with variable or-
der and step-size control; this approach opens the possibility to discuss the
principle of extrapolation methods, which playa key role in the numerical
solution of ordinary differential equations. The alternative formulation of
the quadrature problem as a boundary value problem is used for the deriva-
Outline xiii
tion of an adaptive multigrid quadrature; in this way we can deal with the
adaptivity principle behind multigrid methods for partial differential equa-
tions in isolated form-clearly separated from the principle of fast solution,
which is often predominant in the context of partial differential equations.
Contents
Preface vii
Outline xi
1 Linear Systems 1
1.1 Solution of Triangular Systems. . . . . . . . 3
1.2 Gaussian Elimination . . . . . . . . . . . . . 4
1.3 Pivoting Strategies and Iterative Refinement 7
1.4 Cholesky Decomposition for Symmetric Positive Definite
Matrices 14
Exercises . . . . 16
2 Error Analysis 21
2.1 Sources of Errors . . . . . . . . . . . 22
2.2 Condition of Problems . . . . . . . . 24
2.2.1 Normwise Condition Analysis 26
2.2.2 Componentwise Condition Analysis 31
2.3 Stability of Algorithms . . 34
2.3.1 Stability Concepts 35
2.3.2 Forward Analysis . 37
2.3.3 Backward Analysis 42
2.4 Application to Linear Systems 44
XVI Contents
2.4.1 A Zoom into Solvability . . . . . . . . . . . 44

2.4.2 Backward Analysis of Gaussian Elimination 46
2.4.3 Assessment of Approximate Solutions. 49
Exercises . 52
3 Linear Least-Squares Problems 57

3.1 Least-Squares Method of Gauss 57
3.1.1 Formulation of the Problem 57
3.1.2 Normal Equations . . . . . . 60
3.1.3 Condition . . . . . . . . . . 62
3.1.4 Solution of Normal Equations 65
3.2 Orthogonalization Methods .. . 66
3.2.1 Givens Rotations . . . . 68
3.2.2 Householder Reflections 70
3.3 Generalized Inverses. 74
Exercises . . . . . . . . . . . . . . . . 78
4 Nonlinear Systems and Least-Squares Problems 81

4.1 Fixed-Point Iterations. . . . . . . . . . . . . . . . 81
4.2 Newton Methods for Nonlinear Systems. . . . . . 86
4.3 Gauss-Newton Method for Nonlinear Least-Squares Prob-
lems . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Nonlinear Systems Depending on Parameters. 99
4.4.1 Solution Structure 100
4.4.2 Continuation Methods 102
Exercises . 113
5 Linear Eigenvalue Problems 119

5.1 Condition of General Eigenvalue Problems . . . . . 120
5.2 Power Method . . . . . . . . . . . . . . . . . . . . . 123
5.3 QR-Algorithm for Symmetric Eigenvalue Problems 126
5.4 Singular Value Decomposition . 132
5.5 Stochastic Eigenvalue Problems 137
Exercises . . . . . . . . . . . . . . . . 148
6 Three-Term Recurrence Relations 151

6.1 Theoretical Background . . . . . 153
6.1.1 Orthogonality and Three-Term Recurrence Rela-
tions . . . . . . . . . . . . . . . . . . . . . . 153
6.1.2 Homogeneous and Inhomogeneous Recurrence Re-
lations . . . . . . . 156
6.2 Numerical Aspects 158
6.2.1 Condition Number 160
6.2.2 Idea of the Miller Algorithm 166
6.3 Adjoint Summation . . . . . . . . . 168
Contents XVII
6.3.1 Summation of Dominant Solutions 169

6.3.2 Summation of Minimal Solutions 172
Exercises 176
7 Interpolation and Approximation 179

7.1 Classical Polynomial Interpolation. . . . . . . . . . . 180
7.1.1 Uniqueness and Condition Number . . . . . . 180
7.1.2 Hermite Interpolation and Divided Differences 184
7.1.3 Approximation Error . . . . . . . . . . . . . . 192
7.1.4 Min-Max Property of Chebyshev Polynomials 193
7.2 Trigonometric Interpolation . . . . . . . . . . . . . . 197
7.3 Bezier Techniques . . . . . . . . . . . . . . . . . . . . 204
7.3.1 Bernstein Polynomials and Bezier Representation 205
7.3.2 De Casteljau Algorithm .. 211
7.4 Splines................ 218
7.4.1 Spline Spaces and B-Splines 219
7.4.2 Spline Interpolation. . . . . 226
7.4.3 Computation of Cubic Splines 230
Exercises . 233
8 Large Symmetric Systems of Equations and Eigenvalue

Problems 237
8.1 Classical Iteration Methods 239
8.2 Chebyshev Acceleration . . . 244
8.3 Method of Conjugate Gradients 249
8.4 Preconditioning. 256
8.5 Lanczos Methods 261
Exercises . . . . . . 266
9 Definite Integrals 269

9.1 Quadrature Formulas. . . . . 270
9.2 Newton-Cotes Formulas. . . . 273
9.3 Gauss-Christoffel Quadrature 279
9.3.1 Construction of the Quadrature Formula 280
9.3.2 Computation of Nodes and Weights. . . 285
9.4 Classical Romberg Quadrature. . . . . . . . . . 287
9.4.1 Asymptotic Expansion of the Trapezoidal Sum 288
9.4.2 Idea of Extrapolation . . 290
9.4.3 Details of the Algorithm 295
9.5 Adaptive Romberg Quadrature 298
9.5.1 Principle of Adaptivity . 299
9.5.2 Estimation of the Approximation Error. 301
9.5.3 Derivation of the Algorithm 304
9.6 Hard Integration Problems . . . 310
9.7 Adaptive Multigrid Quadrature . . 313
xviii Contents
9.7.1 Local Error Estimation and Refinement Rules . . 314

9.7.2 Global Error Estimation and Details of the Algo-
rithm . 318
Exercises 321
References 325
Software 331
Index 333
1
Linear Systems
In this chapter we deal with the numerical solution of a system of n linear

equations
anxl + a12 x 2 + + alnXn b1

a21 x l + a22 x 2 + + a2n Xn b2
anlXl + an2X2 + + annXn bn
or, in short notation,
Ax = b,
where A E Matn(R) is a real (n, n)-matrix and b, x E Rn are real n-vectors.

Before starting to compute the solution x, we should ask ourselves:
When is a linear equation system solvable at all?
From linear algebra, we know the following result, which characterizes
solvability in terms of the determinant of the matrix A.
Theorem 1.1 Let A E Matn(R) be a real square matrix with det A i= 0

and b ERn. Then there exists a unique x ERn such that Ax = b.
Whenever det A i= 0, the solution x = A -lb can be computed by

Cramer's rule-at least in principle. Obviously, this is a direct connection
from the existence and uniqueness theorem to an algorithm. In general,
we will require that whenever a problem does not have a solution, a reli-
able algorithm should not "compute" one. Surprisingly, this requirement is
P. Deuflhard et al., Numerical Analysis in Modern Scientific Computing
© Springer-Verlag New York, Inc. 2003
2 1. Linear Systems
not self-evident; there are counter-examples. Reliability therefore is a first

important property of a "good" algorithm.
However, Cramer's rule will not be the ultimate goal of our considera-
tions: if we calculate the determinant via the Leibniz representation
det A = L sgn (J • al,a(l) ... an,a(n)

aESn
as a sum of all permutations (J E Sn of the set {I, ... , n}, the cost of
computing det A amounts to n . n! arithmetic operations. Even with the
recursive scheme involving an expansion in sub determinants according to
Laplace's rule
n
detA = L(-l)i+lalidetAli
i=l
there are2n necessary arithmetic operations, where Ali E Matn-l (R) is the
matrix obtained from A by crossing out the first row and the ith column. As
we will see, all methods to be described in the sequel are more efficient than
Cramer's rule for n ~ 3. Speed is therefore certainly the second important
property of a "good" algorithm.
Remark 1.2 Of course, we expect that a good numerical method should

solve a given problem at lowest possible cost (in terms of arithmetic oper-
ations). Intuitively there is a minimal cost for each problem that is called
the complexity of the problem. The closer the cost of an algorithm is to
the complexity of the problem, the more efficient that algorithm is. The
cost of a specific algorithm is therefore always an upper bound for the com-
plexity of the corresponding problem. Obtaining lower bounds is in general
much more difficult~for details see the monograph of J. Traub and H.
Wozniakowski [83].
The notation x = A- 1 b might suggest the idea of computing the solution

of Ax = b by first computing the inverse matrix A-I and then multiplying
it to b. However, the computation of A -1 inherently contains all difficulties
related to solving Ax = b for arbitrary right-hand sides b. We will see in the
second chapter that the computation of A-I can be "nasty," even when for
special b the solution of Ax = b is "well-behaved." x = A -lb is therefore
only meant as a formal notation which has nothing to do with the actual
computation of the solution x. One should therefore avoid talking about
"inverting matrices," when in fact one is concerned with "solving systems
of linear equations."
Remark 1.3 There has been a longstanding bet by an eminent colleague,

who wagered a significant amount that in practice the problem of "inverting
a matrix" is always avoidable. As far as we know he has won his bet in all
cases.
1.1. Solution of Triangular Systems 3
In the search for an efficient solution method for arbitrary linear equation
systems we will begin with a study of simple special cases. The simplest one
is certainly the case of a diagonal matrix A where the system degenerates
to n independent scalar linear equations. The idea to transform a general
system into a diagonal one underlies the Gauss-Jordan decomposition. This
method, however, is less efficient than the one to be described in Section
1.2 and is therefore omitted here. In terms of complexity, next is the case
of a triangular system, which is the topic of the following section.
1.1 Solution of Triangular Systems

Here we consider the case of a triangular system
rllxl + r12 x 2 + + rlnXn Zl
r22 X 2 + + r2nXn Z2
rnnxn Zn,
and in short matrix notation,

Rx = z, (1.1 )
where R is an upper triangular matrix; i.e., Tij = 0 for all i > j. Obviously
the components of x can be obtained recursively starting with the nth row:
Xn .- Zn/Tnn if Tnn =1= 0,
Xn-l - (Zn-l - Tn-l,nXn)/Tn-l,n-l if Tn-l,n-l =1= 0,
Xl .- (Zl - T12X2 - ... - rlnXn)/Tll if TU =1= o.

For the determinant of the matrix R we have det R = Tll'" T nn , and
therefore
det R =1= 0 {==} Tii =1= 0 for all i = 1, ... ,n.
The above-defined algorithm is therefore applicable (as in the case of Cra-
mer's rule) if and only if det R =1= 0, i.e., under the hypothesis of the
existence and uniqueness theorem. The computational cost amounts to:
(a) for the ith row: n - i additions and multiplications, and one division
(b) for rows n through 1 together:
~(i _ 1) = n(n - 1) -'- n 2

~ 2 2
t=l
multiplications and as many additions.

4 1. Linear Systems
Here the notation "~,, stands for "equal up to lower-order terms;" i.e., we
consider only the term containing the highest power of n, which dominates
the cost for large values of n.
In total analogy a triangular system of the form
Lx = Z, (1.2)
with a lower triangular matrix L can be solved starting from the first row
and working through to the last one.
This way of solving triangular systems is called backward substitution
in case of (1.1) and forward substitution in case of (1.2). The name sub-
stitution is used because each component of the right hand-side vector can
be successively substituted (replaced) by the solution, as indicated in the
following storage scheme for backward substitution:
(Zl,Z2, ... ,Zn-l,Zn)

(Zl' Z2,···, Zn-l, xn)
(Zl,X2, ... ,Xn-l,Xn )

(Xl,X2, ... ,Xn-l,Xn ).
The case of forward substitution is just reversed.
1.2 Gaussian Elimination

We now turn to the most efficient among the classical solution methods
for systems of linear equations, the Gaussian elimination method. Carl
Friedrich Gauss (1777-1855) describes this method in 1809 in his work
on celestial mechanics Theoria Motus Corporum Coelestium [35] by saying,
"the values can be obtained by the usual elimination method." There he
had used the elimination in connection with his least-squares method (d.
Section 3). However, elimination had already been used by J. L. Lagrange
in 1759 and in fact been known in China as early as the first century B.C.
Let us return to the general linear system
al1 x l + a12 x 2 + ... + alnxn bl

a21 x l + a22 x 2 + ... + a2n x n b2
(1.3)
anlXl + an2X2 + ... + annXn bn
and try to transform it into a triangular one. If we aim at an upper triangu-

lar matrix, the first row does not need to be changed. As for the remaining
rows, we want to manipulate them in such a way that the coefficients in
front of Xl vanish; i.e., the variable Xl from rows 2 through n is eliminated.
1.2. Gaussian Elimination 5
Thus we produce a system of the form
allXI + al2 X 2 + ... + alnXn

, bl
a22 x 2 + ... + a2n Xn b'2
(1.4)
,
a n 2X 2 + ... + annXn b~ .
Having achieved this we can apply the same procedure to the last n -1 rows
in order to obtain recursively a triangular system. Therefore it is sufficient
to examine the first elimination step from (1.3) to (1.4). We assume that
all =1= O. In order to eliminate the term ailXI in row i (i = 2, ... , n), we
subtract from row i a multiple of row 1 (unaltered), i.e.,
new row i := row i - lil . row 1,
or explicitly
From ail - l i l a l l = 0 it follows immediately that lil = ail/all. Therefore

the first elimination step can be performed under the assumption all =1= O.
The element all is called a pivot element and the first row a pivot row.
After this first elimination step there remains an (n - 1, n - l)-submatrix
in rows 2 through n. We are now in a situation as in the beginning, but
one dimension lower.
By recursive application of this kind of elimination procedure we obtain
a sequence of matrices
A = A(l) --+ A(2) --+ ... --+ A(n) =: R,
each of which has the special form
(1)
a ln
(2)
a 2n
(k) (k)
a kk a kn
(k) (k)
a nk ann
with an (n - k + 1, n - k + 1)-submatrix, the so-called remainder matrix, in

the right bottom corner. Whenever the pivot a~~ does not vanish~which
6 1. Linear Systems
is unknown in advance-we can apply the elimination step

.- (k) / (k)
lik a ik a kk for i=k+1, ... ,n
(k+l) (k) (k)
a ij a ij - likakj for i,j=k+1, ... ,n
b(k+l)
t
.- b(k) _ l. b(k)
i tk k for i = k + 1, ... , n
to the corresponding remainder matrix. Since every elimination step is a
linear operation applied to the rows of A, the transformation from A (k)
and b(k) to A(k+l) and b(k+1) can be represented as a premultiplication by
a matrix Lk E Matn(R), i.e.,
A (k+l) = Lk A (k) and b(k+l) = Lkb(k) .
(In operations on columns one obtains an analogous postmultiplication.)

The matrix
1
1
-lk+l,k 1
-In,k 1
is called a Frobenius matrix. It has the nice property that its inverse L-;;1
is obtained from Lk by changing the signs of the lik'S. Furthermore the
product of the L-;;I's satisfies
L ·- 1 1
. - L-
1 .. ·L-
n-l --
In,n-l 1
Summarizing, we have in this way reduced the system Ax b to the
equivalent triangular system Rx = z with
R = L -1 A and z = L -1 b .
A lower (respectively, upper) triangular matrix, whose main diagonal el-
ements are all equal to one, is called a unit lower (respectively, upper)
triangular matrix. The above product representation A = LR of the matrix
A with a unit lower triangular matrix L and an upper triangular matrix R
is called the Gaussian triangular jactorization, or briefly LR-jactorization
of A. If such a factorization exists, then Land R are uniquely determined
(cf. Exercise l.2). (In most of the English literature the matrix R is de-
noted by U -for Upper triangular-and accordingly Gaussian triangular
1.3. Pivoting Strategies and Iterative Refinement 7
factorization is then called LU-factorization; in this book we typically use

U for unitary matrices.)
Algorithm 1.4 Gaussian Elimination.
(a) A = LR Triangular Factorization, R upper and L lower
triangular matrix,
(b) Lz = b Forward Substitution,
(c) Rx = z Backward Substitution.
The array storage scheme for Gaussian elimination is based on the repre-
sentation (1.5) of the matrices A (k). In the remaining memory locations
one can store the lik's, because the other elements, with values a or 1,
do not have to be stored. The entire memory cost for Gaussian elimination
amounts to n( n + 1) memory locations, i.e., as many as are needed to define
the problem. The cost in terms of number of multiplications is
'"" L~::i k ~ n /3 for (a)

2 3
'"" L~::i k ~ n /2 both for (b) and (c).

2
Therefore the main cost comes from the LR-factorization. However, if dif-
ferent right-hand sides bl , ... ,bj are considered, then this factorization has
to be carried out only once.
1.3 Pivoting Strategies and Iterative Refinement

As seen from the simple example
A= G~), detA = -1, all = a

there are cases where the triangular factorization fails even when det Ai-a.
However, an interchange of rows leads to the simplest LR-factorization we
can imagine, namely,
A- (10)
a
= 1 = I = LR with L = R = I .
In the numerical implementation of Gaussian elimination, difficulties can

arise not only when pivot elements vanish, but also when they are "too
small."
Example 1.5 (cf. [32]) We compute the solution of the system
(a) 1.00.10- 4 Xl + 1.00 x2 = 1.00
(b) 1.00 Xl + 1.00 x2 = 2.00
on a machine, which, for the sake of simplicity, works only with three
exact decimal places. By completing the numbers with zeros, we obtain the
8 1. Linear Systems
"exact" solution with four correct figures
Xl = 1.000 X2 = 0.9999,
and with three correct figures
Xl = 1.00 X2 = 1.00.
Let us now carry out the Gaussian elimination on our computer, i.e., III
three exact decimal figures
121 = a2l = 1.00 = 1.00 . 10 4

a11 1.00 . 10- 4 '
(1.00 - 1.00· 10 4 . 1.00· 1O- 4 )Xl + (1.00 - 1.00· 10 4 . 1.00)X2
= 2.00 - 1.00 . 104 . 1.00.

Thus we obtain the upper triangular system
1.00 . 10- 4 Xl + 1.00 X2 1.00
-1.00 . 10 4 x2 -1.00.10 4
and the "solution"
X2 = 1.00 (true) Xl = 0.00 (false!) .

However, if before starting the elimination, we interchange the rows
(a) 1.00 Xl + 1.00X2 2.00
(b) 1.00 . 10- 4 Xl + 1.00 X2 1.00,
then [21 = 1.00 . 10- 4 , which yields the upper triangular system
1.00 Xl + 1.00 X2 2.00
1.00 X2 1.00
as well as the "true solution"
X2 = 1.00 Xl = 1.00.
By interchanging the rows in the above example we obtain
Thus, the new pivot 0. 11 is the largest element, in absolute value, of the
first column.
We can deduce the partial pivoting or column pivoting strategy from the
above considerations. This strategy is to choose at each Gaussian elimina-
tion step as pivot row the one having the largest element in absolute value
within the pivot column. More precisely, we can formulate the following
algorithm:
Algorithm 1.6 Gaussian elimination with column pivoting

(a) In elimination step A (k) ........ A (k+1) choose apE {k, ... ,n} such that
la~~ 12: la;~) 1 for j = k, . .. , n.

Row p becomes pivot row.
(b) Interchange rows p and k
if i =p
if i = k
otherwise
Now we have
Itikl = I ~~~; I I aa~~; I ::; l.

a kk
=
pk
(c) Perform "he next elimination step for A(k), i.e.,

A(k) ........ A(k+1) .
Remark 1.7 Instead of column pivoting with row interchange one can
also perform row pivoting with column interchange. Both strategies require
at most O(n 2 ) additional operations. If we combine both methods and
look at each step for the largest element in absolute value of the entire
remaining matrix, then we need O(n 3 ) additional operations. This total
pivoting strategy is therefore almost never employed.
In the following formal description of the triangular factorization with
partial pivoting we use permutation matrices P E Matn(R). For each
permutation 7f E Sn we define the corresponding matrix
P 7r = [e (l)·· .e (n)],
7r 7r
where ej = (5 1j , ... ,5nj )T is the lh unit vector.

A permutation 7f of the
rows of the matrix A can be expressed as a premultiplication by P7r :
Permutation of rows 7f: A ---t P7rA.
and analogously a permutation 7f of the columns as a postmultiplication:

Permutation of columns 7f: A ........ AP7r .
It is known from linear algebra that the mapping

7f f------+ P7r
is a group homeomorphism Sn ........ N (n) of the symmetric group Sn into the
orthogonal group N(n). In particular we have
p-1
7r
= pT
7r •
10 1. Linear Systems
The determinant of a permutation matrix is just the sign of the

corresponding permutation
det Pn = sgn'iT E {±1};
i.e., it is equal to +1, if 'iT consists of an even number of transpositions,
and -1 otherwise. The following proposition shows that, theoretically, the
triangular factorization with partial pivoting fails only when the matrix A
is singular.
Theorem 1.8 For every invertible matrix A there exists a permutation
matrix P such that a triangular factorization of the form
P7r A = LR
is possible. Here P 7r can be chosen so that all elements of L are less than
or equal to one in absolute value, i. e.,
ILl:::; 1.
Proof. We employ the LR-factorization algorithm with column pivoting.
Since det A -=I- 0, there is a transposition T1 E Sn such that the first diagonal
element ag) of the matrix
is different from zero and is also the largest element in absolute value in
the first column, i.e.,
o -=I- lag) I : : : la~i) I for i = 1, ... ,n.

After eliminating the remaining elements of the first column we obtain the
matrix
*
o
o
where all elements of L1 are less than or equal to one in absolute value, i.e.,
IL11 :::; 1, and det L1 = 1. The remaining matrix B(2) is again invertible
since lag) I -=I- 0 and
o -=I- sgn (T1) det A = det A (2) = ag) det B(2) .

Now we can proceed by induction and obtain
R = A(n) = Ln-1PTn_l ... L 1PT1 A, (1.6)
where ILkl :::; 1, and Tk is either the identity or the transposition of two
numbers::::: k. If 'iT E Sn only permutes numbers::::: k+ 1, then the Frobenius
matrix
1
1
-lk+1,k 1
-In,k 1
satisfies
1
1
(1.7)
-l7r(k+1),k 1
-l7r(n),k 1
Therefore we can separate Frobenius matrices Lk and permutations PTk by
inserting in (1.6) the identities PT~l P Tk , i.e.,
R = Ln-1PTn_l Ln-2P~~, P Tn - l PTn_2Ln-3··· L1PTl A.

Hence we obtain
-1
= L n - 1 ... L 1P 7rO A =
A A • A
R WIth Lk P7rkLkP7rk '
where 7rn -1 := id and 7rk = Tn-1 ... Tk+1 for k = 0, ... , n - 2. Since the
permutation 7rk interchanges in fact only numbers 2:: k + 1, the matrices
Lk are of the form (1.7). Consequently
P 7rO A = LR
with L := L-)"1 ... L;;'~l or explicitly
1
l7rl(2),1 1
L = l7r,(3),1 l7r2(3),2 1
l7rl(n),l
and therefore IL I ::::: 1. o

Note that we have used Gaussian decomposition with column pivoting as
a constructive tool for proving an existence theorem. Again we see that,
as in the case of Cramer's rule, there exists a direct connection between
the algorithm and existence and uniqueness statements. In other words:
Gaussian elimination with column pivoting is a reliable algorithm.
Remark 1.9 Let us also note that the determinant of A can be easily
computed by using the P A = LR-factorization of Proposition l.8 via the
formula
det A = det(P) . det(LR) = sgn (7f0) . Tn ... Tnn .
A warning should be made against a naive computation of determinants!

As is well known, multiplication of a linear system by an arbitrary scalar
ex results in
det(exA) = exndetA.
This trivial transformation may be used to convert a "small" determinant
into an arbitrarily "large" one and the other way around. The only invari-
ants under this class of trivial transformations are the Boolean quantities
det A = 0 or det A =I- 0; for an odd n we have additionally sgn (det A).
Subsequently, this feature will lead to a theoretical characterization of the
solvability of linear systems, which is not based on determinants. Moreover,
another criterion for the assessment of algorithms becomes apparent: if the
problem itself is invariant under some kind of transformation, then we will
require the same invariance for a "good" algorithm whenever this can be
realized.
Attentive observers will have already recognized that the pivoting strat-
egy can be arbitrarily changed by multiplying different rows by different,
instead of the same, scalars. This observation leads us to the enormously
important question of scaling. By row scaling we mean premultiplication
of A by a diagonal matrix
A ----+ DrA, Dr diagonal matrix
and analogously, by column scaling we mean postmultiplication by a

diagonal matrix
A ----+ AD c , Dc diagonal matrix.

(As we have already seen in the context of Gaussian elimination, linear
operations on the rows of a matrix can be expressed by premultiplication
with suitable matrices and correspondingly operations on columns are rep-
resented by postmultiplication.) Mathematically speaking, scaling changes
the length of the basis vectors of the range (row scaling) and of the domain
(column scaling) of the linear mapping defined by the matrix A, respec-
tively. If this mapping models a physical phenomenon, then we can interpret
scaling as a change of unit, or gauge transformation (e.g., from A to km).
In order to make the solution of the linear system Ax = b independent
of the choice of unit we have to scale the system appropriately by pre- or
post multiplying the matrix A by suitable diagonal matrices:
A ----+ A := DrADc ,
where
Dr = diag(O"l, ... , O"n) and Dc = diag (Tl"'" Tn).
At first glance the following three strategies seem to be reasonable:
(a) Row equilibration of A with respect to a vector norm 11·11. Let Ai be
the ith row of A and assume that there are no zero rows. By setting
Ds := I and
O"i := IIAill~l for i = 1, ... , n,
we make all rows of A have norm one.
(b) Column equilibration. Suppose that there are no columns Aj of A
equal to zero. By setting Dz := I and
Tj := IIAjll~l for j = 1, ... , n,
we make all columns of A have norm one.

(c) Following (a) and (b), it is natural to require that all rows of A have
the same norm and at the same time that all columns of A have the
same norm. In order to determine O"i and Tj up to a mutual common
factor one has to solve a nonlinear system with 2n ~ 2 unknowns. This
obviously requires a great deal more effort than solving the original
problem. As will be seen in the fourth chapter, the solution of this
nonlinear system requires the solution of a sequence of linear systems,
now in 2n ~ 2 unknowns, for which the problem of scaling has to be
addressed again.
In view of this dilemma, most programs (e.g., LAPACK [4]) leave the
scaling issue up to the user.
Obviously, the pivoting strategies discussed above cannot prevent the
possibility of computing a "rather inaccurate" solution x. How can we
improve the accuracy of x without too much effort? Of course, we could
simply discard the solution x altogether and try to compute a "better" one
by using a higher machine precision. However, in this way all information
contained in the computed x would be lost. This can be avoided in the
iterative refinement method by explicitly evaluating the residual
r(y) := b ~ Ay = A(x ~ y) .
The absolute error D.xo := x ~ Xo of Xo := x satisfies the equation
AD.xo = r(xo). (1.8)
In solving this corrector equation (1.8), we obtain an approximate correction
Lixo !- D.xo, which is again afflicted by rounding errors. In spite of this fact
we expect that the approximate solution
Xl := Xo + Lixo
is "better" than Xo. The idea of iterative refinement consists in repeating

this process until the approximate solution Xi is "accurate enough." We
should remark that the linear system (1.8) differs from the original lin-
ear system only by the right-hand side, so that the computation of the
corrections ~Xi requires comparatively little additional effort.
Iterative refinement works particularly well in conjunction with Gaussian
elimination with column pivoting: in Section 2.4.3 we will formulate the
substantial result of R. D. Skeel [77] that states that in this algorithmic
setting a single refinement step is enough for obtaining an "acceptable" or
sufficiently accurate solution. There we will also state more precisely what
we mean by the up to now vague terms "better approximate solution" and
"sufficiently accurate."
1.4 Cholesky Decomposition for Symmetric

Positive Definite Matrices
We want now to apply Gaussian elimination to the special class of systems
of equations with symmetric positive definite (Spd) matrices. It will become
clear that, in this case, the triangular factorization can be substantially
simplified. We recall that a symmetric matrix A = AT E Matn(R) is
positive definite if and only if
(x, Ax) > 0 for all x -=1= o. (1.9)
We call such matrices for short Spd-matrices.
Theorem 1.10 For any Spd-matrix A E Matn(R) we have
(i) A is invertible.
(ii) aii > 0 for i = 1, ... ,n.
(iii) . max laij I = . max aii·
1,,)=1, ... ,n 1,=l, ... ,n
(iv) Each remainder matrix obtained during Gaussian elimination

without pivoting is also symmetric positive definite.
Obviously (iii) and (iv) say that row or column pivoting is not necessary
for LR-factorization; in fact it is even absurd because it might destroy the
structure of A. In particular (iii) means that total pivoting can be reduced
to diagonal pivoting.
Proof. The invertibility of A follows immediately from (1.9). If we put

in (1.9) a basis vector ei instead of x, it follows immediately that aii =
(ei' Aei) > 0 and therefore the second claim is proven. The third statement
is proved similarly (cf. Exercise 1.7). In order to prove statement (iv) we
1.4. Cholesky Decomposition for Symmetric Positive Definite Matrices 15
write A = A (1) as
all
A(l) = rz
where z = (a12,'" ,a1n)T and after one elimination step we obtain
1
1 with L, ~ -b 1
Now if we premultiply A (2) with Lf, then zT in the first row is also
eliminated and the remainder matrix B(2) remains unchanged, i.e.,
o o
The operation A -+ L1AL[ describes a change of basis for the bilinear

form defined by the symmetric matrix A. According to the inertia theorem
of Sylvester, L1A (1) L[ and with it B(2) remain positive definite. D
Together with the LR-factorization we can now deduce the rational

Cholesky decomposition or factorization for symmetric positive definite
matrices.
Theorem 1.11 For every symmetric positive definite matrix A there exists
a uniquely determined factorization of the form
A = LDL T ,
where L is a unit lower triangular matrix and D a positive diagonal matrix.
Proof. We continue the construction from the proof of Theorem 1.10 for
k = 2, ... , n-l and obtain immediately L as the product of r;\ ... , L;;~l
and D as the diagonal matrix of the pivots. D
Corollary 1.12 Since D = diag(d i ) is positive, the square root D!

diag( Jdi) exists and with it the Cholesky factorization
A = LLT, (1.10)
where L is the lower triangular matrix L := LD! .
The matrix L = (lij) can be computed by using Cholesky's method,
which in compact form reads:
16 l. Linear Systems
Algorithm 1.13 Cholesky decomposition.

for k := 1 to n do
'- (akk - "",k-1
I kk·- 2 )1/2.,
uj=l 1kj
for i := k + 1 to n do
lik = (aik - 2:7:.:; lij1kj)/lkk;
end for
end for
The derivation of this algorithm is nothing more than the elementwise
evaluation of equation (1.10)
i =k : akk l~l + ... + 1~,k-1 + l~k

i > k : aik li1lkl + ... + li,k-1 I k,k-1 + liklkk .
The tricky idea of the method is contained in the sequence of computations
for the elements of 1. As for the computational cost we have
rv ~n3 multiplications and n square roots.
In contrast, the rational Cholesky factorization requires no square roots,

but only rational operations (whence the name). By smart programming
the cost can be kept here also to rv ~n3. An advantage of the rational
Cholesky factorization is that almost singular matrices D can be recognized.
In addition, the method can be extended to symmetric indefinite matrices
(x T Ax =I- 0 for all x).
Remark 1.14 The supplemental Spd-property has obviously led to a sen-
sible reduction of the computational cost. At the same time, this property
forms the basis of a completely different type of solution methods that will
be described in detail in Section 8. Such methods playa role when the
occurring matrices are large and even sparse.
Exercises
Exercise 1.1 Give an example of a full nonsingular (3,3)-matrix for which
Gaussian elimination without pivoting fails.
Exercise 1.2
(a) Show that the unit (nonsingular) lower (upper) triangular matrices
form a subgroup of GL(n).
Exercises 17
(b) Apply (a) to show that the representation

A=LR
of a nonsingular matrix A E GL(n) as the product of a unit lower
triangular matrix L and a nonsingular upper triangular matrix R is
unique, provided it exists.
(c) If A = LR as in (b), then Land R can be computed by Gaussian
triangular factorization. Why is this another proof of (b) ?
Hint: Use induction.
Exercise 1.3 A matrix A E Matn(R) is called strictly diagonally
dominant if
n
JaiiJ > L JaijJ for i = 1, ... ,no
i=l
j-::f:.i
Show that Gaussian triangular factorization can be performed for any ma-
trix A E Matn(R) with a strictly diagonally dominant transpose AT. In
particular any such A is invertible.
Exercise 1.4 The numerical range W(A) of a matrix A E Matn(R) is
defined as the set
W(A):= {(Ax,x) J (x,x) = 1, x ERn} .
Here (-,.) is the Euclidean scalar product on Rn.
(a) Show that the matrix A E Matn(R) has an LR-factorization (L unit
lower triangular, R upper triangular) if and only if the origin is not
contained in the numerical range of A, i.e.,
o (j. W(A) .
:n
(b) Use (a) to show that the matrix
[~
has no LR-factorization.
Exercise 1.5 Program the Gaussian triangular factorization. The pro-
gram should read data A and b from a data file and should be tested
on the following examples:
(a) with the matrix from Example 1.1,
(b) with n = 1, A = 25 and b = 4,
(c) with aij = i j - 1 and bi = i for n = 7,15, and 50.

Compare in each case the computed and the exact solutions.
Exercise 1.6 Gaussian elimination with column pivoting applied to the
matrix A delivers the factorization P A = LR, where P is the permutation
matrix produced during elimination. Show that:
(a) Gaussian elimination with column pivoting is invariant with respect
to
(i) permutation of rows of A (with the trivial exception that there
are several elements of equal absolute value per column).
(ii) Multiplication of the matrix by a number (J -I- 0, A -----+ (J A.
(b) If D is a diagonal matrix, then Gaussian elimination with column
pivoting applied to A := AD delivers the factorization P A = LR
with R = RD.
Consider the corresponding behavior for a row pivoting strategy with
column interchange as well as for total pivoting with row and column
interchange.
Exercise 1. 7 Let the matrix A E Mat n (R) be symmetric positive definite.
(a) Show that
!aij! :::; Jaiiajj :::; ~(aii+ajj) foralli,j=l,oo.,n.
Hint: Show first that the matrix (a" ail)
a)t aJJ
is symmetric positive
definite for all i, j.
(b) Deduce from (a) that
max !aij! = max aii .
2,J '"
Interpret the result in the context of pivoting strategies.

Exercise 1. 8 Show that for any u, v ERn we have
uv T
(a) (I + uvT)-l = I - ,whenever u T v -I- -1.
1 +vTu
(b) I + uv T is singular whenever u T v = -1.
Exercise 1.9 The linear system Ax = b with matrix
is to be solved, where R E Matn(R) is an invertible upper triangular

matrix, u, vERn and x, bE Rn+l.
Exercises 19
(a) Specify the triangular factorization of A.

(b) Show that A is nonsingular if and only if
u T R- 1 v=/=-O.
(c) Formulate an economical algorithm for solving the above linear
system and determine its computational cost.
Exercise 1.10 In the context of probability distributions one encounters
matrices A E Matn(R) with the following properties:
(i) L:~=1 aij = 0 for j = 1, ... , n ;
(ii) aii < 0 and aij 2: 0 for i = 1, ... , nand j =/=- i.

Let A = A (1), A (2), ... , A (n) be produced during Gaussian elimination.
Show that
(a) laul2: lai11 fori=2, ... ,n;
(b) ~n
L.Ji=2 aij
(2) -
-
0 £or J. -- 2, ... , n .,
(1) (2) . _ .
(c) a ii ~ a ii ~ 0 for z - 2, ... , n ,
(d) (2)
aij
>
_
(1)
aij
>
_
0 £or z,..J = 2 , ... , n and J. -r
--I- .
Z ;
(e) If the diagonal elements produced successively during the first n - 2

Gaussian elimination steps are all nonzero (i.e., a~:) < 0 for i =
1, ... , n - 1), then a~~ = o.
Exercise 1.11 A problem from astrophysics ("cosmic maser") can be for-
mulated as a system of (n + 1) linear equations in n unknowns of the
form
where A is the matrix from Exercise 1.10. In order to solve this system
we apply Gaussian elimination on the matrix A with the following two
additional rules, where the matrices produced during elimination are de-
noted again by A = A (1), ... , A (n-1) and the relative machine precision is
denoted by eps.
(a) If during the algorithm la~~1 ~ lakkleps for some k < n, then shift
simultaneously column k and row k to the end and the other columns
and rows toward the front (rotation of rows and columns).
(b) If la~~1 ~ lakkleps for all remaining k < n -1, then terminate the
algorithm.
Show that:
(i) If the algorithm does not terminate in (b), then after n-1 elimination
steps it delivers a factorization of A as PAP = LR, where P is a
permutation and R = A(n-l) is an upper triangular matrix with
rnn = 0, rii < 0 for i = 1, ... ,n - 1 and rij 2: 0 for j > i.
(ii) The system has in this case a unique solution x, and all components
of x are nonnegative (interpretation: probabilities).
Give a simple scheme for computing x.
Exercise 1.12 Program the algorithm developed in Exercise 1.11 for solv-
ing the special system of equations and test the program on two examples
:)
of your choice of dimensions n = 5 and n = 7, as well as on the matrix
2 0
-4 1
1 -2 .
1 0 -2
Exercise 1.13 Let a linear system Cx = b be given, where C is an
invertible (2n, 2n)-matrix of the following special form:
C = [~ ~], A, B invertible.
(a) Let C-l be partitioned as C:
C- l = [EG HF] .
Prove the identity by 1. Schur:

E =H = (A - BA-lB)-l and F = G = (B - AB-lA)-l.
(b) Let x = (Xl, X2)T and b = (b l , b2)T be likewise partitioned and
(A + B)Yl = bl + b2, (A - B)Y2 = bl - b2 .
Show that
Xl = ~(Yl + Y2), X2 = ~(Yl - Y2).
Numerical advantage?
2
Error Analysis
In the previous chapter, we got to know a class of methods for the numerical
solution of linear systems. Formally speaking, we there computed, from a
given input data (A, b), the solution f(A, b) = A -lb. With this example in
mind, we want to analyze algorithms from a more abstract point of view
in the present section.
Let a problem be abstractly characterized by (j, x) for given mapping
f and given input data x. To solve the problem then means to compute
the result f(x) by means of an algorithm that may produce intermediate
results as well. The situation is described by the scheme
output
input data --+ algorithm --+
data
In this chapter we want to see how errors come up and influence this pro-
cess and, in particular, whether Gaussian elimination is indeed a reliable
algorithm. Errors in the numerical result arise from errors in the data or
input errors as well as from errors from the algorithm.
errors III output

input errors --+
algorithm errors
In principle, we are powerless against input errors, since they belong to the
given problem and can only be avoided by changing the problem setting.
The situation is clearly different with errors caused by the algorithm. Here
we have the chance to avoid or, at least, to diminish errors by changing the
22 2. Error Analysis
method. In what follows the distinction between the two kinds of errors
will lead us to the notions of the condition of a problem as opposed to the
stability of an algorithm. First we want to discuss the possible sources of
errors.
2.1 Sources of Errors

Even when input data are considered to be given exactly, errors in the
data may still occur because of the machine representation of noninteger
numbers. With today's usual floating point representation, a number z of
"real type" is represented as z = ad e , where the basis d is a power of two
(as a rule d is 2, 8, or 16) and the exponent e is an integer of a given
maximum number of binary positions,
e E {emin, ... , emax } C Z .
The mantissa a is either 0 or a number satisfying d- 1 ~ lal < 1 and has
the form
a = v Laid-i,
i=1
where v E {±1} is the sign, ai E {O, ... ,d -I} are the digits (it is assumed
that a = 0 or a1 =f. 0), and l is the length of the mantissa. The numbers
that are representable in this way form a subset
N := {x E R I there is a, e as above, so that x = ad e }
of real numbers. The range of the exponent e defines the largest and small-
est number that can be represented on the machine (by which we mean
the processor together with the compiler). The length of the mantissa is
responsible for the relative precision of the representation of real numbers
on the given machine. Every number x =f. 0 with
demin-1 ~ Ixl ~ d emax (l _ d- 1)
is represented as a floating point number by rounding to the closest machine
number whose relative error is estimated by
Ix 1~(x)1 ~ eps := d 1 - 1/2.

Here we use for division the convention % = 0 and x/O = 00 for x > O.
We say that we have an underflow when Ixl is smaller than the smallest
machine number demin-1 and, an overflow when Ixl > demax(l_d-l). We call
eps the relative machine precision or the machine epsilon. In the literature
this quantity is also denoted by u for "unit roundoff" or "unit round." For
single precision in FORTRAN, or float in C, we have usually eps :=;:j 10- 7 .
2.1. Sources of Errors 23
Let us imagine that we wanted to enter in the machine a mathematically

exact real number x, for example,
x = 'if = 3.141592653589 ... ,

It is known theoretically that 'if as an irrational number cannot be rep-
resented with a finite mantissa and is therefore affected by errors on any
computer, e.g., for eps = 10- 7 we have
'if f---+ fl('if) = 3.141593, Ifl('if) - 'if I ~ eps 'if.

Here it is essential that the number x after being introduced into the ma-
chine is indistinguishable from any other numbers x that are rounded to
the same floating point number fl(x) = fl(x). In particular the special
real number obtained by appending zeros to a machine number is by no
means "distinguished." Therefore it would be unreasonable to look out for
an "exact" input. This insight will decisively influence the following error
analysis.
A further important source of input errors are measurement errors that
come up when input quantities are obtained from experimental data. Such
data x are usually given together with their absolute error Ox, the so-called
tolerance, which is the distance from x to the "true" value x; it can be
estimated componentwise by
Ix - xl ~ Ox.
In many important practical situations the relative precision lox/xl lies in
between 10- 2 and 1O- 3 -a quantity that in general outweighs by far the
rounding of the input data. In this context the term technical precision is
often used.
Let us now go to the second group of error sources, the errors in the
algorithm. The realization of an elementary operation
o E {+, -, ., /}
by the corresponding floating point operation 0 E {-t-,.':..,~, /} does not
avoid the rounding errors. The relative error here is less than or equal to
the machine precision; i.e., for x, YEN, we have
xoy = (x 0 y)(l + E) for an E = E(X, y) with lEI ~ eps.

One should notice that in general the operation 0 is not associative (see
Exercise 2.1), so that within N the order sequence of the operations to be
executed is very important.
Besides rounding errors, an algorithm may also produce approxima-
tion errors. Such errors appear whenever a function cannot be calculated
exactly, but must be approximated. This happens, for example, when com-
puting sine by a truncated power series or, to mention a more complex
problem, in the solution of a differential equation. In this chapter we will
essentially limit ourselves to the treatment of rounding errors. Approxima-

tion errors will be studied in the context of particular algorithms in later
chapters.
2.2 Condition of Problems

In this section we treat the following question:
How do perturbations of input variables influence the result indepen-
dently of the choice of algorithm?
We have seen in the above description of input errors that the input x is
logically indistinguishable from all input data x that are within the range of
a given precision. Instead of the "exact" input x we should instead consider
an input set E, that contains all perturbed inputs x (see Figure 2.1).
fE:;\ .................
f
·.· ..... ·· ....cdf (x)
~ R
Figure 2.1. Input and output sets.
A machine number x clearly represents the input set

E = {x E R I Ix - x I :s: eps Ix I} .
In case the input x is given up to absolute tolerance ox, then the input set
would be
E = {x E R I Ix - xl:s: ox}.
The function f defining our problem maps the input set E into an output set
R = f(E). With the same arguments that led us from the putatively exact
input x to consider an input set E, we will have to replace the pointwise
mapping f : x f-+ f(x) the set valued mapping f : E f-+ R = f(E). The
effect of perturbations of input data on the output quantities can then be
expressed by some measure of a ratio of output versus input sets-a ratio
that we will call the condition of a problem stated by (j, x).
Example 2.1 In order to create a feeling for this not yet precise term,
let us examine beforehand the geometric problem of determining the
intersection point r of two lines g and h in the plane (see Figure 2.2).
Already in the graphical solution of this problem it is impossible to
represent the lines g and h exactly. The question is to what extent the
constructed intersection point depends on the drawing error (or: input er-
ror). In our example, the input set E consists of all lines 9 and h lying
within drawing precision from g and h, respectively; the output set R con-
sists of the corresponding intersection points r. We see at once that the
2.2. Condition of Problems 25
h
h
9
Figure 2.2. Nearly perpendicular intersection of two lines g, h (well-conditioned).
9 r
h
Figure 2.3. Small angle intersection of two lines g, h (ill-conditioned).
ratio between the input and the output sets depends strongly on the inter-
section angle <r.(g, h) between 9 and h. If 9 and h are nearly perpendicular,
the variation of the intersection point r is about the same as the varia-
tion of the lines 9 and h. If, however, the angle <r.(g, h) is small, i.e., the
lines 9 and h are nearly parallel, then one has real difficulties locating the
intersection point even with the naked eye (see Figure 2.3). Actually, the in-
tersection point r moves several times more compared to any perturbation
of the lines. We can therefore call the determination of the intersection
point well-conditioned in the first case, but ill-conditioned in the second
case.
Thus we arrive at a mathematical definition of the concept of condition.

For the sake of simplicity we assume that the problem (j, x) is given by a
mapping
from an open subset U c Rn into Rm, a point x E U and a (relative or

absolute) precision Ii of the input data. The precision Ii can be given either
through a norm II· lion R n
IIi: - xii :s: Ii (absolute) or IIi: - xii :s: Ii Ilxll (relative)
or componentwise
for i = 1, ... , n. Correspondingly, we measure the output error f(i:) - f(x)

either normwise or componentwise.
2.2.1 Normwise Condition Analysis

In order to keep the calculations involved with the quantitative condition
analysis manageable, one usually chooses a sufficiently small input error (j
and examines the asymptotic ratio defining the condition in the linearized
error theory as (j ---7 O. In this setting the following short notation is ex-
tremely useful: Two functions g, h : R n ---7 Rm are equal to first-order
approximation or in leading order of x ---7 Xo, written as
g(x) ~ h(x) for x ---7 XO,
if
g(x) = h(x) + o(llh(x)ll) for x ---7 xo,
where the Landau symbol "o(llh(x)ll) for x ---7 xo" denotes a generic
function cp(x) having the property
lim Ilcp(x)11 = O.
X->Xo Ilh(x)11
Thus for a differentiable function f we have
f (x) - f (x) ~ l' (x ) (x - x) for x ---7 X •
Analogously we define "g(x) ~ h(x) for x ---7 xo" (componentwise) by

"g(x) ::; h(x) + o(llh(x)ll) for x ---7 xo."
Of course, we can characterize the normwise ratio between input and out-
put sets-without formal recourse to the derivative of f-via an asymptotic
Lipschitz condition.
Definition 2.2 The absolute normwise condition number of the problem
(j, x) is the smallest number Kabs 2: 0 such that
Ilf(x) - f(x)11 ~ Kabs Ilx - xii for x ---7 x.
The problem (j, x) is said to be ill-posed whenever such a number does not
exist, which is formally equivalent to Kabs = 00. Analogously the relative
normwise condition number of (j, x) is the smallest number Krel 2: 0 such
that
Ilf(x) - f(x)11 <: Ilx - xii -
Ilf(x) I - Krel I xii for x ---7 x.
Thus Kabs and Krel describe the increase of the absolute and the relative
errors, respectively. A problem (j, x) is said to be well-conditioned when its
condition number is small and ill-conditioned when it is large. Naturally,
the meaning of "small" and "large" has to be considered separately for each
problem. For the relative condition number, unity serves for orientation: it
corresponds to the case of pure rounding of the result (see the discussion
in Section 2.1). In what follows we will show what the terms "small" and
"large" mean in a sequence of illustrative examples.
If f is differentiable in x, then according to the mean value theorem, we

can express the condition numbers in terms of the derivative:
Ilxll ,
/'i;abs = 11f'(x)11 and /'i;rel = Ilf(x)llllf (x)11 , (2.1)
where 11f'(x)11 is the norm of the Jacobian f'(x) E Matm,n(R) in the

subordinate (or operator) norm
IIAxl1
IIAII := sup -II-II = sup IIAxl1 for A E Matm,n(R).
x¥O x Ilxll=l
For illustration let us compute the condition numbers for some simple
problems.
Example 2.3 Condition of addition (respectively, subtraction). Addition
is a linear mapping
f:R2-->R, G) f-+f(a,b):=a+b
with derivative f'(a, b) = (1,1) E Mat 1 ,2(R). If we choose the I-norm on

R2
II(a, bfll = tal + Ibl

and the absolute value on R, then it follows that the subordinate matrix
norm (see Exercise 2.8) is
11f'(a, b)11 = 11(1,1)11 = 1.

Therefore the condition numbers of addition are
lal + Ibl
/'i;abs = 1 and /'i;rel = Ia+b I'
Hence for the addition of two numbers of the same sign we have /'i;rel = 1. On
the other hand, it turns out that the subtraction of two nearly equal num-
bers is ill-conditioned according to the relative condition number, because
in this case we have
la + bl « tal + Ibl {=;> /'i;rel» 1.

This phenomenon is called cancellation of leading digits.
Example 2.4 Unavoidable cancellation. For illustration we give the
following example with eps = 10- 7 :
a 0.123467* +- perturbation at position 7
b 0.123456* +- perturbation at position 7
a-b 0.000011* 000 +- perturbation at position 3.
'-".-' '-v-"
leading zeros appended zeros
An error in the seventh significant decimal digit of the input data a, b leads
to an error in the third significant decimal digit of the result a - b, i.e.,
K:rel ~ 104 .
Be aware of the fact that the cancellation of a digit of the result given
by a computer cannot be noticed afterward. The appended zeros are zeros
in the binary representation, which are lost via the transformation to the
decimal system. Therefore we arrive at the following rule:
A void avoidable subtraction of nearly equal numbers.
For unavoidable subtractions, a further rule will be derived in the

subsequent Section 2.3.
Example 2.5 Quadratic equations. A really classical example of an avoid-

able cancellation of leading digits arises in the solution of the quadratic
equation (see also Chapter 4)
x2 - 2px + q = 0,
whose solution is usually given by
Xl,2 = p ± Vp2 - q.
In this form the cancellation phenomenon occurs when one of the solutions
is close to zero. However, this cancellation of significant digits is avoidable
because, by Vieta's theorem, q is the product of the roots, which can be
exploited according to
to compute both solutions in a stable manner.
Example 2.6 Power series. Often cancellation errors can be avoided by

using power series expansions. For example, let us consider the function
1 - cos (x) 2 x2
X =;:1 (
1 - [1 - 2x + x4 ) X (
24 ± ... J ="2 1 - 12
)
± ... .
For x = 10- 4 we have x 2 /12 < 10- 9 and therefore, according to Leib-
niz's theorem on alternating power series, x/2 is an approximation of
(1 - cosx)/x correct up to eight decimal digits.
Example 2.7 Condition of a linear system Ax = b. In the solution of the

linear system Ax = b, we first consider only the vector b ERn as input.
Then the problem can be described by the mapping
which is linear in b. Its derivative is f' (b) = A-I so that the condition
numbers of the problem are given by
-1 Ilbll -1 IIAxl1 -1
Kabs = IIA I and Krel = IIA-lbIIIIA I = WIIA II·
Next we take perturbations in A into account, too. For that purpose we
consider the matrix A as input quantity
and keep b fixed. This mapping is nonlinear in A and differentiable. (This

follows, for example, via Cramer's rule and the fact that the determinant
of a matrix is a polynomial in the entries of the matrix-see also Remark
2.9 below). For the computation of the derivative we need:
Lemma 2.8 The mapping 9 : GL(n) c Matn(R) -) GL(n) with g(A) =

A -1 is differentiable, and
g'(A)C = -A- 1CA- 1 for all C E Matn(R). (2.2)
Proof. We differentiate with respect to t the equation (A + tC) (A + tC) -1 =

1 and obtain
d
0= C(A + tC)-1 + (A + tC) dt (A + tC)-1 .
In particular for t = 0 it follows that
g'(A)C = ~(A + tC)-11 = -A- 1CA- 1 .

dt t=O
Remark 2.9 The differentiability of the inverse follows easily also from
the Neumann series. If C E Matn(R) is a matrix with 11011 < 1, then 1 - C
is invertible and
C) -1 = L
00
(I - C k = 1 + C + C2 + ...
k=O
This fact is proved as for the summation formula 2:%':0 qk = 1/(1 - q) of

a geometrical progression with Iql < 1. Hence, for a matrix C E Matn(R)
with IIA-ICII < 1 it follows that
(A + C)-I (A(1 + A- 1C))-1 = (1 + A- 1C)-1 A-I
(I - A-IC + 0(11011)) A-I = A-I - A- 1CA- 1 + 0(11011)
for IICII -) 0 and therefore (2.2) holds. This argument remains valid for
bounded linear operators in Banach spaces as well.
Lemma 2.8 implies that the derivative with respect to A of the solution
f(A) = A -lb of the linear system satisfies
f'(A) C = -A-1CA-lb = -A-1Cx for C E Matn(R).
In this way we arrive at the condition numbers
I\:abs 11f'(A)11 = sup IIA-1Cxll::; IIA-lllllxll,

Ilcll=l
~llf'(A)11
Ilxll < IIAIIIIA-lil
- .
The earlier calculated relative condition number with respect to the input
b can be estimated by
IIAII I A-111
I\:[el ::;
because of the submultiplicativity IIAxl1 ::; IIAII Ilxll of the matrix norm.
Therefore henceforth the quantity
will be called the condition number of the matrix A. It describes in particu-

lar the relative condition number of a linear system Ax = b for all possible
right-hand sides b ERn. Another representation for 1\:( A) (see Exercise
2.12) is
max IIAxl1
Ilxll=l
I\:(A):= min IIAxl1 E [0,00]. (2.3)
Ilxll=l
It has the advantage that it is well-defined for noninvertible and rectan-
gular matrices as well. With this representation we immediately verify the
following three properties of I\:(A):
(i) I\:(A) ~ 1,
(ii) l\:(aA) = I\:(A) for all a E R, a =I 0,
(iii) A =I 0 is singular if and only if I\:(A) = 00.
We see that the condition number I\:(A) , as opposed to the determinant
det A, is invariant under the scalar transformation A ---+ aA. Together
with properties (i) and (iii) this favors the use of condition numbers rather
than determinants for characterizing the solvability of a linear system. We
will go deeper into this subject in Section 2.4.1 below.
Example 2.10 Condition of nonlinear systems. Assume that we want to
solve a nonlinear system f(x) = y, where f : Rn ---+ Rn is a continuously
differentiable function and y ERn is the input quantity (mostly y = 0).
We see immediately that the problem is well-defined only if the derivative
f'(x) is invertible. In this case, according to the inverse function theorem,
f is also invertible in a neighborhood of y, i.e., x = f-l(y). The derivative

satisfies
(f-l)'(y) = f'(X)-I.
The condition numbers of the problem (f-l, y) are therefore
The conclusion clearly agrees with the geometrical determination of the

intersection point of two lines. If h:abs or h:rel are large, we have a situation
similar to the small angle intersection of two lines (see Figure 2.4).
y = 0 f--------::::===--=------r--
Xo x
Figure 2.4. Ill-conditioned zero at Xo, well-conditioned zero at Xl.
2.2.2 Componentwise Condition Analysis

The above presented normwise consideration does not take into account
any possible special structure of the matrix A, but rather analyzes the
behavior relative to arbitrary perturbations <SA-including those that do
not preserve this special structure. Moreover, individual components may
be afflicted by relative errors in a rather different way so that important
phenomena cannot be characterized by normwise considerations. In view
of such situations we here want to turn to some concept of componentwise
error analysis.
Example 2.11 The solution of a linear system Ax = b with a diagonal
matrix
0)
E'
A-I = ( 1
0 ),
is obviously a well-conditioned problem, because the equations are com-
pletely independent of each other (also called decoupled). Here we implicitly
assume that the admissible perturbations preserve the diagonal form. The
normwise condition number, however, is obtained as

1
K:oo(A) = IIA -1 1100 IIAlloo = - ,
E
i.e., arbitrarily large for small E :S 1. It describes the condition for all kinds
of possible perturbations of the matrix.
The example suggests that the notion of condition defined in Section 2.2.1
turns out to be deficient in some situations. Intuitively, we expect that the
condition number of a diagonal matrix, i.e., of a completely decoupled linear
system, is equal to one, as in the case of a scalar linear equation. The fol-
lowing componentwise analysis will lead us to such a condition number. In
order to transfer the concept of Section 2.2.1 to the componentwise setting,
we merely have to replace norms with absolute values of the components.
In the following we will work out details for the relative error concept only.
Definition 2.12 The (prototype) componentwise condition number of the

problem (I, x) is the smallest number K:rel 2: 0 such that
for i: ---> x.
Remark 2.13 Alternatively we can define the relative componentwise

condition number also by
Ifi(X) - fi(X)1 <: IXi - xii _
mr-x Ifi(X)1 - K:rel mr-x IXil for x ---> x.
The condition number defined in this way is even submultiplicative, i.e.,
K:rel(g 0 h, x) :S K:rel(g, h(x)) . K:rel(h, x).

By analogy to (2.1) we can compute this condition number for differ-
entiable mappings via the derivative at x. Application of the mean value
theorem
f(x) - f(x) = l~o f'(x + t(x - x))(x - x) dt
gives componentwise
If(x) - f(x)1 :S l~o If'(x + t(x - x))II(x - x)1 dt;

and hence
K:rel =
I 1f'(x)1 Ixl 1100
Ilf(x)lloo
As earlier with the normwise concept, we also want to calculate the
componentwise condition number for a sequence of illustrative problems.
Example 2.14 Condition of multiplication. The multiplication of two real

numbers x, y is described by the mapping
f: R2 --> R, (x,yf f-> f(x,y) = xy.
It is differentiable with f'(x, y) = (y, x); and the relative componentwise
condition number becomes
fi:rel = \\\f'(x,y)\\(x,y)T\\\oo = 2\x y \ = 2

\\f(x,y)\\oo \xy\·
Therefore multiplication can be considered as well-conditioned.
Example 2.15 Condition of scalar products. When computing a scalar
product (x, y) = L~=l XiYi, we evaluate the mapping
f : R2n --> R, (x, y) f-> (x, y)

at (x,y). Since f is differentiable with f'(x,y) = (yT,XT), it follows that
the componentwise relative condition number is
fi:rel = \\\(yT, xT)\\(x, Y)\\\oo = 2 (\x\, \y\) .

\\(x,Y)\\oo \(x,y)\
Example 2.16 Componentwise condition of a linear system (Skeel's con-
dition number). If we consider, as in Example 2.7, the problem (A-I,b) of
a linear system with b as input, then we obtain the following value of the
componentwise relative condition number:
\\\A-I\\b\\\oo \\\A-I\\b\\\oo
fi:rel = \\A-Ib\\oo = \\x\\oo .
This number was introduced by R. D. Skeel [76]. With it the error i: - x,
i: = A-It; can be estimated by
IIi: - x\\oo -
\\x\\oo ~ fi:relE for \b - b\ ~ E\bI.
The ideas of Example 2.7 can be transferred for perturbations in A. We
already know that the mapping f : GL(n) --> Rn, A f-> f(A) = A-Ib, is
differentiable with
It follows that (see Exercise 2.14) the componentwise relative condition

number is given by
\\\f'(A)\\A\\\oo \\\A-I\\A\\x\\\oo
fi:rel = \\f(A) \\00 = \\x\\oo .
If we collect the results for perturbations in both A and b, then the relative
condition numbers add up and we obtain as a condition number for the
combined problem
II lA-II IAI Ixl + IA-Illbili oo I lA-II IAI Ixl 1100
I'Crel = Ilxll oo :::; 2 Ilxll oo .
Taking for x the vector e = (1, ... ,1), yields the following characterization
of the componentwise condition of Ax = b for arbitrary right-hand sides b
~ < I lA-II IAI lei 1100 = IIIA- 1 1IAIII

2 I'Crel - Ilell oo 00
in terms of the Skeel's condition number
of A. This condition number I'Cc(A) satisfies, just as I'C(A), properties (i)

through (iii) of Example 2.7. Moreover, I'Cc(A) has the property I'Cc(D) = 1
for any diagonal matrix D that we have been intuitively looking for from
the beginning. Moreover, because of
it is even invariant under row scaling, i.e.,
Example 2.17 Componentwise condition of nonlinear systems. Let us

compute the componentwise condition number of the problem (f-1, y) of
a nonlinear system f(x) = y, where f : Rn ~ Rn is as in Example 2.10
a continuously differentiable function. It follows in a completely analogous
manner that
111f'(x)-lllf(x)llloo
I'Crel =
Ilxll oo
The expression 1f'(x)-lllf(x)1 strongly resembles the correction
~x = - f'(X)-l f(x)
of Newton's method that we are going to meet in Section 4.2.
2.3 Stability of Algorithms

In this section we turn to the second group of errors, those arising in the
algorithm. Instead of the ideal output f(x) given by the problem any al-
gorithm realizes a perturbed output }(x) in terms of a mapping J that
contains all rounding and approximation errors. In this setting the question
of stability of an algorithm is then:
Is the output J(x) instead of f (x) acceptable?

2.3. Stability of Algorithms 35
To answer this quest~on we must first think about how to characterize the
perturbed mapping f. We have seen above that the errors in performing a
floating point operation 0 E {+, -, ., /} can be estimated by
aob=(aob)(1+c), c=c(a,b) with !c!::;eps. (2.4)
Here it does not make too much sense (even if it may be possible in prin-
ciple) to determine c = c(a, b) for all values a and b on a given computer.
In this respect our algorithm has to deal not with a single mapping j, but
with a whole class {j}, containing all mappings characterized by estimates
of the form (2.4). This class also contains the given problem f E {j}.
The estimate of the error of a floating point operation (2.4) was derived
in Section 2.1 only for machine numbers. Because we want to study the
relation of the whole class of mappings we can allow arbitrary real numbers
as arguments. In this way we put the mathematical tools of calculus at
our disposal. Our model of the algorithm consists therefore of mappings
operating on real numbers and satisfying estimates of the form (2.4).
In order t~ avoid unwieldy notation, let us denote the family {j} by j
as well; i.e., f stands for the whole family or for a representative according
to the context. Statements on such an algorithm j (for example, error
estimates) are always appropriately interpreted for all the mappings in the
family. In particular, we define the image j (E) of a set E as the union
j(E) := U¢(E) .
<PEj
We are left with the question of how to assess the error ](x) - f(x). In
our condition analysis we have seen that input data are always (at least
for floating point numbers) affected by input errors, which-through the
condition of the problem-lead to unavoidable errors in the output. From
our algorithm we cannot expect to accomplish more than from the problem
itself. Therefore we are happy when its error ](x) - f(x) lies within reason-
able bounds of the error f(5;) - f(x) caused by the input error. Along this
line of thought, there are essentially two approaches: the forward analysis
and the backward analysis. They will be treated in what follows.
2.3.1 Stability Concepts

With forward analysis we analyze the set
R:= j(E)
of all outputs, as perturbed by input errors and errors in the algorithm.
Because f E j, the set R contains in particular R = f(E). The enlargement
from R to R identifies the stability in the sense of forward analysis-see
Figure 2.5. If the measure of R has the same order of magnitude as that of
R, then we say that the algorithm is stable in the sense of forward analysis.
-;. -- ;
\ /
.... R
1(~)./··/
Figure 2.5. Input and output sets in forward analysis.
The idea of backward analysis introduced by J. H. Wilkinson consists of

passing the errors of the algorithm back to the starting point and inter-
preting them as input errors. To do this we formulate the perturbed output
fj = j(x) as the exact result fj = f(x) corresponding to the perturbed input
quantity x-see Figure 2.6. Obviously this can be done only if J(E) lies in
the image of f. If this is not the case, a backward analysis is not possible
and the algorithm is regarded as unstable in the sense of backward analysis.
f
f
.
;
f
;
;
;
;
Figure 2.6. Input and output sets in backward analysis.
For noninjective mappings f this is expressed in the form x E f-l(fj). In

the concept of backward analysis x is chosen as the least deviation of the
input, i.e.,
Ilx - xii = min.
(As f was supposed to be continuous, there is at least one such x.) If we
construct the set
E := {x I f(x) = j(x) and Ilx - xii = min for some xE E}
of all those perturbed input quantities x of minimal distance, then any
perturbed result fj E J(E) can be interpreted as the exact result f(x)
corresponding to an input x E E. The ratio between the sets E and E is
the measure of stability of an algorithm in the sense of backward analysis.
In the following quantitative description of stability we will limit our-

selves to roundoff errors both in input and algorithm, and therefore we
choose the relative error concept as an adequate basis for our analysis.
2.3.2 Forward Analysis

In forward analysis we have to relate the error ](x) - f(x) caused by the
algorithm with the unavoidable error which, according to Section 2.2, can
be estimated by the product '" eps of the relative condition number", and
the relative input error eps. We therefore define a stability indicator u as
the factor by which the algorithm enlarges the unavoidable error, which is
'" eps.
Definition 2.18 Let J be the floating point realization of an algorithm
for solving the problem (j, x) with (normwise) relative condition number
"'rel. The stability indicator of normwise forward analysis is the smallest
number u ::::: 0 such that for all x E E
II](x) - f(x)11 .
I f (x) I ::; u "'rei eps for eps - 0 .
Analogously the stability indicator of componentwise forward analysis is

the smallest number u ::::: 0 such that
IJi(x) - fi(x)1 <' _
mfx Ifi(x) I - U "'rei eps for eps - 0,
where Krel is the componentwise relative condition number of (j, x) (see

remark 2.13).
The algorithm J is called stable in the sense of forward analysis, if u is
smaller than the number of successively performed elementary operations
Lemma 2.19 For the elementary operations {+, -, *, /} and their floating
*, /}
point realizations {-t, -=-, we have
u'" : : ; 1.
Proof. For any elementary operation 0 E {-t, -=-, *, /} we have aob = (a 0
b)(l + 10) for some 10 with 1101 ::::; eps and hence
laob-aobl l(aob)(l+s)-aobl_ll<
la 0 bl la 0 bl - 10 - eps.
o
Example 2.20 Subtraction. We see in particular that in the case of can-
cellation we have", » 1, which implies for the stability indicator that
u « 1. Thus subtraction is outstandingly stable and in the case of total
cancellation is indeed error free a-=-b = a-b.
When computing the stability indicators it is advantageous first to ana-

lyze the algorithm step by step and from these to obtain the stability of the
algorithm as a whole recursively. We will clarify this process for algorithms
that can be split in two steps. To do this we divide the problem (f, x) in
two partial problems (g,x) and (h,g(x)), where
and we assume that the stability indicators 0" g, O"h for the partial algorithms
g and h that implement 9 and h are known. How can we assess from here
the stability of the composed algorithm
Lemma 2.21 The stability indicator 0" f of the composed algorithm J

satisfies
(2.5)
for both norm- and componentwise approaches.
Proof. We work out the proof for the normwise approach. Let g and h be
arbitrary representatives of the algorithms for 9 and h as well as J = hog
for f = hog. Then
))J(x) - f(x))) ))h(g(x)) - h(g(x))))

< ))h(g(x)) - h(g(x)))) + ))h(g(x))
- h(g(x))))
))g(x) - g(x)))
< O"h Kh eps ))h(g(x)))) + Kh ))g(x))) ))h(g(x))))
<: O"hKhepsl)h(g(x)))) + KhO"g Kg eps))h(g(x))11

< (O"h Kh + O"g Kg Kh) eps ))f(x))).
o
This lemma has an immediate important consequence. As we have 0"K, S 1
for each elementary operation, the stability of the composed mapping f =
hog is in danger only if a subtraction arises in the second mapping h so that
K,h » 1. An illustrative example is the computation of the variance (see
Exercise 2.16). We therefore arrive at the following algorithmic construction
rule (compare page 28):
Put unavoidable subtractions as close to the beginning of the

algorithm as possible.
We now analyze via the recursive application offormula (2.5) the floating
point realization of the sum of n real numbers and the scalar product.
Example 2.22 Summation. The simplest algorithm for the sum

n
8n :Rn ___ R, (XI, ... ,Xn)f------+LXi
i=l
is by recursive computation in accordance with 8n = 8 n -1 0 an for n > 2
and 82 = 002, where
denotes the addition of the first two components. We want to examine this
"algorithm" componentwise. The condition number and stability indicator
for an coincide with those for the addition of two numbers, i.e., /\'cx n = /\,+
and O'CX n = 0'+. With the notation /\,j := /\'Sj and O'j := O'Sj we have by
virtue of Lemma 2.21 that
O'n /\'n ::; (O'n-l + 0'+ /\'+)/\'n-l ::; (1 + O'n-d/\'n-l .
According to Example 2.3, the condition number /\'n satisfies
_ l:~=l IXil > 1 d _ l:~=3Ixil + IXI + x21

I",n
/\'n -
LJi=1 X,.1 - an /\'n-l - l",n.1
LJi=1 X, ::; /\'n
and therefore 0'n ::; 1 + an-I. Since 0'2 = 0'+ ::; II/\,+ ::; 1, we obtain for the
stability indicator that
O'n ::; n - 1.
Hence the naive summation algorithm is numerically stable for the required
n - 1 elementary operations.
Example 2.23 Implementation of scalar products. We subdivide the com-
putation of the scalar product f(x,y) = (x,y) for X,Y E R n into the
componentwise multiplication
p: R n x R n ___ R n , ((Xi), (Yi)) f------+ (XiYi) '
followed by the summation, f = 8 n 0 p, analyzed in the last example.
According to Lemmas 2.21 and 2.19 together with the estimation of the
stability indicator of the last example we have
0'1 /\, 1 ::; (an + 0'p I£p) I£n ::; (1 + 0'n) I£n ::; nl£n
and therefore
I£n _ l:~=1 IXiYil/ Il::l XiYil _ 12
0'1 <
_ n- - n n n - n . (2.6)
1£1 2l: i=1I x iYill ll:i=l XiYil
At 2n - 1 elementary operations, this algorithm for the scalar product is
also numerically stable.
Actually, this estimate proves to be as a rule too pessimistic. Frequently
one observes a factor Vn rather than n in (2.6).
Remark 2.24 Some computers have a "scalar product function" with

variable mantissa length, the dot-product 8. Here the mantissas are en-
larged in such a way that the additions can be performed in fixed point
arithmetic. Thus one achieves the same stability as for addition, i.e., a ~ 1.
In the following we will see how to carry out the forward analysis for
scalar functions. In this special case we have a simplified version of Lemma
2.21.
Lemma 2.25 If the functions 9 and h of Lemma 2.21 are scalar an_d differ-
entiable then the stability indicator a f of the combined algorithm f = hog
satisfies
Proof. In this special case the condition number of the combined problem
is the product of the condition numbers of the parts
Ixllf'(x)1 Ig(x)llh'(g(x))IIg'(x)llxl
""f = If(x)1 = Ih(g(x))llg(x)1 = ""h ""9'
Hence from Lemma 2.21 it follows that
o
If the condition number ""9 of the first partial problem is very small, ""9 « 1,
then the algorithm becomes unstable. A small condition number can also
be interpreted as loss of information: A change in the input has almost
no influence on the output. Such a loss of information at the beginning
of the algorithm leads therefore to instability. Moreover, we see that an
instability in the beginning of the algorithm (large a 9) fully affects the
composed algorithm. For example, let us analyze the recursive method for
computing cos mx and an intermediary result.
Example 2.26 For the mapping f(x) = cos x we have
""abs = sin x and ""reI = x tan x .

If x -+ 0 then ""reI ~ X 2 -+ O. The evaluation of the function alone is
extremely well-conditioned for small x. However, if the information of x
were to be subsequently used, then it becomes inaccessible through this
intermediary step. We will come back to this phenomenon in the following
example:
Example 2.27 Now we can analyze the recursive computation of cos mx.
It is important for example in the Fourier-synthesis, i.e., in the evaluation
of trigonometric polynomials of the form

N
f(x) = Lakcoskx+bksinkx.
k=l
On the basis of the addition theorem

cos(k + l)x = 2cosx· coskx - cos(k -l)x
we can compute Cm := f(x) := cosmx from Co = 1 and Cl cosx by
means of the three-term recurrence relation
(2.7)
Is this a stable algorithm? A crucial role is played by the evaluation of
g(x) := cosx, which occurs in each step of the three-term recurrence. We
have just seen in Example 2.26 that for small x information is lost when
computing cos x and this can lead to instability. The corresponding stability
indicator contains the factor
in each term of the sum. Since

1 1
x-+O =? - - --> - -+ 00
K(X) x2
1 1
X-+7I =? -- --> -+ 00,
K(X) 7I(x -71)
the recurrence is unstable for both limit cases x -+ 0,71 with the former
case x -+ 0 being the more critical one. If we have a relative machine
precision eps = 5.10- 12 and we compute the value cosmx for m = 1,240
and x = 10- 4 according to (2.7), we obtain for example a relative error of
10- 8 . With the condition number
K = Imx tan mxl ~ 1.5 . 10- 2
it follows that (J" > 1.3.10 5 ; i.e., the calculation is clearly unstable. There is,
however, a stable recurrence relation for computing cos mx for small values
of x, developed by C. Reinsch [70]. It is based on introducing differences
~Ck := Ck+l - Ck and transforming the three-term recurrence relation (2.7)
into a system of two-term recurrence relation
~Ck -4sin2 (x/2)· Ck + ~Ck-l
Ck+l Ck + ~Ck
for k = 1,2, ... with starting values Co = 1 and ~co = -2 sin2 (x/2). The
evaluation of h(x) := sin 2 (x/2) is stable for small x E [-71/2,71/2]' since
_1_=\ h(x) \=\tan(x/2)\-+~ for x-+O.

K(h, x) h'(x)x x 2
For the above numerical example this recurrence yields an essentially better
solution with a relative error of 1.5 . 10- 11 . The recurrence for x -+ 7r can
be stabilized in a similar way. (It turns out that these stabilizations lead
ultimately to usable results only because the three-term recurrence relation
(2.7) is well-conditioned-see Section 6.2.1.)
2.3.3 Backward Analysis

For the quantitative description of the backward analysis we have to relate
the errors in the algorithm, passed back to the input side with the original
input errors.
Definition 2.28 The normwise backward error of the algorithm f for solv-
ing the problem (f, x), is the smallest number T} ::::: 0, having the property
that for any x E E there is x such that
Ilx - xii <: for eps -+ 0

Ilxll _./ .
'11
The componentwise backward error is defined analogously by replacing the

inequality with
The algorithm is called stable with respect to the relative input error J, if
T} < J.
For the input error J = eps caused by roundoff we define the stability
indicator of the backward analysis as the quotient
O"R := T}/eps.
As we see, the condition of the problem does not appear in the defini-
tion. Also, in contrast with forward analysis the backward analysis does
not require a beforehand condition analysis of the problem. Furthermore,
the results are easily interpreted by comparing the input error and the
backward error. Due to these properties the backward analysis is prefer-
able, especially in case of complex algorithms. All the stability results for
Gaussian elimination collected in the next section are related to backward
analysis.
The two stability indicators 0" and O"R are not identical in general. The
concept of backward analysis is rather stronger, as shown by the following
lemma.
Lemma 2.29 The stability indicators 0" and O"R of the forward and
backward analysis satisfy
In particular backward stability implies forward stability.
Proof. From the definition of the backward error it follows that for any
x E E there is a x such that f(x) = /(x) and
Ilx-xll .
Ilxll S TJ = aR eps for eps -+ O.
It follows that the relative error of the results satisfies

Ili(x) - f(x)11 Ilf(x) - f(x)ll· Ilx - xii .
Ilf(x) I = Ilf(x) II S K, Ilxll S K, aR eps
for eps -+ O. o
As an example of the backward analysis let us look again to the scalar
product (x, y) for x, y E Rn in its floating point implementation
(x,y:=xnyn
) + (x n-l ,yn-l) , (2.8)
where
Xn-l := (Xl,'" ,Xn-l )T an d Yn-l := (YI,··· ,Yn-l )T .

This is the form used on sequential machines. We concentrate on the sta-
bility corresponding to the input x, because we need it in Section 2.4.2 for
the analysis of the backward substitution.
Lemma 2.30 The floating point implementation of the scalar product in
accordance with (2.8) computes for x, Y E Rn a solution (x, Y)fl such that
(x, Y)fl = (x, y)
for an x E Rn with
Ix - xl <: n eps lxi,

i. e., the relative componentwise backward error amounts to
TJ S neps
and the scalar product is (with 2n - 1 elementary operations) stable in the
sense of backward analysis.
Proof. The recursive formulation (2.8) naturally suggests an inductive

proof. For n = 1 the assertion is clear. Therefore let n > 1 and the as-
sertion already proved for n - 1. For the floating point implementation of
the recurrence relation (2.8), we have
(x, Y)fl = (xnYn(l + 8) + (x n - l , yn-l)fl)(l + e),
where 8 and ewith 181, lei S eps characterize the relative errors of multipli-
cation and addition, respectively. Furthermore, according to the induction
hypothesis we have
for a Z E Rn-I with

Ixn- I - zl :S (n - 1) eps Ixn-Il·
If we set Xn := xn (l + 8)(1 + c:) and Xk := zk(l + c:) for k = 1, ... , n - 1, it
follows that
(x, y) = xnYn(l + 8)(1 + c:) + (z(l + c:), y) = XnYn + (x n- I , y) = (x, y),
where IXn - xnl :S 2 eps Ixnl :S neps Ixnl and
IXk - xkl < IXk - zkl + IZk - xkl
< (n - 1) eps Ix kI + eps Ix kI :S n eps Ix kI
for k = 1, ... , n - 1, hence Ix - xl:S neps Ixl. o
Remark 2.31 If we pass back the errors in x and y in an equally dis-
tributed way, then we deduce as in the case of forward analysis that the
stability indicator satisfies rJR :S n/2.
2.4 A pplication to Linear Systems

In what follows the above concepts of condition and stability will be
discussed again and deepened in the context of linear systems.
2.4.1 A Zoom into Solvability

As we did before in Chapter 1, we ask ourselves again the question: When
is a linear system Ax = b solvable? In contrast to Chapter 1, where this
question was answered on the fictive basis of real numbers, we want to use
now the above-derived error theory. Accordingly, the characteristic quantity
det A will be replaced by the condition number. We restate the results of
Section 2.2.1
Ilx-xll .
Ilxll :S I1: re l 8, (2.9)
where 8 is the relative input error of A (or b). By normwise examination

we obtained for the relative condition number
and by componentwise examination

2.4. Application to Linear Systems 45
As seen in Example 2.7 for a nonzero matrix A E Matn(R) we have
det A =0 {==} K:(A) = 00. (2.lO)

If K:(A) < 00, then the linear system is in principle uniquely solvable for
any right-hand side. On the other hand, according to (2.9) we expect a
numerically usable result if K:re1J is sufficiently small. In addition to that
we only have seen that a point condition of type (2.10) makes no sense
because instead of an individual matrix A we have to consider the set of
all matrices that are indistinguishable from A, for example,
E:= {A I IIA - All:::; JIIAII}·

It is therefore suitable to call a matrix A with relative precision J "almost
singular," or "numerically singular" whenever the corresponding input set
contains an (exactly) singular matrix. This leads to the following definition:
Definition 2.32 A matrix A is called almost singular or numerically

singular with respect to the condition number K:(A), whenever
where J is the relative precision of the matrix A.
For the rounding error caused by putting the matrix A into a computer
we assume, for example that J = eps. With experimental data J is taken as
the largest tolerance. However, we would like to stress the fact that linear
systems with almost singular matrices may nevertheless be numerically
well-behaved~a fact that can be interpreted through the x-dependency of
the relative condition number K:rel.
Example 2.33 We examine a linear system Ax = b, first for a right-hand

side b1 with
The matrix A and the right-hand side b1 contain the common input variable
E « 1, i.e., they are connected with each other. The condition number of
the matrix
K:(A) = IIA-11IooIIAlioo = ~E » 1
is clearly not meaningful for the solution of the special problem
The solution is indeed independent of E, i.e., f{(E) = 0, and consequently

the problem is well-conditioned (relatively and absolutely) for any E.
Second, we examine the same problem, but now for a different right-hand
side br := (0,1) independent of E:. Here we obtain the solution
( -liE:)
liE:
and the componentwise condition numbers
'( )11 1 II If&(E:)1 lE:I II = 1.

h:abs = II f2 E: = E: 2 and h:rel = Ilh(E:)11
In this case only the direction of the solution is well-conditioned, which is

reflected in the relative condition number. We will encounter a situation of
this type again in Sections 3.1.2 and 5.2
2.4.2 Backward Analysis of Gaussian Elimination

We have already seen in Lemma 2.30 that the floating point computation
of the scalar product f(x, y) = (x, y) is stable in the sense of the backward
analysis. The algorithm for solving a linear system Ax = b discussed in the
first chapter requires only the evaluation of scalar products of certain rows
and columns of matrices, so that on the basis of this insight it is possible
to perform a backward analysis of the Gaussian elimination. The techni-
calities of the proof for backward analysis will certainly increase with the
complexity of the algorithm. Therefore, we will carry it out explicitly only
for the forward substitution, while for the Gaussian triangular factorization
we will just state the main results, which can be essentially found in the
books [88] and [87] by J. H. Wilkinson. A nice overview is also offered by
the paper of N. J. Higham [51].
Theorem 2.34 The floating point implementation of the forward substi-
tution for the solution of a triangular system Lx = b computes a solution
x such that there is a lower triangular matrix L with
Lx=b and IL-LI:snepsILI,
(i. e., for the componentwise relative backward error we have rJ ~ n . eps) so
that the forward substitution is stable in the sense of backward analysis.
Proof. Similar to the scalar product (see Lemma 2.30), the forward
substitution algorithm may also be recursively formulated as
l kkXk = bk - (l k-l , Xk-l) (2.11)
for k = 1, ... , n, where
Xk - 1 = (Xl, ... ,Xk-lf and lk-l = (lkl, ... ,lk,k_d T .
Floating point implementation turns (2.11) into the recurrence relation
lkk(l + 5k )(1 + E:k)Xk = bk - (lk-\ Xk-1)fl,
where 6k and Ek with 16kl,IEkl ::; eps describe the relative errors of the
multiplication and the addition, respectively. For the floating point im-
plementation of the scalar product we know already from Lemma 2.30
that
(lk-l, xk-1)fl = (Zk-l, x k- 1)
'k l ' , T
for some vector l - = (tkl,' .. , lk,k-d with
Ilk-I - Zk-II ::; (k -l)eps Ilk-II.
By setting also Zkk := lkk(1 + 6d(1 + Ek), we get, as asserted,

Lx = band IL - LI ::; neps ILl.
D
As a first result of backward analysis we assess the quality of the LR-

factorization.
Lemma 2.35 Let A possess an LR-Jactorization. Then the Gaussian
elimination computes Land R such that
LR=A
Jor a matrix A with
Proof. A simple inductive proof for the weaker statement with 4n instead
of n is found in the book [41] of G. H. Golub and C. van Loan, Theorem
3.3.1. D
The following estimation of the componentwise backward error of the

Gaussian elimination was proved by W. Sautter. [74] The weaker statement
with 3n instead of 2n follows easily from Theorem 2.34 and Lemma 2.35.
Theorem 2.36 Let A possess a LR-Jactorization. Then the Gaussian
elimination Jor the linear system Ax = b computes a solution x with
Ax =b
Jor a matrix A with
IA - AI < 2nlLIIRI eps.
From here we can derive the following statement on the normwise back-
ward error of the Gaussian elimination that goes back to J. H. Wilkinson
[88].
Theorem 2.37 The Gaussian elimination with column pivoting Jor the
linear system Ax = b computes an x such that
Ax=b
for a matrix A with
where
Pn(A) := Qmax
maXi,j laij I
and Qmax is the largest absolute value of an element of the remainder
matrices A (1) = A through A (n) = R appearing during elimination.
Proof. We denote by P, L, Rand i; the quantities computed during the

Gaussian factorization with column pivoting P A = LR. Then PA possesses
an LR-factorization and according to Theorem 2.36 there is a matrix A such
that Ai; = band
IA - mAl :S 2n pT ILIIRI eps.

Since IIPlloo = 1, it follows that
IIA - Alloo :S 2n IILlloo IIRlloo eps. (2.12)

The column pivoting strategy takes care that all the components of L are
less than or equal to one in absolute value, i.e.,
IILlloo :s:; n.
The norm of R can be estimated by
IIRlloo :s:; n max Irij I :s:; n Qmax·
2,)
The statement follows therefore from (2.12), because maXi,j laijl :s:; IIAlloo.
o
So what is the stability of Gaussian elimination? This question is not
clearly answered by Theorem 2.37. Whether the matrix is suitable for Gaus-
sian elimination depends obviously on the number Pn(A). In general this
quantity can be estimated by
where the estimate is sharp because the bound is attained by the

(pathological) matrix given by J. H. Wilkinson (see Exercise 2.20):
1 1
-1
Aw=
-1 -1 1
Therefore Gaussian elimination with column pivoting is not stable for the
whole class of invertible matrices. However, for special classes of matrices
the situation looks considerably better. In Table 2.1 we have listed some
classes of matrices and the corresponding estimates for Pn.
Table 2.1. Stability of Gaussian elimination for special matrices.

Type of matrix column pivoting Pn ~
invertible yes 2n - 1
upper Hessenberg yes n
A or AT strictly diagonally dominant superfluous 2
tridiagonal yes 2
symmetric positive definite no 1
random yes n 2 / 3 (average)
Gaussian elimination is stable for symmetric positive definite matrices

(Cholesky factorization) as well as for diagonally dominant matrices, i.e.,
n
laiil > L laijl for i = 1, ... ,no
j=l
i-::;ej
Furthermore, one could state with a clear conscience that Gaussian elimi-
nation is stable "as a rule," i.e., is stable for matrices usually encountered
in practice. This statement is also supported by the probabilistic consider-
ations carried out by L. N. Trefethen and R. S. Schreiber [84] (see the last
row of Table 2.1).
2·4.3 Assessment of Approximate Solutions

Now let us assume that an approximate solution x of a given linear system
Ax = b is already available and we want to answer the question:
How good is an approximate solution x of the Problem Ax = b?
The fact that x is the exact solution of the linear system Ax = b is reflected
in the property that the corresponding residual vanishes,
r(x) := b - Ax = o.
From a "good" approximation one would expect the residual r(x) to be
"sufficiently small." The question then is: how small is "sufficiently small?"
In looking for a reasonable limit we run into the problem that the norm
Ilr(x)11 can be arbitrarily changed by a row scaling
Ax = b --+ (DrA)x = Drb
even though the problem remains unchanged. This feature also shows up in
the invariance of Skeel's condition number I'i:c(A) under row scaling. There-
fore the residual can be invoked at best when it possesses a problem-specific
meaning. Rather, the concept of backward analysis introduced in Defini-

tion 2.28 seems to be better suited: both normwise and componentwise
backward errors of an approximate solution x can be directly derived.
Theorem 2.38 The normwise relative backward error of an approximate
solution x of a linear system f(A,b) = f-1b relative to II(A,b)" := IIAII +
Ilbll is
_ IIAx - bll
TJN(X) = IIAII Ilxl + Ilbll .
Proof. See J. L. Rigal and J. Gaches [71]. o

The following theorem dates back to W. Prager and W. Oettli [68].
Theorem 2.39 The componentwise relative backward error of an approx-
imate solution x of Ax = b is
(2.13)
Proof. Let
TJc = min{w I there is A, b with Ax = b, IA - AI:::; wlAI, Ib - bl :::; wlbl}·
We set
Ir(x) Ii
e := mfx ( IAllxl + Ibl)i
and we have to prove that TJc(x) = e. Half of the statement, TJc(x) 2 e,
follows from the fact that for any solution A, b, w we have
lAx - bl = I(A - A)x + (b - b)1 :::; IA - Allxl + Ib - bl :::; w(IAllxl + Ibl),
and therefore w 2 e. For the other half of the statement TJc (x) :::; e, we
write the residual f = b - Ax as
f = D(IAllxl + Ibl) with D = diag(d 1 , ... , dn ),
where
Now by setting
A:= A + DIAl diag(sgn(x)) and b:= b - Dlbl,
we get IA - AI :::; elAI, Ib - bl :::; elbl and
Ax - b = Ax + DIAl Ixl- b + Dlbl = -f + D(IAllxl + Ibl) = o.
o
Remark 2.40 Naturally, formula (2.13) for the backward error is usable
in practice only when the error occurring in the evaluation does not alter
the meaningfulness of the result. R. D. Skeel has shown that the value ii
computed in floating point arithmetic differs from the actual value 'r/ by at
most n· eps. Thus formula (2.13) can really be invoked to assess the quality
of an approximate solution x.
In Chapter 1, the iterative refinement was introduced as a possibility
of "improving" approximate solution. The following result proved by R.
D. Skeel [77] shows the effectiveness of the iterative refinement: For Gaus-
sian elimination a single refinement step already implies componentwise
stability.
Theorem 2.41 Let
Knli:c(A) o-(A, x)eps < 1,
with Skeel's condition number li:c(A) = " lAllA-II 1100' the definition
(A ).= maxi(IAllxl)i
0" ,x. mini (IAllxl)i '
and Kn a constant close to n. Then Gaussian elimination with column
pivoting followed by one refinement step has a componentwise stability
indicator
O":Sn+l,
i. e, this form of Gaussian elimination is stable.
The quantity O"(A, x) is a measure of the quality of the scaling (see [77]).
Example 2.42 Hamming's example.
A= , b=
For c we take the relative machine precision eps = 3 . 10- 10 . For Skeel's
condition number we then obtain
_ IIIA-lIIAllxl + lA-II Ibl 1100 -'- 6
li:rel - Ilxll oo - ,
i.e., this system of equations is well-conditioned. By using Gaussian

elimination with column pivoting we obtain the solution
x = Xo = (6.20881717.10- 10 ,1, I?
with the componentwise backward error (according to Theorem 2.39)
'r/(xo) = 0.15. From the quantity O"(A,x) = 2.0.10 9 we can see that the
problem is extremely badly scaled. In spite of that, after one refinement
step we obtain the solution Xl = X with the computed backward error
'r/(xI) = 1 . 10- 9 .
Exercises
Exercise 2.1 Show that the elementary operation +is not associative in
floating point arithmetic.
Exercise 2.2 Determine through a short program the relative machine
precision of the computer you use.
Exercise 2.3 The zeros of the cubic polynomial
z3 + 3qz - 2r = 0, r, q >0
are to be found. According to Cardano-Tartaglia a real root is given by
Z= (r + J q3 + r2) + (r _ J q3 + r2) 3
1 1
3 .
This formula is numerically inappropriate because two cubic roots must

be computed and total cancellation occurs for r ........, O. Give a cancellation-
free formula for z that requires the computation of only one cubic and one
square root.
Exercise 2.4 Determination of zeros of polynomials in coefficient repre-
sentation is in general an ill-conditioned problem. For illustration consider
the following polynomial:
P(x)=x 4 -4x 3 +6x 2 -4x+a, a=1.
How do the roots Xi of P change ifthe coefficient a is distorted to a := a-c,
o< c ::; eps? Give the condition number of the problem. Specify the
condition of the problem. With how many exact digits can the solution be
computed if the relative machine precision is eps = 10- 16 ?
Exercise 2.5 Determine
0: = (1 + 2~ ) n _ e (1 + _12 ) > 0
1- l2n 12n
for n = 106 up to three exact digits.
Exercise 2.6 Compute the condition number of the evaluation of a
polynomial that is given by its coefficients ao, ... ,an
p(x) = anx n + an_lX n - 1 + ... + alX + ao
at the point X, first with respect to perturbation of the coefficients ai ........,
iii = ai(l + ci), and then with respect to perturbations X ........, x = x(l + c)
of x. Consider in particular the polynomial
p(x) = 8118x 4 - 11482x 3 + x 2 + 5741x - 2030
at the point x = 0.707107. The exact result is
p(x) = -1.9152732527082· 1O- 11 .
Exercises 53
Suppose that a computer delivers the value

p(x) = -1.9781509763561· 10- 11 .
Assess the solution by using the condition of the problem.

Exercise 2.7 Letll . III and 11·112 be norms on Rn and Rffi, respectively.
Show that
IIAxl12
II All := sup -11-11-
x¥O x 1
defines a norm on the space Matm,n(R) of all real (m, n) matrices.
Exercise 2.8 Vector and matrix norms. The p-norms on R n are defined
for 1 :S p < 00 by
and for p = 00 by
Ilxll oo := . max
l.=l, ... ,n
IXi I·
Show that the corresponding matrix norms
IIAxll p
IIAllp := sup -11-11-
x¥O x p
satisfy
(a) IIAl11 = maXj=l, ... ,n 2.::1Iaijl·
(b) IIAlloo = maXi=l, ... ,m 2.:7=1 laijl·
(c) IIAII2:S JIIAI111IAlloo.
(d) IIABllp:S IIAllpllBllp for 1 :S p:S 00.
Exercise 2.9 Spectral and Frobenius norm. Let A E Matn(R).

(a) Show that IIAI12 = IIA T I12.
(b) Show that IIAIIF = Jtr(AT A) = IIa-(A)112, where O"(A)
V
[0"1, ... , 0" n is the vector of singular values of A.
(c) Deduce from (b) that IIAI12 :S IIAIIF :S fo11A112.
(d) Prove the submultiplicativity of the Frobenius norm II ·IIF.
(e) Show that the Frobenius norm II . IIF is not an operator norm.
Exercise 2.10 Let A be a matrix with columns A j , j = 1, ... , n. We define
a matrix norm by
Show that
(a) II· 110 defines indeed a norm on Matm,n(R).
(b) IIAxII2::; IIAlloIlxlll and IIATxll oo ::; IIAllollxll2.
(c) IIAIIF::; JnIlAIIo and IIABIIo::; IIAIIFIlBIIo-
Exercise 2.11 Let A be a nonsingular matrix that is transformed into
A := A+ oA by a small perturbation oA, and let 11·11 be a submultiplicative
norm with 11111 = 1. Show that
(a) If IIBII < 1, then there exists (I - B)-l = 2:;:'=0 Bk, and the following
inequality holds
-1 1
I (I - B) II::; 1- IIBII
(b) If IIA-loAIl ::; € < 1, then
- 1 +€
~(A) ::; 1 _ € ~(A) .
Exercise 2.12 Show that the following property holds for any invertible
matrix A E GL(n):
~(A) = IIAIlIiA-lll = m~llxll=lIlAxll .

mmllxll=l IIAxll
Exercise 2.13 Let A E GL(n) be symmetric positive definite. Show that
in this case
~(A) = Amax(A) .
Amin(A)
Exercise 2.14 Show that the mapping f : GL(n) --+ Rn, defined by
f(A) = A-lb, satisfies
Exercise 2.15 Determine the componentwise relative condition number

of the sum 2:~=1 Xi of n real number Xl, ... ,Xn and compare it with the
normwise relative condition number relative to the norm II . Ill.
Exercise 2.16 There are two formulas for computing the variance of a
vector X ERn:
(a) S 2 = -1-
n-1
L:
n
(Xi -x) 2
i=l
(b))
Exercises 55
where x = ~ I:~=l Xi is the mean value of Xi. Which of the two formulas for
computing S2 is numerically more stable, and therefore preferable? Support
your assertion on the stability indicator, and illustrate your choice with a
numerical example.
Exercise 2.17 We consider the approximation of the exponential exp(x)
by the truncated Taylor series
N k
exp(x) ~ L ~! . (2.14)
k=O
Compute approximate values for exp(x) for X = -5.5 of the following three
types with N = 3,6,9, ... ,30:
(a) with the help of formula (2.14).

(b) with exp( -5.5) = 1/ exp(5.5) and formula (2.14) for exp(5.5).
(c) with exp( -5.5) = (exp(0.5))-1l and formula (2.14) for exp(0.5).
The "exact" value is
exp( -5.5) = 0.0040867714 ....

How can the results be interpreted?
Exercise 2.18 Given a matrix
and two right-hand sides
bi = (1, 1), bf = (-1, 1) ,

compute the two solutions Xl and X2 of the equations AXi = bi, the
condition number i'i:oo(Ac:) and Skeel's condition numbers i'i:rel(Ac:, xd and
i'i:rel(Ac:, X2). How can the results be interpreted?
Exercise 2.19 Let be given a linear system Ax = b with
A = (0.780 0.563) b = ( 0.217 )

0.913 0.659 ' 0.254
with relative precision eps = 10- 4 . Assess the precision of the two
approximate solutions
__ ( 0.999 ) - (0.341)
Xl - -1.001 an d X2 = -0.087
with the help of Theorem 2.39 of W. Prager and W. Oettli.

Exercise 2.20 Show that in the pathological example of J. H. Wilkinson

1 1
-1
A=
-1 -1 1
the maximal absolute value of a pivot in column pivoting is
lamaxl = 2n - 1 .
Exercise 2.21 According to R. D. Skeel [76] the componentwise backward
error 'f] in Gaussian elimination with column pivoting for solving Ax = b,
satisfies
'f] :::; Xn O"(A, x) eps,
where Xn is a constant depending only on n (as a rule Xn ~ n ), and
(A ) = maxi(IAllxl)i
0" ,x mini(IAllxl)i .
Specify a row scaling A -+ DA of the matrix A with a diagonal matrix
D = diag(d 1, ... , d n ), d i > 0 such that
O"(DA, x) = l.
Why is this an impractical stabilization method?
Exercise 2.22 Let
~
1 -1
c 0
A=
r
0 c
0 0
The solution of the linear system Ax = b is x = [1, c-1, c-1, l]T.
(a) Show that this system is well-conditioned but badly scaled, by com-
puting the condition number h:c(A) = II IAI-1IAI 1100 and the scaling
quantity O"(A, x) (see Exercise 2.21). What do you expect from
Gaussian elimination when c is substituted by the relative machine
precision eps?
(b) Solve the system by a Gaussian elimination program with column
pivoting for c = eps. How big is the computed backward error f]?
(c) Check yourself that one single refinement step delivers a stable result.
3
Linear Least-Squares Problems
This chapter deals with the solution of overdetermined linear systems by

means of the linear least-squares method, also known as the maximum
likelihood method. Because of the invariance of this type of problem, or-
thogonal transformations are very well suited for its solution. In Section
3.2 we give the description of orthogonalization methods, which can be used
for a stable solution of such problems. Associated with them are (somewhat
more expensive) alternatives to Gaussian elimination for systems of linear
equations.
3.1 Least-Squares Method of Gauss

C. F. Gauss first described the method in 1809 in his book Theoria Motus
Corporum Coelestium (theory of the motions of celestial bodies), as well
as the elimination method discussed in the first chapter, which he gave as
part of the whole solution method (see Section 3.1.4).
Within the same work he also studied fundamental questions from prob-
ability theory-for an extensive appreciation of the historical connections
we refer to the recent treatise [1].
3.1.1 Formulation of the Problem

We start with the following problem formulation: Let
(ti,b i ), ti,b i E R, i = 1, ... ,m
58 3. Linear Least-Squares Problems
be m given points, where bi may describe for example the position of an

object at time t i . We assume that these measurements are in conformity
with a natural law, so that the dependence of bon t can be expressed by a
model function <p
where the model function contains n unknown parameters Xi.

Example 3.1 We consider as an example Ohm's law b = xt = <p(t; x),
where t is the intensity of the current, b the voltage, and x the resistance.
The task is to draw a line through the origin that is "as close as possible"
to the measurements (Figure 3.1).
Figure 3.1. Linear least-squares computation for Ohm's law.
If there were no measurement errors, then the model would describe

the situation exactly and the parameters Xl, ... , xn would need to be
determined such that
bi=b(ti)=<p(ti;XI, ... ,X n ) for i=l, ... ,m.
Actually, however, all measurements are corrupted by errors and model

functions usually describe reality only partially. As an example, Ohm's law
holds only as an approximation within a certain temperature range-a fact
that will become clear at latest when a wire burns through. Therefore we
can only require that
There are several possibilities for weighting the individual deviations
Gauss chose first to minimize even powers of the deviations. Based on

considerations from the theory of probability he finally chose the squares
3.1. Least-Squares Method of Gauss 59
,6,;. This leads to the problem of determining the unknown parameters

Xl, ... ,X n such that
m
,6,2:= L,6,; = min. (3.1)

i=l
Remark 3.2 The relation between the linear least-squares problems and
probability theory is reflected in the equivalence of the minimization
problem (3.1) with the maximization problem
exp( _,6,2) = max.
The exponential term characterizes here a probabilistic distribution, the
Gaussian normal distribution. The complete method is called maximum
likelihood method.
In (3.1) the errors of individual measurements are equally weighted. How-
ever, the measurements (ti' bi ) are different just because the measuring
apparatus works differently over different ranges while the measurements
are taken sometimes with more and sometimes with less care. To any in-
dividual measurement bi pertains therefore in a natural wayan absolute
measuring precision, or tolerance, 8b i . These tolerances 8b i can be included
in the problem formulation (3.1) by weighting different errors with different
tolerances, i.e.,
t;
m (,6,i)2
8b i = mm.
This form of minimization has also a reasonable statistical interpreta-
tion (somewhat similar to standard deviation). In some cases the linear
least-squares problem is only uniquely solvable if the problem-specific
measurement tolerances are explicitly included!
Within this chapter we consider only the special case when the model
function if' is linear in x, i.e.,
where al,"" an : R -+ R are arbitrary functions. The nonlinear case

will be dealt with in Section 4.3. In this section I . I always denotes the
Euclidean norm II . 112. In the linear case the least-squares problem can be
written in short form as
lib - Axil = min,
where b = (b l , ... , bm)T, x = (Xl, ... ,xn)T and
A = (aij) E Matm,n(R) with aij:= aj(t i ).
As a further restriction we will consider here only the overdetermined case
m 2 n; i.e., there are more data than parameters to be determined, which
sounds reasonable from a statistical point of view. We obtain therefore
the framework of the linear least-squares problem: For given bERm and
A E Matm,n(R) with m:::: n find an x E Rn such that
lib - Axil = min.

Remark 3.3 By replacing the 2-norm by the I-norm, one obtains the stan-
dard linear optimization problem, which will not be treated here, because
in the natural sciences and engineering this problem type rarely appears.
It plays, however, an important role in economics and management sci-
ence. With the maximum-norm instead of the 2-norm we come across the
Chebyshev approximation problem. This problem does occur in the natural
sciences, although not as frequently as the Gaussian least-squares problem.
As the Euclidean norm is induced by a scalar product, the latter offers
more geometric insight.
3.1. 2 Normal Equations

Geometrically speaking the solution of the linear least-squares problem
consists of finding a point z = Ax in the range space R(A) of A, which has
the smallest distance to the given point b. For m = 2 and n = 1 R(A) C R2
is either the origin or a straight line through the origin (see Figure 3.2).
R(A)
Figure 3.2. Projection on the range space R(A).
It is graphically clear that the distance lib - Axil is minimal exactly when
the difference b - Ax is perpendicular to the subspace R( A). In other words:
Ax is the orthogonal projection of b onto the subspace R(A). As we want
to come back to this result later, we will formulate it in a somewhat more
abstract form.
Theorem 3.4 Let V be a finite dimensional Euclidean vector space with

scalar product (', .), u c V a subspace and
U 1. = {v E V I (v, u) = 0 for all u E U}

its orthogonal complement in V. Then for all v E V we have the following

property regarding the norm Ilvll = yffl;::0 induced by the scalar product:
Ilv - ull = u'EU
min Ilv - u'll <;:::=;> v- u E UJ. .
Proof. Let u E U be the (uniquely determined) point such that v - u E U J. .

Then for all u' E U we have
Ilv - u'I1 2 = Ilv - ul1 2 + 2(v - u, u - u') + Ilu - u'I1 2
Ilv - ul1 2 + Ilu - u'I1 2 2: Ilv - ul1 2 ,
where equality holds if and only if u = u'. D
Remark 3.5 With this the solution u E U of Ilv - ull = min is uniquely
determined and is called the orthogonal projection of v onto U. The mapping
P: V --> U, V f----+ Pv with Ilv - Pvll = min
uEU
Ilv - ull
is linear and is called the orthogonal projection from V onto U.
Remark 3.6 The theorem generally holds also when U is replaced by an
affine subspace W = Wo + U c V, where Wo E V and U is a subspace of V
parallel to W. Then for all v E V and W E W it follows
Ilv - wll = min
w'EW
Ilv - w'll <;:::=;> v- wE UJ. .
This defines, as in Remark 3.5, a function

P: V --> W, V f----+ Pv with Ilv - Pvll = min
wEW
Ilv - wll,
which is an affine mapping called the orthogonal projection of V onto
the affine subspace W. This consideration will prove to be quite useful
in Chapter 8.
With Theorem 3.4 we can easily prove a statement on the existence and
uniqueness of the solution of the linear least-squares problem.
Theorem 3.7 The vector x ERn is a solution of the linear least-squares
problem lib - Axil = min, if and only if it satisfies the normal equations
ATAx=ATb.
In particular, the linear least-squares problem is uniquely solvable if and
only if the rank of A is maximal, i.e., rank(A) = n.
Proof. By applying Theorem 3.4 to V = Rm and U = R(A) we get

lib - Axil = min <;:::=;> (b - Ax, Ax') = 0 for all x' ERn
<;:::=;> (AT (b - Ax), x') = 0 for all x' ERn
<;:::=;> AT (b - Ax) = 0
<;:::=;> AT Ax = ATb
and therefore the first statement. The second part follows from the fact
that AT A is invertible if and only if rank(A) = n. 0
Remark 3.8 Geometrically, the normal equations mean precisely that the
residual vector b - Ax is normal to R(A) c Rm; hence the name.
3.1.3 Condition
We begin our condition analysis with the orthogonal projection P : R m ---+
V, b f--7 Pb, onto a subspace V of Rm (see Figure 3.3). Clearly the relative
Figure 3.3. Projection onto the subspace V.
condition number of the projection problem (P, b) corresponding to the

input b depends strongly on the angle {) of intersection between b and the
subspace V. If the angle is small, i.e., b ~ Pb, then perturbations of b
leave the result Pb nearly unchanged. On the other hand, if b is almost
perpendicular to V, then small perturbations of b produce relatively large
variations of Pb. These observations are reflected in the following lemma.
Lemma 3.9 Let P : Rm ---+ V be the orthogonal projection onto a subspace
V of R n. For an input b let {) denote the angle between b and V, i. e.,
. lib - Pbll 2
sm {) = IIbl1 2 .
Then the relative condition number of the problem (P, b) corresponding to

the Euclidean norm satisfies
1
"" = ------:a
coS-u
11P112 .
Proof. According to the Pythagorean theorem IIPbl1 2 = IIbl1 2 - lib - Pb11 2 ,

and therefore
lIb1f2
IIPbl12
=
. 2
1 - sm {) = cos {).
2
Because P is linear it follows that the relative condition number of (P, b)

satisfies, as stated
~ = ~IIP'(b)11 = ~IIPII = _I_IIPII.

IIPbl1 IIPbl1 cos'l9
o
For the next theorem we also need the following relationship between the
condition numbers of A and AT A corresponding to the Euclidean norm.
Lemma 3.10 For a matrix A E Matm,n(R) of maximal rank p = n we
have
Proof. According to Definition (2.3) the condition number of a rectangular

matrix satisfies
maxllxll,=l IIAxl1 2
minllxl12=1llAxl12
maxllxl12=1 (AT Ax, x)
minllxl12=1 (AT Ax, x)
o
With these preparations the following result on the condition of a linear
least-squares problem no longer comes as a surprise.
Theorem 3.11 Let A E Matm,n(R), m 2: n, be a matrix of full column
rank, b ERn, and x the (unique) solution of the linear least-squares problem
lib - Axl12 = min.

We assume that x of. 0 and we denote by 'l9 the angle between b and the
range space R(A) of A, i.e.,
with residual r = b - Ax. Then the relative condition number of x in the

Euclidean norm satisfies:
(a) corresponding to perturbations of b,
~2(A)
~ <--
- cos'l9
(b) corresponding to perturbations of A,
~ :::; ~2(A) + ~2(A)2 tan'l9. (3.2)
Proof. (a) The solution x is given through the normal equations by the
linear mapping
so that
IIAI1211(AT A)-l ATI1211bl1 2
IIAI1211xl12
It is easily seen that for a full-column rank matrix A the condition number
i"i:2(A) is precisely
i"i:2(A) = m~xjjxjj2=1 IIAxl12 = IIAII2II(AT A)-l A T 112.

mmjjxjj2=1 IIAxl12
Now, as in Lemma 3.9, the assertion follows from
IIbl1 2 IIbl1 2 i"i:2(A)
i"i: = IIAI1211xl12 i"i:2(A) ::::: IIAxl12 i"i:2(A) = cos19 .
(b) Here we consider x = ¢(A) = (AT A)-l ATb as a function of A. Because

the matrices of rank n form an open subset of Matm,n (R), ¢ is differentiable
in a neighborhood of A. We construct the directional derivative ¢'(A)C for
a matrix C E Matm,n(R), by differentiating the equation characterizing
¢(A + tC)
(A + tC)T(A + tC)¢(A + tC) = (A + tC)Tb,
with respect to t at the point t = O. It follows that
C T Ax + ATCx + AT A¢'(A)C = CTb;
i.e.,
From it we can estimate the derivative of ¢ by
so that
II¢'(A) 112 IIAI12
IIxl12
< lIAI1211(ATvA)-1 ATII~ + lIAII~II(~T A)-lll~ IIA:::I:I~112

=K2(A) =K2(AT A)=K2(A)2
Now the assertion follows as in (a) from
IIrl12 IIbl1 2 <. . _1_ _

IIbl1 2 IIAI1211xl12 - sm 19 cos 19 - tan 19.
o
If the residual r = b - Ax of the linear least-squares problem is small

compared with the input b, then we have cos {} ~ 1 and tan {} ~ O. In this
case (which should be the normal case for linear least-squares problems) the
problem behaves as a linear system from the point of view of condition. For
large residuals, i.e., cos {} « 1 and tan {} > 1, the estimate (3.2) contains the
quantity f£2(A), which is relevant for linear systems, as well as its square
f£2(A)2. Hence for large residuals the linear least-squares problem behaves
essentially differently from linear systems.
3.1.4 Solution of Normal Equations

We switch now to the solution of the normal equations. We start from the
assumption that the linear least-squares problem is uniquely solvable, i.e.,
rank(A) = n, so that AT A is an Spd-matrix. According to Section 1.4
this gives the possibility of using the Cholesky decomposition. As for the
cost of solving the linear least-squares problem with the help of the normal
equations, we have (in number of multiplications):
(a) computation of AT A: rv ~n2m,
(b) Cholesky factorization of AT A: rv in3.
For m » n part (a) predominates, so that we have altogether a cost of

rv ~n2m for m »n and rv ~n3 for m ~ n.
The numerical treatment described above requires in the first step the
computation of AT A, i.e., of numerous scalar products of rows of A. The
numerical intuition developed in the second chapter (see Example 2.33)
makes this appear dubious. In each additional step, additional errors may
arise that will propagate further to the final solution. Therefore, in most
cases it is better to look for an efficient "direct" method operating only
on A itself. A further point of criticism is the fact that the errors in AT b
are amplified in the solution of the linear system AT Ax = ATb by a factor
close to the condition (see Lemma 3.10)
K2(A T A) = f£2(A)2 .
For large residuals this agrees with the condition number of the linear least-
squares problem (3.2). However, for small residuals the condition number
of the latter problem is described instead by f£(A), so that passing to the
normal equations means a considerable worsening of the condition number.
In addition the matrices usually arising in linear least-squares problems
are already badly conditioned so that they require extreme attention and
further worsening of the condition by passing to AT A is not manageable.
Hence the solution of linear least-squares problems via normal equations
with Cholesky factorization can be recommended only for problems with
large residuals.
Remark 3.12 Based on the properties of the linear least-squares problem

discussed in the last section one should perform a residual correction for
problems with matrices of maximal rank and large residuals. In the iterative
refinement the residual r = b - Ax is introduced as an explicit variable (see
A. Bj0rck [7] and Problem 3.2). With this variable we write the normal
equations as an equivalent symmetric system
of double dimension with the input quantities A and b appearing directly.

This formulation is also especially appropriate for large sparse matrices A
and is the basis for iterative refinement in linear least-squares problems
with "large" residuals.
3.2 Orthogonalization Methods

Each elimination process for linear systems, as for example the Gaussian
elimination from Section l.2, can be formally represented as
where the matrices B j describe the operations on the matrix A. We have

seen in the recursive stability analysis from Lemma 2.21 that the stability
indicators of the partial steps of an algorithm are increased by the condi-
tion of all successive partial steps. For an elimination process as described
above they are the condition numbers of the matrices B j . These condition
numbers, e.g., that of the elimination matrix B j = L j of the Gaussian
triangular factorization, are in general not bounded from above, so that
we may encounter instabilities. On the other hand, if we choose instead
of L j orthogonal transformations Qj for the elimination, then we have in
Euclidean norm
Therefore these orthogonalization methods are always stable. Unfortunately

the stability comes together with a somewhat higher cost than, for exam-
ple, Gaussian elimination, as will be seen in the following. In addition
there is a further reason that makes the orthogonalization method for lin-
ear least-squares problems a good alternative to the solution of the normal
equations. Actually, it is the invariance of the Euclidean norm with respect
to orthogonal transformations that imposes the application of orthogonal-
ization methods for the solution of linear systems. Let us assume that we
have brought a matrix A E Matm,n with m ::::: n to the upper triangular
3.2. Orthogonalization Methods 67
form R by means of an orthogonal matrix Q E N (m)
Then, as an alternative to solving the normal equations, we can determine

the solution of the linear least-squares problem lib - Axil = min as follows:
Theorem 3.13 Let A E Matm,n(R), m 2': n be a matrix of maximal rank,
b E R m a vector, and Q E N (m) an orthogonal matrix with
QT A = (~) and QTb = G~) ,

where b1 ERn, b2 E Rm-n and R E Matn(R) is an (invertible) upper
triangular matrix. Then x = R- 1b1 is the solution of the linear least-squares
problem lib - Axil = min.
Proof. Because Q E N(m), we have for all x E R m
IIQT(b _ Ax)11 2 = II (b 1 ~2RX) 112

Ilb l + IIb 2 112 2': Ilb2112.
- Rxl12
Since rank(A) = rank(R) = n, R is invertible. The term Ilb l -Rx112 vanishes
precisely for x = R-1b 1 . Observe that the residual r := b - Ax does not
vanish in general and that Ilrll = Ilb 2 11. D
For m = 2 the orthogonal transformations in question can be derived

geometrically as rotations, and reflections (see Figure 3.4). If we want to
ai":"'\,,, ""\/2(~'~)
,,' .. ,",,. - (v,v) V
~ ael
Figure 3.4. Rotation and reflection of a over ael.
map the vector a E R 2 via an orthogonal transformation into a multiple

ael of the first unit vector it follows that a = Iiali. The first possibility is
e
to rotate a with an angle over ael. We obtain
cose Sine)
a f----+ ael = Qa with Q:= ( . e e'
- sm cos
As a second possibility we can reflect a with respect to a vector l that is

perpendicular on v, i.e.,
(v, a)
a f----+ ael = a - 2-(--) v,
v,v
where v is collinear to the difference a - ael. The numerical handling by
means of rotations will be introduced in Section 3.2.1 and that by means
of reflections in Section 3.2.2.
3.2.1 Givens Rotations

The name Givens rotation (W. Givens, 1953) is used to denote matrices of
the form
I
e s i-k
Okl := I E Matm(R),
-s e i-l
I
where I is the identity matrix of corresponding dimension and e 2 + S2 = l.

Here e and s should remind us naturally of cos () and sin (). Geometrically
the matrix describes a rotation of angle () in the (k, l)-plane. If we apply
Okl to a vector x E R m, it follows that
eXk + 8Xl if i = k
-SXk + eXI if i = l . (3.3)
if i =I- k, l
If we premultiply a matrix
with columns AI"'" An E R m by Okl, then the Givens rotation operates

on the columns, i.e.,
OklA = [OkIAI,"" OklAn].
According to (3.3) only rows k and l of the matrix A are changed. This
is especially important when the sparsity structure is to be preserved as
much as possible by the transformation.
Now how can we determine the coefficients e and 8 in order to eliminate
a component Xl of the vector x? As Okl operates only on the (k, l)-plane
it is sufficient to clarify the principle in case m = 2. From x~ + xf =I- 0 and
8 2 + c2 = 1 it follows that
(~8 :)(:~) G) <==* r=±Jx~+Xf,c=xk/rand8=xz/r.

Actually c and S are more conveniently computed via the formula (where
T stands for tan e
and cot respectively) e,
T:= Xk/Xl, s:= 1/~, c:= ST
if IXll > IXkl and
T:=Xdxk' c:=1/V1+T2, S:=CT

if IXk I ;::: IXll. By this implementation exponential overflow is also avoided.
Suppose now we want to bring a given matrix A E Matm,n to upper
triangular form R where TiJ = 0 for i > j with the help of Givens rotations.
This is again done eliminating the nonzero subdiagonal elements column by
column. As an example, we illustrate the algorithm on a full (5,4)-matrix
(the index pair over the arrows give the indices of the Givens rotation Dkl
performed at that step):
* * * * * * * * * * * *
0
* * * * (5,4) * * * * (4,3) (2,1) * * *
A= * * * * -----+
* * * * ----7 . • • -----+ 0
* * *
0
* * * * *
0
* * * * * *
* * 0
* * * * * * * *
* *
0
* * * * * *
0
(5,4)
0
* * * (4,3) (5,4) * * *
----+ ... - ? 0 0
-----+
0
* * * * *
0 0 0
0
*
0
* * 0
*
0 0
* 0
*
After carefully counting the operations we obtain the cost of the
QR-factorization of a full matrix A E Matm,n :
(a) rv n 2 /2 square roots and rv 4n 3 /3 multiplications, if m ~ n ,
(b) rv mn square roots and rv 2mn 2 multiplications, if m » n.
For m = n we obtain an alternative to the Gaussian triangular factorization
of Section 1.2. The better stability is bought at a considerable higher cost
of rv 4n 3 /3 multiplications versus rv n 3 /3 for the Gaussian elimination.
However, one should observe that for sparse matrices the comparison turns
out to be essentially more favorable. Thus only n - 1 Givens rotations are
needed to bring A to upper Hessenberg form
* *
*
A= o
o o * *
having almost upper triangular shape with nonzero components only in the
first sub diagonal. With Gaussian elimination the pivot search may double
the sub diagonal band.
Remark 3.14 If A is stored with a row scaling DA, then the Givens ro-
tations can be realized (similarly to the rational Cholesky factorization)
without evaluating square roots. In 1973 W. M. Gentleman [37] and S.
Hammarling [49] developed a variant, the fast Givens or rational Givens.
This type of factorization is invariant with respect to column scaling, i.e.,
A = QR ===} AD = Q(RD) for a diagonal matrix D.
3.2. 2 Householder Reflections

In 1958 A. S. Householder [52] introduced matrices Q E Matn(R) of the
form
vv T
Q=I-2-
vTv
with vERn. Today they are called Householder reflections. Such matrices
describe exactly the reflections on the plane perpendicular to v (com-
pare Figure 3.4). In particular Q depends only on the direction of v. The
Householder reflections Q have the following properties:
(a) Q is symmetric, i.e., QT = Q.
(b) Q is orthogonal, i.e., QQT = QTQ = I.
(c) Q is idempotent, i.e., Q2 = I.
If we apply Q to a vector y .E R n we get
yf---+Qy= (I_2 vvT )Y=Y_2(v,y)V.

vTv (v,v)
If Y is to be mapped on a multiple of the unit vector el,
(v, y)
ael = Y- 2 - - v E span(ed,
(v,v)
then
lal = IIYI12 and v E span(y - aed .
From this we determine Q by
v := y - ael with a = ±IIYI12 .
In order to avoid cancelling in computing v = (Yl - a, Y2, ... ,Yn)T we
choose a := -sgn(Yl)llyI12. Because
(v, v) = (y - ael, Y - ael) = IIYII~ - 2a(y, el) + a 2 = -2a(Yl - a)
we can compute Qx for arbitrary x ERn by the very simple formula

(v, x) (v, x)
Qx = x - 2-(- ) v
v,v
= x + a (Yl - a ) v.
With the help of Householder reflections we can transform a matrix A =
[AI' ... ' An] E Matm,n(R) to upper triangular form as well, by eliminating
successively the elements below the main diagonal. In the first step we
"shorten" the column Al and obtain
where
vlvi
Ql = 1- 2-T- with Vl:= Al - aIel and al:= -sgn(all)IIAI!I2.
VI vI
After the kth step the output matrix is brought to upper triangular form
except for a remainder matrix T(k+l) E Matm-k,n-k(R)
* *
=
o* *
A(k)
o
Now let us build an orthogonal matrix
Qk+l = [ ~ I Q~+l ] ,
where Q~+l E N(m - k) is constructed as in the first step with T(k+l)

instead of A. Altogether after p = min(m -1, n) steps we obtain the upper
triangular matrix
R= Qp···QlA
and from here, because Q; = I, the factorization
A = QR with Q = Ql ... Qp.
Suppose now we calculate the solution of the linear least-squares problem
as in Theorem 3.13 by computing the QR-factorization of the matrix A E
Matm,n(R), m ~ n, with the help of Householder reflections Qj E N(m).
Then we arrive at the following method:
(1) A = QR, QR-factorization with Householder reflections,
(3) Rx = bl , solution of upper triangular system.
In a computer implementation we have to store the Householder vectors

VI, ... , Vp as well as the upper triangular matrix R. The diagonal elements
rii=CXi fori=l, ... ,p
are stored in a separate vector, so that the Householder vectors VI, ... , vp
find a place in the lower half of A (see Figure 3.5). Another possibility is to
normalize the Householder vectors in such a way that the first component
(Vi, ei) is always 1 and therefore does not need to be stored.
f0-
R
t---
A
-
VI V2 V3 V4
Figure 3.5. Storage scheme for QR-factorization with Householder reflections

(m = 5,n = 4).
For the cost of this method "[e obtain

I
(a) rv 2n 2 m multiplications, if m » n,
(b) rv ~n3 multiplications, if m ~ n.

I
For m ~ n we have about the same cost as for the Cholesky method for
the normal equations. For m i» n the cost is worse by a factor of two, but
the method has the stability advantage discussed above.
As in the case of Gaussian elimination there is also a pivoting strategy
for the QR-factorization, the' column permutation strategy of P. Businger
and G. H. Golub [12]. In contrast to Gaussian elimination, this strategy
is of minor importance for th~ numerical stability of the algorithm. If one
pushes the column with maximal 2-norm to the front, so that after the
change we have
then the diagonal elements rkk of R satisfy

for the matrix norm IIAllo := maxj IIAj 112. If p = rank(A), then we obtain
theoretically that after p steps the matrix
with an invertible upper triangular matrix R E Matp(R) and a matrix

S E Matp,n_p(R). Because of roundoff errors we obtain instead of this the
following matrix
o
where the elements of the remainder matrix T(p+l) E Matm_p,n_p(R) are
"very small." As the rank of the matrix is not generally known in advance
we have to decide during the algorithm when to neglect the rest matrix.
In the course of the QR-factorization with column exchange the following
criterion for the rank decision presents itself in a convenient way. If we
define the numerical rank p for a relative precision b of the matrix A by
the condition
Irp+l,p+ll < blrHI::; Irppl,

then it follows directly that
IIT(P+l) 110 = Irp+l,p+ll < birHI = biIAIIo;
i.e., T(p+l) is below the error in A corresponding to the norm 11·110. If p = n,
then we can easily compute further the subcondition number
sc (A) '=
. ~
Irnn I
of P. Deuflhard and W. Sautter (1979) [28]. Analogously to the properties
of the condition number K(A), we have
(a) sc(A) ~ 1,
(b) sc(aA) = sc(A),
(c) A#-O singular {==;> sc(A) = 00,
(d) sc(A) ::; K2(A) (hence the name).

In agreement with the above definition of the numerical rank, A is called
almost singular if we have
bsc(A) ~ 1 or equivalently birHI ~ Irnnl
for a QR-factorization with possible column permutation.

We have substantiated above that this concept makes sense. Each matrix,
which is almost singular according to this definition, is also almost singular
with respect to the condition number K:2(A), as shown by property (d) (d.
Definition 2.32). The reverse is not true.
3.3 Generalized Inverses

According to Theorem 3.7 the solution x of the linear least-squares problem
lib - Axil = min, for A E Matm,n(R), m ;:::: nand rank(A) = n, is uniquely
determined. Clearly it depends linearly on b and it is formally denoted by
x = A+b. Under the above assumptions the normal equations imply that
A+ = (AT A)-l AT.
Because A + A = I is precisely the identity, A + is also called the pseudo-
inverse of A. The definition of A + can be extended to arbitrary matrices
A E Matm,n(R). In this case the solution of lib - Axil = min is in general
no longer uniquely determined. On the contrary, if we denote by
F: R m ----7 R(A) c Rm
the orthogonal projection of R m onto the image space R(A), then according
to Theorem 3.4 the solutions form an affine subspace
L(b) := {x E Rn I lib - Axil = min} = {x E R n I Ax = Fb} .
Nevertheless, in order to enforce uniqueness we choose the smallest solution
x E L(b) in the Euclidean norm II . II, and we denote again x = A+b.
According to Remark 3.6, x is precisely the orthogonal projection of the
L(b) = x + N(A)
x
N(A)
\
....
Figure 3.6. "Smallest" solution of the least-squares problem as a projection of 0

onto L(b).
3.3. Generalized Inverses 75
origin 0 E R n onto the affine subspace L(b) (see Figure 3.6). If x E L(b) is
an arbitrary solution of lib - Axil = min, then we obtain all the solutions
by translating the nullspace N(A) of A by x, i.e.,
L(b) = x + N(A).
Here the smallest solution x must be perpendicular onto the nullspace
N(A); in other words: x is the uniquely determined vector x E N(A)l.
with lib - Axil = min.
Definition 3.15 The pseudo-inverse of a matrix A E Matm,n(R) is a ma-
trix A+ E Matn,m(R) such that for all bERm the vector x = A+b is the
smallest solution lib - Axil = min, i.e.,
A+b E N(A)l. and lib - AA+bll = min.
The situation can be most clearly represented by the following commu-
tative diagram (where i denotes each time the inclusion operator):
R(A+) = N(A)l.
We can easily read that the projection P is precisely AA+, while P = A+ A
describes the projection from Rn onto the orthogonal complement N(A)l.
of the nullspace. Furthermore, because of the projection property, we have
obviously A+ AA+ = A+ and AA+ A = A. As seen in the following theorem
the pseudo-inverse is uniquely determined by these two properties and the
symmetry of the orthogonal projections P = A + A and P = AA +.
Theorem 3.16 The pseudo-inverse A+ E Matn,m(R) of a matrix
A E Matm,n(R) is uniquely characterized by the following properties:
(i) (A+ A)T = A+ A,
(ii) (AA+)T = AA+,
(iii) A+ AA+ = A+,
(iv) AA+ A = A.
The properties (i) through (iv) are also called the Penrose axioms.
Proof. We have already seen that A + satisfies properties (i) through (iv),
because A+ A and AA+ are orthogonal projections onto N(A)l. = R(A+)
and R(A), respectively. Conversely (i) through (iv) imply that P := A+ A
and P := AA+ are orthogonal projections, because pT = P = p2 and
pT = P = P2. Analogously from (iii) and P = A+ A it follows that
N(P) = N(A). Thus the projections P and P are uniquely determined

(independently of A+) by properties (i) through (iv). From this the unique-
ness of A+ follows: If At and At satisfy conditions (i) through (iv), then
P = At A = At A and P = AAt = AAt and therefore
Remark 3.17 If only part of the Penrose axioms hold, then we speak of
generalized inverses. A detailed investigation is found, e.g., in the book of
M. Z. Nashed [63].
Now we want to derive a way of computing the smallest solution x = A+b

for an arbitrary matrix A E Matm,n(R) and bERm with the help of
the QR-factorization. Let p = rank(A) ~ min(m, n) be the rank of the
matrix A. In order to simplify notation we neglect permutations and bring
A to upper triangular form by orthogonal transformations Q E N(m) (e.g.,
Householder reflections)
(3.4)
where R E Matp(R) is an invertible upper triangular matrix and S E

Matp,n~p(R). We formally partition the vectors x and Qb in an analogous
way as
x (XX21) with Xl E RP and X2 E Rn~p ,
Qb = ( bb 21 ) with bl E RP and b2 E Rm~p.
Then we can characterize the solution of the least-squares problem as

follows:
Lemma 3.18 With the above notation x is a solution of lib - Axil = min,
if and only if
Proof. Because of the invariance of the Euclidean norm under orthogonal

transformations we have
The expression is minimal if and only if RXI + SX2 - bl = O. D

3.3. Generalized Inverses 77
The case p = rank(A) = n corresponds to the case of overdetermined

full rank system that has already been treated. The matrix S vanishes
and we get as in Theorem 3.13 the solution X = Xl = R-Ib l . In the
under determined or rank-deficient case p < n the solution can be computed
as follows:
Lemma 3.19 Let p < n, V := R-IS E Matp,n_p(R) and u := R-Ib l E
RP. Then the smallest solution X of lib - Axil = min is given by x =
(Xl, X2) E RP x Rn-p with
(1 + VTV)X2 = VT u and Xl = U - V X2 .
Proof. According to Lemma 3.18 the solutions of lib - Axil min are
characterized by Xl = U - VX2. By inserting into Ilxll we obtain
IIxI1 2 IIxll12 + IIx2112 = Ilu - VX2112 + IIx2112
IIul1 2 - 2(u, VX2) + (VX2' VX2) + (X2' X2)
IIul1 2 + (X2' (1 + V T V)X2 - 2VT u) =: ~(X2)'
Here
~'(X2) = _2VT U + 2(1 + VTV)X2 and ~"(X2) = 2(1 + VTV).
Because 1 + VTV is a symmetric positive definite matrix, ~(X2) attains its
minimum for X2 and for this value we have ~'(X2) = 0, i.e., (1 + V T V)X2 =
V T u. This was exactly our claim. D
Since 1 + VTV is an Spd-matrix we can use the Cholesky factorization for

computing X2. Altogether we obtain the following algorithm for computing
the smallest solution X = A+b of lib - Axil = min.
Algorithm 3.20 Pseudo-inverse via QR-factorization.
Let A E Matm,n(R), bERm. Then X = A+b is computed as follows:
1. QR-factorization (3.4) of A with p =rank(A), where Q E N(m),
R E Matp(R) is an upper triangular matrix and S E Matp,n_p(R).
2. Compute V E Matp,n_p(R) from RV = S.
3. Cholesky factorization of I + VTV
1+ VTV = LL T ,
where L E Matn_p(R) is a lower triangular matrix.
4. (b l , b2)T := Qb with bl E RP, b2 E Rm-p.
5. Compute u E RP from Ru = bl .
6. Compute X2 E Rn-p from LLT X2 = V T U.
7. Set Xl := u - VX2.
Then it follows that X = (Xl, X2f = A+b.
Note that for different right-hand sides b we have to perform steps 1

through 3 only once.
Exercises
Exercise 3.1 A Givens rotation
Q= [ c
-s
can be stored, up to a sign, as a unique number p (naturally, the best
storage location would be in the place of the eliminated matrix entry):
if c = 0
p := { ;gn (c) sl2 if lsi < lei
2 sgn (s) Ie if lsi 2: lei i= O.
Give formulas which reconstruct, up to a sign, from p the Givens rotation
±Q. Why is this representation meaningful although the sign is lost? Is
this representation stable?
Exercise 3.2 Let the matrix A E Matm,n(R), m 2: n, have full rank.
Suppose that the solution x of a linear least-squares problem lib - Axll2 =
min computed in a stable way (by using a QR-factorization) is not accurate
enough. According to A. Bjorck [7] the solution can be improved by residual
correction. This is implemented on the linear system
where r is the residual r = b - Ax.

(a) Show that the vector (r, x) is the solution of the above system of
equations if and only if x is the solution of the least-squares problem
lib - Axl12 = min, and r is the residual r = b - Ax.
(b) Construct an algorithm for solving the above system that uses the
available QR-factorization of A. How much is the cost of one residual
correction?
Exercise 3.3 In the special case rank(A) = p = m = n - 1 of an under-
determined system of co dimension 1 the matrix S of the QR-factorization
(3.4) of A is a vector s E Rm. Show that Algorithm 3.20 for computing the
pseudo-inverse x = A+b = (Xl,X2)T simplifies to
(1) QR-factorization QA = [R, s],
(2) v := R-1s E R m ,
(3) b1 := Qb E Rm,
Exercises 79
(4) u := R-Ib l E R m ,
(5) X2 := (v, u) /(1 + (v, v)) E R,

(6) Xl := U - X2V.
Exercise 3.4 An experiment with m measurements leads to a linear least-

squares problem with A E Matm,n(R). Let a QR-factorization of A, A =
QR be available. Then
(a) a (first) measurement is added or,
(b) the (first) measurement is omitted.
Give formulas for computing QR = A for
A = (:) and A = (~), respectively,
by using the QR-factorization of A. How does the formula for modifying

the kth row of A read?
Exercise 3.5 Let A = BC. Show that A+ = C+ B+ if and only if Cor B
is an orthogonal matrix (of appropriate dimension). Derive from it formally
the solution of the linear least-squares problem.
Hint: In case of rank deficiency (p < n) there exists an orthogonal transfor-
mation from the right such that only one more regular triangular matrix
of dimension p is to be inverted. Consider such a transformation in detail.
Exercise 3.6 Let B+ be the pseudo-inverse of a matrix B and A - :=
LB+ R a generalized inverse, where L, R are regular matrices. Derive axioms
for A-corresponding to the Penrose axioms. What are the consequences
for row or column scaling of linear least squares in the full-rank case
and the rank-deficient case? Consider especially the influence on rank
determination.
Exercise 3.7 In chemistry one often measures the so-called reaction rate
constants Ki (i = 1, ... , m) at temperature Ti with absolute precision
(tolerance) 6Ki . With the help of the Arrhenius law
Ki = A . exp ( - R~i)
one determines in the sense of least squares both the pre-exponential factor
A and the activation energy E, where the general gas constant R is given
in advance. Formulate the above given nonlinear problem as a linear least-
squares problem. What simplifications are obtained for the following two
special cases:
(a) 6Ki = CKi (constant relative error)?
(b) 6Ki = const (constant absolute error)?
Exercise 3.8 Program the Householder orthogonalization procedure with-

out column interchange for (m, n)-matrices, m 2: n, and solve with it the
linear least-squares problem from Exercise 3.7 for the data file of Table 3.1
(bKi = 1).
Table 3.1. Data file for the linear least-squares problem.
Ti Ki
1 728.79 7.4960. 10- 6
2 728.61 1.0062 . 10- 5
3 728.77 9.0220. 10- 6
4 728.84 1.4217. 10- 5
5 750.36 3.6608 . 10- 5
6 750.31 3.0642 . 10- 5
7 750.66 3.4588 . 10- 5
8 750.79 2.8875 . 10- 5
9 766.34 6.2065 . 10- 5
10 766.53 7.1908.10- 5
11 766.88 7.6056 . 10- 5
12 764.88 6.7110.10- 5
13 790.95 3.1927.10- 4
14 790.23 2.5538 . 10- 4
15 790.02 2.7563. 10- 4
16 790.02 2.5474.10- 4
17 809.95 1.0599 . 10- 3
18 810.36 8.4354. 10- 4
19 810.13 8.9309 . 10- 4
20 810.36 9.4770.10- 4
21 809.67 8.3409 . 10- 4
You can save the tedious typing of the above data by just looking into
the following web site:
http://www.zib.de/SciSoft/Codelib/arrhenius/
4
Nonlinear Systems and Least-Squares
Problems
So far we have been almost entirely concerned with linear problems. In this
chapter we shall direct our attention to the solution of nonlinear problems.
For this we should have a very clear picture of what is meant by a "solution"
of an equation. Probably everybody knows from high school the quadratic
equation
f(x) := X2 - 2px + q = 0
and its analytic, closed form, solution
Xl,2 = P ± jp2 - q.
For a stable evaluation ofthis expression see Example 2.5. In fact, however,
this solution only transfers the problem of solving the quadratic equation
to the problem of computing a square root, i.e., the solution of a simpler
quadratic equation of the form
f(x) := x 2 - c = 0 with c = Ip2 - ql.
The question of how to determine this solution, i.e., how to solve such a
problem numerically, still remains open.
4.1 Fixed-Point Iterations

For the time being we continue with the scalar nonlinear equation
f(x) = 0
82 4. Nonlinear Systems and Least-Squares Problems
with an arbitrary function f : R -> R. The idea of fixed-point itera-

tion consists of transforming this equation into an equivalent fixed-point
equation
¢(x) = x
and of constructing a sequence {xo, Xl, ... } with the help of the iterative
scheme
Xk+l = ¢(Xk) with k = 0,1, ...
for a given starting value Xo. We hope that the sequence {x k} defined in
this way converges to a fixed point x* with ¢(x*) = x*, which consequently
is also a solution of the nonlinear equation, i.e. f(x*) = O.
Example 4.1 We consider the equation
f(x) :=2x-tanx=0. ( 4.1)
From Figure 4.1 we can read off the value x* ~ 1.2 as an approximation for
the solution of 4.1 in the interval [0.5,2]. We choose Xo = 1.2 as a starting
2
y = 2x
y = t4nx
;
1 x* 7r/2 2
Figure 4.l. Graphical solution of 2x - tan x = o.
value for a fixed-point iteration. Equation (4.1) can be easily transformed
into a fixed-point iteration, perhaps into
1
x= 2" tanx =: ¢l (x) or x = arctan(2x) =: ¢2(X) .
If we try the two corresponding fixed-point iterations with the starting

value Xo = 1.2, we obtain the numerical values in Table 4.1. We see that
the first sequence diverges (tan x has a pole at 7r/2 and X2 > 7r/2), whereas
the second one converges. The convergent sequence has the property that
roughly every second iteration a new "correct" decimal digit appears.
Obviously not every naively constructed fixed-point iteration converges.
Therefore we consider now general sequences {xd, which are given by an
iteration mapping ¢
4.1. Fixed-Point Iterations 83
Table 4.1. Comparison of the fixed-point iterations (h and ,h.

k Xk+1 = 1. tan Xk Xk+1 = arctan(2xk)
0 1.2 1·2
1 1.2860 .Ll760
2 1.70 ... > 7r/2 1.1687
3 1.1665
4 1.1658
5 1.1656
6 1.1655
7 1.1655
If we want to estimate the difference of two consecutive terms

IXk+1 - xkl = 1¢(Xk) - ¢(xk-dl
by the difference ofthe previous terms IXk-Xk-11 (naturally we have the ge-
ometric series in mind), we are necessarily lead to the following theoretical
characterization:
Definition 4.2 Let I = [a, b] c R be an interval and ¢ : I --+ R a
mapping. ¢ is contractive on I if there is a 0 :::; () < 1 such that
I¢(x) - ¢(y)1 :::; ()Ix - yl for all x, y E I.
The Lipschitz constant () can be easily computed if ¢ is continuously
differentiable.
Lemma 4.3 If ¢ : I --+ R is continuously differentiable, ¢ E C 1 (1), then
sup I¢(~) - ¢I(y) I = sup 1¢'(z)1 < 00.

x,yEI X - Y zEI
Proof. This is a simple application of the mean value theorem in R: For all
x, y E I, x < y, there exists a ~ E [x, y], such that
¢(X) - ¢(y) = ¢'(~)(x - y).
Theorem 4.4 Let I = [a, b] c R be an interval and ¢ : I --+ I a contractive

mapping with Lipschitz constant () < 1. Then it follows that
(a) There exists a unique fixed point x* of ¢, ¢(x*) = x*.
(b) For any starting point Xo E I, the fixed-point iteration Xk+1 = ¢(Xk)
converges to x* such that
Proof. For all Xo E I holds
IXk+1 - xkl = 1¢(Xk) - ¢(xk-dl ::; OIXk - Xk-ll

and therefore inductively
IXk+1 - xkl ::; Oklxl - xol·
We want to show that {xd is a Cauchy sequence and therefore we write
IXk+m - xkl < IXk+m - xk+m-ll + ... + IXk+1 - xkl

< (Ok+m-l + ok+m-2 + ... + Ok) IXI - Xol
, .f
Ok
< 1_0IxI-xol,
where we have used the triangle inequality and the formula for the sum of
geometric series L~o Ok = 1/(1 - 0). Thus {xd is a Cauchy sequence in
the complete metric space of all real numbers, and therefore it converges
to a limit point
X* := lim Xk.
k-->oo
But then x* is a fixed point of ¢, because

IX* - ¢(x*)1 Ix* - Xk+l + Xk+1 - ¢(x*)1
Ix* - Xk+l + ¢(xd - ¢(x*)1
< Ix* - xk+11 + 1¢(Xk) - ¢(x*)1
< Ix* - Xk+ll + Olx* - Xk I ---+ 0 for k ---+ 00 .
With this we have proved the second part of the theorem and the existence
of a fixed point. If x*, y* are two fixed points, then
0::; Ix* - y*1 = I¢(x*) - ¢(Y*)1 ::; 0 Ix* - Y*I·
Because 0 < 1, this is possible only if Ix* - Y*I = O. This proves the
uniqueness of the fixed point of ¢. 0
Remark 4.5 Theorem 4.4 is a special case of the Banach fixed-point the-
orem. The only properties used in the proof are the triangle inequality for
the absolute value and the completeness of R. Therefore the proof is valid in
the much more general situation when R is replaced by a Banach space X,
e.g., a function space, and the absolute value by the corresponding norm.
Such theorems playa role not only in the theory but also in the numerics of
differential and integral equations. In this introductory textbook we shall
use only the extension to X = R n with a norm II . II instead of the absolute
value 1·1.
4.1. Fixed-Point Iterations 85
Remark 4.6 For the solution of scalar nonlinear equations in the case
when only a program for evaluating f(x) and an interval enclosing the
solution are available, the algorithm of R. P. Brent [10] has established
itself as a standard code. It is based on a mixture of rather elementary
techniques, such as bisection and inverse quadratic interpolation, which
will not be further elaborated here. For a detailed description we refer
the reader to [10]. If additional information regarding f, like convexity or
differentiability, is available, then methods with faster convergence can be
constructed, on which we will focus our attention in the following.
In order to assess the speed of convergence of a fixed-point iteration we
define the notion of the order of convergence of a sequence {Xk}.
Definition 4.7 A sequence {xd, Xk ERn, converges to x*, with order (at
least) p ~ 1, if there is a constant C ~ 0 such that
where in case p = 1 we also require that C < 1. We use the term linear con-
vergence in case p = 1, and quadratic convergence for p = 2. Furthermore
we say that {xd is superlinearly convergent if there exists a nonnegative
null sequence Ck ~ 0 with limk-->oo Ck = 0 such that
Remark 4.8 Often, for reasons of simplicity, the convergence order p is

alternatively defined by the analogue inequalities for the iterates; that is,
for convergence with order p and
for superlinear convergence.

As we have seen above in the simple example f(x) = 2x-tanx, in order to
solve the nonlinear problem f(x) = 0, we must choose from many possible
fixed-point iterations a suitable one. In general this is not a simple task.
Since
and 0 S e < 1, the fixed-point iteration
where ¢ is a contractive mapping, converges only linearly in the general

case. We would expect that the convergence of a good iterative method be
at least superlinear or linear with a small constant C « 1. Therefore in the
next section we will turn our attention toward a quadratically convergent
method.
4.2 Newton Methods for Nonlinear Systems

For the time being we consider again a scalar nonlinear equation f (x) = 0
and we are trying to find a zero x* of f. As the function f is not given in a
global manner and we merely have the possibility of pointwise evaluation,
we approximate it by its tangent p(x) at the starting point Xo. Instead of
the intersection point x* of the graph of f with the x-axis, we compute
the x-intercept of the tangent line (see Figure 4.2). The tangent line is
Xo Xl
Or----------r------~~~~--------
x*
f(xo)
Figure 4.2. Idea of Newton's method in R 1.
represented by the first-order polynomial

p(x) = f(xo) + J'(xo)(x - xo).
In case f' (xo) i= 0 the corresponding zero Xl may be written as
f(xo)
Xl = Xo - f'(xo)·
The fundamental idea of Newton's method consists in repeated application
of this rule
f(Xk)
Xk+l := Xk - f'(Xk) ' k = 0,1,2, ....
This is obviously a specific fixed-point iteration with iteration mapping

f(x)
¢(x) := x - f'(x) .
Naturally, ¢ can only be constructed if f is differentiable and f'(x) does

not vanish at least in a neighborhood of the solution. The convergence
properties of the method will be analyzed later in a general theoretical
framework.
Example 4.9 Computation of the square root. We have to solve the
equation
f(x) := x 2 - c = o.
4.2. Newton Methods for Nonlinear Systems 87
In a computer the number c has the floating point representation

c = a2 P with 0.5 < a :s; 1, pEN
with mantissa a and exponent p. Therefore
if p = 2m
if p = 2m-1
where
JD.5 < Va :s; l.

Once v'o.5 = 1/J2 ;:::: 0.71 is computed and stored with the necessary
number of digits, then only the problem
f(x):=x 2 -a=0 for aE ]0.5,1]

remains to be solved. Because
f'(x) = 2x i- 0 for x E ]0.5,1],
Newton's method is applicable and the corresponding iteration mapping is

given by
f (x ) x2 - a x a 1 ( a)
¢(x)=x- f'(x) =x-~=x-2+2x=2 x+~
Therefore the Newton iteration is defined as
Xk+l := ~2 (Xk + .!!...-)

Xk
Division by 2 can be cheaply implemented by subtracting 1 from the ex-

ponent, so that only a division and an addition have to be carried out
per iteration step. We have rendered the Newton iteration for a = 0.81
and Xo = 1 in Table 4.2. It shows that the number of exact figures is ap-
proximately doubled at each iteration step, a typical behavior of quadratic
convergence.
Table 4.2. Newton iteration for a = 0.81 and Xo = 1.
o 1.0000000000
1 0.2.050000000
2 0.9000138122
3 0.90000000001
Newton's method can be easily extended to nonlinear systems

F(x)=O,
where F : R n --> R n is a continuously differentiable mapping satisfying
certain additional properties. The graphical derivation of the method is
of course no longer possible for dimension n > 1. In principle, however,
we only have to replace the nonlinear map by a linear one. The Taylor
expansion of F about a starting point xo yields
0= F(x) = F(xo) + F'(xo)(x - xO) +0 (Ix - xOU for x --> xO . (4.2)
\, #
v
:= F(x)
The zero Xl of the linear substitute map F is precisely
Xl = x O _ F'(XO)-l F(xO) ,
as long as the Jacobian matrix F'(xO) is invertible. This inspires the Newton
iteration (k = 0,1, ... )
F'(Xk)/}.x k = -F(x k ) with Xk+l = Xk + /}.x k . (4.3)
Of course, one does not compute the inverse F'(xk)-l at each iteration step,
but instead one determines the Newton correction /}.x k as the solution of
the above linear system. Therefore we reduce the numerical solution of a
system on nonlinear equations to the numerical solution of a sequence of
linear systems.
Before turning to the analysis of the convergence properties of New-
ton's method we wish to point out an invariance property. Obviously, the
problem of solving the equation F(x) = 0 is equivalent to solving
G(x) := AF(x) = 0,
where A E GL(n) is an arbitrary invertible matrix. At the same time note
that, for a given xO, the Newton sequence {xk} is independent of A since
G'(x)-lG(x) = (AF'(x))-l AF(x) = F'(X)-l A- l AF(x) = F'(x)-l F(x).
The transformation F --> G is an affine transformation (without the trans-
lation component). Therefore it has become common usage to say that the
problem F(x) = 0 as well as Newton's method are affine invariant. Ac-
cordingly we require that the convergence properties of Newton's method
be described by an affine invariant theory. Among many convergence the-
orems for Newton's method we select a relatively new one [27], because it
yields particularly clear results and is nevertheless relatively easy to prove.
Theorem 4.10 Let D c Rn be open and convex, and let F : D --> Rn
be a continuously differentiable mapping, with an invertible Jacobian ma-
trix F' (x) for all xED. Suppose that for an w 2 0 the following (affine
in variant) Lipschitz condition holds:
1IF'(x)-l(F'(x + sv) - F'(x))vll :::; swllvl1 2 ( 4.4)
for all s E [0,1]' xED and vERn, so that x + v E D. Furthermore let

us assume that there exists a solution x* E D and a starting point xO E D
such that
2
p:= Ilx* - x011 < - and Bp(x*) ~ D.
w
Then the sequence {xk}, k > 0, defined by Newton's method stays in the
open ball Bp(x*) and converges to x*, i.e.,
Ilxk - x* II < P for k > °and lim xk = x* .

k-->oo
The speed of convergence can be estimated by
Ilx k+ 1 - x*11 :::; ~llxk - x*112 for k = 0,1, ....

Moreover, the solution x* is unique in B 2 / w (x*).
Proof. First, we use the Lipschitz condition (4.4) to derive the following
result for all x, y E D
IIF'(x)-l(F(y) - F(x) - F'(x)(y - x))11 :::; ~IIY - x11 2. (4.5)
Here we use the Lagrange form of the integral mean value theorem:
F(y) - F(x) - F'(x)(y - x) = 1~0 (F'(x + s(y - x)) - F'(x))(y - x)ds.
The left-hand side of (4.5) can thus be rewritten and estimated as
111~0 F'(X)-l(F'(x + s(y - x)) - F'(x))(y - X)dSII
: ; 11 swlly - xl1 2
8=0
ds = ~ Ily - xl1 2,
2
which proves (4.5). After this preparation we can turn our attention to
°
the question of convergence of the Newton iteration. By using the iterative
scheme (4.3) as well as the relation F(x*) = we get
Xk - F'(X k )-l F(xk) - x*
xk - x* - F'(x k )-l(F(x k ) - F(x*))
F'(X k )-l(F(x*) - F(x k ) - F'(xk)(x* _ xk)).
With the help of (4.5) this leads to the following estimate of the speed of
convergence:
If 0 < Ilxk - x* I ::; p, then

IIX k+1 - x* II ::; ~ Ilxk - x* 1IIIxk - x* I < Ilxk - x* II .
'-...-"
::; pw/2 < 1
Since Ilx o- x* I = p, we have Ilxk - x* I < p for all k > 0, and the sequence
{xk} converges toward x*.
In order to prove uniqueness in the ball B2/w(X*) centered at x* with ra-
dius 2/w, we employ again inequality (4.5). Let x** E B2/w(X*) be another
solution so that F(x**) = 0 and Ilx* -x**11 < 2/w. By substituting in (4.5)
we obtain
Ilx** - x*11 1IF'(x*)-l(O - 0 - F'(x*)(x** - x*))11

< ~ Ilx** - x* 1IIIx** - x* I .
'---'
<1
This is possible only if x** = x*. o
In short, the above theorem reads: Newton's method converges locally and
quadratically.
Remark 4.11 Different variants of the above theorem additionally prove

the existence of a solution x*, which we had assumed above. However, the
corresponding proofs are much more involved (see [20]).
Because the solution x* is not known a priori, the theoretical assump-
tions of Theorem 4.10 cannot be directly verified. On the other hand, of
course, in order to save unnecessary iteration steps, we would like to know
as early as possible whether the Newton iteration converges. Therefore we
are looking for a convergence criterion that can be verified within the al-
gorithm itself, and allows us to decide after one, or a few steps, if Newton's
method converges. As in Section 2.4.3, where we considered the solution of
linear systems, we will first look at the residuals F(xk). Obviously, solving
the system of nonlinear equations is equivalent to minimizing the resid-
ual. Therefore one may require that the residual should be monotonically
decreasing, i.e.,
IIF(xk+1)II::;OIIF(xk)11 for k=O,l, ... and a 0<1. (4.6)
This standard mono tonicity test, however, is not affine invariant. Multipli-
cation of F by any invertible matrix A may arbitrarily change the result of
the test (4.6). With the idea of transforming the inequality (4.6) into a both
affine invariant and easily executable condition P. Deuflhard suggested in
1972 the natural monotonicity test
For an extensive representation about this subject see, e.g., the book [20].
On the right-hand side we recognize the Newton correction ~xk that has
to be computed anyway. On the left-hand side we detect the simplified
Newton correction ~x k+ 1 as the solution of the linear equation system
F'(xk)~xk+l = _F(xk+ 1 ).
With this notation the natural monotonicity test (4.7) can be written as
(4.8)
For the simplified Newton correction we obviously have to solve another
system of linear equations with the same matrix F'(x k ), but with different
right-hand side F(xk+l) evaluated at the next iterate
X k+ 1 = xk + ~xk .
This can be done with little additional effort: If we apply an elimination
method (requiring O(n 3 ) operations for a full matrix), we only have to
carry out the forward and backward substitutions (which means O(n 2 )
additional operations).
The theoretical analysis in [20] also shows that, within the local
convergence domain of the ordinary Newton method,
II~Xk+l11 :::; ill~xkll.

We "soften" this condition somewhat and set (j := 0.5 in the above natural
monotonicity test (4.8). If this test fails for some k, i.e., if
II~xk+lll > ~11~xkll,

then the Newton iteration should be terminated. In this case we have no
other option than to restart it with a different, hopefully better, initial
guess xo.
One possibility to globalize the local convergence of the ordinary Newton
method is the damping of the Newton corrections ~xk. This leads to the
modified iteration
where )..k is the damping factor. As a simple damping strategy we recom-

mend choosing the damping factors )..k such that the natural monotonicity
test (4.8) is satisfied with (j := 1 - )..k/2, i.e.,
II~xk+l()..k)ll:::; (1- ~k) lI~xkll.

In the simplest implementation we choose a threshold value )..min « 1 and
damping factors )..k from some sequence, e.g.,
If Ak < Amin, the iteration is terminated. In critical examples one will start
with Ao = Amin, in harmless examples preferably with Ao = 1. Whenever
Ak was successful, we set
in the next iteration to attain asymptotically the quadratic convergence of

the ordinary Newton method (with Ak = 1 throughout). If the monotonicity
test for Ak fails, we try again with Ak/2. Theoretically sounder and more
efficient damping strategies are worked out in the already cited book [20].
Such strategies are implemented in quite a number of program packages in
Scientific Computing.
Remark 4.12 In the implementation of Newton methods it is generally
sufficient to replace the exact Jacobian matrix by a suitable approxima-
tion. Popular approximation techniques are finite differences (see, however,
Exercise 4.7) or automatic differentiation as suggested by A. Griewank
[44]. Moreover, the convergence of the method is usually not impaired, if
"nonessential" elements of the Jacobian matrix are omitted. This spars-
ing technique is particularly recommended for large nonlinear systems. It
requires, however, some deeper insight into the underlying problem.
An alternative approach to extend the local convergence domain of the
ordinary Newton method is represented by continuation methods. In this
kind of approach an embedding of the nonlinear problem with respect to a
parameter A is constructed, say
F(x, A) = o.
Such methods that require additional knowledge about the problem to be
solved will be treated in Section 4.4.2.
4.3 Gauss-Newton Method for Nonlinear

Least-Squares Problems
In Section 3.1 we have dealt in detail with the general setting of the prob-
lem of Gaussian least-squares computation. The goal is to determine a
parameter x ERn, which arises within a model function cp such that it fits
given measurements b in a least-squares sense. When the parameters enter
linearly in cp, then this leads to the linear least-squares problems treated
in Chapter 3. When the parameters enter nonlinearly in cp, then we have a
nonlinear least-squares problem of the general form
g(x) := IIF(x)ll~ = min,
where we assume that F : D eRn ---+ Rm is a twice continuously differ-
entiable function F E C 2 (D) on an open set D eRn. In this section we
4.3. Gauss-Newton Method for Nonlinear Least-Squares Problems 93
treat only the overdetermined case m > n. In applications we have, as a

rule, significantly more measurement data than parameters, i.e., m » n,
which is why this case is also known under the term data compression. In
what follows we will omit necessary considerations concerning the bound-
ary aD and rather look only for interior local minima x* E D of g, which
satisfy the sufficient conditions
g'(x*) = 0 and gl/(x*) positive definite.
Since g'(x) = 2F'(X)T F(x), we therefore have to solve the following system
of n nonlinear equations
G(x) := F'(xf F(x) = o. (4.9)
The Newton iteration for this system of equations is
G'(x k )t6.x k = -G(x k ) with k = 0,1, ... , (4.10)
where under the above assumptions the Jacobian matrix
G'(x) = F'(xf F'(x) + FI/(xf F(x)
is positive definite in a neighborhood of x* and hence invertible. When the
model and data fully agree at x*, i.e., when they are compatible, then we
have
F(x*) = 0 and G'(x*) = F'(x*fF'(x*).
The condition "G' (x*) is positive definite" is equivalent in this case with
the condition that F'(x*) has full-rank n. For compatible, or at least for
"almost compatible," nonlinear least-squares problems, we would like to
save the effort of evaluating the tensor FI/ (x). Therefore we modify the
Jacobian matrix G'(x) in the Newton iteration (4.10) by dropping FI/(x)
thus obtaining the iterative scheme
F'(xkf F'(x k )t6.x k = _F'(xk)T F(xk).
Obviously these are the normal equations for the linear least-squares
problem
Upon recalling the notation of the pseudo-inverse from Section 3.3 we here
obtain the formal representation
(4.11 )
In this way we have reduced the numerical solution of a nonlinear least-

squares problem to a sequence of linear least-squares problems.
Remark 4.13 If the Jacobian matrix has full rank, then
F'(x)+ = (F'(xf F'(X))-l F'(xf ,
and therefore the equation (4.9) to be solved for the nonlinear least-squares
problem is equivalent to
F'(x)+ F(x) = O.
This characterization holds for the rank-deficient and the under determined
case as well.
Similarly as in Newton's method for nonlinear systems, we could have
derived the iterative scheme (4.11) directly from the original minimiza-
tion problem by expanding in a Taylor series and truncating after the
linear term. Therefore (4.11) is also called Gauss-Newton method for the
nonlinear least-squares problem IIF(x)112 = min. The convergence of the
Gauss-Newton method is characterized by the following theorem (com-
pare [26]), which is an immediate generalization of our Theorem 4.10 for
Newton's method.
Theorem 4.14 Let D c Rn be open and convex and F : D - t R m ,
m ~ n, a continuously differentiable mapping whose Jacobian matrix F'(x)
has full rank n for all xED. Suppose there is a solution x* E D of the cor-
responding nonlinear least-squares problem II F (x) 112 = min. Furthermore
let w > 0 and 0 :S ""* < 1, be two constants such that
11F'(x)+(F'(x + sv) - F'(x))vll :S swllvl1 2 (4.12)
for all s E [0,1]' xED and vERn with x + v E D, and assume that
11F'(x)+F(x*)II:S ""*llx-x*11 (4.13)
for all xED. If for a given starting point x O ED, we have
P := Ilxo- x* I < 2(1 - ",,*)/w =: (J; (4.14)
then the sequence {xk} defined by the Gauss-Newton method (4.11) stays
in the open ball Bp(x*) and converges toward x*, i.e.,
Ilxk -x*11 < p for k > 0 and lim
k--+oo
Xk = x* .
The speed of convergence can be estimated by
Ilx k+ 1 - x*11 :S ~llxk - x*112 + ",,*llx k - x*ll· (4.15)
In particular quadratic convergence is obtained for compatible nonlinear

least-squares problems. Moreover the solution x* is unique in Bo- (x*).
Proof. The proof follows directly the main steps of the proof of Theorem
4.10. From the Lipschitz condition (4.12) it follows immediately that
11F'(x)+(F(y) - F(x) - F'(x)(y - x))11 :S ~ Ily - xl1 2

for all x, y E D. In order to estimate the speed of convergence we use
the definition (4.11) of the Gauss-Newton iteration as well as the property
F'(x*)+ F(x*) = 0 of the solution x* whose existence has been assumed.

From the full-rank Jacobian assumption it follows immediately (see Section
3.3) that
F'(x)+ F'(x) = In for all xED.
Therefore we obtain
Xk+l - x* = xk - x* - F'(x k)+ F(x k )
= F'(xk)+(F(x*) - F(x k ) - F'(xk)(x* - xk)) - F'(x k )+ F(x*).
By applying conditions (4.12) and (4.13) we get
Ilxk+l - x* I ::; (~llxk - x* I + i'£*) IIxk - x* II.
Together with assumption (4.14) and induction on k this implies

Ilxk+! - x* II < Ilxk - x* I ::; p.
From here it follows immediately that the iterates remain in Bp(x*) and
converge toward the solution x*. For compatible least-squares problems
i.e., F(x*) = 0, we can choose i'£* = 0 in (4.13) and hence obtain quadratic
convergence. The uniqueness of the solution is obtained as in the proof of
Theorem 4.10. 0
Remark 4.15 Note that in the above theorem the existence of a solution
has been assumed. A variant of the above theorem additionally yields the
proof of existence of a solution x* wherein the full-rank assumption on the
Jacobian (see again, e.g., [20]) can be relaxed: only one out of the four
Penrose axioms, namely,
F'(x)+ F'(x)F'(x)+ = F'(x)+ ,
is needed.
Uniqueness, however, requires-just as in the linear problem (see Section
3.3)-a maximal rank assumption. Otherwise there exists a solution man-
ifold of a dimension equal to the rank deficiency and the Gauss-Newton
method converges toward any point on this manifold. We will exploit this
property in Section 4.4.2 in the context of continuation methods.
Finally we want to discuss the condition K.* < 1 in more detail. As we
approach the solution x*) the linear term i'£* Ilxk - x* I dominates the speed
of convergence estimate (4.15), at least for i'£* > O. In this case the Gauss-
Newton method converges linearly with asymptotic convergence factor i'£*,
which enforces the condition "'*
< 1. Obviously, the quantity reflects the "'*
omission of the tensor F"(x) in the derivation of the Gauss-Newton method
from Newton's method (4.10).
Another interpretation of i'£* comes from examining the influence of the
statistical measure of the error Db on the solution. In case the Jacobian
matrix P' (x*) has full rank, the perturbation of the parameters induced by
ob is determined, in a linearized error analysis, by
ox* = -P'(x*)+ob.
A quantity of this general type is given as an aposteriori error analysis by
virtually all software packages that are in widespread use today in statis-
tics. Obviously this condition does not reflect the possible effects of the
nonlinearity of the model. A more accurate analysis of this problem has
been carried out by H. G. Bock [8]: he actually showed that one should
perform the substitution
(4.16)
In the compatible case we have P(x*) = 0 and f£* = O. In the "almost
compatible" case 0 < f£* « 1 the linearized error theory is certainly sat-
isfactory. However, as shown in [8], in the case f£* ?: 1 there are always
statistical errors such that the solution "runs away unboundedly." Such
models might be called statistically ill-posed or inadequate. Conversely,
a nonlinear least-squares problem is called statistically well-posed or ad-
equate whenever f£* < 1. In this wording, Theorem 4.14 can be stated
in short form as: For adequate nonlinear least-squares problems the ordi-
nary Gauss-Newton method converges locally, for compatible least-squares
problems even quadratically.
Intuitively it is clear that not every model and every set of measurements
allow for a determination of a unique suitable parameter vector. But only
unique solutions permit a clear interpretation in connection with the basic
theoretical model. The Gauss-Newton method presented here proves the
uniqueness of a solution by three criteria:
(a) checking the full-rank condition for the corresponding Jacobians in
the sense of a numerical rank determination-which can be done, for
example, as in Section 3.2 on the basis of a QR-factorization;
(b) checking the statistical well-posedness with the help of the condition
f£* < 1 by estimating
. II~xk+lll
f£*= II~xkll
in the asymptotic phase of the Gauss-Newton iteration;
(c) analyzing the error behavior by (4.16).
One should be aware of the fact that all three criteria are influenced by the
choice of the measurement tolerances ob (cf. Section 3.1) as well as by the
scaling of the parameters x.
Remark 4.16 As in Newton's method the convergence domain of the
ordinary Gauss-Newton method can be enlarged by some damping strat-
egy. If we denote by ~xk the ordinary Gauss-Newton correction, then a
corresponding iteration step reads
x k+ l = xk + Ak!:"X k ,
Again there exist rather efficient theoretically backed strategies, which are
implemented in a series of modern least-squares software packages-see
[20j. These programs also check automatically whether the a least-squares
problem under consideration is adequate. If this is not the case, as happens
rather seldom, one should either improve the model or increase the precision
of the measurements. Moreover, these programs ensure automatically that
the iteration is performed only down to a relative precision that matches
the given precision of the measurements.
Example 4.17 A biochemical reaction. In order to illustrate the behavior

of the damped Gauss-Newton method we give a nonlinear least-squares
problem from biochemistry, the Feulgen hydrolysis [67]. From an extensive
series of measurements we choose a problem with m = 30 measurements
and n = 3 unknown parameters Xl, x2 and X3 (see Table 4.3).
Table 4.3. Measurement sequence (ti, bi), i = 1, ... ,30, for Feulgen hydrolysis.
t b t b t b t b t b
6 24.19 42 57.39 78 52.99 114 49.64 150 46.72
12 35.34 48 59.56 84 53.83 120 57.81 156 40.68
18 43.43 54 55.60 90 59.37 126 54.79 162 35.14
24 42.63 60 51.91 96 62.35 132 50.38 168 45.47
30 49.92 66 58.27 102 61.84 138 43.85 174 42.40
36 51.53 72 62.99 108 61.62 144 45.16 180 55.21
Originally the model function was given in the form

!p(x;t):= x~lX~3 (exp(-x3t) - exp(-x2 t )) ,
where Xl is the DNA concentration and X2, X3 are chemical reaction rate
constants. For X2 = X3 we obviously have a limit of the form 0/0. Hence
numerical cancellation already occurs for X2 ;:::;; X3, which induces difficulties
in the iteration behavior. Therefore we introduce a different parametriza-
tion, which, in passing, also takes the inequalities X2 > X3 2: 0 (coming
from biochemistry) into account: Instead of Xl, X2, and X3, we consider the
unknowns
and the transformed model function

sinh(x 2 t)
!p(x; t) := Xl exp( -(X~ + x~)t) 23.
X3
The property sinh(x3t) = x~t + o(ltl) for small arguments is surely estab-
lished in every standard routine for calculating sinh. Therefore only the
evaluation of if' for X3 = 0 must be especially handled by the program. As
starting point for the iteration we choose xO = (80,0.055,0.21). The itera-
tion history of if'(x k ; t) over the interval t E [0,180] is represented in Figure
4.3. We come out with TOL = 0.142.10- 3 as the "statistically reasonable"
relative precision. At the solution x* we obtain the estimate = 0.156 and "'*
the residual norm IIF(x*)112 ~ 3.10 2. Therefore despite a "large" residual,
the problem is "almost compatible" in the sense of our above theoretical
characterization.
+ +++
60 +- - - +
+
/} +
--_ ,5
+
H
+ +
+ -----
/'
45 / + -
+_ - +-
- - --
~ 6, 7, 8
fr
-...- - 4
----- +
30 3
2
15
~~0--------~5~0--------~10~0~------~1T50~--~t
Figure 4.3. Measurements and iterated model function for Example 4.17.
Remark 4.18 Most software packages for are still based on another
globalization method with enlarged convergence domain, the Levenberg-
Marquardt method. This method is based on the idea that the local
linearization
which underlies the (ordinary) Newton method, is reasonable for iterates

xk "far away" from the solution point x* only in a neighborhood of Xk.
Therefore instead of (4.2) one solves the substitute problem
(4.17)
under the constraint
(4.18)
where the (locally fitted) parameter !j is to be discussed. The constraints
(4.18) can be coupled with the help of a Lagrange multiplier p ~ 0 to the
4.4. Nonlinear Systems Depending on Parameters 99
minimization problem (4.17) by
p(ll~zll~ - 62 ) ;::: O.
If II~z112 = 6, then we must have p > 0, while if II~z112 < 6, then p = O.

This leads to the formulation
This quadratic function in ~z attains its minimum with a correction ~zk

satisfying the equation
The fitting of the parameter 6 is usually replaced by the fitting of p. For

p > 0 the symmetric matrix appearing here is positive definite even in
case of a rank-deficient Jacobian F'(x k ), which gives the method a certain
robustness. On the other hand we pay for this robustness with a series of
disadvantages: The "solutions" are often not minima of g(x), but merely
saddle points with g'(x) = O. Furthermore, the masking of the Jacobian
rank also masks the uniqueness of a given solution, which implies that in
this setting numerical results are often incorrectly interpreted. Besides, the
above linear system is a generalization of the normal equations having all
the numerical disadvantages discussed in Section 3.1.2 (cf. Exercise 5.6).
Finally, the above formulation is not affine invariant (cf. Exercise 4.12).
Altogether we see that nonlinear least-squares problems represent, be-

cause of their statistical background, a considerably more subtle problem
class than just solving systems of nonlinear equations.
4.4 Nonlinear Systems Depending on Parameters

In numerous applications in the natural sciences and engineering one has
to solve not only one isolated nonlinear problem F(x) = 0, but a whole
family of problems
F(x, A) =0 with F: D c R n x RP ---t Rn , (4.19)
that depend on one or several parameters A E RP.
In this section we restrict our attention to the case p = 1, i.e., to

a scalar parameter A E R, and begin with the analysis of the solution
structure of the parametrized nonlinear system, whose main elements are
illustrated by a simple example. With the insight gained there we construct
in Section 4.4.2 a class of methods for solving parametrized systems, the
continuation methods.
4.4.1 Solution Structure

We consider a parametrized system of nonlinear equations
F: D x [a,bj-+ R n , Dc R n open,
with a scalar parameter A E [a, b] E R, where we assume that F is con-
tinuously differentiable on D x [a, b]. Our task is to determine eventually
all solutions (x, A) E D x [a, b] of the equation F(x, A) = 0; i.e., we are
interested in the solution set
S := {(x, A) I (x, A) E D x [a, b] with F(x, A) = O} .
In order to develop a feeling for the structure of the solution of a para-
metrized system, let us look first at the following simple example.
Example 4.19 We have to solve the scalar equation
F (x, A) = x (x 3 - X - A) = 0 ,
whose solution set is represented in Figure 4.4 for A E [-1, 1]. Because of
the equivalence
F (x, A) = 0 <¢:=} x = 0 or A = x 3 - X
the solution set
S:={(X,A) I AE[-I,I] and F(X,A)=O}
consists of two solution curves S = So U Sl, the A-axis
So := {(O, A) I A E [-1, I]}
as trivial solution and the cubic parabola
Both solution curves intersect at the point (0,0), which is called a

bifurcation point. At this point the Jacobian matrix of F vanishes since
F'(x, A) = [Fx(x, A), F)..(x, A)] = [4x 3 - 2x - A, - x].

A special role is played by the solutions at which only the derivative with
respect to x vanishes, i.e.,
F(x, A) = 0, Fx(x, A) = 4x 3 - 2x - A = 0 and F)..(x, A) i= o.
These turning points (x, A) distinguish themselves by the fact that the
solution cannot be expressed as a function of the parameter A in any neigh-
borhood (no matter how small) of (x, A), while it may be expressed as a
function of x. If we substitute the A = x 3 - x in Fx(x, A) and F)..(x, A), it
follows that the turning points of f are characterized by
x i= 0, A = x 3 - x and 3x 2 - 1= 0.
0.5
O~--------.-----~~---,----------~
)'1 AO 2
·0.5
·1
1 --.~0.:-8---'.O~.6--.~0.-'--4--.O"-:.2--~--:0~.2--~0.4-'------'O:'-:.6--~0.~8- - '
'l.S.L
Figure 4.4. Solution of x(x 3 - x - A) = 0 for A E [-1, 1J.
Therefore in our example there are exactly two turning points, namely, at
Xl = 1/ J3 and X2 = -1/ J3. At these points the tangent to the solution
curve is perpendicular to the A-axis.
As a last property we want to note the symmetry of the equation
F(x, A) = F( -X, -A) .
This is reflected in the point symmetry of the solution set: If (x, A) is a
solution of F(x, A) = 0, then so is (-x, -A).
Unfortunately, we cannot go into all the phenomena observed in Ex-
ample 4.19 within the frame of this introduction. We assume in what
follows that the Jacobian F'(x, A) has maximal rank at each solution point
(x, A) E D x [a, b], i.e.,
rank F' (x, A) =n whenever F(x, A) = o. (4.20)
Thus we exclude bifurcation points because according to the implicit
function theorem under the assumption (4.20) the solution set can be
represented locally around a solution (xo, AO) as the image of a differen-
tiable curve, i.e., there is a neighborhood U c D x [a, b] of (xo, AO) and
continuously differentiable mappings
x: ] - c, E[ ~ D and A:] - c, c[ ~ [a, b]
such that (x(O), A(O)) = (xo, AO) and the solutions of F(x, A) = 0 in U are
given exactly by
snu={(X(S),A(S)) I SE]-c,c[}.
In many applications an even more special case is also interesting, where
the partial derivative Fx(x, A) E Matn(R) with respect to x is invertible at
every solution point, i.e.,

Fx(x, A) is invertible whenever F(x, A) = 0 .
In this case both bifurcation points and turning points are excluded, and
we can parametrize the solution curve with respect to A, i.e., there are, as
above, around each solution point (xo, Ao) a neighborhood U C D x [a, b]
and a differentiable function x : ] - E, E[---> D such that x(O) = Xo and
SnU = {( x (s ), Ao + s) I s E ] - E, E [} •
4.4.2 Continuation Methods

Now we turn to the numerical computation of the solution of a parametrized
system of equations. First we assume that the derivative Fx(x, A) is invert-
ible for all (x, A) ED x [a, b]. In the last section we have seen that in this
case the solution set of the parametrized system is made up of differen-
tiable curves that can be parametrized by A. The idea of the continuation
methods consists of computing successive points on such a solution curve.
If we keep the parameter A, then
F(x, A) = 0 (4.21 )
is a nonlinear system in x as treated in Section 4.2. Therefore we can try
to compute a solution with the help of Newton's method:
Fx(x k , A)D.xk = - F(xk, A) and xk+l = xk + D.xk. (4.22)
Let us assume that we have found a solution (xo, Ao) and and we want
to compute another solution (Xl, Ad on the solution curve (x(s), Ao + s)
through (xo, Ao). Then in selecting the starting point x for the Newton
iteration (4.22), with fixed value of the parameter A = Al we can use the
fact that both solutions (xo, AO) and (Xl, Ad lie on the same solution curve.
The simplest possibility is indicated in Figure 4.5.
(xo, Ao) / ..._._ ..... _._._._._._._.- ... _._.-
(x, .\)
Figure 4.5. Classical continuation method.
As starting point we take the old solution and set x := Xo. This choice,
originally suggested by Poincare in his book on celestial mechanics [66] is
today called classical continuation.
A geometric view brings yet another choice: instead of going parallel to
the A-axis we can move along the tangent (x'(O), 1) to the solution curve
(X(s), Ao + s) at the point (xo, Ao) and choose x := Xo + (Al - AO)x'(O) as

starting point (see Figure 4.6). This is the tangent continuation method. If
we differentiate the equation
F(x(s), Ao + s) = 0
with respect to s at s = 0, it follows that
Fx(xo, AO)x'(O) = -F)..(xo, AO),
i.e., the slope x'(O) is computed from a linear system similar to (4.22).
(x, 5.)
..';
sx'(O) ......
(xl,Ad
Figure 4.6. Tangent continuation method.
Thus each step of a continuation method contains two substeps: first the
choice of a point (x, Ad as close as possible to the curve; and second the
iteration from the starting point 5; back to a solution (Xl, Ad on the curve,
where Newton's method appears to be the most appropriate because of
its quadratic convergence. The first substep is frequently called predictor,
the second substep corrector, and the whole process a predictor-corrector
method. If we denote by s := Al - Ao the step size, then the dependence of
the starting point :i; on s for the both possibilities encountered so far can
be expressed as
5;(s) = Xo
for the classical continuation and
x(s) = Xo + sx'(O)
for tangent continuation. The most difficult problem in the construction
of a continuation algorithm consists of an appropriate choice of the step
length s in conjunction with the predictor-corrector strategy. The optimist
who chooses a too-large step size s must constantly reduce the step length
and therefore ends up with too many unsuccessful steps. The pessimist,
on the other hand, chooses the step size too small and ends up with too
many successful steps. Both characters waste computing time. In order
to minimize cost, we therefore want to choose the step length as large as
possible while still ensuring the convergence of Newton's method.
Remark 4.20 In practice one should take care of a third criterion, namely,
not to leave the present solution curve and "jump" onto another solution
curve without noticing it (see Figure 4.7). The problem of "jumping over"
becomes important especially when considering bifurcations of solutions.
Figure 4.7. Unintentional "jump" between different solution branches.
Naturally, the maximal feasible step size Smax for which Newton's method
with starting point x O := x(s) and fixed parameter A = AO + s converges
depends on the quality of the predictor step. The better the curve is pre-
dicted the larger the step size. For example, the point x(s) given by the
tangent method appears graphically to be closer to the curve than the point
given by the classical method. In order to describe this deviation of the pre-
dictor from the solution curve more precisely, we introduce the order of a
continuation method (see [18]).
Definition 4.21 Let x and x be two curves
x,x: [-c:,c:]---+ R n
in R n. We say that the curve x(s) represents a continuation method of
order pEN at s = 0, if
Ilx(s) - x(s)11 = O(lsI P ) for s ---+ 0,

i.e., if there are constants 0 < So S c: and TJ > 0 such that
Ilx(s) - x(s)11 S TJlslP for all lsi < so·
From the mean value theorem it follows immediately that for a suffi-
ciently differentiable mapping F the classical continuation has order p = 1
while the tangent continuation has order p = 2. The constant TJ can be
given explicitly. For the classical continuation we set x(s) = x(O).
Lemma 4.22 For any continuously differentiable curve x : [-c:, E] ---+ R n
holds
Ilx(s) - x(O)11 S TJS with TJ:= max

tE[-o,E]
Ilx'(t)ll·
Proof. According to the Lagrange form of the mean value theorem it follows
that
Ilx(s) - x(O)11 = IIsl1 x'(Ts)dTII S s max Ilx'(t)ll· o
r=O tE[-E,E]
For the tangent continuation i:(s) = Xo + sx'(O) we obtain analogously

the following statement:
Lemma 4.23 Let x : [-c, c] --+ Rn be a twice differentiable curve and
i:(s) = x(O) + sx'(O). Then
Ilx(s) - i:(s)11 ::; T/s 2 with T/:= ~ max Ilx"(t)ll·
tE[-E,E]
Proof. As in the proof of Lemma 4.22 we have
x(s) - i:(s) x(s) - x(O) - sx'(O) = 1~0 SX'(TS) - sx'(O)dT
and therefore
Ilx(s) - i:(s)11 ::; ~s2 max Ilx"(t)ll.

tE[-E,E]
o
The following theorem connects a continuation method of order p as
predictor and Newton's method as corrector. It characterizes the maximal
feasible step size Smax, for which Newton's method applied to x O := i:(s)
with fixed parameter Ao + s converges.
Theorem 4.24 Let D c Rn be open and convex and let F : D x [a, b] --+
Rn be a continuously differentiable parametrized system such that Fx(x, A)
is invertible for all (x, A) E D x [a, b]. Furthermore, let an w > 0 be given
such that F satisfies the Lipschitz condition
Also let (x( s), Ao + s), s E [-c, c] be a continuously differentiable solution

curve around (xo, Ao), i.e.,
F(x(s), Ao + s) = 0 and x(O) = Xo,
and i:(s) a continuation method (predictor) of order p with
Ilx(s) - i:(s)11 ::; T/sP for all lsi::; c. (4.23)
Then Newton's method (4.22) with starting point x O i: (s) con verges
toward the solution x(s) of F(x, Ao + s) = 0, whenever
s < Smax := max (c, 1:Ji) . (4.24)
Proof. We must check the hypothesis of Theorem 4.10 for Newton's method
(4.22) and the starting point x O = i:( s). According to the condition (4.23)
the following relation holds:

pes) = Ilx* ~ x011 = Ilx(s) - x(s)11 ~ TJS P .
If we put this inequality into the convergence condition p < 2/ w of Theorem
4.10 we obtain the sufficient condition
TJS P < 2/w,
or, equivalently, (4.24). D
This theorem guarantees that with the above-described continuation

methods, consisting of classical continuation (order p = 1) or tangent con-
tinuation (order p = 2) as predictor and Newton's method as corrector, we
will succeed in following a solution curve as long as the step sizes are cho-
sen sufficiently small (depending on the problem). On the other hand, the
characterizing quantities wand TJ are not known in general, and therefore
the formula (4.24) for Smax cannot be used in practice. Hence, we must de-
velop a corresponding strategy that exclusively uses information available
during the course of computation. Such a step-size control consists, first,
of an initial step-size guess s (most of the time the step size of the previous
continuation step), and second, of a strategy for choosing a smaller step size
in case Newton's method fails to converge for the starting point xes). The
convergence of Newton's method is assessed with the natural monotonicity
test introduced in Section 4.2.
II~Xk+lll ~ ell~xkll with e:= ~. (4.25)
Here ~xk and ~x k+1 are the ordinary and the simplified Newton cor-
rections of Newton's method (4.22), i.e., with XO := x( s) and ~ '-
>'0 + s:
Fx(x k , ~)~xk = _F(xk,~) and Fx(x k , ~)~xk+1 = _F(xk+l,~).
If we establish with the help of the criterion (4.25) that Newton's method
does not converge for the step size s, i.e., if
then we reduce this step size by a factor (3 < 1 and we perform again the
Newton iteration with the new step size
s' := (3. s,
i.e., with the new starting point xO = xes') and the new parameter ~ :=
>'o+s'. This process is repeated until either the convergence criterion (4.25)
for Newton's method is satisfied or we get below a minimal step size Smin.
In the latter case we suspect that the assumptions on F are violated and
that we might be in a close neighborhood of a turning point or a bifurcation
point. On the other hand, we can choose a larger step size for the next step,
if Newton's method converges "too fast." This can also be seen from the
two Newton corrections. If
(4.26)
then the method converges "too fast," and we can enlarge the step size for
the next predictor step by a factor {3; i.e., we suggest the step size
8':=8/{3.
Here the choice
{3:= VL
motivated by (4.24) is consistent with (4.25) and (4.26). The following
algorithm describes the tangent continuation from a solution (xo, a) up to
the right endpoint ,\ = b of the parameter interval.
Algorithm 4.25 Tangent Continuation. The procedure newton (x,'\)
contains the (ordinary) Newton method (4.22) for the starting point xO = x
and fixed value of the parameter ,\. The Boolean variable done specifies
whether the procedure has computed the solution accurately enough after
at most kmax steps. Besides this information and (if necessary) the solution
x the program will return the quotient
-1
e=~
II~xOII
of the norms of the simplified and ordinary Newton correctors. The pro-
cedure continuation realizes the continuation method with the "step-size
control described above." Beginning with a starting point x for the solution
of F(x, a) = 0 at the left endpoint ,\ = a of the parameter interval, the
program tries to follow the solution curve up to the right endpoint ,\ = b.
The program terminates if this is achieved or if the step size 8 becomes to
small, or if the maximal number i max of computed solution is exceeded.
function [done,x, IJ]=newton (x,.x)
x:= x;
for k = 0 to k max do
A:= Fx(x,,\.);
solve Allx = -F(x, .x);
x:= x + llx;
solve Allx = -F(x, .x); (use again the factorization of A)
if k = 0 then
IJ := Illlxll/llllxll; (for the next predicted step size)
end
if Illlxll < tol then
done:= true;
break; (solution found)
end
if Illlxll > t91111xll then
done:= false;
break; (monotonicity violated)
end
end
if k > k max then
done:= false; (too many iterations)
end
function continuation (x)
AO := a;
[done, xo,lI] = newton (X,AO);
if not done then
poor starting point x for F(x, a) = 0
else
s := so; (starting step size)
for i = 0 to i max do
solve Fx(Xi, Ai)X' = -F.\(Xi, Ai);
repeat
X:= xi + sxt;
Ai+l := Ai + s;
[done, Xi+l,lI] = newton (x, Ai+ll;
if not done then
s = (3s;
< 1J/4 then
elseif II
s=s/(3;
end
s = mines, b - Ai+l);
until s < Smin or done
if not done then
break; (algorithm breaks down)
elseif Ai+l = b then
break; (terminated, solution Xi+l)
end
end
end
Remark 4.26 There is a significantly more efficient step-size control strat-

egy, which uses the fact that the quantities wand TJ can be locally
approximated by quantities accessible from the algorithm. That strategy is
also well founded theoretically. Its description cannot be done within the
frame of this introduction-for details we refer to [20, 18].
We want to describe yet another variant of the tangent continuation
because it fits well into the context of Chapter 3 and Section 4.3. It allows
at the same time dealing with turning points (x,.\) with
rank F'(x,'\) = nand Fx(x,.\) singular.
In a small neighborhood of such a point the automatically chosen step sizes
s of the continuation method described above become arbitrarily small,
because the solution curve around (x,.\) can no longer be parametrized
with respect to .\. We overcome this difficulty by giving up the "special
role" of the parameter .\ and consider instead directly the underdetermined
nonlinear system in y = (x, A),

F(y) = 0 with F: Dc R n +1 ---+ Rn .
We assume again that the Jacobian F'(y) ofthis system has full rank for all
y E D. Then for each solution Yo E D there is a neighborhood U C Rn+l
and a differentiable curve y : ] - E,E[-+ D, S := {y E DI F(y) = O}
characterizing the curve around Yo, i.e.,
snu={y(s)1 SE]-E,Er}.
If we differentiate the equation F(y( s)) = 0 with respect to s at s = 0, it
follows that
F'(y(O))y'(O) = 0; (4.27)
i.e., the tangent y'(O) to the solution curve spans exactly the nullspace of
the Jacobian F' (yo). Since F' (yo) has maximal rank, the tangent through
(4.27) is uniquely determined up to a scalar factor. Therefore we define for
all y E D the normalized tangent t (y) E R n+ 1 by
F'(y)t(y) = 0 and Ilt(y)112 = 1,
which is uniquely determined up to its orientation (i.e., up to a factor ±1).
We choose the orientation of the tangent during the continuation process
such that two successive tangents to = t(yo) and tl = t(yI) form an acute
angle, i.e.,
(to,h»O.
This guarantees that we are not going backward on the solution curve.
With it we can also define tangent continuation for turning points (see
Figure 4.8) by
fj = fj(s) := Yo + st(yo).
Beginning with the starting vector yO = fj, we want to find y(s) on the
curve "as fast as possible." The vague expression "as fast as possible"
can be interpreted geometrically as "almost orthogonal" to the tangent at
a nearby point y(s) on the curve. However, since the tangent t(y(s)) is
at our disposal only after computing y(s), we substitute t(y(s)) with the
best approximation available at the present time t(yk). According to the
geometric interpretation of the pseudo-inverse (cf. Section 3.3) this leads
to the iterative scheme
(4.28)
The iterative scheme (4.28) is obviously a Gauss-Newton method for the
overdetermined system F(y) = O. We mention without proof that if F'(y)
has maximal rank then this method is quadratically convergent in a neigh-
borhood of the solution curve, the same as the ordinary Newton method.
The proof can be found in [20].
.....
Figure 4.8. Tangent continuation through turning points.
Here we want to examine the computation of the correction fj.yk. We

will drop the index k. The correction fj.y in (4.28) is the shortest solution
of the solution set Z(y) of the overdetermined linear problem
Z(y) := {z E Rn+1IF'(y)z + F(y) = O}.
By applying a Gaussian elimination (with row pivoting and possible column
exchange, cf. Section 1.3) or a QR-factorization (with possible column per-
mutation, cf. Section 3.2.2) we succeed relatively easily in computing some
solution z E Z(y) as well as a nullspace vector t(y) with
F'(y)t(y) = O.
Then the following equation holds:
fj.y = -F'(y)+ F(y) = F'(y)+ F'(y)z.
As we have seen in Section 3.3, P = F'(y)+ F'(y) is the projection onto the
orthogonal complement of F' (y) and therefore
P =1_ ttT .
tTt
For the correction fj.y it follows that
fj.y = (1 _tTt
tt T ) Z = Z _ (t, z) t .
(t, t)
With this we have a simple computational scheme for the pseudo-inverse
(with rank defect 1) provided we only have some solution z and nulls pace
vector t at our disposal. The Gauss-Newton method given in (4.28) is also
easily implement able in close interplay with tangent continuation.
For a step-size strategy we realize a strategy similar to the One described
in Algorithm 4.25. If the iterative method does not converge, then we reduce
the step length s by a factor {3 = 1/ J2. If the iterative method converges
"too fast" we enlarge the step size for the next predictor step by a factor
{3-1. This empirical continuation method is comparatively effective even in
rather complex problems.
Remark 4.27 For this tangent continuation method there is also a theo-
retically based, more effective step-size control, the description of which can
be found in [23]. Additionally, one may apply approximations of the exact
Jacobian F'(y). Extremely effective programs for parametrized systems are
working on this basis (see Figures 4.9 and 4.10).
Remark 4.28 The description of the solutions of the parametrized sys-
tem (4.19) is also called parameter study. At the same time parametrized
systems are used for enlarging the convergence domain of a method for
solving nonlinear systems. The idea is to work our way, step by step, from
a previously solved problem
G(x) = 0
to the actual problem
F(x) = o.
For this we construct a parametrized problem
H(x, >.) = 0, >. E [0,1],
that connects the two problems:
H(x,O) = G(x) and H(x, 1) = F(x) for all x.
Such a mapping H is called an embedding of the problem F(x) = 0, or a
homotopy. The simplest example is the standard embedding,
H(x, >.) := >.F(x) + (1 - >')G(x).
Problem-specific embeddings are certainly preferable (see Example 4.29). If
we apply a continuation method to this parametrized problem H(x, >.) = 0,
where we start with a known solution Xo of G(x) = 0, then we obtain a
homotopy method for solving F(x) = O.
Example 4.29 Continuation for different embeddings. The following

problem is given in [38]:
F(x) := x - ¢(x) = 0,
where
10
¢;(x) = exp(cos(i· LXj)), i = 1, ... ,10.
j=l
There the trivial embedding

H(x, >.) = >.F(x) + (1 - >.)x = x - >.¢(x)
with starting point x O = (0, ... ,0) at >. = 0 is suggested. The continuation
with respect to >. leads indeed for>. = 1 to the solution (see Figure 4.9,
2.5 2.0
1.8
2.0
1.6
1.5
1.0
0.5
Figure 4.9. Comparison of continuation methods~depicted is Xg(.\).

Left: Trivial embedding. Right: Problem-specific embedding.
left). The more problem-specific embedding, however,

10
Hi(x,)..) := Xi - exp()..· cos(i· LXj)), i = 1, ... ,10,
j=l
with starting point x O = (0, ... ,0) at ).. = 0 is clearly advantageous (see
Figure 4.9, right). Note that there are no bifurcations in this example. The
intersections of the solution curves appear only in the projection onto the
coordinate plane (xg, )..). The points on both solution branches mark the
intermediate values selected automatically by the program: their number
is a measure for the computing cost required to go from).. = 0 to ).. = 1.
The above example has an merely illustrative character. It can be easily

transformed into a purely scalar problem and solved as such (Exercise 4.6).
Therefore we add another more interesting problem.
Example 4.30 Erusselator. In [69] a chemical reaction-diffusion equation

is considered as a discrete model, where two chemical species with concen-
trations z = (x, y) in several cells react with each other according to the
rule
z= ( ~)
y
= (A - (E + l)x + x 2y )
Ex - x 2 y
=: j(z).
Diffusion appears from coupling with neighboring cells. If one considers

only solutions that are constant in time and diffusion is parametrized by
Exercises 113
,x, then the following nonlinear system is obtained
0= f(Zi)
1
+ ,X2 L D(zj - Zi), i = 1, ... ,k.
(i,j)
Here D = diag(l, 10) is a (2,2) diagonal matrix. Because the equations

reflect the symmetry of the geometrical arrangement of the cells, a rich
set of bifurcations appears (see Figure 4.10), which is analyzed in [34] by
exploiting the symmetry of the system in combination with methods from
symbolic computation.
Figure 4.10. Brusselator with four cells in a linear chain (A = 2, B = 6)-depicted

is X8(>').
Exercises
Exercise 4.1 Explain the different convergence behavior of the two fixed-
point iterations described in Section 4.1 for the solution of
f(x) = 2x - tanx = O.
Analyze the speed of convergence of the second method.
Exercise 4.2 In order to determine a fixed point x* of a continuously
differentiable mapping ¢ with 14>' (x) 1 i=- 1 let us define the following iterative
procedures for k = 0,1, ... :
(I) Xk+l := ¢(Xk) ,
(II) Xk+l := ¢-l(Xk) .

Show that at least one of the two iterations is locally convergent.
Exercise 4.3 Let f E e l [a, b] be a function having a simple root X* E

[a, b], and let p(x) be the uniquely determined quadratic interpolation
polynomial through the three nodes
(a, fa), (c, fe), (b, fb), with a < c < b, fafb < o.
(a) Show that p has exactly one simple y zero in [a, b].
(b) Given a formal procedure
y = y(a, b, c, fa, fb, fe),
that computes the zero of p in [a, b], construct an algorithm for
evaluating x* with a prescribed precision eps.
Exercise 4.4 In order to accelerate the convergence of a linearly conver-
gent fixed-point method in Rl
Xi+l := ¢(Xi), Xo given, x* fixed point

we can use the so-called ~2-method of Aitken. This consists of computing
from the sequence {Xi} a transformed sequence {x;}
_ (~Xi)2
Xi:=Xi-~,
L..l. Xi
where ~ is the difference operator ~Xi := Xi+l - Xi.
(a) Show that if the sequence {x;}, with Xi i= x*, satisfies
Xi+l - x* = (,." + t5i )(Xi - x*),
where 1,.,,1 < 1 and {t5;} is a null sequence, limi~oo t5i = 0, then the se-
quence {x;} is well-defined for sufficiently large i and has the property
that
· Xi - x*
11m = 0.
i-+cx> Xi - x*
(b) For implementing the method one computes only xo, Xl, X2 and Xo
and then one starts the iteration with the improved starting point Xo
(Steffensen's method). Try this method on our trusted example
¢l(X) := (tanx)/2 and ¢2(X):= arctan2x
with starting point Xo = l.2 .
Exercise 4.5 Compute the solution of the nonlinear least-squares problem
arising in Feulgen hydrolysis by the ordinary Gauss-Newton method (from
a software package or written by yourself) for the data from Table 4.3 and
the starting points given there.
Hint: In this special case the ordinary Gauss-Newton method converges
faster than the damped method (cf. Figure 4.3).
Exercises 115
Exercise 4.6 Compute the solution of F(x) = x - ¢(x) = 0 with

10
¢i(X) := exp(cos(i· LXj)) i = 1, ... ,10,
j=1
by first setting up an equation for u = L~~1 Xj and then solving it.

Exercise 4.7 Let there be given a function F : D --> R n, D eRn, F E
C 2 (D). We consider the approximation of the Jacobian J(x) = F'(x) by
divided differences
In order to obtain a sufficiently good approximation of the Jacobian we

compute the quantity
and require that
where eps is the relative machine precision. Show that
/'L ( "I) ~ CIT/ + C2 "12 for "I --> 0.

Specify a rule that provides an estimate for fJ in case /'L(T/) « /'Lo. Why is a
corresponding estimate for /'L( "I) » /'Lo not a useful result?
Exercise 4.8 The zeros ofPn(z) := zn-1 for n even have to be determined
with the complex Newton method.
Let us define
L(8):={tei~SltER}, 8E[0,n[.
(a) Show that
Zk E L(8) =? Zk+l E L(8).

(b) Prepare a sketch describing the convergence behavior. Compute
K(8) := L(8) n {z 11«>'(z)1 < I}

and all fixed points of «> in L (8) for
8 = 0, 1, ... , n - 1 and 8 = ~, ~, ... , n - ~.
Exercise 4.9 Consider the system of n = 10 nonlinear equations (8

2:i~1 Xi)
Xl + X4 - 3
2XI + X2 + X4 + X7 + Xs + Xg + 2XlO - ,\
2X2 + 2X5 + X6 + X7 - 8
2X3 + Xg - 4,\
XIX5 - 0.193x2x4
P(X,,\) := =0.
X~XI - 0.67444· 1O-5x2x48
X?X4 - 0.1189· 1O-4xIX28
XSX4 - 0.1799· 1O-4x18
(X9X4)2 - 0.4644 . 10- 7 XiX38
XlOX~ - 0.3846 . 1O- 4xi 8
It describes a chemical equilibrium (for propane). All solutions of in-

terest must be nonnegative because they are interpreted as chemical
concentrations.
(a) Show that we must have ,\ 2 3 if Xi 20, i = 1, ... , n. Compute (by
hand) the special degenerate case ,\ = 3.
(b) Write a program for a continuation method with ordinary Newton's
method as local iterative procedure and empirical step-size strategy.
(c) Test the program on the above example with ,\ > 3.
Exercise 4.10 Prove the following theorem: Let D ~ Rn be open and
convex and let P : D ~ Rn ---+ Rn be differentiable. Suppose there exists
a solution x* E D such that P'(x*) is invertible. Assume further that the
following (affine-invariant) Lipschitz condition is satisfied for all X, y E D:
IIP'(x*)-I(p'(y) - P'(x))11 :::; w*lly - xii.
Let p:= Iixo - x*11 < 2/(3w*) and Bp(x*) cD. Then it follows that: The
sequence {xk} defined by the ordinary Newton method stays in B p(x*) and
converges toward x*. Moreover x* is the unique solution in Bp(x*).
Hint: Use the perturbation lemma, based on the Neumann series for the
Jacobian P'(x).
Exercise 4.11 Consider solving the quadratic equation
x 2 - 2px + q = 0 with p2 - q 2 0 and q = 0.123451234.
Compute for p E {1, 10, 10 2, ... } the two solutions
Xl =:h = p + Vp2 - r,
X2 =p- Vp2 - q, X2 = q/XI'
Write down the results in a table and underline each time the correct
figures.
Exercises 11 7
Exercise 4.12 The principle used in deriving the Levenberg-Marquardt

method
Xk+l = Xk + b.z k , k = 0,1, ...
for solving nonlinear systems is not affine-invariant. This shortcom-
ing is naturally inherited also by the method itself. An affine-invariant
modification reads: Minimize
under the constraints
What is the resulting method?

5
Linear Eigenvalue Problems
The following chapter is dedicated to the study of the numerical eigenvalue

problem of linear algebra
Ax = Ax,
where A is a quadratic matrix of order n and x is an eigenvector cor-

responding to the eigenvalue A E C. For an elementary introduction to
applied linear algebra we recommend the well-written and extremely stim-
ulating textbook [61] of C. D. Meyer. As for the numerical aspects of linear
algebra, the classical textbook [41] by G. H. Golub and C. van Loan has
set a standard over the years.
Apart from general matrices we are also interested in the following special
matrices:
• A symmetric: all eigenvalues are real; in science and engineering this

type of eigenvalue problem occurs most frequently.
• A symmetric positive definite (or semidefinite): all eigenvalues are

positive (or nonnegative); corresponding to this eigenvalue problem
is usually a minimization problem of the form
x T Ax + bT X = min,
where we do not further specify the vector b here.

120 5. Linear Eigenvalue Problems
• A stochastic: all entries of such matrices can be interpreted as

probabilities, which shows up in the relations
n
aij 2: 0, L aij = 1.
j=1
In this case there exists a Perron eigenvalue Al equal to the spectral

radius p(A) = 1; the corresponding left and right eigenvectors are
positive up to some common phase factor for all components; such
eigenvalue problems play an increasing role within cluster analysis.
In what follows we begin with a condition analysis for the general eigen-
value problem (Section 5.1). As we will see, the eigenvalue problem is
well-conditioned with guarantee only for normal matrices, whose most im-
portant subclass are the real symmetric matrices. For this reason we first
treat algorithms for the numerical computation of eigenvalues and eigen-
vectors for this special case (Sections 5.2 and 5.3). For general matrices
the problem of singular value decomposition is well-conditioned and at the
same time of utmost practical relevance-see Section 5.4. In recent years
eigenvalue problems for stochastic matrices have played an increasing role,
which is why we turn to this problem class in the last Section 5.5.
5.1 Condition of General Eigenvalue Problems

We start with determining the condition of the eigenvalue problem
Ax = AX
for an arbitrary complex matrix A E Mat n (C). For the sake of simplicity we
assume that AO is an (algebraically) simple eigenvalue of A, i.e., a simple
zero of the characteristic polynomial XA(A) = det(A - AI). Under these
assumptions A is differentiable in A, as will be seen in the following lemma.
Lemma 5.1 Let AO E C be a simple eigenvalue of A E Matn(C), Then

there is a continuously differentiable mapping
A: V C Matn(C) -t C, B f--+ A(B)
from a neighborhood V of A in Matn(C) such that A(A) = AO and A(B) is
a simple eigenvalue of B for all BE V. If Xo is an eigenvector of A for AO,
and Yo an (adjoint) eigenvector of A* := AT for the eigenvalue ),0, z.e.,
Axo = AOXo and A*yo = ),0 Yo ,
then the derivative of A at A satisfies
A'(A)O = (Oxo, Yo) for all 0 E Matn(C),

(xo, Yo)
5.1. Condition of General Eigenvalue Problems 121
Proof. Let C E Matn(C) be an arbitrary complex matrix. Because Ao is a

simple zero of the characteristic polynomial XA, we have
o i= X~(Ao) = :AXA+tc(A)lt=o·
According to the implicit function theorem there is a neighborhood of the
origin] - c, de R and a continuously differentiable mapping
A:] - c, c[-+ C, t f-+ A(t)
such that A(O) = Ao and A(t) is a simple eigenvalue of A + tC. Using
again the fact that AO is simple we deduce the existence of a continuously
differentiable function
x:] - c, c[-+ C n , t f-+ x(t)
such that x(O) = Xo and x(t) is an eigenvector of A + tC for the eigen-
value A(t); x(t) can be explicitly computed with adjoint determinants, see
Exercise 5.2. If we differentiate the equation
(A + tC)x(t) = A(t)X(t)
with respect to t at t = 0, then it follows that
Cxo + Ax'(O) = AOX'(O) + A'(O)XO.
If we multiply from the right by Yo (in the sense of the scalar product),
then we obtain
(Cxo, Yo) + (Ax'(O),yo) = (AOX'(O),yo) + (A'(O)XO,yo).
As (X(O)XO,yo) = X(O)(XO,yo) and
(Ax' (0), Yo) = (x' (0), A *yo) = AO (x' (0), Yo) = (AoX' (0), Yo) ,
it follows that
A'(O) = (Cxo, Yo) .
(XO,Yo)
Hence we have computed the derivative of A in the direction of the matrix
C. The continuous differentiability of the directional derivative implies the
differentiability of A with respect to A and
A'(A)C = A'(O) = (CXO,Yo)

(xo, Yo)
for all C E Matn(C). D
To compute the condition of the eigenvalue problem (A, A) we must

calculate the norm of the mapping A'(A) a as linear mapping,
'( ) ( (Cx,y)
A A : Matn C) -+ C, C f-+ ( )'
X,y
where x is an eigenvector for the simple eigenvalue >'0 of A and y is an

adjoint eigenvector for the eigenvalue >-0 of A *. On Mat n (C) we choose the
matrix norm induced by the Euclidean vector norm, and on C the abso-
lute value. For each matrix G E Matn(C) we have (the Cauchy-Schwarz
inequality)
I(Gx, y)1
:::; IIGxl1 Ilyll :::; IIGllllxllllyl1 ,
where we have equality for G = yx*, x* := xT . Since Ilyx*11 = Ilxll Ilyll, it
follows that
11>.' (A) II = sup I(Gx,y)/(x,y)1 = Ilxlillyll = 1
c#o IIGII l(x,y)1 Icos(<r(x,y))I'
where <r(x, y) is the angle between the eigenvector x and the adjoint
eigenvector y. For normal matrices each eigenvector x is also an adjoint
eigenvector, i.e., A*x = >-ox, and therefore II>.'(A)II = 1. We can summarize
our results as follows:
Theorem 5.2 The absolute condition number of determining a simple ei-
genvalue >'0 of a matrix A E Matn(C) with respect to the 2-norm is
1
= II>"(A)II = Ilxllllyll
~abs l(x,y)1 Icos( <r(x, Y))1
and the relative condition number
IIAII , IIAII
~rel = 1>'01 II>. (A)II = 1>'0 cos(<r(x, Y))1 '
where x is an eigenvector of A for the eigenvalue >'0, i. e., Ax = >'ox,
and y
an adjoint eigenvector, i. e., A *y = >-oy. In particular for normal matrices
the eigenvalue pmblem is well-conditioned with ~abs = 1.
Example 5.3 If A is not symmetric then the eigenvalue problem is in
general not well-conditioned anymore. As an example let us examine the
matrices
A = (~ ~) and A= (~ ~),
with the eigenvalues >'1 = >'2 = 0 and ~1,2 = ±/J. For the condition of the
eigenvalue problem (A, >'d we have
1~1 - >'11 /J 1
~abs ;::: IIA _ AI12 = T = /J --7 00 for b --7 O.
The computation of the eigenvalue >. = 0 of A is therefore an ill-posed

problem (with respect to the absolute error).
A more precise perturbation analysis for (multiple) eigenvalues and eigen-
vectors for general nonsymmetric matrices (and operators) can be found in
the book of T. Kato [54].
5.2. Power Method 123
Without going into depth we just want to state the following: For multiple
eigenvalues or already for nearby eigenvalue clusters the computation of
single eigenvectors is ill-conditioned, but not the computation of orthogonal
bases of the corresponding eigenspace.
For the well-conditioned real symmetric eigenvalue problem one could
first think of setting up the characteristic polynomial and subsequently
determining its zeros. Unfortunately the information on eigenvalues "dis-
appears" once the characteristic polynomial is treated in coefficient
representation. According to Section 2.2 the reverse problem is also
ill-conditioned.
Example 5.4 J. H. Wilkinson [87] has given the polynomial
P()..) = ().. - 1)··· ().. - 20) E P20
as a cautionary example. If we perform the multiplication in this root repre-

sentation then the resulting coefficients have orders of magnitude between
1 (coefficient of )..20) and 10 20 (the constant term is, e.g., 20!). Now let us
perturb the coefficient of )..19 (which has an order of magnitude of 10 3 ) by
the very small value E := 2- 23 ~ 10- 7 . In Table 5.1 we have entered the
Table 5.1. Exact zeros of the polynomial P(>\) for E := 2- 23 .
1.000 000 000 10.095 266 145 ± 0.643 500 904i

2.000 000 000 11.793633881 ± 1.652329 728i
3.000 000 000 13.992358 137 ± 2.518830 070i
4.000000000 16.730737466 ± 2.812624 894i
4.999999928 19.502439400 ± 1.940330 347i
6.000 006 944 20.846 908 101
6.999 697 234
8.007267603
8.917 250 249
exact zeros of the perturbed polynomial

F()..) = P()..) - E)..19
In spite of the extremely small perturbation the errors are considerable. In

particular five pairs of zeros are complex.
5.2 Power Method

Computing the eigenvalues of a matrix A E Matn(R) as zeros of the char-
acteristic polynomial XA()..) = det(A - )..1), may be reasonable only for
n = 2. Here we will develop direct computational methods for determining
eigenvalues and eigenvectors. The simplest possibility is the power method,

and we will discuss in what follows both of its variations, the direct and
the inverse power method.
The (direct) power method introduced by R. von Mises is based on the
following idea: we iterate the mapping given by the matrix A E Matn(R)
and define a sequence {Xdk=O,l, ... for an arbitrary starting point Xo ERn
by
Xk+l := AXk for k = 0, 1, ....
If a simple eigenvalue A of A is strictly greater in absolute value than all

other eigenvalues of A, then we can suspect that A "asserts" itself against
all other eigenvalues during the iteration and Xk converges toward an eigen-
vector of A corresponding to the eigenvalue A. This suspicion is confirmed
by the following theorem. For the sake of simplicity we limit ourselves here
to symmetric matrices, for which according to Theorem 5.2 the eigenvalue
problem is well-conditioned.
Theorem 5.5 Let Al be a simple eigenvalue of the symmetric matrix A E

Matn(R) that is strictly greater in absolute value than all other eigenvalues
of A, i.e.,
Furthermore let Xo E Rn be a vector that is not orthogonal to the eigenspace

of A1. Then the sequence Yk := Xk/))Xk)) with Xk+1 = AXk converges
toward a normalized eigenvector of A corresponding to the eigenvalue A1.
Proof. Let 1]1, ... ,1]n be an orthonormal basis of eigenvectors of A with

A1]i = Ai1]i. Then Xo = 2:::7=1 ai1]i with a1 = (Xo, 1]1) =1= O. Consequently
xk = A k Xo =~
~ ai\k1]i = a1 Ak1 ( 1]1 + ~ ai (Ai)
~;- :\ k 1]i ) .
i=l i=2 1 1
'----,v '
=: Zk
Because )Ai) < )A1) we have limk->oo Zk 1]1 for all 2, ... , n, and
therefore
o
The direct power method has several disadvantages: On one hand, we
obtain only the eigenvector corresponding to the largest eigenvalue (in ab-
solute value) Al of A; on the other hand, the speed of convergence depends
on the quotient )A2/ A1)' Hence, if the absolute values of the eigenvalues A1
and A2 are close, then the direct power method converges rather slowly.
5.2. Power Method 125
The disadvantages of the direct power method described above are

avoided by the inverse power method developed 1944 by H. Wielandt (for
reference see the nice historical survey paper by 1. C. F. Ipsen [53]). As-
suming that we had an estimated value >- ;: : ; Ai of an arbitrary eigenvalue
Ai of the matrix A at our disposal such that
(5.1)
Then (>- - Ai)-1 is the largest eigenvalue of the matrix (A - >-1)-1. Conse-
quently we apply the power method for this matrix. This idea delivers the
iterative scheme
(A - >-1)xk+l = Xk for k = 0,1, .... (5.2)
This is called the inverse power method. One should be aware of the fact
that at each iteration one has to solve the linear system (5.2) for different
right-hand sides Xk. Therefore one has to factor the matrix A - >-1 only
once (e.g., by LR-factorization). According to Theorem 5.5 the sequence
Yk := xk/llxkll converges under assumption (5.1) for k ----> 00 toward a
normalized eigenvector of A corresponding to the eigenvalue Ai, provided
the starting vector Xo is not orthogonal to the eigenvector T}i of eigenvalue
Ai. Its convergence factor is
max Ai -
1
#i Aj - A
< l. ~ 1
If >- is a particularly good estimate of Ai, then
A' ---
- >-1 «1 for all j #- i ,
1-"
Aj - A
so that the method converges very rapidly in this case. Thus with appro-
priate choice of >- this method can be used with nearly arbitrary starting
vector Xo in order to pick out individual eigenvalues and eigenvectors. For
an improvement of this method see Exercise 5.3.
Remark 5.6 Note that the matrix A - >-1 is almost singular for "well-
chosen" >- ;::::; Ai. In the following this poses no numerical difficulties because
we want to find only the directions of the eigenvectors whose calculation is
well-conditioned (cf. Example 2.33).
Example 5.7 Let us examine for example the 2 x 2-matrix
A:= ( -1 3)
-2 4 .
Its eigenvalues are Al = 1 and A2 = 2. If we take as starting point an

approximation >- = 1 - c: of Al with 0 < c: « 1, then the matrix
A- >-1 = ( -2+c:
-2
is almost singular and
(A _ 5..I) -1 = _1_ ( 3 + c:
c: 2 +c: 2
-3
-2+c: ).
Because the factor 1/(c: 2 + c:) simplifies through normalization, the compu-
tation of the direction of a solution x of (A - 5..I)x = b is well-conditioned.
This can be also read from the componentwise relative condition number
III(A - 5..I)-lll blll oo

~c = Ilxll ao
corresponding to perturbations of the right-hand side. For example if b :=
(l,O)T we get
x = (A - 5..I)- l b = I(A _ 5..I)- l ll bl = 1 (3 +

c:(c:+1) 2
c:),
and hence ~c ((A - 5..I)-1, b) = l.

Actually in programs for a (genuinely) singular matrix A - 5..[ a pivot
element c: = 0 is substituted by the relative machine precision eps and thus
the inverse power method is performed only for nearly singular matrices.
(cf. [89]).
5.3 QR-Algorithm for Symmetric Eigenvalue

Problems
As described in Section 5.1 the eigenvalue problem for symmetric matrices
is well-conditioned. In the present section we are interested in the question
of how to compute simultaneously all the eigenvalues of a real symmetric
matrix A E Matn(R) in an efficient way. We know that A has only real
eigenvalues AI,"" An E R and that there exists an orthonormal basis
fJ1, ... , fJn ERn of eigenvectors AfJi = AifJi' i.e.,
QT AQ = A = diag (AI, ... , An) with Q = [fJ1,"" fJnJ E O(n). (5.3)

The first idea that comes to one's mind would be to determine Q in finitely
many steps. Because the eigenvalues are zeros of the characteristic polyno-
mial this would also give us a finite procedure for determining the zeros of
polynomials of arbitrary degree (in case of symmetric matrices only with
real roots). This would be in conflict with Abel's theorem, which says that
in general such a procedure (based on the operations +, -,', / and radicals)
does not exist.
The second idea, suggested by (5.3) is to bring A closer to diagonal form
by a similarity transformation (conjugation), e.g., via orthogonal matrices,
because the eigenvalues are invariant under similarity transformations. If
5.3. QR-Algorithm for Symmetric Eigenvalue Problems 127
we try to bring a symmetric matrix A to diagonal form by conjugation with

Householder matrices, then we realize quickly that this is impossible.
0 0
* * * * * *
Q1·
--+
0 ·Qf
--+ * *
0
* * * * * * *
What is done by multiplying with a Householder matrix from the left is
undone by the multiplication from the right. Things look different if we only
want to bring A to tridiagonal form. Here the Householder transforms from
the left and right do not disturb each other:
* * * * * * * 0 0
* . pT
* * *
Pl· 1
--+ 0 --+ 0
0
* * 0
* * * *
(5.4)
We formulate this insight as a lemma.
Lemma 5.8 Let A E Matn(R) be symmetric. Then there is an orthogonal
matrix P E O(n), which is a product ofn - 2 Householder reflections such
that P ApT is tridiagonal.
Proof. We iterate the process shown in (5.4), and we obtain Householder

reflections PI, ... ,Pn - 2 such that
* *
Pn - 2 ... PI A pi· ..
P'L2 = *
'---....-----" ' - . . . - '
=P =pT *
* *
o
With this we have transformed our problem to finding the eigenvalues of a
symmetric tridiagonal matrix. Therefore we need an algorithm for this spe-
cial case. The idea of the following algorithm goes back to H. Rutishauser.
He had first tried to find out what happens when the factors of the LR-
factorization of a matrix A = LR are interchanged according to A' = RL
and this process is recursively iterated. It turned out that in many cases the
matrices constructed in this way converged toward the diagonal matrix A of
the eigenvalues. The QR-algorithm that goes back to J. G. F. Francis (1959)
[33] and V. N. Kublanovskaja (1961) [56] employs the QR-factorization in-
stead of the of the LR-factorization. This factorization always exists (no

permutation is necessary) and above all it is inherently stable, as seen in
Section 3.2. Therefore we define a sequence {A k h=1,2, ... of matrices by
(a) Al A
(b) Ak QkRk, QR-factorization (5.5)
(c) Ak+l RkQk.
Lemma 5.9 The matrices Ak have the following properties:

(i) The matrices Ak are all similar to A.
(ii) If A is symmetric, then so are all A k .
(iii) If A is symmetric and tridiagonal, then so are all A k .
Proof. (i) Let A = QR and AI = RQ. Then

QAIQT = QRQQT = QR = A.
(ii) The transformations of the form A ----+ BT AB, B E GL(n), represent
a change of basis for bilinear forms and therefore are symmetric. In partic-
ular this holds for orthogonal similarity transformations. This follows also
directly from
(Alf = (AlfQTQ = QTRTQTQ = QT ATQ = QT AQ = AI.
(iii) Let A be symmetric and tridiagonal. We realize Q with n - 1 Givens
rotations 0 12 , ... , On-l,n, so that QT = On-l,n··· 0 12 (181 to eliminate, EEl
fill- in entry).
* * EEl
* * * * EEl
181 * *
181 * *
EEl
181 * *
181 * *
A ----+ R=QTA
*
* * EEl * * EEl
* * E9 * * EEl
*
----+
EEl E9
* *
R
* *
AI = RQ = QTAQ
*
According to (ii) A' must be symmetric and therefore all EEl entries in A'
vanish. Hence A' is also tridiagonal. 0
We show the convergence properties only for the simple case when the
absolute value of the eigenvalues of A are distinct.
Theorem 5.10 Let A E Matn(R) be symmetric with eigenvalues )'1,"" An

such that
(k)
and let A k , Qk, Rk be defined as in (5.5), with Ak (a ij ). Then the
following statements hold:
(a) lim Qk I,
k-+oo
(b) lim Rk A,
k-+DO
(c)
(k)
a·2,J. 0(1 ~; Ik) for i > j.
Proof. The proof given here goes back to J. H. Wilkinson [88]. We show
first that
Ak = Ql ... Qk Rk ... Rl for k = 1,2, ...
'--v--'~
=: Pk =: Uk
The assertion is clear for k = 1 because A = Al = QlR l .
On the other hand from the construction of Ak it follows that
Ak+l = Qk+lRk+l = Qr ... Qr AQI ... Qk = Pk- l APk

and from it the induction step
A k+l = AAk = APkUk = PkQk+lRk+lUk = Pk+lUk+l .

Because Pk E O(n) is orthogonal and Uk upper triangular, we can express
the QR-factorization Ak = PkUk of Ak through the QR-factorization of
AI, .. . , A k . Further
We assume for the sake of simplicity that Q has an LR-factorization
Q=LR,
where L is a unit lower triangular matrix and R an upper triangular
matrix. We can always achieve this by conjugating A with appropriate
permutations. With this we have
(5.6)
The unit lower triangular matrix Ak LA -k satisfies
(A k LA -k) ij = lij (Ai)

Aj k
In particular all off-diagonal entries vanish for k ---> 00, i.e.,

Ak LA -k = I + Ek with Ek ---> 0 for k ---> 00 .
By substituting in (5.6) it follows that

Ak = Q(I +Ek)AkR.
Now we (formally) apply a QR-factorization
I + Ek = (hRk ,
where all diagonal elements of Rk are positive. Then from the uniqueness
of the QR-factorization and limk--+oo Ek = 0 it follows that
Qk, Rk ---> I for k ---> 00.
With this we have deduced a second QR-factorization of Ak, because

Ak = (QQk)(RkAkR).
Therefore the following equality holds up to the sign of the diagonal
elements:
Pk = QQk, Uk = RkAkR.
For k ---> 00 it follows that
lim Ak
k~(X)
= k---+CX)
lim QkRk = lim Rk = A.
k-HXJ
Remark 5.11 A more precise analysis shows that the method converges
also for multiple eigenvalues Ai = ... = Aj. However, if Ai = -Ai+1 then
the method does not converge. The 2 x 2 blocks are left as such.
If two eigenvalues Ai, Ai+1 are very close in absolute value, then the
method converges very slowly. This can be improved with the shift strategy.
In principle one tries to push both eigenvalues closer to the origin so as to
reduce the quotients IAi+l1 Ail. In order to do that one uses at each iteration
step k a shift parameter O"k and one defines the sequence {Ak} by
(a) Al A,
(b) Ak - O"kI QkRk, QR-factorization,
(c) Ak+l RkQk + O"k I .
As above it follows that

(1) A k+ 1 = Qr AkQk rv A k,
(2) (A - akI) ... (A - ad) = Q1 ... QkRk ... R 1·
The sequence {Ak} converges toward A with the speed
a~k),J = 0 (I AjAi -- a1a1 1···1 AjAi -- ak-1

ak-1
I) for i > j .
We have already met such a convergence behavior in Section 5.2 in the case
of the inverse power method.
In order to achieve a convergence acceleration the ak's have to be chosen
as close as possible to the eigenvalues Ai, Ai+1. J. H. Wilkinson has proposed
the following shift strategy: We start with a symmetric tridiagonal matrix
A; if the lower end of the tridiagonal matrix Ak is of the form
(k)
dn-1 (k)
en
e~k) d~k)
then the (2, 2)-matrix at the right end corner has two eigenvalues; we choose
as ak the one that is closer to d~k).
Better than these explicit shift strategies, especially for badly scaled
matrices, are the implicit shift methods, for which we refer again to [41] and
[79]. With these techniques one finally needs O(n) arithmetic operations
per computed eigenvalue that is O(n 2 ) for all eigenvalues.
Besides the eigenvalues we are also interested in the eigenvectors, which
can be computed as follows: If Q E O(n) is an orthogonal matrix, then
A;::::: QT AQ, A = diag(A1, ... , An),
then the columns of Q approximate the eigenvectors of A, i.e.,
Together we obtain the following algorithm for determining all eigenvalues

and eigenvectors of a symmetric matrix.
Algorithm 5.12 QR algorithm.

(a) Reduce the problem to tridiagonal form:
A -+ A1 = PAp T , A1 symmetric and tridiagonal, P E O(n).
(b) Approximate the eigenvalues with the QR algorithm with Givens
rotations applied to A 1 :
nA1n T ;::::: A, n Product of all Givens rotations n~~) .

(c) The columns of DP approximate the eigenvectors of A:

DP ~ [171, ... , 17nl·
The cost amounts to
(a) 1n3 multiplications for reduction to tridiagonal form,
(b) O(n 2 ) multiplications for the QR algorithm.
Hence, for large n the cost of reduction to tridiagonal form dominates.
Remark 5.13 For nonsymmetric matrices an orthogonal conjugation re-
duces in a first step the matrix to Hessenberg form. In a second step the
QR algorithm iteratively brings this matrix to Schur normal form (com-
plex upper triangular matrix). Details can be found in the book of J. H.
Wilkinson and C. Reinsch [89l.
5.4 Singular Value Decomposition

A very useful tool for analyzing matrices is provided by the singular value
decomposition of a matrix A E Matm,n(R). First we prove the existence of
such a decomposition and list some of its properties. Finally we will see how
to compute the singular values by a variant of the Q R algorithm described
above.
Theorem 5.14 Let A E Matm,n(R) be an arbitrary real matrix. Then
there are orthogonal matrices U E 0 (m) and V E 0 (n) such that
U T AV = ~ = diag(ul, ... , up) E Matm,n(R),
where p = min(m, n) and Ul ;:::: U2 ;:::: ... ;:::: up ;:::: o.
Proof. It is sufficient to show that there are U E Oem) and V E O(n) such
that
UT AV = (; ~).
The claim follows then by induction. Let U := IIAI12 = maxllxll=l IIAxll·
Because the maximum is attained there are v ERn, U E R m such that
Av = uu and IIul12 = IIvl12 = 1.
We can extend {v} to an orthonormal basis {v = VI, ... , Vn } of R nand
{u} to an orthonormal basis {u = Ul , ... , Um} of Rm. Then
V:= [VI, ... , Vnl and U:= [U l , ... , Uml
are orthogonal matrices, V E O(n), U E Oem), and U T AV is of the form
Al := U TAV = (Uo U;;)

5.4. Singular Value Decomposition 133
with w ERn-I. Since
we have a 2 = IIAII~ = IIAd~ 2: a 2 + IIwll~ and therefore w = 0, so that
U T AV = (; ~).
o
Definition 5.15 The factorization U T AV = ~ is called the singular value
decomposition of A, and the ai's are called the singular values of A.
With the singular value decomposition we have at our disposal the most
important information about the matrix. The following properties can be
easily deduced from Theorem 5.14.
Corollary 5.16 Let U T AV = ~ = diag(aI, ... ,ap) be the singular value
decomposition of A with singular values aI, ... , a p' where p = min( m, n).
Then
1. If Ui and V; are the columns of U and V, respectively, then
AV; = aiUi and ATUi = ai V; for i = 1, ... ,p.
2. If al 2: ... 2: a r > ar+I = ... = a p = 0, then Rang A = r,
ker A = span{Vr+I"'" Vn } and imA = span{UI , ... , Ur }.
3. The Euclidean norm of A is the largest singular value, z.e.,
IIAII2 = aI·
4. The Frobenius norm IIAIIF = (2:~=1 IIAII~)1/2 is equal to
II All} = ai + ... + a; .
5. The condition number of A relative to the Euclidean norm is equal to
the quotient of the largest and the smallest singular values , i. e.,
i'L2(A) = aI/ap .
6. The squares of the singular values ai, ... , a~
are the eigenvalues of
AT A and AAT corresponding to the eigenvectors VI"'" Vp and
U I , ... , Up, respectively.
Based on the invariance of the Euclidean norm 11·112 under the orthogonal
transformations U and V we obtain from the singular value decomposition
of A another representation of the pseudo-inverse A + of A.
Corollary 5.17 Let U T AV = ~ be the singular value decomposition of a
matrix A E Matm,n(R) with p = Rang A and
~ = diag(aI, . .. ,ap,O, . .. ,0).
Then the pseudo-inverse A+ E Matn,m(R) is given by

A+ = V~+UT with ~+ = diag(O'll, .. . ,0';1,0, ... ,0) .
Proof. We have to prove that the right-hand side B := V~+UT satisfies the
Penrose axioms. The (Moore-Penrose) pseudo-inverse of the diagonal ma-
trix ~ is evidently ~+. Then the Penrose axioms for B follow immediately
because VTV = I and UTU = I. 0
N ow we turn to the problem of the numerical computation of the singular

values. According to Corollary 5.16 the singular values of a matrix A E
Matn(R) are the square roots of the eigenvalues of AT A,
(5.7)
The eigenvalue problem for the symmetric matrix AT A is well-conditioned

and with it so is the singular value problem of A provided we can avoid the
computation of AT A. Relation (5.7) suggests a computational method for
O'i(A). This detour is, however, inappropriate as will be easily seen from
the following example.
Example 5.18 We compute with four significant figures (rounded).
A = AT = (1.005
0.995
For AT A we obtain
fl(AT A) = (2.000 2.000) 0- 2 =4 0- 2 = O.
2.000 2.000' 1 , 2
As in the case of linear least squares we will search here also for a method
operating only on the matrix A. For this we examine first the operations
that leave the singular values invariant.
Lemma 5.19 Let A E Matm,n(R), and let P E O(m), Q E O(n) be
orthogonal matrices. Then A and B := PAQ have the same singular values
Proof. Simple. o
Hence, we may pre- and post-multiply the matrix A with arbitrary or-
thogonal matrices, without changing the singular values. In view of an
application of the Q R algorithm it is desirable to transform the matrix A
in such a way that AT A is tridiagonal. The simplest way to accomplish this
is by bringing A to bidiagonal form. The following lemma shows that this
goal can be reached by means of alternate Householder transformations
from the right and the left.
Lemma 5.20 Let A E Matm,n(R) be a general matrix and suppose, with-
out loss of generality that with m 2 n. Then there exist orthogonal matrices
5.4. Singular Value Decomposition 135
P E O(m) and Q E O(n) such that
* *
PAQ= *
*
o o
o o
where B is an upper (square) bidiagonal matrix.
Proof. We illustrate the construction of P and Q with Householder

matrices:
* * * * o
o* * * 0
o * *
P"
---+
* * o * * o * *
* * 0
0
0
* *
* * * Pn
* *
P2' - 1·
---+ 0 ---+ ... ---+
*
0 0
* * *
Therefore we have
(~) = Pn- 1 ..... P1 AQ1·····

'-v-----"
Qn-2 .
----....--
=:P =:Q
D
In order to derive an effective algorithm we examine now the QR algo-

rithm for the tridiagonal matrix BT B. The goal is to find a simplified
version that operates exclusively on B. If we change our notation to
A = BT B and perform the first Givens elimination steps of the QR
algorithm
A -+ 012BT Bof2 = (BOf2? Bof2 ,

' - - v - " '---'
iF B
then we obtain as 13 the matrix
* *
EEl * *
*
*
where in position EEl a new fill-in element is generated. If we play back the
QR algorithm for BT B in this way on B then it appears that the method
corresponds to the following elimination process.
* * Z3
Z2
* * Z5
Z4
* * Z7
(5.8)
Z2n-6
* * Z2n-3
Z2n-4
* *
Z2n-2
*
eliminate Z2 (Givens from left) ---t fill-in Z3
eliminate Z3 (Givens from right) ---t fill-in Z4
eliminate Z2n-3 (Givens from right) ---t fill-in Z2n-2

eliminate Z2n-2 (Givens from left)
We "chase" fill-in elements alongside both diagonals and remove the newly
generated fill-in entries with Givens rotations alternating from left and
right~whence the name chasing has been given to this process. In the end
the matrix has bidiagonal form and we have performed one iteration of the
QR method for BT B operating only on B. According to Theorem 5.10 we
have
B[ Bk ---t A = diag(a"i, ... , O"~) = ~2 for k ---t 00 .
Therefore the sequence Bk converges toward the diagonal matrix of the

singular values of B, i.e.,
Bk ---t ~ for k ---t 00 .
Summarizing we obtain the following algorithm for determining the

singular values of A (for details we refer to [41]):
Algorithm 5.21 QR algorithm for singular values.

5.5. Stochastic Eigenvalue Problems 137
(a) Bring A E Matm,n(R) to bidiagonal form via orthogonal transforms,

P E O(m) and Q E O(n) (i.e., Householder reflections).
P AQ = (~), B E Matn (R) upper bidiagonal matrix.
(b) Perform the QR algorithm for BT B by the "chasing" method (5.8) on

B and obtain a sequence of bidiagonal matrices {Bd that converges
toward a diagonal matrix ~ of the singular values.
For the cost in case m = n we count
(a) rv ~n3 multiplications for reduction to bidiagonal form,
(b) O(n 2 ) multiplications for the modified QR algorithm.
5.5 Stochastic Eigenvalue Problems

The problem type to be treated in this section is closely related to stochastic
processes: Let X (.) denote a stochastic variable, which, at discrete times
k = 0,1, ... , can realize discrete states out of a finite set S = {Sl' ... ,sn}.
Let
P( X(k + 1) = Sj I X(k) = Si ) = aij(k)
denote the probability that the variable realizes the state Sj at time k + 1
having been in state Si at time k. Obviously only the immediately preced-
ing states enter here, but not any earlier ones, i.e., this process has "no
memory"; such special stochastic processes are called Markov processes, in
our here discussed setting more precisely Markov chains. They had been
invented in 1907 by the Russian mathematician A. A. Markov (for details
see [61]). If the probabilities do not depend on time, which means that
aij(k) = aij, we may speak of a homogeneous Markov chain.
The aij can be interpreted as probabilities. This leads to:
Definition 5.22 A matrix A = (aij) is called stochastic when
aij20, Laij=l, i,j=l, ... ,n.

j
If we introduce the special vector e T = (1, ... ,1), we may write the above
row sum relation in compact form as
Ae = e.
Therefore there exists a (right) eigenvector e corresponding to an eigenvalue
A1(A) = 1. Upon recalling that IIAlloo is just the row sum norm, we obtain
for the spectral radius p(A) the inequality chain
IA(A)I :S p(A) :S IIAlloo = 1,
which immediately implies )'1 = p(A) = l.

Let p( k) 2': 0 denote a probability distribution over all states in S at time
k normalized as
Then the Markov chain yields the recursive relation

pT(k + 1) = pT(k)A, k = 0, 1, ... ,
and thus
With this background the matrix A is also called transition matrix of a

Markov chain. Assume now that the following spectral property holds:
The eigenvalue Al = 1 is simple and the only one on the unit circle.
Then the results of the power method (see Section 5.2) confirm, also for a
nonsymmetric matrix A, the limiting property
lim pT(k) = pT(O) lim Ak =;rT, (5.9)
k----+oo k----+oo
where ;r is a (normalized) left eigenvector corresponding to the dominant

eigenvalue Al = 1, i.e.,
;rT A =;rT ;rT e = 1 .
By definition all components of ;r are surely nonnegative, so that the
following normalization holds:
11;r1l1 = 1 (=;rTe).
In what follows we want to clarify under which assumptions on the matrix
A the above spectral property actually holds.
Positive Matrices. As an intermediate step we first treat the spectral
properties of positive matrices A = IAI > 0 where, as in previous chapters,
the modulus 1·1 is to be understood elementwise, which here means aij > 0
for all indices i, j. The door to this interesting problem class has been
opened by O. Perron also in the year 1907 (see Markov!). For positive
matrices we certainly have p(A) > 0, since otherwise all eigenvalues would
have to vanish, which implies that A were nilpotent-in contradiction to
the assumption A > O. Therefore we may equally well consider the matrix
AI p(A) instead of A. Consequently we will assume in the following that
p(A) = 1 without loss of generality. We will, however, not make use of the
special fact that e is an eigenvector. The following theorem dates back to
Perron [65].
Theorem 5.23 Let A > 0 be a positive matrix having a spectral radius

p(A) = l. Then
1. The spectral radius p(A) = 1 is an eigenvalue.
II. The eigenvalue A = 1 is the only one on the unit circle.

III. To A = 1 there exist corresponding positive left and right eigenvectors.
IV. The eigenvalue A = 1 is simple.
Proof. For the time being, we consider possible eigenvectors x corresponding

to eigenvalues A on the unit circle, i.e., with IAI = 1. We obtain
Ixl = IAI Ixl = IAxl = IAxl <:::; IAI Ixl = A Ixl·

If such eigenvectors x i= 0 exist, then we may conclude that z = A Ixl > O.
If we further define y = z -lxi, we may write the above inequality as y ::::: o.
We now make the (contradictory) assumption y i= 0, which implies Ay > O.
Then there exists some T > 0 such that Ay > TZ, which implies
Ay = Az - A Ixl = Az - z > TZ
or, equivalently, Bz > z for B = A/(1 + T). Upon repeated application of
B we arrive at
By construction, we have p(B) = 1/(1 + T) < 1 and therefore

lim Bk z =0>z.
k~co
Clearly, this is a contradiction to our assumption z > O. Hence, the

assumption y i= 0 must be wrong, i.e., we must have
y = A Ixl - Ixl = o.
As a consequence, there exists an eigenvector Ixl to the eigenvalue A =
I-which is statement I of the theorem.
Due to Ixl = A Ixl > 0, all of its components must be positive. Obviously,
the proof applies for left as well as right eigenvectors in the same way, since
AT is also positive. This is statement III of the theorem. Moreover, the
following is apparent: If an eigenvector for an eigenvalue on the unit circle
exists, this eigenvalue must be A = 1; indeed, the above assumption y i= 0,
which included A i= 1 on the unit circle had led to a contradiction for A i= 1.
The eigenvalue A = 1 is therefore the only eigenvalue on the unit circle,
which proves statement II of the theorem.
The only still missing part of the theorem is that the eigenvalue A = 1
is simple. From the Jordan decomposition J = T- 1 AT we conclude that
Jk = T- 1 AkT, IIJkl1 <:::; K,(T) ·IIAkll ,
wherein K,(T) denotes the condition number of the transformation matrix T.
Let us first assume that J contains a Jordan block J v (l) to the eigenvalue
1 with v > 1. In this case we have on one hand that
lim
k~co
IIJv (l)kll = 00 =} lim
k~co
IIJkll = 00 =} lim
k~co
IIAkl1 = 00.
On the other hand, there exists a norm I . I such that for E > 0 we obtain
IIAkl1 ::; p(Ak) + E = max I>-kl + E = 1 + E,
AEa(A)
which, because of the norm equivalence in R n, is in apparent contradiction

to our above assumption. Hence, the index must be v = 1. In this case
the eigenvalue >- = 1 may still have multiplicity m > 1. Then there exist
left eigenvectors Xi, i = 1, ... , m and right eigenvectors Xi, i = 1, ... , m all
of whose components are positive. At the same time they must satisfy the
orthogonality relations
i,j = 1, ... ,m.
For i i= j this means that the eigenvectors must have nonvanishing compo-
nents of either sign~in contradiction to the fact that all components are
positive. This implies m = 1~which completes the proof of statement IV.
o
Today the eigenvalue >- = p(A) is generally called Perron eigenvalue.
The proof of the theorem heavily relies on the strict positivity of all matrix
elements. However, our stochastic matrices, from which we had started out,
may well contain zero elements. We therefore have to find out whether and
how the just proved results extend to the nonnegative case.
Nonnegative Matrices. Already in 1912, only five years after Perron,
the Berlin mathematician F. G. Frobenius found the ingenious extension
of Perron's results to the case of matrices with aij ~ O. He detected that
in this case the matrices must have an additional property: they must at
least also be irreducible.
Definition 5.24 A matrix is said to be reducible if a permutation P exists
such that
pT AP = (~ ~)
where the block matrices C and F are quadratic. If no zero block can be
generated, the matrix is said to be irreducible.
The mathematical objects behind this notion are graphs. From any non-
negative matrix A = (aij) we may construct the corresponding graph by
associating a node with each index i = 1, ... ,n and connecting node i with
node j by an arrow whenever aij > O. The operation p T AP describes just
a renumbering of the nodes leaving the graph as the mathematical object
unaltered. Just like the matrix a graph is irreducible or also strongly con-
nected, if there exists a connected path (in the direction of the arrows) from
each of the nodes to each other. If the corresponding matrix is reducible,
then the index set divides into (at least) two subsets: there exist no arrows
from the nodes of the second subset to the nodes of the first subset. In this
case the graph is also reducible. In Figure 5.1 we give two (3,3)-matrices
together with their corresponding graphs. For the representation it is suf-

ficient to give the incidence matrices, wherein 0 stands for aij = 0 and 1
for aij > O.
A=[~~~l
010
A=[~~~l
010
Figure 5.1. Examples of graphs and corresponding incidence matrices. Left:

irreducible case. Right: reducible case.
For the subsequent proof we need the following algebraic characterization

of irreducibility.
Lemma 5.25 Whenever an (n, n)-matrix A ~ 0 is irreducible, then
(I + A)n-l > o.
Proof. Let Ak = (a~7)) denote the powers of the nonnegative matrix A.

Elementwise we have
a~7) = L ail, aZ,Z2 ... aZk_d .

ll, ... ,lk-l
These elements vanish, if at least one of the factors on the right side vanish,
i.e., if in the corresponding graph there is no connecting path from node i
to node j. If, however, there exists such a path, then there exists at least
one index sequence i, l)', ... ,l'k-l' j such that
For an irreducible graph this case occurs with guarantee at latest after
running through all other nodes, which means through n - 1 nodes. With
binomial coefficients Cn -l,k > 0, we then obtain the relation
[(J + A)n-l Lj = [ L Cn _l,k Ak1 = n-l

,,-1
L Cn-l,ka~7) > O. D
k=O ij k=O
Of course, the inverse statement of the theorem does not hold. Inci-
dentally, in the concrete case lower powers of (I + A) might already be
positive: the connecting paths from each node to every other one might
be significantly shorter, i. e. they might run through less than n - 1 other
nodes-compare Figure 5.1.
In what follows we want to return to our original topic of interest, the
class of stochastic matrices. The following theorem is an adaptation of the
theorem of Perron-Frobenius (see, e.g., [61]) to our special case.
Theorem 5.26 Let A ~ 0 be an irreducible stochastic matrix. Then
1. The Perron eigenvalue .\ = 1 is simple.
II. To.\ = 1 there exists a corresponding left eigenvector JrT > O.
Proof. For stochastic matrices A we already know the Perron eigenvalue

.\ = p(A) = 1 and its corresponding right eigenvector e > O. It remains to
be shown that this eigenvalue is again simple.
Our proof will be based on the preceding Lemma 5.25, which states that
the matrix B = (1 + A)n-l is positive. Let .\ with 1.\1 ::; 1 denote any
eigenvalue of A, then (1 + .\)n-l is an associated eigenvalue of B. From the
theorem of Perron the dominant eigenvalue and spectral radius of B is
JJ = p(B) = max 11 + .\In-l = 2n - 1 .
1,\19
This eigenvalue is therefore simple and the only one on the spectral circle
with radius JJ. The above maximum is achieved at .\ = 1. Therefore, the
multiplicities of the eigenvalue JJ of B and the eigenvalue .\ = 1 of A are
the same, which proves statement I of our theorem here.
Each eigenvector to eigenvalue .\ of A is also an eigenvector to the eigen-
value (1+.\)n-l of B. Let x be the eigenvector to the eigenvalue JJ of Band
simultaneously to .\ = 1 of A. Then x = Ixl > 0 is clear from the theorem
of Perron. Specification of the statement to the right eigenvector is trivial,
since e > O. Specification to the left eigenvector supplies, in view of (5.9),
the property JrT > 0, which is just statement II above. 0
The theorem does not state that in the case of irreducible nonnegative
matrices the Perron eigenvalue is also the only eigenvalue on the unit circle.
To assure this additionally, we need a further structural property-as has
also already been found by Frobenius.
Definition 5.27 Nonnegative irreducible matrices are called primitive
when their Perron eigenvalue is the only eigenvalue on the unit circle
(assuming the normalization p(A) = 1).
Such matrices can be characterized by the property that there exists an
index m such that
One even knows an upper bound m ::; n 2 - 2n + 2, which, however, had

been found later by H. Wielandt. The proof for primitive matrices is com-
paratively simple: one merely applies the theorem of Perron to the positive
matrix Am with eigenvalues ..\ m. That is why we omit it here. Instead we
take a short glimpse on the interesting structure of irreducible matrices
with more than one eigenvalue on the unit circle. For this purpose we cite
without proof the following theorem [61].
Theorem 5.28 Let A 2:: 0 be an irreducible stochastic matrix with LJ eigen-
values on the unit circle, LJ > 1. Then the whole spectrum is invariant under
rotation by the angle 27T / LJ.
It is for this property that such matrices with LJ > 1 are also called cyclic
matrices. As a consequence the trace of A vanishes for these matrices, which
implies
n
trace(A) = I:>ii = L . \ = O.
i=l AEo-(A)
As all elements of A are nonnegative, we immediately conclude that

aii = 0, i = 1, ... ,n.
If only a single diagonal element of an irreducible matrix A is unequal 0,
then this matrix is certainly primitive-a sufficient condition that can be
easily checked.
With these preparations we have shed enough light into the theoret-
ical background of (5.9): for primitive stochastic matrices every initial
distribution p(O) converges asymptotically toward the left eigenvector
7T T = (7Tl, ... ,7Tn) > O.
The description of stochastic matrices by the underlying graph permits a
natural interpretation of the underlying Markov chain: the elements aij 2:: 0
are just the probabilities for a transition from a discrete state i to a discrete
state j. In applications the following inverse problem comes up in the larger
context of cluster analysis:
Given a Markov chain over a known set of states, either as executable
Markov chain or by its transition matrix, decompose the set of states
into subsets or clusters such that the Markov chain decomposes into
uncoupled or "nearly uncoupled" subchains over these subsets of states.
In what follows we will state more precisely what the terms "uncoupled"
and "nearly uncoupled" mean. For the solution of this problem we may
profit from our just gained insight, as will be shown now.
Perron Cluster Analysis. The following presentation is based on
the rather recent paper [24]. It restricts itself to Markov chains where
additionally the principle of detailed balance holds, which means
Because of 7r > 0 the corresponding graph has a connection from i to

j whenever it has one from j to i. Such Markov chains as well as their
transition matrices are called reversible.
If we introduce a weighting matrix D = diag( J7Tl, ... ,0), we can
write the above condition in compact form as
D2A = ATD2.
This implies at once that the matrix

Asym = DAD-I,
which is similar to A, is real symmetric, but in general not stochastic. As
a consequence, all eigenvalues of A as well as those of Asym are real and
lie in the interval [-1, + 1]. In the same way as Asym is associated with an
orthogonal eigenbasis with respect to the Euclidean inner product (x, y) =
x T Y , the matrix A is associated with a 7r-orthogonal eigenbasis, where
orthogonality here is defined with respect to the weighted inner product
(x, y)7r = xT D2y.
See also Exercise 5.9.
If the graph is reducible, then the transition matrix, after suitable per-
mutation P, consists in the reversible case of k diagonal blocks. For the
purpose of illustration we consider k = 3:
(5.10)
Each of the block matrices Am, m = 1, ... , k, is in itself a reversible

stochastic matrix. Assume further that these block matrices are primi-
tive, then there exists a multiple Perron eigenvalue Am = 1, m = 1, ... , k,
and a reduced right eigenvector e;;' = (1, ... ,1) each. The block diagonal
matrix as a whole therefore represents k uncoupled Markov chains whose
asymptotic stationary distributions correspond to reduced left eigenvectors
to each of the diagonal blocks.
To be more precise, we collect all indices associated with the submatrices
Am in index subsets Sm. The corresponding left eigenvectors 7rsm > 0 then
have components 7ri > 0, i E Sm as well as 7ri = 0, i E S\Sm' If we
formally extend the right eigenvectors em to the full index set S, then we
obtain
em = Xs m , m = 1, ... , k ,
where Xs", denotes the characteristic function of these index sets: its value
is 1 for indices in Sm, otherwise O. In Figure 5.2, left, we illustrate the
situation by a simple example. Here we have ordered the indices already in
such a way that the subsets show up as connected sets.
0.5 0.5 -- - - -- -- - -- -~ - -- - ---.
0\------' o )- ------------
,
/
-0.5 -0.5
,
~,- ., . . ,. "
-,---------------
-1 -1
o 30 60 90 o 30 60 90
Figure 5.2. Markov chain with k = 3 uncoupled subchains. The set of states
S = {Sl, ... , S90} divides into the subsets Sl = {Sl, ... , S29}, S2 = {S30, ... , S49},
and S3 = {S50, ... , S90}. Left: Characteristic function XS2' Right: Eigenbasis
{X 1 ,X2 ,X3} to 3-fold Perron eigenvalue A = 1.
In our formal frame we may therefore restate the above problem of cluster
analysis as:
Find index sets Sm, m = 1, ... , k corresponding to (nearly)
uncoupled Markov chains.
In a first step we consider the case of uncoupled Markov chains. After what
has been said we understand that in this case the knowledge about the
index subsets Sm is equivalent to the knowledge about the reduced right
eigenvectors em to the k-fold Perron eigenvalue of the transition matrix
A. However, we do not know any permutation P to transform A to block
diagonal shape (5.10), its actual computation would anyway be too expen-
sive. Moreover, in the "nearly uncoupled" case we expect a "perturbed"
block diagonal structure. For these reasons we must try to find a different
solution approach.
At first we will certainly solve the numerical eigenvalue problem for the
reversible transition matrix Ai as an algorithm we recommend a variant
of the QR iteration for stochastic matrices~as an indication see Exer-
cise 5.S. Suppose now we thereby detect a Perron eigenvalue A = 1 with
multiplicity k, then we have k after all. In this case the computation of sin-
gle corresponding eigenvectors is known to be ill-conditioned, but not the
computation of an (arbitrary, in general orthogonal) basis {Xl, ... , X k} of
the eigenspace (compare our remark in Section 5.1). Without any advance
knowledge about the index subsets Sm we are then automatically led to a
linear combination of the form
k
Xl = e, Xi = L aim XS m , i = 2, ... k. (5.11)
m=l
Figure 5.2, right, represents the situation again for our illustrating ex-
ample. Obviously the eigenvectors over each subset Sm are locally constant.
For well-ordered indices (via a suitable permutation P) we would be able

to simply "read off" the index sets Sm, m = 1, ... , k. However, in the given
matrix A the indices will in general not be so nicely ordered; therefore we
will need an efficient additional criterion, which is invariant under permu-
tation of the indices. Such a criterion is supplied by the following lemma
[24].
Lemma 5.29 Given a stochastic matrix A consisting of reversible prim-
itive diagonal blocks A 1 , ... ,A k ! up to permutations. Let {X 1 , .•• ,Xd be
a 7r-orthogonal basis of the eigenspace to the Perron eigenvalue A = 1 with
multiplicity k. Let Sm, m = 1, ... , k, denote the index sets corresponding to
the diagonal blocks. Let each state Si E S be associated with a sign structure
(5.12)
based on the ith component of the eigenbasis. Then
I. All elements Si E Sm have the same sign structure.
II. Elements from different index sets Sm have different sign structures.
Proof. Because of (5.ll) all basis vectors Xm are locally constant over the
index sets Sm, which includes also a common sign structure. This confirms
statement I above. For the proof of statement II we may shorten the index
sets Sm each to a single element without loss of generality.
Let {Q1,"" Qk} be an orthogonal eigenbasis of the matrix Asym =
DAD- 1 and Q = [Q1,"" Qk] the corresponding (k, k)-matrix. As Q is
orthogonal w.r.t. (-, .;, QT is also orthogonal, since QT = Q-1. This means
that not only the columns of Q, but also the rows are mutually orthogonal.
Let {Xl,' .. , X k} denote the associated 7r-orthogonal basis of right eigen-
vectors corresponding to the matrix A. Then Xi = D- 1 Qi for i = 1, ... , k.
As the transformation matrices D- 1 only contain positive diagonal entries,
the sign structures of Xi and Qi are identical for i = 1, ... , k. The sign
structure of Sm is equal to the one of row m of matrix X = [Xl, ... , X k]'
Suppose now there were two index sets Si and Sj with i i' j, but with the
same sign structure. Then the rows i and j of X would have the same sign
structure and, as a consequence, the associated rows of Q. Their inner prod-
uct (','; therefore could not vanish-in contradiction to the orthogonality
of the rows of Q. This finally confirms statement II above. 0
Lemma 5.29 clearly shows that the k right eigenvectors to the k-fold
eigenvalue A = 1 can be conveniently exploited for the identification of
the k unknown index sets Sl, ... ,Sk via the sign structures as defined in
(5.12)-per each component only k binary digits. The criterion can be
tested componentwise and is therefore independent of any permutation.
For example, in Figure 5.2, right, we obtain for component S20 the sign
structure (+, +, +), for S69 accordingly (+, -,0).
In a second step we now turn to the case of nearly uncoupled Markov

chains. In this case the matrix A has a, hidden by an unknown permutation,
block-diagonally dominant shape. As a variation of (5.10) for k = 3 we could
have the shape
Herein the matrix blocks Eij = O( E) represent a perturbation of the

block diagonal shape, the quadratic diagonal blocks Am, m = 1, ... , k
are stochastic reversible matrices only up to O(E). Assume now that the
total matrix A is primitive, then there exists exactly one Perron eigenvalue
A = 1, a corresponding right eigenvector e, and a corresponding left eigen-
vector 7r > O. The k-fold root A = 1 has split under perturbation into a
cluster of eigenvalues, which we will call Perron cluster from now on. Be-
sides the Perron eigenvalue this cluster also contains the perturbed Perron
eigenvalues
'\1 = 1, '\2 = 1 - O(E), .... , (5.13)
Different theoretical characterizations of the perturbation E can be found
in the paper [80] by G. W. Stewart and, more recently, in [24]. The pre-
sentation of the associated perturbation theory would, however, be beyond
the scope of this introductory textbook.
0.5 --. ____ -. __ •• : .. _., ___ -_,

, ,
, ,
o
-0.5
.,
',,,,_, }"f '-.\ I : : .... " ~, .. \ , .. \ .. ,~ ~ ~
.... .. ' \ " ; \
-1
o 30 60 90
Figure 5.3. Markov chain with k = 3 nearly uncoupled sub chains. Eigenbasis
{X 1 ,X2 ,X3} to Perron cluster.\l = 1,.\2 = 0.75,.\3 = 0.52. Compare Figure
5.2, right, for the uncoupled case.
Instead of that we here just want to illustrate the effect of perturbations

on the eigenvectors (cf. Exercise 5.10): in Figure 5.3 we show a perturbation
of the unperturbed system depicted in Figure 5.2, right. As can be observed,
the sign structure introduced in the uncoupled case has nearly not changed
under perturbation--which means that here, too, a simple characterization
of the unknown index sets Sm seem to be possible. In analogy to Figure
5.2, right, we now obtain from Figure 5.3 for the component 820 the sign
structure (+, +, +) as before, but for 869 now (+, -, zero), where zero
stands for some kind of "dirty zero" to be defined in close connection with
the perturbation. In fact, the algorithm in [24] eventually even supplies
the perturbation parameter E and the transition probabilities between the
nearly uncoupled Markov chains~for more details see there.
Exercises
Exercise 5.1 Determine the eigenvalues, eigenvectors and the determi-
nant of a Householder matrix
T
Q = I _ 2 vV .
vTv
Exercise 5.2 Give a formula (in terms of determinants) for an eigenvector
x E en corresponding to a simple eigenvalue A E e of a matrix A E
Matn(C).
Exercise 5.3 The computation of an eigenvector 'f}j corresponding to an
eigenvalue Aj of a given matrix A can be done, according to Wielandt, by
the inverse iteration
AZi - >'jZi = Zi-l
with an approximation >'j to the eigenvalue Aj. Deduce from the relation
r(o) := AZi - (>'j + O)Zi = Zi-l - OZi
a correction 0 for the approximation >'j such that Ilr(0)112 is minimal.
Exercise 5.4 Let there be given a so-called arrow matrix Z of the form
z=[t T ~],
where A = AT E Matn(R) is symmetric, BE Matn,m and D is a diagonal
matrix, D = diag(d 1 , ... , d m ). For m » n it is recommended to use the
sparsity structure of Z.
(a) Show that
Z-AI=LT(A)(Y(A)-AI)L(A) for A#di ,i=l, ... ,m,
where
In
L(A) := [ (D - AIm)-l BT
L] ~]
and M(A) := A - B(D - AIm)-l BT.
Exercises 149
(b) Modify the method handled in Exercise 5.3 in such a way that one
operates essentially only on (n, n) matrices.
Exercise 5.5 Prove the properties of the singular value decomposition
from Corollary 5.16.
Exercise 5.6 Let there be given an (m, n)-matrix A, m :::: n, and an m-
vector b. The following linear system is to be solved for different values of
p :::: 0 (Levenberg-Marquardt method, compare Section 4.3):
(AT A + pIn)x = ATb. (5.14)
(a) Show that the matrix AT A + pIn is invertible for rank A < nand
p> o.
(b) Let A have the singular values CT1 :::: CT2 :::: ••• :::: CT n :::: o.
Show that: If CT n :::: CT1y'8PS, then
1
il:2(A T A + pIn) :S -eps for P:::: O.
If CTn < CT1 y'8PS, then there exists a p :::: 0 such that
1
il:2(AT A + pIn) :S -eps for P:::: p.
Determine p.
(c) Develop an efficient algorithm for solving (5.14) by using the singular
value decomposition of A.
Exercise 5.7 Determine the eigenvalues Ai (t) and the eigenvectors 1)i (t)
of the matrix
-tsin(2/t) ]
I-tcos(2/t) .
How do A(t) , Ai(t) and 1)i(t) behave for t ----+ O?
Exercise 5.8 We consider the matrix A given in Exercises 1.10 and 1.11
describing a "cosmic maser." What is the connection between a stochas-
tic matrix Astoch and the matrix A there? Which iterative algorithm
for the computation of all eigenvalues would be more natural than the
Q R-algorithm?
Exercise 5.9 Given a reversible primitive matrix A with left eigenvector
7r > O. Let D = diag( J7i'1, ... , Fn) be a diagonal weighting matrix and
(x, Y)7r = x T D2y

an associated inner product. Show that
1. A is symmetric with respect to this inner product.
2. All eigenvalues of A are real and contained in the interval [-1, +1].
3. There exists a 7r-orthogonal basis of right eigenvectors, which

diagonalize A.
4. For each right eigenvector x there exists a left eigenvector y = D 2x
to the same eigenvalue.
Exercise 5.10 We construct a stochastic transition matrix for k = 3
nearly uncoupled Markov chains. For this purpose we determine first a
symmetric block diagonal matrix D with three blocks and a positive sym-
metric perturbation matrix E-both matrices with uniformly distributed
stochastic entries (use a random number generator). For 0 < fJ < 1 we
define the symmetric matrix
B = (1- fJ)D + fJE.
Now we normalize B = (b ij ) such that
n
L bij = l.
i,j=l
Thus we obtain, as wanted, a reversible stochastic matrix
A = (aij) = (bij/7ri)
with stationary distribution 7r T = (7r1' ... , 7rn) defined by
n
7ri = Lb ij .
j=l
Compute all eigenvalues of the matrix A. In particular, identify the Perron
cluster (5.13), the associated spectral gap, and any index subsets corre-
sponding to nearly uncoupled Markov chains. Experiment a little bit with
the random number generator.
6
Three-Term Recurrence Relations
There are many problems in mathematics and science where a solution

function can be represented in terms of special functions. These functions
are distinguished by special properties, which make them particularly suit-
able for the problem under consideration, and which often allow for simple
construction. The study and use of special functions is an old branch of
mathematics to which many outstanding mathematicians have contributed.
Recently this area has experienced a resurgence because of new discoveries
and extended computational capabilities (e.g., through symbolic computa-
tion). As examples of classical special functions, let us mention here the
Chebyshev, Legendre, Jacobi, Laguerre, and Hermite polynomials and the
Bessel functions. In the next section, we shall use some of these polynomials
and derive the pertinent and important properties.
Here we want to consider the aspect of evaluating linear combinations
N
f(x) = L CXkPk(X) (6.1)

k=D
of special functions P k (x), where we consider the coefficients CXk as given.

The computation or even approximation of these coefficients may be very
difficult. We shall address this question in Section 7.2 when we discuss the
discrete Fourier transform.
One property that is common to all special functions is their orthogonal-
ity. So far, we only considered orthogonality in connection with a scalar
product on a finite dimensional vector space. Many of the familiar struc-
tures carryover to (infinite dimensional) function spaces. Here the scalar
152 6. Three-Term Recurrence Relations
product is usually an integral (or also an infinite sum). To illustrate this,

we consider the following example.
1:
Example 6.1 Define a scalar product
(j, g) f(x)g(x)dx
for functions f,g : [-7r,7r] ---+ R. It is easy to convince oneself that the
special functions P2d x) = cos kx and P2k +1 (x) = sin kx for k = 0, 1, ...
are orthogonal with respect to this scalar product, i.e.,
As in the finite dimensional case, this scalar product induces a norm
(1:
1
Ilfll = J(D) = If(XWdX) 2 .
The functions, for which this norm is well-defined and finite, can be ap-
proximated arbitrarily well with respect to this norm by the partial sums
of the Fourier series
2N N
fN(X) = L akPk(x) = ao + L(a2k cos kx + a2k-l sin kx),
k=O k=l
if N is large enough.
Here we can compute the functions cos kx and sin kx via the three-term
recurrence relation
Tk(X)=2cosx·Tk-l(X)-Tk-2(X) for k=2,3,... (6.2)

as in Example 2.27.
It is not by accident that we can compute the trigonometric functions cos kx

and sin kx by a three-term recurrence relation in k, since the existence of a
three-term recurrence relation for special functions is connected with their
orthogonality.
First we shall study this connection, and we shall in particular be
concerned with orthogonal polynomials. The theoretical investigation of
three-term recurrence relations as difference equations is the central part
of Section 6.1.2. A detailed numerical example will show that the obvious
and naive idea of using the three-term recurrence relation as an algorithm
may not lead to useful results. In Section 6.2.1 we shall therefore analyze
the conditioning of the three-term recurrence relation and thus obtain a
classification of the solutions. This will finally enable us to give stable al-
gorithms for the computation of special functions and linear combinations
of the form (6.1).
6.l. Theoretical Background 153
6.1 Theoretical Background

Three-term recurrence relations as, e.g., the trigonometric recurrence rela-
tion (6.2), are of central importance in the computation of special functions.
In the following section, we shall study the general connection between or-
thogonality and three-term recurrence relations. Subsequently, we shall be
concerned with the theory of homogeneous and inhomogeneous three-term
recurrence relations.
6.1.1 Orthogonality and Three- Term Recurrence Relations

As a generalization of the scalar product in Example 6.1, we consider the
scalar product
(j, g) := lb w(t)f(t)g(t) dt (6.3)
with an additional positive weight function w :]a, b[---t R, w(t) > o. We

assume that the induced norm
(l
1
IIPII = v(P,P) =
b
w(t)p(t)2 dt)" < 00
is well-defined and finite for all of the polynomials P E P k and all kEN.
In particular, under this assumption all moments
exist, because 1, t k E P k implies that
by the Cauchy-Schwarz inequality. Suppose that {Pdt)} is a sequence

of pairwise orthogonal polynomials Pk E P k of degree k, i.e., with
nonvanishing leading coefficient and
then the Pk are called orthogonal polynomials on [a, b] with respect to

the weight function w(t). In order to define the orthogonal polynomi-
als uniquely, we require an additional normalization condition, e.g., by
assuming that Pk(O) = 1 or that the leading coefficient is one, i.e.,
Pdt) = t k + ...
The existence and uniqueness of a system of orthogonal polynomials with

respect to the scalar product (6.3) will now be shown by employing the
three-term recurrence relation.
Theorem 6.2 For each weighted scalar product (6.3), there exist uniquely
determined orthogonal polynomials P k E P k with leading coefficient one.
These satisfy the three-term recurrence relation
with starting values Po := 1, H(t) = t + al and coefficients

(tPk- 1 , Pk-d (Pk- 1 , Pk-d
ak = - bk = - ...,...-:--::-----:-
(Pk- 1 , Pk-d ' (Pk- 2, P k- 2) .
Proof. The only polynomial of degree 0 with leading coefficient one is

Po == 1 E Po. Suppose that Po, . .. , Pk- 1 have already been constructed
as pairwise orthogonal polynomials Pj E P j of degree j with leading coef-
ficient one. If Pk E Pk is an arbitrary normalized polynomial of degree k,
then Pk - tPk- 1 is a polynomial of degree:::; k - l. On the other hand, the
Po, ... ,Pk- 1 form an orthogonal basis of P k - 1 with respect to the weighted
scalar product (., .), so that
If Pk is to be orthogonal to Po, ... , Pk- 1 , then

(tPk-l, Pj ) (P - , tPj )
Cj = - =- k 1 .
(Pj , Pj) (Pj , Pj)
This implies that Co = ... = Ck-3 = 0,
(Pk- 1 , Pk-d
(Pk - 2, Pk - 2) .
We therefore obtain the next orthogonal polynomial from the formula
Pk = (t + ck-dPk-l + Ck-2Pk-2 = (t + ak)Pk- 1 + bkPk- 2,

and the statement follows by induction. o
Example 6.3 By putting cos a = x and viewing cos ka as a function of x,
one is led to the Chebyshev polynomials
Tdx) = cos(karccosx) for x E [-1,1].
The three-term recurrence relation for cos kx implies that the Tk are in fact
polynomials that satisfy the recurrence relation
6.1. Theoretical Background 155
with the starting values To(x) = 1 and T 1 (x) = x. Using this, we can
define Tk(X) for all x E R. From the variable substitution x = cosa, i.e.,
dx = - sin ada, we can see that the Chebyshev polynomials are indeed
the orthogonal polynomials on [-1, 1] with respect to the weight function
w(x) = 1/JI=X2, i.e.,
,if n -=f. m
,if n=m=O
, if n = m -=f. 0
The Chebyshev polynomials are particularly important in approximation
theory. We shall encounter them several times in the next chapters.
By carefully analyzing the proof of Theorem 6.2, we can understand the

connection between orthogonality and three-term recurrence relations in
greater generality. We shall encounter this structure again in Section 8.3
when studying the method of conjugate gradients.
Theorem 6.4 Let VI C 112 C ... C X be an increasing chain of subspaces

of dimension dim V k = k in a vector space X! and let A : X --; X be a
self-adjoint linear mapping with respect to a scalar product (., .) on X! i. e.!
(Au, v) = (u,Av) for all u,v E X
such that
Then for each PI E VI! there exists a unique extension to an orthogonal

system {pd with Pk E Vk for all k and
(Pk,Pk) = (Apk-I,Pk) for all k ~ 2.
The family {pd satisfies the three-term recurrence relation
Pk = (A + ak)Pk-I + bkPk-2 for k = 2,3, ... (6.4)
with Po := 0 and
ak '= - (Apk-I,Pk-I) bk '= _ (Pk-I,Pk-I)

. (Pk-I,Pk-I)' . (Pk-2,Pk-2) .
Proof. Completely analogous to theorem 6.2, where the self-adjoint operator

A : X --; X was multiplication by t,
t : P k --; Pk+I, P(t) f--+ tP(t).
The self-adjointness is used in the proof of Theorem 6.2 in the transition:
o
A remarkable property of orthogonal polynomials is that they possess

only real and simple roots, which lie in the interval la, b[.
Theorem 6.5 The orthogonal polynomials Pk(t) E P k posseS8 exactly k
simple roots in la, b[.
Proof. Let tl,"" tm be the m distinct points ti Ela, b[, at which P k changes
sign. The polynomial
Q(t) := (t - tl)(t - t2)'" (t - t m )
then changes sign at the same points, so that the function w(t)Q(t)Pk(t)
does not change sign in la, b[, and therefore
(Q, P k ) = lb w(t)Q(t)Pk(t) dt -I O.
Since Pk is orthogonal to all polynomials PEP k-l, it follows that deg Q =
m ;::: k as required. 0
6.1. 2 Homogeneous and Inhomogeneous Recurrence Relations

Because of their importance, which became clear in the last section, we
shall now study real three-term recurrence relations of the form
Pk = akPk-l + bkPk-2 + Ck for k = 2,3, ... (6.5)
for values Pk E R with coefficients ak, bk, Ck E R. We assume that bk -I 0
for all k, so that this actually is a three-term recurrence relation. Under
this assumption, we can perform the recurrence relation backwards, i.e.,
ak 1 Ck
Pk-2 = - bk Pk-l + bk Pk - bk for k = N, N - 1, ... ,2. (6.6)
As in the trigonometric or the Bessel recurrence relation, it is often the

case that bk = -1 for all k, so that the three-term recurrence relation
(6.6) can be obtained from the original one by interchanging Pk and Pk-2.
We shall call such a three-term recurrence relation symmetric. If all Ck
vanish, then the three-term recurrence relation is called homogeneous, and
otherwise inhomogeneou8. So far all of our examples were homogeneous and
symmetric.
For each pair Pj,Pj+l of starting values, the three-term recurrence re-
lation (6.5) determines exactly one sequence P = (Po, PI"") ERN. The
solutions P = (Pk) of the homogeneous three-term recurrence relation
Pk = akPk-l + bkPk-2 for k = 2,3, ... (6.7)
depend linearly on the starting values Pj, PH 1, and they therefore form a
two-dimensional subspace
£ := {p E RN I Pk = akPk-l + bkPk-2 for k = 2,3, ... }
6.1. Theoretical Background 157
of RN. Two solutions p, q E £ are linearly independent, if and only if the

Casorati determinants of p and q,
D(k, k + 1) := Pkqk+l - qkPk+l ,
do not vanish. It is easy to compute that
D(k, k + 1) = -bk+1D(k - 1, k),
and, because bk i= 0, then either all D(k, k + 1) vanish or none do. In
particular, for all symmetric recurrence relations, i.e., h = -1 for allk, we
have
D(k, k + 1) = D(O, 1) for all k.
Example 6.6 For x tf. Z7f, the trigonometric recurrence relation
Pk = akPk-l + bkPk-2, ak = 2 cos x , bk = -1,
has the linearly independent solutions cos kx and sin kx, since
D (0, 1) = cos 0 sin x - sin 0 cos x = sin x i= 0 .
If x = l7f with l E Z, then D(O, 1) = 0; and cos kx and sin kx would not be
linearly independent. Instead
Pk = coskx = (_I)lk, qk= k(_I)lk
are two linearly independent solutions with D(O, 1) =
1. Note that this
value of the Casorati determinant can obviously not be obtained by passing
to the limit x -+ l7f, a theoretical weakness. In the following, we shall learn
about a different characteristic quantity, which does satisfy the required
limiting property (compare Exercise 6.8).
We shall now try to put together the solution of the general inhomoge-
neous recurrence relation (6.5) from solutions of inhomogeneous recurrence
relations, which are as simple as possible. In order to do this, we study how
a single inhomogeneity Ck = Ojk propagates from position j.
Definition 6.7 Let g+(j,k) and g-(j,k) be the solutions of the inhomo-
geneous three-term recurrence relation
g-(j, k) - akg-(j, k -1) - bkg-(j, k - 2)
g+(j, k) - akg+(j, k - 1) - bkg+(j, k - 2)
for j, kEN and k :::: 2 with the starting values
g-(j,j-2) = 9 - (j, j - 1) = 0 respectively,
g+(j,j+2) = g+(j,j+l)=O.
Then the discrete Green's function g(j, k) of the three-term recurrence
relation (6.5) is defined by
( .k).-{ g-(j,k) if k::::j

gJ, .- g+(j,k) if k~j
Here note that g- (j, j) = g+ (j, j) = l. The solutions of the inhomogene-

ous recurrence relation (6.5) with the starting values Po = Co and PI = CI
can now be obtained by superposition according to
k k
Pk=LCjg(j,k)=LCjg-(j,k) for k=O,l,... (6.8)
j=o j=o
(proof as an exercise). Conversely, for the backward recurrence relation

(6.6), it follows that
N+I N+I
Pk = L Cjg(j, k) = L Cjg+(j, k) for k = 0, ... ,N + 1
j=k j=k
is the solution for the starting values PN = CN and PN+I = CN+I.
Figure 6.1. Discrete Green's function g(5, k) over k = 0, ... , 10 for ak = 2 and
bk = -1.
Remark 6.8 Readers knowledgeable in the theory of ordinary differential

equations may recognize familiar structures in the above method for dif-
ference equations. In fact, the name "discrete Green's function" is chosen
analogous to the terminology used in differential equations. Similarly, the
Casorati determinant corresponds to the Wronski determinant, and the
special starting values of the inhomogeneous differential equation, which
are defined via the Kronecker IS ij , correspond to the IS-distribution.
6.2 Numerical Aspects

The mathematical structure of the three-term recurrence relation suggests
a direct translation into an algorithm (simple loop). In Example 2.27, we
have already seen that this way of computing special functions has to be
6.2. Numerical Aspects 159
treated with special care. At least in that case, it was possible to stabilize
the trigonometric three-term recurrence relation numerically. The following
example shows that this is not always possible.
Example 6.9 Bessel's maze. The Bessel junctions, Jk = Jk(x), satisfy the
three-term recurrence relation
2k
Jk+1 = -Jk - J k- 1 for k 2: 1. (6.9)
x
We start, for example, with x = 2.13 and the values
Jo 0.14960677044884
J1 0.56499698056413,
which can be taken from a table (e.g., [73]). At the end of the chapter
we shall be able to confirm these values (see Exercise 6.7). We can now
try to compute the values J 2, ... , h3 by employing the three-term recur-
rence relation in forward mode. In order to "verify" (see below) the results
]2, ... , ]23, we solve the recurrence relation (6.9) with respect to Jk-1, and
insert ]23 and ]22 into the recurrence relation in backward mode. This way
we get 121 , ... , 10 back and actually expect that 10 coincides approximately
with the starting value J o. However, with a relative machine precision of
eps = 10- 16 , we obtain
- 9
Jo/Jo~lO .
A comparison of the computed value ]23 with the actual value h3 reveals
that it is much worse, namely,
27
J 23 /h3 ~ 10 ,
A
i.e., the result misses by several orders of magnitude! In Figure 6.2, we have
plotted the repetition of this procedure, i.e., the renewed start with 10 etc.:
Numerically, one does not find the way back to the starting value, hence
this phenomenon is called Bessel's maze. What happened? A first analysis
of the behavior of the rounding errors shows that
2k
-Jk ~ Jk-1 for k > x
x
(compare Table 6.1). Thus cancellation occurs in the forward recurrence
relation every time when Jk+l is computed (see Exercise 6.9). Moreover,
besides the Bessel functions Jk, the Neumann junctions Y k also satisfy
the same recurrence relation (Bessel and Neumann functions are called
cylinder functions). However, these possess an opposite growth behavior.
The Bessel functions decrease when k increases, whereas the Neumann
functions increase rapidly. It is through the input error for J o and h (in
the order of magnitude of machine precision),
10 = J o + EOYO , 11 = J 1+ E1 Y 1 ,
p
60
40
20
-20
~r,0--------5~------1~0------~1~5------~2~0----~k
Figure 6.2. Bessel's maze for x = 2.13, In([Jk(x)[) is plotted over k for 5 loops
until k = 23.
Table 6.1. Cancellation in the three-term recurrence relation for the Bessel
functions Jk = Jk(X), x = 2.13.
k J k- 1 2k Jk
x
1 1.496. 10- 1 5.305. 10- 1

2 5.649.10- 1 7.153.10- 1
3 3.809. 10- 1 4.234.10- 1
4 1.503. 10- 1 1.597. 10- 1
5 4.253.10- 2 4.425.10- 2
6 9.425.10- 3 9.693.10- 3
7 1.720.10- 3 1.756.10- 3
8 2.672.10- 4 2.716.10- 4
9 3.615.10- 5 3.662.10- 5
10 4.333.10- 6 4.379.10- 6
that the input ]0,]1 always contains a portion of the Neumann function
Yk, which at first is very small, but, which in the course of the recur-
rence increasingly overruns the Bessel function. Conversely in the backward
direction, the Bessel functions superimpose the Neumann functions.
In the following section we shall try to understand the observed numerical
phenomena.
6.2.1 Condition Number

We view the three-term recurrence relation (6.5) as a mapping that relates
the starting values Po, P1 and the ak, bk as input quantities to the values
P2, P3, ... as resulting quantities. Only two multiplications and one addition
have to be carried out in each step, and we have verified the stability of
these operations in Lemma 2.19. The execution of the three-term recur-
rence relation in floating point arithmetic is therefore stable. Thus only the
condition number of the three-term recurrence relation determines whether
it is numerically useful. In order to analyze the numerical usefulness, we
prescribe perturbed starting values
and perturbed coefficients

ih=ad1+ak), bk=bk(l+(3k) for k?2,
whose errors are bounded by 6 > 0,
and we compute the error

D.Pk := Pk - Pk ,
where P is the solution of the perturbed three-term recurrence relation.
By employing the recursion for P and p, it turns out that D.p satisfies the
inhomogeneous recurrence relation
D.Pk = akD.Pk-l + bkD.Pk-2 + Ek for k? 2

with the starting values D.po = Eo := eo Po , D.Pl = El := e 1P1 and
coefficients
Ek = akakPk-1 + f3kbkPk-2 ~ akakPk-1 + (3kbkPk-2 for 6 --t O.

By utilizing the discrete Green's function as in (6.8), we obtain
k
D.Pk = L Ejg(j, k).
j=O
The discrete Green's function thus characterizes the absolute condition of

the three-term recurrence relations. Similarly, it follows that the relative
error
Pk 1= 0,
is the solution of the inhomogeneous recurrence relation
akPk-l ll hPk-2 II
II
Ok = ---uk-l + ---Ok-2 + ck Dor I ? 2
Pk Pk
with the starting values co := eo, C1 := e 1, where
Ek. akPk-1 1-1 bkPk-2
Ck := - = ak--- + fJk---,
Pk Pk Pk
and we therefore have

k
Ok = LEjr(j,k) with r(j,k):= Pj g(j,k). (6.10)
j=O Pk
The functions r(j, k) obviously describe the propagation of the relative

errors and characterize therefore the relative condition of the three-term
recurrence relations. Motivated by the Bessel and Neumann functions, we
distinguish between two types of solutions in order to judge r(j, k).
Definition 6.10 A solution pEl: is called recessive or a minimal solution,
if for each solution q E 1:, which is linearly independent of p we have
lim Pk = o.
k ...... oo qk
The solutions q, which are linearly independent of p are called dominant.

It is clear that the minimal solution is only uniquely determined up to a
scalar factor. In many cases the free factor is determined by a normalization
condition
00
G oo := LmkPk = 1 (6.11)
k=O
with the weights mk. Conversely, such relations generally hint that the
corresponding solutions Pk are minimal. If they exist, then the minimal
solutions form a one-dimensional subspace of 1:. The existence can be
guaranteed by imposing certain assumptions on the coefficients ak and bk.
Theorem 6.11 Suppose that the three-term recurrence relation is sym-
metric, i. e., bk = -1 for all k, and that there exists a ko E N such
that
Then there is a minimal solution P with the properties

1
IPkl ::; Iak+l I - 1
IPk-ll and Pk+l(X) #0 (6.12)
for all k > ko. Furthermore, for each dominant solution q, there is an index
kl ~ ko such that
Proof. The proof is by continued fractions and can be found in J. Meixner

and W. Schiiflke [60]. 0
Example 6.12 The three-term recurrence relations of the trigonometric

functions
cos kx, sin kx satisfy bk = -1 and

lakl = 21 cosxl ~ 2 ¢::=? x = l7r with l E Z.
If x = l7r E Z7r, then Pk = (_I)lk is a minimal solution, and the sequences
qk = J3k( -1 )lk + apk with J3 i- 0
are dominant solutions.
Example 6.13 For the recurrence relations of the cylinder functions, we
have bk = -1 and
k-l
lakl = 2Tx1 ~ 2 {o} k > ko := [lxll·
The minimal solution is the Bessel function J k , whereas the Neumann
function Yk is dominant. This can be proved by invoking the asymptotic
approximations for Jk, respectively, Y k for k ---+ 00, because
Jk(x) ~ _1_ (ex)k Yk(X) ~ _ {2 (eX)-k

v'27rk 2k' V;k 2k
The Bessel functions Jk(X) satisfy the normalization condition (see, e.g.,
[2])
LJ
00
Goo := J o + 2 2k = 1.
k=l
Under the assumptions of Theorem 6.11, it can be shown that

Ig(j, k)1 ~ Ik - j + 11 for all k ~ j > ko ·
So the discrete Green's functions g(j, k) are themselves dominant solutions
and increase beyond any bounds. On the other hand, because of (6.12), a
minimal solution P satisfies Ipj / Pk I ~ 1, and therefore
Ir(j,k)1 = 1~:g(j,k)1 ~ Ig(j,k)1 ~ Ik-j+ll

for all k ~ j > ko. Beginning with the index ko, the three-term recur-
rence relation is thus ill-conditioned for the computation of a minimal
solution. For dominant solutions, the growth of the discrete Green's func-
tions can be compensated by the growth of the solution itself, so that the
relative error amplification, which is expressed by r(j, k), stays moderate,
and the three-term recurrence relation is well-conditioned. Thus the three-
term recurrence relation (in forward direction) is ill-conditioned for the
Bessel functions as minimal solution, but well-conditioned for the Neumann
functions.
Example 6.14 Spherical Harmonics. We now harvest what has been

planted above by giving a more complicated example, which plays an im-
portant role in many applications, for example, in theoretical physics or
geodesy. In general, one has to compute expansions with respect to spheri-

cal harmonics, as well as entire sets of spherical harmonics. They are usually
denoted by Y~ (0, cp), where the Euler angles, 0 and cp, are variables on the
sphere subject to
o :::; 0 :::; 7'1 and 0:::; cp :::; 27'1 .
Among the numerous representations of spherical harmonics, we choose the

complex representation
where pUx) denotes the associated Legendre functions of the first kind for
Ixl :::; 1. They can be given explicitly as follows:
(_l)k+l 2.!. d k+1 2 k
Pk(X):= (k+l)!k!2 k (1-x)2 dx k+l(l-x) . (6.13)
Among the numerous normalizations of these functions, which appear in

the literature, we have chosen the one according to [36]. Using the relations
Pk (x) == 0 for l > k :::: 0 and l < - k :::; 0 (6.14)
and
it is sufficient to compute the real spherical harmonics
Ck (0, cp) Pk (cos 0) cos(lcp) for 0:::; l :::; k

51(0, cp) Pk(cosO) sin(lcp) for 0 < l :::; k.
We have earlier discussed the three-term recurrence relations for the
trigonometric functions in great detail. We therefore draw our attention to
the Legendre functions for the argument x = cos O. All three-term recur-
rence relations, which are valid for these doubly indexed Legendre functions
of the first kind (see, e.g., [40]), are also valid for the Legendre functions of
the second kind, which, in contrast to those of the first kind, have singular-
ities of order l at x = 1 and x = -1. This property carries directly over to
the corresponding discrete Green's functions (compare Exercise 6.8). Thus
recurrence relations with variable l would be ill-conditioned for the Pk.
Consequently, among the many three-term recurrence relations, we choose
those with constant l. This leads to the recurrence relation
I (2k - l)xpLl - pL2
(6.15)
Pk = (k-l)(k+l) ,
with the only running index k. It is well-conditioned for the Pk in forward

direction with respect to k (see, e.g., the paper [36] by W. Gautschi).
Still missing is a well-conditioned link for different l. From k = l in

Definition (6.13), we obtain the representation
pl(X) = (-1)l(1_x2)~
l 2l . l! '
which leads immediately to the two-term recurrence relation
pl _ _ (1 - X
2).12 p l- 1
l - 2l l-1 , (6.16)
which as such is well-conditioned. In order to start the recurrence relations,

the value P3 = 1 is used for (6.16), and for (6.15) we use the recurrence
relation for k = l + 1, which, because of Pi-I == 0 and according to (6.14),
also degenerates into a two-term recurrence relation:
(6.17)
If we replace the argument x by cosO, then we expect, by the results of
Section 2.3 that the corresponding algorithm is numerically unstable. As
in the stabilization of the trigonometric recurrence relation, we here try
again to replace the argument cosO by 1- cosO = -2sin2(O/2) for 0 ---> o.
Unfortunately, the stabilization is not as easily accomplished as for the
trigonometric functions. We therefore seek solutions of the form:
l l -l -l l -l
P k = qkPk and P k = rk Pk-l + !:"Pkl
with suitably chosen transformations qk and rk. Observe that the relative
condition numbers in (6.10) are the same for Pk and independent of PL
the choice of the transformations qk. Insertion of and into (6.15) Pk Pk-2
then gives
where
(2k-1)cosO l l l qL2
IJ"k(O) := (k -l)(k + l) . qk-l - qkrk - (k -l)(k + l) rLl
In order for the expression (1 - cos 0) to be a factor in IJ"k (0), 0 = 0
obviously has to be a root of IJ"k, i.e., IJ"dO) = O. Because of (6.17), we
require in addition that ql+1 rl+1 = 1. These two requirements regarding
the transformations qk and rk are satisfied by the choice
l 1
qk = I = k - l + 1.
rk
With this transformation, and by starting with = P3 P3

= 1, one obtains
the following numerically stable recurrence relation representation:
Algorithm 6.15 Computation of the spherical harmonics Pk (cos B) for

l = 0, ... , Land k = l, ... , K.
p3:= P3:= 1;
for l := 0 to L do
p l+1 ._ pl+l._ sin BpI.
l+l . - /+1 2(l + 1) I'
LlPl := - sin 2 (B/2)pi;
for k := l + 1 to K do
(k - l - l)LlPk-l - 2(2k - 1) sin 2 (B/2)Pk-l
LlP I '= .
k' (k+l)(k-l+1) ,
-l._ 1 -I 1
Pk + 1) Pk - 1 + LlPk ;
(k -l
Pk := (k - l + l)Pk;
end for
end for
Remark 6.16 For the successful computation of orthogonal polynomials,
one obviously needs a kind of "look-up table of condition numbers" for as
many orthogonal polynomials as possible. A first step in this direction is
the paper [36]. However, the numerically necessary information is in many
cases more hidden than published. Moreover, the literature often does not
clearly distinguish between the notions "stability" and "condition."
6.2.2 Idea of the Miller Algorithm

Do we have to abandon the three-term recurrence relation for the computa-
tion of a minimal solution because of the above error analysis? This is not
the case, as we shall show here. The remedy, which is due to J. C. P. Miller
[62], is based on two ideas. The first consists in analyzing the three-term
recurrence relation in backward mode with the starting values Pn, Pn+l
with respect to its condition. By carrying the above considerations over to
this case (see Exercise 6.5), it can be shown that the three-term recurrence
relation is well-conditioned for a minimal solution in backward mode. The
second idea consists of utilizing the normalization condition (6.11). Since
the minimal solutions Pk(X) become arbitrarily small in absolute value for
k --> 00, Goo can be approximated by the finite partial sums
n
G n := LmkPk.
k=O
By computing an arbitrary solution 'Pk of the three-term recurrence rela-
tion in backward mode, e.g., with the starting values Pn+l = 0 and Pn = 1,
and normalizing these with the help of G n , one obtains for increasing n
increasingly better approximations of the minimal solution. These consid-
erations motivate the following algorithm for the computation of p N with
a relative precision E.
Algorithm 6.17 Miller algorithm for the computation of a minimal

solution p N .
1. Choose a break-off index n > N and put

~(n) .- 0 ~(n) .= 1
Pn+l·- , Pn · .
,en) f
t ,en) ... 'Po
2. C ompuePn_l' rom
3. Compute
n
On:= LmkPk.
k=O
4. Normalize according to
p~n) := p~n) IOn.
5. Repeat steps 1 to 4 for increasing n = nl, n2, ... , and while doing
this, test the accuracy by comparing p<;)i) and p<;)i-,). If
Ip<;)i) - p<;)i-,) I :s:: E p<;);) ,
then p<;)i) is a sufficiently accurate approximation of PN.
In the following theorem it is shown that this algorithm converges indeed.
Theorem 6.18 Let p E .c be a minimal solution of a homogeneous three-

term recurrence relation, which satisfies the normalization condition
In addition, it is assumed that there is a dominant solution q E .c such that
Then the sequence of Miller approximations p<;)) converges to p N,

r (n)
n~PN =PN·
Proof. The solution pin) of the three-term recurrence relation with the
starting values p~n) := 1 and p~nJI := 0 can be represented as a linear
combination of Pk and qk, because
,(n) Pkqn+1 - qkPn+1
Pk =
Pnqn+1 - qnPn+1
This implies
For the Miller approximations p<;J) = p<;J) fGn, this yields
for n ---+ 00. 0
This algorithm, which was developed by J. C. P. Miller in 1952, is now

outdated, since it uses too much storage space and computation cost. The
basic idea, however, enters into the more efficient algorithm, which we shall
present in the next chapter.
6.3 Adjoint Summation

We turn our attention again toward the original problem, namely, to
evaluate linear combinations of the form
N
f(x) = SN = L 0kPk, (6.18)

k=O
where Pk := Pk(X) satisfies a homogeneous three-term recurrence relation

(6.7) with starting values PO,PI and given coefficients Ok. The subsequent
presentation essentially follows the lines of the two papers [16, 17]. As an
illustration we start with a two-term recurrence relation.
Example 6.19 The evaluation of a polynomial
SN := p(x) = 00 + 0IX + ... + ONX N

can be considered as a computation of a linear combination (6.18), where
Pk := xk satisfies the two-term recurrence relation
Po := 1, Pk:= XPk-1 for k;::: 1.

6.3. Adjoint Summation 169
This sum can be computed in two different ways. The direct way is the
forward recurrence relations
(6.19)
for k = 1, ... , N with starting values 50 := ao and Po := 1. It corresponds

to the naive evaluation of a polynomial. But one can factor cleverly
5N = ao + x(al + x(·· . (aN-l + x aN) ... ))

'-v-'
UN
'-.r--"'
UN-l
and compute the sum by the backward recurrence relation
UN+l .- 0
Uk .- XUk+l + ak for k = N, N - 1, ... ,0 (6.20)
5N .- Uo·
This is the Horner algorithm. Compared with the first algorithm (6.20), it
saves N multiplications and is therefore approximately twice as fast.
6.3.1 Summation of Dominant Solutions

An obvious way of computing the sum (6.18) consists of computing the
values Pk in each step by the three-term recurrence relation
(6.21)
multiplying them by the coefficients ak, and adding the result up. The
resulting algorithm corresponds to the forward recurrence relation (6.19):
Pk := akPk-l + bkPk-2 and 5k:= 5 k- l + akPk for k = 2, ... ,N

with the starting value 51 := aopo + alPl. We wonder if the procedure that
we used in the derivation of the Horner algorithm carries over to the case
at hand. In order to construct an analogue to "factoring," we extend the
three-term recurrence relation (6.7) by the two trivial equations Po = Po
and PI = Pl· For the computation of the values P = (Po, ... ,pN ), we thus
obtain the triangular system
Po
1 Po
1 PI
-b 2 -a2 1 0
-bN -aN 1 0
v PN ~
=:L ~ =: r
=:p
with a unipotent lower triangular matrix L E MatN+l(R) and the trivial

right-hand side r. The linear combination SN is just the (Euclidean) scalar
product
N
SN = LakPk = (a,p) with Lp =r
k=O
of P and a = (ao, ... , aN )T. Therefore

SN = (a, L -lr) = (L -T ex, r) . (6.22)
IfU denotes the solution of the (adjoint) triangular system LT u = a, i.e.,
u := L -T ex, then it follows that
SN = (u, r) = UoPo + U1Pl .
Explicitly, u is the solution of
1 -b 2 Uo
1 -a2
1 -bN
-aN
1 UN aN
By solving this triangular system of equations, we obtain the desired
analogue of the algorithm (6.20):
b2U2 + ao, (6.23)

uOPo + UIPl,
where we set UN+l = UN+2 := O. The homogeneous three-term recurrence
relation, which is defined by LT U = 0, is called the recurrence relation
adjoint to (6.21). Similarly, because of the relation (6.22), we call (6.23)
the adjoint summation for SN = 2:~=0 akPk. Compared with the first
algorithm, we here save again N multiplications.
Example 6.20 The adjoint summation, as applied to the partial Fourier
sums
N N
SN := L ak sinkx respectively eN := L ak cos kx
k=l k=O
with the trigonometric three-term recurrence relation
Pk = 2cosx· Pk-l - Pk-2
for Sk = sin kx and Ck = cos kx and the starting values
So = 0, Sl = sin x respectively, Co = 1, Cl = cos X ,
leads to the recurrence relation
UN+1 = 0 (6.24)
Uk 2 cos x . Uk+1 - Uk+2 + ak for k = N, ... , 1
and to the results
SN = U1 sinx respectively, C N = ao + U1 cosx - U2·
This algorithm, which G. Goertzel [40] devised in 1958, is unstable for

x --+ In, I E Z, like the three-term recurrence relation on which it is based
(as an algorithm for the mapping x f---> cos kx, compare Example 2.27).
However, the recurrence relation (6.24) can be stabilized by introducing
the differences !:1uk = Uk - Uk+! for cos x ~ 0, and by changing to a
system of two-term recurrence relation. As in Example 2.27, we obtain the
following stable form of the recurrence relation for k = N, N -1, ... ,1, the
algorithm of G. Goertzel and C. Reinsch:
!:1uk -4 sin 2 (x /2) . Uk+1 + !:1uk+1 + ak
Uk Uk+1 + !:1uk
with the starting values UN+1 = !:1uN+1 = O. For the sums we obtain
SN = U1 sinx and C N = ao - 2 sin 2 (x/2)U1 + !:1U1 .
As in the error analysis of the three-term recurrence relation, the ex-
ecution of the additions and multiplications of the adjoint summation is
stable; the resulting errors can be interpreted as modifications of the in-
put errors in the coefficients ak, bk and ak. Only the condition determines
the numerical usefulness; however, note that the condition is independent
of the algorithmic realization. It is therefore true that if the original re-
currence relation (6.21) is well-conditioned in forward mode, then this is
also the case for the adjoint three-term recurrence relation (6.23), which of
course runs backward. The algorithm (6.23) is thus only suitable for the
summation of dominant solutions.
Example 6.21 Adjoint summation of spherical harmonics. To illustrate
this, we consider the evaluation of expansions of the form
L K
C(K, L; 8, cp) Lcos(lcp) LA~Pk(cos8)
1=0 k=l
L K
S(K, L; 8, cp) L sin(lcp) L A~Pk(cos 8) ,
1=1 k=l
where pUx) are the Legendre functions of the first kind as introduced
in Example 6.14. Negative indices I are omitted here because of P k- 1 =
(-1)1 Pk. A set of well-conditioned recurrence relations was already given in
Example 6.14. Also in this situation, the application of similar stabilization
techniques yields an economical numerically stable version. After a few

intermediate calculations, one obtains the following algorithm for cos e 2: 0
and cos <.p 2: 0:
Algorithm 6.22 Computation of the sums C(K, L; e, <.p) and S(K, L; e, <.p).
V := 0; ~ V := 0;
for l := L to 1 step -1 do
U := 0; ~U := 0;
for k := K to l + 1 step -1 do
~U := (A~ - 2(2k + 1) sin2(ej2) U + ~U)j(k -l);

U := (U + ~U)j(k + l);
end for
Ul := Al- 2(2l + 1) sin2(ej2) U + ~U;

if l > 0 then
sin e .2
~V := Ul - 2(l + 1) (-4sm (<.pj2)V + ~V);
V .- sin e V ~ V-
2(l+1) + ,
end
end for
sin e 2
C(K,L;e,<.p) := Uo - - (-2sin (<.pj2)V + ~V);
1 2
S (K, L; e, <.p) := -"2 sin e sin <.p . V;
6.3.2 Summation of Minimal Solutions

For the summation of minimal solutions like, e.g., the Bessel functions, we
go back to the idea of the Miller algorithm. We again assume that the min-
imal solution p E .c, which we have to compute, satisfies the normalization
condition (6.11). In this case one can, in principle, derive from the Miller
algorithm 6.17 a method to compute the approximations of the sum SN:
N n
SY:) = L o:d)~n) jG~n) with G~n):= L mkj3~n) .
k=O k=O
Under the assumptions of Theorem 6.18, we then have
lim S(n) = SN.
n---+oo N
However, the cost of computing the 5}:;) would be quite high, since for
each new n all values have to be computed again. Can that be avoided
by employing some kind of adjoint summation? In order to answer this
question, we proceed as in the derivation of the previous section, and we
describe for given n > N one step of the Miller algorithm through a linear
system MnP(n) = r(n):
(n)
Po
°
bn an -1
bn+ 1 an+l
mo mn (n)
~----------~v~------------ Pn
~
=: p(n)
With a(n) := (ao, ... ,aN,O, ... ,O)T E Rn+I, the sum 5}:;) can again be
written as a scalar product
L akP~n) = (a(n), p(n))

N
5}:;) = with MnP(n) = r(n) .
k=O
If we assume that Mn is invertible (otherwise G~n) = °and the

normalization cannot be carried out), then it follows that
5}:;) = (a(n), M;;lr(n)) = (M;;T a(n), r(n)) . (6.25)
By setting urn) := M;;T a(n), we have
5(n)
N
= (u(n) , r(n)) = urn)
n ,
where urn) is the solution of the system MJ urn) = a(n). More explicitly,
(n)
b2 mo Uo a o(n)
a2
-1 bn
an bn+1
-1 mn (n) (n)
an+l Un an
We solve this system by the Gaussian elimination method. The arising
computations and results are listed in the following theorem:
Theorem 6.23 Define ern) = (eo, ... , en) and f(n) = (fo, . .. , fn) by
(a) Cl := 0, eo := ao/b 2 and

1
ek+l:=-b-(ak+l+ek-l-ak+2ek) for k=0, ... ,n-1, (6.26)
k+3
(b) f-l := 0, fa := mo/b 2 and

1
fk+l := -b-(mk+l + fk-l - ak+2!k) for k = 0, ... , n - 1, (6.27)
k+3
where ak := ° for k > N. Then, under the assumption fn =1= 0, we have
N
(n) _ "
SN (n) _ en
- ~akPk - - .
k=O fn
Proof. The values fa, ... , fn are computed by LU-factorization M;;

LnUn of M;;,
where
1 fa
-1 , Un :=
1 fn-l
fn
and therefore Lnf(n) = m(n). This is equivalent to the recurrence relation

(b) for fa, ... , f n. The recurrence relation (a) for eo, ... ,en is equivalent to
Lne(n) = a(n). By inserting the factorization MT = LnUn into the system
MT urn) = oJn), it follows that
Unu(n) = ern) ,
and therefore
S (n) _ (n) _ en
N - Un - fn .
With the recurrence relations (6.26) and (6.27), we need 0(1) operations
in order to compute the next approximation sj;+l) from sj;), as opposed to
O(n) operations with the method, which is directly derived from the Miller
algorithm. In addition, we need less memory not depending on n but only
on N (if the coefficients {ad are given as a field). Because of (6.25), we call
the method developed in Theorem 6.23 the adjoint summation of minimal
solutions.
We now want to illustrate how this method can be employed to obtain
a useful algorithm from the theoretical description of Theorem 6.23. First
we replace the three-term recurrence relation (6.26) for ek by a system of
two-term recurrence relations for
(k) ek
Uk := Uk = - and b.uk:= Uk - uk-l
fk
because we are interested in precisely these two values (Uk as solution, b.uk
to scrutinize the precision). Furthermore, one has to consider that the f n
and ek get very large, and they may fall outside the domain of numbers,
which can be represented on the computer. Instead of the ik, we therefore
use the new quantities
fk-l - 1
gk := T and h:= fk .
In the transformation of the recurrence relations (6.26) and (6.27) to

the new quantities Uk, f).Uk, gk and Ik, it turns out to be appropriate to
introduce the notation
_ bk+2 fk _ - mk
gk:= - - = bk+2-f and mk:= mkh-l = -f .
gk k-l k-l
From (6.27), it thus follows that (multiplication by bk+2/ fk-d

?1k = mk - ak+l + gk-l for k::::: 1, (6.28)
and from (6.26) (multiplication with bk+d h-l and insertion of (6.28))
that
?1kf).Uk = Ik-lak - gk-lf).uk-l - mkuk-l .
By arranging the operations so that as little storage space as possible is
used and by omitting the no longer required indices, we then obtain the
following numerically useful algorithm.
Algorithm 6.24 Computation of SN = 2:~=o akPk for a minimal
solution (Pk) with relative precision c.
g := f).u := 0, I:= bdmo, U:= ao/mo, k:= 1
repeat
m :=mI;
f).u:= Iak - gf).u - mu;
g := m - ak+l + g;
f).u := f).u/ g;
u := u + f).u;
if (k > Nand If).ul :::; lui' c) then exit; (Solution SN ~ u)
g := bk+2/g;
I :=Ig;
k := k + 1;
until (k > n max )
Exercises
Exercise 6.1 On a computer, calculate the value of the Chebyshev
polyilOmial T3I(X) for x = 0.923:
(a) by using the Horner scheme (by computing the coefficients of the
monomial representation of T3I on a computer, or by looking them
up in a table),
(b) by using the three-term recurrence relation.
Compare the results with the value
T3I(X) = 0.948715916161,
which is precise up to 12 digits, and explain the error.
Exercise 6.2 Consider the three-term recurrence relation
Tk = akTk-I + bkTk-2 .
For the relative error Bk = Ch - Tk)/Tk of n, there is a inhomogeneous
three-term recurrence relation, which is of the form
/l Tk-I/l Tk-2/l
Uk = akT;:Uk-I + bkT;:Uk-2 + Ek·
Consider the case ak 2 0, bk > 0, To, TI > 0, and verify that
(a ) IE k I :::; 3eps,
(b) Ilhl:::; (3k - 2)eps, k 21.
Exercise 6.3 The Legendre polynomials are defined through the recur-
rence relation
(6.29)
with the starting values Po(x) = 1 and PI (x) = x. (6.29) is well-conditioned
in forward mode. Show that the computation of Sk (B) := Pk (cos B) accord-
ing to (6.29) is numerically unstable for () - t O. For cos () > 0, find an
economical, stable version of (6.29) for the computation of Sk(B).
Hint: Define Dk = ak(Sk - Sk-J), and determine a suitable ak.
Exercise 6.4 Consider the three-term recurrence relation
(6.30)
(a) Find the general solution of (6.30) by seeking solutions of the form
Tk = w k (distinguish cases!).
(b) Show the existence of a minimal solution, if lal 2 l.
Exercise 6.5 Under the assumptions of Theorem 6.11, analyze the
condition of the adjoint three-term recurrence relation for a minimal
solution.
Exercises 177
Exercise 6.6 Show that the symmetry of three-term recurrence relation

carries over to the discrete Green's function, i.e., for symmetric three-term
recurrence relation, with bk = -1 for all k, it is true that
g(j, k) = g(k,j) for all j, kEN.
Exercise 6.7 Compute the Bessel functions Jo(x) and J1(x) for x = 2.13,
and compute Jk(X) for x = 1024 and k = 0, ... ,1024,
(a) by employing the Miller algorithm,
(b) by specializing the adjoint summation algorithm 6.24.
Compare both algorithms with respect to storage space and computational
cost.
Exercise 6.8 Starting from two arbitrary linearly independent solutions
{Fd, {Qd, show that the discrete Green's functions can be written in the
form
_. D(j - 1, k) and +(. k) _ D(j + 1, k)
9 (J, k) = D(j _ 1, j) g], - D(j + 1, j) ,
where the
D(l, m) ;= FIQm - QIFm
denote the generalized Casorati determinants. For the special case of the
trigonometric recurrence relation, find a closed formula for g(j, k), and
carry out the limiting process x ----+ lJr, l E Z. Sketch g(j, k) for selected j,
k, and x.
Exercise 6.9 Consider the three-term recurrence relation for the cylinder
functions
(6.31)
Starting with the asymptotic representations for the Bessel functions
Jk(X) ~ -1- (ex)k

- for k ----+ 00,
V2Jrk 2k
and the Neumann functions
Yk(X) ~ _ (2 (ex)-k for k ----+ 00,

V~ 2k
show that there is cancellation in (6.31) for Jk+l in forward mode, and
cancellation for Yk-l in backward mode.
Exercise 6.10 Consider a three-term recurrence relation of the form
(6.32)
(a) Transform Pk by Pk = ckih such that the Pk satisfy the symmetric

(6.33)
(b) Using the assumption
prove a theorem about the existence of minimal solutions, which is

similar to Theorem 6.11. Compare the application of Theorem 6.11
to the recurrence relation (6.33) with the application of (a) to (6.32).
Exercise 6.11 Consider a symmetric tridiagonal matrix
d1 e2
Show the following:

(a) The polynomials Pi(A) := det(Ti -Ali) satisfy a three-term recurrence
relation.
(b) Under the assumption rl:~2 ei i= 0 it is true that:
For i 2: 1, Pi has only real simple roots. (The roots of Pi separate
those of PHd
(c) If Tn possesses an eigenvalue of multiplicity k, then at least k - 1
nondiagonal elements ei vanish.
Hint: {pd is called a Sturm sequence.
7
Interpolation and Approximation
In Numerical Analysis one often encounters the situation that instead of

a function f : R --> R, only a few discrete function values f(t;) are given,
and maybe derivatives f(j) (t;) at finitely many points t;. This is the case,
for example, when the function f is given in the form of experimental data.
Also, most methods for solving differential equations calculate a solution
f(t) (including its derivative) only at finitely many positions. Historically,
this problem occurred in the computation of additional function values
between tabulated ones. Nowadays, one of the most important applications
occurs in computer graphics, known by the abbreviations CAD (Computer-
Aided Design) and CAGD (Computer-Aided Geometric Design).
If one is interested in total behavior of the function, then, from the given
data
f(j)(t;) for i=O, ... ,n and j=O, ... ,c;,
one should construct a function cP which differs as little as possible from the
original function f. In addition, the function cP should be simple to evaluate,
like, for example, (piecewise) polynomials, trigonometric, exponential, or
rational functions. A first obvious requirement regarding the function cp is
the interpolation property: cp should coincide with the function f at the
nodes (sometimes also called knots) t;,
cp(j) (ti) = f(j) (ti) for all i, j .
The values f(j)(ti) are called node values. If we compare the two functions
CPl and CP2 in Figure 7.1, then both obviously satisfy the interpolation con-
dition at given values f(t i ). Nevertheless, we would prefer CPl. In addition
180 7. Interpolation and Approximation
to
Figure 7.1. Various interpolating functions for f.
to the interpolation property, we therefore require also the approximation

property: cp should differ as little as possible from 1 with respect to a norm
I . II in a suitable function space,
Ilcp - 111 "small."
7.1 Classical Polynomial Interpolation

We start with the simplest case, when only the values
Ii := l(t i ) for i = 0, ... , n
are given at the pairwise distinct nodes to, ... , tn. We now seek a polynomial
P E P n of degree deg P ::::: n,
P(t) = ant n + an_It n - 1 + ... + alt + ao with ao, ... , an E R,
which interpolates 1 at the n + 1 nodes to, ... , tn, i.e.,
P(ti) = Ii for i = 0, ... , n. (7.1)
7.1.1 Uniqueness and Condition Number

The following argument shows that the n+ 1 unknown coefficients ao, ... , an
are uniquely determined by the condition (7.1): If P, Q E P n are two in-
terpolating polynomials with P(ti) = Q(t i ) for i = 0, ... ,n, then P - Q is
a polynomial of at most nth degree with the n + 1 roots to, ... , tn, and is
therefore the null-polynomial. But the rule
Pn --7 Rn+l, P f-t (P(to), ... , P(t n ))
is also a linear mapping between the two (n + 1)-dimensional real vector
spaces P nand Rn+l, so that injectivity already implies surjectivity. We
have therefore proven the following theorem.
7.1. Classical Polynomial Interpolation 181
Theorem 7.1 Suppose n + 1 nodes (ti' fi) for i = 0, ... , n are given with
pairwise distinct nodes to, ... , tn, then there exists a unique interpolating
polynomial P E P n , i.e., P(td = fi for i = 0, ... , n.
The unique polynomial P, which is given by theorem 7.1, is called inter-

polating polynomial of f for the pairwise distinct nodes to, ... , tn, and it is
denoted by
In order actually to compute the interpolating polynomial, we have to

choose a basis of the space of polynomials P n. In order to illustrate this, we
first give two classical representations. If we write P as above in coefficient
representation
P(t) = ant n + an_It n - 1 + ... + alt + ao,

i.e., with respect to the monomial basis {I, t, ... , tn} of P n, then the
interpolation conditions P(ti) = fi can be formulated as a linear system
lC:) C:)
1 to t 02 t 0n
,
[ 1 tn t n2 tn
n
v. - - - - - - ' "
=: Vn
The matrix Vn is called Vandermonde matrix. For the determinant of Vn ,
we have
n n
det Vn = II II (ti - tj) ,
i=O j=i+l
which is proven in virtually every linear algebra textbook (see ,e.g., [61]). It
is different from zero exactly when the nodes to, ... ,tn are pairwise distinct
(in agreement with Theorem 7.1). However, the solution of the system
requires an excessive amount of computational effort when compared with
the methods that will be discussed below.
In addition, the Vandermonde matrices are almost singular in higher
dimensions n. Gaussian elimination without pivoting is recommended for
its solution, because pivoting strategies may perturb the structure of the
matrix (compare [51]). For special nodes, the above Vandermonde matrix
can easily be inverted analytically. In Section 7.2 we shall encounter an
example of this.
An alternative basis for the representation of the interpolation polyno-
mial is formed by the Lagrange polynomials La, . .. ,Ln. They are defined
as the uniquely determined interpolation polynomials Li E P n with
1.2,---,---,---,---~-~-~-~--,
Figure 7.2. Lagrange polynomials Li for n = 4 and equidistant nodes ti.
The corresponding explicit form (compare Figure 7.2) is
L;(t) = rrn t - tj .
j~O'
t - tJ
#i
The interpolating polynomial P for arbitrary nodes fo, ... , fn can easily
be built from the Lagrange polynomials by superposition: With
n
P(t) := L fiLi(t) , (7.2)
i=O
we obviously have
n n
i=O i=O
Remark 7.2 The above statement can also be phrased as follows: The
Lagrange polynomials form an orthogonal basis of P n with respect to the
scalar product
n
(P, Q) := L P(ti)Q(t i )
i=O
for P,Q E P n . Let (P,L i ) = P(ti), then, obviously,

n n n
i=O i=O i=O
For practical purposes, the Lagrange representation (7.2) is computation-

ally too costly; however, in many theoretical questions it is advantageous.
An example of this is the determination of the condition number of the
interpolation problem.
Theorem 7.3 Let a :::; to < ... < tn :::; b be pairwise distinct nodes, and let
Lin be the corresponding Lagrange polynomials. Then the absolute condition
number K:abs of the polynomial interpolation
with respect to the supremum-norm is the Lebesgue constant

n
K:abs = An := max L
tE[a,b] i=O
ILin(t)1
for the nodes to, ... , tn.
Proof. The polynomial interpolation is linear, i.e., 1/ (f) (g) = ¢(g). We have
to show that IWII = An. For every continuous function f E C[a, b], we have
n n
1¢(f)(t)1 I L f(ti)Lin(t)1 :::; L If(ti)IILin(t)1
i=O i=O
n
Ilflloo max L
tE[a,b] i=O
ILin(t)l,
and thus K:abs :::; An. For the opposite direction, we construct a function
g E C[a, b] such that
n
1¢(g)(T)1 = Ilglloo max

tE[a,b] i=O
L ILin(t)1
for aTE [a, b]. For this let T E [a, b] be the place where the maximum is
attained, i.e.,
n n
and let g E C[a, b] be a function with Ilglloo = 1 and g(t i ) = sgnLi(T),

e.g., the piecewise linear interpolation function corresponding to the points
(ti' sgn Li (T)). Then, as desired
n n
1¢(g)(T)1 = L
i=O
ILin(T)1 = Ilglloo max L
tE[a,b] i=O
ILin(t)l,
and thus K:abs 2: An, and altogether K:abs = An· o

One easily computes that the Lebesgue constant An is invariant under
affine transformations (see Exercise 7.1), and therefore depends only on
the relative position of the nodes ti with respect to each other. In Table
7.1, An is given for equidistant nodes in dependence of n. Obviously, An
grows rapidly beyond all reasonable bounds. However, this is not true for
any choice of nodes. For comparison, Table 7.1 also shows the Lebesgue
constants for the Chebyshev nodes (see Section 7.1.4)
ti = cos 2i + 1)
(2n+
--7["
2
.
for z = 0, ... ,n
(where the maximum was taken over [-1,1]). They grow only very slowly.
Table 7.1. Lebesgue constant An for equidistant and for Chebyshev nodes.
n An for equidistant nodes An for Chebyshev nodes
5 3.106292 2.104398
10 29.890695 2.489430
15 512.052451 2.727778
20 10986.533993 2.900825
7.1.2 Hermite Interpolation and Divided Differences

If one is only interested in the interpolating polynomial P at a single po-
sition t, then the recursive computation of P(t) turns out to be the most
effective method. It is based on the following simple observation, the Aitken
lemma.
Lemma 7.4 The interpolating polynomial P = P(f Ito,· .. , tn) satisfies
the recurrence relation
P(f I to, . .. , tn ) = (to - t)P(f I tl,·.·, tn) - (tn - t)P(f I to,.··, tn-d .
to - tn
(7.3)
Proof. Let <p(t) be defined as the expression on the right-hand side of (7.3).
Then <p E P n , and
<p (t t.) -- (to - ti)fi - (tn - ti)!; -- f.t ClOr z• -- 1 , ... , n - 1 .

to - tn
Similarly, it is simple to conclude that <p(to) = fa and <p(tn) = fn, and the
statement therefore follows. 0
The interpolation polynomials for only one single node are nothing else
than the constants
P(f Iti) = fi for i = 0, ... ,n.
If we simplify the notation for fixed t by
then the value Pnn = PC! Ito, ... , tn)(t) can be computed according to the
Neville scheme
Poo
~
P lO -> Pll
Pn-1,o -> -> Pn-1,n-l

~
PnO -> -> Pn,n-l -> Pnn
given by
Pia Ii for i = 0, ... ,n

t - ti
Pi k-l + (Pi k-l - Pi- 1 k-d for i ~ k. (7.4)
, ti - ti-k' ,
Example 7.5 Computation of sin 62° from the nodes
(50°, sin 50°), (55°, sin 55°), ... , (70°, sin 70°)
by the Aitken-Neville algorithm.
ti sin ti
50° 0.7660444
55° 0.8191520 0.~935027
60° 0.8660254 0.~847748 0.8830292
65° 0.9063078 0.~821384 0.8829293 0.8829493
70° 0.9396926 0.~862768 0.8829661 0.8829465 0.8829476
The recursive structure of the interpolation polynomials according to
Lemma 7.4 can also be utilized for the determination of the entire poly-
nomial PC! I to, ... ,tn). This is also true for the generalized interpolation
problem, where besides function values I(ti), also derivatives at the nodes
are given, the Hermite interpolation. For this we introduce the following
practical notation. With
a = to ::; h ::; ... ::; tn = b
we allow for the occurrence of multiple nodes in ~ := {ti};=O, ... ,n. If at a
node ti, the value I(ti) and the derivatives !'(ti), ... , I(k)(ti) are given up
to an order k, then ti shall occur (k+1)-times in the sequence~. The same
nodes are enumerated from the left to the right by
di := max{j I ti = ti-j} ,
e.g.,
ti
di
I t3 1 o 1 2 0 o 1
By defining now for i = 0, ... ,n the linear mappings
then the problem of the Hermite interpolation can be phrased as follows:

Find a polynomial PEP n such that
fJ.i(P) = fJ.i(f) for all i = 0, ... , n. (7.5)
The solution P = P(f I to, ... ,tn ) E P n of the interpolation problem (7.5)
is called Hermite interpolation of f at the nodes to, ... ,tn. Existence and
uniqueness follows as in Theorem 7.l.
Theorem 7.6 For each function f E Cn[a, b] and each monotone sequence
a = to ::::: tl ::::: ... ::::: tn =b
of (not necessarily distinct) nodes, there exists a unique polynomial PEP n
such that
fJ.iP = fJ.d for all i = 0, ... , n .
Proof. The mapping

fJ. : P n -+ R n + l , P (fJ.o P , ... , fJ.nP)
°
f--+
is obviously linear and also injective. Now fJ.(P) = implies that P pos-
sesses at least n + 1 roots (counted with mUltiplicity), and it is therefore
the null-polynomial. Since dimP n = dim Rn+l = n+ 1, this implies again
the existence. 0
If all nodes are pairwise distinct, then we recover the Lagrange

interpolation
n
P(flto, ... ,tn ) = Lf(ti)Li .
i=O
If all nodes coincide, to = tl = ... = tn, then the interpolation polynomial

is the Taylor polynomial centered at t = to,
_~(t-to)j (j)
P(f I to,···, tn)(t) - L..- ., f (to), (7.6)
j=O J.
also called the Taylor interpolation.
Remark 7.7 An important application is the cubic Hermite interpolation,
where function values fa, h and derivatives fa, ff are given at two nodes
to, t l . According to Theorem 7.6, this determines uniquely a cubic polyno-
mial P E P 3 . If the Hermite polynomials H5, ... , H5 E P 3 are defined by
Hr(to) = 0,
Hg(to) = 0,
Hl(to) = 0,
then the polynomials
{HJ(t),Hr(t),Hg(t),Hl(t)}
form a basis of P 3 , the cubic Hermite basis, with respect to the nodes to, tl.
The Hermite polynomial corresponding to the values
{fo,f6,iI,fD is thus formally given by
P(t) = foHJ(t) + f~Hr(t) + iIHg(t) + f{Hl(t).

If an entire series to, ... , tn of nodes is given, with function values hand
derivatives fI, then on each interval [ti, t i +1], we can consider the cubic
Hermite interpolation and join these polynomials at the nodes. Because of
the data, it is clear that this piecewise defined function is C I . This kind of
interpolation is called locally cubic Hermite interpolation. We have already
seen an application in Section 4.4.2. When computing a solution curve by
a tangent continuation method we obtain solution points (Xi, Ai) together
with slopes x;.
In order to get an impression of the entire solution curve
from this discrete information, we had connected the points by their locally
cubic Hermite interpolation.
Similar to the Aitken lemma, the following recurrence relation is valid

for two distinct nodes ti "I t j .
Lemma 7.8 If ti "I t j , then the Hermite interpolation polynomial P =

P(f I to,··· ,tn ) satisfies
P = (ti - t)P(f I t l , ... , £j, ... ,t n ) - (t j - t)P(f I iI, ... , t;, ... ,tn)
4-0 '
where ~ indicates that the corresponding node is omitted. ("has to lift its
hat").
Proof. Verification of the interpolation property by inserting the definitions.

o
For the representation of the interpolation polynomial, we use the Newton
basis Wo, ... ,Wn of the space of polynomials P n:
i-I
Wi(t) := II (t - ti) , Wi E Pi.

j=O
The coefficients with respect to this basis are the divided differences, which
we now define.
Definition 7.9 The leading coefficient an of the interpolation polynomial
P(f I to,···, tn)(t) = ant n + an_It n - 1 + ... + ao

of f corresponding to the (not necessarily distinct) nodes to ~ tl ~ ... ~ tn
is called nth divided difference of f at to, .. . , tn, and it is denoted by
[to, ... ,tnlJ := an .

Theorem 7.10 For each function f E en and for given (not necessarily
distinct) nodes to ~ ... ~ t n ) the interpolation polynomial P(f I to, ... , t n )
of f at to, ... ,tn is given by
n
P := ~)to, ... , tilJ . Wi .
i=O
If f E en+! , then
f(t) = P(t) + [to, ... , tn, tlJ· wn+!(t). (7.7)
Proof. We show the first statement by induction over n. The statement is

trivial for n = o. Thus let n > 0, and let
n-I
Pn- I := P(f Ito,···, tn-d = 2:)to, ... , tilf· Wi
i=O
be the interpolation polynomial of f at to, . .. , tn-I. Then the interpolation

polynomial Pn = P(f I to, ... ,tn ) of f at to, ... ,tn can be written in the
form
Pn(t) + an_Itn - 1 + ... + ao
[to, ... , tnlJ· t n
[to, ... , tnlJ· wn(t) + Qn-I(t) ,
with a polynomial Qn-I E P n - I . But
Qn-I = Pn - [to, ... , tnlJ . Wn
obviously satisfies the interpolation conditions for to, ... , tn-I, SO that
n-I
Qn-I = Pn- I = l)to, . .. , tilJ . Wi·
i=O
This proves the first statement. In particular, it follows that
Pn + [to, ... ,tn, tlJ . Wn+1

interpolates the function f at the nodes to, ... , tn and t, which proves (7.7).
o
From the properties of the Hermite interpolation one can immediately

deduce the following statements about the divided differences of f.
Lemma 7.11 The divided differences [to, ... , tnlf satisfy the following
properties (f E en):
(i) [to, ... , tnlP = 0 for all P E Pn- 1.
(ii) For multiple nodes to = ... = tn,

[to, ... , tnlf = f(n)(to)/n!. (7.8)
(iii) The following recurrence relation holds for ti i= tj:
[to,···, tn I f - [to, ... , t:, ... ,tnlf - [to, ... , ij, ... , tnlf
- . (7.9)
tj - ti
Proof. (i) is true, because the nth coefficient of a polynomial of degree less
than or equal to n - 1 vanishes. (ii) follows from the Taylor interpolation
(7.6) and (iii) from Lemma 7.8 and the uniqueness of the leading coefficient.
D
With properties (ii) and (iii), the divided differences can be computed
recursively from the function values and derivatives f(j) (td of f at the
nodes ti. We also need the recurrence relation in the proof of the following
theorem, which states a surprising interpretation of the divided differ-
ences: The nth divided difference of a function f E en with respect to the
nodes to, ... , tn is the integral of the nth derivative over the n-dimensional
standard simplex
~n := {s = (so, ... ,sn) E R n I +1 t

2=0
Si = 1 and Si 2: o} .
Theorem 7.12 Hermite-Genocchi formula. The nth divided difference of
a n-times continuously differentiable function f E en satisfies
[to, . .. , tnlf = J (t
En
fen)
2=0
Siti) ds. (7.10)
Proof. We prove the formula by induction over n. The statement is trivial

for n = O. The induction step from n to n + 1 is as follows: If all nodes
coincide, then the statement follows from (7.8). We can therefore assume
without loss of generality that to i= t n +1' We then have
J
n
n+l
i=O
L 8i=1
i=O
J
n
i=l
n
1- LSi
i=l
J J
n
f(n+1) (to + 2:>i(ti - to) + Sn+1(tn+l - to)) ds
80=0
i=l
o
Corollary 7.13 Let g : Rn+1 -+ R be the mapping, which is given by the
nthdivided difference of a function f E en with
g(to, ... ,tn ) := [to, ... , tnlf .
Then g is continuous in its arguments ti. Furthermore, for all nodes to :::::
... ::::: tn, there exists aTE [to, tnl such that
(7.11)
Proof. The integral representation (7.10) yields immediately the continu-

ity; and (7.11) follows from the integral mean-value theorem, because the
volume of the n-dimensional standard simplex is vol(~n) = lin! . 0
For pairwise distinct nodes to < ... < tn, the divided differences can be
arranged similar to the Neville-scheme because of the recurrence relation
(7.9).
fa [tolf
'\.
fr [tllf -+ [to, tllf
fn-l = [tn-lJ! -+ -+ [to, ... ,tn-llf

'\.
fn [tnJ! -+ -+ [h, ... ,tnlf -+ [to, ... ,tnJ!
Example 7.14 We compute the interpolation polynomial corresponding
to the values
ti I0 1 2 3
fi 1 2 0 1
by employing Newton's divided differences,
f[tol =1
J[hl = 2 f[to, tIl = 1
J[t2l = 0 J[tl' t2l = -2 f[to, tl, t2l = -3/2
f[t3l = 1 J[t2,t3l = 1 J[tl,t2,t3l = 3/2 f[tO,tl,t2,t3l = 1,
i.e.,
(ao, aI, a2, (3) = (1,1, -3/2, 1).
The interpolation polynomial is therefore
P(t) = 1 + l(t - 0) + (-3/2)(t - O)(t -1) + l(t - O)(t -l)(t - 2)
= t 3 - 4.5 t 2 + 4.5 t + 1.
A further important property of the divided differences is the following
Leibniz formula.
Lemma 7.15 Let g, h E en, and let to ::; tl ::; ... ::; tn be an arbitrary
sequence of nodes. Then
n
[to, ... , tnlgh = ~)to, ... ,tilg . [ti, ... , tnlh.
i=O
Proof. First suppose that the nodes to, ... ,tn are pairwise distinct. Set
i-I n
Wi(t) := IT (t - tk) and Wj(t):= II (t - tl).
k=O 1=)+1
Then, according to Theorem 7.10, the interpolation polynomials P, Q E P n

of g, respectively, h are given by
n n
P = ~)to, ... , tdg· Wi and Q = L)tj , ... , tnlh· Wj .
i=O j=o
The product
n
PQ = L [to, ... , tiJg [t j , ... , tnJh· WiWj
i,j=O
thus interpolates the function f := gh in to, ... , tn. Since Wi(tk)Wj(tk) = 0
for all k and i > j, it follows that
n
i,j=O
i5,j
is the interpolation polynomial of gh in to, ... , tn. As claimed, the leading
coefficient is
n
L[to, ... ,tiJg [ti, ... , tnJh.
i=O
For arbitrary, not necessarily distinct nodes ti, the statement now follows
from the continuity of the divided differences at the nodes ti. D
7.1.3 A pproximation Error

We shall now turn our attention toward the second requirement, namely,
the approximation property, and analyze to what extent the polynomials
P(f I to, ... ,tn ) approximate the original function f. By employing previ-
ously derived properties of the divided differences, it is simple to find a
representation of the approximation error.
Theorem 7.16 Suppose that f E C n+1 . Then for the approximation error
of the Hermite interpolation P(f I to, ... ,tn ) and t i , t E [a, bJ we have
f(n+l)(T)
f(t) - P(f I to, .. ·, tn)(t) = ( )' Wn+l(t) (7.12)
n+ 1 .
for some T = T(t) EJa, b[.
Proof. According to theorem 7.10 and theorem 7.13, we have for P .-

P(f I to, ... ,tn ) that
for some T EJa, b[. D
Example 7.17 In the case ofthe Taylor interpolation (7.6), i.e., to = ... =
tn, the error formula (7.12) is just the Lagrange remainder of the Taylor
expansion
f(n+l)(T) n+l
f(t) - P(f I to, .. ·, tn)(t) = ( )' (t - to) .
n+1.
If we consider the class of functions

F:= {J E Cn+1[a, b] I sup Ir+1(T)I::::; M(n + I)!}
TE[a,b]
for a constant M > 0, then the approximation error obviously depends

crucially on the choice of the nodes to, ... , tn via the expression
(7.13)
In the next section (see Example 7.20) we will show that, in the case of pair-
wise distinct nodes, the expression (7.13) can be minimized on an interval
[a, b],
max IWn+l(t)1 =min,
tE[a,b]
by choosing again the Chebyshev nodes for the nodes t;. We now turn our
attention to the question of whether the polynomial interpolation satisfies
the approximation property. For merely continuous functions I E C[a, b],
and the supremum-norm 11111 = SUPtE[a,b] I(t), the approximation error can
in principle grow beyond all bounds. More precisely, according to Faber,
for each sequence {Td of sets of nodes Tk = {tk,o, ... ,tk,nk} C [a, b], there
exists a continuous function I E C[a, b] such that the sequence {Pd of
the interpolation polynomials, which belong to the Tk, does not converge
uniformly to I.
7.1.4 Min-Max Property of Chebyshev Polynomials

In the previous chapter we have repeatedly mentioned the Chebyshev
nodes, for which the polynomial interpolation has particularly nice proper-
ties (such as bounded condition and optimal approximation of lover the
class F). The Chebyshev nodes are the roots of the Chebyshev polynomials,
which we already encountered in Chapter 6.1.1. In the investigation of the
approximation error, as well as in the condition analysis of the polynomial
interpolation, we were lead to the following approximation problem: Find
the polynomial Pn E P n of degree deg Pn = n with leading coefficient 1,
and the smallest supremum-norm over the interval [a, b], i.e.,
max /Pn(t) I = min. (7.14)
tE[a,b]
We shall now see that the Chebyshev polynomials Tn, which we already
encountered as orthogonal polynomials with respect to the weight function
w(x) = (1- X2)-~ over [-1,1], solve this min-max problem (up to a scalar
factor and an affine transformation). In order to show this, we first reduce
the problem to the interval [-1,1], which is suitable for the Chebyshev
polynomials, with the help of the affine mapping
~
x: [a,b] -----7 [-1,1]
t-a 2t-a-b
X = x(t) = 2 - - -1 = - - -
b-a b-a'
whose inverse mapping is t : [-1, 1] ~ [a, b],

I-x l+x
t = t(x) = -2-a + -2-b.
If Pn E P n with deg P = n and leading coefficient 1 is the solution of the
min-max problem
max IPn(x) I = min,

XE[-l,l]
then Fn(t) := Pn(t(x)) is the solution of the original problem (7.14) with
leading coefficient 2n / (b - a) n .
In Example 6.3, we introduced the Chebyshev polynomials (see Figure
7.3) via
Tn(x) = cos(narccosx) for x E [-1,1]

and more generally for x E R via the three-term recurrence relation
Tk(X) = 2xTk-1 (x) - Tk-2(X) , To(x) = 1, TI (x) = x.

The following properties of the Chebyshev polynomials are either obvious
-1.5.!;-1---:-O~.8--0~.6--~0.4'-----~O.2:---:::------::0.2:--70.4;--70.6;--70.::-8~
Figure 7.3. Chebyshev polynomials To, . .. , T 4 •
or easily verified. In particular we can directly give the roots Xl, ... ,X n of
Tn(x), which are real and simple according to theorem 6.5 (see property 7
below).
Remark 7.18
1. The Chebyshev polynomials have integer coefficients.
2. The leading coefficient of Tn is an = 2n-l.
3. Tn is an even function if n is even, and an odd one if n is odd.
4. Tn(1) = I, Tn(-l) = (_l)n.

5. ITn(x)1 :s 1 for x E [-1,1].
6. ITn(x)1 takes on the value 1 at the Chebyshev abscissas
Xk = cos(k7r/n), i.e.,
k7r
ITn(x)l=l -{=} X=Xk=COS- forak=O, ... ,n.
n
7. The roots of Tn(x) are
2k - 1 )
Xk :=cos ( ~1f for k = 1, ... , n .
8. We have
cos(k arccos x) if -1:Sx:S1
11(x) ~ { cosh(karccoshx)
(-l)k cosh(karccosh( -x))
if x?l
if x:S -1
9. The Chebyshev polynomials have the global representation
Tk(X) = ~((x + Vx~l/ + (x - Vx2-=1/) for x E R.
Properties 8 and 9 are most easily checked by verifying that they satisfy
the three-term recurrence (including the starting values). The min-max
property of the Chebyshev polynomials follows from the intermediate-value
theorem:
Theorem 7.19 Every polynomial P n E P n with leading coefficient an =f. 0
attains a value of absolute value ? 1an 1/2 n - 1 in the interval [-1, 1]. In
particular the Chebyshev polynomials Tn (x) are minimal with respect to
the maximum-norm Ilflloo = maxxE[-l,l]lf(x)1 among the polynomials of
degree n with leading coefficient 2n - 1 .
Proof. Let P n E P n be a polynomial with leading coefficient an = 2n - 1 and

JPn (x) < 1 for x E [-1, 1]. Then Tn - P n is a polynomial of degree less
1
then or equal to n - 1. At the Chebyshev abscissas Xk := cos k: we have
T n (X2k) = 1, Pn (X2k) < 1 ===} P n (X2k) - T n (X2k) < 0

Tn(X2k+d = -1, Pn (x2k+d > -1 ===} Pn (X2k+d - T n (X2k+l) > 0;
i.e., at the n + 1 Chebyshev abscissae, the difference Tn - P n is alternating
between positive and negative and has therefore at least n roots in [-1, 1],
which contradicts 0 #- Tn - Pn E P n - 1 . As a consequence, for each poly-

nomial Pn E P n with leading coefficient an = 2n - 1 there must exist an
x E [-1,1] such that IPn(X) 1 2': 1. For an arbitrary polynomial Pn E P n
with leading coefficient an #- 0, the statement follows from the fact that
Fn := 2 nan- 1 Pn is a polynomial with leading coefficient an = 2n-l. 0
Example 7.20 When minimizing the approximation error of the polyno-

mial interpolation, we try to find the nodes to, ... ,tn E [a, b], which solve
the min-max problem
max Iw(t)1 = max I(t - to) ... (t - tn)1 = min

tE[a,b] tE[a,b]
To put it differently, the goal is to determine the normalized polynomial

w(t) E P n + 1 with the real roots to, ... , tn, for which maXtE[a,b]lw(t)1 = min.
According to theorem 7.19, for the interval [a,b] = [-1,1], this is just the
(n + 1)st Chebyshev polynomial w(t) = T n+1 (t), whose roots
2i + 1 ) = 0, ... , n
ti = cos ( 2n + 27r for i
are just the Chebyshev nodes.
We shall now derive a second property of the Chebyshev polynomials,

which we will be useful in Chapter 8.
Theorem 7.21 Let [a, b] be an arbitrary interval, and let to rt. [a, b]. Then
the modified Chebyshev polynomial
TOn(t):= Tn(x(t)) with x (t ) := 2t- a - 1

- -
Tn(x(to)) b- a
is minimal with respect to the maximum-norm 1111100 = maXtE[a,b]11(t)1

among the polynomials P n E P n with Pn (to) = 1.
Proof. Since all roots of Tn(x(t)) lie in [a, b], we have c := Tn(x(to)) #- 0
and Tn is well defined. Furthermore Tn(tO) = 1, and ITn(t)1 :::; Icl- 1 for
all t E [a, b]. Suppose now that there is a polynomial Pn E P n such that
Pn(to) = 1, and IPn(t)1 < 1c1- 1 for all t E [a,b], then to is a root of the
difference Tn - Pn , i.e.,
Tn(t) - Pn(t) = Qn-l(t)(t - to) for a polynomial Qn-l E Pn- 1.
As in the proof of theorem 7.19, Qn-l changes sign at the Chebyshev

abscissae tk = t(Xk) for k = 0, ... , n and has therefore at least n distinct
roots in [a, b]. This contradicts 0 #- Qn-l E P n - 1 . 0
7.2. Trigonometric Interpolation 197
7.2 Trigonometric Interpolation

In this section we shall interpolate periodic functions by trigonometric poly-
nomials, i.e., linear combinations of trigonometric functions. Next to the
polynomials, this class is one of the most important as far as interpolation
is concerned, not least because there is an extremely effective algorithm for
the solution of the interpolation problem, the fast Fourier transform.
In Example 6.20, we have already encountered the algorithm of Goertzel
and Reinsch for the evaluation of a trigonometric polynomial
First, we define the N-dimensional spaces of trigonometric polynomials,

real as well as complex. In the real case it is necessary to distinguish between
the cases of even and odd N. The following considerations will therefore
be carried out mostly in the complex case only.
Definition 7.22 By T{j we denote the N-dimensional space of the
complex trigonometric polynomials
N-1
¢iN(t) = L cje ijt with Cj E C
j=O
of degree N - 1. The N-dimensional spaces T~ contain all real trigono-

metric polynomials cP N (t) of the form
n
cP2n+1 (t) = ~O + L(aj cosjt + bj sinjt), (7.15)
j=l
for odd N = 2n + 1, respectively,

n-1
cP2n t = 2 + ""
() ao 6 ( aj COS]t. + bj sm]t
..) + 2
an cos nt, (7.16)
j=l
for even N = 2n, where aj, bj E R.

The thus defined trigonometric polynomials are apparently 21f-periodic,
i.e., ip(t+21f) = ip(t). Hence, considering interpolation, we choose N distinct
nodes
o :s; to < t1 < ... < tN-1 < 21f
in the half-open interval [0, 21f[. Given nodal values fo, ... , fN-1 we look
for trigonometric polynomials cP N E TI!, respectively, Tft satisfying
(7.17)
Here, the nodes fj are of course complex in one case and real in the other.
The complex trigonometric interpolation may be easily derived from the
standard (complex) polynomial interpolation since we have the bijection

T!! -----+ P N-1
N-1 N-1
¢(t) = 2.:>je ijt f-------> P(w) = 2.:>jw j .

h=O j=O
Hence, there is a unique polynomial ¢N E T!! satisfying the interpolation
condition (7.17). For equidistant nodes
tj = 27rj / N, j = 0, ... ,N - 1,
to which we shall restrict ourselves, we introduce the Nth unit roots Wj :=
e itj = e21rij/N. Thus for the coefficients Cj of the complex trigonometric
interpolation ¢N E T{i, we obtain the Vandermonde system:
1 Wo w02 N-1
Wo
[ 1 WN-1 w;" -1
v
N-1
WN _1
=: VN-1
Using the complex solution ¢N E T!! of a real interpolation problem, fj E

R, we can compute the real solution ¢ N E Tf{ by the following lemma.
Lemma 7.23 Let
N-1
¢N(t) = LCjeijt E T!!
j=O
be the solution of the interpolation problem (7.17) with real nodes fj E R
and equidistant nodes tj = 27rj / N. Then the real trigonometric polynomial
'l/JN E Tf{, given by the coefficients
aj = 2~cj = Cj + CN-j and bj = -28'cj = i(cj - CN-j)
from the representations (7.15), respectively, (7.16), also satisfies the
interpolation conditions, i. e., 'P N (tj) = fj·
Proof. We evaluate the polynomial at the N equidistant nodes tk' Since

e 21ri (N-j)/N = e-21rij/N, we have
N-1 N-1 N-1
¢N(tk) = L cje ijtk = ¢N(tk) = L cje- ijtk = L CN_je ijtk .
j=O j=O j=O
The unique interpolation property therefore implies that Cj = CN-j. In
particular, Co is real and for even N = 2n also C2n. For odd N = 2n + 1,
we obtain
2n n
Co + L cje ijtk = Co + L (cje ijtk + cje- ijtk )
j=l j=l
n n
Co +L 2fR(Cje ijtk ) = Co + L(2Rcj cosjt - 2CJcj sinjt).
j=l j=l
Hence, the real trigonometric polynomial with the coefficients
aj = 2Rcj = Cj + Cj = Cj + CN-j
and
bj = -2CJcj = i(cj - Cj) = i(cj - CN-j)
solves the interpolation problem. For even N = 2n, the statement follows
similarly. D
Lemma 7.23 shows the existence of a real trigonometric interpolating

polynomial. However, since the interpolation problem is linear and the
numbers of nodes and coefficients coincide, this also implies uniqueness.
In the case at hand, the Vandermonde matrix VN -1 can easily be inverted

analytically. In order to demonstrate this, we shall first show the orthonor-
mality of the basis functions 'l/Jj (t) := eijt with respect to the following
scalar product, which is given by the equidistant nodes tj,
(7.18)
Lemma 7.24 For the Nth unit roots Wj := e2rrijjN we have

N-1
L wjwjl = N r5kl .
j=O
In particular, the functions 'l/Jj (t) = eijt are orthonormal with respect to the
scalar product (7.18), i.e., ('l/Jk,'l/Jl) = r5kl.
Proof. The statement is obviously equivalent to

N-1
L w~ = Nr5 ok .
j=O
(Observe that wj = w~.) Now the Nth unit roots Wk are solutions of the
equation
N-1
0= w N - 1 = (w - 1)(w N - 1 + w N - 2 + ... + 1) = (w- 1) L w j .
j=O
N 1 .
If k =I- 0, then Wk =I- 1 and therefore I:j~ w~ = O. In the other case we
~N-l j
·
obVlOUS 1y h ave 6j=0 ~N-l1
Wo = 6j=0 = N. D
With the help of this orthogonality relation, we can easily give the
solution of the interpolation problem.
Theorem 7.25 The coefficients Cj of the tTigonometric interpolation COT-
Tesponding to the N nodes (tk' ik) with equidistant nodes tk = 27rk/N,
z.e.,
N-l N-l
¢iN(td = L cje ijtk = L CjW~ = ik fOT k = 0, ... , N - 1,
j=O j=O
aTe given by
1 N-l .
Cj = N LfkWi;J fOT j = 0, ... , N-1.
k=O
Proof. We insert the given solution for the coefficients Cj and obtain
N-l
L
j=O
cjwl ~l (~ % ikwi;j) w{
1 N-l N-l .. 1 N-l

N L ik L wi;J wr = N L ikOkl N = fl'
k=O j=O k=O
o
Remark 7.26 For odd N = 2n + 1, and letting C_j := CN-j for j > 0, we
can rewrite the trigonometric interpolation polynomial in symmetric form:
N-l n
tpN-l(td = L cje ijtk = L cje ijtk .

j=O j=-n
In this form, it strongly resembles the truncated Fourier series
n
fn(t) = L j(j)e ijt
j=-n
of a 27r-periodic function f E L2(R) with coefficients
~
ior
27r
j(j) = (I, eijt ) = f(t)e- ijt dt. (7.19)
27r
In fact, the coefficients Cj can be considered as the approximation of the
integral in (7.19) by the trapezoidal sum (compare Section 9.2) with respect
to the nodes tk = 27rk/N. If we insert this approximation
ior
27r 27r ~l
g(t) dt ~ N 0 g(tk) (7.20)
k=O
into (7.19), then this yields

N-l N-l
J(j) ~ ~ L ike- ijtk = ~ L e-27rijk/N fk = Cj . (7.21 )
k=O k=O
Observe that the formula (7.20) is actually exact for trigonometric polyno-
mials 9 E Tij, and equality in (7.21) holds therefore also for f E Tij. For
this reason the isomorphism
with
N-l
Cj =~ L fke-27rijk/N for j = 0, ... ,N - 1
k=O
is called a discrete Fourier transform. The inverse mapping :r;/ is
N-l
fj = L cke27rijk/N for j = 0, ... , N - 1 .
k=O
The computation of the coefficients Cj from the values fj (or the other
way around) is in principle a matrix-vector multiplication, for which we
expect a cost of O(N 2 ) operations. However, there is an algorithm, which
requires only O(Nlog2 N) operations, the fast Fourier transform (FFT).
It is based on a separate analysis of the expressions for the coefficients Cj
for odd, respectively, even indices j, called the odd even reduction. This
way it is possible to transform the original problem into two similar partial
problems of half dimension.
Lemma 7.27 Let N = 2M be even and w e±27ri/N. Then the

trigonometric sums
N-l
Ctj = L fk wkj for j = 0, ... , N - 1

k=O
can be computed as follows, where ~ := w 2 and l = 0, ... ,M - 1 :
M-l
Ct21 L gke l with gk = fk + fk+M ,

k=O
M-l
L hk~kl with hk = Uk - fk+M )w k ;
k=O
i.e., the computation of the Ctj can be reduced to two similar problems of
half dimension M = N /2.
Proof. In the even case j = 21, and because of w NI = 1, it follows that

N-l
(X21 L hw 2k1
k=O
N/2-1
L (hW2kl + fk+N/2 w 2 (k+N/2)1)
k=O
M-l
L (h + fk+M )(w 2 )kl .

k=O
Similarly, because of w N /2 = -1, we obtain for odd indices j = 21 + 1 ,
L
N-I
fk Wk (21+1)
k=O
L
N/2-I
(h wk (21+I) + fk+N/2 W (k+N/2)(21+1»)
k=O
M-I
L (h - fk+M )w k (w 2 l 1.
k=O
o
The lemma can be applied to the discrete Fourier analysis (h) f--+ (Cj),
as well as to the synthesis (Cj) f--+ (h). If the number N of the given points
is a power of two N = 2P , pEN, then we can iterate the process. This
algorithm due to W. Cooley and J. W. Tukey [85] is frequently called the
Cooley- Tukey algorithm. The computation can essentially be carried out
on a single vector, if the current number-pairs are overwritten. In the Al-
gorithm 7.28, we simply overwrite the input values fo, ... , iN-i' However,
here the order is interchanged in each reduction step because of the sep-
aration of even and odd indices. We have illustrated this permutation of
indices in Table 7.2. We obtain the right indices by reversing the order of
the bits in the dual-representation of the indices.
We therefore define a permutation rJ,
rJ:{O, ... ,N-l} -+ {O, ... ,N-l}

p-I p-I
Laj 2j f----7 Lap - l - j 2 j , ajE{O,I},
j=O j=O
which represents this operation, and which can be realized on a computer

at little cost by a corresponding bit-manipulation.
Table 7.2. Interchange of the indices of the fast Fourier transform for N = 8, i.e.,
p= 3.
k dual 1. reduction 2. reduction dual

0 000 0 0 000
1 001 2 4 100
2 010 4 2 010
3 011 6 6 110
4 100 1 1 001
5 101 3 5 101
6 110 5 3 011
7 111 7 7 111
Algorithm 7.28 Fast Fourier transform (FFT). From given input values
fo, ... , fN-l for N = 2P and w = e±27ri/N the algorithm computes the
transformed values ao, ... , aN-l with aj = L~=-Ol fkwkj.
N red := N;
z :=w;
while Nred > 1 do
Mred := N red /2;
for j := 0 to N/Nred -1 do
l := jNred ;
for k := 0 to M red - 1 do
a := flH + flH+M,ed;
flH+M,ed := (flH - izH+M,eJZ k ;
fl+k := a;
end for
end for
N red := M red ;
z:= z2;
end while
for k := 0 to N - 1 do
aa(k) := fk
end for
In each reduction step, we need 2 . 2P = 2N multiplications, where the

evaluation of the exponential function counts for one multiplication (recur-
sive computation of cosjx, sinjx). After p = log2 N steps, all ao, . .. , aN-l
are computed at the cost of 2N log2 N multiplications.
7.3 Bezier Techniques

The topics, which have so far been presented in this chapter belong to the
classical part of Numerical Analysis, as the names Lagrange and Newton
indicate. With the increasing importance of computer-aided construction,
new ground has recently been broken (i.e., in the last thirty years) in inter-
polation and approximation theory, which we shall indicate in this section.
It is interesting that geometric aspects gain a decisive importance here. A
curve or surface has to be represented on a computer in a way that it can
be drawn and manipulated quickly. In order to achieve this, parametriza-
tions of the geometric objects are used, whose relevant parameters have
geometric meaning.
In this introduction we can only illustrate these considerations in the
simplest situations. In particular, we shall restrict ourselves to polynomial
curves, i.e., one-dimensional geometric objects. The book by C. de Boor
[15] and the newer textbook by G. Farin [30] are recommended to those
who want to familiarize themselves in more detail with this area.
We start with a generalization of real-valued polynomials.
Definition 7.29 A polynomial (or a polynomial curve) of degree n in Rd

is a function P of the form
n
P :R ---7 R d , P(t) = L aiti with ao, ... , an E R d , an i- 0.
i=O
The space of polynomials of degree less than or equal to n in R d is denoted

by P~.
The most interesting cases for us are the curves in space (d = 3) or in

the plane (d = 2). If {Po, ... , Pn } is a basis of P n and {el,"" ed} the
standard basis of R d, then the polynomials
{eiPj I i = 1, ... , d and j = 0, ... , n}
form a basis of P~. The graph rp of a polynomial P E P~
can now again be considered as a polynomial rp E p~+l. If P is given in

coefficient representation
then
7.3. Bezier Techniques 205
7.3.1 Bernstein Polynomials and Bezier Representation

So far, we have considered three different bases of the space Pn of
polynomials of degree less than or equal to n:
(a) Monomial basis {I, t, t 2 , ... , tn},
(b) Lagrange basis {Lo(t), ... , Ln(t)},
(c) Newton basis {wo(t), ... ,Wn (t)}.
The last two bases are already oriented toward interpolation and depend
on the nodes to, ... , tn. The basis polynomials, which we shall now present,
apply to two parameters a, b E R. They are therefore very suitable for the
local representation of a polynomial. In the following, the closed interval
between the two points a and b is denoted by [a, b] also when a > b, i.e.,
(compare Definition 7.37)
[a,b]:= {x = ),a+ (1- )')b I ), E [0, In.
The first step consists of an affine transformation onto the unit interval
[0,1]'
[a, b] ---7 [0,1]
t- a
t f-----7), = ),(t) := b _ a' (7.22)
with the help of which we can usually restrict our consideration to [0, 1].
By virtue of the binomial theorem, we can represent the unit function as
The terms of this partition of unity are just the Bernstein polynomials with
respect to the interval [0, 1]. By composing these with the above affine trans-
formation (7.22), we then obtain the Bernstein polynomials with respect
to the interval [a, b].
Definition 7.30 The ith Bernstein polynomial (compare Figure 7.4) of
degree n with respect to the interval [0, 1] is the polynomial Bi E P n with
where i = 0, ... , n. Similarly, Bi(' ; a, b) E P n with
t -
Bi(t; a, b) := Bi(),(t)) = Bi ( b _ a
a) = (b _ a)n
1 (n)
i
. .
(t - a)'(b - t)n-,
is the ith Bernstein polynomial of degree n with respect to the interval

[a, b].
Instead of Bi(t; a, b), we shall in the following often simply write Bi(t),
if confusion with the Bernstein polynomials Bi(A) with respect to [0,1] is
impossible. In the following theorem we list the most important properties
0.6
Figure 7.4. Bernstein polynomials for n = 4.
of the Bernstein polynomials.

Theorem 7.31 The Bernstein polynomials Bi(A) satisfy the
following properties:
1. A = ° is a multiplicity i root of Bi.
2. A = 1 is a multiplicity (n - i) root of Bi.
3. Bi(A) = B:;;;_;(1- A) for i = 0, ... , n (symmetry).
4. (1 - A)B[f = B~+l and AB:;;; = B~ti.

5. The Bernstein polynomials Bi are nonnegative on [0,1 J and form a
partition of unity, i. e.,
°
n
Bi(A) 2: for A E [0,1] and LBi(A) = 1 for A E R.
;=0
6. Bi has exactly one maximum value in the interval [0,1 J, namely, at
A = i/n.
7. The Bernstein polynomials satisfy the recurrence relation
(7.23)
for i = 1, ... , n and A E R.
8. The Bernstein polynomials form a basis B := {Bo, ... , B~} of P n.
Proof. The first five statements are either obvious or can be easily verified.
Statement 6 follows from the fact that
d~ Bf()..) = (7) (1- )..t- i - 1 )..i-l(i - n)..)
for i = 1, . .. , n. The recurrence relation (7.23) follows from the definition

and the formula
for the binomial coefficients. For the last statement, we show that the n+ 1
polynomials Bf are linearly independent. If
n
0= L biBf()..),
i=O
then according to 1 and 2,

n
0= L biBf(l) = bnB~(l) = bn
i=O
and therefore inductively bo = ... = bn = O. D
Similar statements are of course true for the Bernstein polynomials with
respect to the interval [a, b]. Here the maximum value of Bi(tj a, b) in [a, b]
is attained at
i
t=a+-(b-a).
n
Remark 7.32 The property that the Bernstein polynomials form a par-
tition of unity is equivalent to the fact that the Bezier points are affine
invariant. If ¢ : Rd -> Rd is an affine mapping,
¢ : Rd ---+ Rd with A E Matd(R) and v E Rd
U 1-----4 Au + v,
then the images ¢(bi ) of the Bezier points bi of a polynomial P E P~ are
the Bezier points of ¢ 0 P.
We now know that we can write any polynomial P E P~ as a linear
combination with respect to the Bernstein basis
n
P(t) = L biBf(tj a, b), bi E Rd. (7.24)
i=O
Remark 7.33 The symmetry Bi()..) = B~_i(l - )..) of the Bernstein

polynomials yields in particular
n n
i=O i=O
i.e., the Bezier coefficients with respect to b, aare just the ones of a, bin
reverse order.
The coefficients bo, ... , bn are called control or Bezier points of P, the
corresponding polygonal path a Bezier polygon. Because of
the Bezier points of the polynomial P(t) = t are, just the maxima bi =
*
a + (b - a) of the Bernstein polynomials. The Bezier representation of the
graph fp of a polynomial P as in (7.24) is therefore just
fp(t) - P(t) - bit:o'

_( t )_~(a+*(b-a)) n .
Bi (t, a, b). (7.25)
In Figure 7.5 we have plotted the graph of a cubic polynomial together
········· ..P(t)....
Figure 7.5. Cubic polynomial with its Bezier points.
with its Bezier polygon. It is striking that the shape of the curve is closely
related to the shape of the Bezier polygon. In the following we shall more
closely investigate this geometric meaning of the Bezier points. First, it
is clear from Theorem 7.31 that the beginning and ending points of the
polynomial curve and the Bezier polygon coincide. Furthermore, it appears
that the tangents at the boundary points also coincide with the straight
lines at the end of the Bezier polygon. In order to verify this property,
we compute the derivatives of a polynomial in Bezier representation. We
shall restrict ourselves to the derivatives of the Bezier representation with
respect to the unit interval [0,1]. Together with the derivative of the affine
transformation >.(t) from [a, b] onto [0,1],

d 1
-d >.(t)
t
= -b-a
-,
one immediately obtains the derivatives in the general case also.

Lemma 7.34 The derivative of the Bernstein polynomials Bf(>.) with
respect to [0, 1] satisfies
for i =0
for i = 1, ... , n - 1
for i = n.
Proof. The statement follows from
:>. Bf(>.) = (~) [i(l- >.)n-i >.i-l - (n - i)(l - >.)n-i-l >.i]
by virtue of the identities of Theorem 7.31. o

Theorem 7.35 Let P(>.) = 2::7=0 biBf(>.) be a polynomial in Bbier
representation with respect to [0,1]. Then the kth derivative of P satisfies
, n-k
p(k)(>.) = n. """"' D,kbB n - k (>.)
(n - k)! L.." t t ,
t=O
where the forward difference operator D, operates on the lower index, i. e.,
D,lb i := bH1 - bi and D,kb i := D,k-1b H1 - D,k-1b i for k> 1.
Proof. Induction over k, see Exercise 7.6. o

Corollary 7.36 For the boundary points>' = 0,1 one obtains the values
(k) _ n! k (k) _ n! k
P (0) - (n _ k)! D, bo and P (1) - (n _ k)! D, bn-k,
thus in particular up to the second derivative,
(a) P(O) = bo and P(l) = bn ,
(b) P'(O) = n(b 1 - bo ) and P'(l) = n(bn - bn-d,
(c) P"(O) = n(n - 1)(b 2 - 2b 1 + bo) and P"(l) = n(n - l)(b n - 2bn - 1 +
bn - 2 ).
Proof. Note that B~-k (0) = 60,i and B~-k (1) = 6n -k,i. o
Corollary 7.36 therefore confirms the geometric observations that we

described above. It is important that at a boundary point, the curve is
determined up to the kth derivative by the k closest Bezier points. This
property will be crucial later on for the purpose of joining several pieces
together. The Bezier points are of further geometric importance. In order
to describe this, we need the notion of a convex hull of a set A c R d, which
we shall briefly review.
Definition 7.37 A set A c R d is called a convex, if together with any two
points x, YEA, the straight line, which joins them is also contained in A,
i.e.,
[x,y]:= {Ax + (1- A)y I A E [0, In c A for all x,y E A.
The convex hull co(A) of a set A C Rd is the smallest convex subset of R d,
which contains A. A linear combination of the form
k k
X = LAixi, with Xi E R d , Ai 2: 0 and LAi = 1
i=l i=l
is called convex combination of Xl, ... ,Xk.
Remark 7.38 The convex hull co(A) of A c Rd is the set of all convex
combinations of points of A, i.e.,
I" B
t
co(A) n{B c Rd convex with A C B}
{ X= ~ Aixi I mEN, Xi E A, Ai 2: 0, Ai = 1 } .
The following theorem states that a polynomial curve is always contained

in the convex hull of its Bezier points.
Theorem 7.39 The image P([a, b]) of a polynomial P E P~ in Bernstein
representation P(t) = L~=o biB['(t; a, b) with respect to [a, b] is contained
in the convex hull of the Bbier points bi , i.e.,
P(t) E co(b o , .. . , bn ) for t E [a, b].
In particular, the graph of the polynomial for t E [a, b] is contained in the
convex hull of the points bi.
Proof. On [a, b], the Bernstein polynomials form a nonnegative partition of

unity, i.e., B['(t; a, b) 2: 0 for t E [a, b] and L~=o B['(t) = 1. Therefore
n
P(t) = L biB['(t; a, b)
i=O
is a convex combination of the Bezier points bo, .. . , bn . The second state-
ment follows from the Bezier representation (7.25) of the graph r p of P.
o
As one can already see in Figure 7.5, for a cubic polynomial P E P3,
this means that the graph of P for t E [a, b] is completely contained in the
convex hull of the four Bezier points b I , b 2 , b 3 , and b 4 . The name control
point is explained by the fact that, because of their geometric significance,
the points b i can be used to control a polynomial curve. Because of The-
orem 7.31, at the position ..\ = i/n, over which it is plotted, the control
point bi has the greatest "weight" Bf(..\). This is another reason that the
curve between a and b is closely related to the Bezier polygon, as the figure
indicates.
7.3.2 De Casteljau Algorithm

Besides the geometric interpretation of the Bezier points, the importance of
the Bezier representation rests mainly on the fact that there is an algorithm,
which builds on continued convex combinations, and which, besides the
function value P(t) at an arbitrary position t, also yields information about
the derivatives. Furthermore, the same algorithm can be used to subdivide
the Bezier curve into two segments. By repeating this partitioning into
segments, the sequence of the Bezier polygons converges extremely fast to
the curve (exponentially when dividing the interval into halves), so that
this method is very well-suited to effectively plot a curve, e.g., in computer
graphics. This construction principle is also tailor-made to control a milling
cutter, which can only "remove" material. We start with the definition of
the partial polynomials of P.
Definition 7.40 Let P("\) = L~=o biBf(..\) be a polynomial in Bezier rep-

resentation with respect to [0,1]. Then we define the partial polynomials
b7 E P% of P for i = 0, ... , n - k by
k i+k
b7(..\) := L bi+jBj(..\) = L bjBj_i(..\)·
j=O j=i
For a polynomial P(t) = L~=o biBf(t; a, b) in Bezier representation with
respect to [a, b], the partial polynomials b7 are similarly defined by
k
b7(t; a, b) := b7(..\(t)) = L bi+jBj(t; a, b).
j=O
Thus the partial polynomial b7 E P% is just the polynomial, which is
defined by the Bezier points bi , ... , bi+k (see Figure 7.6). If no confusion
arises, then we simply write b7(t) for b7(t; a, b). In particular, b'O(t) = P(t)
is the starting polynomial, and the b? (t) = bi are its Bezier points for all
t E R. Furthermore, for all boundary points,
b~(a) = bi and b~(b) = bi+k .

..---,{-n-....
... ............. bi (t )
\\b 6(t)
b6 (t) , //b~(t )
Figure 7.6. Cubic polynomial with its partial polynomials.
Similar to the Aitken lemma, the following recurrence relation is true, which
is the base for the algorithm of de Casteljau.
Lemma 7.41 The partial polynomials bf(t) of P(t) = L~o biBf(t; a, b)

satisfy the recurrence relation
for k = 0, ... ,n and i = 0, ... ,n - k.
Proof. We insert the recurrence relation (7.23) into the definition of the
partial polynomials bf and obtain
k
bf LbHjBJ
j=O
k-1
biBg + bHkBZ + L bHjBJ
j=l
k-1
bi (l - >.)B6k- 1) + bHk>'B~=i +L bi+j ((1 - >')BJ-1 + >.BJ=l)
j=l
k-1 k
LbHj (l- >')BJ-1 + Lbi+j>.BJ=f
j=O j=l
( 1 - >.)b k
t
- 1 + >.b k - 1
,+1 .
o
Because of b~(t) = bi , by continued convex combination (which, for t rt.
[a, b] is only an affine combination) we can compute the function value
P(t) = bo(t) from the Bezier points. The auxiliary points bf can, similar
to the scheme of Neville, be arranged in the de Casteljau scheme.

bn bO
n
'\.
bn - 1 b~_l -+ b~_l
b1 bO -+ -+ bn - 1
1 1
'\. '\.
bo bO -+ -+ bn - 1 -+ b0n
0 0
In fact, the derivatives of P are hidden behind the auxiliary points b~ of

de Casteljau's scheme, as the following theorem shows. Here we again only
consider the Bezier representation with respect to the unit interval [0, 1].
Theorem 7.42 Let P(>..) = I:~=o biBi(>") be a polynomial in Bezier repre-
sentation with respect to [0,1]. Then the derivatives p(k) (>..) for k = 0, ... , n
can be computed from the partial polynomials b~ (>..) via the relation
p(k)(>..) = n! /}.kb n - k (>..)

(n-k)! 0 ,
Proof. The statement follows from Theorem 7.35 and from the fact that
the forwards difference operator commutes with the sum:
, n-k ,n-k
n. """ k n-k ( ) n. k """ n-k ( )
(n - k)! L /}. biBi >.. = (n _ k)!/}. L biBi >..
,=0 ,=0
n! /}.kb n - k (>..)
(n - k)! 0 .
o
Thus the kth derivative p(k) (>..) at the position>.. is computed from the
(n - k)th column of the de Casteljau scheme. In particular,
pet) b~ ,
P'(t) n(b n - 1 _ bn - 1)
1 0'
pI! (t) n(n - 1)(b~-2 - 2b~-2 + b~-2) .
So far we have only considered the Bezier representation of a single poly-
nomial with respect to a fixed reference interval. Here the question remains
open on how the Bezier points are transformed when we change the ref-
erence interval (see Figure 7.7). It would also be interesting to know how
to join several pieces of polynomial curves continuously or smoothly (see
Figure 7.8). Finally, we would be interested in the possibility of subdi-
viding curves, in the sense that we subdivide the reference interval and
bl
/·············aT··· ......... ~~...
al ;' '\
/ a3 \\
/ \
~ \
bo = ao b3
Figure 7.7. Cubic polynomial with two Bezier representations.
all'
\.
I \.
;
/ .. .......'..
\.
;
; \.
·c~'
;
;
!
ao
Figure 7.8. Two cubic Bezier curves with CI-smoothness at a3 = Co.
b2
b.,t_·_··-···---·~-~·=-~~-··--:-;.·.l.···-·-·-········.
.r;.~.............. - \.
;
;
;
/
al//
/
/
!
ao = bo
Figure 7.9. Subdivision of a cubic Bezier curve.
compute the Bezier points for the subintervals (see Figure 7.9). According
to Theorem 7.39, the curve is contained in the convex hull of the Bezier
points. Hence, it is clear that the Bezier polygons approach more and more
closely the curve, when the subdivision is refined. These three questions
are closely related, which can readily be seen from the figures. We shall see
that they can be easily resolved in the context of the Bezier technique. The
connecting elements are the partial polynomials. We have already seen in
Corollary 7.36 that, at a boundary point, a Bezier curve P is determined

to the kth derivative by the k closest Bezier points. The opposite is also
true: The values of P up to the kth derivative at the position ,\. = 0 already
determine the Bezier points bo, ... , bk. More precisely, this is even true for
the partial polynomials b8(,\.), ... , b~('\'), as we shall prove in the following
lemma.
Lemma 7.43 The partial polynomial b~('\') of a Bezier curve P(,\.)= bo ('\')
= 2:~obiBT:('\') is completely determined by the values of P up to and
including the kth derivative at the position ,\. = O.
Proof. According to Theorem 7.42, the derivatives at the position ,\. =0

satisfy
..:!!...-k _ k! 1 _(n-l)!k!..:!!...-bn()
d,\.l bo(O) - (k _l)!.6. bo - (k -l)! n! d,\.l 0 0
for l = 0, ... , k. The statement follows, because a polynomial is completely
determined by all derivatives at one position. 0
Together with Corollary 7.36, we obtain the following theorem:

Theorem 7.44 Let P(t) = ao(t; a, b) and Q(t) = bo(t; a, c) be two Bezier
curves with respect to a, b, respectively, a, c. Then the following statements
are equivalent:
(i) P(t) and Q(t) coincide at the position t = a up to the kth derivative,
~.e.,
p(l)(a)=Q(l)(a) for l=O, ... ,k.

(ii) a~(t; a, b) = b~(t; a, c) for all t E R.
(iii) ab(t; a, b) = bb(t; a, c) for all t E Rand l = 0, ... , k.
(iv) al = bb(b; a, c) for l = 0, ... , k.
Proof. We show (i) {o} (ii) =? (iii) =? (iv) =? (ii). According to Corollary
7.36 and Lemma 7.43, the two curves P(t) and Q(t) coincide at the position
t = a up to the kth derivative, if and only if they have the same partial
polynomials a~(t; a, b) = b~(t; a, c). The two first statements are therefore
equivalent. If a~ and b~ coincide, then so do their partial polynomials ab
and bb for l = 0, ... ,k; i.e., (ii) implies (iii). By inserting t = b into (iii), it
follows in particular that
al = aUl) = ab(b; a, b) = bb(b; a, c),
and therefore (iv). Since a polynomial is uniquely determined by its Bezier
coefficients, (iv) therefore implies (ii) and thus the equivalence of the four
statements. 0
With this result in hand, we can easily answer our three questions. As a
first corollary we compute the Bezier points that are created when subdi-
viding the reference interval. At the same time, this answers the question
regarding the change of the reference interval.
Corollary 7.45 Let
ao(t;a,b) = bo(t;a,c) = co(t;b,c)
be the Bezier representations of a polynomial curve P(t) with respect to the
intervals [a, b], [a, c] and [b, c], i.e.,
n n n
i=O i=O i=O
(see Figure 7.9). Then the Bezier coefficients ai and Ci of the partial curves
can be computed from the Bezier coefficients bi with respect to the entire
interval via
for k = 0, ... ,n.

Proof. Because a polynomial of degree n is completely determined by its
derivatives at one point, the statement follows from Theorem 7.44 for k = n
and the symmetry of the Bezier representation, see Remark 7.33. 0
Since the curve pieces always lie in the convex hull of their Bezier points,
the corresponding Bezier polynomials converge to the curve when continu-
ously subdivided. By employing this method, the evaluation of a polynomial
is very stable, since only convex combinations are computed in the algo-
rithm of de Casteljau. In Figure 7.10, we have always divided the reference
interval of a Bezier curve of degree 4 in half, and we have plotted the Bezier
polygon of the first three subdivisions. After only a few subdivisions, it is
almost impossible to distinguish the curve from the polygonal path.
If we do utilize the fact that only the derivatives at one position must co-
incide, then we can solve the problem of continuously joining two polygonal
curves:
Corollary 7.46 A joined Bezier curve
R(t) = { ao(t; a, b) if a:::; t < b

co(t;b,c) if b:::; t :::; c
is C k -smooth, if and only if
cl=a~_l(c;a,b) for l=O, ... ,k
or, equivalently,
an-l = c~(a; b, c) for l = 0, ... , k.
........................................................................ b2
bo
4
Figure 7.10. Threefold subdivision of a Bezier curve of degree n = 4.
Therefore, through the Ck-smoothness, the first k + 1 Bezier points of

the second partial curve are determined by the last k + 1 Bezier points of
the first and vice versa. A polynomial ao(t; a, b) over [a, b] can therefore be
continued Ck-smoothly by a polynomial co(t; b, c) over [b, c], by determining
the Bezier points co, ... ,Ck according to Corollary 7.45 by employing the
algorithm of de Casteljau, whereas the remaining Ck+l, ... , Cn can be chosen
freely.
In particular, the joined curve R( t) is continuous, if and only if
It is continuously differentiable, if and only if, in addition,

Cl a~_l(c;a,b)
1 C- a
a n - l (,\) = (1 - ,\)an-l + '\a n with ,\ = - -
b-a
or, equivalently,
an-l c6(a;b,c)
a-b
c6(f.L) = (1 - f.L)co + f.LCl with f.L = - - .
c-b
This implies that
c-b b-a
an = Co = --an-l
c-a
+ --Cl
c-a
, (7.26)
i.e., the point an = Co has to divide the segment [an-l, Cl] in the proportion
c - b to b - a. If the pieces of curves fit C 2 -smoothly, then a n -2, an-l and
an describe the same parabola as co, Cl and C2, namely, with respect to
[a, b], respectively, [b, c]. According to Corollary 7.46, the Bezier points of
this parabola with respect to the entire interval [a, c] are an -2, d and C2,
where d is the auxiliary point
d := a~_2(c; a, b) = a;'_2('\) = ci(a; b, c) = cUf.L)
(see Figure 7.11). Furthermore, according to Corollary 7.46, it follows from
d
.~};""""'/""""""""
a) ......···
/
;
;
;
;
ao
,,"0.'.'.'.'.'.'.'.'.-·'·
C2 C3
Figure 7.11. Two cubic Bezier curves with C 2 -transition at a3 = Co.
the C 2 -smoothness that

C2 = a;_2(A) = (1- A) a~_2(A) +Aa~_l(A)
'-.,----' '-.,----'
=d = Cl
and
an -2 = c6(1L) = (1-1L) C6(1L) +IL CUIL) .
"--v--" ~
= an-l =d
The joined curve is therefore C 2 -smooth, if and only if there is a point d
such that
C2 = (1 - A)d + ACI and an -2 = (1 - lL)a n - l + ILd.
The auxiliary point d, the de Boor point, will play an important role in the
next section in the construction of cubic splines.
7.4 Splines
As we have seen, the classical polynomial interpolation is incapable of solv-
ing the approximation problem with a large number of equidistant nodes.
Polynomials of high degree tend to oscillate a lot, as the sketches of the La-
grange polynomials indicate (see Figure 7.2). They may thus not only spoil
the condition number (small changes of the nodes Ii induce large changes
of the interpolation polynomial P(t) at intermediate values t =F t i ), but also
lead to large oscillations of the interpolating curve between the nodes. As
one can imagine, such oscillations are highly undesirable. One need only
think of the induced vibrations of an airfoil formed according to such an
interpolation curve. If we require that an interpolating curve passes "as
7.4. Splines 219
smooth as possible" through given nodes (ti' ji), then it is obvious to lo-
cally use polynomials of lower degree and to join these at the nodes. As
a first possibility, we have encountered the cubic Hermite interpolation in
Example 7.7, which was, however, dependent on the special prescription of
function values and derivatives at the nodes. A second possibility are the
spline junctions, with which we shall be concerned in this chapter.
7.4.1 Spline Spaces and B-Splines

We start with the definition of kth order splines over a grid ~ =
{to, ... ,tl+d of node points. These functions have proven to be an ex-
tremely versatile tool, from interpolation and approximation and the
modeling in CAGD to collocation and Galerkin methods for differential
equations.
Definition 7.47 Let ~ = {to, ... , tl+d be a grid of 1+2 pairwise distinct
node points
a = to < tl < ... < tl+l = b.

A spline of degree k - 1 (order k) with respect to ~ is a function s E
C k - 2 [a, b], which on each interval [ti' tHll for i = 0, ... ,I coincides with
a polynomial Si E P k - 1 of degree <::: k - 1. The space of splines of degree
k - 1 with respect to ~ is denoted by Sk,t-,..
The most important spline functions are the linear splines of order k = 2
(see Figure 7.12) and the cubic splines of order k = 4 (see Figure 7.13). The
linear splines are the continuous, piecewise linear functions with respect to
the intervals [ti' ti+ll. The cubic splines are best suited for the graphic
So
a = to
Figure 7.12. Linear splines, order k = 2.
representation of curves, since the eye can still recognize discontinuities of

s
So
a = to
Figure 7.13. Cubic splines, order k = 4.
curvature, i.e., of the second derivative. Thus the C 2 -smooth cubic splines
are recognized as "smooth."
It is obvious that Sk,L:> is a real vector space, which, in particular, contains
all polynomials of degree::; k - 1, i.e., Pk-l C Sk,L:>. Furthermore, the
truncated powers of degree k,
if t ? ti
if t < ti
are contained in Sk,L:>. Together with the monomials 1, t, ... ,t k - l , they
form a basis of Sk,L:> , as we shall show in the following theorem:
Theorem 7.48 The monomials and truncated powers form a basis
(7.27)
of the spline space Sk,L:>. In particular, the dimension of Sk,fl. zs
dimSk,L:> = k + l.
Proof. We first show that one has at most k + l degrees of freedom for the
construction of a spline s E Sk,L:>. On the interval [to, tIl, we can choose
any polynomial of degree::; k - 1; these are k free parameters. Because of
the smoothness requirement s E C k - 2 , the polynomials on the following
intervals [tl, t2]' ... , [tl' t£+l] are determined by their predecessor up to one
parameter. Thus, we have another l parameters. Therefore dim Sk,L:> ::; k+l.
The remaining claim is that the k+l functions in B are linearly independent.
To prove this, let
k-l 1
s(t) := L ai ti +L Ci(t - ti)~-l = 0 for all t E [a, b].
i=O i=l
7.4. Splines 221
By applying the linear functionals
Gi(f) := (k ~ I)! (J(k-1)(tn - f(k-1) (r:))
to s (where f(t+) and f(r) denote the right, respectively, left-sided limits),
then for all i = 1, ... ,l, it follows that
k-1 I
0= Gi(s) = G i (L aje) + L Cj Gi(t - tj)~-l = Ci·
j=O j=l ~
~ =Oij
=0
k 1 .
Thus s(t) = Li:O ait' = 0 for all t E [a, b], and therefore also ao = ... =
~-1=0. D
However, the basis B of Sk,~ given in (7.27) has several disadvantages.
For one, the basis elements are not local; e.g., the support of the monomials
t i is the whole of R. Second, the truncated powers are "almost" linearly
dependent for close nodes ti, ti+1. This results in the fact that the evaluation
of a spline in the representation
k-1 I
s(t) = L ai ti + L Ci(t - td~-l
i=O i=l
is poorly conditioned with respect to perturbations in the coefficients Ci.

Third, the coefficients ai and Ci have no geometric meaning-unlike the
Bezier points bi . In the following, we shall therefore construct a basis for
the spline space Sk,~, which has properties as good as the Bernstein basis
has for P k . In order to achieve this, we define recursively the following
generalization of the characteristic function X[Ti,Ti+d of an interval and
also the "hat function" (see Figure 7.14).
Definition 7.49 Let T1 ::; ... ::; Tn be an arbitrary sequence of nodes.
Then the B-splines Nik(t) of order k for k = 1, ... , nand i = 1, ... , n - k
are recursively defined by
X[T;,Ti+d
(t) = {I
0
if
else
Ti ::; t < Ti+ 1
' (7.28)
t - Ti Ti+k - t
- - - - Ni,k-1 (t) + Ni+1,k-1 (t) . (7.29)
Ti+k-1 - Ti Ti+k - Ti+1
Note that the characteristic function in (7.28) vanishes if the nodes coincide,
i.e.,
Nil = X[Ti,THd = 0 if Ti = Ti+1 .
The corresponding terms are omitted according to our convention % = 0
in the recurrence relation (7.29). Thus, even if the nodes coincide, the B-
splines Nik are well-defined by (7.28) and (7.29); furthermore, Nik = 0 if
Figure 7.14. B-splines of order k = 1,2,3.
Ti = Ti+k. The following properties are obvious because of the recursive

definition.
Remark 7.50 The B-splines satisfy
(a) SUppNik C [Ti,' .. , Ti+kl (local support),
(b) Nik(t) ~ 0 for all t E R (nonnegative),
(c) Nik is a piecewise polynomial of degree ::::; k - 1 with respect to the
intervals [Tj, Tj+ll.
In order to derive further properties, it is convenient to represent the B-
splines in closed form. In fact, they can be written as an application of a kth
divided difference [Ti' ... ,Ti+kl to the truncated powers f(s) = (s - t)~-l.
Lemma 7.51 If Ti < Ti+k, then the B-spline Nik satisfies
Nik(t) = (Ti+k - Ti)[Ti, ... , Ti+kl(' - t)~-l .
Proof. For k = 1, we obtain for the right-hand side
{~ if
else
Ti ::::; t < Ti+ 1
Furthermore, by employing the Leibniz formula (Lemma 7.15), it can easily

be verified that the right-hand side also satisfies the recurrence relation
(7.29). The statement now follows inductively. 0
Corollary 7.52 If Tj is an m-fold node,i.e.,

Tj-1 < Tj = ... = Tj+m-l < Tj+m ,
7.4. Splines 223
then, at the position Tj, Nik is at least (k - 1 - m)-times continuously

differentiable. The derivative of Nik satisfies
Proof. The first statement follows from the fact that the divided difference
[Ti,' .. , Ti+klJ contains at most the (m - l)st derivative of the function
f at the position Tj. However, the truncated power f(s) = (s - Tj)~-l
is (k - 2)-times continuously differentiable. The second statement follows
from
-(k -1)(Ti+k - Ti)[Ti, ... ,Ti+kl(' - t)~-2

-(k - 1)(Ti+k - Ti)
(
[TH1' ... ,Ti+kl (- - t)~-2 - h, ... ,Ti+k-1](' - t)~-2)
Ti+k - Ti
(k -1) ( N i ,k-1(t) _ N H 1,k-1(t)) .

Ti+k-1 - Ti Ti+k - Ti+l
D
We now return to the space Sk,l'. of splines of order k with respect to

the grid ~ = {tj}j=O, ... ,I+1:
~ : a = to < h < ... < tl+1 = b.
For the construction of the desired basis, to ~ we assign the following

extended sequence of nodes T = {Tj}j=l, ... ,n+k, where the boundary nodes
a = to and b = t/+1 are counted k-times.
~: a to < h < ... < t/+1 b

II II II
T: T1 = ... Tk < Tk+1 < ... < Tn+1 ... = Tn+k
Here n = I + k = dim Sk,l'. is the dimension of the spline space Sk,l'..

Consider the n B-splines Nik for i = 1, ... , n, which correspond to the
extended sequence of nodes T = {Tj}. In the following, we shall see that
they form the desired basis of Sk,l'. (see Figure 7.15). To begin with, it is
clear from Corollary 7.52 that the B-splines Nik are indeed splines of order
k, i.e.,
Nik E Sk,l'. for all i = 1, ... ,n.

Since the number n coincides with the dimension n = dim S k,l'., it only
remains to show that they are linearly independent. For this, we need the
following technical statement, which is also known as Marsden identity.
Figure 7.15. B-spline basis of order k = 3 (locally quadratic).
Lemma 7.53 With the above notation, we have for all t E [a, b] and s E R
that
n k-l
(t - s)k-l = L cpik(S)Nik(t) with CPik(S):= II (Ti+j - s).
i=l j=l
Proof. Because of the recursive definition of the B-splines, the proof is

by induction over k. The statement is clear for k = I, because of 1 =
L~=l N i1 (t). Thus let k > I, and suppose that the statement is true for all
I :S k - 1. Insertion of the recurrence relation (7.29) on the right-hand side
yields
n
i=l
~ t - T·
L.... ( ---'-cpik(S) + , -1 -
T+k t
CPi-l,k(S)
)
N i ,k-1(t)
i=2 Ti+k-l - Ti T;+k-l - Ti
n k-2
L II (Ti+j - s) .
;=2 j=l
t- t )
.(
T
'(TiH-1 - S) + T+k
, -1 - (Ti - s) N i ,k-1(t)
Ti+k-1 - Ti Ti+k-l - Ti
, "
=t-s
n
i=2
(t - s)(t - s)k-2 = (t _ s)k-l .
Here note that the expression, which is "bracketed from below" is the linear
interpolation of t - s, hence t - s itself. 0
Corollary 7.54 The space Pk-da, b] of polynomials of degree :S k-1 over

[a, b] is contained in the space, which is spanned by the B-splines of order
7.4. Splines 225
k, i.e.,
Pk-da, bl c span (Nlk , ... , Nnk)'
In particular,
n
1 = L Nik(t) for all t E [a, bl ,
i=l
i. e., the B -splines form a partition of unity on [a, bl.
Proof. For the lth derivative of the function f(s) := (t - s)k-1, it follows
from the Marsden identity that
n
f(l) (0) = (k - 1) ... (k -l)( _1)lt k - I - 1 = L i.p~~ (O)Nidt )
i=l
and therefore, with m = k - I - 1,
m (_1)k-m-1 ~ (k-m-1)()N ()
t = (k _ 1) ... (m + 1) L.." i.pik 0 ik t .
,=1
The (k - pt) derivative of ¢ik satisfies
k-1 k-1
¢7k- 1(s) = (II h+j -s)) = (( _1)k-1 s k-1+ ... )k-1 = (_1)k-1(k_1)!
j=l
and the second statement thus also follows. o

After these preparations, we can now prove the linear independence of
the B-splines. They are locally independent as the following theorem shows.
Theorem 7.55 The B -splines Nik are locally linear independent, i. e., if
n
L CiNidt) = 0 for all t Elc, d[ C [a, bl
i=l
and lc, d[ n h, Ti+k[ ic ¢, then

Ci = O.
Proof. Without loss of generality, we may assume that the open interval
lc, d[ does not contain any nodes (otherwise we decompose lc, d[ into subin-
tervals). According to Corollary 7.54, each polynomial of degree:::; k-1 over
lc, d[ can be represented by the B-splines N ik . However, only k = dim P k - 1
B-splines are different from zero on the intervallc, dr. They therefore have
to be linearly independent. 0
Let us summarize briefly what we have shown: The B-splines Nik of

order k with respect to the sequence of nodes T = {Tj} form a basis
B := {N1 k, ... , Nnk} of the spline space Sk,.6.. They are locally linear in-
dependent, are locally supported, and form a positive partition of unity.
Each spline s E Sk,.6. therefore has a unique representation as a linear
combination of the form
n
S = LdiNik.
i=1
The coefficients d i are called de Boor points of s. The function values

s(t) are therefore convex combinations of the de Boor points d i . For the
evaluation, we can use the recursive definition of the B-splines Nik, and we
can therefore also derive the recurrence relation for the linear combinations
themselves, which is given in Exercise 7.9, the algorithm of de Boor.
Remark 7.56 By employing the Marsden identity, one can explicitly give
the dual basis B' = {VI, ... ,vn } of the B-spline basis B,
Vj : Sk,.6. -> R linearly with Vj(Nik) = rSij .
With this at hand, it can be shown that there is a constant Db which
depends only on the order k such that
n
Dkmax Idjl:=; II LdjNjklloo

J=l, ... , n .
:=;max Idjl·
)=l, ... ,n
J=1
Here the second inequality follows from the fact that the B-splines form
a positive partition of unity. Perturbations in the function values s(t) of
the spline s = 2:~=1 CiNik and the coefficients can therefore be estimated
against each other. In particular, the evaluation of a spline in B-spline
representation is well-conditioned. Therefore, the basis is also called well-
conditioned.
7.4.2 Spline Interpolation

We now turn our attention again toward the problem of interpolating a
function f, which is given pointwise on a grid ~ = {to, ... , tl+ d,
a = to < tl < ... < tl+l = b.
In the linear case k = 2, the number I + 2 of nodes coincides with the
dimension of the spline space n = dim S2,.6. = I + k. The linear B-splines
Ni2 with respect to the extended sequence of nodes
T = = T2 < ... < Tn+! = Tn+2} with Tj = tj-2 for

{Tl j = 2, ... ,n
satisfy Ni2 (tj) = OJ+l,i. The piecewise linear spline hf E S2,.6., which
interpolates f, is therefore uniquely determined with
n
hi = L i(ti-dNi2.
i=1
7.4. Splines 227
Besides this very simple case of linear spline interpolation, the case k = 4 of
cubic splines plays the most important role in the applications. In this case,
we are missing two conditions to uniquely characterize the interpolating
cubic spline S E Si,A, because
dim S4,A - number of nodes = l +k - l - 2 = 2.
N ow the starting idea for the construction of spline functions was to find
interpolating curves, which are as "smooth" as possible; we could also say
"possibly least curved." The curvature of a parametric curve (in the plane)
y: [a, b] --+ R, y E C 2 [a, b]

at t E [a, b] is given by
yl/ (t)
r;,(t) := (1 + y'(t)2)3/2 .
The absolute value of the curvature is just the reciprocal1/r of the radius
r of the osculating circle to the curve at the point (t, y( t)) (see Figure 7.16),
i.e., the curvature is zero if and only if the osculating circle has the radius
y(t)=lnt
Figure 7.16. Osculating circle of the curve (t,y(t)).
00, hence the curve is straight. In order to simplify this, instead of the
curvature, we consider for small y'(t) the reasonable approximation yl/(t),
yl/(t) ~ 1/
(1 + y'(t)2)3/2 ~ Y (t),
and measure the curvature of the entire curve by the L 2 -norm
of this approximation with respect to [a, b]. The interpolating cubic splines,
which satisfy the additional properties of Corollary 7.58, minimize this
functional.
Theorem 7.57 Let 8 be an interpolating cubic spline of f at the nodes

e
a = to < ... < tl+l = b, and let y E 2 [a, b] be an arbitrary interpolating
function of f such that
[s(t)"(y(t)' - s(t)')]~=a = o. (7.30)

Then
(7.31)
Proof. Trivially, y" = s" + (y" - s"), and, inserted into the right-hand side
of (7.31), it follows that
lb (y")2 dt = lb (s")2 dt + 21b s" (y" - s") dt + lb (y" - s")2 dt

" v , " yo j
(*) 2:0
> lb (s")2 dt,
if the term (*) vanishes. This holds true under the assumption (7.30),
because by partial integration, it follows that
lb s" (y" - s") dt = [s" (y' - s')]~ - lb S'" (y' - s') dt ,
where Sill is in general discontinuous at the nodes tl,.'" tn-l, and is

constant
Slll(t) = st(t) = d; for t E (ti' tHd
in the interior of the subintervals (the S; are cubic polynomials). Therefore,
under the assumption (7.30), it is true that
lb S" (y" - s") dt =
n
- L dd(y(ti) -
;=1 ~
s(t;)) - (y(t;-d - s(ti-d)]
'" v '
=0 =0
O.
o
Corollary 7.58 In addition to the interpolation conditions SCti) = f(ti),
assume that the cubic spline s E S4,6 satisfies one of the following
boundary conditions:
7.4. Splines 229
(i) s'(a) = f'(a) and s'(b) = f'(b),

(ii) s//(a) = s//(b) = 0,
(iii) s'(a) = s'(b) and s//(a) = s//(b) (if f is periodic with period b - a).
Then there exists a unique solution s E S4,ll, which satisfies this boundary
condition. An arbitrary interpolating function y E C 2 [a, b], which satisfies
the same boundary condition, furthermore satisfies
Proof. The requirements are linear in s, and their number coincides with
the dimension n = l + 4 of the spline space S4,ll. It is therefore sufficient to
show that the trivial spline s == 0 is the only solution for the null-function
f == O. Since y == 0 satisfies all requirements, Theorem 7.57 implies that
(7.32)
Since s// is continuous, this implies s// == 0; i.e., s is a continuously dif-
ferentiable, piecewise linear function with S(ti) = 0, and is therefore the
null-function. 0
The three types (i), (ii), and (iii) are called complete, natural, and
periodic cubic spline interpolation. The physical interpretation of the
above minimization property (7.32) accounts for the name "spline." If y(t)
describes the position of a thin wooden beam, then
E -
-
I a
b (
(1
y//(t)
+ y'(tF)3/2
) 2
dt
measures the "deformation energy" of the beam. Because of Hamilton's

principle, the beam takes a position so that this energy is minimized. For
small deformations one has approximately
E ~ lb y//(t)2 dt = Ily//II~.
The interpolating cubic spline s E S4,ll therefore describes approximately
the position of a thin wooden beam, which is fixed at the nodes ti. In the
complete spline interpolation, we have clamped the beam at the boundary
nodes with an additional prescription of the slopes. The natural boundary
conditions correspond to the situation when the beam is straight outside
the interval [a, b]. Such thin wooden beams were in fact used as drawing
tools and are called "splines."
Note that besides the function values, two additional pieces of informa-
tion regarding the original function f at the nodes enter in the complete
spline interpolation. Thus their approximation properties (particularly at
the boundary) are better than the ones of the other types (ii) and (iii). In
fact, the complete interpolating spline I4f E S4,.c:. approximates a function

f E C 4 [a, b] of the order h4 , where
h := max Iti+! - ti I
,=0, ... ,1
is the largest distance of the nodes k We state the following related result
due to C. A. Hall and W. W. Meyer [48] without proof.
Theorem 7.59 Let I4f E S4,.c:. be the complete interpolating spline of a
function f E C 4 [a, b] with respect to the nodes ti with h := maXi Iti+1 - til.
Then
Ilf-I4fll00 ~ 3:4h41If(4)1100.
Note that this estimate is independent of the position of the nodes ti.
7.4·3 Computation of Cubic Splines

In the following, we shall derive a system of equations for the cubic in-
terpolating splines. For this purpose, we describe the spline s E S4,.c:. by
employing the local Bezier representation
3
s(t) = Si(t) = 1~)3i+jB](t; ti, ti+r) for t E [ti, ti+rl (7.33)
j=o
of the partial polynomials Si with respect to the intervals [ti' ti+I]' Here
B](t;ti,ti+I)=B](t~iti) with hi:=ti+l-ti.
The continuity of S enters implicitly into the representation (7.33). By
/
/~
/
.................-:_--._---../
d2 b7 d3 = bs
Figure 7.17. Cubic spline with de Boor points d i and Bezier points bi.
(7.26), the CI-smoothness implies that

hi hi- I
b3i = h i-I + h i b3i - 1 + h i-I + h i b3i+1 . (7.34)
7.4. Splines 231
Furthermore, according to the C 2 -smoothness of s, we have shown that

there are de Boor points d i such that
--- i
hi d
hi -
+ hi-h1 -+ hi b3i+l
1 i 1
h i- 1 + hi b hi- 1 d
3i-l - - - i·
hi hi
Graphically, this means that the straight line segment between b3i - 2 and d i ,
respectively, d i and b3i+2 is partitioned at the ratio hi-I: hi by the Bezier
points b3i - 1 , respectively, b3i + 1 . The points d i , b3i +l' b3i +2 and di+l are
therefore positioned as shown in Figure 7.17. Taken together, this implies
b hi+ hi+1 d ht - 1 d
3i+l h
i-I+ h i + hi+l i + h t-l + h t + h t+l t+l
h i- 2 + hi- 1 d hi d
i +
hi- 2 + h i- 1 + hi hi- 2 + hi- 1 + hi
i-I
If we define at the boundary h-1 := h 1+ 1 := 0 and

(7.35)
then the Bezier coefficients b3i +j , and thus also the spline s, are completely
determined by the I + 4 points d_ 1 to dl+ 2 and the equations (7.34) to
(7.35).
By inserting the interpolation conditions
fi = s ( ti) = b3i for I = 0, ... , I + 1,
then it follows at the boundary that
d_ 1 := fo and d1+2 = fl+1 .
The remaining points do, ... , d 1+1 of the interpolating spline must solve the
following system (proof as an exercise):
do
1 b1
al (31 11 (h o + hdh
al (31 11 (h1- 1 + hz)fl

1 b3l+ 2
d1+l
with
.- h 2t
ai
h i- 2 + hi- 1 + hi
.- h i (hi- 2 + hi-I) hi- 1(hi + hi+d
(3i
hi- 2 + hi- 1 + hi
+ hi- 1 + hi + hi+ 1
.- hT-l
1i
h i- 1 + hi + hi+l
We now only have to determine the Bezier points hand b31+2 from the
boundary conditions. We confine ourselves to the first two types. For the
complete spline interpolation, we obtain from
f~ = s'(a) = :0 (b 1 - bo) and f!+1 = s'(b) = ~l (b 3 (l+I) - b3 1+ 2)
that we have to set

ho , hi ,
b1 = 3"" fa + fa and b3 1+ 2 = -3 fl+l + fz+l.
For the natural boundary conditions we have to choose b1 and b31 +2 , so that
s" (a) = s" (b) = O. This is satisfied for
b1 := bo = fa and b3 1+ 2 := b3 (1+1) = fz+l
(see Figure 7.18).
d1
h~/"'/\
....... ho
b2,~ - _ _b_3 _. __ ~.~\, b4
h~/-..'/ \
". hI
d_ 1 = do = bo = b1 \
b5 \----_.,.,.- - -
Figure 7.1S. Cubic interpolating spline with natural boundary conditions.
Remark 7.60 For an equidistant grid, i.e., hi = h for all i, we have

h 4h
Cti = "Ii ="3 and f3i = 3"" for i = 2, ... ,l - 1 .
In this case (and also for an almost equidistant grid) the matrix is strictly
diagonally dominant, and it can therefore be solved efficiently and in a
stable manner by the Gaussian elimination without interchanging columns.
Remark 7.61 The de Boor points d i are just the B-spline coefficients of
the interpolating cubic splines, i.e.,
1+4
s= Ld i - 2 N i4 ,
i=1
Exercises 233
if, as above, Ni4 are the B-splines for the extended sequence of nodes
T = {ij}.
Exercises
Exercise 7.1 Let An(K, I) denote the Lebesgue constant with respect to
the set of nodes K on the interval I.
(a) Let K = {to, ... , tn} C I = [a, b] be pairwise distinct nodes. Suppose
that the affine transformation
2t - a - b
X: 1--->10 = [-1,1]' t 1--+ --=-b--
-a
of this interval onto the unit interval 10 maps the set of nodes K
onto the set of nodes Ko = X(K). Show that the Lebesgue constant
is invariant under this transformation, i.e.,
An(K, I) = An(Ko,Io) .
(b) Let K = {to, ... , tn} with a:S to < t1 < ... < tn :S b be nodes in the
interval I = [a, b]. Give the affine transformation
X : [to, t n ] ---> I
on I, which satisfies the property that for R = X(K) = {to, ... , tn}:
a = to < t1 < ... < tn = b .
Show that
i.e., inclusion ofthe boundary nodes improves the Lebesgue constant.

Exercise 7.2 Consider the class of functions
F:={f Ecn +1 [-1,1] I Ilf(n+l)lloo:s(n+1)!}.
For f E F, let Pn(f) denote the polynomial of degree n of the (Hermite)
interpolation for the nodes K = {to, ... , tn} C 10 = [-1,1].
(a) Show that
En(K) := sup Ilf - Pn(f)lloo = Ilwn+lll oo ,
JEF
where Wn+l(t) = (t - to)··· (t - tn).

(b) Show that En(K) ~ 2- n and that equality holds if and only if K is
the set of the Chebyshev nodes, i.e.,
2j +1
tj = cos 2n + 21f for j = 0, ... , n.
Exercise 7.3 Count how many computations and how much storage space
an economically written program requires for the evaluation of interpolation
polynomials on the basis of the Lagrange representation.
Compare with the algorithms of Aitken-Neville and the representation
over Newton's divided differences.
Exercise 7.4 Let a = to < tl < ... < tn-l < tn = b be a distribution
of nodes in the interval I = [a, b]. For a continuous function 9 E C(1), the
interpolating polygon Ig E C(I) is defined by
(a) Ig(ti) = g(ti) for i = 0, ... , n,

(b) Igi [ti,ti+l] is a polynomial of degree one for i = 0, ... ,n - l.
Show the following:

(a) Any function g E C 2 (I) satisfies
Ilg - Iglloo ::; ~211g'llloo,

where h = max (ti+l - t i ) is the "grid-width parameter."
O::;i::;n-l
(b) The absolute condition of the polygonal interpolation satisfies
h:abs = 1.
Discuss and evaluate the difference between this and the polynomial
interpolation.
Exercise 7.5 For the approximation of the first derivative of a pointwise
given function f, one utilizes the first divided difference
(Dhf)(X) := [x, x + hlf.
°
(a) Estimate the approximation error IDhf(x) - f'(x)1 for f E C3.
(Leading order in h for h -+ is sufficient.)
(b) Instead of Dhf(x), the floating point arithmetic computes Dhf(x).
Estimate the error IDhf(x) - Dhf(x)1 in leading order.
(c) Which h turns out to be optimal, i.e., minimizes the total error?
(d) Test your prediction at f(x) = eX at the position x = 1 with
h=lO-I, 5.10- 2 , 10- 2 , ... , eps.
Exercise 7.6 Show that the derivatives of a polynomial in Bezier

representation with respect to the interval [to, h]
n
P(t) = L biBf(A), t - to
A:= - - - ,
tl - to
i=O
Exercises 235
are given by
d k ()
dt kP t =
n! ~
(n _ k)!h k ~
A
L.l
k n-k()
biBi A,
,=0
Exercise 7.7 Find the Bezier representation with respect to [0,1] of the
Hermite polynomials Hl for the nodes to, t 1, and sketch the Hermite
polynomials together with the Bezier polygons.
Exercise 7.8 We have learned three different bases for the space P3 of
polynomials of degree:::; 3: the monomial basis {I, t, t 2 , t 3 }, the Bernstein
basis {B8(t), Br(t), B~(t), B~(t)} with respect to the interval [0,1]' and the
Hermite basis {H8 (t), Hr
(t), H~ (t), H~ (t)} for the nodes to, h. Determine
the matrices for the basis changes.
Exercise 7.9 Show that a spline s = L~=l diNik in B-spline represen-
tation with respect to the nodes {T;} satisfies the following recurrence
relation:
L
n
s(t) = d~(t)Ni,k-1(t).
i=l+l
Here the di are defined by d?(t) := di and

t - Ti d1- 1() Ti+k-l - t d 1- 1( )
t + i-1 t if TiH-l -=I Ti
d i (t) .- {
I ._ i
Ti+k-l - Ti 0 TiH-l - Ti
else
for l > O. Show that s(t) = d7- 1(t) for t E [7;, Ti+1]' Use this to derive
a scheme for the computation of the spline s(t) through continued convex
combination of the coefficients di (algorithm of de Boor).
8
Large Symmetric Systems of
Equations and Eigenvalue Problems
The previously described direct methods for the solution of a linear system
Ax = b (Gaussian elimination, Cholesky factorization, QR-factorization
with Householder or Givens transformations) have two properties in
common.
(a) The methods start with arbitrary (for the Cholesky factorization
symmetric) full (or dense) matrices A E Matn(R).
(b) The cost of solving the system is of the order O( n 3 ) (multiplications).
However, there are many important cases of problems Ax = b, where
(a) the matrix A is highly structured (see below) and most of the
components are zero (i.e., A is sparse),
(b) the dimension n of the problem is very large.
For example, discretization of the Laplace equation in two space dimensions
leads to block-tridiagonal matrices,
A= (8.1)
A q- 1,q-2 Aq-1,q-l Aq-1,q
Aq,q-l Aqq
with Aij E Matn/q(R), which, in addition, are symmetric, i.e., Aij = Af;.
The direct methods are unsuitable for the treatment of such problems; they
238 8. Large Symmetric Systems of Equations and Eigenvalue Problems
do not exploit the special structure, and they take far too long. There are
essentially two approaches to develop new solution methods. The first con-
sists of exploiting the special structure of the matrix in the direct methods,
in particular its sparsity pattern, as much as possible. We have already
discussed questions of this kind, when we compared the Givens and House-
holder transformations. The rotations operate only on two rows (from the
left) or columns (from the right) of a matrix at a time, and they are
therefore suited largely to maintain a sparsity pattern. In contrast, the
Householder transformations are completely unsuitable for this purpose.
Already in one step, they destroy any pattern of the starting matrix, so
that from then on, the algorithm has to work with a full matrix. In gen-
eral, the Gaussian elimination treats the sparsity pattern of matrices most
sparingly. It is therefore the most commonly used starting basis for the
construction of direct methods, which utilize the structure of the matrix
(direct sparse solver). Typically, column pivoting with possible row inter-
change and row pivoting with possible column interchange alternate with
each other, depending on which strategy spares the most zero elements. In
addition, the pivot rule is relaxed (conditional pivoting) in order to keep
the number of additional nonzero elements (fill-in elements) small. In the
last few years, the direct sparse solvers have developed into a sophisticated
art form. Their description requires, in general, resorting to graphs that
characterize the prevailing systems (see, e.g., [39]). Their presentation is
not suitable for this introduction.
The second approach to solve large systems, which are rich in structure,
is to develop iterative methods for the approximation of the solution x.
This seems reasonable, also because we are generally only interested in
the solution x up to a prescribed precision E, which depends on the pre-
cision of the input data (compare the evaluation of approximate solutions
in Section 2.4.3). If, for example, the linear system was obtained by dis-
cretization of a differential equation, then the precision of the solution of
the system only has to lie within the error bounds, which are induced by
the discretization. Any extra work would be a waste of time.
In the following sections, we shall be concerned with the most common
iterative methods for the solution of large linear systems and eigenvalue
problems for symmetric matrices. The goal is then always the construction
of an iteration prescription Xk+l = ¢(xo, .. . , Xk) such that
(a) the sequence {xd of iterates converges as fast as possible to the
solution x, and
(b) Xk+l be computed with as little cost as possible from XO, ... , Xk.
In the second requirement, one usually asks that the evaluation of ¢ does
not cost much more than a simple matrix-vector multiplication (A, y) f--+
Ay. It is notable that the cost for sparse matrices is of the order O(n) and
not O(n 2 ) (as with full matrices), because often the number of nonzero
elements in a row is independent of the dimension n of the problem.
8.1. Classical Iteration Methods 239
8.1 Classical Iteration Methods

In Chapter 4, we have solved nonlinear systems by using fixed-point
methods. This idea is also the basis of most classical iteration methods.
For a fixed-point method Xk+l = ¢(Xk) for the solution of a linear system
Ax = b, we shall of course construct an iteration function ¢, so that it has
a unique fixed point x*, which is the exact solution x* = x of Ax = b.
This is most easily achieved by transforming the equation Ax = b into a
fixed-point equation,
Ax=b ~ Q-l(b-Ax)=O
~ ¢(x):=(I-Q-1A)x+Q-1b=x,
~ '-v--'
=: G =: c
where Q E GL(n) is an arbitrary regular matrix. In order to obtain a

reasonable iteration method, we have to take care that the fixed-point
method Xk+l = ¢(Xk) = GXk + c converges.
Theorem 8.1 The fixed-point method Xk+l = GXk + c with G E Matn(R)
converges for each starting value Xo ERn if and only if
p(G) < 1,
where p( G) = max IAj (G) I is the spectral radius of G.
J
Proof. We again restrict ourselves to the simple case of a symmetric matrix

G = G T , which is the only case that we need in the following. Then there
is an orthogonal matrix Q E N(n) such that
QGQT = A = diag(Al, ... , An)
is the diagonal matrix of the eigenvalues of G. Since IAi I :::: p( G) < 1 for all
i and Dk = diag(A1, ... ,A~), we have limk-->oo Dk = 0, and therefore also
lim G k = lim QTDkQ = O. 0
k-->oo k-->oo
Since p( G) :::: I Gil for any corresponding matrix norm, it follows that
IIGII < 1 is sufficient for p(G) < 1. In this case, we can estimate the errors
Xk - x = Gk(xo - x) by
IIxk - xII :::: IIGllklixo - xII·
Besides the convergence, we require that ¢(y) = Gy+c be easily computed.
For this purpose, the matrix Q has to be easily invertible. The matrix, which
is most easy to invert, is doubtless the identity Q = I. The method, which
thus arises for the iteration function G = I - A,
Xk+l = Xk - AXk + b,
is the Richardson method. If we start with an Spd-matrix A, then we obtain

for the spectral radius of G:
p(G) = p(I - A) = max {II - Amax(A)I, 11 - Amin(A)I} .
A necessary condition for the convergence of the Richardson iteration is
therefore Amax(A) < 2. Taken by itself, this iteration is thus only rarely
usable. However, we shall below discuss possibilities to improve the conver-
gence. The next more complicated matrices are the diagonal matrices, so
that the diagonal D of
A=L+D+R,
is a candidate for a second possibility for Q. Here D = diag( all, ... , ann)
and
0 0 0 a12 all
a21
L:= , R:=
an-l,n
an1 an,n-l 0 0 0
The corresponding method
is called the Jacobi method. A sufficient condition for its convergence is

that A is strictly diagonally dominant.
Theorem 8.2 The Jacobi iteration Xk+1 = _D-1(L + R)Xk + D- 1b con-
verges for any starting value Xo to the solution x = A -lb, if A is strictly
diagonally dominant, i. e.,
laiil > Llaijl for all i=I, ... ,n.

i¥-j
Proof. The statement follows from Theorem 8.1 because
p(D-1(L + R)) :s; IID- 1(L + R)lloo = mfx L I:'J I. 0

jl'i n
In the first chapter, after the diagonal ones, the triangular systems have
proven to be simply solvable. For full lower or upper triangular matrices, the
cost is of the order O(n 2 ) per solution, for sparse matrices the cost is often
of the order O(n), i.e., of an order which we consider acceptable. By taking
Q as the lower triangular half Q := D + L, we obtain the Gauss-Seidel
method:
Xk+1 (I - (D + L)-l A)Xk + (D + L)-lb
-(D + L)-l RXk + (D + L)-lb.
It converges for any Spd-matrix A. In order to prove this property, we derive
a condition for the contraction property of p( G) < 1 of G = I - Q -1 A,
which is easy to verify. For this, we note that every Spd-matrix A induces
a scalar product (x, y) := (x, Ay) on Rn. For any matrix B E Matn(R),
B* := A- l BT A is the adjoint matrix with respect to this scalar product,
i.e.,
(Bx, y) = (x, B*y) for all x, y E R n .
A self-adjoint matrix B = B* is called positive with respect to (.,.), if

(Bx,x) > 0 for all x oft o.
Lemma 8.3 Let G E Matn(R), and let G* be the adjoint matrix of G with
respect to a scalar product (., .). Then, if B := I - G* G is a positive matrix
with respect to (., .), it follows that p( G) < 1.
Proof. Since B is positive, we have for all x oft 0

(Bx,x) = (x,x) - (G*Gx,x) = (x,x) - (Gx,Gx) > 0,
which is derived from (.,.) that Ilxll > IIGxl1 for all x oft O. This implies
p(G) :::; IIGII:= sup IIGxl1 < 1,

Ilxll=l Ilxll
Ilxoll = 1, because of the compactness of the sphere. o
Theorem 8.4 The Gauss-Seidel method converges for any Spd-matrix A.
Proof. We have to show that B := I - G*G, with G = 1- (D + L)-l A, is

a positive matrix with respect to (.,.) := (-, k). Because of RT = L:
G* = I - A -1 AT (D + L) - T A = I - (D + R) -1 A
(8.2)
The trick in the last manipulation consists of inserting the equation
(D + M)-l = (D + M)-l(D + M)(D + M)-l
for M = R, L, after carrying out the multiplications and then factoring.
From (8.2) it follows for all x oft 0 that
(Bx, X)A ((D + R)-l D(D + L)-l Ax, Ax)

(D1/2(D + L)-lAx,D1/2(D + L)-lAx) > 0;
i.e., B is positive and p( G) < 1. o

The speed of convergence of a fixed-point iteration Xk+1 GXk + c
depends strongly on the spectral radius p( G). For any concretely chosen
G, there is, however, a way to improve it, namely, the extrapolation or
better relaxation method. For this we consider convex combinations of the,

respectively, "old" and "new" iterate
Xk+l W(GXk + c) + (1 - W)Xk
GwXk + wc with G w := wG + (1 - w)I,
where w E [O,IJ is a damping parameter. This way we obtain from a
fixed-point iteration Xk+l = GXk + c an entire family of relaxed fixed-point
iterations with the iteration function, which depends on w:
¢w(X) = w¢(x) + (1 - w)x = Gwx + wc.
The art consists now of choosing the damping parameter w so that p( G w ) is
as small as possible. In fact, for a class of fixed-point iterations it is possible
to even force convergence by a suitable choice of w, despite the fact that
the starting iteration in general does not converge.
Definition 8.5 A fixed-point method Xk+l = GXk + c, G = G(A), is
called symmetrizable, if for any Spd-matrix A, 1- G is equivalent (similar)
to an Spd-matrix; i.e., there is a regular matrix WE GL(n) such that
WeI - G)W- 1
is an Spd-matrix.
Example 8.6 The Richardson method G = I - A is trivially symmetriz-
able. The same is also true for the Jacobi method G = 1- D- 1 A: With
W := D~ we have that
D~(I -G)D-~ =D~D-1AD-~ =D-~AD-~
is an Spd-matrix.
The iteration matrices G of symmetric fixed-point methods have the
following properties.
Lemma 8.7 Let Xk+l = GXk+C, G = G(A) be a symmetrizable fixed-point
method, and let A an Spd-matrix. Then all eigenvalues of G are real and
less than 1,. i. e., the spectrum a-( G) of G satisfies
a-(G) c J- 00, 1[.
Proo]. Since 1- G is similar to an Spd-matrix, the eigenvalues of I - G are

real and positive, and the eigenvalues of G are therefore real and < 1. D
Now let Amin ::; Amax < 1 be the extreme eigenvalues of G. Then the
eigenvalues of G w are just
i.e.,
p(Gw ) = max {II - w(1 - Amin(G))I, 11 - w(1 - Amax( G))I} .
Because of 0 < 1 - Amax (G) ~ 1 - Amin (G), the optimal damping parameter
w with
p(Gw ) = min p(G w ) = 1 - w(l - Amin(G))
O<w:9
satisfies the equation
1 - w(l - Amax(G)) = -1 + w(l - Amin(G))

(see Figure 8.1). We thus obtain the following result.
j1 - w(1- Amin)j
\"
~\,
/
.
\
1 W 1
I-A m in l-A max
Figure 8.1. Spectral radius versus damping parameter w.
Lemma 8.8 With the above notation,

2
w= -------------------
2 - Amax (G) - Amin (G)
is the optimal damping parameter for the symmetrizable iteration method
Xk+l = GXk + c. The spectral radius of the iteration matrix of the relaxed
method satisfies
p(G w ) = 1 - w(l - Amin(G)) < 1.

In other words: We have just shown that, for any symmetrizable iteration
method, convergence can be forced by a suitable choice of the damping pa-
rameter. Here information about the matrix A enters in the determination
of W, so that extrapolating methods can only be given for certain classes of
matrices.
Example 8.9 For the Richardson method with G = I - A, the relaxed

iteration has a particularly simple form:
Gw = wG + (1 - w)I =I - wA.
An Spd-matrix A satisfies Amin(G) = 1 - Amax(A) and Amax(G) = 1 -

Amin (A), and therefore
2 2
w = --~--~~--~--~~
2 - Amax(G) - Amin(G)
The spectral radius of the optimally damped Richardson method IS
therefore
with the condition K:2(A) of A with respect to the Euclidean norm.
For a more detailed account on optimal relaxation methods, we refer to

the book by L. A. Hageman and D. M. Young [46]. We now leave this topic,
because nowadays, the most important application of relaxation methods,
namely, the multigrid methods, in fact do not use the above computed op-
timal damping parameter w. Instead, the contraction behavior is analyzed
in more detail, namely, with respect to the frequency parts of the iter-
ates. If the solutions of a linear system are interpreted as a superposition
of "standing waves," then it turns out that, with a suitable choice (with
respect to the problem) of the relaxation parameter, the relaxed Jacobi,
as well as the relaxed Gauss-Seidel, methods, prefer to damp the high-
frequency parts of the error. This property is called smoothing property
and is used in the construction of multigrid methods for the iterative so-
lution of discretized partial differential equations. The fundamental idea
is to damp the high-frequency parts of the error with a relaxation method
over a fine discretization grid and then to go to the next coarser grid, which
serves to eliminate the low-frequency parts. By recursive application of this
rule, an iteration method emerges, which only requires the direct solution
of a linear system on the coarsest grid, and thus with few unknowns. Since
we do not address the question of discretization in this elementary intro-
duction, we have to leave it here with a brief hint. For further details we
recommend the survey article by G. Wittum [90], or the textbook by W.
Hackbusch [45].
8.2 Chebyshev Acceleration

Without any exception, the methods which were introduced in the last
section are all fixed-point methods, Xk+l = ¢(xd = GXk + c. For the com-
putation of Xk+l, only information of the last iteration step is utilized, and
the already computed values Xo, ... ,Xk-l are not taken into consideration.
In the following, we shall present a technique to improve a given fixed-point
8.2. Chebyshev Acceleration 245
method, by constructing a linear combination

k
Yk = LVkjXj
j=O
from all values Xo, ... ,Xk. With a suitable choice of the coefficients Vkj,
the replacement sequence {Yo, Yl, ... } will then converge faster than the
original sequence {XO,Xl, .. .}. How should the Vkj be determined? If Xo =
... = Xk = x is already the solution, then we require that Yk = x also
solves, from which it follows that
Thus the error Yk - x in particular satisfies

k
Yk - X = L Vkj(Xj - x). (8.3)
j=O
We therefore seek a vector Yk in the affine subspace
Vk := {Yk = 2:~=0 VkjXj I Vkj E R with 2:~=0 Vkj = I} c Rn ,
which approximates the exact solution as well as possible, i.e.,
IIYk - xii = min

yEVk
Ily - xii (8.4)
with a suitable norm II . II. According to Remark 3.6, Yk is the (affine) or-
thogonal projection of x onto Vk with respect to the Euclidean norm IIYII =
,;r:;;:il;, and the minimization problem is equivalent to the variational
problem
(Yk - x, Y - xo) = 0 for all Y E Vk .

Now, if ql, ... , qk is an orthogonal basis of the linear subspace Uk, which
is parallel to Vk, i.e., Vk = Xo + Uk, then we can give explicitly the affine
orthogonal projection Qk : Rn ---> Vk :
(8.5)
i.e., the formula cannot be evaluated, and the minimization problem (8.4)
is thus not solvable yet. There are two ways to escape this situation: One
possibility is to replace the Euclidean scalar product with a different one,
which better suits the problem at hand. We shall pursue this approach in
Section 8.3. The second possibility, which is considered here, is to construct
a solvable substitute problem instead of the minimization problem (8.4).
The iterates Xk of the given fixed-point iteration are Xk = ¢k(xO), where

¢(y) = Gy + c is the iteration rule. Therefore,
k
Yk = L VkjXj = Pd¢)xo,
j=O
where Pk E P k is a polynomial of degree k with
k k
Pk(>") = L Vkj>..j and Pk(l) = L Vkj = 1.
j=O j=O
According to (8.3), for the error Yk - X we thus obtain
Yk - x = Pk(G)(XO - x).
In order to split off the initially unknown solution, we make the (generally
rough) estimate
Ilxk - xii s:: IIPk(G)11 Ilxo - xii·

Instead of the solution of the minimization problem (8.4), we now seek a
polynomial P k with Pk (l) = 1 such that
IIPk(G)11 = min.
For this purpose, we assume that the underlying fixed-point iteration is
symmetrizable and set
a := Amin(G) and b:= >"max(G).
Thus the 2-norm of P k (G) satisfies
IIPdG)112 = max IPk(Ai)1
,
s:: max IPk(>") I =: P(Pk(G)).
.\E[a,bj
The value p(Pk(G)) is also called the virtual spectral radius of G. This way
we finally arrive at the min-max problem
max IPd>") I = min with degPk=k and Pk (l)=l .
.\E[a,bj
We have already encountered and solved this min-max problem in Sec-

tion 7.1.4. According to Theorem 7.21, the Pk turn out to be the specially
normalized Chebyshev polynomials
Tk(t(>..)). A-a
Pk(>") = Tk ((
t 1
)) WIth t(>..) = 2b - - -
- a
l.
For the computation of Yk we can utilize the three-term recurrence relation
Note that in the first improvement step, from

t(>..) _ _
P I (>") = t(l) =w>..+l-w,
8.2. Chebyshev Acceleration 247
we recover the relaxation method with the optimal damping parameter

w= 2/(2 - b - a), which was described in Section S.l. If we set
- 2-b-a -Tk-l(f)
t := t(l) = and Pk:= 2t (t)'
b- a Tk t
then it follows that
(S.6)
Here note that
_ T k- 2(f) _ Tk(f) - 2tTk-l(t) _ 1-
Tk(f) - Tk(t) - Pk
and
Tk-l(f) t 2A-b-a __
2t (f) =Pk"'=Pk =Pk(l-w+wA).
Tk t t 2 - b- a
If we insert (S.6) into Yk = P k (¢ )xo for the fixed-point method ¢(y)
Gy + c, then we obtain the recurrence relation
Yk Pk(¢)xo
(Pk((1-w)Pk- 1 (¢) +W¢Pk- 1 (¢)) + (1- Pk)Pk- 2(¢))xo
Pk((l - W)Yk-l + W(GYk-l + c)) + (1 - Pk)Yk-2.
For a fixed-point method of the form G = 1- Q-l A and c = Q-1b we have
in particular
Yk = Pk(Yk-l - Yk-2 + wQ-l(b - AYk-J)) + Yk-2·

This iteration for the Yk is called Chebyshev iteration or Chebyshev accel-
eration for the fixed-point iteration Xk+l = GXk + c with G = 1- Q-1A
and c = Q-1b.
Algorithm 8.10 Chebyshev iteration (for the starting value Yo = Xo with

a prescribed relative precision TOL).
- 2 - Amax(G) - Amin(G)
t:= ;
Amax (G) - Amin (G)
To := 1, Tl := t;
Yl := w(GyO + c) + (1 - w)Yo;
for k := 2 to kmax do
Tk := 2tTk-l - T k - 2 ;
-Tk-l
Pk :=2tn;
solve the system Qz = b - AYk-l in z;
Yk := Pk(Yk-l - Yk-2 + wz) + Yk-2;

if IIYk - Yk-lll ::; TOLllYk I then exit;
end for
The rate of convergence of the Chebyshev acceleration for a symmetriz-
able fixed-point iteration can be estimated as follows:
Theorem 8.11 Let G = G(A) be the iteration matrix of a symmetrizable
fixed-point iteration Xk+l = ¢(Xk) = GXk + c for the Spd-matrix A, and let
Xo E Rn be an arbitrary starting value. Then the corresponding Chebyshev
iteration Yk = Pk (¢ )xo satisfies
IIYk-xl::; 1
I ITk(t"i'llllxo-xll - 2 - Amax(G) - Amin(G)
with t=
"J Amax(G) - Amin(G) .
Proof. According to the construction of the iteration, we have

ITk(t(A))1
IIYk - xii::; IIPk(G)llllxo - xii ::; ,\~t~;'l ITk (t(I))lllx o - xii,
where, as above, t(A) is the transformation
t(A) = 2 A - Amin(G) _ 1 = 2 A - Amax(G) - Amin(G).
Amax (G) - Amin (G) Amax (G) - Amin (G)
The statement follows, because ITk(t)1 ::; 1 for t E [-1, IJ. D
Example 8.12 For the Chebyshev acceleration of the Richardson itera-

tion, we have in particular:
f = 2 - Amax(G) - Amin(G) = Amax(A) + Amin(A) = fi:2(A) + 1 > 1.
Amax(G) - Amin(G) Amax(A) - Amin(A) fi:2(A) - 1
Lemma 8.13 For the Chebyshev polynomials Tk and fi: > 1, we have the
estimate
I
Tk (~)12
fi:-l
~2 (VK+1)k
VK-1
Proof. One easily computes that for z := (fi: + 1)/(fi: - 1), it follows that
r::;--; VK ± 1
z ± V z2 -1 = ,
VK=f 1
Tk (~)
fi:-1
= ~ [(VK+l)k + (VK-l)k]
2 VK-1 VK+1
2 ~2 (~+l)k,
~-1
D
8.3. Method of Conjugate Gradients 249
According to Theorem 8.11 and Lemma 8.13, we therefore have

k
K(A) - 1 )
IIYk -xii:::; 2 (
~
K(A) + 1
Ilxo -xii·
Chebyshev iteration requires a very good knowledge of the limits Amin
and Amax of the (real) spectrum of A. Modern methods therefore com-
bine these methods with a vector iteration, which in a few steps generally
produces usable estimates of Amin and Amax.
Remark 8.14 The idea of the Chebyshev acceleration, which we presented
here, can also be carried over to nonsymmetric matrices. Instead of the
intervals, which in the symmetric case enclose the real spectrum, one has
ellipses, which in general enclose the complex spectrum. Details can be
found in the article by T. A. Manteuffel [58].
8.3 Method of Conjugate Gradients

In the last section, we tried to approximate the solution of the linear prob-
lem Ax = b by vectors Yk in an affine subspace Vk . The use of the Euclidean
norm IIYII = J(ii:il) had first led us to an impasse, which we overcame by
passing to a solvable substitute problem. Here we want to pursue further
our original idea by passing to a scalar product, which is adapted to the
problem at hand. In fact, any Spd-matrix A defines in a natural way a
scalar product
(x, y) := (x, Ay)
with the corresponding norm
the energy norm of A. We encountered both in Section 8.1. Now we repeat

the chain of reasoning of Section 8.2, only with the energy norm instead
of the Euclidean norm of the iteration error. Let Vk = Xo + Uk C Rn be
a k-dimensional affine subspace, where Uk is the linear subspace, which is
parallel to Vk. The solution Xk of the minimization problem
Ilxk - xiiA = min
yEVk
Ily - xlIA'
is also called Ritz-Galerkin approximation of x in Vk. According to Theorem

3.4, Xk is the orthogonal projection of x onto Vk with respect to (', .); i.e.,
the minimization problem Ilxk - xiiA = min is equivalent to the variational
problem
(8.7)
Instead of "orthogonal with respect to (.,.) = (-, k)," we shall in the

following simply say "A-orthogonal" (historically also A-conjugate); i.e.,
according to (8.7), the errors x - Xk must be A-orthogonal to Uk. If we
denote the residues again by rk := b - AXk, then we have
(x - Xk, u) = (A(x - Xk), u) = (rk, u) .
The variational problem (8.7) is therefore equivalent to the requirement
that the residues rk must be orthogonal (with respect to the Euclidean
scalar product) to Uk, i.e.,
(rk,u) = 0 for all u E Uk. (8.8)
Now let PI, ... ,Pk be an A-orthogonal basis of Uk, i.e.,
(Pk,Pj) = Okj(Pk,Pk).
Then, for the A-orthogonal projection P k : Rn --+ Vk , it follows that
~(pj,X-xo) ~(pj,Ax-Axo)
Xo +~ Pj = Xo +~ Pj
j=l (Pj,Pj) j=l (Pj,Pj)
k
~(pj,ro)
Xo + ~ -(-.-.) Pj . (8.9)
j=l~
:= CY-j
We note that, in contrast to Section 8.2, the initially unknown solution x

does not appear on the right-hand side; i.e., we can explicitly compute the
A-orthogonal projection Xk of x onto Vk, without knowing x. From (8.9),
for Xk and rk, we immediately obtain the recurrence relations
(8.10)
because
rk = A(x - Xk) = A(x - Xk-l - OCkPk) = rk-l - AakPk .
For the construction of an approximation method, we only lack suitable
subs paces Vk eRn, for which an A-orthogonal basis PI, ... ,Pk can be
easily computed. By the Cayley-Hamilton theorem (see, e.g., [61]), there is
a polynomial Pn - l E P n-l such that
A-I = Pn-I(A),
and therefore
x - Xo = A-Iro = Pn-I(A)ro E span{ro, Aro, ... , An-Iro}.
If we chose Vk = Xo + Uk with Uo := {O} for the approximation spaces, and
Uk:=span{ro,Aro, ... ,Ak-lro} for k=l, ... ,n,
then x E Vn , i.e., in the worst case, the nth approximation Xn is the solution
itself. The spaces Uk = Uk(A, xo) are called Krylov spaces. They are also
automatically obtained from our requirement essentially to carry out only

a single matrix-vector multiplication (A, y) f---+ Ay in each iteration step. If
we furthermore recall Theorem 6.4, then we see that we can construct an
A-orthogonal basis PI, ... ,Pk of Uk with the three-term recurrence relation
(6.4). However, we can compute the Pk directly from the residues.
Lemma 8.15 Let rk # O. Then the residues ra, ... , rk are pairwise
orthogonal, i. e.,
(ri,rj)=6ij(ri,ri) for i,j=O, ... ,k,
and they span Uk+l, i.e.,
Uk+l = span{ra, ... , rd·
Proof. The proof is by induction over k. The case k = 0 is trivial because

of U1 = span(ra). Suppose the statement is true for k - 1. From (8.10),
it follows immediately that rk E Uk+l. Furthermore, we have seen in (8.8)
that rk is orthogonal to Uk. By induction assumption we have
Uk = span{ra, ... , rk-l},
and thus (rk' rj) o for j < k. Finally, rk # 0 implies that Uk+l
span{ra, ... , rd. o
We therefore construct the A-orthogonal basis vectors Pk as follows: If
ra # 0 (otherwise Xa is the solution), then we set PI := ra. Lemma 8.15 now
states for k > 1 that rk either vanishes, or that the vectors PI,··· ,Pk-l
and rk are linearly independent, and span Uk+l' In the first case, we have
x = Xk, and we are done. In the second case, by choosing
k
' " (rbPj) (rk,pd
Pk+l = rk - ~ -(--)pj = rk - ( ) Pk, (8.11)
"-I Pj,Pj Pk,Pk
J- ~
=: (3k+l
we obtain an orthogonal basis of Uk+l. By inserting (8.11), the evaluation

of ak and (3k can be further simplified. Since
(x - Xa,Pk) = (x - Xk-l,Pk) = (rk-l,Pk) = (rk-l,rk-l) '
it follows that
and because of
-ak(rk,Pk) = (-akApk, rk) = (rk - rk-l, rk) = (rk' rk)
we have
Together we obtain the method of conjugate gradients or briefly cg-method,

which was introduced by M. R. Hestenes and E. Stiefel [50] in 1952.
Algorithm 8.16 cg-method (for given starting value xo).
Pl := T'o := b - Axo;
for k := 1 to k max do
._ (T'k-l,T'k-l) (T'k-l, T'k-l) .
O'.k
(Pk,Pk) (Pk, Apk) ,
Xk := Xk-l + O'.kPk;
if accurate then exit;
T'k := T'k-l - O'.kAPk;
(3 .- (T'k,T'k) .
k+l - (
T'k-l, T'k-l )'
Pk+l := T'k + (3k+1Pk;
end for
Note that in fact essentially only one matrix-vector multiplication has

to be carried out per iteration step, namely, APk, and the method thus
fully satisfies our requirements with respect to cost of computation. In the
above, we have not further specified the termination criterion "accurate."
Of course, one would like to have a criterion of the kind
Ilx - Xk I "sufficiently small." (8.12)
which is, however, not feasible in this form. Because of this, (8.12) is in
practice replaced by requiring
Ihl12 = Ilx - xkllA2 "sufficiently small." (8.13)
As we have already explained in great detail in Section 2.4.3, the residue

norm is not a suitable measure for the convergence: For ill-conditioned
systems, i.e., for K(A) » 1, the iterates can improve drastically, even though
the residue norms grow. We shall return to this topic again in Section 8.4.
The question regarding convergence properties remains to be answered.
Since the cg-method produces the solution Xn = x in finitely many steps,
we only lack a statement regarding convergence speed.
Theorem 8.17 The approximation eT'roT' x - Xk of the cg-method can be

estimated in the eneT'gy nOT'm IlyliA = j(y, Ay) by
(8.14)
wheT'e K2 (A) is the condition of A with T'espect to the Euclidean nOT'm.

Proof. Since Xk is the solution of the minimization problem Ilx - Xk IIA =

minvk' we have
(x - Xk, X - Xk) :::; (x - y, x - y) for all y E Vk .
The elements of Vk are of the form
y = Xo + Pk- 1 (A)ro = Xo + Pk-l(A)(b - Axo) = Xo + APk- 1 (A)(x - xo),
with a polynomial Pk-l E P k- 1 such that
x- y = x - Xo - APk-dA)(x - xo) = (I - APk- 1 (A))(x - xo),
~
=: Qk(A)
where Qk E Pk is a polynomial with Qk(O) = 1. By inserting into the
minimization condition, it follows that
min IIQk(A)(x - xo)IIA
Qk (0)=1
< min max

Qk (0)=1 AEcr(A)
IQk(A)1 Ilx - xollA.
Here we have used the fact that
which follows from
IIQ k (A)II A = sup IIQk(A)zIIA

I I
z;t'o z A
by inserting z = A! w. It remains to be shown that the solution of the

min-max problem can be estimated by
a:= min max

QdO)=lAEcr(A)
IQk(A)1 < 2 ( J 11:2 (A)
-
-
J 11:2 (A) + 1
1) k
In order to prove this, let

o < a = Al :::; A2 :::; ... :::; An =b
be the eigenvalues of the Spd-matrix A. Theorem 7.21 is applicable, since
o rf. [a, b]; and we therefore have
. 1
a:::; mill max IQk(A)1 :::; -,
QdO)=1 AE[a,bJ C
where c := In(A(O))1 is the maximal absolute value of the modified

Chebyshev polynomial on [a, b]. Here
A(0) = 2 0 - a _ 1 = _ b + a = _ 11:2 (A) + 1

b- a b- a 1I:2(A) - 1 '
because 1I:2(A) = Ani Al = bla, so that we obtain the statement from
Lemma 8.13 as in Example 8.12. 0
Corollary 8.18 In order to reduce the error in the energy-norm by a factor

c, i.e.,
at most k cg-iterations are needed, where k is the smallest integer such that
1
k? "2 V"'2(A) In(2/c).
Proof. According to Theorem 8.14, we have to show that
2(~-1)k :S;c,
V"'2(A) +1
or, equivalently,
Ok <~
- c
with 0:= (~ +
V"'2(A)-1
1) > 1,
Thus the reduction factor is achieved if
k _ In(2/c)
? Ioge (2/)
c - In 0 .
Now the natural logarithm satisfies
In (aa +-11) > ~a for a > 1.
(By differentiating both sides with respect to a, one sees that their differ-
ence is strictly decreasing for a > 1. In the limit case a --> 00 both sides
vanish.) By assumption we thus have
1 r::fA\ > In(2/c)
k ? "2V "'2 (A) In(2/c) lne
and therefore the statement. o
Remark 8.19 Because of the striking properties of the cg-method for Spd-
matrices, the natural question to ask is, which properties can be carried
over to nonsymmetric matrices? First one needs to point out the fact that
an arbitrary, only invertible matrix A does not, in general, induce a scalar
product. Two principle possibilities have been pursued so far:
If one interprets the cg-method for Spd-matrices as an orthogonal sim-
ilarity transformation to tridiagonal form (compare Chapter 6.1.1), then
one would have to transform an arbitrary, not necessarily symmetric ma-
trix to Hessenberg form (compare Remark 5.13). This means that a k-term
recurrence relation with growing k will replace a three-term recurrence re-
lation. This variant is also called Arnoldi method (compare [5]). Besides
the fact that it uses more storage space, it is not particularly robust.
If one insists on maintaining the three-term recurrence relation as a struc-

tural element (it uses little storage space), then one generally passes to the
normal equations
AT Ax = ATb
and realizes a cg-method for the Spd-matrix AT A. However, because of
K:2(A T A) = K:2(A)2 (see Lemma 3.10), in this method, in the estimate
(8.14) of the convergence speed, the factor J
K:2(A) has to be replaced
by K:2(A). This is in general a significant change for the worse. A nice
overview on this scientific area is given by the extensive presentation by
J. Stoer [81]. Regarding algorithms on this basis, two variants have so
far essentially established themselves, namely, the cgs-method or conjugate
gradient squaTed method by P. Sonneveld [78] and the bi-cg-method, which
was originally proposed by R. Fletcher [31]. Furthermore, the cgs-method
avoids the additional evaluation of the mapping y f---+ AT y, which in certain
applications is costly to program.
Remark 8.20 In the derivation of the cg-method, we did not follow the
historical development, but chose the more lucid way using Galerkin ap-
proximations, which at the same time was supposed to illustrate a more
general concept. Here the meaning of the notion "conjugate gradients" at
first remained mysterious. The original approach starts with a different iter-
ation method, the method of steepest descent. Here one tries to successively
approximate the solution x of the minimization problem
1 .
¢(x) = 2(x,Ax) - (x,b) = mm
by minimizing ¢ in the direction of the steepest descent

-f1¢(Xk) = b - AXk = Tk .
We divide the minimization problem ¢(x) = min into a sequence of one-
dimensional minimization problems
whose solution, the optimal line seaTch, is given by

(Tk' Tk)
O!k+l = (Tk, A Tk ).
We now expect that the sequence of the thus constructed approximations
Xk+l = Xk + O!k+l Tk converges to x. This is indeed the case, because
1¢(Xk) - ¢(x)1 ::; (1 - K:2~A)) 1¢(Xk~d - ¢(x)1 ,
which, however, is very slow for a large condition K:2(A) = An(A)/Al(A).

From geometric intuition, it is clear that there exists the problem that the
level surfaces {x I ¢(x) = c} for c::::: 0 of ¢ are ellipsoids with very differing
-.-- ------------- -.-.-
. jM~\.··,~:
Figure 8.2. Method of steepest descent for large 1£2(A).
axis, where Al(A) and An(A) are just the lengths of the smallest, respec-
tively, largest semiaxis of {¢(x) = I} (see Figure 8.2). The quantity 1£2(A)
thus describes the geometric "distortion" of the ellipsoids as compared with
spheres. However, the method of steepest descent converges best, when the
level surfaces are approximately spheres. An improvement can be reached
by replacing the directions of search rk by other "gradients." Here, the A-
orthogonal (or A-conjugate) vectors Pk with Pl = ro = b have particularly
good properties, which explains the historical naming.
Finally, we wish to mention that the cg-method, as opposed to the

Chebyshev method, does not require adjustment of parameters.
8.4 Preconditioning
The estimates of the convergence speed for the Chebyshev acceleration,
as well as the estimates for the cg-method depend monotonically on the
condition 1£2 (A) with respect to the Euclidean norm. Our next question is
therefore the following: How can one make the condition of the matrix A
smaller? Or, more precisely: How can the problem Ax = b be transformed,
so that the condition of the resulting matrix is as small as possible? This
question is the topic of the preconditioning. Geometrically speaking, this
means: We want to transform the problem such that the level surfaces,
which in general are ellipsoids, become as close as possible to spheres.
Instead of the equation Ax = b with an Spd-matrix A E Matn(R), we
can solve also for any invertible matrix BE GL(n) the equivalent problem
Ax = b with x := B~lX and A := AB.

Here we have to take care that the symmetry of the problem does not get
destroyed, so that our iteration methods remain applicable. If B is also
symmetric and positive definite, then the matrix A = AB is not any more
self-adjoint with respect to the Euclidean scalar-product (.,.), but with
respect to the product, which is induced by B,
(., ·)B := (., B·) ,

8.4. Preconditioning 257
because
(x, ABY)B = (x, BABy) = (ABx, By) = (ABx, Y)B.
The cg-method is therefore again applicable, if we change the scalar prod-
ucts accordingly: (., ·)B takes on the role of the Euclidean product (-, .),
and the corresponding "energy product"
(., ·)AB = (AB·, ·)B = (AB·, B·)
of A = AB the role of (-, .). This immediately yields the following iteration
xo, Xl, ... for the solution of Ax = b:
P1 := ro := b - ABxo;
for k := 1 to k max do
._ (rk-1, rk-dB (rk-1, Brk-1).
ak
(Pk,Pk)AB (ABpk, Bpk) ,
Xk := Xk-1 + akPk;
rk := rk-1 - akABpk;
f3 .- (rk' rk)B (rk' Brk) .
k+1·- (
rk-1, rk-dB (rk-1, Brk-1)'
Pk+1 := rk + f3k+1Pk;
end for
We are of course interested in an iteration for the actual solution x = Bx,
and we thus replace the row for the Xk by
Xk = Xk-1 + akBpk .
It strikes us that the Pk now only occur explicitly in the last row. If for
this reason, we introduce the (A-orthogonal) vectors qk := BPb then this
yields the following economical version of the method, the preconditioned
eg-method or briefly peg-method.
Algorithm 8.21 peg-method for the starting value Xo.
ro := b - Axo;
q1 := Bro;
for k := 1 to kmax do
._ (rk-1, Brk-1)
ak
(qk, Aqk)
Xk := Xk-1 + akqk;
rk := rk-1 - akAqk;
(3 .- (rk, Brk) .
k+1·- (rk-l, B rk-1 ) ,
qk+l := Brk + (3k+lqk;
end for
Per iteration step, each time we only need to carry out one multiplication
by the matrix A (for Aqk), respectively, by B (for Brk), thus, as compared
with the original cg-method, only one more multiplication by B.
Let us turn to the error X-Xk of the pcg-method. According to Theorem
8.17, for the error Ilx - xkliAB of the transformed iterate Xk in the "new"
energy norm
IIYIIAB := J(y, y)AB = J(ABy, By),
we have the estimate
JKB(AB) _l)k
Ilx - xkliAB ::; 2 ( Ilx - xoiIAB.
JKB(AB) + 1
Here KB (AB) is the condition of AB with respect to the energy norm 11·11 B.
However, the condition
Amax(AB)
KB(AB) = Amin(AB) = K2(AB)
is now independent of the underlying scalar product, and, because of

IlyllAB = J(ABy, By) = J(Ay, y) = IlylIA,
the norm I . IIAB is nothing else than the transformed energy norm II· IIA.
We therefore obtain the following analogue to Theorem 8.17.
Theorem 8.22 Consider the approximation error x - Xk of the cg-method
8.21, which is preconditioned with the Spd-matrix B. X-Xk can be estimated
in the energy norm IlyliA = J(y, Ay) by
J K2(AB) _l)k
Ilx- x kIIA::;2 ( J
K2(AB) + 1
Ilx-xoIIA'
We therefore seek an Spd-matrix B, a preconditioner, with the following
properties:
(a) the mapping (B, y) f-+ By is "simple" to carry out, and
(b) the condition K2(AB) of AB is "small,"
where, for the time being, we have to leave it with the vague expres-
sions "simple" and "small." The ideal matrix to satisfy (b), B = A-I,
unfortunately has the disadvantage that the evaluation of the mapping
y f-+ By = A-1y possesses the complexity of the entire problem and con-
tradicts therefore the requirement (a). However, the following lemma says
8.4. Preconditioning 259
that it is sufficient, if the energy norms 11·11 Band 11·llk1, which are induced
by Band A- 1 , can be estimated (as sharply as possible) from above and
below (compare [92]).
Lemma 8.23 Suppose that for two positive constants fJ.o, fJ.1 > 0, one of
the following three equivalent conditions is satisfied
(i) fJ.o(A- 1y, y) ~ (By, y) ~ fJ.1 (A- 1y, y) for all y E R n
(ii) fJ.o(By, y) ~ (BABy, y) ~ fJ.dBy, y) for all y E R n
(iii) Amin(AB) :::: fJ.o and Amax(AB) ~ fJ.1· (8.15)
Then the condition of AB satisfies
Proof. The equivalence of (i) and (ii) follows by inserting y = Au into (i),
because
(A- 1y, y) = (Au, u) and (By, y) = (BAu, Au) = (ABAu, u) .
Because of
\ . (AB) -_ m1n
. (ABy, y)B . (BABy,y)
/\mm = m1n -'---c_--'--"":-:-
#0 (y,y)B #0 (By, y)
and
Amax(AB) = max ...c..(A-,-B_y,.,--y-,-)_B = max (BABy, y)
yolO (y, y)B yolO (By, y) ,
the latter condition is equivalent to (iii) (compare Lemma 8.29), from which
the statement
follows immediately. o
If both norms II· liB and II· IIA-' are approximately equal, i.e., fJ.o ~ fJ.1,
then B and A -1 are called spectrally equivalent, or briefly also B ~ A -1.
In this case, according to Lemma 8.23, we have that K:2(AB) ~ l.
Remark 8.24 The three conditions of Lemma 8.23 are in fact symmetric
in A and B. One can see this most easily in the condition for the eigenvalues,
since
Amin(AB) = Amin(BA) and Amax(AB) = Amax(BA) .
(This follows, for example, from (ABy, y) = (BAy, y).) If we assume
that the vague relation ~ is transitive, then one can call the spectrally
equivalence with full right an "equivalence" in the sense of equivalence
relations.
An important consequence of Lemma 8.23 concerns the termination

criteria.
Lemma 8.25 Suppose that the assumptions of Lemma 8.23 are satisfied,
and let II·IIA := ~ be the energy norm and II· liB := ~. Then
1 1
-lhllB :::: Ilx-XklIA :::: -llrkIIB;
v1Il y7IO
i. e., if B and A -1 are spectrally equivalent, then the computable residual
norm IhllB is a quite good estimate of the energy norm Ilx - xkllA of the
error.
Proof According to Lemma 8.23 and Remark 8.24, (8.15), with A and B
exchanged, implies that
110 (Ay, y) :::: (ABAy, y) :s 111 (Ay, y) ,
or, equivalently
1 1
-(ABAy,y) :::: (Ay,y) :S -(ABAy,y).
111 110
The residue rk = b - AXk = A(x - Xk), however, satisfies
and therefore, as claimed,
As a termination criterion for the peg-method, instead of (8.13), we

therefore utilize the much more sensible condition
Example 8.26 A very simple, but often already effective preconditioning

is the inverse B := D- 1 of the diagonal D of A, called the diagonal precon-
ditioning. Variants of this utilize block diagonal matrices (see (8.1)), where
the blocks have to be invertible.
Example 8.27 If one applies the Cholesky factorization A = LLT from

Section 1.4 to a symmetric sparse matrix, then one observes that outside
the sparsing pattern of A, usually only "relatively small" elements lij are
produced. This observation leads to the idea of the incomplete Cholesky
decomposition (IC): It consists of simply omitting these elements. If by
P(A) := {(i,j) I aij i= O}

8.5. Lanczos Methods 261
we denote the index set of the nonvanishing elements of a matrix A, then,

instead of L, we construct a matrix L with
p(L) c P(A) ,
by proceeding as in the Cholesky factorization and setting iij := 0 for all
(i,j) rf. P(A). Here we expect that
A:::::: A := LLT.
In [59], a proof of the existence and numerical stability was given, in the
case that A is a M-Matrix, i.e.,
aii > 0, aij ~ 0 for i i- j and Al 2:: 0 (elementwise).
Such matrices occur when discretizing simple partial differential equations
(see [86]). In this case, an M-matrix is produced in each elimination step.
For B, we then set
B:= A-I = L-TL-l.
This way, in many cases one obtains a drastic acceleration of the cg-method,
far beyond the class of M-matrices, which is accessible to proofs.
Remark 8.28 For systems which originate from the discretization of par-
tial differential equations, this additional knowledge about the origin of the
system allows for a much more refined and effective construction of pre-
conditioners. For examples, we refer to the articles by H. Yserentant [93],
J. Xu [92], and (for time-dependent partial differential equations) to the
dissertation of F. A. Bornemann [9]. As a matter of fact, in solving par-
tial differential equations by discretization, one has to deal not only with
a single linear system of fixed, though high dimension. An adequate de-
scription is by a nested sequence of linear systems, whose dimension grows
when the discretization is successively refined. This sequence is solved by
a cascade of cg-methods of growing dimension. Methods of this type are
genuine alternatives to classical multigrid methods-for details see, e.g.,
the fundamental paper by P. Deuflhard, P. Leinen, and H. Yserentant [25].
In these methods, each linear system is only solved up to the precision of
the corresponding discretization. In addition, it allows for a simultaneous
construction of discretization grids, which fit the problem under consider-
ation. We shall explain this aspect in Section 9.7 in the simple example of
numerical quadrature. In Exercise 8.4, we illustrate an aspect of the cascade
principle which is suitable for this introductory text.
8.5 Lanczos Methods

In Section 5.3, we have computed the eigenvalues of a symmetric ma-
trix by first transforming it to tridiagonal form and then applying the
QR-algorithm. In this section, we shall again turn our attention to the

eigenvalue problem
AX=AX (8.16)
with a real symmetric matrix A. We shall be concerned with large sparse
matrices A, as they occur in most applications. For these problems, the
methods which were presented in Chapter 5 are too expensive. In the fol-
lowing, we shall therefore develop iterative methods for the approximation
of the eigenvalues of a symmetric matrix which essentially only require one
matrix-vector multiplication per iteration step. The idea for these meth-
ods goes back to the Hungarian mathematician C. Lanczos (pronounced
Luntsosh) in his work [57] from the year 1950. As in the derivation of the
cg-method in Section 8.3, we pose the eigenvalue problem (8.16) first as an
extremum problem.
Lemma 8.29 Suppose Amin and Amax are the smallest, respectively, largest
eigenvalues of the real symmetric matrix A E Mat n (R). Then
. (x, Ax) (x, Ax)
Amin = mm ( ) and Amax = max ( ).
x/a x,x x/a x, x
Proof. Since A is symmetric, there exists an orthogonal matrix Q E N(n)

such that
QAQT = A = diag(A1, ... , An).
With y := Qx, we have (x, x) = (y, y), and
(x, Ax) = (QT y, AQT y) = (y, QAQT y) = (y, Ay) .
Thus the statement is reduced to the case of a diagonal matrix, for which
the statement is obvious. D
The function f.-L : R n --> R,

._ (x,Ax)
f.-L (x ) .- ( \
x, xI
is called Rayleigh quotient of A.
Corollary 8.30 Let A1 ::; ... ::; An be the eigenvalues of the symmetric
matrix A E Matn(R), and let TJ1, ... ,1]n be the corresponding eigenvectors.
Then
Ai = min f.-L(x) = max f.-L(x).
xEspan(1);, ... ,1)n) xEspan(1)l, ... ,1);)
x/a x/a
If we compare the formulation of the eigenvalue problem as an extremum
problem in Lemma 8.29 with the derivation of the cg-method in Section 8.3,
then this suggests approximating the eigenvalues Amin and Amax, by solving
the corresponding extremum problem on a sequence of subspaces VI e V2 e

... eRn.
Since the Krylov spaces proved useful in Section 8.3, we choose as
subspaces
Vk (x) := span { x, Ax, ... , A k-I X }
for a starting value x i= 0 and k = 1,2, .... We expect that the extreme
values
(k)._ . (y, Ay) > . \ (k) ._ (y, Ay) < \
A min · - mIn ( ) _ A mm , A max . - max ( ) _ Amax
yEVdx) y, y yEVdx) y, Y
Y#O Y#O
to approximate the eigenvalues Amin and Amax well for growing k. According
to Theorem 6.4, we can construct an orthonormal basis VI, ... ,Vk of Vk(X)
by the following three-term recurrence relation:
.- x
Vo 0, VI .-
IIxl12
Qk .- (Vk' AVk)
Wk+I .- AVk - QkVk - PkVk-I (8.17)
Pk+l .- Il wk+Iil2
Vk+I .- Wk+I falls Pk+l i= O.
Pk+l
This iteration is called Lanczos algorithm. Thus Qk := [VI, ... , Vk] is a
column-orthonormal matrix, and
is a symmetric tridiagonal matrix. Now set y = QkV. Then, for a V E Rk,

we have (y, y) = (v, v), and
(y, Ay) = (QkV, AQkV) = (v, Qf AQkV) = (v, TkV) .

Hence it follows that
A(k) = min (y, Ay)
mm yEVk(x) (y, y)
Y#O
and similarly that A~lx = Amax(Tk). Because of Vk+I :J Vk, the minimal
property yields immediately that
A(HI) < A(k) and A(HI) > A(k)
mIn - mIn max - max·
The approximations A;:{n and A~lx are therefore the extreme eigenvalues
of the symmetric tridiagonal matrix Tk, and, as such, can be easily com-
puted. However, in contrast to the cg-method, it is not guaranteed that
A;:i~ = Amin, since in general Vn(x) "I Rn. This shows in the three-term
recurrence relation (8.17) in a vanishing of (3k+l for a k < n. In this case,
the computation has to be started again with an x E Vk (x).1. .
The convergence speed of the method can again be estimated by utilizing
the Chebyshev polynomials.
Theorem 8.31 Let A be a symmetric matrix with the eigenvalues Al :::;
... :::; An and corresponding orthonormal eigenvectors T)l, ... ,T)n. Further-
more, let fJl :::; ... :::; /-lk be the eigenvalues of the tridiagonal matrix Tk of
the Lanczos method for the starting value x "I 0, and with the orthonormal
basis VI, ... ,Vk ofVdx) as in {8.17}. Then
(An - Al)tan2(~(vl,T)n))
/\\ > "k > /\\ - -'-----;~...,----'--'--'--.:....:...
n -,.., - n TLI (1 + 2pn) ,
where Pn := (An - An-d/(An-l - Ad·
Proof. Because of Vk(X) = {P(A)x I P E Pk-d, we have

(y, Ay) (P(A)Vl, AP(A)Vl)
fJk = max - - - = max .
YEVk(X) (y, y) PEPk-l (P(A)Vl, P(A)Vl)
yf.O
By representing VI with respect to the orthonormal basis T)l, ... ,T)n , VI =

'L7=1 ~jT)j with ~j = (VI, T)j) = cos( ~(Vl' T)j)), it then follows that
n
L~;P2(Aj)Aj
j=1
n
(P(A)Vl, P(A)Vl) L~; p2(>\j) .
j=1
We thus obtain
(P(A)Vl' AP(A)Vl) A 'L7~11 ~JP2(Aj)(Aj - An)
(P(A)Vl, P(A)Vl) n+ ",n C2P2(A)
L...J=l <'J J
",n-l C2p2(A)
> An + (AI - An) L...J=1 <'J J .
c2 p2 (A ) + ",n-l c2 p2 (A.)
<'n n L...J=1 <'J J
In order to obtain an estimate as sharp as possible, we need to insert a poly-

nomial PEP n-l, which is as small as possible in the interval [AI, An-I].
According to Theorem 7.19, one should take the transformed Chebyshev
polynomial
peA) := Tk_l(t(A)) with teA) = 2 A - Al _ 1 = 1 + 2 A - An-l

An-l - Al An-l - Al
with the property IP(Aj) I ~ 1 for j = 1, ... ,n - 1. Because of "L7=1 E,; =

Ilvd~ = 1, it thus follows that
1 - ~2 1
11k ~ An - (An - Ad~ T2
'-o.n k-l
(1 + 2Pn )'
and the statement follows from the fact that
1 - ~~
- - 2- = tan 2 (<r( VI, 7)n )) .
~n
D
In many applications (e.g., in structural mechanics) one encounters the

generalized symmetric eigenvalue problem,
Ax = ABx, (8.18)
where the matrices A, B E Matn(R) are both symmetric and B is in addi-
tion positive definite. If we insert the Cholesky factorization B = LLT of
B into (8.18), then we have
Ax = ABx ¢=} Ax = ALLT x¢=} (L -1 AL -T) LT x = ALT.T.
' - v - " '-v-"
=: A =:x
Since A = L -1 AL -T is again symmetric, it follows that the generalized
eigenvalue problem Ax = ABx is equivalent to the symmetric eigenvalue
problem Ax = AX. Thus all eigenvalues Ai are real. Furthermore, there is
an orthonormal basis 7)1, ... ,7)n of generalized eigenvectors A7)i = Ai B 7)i.
If we therefore define the generalized Rayleigh quotient of (A, B) by
( ) ._ (x, Ax)
11 x .- (x, Bx) ,
then we obtain the following statement, which is an analogue to Lemma
8.29.
Lemma 8.32 Let Amin and Amax be the smallest, respectively, largest
eigenvalue of the generalized eigenvalue problem Ax = ABx, where the
matrices A, B E Matn(R) are symmetric and B is positive definite. Then
. (x, Ax) (x, Ax)
Amin = mIn (
x#O x, Bx
) and Amax = max
x#O
( B)'
x, x
Proof. With the above notation, we have

(x, Bx) = (x, LLT x) = (x, x)
and
(x, Ax) = (L -T x, AL -Tx) = (x, Ax) .
The statement follows from Ax = ABx <=? Ax = AX and Lemma 8.29. D
Similarly, the Lanczos algorithm (8.17) carries over, by maintaining

(x, Ax), but replacing (x, x) by (x, Bx) and IIxl12 by (x, Bx)1/2. A detailed
presentation can be found in the book by J. Cullum and R. Willoughby
[13].
Finally, we want to discuss the case that not only the extremal eigenval-
ues, but eigenvalues in a given interval, or even all eigenvalues are to be
found. In this case, we go back to the fundamental idea of the inverse power
method. Let 5. be a given estimate in the neighborhood of the unknown
eigenvalues. Then (for the generalized eigenvalue problem) the matrix
C := (A - 5.1)-1 = LT(A - 5.B)-1 L
has the eigenvalues

Vi := (Ai - 5.)-1 for i = 1,2, .... (8.19)
The application of the Lanczos method to the matrix C then produces just
the dominant eigenvalues Vi, thus, because of (8.19), eigenvalues Ai in a
neighborhood of 5.. By variation of the shift parameter 5. in a given interval,
one can therefore obtain all eigenvalues. This variant of the algorithm is
called spectral Lanczos and was described in 1980 by T. Ericsson and A.
Ruhe [29].
Exercises
Exercise 8.1 Show that the coefficients Pk = 2ITk-l(f)/Tk(f), which
occur in the Chebyshev acceleration, satisfy the two-term recurrence
relation
1
PI = 2, Pk+l = 1 2
1 - 4(} Pk
'
where () .- 1/1. Furthermore, show that the limit of the sequence {pd
satisfies
2
lim Pk =: P = - - - = = =
k-->oo 1+~'
Exercise 8.2 Given are sparse matrices of the following structure (band
matrices, block diagonal matrices, arrow matrices, block cyclical matrices).
* * * *
* * * *
* * *
* * * * *
* *
* *
* * *
* * *
Exercises 267
* * * * * * * * * *
* * * * * *
* * * * * *
* * * * * *
* * * * * *
* * * * * *
Estimate the required storage space and cost of computation (number of
operations) for
(a) LU-factorization with Gaussian elimination without pivoting, respec-
tively, with column pivot search and exchange of rows.
(b) QR-factorization with Householder transformations without, respec-
tively, with exchange of columns.
(c) QR-factorization with Givens transformations.
Exercise 8.3 Let B ~ A -1 be a spectrally equivalent preconditioning ma-
trix. In the special case that B is of the form B = CC T , a preconditioned
cg-method can also be derived differently. For this, from the system Ax = b,
one passes formally to the equivalent system
Ax=b with A=CACT , x=C-Tx, and b=Cb.
Here A is again an Spd-matrix. One applies the classical cg-method to this
transformed system. Using this idea, derive a convergence result, which
is based on the energy norm of the error (which, as is well-known, is not
directly approachable). Use this approach to derive two different effective
variants of the pcg-method, one of which coincides with our Algorithm 8.2l.
Derive both variants, and consider which would be preferable under which
conditions. Implement both variants in the special case of the incomplete
Cholesky factorization, and carry out computational comparisons.
Exercise 8.4 Short introduction to the cascade principle. We consider a
sequence of linear systems
Ajxj=bj , j=l, ... ,m
of dimension nj, as it could come up by successively finer (uniform) dis-
cretization of an (elliptic) partial differential equation. The dimension of
the systems is assumed to grow geometrically, i.e.,
nj+1 = Knj for a K > l.
We seek an approximate solution xm of the largest system (corresponding
to the finest discretization) such that the error in the energy norm satisfies
Ilxm - xmllAn> s:: cmo
for given c, 0 > o. Assume that the connection of the linear systems with
the discretizations is shown by the following properties:
(i) The matrices Aj are symmetric positive definite, and their conditions
are uniformly bounded by
K2(A j )::::: C for j = l, ... ,m.
(This is only true after a suitable preconditioning.)
(ii) If Xj is an approximation of Xj with
Ilxj - xjllA j ::::: E j 8,
then the vector XJ+l := (Xj, 0), which is augmented by zeros, is an

approximation of XJ+l with
IlxJ+l - xJ+11IA + j 1 ::::: Ej 8,

(iii) The cost of a cg-iteration for Aj is
Cj = (Jnj for a (J > O.
We compare two algorithms for the computation of xm:
(a) Standard cg-method: How many cg-iterations are necessary to
compute Xm , beginning with a starting solution x;;" with
Ilx?n - xmllAm ::::: 8?
What is the cost?
(b) Cascade cg-method: Let x~ be an approximation of Xl with
Ilx? - xlllA, : : : 8.
From this, we compute successive approximations Xl, ... , xm with
Ilxj - xjllAJ ::::: E j 8,
by always taking the approximate solutions of the previous system,
which are augmented by zeros, as a starting solution
xJ := (Xj-l, 0) for j = 2, ... ,m
in the cg-method for the system Ajxj = bj . Show that the approx-
imate solution Xm , which is computed this way, has the required
precision. How many cg-iterations are necessary for each j, and what
is their cost? What is the total cost of the computation of xm?
The method (b) even opens the possibility to compute the starting value
x~ by a direct method, e.g., the Gaussian elimination, if the dimension nl
is sufficiently small.
Exercise 8.5 Carry the spectral Lanczos algorithm out in detail. In par-
ticular, replace the explicit computation of A by the product representation
L-1AL- T .
9
Definite Integrals
A relatively common problem is the computation of the Riemann integral
J(f) := J~(f) := lb f(t) dt .
Here f is a piecewise continuous function on the interval [a, b], which,

however, is usually piecewise smooth in the applications. (A piecewise con-
tinuous, but not piecewise smooth, function could not be implemented on
a computer anyway.) The Riemann integral may be interpreted as the area
below the graph of f~see Figure 9.1.
a b t
Figure 9.1. Problem of quadrature.

270 9. Definite Integrals
In calculus many techniques to "solve" such an integral are taught-in

the sense that a "simpler" expression is found in lieu of J(f). Whenever this
is possible the integral is said to be analytically solvable and the solution
can be expressed in closed form. However, these two notions do not mean
much more than that the "analytic expressions" are better known math-
ematically, and are therefore preferred to the original integral expression.
However, in many cases, a closed analytic representation of the integral
does not exist. Then only the purely numerical evaluation remains. In many
cases, this is even preferable when a closed-form solution is known. One can
very easily convince oneself of this fact by looking into an integral table
(e.g., [43]).
The numerical computation of J(f) is also called numerical quadrature.
This notion reminds us of the quadrature of the circle, the problem of only
using compasses and ruler to construct a square, which has the area of
the, unit circle, which is unsolvable because of the transcendence of 7r. To-
days meaning of the term quadrature emerged from this construction of the
square of equal area. More often, one finds the notion of numerical integra-
tion, which, however, and more generally, at the same time describes the
solution of differential equations. In fact, the computation of the integral
1(f) corresponds formally to the solution of the initial value problem
y'(t) = f(t), y(a) = 0, t E [a, b] ,
since y(b) = 1(f). We will come back to this formulation in the course of
this chapter.
9.1 Quadrature Formulas

First, we shall list some analytic properties of the Riemann integral. In the
following we assume that b > a. The definite integral
1~ = 1 : C[a, b] --+ R, f f------t 1(f) = lb f(x) dx
is a positive linear form on the space C[a, b] of continuous functions on the

interval [a, b]. In other words,
(a) 1 is linear, i.e., for all continuous functions f, g and 0, (3 E R, we

have
1(of + (3g) = 01(f) + (31(g) .
(b) 1 is positive, i.e., if f is nonnegative, then so is the integral 1(f),
f 2: 0 ~ 1(f) 2: 0 .
9.1. Quadrature Formulas 271
In addition, the integral is additive with respect to a partition of the

integration interval, i.e., for all T E [a, b], we have
1~ + 1~ = 1~ .
Before starting with the computation, we should ask the question of the
condition of the integral. In order to do this, we first have to clarify how
to measure perturbations of of the integrand. If we choose the supremum-
norm (as the standard norm on CO), then, on an infinite integration interval,
the perturbations can become infinite. Because of this, we decide here for
the so-called Ll-norm
Ilflll := lb If(t)1 dt = 1(lfl) .

Lemma 9.1 The absolute and relative condition of the quadrature problem
(I, 1), 1(f) =J:
f, with respect to the Ll-norm II . 111 are
. 1(lfl)
h:abs =1 respectwely, h:rel = 11 (f) I .
Proof. For any perturbation Of E LI[a, b], the perturbation of the integral
can be estimated by
where equality holds for nonnegative or nonpositive perturbations. D
In the absolute concept, perturbation is therefore a harmless problem.

In our relative view, however, we expect difficulties when the ratio of the
integral of the absolute value and the absolute value of the integral is very
large, and the problem is thus ill-conditioned. Obviously, this is completely
analogous to the condition of addition (respectively, subtraction) (com-
pare Example 2.3). A danger for the relative condition comes from highly
oscillatory integrands, which do occur in many applications. Already the
integration of a single oscillation of the sine-function over a single period
is ill-conditioned with respect to the relative error concept.
Of course, we expect that a method to compute the integral preserves
its structural properties. The goal of the numerical quadrature is therefore
the construction of positive linear forms
i : C[a, b] ---7 R, f ~ i(f) ,
which approximate the integral 1 as well as possible, i.e.,
i(f) - 1(f) is "small."
Example 9.2 The first, and, because of the definition of the Riemann
integral, most obvious method for the computation of 1~(f), is the trape-
zoidal sum (see Figure 9.2). We partition the interval into n subintervals
f
-" ;
;
;
;
;
;
;
;
;
;
;
;
;
To ;
ho
i
a = to t
Figure 9.2. Trapezoidal sum with equidistant nodes.
[t i - 1, til (i = 1, ... , n) of length hi := ti - ti-l such that
a = to < tl < ... < tn = b,

and approximate the integral 1 (1) by the sum of the areas of the trapezoids
T(n) := t
i=1
Ti, T i := ~i (1(ti-l) + f(ti)) .
The trapezoidal sum T(n) is obviously a positive linear form. We also

interpret it as an application of the trapezoidal rule (see Section 9.2)
b-a
T := -2-(1(a) + f(b)) (9.1 )
to the subintervals [t i - l , til. By comparing the trapezoidal sum T(n) with

the Riemann lower, respectively, upper sums,
n n
R;:i~ = L
i=1
hi min
tE[ti_l,t;]
f(t) and R~2x = L
i=1
hi max
tE[ti-l,ti]
f(t) ,
it is obvious that
R(n)
mIn -
< T(n) <
-
R(n)
max
.
For continuous f E Cora, b], the convergence of the Riemann sums therefore
implies the convergence of the trapezoidal sums:
R(n)
mm
< T(n) < Rmax
(n)
1 1 1 for n -t 00, hi S h -> 0 .

1(1) 1(1) 1(1)
9.2. Newton-Cotes Formulas 273
Below (see Lemma 9.8), we shall give the approximation error in more
detail.
The trapezoidal sum is a simple example for a quadrature formula, which
we define as follows.
Definition 9.3 A quadrature formula i for the computation of the definite
integral is a sum
n
i(J) = (b - a) L Ai!(ti) ,
i=O
with nodes to, ... ,tn, and weights AO, ... , An such that
n
(9.2)
The condition (9.2) concerning the weights guarantees that a quadrature

formula integrates constant functions exactly, i.e., i(l) = 1(1) = b - a. It
is furthermore obvious that a quadrature formula is positive, if all weights
are, i.e.,
i positive {==} Ak;:::: 0 for all k = 0, ... , n.
This means that for all practical purposes, only quadrature formulas with
positive weights are interesting. If this is not the case, then the sum of the
absolute values of the weights
n
(9.3)
measures by how much the quadrature formula deviates from the positivity
requirement. Because of the results for the scalar product, we do not have
to worry about the stability of the evaluation of a quadrature formula (see
Lemma 2.30).
9.2 Newton-Cotes Formulas

The idea of the trapezoidal sum consists of replacing the function f by an
approximation j, here the linear interpolation, for which the quadrature
can be easily carried out, and to view 1(}) as an approximation of 1(J),
i.e.,
i(J) := 1(j) .
Here, of course, not only linear approximations can be used, as is the case
in the trapezoidal rule (9.1), but arbitrary approximations, like the ones
which were introduced in the last chapter. In particular, for given nodes
to, ... , tn, the function
n
j(t) := P(f I to,···, tn ) = L f(ti)Lin(t),
i=O
is the interpolation polynomial of f, where Lin E P n is the ith Lagrange

polynomial for the nodes t j , i.e., Lin(tj) = Oij. This approach yields the
quadrature formulas
n
i(f) := I(P(f I to, ... , tn )) = (b - a) L Ainf(td,
i=O
where the weights
Ain := b ~ alb Lin(t) dt

only depend on the choice of the nodes to, ... , tn.
It is clear from the construction that the quadrature formulas, which are
defined this way, are exact for polynomials PEP n of degree less than or
equal to n,
i(p) = I(Pn(P)) = I(P) for P E P n .
For given nodes ti, the quadrature formula is already uniquely determined
by this property.
Lemma 9.4 For n + 1 pairwise distinct nodes to, ... , tn, there exists one
and only one quadrature formula
i(f) = (b - a) L Ad(ti) ,
i=O
which is exact for all polynomials PEP n of degree less than or equal to n.
Proof. We insert the Lagrange polynomials Lin E P n, which belong to

the nodes t i , and which by assumption are integrated exactly, into the
quadrature formula
n n
j=o j=o
and thus, in a unique way, obtain back the weights Ai = (b - a)-l I(Lin) =
~. 0
In the special case of equidistant nodes

b-a
hi = h = - - , ti = a + ih, i = 0, ... , n,
n
the constructed quadrature formulas are called Newton-Cotes formulas.

The term for the corresponding Newton-Cotes weights Ain simplifies via
the substitution s := (t - a)/h:
Ain = --
b- a
1 lb II--
a j=O
n t - tj
ti - tj
1 in n
dt= -
n 0
II-- j
S -
i - j
j=O
ds.
j#i j#i
The weights Ain, which are independent of the interval boundaries, only
have to be computed once, respectively, given once. In Table 9.1, we have
listed them up to order n = 4. The weights, and therefore also the quadra-
ture formulas, are always positive for the orders n = 1, ... , 7. Higher orders
are less attractive, since starting with n = 8, negative weights may occur.
In this case, the Lebesgue constant is the characteristic quantity (9.3), up
to the normalization factor (b - a)-I. Note that we have already encoun-
tered the Lebesgue constant in Section 7.1 as the condition number of the
polynomial interpolation.
Table 9.1. Newton-Cotes weights Ain for n = 1, ... ,4.

n AOn , ... , Ann Error Name
1 1
"2 "2
1
~~ 1" (,) Trapezoidal rule
2 1
(;
4
(;
1
(; ~~f(4)(,) Simpson's rule, Kepler's barrel rule
3 1
8 8 8 8
3 3 1 3:0 5
f(4)(,) Newton's 3/8-rule
4 7
90
32
90
12
90
32
90
7
90
~~~ f(6)(,) Milne's rule
In Table 9.1, we have already assigned the respective approximation er-

rors to the Newton-Cotes formulas (for sufficiently smooth integrands),
expressed as a power of the step size h and a derivative at an intermediate
position, E [a, b]. Observe that for the even orders n = 2,4, the power
of h and the degree of the derivative always jump by 2. In the following
we shall verify these estimates for the first two formulas, by which one can
already see the principle for odd n (see Exercise 9.3).
Before starting, we recall a not so obvious variant of the mean value theo-
rem, which we shall encounter repeatedly in the proof of the approximation
statements.
Lemma 9.5 Let g, h E C[a, b] be continuous functions on [a, b], where g
has only one sign, i.e., either g(t) 2: 0 or g(t) ::; 0 for all t E [a, b]. Then
lb h(t)g(t) dt = h(,) lb g(t) dt
for some, E [a, b].

Proof. Assume without loss of generality that g is nonnegative. Then
min h(t)
tEla,b] Ja
r
b
g(s) ds:S;
Ja
b
r
h(s)g(s) ds:S; max h(t)
tEla,b] Ja
b
g(s) ds. r
Therefore, for the continuous function
F(t) := lb h(s)g(s) ds - h(t) lb g(s) ds
°
there exist to, tl E [a, bj with F(to) ::::: and F(td :s; 0, and thus, because
of the mean value theorem, there exits also aTE [a, bj such that F( T) = 0,
or, in other words
lb h(t)g(t) dt = h(T) lb g(t) dt ,
as required. o
Lemma 9.6 Let f E C 2 ([a, b]) be a twice continuously differentiable
function. Then the approximation error of the trapezoidal rule
b-a
T = -2-(f(a) + f(b))
with step size h := b - a can be expressed by
T -
Ja
rbf = h123 J"(T)
for some T E [a, bj.
Proof. According to Theorem 7.10, and by using Newton's remainder, the

linear interpolation P = P1 (f) satisfies
f(t) = P(t) + [t, a, blf· (t - a)(t - b) .
From Corollary 7.13 the second divided difference can be written as
[t ,a, bjf = f"(T)

2
with some T = T(t) E [a, b], which is independent of t. Inserted into the
quadrature formula, Lemma 9.5 implies that
lb f = lb P(t)dt+ lb[t,a,bjf.~dt
:so
T +
2 ,a J
r
f"(T) b(t - a)(t - b) dt
~
'V
(b_a)3
--6-
for some 7 E [a, b], hence
T-lb a
f= h 3 1"(7).
12
o
Lemma 9.7 Kepler's barrel rule
s= b~a (f(a)+4 f (a;b) +f(b))
is also exact for polynomials Q E P3 of degree 3. For f E C 4 ([a, b]), the

approximation error of step size h := b;a
can be expressed by
with some 7 E [a,b].
Proof. Let Q E P 3 . Then, according to Newton's remainder formula, the

quadratic interpolation P = P 2 (Q) at the nodes a, b and (a + b)/2 satisfies
a+b
Q(t) = P(t) + 'Y (t - a)(t - -2-) (t - b) ,
'- J
V
= W3(t)
where 'Y = QI//(t)/6 E R is a constant. This implies for the integral that
because the integral J:

Wk of Newton's basis functions vanishes for odd k.
Kepler's barrel rule is therefore also exact for polynomials of degree 3. For
f E C 4 ([a, b]), we now form the cubic Hermite interpolation Q = P3 (f) E
P 3 with respect to the four nodes
a+b a+b
to = a, h = -2-' t2 = -2-' t3 = b .
For the description of the approximation error of Q, we again utilize

Newton's remainder,
a+b a+b a+b 2

f(t) = Q(t) + [t,a, -2-' -2-,b]f (t - a)(x - -2-) (t - b)
'- J
v
=W4(t):::;O
Estimation of the remainder according to Corollary 7.13, insertion into the

integral and application of the mean value theorem
yield again the statement. D
As we already saw in the introductory Example 9.2 of the trapezoidal

sum, further quadrature formulas can be constructed by partitioning the in-
terval and applying a quadrature formula on the respective subintervals. We
partition the interval again into n subintervals [ti-l, til with i = 1, ... , n,
a = to < tl < ... < tn = b,
so that, according to the additivity of the integral,
I f = 8 ititi-1 f .
a
b n
Thus
n
i(f) := L i;:_1 (f)

i=1
is a (possibly better) approximation of the integral, where i:;+1 denotes

an arbitrary quadrature formula on the interval [t i , tHl]' In the following,
we shall derive the approximation error for the trapezoidal sum from the
approximation error of the trapezoidal rule.
Lemma 9.8 Let h := (b - a)/n, and let ti be the equidistant nodes ti =

a + ih for i = 0, ... , n. Furthermore, let Ti denote the trapezoidal rule
h
Ti := 2(f(ti-d + f(ti))
on the interval [t i - 1 ,ti ], where i = 1, ... ,n. Then, for f E C 2 ([a,b]), the
approximation error of the trapezoidal sum
T(h) := t T i = h (~(f(a) + f(b)) + ~ f(a + ih))

can be expressed in the form
T(h) -lb a
f = (b - a)h 21"(T)
12
with some T E [a, b].
9.3. Gauss-Christoffel Quadrature 279
Proof. According to Lemma 9.6, there exists a E [ti-1, til such that
iti
7i
h3
T; - f = 121"(7;) ,
til
and therefore
T(h)- I =8
a
b
f
n (
T; -
iti)
ti-l f = 8 h121"(7;) = (b - 12a)h
n 3 2 1 n
:;;: 81"(7;) .
Since
1
L 1"(7i)
n
min 1"(t) ::; - ::; max 1"(t) ,
tE[a,b] n ;=1 tE[a,b]
and according to the mean value theorem, there exists a 7 E [a, b] such that
and therefore
as claimed. o
9.3 Gauss-Christoffel Quadrature

In the construction of the Newton-Cotes formulas, starting with n + 1
given integration nodes t;, we have determined the weights A; so that the
quadrature formula integrates exactly polynomials up to degree n. Can we
possibly achieve more by also putting the nodes at our disposal?
In this section, we want to answer this question for the more general
problem of weighted integrals
J(f) := lb w(t) f(t) dt ,
which we have already encountered, when we introduced orthogonal poly-

nomials in Chapter 6.1.1. Again, let w be a positive weight function,
w(t) > 0 for all t E]a, b[, so that the norms
(l
1
IIPII:= (P,P)~ = b
w(t)p(t)2 dt) '2 < 00
are well-defined and finite for all polynomials PEP k and all kEN.
In contrast to Section 9.2, the interval may be infinite here. It is only
important that the corresponding moments
11k := lb tkW(t) dt
are bounded. For the definition of the absolute condition, we measure the
perturbations Sf in a natural way with respect to the weighted L 1 -norm
IIfl11 = lb w(t) If(t)1 dt = 1(lfl) .
This way, the results of Lemma 9.1 remain valid also for weighted integrals,
only the interpretation of 1(f) is changed.
In Table 9.2, the most common weight functions are listed, together with
the corresponding intervals.
Table 9.2. Typical weight functions.

w(t) Interval [a, b]
1
v'1-t 2
[-1,1]
e- t [0,00]
e- t2 [-00,00]
1 [-1,1]
9.3.1 Construction of the Quadrature Formula

Our goal is the construction of quadrature formulas of the form
n
in(f) := L Ainf(Tin) ,
i=O
which approximate the integral 1(f) as good as possible. More precisely,

for a given n, we seek the n + 1 nodes TOn"", Tnn and n + 1 weights
Aon , ... , Ann, so that polynomials up to as high degree N as possible can
be integrated exactly, i.e.,
in(p) = I(P) for all P E PN .
If we first try to estimate, which degree N can be achieved for a given

n, then we observe that we have 2n + 2 parameters (respectively, n + 1
nodes and weights) at our disposal, as opposed to N + 1 coefficients of
a polynomial of degree N. The best that we can expect is therefore that
polynomials up to a degree N ::; 2n + 1 can be integrated exactly. Since
the integration nodes enter the quadrature formula in a nonlinear way, it
is not enough to simply count the degrees of freedom. We instead try to
draw conclusions from our wishful thinking, which may be helpful in the
solution of the problem.
Lemma 9.9 If in is exact for all polynomials P E P2n+l, then the

polynomials {Pj }, which are defined by their root representation
are orthogonal with respect to the scalar product, which is induced by w
(f, g) = lb w(t)f(t)g(t) dt .
Proof. For j < n + 1, we have Pn+1Pj E P 2n +1; so that
(Pj,Pn+d = 1
a
b
WPjPn+1 = in(PjPn+d
n
= l:AinPj(Tin)Pn+l(Tin) = O.
i=O ~
=0
o
Therefore the nodes Tin, which we seek, have to be roots of pairwise
orthogonal polynomials {Pj } of degree deg Pj = j. Such orthogonal poly-
nomials are not unknown to us. According to Theorem 6.2, there exists one
and only one family {Pj } of polynomials Pj E P j with leading coefficients
one, i.e., Pj(t) = t j + ... so that
From Theorem 6.5, we already know that the roots of these orthogonal
polynomials are real and have to lie in the interval [a, b]. We have therefore
constructed, in a unique way, candidates for the integration nodes Tin of
the quadrature formula in: the roots of the orthogonal polynomial Pn+l'
Once the nodes are determined, there is no choice for the weights: In order
that at least polynomials PEP n up to degree n are integrated exactly,
according to Lemma 9.4, the weights
Ain := _l_l
b- a a
b
Lin(t) dt
have to be chosen with the Lagrange polynomials Lin (Tjn) = 6ij. This, at
first, only guarantees exactness for polynomials up to degree n, which is in
fact enough.
Lemma 9.10 Let TOn, . .. , Tnn be the roots of the (n + 1)st orthogonal
polynomial Pn+1. Then any quadrature formula in(f) = 2:7=0 Ad(Tin)
satisfies
in exact on P n {=::? in exact on P 2n+ 1 .

Pmof. Suppose that in is exact on P nand P E P 2n + 1 . Then there exist

polynomials Q, REP n (Euclidean algorithm) such that
P=QPn+1+R.
Since Pn +1 is orthogonal to P n , it follows for the weighted integral that
lb lbwP = wQPn+1 +
'-..-'
lb lbwR = wR = in(R) .
=0
On the other hand,
n n
i.e., i is exact on P 2n + 1 . o
We collect our results in the following theorem.
Theorem 9.11 There exist uniquely determined nodes Tan, . .. , Tnn and
weights AO n , ... ,Ann such that the quadrature formula
n
in(f) = L Ainf(Tin)
i=O
integrates exactly all polynomials of degree less than or equal to 2n + 1, i. e.,
in (P) = lb wP for P E P 2n + 1 .
The nodes Tin are the mots of the (n + l)st orthogonal polynomial P n+1
with respect to the weight function wand the weights
Ain := b ~a lb Lin(t) dt
with the Lagrange polynomials Lin (Tjn) = 6ij. Furthermore, the weights are
all positive, Ain > 0, i. e., in is a positive linear form, and they satisfy the
equation
(9.4)
Pmof. We only have to verify the positivity of the weights and their rep-
resentation (9.4). Suppose Q E P 2n +l is a polynomial such that Tkn is
the only node, at which it does not vanish, i.e., Q(Tin) = 0 for i i= k and
Q(Tkn) i= o. Then, obviously,
lba wQ
1
= AknQ(7"kn), hence Akn = Q(Tkn) lb
a wQ.
If we set, e.g.,
Q(t):=(Pn+l(t))2,
(t - Tkn)
then Q E P 2n has the required properties, where Q(Tkn) = P~+l (Tkn)2.
Thus the weights satisfy
Akn 1-
= --
Q(Tkn)
lb a
wQ = lb (
a
W I
Pn+dt)
Pn+1(Tkn)(t-Tkn)
)2 dt > 0 ;
i.e., all weights are positive. In order to verify formula (9.4), we put
Q(t) := Pn+1(t) Pn(t) .

t - Tkn
Again, Q E P 2n has the required properties, and it follows that
Akn = I
Pn+l(Tkn)Pn(Tkn)
1 lb
a
w(t)
Pn+l(t)
t - Tkn
Pn(t) dt .
The polynomial Pn+1(t)/(t - Tkn) again has leading coefficient I, so that
Pn+l(t) = Pn(t) + Qn-l(t)

t - Tkn
with a Qn-l E P n - 1. Since Pn is orthogonal to P n - 1, the statement finally
follows:
These quadrature formulas in are the Gauss-Christoffel formulas for the

weight function w. As is the case for the Newton-Cotes formulas, it is easy to
deduce the approximation error from the exactness for a certain polynomial
degree.
Theorem 9.12 For any function f E C 2n+2! the approximation error of
the Gauss-Christoffel quadrature can be expressed in the form
l a
b A
wf - In (f) =
f(2n+2) (T)
(2n + 2)! (Pn+l, Pn+d
with some T E [a, b].
Proof. As expected, we employ the Newton remainder for the Hermite

interpolation P E P 2n + 1 for the 2n + 2 nodes Tan, Tan,· .. , Tnn , Tnn:
f(t) = P(t) + [t, Tan, Tan) ... , Tnn )Tnnlf . (t - TOn)2 ... (t - Tnn)2
" V .I
= Pn + 1 (t)2 2: 0
in
lb + ( lb
Since integrates the interpolation P exactly, it follows that
f(2n+2) (T) 2
a
wP )1
2n + 2. a
WPn + 1
n f(2n+2) (T)
~ .Ain ~ + (2n + 2)! (Pn+1, Pn +1) .
= f(Tin)
D
Example 9.13 Gauss-Chebyshev quadrature. Consider the weight func-

tion w(t) = 1/~ on the interval [-1,1]. Then the Chebyshev
polynomials Tk, with which we are so well-acquainted, are orthogonal,
because
,ifk=j=O
,ifk=j>O
,ifk#j.
The orthogonal polynomials with leading coefficient 1 are therefore Pn(t) =
2 1- nT n (t). The roots of P n+1 (respectively, Tn+d are the Chebyshev nodes
2i + 1 .
Tin = COS - - - 7 r , for z = 0, ... ,n .
2n+2
By employing (9.4), one easily calculates that the weights for n > 0 are
given by
The Gauss-Chebyshev quadrature has therefore the simple form
,
In(f) = - -
7r Ln
f(Tin) wIth Tin =
. 2i + 1
COS - - 7 r .
n + 1 ,=0
. 2n+ 2
According to Theorem 9.12, and because of (Tn+1' T n+1) = 7r /2, its

approximation error satisfies
J 1
-1
f(t)
~
dt _ j (f) _
n
7r
- 22n+l(2n + 2)!
f(2n+2)(T)
for aTE [-1,1].

In Table 9.3, we have listed some names of classes of orthogonal polyno-
mials, together with their associated weight functions. The corresponding
quadrature methods always carry the name "Gauss," hyphenated with the
name of the respective polynomial class. The Gauss-Legendre quadrature
Table 9.3. Commonly occurring classes of orthogonal polynomials.

w(t) Interval I = [a, b] Orthogonal polynomials
1
v'1-t 2
[-1,1] Chebyshev polynomials Tn
e- t [0,00] Laguerre polynomials Ln
e- t2 [-00,00] Hermite polynomials Hn
1 [-1,1] Legendre polynomials P n
(w == 1) is only used in special applications. For general integrands, the

trapezoidal sum extrapolation, about which we shall learn in the next sec-
tion, is superior. However, the weight function of the Gauss-Chebyshev
quadrature is weakly singular at t = ±1, so that the trapezoidal rule
is not applicable. Of particular interest are the Gauss-Hermite and the
Gauss-Laguerre quadrature, which allow the approximation of integrals
over infinite intervals (and even solve exactly for polynomials P E P 2n +1)'
Let us finally note an essential property of the Gauss quadrature for
weight functions w t 1: The quality of the approximation can only be
improved by increasing the order. A partitioning into subintervals, however,
is only possible for the Gauss-Legendre quadrature (respectively, the Gauss-
Lobatto quadrature; compare Exercise 9.11).
9.3.2 Computation of Nodes and Weights

For the effective computation of the weights Ain, we need another represen-
tation. For this purpose, let {Fd be a family of orthonormal polynomials
Fk E Pk, i.e.,
(Fi' Fj ) = bij .
These satisfy the Christoffel-Darboux formula (see, e.g., [82] or [64]).
Lemma 9.14 Suppose that k n are the leading coefficients of the or-
thonormal polynomials Fn(t) = knt n + O(t n - 1 ). Then, for all 8, t E
t
R,
~ (Fn+l (t)Fn(s) - Fn(t)Fn+l (8)) = Fj(t)Fj(s) .

kn+l t - s .
J=O
The following formula for the weights Ain can be derived from this
formula.
Lemma 9.15 The weights Ain satisfy
(9.5)
Proof. Let s = Tin. Then the Christoffel-Darboux formula implies that
(9.6)
and, in the limit case t ---+ Tin,
= (PO, PO) = 1
(9.7)
o
The actual determination of the weights Ain and nodes Tin is based on
techniques which are based on the contents of Chapter 6. For this recall
that, according to Theorem 6.2, the orthogonal polynomials Pk with respect
to the weight function w satisfy a three-term recurrence relation
(9.8)
where
(3k= (tPk-1, Pk-d ,"/k=2 (Pk- 1, Pk- 1)
(Pk- 1 , Pk- 1 ) (Pk-2, Pk- 2)
We therefore assume that the orthogonal polynomials are given by their
three-term recurrence relation (9.8), which, for k = 0, ... ,n, we can write
as a linear system Tp = tp + r with
(31 1
'Y~ (32 1
T·-
9.4. Classical Romberg Quadrature 287
and
p:=(Po(t), ... ,Pn(t)f, r:=(O, ... ,o,-Pn+1 (t)f·
Thus
i.e., the roots of P n + 1 are just the eigenvalues of T, where the eigenvector
p( T) corresponds to an eigenvalue T.
Because the roots Tin of Pn +1 are all real, one could have the idea that the
eigenvalue problem Tp = tp can be transformed into a symmetric eigenvalue
problem. The simplest possibility would be to scale with a diagonal matrix
D = diag(d o, ... , d n ) to obtain
Tp = tp withP= Dp , T = DTD- 1 ,
with the hope of achieving T = TT. More explicitly, diagonal scaling, as
applied to a matrix A E Matn+1 (R), satisfies
A~ DAD-1
A f-+:= WIt
~ = -;Faij
. h aij di
J
For T to be symmetric, it is necessary that

2 di d2
d-
= d2i-1 / I i2
i I .
Ii -d. = -d. ' I.e., i ,
2-1 2
which we can satisfy without any problems, e.g., by

do := 1 and d i := h2" 'li+1)-1 for i = 1, ... , n .
With this choice of D, T is the symmetric tridiagonal matrix
=TT,
11 (3n In +1
In+l (3n
whose eigenvalues TOn, .. . , Tnn we can compute by employing the QR-algo-

rithm (compare Section 5.3). The weights '\;n can also be computed from
(9.5) via the three-term recurrence relation (9.8). The Gauss quadrature
can be carried out as soon as the (>\in, Tin) are at hand (see also [42]).
9.4 Classical Romberg Quadrature

In the following, we want to learn about a different kind of integration
method which is based on the trapezoidal sum. The quadrature formulas
discussed so far are all based on a single fixed grid of nodes to, ... , tn,
at which the function was evaluated. In contrast to this, for the Romberg
quadrature, we employ a sequence of grids, and we try to construct a better
approximation of the integral from the corresponding trapezoidal sums.
9.4.1 Asymptotic Expansion of the Trapezoidal Sum

Before we can describe the method, we have to analyze in more detail the
structure of the approximation error of the trapezoidal rule. For a step size
h = (b - a)/n, n = 1,2, ... , let T(h) denote the trapezoidal sum
T(h) := Tn := h (~(j(a) + f(b)) + ~ f(a + ih))

for the equidistant nodes ti = a + ih. The following theorem shows that the
trapezoidal sum T(h), when viewed as a function of h, can be expanded
into an asymptotic expansion in terms of h 2 .
Theorem 9.16 Let f E c 2 m+l [a, b], and let h = b-;.a for an n E N \ {O}.
Then the approximation error of the trapezoidal sum T(h) has the following
asymptotic expansion:
T(h) = lb f(t) dt + T2h2 + T4h4 + ... + T2mh2m + R2m+2(h)h2m+2 (9.9)
with coefficients
T2k = (~~~! (j(2k-l) (b) - f(2k-l)(a)) ,
where B2k are the Bernoulli numbers, and with the remainder
R 2m +2(h) = -lb K 2m +2(t, h)f(2ml(t) dt .
The remainder R 2m +2 is uniformly bounded in h; i.e., there is a constant

C 2m +2 2': 0, which is independent of h such that
IR2m+2(h)1 :::; C2m+2 1b - al for all h = (b - a)/n .
The proof of this classical theorem is based on the Euler formula for the
sums, for which we refer to [55]. The h 2 -expansion also comes up in the
general context of solving initial value problems for ordinary differential
equations, however, in a much simpler way (see [47]).
Remark 9.17 The functions K 2m +2 , which occur in the remainder, are
closely related to the Bernoulli functions B 2m +2 .
For periodic functions f of period b- a, all T2k vanish; i.e., the entire error
stays in the remainder. In this case, no improvement can be achieved with
the Romberg integration, as we shall describe below. In fact, in this case,
the simple trapezoidal rule already yields the result of the trigonometric
interpolation (compare Section 7.2).
For large k, the Bernoulli numbers B2k satisfy
B2k ~ (2k)! ,
so that the series (9.9) in general also diverges with m ~ 00 for analytic
functions f E CW[a, b]; i.e., in contrast to the series expansion, which we
know from calculus (like Taylor or Fourier series), in Theorem 9.16 the
function is expanded into a divergent series. At first, this does not seem to
make any sense; in practice, however, the finite partial sums can often be
used to compute the function value with sufficient accuracy, even though
the corresponding series diverges.
In order to illustrate the fact that such an expansion into a divergent se-
ries can be numerically useful, we consider the following example (compare
[55]).
Example 9.18 Let f(h) be a function with an asymptotic expansion in h
such that for all hER and n E N
n
f(h) = 2:) -l)kk! . hk + ()( _l)n+l(n + I)! hn+l for a 0 < () = ()(h) < 1.
k=O
The series L (-1) k k! h k diverges for all h i= o. When considering the

sequence of partial sums
n
8 n (h) := 2:) -l)kk! hk
k=O
for small h, 0 i= h « 1, it appears at first that they converge, because the

terms (-l)kk! hk of the series at first decay very much. However, starting
from a certain index, the factor k! dominates, the terms get arbitrarily
large, and the sequence of partial sums diverges. Because
the error made by approximating f by Sn is always smaller than the first

term, which we drop. In order to determine f(h) to an (absolute) precision
of tol, we have to find an n so that
I(n + I)! hn+11 < tol .

We actually obtain f(1O- 3 ) with ten precise decimal positions for n = 3 by
f(1O- 3 ) ~ 83(10- 3 ) = 1 - 10- 3 + 2· 10- 6 - 6 . 10- 9 .
Because of their "almost convergent" behavior, Legendre called such se-

ries "semiconvergent." Euler made his life easier by using the same notation
for any kind of infinite series, whether they were convergent or not. (Not
everybody is allowed to do so-only geniuses!)
9·4·2 Idea of Extrapolation

In Section 9.2, we have approximated the integral
J(f) := lb f(t) dt
by the trapezoidal sum
T(h) = T(n) = h (~(f(a) + f(b)) + ~ f(a + ih)) with h

b-a
= --,
n
where the quality of the approximation depends on the step size h. For
h --+ 0, the expression T(h) converges to the integral J(f); more precisely
we should say "for n --+ (Xl and h = (b-a)/n," because T(h) is only defined
for discrete values h = (b - a)/n, n = 1,2, .... We then write
lim T(h):= lim T(n) = J(f) . (9.10)
h----'?O n-+CX)
In order to illustrate the basic idea of the Romberg quadrature, we first

start by assuming that we have computed T(h) for two step sizes
b-a
hi := - - , i = 1,2 ,
ni
and we consider the most simple function f(t) = t 2 over the unit interval
[a, b] = [0,1] and ni = i (see Figure 9.3). Because the second derivative f(2)
Figure 9.3. Trapezoidal sums T(h 1 ) and T(h2) for f(t) = t 2.
is constant 2, we have R4(h) = 0, and therefore

T(h) = J(f) + T2h2 + R4(h)h 4 = J(f) + T2h2 . (9.11)
We can determine the coefficient T2 from the trapezoidal sums
= J(f) + T2hi = 1/2,
T(1)
= T(1/2) = J(f) + T2h~ = 3/8 ,
yielding
Again inserted into (9.11), we obtain the integral
1(f) = T(h ) - h2 = T(h ) _ T(h2) - T(h1) h2 = ~ (9.12)

1 T2 1 1 h2 _ h2 1 3
2 1
from the two trapezoidal sums (see Figure 9.4). We can also explain
1
4~0------~~~------------------~------~
h2
2 -- .!
4
2= 1
h1 h2
Figure 9.4. (Linear) extrapolation.
formula (9.12) as follows: Based on the asymptotic expansion of the trape-

zoidal rule, we determine the integration polynomial in h 2 for the points
(hi, T(hd) and (h~, T(h 2),
P(T(h) I hi, h§)(h 2) = T(h1) + T(h~~

2
=~2(h1) (h 2 - hi) ,
1
and we extrapolate for h 2 = 0, i.e.,
P(T(h) I h2 h2)(0) = T(h 1 ) _ T(h2) - T(hd h2

1, 2 h 2 -h 2 l'
2 1
We expect the extrapolated value P(T(h) I hi, h~)(O) for h2 = 0 to be a

better approximation of 1(f).
This basic idea carries over to higher orders in a natural way, by re-
spective repeated evaluation of T(h) for successively smaller h = hi. In
particular, it can be used in a more general context whenever a method
allows for an asymptotic expansion of the approximation error. This leads
to the general class of extrapolation methods (see, e.g., [22]). In order to
present the chain of reasoning, we start with a method T(h), which, de-
pending on some "step size" h, computes the wanted value TO. Here we allow
that T(h) is only defined for discrete values h (see above). In addition, we
require that the method converges to TO for h --7 0, i.e.,
lim T(h) = TO . (9.13)
h-->O
Definition 9.19 The method T(h) for the computation of TO has an

asymptotic expansion in h P up to the order pm, if there exist constants
Tp, T2p,'" , Tmp E R such that
T(h) = TO+TphP+T2ph2p+ .. '+Tmphmp+O(h(m+l)p) for h --7 O. (9.14)
Remark 9.20 According to Theorem 9.16, the trapezoidal rule has an
asymptotic expansion in h 2 up to order 2m for functions f E C 2m+l[a,bj.
Once we have computed T(h) for k different step sizes
we can determine the interpolating polynomial in hP as
Pik(h P) = P(h P; hLk+l"'" hf) E Pk-l (h P)
with respect to the nodes
(hLk+l' T(hi-k+l)),"" (hf, T(hi )) ,

and extrapolate by evaluating at the value h = O. This way we obtain the
approximations Tik,
Tik := Pik (0) for 1 ::; k ::; i ,
of TO. According to Section 7.1.2, we of course employ the algorithm of
Aitken and Neville for the computation of Tik. The recurrence relation
(7.4) is then transformed into the present situation as follows.
Til .- T(h i ) for i = 1,2, ...
'- Ti,'" + T(::' -T)-;"'H 2" k <:
i-k+l _ 1
fm i
hi
The Neville scheme turns into the extrapolation table.
Tll
--7 Tk-l,k-l
""
--7 Tk,k-l
Remark 9.21 In accordance with [19], we start to count with 1 in the ex-
trapolation methods. As we shall see below, this leads to a more convenient
connection between the order of approximation Tkk and the number k of

computed values T(h I ) to T(hk)'
If we denote the approximation error of the approximations Tik of TO,
which were obtained by extrapolation, by
Cik := ITik - TO I for 1:::; k :::; i,
then we can arrange these accordingly into an error table.

Cll
\.
C2I ---+ C22
(9.15)
Ck-I,I ---+ ---+ ck-I,k-I
\. \. \.
Ckl ---+ ---+ ck,k-I ---+ Ckk
The following theorem, which goes back to R. Bulirsch [11], provides

information about the behavior of these errors.
Theorem 9.22 Let T(h) be a method with an asymptotic expansion (9.14)

in h P up to the order pm, and let hI, ... ,hm be distinct step sizes. Then
the approximation error Cik of the extrapolation values Tik satisfies
Cik ~ hpl hLk+1 ... hf for 1 :::; k :::; i :::; m and h j :::; h ---+ 0.
~
k factors
More precisely,
Cik ITkpl hLk+I ... hf + L O(h;k+I)P) for h j :::; h ---+ 0.

j=i-k+I
This theorem says that, essentially, for each column of the extrapolation
table, we can gain p orders. However, since we have to deal with asymptotic
expansions, and not with series expansions, this viewpoint is too optimistic.
The high order is of little use, if the remainders of the asymptotic expansion,
which are hidden behind the O(h;k+1)p), become very large. For the proof
of the theorem we use the following auxiliary statement.
Lemma 9.23 The Lagrange functions L o , ... , Ln with respect to the nodes
to, ... ,tn satisfy
n for m=O
LLj(O)tj = for 1:::; m:::; n
j=O for m=n+1
Proof. For 0 ::; m ::; n, P(t) = t m is the interpolating polynomial for the
points (tj, tj) for j = 0, ... ,n, and therefore
n n
P(t) = tm = LLj(t)P(tj ) = LLj(t)tj .
j=O j=O
If we set t = 0, then the statement follows in the first two cases. In the case
m = n + 1, we consider the polynomial
n
Q(t) := t n+1 - L Lj(tW;+l .
j=O
This is a polynomial of degree n + 1 with leading coefficient 1 and roots
to, ... , tn, so that
Q(t) = (t - to)··· (t - tn) ,
and, in particular,
n
LLj (0)tr 1 = -Q(O) = (-l)nto···tn .
j=O
D
We now turn to the proof of Theorem 9.22.
Proof. Since T(h) possesses an asymptotic expansion in h P up to order pm,

we have, for 1 ::; k ::; m,
(9.16)
It is sufficient, to show the statement for i = k (the case i 1- k follows then
by shifting the indices of the h j ). Thus let
Pkk = P(h P; hi,···, hD
be the interpolating polynomial in h P with respect to the nodes (hf, T(h j ))
for j = 1, ... , k, and let L 1(h P), ... , Lk(hP) be the Lagrange polynomials
for the nodes hi, ... , h~. Then
k
Pkk(h P) = L L j (h P)Tj1 .
j=l
Thus, according to (9.16) and Lemma 9.23,
k
Tkk Pkk(O) = L Lj(0)Tj1

j=l
k
L Lj(O) [TO + Tphf + ... + Tkph~P + O(hj"+l)P)]
j=l
+ Tkp(-1)k- 1hi ..... h~ + L

k
TO O(hjk+1)P) ,
j=l
and therefore
k
ckk = ITkk - Tal = ITkpl hi· .... h~ +L O(hjk+1)P)
j=l
o
The theory, which we presented so far suggests, for a method T(h) with
an asymptotic expansion, the following extrapolation algorithm. We start
with a basic step size H, and form the step sizes hi by dividing H, hence,
hi = H/ni with ni EN.
Algorithm 9.24 Extrapolation method
1. For a basic step size H, choose a sequence of step sizes hI, h2 ... with
hj = H/nj, nj+1 > nj and set i := 1.
2. Determine Til = T(h i ) .

3. Compute Tik for k = 2, ... , i from Neville's scheme
T - T T i,k-1 - Ti- 1,k-1
,k- "k-1+ ( ni )P
-- -1
ni-k+1
4. If Tii is precise enough, or if i is too large, then stop.
Otherwise increase i by 1 and go back to 2.
This rough description leaves many questions open. It is obviously not
clear what is meant by "Tii is precise enough" or "i is too large." It is also
not clear how the step sizes h j should be chosen. We shall discuss this in
more detail in the context of the Romberg quadrature in the next sections.
9.4.3 Details of the Algorithm

As we have seen in the above, the trapezoidal sum is a method with an
asymptotic expansion in h 2 , where the order up to which we can expand,
depends on the smoothness of the integrand. We can therefore apply ev-
erything, which we described in the previous section to the trapezoidal
rule, and we obtain as an extrapolation method the classical Romberg
quadrature, which was introduced by Romberg.
The cost Ai for the computation of Tii can essentially be measured by
the number of necessary function evaluations of j,
Ai := number of j evaluations used for the computation of T ii .

These numbers, of course, depend on the chosen sequence nl, n2, . ... To
each increasing sequence
we assign therefore the corresponding sequence of costs
For the Romberg sequence
F R ={1,2,4,8,16, ... }, ni=2 i - l ,
we obtain
AR = {2, 3, 5, 9,17, ... }, Ai = ni +1.

For this sequence, the recursive evaluation of the trapezoidal sums is
• 2 new f evaluations
a b
o~----------~------------~o
1 new f evaluation
a h2 = H/2 b
o~---- __ ----~e------ __ ----~o

2 new f evaluations
Figure 9.5. Computation of the trapezoidal sums for the Romberg sequence.
particularly simple (see Figure 9.5). For h = H/n we obtain
T(h/2) = ~ (f(a) + 2 2~1 f(a + ih/2) + f(b))

(n-l
"4h f(a)+2~f(a+kh)+f(b) +2~f
h n ( 2k
a+---i- h
) 1)
v
= T(h)/2
and therefore for the Romberg sequence hi = H /2 i - 1
1
+ hi L
ni-l
Til = "2Ti-I,1 f(a + (2k - l)hi ) .

k=l
By representing the extrapolation values Tii of the Romberg quadrature as

a quadrature formula with weights Aj,
Ai
Tii = H"LAjfj,
j=l
where fj is the /h computed function value, it can be shown (see Exercise

9.6) that for the Romberg sequence F R only positive weights Aj result.
Concerning the cost, W. Romberg in his original paper [72] had already
proposed an even more favorable sequence, which is now called the Bulirsch
sequence:
if i = 2k
F B := {l, 2, 3, 4, 6, 8,12,16,24, ... }, if i = 2k + 1
if i = 1
The corresponding sequence of costs is
AB = {2, 3, 5, 7, 9,13,17, ... }.
However, the sequence F B has the disadvantage that the corresponding

quadrature formula may now also contain negative weights (see Table 9.4).
In the context of a method of variable order, this property is not so dra-
matic, because such methods can, in critical regions, be switched back to
lower order (with positive weights) (see Chapter 9.5).
Table 9.4. Weights Aj for the diagonal and subdiagonal elements of the extrap-
olation tableau at the nodes tj for the Bulirsch sequence (Note that Aj = Anj,
tj = tn-j).
0 1
"6 4
1 1
"3
1
"2 I: IAil
1
T 1 ,1 "2 1
1 1
T 2 ,1 4 "2 1
1 2
T 2 ,2 "6 "3 1
1 3 2
T 3 ,2 10 [; "3 1
11 27 8
T 3 ,3 120 40 -15 2.067
13 16 27 94
T 4 ,3 210 21 -35 105 4.086
151 256 243 104
T 4 ,4 2520 315 - 280 105 4.471
Remark 9.25 In the solution of initial value problems for ordinary

differential equations, the most simple harmonic sequence
FH = {1,2,3, ... }, ni = i
occurs. In our context of quadrature, it is, however, substantially less

favorable than the Romberg sequence, because for one, the cost is larger,
AH = {2, 3, 5, 7, 11, 13, 19,23,29, ...} ,
and, for the other, the trapezoidal sums Til cannot be computed recursively.
The computation of the extrapolation tableau is carried out by rows. We
stop if sufficiently many digits "stand" or if no improvement of convergence
is noticed.
Example 9.26 Needle impulse: Computation of the integral
1 1
-1
dt
10- 4 + t2
(9.17)
(compare Figure 9.10) with relative precision tol = 10- 8 . The values Tkk
of the extrapolation tableau are given in Table 9.5.
Table 9.5. Romberg quadrature for the needle impulse f(t) = 1/ (10- 4 +x 2 ). (ckk'
relative precision, A k , cost in terms of f evaluations.)
k Tkk ckk Ak
1 1.999800 9.9. 10- 1 2

2 13333.999933 4.2.10 1 3
3 2672.664361 7.6.10° 5
4 1551.888793 4.0.10° 9
5 792.293096 1.5.10° 17
6 441.756664 4.2.10- 1 33
7 307.642217 1.4 . 10- 2 65
8 293.006708 6.1.10- 2 129
9 309.850398 7.4.10- 3 257
10 312.382805 7.2.10- 4 513
11 312.160140 2.6.10- 6 1025
12 312.159253 2.5.10- 7 2049
13 312.159332 1.1. 10- 9 4097
9.5 Adaptive Romberg Quadrature

So far, we have taken the entire length of the interval H = b- a as the basic
step size. If we consider the needle impulse (9.17), then it strikes us that in
this example, the essential contributions to the entire integral often come
only from one or more much smaller subintervals. If we start with a basic
9.5. Adaptive Romberg Quadrature 299
step size H = b - a, then all regions of the basic interval [a, b] are equal,
and we apply the same method everywhere. This cannot be the best way
to integrate functions. Rather, we should partition the integration interval
so that we can choose in each subregion a method which is tailor-made to
the function and which thus, with as little effort as possible, determines
the integral with a given relative precision. Such methods, which control
themselves in the course of the computation by adapting to the problem
at hand, are called adaptive methods. Their essential advantage consists
of the fact that a large class of problems can be handled with the same
program, without the user having to make adaptions, i.e., without having to
insert apriori knowledge about the problem into the method. The program
itself tries to adapt itself to the problem. In order to achieve this, the
intermediate results, which are computed in the course of the algorithm are
constantly checked. This serves two purposes: On one hand, the algorithm
can thus automatically choose an optimal solution strategy with respect to
cost, and thus solve the posed problem effectively. On the other hand, this
ensures that the program works more safely, and hopefully does not produce
fictitious solutions, which, in reality do not have a lot to do with the posed
problem. It should also be a goal that the program can recognize its own
limitations, and ,for instance, detect that a user-prescribed precision cannot
be achieved. This adaptive concept can in general only be carried out if a
reasonable estimate for the occurring approximation error is available and
can be computed at relatively little cost.
9.5.1 Principle of Adaptivity
J:
In quadrature the problem is more precisely formulated: Approximate the
integral I = f(t) dt up to a given relative precision tol; i.e., compute an
approximation i of I so that
Ii - II :::; III tol . (9.18)
Since we do not know I, we replace (9.18) by the requirement
(9.19)
where Iscal ("scal" for "scaling") should be of the order of III. This value
is either given by the user together with tol, or is obtained from the first
approximations.
Whereas the classical Romberg quadrature merely adapts the order of
the method in order to achieve a desired precision, in the adaptive Romberg
quadrature the basic step size H is also adapted. There are two principal
possibilities to attack the problem: The initial value method (in this section)
and the boundary value method (two sections later).
The following considerations are based on [19] and [21]. We start with
the formulation of the quadrature problem as an initial value problem
y'(t) = f(t), y(a) = 0, y(b) = 1b f(t) dt
and try to compute the integral successively from the left to the right (see
Figure 9.6). Here we partition the basic interval in suitable subintervals
[ti, t HI ] of length Hi := tHI - t i , which are adapted to the function f and
y(t) = J~ f(T)dT
a t b
Figure 9.6. Quadrature as an initial value problem.
apply the Romberg quadrature to the thus obtained subproblems
up to a certain degree qi.

Remark 9.27 In this initial value approach, however, the symmetry
IZ(f) = -I'b(f)
is destroyed, because we distinguish one direction (from left to right). This
will not be the case in the boundary value approach (see Section 9.7).
Numerous questions arise from this first superficial description. Which
step sizes should the algorithm choose? Up to which order should the
Romberg quadrature be carried out with respect to a subinterval? How
can the result be (locally) verified?
In the following, we construct a method, which, starting with input step
size H and order q, computes the integral of the next subinterval, and
which makes proposals for the step size iI and the order ij of the following
step (see Figures 9.7 and 9.8). Should the computation of the subintegral
I;+H (f) not be possible with the required precision and with the given
order, then the method is to choose a new order q and/or a reduced step
size H. The method should stop after "too frequent" reduction of H.
a t+H b
Figure 9.7. One step of the adaptive Romberg quadrature.
I Input I
H, q
~ Quadrature
step t+H
It
Output
-
,H, ij
1
possible error message
Figure 9.S. Schematic description of a quadrature step.
9.5.2 Estimation of the Approximation Error

In order to realize the adaptive concept sketched above, with respect to
order and basic step size, we need a reasonable and cheap technique to
estimate the approximation error. This is particularly simple for extrapo-
lation methods, since extrapolation tableaux include an array of different
approximations. The estimation technique will be based on this tableau of
approximations. In general, a quantity t is called an estimator for an un-
approachable approximation error c, shortly t = [c], if E can be estimated
by c both from above and below, i.e., if there exist constants /';;1 -:; 1 -:; /';;2
such that
(9.20)
The construction of an effective error estimator is one of the most difficult

problems in the development of an adaptive algorithm. A common method
consists of comparing an approximation of low order with one of higher
order. All error estimators, which we shall encounter in this book, are
based on this construction principle.
In our case, locally, with respect to a subinterval [t, t+ H] of basic step size
H, the approximation quality is described by the already earlier introduced
error table (9.15) of the Cik. According to Theorem 9.22, we already know
that
Besides, for H ---* 0, the coefficients of the asymptotic expansion of the

trapezoidal rule can be estimated by
Here the constants T2k depend on the integrand j, and thus on the problem.
Inserted into (9.21), it follows that
Cik =. 1-T2k 1hi-k+l·

2 .. hi . 1-
2· H = T2k 1'"Yik H 2k + lor H
1 c 0
---*,
where
The order 2k + 1 with respect to H only depends on the column index k

of the extrapolation table. In particular, for two consecutive errors within
column k, and independent from the problem, we have
Ci+l,k -'- '"Yi+l,k

( k 1)2 «1.
ni _ +
Cik '"Yik ni+l
In other words, independent of the problem and independent of H, within

each column k, the approximation errors decrease very fast with increasing
row index i.
(9.22)
For the relation between the columns, we need a further assumption,

namely, that higher approximation orders yield smaller approximation
errors, i.e., for 1 :::; k < i we have
(9.23)
This assumption is plausible; however, it is not imperative: It is surely true

for "sufficiently small" step sizes H; however, it has to be verified in the
program for each concrete H in a suitable way. In Section 9.5.3, we shall
present one possibility to test whether our assumption agrees with the given
situation. If we denote these relations in the error table by an arrow,
then, under the assumptions (9.22) and (9.23), the following picture
emerges.
Ell
i
E21 f- E22
i i
E31 f- E32 f- E33
i i i
The most precise approximation inside the row k is thus the diagonal el-
ement Tkk. It would therefore be ideal if we could estimate the error Ekk.
However, we are in a dilemma here: In order to estimate the error of Tkk,
we need a more precise approximation j of the integral, e.g., Tk+1,k. With
such an approximation at hand, it would be possible to estimate Ekk, e.g.,
by
But once we have computed T k +1,k , we can also directly produce the
(better) approximation Tk+1,k+1. However, we do not have an estimate of
the error Ek+1,k+1, unless we again compute a more precise approximation.
We escape this dilemma by the insight that the second-best solution may
also be useful. The second-best approximation available including the row
k, is the subdiagonal element T k ,k-1. The approximation error Ek,k-1 can
be estimated from known data up to this row as follows.
Lemma 9.28 Under the assumption (9.23),
Ek,k-1 := ITk ,k-1 - nkl = [Ek,k-1]
is an error estimator for Ek,k-1 in the sense of (9.20).
Proof. Let 1:= ftt+H f(t) dt. Then
Ek,k-1 = I(Tk ,k-1 - I) - (Tk,k - I)I ::; Ek,k-1 + Ekk ,
and, with the assumption (9.23),
Ek,k-1 = I(Tk ,k-1 - I) - (Tk,k - I)I 2': Ek,k-1 - Ekk ,
and therefore
( 1 - -Ekk)
- - Ek,k-1 < Ek,k-1 < ( 1+ ---
Ekk)
Ek,k-1.
Ek,k-1 Ek,k-1
'-v--' '-v--'
«1 «1
o
In order to simplify notation in the following, we assume that Iscal = l.

Then we replace the termination criterion
II-II :::: tol

by the condition
Ek,k-l :::: ptol, (9.24)

which can be verified in the algorithm, where p < 1 (typically p := 0.25) is
a safety factor. The diagonal element Tkk is thus accepted as a solution, if
and only if the termination condition (9.24) is satisfied. This condition is
also called the subdiagonal error criterion.
Remark 9.29 For a long time, there had been intense discussions of
whether one should be allowed to consider the "best" solution (here, the
diagonal element Tkk) as an approximation, even though only the error
of the "second-best" solution (here, the sub diagonal element Tk,k-d was
estimated. In fact, the employed error estimator is useful for the solution
Tk,k-l only if Tkk is the "best" solution, so that it would be inconsistent
to ignore this more precise solution.
9.5.3 Derivation of the Algorithm

It is the goal of the adaptive algorithm to approximate the integral up to
a desired precision at as little cost as possible. At our disposition are two
parameters of the algorithm for the adapt ion to the problem, namely, the
basic step size H, and the order p = 2k, i.e., the maximal used column k of
the extrapolation table. We first start with a method i with a given fixed
order p, i.e.,
E = E(t, H) = lii+H(J) - I t
t+H
f(T) dTI ~ "Y(t)HP+l
with a number "Y(t), which depends on the left boundary t and on the
problem. With the data E and H of the current integration step, we can
estimate "Y(t) by
(9.25)
Suppose that H is the step size, for which we would have achieved the
desired precision
tol = E(t, H) ~ "Y(t)HP+l . (9.26)
By employing (9.25), we can compute an a posteriori approximation of H
from E and H, because
(9.27)
We also call H the optimal step size in the sense of (9.26). Should H
be much smaller than the step size H that we actually used, then this
indicates that H was too large, and that we have possibly jumped over
a critical region (e.g., a small peak). In this case we should repeat the
integration step with H as basic step size. Otherwise we can use H as the
recommended step size for the next integration step, because for sufficiently
smooth integrands f and small basic step sizes H, the number '1(t) will
change only little over the integration interval [t, t + H], i.e.,
'1(t) ~ '1(t+H) for H ~ o. (9.28)
This implies
c(t + H, H) ~ '1(t + H)Hp+1 ~ '1(t)H P+l ~ tol ,
so that we may assume that H is also the optimal step size for
the next
step. The algorithm of course has to verify the Assumption (9.23) as well
as the Assumption (9.28), and possibly correct the step size.
So far, we have only considered a fixed order p, and we have determined
an optimal step size H for this order. The Romberg quadrature, as an
extrapolation method, produces an entire series of approximations Tik of
various orders p = 2k for the column index k, which could also vary, where
the approximation error satisfies
Eik~IT2kl '1ikH2k+l for f E C 2k [t, t + H] .
In the course of the investigation in the previous section, we worked out
the error estimator Ek,k-1 for the sub diagonal approximation T k,k-1 of the
order p = 2k - 2. If we now replace the unknown error E = Ek,k-l in (9.27)
by Ek,k-1, then we obtain the suggested step size
Hk := 2k-l !tol H
Ek,k-1
where we again have introduced the safety factor p < 1, in order to match
a possible variation of '1(t) in the interval [t, t + h], compare Assumption
(9.28).
For each column k the above derivation supplies a step-size proposal Hk,
the realization of which requires an amount of work Ak associated with the
subdivision sequence F. Still missing is a criterion to choose, in each basic
step j = 1, ... , J, from the triples
(k,Hk,A k ) = (column, step-size proposal, work amount)
for k = 1, ... , kmax = q /2 the best triple (kj, H kj , Ak j ). In an abstract
setting, we to have solve the following optimization problem: minimize the
total amount of work
J
Atotal = L Akj = min
j=l
subject to the condition that an integration interval of length T IS

prescribed, i.e., subject to
J
2.:ihj = T = const.
j=l
The total number J of basic steps is here dependent on the selected se-
quence of indices {k j }, which means it is unknown in advance. We therefore
have to tackle a discrete minimization problem. For this type of problem
in particular, there exists a quite efficient established heuristics, the greedy
algorithm-see, e.g., Chapter 9.3 in the introductory textbook [3] by M.
Aigner. At step j this algorithm requires the minimization of the work per
unit step
In this sense the column k = k j with

W k= min Wk
k=l, ... ,k max
is "optimal" as well as the order ij = 2k. We thus have found an appropriate

order ij and basic step size iI = iIk exploiting data available in the present
integration step.
Summarizing all considerations of the previous sections, we arrive at the
following algorithm for one basic step of the adaptive Romberg quadrature.
Algorithm 9.30 One step of the adaptive Romberg quadrature.
As input, the procedure step gets the beginning t of the interval under
consideration, the suggested column k and the step size H. Besides the
possible success notice done, the output is: the corresponding values t, k
and iI for the next step, as well as the approximation I for the integral II
over the interval [t, i].
function [done, I, t, k, iI]=step(t, k, H)
done := false;
i := 1;
while not done and i < i max do
Compute the approximations T ll , ... ,Tkk of I;+H;
while k < k max and Ek,k-I > tol do
k := k + 1;
Compute T kk ;
end
Compute iII,".' iIk and WI,"" Wk;
Choose k :s: k with minimal work W k ;
iI:= H k ;
if k < k max then
if H > H then
H := H; (repeat the step for safety)
else
i:=t+H;
1= T kk ;
done := true; (done)
end
end
i:=i+1;
end
When programming this adaptive algorithm, one encounters the unfortu-

nate experience that in the present form it does not (yet) work as one might
have hoped. It is precisely the example of the needle function, which causes
the trouble. The step sizes nicely contract with decreasing orders toward
the center of the needle (as expected). However, after crossing the tip of
the needle, the orders still remain low and the step sizes small. We shall
briefly analyze this situation together with two further difficulties of the
present algorithm.
Disadvantages of the Algorithm:
(1) Trapping of order, as explained in the abovc. Once a low order q = 2k

is reached and condition Ek,k-l :s: tol is always satisfied, the algorithm
does not test any higher order-even though this might be advanta-
geous. The order remains low and the step sizes small, as we observed
in the case of the integration of the needle.
(2) The algorithm notices only rather late, namely, only after crossing
k max that a suggested step size H was too large and that it does not
pass the accuracy criterion (Ek,k-l :s: tol) for any column k.
(3) If our assumptions are not satisfied, then the error estimator does
not work. It may therefore happen that the algorithm recognizes an
incorrect solution as correct and supplies it as an output. This case
is referred to as pseudo-convergence.
In the last two problems mentioned, it would be desirable to recognize early

on whether the approximations behave "reasonably," i.e., entircly within
our theoretical assumptions. One thus needs a convergence monitor. The
main difficulty in the construction of such a monitor is that one would have
to make algorithmic decisions on the basis of information which is not (yet)
available. Because of this, we try to obtain a model, which hopefully de-
scribes the situation, at least in the statistical average over a large number
of problems. We may then compare the actually obtained values with this
model.
Here we only want to discuss briefly one such possibility, which is based
on the information theory of C. E. Shannon (see, e.g., [75]). For more details
we refer to the paper [19] of P. Deufihard. In this model, the quadrature
algorithm is interpreted as an encoding device. It converts the information,
Input Quadrature Output

information f algorithm information Tik
Figure 9.9. Quadrature algorithm as encoding device.
which is obtained by evaluating the function, into information about the

integral (see the schematic Figure 9.9). The amount of information on the
input side, the input entropy E;!n),
is measured by the number of f eval-
uations, which are required for the computation of T ik . This assumes that
no redundant f evaluations are considered, i.e., that all digits of f are in-
dependent of each other. Since the values Ti-k+l,l, ... ,T;,l are needed as
input for the computation of Tik, we obtain
E;~n) = a(Ai - A i - k + 1)
with a constant a > O. The amount of information on the output side,
the output entropy E;~ut), can be characterized by the number of correct
binary digits of the approximation Tik. This leads to
I (1)
(out) = og2
E ik -
Eik
.
We now assume that our information channel works with a constant noise
factor 0 < f3 ::::: 1,
E (out) _ f3E(in) .
ik - ik'
i.e., that input and output entropies are proportional to each other. (If
f3 = 1, then the channel is noise free; no information gets lost.) In our case
this means that
(9.29)
with c := af3. In order to determine the proportionality factor c, we need
a pair of input and output entropies. In the above we required that for
a given column k, the subdiagonal error Ek,k-l is equal to the required
precision tol, hence Ek,k-l = tol. By inserting this relation into (9.29), we
conclude that
Having thus determined c, we can then determine for all i, j, which errors
Eijare to be expected by our model. If we denote these errors, which the
information theoretic model implies, by a~7) (where k is the row, from

which we have obtained the proportionality factor), then it follows that
(k) Ai - A i - j +1
log2 aij = -C(Ai - A i - j + 1) = A A log2 tol .
k- 1+1
In a quite elementary manner, we have thus in fact constructed a statisti-
cal comparison model, with which we can test the convergence behavior of
our algorithm for a concrete problem, namely, by comparing the estimated
errors Ei,i-1 with the values a~~~l of the convergence model. On the one
hand, we thus obtain the desir~d convergence monitor, on the other hand,
we can also estimate, how higher orders would behave. We shall omit further
details. They have been worked out for a large class of extrapolation meth-
ods in [19]. For the adaptive Romberg quadrature they are implemented in
the program TRAPEX (see [21]).
Obtained Global Precision

If we ignore the safety factor p, then the above algorithm approximates the
integral I = I(f) with a global precision
II - 11 :::; I sca1 ' m . tol,
where m is the number of basic steps, which were obtained in the adap-
tive quadrature (a-posteriori error estimate). The chosen strategy obviously
leads to a uniform distribution of the local discretization errors. This
principle is also important for considerably more general adaptive dis-
cretization methods (compare Section 9.7). If one wants to prescribe a
global discretization error, which is independent of m,
II - 11 :::; Iscal' E ,
then, following a suggestion by C. de Boor [14], in the derivation of the
order and step-size control, the precision tol is to be replaced by
H
tol ---+ --E .
b-a
This leads to smaller changes of the order and step-size control, but also
to additional difficulties and a less robust algorithm.
Example 9.31 We again return to the example of the needle impulse,

whose treatment with the classical Romberg quadrature we have docu-
mented in Section 9.4.3, Table 9.5: We needed 4,097 f calls for an achieved
precision of approximately 10- 9 . In the adaptive Romberg quadrature, for
a required precision of tol = 10- 9 , we only need 321 f evaluations (for 27
basic steps) with an achieved precision of E: = 1.4.10- 9 . The automatic
subdivision into basic steps by the program TRAPEX is given in Figure
9.10.
10 4
Or-----------------
-1 o 1
Figure 9.10. Automatic subdivision into basic steps by the program TRAPEX.
9.6 Hard Integration Problems

Of course, even the adaptive Romberg quadrature cannot solve all problems
of numerical quadrature. In this section we shall discuss some difficulties.
Discontinuous Integrands
A common problem in numerical quadrature are discontinuities of the in-
tegrand f or its derivatives (see Figure 9.11). Such integrands occur, e.g.,
a b
Figure 9.11. Jump of fat t1, jump of l' at t2.
when a physical-technical system is described by different models in differ-

ent regions, which do not quite fit at the interface positions. If the jumps are
known, then one should subdivide the integration interval at these positions
and solve the arising subproblems separately. Otherwise the quadrature
program reacts quite differently. Without any further preparation, a non-
adaptive quadrature program yields incorrect results or does not converge.
The jumps cannot be localized. An adaptive quadrature program, such as
9.6. Hard Integration Problems 311
the adaptive Romberg method, freezes at the jumps. Thus the jumps can
be localized and treated separately.
Needle Impulses
We have considered this problem repeatedly in the above. It has to be
noted, however, that in principle, every quadrature program will fail if the
peaks are small enough (compare Exercise 9.8). On the other hand, such
integrands are pretty common: just think of the spectrum of a star whose
entire radiation is to be computed. If the positions of the peaks are known,
then one should subdivide the interval in a suitable way, and again compute
the sub integrals separately. Otherwise, there only remains the hope that
the adaptive quadrature program does not "overlook" them.
Highly Oscillatory Integrands
We have already noted in Section 9.1 that highly oscillatory integrands are
ill-conditioned from the relative error viewpoint. As an example, we have
plotted the function
f(t) = cos(te 4 t 2 )
for t E [-1,1] in Figure 9.12. The numerical quadrature is powerless against
such integrands. They have to be prepared by analytical averaging over
subintervals (pre-clarification of the structure of the inflection points of the
integrand).
-0.2
-0.4
-0.6
-0.8
l llLL_0:":.8:--..l.U...'--~"---_~0.-::-2-~O:-----:-O.'::-2_-:-'-:-_--::'"'L-.-:":,.u-uw
-l W
2
Figure 9.12. Highly oscillatory integrand f(t) = cos(te 4t ).
Weakly Singular Integrands

A function f, which is integrable over the interval [a, b], is called weakly
singular, if one of its derivatives f(k) in [a, b] does not exist. As an example,
take the functions f(t) = tag(t), where 9 E COO[O,T] is an arbitrarily
smooth function and 0: > -1.
Example 9.32 As an example, we consider the integral
1 v't
71"
t=o'-v--'
cos t dt .
I(t)
The derivative f'(t) = (cost)/(2/t) - /tsint has a pole at O.

In the case of weakly singular integrands, adaptive quadrature programs
usually tend to contract step size and order, and they therefore tend to
be extremely slow. Nonadaptive quadrature algorithms, however, do not
get slow, but usually false. The singularities can often be removed via a
substitution.
Example 9.33 In the above example, we obtain after the substitution
s = /t:
1:0 v't cost dt = 21~ s2 coss 2 ds .
In some cases, however, proceeding like this becomes inefficient when

the substitution leads to functions that are tedious to evaluate. A second
possibility consists of the recursive computation of the integral under con-
sideration (Miller-trick), which we shall not consider here (see Exercise
9.10).
Parameter- Dependent Integrands

Often the integrand f depends on an additional parameter A E R:
f(t, A), A E R parameter.

We thus have to solve an entire family of problems
I(A) := lb f(t, A) dt .
The most important class of examples for such parameter-dependent

integrals is the multi-dimensional quadrature. Usually the integrand is
differentiable with respect to A, and so is therefore the integral I(A). Of
course, one hopes that the approximation i(A) inherits this property. Un-
fortunately, however, it turns out that just our best methods, the adaptive
quadrature methods, do not have this property-in contrast to the simple
nonadaptive quadrature formulas. There are essentially three possibilities
to rescue the adaptive approach for parameter-dependent problems.
The first possibility consists of carrying out the quadrature for one para-
meter value, storing away the employed orders and step sizes, and use them
again for all other parameter values. This is also called freezing of orders
and step sizes. This can only be successful if the integrand qualitatively
does not change too much in dependence of the parameter.
9.7. Adaptive Multigrid Quadrature 313
If, however, a peak varies with the parameter, and if this dependence is
known, then one can employ parameter-dependent grids. One transforms
the integral in dependence of A in such a way that the integrand stays the
same qualitatively (the movement of the peak is, e.g., counter-balanced) or,
in dependence of A, one shifts the adaptive partitioning of the integration
interval.
The last possibility requires a lot of insight into the respective problem.
We choose a fixed grid adapted to the respective problem and integrate
over this grid with a fixed quadrature formula (Newton-Cotes or Gauss-
Christoffel). In order to do this, the qualitative properties of the integrand
need to be largely known, of course.
Discrete Integrands
In many applications, the integrand is not given as a function f, but only
in the form of finitely many discrete points
(ti,fi), i=O, ... ,N,

(e.g., nuclear spin spectrum, digitalized measurement data). The simplest
and best way to deal with this situation consists of forming the trapezoidal
sum over these points. The trapezoidal sum has the advantage that errors
in measurement data often get averaged out in the computation of the
integral with an equidistant grid. If the measurement errors oj; have the
expectation 0, i.e., L~o ofi = 0, then this is also the case for the induced
error of the trapezoidal sum. This property holds only for methods where
all weights are equal, and it is not true any more for methods of higher
order. In the next section, we shall consider an effective method for the
solution of such problems.
9.7 Adaptive Multigrid Quadrature

In the present section, we consider a second approach to the adaptive
quadrature, which rests on ideas that were originally developed for the
solution of considerably more complicated problems in partial differential
equations (see [6]). This multigrid approach, or, more generally, multilevel
approach, is based on the formulation of the quadrature problem as a bound-
ary value problem. In the adaptive Romberg quadrature, which is based on
the initial value approach, we traversed the interval in an arbitrarily chosen
direction. According to the problem, we then subdivided it into subinter-
vals, and then integrated over these with local fine grids (of the Romberg
quadrature). In contrast to this, the multigrid quadrature starts with the
entire basic interval or with a coarse initial partitioning ~ 0 , and step by
step generates a sequence of finer global subdivisions ~ i of the interval and
more precise approximations I(~i) of the integral. Here the grids are only
refined at places where it is necessary for the required precision, i.e., the
qualitative behavior of the integrand becomes visible in the refinement of
the grids. The nodes condense where "a lot happens." In order to achieve
this, one requires two things: a local error estimator and local refinement
rules.
The local error estimator is typically realized by a comparison of methods
of lower and higher order, as we have seen in Section 9.5.3 in the subdiago-
nal error criterion. Here the theory of the respective approximation method
enters. In the definition of refinement rules, aspects of the data structures
play the decisive role. Thus, in fact part of the complexity of the mathe-
matical problem is transferred to the computer science side (in the form of
more complex data structures).
9.7.1 Local Error Estimation and Refinement Rules

As an example of a multigrid quadrature, we here present a particular
method where the trapezoidal rule (locally linear) is used as the method
of lower order, and where Simpson's rule (locally quadratic) is used as the
method of higher order. As a refinement method, we shall restrict ourselves
to the local bisection of an interval.
We start with a subinterval [tz, tr] C [a, b] (l: left, r: right). Since we
need three nodes for Simpson's rule, we add the center tm := (tz + t r )/2
and describe the interval by the triple J := (tz, t m , t r ). The length of the
interval is denoted by h = h(J) := tr - tz. A grid ~ is a family ~ = {J;}
of such intervals, which together form a partition of the original interval
[a, b].
By T( J) and S( J), we denote the results of the trapezoidal rule, as
applied to the subintervals [tz, t m ] and [tm, t r ], and Simpson's rule with
respect to the nodes tz, tm and t r . The formulas are given in Figure 9.13.
Observe that Simpson's rule is obtained from the Romberg quadrature
as S(J) = T22(J) (see Exercise 9.6). For sufficiently smooth functions f,
(I ~
;
;
I
T(J)
S(J)
= %(f(tz) + 2f(tm ) + f(t r ))
= ~ (f(tz) + 4f(tm) + f(t r))
t~
Figure 9.13. Trapezoidal and Simpson's rule for an interval J:= (t[,trn,tr ).
T(J) and S(J) are approximations of the integral Jt~r f(t) dt of order O(h3)
or O(h 5 ), respectively. The error of Simpson's approximation therefore

satisfies
By summation over all subintervals J E

J:
~, we obtain the approximation
of the entire integral f (t) dt:
T(~) = L T(J) and S(~) = L S(J) .
JEt;.
As in the Romberg quadrature, we assume (at first not checked) that the
method of higher order, the Simpson rule, is locally better, i.e.,
Under this assumption, the sub diagonal estimator of the local approxima-
tion error is
Eel) := IT(J) - S(J)I = [e(J)] ,
and we can use the Simpson result as a better approximation.
In the construction of local refinement rules, we essentially follow an
abstract suggestion by 1. Babuska and W. C. Rheinboldt [6], which they
made in the more general context of boundary value problems for partial
differential equations. The subintervals which are obtained when bisecting
an interval J := (tz, tm, t r ) are denoted by Jz and J r , where
When refining twice, we thus obtain the binary tree, which is displayed in
Figure 9.14. If J is obtained by refinement, then we denote the starting
J
•
~
Jz Jr
0 • 0 • 0
J ll
/\ /\ JZ r Jrz J rr
0 0 0 0 0
Figure 9.14. Twofold refinement of the interval J:= (tl' t m , tT).

interval of the last step by J-, i.e., J; = Jz- = J.

The principle, according to which we want to proceed when deriving a set
of refinement rules, is the equidistribution of the local discretization error
(compare Section 9.5.3). This means that the grid ~ is to be refined such
that the estimated local approximation errors of the refined grid ~ + are
approximately equal, i.e.,
s(J) ~ const for all J E ~ + .
For the estimated error of the trapezoidal rule, we make the theoretical
assumption (see (9.7.1))
s(J) ~ Ch' where h = h(J) (9.30)
with a local order "y and a local constant C, which depends on the problem.
Remark 9.34 The trapezoidal rule actually has the order "y = 3. Hidden
in the constant, however, is the second derivative of the integrand, so that
an order "y :::; 3 more realistically characterizes the method, if we assume
that C is locally constant. In the following considerations, the order "y
cancels out, so that this does not cause any trouble.
We can thus define a second error estimator E+ (J), which yields infor-
mation about the error E(Jz) of the next step, in the case that we partition
the interval J. Assumption (9.30) implies
s(J-) ~ C(2h)' = 2'Ch' ~ 2's(J) with h = h(J) ,
thus 2' ~ s( J-) / s( J), and therefore
E(Jz) ~ ChiT' ~ s(J)s(J)/s(J-) .
Thus, through local extrapolation (see Figure 9.15), we have obtained an
error estimator
for the unknown error E(JZ )' We can therefore estimate in advance, what
... ·····Ch'
s(J) ,.
E+(J) .
~~~~~-----------,~~-----
(h/2)' h' (2h)'
Figure 9.15. Local extrapolation for the error estimator c+(J).
effect a refinement of an interval J E ~ would have. We only have to fix

a threshold value for the local errors, above which we refine an interval. In
order to do this, we take the maximal local error, which we would obtain
from a global refinement, i.e., refinement of all intervals J E Do, and define
(9.31 )
In order to illustrate the situation, we plot the estimated errors t( J) and

c+(J) in a histogram (see Figure 9.16). Already before the refinement, the
K(Do) f····································:/··············· ..............~"" .......................... '''''
a b
Figure 9.16. Estimated error distributions before and after global and local
refinement.
error at the right and left boundary is below the maximal local error K(Do),
which can possibly be achieved by a complete refinement. If we follow the
principle of equidistribution of the local error, then we do not have to refine
any more near the right and left boundary. Refinement does only payoff in
the middle region. We thus arrive at the following refinement rule: Refine
only intervals J E Do, for which
This yields the error distribution, which is displayed in Figure 9.16. It is ob-
viously one step closer to the desired equidistribution of the approximation
errors.
Remark 9.35 By local refinement, part of the interval J becomes two

parts of the arising subintervals Jl and J r :
In order that the partitioning in fact yields an improvement, the order '"Y
has to satisfy the condition '"Y > 1 locally.
9.7.2 Global Error Estimation and Details of the Algorithm

A difficulty of the multigrid quadrature is the estimation of the global
approximation error
The sum 2::JE~ E(J) is not a suitable measure, since integration errors may
average out. Better suitable is a comparison with the approximation of the
previous grid ~ -. If
(9.32)
then
is an estimator of the global approximation error c(~). In order that the

condition (9.32) be satisfied, sufficiently many intervals have to be refined
from step to step. In order to guarantee this, it has turned out to be useful,
to replace the threshold value K:(~) from (9.31) by
K(~) := min (maxc+(J),

JE~
~2 maxE(J))
JE~
.
J:
The complete algorithm of the adaptive multigrid quadrature for the
computation of f(t) dt with a relative precision tol now looks as follows:
Algorithm 9.36 Simple multigrid quadrature
Choose an initial grid, e.g., ~ := {(a, (a + b)/2, b)};
for i = 0 to i max do
Compute T(J), S(J) and E(~) for all J E ~;
Compute E(~);
if E(~) s; tolIS(J)1 then
break; (done, solution S (~))
else
Compute E+ (J) and E( J) for all J E ~;
Compute K(~);
Replace all J E ~ with E(J) ::::: K(~) by J 1 and J r ;
end
end
The multigrid approach obviously leads to a considerably simpler adap-
tive quadrature algorithm than the adaptive Romberg quadrature. The only
difficulty consists in the storage of the grid sequence. However, this diffi-
culty can be mastered fairly easily by employing a structured programming
language (such as C or Pascal). In the one-dimensional quadrature, we can
store the sequence as a binary tree (as indicated in Figure 9.14). In prob-
lems in more than one spatial dimension, the question of data structures
often conceals a much higher complexity-consider only the refinement of

meshes of tetrahedrons in three spatial dimensions.
Our current presentation of adaptive multigrid algorithms also overcomes
difficulties regarding special integrands, which we discussed in the previ-
ous section (Section 9.6). Note the case of discontinuous or weakly singular
integrands, where the nodes collect automatically at the critical places,
without the integrator "grinding to halt" at these places, as would be the
case with the initial value approach of Section 9.5.3. The refinement strat-
egy still works locally for these places, because it was derived for general
local orders 'Y > 1.
Example 9.37 Needle Impulse. We have repeatedly used this Example

(9.17) for illustration purposes (for classical and adaptive Romberg quadra-
ture). The result for the tolerance tol = 10- 3 is presented in Figure 9.17
in the case that the initial grid l!, 0 already contains the tip of the needle.
The final grid l!,9 has 61 nodes, thus requiring 121 f evaluations. The es-
timated total error amounts to t(l!,9) = 2.4.10- 4 , with an actual error of
E(l!,9) = 2.1.10- 4 . When shifting the interval asymmetrically, i.e., when
the tip of the needle is not represented within the initial grid, this does not
deteriorate the result.
Or-----------------
-1 o 1
Figure 9.17. Adapted grid for the needle impulse f(t) = 1/(10- 4 +t2 ) of the fifth
and ninth step for the tolerance 10- 3 .
The program can also be adapted to discrete integrands (it was originally
developed just for this case in [91] as the so-called SUMMATOR). Here
one only has to consider the case that there is no value available at a
bisecting point. As always, we do the next best, and this time in the literal
sense, by taking the nearest given point, which is next to the bisectional
point, and thus modify the bisection slightly. Once the required precision is
achieved, then for discrete integrands, and for reasons, which we discussed
in Section 9.6, we take the trapezoidal sum as the best approximation.
Example 9.38 Summation of the Harmonic Series. The sum
L -J:-
n 1
S = for n = 10 7
j=1
is to be computed, i.e., a sum of 10 7 terms. For a required preClSlOn of

tol = 10- 2 , respectively, tol = 10- 4 , the program SUMMATOR only needs
47, respectively, 129 terms! In order to illustrate this, the automatically
chosen grids are presented in Figure 9.18. (Observe the logarithmic scale.)
Figure 9.18. Summation of the harmonic series with the program SUMMATOR.
We finally return again to the parameter-dependent case, which causes

difficulties in the adaptive multigrid approach, too, similar to those of the
other approaches; i.e., it requires additional considerations, like the oneS
described in Section 9.6. Overall, however, even for considerably more gen-
eral boundary value problems (e.g., for partial differential equations), the
adaptive multigrid concept turned out to be simple, fast and reliable.
Exercises 321
Exercises
li I I -
Exercise 9.1 Let
n n .
s-]
A·2n = - . . ds
n 0 j~O ~ - ]
jf.i
be the constants of the Newton-Cotes formulas. Show that

n
An~i,n = Ain and LAin = 1 .

i=O
Exercise 9.2 Compute an approximation of the integral
12 x 2e 3x dx
by fivefold application of Simpson's rule and using equidistant nodes.

Exercise 9.3 The nth Newton-Cotes formula is constructed such that it
yields the exact integral value for polynomials of degree ~ n. Show that
for even n, even polynomials of degree n + 1 are integrated exactly.
Hint: Employ the remainder formula of the polynomial interpolation, and
use the symmetries with respect to (a + b)/2.
Exercise 9.4 (R. van Veldhuizen) Compute the period
P = 211 v'"f=t2
~l
f(t) dt
'
of the radial movement of a satellite in an orbit in the equatorial plane
(apogeum height 492 km) under the influence ofthe flattening ofthe Earth.
Here
1 P2-1
(a) f(t) = , r(t) = 1 + (1 + t ) - - ,
V2g(r(t)) 2
(b) g(x) = 2w2(1 - pI/x),
k
(c) w 2 = ~(1 - c) + ~, P1=-6
2 '
W P2
with the constants c = 0.5 (elliptic eccentricity of the satellite orbit), P2 =
2.9919245059286 and k = 1.4·10~3 (constant, which describes the influence
of the Earth flattening). Write a program which computes the integral
In := -1f- ~
L f(Tin), Tin:= cos
n+1.,=0
(2i + 1 1f)
--.-
n+12
, n = 3,4, ... 7
by using the Gauss-Chebyshev quadrature.

Hint: For verification: P = 2 . 4.4395413186376.
Exercise 9.5 Derive the formula

Ti k-1 - T i - 1 k-1
= Ti ,k-1 + '
( ni)
Tik 2 '
1
ni-k+l
for the extrapolation tableau from the one of the Aitken-Neville algorithm.
Exercise 9.6 Every element Tik in the extrapolation tableau of the extra-
polated trapezoidal rule can be considered as a quadrature formula. Show
that when using the Romberg sequence and polynomial extrapolation, the
following results hold:
(a) T22 is equal to the value, which is obtained by applying the Simpson
rule; T33 corresponds to the Milne rule.
(b) Tik, i > k is obtained by 2i - k -fold application of the quadrature
formula, which belongs to Tkk to suitably chosen subintervals.
(c) For every Tik, the weights of the corresponding quadrature formula
are positive.
Hint: By using (b), show that the weights Ai,n of the quadrature formula,
which corresponds to Tkk, satisfies
max Ai n :::; 4k. min Ai n .
i' i 1
Exercise 9.7 Implement the Romberg algorithm by only using one single
vector of length n (note that only one intermediate value of the table needs
to be extra stored).
Exercise 9.8 Experiment with an adaptive Romberg quadrature pro-
gram, test it with the "needle function"
I(n):= 1 1
-1
2-n
4- n +t
2 dt, for n = 1,2, ...
and determine the n, for which your program yields the value zero for a
given precision of eps = 10- 3 .
Exercise 9.9 Consider the computation of the integrals
In = 12 (In x)ndx, n = 1,2, ...

(a) Show that the In satisfy the recurrence relation
In = 2(ln2)n - nIn - 1 , n ~ 2 (R).
(b) Note that h = 0.3863 ... and h = 0.0124 .... Investigate the increase
of the input error in the computation of
(1) h from h by means of (R) (forward recursion),
(2) h from h by means of (R) (backward recursion).
Exercises 323
Assume an accuracy of four decimal places and neglect any rounding

errors.
(c) Use (R) as a backward recursion for the computation of In from In+k
with starting value
In+k = 0 .
How is k to be chosen in order to compute 17 accurately up to 8 digits
by this method?
Exercise 9.10 Consider integrals of the following form:
In(a) := 11 t 2n +a sin(7rt)dt where a > -1 and n = 0, 1,2, ...
(a) For In, derive the following inhomogeneous two-term recurrence

relation
I ()=~_(2n+a)(2n+a-1)I ()
n a 7r 7r 2 n-l a .
(b) Show that

lim In(a)
n--->oo
=0 and 0::; In+l(a) ::; In(a) for n 2: 1 .
(c) Give an informal algorithm for the computation of Io(a) (compare

Chapter 6.2-3). Write a program to compute Io(a) for a given relative
precision.
Exercise 9.11 A definite integral over [-1, + 1] is to be computed. Based
on the idea of the Gauss-Christoffel quadrature, derive a quadrature
formula
11 +1
f(t)dt::::; f-Lof( -1)
n-l
+ f-Lnf(l) + ~ f-Ld(t i )
with fixed nodes -1 and + 1 and variable nodes to be determined such that
the order is as high as possible (Gauss-Lobatto quadrature).
References
[1] ABDULLE, A., AND WANNER, G. 200 years of least squares method. Elemente
der Mathematik (2002).
[2] ABRAMOWITZ, M., AND STEGUN, 1. A. Pocketbook of Mathematical
Functions. Verlag Harri Deutsch, Thun, Frankfurt/Main, 1984.
[3] AIGNER, M. Diskrete Mathematik, 4. ed. Vieweg, Braunschweig, Wiesbaden,
200l.
[4] ANDERSON, E., BAI, Z., BISCHOF, C., DEMMEL, J., DONGARRA, J.,
DUCROZ, J., GREENBAUM, A., HAMMARLING, S., McKENNEY, A., OSTRU-
CHOV, S., AND SORENSEN, D. LAPACK Users' Guide. SIAM, Philadelphia,
1999.
[5] ARNOLDI, W. E. The principle of minimized iterations in the solution of the
matrix eigenvalue problem. Quart. Appl. Math. 9 (1951), 17-29.
[6] BABUSKA, 1., AND RHEINBOLDT, W. C. Error estimates for adaptive finite
element computations. SIAM J. Numer. Anal. 15 (1978), 736-754.
[7] BJ0RCK, A. Iterative refinement of linear least squares solutions I. BIT 7
(1967), 257-278.
[8] BOCK, H. G. Randwertproblemmethoden zur Parameteridentijizierung in
Systemen nichtlinearer Differentialgleichungen. PhD thesis, Universitiit zu
Bonn, 1985.
[9] BORNEMANN, F. A. An Adaptive Multilevel Approach to Parabolic Equations
in two Dimensions. PhD thesis, Freie Universitiit Berlin, 1991.
[10] BRENT, R. P. Algorithms for Minimization Without Derivatives. Prentice
Hall, Englewood Cliffs, N.J., 1973.
[11] BULIRSCH, R. Bemerkungen zur Romberg-Integration. Numer. Math. 6
(1964),6-16.
326 References
[12] BUSINGER, P., AND GOLUB, G. H. Linear least squares solutions by

Householder transformations. Numer. Math. 'l (1965), 269-276.
[13] CULLUM, J., AND WILLOUGHBY, R. Lanczos Algorithms for Large Symmet-
ric Eigenvalue Computations, Vol I, II. Birkhiiuser, Boston, 1985.
[14] DE BOOR, C. An algorithm for numerical quadrature. In Mathematical
Software, J. Rice, Ed. Academic Press, London, 1971.
[15] DE BOOR, C. A Practical Guide to Splines, reprint ed. Springer-Verlag,
Berlin, Heidelberg, New York, 1994.
[16] DEUFLHARD, P. On algorithms for the summation of certain special
functions. Computing 17 (1976), 37-48.
[17] DEUFLHARD, P. A summation technique for minimal solutions of linear
homogeneous difference equations. Computing 18 (1977), 1-13.
[18] DEUFLHARD, P. A stepsize control for continuation methods and its special
application to multiple shooting techniques. Numer. Math. 33 (1979), 115-
146.
[19] DEUFLHARD, P. Order and stepsize control in extrapolation methods.
Numer. Math. 41 (1983), 399-422.
[20] DEUFLHARD, P. Newton Methods for Nonlinear Problems. Affine Invariance
and Adaptive Algorithms. Springer International, 2002.
[21] DEUFLHARD, P., AND BAUER, H. J. A note on Romberg quadrature.
Preprint 169, Universitiit Heidelberg, 1982.
[22] DEUFLHARD, P., AND BORNEMANN, F. Scientific Computing with Ordinary
Differential Equations. Springer, New York, 2002.
[23] DEUFLHARD, P., FIEDLER, B., AND KUNKEL, P. Efficient numerical
pathfollowing beyond critical points. SIAM J. Numer. Anal. 18 (1987),
949-987.
[24] DEUFLHARD, P., HUISINGA, W., FISCHER, A., AND SCHUTTE, C. Identifi-
cation of almost invariant aggregates in reversible nearly uncoupled Markov
chains. Lin. Alg. Appl. 315 (2000), 39-59.
[25] DEUFLHARD, P., LEINEN, P., AND YSERENTANT, H. Concept of an adap-
tive hierarchical finite element code. Impact of Computing in Science and
Engineering 1, 3 (1989), 3-35.
[26] DEUFLHARD, P., AND POTRA, F. A. A refined Gauss-Newton-Mysovskii
theorem. ZIB Report SC 91-4, ZIB, Berlin, 1991.
[27] DEUFLHARD, P., AND POTRA, F. A. Asymptotic mesh independence for
Newton-Galerkin methods via a refined Mysovskii theorem. SIAM J. Numer.
Anal. 29,5 (1992), 1395-1412.
[28] DEUFLHARD, P., AND SAUTTER, W. On rank-deficient pseudoinverses. Lin.
Alg. Appl. 29 (1980),91-111.
[29] ERICSSON, T., AND RUHE, A. The spectral transformation Lanczos method
for the numerical solution of large sparse generalized symmetric eigenvalue
problems. Math. Compo 35 (1980), 1251-1268.
[30] FARIN, G. Curves and Surfaces for Computer Aided Geometric Design: A
Practical Guide. Academic Press, New York, 1988.
References 327
[31J FLETCHER, R. Conjugate gradient methods. In Pmc. Dundee Biennial

Conference on Numerical Analysis. Springer Verlag, New York, 1975.
[32J FORSYTHE, G. E., AND MOLER, C. Computer Solution of Linear Algebra
Systems. Prentice Hall, Englewood Cliffs, N.J., 1967.
[33J FRANCIS, J. G. F. The QR-transformation. A unitary analogue to the LR-
transformation - Part 1 and 2. Compo J. 4 (1961/62),265-271 and 332-344.
[34J GATERMANN, K., AND HOHMANN, A. Symbolic exploitation of symmetry in
numerical pathfollowing. Impact Compo Sci. Eng. 3,4 (1991), 330-365.
[35J GAUSS, C. F. Theoria Motus Corporum Coelestium. Vol. 7. Perthes et
Besser, Hamburgi, 1809.
[36J GAUTSCHI, W. Computational aspects of three-term recurrence relations.
SIAM Rev. 9 (1967), 24-82.
[37J GENTLEMAN, W. M. Least squares computations by Givens transformations
without square roots. J. Inst. Math. Appl. 12 (1973), 189-197.
[38J GEORG, K. On tracing an implicitly defined curve by quasi-Newton steps and
calculating bifurcation by local perturbations. SIAM J. Sci. Stat. Comput.
2, 1 (1981), 35-50.
[39J GEORGE, A., AND LIU, J. W. Computer Solution of Large Sparse Positive
Definite Systems. Prentice Hall, Englewood Cliffs, N.J., 1981.
[40J GOERTZEL, G. An algorithm for the evaluation of finite trigonometric series.
Amer. Math. Monthly 65 (1958), 34-35.
[41J GOLUB, G. H., AND VAN LOAN, C. F. Matrix Computations, second ed. The
Johns Hopkins University Press, Baltimore, MD, 1989.
[42J GOLUB, G. H., AND WELSCH, J. H. Calculation of Gauss quadrature rules.
Math. Compo 23 (1969), 221-230.
[43J GRADSHTEYN, 1. S., AND RYZHlK, 1. W. Table of Integral Series and
Products, sixth ed. Academic Press, New York, San Francisco, London, 2000.
[44J GRIEWANK, A., AND CORLISS, G. F. Automatic Differentiation of Al-
gorithms: Theory, Implementation, and Application. SIAM Publications,
Philadelphia, PA, 1991.
[45J HACKBUSCH, W. Multi-Grid Methods and Applications. Springer Verlag,
Berlin, Heidelberg, New York, Tokyo, 1985.
[46J HAGEMAN, L. A., AND YOUNG, D. M. Applied Iterative Methods. Academic
Press, Orlando, San Diego, New York, 1981.
[47J HAIRER, E., N0RSETT, S. P., AND WANNER, G. Solving Ordinary Differ-
ential Equations I, Nonstiff Problems. Springer Verlag, Berlin, Heidelberg,
New York, Tokyo, 1987.
[48J HALL, C. A., AND MEYER, W. W. Optimal error bounds for cubic spline
interpolation. J. Appmx. Theory 16 (1976), 105-122.
[49J HAMMARLING, S. A note on modifications to the Givens plane rotations. J.
[nst. Math. Appl. 13 (1974), 215-218.
[50J HESTENES, M. R., AND STIEFEL, E. Methods of conjugate gradients for
solving linear systems. J. Res. Nat. Bur. Stand 49 (1952), 409-436.
328 References
[51J HIGHAM, N. J. How accurate is Gaussian elimination? In Numerical Analy-

sis, Pmc. 13th Biennial Conf., Dundee / UK 1989. Pitman Res. Notes Math.
Ser. 228, 1990, pp. 137-154.
[52J HOUSEHOLDER, A. S. The Theory of Matrices in Numerical Analysis.
Blaisdell, New York, 1964.
[53J IpSEN, I. C. F. A history of inverse iteration. In Helmut Wielandt, Mathe-
matische Werke, Mathematical Works, B. Huppert and H. Schneider, Eds.,
vol. II: Matrix Theory and Analysis. Walter de Gruyter, New York, 1996,
pp.464-72.
[54J KATO, T. Perturbation Theory for Linear Operators, reprint ed. Springer
Verlag, Berlin, Heidelberg, New York, Tokyo, 1995.
[55J KNOPP, K. Theorie und Anwendung der unendlichen Reihen. Springer
Verlag, Berlin, Heidelberg, New York, (5. Aufiage) 1964.
[56J KUBLANOVSKAYA, V. N. On some algorithms for the solution of the complete
eigenvalue problem. USSR Compo Math. Phys. 3 (1961),637-657.
[57J LANCZOS, C. An iteration method for the solution of the eigenvalue prob-
lem of linear differential and integral operators. J. Res. Nat. Bur. Stand 45
(1950), 255-282.
[58J MANTEUFFEL, T. A. The Tchebychev iteration for nonsymmetric linear
systems. Numer. Math. 28 (1977), 307-327.
[59J MEIJERINK, J., AND VAN DER VORST, H. An iterative solution method for
linear systems of which the coefficient matrix is a symmetric M-matrix.
Math. Compo 31 (1977), 148-162.
[60J MEIXNER, J. R., AND SCHAFFKE, W. Mathieusche Funktionen und
Sphiiroidfunktionen. Springer Verlag, Berlin, Gi:ittingen, Heidelberg, 1954.
[61J MEYER, C. D. Matrix Analysis and Applied Linear Algebra. SIAM
Publications, Philadelphia, PA, 2000.
[62J MILLER, J. C. P. Bessel Functions, Part II (Math. Tables X). Cambridge
University Press, Cambridge, UK, 1952.
[63J NASHED, M. Z. Generalized Inverses and Applications. Academic Press, New
York, 1976.
[64J NIKIFOROV, A. F., AND UVAROV, V. B. Special Functions of Mathematical
Physics. Birkhiiuser, Basel, Boston, 1988.
[65J PERRON, O. tiber Matrizen. Math. Annalen 64 (1907),248-263.
[66J POINCARE, H. Les Methodes Nouvelles de la Mecanique Celeste. Gauthier-
Villars, Paris, 1892.
[67J POPPE, C., PELLICIARI, C., AND BACHMANN, K. Computer analysis of
Feulgen hydrolysis kinetics. Histochemistry 60 (1979), 53-60.
[68J PRAGER, W., AND OETTLI, W. Compatibility of approximate solutions of
linear equations with given error bounds for coefficients and right hand sides.
Numer. Math. 6 (1964), 405-409.
[69J PRIGOGINE, I., AND LEFEVER, R. Symmetry breaking instabilities in
dissipative systems II. J. Chem. Phys. 48 (1968), 1695-170l.
[70J REINSCH, C. A note on trigonometric interpolation. Manuscript, 1967.
References 329
[71] RIGAL, J. L., AND GACHES, J. On the compatibility of a given solution with
the data of a linear system. J. Assoc. Comput. Mach. 14 (1967), 543-548.
[72] ROMBERG, W. Vereinfachte Numerische Integration. Det Kongelige Norske
Videnskabers Selskabs Forhandlinger Bind 28, 7 (1955).
[73] SAUER, R., AND SZABO, 1. Mathematische Hilfsmittel des Ingenieurs.
Springer Verlag, Berlin, Heidelberg, New York, 1968.
[74] SAUTTER, W. Fehlerfortpfianzung und Rundungsfehler bei der verallge-
meinerten Inversion von Matrizen. PhD thesis, TU Miinchen, Fakultiit fiir
Allgemeine Wissenschaften, 1971.
[75] SHANNON, C. E. The Mathematical Theory of Communication. The
University of Illinois Press, Urbana, Chicago, London, 1949.
[76] SKEEL, R. D. Scaling for numerical stability in Gaussian elimination. J.
ACM 26, 3 (1979), 494-526.
[77] SKEEL, R. D. Iterative refinement implies numerical stability for Gaussian
elimination. Math. Compo 35, 151 (1980), 817-832.
[78] SONNEVELD, P. A fast Lanczos-type solver for nonsymmetric linear systems.
SIAM J. Sci. Stat. Comput. 10 (1989), 36-52.
[79] STEWART, G. W. Introduction to Matrix Computations. Academic Press,
New York, San Francisco, London, 1973.
[80] STEWART, G. W. On the structure of nearly uncoupled Markov chains.
In Mathematical Computer Performance and Reliability, G. Iazeolla, P. J.
Courtois, and A. Hordijk, Eds. Elsevier, New York, 1984.
[81] STOER, J. Solution of large systems of linear equations by conjugate gra-
dient type methods. In Mathematical Programming, the State of the Art,
A. Bachem, M. Grotschel, and B. Korte, Eds. Springer Verlag, Berlin,
Heidelberg, New York, 1983.
[82] SZEGO, G. Orthogonal Polynomials, fourth ed. AMS, Providence, RI, 1975.
[83] TRAUB, J., AND WOZNIAKOWSKI, H. General Theory of Optimal Algorithms.
Academic Press, Orlando, San Diego, San Francisco, 1980.
[84] TREFETHEN, L. N., AND SCHREIBER, R. S. Average-case stability of
gaussian elimination. SIAM J. Matrix Anal. Appl. 11,3 (1990), 335-360.
[85] TUKEY, J. W., AND COOLEY, J. W. An algorithm for the machine
calculation of complex Fourier series. Math. Comp 19 (1965), 197-30l.
[86] VARGA, J. Matrix Iterative Analysis. Prentice Hall, Englewood Cliffs, N.J.,
1962.
[87] WILKINSON, J. H. Rounding Errors in Algebraic Processes. Her Majesty's
Stationary Office, London, 1963.
[88] WILKINSON, J. H. The Algebraic Eigenvalue Problem. Oxford University
Press, Oxford, UK, 1965.
[89] WILKINSON, J. H., AND REINSCH, C. Handbook for Automatic Computation,
Volume II, Linear Algebra. Springer Verlag, New York, Heidelberg, Berlin,
1971.
[90] WITTUM, G. Mehrgitterverfahren. Spektrum der Wissenschajt (April 1990),
78-90.
330 References
[91] WULKOW, M. Numerical treatment of countable systems of ordinary

differential equations. ZIB Report TR 90-8, ZIB, Berlin, 1990.
[92] Xu, J. Theory of Multilevel Methods. PhD thesis, Penn State University,
1989.
[93] YSERENTANT, H. On the multi-level splitting of finite element spaces.
Numer. Math. 49 (1986), 379-4l2.
Software 331
Software
For most of the algorithms described in this book there exists rather so-
phisticated software, which is public domain. Of central importance is the
netlib, a library of mathematical software, data, documents, etc. Its address
IS
http://www.netlib.org/
Linear algebra (LAPACK):
http://www.netlib.org/lapack
Especially linear eigenvalue problems (EISPACK):
http://www.netlib.org/eispack
Please study the therein given hints carefully (e.g., README, etc.) to
make sure that you download all necessary material. Sometimes a bit of
additional browsing in the neighborhood is needed.
The commercial program package MATLAB also offers a variety of

methods associated with topics of this book.
In addition, the book presents a series of algorithms as informal algo-

rithms which can be easily programmed from this description-such as the
fast summation of spherical harmonics.
Numerous further programs (not only by the authors) can be downloaded
from the electronic library Elib by ZIB, either via the ftp-oriented address
http://elib.zib.de/pub/elib/codelib/
or via the web-oriented address
http://www.zib.de/SciSoft/CodeLib/
All of the there available programs are free as long as they are exclusively
used for research or teaching purposes.
Index
A-orthogonal, 250 bi-cg-method, 255

Abel's theorem, 126 bifurcation point, 100
Aigner, M., 306 Bjorck, A., 66, 78
Aitken's L~.2-method, 114 Bock, H. G., 96
Aitken-Neville algorithm, 184 Bornemann, F. A., 261
algorithm Brent, R. P., 85
invariance, 12 Brusselator, 112
reliability, 2 Bulirsch sequence, 297
speed, 2 Bulirsch, R., 293
almost singular, 45, 73 Businger, P., 72
Arnoldi method, 254
Arrhenius law, 79 cancellation, 27
asymptotic expansion, 292 cascadic principle, 267
automatic differentiation, 92 Casorati determinant, 157, 177
cg-method, 252
B-spline preconditioned, 257
basis property, 224 termination criterion, 252, 260
recurrence relation, 221 Chebyshev
Bezier abscissas, 195
curve, 208 approximation problem, 60
points, 208 iteration, 247
Babuska, 1., 315 nodes, 184, 196
backward substitution, 4 polynomials, 193, 248
Bernoulli numbers, 288 min-max property, 193, 246, 253
Bessel Cholesky decomposition
functions, 159, 177 rational, 15
maze, 159 Christoffel-Darboux formula, 285
334 Index
complexity backward, 36
of problems, 2 forward, 35
condition equidistribution, 316
intersection point, 24 linearised theory, 26
condition number relative, 25
absolute, 26 extrapolation
componentwise, 32 algorithm, 295
of addition, 27 local, 316
of multiplication, 32 methods, 291, 295
of scalar product, 33 sub diagonal error criterion, 304
relative, 26 tableau, 292
Skeel's, 33
conjugate gradients, 252 Farin, G., 204
continuation method, 92 FFT,203
classical, 102 fixed-point
order, 104 Banach theorem, 84
tangent, 103, 108 equation, 82
convergence iteration, 82, 239
linear, 85 method
model, 309 symmetrizable, 242
monitor, 307 Fletcher, R., 255
quadratic, 85 floating point number, 22
super linear, 85 forward substitution, 4
Cooley, W., 202 Fourier
cost series, 152, 200
QR-factorization, 69, 72 transform, 197
Cholesky decomposition, 16 fast, 201
Gaussian elimination, 7 Francis, J. G. F., 127
QR method Frobenius, F. G., 140
for singular values, 137
QR-algorithm, 132 Gaches, J., 50
Cramer's rule, 1 Gauss
Cullum, J., 266 Jordan decomposition, 3
cylinder functions, 159 Newton method, 109
Seidel method:, 240
de Boor algorithm, 235 Gauss, C. F., 4, 57
de Boor, C., 204, 309 Gautschi, W., 164
de Casteljau algorithm, 213 generalized inverse, 76
detailed balance, 143 Gentleman, W. M., 70
Deuflhard, P., 73, 90, 261, 308 Givens
fast, 70
eigenvalue rational, 70
derivative, 120 rotations, 68
Perron, 140 Givens, W., 68
elementary operation, 23 Goertzel algorithm, 171
Ericsson, T., 266 Goertzel, G., 171
error Golub, G. H., 47, 72, 119
absolute, 25 graph, 140
analysis irreducible, 140
Index 335
strongly connected, 140 Lebesgue constant, 183

greedy algorithm, 306 Leibniz formula, 191
Green's function Leinen, P., 261
discrete, 157, 177 Levenberg-Marquardt method, 98,
Griewank, A., 92 117, 149
Hackbusch, W., 244 Manteuffel, T. A., 249

Hagemann, L. A., 244 Markov
Hall, C. A., 230 chain, 137
Hammarling, S., 70 nearly uncoupled, 147, 150
Hermite interpolation reversible, 144
cubic, 186 uncoupled, 145
Hestenes, M. R., 252 process, 137
Higham, N. J., 46 Markov, A. A., 137
homotopy, 111 Marsden identity, 223
method,l11 matrix
Horner algorithm, 169 bidiagonal, 134
generalized, 170 determinant, 1, 12, 30
Householder Hessenberg, 132, 254
reflections, 70 incidence, 141
Householder, A. S., 70 irreducible, 140
norms, 53
incidence matrix, 141 numerical range, 17
information theory permutation, 9
Shannon, 308 primitive, 142
initial value problem, 270 Spd-, 14
interpolation stochastic, 137
Hermite, 185 triangular, 3
nodes, 179 Vandermonde, 181
iterative refinement maximum likelihood method, 59
for linear equations, 13 measurement tolerance, 59
for linear least-squares problems, Meixner, J., 162
66,78 Meyer, C. D., 119
Meyer, W. W., 230
Jacobi method, 240 Miller algorithm, 167
Miller, J. C. P., 166, 168
Kato, T., 122 monotonicity test
Krylov spaces, 250 natural, 90, 106
Kublanovskaja, V. N., 127 standard, 90
multigrid methods, 244, 313, 320
Lagrange
polynomials, 181 Nashed, M. Z., 76
representation, 182 needle impulse, 298, 309, 319
Lagrange, J. L., 4 Neumann
Lanczos method functions, 159, 177
spectral, 266 series, 29
Lanczos, C., 262 Neville scheme, 185
Landau symbol, 26 Newton
LAPACK,13 correction, 88
336 Index
simplified correction, 91 Prager, W., 50, 55

Newton method preconditioning
affine invariance, 88, 90 diagonal, 260
complex, 115 incomplete Cholesky, 260
for square root, 86 pseudo-inverse, 75, 93, 109
nodes QR-factorization, 76
Gauss-Christoffel, 282 singular value decomposition, 133
of a quadrature formula, 273
nonlinear least-squares problem QR decomposition
almost compatible, 96 column permutation, 72
compatible, 93 QR-algorithm
norm shift strategy, 131
£1_,271
quadratic equation, 28, 81
energy, 249 quadrature
Frobenius, 53 condition of problem, 271
matrix, 53 error, 283
spectral, 53
estimator, 301
vector, 53
formula, 273
normal distribution, 59
Gauss-Christoffel, 282
numerical rank, 73
Newton-Cotes, 275
numerically singular, 45
Gauss-Chebyshev, 285
Gauss-Christoffel, 285
Oettli, W., 50, 55
Gauss-Hermite, 285
Ohm's law, 58
Gauss-Laguerre, 285
Gauss-Legendre, 285
pcg-method, 257, 267
numerical, 270
Penrose axioms, 76
parameter-dependent, 312
Perron cluster, 147
analysis, 143
Perron, 0., 138 rank
pivot decision, 73
element, 7 determination, 96
row, 5 Rayleigh quotient, 262
pivoting generalized, 265
column, 8 refinement
conditional, 238 global, 317
partial, 8 local, 317
total, 9 Reinsch, C., 41, 132, 171
polynomials residual, 49
Bernstein, 205 Rheinboldt, W. C., 315
Chebyshev, 154, 176, 193, 248, 285 Richardson method, 240
Hermite, 186, 285 relaxed, 243
Laguerre, 285 Rigal, J. L., 50
Legendre, 164, 176, 285 Ritz-Galerkin approximation, 249
orthogonal, 153, 279, 285 Romberg quadrature
trigonometric, 197 adaptive algorithm, 306
power method Romberg sequence, 296
direct, 124 Ruhe, A., 266
inverse, 125 Rutishauser, H., 127
Index 337
Sautter, W., 47, 73 minimal solution, 162

scaling, 12 symmetric, 156
column, 12 trigonometric, 40, 162, 170
row, 12 Traub, J., 2
Schiiffke, W., 162 Trefethen, L. N., 49
Schreiber, R. S., 49 Tukey, J. W., 202
Schur normal form, 132 turning point, 100
Schur, I., 20
Shannon, C. E., 308 van Loan, C., 47, 119
shift strategy, 130 van Veldhuizen, R., 321
Skeel, R. D., 14, 33, 51, 56 von Mises, R., 124
Sonneveld, P., 255
sparse solvers, 238, 266 weight function, 279
sparsing, 92 weights
special functions, 151 Gauss-Christoffel, 282, 285
spectral equivalence, 259 Newton-Cotes, 275
spherical harmonics of a quadrature formula, 273
algorithm, 163, 166 Wielandt, H., 125, 143
fast summation, 171 Wilkinson
splines pathological example, 48, 56
complete, 232 Wilkinson, J. H., 36, 46, 47, 123, 129,
minimization property, 229 131, 132
natural, 232 Willoughby, R., 266
stability indicator, 37, 42 Wittum, G., 244
statistical model work per unit step, 306
inadequate, 96 Wozniakowski, H., 2
steepest descent method, 255 Wronski determinant, 158
step size, 103
basic, 295, 299 Xu, J., 261
internal, 295
step-size Young, D. M., 244
control, 300 Yserentant, H., 261
Stewart, G. W., 147
Stiefel, E., 252
stochastic process, 137
Stoer, J., 255
Sturm sequence, 178
subcondition, 73
substitution
backward, 4
forward, 4
Taylor interpolation, 186

adjoint, 170
condition, 161
dominant solution, 162
homogeneous, 156
inhomogeneous, 156, 158
Texts in Applied Mathematics
(continued from page ii)
3l. Bremaud: Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues.
32. Durran: Numerical Methods for Wave Equations in Geophysical Fluid
Dynamics.
33. Thomas: Numerical Partial Differential Equations: Conservation Laws and
Elliptic Equations.
34. Chicone: Ordinary Differential Equations with Applications.
3 ::J.
~
Kevorkian: Partial Differential Equations: Analytical Solution Techniques,
2nd ed.
36. Dllllerlld/Paganini: A Course in Robust Control Theory: A Convex Approach.
37. Quarteroni/Sacco/Saleri: Numerical Mathematics.
38. Gallier: Geometric Methods and Applications: For Computer Science and
Engineering.
39. Atkinson/Han: Theoretical Numerical Analysis: A Functional Analysis
Framework.
40. Braller/Castill(}-Chimez: Mathematical Models in Population Biology and
Epidemiology.
41. Davies: Integral Transforms and Their Applications, 3rd ed.
42. Deuflhard/Bornemann: Scientific Computing with Ordinary Differential
Equations.
43. Deuflhard/Hohmann: Numerical Analysis in Modern Scientific Computing: An
Introduction, 2nd ed.

Texts in Applied Mathematics: Springer

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Texts in Applied Mathematics: Springer

Uploaded by

Copyright:

Available Formats

Texts in Applied Mathematics 43

I. SiTOvich: Introduction to Applied Mathematics.

Numerical Analysis in Modern

M. Golubitsky S.S. Antman

Mathematics Subject Classification (2000): 65-XX, 6S-XX, 65-01, 65Fxx, 6SNxx

Library of Congress Cataloging-in-Publication Data

I. Numerical analysis-Data processing. I. Hohmann, Andreas. 1964- II. Deutlhard, P.

ISBN 978-1-4419-2990-7 ISBN 978-0-387-21584-6 (eBook) Printed on acid-free paper.

Springer-Verlag New York Berlin Heidelberg

Mathematics is playing an ever more important role in the physical and

Pasadena, California J.E. Marsden

The field of Scientific Computing, situated at the confluence of mathe-

Of course, any selection of material expresses the scientific taste of the

Berlin and Dusseldorf, March 2002

Peter Deufihard and Andreas Hohmann

This introductory textbook is, in the first place, addressed to students of

methods for systems of nonlinear equations (Newton method), nonlinear

2.4.1 A Zoom into Solvability . . . . . . . . . . . 44

3 Linear Least-Squares Problems 57

4 Nonlinear Systems and Least-Squares Problems 81

5 Linear Eigenvalue Problems 119

6 Three-Term Recurrence Relations 151

6.3.1 Summation of Dominant Solutions 169

7 Interpolation and Approximation 179

8 Large Symmetric Systems of Equations and Eigenvalue

9 Definite Integrals 269

9.7.1 Local Error Estimation and Refinement Rules . . 314

In this chapter we deal with the numerical solution of a system of n linear

anxl + a12 x 2 + + alnXn b1

anlXl + an2X2 + + annXn bn

or, in short notation,

where A E Matn(R) is a real (n, n)-matrix and b, x E Rn are real n-vectors.

Theorem 1.1 Let A E Matn(R) be a real square matrix with det A i= 0

Whenever det A i= 0, the solution x = A -lb can be computed by

not self-evident; there are counter-examples. Reliability therefore is a first

det A = L sgn (J • al,a(l) ... an,a(n)

Remark 1.2 Of course, we expect that a good numerical method should

The notation x = A- 1 b might suggest the idea of computing the solution

Remark 1.3 There has been a longstanding bet by an eminent colleague,

1.1 Solution of Triangular Systems

and in short matrix notation,

Xl .- (Zl - T12X2 - ... - rlnXn)/Tll if TU =1= o.

~(i _ 1) = n(n - 1) -'- n 2

multiplications and as many additions.

(Zl,Z2, ... ,Zn-l,Zn)

(Zl,X2, ... ,Xn-l,Xn )

The case of forward substitution is just reversed.

1.2 Gaussian Elimination

al1 x l + a12 x 2 + ... + alnxn bl

anlXl + an2X2 + ... + annXn bn

and try to transform it into a triangular one. If we aim at an upper triangu-

Thus we produce a system of the form

allXI + al2 X 2 + ... + alnXn

new row i := row i - lil . row 1,

From ail - l i l a l l = 0 it follows immediately that lil = ail/all. Therefore

A = A(l) --+ A(2) --+ ... --+ A(n) =: R,

each of which has the special form

with an (n - k + 1, n - k + 1)-submatrix, the so-called remainder matrix, in

is unknown in advance-we can apply the elimination step

(In operations on columns one obtains an analogous postmultiplication.)

factorization is then called LU-factorization; in this book we typically use

'"" L~::i k ~ n /3 for (a)

Itikl = I ~; I I aa; I ::; l.