Download as pdf or txt
Download as pdf or txt
You are on page 1of 255

Numerical Mathematics I

Peter Philip∗

Lecture Notes
Originally Created for the Class of Winter Semester 2008/2009 at LMU Munich,
Revised and Extended for Several Subsequent Classes†

September 22, 2021

Contents
1 Introduction and Motivation 5

2 Tools: Landau Symbols, Norms, Condition 9


2.1 Landau Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Operator Norms and Natural Matrix Norms . . . . . . . . . . . . . . . . 16
2.3 Condition of a Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Interpolation 36
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Polynomials and Polynomial Functions . . . . . . . . . . . . . . . 37
3.2.2 Existence and Uniqueness . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3 Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 42

E-Mail: philip@math.lmu.de

Resources used in the preparation of this text include [DH08, GK99, HB09, Pla10].

1
CONTENTS 2

3.2.4 Divided Differences and Newton’s Interpolation Formula . . . . . 50


3.2.5 Convergence and Error Estimates . . . . . . . . . . . . . . . . . . 58
3.2.6 Chebyshev Polynomials . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.1 Introduction, Definition of Splines . . . . . . . . . . . . . . . . . . 64
3.3.2 Linear Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.3 Cubic Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4 Numerical Integration 78
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Quadrature Rules Based on Interpolating Polynomials . . . . . . . . . . . 82
4.3 Newton-Cotes Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.1 Definition, Weights, Degree of Accuracy . . . . . . . . . . . . . . 84
4.3.2 Rectangle Rules (n = 0) . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.3 Trapezoidal Rule (n = 1) . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.4 Simpson’s Rule (n = 2) . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.5 Higher Order Newton-Cotes Formulas . . . . . . . . . . . . . . . . 91
4.4 Convergence of Quadrature Rules . . . . . . . . . . . . . . . . . . . . . . 92
4.5 Composite Newton-Cotes Quadrature Rules . . . . . . . . . . . . . . . . 94
4.5.1 Introduction, Convergence . . . . . . . . . . . . . . . . . . . . . . 95
4.5.2 Composite Rectangle Rules (n = 0) . . . . . . . . . . . . . . . . . 97
4.5.3 Composite Trapezoidal Rules (n = 1) . . . . . . . . . . . . . . . . 98
4.5.4 Composite Simpson’s Rules (n = 2) . . . . . . . . . . . . . . . . . 99
4.6 Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6.2 Orthogonal Polynomials . . . . . . . . . . . . . . . . . . . . . . . 100
4.6.3 Gaussian Quadrature Rules . . . . . . . . . . . . . . . . . . . . . 103

5 Numerical Solution of Linear Systems 108


5.1 Background and Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
CONTENTS 3

5.2 Pivot Strategies for Gaussian Elimination . . . . . . . . . . . . . . . . . . 110


5.3 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4 QR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.4.1 Definition and Motivation . . . . . . . . . . . . . . . . . . . . . . 116
5.4.2 QR Decomposition via Gram-Schmidt Orthogonalization . . . . . 118
5.4.3 QR Decomposition via Householder Reflections . . . . . . . . . . 123

6 Iterative Methods, Solution of Nonlinear Equations 129


6.1 Motivation: Fixed Points and Zeros . . . . . . . . . . . . . . . . . . . . . 129
6.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2.1 Newton’s Method in One Dimension . . . . . . . . . . . . . . . . 131
6.2.2 Newton’s Method in Several Dimensions . . . . . . . . . . . . . . 134

7 Eigenvalues 139
7.1 Introductory Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2 Estimates, Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.3 Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.4 QR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

8 Minimization 158
8.1 Motivation, Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.2 Local Versus Global and the Role of Convexity . . . . . . . . . . . . . . . 159
8.3 Gradient-Based Descent Methods . . . . . . . . . . . . . . . . . . . . . . 165
8.4 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . 174

A Representations of Real Numbers, Rounding Errors 179


A.1 b-Adic Expansions of Real Numbers . . . . . . . . . . . . . . . . . . . . . 179
A.2 Floating-Point Numbers Arithmetic, Rounding . . . . . . . . . . . . . . . 181
A.3 Rounding Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

B Operator Norms and Natural Matrix Norms 192


CONTENTS 4

C Integration 194
C.1 Km -Valued Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
C.2 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

D Divided Differences 197

E DFT and Trigonometric Interpolation 201


E.1 Discrete Fourier Transform (DFT) . . . . . . . . . . . . . . . . . . . . . . 201
E.2 Approximation of Periodic Functions . . . . . . . . . . . . . . . . . . . . 204
E.3 Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 208
E.4 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . . 213

F The Weierstrass Approximation Theorem 220

G Baire Category and the Uniform Boundedness Principle 223

H Matrix Decomposition 226


H.1 Cholesky Decomposition of Positive Semidefinite Matrices . . . . . . . . 226
H.2 Multiplication with Householder Matrix over R . . . . . . . . . . . . . . 230

I Newton’s Method 231

J Eigenvalues 233
J.1 Continuous Dependence on Matrix Coefficients . . . . . . . . . . . . . . . 233
J.2 Transformation to Hessenberg Form . . . . . . . . . . . . . . . . . . . . . 235
J.3 QR Method for Matrices in Hessenberg Form . . . . . . . . . . . . . . . . 239

K Minimization 243
K.1 Golden Section Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
K.2 Nelder-Mead Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
1 INTRODUCTION AND MOTIVATION 5

1 Introduction and Motivation


The central motivation of Numerical Mathematics is to provide constructive and effective
methods (so-called algorithms, see Def. 1.1 below) that reliably compute solutions (or
sufficiently accurate approximations of solutions) to classes of mathematical problems.
Moreover, such methods should also be efficient, i.e. one would like the algorithm to
be as quick as possible while one would also like it to use as little memory as possible.
Frequently, both goals can not be achieved simultaneously: For example, one might
decide to recompute intermediate results (which needs more time) to avoid storing them
(which would require more memory) or vice versa. One typically also has a trade-
off between accuracy and requirements for memory and execution time, where higher
accuracy means use of more memory and longer execution times.
Thus, one of the main tasks of Numerical Mathematics consists of proving that a given
method is constructive, effective, and reliable. That a method is constructive, effective,
and reliable means that, given certain hypotheses, it is guaranteed to converge to the
solution. This means, it either finds the solution in a finite number of steps, or, more
typically, given a desired error bound, within a finite number of steps, it approximates
the true solution such that the error is less than the given bound. Proving error esti-
mates is another main task of Numerical Mathematics and so is proving bounds on an
algorithm’s complexity, i.e. bounds on its use of memory (i.e. data) and run time (i.e.
number of steps). Moreover, in addition to being convergent, for a method to be useful,
it is of crucial importance that is also stable in the sense that a small perturbation of
the input data does not destroy the convergence and results in, at most, a small increase
of the error. This is of the essence as, for most applied problems, the input data will
not be exact, and most algorithms are subject to round-off errors.
Instead of a method, we will usually speak of an algorithm, by which we mean a “useful”
method. To give a mathematically precise definition of the notion algorithm is beyond
the scope of this lecture (it would require an unjustifiably long detour into the field of
logic), but the following definition will be sufficient for our purposes.

Definition 1.1. An algorithm is a finite sequence of instructions for the solution of a


class of problems. Each instruction must be representable by a finite number of symbols.
Moreover, an algorithm must be guaranteed to terminate after a finite number of steps.

Remark 1.2. Even though we here require an algorithm to terminate after a finite
number of steps, in the literature, one sometimes omits this part from the definition.
The question if a given method can be guaranteed to terminate after a finite number of
steps is often tremendously difficult (sometimes even impossible) to answer.

Example 1.3. Let a, a0 ∈ R+ and consider the sequence (xn )n∈N0 defined recursively
1 INTRODUCTION AND MOTIVATION 6

by  
1 a
x0 := a0 , xn+1 := xn + for each n ∈ N0 . (1.1)
2 xn
It is an exercise to show that, for each√a, a0 ∈ R+ , this sequence is well-defined (i.e. xn >
0 for each n ∈ N0 ) and converges to a (this is Newton’s method for the computation
of the zero of the function f : R+ −→ R, f (x) := x2 − a, cf. Ex. 6.4(b)). The xn can be
computed using the following finite sequence of instructions:
1 : x = a0 % store the number a0 in the variable x
2 : x = (x + a/x)/2 % compute (x + a/x)/2 and replace the
(1.2)
% contents of the variable x with the computed value
3 : goto 2 % continue with instruction 2

Even though the contents of the variable x will converge to a, (1.2) does not constitute
an algorithm in the sense of Def. 1.1 since it does not terminate. To guarantee termi-
nation and to make the method into an algorithm, one might introduce the following
modification:
1 : ǫ = 10−10 ∗ a % store the number 10−10 a in the variable ǫ
2 : x = a0 % store the number a0 in the variable x
3: y=x % copy the contents of the variable x to
% the variable y to save the value for later use
4 : x = (x + a/x)/2 % compute (x + a/x)/2 and replace the (1.3)
% contents of the variable x with the computed value
5 : if |x − y| > ǫ
then goto 3 % if |x − y| > ǫ, then continue with instruction 3
else quit % if |x − y| ≤ ǫ, then terminate the method
Now the convergence of the sequence guarantees that the method terminates within
finitely many steps.

Another problem with regard to algorithms, that we already touched on in Ex. 1.3, is
the implicit requirement of Def. 1.1 for an algorithm to be well-defined. That means, for
every initial condition, given a number n ∈ N, the method has either terminated after
m ≤ n steps, or it provides a (feasible!) instruction to carry out step number n + 1.
Such methods are called complete. Methods that can run into situations, where they
have not reached their intended termination point, but can not carry out any further
instruction, are called incomplete. Algorithms must be complete! We illustrate the issue
in the next example:
1 INTRODUCTION AND MOTIVATION 7

Example 1.4. Let a ∈ R \ {2} and N ∈ N. Define the following finite sequence of
instructions:
1 : n = 1; x = a
2 : x = 1/(2 − x); n = n + 1
3 : if n ≤ N (1.4)
then goto 2
else quit
Consider what occurs for N = 10 and a = 45 . The successive values contained in
the variable x are 54 , 34 , 23 , 2. At this stage n = 4 ≤ N , i.e. instruction 3 tells the
method to continue with instruction 2. However, the denominator has become 0, and
the instruction has become meaningless. The following modification makes the method
complete and, thereby, an algorithm:
1 : n = 1; x = a
2 : if x 6= 2
then x = 1/(2 − x); n = n + 1
else x = −5; n=n+1 (1.5)
3 : if n ≤ N
then goto 2
else quit

We can only expect to find stable algorithms if the underlying problem is sufficiently
benign. This leads to the following definition:
Definition 1.5. We say that a mathematical problem is well-posed if, and only if, its
solutions enjoy the three benign properties of existence, uniqueness, and continuity with
respect to the input data. More precisely, given admissible input data, the problem
must have a unique solution (output), thereby providing a map between the set of
admissible input data and (a superset of) the set of possible solutions. This map must
be continuous with respect to suitable norms or metrics on the respective sets (small
changes of the input data must only cause small changes in the solution). A problem
which is not well-posed is called ill-posed.

We can thus add to the important tasks of Numerical Mathematics mentioned earlier
the additional important tasks of investigating a problem’s well-posedness. Then, once
well-posedness is established, the task is to provide a stable algorithm for its solution.
1 INTRODUCTION AND MOTIVATION 8

Example 1.6. (a) The problem “find a minimum of a given polynomial function p :
R −→ R” is inherently ill-posed: Depending on p, the problem has no solution
(e.g. p(x) = x), a unique solution (e.g. p(x) = x2 ), finitely many solutions (e.g.
p(x) = x2 (x − 1)2 (x + 2)2 ) or infinitely many solutions (e.g. p(x) = 1)).

(b) Frequently, one can transform an ill-posed problem into a well-posed problem, by
choosing an appropriate setting: Consider the problem “find a zero of f (x) =
ax2 + c”. If, for example, one admits a, c ∈ R and looks for solutions in R, then the
problem is ill-posed as one has no solutions for ac > 0 and no solutions for a = 0,
c 6= 0. Even for ac < 0, the problem is not well-posed as the solution is not always
unique. However, in this case, one can make the problem well-posed by considering
solutions in R2 . The correspondence between the input and the solution (sometimes
referred to as the solution operator) is then given by the continuous map
r r 
 c c
S : (a, c) ∈ R2 : ac < 0 −→ R2 , S(a, c) := − ,− − . (1.6a)
a a

The problem is also well-posed when just requiring a 6= 0, but admitting complex
solutions. The continuous solution operator is then given by

S : (a, c) ∈ R2 : a 6= 0 −→ C2 ,
q q 
 |c| |c|
, − |a| for ac ≤ 0, (1.6b)
|a|
S(a, c) :=  q |c| q 
 i , −i |a||c|
for ac > 0.
|a|

(c) The problem “determine if x ∈ R is positive” might seem simple at first glance,
however it is ill-posed, as it is equivalent to computing the values of the function
(
1 for x > 0,
S : R −→ {0, 1}, S(x) := (1.7)
0 for x ≤ 0,

which is discontinuous at 0.

As stated before, the analysis and control of errors is of central interest. Errors occur
due to several causes:

(1) Modeling Errors: A mathematical model can only approximate the physical situ-
ation in the best of cases. Often models have to be further simplified in order to
compute solutions and to make them accessible to mathematical analysis.
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 9

(2) Data Errors: Typically, there are errors in the input data. Input data often result
from measurements of physical experiments or from calculations that are potentially
subject to every type of error in the present list.

(3) Blunders: For example, logical errors and implementation errors.

One should always be aware that errors of the types just listed will or can be present.
However, in the context of Numerical Mathematics, one focuses mostly on the following
error types:

(4) Truncation Errors: Such errors occur when replacing an infinite process (e.g. an
infinite series) by a finite process (e.g. a finite summation).

(5) Round-Off Errors: Errors occurring when discarding digits needed for the exact
representation of a (e.g. real or rational) number.

In an increasing manner, the functioning of our society relies on the use of numerical
algorithms. In consequence, avoiding and controlling numerical errors is vital. Several
examples of major disasters caused by numerical errors can be found on the following
web page of D.N. Arnold at the University of Minnesota:
http://www.ima.umn.edu/~arnold/disasters/
For a much more comprehensive list of numerical and related errors that had significant
consequences, see the web page of T. Huckle at TU Munich:
http://www5.in.tum.de/~huckle/bugse.html

2 Tools: Landau Symbols, Norms, Condition


Notation 2.1. We will write K in situations, where we allow K to be R or C.

2.1 Landau Symbols


When calculating errors (and also when calculating the complexity of algorithms), one is
frequently not so much interested in the exact value of the error (or the computing time
and size of an algorithm), but only in the order of magnitude and in the asymptotics.
The Landau symbols O (big O) and o (small o) are a notation in support of these facts.
Here is the precise definition:

Definition 2.2. Let (X, d) be a metric space and let (Y, k · k), (Z, k · k) be normed
vector spaces over K. Let D ⊆ X and consider functions f : D −→ Y , g : D −→ Z,
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 10

where we assume g(x) 6= 0 for each x ∈ D. Moreover, let x0 ∈ X be a cluster point of


D (note that x0 does not have to be in D, and f and g do not have to be defined in x0 ).

(a) f is called of order big O of g or just big O of g (denoted by f (x) = O(g(x))) for
x → x0 if, and only if,
kf (x)k
lim sup <∞ (2.1)
x→x0 kg(x)k

(i.e. there exists M ∈ R+0 such that, for each sequence (xn )n∈N in D with the
property limn→∞ xn = x0 , the sequence kf (xn )k
kg(xn )k n∈N
in R+
0 is bounded from above
by M ).

(b) f is called of order small o of g or just small o of g (denoted by f (x) = o(g(x)))


for x → x0 if, and only if,
kf (x)k
lim =0 (2.2)
x→x0 kg(x)k

(i.e., for each sequence (xn )n∈N in D with the property limn→∞ xn = x0 , we have
limn→∞ kf (xn )k
kg(xn )k
= 0).

As mentioned before, O and o are known as Landau symbols.

Remark 2.3. In applications, the space X in Def. 2.2 is often R := R ∪ {−∞, ∞},
where convergence is defined in the usual way. This usual notion of convergence is,
indeed, given by a suitable metric, where, as a metric space, R is then homeomorphic
to [−1, 1], where

−1
 for x = −∞,
x
φ : R −→ [−1, 1], φ(x) := |x|+1 for x ∈ R,


1 for x = ∞,

provides a homeomorphism (cf. [Phi17, Sec. A], in particular, [Phi17, Rem. A.5]).

Proposition 2.4. Consider the setting of Def. 2.2. In addition, define the functions

f0 , g0 : D −→ R, f0 (x) := kf (x)k, g0 (x) := kg(x)k.

(a) The following statements are equivalent:

(i) f (x) = O(g(x)) for x → x0 .


(ii) f0 (x) = O(g0 (x)) for x → x0 .
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 11

(iii) There exist C, δ > 0 such that

kf (x)k
≤C for each x ∈ D \ {x0 } with d(x, x0 ) < δ. (2.3)
kg(x)k

(b) The following statements are equivalent:

(i) f (x) = o(g(x)) for x → x0 .


(ii) f0 (x) = o(g0 (x)) for x → x0 .
(iii) For every C > 0, there exists δ > 0 such that (2.3) holds.

Proof. Note that


kf (x)k f0 (x)
∀ = . (2.4)
x∈D kg(x)k g0 (x)

(a): The equivalence between (i) and (ii) is immediate from (2.4). Now suppose (i)
holds, and let M := lim sup kf (x)k
kg(x)k
, C := 1 + M . If there were no δ > 0 such that
x→x0
(2.3) holds, then, for each δn := 1/n, n ∈ N, there were some xn ∈ D \ {x0 } with
d(xn , x0 ) < δn and yn := kf (xn )k
kg(xn )k
> C = 1 + M , implying lim sup yn ≥ 1 + M > M ,
n→∞
in contradiction to the definition of M . Thus, (iii) must hold. Conversely, suppose
(iii) holds, i.e. there exist C, δ > 0 such that (2.3) is valid. Then, for each sequence
(xn )n∈N in D \ {x0 } such that limn→∞ xn = x0 , the sequence (yn )n∈N with yn := kf (xn )k
kg(xn )k
can not have a cluster point larger than C (only finitely many of the xn are not in
Bδ (x0 ) := {x ∈ D : d(x, x0 ) < δ}, which is the neighborhood of x0 determined by δ).
Thus lim sup yn ≤ C, thereby implying (2.1) and (i).
n→∞

(b): Again, the equivalence between (i) and (ii) is immediate from (2.4). The equivalence
between (i) and (iii), we know from Analysis 2, cf. [Phi16b, Cor. 2.9]. 

Due to the respective equivalences in Prop. 2.4, in the literature, the Landau symbols
are often only defined for real-valued functions.
Proposition 2.5. Consider the setting of Def. 2.2. Suppose k · kY and k · kZ are
equivalent norms on Y and Z, respectively. Then f (x) = O(g(x)) (resp. f (x) = o(g(x)))
for x → x0 with respect to the original norms if, and only if, f (x) = O(g(x)) (resp.
f (x) = o(g(x))) for x → x0 with respect to the new norms k · kY and k · kZ .

Proof. Let CY , CZ ∈ R+ be such that

∀ kykY ≤ CY kyk, ∀ CZ kzk ≤ kzkZ .


y∈Y z∈Z
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 12

Suppose f (x) = O(g(x)) for x → x0 with respect to the original norms. Then

kf (x)kY CY kf (x)k
lim sup ≤ lim sup < ∞,
x→x0 kg(x)kZ CZ x→x0 kg(x)k

showing f (x) = O(g(x)) for x → x0 with respect to the new norms. Suppose f (x) =
o(g(x)) for x → x0 with respect to the original norms. Then

kf (x)kY CY kf (x)k
lim = lim = 0,
x→x0 kg(x)kZ CZ x→x0 kg(x)k

showing f (x) = o(g(x)) for x → x0 with respect to the new norms. The converse
implications now also follow, since one can switch the roles of the old norms and the
new norms. 
Example 2.6. (a) Consider a polynomial function on R, i.e. (a0 , . . . , an ) ∈ Rn+1 , n ∈
N0 , and
X n
P : R −→ R, P (x) := ai x i , an =
6 0. (2.5)
i=0

Then, for p ∈ R,

P (x) = O(xp ) for x→∞ ⇔ p ≥ n, (2.6a)


P (x) = o(xp ) for x→∞ ⇔ p>n: (2.6b)

For each x 6= 0:
n
P (x) X
= ai xi−p . (2.7)
xp i=0

Since 

∞ for i − p > 0,
lim xi−p = 1 for i = p, (2.8)
x→∞ 

0 for i − p < 0,
(2.7) implies (2.6). Thus, in particular, for x → ∞, each constant function is big O
of 1: a0 = O(1).

(b) Recall the notion of differentiability: If G is an open subset of Rn , n ∈ N, ξ ∈ G,


then f : G −→ Rm , m ∈ N, is called differentiable in ξ if, and only if, there exists a
linear map L : Rn −→ Rm and another (not necessarily linear) map r : Rn −→ Rm
such that

f (ξ + h) − f (ξ) = L(h) + r(h) (2.9a)


2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 13

for each h ∈ Rn with sufficiently small khk2 , and


r(h)
lim = 0. (2.9b)
h→0 khk2

Now, using the Landau symbol o, (2.9b) can be equivalently expressed as


r(h) = o(h) for h → 0. (2.9c)

In the literature, one also finds statements of the form f (x) = h(x) + o(g(x)), which are
not covered by Def. 2.2 since, according to Def. 2.2, o(g(x)) does not denote an element
of a set, i.e. adding o(g(x)) to (or otherwise combining it with) an element of a set does
not immediately make any sense. To give a meaning to such expressions is the purpose
of the following definition:
Definition 2.7. As in Def. 2.2, let (X, d) be a metric space, let (Y, k · k), (Z, k · k)
be normed vector spaces over K, let D ⊆ X and consider functions f : D −→ Y ,
g : D −→ Z, where g(x) 6= 0 for each x ∈ D. Moreover, let x0 ∈ X be a cluster point of
D. Now let (V, k · k) be another normed vector space and let S ⊆ V be a subset1 and,
for each x ∈ D, let Fx : S −→ Y be invertible on f (D) ⊆ Y . Define

f (x) = Fx O(g(x)) for x → x0 :⇔ Fx−1 (f (x)) = O(g(x)) for x → x0 ,

f (x) = Fx o(g(x)) for x → x0 :⇔ Fx−1 (f (x)) = o(g(x)) for x → x0 .
Example 2.8. (a) In the situation of Ex. 2.6(b), we can now state that f is differen-
tiable in ξ if, and only if, there exists a linear map L : Rn −→ Rm such that
f (ξ + h) − f (ξ) = L(h) + o(h) for h → 0 : (2.10)
Indeed, for each h ∈ Rn , the map
Fh : Rm −→ Rm , Fh (s) := L(h) + s,
is invertible and, by Def. 2.7,
f (ξ + h) − f (ξ) = L(h) + o(h) = Fh (o(h)) for h → 0
is equivalent to

r(h) := f (ξ + h) − f (ξ) − L(h) = Fh−1 f (ξ + h) − f (ξ) = o(h),
which is (2.9c).
1
Even though, when using this kind of notation, one typically thinks of Fx as operating on g(x), as
O(x) and o(x) are merely notations and not actual mathematical objects, it is not even necessary for
S to contain g(D).
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 14

(b) Using the notation of Def. 2.7, we claim

sin x = x ln(O(1)) for x → 0 : (2.11)

Indeed, for each x ∈ R \ {0}, the map

Fx : R+ −→ R, Fx (s) := x ln s,

is invertible with Fx−1 : R −→ R+ , Fx−1 (y) := exp( xy ) and, by Def. 2.7, (2.11) is
equivalent to
sin x
Fx−1 (sin x) = e x = O(1),
which holds, due to limx→0 exp( sinx x ) = e1 = e.

(c) Let I ⊆ R be an open interval and a, x ∈ I, x 6= a. If m ∈ N0 and f ∈ C m+1 (I),


then
f (x) = Tm (x, a) + Rm (x, a), (2.12a)
where m
X f (k) (a)
Tm (x, a) := (x − a)k (2.12b)
k=0
k!
is the mth Taylor polynomial and

f (m+1) (θ)
Rm (x, a) := (x − a)m+1 with some suitable θ ∈]x, a[ (2.12c)
(m + 1)!

is the Lagrange form of the remainder term. Since the continuous function f (m+1)
is bounded on each compact interval [a, y], y ∈ I, (2.12) imply

f (x) = Tm (x, a) + O (x − a)m+1

for x → x0 for each x0 ∈ I.

Proposition 2.9. As in Def. 2.2, let (X, d) be a metric space, let (Y, k · k), (Z, k · k)
be normed vector spaces over K, let D ⊆ X and consider functions f : D −→ Y ,
g : D −→ Z, where g(x) 6= 0 for each x ∈ D. Moreover, let x0 ∈ X be a cluster point
of D.

(a) Suppose kf k ≤ kgk in some neighborhood of x0 , i.e. there exists ǫ > 0 such that

kf (x)k ≤ kg(x)k for each x ∈ D \ {x0 } with d(x, x0 ) < ǫ.

Then f (x) = O(g(x)) for x → x0 .


2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 15

(b) f (x) = O(f (x)) for x → x0 , but not f (x) = o(f (x)) for x → x0 (assuming f 6= 0).
(c) f (x) = o(g(x)) for x → x0 implies f (x) = O(g(x)) for x → x0 .

Proof. Exercise. 
Proposition 2.10. As in Def. 2.2, let (X, d) be a metric space, let (Y, k · k), (Z, k · k)
be normed vector spaces over K, let D ⊆ X and consider functions f : D −→ Y ,
g : D −→ Z, where g(x) 6= 0 for each x ∈ D. Moreover, let x0 ∈ X be a cluster point
of D. In addition, for use in some of the following assertions, we introduce functions
f1 : D −→ Y1 , f2 : D −→ Y2 , g1 : D −→ Z1 , g2 : D −→ Z2 , where (Y1 , k · k), (Y2 , k · k),
(Z1 , k · k), (Z2 , k · k) are normed vector spaces over K. Assume g1 (x) 6= 0 and g2 (x) 6= 0
for each x ∈ D.

(a) If f (x) = O(g(x)) (resp. f (x) = o(g(x))) for x → x0 , and if kf1 k ≤ kf k and kgk ≤
kg1 k in some neighborhood of x0 , then f1 (x) = O(g1 (x)) (resp. f1 (x) = o(g1 (x)))
for x → x0 .
(b) If α ∈ K \ {0}, then for x → x0 :

αf (x) = O(g(x)) ⇒ f (x) = O(g(x)),


αf (x) = o(g(x)) ⇒ f (x) = o(g(x)),
f (x) = O(αg(x)) ⇒ f (x) = O(g(x)),
f (x) = o(αg(x)) ⇒ f (x) = o(g(x)).

(c) For x → x0 :

f (x) = O(g1 (x)) and g1 (x) = O(g2 (x)) implies f (x) = O(g2 (x)),
f (x) = o(g1 (x)) and g1 (x) = o(g2 (x)) implies f (x) = o(g2 (x)).

(d) Assume (Y1 , k · k) = (Y2 , k · k). Then, for x → x0 :

f1 (x) = O(g(x)) and f2 (x) = O(g(x)) implies f1 (x) + f2 (x) = O(g(x)),


f1 (x) = o(g(x)) and f2 (x) = o(g(x)) implies f1 (x) + f2 (x) = o(g(x)).

(e) For x → x0 :

f1 (x) = O(g1 (x)) and f2 (x) = O(g2 (x))


implies kf1 (x)k kf2 (x)k = O(kg1 (x)k kg2 (x)k),
f1 (x) = o(g1 (x)) and f2 (x) = o(g2 (x))
implies kf1 (x)k kf2 (x)k = o(kg1 (x)k kg2 (x)k).
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 16

(f ) For x → x0 :

kf (x)k
f (x) = O(kg1 (x)k kg2 (x)k) ⇒ = O(g2 (x)),
kg1 (x)k
kf (x)k
f (x) = o(kg1 (x)k kg2 (x)k) ⇒ = o(g2 (x)).
kg1 (x)k

Proof. Exercise. 

2.2 Operator Norms and Natural Matrix Norms


When considering maps between normed vector spaces, the corresponding norms provide
a natural measure for error analyses. A special class of norms of importance to us are
so-called operator norms defined on linear maps between normed vector spaces (cf. Def.
2.12 below).

Definition 2.11. Let A : X −→ Y be a linear map between two normed vector spaces
(X, k · k) and (Y, k · k) over K. Then A is called bounded if, and only if, A maps bounded
sets to bounded sets, i.e. if, and only if, A(B) is a bounded subset of Y for each bounded
B ⊆ X. The vector space of all bounded linear maps between X and Y is denoted by
L(X, Y ).

Definition 2.12. Let A : X −→ Y be a linear map between two normed vector spaces
(X, k · k) and (Y, k · k) over K. The number
 
kAxk
kAk := sup : x ∈ X, x 6= 0
kxk

= sup kAxk : x ∈ X, kxk = 1 ∈ [0, ∞] (2.13)

is called the operator norm of A induced by the norms on X and Y , respectively (strictly
speaking, the term operator norm is only justified if the value is finite, but it is often
convenient to use the term in the generalized way defined here).
In the special case, where X = Kn , Y = Km , and A is given via a real m × n matrix,
the operator norm is also called a natural matrix norm or an induced matrix norm.

Remark 2.13. According to Th. B.1 of the Appendix, for a linear map A : X −→
Y between two normed vector spaces X and Y over K, the following statements are
equivalent:

(a) A is bounded.
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 17

(b) kAk < ∞.

(c) A is Lipschitz continuous.

(d) A is continuous.

(e) There is x0 ∈ X such that A is continuous at x0 .

For linear maps between finite-dimensional spaces, the above equivalent properties al-
ways hold: Each linear map A : Kn −→ Km , (n, m) ∈ N2 , is continuous (cf. [Phi16b,
Ex. 2.16]). In particular, each linear map A : Kn −→ Km , has all the above equivalent
properties.
Remark 2.14. According to Th. B.2 of the Appendix, for two normed vector spaces X
and Y over K, the operator norm does, indeed, constitute a norm on the vector space of
bounded linear maps L(X, Y ), and, moreover, if A ∈ L(X, Y ), then kAk is the smallest
Lipschitz constant for A.
Lemma 2.15. If Id : X −→ X, Id(x) := x, is the identity map on a normed vector
space X over K, then k Id k = 1 (in particular, the operator norm of a unit matrix is
always 1). Caveat: In principle, one can consider two different norms on X simultane-
ously, and then the operator norm of the identity can differ from 1.

Proof. If kxk = 1, then k Id(x)k = kxk = 1. 


Lemma 2.16. Let X, Y, Z be normed vector spaces over K and consider linear maps
A ∈ L(X, Y ), B ∈ L(Y, Z). Then

kBAk ≤ kBk kAk. (2.14)

Proof. Let x ∈ X with kxk = 1. If Ax = 0, then kB(A(x))k = 0 ≤ kBk kAk. If Ax 6= 0,


then one estimates
 
Ax
B(Ax) = kAxk B ≤ kAk kBk,
kAxk
thereby establishing the case. 
Notation 2.17. For each m, n ∈ N, let M(m, n, K) ∼ = Kmn denote the vector space
over K of m × n matrices (akl ) := (akl )(k,l)∈{1,...,m}×{1,...,n} over K. Moreover, we define
M(n, K) := M(n, n, K).
Example 2.18. Let m, n ∈ N and let A : Kn −→ Km be the K-linear map given by
(akl ) ∈ M(m, n, K) (with respect to the standard bases of Kn and Km , respectively).
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 18

(a) The norm defined by


( n
)
X
kAk∞ := max |akl | : k ∈ {1, . . . , m}
l=1

is called the row sum norm of A. It is an exercise to show that kAk∞ is the operator
norm induced if Kn and Km are endowed with the ∞-norm.

(b) The norm defined by


( m
)
X
kAk1 := max |akl | : l ∈ {1, . . . , n}
k=1

is called the column sum norm of A. It is an exercise to show that kAk1 is the
operator norm induced if Kn and Km are endowed with the 1-norm.

(c) The operator norm on A induced if Kn and Km are endowed with the 2-norm
(i.e. the Euclidean norm) is called the spectral norm of A and is denoted kAk2 .
Unfortunately, there is no formula as simple as the ones in (a) and (b) for the
computation of kAk2 . We will see in Th. 2.24 below that the value of kAk2 is
related to the eigenvalues of A∗ A.

Caveat 2.19. Let m, n ∈ N and p ∈ [1, ∞]. Let A : Kn −→ Km be the K-linear


map given by (akl ) ∈ M(m, n, K) (with respect to the standard bases of Kn and Km ,
respectively). It is common to denote the operator norm on A ∈ L(Kn , Km ) induced by
the p-norms on Kn and Km , respectively, by kAkp (as we did in Ex. 2.18 for p = ∞, 1, 2).
However, for n, m > 1, the operator norm k·kp is not(!) the p-norm on Kmn ∼ = L(Kn , Km )
(the 2-norm on Kmn is known as the Frobenius norm or the Hilbert-Schmidt norm on
2
Kmn ). As it turns out, for n > 1, the p-norm on Kn is not an operator norm induced
by any norm on Kn at all: Let
(P P
( nk=1 nl=1 |akl |p )1/p for p < ∞,
kAkp,nop :=
max{|akl | : k, l ∈ {1, . . . , n}} for p = ∞,
2
denote the p-norm on Kn . If p ∈ [1, ∞[, then

k Id kp,nop = p n 6= 1 for n > 1,

showing that k · kp,nop does not satisfy Lem. 2.15. If akl := 1 for each k, l ∈ {1, . . . , n},
then
n = kAAk∞,nop > kAk∞,nop · kAk∞,nop = 1 · 1 = 1 for n > 1,
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 19

showing that k · k∞,nop does not satisfy Lem. 2.16. For p ∈ [1, ∞[, one might think that
rescaling k · kp,nop might make it into an operator norm: The modified norm
2 kAkp,nop
k · k′p,nop : Kn −→ R+
0, kAk′p,nop := √ ,
p
n
does satisfy Lem. 2.15. However, if
(
1 for k = l = 1,
akl :=
0 otherwise,

then AA = A and
1 1 1
√ = kAAk′p,nop > kAk′p,nop · kAk′p,nop = √ ·√ for n > 1,
p
n p
n p
n
i.e. the modified norm does not satisfy Lem. 2.16.

In the rest of this section, we will further investigate the spectral norm defined in Ex.
2.18(c). In preparation, we recall the following notions and notation:
Notation 2.20. Let m, n ∈ N. If A = (akl )(k,l)∈{1,...,m}×{1,...,n} ∈ M(m, n, K), then

A := (akl )(k,l)∈{1,...,m}×{1,...,n} ∈ M(m, n, K),


At := (akl )(l,k)∈{1,...,n}×{1,...,m} ∈ M(n, m, K),
A∗ := (A)t ∈ M(n, m, K),

where A is called the (complex) conjugate of A, At is the transpose of A, and A∗ is the


conjugate transpose, also called the adjoint, of A (thus, if K = R, then A = A and
A∗ = At ).
Definition 2.21. Let n ∈ N and let A = (akl ) ∈ M(n, K).

(a) A is called symmetric if, and only if, A = At , i.e. if, and only if, akl = alk for each
(k, l) ∈ {1, . . . , n}2 .
(b) A is called Hermitian if, and only if, A = A∗ , i.e. if, and only if, akl = alk for each
(k, l) ∈ {1, . . . , n}2 .
(c) A is called positive semidefinite if, and only if, x∗ Ax ∈ R+ n
0 for each x ∈ K .

(d) A is called positive definite if, and only if, A is positive semidefinite and x∗ Ax =
0 ⇔ x = 0, i.e. if, and only if, x∗ Ax > 0 for each 0 6= x ∈ Kn .
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 20

Lemma 2.22. Let m, n ∈ N and A ∈ M(m, n, K). Then both A∗ A ∈ M(n, K) and
AA∗ ∈ M(m, K) are Hermitian and positive semidefinite.

Proof. Since (A∗ )∗ = A, it suffices to consider A∗ A. That A∗ A is Hermitian due to

(A∗ A)∗ = A∗ (A∗ )∗ = A∗ A.

Moreover, if x ∈ Kn , then x∗ A∗ Ax = (Ax)∗ (Ax) = kAxk22 ∈ R+ ∗


0 , showing A A to be
positive semidefinite. 
Definition 2.23. We define, for each n ∈ N and each A ∈ M(n, C),

r(A) := max |λ| : λ ∈ C and λ is eigenvalue of A . (2.15)

The number r(A) is called the spectral radius of A.


Theorem 2.24. Let m, n ∈ N. If A ∈ M(m, n, K), then its spectral norm kAk2 (cf.
Ex. 2.18(c)) satisfies p
kAk2 = r(A∗ A). (2.16a)
If m = n and A is Hermitian (i.e. A∗ = A), then kAk2 is also given by the following
simpler formula:
kAk2 = r(A). (2.16b)

Proof. According to Lem. 2.22, A∗ A is a Hermitian and positive semidefinite n × n


matrix. Then there exist (cf. [Phi19b, Th. 10.41, Th. 11.3(a)]) real nonnegative (not
necessarily distinct) eigenvalues λ1 , . . . , λn ≥ 0 and a corresponding ordered orthonormal
basis (v1 , . . . , vn ) of Kn , consisting of eigenvectors satisfying, in particular,

A∗ Avk = λk vk for each k ∈ {1, . . . , n}.

, vn ) is a basis of Kn , for each x ∈ Kn , there are numbers x1 , . . . , xn ∈ K such


As (v1 , . . .P
that x = nk=1 xk vk . Thus, one computes
n
!∗ n
!
X X
2 ∗ ∗
kAxk2 = (Ax) · (Ax) = x A Ax = xk v k xk λk v k
k=1 k=1
n
X n
X
= λk |xk |2 ≤ r(A∗ A) |xk |2 = r(A∗ A)kxk22 , (2.17)
k=1 k=1
p
proving kAk2 ≤ r(A∗ A). To verify the remaining inequality, let λj := r(A∗ A) be the
largest of the nonnegative λk , and let v := vj be the corresponding eigenvector from the
orthonormal basis. Then kvk2 = 1 and choosing x = v = vj (i.e. xk = δjk ) in (2.17)
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 21

p
yields kAvk22 = r(A∗ A), thereby proving kAk2 ≥ r(A∗ A) and completing the proof of
(2.16a). It remains to consider the case where A = A∗ . Then A∗ A = A2 and, since

Av = λv ⇒ A2 v = λ2 v ⇒ r(A2 ) = r(A)2 ,

we have p p
kAk2 = r(A∗ A) = r(A2 ) = r(A),
thereby establishing the case. 

Caveat 2.25. In general, it is not admissible to use the simpler


 formula (2.16b) for
0 1
non-Hermitian n × n matrices: For example, for A := , one has
0 1
    
∗ 0 0 0 1 0 0
A A= = ,
1 1 0 1 0 2
√ p
such that 1 = r(A) 6= 2= r(A∗ A) = kAk2 .

Lemma 2.26. Let n ∈ N. Then



kyk2 = max |v · y| : v ∈ Kn and kvk2 = 1 for each y ∈ Kn . (2.18)

Proof. Let v ∈ Kn such that kvk2 = 1. One estimates, using the Cauchy-Schwarz
inequality [Phi16b, (1.41)],

|v · y| ≤ kvk2 kyk2 = kyk2 . (2.19a)

Note that (2.18) is trivially true for y = 0. If y 6= 0, then letting w := y/kyk2 yields
kwk2 = 1 and
|w · y| = y · y/kyk2 = kyk2 . (2.19b)
Together, (2.19a) and (2.19b) establish (2.18). 

Proposition 2.27. Let m, n ∈ N. If A ∈ M(m, n, K), then kAk2 = kA∗ k2 and, in


particular, r(A∗ A) = r(AA∗ ). This allows one to use the simpler (smaller) matrix of
A∗ A and AA∗ to compute kAk2 (see Ex. 2.29 below).

Proof. We know from Linear Algebra (cf. [Phi19b, Th. 10.22(a)] and [Phi19b, Th.
10.24(a)]) that
∀n ∀ m v · Ax = (A∗ v) · x.
x∈K v∈K
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 22

Thus, one calculates



kAk2 = max kAxk2 : x ∈ Kn and kxk2 = 1
(2.18) 
= max |v · Ax| : v ∈ Km , x ∈ Kn , and kvk2 = kxk2 = 1

= max |(A∗ v) · x| : v ∈ Km , x ∈ Kn , and kvk2 = kxk2 = 1
(2.18) 
= max kA∗ vk2 : v ∈ Km and kvk2 = 1
= kA∗ k2 ,
proving the proposition. 
Remark 2.28. One can actually show more than we did in Prop. 2.27: The eigenvalues
(including multiplicities) of A∗ A and AA∗ are always identical, except that the larger of
the two matrices (if any) can have additional eigenvalues of value 0 – the square roots of
the nonzero (i.e. positive) eigenvalues of A∗ A and AA∗ are the so-called singular values
of A and A∗ (see, e.g., [HB09, Def. and Th. 12.1]).
 
1 0 1 0
Example 2.29. Consider A := ∈ M(2, 4, K). One obtains
0 1 1 0
 
  1 0 1 0
2 1 0 1 1 0
AA∗ = , A∗ A =  1 1 2 0 .

1 2
0 0 0 0

So one would tend to compute the
√ eigenvalues using AA . One finds λ1 = 1 and λ2 = 3.

Thus, r(AA ) = 3 and kAk2 = 3.

2.3 Condition of a Problem


We are now in a position to apply our acquired knowledge of operator norms to error
analysis. The general setting is the following: The input is given in some normed vector
space X (e.g. Kn ) and the output is likewise a vector lying in some normed vector space
Y (e.g. Km ). The solution operator, i.e. the map f between input and output, is some
function f defined on an open subset of X. The goal in this section is to study the
behavior of the output given small changes (errors) of the input.
Definition 2.30. Let (X, k · k) and (Y, k · k) be normed vector spaces over K, U ⊆ X
open, and f : U −→ Y . Fix x ∈ U \ {0} such that f (x) 6= 0 and δ > 0 such that
Bδ (x) = {x̃ ∈ X : kx̃ − xk < δ} ⊆ U . We call (the problem represented by the solution
operator) f well-conditioned in Bδ (x) if, and only if, there exists K ≥ 0 such that
kf (x̃) − f (x)k kx̃ − xk
≤K for every x̃ ∈ Bδ (x). (2.20)
kf (x)k kxk
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 23

If f is well-conditioned in Bδ (x), then we define K(δ) := K(f, x, δ) to be the smallest


number K ∈ R+ 0 such that (2.20) is valid. If there exists δ > 0 such that f is well-
conditioned in Bδ (x), then f is called well-conditioned at x.
Definition and Remark 2.31. We remain in the situation of Def. 2.30 and notice that
0 < α ≤ β ≤ δ implies Bα (x) ⊆ Bβ (x) ⊆ Bδ (x), and, thus 0 ≤ K(α) ≤ K(β) ≤ K(δ).
In consequence, the following definition makes sense if f is well-conditioned at x:

krel := krel (f, x) := lim K(α). (2.21)


α→0

The number krel ∈ R+ 0 is called the relative condition of (the problem represented by the
solution operator) f at x. The number krel provides a measure for how much a relative
error in x might get inflated by the application of f , i.e. the smaller krel , the more well-
behaved is the problem in the vicinity of x. What value for krel is still acceptable in
terms of numerical stability depends on the situation: While krel ≈ 1 generally indicates
stability, even krel ≈ 100 might be acceptable in some situations.
Remark 2.32. Let us compare the newly introduced notion of Def. and Rem. 2.31 of
f being well-conditioned with some other regularity notions for f . For example, if f is
L-Lipschitz in Bδ (x), x ∈ X, δ > 0, then

∀ kf (x̃) − f (x)k ≤ L kx̃ − xk,


x̃∈Bδ (x)

L kxk
i.e. f is well-conditioned in Bδ (x) and (2.20) holds with K := kf (x)k
. The converse is not
true: There are examples, where f is well-conditioned in some Bδ (x) without being Lip-
schitz continuous on Bδ (x) (exercise). On the other hand, (2.20) does imply continuity
in x, and, once again, the converse is not true (another exercise). In particular, a prob-
lem can be well-posed in the sense of Def. 1.5 without being well-conditioned. On the
other hand, if f is everywhere well-conditioned, then it is everywhere continuous, and,
thus, well-posed. In that sense well-conditionedness is stronger than well-posedness.
However, one should also note that we defined well-conditionedness as a local property,
whereas well-posedness was defined as a global property.

As an example, we consider the problem of solving the linear system Ax = b with an


invertible n × n matrix A ∈ M(n, K). Before studying the problem systematically using
the notion of well-conditionedness, let us look at a particular example that illustrates a
typical instability that can occur:
Example 2.33. Consider the linear system

Ax = b
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 24

for the unknown x ∈ R4 with


   
10 7 8 7 32
7 5 6 5 23
A := 
8
, b :=  
33 .
6 10 9 
7 5 9 10 31

One finds that the solution is  


1
1
x :=  
1 .
1
If instead of b, we are given the following perturbed b̃,
 
32.1
22.9
b̃ :=  
33.1 ,
30.9

where the absolute error is 0.1 and relative error is even smaller, then the corresponding
solution is  
9.2
−12.6
x̃ :=  
 4.5  ,
−1.1
in no way similar to the solution to b. In particular, the absolute and relative errors
have been hugely amplified.
One might suspect that the issue behind the instability lies in A being “almost” singular
such that applying A−1 is similar to dividing by a small number. However, that is not
the case, as we have  
25 −41 10 −6
−41 68 −17 10 
A−1 =   10 −17 5 −3 .

−6 10 −3 2
We will see in Ex. 2.42 below that the actual reason behind the instability is the range
of eigenvalues of A, in particular, the relation between the largest and the smallest
eigenvalue. One has:
λ4
λ1 ≈ 0.010, λ2 ≈ 0.843, λ3 ≈ 3.86, λ4 ≈ 30.3, ≈ 2984.
λ1
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 25

Proposition 2.34. Let m, n ∈ N and consider the normed vector spaces (Kn , k · k) and
(Km , k · k). Let A : Kn −→ Km be K-linear. Moreover, let x ∈ Kn be such that x 6= 0
and Ax 6= 0. Then the following statements hold true:

(a) A is well-conditioned at x and

kxk
krel (A, x) = kAk ,
kAxk

where kAk denotes the corresponding operator norm of A.

(b) If n = m and A is invertible, then

krel (A, x) ≤ kAkkA−1 k,

where kAk and kA−1 k denote the corresponding operator norms of A and A−1 ,
respectively.

Proof. (a): For each x̃ ∈ Kn , we estimate

kAx̃ − Axk kAk kx̃ − xk kxk kx̃ − xk


≤ = kAk ,
kAxk kAxk kAxk kxk
kxk
showing krel (A, x) ≤ kAk kAxk . It still remains to prove the inverse inequality. To that
n
end, choose v ∈ K such that kvk = 1 and kAvk = kAk (such a vector v must exist due
to the fact that the continuous function z 7→ kAzk must attain its max on the compact
unit sphere S1 (0) = {z ∈ Kn : kzk = 1} – note that this argument does not work in
infinite-dimensional spaces, where S1 (0) is not compact). Consider x̃ := x + αv. Then

kAx̃ − Axk kAvk kAvk=kAk, kvk=1 kxk kx̃ − xk


=α = kAk .
kAxk kAxk kAxk kxk
kxk
As this equality is independent of α, for α → 0, we obtain krel (A, x) ≥ kAk kAxk , as
desired.
(b): If n = m and A is invertible, then (a) implies

kxk kA−1 Axk kA−1 kkAxk


krel (A, x) = kAk = kAk ≤ kAk = kAkkA−1 k,
kAxk kAxk kAxk

thereby proving (b). 


2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 26

Example 2.35. Let A ∈ M(n, K) be an invertible n × n matrix, n ∈ N and consider


the problem of solving the linear system Ax = b, b ∈ Kn . Then the solution operator is

A−1 : Kn −→ Kn , b 7→ A−1 b.

Fixing some norm on Kn and using the induced natural matrix norm, we obtain from
Prop. 2.34(b) that, for each 0 6= b ∈ Kn :

krel (A−1 , b) ≤ kAkkA−1 k. (2.22)

Definition and Remark 2.36. In view of Prop. 2.34(b) and (2.22), we define the
condition number κ(A) (also just called the condition) of an invertible A ∈ M(n, K),
n ∈ N, by
κ(A) := kAkkA−1 k, (2.23)
where k · k denotes a natural matrix norm induced by some norm on Kn . Then one
immediately gets from (2.22) that

krel (A−1 , b) ≤ κ(A) for each b ∈ Kn \ {0}. (2.24)

The condition number clearly depends on the underlying natural matrix norm. If the
natural matrix norm is the spectral norm, then one calls the condition number the
spectral condition and one also writes κ2 instead of κ.

Notation 2.37. For each A ∈ M(n, C), n ∈ N, let

σ(A) := {λ ∈ C : λ is eigenvalue of A},


|σ(A)| := {|λ| : λ ∈ C is eigenvalue of A},

denote the set of eigenvalues of A (also known as the spectrum of A) and the set of
absolute values of eigenvalues of A, respectively.

Lemma 2.38. For the spectral condition of an invertible matrix A ∈ M(n, K), n ∈ N,
one obtains s
max σ(A∗ A)
κ2 (A) = (2.25)
min σ(A∗ A)
(recall that all eigenvalues of A∗ A are real and positive). Moreover, if A is Hermitian,
then (2.25) simplifies to
max |σ(A)|
κ2 (A) = (2.26)
min |σ(A)|
(note that min |σ(A)| > 0, as A is invertible).
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 27

Proof. Exercise. 
Theorem 2.39. Let A ∈ M(n, K) be invertible, n ∈ N. Assume that x, b, ∆x, ∆b ∈ Kn
satisfy
Ax = b, A(x + ∆x) = b + ∆b, (2.27)
i.e. ∆b can be seen as a perturbation of the input and ∆x can be seen as the resulting
perturbation of the output. One then has the following estimates for the absolute and
relative errors:
k∆xk k∆bk
k∆xk ≤ kA−1 kk∆bk, ≤ κ(A) , (2.28)
kxk kbk
where x, b 6= 0 is assumed for the second estimate (as before, it does not matter which
norm on Kn one uses, as long as one uses the induced operator norm for the matrix).

Proof. Since A is linear, (2.27) implies A(∆x) = ∆b, i.e. ∆x = A−1 (∆b), which already
yields the first estimate in (2.28). For x, b 6= 0, the first estimate together with Ax = b
easily implies the second:
k∆xk k∆bk kAxk
≤ kA−1 k ,
kxk kbk kxk
thereby establishing the case. 
Example 2.40. Suppose we want to find out how much we can perturb b in the problem
   
1 2 1
Ax = b, A := , b := ,
1 1 4

if the resulting relative error er in x is to be less than 10−2 with respect to the ∞-norm.
From (2.28), we know
k∆bk∞ k∆bk∞
er ≤ κ(A) = κ(A) .
kbk∞ 4
 
−1 2
Moreover, since A−1 = , from (2.23) and Ex. 2.18(a), we obtain
1 −1

κ(A) = kAk∞ kA−1 k∞ = 3 · 3 = 9.

Thus, if the perturbation ∆b satisfies k∆bk∞ < 4/900 ≈ 0.0044, then er < 49 · 900
4
= 10−2 .
Remark 2.41. Note that (2.28) is a much stronger and more useful statement than
the krel (A−1 , b) ≤ κ(A) of (2.22). The relative condition only provides a bound in the
limit of smaller and smaller neighborhoods of b and without providing any information
on how small the neighborhood actually has to be (one can estimate the amplification
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 28

of the error in x provided that the error in b is very small). On the other hand, (2.28)
provides an effective control of the absolute and relative errors without any restrictions
with regard to the size of the error in b (it holds for each ∆b).

Example 2.42. We come back to the instability observed in Ex. 2.33 and investigate
its actual cause. Suppose A ∈ M(n, C) is an arbitrary Hermitian invertible matrix,
λmin , λmax ∈ σ(A) such that |λmin | = min |σ(A)|, |λmax | = max |σ(A)|. Let vmin and vmax
be eigenvectors for λmin and λmax , respectively, satisfying kvmin k = kvmax k = 1. The
most unstable behavior of Ax = b occurs if b = vmax and b is perturbed in the direction
of vmin , i.e. b̃ := b + ǫ vmin , ǫ > 0. The solution to Ax = b is x = λ−1
max vmax , whereas the
solution to Ax = b̃ is
−1
x̃ = A−1 (vmax + ǫ vmin ) = λ−1
max vmax + ǫλmin vmin .

Thus, the resulting relative error in the solution is

kx̃ − xk |λmax |
=ǫ ,
kxk |λmin |

while the relative error in the input was merely

kb̃ − bk
= ǫ.
kbk

This shows that, in the worst case, the relative error in b can be amplified by a factor
of κ2 (A) = |λmax |/|λmin |, and that is exactly what had occurred in Ex. 2.33.

Next, we study a generalization of the problem from (2.27). Now, in addition to a


perturbation of the right-hand side b, we also allow a perturbation of the matrix A by
a matrix ∆A:
Ax = b, (A + ∆A)(x + ∆x) = b + ∆b. (2.29)
We would now like to control ∆x in terms of ∆b and ∆A.

Remark 2.43. Let n ∈ N. The set GLn (K) of invertible n × n matrices over K is open
in M(n, K) ∼
2
= Kn (where all norms are equivalent) – this follows, for example, from the
determinant being a continuous map (even a polynomial, cf. [Phi19b, Rem. 4.33]) from
2
Kn into K, which implies that GLn (K) = det−1 (K \ {0}) must be open. Thus, if A in
(2.29) is invertible and k∆Ak is sufficiently small, then A + ∆A must also be invertible.
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 29

Lemma 2.44. Let n ∈ N, consider some norm on Kn , and the induced natural matrix
norm on M(n, K). If B ∈ M(n, K) is such that kBk < 1, then Id +B is invertible and
1
k(Id +B)−1 k ≤ . (2.30)
1 − kBk

Proof. For each x ∈ Kn :

k(Id +B)xk ≥ kxk − kBxk ≥ kxk − kBk kxk = (1 − kBk) kxk, (2.31)

showing that x =6 0 implies (Id +B)x 6= 0, i.e. Id +B is invertible. Fixing y ∈ Kn and


applying (2.31) with x := (Id +B)−1 y yields
1
k(Id +B)−1 yk ≤ kyk,
1 − kBk
thereby proving (2.30) and concluding the proof of the lemma. 
Lemma 2.45. Let n ∈ N, consider some norm on Kn and the induced natural matrix
norm on M(n, K). If A, ∆A ∈ M(n, K), A is invertible, and k∆Ak < kA−1 k−1 , then
A + ∆A is invertible and
1
k(A + ∆A)−1 k ≤ . (2.32)
kA−1 k−1 − k∆Ak
Moreover,
1
k∆Ak ≤ ⇒ k(A + ∆A)−1 − A−1 k ≤ Ck∆Ak, (2.33)
2kA−1 k
where the constant can be chosen as C := 2kA−1 k2 .

Proof. We can write A + ∆A = A(Id +A−1 ∆A). As kA−1 ∆Ak ≤ kA−1 k k∆Ak < 1,
Lem. 2.44 yields that Id +A−1 ∆A is invertible. Since A is also invertible, so is A + ∆A.
Moreover,
(2.30) kA−1 k kA−1 k
k(A + ∆A)−1 k = (Id +A−1 ∆A)−1 A−1 ≤ ≤ ,
1 − kA−1 ∆Ak 1 − kA−1 k k∆Ak
(2.34)
proving (2.32). The identity

(A + ∆A)−1 − A−1 = −(A + ∆A)−1 (∆A)A−1 (2.35)

holds, as it is equivalent to

Id −(A + ∆A)A−1 = −(∆A)A−1


2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 30

and
A − A − ∆A = −∆A.
1
For k∆Ak ≤ 2kA−1 k
, (2.35) yields
(2.34) kA−1 k k∆Ak kA−1 k
k(A + ∆A)−1 − A−1 k ≤ k(A + ∆A)−1 k k∆Ak kA−1 k ≤
1 − kA−1 k k∆Ak
≤ 2 kA−1 k2 k∆Ak,
proving (2.33). 
Theorem 2.46. As in Lem. 2.45, let n ∈ N, consider some norm on Kn with the
induced natural matrix norm on M(n, K) and let A, ∆A ∈ M(n, K) such A is invertible
and k∆Ak < kA−1 k−1 . Moreover, assume b, x, ∆b, ∆x ∈ Kn satisfy
Ax = b, (A + ∆A)(x + ∆x) = b + ∆b. (2.36)
Then, for b, x 6= 0, one has the following bound for the relative error:
 
k∆xk κ(A) k∆Ak k∆bk
≤ + . (2.37)
kxk 1 − kA−1 k k∆Ak kAk kbk

Proof. The equations (2.36) imply


(A + ∆A)(∆x) = ∆b − (∆A)x. (2.38)
In consequence, from (2.32) and kbk ≤ kAk kxk, one obtains

−1
k∆xk (2.38) (A + ∆A) ∆b − (∆A)x
=
kxk kxk
 
(2.32) 1 k∆bkkAk
≤ k∆Ak +
kA−1 k−1 − k∆Ak kbk
 
κ(A) k∆Ak k∆bk
= + ,
1 − kA−1 k k∆Ak kAk kbk
thereby establishing the case. 

For nonlinear problems, the following result can be useful:


Theorem 2.47. Let m, n ∈ N, let U ⊆ Rn be open, and assume that f : U −→ Rm
is continuously differentiable: f ∈ C 1 (U, Rm ). If 0 6= x ∈ U and f (x) 6= 0, then f is
well-conditioned at x and
kxk
krel = krel (f, x) = kDf (x)k , (2.39)
kf (x)k
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 31

where Df (x) : Rn −→ Rm is the (total) derivative of f in x (which is a linear map


represented by the Jacobian Jf (x)). In (2.39), any norm on Rn and Rm will work as
long as kDf (x)k is the corresponding induced operator norm. While the present result is
only formulated and proved for R-differentiable functions, see Rem. 2.48 and Ex. 2.49(c)
below to learn how it can also be applied to C-differentiable functions.

Proof. Choose δ > 0 sufficiently small such that B δ (x) ⊆ U . Since f ∈ C 1 (U, Rm ), we
can apply Taylor’s theorem. Recall that its lowest order version with the remainder
term in integral form states that, for each h ∈ Bδ (0) (which implies x + h ∈ Bδ (x)),
the following holds (cf. [Phi16b, Th. 4.44], which can be applied to the components
f1 , . . . , fm of f ): Z 1

f (x + h) = f (x) + Df x + th (h) dt . (2.40)
0

If x̃ ∈ Bδ (x), then we can let h := x̃ − x, which permits to restate (2.40) in the form
Z 1

f (x̃) − f (x) = Df (1 − t)x + tx̃ (x̃ − x) dt . (2.41)
0

This allows to estimate the norm as follows:


Z 1

f (x̃) − f (x) = Df (1 − t)x + tx̃ (x̃ − x) dt

0
Z 1
Th. C.5 
≤ Df (1 − t)x + tx̃ (x̃ − x) dt
Z0 1

≤ Df (1 − t)x + tx̃ kx̃ − xk dt
0
n o

sup Df (y) : y ∈ Bδ (x) kx̃ − xk
= S(δ)kx̃ − xk (2.42)

with S(δ) := sup Df (y) : y ∈ Bδ (x) . This implies

f (x̃) − f (x) kxk kx̃ − xk

f (x) f (x) S(δ) kxk

and, therefore,
kxk
K(δ) ≤
f (x) S(δ). (2.43)

The hypothesis that f is continuously differentiable implies limy→x Df (y) = Df (x),


which implies limy→x kDf (y)k = kDf (x)k (continuity of the norm, cf. [Phi16b, Ex.
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 32

2.6(a)]), which, in turn, implies limδ→0 S(δ) = kDf (x)k. Since, also, limδ→0 K(δ) = krel ,
(2.43) yields
kxk
krel ≤ kDf (x)k . (2.44)
kf (x)k
It still remains to prove the inverse inequality. To that end, choose v ∈ Rn such that
kvk = 1 and
Df (x)(v) = Df (x) (2.45)
(such a vector v must exist due to the fact that the continuous function y 7→ kDf (x)(y)k
must attain its max on the compact unit sphere S1 (0) – as in the proof of Prop. 2.34(a),
this argument does not extend to infinite-dimensional spaces, where S1 (0) is not com-
pact). For 0 < ǫ < δ consider x̃ := x + ǫv. Then kx̃ − xk = ǫ < δ, i.e. x̃ ∈ Bδ (x). In
particular, x̃ is admissible in (2.41), which provides
Z 1

f (x̃) − f (x) = ǫ Df (1 − t)x + tx̃ (v) dt
0
Z 1 

= ǫDf (x)(v) + ǫ Df (1 − t)x + tx̃ − Df (x) (v) dt . (2.46)
0

Once again using kx̃ − xk = ǫ as well as the triangle inequality in the form ka + bk ≥
kak − kbk, we obtain from (2.46) and (2.45):

f (x̃) − f (x)

f (x)
 Z 1  kx̃ − xk
kxk 
≥ Df (x) − Df (1 − t)x + tx̃ − Df (x) dt . (2.47)
f (x) 0 kxk

Thus,
kxk  
K(ǫ) ≥ Df (x) − T (ǫ) , (2.48)
f (x)
where
Z 1 

T (ǫ) := sup Df (1 − t)x + tx̃ − Df (x) dt : x̃ ∈ B ǫ (x) . (2.49)
0

Since Df is continuous in x, for each α > 0, there is ǫ > 0 such that ky − xk ≤ ǫ implies
kDf (y) − Df (x)k < α. Thus, since k(1 − t)x + tx̃ − xk = t kx̃ − xk ≤ ǫ for x̃ ∈ B ǫ (x)
and t ∈ [0, 1], one obtains Z 1
T (ǫ) ≤ α dt = α,
0
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 33

implying
lim T (ǫ) = 0.
ǫ→0

In particular, we can take limits in (2.48) to get

kxk
krel ≥ Df (x) . (2.50)
f (x)

Finally, (2.50) together with (2.44) completes the proof of (2.39). 

Remark 2.48. While Th. 2.47 was formulated and proved for (continuously) R-differen-
tiable functions, it can also be used for C-differentiable (i.e. for holomorphic) functions:
Recall that C-differentiability is actually a much stronger condition than continuous R-
differentiability: A function that is C-differentiable is, in particular, R-differentiable and
even C ∞ (see [Phi16b, Def. 4.18] and [Phi16b, Rem. 4.19(c)]). As a concrete example,
we apply Th. 2.47 to complex multiplication in Ex. 2.49(c) below. For more information
regarding the relation between norms on Cn and norms on R2n , also see [Phi16b, Sec.
D].

Example 2.49. (a) Let us investigate the condition of the problem of real multiplica-
tion, by considering

f : R2 −→ R, f (x, y) := xy, Df (x, y) = (y, x). (2.51)

Using the Euclidean norm k · k2 on R2 , one obtains, for each (x, y) ∈ R2 such that
xy 6= 0,

k(x, y)k2 x2 + y 2 |x| |y|


krel (x, y) := krel (f, (x, y)) = kDf (x, y)k2 = = + . (2.52)
|f (x, y)| |xy| |y| |x|

Thus, we see that the relative condition explodes if |x| ≫ |y| or |x| ≪ |y|. To show
f is well-conditioned in Bδ (x, y) for each δ > 0, let (x̃, ỹ) ∈ Bδ (x, y) and estimate

f (x, y) − f (x̃, ỹ) = |xy − x̃ỹ| ≤ |xy − xỹ| + |xỹ − x̃ỹ| = |x||y − ỹ| + |ỹ||x − x̃|

≤ max k(x, y)k2 , k(x̃, ỹ)k2 (x, y) − (x̃, ỹ) 1

≤ C δ + k(x, y)k2 (x, y) − (x̃, ỹ) 2

with suitable C ∈ R+ . Thus, f is well-conditioned in Bδ (x, y). In consequence, we


see that multiplication is numerically stable if, and only if, |x| and |y| are roughly
of the same order of magnitude.
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 34

(b) Analogous to (a), we now investigate real division, i.e.


 
 x 1 x
g : R × R \ {0} −→ R, g(x, y) := , Dg(x, y) = ,− 2 . (2.53)
y y y

Once again using the Euclidean norm on R2 , one obtains from (2.39) that, for each
(x, y) ∈ R2 such that xy 6= 0,
s
k(x, y)k2 |y| p 2 1 x2
krel (x, y) := krel (g, (x, y)) = kDg(x, y)k2 = x + y2 +
|g(x, y)| |x| y2 y4
s
1 p 2 x2 x2 + y 2 |x| |y|
= x + y2 1 + 2 = = + , (2.54)
|x| y |x||y| |y| |x|

i.e. the relative condition is the same as for multiplication. In particular, division
also becomes unstable if |x| ≫ |y| or |x| ≪ |y|, while krel (x, y) remains small if |x|
and |y| are of the same order of magnitude. However, we now again investigate if
g is well-conditioned in Bδ (x, y), δ > 0, and here we find an important difference
between division and multiplication. For δ ≥ |y|, it is easy to see that g is not well-
conditioned in Bδ (x, y): For each n ∈ N, let zn := (x, ny ). Then k(x, y) − zn k2 =
(1 − n1 )|y| < |y| ≤ δ, but

x xn
− = (n − 1) |x| → ∞ for n → ∞.
y y |y|

For 0 < δ < |y|, the result is different: Suppose |ỹ| ≤ |y| − δ. Then

(x, y) − (x̃, ỹ) ≥ |ỹ − y| ≥ |y| − |ỹ| ≥ δ.
2

Thus, (x̃, ỹ) ∈ Bδ (x, y) implies |ỹ| > |y| − δ. In consequence, one estimates, for each
(x̃, ỹ) ∈ Bδ (x, y),

x x̃ x x x x̃ x
g(x, y) − g(x̃, ỹ) = − ≤ − + − = |y − ỹ| + 1 |x − x̃|
y ỹ y ỹ ỹ ỹ y ỹ ỹ
 
|x| 1
≤ max , (x, y) − (x̃, ỹ)
|y|(|y| − δ) |y| − δ 1
 
|x| 1
≤ C max , (x, y) − (x̃, ỹ)
|y|(|y| − δ) |y| − δ 2

for some suitable C ∈ R+ , showing that g is well-conditioned in Bδ (x, y) for 0 <


δ < |y|. The significance of the present example lies in its demonstration that the
knowledge of krel (x, y) alone is not enough to determine the numerical stability of g
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 35

at (x, y): Even though krel (x, y) can remain bounded for y → 0, the problem is that
the neighborhood Bδ (x, y), where division is well-conditioned, becomes arbitrarily
small. Thus, without an effective bound (from below) on Bδ (x, y), the knowledge
of krel (x, y) can be completely useless and even misleading.
(c) We now consider the problem of complex multiplication, i.e.
f : C2 −→ C, f (z, w) := zw. (2.55)
We check that f is C-differentiable with Df (z, w) = (w, z): Indeed,

f (z, w) + (h1 , h2 ) − f (z, w) − (w, z)(h1 , h2 )
lim
h=(h1 ,h2 )→0 khk∞

(z + h1 )(w + h2 ) − zw − wh1 − zh2
= lim
h→0 khk∞
|h1 h2 |
= lim = lim min{|h1 |, |h2 |} = 0.
h→0 max{|h1 |, |h2 |} h→0

Since (x + iy)(u + iv) = xu − yv + i(xv + yu), to apply Th. 2.47, we now consider
complex multiplication as the R-differentiable map
g : R4 −→ R2 , g(x, y, u, v) := (xu − yv, xv + yu),
 
u −v x −y (2.56)
Dg(x, y, u, v) = .
v u y x
Noting


  u v
t u −v x −y  −v u

Dg(x, y, u, v) Dg(x, y, u, v) =
v u y x  x y
−y x
 2 
u + v 2 + x2 + y 2 0
= ,
0 u2 + v 2 + x 2 + y 2
p
we obtain kDg(x, y, u, v)k2 = u2 + v 2 + x2 + y 2 = k(x, y, u, v)k2 and
k(x, y, u, v)k2
krel (x, y, u, v) := krel (g, (x, y, u, v)) = kDg(x, y, u, v)k2
kg(x, y, u, v)k2
x 2 + y 2 + u2 + v 2 x 2 + y 2 + u2 + v 2
=p =p
x 2 u2 + y 2 v 2 + x 2 v 2 + y 2 u 2 (x2 + y 2 )(u2 + v 2 )
k(x, y)k2 k(u, v)k2
= + . (2.57)
k(u, v)k2 k(x, y)k2
3 INTERPOLATION 36

Thus, letting z := x + iy and w := u + iv, in generalization of what occurred in (a),


the relative condition becomes arbitrarily large if kzk2 ≫ kwk2 or kzk2 ≪ kwk2 .
An estimate completely analogous to the one in (a) shows g to be well-conditioned
in Bδ (x, y, u, v) = Bδ (z, w) for each δ > 0: We have, for each

(z̃, w̃) := (x̃, ỹ, ũ, ṽ) ∈ Bδ (x, y, u, v),

that

g(x, y, u, v) − g(x̃, ỹ, ũ, ṽ) = |zw − z̃ w̃| ≤ |zw − z w̃| + |z w̃ − z̃ w̃|
2
= |z||w − w̃| + |w̃||z − z̃|

≤ max k(z, w)k2 , k(z̃, w̃)k2 (z, w) − (z̃, w̃) 1

≤ C δ + k(z, w)k2 (z, w) − (z̃, w̃) 2

= C δ + k(x, y, u, v)k2 (x, y, u, v) − (x̃, ỹ, ũ, ṽ) 2

with suitable C ∈ R+ .

3 Interpolation

3.1 Motivation
Given a finite number of points (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ), so-called data points, sup-
porting points, or tabular points, the goal is to find a function f such that f (xi ) = yi for
each i ∈ {0, . . . , n}. Then f is called an interpolate of the data points. The data points
can be measured values from a physical experiment or computed values (for example,
it could be that yi = g(xi ) and g(x) can, in principle, be computed, but it could be
very difficult and computationally expensive (g could be the solution of a differential
equation) – in that case, it can also be advantageous to interpolate the data points by
an approximation f to g such that f is efficiently computable).
To have an interpolate f at hand can be desirable for many reasons, including

(a) f (x) can then be computed at an arbitrary point.

(b) If f is differentiable, then derivatives can be computed in an arbitrary point, at


least, in principle.
Rb
(c) Computation of integrals of the form a f (x) dx , if f is integrable.

(d) Determination of extreme points of f (e.g. for R-valued f ).


3 INTERPOLATION 37

(e) Determination of zeros of f (e.g. for R-valued f ).

While a general function f , defined on an infinite domain, is never determined by its


values on finitely many points, the situation can be different if it is known that f has
additional properties, for instance, that it has to lie in a certain set of functions or that
it can at least be approximated by functions from a certain function set. For example,
we will see in the next section that polynomials are determined by finitely many data
points.
The interpolate should be chosen such that it reflects properties expected from the ex-
act solution (if any). For example, polynomials are not the best choice if one expects
a periodic behavior or if the exact solution is known to have singularities. If periodic
behavior is expected, then interpolation by trigonometric functions is usually advised
(cf. Sections E.2 and E.3 of the Appendix), whereas interpolation by rational functions
is often desirable if singularities are expected (cf. [FH07, Sec. 2.2]). Due to time con-
straints, we will only study interpolations related to polynomials in this class, but one
should keep in the back of one’s mind that there are other types of interpolates that
might me more suitable, depending on the circumstances.

3.2 Polynomial Interpolation


If polynomial interpolation is to mean the problem of finding a polynomial (function)
that interpolates given data points, then we will see that (cf. Th. 3.4 below), in contrast
to the general interpolation problem described above, the problem is purely algebraic
with (in the absence of roundoff errors) an exact solution, not involving any approxi-
mation. However, for the reasons described above, one migth then use the interpolating
polynomial as an approximation of some more general function.

3.2.1 Polynomials and Polynomial Functions

Definition and Remark 3.1. Let F be a field.

(a) We call 
N0
F [X] := Ffin := (f : N0 −→ F ) : #f −1 (F \ {0}) < ∞ (3.1)
the set of polynomials over F (i.e. a polynomial over F is a sequence (ai )i∈N0 in
F such that all, but finitely many, of the entries ai are 0). On F [X], we have the
3 INTERPOLATION 38

following pointwise-defined addition and scalar multiplication:

∀ (f + g) : N0 −→ F, (f + g)(i) := f (i) + g(i), (3.2a)


f,g∈F [X]

∀ ∀ (λ · f ) : N0 −→ F, (λ · f )(i) := λ f (i), (3.2b)


f ∈F [X] λ∈F

where we know from Linear Algebra (cf. [Phi19a, Ex. 5.16(c)]) that, with these
compositions, F [X] forms a vector space over F with standard basis B := {ei : i ∈
N0 }, where
(
1 for i = j,
∀ ei : N0 −→ F, ei (j) := δij :=
i∈N0 0 for i 6= j.

In the context of polynomials, we will usually use the common notation X i := ei


and we will call these polynomials monomials. Moreover, one commonly uses the
simplified notation X := X 1 ∈ F [X] and λ := λ X 0 ∈ F [X] for each λ ∈ F . Next,
one defines a multiplication on F [X] by letting

(ai )i∈N0 , (bi )i∈N0 7→ (ci )i∈N0 := (ai )i∈N0 · (bi )i∈N0 ,
X X i
X (3.3)
ci := ak bl := ak b l = ak bi−k .
k+l=i (k,l)∈(N0 )2 : k+l=i k=0

We then know from Linear Algebra (cf. [Phi19b, Th. 7.4(b)]) that, with the above
addition and multiplication, F [X] forms a commutative ring with unity, where
1 = X 0 is the neutral element of multiplication.

(b) If f := (ai )i∈N0 ∈ F [X], then we call the ai ∈ F the coefficients of f , and we define
the degree of f by
(
−∞ for f ≡ 0,
deg f := (3.4)
max{i ∈ N0 : ai 6= 0} for f 6≡ 0.

If deg f = n ∈ N0 and an = 1, then the polynomial f is called monic. Clearly, for


each n ∈ N0 , the set

F [X]n := span{1, X, . . . , X n } = {f ∈ F [X] : deg f ≤ n}, (3.5)

of polynomials of degree at most n, forms an (n + 1)-dimensional vector subspace


of F [X] with basis Bn := {1, X, . . . , X n }.
3 INTERPOLATION 39

Remark 3.2. In the situation of Def. 3.1, using the notation X i = ei , we can write ad-
dition, scalar multiplication, and
P multiplicationPin the following, perhaps, more familiar-
looking forms: If λ ∈ F , f = ni=0 fi X i , g = ni=0 gi X i , n ∈ N0 , f0 , . . . , fn , g0 , . . . , gn ∈
F , then
n
X
f +g = (fi + gi ) X i ,
i=0
n
X
λf = (λfi ) X i ,
i=0
2n
!
X X
fg = fk g l X i.
i=0 k+l=i

Definition and Remark 3.3. Let F be a field.

(a) For each x ∈ F , the map


deg f
! deg f
X X
i
ǫx : F [X] −→ F, f 7→ ǫx (f ) = ǫx fi X := f i xi , (3.6)
i=0 i=0

is called the substitution homomorphism or evaluation homomorphism correspond-


ing to x (indeed, ǫx is both linear and a ring homomorphism, cf. [Phi19b, Def. and
Rem. 7.10]). We call x ∈ F a zero or a root of f ∈ F [X] if, and only if, ǫx (f ) = 0.
It is common to define
∀ f (x) := ǫx (f ), (3.7)
x∈F

even though this constitutes an abuse of notation, since, according to Def. and
Rem. 3.1(a), f ∈ F [X] is a function on N0 and not a function on F . However, the
notation of (3.7) tends to improve readability and the meaning of f (x) is usually
clear from the context.

(b) The map

φ : F [X] −→ F F = F(F, F ), f 7→ φ(f ), φ(f )(x) := ǫx (f ), (3.8)

constitutes a linear and unital ring homomorphism, which is injective if, and only
if, F is infinite (cf. [Phi19b, Th. 7.17]). Here, F F = F(F, F ) denotes the set of
functions from F into F , and φ being unital means it maps 1 = X 0 ∈ F [X] to the
constant function φ(X 0 ) ≡ 1. We define

Pol(F ) := φ(F [X])


3 INTERPOLATION 40

and call the elements of Pol(F ) polynomial functions. Then φ : F [X] −→ Pol(F )
is an epimorphism (i.e. surjective) and an isomorphism if, and only if, F is infinite.
If F is infinite, then we also define

∀ deg P := deg φ−1 (P ) ∈ N0 ∪ {−∞}


P ∈Pol(F )

and 
∀ Poln (F ) := φ F [X]n = {P ∈ Pol(F ) : deg P ≤ n}.
n∈N0

3.2.2 Existence and Uniqueness

Theorem 3.4 (Polynomial Interpolation). Let F be a field and n ∈ N0 . Given n + 1


data points (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ) ∈ F 2 , such that xi 6= xj for i 6= j, there exists
a unique interpolating polynomial f ∈ F [X]n , satisfying

∀ ǫxi (f ) = yi . (3.9)
i∈{0,1,...,n}

Moreover, one can identify f as the Lagrange interpolating polynomial, which is given
by the explicit formula
n
X Yn n
X − xi Y X 1 − xi X 0
f := yj Lj , where ∀ Lj := = (3.10)
j=0
j∈{0,1,...,n}
i=0
x j − x i i=0
xj − xi
i6=j i6=j

(the Lj are called Lagrange basis polynomials).

Proof. Since f = f0 X 0 + f1 X 1 + · · · + fn X n , (3.9) is equivalent to the linear system

f0 + f1 x0 + · · · + fn xn0 = y0
f0 + f1 x1 + · · · + fn xn1 = y1
..
.
f0 + f1 xn + · · · + fn xnn = yn ,

for the n + 1 unknowns f0 , . . . , fn ∈ F . This linear system has a unique solution if, and
only if, the determinant
1 x0 x20 . . . xn0

1 x1 x21 . . . xn1

D := .. .. .. .. ..
. . . . .

1 xn xn . . . xnn
2
3 INTERPOLATION 41

does not vanish. According to Linear Algebra, D constitutes a so-called Vandermonde


determinant, satisfying (cf. [Phi19b, Th. 4.35])
n
Y
D= (xi − xj ).
i,j=0
i>j

In particular, D 6= 0, as we assumed xi 6= xj for i 6= j, proving both existence and


uniqueness of f (bearing in mind that the X 0 , . . . , X n form a basis of F [X]n ). It
remains
Pn to identify f with the Lagrange interpolating polynomial. To this end, set
g := j=0 yj Lj . Since, clearly, deg g ≤ n, it merely remains to show g satisfies (3.9).
Since the xi are distinct, we obtain
n
X n
X
∀ ǫxk (g) = yj ǫxk (Lj ) = yj δjk = yk ,
k∈{0,...,n}
j=0 j=0

thereby establishing the case. 


Example 3.5. Let F be a field. We compute the first three Lagrange basis polynomials
and resulting Lagrange interpolating polynomials, given data points
(x0 , y0 ), (x1 , y1 ), (x2 , y2 ) ∈ F 2
with x0 , x1 , x2 being distinct: For n = 0, the product in (3.10) is empty, such that
L0 = 1 = X 0 and f0 = y0 = y0 X 0 . For n = 1, (3.10) yields
X − x1 X − x0
L0 = , L1 = ,
x0 − x1 x1 − x0
and
X − x1 X − x0 y1 − y0
f 1 = y 0 L0 + y 1 L1 = y 0 + y1 = y0 + (X − x0 ),
x0 − x1 x1 − x0 x1 − x0
which is the straight line through (x0 , y0 ) and (x1 , y1 ). Finally, for n = 2, (3.10) yields
(X − x1 )(X − x2 ) (X − x0 )(X − x2 )
L0 = , L1 = ,
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 )
(X − x0 )(X − x1 )
L2 = ,
(x2 − x0 )(x2 − x1 )
and
f 2 = y 0 L0 + y 1 L1 + y 2 L2
 
y1 − y0 1 y2 − y1 y1 − y0
= y0 + (X − x0 ) + − (X − x0 )(X − x1 ).
x1 − x0 x2 − x0 x2 − x1 x1 − x0

3 INTERPOLATION 42

The above formulas for f0 , f1 , f2 indicate that the interpolating polynomial can be built
recursively, and we will see below (cf. (3.20a)) that this is, indeed, the case. Having a
recursive formula at hand can be advantageous in many circumstances. For example,
even though the complexity of building the interpolating polynomial from scratch is
O(n2 ) in either case, if one already has the interpolating polynomial for x0 , . . . , xn , then,
using the recursive formula (3.20a) below, one obtains the interpolating polynomial for
x0 , . . . , xn , xn+1 in just O(n) steps.
Looking at the structure of f2 above, one sees that, while the first coefficient is just y0 , the
second one is a difference quotient (which, for F = R, can be seen as an approximation
of some derivative y ′ (x)), and the third coefficient is a difference quotient of difference
quotients (which, for F = R, can be seen as an approximation of some second derivative
y ′′ (x)). Such iterated difference quotients are called divided differences and we will see in
Th. 3.17 below that the same structure does, indeed, hold for interpolating polynomials
of all orders. However, we will first introduce and study a slightly more general form
of polynomial interpolation, so-called Hermite interpolation, where the given points
x0 , . . . , xn no longer need to be distinct.

3.2.3 Hermite Interpolation

So far, we have determined an interpolating polynomial f of degree at most n from


values at n + 1 distinct points x0 , . . . , xn ∈ F . Alternatively, one can prescribe the
values of f at less than n + 1 points and, instead, additionally prescribe the values of
derivatives of f (cf. Def. 3.6 below) at some of the same points, where the total number
of pieces of data still needs to be n + 1. For example, to determine f ∈ R[X] of degree at
most 7, one can prescribe f (1), f (2), f ′ (2), f (3), f ′ (3), f ′′ (3), f ′′′ (3), and f (4) (3). One
then says that x1 = 2 is considered with multiplicity 2 and x2 = 3 is considered with
multiplicity 5. This particular kind of polynomial interpolation is known as Hermite
interpolation.
While one is probably most familiar with the derivative as an analytic concept (e.g.
for functions from R into R), for polynomials, one can define the derivative in entirely
algebraic terms:
P
Definition 3.6. Let F be a field. Let n ∈ N0 and f0 , . . . , fn ∈ F . For f = ni=0 fi X i ∈
F [X], we define the derivative f ′ ∈ F [X] of f by letting
n−1
X
′ ′
f := 0 for n = 0, f := (i + 1) fi+1 X i for n ≥ 1.
i=0

Higher derivatives are defined recursively, in the usual way, by letting


f (0) := f, ∀ f (k+1) := (f (k) )′ ,
k∈N0
3 INTERPOLATION 43

and we make use of the notation f ′′ := f (2) , f ′′′ := f (3) .

Proposition 3.7. Let F be a field.

(a) Forming the derivative in F [X] is linear, i.e.

∀ ∀ (λf + µg)′ = λf ′ + µg ′ .
f,g∈F [X] λ,µ∈F

(b) Forming the derivative in F [X] satisfies the product rule, i.e.

∀ (f g)′ = f ′ g + f g ′ .
f,g∈F [X]

(c) For each x ∈ F and each n ∈ N, one has


′
(X − x)n = n (X − x)n−1 .

(d) Forming higher derivatives in F [X] satisfies the general Leibniz rule, i.e.
n  
X
(n) n
∀ ∀ (f g) = f (n−k) g (k) .
n∈N0 f,g∈F [X]
k=0
k

(e) For each x ∈ F , one has, setting g (−1) := 0,


(n)
∀ ∀ (X − x) g = n g (n−1) + (X − x) g (n) .
n∈N0 g∈F [X]

Proof. Exercise. 

Theorem 3.8. Let F be a field, n ∈ N, and assume char F = 0 or char F ≥ n (i.e., in


F , k · 1 6= 0 for each k ∈ {1, . . . , n − 1}, cf. [Phi19a, Def. and Rem. 4.43]). Let f ∈ F [X]
with deg f = n, x ∈ F . If m ∈ {0, . . . , n − 1} and, using the notation of (3.7),

f (x) = f ′ (x) = · · · = f (m) (x) = 0, (3.11)

then x is a zero of multiplicity at least m + 1 for f , i.e. (cf. [Phi19b, Cor. 7.14])
 
∃ deg q = n − m − 1 ∧ f = q (X − x)m+1 .
q∈F [X]
3 INTERPOLATION 44

Proof. We conduct the proof via induction on n: If n = 1, then m = 0, f (x) = 0 and,


thus, f = q (X − x) with q ∈ F [X] and deg q = n − 1 by [Phi19b, Prop. 7.13]. Now fix
n ∈ N with n ≥ 2 and conduct an induction over m ∈ {0, . . . , n − 1}: The case m = 0
holds as before. Thus, let 0 < m ≤ n − 1. By the induction hypothesis for m − 1, there
exists g ∈ F [X] with deg g = n − m such that

f = g (X − x)m . (3.12)

Taking derivatives in (3.12) and using Prop. 3.7(b),(c) yields

f ′ = g ′ (X − x)m + m g (X − x)m−1 . (3.13)

As we are assuming (3.11) and deg f ′ = n−1, we can now apply the induction hypothesis
for n − 1 to f ′ to obtain h ∈ F [X] with deg h = n − 1 − m, satisfying f ′ = h (X − x)m .
Using this in (3.13) yields

h (X − x)m = g ′ (X − x)m + m g (X − x)m−1 .

Computing in the field of rational fractions F (X) (cf. [Phi19b, Ex. 7.40(b)]), we may
divide by (X − x)m−1 to obtain

h (X − x) = g ′ (X − x) + m g.

As we assume m 6= 0 in F , we can solve this equation for g, yielding



(X − x) h − g ′
g= . (3.14)
m
1
Setting q := m
(h − g ′ ) ∈ F [X] and plugging (3.14) into (3.12) yields

f = q (X − x)m+1 .

As this implies n = deg f = deg q + m + 1 and deg q = n − m − 1, both inductions and


the proof are complete. 

Example 3.9. The present example shows that, in general, one can not expect the
conclusion of Th. 3.8 to hold without the hypothesis regarding the field’s characteristic:
Consider F := Z2 = {0, 1} (i.e. F is the field with two elements, char F = 2) and

f := X 3 + X 2 = X 2 (X + 1), f ′ = 3X 2 + 2X = X 2 , f ′′ = 2X = 0.

Then f (0) = f ′ (0) = f ′′ (0) = 0 (i.e. we have n = 3 and m = 2 in Th. 3.8), but x = 0 is
not a zero of multiplicity 3 of f (but merely a zero of multiplicity 2).
3 INTERPOLATION 45

Theorem 3.10 (Hermite Interpolation). Let F be a field and r, n ∈ N0 . Let x̃P 0 , . . . , x̃r ∈
F be r +1 distinct elements of F . Moreover, let m0 , . . . , mr ∈ N be such that rk=0 mk =
(m)
n + 1 and assume yj ∈ F for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}.

(a) The map


A : F [X]n −→ F n+1 ,

A(p) := p(x̃0 ), . . . , p(m0 −1) (x̃0 ), . . . , p(x̃r ), . . . , p(mr −1) (x̃r ) .
is linear.
(b) Assume char F = 0 or char F ≥ M := max{m0 , . . . , mr }. Then there exists a
unique polynomial f ∈ F [X]n (i.e. of degree at most n) such that (using the notation
of (3.7))
(m)
f (m) (x̃j ) = yj for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}. (3.15)

Proof. (a): If p, q ∈ F [X]n and λ, µ ∈ F , then


 Prop. 3.7(a) 
A(λp + µq) = . . . , (λp + µq)(l) (x̃k ), . . . = . . . , (λp(l) + µq (l) )(x̃k ), . . .

= . . . , λp(l) (x̃k ) + µq (l) (x̃k ), . . .
 
= λ . . . , p(l) (x̃k ), . . . + µ . . . , q (l) (x̃k ), . . . = λA(p) + µA(q),
proving the linearity of A.
(b): The existence and uniqueness statement of the theorem is the same as saying that
the map A of (a) is bijective. Since A is linear by (a) and since dim F [X]n = dim F n+1 =
n + 1, it suffices to show that A is injective, i.e. ker A = {0}. If A(p) = (0, . . . , 0), then,
by Th. 3.8, each x̃k is a zero of p with multiplicity at least mk , such that p has at least
n + 1 zeros. More precisely, Th. 3.8 implies that, if pQ6= 0, then the prime factorization
of p (cf. [Phi19b, Cor. 7.37]) must contain the factor rk=0 (X − x̃k )mk , i.e. deg p ≥ n + 1.
Since deg p ≤ n, we know p = 0, showing ker A = {0}, thereby completing the proof. 
Example 3.11. The present example shows that, in general, one can not expect the
conclusion of Th. 3.10(b) to hold without the hypothesis regarding the field’s character-
istic: We show that, for F := Z2 = {0, 1}, the Hermite interpolation problem can have
multiple solutions or no solutions: The 4 conditions
f (0) = f ′ (0) = f ′′ (0) = 0, f (1) = 0
are all satisfied by both f = 0 ∈ F [X]3 and by f = X 3 + X 2 ∈ F [X]3 (cf. Ex. 3.9). On
the other hand, since (X 3 )′ = 3X 2 = X 2 and (X 2 )′ = 2X = 0, one has f ′′ = 0 for each
f ∈ F [X]3 , i.e. the 4 conditions
f (0) = f ′ (0) = f ′′ (0) = 1, f (1) = 0
3 INTERPOLATION 46

are not satisfied by any f ∈ F [X]3 .

Definition 3.12. As in Th. 3.10, let F be a field and r, n ∈ N0P ; let x̃0 , . . . , x̃r ∈ F be
r + 1 distinct elements of F ; let m0 , . . . , mr ∈ N be such that rk=0 mk = n + 1; and
(m)
assume yj ∈ F for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}. Also, as in Th. 3.10(b),
assume char F = 0 or char F ≥ M := max{m0 , . . . , mr }. Finally, let (x0 , . . . , xn ) ∈ F n+1
be such that, for each k ∈ {0, . . . , r}, precisely mk of the xj are equal to x̃k (i.e. x̃k occurs
precisely mk times in the vector (x0 , . . . , xn )).

(a) Let P [x0 , . . . , xn ] := f ∈ F [X]n denote the unique interpolating polynomial satisfy-
(m)
ing (3.15). As a caveat, we note that P [x0 , . . . , xn ] also depends on the yj , which
is, however, suppressed in the notation.

(b) If x0 , . . . , xn are all distinct (i.e. n = r and (x0 , . . . , xn ) constitutes a permutation


of (x̃0 , . . . , x̃r )) and g : F −→ F is a function such that
(0)
∀ g(x̃j ) = yj ,
j∈{0,...,n}

then we define the additional notation

P [g|x0 , . . . , xn ] := P [x0 , . . . , xn ].

(c) Let F = R with a, b ∈ R, a < b, such that x̃0 , . . . , x̃r ∈ [a, b], and let g : [a, b] −→ R
be an M times differentiable function such that
(m)
g (m) (x̃j ) = yj for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}.

Then, as in (b), we define the additional notation

P [g|x0 , . . . , xn ] := P [x0 , . . . , xn ].

We remain in the situation of Def. 3.12 and denote the interpolating polynomial by
f := P [x0 , . . . , xn ]. If one is only interested in the value f (x) for some x ∈ F , rather
than all the coefficients of f (or rather than evaluating f at many different elements of
F ), then an efficient way to obtain f (x) is by means of a so-called Neville tableau (cf.
Rem. 3.14 below), making use of the recursion provided by the following Th. 3.13(a).

Theorem 3.13. We consider the situation of Def. 3.12.


3 INTERPOLATION 47

(a) The interpolating polynomials satisfy the following recursion: For r = 0, one has
Xn
(X − x̃0 )j (j)
P [x0 , . . . , xn ] = P [x̃0 , . . . , x̃0 ] = y0 . (3.16a)
j=0
j!

For r > 0, one has, for each k, l ∈ {0, . . . , n} such that xk 6= xl ,

(X − xl )P [x0 , . . . , xbl , . . . , xn ] − (X − xk )P [x0 , . . . , xbk , . . . , xn ]


P [x0 , . . . , xn ] =
xk − xl
(3.16b)
(the hatted symbols xbl and xbk mean that the corresponding arguments are omitted).

(b) The interpolating polynomials do not depend on the order of the x0 , . . . , xn ∈ F ,


i.e., for each permutation π on {0, . . . , n},

P [xπ(0) , . . . , xπ(n) ] = P [x0 , . . . , xn ].

Proof. (a): The case r = 0 holds, since, clearly,



 j
(k) 
0 for k < j,
(X − x̃0 )
∀ (x̃0 ) = 1 for k = j,
k∈N0 j! 

0 for k > j,

implying
n
!(m)
X (X − x̃0 )j (j) (m)
∀ y0 (x̃0 ) = y0 .
m∈{0,...,n}
j=0
j!

We now consider the case r > 0. Letting p := P [x0 , . . . , xn ] and letting q denote the
polynomial given by the right-hand side of (3.16b), we have to show p = q. Clearly,
both p and q are in F [X]n . Thus, the proof is complete, if we can show that q also
satisfies (3.15), i.e.
(m)
q (m) (x̃j ) = yj for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}. (3.17)

To verify (3.17), given k, l ∈ {0, . . . , n} with xk 6= xl , we use Prop. 3.7(e) to obtain, for
each m ∈ N0 ,

m P [x0 , . . . , xbl , . . . , xn ](m−1) + (X − xl )P [x0 , . . . , xbl , . . . , xn ](m)


q (m) =
xk − xl
(m−1)
m P [x0 , . . . , xbk , . . . , xn ] + (X − xk )P [x0 , . . . , xbk , . . . , xn ](m)
− . (3.18)
xk − xl
3 INTERPOLATION 48

Now we fix j ∈ {0, . . . , r} such that x̃j = xk and let m ∈ {0, . . . , mj − 1}. Then (setting
(−1)
yj := 0)
(m−1) (m) (m−1)
(m)
m yj + (xk − xl ) yj − m yj −0 (m)
q (x̃j ) = = yj .
xk − xl
Similarly, for j ∈ {0, . . . , r} such that x̃j = xl :
(m−1) (m−1) (m)
(m)
m yj + 0 − m yj − (xl − xk ) yj (m)
q (x̃j ) = = yj .
xk − xl
Finally, if j ∈ {0, . . . , r} is such that x̃j ∈
/ {xk , xl }, then
(m−1) (m) (m−1) (m)
(m)
m yj + (x̃j − xl ) yj − m yj − (x̃j − xk ) yj (m)
q (x̃j ) = = yj ,
xk − xl
proving that (3.17) does, indeed, hold.
(b) holds due to the fact that the interpolating polynomial P [x0 , . . . , xn ] is determined
by the conditions (3.15), which do not depend on the order of x0 , . . . , xn (according to
Def. 3.12, this order independence of the interpolating polynomial is already built into
the definition of P [x0 , . . . , xn ]). 

Remark 3.14. To use the recursion (3.16), to compute P [x0 , . . . , xn ](x) for x ∈ F , we
now assume that (x0 , . . . , xn ) is ordered in such a way that identical entries are adjacent:

(x0 , . . . , xn ) = (x̃0 , . . . , x̃0 , x̃1 , . . . , x̃1 , . . . , x̃r , . . . , x̃r ). (3.19)

Moreover, for fixed x ∈ F , we now define

∀ Pαβ := P [xα−β , . . . , xα ](x).


α,β∈{0,...,n},
α≥β

Then Pnn = P [x0 , . . . , xn ](x) is the desired value. According to Th. 3.13(a) and (3.19),
m
X (x − x̃j )k (k)
Pαβ = yj for xα−β = xα = x̃j and m = β,
k=0
k!
(x − xα−β )P [x1+α−β , . . . , xα ](x) − (x − xα )P [xα−β , . . . , xα−1 ](x)
Pαβ =
xα − xα−β
(x − xα−β )Pα,β−1 − (x − xα )Pα−1,β−1
= for xα−β 6= xα .
xα − xα−β
3 INTERPOLATION 49

In consequence, Pnn can be computed, in O(n2 ) steps, via the following Neville tableau:
(0)
y0 = P00
ց
(0)
y0 = P10 → P11
ց ց
(0)
y0 = P20 → P21 → P22
.. .. .. . .
. . . .
(0)
yr = Pn−1,0 → Pn−1,1 → . . . . . .Pn−1,n−1
ց ց ց
(0)
yr = Pn0 → Pn1 → . . . . . . Pn,n−1 → Pnn

Example 3.15. Consider a field F with char F 6= 2, n := 4, r := 1, and let

x̃0 := 1, x̃1 := 2,
y0 := 1, y0′ := 4, y1 := 3, y1′ := 1, y1′′ := 2.

Setting x := 3, we want to compute P44 = P [1, 1, 2, 2, 2](3), using a Neville tableau


according to Rem. 3.14. In preparation, we compute
(1)
P [1, 1](3) = y0 + (3 − 1) y0 = 1 + 2 · 4 = 9,
(1)
P [2, 2](3) = y1 + (3 − 2) y1 = 3 + 1 = 4,
(3 − 2)2 (2)
P [2, 2, 2](3) = P [2, 2](3) + y1 = 4 + 1 = 5.
2!
Thus, we obtain the Neville tableau
P00 = y0 = 1
ց
P10 = y0 = 1 → P11 = P [1, 1](3) = 9
ց ց
P20 = y1 = 3 → P21 = P [1, 2](3) = (3−1)·3−(3−2)·1
2−1 =5 → P22 = P [1, 1, 2](3) = 2·5−9
2−1 =1
ց ց
2·4−5
P30 = y1 = 3 → P31 = P [2, 2](3) = 4 → P32 = P [1, 2, 2](3) = 2−1 =3
ց ց
P40 = y1 = 3 → P41 = P [2, 2](3) = 4 → P42 = P [2, 2, 2](3) = 5

and, continued:
... P22 = 1
ց
2·3−1
... P32 = 3 → P33 = P [1, 1, 2, 2](3) = 2−1 =5
ց ց
2·5−3 2·7−5
... P42 = 5 → P43 = P [1, 2, 2, 2](3) = 2−1 =7 → P44 = P [1, 1, 2, 2, 2](3) = 2−1 = 9.
3 INTERPOLATION 50

3.2.4 Divided Differences and Newton’s Interpolation Formula

We now come back to the divided differences mentioned at the end of Sec. 3.2.2. We
proceed to immediately define divided differences in the general context of Hermite
interpolation:

Definition 3.16. As in Def. 3.12, let F be a field and r, n ∈ N0P ; let x̃0 , . . . , x̃r ∈ F be
r + 1 distinct elements of F ; let m0 , . . . , mr ∈ N be such that rk=0 mk = n + 1; and
(m)
assume yj ∈ F for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}; assume char F = 0 or
char F ≥ M := max{m0 , . . . , mr }; and let (x0 , . . . , xn ) ∈ F n+1 be such that, for each
k ∈ {0, . . . , r}, precisely mk of the xj are equal to x̃k (i.e. x̃k occurs precisely mk times
in the vector (x0 , . . . , xn )).

Pn the iinterpolating polynomial f := P [x0 , . . . , xn ] ∈ F [X]n in the form


(a) We write
f = i=0 ai X with a0 , . . . , an ∈ F . The value

[x0 , . . . , xn ] := an ∈ F

is called a divided difference of nth order (as P [x0 , . . . , xn ], it is determined by the


(m) (m)
xj together with the yj , but, once again, the dependence on the yj is suppressed
in the notation).

(b) If x0 , . . . , xn are all distinct (i.e. n = r and (x0 , . . . , xn ) constitutes a permutation


of (x̃0 , . . . , x̃r )) and g : F −→ F is a function such that
(0)
∀ g(x̃j ) = yj ,
j∈{0,...,n}

then we define the additional notation

[g|x0 , . . . , xn ] := [x0 , . . . , xn ].

(c) Let F = R with a, b ∈ R, a < b, such that x̃0 , . . . , x̃r ∈ [a, b], and let g : [a, b] −→ R
be an M times differentiable function such that
(m)
g (m) (x̃j ) = yj for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}.

Then, as in (b), we define the additional notation

[g|x0 , . . . , xn ] := [x0 , . . . , xn ].
3 INTERPOLATION 51

Theorem 3.17. In the situation of Def. 3.16, the interpolating polynomial P [x0 , . . . , xn ]
∈ F [X]n can be written in the following form known as Newton’s divided difference
interpolation formula:
n
X
P [x0 , . . . , xn ] = [x0 , . . . , xj ] ωj , (3.20a)
j=0

where, for each j ∈ {0, 1, . . . , n},


j−1
Y
ωj := (X − xi ) ∈ F [X]j ⊆ F [X]n (3.20b)
i=0

are the so-called Newton basis polynomials. In particular, (3.20a) means that the inter-
polating polynomials satisfy the recursive relation
(0)
P [x0 ] = y0 , (3.21a)
P [x0 , . . . , xj ] = P [x0 , . . . , xj−1 ] + [x0 , . . . , xj ] ωj for 1 ≤ j ≤ n. (3.21b)

Proof. We conduct the proof via induction on n ∈ N0 : Noting ω0 = 1 (empty product),


the base case (n = 0) is provided by the true statement
(0)
P [x0 ] = y0 = [x0 ] ω0 .
Now let n > 0. Then, according to Def. 3.16, there exist a0 , . . . , an−1 ∈ F and q ∈
F [X]n−1 such that
n−1
X
n
p := P [x0 , . . . , xn ] = [x0 , . . . , xn ] X + ai X i = [x0 , . . . , xn ] ωn + q.
i=0

Thus, q = p − [x0 , . . . , xn ] ωn and, to show q = pn−1 := P [x0 , . . . , xn−1 ], we need to show


q satisfies
(m)
q (m) (x̃j ) = yj for each j ∈ {0, . . . , r} \ {j0 }, m ∈ {0, . . . , mj − 1}, (3.22a)
(m)
q (m) (x̃j0 ) = yj0 for each m ∈ {0, . . . , mj0 − 1} \ {mj0 − 1}, (3.22b)
where j0 ∈ {0, . . . , r} is such that xn = x̃j0 . However, q does satisfy (3.22), since
(3.22) is satisfied with q replaced by p and, by (3.20b) in combination with Prop. 3.7(d),
(m)
ωn (x̃j ) = 0 for each j ∈ {0, . . . , r} \ {j0 } and each m ∈ {0, . . . , mj − 1}, as well as for
each m ∈ {0, . . . , mj0 − 1} \ {mj0 − 1} for j = j0 . Thus, q = pn−1 , showing
n−1
X
ind.hyp.
p = [x0 , . . . , xn ] ωn + q = [x0 , . . . , xn ] ωn + pn−1 = [x0 , . . . , xn ] ωn + [x0 , . . . , xj ] ωj ,
j=0

thereby completing the induction. 


3 INTERPOLATION 52

We can now make use of our knowledge regarding interpolating polynomials to infer
properties of the divided differences (in particular, we will see that divided differences
can also be efficiently computed using Neville tableaus):

Theorem 3.18. In the situation of Def. 3.16, the divided differences satisfy the follow-
ing:

(a) Recursion: For r = 0, one has


(n)
y0
[x0 , . . . , xn ] = [x̃0 , . . . , x̃0 ] = . (3.23a)
n!
For r > 0, one has, for each k, l ∈ {0, . . . , n} such that xk 6= xl ,

[x0 , . . . , xbl , . . . , xn ] − [x0 , . . . , xbk , . . . , xn ]


[x0 , . . . , xn ] = (3.23b)
xk − xl
(where, as before, the hatted symbols xbl and xbk mean that the corresponding argu-
ments are omitted).

(b) [x0 , . . . , xn ] does not depend on the order of the x0 , . . . , xn ∈ F , i.e., for each per-
mutation π on {0, . . . , n},

[xπ(0) , . . . , xπ(n) ] = [x0 , . . . , xn ].

Proof. (a): Case r = 0: According to (3.16a), we have

Xn
(X − x̃0 )j (j)
P [x0 , . . . , xn ] = P [x̃0 , . . . , x̃0 ] = y0 ,
j=0
j!

(n)
y
i.e. the coefficient of X n is n! 0
. Case r > 0: According to (3.16b), we have, for each
k, l ∈ {0, . . . , n} such that xk 6= xl ,

(X − xl )P [x0 , . . . , xbl , . . . , xn ] − (X − xk )P [x0 , . . . , xbk , . . . , xn ]


P [x0 , . . . , xn ] = ,
xk − xl
where, comparing the coefficient of X n on the left-hand side with the coefficient of X n
on the right-hand side immediately proves (3.23b).
(b) is immediate from Th. 3.13(b). 
3 INTERPOLATION 53

Remark 3.19. Analogous to Rem. 3.14, to use the recursion (3.23) to compute the
divided difference [x0 , . . . , xn ], we now assume that (x0 , . . . , xn ) is ordered in such a way
that identical entries are adjacent, i.e.

(x0 , . . . , xn ) = (x̃0 , . . . , x̃0 , x̃1 , . . . , x̃1 , . . . , x̃r , . . . , x̃r ).

Then Pnn = [x0 , . . . , xn ] can be computed via the Neville tableau of Rem. 3.14, provided
that we redefine
∀ Pαβ := [xα−β , . . . , xα ],
α,β∈{0,...,n},
α≥β

since, with this redefinition, Th. 3.18(a) implies


(m)
yj
Pαβ = for xα−β = xα = x̃j and m = β,
m!
[x1+α−β , . . . , xα ] − [xα−β , . . . , xα−1 ] Pα,β−1 − Pα−1,β−1
Pαβ = = for xα−β 6= xα .
xα − xα−β xα − xα−β

Example 3.20. We remain in the situation of Def. 3.16.

(a) Suppose char F 6= 2, x0 6= x1 , and we want to compute [x0 , x1 , x1 , x1 ] according to


(3.23). First, we obtain [x0 ], [x1 ] , [x1 , x1 ], and [x1 , x1 , x1 ] from (3.23a). Then we
use (3.23b) to compute, in succession,

[x1 ] − [x0 ] [x1 , x1 ] − [x0 , x1 ]


[x0 , x1 ] = , [x0 , x1 , x1 ] = ,
x1 − x0 x1 − x0
[x1 , x1 , x1 ] − [x0 , x1 , x1 ]
[x0 , x1 , x1 , x1 ] = .
x1 − x0

(0)
(b) Suppose x0 , . . . , xn ∈ F are all distinct and set yj := yj for each j ∈ {0, . . . , n}.
Then the Neville tableau of Rem. 3.19 takes the form
y0 = [x0 ]
ց
y1 = [x1 ] → [x0 , x1 ]
ց ց
y2 = [x2 ] → [x1 , x2 ] → [x0 , x1 , x2 ]
.. .. .. ..
. . . .
yn−1 = [xn−1 ] → [xn−2 , xn−1 ] → ... ... [x0 , . . . , xn−1 ]
ց ց ց
yn = [xn ] → [xn−1 , xn ] → ... ... [x1 , . . . , xn ] → [x0 , . . . , xn ]
3 INTERPOLATION 54

(c) As in Ex. 3.15, consider a field F with char F 6= 2, n := 4, r := 1, and the given
data

x̃0 = 1, x̃1 = 2,
y0 = 1, y0′ = 4, y1 = 3, y1′ = 1, y1′′ = 2.

This time, we seek to obtain the complete interpolating polynomial P [1, 1, 2, 2, 2],
using Newton’s interpolation formula (3.20a):

P [1, 1, 2, 2, 2] = [1] + [1, 1](X − 1) + [1, 1, 2](X − 1)2 + [1, 1, 2, 2](X − 1)2 (X − 2)
+ [1, 1, 2, 2, 2](X − 1)2 (X − 2)2 . (3.24)

The coefficients can be computed from the Neville tableau given by Rem. 3.19:
[1] = y0 = 1
ց
[1] = y0 = 1 → [1, 1] = y0′ = 4
ց ց
3−1 2−4
[2] = y1 = 3 → [1, 2] = =2 →
2−1 [1, 1, 2] = 2−1 = −2
ց ց ց
1−2 −1−(−2)
[2] = y1 = 3 → [2, 2] = y1′ = 1 → [1, 2, 2] = 2−1 = −1 → [1, 1, 2, 2] = 2−1 =1
ց ց ց
′ y1′′ 1−(−1)
[2] = y1 = 3 → [2, 2] = y1 = 1 → [2, 2, 2] = 2! =1 → [1, 2, 2, 2] = 2−1 =2

and, finally,
... [1, 1, 2, 2] = 1
ց
2−1
... [1, 2, 2, 2] = 2 → [1, 1, 2, 2, 2] = 2−1 = 1.

Hence, plugging the results of the tableau into (3.24),

P [1, 1, 2, 2, 2] = 1 + 4(X − 1) − 2(X − 1)2 + (X − 1)2 (X − 2) + (X − 1)2 (X − 2)2 .

Additional formulas for divided differences over a general field F can be found in Ap-
pendix D. Here, we now proceed to present a result for the field F = R, where there
turns out to be a surprising relation between divided differences and integrals over sim-
plices. We start with some preparations:

Notation 3.21. For n ∈ N, the following notation is introduced:


( n
)
X
Σn := (s1 , . . . , sn ) ∈ Rn : si ≤ 1 and 0 ≤ si for each i ∈ {1, . . . , n} .
i=1
3 INTERPOLATION 55

Remark 3.22. The set Σn introduced in Not. 3.21 is the convex hull of the set consisting
of the origin and the n standard unit vectors e1 , . . . , en ∈ Rn , i.e. the convex hull of the
(n − 1)-dimensional standard simplex ∆n−1 united with the origin (cf. [Phi19b, Def. and
Rem. 1.23]):
   
Σn = conv {0} ∪ {ei : i ∈ {1, . . . , n}} = conv {0} ∪ ∆n−1 ,

where
ei := (0, . . . , 1, . . . , 0), ∆n−1 := conv{ei : i ∈ {1, . . . , n}}. (3.25)

i

Lemma 3.23. For each n ∈ N:


Z
n 1
vol(Σ ) := 1 ds = . (3.26)
Σn n!

Proof. Consider the change of variables T : Rn −→ Rn , t = T (s), where

t1 := s1 , s1 := t1 ,
t2 := s1 + s2 , s2 := t2 − t1 ,
t3 := s1 + s2 + s3 , s3 := t3 − t2 , (3.27)
.. ..
. .
tn := s1 + · · · + sn , sn := tn − tn−1 .

According to (3.27), T is clearly linear and bijective, det JT −1 ≡ det T −1 = 1, and



T (Σn ) = (t1 , . . . , tn ) ∈ Rn : 0 ≤ t1 ≤ t2 ≤ · · · ≤ tn ≤ 1 .

Thus, using the change of variables theorem (CVT) and the Fubini theorem (FT), we
can compute vol(Σn ):
Z Z Z
n (CVT)
vol(Σ ) = 1 ds = det JT −1 (t) dt = 1 dt
Σn T (Σn ) T (Σn )
Z 1 Z t3 Z t2 Z 1 Z t3
(FT)
= ··· dt1 dt2 · · · dtn = ··· t2 dt2 · · · dtn
Z0 1 Z0 t4 0 0
Z 1 n−1 0
t23 tn 1
= ··· dt3 · · · dtn = dtn = ,
0 0 2! 0 (n − 1)! n!

proving (3.26). 
3 INTERPOLATION 56

Theorem 3.24 (Hermite-Genocchi Formula). Consider the situation of Def. 3.16(c),


i.e. r, n ∈P
N0 with x̃0 , . . . , x̃r ∈ [a, b] ⊆ R distinct (a, b ∈ R, a < b); m0 , . . . , mr ∈ N
such that rk=0 mk = n + 1 and g ∈ C n [a, b] (this assumption is actually stronger than
in Def. 3.16(c)) with
(m)
g (m) (x̃j ) = yj ∈R for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1};

and (x0 , . . . , xn ) ∈ Rn+1 such that, for each k ∈ {0, . . . , r}, precisely mk of the xj are
equal to x̃k . Then the divided differences satisfy the following relation, known as the
Hermite-Genocchi formula:
Z n
!
X
n ≥ 1 ⇒ [g|x0 , . . . , xn ] = g (n) x0 + si (xi − x0 ) ds , (3.28)
Σn i=1

where Σn is the set according to Not. 3.21.

Proof. First, note that, for each s ∈ Σn ,


n
X n
X
a = x0 + a − x0 ≤ x0 + si (a − x0 ) ≤ xs := x0 + si (xi − x0 )
i=1 i=1
Xn
≤ x0 + si (b − x0 ) ≤ x0 + b − x0 = b,
i=1

such that g (n) is defined at xs . Moreover, the integral in (3.28) exists, as the continuous
function g (n) is bounded on the compact set Σn . If r = 0 (i.e. x0 = · · · = xn = x̃0 ), then
Z n
! Z
X (3.26) g
(n)
(x0 )
(n)
g x0 + si (xi − x0 ) ds = g (n) (x0 ) ds =
Σn i=1 Σn n!
(n)
y (3.23a)
= 0 = [g|x0 , . . . , xn ],
n!
proving (3.28) in this case. For r > 0, we prove (3.28) via induction on n ∈ N, where,
by Th. 3.18(b), we may assume x0 6= xn . Then, for n = 1, (3.28) is an easy consequence
of the fundamental theorem of calculus (FTC) (note Σ1 = [0, 1]):
Z 1 "  #1
 g x 0 + s(x 1 − x 0 )
g ′ x0 + s(x1 − x0 ) ds =
0 x1 − x0
0
g(x1 ) − g(x0 ) y1 − y0
= = = [g|x0 , x1 ].
x1 − x0 x1 − x0
3 INTERPOLATION 57

Next, for the induction step, let n ∈ N and compute


Z n+1
!
X
g (n+1) x0 + si (xi − x0 ) ds
Σn+1 i=1
Z Z Pn
n
! !
Fubini
1− i=1 si X
= g (n+1) x0 + si (xi − x0 ) + sn+1 (xn+1 − x0 ) dsn+1 d(s1 , . . . , sn )
Σn 0 i=1
Z n
!
FTC 1 X
(n)
= g xn+1 + si (xi − xn+1 )
xn+1 − x0 Σn i=1
n
!!
X
(n)
−g x0 + si (xi − x0 ) d(s1 , . . . , sn )
i=1
ind.hyp. [g|x1 , . . . , xn+1 ] − [g|x0 , . . . , xn ] (3.23b)
= = [g|x0 , . . . , xn+1 ],
xn+1 − x0
thereby establishing the case. 
Theorem 3.25. Under the assumptions of Th. 3.24 and letting p := P [g|x0 , . . . , xn ] ∈
R[X]n be the corresponding interpolating polynomial (cf. Def. 3.12(c)), the following
holds true:

(a) For each x ∈ [a, b]:


g(x) = p(x) + [g|x, x0 , . . . , xn ] ωn+1 (x), (3.29)
where n
Y
ωn+1 = (X − xi )
i=0

(if x = x0 = · · · = xn , then [g|x, x0 , . . . , xn ] is only defined if g is (n + 1) times


differentiable – however, in this case, ωn+1 (x) = 0 and g(x) = p(x) still holds, even
if [g|x, x0 , . . . , xn ] is not defined).
(b) (Mean Value Theorem for Divided Differences) Given x ∈ [a, b], let
m := min{x, x0 , . . . , xn }, M := max{x, x0 , . . . , xn }.
If g ∈ C n+1 [a, b], then there exists ξ = ξ(x, x0 , . . . , xn ) ∈ [m, M ] such that
g (n+1) (ξ)
[g|x, x0 , . . . , xn ] = . (3.30)
(n + 1)!
In consequence, with ωn+1 as in (a), one then also has
g (n+1) (ξ)
g(x) = p(x) + ωn+1 (x). (3.31)
(n + 1)!
3 INTERPOLATION 58

(c) The function

δ : [a, b]n+1 −→ R, (x0 , . . . , xn ) 7→ [g|x0 , . . . , xn ],

is continuous, in particular, continuous in each xk , k ∈ {0, . . . , n}. In consequence,


if ξ(x, x0 , . . . , xn ) is chosen as in (b), then the map

(x0 , . . . , xn ) 7→ g (n+1) ξ(x, x0 , . . . , xn )

is continuous on [a, b]n+1 .

Proof. (a): For x ∈ [a, b], one obtains:


n
X
(3.20)
p(x) + [g|x, x0 , . . . , xn ] ωn+1 (x) = [g|x0 , . . . , xj ] ωj (x) + [g|x0 , . . . , xn , x] ωn+1 (x)
j=0

(3.20)
= g(x).

(b): If x = x0 = · · · = xn , then (3.30) holds due to


(n+1)
(3.23a) y0 g (n+1) (x0 )
[g|x, x0 , . . . , xn ] = = .
(n + 1)! (n + 1)!
Otherwise, we can apply Th. 3.24 with a, b replaced by m, M . Using (3.28) together
with the mean value theorem of integration (see, e.g., [Phi17, Th. 2.16(f)]) then yields
ξ ∈ [m, M ] such that
Z (n+1)
(n+1) (3.26) g (ξ)
[g|x, x0 , . . . , xn ] = g (ξ) 1 ds = ,
Σn+1 (n + 1)!
proving (3.30). Plugging (3.30) into (3.29) yields (3.31).
(c): For n = 0, the continuity of x0 7→ [g|x0 ] = g(x0 ) is merely the assumed continuity
of g. Let n > 0. Since g (n) is continuous and the continuous function |g (n) | is uniformly
bounded on the compact interval [a, b], the continuity of δ is due to (3.28) and the
continuity of parameter-dependent integrals (see, e.g., [Phi17, Th. 2.22]). 

3.2.5 Convergence and Error Estimates

We will now provide a convergence result for the approximation of a (benign) C ∞


function by interpolating polynomial functions. Recall from Def. and Rem. 3.3(b) that,
given f ∈ R[X], φ(f ) ∈ Pol(R), φ(f )(x) = f (x), is the corresponding polynomial
function.
3 INTERPOLATION 59

Theorem 3.26. As in Th. 3.24, let a, b ∈ R, a < b. Let g ∈ C ∞ [a, b] and consider a
sequence (xk )k∈N0 in [a, b]. For each n ∈ N, define pn := P [g|x0 , . . . , xn ] ∈ R[X]n to be
the interpolating polynomial corresponding to the first n + 1 points – thus, according to
Th. 3.17 and Th. 3.24,
Xn
∀ pn = [g|x0 , . . . , xj ] ωj ,
n∈N
j=0

where
Z j
! j−1
X Y
(j)
[g|x0 , . . . , xj ] = g x0 + si (xi − x0 ) ds , ωj := (X − xi ).
Σj i=1 i=0

If there exist K, R ∈ R+
0 such that

∀ kg (n) k∞ ≤ Kn!Rn , (3.32)


n∈N

then one has the error estimate


kg (n+1) k∞
∀ g − φ(pn ) ≤ φ(ωn+1 ) ≤ KRn+1 (b − a)n+1 . (3.33)
n∈N0 ∞ (n + 1)! ∞

If, moreover, R < (b − a)−1 , then one also has the convergence

lim g − φ(pn ) ∞ = 0. (3.34)
n→∞

Proof. The first estimate in (3.33) is merely (3.31), while the second estimate in (3.33)
is (3.32) combined with kφ(ωn+1 )k∞ ≤ (b − a)n+1 . As R < (b − a)−1 , (3.33) establishes
(3.34). 

Example 3.27. Suppose g ∈ C ∞ [a, b] is such that there exists C ≥ 0 satisfying

∀ kg (n) k∞ ≤ C n . (3.35)
n∈N

We claim that, in that case, there are K ∈ R+ 0 and 0 ≤ R < (b − a)


−1
satisfying (3.32)
(in particular, this shows that Th. 3.26 applies to all functions of the form g(x) = ekx ,
g(x) = sin(kx), g(x) = cos(kx)). To verify that (3.35) implies (3.32), set R := 21 (b−a)−1 ,
choose N ∈ N sufficiently large such that n! > (C/R)n for each n > N , and set
K := max{(C/R)m : 0 ≤ m ≤ N }. Then, for each n ∈ N,

kg (n) k∞ ≤ C n = (C/R)n Rn ≤ K n!Rn .


3 INTERPOLATION 60

Remark 3.28. When approximating g ∈ C ∞ [a, b] by an interpolating polynomial pn ,


the accuracy of the approximation depends on the choice of the xj . If one chooses
equally-spaced xj , then ωn+1 (x) will be much larger in the vicinity of a and b than in
the center of the interval. This behavior of ωn+1 can be counteracted by choosing more
data points in the vicinity of a and b. This leads to the problem of finding data points
such that kφ(ωn+1 )k∞ is minimized. If a = −1, b = 1, then the optimal choice for the
data points is  
2j + 1
xj := cos π for each j ∈ {0, . . . , n}.
2n + 2
These xj are precisely the zeros of the so-called Chebyshev polynomials, which we will
study in Sec. 3.2.6 below (cf. Ex. 3.35).

One can compute n! [g|x0 , . . . , xn ] as a numerical approximation to g (n) (x). We obtain


the following error estimate:

Theorem 3.29 (Numerical Differentiation). Let a, b ∈ R, a < b, g ∈ C n+1 [a, b], n ∈ N0 ,


x ∈ [a, b]. Then, given n + 1 (not necessarily distinct) points x0 , . . . , xn ∈ [a, b] and
defining [g|x0 , . . . , xn ] as in Def. 3.16(c), the following estimate holds:

(n)
g (x) − n! [g|x0 , . . . , xn ] ≤ (M − m) g (n+1) ↾[m,M ] ∞ , (3.36)

where ↾ denotes restriction and, as in Th. 3.25(b),

m := min{x, x0 , . . . , xn }, M := max{x, x0 , . . . , xn }.

Proof. For each ξ ∈ [m, M ], we have


(n)
g (x) − g (n) (ξ) ≤ |x − ξ| g (n+1) ↾[m,M ] , (3.37)

which is clear for x = ξ and, otherwise, due to the mean value theorem [Phi16a, Th.
9.18]. For n = 0, the left-hand side of (3.36) is |g(x) − g(x0 )| and (3.36) follows from
(3.37). If n ≥ 1, then we apply (3.30) with n − 1, yielding n! [g|x0 , . . . , xn ] = g (n) (ξ) for
some ξ ∈ [m, M ], such that (3.36) follows from (3.37) also in this case. 

3.2.6 Chebyshev Polynomials

As mentioned in Rem. 3.28 above, the zeros of the Chebyshev polynomials yield optimal
points for the use of polynomial interpolation. We will provide this result as well as some
other properties of Chebyshev polynomials in the present section.
3 INTERPOLATION 61

Definition 3.30. We define the Chebyshev polynomials Tn ∈ Pol(R), n ∈ N0 , via the


following recursion:
T0 (x) := 1, T1 (x) := x, ∀ Tn (x) := 2x Tn−1 (x) − Tn−2 (x). (3.38)
n≥2

Theorem 3.31. The Chebyshev polynomials Tn ∈ Pol(R), n ∈ N0 , as defined in Def.


3.30, have the following properties:

(a) Tn ∈ Poln (R), deg Tn = n, and all coefficients of Tn are integers.


(b) For n ≥ 1, the coefficient of xn in Tn is an (Tn ) = 2n−1 .
(c) Tn is even for n even; Tn is odd for n odd, i.e.
(
Tn (x) for n even,
∀ Tn (−x) =
x∈R −Tn (x) for n odd.

(d) Tn (1) = 1, Tn (−1) = (−1)n .


(e) One has
 

 cos n arccos x for x ∈ [−1, 1],

Tn (x) = cosh n arcosh x for x ∈ [1, ∞[,

 
(−1)n cosh n arcosh(−x) for x ∈] − ∞, −1],
where
ex + e−x
cosh : R −→ [1, ∞[, cosh(x) = ,
2
with the inverse function
 √ 
arcosh : [1, ∞[−→ R, arcosh(x) = ln x + x2 −1 .

(f ) Tn has n simple zeros, all in the interval [−1, 1], namely, for n ≥ 1,
 
2k − 1
xk := cos π , k ∈ {1, . . . , n}.
2n

(g) One has


∀ Tn (x) ∈ [−1, 1],
x∈[−1,1]

where, for n ≥ 1,
(
1 if, and only if, x = cos kπ
n
with k ∈ {0, . . . , n} even,
Tn (x) = kπ
−1 if, and only if, x = cos n with k ∈ {0, . . . , n} odd.
3 INTERPOLATION 62

Proof. Exercise. 

Notation 3.32. Let n ∈ N0 . For each p ∈ Pol(R), let an (p) ∈ R denote the coefficient
of xn in p.

Lemma 3.33. Let a, b ∈ R with a < b. Define


x−a
α : [a, b] −→ [−1, 1], α(x) := 2 − 1,
b−a
b−a b+a
β : [−1, 1] −→ [a, b], β(x) := x+ .
2 2

(a) α, β are bijective, strictly increasing affine functions with β = α−1 .

(b) If f : [a, b] −→ R, then (in [0, ∞]):


 
kf k∞ = sup |f (x)| : x ∈ [a, b] = sup |f (β(x))| : x ∈ [−1, 1] = kf ◦ βk∞ .

(c) If p ∈ Poln (R), then


(b − a)n
an (p ◦ β) = an (p) .
2n

Proof. (a): Clearly, α, β are both restrictions of affine polynomial functions with deg α =
2
deg β = 1. They are strictly increasing (and, thus, bijective) e.g. due to α′ = b−a >0
′ b−a
and β = 2 > 0. Moreover, α(a) = −1, α(b) = 1, β(−1) = a, β(1) = b, which also
shows β = α−1 .
(b) holds, as the bijectivity of β implies |f |([a, b]) = |f ◦ β|([−1, 1]).
(c) is a simple consequence of the binomial theorem (cf. [Phi16a, Th. 5.16]). 

Theorem 3.34. Let n ∈ N and a, b ∈ R with a < b. Then


 
(b − a)n
∀ an (p) 6= 0 ⇒ kp↾[a,b] k∞ ≥ |an (p)| 2n−1
p∈Poln (R) 2

and, in particular,
n o
1 = kTn ↾[−1,1] k∞ = min kp↾[−1,1] k∞ : p ∈ Poln (R), deg p = n, an (p) = 2n−1

(i.e. Tn minimizes the ∞-norm on [−1, 1] in the set of polynomial functions with degree
n and highest coefficient 2n−1 ).
3 INTERPOLATION 63

Proof. We first consider the special case a = −1, b = 1, and an (p) = 2n−1 . Seeking a
contradiction, assume p ∈ Poln (R) with deg p = n, an (p) = 2n−1 , and kp↾[−1,1] k∞ < 1.
Using Th. 3.31(g), we obtain
 (
kπ < 0 for k even,
∀ (p − Tn ) cos
k∈{0,...,n} n > 0 for k odd,
showing p − Tn to have at least n zeros due to the intermediate value theorem (cf.
[Phi16a, Th. 7.57]). Since p, Tn ∈ Poln (R) with an (p) = an (Tn ) = 2n−1 , we have
p − Tn ∈ Poln−1 (R) and the n zeros yield p − Tn = 0 in contradiction to p 6= Tn . Thus,
there must exist x0 ∈ [−1, 1] such that |p(x0 )| ≥ 1. Now if q ∈ Poln (R) is arbitrary with
n−1
an (q) 6= 0 and p := a2n (q) q, then p ∈ Poln (R) with an (p) = 2n−1 and, due to the special
case,
an (q) |an (q)|

|q(x0 )| = n−1 p(x0 ) ≥ n−1 ,
x0 ∈[−1,1] 2 2
as desired. Moreover, on the interval [a, b], one then obtains
Lem. 3.33(b) Lem. 3.33(c) (b − a)n
kq↾[a,b] k∞ = kq ◦ βk∞ ≥ |an (q)| ,
22n−1
thereby completing the proof. 
Example 3.35. Let n ∈ N and a, b ∈ R with a < b. As mentioned in Rem. 3.28,
minimizing the error in (3.33) leads to the problem of finding x0 , . . . , xn ∈ [a, b] such
that ( n )
Y
max |x − xj | : x ∈ [a, b]
j=0

is minimized. In other words, one needs to find the zeros of a monic polynomial function
p in Poln+1 (R) such that p minimizes the ∞-norm on [a, b]. According to Th. 3.34, for
a = −1, b = 1, p = Tn+1 /2n , which, according to Th. 3.31(f) has zeros
   
2(j + 1) − 1 2j + 1
xj = cos π = cos π , j ∈ {0, . . . , n}.
2(n + 1) 2n + 2
In the general case, we then obtain the solution p by using the functions α, β of Lem.
3.33: According to Lem. 3.33(b), we have p ◦ β = cTn+1 , where c ∈ R is chosen such
that an+1 (p) = 1. By Lem. 3.33(c),
(b − a)n+1 (b − a)n+1 (b − a)n+1
c · 2n = an+1 (p ◦ β) = an+1 (p) = ⇒ c= ,
2n+1 2n+1 22n+1
(b−a)n+1
implying p = (Tn+1 ◦ α) · with the zeros
22n+1
  
2j + 1
xj = β cos π , j ∈ {0, . . . , n}.
2n + 2
3 INTERPOLATION 64

Theorem 3.36. Let n ∈ N and a, b ∈ R with a < b. Assume s ∈ R \ [a, b] and define
1
cs := ∈ R \ {0}, ps := cs (Tn ◦ α) ∈ Poln (R),
Tn (α(s))

where α : [a, b] −→ [−1, 1] is as in Lem. 3.33, extended to α : R −→ R (also note


α(s) ∈
/ [−1, 1] and Tn (α(s)) 6= 0). Then
n o
|cs | = kps ↾[a,b] k∞ = min kp↾[a,b] k∞ : p ∈ Poln (R), deg p = n, p(s) = 1

(i.e. ps minimizes the ∞-norm on [a, b] in the set of polynomial functions with degree n
and value 1 at s).

Proof. Exercise. 

3.3 Spline Interpolation


3.3.1 Introduction, Definition of Splines

In the previous sections, we have used a single polynomial to interpolate given data
points. If the number of data points is large, then the degree of the interpolating
polynomial is likewise large. If the data points are given by a function, and the goal
is for the interpolating polynomial to approximate that function, then, in general, one
might need many data points to achieve a desired accuracy. We have also seen that the
interpolating polynomials tend to be large in the vicinity of the outermost data points,
where strong oscillations are typical. One possibility to counteract that behavior is
to choose an additional number of data points in the vicinity of the boundary of the
considered interval (cf. Rem. 3.28).
The advantage of the approach of the previous sections is that the interpolating poly-
nomial is C ∞ (i.e. it has derivatives of all orders) and one can make sure (by Hermite
interpolation) that, in a given point x0 , the derivatives of the interpolating polynomial p
agree with the derivatives of a given function in x0 (which, again, will typically increase
the degree of p). The disadvantage is that one often needs interpolating polynomials of
large degrees, which can be difficult to handle (e.g. numerical problems resulting from
large numbers occurring during calculations).
Spline interpolation follows a different approach: Instead of using one single interpo-
lating polynomial of potentially large degree, one uses a piecewise interpolation, fitting
together polynomials of low degree (e.g. 1 or 3). The resulting function s will typically
not be of class C ∞ , but merely of class C or C 2 , and the derivatives of s will usually not
3 INTERPOLATION 65

agree with the derivatives of a given f (only the values of f and s will agree in the data
points). The advantage is that one only needs to deal with polynomials of low degree. If
one does not need high differentiability of the interpolating function and one is mostly
interested in a good approximation with respect to k · k∞ , then spline interpolation is
usually a good option.
Definition 3.37. Let a, b ∈ R, a < b. Then, given N ∈ N,

∆ := (x0 , . . . , xN ) ∈ RN +1 (3.39)

is called a knot vector for [a, b] if, and only if, it corresponds to a partition a = x0 <
x1 < · · · < xN = b of [a, b]. The xk are then also called knots. Let l ∈ N. A spline of
order l for ∆ is a function s ∈ C l−1 [a, b] such that, for each k ∈ {0, . . . , N − 1}, there
exists a polynomial function pk ∈ Poll (R) that coincides with s on [xk , xk+1 ]. The set of
all such splines is denoted by S∆,l :
 
l−1
S∆,l := s ∈ C [a, b] : ∀ ∃ pk ↾[xk ,xk+1 ] = s↾[xk ,xk+1 ] . (3.40)
k∈{0,...,N −1} pk ∈Poll (R)

Of particular importance are the elements of S∆,1 (linear splines) and of S∆,3 (cubic
splines). It is also useful to define

S∆,0 := span χk : k ∈ {0, . . . , N − 1} , (3.41)

where (
1 if x ∈ [xk , xk+1 [,
∀ χk : [a, b] −→ R, χk (x) :=
k∈{0,...,N −1} 0 if x ∈
/ [xk , xk+1 [,
i.e. each χk is a step function, namely the characteristic function of the interval [xk , xk+1 [.
We call the elements of S∆,0 splines of order 0.
Remark and Definition 3.38. In the situation of Def. 3.37, let k ∈ {0, . . . , N − 1}.
If s ∈ S∆,l , l ∈ N0 , then, as s↾]xk ,xk+1 [ coincides with a polynomial function of degree at
most l, on ]xk , xk+1 [ the derivative s(l) exists and is constant, i.e. there exists αk ∈ R
such that s(l) ↾]xk ,xk+1 [ ≡ αk . We extend s(l) to all of [a, b] by defining
N
X −1
(l)
s := αk χk ∈ S∆,0 . (3.42)
k=0

Definition 3.39. In the situation of Def. 3.37, if s ∈ S∆,0 , then, for each l ∈ N0 , define
the lth antiderivative Il (s) : [a, b] −→ R of s recursively by letting
Z x
I0 (s) := s, ∀ Il+1 (s)(x) := Il (s)(t) dt .
l∈N0 a
3 INTERPOLATION 66

Theorem 3.40. Let a, b ∈ R, a < b, N ∈ N, l ∈ N0 , with ∆ := (x0 , . . . , xN ) being a


knot vector for [a, b]. Then the following statements hold true:

(a) Poll (R) ⊆ S∆,l (here, and in the following, we are sometimes a bit lax regarding the
notation, implicitly identifying polynomials on R with their respective restriction on
[a, b]).

(b) For each j ∈ {0, . . . , l}, we have

s ∈ S∆,l ⇒ s(j) ∈ S∆,l−j .

(c) If s ∈ S∆,0 and S := Il (s) is the lth antiderivative of s as defined in Def. 3.39, then
S ∈ S∆,l .

(d) S∆,l is an (N + l)-dimensional vector space over R and, if, for each k ∈ {0, . . . , N −
1}, Xk := Il (χk ) is the lth antiderivative of χk , then
 
B := Xk : k ∈ {0, . . . , N − 1} ∪ xj : j ∈ {0, . . . , l − 1}

forms a basis of S∆,l .

Proof. (a) is clear from the definition of S∆,l .


(b): For j ∈ {0, . . . , l − 1}, (b) holds since s ∈ C l−1 [a, b] implies s(j) ∈ C l−j−1 [a, b] and
(j)
pk ∈ Poll (R) implies pk ∈ Poll−j (R). For j = l, (b) holds due to Rem. and Def. 3.38.
(c): We obtain the continuity of I1 (s) from Th. C.6 of the appendix. We then apply the
fundamental theorem of calculus (cf. [Phi16a, Th. 10.20(a)]) (l − 1) times to I1 (s) to
obtain S ∈ C l−1 [a, b]. For each k ∈ {0, . . . , N − 1}, we apply the fundamental theorem
of calculus l times on [xk , xk+1 ] to obtain S↾[xk ,xk+1 ] ∈ Poll (R).
(d): S∆,l being a vector space over R is a consequence of both C l−1 [a, b] and Poll (R)
being vector spaces over R. Since B ⊆ S∆,l by (a),(c) and, clearly, #B = N + l, it
merely remains to show that B is, indeed, a basis of S∆,l . Let s ∈ S∆,l . Since s(l) ∈ S∆,0 ,
P −1 PN −1
there exist α0 , . . . , αN −1 ∈ R such that s(l) = N k=0 αk χk . If s̃ := k=0 αk Xk , then
s̃(l) = s(l) . We need to show (s − s̃) ∈ Poll−1 (R). To this end, let k ∈ {1, . . . , N − 1}.
Then the fundamental theorem of calculus yields p := (s − s̃)↾[xk−1 ,xk ] ∈ Poll−1 (R) and
q := (s − s̃)↾[xk ,xk+1 ] ∈ Poll−1 (R). However, as s − s̃ is C l−1 , we also know

∀ p(j) (xk ) = q (j) (xk ),


j∈{0,...,l−1}

implying p = q due to Th. 3.10(b) about Hermite interpolation. As k ∈ {1, . . . , N − 1}


was arbitrary, this shows (s − s̃) ∈ Poll−1 (R) and span B = S∆,l . It remains to show
3 INTERPOLATION 67

that B is linearly independent. To this end assume


N
X −1
∃ ∃ ∀ p(x) + αk Xk (x) = 0.
p∈Poll−1 (R) α0 ,...,αN −1 ∈R x∈[a,b]
k=0

Taking the derivative l times then yields


N
X −1
∀ αk χk (x) = 0.
x∈[a,b]
k=0

As the χk are clearly linearly independent, we obtain α0 = · · · = αN −1 = 0, implying


p = 0 as well. Thus, B is linearly independent and, hence, a basis of S∆,l . 

3.3.2 Linear Splines

Given a node vector as in (3.39), and values y0 , . . . , yN ∈ R, a corresponding linear


spline is a continuous, piecewise linear function s satisfying s(xk ) = yk . In this case,
existence, uniqueness, and an error estimate are still rather simple to obtain:
Theorem 3.41. Let a, b ∈ R, a < b. Then, given N ∈ N, a knot vector ∆ :=
(x0 , . . . , xN ) ∈ RN +1 for [a, b], and values (y0 , . . . , yN ) ∈ RN +1 , there exists a unique
interpolating linear spline s ∈ S∆,1 satisfying

s(xk ) = yk for each k ∈ {0, . . . , N }. (3.43)

Moreover, s is explicitly given by


yk+1 − yk
s(x) = yk + (x − xk ) for each x ∈ [xk , xk+1 ]. (3.44)
xk+1 − xk

If f ∈ C 2 [a, b] and yk = f (xk ), then the following error estimate holds:


1
kf − sk∞ ≤ kf ′′ k∞ h2 , (3.45)
8
where 
h := h(∆) := max xk+1 − xk : k ∈ {0, . . . , N − 1}
is the mesh size of the partition ∆.

Proof. Since s defined by (3.44) clearly satisfies (3.43), is continuous on [a, b], and,
for each k ∈ {0, . . . , N }, the definition of (3.44) extends to a unique polynomial in
Pol1 (R), we already have existence. Uniqueness is equally obvious, since, for each k, the
3 INTERPOLATION 68

prescribed values in xk and xk+1 uniquely determine a polynomial in Pol1 (R) and, thus,
s. For the error estimate, consider x ∈ [xk , xk+1 ], k ∈ {0, . . . , N − 1}. Note that, due to
2
the inequality α β ≤ α+β2
valid for all α, β ∈ R, one has
 2
(x − xk ) + (xk+1 − x) h2
(x − xk )(xk+1 − x) ≤ ≤ . (3.46)
2 4
Moreover, on [xk , xk+1 ], s coincides with the interpolating polynomial p ∈ Pol1 (R) with
p(xk ) = f (xk ) and p(xk+1 ) = f (xk+1 ). Thus, we can use (3.31) to compute
′′
f (x) − s(x) = f (x) − p(x) ≤ kf k∞ (x − xk )(xk+1 − x) ≤ 1 kf ′′ k∞ h2 ,
(3.46)

2! 8
thereby proving (3.45). 

3.3.3 Cubic Splines

Given a knot vector


∆ = (x0 , . . . , xN ) ∈ RN +1 and values (y0 , . . . , yN ) ∈ RN +1 , (3.47)
the task is to find an interpolating cubic spline, i.e. an element s of S∆,3 satisfying
s(xk ) = yk for each k ∈ {0, . . . , N }. (3.48)
As we require s ∈ S∆,3 , according to (3.40), for each k ∈ {0, . . . , N − 1}, we must find
pk ∈ Pol3 (R) such that pk ↾[xk ,xk+1 ] = s↾[xk ,xk+1 ] . To that end, for each k ∈ {0, . . . , N − 1},
we employ the ansatz
pk (x) = ak + bk (x − xk ) + ck (x − xk )2 + dk (x − xk )3 (3.49a)
with suitable
(ak , bk , ck , dk ) ∈ R4 ; (3.49b)
to then define s by
s(x) = pk (x) for each x ∈ [xk , xk+1 ]. (3.49c)
It is immediate that s is well-defined by (3.49c) and an element of S∆,3 (cf. (3.40))
satisfying (3.48) if, and only if,
p0 (x0 ) = y0 , (3.50a)
pk−1 (xk ) = yk = pk (xk ) for each k ∈ {1, . . . , N − 1}, (3.50b)
pN −1 (xN ) = yN , (3.50c)
p′k−1 (xk ) = p′k (xk ) for each k ∈ {1, . . . , N − 1}, (3.50d)
p′′k−1 (xk ) = p′′k (xk ) for each k ∈ {1, . . . , N − 1}, (3.50e)
3 INTERPOLATION 69

using the convention {1, . . . , N − 1} = ∅ for N − 1 < 1. Note that (3.50) constitutes a
system of 2 + 4(N − 1) = 4N − 2 conditions, while (3.49a) provides 4N variables. This
already indicates that one will be able to impose two additional conditions on s. One
typically imposes so-called boundary conditions, i.e. conditions on the values for s or its
derivatives at x0 and xN (see (3.61) below for commonly used boundary conditions).
In order to show that one can, indeed, choose the ak , bk , ck , dk such that all conditions of
(3.50) are satisfied, the strategy is to transform (3.50) into an equivalent linear system.
An analysis of the structure of that system will then show that it has a solution. The
trick that will lead to the linear system is the introduction of
s′′k := s′′ (xk ), k ∈ {0, . . . , N }, (3.51)
as new variables. As we will see in the following lemma, (3.50) is satisfied if, and only
if, the s′′k satisfy the linear system (3.53):
Lemma 3.42. Given a knot vector and values according to (3.47), and introducing the
abbreviations
hk := xk+1 − xk , for each k ∈ {0, . . . , N − 1}, (3.52a)
 
yk+1 − yk yk − yk−1
gk := 6 − for each k ∈ {1, . . . , N − 1}, (3.52b)
hk hk−1
| {z }
(hk +hk−1 )[xk−1 ,xk ,xk+1 ]

the pk ∈ Pol3 (R) of (3.49a) satisfy (3.50) (and that means s given by (3.49c) is in S∆,3
and satisfies (3.48)) if, and only if, the N + 1 numbers s′′0 , . . . , s′′N defined by (3.51) solve
the linear system consisting of the N − 1 equations
hk s′′k + 2(hk + hk+1 )s′′k+1 + hk+1 s′′k+2 = gk+1 , k ∈ {0, . . . , N − 2}, (3.53)
and the ak , bk , ck , dk are given in terms of the xk , yk and s′′k by
ak = y k for each k ∈ {0, . . . , N − 1}, (3.54a)
yk+1 − yk hk ′′
bk = − (s + 2s′′k ) for each k ∈ {0, . . . , N − 1}, (3.54b)
hk 6 k+1
s′′k
ck = for each k ∈ {0, . . . , N − 1}, (3.54c)
2
s′′k+1 − s′′k
dk = for each k ∈ {0, . . . , N − 1}. (3.54d)
6hk

Proof. First, for subsequent use, from (3.49a), we obtain the first two derivatives of the
pk for each k ∈ {0, . . . , N − 1}:
p′k (x) = bk + 2ck (x − xk ) + 3dk (x − xk )2 , (3.55a)
p′′k (x) = 2ck + 6dk (x − xk ). (3.55b)
3 INTERPOLATION 70

As usual, for the claimed equivalence, we prove two implications.


(3.50) implies (3.53) and (3.54):
Plugging x = xk into (3.49a) yields
pk (xk ) = ak for each k ∈ {0, . . . , N − 1}. (3.56)
Thus, (3.54a) is a consequence of (3.50a) and (3.50b).
Plugging x = xk into (3.55b) yields
s′′k = p′′k (xk ) = 2ck for each k ∈ {0, . . . , N − 1}, (3.57)
i.e. (3.54c).
Plugging (3.50e), i.e. p′′k−1 (xk ) = p′′k (xk ), into (3.55b) yields
2ck−1 + 6dk−1 (xk − xk−1 ) = 2ck for each k ∈ {1, . . . , N − 1}. (3.58a)
When using (3.54c) and solving for dk−1 , one obtains
s′′k − s′′k−1
dk−1 = for each k ∈ {1, . . . , N − 1}, (3.58b)
6 hk−1
i.e. the relation of (3.54d) for each k ∈ {0, . . . , N − 2}. Moreover, (3.51) and (3.55b)
imply
s′′N = s′′ (xN ) = p′′N −1 (xN ) = 2cN −1 + 6dN −1 hN −1 , (3.58c)
which, after solving for dN −1 , shows that the relation of (3.54d) also holds for k = N −1.
Plugging the first part of (3.50b) and (3.50c), i.e. pk−1 (xk ) = yk into (3.49a) yields
ak−1 + bk−1 hk−1 + ck−1 h2k−1 + dk−1 h3k−1 = yk for each k ∈ {1, . . . , N }. (3.59a)
When using (3.54a), (3.54c), (3.54d), and solving for bk−1 , one obtains
yk − yk−1 s′′k−1 hk−1 s′′k − s′′k−1 2 yk − yk−1 hk−1 ′′
bk−1 = − − hk−1 = − (sk + 2s′′k−1 ) (3.59b)
hk−1 2 6 hk−1 hk−1 6
for each k ∈ {1, . . . , N }, i.e. (3.54b).
Plugging (3.50d), i.e. p′k−1 (xk ) = p′k (xk ), into (3.55a) yields
bk−1 + 2ck−1 hk−1 + 3dk−1 h2k−1 = bk for each k ∈ {1, . . . , N − 1}. (3.60a)
Finally, applying (3.54b), (3.54c), and (3.54d), one obtains
yk − yk−1 hk−1 ′′ s′′ − s′′k−1 yk+1 − yk hk ′′
− (sk + 2s′′k−1 ) + s′′k−1 hk−1 + k hk−1 = − (sk+1 + 2s′′k ),
hk−1 6 2 hk 6
(3.60b)
3 INTERPOLATION 71

or, rearranged,

hk−1 s′′k−1 + 2(hk−1 + hk )s′′k + hk s′′k+1 = gk , k ∈ {1, . . . , N − 1}, (3.60c)

which is (3.53).
(3.53) and (3.54) imply (3.50):
First, note that plugging x = xk into (3.55b) again yields s′′k = p′′k (xk ). Using (3.54a) in
(3.49a) yields (3.50a) and the pk (xk ) = yk part of (3.50b). Since (3.54b) is equivalent to
(3.59b), and (3.59b) (when using (3.54a), (3.54c), and (3.54d)) implies (3.59a), which,
in turn, is equivalent to pk−1 (xk ) = yk for each k ∈ {1, . . . , N }, we see that (3.54)
implies (3.50c) and the pk−1 (xk ) = yk part of (3.50b). Since (3.53) is equivalent to
(3.60c) and (3.60b), and (3.60b) (when using (3.54a) – (3.54d)) implies (3.60a), which,
in turn, is equivalent to p′k−1 (xk ) = p′k (xk ) for each k ∈ {1, . . . , N − 1}, we see that
(3.53) and (3.54) implies (3.50d). Finally, (3.54d) is equivalent to (3.58b), which (when
using (3.54c)) implies (3.58a). Since (3.58a) is equivalent to p′′k−1 (xk ) = p′′k (xk ) for each
k ∈ {1, . . . , N − 1}, (3.54) also implies (3.50d), concluding the proof of the lemma. 

As already mentioned after formulating (3.50), (3.50) yields two conditions less than
there are variables. Not surprisingly, the same is true in the linear system (3.53), where
there are N − 1 equations for N + 1 variables. As also mentioned before, this allows to
impose two additional conditions. Here are some of the most commonly used additional
conditions:
Natural boundary conditions:

s′′ (x0 ) = s′′ (xN ) = 0. (3.61a)

Periodic boundary conditions:

s′ (x0 ) = s′ (xN ) and s′′ (x0 ) = s′′ (xN ). (3.61b)

Dirichlet boundary conditions for the first derivative:

s′ (x0 ) = y0′ ∈ R and s′ (xN ) = yN



∈ R. (3.61c)

Due to time constraints, we will only investigate the most simple of these cases, namely
natural boundary conditions, in the following.
Since (3.61a) fixes the values of s′′0 and s′′N as 0, (3.53) now yields a linear system
consisting of N − 1 equations for the N − 1 variables s′′1 through s′′N −1 . We can rewrite
this system in matrix form
As′′ = g (3.62a)
3 INTERPOLATION 72

by introducing the matrix


 
2(h0 + h1 ) h1
 h1 2(h1 + h2 ) h2 
 
 h2 2(h2 + h3 ) h3 
 
 . . . . . . 
A := 
 . . . 

 . . . . . . 
 . . . 
 
 hN −3 2(hN −3 + hN −2 ) hN −2 
hN −2 2(hN −2 + hN −1 )
(3.62b)
and the vectors    
s′′1 g1
   
s′′ :=  ...  , g :=  ...  . (3.62c)
s′′N −1 gN −1
The matrix A from (3.62b) belongs to an especially benign class of matrices, so-called
strictly diagonally dominant matrices:

Definition 3.43. Let A ∈ M(N, C), N ∈ N. Then A is called strictly diagonally


dominant if, and only if,
N
X
|akj | < |akk | for each k ∈ {1, . . . , N }. (3.63)
j=1
j6=k

Lemma 3.44. Let A ∈ M(N, K), N ∈ N. If A is strictly diagonally dominant, then,


for each x ∈ KN ,
 

 

1
kxk∞ ≤ max PN : k ∈ {1, . . . , N } kAxk∞ . (3.64)

 |akk | − j=1 |akj | 

j6=k

In particular, A is invertible with


 

 

−1 1
kA k∞ ≤ max PN : k ∈ {1, . . . , N } , (3.65)

 |akk | − j=1 |akj | 

j6=k

where kA−1 k∞ denotes the row sum norm according to Ex. 2.18(a).

Proof. Exercise. 
3 INTERPOLATION 73

Theorem 3.45. Given a knot vector and values according to (3.47), there is a unique
cubic spline s ∈ S∆,3 that satisfies (3.48) and the natural boundary conditions (3.61a).
Moreover, s can be computed by first solving the linear system (3.62) for the s′′k (the hk
and the gk are given by (3.52)), then determining the ak , bk , ck , and dk according to
(3.54), and, finally, using (3.49) to get s.

Proof. Lemma 3.42 shows that the existence and uniqueness of s is equivalent to (3.62a)
having a unique solution. That (3.62a) does, indeed, have a unique solution follows from
Lem. 3.44, as A as defined in (3.62b) is clearly strictly diagonally dominant. It is also
part of the statement of Lem. 3.42 that s can be computed in the described way. 

Theorem 3.46. Let a, b ∈ R, a < b, and f ∈ C 4 [a, b]. Moreover, given N ∈ N and a
knot vector ∆ := (x0 , . . . , xN ) ∈ RN +1 for [a, b], define

hk := xk+1 − xk for each k ∈ {0, . . . , N − 1}, (3.66a)



hmin := min hk : k ∈ {0, . . . , N − 1} , (3.66b)

hmax := max hk : k ∈ {0, . . . , N − 1} . (3.66c)

Let s ∈ S∆,3 be any cubic spline function satisfying s(xk ) = f (xk ) for each k ∈
{0, . . . , N }. If there exists C > 0 such that
n o
′′ ′′
max s (xk ) − f (xk ) : k ∈ {0, . . . , N } ≤ C kf (4) k∞ h2max , (3.67)

then the following error estimates hold:

kf − sk∞ ≤ c kf (4) k∞ h4max , (3.68a)


′ ′
kf − s k∞ ≤ 2c kf k∞ h3max ,
(4)
(3.68b)
kf ′′ − s′′ k∞ ≤ 2c kf (4) k∞ h2max , (3.68c)
kf ′′′ − s′′′ k∞ ≤ 2c kf (4) k∞ h1max , (3.68d)

where  
hmax 1
c := C+ . (3.69)
hmin 4
At the xk , the third derivative s′′′ does not need to exist. In that case, (3.68d) is to be
interpreted to mean that the estimate holds for all values of one-sided third derivatives
of s (which must exist, since s agrees with a polynomial on each [xk , xk+1 ]).

Proof. Exercise. 
3 INTERPOLATION 74

Theorem 3.47. Once again, consider the situation of Th. 3.46, but, instead of (3.67),
assume that the interpolating cubic spline s satisfies natural boundary conditions s′′ (a) =
s′′ (b) = 0. If f ∈ C 4 [a, b] satisfies f ′′ (a) = f ′′ (b) = 0, then it follows that s satisfies
(3.67) with C = 3/4, and, in consequence, the error estimates (3.68) with c = hmax /hmin .

Proof. We merely need to show that (3.67) holds with C = 3/4. Since s is the interpo-
lating cubic spline satisfying natural boundary conditions, s′′ satisfies the linear system
(3.62a). Deviding, for each k ∈ {1, . . . , N − 1}, the (k − 1)th equation of that system
by 3(hk−1 + hk ) yields the form
Bs′′ = ĝ, (3.70a)
with
 2 h1

3 3(h0 +h1 )
 
 h1 2 h2 
 3(h1 +h2 ) 3 3(h1 +h2 ) 
 
 
 h2 2 h3 
 3(h2 +h3 ) 3 3(h2 +h3 ) 
 .. .. .. 
B := 
 . . . 

 .. .. .. 
 . . . 
 
 hN −3 hN −2

 2 
 3(hN −3 +hN −2 ) 3 3(hN −3 +hN −2 ) 
 
hN −2 2
3(hN −2 +hN −1 ) 3
(3.70b)
and    g1 
ĝ1 3(h0 +h1 )
   .. 
ĝ :=  ...  :=  . . (3.70c)
gN −1
ĝN −1 3(hN −2 +hN −1 )

Taylor’s theorem applied to f ′′ provides, for each k ∈ {1, . . . , N − 1}, αk ∈]xk−1 , xk [


and βk ∈]xk , xk+1 [ such that the following equations (3.71) hold, where (3.71a) has been
hk−1 hk
multiplied by 3(hk−1 +hk )
and (3.71b) has been multiplied by 3(hk−1 +hk )
:

hk−1 f ′′ (xk−1 ) hk−1 f ′′ (xk ) h2 f ′′′ (xk ) h3 f (4) (αk )


= − k−1 + k−1 , (3.71a)
3(hk−1 + hk ) 3(hk−1 + hk ) 3(hk−1 + hk ) 6(hk−1 + hk )
hk f ′′ (xk+1 ) hk f ′′ (xk ) h2k f ′′′ (xk ) h3 f (4) (βk )
= + + k . (3.71b)
3(hk−1 + hk ) 3(hk−1 + hk ) 3(hk−1 + hk ) 6(hk−1 + hk )
Adding (3.71a) and (3.71b) results in
hk−1 2 hk
f ′′ (xk−1 ) + f ′′ (xk ) + f ′′ (xk+1 ) = f ′′ (xk ) + Rk + δk , (3.72a)
3(hk−1 + hk ) 3 3(hk−1 + hk )
3 INTERPOLATION 75

where
1
Rk := (hk − hk−1 ) f ′′′ (xk ), (3.72b)
3
1 
δk := h3k−1 f (4) (αk ) + h3k f (4) (βk ) . (3.72c)
6(hk−1 + hk )

Using the matrix B from (3.70b), (3.72a) takes the form


       
f ′′ (x1 ) f ′′ (x1 ) R1 δ1
 ..   ..   ..   .. 
B . = .  +  .  +  . . (3.73)
f ′′ (xN −1 ) f ′′ (xN −1 ) RN −1 δN −1

In the same way we applied Taylor’s theorem to f ′′ to obtain (3.71), we now apply
Taylor’s theorem to f to obtain for each k ∈ {1, . . . , N − 1}, ξk ∈]xk , xk+1 [ and ηk ∈
]xk−1 , xk [ such that

h2k ′′ h3 h4
f (xk+1 ) = f (xk ) + hk f ′ (xk ) + f (xk ) + k f ′′′ (xk ) + k f (4) (ξk ), (3.74a)
2 6 24
′ h2k−1 ′′ h3k−1 ′′′ h4k−1 (4)
f (xk−1 ) = f (xk ) − hk−1 f (xk ) + f (xk ) − f (xk ) + f (ηk ). (3.74b)
2 6 24
Multiplying (3.74a) and (3.74b) by 2/hk and 2/hk−1 , respectively, leads to

f (xk+1 ) − f (xk ) h2 h3
2 = 2 f ′ (xk ) + hk f ′′ (xk ) + k f ′′′ (xk ) + k f (4) (ξk ), (3.75a)
hk 3 12
f (xk ) − f (xk−1 ) h2 h3
−2 = −2 f ′ (xk ) + hk−1 f ′′ (xk ) − k−1 f ′′′ (xk ) + k−1 f (4) (ηk ). (3.75b)
hk−1 3 12

Adding (3.75a) and (3.75b) followed by a division by hk−1 + hk yields

f (xk+1 ) − f (xk ) f (xk ) − f (xk−1 )


2 −2 = f ′′ (xk ) + Rk + δ̂k , (3.76)
hk (hk−1 + hk ) hk−1 (hk−1 + hk )

where Rk is as in (3.72b) and


1 
δ̂k := h3k−1 f (4) (ηk ) + h3k f (4) (ξk ) .
12(hk−1 + hk )

Observing that

(3.70c) gk (3.52b), yk =f (xk ) f (xk+1 ) − f (xk ) f (xk ) − f (xk−1 )


ĝk = = 2 −2 ,
3(hk−1 + hk ) hk (hk−1 + hk ) hk−1 (hk−1 + hk )
3 INTERPOLATION 76

we can write (3.76) as


       
f ′′ (x1 ) ĝ1 R1 δ̂1
 ..   ..   ..   .. 
 .  =  .  −  .  −  . . (3.77)
′′
f (xN −1 ) ĝN −1 RN −1 δ̂N −1

Using (3.77) in the right-hand side of (3.73) and then subtracting (3.70a) from the result
yields    
f ′′ (x1 ) − s′′ (x1 ) δ1 − δ̂1
 ..   .. 
B . = . .
′′ ′′
f (xN −1 ) − s (xN −1 ) δN −1 − δ̂N −1
From
2 hk−1 hk 1
− − = for k ∈ {1, . . . , N − 1},
3 3(hk−1 + hk ) 3(hk−1 + hk ) 3
we conclude that B is strictly diagonally dominant, such that Lem. 3.44 allows us to
estimate
 
max f ′′ (xk ) − s′′ (xk ) : k ∈ {0, . . . , N } ≤ 3 max |δk | + |δ̂k | : k ∈ {1, . . . , N − 1}
(∗) 3 2
≤ hmax kf (4) k∞ , (3.78)
4
where  
1 1 h3k−1 + h3k (4) 1
|δk | + |δ̂k | ≤ + kf k∞ ≤ h2max kf (4) k∞
6 12 hk−1 + hk 4
was used for the inequality at (∗). The estimate (3.78) shows that s satisfies (3.67) with
C = 3/4, thereby concluding the proof of the theorem. 

Note that (3.68) is, in more than one way, much better than (3.45): From (3.68) we
get convergence of the first three derivatives, and the convergence of kf − sk∞ is also
much faster in (3.68) (proportional to h4 rather than to h2 in (3.45)). The disadvantage
of (3.68), however, is that we needed to assume the existence of a continuous fourth
derivative of f .
Remark 3.48. A piecewise interpolation by cubic polynomials different from spline
interpolation is given by carrying out a Hermite interpolation on each interval [xk , xk+1 ]
such that the resulting function satisfies f (xk ) = s(xk ) as well as

f ′ (xk ) = s′ (xk ). (3.79)

The result will usually not be a spline, since s will usually not have second derivatives
in the xk . On the other hand the corresponding cubic spline will usually not satisfy
3 INTERPOLATION 77

(3.79). Thus, one will prefer one or the other form of interpolation, depending on either
(3.79) (i.e. the agreement of the first derivatives of the given and interpolating function)
or s ∈ C 2 being more important in the case at hand.

We conclude the section by depicting different interpolations of the functions f (x) =


1/(x2 + 1) and f (x) = cos x on the interval [−6, 6] in Figures 3.1 and 3.2, respectively.
While f is depicted as a solid blue curve, the interpolating polynomial of degree 6 is
shown as a dotted black curve, the interpolating linear spline as a solid red curve, and
the interpolating cubic spline is shown as a dashed green curve. One clearly observes the
large values and strong oscillations of the interpolating polynomial toward the bound-
aries of the interval.

1.2
f
linear spline
1 cubic spline
pol of deg 6

0.8

0.6

0.4

0.2

−0.2
−6 −4 −2 0 2 4 6

Figure 3.1: Different interpolations of f (x) = 1/(x2 + 1) with data points at (x0 , . . . , x6 )
= (−6, −4, . . . , 4, 6).
4 NUMERICAL INTEGRATION 78

1.5
f
linear spline
cubic spline
1 pol of deg 6

0.5

−0.5

−1

−1.5
−6 −4 −2 0 2 4 6

Figure 3.2: Different interpolations of f (x) = cos(x) with data points at (x0 , . . . , x6 ) =
(−6, −4, . . . , 4, 6).

4 Numerical Integration

4.1 Introduction
The goal is the computation of (an approximation of) the value of definite integrals
Z b
f (x) dx . (4.1)
a

Even if we know from the fundamental theorem of calculus that f has an antiderivative,
even for simple f , the antiderivative can often not be expressed in terms of so-called
elementary functions (such as polynomials, exponential functions, and trigonometric
functions). An example is the important function
Z y
1 x2
Φ(y) := √ e− 2 dx , (4.2)
0 2π
4 NUMERICAL INTEGRATION 79

which can not be expressed by elementary functions.


On the other hand, as we will see, there are effective and efficient methods to compute
accurate approximations of such integrals. As a general rule, one can say that numerical
integration is easy, whereas symbolic integration is hard (for differentiation, one has the
reverse situation – symbolic differentiation is easy, while numeric differentiation is hard).
A first approximative method for the evaluation of integrals is given by the very defini-
tion of the Riemann integral. We recall it in the form of the following Th. 4.3. First,
let us formally introduce some notation that, in part, we already used when studying
spline interpolation:
Notation 4.1. Given a real interval [a, b], a < b, (x0 , . . . , xn ) ∈ Rn+1 , n ∈ N, is called
a partition of [a, b] if, and only if, a = x0 < x1 < · · · < xn = b (what was called a knot
vector in the context of spline interpolation). A tagged partition of [a, b] is a partition
together with a vector (t1 , . . . , tn ) ∈ Rn such that ti ∈ [xi−1 , xi ] for each i ∈ {1, . . . , n}.
Given a partition ∆ (with or without tags) of [a, b] as above, the number

h(∆) := max hi : i ∈ {1, . . . , n} , hi := xi − xi−1 , (4.3)

is called the mesh size of ∆. Moreover, if f : [a, b] −→ R, then


n
X
f (ti )hi (4.4)
i=1

is called the Riemann sum of f associated with a tagged partition of [a, b].
Notation 4.2. Given a real interval [a, b], a < b, we denote the set of all Riemann
integrable functions f : [a, b] −→ R by R[a, b].
Theorem 4.3. Given a real interval [a, b], a < b, a function f : [a, b] −→ R is in R[a, b]
if, and only if, for each sequence of tagged partitions

∆k = (xk0 , . . . , xknk ), (tk1 , . . . , tknk ) , k ∈ N, (4.5)

of [a, b] such that limk→∞ h(∆k ) = 0, the limit of associated Riemann sums
Z b nk
X
I(f ) := f (x) dx := lim f (tki )hki (4.6)
a k→∞
i=1

exists. The limit is then unique and called the Riemann integral of f . Recall that every
continuous and every piecewise continuous function on [a, b] is in R[a, b].

Proof. See, e.g., [Phi16a, Th. 10.10(d)]. 


4 NUMERICAL INTEGRATION 80

Thus, for each Riemann integrable function, we can use (4.6) to compute a sequence of
approximations that converges to the correct value of the Riemann integral.
Example 4.4. To employ (4.4) to compute approximations to Φ(1), where Φ is the
function defined in (4.2), one can use the tagged partitions ∆n of [0, 1] given by xi =
ti = i/n for i ∈ {0, . . . , n}, n ∈ N. Noticing hi = 1/n, one obtains
Xn  2 !
1 1 1 i
In := √ exp − .
i=1
n 2π 2 n

From (4.6), we know that Φ(1) = limn→∞ In , since the integrand is continuous. For
example, one can compute the following values:

n In
1 0.2426
2 0.2978
10 0.3342
100 0.3415
1000 0.3422
5000 0.3422

A problem is that, when using (4.6), a priori, we have no control on the approximation
error. Thus, given a required accuracy, we do not know what mesh size we need to
choose. And even though the first four digits in the table of Example 4.4 seem to have
stabilized, without further investigation, we can not be certain that that is, indeed,
the case. The main goal in the following is to improve on this situation by providing
approximation formulas for Riemann integrals together with effective error estimates.
We will slightly generalize the problem class by considering so-called weighted integrals:
Definition 4.5. Given a real interval [a, b], a < b, each Riemann integrable function
ρ : [a, b] −→ R+
0 , such that Z b
ρ(x) dx > 0 (4.7)
a
Rb
(in particular, ρ 6≡ 0, ρ bounded, 0 < a ρ(x) dx < ∞), is called a weight function.
Given a weight function ρ, the weighted integral of f ∈ R[a, b] is
Z b
Iρ (f ) := f (x)ρ(x) dx , (4.8)
a
4 NUMERICAL INTEGRATION 81

that means the usual integral I(f ) corresponds to ρ ≡ 1 (note that f ρ ∈ R[a, b], cf.
[Phi16a, Th. 10.18(d)]).

In the sequel, we will first study methods designed for the numerical solution of non-
weighted integrals (i.e. ρ ≡ 1) in Sections 4.2, 4.3, and 4.5, followed by an investigation
of Gaussian quadrature in Sec. 4.6, which is suitable for the numerical solution of both
weighted and nonweighted integrals.
Definition 4.6. Given a real interval [a, b], a < b, in a generalization of the Riemann
sums (4.4), a map
n
X
In : R[a, b] −→ R, In (f ) = (b − a) σi f (xi ), (4.9)
i=0

is called a quadrature formula or a quadrature rule with distinct points x0 , . . . , xn ∈ [a, b]


and weights σ0 , . . . , σn ∈ R, n ∈ N0 .
Remark 4.7. Note that every quadrature rule constitutes a linear functional on R[a, b].

A decent quadrature rule should give the exact integral for polynomials of low degree.
This gives rise to the next definition:
Definition 4.8. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→
R+ 0 , a quadrature rule In is defined to have the degree of accuracy r ∈ N0 if, and only
if,
Z b
m
In (x ) = xm ρ(x) dx for each m ∈ {0, . . . , r}, (4.10a)
a

but,
Z b
r+1
In (x ) 6= xr+1 ρ(x) dx . (4.10b)
a

Remark 4.9. (a) Due to the linearity of In , In has degree of accuracy at least r ∈ N0
if, and only if, In is exact for each p ∈ Polr (R).
(b) A quadrature rule In as given by (4.9) can never be exact for the polynomial
n
Y
p : R −→ R, p(x) := (x − xi )2 , (4.11)
i=0
4 NUMERICAL INTEGRATION 82

since Z b
In (p) = 0 6= p(x)ρ(x) dx . (4.12)
a
In particular, In can have degree of accuracy at most r = 2n + 1.
(c) In view of (b), a quadrature rule In as given by (4.9) has any degree of accuracy
(i.e. at least degree of accuracy 0) if, and only if,
Z b n
X
ρ(x) dx = In (1) = (b − a) σi , (4.13)
a i=0

which simplifies to
n
X
σi = 1 for ρ ≡ 1. (4.14)
i=0

4.2 Quadrature Rules Based on Interpolating Polynomials


In this section, we consider ρ ≡ 1, that means only nonweighted integrals. We can make
use of interpolating polynomials to produce an abundance of useful quadrature rules.
Definition 4.10. Given a real interval [a, b], a < b, the quadrature rule based on
interpolating polynomials for the distinct points x0 , . . . , xn ∈ [a, b], n ∈ N0 , is defined
by Z b
In : R[a, b] −→ R, In (f ) := pn (x) dx , (4.15)
a
where pn = pn (f ) ∈ Poln (R) is the interpolating polynomial corresponding to the data
points (x0 , f (x0 )), . . . , (xn , f (xn )).
Theorem 4.11. Let [a, b] be a real interval, a < b, and let In be the quadrature rule
based on interpolating polynomials for the distinct points x0 , . . . , xn ∈ [a, b], n ∈ N0 , as
defined in (4.15).

(a) In is, indeed, a quadrature rule in the sense of Def. 4.6, where the weights σi ,
i ∈ {0, . . . , n}, are given in terms of the Lagrange basis polynomials Li of (3.10) by
Z b Z bYn
1 1 x − xk
σi = Li (x) dx = dx
b−a a b − a a k=0 xi − xk
k6=i
Z n
1Y
(4.16)
(∗) t − tk xi − a
= dt , where ti := .
0 k=0
ti − tk b−a
k6=i
4 NUMERICAL INTEGRATION 83

(b) In has at least degree of accuracy n, i.e. it is exact for each q ∈ Poln (R).

(c) For each f ∈ C n+1 [a, b], one has the error estimate
Z b Z
kf (n+1) k∞ b

f (x) dx − In (f ) ≤ |ωn+1 (x)| dx , (4.17)
(n + 1)! a
a
Q
where ωn+1 (x) = ni=0 (x − xi ) is the Newton basis polynomial.

Proof. (a): If pn is as in (4.15), then, according to (3.10), for each x ∈ R,


n
X n
X Yn
x − xk
pn (x) = f (xi ) Li (x) = f (xi ) .
i=0 i=0 k=0
xi − xk
k6=i

Comparing with (4.9) then yields (4.16), where the equality labeled (∗) is due to

x − xk
x−a
b−a
− xb−a
k −a

=
xi − xk xi −a
b−a
− xb−a
k −a

x−a
and the change of variables x 7→ t := b−a
.
Rb
(b): For q ∈ Poln (R), we have pn (q) = q such that In (q) = a
q(x) dx is clear from
(4.15).
(c) follows by using (4.15) in combination with (3.31). 
Remark 4.12. If f ∈ C ∞ [a, b] and there exist K ∈ R+
0 and R < (b − a)
−1
such that
(3.32) holds, i.e.
kf (n) k∞ ≤ Kn!Rn for each n ∈ N,
then, for each sequence of distinct numbers (x0 , x1 , . . . ) in [a, b], the corresponding
quadrature rules based on interpolating polynomials converge, i.e.
Z b

lim f (x) dx − In (f ) = 0 :
n→∞ a
Rb Rb
Indeed, | a f (x) dx − In (f )| ≤ a kf − pn k∞ dx = (b − a) kf − pn k∞ and limn→∞ kf −
pn k∞ = 0 according to (3.34).

In certain situations, (4.17) can be improved by determining the sign of the error term
(see Prop. 4.13 below). This information will come in handy when proving subsequent
error estimates for special cases.
4 NUMERICAL INTEGRATION 84

Proposition 4.13. Let [a, b] be a real interval, a < b, and let In be the quadrature rule
Q for the distinct points x0 , . . . , xn ∈ [a, b], n ∈ N0 , as
based on interpolating polynomials
defined in (4.15). If ωn+1 (x) = ni=0 (x − xi ) has only one sign on [a, b] (i.e. ωn+1 ≥ 0
or ωn+1 ≤ 0 on [a, b], see Rem. 4.14 below), then, for each f ∈ C n+1 [a, b], there exists
τ ∈ [a, b] such that
Z b Z b
f (n+1) (τ )
f (x) dx − In (f ) = ωn+1 (x) dx . (4.18)
a (n + 1)! a

In particular, if f (n+1) has just one sign as well, then one can infer the sign of the error
term (i.e. of the right-hand side of (4.18)).

Proof. From Th. 3.25, we know that, for each x ∈ [a, b], there exists ξ(x) ∈ [a, b] such
that (cf. (3.31)) 
f (n+1) ξ(x)
f (x) = pn (x) + ωn+1 (x), (4.19)
(n + 1)!

where, according to Th. 3.25(c), the function x 7→ f (n+1) ξ(x) is continuous on [a, b].
Thus, integrating (4.19) and applying the mean value theorem of integration (cf. [Phi17,
(2.15f)]) yields τ ∈ [a, b] satisfying
Z b Z b Z b
1 (n+1)
 f (n+1) (τ )
f (x) dx − In (f ) = f ξ(x) ωn+1 (x) dx = ωn+1 (x) dx ,
a (n + 1)! a (n + 1)! a

proving (4.18). 

Remark 4.14. There exist precisely 3 examples where ωn+1 has only one sign on [a, b],
namely ω1 (x) = (x − a) ≥ 0, ω1 (x) = (x − b) ≤ 0, and ω2 (x) = (x − a)(x − b) ≤ 0. In
all other cases, there exists at least one xi ∈]a, b[. As xi is a simple zero of ωn+1 , ωn+1
changes its sign in the interior of [a, b].

4.3 Newton-Cotes Formulas


We will now study quadrature rules based on interpolating polynomials with equally-
spaced data points. In particular, we continue to assume ρ ≡ 1 (nonweighted integrals).

4.3.1 Definition, Weights, Degree of Accuracy

Definition and Remark 4.15. Given a real interval [a, b], a < b, a quadrature rule
based on interpolating polynomials according to Def. 4.10 is called a Newton-Cotes
4 NUMERICAL INTEGRATION 85

formula, also known as a Newton-Cotes quadrature rule if, and only if, the points
x0 , . . . , xn ∈ [a, b], n ∈ N, are equally-spaced. Moreover, a Newton-Cotes formula is
called closed or a Newton-Cotes closed quadrature rule if, and only if,
b−a
xi := a + i h, h := , i ∈ {0, . . . , n}, n ∈ N. (4.20)
n
In the present context of equally-spaced xi , the discretization’s mesh size h is also called
step size. It is an exercise to compute the weights of the Newton-Cotes closed quadrature
rules. One obtains Z n
1 nYs−k
σi = σi,n = ds . (4.21)
n 0 k=0 i − k
k6=i

Given n, one can compute the σi explicitly and store or tabulate them for further use.
The next lemma shows that one actually does not need to compute all σi :

Lemma 4.16. The weights σi of the Newton-Cotes closed quadrature rules (see (4.21))
are symmetric, i.e.

σi = σn−i for each i ∈ {0, . . . , n}, n ∈ N. (4.22)

Proof. Exercise. 

If n is even, then, as we will show in Th. 4.18 below, the corresponding closed Newton-
Cotes formula In is exact not only for p ∈ Poln (R), but even for p ∈ Poln+1 (R). On the
other hand, we will also see in Th. 4.18 that I(xn+2 ) 6= In (xn+2 ) as a consequence of the
following preparatory lemma:

Lemma 4.17. If [a, b], a < b, is a real interval, n ∈ N is even, and xi , i ∈ {0, . . . , n},
are defined according to (4.20), then the antiderivative of the Newton basis polynomial
ω := ωn+1 , i.e. the function
Z n
xY
F : R −→ R, F (x) := (y − xi ) dy , (4.23)
a i=0

satisfies 

= 0 for x = a,
F (x) > 0 for a < x < b, (4.24)


=0 for x = b.

Proof. F (a) = 0 is clear. The remaining assertions of (4.24) are shown in several steps.
4 NUMERICAL INTEGRATION 86

Claim 1. With h ∈ ]0, b − a[ defined according to (4.20), one has


ω(x2j + τ ) > 0, n n o
for each τ ∈ ]0, h[ and each j ∈ 0, . . . , − 1 . (4.25)
ω(x2j+1 + τ ) < 0, 2

Proof. As ω has degree precisely n + 1 and a zero at each of the n + 1 distinct points
x0 , . . . , xn , each of these zeros must be simple. In particular, ω must change its sign at
each xi . Moreover, limx→−∞ ω(x) = −∞ due to the fact that the degree n + 1 of ω is
odd. This implies that ω changes its sign from negative to positive at x0 = a and from
positive to negative at x1 = a + h, establishing the case j = 0. The general case (4.25)
now follows by induction. N
Claim 2. On the left half of the interval [a, b], |ω| decays in the following sense:
 
a + b
ω(x + h) < ω(x) for each x ∈ a, − h \ {x0 , . . . , xn/2−1 }. (4.26)
2

Proof. For each x that is not a zero of ω, we compute


Qn Q
ω(x + h) i=0 (x + h − xi ) (x + h − a) ni=1 (x + h − xi )
= Qn = Qn−1
ω(x) i=0 (x − xi ) (x − b) i=0 (x − xi )
Qn−1
xi −h=xi−1 (x + h − a) (x − xi ) x+h−a
= Qn−1i=0 = .
(x − b) i=0 (x − xi ) x−b
a+b
Moreover, if a < x < 2
− h, then
b−a
|x + h − a| = x + h − a < < b − x = |x − b|,
2
proving (4.26). N
 a+b

Claim 3. F (x) > 0 for each x ∈ a, 2
.

Proof. There exists τ ∈ ]0, h] and i ∈ {0, . . . , n/2 − 1} such that x = xi + τ . Thus,
Z x Xi−1 Z xk+1
F (x) = ω(y) dy + ω(y) dy , (4.27)
xi k=0 xk

and it suffices to show that, for each τ ∈ ]0, h],


Z x2j +τ  
n/2 − 1
ω(y) dy > 0 if j ∈ k ∈ N0 : 0 ≤ k ≤ , (4.28a)
x2j 2
Z x2j+1 +τ  
n/2 − 2
ω(y) dy > 0 if j ∈ k ∈ N0 : 0 ≤ k ≤ . (4.28b)
x2j 2
4 NUMERICAL INTEGRATION 87

While (4.28a) is immediate from (4.25), for (4.28b), one calculates


≥ 0 by (4.25)
Z Z > 0 by (4.26) zZ }| {
x2j+1 +τ x2j +τ z }| { x2j+1
ω(y) dy = ω(y) + ω(y + h) dy + ω(y) dy > 0
x2j x2j | {z } x2j +τ
=−|ω(y+h)|

n/2−2
for each τ ∈ ]0, h] and each 0 ≤ j ≤ 2
, thereby establishing the case. N
Claim 4. ω is antisymmetric with respect to the midpoint of [a, b], i.e.
   
a+b a+b
ω + τ = −ω −τ for each τ ∈ R. (4.29)
2 2

Proof. From (a + b)/2 − xi = − (a + b)/2 − xn−i = (n/2 − i)h for each i ∈ {0, . . . , n},
one obtains
  Y n   Yn  
a+b a+b a+b
ω +τ = + τ − xi = − − τ − xn−i
2 i=0
2 i=0
2
Y n    
a+b a+b
=− − τ − xi = −ω −τ ,
i=0
2 2

which is (4.29). N
Claim 5. F is symmetric with respect to the midpoint of [a, b], i.e.
   
a+b a+b
F +τ =F −τ for each τ ∈ R. (4.30)
2 2
 b−a

Proof. For each τ ∈ 0, 2
, we obtain
  Z (a+b)/2−τ Z (a+b)/2+τ  
a+b a+b
F +τ = ω(x) dx + ω(x) dx = F −τ ,
2 a (a+b)/2−τ 2

since the second integral vanishes due to (4.29). N

Finally, (4.24) follows from combining F (a) = 0 with Claims 3 and 5. 

Theorem 4.18. If [a, b], a < b, is a real interval and n ∈ N is even, then the Newton-
Cotes closed quadrature rule In : R[a, b] −→ R has degree of accuracy n + 1.
4 NUMERICAL INTEGRATION 88

Proof. It is an exercise to show that In has at least degree of accuracy n + 1. It then


remains to show that
I(xn+2 ) 6= In (xn+2 ). (4.31)
To that end, let xn+1 ∈ [a, b] \ {x0 , . . . , xn }, and let q ∈ Poln+1 (R) be the interpolating
polynomial for g(x) = xn+2 and x0 , . . . , xn+1 , whereas pn ∈ Poln (R) is the corresponding
interpolating polynomial for x0 , . . . , xn (note that pn interpolates both g and q). As we
already know that In has at least degree of accuracy n + 1, we obtain
Z b Z b
pn (x) dx = In (g) = In (q) = q(x) dx . (4.32)
a a

By applying (3.31) with g(x) = xn+2 and using g (n+2) (x) = (n + 2)!, one obtains
 n+1
g (n+2) ξ(x) Y
g(x)−q(x) = ωn+2 (x) = ωn+2 (x) = (x−xi ) for each x ∈ [a, b]. (4.33)
(n + 2)! i=0

Integrating (4.33) while taking into account (4.32) as well as using F from (4.23) yields
Z b n+1
Y Z b
I(g) − In (g) = (x − xi ) dx = F ′ (x)(x − xn+1 ) dx
a i=0 a
Z b Z b
 b (4.24) (4.24)
= F (x)(x − xn+1 ) a
− F (x) dx = − F (x) dx < 0,
a a

proving (4.31). 

4.3.2 Rectangle Rules (n = 0)

Note that, for n = 0, there is no closed Newton-Cotes formula, as (4.20) can not be
satisfied with just one point x0 . Every Newton-Cotes
Rb quadrature rule with n = 0 is
called a rectangle rule, since the integral a f (x) dx is approximated by (b − a)f (x0 ),
x0 ∈ [a, b], i.e. by the area of the rectangle with vertices (a, 0), (b, 0), (a, f (x0 )), (b, f (x0 )).
Not surprisingly, the rectangle rule with x0 = (b + a)/2 is known as the midpoint rule.
Rectangle rules have degree of accuracy r ≥ 0, since they are exact for constant func-
tions. Rectangle rules actually all have r = 0, except for the midpoint rule with r = 1
(exercise).
In the following lemma, we provide an error estimate for the rectangle rule using x0 = a.
For other choices of x0 , one obtains similar results.
4 NUMERICAL INTEGRATION 89

Lemma 4.19. Given a real interval [a, b], a < b, and f ∈ C 1 [a, b], there exists τ ∈ [a, b]
such that Z b
(b − a)2 ′
f (x) dx − (b − a)f (a) = f (τ ). (4.34)
a 2

Proof. As ω1 (x) = (x − a) ≥ 0 on [a, b], we can apply Prop. 4.13 to get τ ∈ [a, b]
satisfying
Z b Z b
′ (b − a)2 ′
f (x) dx − (b − a)f (a) = f (τ ) (x − a) dx = f (τ ),
a a 2

proving (4.34). 

4.3.3 Trapezoidal Rule (n = 1)

Definition and Remark 4.20. The closed Newton-Cotes formula with n = 1 is called
trapezoidal rule. Its explicit form is easily computed, e.g. from (4.15):
Z b   b
f (b) − f (a) 1 f (b) − f (a) 2
I1 (f ) : = f (a) + (x − a) dx = f (a)x + (x − a)
a b−a 2 b−a a
 
1 1
= (b − a) f (a) + f (b) (4.35)
2 2

for each f ∈ C[a, b]. Note that (4.35) justifies the name trapezoidal rule, as this is
precisely the area of the trapezoid with vertices (a, 0), (b, 0), (a, f (a)), (b, f (b)).

Lemma 4.21. The trapezoidal rule has degree of accuracy r = 1.

Proof. Exercise. 

Lemma 4.22. Given a real interval [a, b], a < b, and f ∈ C 2 [a, b], there exists τ ∈ [a, b]
such that Z b
(b − a)3 ′′ h3
f (x) dx − I1 (f ) = − f (τ ) = − f ′′ (τ ). (4.36)
a 12 12

Proof. Exercise. 
4 NUMERICAL INTEGRATION 90

4.3.4 Simpson’s Rule (n = 2)

Definition and Remark 4.23. The closed Newton-Cotes formula with n = 2 is called
Simpson’s rule. To find the explicit form, we compute the weights σ0 , σ1 , σ2 . According
to (4.21), one finds
Z 2  2
1 s−1 s−2 1 s3 3s2 1
σ0 = ds = − + 2s = . (4.37)
2 0 −1 −2 4 3 2 0 6

Next, one obtains σ2 = σ0 = 16 from (4.22) and σ1 = 1 − σ0 − σ2 = 2


3
from Rem. 4.9.
Plugging this into (4.9) yields
   
1 2 a+b 1
I2 (f ) = (b − a) f (a) + f + f (b) (4.38)
6 3 2 6

for each f ∈ C[a, b].

Lemma 4.24. Simpson’s rule rule has degree of accuracy 3.

Proof. The assertion is immediate from Th. 4.18. 

Lemma 4.25. Given a real interval [a, b], a < b, and f ∈ C 4 [a, b], there exists τ ∈ [a, b]
such that Z b
(b − a)5 (4) h5
f (x) dx − I2 (f ) = − f (τ ) = − f (4) (τ ). (4.39)
a 2880 90

Proof. In addition to x0 = a, x1 = (a + b)/2, x2 = b, we choose x3 := x1 . Using Hermite


interpolation, we know that there is a unique q ∈ Pol3 (R) such that q(xi ) = f (xi ) for
each i ∈ {0, 1, 2} and q ′ (x3 ) = f ′ (x3 ). As before, let p2 ∈ Pol2 (R) be the corresponding
interpolating polynomial merely satisfying p2 (xi ) = f (xi ). As, by Lem. 4.24, I2 has
degree of accuracy 3, we know
Z b Z b
p2 (x) dx = I2 (f ) = I2 (q) = q(x) dx . (4.40)
a a

By applying (3.31) with n = 3, one obtains



f (4) ξ(x)
f (x) = q(x) + ω4 (x) for each x ∈ [a, b], (4.41)
4!
where  2
a+b
ω4 (x) = (x − a) x − (x − b) (4.42)
2
4 NUMERICAL INTEGRATION 91


and the function x 7→ f (4) ξ(x) is continuous on [a, b] by Th. 3.25(c). Moreover,
according to (4.42), ω4 (x) ≤ 0 for each x ∈ [a, b]. Thus, integrating (4.41) and applying
(4.40) as well as the mean value theorem of integration (cf. [Phi17, (2.15f)]) yields
τ ∈ [a, b] satisfying
Z b Z
f (4) (τ ) b f (4) (τ ) (b − a)5
f (x) dx − I2 (f ) = ω4 (x) dx = −
a 4! a 24 120
5 5
(b − a) (4) h
=− f (τ ) = − f (4) (τ ), (4.43)
2880 90
where it was used that
Z b Z b    2 !
(x − a)2 a+b a+b
ω4 (x) dx = − 2 x− (x − b) + x − dx
a a 2 2 2
 Z b    
(b − a)5 (x − a)3 a+b
=− − 2(x − b) + 4 x − dx
24 a 6 2
 Z b 
(b − a)5 (b − a)5 (x − a)4
=− −2 + 6 dx
24 24 a 24
(b − a)5
=− . (4.44)
120
This completes the proof of (4.39). 

4.3.5 Higher Order Newton-Cotes Formulas

As before, let [a, b] be a real interval, a < b, and let In : R[a, b] −→ R be the Newton-
Cotes closed quadrature rule based on interpolating polynomials p ∈ Poln (R). According
to Th. 4.11(b), we know that In has degree of accuracy at least n, and one might hope
that, by increasing n, one obtains more and more accurate quadrature rules for all
f ∈ R[a, b] (or at least for all f ∈ C[a, b]). Unfortunately, this is not the case! As it
turns out, Newton-Cotes formulas with larger n have quite a number of drawbacks. As
a result, in practice, Newton-Cotes formulas with n > 3 are hardly ever used (for n = 3,
one obtains Simpson’s 3/8 rule, and even that is not used all that often).
The problems with higher order Newton-Cotes formulas have to do with the fact that
the formulas involve negative weights (negative weights first occur for n = 8). As n
increases, the situation becomes worse and worse and one can actually show that (see
[Bra77, Sec. IV.2])
Xn
lim |σi,n | = ∞. (4.45)
n→∞
i=0
4 NUMERICAL INTEGRATION 92

For polynomials and some other functions, terms with negative and positive weights
cancel, but for general f ∈ C[a, b] (let alone general f ∈ R[a, b]), that is not the case,
leading to instabilities and divergence. That, in the light of (4.45), convergence can not
be expected is a consequence of the following Th. 4.27.

4.4 Convergence of Quadrature Rules


We first compute the operator norm of a quadrature rule in terms of its weights:
Proposition 4.26. Given a real interval [a, b], a < b, and a quadrature rule
n
X
In : C[a, b] −→ R, In (f ) = (b − a) σi f (xi ), (4.46)
i=0

with distinct x0 , . . . , xn ∈ [a, b] and weights σ0 , . . . , σn ∈ R, n ∈ N, the operator norm of


In induced by k · k∞ on C[a, b] is
n
X
kIn k = (b − a) |σi |. (4.47)
i=0

Proof. For each f ∈ C[0, 1], we estimate


n
X
|In (f )| ≤ (b − a) |σi | kf k∞ ,
i=0
Pn
which already shows kIn k ≤ (b − a) i=0 |σi |. For the remaining inequality, define
φ : {0, . . . , n} −→ {0, . . . , n} to be a reordering such that xφ(0) < xφ(1) < · · · < xφ(n) ,
let yi := sgn(σφ(i) ) for each i ∈ {0, . . . , n}, and

y0
 for x ∈ [a, xφ(0) ],
yi+1 −yi
f (x) := yi + xφ(i+1) −xφ(i) (x − xφ(i) ) for x ∈ [xφ(i) , xφ(i+1) ], i ∈ {0, . . . , n − 1},


yn for x ∈ [xφ(n) , b].

Note that f is a linear spline on [xφ(0) , xφ(n) ] (cf. (3.44)), possibly with constant con-
tinuous extensions on [a, xφ(0) ] and [xφ(n) , b]. In particular, f ∈ C[a, b]. Clearly, either
In ≡ 0 or kf k∞ = 1 and
n
X n
X n
X
In (f ) = (b − a) σi f (xi ) = (b − a) σi sgn(σi ) = (b − a) |σi |,
i=0 i=0 i=0
Pn
showing the remaining inequality kIn k ≥ (b − a) i=0 |σi | and, thus, (4.47). 
4 NUMERICAL INTEGRATION 93

Theorem 4.27 (Polya). Given a real interval [a, b], a < b, and a weight function
ρ : [a, b] −→ R+
0 , consider a sequence of quadrature rules

n
X
In : C[a, b] −→ R, In (f ) = (b − a) σi,n f (xi,n ), (4.48)
i=0

with distinct points x0,n , . . . , xn,n ∈ [a, b] and weights σ0,n , . . . , σn,n ∈ R, n ∈ N. The
sequence converges in the sense that
Z b
lim In (f ) = f (x)ρ(x) dx for each f ∈ C[a, b] (4.49)
n→∞ a

if, and only if, the following two conditions are satisfied:
Rb
(i) limn→∞ In (p) = a
p(x)ρ(x) dx holds for each polynomial p.

(ii) The sums of the absolute values of the weights are uniformly bounded, i.e. there
exists C ∈ R+ such that
n
X
|σi,n | ≤ C for each n ∈ N.
i=0

Rb
Proof. Suppose (i) and (ii) hold true. Let W := a ρ(x) dx . Given f ∈ C[a, b] and ǫ > 0,
according to the Weierstrass Approximation Theorem F.1, there exists a polynomial p
such that
ǫ
kf − p↾[a,b] k∞ < ǫ̃ := . (4.50)
C(b − a) + 1 + W
Due to (i), there exists N ∈ N such that, for each n ≥ N ,
Z b

In (p) − p(x)ρ(x) dx < ǫ̃.

a

Thus, for each n ≥ N ,


Z b

In (f ) − f (x)ρ(x) dx

a
Z b Z b Z b

≤ In (f ) − In (p) + In (p) − p(x)ρ(x) dx + p(x)ρ(x) dx − f (x)ρ(x) dx
a a a
n
X 
< (b − a) |σi,n | f (xi,n ) − p(xi,n ) + ǫ̃ + ǫ̃ W ≤ ǫ̃ (b − a)C + 1 + W = ǫ,
i=0
4 NUMERICAL INTEGRATION 94

proving (4.49).
Conversely, (4.49) trivially implies (i). That it also implies (ii) is much more involved.
Letting T := {In : n ∈ N}, we have T ⊆ L(C[a, b], R) and (4.49) implies
 
sup |T (f )| : T ∈ T = sup |In (f )| : n ∈ N < ∞ for each f ∈ C[a, b]. (4.51)

Then the Banach-Steinhaus Th. G.4 yields


 
sup kIn k : n ∈ N = sup kT k : T ∈ T < ∞, (4.52)

which, according to Prop. 4.26, is (ii). The proof of the Banach-Steinhaus theorem is
usually part of Functional Analysis. It is provided in Appendix G. As it only uses
elementary metric space theory, the interested reader should be able to understand
it. 

While, for ρ ≡ 1, Condition (i) of Th. 4.27 is obviously satisfied by the Newton-Cotes
formulas, since, given q ∈ Poln (R), Im (q) is exact for each m ≥ n, Condition (ii) of Th.
4.27 is violated due to (4.45). As the Newton-Cotes formulas are based on interpolating
polynomials, and we had already needed additional conditions to guarantee the conver-
gence of interpolating polynomials (see Th. 3.26), it should not be too surprising that
Newton-Cotes formulas do not converge for all continuous functions.
So we see that improving the accuracy of Newton-Cotes formulas by increasing n is,
in general, not feasible. However, the accuracy can be improved using a number of
different strategies, two of which will be considered in the following. In Sec. 4.5, we will
study so-called composite Newton-Cotes rules that result from subdividing [a, b] into
small intervals, using a low-order Newton-Cotes formula on each small interval, and
then summing up the results. In Sec. 4.6, we will consider Gaussian quadrature, where,
in contrast to the Newton-Cotes formulas, one does not use equally-spaced points xi for
the interpolation, but chooses the locations of the xi more carefully. As it turns out,
one can choose the xi such that all weights remain positive even for n → ∞. Then Rem.
4.9 combined with Th. 4.27 yields convergence.

4.5 Composite Newton-Cotes Quadrature Rules


Before studying Gaussian quadrature rules in the context of weighted integrals in Sec.
4.6, for the present section, we return to the situation of Sec. 4.3, i.e. nonweighted
integrals (ρ ≡ 1).
4 NUMERICAL INTEGRATION 95

4.5.1 Introduction, Convergence

As discussed in the previous section, improving the accuracy of numerical integration


by using high-order Newton-Cotes rules does usually not work. On the other hand,
better accuracy can be achieved by subdividing [a, b] into small intervals and then using
a low-order Newton-Cotes rule (typically n ≤ 2) on each of the small intervals. The
composite Newton-Cotes quadrature rules are the results of this strategy.
We begin with a general definition and a general convergence result based on the con-
vergence of Riemann sums for Riemann integrable functions.

Definition 4.28. Suppose, for f : [0, 1] −→ R, In , n ∈ N0 , has the form


n
X
In (f ) = σl f (ξl ) (4.53)
l=0

with 0 ≤ ξ0 < ξ1 < · · · < ξn ≤ 1 and weights σ0 , . . . , σn ∈ R. Given a real interval [a, b],
a < b, and N ∈ N, set
b−a
xk := a + k h, h := , for each k ∈ {0, . . . , N }, (4.54a)
N
xkl := xk−1 + h ξl , for each k ∈ {1, . . . , N }, l ∈ {0, . . . , n}. (4.54b)

Then
N X
X n
In,N (f ) := h σl f (xkl ), f : [a, b] −→ R, (4.55)
k=1 l=0

are called the composite quadrature rules based on In .

Remark 4.29. The heuristics behind (4.55) is


Z xk Z 1 n
X
f (x) dx = h f (xk−1 + ξh) dξ ≈ h σl f (xkl ). (4.56)
xk−1 0 l=0

Theorem 4.30. Let [a, b], a < b, be a real interval, n ∈ N0 . If In has the form (4.53)
and is exact for each constant function (i.e. it has at least degree of accuracy 0 in the
language of Def. 4.8), then the composite quadrature rules (4.55) based on In converge
for each Riemann integrable f : [a, b] −→ R, i.e.
Z b
∀ lim In,N (f ) = f (x) dx . (4.57)
f ∈R[a,b] N →∞ a
4 NUMERICAL INTEGRATION 96

Proof. Let f ∈ R[a, b]. Introducing, for each l ∈ {0, . . . , n}, the abbreviation
N
X b−a
Sl (N ) := f (xkl ), (4.58)
k=1
N

(4.55) yields
n
X
In,N (f ) = σl Sl (N ). (4.59)
l=0

Note that, for each l ∈ {0, . . . , n}, S  l (N ) is a Riemann sum for the tagged partition
N N N N b−a
∆N l := (x0 , . . . , xN ), (x1l , . . . , xN l ) and limN →∞ h(∆N l ) = limN →∞ N = 0. Thus,
applying Th. 4.3,
Z b
lim Sl (N ) = f (x) dx for each l ∈ {0, . . . , n}.
N →∞ a

This, in turn, implies


n
X Z b n
X Z b
lim In,N (f ) = lim σl Sl (N ) = f (x) dx σl = f (x) dx
N →∞ N →∞ a a
l=0 l=0

as, due to hypothesis,


n
X Z 1
σl = In (1) = dx = 1,
l=0 0

proving (4.57). 

The convergence result of Th. 4.30 still suffers from the drawback of the original con-
vergence result for the Riemann sums, namely that we do not obtain any control on the
rate of convergence and the resulting error term. We will achieve such results in the
following sections by assuming additional regularity of the integrated functions f . We
introduce the following notion as a means to quantify the convergence rate of composite
quadrature rules:
Definition 4.31. Given a real interval [a, b], a < b, and a subset F of the Riemann
integrable functions on [a, b], composite quadrature rules In,N according to (4.55) are
said to have order of convergence r > 0 for f ∈ F if, and only if, for each f ∈ F, there
exists K = K(f ) ∈ R+ 0 such that
Z b

In,N (f ) − f (x) dx ≤ K hr for each N ∈ N (4.60)

a

(h = (b − a)/N as before).
4 NUMERICAL INTEGRATION 97

4.5.2 Composite Rectangle Rules (n = 0)


R1
Definition and Remark 4.32. Using the rectangle rule 0 f (x) dx ≈ f (0) of Sec.
4.3.2 as I0 on [0, 1], the formula (4.55) yields, for f : [a, b] −→ R, the corresponding
composite rectangle rules
N
X xk0 =xk−1 
I0,N (f ) = h f (xk0 ) = h f (x0 ) + f (x1 ) + · · · + f (xN −1 ) . (4.61)
k=1

Theorem 4.33. For each f ∈ C 1 [a, b], there exists τ ∈ [a, b] satisfying
Z b
(b − a)2 ′ b−a
f (x) dx − I0,N (f ) = f (τ ) = h f ′ (τ ). (4.62)
a 2N 2
In particular, the composite rectangle rules have order of convergence r = 1 for f ∈
C 1 [a, b] in terms of Def. 4.31.

Proof. Fix f ∈ C 1 [a, b]. From Lem. 4.19, we obtain, for each k ∈ {1, . . . , N }, a point
τk ∈ [xk−1 , xk ] such that
Z xk
b−a (b − a)2 ′ h2 ′
f (x) dx − f (xk−1 ) = f (τk ) = f (τk ). (4.63)
xk−1 N 2N 2 2

Summing (4.63) from k = 1 through N yields


Z N
b
b−a 1 X ′
f (x) dx − I0,N (f ) = h f (τk ).
a 2 N k=1

As f ′ is continuous and
N
 ′ 1 X ′ 
min f (x) : x ∈ [a, b] ≤ α := f (τk ) ≤ max f ′ (x) : x ∈ [a, b] ,
N k=1

the intermediate value theorem provides τ ∈ [a, b] satisfying α = f ′ (τ ), proving (4.62).


The claimed order of convergence now follows as well, since (4.62) implies that (4.60)
holds with r = 1 and K := (b − a)kf ′ k∞ /2. 
4 NUMERICAL INTEGRATION 98

4.5.3 Composite Trapezoidal Rules (n = 1)

Definition and Remark 4.34. Using the trapezoidal rule (4.35) as I1 on [0, 1], the
formula (4.55) yields, for f : [a, b] −→ R, the corresponding composite trapezoidal rules
N
X XN
1  xk0 =xk−1 ,
xk1 =xk 1 
I1,N (f ) = h f (xk0 ) + f (xk1 ) = h f (xk−1 ) + f (xk )
k=1
2 k=1
2
h   
= f (a) + 2 f (x1 ) + · · · + f (xN −1 ) + f (b) . (4.64)
2
Theorem 4.35. For each f ∈ C 2 [a, b], there exists τ ∈ [a, b] satisfying
Z b
(b − a)3 ′′ b − a 2 ′′
f (x) dx − I1,N (f ) = − 2
f (τ ) = − h f (τ ). (4.65)
a 12N 12
In particular, the composite trapezoidal rules have order of convergence r = 2 for f ∈
C 2 [a, b] in terms of Def. 4.31.

Proof. Fix f ∈ C 2 [a, b]. From Lem. 4.22, we obtain, for each k ∈ {1, . . . , N }, a point
τk ∈ [xk−1 , xk ] such that
Z xk
h  h3
f (x) dx − f (xk−1 ) + f (xk ) = − f ′′ (τk ).
xk−1 2 12

Summing the above equation from k = 1 through N yields


Z N
b
b − a 2 1 X ′′
f (x) dx − I1,N (f ) = − h f (τk ).
a 12 N k=1

As f ′′ is continuous and
N
 1 X ′′ 
min f ′′ (x) : x ∈ [a, b] ≤ α := f (τk ) ≤ max f ′′ (x) : x ∈ [a, b] ,
N k=1

the intermediate value theorem provides τ ∈ [a, b] satisfying α = f ′′ (τ ), proving (4.65).


The claimed order of convergence now follows as well, since (4.65) implies that (4.60)
holds with r = 2 and K := (b − a)kf ′′ k∞ /12. 
4 NUMERICAL INTEGRATION 99

4.5.4 Composite Simpson’s Rules (n = 2)

Definition and Remark 4.36. Using Simpson’s rule (4.38) as I2 on [0, 1], the formula
(4.55) yields, for f : [a, b] −→ R, the corresponding composite Simpson’s rules I2,N (f ).
It is an exercise to check that
N N −1
!
h X X
I2,N (f ) = f (a) + 4 f (xk− 1 ) + 2 f (xk ) + f (b) , (4.66)
6 k=1
2
k=1

where xk− 1 := (xk−1 + xk )/2 for each k ∈ {1, . . . , N }.


2

Theorem 4.37. For each f ∈ C 4 [a, b], there exists τ ∈ [a, b] satisfying
Z b
(b − a)5 (4) b − a 4 (4)
f (x) dx − I2,N (f ) = − f (τ ) = − h f (τ ). (4.67)
a 2880N 4 2880
In particular, the composite Simpson’s rules have order of convergence r = 4 for f ∈
C 4 [a, b] in terms of Def. 4.31.

Proof. Exercise. 

4.6 Gaussian Quadrature


4.6.1 Introduction

As discussed in Sec. 4.3.5, quadrature rules based on interpolating polynomials for


equally-spaced points xi are suboptimal. This fact is related to equally-spaced points
being suboptimal for the construction of interpolating polynomials in general as men-
tioned in Rem. 3.28. In Gaussian quadrature, one invests additional effort into smarter
placing of the xi , resulting in more accurate quadrature rules. For example, this will
ensure that all weights remain positive as the number of xi increases, which, in turn,
will yield the convergence of Gaussian quadrature rules for all continuous functions (see
Th. 4.49 below).
We will see in the following that Gaussian quadrature rules In are quadrature rules
in the sense of Def. 4.6. However, for historical reasons, in the context of Gaussian
quadrature, it is common to use the notation
n
X
In (f ) = σi f (λi ), (4.68)
i=1

i.e. one writes λi instead of xi and the factor (b − a) is incorporated into the weights σi .
4 NUMERICAL INTEGRATION 100

4.6.2 Orthogonal Polynomials

During Gaussian quadrature, the λi are determined as the zeros of polynomials that
are orthogonal with respect to certain scalar products on the (infinite-dimensional) real
vector space Pol(R).
Notation 4.38. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→ R+ 0
as defined in Def. 4.5, let
Z b
h·, ·iρ : Pol(R) × Pol(R) −→ R, hp, qiρ := p(x)q(x)ρ(x) dx , (4.69a)
a
q
k · kρ : Pol(R) −→ R+
0, kpkρ := hp, piρ . (4.69b)

Lemma 4.39. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→ R+
0,
(4.69a) defines a scalar product on Pol(R).
Rb
Proof. As ρ ∈ R[a, b], ρ ≥ 0, and a ρ(x) dx > 0, there must be some nontrivial interval
J ⊆ [a, b] and ǫ > 0 such that ρ ≥ ǫ on J. Thus, if 0 6= p ∈ Pol(R), then there must
be some nontrivial interval Jp ⊆ J and ǫp > 0 such that p2 ρ ≥ ǫp on Jp , implying
Rb
hp, piρ = a p2 (x)ρ(x) dx ≥ ǫp |Jp | > 0 and proving h·, ·iρ to be positive definite. Given
p, q, r ∈ Pol(R) and λ, µ ∈ R, one computes
Z b

hλp + µq, riρ = λp(x) + µq(x) r(x)ρ(x) dx
a
Z b Z b
=λ q(x)r(x)ρ(x) dx + µ q(x)r(x)ρ(x) dx
a a

= λhp, riρ + µhq, riρ ,

proving h·, ·iρ to be linear in its first argument. Since, for p, q ∈ Pol(R), one also has
Z b Z b
hp, qiρ = p(x)q(x)ρ(x) dx = q(x)p(x)ρ(x) dx = hq, piρ ,
a a

h·, ·iρ is symmetric as well and the proof of the lemma is complete. 
Remark 4.40. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→ R+0,
consider Pol(R) with the scalar product h·, ·iρ given by (4.69a). Then, for each n ∈ N0 ,
q ∈ Poln (R)⊥ (cf. [Phi19b, Def. 10.10(b)]) if, and only if,
Z b
q(x)p(x)ρ(x) = 0 for each p ∈ Poln (R). (4.70)
a
4 NUMERICAL INTEGRATION 101

At first glance, it might not be obvious that there are always q ∈ Pol(R) \ {0} that
satisfy (4.70). However, such q 6= 0 can always be found by the general procedure of
Gram-Schmidt orthogonalization, known from Linear Algebra: According to [Phi19b,
Th. 10.13], if h·, ·i is an arbitrary scalar product on Pol(R), if (xn )n∈N0 is a sequence in
Pol(R) and if (pn )n∈N0 are recursively defined by
n−1
X hxn , pk i
p0 := x0 , ∀ pn := xn − pk , (4.71)
n∈N
k=0,
kpk k2
pk 6=0

then the sequence (pn )n∈N0 constitutes an orthogonal system in Pol(R).


While Gram-Schmidt orthogonalization works in every space with a scalar product, for
polynomials, the following theorem provides a more efficient recursive orthogonalization
algorithm:
Theorem 4.41. Given a scalar product h·, ·i on Pol(R) that satisfies
hxp, qi = hp, xqi for each p, q ∈ Pol(R) (4.72)
(clearly, the scalar product from (4.69a) satisfies (4.72)), let k · k denote the induced
norm. If, for each n ∈ N0 , xn ∈ Pol(R) is defined by xn (x) := xn and pn is given by
Gram-Schmidt orthogonalization according to (4.71), then the pn satisfy the following
recursive relation (sometimes called three-term recursion):
p0 = x0 ≡ 1, p1 = x − β0 , (4.73a)
pn+1 = (x − βn )pn − γn2 pn−1 for each n ∈ N, (4.73b)

where
hxpn , pn i kpn k
βn := for each n ∈ N0 , γn := for each n ∈ N. (4.73c)
kpn k2 kpn−1 k

Proof. We know from [Phi19b, Th. 10.13] that the linear independence of the xn guar-
antees the linear independence of the pn . We observe that p0 = x0 ≡ 1 is clear from
(4.71) and the definition of x0 . Next, form (4.71), we compute
hx1 , p0 i hxp0 , p0 i
p 1 = v 1 = x1 − 2
p0 = x − = x − β0 .
kp0 k kp0 k2
It remains to consider n + 1 for each n ∈ N. Letting
qn+1 := (x − βn ) pn − γn2 pn−1 ,
4 NUMERICAL INTEGRATION 102

the proof is complete once we have shown

qn+1 = pn+1 .

Clearly,
r := pn+1 − qn+1 ∈ Poln (R),
as the coefficient of xn+1 in both pn+1 and qn+1 is 1. Since Poln (R) ∩ Poln (R)⊥ =
{0} (cf. [Phi19b, Lem. 10.11(a)]), it now suffices to show r ∈ Poln (R)⊥ . Moreover,
we know pn+1 ∈ Poln (R)⊥ , since pn+1 ⊥ pk for each k ∈ {0, . . . , n} and Poln (R) =
span{p0 , . . . , pn }. Thus, as Poln (R)⊥ is a vector space, to prove r ∈ Pn⊥ , it suffices to
show
qn+1 ∈ Poln (R)⊥ . (4.74)
To this end, we start by using hpn−1 , pn i = 0 and the definition of βn to compute




hqn+1 , pn i = (x − βn )pn − γn2 pn−1 , pn = xpn , pn − βn pn , pn = 0. (4.75a)

Similarly, also using (4.72) as well as the definition of γn , we obtain

hqn+1 , pn−1 i = hxpn , pn−1 i − γn2 kpn−1 k2 = hpn , xpn−1 i − kpn k2


(∗)
= hpn , xpn−1 − pn i = 0, (4.75b)

where, at (∗), it was used that xpn−1 −pn ∈ Poln−1 (R) due to the fact that xn is canceled.
Finally, for an arbitrary q ∈ Poln−2 (R) (setting Pol−1 (R) := {0}), it holds that

hqn+1 , qi = hpn , xqi − βn hpn , qi − γn2 hpn−1 , qi = 0, (4.75c)

due to pn ∈ Poln−1 (R)⊥ and pn−1 ∈ Poln−2 (R)⊥ . Combining


 the equations (4.75) yields
qn+1 ⊥ q for each q ∈ span {pn , pn−1 } ∪ Poln−2 (R) = Poln (R), establishing (4.74) and
completing the proof. 

Remark 4.42. The reason that (4.73) is more efficient than (4.71) lies in the fact that
the computation of pn+1 according to (4.71) requires the computation of the n + 1 scalar
products hxn+1 , p0 i, . . . , hxn+1 , pn i, whereas the computation of pn+1 according to (4.73)
merely requires the computation of the 2 scalar products hpn , pn i and hxpn , pn i occurring
in βn and γn .

Definition 4.43. In the situation of Th. 4.41, we call the polynomials pn , n ∈ N0 ,


given by (4.73), the orthogonal polynomials with respect to h·, ·i (pn is called the nth
orthogonal polynomial).
4 NUMERICAL INTEGRATION 103

R1
Example 4.44. Considering ρ ≡ 1 on [−1, 1], one obtains hp, qiρ = −1 p(x)q(x) dx and
it is an exercise to check that the first four resulting orthogonal polynomials are
1 3
p0 (x) = 1, p1 (x) = x, p2 (x) = x2 − , p3 (x) = x3 − x.
3 5

The Gaussian quadrature rules are based on polynomial interpolation with respect to the
zeros of orthogonal polynomials. The following theorem provides information regarding
such zeros. In general, the explicit computation of the zeros is a nontrivial task and a
disadvantage of Gaussian quadrature rules (the price one has to pay for higher accuracy
and better convergence properties as compared to Newton-Cotes rules).
Theorem 4.45. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→
R+
0 , let pn , n ∈ N0 , be the orthogonal polynomials with respect to h·, ·iρ . For a fixed
n ≥ 1, pn has precisely n distinct zeros λ1 , . . . , λn , which all are simple and lie in the
open interval ]a, b[.

Proof. Let a < λ1 < λ2 < · · · < λk < b be an enumeration of the zeros of pn that lie in
]a, b[ and have odd multiplicity (i.e. pn changes sign at these λj ). From pn ∈ Poln (R),
we know k ≤ n. The goal is to show k = n. Seeking a contradiction, we assume k < n.
This implies
Yk
q(x) := (x − λi ) ∈ Poln−1 (R).
i=1

Thus, as pn ∈ Poln−1 (R) , we get

hpn , qiρ = 0. (4.76)

On the other hand, the polynomial pn q has only zeros of even multiplicity on ]a, b[,
which means either pn q ≥ 0 or pn q ≤ 0 on the entire interval [a, b], such that
Z b
hpn , qiρ = pn (x)q(x)ρ(x) dx 6= 0, (4.77)
a

in contradiction to (4.76). This shows k = n, and, in particular, that all zeros of pn are
simple. 

4.6.3 Gaussian Quadrature Rules

Definition 4.46. Given a real interval [a, b], a < b, a weight function ρ : [a, b] −→ R+ 0,
and n ∈ N, let λ1 , . . . , λn ∈ [a, b] be the zeros of the nth orthogonal polynomial pn with
4 NUMERICAL INTEGRATION 104

respect to h·, ·iρ and let L1 , . . . , Ln be the corresponding Lagrange basis polynomials (cf.
(3.10)), now written as polynomial functions, i.e.

Yn
x − λi
Lj (x) := for each j ∈ {1, . . . , n}, (4.78)
λ − λi
i=1 j
i6=j

then the quadrature rule


n
X
In : R[a, b] −→ R, In (f ) := σj f (λj ), (4.79a)
j=1

where
Z b
σj := hLj , 1iρ = Lj (x)ρ(x) dx for each j ∈ {1, . . . , n}, (4.79b)
a

is called the nth order Gaussian quadrature rule with respect to ρ.

Theorem 4.47. Consider the situation of Def. 4.46; in particular let In be the Gaussian
quadrature rule defined according to (4.79).

(a) In is exact for each polynomial of degree at most 2n − 1. This can be stated in the
form
n
X
hp, 1iρ = σj p(λj ) for each p ∈ Pol2n−1 (R). (4.80)
j=1

Comparing with (4.9) and Rem. 4.9(b) shows that In has degree of accuracy precisely
2n − 1, i.e. the maximal possible degree of accuracy.

(b) All weights σj , j ∈ {1, . . . , n}, are positive: σj > 0.

(c) For ρ ≡ 1, In is based on interpolating polynomials for λ1 , . . . , λn in the sense of


Def. 4.10.

Proof. (a): Recall from Linear Algebra that the remainder theorem [Phi19b, Th. 7.9]
holds in the ring of polynomials R[X] and, thus, in Pol(R) ∼ = R[X] (using that the
field R is infinite). In particular, since pn has degree precisely n, given an arbitrary
p ∈ Pol2n−1 (R), there exist q, r ∈ Poln−1 (R) such that

p = qpn + r. (4.81)
4 NUMERICAL INTEGRATION 105

Then, the relation pn (λj ) = 0 implies

p(λj ) = r(λj ) for each j ∈ {1, . . . , n}. (4.82)

Since r ∈ Poln−1 (R), it must be the unique interpolating polynomial for the data
 
λ1 , r(λ1 ) , . . . , λn , r(λn ) .

Thus,
n
X n
X
(3.10)
r(x) = r(λj )Lj (x) = p(λj )Lj (x). (4.83)
j=1 j=1

This allows to compute


Z b
(4.81) 
hp, 1iρ = q(x)pn (x) + r(x) ρ(x) dx = hq, pn iρ + hr, 1iρ
a
pn ∈Poln−1 (R)⊥ , n n
(4.83) X (4.79b) X
= p(λj )hLj , 1iρ = σj p(λj ), (4.84)
j=1 j=1

proving (4.80).
(b): For each j ∈ {1, . . . , n}, we apply (4.80) to L2j ∈ Pol2n−2 (R) to obtain
n
X
(4.80) (3.10)
0 < kLj k2ρ = hL2j , 1iρ = σk L2j (λk ) = σj .
k=1

(c): This follows from (a): If f : [a, b] −→ R is given and p ∈ Poln−1 (R) is the
interpolating polynomial function for the data
 
λ1 , f (λ1 ) , . . . , λn , f (λn ) ,

then n n Z
X X (4.80)
b
In (f ) = σj f (λj ) = σj p(λj ) = hp, 1i1 = p(x) dx ,
j=1 j=1 a

which establishes the case. 

Theorem 4.48. The converse of Th. 4.47(a) is also true: If, for n ∈ N, λ1 , . . . , λn ∈ R
and σ1 , . . . , σn ∈ R are such that (4.80) holds, then λ1 , . . . , λn must be the zeros of the
nth orthogonal polynomial pn and σ1 , . . . , σn must satisfy (4.79b).
4 NUMERICAL INTEGRATION 106

Proof. We first verify q = pn for


n
Y
q : R −→ R, q(x) := (x − λj ). (4.85)
j=1

To that end, let m ∈ {0, . . . , n − 1} and apply (4.80) to the polynomial p(x) := xm q(x)
(note p ∈ Poln+m (R) ⊆ Pol2n−1 (R)) to obtain
n
X
m m
hq, x iρ = hx q, 1iρ = σj λm
j q(λj ) = 0,
j=1

showing q ∈ Poln−1 (R)⊥ and q − pn ∈ Poln−1 (R)⊥ . On the other hand, both q and pn
have degree precisely n and the coefficient of xn is 1 in both cases. Thus, in q − pn ,
xn cancels, showing q − pn ∈ Poln−1 (R). Since Poln−1 (R) ∩ Poln−1 (R)⊥ = {0}, we have
established q = pn , thereby also identifying the λj as the zeros of pn as claimed. Finally,
applying (4.80) with p = Lj yields
n
X
hLj , 1iρ = σk Lj (λk ) = σj ,
k=1

showing that the σj , indeed, satisfy (4.79b). 

Theorem 4.49. For each f ∈ C[a, b], the Gaussian quadrature rules In (f ) of Def. 4.46
converge: Z b
lim In (f ) = f (x)ρ(x) dx for each f ∈ C[a, b]. (4.86)
n→∞ a

Proof. The convergence is a consequence of Polya’s Th. 4.27: Condition (i) of Th. 4.27
is satisfied as In (p) is exact for each p ∈ Pol2n−1 (R) and Condition (ii) of Th. 4.27 is
satisfied since the positivity of the weights yields
n
X n
X Z b
|σj,n | = σj,n = In (1) = ρ(x) dx ∈ R+ (4.87)
j=1 j=1 a

for each n ∈ N. 

Remark 4.50. (a) Using the linear change of variables

b−a a+b
x 7→ u = u(x) := x+ ,
2 2
4 NUMERICAL INTEGRATION 107

we can transform an integral over [a, b] into an integral over [−1, 1]:
Z b Z Z  
b−a 1  b−a 1 b−a a+b
f (u) du = f u(x) dx = f x+ dx . (4.88)
a 2 −1 2 −1 2 2

This is useful in the context of Gaussian quadrature rules, since the computation
of the zeros λj of the orthogonal polynomials is, in general, not easy. However, due
to (4.88), one does not have to recompute them for each interval [a, b], but one can
just compute them once for [−1, 1] and tabulate them. Of course, one still needs to
recompute for each n ∈ N and each new weight function ρ.

(b) If one is given the λj for a large n with respect to some strange ρ > 0, and one wants
to exploit the resulting Gaussian quadrature rule to just compute the nonweighted
integral of f , one can obviously always do that by applying the rule to g := f /ρ
instead of to f .

As usual, to obtain an error estimate, one needs to assume more regularity of the
integrated function f .

Theorem 4.51. Consider the situation of Def. 4.46; in particular let In be the Gaussian
quadrature rule defined according to (4.79). Then, for each f ∈ C 2n [a, b], there exists
τ ∈ [a, b] such that
Z b Z b
f (2n) (τ )
f (x)ρ(x) dx − In (f ) = p2n (x)ρ(x) dx , (4.89)
a (2n)! a

where pn is the nth orthogonal polynomial, which can be further estimated by



pn (x) ≤ (b − a)n for each x ∈ [a, b]. (4.90)

Proof. As before, let λ1 , . . . , λn be the zeros of the nth orthogonal polynomial pn . Given
f ∈ C 2n [a, b], let q ∈ Pol2n−1 (R) be the Hermite interpolating polynomial for the 2n
points
λ1 , λ1 , λ2 , λ2 , . . . , λn , λn .
The identity (3.29) yields
n
Y
f (x) = q(x) + [f |x, λ1 , λ1 , λ2 , λ2 , . . . , λn , λn ] (x − λj )2 . (4.91)
j=1
| {z }
p2n (x)
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 108

Now we multiply (4.91) by ρ, replace the divided difference by using the mean value
theorem for divided differences (3.30), and integrate over [a, b]:
Z b Z b Z b
1 
f (x)ρ(x) dx = q(x)ρ(x) dx + f (2n) ξ(x) p2n (x)ρ(x) dx . (4.92)
a a (2n)! a
Since the map x 7→ f (2n) (ξ(x)) is continuous on [a, b] by Th. 3.25(c) and since p2n ρ ≥ 0,
we can apply the mean value theorem of integration (cf. [Phi17, (2.15f)]) to obtain
τ ∈ [a, b] satisfying
Z b Z b Z
f (2n) (τ ) b 2
f (x)ρ(x) dx = q(x)ρ(x) dx + pn (x)ρ(x) dx . (4.93)
a a (2n)! a

Since q ∈ Pol2n−1 (R), In is exact for q, i.e.


Z b X n n
X
q(x)ρ(x) dx = In (q) = σj q(λj ) = σj f (λj ) = In (f ). (4.94)
a j=1 j=1

Combining (4.94)
Qn and (4.93) proves (4.89). Finally, (4.90) is clear from the representa-
tion pn (x) = j=1 (x − λj ) and λj ∈ [a, b] for each j ∈ {1, . . . , n}. 
Remark 4.52. One can obviously also combine the strategies of Sec. 4.5 and Sec. 4.6.
The results are composite Gaussian quadrature rules.

5 Numerical Solution of Linear Systems

5.1 Background and Setting


Definition 5.1. Let F be a field. Given an m × n matrix A ∈ M(m, n, F ), m, n ∈ N,
and b ∈ M(m, 1, F ) ∼
= F m , the equation
Ax = b (5.1)
is called a linear system for the unknown x ∈ M(n, 1, F ) ∼
= F n . The matrix one obtains
by adding b as the (n + 1)th column to A is called the augmented matrix of the linear
system. It is denoted by (A|b). The linear system (5.1) is called homogeneous for b = 0
and inhomogeneous for b 6= 0. By
L(A|b) := {x ∈ F n : Ax = b}, (5.2)
we denote the set of solutions to (5.1).

5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 109

We have already learned in Linear Algebra how to determine the set of solutions L(A|b)
for linear systems in echelon form (cf. [Phi19a, Def. 8.7]) using back substitution (cf.
[Phi19a, Sec. 8.3.1]), and we have learned how to transform a general linear system into
echelon form without changing the set of solutions using the Gaussian elimination algo-
rithm (cf. [Phi19a, Sec. 8.3.3]). Moreover, we have seen how to augment the Gaussian
elimination algorithm to obtain a so-called LU decomposition for a linear system, which
is more efficient if one needs to solve a linear system Ax = b, repeatedly, with fixed
A and varying b (cf. [Phi19a, Sec. 8.4], in particular, [Phi19a, Sec. 8.4.4]). We recall
from [Phi19a, Th. 8.34] that, for each A ∈ M(m, n, F ), there exists a permutation
matrix P ∈ M(m, F ) (which contains precisely one 1 in each row and one 1 in each
column and only zeros, otherwise), a unipotent lower triangular matrix L ∈ M(m, F )
and à ∈ M(m, n, F ) in echelon form such that

P A = LÃ. (5.3)

It is an LU decomposition in the strict sense (i.e. with à being upper triangular) for A
being quadratic (i.e. m × m).

Remark 5.2. Let F be a field. We recall from Linear Algebra how to make use of
an LU decomposition P A = LÃ to solve linear systems Ax = b with fixed matrix
A ∈ M(m, n, F ) and varying b ∈ F n . One has

x ∈ L(A|b) ⇔ P Ax = LÃx = P b ⇔ Ãx = L−1 P b ⇔ x ∈ L(Ã|L−1 P b).

Thus, solving Ax = b is equivalent to solving Lz = P b for z and then solving Ãx =


z = L−1 P b for x. Since à is in echelon form, Ãx = z = L−1 P b can be solved via back
substitution; and, since L is lower triangular, Lz = P b can be solved analogously via
forward substitution.

Gaussian elimination and LU decomposition have the advantage that, at least in prin-
ciple, they work over every field F and for every matrix A. The algorithm is still
rather simple and, to the best of my knowledge, still the most efficient one known in
the absence of any further knowledge regarding F and A. However, in the presence of
roundoff errors, Gaussian elimination can become numerically unstable and so-called
pivot strategies can help to improve the situation (cf. Sec. 5.2 below). Numerically, it
can also be an issue that the condition numbers of L and U in an LU decomposition
A = LU can be much larger than the condition number of A (cf. Rem. 5.15 below), in
which case, the use of the strategy of Rem. 5.2 is numerically ill-advised. If F = K,
then the so-called QR decomposition A = QR of Sec. 5.4 below is usually a numerically
better option. Moreover, if A has a special structure, such as being Hermitian or sparse
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 110

(i.e. having many 0 entries), than it is usually much more efficient (and, thus, advisable)
to use algorithms designed to exploit this special structure. In this class, we will not
have time to consider algorithms designed for sparse matrices, but we will present the
Cholesky decomposition for Hermitian matrices in Sec. 5.3 below.

5.2 Pivot Strategies for Gaussian Elimination


The Gaussian elimination algorithm can be numerically unstable even for matrices with
small condition number, as demonstrated in the following example.
Example 5.3. Consider the linear system Ax = b over R with
   
0.0001 1 1
A := , b := . (5.4)
1 1 2

The exact solution is  10000   


9999 1.00010001 . . .
x= 9998 = . (5.5)
9999
0.99989999 . . .
We now examine what occurs if we solve the system approximately using the Gaussian
elimination algorithm and a floating point arithmetic, rounding to 3 significant digits:
When applying the Gaussian elimination algorithm to (A|b), we replace the second row
by the sum of the second row and the first row multiplied by −104 . The exact result is
 −4 
10 1 | 1
.
0 −9999 | −9998

However, when rounding to 3 digits, it is replaced by


 −4 
10 1 | 1
,
0 −104 | −104

yielding the approximate solution  


0
x= . (5.6)
1
Thus, the relative error in the solution is about 10000 times the relative error of rounding.
The condition number is low, namely κ2 (A) ≈ 2.618 with respect to the Euclidean norm.
The error should not be amplified by a factor of 10000 if the algorithm is stable.
Now consider what occurs if we first switch the rows of (A|b):
 
1 1 | 2
−4 .
10 1 | 1
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 111

Now Gaussian elimination and rounding yields


 
1 1 | 2
0 1 | 1

with the approximate solution  


1
x= (5.7)
1
which is much better than (5.6).

Example 5.3 shows that it can be advantageous to switch rows even if one could proceed
Gaussian elimination without switching. Similarly, from a numerical point of view, not
every choice of the row number i, used for switching, is equally good. It is advisable
to use what is called a pivot strategy – a strategy to determine if, in the Gaussian
elimination’s kth step, one should first switch the current row with another row, and, if
so, which other row should be used.
A first strategy that can be used to avoid instabilities of the form that occurred in
Example 5.3 is the column maximum strategy (cf. Def. 5.5 below): In the kth step of
Gaussian elimination (i.e. when eliminating in the kth column), find the row number
i ≥ r(k) (all rows with numbers less than r(k) are already in echelon form and remain
unchanged, cf. [Phi19a, Sec. 8.3.3]), where the pivot element (i.e. the row’s first nonzero
element) has the maximal absolute value; if i 6= r(k), then switch rows i and r(k)
before proceeding. This was the strategy that resulted in the acceptable solution (5.7).
However, observe what happens in the following example:

Example 5.4. Consider the linear system Ax = b with


   
1 10000 10000
A := , b := .
1 1 2

The exact solution is the same as in Example 5.3, i.e. given by (5.5) (the first row of
Example 5.3 has merely been multiplied by 10000). Again, we examine what occurs if
we solve the system approximately using Gaussian elimination, rounding to 3 significant
digits:
As the pivot element of the first row is maximal, the column maximum strategy does
not require any switching. After the fist step we obtain
 
1 10000 | 10000
,
0 −9999 | −9998
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 112

which, after rounding to 3 digits, is replaced by


 
1 10000 | 10000
,
0 −10000 | −10000

yielding the unsatisfactory approximate solution


 
0
x= .
1

The problem here is that the pivot element of the first row is small as compared with
the other elements of that row! This leads to the so-called relative column maximum
strategy (cf. Def. 5.5 below), where, instead of choosing the pivot element with the
largest absolute value, one chooses the pivot element such that its absolute value divided
by the sum of the absolute values of all elements in that row is maximized.
In the above case, the relative column maximum strategy would require first switching
rows:  
1 1 | 2
.
1 10000 | 10000
Now Gaussian elimination and rounding yields
 
1 1 | 2
,
0 10000 | 10000

once again with the acceptable approximate solution


 
1
x= .
1

Both pivot strategies discussed above are summarized in the following definition:

Definition 5.5. Let A = (aji ) ∈ M(m, n, K), m, n ∈ N, and b ∈ Km . We define two


modified versions of the Gaussian elimination algorithm [Phi19a, Alg. 8.17] by adding
one of the following actions (0) or (0′ ) at the beginning of the algorithm’s kth step (for
each k ≥ 1 such that r(k) < m and k ≤ n).
As in [Phi19a, Alg. 8.17], let (A(1) |b(1) ) := (A|b), r(1) := 1, and, for each k ≥ 1 such that
r(k) < m and k ≤ n, transform (A(k) |b(k) ) into (A(k+1) |b(k+1) ) and r(k) into r(k + 1). In
contrast to [Phi19a, Alg. 8.17], to achieve the transformation, first perform either (0) or
(0′ ):
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 113

(0) Column Maximum Strategy: Determine i ∈ {r(k), . . . , m} such that


n o
(k) (k)
a
ik = max a
αk : α ∈ {r(k), . . . , m} . (5.8)

(0′ ) Relative Column Maximum Strategy: For each α ∈ {r(k), . . . , m}, define
n
X
(k)
Sα(k) := aαβ + |b(k)
α |. (5.9a)
β=k

 (k)
If max Sα : α ∈ {r(k), . . . , m} = 0, then (A(k) |b(k) ) is already in echelon form
and the algorithm is halted. Otherwise, determine i ∈ {r(k), . . . , m} such that
 (k) 
(k)  
a
ik a
αk
(k)
(k)
= max : α ∈ {r(k), . . . , m}, Sα 6= 0 . (5.9b)
Si  Sα(k) 

(k)
If aik = 0, then nothing needs to be done in the kth step and one merely proceeds
to the next column (if any) without changing the row by setting r(k + 1) := r(k) (cf.
(k)
[Phi19a, Alg. 8.17(c)]). If aik 6= 0 and i = r(k), then no row switching is needed before
(k)
elimination in the kth column (cf. [Phi19a, Alg. 8.17(a)]). If aik 6= 0 and i 6= r(k),
then one proceeds by switching rows i and r(k) before elimination in the kth column
(cf. [Phi19a, Alg. 8.17(b)]).

Remark 5.6. (a) Clearly, [Phi19a, Th. 8.19] remains valid for the Gaussian elimina-
tion algorithm with column maximum strategy as well as with relative column
maximum strategy (i.e. the modified Gaussian elimination algorithm still trans-
forms the augmented matrix (A|b) of the linear system Ax = b into a matrix (Ã|b̃)
in echelon form, such that the linear system Ãx = b̃ has precisely the same set of
solutions as the original system).

(b) The relative maximum strategy is more stable in cases where the order of magnitude
of matrix elements varies strongly, but it is also less efficient.

5.3 Cholesky Decomposition


Hermitian positive definite matrices A ∈ M(n, K) (and also Hermitian positive semidef-
inite matrices, see Appendix H.1) can be decomposed in the particularly simple form
A = LL∗ , where L ∈ M(n, K) is lower triangular:
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 114

Theorem 5.7. If A ∈ M(n, K) is Hermitian and positive definite (cf. Def. 2.21), then
there exists a unique lower triangular L ∈ M(n, K) with positive diagonal entries (i.e.
with ljj ∈ R+ for each j ∈ {1, . . . , n}) and
A = LL∗ . (5.10)
This decomposition is called the Cholesky decomposition of A.

Proof. The proof is conducted via induction on n. For n = 1, it is A = (a11 ), a11 > 0,

and L = (l11 ) is uniquely given by the square root of a11 : l11 = a11 > 0. For n > 1,
we write A in block form  
 B v 
A=

,
 (5.11)
v∗ ann
where B ∈ M(n − 1, K) is the matrix with bkl = akl for each k, l ∈ {1, . . . , n − 1} and
v ∈ Kn−1 is the column vector with vk = akn for each k ∈ {1, . . . , n − 1}. According
to [Phi19b, Prop. 11.6], B is positive definite (and, clearly, B is also Hermitian) and
ann > 0. According to the induction hypothesis, there exists a unique lower triangular
matrix L′ ∈ M(n − 1, K) with positive diagonal entries and B = L′ (L′ )∗ . Then L′ is
invertible and we can define
w := (L′ )−1 v ∈ Kn−1 . (5.12)
Moreover, let
α := ann − w∗ w. (5.13)
Then α ∈ R and there exists β ∈ C such that β 2 = α. We set
 
 L′ 0 
L := 

.
 (5.14)
w∗ β
Then L is lower triangular and
 
 (L′ )∗ w 
M := 



0 β
is upper triangular with
   
 L′ (L′ )∗ L′ w   B v 
LM = 

=
 
 = A.
 (5.15)
w∗ (L′ )∗ w∗ w + β 2 v∗ ann
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 115

It only remains to be shown one can choose β ∈ R+ and that this choice determines β
uniquely (note that β ∈ R then also implies M = L∗ and A = LM = LL∗ ). Since L′ is
a triangular matrix with positive entries on its diagonal, it is det L′ ∈ R+ and
[Phi19b, Th. 11.3(a)]
det A = det L · det M = (det L′ )2 β 2 > 0. (5.16)
√ √
Thus, α = β 2 > 0, and one can choose β = α ∈ R+ and α is the only positive
number with square α. 

Remark 5.8. From the proof of Th. 5.7, it is not hard to extract a recursive algorithm
to actually compute the Cholesky decomposition of an Hermitian positive definite A ∈
M(n, K): One starts by setting

l11 := a11 , (5.17a)
where the proof of Th. 5.7 guarantees a11 > 0. In the recursive step, one already has
constructed the entries for the first r − 1 rows of L, 1 < r ≤ n (corresponding to the
entries of L′ in the proof of Th. 5.7), and one needs to construct lr1 , . . . , lrr . Using
(5.12), one has to solve L′ w = v for
   
lr1 a1r
    
w =  ...  , where v =  ...  then w∗ = (lr1 . . . lr,r−1 ) .
lr,r−1 ar−1,r

As L′ has lower triangular form, we obtain lr1 , . . . , lr,r−1 successively, using forward
substitution, where
Pj−1
ajr − k=1 ljk lrk
lrj := for each j ∈ {1, . . . , r − 1} (5.17b)
ljj

and the proof of Th. 5.7 guarantees ljj > 0. Finally, one uses (5.13) (with n replaced
by r) to obtain v
u r−1
√ √ u X

lrr = α = arr − w w = arr −t |lrk |2 , (5.17c)
k=1

where the proof of Th. 5.7 guarantees α > 0.

Example 5.9. We compute the Cholesky decomposition for the symmetric positive
definite matrix  
1 −1 −2
A = −1 2 3 .
−2 3 9
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 116

According to the algorithm described in Rem. 5.8, we compute


√ √
l11 = a11 = 1 = 1,
a12 −1
l21 = = = −1,
l11 1
q √
2
l22 = a22 − l21 = 2 − 1 = 1,
a13 −2
l31 = = = −2,
l11 1
a23 − l21 l31 3 − (−1)(−2)
l32 = = = 1,
l22 1
q √
2 2
l33 = a33 − l31 − l32 = 9 − 4 − 1 = 2.
Thus, the Cholesky decomposition for A is
    
1 0 0 1 −1 −2 1 −1 −2
LL∗ = LLt = −1 1 0 0 1 1  = −1 2 3  = A.
−2 1 2 0 0 2 −2 3 9

We conclude the section by showing that a Cholesky decomposition can not exist if A
is not symmetric positive semidefinite:
Proposition 5.10. For n ∈ N, let A, L ∈ M(n, K), where A = LL∗ . Then A is
Hermitian and positive semidefinite.

Proof. A is Hermitian, since


A∗ = (LL∗ )∗ = LL∗ = A.
A is positive semidefinite, since
∀ x∗ Ax = x∗ LL∗ x = (L∗ x)∗ (L∗ x) ≥ 0,
x∈Kn

which completes the proof. 

5.4 QR Decomposition
5.4.1 Definition and Motivation

Definition 5.11. Let n ∈ N and let A ∈ M(n, K). Then A is called unitary if, and
only if, A is invertible and A−1 = A∗ . Moreover, for K = R, unitary matrices are also
called orthogonal.
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 117

Definition 5.12. Let m, n ∈ N and A ∈ M(m, n, K).

(a) A decomposition
A = QÃ (5.18a)
is called a QR decomposition of A if, and only if, Q ∈ M(m, K) is unitary and
à ∈ M(m, n, K) is in echelon form. If A ∈ M(m, K) (i.e. quadratic), then à is
upper (i.e. right) triangular, i.e. (5.18a) is a QR decomposition in the strict sense,
which is emphasized by writing R := Ã:

A = QR. (5.18b)

(b) Let r := rk(A) be the rank of A. A decomposition

A = QÃ (5.19a)

is called an economy size QR decomposition of A if, and only if, Q ∈ M(m, r, K)


is such that its columns form an orthonormal system with respect to the standard
scalar product on Km and à ∈ M(r, n, K) is in echelon form. If r = n, then Ã
is upper triangular, i.e. (5.19a) is an economy size QR decomposition in the strict
sense, which is emphasized by writing R := Ã:

A = QR. (5.19b)

Note: In the German literature, the economy size QR decomposition is usually called
just QR decomposition, while our QR decomposition is referred to as the extended QR
decomposition.

While, even for an invertible A ∈ M(n, K), the QR decomposition is not unique, the
nonuniqueness can be precisely described:

Proposition 5.13. Suppose Q1 , Q2 , R1 , R2 ∈ M(n, K), n ∈ N, are such that Q1 , Q2 are


unitary, R1 , R2 are invertible upper triangular, and

Q 1 R1 = Q 2 R2 . (5.20)

Then there exists a diagonal matrix S ∈ M(n, K), S = diag(σ1 , . . . , σn ), |σ1 | = · · · =


|σn | = 1 (for K = R, this means σ1 , . . . , σn ∈ {−1, 1}), such that

Q2 = Q1 S and R2 = S ∗ R1 . (5.21)
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 118

Proof. As (5.20) implies S := Q−1 −1


1 Q2 = R1 R2 , S must be both unitary and upper
triangular. Thus, S must be diagonal, S = diag(σ1 , . . . , σn ) with |σk | = 1. 

Proposition 5.14. Let n ∈ N and A, Q ∈ M(n, K), A being invertible and Q being
unitary. With respect to the 2-norm k · k2 on Kn , the following holds (recall Def. 2.12
of a natural matrix norm and Def. and Rem. 2.36 for the condition of a matrix):

kQAk2 = kAQk2 = kAk2 ,


kQk2 = κ2 (Q) = 1,
κ2 (QA) = κ2 (AQ) = κ2 (A).

Proof. Exercise. 

Remark 5.15. Suppose the goal is to solve linear systems Ax = b with fixed A ∈
M(m, n, K) and varying b ∈ Kn . In Rem. 5.2, we described how to make use of a given
LU decomposition P A = LÃ in this situation. Even though the strategy of Rem. 5.2
works fine in many situations, it can fail numerically due to the following stability issue:
While the condition numbers, for invertible A, always satisfy

κ(P A) ≤ κ(L)κ(Ã) (5.22)

due to (2.23) and (2.14), the right-hand side of (5.22) can be much larger than the left-
hand side. In that case, it is numerically ill-advised to solve Lz = P b and Ãx = z instead
of solving Ax = b. An analogous procedure can be used given a QR decomposition
A = QÃ. Solving Ax = QÃx = b is equivalent to solving Qz = b for z and then
solving Ãx = z for x. Due to Q−1 = Q∗ , solving for z = Q∗ b is particularly simple.
Moreover, one does not encounter the possibility of an increased condition as for the
LU decomposition: Due to Prop. 5.14, the condition of using the QR decomposition is
exactly the same as the condition of A. Thus, if available, the QR decomposition should
always be used!

5.4.2 QR Decomposition via Gram-Schmidt Orthogonalization

As just noted in Rem. 5.15, it is numerically advisable to make use of a QR decompo-


sition if it is available. However, so far we have not adressed the question of how to
obtain such a decomposition. The simplest way, but not the most numerically stable,
is to use Gram-Schmidt orthogonalization. More precisely, Gram-Schmidt orthogonal-
ization provides the economy size QR decomposition of Def. 5.12(b) as described in
Th. 5.16. In Sec. 5.4.3 below, we will study the Householder method, which is more
numerically stable and also provides the full QR decomposition directly.
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 119

Theorem 5.16. Let A ∈ M(m, n, K), m, n ∈ N, with columns x1 , . . . , xn . If r :=


rk(A) > 0, then define increasing functions
ρ : {1, . . . , n} −→ {0, . . . , r}, ρ(k) := dim(span{x1 , . . . , xk }), (5.23a)

µ : {1, . . . , r} −→ {1, . . . , n}, µ(k) := min j ∈ {1, . . . , n} : ρ(j) = k . (5.23b)

(a) There exists a QR decomposition of A as well as, for r > 0, an economy size
QR decomposition of A. More precisely, there exist a unitary Q ∈ M(m, K), an
à ∈ M(m, n, K) in echelon form, a Qe ∈ M(m, r, K) (for r > 0) such that its
columns form an orthonormal system with respect to the standard scalar product
h·, ·i on Km , and (for r > 0) an Ãe ∈ M(r, n, K) in echelon form such that
A = QÃ, (5.24a)
A = Qe Ãe for r > 0. (5.24b)
(i) For r > 0, the columns q1 , . . . , qr ∈ Km of Qe can be computed from the
columns x1 , . . . , xn ∈ Km of A, obtaining v1 , . . . , vn ∈ Km from
k−1
X hxk , vi i
v1 := x1 , ∀ vk := xk − 2
vi . (5.25)
k≥2
i=1,
kv i k2
vi 6=0

letting, for each k ∈ {1, . . . , r}, ṽk := vµ(k) (i.e. the ṽ1 , . . . , ṽr are the v1 , . . . , vn
with each vk = 0 omitted) and, finally, letting qk := ṽk /kṽk k2 for each k =
1, . . . , r.
(ii) For r > 0, the entries rik of Ãe ∈ M(r, n, K) can be defined by


 kv1 k2 for i = k = 1, ρ(1) = 1,




hxk , qi i for k ∈ {2, . . . , n}, i < ρ(k),
rik := hxk , qi i for k ∈ {2, . . . , n}, i = ρ(k) = ρ(k − 1),


kṽρ(k) k2 for k ∈ {2, . . . , n}, i = ρ(k) > ρ(k − 1),



0 otherwise.

(iii) For the columns q1 , . . . , qm of Q, one can complete the orthonormal system
q1 , . . . , qr of (i) by qr+1 , . . . , qm ∈ Km to an orthonormal basis of Km .
(iv) For Ã, one can use Ãe as in (ii), completed by m − r zero rows at the bottom.
(b) For r > 0, there is a unique economy size QR decomposition of A such that all
pivot elements of Ãe are positive. Similarly, if one requires all pivot elements of Ã
to be positive, then the QR decomposition of A is unique, except for the last columns
qr+1 , . . . , qm of Q.
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 120

Proof. (a): We start by showing that, for r > 0, if Qe and Ãe are defined by (i) and
(ii), respectively, then they provide the desired economy size QR decomposition of A.
From [Phi19b, Th. 10.13], we know that vk = 0 if, and only if, xk ∈ span{x1 , . . . , xk−1 }.
Thus, in (i), we indeed obtain r linearly independent ṽk and Qe is in M(m, r, K) with
orthonormal columns. To see that Ãe according to (ii) is in echelon form, consider its
ith row, i ∈ {1, . . . , r}. If k < µ(i), then ρ(k) < i, i.e. rik = 0. Thus, kṽρ(k) k2 with
k = µ(i) is the pivot element of the ith row, and as µ is strictly increasing, Ãe is in
echelon form. To show A = Qe Ãe , note that, for each k ∈ {1, . . . , n},
 

 0 for k = 1, ρ(1) = 0, 


kṽ k q 
 X r
ρ(1) 2 ρ(1) = kv1 k2 q1 for k = 1, ρ(1) = 1,
xk = Pρ(k) = rik qi

 hxk , qi i qi for k > 1, ρ(k) = ρ(k − 1) 


Pi=1
ρ(k)−1

 i=1

i=1 hxk , qi i qi + kṽρ(k) k2 qρ(k) for k > 1, ρ(k) > ρ(k − 1)

as a consequence of (5.25), completing the proof of the economy size QR decomposition.


Moreover, given the economy size QR decomposition, (iii) and (iv) clearly provide a QR
decomposition of A.
(b): Recall from the proof of (a) that, for the Ãe defined in (ii) and, thus, for the Ã
defined in (iv), the pivot elements are given by the kṽρ(k) k2 and, thus, always positive.
It suffices to prove the uniqueness statement for the economy size QR decomposition,
as it clearly implies the uniqueness statement for the QR decomposition. Suppose

A = QR = Q̃R̃, (5.26)

where Q, Q̃ ∈ M(m, r, K) with orthonormal columns, and R, R̃ ∈ M(r, n, K) are in


echelon form such that all pivot elements are positive. We write (5.26) in the equivalent
form r r
X X
xk = rik qi = r̃ik q̃i for each k ∈ {1, . . . , n}, (5.27)
i=1 i=1

and show, by induction on α ∈ {1, . . . , r + 1}, that qα = q̃α for α ≤ r and rik = r̃ik for
each i ∈ {1, . . . , r}, k ∈ {1, . . . , min{µ(α), n}}, µ(r + 1) := n + 1, as well as

span{x1 , . . . , xµ(α) } = span{q1 , . . . , qα−1 , qα } for α ≤ r (5.28)

(the induction goes to r + 1, as it might occur that µ(r) < n). Consider α = 1. For
1 ≤ k < µ(1), we have xk = 0, and the linear independence of the qi (respectively, q̃i )
yields rik = r̃ik = 0 for each i ∈ {1, . . . , r}, 1 ≤ k < µ(1) (if any). Next,

0 6= xµ(1) = r1,µ(1) q1 = r̃1,µ(1) q̃1 , (5.29)


5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 121

since the echelon forms of R and R̃ imply ri,µ(1) = r̃i,µ(1) = 0 for each i > 1. Moreover,
(5.29)
kq1 k2 = kq̃1 k2 = 1 ⇒ |r1,µ(1) | = |r̃1,µ(1) |
r1,µ(1) ,r̃1,µ(1) ∈R+ 
⇒ r1,µ(1) = r̃1,µ(1) ∧ q1 = q̃1 .

Note that (5.29) also implies span{x1 , . . . , xµ(1) } = span{xµ(1) } = span{q1 }. Now let
α > 1 and, by induction, assume qβ = q̃β for 1 ≤ β < α and rik = r̃ik for each
i ∈ {1, . . . , r}, k ∈ {1, . . . , µ(α − 1)}, as well as

∀ span{x1 , . . . , xµ(β) } = span{q1 , . . . , qβ }. (5.30)


1≤β<α

We have to show rik = r̃ik for each i ∈ {1, . . . , r}, k ∈ {µ(α − 1) + 1, . . . , min{µ(α), n}},
and qα = q̃α for α ≤ r, as well as (5.28). For each k ∈ {µ(α − 1) + 1, . . . , µ(α) − 1},
from (5.30) with β = α − 1, we know

span{x1 , . . . , xk } = span{x1 , . . . , xµ(α−1) } = span{q1 , . . . , qα−1 }, (5.31)

such that (5.27) implies rik = r̃ik = 0 for each i > α − 1. Then, for 1 ≤ i ≤ α − 1,
multiplying (5.27) by qi and using the orthonormality of the qi provides rik = r̃ik =
hxk , qi i. For α = r + 1, we are already done. Otherwise, a similar argument also works
for k = µ(α) ≤ n: As we had just seen, rαl = r̃αl = 0 for each 1 ≤ l < k. So the echelon
forms of R and R̃ imply rik = r̃ik = 0 for each i > α. Then, as before, rik = r̃ik = hxk , qi i
for each i < α by multiplying (5.27) by qi , and, finally,
α−1
X α−1
X
0 6= xk − rik qi = xµ(α) − rik qi = rαk qα = r̃αk q̃α . (5.32)
i=1 i=1

Analogous to the case α = 1, the positivity of rαk and r̃αk implies rαk = r̃αk and qα = q̃α .
To conclude the induction, we note that combining (5.32) with (5.31) verifies (5.28). 

Remark 5.17. Gram-Schmidt orthogonalization according to (4.71) becomes numeri-


cally unstable in cases where xk is “almost” in span{x1 , . . . , xk−1 }, resulting in a very
small 0 6= vk , in which case kvk k−1
2 can become arbitrarily large. In the following section,
we will study an alternative method to compute the QR decomposition that avoids this
issue.

Example 5.18. Consider  


1 4 2 3
1 2 1 0
A := 
2
.
6 3 1
0 0 1 4
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 122

Applying Gram-Schmidt orthogonalization (4.71) to the columns


       
1 4 2 3
1 2 1 0
x1 :=      
2 , x2 := 6 , x3 := 3 , x4 := 1
 

0 0 1 4
of A yields
 
1
1
v 1 = x1 =  
2 ,
0

1
hx2 , v1 i 18 −1
v 2 = x2 − 2
v 1 = x2 − v1 =  
 0 ,
kv1 k2 6
0
 
0
hx3 , v1 i hx3 , v2 i 9 1  
v 3 = x3 − v − v = x − v − v =  0 ,
kv1 k22
1
kv2 k22
2 3
6
1
2
2  0
1


2/3
hx4 , v1 i hx4 , v2 i hx4 , v3 i 5 3 4  2/3 
v 4 = x4 − 2
v1 − 2
v2 − 2
v 3 = x4 − v 1 − v 2 − v 3 =  
−2/3 .
kv1 k2 kv2 k2 kv3 k2 6 2 1
0
Thus,
v1 v1 v2 v2 v3 v4 v4
q1 = =√ , q2 = =√ , q3 = = v3 , q4 = = √
kv1 k2 6 kv2 k2 2 kv3 k2 kv4 k2 2/ 3
and  √ √ √ 
1/ 6 1/ 2 0 1/ 3
1/√6 −1/√2 √
0 1/ 3 
 
Q= √ √ .
2/ 6 0 0 −1/ 3
0 0 1 0
Next, we obtain
√ √ √ √
r11 = kv1 k2 = 6, r12 = hx2 , q1 i = 3 6, r13 = hx3 , q1 i = 9/ 6, r14 = hx4 , q1 i = 5/ 6,
√ √ √
r21 = 0, r22 = kv2 k2 = 2, r23 = hx3 , q2 i = 1/ 2, r24 = hx4 , q2 i = 3/ 2,
r31 = 0, r32 = 0, r33 = kv3 k2 = 1, r34 = hx4 , q3 i = 4,

r41 = 0, r42 = 0, r43 = 0, r44 = kv4 k2 = 2/ 3,
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 123

that means √ √ √ √ 
6 3 6 9/ 6 5/ 6
 √ √ √
 0 2 1/ 2 3/ 2
R = Ã =  .
 0 0 1 4 

0 0 0 2/ 3
One verifies A = QR.

5.4.3 QR Decomposition via Householder Reflections

The Householder method uses reflections to compute the QR decomposition of a matrix.


It does not go through the economy size QR decomposition (as does the Gram-Schmidt
method), but provides the QR decomposition directly (which can be seen as an advan-
tage or disadvantage, depending on the circumstances). The Householder method does
not suffer from the numerical instability issues discussed in Rem. 5.17. On the other
hand, the Gram-Schmidt method provides k correct columns of Q in k steps, whereas,
for the Householder method, in general, one might not obtain any correct columns of
Q before completing the final step. We begin with some preparatory definitions and
remarks:

Definition 5.19. A matrix H ∈ M(n, K), n ∈ N, is called a Householder matrix or


Householder transformation over K if, and only if, there exists u ∈ Kn such that kuk2 = 1
and
H = Id −2 uu∗ , (5.33)
where, here and in the following, u is treated as a column vector. If H is a Householder
matrix, then the linear map H : Kn −→ Kn is also called a Householder reflection or
Householder transformation.

Lemma 5.20. Let H ∈ M(n, K), n ∈ N, be a Householder matrix. Then the following
holds:

(a) H is Hermitian: H ∗ = H.

(b) H is involutary: H 2 = Id.

(c) H is unitary: H ∗ H = Id.

Proof. As H is a Householder matrix, there exists u ∈ Kn such that kuk2 = 1 and (5.33)
is satisfied. Then ∗
H ∗ = Id −2 uu∗ = Id −2(u∗ )∗ u∗ = H
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 124

proves (a) and


 
H 2 = Id −2 uu∗ Id −2 uu∗ = Id −4 uu∗ + 4uu∗ uu∗ = Id,
proves (b), where u∗ u = kuk22 = 1 was used in the last equality. Combining (a) and (b)
then yields (c). 
Remark 5.21. Let H ∈ M(n, R), n ∈ N, be a Householder matrix, with H = Id −2 uut ,
u ∈ Rn , kuk2 = 1. Then the map H : Rn −→ Rn , x 7→ Hx constitutes the reflection
through the hyperplane
V := span{u}⊥ = {z ∈ Rn : z t u = 0} :
Note that, for each x ∈ Rn , x − uut x ∈ V :
ut u=1
(x − uut x)t u = xt u − (xt u)ut u = 0.
Also:
z t u=0
∀ huut x, zi = z t uut x = 0.
z∈V

In particular, Id − uut is the orthogonal projection onto V , showing that H constitutes


the reflection through V .

The key idea of the Householder method for the computation of a QR decomposition
is to find a Householder reflection that transforms a given vector into the span of the
first standard unit vector e1 ∈ Kk . This task then has to be performed for varying
dimensions k. That such Householder reflections exist is the contents of the following
lemma2 :
Lemma 5.22. Let k ∈ N. We consider the elements of Kk as column vectors. Let
e1 ∈ Kk be the first standard unit vector of Kk . If x ∈ Kk \ {0}, then
( |x | x+x kxk e
v(x) 1 1
|x1 | kxk2
2 1
for x1 6= 0,
u := u(x) := , v := v(x) := x
(5.34a)
kv(x)k2 kxk2
+ e1 for x1 = 0,

satisfies
kuk2 = 1 and (5.34b)
(
− x1|xkxk 2
for x1 6= 0,
(Id −2uu∗ )x = ξ e1 , ξ= 1|
(5.34c)
−kxk2 for x1 = 0.
2
If one is only interested in the case K = R, then one can use the simpler version provided in
Appendix Sec. H.2.
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 125

Proof. If x1 = 0, then v1 = 0 + 1 = 1, i.e. v 6= 0 and u is well-defined. If x1 6= 0, then


x1 (|x1 | + kxk2 )
v1 = 6= 0,
|x1 | kxk2

i.e. v 6= 0 and u is well-defined in this case as well. Moreover, (5.34b) is immediate from
the definition of u. To verify (5.34c), note
( 2


1 + |x2|x 1|
1 | kxk2
+ 1 = 2 + 2|x 1|
kxk2
for x1 6= 0,
v v= 2|x1 |
1 + 0 + 1 = 2 + kxk2 for x1 = 0,
(
kxk2 + |x1 | for x1 6= 0,
v∗x =
kxk2 + 0 = kxk2 + |x1 | for x1 = 0.

Thus, for x1 = 0, we obtain


2vv ∗ x v (kxk2 + |x1 |)
(Id −2uu∗ )x = x − ∗
=x− |x1 |
= x − v kxk2 = −kxk2 e1 = ξ e1 ,
v v 1 + kxk 2

proving (5.34c). Similarly, for x1 6= 0, we obtain

|x1 | x + x1 kxk2 e1 x1 kxk2


(Id −2uu∗ )x = x − =− e1 = ξ e1 ,
|x1 | |x1 |

once again proving (5.34c) and concluding the proof of the lemma. 

We note that the numerator in (5.34a) is constructed in such a way that there is no
subtractive cancellation of digits, independent of the signs of Re x1 and Im x1 .

Definition 5.23. Given A ∈ M(m, n, K) with m, n ∈ N, the Householder method is


the following procedure: Let A(1) := A, Q(1) := Idm (identity matrix on Km ), r(1) := 1.
For k ≥ 1, as long as r(k) < m and k ≤ n, the Householder method transforms A(k)
into A(k+1) , Q(k) into Q(k+1) , and r(k) into r(k + 1) by performing precisely one of the
following actions:
(k)
(a) If aik = 0 for each i ∈ {r(k), . . . , m}, then

A(k+1) := A(k) , Q(k+1) := Q(k) , r(k + 1) := r(k).

(k) (k)
(b) If ar(k),k 6= 0, but aik = 0 for each i ∈ {r(k) + 1, . . . , m},

A(k+1) := A(k) , Q(k+1) := Q(k) , r(k + 1) := r(k) + 1.


5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 126

(c) Otherwise, i.e. if x(k) ∈


/ span{e1 }, where
 t
(k) (k) (k)
x := ar(k),k , . . . , am,k ∈ Km−r(k)+1 (5.35)

and e1 is the standard basis vector in Km−r(k)+1 , then

A(k+1) := H (k) A(k) , Q(k+1) := Q(k) H (k) , r(k + 1) := r(k) + 1,

where
 
 Idr(k)−1 0 
 
 
H (k) := 

,
 u(k) := u(x(k) ), (5.36)
 
 0 (k) (k) ∗ 
Idm−r(k)+1 −2u (u )

with u(x(k) ) according to (5.34a).

Theorem 5.24. Given A ∈ M(m, n, K) with m, n ∈ N, the Householder method as


defined in Def. 5.23 yields a QR decomposition of A. More precisely, if r(k) = m or
k = n + 1, then Q := Q(k) ∈ M(m, K) is unitary and à := A(k) ∈ M(m, n, K) is in
echelon form, satisfying A = QÃ.

Proof. First note that the Householder method terminates after at most n steps. If
1 ≤ N ≤ n + 1 is the maximal k occurring during the Householder method, and one
lets H (k) := Idm for each k such that Def. 5.23(a) or Def. 5.23(b) was used, then

Q = H (1) · · · H (N −1) , Ã = H (N −1) · · · H (1) A. (5.37)

Since, according to Lem. 5.20(b), (H (k) )2 = Idm for each k, QÃ = A is an immediate
consequence of (5.37). Moreover, it follows from Lem. 5.20(c) that the H (k) are all
unitary, implying Q to be unitary as well. To prove that à is in echelon form, we show
by induction over k that the first k − 1 columns of A(k) are in echelon form as well as
(k)
the first r(k) rows of A(k) with aij = 0 for each (i, j) ∈ {r(k), . . . , m} × {1, . . . , k − 1}.
For k = 1, the assertion is trivially true. By induction, we assume the assertion for
k and prove it for k + 1 (for k = 1, we are not assuming anything). Observe that
multiplication of A(k) with H (k) as defined in (5.36) does not change the first r(k) − 1
(k)
rows of A(k) . It does not change the first k − 1 columns of A(k) , either, since aij = 0
for each (i, j) ∈ {r(k), . . . , m} × {1, . . . , k − 1}. Moreover, if we are in the case of Def.
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 127

5.23(c), then the choice of u(k) in (5.36) and (5.34c) of Lem. 5.22 guarantee
 
(k+1)
ar(k),k  
 (k+1)  
 1 
a  0
 
 r(k)+1,k   
 ..  ∈ span  ..  . (5.38)
 .  
  . 
   0 
 
(k+1)
am,k

(k+1)
In each case, after the application of (a), (b), or (c) of Def. 5.23, aij = 0 for each
(i, j) ∈ {r(k + 1), . . . , m} × {1, . . . , k}. We have to show that the first r(k + 1) rows of
A(k+1) are in echelon form. For Def. 5.23(a), it is r(k + 1) = r(k) and there is nothing
(k+1)
to prove. For Def. 5.23(b),(c), we know ar(k+1),j = 0 for each j ∈ {1, . . . , k}, while
(k+1)
ar(k),k 6= 0, showing that the first r(k + 1) rows of A(k+1) are in echelon form in each
case. So we know that the first r(k + 1) rows of A(k+1) are in echelon form, and all
elements in the first k columns of A(k+1) below row r(k + 1) are zero, showing that the
first k columns of A(k+1) are also in echelon form. As, at the end of the Householder
method, r(k) = m or k = n + 1, we have shown that à is in echelon form. 

Example 5.25. We apply the Householder method of Def. 5.23 to the matrix A of the
previous Ex. 5.18, i.e. to  
1 4 2 3
1 2 1 0
A := 
2 6 3 1 .
 (5.39a)
0 0 1 4
We start with
A(1) := A, Q(1) := Id, r(1) := 1. (5.39b)
Since x(1) := (1, 1, 2, 0)t ∈
/ span{(1, 0, 0, 0)t }, we compute
 √ 
1+ 6
x(1) + σ(x(1) ) e1 1  1 
u(1) := (1) = √  , (5.39c)
kx + σ(x(1) ) e1 k2 kx(1) + 6 e1 k2  2 
0
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 128

 √ √ √ 
7 + 2√ 6 1 + 6 2 + 2 6 0
1 1+ 6 1 2 0
H (1) := Id −2u(1) (u(1) )t = Id − √ 

√ 
6+ 6 2+2 6 2 4 0
0 0 0 0
 
−0.4082 −0.4082 −0.8165 0
−0.4082 0.8816 −0.2367 0
≈−0.8165 −0.2367 0.5266 0 ,
 (5.39d)
0 0 0 1

 
−2.4495 −7.3485 −3.6742 −2.0412
 0 −1.2899 −0.6449 −1.4614
A(2) := H (1) A(1) ≈

, (5.39e)
0 −0.5798 −0.2899 −1.9229
0 0 1 4

Q(2) := Q(1) H (1) = H (1) , r(2) := r(1) + 1 = 2. (5.39f)

Since x(2) := (−1.2899, −0.5798, 0)t ∈ / span{(1, 0, 0)t }, we compute


 
(2) (2) (2) (2) −0.9778
x + σ(x ) e1 x − kx k2 e1
u(2) := (2) = ≈ −0.2096 , (5.39g)
(2)
kx + σ(x ) e1 k2 (2) (2) k e
x − kx 2 1 0
2
 
  1 0 0 0
1 0 0 −0.9121 −0.4100 0
(2)
H := (2) (2) t ≈ 0 −0.4100 0.9121 0 ,
 (5.39h)
0 Id −2u (u )
0 0 0 1
 
−2.4495 −7.3485 −3.6742 −2.0412
 0 1.4142 0.7071 2.1213 
A(3) = H (2) A(2) ≈ 

, (5.39i)
0 0 0 −1.1547
0 0 1 4
 
−0.4082 0.7071 −0.5774 0
 −0.4082 −0.7071 −0.5774 0
Q(3) = Q(2) H (2) ≈  
−0.8165 −0.0000 0.5774 0 , r(3) = r(2) + 1 = 3. (5.39j)
0 0 0 1
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 129

Since x(3) := (0, 1)t ∈


/ span{(1, 0)t }, we compute
 √ 
(3) x(3) + σ(x(3) ) e1 1/√2
u := (3) (3)
= , (5.39k)
kx + σ(x ) e1 k2 1/ 2
 
! 1 0 0 0
Id 0 0 1 0 0
H (3) := = 
, (5.39l)
(3) (3) t
0 Id −2u (u ) 0 0 0 −1
0 0 −1 0
 
−2.4495 −7.3485 −3.6742 −2.0412
 0 1.4142 0.7071 2.1213 
R = Ã = A(4) = H (3) A(3) ≈ 
, (5.39m)
0 0 −1 −4 
0 0 0 1.1547
 
−0.4082 0.7071 0 0.5774
 −0.4082 −0.7071 0 0.5774 
Q = Q(4) = Q(3) H (3) ≈ −0.8165
. (5.39n)
0 0 −0.5774
0 0 −1 0
One verifies A = QR.

6 Iterative Methods, Solution of Nonlinear Equa-


tions

6.1 Motivation: Fixed Points and Zeros


Definition 6.1. (a) Given a function ϕ : A −→ B, where A, B can be arbitrary sets,
x ∈ A is called a fixed point of ϕ if, and only if, ϕ(x) = x.
(b) If f : A −→ Y , where A can be an arbitrary set, and Y is (a subset of) a vector
space, then x ∈ A is called a zero (or sometimes a root) of f if, and only if, f (x) = 0.

In the following, we will study methods for finding zeros of functions in special situations.
While, if formulated in the generality of Def. 6.1, the setting of fixed point and zero
problems is different, in many situations, a fixed point problem can be transformed
into an equivalent zero problem and vice versa (see Rem. 6.2 below). In consequence,
methods that are formulated for fixed points, such as the Banach fixed point method, can
often also be used to find zeros, while methods formulated for zeros, such as Newton’s
method, can often also be used to find fixed points.
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 130

Remark 6.2. (a) If ϕ : A −→ B, where A, B are subsets of a common vector space


X, then x ∈ A is a fixed point of ϕ if, and only if, ϕ(x) − x = 0, i.e. if, and only if,
x is a zero of the function f : A −→ X, f (x) := ϕ(x) − x (we can no longer admit
arbitrary sets A, B, as adding and subtracting must make sense; but note that it
might happen that ϕ(x) − x ∈ / A ∪ B).

(b) If f : A −→ Y , where A and Y are both subsets of some common vector space Z,
then x ∈ A is a zero of f if, and only if, f (x) + x = x, i.e. if, and only if, x is a
fixed point of the function ϕ : A −→ Z, ϕ(x) := f (x) + x.

Given a function g, an iterative method defines a sequence x0 , x1 , . . . , where, for n ∈ N0 ,


xn+1 is computed from xn (or possibly x0 , . . . , xn ) by using g. For example, “using g”
can mean applying g to obtain xn+1 := g(xn ) or it can mean applying g and g ′ in a
suitable way (as in the case of Newton’s method according to (6.1) and (6.5) below).
The iterative method with xn+1 := g(xn ) is already known from the Banach fixed point
theorem of Analysis 2 (cf. [Phi16b, Th. 2.29]). In the following, we will provide an
introduction to the iterative method known as Newton’s method.

6.2 Newton’s Method


As mentioned above, the Banach fixed point method has already been studied in Anal-
ysis 2, [Phi16b, Sec. 2.2]. The Banach fixed point method has the advantage of being
applicable in the rather general setting of a complete metric space and of not requiring
any differentiability. Moreover, convergence results and error estimates are compara-
tively simple to prove.
Newton’s method on the other hand is only applicable to differentiable maps between
normed vector spaces and convergence results as well as error estimates are typically
considerably more difficult to establish, especially in more than one dimension. However,
if Newton’s method converges, than the convergence is typically considerably faster than
for the Banach fixed point method and, hence, Newton’s method is preferred in such
situations.
While the Banach fixed point method is designed to provide sequences that converge to
a fixed point of a function ϕ, Newton’s method is designed to provide sequences that
converge to a zero of a function f (cf. Sec. 6.1 above).
The subject of Newton’s and related methods is a vast field and we will only be able
to provide a brief introduction – see, e.g., [Deu11], [OR70, Sections 12.6,13.3,14.4] for
further results regarding such methods.
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 131

6.2.1 Newton’s Method in One Dimension

If A ⊆ R and f : A −→ R is differentiable, then Newton’s method is defined by the


recursion
f (xn )
x0 ∈ A, xn+1 := xn − ′ for each n ∈ N0 . (6.1)
f (xn )
There are clearly several issues with Newton’s method as formulated in (6.1):

(a) Is the derivative of f in xn invertible (i.e. nonzero in the case of (6.1))?

(b) Is xn+1 an element of A such that the iteration can be continued?

(c) Does the sequence (xn )n∈N0 converge?

To ensure that we can answer “yes” to all of the above questions, we need suitable
hypotheses. The following theorem provides examples for sufficient conditions:

Theorem 6.3. Let I ⊆ R be a nontrivial interval (i.e. an interval consisting of more


than one point). Assume f : I −→ R to be differentiable with f ′ (x) 6= 0 for each x ∈ I
and to have a zero x̂ ∈ I. If f satisfies one of the following four conditions (a) – (d),
then x̂ is the only zero of f in I, and (6.1) well-defines a sequence (xn )n∈N0 in I that
converges to x̂, provided that x0 = x̂ (in which case the sequence is constant) or x0 , x1
are on different sides of x̂ with x1 ∈ I or x0 , x1 are on the same side of x̂ (in which case
the sequence is monotone).

(a) f is strictly increasing and convex.

(b) f is strictly decreasing and concave.

(c) f is strictly increasing and concave.

(d) f is strictly decreasing and convex.

For (a) and (b), x0 , x1 are on different sides of x̂ for x0 < x̂ and (xn )n∈N0 is decreasing
for x0 > x̂; for (c) and (d), x0 , x1 are on different sides of x̂ for x0 > x̂ and (xn )n∈N0 is
increasing for x0 < x̂.

Proof. In each case, f is strictly monotone and, hence, injective. Thus, x̂ must be the
only zero of f . For each n ∈ N0 , if f (xn ) = 0, then (6.1) implies xn+1 = xn , showing
(xn )n∈N0 to be constant for x0 = x̂.
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 132

We now assume (a). The convexity of f yields (cf. [Phi16a, (9.37)])

∀ f (x) ≥ f ′ (y)(x − y) + f (y). (6.2)


x,y∈I

We now apply (6.2) to x := xn+1 and y := xn , assuming n ∈ N0 with xn , xn+1 ∈ I,


obtaining
(6.1)
f (xn+1 ) ≥ f ′ (xn )(xn+1 − xn ) + f (xn ) = 0. (6.3)
Claim 1. If (a) holds, then

f (x)
η : I −→ R, η(x) := x − ,
f ′ (x)

is increasing on [x̂, sup I[.

Proof. Let x1 , x2 ∈ I such that x̂ ≤ x1 < x2 . Then we estimate

f (x1 ) f (x2 ) (6.2) f (x2 ) − f (x1 ) f (x1 ) f (x2 )


η(x2 ) − η(x1 ) = x2 − x1 + ′ − ′ ≥ ′
+ ′ − ′
f (x1 ) f (x2 ) f (x2 ) f (x1 ) f (x2 )
 
1 1
= f (x1 ) ′
− ′ ≥ 0,
f (x1 ) f (x2 )

where we used f (x1 ) ≥ 0 (as f is increasing) and f ′ (x1 ) ≤ f ′ (x2 ) (as f is convex,
[Phi16a, (9.37)]). N

If xn ∈ I with xn > x̂, then f (xn ) > 0 (as f is strictly increasing) and f ′ (xn ) > 0 (as f
is strictly increasing and f ′ (xn ) 6= 0), implying
Cl. 1 f (xn )
x̂ = η(x̂) ≤ η(xn ) = xn+1 = xn − < xn ⇒ xn+1 ∈ I. (6.4)
f ′ (xn )

According to (6.4), if x0 ≥ x̂, then (xn )n∈N0 is a decreasing sequence in [x̂, x0 ], converging
to some x∗ ∈ [x̂, x0 ] (this is the case, where x0 , x1 are on the same side of x̂). If x0 < x̂
and x1 ∈ I, then (6.3) shows x1 ≥ x̂ and, then, (6.4) yields (xn )n∈N is a decreasing
sequence in [x̂, x1 ], converging to some x∗ ∈ [x̂, x1 ] (this is the case, where x0 , x1 are on
different sides of x̂). In each case, we then obtain
 
f (xn ) f (x∗ )
x∗ = lim xn+1 = lim xn − ′ = x∗ − ′ ,
n→∞ n→∞ f (xn ) f (x∗ )

proving f (x∗ ) = 0, i.e. x∗ = x̂ (note that we did not assume the continuity of f ′ –
however the boundedness of f ′ on [x̂, x1 ], as given by its monotonicity, already suffices
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 133

to imply f (x∗ ) = 0; but the above equation does actually hold as stated, since the
monotonicity of f ′ , together with the intermediate value theorem for derivatives (cf.
[Phi16a, Th. 9.24]), does also imply the continuity of f ′ ).
If f is strictly decreasing and concave, then −f is strictly increasing and convex. Since
the zeros of f and −f are the same, and (6.1) does not change if f is replaced by −f ,
(b) follows from (a).
If f is strictly increasing and concave, then g : (−I) −→ R, g(x) := f (−x), −I :=
{−x : x ∈ I}, is strictly decreasing and concave: For each x1 , x2 ∈ (−I),
x1 < x2 ⇒ −x2 < −x1

⇒ g(x1 ) = f (−x1 ) > f (−x2 ) = g(x2 )

∧ g ′ (x1 ) = −f ′ (−x1 ) > −f ′ (−x2 ) = g ′ (x2 ) .
As (6.1) implies
f (xn ) g(−xn )
∀ − xn+1 = −xn − = −x n − ,
n∈N0 −f ′ (xn ) g ′ (−xn )
(b) applies to g, yielding that −xn converges to −x̂ in the stated manner, i.e. xn con-
verges to x̂ in the manner stated in (c).
If f is strictly decreasing and convex, then −f is strictly increasing and concave, i.e. (d)
follows from (c) in the same way that (b) follows from (a). 

While Th. 6.3 shows the convergence of Newton’s method for each x0 in the domain I,
it does not provide error estimates. We will prove error estimates in Th. 6.5 below in
the context of the multi-dimensional Newton’s method, where the error estimates are
only shown in a sufficiently small neighborhood of the zero x∗ . Results of this type are
quite typical in the context of Newton’s method, albeit somewhat unsatisfactory, as the
location of the zero is usually unknown in applications (also cf. Ex. 6.6 below).
Example 6.4. (a) We show that cos x = x has a unique solution in [0, π2 ] and that the
solution can be approximated via Newton’s method: To this end, we consider
f : [0, π/2] −→ R, f (x) := cos x − x.
Then f is differentiable with f ′ (x) = − sin x − 1 < 0 on [0, π/2[ and f ′′ (x) =
− cos x < 0 on [0, π/2[, showing f to be strictly decreasing and concave. Thus, Th.
6.3(b) applies for each x0 ∈ [0, π/2[ (see Ex. 6.6(a) below for error estimates).
(b) Theorem 6.3(a) applies to the function
f : R+ −→ R, f (x) := x2 − a, a ∈ R+
of Ex. 1.3, which is, clearly, stricly increasing and convex.
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 134

6.2.2 Newton’s Method in Several Dimensions

Newton’s method, as defined in (6.1), naturally generalizes to a differentiable function


f : A −→ Rm , A ⊆ Rm , as follows:
−1
x0 ∈ A, xn+1 := xn − Df (xn ) f (xn ) for each n ∈ N0 (6.5)

(indeed, the same still works if Rm is replaced by some general normed vector space,
but then the notion of differentiability needs to be generalized to such spaces, which is
beyond the scope of the present class).
In practice, in each step of Newton’s method (6.5), one will determine xn+1 as the
solution to the linear system

Df (xn )xn+1 = Df (xn )xn − f (xn ). (6.6)

Not surprisingly, the issues formulated as (a), (b), (c) after (6.1) (i.e. invertibility of
Df (xn ), xn+1 being in A, and convergence of (xn )n∈N0 ), are also present in the multi-
dimensional situation.
As in the 1-dimensional case, we will present a result that provides an example for suffi-
cient conditions for convergence (cf. Th. 6.5 below). While Th. 6.3 for the 1-dimensional
Newton’s method is an example of a global Newton theorem (proving convergence for
each x0 in the considered domain of f ), Th. 6.5 is an example of a local Newton theorem,
proving convergence under the assumption that x0 is sufficiently close to the zero x̂ –
however, in contrast to Th. 6.3, Th. 6.5 also includes error estimates (cf. (6.12), (6.13)).
In the literature (e.g. in the previously mentioned references [Deu11], [OR70]), one finds
many variants of both local and global Newton theorems. A global Newton theorem in
the a multi-dimensional setting is also provided by Th. I.1 in the appendix.
Theorem 6.5. Let m ∈ N and fix a norm k · k on Rm . Let A ⊆ Rm be open. Moreover
let f : A −→ A be differentiable, let x∗ ∈ A be a zero of f , and let r > 0 be (sufficiently
small) such that 
Br (x∗ ) = x ∈ Rm : kx − x∗ k < r ⊆ A. (6.7)
Assume that Df (x∗ ) is invertible and that β > 0 is such that
−1

Df (x∗ ) ≤ β (6.8)

(here and in the following, we consider the induced operator norm on real n×n matrices).
Finally, assume that the map Df : Br (x∗ ) −→ L(Rm , Rm ) is Lipschitz continuous with
Lipschitz constant L ∈ R+ , i.e.

(Df )(x) − (Df )(y) ≤ L kx − yk for each x, y ∈ Br (x∗ ). (6.9)
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 135

Then, letting  
1
δ := min r, , (6.10)
2βL
Newton’s method (6.5) is well-defined for each x0 ∈ Bδ (x∗ ) (i.e., for each n ∈ N0 ,
Df (xn ) is invertible and xn+1 ∈ Bδ (x∗ )), and

lim xn = x∗ (6.11)
n→∞

with the error estimates


1
kxn+1 − x∗ k ≤ β L kxn − x∗ k2 ≤ kxn − x∗ k, (6.12)
2
 2n −1
1
kxn − x∗ k ≤ kx0 − x∗ k (6.13)
2

for each n ∈ N0 . Moreover, within Bδ (x∗ ), the zero x∗ is unique.

Proof. The proof is conducted via several steps.


Claim 1. For each x ∈ Bδ (x∗ ), the matrix Df (x) is invertible with
−1

Df (x) ≤ 2β. (6.14)

Proof. Fix x ∈ Bδ (x∗ ). We will apply Lem. 2.45 with

A := Df (x∗ ), ∆A := Df (x) − Df (x∗ ), Df (x) = A + ∆A. (6.15)

To apply Lem. 2.45, we must estimate ∆A. We obtain


(6.9) 1 (6.8) 1 −1 −1
k∆Ak ≤ L kx − x∗ k < L δ ≤ ≤ kA k , (6.16)
2β 2

such that we can, indeed, apply Lem. 2.45. In consequence, Df (x) is invertible with
−1 kA−1 k
(2.32) (6.16)
Df (x) ≤ ≤ 2kA−1 k ≤ 2β, (6.17)
1 − kA−1 k k∆Ak

thereby establishing the case. N


Claim 2. For each x, y ∈ Br (x∗ ), the following holds:

f (x) − f (y) − Df (y)(x − y) ≤ L kx − yk2 . (6.18)
2
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 136

Proof. We apply the chain rule to the auxiliary function



φ : [0, 1] −→ Rm , φ(t) := f y + t(x − y) (6.19)

to obtain 
φ′ (t) := Df (y + t(x − y)) (x − y) for each t ∈ [0, 1]. (6.20)
This allows us to write
Z 1


f (x) − f (y) − Df (y)(x − y) = φ(1) − φ(0) − φ (0) = φ′ (t) − φ′ (0) dt . (6.21)
0

Next, we estimate the norm of the integrand:



φ (t) − φ′ (0) ≤ Df (y + t(x − y)) − Df (y) kx − yk ≤ L t kx − yk2 . (6.22)

Thus,
Z 1 Z 1 Z 1
 ′ L
φ (t) − φ (0) dt
′ ′ φ (t) − φ′ (0) dt ≤ L t kx − yk2 dt = kx − yk2 .
≤ 2
0 0 0
(6.23)
Combining (6.23) with (6.21) proves (6.18). N

We will now show that


xn ∈ Bδ (x∗ ) (6.24)
and the error estimate (6.12) holds for each n ∈ N0 . We proceed by induction. For
n = 0, (6.24) holds by hypothesis. Next, we assume that n ∈ N0 and (6.24) holds.
Using f (x∗ ) = 0 and (6.5), we write
−1 
xn+1 − x∗ = xn − x∗ − Df (xn ) f (xn ) − f (x∗ )
−1 
= − Df (xn ) f (xn ) − f (x∗ ) − Df (xn )(xn − x∗ ) . (6.25)

Applying the norm to (6.25) and using (6.14) and (6.18) implies

L 1
kxn+1 −x∗ k ≤ 2 β kxn −x∗ k2 = β L kxn −x∗ k2 ≤ β L δkxn −x∗ k ≤ kxn −x∗ k, (6.26)
2 2
which proves xn+1 ∈ Bδ (x∗ ) as well as (6.12).
Another induction shows that (6.12) implies, for each n ∈ N,
n−1 n n −1
kxn − x∗ k ≤ (βL)1+2+···+2 kx0 − x∗ k2 ≤ (βLδ)2 kx0 − x∗ k, (6.27)
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 137

n
where the geometric sum formula 1 + 2 + · · · + 2n−1 = 22−1
−1
= 2n − 1 was used for the last
inequality. Moreover, βLδ ≤ 12 together with (6.27) proves (6.13), and, in particular,
(6.11).
Finally, we show that x∗ is the unique zero in Bδ (x∗ ): Suppose x∗ , x∗∗ ∈ Bδ (x∗ ) with
f (x∗ ) = f (x∗∗ ) = 0. Then
−1 

kx∗∗ − x∗ k = Df (x∗ ) f (x∗∗ ) − f (x∗ ) − Df (x∗ )(x∗∗ − x∗ ) . (6.28)

Applying (6.14) and (6.18) yields

L 1
kx∗∗ − x∗ k ≤ 2β kx∗ − x∗∗ k2 ≤ β L δ kx∗ − x∗∗ k ≤ kx∗ − x∗∗ k, (6.29)
2 2
implying kx∗ − x∗∗ k = 0 and completing the proof of the theorem. 

Example 6.6. (a) We already know from Ex. 6.4(a), that

f : [0, π/2] −→ R, f (x) := cos x − x,

has a unique zero x∗ ∈]0, π/2[ and that Newton’s method converges to x∗ for each
x0 ∈ [0, π/2[. Moreover,
 
1
∀π −2 ≤ f ′ (x) = − sin x − 1 ≤ −1 ⇒ ≤1 ,
x∈[0, 2 ] f ′ (x)

and f ′ is Lipschitz continuous with Lipschitz constant L = 1. Thus, Th. 6.5 applies
with β = L = 1. However, without further knowledge regarding the location
of the zeron in [0,oπ/2], we do not have any lower positive bound on r to obtain
1
δ = min r, 2βL . One possibility is to extend the domain to A :=] − 21 , π2 + 12 [,
1
yielding δ = r = 2βL = 12 . Thus, in combination with the information of Ex. 6.4(a),
we know that, for each x0 ∈ [0, π/2[, Newton’s method converges to the zero x∗
and the error estimates (6.12), (6.13) hold if |x0 − x∗ | < 21 (otherwise, (6.12), (6.13)
might not hold for some initial steps).

(b) Consider the map


     
2 2 x1 x1 1 cos x1 − sin x2
f : R −→ R , f := − .
x2 x2 4 cos x1 − 2 sin x2
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 138

x1

Then f is differentiable with, for each x = x2
∈ R2 ,
 
1 4 + sin x1 cos x2
Df (x) = ,
4 sin x1 4 + 2 cos x2
1  3
det Df (x) = 16 + 4 sin x1 + 8 cos x2 + sin x1 cos x2 ≥ > 0,
16 16
 
−1 1 1 4 + 2 cos x2 − cos x2
Df (x) = .
det Df (x) 4 − sin x1 4 + sin x1

We now consider R2 with k · k∞ and recall from Ex. 2.18(a) that the row sum norm
is the corresponding operator norm. For each x, y ∈ R2 , we estimate, using that
both sin and cos are Lipschitz continuous with Lipschitz constant 1,
 
1 | sin x1 − sin y1 | | cos x2 − cos y2 |
Df (x) − Df (y) =
∞ 4 | sin x1 − sin y1 | 2| cos x2 − cos y2 | ∞
1  3
≤ |x1 − y1 | + 2|x2 − y2 | ≤ kx − yk∞ ,
4 4
as well as
16 7
Df (x)−1 ≤
· · kxk∞ .
∞3 4
Thus, if we can establish f to have a zero in B1 (0) = [−1, 1] × [−1, 1], then Th. 6.5
applies with β := 28
3
and L := 43 (and arbitrary r ∈ R+ ), i.e. with δ = 12 · 28
3 4 1
· 3 = 14 .
To verify that f has a zero in B1 (0), we note that f (x) = 0 is equivalent to the
fixed point problem φ(x) = x with
   
2 2 x1 1 cos x1 − sin x2
φ : R −→ R , φ := .
x2 4 cos x1 − 2 sin x2

As φ, clearly, maps B1 (0) (even all of R2 ) into B 3 (0) and is a contraction, due to
4

3
∀ φ(x) − φ(y) ≤ kx − yk∞ ,
x,y∈R2 ∞ 4

the Banach fixed point theorem [Phi16b, Th. 2.29] applies, showing φ to have a
unique fixed point x∗ , which must be located in B 3 (0). As we now only know Th.
4
6.5 to yield convergence of Newton’s method for x0 ∈ B 1 (x∗ ), in practice, without
14
more accurate knowledge of the location of x∗ , one could cover B 3 (0) with squares
4
of side length 17 , successively starting iterations in each of the squares, aborting an
iteration if one finds it to be inconsistent with the error estimates of Th. 6.5.
7 EIGENVALUES 139

7 Eigenvalues

7.1 Introductory Remarks


We recall some basic definitions and results from Linear Algebra: If A ∈ M(n, F ) is an
n × n matrix over the field F , n ∈ N, then λ ∈ F is called an eigenvalue of A if, and
only if,
∃ n Av = λ v,
06=v∈F

where every 0 6= v ∈ F n satisfying Av = λv is called an eigenvector to the eigenvalue λ.


The set of all eigenvalues of A is called the spectrum of A, denoted σ(A). According to
[Phi19b, Prop. 8.2(b)], σ(A) is precisely the set of zeros of the characteristic polynomial
χA of A, where χA is defined by (cf. [Phi19b, Def. 8.1] and [Phi19b, Rem. 8.7(a)])

χA := det(X Idn −A) ∈ F [X].

For λ ∈ σ(A), its algebraic multiplicity ma (λ) and geometric multiplicity mg (λ) are (cf.
[Phi19b, Def. 6.12] and [Phi19b, Th. 8.4]):

ma (λ) := ma (λ, A) := multiplicity of λ as a zero of χA ,


mg (λ) := mg (λ, A) := dim ker(A − λ Id).

Remark 7.1. Let F be a field, n ∈ N, A ∈ M(n, F ).

(a) Eigenvalues of A and their multiplicities are invariant under similarity transforma-
tions, i.e. if T ∈ M(n, F ) is invertible and B := T −1 AT , then σ(B) = σ(A) and, for
each λ ∈ σ(A), ma (λ, A) = ma (λ, B) as well as mg (λ, A) = mg (λ, B) (e.g. due to
the fact that χA and ker(A − λ Id) only depend on the linear maps (in L(F n , F n ))
represented by A and ker(A − λ Id), respectively, where A and B (resp. A − λ Id and
T −1 (A − λ Id)T ) represent the same map, merely with respect to different bases of
F n (cf. [Phi19b, Prop. 8.2(a)])).

(b) Note that the characteristic polynomial χA = det(X Idn −A) is always a monic
polynomial of degree n (i.e. a polynomial of degree n, where the coefficient of X n is
1, cf. [Phi19b, Rem. 8.3]). In general, the eigenvalues of A will be in the algebraic
closure F of F , cf. [Phi19b, Def. 7.11, Th. 7.38, Def. D.13, Th. D.16]3 . On the
other hand, every monic polynomial f ∈ F [X] of degree n ≥ 1 is the characteristic
3
It is, perhaps, somewhat misleading to speak of the algebraic closure of F : Even though, according
to [Phi19b, Cor. D.20], all algebraic closures of a given field F are isomorphic, the existence of the
isomorphism is, in general, nonconstructive, based on an application of Zorn’s lemma.
7 EIGENVALUES 140

polynomial of some matrix A(f ) ∈ M(n, F ), where A(f ) is the so-called companion
matrix of the polynomial: If a1 , . . . , an ∈ F , and
n
X
n
f := X + aj X n−j ,
j=1

then χA(f ) = f for


 
−a1 −a2 −a3 . . . −an−1 −an
 1 0 
 
 1 0 
 

A(f ) :=  ... 
1 
 
 ... 
 0 
1 0

(cf. [Phi19b, Rem. 8.3(c)])). In the following, we will only consider matrices over R
and C, and we know that, due to the fundamental theorem of algebra (cf. [Phi16a,
Th. 8.32]), C is the algebraic closure of R.

The computation of eigenvalues is of importance to numerous applications. A standard


example from Physics is the occurrance of eigenvalues as eigenfrequencies of oscillators.
More generally, one often needs to compute eigenvalues when solving (linear) differential
equations (cf., e.g., [Phi16c, Th. 4.32, Rem. 4.51]). In this and many other applications,
the occurrance of eigenvalues can be traced back to the fact that eigenvalues occur
when transforming a matrix into suitable normal forms, e.g. into Jordan normal form
(cf. [Phi19b, Th. 9.13]): To compute the Jordan normal form of a matrix, one first
needs to compute its eigenvalues. For another example, recall that the computation
of the spectral norm kAk2 of a matrix A ∈ M(m, K) requires the computation of the
absolute value of an eigenvalue of A with maximal size (cf. Th. 2.24).

Remark 7.2. Given that eigenvalues are precisely the roots of the characteristic poly-
nomial, and given that, as mentioned in Rem. 7.1(b), every monic polynomial of degree
at least 1 can occur as the characteristic polynomial of a matrix, it is not surprising
that computing eigenvalues is, in general, a difficult task. According to Galois theory
in Algebra, for a generic polynomial of degree at least 5, it is not possible to obtain
its zeros using so-called radicals (which are, roughly, zeros of polynomials of the form
X k − λ, k ∈ N, λ ∈ F , see, e.g., [Bos13, Def. 6.1.1] for a precise definition) in finitely
many steps (cf., e.g., [Bos13, Cor. 6.1.7]) and, in general, an approximation by iterative
7 EIGENVALUES 141

numerical methods is warranted. Having said that, let us note that the problem of com-
puting eigenvalues is, indeed, typically easier than the general problem of computing
zeros of polynomials. This is due to the fact that the difficulty of computing the zeros
of a polynomial depends tremendously on the form in which the P polynomial is given: It
is typically hard if the polynomial is expanded into the form nj=0 a
Qnj X j
, but it is easy
(trivial, in fact), if the polynomial is given in a factored form c j=1 (X − λj ). If the
characteristic polynomial is given implicitly by a matrix, one is, in general, somewhere
between the two extremes. In particular, for a large matrix, it usually makes no sense to
compute the characteristic polynomial in its expanded form (this is an expensive task in
itself and, in the process, one even loses the additional structure given by the matrix). It
makes much more sense to use methods tailored to the computation of eigenvalues, and,
if available, one should make use of additional structure (such as symmetry) a matrix
might have.

7.2 Estimates, Localization


As discussed in Rem. 7.2, iterative numerical methods will usually need to be employed
to approximate the eigenvalues of a given matrix. Iterative methods (e.g. Newton’s
method of Sec. 6.2 above) will, in general, only converge if one starts sufficiently close
to the exact solution. The localization results of the present section can help choosing
suitable initial points.

Definition 7.3. For each A = (akl ) ∈ M(n, C), n ∈ N, the closed disks
 

 

 Xn 
Gk := Gk (A) := z ∈ C : |z − akk | ≤ |akl | , k ∈ {1, . . . , n}, (7.1)

 

 l=1, 
l6=k

are called the Gershgorin disks of the matrix A.

Theorem 7.4 (Gershgorin Circle Theorem). For each A = (akl ) ∈ M(n, C), n ∈ N,
one has n
[
σ(A) ⊆ Gk (A), (7.2)
k=1

that means the spectrum of a square complex matrix A is always contained in the union
of the Gershgorin disks of A.

Proof. Let λ ∈ σ(A) and let x ∈ Cn \ {0} be a corresponding eigenvector. Moreover,


choose k ∈ {1, . . . , n} such that |xk | = kxk∞ . The kth component of the equation
7 EIGENVALUES 142

Ax = λx reads n
X
λxk = akl xl
l=1

and implies
n
X n
X
|λ − akk | |xk | ≤ |akl | |xl | ≤ |akl | |xk |.
l=1, l=1,
l6=k l6=k

Dividing the estimate by |xk | 6= 0 shows λ ∈ Gk and, thus, proves (7.2). 

Corollary 7.5. For each A = (akl ) ∈ M(n, C), n ∈ N, instead of using row sums to
define the Gershgorin disks of (7.1), one can use the column sums to define
 

 

 X n 
′ ′
Gk := Gk (A) := z ∈ C : |z − akk | ≤ |alk | , k ∈ {1, . . . , n}. (7.3)

 

 l=1, 
l6=k

Then one can strengthen the statement of Th. 7.4 to


n
! n
!
[ [
σ(A) ⊆ Gk (A) ∩ G′k (A) . (7.4)
k=1 k=1

Proof. Since σ(A) = σ(At ) and Gk (At ) = G′k (A), (7.4) is an immediate consequence of
(7.2). 

Example 7.6. Consider the matrix


 
1 −1 0
A := 2 −1 0 .
0 1 4

Then the Gershgorin disks according to (7.1) and (7.3) are (cf. Fig. 7.1)

G1 = B 1 (1), G2 = B 2 (−1), G3 = B 1 (4),


G′1 = B 2 (1), G′2 = B 2 (−1), G′3 = {4}.

Here, the intersection of (7.4) is G1 ∪ G2 ∪ G′3 and, thus, noticeably smaller than either
G1 ∪ G2 ∪ G3 or G′1 ∪ G′2 ∪ G′3 (cf. Fig. 7.1). The spectrum is

σ(A) = {−i, i, 4} :
7 EIGENVALUES 143

Figure 7.1: The Gershgorin disks of matrix A of Ex. 7.6 are depicted in the complex
plane together with the exact locations of A’s eigenvalues: The black dots are the
locations of the eigenvalues, the grey disks are the rowwise disks G1 , G2 , G3 , whereas
the columnwise disks G′1 , G′2 , G′3 are not shaded (only visible for G′1 , since G2 = G′2 and
G′3 = {4}).
 
0 17 + 17i 17 − 17i
1 
Indeed, A is diagonalizable and, with T := 34 0 34i −34i ,
4 −2 − 8i −2 + 8i
    
2 3 17 1 −1 0 0 17 + 17i 17 − 17i
1 1 
T −1 AT = 2 −1 − i 0  2 −1 0 0 34i −34i 
2 34
2 −1 + i 0 0 1 4 4 −2 − 8i −2 + 8i
 
4 0 0
= 0 −i 0 .

0 0 i
Figure 7.1 depicts the Gershgorin disks of A as well as the exact location of A’s eigen-
values in the complex plane. Note that, while G3 contains precisely one eigenvalue, G1
contains none, and G2 contains 2.
Theorem 7.7. Let A = (akl ) ∈ M(n, C), n ∈ N. Suppose k1 , . . . , kn is an enumeration
of the numbers from 1 to n, and let q ∈ {1, . . . , n}. Moreover, suppose
D(1) = Gk1 ∪ · · · ∪ Gkq (7.5a)

is the union of q Gershgorin disks of A and

D(2) = Gkq+1 ∪ · · · ∪ Gkn (7.5b)


7 EIGENVALUES 144

is the union of the remaining n − q Gershgorin disks of A. If D(1) and D(2) are disjoint,
then D(1) must contain precisely q eigenvalues of A and D(2) must contain precisely n−q
eigenvalues of A (counted according to their algebraic multiplicity).

Proof. Let D := diag(a11 , . . . , ann ) be the diagonal matrix that has precisely the same
diagonal elements as A. The idea of the proof is to connect A with D via a path (in
fact, a line segment) through M(n, C). Then we will show that the claim of the theorem
holds for D and must remain valid along the entire path. In consequence, it must also
hold for A. The path is defined by
(
akl for k = l,
Φ : [0, 1] −→ M(n, C), t 7→ Φ(t) = (φkl (t)), φkl (t) := (7.6)
takl for k 6= l,

i.e.  
a11 ta12 ... ... ta1n
 
 ta21 a22 ta23 ... ta2n 
 
 
 ..  .
Φ(t) =  ... ... ... ...
. 
 
 
tan−1,1 ... tan−1,n−2 an−1,n−1 tan−1,n 
 
tan1 tan2 ... tan,n−1 ann
Clearly,
Φ(0) = D = diag(a11 , . . . , ann ), Φ(1) = A.
We denote the corresponding Gershgorin disks as follows:
 

 

  Xn 
∀ ∀ Gk (t) := Gk Φ(t) = z ∈ C : |z − akk | ≤ t |akl | . (7.7)
k∈{1,...,n} t∈[0,1] 
 

 l=1, 
l6=k

In consequence of (7.7), we have

∀ ∀ {akk } = Gk (0) ⊆ Gk (t) ⊆ Gk (1) = Gk (A) = Gk . (7.8)


k∈{1,...,n} t∈[0,1]

We now consider the set

S := {t ∈ [0, 1] : Φ(t) has precisely q eigenvalues in D(1) }. (7.9)

In (7.9) as well as in the rest of the proof, we always count all eigenvalues according
to their respective algebraic multiplicities. Due to (7.8), we know that precisely the q
7 EIGENVALUES 145

eigenvalues ak1 k1 , . . . , akq ,kq of D are in D(1) . In particular, 0 ∈ S and S 6= ∅. The proof
is done once we show
τ := sup S = 1 (7.10a)
and
τ ∈S (i.e. τ = max S). (7.10b)
As D(1) and D(2) are compact and disjoint, they have a postive distance,

ǫ := dist(D(1) , D(2) ) > 0. (7.11)

Due to the continuous dependence of the eigenvalues on the coefficients of the matrix
(cf. Cor. J.3 in the Appendix), there exists δ > 0 such that, for each s, t ∈ [0, 1]
with |s − t| < δ, there exist enumerations λ1 (s), . . . , λn (s) and λ1 (t), . . . , λn (t) of the
eigenvalues of Φ(s) and Φ(t), respectively, satisfying

∀ |λk (s) − λk (t)| < ǫ. (7.12)


k∈{1,...,n}

Combining (7.12) with (7.11) yields


 
∀ (s ∈ S and |s − t| < δ) ⇒ t∈S . (7.13)
s,t∈[0,1]

Clearly, (7.13) implies both (7.10a) (since, otherwise, there were τ < t < 1 with |τ − t| <
δ) and (7.10b) (since there are 0 < t < τ with |τ − t| < δ), thereby completing the
proof. 

Remark 7.8. If A ∈ M(n, R), n ∈ N, then Th. 7.7 can sometimes help to determine
which regions can contain nonreal eigenvalues: For real matrices all Gershgorin disks
are centered on the real line. Moreover, if A is real, then λ ∈ σ(A) implies λ ∈ σ(A)
and they must lie in the same Gershgorin disks (as they have the same distance to the
real line). Thus, if λ is not real, then it can not lie in any Gershgorin disks known to
contain just one eigenvalue.

Example 7.9. For the matrix A of Example 7.6, we see that the sets G1 ∪ G2 =
B 1 (1) ∪ B 2 (−1) and G3 = B 1 (4) are disjoint (cf. Fig. 7.1). Thus, Th. 7.7 implies that
G3 contains precisely one eigenvalue and G1 ∪ G2 contains precisely two eigenvalues.
Combining this information with Rem. 7.8 shows that G3 must contain a single real
eigenvalue, whereas G1 ∪ G2 could contain (as, indeed, it does) two conjugated nonreal
eigenvalues.
7 EIGENVALUES 146

7.3 Power Method


Given an n × n matrix A ∈ M(n, C) and an initial vector z0 ∈ Cn , the power method
(also known as power iteration or von Mises iteration) means applying increasingly
higher powers of A to z0 according to the definition
zk := A zk−1 for each k ∈ N (i.e. zk = Ak z0 for each k ∈ N0 ). (7.14)
We will see below that this simple iteration can, in many situations and with suitable
modifications, be useful to approximate eigenvalues and eigenvectors (cf. Th. 7.12 and
Cor. 7.15 below). Before we can proof the main results, we need some preparation:
Proposition 7.10. The following holds true for each A ∈ M(n, C), n ∈ N:
lim Ak = 0 ⇔ r(A) < 1, (7.15)
k→∞

where r(A) denotes the spectral radius of A (cf. Def. 2.23).

Proof. First, suppose limk→∞ Ak = 0. Let λ ∈ σ(A) and choose a corresponding eigen-
vector v ∈ Cn \ {0}. Then, for each norm k · k on Cn with induced matrix norm on
M(n, C),
∀ kAk vk = kλk vk = |λ|k kvk ≤ kAk k kvk,
k∈N

implying (since limk→∞ kAk k = 0 and kvk > 0)


lim |λ|k = 0,
k→∞

which, in turn, implies |λ| < 1. Conversely, suppose r(A) < 1. If B, T ∈ M(n, C) and
T is invertible with A = T BT −1 , then an obvious induction shows
∀ Ak = T B k T −1 . (7.16)
k∈N0

According to [Phi19b, Th. 9.13], we can choose T such that (7.16) holds with B in
Jordan normal form, i.e. with B in block diagonal form
 
B1 0
 ... 
B= , (7.17)
0 Br
1 ≤ r ≤ n, where
 
λj 1 0 . . . 0
 λj 1 
 
 ... ... 
Bj = (λj ) or Bj =  0 ,
 
 0 λj 1 
λj
7 EIGENVALUES 147

P
λj ∈ σ(A), Bj ∈ M(nj , C), nj ∈ {1, . . . , n} for each j ∈ {1, . . . , r}, and rj=1 nj = n.
Using blockwise matrix multiplication (cf. [Phi19a, Sec. 7.5]) in (7.17) yields
 
B1k 0
 ... 
∀ Bk =  . (7.18)
k∈N0
k
0 Br

We now show
∀ lim Bjk = 0. (7.19)
j∈{1,...,r} k→∞

To this end, we write Bj = λj Id +Nj , where Nj ∈ M(nj , C) is a canonical nilpotent


matrix,  
0 1 0 ... 0
 0 1 
 
 ... ... 
Nj = 0 (zero matrix) or Nj =  0 ,
 
 0 0 1
0
and apply the binomial theorem to obtain
k  
X n min{k,nj −1} 
X 
k Nj j =0 k k−α α
∀ Bjk = λjk−α Njα = λ Nj . (7.20)
k∈N
α=0
α α=0
α j

Using
 
k k!
∀ ∀ = ≤ k(k − 1) · · · (k − α + 1) ≤ k α ,
k∈N α∈{0,...,k} α α! (k − α)!

and an arbitrary induced matrix norm on M(n, C), (7.20) implies


nj −1
X |λj |<1
kBjk k ≤ k α |λj |k−α kNj kα → 0 for k → ∞,
α=0

proving (7.19). Next, we conclude limk→∞ B k = 0 from (7.18). Thus, as kAk k ≤


kT k kB k k kT −1 k, the claimed limk→∞ Ak = 0 also follows and completes the proof. 

Definition 7.11. Let A ∈ M(n, C), z ∈ Cn , n ∈ N. Recalling from [Phi19b, Th.


9.13(a)] that Cn is the direct sum of the generalized eigenspaces of A,
M
Cn = M (λ), ∀ M (λ) := M (λ, A) := ker(A − λ Id)ma (λ) , (7.21)
λ∈σ(A)
λ∈σ(A)
7 EIGENVALUES 148

we say that z has a nontrivial part in M (µ), µ ∈ σ(A), if, and only if, in the unique
decomposition of z with respect to the direct sum (7.21),
X
z= zλ with zµ 6= 0, ∀ zλ ∈ M (λ). (7.22)
λ∈σ(A)
λ∈σ(A)

Theorem 7.12. Let A ∈ M(n, C), n ∈ N. Suppose that A has an eigenvalue that is
larger in size than all other eigenvalues of A, i.e.
∃ ∀ |λ| > |µ|. (7.23)
λ∈σ(A) µ∈σ(A)\{λ}

Moreover, assume λ to be semisimple (i.e. A↾M (λ) is diagonalizable and all Jordan blocks
with respect to λ are trivial). Choosing an arbitrary norm k · k on Cn and starting with
z0 ∈ Cn , define the sequence (zk )k∈N0 via the power iteration (7.14). From the zk also
define sequences (wk )k∈N0 and (αk )k∈N0 by
(
zk
for zk 6= 0,
wk := kzk k ∈ Cn for each k ∈ N0 , (7.24a)
z0 for zk = 0
( z∗ Az
k
k
zk∗ zk
for zk 6= 0,
αk := ∈ C for each k ∈ N0 (7.24b)
0 for zk = 0

(treating the zk as column vectors). If z0 has a nontrivial part in M (λ) in the sense of
Def. 7.11, then
lim αk = λ (7.25a)
k→∞

and there exist sequences (φk )k∈N in R and (rk )k∈N in Cn such that, for all k sufficiently
large,
v λ + rk
wk = eiφk , kvλ + rk k 6= 0, lim rk = 0, (7.25b)
kvλ + rk k k→∞

where vλ is an eigenvector corresponding to the eigenvalue λ.

Proof. If λ = 0, then condition (7.23) together with λ being semisimple implies A = 0


and there is nothing to prove. Thus, we proceed under the assumption λ 6= 0. As in the
proof of Prop. 7.10, we choose T ∈ M(n, C) invertible such that (7.16) holds with B
in Jordan normal form. However, this time, we choose T such that the (trivial) Jordan
blocks corresponding to λ come first, i.e.
 
B1 0
 B2 
 
B= ...  , B1 = λ Id ∈ M(n1 , C), (7.26)
 
0 Br
7 EIGENVALUES 149

Bj ∈ M(nj , C), is a Jordan matrix Prto λj ∈ σ(A) \ {λ} for each j = 2, . . . , r, r ∈


{1, . . . , n}, n1 , . . . , nr ∈ N, where j=1 nj = n. Since T is invertible, the columns
t1 , . . . , tn of T form a basis of Cn . For each j ∈ {1, . . . , n1 }, we compute
Atj = T BT −1 tj = T Bej = T λej = λtj , (7.27)
showing tj to be an eigenvector corresponding to λ. Thus, t1 , . . . , tn1 form a basis of
M (λ) = ker(A − λ Id) (here, the eigenspace is the same as the generalized eigenspace),
and, if z0 has a nontrivial part in M (λ), then there exist ζ1 , . . . , ζn ∈ C with at least
one ζj 6= 0, j ∈ {1, . . . , n1 }, such that
n
X n1
X
z 0 = vλ + ζj t j , vλ = ζj t j .
j=n1 +1 j=1

We note vλ to be an eigenvector corresponding to λ. Next, we compute, for each k ∈ N0 ,


n
! n
!
X X
Ak z0 = T B k T −1 vλ + ζ j t j = λk vλ + T B k ζj e j
j=n1 +1 j=n1 +1
n
!!
1 k X
= λk vλ + T B ζj e j , (7.28)
λk j=n1 +1

where, in the last step, we have used λ 6= 0. In consequence, letting, for each k ∈ N0 ,
n
!
1 k X
rk := T k B ζj e j , (7.29)
λ j=n +1 1

we obtain, for each k such that vλ + rk 6= 0,


Ak z0 λ k v λ + rk
wk = = . (7.30)
kAk z0 k |λ|k kvλ + rk k
λ k
iφk
Let βk := |λ|k . Then |βk | = 1 and, thus, there is φk ∈ [0, 2π[ such that βk = e . To
prove (7.25b), it now only remains to show limk→∞ rk = 0. However, the hypothesis
(7.23) implies  
1 |λj |
∀ r Bj = <1
j∈{2,...,r} λ |λ|
and Prop. 7.10, in turn, implies limk→∞ λ1k Bjk = 0. Using this in (7.29), where 1
λk
Bk is
applied to a vector in span{en1 +1 , . . . , en }, yields
n
!
1 X
lim rk = T lim k B k ζj ej = 0, (7.31)
k→∞ k→∞ λ
j=n +1 1
7 EIGENVALUES 150

finishing the proof of (7.25b). Finally, it is

zk∗ Azk wk∗ Awk (vλ∗ + rk∗ )A(vλ + rk ) λvλ∗ vλ


lim αk = lim = lim = lim = = λ,
k→∞ k→∞ z ∗ zk k→∞ w ∗ wk k→∞ (v ∗ + r ∗ )(vλ + rk ) vλ∗ vλ
k k λ k

establishing (7.25a) and completing the proof. 

Caveat 7.13. It is emphasized that, due to the presence of the factor eiφk , (7.25b) does
not claim the convergence of the wk to an eigenvector of λ. Indeed, in general, the
sequence (wk )k∈N cannot be expected to converge. However, as vλ is an eigenvector for

λ, so is every kveλ +rk k k vλ (and every other scalar multiple). Thus, what (7.25b) does claim
is that, for sufficiently large k, wk is arbitrarily close to an eigenvector of λ, namely to

the eigenvector e kvkλvkλ (the eigenvector that wk is close to still depends on k).

Remark 7.14. (a) The hypothesis of Th. 7.12, that z0 have a nontrivial part in M (λ),
can be difficult to check in practice. Still, it is usually not a severe restriction: Even
if, initially, z0 does not satisfy the condition, roundoff errors during the computation
of the zk = Ak z0 will cause the condition to be satisfied for sufficiently large k.

(b) The rate of convergence of the rk in Th. 7.12 (and, thus, of the αk as well) is
determined by the size of |µ|/|λ|, where µ is a second largest eigenvalue (in terms
of absolute value). Therefore, the convergence will be slow if |µ| and |λ| are close.

Even though the power method applied according to Th. 7.12 will (under the hypothesis
of the theorem) only approximate the dominant eigenvalue (i.e. the largest in terms of
absolute value), the following modification, known as the inverse power method, can be
used to approximate other eigenvalues:

Corollary 7.15. Let A ∈ M(n, C), n ∈ N, and assume λ ∈ σ(A) to be semisimple. Let
p ∈ C \ {λ} be closer to λ than to any other eigenvalue of A, i.e.

∀ 0 < |λ − p| < |µ − p|. (7.32)


µ∈σ(A)\{λ}

Set
G := (A − p Id)−1 . (7.33)
Choosing an arbitrary norm k · k on Cn and starting with z0 ∈ Cn \ {0}, define the
sequence (zk )k∈N0 via the power iteration (7.14) with A replaced by G. From the zk also
7 EIGENVALUES 151

define sequences (wk )k∈N0 and (αk )k∈N0 by

zk Gk z 0
wk := = ∈ Cn for each k ∈ N0 , (7.34a)
kzk k kGk z0 k
( z∗ z
k k
zk∗ Gzk
+ p for zk∗ Gzk 6= 0,
αk := ∈ C for each k ∈ N0 (7.34b)
0 otherwise

(treating the zk as column vectors). If z0 has a nontrivial part in M ((λ − p)−1 , G) in


the sense of Def. 7.11, then
lim αk = λ (7.35a)
k→∞

and there exist sequences (φk )k∈N in R and (rk )k∈N in Cn such that, for all k sufficiently
large,
v λ + rk
wk = eiφk , kvλ + rk k 6= 0, lim rk = 0, (7.35b)
kvλ + rk k k→∞

where vλ is an eigenvector corresponding to the eigenvalue λ.

Proof. First note that (7.32) implies p ∈ / σ(A), such that G is well-defined. As G is
k
invertible, z0 6= 0 implies G z0 6= 0 for each k ∈ N0 , showing the wk to be well-defined
as well. The idea is to apply Th. 7.12 to G. Let µ ∈ C \ {p} and v ∈ Cn \ {0}. Then

Av = µv
⇔ (µ − p) v = (A − p Id) v
⇔ (A − p Id)−1 v = (µ − p)−1 v, (7.36)

showing µ to be an eigenvalue for A with corresponding eigenvector v if, and only if,
(µ − p)−1 is an eigenvalue for G = (A − p Id)−1 with the same corresponding eigenvector
1
v. In particular, (7.32) implies λ−p to be the dominant eigenvalue of G. Next, notice
that the hypothesis that λ be semisimple for A implies M (λ, A) to have a basis of
eigenvectors v1 , . . . , vn1 , n1 = ma (λ) (cf. (7.21)), which, by (7.36), then constitutes a
1 1
basis of eigenvectors for M ( λ−p , G) as well, showing λ−p to be semisimple for G. Thus,
all hypotheses of Th. 7.12 are satisfied and we obtain limk→∞ αk = (λ − p) + p = λ,
1
proving (7.35a). Moreover, (7.35b) also follows, where vλ is an eigenvector for λ−p and
G. However, then vλ is also an eigenvector for λ and A by (7.36). 
Remark 7.16. In practice, one computes the zk for the inverse power method of Cor.
7.15 by solving the linear systems (A−p Id)zk = zk−1 for zk . As the matrix A−p Id does
not vary, but only the right-hand side of the system varies, one should solve the systems
by first computing a suitable decomposition of A − p Id (e.g. a QR decomposition, cf.
Rem. 5.15).
7 EIGENVALUES 152

Remark 7.17. In sufficiently benign situations (say, if A is diagonalizable and all


eigenvalues are in R+
0 , e.g., if A is Hermitian), one can combine the inverse power method
(IPM) of Cor. 7.15 with a bisection (or similar) method to obtain all eigenvalues of A:
One starts by determining upper and lower bounds for σ(A) ⊆ [m, M ] (for example,
by using Cor. 7.5). One then chooses p := (m + M )/2 as the midpoint and determines
λ = λ(p). Next, one uses the midpoints p1 := (m + p)/2 and p2 := (p + M )/2 and
iterates the method, always choosing the midpoints of previous intervals. For each new
midpoint p, four possibilities can occur: (i) one finds a new eigenvalue by IPM , (ii) IPM
finds an eigenvalue that has already been found before, (iii) IPM fails, since p itself is an
eigenvalue, (iv) IPM fails since p is exactly in the middle between two other eigenvalues.
Cases (iii) and (iv) are unlikely, but if they do occur, they need to be recognized and
treated separately. If not all eigenvalues are real, one can still use an analogous method,
one just has to cover a certain compact section of the complex plane by finer and finer
grids. If one were to use squares for the grids, one could use the centers of the squares
to replace the midpoints of the intervals in the described procedure.

7.4 QR Method
The QR algorithm for the approximation of the eigenvalues of A ∈ M(n, K) makes use
of the QR decomposition A = QR of A into a unitary matrix Q and an upper triangular
matrix R (cf. Def. 5.12(a)). We already know from Th. 5.24 that Q and R can be
obtained via the Householder method of Def. 5.23.
Algorithm 7.18 (QR Method). Given A ∈ M(n, K), n ∈ N, we define the following
algorithm for the computation of a sequence (A(k) )k∈N of matrices in M(n, K): Let
A(1) := A. For each k ∈ N, let

A(k+1) := Rk Qk , (7.37a)

where

A(k) = Qk Rk (7.37b)

is a QR decomposition of A(k) .
Remark 7.19. In each step of the QR method of Alg. 7.18, one needs to obtain a QR
decomposition of A(k) . For a fully populated matrix A(k) , the complexity of obtaining a
QR decomposition is O(n3 ) (obtaining a QR decomposition via the Householder method
of Def. 5.23 requires O(n) matrix multiplications, each matrix multiplication needing
O(n2 ) multiplications). The complexity can be improved to O(n2 ) by first transforming
A into so-called Hessenberg form (the complexity of the transformation is O(n3 ), cf.
7 EIGENVALUES 153

Sections J.2 and J.3 of the Appendix). Thus, in practice, it is always advisable to
transform A into Hessenberg form before applying the QR method.
Lemma 7.20. Let A ∈ M(n, K), n ∈ N. Moreover, for each k ∈ N, let matrices
A(k) , Rk , Qk ∈ M(n, K) be defined according to Alg. 7.18. Then the following identities
hold for each k ∈ N:
A(k+1) = Q∗k A(k) Qk , (7.38a)
A(k+1) = (Q1 Q2 · · · Qk )∗ A (Q1 Q2 · · · Qk ), (7.38b)
Ak = (Q1 Q2 · · · Qk )(Rk Rk−1 · · · R1 ). (7.38c)

Proof. (7.38a) holds, as


∀ Q∗k A(k) Qk = Q∗k Qk Rk Qk = Rk Qk = A(k+1) .
k∈N

A simple induction applied to (7.38a) yields (7.38b). Another induction yields (7.38c):
For k = 1, it is merely (7.37b). For k ∈ N, we compute
A(k+1) Id
z }| { (7.38b)
z }| {
Q1 · · · Qk Qk+1 Rk+1 Rk · · · R1 = (Q1 · · · Qk )(Q1 · · · Qk )∗ A (Q1 · · · Qk )(Rk · · · R1 )
ind. hyp.
= AAk = Ak+1 ,
which completes the induction and establishes the case. 

In the following Th. 7.22, it will be shown that, under suitable hypotheses, the matrices
A(k) of the QR method become, asymptotically, upper triangular, with the diagonal
entries converging to the eigenvalues of A. Theorem 7.22 is not the most general con-
vergence result of this kind. We will remark on possible extensions afterwards and we
will also indicate that some of the hypotheses are not as restrictive as they might seem
at first glance.
Notation 7.21. Let n ∈ N. By diag(λ1 , . . . , λn ) we denote the diagonal n × n matrix
with diagonal entries λ1 , . . . , λn , i.e.
(
λk for k = l,
diag(λ1 , . . . , λn ) := D = (dkl ), dkl :=
0 for k =6 l

(we have actually used this notation before). If A = (akl ) is an n × n matrix, then,
by diag(A), we denote the diagonal n × n matrix that has precisely the same diagonal
entries as A, i.e.
(
akk for k = l,
diag(A) := D = (dkl ), dkl :=
0 for k 6= l.
7 EIGENVALUES 154

Theorem 7.22. Let A ∈ M(n, K), n ∈ N. Assume A to be diagonizable, invertible,


and such that all eigenvalues λ1 , . . . , λn ∈ σ(A) ⊆ K are distinct, satisfying

|λ1 | > |λ2 | > · · · > |λn | > 0. (7.39a)

Let T ∈ M(n, K) be invertible and such that

A = T DT −1 , D := diag(λ1 , . . . , λn ). (7.39b)

Assume T −1 to have an LU decomposition (without permutations), i.e., assume

T −1 = LU, (7.39c)

where L, U ∈ M(n, K), L being unipotent and lower triangular, U being upper trian-
gular. If (A(k) )k∈N is defined according to Alg. 7.18, then the A(k) are, asymptotically,
upper triangular, with the diagonal entries converging to the eigenvalues of A, i.e.
(k)
∀ lim ajl = 0 (7.40a)
j,l∈{1,...,n}, k→∞
j>l

and
(k)
∀ lim ajj = λj . (7.40b)
j∈{1,...,n} k→∞

Moreover, there exists a constant C > 0, such that we have the error estimates
 

 (k) (k)
∀ ∀ ajl ≤ C q k , ∀ ajj − λj ≤ C q k  , (7.40c)
k∈N j,l∈{1,...,n}, j∈{1,...,n}
j>l

where  
|λj+1 |
q := max : j ∈ {1, . . . , n − 1} < 1. (7.40d)
|λj |

Proof. For n = 1, there is nothing to prove (one has A = D = A(k) for each k ∈ N).
Thus, we assume n > 1 for the rest of the proof. We start by making a number of
observations for k ∈ N being fixed. From (7.38c), we obtain a QR decomposition of Ak :

Ak = Q̃k R̃k , Q̃k := Q1 · · · Qk , R̃k := Rk · · · R1 . (7.41)

On the other hand, we also have

Ak = T Dk LU = Xk Dk U, Xk := T Dk LD−k . (7.42)

We note Xk to be invertible, as it is the product of invertible matrices. Thus Xk has a


QR decomposition Xk = QXk RXk with invertible upper triangular matrix RXk . Plugging
7 EIGENVALUES 155

the QR decomposition of Xk into (7.42), thus, yields a second QR decomposition of Ak ,


namely
Ak = QXk (RXk Dk U ).
Due to Prop. 5.13, we now obtain a unitary diagonal matrix Sk such that

Q̃k = QXk Sk , R̃k = Sk∗ RXk Dk U. (7.43)

This will now allow us to compute a somewhat complicated-looking representation of


A(k) , which, nonetheless, will turn out to be useful in advancing the proof: We have, for
k ≥ 2,

Qk = (Q1 · · · Qk−1 )−1 (Q1 · · · Qk−1 Qk ) = Q̃−1 ∗ −1


k−1 Q̃k = Sk−1 QXk−1 QXk Sk ,
−1
Rk = (Rk Rk−1 · · · R1 )(Rk−1 · · · R1 )−1 = R̃k R̃k−1
−1
= Sk∗ RXk Dk U U −1 D−(k−1) RX S
k−1 k−1
−1
= Sk∗ RXk DRX S ,
k−1 k−1

implying
(7.37b)
A(k) = ∗
Qk Rk = Sk−1 Q−1 −1
Xk−1 QXk RXk DRXk−1 Sk−1
−1
= ∗
Sk−1 RXk−1 RX k−1
Q−1 −1
Xk−1 QXk RXk DRXk−1 Sk−1
∗ −1 −1
= Sk−1 RXk−1 Xk−1 Xk DRX S .
k−1 k−1
(7.44)

The next step is to investigate the entries of

Xk = T L(k) , L(k) := Dk LD−k . (7.45)

As L is lower triangular with ljj = 1, we have




0 for j < l,
(k) k −k
ljl = λj ljl λl = 1 for j = l,

 λ
ljl ( λjl )k for j > l,

and 

= 0 for j < l,
(k)
|ljl | = 1 for j = l,


≤ CL q k for j > l,
where q is as in (7.40d) and

CL := max{|ljl | : l, j ∈ {1, . . . , n}}.


7 EIGENVALUES 156

Thus, there is a lower triangular matrix Ek and a constant CE ∈ R+ such that


L(k) = Id +Ek , kEk k2 ≤ CE CL q k . (7.46)
Then
Xk = T L(k) = T + T Ek . (7.47)
Set, for k ≥ 2,
Fk := (Ek − Ek−1 ) (Id +Ek−1 )−1 . (7.48)

As limk→∞ Ek = 0, the sequence k(Id +Ek−1 )−1 k2 k∈N is bounded, i.e. there is a con-
stant CF > 0 such that

kFk k2 ≤ Ek − Ek−1 2 (Id +Ek−1 )−1 2 ≤ CE CL (q k + q k−1 ) CF ≤ C1 q k ,

k≥2 C1 := 2CE CL CF q −1 (note q > 0).
Now (7.48) is equivalent to
Id +Ek = (Id +Ek−1 ) (Id +Fk ),
implying
−1
Xk−1 Xk = (T + T Ek−1 )−1 (T + T Ek ) = (Id +Ek−1 )−1 (Id +Ek ) = Id +Fk .
Plugging the last equality into (7.44) yields
−1
A(k) = Sk−1

RXk−1 (Id +Fk )DRX S
k−1 k−1
∗ −1 ∗ −1
= Sk−1 RXk−1 DRX Sk−1 + Sk−1 RXk−1 Fk DRX S
k−1 k−1
. (7.49)
| {z k−1 } | {z }
(k) (k)
=:AD =:AF

(k)
To estimate kAF k2 , we note
kRXk k2 = kQ∗Xk Xk k2 = kXk k2 , −1
kRX k = kXk−1 QXk k2 = kXk−1 k2 ,
k 2

since QXk is unitary and, thus, isometric. Due to (7.46) and (7.47), limk→∞ Xk = T ,
−1
such that the sequences (kRXk k2 )k∈N and (kRX k)
k 2 k∈N
are bounded by some constant
CR > 0. Thus, as the Sk are also unitary,

(k) ∗ −1
∀ kAF k2 = Sk−1 RXk−1 Fk DRX k−1
S k−1 ≤ CR2 kDk2 C1 q k . (7.50)
k≥2 2
(k)
Next, we observe each AD to be an upper triangular matrix, as it is the product of
upper triangular matrices. Moreover, each Sk is a diagonal matrix and diag(RXk )−1 =
−1
diag(RX k
). Thus,
(k) ∗ −1
∀ diag(AD ) = Sk−1 diag(RXk−1 )D diag(RX k−1
)Sk−1 = D. (7.51)
k≥2
2
Finally, as we know the 2-norm on M(n, K) to be equivalent to the max-norm on Kn ,
combining (7.49) – (7.51) proves everything that was claimed in (7.40). 
7 EIGENVALUES 157

Remark 7.23. (a) Hypothesis (7.39c) of Th. 7.22 that T −1 has an LU decomposition
without permutations is not as severe as it may seem. If permutations are necessary
in the LU decomposition of T −1 , then Th. 7.22 and its proof remain the same,
except that the eigenvalues will occur permuted in the limiting diagonal matrix D.
Moreover, it is an exercise to show that, if the matrix A in Th. 7.22 has Hessenberg
form with 0 ∈ / {a21 , . . . , an,n−1 } (in view of Rem. 7.19, this is the case of most
interest), then hypothesis (7.39c) can always be satisfied: As a intermediate step,
one can show that, in this case,

∀ span{e1 , . . . , ej } ∩ span{tj+1 , . . . , tn } = {0},


j∈{1,...,n−1}

where e1 , . . . , en−1 are the standard unit vectors, and t1 , . . . , tn are the eigenvectors
of A that form the columns of T .

(b) If A has several eigenvalues of equal modulus (e.g. multiple eigenvalues or nonreal
eigenvalues of a real matrix), then the sequence of the QR method can no longer
be expected to be asymptotically upper triangular, only block upper triangular. For
example, if |λ1 | = |λ2 | > |λ3 | = |λ4 | = |λ5 |, then the asymptotic form will look like
 
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
 
 ∗ ∗ ∗
 ,
 ∗ ∗ ∗
∗ ∗ ∗

where the upper diagonal 2 × 2 block still has eigenvalues λ1 , λ2 and the lower diag-
onal 3 × 3 block still has eigenvalues λ3 , λ4 , λ5 (see [Cia89, Sec. 6.3] and references
therein).

Remark 7.24. (a) The efficiency of the QR method can be improved significantly by
the introduction of so-called spectral shifts. One replaces (7.37) by

A(k+1) := Rk Qk + µk Id, (7.52a)

where

A(k) − µk Id = Qk Rk (7.52b)

is a QR decomposition of A(k) − µk Id and µk ∈ C is the shift parameter of the kth


step. As it turns out, using the lowest diagonal entry of the previous iteration as
(k)
shift parameter, µk := ann , is a simple choice that yields good results (see [SK11,
8 MINIMIZATION 158

Sec. 5.5.2]). According to [HB09, Sec. 27.3], an even better choice, albeit less simple,
is to choose µk as the eigenvalue of
!
(k) (k)
an−1,n−1 an−1,n
(k) (k)
an,n−1 ann
(k)
that is closest to ann . If one needs to find nonreal eigenvalues of a real matrix, the
related strategy of so-called double shifts is warranted (see [SK11, Sec. 5.5.3]).
(b) The above-described QR method with shifts is particularly successful if applied to
Hessenberg matrices in combination with a strategy known as deflation. After each
step, each element of the first subdiagonal is tested if it is sufficiently close to 0 to be
(k)
treated as “already converged”, i.e. to be replaced by 0. Then, if aj,j−1 , 2 ≤ j ≤ n,
has been replaced by 0, the QR method is applied separately to the upper diagonal
(j − 1) × (j − 1) block and the lower diagonal (n − j + 1) × (n − j + 1) block. If
j = 2 or j = n (this is what usually happens if the shift strategy as described in (a)
(k) (k)
is applied), then a11 = λ1 or ann = λn is already correct and one has to continue
with an (n − 1) × (n − 1) matrix.

8 Minimization

8.1 Motivation, Setting


Minimization (more generally, optimization) is a vast field, and we can only cover a few
selected aspects in this class. We will consider minimization problems of the form
min f (a), f : A −→ R. (8.1)
The function f to be minimized is often called the objective function or the objective
functional of the problem, even though this seems to be particularly common in the
context of constrained minimization (see below).
In Sec. 6.1, we discussed the tasks of finding zeros and fixed points of functions, and we
found that, in most situations, both problems are equivalent. Minimization provides a
third perspective onto this problem class: If F : A −→ Y , where (Y, k · k) is a normed
vector space, and α > 0, then
F (x) = 0 ⇔ kF (x)kα = 0 = min{kF (a)kα : a ∈ A}.
More generally, given y ∈ Y , we have the equivalence
F (x) = y ⇔ kF (x) − ykα = 0 = min{kF (a) − ykα : a ∈ A}. (8.2)
8 MINIMIZATION 159

However, minimization is more general: If we seek the (global) min of the function
f : A −→ R, f (a) := kF (a) − ykα , then it will yield a solution to F (x) = y, provided
such a solution exists. On the other hand, f can have a minimum even if F (x) = y
does not have a solution. If Y = Rn , n ∈ N, with the 2-norm, and α = 2, then the
minimization of f is known as a least squares problem. The least squares problem is
called linear if A is a normed vector space over R and F : A −→ Y is linear, and
nonlinear otherwise.
One distinguishes between constrained and unconstrained minimization, where con-
strained minimization can be seen as restricting the objective function f : A −→ R
to some subset of A, determined by the constraints. Constraints can be given explic-
itly or implicitly. If A = R, then an explicit constraint might be to seek nonnegative
solutions. Implicit constraints can be given in the form of equation constraints – for
example, if A = Rn , then one might want to minimize f on the set of all solutions to
M a = b, where M is some real m × n matrix. In this class, we will restrict ourselves
to unconstrained minimization. In particular, we will not cover the simplex method, a
method for so-called linear programming, which consists of minimizing a linear f with
affine constraints in finite dimensions (see, e.g., [HH92, Sec. 9] or [Str08, Sec. 5]). For
constrained nonlinear optimization see, e.g., [QSS07, Sec. 7.3]). The optimal control of
differential equations is an example of constrained minimization in infinite dimensions
(see, e.g., [Phi09]).
We will only consider methods that are suitable for unconstrained nonlinear minimiza-
tion, noting that even the linear least squares problem is a nonlinear minimization prob-
lem. Both methods involving derivatives (such as gradient methods) and derivative-free
methods are of interest. Derivative-based methods typically show faster convergence
rates, but, on the other hand, they tend to be less robust, need more function evalua-
tions (i.e. each step of the method is more costly), and have a smaller scope of applica-
bility (e.g., the function to be minimized must be differentiable). Below, we will study
derivative-based methods in Sections 8.3 and 8.4, whereas derivative-free methods can
be found in Appendix K.1 and K.2.

8.2 Local Versus Global and the Role of Convexity


Definition 8.1. Let (X, k · k) be a normed vector space, A ⊆ X, and f : A −→ R.

(a) Given x ∈ A, f has a (strict) global min at x if, and only if, f (x) ≤ f (y) (f (x) <
f (y)) for each y ∈ A \ {x}.

(b) Given x ∈ X, f has a (strict) local min at x if, and only if, there exists r > 0 such
that f (x) ≤ f (y) (f (x) < f (y)) for each y ∈ {y ∈ A : ky − xk < r} \ {x}.
8 MINIMIZATION 160

Basically all deterministic minimization methods (in particular, all methods considered
in this class) have the drawback that they will find local minima rather than global
minima. This problem vanishes if every local min is a global min, which is, indeed, the
case for convex functions. For this reason, we will briefly study convex functions in this
section, before treating concrete minimization methods thereafter.

Definition 8.2. A subset C of a real vector space X is called convex if, and only if,
for each (x, y) ∈ C 2 and each 0 ≤ α ≤ 1, one has α x + (1 − α) y ∈ C. If A ⊆ X, then
conv A is the intersection of all convex sets containing A and is called the convex hull
of A (clearly, the intersection of convex sets is always convex and, thus, the convex hull
of a set is always convex, cf. [Phi19b, Def. and Rem. 1.23]).

Definition 8.3. Let C be a convex subset of a real vector space X, f : C −→ R.

(a) f is called convex if, and only if,



∀ ∀ f α x + (1 − α) y ≤ α f (x) + (1 − α) f (y). (8.3)
(x,y)∈C 2 0≤α≤1

(b) f is called strictly convex if, and only if, the inequality in (8.3) is strict whenever
x 6= y and 0 < α < 1.

(c) f is called strongly or uniformly4 convex if, and only if, there exists a norm k · k on
X and µ ∈ R+ such that

∀ 2 ∀ f α x+(1−α) y +µ α (1−α) kx−yk2 ≤ α f (x)+(1−α) f (y). (8.4)
(x,y)∈C 0≤α≤1

Remark 8.4. It is immediate from Def. 8.3 that uniform convexity implies strict con-
vexity, which, in turn, implies convexity. In general, a convex function need not be
strictly convex (cf. Ex. 8.5(b)), and a strictly convex function need not be uniformly
convex (cf. Ex. 8.5(d)).

Example 8.5. (a) We know from Calculus (cf. [Phi16a, Prop. 9.32(b)]) that a contin-
uous function f : [a, b] −→ R, that is twice differentiable on ]a, b[ is strictly convex
if f ′′ > 0 on ]a, b[. Thus, f : R+ p
0 −→ R, f (x) = x , p > 1, is strictly convex (note
f ′′ (x) = p(p − 1)xp−2 > 0 for x > 0).
4
In the literature, strong convexity is often defined as a special case of a more general notion of
uniform convexity.
8 MINIMIZATION 161

(b) Let C be a convex subset of a normed vector space (X, k · k) over R. Then k · k :
C −→ R is convex by the triangle inequality, but not strictly convex if C contains
a segment S of a one-dimensional subspace: If 0 6= x0 ∈ X and 0 ≤ a < b (a, b ∈ R)
such that S = {tx0 : a ≤ t ≤ b} ⊆ C, then

∀ α ax0 + (1 − α) bx0 = αa + (1 − α)b kx0 k = α kax0 k + (1 − α) kbx0 k.
0≤α≤1

p
If h·, ·i is an inner product on X and kxk = hx, xi, then

f : C −→ R, f (x) := kxk2 ,

is uniformly (in particular, strictly) convex: Indeed, if x, y ∈ C and 0 ≤ α ≤ 1,


then, using
(1 − α) − α(1 − α) = 1 − 2α + α2 = (1 − α)2 , (8.5)
we obtain

f (αx + (1 − α)y) = α2 kxk2 + 2α(1 − α)hx, yi + (1 − α)2 kyk2


(8.5)
 
2 2 2 2
= αkxk + (1 − α)kyk − α (1 − α) kxk − 2hx, yi + kyk
 
= αf (x) + (1 − α)f (y) − α (1 − α) kxk2 − 2hx, yi + kyk2
= αf (x) + (1 − α)f (y) − α (1 − α) hx − y, x − yi
= αf (x) + (1 − α)f (y) − α (1 − α) kx − yk2 ,

showing (8.4) to hold with µ = 1.

(c) The objective function of the linear least squares problem is convex and, if the
linear function F is injective, even strictly convex (exercise).

(d) Consider f : R+ −→ R, f (x) := x4 . Then f is strictly convex by (a). We claim


that f is not uniformly convex: Indeed, if f were uniformly convex, then, by (8.8)
below, it had to hold that, with µ > 0,

∀ x4 − y 4 = f (x) − f (y) ≥ f ′ (y)(x − y) + µ (x − y)2 = 4y 3 (x − y) + µ (x − y)2 .


x,y∈R+

2
However, if we let x := n
and y := n1 , then the above inequality is equivalent to

16 1 4 1 µ 11 µ 11
− ≥ · + ⇔ ≥ ⇔ ≥ n2 ,
n4 n4 n3 n n2 n4 n2 µ
showing it will always be violated for sufficiently large n.
8 MINIMIZATION 162

Theorem 8.6. Let n ∈ N and let C ⊆ Rn be convex and open. Moreover, let f : C −→
R be continuously differentiable. Then the following holds:

(a) f is convex if, and only if,

∀ f (x) − f (y) ≥ ∇ f (y) (x − y). (8.6)


x,y∈C

(b) f is strictly convex if, and only if,


 
∀ x 6= y ⇒ f (x) − f (y) > ∇ f (y) (x − y) . (8.7)
x,y∈C

(c) f is uniformly convex if, and only if,

∀ f (x) − f (y) ≥ ∇ f (y) (x − y) + µ kx − yk2 (8.8)


x,y∈C

for some µ ∈ R+ and some norm k · k on Rn .

In each case, the stated condition is still sufficient for the stated form of convexity if the
partials of f exist, but are not necessarily continuous.

Proof. Suppose, (8.8) with µ ∈ R+ n


0 and some norm k · k on R . Fix x, y ∈ C, α ∈ [0, 1],
and define z := αx + (1 − α)y ∈ C. Then

f (x) − f (z) ≥ ∇ f (z) (x − z) + µ kx − zk2 , (8.9a)


f (y) − f (z) ≥ ∇ f (z) (y − z) + µ ky − zk2 . (8.9b)

We multiply (8.9a) by α, (8.9b) by (1 − α), and add the resulting inequalities, also using

x − z = (1 − α) (x − y), y − z = α (y − x),

to obtain:

α f (x) + (1 − α) f (y) − f α x + (1 − α) y
= α f (x) + (1 − α) f (y) − α f (z) − (1 − α) f (z)
 
≥ α ∇ f (z) (1 − α) (x − y) − (1 − α) ∇ f (z) α(x − y)
+ µ α (1 − α)2 kx − yk2 + µ α2 (1 − α) kx − yk2
= µ α (1 − α) kx − yk2 . (8.10)

Thus, if (8.8) holds with µ > 0, then (8.10) holds with µ > 0, showing f to be uniformly
convex; if (8.6) holds, then (8.8) and (8.10) hold with µ = 0, showing f to be convex;
8 MINIMIZATION 163

if (8.7) holds and α ∈]0, 1[, x 6= y, then (8.10) holds with µ = 0 and strict inequality,
showing f to be strictly convex.
Now assume f to be uniformly convex. Then there exist µ ∈ R+ and some norm k · k
on Rn such that, for each x, y ∈ C and each α ∈ [0, 1],
 
f y + α(x − y) = f α x + (1 − α) y
≤ α f (x) + (1 − α) f (y) − µ α (1 − α) kx − yk2 (8.11a)

or, rearranged for α > 0,



f y + α(x − y) − f (y)
≤ f (x) − f (y) − µ (1 − α) kx − yk2 . (8.11b)
α
Thus, according to the mean value theorem (cf. [Phi16b, Th. 4.35]), there exists s(α) ∈
]0, α[ such that

 f y + α(x − y) − f (y)
∇ f y +s(α)(x−y) (x−y) = ≤ f (x)−f (y)−µ (1−α) kx−yk2 .
α
(8.11c)
Taking the limit for α → 0 and using the assumed continuity of ∇ f then yields

∇ f (y)(x − y) ≤ f (x) − f (y) − µ kx − yk2 , (8.11d)

proving (8.8). Setting µ := 0 in (8.11) shows (8.6) to hold for convex f . It remains
to prove the validity of (8.7) for f being strictly convex. The argument of (8.11) does
not suffice in this case, as we lose the strict inequality, when taking the limit in (8.11c).
Thus, we assume f to be strictly convex, x 6= y, and proceed as follows: Letting
1
z := (x + y) ∈ C,
2
the strict convexity of f yields
 
1 1 1 1
f (z) = f x+ y < f (x) + f (y)
2 2 2 2

and, thus, we obtain


(8.6) 
∇ f (y)(x − y) = 2 ∇ f (y)(z − y) ≤ 2 f (z) − f (y) < f (x) − f (y),

completing the proof of (8.7). 


8 MINIMIZATION 164

Corollary 8.7. Let n ∈ N and let C ⊆ Rn be convex and open. Moreover, let f : C −→
R be continuously differentiable and let k · k denote a norm on Rn . If f is uniformly
convex and x∗ ∈ C is a local min of f (which, by Th. 8.9 below, must then be the unique
global min of f ), then there exists µ ∈ R+ such that

∀ µ kx − x∗ k2 ≤ f (x) − f (x∗ ). (8.12)


x∈C

Proof. If f has a local min at x∗ , then ∇ f (x∗ ) = 0 (cf. [Phi16b, Th. 5.13]). Due to this
fact and since all norms on Rn are equivalent, (8.12) is an immediate consequence of
Th. 8.6(c). 

Theorem 8.8. Let (X, k·k) be a normed vector space over R, C ⊆ X, and f : C −→ R.
Assume C is a convex set, and f is a convex function.

(a) If f has a local min at x0 ∈ C, then f has a global min at x0 .

(b) The set of mins of f is convex.

Proof. (a): Suppose f has a local min at x0 ∈ C, and consider an arbitrary x ∈ C,


x 6= x0 . As x0 is a local min, there is r > 0 such that f (x0 ) ≤ f (y) for each y ∈ Cr :=
{y ∈ C : ky − x0 k < r}. Note that, for each α ∈ R,

x0 + α (x − x0 ) = (1 − α) x0 + α x.

Thus, due to the convexity of C, x0 + α (x − x0 ) ∈ C  for each α ∈ [0, 1]. Moreover,


for sufficiently small α, namely for each α ∈ R := 0, min{1, r/kx0 − xk} , one has
x0 + α (x − x0 ) ∈ Cr . As x0 is a local min and f is convex, for each α ∈ R, one obtains:

f (x0 ) ≤ f (1 − α) x0 + α x ≤ (1 − α) f (x0 ) + α f (x). (8.13)

After subtracting f (x0 ) and dividing by α > 0, (8.13) yields f (x0 ) ≤ f (x), showing that
x0 is actually a global min as claimed.
(b): Let x0 ∈ C be a min of f . From (a), we already know that x0 must be a global
min. So, if x ∈ C is any min of f , then it follows that f (x) = f (x0 ). If α ∈ [0, 1], then
the convexity of f implies that

f (1 − α) x0 + α x ≤ (1 − α) f (x0 ) + α f (x) = f (x0 ). (8.14)

As x0 is a global min, (8.14) implies that (1 − α) x0 + α x is also a global min for each
α ∈ [0, 1], showing that the set of mins of f is convex as claimed. 
8 MINIMIZATION 165

Theorem 8.9. Let (X, k·k) be a normed vector space over R, C ⊆ X, and f : C −→ R.
Assume C is a convex set, and f is a strictly convex function. If x ∈ C is a local min
of f , then this is the unique local min of f , and, moreover, it is strict.

Proof. According to Th. 8.8, every local min of f is also a global min of f . Seeking a
contradiction, assume there is y ∈ C, y 6= x, such that y is also a min of f . As x and y
are both global mins, f (x) = f (y) is implied. Define z := 12 (x + y). Then z ∈ C due to
the convexity of C. Moreover, due to the strict convexity of f ,
1 
f (z) < f (x) + f (y) = f (x) (8.15)
2
in contradiction to x being a global min. Thus, x must be the unique min of f , also
implying that the min must be strict. 

8.3 Gradient-Based Descent Methods


We are concerned with differentiable objective functions f defined on (subsets of) Rn .
Definition 8.10. Let G ⊆ Rn be open, n ∈ N, and f : G −→ R.

(a) Given x1 ∈ G, a stepsize direction iteration defines a sequence (xk )k∈N in G by


setting
∀ xk+1 := xk + tk uk , (8.16)
k∈N

where tk ∈ R is the stepsize and uk ∈ Rn is the direction of the kth step. In general,
+

tk and uk will depend on G, f , x1 , . . . , xk , t1 , . . . , tk−1 , u1 , . . . , uk−1 . Different choices


of tk and uk correspond to different methods – some examples are provided in (b)
– (d) below.
(b) A stepsize direction iteration as in (a) is called a descent method for f if, and only
if, for each k ∈ N, either uk = 0 or

φk : Ik −→ R, φk (t) := f (xk + t uk ), Ik := {t ∈ R+ k k
0 : x + t u ∈ G}, (8.17)

is strictly decreasing in some neighborhood of 0.


(c) If f is differentiable, then a stepsize direction iteration as in (a) is called gradient
method for f or method of steepest descent if, and only if,

∀ uk = − ∇ f (xk ), (8.18)
k∈N

i.e. if the directions are always chosen as the antigradient of f in xk .


8 MINIMIZATION 166

(d) If f is differentiable, then the conjugate gradient method is a stepsize direction


iteration, where
∀ uk = − ∇ f (xk ) + βk uk−1 , (8.19)
k∈{2,...,n}

and the βk ∈ R+ are chosen such that (uk )k∈{1,...,n} forms an orthogonal system with
respect to a suitable inner product.
Remark 8.11. In the situation of Def. 8.10 above, if f is differentiable, then its direc-
tional derivative at xk ∈ G in the direction uk ∈ Rn is
f (xk + tuk ) − f (xk )
φ′k (0) = lim = ∇ f (xk ) · uk . (8.20)
t↓0 t
If kuk k2 = 1, then the Cauchy-Schwarz inequality implies

φk (0) = ∇ f (xk ) · uk ≤ ∇ f (xk ) , (8.21)
2

justifying the name method of steepest descent for the gradient method of Def. 8.10(c).
The next lemma shows it is, indeed, a descent method as defined in Def. 8.10(b), provided
that f is continuously differentiable.
Notation 8.12. If X is a real vector space and x, y ∈ X, then

[x, y] := {(1 − t)x + ty : t ∈ [0, 1]} = conv{x, y} (8.22)

denotes the line segment between x and y.


Lemma 8.13. Let G ⊆ Rn be open, n ∈ N, and f : G −→ R continuously differentiable.
If
∀ ∇ f (xk ) · uk < 0, (8.23)
k∈N

then each stepsize direction iteration as in Def. 8.10(a) is a descent method according to
Def. 8.10(b). In particular, the gradient method of Def. 8.10(c) is a descent method.

Proof. According to (8.20) and (8.23), we have

φ′k (0) < 0 (8.24)

for φk of (8.17). As φ′k is continuous in 0, (8.24) implies φk to be strictly decreasing in


a neighborhood of 0. 

We continue to remain in the setting of Def. 8.10. Suppose we are in step k of a descent
method, i.e. we know φk of (8.17) to be strictly decreasing in a neighborhood of 0.
Then the question remains of how to choose the next stepsize tk ? One might want to
8 MINIMIZATION 167

choose tk such that φk has a global (or at least a local) min at tk . A local min can, for
instance, be obtained by applying the golden section search of Alg. K.5 in the Appendix.
However, if function evaluations are expensive, then one might want to resort to some
simpler method, even if it does not provide a local min of φk . Another potential problem
with Alg. K.5 is that, in general, one has little control over the size of tk , which makes
convergence proofs difficult. The following Armijo rule has the advantage of being simple
and, at the same time, relating the size of tk to the size of ∇ f (xk ):

Algorithm 8.14 (Armijo Rule). Let G ⊆ Rn be open, n ∈ N, and f : G −→ R


differentiable. The Armijo rule has two parameters, namely σ, β ∈]0, 1[. If xk ∈ G and
uk ∈ Rn is either 0 or such that ∇ f (xk ) · uk < 0, then set
(
1 for uk = 0,
tk :=  l (8.25)
max β : l ∈ N0 , xk + β l uk ∈ G, and β l satisfies (8.26) for uk 6= 0,

where
f (xk + β l uk ) ≤ f (xk ) + σβ l ∇ f (xk ) · uk . (8.26)
In practise, since (β l )l∈N0 is strictly decreasing, one merely has to test the validity of
xk + β l uk ∈ G and (8.26) for l = 0, 1, 2, . . . stopping once (8.26) holds for the first time.

Lemma 8.15. Under the assumptions of Alg. 8.14, if uk 6= 0, then the max in (8.25)
exists. In particular, the Armijo rule is well-defined.

Proof. Let uk 6= 0. Since (β l )l∈N0 is decreasing, we only have to show that an l ∈ N0


exists such that xk + β l uk ∈ G and (8.26) holds. Since liml→∞ β l = 0 and G is open,
xk + β l uk ∈ G holds for each sufficiently large l. Seeking a contradiction, we now assume
 
∃ ∀ xk + β l uk ∈ G and f (xk + β l uk ) > f (xk ) + σβ l ∇ f (xk ) · uk .
N ∈N l≥N

Thus, as f is differentiable,

(8.20) f (xk + β l uk ) − f (xk )


∇ f (xk ) · uk = lim ≥ σ ∇ f (xk ) · uk .
l→∞ βl

Then the assumption ∇ f (xk ) · uk < 0 implies 1 ≤ σ, in contradiction to 0 < σ < 1. 

In preparation for a convergence theorem for the gradient method (Th. 8.17 below), we
prove the following lemma:
8 MINIMIZATION 168

Lemma 8.16. Let G ⊆ Rn be open, n ∈ N, and f : G −→ R continuously differentiable.


Suppose x ∈ G, u ∈ Rn , and consider sequences (xk )k∈N in G, (uk )k∈N in Rn , (tk )k∈N in
R+ , satisfying
lim xk = x, lim uk = u, lim tk = 0. (8.27)
k→∞ k→∞ k→∞

Then
f (xk + tk uk ) − f (xk )
lim = ∇ f (x) · u. (8.28)
k→∞ tk

Proof. As G is open, (8.27) guarantees

∃ ∃ ∀ y k := xk + tk uk ∈ Bǫ (xk ) ⊆ G.
ǫ>0 N ∈N k>N

Thus, we can apply the mean value theorem to obtain

∀ ∃ f (xk + tk uk ) − f (xk ) = tk ∇ f (ξ k ) · uk . (8.29)


k>N ξ k ∈[xk ,y k ]

Since limk→∞ ξ k = x and ∇ f is continuous, (8.29) implies (8.28). 


Theorem 8.17. Let f : Rn −→ R be continuously differentiable, n ∈ N. If the sequence
(xk )k∈N is given by the gradient method of Def. 8.10(c) together with the Armijo rule
according to Alg. 8.14, then every cluster point x of the sequence is a stationary point
of f .

Proof. If there is k0 ∈ N such that uk0 = − ∇ f (xk0 ) = 0, then limk→∞ xk = xk0 is


clear from (8.16) and we are already done. Thus, we may now assume uk 6= 0 for each
k ∈ N. Let x ∈ Rn be a cluster point of (xk )k∈N and (xkj )j∈N a subsequence such
that x = limj→∞ xkj . The continuity of f implies f (x) = limj→∞ f (xkj ). But then, as
(f (xk ))k∈N is decreasing by the Armijo rule, one also obtains

f (x) = lim f (xkj ) = lim f (xk ).


j→∞ k→∞

Thus, limk→∞ (f (xk+1 ) − f (xk )) = 0, and, using the Armijo rule again, we have

0 < σtk k ∇ f (xk )k22 ≤ f (xk ) − f (xk+1 ) → 0, (8.30)

where σ ∈]0, 1[ is the parameter from the Armijo rule. Seeking a contradiction, we now
assume ∇ f (x) 6= 0. Then the continuity of ∇ f yields limj→∞ ∇ f (xkj ) = ∇ f (x) 6= 0,
which, together with (8.30) implies limj→∞ tkj = 0. Moreover, for each j ∈ N, tkj = β lj ,
where β ∈]0, 1[ is the parameter from the Armijo rule and lj ∈ N0 is chosen according
to (8.25). Thus, for each sufficiently large j, mj := lj − 1 ∈ N0 and

f (xkj + β mj ukj ) > f (xkj ) + σβ mj ∇ f (xkj ) · ukj ,


8 MINIMIZATION 169

which, rearranged, reads

f (xkj + β mj ukj ) − f (xkj )


> σ ∇ f (xkj ) · ukj
β mj

(here we have used that xkj + β mj ukj ∈ G = Rn ). Taking the limit for j → ∞ and
observing limj→∞ β mj = 0 in connection with Lem. 8.16 then yields

−k ∇ f (x)k22 ≥ −σk ∇ f (x)k22 ,

i.e. σ ≥ 1. This contradiction to σ < 1 completes the proof of ∇ f (x) = 0. 

As a caveat, it is underlined that Th. 8.17 does not claim the sequence (xk )k∈N to have
cluster points. It merely states that, if cluster points exist, then they must be stationary
points. For convex differentiable functions, stationary points are minima:

Proposition 8.18. If C ⊆ Rn is open and convex, n ∈ N, and f : C −→ R is


differentiable and (strictly) convex, then f has a (strict) global min at each stationary
point (of which there is at most one in the strictly convex case, cf. Th. 8.9).

Proof. Let x ∈ C be a stationary point of f , i.e. ∇ f (x) = 0. Let y ∈ C, y 6= x.


Since C is open and convex, there exists an open interval I ⊆ R such that 0, 1 ∈ I and
{x + t(y − x) : t ∈ I} ⊆ C. Consider the auxiliary function

φ : I −→ R, φ(t) := f x + t(y − x) . (8.31)

We claim that φ is (strictly) convex if f is (strictly) convex: Indeed, let s, t ∈ I, s 6= t,


and α ∈]0, 1[. Then
 
φ αs + (1 − α)t = f x + (αs + (1 − α)t)(y − x)

= f α(x + s(y − x)) + (1 − α)(x + t(y − x))
 
≤ αf x + s(y − x) + (1 − α)f x + t(y − x)
= αφ(s) + (1 − α)φ(t) (8.32)

with strict inequality if f is strictly convex. By the chain rule, φ is differentiable with

∀ φ′ (t) = ∇ f x + t(y − x) · (y − x).
t∈I

Thus, we know from Calculus (cf. [Phi16a, Prop. 9.31]) that φ′ is (strictly) increasing
and, as φ′ (0) = 0, φ must be (strictly) increasing on I ∩ R+0 . Thus, f (y) ≥ f (x)
(f (y) > f (x)), showing x to be a (strict) global min. 
8 MINIMIZATION 170

Corollary 8.19. Let f : Rn −→ R be (strictly) convex and continuously differentiable,


n ∈ N. If the sequence (xk )k∈N is given by the gradient method of Def. 8.10(c) together
with the Armijo rule according to Alg. 8.14, then every cluster point x of the sequence is
a (strict) global min of f (in particular, there is at most one cluster point in the strictly
convex case)

Proof. Everything is immediate when combining Th. 8.17 with Prop. 8.18. 

A disadvantage of the (hypotheses) of the previous results is that they do not guarantee
the convergence of the gradient method (or even the existence of local minima). An
important class of functions, where we can, indeed, guarantee both (together with an
error estimate that shows a linear convergence rate) is the class of convex quadratic
functions, which we will study next.
Definition 8.20. Let Q ∈ M(n, R) be symmetric and positive definite, a ∈ Rn , n ∈ N,
γ ∈ R. Then the function
1
f : Rn −→ R, f (x) := xt Qx + at x + γ, (8.33)
2
where a, x are considered column vectors, is called the quadratic function given by Q,
a, and γ.
Proposition 8.21. Let Q ∈ M(n, R) be symmetric and positive definite, a ∈ Rn ,
n ∈ N, γ ∈ R. Then the function f of (8.33) has the following properties:

(a) f is uniformly convex.


(b) f is differentiable with

∇ f : Rn −→ Rn , ∇ f (x) = xt Q + at . (8.34)

(c) f has a unique and strict global min at x∗ := −Q−1 a.


(d) Given x, u ∈ Rn , u 6= 0, the function

φ : R −→ R, φ(t) := f (x + tu), (8.35)


t t
has a unique and strict global min at t∗ := − (x uQ+a
t Qu
)u
.

Proof. (a): Since hx, yi := xt Qy defines an inner product on Rn , the uniform convexity
of f follows from Ex. 8.5(b): Indeed, if we let
p
∀ n kxk := xt Qx,
x∈R
8 MINIMIZATION 171

then Ex. 8.5(b) implies, for each x, y ∈ Rn and each α ∈ [0, 1]:
 1
f α x + (1 − α) y + α (1 − α) kx − yk2
2
1 t  
= α x + (1 − α) y Q α x + (1 − α) y + at α x + (1 − α) y + γ
2
1
+ α (1 − α) kx − yk2
2
1 1
≤ α xt Qx + (1 − α) y t Qy + α at x + (1 − α) at y + α γ + (1 − α) γ
2 2
= α f (x) + (1 − α) f (y).

(b): Exercise.
(c) follows from (a), (b), and Prop. 8.18, as f has a stationary point at x∗ = −Q−1 a.
(d): The computation in (8.32) with y − x replaced by u shows φ to be strictly convex.
Moreover,

φ′ : R −→ R, φ′ (t) = ∇ f (x + tu)u = (xt + tut )Qu + at u.


t t
Then the assertion follows from Prop. 8.18, as φ has a stationary point at t∗ = − (x uQ+a
t Qu
)u
.


Theorem 8.22 (Kantorovich Inequality). Let Q ∈ M(n, R) be symmetric and positive


definite, n ∈ N. If α := min σ(Q) ∈ R+ and β := max σ(Q) ∈ R+ are the smallest and
largest eigenvalues of Q, respectively, then
r r !−2
(xt x)2 4αβ α β
∀ F (x) := t t −1
≥ 2
=4 + . (8.36)
n
x∈R \{0} (x Qx)(x Q x) (α + β) β α

Proof. Since Q is symmetric positive definite, we can write the eigenvalues in the form
0 < α = λ1 ≤ · · · ≤ λn = β with corresponding orthonormal eigenvectors v1 , . . . , vn .
Fix x ∈ Rn \ {0}. As the v1 , . . . , vn form a basis,
 n
X n
X 
∃ x= µi vi , µ := µ2i >0 .
µ1 ,...,µn ∈R
i=1 i=1

As the v1 , . . . , vn are orthonormal eigenvectors for the λ1 , . . . , λn , we obtain

(8.36) µ2
F (x) = Pn 2
Pn −1 2 ,
( i=1 λi µi )( i=1 λi µi )
8 MINIMIZATION 172

which, letting γi := µ2i /µ, becomes


1
F (x) = Pn P .
( i=1 γi λi )( ni=1 γi λ−1
i )
Pn
Noting 0 ≤ γi ≤ 1 and i=1 γi = 1, we see that
n
X
λ := γ i λi
i=1

is a convex combination of the λi , i.e. α ≤ λ ≤ β. We would now like to show


n
X α+β−λ
γi λ−1
i ≤ . (8.37)
i=1
αβ

To this end, observe that l : R −→ R, l(λ) := α+β−λ


αβ
, represents the line through the
−1 −1
points Pα := (α, α ) and Pβ := (β, β ), and that the set

C := (λ, µ) ∈ R2 : α ≤ λ ≤ β, µ ≤ l(λ) ,

below that line, is a convex subset of R2 . The function ϕ : R+ −→ R+ , ϕ(λ) := λ−1 ,


−1
is strictly convex (e.g., due to ϕ′′ (λ) = 2λ−3 > 0), implying Pi := (λi , λP
i ) ∈ C for
n
eachPi ∈ {1, . . . , n} (cf. P
[Phi16a, Prop. 9.29]). In consequence, P := i=1 γi Pi =
n −1 n −1
(λ, i=1 γi λi ) ∈ C, i.e. i=1 γi λi ≤ l(λ), proving (8.37). Thus,

1 (8.37) αβ
F (x) = Pn −1
≥ . (8.38)
λ( i=1 γi λi ) λ (α + β − λ)
To estimate F (x) further, we determine the global min of the differentiable function
αβ
φ : ]α, β[−→ R, φ(λ) = .
λ (α + β − λ)
The derivative is
αβ(α + β − 2λ)
φ′ : ]α, β[−→ R, φ′ (λ) = − .
λ2 (α + β − λ)2

Let λ0 := α+β
2
. Clearly, φ′ (λ) < 0 for λ < λ0 , φ′ (λ0 ) = 0, and φ′ (λ) > 0 for λ > λ0 ,
showing that φ has its global min at λ0 . Using this fact in (8.38) yields
αβ 4αβ
F (x) ≥ = , (8.39)
λ0 (α + β − λ0 ) (α + β)2
completing the proof of (8.36). 
8 MINIMIZATION 173

Theorem 8.23. Let Q ∈ M(n, R) be symmetric and positive definite, a ∈ Rn , n ∈


N, γ ∈ R, and let f be the corresponding quadratic function according to (8.33). If
the sequence (xk )k∈N is given by the gradient method of Def. 8.10(c), where (cf. Prop.
8.21(d))
 ((xk )t Q + at )uk 
∀ uk 6= 0 ⇒ tk := − > 0 , (8.40)
k∈N (uk )t Quk
then limk→∞ xk = x∗ , where x∗ = −Q−1 a is the global min of f according to Prop.
8.21(c). Moreover, we have the error estimate
 2  2k
k+1 ∗ β−α k ∗
 β−α 
∀ f (x ) − f (x ) ≤ f (x ) − f (x ) ≤ f (x1 ) − f (x∗ ) ,
k∈N α+β α+β
(8.41)
+ +
where α := min σ(Q) ∈ R and β := max σ(Q) ∈ R .

Proof. Introducing the abbreviations

∀ g k := (∇ f (xk ))t = Qxk + a,


k∈N

we see that, indeed, for uk 6= 0,


(g k )t g k
tk = k t k > 0.
(g ) Qg
If there is k0 ∈ N such that uk0 = − ∇ f (xk0 ) = 0, then, clearly, limk→∞ xk = xk0 = x∗
and we are already done. Thus, we may now assume uk 6= 0 for each k ∈ N. It suffices
to show the first estimate in (8.41), as this implies the second estimate as well as
convergence to x∗ (where the latter claim is due to the uniform convexity of f according
to Prop. 8.21(a) in combination with Cor. 8.7). Using the formulas for g k and tk from
above, we have
(g k )t g k
∀ xk+1 = xk − tk g k = xk − k t k g k ,
k∈N (g ) Qg
implying
1
∀ f (xk+1 ) = (xk − tk g k )t Q(xk − tk g k ) + at (xk − tk g k ) + γ
k∈N 2
1 1 1
= f (xk ) − tk (xk )t Qg k − tk (g k )t Qxk + t2k (g k )t Qg k − tk at g k
2 2 2
1
= f (xk ) − tk (Qxk + a)t g k + t2k (g k )t Qg k
2
k t k 2
1 ((g ) g )
= f (xk ) − , (8.42)
2 (g k )t Qg k
8 MINIMIZATION 174

where it was used that (g k )t Qxk = ((g k )t Qxk )t = (xk )t Qg k ∈ R. Next, we claim that
 
k+1 ∗ ((g k )t g k )2 k ∗

∀ f (x ) − f (x ) = 1 − f (x ) − f (x ) , (8.43)
k∈N ((g k )t Qg k )((g k )t Q−1 g k )
which, in view of (8.42), is equivalent to
1 ((g k )t g k )2 ((g k )t g k )2 
k t k
= k t k k t −1 k
f (xk ) − f (x∗ )
2 (g ) Qg ((g ) Qg )((g ) Q g )
and 
2 f (xk ) − f (x∗ ) = (g k )t Q−1 g k . (8.44)
Indeed,
 
2 f (xk ) − f (x∗ ) = (xk )t Qxk + 2at xk + 2γ − at Q−1 QQ−1 a − 2at Q−1 a + 2γ
= (xk )t g k + at xk + at Q−1 a
= ((xk )t Q + at )Q−1 (Qxk + a) = (g k )t Q−1 g k ,
which is (8.44). Applying the Kantorovich inequality (8.36) with x = g k to (8.43) now
yields
 
k+1 ∗ 4αβ 
∀ f (x ) − f (x ) ≤ 1 − 2
f (xk ) − f (x∗ )
k∈N (α + β)
 2
β−α 
= f (xk ) − f (x∗ ) ,
α+β
thereby establishing the case. 

8.4 Conjugate Gradient Method


The conjugate gradient (CG) method is an iterative method designed for the solution
of linear systems
Qx = b, (8.45)
where Q ∈ M(n, R) is symmetric and positive definite, b ∈ Rn . While (in exact arith-
metic), it will find the exact solution in at most n steps, it is most suitable for problems
where n is large and/or Q is sparse, the idea being to stop the iteration after k < n
steps, obtaining a sufficiently accurate approximation to the exact solution. According
to Prop. 8.21(c), (8.45) is equivalent to finding the global min of the quadratic function
given by Q and −b (and 0 ∈ R), namely of
1
f : Rn −→ R, f (x) := xt Qx − bt x. (8.46)
2
8 MINIMIZATION 175

Remark 8.24. If Q ∈ M(n, R) is symmetric and positive definite, then the map
h·, ·iQ : Rn × Rn −→ R, hx, yiQ := xt Qy, (8.47)
clearly forms an inner product on Rn . We call x, y ∈ Rn Q-orthogonal if, and only
if, they are orthogonal with respect to the inner product of (8.47), i.e. if, and only if,
hx, yiQ = 0.
Proposition 8.25. Let Q ∈ M(n, R) be symmetric and positive definite, b ∈ Rn , n ∈ N.
If x1 , . . . , xn+1 ∈ Rn are given by a stepsize direction iteration according to Def. 8.10(a),
if the directions u1 , . . . , un ∈ Rn \ {0} are Q-orthogonal, i.e.
∀ ∀ huk , uj iQ = 0, (8.48)
k∈{2,...,n} j<k

and if t1 , . . . , tn ∈ R are according to (8.40) with a replaced by −b, then xn+1 = Q−1 b,
i.e. the iteration converges to the exact solution to (8.45) in at most n steps (here, in
contrast to Def. 8.10(a), we do not require tk ∈ R+ – however, if, indeed, tk < 0, then
one could replace uk by −uk and tk by −tk > 0 to remain in the setting of Def. 8.10(a)).

Proof. Let f be according to (8.46). As in the proof of Th. 8.22, we set


∀ g k := (∇ f (xk ))t = Qxk − b.
k∈{1,...,n+1}

It then suffices to show


∀ ∀ (g k+1 )t uj = 0 : (8.49)
k∈{1,...,n} j≤k

Since u1 , . . . , un are Q-orthogonal, they must be linearly independent. Moreover, by


(8.49), g n+1 is orthogonal to each u1 , . . . , un . In consequence, if g n+1 6= 0, then we have
n + 1 linearly independent vectors in Rn , which is impossible. Thus, g n+1 = 0, i.e.
Qxn+1 = b, proving the proposition. So it merely remains to verify (8.49). We start
with the case j = k: Indeed,
 t
k+1 t k k+1 t k (8.16),(8.40) k ((xk )t Q − bt )uk k
(g ) u = (Qx − b) u = Qx − Qu uk − bt uk = 0.
(uk )t Quk
(8.50)
Moreover,
∀ (g i+1 − g i )t uj = (Qxi+1 − Qxi )t uj = ti (ui )t Quj = 0, (8.51)
j<i≤k≤n

implying
k
X
k+1 t j j+1 t j (8.50),(8.51)
∀ ∀ (g ) u = (g )u + (g i+1 − g i )t uj = 0,
k∈{1,...,n} j≤k
i=j+1

which establishes (8.49) and completes the proof. 


8 MINIMIZATION 176

Proposition 8.25 suggests to choose Q-orthogonal search directions. We actually al-


ready know methods for orthogonalization from Numerical Analysis I, namely the Gram-
Schmidt method and the three-term recursion. The following CG method is an efficient
way to carry out the steps of the stepsize direction iteration simultaneous with a Q-
orthogonalization:

Algorithm 8.26 (CG Method). Let Q ∈ M(n, R) be symmetric and positive definite,
b ∈ Rn , n ∈ N. The conjugate gradient (CG) method then defines a finite sequence
(x1 , . . . , xn+1 ) in Rn via the following recursion: Initialization: Choose x1 ∈ Rn (barring
further information on Q and b, choosing x1 = 0 is fine) and set

g 1 := Qx1 − b, u1 := −g 1 . (8.52)

Iteration: Let k ∈ {1, . . . , n}. If uk = 0, then set tk := βk := 1, xk+1 := xk , g k+1 := g k ,


uk+1 := uk (or, in practise, stop the iteration). If uk 6= 0, then set

kg k k22
tk := , (8.53a)
huk , uk iQ
xk+1 := xk + tk uk , (8.53b)
g k+1 := g k + tk Quk , (8.53c)
kg k+1 k22
βk := , (8.53d)
kg k k22
uk+1 := −g k+1 + βk uk . (8.53e)

Remark 8.27. Consider Alg. 8.26.

(a) We will see in (8.54d) below that uk 6= 0 implies g k 6= 0, such that Alg. 8.26 is
well-defined.

(b) The most expensive operation in (8.53) is the matrix multiplication Quk , which is
O(n2 ). As the result is needed in both (8.53a) and (8.53c), it should be temporarily
stored in some vector z k := Quk . If the matrix Q is sparse, it might be possible to
reduce the cost of carrying out (8.53) significantly – for example, if Q is tridiagonal,
then the cost reduces from O(n2 ) to O(n).

Theorem 8.28. Let Q ∈ M(n, R) be symmetric and positive definite, b ∈ Rn , n ∈ N.


If x1 , . . . , xn+1 ∈ Rn as well as the related vectors uk , g k ∈ Rn and numbers tk , βk ∈ R+
0
8 MINIMIZATION 177

are given by Alg. 8.26, then the following orthogonality relations hold:
∀ huk , uj iQ = 0, (8.54a)
k,j∈{1,...,n+1},
j<k

∀ (g k )t g j = 0, (8.54b)
k,j∈{1,...,n+1},
j<k

∀ (g k )t uj = 0, (8.54c)
k,j∈{1,...,n+1},
j<k

∀ (g k )t uk = −kg k k22 . (8.54d)


k∈{1,...,n+1}

Moreover,
xn+1 = Q−1 b, (8.55)
i.e. the CG method converges to the exact solution to (8.45) in at most n steps.

Proof. Let  

m := min k ∈ {1, . . . , n + 1} : uk = 0 ∪ {n + 1} . (8.56)
A first induction shows
∀ g k = Qxk − b : (8.57)
k∈{1,...,m}

Indeed, g 1 = Qx1 − b holds by initialization and, for 1 ≤ k < m,


ind.hyp.
g k+1 = g k + tk Quk = Qxk − b + tk Quk = Q(xk + tk uk ) − b = Qxk+1 − b,
proving (8.57). It now suffices to show that (8.54) holds with n + 1 replaced by m: If
m < n + 1, then um = 0, i.e. g m = 0 by (8.54d), implying xm = Q−1 b by (8.57). Then
(8.54) and (8.55) also follow as stated. Next, note
kg k k22 (8.54d) (g k )t uk
∀ tk = = − , (8.58)
k∈{1,...,m−1} huk , uk iQ huk , uk iQ
which is consistent with (8.40). If m = n + 1, then Prop. 8.25 applies according to
(8.54a) and (8.58), implying the validity of (8.55). It, thus, remains to verify (8.54)
holds with n + 1 replaced by m. We proceed via induction on k = 1, . . . , m: For k = 1,
(8.54a) – (8.54c) are trivially true and (8.54d) holds as (g 1 )t u1 = −kg 1 k22 is true by
initialization. Thus, let 1 ≤ k < m. As in the proof of Prop. 8.25, we obtain (8.49) from
(8.54a), that means the induction hypothesis for (8.54a) implies (8.54c) for k + 1. Next,
we verify (8.54d) for k + 1:
(g k+1 )t uk+1 = (g k+1 )t (−g k+1 + βk uk ) = −kg k+1 k22 + (g k + tk Quk )t βk uk
ind.hyp.
= −kg k+1 k22 − βk kg k k22 + βk kg k k22 = −kg k+1 k22 .
8 MINIMIZATION 178

It remains to show (8.54a) and (8.54b) for k + 1. We start with (8.54b):


(8.54c)
(g k+1 )t g 1 = −(g k+1 )t u1 = 0
and, for j ∈ {2, . . . , k},
(8.54c)
(g k+1 )t g j = (g k+1 )t (−uj + βj−1 uj−1 ) = 0.
Finally, to obtain (8.54a) for k + 1, we note, for each j ∈ {1, . . . , k},
1
huk+1 , uj iQ = (−g k+1 + βk uk )t Quj = − (g k+1 )t (g j+1 − g j ) + βk (uk )t Quj . (8.59)
tj
Thus, for j < k, (8.59) together with (8.54b) (for k + 1) and (8.54a) (for k, i.e. the
induction hypothesis) yields huk+1 , uj iQ = 0. For j = k, we compute
(8.59),(8.54b) kg k+1 k22 kg k k22
huk+1 , uk iQ = − + βk = 0,
tk tk
which completes the induction proof of (8.54). 
Theorem 8.29. Let Q ∈ M(n, R) be symmetric and positive definite, b ∈ Rn , n ∈ N.
If x1 , . . . , xn+1 ∈ Rn are given by Alg. 8.26 and x∗ := Q−1 b, then the following error
estimate holds:
p !k
κ 2 (Q) − 1
∀ kxk+1 − x∗ kQ ≤ 2 p kx1 − x∗ kQ , (8.60)
k∈{1,...,n} κ2 (Q) + 1
max σ(Q)
where k · kQ denotes the norm induced by h·, ·iQ and κ2 (Q) = kQk2 kQ−1 k2 = min σ(Q)
is
the spectral condition of Q.

Proof. See, e.g., [DH08, Th. 8.17]. 


Remark 8.30. (a) According to Th. 8.28, Alg. 8.26 finds the exact solution after at
most n steps. However, it might find the exact solution even earlier: According to
[Sco11, Cor. 9.10], if Q has precisely m distinct eigenvalues, then Alg. 8.26 finds
the exact solution after at most m steps; and it is stated in [GK99, p. 225] that
the same also holds if one starts at x1 = 0 and b is the linear combination of m
eigenvectors of Q.
(b) If one needs to solve Ax = b with invertible A ∈ M(n, R), where A is not symmetric
positive definite, one can still make use of the CG method, namely by applying it
to At Ax = At b (noting Q := At A to be symmetric positive definite). However, a
disadvantage then is that the condition of Q can be much larger than the condition
of A.
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 179

A Representations of Real Numbers, Rounding Er-


rors

A.1 b-Adic Expansions of Real Numbers


We are mostly used to representing real numbers in the decimal system. For example,
we write ∞
395 X
2 1 0
x= = 131.6 = 1 · 10 + 3 · 10 + 1 · 10 + 6 · 10−n . (A.1a)
3 n=1
The decimal system represents real numbers as, in general, infinite series of decimal
fractions. Digital computers represent numbers in the dual system, using base 2 instead
of 10. For example, the number from (A.1a) has the dual representation

X
7 1 0
x = 10000011.10 = 2 + 2 + 2 + 2−(2n+1) . (A.1b)
n=0

Representations with base 16 (hexadecimal) and 8 (octal) are also of importance when
working with digital computers. More generally, each natural number b ≥ 2 can be used
as a base.
In most cases, it is understood that we work only with decimal representations such
that there is no confusion about the meaning of symbol strings like 101.01. However,
in general, 101.01 could also be meant with respect to any other base, and, the number
represented by the same string of symbols does obviously depend on the base used.
Thus, when working with different representations, one needs some notation to keep
track of the base.
Notation A.1. Given a natural number b ≥ 2 and finite sequences
(dN1 , dN1 −1 , . . . , d0 ) ∈ {0, . . . , b − 1}N1 +1 , (A.2a)
N2
(e1 , e2 , . . . , eN2 ) ∈ {0, . . . , b − 1} , (A.2b)
(p1 , p2 , . . . , pN3 ) ∈ {0, . . . , b − 1}N3 , (A.2c)
N1 , N2 , N3 ∈ N0 (where N2 = 0 or N3 = 0 is supposed to mean that the corresponding
sequence is empty), the respective string
(dN1 dN1 −1 . . .d0 )b for N2 = N3 = 0,
(A.3)
(dN1 dN1 −1 . . .d0 . e1 . . . eN2 p1 . . . pN3 )b for N2 + N3 > 0
represents the number
N1
X N2
X N3
∞ X
X
ν −ν
dν b + eν b + pν b−N2 −αN3 −ν . (A.4)
ν=0 ν=1 α=0 ν=1
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 180

Example A.2. For the number from (A.1), we get

x = (131.6)10 = (10000011.10)2 = (83.A)16 (A.5)

(for the hexadecimal system, it is customary to use the symbols 0, 1, 2, 3, 4, 5, 6, 7, 8,


9, A, B, C, D, E, F).

Definition A.3. Given a natural number b ≥ 2, an integer N ∈ Z, and a sequence


(dN , dN −1 , dN −2 , . . . ) of nonnegative integers such that dn ∈ {0, . . . , b − 1} for each
n ∈ {N, N − 1, N − 2, . . . }, the expression

X
dN −ν bN −ν (A.6)
ν=0

is called a b-adic series. The number b is called the base or the radix, and the numbers
dν are called digits.

Definition A.4. If b ≥ 2 is a natural number and x ∈ R+ 0 is the value of the b-adic


series given by (A.6), than one calls the b-adic series a b-adic expansion of x.

Lemma A.5. Given a natural number b ≥ 2, consider the b-adic series given by (A.6).
Then ∞
X
dN −ν bN −ν ≤ bN +1 , (A.7)
ν=0

and, in particular, the b-adic series converges to some x ∈ R+ 0 . Moreover, equality in


(A.7) holds if, and only if, dn = b − 1 for every n ∈ {N, N − 1, N − 2, . . . }.

Proof. One estimates, using the formula for the value of a geometric series:

X ∞
X ∞
X
N −ν N −ν N 1
dN −ν b ≤ (b − 1) b = (b − 1)b b−ν = (b − 1)bN 1 = bN +1 . (A.8)
ν=0 ν=0 ν=0
1− b

Note that (A.8) also shows that equality is achieved if all dn are equal to b−1. Conversely,
if there is n ∈ {N, N − 1, N − 2, . . . } such that dn < b − 1, then there is ñ ∈ N such
that dN −ñ < b − 1 and one estimates

X ñ−1
X ∞
X
N −ν N −ν N −ñ
dN −ν b < dN −ν b + (b − 1)b + dN −ν bN −ν ≤ bN +1 , (A.9)
ν=0 ν=0 ν=ñ+1

showing that the inequality in (A.7) is strict. 


A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 181

A.2 Floating-Point Numbers Arithmetic, Rounding


Numerical problems are often related to the computation of (approximations of) real
numbers. The b-adic representations of real numbers considered above, in general, need
infinitely many digits. However, computer memory can only store a finite amount of
data. This implies the need to introduce representations that use only strings of finite
length and only a finite number of symbols. Clearly, given a supply of s ∈ N symbols
and strings of length l ∈ N, one can represent a maximum of sl < ∞ numbers. The
representation of this form most commonly used to approximate real numbers is the
so-called floating-point representation:

Definition A.6. Let b ∈ N, b ≥ 2, l ∈ N, and N− , N+ ∈ Z with N− ≤ N+ . Then, for


each

(x1 , . . . , xl ) ∈ {0, 1, . . . , b − 1}l , (A.10a)


N ∈ Z satisfying N− ≤ N ≤ N+ , (A.10b)

the strings
0 . x1 x2 . . . xl · bN and − 0 . x1 x2 . . . xl · bN (A.11)
are called floating-point representations of the (rational) numbers
l
X
N
x := b xν b−ν and − x, (A.12)
ν=1

respectively, provided that x1 6= 0 if x 6= 0. For floating-point representations, b is


called the radix (or base), 0 . x1 . . . xl (i.e. |x|/bN ) is called the significand (or mantissa
or coefficient), x1 , . . . , xl are called the significant digits, and N is called the exponent.
The number l is sometimes called the precision. Let fll (b, N− , N+ ) ⊆ Q denote the set of
all rational numbers that have a floating point representation of precision l with respect
to base b and exponent between N− and N+ .

Remark A.7. To get from the floating-point representation (A.11) to the form of (A.3),
one has to shift the radix point of the significand by a number of places equal to the
value of the exponent – to the right if the exponent is positive or to the left if the
exponent is negative.

Remark A.8. If one restricts Def. A.6 to the case N− = N+ , then one obtains what
is known as fixed-point representations. However, for many applications, the required
numbers vary over sufficiently many orders of magnitude to render fixed-point represen-
tations impractical.
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 182

Remark A.9. Given the assumptions of Def. A.6, for the numbers in (A.12), it always
holds that
l
X l
X Lem. A.5
N− −1 N −1 N −ν
b ≤b ≤ xν b = |x| ≤ (b − 1) bN+ −ν =: max(l, b, N+ ) < bN+ ,
ν=1 ν=1
(A.13)
provided that x 6= 0. In other words:

fll (b, N− , N+ ) ⊆ Q ∩ [− max(l, b, N+ ), −bN− −1 ] ∪ {0} ∪ [bN− −1 , max(l, b, N+ )] . (A.14)

Obviously, fll (b, N− , N+ ) is not closed under arithmetic operations. A result of absolute
value bigger than max(l, b, N+ ) is called an overflow, whereas a nonzero result of absolute
value less than bN− −1 is called an underflow. In practice, the result of an underflow is
usually replaced by 0.

Definition A.10. Let b ∈ N, b ≥ 2, l ∈ N, N ∈ Z, and σ ∈ {−1, 1}. Then, given



X
N
x = σb xν b−ν , (A.15)
ν=1

where xν ∈ {0, 1, . . . , b − 1} for each ν ∈ N and x1 6= 0 for x 6= 0, define


( P
σbN lν=1 xν b−ν for xl+1 < b/2,
rdl (x) := P l (A.16)
σbN (b−l + ν=1 xν b−ν ) for xl+1 ≥ b/2.

The number rdl (x) is called x rounded to l digits.

Remark A.11. We note that the notation rdl (x) of Def. A.10 is actually not entirely
correct, since rdl is actually a function of the sequence (σ, N, x1 , x2 , . . . ) rather than of
x: It can actually occur that rdl takes on different values for different representations of
the same number x (for example, using decimal notation, consider x = 0.349 = 0.350,
yielding rd1 (0.349) = 0.3 and rd1 (0.350) = 0.4). However, from basic results on b-
adic expansions of real numbers (see [Phi16a, Th. 7.99]), we know that x can have at
most two different b-adic representations and, for the same x, rdl (x) can vary at most
bN b−l . Since always writing rdl (σ, N, x1 , x2 , . . . ) does not help with readability, and since
writing rdl (x) hardly ever causes confusion as to what is meant in a concrete case, the
abuse of notation introduced in Def. A.10 is quite commonly employed.

Lemma Pl A.12. Let b ∈ N, b ≥ 2, l ∈ N. If x ∈ R is given by (A.15), then rdl (x) =


N′ ′ −ν ′
σb ν=1 xν b , where N ∈ {N, N + 1} S and x′ν ∈ {0, 1, . . . , b − 1} for each ν ∈ N and
x′1 6= 0 for x 6= 0. In particular, rdl maps ∞k=1 flk (b, N− , N+ ) into fll (b, N− , N+ + 1).
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 183

Proof. For xl+1 < b/2, there is nothing to prove. Thus, assume xl+1 ≥ b/2. Case
(i): There exists ν ∈ {1, . . . , l} such that xν < b − 1. Then, letting ν0 := max ν ∈
{1, . . . , l} : xν < b − 1 , one finds N ′ = N and


 xν for 1 ≤ ν < ν0 ,

xν = xν0 + 1 for ν = ν0 , (A.17a)


0 for ν0 < ν ≤ l.

Case (ii): xν = b − 1 holds for each ν ∈ {1, . . . , l}. Then one obtains N ′ = N + 1,
(
1 for ν = 1,
x′ν := (A.17b)
0 for 1 < ν ≤ l,

thereby concluding the proof. 

When composing floating point numbers of a fixed precision l ∈ N by means of arithmetic


operations such as ‘+’, ‘−’, ‘·’, and ‘:’, the exact result is usually not representable as a
floating point number with the same precision l: For example, 0.434 · 104 + 0.705 · 10−1 =
0.43400705 · 104 , which is not representable exactly with just 3 significant digits. Thus,
when working with floating point numbers of a fixed precision, rounding will generally
be necessary.
The following Notation A.13 makes sense for general real numbers x, y, but is intended
to be used with numbers from some fll (b, N− , N+ ), i.e. numbers given in floating point
representation (in particular, with a finite precision).

Notation A.13. Let b ∈ N, b ≥ 2. Assume x, y ∈ R are given in a form analogous to


(A.15). We then define, for each l ∈ N,

x ⋄l y := rdl (x ⋄ y), (A.18)

where ⋄ can stand for any of the operations ‘+’, ‘−’, ‘·’, and ‘:’.
S
Remark A.14. Consider x, y ∈ fll (b, N− , N+ ). If x ⋄ y ∈ ∞ k=1 flk (b, N− , N+ ), then,
according to Lem. A.12, x ⋄l y as defined in (A.18) is either in fll (b, N− , N+ ) or in
fll (b, N− , N+ + 1) with the digits given by (A.17b) (which constitutes an overflow).

Remark A.15. The definition of (A.18) should only be taken as an example of how
floating-point operations can be realized. On concrete computer systems, the rounding
operations implemented can be different from the one considered here. However, one
would still expect a corresponding version of Rem. A.14 to hold.
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 184

Caveat A.16. Unfortunately, the associative laws of addition and multiplication as well
as the law of distributivity are lost for arithmetic operations of floating-point numbers.
More precisely, even if x, y, z ∈ fll (b, N− , N+ ) and one assumes that no overflow or
underflow occurs, then there are examples, where (x+l y)+l z 6= x+l (y +l z), (x·l y)·l z 6=
x ·l (y ·l z), and x ·l (y +l z) 6= x ·l y +l x ·l z:
For l = 1, b = 10, N− = 0, N+ = 1, x = 0.1 · 101 , y = 0.4 · 100 , z = 0.1 · 100 , one has

(x +1 y) +1 z = rd1 (0.1 · 101 + 0.4 · 100 ) +1 0.1 · 100 = rd1 (0.14 · 101 ) +1 0.1 · 100
= rd1 (0.1 · 101 + 0.1 · 100 ) = 0.1 · 101 , (A.19)

but

x +1 (y +1 z) = 0.1 · 101 +1 rd1 (0.4 · 100 + 0.1 · 100 ) = rd1 (0.1 · 101 + 0.5 · 100 )
= rd1 (0.15 · 101 ) = 0.2 · 101 . (A.20)

For l = 1, b = 10, N− = 0, N+ = 1, x = 0.8 · 101 , y = 0.3 · 100 , z = 0.5 · 100 , one has

(x ·1 y) ·1 z = rd1 (0.8 · 101 · 0.3 · 100 ) ·1 0.5 · 100 = rd1 (0.24 · 101 ) ·1 0.5 · 100
= rd1 (0.2 · 101 · 0.5 · 100 ) = 0.1 · 101 , (A.21)

but

x ·1 (y ·1 z) = 0.8 · 101 ·1 rd1 (0.3 · 100 · 0.5 · 100 ) = 0.8 · 101 ·1 rd1 (0.15 · 100 )
= rd1 (0.8 · 101 · 0.2 · 100 ) = 0.2 · 101 . (A.22)

For l = 1, b = 10, N− = 0, N+ = 0, x = 0.5 · 100 , y = z = 0.3 · 100 , one has

x ·1 (y +1 z) = 0.5 · 100 ·1 rd1 (0.3 · 100 + 0.3 · 100 )


= 0.5 · 100 ·1 0.6 · 100 = 0.3 · 100 , (A.23)

but

x ·l y +l x ·l z = rd1 (0.5 · 100 · 0.3 · 100 ) +1 rd1 (0.5 · 100 · 0.3 · 100 )
= 0.2 · 100 +1 0.2 · 100 = 0.4 · 100 . (A.24)

Lemma A.17. Let b ∈ N, b ≥ 2. Suppose



X ∞
X
N −ν M
x=b xν b , y=b yν b−ν , (A.25)
ν=1 ν=1

where N, M ∈ Z; xν , yν ∈ {0, 1, . . . , b − 1} for each ν ∈ N; and x1 , y1 6= 0.


A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 185

(a) If N > M , then x ≥ y.

(b) If N = M and there is n ∈ N such that xn > yn and xν = yν for each ν ∈


{1, . . . , n − 1}, then x ≥ y.

Proof. (a): One estimates



X ∞
X ∞
X
N −ν M −ν N −1 N −1
x−y =b xν b −b yν b ≥b −b (b − 1)b−ν
ν=1 ν=1 ν=1
 
1
= bN −1 − bN −1 (b − 1) −1 = 0. (A.26a)
1 − b−1

(b): One estimates



X ∞
X ∞
X ∞
X
N −ν M −ν N −ν N
x−y =b xν b −b yν b =b xν b −b yν b−ν
ν=1 ν=1 ν=n ν=n

X as in (A.26a)
≥ bN −n − bN −n (b − 1)b−ν = 0, (A.26b)
ν=1

concluding the proof of the lemma. 

Lemma A.18. Let b ∈ N, b ≥ 2, l ∈ N. Then, for each x, y ∈ R:

(a) rdl (x) = sgn(x) rdl (|x|).

(b) 0 ≤ x < y implies 0 ≤ rdl (x) ≤ rdl (y).

(c) 0 ≥ x > y implies 0 ≥ rdl (x) ≥ rdl (y).

Proof. (a): If x is given by (A.15), one obtains from (A.16):



!
X
rdl (x) = σ rdl bN xν b−ν = sgn(x) rdl (|x|).
ν=1

(b): Suppose x and y are given as in (A.25). Then Lem. A.17(a) implies N ≤ M .
Case xl+1 < b/2: In this case, according to (A.16),
l
X
N
rdl (x) = b xν b−ν . (A.27)
ν=1
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 186

For N < M , we estimate


l
X l
X
Lem. A.17(a)
M −ν N
rdl (y) ≥ b yν b ≥ b xν b−ν = rdl (x), (A.28)
ν=1 ν=1

which establishes the case. We claim that (A.28) also holds for N = M , now due to
Lem. A.17(b): This is clear if xν = yν for each ν ∈ {1, . . . , l}. Otherwise, define

n := min ν ∈ {1, . . . , l} : xν 6= yν . (A.29)

Then x < y and Lem. A.17(b) yield xn < yn . Moreover, another application of Lem.
A.17(b) then implies (A.28) for this case.
Case xl+1 ≥ b/2: In this case, according to Lem. A.12,
l
X
N′
rdl (x) = b x′ν b−ν , (A.30)
ν=1

where either N ′ = N and the x′ν are given by (A.17a), or N ′ = N + 1 and the x′ν are
given by (A.17b). For N < M , we obtain
l
X l
X
(∗) ′
rdl (y) ≥ bM yν b−ν ≥ bN x′ν b−ν = rdl (x), (A.31)
ν=1 ν=1

where, for N ′ < M , (∗) holds by Lem. A.17(a), and, for N ′ = N + 1 = M , (∗) holds
by (A.17b). It remains to consider N = M . If xν = yν for each ν ∈ {1, . . . , l}, then
x < y and Lem. A.17(b) yield b/2 ≤ xl+1 ≤ yl+1 , which, in turn, yields rdl (y) = rdl (x).
Otherwise, once more define n according to (A.29). As before, x < y and Lem. A.17(b)
yield xn < yn . From Lem. A.12, we obtain N ′ = N and that the x′ν are given by (A.17a)
with ν0 ≥ n. Then the values from (A.17a) show that (A.31) holds true once again.
(c) follows by combining (a) and (b). 
Definition and Remark A.19. Let b ∈ N, b ≥ 2, l ∈ N, and N− , N+ ∈ Z with
N− ≤ N+ . If there exists a smallest positive number ǫ ∈ fll (b, N− , N+ ) such that

1 +l ǫ 6= 1, (A.32)

then it is called the relative machine precision. We will show that, for N+ < −l + 1, one
has 1 +l y = 1 for every 0 < y ∈ fll (b, N− , N+ ), such that there is no positive number in
fll (b, N− , N+ ) satisfying (A.32), whereas, for N+ ≥ −l + 1:
(
bN− −1 for −l + 1 < N− ,
ǫ= −l
⌈b/2⌉b for −l + 1 ≥ N−
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 187

(for x ∈ R, ⌈x⌉ := min{k ∈ Z : x ≤ k} is called ceiling of x or x rounded up):


First, one notices that 0 < y < ⌈b/2⌉b−l implies

X
1
1+y =b yi b−i , (A.33)
i=1

where 

 =1 for i = 1,

< ⌈b/2⌉ for i = l + 1,
yi (A.34)
= 0
 for 1 < i < l + 1,


∈ {0, . . . , b − 1} for l + 1 < i.
As yl+1 < b/2, the definition of rounding in (A.16) yields:
l
X
1
1 +l y = rdl (1 + y) = b yi b−i = b1 b−1 = 1, (A.35)
i=1

From Rem. A.9, we know



fll (b, N− , N+ ) ⊆ Q ∩ ] − bN+ , −bN− −1 ] ∪ {0} ∪ [bN− −1 , bN+ [ . (A.36)

In particular, every element in fll (b, N− , N+ ) is less than bN+ .


If N+ < −l + 1, then bN+ ≤ b−l < ⌈b/2⌉b−l . Thus, using what we have shown above,
1 +l y = 1 for each 0 < y ∈ fll (b, N− , N+ ).
Now let N+ ≥ −l + 1.
One notices next that
l+1
X
−l 1 1
1 + ⌈b/2⌉b = 0 . x1 . . . xl+1 · b = b xi b−i , (A.37)
i=1

where 

1 for i = 1,
xi := ⌈b/2⌉ for i = l + 1, (A.38)


0 6 1, l + 1.
for i =
Using (A.37) and the definition of rounding in (A.16), we have

1 +l ⌈b/2⌉b−l = rdl 1 + ⌈b/2⌉b−l = 1 + b−l+1 > 1. (A.39)
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 188

Now let −l + 1 < N− . Due to (A.14), bN− −1 is the smallest positive number contained
in fll (b, N− , N+ ). In addition, due to bN− −1 ≥ b−l+1 > ⌈b/2⌉b−l , we have
Lem. A.18(b)  (A.39)
1 +l bN− −1 = rdl (1 + bN− −1 ) ≥ rdl 1 + ⌈b/2⌉b−l = 1 + b−l+1 > 1, (A.40)
implying ǫ = bN− −1 .
For −l + 1 ≥ N− , we have ⌈b/2⌉b−l ∈ fll (b, N− , N+ ) due to ⌈b/2⌉b−l = 0.⌈b/2⌉ · b−l+1
(where N+ ≥ −l + 1 was used as well). As we have already shown that 1 +l y = 1 for
each 0 < y < ⌈b/2⌉b−l , ǫ = ⌈b/2⌉b−l is now a consequence of (A.39).

A.3 Rounding Errors


Definition A.20. Let v ∈ R be the exact value and let a ∈ R be an approximation for
v. The numbers
ea
ea := |v − a|, er := (A.41)
|v|
are called the absolute error and the relative error, respectively, where the relative error
is only defined for v 6= 0. It can also be useful to consider variants of the absolute and
relative error, respectively, where one does not take the absolute value. Thus, in the
literature, one finds the definitions with and without the absolute value.
Proposition A.21. Let b ∈ N be even, b ≥ 2, l ∈ N, N ∈ Z, and σ ∈ {−1, 1}. Suppose
x ∈ R is given by (A.15), i.e.
X∞
N
x = σb xν b−ν , (A.42)
ν=1

where xν ∈ {0, 1, . . . , b − 1} for each ν ∈ N and x1 6= 0 for x 6= 0,

(a) The absolute error of rounding to l digits satisfies


bN −l
ea (x) = rdl (x) − x ≤ .
2
(b) The relative error of rounding to l digits satisfies, for each x 6= 0,

rdl (x) − x b−l+1
er (x) = ≤ .
|x| 2

(c) For each x 6= 0, one also has the estimate



rdl (x) − x b−l+1
≤ .
| rdl (x)| 2
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 189

Proof. (a): First, consider the case xl+1 < b/2: One computes

X

ea (x) = rdl (x) − x = −σ rdl (x) − x = bN xν b−ν
ν=l+1

X
= bN −l−1 xl+1 + bN xν b−ν
ν=l+2
 
b even, (A.7) b bN −l
≤ bN −l−1 − 1 + bN −l−1 = . (A.43a)
2 2
It remains to consider the case xl+1 ≥ b/2. In that case, one obtains

X
 N −l N −l−1 N
σ rdl (x) − x = b − b xl+1 b −b xν b−ν
ν=l+2

X bN −l
= bN −l−1 (b − xl+1 ) − bN xν b−ν ≤ . (A.43b)
ν=l+2
2
P∞
Due to 1 ≤ b − xl+1 , one has bN −l−1 ≤ bN −l−1 (b − x
 l+1 ). Therefore, b
N −ν
ν=l+2 xν b ≤
bN −l−1 together with (A.43b) implies σ rdl (x) − x ≥ 0, i.e. ea (x) = σ rdl (x) − x .
(b): Since x 6= 0, one has x1 ≥ 1. Thus, |x| ≥ bN −1 . Then (a) yields

ea bN −l b−N +1 b−l+1
er = ≤ = . (A.44)
|x| 2 2

(c): Again, x1 ≥ 1 as x 6= 0. Thus, (A.16) implies | rdl (x)| ≥ bN −1 . This time, (a) yields

ea bN −l b−N +1 b−l+1
≤ = , (A.45)
| rdl (x)| 2 2
establishing the case and completing the proof of the proposition. 
Corollary A.22. In the situation of Prop. A.21, let, for x 6= 0:
rdl (x) − x rdl (x) − x
ǫl (x) := , η l (x) := . (A.46)
x rdl (x)
Then
 b−l+1
max |ǫl (x)|, |ηl (x)| ≤ =: τ l . (A.47)
2
The number τ l is called the relative computing precision of floating-point arithmetic with
l significant digits. 
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 190

Remark A.23. From the definitions of ǫl (x) and η l (x) in (A.46), one immediately
obtains the relations
 x
rdl (x) = x 1 + ǫl (x) = for x 6= 0. (A.48)
1 − η l (x)
In particular, according to (A.18), one has for floating-point operations (for x ⋄ y 6= 0):
 x⋄y
x ⋄l y = rdl (x ⋄ y) = (x ⋄ y) 1 + ǫl (x ⋄ y) = . (A.49)
1 − η l (x ⋄ y)

One can use the formulas of Rem. A.23 to perform what is sometimes called a forward
analysis of the rounding error. This technique is illustrated in the next example.
Example A.24. Let x := 0.9995 · 100 and y := −0.9984 · 100 . Computing the sum with
precision 3 yields

rd3 (x) +3 rd3 (y) = 0.100 · 101 +3 (−0.998 · 100 ) = rd3 (0.2 · 10−2 ) = 0.2 · 10−2 .

Letting ǫ := ǫ3 (rd3 (x) + rd3 (y)), applying the formulas of Rem. A.23 provides:
(A.49) 
rd3 (x) +3 rd3 (y) = rd3 (x) + rd3 (y) (1 + ǫ)
(A.48) 
= x(1 + ǫ3 (x)) + y(1 + ǫ3 (y)) (1 + ǫ)
= (x + y) + ea , (A.50)

where  
ea = x ǫ + ǫ3 (x)(1 + ǫ) + y ǫ + ǫ3 (y)(1 + ǫ) . (A.51)
Using (A.46) and plugging in the numbers yields:

rd3 (x) +3 rd3 (y) − rd3 (x) + rd3 (y) 0.002 − 0.002
ǫ= = = 0,
rd3 (x) + rd3 (y) 0.002
1 − 0.9995 1
ǫ3 (x) = = = 0.00050 . . . ,
0.9995 1999
−0.998 + 0.9984 1
ǫ3 (y) = =− = −0.00040 . . . ,
−0.9984 2496
ea = xǫ3 (x) + yǫ3 (y) = 0.0005 + 0.0004 = 0.0009.

The corresponding relative error is


ea 0.0009 9
er = = = = 0.81.
|x + y| 0.0011 11
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 191

Thus, er is much larger than both er (x) and er (y). This is an example of subtractive
cancellation of digits, which can occur when subtracting numbers that are almost iden-
tical. If possible, such situations should be avoided in practice (cf. Examples A.25 and
A.26 below).

Example A.25. Let us generalize the situation of Example A.24 to general x, y ∈ R\{0}
and a general precision l ∈ N. The formulas in (A.50) and (A.51) remain valid if one
replaces 3 by l. Moreover, with b = 10 and l ≥ 2, one obtains from (A.47):

|ǫ| ≤ 0.5 · 10−l+1 ≤ 0.5 · 10−1 = 0.05.

Thus,  
|ea | ≤ |x| ǫ + 1.05|ǫl (x)| + |y| ǫ + 1.05|ǫl (y)| ,
and
|ea | |x|  |y| 
er = ≤ |ǫ| + 1.05|ǫl (x)| + |ǫ| + 1.05|ǫl (y)| .
|x + y| |x + y| |x + y|
One can now distinguish three (not completely disjoint) cases:

(a) If |x + y| < max{|x|, |y|} (in particular, if sgn(x) = − sgn(y)), then er is typically
larger (potentially much larger, as in Example A.24) than |ǫl (x)| and |ǫl (y)|. Sub-
tractive cancellation falls into this case. If possible, this should be avoided (cf.
Example A.26 below).

(b) If sgn(x) = sgn(y), then |x + y| = |x| + |y|, implying



er ≤ |ǫ| + 1.05 max |ǫl (x)|, |ǫl (y)|

i.e. the relative


 error is at most of the same order of magnitude as the number
|ǫ| + max |ǫl (x)|, |ǫl (y)| .

(c) If |y| ≪ |x| (resp. |x| ≪ |y|), then the bound for er is predominantly determined by
|ǫl (x)| (resp. |ǫl (y)|) (error dampening).

As the following Example A.26 illustrates, subtractive cancellation can often be avoided
by rearranging an expression into a mathematically equivalent formula that is numeri-
cally more stable.

Example A.26. Given a, b, c ∈ R satisfying a 6= 0 and 4ac ≤ b2 , the quadratic equation

ax2 + bx + c = 0
B OPERATOR NORMS AND NATURAL MATRIX NORMS 192

has the solutions


1  √  1  √ 
x1 = −b − sgn(b) b2 − 4ac , x2 = −b + sgn(b) b2 − 4ac .
2a 2a
If |4ac| ≪ b2 , then, for the computation of x2 , one is in the situation of Example A.25(a),
i.e. the formula is numerically unstable. However, due to x1 x2 = c/a, one can use the
equivalent formula
2c
x2 = √ ,
−b − sgn(b) b2 − 4ac
where subtractive cancellation can not occur.

B Operator Norms and Natural Matrix Norms


Theorem B.1. For a linear map A : X −→ Y between two normed vector spaces
(X, k · k) and (Y, k · k) over K, the following statements are equivalent:

(a) A is bounded.

(b) kAk < ∞.

(c) A is Lipschitz continuous.

(d) A is continuous.

(e) There is x0 ∈ X such that A is continuous at x0 .

Proof. Since every Lipschitz continuous map is continuous and since every continuous
map is continuous at every point, “(c) ⇒ (d) ⇒ (e)” is clear.
“(e) ⇒ (c)”: Let x0 ∈ X be such that A is continuous at x0 . Thus, for each ǫ > 0, there
is δ > 0 such that kx − x0 k < δ implies kAx − Ax0 k < ǫ. As A is linear, for each x ∈ X
with kxk < δ, one has kAxk = kA(x + x0 ) − Ax0 k < ǫ, due to kx + x0 − x0 k = kxk < δ.
Moreover, one has k(δx)/2k ≤ δ/2 < δ for each x ∈ X with kxk ≤ 1. Letting L := 2ǫ/δ,
this means that kAxk = kA((δx)/2)k/(δ/2) < 2ǫ/δ = L for each x ∈ X with kxk ≤ 1.
Thus, for each x, y ∈ X with x 6= y, one has
 
x − y
kAx − Ayk = kA(x − y)k = kx − yk
A kx − yk < L kx − yk. (B.1)

Together with the fact that kAx − Ayk ≤ Lkx − yk is trivially true for x = y, this shows
that A is Lipschitz continuous.
B OPERATOR NORMS AND NATURAL MATRIX NORMS 193

“(c) ⇒ (b)”: As A is Lipschitz continuous, there exists L ∈ R+


0 such that kAx − Ayk ≤
L kx − yk for each x, y ∈ X. Considering the special case y = 0 and kxk = 1 yields
kAxk ≤ L kxk = L, implying kAk ≤ L < ∞.
“(b) ⇒ (c)”: Let kAk < ∞. We will show

kAx − Ayk ≤ kAk kx − yk for each x, y ∈ X. (B.2)

For x = y, there is nothing to prove. Thus, let x 6= y. One computes


 
kAx − Ayk x − y
= A ≤ kAk (B.3)
kx − yk kx − yk

x−y
as kx−yk = 1, thereby establishing (B.2).
“(b) ⇒ (a)”: Let kAk < ∞ and let M ⊆ X be bounded. Then there is r > 0 such that
M ⊆ Br (0). Moreover, for each 0 6= x ∈ M :
 
kAxk x ≤ kAk
= A (B.4)
kxk kxk

x
as kxk = 1. Thus kAxk ≤ kAkkxk ≤ rkAk, showing that A(M ) ⊆ B rkAk (0). Thus,
A(M ) is bounded, thereby establishing the case.
“(a) ⇒ (b)”: Since A is bounded, it maps the bounded set B 1 (0) ⊆ X into some
bounded subset of Y . Thus, there is r > 0 such that A(B 1 (0)) ⊆ Br (0) ⊆ Y . In
particular, kAxk < r for each x ∈ X satisfying kxk = 1, showing kAk ≤ r < ∞. 

Theorem B.2. Let X and Y be normed vector spaces over K.

(a) The operator norm does, indeed, constitute a norm on the set of bounded linear
maps L(X, Y ).

(b) If A ∈ L(X, Y ), then kAk is the smallest Lipschitz constant for A, i.e. kAk is a
Lipschitz constant for A and kAx − Ayk ≤ L kx − yk for each x, y ∈ X implies
kAk ≤ L.

Proof. (a): If A = 0, then, in particular, Ax = 0 for each x ∈ X with kxk = 1, implying


kAk = 0. Conversely, kAk = 0 implies Ax = 0 for each x ∈ X with kxk = 1. But then
Ax = kxk A(x/kxk) = 0 for every 0 6= x ∈ X, i.e. A = 0. Thus, the operator norm is
positive definite. If A ∈ L(X, Y ), λ ∈ R, and x ∈ X, then

(λA)x = λ(Ax) = |λ| Ax , (B.5)
C INTEGRATION 194

yielding
 
kλAk = sup k(λA)xk : x ∈ X, kxk = 1 = sup |λ| kAxk : x ∈ X, kxk = 1

= |λ| sup kAxk : x ∈ X, kxk = 1 = |λ| kAk, (B.6)
showing that the operator norm is homogeneous of degree 1. Finally, if A, B ∈ L(X, Y )
and x ∈ X, then
k(A + B)xk = kAx + Bxk ≤ kAxk + kBxk, (B.7)
yielding

kA + Bk = sup k(A + B)xk : x ∈ X, kxk = 1

≤ sup kAxk + kBxk : x ∈ X, kxk = 1
 
≤ sup kAxk : x ∈ X, kxk = 1 + sup kBxk : x ∈ X, kxk = 1
= kAk + kBk, (B.8)
showing that the operator norm also satisfies the triangle inequality, thereby completing
the verification that it is, indeed, a norm.
(b): To see that kAk is a Lipschitz constant for A, we have to show
kAx − Ayk ≤ kAk kx − yk for each x, y ∈ X. (B.9)
For x = y, there is nothing to prove. Thus, let x 6= y. One computes
 
kAx − Ayk x − y
= A ≤ kAk
kx − yk kx − yk

x−y
as kx−yk = 1, thereby establishing (B.9). Now let L ∈ R+0 be such that kAx − Ayk ≤
L kx − yk for each x, y ∈ X. Specializing to y = 0 and kxk = 1 implies kAxk ≤ L kxk =
L, showing kAk ≤ L. 

C Integration

C.1 Km -Valued Integration


In the proof of the well-conditionedness of a continuously differentiable function f :
U −→ Rm , U ⊆ Rn , in Th. 2.47, weR make use m
R 1 of R -valued integrals. mIn particular,
1
in the estimate (2.42), we use that k 0 f k ≤ 0 kf k for each norm on R . While this
is true for vector-valued integrals in general, the goal here is only to provide a proof
for our special situation (except that we also allow Cm -valued integrals in the following
treatment). For a general treatment of vector-valued integrals, see, for example, [Alt06,
Sec. A1] or [Yos80, Sec. V.5].
C INTEGRATION 195

Definition C.1. Let A ⊆ Rn be (Lebesgue) measurable, m, n ∈ N.

(a) A function f : A −→ Km is called (Lebesgue) measurable (respectively, (Lebesgue)


integrable) if, and only if, each coordinate function fj = πj ◦ f : A −→ K, j =
1, . . . , m, is (Lebesgue) measurable (respectively, (Lebesgue) integrable), which, for
K = C, means if, and only if, each Re fj and each Im fj , j = 1, . . . , m, is (Lebesgue)
measurable (respectively, (Lebesgue) integrable).

(b) If f : A −→ Km is integrable, then


Z Z Z 
f := f1 , . . . , fm ∈ Km (C.1)
A A A

is the (Km -valued) (Lebesgue) integral of f over A.

Remark C.2. The linearity of the K-valued integral implies the linearity of the Km -
valued integral.

Theorem C.3. Let A ⊆ Rn be measurable, m, n ∈ N. Then f : A −→ Km is measurable


in the sense of Def. C.1(a) if, and only if, f −1 (O) is measurable for each open subset O
of Km .

Proof. Assume f −1 (O) is measurable for each open subset O of Km . Let j ∈ {1, . . . , m}.
If Oj ⊆ K is open in K, then O := πj−1 (Oj ) = {x ∈ Km : xj ∈ Oj } is open in Km .
Thus, fj−1 (Oj ) = f −1 (O) is measurable, showing that each fj is measurable, i.e. f is
measurable. Now assume f is measurable, i.e. each fj is measurable. Since every open
O ⊆ Km is a countable union of open sets of the form O = O1 × · · · × Om with each Oj
being an open subset of K, it suffices to show that the
Tm preimages of such open sets are
−1
measurable. So let O be as above. Then f (O) = j=1 fj (Oj ), showing that f −1 (O)
−1

is measurable. 

Corollary C.4. Let A ⊆ Rn be measurable, m, n ∈ N. If f : A −→ Km is measurable,


then kf k : A −→ R is measurable.

Proof. If O ⊆ R is open, then k · k−1 (O) is an open subset


 of Km by the continuity of
the norm. In consequence, kf k−1 (O) = f −1 k · k−1 (O) is measurable. 

Theorem C.5. Let A ⊆ Rn be measurable, m, n ∈ N. For each norm k · k on Km and


each integrable f : A −→ Km , the following holds:
Z Z

f ≤ kf k. (C.2)

A A
C INTEGRATION 196

Proof. First assume that B ⊆ A is measurable, y ∈ Km , and f = y χB , where χB is the


characteristic function of B (i.e. the fj are yj on B and 0 on A \ B). Then
Z Z

f =
kf k,
y1 λn (B), . . . , ym λn (B) = λn (B)kyk = (C.3)
A A

where λn denotes n-dimensional Lebesgue measure on Rn . Next, consider the case


that f is a so-called simple function, that means f takes only finitely many values
y1 , . . . , yN ∈ Km , N ∈ N, and each preimage Bj := f −1 {yj } ⊆ Rn is measurable. Then
PN
f = j=1 yj χBj , where, without loss of generality, we may assume that the Bj are
pairwise disjoint. We obtain
Z XN Z X N Z Z X N

f ≤ yj χ B = yj χ B = yj χ B
j j j
A j=1 A j=1 A A j=1

Z X
N Z
(∗)
= yj χ B j = kf k, (C.4)
A j=1
A

where, at (∗), it was used that, as the Bj are disjoint, the integrands of the two integrals
are equal at each x ∈ A.
Now, if f is integrable, then each Re fj and each Im fj , j ∈ {1, . . . , m}, is integrable
(i.e., for each j ∈ {1, . . . , m}, Re fj , Im fj ∈ L1 (A)) and there exist sequences of simple
functions φj,k : A −→ R and ψj,k : A −→ R such that limk→∞ kφj,k − Re fj kL1 (A) =
limk→∞ kψj,k − Im fj kL1 (A) = 0. In particular, for each j ∈ {1, . . . , m},
Z Z

0 ≤ lim φj,k − Re fj ≤ lim kφj,k − Re fj kL1 (A) = 0, (C.5a)
k→∞ A A k→∞
Z Z

0 ≤ lim ψj,k − Im fj ≤ lim kψj,k − Im fj kL1 (A) = 0, (C.5b)
k→∞ A A k→∞

and also, for each j ∈ {1, . . . , m},

0 ≤ lim kφj,k + iψj,k − fj kL1 (A)


k→∞
≤ lim kφj,k − Re fj kL1 (A) + lim kψj,k − Im fj kL1 (A) = 0. (C.6)
k→∞ k→∞

Thus, letting, for each k ∈ N, φk := (φ1,k , . . . , φm,k ) and ψk := (ψ1,k , . . . , ψm,k ), we


D DIVIDED DIFFERENCES 197

obtain
Z Z Z 

f = f , . . . , f
1 m
A A A
 Z Z Z Z 

= lim
k→∞ φ 1,k + i lim ψ 1,k , . . . , lim φ m,k + i lim ψ m,k


A k→∞ A k→∞ A k→∞ A
Z Z Z
(∗)
= lim (φ k + i ψ k ) ≤ lim
kφk + iψk k = kf k, (C.7)
k→∞ A k→∞ A A

where the equality at (∗) holds due to limk→∞ kφk + iψk k − kf k L1 (A) = 0, which, in
turn, is verified by
Z Z Z

0≤ kφk + iψk k − kf k ≤ kφk + iψk − f k ≤ C kφk + iψk − f k1
A A A
Z X
m

=C φj,k + iψj,k − fj → 0 for k → ∞, (C.8)
A j=1

with C ∈ R+ since the norms k · k and k · k1 are equivalent on Kn . 

C.2 Continuity
Theorem C.6. Let a, b ∈ R, a < b. If f : [a, b] −→ R is (Lebesgue-) integrable, then
Z t
F : [a, b] −→ R, F (t) := f, (C.9)
a

is continuous.

Proof. We show F to be continuous at each t ∈ [a, b]: Let (tk )k∈N be a sequence in [a, b]
such that limk→∞ tk = t. Let fk := f χ[a,tk ] . Then fk → f χ[0,t] (pointwise everywhere,
except, possibly, at t) and the dominated convergence theorem (DCT) yields (note
|fk | ≤ |f |) Z Z
tk t
DCT
lim F (tk ) = lim f = f = F (t),
k→∞ k→∞ a a
proving the continuity of F at t. 

D Divided Differences
The present section provides some additional results regarding the divided differences
of Def. 3.16:
D DIVIDED DIFFERENCES 198

Theorem D.1 (Product Rule). Let F be a field and consider the situation of Def. 3.16.
(m) (m) (m)
Moreover, for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}, let yf,j , yg,j , yh,j ∈ F be such
that m  
(m)
X m (m−k) (k)
yh,j = y yg,j . (D.1)
k=0
k f,j
Let Pf , Pg , Ph ∈ F [X]n be the interpolating polynomials corresponding to the case, where
(m) (m) (m) (m)
the yj of Def. 3.16 are replaced by yf,j , yg,j , yh,j , respectively, with

[x0 , . . . , xn ]f , [x0 , . . . , xn ]g , [x0 , . . . , xn ]h

denoting the corresponding divided differences. Then one has


n
X
[x0 , . . . , xn ]h = [x0 , . . . , xi ]f [xi , . . . , xn ]g (D.2)
i=0

(m) (m) (m)


(thus, in particular, the product rule holds if yf,j = f (m) (xj ), yg,j = g (m) (xj ), yh,j =
h(m) (xj ), and h = f g, where f, g ∈ F [X] or F = R with f, g : R −→ R differentiable
sufficiently often).

Proof. According to (3.20) and (b), we have


n
X n
X
Pf = [x0 , . . . , xi ]f ωf,i , Pg = [xj , . . . , xn ]g ωg,j ,
i=0 j=0

where, for each i, j ∈ {0, 1, . . . , n},


i−1 j+1 n
Y Y Y
ωf,i := (X − xk ), ωg,j := (X − xk ) = (X − xk ).
k=0 k=n k=j+1

Consider the polynomials


n
X
Pf Pg = [x0 , . . . , xi ]f [xj , . . . , xn ]g ωf,i ωg,j ∈ F [X]2n
i,j=0

and n
X
Q := [x0 , . . . , xi ]f [xj , . . . , xn ]g ωf,i ωg,j ∈ F [X]n ,
i,j=0,
i≤j
D DIVIDED DIFFERENCES 199

where the degree estimates for Pf Pg and Q are due to


(
2n,
∀ deg(ωf,i ωg,j ) = i + n − j ≤
i,j∈{0,...,n} n for i ≤ j.
We will show Q = Ph , but, first, we note, for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1},
m  
(m) Prop. 3.7(d) X m (m−k)
(Pf Pg ) (x̃j ) = Pf (x̃j )Pg(k) (x̃j )
k=0
k
Xm  
m (m−k) (k) (D.1) (m)
= yf,j yg,j = yh,j . (D.3)
k=0
k
Q
If i > j, then ωf,i ωg,j has the divisor (i.e. the factor) nk=0 (X − xk ), i.e.
∀ ∀ (ωf,i ωg,j )(x̃α ) = 0,
α∈{0,...,r} m∈{0,...,mα −1}

implying
(D.3) (m)
∀ ∀ Q(x̃α ) = (Pf Pg )(m) (x̃α ) = yh,α ,
α∈{0,...,r} m∈{0,...,mα −1}

and, thus, Q = Ph . In Q, the summands with deg(ωf,i ωg,j ) = n are precisely those with
i = j, thereby proving (D.2). 

The following result is proved under the additional assumption that x0 , . . . , xn ∈ F are
all distinct:
Proposition D.2. Let F be a field and consider the situation of Def. 3.16 with r = n ∈
N0 , i.e. with x̃0 , . . . , x̃n ∈ F all being distinct. Moreover, let (x0 , . . . , xn ) := (x̃0 , . . . , x̃n ).
Then, according to Def. 3.12, P [x0 , . . . , xn ] denotes the unique polynomial f ∈ P [X]n
that, given y0 , . . . , yn ∈ F , satisfies
(0)
∀ f (xj ) = yj := yj ; (D.4)
j∈{0,...,n}

and, according to Def. 3.16, [x0 , . . . , xn ] denotes its coefficient of X n . Then


n
X n
Y 1
[x0 , . . . , xn ] = yi . (D.5)
i=0 m=0
xi − xm
m6=i

Proof. The proof is carried out by induction on n ∈ N0 : For n = 0, one has


0
X 0
Y 1
yi = y0 = [x0 ],
i=0 m=0
xi − xm
m6=i

as desired. Thus, let n > 0.


D DIVIDED DIFFERENCES 200

Claim 1. It suffices to show that the following identity holds true:


 
n
X Y n n n n−1 n−1
1 1 X Y 1 X Y 1 
yi =  yi − yi . (D.6)
i=0
x − xm
m=0 i
xn − x0 i=1 m=1 xi − xm i=0
x − xm
m=0 i
m6=i m6=i m6=i

Proof. One computes as follows (where (D.5) is used, it holds by induction):


 
n
X Y n n n n−1 n−1
1 (D.6) 1 X Y 1 X Y 1 
yi =  yi − yi 
i=0
x − xm
m=0 i
xn − x0 i=1 m=1 xi − xm i=0
x − xm
m=0 i
m6=i m6=i m6=i
(D.5) 1 
= [x1 , . . . , xn ] − [x0 , . . . , xn−1 ]
xn − x0
(3.23b)
= [x0 , . . . , xn ],

showing that (D.6) does, indeed, suffice. N

So it remains to verify (D.6). To this end, we show that, for each i ∈ {0, . . . , n}, the
coefficients of yi on both sides of (D.6) are identical.
Case i = n: Since yn occurs only in first sum on the right-hand side of (D.6), one has
to show that n n
Y 1 1 Y 1
= . (D.7)
m=0
xn − xm xn − x0 m=1 xn − xm
m6=n m6=n

However, (D.7) is obviously true.


Case i = 0: Since y0 occurs only in second sum on the right-hand side of (D.6), one has
to show that
Yn n−1
Y
1 1 1
=− . (D.8)
m=0
x 0 − x m x n − x 0 m=0
x 0 − x m
m6=0 m6=0

Once again, (D.8) is obviously true.


Case 0 < i < n: In this case, one has to show
 
Yn n n−1
1 1 Y 1 Y 1 
=  − . (D.9)
m=0 i
x − xm xn − x0 m=1 xi − xm m=0 xi − xm
m6=i m6=i m6=i
E DFT AND TRIGONOMETRIC INTERPOLATION 201

Multiplying (D.9) by
n
Y
(xi − xm )
m=0
m6=i,0,n

shows that it is equivalent to


 
1 1 1 1 1
= − ,
xi − x0 xi − xn xn − x0 xi − xn xi − x0
which is also clearly valid. 

E Discrete Fourier Transform and Trigonometric In-


terpolation

E.1 Discrete Fourier Transform (DFT)


Definition E.1. Let n ∈ N = {1, 2, 3, . . . }. The discrete Fourier transform (DFT) is
the map
F : Cn −→ Cn , (f0 , . . . , fn−1 ) 7→ (d0 , . . . , dn−1 ), (E.1a)
where
n−1
1 X ijk2π
∀ dk := fj e − n , (E.1b)
k∈{0,...,n−1} n j=0
i ∈ C denoting the imaginary unit.

Important applications of DFT include the approximation and interpolation of periodic


functions (Sec. E.2 and Sec. E.3), practical applications include digital data transmission
(cf. Rem. E.9). First, we need to study some properties of DFT and its inverse map.
Remark E.2. Introducing the vectors
   
f0 d0
   
f :=  ...  ∈ Cn , d :=  ...  ∈ Cn ,
fn−1 dn−1

we can write (E.1) in matrix form as


1
F : Cn −→ Cn , d := F(f ) = A f, (E.2)
n
E DFT AND TRIGONOMETRIC INTERPOLATION 202

where  
1 1 1 ... 1
1 ω ω2 . . . ω n−1 
 
 ω2 ω4 . . . ω 2(n−1) 
A = 1 , (E.3a)
 .. .. .. ... .. 
. . . . 
(n−1)2
1 ω n−1 ω 2(n−1) ... ω
that means A = (akj ) ∈ M(n, C) is a symmetric complex matrix with
ijk2π i2π
∀ akj = ω kj = e n , ω := e n . (E.3b)
k,j∈{0,...,n−1}

In (E.2), A = (akj ) ∈ M(n, C) denotes the complex conjugate matrix of A.


Proposition E.3. Let A ∈ M(n, C), n ∈ N, be the matrix defined in (E.3a), and set
B := √1n A.

(a) The columns of B form an orthonormal basis with respect to the standard inner
product on Cn .

(b) The matrix B is unitary, i.e. invertible with B −1 = B ∗ .

(c) DFT, as defined in (E.1), is invertible and its inverse map, called inverse DFT, is
given by
F −1 : Cn −→ Cn , (d0 , . . . , dn−1 ) 7→ (f0 , . . . , fn−1 ), (E.4a)
where
n−1
X ijk2π
∀ fk := dj e n . (E.4b)
k∈{0,...,n−1}
j=0

(d) If d, f ∈ Cn are such that d = F(f ), then


n−1
X n−1
2 1 X
|dj | = |fj |2 , (E.5a)
j=0
n j=0

i.e.
1
kF(f )k2 = √ kf k2 . (E.5b)
n

Proof. (a): For each k = 0, . . . , n − 1, we let bk denote the kth column of B. Then,
noting |ω| = 1,
n−1
X n−1
1 X 2jk n
hbk , bk i = bjk bjk = |ω| = = 1.
j=0
n j=0 n
E DFT AND TRIGONOMETRIC INTERPOLATION 203

n
Moreover, for each k, l ∈ {0, . . . , n − 1} and k 6= l, we note ω l−k 6= 1, but ω l−k = 1,
and we use the geometric sum to obtain
n−1 n−1 n−1 n
X 1 X −jk jl 1 X l−k j 1 1 − ω l−k
hbk , bl i = bjk bjl = ω ω = ω = · l−k
= 0,
j=0
n j=0
n j=0
n 1 − ω

proving (a).
(b) is immediate due to (a) and the equivalences (i) and (iii) of [Phi19b, Pro. 10.36].
(c) follows from (b), as
 −1
−1 1 √
∀n f := F (d) = √ B d= nB ∗ d (E.6)
d∈C n
and √
√ n ijk2π
∀ n b∗kj = √ ajk = e n .
k,j∈{0,...,n−1} n
(d): According to [Phi19b, Pro. 10.36(viii)], B is isometric with respect to k · k2 . Thus,

1 1
kF(f )k2 =
√n Bf = √n kf k2 ,
2

proving (E.5b). Then (E.5a) also follows, as it is merely (E.5b) squared. 

As it turns out, the inverse DFT of Prop. E.3(c) has the useful property that there are
several simple ways to express it in terms of (forward) DFT:
Proposition E.4. Let n ∈ N and let F denote DFT as defined in (E.1).

(a) One has


∀ F −1 (d) = n F(d0 , dn−1 , dn−2 , . . . , d1 ).
d=(d0 ,...,dn−1 )∈Cn

(b) Applying complex conjugation, one has



∀n F −1 (d) = n F d .
d∈C

(c) Let σ be the map that swaps the real and imaginary parts of a complex number, i.e.
σ : C −→ C, σ(a + bi) := b + ai = i (a + bi).
Using componentwise application, we can also consider σ to be defined on Cn :
σ : Cn −→ Cn , σ(z0 , . . . , zn−1 ) := (σ(z0 ), . . . , σ(zn−1 )) = i z.
Then one has 
∀n F −1 (d) = n σ F(σd) .
d∈C
E DFT AND TRIGONOMETRIC INTERPOLATION 204

Proof. (a): We let f = (f0 , . . . , fn−1 ) := n F(d0 , dn−1 , dn−2 , . . . , d1 ) and compute for
each k = 0, . . . , n − 1:
1
(E.1b)
n−1
X ijk2π
n−1
X i(n−j)k2π
n−1
X ijk2π
z }| {
fk = d0 + dn−j e− n = d0 + dj e− n = d0 + dj e n e−ik2π
j=1 j=1 j=1
n−1
X ijk2π
= dj e n ,
j=0

which, according to (E.4), proves (a).


(b): Letting f := n F(d), we obtain for each k = 0, . . . , n − 1:

n−1
X n−1
X
(E.1b) − ijk2π ijk2π
fk = dj e n = dj e n .
j=0 j=0

Taking the complex conjugate of fk and comparing with (E.4) proves (b).
(c): Using the C-linearity of DFT and (b), we compute
   (b)
n σ F(σd) = n σ F(i d) = n i i F(d) = n F d = F −1 (d),

which proves (c). 

E.2 Approximation of Periodic Functions


As a first application of the discrete Fourier transform, we consider the approximation
of periodic functions; how this is used in the field of data transmission is indicated in
Rem. E.9 below. The approximation is based on the Fourier series representation of
(sufficiently regular) periodic functions, which we will proceed to, briefly, consider as a
preparation. We will make use of the following well-known relations between the expo-
nential function and the trigonometric functions sine and cosine (see [Phi16a, (8.46)]):

∀ eiz = cos z + i sin z (Euler formula), (E.7a)


z∈C

eiz + e−iz
∀ cos z = , (E.7b)
z∈C 2
eiz − e−iz
∀ sin z = . (E.7c)
z∈C 2i
E DFT AND TRIGONOMETRIC INTERPOLATION 205

Definition E.5. Let L ∈ R+ and let f : [0, L] −→ R be integrable.

(a) For each j ∈ Z, we define numbers


Z L  
1 j2πt
aj := aj (f ) := f (t) cos − dt ∈ R, (E.8a)
L 0 L
Z L  
1 j2πt
bj := bj (f ) := f (t) sin − dt ∈ R, (E.8b)
L 0 L
Z L
1 ij2πt
cj := cj (f ) := f (t) e− L dt ∈ C. (E.8c)
L 0
We call the aj , bj the real Fourier coefficients of f and the cj the complex Fourier
coefficients of f . As a consequence of (E.7a), one has

∀ c j = aj + i b j . (E.9)
j∈Z

(b) For each n ∈ N, we define the functions


n
X ij2πx
Sn := Sn (f ) : [0, L] −→ R, Sn (x) := cj e L (E.10)
j=−n

and consider the series



X
 ij2πx
(Sn )n∈N , i.e. ∀ Sn (x) n∈N
=: cj e L . (E.11)
x∈[0,L]
j=−∞

We call the Sn the Fourier partial sums of f and the series of (E.11) the Fourier
series of f .

Remark E.6. (a) We verify that the Fourier partial sums Sn of (E.10) are, indeed,
real-valued. The key observation is an identity that we will prove in a form so that
we can use it both here and, later, in the proof of Th. E.13(a). To this end, we
fix n ∈ N and assume, for each j ∈ {−n, . . . , n}, we have numbers aj , bj ∈ R and
cj ∈ C (not necessarily given by (E.8)), satisfying
 
∀ cj = aj + ibj and ∀ aj = a−j , bj = −b−j . (E.12)
j∈{−n,...,n} j∈{−n+1,...,n−1}\{0}
E DFT AND TRIGONOMETRIC INTERPOLATION 206

Then, for each x ∈ R, we compute


n
X n
X
ij2πx (E.12) ij2πx
cj e L = (aj + i bj ) e L

j=−n j=−n
(E.12) in2πx in2πx
= (a−n + i b−n ) e− L + a0 + ib0 + (an + i bn ) e L
n−1
X  ij2πx ij2πx
 n−1
X  ij2πx ij2πx


+ aj e L + e L +i bj e L − e − L
j=1 j=1
(E.7b), (E.7c) in2πx in2πx
= (a−n + i b−n ) e− L + a0 + ib0 + (an + i bn ) e L
n−1
X   n−1
X  
j2πx j2πx
+2 aj cos −2 bj sin . (E.13)
j=1
L j=1
L

We now come back to the case, where aj , bj , cj are defined by (E.8). Then (E.12)
holds due to (E.9) together with the symmetry of cosine (i.e. cos(z) = cos(−z)) and
the antisymmetry of sine (i.e. sin(z) = − sin(−z)), where we have aj = a−j and
bj = −b−j even for j = n. Thus, for each x ∈ [0, L], we use (E.13) to obtain
n
X ij2πx
Sn (x) = cj e L

j=−n
n
X   n
X  
(E.13), b0 =0 j2πx j2πx
= a0 + 2 aj cos −2 bj sin ,
j=1
L j=1
L

showing Sn (x) to be real-valued.


(b) In general, the question, whether (and in which sense) the Fourier series of f , as
defined in (E.11), converges to f does not have an easy answer. We will provide
some sufficient conditions in the following Th. E.7. For convergence results under
weaker assumptions, see, e.g., [Rud87, Sec. 4.26, Sec. 5.11] or [Heu08, Th. 136.2,
Th. 137.1, Th. 137.2].
Theorem E.7. Let L ∈ R+ and let f : [0, L] −→ R be integrable and periodic in
the sense that f (0) = f (L). Moreover, assume that f is continuously differentiable
(i.e. f ∈ C 1 ([0, L])) with f ′ (0) = f ′ (L). Then the following convergence results for the
Fourier series (Sn )n∈N of f , with Sn according to (E.10), hold:

(a) The Fourier series of f converges uniformly to f . In particular, one has the point-
wise convergence

X ij2πx
∀ f (x) = cj e L = lim Sn (x). (E.14)
x∈[0,L] n→∞
j=−∞
E DFT AND TRIGONOMETRIC INTERPOLATION 207

(b) The Fourier series of f converges to f in Lp [0, L] for each p ∈ [1, ∞].

Proof. (a): For the main work of the proof, we refer to [Wer11, Th. IV.2.9], where (a) is
proved for the case L = 2π. Here, we only verify that the case L = 2π then also implies
the general case of L ∈ R+ : Assume (a) holds for L = 2π, let L ∈ R+ be arbitrary, and
f ∈ C 1 ([0, L]) with f (0) = f (L) as well as f ′ (0) = f ′ (L). Define
 
tL
g : [0, 2π] −→ R, g(t) := f .

Then g is continuously differentiable with
 
′ ′ L ′ tL
g : [0, 2π] −→ R, g (t) := f .
2π 2π
L L
Also g(0) = f (0) = f (L) = g(2π) and g ′ (0) = 2π f ′ (0) = 2π f ′ (L) = g ′ (2π), i.e. we know
t 2π
(a) to hold for g. Next, we note f (t) = g( L ) for each t ∈ [0, L] and compute, for each
j ∈ Z,
Z L Z L  
1 − ij2πt 1 t 2π ij2πt
cj (f ) = f (t) e L dt = g e− L dt
L 0 L 0 L
Z 2π
1
= g(s) e−ij s ds = cj (g),
2π 0
where we used the change of variables s := t L2π . Thus,
( )
Xn
ij2πx
lim Sn (f ) − f ∞ = lim sup f (x) − cj (f ) e L : x ∈ [0, L]
n→∞ n→∞
j=−n
(   X )
x 2π
n
ij2πx
= lim sup g − cj (f ) e L : x ∈ [0, L]
n→∞ L
j=−n
( )
Xn

= lim sup g(t) − cj (g) eij t : t ∈ [0, 2π]
n→∞
j=−n
(a) for g
= lim Sn (g) − g ∞ = 0,
n→∞

which completes the proof of (a).


(b): According to (a), we have convergence in L∞ [0, L] (note that the continuity of f
and the compactness of [0, L] imply f ∈ L∞ [0, L]). As a consequence of the Hölder
inequality (cf. [Phi17, Th. 2.42]), one then also obtains, for each p ∈ [1, ∞[,
1
lim Sn (f ) − f p = L p lim Sn (f ) − f ∞ = 0,
n→∞ n→∞

thereby establishing the case. 


E DFT AND TRIGONOMETRIC INTERPOLATION 208

For a periodic function f , applying the inverse discrete Fourier transform can yield a
useful approximation of the values of f at equidistant points:
Example E.8. Let L ∈ R+ and let f : [0, L] −→ R be integrable and such that
f (0) = f (L). Given n ∈ N, we now define the equidistant points
L
∀ xk := kh, h := . (E.15)
k∈{0,...,2n+1} 2n + 1
Plugging x := xk into (E.10) then yields the following approximations fk of f (xk ),
n
X 2n
X
ijk2π i(j−n)k2π
fk := Sn (f )(xk ) = cj e 2n+1 = cj−n e 2n+1

j=−n j=0
∀ 2n
(E.16)
k∈{0,...,2n} X ijk2π
− ink2π
=e 2n+1 cj−n e 2n+1 ,
j=0

where we have omitted f (x2n+1 ) in (E.16), since we know f (x2n+1 ) = f (L) = f (0) =
f (x0 ), due to the assumed periodicity of f . Comparing (E.16) with (E.4), we see that
 in2π in2n2π

f0 , e 2n+1 f1 , . . . , e 2n+1 f2n = F −1 (c−n , . . . , cn ). (E.17)

If Th. E.7(a) holds (e.g. under the hypotheses of Th. E.7), then, letting
Fn := (f (x0 ), . . . , f (x2n )), Gn := (f0 , . . . , f2n ),
we conclude
n o Th. E.7(a)
lim kFn − Gn k∞
= lim sup f (xk ) − Sn (f )(xk ) : k ∈ {0, . . . , 2n} = 0. (E.18)
n→∞ n→∞

Remark E.9. A physical application of the above-described approximation of periodic


functions in connection with DFT is digital data transmission. Given an analog signal,
represented by a function f , the signal is compressed into finitely many cj according to
(E.16) and (E.8c) (physically, this can be accomplished using high-pass and low-pass
filters). The digital signal consisting of the cj is then transmitted and, at the remote
location, one wants to reconstruct the original f , at least approximately. According to
(E.17) and (E.18), an approximation of f can be obtained by applying inverse DFT to
the cj (combined with a suitable interpolation method).

E.3 Trigonometric Interpolation


Another useful application of DFT is the interpolation of periodic functions by means
of trigonometric functions.
E DFT AND TRIGONOMETRIC INTERPOLATION 209

Definition E.10. Let L ∈ R+ , n ∈ N, and d := (d0 , . . . , dn−1 ) ∈ Cn . Then the function


n−1
X ik2πx
p := p(d) : R −→ C, p(x) := dk e L , (E.19)
k=0

is called a trigonometric polynomial (which is justified due to Euler’s formula). As it


turns out, the following modification will be more useful for our purposes here: We let

r := r(d) : R −→ C,
n−1
X n−1
X
inπx inπx ik2πx i(k−n/2)2πx
r(x) := e− L p(d)(x) = e− L dk e L = dk e L . (E.20)
k=0 k=0

Theorem E.11. In the situation of Def. E.10, let x0 , . . . , xn−1 be defined by xj := jL/n
for each j ∈ {0, . . . , n − 1}. Moreover, let z := (z0 , . . . , zn−1 ) ∈ Cn .

(a) The function r : R −→ C of Def. E.10 satisfies

∀ r(xj ) = zj (E.21)
j∈{0,...,n−1}

if, and only if, 


F (−1)0 z0 , . . . , (−1)n−1 zn−1 = d. (E.22)

(b) Let m ∈ N. If f : [0, L] −→ C is an m times continuously differentiable function


such that f (0) = f (L), and the function r : R −→ C of Def. E.10 satisfies (E.21)
with zj := f (xj ), then there exists cm ∈ R+ such that one has the following error
estimate with respect to the L2 -norm:

cm kf k2 + kf (m) k2
kr − f k2 ≤ (E.23)
nm
– recall that, for a square-integrable function g : [0, L] −→ C, the L2 -norm is defined
R  21
L
by kgk2 := 0 |g|2 .

Proof. (a): According to (E.20) and using xj = jL/n, (E.21) is equivalent to


n−1
X n−1
X
inπxj ik2πxj ijk2π
∀ zj = e − L dk e L = e−ijπ dk e n
j∈{0,...,n−1}
k=0 k=0
n−1
X ijk2π
= (−1)j dk e n ,
k=0
E DFT AND TRIGONOMETRIC INTERPOLATION 210

which, according to (E.4), is equivalent to



F −1 (d) = (−1)0 z0 , . . . , (−1)n−1 zn−1 .

Applying F to the previous equation completes the proof of (a).


(b) is stated as [Pla10, Th. 3.10], where it is referred to [SV01] for the proof. It actually
turns out to be a special case of the more general result [HB09, Th. 52.6]. 

Example E.12. Consider


(
x for 0 ≤ x ≤ 21 ,
f : [0, 1] −→ R, f (x) :=
1 − x for 12 ≤ x ≤ 1.

It is an exercise to use different values of n (e.g. n = 4 and n = 16) to compute and plot
the corresponding r satisfying the conditions of Th. E.11(a) with zj = f (xj ), xj = j/n,
for instance by using MATLAB.

When approximating real-valued periodic functions f : [0, L] −→ R, it makes sense to


use real-valued trigonometric functions. Indeed, we can choose real-valued trigonometric
functions such that we can make use of the above results:

Theorem E.13. Let L ∈ R+ , let n ∈ N be even, and

(a, b) := (a0 , . . . , a n2 , b1 , . . . , b n2 −1 ) ∈ Rn .

Define the trigonometric function

t := t(a, b) : R −→ R,
n
−1       nπx 
X
2
k2πx k2πx
t(x) := a0 + 2 ak cos + bk sin + a n2 cos . (E.24)
k=1
L L L

Furthermore, let d = (d0 , . . . , dn−1 ) ∈ Cn , let the function r = r(d) be defined according
to (E.20), and assume the relations

d0 = a n2 , d n2 = a0 ,
∀ d n2 −k = ak + ibk , d n2 +k = ak − ibk . (E.25)
k∈{1,..., n
2
−1}

(a) The above conditions imply t = Re r.


E DFT AND TRIGONOMETRIC INTERPOLATION 211

(b) If x0 , . . . , xn−1 ∈ [0, L] are defined by xj := jL/n for each j ∈ {0, . . . , n − 1},
z := (z0 , . . . , zn−1 ) ∈ Rn , and r satisfies the conditions of Th. E.11(a), then
∀ t(xj ) = zj (E.26)
j∈{0,...,n−1}

and, moreover,
n−1   n−1  !
1X jk2π 1X jk2π
∀ ak = zj cos , bk = zj sin . (E.27)
k∈{1,..., n
2
−1} n j=0 n n j=0 n

(c) Let f : [0, L] −→ R. If f and r satisfy the conditions of Th. E.11(b), then there
exists cm ∈ R+ such that the error estimate (E.23) holds with r replaced by t.

Proof. (a): To apply (E.13), we let


c n2 := 0, ∀ ck := dk+ n2 .
k∈{− n
2
,..., n
2
−1}

Then, for each k ∈ {1, . . . , n/2 − 1},


(E.25) (E.25)
Re(ck ) = Re(d n2 +k ) = ak = Re(d n2 −k ) = Re(c−k ) (E.28a)
as well as
(E.25) (E.25)
Im(ck ) = Im(d n2 +k ) = −bk = − Im(d n2 −k ) = − Im(c−k ), (E.28b)
i.e. (E.12) is satisfied and, hence, (E.13) applies. We also note
(E.25) (E.25)
c− n2 = d0 = a n2 , c0 = d n2 = a0 , c n2 = 0. (E.28c)
n
We use the above in (E.13) (with n replaced by 2
and j replaced by k) to obtain, for
each x ∈ R,
n n
n−1 −1
X i(k−n/2)2πx X
2
ik2πx
X
2
ik2πx
r(x) = dk e L = dk+ n2 e L = ck e L

k=0 k=− n
2
k=− n
2

(E.13) inπx inπx


= c− n2 e− L + c0 + c n2 e L

n
X
2
−1   n
X
2
−1  
k2πx k2πx
+2 Re(ck ) cos −2 Im(ck ) sin
k=1
L k=1
L
n
−1     
(E.28) X
2
k2πx k2πx
= a0 + 2 ak cos + bk sin
L L
 k=1  nπx   nπx 
+a n2 cos − i sin ,
L L
E DFT AND TRIGONOMETRIC INTERPOLATION 212

proving t = Re r.
(b): (E.26) is an immediate consequence of (a). Furthermore, as a consequence of (E.22)
and (E.1b), we have, for each k ∈ {0, . . . , n},
n−1
1 X ijk2π
dk = zj (−1)j e− n . (E.29)
n j=0

Thus, for each k ∈ {1, . . . , n2 − 1},

(E.25) 1
ak = (d n −k + d n2 +k )
2 2
(−1)j
(E.29) 1
n−1
X z }| {  ijk2π ijk2π

= zj (−1)j e−ijπ e n + e− n
2n j=0
n−1
X  
(E.7b) 1 jk2π
= zj cos ,
n j=0
n

thereby establishing the case. Analogously, for each k ∈ {1, . . . , n2 − 1},

(E.25) 1
bk = (d n −k − d n2 +k )
2i 2
(−1)j
(E.29) 1 X
n−1 z }| {  ijk2π ijk2π

= zj (−1)j e−ijπ e n − e− n
2in j=0
n−1  
(E.7c) 1X jk2π
= zj sin ,
n j=0 n

thereby completing the proof of (E.27).


(c): According to (a), we have t = Re r. Since f and f (m) are real-valued, we obtain
from Th. E.11(b):
Z L Z L Z L

kt − f k22 = | Re r − f | ≤2 2
| Re r − f | + | Im r| 2
= |r − f |2
0 0 0
(m)
 !2
(E.23) cm kf k2 + kf k2
≤ ,
nm

which establishes the case. 


E DFT AND TRIGONOMETRIC INTERPOLATION 213

Remark E.14. To make use of Th. E.13 to interpolate a real-valued function f on


[0, L], one would proceed as follows:

(i) Choose n ∈ N even and compute the equidistant xj := jL/n ∈ [0, L].
(ii) Compute the corresponding values zj := f (xj ).

(iii) Compute d := F (−1)0 z0 , . . . , (−1)n−1 zn−1 according to (E.22).
1
(iv) Compute the ak and bk from (E.25), i.e. from ak = 2
(d n2 −k + d n2 +k ) and bk =
1
2i
(d n2 −k − d n2 +k ).
(v) Obtain the real-valued interpolating function t from (E.24).

E.4 Fast Fourier Transform (FFT)


If one computes DFT using matrix multiplication as described in Rem. E.2, then one
needs O(n2 ) complex multiplications to apply F to a vector with n components. How-
ever, if one is able to choose n to be a power of 2, then one can apply a method called fast
Fourier transform (FFT) to reduce the complexity to just O(n log2 n) complex multipli-
cations (cf. Rem. E.23 below). We will study the FFT method in the present section.
Remark E.15. (a) FFT can also be used to apply inverse DFT by making use of Prop.
E.4.
(b) There are variants of FFT that make use of the prime factorization of n and which
can be applied if n is not a power of 2 (see [SK11, p. 161] and the references given
there). However, these variants fail to be fast if n is (a large) prime (or, more
generally, if n has few large prime factors). A different approach that is also fast for
large prime numbers n is to extend the vectors to be transformed to larger vectors of
size m, where m is a power of 2. One can simply do the extension by so-called zero
padding, i.e. by adding components with value zero to the original vector. However,
as ω in (E.3) depends on n, the details of this approach are somewhat tricky. In
the literature, this is known as Bluestein’s algorithm and it involves convolutions
and the so-called chirp z-transform (see [Blu68, RSR69]). We will not study the
details of these approaches in this class, i.e. FFT as presented below applies only
to the case where n is a power of 2.

FFT is based on the following Th. E.17, which allows to compute the DFT of a vector
of length 2n from the DFT of two vectors of length n. In preparation for Th. E.17, we
introduce some notation.
E DFT AND TRIGONOMETRIC INTERPOLATION 214

Notation E.16. If we want to emphasize the dimension on which a DFT operator


operates, we write the dimension as a superscript of the operator, denoting DFT from
(n)
Cn to Cn , n ∈ N, by F (n) : Cn −→ Cn . For each k ∈ {0, . . . , n − 1}, we let Fk :=
(n) (n)
πk ◦ F (n) : Cn −→ C, where πk : Cn −→ C is the projection (z0 , . . . , zn−1 ) 7→ zk ,
denote the kth coordinate function of F (n) .

Theorem E.17. Let n ∈ N. Then, for each k ∈ {0, . . . , n − 1}, it holds that
(n) ikπ (n)
Fk (f0 , f1 , . . . , fn−1 ) + e− n Fk (fn , fn+1 , . . . , f2n−1 )
∀ 2 (E.30a)
(f0 ,...,f2n−1 )∈C2n (2n)
= Fk (f0 , fn , f1 , fn+1 , . . . , fn−1 , f2n−1 )

and
(n) ikπ (n)
Fk (f0 , f1 , . . . , fn−1 ) − e− n Fk (fn , fn+1 , . . . , f2n−1 )
∀ 2 (E.30b)
(f0 ,...,f2n−1 )∈C2n (2n)
= Fn+k (f0 , fn , f1 , fn+1 , . . . , fn−1 , f2n−1 ).

Proof. Given (f0 , . . . , f2n−1 ) ∈ C2n , we compute


(2n)
Fk (f0 , fn , f1 , fn+1 , . . . , fn−1 , f2n−1 )
n−1 n−1
!
(E.1b) 1 X i2jk2π X i(2j+1)k2π
= fj e− 2n + fn+j e− 2n
2n j=0 j=0
n−1 n−1
!
1 X ijk2π ikπ
X ijk2π
= fj e − n + e − n fn+j e− n
2n j=0 j=0
(n) ikπ (n)
(E.1b) Fk (f0 , f1 , . . . , fn−1 ) + e− n Fk (fn , fn+1 , . . . , f2n−1 )
= ,
2
E DFT AND TRIGONOMETRIC INTERPOLATION 215

proving (E.30a), and


(2n)
Fn+k (f0 , fn , f1 , fn+1 , . . . , fn−1 , f2n−1 )
n−1 n−1
!
(E.1b) 1 X i2j(n+k)2π X i(2j+1)(n+k)2π
= fj e − 2n + fn+j e− 2n
2n j=0 j=0
 −1

n−1 n−1 z }| {
1 X i2j(n+k)2π X i(2j+1)k2π i(2j+1)n2π 
=  fj e− 2n + fn+j e− 2n e− 2n 
2n j=0 j=0

n−1 n−1
!
1 X ijk2π X ijk2π
− ikπ
= fj e − n −e n fn+j e− n
2n j=0 j=0
(n) − ikπ (n)
(E.1b) Fk (f0 , f1 , . . . , fn−1 ) −e n Fk (fn , fn+1 , . . . , f2n−1 )
= ,
2
proving (E.30b). 
Example E.18. (a) If n = 2q , q ∈ N, then one can apply Th. E.17 q times to reduce
an application of F (n) to n applications of F (1) . Moreover, we observe that F (1) is
merely the identity on C: According to (E.1):
1
∀ F (1) (f0 ) = f0 e 0 = f0 .
f0 ∈C 1

(b) For n = 8 = 23 , one can build the result of an application of F (n) from trivial
one-dimensional transforms in 3 steps as illustrated in the scheme in (E.31) below.

f0 f4 f2 f6 f1 f5 f3 f7
= = = = = = = =
F (1) (f0 ) F (1) (f4 ) F (1) (f2 ) F (1) (f6 ) F (1) (f1 ) F (1) (f5 ) F (1) (f3 ) F (1) (f7 )
ց ւ ց ւ ց ւ ց ւ
F (2) (f0 , f4 ) F (2) (f2 , f6 ) F (2) (f1 , f5 ) F (2) (f3 , f7 )
ց ւ ց ւ
F (4) (f0 , f2 , f4 , f6 ) F (4) (f1 , f3 , f5 , f7 )
ց ւ
F (8) (f0 , f1 , f2 , f3 , f4 , f5 , f6 , f7 ) (E.31)

The scheme in (E.31) also illustrates that getting all the indices right needs some
careful accounting. It turns out that this can be done quite elegantly and efficiently
E DFT AND TRIGONOMETRIC INTERPOLATION 216

by a trick known as bit reversal. Before studying bit reversal and its application to
FFT systematically, let us see how it works in the present example: In (E.31), we
want to compute F (8) (f0 , . . . , f7 ) and need to figure out the order of indices we need
in the first row of (E.31). Using the bit reversal algorithm we can do that directly,
without having to compute (and store) the intermediate lines of the scheme: We
start by writing the numbers 0, . . . , 7 in their binary representation instead of in
their usual decimal representation:

(0)10 = (000)2 , (1)10 = (001)2 , (2)10 = (010)2 , (3)10 = (011)2 ,


(4)10 = (100)2 , (5)10 = (101)2 , (6)10 = (110)2 , (7)10 = (111)2 . (E.32)

Then we reverse the bits of each number in (E.32) (and we also provide the resulting
decimal representation):

(0)10 = (000)2 , (4)10 = (100)2 , (2)10 = (010)2 , (6)10 = (110)2 ,


(1)10 = (001)2 , (5)10 = (101)2 , (3)10 = (011)2 , (7)10 = (111)2 . (E.33)

We note that, as if by magic, we have obtained the correct order needed in the first
row of (E.31). In the following, we will prove that this procedure works in general.

Definition E.19. For each q ∈ N0 , we denote MqP:= Z2q = {0, . . . , 2q − 1}. As every
q−1
n ∈ Mq has a (unique) binary representation n = j=0 bj 2j , bj ∈ {0, 1}, we can define
bit reversal (in q dimensions) as the map
q−1
! q−1
X X
j
ρq : Mq −→ Mq , ρq bj 2 := bq−1−j 2j . (E.34)
j=0 j=0

Lemma E.20. Let q ∈ N0 .

(a) The bit reversal map ρq : Mq −→ Mq is bijective with ρ−1


q = ρq .

(b) One has


∀ ρq (n) = ρq+1 (2n).
n∈Mq

(c) One has


∀ 2q + ρq (n) = ρq+1 (2n + 1).
n∈Mq

Proof. (a): Since, for each j ∈ {0, . . . , q − 1}, one has q − 1 − (q − 1 − j) = j, this is
immediate from (E.34).
E DFT AND TRIGONOMETRIC INTERPOLATION 217

(b): If n ∈ Mq , then 0 ≤ n ≤ 2q − 1, i.e. 0 ≤ 2n ≤ 2q+1 − 2 < 2q+1 − 1 and 2n ∈ Mq+1 .


Pq−1
Moreover, for n = j=0 bj 2j ∈ Mq , bj ∈ {0, 1}, we compute

q−1
! q−1
!
X X
ρq+1 (2n) = ρq+1 2 bj 2j = ρq+1 0 · 20 + bj−1 2j + bq−1 2q
j=0 j=1
q−1
X
= bq−1−j 2j = ρq (n),
j=0

proving (b).
(c): If n ∈ Mq , then 0 ≤ n ≤ 2q − 1, i.e. 0 ≤ 2n + 1 ≤ 2q+1 − 1 and 2n + 1 ∈ Mq+1 .
Pq−1
Moreover, for n = j=0 bj 2j ∈ Mq , bj ∈ {0, 1}, we compute

q−1
! q−1
!
X X
ρq+1 (2n + 1) = ρq+1 1 + 2 bj 2j = ρq+1 1 · 20 + bj−1 2j + bq−1 2q
j=0 j=1
q−1
X
= bq−1−j 2j + 1 · 2q = 2q + ρq (n),
j=0

proving (c). 

Definition E.21. We define the fast Fourier transform (FFT) algorithm for the compu-
tation of the n-dimensional discrete Fourier transform F (n) in the case n = 2q , q ∈ N0 ,
by the following recursion: We start with the n = 2q complex numbers z0 , . . . , zn−1 ∈ C
(we will see below (cf. Th. E.22) that, to compute d := F (n) (f ), f ∈ Cn , one has to
choose z0 := fρq (0) , z1 := fρq (1) , . . . , zn−1 := fρq (n−1) ):
(0,0) (0,2q −1)
d0 := z0 , ..., d0 := z2q −1 = zn−1 . (E.35a)

Then, for each 1 ≤ r ≤ q, one defines recursively the following 2q−r vectors d(r,0) , . . . ,
q−r r
d(r,2 −1) ∈ C2 by
(r−1,2j) ikπ (r−1,2j+1)
(r,j) dk + e− 2r−1 dk
dk := ,
∀ ∀ 2 (E.35b)
(r−1,2j) ikπ (r−1,2j+1)
k∈{0,...,2r−1 −1} j∈{0,...,2q−r −1}
(r,j) dk − e− 2r−1 dk
d2r−1 +k :=
2
(note that the recursion ends after q steps with the single vector d(q,0) ∈ Cn ).
E DFT AND TRIGONOMETRIC INTERPOLATION 218

Theorem E.22. Let n = 2q , q ∈ N0 , and z = (z0 , . . . , zn−1 ) ∈ Cn . If, for r ∈ {0, . . . , q},
j ∈ {0, . . . , 2q−r − 1}, the d(r,j) are defined by the FFT algorithm of Def. E.21, then
r
∀ ∀q−r d(r,j) = F (2 ) (zj2r +ρr (0) , zj2r +ρr (1) , . . . , zj2r +ρr (2r −1) ). (E.36)
r∈{0,...,q} j∈{0,...,2 −1}

In particular,
d(q,0) = F (n) (zρq (0) , zρq (1) , . . . , zρq (2q −1) ), (E.37)
i.e., to compute F (n) (f0 , . . . , fn−1 ) for a given f ∈ Cn , one has to set

z := (fρq (0) , . . . , fρq (n−1) ) (E.38)

to obtain
d(q,0) = F (n) (f0 , . . . , fn−1 ). (E.39)

Proof. The proof of (E.36) is conducted via induction with respect to r ∈ {0, . . . , q}.
For r = 0, (E.36) reads
∀q d(0,j) = F (1) (zj ), (E.40)
j∈{0,...,2 −1}

which is correct due to (E.35a) and Example E.18(a). If r ∈ {1, . . . , q} and j ∈


{0, . . . , 2q−r − 1}, then, for each k ∈ {0, . . . , 2r−1 − 1}, we calculate

(r,j) (E.35b) 1  (r−1,2j) ikπ (r−1,2j+1)



dk = dk + e− 2r−1 dk
2
ind.hyp. 1 (2r−1 )
= F (z2j2r−1 +ρr−1 (0) , z2j2r−1 +ρr−1 (1) , . . . , z2j2r−1 +ρr−1 (2r−1 −1) )
2 k
1 ikπ (2r−1 )
+ e− 2r−1 Fk (z(2j+1)2r−1 +ρr−1 (0) , . . . , z(2j+1)2r−1 +ρr−1 (2r−1 −1) )
2
1 (2r−1 )
= F (zj2r +ρr−1 (0) , . . . , zj2r +ρr−1 (2r−1 −1) )
2 k
1 ikπ (2r−1 )
+ e− 2r−1 Fk (zj2r +2r−1 +ρr−1 (0) , . . . , zj2r +2r−1 +ρr−1 (2r−1 −1) )
2
(∗) (2r )
= Fk (zj2r +ρr (0) , zj2r +ρr (1) , . . . , zj2r +ρr (2r −1) ), (E.41)

where the equality at (∗) holds due to (E.30a) and Lem. E.20(b),(c), since, for each
α ∈ {0, . . . , 2r−1 − 1},

j2r + ρr−1 (α) = j2r + ρr (2α) and j2r + 2r−1 + ρr−1 (α) = j2r + ρr (2α + 1). (E.42)
E DFT AND TRIGONOMETRIC INTERPOLATION 219

Similarly,
(r,j) (E.35b) 1  (r−1,2j) ikπ (r−1,2j+1)

d2r−1 +k = dk − e− 2r−1 dk
2
ind.hyp. 1 (2r−1 )
= F (zj2r +ρr−1 (0) , . . . , zj2r +ρr−1 (2r−1 −1) )
2 k
1 ikπ (2r−1 )
− e− 2r−1 Fk (zj2r +2r−1 +ρr−1 (0) , . . . , zj2r +2r−1 +ρr−1 (2r−1 −1) )
2
(E.30a),(E.42) (2r )
= F2r−1 +k (zj2r +ρr (0) , zj2r +ρr (1) , . . . , zj2r +ρr (2r −1) ), (E.43)
completing the induction proof of (E.36).
Cleary, (E.36) turns into (E.37) for r = q, j = 0; and (E.39) follows from (E.37) and
(E.38) as ρq ◦ ρq = Id. 
Remark E.23. As mentioned at the beginning of the section, FFT reduces the number
of complex multiplications needed to compute an n-dimensional DFT from O(n2 ) (when
done by matrix multiplication) to O(n log2 n) (an actually quite significant improvement
– even for a moderate n like n = 8 = 23 , one has n2 = 64 versus n log2 n = 24, i.e. one
gains almost a factor 3; the larger n gets the more essential it becomes to use FFT).
Let us check the O(n log2 n) claim for FFT as computed by (E.35) for n = 2q , q ∈ N,
(where we need to estimate the number of multiplications involved in the executions of
(E.35b)):
iπ iπ
(i) In a preparatory step, one computes the q − 1 factors e− 21 , . . . , e− 2q−1 by starting

with e− 2q−1 and using

 iπ
2
∀ e− 2r = e− 2r+1 .
r∈{0,...,q−2}

Thus, the number of multiplications used in this step is


N1 := q − 2 < q.

− iπ0
It is used that e = −1 and e− 2q−1 is assumed to be given (it has to be precom-
2

puted once, which adds a constant number of multiplications, depending on the


desired accuracy).

Then (E.35b) has to be executed q times, namely once for each r = 1, . . . , q. We compute
the number of multiplications needed to obtain the d(r,j) :

(ii) Starting from the precomputed θr := e− 2r−1 , one uses successive multiplications
ikπ
with θr to obtain the factors θrk = e− 2r−1 , k = 1, . . . , 2r−1 − 1. Thus, the number
of multiplications for this step is
N2,r := 2r−1 − 2 < 2r−1 .
F THE WEIERSTRASS APPROXIMATION THEOREM 220

(r,j)
(iii) The computation of each dk needs two multiplications (one by the e-factor and
one by 12 ). As the range for k is {0, . . . , 2r − 1} and the range for j is {0, . . . , 2q−r −
1}, the total number of multiplications for this step is

N3,r := 2 · 2r · 2q−r = 2q+1 .

Finally, summing all multiplications from steps (i) – (iii), one obtains
q q
X X
N = N1 + (N2,r + N3,r ) < q + (2r−1 + 2q+1 ) = q + 2q − 1 + q 2q+1
r=1 r=1
< log2 n + n + 2 n log2 n, (E.44)

which is O(n log2 n) as claimed.

F The Weierstrass Approximation Theorem


An important result in the context of the approximation of continuous functions by
polynomials, which we will need for subsequent use, is the following Weierstrass approx-
imation Th. F.1. Even though related, this topic is not strictly part of the theory of
interpolation.

Theorem F.1 (Weierstrass Approximation Theorem). Let a, b ∈ R with a < b. For each
continuous function f ∈ C[a, b] and each ǫ > 0, there exists a polynomial p : R −→ R
such that kf − p↾[a,b] k∞ < ǫ, where p↾[a,b] denotes the restriction of p to [a, b].

Theorem F.1 will be a corollary of the fact that the Bernstein polynomials corresponding
to f ∈ C[0, 1] (see Def. F.2) converge uniformly to f on [0, 1] (see Th. F.3 below).

Definition F.2. Given f : [0, 1] −→ R, define the Bernstein polynomials Bn f corre-


sponding to f by
Xn  ν   n
Bn f : R −→ R, (Bn f )(x) := f xν (1 − x)n−ν for each n ∈ N. (F.1)
ν=0
n ν

Theorem F.3. For each f ∈ C[0, 1], the sequence of Bernstein polynomials (Bn f )n∈N
corresponding to f according to Def. F.2 converges uniformly to f on [0, 1], i.e.

lim kf − (Bn f )↾[0,1] k∞ = 0. (F.2)


n→∞
F THE WEIERSTRASS APPROXIMATION THEOREM 221

Proof. We begin by noting


(Bn f )(0) = f (0) and (Bn f )(1) = f (1) for each n ∈ N. (F.3)
For each n ∈ N and ν ∈ {0, . . . , n}, we introduce the abbreviation
 
n ν
qnν (x) := x (1 − x)n−ν . (F.4)
ν
Then n
n X
1 = x + (1 − x) = qnν (x) for each n ∈ N (F.5)
ν=0
implies
n 
X  ν 
f (x) − (Bn f )(x) = f (x) − f qnν (x) for each x ∈ [0, 1], n ∈ N,
ν=0
n

and
n  ν 
X
f (x) − (Bn f )(x) ≤ f (x) − f qnν (x) for each x ∈ [0, 1], n ∈ N. (F.6)
ν=0
n

As f is continuous on the compact interval [0, 1], it is uniformly continuous, i.e. for each
ǫ > 0, there exists δ > 0 such that, for each x ∈ [0, 1], n ∈ N, ν ∈ {0, . . . , n}:
ν  ν  ǫ

x − < δ ⇒ f (x) − f < . (F.7)
n n 2
For the moment, we fix x ∈ [0, 1] and n ∈ N and define
n ν o

N1 := ν ∈ {0, . . . , n} : x − < δ ,
n n o
ν
N2 := ν ∈ {0, . . . , n} : x − ≥ δ .
n
Then
X  ν 
ǫ X ǫX
n
ǫ
f (x) − f qnν (x) ≤ qnν (x) ≤ qnν (x) = , (F.8)
ν∈N
n 2 ν∈N 2 ν=0 2
1 1

and with M := kf k∞ ,
2
X  ν 
X  ν 
x − nν
f (x) − f qnν (x) ≤ f (x) − f qnν (x)
ν∈N
n ν∈N
n δ2
2 2

2M X
n  ν 2
≤ q nν (x) x − . (F.9)
δ 2 ν=0 n
F THE WEIERSTRASS APPROXIMATION THEOREM 222

To compute the sum on the right-hand side of (F.9), observe


 ν 2 ν  ν 2
x− = x2 − 2x + (F.10)
n n n
and
n  
X Xn  
n ν n−ν ν n − 1 ν−1 (F.5)
x (1 − x) =x x (1 − x)(n−1)−(ν−1) = x (F.11)
ν=0
ν n ν=1
ν−1

as well as
Xn    2 x X n  
n ν n−ν ν n − 1 ν−1 x
x (1 − x) = (ν − 1) x (1 − x)(n−1)−(ν−1) +
ν=0
ν n n ν=1 ν−1 n
Xn  
x2 n − 2 ν−2 x
= (n − 1) x (1 − x)(n−2)−(ν−2) +
n ν=2
ν−2 n
 
1 x x
= x2 1 − + = x2 + (1 − x). (F.12)
n n n
Thus, we obtain
n
X  ν 2 (F.10),(F.5),(F.11),(F.12) x
qnν (x) x − = x2 · 1 − 2x · x + x2 + (1 − x)
ν=0
n n
1
≤ for each x ∈ [0, 1], n ∈ N,
4n
and together with (F.9):
X  ν 
2M 1 ǫ M
f (x) − f qnν (x) ≤ 2 < for each x ∈ [0, 1], n > . (F.13)
ν∈N
n δ 4n 2 δ2ǫ
2

Combining (F.6), (F.8), and (F.13) yields



f (x) − (Bn f )(x) < ǫ + ǫ = ǫ for each x ∈ [0, 1], n > M ,
2 2 δ2ǫ
proving the claimed uniform convergence. 

Proof of Th. F.1. Define


x−a
φ : [a, b] −→ [0, 1], φ(x) := ,
b−a
φ−1 : [0, 1] −→ [a, b], φ−1 (x) := (b − a)x + a.
G BAIRE CATEGORY AND THE UNIFORM BOUNDEDNESS PRINCIPLE 223

Given ǫ > 0, and letting



g : [0, 1] −→ R, g(x) := f φ−1 (x) ,

Th. F.3 provides a polynomial q : R −→ R such that kg − q↾[0,1] k∞ < ǫ. Defining


 
 x−a
p : R −→ R, p(x) := q φ(x) = q
b−a

(having extended φ in the obvious way) yields a polynomial p such that



|f (x) − p(x)| = g(φ(x)) − q(φ(x)) < ǫ for each x ∈ [a, b]

as needed. 

G Baire Category and the Uniform Boundedness


Principle
Theorem G.1 (Baire Category Theorem). Let ∅ 6= X be a nonempty complete metric
space, and, for each k ∈ N, let Ak be a closed subset of X. If

[
X= Ak , (G.1)
k=1

then there exists k0 ∈ N such that Ak0 has nonempty interior: int Ak0 6= ∅.

Proof. Seeking a contradiction, we assume (G.1) holds with closed sets Ak such that
int Ak = ∅ for each k ∈ N. Then, for each nonempty open set O ⊆ X and each k ∈ N,
the set O \Ak = O ∩(X \Ak ) is open and nonempty (as O \Ak = ∅ implied O ⊆ Ak , such
that the points in O were interior points of Ak ). Since O \ Ak is open and nonempty,
we can choose x ∈ O \ Ak and 0 < ǫ < k1 such that

B ǫ (x) ⊆ O \ Ak . (G.2)

As one can choose Bǫ (x) as a new O, starting with arbitrary x0 ∈ X and ǫ0 > 0, one can
inductively construct sequences of points xk ∈ X, numbers ǫk > 0, and corresponding
balls such that
1
B ǫk (xk ) ⊆ Bǫk−1 (xk−1 ) \ Ak , ǫk ≤ for each k ∈ N. (G.3)
k
G BAIRE CATEGORY AND THE UNIFORM BOUNDEDNESS PRINCIPLE 224

Then l ≥ k implies xl ∈ Bǫk (xk ) for each k, l ∈ N, such that (xk )k∈N constitutes a
Cauchy sequence due to limk→∞ ǫk = 0. The assumed completeness of X provides a limit
x = limk→∞ xk ∈ X. The nested form (G.3) of the balls Bǫk (xk ) implies x ∈ B ǫk (xk ) for
each k ∈ N on the one hand and x ∈ / Ak for each k ∈ N on the other hand. However,
this last conclusion is in contradiction to (G.1). 
Remark G.2. The purpose of this remark is to explain the name Baire Category The-
orem in Th. G.1. To that end, we need to introduce some terminology that will not be
used again outside the present remark.
Let X be a metric space.
Recall that a subset D of X is called dense if, and only if, D = X, or, equivalently, if,
and only if, for each x ∈ X and each ǫ > 0, one has D ∩ Bǫ (x) 6= ∅. A subset E of X is
called nowhere dense if, and only if, X \ E is dense.
S
A subset M of X is said to be of first category or meager if, and only if, M = ∞ k=1 Ek
is a countable union of nowhere dense sets Ek , k ∈ N. A subset S of X is said to be of
second category or nonmeager or fat if, and only if, S is not of first category. Caveat:
These notions of category are completely different from the notion of category occurring
in the more algebraic discipline called category theory.
Finally, the reason Th. G.1 carries the name Baire Category Theorem lies in the fact
that it implies that every nonempty complete metric space X is of second S∞ category
(considered as a subset of itself): If X were of first category, then X = k=1 Ek with
nowhere dense sets Ek . As the Ek are nowhereS∞ dense, the complements X \ E k are
dense, i.e. E k has empty interior, and X = k=1 E k were a countable union of closed
sets with empty interior in contradiction to Th. G.1.
Theorem G.3 (Uniform Boundedness Principle). Let X be a nonempty complete metric
space and let Y be a normed vector space. Consider a set of continuous functions
F ⊆ C(X, Y ). If

Mx := sup kf (x)k : f ∈ F < ∞ for each x ∈ X, (G.4)
then there exists x0 ∈ X and ǫ0 > 0 such that

sup Mx : x ∈ B ǫ0 (x0 ) < ∞. (G.5)
In other words, if a collection of continuous functions from X into Y is bounded point-
wise in X, then it is uniformly bounded on an entire ball.

Proof. If F = ∅, then Mx = sup ∅ := inf R+


0 = 0 for each x ∈ X and there is nothing to
prove. Thus, let F 6= ∅. For f ∈ F, k ∈ N, as both f and the norm are continuous, the
set 
Ak,f := x ∈ X : kf (x)k ≤ k (G.6)
G BAIRE CATEGORY AND THE UNIFORM BOUNDEDNESS PRINCIPLE 225

is an continuous inverse image of the closed set [0, k] and, hence, closed. Since arbitrary
intersections of closed sets are closed, so is
\ 
Ak := Ak,f = x ∈ X : kf (x)k ≤ k for each f ∈ F . (G.7)
f ∈F

Now (G.4) implies



[
X= Ak (G.8)
k=1
and the Baire Category Th. G.1 provides k0 ∈ N such that Ak0 has nonempty interior.
Thus, there exist x0 ∈ Ak0 and ǫ0 > 0 such that B ǫ0 (x0 ) ⊆ Ak0 , and, in consequence,

sup Mx : x ∈ B ǫ0 (x0 ) ≤ k0 , (G.9)
proving (G.5). 

We now specialize the uniform boundedness principle to bounded linear maps:


Theorem G.4 (Banach-Steinhaus Theorem). Let X be a Banach space (i.e. a complete
normed vector space) and let Y be a normed vector space. Consider a set of bounded
linear functions T ⊆ L(X, Y ). If

sup kT (x)k : T ∈ T < ∞ for each x ∈ X, (G.10)
then T is a bounded subset of L(X, Y ), i.e.

sup kT k : T ∈ T < ∞. (G.11)

Proof. To apply Th. G.3, define, for each T ∈ T :


fT : X −→ R, fT (x) := kT xk. (G.12)
Then F := {fT : T ∈ T } ⊆ C(X, R) and (G.10) implies that F satisfies (G.4). Thus,
if Mx is as in (G.4), we obtain x0 ∈ X and ǫ0 > 0 such that

sup Mx : x ∈ B ǫ0 (x0 ) < ∞, (G.13)
i.e. there exists C ∈ R+ satisfying
kT xk ≤ C for each T ∈ T and each x ∈ X with kx − x0 k ≤ ǫ0 . (G.14)
In consequence, for each x ∈ X \ {0},
 
kxk x kxk
kT xk = · T x + ǫ − T x ≤ · 2 C, (G.15)
ǫ0
0 0 0
kxk ǫ0
2C
showing kT k ≤ ǫ0
for each T ∈ T , thereby establishing the case. 
H MATRIX DECOMPOSITION 226

H Matrix Decomposition

H.1 Cholesky Decomposition of Positive Semidefinite Matrices


In this section, we show that the results of Sec. 5.3 (except uniqueness of the decompo-
sition) extend to Hermitian positive semidefinite matrices.

Theorem H.1. Let Σ ∈ M(n, K) be Hermitian positive semidefinite, n ∈ N. Then


there exists a lower triangular A ∈ M(n, K) with nonnegative diagonal entries (i.e. with
ajj ∈ R+0 for each j ∈ {1, . . . , n}) and

Σ = AA∗ . (H.1)

In extension of the term for the positive definite case, we still call this decomposition a
Cholesky decomposition of Σ. It is recalled from Th. 5.7 that, if Σ is positive definite,
then the diagonal entries of A can be chosen to be all positive and this determines A
uniquely.

Proof. The positive definite case was proved in Th. 5.7. The general (positive semidefi-
nite) case can be deduced from the positive definite case as follows: If Σ ∈ M(n, K) is
Hermitian positive semidefinite, then each Σk := Σ + k1 Id ∈ M(n, K), k ∈ N, is positive
definite:
x∗ x
x∗ Σk x = x∗ Σx + > 0 for each 0 6= x ∈ Kn . (H.2)
k
Thus, for each k ∈ N, there exists a lower triangular matrix Ak M(n, K) with positive
diagonal entries such that Σk = Ak A∗k .
2
On the other hand, Σk converges to Σ with respect to each norm on Kn (since all norms
2
on Kn are equivalent). In particular, limk→∞ kΣk − Σk2 = 0. Thus,

kAk k22 = r(Σk ) = kΣk k2 , (H.3)

such that limk→∞ kΣk − Σk2 = 0 implies that the set K := {Ak : k ∈ N} is bounded
2
with respect to k · k2 . Thus, the closure of K in Kn is compact, which implies (Ak )k∈N
2
has a convergent subsequence (Akl )l∈N , converging to some matrix A ∈ Kn . As this
2
convergence is with respect to the norm topology on Kn , each entry of the Akl must
converge (in K) to the respective entry of A. In particular, A is lower triangular with
all nonnegative diagonal entries. It only remains to show AA∗ = Σ. However,

Σ = lim Σkl = lim Akl A∗kl = AA∗ , (H.4)


l→∞ l→∞

which establishes the case. 


H MATRIX DECOMPOSITION 227

Example H.2. The decomposition


    
0 0 0 sin x 0 0
= for each x ∈ R (H.5)
sin x cos x 0 cos x 0 1

shows a Hermitian positive semidefinite matrix (which is not positive definite) can have
uncountably many different Cholesky decompositions.

Theorem H.3. Let Σ ∈ M(n, K) be Hermitian positive semidefinite, n ∈ N. Define


the index set 
I := (i, j) ∈ {1, . . . , n}2 : j ≤ i . (H.6)
Then a matrix A = (Aij ) providing a Cholesky decomposition Σ = AA∗ of Σ = (σij ) is
obtained via the following algorithm, defined recursively over I, using the order (1, 1) <
(2, 1) < · · · < (n, 1) < (2, 2) < · · · < (n, 2) < · · · < (n, n) (which corresponds to
traversing the lower half of Σ by columns from left to right):

A11 := σ11 . (H.7a)

For (i, j) ∈ I \ {(1, 1)}:


 Pj−1 

 σ ij − A A
k=1 ik jk /Ajj for i > j and Ajj 6= 0,


Aij := 0 for i > j and Ajj = 0, (H.7b)

 q

 σ − Pi−1 |A |2
ii k=1 ik for i = j.

Proof. A lower triangular n × n matrix A provides a Cholesky decomposition of Σ if,


and only if,
  
A11 A11 A21 . . . An1
 A21 A22  A22 . . . An2 
∗   
AA =  .. .. . .  . . ..  = Σ, (H.8)
 . . .  . . 
An1 An2 . . . Ann Ann

i.e. if, and only if, the n(n + 1)/2 lower half entries of A constitute a solution to the
following (nonlinear) system of n(n + 1)/2 equations:
j
X
Aik Ajk = σij , (i, j) ∈ I. (H.9a)
k=1
H MATRIX DECOMPOSITION 228

Using the order on I introduced in the statement of the theorem, (H.9a) takes the form

|A11 |2 = σ11 ,
A21 A11 = σ21 ,
..
.
An1 A11 = σn1 ,
(H.9b)
|A21 |2 + |A22 |2 = σ22 ,
A31 A21 + A32 A22 = σ32 ,
..
.
|An1 |2 + · · · + |Ann |2 = σnn .

From Th. H.1, we know (H.9) must have at least one solution with Ajj ∈ R+ 0 for each
j ∈ {1, . . . , n}. In particular, σjj ∈ R+
0 for each j ∈ {1, . . . , n} (this is also immediate
from Σ being positive semidefinite, since σjj = e∗j Σej , where ej denotes the jth standard
unit vector of Kn ). We need to show that (H.7) yields a solution to (H.9). The proof is

carried out by induction on n. For n = 1, we have σ11 ∈ R+ 0 and A11 = σ11 ∈ R+
0 , i.e.
there is nothing to prove. Now let n > 1. If σ11 > 0, then

A11 = σ11 , A21 = σ21 /A11 , ..., An1 = σn1 /A11 (H.10a)

is the unique solution to the first n equations of (H.9b) satisfying A11 ∈ R+ , and this
solution is provided by (H.7). If σ11 = 0, then σ21 = · · · = σn1 = 0: Otherwise, let
s := (σ21 . . . σn1 )∗ ∈ Kn−1 \ {0}, α ∈ R, and note
    
∗ 0 s∗ α ∗ s∗ s
(α, s ) = (α, s ) = 2αksk2 + s∗ Σn−1 s < 0
s Σn−1 s αs + Σn−1 s

for α < −s∗ Σn−1 s/(2ksk22 ), in contradiction to Σ being positive semidefinite. Thus,

A11 = A21 = · · · = An1 = 0 (H.10b)

is a particular solution to the first n equations of (H.9b), and this solution is provided
by (H.7). We will now denote the solution to (H.9) given by Th. H.1 by B11 , . . . , Bnn
to distinguish it from the Aij constructed via (H.7).
In each case, A11 , . . . , An1 are given by (H.10), and, for (i, j) ∈ I with i, j ≥ 2, we define

τij := σij − Ai1 Aj1 ∈ K for each (i, j) ∈ J := (i, j) ∈ I : i, j ≥ 2 . (H.11)
H MATRIX DECOMPOSITION 229

To be able to proceed by induction, we show that the Hermitian (n − 1) × (n − 1) matrix


 
τ22 . . . τn2
 
T :=  ... . . . ...  (H.12)
τn2 . . . τnn

is positive semidefinite. If σ11 = 0, then (H.10b) implies τij = σij for each (i, j) ∈ J and
T is positive semidefinite, as Σ being positive semidefinite implies
 
  0
x2
     x2 


x2 . . . xn T  ...  = 0 x2 . . . xn Σ  ..  ≥ 0 (H.13)
.
xn
xn

for each (x2 , . . . , xn ) ∈ Kn−1 . If σ11 > 0, then (H.10a) holds as well as B11 = A11 , . . . ,
Bn1 = An1 . Thus, (H.11) implies
τij := σij − Ai1 Aj1 = σij − Bi1 B j1 for each (i, j) ∈ J. (H.14)
Then (H.9) with A replaced by B together with (H.14) implies
j
X
Bik B jk = τij for each (i, j) ∈ J (H.15)
k=2

or, written in matrix form,


    
B22 B 22 . . . B n2 τ22 . . . τn2
 .. ...  ... ..  =  .. . . . 
 .  .   . . ..  = T, (H.16)
Bn2 . . . Bnn B nn τn2 . . . τnn

which, once again, establishes T to be positive semidefinite (since, for each x ∈ Kn−1 ,
x∗ T x = x∗ BB ∗ x = (B ∗ x)∗ (B ∗ x) ≥ 0).
By induction, we now know the algorithm of (H.7) yields a (possibly different from
(H.16) for σ11 > 0) decomposition of T :
    
C22 C 22 . . . C n2 τ22 . . . τn2
  ..  =  .. . . . 
CC ∗ =  ... ...  ...
.   . . ..  = T (H.17)
Cn2 . . . Cnn C nn τn2 . . . τnn
or
j
X
Cik C jk = τij = σij − Ai1 Aj1 for each (i, j) ∈ J, (H.18)
k=2
H MATRIX DECOMPOSITION 230

where (√
√ σ22 for σ11 = 0,
C22 := τ22 = (H.19a)
B22 for σ11 > 0,
and, for (i, j) ∈ J \ {(2, 2)},
 Pj−1 

 τij − k=2 Cik C jk /Cjj for i > j and Cjj 6= 0,


Cij := 0 for i > j and Cjj = 0, (H.19b)

 q

 τ − Pi−1 |C |2
ii k=2 ik for i = j.

Substituting τij = σij − Ai1 Aj1 from (H.11) into (H.19) and comparing with (H.7), an
induction over J with respect to the order introduced in the statement of the theorem
shows Aij = Cij for each (i, j) ∈ J. In particular, since all Cij are well-defined by
induction, all Aij are well-defined by (H.7) (i.e. all occurring square roots exist as real
numbers). It also follows that {Aij : (i, j) ∈ I} is a solution to (H.9): The first n
equations are satisfied according to (H.10); the remaining (n − 1) n/2 equations are
satisfied according to (H.18) combined with Cij = Aij . This concludes the proof that
(H.7) furnishes a solution to (H.9). 

H.2 Multiplication with Householder Matrix over R


If K = R, then one can use the following, simpler, version of Lem. 5.22:

Lemma H.4. Let k ∈ N. We consider the elements of Rk as column vectors. Let


e1 ∈ Rk be the first standard unit vector of Rk . If x ∈ Rk and x ∈
/ span{e1 }, then
x + σ e1
u := for σ = ±kxk2 , (H.20)
kx + σ e1 k2

satisfies

kuk2 = 1 and (H.21)


(Id −2uut )x = −σ e1 . (H.22)

Proof. First, note that x ∈


/ span{e1 } implies kx + σ e1 k2 =
6 0 such that u is well-defined.
Moreover, (H.21) is immediate from the definition of u. To verify (H.22), note

kx + σ e1 k22 = kxk22 + 2σ et1 x + σ 2 = 2(x + σ e1 )t x (H.23a)


I NEWTON’S METHOD 231

due to the definition of σ. Together with the definition of u, this yields


2(x + σ e1 )t x
2ut x = = kx + σ e1 k2 (H.23b)
kx + σ e1 k2
and
2uut x = x + σ e1 , (H.23c)
proving (H.22). 
Remark H.5. In (H.20), one has the freedom to choose the sign of σ. It is numerically
advisable to choose (
kxk2 for x1 ≥ 0,
σ = σ(x) := (H.24)
−kxk2 for x1 < 0,
to avoid subtractive cancellation of digits.

I Newton’s Method
A drawback of Theorems 6.3 and 6.5 is the assumption of the existence and (for Th. 6.5)
location of the zero x∗ , which (especially for Th. 6.5) can be very difficult to verify in
cases where one would like to apply Newton’s method. In the following variant Th. I.1,
the hypotheses are such the existence of a zero can be proved as well as the convergence
of Newton’s method to that zero.
Theorem I.1. Let m ∈ N and fix a norm k · k on Rm . Let A ⊆ Rm be open and convex.
Moreover let f : A −→ A be differentiable, and assume Df to satisfy the Lipschitz
condition
∀ (Df )(x) − (Df )(y) ≤ L kx − yk, L ∈ R+ 0. (I.1)
x,y∈A

Assume Df (x) to be invertible for each x ∈ A, satisfying


−1

∀ Df (x) ≤ β ∈ R+ 0, (I.2)
x∈A

where, once again, we consider the induced operator norm on the real n × n matrices.
If x0 ∈ A and r ∈ R+ are such that x1 ∈ B r (x0 ) ⊆ A (x1 according to (6.5)) and
−1

α := Df (x0 ) (f (x0 )) (I.3)

satisfies
αβL
h := <1 (I.4a)
2
I NEWTON’S METHOD 232

and
α
r≥ , (I.4b)
1−h
then Newton’s method (6.5) is well-defined for x0 (i.e., for each n ∈ N0 , Df (xn ) is
invertible and xn+1 ∈ A) and converges to a zero of f : More precisely, one has
∀ xn ∈ B r (x0 ), (I.5)
n∈N0

lim xn = x∗ ∈ A, (I.6)
n→∞
f (x∗ ) = 0. (I.7)
One also has the error estimate
n
h2 −1
∀ kxn − x∗ k ≤ α . (I.8)
n∈N 1 − h2n

Proof. Analogous to (6.18) in the proof of Th. 6.5, we obtain



∀ f (x) − f (y) − Df (y)(x − y) ≤ L kx − yk2 (I.9)
x,y∈A 2
from the Lipschitz condition (I.1) and the assumed convexity of A. We now show via
induction on n ∈ N0 that
 n

∀ xn+1 ∈ B r (x0 ) ∧ kxn+1 − xn k ≤ α h2 −1 . (I.10)
n∈N0

For n = 0, x1 ∈ B r (x0 ) by hypothesis and kx1 − x0 k ≤ α by (6.5) and (I.3). Now let
n ≥ 1. From (6.5), we obtain
∀ Df (xn−1 )(xn ) = Df (xn−1 )(xn−1 ) − f (xn−1 )
n∈N

and, as we may apply f to xn by the induction hypothesis,


f (xn ) = f (xn ) − f (xn−1 ) − Df (xn−1 )(xn − xn−1 ).
Together with (I.9), this yields
L
kf (xn )k ≤ kxn − xn−1 k2 . (I.11)
2
Thus,
(6.5) −1 (I.2)

kxn+1 − xn k ≤ Df (x n ) kf (xn )k ≤ β kf (xn )k
(I.11) βL ind.hyp. βL 2 2n −2 (I.4a) n
≤ kxn − xn−1 k2 ≤ α h = α h2 −1 , (I.12)
2 2
J EIGENVALUES 233

completing the induction step for the second part of (I.10). Next, we use the second
part of (I.10) to estimate
n
X n
X ∞
X
2k −1
(I.4a) α (I.4b)
kxn+1 − x0 k ≤ kxk+1 − xk k ≤ α h ≤ α hk = ≤ r, (I.13)
k=0 k=0 k=0
1−h

showing xn+1 ∈ B r (x0 ) and completing the induction.


We use
n+k −1 n −1 n+k −2n n −1 n k −1 n −1 n
∀ h2 = h2 h2 = h2 (h2 )2 ≤ h2 (h2 )k (I.14)
n,k∈N

to estimate, for each m, n ∈ N,


m−1
X m−1
X
(I.10) n+k −1
kxn+m − xn k ≤ kxn+k+1 − xn+k k ≤ α h2
k=0 k=0
X∞ n
(I.14)
2n −1 2n k α h2 −1
≤ αh (h ) = 2n
→ 0 for n → ∞, (I.15)
k=0
1 − h

showing (xn )n∈N to be a Cauchy sequence and implying the existence of


x∗ = lim xn ∈ B r (x0 ).
n→∞

For fixed n ∈ N and m → ∞, (I.15) proves (I.8). Finally,


L(I.11) (I.10) Lα2 n−1
2
kf (x∗ )k = lim kf (xn )k ≤ lim kxn − xn−1 k ≤ lim (h2 −1 )2 = 0
n→∞ 2 n→∞ 2 n→∞

shows f (x∗ ) = 0, completing the proof. 

J Eigenvalues

J.1 Continuous Dependence on Matrix Coefficients


The eigenvalues of a complex matrix depend, in a certain sense, continuously on the
coefficients of the matrix, which can be seen as a consequence of the fact that the zeros
of a complex polynomial depend, in the same sense, continuously on its coefficients.
The precise results will be given in Th. J.2 and Cor. J.3 below. The proof is based on
Rouché’s theorem of Complex Analysis, which we now state (not in its most general
form) for the convenience of the reader (Rouché’s theorem is usually proved as a simple
consequence of the residue theorem).
J EIGENVALUES 234

Theorem J.1 (Rouché). Let D := Br (z0 ) = {z ∈ C : |z − z0 | < r} be a disk in C


(z0 ∈ C, r ∈ R+ ) with boundary ∂D. If G ⊆ C is an open set containing the closed disk
D = D ∪ ∂D and if f, g : G −→ C are holomorphic functions satisfying

∀ |f (z) − g(z)| < |f (z)|, (J.1)


z∈∂D

then Nf = Ng , where Nf , Ng denote the number of zeros that f and g have in D,


respectively, and where each zero is counted with its multiplicity.

Proof. See, e.g., [Rud87, Th. 10.43(b)]. 


Theorem J.2. Let p : C −→ C be a monic complex polynomial function of degree n,
n ∈ N, i.e.
p(z) = z n + an−1 z n−1 + · · · + a1 z + a0 , (J.2a)
a0 , . . . , an−1 ∈ C. Then, for each ǫ > 0, there exists δ > 0 such that, for each monic
complex polynomial q : C −→ C of degree n, satisfying

q(z) = z n + bn−1 z n−1 + · · · + b1 z + b0 , (J.2b)

b0 , . . . , bn−1 ∈ C, and
∀ |ak − bk | < δ, (J.2c)
k∈{0,...,n−1}

there exist enumerations λ1 , . . . , λn ∈ C and µ1 , . . . , µn ∈ C of the zeros of p and q,


respectively, listing each zero according to its multiplicity, where

∀ |λk − µk | < ǫ. (J.3)


k∈{1,...,n}

Proof. Let α ∈ R+ ∪ {∞} denote the minimal distance between distinct zeros of p, i.e.
(
∞ if p has only one distinct zero,
α :=  (J.4)
min |λ − λ̃| : p(λ) = p(λ̃) = 0, λ 6= λ̃ otherwise.

Without loss of generality, we may assume 0 < ǫ < α/2. Let λ̃1 , . . . , λ̃l , where 1 ≤ l ≤ n
be the distinct zeros of p. For each k ∈ {1, . . . , l}, define the open disk Dk := Bǫ (λ̃k ).
Since ǫ < α, p does not have any zeros on ∂Dk . Thus,

mk := min |p(z)| : z ∈ ∂Dk > 0. (J.5a)

Also define
( n−1 )
X
Mk := max |z|j : z ∈ ∂Dk and note 1 ≤ Mk < ∞. (J.5b)
j=0
J EIGENVALUES 235

Then, for each δ > 0 satisfying

∀ Mk δ < m k
k∈{1,...,l}

and each q satisfying (J.2b) and (J.2c), the following holds true:
n−1
X n−1
X
(J.2c)
∀ ∀ |q(z) − p(z)| ≤ |aj − bj ||z|j ≤ δ |z|j ≤ δMk < mk ≤ |p(z)|,
k∈{1,...,l} z∈∂Dk
j=0 j=0
(J.6)
showing p and q satisfy hypothesis (J.1) of Rouché’s theorem. Thus, in each Dk , the
polynomials p and q must have precisely the same number of zeros (counted according to
their respective multiplicities). Since p and q both have precisely n zeros (not necessarily
distinct), the only (distinct) zero of p in Dk is λ̃k , and the radius of Dk is ǫ, the proof
of Th. J.2 is complete. 
Corollary J.3. Let A ∈ M(n, C), n ∈ N, and let k · k denote some arbitrary norm
on M(n, C). Then, for each ǫ > 0, there exists δ > 0 such that, for each matrix
B ∈ M(n, C), satisfying
kA − Bk < δ, (J.7)
there exist enumerations λ1 , . . . , λn ∈ C and µ1 , . . . , µn ∈ C of the eigenvalues of A and
B, respectively, listing each eigenvalue according to its (algebraic) multiplicity, where

∀ |λk − µk | < ǫ. (J.8)


k∈{1,...,n}

Proof. Recall that each coefficient of the characteristic polynomial χA is a polynomial


in the coefficients of A (due to χA = det(X Id −A), cf. [Phi19b, Rem. 4.33]), and thus,
the function that maps the coefficients of A to the coefficients of χA is continuous from
2 2
Cn to Cn . Thus (and since all norms on Cn are equivalent as well as all norms on Cn ),
given γ > 0, there exists δ = δ(γ) > 0, such that (J.7) implies

∀ |ak − bk | < γ,
k∈{0,...,n−1}

if a0 , . . . , an−1 and b0 , . . . , bn−1 denote the coefficients of χA and χB , respectively. Since


the eigenvalues are precisely the zeros of the respective characteristic polynomial, the
statement of the corollary is now immediate from the statement of Th. J.2. 

J.2 Transformation to Hessenberg Form


In preparation for the application of a numerical eigenvalue approximation algorithm
(e.g. the QR algorithm of Sec. 7.4), it is often advisable to transform the matrix under
J EIGENVALUES 236

consideration to Hessenberg form (see Def. J.4). The complexity of the transformation
to Hessenberg form is O(n3 ) (see Th. J.6(c) below). However, each step of the QR
algorithm applied to a matrix in Hessenberg form is O(n2 ) (see Rem. J.12 below) as
compared to O(n3 ) for a fully populated matrix (obtaining a QR decomposition via
the Householder method of Def. 5.23 requires O(n) matrix multiplications, each matrix
multiplication needing O(n2 ) multiplications).
Definition J.4. An n × n matrix A = (akl ), n ∈ N, is said to be in upper (resp. lower)
Hessenberg form if, and only if, it is almost in upper (resp. lower) triangular form,
except that there are allowed to be nonzero elements in the first subdiagonal (resp.
superdiagonal), i.e., if, and only if,

akl = 0 for k ≥ l + 2 (resp. akl = 0 for k ≤ l − 2). (J.9)

Thus, according to Def. J.4, the n × n matrix A = (akl ) is in upper Hessenberg form if,
and only if, it can be written as
 
∗ ... ... ... ∗
∗ ∗ ∗
 .. 
 ... 
A = 0 ∗ . .
. ... ... ... .. 
 .. .
0 ... 0 ∗ ∗
How can one transform an arbitrary matrix A ∈ M(n, K) into Hessenberg form without
changing its eigenvalues? The idea is to use several similarity transformations of the
form A 7→ HAH, using Householder matrices H = H −1 (cf. Def. 5.19). Householder
transformations are also used in Sec. 5.4.3 to obtain QR decompositions.
Algorithm J.5 (Transformation to Hessenberg Form). Given A ∈ M(n, K), n ∈ N,
n > 2 (for n = 1 and n = 2, A is always in Hessenberg form), we define the following
algorithm: Let A(1) := A. For k = 1, . . . , n − 2, A(k) is transformed into A(k+1) by
performing precisely one of the following actions:
(k)
(a) If aik = 0 for each i ∈ {k + 2, . . . , n}, then

A(k+1) := A(k) , H (k) := Id ∈ M(n, K).

(b) Otherwise, it is x(k) 6= 0, where


 t
(k) (k) (k)
x := ak+1,k , . . . , an,k ∈ Kn−k , (J.10)
J EIGENVALUES 237

and we set
A(k+1) := H (k) A(k) H (k) ,
where
 
 Idk 0 
 
 
 
 
H (k) :=  , u(k) := u(x(k) ), (J.11)
 0 (k) (k) ∗ 
Idn−k −2u (u ) 
 | {z } 

 =:Hl
(k) 

with u(x(k) ) according to (5.34a).


Theorem J.6. Let A ∈ M(n, K), n ∈ N, n > 2, and let B := A(n−1) be the matrix
obtained as a result after having applied Alg. J.5 to A.

(a) Then B is in upper Hessenberg form and σ(B) = σ(A). Moreover, ma (λ, B) =
ma (λ, A) for each λ ∈ σ(A).

(b) If A is Hermitian, then B must be Hermitian as well. Thus, B is a Hermitian


matrix in Hessenberg form, i.e., it is tridiagonal (of course, for K = R, Hermitian
is the same as symmetric).

(c) The transformation of A into B requires O(n3 ) arithmetic operations.

Proof. (a): Using (5.34c) and blockwise matrix multiplication, a straightforward induc-
tion shows that each A(k+1) has the form
 
 
 (k)
A1
(k)
A2 Hl
(k) 
 
 
 
A(k+1) = H (k) A(k) H (k) =

,
 (J.12)
 
 
 0 ξk e 1
(k)
Hl A3 Hl
(k) (k) 
 

(k)
where A1 is in upper Hessenberg form. In particular, for k + 1 = n − 1, we obtain B
to be in upper Hessenberg form. Moreover, since each A(k+1) is obtained from A(k) via
a similarity transformation, both matrices always have the same eigenvalues with the
same multiplicities. In particular, by another induction, the same holds for A and B.
J EIGENVALUES 238

(b): Since each H (k) is Hermitian according to Lem. 5.20(a), if A(k) is Hermitian, then
(A(k+1) )∗ = (H (k) A(k) H (k) )∗ = A(k+1) , i.e. A(k+1) is Hermitian as well. Thus, another
straightforward induction shows all A(k) to be Hermitian.
(c): The claim is that the number N (n) of arithmetic operations needed to compute
B from A is O(n3 ) for n → ∞. For simplicity, we do not distinguish between different
arithmetic operations, considering addition, multiplication, division, complex conjuga-
tion, and even the taking of a square root as one arithmetic operation each, even though
the taking of a square root will tend to be orders of magnitude slower than an addi-
tion. Still, none of the operations depends on the size n of the matrix, which justifies
treating them as the same, if one is just interested in the asymptotics n → ∞. First,
if u, x ∈ Cm , H = Id −2uu∗ , then computing Hx = x − 2u(u∗ x) needs m conjugations,
3m multiplications, m additions, and m subtractions, yielding

N1 (m) = 6m.

Thus, the number of operations to obtain x∗ H = (Hx)∗ is

N2 (m) = 7m

(the same as N1 (m) plus the additional m conjugations). The computation of the
norm kxk2 needs m conjugations, m multiplications, m additions, and 1 square root, a
scalar multiple ξx needs m multiplications. Thus, the computation of u(x) according to
(5.34a) needs 8m operations (2 norms, 2 scalar multiples) plus a constant number C of
operations (conjugations, multiplications, divisions, square roots) that does not depend
on m, the total being
N3 (m) = 8m + C.
Now we are in a position to compute the number N k (n) of operations needed to obtain
A(k+1) from A(k) : N3 (n − k) operations to obtain u(k) according to (J.11), N1 (n − k)
(k) (k)
operations to obtain ξk e1 according to (J.12), k ·N2 (n−k) operations to obtain A2 Hl
according to (J.12), and (n − k) · N1 (n − k) + (n − k) · N2 (n − k) operations to obtain
(k) (k) (k)
Hl A3 Hl according to (J.12), the total being

N k (n) = N3 (n − k) + N1 (n − k) + k · N2 (n − k)
+ (n − k) · N1 (n − k) + (n − k) · N2 (n − k)
= C + 8(n − k) + 6(n − k) + 7k(n − k) + 6(n − k)(n − k) + 7(n − k)(n − k)
= C + 14(n − k) + 7n(n − k) + 6(n − k)2 .
J EIGENVALUES 239

Finally, we obtain
n−2
X n−1
X n−1
X
k
N (n) = N (n) = C(n − 2) + (14 + 7n) k+6 k2
k=1 k=2 k=2
   
(n − 1)n (n − 1)n(2n − 1)
= C(n − 2) + (14 + 7n) −1 +6 −1 ,
2 6

which is O(n3 ) for n → ∞, completing the proof. 

J.3 QR Method for Matrices in Hessenberg Form


As mentioned before, it is not advisable to apply the QR method of the previous section
directly to a fully populated matrix, since, in this case, the complexity of obtaining
the QR decomposition in each step is O(n3 ). In the present section, we describe how
each step of the QR method can be performed more efficiently for Hessenberg matrices,
reducing the complexity of each step to O(n2 ). The key idea is to replace the use of
Householder matrices by so-called Givens rotations.

Notation J.7. Let n ∈ N, n > 1, m ∈ {1, . . . , n − 1}, and c, s ∈ K. By G(m, c, s) =


(gjl ) ∈ M(n, K) we denote the matrix, where


 c for j = l = m,



 c for j = l = m + 1,


s for j = m + 1, l = m,
gjl = (J.13)

 −s for j = m, l = m + 1,



 1 for j = l ∈
/ {m, m + 1},



0 otherwise,

that means
 
1
 ... 
 
 
 1 
 
 c −s  ← row m
G(m, c, s) =   .
 s c  ← row m + 1
 
 1 
 ... 
 
1
J EIGENVALUES 240

Lemma J.8. Let n ∈ N, n > 1, m ∈ {1, . . . , n − 1}, and c, s ∈ K. Let G(m, c, s) =


(gjl ) ∈ M(n, K) be as in Not. J.7.

(a) If |c|2 + |s|2 = 1, then G(m, c, s) is unitary (in this case, G(m, c, s) is a special case
of what is known as a Givens rotation).

(b) If A = (ajl ) ∈ M(n, K) is in Hessenberg form with a21 = · · · = am,m−1 = 0,


am+1,m 6= 0,

a b
c := p , s := p , a := amm , b := am+1,m , (J.14)
|a| + |b|2
2 |a| + |b|2
2

and B = (bjl ) := G(m, c, s)∗ A, then B is still in Hessenberg form, but with b21 =
· · · = bm,m−1 = bm+1,m = 0. Moreover, |c|2 + |s|2 = 1. Rows m + 2 through n are
identical in A and B.

Proof. (a): G(m, c, s) is unitary, since


    2 
c −s c s |c| + |s|2 cs − sc
= = Id,
s c −s c sc − cs |c|2 + |s|2

implying G(m, c, s)G(m, c, s)∗ = Id.


(b): Clearly, |c|2 + |s|2 = 1. Due to the form of G(m, c, s)∗ , G(m, c, s)∗ A can only
affect rows m and m + 1 of A (all other rows are the same in A and B). As A is in
Hessenberg form with a21 = · · · = am,m−1 = 0, this further implies that columns 1
through m − 1 are also the same in A and B, implying B to be in Hessenberg form with
b21 = · · · = bm,m−1 = 0 as well. Finally,

−ba + ab
bm+1,m = −s amm + c am+1,m = p = 0,
|a|2 + |b|2

thereby establishing the case. 

The idea of each step of the QR method for matrices in Hessenberg form is to transform
A(k) into A(k+1) by a sequence of n − 1 unitary similarity transformations via Givens
rotations Gk1 , . . . , Gk,n−1 . The Gkm are chosen such that the sequence

A(k) =: A(k,1) 7→ A(k,2) := G∗k1 A(k,1)


7→ · · · 7→ Rk := A(k,n) := G∗k,n−1 A(k,n−1)

transforms A(k) into an upper triangular matrix Rk :


J EIGENVALUES 241

Lemma J.9. If A(k) ∈ M(n, K) is a matrix in Hessenberg form, n ∈ N, n ≥ 2,


A(k,1) =: A(k) , and, for each m ∈ {1, . . . , n − 1},

A(k,m+1) := G∗km A(k,m) , (J.15)


(k)
Gkm = Id for am+1,m = 0, otherwise Gkm = G(m, c, s) with c, s defined as in (J.14),
(k,m) (k) (k,m)
except with a := amm and b := am+1,m , then each A(k,m) is in Hessenberg form, a21 =
(k,m)
· · · = am,m−1 = 0, and rows m + 1 through n are identical in A(k,m) and in A(k) . In
particular, Rk := A(k,n) is an upper triangular matrix.

Proof. This is just Lem. J.8(b) combined with a straightforward induction on m. 

Algorithm J.10 (QR Method for Matrices in Hessenberg Form). Given A ∈ M(n, K)
in Hessenberg form, n ∈ N, n ≥ 2, we define the following algorithm for the computation
of a sequence (A(k) )k∈N of matrices in M(n, K): Let A(1) := A. For each k ∈ N, A(k+1)
is obtained from A(k) by a sequence of n − 1 unitary similarity transformations:

A(k) =: B (k,1) 7→ B (k,2) := G∗k1 B (k,1) Gk1


7→ · · · 7→ A(k+1) := B (k,n) := G∗k,n−1 B (k,n−1) Gk,n−1 , (J.16a)

where the Gkm , m ∈ {1, . . . , n − 1}, are defined as in Lem. J.9. To obtain Gkm , one
(k,m) (k)
needs to know amm and am+1,m . However, as it turns out, one does not actually have
to compute the matrices A(k,m) explicitly, since, letting

∀ C (k,m) := G∗km B (k,m) , (J.16b)


m∈{1,...,n−1}

one has (see Th. J.11(c) below)


(k,m) (k,m+1) (k,m) (k)
∀ cm+1,m+1 = am+1,m+1 , cm+2,m+1 = am+2,m+1 . (J.16c)
m∈{1,...,n−2}

Thus, when computing B (k,m+1) for (J.16a), one computes C (k,m) first and stores the
elements in (J.16c) to be available to obtain Gk,m+1 .

Theorem J.11. Let A ∈ M(n, K) be in Hessenberg form, n ∈ N, n ≥ 2. Let the


matrices A(k) , A(k,m) , B (k,m) , and C (k,m) be defined as in Alg. J.10.

(a) The QR method for matrices in Hessenberg form as defined in Alg. J.10 is, indeed,
a QR method as defined in Alg. 7.18, i.e. (7.37) holds for the A(k) . In particular,
Alg. J.10 preserves the eigenvalues (with multiplicities) of A and Th. 7.22 applies
(also cf. Rem. 7.23(a)).
J EIGENVALUES 242

(b) Each C (k,m) and each A(k) is in Hessenberg form.

(c) Let m ∈ {1, . . . , n − 1}. Then all columns with indices m + 1 through n of the
matrices C (k,m) and A(k,m+1) are identical (in particular, (J.16c) holds true).

Proof. Step 1: Assuming A(k) to be in Hessenberg form, we show that each C (k,m) ,
m ∈ {1, . . . , n − 1}, is in Hessenberg form as well, (c) holds, and A(k+1) is in Hessenberg
form: Let m ∈ {1, . . . , n − 1}. According to the definition in Lem. J.9,

A(k,m+1) = G∗km · · · G∗k1 A(k) .

Thus, according to (J.16a),

C (k,m) = G∗km B (k,m) = G∗km G∗k,m−1 · · · G∗k1 A(k) Gk1 · · · Gk,m−1


= A(k,m+1) Gk1 · · · Gk,m−1 . (J.17)

If a matrix Z ∈ M(n, K) is in Hessenberg form with zl+1,l = zl+2,l+1 = 0, then right


multiplication by Gkl only affects columns l, l + 1 of Z and rows 1, . . . , l + 1 of Z. As
(k,m+1) (k,m+1)
A(k,m+1) is in Hessenberg form with a21 = · · · = am+1,m = 0 by Lem. J.9, (J.17)
implies C (k,m) is in Hessenberg form with columns m, . . . , n of C (k,m) and A(k,m+1) being
identical. As A(k+1) = C (k,n−1) Gk,n−1 , and Gk,n−1 only affects columns n − 1 and n of
C (k,n−1) , A(k+1) must also be in Hessenberg form, concluding Step 1.
Step 2: Assuming A(k) to be in Hessenberg form, we show (7.37) to hold for A(k) and
A(k+1) : According to (J.16a),

A(k+1) = G∗k,n−1 · · · G∗k1 A(k) Gk1 · · · Gk,n−1 = Rk Qk , (J.18a)

where

Rk = G∗k,n−1 · · · G∗k1 A(k) , Qk := Gk1 · · · Gk,n−1 , A(k) = Qk Rk . (J.18b)

As Rk is upper triangular by Lem. J.9 and Qk is unitary (as it is the product of unitary
matrices), we have proved that (7.37) holds and, thereby, completed Step 2.
Finally, since A(1) is in Hessenberg form by hypothesis, using Steps 1 and 2 in an
induction on k concludes the proof of the theorem. 

Remark J.12. In the situation of Alg. J.10 it takes O(n2 ) arithmetic operations to
transform A(k) into A(k+1) (see the proof of Th. J.6(c) for a list of admissible arithmetic
operations): One has to obtain the n − 1 Givens rotations Gk1 , . . . , Gk,n−1 using (J.14),
and one also needs the adjoint n−1 Givens rotations. As the number of nontrivial entries
in the Gkm is always 4 and, thus, independent of n, the total of arithmetic operations
K MINIMIZATION 243

to obtain the Gkm is NG (n) = O(n). Since each C (k,m) is in Hessenberg form, right
multiplication with Gkm only affects at most an (m + 2) × 2 submatrix, needing at most
C(m + 2) arithmetic operations, C ∈ N. The result differs from a Hessenberg matrix at
most in one element, namely the element with index (m + 2, m), i.e. left multiplication
of the result with G∗k,m+1 only affects at most an 2 × (n − m) submatrix, needing at most
C(n − m) arithmetic operations (this is also consistent with the very first step, where
one computes G∗k1 A(k) with A(k) in Hessenberg form). Thus, the total is at most

NG (n) + (n − 1)C(n − m + 1 + m + 2) = NG (n) + (n − 1)C(n + 3),

which is O(n2 ) for n → ∞.

K Minimization

K.1 Golden Section Search


The goal of this section is to provide a derivative-free algorithm for one-dimensional
minimization problems

min f (t), f : I −→ R, I ⊆ R, (K.1)

that is worst-case optimal in the same way the bisection method is worst-case optimal for
root-finding problems. Let us emphasize that one-dimensional minimization problems
also often occur as subproblems, so-called line searches in multi-dimensional minimiza-
tion: One might use a derivative-based method to determine a descent direction u at
some point x for some objective function J : X −→ R, that means some u ∈ X such
that the auxiliary objective function

f : [0, a[−→ R, f (t) := J(x + tu) (K.2)

is known to be decreasing in some neighborhood of 0. The line search then aims to


minimize J on the line segment {x+tu : t ∈ [0, a[}, i.e. to perform the (one-dimensional)
minimization of f .
Recall the bisection algorithm for finding a zero of a one-dimensional continuous function
f : [a, b] −→ R, defined on some interval [a, b] ⊆ R, a < b, with f (a) < 0 < f (b): In each
step, one has an interval [an , bn ], an < bn , with f (an ) < 0 < f (bn ), and one evaluates
f at the midpoint cn := (an + bn )/2, choosing [an , cn ] as the next interval if f (cn ) > 0
and [cn , bn ] if f (cn ) < 0. If f is continuous, then the boundary points of the intervals
converge to a zero of f . Choosing the midpoint is worst-case optimal in the sense that,
without further knowledge on f , any other choice for cn might force one to choose the
K MINIMIZATION 244

larger of the two subintervals for the next step, which would then have length bigger
than (bn − an )/2.
As we will see below, golden section search is the analogue of the bisection algorithm
for minimization problems, where the situation is slightly more subtle. First, we have
to address the problem of bracketing a minimum: For the bisection method, by the
intermediate value theorem, a zero of the continuous function f is bracketed if f (a) <
f (b) (or f (a) > f (b)). However, to bracket the min of a continuous function f , we need
three points a < b < c, where

f (b) = min{f (a), f (b), f (c)}. (K.3)

Definition K.1. Let I ⊆ R be an interval, f : I −→ R. Then (a, b, c) ∈ I 3 is called a


minimization triplet for f if, and only if, a < b < c and (K.3) holds.

Clearly, if f is continuous, then (K.3) implies f to have (at least one) global min in ]a, b[.
Thus, the problem of bracketing a minimum is to determine a minimization triplet for
f , where one has to be aware that the problem has no solution if f is strictly increasing
or strictly decreasing. Still, we can formulate the following algorithm:
Algorithm K.2 (Bracketing a Minimum). Let f : I −→ R be defined on a nontrivial
interval I ⊆ R, m := inf I, M := sup I. We define a finite sequence (a1 , a2 , . . . , aN ),
N ≥ 3, of distinct points in I via the following recursion: For the first two points choose
a < b in the interior of I. Set
a1 := a, a2 := b, σ := 1 if f (b) ≤ f (a),
(K.4)
a1 := b, a2 := a, σ := −1 if f (b) > f (a).

If the algorithm has not been stopped in the pevious step (i.e. if 2 ≤ n < N ), then set
tn := an + 2σ|an − an−1 | and


 tn if tn ∈ I,



 > M , M ∈ I,
M if tn
an+1 := (an + M )/2 if tn > M, M ∈/ I, (K.5)



 m if tn < m, m ∈ I,


(m + a )/2 if tn < m, m ∈
/ I.
n

Stop the iteration (and set N := n + 1) if at least one of the following conditions
is satisfied: (i) f (an+1 ) ≥ f (an ), (ii) an+1 ∈ {m, M }, (iii) an+1 ∈
/ [amin , amax ] (where
amin , amax ∈ I are some given bounds), (iv) n + 1 = nmax (where nmax ∈ N is some given
bound).
K MINIMIZATION 245

Remark K.3. (a) Since (K.4) guarantees f (a1 ) ≥ f (a2 ), condition (i) plus an induc-
tion shows
f (a1 ) ≥ f (a2 ) > · · · > f (aN −1 ).

(b) If Alg. K.2 terminates due to condition (i), then (a) implies (aN −2 , aN −1 , aN ) (for
σ = 1) or (aN , aN −1 , aN −2 ) (for σ = −1) to be a minimization triplet for f .

(c) If Alg. K.2 terminates due to conditions (ii) – (iv) (and (i) is not satisfied), then the
method has failed to find a minimization triplet for f . One then has three options:
(a) abort the search (after all, a minimization triplet for f might not exist and aN
is a good candidate for a local min), (b) search on the other side of a1 , (c) continue
the original search with adjusted bounds (if possible).

Given a minimization triplet (a, b, c) for some function f , we now want to evaluate f at
x ∈]a, b[ or at x ∈]b, c[ to obain a new minimization triplet (a′ , b′ , c′ ), where c′ −a′ < c−a.
Moreover, the strategy for choosing x should be worst-case optimal.

Lemma K.4. Let I ⊆ R be an interval, f : I −→ R. If (a, b, c) ∈ I 3 is a minimization


triplet for f , then so are the following triples:

(a, x, b) for a < x < b and f (b) ≥ f (x), (K.6a)


(x, b, c) for a < x < b and f (b) ≤ f (x), (K.6b)
(a, b, x) for b < x < c and f (b) ≤ f (x), (K.6c)
(b, x, c) for b < x < c and f (b) ≥ f (x). (K.6d)

Proof. All four cases are immediate. 

While Lem. K.4 shows the forming of new, smaller triplets to be easy, the question
remains of how to choose x. Given a minimization triplet (a, b, c), the location of b is a
fraction w of the way between a and c, i.e.
b−a c−b
w := , 1−w = . (K.7)
c−a c−a
We will now assume w ≤ 21 – if w > 12 , then one can make the analogous (symmetric)
argument, starting from c rather than from a (we will come back to this point below).
The new point x will be an additional fraction z toward c,
x−b
z := . (K.8)
c−a
K MINIMIZATION 246

If one chooses the new minimization triplet according to (K.6), then its length l is


 w (c − a) for a < x < b and f (b) ≥ f (x),

(1 − (w + z)) (c − a) for a < x < b and f (b) ≤ f (x),
l= (K.9)

 (w + z) (c − a) for b < x < c and f (b) ≤ f (x),


(1 − w) (c − a) for b < x < c and f (b) ≥ f (x).

For the moment, consider the case b < x < c. As we assume no prior knowledge about
f , we have no control over which of the two choices we will have to make to obtain the
new minimization triplet, i.e., if possible, we should make the last two factors in front
of c − a in (K.9) equal, leading to w + z = 1 − w or

z = 1 − 2w. (K.10)

If we were to choose x between a and b, then it could occur that l = (1−(w +z)) > 1−w
(note z < 0 in this case), that means this choice is not worst-case optimal. Together
with the symmetric argument for w > 12 , this means we always need to put x in the
larger of the two subintervals. While (K.10) provides a formula for determining z in
terms of w, we still have the freedom to choose w. If there is an optimal value for w,
then this value should remain constant throughout the algorithm – it should be the
same for the old triplet and for the new triplet. This leads to

x − b (K.8) z(c − a) (K.7) z


w= = = , (K.11)
c−b c−b 1−w
which, combined with (K.10), yields

3± 5
w2 − 3w + 1 = 0 ⇔ w= . (K.12)
2
Fortunately, precisely one of the values of w in (K.12) satisfies 0 < w < 12 , namely
√ √
3− 5 5−1
w= ≈ 0.38, 1−w = ≈ 0.62. (K.13)
2 2

Then 1−ww
= 1+2 5 is the value of the golden ratio or golden section, which gives the
following Alg. K.5 its name. Together with the symmetric argument for w > 12 , we can√
summarize our findings by stating that the new point x should be a fraction of w = 3−2 5
into the larger of the two subintervals (where the fraction is measured from the central
point of the triplet):
K MINIMIZATION 247

Algorithm K.5 (Golden Section Search). Let I ⊆ R be a nontrivial interval, f : I −→


R. Assume that (a, b, c) ∈ I 3 is a minimization triplet for f . We define the following
recursive algorithm for the computation of a sequence of minimization triplets for f :
Set (a1 , b1 , c1 ) := (a, b, c). Given n ∈ N and a minimization triplet (an , bn , cn ), set
( √
bn + w(cn − bn ) if cn − bn ≥ bn − an , 3− 5
xn := w := , (K.14)
bn − w(bn − an ) if cn − bn < bn − an , 2

and choose (an+1 , bn+1 , cn+1 ) according to (K.6) (in practise, one will stop the iteration
if cn − an < ǫ, where ǫ > 0 is some given bound).

Lemma K.6. Consider Alg. K.5.

(a) The value of f (bn ) decreases monotonically, i.e. f (bn+1 ) ≤ f (bn ) for each n ∈ N.

(b) With the possible exception of one step, one always has

1 α 3− 5
cn+1 − an+1 ≤ q (cn − an ), where q := + ≈ 0.69, α := . (K.15)
2 2 2

(c) One has \


x := lim an = lim bn = lim cn , {x} = [an , cn ]. (K.16)
n→∞ n→∞ n→∞
n∈N

Proof. (a) clearly holds, since each new minimization triplet is chosen according to
(K.6).
(b): As, according to (K.14), xn is always chosen in the larger subinterval, it suffices to
consider the case w := cbnn −a
−an
n
≤ 12 . If w = α, then (K.9) and (K.10) guarantee

cn+1 − an+1 = (w + z)(cn − an ) = (1 − w)(cn − an ) < q(cn − an ).

Moreover, due to (K.11), if w = α (or w = 1 − α) has occurred once, then w = α or


w = 1 − α will hold for all following steps. Let w 6= α, w ≤ 21 . If (an+1 , bn+1 , cn+1 ) =
(bn , xn , cn ), then (K.15) might not hold, but w = α holds in the next step, i.e. (K.15) does
hold for all following steps. Thus, it only remains to consider the case (an+1 , bn+1 , cn+1 )
= (an , bn , xn ). In this case,

cn+1 − an+1 x n − an b n − an x n − b n xn − bn c n − bn
= = + =w+
c n − an c n − an c n − an c n − an c n − b n c n − an
1
= w + α(1 − w) = α + (1 − α)w ≤ α + (1 − α) = q,
2
K MINIMIZATION 248

establishing the case.


(c): Clearly, [an+1 , cn+1 ] ⊆ [an , cn ] for each n ∈ N. Applying an induction to (K.15)
yields
∀ cn+1 − an+1 ≤ q n−1 (c1 − a1 ),
n∈N

where the exponent on the right-hand side is n − 1 instead of n due to the possible
exception in one step. Since q < 1, this means limn→∞ (cn − an ) = 0, proving (K.16). 
Definition K.7. Let I ⊆ R be an interval, f : I −→ R. We call f locally monotone
if, and only if, for each x ∈ I, there exists ǫ > 0 such that f is monotone (increasing or
decreasing) on I∩]x − ǫ, x] and monotone on I ∩ [x, x + ǫ[.
Example K.8. The function
(
x sin x1 for x 6= 0,
f : R −→ R, f (x) :=
0 for x = 0,

is an example of a continuous function that is not locally monotone (clearly, f is not


monotone in any interval ] − ǫ, 0] or [0, ǫ[, ǫ > 0).
Theorem K.9. Let I ⊆ R be a nontrivial interval and assume f : I −→ R to be
continuous and locally monotone in the sense of Def. K.7. Consider Alg. K.5 and let
x := limn→∞ bn (cf. Lem. K.6(c)). Then f has a local min at x.

Proof. The continuity of f implies f (x) = limn→∞ f (bn ). Then Lem. K.6(a) implies

f (x) = inf{f (bn ) : n ∈ N}. (K.17)

If f does not have a local min at x, then


 
∀ ∃ yn ∈]an , cn [ and f (yn ) < f (x) . (K.18)
n∈N yn ∈I

If an < yn < x, then f (an ) > f (yn ) < f (x), showing f is not monotone in [an , x].
Likewise, if x < yn < cn , then f (x) > f (yn ) < f (cn ), showing f is not monotone
in [x, cn ]. Thus, there is no ǫ > 0 such that f is monotone on both I∩]x − ǫ, x] and
I ∩ [x, x + ǫ[, in contradiction to the assumption that f be locally monotone. 

K.2 Nelder-Mead Method


The Nelder-Mead method of [NM65] is a method for minimization in Rn , i.e. for an
objective function f : Rn −→ R. It is also known as the downhill simplex method or
K MINIMIZATION 249

the amoeba method. The method has nothing to do with the simplex method of linear
programming (execpt that both methods make use of the geometrical structure called
simplex).
The Nelder-Mead method is a derivative-free method that is simple to implement. It
is an example of a method that seems to perform quite well in practise, even though
there are known examples, where it will fail, and, in general, the related convergence
theory available in the literature remains quite limited. Here, we will merely present
the algorithm together with its heuristics, and provide some relevant references to the
literature.
Recall that a (nondegenerate) simplex in Rn is the convex hull of n+1 points x1 , . . . , xn+1
∈ Rn such that the points x2 − x1 , . . . , xn+1 − x1 are linearly independent. If ∆ =
conv{x1 , . . . , xn+1 }, then x1 , . . . , xn+1 are called the vertices of ∆.
The idea of the Nelder-Mead method is to construct a sequence (∆k )k∈N ,

∆k = conv{xk1 , . . . , xkn+1 } ⊆ Rn , (K.19)

of simplices such that the size of the ∆k converges to 0 and the value of f at the vertices
of the ∆k decreases, at least on average. To this end, in each step, one finds the vertex
v, where f is largest and either replaces v by a new vertex (by moving it in the direction
of the centroid of the remaining vertices, see below) or one shrinks the simplex.

Algorithm K.10 (Nelder-Mead). Given f : Rn −→ R, n ∈ N, we provide a recursion


to obtain a sequence of simplices (∆k )k∈N as in (K.19):
Initialization: Choose x11 ∈ Rn \ {e1 , . . . , en } (the initial guess for the desired min of f ),
and set
∀ x1i+1 := x11 + δi ei , (K.20)
i∈{1,...,n}

where the ei denote the standard unit vectors, and the δi > 0 are initialization param-
eters that one should try to adapt to the (conjectured) characteristic length scale (or
scales) of the considered problem.
Iteration Step: Given ∆k , k ∈ N, sort the vertices such that (K.19) holds with

f (xk1 ) ≤ f (xk2 ) · · · ≤ f (xkn+1 ). (K.21)

According to the following rules, we will either replace (only) xkn+1 or shrink the simplex.
In preparation, we compute the centroid of the first n vertices, i.e.
n
1 X k
x := x , (K.22a)
n i=1 i
K MINIMIZATION 250

and the search direction u for the new vertex,

u := x − xkn+1 . (K.22b)

One will now apply precisely one of the following rules (1) – (4) in decreasing priority
(i.e. rule (j) can only be used if none of the conditions for rules (1) – (j − 1) applied):

(1) Reflection: Compute the reflected point

xr := x + µr u = xkn+1 + (µr + 1) u (K.23a)

and the corresponding value


fr := f (xr ), (K.23b)
where µr > 0 is the reflection parameter (µr = 1 being a common setting). If

f (xk1 ) ≤ fr < f (xkn ), (K.23c)

then replace xkn+1 with xr to obtain the new simplex ∆k+1 . Heuristics: As f is largest
at xkn+1 , one reflects the point through the hyperplane spanned by the remaining
vertices hoping to find smaller values on the other side. If one, indeed, finds a
sufficiently smaller value (better than the second worst), then one takes the new
point, except if the new point has a smaller value than all other vertices, in which
case, one tries to go even further in that direction by applying rule (2) below.

(2) Expansion: If
f (xr ) < f (xk1 ) (K.24a)
(i.e. (K.23c) does not hold and xr is better than all xki ), then compute the expanded
point
xe := x + µe u = xkn+1 + (µe + 1) u, (K.24b)
where µe > µr is the expansion parameter (µe = 2 being a common setting). If

f (xe ) < fr , (K.24c)

then replace xkn+1 with xe to obtain the new simplex ∆k+1 . If (K.24c) does not hold,
then, as in (1), replace xkn+1 with xr to obtain ∆k+1 . If neither (K.23c) nor (K.24a)
holds, then continue with (3) below. Heuristics: If xr has the smallest value so far,
then one hopes to find even smaller values by going further in the same direction.

(3) Contraction: If neither (K.23c) nor (K.24a) holds, then

fr ≥ f (xkn ). (K.25a)
K MINIMIZATION 251

In this cases, compute the contracted point

xc := x + µc u = xkn+1 + (µc + 1) u, (K.25b)

where −1 < µc < 0 is the contraction parameter (µc = − 21 being a common setting).
If
f (xc ) < f (xkn+1 ), (K.25c)
then replace xkn+1 with xc to obtain the new simplex ∆k+1 . Heuristics: If (K.25a)
holds, then xr would still be the worst vertex of the new simplex. If the new vertex
is to be still the worst, then it should at least be such that it reduces the volume of
the simplex. Thus, the contracted point is chosen between xkn+1 and x.

(4) Reduction: If (K.25a) holds and (K.25c) does not hold, then we have failed to find a
better vertex in the direction u. We then shrink the simplex around the best point
xk1 by setting xk+1
1 := xk1 and

∀ xk+1
i := xk1 + µs (xki − xk1 ), (K.26)
i∈{2,...,n+1}

where 0 < µs < 1 is the shrink parameter (µs = 21 being a common setting).
Heuristics: If f increases by moving from the worst point in the direction of the
better points, then that indicates the simplex being too large, possibly containing
several local mins. Thus, we shrink the simplex around the best point, hoping to
thereby improve the situation.

Remark K.11. Consider Alg. K.10.

(a) Clearly, all simplices ∆k , k ∈ N, are nondegenerate: For ∆1 this is guaranteed by


choosing x11 ∈ / {e1 , . . . , en }; for k > 1, this is guaranteed since (1) – (3) choose
the new vertex not in conv{xk1 , . . . , xkn } (note {xkn+1 + µ(x − xkn+1 ) : µ ∈ R} ∩
conv{xk1 , . . . , xkn } = {x}), and xk2 − xk1 , . . . , xkn+1 − xk1 are linearly independent in
(4).

(b) (1) – (3) are designed such that the average value of f at the vertices decreases:
n+1 n+1
1 X 1 X
f (xk+1
i ) < f (xki ). (K.27)
n + 1 i=1 n + 1 i=1

However, if (4) occurs, then (K.27) can fail to hold.

(c) As mentioned above, even though Alg. K.10 seems to perform well in practise, one
can construct examples, where it will converge to points that are not local minima
REFERENCES 252

(see, e.g., [Kel99, Sec. 8.1.3]). Some positive results regarding the convergence of
Alg. K.10 can be found in [Kel99, Sec. 8.1.2] and [LRWW98]. In general, multidi-
mensional minimization is not an easy task, and it is not uncommon for algorithms
to converge to wrong points. It is therefore recommended in [PTVF07, pp. 503-504]
to restart the algorithm at a point that it claimed to be a minimum (also see [Kel99,
Sec. 8.1.4] regarding restarts of Alg. K.10).

References
[Alt06] Hans Wilhelm Alt. Lineare Funktionalanalysis, 5th ed. Springer-Verlag,
Berlin, 2006 (German).

[Blu68] L.I. Bluestein. A linear filtering approach to the computation of the dis-
crete Fourier transform. Northeast Electronics Research and Engineering
Meeting Record 10 (1968), 218–219.

[Bos13] Siegfried Bosch. Algebra, 8th ed. Springer-Verlag, Berlin, 2013 (Ger-
man).

[Bra77] Helmut Brass. Quadraturverfahren. Studia Mathematica, Vol. 3, Van-


denhoeck & Ruprecht, Göttingen, Germany, 1977, in German.

[Cia89] Philippe G. Ciarlet. Introduction to numerical linear algebra and opti-


mization. Cambridge Texts in Applied Mathematics, Cambridge University
Press, Cambridge, UK, 1989.

[Deu11] Peter Deuflhard. Newton Methods for Nonlinear Problems. Springer


Series in Computational Mathematics, Vol. 35, Springer, Berlin, Germany,
2011.

[DH08] Peter Deuflhard and Andreas Hohmann. Numerische Mathematik


1, 4th ed. Walter de Gruyter, Berlin, Germany, 2008 (German).

[FH07] Roland W. Freund and Ronald H.W. Hoppe. Stoer/Bulirsch: Nu-


merische Mathematik 1, 10th ed. Springer, Berlin, Germany, 2007 (Ger-
man).

[GK99] Carl Geiger and Christian Kanzow. Numerische Verfahren zur


Lösung unrestringierter Optimierungsaufgaben. Springer-Verlag, Berlin,
Germany, 1999 (German).
REFERENCES 253

[HB09] Martin Hanke-Bourgeois. Grundlagen der Numerischen Mathematik


und des Wissenschaftlichen Rechnens, 3rd ed. Vieweg+Teubner, Wies-
baden, Germany, 2009 (German).

[Heu08] Harro Heuser. Lehrbuch der Analysis Teil 2, 14th ed. Vieweg+Teubner,
Wiesbaden, Germany, 2008 (German).

[HH92] G. Hämmerlin and K.-H. Hoffmann. Numerische Mathematik, 3rd ed.


Grundwissen Mathematik, Vol. 7, Springer-Verlag, Berlin, 1992 (German).

[Kel99] C.T. Kelley. Iterative Methods for Optimization. SIAM, Philadelphia,


USA, 1999.

[LRWW98] J.C. Lagarias, J.A. Reeds, M.H. Wright, and P.E. Wright. Con-
vergence Properties of the Nelder-Mead Simplex Method in Low Dimensions.
SIAM J. Optim. 9 (1998), No. 1, 112–147.

[NM65] J.A. Nelder and R. Mead. A simplex method for function minimization.
Comput. J. 7 (1965), 308–313.

[OR70] J.M. Ortega and W.C. Rheinboldt. Iterative Solution of Nonlinear


Equations in Several Variables. Computer Science and Applied Mathemat-
ics, Academic Press, New York, 1970.

[Phi09] P. Philip. Optimal Control of Partial Differential Equations. Lecture


Notes, Ludwig-Maximilians-Universität, Germany, 2009, available in PDF
format at http://www.math.lmu.de/~philip/publications/lectureNot
es/philipPeter_OptimalControlOfPDE.pdf.

[Phi16a] P. Philip. Analysis I: Calculus of One Real Variable. Lecture Notes, LMU
Munich, 2015/2016, AMS Open Math Notes Ref. # OMN:202109.111306,
available in PDF format at
https://www.ams.org/open-math-notes/omn-view-listing?listingId=111306.

[Phi16b] P. Philip. Analysis II: Topology and Differential Calculus of Several Vari-
ables. Lecture Notes, LMU Munich, 2016, AMS Open Math Notes Ref. #
OMN:202109.111307, available in PDF format at
https://www.ams.org/open-math-notes/omn-view-listing?listingId=111307.

[Phi16c] P. Philip. Ordinary Differential Equations. Lecture Notes, Ludwig-


Maximilians-Universität, Germany, 2016, available in PDF format at
http://www.math.lmu.de/~philip/publications/lectureNotes/philipPeter_ODE.pdf.
REFERENCES 254

[Phi17] P. Philip. Analysis III: Measure and Integration Theory of Several Vari-
ables. Lecture Notes, LMU Munich, 2016/2017, AMS Open Math Notes
Ref. # OMN:202109.111308, available in PDF format at
https://www.ams.org/open-math-notes/omn-view-listing?listingId=111308.

[Phi19a] P. Philip. Linear Algebra I. Lecture Notes, Ludwig-Maximilians-Universi-


tät, Germany, 2018/2019, available in PDF format at
http://www.math.lmu.de/~philip/publications/lectureNotes/philipPeter_LinearAlgebra1.pdf.

[Phi19b] P. Philip. Linear Algebra II. Lecture Notes, Ludwig-Maximilians-Univer-


sität, Germany, 2019, available in PDF format at
http://www.math.lmu.de/~philip/publications/lectureNotes/philipPeter_LinearAlgebra2.pdf.

[Pla10] Robert Plato. Numerische Mathematik kompakt, 4th ed. Vieweg Verlag,
Wiesbaden, Germany, 2010 (German).

[PTVF07] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flan-
nery. Numerical Recipes. The Art of Scientific Computing, 3rd ed. Cam-
bridge University Press, New York, USA, 2007.

[QSS07] Alfio Quarteroni, Riccardo Sacco, and Fausto Saleri. Numerical


Mathematics, 2nd ed. Texts in Applied Mathematics, Vol. 37, Springer-
Verlag, Berlin, 2007.

[RSR69] L.R. Rabiner, R.W. Schafer, and C.M. Rader. The chirp z-transform
algorithm and its application. Bell Syst. Tech. J. 48 (1969), 1249–1292.

[Rud87] W. Rudin. Real and Complex Analysis, 3rd ed. McGraw-Hill Book Com-
pany, New York, 1987.

[Sco11] L. Ridgway Scott. Numerical Analysis. Princeton University Press,


Princeton, USA, 2011.

[SK11] H.R. Schwarz and N. Köckler. Numerische Mathematik, 8th ed.


Vieweg+Teubner Verlag, Wiesbaden, 2011 (German).

[Str08] Gernot Stroth. Lineare Algebra, 2nd ed. Berliner Studienreihe zur
Mathematik, Vol. 7, Heldermann Verlag, Lemgo, Germany, 2008 (German).

[SV01] J. Saranen and G. Vainikko. Periodic Integral and Pseudodifferential


Equations with Numerical Approximation. Springer, Berlin, 2001.

[Wer11] D. Werner. Funktionalanalysis, 7th ed. Springer-Verlag, Berlin, 2011


(German).
REFERENCES 255

[Yos80] K. Yosida. Functional Analysis, 6th ed. Grundlehren der mathematischen


Wissenschaften, Vol. 123, Springer-Verlag, Berlin, 1980.

You might also like