Professional Documents
Culture Documents
Numerical Mathematics I (Lecture Notes) - Peter Philip
Numerical Mathematics I (Lecture Notes) - Peter Philip
Peter Philip∗
Lecture Notes
Originally Created for the Class of Winter Semester 2008/2009 at LMU Munich,
Revised and Extended for Several Subsequent Classes†
Contents
1 Introduction and Motivation 5
3 Interpolation 36
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Polynomials and Polynomial Functions . . . . . . . . . . . . . . . 37
3.2.2 Existence and Uniqueness . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3 Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 42
∗
E-Mail: philip@math.lmu.de
†
Resources used in the preparation of this text include [DH08, GK99, HB09, Pla10].
1
CONTENTS 2
4 Numerical Integration 78
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Quadrature Rules Based on Interpolating Polynomials . . . . . . . . . . . 82
4.3 Newton-Cotes Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.1 Definition, Weights, Degree of Accuracy . . . . . . . . . . . . . . 84
4.3.2 Rectangle Rules (n = 0) . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.3 Trapezoidal Rule (n = 1) . . . . . . . . . . . . . . . . . . . . . . . 89
4.3.4 Simpson’s Rule (n = 2) . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.5 Higher Order Newton-Cotes Formulas . . . . . . . . . . . . . . . . 91
4.4 Convergence of Quadrature Rules . . . . . . . . . . . . . . . . . . . . . . 92
4.5 Composite Newton-Cotes Quadrature Rules . . . . . . . . . . . . . . . . 94
4.5.1 Introduction, Convergence . . . . . . . . . . . . . . . . . . . . . . 95
4.5.2 Composite Rectangle Rules (n = 0) . . . . . . . . . . . . . . . . . 97
4.5.3 Composite Trapezoidal Rules (n = 1) . . . . . . . . . . . . . . . . 98
4.5.4 Composite Simpson’s Rules (n = 2) . . . . . . . . . . . . . . . . . 99
4.6 Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.6.2 Orthogonal Polynomials . . . . . . . . . . . . . . . . . . . . . . . 100
4.6.3 Gaussian Quadrature Rules . . . . . . . . . . . . . . . . . . . . . 103
7 Eigenvalues 139
7.1 Introductory Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2 Estimates, Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.3 Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.4 QR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8 Minimization 158
8.1 Motivation, Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.2 Local Versus Global and the Role of Convexity . . . . . . . . . . . . . . . 159
8.3 Gradient-Based Descent Methods . . . . . . . . . . . . . . . . . . . . . . 165
8.4 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . 174
C Integration 194
C.1 Km -Valued Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
C.2 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
J Eigenvalues 233
J.1 Continuous Dependence on Matrix Coefficients . . . . . . . . . . . . . . . 233
J.2 Transformation to Hessenberg Form . . . . . . . . . . . . . . . . . . . . . 235
J.3 QR Method for Matrices in Hessenberg Form . . . . . . . . . . . . . . . . 239
K Minimization 243
K.1 Golden Section Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
K.2 Nelder-Mead Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
1 INTRODUCTION AND MOTIVATION 5
Remark 1.2. Even though we here require an algorithm to terminate after a finite
number of steps, in the literature, one sometimes omits this part from the definition.
The question if a given method can be guaranteed to terminate after a finite number of
steps is often tremendously difficult (sometimes even impossible) to answer.
Example 1.3. Let a, a0 ∈ R+ and consider the sequence (xn )n∈N0 defined recursively
1 INTRODUCTION AND MOTIVATION 6
by
1 a
x0 := a0 , xn+1 := xn + for each n ∈ N0 . (1.1)
2 xn
It is an exercise to show that, for each√a, a0 ∈ R+ , this sequence is well-defined (i.e. xn >
0 for each n ∈ N0 ) and converges to a (this is Newton’s method for the computation
of the zero of the function f : R+ −→ R, f (x) := x2 − a, cf. Ex. 6.4(b)). The xn can be
computed using the following finite sequence of instructions:
1 : x = a0 % store the number a0 in the variable x
2 : x = (x + a/x)/2 % compute (x + a/x)/2 and replace the
(1.2)
% contents of the variable x with the computed value
3 : goto 2 % continue with instruction 2
√
Even though the contents of the variable x will converge to a, (1.2) does not constitute
an algorithm in the sense of Def. 1.1 since it does not terminate. To guarantee termi-
nation and to make the method into an algorithm, one might introduce the following
modification:
1 : ǫ = 10−10 ∗ a % store the number 10−10 a in the variable ǫ
2 : x = a0 % store the number a0 in the variable x
3: y=x % copy the contents of the variable x to
% the variable y to save the value for later use
4 : x = (x + a/x)/2 % compute (x + a/x)/2 and replace the (1.3)
% contents of the variable x with the computed value
5 : if |x − y| > ǫ
then goto 3 % if |x − y| > ǫ, then continue with instruction 3
else quit % if |x − y| ≤ ǫ, then terminate the method
Now the convergence of the sequence guarantees that the method terminates within
finitely many steps.
—
Another problem with regard to algorithms, that we already touched on in Ex. 1.3, is
the implicit requirement of Def. 1.1 for an algorithm to be well-defined. That means, for
every initial condition, given a number n ∈ N, the method has either terminated after
m ≤ n steps, or it provides a (feasible!) instruction to carry out step number n + 1.
Such methods are called complete. Methods that can run into situations, where they
have not reached their intended termination point, but can not carry out any further
instruction, are called incomplete. Algorithms must be complete! We illustrate the issue
in the next example:
1 INTRODUCTION AND MOTIVATION 7
Example 1.4. Let a ∈ R \ {2} and N ∈ N. Define the following finite sequence of
instructions:
1 : n = 1; x = a
2 : x = 1/(2 − x); n = n + 1
3 : if n ≤ N (1.4)
then goto 2
else quit
Consider what occurs for N = 10 and a = 45 . The successive values contained in
the variable x are 54 , 34 , 23 , 2. At this stage n = 4 ≤ N , i.e. instruction 3 tells the
method to continue with instruction 2. However, the denominator has become 0, and
the instruction has become meaningless. The following modification makes the method
complete and, thereby, an algorithm:
1 : n = 1; x = a
2 : if x 6= 2
then x = 1/(2 − x); n = n + 1
else x = −5; n=n+1 (1.5)
3 : if n ≤ N
then goto 2
else quit
We can only expect to find stable algorithms if the underlying problem is sufficiently
benign. This leads to the following definition:
Definition 1.5. We say that a mathematical problem is well-posed if, and only if, its
solutions enjoy the three benign properties of existence, uniqueness, and continuity with
respect to the input data. More precisely, given admissible input data, the problem
must have a unique solution (output), thereby providing a map between the set of
admissible input data and (a superset of) the set of possible solutions. This map must
be continuous with respect to suitable norms or metrics on the respective sets (small
changes of the input data must only cause small changes in the solution). A problem
which is not well-posed is called ill-posed.
—
We can thus add to the important tasks of Numerical Mathematics mentioned earlier
the additional important tasks of investigating a problem’s well-posedness. Then, once
well-posedness is established, the task is to provide a stable algorithm for its solution.
1 INTRODUCTION AND MOTIVATION 8
Example 1.6. (a) The problem “find a minimum of a given polynomial function p :
R −→ R” is inherently ill-posed: Depending on p, the problem has no solution
(e.g. p(x) = x), a unique solution (e.g. p(x) = x2 ), finitely many solutions (e.g.
p(x) = x2 (x − 1)2 (x + 2)2 ) or infinitely many solutions (e.g. p(x) = 1)).
(b) Frequently, one can transform an ill-posed problem into a well-posed problem, by
choosing an appropriate setting: Consider the problem “find a zero of f (x) =
ax2 + c”. If, for example, one admits a, c ∈ R and looks for solutions in R, then the
problem is ill-posed as one has no solutions for ac > 0 and no solutions for a = 0,
c 6= 0. Even for ac < 0, the problem is not well-posed as the solution is not always
unique. However, in this case, one can make the problem well-posed by considering
solutions in R2 . The correspondence between the input and the solution (sometimes
referred to as the solution operator) is then given by the continuous map
r r
c c
S : (a, c) ∈ R2 : ac < 0 −→ R2 , S(a, c) := − ,− − . (1.6a)
a a
The problem is also well-posed when just requiring a 6= 0, but admitting complex
solutions. The continuous solution operator is then given by
S : (a, c) ∈ R2 : a 6= 0 −→ C2 ,
q q
|c| |c|
, − |a| for ac ≤ 0, (1.6b)
|a|
S(a, c) := q |c| q
i , −i |a||c|
for ac > 0.
|a|
(c) The problem “determine if x ∈ R is positive” might seem simple at first glance,
however it is ill-posed, as it is equivalent to computing the values of the function
(
1 for x > 0,
S : R −→ {0, 1}, S(x) := (1.7)
0 for x ≤ 0,
which is discontinuous at 0.
—
As stated before, the analysis and control of errors is of central interest. Errors occur
due to several causes:
(1) Modeling Errors: A mathematical model can only approximate the physical situ-
ation in the best of cases. Often models have to be further simplified in order to
compute solutions and to make them accessible to mathematical analysis.
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 9
(2) Data Errors: Typically, there are errors in the input data. Input data often result
from measurements of physical experiments or from calculations that are potentially
subject to every type of error in the present list.
One should always be aware that errors of the types just listed will or can be present.
However, in the context of Numerical Mathematics, one focuses mostly on the following
error types:
(4) Truncation Errors: Such errors occur when replacing an infinite process (e.g. an
infinite series) by a finite process (e.g. a finite summation).
(5) Round-Off Errors: Errors occurring when discarding digits needed for the exact
representation of a (e.g. real or rational) number.
In an increasing manner, the functioning of our society relies on the use of numerical
algorithms. In consequence, avoiding and controlling numerical errors is vital. Several
examples of major disasters caused by numerical errors can be found on the following
web page of D.N. Arnold at the University of Minnesota:
http://www.ima.umn.edu/~arnold/disasters/
For a much more comprehensive list of numerical and related errors that had significant
consequences, see the web page of T. Huckle at TU Munich:
http://www5.in.tum.de/~huckle/bugse.html
Definition 2.2. Let (X, d) be a metric space and let (Y, k · k), (Z, k · k) be normed
vector spaces over K. Let D ⊆ X and consider functions f : D −→ Y , g : D −→ Z,
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 10
(a) f is called of order big O of g or just big O of g (denoted by f (x) = O(g(x))) for
x → x0 if, and only if,
kf (x)k
lim sup <∞ (2.1)
x→x0 kg(x)k
(i.e. there exists M ∈ R+0 such that, for each sequence (xn )n∈N in D with the
property limn→∞ xn = x0 , the sequence kf (xn )k
kg(xn )k n∈N
in R+
0 is bounded from above
by M ).
(i.e., for each sequence (xn )n∈N in D with the property limn→∞ xn = x0 , we have
limn→∞ kf (xn )k
kg(xn )k
= 0).
Remark 2.3. In applications, the space X in Def. 2.2 is often R := R ∪ {−∞, ∞},
where convergence is defined in the usual way. This usual notion of convergence is,
indeed, given by a suitable metric, where, as a metric space, R is then homeomorphic
to [−1, 1], where
−1
for x = −∞,
x
φ : R −→ [−1, 1], φ(x) := |x|+1 for x ∈ R,
1 for x = ∞,
provides a homeomorphism (cf. [Phi17, Sec. A], in particular, [Phi17, Rem. A.5]).
Proposition 2.4. Consider the setting of Def. 2.2. In addition, define the functions
kf (x)k
≤C for each x ∈ D \ {x0 } with d(x, x0 ) < δ. (2.3)
kg(x)k
(a): The equivalence between (i) and (ii) is immediate from (2.4). Now suppose (i)
holds, and let M := lim sup kf (x)k
kg(x)k
, C := 1 + M . If there were no δ > 0 such that
x→x0
(2.3) holds, then, for each δn := 1/n, n ∈ N, there were some xn ∈ D \ {x0 } with
d(xn , x0 ) < δn and yn := kf (xn )k
kg(xn )k
> C = 1 + M , implying lim sup yn ≥ 1 + M > M ,
n→∞
in contradiction to the definition of M . Thus, (iii) must hold. Conversely, suppose
(iii) holds, i.e. there exist C, δ > 0 such that (2.3) is valid. Then, for each sequence
(xn )n∈N in D \ {x0 } such that limn→∞ xn = x0 , the sequence (yn )n∈N with yn := kf (xn )k
kg(xn )k
can not have a cluster point larger than C (only finitely many of the xn are not in
Bδ (x0 ) := {x ∈ D : d(x, x0 ) < δ}, which is the neighborhood of x0 determined by δ).
Thus lim sup yn ≤ C, thereby implying (2.1) and (i).
n→∞
(b): Again, the equivalence between (i) and (ii) is immediate from (2.4). The equivalence
between (i) and (iii), we know from Analysis 2, cf. [Phi16b, Cor. 2.9].
Due to the respective equivalences in Prop. 2.4, in the literature, the Landau symbols
are often only defined for real-valued functions.
Proposition 2.5. Consider the setting of Def. 2.2. Suppose k · kY and k · kZ are
equivalent norms on Y and Z, respectively. Then f (x) = O(g(x)) (resp. f (x) = o(g(x)))
for x → x0 with respect to the original norms if, and only if, f (x) = O(g(x)) (resp.
f (x) = o(g(x))) for x → x0 with respect to the new norms k · kY and k · kZ .
Suppose f (x) = O(g(x)) for x → x0 with respect to the original norms. Then
kf (x)kY CY kf (x)k
lim sup ≤ lim sup < ∞,
x→x0 kg(x)kZ CZ x→x0 kg(x)k
showing f (x) = O(g(x)) for x → x0 with respect to the new norms. Suppose f (x) =
o(g(x)) for x → x0 with respect to the original norms. Then
kf (x)kY CY kf (x)k
lim = lim = 0,
x→x0 kg(x)kZ CZ x→x0 kg(x)k
showing f (x) = o(g(x)) for x → x0 with respect to the new norms. The converse
implications now also follow, since one can switch the roles of the old norms and the
new norms.
Example 2.6. (a) Consider a polynomial function on R, i.e. (a0 , . . . , an ) ∈ Rn+1 , n ∈
N0 , and
X n
P : R −→ R, P (x) := ai x i , an =
6 0. (2.5)
i=0
Then, for p ∈ R,
For each x 6= 0:
n
P (x) X
= ai xi−p . (2.7)
xp i=0
Since
∞ for i − p > 0,
lim xi−p = 1 for i = p, (2.8)
x→∞
0 for i − p < 0,
(2.7) implies (2.6). Thus, in particular, for x → ∞, each constant function is big O
of 1: a0 = O(1).
In the literature, one also finds statements of the form f (x) = h(x) + o(g(x)), which are
not covered by Def. 2.2 since, according to Def. 2.2, o(g(x)) does not denote an element
of a set, i.e. adding o(g(x)) to (or otherwise combining it with) an element of a set does
not immediately make any sense. To give a meaning to such expressions is the purpose
of the following definition:
Definition 2.7. As in Def. 2.2, let (X, d) be a metric space, let (Y, k · k), (Z, k · k)
be normed vector spaces over K, let D ⊆ X and consider functions f : D −→ Y ,
g : D −→ Z, where g(x) 6= 0 for each x ∈ D. Moreover, let x0 ∈ X be a cluster point of
D. Now let (V, k · k) be another normed vector space and let S ⊆ V be a subset1 and,
for each x ∈ D, let Fx : S −→ Y be invertible on f (D) ⊆ Y . Define
f (x) = Fx O(g(x)) for x → x0 :⇔ Fx−1 (f (x)) = O(g(x)) for x → x0 ,
f (x) = Fx o(g(x)) for x → x0 :⇔ Fx−1 (f (x)) = o(g(x)) for x → x0 .
Example 2.8. (a) In the situation of Ex. 2.6(b), we can now state that f is differen-
tiable in ξ if, and only if, there exists a linear map L : Rn −→ Rm such that
f (ξ + h) − f (ξ) = L(h) + o(h) for h → 0 : (2.10)
Indeed, for each h ∈ Rn , the map
Fh : Rm −→ Rm , Fh (s) := L(h) + s,
is invertible and, by Def. 2.7,
f (ξ + h) − f (ξ) = L(h) + o(h) = Fh (o(h)) for h → 0
is equivalent to
r(h) := f (ξ + h) − f (ξ) − L(h) = Fh−1 f (ξ + h) − f (ξ) = o(h),
which is (2.9c).
1
Even though, when using this kind of notation, one typically thinks of Fx as operating on g(x), as
O(x) and o(x) are merely notations and not actual mathematical objects, it is not even necessary for
S to contain g(D).
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 14
Fx : R+ −→ R, Fx (s) := x ln s,
is invertible with Fx−1 : R −→ R+ , Fx−1 (y) := exp( xy ) and, by Def. 2.7, (2.11) is
equivalent to
sin x
Fx−1 (sin x) = e x = O(1),
which holds, due to limx→0 exp( sinx x ) = e1 = e.
f (m+1) (θ)
Rm (x, a) := (x − a)m+1 with some suitable θ ∈]x, a[ (2.12c)
(m + 1)!
is the Lagrange form of the remainder term. Since the continuous function f (m+1)
is bounded on each compact interval [a, y], y ∈ I, (2.12) imply
f (x) = Tm (x, a) + O (x − a)m+1
Proposition 2.9. As in Def. 2.2, let (X, d) be a metric space, let (Y, k · k), (Z, k · k)
be normed vector spaces over K, let D ⊆ X and consider functions f : D −→ Y ,
g : D −→ Z, where g(x) 6= 0 for each x ∈ D. Moreover, let x0 ∈ X be a cluster point
of D.
(a) Suppose kf k ≤ kgk in some neighborhood of x0 , i.e. there exists ǫ > 0 such that
(b) f (x) = O(f (x)) for x → x0 , but not f (x) = o(f (x)) for x → x0 (assuming f 6= 0).
(c) f (x) = o(g(x)) for x → x0 implies f (x) = O(g(x)) for x → x0 .
Proof. Exercise.
Proposition 2.10. As in Def. 2.2, let (X, d) be a metric space, let (Y, k · k), (Z, k · k)
be normed vector spaces over K, let D ⊆ X and consider functions f : D −→ Y ,
g : D −→ Z, where g(x) 6= 0 for each x ∈ D. Moreover, let x0 ∈ X be a cluster point
of D. In addition, for use in some of the following assertions, we introduce functions
f1 : D −→ Y1 , f2 : D −→ Y2 , g1 : D −→ Z1 , g2 : D −→ Z2 , where (Y1 , k · k), (Y2 , k · k),
(Z1 , k · k), (Z2 , k · k) are normed vector spaces over K. Assume g1 (x) 6= 0 and g2 (x) 6= 0
for each x ∈ D.
(a) If f (x) = O(g(x)) (resp. f (x) = o(g(x))) for x → x0 , and if kf1 k ≤ kf k and kgk ≤
kg1 k in some neighborhood of x0 , then f1 (x) = O(g1 (x)) (resp. f1 (x) = o(g1 (x)))
for x → x0 .
(b) If α ∈ K \ {0}, then for x → x0 :
(c) For x → x0 :
f (x) = O(g1 (x)) and g1 (x) = O(g2 (x)) implies f (x) = O(g2 (x)),
f (x) = o(g1 (x)) and g1 (x) = o(g2 (x)) implies f (x) = o(g2 (x)).
(e) For x → x0 :
(f ) For x → x0 :
kf (x)k
f (x) = O(kg1 (x)k kg2 (x)k) ⇒ = O(g2 (x)),
kg1 (x)k
kf (x)k
f (x) = o(kg1 (x)k kg2 (x)k) ⇒ = o(g2 (x)).
kg1 (x)k
Proof. Exercise.
Definition 2.11. Let A : X −→ Y be a linear map between two normed vector spaces
(X, k · k) and (Y, k · k) over K. Then A is called bounded if, and only if, A maps bounded
sets to bounded sets, i.e. if, and only if, A(B) is a bounded subset of Y for each bounded
B ⊆ X. The vector space of all bounded linear maps between X and Y is denoted by
L(X, Y ).
Definition 2.12. Let A : X −→ Y be a linear map between two normed vector spaces
(X, k · k) and (Y, k · k) over K. The number
kAxk
kAk := sup : x ∈ X, x 6= 0
kxk
= sup kAxk : x ∈ X, kxk = 1 ∈ [0, ∞] (2.13)
is called the operator norm of A induced by the norms on X and Y , respectively (strictly
speaking, the term operator norm is only justified if the value is finite, but it is often
convenient to use the term in the generalized way defined here).
In the special case, where X = Kn , Y = Km , and A is given via a real m × n matrix,
the operator norm is also called a natural matrix norm or an induced matrix norm.
Remark 2.13. According to Th. B.1 of the Appendix, for a linear map A : X −→
Y between two normed vector spaces X and Y over K, the following statements are
equivalent:
(a) A is bounded.
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 17
(d) A is continuous.
For linear maps between finite-dimensional spaces, the above equivalent properties al-
ways hold: Each linear map A : Kn −→ Km , (n, m) ∈ N2 , is continuous (cf. [Phi16b,
Ex. 2.16]). In particular, each linear map A : Kn −→ Km , has all the above equivalent
properties.
Remark 2.14. According to Th. B.2 of the Appendix, for two normed vector spaces X
and Y over K, the operator norm does, indeed, constitute a norm on the vector space of
bounded linear maps L(X, Y ), and, moreover, if A ∈ L(X, Y ), then kAk is the smallest
Lipschitz constant for A.
Lemma 2.15. If Id : X −→ X, Id(x) := x, is the identity map on a normed vector
space X over K, then k Id k = 1 (in particular, the operator norm of a unit matrix is
always 1). Caveat: In principle, one can consider two different norms on X simultane-
ously, and then the operator norm of the identity can differ from 1.
is called the row sum norm of A. It is an exercise to show that kAk∞ is the operator
norm induced if Kn and Km are endowed with the ∞-norm.
is called the column sum norm of A. It is an exercise to show that kAk1 is the
operator norm induced if Kn and Km are endowed with the 1-norm.
(c) The operator norm on A induced if Kn and Km are endowed with the 2-norm
(i.e. the Euclidean norm) is called the spectral norm of A and is denoted kAk2 .
Unfortunately, there is no formula as simple as the ones in (a) and (b) for the
computation of kAk2 . We will see in Th. 2.24 below that the value of kAk2 is
related to the eigenvalues of A∗ A.
showing that k · kp,nop does not satisfy Lem. 2.15. If akl := 1 for each k, l ∈ {1, . . . , n},
then
n = kAAk∞,nop > kAk∞,nop · kAk∞,nop = 1 · 1 = 1 for n > 1,
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 19
showing that k · k∞,nop does not satisfy Lem. 2.16. For p ∈ [1, ∞[, one might think that
rescaling k · kp,nop might make it into an operator norm: The modified norm
2 kAkp,nop
k · k′p,nop : Kn −→ R+
0, kAk′p,nop := √ ,
p
n
does satisfy Lem. 2.15. However, if
(
1 for k = l = 1,
akl :=
0 otherwise,
then AA = A and
1 1 1
√ = kAAk′p,nop > kAk′p,nop · kAk′p,nop = √ ·√ for n > 1,
p
n p
n p
n
i.e. the modified norm does not satisfy Lem. 2.16.
—
In the rest of this section, we will further investigate the spectral norm defined in Ex.
2.18(c). In preparation, we recall the following notions and notation:
Notation 2.20. Let m, n ∈ N. If A = (akl )(k,l)∈{1,...,m}×{1,...,n} ∈ M(m, n, K), then
(a) A is called symmetric if, and only if, A = At , i.e. if, and only if, akl = alk for each
(k, l) ∈ {1, . . . , n}2 .
(b) A is called Hermitian if, and only if, A = A∗ , i.e. if, and only if, akl = alk for each
(k, l) ∈ {1, . . . , n}2 .
(c) A is called positive semidefinite if, and only if, x∗ Ax ∈ R+ n
0 for each x ∈ K .
(d) A is called positive definite if, and only if, A is positive semidefinite and x∗ Ax =
0 ⇔ x = 0, i.e. if, and only if, x∗ Ax > 0 for each 0 6= x ∈ Kn .
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 20
Lemma 2.22. Let m, n ∈ N and A ∈ M(m, n, K). Then both A∗ A ∈ M(n, K) and
AA∗ ∈ M(m, K) are Hermitian and positive semidefinite.
p
yields kAvk22 = r(A∗ A), thereby proving kAk2 ≥ r(A∗ A) and completing the proof of
(2.16a). It remains to consider the case where A = A∗ . Then A∗ A = A2 and, since
Av = λv ⇒ A2 v = λ2 v ⇒ r(A2 ) = r(A)2 ,
we have p p
kAk2 = r(A∗ A) = r(A2 ) = r(A),
thereby establishing the case.
Proof. Let v ∈ Kn such that kvk2 = 1. One estimates, using the Cauchy-Schwarz
inequality [Phi16b, (1.41)],
Note that (2.18) is trivially true for y = 0. If y 6= 0, then letting w := y/kyk2 yields
kwk2 = 1 and
|w · y| = y · y/kyk2 = kyk2 . (2.19b)
Together, (2.19a) and (2.19b) establish (2.18).
Proof. We know from Linear Algebra (cf. [Phi19b, Th. 10.22(a)] and [Phi19b, Th.
10.24(a)]) that
∀n ∀ m v · Ax = (A∗ v) · x.
x∈K v∈K
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 22
The number krel ∈ R+ 0 is called the relative condition of (the problem represented by the
solution operator) f at x. The number krel provides a measure for how much a relative
error in x might get inflated by the application of f , i.e. the smaller krel , the more well-
behaved is the problem in the vicinity of x. What value for krel is still acceptable in
terms of numerical stability depends on the situation: While krel ≈ 1 generally indicates
stability, even krel ≈ 100 might be acceptable in some situations.
Remark 2.32. Let us compare the newly introduced notion of Def. and Rem. 2.31 of
f being well-conditioned with some other regularity notions for f . For example, if f is
L-Lipschitz in Bδ (x), x ∈ X, δ > 0, then
L kxk
i.e. f is well-conditioned in Bδ (x) and (2.20) holds with K := kf (x)k
. The converse is not
true: There are examples, where f is well-conditioned in some Bδ (x) without being Lip-
schitz continuous on Bδ (x) (exercise). On the other hand, (2.20) does imply continuity
in x, and, once again, the converse is not true (another exercise). In particular, a prob-
lem can be well-posed in the sense of Def. 1.5 without being well-conditioned. On the
other hand, if f is everywhere well-conditioned, then it is everywhere continuous, and,
thus, well-posed. In that sense well-conditionedness is stronger than well-posedness.
However, one should also note that we defined well-conditionedness as a local property,
whereas well-posedness was defined as a global property.
—
Ax = b
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 24
where the absolute error is 0.1 and relative error is even smaller, then the corresponding
solution is
9.2
−12.6
x̃ :=
4.5 ,
−1.1
in no way similar to the solution to b. In particular, the absolute and relative errors
have been hugely amplified.
One might suspect that the issue behind the instability lies in A being “almost” singular
such that applying A−1 is similar to dividing by a small number. However, that is not
the case, as we have
25 −41 10 −6
−41 68 −17 10
A−1 = 10 −17 5 −3 .
−6 10 −3 2
We will see in Ex. 2.42 below that the actual reason behind the instability is the range
of eigenvalues of A, in particular, the relation between the largest and the smallest
eigenvalue. One has:
λ4
λ1 ≈ 0.010, λ2 ≈ 0.843, λ3 ≈ 3.86, λ4 ≈ 30.3, ≈ 2984.
λ1
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 25
Proposition 2.34. Let m, n ∈ N and consider the normed vector spaces (Kn , k · k) and
(Km , k · k). Let A : Kn −→ Km be K-linear. Moreover, let x ∈ Kn be such that x 6= 0
and Ax 6= 0. Then the following statements hold true:
kxk
krel (A, x) = kAk ,
kAxk
where kAk and kA−1 k denote the corresponding operator norms of A and A−1 ,
respectively.
A−1 : Kn −→ Kn , b 7→ A−1 b.
Fixing some norm on Kn and using the induced natural matrix norm, we obtain from
Prop. 2.34(b) that, for each 0 6= b ∈ Kn :
Definition and Remark 2.36. In view of Prop. 2.34(b) and (2.22), we define the
condition number κ(A) (also just called the condition) of an invertible A ∈ M(n, K),
n ∈ N, by
κ(A) := kAkkA−1 k, (2.23)
where k · k denotes a natural matrix norm induced by some norm on Kn . Then one
immediately gets from (2.22) that
The condition number clearly depends on the underlying natural matrix norm. If the
natural matrix norm is the spectral norm, then one calls the condition number the
spectral condition and one also writes κ2 instead of κ.
denote the set of eigenvalues of A (also known as the spectrum of A) and the set of
absolute values of eigenvalues of A, respectively.
Lemma 2.38. For the spectral condition of an invertible matrix A ∈ M(n, K), n ∈ N,
one obtains s
max σ(A∗ A)
κ2 (A) = (2.25)
min σ(A∗ A)
(recall that all eigenvalues of A∗ A are real and positive). Moreover, if A is Hermitian,
then (2.25) simplifies to
max |σ(A)|
κ2 (A) = (2.26)
min |σ(A)|
(note that min |σ(A)| > 0, as A is invertible).
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 27
Proof. Exercise.
Theorem 2.39. Let A ∈ M(n, K) be invertible, n ∈ N. Assume that x, b, ∆x, ∆b ∈ Kn
satisfy
Ax = b, A(x + ∆x) = b + ∆b, (2.27)
i.e. ∆b can be seen as a perturbation of the input and ∆x can be seen as the resulting
perturbation of the output. One then has the following estimates for the absolute and
relative errors:
k∆xk k∆bk
k∆xk ≤ kA−1 kk∆bk, ≤ κ(A) , (2.28)
kxk kbk
where x, b 6= 0 is assumed for the second estimate (as before, it does not matter which
norm on Kn one uses, as long as one uses the induced operator norm for the matrix).
Proof. Since A is linear, (2.27) implies A(∆x) = ∆b, i.e. ∆x = A−1 (∆b), which already
yields the first estimate in (2.28). For x, b 6= 0, the first estimate together with Ax = b
easily implies the second:
k∆xk k∆bk kAxk
≤ kA−1 k ,
kxk kbk kxk
thereby establishing the case.
Example 2.40. Suppose we want to find out how much we can perturb b in the problem
1 2 1
Ax = b, A := , b := ,
1 1 4
if the resulting relative error er in x is to be less than 10−2 with respect to the ∞-norm.
From (2.28), we know
k∆bk∞ k∆bk∞
er ≤ κ(A) = κ(A) .
kbk∞ 4
−1 2
Moreover, since A−1 = , from (2.23) and Ex. 2.18(a), we obtain
1 −1
Thus, if the perturbation ∆b satisfies k∆bk∞ < 4/900 ≈ 0.0044, then er < 49 · 900
4
= 10−2 .
Remark 2.41. Note that (2.28) is a much stronger and more useful statement than
the krel (A−1 , b) ≤ κ(A) of (2.22). The relative condition only provides a bound in the
limit of smaller and smaller neighborhoods of b and without providing any information
on how small the neighborhood actually has to be (one can estimate the amplification
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 28
of the error in x provided that the error in b is very small). On the other hand, (2.28)
provides an effective control of the absolute and relative errors without any restrictions
with regard to the size of the error in b (it holds for each ∆b).
—
Example 2.42. We come back to the instability observed in Ex. 2.33 and investigate
its actual cause. Suppose A ∈ M(n, C) is an arbitrary Hermitian invertible matrix,
λmin , λmax ∈ σ(A) such that |λmin | = min |σ(A)|, |λmax | = max |σ(A)|. Let vmin and vmax
be eigenvectors for λmin and λmax , respectively, satisfying kvmin k = kvmax k = 1. The
most unstable behavior of Ax = b occurs if b = vmax and b is perturbed in the direction
of vmin , i.e. b̃ := b + ǫ vmin , ǫ > 0. The solution to Ax = b is x = λ−1
max vmax , whereas the
solution to Ax = b̃ is
−1
x̃ = A−1 (vmax + ǫ vmin ) = λ−1
max vmax + ǫλmin vmin .
kx̃ − xk |λmax |
=ǫ ,
kxk |λmin |
kb̃ − bk
= ǫ.
kbk
This shows that, in the worst case, the relative error in b can be amplified by a factor
of κ2 (A) = |λmax |/|λmin |, and that is exactly what had occurred in Ex. 2.33.
—
Remark 2.43. Let n ∈ N. The set GLn (K) of invertible n × n matrices over K is open
in M(n, K) ∼
2
= Kn (where all norms are equivalent) – this follows, for example, from the
determinant being a continuous map (even a polynomial, cf. [Phi19b, Rem. 4.33]) from
2
Kn into K, which implies that GLn (K) = det−1 (K \ {0}) must be open. Thus, if A in
(2.29) is invertible and k∆Ak is sufficiently small, then A + ∆A must also be invertible.
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 29
Lemma 2.44. Let n ∈ N, consider some norm on Kn , and the induced natural matrix
norm on M(n, K). If B ∈ M(n, K) is such that kBk < 1, then Id +B is invertible and
1
k(Id +B)−1 k ≤ . (2.30)
1 − kBk
k(Id +B)xk ≥ kxk − kBxk ≥ kxk − kBk kxk = (1 − kBk) kxk, (2.31)
Proof. We can write A + ∆A = A(Id +A−1 ∆A). As kA−1 ∆Ak ≤ kA−1 k k∆Ak < 1,
Lem. 2.44 yields that Id +A−1 ∆A is invertible. Since A is also invertible, so is A + ∆A.
Moreover,
(2.30) kA−1 k kA−1 k
k(A + ∆A)−1 k =
(Id +A−1 ∆A)−1 A−1
≤ ≤ ,
1 − kA−1 ∆Ak 1 − kA−1 k k∆Ak
(2.34)
proving (2.32). The identity
holds, as it is equivalent to
and
A − A − ∆A = −∆A.
1
For k∆Ak ≤ 2kA−1 k
, (2.35) yields
(2.34) kA−1 k k∆Ak kA−1 k
k(A + ∆A)−1 − A−1 k ≤ k(A + ∆A)−1 k k∆Ak kA−1 k ≤
1 − kA−1 k k∆Ak
≤ 2 kA−1 k2 k∆Ak,
proving (2.33).
Theorem 2.46. As in Lem. 2.45, let n ∈ N, consider some norm on Kn with the
induced natural matrix norm on M(n, K) and let A, ∆A ∈ M(n, K) such A is invertible
and k∆Ak < kA−1 k−1 . Moreover, assume b, x, ∆b, ∆x ∈ Kn satisfy
Ax = b, (A + ∆A)(x + ∆x) = b + ∆b. (2.36)
Then, for b, x 6= 0, one has the following bound for the relative error:
k∆xk κ(A) k∆Ak k∆bk
≤ + . (2.37)
kxk 1 − kA−1 k k∆Ak kAk kbk
Proof. Choose δ > 0 sufficiently small such that B δ (x) ⊆ U . Since f ∈ C 1 (U, Rm ), we
can apply Taylor’s theorem. Recall that its lowest order version with the remainder
term in integral form states that, for each h ∈ Bδ (0) (which implies x + h ∈ Bδ (x)),
the following holds (cf. [Phi16b, Th. 4.44], which can be applied to the components
f1 , . . . , fm of f ): Z 1
f (x + h) = f (x) + Df x + th (h) dt . (2.40)
0
If x̃ ∈ Bδ (x), then we can let h := x̃ − x, which permits to restate (2.40) in the form
Z 1
f (x̃) − f (x) = Df (1 − t)x + tx̃ (x̃ − x) dt . (2.41)
0
and, therefore,
kxk
K(δ) ≤
f (x)
S(δ). (2.43)
2.6(a)]), which, in turn, implies limδ→0 S(δ) = kDf (x)k. Since, also, limδ→0 K(δ) = krel ,
(2.43) yields
kxk
krel ≤ kDf (x)k . (2.44)
kf (x)k
It still remains to prove the inverse inequality. To that end, choose v ∈ Rn such that
kvk = 1 and
Df (x)(v)
=
Df (x)
(2.45)
(such a vector v must exist due to the fact that the continuous function y 7→ kDf (x)(y)k
must attain its max on the compact unit sphere S1 (0) – as in the proof of Prop. 2.34(a),
this argument does not extend to infinite-dimensional spaces, where S1 (0) is not com-
pact). For 0 < ǫ < δ consider x̃ := x + ǫv. Then kx̃ − xk = ǫ < δ, i.e. x̃ ∈ Bδ (x). In
particular, x̃ is admissible in (2.41), which provides
Z 1
f (x̃) − f (x) = ǫ Df (1 − t)x + tx̃ (v) dt
0
Z 1
= ǫDf (x)(v) + ǫ Df (1 − t)x + tx̃ − Df (x) (v) dt . (2.46)
0
Once again using kx̃ − xk = ǫ as well as the triangle inequality in the form ka + bk ≥
kak − kbk, we obtain from (2.46) and (2.45):
f (x̃) − f (x)
f (x)
Z 1
kx̃ − xk
kxk
≥
Df (x)
−
Df (1 − t)x + tx̃ − Df (x)
dt . (2.47)
f (x)
0 kxk
Thus,
kxk
K(ǫ) ≥
Df (x)
− T (ǫ) , (2.48)
f (x)
where
Z 1
T (ǫ) := sup
Df (1 − t)x + tx̃ − Df (x)
dt : x̃ ∈ B ǫ (x) . (2.49)
0
Since Df is continuous in x, for each α > 0, there is ǫ > 0 such that ky − xk ≤ ǫ implies
kDf (y) − Df (x)k < α. Thus, since k(1 − t)x + tx̃ − xk = t kx̃ − xk ≤ ǫ for x̃ ∈ B ǫ (x)
and t ∈ [0, 1], one obtains Z 1
T (ǫ) ≤ α dt = α,
0
2 TOOLS: LANDAU SYMBOLS, NORMS, CONDITION 33
implying
lim T (ǫ) = 0.
ǫ→0
kxk
krel ≥
Df (x)
. (2.50)
f (x)
Remark 2.48. While Th. 2.47 was formulated and proved for (continuously) R-differen-
tiable functions, it can also be used for C-differentiable (i.e. for holomorphic) functions:
Recall that C-differentiability is actually a much stronger condition than continuous R-
differentiability: A function that is C-differentiable is, in particular, R-differentiable and
even C ∞ (see [Phi16b, Def. 4.18] and [Phi16b, Rem. 4.19(c)]). As a concrete example,
we apply Th. 2.47 to complex multiplication in Ex. 2.49(c) below. For more information
regarding the relation between norms on Cn and norms on R2n , also see [Phi16b, Sec.
D].
Example 2.49. (a) Let us investigate the condition of the problem of real multiplica-
tion, by considering
Using the Euclidean norm k · k2 on R2 , one obtains, for each (x, y) ∈ R2 such that
xy 6= 0,
Thus, we see that the relative condition explodes if |x| ≫ |y| or |x| ≪ |y|. To show
f is well-conditioned in Bδ (x, y) for each δ > 0, let (x̃, ỹ) ∈ Bδ (x, y) and estimate
f (x, y) − f (x̃, ỹ) = |xy − x̃ỹ| ≤ |xy − xỹ| + |xỹ − x̃ỹ| = |x||y − ỹ| + |ỹ||x − x̃|
≤ max k(x, y)k2 , k(x̃, ỹ)k2
(x, y) − (x̃, ỹ)
1
≤ C δ + k(x, y)k2
(x, y) − (x̃, ỹ)
2
Once again using the Euclidean norm on R2 , one obtains from (2.39) that, for each
(x, y) ∈ R2 such that xy 6= 0,
s
k(x, y)k2 |y| p 2 1 x2
krel (x, y) := krel (g, (x, y)) = kDg(x, y)k2 = x + y2 +
|g(x, y)| |x| y2 y4
s
1 p 2 x2 x2 + y 2 |x| |y|
= x + y2 1 + 2 = = + , (2.54)
|x| y |x||y| |y| |x|
i.e. the relative condition is the same as for multiplication. In particular, division
also becomes unstable if |x| ≫ |y| or |x| ≪ |y|, while krel (x, y) remains small if |x|
and |y| are of the same order of magnitude. However, we now again investigate if
g is well-conditioned in Bδ (x, y), δ > 0, and here we find an important difference
between division and multiplication. For δ ≥ |y|, it is easy to see that g is not well-
conditioned in Bδ (x, y): For each n ∈ N, let zn := (x, ny ). Then k(x, y) − zn k2 =
(1 − n1 )|y| < |y| ≤ δ, but
x xn
− = (n − 1) |x| → ∞ for n → ∞.
y y |y|
For 0 < δ < |y|, the result is different: Suppose |ỹ| ≤ |y| − δ. Then
(x, y) − (x̃, ỹ)
≥ |ỹ − y| ≥ |y| − |ỹ| ≥ δ.
2
Thus, (x̃, ỹ) ∈ Bδ (x, y) implies |ỹ| > |y| − δ. In consequence, one estimates, for each
(x̃, ỹ) ∈ Bδ (x, y),
x x̃ x x x x̃ x
g(x, y) − g(x̃, ỹ) = − ≤ − + − = |y − ỹ| + 1 |x − x̃|
y ỹ y ỹ ỹ ỹ y ỹ ỹ
|x| 1
≤ max ,
(x, y) − (x̃, ỹ)
|y|(|y| − δ) |y| − δ 1
|x| 1
≤ C max ,
(x, y) − (x̃, ỹ)
|y|(|y| − δ) |y| − δ 2
at (x, y): Even though krel (x, y) can remain bounded for y → 0, the problem is that
the neighborhood Bδ (x, y), where division is well-conditioned, becomes arbitrarily
small. Thus, without an effective bound (from below) on Bδ (x, y), the knowledge
of krel (x, y) can be completely useless and even misleading.
(c) We now consider the problem of complex multiplication, i.e.
f : C2 −→ C, f (z, w) := zw. (2.55)
We check that f is C-differentiable with Df (z, w) = (w, z): Indeed,
f (z, w) + (h1 , h2 ) − f (z, w) − (w, z)(h1 , h2 )
lim
h=(h1 ,h2 )→0 khk∞
(z + h1 )(w + h2 ) − zw − wh1 − zh2
= lim
h→0 khk∞
|h1 h2 |
= lim = lim min{|h1 |, |h2 |} = 0.
h→0 max{|h1 |, |h2 |} h→0
Since (x + iy)(u + iv) = xu − yv + i(xv + yu), to apply Th. 2.47, we now consider
complex multiplication as the R-differentiable map
g : R4 −→ R2 , g(x, y, u, v) := (xu − yv, xv + yu),
u −v x −y (2.56)
Dg(x, y, u, v) = .
v u y x
Noting
u v
t u −v x −y −v u
Dg(x, y, u, v) Dg(x, y, u, v) =
v u y x x y
−y x
2
u + v 2 + x2 + y 2 0
= ,
0 u2 + v 2 + x 2 + y 2
p
we obtain kDg(x, y, u, v)k2 = u2 + v 2 + x2 + y 2 = k(x, y, u, v)k2 and
k(x, y, u, v)k2
krel (x, y, u, v) := krel (g, (x, y, u, v)) = kDg(x, y, u, v)k2
kg(x, y, u, v)k2
x 2 + y 2 + u2 + v 2 x 2 + y 2 + u2 + v 2
=p =p
x 2 u2 + y 2 v 2 + x 2 v 2 + y 2 u 2 (x2 + y 2 )(u2 + v 2 )
k(x, y)k2 k(u, v)k2
= + . (2.57)
k(u, v)k2 k(x, y)k2
3 INTERPOLATION 36
that
g(x, y, u, v) − g(x̃, ỹ, ũ, ṽ)
= |zw − z̃ w̃| ≤ |zw − z w̃| + |z w̃ − z̃ w̃|
2
= |z||w − w̃| + |w̃||z − z̃|
≤ max k(z, w)k2 , k(z̃, w̃)k2
(z, w) − (z̃, w̃)
1
≤ C δ + k(z, w)k2
(z, w) − (z̃, w̃)
2
= C δ + k(x, y, u, v)k2
(x, y, u, v) − (x̃, ỹ, ũ, ṽ)
2
with suitable C ∈ R+ .
3 Interpolation
3.1 Motivation
Given a finite number of points (x0 , y0 ), (x1 , y1 ), . . . , (xn , yn ), so-called data points, sup-
porting points, or tabular points, the goal is to find a function f such that f (xi ) = yi for
each i ∈ {0, . . . , n}. Then f is called an interpolate of the data points. The data points
can be measured values from a physical experiment or computed values (for example,
it could be that yi = g(xi ) and g(x) can, in principle, be computed, but it could be
very difficult and computationally expensive (g could be the solution of a differential
equation) – in that case, it can also be advantageous to interpolate the data points by
an approximation f to g such that f is efficiently computable).
To have an interpolate f at hand can be desirable for many reasons, including
(a) We call
N0
F [X] := Ffin := (f : N0 −→ F ) : #f −1 (F \ {0}) < ∞ (3.1)
the set of polynomials over F (i.e. a polynomial over F is a sequence (ai )i∈N0 in
F such that all, but finitely many, of the entries ai are 0). On F [X], we have the
3 INTERPOLATION 38
where we know from Linear Algebra (cf. [Phi19a, Ex. 5.16(c)]) that, with these
compositions, F [X] forms a vector space over F with standard basis B := {ei : i ∈
N0 }, where
(
1 for i = j,
∀ ei : N0 −→ F, ei (j) := δij :=
i∈N0 0 for i 6= j.
We then know from Linear Algebra (cf. [Phi19b, Th. 7.4(b)]) that, with the above
addition and multiplication, F [X] forms a commutative ring with unity, where
1 = X 0 is the neutral element of multiplication.
(b) If f := (ai )i∈N0 ∈ F [X], then we call the ai ∈ F the coefficients of f , and we define
the degree of f by
(
−∞ for f ≡ 0,
deg f := (3.4)
max{i ∈ N0 : ai 6= 0} for f 6≡ 0.
Remark 3.2. In the situation of Def. 3.1, using the notation X i = ei , we can write ad-
dition, scalar multiplication, and
P multiplicationPin the following, perhaps, more familiar-
looking forms: If λ ∈ F , f = ni=0 fi X i , g = ni=0 gi X i , n ∈ N0 , f0 , . . . , fn , g0 , . . . , gn ∈
F , then
n
X
f +g = (fi + gi ) X i ,
i=0
n
X
λf = (λfi ) X i ,
i=0
2n
!
X X
fg = fk g l X i.
i=0 k+l=i
even though this constitutes an abuse of notation, since, according to Def. and
Rem. 3.1(a), f ∈ F [X] is a function on N0 and not a function on F . However, the
notation of (3.7) tends to improve readability and the meaning of f (x) is usually
clear from the context.
constitutes a linear and unital ring homomorphism, which is injective if, and only
if, F is infinite (cf. [Phi19b, Th. 7.17]). Here, F F = F(F, F ) denotes the set of
functions from F into F , and φ being unital means it maps 1 = X 0 ∈ F [X] to the
constant function φ(X 0 ) ≡ 1. We define
and call the elements of Pol(F ) polynomial functions. Then φ : F [X] −→ Pol(F )
is an epimorphism (i.e. surjective) and an isomorphism if, and only if, F is infinite.
If F is infinite, then we also define
and
∀ Poln (F ) := φ F [X]n = {P ∈ Pol(F ) : deg P ≤ n}.
n∈N0
∀ ǫxi (f ) = yi . (3.9)
i∈{0,1,...,n}
Moreover, one can identify f as the Lagrange interpolating polynomial, which is given
by the explicit formula
n
X Yn n
X − xi Y X 1 − xi X 0
f := yj Lj , where ∀ Lj := = (3.10)
j=0
j∈{0,1,...,n}
i=0
x j − x i i=0
xj − xi
i6=j i6=j
f0 + f1 x0 + · · · + fn xn0 = y0
f0 + f1 x1 + · · · + fn xn1 = y1
..
.
f0 + f1 xn + · · · + fn xnn = yn ,
for the n + 1 unknowns f0 , . . . , fn ∈ F . This linear system has a unique solution if, and
only if, the determinant
1 x0 x20 . . . xn0
1 x1 x21 . . . xn1
D := .. .. .. .. ..
. . . . .
1 xn xn . . . xnn
2
3 INTERPOLATION 41
The above formulas for f0 , f1 , f2 indicate that the interpolating polynomial can be built
recursively, and we will see below (cf. (3.20a)) that this is, indeed, the case. Having a
recursive formula at hand can be advantageous in many circumstances. For example,
even though the complexity of building the interpolating polynomial from scratch is
O(n2 ) in either case, if one already has the interpolating polynomial for x0 , . . . , xn , then,
using the recursive formula (3.20a) below, one obtains the interpolating polynomial for
x0 , . . . , xn , xn+1 in just O(n) steps.
Looking at the structure of f2 above, one sees that, while the first coefficient is just y0 , the
second one is a difference quotient (which, for F = R, can be seen as an approximation
of some derivative y ′ (x)), and the third coefficient is a difference quotient of difference
quotients (which, for F = R, can be seen as an approximation of some second derivative
y ′′ (x)). Such iterated difference quotients are called divided differences and we will see in
Th. 3.17 below that the same structure does, indeed, hold for interpolating polynomials
of all orders. However, we will first introduce and study a slightly more general form
of polynomial interpolation, so-called Hermite interpolation, where the given points
x0 , . . . , xn no longer need to be distinct.
∀ ∀ (λf + µg)′ = λf ′ + µg ′ .
f,g∈F [X] λ,µ∈F
(b) Forming the derivative in F [X] satisfies the product rule, i.e.
∀ (f g)′ = f ′ g + f g ′ .
f,g∈F [X]
(d) Forming higher derivatives in F [X] satisfies the general Leibniz rule, i.e.
n
X
(n) n
∀ ∀ (f g) = f (n−k) g (k) .
n∈N0 f,g∈F [X]
k=0
k
Proof. Exercise.
then x is a zero of multiplicity at least m + 1 for f , i.e. (cf. [Phi19b, Cor. 7.14])
∃ deg q = n − m − 1 ∧ f = q (X − x)m+1 .
q∈F [X]
3 INTERPOLATION 44
f = g (X − x)m . (3.12)
As we are assuming (3.11) and deg f ′ = n−1, we can now apply the induction hypothesis
for n − 1 to f ′ to obtain h ∈ F [X] with deg h = n − 1 − m, satisfying f ′ = h (X − x)m .
Using this in (3.13) yields
Computing in the field of rational fractions F (X) (cf. [Phi19b, Ex. 7.40(b)]), we may
divide by (X − x)m−1 to obtain
h (X − x) = g ′ (X − x) + m g.
f = q (X − x)m+1 .
Example 3.9. The present example shows that, in general, one can not expect the
conclusion of Th. 3.8 to hold without the hypothesis regarding the field’s characteristic:
Consider F := Z2 = {0, 1} (i.e. F is the field with two elements, char F = 2) and
f := X 3 + X 2 = X 2 (X + 1), f ′ = 3X 2 + 2X = X 2 , f ′′ = 2X = 0.
Then f (0) = f ′ (0) = f ′′ (0) = 0 (i.e. we have n = 3 and m = 2 in Th. 3.8), but x = 0 is
not a zero of multiplicity 3 of f (but merely a zero of multiplicity 2).
3 INTERPOLATION 45
Theorem 3.10 (Hermite Interpolation). Let F be a field and r, n ∈ N0 . Let x̃P 0 , . . . , x̃r ∈
F be r +1 distinct elements of F . Moreover, let m0 , . . . , mr ∈ N be such that rk=0 mk =
(m)
n + 1 and assume yj ∈ F for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}.
Definition 3.12. As in Th. 3.10, let F be a field and r, n ∈ N0P ; let x̃0 , . . . , x̃r ∈ F be
r + 1 distinct elements of F ; let m0 , . . . , mr ∈ N be such that rk=0 mk = n + 1; and
(m)
assume yj ∈ F for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}. Also, as in Th. 3.10(b),
assume char F = 0 or char F ≥ M := max{m0 , . . . , mr }. Finally, let (x0 , . . . , xn ) ∈ F n+1
be such that, for each k ∈ {0, . . . , r}, precisely mk of the xj are equal to x̃k (i.e. x̃k occurs
precisely mk times in the vector (x0 , . . . , xn )).
(a) Let P [x0 , . . . , xn ] := f ∈ F [X]n denote the unique interpolating polynomial satisfy-
(m)
ing (3.15). As a caveat, we note that P [x0 , . . . , xn ] also depends on the yj , which
is, however, suppressed in the notation.
P [g|x0 , . . . , xn ] := P [x0 , . . . , xn ].
(c) Let F = R with a, b ∈ R, a < b, such that x̃0 , . . . , x̃r ∈ [a, b], and let g : [a, b] −→ R
be an M times differentiable function such that
(m)
g (m) (x̃j ) = yj for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}.
P [g|x0 , . . . , xn ] := P [x0 , . . . , xn ].
We remain in the situation of Def. 3.12 and denote the interpolating polynomial by
f := P [x0 , . . . , xn ]. If one is only interested in the value f (x) for some x ∈ F , rather
than all the coefficients of f (or rather than evaluating f at many different elements of
F ), then an efficient way to obtain f (x) is by means of a so-called Neville tableau (cf.
Rem. 3.14 below), making use of the recursion provided by the following Th. 3.13(a).
(a) The interpolating polynomials satisfy the following recursion: For r = 0, one has
Xn
(X − x̃0 )j (j)
P [x0 , . . . , xn ] = P [x̃0 , . . . , x̃0 ] = y0 . (3.16a)
j=0
j!
implying
n
!(m)
X (X − x̃0 )j (j) (m)
∀ y0 (x̃0 ) = y0 .
m∈{0,...,n}
j=0
j!
We now consider the case r > 0. Letting p := P [x0 , . . . , xn ] and letting q denote the
polynomial given by the right-hand side of (3.16b), we have to show p = q. Clearly,
both p and q are in F [X]n . Thus, the proof is complete, if we can show that q also
satisfies (3.15), i.e.
(m)
q (m) (x̃j ) = yj for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}. (3.17)
To verify (3.17), given k, l ∈ {0, . . . , n} with xk 6= xl , we use Prop. 3.7(e) to obtain, for
each m ∈ N0 ,
Now we fix j ∈ {0, . . . , r} such that x̃j = xk and let m ∈ {0, . . . , mj − 1}. Then (setting
(−1)
yj := 0)
(m−1) (m) (m−1)
(m)
m yj + (xk − xl ) yj − m yj −0 (m)
q (x̃j ) = = yj .
xk − xl
Similarly, for j ∈ {0, . . . , r} such that x̃j = xl :
(m−1) (m−1) (m)
(m)
m yj + 0 − m yj − (xl − xk ) yj (m)
q (x̃j ) = = yj .
xk − xl
Finally, if j ∈ {0, . . . , r} is such that x̃j ∈
/ {xk , xl }, then
(m−1) (m) (m−1) (m)
(m)
m yj + (x̃j − xl ) yj − m yj − (x̃j − xk ) yj (m)
q (x̃j ) = = yj ,
xk − xl
proving that (3.17) does, indeed, hold.
(b) holds due to the fact that the interpolating polynomial P [x0 , . . . , xn ] is determined
by the conditions (3.15), which do not depend on the order of x0 , . . . , xn (according to
Def. 3.12, this order independence of the interpolating polynomial is already built into
the definition of P [x0 , . . . , xn ]).
Remark 3.14. To use the recursion (3.16), to compute P [x0 , . . . , xn ](x) for x ∈ F , we
now assume that (x0 , . . . , xn ) is ordered in such a way that identical entries are adjacent:
Then Pnn = P [x0 , . . . , xn ](x) is the desired value. According to Th. 3.13(a) and (3.19),
m
X (x − x̃j )k (k)
Pαβ = yj for xα−β = xα = x̃j and m = β,
k=0
k!
(x − xα−β )P [x1+α−β , . . . , xα ](x) − (x − xα )P [xα−β , . . . , xα−1 ](x)
Pαβ =
xα − xα−β
(x − xα−β )Pα,β−1 − (x − xα )Pα−1,β−1
= for xα−β 6= xα .
xα − xα−β
3 INTERPOLATION 49
In consequence, Pnn can be computed, in O(n2 ) steps, via the following Neville tableau:
(0)
y0 = P00
ց
(0)
y0 = P10 → P11
ց ց
(0)
y0 = P20 → P21 → P22
.. .. .. . .
. . . .
(0)
yr = Pn−1,0 → Pn−1,1 → . . . . . .Pn−1,n−1
ց ց ց
(0)
yr = Pn0 → Pn1 → . . . . . . Pn,n−1 → Pnn
x̃0 := 1, x̃1 := 2,
y0 := 1, y0′ := 4, y1 := 3, y1′ := 1, y1′′ := 2.
and, continued:
... P22 = 1
ց
2·3−1
... P32 = 3 → P33 = P [1, 1, 2, 2](3) = 2−1 =5
ց ց
2·5−3 2·7−5
... P42 = 5 → P43 = P [1, 2, 2, 2](3) = 2−1 =7 → P44 = P [1, 1, 2, 2, 2](3) = 2−1 = 9.
3 INTERPOLATION 50
We now come back to the divided differences mentioned at the end of Sec. 3.2.2. We
proceed to immediately define divided differences in the general context of Hermite
interpolation:
Definition 3.16. As in Def. 3.12, let F be a field and r, n ∈ N0P ; let x̃0 , . . . , x̃r ∈ F be
r + 1 distinct elements of F ; let m0 , . . . , mr ∈ N be such that rk=0 mk = n + 1; and
(m)
assume yj ∈ F for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}; assume char F = 0 or
char F ≥ M := max{m0 , . . . , mr }; and let (x0 , . . . , xn ) ∈ F n+1 be such that, for each
k ∈ {0, . . . , r}, precisely mk of the xj are equal to x̃k (i.e. x̃k occurs precisely mk times
in the vector (x0 , . . . , xn )).
[x0 , . . . , xn ] := an ∈ F
[g|x0 , . . . , xn ] := [x0 , . . . , xn ].
(c) Let F = R with a, b ∈ R, a < b, such that x̃0 , . . . , x̃r ∈ [a, b], and let g : [a, b] −→ R
be an M times differentiable function such that
(m)
g (m) (x̃j ) = yj for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}.
[g|x0 , . . . , xn ] := [x0 , . . . , xn ].
3 INTERPOLATION 51
Theorem 3.17. In the situation of Def. 3.16, the interpolating polynomial P [x0 , . . . , xn ]
∈ F [X]n can be written in the following form known as Newton’s divided difference
interpolation formula:
n
X
P [x0 , . . . , xn ] = [x0 , . . . , xj ] ωj , (3.20a)
j=0
are the so-called Newton basis polynomials. In particular, (3.20a) means that the inter-
polating polynomials satisfy the recursive relation
(0)
P [x0 ] = y0 , (3.21a)
P [x0 , . . . , xj ] = P [x0 , . . . , xj−1 ] + [x0 , . . . , xj ] ωj for 1 ≤ j ≤ n. (3.21b)
We can now make use of our knowledge regarding interpolating polynomials to infer
properties of the divided differences (in particular, we will see that divided differences
can also be efficiently computed using Neville tableaus):
Theorem 3.18. In the situation of Def. 3.16, the divided differences satisfy the follow-
ing:
(b) [x0 , . . . , xn ] does not depend on the order of the x0 , . . . , xn ∈ F , i.e., for each per-
mutation π on {0, . . . , n},
Xn
(X − x̃0 )j (j)
P [x0 , . . . , xn ] = P [x̃0 , . . . , x̃0 ] = y0 ,
j=0
j!
(n)
y
i.e. the coefficient of X n is n! 0
. Case r > 0: According to (3.16b), we have, for each
k, l ∈ {0, . . . , n} such that xk 6= xl ,
Remark 3.19. Analogous to Rem. 3.14, to use the recursion (3.23) to compute the
divided difference [x0 , . . . , xn ], we now assume that (x0 , . . . , xn ) is ordered in such a way
that identical entries are adjacent, i.e.
Then Pnn = [x0 , . . . , xn ] can be computed via the Neville tableau of Rem. 3.14, provided
that we redefine
∀ Pαβ := [xα−β , . . . , xα ],
α,β∈{0,...,n},
α≥β
(0)
(b) Suppose x0 , . . . , xn ∈ F are all distinct and set yj := yj for each j ∈ {0, . . . , n}.
Then the Neville tableau of Rem. 3.19 takes the form
y0 = [x0 ]
ց
y1 = [x1 ] → [x0 , x1 ]
ց ց
y2 = [x2 ] → [x1 , x2 ] → [x0 , x1 , x2 ]
.. .. .. ..
. . . .
yn−1 = [xn−1 ] → [xn−2 , xn−1 ] → ... ... [x0 , . . . , xn−1 ]
ց ց ց
yn = [xn ] → [xn−1 , xn ] → ... ... [x1 , . . . , xn ] → [x0 , . . . , xn ]
3 INTERPOLATION 54
(c) As in Ex. 3.15, consider a field F with char F 6= 2, n := 4, r := 1, and the given
data
x̃0 = 1, x̃1 = 2,
y0 = 1, y0′ = 4, y1 = 3, y1′ = 1, y1′′ = 2.
This time, we seek to obtain the complete interpolating polynomial P [1, 1, 2, 2, 2],
using Newton’s interpolation formula (3.20a):
P [1, 1, 2, 2, 2] = [1] + [1, 1](X − 1) + [1, 1, 2](X − 1)2 + [1, 1, 2, 2](X − 1)2 (X − 2)
+ [1, 1, 2, 2, 2](X − 1)2 (X − 2)2 . (3.24)
The coefficients can be computed from the Neville tableau given by Rem. 3.19:
[1] = y0 = 1
ց
[1] = y0 = 1 → [1, 1] = y0′ = 4
ց ց
3−1 2−4
[2] = y1 = 3 → [1, 2] = =2 →
2−1 [1, 1, 2] = 2−1 = −2
ց ց ց
1−2 −1−(−2)
[2] = y1 = 3 → [2, 2] = y1′ = 1 → [1, 2, 2] = 2−1 = −1 → [1, 1, 2, 2] = 2−1 =1
ց ց ց
′ y1′′ 1−(−1)
[2] = y1 = 3 → [2, 2] = y1 = 1 → [2, 2, 2] = 2! =1 → [1, 2, 2, 2] = 2−1 =2
and, finally,
... [1, 1, 2, 2] = 1
ց
2−1
... [1, 2, 2, 2] = 2 → [1, 1, 2, 2, 2] = 2−1 = 1.
Additional formulas for divided differences over a general field F can be found in Ap-
pendix D. Here, we now proceed to present a result for the field F = R, where there
turns out to be a surprising relation between divided differences and integrals over sim-
plices. We start with some preparations:
Remark 3.22. The set Σn introduced in Not. 3.21 is the convex hull of the set consisting
of the origin and the n standard unit vectors e1 , . . . , en ∈ Rn , i.e. the convex hull of the
(n − 1)-dimensional standard simplex ∆n−1 united with the origin (cf. [Phi19b, Def. and
Rem. 1.23]):
Σn = conv {0} ∪ {ei : i ∈ {1, . . . , n}} = conv {0} ∪ ∆n−1 ,
where
ei := (0, . . . , 1, . . . , 0), ∆n−1 := conv{ei : i ∈ {1, . . . , n}}. (3.25)
↑
i
t1 := s1 , s1 := t1 ,
t2 := s1 + s2 , s2 := t2 − t1 ,
t3 := s1 + s2 + s3 , s3 := t3 − t2 , (3.27)
.. ..
. .
tn := s1 + · · · + sn , sn := tn − tn−1 .
Thus, using the change of variables theorem (CVT) and the Fubini theorem (FT), we
can compute vol(Σn ):
Z Z Z
n (CVT)
vol(Σ ) = 1 ds = det JT −1 (t) dt = 1 dt
Σn T (Σn ) T (Σn )
Z 1 Z t3 Z t2 Z 1 Z t3
(FT)
= ··· dt1 dt2 · · · dtn = ··· t2 dt2 · · · dtn
Z0 1 Z0 t4 0 0
Z 1 n−1 0
t23 tn 1
= ··· dt3 · · · dtn = dtn = ,
0 0 2! 0 (n − 1)! n!
proving (3.26).
3 INTERPOLATION 56
and (x0 , . . . , xn ) ∈ Rn+1 such that, for each k ∈ {0, . . . , r}, precisely mk of the xj are
equal to x̃k . Then the divided differences satisfy the following relation, known as the
Hermite-Genocchi formula:
Z n
!
X
n ≥ 1 ⇒ [g|x0 , . . . , xn ] = g (n) x0 + si (xi − x0 ) ds , (3.28)
Σn i=1
such that g (n) is defined at xs . Moreover, the integral in (3.28) exists, as the continuous
function g (n) is bounded on the compact set Σn . If r = 0 (i.e. x0 = · · · = xn = x̃0 ), then
Z n
! Z
X (3.26) g
(n)
(x0 )
(n)
g x0 + si (xi − x0 ) ds = g (n) (x0 ) ds =
Σn i=1 Σn n!
(n)
y (3.23a)
= 0 = [g|x0 , . . . , xn ],
n!
proving (3.28) in this case. For r > 0, we prove (3.28) via induction on n ∈ N, where,
by Th. 3.18(b), we may assume x0 6= xn . Then, for n = 1, (3.28) is an easy consequence
of the fundamental theorem of calculus (FTC) (note Σ1 = [0, 1]):
Z 1 " #1
g x 0 + s(x 1 − x 0 )
g ′ x0 + s(x1 − x0 ) ds =
0 x1 − x0
0
g(x1 ) − g(x0 ) y1 − y0
= = = [g|x0 , x1 ].
x1 − x0 x1 − x0
3 INTERPOLATION 57
(3.20)
= g(x).
Theorem 3.26. As in Th. 3.24, let a, b ∈ R, a < b. Let g ∈ C ∞ [a, b] and consider a
sequence (xk )k∈N0 in [a, b]. For each n ∈ N, define pn := P [g|x0 , . . . , xn ] ∈ R[X]n to be
the interpolating polynomial corresponding to the first n + 1 points – thus, according to
Th. 3.17 and Th. 3.24,
Xn
∀ pn = [g|x0 , . . . , xj ] ωj ,
n∈N
j=0
where
Z j
! j−1
X Y
(j)
[g|x0 , . . . , xj ] = g x0 + si (xi − x0 ) ds , ωj := (X − xi ).
Σj i=1 i=0
If there exist K, R ∈ R+
0 such that
If, moreover, R < (b − a)−1 , then one also has the convergence
lim
g − φ(pn )
∞ = 0. (3.34)
n→∞
Proof. The first estimate in (3.33) is merely (3.31), while the second estimate in (3.33)
is (3.32) combined with kφ(ωn+1 )k∞ ≤ (b − a)n+1 . As R < (b − a)−1 , (3.33) establishes
(3.34).
∀ kg (n) k∞ ≤ C n . (3.35)
n∈N
m := min{x, x0 , . . . , xn }, M := max{x, x0 , . . . , xn }.
which is clear for x = ξ and, otherwise, due to the mean value theorem [Phi16a, Th.
9.18]. For n = 0, the left-hand side of (3.36) is |g(x) − g(x0 )| and (3.36) follows from
(3.37). If n ≥ 1, then we apply (3.30) with n − 1, yielding n! [g|x0 , . . . , xn ] = g (n) (ξ) for
some ξ ∈ [m, M ], such that (3.36) follows from (3.37) also in this case.
As mentioned in Rem. 3.28 above, the zeros of the Chebyshev polynomials yield optimal
points for the use of polynomial interpolation. We will provide this result as well as some
other properties of Chebyshev polynomials in the present section.
3 INTERPOLATION 61
(f ) Tn has n simple zeros, all in the interval [−1, 1], namely, for n ≥ 1,
2k − 1
xk := cos π , k ∈ {1, . . . , n}.
2n
where, for n ≥ 1,
(
1 if, and only if, x = cos kπ
n
with k ∈ {0, . . . , n} even,
Tn (x) = kπ
−1 if, and only if, x = cos n with k ∈ {0, . . . , n} odd.
3 INTERPOLATION 62
Proof. Exercise.
Notation 3.32. Let n ∈ N0 . For each p ∈ Pol(R), let an (p) ∈ R denote the coefficient
of xn in p.
Proof. (a): Clearly, α, β are both restrictions of affine polynomial functions with deg α =
2
deg β = 1. They are strictly increasing (and, thus, bijective) e.g. due to α′ = b−a >0
′ b−a
and β = 2 > 0. Moreover, α(a) = −1, α(b) = 1, β(−1) = a, β(1) = b, which also
shows β = α−1 .
(b) holds, as the bijectivity of β implies |f |([a, b]) = |f ◦ β|([−1, 1]).
(c) is a simple consequence of the binomial theorem (cf. [Phi16a, Th. 5.16]).
and, in particular,
n o
1 = kTn ↾[−1,1] k∞ = min kp↾[−1,1] k∞ : p ∈ Poln (R), deg p = n, an (p) = 2n−1
(i.e. Tn minimizes the ∞-norm on [−1, 1] in the set of polynomial functions with degree
n and highest coefficient 2n−1 ).
3 INTERPOLATION 63
Proof. We first consider the special case a = −1, b = 1, and an (p) = 2n−1 . Seeking a
contradiction, assume p ∈ Poln (R) with deg p = n, an (p) = 2n−1 , and kp↾[−1,1] k∞ < 1.
Using Th. 3.31(g), we obtain
(
kπ < 0 for k even,
∀ (p − Tn ) cos
k∈{0,...,n} n > 0 for k odd,
showing p − Tn to have at least n zeros due to the intermediate value theorem (cf.
[Phi16a, Th. 7.57]). Since p, Tn ∈ Poln (R) with an (p) = an (Tn ) = 2n−1 , we have
p − Tn ∈ Poln−1 (R) and the n zeros yield p − Tn = 0 in contradiction to p 6= Tn . Thus,
there must exist x0 ∈ [−1, 1] such that |p(x0 )| ≥ 1. Now if q ∈ Poln (R) is arbitrary with
n−1
an (q) 6= 0 and p := a2n (q) q, then p ∈ Poln (R) with an (p) = 2n−1 and, due to the special
case,
an (q) |an (q)|
∃
|q(x0 )| = n−1 p(x0 ) ≥ n−1 ,
x0 ∈[−1,1] 2 2
as desired. Moreover, on the interval [a, b], one then obtains
Lem. 3.33(b) Lem. 3.33(c) (b − a)n
kq↾[a,b] k∞ = kq ◦ βk∞ ≥ |an (q)| ,
22n−1
thereby completing the proof.
Example 3.35. Let n ∈ N and a, b ∈ R with a < b. As mentioned in Rem. 3.28,
minimizing the error in (3.33) leads to the problem of finding x0 , . . . , xn ∈ [a, b] such
that ( n )
Y
max |x − xj | : x ∈ [a, b]
j=0
is minimized. In other words, one needs to find the zeros of a monic polynomial function
p in Poln+1 (R) such that p minimizes the ∞-norm on [a, b]. According to Th. 3.34, for
a = −1, b = 1, p = Tn+1 /2n , which, according to Th. 3.31(f) has zeros
2(j + 1) − 1 2j + 1
xj = cos π = cos π , j ∈ {0, . . . , n}.
2(n + 1) 2n + 2
In the general case, we then obtain the solution p by using the functions α, β of Lem.
3.33: According to Lem. 3.33(b), we have p ◦ β = cTn+1 , where c ∈ R is chosen such
that an+1 (p) = 1. By Lem. 3.33(c),
(b − a)n+1 (b − a)n+1 (b − a)n+1
c · 2n = an+1 (p ◦ β) = an+1 (p) = ⇒ c= ,
2n+1 2n+1 22n+1
(b−a)n+1
implying p = (Tn+1 ◦ α) · with the zeros
22n+1
2j + 1
xj = β cos π , j ∈ {0, . . . , n}.
2n + 2
3 INTERPOLATION 64
Theorem 3.36. Let n ∈ N and a, b ∈ R with a < b. Assume s ∈ R \ [a, b] and define
1
cs := ∈ R \ {0}, ps := cs (Tn ◦ α) ∈ Poln (R),
Tn (α(s))
(i.e. ps minimizes the ∞-norm on [a, b] in the set of polynomial functions with degree n
and value 1 at s).
Proof. Exercise.
In the previous sections, we have used a single polynomial to interpolate given data
points. If the number of data points is large, then the degree of the interpolating
polynomial is likewise large. If the data points are given by a function, and the goal
is for the interpolating polynomial to approximate that function, then, in general, one
might need many data points to achieve a desired accuracy. We have also seen that the
interpolating polynomials tend to be large in the vicinity of the outermost data points,
where strong oscillations are typical. One possibility to counteract that behavior is
to choose an additional number of data points in the vicinity of the boundary of the
considered interval (cf. Rem. 3.28).
The advantage of the approach of the previous sections is that the interpolating poly-
nomial is C ∞ (i.e. it has derivatives of all orders) and one can make sure (by Hermite
interpolation) that, in a given point x0 , the derivatives of the interpolating polynomial p
agree with the derivatives of a given function in x0 (which, again, will typically increase
the degree of p). The disadvantage is that one often needs interpolating polynomials of
large degrees, which can be difficult to handle (e.g. numerical problems resulting from
large numbers occurring during calculations).
Spline interpolation follows a different approach: Instead of using one single interpo-
lating polynomial of potentially large degree, one uses a piecewise interpolation, fitting
together polynomials of low degree (e.g. 1 or 3). The resulting function s will typically
not be of class C ∞ , but merely of class C or C 2 , and the derivatives of s will usually not
3 INTERPOLATION 65
agree with the derivatives of a given f (only the values of f and s will agree in the data
points). The advantage is that one only needs to deal with polynomials of low degree. If
one does not need high differentiability of the interpolating function and one is mostly
interested in a good approximation with respect to k · k∞ , then spline interpolation is
usually a good option.
Definition 3.37. Let a, b ∈ R, a < b. Then, given N ∈ N,
∆ := (x0 , . . . , xN ) ∈ RN +1 (3.39)
is called a knot vector for [a, b] if, and only if, it corresponds to a partition a = x0 <
x1 < · · · < xN = b of [a, b]. The xk are then also called knots. Let l ∈ N. A spline of
order l for ∆ is a function s ∈ C l−1 [a, b] such that, for each k ∈ {0, . . . , N − 1}, there
exists a polynomial function pk ∈ Poll (R) that coincides with s on [xk , xk+1 ]. The set of
all such splines is denoted by S∆,l :
l−1
S∆,l := s ∈ C [a, b] : ∀ ∃ pk ↾[xk ,xk+1 ] = s↾[xk ,xk+1 ] . (3.40)
k∈{0,...,N −1} pk ∈Poll (R)
Of particular importance are the elements of S∆,1 (linear splines) and of S∆,3 (cubic
splines). It is also useful to define
S∆,0 := span χk : k ∈ {0, . . . , N − 1} , (3.41)
where (
1 if x ∈ [xk , xk+1 [,
∀ χk : [a, b] −→ R, χk (x) :=
k∈{0,...,N −1} 0 if x ∈
/ [xk , xk+1 [,
i.e. each χk is a step function, namely the characteristic function of the interval [xk , xk+1 [.
We call the elements of S∆,0 splines of order 0.
Remark and Definition 3.38. In the situation of Def. 3.37, let k ∈ {0, . . . , N − 1}.
If s ∈ S∆,l , l ∈ N0 , then, as s↾]xk ,xk+1 [ coincides with a polynomial function of degree at
most l, on ]xk , xk+1 [ the derivative s(l) exists and is constant, i.e. there exists αk ∈ R
such that s(l) ↾]xk ,xk+1 [ ≡ αk . We extend s(l) to all of [a, b] by defining
N
X −1
(l)
s := αk χk ∈ S∆,0 . (3.42)
k=0
Definition 3.39. In the situation of Def. 3.37, if s ∈ S∆,0 , then, for each l ∈ N0 , define
the lth antiderivative Il (s) : [a, b] −→ R of s recursively by letting
Z x
I0 (s) := s, ∀ Il+1 (s)(x) := Il (s)(t) dt .
l∈N0 a
3 INTERPOLATION 66
(a) Poll (R) ⊆ S∆,l (here, and in the following, we are sometimes a bit lax regarding the
notation, implicitly identifying polynomials on R with their respective restriction on
[a, b]).
(c) If s ∈ S∆,0 and S := Il (s) is the lth antiderivative of s as defined in Def. 3.39, then
S ∈ S∆,l .
(d) S∆,l is an (N + l)-dimensional vector space over R and, if, for each k ∈ {0, . . . , N −
1}, Xk := Il (χk ) is the lth antiderivative of χk , then
B := Xk : k ∈ {0, . . . , N − 1} ∪ xj : j ∈ {0, . . . , l − 1}
Proof. Since s defined by (3.44) clearly satisfies (3.43), is continuous on [a, b], and,
for each k ∈ {0, . . . , N }, the definition of (3.44) extends to a unique polynomial in
Pol1 (R), we already have existence. Uniqueness is equally obvious, since, for each k, the
3 INTERPOLATION 68
prescribed values in xk and xk+1 uniquely determine a polynomial in Pol1 (R) and, thus,
s. For the error estimate, consider x ∈ [xk , xk+1 ], k ∈ {0, . . . , N − 1}. Note that, due to
2
the inequality α β ≤ α+β2
valid for all α, β ∈ R, one has
2
(x − xk ) + (xk+1 − x) h2
(x − xk )(xk+1 − x) ≤ ≤ . (3.46)
2 4
Moreover, on [xk , xk+1 ], s coincides with the interpolating polynomial p ∈ Pol1 (R) with
p(xk ) = f (xk ) and p(xk+1 ) = f (xk+1 ). Thus, we can use (3.31) to compute
′′
f (x) − s(x) = f (x) − p(x) ≤ kf k∞ (x − xk )(xk+1 − x) ≤ 1 kf ′′ k∞ h2 ,
(3.46)
2! 8
thereby proving (3.45).
using the convention {1, . . . , N − 1} = ∅ for N − 1 < 1. Note that (3.50) constitutes a
system of 2 + 4(N − 1) = 4N − 2 conditions, while (3.49a) provides 4N variables. This
already indicates that one will be able to impose two additional conditions on s. One
typically imposes so-called boundary conditions, i.e. conditions on the values for s or its
derivatives at x0 and xN (see (3.61) below for commonly used boundary conditions).
In order to show that one can, indeed, choose the ak , bk , ck , dk such that all conditions of
(3.50) are satisfied, the strategy is to transform (3.50) into an equivalent linear system.
An analysis of the structure of that system will then show that it has a solution. The
trick that will lead to the linear system is the introduction of
s′′k := s′′ (xk ), k ∈ {0, . . . , N }, (3.51)
as new variables. As we will see in the following lemma, (3.50) is satisfied if, and only
if, the s′′k satisfy the linear system (3.53):
Lemma 3.42. Given a knot vector and values according to (3.47), and introducing the
abbreviations
hk := xk+1 − xk , for each k ∈ {0, . . . , N − 1}, (3.52a)
yk+1 − yk yk − yk−1
gk := 6 − for each k ∈ {1, . . . , N − 1}, (3.52b)
hk hk−1
| {z }
(hk +hk−1 )[xk−1 ,xk ,xk+1 ]
the pk ∈ Pol3 (R) of (3.49a) satisfy (3.50) (and that means s given by (3.49c) is in S∆,3
and satisfies (3.48)) if, and only if, the N + 1 numbers s′′0 , . . . , s′′N defined by (3.51) solve
the linear system consisting of the N − 1 equations
hk s′′k + 2(hk + hk+1 )s′′k+1 + hk+1 s′′k+2 = gk+1 , k ∈ {0, . . . , N − 2}, (3.53)
and the ak , bk , ck , dk are given in terms of the xk , yk and s′′k by
ak = y k for each k ∈ {0, . . . , N − 1}, (3.54a)
yk+1 − yk hk ′′
bk = − (s + 2s′′k ) for each k ∈ {0, . . . , N − 1}, (3.54b)
hk 6 k+1
s′′k
ck = for each k ∈ {0, . . . , N − 1}, (3.54c)
2
s′′k+1 − s′′k
dk = for each k ∈ {0, . . . , N − 1}. (3.54d)
6hk
Proof. First, for subsequent use, from (3.49a), we obtain the first two derivatives of the
pk for each k ∈ {0, . . . , N − 1}:
p′k (x) = bk + 2ck (x − xk ) + 3dk (x − xk )2 , (3.55a)
p′′k (x) = 2ck + 6dk (x − xk ). (3.55b)
3 INTERPOLATION 70
or, rearranged,
which is (3.53).
(3.53) and (3.54) imply (3.50):
First, note that plugging x = xk into (3.55b) again yields s′′k = p′′k (xk ). Using (3.54a) in
(3.49a) yields (3.50a) and the pk (xk ) = yk part of (3.50b). Since (3.54b) is equivalent to
(3.59b), and (3.59b) (when using (3.54a), (3.54c), and (3.54d)) implies (3.59a), which,
in turn, is equivalent to pk−1 (xk ) = yk for each k ∈ {1, . . . , N }, we see that (3.54)
implies (3.50c) and the pk−1 (xk ) = yk part of (3.50b). Since (3.53) is equivalent to
(3.60c) and (3.60b), and (3.60b) (when using (3.54a) – (3.54d)) implies (3.60a), which,
in turn, is equivalent to p′k−1 (xk ) = p′k (xk ) for each k ∈ {1, . . . , N − 1}, we see that
(3.53) and (3.54) implies (3.50d). Finally, (3.54d) is equivalent to (3.58b), which (when
using (3.54c)) implies (3.58a). Since (3.58a) is equivalent to p′′k−1 (xk ) = p′′k (xk ) for each
k ∈ {1, . . . , N − 1}, (3.54) also implies (3.50d), concluding the proof of the lemma.
As already mentioned after formulating (3.50), (3.50) yields two conditions less than
there are variables. Not surprisingly, the same is true in the linear system (3.53), where
there are N − 1 equations for N + 1 variables. As also mentioned before, this allows to
impose two additional conditions. Here are some of the most commonly used additional
conditions:
Natural boundary conditions:
Due to time constraints, we will only investigate the most simple of these cases, namely
natural boundary conditions, in the following.
Since (3.61a) fixes the values of s′′0 and s′′N as 0, (3.53) now yields a linear system
consisting of N − 1 equations for the N − 1 variables s′′1 through s′′N −1 . We can rewrite
this system in matrix form
As′′ = g (3.62a)
3 INTERPOLATION 72
where kA−1 k∞ denotes the row sum norm according to Ex. 2.18(a).
Proof. Exercise.
3 INTERPOLATION 73
Theorem 3.45. Given a knot vector and values according to (3.47), there is a unique
cubic spline s ∈ S∆,3 that satisfies (3.48) and the natural boundary conditions (3.61a).
Moreover, s can be computed by first solving the linear system (3.62) for the s′′k (the hk
and the gk are given by (3.52)), then determining the ak , bk , ck , and dk according to
(3.54), and, finally, using (3.49) to get s.
Proof. Lemma 3.42 shows that the existence and uniqueness of s is equivalent to (3.62a)
having a unique solution. That (3.62a) does, indeed, have a unique solution follows from
Lem. 3.44, as A as defined in (3.62b) is clearly strictly diagonally dominant. It is also
part of the statement of Lem. 3.42 that s can be computed in the described way.
Theorem 3.46. Let a, b ∈ R, a < b, and f ∈ C 4 [a, b]. Moreover, given N ∈ N and a
knot vector ∆ := (x0 , . . . , xN ) ∈ RN +1 for [a, b], define
Let s ∈ S∆,3 be any cubic spline function satisfying s(xk ) = f (xk ) for each k ∈
{0, . . . , N }. If there exists C > 0 such that
n o
′′ ′′
max s (xk ) − f (xk ) : k ∈ {0, . . . , N } ≤ C kf (4) k∞ h2max , (3.67)
where
hmax 1
c := C+ . (3.69)
hmin 4
At the xk , the third derivative s′′′ does not need to exist. In that case, (3.68d) is to be
interpreted to mean that the estimate holds for all values of one-sided third derivatives
of s (which must exist, since s agrees with a polynomial on each [xk , xk+1 ]).
Proof. Exercise.
3 INTERPOLATION 74
Theorem 3.47. Once again, consider the situation of Th. 3.46, but, instead of (3.67),
assume that the interpolating cubic spline s satisfies natural boundary conditions s′′ (a) =
s′′ (b) = 0. If f ∈ C 4 [a, b] satisfies f ′′ (a) = f ′′ (b) = 0, then it follows that s satisfies
(3.67) with C = 3/4, and, in consequence, the error estimates (3.68) with c = hmax /hmin .
Proof. We merely need to show that (3.67) holds with C = 3/4. Since s is the interpo-
lating cubic spline satisfying natural boundary conditions, s′′ satisfies the linear system
(3.62a). Deviding, for each k ∈ {1, . . . , N − 1}, the (k − 1)th equation of that system
by 3(hk−1 + hk ) yields the form
Bs′′ = ĝ, (3.70a)
with
2 h1
3 3(h0 +h1 )
h1 2 h2
3(h1 +h2 ) 3 3(h1 +h2 )
h2 2 h3
3(h2 +h3 ) 3 3(h2 +h3 )
.. .. ..
B :=
. . .
.. .. ..
. . .
hN −3 hN −2
2
3(hN −3 +hN −2 ) 3 3(hN −3 +hN −2 )
hN −2 2
3(hN −2 +hN −1 ) 3
(3.70b)
and g1
ĝ1 3(h0 +h1 )
..
ĝ := ... := . . (3.70c)
gN −1
ĝN −1 3(hN −2 +hN −1 )
where
1
Rk := (hk − hk−1 ) f ′′′ (xk ), (3.72b)
3
1
δk := h3k−1 f (4) (αk ) + h3k f (4) (βk ) . (3.72c)
6(hk−1 + hk )
In the same way we applied Taylor’s theorem to f ′′ to obtain (3.71), we now apply
Taylor’s theorem to f to obtain for each k ∈ {1, . . . , N − 1}, ξk ∈]xk , xk+1 [ and ηk ∈
]xk−1 , xk [ such that
h2k ′′ h3 h4
f (xk+1 ) = f (xk ) + hk f ′ (xk ) + f (xk ) + k f ′′′ (xk ) + k f (4) (ξk ), (3.74a)
2 6 24
′ h2k−1 ′′ h3k−1 ′′′ h4k−1 (4)
f (xk−1 ) = f (xk ) − hk−1 f (xk ) + f (xk ) − f (xk ) + f (ηk ). (3.74b)
2 6 24
Multiplying (3.74a) and (3.74b) by 2/hk and 2/hk−1 , respectively, leads to
f (xk+1 ) − f (xk ) h2 h3
2 = 2 f ′ (xk ) + hk f ′′ (xk ) + k f ′′′ (xk ) + k f (4) (ξk ), (3.75a)
hk 3 12
f (xk ) − f (xk−1 ) h2 h3
−2 = −2 f ′ (xk ) + hk−1 f ′′ (xk ) − k−1 f ′′′ (xk ) + k−1 f (4) (ηk ). (3.75b)
hk−1 3 12
Observing that
Using (3.77) in the right-hand side of (3.73) and then subtracting (3.70a) from the result
yields
f ′′ (x1 ) − s′′ (x1 ) δ1 − δ̂1
.. ..
B . = . .
′′ ′′
f (xN −1 ) − s (xN −1 ) δN −1 − δ̂N −1
From
2 hk−1 hk 1
− − = for k ∈ {1, . . . , N − 1},
3 3(hk−1 + hk ) 3(hk−1 + hk ) 3
we conclude that B is strictly diagonally dominant, such that Lem. 3.44 allows us to
estimate
max f ′′ (xk ) − s′′ (xk ) : k ∈ {0, . . . , N } ≤ 3 max |δk | + |δ̂k | : k ∈ {1, . . . , N − 1}
(∗) 3 2
≤ hmax kf (4) k∞ , (3.78)
4
where
1 1 h3k−1 + h3k (4) 1
|δk | + |δ̂k | ≤ + kf k∞ ≤ h2max kf (4) k∞
6 12 hk−1 + hk 4
was used for the inequality at (∗). The estimate (3.78) shows that s satisfies (3.67) with
C = 3/4, thereby concluding the proof of the theorem.
Note that (3.68) is, in more than one way, much better than (3.45): From (3.68) we
get convergence of the first three derivatives, and the convergence of kf − sk∞ is also
much faster in (3.68) (proportional to h4 rather than to h2 in (3.45)). The disadvantage
of (3.68), however, is that we needed to assume the existence of a continuous fourth
derivative of f .
Remark 3.48. A piecewise interpolation by cubic polynomials different from spline
interpolation is given by carrying out a Hermite interpolation on each interval [xk , xk+1 ]
such that the resulting function satisfies f (xk ) = s(xk ) as well as
The result will usually not be a spline, since s will usually not have second derivatives
in the xk . On the other hand the corresponding cubic spline will usually not satisfy
3 INTERPOLATION 77
(3.79). Thus, one will prefer one or the other form of interpolation, depending on either
(3.79) (i.e. the agreement of the first derivatives of the given and interpolating function)
or s ∈ C 2 being more important in the case at hand.
—
1.2
f
linear spline
1 cubic spline
pol of deg 6
0.8
0.6
0.4
0.2
−0.2
−6 −4 −2 0 2 4 6
Figure 3.1: Different interpolations of f (x) = 1/(x2 + 1) with data points at (x0 , . . . , x6 )
= (−6, −4, . . . , 4, 6).
4 NUMERICAL INTEGRATION 78
1.5
f
linear spline
cubic spline
1 pol of deg 6
0.5
−0.5
−1
−1.5
−6 −4 −2 0 2 4 6
Figure 3.2: Different interpolations of f (x) = cos(x) with data points at (x0 , . . . , x6 ) =
(−6, −4, . . . , 4, 6).
4 Numerical Integration
4.1 Introduction
The goal is the computation of (an approximation of) the value of definite integrals
Z b
f (x) dx . (4.1)
a
Even if we know from the fundamental theorem of calculus that f has an antiderivative,
even for simple f , the antiderivative can often not be expressed in terms of so-called
elementary functions (such as polynomials, exponential functions, and trigonometric
functions). An example is the important function
Z y
1 x2
Φ(y) := √ e− 2 dx , (4.2)
0 2π
4 NUMERICAL INTEGRATION 79
is called the Riemann sum of f associated with a tagged partition of [a, b].
Notation 4.2. Given a real interval [a, b], a < b, we denote the set of all Riemann
integrable functions f : [a, b] −→ R by R[a, b].
Theorem 4.3. Given a real interval [a, b], a < b, a function f : [a, b] −→ R is in R[a, b]
if, and only if, for each sequence of tagged partitions
∆k = (xk0 , . . . , xknk ), (tk1 , . . . , tknk ) , k ∈ N, (4.5)
of [a, b] such that limk→∞ h(∆k ) = 0, the limit of associated Riemann sums
Z b nk
X
I(f ) := f (x) dx := lim f (tki )hki (4.6)
a k→∞
i=1
exists. The limit is then unique and called the Riemann integral of f . Recall that every
continuous and every piecewise continuous function on [a, b] is in R[a, b].
Thus, for each Riemann integrable function, we can use (4.6) to compute a sequence of
approximations that converges to the correct value of the Riemann integral.
Example 4.4. To employ (4.4) to compute approximations to Φ(1), where Φ is the
function defined in (4.2), one can use the tagged partitions ∆n of [0, 1] given by xi =
ti = i/n for i ∈ {0, . . . , n}, n ∈ N. Noticing hi = 1/n, one obtains
Xn 2 !
1 1 1 i
In := √ exp − .
i=1
n 2π 2 n
From (4.6), we know that Φ(1) = limn→∞ In , since the integrand is continuous. For
example, one can compute the following values:
n In
1 0.2426
2 0.2978
10 0.3342
100 0.3415
1000 0.3422
5000 0.3422
A problem is that, when using (4.6), a priori, we have no control on the approximation
error. Thus, given a required accuracy, we do not know what mesh size we need to
choose. And even though the first four digits in the table of Example 4.4 seem to have
stabilized, without further investigation, we can not be certain that that is, indeed,
the case. The main goal in the following is to improve on this situation by providing
approximation formulas for Riemann integrals together with effective error estimates.
We will slightly generalize the problem class by considering so-called weighted integrals:
Definition 4.5. Given a real interval [a, b], a < b, each Riemann integrable function
ρ : [a, b] −→ R+
0 , such that Z b
ρ(x) dx > 0 (4.7)
a
Rb
(in particular, ρ 6≡ 0, ρ bounded, 0 < a ρ(x) dx < ∞), is called a weight function.
Given a weight function ρ, the weighted integral of f ∈ R[a, b] is
Z b
Iρ (f ) := f (x)ρ(x) dx , (4.8)
a
4 NUMERICAL INTEGRATION 81
that means the usual integral I(f ) corresponds to ρ ≡ 1 (note that f ρ ∈ R[a, b], cf.
[Phi16a, Th. 10.18(d)]).
—
In the sequel, we will first study methods designed for the numerical solution of non-
weighted integrals (i.e. ρ ≡ 1) in Sections 4.2, 4.3, and 4.5, followed by an investigation
of Gaussian quadrature in Sec. 4.6, which is suitable for the numerical solution of both
weighted and nonweighted integrals.
Definition 4.6. Given a real interval [a, b], a < b, in a generalization of the Riemann
sums (4.4), a map
n
X
In : R[a, b] −→ R, In (f ) = (b − a) σi f (xi ), (4.9)
i=0
A decent quadrature rule should give the exact integral for polynomials of low degree.
This gives rise to the next definition:
Definition 4.8. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→
R+ 0 , a quadrature rule In is defined to have the degree of accuracy r ∈ N0 if, and only
if,
Z b
m
In (x ) = xm ρ(x) dx for each m ∈ {0, . . . , r}, (4.10a)
a
but,
Z b
r+1
In (x ) 6= xr+1 ρ(x) dx . (4.10b)
a
Remark 4.9. (a) Due to the linearity of In , In has degree of accuracy at least r ∈ N0
if, and only if, In is exact for each p ∈ Polr (R).
(b) A quadrature rule In as given by (4.9) can never be exact for the polynomial
n
Y
p : R −→ R, p(x) := (x − xi )2 , (4.11)
i=0
4 NUMERICAL INTEGRATION 82
since Z b
In (p) = 0 6= p(x)ρ(x) dx . (4.12)
a
In particular, In can have degree of accuracy at most r = 2n + 1.
(c) In view of (b), a quadrature rule In as given by (4.9) has any degree of accuracy
(i.e. at least degree of accuracy 0) if, and only if,
Z b n
X
ρ(x) dx = In (1) = (b − a) σi , (4.13)
a i=0
which simplifies to
n
X
σi = 1 for ρ ≡ 1. (4.14)
i=0
(a) In is, indeed, a quadrature rule in the sense of Def. 4.6, where the weights σi ,
i ∈ {0, . . . , n}, are given in terms of the Lagrange basis polynomials Li of (3.10) by
Z b Z bYn
1 1 x − xk
σi = Li (x) dx = dx
b−a a b − a a k=0 xi − xk
k6=i
Z n
1Y
(4.16)
(∗) t − tk xi − a
= dt , where ti := .
0 k=0
ti − tk b−a
k6=i
4 NUMERICAL INTEGRATION 83
(b) In has at least degree of accuracy n, i.e. it is exact for each q ∈ Poln (R).
(c) For each f ∈ C n+1 [a, b], one has the error estimate
Z b Z
kf (n+1) k∞ b
f (x) dx − In (f ) ≤ |ωn+1 (x)| dx , (4.17)
(n + 1)! a
a
Q
where ωn+1 (x) = ni=0 (x − xi ) is the Newton basis polynomial.
Comparing with (4.9) then yields (4.16), where the equality labeled (∗) is due to
x − xk
x−a
b−a
− xb−a
k −a
=
xi − xk xi −a
b−a
− xb−a
k −a
x−a
and the change of variables x 7→ t := b−a
.
Rb
(b): For q ∈ Poln (R), we have pn (q) = q such that In (q) = a
q(x) dx is clear from
(4.15).
(c) follows by using (4.15) in combination with (3.31).
Remark 4.12. If f ∈ C ∞ [a, b] and there exist K ∈ R+
0 and R < (b − a)
−1
such that
(3.32) holds, i.e.
kf (n) k∞ ≤ Kn!Rn for each n ∈ N,
then, for each sequence of distinct numbers (x0 , x1 , . . . ) in [a, b], the corresponding
quadrature rules based on interpolating polynomials converge, i.e.
Z b
lim f (x) dx − In (f ) = 0 :
n→∞ a
Rb Rb
Indeed, | a f (x) dx − In (f )| ≤ a kf − pn k∞ dx = (b − a) kf − pn k∞ and limn→∞ kf −
pn k∞ = 0 according to (3.34).
—
In certain situations, (4.17) can be improved by determining the sign of the error term
(see Prop. 4.13 below). This information will come in handy when proving subsequent
error estimates for special cases.
4 NUMERICAL INTEGRATION 84
Proposition 4.13. Let [a, b] be a real interval, a < b, and let In be the quadrature rule
Q for the distinct points x0 , . . . , xn ∈ [a, b], n ∈ N0 , as
based on interpolating polynomials
defined in (4.15). If ωn+1 (x) = ni=0 (x − xi ) has only one sign on [a, b] (i.e. ωn+1 ≥ 0
or ωn+1 ≤ 0 on [a, b], see Rem. 4.14 below), then, for each f ∈ C n+1 [a, b], there exists
τ ∈ [a, b] such that
Z b Z b
f (n+1) (τ )
f (x) dx − In (f ) = ωn+1 (x) dx . (4.18)
a (n + 1)! a
In particular, if f (n+1) has just one sign as well, then one can infer the sign of the error
term (i.e. of the right-hand side of (4.18)).
Proof. From Th. 3.25, we know that, for each x ∈ [a, b], there exists ξ(x) ∈ [a, b] such
that (cf. (3.31))
f (n+1) ξ(x)
f (x) = pn (x) + ωn+1 (x), (4.19)
(n + 1)!
where, according to Th. 3.25(c), the function x 7→ f (n+1) ξ(x) is continuous on [a, b].
Thus, integrating (4.19) and applying the mean value theorem of integration (cf. [Phi17,
(2.15f)]) yields τ ∈ [a, b] satisfying
Z b Z b Z b
1 (n+1)
f (n+1) (τ )
f (x) dx − In (f ) = f ξ(x) ωn+1 (x) dx = ωn+1 (x) dx ,
a (n + 1)! a (n + 1)! a
proving (4.18).
Remark 4.14. There exist precisely 3 examples where ωn+1 has only one sign on [a, b],
namely ω1 (x) = (x − a) ≥ 0, ω1 (x) = (x − b) ≤ 0, and ω2 (x) = (x − a)(x − b) ≤ 0. In
all other cases, there exists at least one xi ∈]a, b[. As xi is a simple zero of ωn+1 , ωn+1
changes its sign in the interior of [a, b].
Definition and Remark 4.15. Given a real interval [a, b], a < b, a quadrature rule
based on interpolating polynomials according to Def. 4.10 is called a Newton-Cotes
4 NUMERICAL INTEGRATION 85
formula, also known as a Newton-Cotes quadrature rule if, and only if, the points
x0 , . . . , xn ∈ [a, b], n ∈ N, are equally-spaced. Moreover, a Newton-Cotes formula is
called closed or a Newton-Cotes closed quadrature rule if, and only if,
b−a
xi := a + i h, h := , i ∈ {0, . . . , n}, n ∈ N. (4.20)
n
In the present context of equally-spaced xi , the discretization’s mesh size h is also called
step size. It is an exercise to compute the weights of the Newton-Cotes closed quadrature
rules. One obtains Z n
1 nYs−k
σi = σi,n = ds . (4.21)
n 0 k=0 i − k
k6=i
Given n, one can compute the σi explicitly and store or tabulate them for further use.
The next lemma shows that one actually does not need to compute all σi :
Lemma 4.16. The weights σi of the Newton-Cotes closed quadrature rules (see (4.21))
are symmetric, i.e.
Proof. Exercise.
If n is even, then, as we will show in Th. 4.18 below, the corresponding closed Newton-
Cotes formula In is exact not only for p ∈ Poln (R), but even for p ∈ Poln+1 (R). On the
other hand, we will also see in Th. 4.18 that I(xn+2 ) 6= In (xn+2 ) as a consequence of the
following preparatory lemma:
Lemma 4.17. If [a, b], a < b, is a real interval, n ∈ N is even, and xi , i ∈ {0, . . . , n},
are defined according to (4.20), then the antiderivative of the Newton basis polynomial
ω := ωn+1 , i.e. the function
Z n
xY
F : R −→ R, F (x) := (y − xi ) dy , (4.23)
a i=0
satisfies
= 0 for x = a,
F (x) > 0 for a < x < b, (4.24)
=0 for x = b.
Proof. F (a) = 0 is clear. The remaining assertions of (4.24) are shown in several steps.
4 NUMERICAL INTEGRATION 86
Proof. As ω has degree precisely n + 1 and a zero at each of the n + 1 distinct points
x0 , . . . , xn , each of these zeros must be simple. In particular, ω must change its sign at
each xi . Moreover, limx→−∞ ω(x) = −∞ due to the fact that the degree n + 1 of ω is
odd. This implies that ω changes its sign from negative to positive at x0 = a and from
positive to negative at x1 = a + h, establishing the case j = 0. The general case (4.25)
now follows by induction. N
Claim 2. On the left half of the interval [a, b], |ω| decays in the following sense:
a + b
ω(x + h) < ω(x) for each x ∈ a, − h \ {x0 , . . . , xn/2−1 }. (4.26)
2
Proof. There exists τ ∈ ]0, h] and i ∈ {0, . . . , n/2 − 1} such that x = xi + τ . Thus,
Z x Xi−1 Z xk+1
F (x) = ω(y) dy + ω(y) dy , (4.27)
xi k=0 xk
n/2−2
for each τ ∈ ]0, h] and each 0 ≤ j ≤ 2
, thereby establishing the case. N
Claim 4. ω is antisymmetric with respect to the midpoint of [a, b], i.e.
a+b a+b
ω + τ = −ω −τ for each τ ∈ R. (4.29)
2 2
Proof. From (a + b)/2 − xi = − (a + b)/2 − xn−i = (n/2 − i)h for each i ∈ {0, . . . , n},
one obtains
Y n Yn
a+b a+b a+b
ω +τ = + τ − xi = − − τ − xn−i
2 i=0
2 i=0
2
Y n
a+b a+b
=− − τ − xi = −ω −τ ,
i=0
2 2
which is (4.29). N
Claim 5. F is symmetric with respect to the midpoint of [a, b], i.e.
a+b a+b
F +τ =F −τ for each τ ∈ R. (4.30)
2 2
b−a
Proof. For each τ ∈ 0, 2
, we obtain
Z (a+b)/2−τ Z (a+b)/2+τ
a+b a+b
F +τ = ω(x) dx + ω(x) dx = F −τ ,
2 a (a+b)/2−τ 2
Theorem 4.18. If [a, b], a < b, is a real interval and n ∈ N is even, then the Newton-
Cotes closed quadrature rule In : R[a, b] −→ R has degree of accuracy n + 1.
4 NUMERICAL INTEGRATION 88
By applying (3.31) with g(x) = xn+2 and using g (n+2) (x) = (n + 2)!, one obtains
n+1
g (n+2) ξ(x) Y
g(x)−q(x) = ωn+2 (x) = ωn+2 (x) = (x−xi ) for each x ∈ [a, b]. (4.33)
(n + 2)! i=0
Integrating (4.33) while taking into account (4.32) as well as using F from (4.23) yields
Z b n+1
Y Z b
I(g) − In (g) = (x − xi ) dx = F ′ (x)(x − xn+1 ) dx
a i=0 a
Z b Z b
b (4.24) (4.24)
= F (x)(x − xn+1 ) a
− F (x) dx = − F (x) dx < 0,
a a
proving (4.31).
Note that, for n = 0, there is no closed Newton-Cotes formula, as (4.20) can not be
satisfied with just one point x0 . Every Newton-Cotes
Rb quadrature rule with n = 0 is
called a rectangle rule, since the integral a f (x) dx is approximated by (b − a)f (x0 ),
x0 ∈ [a, b], i.e. by the area of the rectangle with vertices (a, 0), (b, 0), (a, f (x0 )), (b, f (x0 )).
Not surprisingly, the rectangle rule with x0 = (b + a)/2 is known as the midpoint rule.
Rectangle rules have degree of accuracy r ≥ 0, since they are exact for constant func-
tions. Rectangle rules actually all have r = 0, except for the midpoint rule with r = 1
(exercise).
In the following lemma, we provide an error estimate for the rectangle rule using x0 = a.
For other choices of x0 , one obtains similar results.
4 NUMERICAL INTEGRATION 89
Lemma 4.19. Given a real interval [a, b], a < b, and f ∈ C 1 [a, b], there exists τ ∈ [a, b]
such that Z b
(b − a)2 ′
f (x) dx − (b − a)f (a) = f (τ ). (4.34)
a 2
Proof. As ω1 (x) = (x − a) ≥ 0 on [a, b], we can apply Prop. 4.13 to get τ ∈ [a, b]
satisfying
Z b Z b
′ (b − a)2 ′
f (x) dx − (b − a)f (a) = f (τ ) (x − a) dx = f (τ ),
a a 2
proving (4.34).
Definition and Remark 4.20. The closed Newton-Cotes formula with n = 1 is called
trapezoidal rule. Its explicit form is easily computed, e.g. from (4.15):
Z b b
f (b) − f (a) 1 f (b) − f (a) 2
I1 (f ) : = f (a) + (x − a) dx = f (a)x + (x − a)
a b−a 2 b−a a
1 1
= (b − a) f (a) + f (b) (4.35)
2 2
for each f ∈ C[a, b]. Note that (4.35) justifies the name trapezoidal rule, as this is
precisely the area of the trapezoid with vertices (a, 0), (b, 0), (a, f (a)), (b, f (b)).
Proof. Exercise.
Lemma 4.22. Given a real interval [a, b], a < b, and f ∈ C 2 [a, b], there exists τ ∈ [a, b]
such that Z b
(b − a)3 ′′ h3
f (x) dx − I1 (f ) = − f (τ ) = − f ′′ (τ ). (4.36)
a 12 12
Proof. Exercise.
4 NUMERICAL INTEGRATION 90
Definition and Remark 4.23. The closed Newton-Cotes formula with n = 2 is called
Simpson’s rule. To find the explicit form, we compute the weights σ0 , σ1 , σ2 . According
to (4.21), one finds
Z 2 2
1 s−1 s−2 1 s3 3s2 1
σ0 = ds = − + 2s = . (4.37)
2 0 −1 −2 4 3 2 0 6
Lemma 4.25. Given a real interval [a, b], a < b, and f ∈ C 4 [a, b], there exists τ ∈ [a, b]
such that Z b
(b − a)5 (4) h5
f (x) dx − I2 (f ) = − f (τ ) = − f (4) (τ ). (4.39)
a 2880 90
and the function x 7→ f (4) ξ(x) is continuous on [a, b] by Th. 3.25(c). Moreover,
according to (4.42), ω4 (x) ≤ 0 for each x ∈ [a, b]. Thus, integrating (4.41) and applying
(4.40) as well as the mean value theorem of integration (cf. [Phi17, (2.15f)]) yields
τ ∈ [a, b] satisfying
Z b Z
f (4) (τ ) b f (4) (τ ) (b − a)5
f (x) dx − I2 (f ) = ω4 (x) dx = −
a 4! a 24 120
5 5
(b − a) (4) h
=− f (τ ) = − f (4) (τ ), (4.43)
2880 90
where it was used that
Z b Z b 2 !
(x − a)2 a+b a+b
ω4 (x) dx = − 2 x− (x − b) + x − dx
a a 2 2 2
Z b
(b − a)5 (x − a)3 a+b
=− − 2(x − b) + 4 x − dx
24 a 6 2
Z b
(b − a)5 (b − a)5 (x − a)4
=− −2 + 6 dx
24 24 a 24
(b − a)5
=− . (4.44)
120
This completes the proof of (4.39).
As before, let [a, b] be a real interval, a < b, and let In : R[a, b] −→ R be the Newton-
Cotes closed quadrature rule based on interpolating polynomials p ∈ Poln (R). According
to Th. 4.11(b), we know that In has degree of accuracy at least n, and one might hope
that, by increasing n, one obtains more and more accurate quadrature rules for all
f ∈ R[a, b] (or at least for all f ∈ C[a, b]). Unfortunately, this is not the case! As it
turns out, Newton-Cotes formulas with larger n have quite a number of drawbacks. As
a result, in practice, Newton-Cotes formulas with n > 3 are hardly ever used (for n = 3,
one obtains Simpson’s 3/8 rule, and even that is not used all that often).
The problems with higher order Newton-Cotes formulas have to do with the fact that
the formulas involve negative weights (negative weights first occur for n = 8). As n
increases, the situation becomes worse and worse and one can actually show that (see
[Bra77, Sec. IV.2])
Xn
lim |σi,n | = ∞. (4.45)
n→∞
i=0
4 NUMERICAL INTEGRATION 92
For polynomials and some other functions, terms with negative and positive weights
cancel, but for general f ∈ C[a, b] (let alone general f ∈ R[a, b]), that is not the case,
leading to instabilities and divergence. That, in the light of (4.45), convergence can not
be expected is a consequence of the following Th. 4.27.
Note that f is a linear spline on [xφ(0) , xφ(n) ] (cf. (3.44)), possibly with constant con-
tinuous extensions on [a, xφ(0) ] and [xφ(n) , b]. In particular, f ∈ C[a, b]. Clearly, either
In ≡ 0 or kf k∞ = 1 and
n
X n
X n
X
In (f ) = (b − a) σi f (xi ) = (b − a) σi sgn(σi ) = (b − a) |σi |,
i=0 i=0 i=0
Pn
showing the remaining inequality kIn k ≥ (b − a) i=0 |σi | and, thus, (4.47).
4 NUMERICAL INTEGRATION 93
Theorem 4.27 (Polya). Given a real interval [a, b], a < b, and a weight function
ρ : [a, b] −→ R+
0 , consider a sequence of quadrature rules
n
X
In : C[a, b] −→ R, In (f ) = (b − a) σi,n f (xi,n ), (4.48)
i=0
with distinct points x0,n , . . . , xn,n ∈ [a, b] and weights σ0,n , . . . , σn,n ∈ R, n ∈ N. The
sequence converges in the sense that
Z b
lim In (f ) = f (x)ρ(x) dx for each f ∈ C[a, b] (4.49)
n→∞ a
if, and only if, the following two conditions are satisfied:
Rb
(i) limn→∞ In (p) = a
p(x)ρ(x) dx holds for each polynomial p.
(ii) The sums of the absolute values of the weights are uniformly bounded, i.e. there
exists C ∈ R+ such that
n
X
|σi,n | ≤ C for each n ∈ N.
i=0
Rb
Proof. Suppose (i) and (ii) hold true. Let W := a ρ(x) dx . Given f ∈ C[a, b] and ǫ > 0,
according to the Weierstrass Approximation Theorem F.1, there exists a polynomial p
such that
ǫ
kf − p↾[a,b] k∞ < ǫ̃ := . (4.50)
C(b − a) + 1 + W
Due to (i), there exists N ∈ N such that, for each n ≥ N ,
Z b
In (p) − p(x)ρ(x) dx < ǫ̃.
a
proving (4.49).
Conversely, (4.49) trivially implies (i). That it also implies (ii) is much more involved.
Letting T := {In : n ∈ N}, we have T ⊆ L(C[a, b], R) and (4.49) implies
sup |T (f )| : T ∈ T = sup |In (f )| : n ∈ N < ∞ for each f ∈ C[a, b]. (4.51)
which, according to Prop. 4.26, is (ii). The proof of the Banach-Steinhaus theorem is
usually part of Functional Analysis. It is provided in Appendix G. As it only uses
elementary metric space theory, the interested reader should be able to understand
it.
While, for ρ ≡ 1, Condition (i) of Th. 4.27 is obviously satisfied by the Newton-Cotes
formulas, since, given q ∈ Poln (R), Im (q) is exact for each m ≥ n, Condition (ii) of Th.
4.27 is violated due to (4.45). As the Newton-Cotes formulas are based on interpolating
polynomials, and we had already needed additional conditions to guarantee the conver-
gence of interpolating polynomials (see Th. 3.26), it should not be too surprising that
Newton-Cotes formulas do not converge for all continuous functions.
So we see that improving the accuracy of Newton-Cotes formulas by increasing n is,
in general, not feasible. However, the accuracy can be improved using a number of
different strategies, two of which will be considered in the following. In Sec. 4.5, we will
study so-called composite Newton-Cotes rules that result from subdividing [a, b] into
small intervals, using a low-order Newton-Cotes formula on each small interval, and
then summing up the results. In Sec. 4.6, we will consider Gaussian quadrature, where,
in contrast to the Newton-Cotes formulas, one does not use equally-spaced points xi for
the interpolation, but chooses the locations of the xi more carefully. As it turns out,
one can choose the xi such that all weights remain positive even for n → ∞. Then Rem.
4.9 combined with Th. 4.27 yields convergence.
with 0 ≤ ξ0 < ξ1 < · · · < ξn ≤ 1 and weights σ0 , . . . , σn ∈ R. Given a real interval [a, b],
a < b, and N ∈ N, set
b−a
xk := a + k h, h := , for each k ∈ {0, . . . , N }, (4.54a)
N
xkl := xk−1 + h ξl , for each k ∈ {1, . . . , N }, l ∈ {0, . . . , n}. (4.54b)
Then
N X
X n
In,N (f ) := h σl f (xkl ), f : [a, b] −→ R, (4.55)
k=1 l=0
Theorem 4.30. Let [a, b], a < b, be a real interval, n ∈ N0 . If In has the form (4.53)
and is exact for each constant function (i.e. it has at least degree of accuracy 0 in the
language of Def. 4.8), then the composite quadrature rules (4.55) based on In converge
for each Riemann integrable f : [a, b] −→ R, i.e.
Z b
∀ lim In,N (f ) = f (x) dx . (4.57)
f ∈R[a,b] N →∞ a
4 NUMERICAL INTEGRATION 96
Proof. Let f ∈ R[a, b]. Introducing, for each l ∈ {0, . . . , n}, the abbreviation
N
X b−a
Sl (N ) := f (xkl ), (4.58)
k=1
N
(4.55) yields
n
X
In,N (f ) = σl Sl (N ). (4.59)
l=0
Note that, for each l ∈ {0, . . . , n}, S l (N ) is a Riemann sum for the tagged partition
N N N N b−a
∆N l := (x0 , . . . , xN ), (x1l , . . . , xN l ) and limN →∞ h(∆N l ) = limN →∞ N = 0. Thus,
applying Th. 4.3,
Z b
lim Sl (N ) = f (x) dx for each l ∈ {0, . . . , n}.
N →∞ a
proving (4.57).
The convergence result of Th. 4.30 still suffers from the drawback of the original con-
vergence result for the Riemann sums, namely that we do not obtain any control on the
rate of convergence and the resulting error term. We will achieve such results in the
following sections by assuming additional regularity of the integrated functions f . We
introduce the following notion as a means to quantify the convergence rate of composite
quadrature rules:
Definition 4.31. Given a real interval [a, b], a < b, and a subset F of the Riemann
integrable functions on [a, b], composite quadrature rules In,N according to (4.55) are
said to have order of convergence r > 0 for f ∈ F if, and only if, for each f ∈ F, there
exists K = K(f ) ∈ R+ 0 such that
Z b
In,N (f ) − f (x) dx ≤ K hr for each N ∈ N (4.60)
a
(h = (b − a)/N as before).
4 NUMERICAL INTEGRATION 97
Theorem 4.33. For each f ∈ C 1 [a, b], there exists τ ∈ [a, b] satisfying
Z b
(b − a)2 ′ b−a
f (x) dx − I0,N (f ) = f (τ ) = h f ′ (τ ). (4.62)
a 2N 2
In particular, the composite rectangle rules have order of convergence r = 1 for f ∈
C 1 [a, b] in terms of Def. 4.31.
Proof. Fix f ∈ C 1 [a, b]. From Lem. 4.19, we obtain, for each k ∈ {1, . . . , N }, a point
τk ∈ [xk−1 , xk ] such that
Z xk
b−a (b − a)2 ′ h2 ′
f (x) dx − f (xk−1 ) = f (τk ) = f (τk ). (4.63)
xk−1 N 2N 2 2
As f ′ is continuous and
N
′ 1 X ′
min f (x) : x ∈ [a, b] ≤ α := f (τk ) ≤ max f ′ (x) : x ∈ [a, b] ,
N k=1
Definition and Remark 4.34. Using the trapezoidal rule (4.35) as I1 on [0, 1], the
formula (4.55) yields, for f : [a, b] −→ R, the corresponding composite trapezoidal rules
N
X XN
1 xk0 =xk−1 ,
xk1 =xk 1
I1,N (f ) = h f (xk0 ) + f (xk1 ) = h f (xk−1 ) + f (xk )
k=1
2 k=1
2
h
= f (a) + 2 f (x1 ) + · · · + f (xN −1 ) + f (b) . (4.64)
2
Theorem 4.35. For each f ∈ C 2 [a, b], there exists τ ∈ [a, b] satisfying
Z b
(b − a)3 ′′ b − a 2 ′′
f (x) dx − I1,N (f ) = − 2
f (τ ) = − h f (τ ). (4.65)
a 12N 12
In particular, the composite trapezoidal rules have order of convergence r = 2 for f ∈
C 2 [a, b] in terms of Def. 4.31.
Proof. Fix f ∈ C 2 [a, b]. From Lem. 4.22, we obtain, for each k ∈ {1, . . . , N }, a point
τk ∈ [xk−1 , xk ] such that
Z xk
h h3
f (x) dx − f (xk−1 ) + f (xk ) = − f ′′ (τk ).
xk−1 2 12
As f ′′ is continuous and
N
1 X ′′
min f ′′ (x) : x ∈ [a, b] ≤ α := f (τk ) ≤ max f ′′ (x) : x ∈ [a, b] ,
N k=1
Definition and Remark 4.36. Using Simpson’s rule (4.38) as I2 on [0, 1], the formula
(4.55) yields, for f : [a, b] −→ R, the corresponding composite Simpson’s rules I2,N (f ).
It is an exercise to check that
N N −1
!
h X X
I2,N (f ) = f (a) + 4 f (xk− 1 ) + 2 f (xk ) + f (b) , (4.66)
6 k=1
2
k=1
Theorem 4.37. For each f ∈ C 4 [a, b], there exists τ ∈ [a, b] satisfying
Z b
(b − a)5 (4) b − a 4 (4)
f (x) dx − I2,N (f ) = − f (τ ) = − h f (τ ). (4.67)
a 2880N 4 2880
In particular, the composite Simpson’s rules have order of convergence r = 4 for f ∈
C 4 [a, b] in terms of Def. 4.31.
Proof. Exercise.
i.e. one writes λi instead of xi and the factor (b − a) is incorporated into the weights σi .
4 NUMERICAL INTEGRATION 100
During Gaussian quadrature, the λi are determined as the zeros of polynomials that
are orthogonal with respect to certain scalar products on the (infinite-dimensional) real
vector space Pol(R).
Notation 4.38. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→ R+ 0
as defined in Def. 4.5, let
Z b
h·, ·iρ : Pol(R) × Pol(R) −→ R, hp, qiρ := p(x)q(x)ρ(x) dx , (4.69a)
a
q
k · kρ : Pol(R) −→ R+
0, kpkρ := hp, piρ . (4.69b)
Lemma 4.39. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→ R+
0,
(4.69a) defines a scalar product on Pol(R).
Rb
Proof. As ρ ∈ R[a, b], ρ ≥ 0, and a ρ(x) dx > 0, there must be some nontrivial interval
J ⊆ [a, b] and ǫ > 0 such that ρ ≥ ǫ on J. Thus, if 0 6= p ∈ Pol(R), then there must
be some nontrivial interval Jp ⊆ J and ǫp > 0 such that p2 ρ ≥ ǫp on Jp , implying
Rb
hp, piρ = a p2 (x)ρ(x) dx ≥ ǫp |Jp | > 0 and proving h·, ·iρ to be positive definite. Given
p, q, r ∈ Pol(R) and λ, µ ∈ R, one computes
Z b
hλp + µq, riρ = λp(x) + µq(x) r(x)ρ(x) dx
a
Z b Z b
=λ q(x)r(x)ρ(x) dx + µ q(x)r(x)ρ(x) dx
a a
proving h·, ·iρ to be linear in its first argument. Since, for p, q ∈ Pol(R), one also has
Z b Z b
hp, qiρ = p(x)q(x)ρ(x) dx = q(x)p(x)ρ(x) dx = hq, piρ ,
a a
h·, ·iρ is symmetric as well and the proof of the lemma is complete.
Remark 4.40. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→ R+0,
consider Pol(R) with the scalar product h·, ·iρ given by (4.69a). Then, for each n ∈ N0 ,
q ∈ Poln (R)⊥ (cf. [Phi19b, Def. 10.10(b)]) if, and only if,
Z b
q(x)p(x)ρ(x) = 0 for each p ∈ Poln (R). (4.70)
a
4 NUMERICAL INTEGRATION 101
At first glance, it might not be obvious that there are always q ∈ Pol(R) \ {0} that
satisfy (4.70). However, such q 6= 0 can always be found by the general procedure of
Gram-Schmidt orthogonalization, known from Linear Algebra: According to [Phi19b,
Th. 10.13], if h·, ·i is an arbitrary scalar product on Pol(R), if (xn )n∈N0 is a sequence in
Pol(R) and if (pn )n∈N0 are recursively defined by
n−1
X hxn , pk i
p0 := x0 , ∀ pn := xn − pk , (4.71)
n∈N
k=0,
kpk k2
pk 6=0
While Gram-Schmidt orthogonalization works in every space with a scalar product, for
polynomials, the following theorem provides a more efficient recursive orthogonalization
algorithm:
Theorem 4.41. Given a scalar product h·, ·i on Pol(R) that satisfies
hxp, qi = hp, xqi for each p, q ∈ Pol(R) (4.72)
(clearly, the scalar product from (4.69a) satisfies (4.72)), let k · k denote the induced
norm. If, for each n ∈ N0 , xn ∈ Pol(R) is defined by xn (x) := xn and pn is given by
Gram-Schmidt orthogonalization according to (4.71), then the pn satisfy the following
recursive relation (sometimes called three-term recursion):
p0 = x0 ≡ 1, p1 = x − β0 , (4.73a)
pn+1 = (x − βn )pn − γn2 pn−1 for each n ∈ N, (4.73b)
where
hxpn , pn i kpn k
βn := for each n ∈ N0 , γn := for each n ∈ N. (4.73c)
kpn k2 kpn−1 k
Proof. We know from [Phi19b, Th. 10.13] that the linear independence of the xn guar-
antees the linear independence of the pn . We observe that p0 = x0 ≡ 1 is clear from
(4.71) and the definition of x0 . Next, form (4.71), we compute
hx1 , p0 i hxp0 , p0 i
p 1 = v 1 = x1 − 2
p0 = x − = x − β0 .
kp0 k kp0 k2
It remains to consider n + 1 for each n ∈ N. Letting
qn+1 := (x − βn ) pn − γn2 pn−1 ,
4 NUMERICAL INTEGRATION 102
qn+1 = pn+1 .
Clearly,
r := pn+1 − qn+1 ∈ Poln (R),
as the coefficient of xn+1 in both pn+1 and qn+1 is 1. Since Poln (R) ∩ Poln (R)⊥ =
{0} (cf. [Phi19b, Lem. 10.11(a)]), it now suffices to show r ∈ Poln (R)⊥ . Moreover,
we know pn+1 ∈ Poln (R)⊥ , since pn+1 ⊥ pk for each k ∈ {0, . . . , n} and Poln (R) =
span{p0 , . . . , pn }. Thus, as Poln (R)⊥ is a vector space, to prove r ∈ Pn⊥ , it suffices to
show
qn+1 ∈ Poln (R)⊥ . (4.74)
To this end, we start by using hpn−1 , pn i = 0 and the definition of βn to compute
hqn+1 , pn i = (x − βn )pn − γn2 pn−1 , pn = xpn , pn − βn pn , pn = 0. (4.75a)
where, at (∗), it was used that xpn−1 −pn ∈ Poln−1 (R) due to the fact that xn is canceled.
Finally, for an arbitrary q ∈ Poln−2 (R) (setting Pol−1 (R) := {0}), it holds that
Remark 4.42. The reason that (4.73) is more efficient than (4.71) lies in the fact that
the computation of pn+1 according to (4.71) requires the computation of the n + 1 scalar
products hxn+1 , p0 i, . . . , hxn+1 , pn i, whereas the computation of pn+1 according to (4.73)
merely requires the computation of the 2 scalar products hpn , pn i and hxpn , pn i occurring
in βn and γn .
R1
Example 4.44. Considering ρ ≡ 1 on [−1, 1], one obtains hp, qiρ = −1 p(x)q(x) dx and
it is an exercise to check that the first four resulting orthogonal polynomials are
1 3
p0 (x) = 1, p1 (x) = x, p2 (x) = x2 − , p3 (x) = x3 − x.
3 5
—
The Gaussian quadrature rules are based on polynomial interpolation with respect to the
zeros of orthogonal polynomials. The following theorem provides information regarding
such zeros. In general, the explicit computation of the zeros is a nontrivial task and a
disadvantage of Gaussian quadrature rules (the price one has to pay for higher accuracy
and better convergence properties as compared to Newton-Cotes rules).
Theorem 4.45. Given a real interval [a, b], a < b, and a weight function ρ : [a, b] −→
R+
0 , let pn , n ∈ N0 , be the orthogonal polynomials with respect to h·, ·iρ . For a fixed
n ≥ 1, pn has precisely n distinct zeros λ1 , . . . , λn , which all are simple and lie in the
open interval ]a, b[.
Proof. Let a < λ1 < λ2 < · · · < λk < b be an enumeration of the zeros of pn that lie in
]a, b[ and have odd multiplicity (i.e. pn changes sign at these λj ). From pn ∈ Poln (R),
we know k ≤ n. The goal is to show k = n. Seeking a contradiction, we assume k < n.
This implies
Yk
q(x) := (x − λi ) ∈ Poln−1 (R).
i=1
⊥
Thus, as pn ∈ Poln−1 (R) , we get
On the other hand, the polynomial pn q has only zeros of even multiplicity on ]a, b[,
which means either pn q ≥ 0 or pn q ≤ 0 on the entire interval [a, b], such that
Z b
hpn , qiρ = pn (x)q(x)ρ(x) dx 6= 0, (4.77)
a
in contradiction to (4.76). This shows k = n, and, in particular, that all zeros of pn are
simple.
Definition 4.46. Given a real interval [a, b], a < b, a weight function ρ : [a, b] −→ R+ 0,
and n ∈ N, let λ1 , . . . , λn ∈ [a, b] be the zeros of the nth orthogonal polynomial pn with
4 NUMERICAL INTEGRATION 104
respect to h·, ·iρ and let L1 , . . . , Ln be the corresponding Lagrange basis polynomials (cf.
(3.10)), now written as polynomial functions, i.e.
Yn
x − λi
Lj (x) := for each j ∈ {1, . . . , n}, (4.78)
λ − λi
i=1 j
i6=j
where
Z b
σj := hLj , 1iρ = Lj (x)ρ(x) dx for each j ∈ {1, . . . , n}, (4.79b)
a
Theorem 4.47. Consider the situation of Def. 4.46; in particular let In be the Gaussian
quadrature rule defined according to (4.79).
(a) In is exact for each polynomial of degree at most 2n − 1. This can be stated in the
form
n
X
hp, 1iρ = σj p(λj ) for each p ∈ Pol2n−1 (R). (4.80)
j=1
Comparing with (4.9) and Rem. 4.9(b) shows that In has degree of accuracy precisely
2n − 1, i.e. the maximal possible degree of accuracy.
Proof. (a): Recall from Linear Algebra that the remainder theorem [Phi19b, Th. 7.9]
holds in the ring of polynomials R[X] and, thus, in Pol(R) ∼ = R[X] (using that the
field R is infinite). In particular, since pn has degree precisely n, given an arbitrary
p ∈ Pol2n−1 (R), there exist q, r ∈ Poln−1 (R) such that
p = qpn + r. (4.81)
4 NUMERICAL INTEGRATION 105
Since r ∈ Poln−1 (R), it must be the unique interpolating polynomial for the data
λ1 , r(λ1 ) , . . . , λn , r(λn ) .
Thus,
n
X n
X
(3.10)
r(x) = r(λj )Lj (x) = p(λj )Lj (x). (4.83)
j=1 j=1
proving (4.80).
(b): For each j ∈ {1, . . . , n}, we apply (4.80) to L2j ∈ Pol2n−2 (R) to obtain
n
X
(4.80) (3.10)
0 < kLj k2ρ = hL2j , 1iρ = σk L2j (λk ) = σj .
k=1
(c): This follows from (a): If f : [a, b] −→ R is given and p ∈ Poln−1 (R) is the
interpolating polynomial function for the data
λ1 , f (λ1 ) , . . . , λn , f (λn ) ,
then n n Z
X X (4.80)
b
In (f ) = σj f (λj ) = σj p(λj ) = hp, 1i1 = p(x) dx ,
j=1 j=1 a
Theorem 4.48. The converse of Th. 4.47(a) is also true: If, for n ∈ N, λ1 , . . . , λn ∈ R
and σ1 , . . . , σn ∈ R are such that (4.80) holds, then λ1 , . . . , λn must be the zeros of the
nth orthogonal polynomial pn and σ1 , . . . , σn must satisfy (4.79b).
4 NUMERICAL INTEGRATION 106
To that end, let m ∈ {0, . . . , n − 1} and apply (4.80) to the polynomial p(x) := xm q(x)
(note p ∈ Poln+m (R) ⊆ Pol2n−1 (R)) to obtain
n
X
m m
hq, x iρ = hx q, 1iρ = σj λm
j q(λj ) = 0,
j=1
showing q ∈ Poln−1 (R)⊥ and q − pn ∈ Poln−1 (R)⊥ . On the other hand, both q and pn
have degree precisely n and the coefficient of xn is 1 in both cases. Thus, in q − pn ,
xn cancels, showing q − pn ∈ Poln−1 (R). Since Poln−1 (R) ∩ Poln−1 (R)⊥ = {0}, we have
established q = pn , thereby also identifying the λj as the zeros of pn as claimed. Finally,
applying (4.80) with p = Lj yields
n
X
hLj , 1iρ = σk Lj (λk ) = σj ,
k=1
Theorem 4.49. For each f ∈ C[a, b], the Gaussian quadrature rules In (f ) of Def. 4.46
converge: Z b
lim In (f ) = f (x)ρ(x) dx for each f ∈ C[a, b]. (4.86)
n→∞ a
Proof. The convergence is a consequence of Polya’s Th. 4.27: Condition (i) of Th. 4.27
is satisfied as In (p) is exact for each p ∈ Pol2n−1 (R) and Condition (ii) of Th. 4.27 is
satisfied since the positivity of the weights yields
n
X n
X Z b
|σj,n | = σj,n = In (1) = ρ(x) dx ∈ R+ (4.87)
j=1 j=1 a
for each n ∈ N.
b−a a+b
x 7→ u = u(x) := x+ ,
2 2
4 NUMERICAL INTEGRATION 107
we can transform an integral over [a, b] into an integral over [−1, 1]:
Z b Z Z
b−a 1 b−a 1 b−a a+b
f (u) du = f u(x) dx = f x+ dx . (4.88)
a 2 −1 2 −1 2 2
This is useful in the context of Gaussian quadrature rules, since the computation
of the zeros λj of the orthogonal polynomials is, in general, not easy. However, due
to (4.88), one does not have to recompute them for each interval [a, b], but one can
just compute them once for [−1, 1] and tabulate them. Of course, one still needs to
recompute for each n ∈ N and each new weight function ρ.
(b) If one is given the λj for a large n with respect to some strange ρ > 0, and one wants
to exploit the resulting Gaussian quadrature rule to just compute the nonweighted
integral of f , one can obviously always do that by applying the rule to g := f /ρ
instead of to f .
As usual, to obtain an error estimate, one needs to assume more regularity of the
integrated function f .
Theorem 4.51. Consider the situation of Def. 4.46; in particular let In be the Gaussian
quadrature rule defined according to (4.79). Then, for each f ∈ C 2n [a, b], there exists
τ ∈ [a, b] such that
Z b Z b
f (2n) (τ )
f (x)ρ(x) dx − In (f ) = p2n (x)ρ(x) dx , (4.89)
a (2n)! a
Proof. As before, let λ1 , . . . , λn be the zeros of the nth orthogonal polynomial pn . Given
f ∈ C 2n [a, b], let q ∈ Pol2n−1 (R) be the Hermite interpolating polynomial for the 2n
points
λ1 , λ1 , λ2 , λ2 , . . . , λn , λn .
The identity (3.29) yields
n
Y
f (x) = q(x) + [f |x, λ1 , λ1 , λ2 , λ2 , . . . , λn , λn ] (x − λj )2 . (4.91)
j=1
| {z }
p2n (x)
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 108
Now we multiply (4.91) by ρ, replace the divided difference by using the mean value
theorem for divided differences (3.30), and integrate over [a, b]:
Z b Z b Z b
1
f (x)ρ(x) dx = q(x)ρ(x) dx + f (2n) ξ(x) p2n (x)ρ(x) dx . (4.92)
a a (2n)! a
Since the map x 7→ f (2n) (ξ(x)) is continuous on [a, b] by Th. 3.25(c) and since p2n ρ ≥ 0,
we can apply the mean value theorem of integration (cf. [Phi17, (2.15f)]) to obtain
τ ∈ [a, b] satisfying
Z b Z b Z
f (2n) (τ ) b 2
f (x)ρ(x) dx = q(x)ρ(x) dx + pn (x)ρ(x) dx . (4.93)
a a (2n)! a
Combining (4.94)
Qn and (4.93) proves (4.89). Finally, (4.90) is clear from the representa-
tion pn (x) = j=1 (x − λj ) and λj ∈ [a, b] for each j ∈ {1, . . . , n}.
Remark 4.52. One can obviously also combine the strategies of Sec. 4.5 and Sec. 4.6.
The results are composite Gaussian quadrature rules.
We have already learned in Linear Algebra how to determine the set of solutions L(A|b)
for linear systems in echelon form (cf. [Phi19a, Def. 8.7]) using back substitution (cf.
[Phi19a, Sec. 8.3.1]), and we have learned how to transform a general linear system into
echelon form without changing the set of solutions using the Gaussian elimination algo-
rithm (cf. [Phi19a, Sec. 8.3.3]). Moreover, we have seen how to augment the Gaussian
elimination algorithm to obtain a so-called LU decomposition for a linear system, which
is more efficient if one needs to solve a linear system Ax = b, repeatedly, with fixed
A and varying b (cf. [Phi19a, Sec. 8.4], in particular, [Phi19a, Sec. 8.4.4]). We recall
from [Phi19a, Th. 8.34] that, for each A ∈ M(m, n, F ), there exists a permutation
matrix P ∈ M(m, F ) (which contains precisely one 1 in each row and one 1 in each
column and only zeros, otherwise), a unipotent lower triangular matrix L ∈ M(m, F )
and à ∈ M(m, n, F ) in echelon form such that
P A = LÃ. (5.3)
It is an LU decomposition in the strict sense (i.e. with à being upper triangular) for A
being quadratic (i.e. m × m).
Remark 5.2. Let F be a field. We recall from Linear Algebra how to make use of
an LU decomposition P A = LÃ to solve linear systems Ax = b with fixed matrix
A ∈ M(m, n, F ) and varying b ∈ F n . One has
Gaussian elimination and LU decomposition have the advantage that, at least in prin-
ciple, they work over every field F and for every matrix A. The algorithm is still
rather simple and, to the best of my knowledge, still the most efficient one known in
the absence of any further knowledge regarding F and A. However, in the presence of
roundoff errors, Gaussian elimination can become numerically unstable and so-called
pivot strategies can help to improve the situation (cf. Sec. 5.2 below). Numerically, it
can also be an issue that the condition numbers of L and U in an LU decomposition
A = LU can be much larger than the condition number of A (cf. Rem. 5.15 below), in
which case, the use of the strategy of Rem. 5.2 is numerically ill-advised. If F = K,
then the so-called QR decomposition A = QR of Sec. 5.4 below is usually a numerically
better option. Moreover, if A has a special structure, such as being Hermitian or sparse
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 110
(i.e. having many 0 entries), than it is usually much more efficient (and, thus, advisable)
to use algorithms designed to exploit this special structure. In this class, we will not
have time to consider algorithms designed for sparse matrices, but we will present the
Cholesky decomposition for Hermitian matrices in Sec. 5.3 below.
Example 5.3 shows that it can be advantageous to switch rows even if one could proceed
Gaussian elimination without switching. Similarly, from a numerical point of view, not
every choice of the row number i, used for switching, is equally good. It is advisable
to use what is called a pivot strategy – a strategy to determine if, in the Gaussian
elimination’s kth step, one should first switch the current row with another row, and, if
so, which other row should be used.
A first strategy that can be used to avoid instabilities of the form that occurred in
Example 5.3 is the column maximum strategy (cf. Def. 5.5 below): In the kth step of
Gaussian elimination (i.e. when eliminating in the kth column), find the row number
i ≥ r(k) (all rows with numbers less than r(k) are already in echelon form and remain
unchanged, cf. [Phi19a, Sec. 8.3.3]), where the pivot element (i.e. the row’s first nonzero
element) has the maximal absolute value; if i 6= r(k), then switch rows i and r(k)
before proceeding. This was the strategy that resulted in the acceptable solution (5.7).
However, observe what happens in the following example:
The exact solution is the same as in Example 5.3, i.e. given by (5.5) (the first row of
Example 5.3 has merely been multiplied by 10000). Again, we examine what occurs if
we solve the system approximately using Gaussian elimination, rounding to 3 significant
digits:
As the pivot element of the first row is maximal, the column maximum strategy does
not require any switching. After the fist step we obtain
1 10000 | 10000
,
0 −9999 | −9998
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 112
The problem here is that the pivot element of the first row is small as compared with
the other elements of that row! This leads to the so-called relative column maximum
strategy (cf. Def. 5.5 below), where, instead of choosing the pivot element with the
largest absolute value, one chooses the pivot element such that its absolute value divided
by the sum of the absolute values of all elements in that row is maximized.
In the above case, the relative column maximum strategy would require first switching
rows:
1 1 | 2
.
1 10000 | 10000
Now Gaussian elimination and rounding yields
1 1 | 2
,
0 10000 | 10000
Both pivot strategies discussed above are summarized in the following definition:
(0′ ) Relative Column Maximum Strategy: For each α ∈ {r(k), . . . , m}, define
n
X
(k)
Sα(k) := aαβ + |b(k)
α |. (5.9a)
β=k
(k)
If max Sα : α ∈ {r(k), . . . , m} = 0, then (A(k) |b(k) ) is already in echelon form
and the algorithm is halted. Otherwise, determine i ∈ {r(k), . . . , m} such that
(k)
(k)
a
ik a
αk
(k)
(k)
= max : α ∈ {r(k), . . . , m}, Sα 6= 0 . (5.9b)
Si Sα(k)
(k)
If aik = 0, then nothing needs to be done in the kth step and one merely proceeds
to the next column (if any) without changing the row by setting r(k + 1) := r(k) (cf.
(k)
[Phi19a, Alg. 8.17(c)]). If aik 6= 0 and i = r(k), then no row switching is needed before
(k)
elimination in the kth column (cf. [Phi19a, Alg. 8.17(a)]). If aik 6= 0 and i 6= r(k),
then one proceeds by switching rows i and r(k) before elimination in the kth column
(cf. [Phi19a, Alg. 8.17(b)]).
Remark 5.6. (a) Clearly, [Phi19a, Th. 8.19] remains valid for the Gaussian elimina-
tion algorithm with column maximum strategy as well as with relative column
maximum strategy (i.e. the modified Gaussian elimination algorithm still trans-
forms the augmented matrix (A|b) of the linear system Ax = b into a matrix (Ã|b̃)
in echelon form, such that the linear system Ãx = b̃ has precisely the same set of
solutions as the original system).
(b) The relative maximum strategy is more stable in cases where the order of magnitude
of matrix elements varies strongly, but it is also less efficient.
Theorem 5.7. If A ∈ M(n, K) is Hermitian and positive definite (cf. Def. 2.21), then
there exists a unique lower triangular L ∈ M(n, K) with positive diagonal entries (i.e.
with ljj ∈ R+ for each j ∈ {1, . . . , n}) and
A = LL∗ . (5.10)
This decomposition is called the Cholesky decomposition of A.
Proof. The proof is conducted via induction on n. For n = 1, it is A = (a11 ), a11 > 0,
√
and L = (l11 ) is uniquely given by the square root of a11 : l11 = a11 > 0. For n > 1,
we write A in block form
B v
A=
,
(5.11)
v∗ ann
where B ∈ M(n − 1, K) is the matrix with bkl = akl for each k, l ∈ {1, . . . , n − 1} and
v ∈ Kn−1 is the column vector with vk = akn for each k ∈ {1, . . . , n − 1}. According
to [Phi19b, Prop. 11.6], B is positive definite (and, clearly, B is also Hermitian) and
ann > 0. According to the induction hypothesis, there exists a unique lower triangular
matrix L′ ∈ M(n − 1, K) with positive diagonal entries and B = L′ (L′ )∗ . Then L′ is
invertible and we can define
w := (L′ )−1 v ∈ Kn−1 . (5.12)
Moreover, let
α := ann − w∗ w. (5.13)
Then α ∈ R and there exists β ∈ C such that β 2 = α. We set
L′ 0
L :=
.
(5.14)
w∗ β
Then L is lower triangular and
(L′ )∗ w
M :=
0 β
is upper triangular with
L′ (L′ )∗ L′ w B v
LM =
=
= A.
(5.15)
w∗ (L′ )∗ w∗ w + β 2 v∗ ann
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 115
It only remains to be shown one can choose β ∈ R+ and that this choice determines β
uniquely (note that β ∈ R then also implies M = L∗ and A = LM = LL∗ ). Since L′ is
a triangular matrix with positive entries on its diagonal, it is det L′ ∈ R+ and
[Phi19b, Th. 11.3(a)]
det A = det L · det M = (det L′ )2 β 2 > 0. (5.16)
√ √
Thus, α = β 2 > 0, and one can choose β = α ∈ R+ and α is the only positive
number with square α.
Remark 5.8. From the proof of Th. 5.7, it is not hard to extract a recursive algorithm
to actually compute the Cholesky decomposition of an Hermitian positive definite A ∈
M(n, K): One starts by setting
√
l11 := a11 , (5.17a)
where the proof of Th. 5.7 guarantees a11 > 0. In the recursive step, one already has
constructed the entries for the first r − 1 rows of L, 1 < r ≤ n (corresponding to the
entries of L′ in the proof of Th. 5.7), and one needs to construct lr1 , . . . , lrr . Using
(5.12), one has to solve L′ w = v for
lr1 a1r
w = ... , where v = ... then w∗ = (lr1 . . . lr,r−1 ) .
lr,r−1 ar−1,r
As L′ has lower triangular form, we obtain lr1 , . . . , lr,r−1 successively, using forward
substitution, where
Pj−1
ajr − k=1 ljk lrk
lrj := for each j ∈ {1, . . . , r − 1} (5.17b)
ljj
and the proof of Th. 5.7 guarantees ljj > 0. Finally, one uses (5.13) (with n replaced
by r) to obtain v
u r−1
√ √ u X
∗
lrr = α = arr − w w = arr −t |lrk |2 , (5.17c)
k=1
Example 5.9. We compute the Cholesky decomposition for the symmetric positive
definite matrix
1 −1 −2
A = −1 2 3 .
−2 3 9
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 116
We conclude the section by showing that a Cholesky decomposition can not exist if A
is not symmetric positive semidefinite:
Proposition 5.10. For n ∈ N, let A, L ∈ M(n, K), where A = LL∗ . Then A is
Hermitian and positive semidefinite.
5.4 QR Decomposition
5.4.1 Definition and Motivation
Definition 5.11. Let n ∈ N and let A ∈ M(n, K). Then A is called unitary if, and
only if, A is invertible and A−1 = A∗ . Moreover, for K = R, unitary matrices are also
called orthogonal.
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 117
(a) A decomposition
A = QÃ (5.18a)
is called a QR decomposition of A if, and only if, Q ∈ M(m, K) is unitary and
à ∈ M(m, n, K) is in echelon form. If A ∈ M(m, K) (i.e. quadratic), then à is
upper (i.e. right) triangular, i.e. (5.18a) is a QR decomposition in the strict sense,
which is emphasized by writing R := Ã:
A = QR. (5.18b)
A = QÃ (5.19a)
A = QR. (5.19b)
Note: In the German literature, the economy size QR decomposition is usually called
just QR decomposition, while our QR decomposition is referred to as the extended QR
decomposition.
—
While, even for an invertible A ∈ M(n, K), the QR decomposition is not unique, the
nonuniqueness can be precisely described:
Q 1 R1 = Q 2 R2 . (5.20)
Q2 = Q1 S and R2 = S ∗ R1 . (5.21)
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 118
Proposition 5.14. Let n ∈ N and A, Q ∈ M(n, K), A being invertible and Q being
unitary. With respect to the 2-norm k · k2 on Kn , the following holds (recall Def. 2.12
of a natural matrix norm and Def. and Rem. 2.36 for the condition of a matrix):
Proof. Exercise.
Remark 5.15. Suppose the goal is to solve linear systems Ax = b with fixed A ∈
M(m, n, K) and varying b ∈ Kn . In Rem. 5.2, we described how to make use of a given
LU decomposition P A = LÃ in this situation. Even though the strategy of Rem. 5.2
works fine in many situations, it can fail numerically due to the following stability issue:
While the condition numbers, for invertible A, always satisfy
due to (2.23) and (2.14), the right-hand side of (5.22) can be much larger than the left-
hand side. In that case, it is numerically ill-advised to solve Lz = P b and Ãx = z instead
of solving Ax = b. An analogous procedure can be used given a QR decomposition
A = QÃ. Solving Ax = QÃx = b is equivalent to solving Qz = b for z and then
solving Ãx = z for x. Due to Q−1 = Q∗ , solving for z = Q∗ b is particularly simple.
Moreover, one does not encounter the possibility of an increased condition as for the
LU decomposition: Due to Prop. 5.14, the condition of using the QR decomposition is
exactly the same as the condition of A. Thus, if available, the QR decomposition should
always be used!
(a) There exists a QR decomposition of A as well as, for r > 0, an economy size
QR decomposition of A. More precisely, there exist a unitary Q ∈ M(m, K), an
à ∈ M(m, n, K) in echelon form, a Qe ∈ M(m, r, K) (for r > 0) such that its
columns form an orthonormal system with respect to the standard scalar product
h·, ·i on Km , and (for r > 0) an Ãe ∈ M(r, n, K) in echelon form such that
A = QÃ, (5.24a)
A = Qe Ãe for r > 0. (5.24b)
(i) For r > 0, the columns q1 , . . . , qr ∈ Km of Qe can be computed from the
columns x1 , . . . , xn ∈ Km of A, obtaining v1 , . . . , vn ∈ Km from
k−1
X hxk , vi i
v1 := x1 , ∀ vk := xk − 2
vi . (5.25)
k≥2
i=1,
kv i k2
vi 6=0
letting, for each k ∈ {1, . . . , r}, ṽk := vµ(k) (i.e. the ṽ1 , . . . , ṽr are the v1 , . . . , vn
with each vk = 0 omitted) and, finally, letting qk := ṽk /kṽk k2 for each k =
1, . . . , r.
(ii) For r > 0, the entries rik of Ãe ∈ M(r, n, K) can be defined by
kv1 k2 for i = k = 1, ρ(1) = 1,
hxk , qi i for k ∈ {2, . . . , n}, i < ρ(k),
rik := hxk , qi i for k ∈ {2, . . . , n}, i = ρ(k) = ρ(k − 1),
kṽρ(k) k2 for k ∈ {2, . . . , n}, i = ρ(k) > ρ(k − 1),
0 otherwise.
(iii) For the columns q1 , . . . , qm of Q, one can complete the orthonormal system
q1 , . . . , qr of (i) by qr+1 , . . . , qm ∈ Km to an orthonormal basis of Km .
(iv) For Ã, one can use Ãe as in (ii), completed by m − r zero rows at the bottom.
(b) For r > 0, there is a unique economy size QR decomposition of A such that all
pivot elements of Ãe are positive. Similarly, if one requires all pivot elements of Ã
to be positive, then the QR decomposition of A is unique, except for the last columns
qr+1 , . . . , qm of Q.
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 120
Proof. (a): We start by showing that, for r > 0, if Qe and Ãe are defined by (i) and
(ii), respectively, then they provide the desired economy size QR decomposition of A.
From [Phi19b, Th. 10.13], we know that vk = 0 if, and only if, xk ∈ span{x1 , . . . , xk−1 }.
Thus, in (i), we indeed obtain r linearly independent ṽk and Qe is in M(m, r, K) with
orthonormal columns. To see that Ãe according to (ii) is in echelon form, consider its
ith row, i ∈ {1, . . . , r}. If k < µ(i), then ρ(k) < i, i.e. rik = 0. Thus, kṽρ(k) k2 with
k = µ(i) is the pivot element of the ith row, and as µ is strictly increasing, Ãe is in
echelon form. To show A = Qe Ãe , note that, for each k ∈ {1, . . . , n},
0 for k = 1, ρ(1) = 0,
kṽ k q
X r
ρ(1) 2 ρ(1) = kv1 k2 q1 for k = 1, ρ(1) = 1,
xk = Pρ(k) = rik qi
hxk , qi i qi for k > 1, ρ(k) = ρ(k − 1)
Pi=1
ρ(k)−1
i=1
A = QR = Q̃R̃, (5.26)
and show, by induction on α ∈ {1, . . . , r + 1}, that qα = q̃α for α ≤ r and rik = r̃ik for
each i ∈ {1, . . . , r}, k ∈ {1, . . . , min{µ(α), n}}, µ(r + 1) := n + 1, as well as
(the induction goes to r + 1, as it might occur that µ(r) < n). Consider α = 1. For
1 ≤ k < µ(1), we have xk = 0, and the linear independence of the qi (respectively, q̃i )
yields rik = r̃ik = 0 for each i ∈ {1, . . . , r}, 1 ≤ k < µ(1) (if any). Next,
since the echelon forms of R and R̃ imply ri,µ(1) = r̃i,µ(1) = 0 for each i > 1. Moreover,
(5.29)
kq1 k2 = kq̃1 k2 = 1 ⇒ |r1,µ(1) | = |r̃1,µ(1) |
r1,µ(1) ,r̃1,µ(1) ∈R+
⇒ r1,µ(1) = r̃1,µ(1) ∧ q1 = q̃1 .
Note that (5.29) also implies span{x1 , . . . , xµ(1) } = span{xµ(1) } = span{q1 }. Now let
α > 1 and, by induction, assume qβ = q̃β for 1 ≤ β < α and rik = r̃ik for each
i ∈ {1, . . . , r}, k ∈ {1, . . . , µ(α − 1)}, as well as
We have to show rik = r̃ik for each i ∈ {1, . . . , r}, k ∈ {µ(α − 1) + 1, . . . , min{µ(α), n}},
and qα = q̃α for α ≤ r, as well as (5.28). For each k ∈ {µ(α − 1) + 1, . . . , µ(α) − 1},
from (5.30) with β = α − 1, we know
such that (5.27) implies rik = r̃ik = 0 for each i > α − 1. Then, for 1 ≤ i ≤ α − 1,
multiplying (5.27) by qi and using the orthonormality of the qi provides rik = r̃ik =
hxk , qi i. For α = r + 1, we are already done. Otherwise, a similar argument also works
for k = µ(α) ≤ n: As we had just seen, rαl = r̃αl = 0 for each 1 ≤ l < k. So the echelon
forms of R and R̃ imply rik = r̃ik = 0 for each i > α. Then, as before, rik = r̃ik = hxk , qi i
for each i < α by multiplying (5.27) by qi , and, finally,
α−1
X α−1
X
0 6= xk − rik qi = xµ(α) − rik qi = rαk qα = r̃αk q̃α . (5.32)
i=1 i=1
Analogous to the case α = 1, the positivity of rαk and r̃αk implies rαk = r̃αk and qα = q̃α .
To conclude the induction, we note that combining (5.32) with (5.31) verifies (5.28).
0 0 1 4
of A yields
1
1
v 1 = x1 =
2 ,
0
1
hx2 , v1 i 18 −1
v 2 = x2 − 2
v 1 = x2 − v1 =
0 ,
kv1 k2 6
0
0
hx3 , v1 i hx3 , v2 i 9 1
v 3 = x3 − v − v = x − v − v = 0 ,
kv1 k22
1
kv2 k22
2 3
6
1
2
2 0
1
2/3
hx4 , v1 i hx4 , v2 i hx4 , v3 i 5 3 4 2/3
v 4 = x4 − 2
v1 − 2
v2 − 2
v 3 = x4 − v 1 − v 2 − v 3 =
−2/3 .
kv1 k2 kv2 k2 kv3 k2 6 2 1
0
Thus,
v1 v1 v2 v2 v3 v4 v4
q1 = =√ , q2 = =√ , q3 = = v3 , q4 = = √
kv1 k2 6 kv2 k2 2 kv3 k2 kv4 k2 2/ 3
and √ √ √
1/ 6 1/ 2 0 1/ 3
1/√6 −1/√2 √
0 1/ 3
Q= √ √ .
2/ 6 0 0 −1/ 3
0 0 1 0
Next, we obtain
√ √ √ √
r11 = kv1 k2 = 6, r12 = hx2 , q1 i = 3 6, r13 = hx3 , q1 i = 9/ 6, r14 = hx4 , q1 i = 5/ 6,
√ √ √
r21 = 0, r22 = kv2 k2 = 2, r23 = hx3 , q2 i = 1/ 2, r24 = hx4 , q2 i = 3/ 2,
r31 = 0, r32 = 0, r33 = kv3 k2 = 1, r34 = hx4 , q3 i = 4,
√
r41 = 0, r42 = 0, r43 = 0, r44 = kv4 k2 = 2/ 3,
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 123
that means √ √ √ √
6 3 6 9/ 6 5/ 6
√ √ √
0 2 1/ 2 3/ 2
R = Ã = .
0 0 1 4
√
0 0 0 2/ 3
One verifies A = QR.
Lemma 5.20. Let H ∈ M(n, K), n ∈ N, be a Householder matrix. Then the following
holds:
(a) H is Hermitian: H ∗ = H.
Proof. As H is a Householder matrix, there exists u ∈ Kn such that kuk2 = 1 and (5.33)
is satisfied. Then ∗
H ∗ = Id −2 uu∗ = Id −2(u∗ )∗ u∗ = H
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 124
The key idea of the Householder method for the computation of a QR decomposition
is to find a Householder reflection that transforms a given vector into the span of the
first standard unit vector e1 ∈ Kk . This task then has to be performed for varying
dimensions k. That such Householder reflections exist is the contents of the following
lemma2 :
Lemma 5.22. Let k ∈ N. We consider the elements of Kk as column vectors. Let
e1 ∈ Kk be the first standard unit vector of Kk . If x ∈ Kk \ {0}, then
( |x | x+x kxk e
v(x) 1 1
|x1 | kxk2
2 1
for x1 6= 0,
u := u(x) := , v := v(x) := x
(5.34a)
kv(x)k2 kxk2
+ e1 for x1 = 0,
satisfies
kuk2 = 1 and (5.34b)
(
− x1|xkxk 2
for x1 6= 0,
(Id −2uu∗ )x = ξ e1 , ξ= 1|
(5.34c)
−kxk2 for x1 = 0.
2
If one is only interested in the case K = R, then one can use the simpler version provided in
Appendix Sec. H.2.
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 125
i.e. v 6= 0 and u is well-defined in this case as well. Moreover, (5.34b) is immediate from
the definition of u. To verify (5.34c), note
( 2
∗
1 + |x2|x 1|
1 | kxk2
+ 1 = 2 + 2|x 1|
kxk2
for x1 6= 0,
v v= 2|x1 |
1 + 0 + 1 = 2 + kxk2 for x1 = 0,
(
kxk2 + |x1 | for x1 6= 0,
v∗x =
kxk2 + 0 = kxk2 + |x1 | for x1 = 0.
once again proving (5.34c) and concluding the proof of the lemma.
We note that the numerator in (5.34a) is constructed in such a way that there is no
subtractive cancellation of digits, independent of the signs of Re x1 and Im x1 .
(k) (k)
(b) If ar(k),k 6= 0, but aik = 0 for each i ∈ {r(k) + 1, . . . , m},
where
Idr(k)−1 0
H (k) :=
,
u(k) := u(x(k) ), (5.36)
0 (k) (k) ∗
Idm−r(k)+1 −2u (u )
Proof. First note that the Householder method terminates after at most n steps. If
1 ≤ N ≤ n + 1 is the maximal k occurring during the Householder method, and one
lets H (k) := Idm for each k such that Def. 5.23(a) or Def. 5.23(b) was used, then
Since, according to Lem. 5.20(b), (H (k) )2 = Idm for each k, QÃ = A is an immediate
consequence of (5.37). Moreover, it follows from Lem. 5.20(c) that the H (k) are all
unitary, implying Q to be unitary as well. To prove that à is in echelon form, we show
by induction over k that the first k − 1 columns of A(k) are in echelon form as well as
(k)
the first r(k) rows of A(k) with aij = 0 for each (i, j) ∈ {r(k), . . . , m} × {1, . . . , k − 1}.
For k = 1, the assertion is trivially true. By induction, we assume the assertion for
k and prove it for k + 1 (for k = 1, we are not assuming anything). Observe that
multiplication of A(k) with H (k) as defined in (5.36) does not change the first r(k) − 1
(k)
rows of A(k) . It does not change the first k − 1 columns of A(k) , either, since aij = 0
for each (i, j) ∈ {r(k), . . . , m} × {1, . . . , k − 1}. Moreover, if we are in the case of Def.
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 127
5.23(c), then the choice of u(k) in (5.36) and (5.34c) of Lem. 5.22 guarantee
(k+1)
ar(k),k
(k+1)
1
a 0
r(k)+1,k
.. ∈ span .. . (5.38)
.
.
0
(k+1)
am,k
(k+1)
In each case, after the application of (a), (b), or (c) of Def. 5.23, aij = 0 for each
(i, j) ∈ {r(k + 1), . . . , m} × {1, . . . , k}. We have to show that the first r(k + 1) rows of
A(k+1) are in echelon form. For Def. 5.23(a), it is r(k + 1) = r(k) and there is nothing
(k+1)
to prove. For Def. 5.23(b),(c), we know ar(k+1),j = 0 for each j ∈ {1, . . . , k}, while
(k+1)
ar(k),k 6= 0, showing that the first r(k + 1) rows of A(k+1) are in echelon form in each
case. So we know that the first r(k + 1) rows of A(k+1) are in echelon form, and all
elements in the first k columns of A(k+1) below row r(k + 1) are zero, showing that the
first k columns of A(k+1) are also in echelon form. As, at the end of the Householder
method, r(k) = m or k = n + 1, we have shown that à is in echelon form.
Example 5.25. We apply the Householder method of Def. 5.23 to the matrix A of the
previous Ex. 5.18, i.e. to
1 4 2 3
1 2 1 0
A :=
2 6 3 1 .
(5.39a)
0 0 1 4
We start with
A(1) := A, Q(1) := Id, r(1) := 1. (5.39b)
Since x(1) := (1, 1, 2, 0)t ∈
/ span{(1, 0, 0, 0)t }, we compute
√
1+ 6
x(1) + σ(x(1) ) e1 1 1
u(1) := (1) = √ , (5.39c)
kx + σ(x(1) ) e1 k2 kx(1) + 6 e1 k2 2
0
5 NUMERICAL SOLUTION OF LINEAR SYSTEMS 128
√ √ √
7 + 2√ 6 1 + 6 2 + 2 6 0
1 1+ 6 1 2 0
H (1) := Id −2u(1) (u(1) )t = Id − √
√
6+ 6 2+2 6 2 4 0
0 0 0 0
−0.4082 −0.4082 −0.8165 0
−0.4082 0.8816 −0.2367 0
≈−0.8165 −0.2367 0.5266 0 ,
(5.39d)
0 0 0 1
−2.4495 −7.3485 −3.6742 −2.0412
0 −1.2899 −0.6449 −1.4614
A(2) := H (1) A(1) ≈
, (5.39e)
0 −0.5798 −0.2899 −1.9229
0 0 1 4
In the following, we will study methods for finding zeros of functions in special situations.
While, if formulated in the generality of Def. 6.1, the setting of fixed point and zero
problems is different, in many situations, a fixed point problem can be transformed
into an equivalent zero problem and vice versa (see Rem. 6.2 below). In consequence,
methods that are formulated for fixed points, such as the Banach fixed point method, can
often also be used to find zeros, while methods formulated for zeros, such as Newton’s
method, can often also be used to find fixed points.
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 130
(b) If f : A −→ Y , where A and Y are both subsets of some common vector space Z,
then x ∈ A is a zero of f if, and only if, f (x) + x = x, i.e. if, and only if, x is a
fixed point of the function ϕ : A −→ Z, ϕ(x) := f (x) + x.
—
To ensure that we can answer “yes” to all of the above questions, we need suitable
hypotheses. The following theorem provides examples for sufficient conditions:
For (a) and (b), x0 , x1 are on different sides of x̂ for x0 < x̂ and (xn )n∈N0 is decreasing
for x0 > x̂; for (c) and (d), x0 , x1 are on different sides of x̂ for x0 > x̂ and (xn )n∈N0 is
increasing for x0 < x̂.
Proof. In each case, f is strictly monotone and, hence, injective. Thus, x̂ must be the
only zero of f . For each n ∈ N0 , if f (xn ) = 0, then (6.1) implies xn+1 = xn , showing
(xn )n∈N0 to be constant for x0 = x̂.
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 132
f (x)
η : I −→ R, η(x) := x − ,
f ′ (x)
where we used f (x1 ) ≥ 0 (as f is increasing) and f ′ (x1 ) ≤ f ′ (x2 ) (as f is convex,
[Phi16a, (9.37)]). N
If xn ∈ I with xn > x̂, then f (xn ) > 0 (as f is strictly increasing) and f ′ (xn ) > 0 (as f
is strictly increasing and f ′ (xn ) 6= 0), implying
Cl. 1 f (xn )
x̂ = η(x̂) ≤ η(xn ) = xn+1 = xn − < xn ⇒ xn+1 ∈ I. (6.4)
f ′ (xn )
According to (6.4), if x0 ≥ x̂, then (xn )n∈N0 is a decreasing sequence in [x̂, x0 ], converging
to some x∗ ∈ [x̂, x0 ] (this is the case, where x0 , x1 are on the same side of x̂). If x0 < x̂
and x1 ∈ I, then (6.3) shows x1 ≥ x̂ and, then, (6.4) yields (xn )n∈N is a decreasing
sequence in [x̂, x1 ], converging to some x∗ ∈ [x̂, x1 ] (this is the case, where x0 , x1 are on
different sides of x̂). In each case, we then obtain
f (xn ) f (x∗ )
x∗ = lim xn+1 = lim xn − ′ = x∗ − ′ ,
n→∞ n→∞ f (xn ) f (x∗ )
proving f (x∗ ) = 0, i.e. x∗ = x̂ (note that we did not assume the continuity of f ′ –
however the boundedness of f ′ on [x̂, x1 ], as given by its monotonicity, already suffices
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 133
to imply f (x∗ ) = 0; but the above equation does actually hold as stated, since the
monotonicity of f ′ , together with the intermediate value theorem for derivatives (cf.
[Phi16a, Th. 9.24]), does also imply the continuity of f ′ ).
If f is strictly decreasing and concave, then −f is strictly increasing and convex. Since
the zeros of f and −f are the same, and (6.1) does not change if f is replaced by −f ,
(b) follows from (a).
If f is strictly increasing and concave, then g : (−I) −→ R, g(x) := f (−x), −I :=
{−x : x ∈ I}, is strictly decreasing and concave: For each x1 , x2 ∈ (−I),
x1 < x2 ⇒ −x2 < −x1
⇒ g(x1 ) = f (−x1 ) > f (−x2 ) = g(x2 )
∧ g ′ (x1 ) = −f ′ (−x1 ) > −f ′ (−x2 ) = g ′ (x2 ) .
As (6.1) implies
f (xn ) g(−xn )
∀ − xn+1 = −xn − = −x n − ,
n∈N0 −f ′ (xn ) g ′ (−xn )
(b) applies to g, yielding that −xn converges to −x̂ in the stated manner, i.e. xn con-
verges to x̂ in the manner stated in (c).
If f is strictly decreasing and convex, then −f is strictly increasing and concave, i.e. (d)
follows from (c) in the same way that (b) follows from (a).
While Th. 6.3 shows the convergence of Newton’s method for each x0 in the domain I,
it does not provide error estimates. We will prove error estimates in Th. 6.5 below in
the context of the multi-dimensional Newton’s method, where the error estimates are
only shown in a sufficiently small neighborhood of the zero x∗ . Results of this type are
quite typical in the context of Newton’s method, albeit somewhat unsatisfactory, as the
location of the zero is usually unknown in applications (also cf. Ex. 6.6 below).
Example 6.4. (a) We show that cos x = x has a unique solution in [0, π2 ] and that the
solution can be approximated via Newton’s method: To this end, we consider
f : [0, π/2] −→ R, f (x) := cos x − x.
Then f is differentiable with f ′ (x) = − sin x − 1 < 0 on [0, π/2[ and f ′′ (x) =
− cos x < 0 on [0, π/2[, showing f to be strictly decreasing and concave. Thus, Th.
6.3(b) applies for each x0 ∈ [0, π/2[ (see Ex. 6.6(a) below for error estimates).
(b) Theorem 6.3(a) applies to the function
f : R+ −→ R, f (x) := x2 − a, a ∈ R+
of Ex. 1.3, which is, clearly, stricly increasing and convex.
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 134
(indeed, the same still works if Rm is replaced by some general normed vector space,
but then the notion of differentiability needs to be generalized to such spaces, which is
beyond the scope of the present class).
In practice, in each step of Newton’s method (6.5), one will determine xn+1 as the
solution to the linear system
Not surprisingly, the issues formulated as (a), (b), (c) after (6.1) (i.e. invertibility of
Df (xn ), xn+1 being in A, and convergence of (xn )n∈N0 ), are also present in the multi-
dimensional situation.
As in the 1-dimensional case, we will present a result that provides an example for suffi-
cient conditions for convergence (cf. Th. 6.5 below). While Th. 6.3 for the 1-dimensional
Newton’s method is an example of a global Newton theorem (proving convergence for
each x0 in the considered domain of f ), Th. 6.5 is an example of a local Newton theorem,
proving convergence under the assumption that x0 is sufficiently close to the zero x̂ –
however, in contrast to Th. 6.3, Th. 6.5 also includes error estimates (cf. (6.12), (6.13)).
In the literature (e.g. in the previously mentioned references [Deu11], [OR70]), one finds
many variants of both local and global Newton theorems. A global Newton theorem in
the a multi-dimensional setting is also provided by Th. I.1 in the appendix.
Theorem 6.5. Let m ∈ N and fix a norm k · k on Rm . Let A ⊆ Rm be open. Moreover
let f : A −→ A be differentiable, let x∗ ∈ A be a zero of f , and let r > 0 be (sufficiently
small) such that
Br (x∗ ) = x ∈ Rm : kx − x∗ k < r ⊆ A. (6.7)
Assume that Df (x∗ ) is invertible and that β > 0 is such that
−1
Df (x∗ )
≤ β (6.8)
(here and in the following, we consider the induced operator norm on real n×n matrices).
Finally, assume that the map Df : Br (x∗ ) −→ L(Rm , Rm ) is Lipschitz continuous with
Lipschitz constant L ∈ R+ , i.e.
(Df )(x) − (Df )(y)
≤ L kx − yk for each x, y ∈ Br (x∗ ). (6.9)
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 135
Then, letting
1
δ := min r, , (6.10)
2βL
Newton’s method (6.5) is well-defined for each x0 ∈ Bδ (x∗ ) (i.e., for each n ∈ N0 ,
Df (xn ) is invertible and xn+1 ∈ Bδ (x∗ )), and
lim xn = x∗ (6.11)
n→∞
such that we can, indeed, apply Lem. 2.45. In consequence, Df (x) is invertible with
−1
kA−1 k
(2.32) (6.16)
Df (x)
≤ ≤ 2kA−1 k ≤ 2β, (6.17)
1 − kA−1 k k∆Ak
to obtain
φ′ (t) := Df (y + t(x − y)) (x − y) for each t ∈ [0, 1]. (6.20)
This allows us to write
Z 1
′
f (x) − f (y) − Df (y)(x − y) = φ(1) − φ(0) − φ (0) = φ′ (t) − φ′ (0) dt . (6.21)
0
Thus,
Z 1
Z 1 Z 1
′
L
φ (t) − φ (0) dt
′ ′
φ (t) − φ′ (0)
dt ≤ L t kx − yk2 dt = kx − yk2 .
≤ 2
0 0 0
(6.23)
Combining (6.23) with (6.21) proves (6.18). N
Applying the norm to (6.25) and using (6.14) and (6.18) implies
L 1
kxn+1 −x∗ k ≤ 2 β kxn −x∗ k2 = β L kxn −x∗ k2 ≤ β L δkxn −x∗ k ≤ kxn −x∗ k, (6.26)
2 2
which proves xn+1 ∈ Bδ (x∗ ) as well as (6.12).
Another induction shows that (6.12) implies, for each n ∈ N,
n−1 n n −1
kxn − x∗ k ≤ (βL)1+2+···+2 kx0 − x∗ k2 ≤ (βLδ)2 kx0 − x∗ k, (6.27)
6 ITERATIVE METHODS, SOLUTION OF NONLINEAR EQUATIONS 137
n
where the geometric sum formula 1 + 2 + · · · + 2n−1 = 22−1
−1
= 2n − 1 was used for the last
inequality. Moreover, βLδ ≤ 12 together with (6.27) proves (6.13), and, in particular,
(6.11).
Finally, we show that x∗ is the unique zero in Bδ (x∗ ): Suppose x∗ , x∗∗ ∈ Bδ (x∗ ) with
f (x∗ ) = f (x∗∗ ) = 0. Then
−1
kx∗∗ − x∗ k =
Df (x∗ ) f (x∗∗ ) − f (x∗ ) − Df (x∗ )(x∗∗ − x∗ )
. (6.28)
L 1
kx∗∗ − x∗ k ≤ 2β kx∗ − x∗∗ k2 ≤ β L δ kx∗ − x∗∗ k ≤ kx∗ − x∗∗ k, (6.29)
2 2
implying kx∗ − x∗∗ k = 0 and completing the proof of the theorem.
has a unique zero x∗ ∈]0, π/2[ and that Newton’s method converges to x∗ for each
x0 ∈ [0, π/2[. Moreover,
1
∀π −2 ≤ f ′ (x) = − sin x − 1 ≤ −1 ⇒ ≤1 ,
x∈[0, 2 ] f ′ (x)
and f ′ is Lipschitz continuous with Lipschitz constant L = 1. Thus, Th. 6.5 applies
with β = L = 1. However, without further knowledge regarding the location
of the zeron in [0,oπ/2], we do not have any lower positive bound on r to obtain
1
δ = min r, 2βL . One possibility is to extend the domain to A :=] − 21 , π2 + 12 [,
1
yielding δ = r = 2βL = 12 . Thus, in combination with the information of Ex. 6.4(a),
we know that, for each x0 ∈ [0, π/2[, Newton’s method converges to the zero x∗
and the error estimates (6.12), (6.13) hold if |x0 − x∗ | < 21 (otherwise, (6.12), (6.13)
might not hold for some initial steps).
x1
Then f is differentiable with, for each x = x2
∈ R2 ,
1 4 + sin x1 cos x2
Df (x) = ,
4 sin x1 4 + 2 cos x2
1 3
det Df (x) = 16 + 4 sin x1 + 8 cos x2 + sin x1 cos x2 ≥ > 0,
16 16
−1 1 1 4 + 2 cos x2 − cos x2
Df (x) = .
det Df (x) 4 − sin x1 4 + sin x1
We now consider R2 with k · k∞ and recall from Ex. 2.18(a) that the row sum norm
is the corresponding operator norm. For each x, y ∈ R2 , we estimate, using that
both sin and cos are Lipschitz continuous with Lipschitz constant 1,
1
| sin x1 − sin y1 | | cos x2 − cos y2 |
Df (x) − Df (y)
=
∞ 4
| sin x1 − sin y1 | 2| cos x2 − cos y2 |
∞
1 3
≤ |x1 − y1 | + 2|x2 − y2 | ≤ kx − yk∞ ,
4 4
as well as
16 7
Df (x)−1
≤
· · kxk∞ .
∞3 4
Thus, if we can establish f to have a zero in B1 (0) = [−1, 1] × [−1, 1], then Th. 6.5
applies with β := 28
3
and L := 43 (and arbitrary r ∈ R+ ), i.e. with δ = 12 · 28
3 4 1
· 3 = 14 .
To verify that f has a zero in B1 (0), we note that f (x) = 0 is equivalent to the
fixed point problem φ(x) = x with
2 2 x1 1 cos x1 − sin x2
φ : R −→ R , φ := .
x2 4 cos x1 − 2 sin x2
As φ, clearly, maps B1 (0) (even all of R2 ) into B 3 (0) and is a contraction, due to
4
3
∀
φ(x) − φ(y)
≤ kx − yk∞ ,
x,y∈R2 ∞ 4
the Banach fixed point theorem [Phi16b, Th. 2.29] applies, showing φ to have a
unique fixed point x∗ , which must be located in B 3 (0). As we now only know Th.
4
6.5 to yield convergence of Newton’s method for x0 ∈ B 1 (x∗ ), in practice, without
14
more accurate knowledge of the location of x∗ , one could cover B 3 (0) with squares
4
of side length 17 , successively starting iterations in each of the squares, aborting an
iteration if one finds it to be inconsistent with the error estimates of Th. 6.5.
7 EIGENVALUES 139
7 Eigenvalues
For λ ∈ σ(A), its algebraic multiplicity ma (λ) and geometric multiplicity mg (λ) are (cf.
[Phi19b, Def. 6.12] and [Phi19b, Th. 8.4]):
(a) Eigenvalues of A and their multiplicities are invariant under similarity transforma-
tions, i.e. if T ∈ M(n, F ) is invertible and B := T −1 AT , then σ(B) = σ(A) and, for
each λ ∈ σ(A), ma (λ, A) = ma (λ, B) as well as mg (λ, A) = mg (λ, B) (e.g. due to
the fact that χA and ker(A − λ Id) only depend on the linear maps (in L(F n , F n ))
represented by A and ker(A − λ Id), respectively, where A and B (resp. A − λ Id and
T −1 (A − λ Id)T ) represent the same map, merely with respect to different bases of
F n (cf. [Phi19b, Prop. 8.2(a)])).
(b) Note that the characteristic polynomial χA = det(X Idn −A) is always a monic
polynomial of degree n (i.e. a polynomial of degree n, where the coefficient of X n is
1, cf. [Phi19b, Rem. 8.3]). In general, the eigenvalues of A will be in the algebraic
closure F of F , cf. [Phi19b, Def. 7.11, Th. 7.38, Def. D.13, Th. D.16]3 . On the
other hand, every monic polynomial f ∈ F [X] of degree n ≥ 1 is the characteristic
3
It is, perhaps, somewhat misleading to speak of the algebraic closure of F : Even though, according
to [Phi19b, Cor. D.20], all algebraic closures of a given field F are isomorphic, the existence of the
isomorphism is, in general, nonconstructive, based on an application of Zorn’s lemma.
7 EIGENVALUES 140
polynomial of some matrix A(f ) ∈ M(n, F ), where A(f ) is the so-called companion
matrix of the polynomial: If a1 , . . . , an ∈ F , and
n
X
n
f := X + aj X n−j ,
j=1
(cf. [Phi19b, Rem. 8.3(c)])). In the following, we will only consider matrices over R
and C, and we know that, due to the fundamental theorem of algebra (cf. [Phi16a,
Th. 8.32]), C is the algebraic closure of R.
Remark 7.2. Given that eigenvalues are precisely the roots of the characteristic poly-
nomial, and given that, as mentioned in Rem. 7.1(b), every monic polynomial of degree
at least 1 can occur as the characteristic polynomial of a matrix, it is not surprising
that computing eigenvalues is, in general, a difficult task. According to Galois theory
in Algebra, for a generic polynomial of degree at least 5, it is not possible to obtain
its zeros using so-called radicals (which are, roughly, zeros of polynomials of the form
X k − λ, k ∈ N, λ ∈ F , see, e.g., [Bos13, Def. 6.1.1] for a precise definition) in finitely
many steps (cf., e.g., [Bos13, Cor. 6.1.7]) and, in general, an approximation by iterative
7 EIGENVALUES 141
numerical methods is warranted. Having said that, let us note that the problem of com-
puting eigenvalues is, indeed, typically easier than the general problem of computing
zeros of polynomials. This is due to the fact that the difficulty of computing the zeros
of a polynomial depends tremendously on the form in which the P polynomial is given: It
is typically hard if the polynomial is expanded into the form nj=0 a
Qnj X j
, but it is easy
(trivial, in fact), if the polynomial is given in a factored form c j=1 (X − λj ). If the
characteristic polynomial is given implicitly by a matrix, one is, in general, somewhere
between the two extremes. In particular, for a large matrix, it usually makes no sense to
compute the characteristic polynomial in its expanded form (this is an expensive task in
itself and, in the process, one even loses the additional structure given by the matrix). It
makes much more sense to use methods tailored to the computation of eigenvalues, and,
if available, one should make use of additional structure (such as symmetry) a matrix
might have.
Definition 7.3. For each A = (akl ) ∈ M(n, C), n ∈ N, the closed disks
Xn
Gk := Gk (A) := z ∈ C : |z − akk | ≤ |akl | , k ∈ {1, . . . , n}, (7.1)
l=1,
l6=k
Theorem 7.4 (Gershgorin Circle Theorem). For each A = (akl ) ∈ M(n, C), n ∈ N,
one has n
[
σ(A) ⊆ Gk (A), (7.2)
k=1
that means the spectrum of a square complex matrix A is always contained in the union
of the Gershgorin disks of A.
Ax = λx reads n
X
λxk = akl xl
l=1
and implies
n
X n
X
|λ − akk | |xk | ≤ |akl | |xl | ≤ |akl | |xk |.
l=1, l=1,
l6=k l6=k
Corollary 7.5. For each A = (akl ) ∈ M(n, C), n ∈ N, instead of using row sums to
define the Gershgorin disks of (7.1), one can use the column sums to define
X n
′ ′
Gk := Gk (A) := z ∈ C : |z − akk | ≤ |alk | , k ∈ {1, . . . , n}. (7.3)
l=1,
l6=k
Proof. Since σ(A) = σ(At ) and Gk (At ) = G′k (A), (7.4) is an immediate consequence of
(7.2).
Then the Gershgorin disks according to (7.1) and (7.3) are (cf. Fig. 7.1)
Here, the intersection of (7.4) is G1 ∪ G2 ∪ G′3 and, thus, noticeably smaller than either
G1 ∪ G2 ∪ G3 or G′1 ∪ G′2 ∪ G′3 (cf. Fig. 7.1). The spectrum is
σ(A) = {−i, i, 4} :
7 EIGENVALUES 143
Figure 7.1: The Gershgorin disks of matrix A of Ex. 7.6 are depicted in the complex
plane together with the exact locations of A’s eigenvalues: The black dots are the
locations of the eigenvalues, the grey disks are the rowwise disks G1 , G2 , G3 , whereas
the columnwise disks G′1 , G′2 , G′3 are not shaded (only visible for G′1 , since G2 = G′2 and
G′3 = {4}).
0 17 + 17i 17 − 17i
1
Indeed, A is diagonalizable and, with T := 34 0 34i −34i ,
4 −2 − 8i −2 + 8i
2 3 17 1 −1 0 0 17 + 17i 17 − 17i
1 1
T −1 AT = 2 −1 − i 0 2 −1 0 0 34i −34i
2 34
2 −1 + i 0 0 1 4 4 −2 − 8i −2 + 8i
4 0 0
= 0 −i 0 .
0 0 i
Figure 7.1 depicts the Gershgorin disks of A as well as the exact location of A’s eigen-
values in the complex plane. Note that, while G3 contains precisely one eigenvalue, G1
contains none, and G2 contains 2.
Theorem 7.7. Let A = (akl ) ∈ M(n, C), n ∈ N. Suppose k1 , . . . , kn is an enumeration
of the numbers from 1 to n, and let q ∈ {1, . . . , n}. Moreover, suppose
D(1) = Gk1 ∪ · · · ∪ Gkq (7.5a)
is the union of the remaining n − q Gershgorin disks of A. If D(1) and D(2) are disjoint,
then D(1) must contain precisely q eigenvalues of A and D(2) must contain precisely n−q
eigenvalues of A (counted according to their algebraic multiplicity).
Proof. Let D := diag(a11 , . . . , ann ) be the diagonal matrix that has precisely the same
diagonal elements as A. The idea of the proof is to connect A with D via a path (in
fact, a line segment) through M(n, C). Then we will show that the claim of the theorem
holds for D and must remain valid along the entire path. In consequence, it must also
hold for A. The path is defined by
(
akl for k = l,
Φ : [0, 1] −→ M(n, C), t 7→ Φ(t) = (φkl (t)), φkl (t) := (7.6)
takl for k 6= l,
i.e.
a11 ta12 ... ... ta1n
ta21 a22 ta23 ... ta2n
.. .
Φ(t) = ... ... ... ...
.
tan−1,1 ... tan−1,n−2 an−1,n−1 tan−1,n
tan1 tan2 ... tan,n−1 ann
Clearly,
Φ(0) = D = diag(a11 , . . . , ann ), Φ(1) = A.
We denote the corresponding Gershgorin disks as follows:
Xn
∀ ∀ Gk (t) := Gk Φ(t) = z ∈ C : |z − akk | ≤ t |akl | . (7.7)
k∈{1,...,n} t∈[0,1]
l=1,
l6=k
In (7.9) as well as in the rest of the proof, we always count all eigenvalues according
to their respective algebraic multiplicities. Due to (7.8), we know that precisely the q
7 EIGENVALUES 145
eigenvalues ak1 k1 , . . . , akq ,kq of D are in D(1) . In particular, 0 ∈ S and S 6= ∅. The proof
is done once we show
τ := sup S = 1 (7.10a)
and
τ ∈S (i.e. τ = max S). (7.10b)
As D(1) and D(2) are compact and disjoint, they have a postive distance,
Due to the continuous dependence of the eigenvalues on the coefficients of the matrix
(cf. Cor. J.3 in the Appendix), there exists δ > 0 such that, for each s, t ∈ [0, 1]
with |s − t| < δ, there exist enumerations λ1 (s), . . . , λn (s) and λ1 (t), . . . , λn (t) of the
eigenvalues of Φ(s) and Φ(t), respectively, satisfying
Clearly, (7.13) implies both (7.10a) (since, otherwise, there were τ < t < 1 with |τ − t| <
δ) and (7.10b) (since there are 0 < t < τ with |τ − t| < δ), thereby completing the
proof.
Remark 7.8. If A ∈ M(n, R), n ∈ N, then Th. 7.7 can sometimes help to determine
which regions can contain nonreal eigenvalues: For real matrices all Gershgorin disks
are centered on the real line. Moreover, if A is real, then λ ∈ σ(A) implies λ ∈ σ(A)
and they must lie in the same Gershgorin disks (as they have the same distance to the
real line). Thus, if λ is not real, then it can not lie in any Gershgorin disks known to
contain just one eigenvalue.
Example 7.9. For the matrix A of Example 7.6, we see that the sets G1 ∪ G2 =
B 1 (1) ∪ B 2 (−1) and G3 = B 1 (4) are disjoint (cf. Fig. 7.1). Thus, Th. 7.7 implies that
G3 contains precisely one eigenvalue and G1 ∪ G2 contains precisely two eigenvalues.
Combining this information with Rem. 7.8 shows that G3 must contain a single real
eigenvalue, whereas G1 ∪ G2 could contain (as, indeed, it does) two conjugated nonreal
eigenvalues.
7 EIGENVALUES 146
Proof. First, suppose limk→∞ Ak = 0. Let λ ∈ σ(A) and choose a corresponding eigen-
vector v ∈ Cn \ {0}. Then, for each norm k · k on Cn with induced matrix norm on
M(n, C),
∀ kAk vk = kλk vk = |λ|k kvk ≤ kAk k kvk,
k∈N
which, in turn, implies |λ| < 1. Conversely, suppose r(A) < 1. If B, T ∈ M(n, C) and
T is invertible with A = T BT −1 , then an obvious induction shows
∀ Ak = T B k T −1 . (7.16)
k∈N0
According to [Phi19b, Th. 9.13], we can choose T such that (7.16) holds with B in
Jordan normal form, i.e. with B in block diagonal form
B1 0
...
B= , (7.17)
0 Br
1 ≤ r ≤ n, where
λj 1 0 . . . 0
λj 1
... ...
Bj = (λj ) or Bj = 0 ,
0 λj 1
λj
7 EIGENVALUES 147
P
λj ∈ σ(A), Bj ∈ M(nj , C), nj ∈ {1, . . . , n} for each j ∈ {1, . . . , r}, and rj=1 nj = n.
Using blockwise matrix multiplication (cf. [Phi19a, Sec. 7.5]) in (7.17) yields
B1k 0
...
∀ Bk = . (7.18)
k∈N0
k
0 Br
We now show
∀ lim Bjk = 0. (7.19)
j∈{1,...,r} k→∞
Using
k k!
∀ ∀ = ≤ k(k − 1) · · · (k − α + 1) ≤ k α ,
k∈N α∈{0,...,k} α α! (k − α)!
we say that z has a nontrivial part in M (µ), µ ∈ σ(A), if, and only if, in the unique
decomposition of z with respect to the direct sum (7.21),
X
z= zλ with zµ 6= 0, ∀ zλ ∈ M (λ). (7.22)
λ∈σ(A)
λ∈σ(A)
Theorem 7.12. Let A ∈ M(n, C), n ∈ N. Suppose that A has an eigenvalue that is
larger in size than all other eigenvalues of A, i.e.
∃ ∀ |λ| > |µ|. (7.23)
λ∈σ(A) µ∈σ(A)\{λ}
Moreover, assume λ to be semisimple (i.e. A↾M (λ) is diagonalizable and all Jordan blocks
with respect to λ are trivial). Choosing an arbitrary norm k · k on Cn and starting with
z0 ∈ Cn , define the sequence (zk )k∈N0 via the power iteration (7.14). From the zk also
define sequences (wk )k∈N0 and (αk )k∈N0 by
(
zk
for zk 6= 0,
wk := kzk k ∈ Cn for each k ∈ N0 , (7.24a)
z0 for zk = 0
( z∗ Az
k
k
zk∗ zk
for zk 6= 0,
αk := ∈ C for each k ∈ N0 (7.24b)
0 for zk = 0
(treating the zk as column vectors). If z0 has a nontrivial part in M (λ) in the sense of
Def. 7.11, then
lim αk = λ (7.25a)
k→∞
and there exist sequences (φk )k∈N in R and (rk )k∈N in Cn such that, for all k sufficiently
large,
v λ + rk
wk = eiφk , kvλ + rk k 6= 0, lim rk = 0, (7.25b)
kvλ + rk k k→∞
where, in the last step, we have used λ 6= 0. In consequence, letting, for each k ∈ N0 ,
n
!
1 k X
rk := T k B ζj e j , (7.29)
λ j=n +1 1
Caveat 7.13. It is emphasized that, due to the presence of the factor eiφk , (7.25b) does
not claim the convergence of the wk to an eigenvector of λ. Indeed, in general, the
sequence (wk )k∈N cannot be expected to converge. However, as vλ is an eigenvector for
iφ
λ, so is every kveλ +rk k k vλ (and every other scalar multiple). Thus, what (7.25b) does claim
is that, for sufficiently large k, wk is arbitrarily close to an eigenvector of λ, namely to
iφ
the eigenvector e kvkλvkλ (the eigenvector that wk is close to still depends on k).
Remark 7.14. (a) The hypothesis of Th. 7.12, that z0 have a nontrivial part in M (λ),
can be difficult to check in practice. Still, it is usually not a severe restriction: Even
if, initially, z0 does not satisfy the condition, roundoff errors during the computation
of the zk = Ak z0 will cause the condition to be satisfied for sufficiently large k.
(b) The rate of convergence of the rk in Th. 7.12 (and, thus, of the αk as well) is
determined by the size of |µ|/|λ|, where µ is a second largest eigenvalue (in terms
of absolute value). Therefore, the convergence will be slow if |µ| and |λ| are close.
—
Even though the power method applied according to Th. 7.12 will (under the hypothesis
of the theorem) only approximate the dominant eigenvalue (i.e. the largest in terms of
absolute value), the following modification, known as the inverse power method, can be
used to approximate other eigenvalues:
Corollary 7.15. Let A ∈ M(n, C), n ∈ N, and assume λ ∈ σ(A) to be semisimple. Let
p ∈ C \ {λ} be closer to λ than to any other eigenvalue of A, i.e.
Set
G := (A − p Id)−1 . (7.33)
Choosing an arbitrary norm k · k on Cn and starting with z0 ∈ Cn \ {0}, define the
sequence (zk )k∈N0 via the power iteration (7.14) with A replaced by G. From the zk also
7 EIGENVALUES 151
zk Gk z 0
wk := = ∈ Cn for each k ∈ N0 , (7.34a)
kzk k kGk z0 k
( z∗ z
k k
zk∗ Gzk
+ p for zk∗ Gzk 6= 0,
αk := ∈ C for each k ∈ N0 (7.34b)
0 otherwise
and there exist sequences (φk )k∈N in R and (rk )k∈N in Cn such that, for all k sufficiently
large,
v λ + rk
wk = eiφk , kvλ + rk k 6= 0, lim rk = 0, (7.35b)
kvλ + rk k k→∞
Proof. First note that (7.32) implies p ∈ / σ(A), such that G is well-defined. As G is
k
invertible, z0 6= 0 implies G z0 6= 0 for each k ∈ N0 , showing the wk to be well-defined
as well. The idea is to apply Th. 7.12 to G. Let µ ∈ C \ {p} and v ∈ Cn \ {0}. Then
Av = µv
⇔ (µ − p) v = (A − p Id) v
⇔ (A − p Id)−1 v = (µ − p)−1 v, (7.36)
showing µ to be an eigenvalue for A with corresponding eigenvector v if, and only if,
(µ − p)−1 is an eigenvalue for G = (A − p Id)−1 with the same corresponding eigenvector
1
v. In particular, (7.32) implies λ−p to be the dominant eigenvalue of G. Next, notice
that the hypothesis that λ be semisimple for A implies M (λ, A) to have a basis of
eigenvectors v1 , . . . , vn1 , n1 = ma (λ) (cf. (7.21)), which, by (7.36), then constitutes a
1 1
basis of eigenvectors for M ( λ−p , G) as well, showing λ−p to be semisimple for G. Thus,
all hypotheses of Th. 7.12 are satisfied and we obtain limk→∞ αk = (λ − p) + p = λ,
1
proving (7.35a). Moreover, (7.35b) also follows, where vλ is an eigenvector for λ−p and
G. However, then vλ is also an eigenvector for λ and A by (7.36).
Remark 7.16. In practice, one computes the zk for the inverse power method of Cor.
7.15 by solving the linear systems (A−p Id)zk = zk−1 for zk . As the matrix A−p Id does
not vary, but only the right-hand side of the system varies, one should solve the systems
by first computing a suitable decomposition of A − p Id (e.g. a QR decomposition, cf.
Rem. 5.15).
7 EIGENVALUES 152
7.4 QR Method
The QR algorithm for the approximation of the eigenvalues of A ∈ M(n, K) makes use
of the QR decomposition A = QR of A into a unitary matrix Q and an upper triangular
matrix R (cf. Def. 5.12(a)). We already know from Th. 5.24 that Q and R can be
obtained via the Householder method of Def. 5.23.
Algorithm 7.18 (QR Method). Given A ∈ M(n, K), n ∈ N, we define the following
algorithm for the computation of a sequence (A(k) )k∈N of matrices in M(n, K): Let
A(1) := A. For each k ∈ N, let
A(k+1) := Rk Qk , (7.37a)
where
A(k) = Qk Rk (7.37b)
is a QR decomposition of A(k) .
Remark 7.19. In each step of the QR method of Alg. 7.18, one needs to obtain a QR
decomposition of A(k) . For a fully populated matrix A(k) , the complexity of obtaining a
QR decomposition is O(n3 ) (obtaining a QR decomposition via the Householder method
of Def. 5.23 requires O(n) matrix multiplications, each matrix multiplication needing
O(n2 ) multiplications). The complexity can be improved to O(n2 ) by first transforming
A into so-called Hessenberg form (the complexity of the transformation is O(n3 ), cf.
7 EIGENVALUES 153
Sections J.2 and J.3 of the Appendix). Thus, in practice, it is always advisable to
transform A into Hessenberg form before applying the QR method.
Lemma 7.20. Let A ∈ M(n, K), n ∈ N. Moreover, for each k ∈ N, let matrices
A(k) , Rk , Qk ∈ M(n, K) be defined according to Alg. 7.18. Then the following identities
hold for each k ∈ N:
A(k+1) = Q∗k A(k) Qk , (7.38a)
A(k+1) = (Q1 Q2 · · · Qk )∗ A (Q1 Q2 · · · Qk ), (7.38b)
Ak = (Q1 Q2 · · · Qk )(Rk Rk−1 · · · R1 ). (7.38c)
A simple induction applied to (7.38a) yields (7.38b). Another induction yields (7.38c):
For k = 1, it is merely (7.37b). For k ∈ N, we compute
A(k+1) Id
z }| { (7.38b)
z }| {
Q1 · · · Qk Qk+1 Rk+1 Rk · · · R1 = (Q1 · · · Qk )(Q1 · · · Qk )∗ A (Q1 · · · Qk )(Rk · · · R1 )
ind. hyp.
= AAk = Ak+1 ,
which completes the induction and establishes the case.
In the following Th. 7.22, it will be shown that, under suitable hypotheses, the matrices
A(k) of the QR method become, asymptotically, upper triangular, with the diagonal
entries converging to the eigenvalues of A. Theorem 7.22 is not the most general con-
vergence result of this kind. We will remark on possible extensions afterwards and we
will also indicate that some of the hypotheses are not as restrictive as they might seem
at first glance.
Notation 7.21. Let n ∈ N. By diag(λ1 , . . . , λn ) we denote the diagonal n × n matrix
with diagonal entries λ1 , . . . , λn , i.e.
(
λk for k = l,
diag(λ1 , . . . , λn ) := D = (dkl ), dkl :=
0 for k =6 l
(we have actually used this notation before). If A = (akl ) is an n × n matrix, then,
by diag(A), we denote the diagonal n × n matrix that has precisely the same diagonal
entries as A, i.e.
(
akk for k = l,
diag(A) := D = (dkl ), dkl :=
0 for k 6= l.
7 EIGENVALUES 154
A = T DT −1 , D := diag(λ1 , . . . , λn ). (7.39b)
T −1 = LU, (7.39c)
where L, U ∈ M(n, K), L being unipotent and lower triangular, U being upper trian-
gular. If (A(k) )k∈N is defined according to Alg. 7.18, then the A(k) are, asymptotically,
upper triangular, with the diagonal entries converging to the eigenvalues of A, i.e.
(k)
∀ lim ajl = 0 (7.40a)
j,l∈{1,...,n}, k→∞
j>l
and
(k)
∀ lim ajj = λj . (7.40b)
j∈{1,...,n} k→∞
Moreover, there exists a constant C > 0, such that we have the error estimates
(k) (k)
∀ ∀ ajl ≤ C q k , ∀ ajj − λj ≤ C q k , (7.40c)
k∈N j,l∈{1,...,n}, j∈{1,...,n}
j>l
where
|λj+1 |
q := max : j ∈ {1, . . . , n − 1} < 1. (7.40d)
|λj |
Proof. For n = 1, there is nothing to prove (one has A = D = A(k) for each k ∈ N).
Thus, we assume n > 1 for the rest of the proof. We start by making a number of
observations for k ∈ N being fixed. From (7.38c), we obtain a QR decomposition of Ak :
Ak = T Dk LU = Xk Dk U, Xk := T Dk LD−k . (7.42)
implying
(7.37b)
A(k) = ∗
Qk Rk = Sk−1 Q−1 −1
Xk−1 QXk RXk DRXk−1 Sk−1
−1
= ∗
Sk−1 RXk−1 RX k−1
Q−1 −1
Xk−1 QXk RXk DRXk−1 Sk−1
∗ −1 −1
= Sk−1 RXk−1 Xk−1 Xk DRX S .
k−1 k−1
(7.44)
and
= 0 for j < l,
(k)
|ljl | = 1 for j = l,
≤ CL q k for j > l,
where q is as in (7.40d) and
(k)
To estimate kAF k2 , we note
kRXk k2 = kQ∗Xk Xk k2 = kXk k2 , −1
kRX k = kXk−1 QXk k2 = kXk−1 k2 ,
k 2
since QXk is unitary and, thus, isometric. Due to (7.46) and (7.47), limk→∞ Xk = T ,
−1
such that the sequences (kRXk k2 )k∈N and (kRX k)
k 2 k∈N
are bounded by some constant
CR > 0. Thus, as the Sk are also unitary,
(k)
∗ −1
∀ kAF k2 =
Sk−1 RXk−1 Fk DRX k−1
S k−1
≤ CR2 kDk2 C1 q k . (7.50)
k≥2 2
(k)
Next, we observe each AD to be an upper triangular matrix, as it is the product of
upper triangular matrices. Moreover, each Sk is a diagonal matrix and diag(RXk )−1 =
−1
diag(RX k
). Thus,
(k) ∗ −1
∀ diag(AD ) = Sk−1 diag(RXk−1 )D diag(RX k−1
)Sk−1 = D. (7.51)
k≥2
2
Finally, as we know the 2-norm on M(n, K) to be equivalent to the max-norm on Kn ,
combining (7.49) – (7.51) proves everything that was claimed in (7.40).
7 EIGENVALUES 157
Remark 7.23. (a) Hypothesis (7.39c) of Th. 7.22 that T −1 has an LU decomposition
without permutations is not as severe as it may seem. If permutations are necessary
in the LU decomposition of T −1 , then Th. 7.22 and its proof remain the same,
except that the eigenvalues will occur permuted in the limiting diagonal matrix D.
Moreover, it is an exercise to show that, if the matrix A in Th. 7.22 has Hessenberg
form with 0 ∈ / {a21 , . . . , an,n−1 } (in view of Rem. 7.19, this is the case of most
interest), then hypothesis (7.39c) can always be satisfied: As a intermediate step,
one can show that, in this case,
where e1 , . . . , en−1 are the standard unit vectors, and t1 , . . . , tn are the eigenvectors
of A that form the columns of T .
(b) If A has several eigenvalues of equal modulus (e.g. multiple eigenvalues or nonreal
eigenvalues of a real matrix), then the sequence of the QR method can no longer
be expected to be asymptotically upper triangular, only block upper triangular. For
example, if |λ1 | = |λ2 | > |λ3 | = |λ4 | = |λ5 |, then the asymptotic form will look like
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗
,
∗ ∗ ∗
∗ ∗ ∗
where the upper diagonal 2 × 2 block still has eigenvalues λ1 , λ2 and the lower diag-
onal 3 × 3 block still has eigenvalues λ3 , λ4 , λ5 (see [Cia89, Sec. 6.3] and references
therein).
Remark 7.24. (a) The efficiency of the QR method can be improved significantly by
the introduction of so-called spectral shifts. One replaces (7.37) by
where
A(k) − µk Id = Qk Rk (7.52b)
Sec. 5.5.2]). According to [HB09, Sec. 27.3], an even better choice, albeit less simple,
is to choose µk as the eigenvalue of
!
(k) (k)
an−1,n−1 an−1,n
(k) (k)
an,n−1 ann
(k)
that is closest to ann . If one needs to find nonreal eigenvalues of a real matrix, the
related strategy of so-called double shifts is warranted (see [SK11, Sec. 5.5.3]).
(b) The above-described QR method with shifts is particularly successful if applied to
Hessenberg matrices in combination with a strategy known as deflation. After each
step, each element of the first subdiagonal is tested if it is sufficiently close to 0 to be
(k)
treated as “already converged”, i.e. to be replaced by 0. Then, if aj,j−1 , 2 ≤ j ≤ n,
has been replaced by 0, the QR method is applied separately to the upper diagonal
(j − 1) × (j − 1) block and the lower diagonal (n − j + 1) × (n − j + 1) block. If
j = 2 or j = n (this is what usually happens if the shift strategy as described in (a)
(k) (k)
is applied), then a11 = λ1 or ann = λn is already correct and one has to continue
with an (n − 1) × (n − 1) matrix.
8 Minimization
However, minimization is more general: If we seek the (global) min of the function
f : A −→ R, f (a) := kF (a) − ykα , then it will yield a solution to F (x) = y, provided
such a solution exists. On the other hand, f can have a minimum even if F (x) = y
does not have a solution. If Y = Rn , n ∈ N, with the 2-norm, and α = 2, then the
minimization of f is known as a least squares problem. The least squares problem is
called linear if A is a normed vector space over R and F : A −→ Y is linear, and
nonlinear otherwise.
One distinguishes between constrained and unconstrained minimization, where con-
strained minimization can be seen as restricting the objective function f : A −→ R
to some subset of A, determined by the constraints. Constraints can be given explic-
itly or implicitly. If A = R, then an explicit constraint might be to seek nonnegative
solutions. Implicit constraints can be given in the form of equation constraints – for
example, if A = Rn , then one might want to minimize f on the set of all solutions to
M a = b, where M is some real m × n matrix. In this class, we will restrict ourselves
to unconstrained minimization. In particular, we will not cover the simplex method, a
method for so-called linear programming, which consists of minimizing a linear f with
affine constraints in finite dimensions (see, e.g., [HH92, Sec. 9] or [Str08, Sec. 5]). For
constrained nonlinear optimization see, e.g., [QSS07, Sec. 7.3]). The optimal control of
differential equations is an example of constrained minimization in infinite dimensions
(see, e.g., [Phi09]).
We will only consider methods that are suitable for unconstrained nonlinear minimiza-
tion, noting that even the linear least squares problem is a nonlinear minimization prob-
lem. Both methods involving derivatives (such as gradient methods) and derivative-free
methods are of interest. Derivative-based methods typically show faster convergence
rates, but, on the other hand, they tend to be less robust, need more function evalua-
tions (i.e. each step of the method is more costly), and have a smaller scope of applica-
bility (e.g., the function to be minimized must be differentiable). Below, we will study
derivative-based methods in Sections 8.3 and 8.4, whereas derivative-free methods can
be found in Appendix K.1 and K.2.
(a) Given x ∈ A, f has a (strict) global min at x if, and only if, f (x) ≤ f (y) (f (x) <
f (y)) for each y ∈ A \ {x}.
(b) Given x ∈ X, f has a (strict) local min at x if, and only if, there exists r > 0 such
that f (x) ≤ f (y) (f (x) < f (y)) for each y ∈ {y ∈ A : ky − xk < r} \ {x}.
8 MINIMIZATION 160
Basically all deterministic minimization methods (in particular, all methods considered
in this class) have the drawback that they will find local minima rather than global
minima. This problem vanishes if every local min is a global min, which is, indeed, the
case for convex functions. For this reason, we will briefly study convex functions in this
section, before treating concrete minimization methods thereafter.
Definition 8.2. A subset C of a real vector space X is called convex if, and only if,
for each (x, y) ∈ C 2 and each 0 ≤ α ≤ 1, one has α x + (1 − α) y ∈ C. If A ⊆ X, then
conv A is the intersection of all convex sets containing A and is called the convex hull
of A (clearly, the intersection of convex sets is always convex and, thus, the convex hull
of a set is always convex, cf. [Phi19b, Def. and Rem. 1.23]).
(b) f is called strictly convex if, and only if, the inequality in (8.3) is strict whenever
x 6= y and 0 < α < 1.
(c) f is called strongly or uniformly4 convex if, and only if, there exists a norm k · k on
X and µ ∈ R+ such that
∀ 2 ∀ f α x+(1−α) y +µ α (1−α) kx−yk2 ≤ α f (x)+(1−α) f (y). (8.4)
(x,y)∈C 0≤α≤1
Remark 8.4. It is immediate from Def. 8.3 that uniform convexity implies strict con-
vexity, which, in turn, implies convexity. In general, a convex function need not be
strictly convex (cf. Ex. 8.5(b)), and a strictly convex function need not be uniformly
convex (cf. Ex. 8.5(d)).
Example 8.5. (a) We know from Calculus (cf. [Phi16a, Prop. 9.32(b)]) that a contin-
uous function f : [a, b] −→ R, that is twice differentiable on ]a, b[ is strictly convex
if f ′′ > 0 on ]a, b[. Thus, f : R+ p
0 −→ R, f (x) = x , p > 1, is strictly convex (note
f ′′ (x) = p(p − 1)xp−2 > 0 for x > 0).
4
In the literature, strong convexity is often defined as a special case of a more general notion of
uniform convexity.
8 MINIMIZATION 161
(b) Let C be a convex subset of a normed vector space (X, k · k) over R. Then k · k :
C −→ R is convex by the triangle inequality, but not strictly convex if C contains
a segment S of a one-dimensional subspace: If 0 6= x0 ∈ X and 0 ≤ a < b (a, b ∈ R)
such that S = {tx0 : a ≤ t ≤ b} ⊆ C, then
∀
α ax0 + (1 − α) bx0
= αa + (1 − α)b kx0 k = α kax0 k + (1 − α) kbx0 k.
0≤α≤1
p
If h·, ·i is an inner product on X and kxk = hx, xi, then
f : C −→ R, f (x) := kxk2 ,
(c) The objective function of the linear least squares problem is convex and, if the
linear function F is injective, even strictly convex (exercise).
2
However, if we let x := n
and y := n1 , then the above inequality is equivalent to
16 1 4 1 µ 11 µ 11
− ≥ · + ⇔ ≥ ⇔ ≥ n2 ,
n4 n4 n3 n n2 n4 n2 µ
showing it will always be violated for sufficiently large n.
8 MINIMIZATION 162
Theorem 8.6. Let n ∈ N and let C ⊆ Rn be convex and open. Moreover, let f : C −→
R be continuously differentiable. Then the following holds:
In each case, the stated condition is still sufficient for the stated form of convexity if the
partials of f exist, but are not necessarily continuous.
We multiply (8.9a) by α, (8.9b) by (1 − α), and add the resulting inequalities, also using
x − z = (1 − α) (x − y), y − z = α (y − x),
to obtain:
α f (x) + (1 − α) f (y) − f α x + (1 − α) y
= α f (x) + (1 − α) f (y) − α f (z) − (1 − α) f (z)
≥ α ∇ f (z) (1 − α) (x − y) − (1 − α) ∇ f (z) α(x − y)
+ µ α (1 − α)2 kx − yk2 + µ α2 (1 − α) kx − yk2
= µ α (1 − α) kx − yk2 . (8.10)
Thus, if (8.8) holds with µ > 0, then (8.10) holds with µ > 0, showing f to be uniformly
convex; if (8.6) holds, then (8.8) and (8.10) hold with µ = 0, showing f to be convex;
8 MINIMIZATION 163
if (8.7) holds and α ∈]0, 1[, x 6= y, then (8.10) holds with µ = 0 and strict inequality,
showing f to be strictly convex.
Now assume f to be uniformly convex. Then there exist µ ∈ R+ and some norm k · k
on Rn such that, for each x, y ∈ C and each α ∈ [0, 1],
f y + α(x − y) = f α x + (1 − α) y
≤ α f (x) + (1 − α) f (y) − µ α (1 − α) kx − yk2 (8.11a)
proving (8.8). Setting µ := 0 in (8.11) shows (8.6) to hold for convex f . It remains
to prove the validity of (8.7) for f being strictly convex. The argument of (8.11) does
not suffice in this case, as we lose the strict inequality, when taking the limit in (8.11c).
Thus, we assume f to be strictly convex, x 6= y, and proceed as follows: Letting
1
z := (x + y) ∈ C,
2
the strict convexity of f yields
1 1 1 1
f (z) = f x+ y < f (x) + f (y)
2 2 2 2
Corollary 8.7. Let n ∈ N and let C ⊆ Rn be convex and open. Moreover, let f : C −→
R be continuously differentiable and let k · k denote a norm on Rn . If f is uniformly
convex and x∗ ∈ C is a local min of f (which, by Th. 8.9 below, must then be the unique
global min of f ), then there exists µ ∈ R+ such that
Proof. If f has a local min at x∗ , then ∇ f (x∗ ) = 0 (cf. [Phi16b, Th. 5.13]). Due to this
fact and since all norms on Rn are equivalent, (8.12) is an immediate consequence of
Th. 8.6(c).
Theorem 8.8. Let (X, k·k) be a normed vector space over R, C ⊆ X, and f : C −→ R.
Assume C is a convex set, and f is a convex function.
x0 + α (x − x0 ) = (1 − α) x0 + α x.
After subtracting f (x0 ) and dividing by α > 0, (8.13) yields f (x0 ) ≤ f (x), showing that
x0 is actually a global min as claimed.
(b): Let x0 ∈ C be a min of f . From (a), we already know that x0 must be a global
min. So, if x ∈ C is any min of f , then it follows that f (x) = f (x0 ). If α ∈ [0, 1], then
the convexity of f implies that
f (1 − α) x0 + α x ≤ (1 − α) f (x0 ) + α f (x) = f (x0 ). (8.14)
As x0 is a global min, (8.14) implies that (1 − α) x0 + α x is also a global min for each
α ∈ [0, 1], showing that the set of mins of f is convex as claimed.
8 MINIMIZATION 165
Theorem 8.9. Let (X, k·k) be a normed vector space over R, C ⊆ X, and f : C −→ R.
Assume C is a convex set, and f is a strictly convex function. If x ∈ C is a local min
of f , then this is the unique local min of f , and, moreover, it is strict.
Proof. According to Th. 8.8, every local min of f is also a global min of f . Seeking a
contradiction, assume there is y ∈ C, y 6= x, such that y is also a min of f . As x and y
are both global mins, f (x) = f (y) is implied. Define z := 12 (x + y). Then z ∈ C due to
the convexity of C. Moreover, due to the strict convexity of f ,
1
f (z) < f (x) + f (y) = f (x) (8.15)
2
in contradiction to x being a global min. Thus, x must be the unique min of f , also
implying that the min must be strict.
where tk ∈ R is the stepsize and uk ∈ Rn is the direction of the kth step. In general,
+
φk : Ik −→ R, φk (t) := f (xk + t uk ), Ik := {t ∈ R+ k k
0 : x + t u ∈ G}, (8.17)
∀ uk = − ∇ f (xk ), (8.18)
k∈N
and the βk ∈ R+ are chosen such that (uk )k∈{1,...,n} forms an orthogonal system with
respect to a suitable inner product.
Remark 8.11. In the situation of Def. 8.10 above, if f is differentiable, then its direc-
tional derivative at xk ∈ G in the direction uk ∈ Rn is
f (xk + tuk ) − f (xk )
φ′k (0) = lim = ∇ f (xk ) · uk . (8.20)
t↓0 t
If kuk k2 = 1, then the Cauchy-Schwarz inequality implies
′
φk (0) = ∇ f (xk ) · uk ≤
∇ f (xk )
, (8.21)
2
justifying the name method of steepest descent for the gradient method of Def. 8.10(c).
The next lemma shows it is, indeed, a descent method as defined in Def. 8.10(b), provided
that f is continuously differentiable.
Notation 8.12. If X is a real vector space and x, y ∈ X, then
then each stepsize direction iteration as in Def. 8.10(a) is a descent method according to
Def. 8.10(b). In particular, the gradient method of Def. 8.10(c) is a descent method.
We continue to remain in the setting of Def. 8.10. Suppose we are in step k of a descent
method, i.e. we know φk of (8.17) to be strictly decreasing in a neighborhood of 0.
Then the question remains of how to choose the next stepsize tk ? One might want to
8 MINIMIZATION 167
choose tk such that φk has a global (or at least a local) min at tk . A local min can, for
instance, be obtained by applying the golden section search of Alg. K.5 in the Appendix.
However, if function evaluations are expensive, then one might want to resort to some
simpler method, even if it does not provide a local min of φk . Another potential problem
with Alg. K.5 is that, in general, one has little control over the size of tk , which makes
convergence proofs difficult. The following Armijo rule has the advantage of being simple
and, at the same time, relating the size of tk to the size of ∇ f (xk ):
where
f (xk + β l uk ) ≤ f (xk ) + σβ l ∇ f (xk ) · uk . (8.26)
In practise, since (β l )l∈N0 is strictly decreasing, one merely has to test the validity of
xk + β l uk ∈ G and (8.26) for l = 0, 1, 2, . . . stopping once (8.26) holds for the first time.
Lemma 8.15. Under the assumptions of Alg. 8.14, if uk 6= 0, then the max in (8.25)
exists. In particular, the Armijo rule is well-defined.
Thus, as f is differentiable,
In preparation for a convergence theorem for the gradient method (Th. 8.17 below), we
prove the following lemma:
8 MINIMIZATION 168
Then
f (xk + tk uk ) − f (xk )
lim = ∇ f (x) · u. (8.28)
k→∞ tk
∃ ∃ ∀ y k := xk + tk uk ∈ Bǫ (xk ) ⊆ G.
ǫ>0 N ∈N k>N
Thus, limk→∞ (f (xk+1 ) − f (xk )) = 0, and, using the Armijo rule again, we have
where σ ∈]0, 1[ is the parameter from the Armijo rule. Seeking a contradiction, we now
assume ∇ f (x) 6= 0. Then the continuity of ∇ f yields limj→∞ ∇ f (xkj ) = ∇ f (x) 6= 0,
which, together with (8.30) implies limj→∞ tkj = 0. Moreover, for each j ∈ N, tkj = β lj ,
where β ∈]0, 1[ is the parameter from the Armijo rule and lj ∈ N0 is chosen according
to (8.25). Thus, for each sufficiently large j, mj := lj − 1 ∈ N0 and
(here we have used that xkj + β mj ukj ∈ G = Rn ). Taking the limit for j → ∞ and
observing limj→∞ β mj = 0 in connection with Lem. 8.16 then yields
As a caveat, it is underlined that Th. 8.17 does not claim the sequence (xk )k∈N to have
cluster points. It merely states that, if cluster points exist, then they must be stationary
points. For convex differentiable functions, stationary points are minima:
with strict inequality if f is strictly convex. By the chain rule, φ is differentiable with
∀ φ′ (t) = ∇ f x + t(y − x) · (y − x).
t∈I
Thus, we know from Calculus (cf. [Phi16a, Prop. 9.31]) that φ′ is (strictly) increasing
and, as φ′ (0) = 0, φ must be (strictly) increasing on I ∩ R+0 . Thus, f (y) ≥ f (x)
(f (y) > f (x)), showing x to be a (strict) global min.
8 MINIMIZATION 170
Proof. Everything is immediate when combining Th. 8.17 with Prop. 8.18.
A disadvantage of the (hypotheses) of the previous results is that they do not guarantee
the convergence of the gradient method (or even the existence of local minima). An
important class of functions, where we can, indeed, guarantee both (together with an
error estimate that shows a linear convergence rate) is the class of convex quadratic
functions, which we will study next.
Definition 8.20. Let Q ∈ M(n, R) be symmetric and positive definite, a ∈ Rn , n ∈ N,
γ ∈ R. Then the function
1
f : Rn −→ R, f (x) := xt Qx + at x + γ, (8.33)
2
where a, x are considered column vectors, is called the quadratic function given by Q,
a, and γ.
Proposition 8.21. Let Q ∈ M(n, R) be symmetric and positive definite, a ∈ Rn ,
n ∈ N, γ ∈ R. Then the function f of (8.33) has the following properties:
∇ f : Rn −→ Rn , ∇ f (x) = xt Q + at . (8.34)
Proof. (a): Since hx, yi := xt Qy defines an inner product on Rn , the uniform convexity
of f follows from Ex. 8.5(b): Indeed, if we let
p
∀ n kxk := xt Qx,
x∈R
8 MINIMIZATION 171
then Ex. 8.5(b) implies, for each x, y ∈ Rn and each α ∈ [0, 1]:
1
f α x + (1 − α) y + α (1 − α) kx − yk2
2
1 t
= α x + (1 − α) y Q α x + (1 − α) y + at α x + (1 − α) y + γ
2
1
+ α (1 − α) kx − yk2
2
1 1
≤ α xt Qx + (1 − α) y t Qy + α at x + (1 − α) at y + α γ + (1 − α) γ
2 2
= α f (x) + (1 − α) f (y).
(b): Exercise.
(c) follows from (a), (b), and Prop. 8.18, as f has a stationary point at x∗ = −Q−1 a.
(d): The computation in (8.32) with y − x replaced by u shows φ to be strictly convex.
Moreover,
Proof. Since Q is symmetric positive definite, we can write the eigenvalues in the form
0 < α = λ1 ≤ · · · ≤ λn = β with corresponding orthonormal eigenvectors v1 , . . . , vn .
Fix x ∈ Rn \ {0}. As the v1 , . . . , vn form a basis,
n
X n
X
∃ x= µi vi , µ := µ2i >0 .
µ1 ,...,µn ∈R
i=1 i=1
(8.36) µ2
F (x) = Pn 2
Pn −1 2 ,
( i=1 λi µi )( i=1 λi µi )
8 MINIMIZATION 172
1 (8.37) αβ
F (x) = Pn −1
≥ . (8.38)
λ( i=1 γi λi ) λ (α + β − λ)
To estimate F (x) further, we determine the global min of the differentiable function
αβ
φ : ]α, β[−→ R, φ(λ) = .
λ (α + β − λ)
The derivative is
αβ(α + β − 2λ)
φ′ : ]α, β[−→ R, φ′ (λ) = − .
λ2 (α + β − λ)2
Let λ0 := α+β
2
. Clearly, φ′ (λ) < 0 for λ < λ0 , φ′ (λ0 ) = 0, and φ′ (λ) > 0 for λ > λ0 ,
showing that φ has its global min at λ0 . Using this fact in (8.38) yields
αβ 4αβ
F (x) ≥ = , (8.39)
λ0 (α + β − λ0 ) (α + β)2
completing the proof of (8.36).
8 MINIMIZATION 173
where it was used that (g k )t Qxk = ((g k )t Qxk )t = (xk )t Qg k ∈ R. Next, we claim that
k+1 ∗ ((g k )t g k )2 k ∗
∀ f (x ) − f (x ) = 1 − f (x ) − f (x ) , (8.43)
k∈N ((g k )t Qg k )((g k )t Q−1 g k )
which, in view of (8.42), is equivalent to
1 ((g k )t g k )2 ((g k )t g k )2
k t k
= k t k k t −1 k
f (xk ) − f (x∗ )
2 (g ) Qg ((g ) Qg )((g ) Q g )
and
2 f (xk ) − f (x∗ ) = (g k )t Q−1 g k . (8.44)
Indeed,
2 f (xk ) − f (x∗ ) = (xk )t Qxk + 2at xk + 2γ − at Q−1 QQ−1 a − 2at Q−1 a + 2γ
= (xk )t g k + at xk + at Q−1 a
= ((xk )t Q + at )Q−1 (Qxk + a) = (g k )t Q−1 g k ,
which is (8.44). Applying the Kantorovich inequality (8.36) with x = g k to (8.43) now
yields
k+1 ∗ 4αβ
∀ f (x ) − f (x ) ≤ 1 − 2
f (xk ) − f (x∗ )
k∈N (α + β)
2
β−α
= f (xk ) − f (x∗ ) ,
α+β
thereby establishing the case.
Remark 8.24. If Q ∈ M(n, R) is symmetric and positive definite, then the map
h·, ·iQ : Rn × Rn −→ R, hx, yiQ := xt Qy, (8.47)
clearly forms an inner product on Rn . We call x, y ∈ Rn Q-orthogonal if, and only
if, they are orthogonal with respect to the inner product of (8.47), i.e. if, and only if,
hx, yiQ = 0.
Proposition 8.25. Let Q ∈ M(n, R) be symmetric and positive definite, b ∈ Rn , n ∈ N.
If x1 , . . . , xn+1 ∈ Rn are given by a stepsize direction iteration according to Def. 8.10(a),
if the directions u1 , . . . , un ∈ Rn \ {0} are Q-orthogonal, i.e.
∀ ∀ huk , uj iQ = 0, (8.48)
k∈{2,...,n} j<k
and if t1 , . . . , tn ∈ R are according to (8.40) with a replaced by −b, then xn+1 = Q−1 b,
i.e. the iteration converges to the exact solution to (8.45) in at most n steps (here, in
contrast to Def. 8.10(a), we do not require tk ∈ R+ – however, if, indeed, tk < 0, then
one could replace uk by −uk and tk by −tk > 0 to remain in the setting of Def. 8.10(a)).
implying
k
X
k+1 t j j+1 t j (8.50),(8.51)
∀ ∀ (g ) u = (g )u + (g i+1 − g i )t uj = 0,
k∈{1,...,n} j≤k
i=j+1
Algorithm 8.26 (CG Method). Let Q ∈ M(n, R) be symmetric and positive definite,
b ∈ Rn , n ∈ N. The conjugate gradient (CG) method then defines a finite sequence
(x1 , . . . , xn+1 ) in Rn via the following recursion: Initialization: Choose x1 ∈ Rn (barring
further information on Q and b, choosing x1 = 0 is fine) and set
g 1 := Qx1 − b, u1 := −g 1 . (8.52)
kg k k22
tk := , (8.53a)
huk , uk iQ
xk+1 := xk + tk uk , (8.53b)
g k+1 := g k + tk Quk , (8.53c)
kg k+1 k22
βk := , (8.53d)
kg k k22
uk+1 := −g k+1 + βk uk . (8.53e)
(a) We will see in (8.54d) below that uk 6= 0 implies g k 6= 0, such that Alg. 8.26 is
well-defined.
(b) The most expensive operation in (8.53) is the matrix multiplication Quk , which is
O(n2 ). As the result is needed in both (8.53a) and (8.53c), it should be temporarily
stored in some vector z k := Quk . If the matrix Q is sparse, it might be possible to
reduce the cost of carrying out (8.53) significantly – for example, if Q is tridiagonal,
then the cost reduces from O(n2 ) to O(n).
are given by Alg. 8.26, then the following orthogonality relations hold:
∀ huk , uj iQ = 0, (8.54a)
k,j∈{1,...,n+1},
j<k
∀ (g k )t g j = 0, (8.54b)
k,j∈{1,...,n+1},
j<k
∀ (g k )t uj = 0, (8.54c)
k,j∈{1,...,n+1},
j<k
Moreover,
xn+1 = Q−1 b, (8.55)
i.e. the CG method converges to the exact solution to (8.45) in at most n steps.
Proof. Let
m := min k ∈ {1, . . . , n + 1} : uk = 0 ∪ {n + 1} . (8.56)
A first induction shows
∀ g k = Qxk − b : (8.57)
k∈{1,...,m}
Representations with base 16 (hexadecimal) and 8 (octal) are also of importance when
working with digital computers. More generally, each natural number b ≥ 2 can be used
as a base.
In most cases, it is understood that we work only with decimal representations such
that there is no confusion about the meaning of symbol strings like 101.01. However,
in general, 101.01 could also be meant with respect to any other base, and, the number
represented by the same string of symbols does obviously depend on the base used.
Thus, when working with different representations, one needs some notation to keep
track of the base.
Notation A.1. Given a natural number b ≥ 2 and finite sequences
(dN1 , dN1 −1 , . . . , d0 ) ∈ {0, . . . , b − 1}N1 +1 , (A.2a)
N2
(e1 , e2 , . . . , eN2 ) ∈ {0, . . . , b − 1} , (A.2b)
(p1 , p2 , . . . , pN3 ) ∈ {0, . . . , b − 1}N3 , (A.2c)
N1 , N2 , N3 ∈ N0 (where N2 = 0 or N3 = 0 is supposed to mean that the corresponding
sequence is empty), the respective string
(dN1 dN1 −1 . . .d0 )b for N2 = N3 = 0,
(A.3)
(dN1 dN1 −1 . . .d0 . e1 . . . eN2 p1 . . . pN3 )b for N2 + N3 > 0
represents the number
N1
X N2
X N3
∞ X
X
ν −ν
dν b + eν b + pν b−N2 −αN3 −ν . (A.4)
ν=0 ν=1 α=0 ν=1
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 180
is called a b-adic series. The number b is called the base or the radix, and the numbers
dν are called digits.
Lemma A.5. Given a natural number b ≥ 2, consider the b-adic series given by (A.6).
Then ∞
X
dN −ν bN −ν ≤ bN +1 , (A.7)
ν=0
Proof. One estimates, using the formula for the value of a geometric series:
∞
X ∞
X ∞
X
N −ν N −ν N 1
dN −ν b ≤ (b − 1) b = (b − 1)b b−ν = (b − 1)bN 1 = bN +1 . (A.8)
ν=0 ν=0 ν=0
1− b
Note that (A.8) also shows that equality is achieved if all dn are equal to b−1. Conversely,
if there is n ∈ {N, N − 1, N − 2, . . . } such that dn < b − 1, then there is ñ ∈ N such
that dN −ñ < b − 1 and one estimates
∞
X ñ−1
X ∞
X
N −ν N −ν N −ñ
dN −ν b < dN −ν b + (b − 1)b + dN −ν bN −ν ≤ bN +1 , (A.9)
ν=0 ν=0 ν=ñ+1
the strings
0 . x1 x2 . . . xl · bN and − 0 . x1 x2 . . . xl · bN (A.11)
are called floating-point representations of the (rational) numbers
l
X
N
x := b xν b−ν and − x, (A.12)
ν=1
Remark A.7. To get from the floating-point representation (A.11) to the form of (A.3),
one has to shift the radix point of the significand by a number of places equal to the
value of the exponent – to the right if the exponent is positive or to the left if the
exponent is negative.
Remark A.8. If one restricts Def. A.6 to the case N− = N+ , then one obtains what
is known as fixed-point representations. However, for many applications, the required
numbers vary over sufficiently many orders of magnitude to render fixed-point represen-
tations impractical.
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 182
Remark A.9. Given the assumptions of Def. A.6, for the numbers in (A.12), it always
holds that
l
X l
X Lem. A.5
N− −1 N −1 N −ν
b ≤b ≤ xν b = |x| ≤ (b − 1) bN+ −ν =: max(l, b, N+ ) < bN+ ,
ν=1 ν=1
(A.13)
provided that x 6= 0. In other words:
fll (b, N− , N+ ) ⊆ Q ∩ [− max(l, b, N+ ), −bN− −1 ] ∪ {0} ∪ [bN− −1 , max(l, b, N+ )] . (A.14)
Obviously, fll (b, N− , N+ ) is not closed under arithmetic operations. A result of absolute
value bigger than max(l, b, N+ ) is called an overflow, whereas a nonzero result of absolute
value less than bN− −1 is called an underflow. In practice, the result of an underflow is
usually replaced by 0.
Remark A.11. We note that the notation rdl (x) of Def. A.10 is actually not entirely
correct, since rdl is actually a function of the sequence (σ, N, x1 , x2 , . . . ) rather than of
x: It can actually occur that rdl takes on different values for different representations of
the same number x (for example, using decimal notation, consider x = 0.349 = 0.350,
yielding rd1 (0.349) = 0.3 and rd1 (0.350) = 0.4). However, from basic results on b-
adic expansions of real numbers (see [Phi16a, Th. 7.99]), we know that x can have at
most two different b-adic representations and, for the same x, rdl (x) can vary at most
bN b−l . Since always writing rdl (σ, N, x1 , x2 , . . . ) does not help with readability, and since
writing rdl (x) hardly ever causes confusion as to what is meant in a concrete case, the
abuse of notation introduced in Def. A.10 is quite commonly employed.
Proof. For xl+1 < b/2, there is nothing to prove. Thus, assume xl+1 ≥ b/2. Case
(i): There exists ν ∈ {1, . . . , l} such that xν < b − 1. Then, letting ν0 := max ν ∈
{1, . . . , l} : xν < b − 1 , one finds N ′ = N and
xν for 1 ≤ ν < ν0 ,
′
xν = xν0 + 1 for ν = ν0 , (A.17a)
0 for ν0 < ν ≤ l.
Case (ii): xν = b − 1 holds for each ν ∈ {1, . . . , l}. Then one obtains N ′ = N + 1,
(
1 for ν = 1,
x′ν := (A.17b)
0 for 1 < ν ≤ l,
where ⋄ can stand for any of the operations ‘+’, ‘−’, ‘·’, and ‘:’.
S
Remark A.14. Consider x, y ∈ fll (b, N− , N+ ). If x ⋄ y ∈ ∞ k=1 flk (b, N− , N+ ), then,
according to Lem. A.12, x ⋄l y as defined in (A.18) is either in fll (b, N− , N+ ) or in
fll (b, N− , N+ + 1) with the digits given by (A.17b) (which constitutes an overflow).
Remark A.15. The definition of (A.18) should only be taken as an example of how
floating-point operations can be realized. On concrete computer systems, the rounding
operations implemented can be different from the one considered here. However, one
would still expect a corresponding version of Rem. A.14 to hold.
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 184
Caveat A.16. Unfortunately, the associative laws of addition and multiplication as well
as the law of distributivity are lost for arithmetic operations of floating-point numbers.
More precisely, even if x, y, z ∈ fll (b, N− , N+ ) and one assumes that no overflow or
underflow occurs, then there are examples, where (x+l y)+l z 6= x+l (y +l z), (x·l y)·l z 6=
x ·l (y ·l z), and x ·l (y +l z) 6= x ·l y +l x ·l z:
For l = 1, b = 10, N− = 0, N+ = 1, x = 0.1 · 101 , y = 0.4 · 100 , z = 0.1 · 100 , one has
(x +1 y) +1 z = rd1 (0.1 · 101 + 0.4 · 100 ) +1 0.1 · 100 = rd1 (0.14 · 101 ) +1 0.1 · 100
= rd1 (0.1 · 101 + 0.1 · 100 ) = 0.1 · 101 , (A.19)
but
x +1 (y +1 z) = 0.1 · 101 +1 rd1 (0.4 · 100 + 0.1 · 100 ) = rd1 (0.1 · 101 + 0.5 · 100 )
= rd1 (0.15 · 101 ) = 0.2 · 101 . (A.20)
For l = 1, b = 10, N− = 0, N+ = 1, x = 0.8 · 101 , y = 0.3 · 100 , z = 0.5 · 100 , one has
(x ·1 y) ·1 z = rd1 (0.8 · 101 · 0.3 · 100 ) ·1 0.5 · 100 = rd1 (0.24 · 101 ) ·1 0.5 · 100
= rd1 (0.2 · 101 · 0.5 · 100 ) = 0.1 · 101 , (A.21)
but
x ·1 (y ·1 z) = 0.8 · 101 ·1 rd1 (0.3 · 100 · 0.5 · 100 ) = 0.8 · 101 ·1 rd1 (0.15 · 100 )
= rd1 (0.8 · 101 · 0.2 · 100 ) = 0.2 · 101 . (A.22)
but
x ·l y +l x ·l z = rd1 (0.5 · 100 · 0.3 · 100 ) +1 rd1 (0.5 · 100 · 0.3 · 100 )
= 0.2 · 100 +1 0.2 · 100 = 0.4 · 100 . (A.24)
(b): Suppose x and y are given as in (A.25). Then Lem. A.17(a) implies N ≤ M .
Case xl+1 < b/2: In this case, according to (A.16),
l
X
N
rdl (x) = b xν b−ν . (A.27)
ν=1
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 186
which establishes the case. We claim that (A.28) also holds for N = M , now due to
Lem. A.17(b): This is clear if xν = yν for each ν ∈ {1, . . . , l}. Otherwise, define
n := min ν ∈ {1, . . . , l} : xν 6= yν . (A.29)
Then x < y and Lem. A.17(b) yield xn < yn . Moreover, another application of Lem.
A.17(b) then implies (A.28) for this case.
Case xl+1 ≥ b/2: In this case, according to Lem. A.12,
l
X
N′
rdl (x) = b x′ν b−ν , (A.30)
ν=1
where either N ′ = N and the x′ν are given by (A.17a), or N ′ = N + 1 and the x′ν are
given by (A.17b). For N < M , we obtain
l
X l
X
(∗) ′
rdl (y) ≥ bM yν b−ν ≥ bN x′ν b−ν = rdl (x), (A.31)
ν=1 ν=1
where, for N ′ < M , (∗) holds by Lem. A.17(a), and, for N ′ = N + 1 = M , (∗) holds
by (A.17b). It remains to consider N = M . If xν = yν for each ν ∈ {1, . . . , l}, then
x < y and Lem. A.17(b) yield b/2 ≤ xl+1 ≤ yl+1 , which, in turn, yields rdl (y) = rdl (x).
Otherwise, once more define n according to (A.29). As before, x < y and Lem. A.17(b)
yield xn < yn . From Lem. A.12, we obtain N ′ = N and that the x′ν are given by (A.17a)
with ν0 ≥ n. Then the values from (A.17a) show that (A.31) holds true once again.
(c) follows by combining (a) and (b).
Definition and Remark A.19. Let b ∈ N, b ≥ 2, l ∈ N, and N− , N+ ∈ Z with
N− ≤ N+ . If there exists a smallest positive number ǫ ∈ fll (b, N− , N+ ) such that
1 +l ǫ 6= 1, (A.32)
then it is called the relative machine precision. We will show that, for N+ < −l + 1, one
has 1 +l y = 1 for every 0 < y ∈ fll (b, N− , N+ ), such that there is no positive number in
fll (b, N− , N+ ) satisfying (A.32), whereas, for N+ ≥ −l + 1:
(
bN− −1 for −l + 1 < N− ,
ǫ= −l
⌈b/2⌉b for −l + 1 ≥ N−
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 187
where
=1 for i = 1,
< ⌈b/2⌉ for i = l + 1,
yi (A.34)
= 0
for 1 < i < l + 1,
∈ {0, . . . , b − 1} for l + 1 < i.
As yl+1 < b/2, the definition of rounding in (A.16) yields:
l
X
1
1 +l y = rdl (1 + y) = b yi b−i = b1 b−1 = 1, (A.35)
i=1
where
1 for i = 1,
xi := ⌈b/2⌉ for i = l + 1, (A.38)
0 6 1, l + 1.
for i =
Using (A.37) and the definition of rounding in (A.16), we have
1 +l ⌈b/2⌉b−l = rdl 1 + ⌈b/2⌉b−l = 1 + b−l+1 > 1. (A.39)
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 188
Now let −l + 1 < N− . Due to (A.14), bN− −1 is the smallest positive number contained
in fll (b, N− , N+ ). In addition, due to bN− −1 ≥ b−l+1 > ⌈b/2⌉b−l , we have
Lem. A.18(b) (A.39)
1 +l bN− −1 = rdl (1 + bN− −1 ) ≥ rdl 1 + ⌈b/2⌉b−l = 1 + b−l+1 > 1, (A.40)
implying ǫ = bN− −1 .
For −l + 1 ≥ N− , we have ⌈b/2⌉b−l ∈ fll (b, N− , N+ ) due to ⌈b/2⌉b−l = 0.⌈b/2⌉ · b−l+1
(where N+ ≥ −l + 1 was used as well). As we have already shown that 1 +l y = 1 for
each 0 < y < ⌈b/2⌉b−l , ǫ = ⌈b/2⌉b−l is now a consequence of (A.39).
Proof. (a): First, consider the case xl+1 < b/2: One computes
∞
X
ea (x) = rdl (x) − x = −σ rdl (x) − x = bN xν b−ν
ν=l+1
∞
X
= bN −l−1 xl+1 + bN xν b−ν
ν=l+2
b even, (A.7) b bN −l
≤ bN −l−1 − 1 + bN −l−1 = . (A.43a)
2 2
It remains to consider the case xl+1 ≥ b/2. In that case, one obtains
∞
X
N −l N −l−1 N
σ rdl (x) − x = b − b xl+1 b −b xν b−ν
ν=l+2
∞
X bN −l
= bN −l−1 (b − xl+1 ) − bN xν b−ν ≤ . (A.43b)
ν=l+2
2
P∞
Due to 1 ≤ b − xl+1 , one has bN −l−1 ≤ bN −l−1 (b − x
l+1 ). Therefore, b
N −ν
ν=l+2 xν b ≤
bN −l−1 together with (A.43b) implies σ rdl (x) − x ≥ 0, i.e. ea (x) = σ rdl (x) − x .
(b): Since x 6= 0, one has x1 ≥ 1. Thus, |x| ≥ bN −1 . Then (a) yields
ea bN −l b−N +1 b−l+1
er = ≤ = . (A.44)
|x| 2 2
(c): Again, x1 ≥ 1 as x 6= 0. Thus, (A.16) implies | rdl (x)| ≥ bN −1 . This time, (a) yields
ea bN −l b−N +1 b−l+1
≤ = , (A.45)
| rdl (x)| 2 2
establishing the case and completing the proof of the proposition.
Corollary A.22. In the situation of Prop. A.21, let, for x 6= 0:
rdl (x) − x rdl (x) − x
ǫl (x) := , η l (x) := . (A.46)
x rdl (x)
Then
b−l+1
max |ǫl (x)|, |ηl (x)| ≤ =: τ l . (A.47)
2
The number τ l is called the relative computing precision of floating-point arithmetic with
l significant digits.
A REPRESENTATIONS OF REAL NUMBERS, ROUNDING ERRORS 190
Remark A.23. From the definitions of ǫl (x) and η l (x) in (A.46), one immediately
obtains the relations
x
rdl (x) = x 1 + ǫl (x) = for x 6= 0. (A.48)
1 − η l (x)
In particular, according to (A.18), one has for floating-point operations (for x ⋄ y 6= 0):
x⋄y
x ⋄l y = rdl (x ⋄ y) = (x ⋄ y) 1 + ǫl (x ⋄ y) = . (A.49)
1 − η l (x ⋄ y)
One can use the formulas of Rem. A.23 to perform what is sometimes called a forward
analysis of the rounding error. This technique is illustrated in the next example.
Example A.24. Let x := 0.9995 · 100 and y := −0.9984 · 100 . Computing the sum with
precision 3 yields
rd3 (x) +3 rd3 (y) = 0.100 · 101 +3 (−0.998 · 100 ) = rd3 (0.2 · 10−2 ) = 0.2 · 10−2 .
Letting ǫ := ǫ3 (rd3 (x) + rd3 (y)), applying the formulas of Rem. A.23 provides:
(A.49)
rd3 (x) +3 rd3 (y) = rd3 (x) + rd3 (y) (1 + ǫ)
(A.48)
= x(1 + ǫ3 (x)) + y(1 + ǫ3 (y)) (1 + ǫ)
= (x + y) + ea , (A.50)
where
ea = x ǫ + ǫ3 (x)(1 + ǫ) + y ǫ + ǫ3 (y)(1 + ǫ) . (A.51)
Using (A.46) and plugging in the numbers yields:
rd3 (x) +3 rd3 (y) − rd3 (x) + rd3 (y) 0.002 − 0.002
ǫ= = = 0,
rd3 (x) + rd3 (y) 0.002
1 − 0.9995 1
ǫ3 (x) = = = 0.00050 . . . ,
0.9995 1999
−0.998 + 0.9984 1
ǫ3 (y) = =− = −0.00040 . . . ,
−0.9984 2496
ea = xǫ3 (x) + yǫ3 (y) = 0.0005 + 0.0004 = 0.0009.
Thus, er is much larger than both er (x) and er (y). This is an example of subtractive
cancellation of digits, which can occur when subtracting numbers that are almost iden-
tical. If possible, such situations should be avoided in practice (cf. Examples A.25 and
A.26 below).
Example A.25. Let us generalize the situation of Example A.24 to general x, y ∈ R\{0}
and a general precision l ∈ N. The formulas in (A.50) and (A.51) remain valid if one
replaces 3 by l. Moreover, with b = 10 and l ≥ 2, one obtains from (A.47):
Thus,
|ea | ≤ |x| ǫ + 1.05|ǫl (x)| + |y| ǫ + 1.05|ǫl (y)| ,
and
|ea | |x| |y|
er = ≤ |ǫ| + 1.05|ǫl (x)| + |ǫ| + 1.05|ǫl (y)| .
|x + y| |x + y| |x + y|
One can now distinguish three (not completely disjoint) cases:
(a) If |x + y| < max{|x|, |y|} (in particular, if sgn(x) = − sgn(y)), then er is typically
larger (potentially much larger, as in Example A.24) than |ǫl (x)| and |ǫl (y)|. Sub-
tractive cancellation falls into this case. If possible, this should be avoided (cf.
Example A.26 below).
(c) If |y| ≪ |x| (resp. |x| ≪ |y|), then the bound for er is predominantly determined by
|ǫl (x)| (resp. |ǫl (y)|) (error dampening).
As the following Example A.26 illustrates, subtractive cancellation can often be avoided
by rearranging an expression into a mathematically equivalent formula that is numeri-
cally more stable.
ax2 + bx + c = 0
B OPERATOR NORMS AND NATURAL MATRIX NORMS 192
(a) A is bounded.
(d) A is continuous.
Proof. Since every Lipschitz continuous map is continuous and since every continuous
map is continuous at every point, “(c) ⇒ (d) ⇒ (e)” is clear.
“(e) ⇒ (c)”: Let x0 ∈ X be such that A is continuous at x0 . Thus, for each ǫ > 0, there
is δ > 0 such that kx − x0 k < δ implies kAx − Ax0 k < ǫ. As A is linear, for each x ∈ X
with kxk < δ, one has kAxk = kA(x + x0 ) − Ax0 k < ǫ, due to kx + x0 − x0 k = kxk < δ.
Moreover, one has k(δx)/2k ≤ δ/2 < δ for each x ∈ X with kxk ≤ 1. Letting L := 2ǫ/δ,
this means that kAxk = kA((δx)/2)k/(δ/2) < 2ǫ/δ = L for each x ∈ X with kxk ≤ 1.
Thus, for each x, y ∈ X with x 6= y, one has
x − y
kAx − Ayk = kA(x − y)k = kx − yk
A kx − yk
< L kx − yk. (B.1)
Together with the fact that kAx − Ayk ≤ Lkx − yk is trivially true for x = y, this shows
that A is Lipschitz continuous.
B OPERATOR NORMS AND NATURAL MATRIX NORMS 193
(a) The operator norm does, indeed, constitute a norm on the set of bounded linear
maps L(X, Y ).
(b) If A ∈ L(X, Y ), then kAk is the smallest Lipschitz constant for A, i.e. kAk is a
Lipschitz constant for A and kAx − Ayk ≤ L kx − yk for each x, y ∈ X implies
kAk ≤ L.
yielding
kλAk = sup k(λA)xk : x ∈ X, kxk = 1 = sup |λ| kAxk : x ∈ X, kxk = 1
= |λ| sup kAxk : x ∈ X, kxk = 1 = |λ| kAk, (B.6)
showing that the operator norm is homogeneous of degree 1. Finally, if A, B ∈ L(X, Y )
and x ∈ X, then
k(A + B)xk = kAx + Bxk ≤ kAxk + kBxk, (B.7)
yielding
kA + Bk = sup k(A + B)xk : x ∈ X, kxk = 1
≤ sup kAxk + kBxk : x ∈ X, kxk = 1
≤ sup kAxk : x ∈ X, kxk = 1 + sup kBxk : x ∈ X, kxk = 1
= kAk + kBk, (B.8)
showing that the operator norm also satisfies the triangle inequality, thereby completing
the verification that it is, indeed, a norm.
(b): To see that kAk is a Lipschitz constant for A, we have to show
kAx − Ayk ≤ kAk kx − yk for each x, y ∈ X. (B.9)
For x = y, there is nothing to prove. Thus, let x 6= y. One computes
kAx − Ayk
x − y
=
A
≤ kAk
kx − yk kx − yk
x−y
as
kx−yk
= 1, thereby establishing (B.9). Now let L ∈ R+0 be such that kAx − Ayk ≤
L kx − yk for each x, y ∈ X. Specializing to y = 0 and kxk = 1 implies kAxk ≤ L kxk =
L, showing kAk ≤ L.
C Integration
Remark C.2. The linearity of the K-valued integral implies the linearity of the Km -
valued integral.
Proof. Assume f −1 (O) is measurable for each open subset O of Km . Let j ∈ {1, . . . , m}.
If Oj ⊆ K is open in K, then O := πj−1 (Oj ) = {x ∈ Km : xj ∈ Oj } is open in Km .
Thus, fj−1 (Oj ) = f −1 (O) is measurable, showing that each fj is measurable, i.e. f is
measurable. Now assume f is measurable, i.e. each fj is measurable. Since every open
O ⊆ Km is a countable union of open sets of the form O = O1 × · · · × Om with each Oj
being an open subset of K, it suffices to show that the
Tm preimages of such open sets are
−1
measurable. So let O be as above. Then f (O) = j=1 fj (Oj ), showing that f −1 (O)
−1
is measurable.
where, at (∗), it was used that, as the Bj are disjoint, the integrands of the two integrals
are equal at each x ∈ A.
Now, if f is integrable, then each Re fj and each Im fj , j ∈ {1, . . . , m}, is integrable
(i.e., for each j ∈ {1, . . . , m}, Re fj , Im fj ∈ L1 (A)) and there exist sequences of simple
functions φj,k : A −→ R and ψj,k : A −→ R such that limk→∞ kφj,k − Re fj kL1 (A) =
limk→∞ kψj,k − Im fj kL1 (A) = 0. In particular, for each j ∈ {1, . . . , m},
Z Z
0 ≤ lim φj,k − Re fj ≤ lim kφj,k − Re fj kL1 (A) = 0, (C.5a)
k→∞ A A k→∞
Z Z
0 ≤ lim ψj,k − Im fj ≤ lim kψj,k − Im fj kL1 (A) = 0, (C.5b)
k→∞ A A k→∞
obtain
Z
Z Z
f
=
f , . . . , f
1 m
A A A
Z Z Z Z
=
lim
k→∞ φ 1,k + i lim ψ 1,k , . . . , lim φ m,k + i lim ψ m,k
A k→∞ A k→∞ A k→∞ A
Z
Z Z
(∗)
= lim
(φ k + i ψ k )
≤ lim
kφk + iψk k = kf k, (C.7)
k→∞ A k→∞ A A
where the equality at (∗) holds due to limk→∞
kφk + iψk k − kf k
L1 (A) = 0, which, in
turn, is verified by
Z Z Z
0≤ kφk + iψk k − kf k ≤ kφk + iψk − f k ≤ C kφk + iψk − f k1
A A A
Z X
m
=C φj,k + iψj,k − fj → 0 for k → ∞, (C.8)
A j=1
C.2 Continuity
Theorem C.6. Let a, b ∈ R, a < b. If f : [a, b] −→ R is (Lebesgue-) integrable, then
Z t
F : [a, b] −→ R, F (t) := f, (C.9)
a
is continuous.
Proof. We show F to be continuous at each t ∈ [a, b]: Let (tk )k∈N be a sequence in [a, b]
such that limk→∞ tk = t. Let fk := f χ[a,tk ] . Then fk → f χ[0,t] (pointwise everywhere,
except, possibly, at t) and the dominated convergence theorem (DCT) yields (note
|fk | ≤ |f |) Z Z
tk t
DCT
lim F (tk ) = lim f = f = F (t),
k→∞ k→∞ a a
proving the continuity of F at t.
D Divided Differences
The present section provides some additional results regarding the divided differences
of Def. 3.16:
D DIVIDED DIFFERENCES 198
Theorem D.1 (Product Rule). Let F be a field and consider the situation of Def. 3.16.
(m) (m) (m)
Moreover, for each j ∈ {0, . . . , r}, m ∈ {0, . . . , mj − 1}, let yf,j , yg,j , yh,j ∈ F be such
that m
(m)
X m (m−k) (k)
yh,j = y yg,j . (D.1)
k=0
k f,j
Let Pf , Pg , Ph ∈ F [X]n be the interpolating polynomials corresponding to the case, where
(m) (m) (m) (m)
the yj of Def. 3.16 are replaced by yf,j , yg,j , yh,j , respectively, with
and n
X
Q := [x0 , . . . , xi ]f [xj , . . . , xn ]g ωf,i ωg,j ∈ F [X]n ,
i,j=0,
i≤j
D DIVIDED DIFFERENCES 199
implying
(D.3) (m)
∀ ∀ Q(x̃α ) = (Pf Pg )(m) (x̃α ) = yh,α ,
α∈{0,...,r} m∈{0,...,mα −1}
and, thus, Q = Ph . In Q, the summands with deg(ωf,i ωg,j ) = n are precisely those with
i = j, thereby proving (D.2).
The following result is proved under the additional assumption that x0 , . . . , xn ∈ F are
all distinct:
Proposition D.2. Let F be a field and consider the situation of Def. 3.16 with r = n ∈
N0 , i.e. with x̃0 , . . . , x̃n ∈ F all being distinct. Moreover, let (x0 , . . . , xn ) := (x̃0 , . . . , x̃n ).
Then, according to Def. 3.12, P [x0 , . . . , xn ] denotes the unique polynomial f ∈ P [X]n
that, given y0 , . . . , yn ∈ F , satisfies
(0)
∀ f (xj ) = yj := yj ; (D.4)
j∈{0,...,n}
So it remains to verify (D.6). To this end, we show that, for each i ∈ {0, . . . , n}, the
coefficients of yi on both sides of (D.6) are identical.
Case i = n: Since yn occurs only in first sum on the right-hand side of (D.6), one has
to show that n n
Y 1 1 Y 1
= . (D.7)
m=0
xn − xm xn − x0 m=1 xn − xm
m6=n m6=n
Multiplying (D.9) by
n
Y
(xi − xm )
m=0
m6=i,0,n
where
1 1 1 ... 1
1 ω ω2 . . . ω n−1
ω2 ω4 . . . ω 2(n−1)
A = 1 , (E.3a)
.. .. .. ... ..
. . . .
(n−1)2
1 ω n−1 ω 2(n−1) ... ω
that means A = (akj ) ∈ M(n, C) is a symmetric complex matrix with
ijk2π i2π
∀ akj = ω kj = e n , ω := e n . (E.3b)
k,j∈{0,...,n−1}
(a) The columns of B form an orthonormal basis with respect to the standard inner
product on Cn .
(c) DFT, as defined in (E.1), is invertible and its inverse map, called inverse DFT, is
given by
F −1 : Cn −→ Cn , (d0 , . . . , dn−1 ) 7→ (f0 , . . . , fn−1 ), (E.4a)
where
n−1
X ijk2π
∀ fk := dj e n . (E.4b)
k∈{0,...,n−1}
j=0
i.e.
1
kF(f )k2 = √ kf k2 . (E.5b)
n
Proof. (a): For each k = 0, . . . , n − 1, we let bk denote the kth column of B. Then,
noting |ω| = 1,
n−1
X n−1
1 X 2jk n
hbk , bk i = bjk bjk = |ω| = = 1.
j=0
n j=0 n
E DFT AND TRIGONOMETRIC INTERPOLATION 203
n
Moreover, for each k, l ∈ {0, . . . , n − 1} and k 6= l, we note ω l−k 6= 1, but ω l−k = 1,
and we use the geometric sum to obtain
n−1 n−1 n−1 n
X 1 X −jk jl 1 X l−k j 1 1 − ω l−k
hbk , bl i = bjk bjl = ω ω = ω = · l−k
= 0,
j=0
n j=0
n j=0
n 1 − ω
proving (a).
(b) is immediate due to (a) and the equivalences (i) and (iii) of [Phi19b, Pro. 10.36].
(c) follows from (b), as
−1
−1 1 √
∀n f := F (d) = √ B d= nB ∗ d (E.6)
d∈C n
and √
√ n ijk2π
∀ n b∗kj = √ ajk = e n .
k,j∈{0,...,n−1} n
(d): According to [Phi19b, Pro. 10.36(viii)], B is isometric with respect to k · k2 . Thus,
1
1
kF(f )k2 =
√n Bf
= √n kf k2 ,
2
As it turns out, the inverse DFT of Prop. E.3(c) has the useful property that there are
several simple ways to express it in terms of (forward) DFT:
Proposition E.4. Let n ∈ N and let F denote DFT as defined in (E.1).
(c) Let σ be the map that swaps the real and imaginary parts of a complex number, i.e.
σ : C −→ C, σ(a + bi) := b + ai = i (a + bi).
Using componentwise application, we can also consider σ to be defined on Cn :
σ : Cn −→ Cn , σ(z0 , . . . , zn−1 ) := (σ(z0 ), . . . , σ(zn−1 )) = i z.
Then one has
∀n F −1 (d) = n σ F(σd) .
d∈C
E DFT AND TRIGONOMETRIC INTERPOLATION 204
Proof. (a): We let f = (f0 , . . . , fn−1 ) := n F(d0 , dn−1 , dn−2 , . . . , d1 ) and compute for
each k = 0, . . . , n − 1:
1
(E.1b)
n−1
X ijk2π
n−1
X i(n−j)k2π
n−1
X ijk2π
z }| {
fk = d0 + dn−j e− n = d0 + dj e− n = d0 + dj e n e−ik2π
j=1 j=1 j=1
n−1
X ijk2π
= dj e n ,
j=0
n−1
X n−1
X
(E.1b) − ijk2π ijk2π
fk = dj e n = dj e n .
j=0 j=0
Taking the complex conjugate of fk and comparing with (E.4) proves (b).
(c): Using the C-linearity of DFT and (b), we compute
(b)
n σ F(σd) = n σ F(i d) = n i i F(d) = n F d = F −1 (d),
eiz + e−iz
∀ cos z = , (E.7b)
z∈C 2
eiz − e−iz
∀ sin z = . (E.7c)
z∈C 2i
E DFT AND TRIGONOMETRIC INTERPOLATION 205
∀ c j = aj + i b j . (E.9)
j∈Z
We call the Sn the Fourier partial sums of f and the series of (E.11) the Fourier
series of f .
Remark E.6. (a) We verify that the Fourier partial sums Sn of (E.10) are, indeed,
real-valued. The key observation is an identity that we will prove in a form so that
we can use it both here and, later, in the proof of Th. E.13(a). To this end, we
fix n ∈ N and assume, for each j ∈ {−n, . . . , n}, we have numbers aj , bj ∈ R and
cj ∈ C (not necessarily given by (E.8)), satisfying
∀ cj = aj + ibj and ∀ aj = a−j , bj = −b−j . (E.12)
j∈{−n,...,n} j∈{−n+1,...,n−1}\{0}
E DFT AND TRIGONOMETRIC INTERPOLATION 206
j=−n j=−n
(E.12) in2πx in2πx
= (a−n + i b−n ) e− L + a0 + ib0 + (an + i bn ) e L
n−1
X ij2πx ij2πx
n−1
X ij2πx ij2πx
−
+ aj e L + e L +i bj e L − e − L
j=1 j=1
(E.7b), (E.7c) in2πx in2πx
= (a−n + i b−n ) e− L + a0 + ib0 + (an + i bn ) e L
n−1
X n−1
X
j2πx j2πx
+2 aj cos −2 bj sin . (E.13)
j=1
L j=1
L
We now come back to the case, where aj , bj , cj are defined by (E.8). Then (E.12)
holds due to (E.9) together with the symmetry of cosine (i.e. cos(z) = cos(−z)) and
the antisymmetry of sine (i.e. sin(z) = − sin(−z)), where we have aj = a−j and
bj = −b−j even for j = n. Thus, for each x ∈ [0, L], we use (E.13) to obtain
n
X ij2πx
Sn (x) = cj e L
j=−n
n
X n
X
(E.13), b0 =0 j2πx j2πx
= a0 + 2 aj cos −2 bj sin ,
j=1
L j=1
L
(a) The Fourier series of f converges uniformly to f . In particular, one has the point-
wise convergence
∞
X ij2πx
∀ f (x) = cj e L = lim Sn (x). (E.14)
x∈[0,L] n→∞
j=−∞
E DFT AND TRIGONOMETRIC INTERPOLATION 207
(b) The Fourier series of f converges to f in Lp [0, L] for each p ∈ [1, ∞].
Proof. (a): For the main work of the proof, we refer to [Wer11, Th. IV.2.9], where (a) is
proved for the case L = 2π. Here, we only verify that the case L = 2π then also implies
the general case of L ∈ R+ : Assume (a) holds for L = 2π, let L ∈ R+ be arbitrary, and
f ∈ C 1 ([0, L]) with f (0) = f (L) as well as f ′ (0) = f ′ (L). Define
tL
g : [0, 2π] −→ R, g(t) := f .
2π
Then g is continuously differentiable with
′ ′ L ′ tL
g : [0, 2π] −→ R, g (t) := f .
2π 2π
L L
Also g(0) = f (0) = f (L) = g(2π) and g ′ (0) = 2π f ′ (0) = 2π f ′ (L) = g ′ (2π), i.e. we know
t 2π
(a) to hold for g. Next, we note f (t) = g( L ) for each t ∈ [0, L] and compute, for each
j ∈ Z,
Z L Z L
1 − ij2πt 1 t 2π ij2πt
cj (f ) = f (t) e L dt = g e− L dt
L 0 L 0 L
Z 2π
1
= g(s) e−ij s ds = cj (g),
2π 0
where we used the change of variables s := t L2π . Thus,
( )
Xn
ij2πx
lim
Sn (f ) − f
∞ = lim sup f (x) − cj (f ) e L : x ∈ [0, L]
n→∞ n→∞
j=−n
( X )
x 2π
n
ij2πx
= lim sup g − cj (f ) e L : x ∈ [0, L]
n→∞ L
j=−n
( )
Xn
= lim sup g(t) − cj (g) eij t : t ∈ [0, 2π]
n→∞
j=−n
(a) for g
= lim
Sn (g) − g
∞ = 0,
n→∞
For a periodic function f , applying the inverse discrete Fourier transform can yield a
useful approximation of the values of f at equidistant points:
Example E.8. Let L ∈ R+ and let f : [0, L] −→ R be integrable and such that
f (0) = f (L). Given n ∈ N, we now define the equidistant points
L
∀ xk := kh, h := . (E.15)
k∈{0,...,2n+1} 2n + 1
Plugging x := xk into (E.10) then yields the following approximations fk of f (xk ),
n
X 2n
X
ijk2π i(j−n)k2π
fk := Sn (f )(xk ) = cj e 2n+1 = cj−n e 2n+1
j=−n j=0
∀ 2n
(E.16)
k∈{0,...,2n} X ijk2π
− ink2π
=e 2n+1 cj−n e 2n+1 ,
j=0
where we have omitted f (x2n+1 ) in (E.16), since we know f (x2n+1 ) = f (L) = f (0) =
f (x0 ), due to the assumed periodicity of f . Comparing (E.16) with (E.4), we see that
in2π in2n2π
f0 , e 2n+1 f1 , . . . , e 2n+1 f2n = F −1 (c−n , . . . , cn ). (E.17)
If Th. E.7(a) holds (e.g. under the hypotheses of Th. E.7), then, letting
Fn := (f (x0 ), . . . , f (x2n )), Gn := (f0 , . . . , f2n ),
we conclude
n o Th. E.7(a)
lim kFn − Gn k∞
= lim sup f (xk ) − Sn (f )(xk ) : k ∈ {0, . . . , 2n} = 0. (E.18)
n→∞ n→∞
r := r(d) : R −→ C,
n−1
X n−1
X
inπx inπx ik2πx i(k−n/2)2πx
r(x) := e− L p(d)(x) = e− L dk e L = dk e L . (E.20)
k=0 k=0
Theorem E.11. In the situation of Def. E.10, let x0 , . . . , xn−1 be defined by xj := jL/n
for each j ∈ {0, . . . , n − 1}. Moreover, let z := (z0 , . . . , zn−1 ) ∈ Cn .
∀ r(xj ) = zj (E.21)
j∈{0,...,n−1}
It is an exercise to use different values of n (e.g. n = 4 and n = 16) to compute and plot
the corresponding r satisfying the conditions of Th. E.11(a) with zj = f (xj ), xj = j/n,
for instance by using MATLAB.
—
(a, b) := (a0 , . . . , a n2 , b1 , . . . , b n2 −1 ) ∈ Rn .
t := t(a, b) : R −→ R,
n
−1 nπx
X
2
k2πx k2πx
t(x) := a0 + 2 ak cos + bk sin + a n2 cos . (E.24)
k=1
L L L
Furthermore, let d = (d0 , . . . , dn−1 ) ∈ Cn , let the function r = r(d) be defined according
to (E.20), and assume the relations
d0 = a n2 , d n2 = a0 ,
∀ d n2 −k = ak + ibk , d n2 +k = ak − ibk . (E.25)
k∈{1,..., n
2
−1}
(b) If x0 , . . . , xn−1 ∈ [0, L] are defined by xj := jL/n for each j ∈ {0, . . . , n − 1},
z := (z0 , . . . , zn−1 ) ∈ Rn , and r satisfies the conditions of Th. E.11(a), then
∀ t(xj ) = zj (E.26)
j∈{0,...,n−1}
and, moreover,
n−1 n−1 !
1X jk2π 1X jk2π
∀ ak = zj cos , bk = zj sin . (E.27)
k∈{1,..., n
2
−1} n j=0 n n j=0 n
(c) Let f : [0, L] −→ R. If f and r satisfy the conditions of Th. E.11(b), then there
exists cm ∈ R+ such that the error estimate (E.23) holds with r replaced by t.
k=0 k=− n
2
k=− n
2
n
X
2
−1 n
X
2
−1
k2πx k2πx
+2 Re(ck ) cos −2 Im(ck ) sin
k=1
L k=1
L
n
−1
(E.28) X
2
k2πx k2πx
= a0 + 2 ak cos + bk sin
L L
k=1 nπx nπx
+a n2 cos − i sin ,
L L
E DFT AND TRIGONOMETRIC INTERPOLATION 212
proving t = Re r.
(b): (E.26) is an immediate consequence of (a). Furthermore, as a consequence of (E.22)
and (E.1b), we have, for each k ∈ {0, . . . , n},
n−1
1 X ijk2π
dk = zj (−1)j e− n . (E.29)
n j=0
(E.25) 1
ak = (d n −k + d n2 +k )
2 2
(−1)j
(E.29) 1
n−1
X z }| { ijk2π ijk2π
= zj (−1)j e−ijπ e n + e− n
2n j=0
n−1
X
(E.7b) 1 jk2π
= zj cos ,
n j=0
n
(E.25) 1
bk = (d n −k − d n2 +k )
2i 2
(−1)j
(E.29) 1 X
n−1 z }| { ijk2π ijk2π
= zj (−1)j e−ijπ e n − e− n
2in j=0
n−1
(E.7c) 1X jk2π
= zj sin ,
n j=0 n
(i) Choose n ∈ N even and compute the equidistant xj := jL/n ∈ [0, L].
(ii) Compute the corresponding values zj := f (xj ).
(iii) Compute d := F (−1)0 z0 , . . . , (−1)n−1 zn−1 according to (E.22).
1
(iv) Compute the ak and bk from (E.25), i.e. from ak = 2
(d n2 −k + d n2 +k ) and bk =
1
2i
(d n2 −k − d n2 +k ).
(v) Obtain the real-valued interpolating function t from (E.24).
FFT is based on the following Th. E.17, which allows to compute the DFT of a vector
of length 2n from the DFT of two vectors of length n. In preparation for Th. E.17, we
introduce some notation.
E DFT AND TRIGONOMETRIC INTERPOLATION 214
Theorem E.17. Let n ∈ N. Then, for each k ∈ {0, . . . , n − 1}, it holds that
(n) ikπ (n)
Fk (f0 , f1 , . . . , fn−1 ) + e− n Fk (fn , fn+1 , . . . , f2n−1 )
∀ 2 (E.30a)
(f0 ,...,f2n−1 )∈C2n (2n)
= Fk (f0 , fn , f1 , fn+1 , . . . , fn−1 , f2n−1 )
and
(n) ikπ (n)
Fk (f0 , f1 , . . . , fn−1 ) − e− n Fk (fn , fn+1 , . . . , f2n−1 )
∀ 2 (E.30b)
(f0 ,...,f2n−1 )∈C2n (2n)
= Fn+k (f0 , fn , f1 , fn+1 , . . . , fn−1 , f2n−1 ).
n−1 n−1
!
1 X ijk2π X ijk2π
− ikπ
= fj e − n −e n fn+j e− n
2n j=0 j=0
(n) − ikπ (n)
(E.1b) Fk (f0 , f1 , . . . , fn−1 ) −e n Fk (fn , fn+1 , . . . , f2n−1 )
= ,
2
proving (E.30b).
Example E.18. (a) If n = 2q , q ∈ N, then one can apply Th. E.17 q times to reduce
an application of F (n) to n applications of F (1) . Moreover, we observe that F (1) is
merely the identity on C: According to (E.1):
1
∀ F (1) (f0 ) = f0 e 0 = f0 .
f0 ∈C 1
(b) For n = 8 = 23 , one can build the result of an application of F (n) from trivial
one-dimensional transforms in 3 steps as illustrated in the scheme in (E.31) below.
f0 f4 f2 f6 f1 f5 f3 f7
= = = = = = = =
F (1) (f0 ) F (1) (f4 ) F (1) (f2 ) F (1) (f6 ) F (1) (f1 ) F (1) (f5 ) F (1) (f3 ) F (1) (f7 )
ց ւ ց ւ ց ւ ց ւ
F (2) (f0 , f4 ) F (2) (f2 , f6 ) F (2) (f1 , f5 ) F (2) (f3 , f7 )
ց ւ ց ւ
F (4) (f0 , f2 , f4 , f6 ) F (4) (f1 , f3 , f5 , f7 )
ց ւ
F (8) (f0 , f1 , f2 , f3 , f4 , f5 , f6 , f7 ) (E.31)
The scheme in (E.31) also illustrates that getting all the indices right needs some
careful accounting. It turns out that this can be done quite elegantly and efficiently
E DFT AND TRIGONOMETRIC INTERPOLATION 216
by a trick known as bit reversal. Before studying bit reversal and its application to
FFT systematically, let us see how it works in the present example: In (E.31), we
want to compute F (8) (f0 , . . . , f7 ) and need to figure out the order of indices we need
in the first row of (E.31). Using the bit reversal algorithm we can do that directly,
without having to compute (and store) the intermediate lines of the scheme: We
start by writing the numbers 0, . . . , 7 in their binary representation instead of in
their usual decimal representation:
Then we reverse the bits of each number in (E.32) (and we also provide the resulting
decimal representation):
We note that, as if by magic, we have obtained the correct order needed in the first
row of (E.31). In the following, we will prove that this procedure works in general.
Definition E.19. For each q ∈ N0 , we denote MqP:= Z2q = {0, . . . , 2q − 1}. As every
q−1
n ∈ Mq has a (unique) binary representation n = j=0 bj 2j , bj ∈ {0, 1}, we can define
bit reversal (in q dimensions) as the map
q−1
! q−1
X X
j
ρq : Mq −→ Mq , ρq bj 2 := bq−1−j 2j . (E.34)
j=0 j=0
Proof. (a): Since, for each j ∈ {0, . . . , q − 1}, one has q − 1 − (q − 1 − j) = j, this is
immediate from (E.34).
E DFT AND TRIGONOMETRIC INTERPOLATION 217
q−1
! q−1
!
X X
ρq+1 (2n) = ρq+1 2 bj 2j = ρq+1 0 · 20 + bj−1 2j + bq−1 2q
j=0 j=1
q−1
X
= bq−1−j 2j = ρq (n),
j=0
proving (b).
(c): If n ∈ Mq , then 0 ≤ n ≤ 2q − 1, i.e. 0 ≤ 2n + 1 ≤ 2q+1 − 1 and 2n + 1 ∈ Mq+1 .
Pq−1
Moreover, for n = j=0 bj 2j ∈ Mq , bj ∈ {0, 1}, we compute
q−1
! q−1
!
X X
ρq+1 (2n + 1) = ρq+1 1 + 2 bj 2j = ρq+1 1 · 20 + bj−1 2j + bq−1 2q
j=0 j=1
q−1
X
= bq−1−j 2j + 1 · 2q = 2q + ρq (n),
j=0
proving (c).
Definition E.21. We define the fast Fourier transform (FFT) algorithm for the compu-
tation of the n-dimensional discrete Fourier transform F (n) in the case n = 2q , q ∈ N0 ,
by the following recursion: We start with the n = 2q complex numbers z0 , . . . , zn−1 ∈ C
(we will see below (cf. Th. E.22) that, to compute d := F (n) (f ), f ∈ Cn , one has to
choose z0 := fρq (0) , z1 := fρq (1) , . . . , zn−1 := fρq (n−1) ):
(0,0) (0,2q −1)
d0 := z0 , ..., d0 := z2q −1 = zn−1 . (E.35a)
Then, for each 1 ≤ r ≤ q, one defines recursively the following 2q−r vectors d(r,0) , . . . ,
q−r r
d(r,2 −1) ∈ C2 by
(r−1,2j) ikπ (r−1,2j+1)
(r,j) dk + e− 2r−1 dk
dk := ,
∀ ∀ 2 (E.35b)
(r−1,2j) ikπ (r−1,2j+1)
k∈{0,...,2r−1 −1} j∈{0,...,2q−r −1}
(r,j) dk − e− 2r−1 dk
d2r−1 +k :=
2
(note that the recursion ends after q steps with the single vector d(q,0) ∈ Cn ).
E DFT AND TRIGONOMETRIC INTERPOLATION 218
Theorem E.22. Let n = 2q , q ∈ N0 , and z = (z0 , . . . , zn−1 ) ∈ Cn . If, for r ∈ {0, . . . , q},
j ∈ {0, . . . , 2q−r − 1}, the d(r,j) are defined by the FFT algorithm of Def. E.21, then
r
∀ ∀q−r d(r,j) = F (2 ) (zj2r +ρr (0) , zj2r +ρr (1) , . . . , zj2r +ρr (2r −1) ). (E.36)
r∈{0,...,q} j∈{0,...,2 −1}
In particular,
d(q,0) = F (n) (zρq (0) , zρq (1) , . . . , zρq (2q −1) ), (E.37)
i.e., to compute F (n) (f0 , . . . , fn−1 ) for a given f ∈ Cn , one has to set
to obtain
d(q,0) = F (n) (f0 , . . . , fn−1 ). (E.39)
Proof. The proof of (E.36) is conducted via induction with respect to r ∈ {0, . . . , q}.
For r = 0, (E.36) reads
∀q d(0,j) = F (1) (zj ), (E.40)
j∈{0,...,2 −1}
where the equality at (∗) holds due to (E.30a) and Lem. E.20(b),(c), since, for each
α ∈ {0, . . . , 2r−1 − 1},
j2r + ρr−1 (α) = j2r + ρr (2α) and j2r + 2r−1 + ρr−1 (α) = j2r + ρr (2α + 1). (E.42)
E DFT AND TRIGONOMETRIC INTERPOLATION 219
Similarly,
(r,j) (E.35b) 1 (r−1,2j) ikπ (r−1,2j+1)
d2r−1 +k = dk − e− 2r−1 dk
2
ind.hyp. 1 (2r−1 )
= F (zj2r +ρr−1 (0) , . . . , zj2r +ρr−1 (2r−1 −1) )
2 k
1 ikπ (2r−1 )
− e− 2r−1 Fk (zj2r +2r−1 +ρr−1 (0) , . . . , zj2r +2r−1 +ρr−1 (2r−1 −1) )
2
(E.30a),(E.42) (2r )
= F2r−1 +k (zj2r +ρr (0) , zj2r +ρr (1) , . . . , zj2r +ρr (2r −1) ), (E.43)
completing the induction proof of (E.36).
Cleary, (E.36) turns into (E.37) for r = q, j = 0; and (E.39) follows from (E.37) and
(E.38) as ρq ◦ ρq = Id.
Remark E.23. As mentioned at the beginning of the section, FFT reduces the number
of complex multiplications needed to compute an n-dimensional DFT from O(n2 ) (when
done by matrix multiplication) to O(n log2 n) (an actually quite significant improvement
– even for a moderate n like n = 8 = 23 , one has n2 = 64 versus n log2 n = 24, i.e. one
gains almost a factor 3; the larger n gets the more essential it becomes to use FFT).
Let us check the O(n log2 n) claim for FFT as computed by (E.35) for n = 2q , q ∈ N,
(where we need to estimate the number of multiplications involved in the executions of
(E.35b)):
iπ iπ
(i) In a preparatory step, one computes the q − 1 factors e− 21 , . . . , e− 2q−1 by starting
iπ
with e− 2q−1 and using
iπ
iπ
2
∀ e− 2r = e− 2r+1 .
r∈{0,...,q−2}
Then (E.35b) has to be executed q times, namely once for each r = 1, . . . , q. We compute
the number of multiplications needed to obtain the d(r,j) :
iπ
(ii) Starting from the precomputed θr := e− 2r−1 , one uses successive multiplications
ikπ
with θr to obtain the factors θrk = e− 2r−1 , k = 1, . . . , 2r−1 − 1. Thus, the number
of multiplications for this step is
N2,r := 2r−1 − 2 < 2r−1 .
F THE WEIERSTRASS APPROXIMATION THEOREM 220
(r,j)
(iii) The computation of each dk needs two multiplications (one by the e-factor and
one by 12 ). As the range for k is {0, . . . , 2r − 1} and the range for j is {0, . . . , 2q−r −
1}, the total number of multiplications for this step is
Finally, summing all multiplications from steps (i) – (iii), one obtains
q q
X X
N = N1 + (N2,r + N3,r ) < q + (2r−1 + 2q+1 ) = q + 2q − 1 + q 2q+1
r=1 r=1
< log2 n + n + 2 n log2 n, (E.44)
Theorem F.1 (Weierstrass Approximation Theorem). Let a, b ∈ R with a < b. For each
continuous function f ∈ C[a, b] and each ǫ > 0, there exists a polynomial p : R −→ R
such that kf − p↾[a,b] k∞ < ǫ, where p↾[a,b] denotes the restriction of p to [a, b].
Theorem F.1 will be a corollary of the fact that the Bernstein polynomials corresponding
to f ∈ C[0, 1] (see Def. F.2) converge uniformly to f on [0, 1] (see Th. F.3 below).
Theorem F.3. For each f ∈ C[0, 1], the sequence of Bernstein polynomials (Bn f )n∈N
corresponding to f according to Def. F.2 converges uniformly to f on [0, 1], i.e.
and
n ν
X
f (x) − (Bn f )(x) ≤ f (x) − f qnν (x) for each x ∈ [0, 1], n ∈ N. (F.6)
ν=0
n
As f is continuous on the compact interval [0, 1], it is uniformly continuous, i.e. for each
ǫ > 0, there exists δ > 0 such that, for each x ∈ [0, 1], n ∈ N, ν ∈ {0, . . . , n}:
ν ν ǫ
x − < δ ⇒ f (x) − f < . (F.7)
n n 2
For the moment, we fix x ∈ [0, 1] and n ∈ N and define
n ν o
N1 := ν ∈ {0, . . . , n} : x − < δ ,
n n o
ν
N2 := ν ∈ {0, . . . , n} : x − ≥ δ .
n
Then
X ν
ǫ X ǫX
n
ǫ
f (x) − f qnν (x) ≤ qnν (x) ≤ qnν (x) = , (F.8)
ν∈N
n 2 ν∈N 2 ν=0 2
1 1
and with M := kf k∞ ,
2
X ν
X ν
x − nν
f (x) − f qnν (x) ≤ f (x) − f qnν (x)
ν∈N
n ν∈N
n δ2
2 2
2M X
n ν 2
≤ q nν (x) x − . (F.9)
δ 2 ν=0 n
F THE WEIERSTRASS APPROXIMATION THEOREM 222
as well as
Xn 2 x X n
n ν n−ν ν n − 1 ν−1 x
x (1 − x) = (ν − 1) x (1 − x)(n−1)−(ν−1) +
ν=0
ν n n ν=1 ν−1 n
Xn
x2 n − 2 ν−2 x
= (n − 1) x (1 − x)(n−2)−(ν−2) +
n ν=2
ν−2 n
1 x x
= x2 1 − + = x2 + (1 − x). (F.12)
n n n
Thus, we obtain
n
X ν 2 (F.10),(F.5),(F.11),(F.12) x
qnν (x) x − = x2 · 1 − 2x · x + x2 + (1 − x)
ν=0
n n
1
≤ for each x ∈ [0, 1], n ∈ N,
4n
and together with (F.9):
X ν
2M 1 ǫ M
f (x) − f qnν (x) ≤ 2 < for each x ∈ [0, 1], n > . (F.13)
ν∈N
n δ 4n 2 δ2ǫ
2
as needed.
then there exists k0 ∈ N such that Ak0 has nonempty interior: int Ak0 6= ∅.
Proof. Seeking a contradiction, we assume (G.1) holds with closed sets Ak such that
int Ak = ∅ for each k ∈ N. Then, for each nonempty open set O ⊆ X and each k ∈ N,
the set O \Ak = O ∩(X \Ak ) is open and nonempty (as O \Ak = ∅ implied O ⊆ Ak , such
that the points in O were interior points of Ak ). Since O \ Ak is open and nonempty,
we can choose x ∈ O \ Ak and 0 < ǫ < k1 such that
B ǫ (x) ⊆ O \ Ak . (G.2)
As one can choose Bǫ (x) as a new O, starting with arbitrary x0 ∈ X and ǫ0 > 0, one can
inductively construct sequences of points xk ∈ X, numbers ǫk > 0, and corresponding
balls such that
1
B ǫk (xk ) ⊆ Bǫk−1 (xk−1 ) \ Ak , ǫk ≤ for each k ∈ N. (G.3)
k
G BAIRE CATEGORY AND THE UNIFORM BOUNDEDNESS PRINCIPLE 224
Then l ≥ k implies xl ∈ Bǫk (xk ) for each k, l ∈ N, such that (xk )k∈N constitutes a
Cauchy sequence due to limk→∞ ǫk = 0. The assumed completeness of X provides a limit
x = limk→∞ xk ∈ X. The nested form (G.3) of the balls Bǫk (xk ) implies x ∈ B ǫk (xk ) for
each k ∈ N on the one hand and x ∈ / Ak for each k ∈ N on the other hand. However,
this last conclusion is in contradiction to (G.1).
Remark G.2. The purpose of this remark is to explain the name Baire Category The-
orem in Th. G.1. To that end, we need to introduce some terminology that will not be
used again outside the present remark.
Let X be a metric space.
Recall that a subset D of X is called dense if, and only if, D = X, or, equivalently, if,
and only if, for each x ∈ X and each ǫ > 0, one has D ∩ Bǫ (x) 6= ∅. A subset E of X is
called nowhere dense if, and only if, X \ E is dense.
S
A subset M of X is said to be of first category or meager if, and only if, M = ∞ k=1 Ek
is a countable union of nowhere dense sets Ek , k ∈ N. A subset S of X is said to be of
second category or nonmeager or fat if, and only if, S is not of first category. Caveat:
These notions of category are completely different from the notion of category occurring
in the more algebraic discipline called category theory.
Finally, the reason Th. G.1 carries the name Baire Category Theorem lies in the fact
that it implies that every nonempty complete metric space X is of second S∞ category
(considered as a subset of itself): If X were of first category, then X = k=1 Ek with
nowhere dense sets Ek . As the Ek are nowhereS∞ dense, the complements X \ E k are
dense, i.e. E k has empty interior, and X = k=1 E k were a countable union of closed
sets with empty interior in contradiction to Th. G.1.
Theorem G.3 (Uniform Boundedness Principle). Let X be a nonempty complete metric
space and let Y be a normed vector space. Consider a set of continuous functions
F ⊆ C(X, Y ). If
Mx := sup kf (x)k : f ∈ F < ∞ for each x ∈ X, (G.4)
then there exists x0 ∈ X and ǫ0 > 0 such that
sup Mx : x ∈ B ǫ0 (x0 ) < ∞. (G.5)
In other words, if a collection of continuous functions from X into Y is bounded point-
wise in X, then it is uniformly bounded on an entire ball.
is an continuous inverse image of the closed set [0, k] and, hence, closed. Since arbitrary
intersections of closed sets are closed, so is
\
Ak := Ak,f = x ∈ X : kf (x)k ≤ k for each f ∈ F . (G.7)
f ∈F
H Matrix Decomposition
Σ = AA∗ . (H.1)
In extension of the term for the positive definite case, we still call this decomposition a
Cholesky decomposition of Σ. It is recalled from Th. 5.7 that, if Σ is positive definite,
then the diagonal entries of A can be chosen to be all positive and this determines A
uniquely.
Proof. The positive definite case was proved in Th. 5.7. The general (positive semidefi-
nite) case can be deduced from the positive definite case as follows: If Σ ∈ M(n, K) is
Hermitian positive semidefinite, then each Σk := Σ + k1 Id ∈ M(n, K), k ∈ N, is positive
definite:
x∗ x
x∗ Σk x = x∗ Σx + > 0 for each 0 6= x ∈ Kn . (H.2)
k
Thus, for each k ∈ N, there exists a lower triangular matrix Ak M(n, K) with positive
diagonal entries such that Σk = Ak A∗k .
2
On the other hand, Σk converges to Σ with respect to each norm on Kn (since all norms
2
on Kn are equivalent). In particular, limk→∞ kΣk − Σk2 = 0. Thus,
such that limk→∞ kΣk − Σk2 = 0 implies that the set K := {Ak : k ∈ N} is bounded
2
with respect to k · k2 . Thus, the closure of K in Kn is compact, which implies (Ak )k∈N
2
has a convergent subsequence (Akl )l∈N , converging to some matrix A ∈ Kn . As this
2
convergence is with respect to the norm topology on Kn , each entry of the Akl must
converge (in K) to the respective entry of A. In particular, A is lower triangular with
all nonnegative diagonal entries. It only remains to show AA∗ = Σ. However,
shows a Hermitian positive semidefinite matrix (which is not positive definite) can have
uncountably many different Cholesky decompositions.
i.e. if, and only if, the n(n + 1)/2 lower half entries of A constitute a solution to the
following (nonlinear) system of n(n + 1)/2 equations:
j
X
Aik Ajk = σij , (i, j) ∈ I. (H.9a)
k=1
H MATRIX DECOMPOSITION 228
Using the order on I introduced in the statement of the theorem, (H.9a) takes the form
|A11 |2 = σ11 ,
A21 A11 = σ21 ,
..
.
An1 A11 = σn1 ,
(H.9b)
|A21 |2 + |A22 |2 = σ22 ,
A31 A21 + A32 A22 = σ32 ,
..
.
|An1 |2 + · · · + |Ann |2 = σnn .
From Th. H.1, we know (H.9) must have at least one solution with Ajj ∈ R+ 0 for each
j ∈ {1, . . . , n}. In particular, σjj ∈ R+
0 for each j ∈ {1, . . . , n} (this is also immediate
from Σ being positive semidefinite, since σjj = e∗j Σej , where ej denotes the jth standard
unit vector of Kn ). We need to show that (H.7) yields a solution to (H.9). The proof is
√
carried out by induction on n. For n = 1, we have σ11 ∈ R+ 0 and A11 = σ11 ∈ R+
0 , i.e.
there is nothing to prove. Now let n > 1. If σ11 > 0, then
√
A11 = σ11 , A21 = σ21 /A11 , ..., An1 = σn1 /A11 (H.10a)
is the unique solution to the first n equations of (H.9b) satisfying A11 ∈ R+ , and this
solution is provided by (H.7). If σ11 = 0, then σ21 = · · · = σn1 = 0: Otherwise, let
s := (σ21 . . . σn1 )∗ ∈ Kn−1 \ {0}, α ∈ R, and note
∗ 0 s∗ α ∗ s∗ s
(α, s ) = (α, s ) = 2αksk2 + s∗ Σn−1 s < 0
s Σn−1 s αs + Σn−1 s
for α < −s∗ Σn−1 s/(2ksk22 ), in contradiction to Σ being positive semidefinite. Thus,
is a particular solution to the first n equations of (H.9b), and this solution is provided
by (H.7). We will now denote the solution to (H.9) given by Th. H.1 by B11 , . . . , Bnn
to distinguish it from the Aij constructed via (H.7).
In each case, A11 , . . . , An1 are given by (H.10), and, for (i, j) ∈ I with i, j ≥ 2, we define
τij := σij − Ai1 Aj1 ∈ K for each (i, j) ∈ J := (i, j) ∈ I : i, j ≥ 2 . (H.11)
H MATRIX DECOMPOSITION 229
is positive semidefinite. If σ11 = 0, then (H.10b) implies τij = σij for each (i, j) ∈ J and
T is positive semidefinite, as Σ being positive semidefinite implies
0
x2
x2
x2 . . . xn T ... = 0 x2 . . . xn Σ .. ≥ 0 (H.13)
.
xn
xn
for each (x2 , . . . , xn ) ∈ Kn−1 . If σ11 > 0, then (H.10a) holds as well as B11 = A11 , . . . ,
Bn1 = An1 . Thus, (H.11) implies
τij := σij − Ai1 Aj1 = σij − Bi1 B j1 for each (i, j) ∈ J. (H.14)
Then (H.9) with A replaced by B together with (H.14) implies
j
X
Bik B jk = τij for each (i, j) ∈ J (H.15)
k=2
which, once again, establishes T to be positive semidefinite (since, for each x ∈ Kn−1 ,
x∗ T x = x∗ BB ∗ x = (B ∗ x)∗ (B ∗ x) ≥ 0).
By induction, we now know the algorithm of (H.7) yields a (possibly different from
(H.16) for σ11 > 0) decomposition of T :
C22 C 22 . . . C n2 τ22 . . . τn2
.. = .. . . .
CC ∗ = ... ... ...
. . . .. = T (H.17)
Cn2 . . . Cnn C nn τn2 . . . τnn
or
j
X
Cik C jk = τij = σij − Ai1 Aj1 for each (i, j) ∈ J, (H.18)
k=2
H MATRIX DECOMPOSITION 230
where (√
√ σ22 for σ11 = 0,
C22 := τ22 = (H.19a)
B22 for σ11 > 0,
and, for (i, j) ∈ J \ {(2, 2)},
Pj−1
τij − k=2 Cik C jk /Cjj for i > j and Cjj 6= 0,
Cij := 0 for i > j and Cjj = 0, (H.19b)
q
τ − Pi−1 |C |2
ii k=2 ik for i = j.
Substituting τij = σij − Ai1 Aj1 from (H.11) into (H.19) and comparing with (H.7), an
induction over J with respect to the order introduced in the statement of the theorem
shows Aij = Cij for each (i, j) ∈ J. In particular, since all Cij are well-defined by
induction, all Aij are well-defined by (H.7) (i.e. all occurring square roots exist as real
numbers). It also follows that {Aij : (i, j) ∈ I} is a solution to (H.9): The first n
equations are satisfied according to (H.10); the remaining (n − 1) n/2 equations are
satisfied according to (H.18) combined with Cij = Aij . This concludes the proof that
(H.7) furnishes a solution to (H.9).
satisfies
I Newton’s Method
A drawback of Theorems 6.3 and 6.5 is the assumption of the existence and (for Th. 6.5)
location of the zero x∗ , which (especially for Th. 6.5) can be very difficult to verify in
cases where one would like to apply Newton’s method. In the following variant Th. I.1,
the hypotheses are such the existence of a zero can be proved as well as the convergence
of Newton’s method to that zero.
Theorem I.1. Let m ∈ N and fix a norm k · k on Rm . Let A ⊆ Rm be open and convex.
Moreover let f : A −→ A be differentiable, and assume Df to satisfy the Lipschitz
condition
∀
(Df )(x) − (Df )(y)
≤ L kx − yk, L ∈ R+ 0. (I.1)
x,y∈A
where, once again, we consider the induced operator norm on the real n × n matrices.
If x0 ∈ A and r ∈ R+ are such that x1 ∈ B r (x0 ) ⊆ A (x1 according to (6.5)) and
−1
α :=
Df (x0 ) (f (x0 ))
(I.3)
satisfies
αβL
h := <1 (I.4a)
2
I NEWTON’S METHOD 232
and
α
r≥ , (I.4b)
1−h
then Newton’s method (6.5) is well-defined for x0 (i.e., for each n ∈ N0 , Df (xn ) is
invertible and xn+1 ∈ A) and converges to a zero of f : More precisely, one has
∀ xn ∈ B r (x0 ), (I.5)
n∈N0
lim xn = x∗ ∈ A, (I.6)
n→∞
f (x∗ ) = 0. (I.7)
One also has the error estimate
n
h2 −1
∀ kxn − x∗ k ≤ α . (I.8)
n∈N 1 − h2n
For n = 0, x1 ∈ B r (x0 ) by hypothesis and kx1 − x0 k ≤ α by (6.5) and (I.3). Now let
n ≥ 1. From (6.5), we obtain
∀ Df (xn−1 )(xn ) = Df (xn−1 )(xn−1 ) − f (xn−1 )
n∈N
completing the induction step for the second part of (I.10). Next, we use the second
part of (I.10) to estimate
n
X n
X ∞
X
2k −1
(I.4a) α (I.4b)
kxn+1 − x0 k ≤ kxk+1 − xk k ≤ α h ≤ α hk = ≤ r, (I.13)
k=0 k=0 k=0
1−h
J Eigenvalues
b0 , . . . , bn−1 ∈ C, and
∀ |ak − bk | < δ, (J.2c)
k∈{0,...,n−1}
Proof. Let α ∈ R+ ∪ {∞} denote the minimal distance between distinct zeros of p, i.e.
(
∞ if p has only one distinct zero,
α := (J.4)
min |λ − λ̃| : p(λ) = p(λ̃) = 0, λ 6= λ̃ otherwise.
Without loss of generality, we may assume 0 < ǫ < α/2. Let λ̃1 , . . . , λ̃l , where 1 ≤ l ≤ n
be the distinct zeros of p. For each k ∈ {1, . . . , l}, define the open disk Dk := Bǫ (λ̃k ).
Since ǫ < α, p does not have any zeros on ∂Dk . Thus,
mk := min |p(z)| : z ∈ ∂Dk > 0. (J.5a)
Also define
( n−1 )
X
Mk := max |z|j : z ∈ ∂Dk and note 1 ≤ Mk < ∞. (J.5b)
j=0
J EIGENVALUES 235
∀ Mk δ < m k
k∈{1,...,l}
and each q satisfying (J.2b) and (J.2c), the following holds true:
n−1
X n−1
X
(J.2c)
∀ ∀ |q(z) − p(z)| ≤ |aj − bj ||z|j ≤ δ |z|j ≤ δMk < mk ≤ |p(z)|,
k∈{1,...,l} z∈∂Dk
j=0 j=0
(J.6)
showing p and q satisfy hypothesis (J.1) of Rouché’s theorem. Thus, in each Dk , the
polynomials p and q must have precisely the same number of zeros (counted according to
their respective multiplicities). Since p and q both have precisely n zeros (not necessarily
distinct), the only (distinct) zero of p in Dk is λ̃k , and the radius of Dk is ǫ, the proof
of Th. J.2 is complete.
Corollary J.3. Let A ∈ M(n, C), n ∈ N, and let k · k denote some arbitrary norm
on M(n, C). Then, for each ǫ > 0, there exists δ > 0 such that, for each matrix
B ∈ M(n, C), satisfying
kA − Bk < δ, (J.7)
there exist enumerations λ1 , . . . , λn ∈ C and µ1 , . . . , µn ∈ C of the eigenvalues of A and
B, respectively, listing each eigenvalue according to its (algebraic) multiplicity, where
∀ |ak − bk | < γ,
k∈{0,...,n−1}
consideration to Hessenberg form (see Def. J.4). The complexity of the transformation
to Hessenberg form is O(n3 ) (see Th. J.6(c) below). However, each step of the QR
algorithm applied to a matrix in Hessenberg form is O(n2 ) (see Rem. J.12 below) as
compared to O(n3 ) for a fully populated matrix (obtaining a QR decomposition via
the Householder method of Def. 5.23 requires O(n) matrix multiplications, each matrix
multiplication needing O(n2 ) multiplications).
Definition J.4. An n × n matrix A = (akl ), n ∈ N, is said to be in upper (resp. lower)
Hessenberg form if, and only if, it is almost in upper (resp. lower) triangular form,
except that there are allowed to be nonzero elements in the first subdiagonal (resp.
superdiagonal), i.e., if, and only if,
Thus, according to Def. J.4, the n × n matrix A = (akl ) is in upper Hessenberg form if,
and only if, it can be written as
∗ ... ... ... ∗
∗ ∗ ∗
..
...
A = 0 ∗ . .
. ... ... ... ..
.. .
0 ... 0 ∗ ∗
How can one transform an arbitrary matrix A ∈ M(n, K) into Hessenberg form without
changing its eigenvalues? The idea is to use several similarity transformations of the
form A 7→ HAH, using Householder matrices H = H −1 (cf. Def. 5.19). Householder
transformations are also used in Sec. 5.4.3 to obtain QR decompositions.
Algorithm J.5 (Transformation to Hessenberg Form). Given A ∈ M(n, K), n ∈ N,
n > 2 (for n = 1 and n = 2, A is always in Hessenberg form), we define the following
algorithm: Let A(1) := A. For k = 1, . . . , n − 2, A(k) is transformed into A(k+1) by
performing precisely one of the following actions:
(k)
(a) If aik = 0 for each i ∈ {k + 2, . . . , n}, then
and we set
A(k+1) := H (k) A(k) H (k) ,
where
Idk 0
H (k) := , u(k) := u(x(k) ), (J.11)
0 (k) (k) ∗
Idn−k −2u (u )
| {z }
=:Hl
(k)
(a) Then B is in upper Hessenberg form and σ(B) = σ(A). Moreover, ma (λ, B) =
ma (λ, A) for each λ ∈ σ(A).
Proof. (a): Using (5.34c) and blockwise matrix multiplication, a straightforward induc-
tion shows that each A(k+1) has the form
(k)
A1
(k)
A2 Hl
(k)
A(k+1) = H (k) A(k) H (k) =
,
(J.12)
0 ξk e 1
(k)
Hl A3 Hl
(k) (k)
(k)
where A1 is in upper Hessenberg form. In particular, for k + 1 = n − 1, we obtain B
to be in upper Hessenberg form. Moreover, since each A(k+1) is obtained from A(k) via
a similarity transformation, both matrices always have the same eigenvalues with the
same multiplicities. In particular, by another induction, the same holds for A and B.
J EIGENVALUES 238
(b): Since each H (k) is Hermitian according to Lem. 5.20(a), if A(k) is Hermitian, then
(A(k+1) )∗ = (H (k) A(k) H (k) )∗ = A(k+1) , i.e. A(k+1) is Hermitian as well. Thus, another
straightforward induction shows all A(k) to be Hermitian.
(c): The claim is that the number N (n) of arithmetic operations needed to compute
B from A is O(n3 ) for n → ∞. For simplicity, we do not distinguish between different
arithmetic operations, considering addition, multiplication, division, complex conjuga-
tion, and even the taking of a square root as one arithmetic operation each, even though
the taking of a square root will tend to be orders of magnitude slower than an addi-
tion. Still, none of the operations depends on the size n of the matrix, which justifies
treating them as the same, if one is just interested in the asymptotics n → ∞. First,
if u, x ∈ Cm , H = Id −2uu∗ , then computing Hx = x − 2u(u∗ x) needs m conjugations,
3m multiplications, m additions, and m subtractions, yielding
N1 (m) = 6m.
N2 (m) = 7m
(the same as N1 (m) plus the additional m conjugations). The computation of the
norm kxk2 needs m conjugations, m multiplications, m additions, and 1 square root, a
scalar multiple ξx needs m multiplications. Thus, the computation of u(x) according to
(5.34a) needs 8m operations (2 norms, 2 scalar multiples) plus a constant number C of
operations (conjugations, multiplications, divisions, square roots) that does not depend
on m, the total being
N3 (m) = 8m + C.
Now we are in a position to compute the number N k (n) of operations needed to obtain
A(k+1) from A(k) : N3 (n − k) operations to obtain u(k) according to (J.11), N1 (n − k)
(k) (k)
operations to obtain ξk e1 according to (J.12), k ·N2 (n−k) operations to obtain A2 Hl
according to (J.12), and (n − k) · N1 (n − k) + (n − k) · N2 (n − k) operations to obtain
(k) (k) (k)
Hl A3 Hl according to (J.12), the total being
N k (n) = N3 (n − k) + N1 (n − k) + k · N2 (n − k)
+ (n − k) · N1 (n − k) + (n − k) · N2 (n − k)
= C + 8(n − k) + 6(n − k) + 7k(n − k) + 6(n − k)(n − k) + 7(n − k)(n − k)
= C + 14(n − k) + 7n(n − k) + 6(n − k)2 .
J EIGENVALUES 239
Finally, we obtain
n−2
X n−1
X n−1
X
k
N (n) = N (n) = C(n − 2) + (14 + 7n) k+6 k2
k=1 k=2 k=2
(n − 1)n (n − 1)n(2n − 1)
= C(n − 2) + (14 + 7n) −1 +6 −1 ,
2 6
that means
1
...
1
c −s ← row m
G(m, c, s) = .
s c ← row m + 1
1
...
1
J EIGENVALUES 240
(a) If |c|2 + |s|2 = 1, then G(m, c, s) is unitary (in this case, G(m, c, s) is a special case
of what is known as a Givens rotation).
a b
c := p , s := p , a := amm , b := am+1,m , (J.14)
|a| + |b|2
2 |a| + |b|2
2
and B = (bjl ) := G(m, c, s)∗ A, then B is still in Hessenberg form, but with b21 =
· · · = bm,m−1 = bm+1,m = 0. Moreover, |c|2 + |s|2 = 1. Rows m + 2 through n are
identical in A and B.
−ba + ab
bm+1,m = −s amm + c am+1,m = p = 0,
|a|2 + |b|2
The idea of each step of the QR method for matrices in Hessenberg form is to transform
A(k) into A(k+1) by a sequence of n − 1 unitary similarity transformations via Givens
rotations Gk1 , . . . , Gk,n−1 . The Gkm are chosen such that the sequence
Algorithm J.10 (QR Method for Matrices in Hessenberg Form). Given A ∈ M(n, K)
in Hessenberg form, n ∈ N, n ≥ 2, we define the following algorithm for the computation
of a sequence (A(k) )k∈N of matrices in M(n, K): Let A(1) := A. For each k ∈ N, A(k+1)
is obtained from A(k) by a sequence of n − 1 unitary similarity transformations:
where the Gkm , m ∈ {1, . . . , n − 1}, are defined as in Lem. J.9. To obtain Gkm , one
(k,m) (k)
needs to know amm and am+1,m . However, as it turns out, one does not actually have
to compute the matrices A(k,m) explicitly, since, letting
Thus, when computing B (k,m+1) for (J.16a), one computes C (k,m) first and stores the
elements in (J.16c) to be available to obtain Gk,m+1 .
(a) The QR method for matrices in Hessenberg form as defined in Alg. J.10 is, indeed,
a QR method as defined in Alg. 7.18, i.e. (7.37) holds for the A(k) . In particular,
Alg. J.10 preserves the eigenvalues (with multiplicities) of A and Th. 7.22 applies
(also cf. Rem. 7.23(a)).
J EIGENVALUES 242
(c) Let m ∈ {1, . . . , n − 1}. Then all columns with indices m + 1 through n of the
matrices C (k,m) and A(k,m+1) are identical (in particular, (J.16c) holds true).
Proof. Step 1: Assuming A(k) to be in Hessenberg form, we show that each C (k,m) ,
m ∈ {1, . . . , n − 1}, is in Hessenberg form as well, (c) holds, and A(k+1) is in Hessenberg
form: Let m ∈ {1, . . . , n − 1}. According to the definition in Lem. J.9,
where
As Rk is upper triangular by Lem. J.9 and Qk is unitary (as it is the product of unitary
matrices), we have proved that (7.37) holds and, thereby, completed Step 2.
Finally, since A(1) is in Hessenberg form by hypothesis, using Steps 1 and 2 in an
induction on k concludes the proof of the theorem.
Remark J.12. In the situation of Alg. J.10 it takes O(n2 ) arithmetic operations to
transform A(k) into A(k+1) (see the proof of Th. J.6(c) for a list of admissible arithmetic
operations): One has to obtain the n − 1 Givens rotations Gk1 , . . . , Gk,n−1 using (J.14),
and one also needs the adjoint n−1 Givens rotations. As the number of nontrivial entries
in the Gkm is always 4 and, thus, independent of n, the total of arithmetic operations
K MINIMIZATION 243
to obtain the Gkm is NG (n) = O(n). Since each C (k,m) is in Hessenberg form, right
multiplication with Gkm only affects at most an (m + 2) × 2 submatrix, needing at most
C(m + 2) arithmetic operations, C ∈ N. The result differs from a Hessenberg matrix at
most in one element, namely the element with index (m + 2, m), i.e. left multiplication
of the result with G∗k,m+1 only affects at most an 2 × (n − m) submatrix, needing at most
C(n − m) arithmetic operations (this is also consistent with the very first step, where
one computes G∗k1 A(k) with A(k) in Hessenberg form). Thus, the total is at most
K Minimization
that is worst-case optimal in the same way the bisection method is worst-case optimal for
root-finding problems. Let us emphasize that one-dimensional minimization problems
also often occur as subproblems, so-called line searches in multi-dimensional minimiza-
tion: One might use a derivative-based method to determine a descent direction u at
some point x for some objective function J : X −→ R, that means some u ∈ X such
that the auxiliary objective function
larger of the two subintervals for the next step, which would then have length bigger
than (bn − an )/2.
As we will see below, golden section search is the analogue of the bisection algorithm
for minimization problems, where the situation is slightly more subtle. First, we have
to address the problem of bracketing a minimum: For the bisection method, by the
intermediate value theorem, a zero of the continuous function f is bracketed if f (a) <
f (b) (or f (a) > f (b)). However, to bracket the min of a continuous function f , we need
three points a < b < c, where
Clearly, if f is continuous, then (K.3) implies f to have (at least one) global min in ]a, b[.
Thus, the problem of bracketing a minimum is to determine a minimization triplet for
f , where one has to be aware that the problem has no solution if f is strictly increasing
or strictly decreasing. Still, we can formulate the following algorithm:
Algorithm K.2 (Bracketing a Minimum). Let f : I −→ R be defined on a nontrivial
interval I ⊆ R, m := inf I, M := sup I. We define a finite sequence (a1 , a2 , . . . , aN ),
N ≥ 3, of distinct points in I via the following recursion: For the first two points choose
a < b in the interior of I. Set
a1 := a, a2 := b, σ := 1 if f (b) ≤ f (a),
(K.4)
a1 := b, a2 := a, σ := −1 if f (b) > f (a).
If the algorithm has not been stopped in the pevious step (i.e. if 2 ≤ n < N ), then set
tn := an + 2σ|an − an−1 | and
tn if tn ∈ I,
> M , M ∈ I,
M if tn
an+1 := (an + M )/2 if tn > M, M ∈/ I, (K.5)
m if tn < m, m ∈ I,
(m + a )/2 if tn < m, m ∈
/ I.
n
Stop the iteration (and set N := n + 1) if at least one of the following conditions
is satisfied: (i) f (an+1 ) ≥ f (an ), (ii) an+1 ∈ {m, M }, (iii) an+1 ∈
/ [amin , amax ] (where
amin , amax ∈ I are some given bounds), (iv) n + 1 = nmax (where nmax ∈ N is some given
bound).
K MINIMIZATION 245
Remark K.3. (a) Since (K.4) guarantees f (a1 ) ≥ f (a2 ), condition (i) plus an induc-
tion shows
f (a1 ) ≥ f (a2 ) > · · · > f (aN −1 ).
(b) If Alg. K.2 terminates due to condition (i), then (a) implies (aN −2 , aN −1 , aN ) (for
σ = 1) or (aN , aN −1 , aN −2 ) (for σ = −1) to be a minimization triplet for f .
(c) If Alg. K.2 terminates due to conditions (ii) – (iv) (and (i) is not satisfied), then the
method has failed to find a minimization triplet for f . One then has three options:
(a) abort the search (after all, a minimization triplet for f might not exist and aN
is a good candidate for a local min), (b) search on the other side of a1 , (c) continue
the original search with adjusted bounds (if possible).
—
Given a minimization triplet (a, b, c) for some function f , we now want to evaluate f at
x ∈]a, b[ or at x ∈]b, c[ to obain a new minimization triplet (a′ , b′ , c′ ), where c′ −a′ < c−a.
Moreover, the strategy for choosing x should be worst-case optimal.
While Lem. K.4 shows the forming of new, smaller triplets to be easy, the question
remains of how to choose x. Given a minimization triplet (a, b, c), the location of b is a
fraction w of the way between a and c, i.e.
b−a c−b
w := , 1−w = . (K.7)
c−a c−a
We will now assume w ≤ 21 – if w > 12 , then one can make the analogous (symmetric)
argument, starting from c rather than from a (we will come back to this point below).
The new point x will be an additional fraction z toward c,
x−b
z := . (K.8)
c−a
K MINIMIZATION 246
If one chooses the new minimization triplet according to (K.6), then its length l is
w (c − a) for a < x < b and f (b) ≥ f (x),
(1 − (w + z)) (c − a) for a < x < b and f (b) ≤ f (x),
l= (K.9)
(w + z) (c − a) for b < x < c and f (b) ≤ f (x),
(1 − w) (c − a) for b < x < c and f (b) ≥ f (x).
For the moment, consider the case b < x < c. As we assume no prior knowledge about
f , we have no control over which of the two choices we will have to make to obtain the
new minimization triplet, i.e., if possible, we should make the last two factors in front
of c − a in (K.9) equal, leading to w + z = 1 − w or
z = 1 − 2w. (K.10)
If we were to choose x between a and b, then it could occur that l = (1−(w +z)) > 1−w
(note z < 0 in this case), that means this choice is not worst-case optimal. Together
with the symmetric argument for w > 12 , this means we always need to put x in the
larger of the two subintervals. While (K.10) provides a formula for determining z in
terms of w, we still have the freedom to choose w. If there is an optimal value for w,
then this value should remain constant throughout the algorithm – it should be the
same for the old triplet and for the new triplet. This leads to
and choose (an+1 , bn+1 , cn+1 ) according to (K.6) (in practise, one will stop the iteration
if cn − an < ǫ, where ǫ > 0 is some given bound).
(a) The value of f (bn ) decreases monotonically, i.e. f (bn+1 ) ≤ f (bn ) for each n ∈ N.
(b) With the possible exception of one step, one always has
√
1 α 3− 5
cn+1 − an+1 ≤ q (cn − an ), where q := + ≈ 0.69, α := . (K.15)
2 2 2
Proof. (a) clearly holds, since each new minimization triplet is chosen according to
(K.6).
(b): As, according to (K.14), xn is always chosen in the larger subinterval, it suffices to
consider the case w := cbnn −a
−an
n
≤ 12 . If w = α, then (K.9) and (K.10) guarantee
cn+1 − an+1 x n − an b n − an x n − b n xn − bn c n − bn
= = + =w+
c n − an c n − an c n − an c n − an c n − b n c n − an
1
= w + α(1 − w) = α + (1 − α)w ≤ α + (1 − α) = q,
2
K MINIMIZATION 248
where the exponent on the right-hand side is n − 1 instead of n due to the possible
exception in one step. Since q < 1, this means limn→∞ (cn − an ) = 0, proving (K.16).
Definition K.7. Let I ⊆ R be an interval, f : I −→ R. We call f locally monotone
if, and only if, for each x ∈ I, there exists ǫ > 0 such that f is monotone (increasing or
decreasing) on I∩]x − ǫ, x] and monotone on I ∩ [x, x + ǫ[.
Example K.8. The function
(
x sin x1 for x 6= 0,
f : R −→ R, f (x) :=
0 for x = 0,
Proof. The continuity of f implies f (x) = limn→∞ f (bn ). Then Lem. K.6(a) implies
If an < yn < x, then f (an ) > f (yn ) < f (x), showing f is not monotone in [an , x].
Likewise, if x < yn < cn , then f (x) > f (yn ) < f (cn ), showing f is not monotone
in [x, cn ]. Thus, there is no ǫ > 0 such that f is monotone on both I∩]x − ǫ, x] and
I ∩ [x, x + ǫ[, in contradiction to the assumption that f be locally monotone.
the amoeba method. The method has nothing to do with the simplex method of linear
programming (execpt that both methods make use of the geometrical structure called
simplex).
The Nelder-Mead method is a derivative-free method that is simple to implement. It
is an example of a method that seems to perform quite well in practise, even though
there are known examples, where it will fail, and, in general, the related convergence
theory available in the literature remains quite limited. Here, we will merely present
the algorithm together with its heuristics, and provide some relevant references to the
literature.
Recall that a (nondegenerate) simplex in Rn is the convex hull of n+1 points x1 , . . . , xn+1
∈ Rn such that the points x2 − x1 , . . . , xn+1 − x1 are linearly independent. If ∆ =
conv{x1 , . . . , xn+1 }, then x1 , . . . , xn+1 are called the vertices of ∆.
The idea of the Nelder-Mead method is to construct a sequence (∆k )k∈N ,
of simplices such that the size of the ∆k converges to 0 and the value of f at the vertices
of the ∆k decreases, at least on average. To this end, in each step, one finds the vertex
v, where f is largest and either replaces v by a new vertex (by moving it in the direction
of the centroid of the remaining vertices, see below) or one shrinks the simplex.
where the ei denote the standard unit vectors, and the δi > 0 are initialization param-
eters that one should try to adapt to the (conjectured) characteristic length scale (or
scales) of the considered problem.
Iteration Step: Given ∆k , k ∈ N, sort the vertices such that (K.19) holds with
According to the following rules, we will either replace (only) xkn+1 or shrink the simplex.
In preparation, we compute the centroid of the first n vertices, i.e.
n
1 X k
x := x , (K.22a)
n i=1 i
K MINIMIZATION 250
u := x − xkn+1 . (K.22b)
One will now apply precisely one of the following rules (1) – (4) in decreasing priority
(i.e. rule (j) can only be used if none of the conditions for rules (1) – (j − 1) applied):
then replace xkn+1 with xr to obtain the new simplex ∆k+1 . Heuristics: As f is largest
at xkn+1 , one reflects the point through the hyperplane spanned by the remaining
vertices hoping to find smaller values on the other side. If one, indeed, finds a
sufficiently smaller value (better than the second worst), then one takes the new
point, except if the new point has a smaller value than all other vertices, in which
case, one tries to go even further in that direction by applying rule (2) below.
(2) Expansion: If
f (xr ) < f (xk1 ) (K.24a)
(i.e. (K.23c) does not hold and xr is better than all xki ), then compute the expanded
point
xe := x + µe u = xkn+1 + (µe + 1) u, (K.24b)
where µe > µr is the expansion parameter (µe = 2 being a common setting). If
then replace xkn+1 with xe to obtain the new simplex ∆k+1 . If (K.24c) does not hold,
then, as in (1), replace xkn+1 with xr to obtain ∆k+1 . If neither (K.23c) nor (K.24a)
holds, then continue with (3) below. Heuristics: If xr has the smallest value so far,
then one hopes to find even smaller values by going further in the same direction.
fr ≥ f (xkn ). (K.25a)
K MINIMIZATION 251
where −1 < µc < 0 is the contraction parameter (µc = − 21 being a common setting).
If
f (xc ) < f (xkn+1 ), (K.25c)
then replace xkn+1 with xc to obtain the new simplex ∆k+1 . Heuristics: If (K.25a)
holds, then xr would still be the worst vertex of the new simplex. If the new vertex
is to be still the worst, then it should at least be such that it reduces the volume of
the simplex. Thus, the contracted point is chosen between xkn+1 and x.
(4) Reduction: If (K.25a) holds and (K.25c) does not hold, then we have failed to find a
better vertex in the direction u. We then shrink the simplex around the best point
xk1 by setting xk+1
1 := xk1 and
∀ xk+1
i := xk1 + µs (xki − xk1 ), (K.26)
i∈{2,...,n+1}
where 0 < µs < 1 is the shrink parameter (µs = 21 being a common setting).
Heuristics: If f increases by moving from the worst point in the direction of the
better points, then that indicates the simplex being too large, possibly containing
several local mins. Thus, we shrink the simplex around the best point, hoping to
thereby improve the situation.
(b) (1) – (3) are designed such that the average value of f at the vertices decreases:
n+1 n+1
1 X 1 X
f (xk+1
i ) < f (xki ). (K.27)
n + 1 i=1 n + 1 i=1
(c) As mentioned above, even though Alg. K.10 seems to perform well in practise, one
can construct examples, where it will converge to points that are not local minima
REFERENCES 252
(see, e.g., [Kel99, Sec. 8.1.3]). Some positive results regarding the convergence of
Alg. K.10 can be found in [Kel99, Sec. 8.1.2] and [LRWW98]. In general, multidi-
mensional minimization is not an easy task, and it is not uncommon for algorithms
to converge to wrong points. It is therefore recommended in [PTVF07, pp. 503-504]
to restart the algorithm at a point that it claimed to be a minimum (also see [Kel99,
Sec. 8.1.4] regarding restarts of Alg. K.10).
References
[Alt06] Hans Wilhelm Alt. Lineare Funktionalanalysis, 5th ed. Springer-Verlag,
Berlin, 2006 (German).
[Blu68] L.I. Bluestein. A linear filtering approach to the computation of the dis-
crete Fourier transform. Northeast Electronics Research and Engineering
Meeting Record 10 (1968), 218–219.
[Bos13] Siegfried Bosch. Algebra, 8th ed. Springer-Verlag, Berlin, 2013 (Ger-
man).
[Heu08] Harro Heuser. Lehrbuch der Analysis Teil 2, 14th ed. Vieweg+Teubner,
Wiesbaden, Germany, 2008 (German).
[LRWW98] J.C. Lagarias, J.A. Reeds, M.H. Wright, and P.E. Wright. Con-
vergence Properties of the Nelder-Mead Simplex Method in Low Dimensions.
SIAM J. Optim. 9 (1998), No. 1, 112–147.
[NM65] J.A. Nelder and R. Mead. A simplex method for function minimization.
Comput. J. 7 (1965), 308–313.
[Phi16a] P. Philip. Analysis I: Calculus of One Real Variable. Lecture Notes, LMU
Munich, 2015/2016, AMS Open Math Notes Ref. # OMN:202109.111306,
available in PDF format at
https://www.ams.org/open-math-notes/omn-view-listing?listingId=111306.
[Phi16b] P. Philip. Analysis II: Topology and Differential Calculus of Several Vari-
ables. Lecture Notes, LMU Munich, 2016, AMS Open Math Notes Ref. #
OMN:202109.111307, available in PDF format at
https://www.ams.org/open-math-notes/omn-view-listing?listingId=111307.
[Phi17] P. Philip. Analysis III: Measure and Integration Theory of Several Vari-
ables. Lecture Notes, LMU Munich, 2016/2017, AMS Open Math Notes
Ref. # OMN:202109.111308, available in PDF format at
https://www.ams.org/open-math-notes/omn-view-listing?listingId=111308.
[Pla10] Robert Plato. Numerische Mathematik kompakt, 4th ed. Vieweg Verlag,
Wiesbaden, Germany, 2010 (German).
[PTVF07] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flan-
nery. Numerical Recipes. The Art of Scientific Computing, 3rd ed. Cam-
bridge University Press, New York, USA, 2007.
[RSR69] L.R. Rabiner, R.W. Schafer, and C.M. Rader. The chirp z-transform
algorithm and its application. Bell Syst. Tech. J. 48 (1969), 1249–1292.
[Rud87] W. Rudin. Real and Complex Analysis, 3rd ed. McGraw-Hill Book Com-
pany, New York, 1987.
[Str08] Gernot Stroth. Lineare Algebra, 2nd ed. Berliner Studienreihe zur
Mathematik, Vol. 7, Heldermann Verlag, Lemgo, Germany, 2008 (German).