MA261

Spring Term 2019
MA261: Differential Equations: Modelling and Numerics
Dr. Andreas Dedner

Contents
1 Introduction 1
2 Getting Started 10
2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Convergence rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Floating point numbers (this subsection is not examinable) . . . . . . . . . 16
2.2 Some ODE Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Solvability of the Initial Value Problem (this subsection is not examinable) 20
2.3 The Forward/Backward Euler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Stability analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Some Aspects of Mathematical Modelling 31

3.1 Reactions kinetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Hamiltonian principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Higher Order One Step Methods 44

4.1 Runge-Kutta methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Using Implicit One-Step Methods
or How to Find the Roots of a Function . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Nested interval method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3 Solving Implicit Timestepping Problems . . . . . . . . . . . . . . . . . . . . 53
4.3 Analysis of Explicit One-step Methods . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Analysis of Implicit One Step Method . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Linear Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 Simplifying Mathematical Models 62

5.1 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Dimensional Scaling Analysis or Non-Dimensionalization . . . . . . . . . . . . . . . 64
5.3 Perturbation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.1 Regular perturbation problems . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.2 Singular perturbation problems and boundary layers . . . . . . . . . . . . . 70
Bibliography 73
i
Chapter 1
Introduction
Numerical analysis is the mathematical study of algorithms for solving problems arising in many
different areas, e.g., physics, engineering, biology, statistics, economics, social sciences. In general
starting from some real world problem, the following steps have to be performed:
1. Modeling: the problem is formulated in mathematical terms, e.g., as differential equation.

In general the resulting problem can not be solved analytically (without approximation).
2. Analysis: the mathematical model is analysed for example with respect to well-posedness,
e.g., existence and uniqueness of a solution, sensitivity to errors in the data. Also terms with
different importance in the model can be identified to possibly reduce the complexity of the
model.
3. Discretization: the problem is approximated by (a sequence of) finite dimensional prob-
lems. The discretization is chosen to maintain important properties of the analytical prob-
lem, e.g., that the density is positive.
4. Numerical analysis: the discretization is studied again with respect to well-posedness, but
most importantly the error between the solution of the finite dimensional problem and the
mathematical model is estimated and convergence of the numerical solution is established.
5. Implementation: the finite dimensional problems are solved using a computer program.
This can be a cyclic procedure where for example the Analysis in step two can influence the
modelling step, i.e., step one is refined. The numerical simulation can show that additional effects
have to be taken into account and so the modelling has to be refined and so on.
This module will focus on all these points for some simple settings to make you familiar with
central underlying ideas. The modelling techniques and the numerical schemes presented are
an important building block used for solving more complex problems. We will be focusing on
problems described by ordinary differential equations so a lot of the material covered in MA133
will be helpful.
Example 1.1. Consider the problem of a steel rope of length L > 1m is clamped between two pols
which are 1m apart so that the rope is almost taut. Now the position of the rope is required in the
case where an acrobat is standing in the middle. A sketch of the problem is shown in Figure 1.2.
1. Modeling: first we make the assumption that the rope can be represented as a function
y : [0, 1] → R. The shape of the rope is such that its bending energy E is minimal. For E
one finds (neglecting for example gravity) the formula
1 1
y 0 (x)2
Z Z
c
E(y) := p dx − f (x)y(x) dx .
2 0 1 + y 0 (x)2 0
1
CHAPTER 1. INTRODUCTION 2
Figure 1.1: Applied mathematics
y(x)
f
0 x
1
Figure 1.2: Sketch of problem and a computed solution with f (x) = B ( 21 ), = 0.05.
Here c depends on the material of the rope and f is the load (the acrobat) on the rope. So
we seek y ∈ V := {v ∈ C 2 ((0, 1)) : v(0) = v(1) = 0} so that
E(y) = inf E(v) .

v∈V
Both the function f and the constant c have to be determined by measurements and contain
data error.
This is a very complex problem. So we make a simplification and assume that the displace-
ment of the rope is small, e.g., y 0 is small. Then we can replace E by
c 1 0 2
Z Z 1
Ē(y) := y (x) dx − f (x)y(x) dx .
2 0 0
Now we seek ȳ ∈ V so that

Ē(ȳ) = inf Ē(v) .
v∈V
The model is now simpler but we have made a modeling error.

2. Analysis: The simplified problem is equivalent to solving
−cȳ 00 (x) = f (x) x ∈ (0, 1) ȳ(0) = ȳ(1) = 0 .
This problem has a unique solution. One can also show for example that y < 0 if f < 0, i.e.,
when the force is pointing downward, the displacement is also downwards along the whole
length of the rope.
3. Discretization: Instead of approximating the function y at all points in [0, 1] we compute
y at N points xi = ih for i = 1, . . . , N − 1 with h = N1 . We can replace the second derivative
of y at a point x by the difference quotient (covered in MA3H0):

1
y 00 (x) ≈ (y(x + h) − 2y(x) + y(x − h)) .
h2
Thus an approximation yi to ȳ(xi ) is given by
1
(yi+1 − 2yi + yi−1 ) = f (xi ) ,
h2
for i = 1, . . . , N − 1 and taking y0 = yN = 0. Defining the vector and the matrix
 
2 −1 0
 −1 . .
 . . .. 
−1 N −1  ∈ RN −1×N −1
F = (f (xi ))N

i=1 ∈ R , A=  .. .. 
 . . −1 
0 −1 2
c
the unknown values Y = (yi )N i=1 are the solution to the linear system h2 AY = F . While
replacing the derivatives of y by a finite difference quotient we have made an approximation
error.
We now have to solve a linear system of equations. There are many ways of doing this
(see MA398), a simple iterative scheme is based on the equivalence of hc2 AY = F with
2
Y = Y − D−1 (AY − hc F ) where D is a diagonal matrix
 
2 0
D = diag(2) = 
 .. .

.
0 2
Note that D−1 can be easily computed. Starting with some initial vector Y 0 we can compute
a sequence of vectors Y n through
h2
Y n = Y n−1 − D−1 (AY n−1 − c F)
for n ≥ 1. We compute this sequence up to Y P which is then our final approximation. It can
be seen that Y n → Y (n → ∞) but since we can not compute an infinite number of iterates,
we have a further termination error caused by stopping the computation after P steps.
4. Numerical analysis: the matrix A is regular and thus the discrete problem has a unique
solution. For v ∈ C 4 ((0, 1)) there
exists a constant M > 0 so that for all x ∈ (0, 1) the
following error estimate holds v 00 (x) − h12 (v(x + h) − 2v(x) + v(x − h)) ≤ M h2 . The same
estimate holds for the discrete values Y :
max ȳ(xi ) − yi ≤ Ch2 ,

i=1,...,N −1
for some constant C > 0. Furthermore yi < 0 if all fk < 0.

As already mentioned one can show that the iteration converges to the exact solution of the
linear system.
5. Implementation: we have to implement the iterative scheme
h2
Y n = Y n−1 − D−1 (AY n−1 − F)
c
on a computed. Here programming languages like C,C++,Fortran,Python,Julia, or MATLab
are used. The choice of the environment and the form of the coding influences the efficiency
of the algorithm, i.e., how long one has to wait until the solution is computed - but also how
long the development of the program takes.
A final source of error is caused by the fact that a computer can not store exact numbers but
only approximations; this leads to rounding errors.
Example 1.2. A mass spring system with friction proportional to the velocity is modelled by the
second order ODE
µx00 (t) + βx0 (t) + γx(t) = 0 .
Here x(t) is the position of the (point) mass at time t, thus x0 is the velocity and x00 the acceleration
at time t. The three constants in the model are the mass µ > 0, β > 0 the amount of friction, and
γ > 0 the restoring force of the spring. To make the problem well posed the initial position of the
mass x(0) = x0 and the initial velocity x0 (t) = v0 have to be prescribed.
There are different ways to derive this model, one of them is to start with Newton’s second law
µa = F (mass × acceleration = applied forces). We choose our coordinate system in such a way
that x = 0 corresponds to the rest position of the spring - so x > 0 means the spring is stretched.
The restoring forces are assumed to be proportional to the amount of stretching s, so Fr = −γs.
This is a modelling assumption we could also have a nonlinear spring where the restoring force
depends nonlinearly on the stretching, e.g., F = −ks3 . The force of friction is assumed to also be
directly proportional to the velocity of the mass x0 , so Ff = −βx0 . As said before a = x00 and due
to our choice of coordinate system s = x so:
µx00 = Fr + Ff = −γx − βx0 .
This model is linear and can be easily solved using the approach of characteristic polynomials
discussed in MA133:
β
x(t) = e− 2µ t A cos(wt) + B sin(wt) ,
p
with w = 4γµ − β 2 /2µ
where we made the assumption that b is small so that β 2 < 4γµ (the system is under damped).
The constants A and B are determined from the initial conditions.
The problem seems to depend on three parameters - although we know from studying the exact
solution that the type of behavior of the system depends on the ration between β 2 and 4γµ, e.g., if
β2 β2
4γµ < 1 the system oscillates while for 4γµ > 1 the system is over damped and will not oscillate
at all. One modelling technique is to non dimensionalize the model and in that step try to reduce
the number of parameters, isolating the parameters that mainly determine the behaviour of the
problem. To this end one first needs to fix the physical units of each part of the model, e.g., x could
be measured in meters (m), the velocity x0 could then be meters per second (m/s), acceleration
is m/s2 ). Mass µ could be in kilogram (kg) and (to make things fit) we assume that β has
units kg/s and γ kg/s2 (we will discuss this in detail later on). Now let us fix a typical time
scale T , length scale L and introduce scaled time τ = t/T and rescale the position x(t) in the form
χ(τ ) = x(T τ )/L. Using chain rule we can easily see that χ0 (τ ) = T /Lx0 (T τ ), χ00 (τ ) = T 2 /Lx00 (T τ )
and thus
0 = µx00 (T τ ) + βx0 (T τ ) + γx(T τ )
= µL/T 2 χ00 (τ ) + βL/T χ0 (τ ) + γLχ(τ ) .
Note that χ, τ do not have any units (e.g. t, T have both some time units like seconds and their
fraction is unitless). We can now divide through by µL and multiply with T 2 to arrive at
ξ 00 + T β/µξ 0 + T 2 γµξ = 0
and note not that the two remaining constants T β/µ, T 2 γ/µ are also unitless. We now have many
different ways to choose T (note that the equation doesn’t depend qon our choice for L). We could
choose T to make T β/µ = 1 or alternatively T γ/µ = 1, e.g., T = µγ , which leads to a coefficient
2
q q
2 β2
in front of the friction term of T β/µ = µβγµ2 = 2
γµ =: ω . Our model thus reduces to
ξ 00 + ω 2 ξ 0 + ξ = 0 .
2
β
We are only left with a single factor ω 2 = γµ and we can discuss the behaviour of the solution
to this model (or simulate it) depending on the size of this one parameter. The damping regime
now depends on ω 2 being less than or greater than one. After understanding the behaviour of this
non dimensionalized problem one can then look at the values of the parameters in the problem e.g.
µ, β, γ to figure out which regime a given spring mass system belongs to. Of course in this simple
case we have not really learned anything new but the concept is more widely applicable.
Using Newton’s second law is one way of deriving the equations of motion for the mass. Another
approach is based on Hamiltonian dynamics which we will also briefly cover in this module. Let
us for now consider the frictionless case. Define the Hamiltonian
µ 2 γ 2
H(p, q) := q + p
2 2
and consider a particle moving such that H(x(t), x0 (t)) is constant in other words d 0
dt H(x(t), x (t)) =
0. Using chain rule it is easy to see that
d
H(x(t), x0 (t)) = µx00 x0 + γx0 x = x0 (µx00 + γx)
dt
so that H(x(t), x0 (t)) is constant if and only if either x is stationary (i.e. x0 = 0 or x solves the
second order problem
µx00 + γx = 0 .
We looked a bit at the modelling aspects of this problem and did some analysis, we can now turn
to discretizing the problem. In this case we have an exact solution so looking at discretization
methods for this problem seems a bit pointless but the circumstance that we know what the solution
should look like allows us to study the behaviour of a given method much more easily and we can
deduce something for more complicated cases where we do not know the exact solution. Of course
this only makes sense if we assume that the method we are studying is applicable to more general
problems. In this module we will focus on methods for solving first order non linear systems, i.e.,
ODEs of the form
y 0 (t) = f (y(t))
where y : [0, T ] → Rm for m ≥ 1. We can easily rewrite our mass spring system in that form
by introducing the vector y(t) = (y1 (t), y2 (t)) = (x(t), x0 (t)) so that y 0 (t) = (x0 (t), x00 (t)) =
(x0 (t), −γ/µx(t)) = (y2 (t), −γ/µy1 (t)) which is of the right form if we define f (y) = (y2 , −γ/µy1 ).
A simple approach to discretize y(t) is to look for approximations yn ≈ y(tn ) where t0 = 0 < t1 <
T
t2 < · · · < tN = T are some fixed points in time, for example tn = nh with h = N . The derivative
y (tn ) can be approximated using a finite difference quotient for example y (tn ) ≈ y(tn+1h)−y(tn =≈
0 0
yn+1 −yn
h . Since y 0 (tn ) = f (y(tn )) ≈ f (yn ) we arrive at the so called forward Euler method:
yn+1 = yn + hf (yn )
which is a very easy to implement method, since given the initial condition y0 we can directly
compute y1 = y0 + hf (y0 ) and then y2 = y1 + hf (y1 ) and so on up to yN = yN −1 + hf (yN −1 ).
Applying this to our mass spring problem we get
γ
yn+1,1 = yn,1 + hyn,2 , yn+1,2 = yn,2 − h yn,1 .
µ
γ
In the following we set µ = 1 and use as initial data y(0) = (x0 , v0 ) = (1, 1) so that the exact
solution is simply
y(t) = (cos(t) + sin(t), − sin(t) + cos(t))
at T = 2π the solution should be back at (1, 1) so let us check what value yN has for N = 2πh for
π
different values of h, we compute a sequence of approximations to y(2π) using hi = 200 ∗ 2−i , i.e.,
we use Ni = 1002̇i points for i = 0, 1, . . . , 4:
i N yN |y(T ) − yN |
0 101 [1.20766198 1.2277517 ] 3.08212e-01
1 201 [1.10139465 1.10595473] 1.46654e-01
2 401 [1.05003655 1.05112221] 7.15342e-02
3 801 [1.02484773 1.02511256] 3.53278e-02
4 1601 [1.01238062 1.01244602] 1.75552e-02
5 3201 [1.00617943 1.00619568] 8.75053e-03
We have used a very large number of points tn for the final simulation and the solution is still not
all that accurate - the error has just dropped below 1%. Depending on the application this might or
might not be an acceptable level of error and may or may not be an acceptable computation effort
to reach this error. But it does seem worth while to investigate methods that achieve a small error
with the same computational cost and we will study some approaches in this module. The results
seem to indicate that the error is going to zero with increasing N - in fact it looks like the error is
halving when N each time N is doupled, i.e., the error is proportional to 1/N ∼ h. We will see
later in this module that this is in fact the case. Computing only one period of the oscillation is
often not of interest but instead the long time behaviour is to be simulated, so let us redo the above
computation with T = 200π (which is actually not that long):
i N yN |y(T ) − yN |
0 10001 [-2.00707e+07,5.08076e+08] 5.08472e+08
1 20001 [1.48842e+04,2.27772e+04] 2.72078e+04
2 40001 [1.31599e+02,1.45952e+02] 1.95108e+02
3 80001 [1.16376e+01,1.19422e+01] 1.52607e+01
4 160001 [3.42277e+00,3.44495e+00] 3.44204e+00
5 320001 [1.85158e+00,1.85458e+00] 1.20644e+00
Not so good - the best that can be said, is that it does seem to be converging but the errors are
huge! As the next simulation shows, a trend here: instead of staying on a constant level curve of
H (i.e. H(y(t)) = H(y(0)) the value of H seem to be increasing and to verify this we add H(yN )
to our output (the expected value is H(1, 1) = 1). We also increase i a bit more:
i N yn |y(T ) − yN | H(yN ) rel error H

0 101 [1.20766e+00,1.22775e+00] 3.08212e-01 2.23660e+00 4.91065e-01
1 201 [1.10139e+00,1.10595e+00] 1.46654e-01 1.82967e+00 2.19781e-01
2 401 [1.05004e+00,1.05112e+00] 7.15342e-02 1.65615e+00 1.04098e-01
3 801 [1.02485e+00,1.02511e+00] 3.53278e-02 1.57601e+00 5.06748e-02
4 1601 [1.01238e+00,1.01245e+00] 1.75552e-02 1.53750e+00 2.50028e-02
5 3201 [1.00618e+00,1.00620e+00] 8.75053e-03 1.51863e+00 1.24188e-02
6 6401 [1.00309e+00,1.00309e+00] 4.36852e-03 1.50928e+00 6.18892e-03
7 12801 [1.00154e+00,1.00154e+00] 2.18258e-03 1.50463e+00 3.08935e-03
8 25601 [1.00077e+00,1.00077e+00] 1.09087e-03 1.50232e+00 1.54340e-03
So decreasing h (or increasing N ) to compute the error at a fixed time does seem to work - although
the required work can be very high if the error is to be small or the time period somewhat longer.
Instead of changinging h we will now fix h and increase T just to show that effect a bit more:
i N T yn |y(T ) − yN | H(yN ) rel error H

0 401 6.28319e+00 [1.05004e+00,1.05112e+00] 7.15342e-02 1.65615e+00 1.04098e-01
1 4001 6.28319e+01 [1.62942e+00,1.64635e+00] 9.02186e-01 4.03797e+00 1.69198e+00
To show the time evolution of the discrete solution for different values of h see the left figure in
the following plot (only every 15th approximate value is plotted). On the right you can see the
same a simulation with a larger value of T using the same value of h used for the curve with the
same color on the left. The plots show the evolution of the system in phase space, i.e., the x-axis
represents the position of the mass and the y-axis its velocity. Another way of thinking of these
plots is in terms of the Hamiltonian H - H should be constant, i.e., the mass should remain on a
single level curve of H which are circles around the origion.
10
2
0 0
−2 −10
−2 0 2 −15 −10 −5 0 5 10 15
We will see later on that the forward or explicit Euler method suffers from stability issues in the
case that h is too large (this is not the problem here...). Nevertheless, we can try a method that
we will later prove to be more stable: the backward or implicit Euler method. The approach to
derive the approximation is the same as for the forward Euler method, except that we use the
approximation at t = tn+1 instead of at t = tn , i.e., y 0 (tn+1 ) ≈ y(tn+1h)−y(tn =≈ yn+1h−yn . and
y 0 (tn+1 ) = f (y(tn+1 )) ≈ f (yn+1 ) we arrive at the so called forward Euler method:
yn+1 = yn + hf (yn+1 ) .
The method is in general not quite as easy to code up but since f is linear in our case it is still
fairly easy to do:
y1,n+1 = y1,n + hy2,n+1 , y2,n+1 = y2,n − hy1,n+1 .
which leads to a system of equations for the two components of yn+1 :
y1,n+1 − hy2,n+1 = y2,n , hy1,n+1 + y2,n+2 = y1,n .
We can easily solve this system

1 1
y1,n+1 = 2
y2,n + hy1,n , y2,n+1 = 2
− hy2,n + y1,n .
1+h 1+h
so that we can repeat the same experiments as in the previous plots:
1 1
0 0
−1 −1
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
Note that now the mass is slowing down like it would if friction was added (recall that the x-axis
in these plots represent the position and the y-axis the velocity).
In summary: were we to use the forward Euler method to compute the orbit of a satelite around
earth, the satelite would always be spinning off into space - preferable perhaps to the trajectory
predicted by the backward Euler method but still not correct... But also note that both methods
converge, i.e., if we fix a point in time and reduce h enough the error can (in theory) be made as
small as we want it to be. We have some experimental indication of this for the forward Euler
method, the following table indicates that the same is true for the backward Euler method:

0 101 [8.14386e-01,8.27934e-01] 2.53100e-01 1.01709e+00 -3.21942e-01
1 201 [9.04188e-01,9.07932e-01] 1.32877e-01 1.23312e+00 -1.77921e-01
2 401 [9.51364e-01,9.52347e-01] 6.80902e-02 1.35951e+00 -9.36588e-02
3 801 [9.75503e-01,9.75755e-01] 3.44668e-02 1.42790e+00 -4.80668e-02
4 1601 [9.87707e-01,9.87771e-01] 1.73399e-02 1.46347e+00 -2.43509e-02
5 3201 [9.93842e-01,9.93859e-01] 8.69672e-03 1.48162e+00 -1.22559e-02
6 6401 [9.96918e-01,9.96923e-01] 4.35507e-03 1.49078e+00 -6.14818e-03
7 12801 [9.98459e-01,9.98460e-01] 2.17921e-03 1.49538e+00 -3.07916e-03
8 25601 [9.99229e-01,9.99229e-01] 1.09003e-03 1.49769e+00 -1.54085e-03
Can we derive a method that converges and maintains the energy of the system, i.e., guarantees
that H(yn ) = H(y0 ) for all n? The answer is yes and the method is just as simple to implement
as the forward Euler method. The method is often refered to as the symplectic Euler method and
you will need to look closely to see the difference to the forward Euler method described above:
γ
yn+1,1 = yn,1 + hyn,2 , yn+1,2 = yn,2 − h yn+1,1 .
µ
1 1
0 0
−1 −1
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5

0 101 [1.00107e+00,9.98932e-01] 1.50898e-03 1.49893e+00 -7.11272e-04
1 201 [1.00026e+00,9.99737e-01] 3.71239e-04 1.49974e+00 -1.75002e-04
2 401 [1.00007e+00,9.99935e-01] 9.20760e-05 1.49993e+00 -4.34050e-05
3 801 [1.00002e+00,9.99984e-01] 2.29283e-05 1.49998e+00 -1.08085e-05
4 1601 [1.00000e+00,9.99996e-01] 5.72080e-06 1.50000e+00 -2.69681e-06
5 3201 [1.00000e+00,9.99999e-01] 1.42880e-06 1.50000e+00 -6.73540e-07
6 6401 [1.00000e+00,1.00000e+00] 3.57023e-07 1.50000e+00 -1.68302e-07
7 12801 [1.00000e+00,1.00000e+00] 8.92339e-08 1.50000e+00 -4.20653e-08
8 25601 [1.00000e+00,1.00000e+00] 2.23057e-08 1.50000e+00 -1.05150e-08
For a fixed (small) time can we derive a method that converges faster - lets try ode45 (in Python
that is rk45):
0 100 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
1 200 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
2 400 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
3 800 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
4 1600 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
5 3200 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
6 6400 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
7 12800 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
8 25600 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
Chapter 2
Getting Started
In this chapter we will introduce a few concepts without being too formal. The ideas will then be
expanded on in the following chapters.
2.1 Basic Concepts

2.1.1 Convergence rates
One of the main tools we will be using is Taylor series. The proofs of the following Theorems are
part of the analysis lectures.
Definition 2.1. With C 0 (I), I = (a.b) we denote the space of continuous functions on the interval
I and the space of m times continuous differentiable functions with
n o
C m (I) := f : I → R | f, f 0 , f 00 , . . . , f (m) exist and are continuous .
We use the abbreviations: C m (a, b) for C m ((a, b)) and C ∞ (I) := C m (I). It follows that
T
m∈N
C ∞ (I) ⊂ . . . ⊂ C m (I) ⊂ . . . ⊂ C 0 (I).
Theorem 2.2. (Taylor Theorem) Let f ∈ C m (a, b) and x0 ∈ (a, b) be given. Then there exist a
function ωm : R → R with lim ωm (x) = 0, so that
x→x0
f (x) = Pm (x) + ωm (x)(x − x0 )m ,
where
m
X 1 (k)
Pm (x) = f (x0 )(x − x0 )k ,
k!
k=0
is the m-th order Taylor polynomial which is of degree m.

For f ∈ C m+1 (a, b) the following holds
f (x) = Pm (x) + Rm (x) ,
where there are different important expressions for the remainder term:
1. Lagrange representation: For fixed x ∈ (a, b) there is a ξ between x0 and x so that
1
Rm (x) := f (m+1) (ξ)(x − x0 )m+1 .
(m + 1)!
10
CHAPTER 2. GETTING STARTED 11
2. Integral representation:
Zx
1
Rm (x) := f (m+1) (t)(x − t)m dt .
m!
x0
.
Taylor expansion motivates the following definition:
Definition 2.3. A function f ∈ C 1 (x0 −h0 , x0 +h0 ) is up to leading order equal to f (x0 )+f 0 (x0 )h
in an open set around x0 , i.e., there is a function ω̄ : (−h0 , h0 ) → R with |ω̄(h)||h| → 0 and
f (x0 + h) = f (x0 ) + f 0 (x0 )h + ω̄(h). This means that we are neglecting all terms that converge
faster to zero than h.
•
Notation: f (x0 + h) = f (x0 ) + f 0 (x0 )h.
Remark (Vector valued case). Taylor expansion for a vector valued function f ∈ (C m (a, b))p is
defined in the same way with a vector valued Taylor polynomial P m - one can also consider this
the Taylor expansion of each component fi (i = 1, . . . , p) of f .
In that case that argument of f is multidimensional f ∈ C m (I) for I ⊂ Rq (or f ∈ (C m (I))p the
Taylor expansion takes the same form with f 0 is the gradient (or Jacobian) of f , f 00 the Hessian
and so on. pPq
For x ∈ Rq we will use |x| to denote the Euclidian norm |x| := 2
i=1 xi .
Definition 2.4. (Landau Symbole) Let g, h : R → R. Then we write
(i) g(t) = O(h(t)) for t → 0 iif there is a constant C > 0 and a δ > 0, so that |g(t)| ≤
C |h(t)| ∀ |t| < δ.
(ii) g(t) = o(h(t)) for t → 0 iif there is a δ > 0 und a function c : (0, δ) → R with |g(t)| ≤
c (|t|) |h(t)| ∀ |t| < δ and c(t) → 0 for t → 0.
Given {an }, {bn } bn > 0 ∀ n, we say:

(i) an = O(bn ) if ∃ M > 0 such that |an | ≤ M bn ∀ n ≥ n0 .
an
(ii) an = o(bn ) if limn→∞ bn = 0.
Example 2.5. 1. tα = O(tβ ) iff α ≥ β

1
2. for c > 0 and any β: c− t = O(tβ (exponential convergence is faster then polynomial conver-
gence) Note: if the aim is to get an error below some tolerance then polynomial convergence
might be better depending on the tolerance.
3. ex = 1 + x + O(x2 ) but also ex = O(1)

4. f1 f˙2 = O(g1 ġ2 ) if fi = O(gi )
5. if f = o(g) then f = O(g), i.e., o(g) is as stronger results then O(g).

6. O(hp ) − O(hp ) = O(hp ), O(hp ) + O(hq ) = O(hmin{p,q} ), and O(hp )O(hq ) = O(hp+q} ).
Definition 2.6 (order of convergence). Suppose {xn } → x as n → ∞. If ∃ λ, p > 0 such that
|xn+1 − x|
lim = λ,
n→∞ |xn − x|p
then we say {xn } converges to x with order p. The largest p with this property is said to be the
convergence rate or the rate of convergence of the sequence (xn )n .
Suppose z : [0, h0 ] → R and z(h) → z0 for h → 0. If there exists a p > 0 with
|z(h) − z0 | = O(hp )
then we say that the convergence is of order p. The largest p with this property is said to be the
convergence rate or rate of convergence of z(h) to z0 .
If p = 1, this is called linear convergence. If p = 2 this is quadratic convergence.
Example 2.7. Let f ∈ C 2 (R) and x0 ∈ R be given. Define d1 (h) = f (x0 +h)−f h
(x0 )
. Then d1 (h)
0
converges to f (x0 ) for h → 0 linearly. The proof is simple using Taylor expansion (see also the
next example).
Remark: we used this finite difference approximation in the introduction to motivate the for-
ward/backward Euler method.
If f ∈ C 4 (R) then d2 (h) = f (x0 +h)−2fh(x2 0 )+f (x0 −h) converges to f 00 (x0 ) for h → 0 quadratically.
Again this is shown by Taylor expansion:
1 1
= f 0 (x0 )h + f 00 (x0 )h2 + h3 f 3 (x0 ) + O(h4 ) ,
f (x0 + h) − f (x0 )
2 6
0 1 00 1
f (x0 − h) − f (x0 ) = −f (x0 )h + f (x0 )h − h3 f 3 (x0 ) + O(h4 ) .
2
2 6
Adding these two equation gives the result.
x2n +a √
Now take a > 1 and define x0 = a and xn+1 = 2x n
for n > 0. One can show that xn > a > 1
and since √ √
√ x2n − 2xn a + a (xn − a)2 1 √
0 ≤ xn+1 − a = = ≤ (xn − a)2
2xn 2xn 2
√
we see that xn converges to a quadratically.
Remark. Note that if z(h) converges to z0 with order p then it also converges to z0 with order
q for any q < p. That is why we are interested in the maximum rate of converges. So why it is
easy to see that for example z(h) = h sin(h) converges to 0 with rate h - simply because sin(h)
is bounded this result is not optimal. Since sin(h) = h + 16 χ3 for some χ ∈ [0, h] it follows that
h sin(h) converges quadratically to 0 since
1
|h sin(h)| ≤ h2 + h3 = O(h) .
6
So the convergence rate of z(h) to 0 for h → 0 is 2.
1 p h2
What is the optimum p, i.e., largest p for which h+h 2 = O(h ) holds? We can note that h+h2 ≤C
h2
but that is not the optimal result. In fact → 0 for h → 0 which indicates that we could
h+h2
choose a C that goes to 0 when h → 0 in the above estimate. That means we can do better:
h h 1 −1
h+h2 → 1 for h → 0 so h+h2 ≤ C or h+h2 = O(h ).
When computing an approximation z(h) to a problem with exact solution z0 it is possible to
determin the order of convergence experimentally. Assume that an approximation z(h) converges
to z0 with order p, i.e., |z(h) − z0 | = O(hp ). This means that |z(h) − z0 | ≤ Chp and C should not
depend on h. We will go a step further an assume that |z(h) − z0 | = Chp . Then taking two values
p
|z(h1 )−z0 | h1
of h, e.g., h2 6= h1 we can eliminate C by taking the quotient |z(h2 )−z0 | = h2 . This motivates
the following definition:
Definition 2.8 (Experimental order of convergence (EOC)). The experimental order of conver-
gence of an approximation y(h) is given by
|z(h1 )−z0 |
log |z(h2 )−z0 |
p := .
log hh12
This can of course not be used to prove convergence but we can get a good indication of the
convergence rate through experiments and we can use a theoretical proven rate of convergence to
verify that an implementation is correct by comparing an experimental order of convergence with
the theoretical convergence rate. A major issue with this approach is that to compute the error one
requires knowledge of the exact solution z0 . In the example we discussed at the beginning of this
chapter, where we applied the forward Euler method to the linear mass spring problem we know
the exact solution so could compute the error. In general we want to use approximation in complex
cases where no exact solution is available. In this case some other approach to determining the
order of convergence has to be used. In summary the EOC is a good tool to check the correctness
of an implementation but to use it requires simplifying/modifying the problem to the point that
the exact solution is available.
Applying this to ODEs we will have z0 = Y (T ) and z(h) is the approximation at the final time
using a scheme with step size h.
Looking back at the errors computed for the mass spring problem we see that the error seems
to behave proportional to h = N1 - we already mentioned this previously. User the concept of a
rate of convergence this means that the scheme converges linearly, i.e., with order p = 1. Let us
compute the EOC to confirm this:
i N yN |y(T ) − yN | EOC
0 101 [1.20766e+00,1.22775e+00] 3.08212e-01
1 201 [1.10139e+00,1.10595e+00] 1.46654e-01 1.07151e+00
2 401 [1.05004e+00,1.05112e+00] 7.15342e-02 1.03571e+00
3 801 [1.02485e+00,1.02511e+00] 3.53278e-02 1.01783e+00
4 1601 [1.01238e+00,1.01245e+00] 1.75552e-02 1.00891e+00
5 3201 [1.00618e+00,1.00620e+00] 8.75053e-03 1.00445e+00
2.1.2 Conditioning
We have seen a great many sources for errors and each of these has to be analysed and kept under
control, especially to avoid the accumulation of errors. Depending on the problem, the influence
of any of these errors can be severe.
Example 2.9. Consider the initial value problem for t > 0:
u0 (t) = (c − u(t))2 , u(0) = 1, c > 0.
It can be easily verified that

1 + tc(c − 1)
u(t) =
1 + t(c − 1)
is the solution. We observe the following dependence of u on the constant c:
c = 1 : in this case u ≡ c.
c > 1 : in this case 1 + t(c − 1) > 0 for all t and lim u(t) = c.
t→∞
1
c < 1 : in this case 1 + t(c − 1) = 0 for t = t0 = 1−c > 0, so that lim u(t) = ∞.
t→t0
It is easy to see how measurement errors or other forms of approximation errors can lead to
c = 1 + ε (which in this case would be okay) or c = 1 − ε (which would be a disaster). Numerical
scheme can thus lead to very different results if one does not keep control on errors. A qualitative
picture is shown in Figure 2.1.
c u(t)
u(t)
1 1
c
0
0 t t0
Figure 2.1: Effects of errors in the data.
The next example also shows how even very small errors in some part of the algorithm can strongly
influences the error in the solution:
Example 2.10. Consider the linear system of equations

1.2969 0.8648 x1 0.86419999
= =: b.
0.2161 0.1441 x2 0.14400001

x1 0.9911
The exact solution is = .
x2 −0.4870
Due to some error, we obtain
0.8642
b̄ =
0.1440
instead of the exact right hand side. The relative error in the first and second component is merely
|0.86419999−0.8642|
0.86419999 = 1.15 10−8 and |0.14400001−0.1440|
0.14400001

−8
= 6.9410 . So the error is quite small.
x̄1 2
But the solution to the new problem is = , which means that the error in the
x̄2 −2
solution to the linear system of equations is more than 100%.
The amplification of errors, as shown in the previous example, is characterized by the conditioning
of the problem.
Definition 2.11. A problem is said to be well conditioned if small errors in the data lead to small
erros in the solution and badly conditioned otherwise. We will provide two different notion of
conditioning for the problem of computing the value f (x0 ) for a given function f : U → Rn and
given data x0 ∈ U where U is an open subset of Rm . We call this the problem (f, x0 ).
Example 2.12. The solution to the linear system Ay = b is the problem (f, x0 ) where f (x) = A−1 x
and x0 = b. The problem given in Example 2.10 was apparently badly conditioned.
Theorem 2.13. Let x0 = (x1 , . . . , xm ) ∈ U and let x0 + ∆x ∈ U be some perturbation of the
data with |∆x| 1. If f : U → Rn is once continuously differentiable then the error ∆fi (x0 ) =
fi (x0 + ∆x) − fi (x0 ) (i = 0, . . . , n) in the evaluation of fi is up to leading order equal to
m
X ∂fi
(x0 )∆xj .
j=1
∂xj
For the relative error we have to leading order

m
∆fi (x0 ) • X ∂fi xj ∆xj
= (x0 ) .
fi (x0 ) j=1
∂xj f (x
i 0 ) xj
Proof. The proof is a straightforward application of (multivariate) Taylor expansion
fi (x0 + ∆x) = fi (x0 ) + ∇fi (x0 ) · ∆x + ω (|∆x|)
with ω (|∆x|) = o (|∆x|).

Definition 2.14. (Condition number) We call the factors
∂fi xj
kij := (x0 )
∂xj f (x0 )
the relative condition numbers for the problem (f, x0 ).
The factor kij describes how an error in xj is amplified or damped when applying fi .
Example 2.15. (Arithmetic operations)
1. f (x1 , x2 ) = x1 x2 :
∂f x1 x2 x1
k11 = (x1 , x2 ) = =1.
∂x1 f (x1 , x2 ) x1 x2
Similary we obtain k12 = 1 so that multiplication is well-conditioned.
2. Division is well-conditioned.
3. f (x1 , x2 ) = x1 + x2 :
xj
k1j = .
x1 + x2
Thus k1j is arbitrarily large, if x1 x2 < 0 and the absolute values of x1 , x2 are similar - in
this case addition is badly conditioned. Otherwise it is well-conditioned.
4. Subtraction is badly conditioned, if x1 x2 > 0 and the absolute values of x1 , x2 are similar.
We see that subtracting two positive numbers of the same size is not well conditioned. This can
be a problem on a computed since non exact arithmetic has to be used.
Example 2.16. Consider the case where number are rounded after the third decimal, i.e., instead
of x = 0.9995 and y = 0.9984, the approximations x̄ = 1 and ȳ = 0.998 have to be used. Then
¯ = 0.002 is performed. The relative error
instead of x − y = 0.0011 the computation x̄ − ȳ = 0.002
in the data is 0.0005 while the relative error after evaluation is 0.82 so we have an amplification
of more than 1000. For the condition number we compute k1j ≈ 910. This problem is known as
cancellation or loss of significants.
For more complex problems we can use a slightly different concept.
Definition 2.17. (Well-posedness) The problem (f, x0 ) is well-possed in
Bδ (x0 ) := {x ∈ U | ||x − x0 || < δ}
if there is a constant Labs ≥ 0, with
||f (x) − f (x0 )|| ≤ Labs ||x − x0 || (∗)
for all x ∈ Bδ (x0 ).

In the following denote with Labs (δ) the smallest constant satisfying (∗) and with Lrel (δ) we denote
the smallest number with
||f (x) − f (x0 )|| ||x − x0 ||
≤ Lrel (δ) .
||f (x0 )|| ||x0 ||
Definition 2.18. (Condition number, vector valued case) We define the absolute condition number
for the problem (f, x0 ) to be Kabs := lim Labs (δ), while the relative condition number is Krel :=
δ&0
lim Lrel (δ).
δ&0
Remark. If f is differentiable then
||x0 ||
Krel = ||f 0 (x0 )|| .
||f (x0 )||
||f 0 (x0 )y||

Note that f 0 (x0 ) is a matrix and ||f 0 (x0 )|| is the induced matrix norm ||f 0 (x0 )|| := sup ||y|| .
y6=0
Thus the value of Krel depends on the choice of the norm.
Example 2.19. (Conditioning for linear systems of equations) Consider the linear system Ax = b,
i.e., the problem (f, b) with f (x) = A−1 x. Thus we have f 0 (x) = A−1 and
−1
A y
Kabs = A−1 := sup

.
y6=0 ||y||

Using the properties of the induced matrix norm A−1 we compute
−1
−1 ||b|| −1 ||Ax|| A · ||A|| · ||x||
A−1 · ||A|| .

Krel = A
= A ≤ =
||A−1 b|| ||x|| ||x||

Since there is a x ∈ Rm with ||Ax|| = ||A|| ||x|| the number A−1 · ||A|| is a good estimate for
the condition number of the problem (f, b).
Consider the matrix A from Example 2.10. We can show that A−1 ||A|| ≈ 109 , which shows
that the problem is badly conditioned.
The following section describes in detail how numbers are represented on a computer and how
arithmetic operations are performed. That section also includes some more examples showing the
issue of cancelation.
2.1.3 Floating point numbers (this subsection is not examinable)

Definition 2.20 ((Floating point number)). A floating point number to the base b ∈ N is a number
a ∈ R of the form
s−1 0
a = ± m1 b−1 + . . . + mr b−r b±[es−1 b +...+e0 b ] .

(∗)
Usually one uses the notation ±a = 0. m1 . . . mr b±E with E = es−1 bs−1 + . . . + e0 b0 (Expo-

| {z }
Mantissa M
nent) and mi , ei ∈ {0, . . . , b − 1} , E ∈ N. For normalization purposes one assumes that m1 6= 0 if
a 6= 0.
For given (b, r, s) let the set A = A(b, r, s) denote all real numbers a ∈ R with the representation
(∗).
To store a number a ∈ D = [−amax , −amin ] ∪ {0} ∪ [amin , amax ] a mapping from D to A(b, r, s) is
defined: f l : D → A with f l(a) = min |â − a|.
â∈A
Remark. The floating point representation allows to store real number of very different magnitude,
−30
e.g., the speed of light c ≈ 0.29998 · 109 m
s or the electron mass m0 ≈ 0.911 · 10 kg.
We usually use the decimal system with b = 10 while on computers a binary representation is used
with b = 2. The constants r, s ∈ N depend on the architecture of the computer and the desired
accuracy.
Lemma 2.21. The set A(b, r, s) is finite. Its largest and smallest positive element is amax =
s s
(1 − b−r ) · bb −1 , amin = b−b , respectively.
Proof. Left as an exercise.
Remark. Usually for a ∈ (−amin , amin ) one defines f l(a) = 0 (“underflow”). If |a| > amax (“over-
flow”) many programs set a = N aN (not a number) and a computation has to be terminated.
Theorem 2.22. (Rounding errors) The aboslute error is given by
1 −r E
|a − f l(a)| ≤ b ·b ,
2
where E is the exponent of a. For the relative error caused by f l(a) for a 6= 0 the estimate
|f l(a) − a| 1
≤ b−r+1
|a| 2
holds.
Definition 2.23. The machine epsilon M := 12 b−r+1 is the difference between 1 and the next
larger number representable.
Defining := f l(a)−a
a , one has f l(a) = a + a = a(1 + ) and || ≤ M .
Proof. (Theorem 2.22) In the worst case f l(a) will differ from a by half a unit in the last position
of the mantissa of a: |a − f l(a)| ≤ 21 b−r bE .
Since we are assuming a normalized representation (m1 6= 0) if follows that |a| ≥ b−1 bE and
therefore
1 −r E
|f l(a) − a| b b 1
≤ 2 −1 E = b−r+1 .
|a| b b 2
Example 2.24 ((IEEE format)). A usual formal is the IEEE format. It provide standards for
single and double precision floating point numbers. A double precision number is stored using 64
bits (8 bytes):
x = ±m2c−1022 .
One bit is used to store the sign. 52 bits are used for the mantissa m = 2−1 +m2 2−2 +· · ·+m53 2−53
(the first position is one due to normalization). The characteristic c = c0 20 +· · ·+c10 210 ∈ [1, 2046]
can be stored in the remaining 11 bits. Here mi , ci ∈ {0, 1}. By storing the exponent in the form
c − 1022, i.e., without a sign, the range of numbers is doubled. The two excluded cases c = 0
and c = 2047 are used to store x = 0 and NaN, respectively. We have amax = 21024 ≈ 1.8 10308 ,
amin = 2−1022 ≈ 2.2 10−308 , and M = 21 2−52 ≈ 10−16 .
Definition 2.25. (Machine operations) The basic operations ? ∈ {+, −, ×, /} is replace by ~. In
general
a ~ b = f l(a ? b) = (a ? b)(1 + )
with || ≤ eps.
Remark. The operation ~ is not associative or distributive.
Example 2.26. (Loss of significants) In this example we use b = 10 and r = 6. We study the
problem (f, x0 ) with √ √
f (x) = ( x + 1 − x), x0 = 100 .
√ √
As x gets large, x + 1 and x are of very similar magnitude and subtracting the two values
is ill conditioned√as we already saw. Assume that √ we can compute
√ the square roots also up to
six √
decimals:
√ f l( 101) = 0.100499 · 10 2
and f l( 101) f l( 100)) = 0.499000 · 10−1 instead of
−1
f l( 101 − 100) = 0.498756 · 10 . So we have lost 3 significant figures from the available 6.
Rewriting f in the form
√ √ √ √
x+1− x x+1+ x 1
f (x) = √ √ =√ √
1 x+1+ x x+1+ x
√ √
removes
√ the problem
√ of loss of significance because adding x + 1 and x is well conditioned:
f l( x + 1) ⊕ f l( x) = 0.200499 · 102 and 1 (0.200499 · 102 ) = 0.498755 · 10−1 .
Observe that:
z fl(z) z(1 + δ)
fl = ≤ = z223 + δz223
M fl(M ) M
in single precision floating point. So the error term is magnified. It is possible to compensate for
this:
Example 2.27. Let:
log(x + 1)
f (x) =
x
and consider x ≈ 0.
lim f (x) = 1
x→0
To compensate for large errors, we have:

log(x + 1) fl(log(x + 1))
fl =
x fl((x ⊕ 1) 1)
Example 2.28. Let:
1 − cos x
f (x) =
x2
and consider x ≈ 0. By Taylor’s expansion:
x2 x4
cos x = 1 − + − · · · h.o.t.
2! 4!
So we have:
1 − cos x 1 x2 x4 x6
= − + − cos ξ
x2 2! 4! 6! 8!
The following definition is closely related to the notion of well-possedness given previously.
Definition 2.29. An algorithm is called stable if small changes in the initial data produce only
small changes in the final results. Otherwise it is called unstable.
Let E0 > 0 be an initial error, and En be the error after n steps. Typically we can have
(i) Linear growth: En ≈ CnE0 , a constant C ∈ R.
(ii) Exponential growth: En ≈ C n E0 , C > 1. This occurs for example if En = CEn−1 . Expo-
nential growth is not acceptable.
Example 2.30. Consider a sequence:
13 4 1
xn+1 = xn − xn−1 , x0 = 1, x1 = , n ≥ 1 (I)
3 3 3
This is equivalent to:
1
xn , n ≥ 0, x0 = 1
xn+1 = (II)
3
in (I) a previous errors are amplified by 13 4
3 > 1 and 3 , this is unstable, while in (II) an error in
1
xn−1 is damped by the factor 3 which makes the algorithm stable.
Here is an example with r = 5:
n 1 2 3 4 5 6 7 8
(II) 0.33333 0.11111 0.03704 0.01235 0.00412 0.00137 0.00046 0.00015
(I) 0.33333 0.1111 0.03699 0.01216 0.00337 −0.00161 −0.01147 −0.04755
The exact value is 0.00015 in the obtainable precision. Even with r = 8 we obtain with (II)
0.00015242 and with (I) 0.00010407; the exact value is 0.00015242. The corresponding relative
errors are approximatly 0.27 · 10−6 and 0.31.
Overall stability means that errors in previous steps are not amplified.
R1 xk
Example 2.31. (Error amplification) We want to compute the integral Ik := x+5 dx.
0
(A) Observe
I0 = ln(6) − ln(5)
and
1
Ik + 5Ik−1 = (k ≥ 1), since
k
Z1 Z1
xk xk−1 − 1 1
+5 = xk−1 dx = .
x+5 x+5 k
0 0
A computation with 3 decimals (r = 3, b = 10) leads to
I¯0 = 0.182 · 100 I¯1 = 0.900 · 10−1

I¯2 = 0.500 · 10−1 I¯3 = 0.833 · 10−1
I¯4 = −0.166 · 100
Here we use I¯k to denote the computed value taking rounding errors into account. Obviously
Ik is monoton decreasing, and Ik & 0 (k → ∞), but this is not observed for the computed
values, we even have I¯4 < 0. On a standard PC we found: I¯21 = −0.158 · 10−1 und
I¯39 = 8.960 · 1010 .
This is a typical example of error accumulation. In the scheme the error in Ik−1 in amplified
by the factor 5 to compute Ik .
(B) If one computes the values for Ik exactly, one observes that I9 = I10 up to the three first
decimals. Using the backwards iteration Ik−1 = 15 k1 − Ik we obtain

I¯4 = 0.343 · 10−1 I¯3 = 0.431 · 10−1

I¯2 = 0.500 · 10−1 I¯1 = 0.884 · 10−1
I¯0 = 0.182 · 100
In this case we observe a damping of errors.
Example 2.32. (Computing the solution to a quadratic equation) Consider the quadratic equation
y 2 − py + q = 0
2
q
2
for p, q ∈ R and 6= q < p4 . The two solution are y1,2 = y1,2 (p, q) = p2 ± p4 − q. Also p = y1 + y2
and q = y1 y2 . From this we can conclude that ∂p y1 +∂p y2 = 1 and y2 ∂p y1 +y1 ∂p y2 = 0. Therefore,
y1 y2
∂p y1 = , ∂p y2 = .
y2 − y1 y2 − y1
From this we can conclude that ∂q y1 + ∂q y2 = 0 and y2 ∂q y1 + y1 ∂q y2 = 1. Therefore,
1 1
∂q y1 = , ∂q y2 = .
y1 − y2 y1 − y2
The condition numbers are for y1 (p, q)
p 1 + y2 /y1 q 1
k1,p = ∂p y1 = , k1,q = ∂q y1 = .
y1 1 − y2 /y1 y1 1 − y2 /y1
Similar results can be obtained for the condition numbers k2,p and k2,q for y2 (p, q). This shows
that the computation of the roots is badly conditioned if the two roots are close together, i.e., yy12
is close to one.
For | yy12 1 the problem is well conditioned. We could employ the following algorithm to compute
the results:
p2 √
u= ,v = u − q ,w = v .
4
For p < 0 we should first compute y2 = p2 − w to avoid cancellation effects. For the second root
we can use different approaches:
(I) (II)
p q .
y1 = 2 +w y1 = y2
2
For q p4 we have w ≈ p2 and (I) is prone to cancellation effects. Errors made in p and w are
carried over to y1 :
∆y1 • 1 ∆p 1 ∆w
≤
y1 1 + 2w/p p 1 + p/2w w .
+
2
Both factors are much greaten than one since q p4 . The method (II) is on the other hand stable:

∆y1 • ∆q ∆y2
≤
y1 q y2 .
+
2.2 Some ODE Basics

A general system of ODE is given by
F (z n (t), z n−1 (t), . . . , z(t), t) = 0
where z maps the time interval [0, T ] to Rm and F = (Fi )m j

i=1 , z denotes the vector of jth
m
derivatives of z(t) = (zi (t))i=1 . In many cases the ODE can be written explicitly in the highest
derivative, i.e.,
z n (t) = F (z n−1 (t), . . . , z(t), t) ,
reusing the symbol F for the right hand side. By introducing an extended solution vector
y(t) = (z(t), z 1 (t), . . . , z n−1 (t), τ ) ∈ R(n−1)m+1
and suitable right hand side it is possible to reduce this problem to a homogeneous, first order
ODE of the form
y 0 (t) = f (y(t))
and this is the type of problem we are going to study throughout this lecture (or its non-
homogeneous counterpart y 0 (t) = f (y(t), t) - although this is equivalent to introducing another
dependent variable satisfying the ODE τ 0 = 1.
We will consequently assume that y(t) = (yi (t))m i=1 for some given m ≥ 1. To make the problem
well posed we also need to provide an initial value, so we will assume that some y0 ∈ Rm is given
and will look for solutions to the initial value problem
y 0 (t) = f (y(t)) , y(0) = y0
2.2.1 Solvability of the Initial Value Problem (this subsection is not

examinable)
Consider the initial value problem:
0
y (t) = f (t, y(t)) (t0 ≤ t ≤ T )
(∗)
y(t0 ) = y0 ,
∂f
Theorem 2.33 (Picard’s Theorem). If f and ∂y are continuous in a closed rectangle:
R = {(t, y)|a1 ≤ t ≤ a2 , b1 ≤ y ≤ b2 }
and if (t0 , y0 ) is an interior point of R, then (∗) has a unique solution y = g(t) which passes
through (t0 , y0 ).
Sketch of Proof - requires
complete metric spaces and Banach fixed point theorem. By assumption,
|f (t, y)| ≤ K and ∂f ≤ L. It follows

∂y
|f (t, y1 ) − f (t, y2 )| ≤ L|y1 − y2 |
∀ (t, y1 ) and (t, y2 ) ∈ R, so we have the Lipschitz condition. Replace (∗) by an integral equation.
If y = g(t) satisfies (∗) then by integrating:
Z t
g(t) = y0 + f (s, g(s))ds.
t0
Choose a > 0 such that La < 1 and |t − t0 | ≤ a and |y − y0 | ≤ Ka. Let X be the set of all
continuous functions y = g(t) on |t − t0 | ≤ a with |g(t) − y0 | ≤ Ka, and so X is a complete
metric space. Define a mapping T of X into itself by
Z t
T g = h, h(t) = y0 + f (s, g(s))ds (|h(t) − y0 | ≤ Ka)
t0
Furthermore,
Z t

|h1 (t) − h2 (t)| = [f (s, g1 (s)) − f (s, g2 (s))]ds ≤ La sup |g1 (t) − g2 (t)|.

t0
Since La < 1, then T is a contraction on X. We therefore conclude T g = g has a unique

solution.
In the following we will denote with Y the exact solution of (∗) which we will always assume to
exists on the interval [t0 , T ] and if not otherwise stated we take t0 = 0.
In very few cases is it possible to provide the function Y in closed form, e.g., for some scalar
(m = 1) problems and for linear problem (f (y) = Ay with some matrix A). This was discussed in
MA133. So in most cases we need to either
1. study certain properties of the solution, e.g., long time behaviour by looking at the stability
of fixed points. This was discussed in MA133.
2. approximate the solution, e.g., simplify the model or use numerical methods. We will look
at both these aspects in this module.
2.3 The Forward/Backward Euler

Most method to approximate the solution Y use a sequence of nodes t0 , t1 , t2 , . . . , tN = T with
t0 < t1 < · · · < tN and provide ways of computing approximations y1 , . . . , yN to the function
values Y (t1 ), . . . , Y (tN ). In the simplest case the node are chosen to be equidistant to each other,
i.e., ti = t0 + ih for some fixed h > 0. Note that such a numerical method produces a sequence
(yn )n starting from the initial conditions to the ODE y0 . This sequence will depend on the time
step size h so that we should be writing (ynh )n but we will ignore this dependency in the following
- but you should keep it in mind.
The most straightforward approach for discretizing the ODE is based on the approximation
Y (t + h) − Y (t)
Y 0 (t + τ h) ≈ ,
h
for some τ ∈ [0, 1]. That this is a reasonable approximation can be seen using Taylor expansion
assuming Y ∈ C 3 :
1 1
Y (t+h) = Y (t+τ h)+Y 0 (t+τ h)(t+h−(t+τ h))+ Y 00 (t+τ h)(t+h−(t+τ h))2 + Y 000 (χ0 )(t+h−(t+τ h))3
2 6
and
1 1
Y (t) = Y (t + τ h) + Y 0 (t + τ h)(t − (t + τ h)) + Y 00 (t + τ h)(t − (t + τ h))2 + Y 000 (χ1 )(t − (t + τ h))3
2 6
Therefore
Y (t + h) − Y (t) =Y (t + τ h) + Y 0 (t + τ h)(h − τ h) − Y (t + τ h) − Y 0 (t + τ h)(−τ h)

1 1
+ Y 00 (t + τ h)(h − τ h)2 + Y 000 (χ0 )(h − τ h)3
2 6
1 00 1 000
− Y (t + τ h)(−τ h) − Y (χ1 )(−τ h)3
2
2 6
The zero order terms Y (t + τ h) cancel and the first order terms add up to Y 0 (t + τ h)h. The second
order terms add up to
1 00 1
Y (t + τ h)((1 + τ 2 )h2 − 2τ h2 − τ 2 h2 ) = Y 00 (t + τ h)h2 (1 − 2τ ) .
2 2
We finally end up assuming Y ∈ C 3 :
Y (t + h) − Y (t) 1
= Y 0 (t + τ h) + Y 00 (t + τ h)h(1 − 2τ ) + O(h2 ) .
h 2
In the case that τ = 21 , i.e., we are aiming at approximating the time derivative exactly in the
middle of the interval [t, t + h], the Y 00 drops out and we are left with
1 Y (t + h) − Y (t)
Y 0 (t + h) = + O(h2 ) .
2 h
1
If we are not approximating in the middle of the interval, i.e., τ 6= 2 the second order term is
present and we end up with
Y (t + h) − Y (t)
Y 0 (t + τ h) = + O(h) .
h
This type of superconvergence in some points is a typical property of many finite difference ap-
proximations to derivatives, i.e., that they have a higher convergence rate in some isolated points
then in the rest of the domain.
A finite difference approximation can be used to compute an approximation yn to the exact
solution Y at a point in time tn+1 = tn + h given approximations to Y at some earlier points in
time. For example taking τ = 0 and t = tn
Y (tn+1 ) − Y (tn )
f (tn , Y (tn )) = Y 0 (tn ) = + O(h) ,
h
Replacing Y (tn ) by yn and Y (tn+1 ) by yn+1 and ignoring the O(h) term yields
yn+1 − yn
= f (tn , yn )
h
which provides an explicit formula to compute yn+1 given yn :
yn+1 = yn + hf (tn , yn ) .
Starting at t0 , y(t0 ) =: y0 , we arrive at
y1 = y0 + hf (t0 , y0 ),
y2 = y1 + hf (t1 , y1 ),
y3 = y2 + hf (t2 , y2 ),
..
.
This is known as the forward or explicit Euler method :
Algorithm. (Forward or Explicit Euler method)
yn+1 = yn + hf (tn , yn ) .
The following algorithm provides an approximation to Y (T ) given an h > 0:

t = t 0 , y = y0
While
t<T
y = y + hf (t, y)
t=t+h
If we start by evaluating the finite difference approximation at t = t, τ = 1 we arrive at
yn+1 − yn
= f (tn+1 , yn+1 ) .
h
This does not lead to an explicit formula for yn+1 . The approximation at tn+1 has to be computed
by finding the root yn+1 of
Fn (z; yn , tn , h) = z − yn − hf (t + h, z) ,
or taking δ = z − yn
δ − hf (t + h, yn + δ) = 0 .
This method is known as backward Euler method:
Algorithm. (Backward or Implicit Euler method)
yn+1 = yn + hf (tn+1 , yn+1 ) .

t = t 0 , y = y0
While t < T

Compute δ with δ − hf (t + h, y + δ) = 0 (e.g. using Newton’s method discussed later)
 y =y+δ
t=t+h
There are many methods for finding a root of a nonlinear function. Newton’s method is probably
the most often used approach and we will analyse that method in more detail later in the lecture:
Algorithm. (Newton’s method)
F (δ k )
δk = δk − .
F 0 (δ k )
The following algorithm provides an approximation to δ with F (δ) = 0 for a scalar function F (it
can be adapted to vector valued functions as well):
δ = δ0
While
h F (δ) < ε
δ = δ − FF0 (δ)
(δ
Remark. We will see later that both the forward and backward Euler method converge with order
one. We will be discussing approaches to improve the accuracy and also why using a more complex
implicit method can sometimes be a good idea.
Using the finite difference quotient in the middle of the interval to take advantage of the higher
convergence rate is not so straightforward:
yn+1 − yn
= f (tn+ 21 , yn+ 21 )
h
since yn+ 12 is not part of the sequence we are computing. We can either use the approximation
on an interval 2h:
yn+1 − yn−1
= f (tn , yn )
2h
which we can use to compute yn+1 assuming we know both yn and yn−1 . This type of method
is called multistep method and is discussed later in the lecture. A second approach is to replace
f (tn+ 21 , yn+ 21 ) by an approximation - either
1
f (tn+ 21 , yn+ 12 ) ≈ (f (tn , yn ) + f (tn+1 , yn+1 ))
2
or
1
f (tn+ 12 , yn+ 21 ) ≈ f (tn+ 12 , (yn + yn+1 )) .
2
The first approach is often called the Crank-Nicholson while the second is called the implicit
midpoint method. Both are implicit:
Algorithm. (Crank-Nicholson method)
h
yn+1 = yn + f (tn , yn ) + f (tn+1 , yn+1 ) .
2
t = t 0 , y = y0
While t < T
Compute δ with δ − h2 f (t + h, y + δ) = 0 (e.g. using Newton’s method discussed later)

 y = y + h2 f (t, y) + δ
t=t+h
This method looks very similar to the backward Euler method but note that it does have a higher
complexity since more evaluations of f are required in each step and evaluation of f can be very
expensive. On the other hand the higher convergence rate of the finite difference approximation
at the midpoint could improve the overall convergence rate of the method - assuming we haven’t
ruined things by approximation f (tn+ 12 , yn+ 12 ).
Remark (Alternative derivation). Starting from y 0 (t) = f (y(t)) we can use integration over the
time interval [tn , tn+1 ] to derive an approximation:
Z tn+1 Z tn+1
y(tn+1 ) − y(tn ) = y(t) dt = f (y(t)) dt .
tn tn
Now to get a numerical scheme we need to approximate the integral on the right. We will study
more sophisticated methods later in the course but for now we can use an approximation based
on a single point in the interval:
Z tn+1
y(tn+1 ) − y(tn ) = f (y(t)) dt ≈ (tn+1 − tn )f (y(tn + τ (tn+1 − tn ))) = hf (y(tn + hτ )) ,
tn
for some τ ∈ [0, 1]. As we will have noticed (hopefully) we have rediscovered the forward Euler
(τ = 0) and the backward Euler (τ = 1) methods.
2.3.1 Convergence analysis

In this section we want to study the most important (at least mathematically speaking) question
concerning the forward and backward Euler method: do they converge, i.e., is yn in any way
an approximation of Y (tn ) if h is small enough? Also we would like to know what the rate of
convergence is (assuming it does converge).
Definition 2.34 (Approximation error). For fixed step size h > 0 the approximation error at a
given step n is defined by
en := |yn − Y (tn )|
The value en will of course depend on h. The total approximation error is given by
E(h) := max ek .
0≤k≤N
Remark. The definition of the approximation error is not unique or more to the point, the way to
measure the error is not unique. For example instead of looking at the maximum error over the
time interval we could study an average error or the error eN at the final time only.
During the derivation of the schemes we replaced the time derivative with a finite difference
quotient by dropping higher derivative terms in the Taylor expansion. Recall for example the
formula used for the forward Euler method
Y (tn+1 ) = Y (tn ) + hf (tn , Y (tn )) + O(h2 )
which indicates that we are introducing an error of the magnitude of h2 when going from Y (tn ) to
Y (tn+1 ). Now define Ỹ (tn+1 ) = Y (tn )+hf (tn , Y (tn )) (so using the forward Euler formula but with
the exact solution value at time t = tn ) and consider the solution Ỹ to the ODE with initial data
Ỹ (tn+1 ). The staring point for the Ỹ trajectory is therefore O(h2 ) off from the exact trajectory
which passed through (tn+1 , Y (tn+1 )). Depending on the type of ODE we are considering this
gap will widen in time. Also two sequences constructed via some numerical method applied to the
two different initial values Y (tn+1 ) (the exact value) and Ỹ (tn+1 ) (the perturbed value) will also
diverge more and more from each other in each step. This is called error propagation, i.e., how
errors done in previous steps influence the accuracy of the solution at later times. The stability of
the numerical scheme and the error done in each step, the so called truncation error or consistency
error both play a central role in determining the approximation error of the scheme.
Forward Euler
For h > 0 let yn+1 = yn + f (yn ) be the sequence produced by the forward Euler method, tn = nh
the sequence of points in time. We also introduce the evaluation of the exact solution at these
points in time, i.e., Yn := Y (tn ) and finally we denote the error at each of these points in time
by en := |yn − Yn |. Keep in mind that tn , yn , Yn , en all depend on h. Next we introduce the local
truncation error which is conceptually the error introduced by inserting the exact solution into
the numerical scheme:
Definition 2.35 (Local truncation error for forward Euler method).
τn := Yn+1 − Yn − hf (Yn )
Remark. From our definition we have Yn+1 = Yn + hf (Yn ) + τn and our derivation based on Taylor
expansion shows that the truncation error converges quadratically to 0, i.e., τn = O(h2 ).
Now assume that f is Lipschitz continuous (an assumption commonly used in the existence theory
of ODEs), i.e.,
|f (u) − f (v)| ≤ L|u − v| .
Using our definitions we now have

en+1 = yn +hf (yn )−(Yn +hf (Yn )+τn ) ≤ en +hf (yn )−f (Yn )+τn ≤ en +Lhen +τn ≤ (1+Lh)en +τn .
We thus conclude that going from step n to n + 1 leads to an amplification of the error en by
1 + Lh plus an additional O(h2 ) error coming from the truncation error.
We can now apply the same estimate to en to get
n
X
en+1 ≤ (1+Lh)en +τn ≤ (1+Lh)2 en−1 +τn +(1+Lh)τn−1 ≤ · · · ≤ (1+Lh)n+1 e0 + (1+Lh)i τn−i
i=0
Since e0 = |y0 − Y (0)| = |y0 − y0 | = 0 and using τn ≤ Ch2 where C depends neither on h nor on
n (in depends on the second derivate of Y over whole time interval we get
n
X
en+1 ≤ Ch2 (1 + Lh)i
i=0
for all 0 ≤ n < N where N = Th or h = N T

. We still need to estimate the size of the final geometric
sum:
n
X (1 + Lh)n+1 − 1 1
(1 + Lh)i = = ((1 + Lh)n+1 − 1) .
i=0
(1 + Lh) − 1 Lh
T T LT
Using n + 1 ≤ N = h and therefore h < n+1 so that Lh ≤ n+1 and therefore
n
X 1 LT n+1
(1 + Lh)i ≤ ((1 + ) − 1) .
i=0
Lh n+1
Using that (1 + ai )i ≤ ea we conclude

n
X 1
(1 + Lh)i ≤ (exp LT − 1) = O(h−1 ) .
i=0
Lh
Putting everything together we get

Theorem 2.36 (Convergence of the forward Euler method). If the exact solution is in C 2 and
the right hand side f is Lipschitz continuous with Lipschitz constant L then the approximation
error converges linearly to 0 for h → 0:
N exp LT − 1
E(h) ≤ max τi = O(h) .
i=0 Lh
So we have proven (what we already guessed from our numerical experiments) that the forward
Euler method converges linearly to the exact solution. The constant in the error bound depends
on (i) the end time,‘ (ii) the second derivative of the exact solution, i.e., max0≤t≤T |Y 00 (t)| and
(iii) the Lipschitz constant of the right hand side f .
Remark. The result given above for the forward Euler method is very typical for many numerical
schemes used to solve ODEs where the error at a time step can often be expressed as the error at
the previous time step plus the truncation error at that step. Since these methods are often derived
by truncating a Taylor expansion, the convergence rate for the truncation error is straightforward
to obtain. Modifying the above argument (i) using that e0 = 0 and (ii) the truncation errors add
up in the form of a geometric sum, one can derive the overall convergence rate from the rate of the
truncation error. Of course one needs to keep in mind that the solution to the ODE (or the right
hand side function f ) has to be smooth enough to carried out the truncated Taylor expansion in
the first place.
Backwards Euler
The analysis for the backward Euler method is almost identical to the above. In this case yn+1 =
yn + hf (yn+1 ) and τn := Yn+1 − hf (Yn+1 ) − Yn .

en+1 ≤ en + hf (yn+1 ) − f (Yn+1 ) + τn ≤ en + hLen+1 + τn
Since we are interested in h → 0 we can assume that hL ≤ 1 − ε ∈ (0, 1) then rearranging terms
leads to
n
en τn e0 X 1
en+1 ≤ + ≤ ··· ≤ + τn−i .
1 − hL 1 − hL (1 − hl)n+1 i=0 (1 − hL)i
Again we can use that e0 = 0, τi ≤ Ch2 so that we are left with

n−1
X βn − 1
en ≤ Ch2 βi =
i=0
β−1
1 hL
where β = 1−hL . We have β − 1 = 1−hL and also
hL hL
β =1+ ≤1+
1 − hL ε
hL
since we assumed that hL < 1 − ε. Using that 1 + x ≤ ex we finally have β ≤ e ε and thus using
again h = Tn and n ≤ N :
nhL TL
βn ≤ e ε ≤ e ε .
Putting it all together we have shown that
Theorem 2.37 (Convergence of the backward Euler method). If the exact solution Y ∈ C 2 and
the right hand side f is Lipschitz continuous with Lipschitz constant L then the approximation
error converges linearly to 0 for h → 0. For h ≤ 1−ε
L for some ε ∈ (0, 1) the error is bounded by
1 − hL N TL
E(h) ≤ max τi exp − 1 = O(h) .
hL i=0 ε
Cranck-Nicholson and Midpoint Rule

One of these will be analysed in the assignments, the analysis of the other one is very similar.
2.3.2 Stability analysis

In the previous section we studied the convergence of the forward and backward Euler method
- a part of the numerical analysis of a scheme that is concerned with the behaviour for h → 0.
During any practical computation one has to be content with some finite step size h > 0 so that
the behaviour of the discretization in that case is important. We already saw in the introduction
that although the forward/backward Euler methods converge they can show some very undesirable
behaviour.
The simplest way of understanding the issue of stability of a numerical scheme for solving ODEs
is that one is trying to analyse the influence of any error in the initial conditions on the resulting
sequence (yn )n . For example consider two initial conditions y0 and ỹ0 = y0 + δ and the two
resulting sequences using a discretization of the ODE (yn )n and (ỹn )n . A form of stability could
imply max0≤n≤N |yn − ỹn | ≤ C|y0 − ỹ0 | = C|δ| with a constant depending possibly on T = N h
and on the right hand side f of the ODE. Of course we can only expect to be able to prove this
continuous dependency on the initial data if the same holds for the exact solutions Y, Ỹ of the
ODE with initial conditions y0 , ỹ0 , respectively.
Let us assume that the ODE is linear, i.e., f (y) = Ay with some invertible matrix A ∈ Rm×m .
Then the above description of stability simplifies considerably: if Y, Ỹ are the exact solutions to
the ODE with initial data y0 , ỹ0 , respectively, then Ȳ = Y − Ỹ solve the same ODE (since it is
linear) with initial conditions δ. So stability for a linear ODE with respect to errors in the initial
data is related to the question if solutions with initial data δ small will stay near/converge to 0
for t → ∞, i.e., will stay near the solution with 0 as initial conditions. This will (for example) be
the case if 0 is a stable fixed point or a center, but not if 0 is unstable.
For a linear ODE the explicit Euler method turns into
yn+1 = (I + hA)yn
while the implicit Euler method reads
(I − hA)yn+1 = yn
where I ∈ Rm×m is the identity matrix.

We can diagonalize A in the form A = R−1 DR where D = diag(λ1 , . . . , λm ) is a diagonal matrix
with the (possibly complex) eigenvalues λi of A on the diagonal. Using this in the forward Euler
method we conclude
yn+1 = (I + hA)yn = R−1 (I + hD)Ryn .
Defining zn = Ryn we find that
zn+1 = (I + hD)zn
or component wise
zn+1,i = (1 + hλi )zn,i .
That means that the system is decoupled but zn could be complex valued. Using z0 = Ry0 we
thus have for i = 1, . . . , m:
zn+1,i = (1 + hλi )n+1 z0,i
We can use the same decomposition for original ode as well defining Z := RY
d
Zi = λi Zi
dt
so that
Zi (t) = exp(λi t)zi,0 .
Remark. Note that connection between the exponential function appearing as a factor in the exact
solution and the factor 1 + hλi in the forward Euler method - recall ex ≈ 1 + x for x small.
The decoupling argument shows that it is enough to study stability for linear problems in the case
of complexed valued scalar problems (those correspond to either real valued problems or to 2 × 2
systems). So in the following we again denote with Y the now possibly complexed valued solution
to
Y 0 (t) = λY (t)
with a complex constant λ. Since we are interested in the ODE setting, where the origin is a
stable fixed point, we assume the real part of λ is less than zero: λ ∈ C, Realλ < 0. We then know
that the origin is a stable fixed point and thus |Y (t)| → 0 for t → ∞ independent of the initial
condition Y0 , in fact:
Y (t) = eλt Y0 = eRealλt cos(Imagλt) + i sin(Imagλt) Y0

and so
|Y (t)| = eRealλt |Y0 |
1. |Y (t)| is monotnly decreasing and converges to 0,
2. if λ ∈ R then Y (t) is strictly monotonly decreasing if Y0 > 0 and strictly monotonly increasing
if Y0 < 0. Consequenlty, Y (t) has the same sign as Y0 for all t.
Now one can ask the question: under which conditions does the sequence (yn )n behave in the
same way?
Forward Euler
As we saw above
yn+1 = (1 + hλ)n+1 Y0
and thus |yn | = |1 + hλ|n |Y0 |. To get this to converge to zero requires |1 + hλ| < 1, since Realλ < 1
this will hold for h sufficiently small but not for too large values of h. In fact
p
1 > |1 + hλ| = (1 + hRealλ)2 + h2 (Imagλ)2
so squaring both sides leads to the condition
1 > 1 + 2hRealλ + h2 (Realλ)2 + (Imagλ)2 = 1 + 2hRealλ + h2 |λ|2

and thus
|Realλ|
2 >h
|λ|2
recalling that our assumption was that Realλ < 0. There are two interesting cases here:
2
1. Imagλ = 0: the condition reduces to h < |Realλ| or perhaps easier to remember h|λ| < 2.
2. Realλ → 0: in this case the condition for convergence of forward Euler scheme is h < 0 so
not achievable. On the other hand the exact solution satisfies |Y (t)| = |Y0 | (the origin is a
center) so aiming for |yn | → 0 does not make sense. But since |yn | = |1 + hλ|n |Y0 | the issue
is not only that the discrete approximation does not converge to zero but in fact it grows
monotonically without bounds. This is in fact the setting of our mass spring system from
the introduction where our experiments showed that the right long time behaviour is not
achievable with the forward Euler method.
Backwards Euler
Doing the same for the backward Euler scheme we arrive at (1 − hλ)yn+1 = yn , or
1
yn+1 = Y0 .
(1 − hλ)n+1
We again focus on the stable case, i.e., Realλ < 0. Now |yn | → 0 if and only if |1 + hλ| > 1 or
1 < (1 − hRealλ)2 + h2 (Imagλ)2 = 1 + 2h|Realλ| + h2 |λ|2
which holds for any value of h again using that Realλ < 0. Note that the condition is satisfied
for any h in the case that Realλ ≤ 0 (in fact even for quite a lot of the right half of the complex
plane in the purely imaginary case the condition is still satisfied for any h > 0 which means that
|yn | → 0 although the exact solution is a center with |Y (t)| = |Y0 | - but we already saw this in
the introduction where are simulations showed that for the mass spring system the implicit Euler
method always leads to approximation that converge to zero.
A way to visualize the stability property of the two schemes is to plot the stability region in the
complex plane, i.e., for the forward Euler method all complex values z = hλ such that |1 + z| < 1,
for the implicit Euler method we shade the left half plane (although as pointed most of the right
half should also be shaded:
2 2
1 1
0 0
−1 −1
−2 −2
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Remark. The stability concept described so far is referred to as absolute stability. A more formal
discussion will be carried out later in the lecture. Methods like the backward Euler method which
are stable for all step sizes h are called unconditionally (absolute) stable or simply A-stable while
method like the forward Euler method which require h to be small enough for stability are called
conditionally stable.
We have so far focused on the limiting behaviour of yn for n → ∞ in the case of linear problems
with a stable fixed point at the origin. As pointed out above, if λ ∈ R and λ < 0 then the
exact solution is monoton decreasing or increasing depending on the sign of the initial condition
Y0 ∈ R. We will now look at conditions for h which guarantee that this behaviour carries over to
the sequence (yn )n when using the forward or backward Euler methods.
Starting with the forward Euler method we have yn = (1 − h|λ|)n |Y0 | where now λ is real and
negative. We can assume for simplicity that Y0 > 0 then we want to find a step size condition
so that 0 < yn < yn+1 , which is equivalent to 0 < 1 − h|λ| < 1 as can be easily seen. Since
h|λ| > 0 the second condition is always satisfied while the first condition requires h|λ| < 1. It is
worth now comparing this to the condition for absolute stability derived previously, which in the
current setting is |1 − h|λ| | < 1 which is equivalent to −1 < 1 − hλ| < 1. So as to be expected
achieving monotonicity requires a harder restriction on the time step then absolute stability did.
2 1
For |yn | → 0 we require h < |λ| while monotonicity requires h < |λ| .
1
Turning our attention to the backward Euler method with yn = (1+h|λ) n Y0 we see that mono-
1
tonicity always holds since 0 < 1+h|λ| < 1 for any step size h > 0. This behaviour is refereed to
as A-stable.
Chapter 3
Some Aspects of Mathematical

Modelling
In this chapter we first discuss different techniques of how mathematical models are arrived at, and
then discuss aspects of simplifying the models that make them easier to handle mathematically.
In the final part we provide some standard examples of mathematical models arising in different
areas of application.
3.1 Reactions kinetics

Mass Action Law : the speed of a reaction is proportional to the product of the reactants. So if
two reactants A and B react to form C then the speed of this reaction is proportional to A times
B with a proportionality constant k.
k
Elementary reaction: (reactants) → (products) and the rate of consumption of reactants and the
rate of production of the products follow from this relation.
We will denote the concentration of chemical A by A(t) ≥ 0, and the total rate of production of
A will have contributions from its creation and/or consumption due to each reaction j = 1, . . . , n
involving A
n n
d X X
A= (creation rate of A)j − (consumption rate of A)j
dt j=1 j=1
these rate equations form a system of first order ODEs for the reactants A, B, C, D, . . . which
have to be combined with initial values for each reactant A0 , B0 , . . . . There are the following
elementary reactions to distinguish between:
• Constant supply: compound A is added to the system at a constant rate
k d
∅ → A =⇒ A=k .
dt
This looks like A is created out of thin air - which can’t really happen. What it means is
that there is a bunch of stuff we are not modelling but that lead to a source of A.
• Decay: substance A transforms into waste at rate k (i.e. A decomposes and is removed
from the system but the decay does not depend on any other reactants)
k d
A → ∅ =⇒ A = −kA .
dt
Again this is extending the mathematical model to include something we are not accurately
modelling.
31
CHAPTER 3. SOME ASPECTS OF MATHEMATICAL MODELLING 32
• Transformation: A is consumed while B is being produced
k d d
A → B =⇒ A = −kA , B = kA .
dt dt
• Reversible transformation: A transforms into B and vice versa. Such reactions should
be explicitly expanded into separate forward and reverse reactions:
k1 d d
A
B =⇒ A = −k1 A + k2 B , B = k1 A − k2 B .
k2 dt dt
k k
This is equivalent to the two transformations A →1 B, B →2 A.
• Multiple products: A and B combine to form C
k d d d
A + B → C =⇒ A = −kAB , B = −kAB , C = kAB .
dt dt dt
Although this looks fine it does not tell the whole story: assume that A is equal to B, so
k
the reaction is A + A → C. The reaction speed is kA2 but the correct ODEs can not be
d d
simply dt A = −kA , dt C = kA2 , since this implies that the amount of C being produces is
2
the same as the amount of A being destroyed - but we need two units of A to produce one
C - so we will need to redefine the ODE a bit.
Assume n units of A and m units of B react to produce p units of C and q units of D. The
problem described above is now resolved by defining the ODE to be
k
nA + mB → pC + qD =⇒
d d d d
A = −nkAn B m , B = −mkAn B m , C = pkAn B m , D = qkAn B m .
dt dt dt dt
d d k
Now we would correctly get dt A = −2kA2 , dt C = kA2 for the A + A → C reaction.
Stoichiometry: the factors −n, −m, +p, +q are called stoichiometries of the reaction and
the actual factors in the ODE are always the product of the stoichiometry and the reaction
speed (kAn B m in the above).
k
Remark: sometimes reaction like this are written as A + B → AB. In the mathematical model
one should use a new letter e.g. C to denote the component AB, which is not A times B but a
third reactant of the system!
Definition 3.1 (Reaction network). Consider a m component reaction network y = (y1 , . . . , ym )

with n reaction (reversible ones are counted as two separate reactions): given positive constants
aij , bij , kj for i = 1, . . . , m and j = 1, . . . , n the reaction network is of the form:
m m
X kj X
aij yi −→ bij yi , j = 1, · · · , n
i=1 i=1
(we are using i for the components and j for the reactions).
The reaction network leads to the system of ODEs:
n
d X alj
yi = kj (bij − aij ) Πm
l=1 yl , i = 1, · · · , m .
dt j=1
The matrix m,n

Γ = bij − aij ∈ Rm×n
i=1,j=1
is called stoichiometry matrix, the vector of reaction speeds is given by

n
alj
w(y) = kj Πm l=1 yl .
j=1
With this notation the system of ODEs can be written in the compact form
d
y = Γw(y) .
dt
Note that although this looks like a linear ODE, it is not one because in general w : Rm → Rn is
a nonlinear function.
Let’s verify that this form fits the ODEs for elementary reactions correctly:
• Constant supply: m = 1, n = 1, a11 = 0, b11 = 1, k1 = k:
d
y1 = k(1 − 0)y10 = k
dt
• Decay: m = 1, n = 1, a11 = 1, b11 = 0, k1 = k:

d
y1 = k(0 − 1)y11 = −ky1
dt
• Transformation m = 2, n = 1, a11 = 1, a21 = 0, b11 = 0, b21 = 1, k1 = k:

d d
y1 = k(0 − 1)y11 y20 = −ky1 , y2 = k(1 − 0)y11 y20 = ky1 .
dt dt
• Reversible transformation: m = 2, n = 2, a11 = 1, a21 = 0, a12 = 0, a22 = 1, b11 =

0, b21 = 1, b12 = 1, b22 = 0:
d
y1 = k1 (0 − 1)y11 y20 + k2 (1 − 0)y10 y21 = −k1 y1 + k2 y2
dt
and
d
y2 = k1 (1 − 0)y11 y20 + k2 (0 − 1)y10 y21 = k1 y1 − k2 y2
dt
• Multiple products: this is why we went through the formalism so it works...
Example 3.2. Consider the following reaction network:
k+
2A + B
AB
k−
and
k
2AB → B
and finally B is produced out of thin air
l
∅→B .
First we will use C to denote AB and we remember to think of this as being four separate reactions,
so m = 3, n = 4, i.e., we count the reversible reaction as two separate ones.
k+
Starting with the equations for A: A is involved in two of the four reactions. In the first 2A+B → C
the stoichiometry is −2 and the speed is k+ A2 B while in the second the stoichiometry is 2 and the
speed is k− C:
A0 = −2k+ A2 B + 2k− C .
Now B is involved in all four reactions: in the first we have stoichiometry −1 and speed k+ A2 B,
in the second stoichiometry 1 and speed k− C, in the third stoichiometry 1 speed kC 2 , and finally
in the last one stoichiometry is 1 and speed is l:
B 0 = −k+ A2 B + k− C + kC 2 + l .
Finally for C we have the three stoichiometries 1, −1, −2 and speeds k+ A2 B, k− C, kC 2 :
C 0 = k+ A2 B − k− C − 2kC 2 .
To simplify things we can write down the stoichiometry matrix (a good way to do this is column
wise, i.e., collecting the entries for each reaction):
 
R1 R2 R3 R4
 A −2 2 0 0 
Γ := 
 B −1 1

1 1 
C 1 −1 −2 0
and reaction speeds

k+ A2 B
 
R1
 R2 k− C 
w(A, B, C) :=  
 R3 kC 2 
R4 l
to get the compact form of the ODE
y 0 = Γw(y)
with y = (A, B, C)T .
Lemma 3.3 (Conserved quantities). Given a reaction network with corresponding ODE y 0 =
[0, T ] → Rm and matrix Γ ∈ Rm,n . Let 0 6= c ∈ Rm be a vector such that ΓT c = 0
Γw(y) and y :P
m
then H(t) := i=1 ci yi (t) is constant in time. We say that H is a conserved quantity.
T
Let d with 0 ≤ d ≤ m denote the Pmdimension of the Kernel of Γ , then there are H1 , . . . , Hd
m
conserved quantities with Hl = i=1 cli yi given by vectors c1 , . . . , cd ∈ R which are linearly
independent.
Pm
Proof. We compute the time derivative of H = i=1 ci yi :
m
X m
X n
X n
X m
X n
X
H 0 (t) = ci yi0 (t) = ci Γij wj (y(t)) = wj (y(t)) ci Γij = wj (y(t))(ΓT c)j = 0
i=1 i=1 j=1 j=1 i=1 j=1
since c satisfied (ΓT c)j = 0 for j = 1, . . . , n. Therefore H is constant.

Let the kernel of ΓT be spanned by the d linearly independent vectors c1 , . . . , cd ∈ Rm . Then for
Pm
l = 1, . . . , d the function Hl (t) := i=1 cli yi (t) is a conserved quantity since ΓT cl = 0.
Remark.
Pm The point about linear independence is that for c ∈ Rm with c 6= 0 the function H(t) =
i=1 ci yi (t) is a conserved quantity as is for example any scalar multiple of H, e.g., 2c also leads
to a conserved quantity which is equal to 2H. This does not provide any additional information
and we therefore are only looking for conserved quantities H1 , . . . , Hd which are generated from
linearly independent vectorsPc1 , . . . , cd .
m
A conserved quantity H = i=1 ci yi can be used to remove one component from the ODE. Since
c 6= 0 we must have at least one i0 such that ci0 6= 0 and therefore
m m
1 X ci 1 X ci
yi0 (t) = H(t) − yi (t) = H(0) − yi (t)
ci0 c
i=1 i0
ci0 c
i=1 i0
Pm
having used that H is constant and thereofore H(t) = H(0). Since H(0) = i=1 ci yi0 where
m
y0 ∈ R are the initial conditions for y we see that yi0 (t) satisfies the algebraic equation
m
X ci
yi0 (t) = yi0 − yi (t)
c
i=1 i0
which can be substituted into the ODE to remove yi0 from the system of ODEs.
Example 3.4. Recall the previous example with the stoichiometry matrix
 
R1 R2 R3 R4
 A −2 2 0 0 
Γ := 
 B −1

1 1 1 
C 1 −1 −2 0
and reaction speeds

k+ A2 B
 
R1
 R2 k− C 
w(A, B, C) := 
 
R3 kC 2 
R4 l
Clearly the three columns of ΓT are linearly independent so the kernel of ΓT is empty and one can
not use the approach discussed above to find conserved quantities (there still might be some).
Consider now the same reaction network but without the forth reaction ∅ → B. Then
 
−2 2 0
Γ :=  −1 1 1 
1 −1 −2
solving for ΓT c = 0 we find that c = (c1 , −4c1 , −2c1 ) is a solution and therefore there is at least
one conserved quantity, e.g., taking c = (1, −4, −2) we find that H = A − 4B − 2C is conserved:
H 0 = − 2k+ A2 B + 2k− C
+ 4k+ A2 B − 4k− C − 4kC 2
− 2k+ A2 B + 2k− C + 4kC 2 = 0
We can now eliminate for example B by using

1 1 1 1 1 1
B(t) = A(t) − C(t) − H(t) = A(t) − C(t) − H(0)
4 2 4 4 2 4
to reduce the 3x3 system of ODEs to the following 2x2 system
A0 = −2k+ A2 (A/4 − C/2 − H0 /4) + 2k− C , C 0 = k+ A2 (A/4 − C/2 − H0 /4) + k− C + kC 2
with the constant H0 = A0 − 4B0 − 2C0 .

As pointed out studying the kernel of ΓT does not produce all the possible conserved quantities
of a given system as the following example shows.
Example 3.5.
k k
A1 →1 A2 , A1 →2 A3
so we have two reactions and three components:
 
−1 −1
Γ= 1 0  .
0 1
It is easy too see that the dimension of KernelΓT is 1 and for example c1 = (1, 1, 1)T satisfies
ΓT c1 = 0 and thus H1 = c1 · (A1 , A2 , A3 ) = A1 + A2 + A3 is a conserved quantity. To check this
note that the ODE system corresponding to the above reaction network is A01 = −(k1 +k2 )A1 , A02 =
k1 A1 , A03 = k2 A1 and consequently
d
(A1 + A2 + A3 ) = −(k1 + k2 )A1 + k1 A1 + k2 A1 = 0 .
dt
But in addition c2 = (0, −k2 , k1 )T also leads to a conserved quantity H2 = k1 A3 − k2 A2 since
H20 = k1 k2 A1 − k2 k1 A1 = 0 ,
and we can not find c2 by looking at the stoichiometry matrix alone.

Again we can use the conserved quantities to reduce the size of the ODE system.
H2 (0)+k2 A2
Using the second conserved quantity we can eliminate for example A3 based on A3 = k1
and the first conserved quantity can be used to compute A2 :
1 k2 1 k1 + k2
H1 (0) = A1 + A2 + H2 (0) + A2 = A1 + H2 (0) + A2
k1 k1 k1 k1
therefore
k1 1
A2 = (H1 (0) − A1 ) − H2 (0)
k1 + k2 k1 + k2
and revisiting the equation for A3 :
H2 (0) k2 k2 k2 1
A3 = + (H1 (0) − A1 ) − H2 (0) = (H1 (0) − A1 ) + H2 (0)
k1 k1 + k2 k1 (k1 + k2 ) k1 + k2 k1 + k2
A closer look at the ODE or at the reaction network also shows that the equations for A2 , A3 are
independent of each other and the equation for A1 is independent of A2 , A3 . So we can solve the
ODE for A1 and use A1 to compute A2 , A3 using the above algebraic equations.
In many cases y1 , . . . , ym describe the amount of some chemical, species, or similar. In this case
the application dictats that for initial conditions y0 with non negative entries the solution should
remain non negative for all t > 0:
y0,i ≥ 0 for i = 1, . . . , m =⇒ yi (t) ≥ 0 for i = 1, . . . , m and t > 0 .
Using the language of dynamical systems one says that the set [0, ∞)m is invariant under the ODE,
i.e., trajectories that start in this set remain in this set. We can not give a prove that [0, ∞)m
is invariante for the ODEs derived from the mass action law. From a numerical point of view it
is important to note that these ODE systems come with additional properties which should be
respected by a numerical scheme used to solve them. From the above discussion we now know
that:
• there might be additional conserved quantities Hl , related for example to mass conservation.
Guaranteeing that the numerical scheme does not lead to a production/destruction of mass
due to the approximation error could be important depending on the application.
• the solution to the ODE remains non negative if the initial conditions are non negative:
again this is an important property of the underlying model and should be observed by an
approximation.
Examples from Biology

Population growth
The simplest model of population growth is based on Malthus law which describes a growth rate
proportional to the current size of the population. Applying this law generates exponential growth:
d β
dt A = βA. In terms of a reaction network this could be expressed as A → A + A (since here A is
involved in two reactions the resulting ODE is of the form A0 = −βA + 2βA = βA). Of course we
can assume that β combines both birth rate and death rate (although then β might have to be
chosen to be negative which is a bit of an extension of the normal reaction kinetics framework).
γ
A second approach to include death (but keep β > 0) is to add the reaction A → ∅ then the ODE
0
is A = −βA + 2βA − γA = (β − γ)A.
The weakness of Malthus law of predicting unlimited growth of a population (or it’s unavoidable
extinction) led to an improved model developed by Verhulst, usually called the logistic equation:
d
dt A = (γ − βA)A. The effective growth rate in this model is γ − βA and thus depends on the
current size of the population (modeling competition). In terms of reaction kinetics we can express
β
this by A + B → 2A leading at first to the two ODEs A0 = −βAB + 2βAB = βAB, B 0 = −βAB.
Since A0 + B 0 = 0 we have that A + B = A0 + B0 or B = A0 + B0 − A so that we can eliminate
B leading to A0 = βAB = βA(A0 + B0 − A) = (γ − βA)A with γ = β(A0 + B0 ). Given β, γ, A0
this provides B0 : B0 = βγ − A0 and since B(t) = A0 + B0 − A we find that B(t) = βγ − A(t) for all
t. In the logistic equation βγ is the second fixed point (next to A = 0) and is called the carrying
capacity.
Note that the stoichiometry matrix has the two row vectors (only one reaction) (−k + 2k) = (k)
and (−k) which are clearly linearly dependent. A corresponding vector c = (1, 1) leads to the
mass conservation A0 + B 0 = 0 we stated above.
A more detailed way to describe the coupling of populations to their use of resources is to introduce
additional rate equations to also describe the growth/decay of the resources. Consider a population
α
of rabbits (R) and foxes (F ). Rabbits reproduce based on Malthus’ law R → 2R. When foxes and
rabbits meet up then the rabbit goes to a better place (one can only hope) and the fox reproduces
β γ
(on the spot apparently) R + F → F + δF . Finally, foxes die of old age F → ∅. This leads to the
following system of ODEs
R0 = αR − βRF , F 0 = −γF − βRF + β(1 + δ)RF = βδRF − γF .
Using the approach of stoichiometry we get the following
 
αR
1 −1 0
Γ= , w(R, F ) =  βRF  .
0 δ −1
γF
This system is referred to as Lotka-Volterra predator-prey system. As far as I know Lotka was a
chemist and Volterra wanted to model a fish population...
Remark. There are many extensions to this model that are not directly related to the mass action
law. Learning is for example one effect that scientist like to take into account, e.g., the prey
learns to avoid the predator. This could be modeled using a time dependent reaction rate β(t)
to try to keep things within the framework of mass action kinetics - and many other extensions
of mass action kinetics can be introduced as well. A second observation is that the assumption
that predators kill prey whenever they meet up with one another, is not realistic for most species
- once a predator is full it stops eating. This can be modelled by introducing a non-linearity into
the system:
R0 = αR − g(R)F , F 0 = δg(R)F − γF .
So instead of assuming that the rate of increase of foxes due to feeding is linear in R (the old term
was βRF ) one assumes that it has the form g(R)F where for example g becomes constant if R is
βR
large, i.e., if there is an abundance of food, e.g., g(R) = 1+βR - for R small this behaves like βR
while for R large g(R) tends to 1. This type of non linearity does not directly fit into the concepts
of mass action kinetics.
SIR: epidemiology
In the context of the spread of diseases, simple models of epidemics divide the total population into
sub-groups depending on whether individuals are infected (I(t)), susceptible to the disease (S(t))
or recovering from the disease (R(t)). The transitions between these states can be interpreted as
reactions
k γ
I + S → I + I ,I → R
(a infected person meeting a susceptible one can lead to two infected people (depending on k),
while infected people recover at a certain rate (γ). Note that recovered people are assumed immune
to the infection. The resulting system of ODEs is
d d d
S = −kSI , I = kSI − γI , R = γI .
dt dt dt
This model is often referred to as the SIR model. One property of this model is that it conserves
d
the population, i.e., dt (S + I + R) = 0. We should keep this in mind when designing numerical
schemes for such a system - the discretization should not lead to a growing or shrinking population
if the model doesn’t include births and deaths. Of course deaths by the infection can be included
in the I equation and one can also derive SIS models where immunity to the infection is not
possible using a “reversible reaction“. One can also add additional groups of the population e.g.
adding different stages to the infection, e.g., infected but not yet contagious. These models can
be made to describe different types of transitions always using the rate equations we discussed for
chemical reactions.
Michaelis-Menten kinetics
The kinetics of an enzymatic reaction is described by
kf k
S + E
SE →c E + P
kr
where S is a substrate which reacts with an enzyme to produce the complex SE. This is a
reversible reaction with forward rate kf and reverse rate kr . The complex SE can in turn release
a product P . The resulting system of ODEs is given by (denoting the complex with C)
S 0 = −kf SE + kr C , E 0 = −kf SE + kr C + kc C , C 0 = kf SE − kr C − kc C , P 0 = kc C
Note again the misuse of notation often found in the literature - in the ODE system SE refers to
the product of the two functions S(t), E(t) while in the reaction network SE is a compound the
amount of which is described by C(t) in the system of ODEs.
3.2 Hamiltonian principle

Under the influence of some external force field V (x) the trajectory of a particle with mass m > 0
is governed by the ODE
mx00 (t) = −∇V (x(t)) . (3.1)
The function x(t) gives the position of the particle.

Physical space is quite a natural interpretation for x in which case x(t) ∈ R3 , but x could also
consists of other so called general coordinates like angles or other mechanical quantities in which
case x(t) ∈ Rd for some d ≥ 1.
Connected to V there is another important function called the Hamiltonian:
Definition 3.6 (Hamiltonian). Consider x, p ∈ Rd and a potential V : Rd → R, and constant
m > 0, then the Hamiltonian H : R2d → R is given by
1 2
H(x, p) := p + V (x)
2m
Pd
. Here we have used the notation p2 = i=1 p2i .
An important property of this equation is that the Hamiltonian H = H(x, p) is constant in time
along a given trajectory:
Lemma 3.7. Assume that x solves mx00 = −∇V (x) and define the momentum of the particle
x by p = mx0 then
d
H(x(t), p(t)) = 0 ,
dt
1 2
with H(x, p) := 2m p + V (x).
Proof.
d
H(x(t), p(t)) = ∂x H(x(t), p(t))x0 (t) + ∂p H(x(t), p(t))p0 (t)
dt
1
= ∇V (x(t))x0 (t) + p(t)p0 (t)
m
= ∇V (x(t))x0 (t) + x0 (t)mx00 (t) = x0 (t) ∇V (x(t)) − ∇V (x(t)) = 0

where we have used that p0 = mx00 = −∇V (x(t)) since x solves the ODE mx00 = −∇V (x).
Assuming now that x solves the second order ODE mx00 = −∇V (x) we see that x0 (t) = m 1
p(t)
and −∇V (x(t)) = p0 (t) . On the other hand ∂x H(x, p) = −∇V (x) and ∂p H(x, p) = m p so that
1
we arrive at the first order ODE system:
x0 (t) = ∂p H(x(t), p(t)) ,

p0 (t) = −∂x H(x(t), p(t)) .
Remark. Note that this is a different system from the first order problem we so far have considered
in place of second order equations, i.e.,
x0 (t) = v(t) ,v 0 (t) = −∇V (x(t)) .
There is a more general underlying principle here

Definition 3.8 (Hamiltonian system). Given a Hamiltonian H : Rd × Rd → R which is continu-
ously differentiable. we say that x : [0, T ] → Rd and p : [0, T ] → Rd solves the Hamiltonian system
if x, p is a solution of
x0 (t) = ∂p H(x(t), p(t)) ,

p0 (t) = −∂x H(x(t), p(t)) .
The functions x are called generalized coordinates and p the momentum of the system.
The functions (x(t), p(t)) are also referred to as the Hamiltonian flow given by H.
If H(x, p) is of the form H(x, p) = T (p) + V (x) then it is called separable. For example the
1 2
Hamiltonian from the begin of the section H(x, p) := 2m p + V (x) is separable.
Remark. It is worth repeating that x are not required to be spatial coordinates for example they
could be the angle a pendulum’s rode makes with the vertical line.
Definition 3.9 (Invariant). A function f = f (x, p) is called an invariant under the Hamiltonian
flow given by H if
d
f (x(t), p(t)) = 0
dt
where x, p solves the Hamiltonian system given by H.
Lemma 3.2.1. For f to be an invariant under a Hamiltonian flow is equivalent to
fx Hp − fp Hx = 0 ,
for any x, p solving the Hamiltonian system.

Proof. Using the chain rule

d
f (x(t), p(t)) = ∂x f (x(t), p(t))x0 (t) + ∂p f (x(t), p(t))p0 (t) .
dt
Using that x0 = ∂p H(x, p) and p0 = −∂x H(x, p) we see that
d
f (x(t), p(t)) = ∂x f (x(t), p(t))∂p H(x(t), p(t)) − ∂p f (x(t), p(t))∂x H(x(t), p(t)) .
dt
d
So dt f (x(t), p(t)) = 0 is equivalent to ∂x f ∂p H − ∂p f ∂x H = 0 as stated in the lemma.
We have already seen that the Hamiltonian itself is an invariant for the Hamiltonian discussed at
the begining of the section. This holds in general:
Lemma 3.10. The Hamiltonian H is invariant under the Hamiltonian flow.
Proof. Using f = H we see that fx Hp − fp Hx = Hx Hp − Hp Hx = 0 so H is invariant according
to the previous lemma.
Here are some standard example of Hamiltonian flows:
Example 3.11 (linear). Assume that the Hamiltonian is quadratic that is H(x, p) = a2 x2 + bxp +
c 2
2 p then the Hamiltonian flow is governed by
x0 (t) = bx + cp ,
p0 (t) = −ax − bp ,

b c
and is therefore a linear system of ODEs. We can rewrite this as with matrix A = .
−a −b
Example 3.12 (Pendulum). Consider a particle with mass m attached to a rod of length l and
let g be the gravitational force. We assume that the pendulum can only swing in one plane so that
we can describe the position of the pendulum over time in Cartesian coordinates (x, y). This is
quite complex since the length of the rode restricts the position of the pendulum which always has
to lie on a circle of length l around the pivot. It is much easier to use the generalized coordinate
θ, which is the angle the rod makes with for example the vertical axis where we choose θ = 0 in
the case that the pendulum is pointing downwards. Once we have θ the actual position is given by
(x, y) = l(sin(θ), − cos(θ)).
The momentum of this system is the angular momentum of the pendulum: p = ml2 θ0 while the
potential energy is the equal to the gravitational energy: V (θ) = −mgl cos(θ). Therefore, the total
energy is given by
p2
H(θ, p) = − mgl cos(θ) .
2ml2
The Hamiltonian system is given by
p
θ0 = , p0 = −mgl sin(θ) ,
ml2
or written as second order equation
p0 g
θ00 = 2
= − sin(θ) .
ml l
Example 3.13 (Duffing oscillator). Let us consider the Hamiltonian for the Duffing oscillator
with mass m = 1: H(x, p) = 12 p2 + 2δ x2 + β4 x4 with β ∈ R and δ > 0. Note that β = 0 gives us
the Hamiltonian for the Harmonic oscillator with k 2 = 2δ . The resulting first order Hamiltonian
system is given by
x0 (t) = ∂p H(x(t), p(t)) = p ,
p0 (t) = −∂x H(x(t), p(t)) = −δx − βx3 .
We can easily recover the second order system for x:
x00 = p0 = −δx − βx3 = −∇V (x)
with V (x) = 2δ x2 + β4 x4 . This is as expected from the discussion above and we could have written
this down directly by looking at the Hamiltonian written in the form H(x, p) = 12 p2 + V (x).
First consider the β = 0 case: then the second order problem becomes the standard mass spring
equation x00 + δx = 0 with mass m = 1. If β 6= 0 the equation describes a spring system where
the spring restoring force (given in the standard case by δx) is nonlinear - the stiffness of the
spring does not exactly obey Hooke’s law. The restoring force provided by the nonlinear spring is
then (δ + βx2 )x. The case β > 0 is called a hardening spring while β < 0 results in a so called
softening spring. Basically if β > 0 the restoring force increases gets larger the more the spring is
displaced from its resting position, i.e., the larger x2 becomes. While for β < 0 the restoring force
is decreased with increasing displacement and could even switch sign.
Example 3.14 (oscillators). Many linear or nonlinear oscillators are govern by Hamiltonian
systems with
1 2
• Simple pendulum: H(x, p) = 2l2 p − gl cos(x).
• Harmonic oscillator: H(x, p) = 21 p2 + k 2 x2 .
• Duffing oscillator: H(x, p) = 12 p2 − 2δ x2 + 14 x4 .

• Cubic potential: H(x, p) = 12 (p2 − x2 + x3 ).
We have set the mass m equal to 1 for simplicity.
Principle of least action

Another way to look at Hamiltonian dynamics is as principle of least action. We can not go into
too much detail - it involves Calculus of Variations which is covered in some other modules. So
we will only give a brief overview:
For this we use a slightly different form of the Hamiltonian defining it to be a function of (gen-
eralized) coordinates x and velocity v = x0 . We consider a separable Hamiltonian H(x, v) =
T (v) + V (x), where T (v) = m 2 v is the kinetic energy of the system and V (x) is the potential
energy. Finally, given T, V we define the function
L(x, v) = T (v) − V (x) .
Definition 3.15 (Principle of least action). The principle of least action states that given two
points in time t0 < t1 the particle path given by mx00 = −∇V (x) is the path that minimizes the
Rt
action S(q) = t01 L(q(t), q 0 (t)) dt under all paths q which go from position (t0 , x(t0 )) to (t1 , x(t1 )).
In other words x is a path such that
S(x) ≤ S(q) , for any path q : [t0 , t1 ] → Rd for which q(t0 ) = x(t0 ), q(t1 ) = x(t1 ) .
A path x that minimizes the action S is called an extremal path.

The question is now, given L, how can we find a extremal path for the action S?
If we were looking for an extremal point of a real valued smooth function f : R → R, i.e., a point
z where f (z) < f (w) for all w, we would compute the derivative f 0 and the candidate for an
extremal point would be the roots of f 0 , i.e., f 0 (z) = 0 must hold as a necessary condition.
We want to do something similar now to find an extremal path of an action S - this is part of
the calculus of variations. The idea is to find at least a necessary condition for a path x to be
extremal:
Assume we had such a path and η = η(t) was a path with η(t0 ) = η(t1 ) = 0. Taking ε (small of
course) then y(t; ε) := x(t) + εη(t) is a path that satisfies y(t0 ; ε) = x(t0 ), y(t1 ; ε) = x(t1 ) so since
x is extremal it must hold that S(x) ≤ S(y(·; ε)). Now we can define the function F : R → R with
Z t1
F (ε) := S(y(·; ε)) = L(y(t; ε), y 0 (t; ε)) dt
t0
where y (t; ε) is the derivative of y with respect to t, i.e., y 0 (t; ε) = x0 (t) + εη 0 (t). So
0
Z t1
F (ε) = L(x0 + εη 0 , x + εη) dt .
t0
Now the extremal conditions translates to F (0) ≤ F (ε) for all ε so 0 is an extremal point of the
scalar function F , which means that F 0 (0) = 0 must hold. Now we need to figure out how to
compute the variation of S, i.e., the derivative F 0 . We will not go into the mathematical details
why the following is legal but it turns out that (just think of the chain rule):
Z t1
d m 0
F 0 (ε) = |x (t) + εη 0 (t)|2 − V (x(t) + εη(t)) dt
dε t0 2
Z t1
= m(x0 (t) + εη 0 (t))η 0 (t) − ∇V (x(t) + εη(t))η(t) dt .
t0
Setting ε = 0 into this equation we arrive at

Z t1
0 = F 0 (0) = mx0 (t)η 0 (t) − ∇V (x(t))η(t) dt .
t0
To complete the argument we need to get rid of the derivative on the test function η which we
can do using integration by parts:
Z t1
0 0 t1
0 = F (0) = [mx (t)η(t))]t=t0 + −mx00 (t)η(t) − ∇V (x(t))η(t) dt .
t0
Since η(t0 ) = η(t1 ) = 0 the boundary terms vanish and we have

Z t1
mx00 (t) + ∇V (x(t)) η(t) dt .

0=−
t0
This has to hold for all paths η which vanish at t0 , t1 and if everything is smooth then the
fundamental theorem of variational calculus states that if
Z t1
G(t)η(t) dt = 0
t0
for all functions η with η(t0 ) = η(t1 ) = 0 then G(t) = 0 for all t ∈ (t0 , t1 ). Using this result (which
we do not prove here) we find that
mx00 (t) + ∇V (x(t)) = 0
so that we recover our original second order ODE.

We can extend the variational approach also to systems that do not exactly conserve the Hamilto-
nian but where additional driving and friction forces have to be taken into account, as for example
in the driven and damped mass spring equation:
mx00 + αx0 + βx = δ cos(ωt) ,
or
x0 = v , mv 0 = −αv − βx + δ cos(ωt) .
Let us assume that these additional forces are of the form f (x, v) then the Lagrange-DAlembert
Principle states that the variation of the action potential including f leads to
Z t1 Z t1
mx0 (t)η 0 (t) − ∇V (x(t))η(t) dt − f (x(t), x0 (t))η(t) dt ,
t0 t0
and using the same arguments as before we arrive at the ODE
−mx00 − ∇V (x) = f (x, x0 ) .

Chapter 4
Higher Order One Step Methods
A one step method for computing solutions to an ODE
y 0 = f (t, y) , y(y0 ) = y0 ∈ Rr (∗)
with f : R × Rr is of the form

yn+1 := yn + hϕ(tn , yn ; hn )
with some function ϕ. One step methods therefore only use the approximation yn at some point
in time tn so compute the approximation yn+1 at the next time level tn+1 = tn + hn . So start-
ing with the initial condition y0 at t0 one directly computes the approximation y1 then using
y1 one computes y2 and so on. In contrast so called multistep method use multiple approxima-
tions yn , yn−1 , . . . , yn−s to compute the next value yn+1 . The explicit/implicit Euler methods are
examples of one step methods.
In the following we will assume that the step size is equidistant, i.e., hn = h for some fixed h > 0.
Using a changing step size is straightforward for one step methods since the history, i.e., the
approximations at previous time steps are not used.
Example 4.1. (Taylor series method) Consider a scalar ODE and assume that the solution Y to
(∗) is arbitrarily smooth and the same holds for f . Fix r ∈ N, r ≥ 1, then for two points tn and
tn+1 = tn + h we have
r
X 1 (k)
Y (tn+1 ) = Y (tn ) + hk Y (tn ) + O(hr+1 )
k!
k=1
Now for the forward Euler method we used r = 1 and replaced Y (tn )0 by f (tn , Y (tn )) using (∗).
We can also use (∗) to replace Y 00 (tn ):
d
Y 00 (tn ) = f (t, Y (t))|t=tn = ∂t f (tn , Y (tn )) + ∂y f (tn , Y (tn ))Y 0 (tn )
dt
Using again Y 0 (tn ) = f (tn , Y (tn )) we can replace Y 00 (tn ) in the Taylor series by evaluations of f
and derivatives of f that only involve knowing Y (tn ) but no derivatives of Y . So we can compute
the derivatives Ynk = Y (k) (tn ) with the following expressions
Yn0 = Y (tn ), Yn1 = f (tn , Yn0 ), Yn2 = ∂t f (tn , Yn0 ) + ∂y f (tn , Yn0 )Yn1
In the same way we can arrive at formulas for Ynk = Y (k) (tn ) with k > 2 based only on Yn0 , . . . , Ynk−1
and higher order derivatives of f . We then have
r
X 1 k
Y (tn+1 ) = hk Y + O(hr+1 ) .
k! n
k=0
44
CHAPTER 4. HIGHER ORDER ONE STEP METHODS 45
Now by ignoring the O(hr+1 ) term and replacing Yn0 by the approximation yn of Y at time t = tn ,
we arrive at an approximation yn+1 for Y at t = tn+1 :
r
X 1 k
yn+1 = hk y
k! n
k=0
with
yn0 = yn , yn1 = f (tn , yn0 ), yn2 = ∂t f (tn , yn0 ) + ∂y f (tn , yn0 )yn1 , . . .
If the ODE is vector valued we can use the same approach but we need to use the vector valued
form of the Taylor series, i.e., ∂y f (tn , yn0 ) is the Jacobian and ∂y f (tn , yn0 )yn1 is a matrix-vector
product and so on.
This method is difficult to implement for a specific problem since it requires to compute high order
partial derivatives of f . These derivative have to be recomputed for each new problem. In the
next example we will see how to construct higher order methods which do only require evaluations
of f :
Example 4.2. (2 step Heun method) We again focus on a scalar ODE but the ideas carry over
to the vector valued case.
Starting with the first three terms of the Taylor expansion (r = 2):
1
= Y (tn ) + hf (tn , Y (tn )) + h2 ∂t f (tn , Y (tn )) + ∂y f (tn , Y (tn ))f (tn , Y (tn )) + O(h3 )

Y (tn+1 )
2
1
= Y (tn ) + hf (tn , Y (tn )) +
2
1
h f (tn , Y (tn )) + h∂t f (tn , Y (tn )) + h∂y f (tn , Y (tn ))f (tn , Y (tn )) + O(h3 )

2
Now using Taylor expansion in two variables we find:
f (t + h, y + hf ) = f (t, y) + h∂t f (t, y) + hf ∂y f (t, y) + O(h2 )
Therefore
1 1
Y (tn+1 ) = Y (tn ) + hf (tn , Y (tn )) + h f (tn+1 , Y (tn ) + hf (tn , Y (tn )) ) + O(h3 )
2 2
Thus given yn we can compute
1
yn+1 = yn + (F1 + F2 ), F1 = hf (tn , yn ), F2 = hf (tn+1 , yn + F1 ) .
2
Equivalently we can use the formula
1
ỹn+1 = yn + hf (tn , yn ), ỹn+2 = ỹn+1 + hf (tn+1 , ỹn+1 ) yn+1 = (yn + ỹn+2 ) .
2
So we are averaging the starting value and the result of taking two forward Euler steps.
Example 4.3. (general second order explicit methods) To obtain second order methods we can
try to find parameter b1 , b2 , a, b so that
Y (tn+1 ) = Y (tn ) + b1 hf (tn , Y (tn )) + b2 hf (t + ch, Y (tn ) + ahf (tn , Yn )) + O(h3 )
holds. Again using Taylor series expansion for f this is equivalent to
Y (tn+1 ) = Y (tn ) + b1 hf (tn , Y (tn )) +

b2 h f (tn , Y (tn )) + ch∂t f (tn , Y (tn )) + ahf (tn , Y (tn ))∂x f (tn , Y (tn )) + O(h3 )

Comparing this with the Taylor expansion for Y

1
Y (tn+1 ) = Y (tn ) + hf (tn , Y (tn ) + h2 ∂t f (tn , Y (tn )) + ∂y f (tn , Y (tn ))f (tn , Y (tn )) + O(h3 )

2
we see that the parameters have to satisfy the order conditions:
1 1
b1 + b2 = 1, b2 c = , b2 a = .
2 2
For Heun’s method we had b1 = 12 , b2 = 12 , c = 1, a = 1 which satisfy the order conditions. But
this is not the only possible solution. The modified Euler method is given by b1 = 0, b1 = 1, c =
1 1
2, a = 2:
1 1
yn+1 = yn + hf (tn + h, yn + hf (tn , yn )) .
2 2
4.1 Runge-Kutta methods

The biggest class of one step methods are so called Runge-Kutta methods.
Definition 4.4. (Runge-Kutta Methods) An explicit m-stage Runge-Kutta method is given by

yn+1 := yn + hϕ(tn , yn ; h)
ϕ(t, y; h) := m γi ki (t, y; h)
P
i=1
k1 (t, y; h) = f (t, y)

k2 (t, y; h) = f (t + α2 h, y + hβ2,1 k1 (t, y; h)

..
.
Pm−1
km (t, y; h) = f t + αm h, y + h l=1 βm,l kl (t, y; h)

Note that kl (t, y; h), ϕ(t, y; h) ∈ Rr . The coefficients αi , γi , βi,l ∈ R are independent of t, y, f and
h, so they are independent of the problem we are solving, the time step used. It is usual to describe
RK methods by their Butcher tableau:
α1 0 0
..
α2 β2,1 .
.. .. .. ..
. . . .
αm βm,1 . . . βm,m−1 0
γ1 γm−1 γm
That is by two vectors α ∈ Rm , γ ∈ Rm and a lower trianglular matrix β ∈ Rm,m :
α β
γ
An implicit m-stage Runge-Kutta method is defined by
m
X
yn+1 = yn + hϕ(tn , yn ; h) with ϕ(t, y; h) := γi ki (t, y; h)
i=1
with !
m
X
kj (t, y; h) = f t + αj h, y + h βj,l kl (t, y; h) ∈ Rr , j = 1, . . . , m
l=1
As before the method can be described by a Butcher tableau but this time the matrix β can have
entries on and above the diagonal.
Example 4.5. All the methods we had so far where of Runge Kutta type:
0
1. Forward Euler method:
1
0 1
2. Implicit Euler method:
1
1 1
3. midpoint method: 2 2
1
0
1 1
4. modified Euler method: 2 2
0 1
0
5. 2 stage Heun method: 1 1
1 1
2 2
Some further methods are

0 0 0
1 1
6. trapezoidal rule: 1 2 2
1 1
2 2
0
1 1
2 2
1 1
7. Clasical 4th order RK method: 2 0 2
1 0 0 1
1 1 1 1
6 3 3 6
1 1
3 3 0
8. Diagonally implicit two stage third order method: 1 1 0
3 1
4 4
1/2 1/2 0 0 0
2/3 1/6 1/2 0 0
9. Four-stage, 3rd order diagonally implicit method 1/2 −1/2 1/2 1/2 0
1 3/2 −3/2 1/2 1/2
3/2 −3/2 1/2 1/2
√ √
1 3 1 1 3
2 − −
√6 4 √ 4 6
1 3 1 3 1
10. Forth order Gauss-Legendre method 2 + 6 4 + 6 4
1 1
2 2
Note that the last two methods and the implicit Euler, midpoint, trapezoidal methods are implicit
RK methods while all the others are explicit.
Remark. In addition to explicit and implicit there are other subclasses of RK methods, for example
an interesting class are the diagonally implicit Runge Kutta method where β is lower diagonal.
In this case we do not have to solve a m · r × m · r system of non linear equation for k1 , · · · , km
but only m non-linear equations of size r, first for k1 then for k2 up to km , since the equations
for ki does not depend on kl for l > i. In fact all the methods shown above with the exception
of the last one are diagonally implicit. Restricting the possible non zero entries in the Butcher
tableau, i.e., requiring βi,l = 0 for l > i for a diagonally implicit method, reduces the freedom one
has to design a method with other desirable properties, i.e., a higher rate of convergence, better
stability, positivity of the solution etc. So one has less choice with an explicit method then with
a fully implicit method and diagonally implicit method are somewhere in between the other two.
Some further remarks (stated without proof) and summary of the above:
• Fully implicit m stage RK methods can be up to order 2m
• Explicit m stage RK method can be at most of order m but...
• order m is only possible for m ≤ 5 ...
• order m − 1 for m ≤ 7...
• order m − 2 for m ≤ 8
• implicit method have better stability properties (larger h) but require the use of the (vector
valued) Newton method (see further down in this chapter) because...
• in general they require the solution of a m · r × m · r system of non linear equation for the
(ki )m
i=1 ...
• diagonally implicit method (β is a lower diagonal matrix) require only to solve m non-linear
equations of size r and can be a good compromise.
Remark. Runge-Kutta methods can often be defined in many equivalent ways. So for example the
Butcher tableau above providing the backward Euler method leads to the scheme (assuming that
f is independent of t):
yn+1 = yn + hk with k = f (yn + hk) .
That this is in fact the backward Euler method discussed at the beginning of the lecture can be
easily seen using that
yn+1 = yn + hk = yn + hf (yn + hk) = yn + hf (yn+1 ) .
We will discuss below how to find the root k of F (k) = k − f (yn + hk) but the same approach can
be used to find the root yn+1 for F (y) = y − hf (y) − yn . The difference is only in the initial guess
used: in the first case f (yn ) makes sense while in the second case yn is a reasonable initial guess.
In the last part of this chapter we will not focus on how to construct RK methods but on how
to analyse a RK method given by its Butcher tableau. But first we discuss how to implement an
implicit RK method.
4.2 Using Implicit One-Step Methods

or How to Find the Roots of a Function
Let us recall the definition of an implicit m-stage Runge-Kutta method:
m
X
yn+1 = yn + hϕ(tn , yn ; h) with ϕ(t, y; h) := γi ki (t, y; h)
i=1
with !
m
X
kj (t, y; h) = f t + αj h, y + h βj,l kl (t, y; h) , j = 1, . . . , m
l=1
The definition of the kj ∈ Rr is equivalent to solving F (κ) = F (k1 , . . . , km ; tn , yn ; h) = 0 with

F (·; t, y; h) : Rr·m → Rr·m given by
m
!
X
Fj (κ; t, y; h) = kj − f t + αj h, y + h βj,l kl , j = 1, . . . , m
l=1
here r is the size of the ODE system (i.e., f : Rr → Rr ) and m is the number of stages, the vector
κ ∈ Rr·m is of the form κ = (k11 , . . . , k1r , k21 , . . . , k2r , . . . , km1 , . . . , kmr ) and similarly we must
understand Fj = (Fj1 , . . . , Fjr ) ∈ Rr and F as the as accumulated entries (Fji ). So F is a very
high dimensional function and finding a root can be challenging.
Recall the definition of a diagonally implicit RK method in which case the computation for the
kj s decouple, i.e., kj only depends on k1 , . . . , kj−1 but not on ki with i > j:
j−1
!
X
kj (t, y; h) = f t + αj h, y + hβj,j kj + h βj,l kl (t, y; h) , j = 1, . . . , m
l=1
This means that kj ∈ R is a root to the function Fj : Rr → Rr with

r
m
!
X
Fj (k) = kj − f t + αj h, y + hβj,j kj + h βj,l kl , j = 1, . . . , m
l=1
So instead of finding one root in Rrm to a high dimension function F we need to find m roots in
Rr which is far easier.
So in this section we will discuss some basic approaches to computing roots of a given vector
valued function. The problem is hard to solve as we will see and we can only scratch the surface.
Problem. Given a function F : Rn → Rn find a root x∗ , i.e.,
x∗ ∈ Rn with F (x∗ ) = 0 .
Example 4.6. With F (x) = Ax − b, A ∈ Rn×n we have to solve a linear system of equations.
F (x) = ax2 + bx + c: in the case there is a simple formula for computing all roots.
F (x) = cos(x): the problem could be to find a root x∗ ∈ [1, 2]. Then the solution is x∗ = π2 .
We will focus on the scalar case n = 1 and to finding the root of a smooth function F : [a, b] → R.
4.2.1 Nested interval method

There is a very simple method for computing a root of a continuous function F ∈ C 0 (a, b) when
F (a)F (b) < 0. It is known from Analysis that in this case there is a x? ∈ (a, b) with F (x? ) = 0.
The proof of this theorem is an example of a constructive proof. It is based on the method of
nested intervals. This method can be used to compute an approximation to a root of F . The idea
is to construct a sequence x(k) with limk→∞ x(k) = x? . We define the sequence by induction:
Let a0 = a, b0 = b, x(0) = 21 (a0 + b0 ).
For k > 0 let ak , bk , x(k) with F (ak )F (bk ) < 0 and x(k) ∈ (ak , bk ) be given.
If F (ak )F (x(k) ) < 0 set ak+1 = ak , bk+1 = x(k) otherwise ak+1 = x(k) , bk+1 = bk .
Define x(k+1) = 12 (ak+1 + bk+1 ).
Algorithm. (Nested interval method)
x = 12 (a + b)
f x = F (x), f a = F (a)
While
 |F x| > TOL
If F a F x < 0 set b=x
 else set a = x, F a = F x
Set x = 21 (a + b), F x = F (x)
Theorem 4.7. (convergence of the nested interval method) Let the sequences (ak )k∈N , (bk )k∈N ,

x(k) k∈N be given by the nested interval method. Then
lim ak = lim bk = lim x(k) = x∗ and F (x∗ ) = 0.

k→∞ k→∞ k→∞
The following error estimate holds:

x − x∗ ≤ 2−(k+1) |b − a| .
(k)
Proof. We have by construction that (ak )k∈N is monoton increasing and (bk )k∈N is monoton de-
creasing and 0 < bk −ak = 21 (bk−1 −ak−1 ) = 2−k (b−a), ak < b, a < bk . Thus (ak )k∈N , (bk )k∈N con-
verge and lim ak = lim bk =: x∗ . Since F ∈ C 0 (a, b) we can conclude that F (x∗ ) = lim F (ak ) =
k→∞ k→∞ k→∞
lim F (bk ).
k→∞
Again by construction F (ak )F (bk ) < 0 always holds, and therefore
F (x∗ )2 = lim (F (ak )F (bk )) ≤ 0 =⇒ F (x∗ ) = 0.
k→∞
Since x∗ ∈ x(k) , bk or x∗ ∈ ak , x(k) it follows that

n o 1
x − x∗ ≤ min x(k) − bk , x(k) − ak = |bk − ak | ≤ 2−k−1 (b − a).
(k)
2
Remark. The method is very robust but also quite slow. It requires about three iterations to
compute one decimal place.
4.2.2 Newton’s method

Let F ∈ C 2 [a, b]. Suppose ∃ x∗ ∈ (a, b) such that F (x∗ ) = 0 and assume that x? is a simple root,
i.e., F 0 (x? ) 6= 0. Since F is continuous, there exists an open set U with x? ∈ U and F 0 6= 0 in U .
By Taylor’s Theorem:
F 00 (x) ∗
0 = F (x∗ ) = F (x) + F 0 (x)(x∗ − x) + (x − x)2 + · · ·
2!
Set h = x∗ − x then
0 = F (x) + F 0 (x)(x∗ − x) + O(h2 )
F (x)
and so x∗ = x − F 0 (x) + O(h2 ) for x ∈ U . So we can define a sequence of numbers
F (x(k) )
x(k+1) = x(k) − , x0 being given, n≥1
F 0 (x(k) )
F (x∗ )
and if x(k) → x∗ for some x0 ∈ U then since F is smooth we have x∗ = x∗ − F 0 (x∗ ) and thus
F (x∗ ) = 0.
Remark. (Geometric interpretation) Let lk (x) = ax + b be the linearization of F at x(k) , i.e.,
lk (x(k) ) = F (x(k) ), lk0 (x(k) ) = F 0 (x(k) ). So lk (x) = F (x(k) ) + F 0 (x(k) )(x − x(k) ) is the tangent to
F at x(k) . Now it seems reasonable to approximate a root of f by the root x(k+1) of lk :
0 = F (x(k) ) + F 0 (x(k) )(x(k+1) − x(k) ).
This is the formula used in the Newton method.
Figure 4.1 (left) shows the idea of the construction. With this idea in mind, it is obvious that the
method does not always lead to the expected result or to any result at all. This is also shown in
Figure 4.1.
Algorithm. (Newton method)
F x = F (x)
While |F x| > T OL :
x := x − FF0 (x)
x
F x := F (x)
Theorem 4.8. (convergence of Newton’s method) Consider F ∈ C 2 (a, b) so that there is a x∗ ∈
(a, b) with F (x∗ ) = 0. Let m := min |F 0 (x)| > 0, M := max |F 00 (x)|, and choose ρ > 0 so that
a≤x≤b a≤x≤b
Bρ (x∗ ) := {x | |x − x∗ | < ρ} ⊂ [a, b] and q := 2m
M
ρ < 1. Then Newton’s method converges with
(0) ∗
rate 2 for any starting value x ∈ Bρ (x ).
The approximation satisfies the a-priori error estimate
Figure 4.1: Newton method: on the left the idea of Newton’s method is sketched. The middle
figure shows a function with two roots
where it is unclear which root will be approximated. On
the right we have a situation with x(k) → ∞, i.e., where the Newton method fails.
(k) (k−1) 2 2m 2k
(a) x − x∗ ≤ M x − x∗ ≤
2m M q
and the a-posteriori estimate

x − x(k−1) 2
(k) (k)
(b) x − x∗ ≤ 1 F (x(k) ) ≤ M
m 2m
Remark.
It follows
from the assumptions made in the Theorem and from the mean value theorem
F (x)−F (y)
that x−y = |F 0 (ξ)| ≥ m ∀ x, y ∈ Bρ (x∗ ); x 6= y =⇒ |x − y| ≤ m 1
|F (x) − F (y)| .
Therefore x is the only root in Bρ (x ) and x is a simple root, i.e., F (x ) = 0 and F 0 (x∗ ) 6= 0.
∗ ∗ ∗ ∗
Remark. The a-posteriori estimate can be used to determine the quality of the approximation
since the right hand side is computable. In contrast the a-priori estimate involves the unknown
quantity x? or overestimates the error considerably - but it established the quadratic convergence
rate. The convergence is very fast if x(0) Bρ (x∗ ). Assume for example q = 21 then after only 10
∈ 2m
∗ −303
step of the method we have x (10)
− x ≤ M q 1024 ∼ 2m M 10 .
Comparing that with the nested interval approqach, we find using the same starting interval that
|b − a| = ρ = q 2m
(10)
m 1
M = M , with q = 2 . After 10 iterations of the nested interval method we get
x − x∗ = 2−11 |b − a| ≈ 2−11 ρ = 2−11 2m M q =2
−10 2m 2m
M ∼ M 10
−3
.
Proof. (convergence of Newton’s method)
From Taylor expansion we find
F 00 (ξ(y,x))
(1) F (y) = F (x) + F 0 (x)(y − x) + R(y, x) with R(y, x) = 2 |y − x|2 .
For all x, y ∈ Bρ (x∗ ):
M 2
(2) |R(y, x)| ≤ 2 |y − x|
For x ∈ Bρ (x∗ ) define Φ(x) := x − FF0(x) (x) . Then

|Φ(x) − x∗ | = (x − x∗ ) − FF0(x) 1 ∗ 0
(x) = − F 0 (x) [F (x) + (x − x)F (x)]

(1) 1
(2)
2
= F 0 (x) R(x∗ , x) = |F 01(x)| |R(x∗ , x)| ≤ m 1
|x − x∗ | M

2
Therefor if x ∈ Bρ (x∗ ) we have

2
|Φ(x) − x∗ | ≤ M
2m |x − x∗ |
(3) M 2 M
≤ 2m ρ = qρ < ρ, since q = 2m ρ < 1.
From this it follows that with x ∈ Bρ (x ) also x(k+1) = Φ(x(k) ) ∈ Bρ (x∗ ) and using
(k) ∗
induction we obtain: x ∈ Bρ (x ) for all k ∈ N, if x(0) ∈ Bρ (x∗ ).

(k) ∗
(k)
Let ρk := M x − x∗ . Using the third estimate we compute
2m
(k−1) 2
M Φ x(k−1) − x∗ ≤ M M
− x∗

ρk = 2m 2m 2m
x
2 2k
= (ρk−1 ) ≤ . . . , ≤ (ρ0 )
2k k
x − x∗ 2
(0)
=⇒ x(k) − x∗ ≤ 2m
M ρk ≤ 2m
M (ρ0 ) = 2m
M
M
2m
M 2k
2m 2m 2k
≤ M ρ = M q .
|2m
{z }
=q
k
This proves the a-priori estimate. Since we have q < 1, it follows that q (2 )
→ 0 =⇒ xk →
x∗ .
For the a-posteriori estimate we use (1) with y = x(k) and x = x(k−1) . It follows that
= F (x(k−1) ) + x(k) − x(k−1) F 0 (x(k−1) ) + R(x(k) , x(k−1) )

F (x(k) )
F (x(k−1) )
= R(x(k) , x(k−1) ), since x(k) = x(k−1) − F 0 (x(k+1) )
F (x(k) )−F (x∗ )
Due to the mean value theorem there exists a χ between x(k) and x∗ with x(k) −x∗
=
F 0 (χ) ≥ m and since F (x∗ ) = 0 we obtain:
1 1 (2) M 2
x − x∗ ≤
(k)
F (x(k) ) = R(x(k) , x(k−1) ) ≤
(k)
x − x(k−1) .

m m 2m
To try the method yourself have a look at
http://www.slu.edu/classes/maymk/GeoGebra/Newton.html.
Remark. We can study the behaviour of the Newton method in terms of a one step Difference
Equation (see MA133) using the function Φ used in the above proof:
F (x(k) )
x(k+1) := Φ(x(k) ) = x(k) − .
F 0 (x(k) )
A fixed point of this DE is given by x = Φ(x) and thus has to be a root of F . Furthermore if x is
a fixed point (and therefore a root of F ) we see that
(F 0 (x))2 − F (x)F 00 (x) F (x)F 00 (x)
|Φ0 (x)| = |1 − 0
=| |=0.
2
F (x) | F 0 (x)2
So all the fixed point are stable and consequently, for x(0) close to a root x∗ the sequence converges
to x? .
As we have seen the Newton method is not quite as robust as the nested interval method but
if x(0) is close to x? the Newton method is far much efficient. There are a few problems with
Newton’s method, the most obvious are:
1. How to find x(0) close enough to x? ?
2. What happens if F 0 (x∗ ) = 0?
3. Can one avoid having to compute F 0 ?
There are modification of Newton’s method to handle these problem. For example one can com-
bine the nested interval method with Newton’s method to produce a stable and efficient scheme:
consider a < b with F (a)F (b) < 0.
Define x := 21 (a + b), e
a := a, eb := b, F0 := F (x), Fa := F (a).
Algorithm. (Improved Newton method)

x = 21 (a + b)
F x := F (x)
F a := F (a)
While |F x| > T OL :
x := x − FF0 (x)
x
 F1 := F (x)

 If (|F1 | > |F x| or x 6∈ (a, b)) , then


 If F a F x < 0, then b := x, else (a := x; F a := F x)
 x = 21 (a + b), F1 = F (x)
F x := F1
Remark. The principle idea (and the convergence Theorem) can be extended to F : Rn → Rn .
The scheme is practically the same:
x(k+1) = x(k) − (DF (x(k) ))−1 F (x(k) )
where DF (x(k) ) now denotes the jacobian of F at x(k) , i.e., DF (x(k) ) ∈ Rn×n . Thus a linear
system of equations has to be solved in each step of the Newton scheme. One iteration thus
consists of two step: first one solves the linear system DF (x(k) )δk = F (x(k) ) for δk and then
performs the update x(k+1) = x(k) − δk . The first step can require finding the solution to a large
system of linear equations - a major topic of numerical linear algebra which is covered in a third
year module.
4.2.3 Solving Implicit Timestepping Problems

Recall that in each step of an implicit RK method we needed to find a root κ? of F (·; tn , y n ; h) : Rr·m →
Rr·m with !
m
X
F (κ; t, y; h) := κ − fi t + αj h, y + h βj,l κl
ji
l=1
As pointed out above the Newton method can be directly applied to this problem requiring in
each step of the Newton iteration to solve the linear system of equations
DF (κ(k) ); tn , yn ; h)δk = F (κ(k) ; tn , yn ; h) .
where DF is the Jacobian matrix of F w.r.t. κ. For this to be solvable we need the Jacobian
matrix to be a regular matrix - this would then also indicate that the convergence to the root κ?
is quadratic. Concerning the initial condition for the Newton iteration, a good choice is κ(0) =
f (tn , yn ) since with this choice we have for h → 0 F (κ(0) ) → 0.
Let us first look at the backward Euler method for a scalar ODE so F : R → R is given by
F (κ) = κ − f (tn + h, yn + hκ). The derivative of this function is
F 0 (κ) = 1 − ∂y f (tn + h, yn + hκ)h
where the final h appears due to the chain rule. It is now clear that F 0 (κ) 6= 0 if h is small enough,
i.e., if |1 − h∂y f (tn + h, yn + hκ| > 0 in a vicinity of the root κ? . This will for example be the case
if h ≤ L1 where L is the Lipschitz constant of f with respect to y. Compare this to the stability
constraint for the forward Euler method! In most cases this restriction is not really that relevant
and much larger values for h can be used but the issue of a possible time step restriction due to the
solvability of the Newton scheme should be taken into account. A very similar argument can be
used for m > 1 (but we will still assume that r = 1 to simplify the presentation, the computations
can be done along the same lines also for r > 1):
DF (κ) = I − hD
Pm
where I ∈ Rm,m is the identity and the matrix D is given by dij = ∂y f (tn +h, yn +h l=1 βj,l κl )βi,j .
Again for h → 0 the Jacobian F 0 approaches the identity matrix so is clearly invertible for h not
too large.
4.3 Analysis of Explicit One-step Methods

Definition 4.9. (Explicit one-step methods) Given a sequence of time steps tn = t0 + nh, n =
0, 1..., N , h = T −t
N , an explicit one-step method for approximating (∗) has the general form
0
yn+1 = yn + hϕ(tn , yn ; h) n ≥ 0 (†)
We restrict the presentation to fixed time step h but the results can be easily extended to time steps
tn+1 − tn = hn . Also we will set t0 = 0.
Example 4.10. All the explicit Runge-Kutta methods are explicit one step methods, e.g., for the
forward Euler method we have ϕ = f (tn , yn ).
We focus in the following on scalar ODE r = 1 but the results carry over directly to the vector
valued case.
We next consider the convergence of these methods.
Definition 4.11. (Convergence) Let tj = t0 + jh, j = 0, 1, ..., N , with tN = T . Denote by E(h)

the maximum error:
E(h) = max ej (h)
0≤j≤N
where ej (h) = |yj − Y (tj )|. A one-step method is called convergent if
lim E(h) = 0.
h→0
The method has convergence order p, if for some constant M > 0, E(h) ≤ M hp for any h > 0.
An important concept is the truncation error of a numerical scheme:
Definition 4.12. (Truncation error and consistency) The truncation error of a one-step method
at step n is defined as
τn (h) = Y (tn+1 ) − Y (tn ) − hϕ(tn , Y (tn ); h)
Thus the exact solution Y satisfies the perturbed equation:
Y (tn+1 ) = Y (tn ) + hϕ(tn , Y (tn ); h) + τn (h)
for n = 0, 1, . . . .
A one step method is consistent if τn (h) = o(h) for all n. This means that for every ε > 0 there
exists an h0 such that |τn (h)| < εh for all h < h0 .
It is consistent of order p if maxn |τn (h)| = O(hp+1 ).
Note that the consistency order is one lower then the order of the truncation error.
Remark. The truncation error is defined by inserting the exact solution to the initial value problem
(∗) into the one step method. This gives a measure of how large the error in step n would be if
the method was started with the correct value Y (tn ).
The situation is sketched in Figure 4.2. The Figure shows the exact solution Y and approximated
d
values y0 , y1 , y2 , y3 . At each point (tn , yn ) the sloop is given by dt Y n (tn ) whereby Y n is the
(
solution to the ODE with initial condition Y tn ) = yn . In yellow is the truncation error in each
step.
Example 4.13. For all the explicit methods described in the previous section which were based
on Taylor expansion, the truncation error is in O(hr+1 ) if the term not taken into account in the
Taylor series is of order O(hr+1 ), So the Taylor method from Example 4.1 given by
r
X 1 k
ϕ(tn , yn ; h) = hk y
k! n
k=1
Figure 4.2: sketch of one step method.
with
yn0 = yn , yn1 = f (tn , yn0 ), yn2 = ∂t f (tn , yn0 ) + ∂y f (tn , yn0 )yn1 , yn3 = . . .
has an order of consistency r since we showed that
r
X 1 k
Y (tn+1 ) = Y (tn ) + h hk−1 Y + O(hr+1 ) .
k! n
k=1
This example also demonstrates that there exist explicit one step methods of arbitrary order of
consistency.
The forward Euler method is consistent of order 1 and for all the two step methods satisfying the
conditions in Example 4.3 the order of consistency is two.
Remark. For Runge Kutta method one can formulate order conditions that allow to determine
order of consistencyPof the method based on the Butcher tableau. For a m stage Runge Kutta
m
method with αi = j=1 βij to be of at least order
Pm
p ≥ 1: i=1 γi = 1
Pm 1
p ≥ 2: i=1 αi γi = 2
Pm 2 1
Pm Pm 1
p ≥ 3: i=1 αi γi = 3 and i=1 j=1 γi βi,j αj = 6
For larger p the number of conditions grows fast. The order conditions can be used for example
to show that the diagonal implicit RK method is at least of order three.
We would expect that the maximum error E(h) is roughly equal to the sum of the local truncation
errors. Since the there are N = O(h−1 ) steps a method which is consistent of order p would thus
have a O(hp+1 ) error in each step which would then mean that the maximum error should be
roughly O(h−1 )O(hp+1 ) = O(hp ). This is in fact the case for h small enough as we will show in
the following. For this we need a discrete version of Gronwall’s lemma:
Lemma 4.14. (discrete Gronwall lemma) Let zn ∈ R+ satisfy
zn+1 ≤ Czn + D, ∀n ≥ 0
for some C ≥ 0, D ≥ 0 and C 6= 1. Then

Cn − 1
zn ≤ D + z0 C n . (4.1)
C −1
Proof. The proof proceeds by induction on n. Setting n = 0 in (4.1) yields

z0 ≤ z0
which is obviously satisfied. We now assume that (4.1) holds for a fixed n and prove that it is true
for n + 1. We have
zn+1 ≤ Czn + D
Cn − 1

≤ C D + z0 C n + D
C −1
n+1
C −C
= D + z0 C n+1 + D
C −1
n+1
C −C C −1
= D + + z0 C n+1
C −1 C −1
C n+1 − 1
= D + z0 C n+1
C −1
and the induction is complete.
Remark. As mentioned this is a discrete version of the Gronwall Lemma which states that if a
smooth enough function u satisfies the differential inequality u0 (t) ≤ β(t)u(t) for t ≥ a then
Rt
β(s) ds
u(t) ≤ u(a)e a .
In other words, u is bounded by the solution to the differential equation v 0 (t) = β(t)v(t) , v(a) =
u(a). In the same way the right hand side in the bound of the discrete Gronwall lemma is the
solution to the difference equation ξn+1 = Cξn + D.
Theorem 4.15. (convergence) Consider a one step method defined by a function ϕ = ϕ(t, y; h)
which is Lipschitz continuous in the y component with Lipschitz constant Lϕ independent of t ∈
[t0 , T ] and h < h0 for some h0 :
|ϕ(t, y1 ; h) − ϕ(t, y2 ; h)| ≤ Lϕ |y1 − y2 | .
If the method is consistent of order p, it converges with order p.
Proof. We have
yn+1 = yn + hϕ(tn , yn ; h)
Y (tn+1 ) = Y (tn ) + hϕ(tn , Y (tn ); h) + τn (h)
so that with en (h) := |yn+1 − Y (tn+1 )| we find
en+1 ≤ en + h|ϕ(tn , yn ; h) − ϕ(tn , Y (tn ); h)| + |τn (h)| .
By assumption we have τn (h) ≤ M hp+1 and using the Lipschitz continuity of ϕ:
en+1 ≤ en + hLϕ en + M hp+1 = Cen + D

with C = 1 + hLϕ > 1 and D = M hp+1 > 0. Using Gronwall we obtain:
Cn − 1
en ≤ D
C −1
where we used that e0 = 0. Since ez = 1 + z + 21 χ2 ≥ 1 + z we have the estimate (1 + z)n ≤ enz
so that
M p M p nhLϕ M p T Lϕ
h (1 + hLϕ )n − 1 ≤

en ≤ h e −1 ≤ h e −1
Lϕ Lϕ Lϕ
where we have used nh ≤ N h = T .
Example 4.16. For t he method we derived in the previous section we have to verify that ϕ is
Lipschitz. This is always true if f is uniformly Lipschitz in y and we assume h ≤ h0 . Take for
example a m stage Runge Kutta method:
m
X
ϕ(t, y; h) = γi ki (t, y; h)
i=1
i−1
!
X
ki (t, y; h) = f t + αi h, y + h βi,l kl (t, y; h) i = 1, . . . , m.
l=1
Thus ϕ is Lipschitz in y if all ki is Lipschitz in y. Using Lipschitz continuity of f and assume

Lipschitz continuity of kl for l < i:
i−1
X i−1
X
|ki (t, y1 ; h) − ki (t, y2 ; h)| ≤ Lf |y1 − y2 | + h |βi,l | Lkl |y1 − y2 | = (Lf + h |βi,l | Lkl )|y1 − y2 |
l=1 l=1
For k1 we find Lk1 = Lf thus we can bound the Lipschitz constant for k2 by Lk2 = (1 + h0 β2,1 )Lf
and Lk3 = (1 + h0 β3,1 + h0 β3,2 + h20 β3,2 β2,1 )Lf . Each Lki will remain bounded by some constant
depending on h0 and the coefficients of the Runge-Kutta method.
4.4 Analysis of Implicit One Step Method

To make the implicit nature of the method more visible we use the following notation:
yn+1 = yn + ϕ̃(tn , yn , yn+1 ; h)
The first question one has to address when studying implicit methods concerns the well posedness
of the problem, i.e., given tn , yn , h is there is a solution to the above problem:
Lemma 4.17. If h is sufficiently small there exists to given tn , yn , h a function v = v(tn , yn ; h)
so that
v = yn + ϕ̃(tn , yn , v; h)
so that the implicit method is equivalent to an explicit method with
ϕ(t, y; h) := ϕ̃(t, y, v(t, y; h); h) .
If ϕ̃(t, y, z; h) is uniformly Lipschitz in y and z then ϕ is Lipschitz in y.
Proof. This is an application of Banach’s fixed point Theorem and is beyond the scope of this
lecture.
We can define the truncation error as before,
Y (tn+1 ) − Y (tn )
τn (h) = − ϕ(tn , Y (tn ), Y (tn+1 ); h)
h
and then obtain the following error estimate:
Theorem 4.18. (convergence) Consider a one step method defined by a function ϕ̃ = ϕ̃(t, y, z; h)
which is Lipschitz continuous in the y, z components with Lipschitz constant Lϕ̃ independent of
t ∈ [t0 , T ] and h < h0 for some h0 :
|ϕ̃(t, y1 , z1 ; h) − ϕ̃(t, y2 , z2 ; h)| ≤ Lϕ̃ (|y1 − y2 | + |z1 − z2 |) .
1
If τn (h) ≤ M hp+1 and h ≤ 2Lϕ̃ the error estimate
M p T Lϕ̃
en ≤ h e −1
2Lϕ̃
holds. So again if the method is consistent of order p, it converges with order p (if h is smalle
enough).
Proof. The proof is almost the same to the purely explicit case. Assume that τn (h) ≤ M hp+1 and
h ≤ 2L1ϕ̃ , then
en+1 ≤ en + hLϕ̃ (en + en+1 ) + M hp+1
so that
en+1 ≤ Cen + D
1+hL M p+1 1
with C = 1−hLϕ̃ϕ̃ > 1 and D = 1−hLϕ̃ h > 0 since 1 − hLϕ̃ > 2 > 0. Using Gronwall and
1 1
1−hLϕ̃ ≤ 2 the result follows.
Remark. In general the error estimates derived for the convergence proofs overestimate the error
considerably, since exp(LT ) grows rapidly with increasing T - so the estimate is only reliably for
small T . Under additional assumptions on f more accurate estimates can be derived.
4.5 Linear Stability Analysis

We have shown that consistency implies convergence. This tells us something about the behavior
of the method for h → 0 but not a lot about the behavior for a given value of h > 0. We will
study this for a linear ODE as we already did in the second chapter for the forward and backward
Euler method. To simplify the argument we will only look at real valued λ leading to a stable
fixed point, i.e., take f (t, y) = λy for λ ∈ R and λ < 0. The whole analysis can be carried out for
complex valued λ as we did for the explicit/implicit Euler methods previously.
For y 0 (t) = λy, y(0) = y0 we know that the exact solution is Y (t) = y0 eλt . This solution has two
characteristics which we might want to capture in our numerical scheme:
1. long time behavior: we have limt→∞ Y (t) = 0 so we would like to show
lim yn = 0
n→∞
A method that satisfies this condition independent of h is called A-stable.

2. monotonicity: for y0 > 0 we have Y (t1 ) > Y (t2 ) for t2 > t1 so we would like to have
yn > yn+1
Example 4.19. For the forward Euler method we have yn+1 = (1−λh)yn so that yn = y0 (1−λh)n .
Therefore limn→∞ yn = 0 iff |1 − λh| < 1. Since λ > 0 is fixed, this means that the step size h has
to small enough: h ≤ λ2 .
Again using yn+1 = (1 − λh)yn we see that yn+1 < yn can only hold, if 1 − λh ∈ [0, 1). Since
1 − λh < 1 always holds, this requires that we choose h so that h ≤ λ1 .
If we do the same analysis for the backward Euler method: yn+1 = yn − λhyn+1 we find yn+1 =
−n
1
1+λh yn = (1 + λh) y0 . Since limn→∞ (1 + λh)−n = 0 for all choices of h, the backward
Euler method will show the correct long time behavior independent of the step size h. To obtain
1
monotonicity, we need 1+λh yn < yn which also holds for any h > 0 since we assumed that λ > 0.
The forward Euler method is said to be conditionally stable while the backward method is uncon-
ditionally stable, i.e., there is no restriction on the time step.
Lemma 4.20. Consider a general m stage Runge Kutta method given by the coefficient vectors
α, γ ∈ Rm and a matrix β ∈ Rm×m . Applying this to the linear ODE y 0 = λy leads to a scheme
of the form
P (λh)
yn+1 = yn
P (λh)
where P, Q are polynomials of degree not more than m. If the method is explicit then Q ≡ 1.
P (ρ)
The rational function R(ρ) = Q(ρ) takes on the form:
m
X m
X m
X
(I − ρβ)−1 γi I − ρβ)−1 e

R(ρ) = 1 + ρ γi ij = 1 + ρ i
i=1 j=1 i=1
where I is the m × m unit matrix and e = (1, . . . , 1)T ∈ Rm .

Remark. A method of order p will satisfy exp(λh) = R(λh) + O((λh)p+1 ).
Proof. For fixed λ < 0 and h > 0 define ρ = λh. The Runge Kutta method is given by
m
X
yn+1 = yn + h γi ki (tn , yn ; h)
i=1
and !
m
X m
X
kj (t, y; h) = f t + αj h, y + h βj,l kl (t, y; h) = λy + λh βj,l kl (t, y; h)
l=1 l=1
Define κj := hkj (tn , yn ; h) and κ = (κj )m

j=1 then we have a linear system of equations for the
vector κ of the form:
m
X
κj − ρ βj,l κl = ρyn .
l=1
The corresponding matrix is I − ρβ and the right hand side is the vector ρyn e with entries ρyn .
Thus
κ = (I − ρβ)−1 e ρyn .
With this we have shown
m
X m
X
γi (I − ρβ)−1 e i yn = R(ρ)yn

yn+1 = yn + ρ γi κi = 1 + ρ
i=1 i=1
as stated in the Lemma.

Pij
The entries of the inverse of I − ρβ are rational functions Pij in ρ with Pij , Pij ∈ Pm and and
therefore this is true also for all κi . This proves the first part of the lemma (consider for example
Cramer’s rule to see this).
In the case of an explicit method we have
j−1
X
κj = ρyn + ρβj,l κl .
l=1
Since κ1 = ρyn , it follows by induction that each κj is a polynomial of degree j and that concludes
the proof.
We can study the problems stated above:
Theorem 4.21. Consider a one step method which takes on the form
yn+1 = R(λh)yn
when applied to f = λh. Then the method has the correct long time behavior if |R(λh)| < 1 and
is monotone if 0 < R(λh) < 1.
Proof. This follows simply from yn = R(λh)n y0 .

1 1
0.8
0.5
0.6
exp(x) 0.4
expl. Euler
impl. Euler
midpoint
-0.5 2 stage Heun
2 stage Gauss 0.2
2 stage diagonal impl.
classical RK
some 6th order implicit method
three stage third order DIRK
-1 0
-10 -8 -6 -4 -2 0 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0
Remarks:
1
1. backward Euler is A stable

0.5
2. midpoint rule is A stable

0
3. 2 point Gauss method is A stable
-0.5
4. 3rd order diagonally implicit method is not A stable...
-1
-50 -40 -30 -20 -10 0
5. ... in fact no explicit method is A stable (do you see
why that is?)
Figure 4.3: Plot of function R(z) for different RK methods. Also shown the approximation
property of R to e−z and the behaviour for z → −∞.
The above construction of R and subsequent arguments work in exactly the same way if we also
include complex valued λ with Realλ < 0. In this case we again get a stability region for each
method in which the complex value R(z) has modulus less then 1.
Definition 4.22. Consider a one step method which takes on the form
yn+1 = R(λh)yn
when applied to f = λh. The stability region of this function is given by
SR = {z ∈ C : |R(z)| < 1} .
If {z ∈ C : Realz < 0} ⊂ SR the method is called A-stable.

If the method is A-stable and in addition R(z) → 0 for Realz → −∞ then then method is called
L-stable.
Remark. For real valued λ A-stable means that |R(z)| < 1 for all z < 0 and and in addition
L-stable methods satisfy limz→−∞ R(z) = 0.
A-stable means that the method is stable for the linear problem independent of the step size h.
This is the case for example for the backward Euler method as already discussed. L-stability is
a slightly different concept from the ones discussed so far but can be very important for some
problems. Again the backward Euler method is obviously L-stable.
Example 4.23. (stability function for some RK method) Figure 4.3 shows the function R(ρ) for
ρ <= 0 for different RK methods. The y-axis is restricted to the stability region [−1, 1].
1 1
3 3 0
Example 4.24. 8. Diagonally implicit two stage third order method: 1 1 0
3 1
4 4
1/2 1/2 0 0 0
2/3 1/6 1/2 0 0
9. Four-stage, 3rd order diagonally implicit method 1/2 −1/2 1/2 1/2 0
1 3/2 −3/2 1/2 1/2
3/2 −3/2 1/2 1/2
By checking the order conditions you can see that they are both at least third order - and in fact
they are not more then third. But the second one uses twice as many stages then the first which
means it it at least twice as expensive. Looking at the two methods more closely one realized that
the two stage method only is implicit in the first stage while it is explicit in the second. So per
step it requires exactlly as many applications of Newton’s method as the implicit Euler method
but is third order. While the second method is implicit in each stage so requires four applications
of Newton’s method which makes it more expensive. So we have a first order method which is
hardly less expensive then the two stage third order method which is much cheaper then the four
stage third order method. So why not always use the two stage method? The answer is stability.
Both the four stage method and the backward Euler method are L-stable (check the previous plots).
For the two stage method on the other hand R(z) > 1 for z < −6 so it does not have nearly the
stability of the other two requiring a reduced time step for realy large negative λ. So depending on
the problem the four stage method could be a lot more efficient to use.
Here are the stability regions in the complex plane for the two order three diagonally implicit Runge
Kutta methods (DIRK) methods. The thrid plot shows the stability region for the two stage fully
implicit Gauss method (see Butcher tableau at the beginning of this chapter). Here in each step a
more complicated nonliear problem has to be solved.
two stage DIRK four stage DIRK two stage Gauss
5 5 5
0 0 0
−5 −5 −5
−5 0 −5 0 −5 0
Not that the two right pictures show A-stable methods so all the left half complex plain is covered
in courlines. In the middle case a lot of the right hand plane is also covered. For the right method
it is unclear if the imaginary axis (or at least some part around the origin) is contained in the
stability region.
Here are the stability regions for the forward Euler method (again), for Heun’s method (compare to
what you know from the assignments), and the final example is the classical four stage Runge-Kutta
method:
explicit Euler Heun method classical RK4
2 2 2
0 0 0
−2 −2 −2
−3 −2 −1 0 1 −3 −2 −1 0 1 −3 −2 −1 0 1
For the classical RK4 method it is again unclear if some part of the imaginary axis around the
origin is contained in the stability region. Note the difference in the x axis scaling between these
plots and the previous plots for the implicit methods.
Chapter 5
Simplifying Mathematical Models
5.1 Linearization
The perhaps simplest approach to simplify a given mathematical model is to linearize it around a
given ground state. The assumption is that the actual solution to the original problem is close to
this ground state at all time so that the linearized model is a good description.
Take an ODE model
y 0 (t) = f (t, y(t)) , y(0) = y0
and a known ground state Ŷ = Ŷ (t). Assume that the exact solution Y to the ODE is of the form
Y (t) = Ŷ (t) + Ỹ (t) where the perturbation Ỹ (t) around the ground state is assumed to be small
for all t. Then
Y 0 (t) = f (t, Y (t)) = f (t, Ŷ (t) + Ỹ (t)) = f (t, Ŷ (t)) + ∂y f (t, Ŷ (t))Ỹ (t) + O((Ỹ (t))2 )
Also Y 0 (t) = d
dt Ŷ (t) + d
dt Ỹ (t) so that we arrive at a linear ODE for Ỹ :
d
ỹ = Â(t)ỹ(t) + Ĉ(t)
dt
d
with Â(t) = ∂y f (t, Ŷ (t)) and Ĉ(t) = f (t, Ŷ (t)) − dt Ŷ (t). Note that Ỹ solves this ODE up to the
2
neglected O((Ỹ (t)) ) which we assumed to be small.
Example 5.1 (Linearization around a fixed point). A special case is the linearization around a
d
fixed point for a homogeneous right hand side f (t, y) = f (y), i.e., taking Ŷ (t) such that dt Ŷ (t) =
0 (Ŷ is constant) and f (Ŷ (t)) = 0. In this case Ĉ(t) = 0. Then the linearized system is a
homogeneous linear ODE:
d
ỹ = ∂y f (Ŷ )ỹ(t) .
dt
This is a special case of linearizing around the solution to the nonlinear problem: assume that a
solution Ŷ (t) is known, i.e.,
d
Ŷ (t) = f (t, Ŷ (t)) .
dt
This again means that Ĉ ≡ 0. Now consider an initial condition y0 = Ŷ (0) + ỹ0 close to the initial
value of the ground state then the linearized system turns into
d
ỹ = Â(t)ỹ(t) , ỹ(0) = ỹ0
dt
with Â(t) = ∂y f (t, Ŷ (t)) as before.
62
CHAPTER 5. SIMPLIFYING MATHEMATICAL MODELS 63
Example 5.2. Consider the SIR model from epidemiology discussed in chapter 3:
d d d
S = −kSI , I = kSI − γI , R = γI .
dt dt dt
Note that R decouples and can be computed once I is know. Consequently, we only need to consider
the first two equations. In this case f (S, I) = − kSI, kSI − γI) is independent of t. The Jacobian
with respect to (S, I) is given by

−kI −kS
∂y f (S, I) =
kI kS − γ
Linearizing around a fixed point Ŝ, Iˆ thus leads to the linear system:
−k Iˆ −k Ŝ

d S̃ S̃
= .
dt I˜ k Iˆ k Ŝ − γ I˜
There is another way of arriving at this equation which does not use Taylor expansion and can
sometimes be easier especially for large systems:
d d
S̃ = S = −kSI = −k(Ŝ + S̃)(Iˆ + I)
˜ = −k Ŝ Iˆ − k Ŝ I˜ − k S̃ Iˆ − k S̃ I˜ .
dt dt
Since we are assuming that S̃, I˜ are small the product of the two is even smaller and can be
ˆ is a fixed point −k Ŝ Iˆ = 0. So we arrive at
neglected. Furthermore, since (Ŝ, I)
d
S̃ = −k Ŝ I˜ − k IˆS̃
dt
which is the same as the first equation we derived using Taylor expansion. The same approach can
be used for the equation for I.
Example 5.3 (Pendulum). Consider again the equation for a pendulum discussed in Chapter 3:
g
θ00 = − sin(θ) .
l
Assuming the angle θ is small we can linearize around the ground state θ̂ = 0. Since we assumed
θ̃ = θ − θ̂ = θ is small we can write sin θ ≈ θ̃ which leads to the linearized ODE
g
θ̃00 = − θ̃ .
l
Of course we can arrive at the same equation using the above Taylor series approach:
d
θ̃ = Â(t)θ̃(t) + Ĉ(t)
dt
with f (t, θ) = − gl sin(θ) we have
g g
Â(t) = ∂y f (t, θ̂(t)) = − cos(θ̂) = −
l l
since cos θ̂ = 1 and

d
Ĉ(t) = f (t, Ŷ (t)) − Ŷ (t) = 0 .
dt
5.2 Dimensional Scaling Analysis or Non-Dimensionalization

The aim of non dimensionalization is to reduce the number of parameters in a given model thus
making it easier to analyse. This also greatly helps in reducing the number of experiments required
to obtain values for the parameters and to find functional relationships between parameters. Most
famous is probably the Reynolds number in fluid dynamics (although it’s only one of many numbers
the size of which give some insight what the regime is in which the solution will live, e.g., how
much turbulence to expect).
The main idea is to remove all dimensions from a system. First off we note that any physical
quantity Q is a product of a scalar quantity q and a unit denoted by [Q] so Q = q[Q]. Of course the
pair (q, [Q]) is not unique since one can for example express a temperature Q by using [Q] =Kelvin
or [Q]=Centigrade. One should choose appropriate units for the problems - although for example
solar physicists tend to use seconds, centimeters, and grams to model the sun’s atmosphere...
The SI (International System of Base Units) is given by
[Length] = meter, [Time] = second, [Mass] = kilogram,
[Temperature] = Kelvin, [Electric current] = ampere, and so on.
The important observation is that none of these can be expressed by a combination of any
of the others. All other units are in some form products of these fundamental units: [Q] =
[Length]α [T ime]β [M ass]γ and so on. Note also that [Qα α
1 Q2 ] = [Q1 ] [Q2 ].
Definition 5.4 (Dimensional Homogeneity). The principle of dimensional homogeneity states

that all summands in a mathematical model must have the same dimension, i.e., the same dimen-
sional units.
This already provides a first sanity check one should perform while deriving mathematical models.
A consequence is also that all arguments of any more complicated function which is part of
the model,
P e.g.,
n
sin, must be dimensionless. One way to see this is to consider for example
exp Q = n Qn! . If Q was not dimensionless then each term in that sum would have a different
dimension and the sum would not make sense (including a dimensionless 1 as first term). Note that
an angle is dimensionless since it is defined as the ratio between two length. Take as example in the
formula for the arc length S = Rθ we have [S] = [R] = [length] and the angle θ is dimensionless.
Example 5.5. The vertical position Y of a projectile at time T launched vertically upwards from
a height Y0 having initial speed V0 and undergoing a constant acceleration A0 is given by the
quadratic equation
Y = Y0 + V0 T + A0 T 2 .
We have [Y ] = [Length], [T ] = [T ime], [Y0 ] = [Length], [V0 ] = [Length/T ime], [A0 ] = [Length/T ime2 ]
so each term in that expression has dimension [Length] and the equation satisfies dimensional ho-
mogeneity.
Now take the derivative with respect to T :
Y 0 = V0 + 2A0 T .
For this equation to still satisfy dimensional homogeneity [Y 0 ] must be [Length/T ime], i.e., the
same dimension as [V0 ] and [A0 T ]. If one takes into account that the derivative is the limit of a
finite difference quotient
Y (T + h) − Y (T )
Y 0 (T ) = lim
h→0 h
where [h] = [T ] for T + h to be meaningful the assumption that [Y 0 ] = [Y ]/[T ] makes sense since
[ Y (T +h)−Y
h
(T )
] = [Y ]/[h] = [Y ]/[T ].
The main consequence of dimensional homogeneity is that since each term has the same dimension
we can scale each dependent and independent variable in such a way that only dimensionless
quantities remain. Factoring out the dimensional units in this way, yields a mathematical model
with only dimensionless variables and parameters. Choosing the correct magnitude of each unit
can be crucial for arriving at a good mathematical model that can be further analysed.
Example 5.6 (Pendulum). Consider again the equation for a pendulum discussed in Chapter 3:
g
θ00 = − sin(θ) .
l
As pointed out above θ is already dimensionless. We have [l] = [length] and [g] = [acceleration] =
[length/time2 ]. So the right hand side has dimension 1/time2 which matches the dimension on the
left (each time derivative gives a 1/time as we saw above). So we only need to non-dimensionalize
time by prescribing some time scale T . Then with τ = Tt and y(τ ) = θ(T τ ) (note y is already
dimensionless so we do not need a scale for that):
T 2g T 2g
y 00 (τ ) = T 2 θ00 (T τ ) = − sin(θ(T τ )) = − sin(y(τ )) = −Π1 sin(y(τ ))
l l
T 2g
Using the arguments made above we already know that Π1 = is dimensionless. Now it makes
l
q
sense to choose the time scale T in such a way that Π1 = 1, i.e., T = gl . This turns out to be
the period of the linearized pendulum.
Let us now work through a more complex problem which also demonstrates the flexibility (or
complexity) of this approach:
Example 5.7 (Projectile motion). Consider a projectile of mass M kilograms that is launched
vertically with initial speed V0 meters per second, from a position Y0 meters above the surface of
the Earth. Newtons law of gravitation coupled with the second law of motion then gives that the
height of the projectile Y (T ) varies with time T according to the ODE
d2 Y GME M
M =− , Y (0) = Y0 , Y 0 (0) = V0 .
dT 2 (RE + Y )2
Here G is the gravitational constant G = 6.7 × 10−11 m3 /(s2 kg) and the earths mass and radius
are given by ME = 6 × 1024 kg, RE = 6.4 × 106 m, respectively. Note the very different orders
of magnitude involved in the different parameters describing the system. First note that we can
2
reduce the number of parameters by introducing g = GME /RE ≈ 9.81m/s2 (which looks more
familiar):
d2 Y gRE2
M
M 2
=− , Y (0) = Y0 , Y 0 (0) = V0 ,
dT (RE + Y )2
or equivalently
d2 Y g
=− , Y (0) = Y0 , Y 0 (0) = V0 ,
dT 2 (1 + RY2 )2
E
Let’s now introduce some (as yet arbitrary) length scale L and time scale T and write T = tT
and Y (T ) = y(T /T)L. Consequently t, y(t) are dimensionless. The chosen scaling constants (here
L, T) are called the characteristic (length and time) scales. First we note that using the chain rule
d L
Y 0 (T ) = y(T /T)L = y 0 (T /T)L/T = y 0 (t)
dT T
so that our ODE becomes
L d2 y g L
2 2
=− L 2
, Ly(0) = Y0 , y 0 (0) = V0 ,
T dt (1 + R2 y) T
E
Dividing through by the dimensional coefficients on the left gives us
d2 y Π2
=− , y(0) = Π0 , y 0 (0) = Π1 ,
dt2 (1 + Π3 y)2
where the constants Πi are all dimensionless quantities:

Y0 T gT2 L
Π0 = , Π1 = V 0 , Π2 = , Π3 = .
L L L RE
It is important to realize that the parameters in the ODE represent ratios of competing effects,
e.g., Π2 = g/ TL2 is the ratio between earths gravitational force and the acceleration of the projectile
measured in the characteristic scales chosen for the problem. The other constant Π3 compares the
length scale of the problem with the radius of the earth.
Choosing the scales is the most difficult part of dimensional scaling and can require insight into
the model and some experimentation. A first approach can always be to try to normalize as many
of the remaining parameters as possible, i.e., to try to make as many of the Πi = 1 as possible
thus effectively removing free parameters from the problem.
Example 5.8 (continued). We have total freedom to choose the characteristic length and time
scales L, T. For example we could choose L = Y0 which means that Π0 = 1 while choosing the
1
time scale T = (L/g) 2 gives Π2 = 1. Trying to make as many parameters (Π’s) in the problem
equal to one is often a good choice for choosing the characteristic scales. The resulting ODE is
now
d2 y 1
=− , y(0) = 1, y 0 (0) = π0 ,
dt2 (1 + π1 y)2
where the constants πi are all dimensionless quantities:
V0 Y0
π0 = 1 , π1 = .
(gY0 ) 2 RE
In the original problem our solution Y (T ) depended on five parameters g, RE , Y0 , V0 , and T while
the dimensionless solution y(t) only depends on three π0 , π1 , and t.
In dimensionless form it is often much easier to analysis certain limiting cases by considering
how to model simplifies if one of the dimensionless quantities Πi → 0 or Πi → ∞. Recall that
these constants describe the ratio of different competing effects so these limits can be viewed as
modelling the situation where one of these effects dominates over the other.
Example 5.9 (continued). For example consider the case that Y0 is very small compared to RE
or in other words RE → ∞ while Y0 remains fixes. With this scaling π1 = 0 and our ODE is
d2 y
= −1 , y(0) = 1, y 0 (0) = π0 ,
dt2
which has the solution y(t) = 1 + π0 t − 21 t2 or reverting the non dimensionalization
Y = Y0 + V0 T + A0 T 2 ,
with a suitable A0 - this is the equation we considered as a first example.

Now π1 → 0 also occurs when Y0 → 0 with RE fixed. Unfortunately, in this case π0 → ∞ leading
to undefined behaviour. This is not too surprising since we choose Y0 as length scale so Y0 = 0
was not an allowed case. An important process in dimensional scaling of mathematical models is
to check that no term diverges in any limiting case of interest. So if Y0 → 0 is a limit case one
wants to correctly model then having π0 → ∞ is not acceptable and consequently, to correctly
capture this limit we need to chose a different set of characteristic length scales.
Example 5.10 (continued). In light of the observation made so far it seems natural to choose
V2
scales in such a way that Π1 = 1 which we can achieve by taking L = g0 , T = Vg0 :
d2 y 1
=− , y(0) = π0 , y 0 (0) = 1 ,
dt2 (1 + π1 y)2
where the constants πi are again dimensionless quantities:
Y0 g V02
π0 = , π1 = .
V02 gRE
Now sending Y0 → 0 results in π0 = 0 but π1 remains valid. If we consider Y0 → 0, RE → ∞ then

π0 = 0 and π1 = 0 and we arrive at the simplified model problem
d2 y
= −1 , y(0) = 0, y 0 (0) = 1 ,
dt2
The solution to this in the original variables is
g
Y = V0 T − T 2 .
2
Assuming V0 > 0 the time of flight of the projectile is 2V0 /g = 2T while the maximum altitude
reached is V02 /(2g) = 12 L so our choice of characteristic scales seem quite natural.
Example 5.11 (continued). Our final choice will be to choose the scaling such that Π1 = 1, Π3 = 1,
i.e., L = RE , T = RE /V0 then we are left with π0 = Y0 /RE , π2 = gRE /V02 . Using the values for
g, RE given above we find π2 ≈ 6.27 · 107 m2 /s2 /V02 so that ε := π12 = V02 6.27 · 10−7 s2 /m2 is very
small for suitable choices of V0 . Again assuming that Y0 is far smaller then RE , i.e., π0 = 0 the
ODE becomes
1
εy 00 = − , y(0) = 0, y 0 (0) = 1
1 + y2
As pointed out ε is very small but sending that constant to zero is problematic then we would then
be left with a algebraic equation for y which does not seem to make any sense. This is called a
singular perturbation problem and will be discussed in the next section.
5.3 Perturbation Problems

After removing any dimensional units from an ODE model, we are left with a number of dimen-
sionless parameters in the system. Often it is possible to identify parameters that (for the problem
under investigation) are very small or very large. An example would a model of ocean currents
where in a 3D model we would expect the horizontal length scales LH to be small compared to
the vertical length scale LV , so a parameter of the form L LV could be assumed to be very small.
H
Of course if we were modelling only the part of the ocean in the British channel this might not
be a reasonable assumption. Just choosing the parameter to be zero might lead to too crude a
model although it can sometimes lead to some useful insight into the problem. A simplified model
retaining more details can be arrived at by perturbation methods.
Perturbation methods can be used for a wide range of problems involving a small scale parameter ε,
e.g., for algebraic and (partial) differential equations. The idea is to assume that the solution can be
written in the form of a sum involving powers of the parameter ε, i.e., x(t) = εx1 (t) + ε3 x3 (t) + . . .
and deriving a system of equations for the functions xi (t) by substituting the expansion into the
model and matching equal powers of ε. Often a small number of functions xi is enough to get a
model that is very close to the original. The mathematical justification of this approach is often
referred to as asymptotic analysis.
Definition 5.12 (Asymptotically equivalent). Two functions f (ε), g(ε) are asymptotically equiv-
alent for ε → 0 if
f (ε)
lim =0
ε→0 g(ε)
To denote this relation we will write f ∼ g as ε → 0.

Remark. f ∼ g does not only mean that the two functions have the same limit for ε → 0 but that
they also approach that limit at the same rate. So for example sin(ε) ∼ exp(ε) − 1 but this does
not hold for sin(ε) ∼ x2 although both still converge to 0 for ε → 0. √
Many functions are asymptotically equivalent to each other, e.g., cos( ε) ∼ cos(ε) ∼ (1 − ε/2) ∼
exp(ε). In fact ∼ defines an equivalence relation on functions.
Definition 5.13 (Asymptotic expansion). A function x(t) = x(t; ε) depending on a parameter ε
has an asymptotic expansion in this parameter if
∞
X
x(t; ε) = δi (ε)xi (t)
i=0
for small enough ε. Here it is assumed that all xi = O(1) w.r.t. ε and the so called gauge functions
are asymptotically ordered, i.e.,
δ0 (ε) δ1 (ε) δ2 (ε) . . .
In most cases the gauge functions satisfy δi = O(εki ) with k0 > k1 > k2 > . . . .
The first term δ0 x0 is referred to as the leading order term and we have x ∼ δ0 x0 as ε → 0.
Remark. More generally we have:
x(t; ε) ∼ δ0 x0 + δ1 x1 + δ2 x2
x(t; ε) = δ0 x0 + δ1 x1 + O(δ2 )
x(t; ε) = δ0 x0 + δ1 x1 + o(δ1 )
and these relations are true for any partial sum used.
The most common example of an asymptotic expansion is given by Taylors series.
Depending on the function x we can have three types of expansions
• regular : with limε→0 x = x0 in which case δ0 = 1.
• vanishing: with limε→0 x = 0 in which case δ0 1.
• singular : with limε→0 x = ∞ in which case δ0 1.
5.3.1 Regular perturbation problems

We first describe the idea using a simple quadratic equation:
√
Example 5.14. Consider the solutions to x2 −x+ 41 ε = 0 which are x± = 12 (1± 1 − ε). Applying
that for z → 0 we have
1 1
(1 + z)r ∼ 1 + rz + r(r − 1)z 2 + r(r − 1)(r − 2)z 3 + . . .
2 6
1
for r = 2 and z = −ε we can rewrite the two solutions as
1 1 2
x+ = 1 − ε − ε + · · · = O(1)
4 16
1 1 2
x− = 0 + ε + ε + · · · = O(ε)
4 16
The problem with ε = 0, i.e., x2 − x = 0 has solutions x = 0, 1 which are the leading terms in
the expansion of x± so the ε term only leads to a slight change in the position of the roots which
are mostly determined by the balance of the x2 and the x term in the equation. This is a regular
problem where ε → 0 leads to well defined limits.
In general we do not know what the solution of a problem will look like to choosing the right gauge
functions can be tricky and can require a good understanding of the characteristics of the problem
(or a lot of trial and error). In this case we start by assuming that we have a regular problem in
which case an expansion of the form x = x0 + εx1 + ε2 x2 + . . . is a good way to start. Inserting
this into our quadratic equations and combining terms with equal power of ε leads to
1
(x20 − x0 ) + ε(2x1 x0 − x1 + ) + ε2 (x21 + 2x0 x2 − x2 ) + · · · = 0
4
If all xi are O(1) we arrive at a system of equations:
O(1) : x20 − x0 = 0 ⇒ x0 = 0 or x0 = 1
1 1 1
O(ε) : 2x1 x0 − x1 + = 0 ⇒ x1 = or x1 = −
4 4 4
1 1
O(ε2 ) : x21 + x0 x2 − x2 = 0 ⇒ x2 = or x2 = −
16 16
and so on.... Note only the leading order equation is non linear and the other equations are linear
in xi but depend on the previous xj s (j < i). Comparing the computed expansions for x± with
the one based on Taylor expansion of the exact solutions given above show that we recovered the
correct factors.
In the next example we apply the same approach to an ODE problem:
Example 5.15. Consider the solution to the ODE
d2 1
x=− , x(0) = 1, x0 (0) = α
dt2 (1 + εx)2
which is the equation for a projectile which we introduced previously. We are interested in the
solution for ε → 0. We start with the asymptotic expansion x(t) = x0 (t) + εx1 (t) + ε2 x2 (t) + . . .
which we insert into the ODE and the initial conditions. The first initial condition gives us
1 = x(0) = x0 (0) + εx1 (0) + . . . so x0 (0) = 1 and xi (0) = 0 for i > 0. The initial conditions for
x0 gives us in the same way x00 (0) = α and x0i (0) = 0. For the left hand side of the ODE we of
course simply get x00 (t) = x000 (t) + εx001 (t) + ε2 x002 (t) + . . . while for the right hand side we first use
the Taylor expansion for (1 + z)r which gives us
1
− = −1 + 2εx − 3ε2 x2 + 4ε3 x3 + . . .
(1 + εx)2
into which we finally enter our asymptotic expansion of x and combine terms with equal powers
of ε:
1
− = −1 + 2ε(x0 + εx1 + ε2 x2 ) − 3ε2 (x0 + εx1 )2 + 4ε3 x30 + O(ε4 )
(1 + εx)2
= −1 + 2ε(x0 + εx1 + ε2 x2 ) − 3ε2 (x20 + 2εx0 x1 ) + 4ε3 x30 + O(ε4 )
= −1 + 2εx0 + ε2 (2x1 − 3x20 ) + ε3 (2x2 − 6x0 x1 + 4x30 ) + O(ε4 ) .
Now combining terms with the same order of ε leads to a system of ODEs:
O(1) : x000 = −1 , x0 (0) = 0, x00 (0) = α , (5.1)

O(ε) : x001 = 2x0 , x1 (0) = 0, x01 (0) =0, (5.2)
2
O(ε ) : x002 = 2x1 − 3x20 x2 (0) = 0, x02 (0) =0, (5.3)
3
O(ε ) : x003 = 2x2 − 6x0 x1 + 4x30 x3 (0) = 0, x03 (0) =0, (5.4)
and so on. Now we can solve these problems in sequence since they are decoupled, i.e., the equation
at scale O(εn ) only depends on xi with i ≤ n but not on those with i > n. Also the ODE at scale
O(εn ) is linear in xn so is easy to solve once the previous xi have been computed.
At leading order we get x0 (t) = − 12 t2 + αt + 1, then x1 solves x001 = −t2 + 2αt + 2 with zero initial
1 4
conditions which has the solution x1 (t) = − 12 t + α3 t3 + t2 . Our asymptotic expansion thus has
the form
1 1 α
x(t) = − t2 + αt + 1 + ε(− t4 + t3 + t2 ) + O(ε2 )
2 12 3
Note that original expansion required |x0 (t)| ε|x1 (t)| to make sense (the gauge functions have
to lead to a separation of the terms in the asymptotic expansion). This stops being the case for
1
t = O(ε− 2 ). This effect is common in asymptotic expansions and indicates a change in the nature
of the problem - a change in the scaling regime occurs requiring the use of a different form for the
expansion.
5.3.2 Singular perturbation problems and boundary layers

If the type of the model changes when a parameter ε = 0 then the perturbation problem is singular.
So for example if a PDE changes into a ODE or a second ODE into a first order ODE or a first
order ODE into an algebraic equation.
Example 5.16. Take for example the quadratic equation
εx2 − 2x + 1 = 0
√
1± 1−ε 1
which has the
two solutions x± = ε . We can expand this solution into x± ∼ ε 1 ± (1 − 12 ε −
1 2 3
8 ε + O(ε ) (check). So that
2 1 ε 1 ε ε2
x+ ∼ − − , x− ∼ + + .
ε 2 8 2 8 16
Now take ε = 0 in the quadratic equation, i.e., study the leading order equation −2x + 1 = 0
which has only a single solution x? = 12 . So we only recover the leading order term of the x−
solution and miss the second solution which has a singular behaviour for ε → 0 and can not be
expressed as an expansion of the form x = x0 + εx1 + ε2 x2 + . . . . Substituting the leading term
in the approximations for x± back into the quadratic equations εx2 − 2x + 1 = 0 shows where the
problem arises
4 4 1 ε
− +1=1 , ε −1+1= .
ε ε 4 4
The solutions are due to different terms in the quadratic equation balancing, for x+ the first two
terms balance at O( 1ε ) while the third term is O(1). For x− the second and third term are both O(1)
while the first term is O(ε) which is the regular solution obtained by the leading order equation.
We can extend the ideas for regular perturbation problems in the following way: instead of sub-
stituting an expansion U = u0 + εu1 + ε2 u2 + . . . as done in the regular perturbation problem,
substitute δ(ε)U . Then choose δ(ε) to obtain consistent dominant balances in the equation and
make sure that all neglected terms are really subdominant. Different choices of δ(ε) will lead to
different dominant balances. In the above example δ(ε) = 1ε , δ(ε) = 1. For each choice of δ(ε),
factoring out common ε factors will lead to a regular perturbation problem which can be used to
determine subsystems for u0 , u1 , u2 , . . . . This systematic approach will result in a full overview
of the regular and singular solutions. Let us try this approach for the quadratic equation for our
previous example:
Example 5.17 (cont.). We start with inserting δ(ε)X into the equation:
εδ(ε)2 X 2 − 2δ(ε)X + 1 = 0 .
We compare the order of magnitudes of each of these terms for different choices of δ keeping in
mind that X is O(1):
1. Term (2)=Term (3): δ = 1: the neglected first term is in this case O(ε) so subdominant. The
quadratic equation in this case is εX 2 − 2X + 1 = 0. Now substituting a regular expansion
into that equation, e.g., X = X0 + εX1 + ε2 X2 leads to the following system of equations
−2X0 + 1 = 0, X02 − 2X1 = 0, 2X0 X1 − 2X2 = 0. So we get X = 21 + 18 ε + 16 1 2
ε which are
the first terms of the regular solution x− .
2. Term (1)=Term (3): εδ 2 = 1 ⇒ δ = √1ε : the first and third term are O(1) while the neglected
second term is of order O( √1ε ) 1 so is not subdominant. So this choice is inconsistent.
3. Term (1)=Term (2): εδ 2 = δ ⇒ δ = 1ε : the neglected third term is O(1) while the two other
terms are O( 1ε ) 1 so the neglected term is subdominant. The quadratic equation in this
case is 1ε (X 2 − 2X + ε) = 0 or equivalently X 2 − 2X + ε = 0. Now substituting a regular
expansion into that equation, e.g., X = X0 + εX1 + ε2 X2 leads to the system: X02 − 2X0 = 0,
X0 X1 − 2X1 + 1 = 0, 2X0 X2 + X12 − 2X2 = 0. So X0 = 2 or X0 = 0. Note that X0 = 0
is not a consistent choice because then X is not O(1) but is O(ε) which means that Term
(1) and Term (2) would not balance anymore. So we need to take X0 = 2. With that choice
X1 = − 21 and X2 = − 18 and we have recovered the first three terms of the singular solution
x− .
In dynamic problems singular perturbations lead to boundary layers. These are parts of the time
line where the solution behaves very differently from the solution in the rest of the domain. The
origin of the term boundary layer comes from fluid dynamics where it describes the layer close
to a boundary generated by a fluid flowing over a rough surface - the flow is very different near
the boundary compared to the rest of the domain. But this multiscale effect appears in many
application - think for example of a tornado where high wind speeds coexists right next to large
regions where the weather is completely calm. How narrow this boundary layer is depends on the
size of some (non dimensional) parameter in the model.
Example 5.18. Consider the ODE y 0 = 1ε (sin t − y) with initial condition y(0) = 1 which has the
t
exact solution y(t) = (1 + ε2 )−1 (sin t − ε sin t) + Ce− ε where C can be used to fix the boundary
condition. So the solution consists of the two parts the first is basically the curve sin t which the
solution reaches after a very short transition phase (“boundary layer“) for t close to zero So we
first do a standard asymptotic expansion y ∼ y0 + εy1 :
εy00 + ε2 y10 = sin t − y0 − εy1
so that we get y0 = sin t (ε0 terms) and y00 = cos t = y1 (ε terms). So the leading order term
represents the slow manifold solution sin t but note that there are no free parameters to choose and
so no boundary conditions we can fix. This is a recuring feature of singularly perturned problems
since the nature of the model changes - in this case from a first order ODE (which has one free
parameter for the boundary condition) to an algebraic equation (which has no free parameter).
To find a solution to the initial boundary layer we rescal time to make the boundary √ layer have
size O(1). We do not really know which size the boundary layer has, i.e., is it O( ε), O(ε), or
perhaps even O(ε2 ) Finding the right scaling is cruicial and can require some experimenting. To
have a general scaling one can try rescaling time in the form τ = t/Φ(ε) where then Φ(ε) = εα
for example. Then τ = 1 corresponds to real time t being equal to εα . Defining a new solution
function Y (τ ) = y(Φ(ε)τ ) and subsituting into the ODE we arrive at
Φ(ε)
Y 0 (τ ) = Φ(ε)y 0 (Φ(ε)τ ) = (sin Φ(ε)τ − y(Φ(ε)τ ))
ε
so that Y satisfies the ODE
ε
Y 0 (τ ) = (sin Φ(ε)τ − Y (τ )) , Y (0) = 1 .
Φ(ε)
Now we can again use an asymptotic expansion of the form Y (τ ) ∼ Y0 (τ ) + εY1 (τ ) - note that we
will focus on the leading term (the one with ε0 ) so the exact form of the expansion is not of so
much interest.
ε
(Y 0 + εY10 ) = sin Φ(ε)τ − Y0 (τ ) − εY1 (τ )) = Φ(ε)τ − Y0 (τ ) − εY1 (τ )) + O(ε2 )
Φ(ε) 0
where we used sin ετ ∼ ετ . Now taking Φ(ε) = ε seems a good choice here which leads to the
leading order ODE
Y00 = −Y0 (τ ) , Y0 (0) = 1 .
so that the leading order term is Y0 (τ ) = e−τ . So now we have two solutions:
• Outer solution (away from the boundary layer): yO (t) = sin t
t
• Inner solution (inside the boundary layer): yI (t) = e− ε (or YI (τ ) = e−τ )
The question
√ now is, is it possible to match these two function in the intermediate range, e.g., for
t = O( ε). We have no free parameters left so either those two function approximate the same
function in the intermediate range or they do not (in the later case we would have to reconsider
the size of the boundary layer Φ). In other words, there are no free parameters to do any matching
of the two solutions. √
Note that limt→0 yO (t) = √ 0 = limτ →∞ YI√ (τ ) which is a good sign. Also taking t = η ε one can
show that both yO = O( ε) and yI = O( ε) so they match each other in this region (check). As
an approximation to the exact solution we will thus use
t
ỹ(t) := yO (t) + yI (t) = sin t + e− ε
Example 5.19. For a slightly more complex example which is nicely worked out from mod-
elling, through non dimensionalization, to singular perturbation theory see: https: // www. math.
colostate. edu/ ~ shipman/ 47/ volume2a2010/ Munoz-Alicea. pdf
Bibliography
[1] M. Alder, An introduction to mathematical modelling, HeavenForBooks.com, 2001.

[2] E. Hairer, S. Norsett, and G. Wanner, Solving ordinary differential equations i: Nonstiff prob-
lems, Springer, 1993.
[3] E. Hairer and G. Wanner, Solving ordinary differential equations ii: Stiff and differential-
algebraic problems, Springer, 1996.
[4] Josef Stoer, Roland Bulirsch, Richard H. Bartels, Walter Gautschi, and Christoph Witzgall,
Introduction to numerical analysis, Texts in applied mathematics, Springer, 2002.
[5] Endre Süli and David Mayers, An introduction to numerical analysis, Cambridge University
Press, 2003.
[6] T. Witelski and M. Bowen, Methods of mathematical modelling: Continuous systems and
differential equations, Springer, 2015.
73

MA261

Uploaded by

Copyright:

Available Formats

You might also like

MA261

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MA261

Uploaded by

Copyright:

Available Formats

Spring Term 2019

MA261: Differential Equations: Modelling and Numerics

Dr. Andreas Dedner

3 Some Aspects of Mathematical Modelling 31

4 Higher Order One Step Methods 44

5 Simplifying Mathematical Models 62

1. Modeling: the problem is formulated in mathematical terms, e.g., as differential equation.

Figure 1.1: Applied mathematics

E(y) = inf E(v) .

Now we seek ȳ ∈ V so that

The model is now simpler but we have made a modeling error.

−cȳ 00 (x) = f (x) x ∈ (0, 1) ȳ(0) = ȳ(1) = 0 .

of y at a point x by the difference quotient (covered in MA3H0):

for some constant C > 0. Furthermore yi < 0 if all fk < 0.

i N yn |y(T ) − yN | H(yN ) rel error H

i N T yn |y(T ) − yN | H(yN ) rel error H

y1,n+1 = y1,n + hy2,n+1 , y2,n+1 = y2,n − hy1,n+1 .

which leads to a system of equations for the two components of yn+1 :

y1,n+1 − hy2,n+1 = y2,n , hy1,n+1 + y2,n+2 = y1,n .

We can easily solve this system

−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5

i N yn |y(T ) − yN | H(yN ) rel error H

−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5

i N yn |y(T ) − yN | H(yN ) rel error H

2.1 Basic Concepts

f (x) = Pm (x) + ωm (x)(x − x0 )m ,

is the m-th order Taylor polynomial which is of degree m.

f (x) = Pm (x) + Rm (x) ,

Definition 2.4. (Landau Symbole) Let g, h : R → R. Then we write

Given {an }, {bn } bn > 0 ∀ n, we say:

Example 2.5. 1. tα = O(tβ ) iff α ≥ β

3. ex = 1 + x + O(x2 ) but also ex = O(1)

5. if f = o(g) then f = O(g), i.e., o(g) is as stronger results then O(g).

u0 (t) = (c − u(t))2 , u(0) = 1, c > 0.

It can be easily verified that

Figure 2.1: Effects of errors in the data.

For the relative error we have to leading order

Proof. The proof is a straightforward application of (multivariate) Taylor expansion

fi (x0 + ∆x) = fi (x0 ) + ∇fi (x0 ) · ∆x + ω (|∆x|)

with ω (|∆x|) = o (|∆x|).

Bδ (x0 ) := {x ∈ U | ||x − x0 || < δ}

if there is a constant Labs ≥ 0, with

||f (x) − f (x0 )|| ≤ Labs ||x − x0 || (∗)

for all x ∈ Bδ (x0 ).

Remark. If f is differentiable then

||f 0 (x0 )y||

2.1.3 Floating point numbers (this subsection is not examinable)

To compensate for large errors, we have:

A computation with 3 decimals (r = 3, b = 10) leads to

I¯0 = 0.182 · 100 I¯1 = 0.900 · 10−1

I¯4 = 0.343 · 10−1 I¯3 = 0.431 · 10−1

In this case we observe a damping of errors.

The condition numbers are for y1 (p, q)

2.2 Some ODE Basics

F (z n (t), z n−1 (t), . . . , z(t), t) = 0

where z maps the time interval [0, T ] to Rm and F = (Fi )m j

y(t) = (z(t), z 1 (t), . . . , z n−1 (t), τ ) ∈ R(n−1)m+1

y 0 (t) = f (y(t)) , y(0) = y0

The matrix m,n