Professional Documents
Culture Documents
MA261
MA261
MA261
1 Introduction 1
2 Getting Started 10
2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Convergence rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Floating point numbers (this subsection is not examinable) . . . . . . . . . 16
2.2 Some ODE Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Solvability of the Initial Value Problem (this subsection is not examinable) 20
2.3 The Forward/Backward Euler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Stability analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Bibliography 73
i
Chapter 1
Introduction
Numerical analysis is the mathematical study of algorithms for solving problems arising in many
different areas, e.g., physics, engineering, biology, statistics, economics, social sciences. In general
starting from some real world problem, the following steps have to be performed:
4. Numerical analysis: the discretization is studied again with respect to well-posedness, but
most importantly the error between the solution of the finite dimensional problem and the
mathematical model is estimated and convergence of the numerical solution is established.
5. Implementation: the finite dimensional problems are solved using a computer program.
This can be a cyclic procedure where for example the Analysis in step two can influence the
modelling step, i.e., step one is refined. The numerical simulation can show that additional effects
have to be taken into account and so the modelling has to be refined and so on.
This module will focus on all these points for some simple settings to make you familiar with
central underlying ideas. The modelling techniques and the numerical schemes presented are
an important building block used for solving more complex problems. We will be focusing on
problems described by ordinary differential equations so a lot of the material covered in MA133
will be helpful.
Example 1.1. Consider the problem of a steel rope of length L > 1m is clamped between two pols
which are 1m apart so that the rope is almost taut. Now the position of the rope is required in the
case where an acrobat is standing in the middle. A sketch of the problem is shown in Figure 1.2.
1. Modeling: first we make the assumption that the rope can be represented as a function
y : [0, 1] → R. The shape of the rope is such that its bending energy E is minimal. For E
one finds (neglecting for example gravity) the formula
1 1
y 0 (x)2
Z Z
c
E(y) := p dx − f (x)y(x) dx .
2 0 1 + y 0 (x)2 0
1
CHAPTER 1. INTRODUCTION 2
y(x)
f
0 x
1
Figure 1.2: Sketch of problem and a computed solution with f (x) = B ( 21 ), = 0.05.
Here c depends on the material of the rope and f is the load (the acrobat) on the rope. So
we seek y ∈ V := {v ∈ C 2 ((0, 1)) : v(0) = v(1) = 0} so that
Both the function f and the constant c have to be determined by measurements and contain
data error.
This is a very complex problem. So we make a simplification and assume that the displace-
ment of the rope is small, e.g., y 0 is small. Then we can replace E by
c 1 0 2
Z Z 1
Ē(y) := y (x) dx − f (x)y(x) dx .
2 0 0
This problem has a unique solution. One can also show for example that y < 0 if f < 0, i.e.,
when the force is pointing downward, the displacement is also downwards along the whole
length of the rope.
3. Discretization: Instead of approximating the function y at all points in [0, 1] we compute
y at N points xi = ih for i = 1, . . . , N − 1 with h = N1 . We can replace the second derivative
CHAPTER 1. INTRODUCTION 3
for n ≥ 1. We compute this sequence up to Y P which is then our final approximation. It can
be seen that Y n → Y (n → ∞) but since we can not compute an infinite number of iterates,
we have a further termination error caused by stopping the computation after P steps.
4. Numerical analysis: the matrix A is regular and thus the discrete problem has a unique
solution. For v ∈ C 4 ((0, 1)) there
exists a constant M > 0 so that for all x ∈ (0, 1) the
following error estimate holds v 00 (x) − h12 (v(x + h) − 2v(x) + v(x − h)) ≤ M h2 . The same
estimate holds for the discrete values Y :
max ȳ(xi ) − yi ≤ Ch2 ,
i=1,...,N −1
Example 1.2. A mass spring system with friction proportional to the velocity is modelled by the
second order ODE
µx00 (t) + βx0 (t) + γx(t) = 0 .
Here x(t) is the position of the (point) mass at time t, thus x0 is the velocity and x00 the acceleration
at time t. The three constants in the model are the mass µ > 0, β > 0 the amount of friction, and
γ > 0 the restoring force of the spring. To make the problem well posed the initial position of the
mass x(0) = x0 and the initial velocity x0 (t) = v0 have to be prescribed.
There are different ways to derive this model, one of them is to start with Newton’s second law
µa = F (mass × acceleration = applied forces). We choose our coordinate system in such a way
that x = 0 corresponds to the rest position of the spring - so x > 0 means the spring is stretched.
The restoring forces are assumed to be proportional to the amount of stretching s, so Fr = −γs.
This is a modelling assumption we could also have a nonlinear spring where the restoring force
depends nonlinearly on the stretching, e.g., F = −ks3 . The force of friction is assumed to also be
directly proportional to the velocity of the mass x0 , so Ff = −βx0 . As said before a = x00 and due
to our choice of coordinate system s = x so:
µx00 = Fr + Ff = −γx − βx0 .
This model is linear and can be easily solved using the approach of characteristic polynomials
discussed in MA133:
β
x(t) = e− 2µ t A cos(wt) + B sin(wt) ,
p
with w = 4γµ − β 2 /2µ
where we made the assumption that b is small so that β 2 < 4γµ (the system is under damped).
The constants A and B are determined from the initial conditions.
The problem seems to depend on three parameters - although we know from studying the exact
solution that the type of behavior of the system depends on the ration between β 2 and 4γµ, e.g., if
β2 β2
4γµ < 1 the system oscillates while for 4γµ > 1 the system is over damped and will not oscillate
at all. One modelling technique is to non dimensionalize the model and in that step try to reduce
the number of parameters, isolating the parameters that mainly determine the behaviour of the
problem. To this end one first needs to fix the physical units of each part of the model, e.g., x could
be measured in meters (m), the velocity x0 could then be meters per second (m/s), acceleration
is m/s2 ). Mass µ could be in kilogram (kg) and (to make things fit) we assume that β has
units kg/s and γ kg/s2 (we will discuss this in detail later on). Now let us fix a typical time
scale T , length scale L and introduce scaled time τ = t/T and rescale the position x(t) in the form
χ(τ ) = x(T τ )/L. Using chain rule we can easily see that χ0 (τ ) = T /Lx0 (T τ ), χ00 (τ ) = T 2 /Lx00 (T τ )
and thus
0 = µx00 (T τ ) + βx0 (T τ ) + γx(T τ )
= µL/T 2 χ00 (τ ) + βL/T χ0 (τ ) + γLχ(τ ) .
Note that χ, τ do not have any units (e.g. t, T have both some time units like seconds and their
fraction is unitless). We can now divide through by µL and multiply with T 2 to arrive at
ξ 00 + T β/µξ 0 + T 2 γµξ = 0
and note not that the two remaining constants T β/µ, T 2 γ/µ are also unitless. We now have many
different ways to choose T (note that the equation doesn’t depend qon our choice for L). We could
choose T to make T β/µ = 1 or alternatively T γ/µ = 1, e.g., T = µγ , which leads to a coefficient
2
q q
2 β2
in front of the friction term of T β/µ = µβγµ2 = 2
γµ =: ω . Our model thus reduces to
ξ 00 + ω 2 ξ 0 + ξ = 0 .
2
β
We are only left with a single factor ω 2 = γµ and we can discuss the behaviour of the solution
to this model (or simulate it) depending on the size of this one parameter. The damping regime
CHAPTER 1. INTRODUCTION 5
now depends on ω 2 being less than or greater than one. After understanding the behaviour of this
non dimensionalized problem one can then look at the values of the parameters in the problem e.g.
µ, β, γ to figure out which regime a given spring mass system belongs to. Of course in this simple
case we have not really learned anything new but the concept is more widely applicable.
Using Newton’s second law is one way of deriving the equations of motion for the mass. Another
approach is based on Hamiltonian dynamics which we will also briefly cover in this module. Let
us for now consider the frictionless case. Define the Hamiltonian
µ 2 γ 2
H(p, q) := q + p
2 2
and consider a particle moving such that H(x(t), x0 (t)) is constant in other words d 0
dt H(x(t), x (t)) =
0. Using chain rule it is easy to see that
d
H(x(t), x0 (t)) = µx00 x0 + γx0 x = x0 (µx00 + γx)
dt
so that H(x(t), x0 (t)) is constant if and only if either x is stationary (i.e. x0 = 0 or x solves the
second order problem
µx00 + γx = 0 .
We looked a bit at the modelling aspects of this problem and did some analysis, we can now turn
to discretizing the problem. In this case we have an exact solution so looking at discretization
methods for this problem seems a bit pointless but the circumstance that we know what the solution
should look like allows us to study the behaviour of a given method much more easily and we can
deduce something for more complicated cases where we do not know the exact solution. Of course
this only makes sense if we assume that the method we are studying is applicable to more general
problems. In this module we will focus on methods for solving first order non linear systems, i.e.,
ODEs of the form
y 0 (t) = f (y(t))
where y : [0, T ] → Rm for m ≥ 1. We can easily rewrite our mass spring system in that form
by introducing the vector y(t) = (y1 (t), y2 (t)) = (x(t), x0 (t)) so that y 0 (t) = (x0 (t), x00 (t)) =
(x0 (t), −γ/µx(t)) = (y2 (t), −γ/µy1 (t)) which is of the right form if we define f (y) = (y2 , −γ/µy1 ).
A simple approach to discretize y(t) is to look for approximations yn ≈ y(tn ) where t0 = 0 < t1 <
T
t2 < · · · < tN = T are some fixed points in time, for example tn = nh with h = N . The derivative
y (tn ) can be approximated using a finite difference quotient for example y (tn ) ≈ y(tn+1h)−y(tn =≈
0 0
yn+1 −yn
h . Since y 0 (tn ) = f (y(tn )) ≈ f (yn ) we arrive at the so called forward Euler method:
yn+1 = yn + hf (yn )
which is a very easy to implement method, since given the initial condition y0 we can directly
compute y1 = y0 + hf (y0 ) and then y2 = y1 + hf (y1 ) and so on up to yN = yN −1 + hf (yN −1 ).
Applying this to our mass spring problem we get
γ
yn+1,1 = yn,1 + hyn,2 , yn+1,2 = yn,2 − h yn,1 .
µ
γ
In the following we set µ = 1 and use as initial data y(0) = (x0 , v0 ) = (1, 1) so that the exact
solution is simply
y(t) = (cos(t) + sin(t), − sin(t) + cos(t))
at T = 2π the solution should be back at (1, 1) so let us check what value yN has for N = 2πh for
π
different values of h, we compute a sequence of approximations to y(2π) using hi = 200 ∗ 2−i , i.e.,
we use Ni = 1002̇i points for i = 0, 1, . . . , 4:
CHAPTER 1. INTRODUCTION 6
i N yN |y(T ) − yN |
0 101 [1.20766198 1.2277517 ] 3.08212e-01
1 201 [1.10139465 1.10595473] 1.46654e-01
2 401 [1.05003655 1.05112221] 7.15342e-02
3 801 [1.02484773 1.02511256] 3.53278e-02
4 1601 [1.01238062 1.01244602] 1.75552e-02
5 3201 [1.00617943 1.00619568] 8.75053e-03
We have used a very large number of points tn for the final simulation and the solution is still not
all that accurate - the error has just dropped below 1%. Depending on the application this might or
might not be an acceptable level of error and may or may not be an acceptable computation effort
to reach this error. But it does seem worth while to investigate methods that achieve a small error
with the same computational cost and we will study some approaches in this module. The results
seem to indicate that the error is going to zero with increasing N - in fact it looks like the error is
halving when N each time N is doupled, i.e., the error is proportional to 1/N ∼ h. We will see
later in this module that this is in fact the case. Computing only one period of the oscillation is
often not of interest but instead the long time behaviour is to be simulated, so let us redo the above
computation with T = 200π (which is actually not that long):
i N yN |y(T ) − yN |
0 10001 [-2.00707e+07,5.08076e+08] 5.08472e+08
1 20001 [1.48842e+04,2.27772e+04] 2.72078e+04
2 40001 [1.31599e+02,1.45952e+02] 1.95108e+02
3 80001 [1.16376e+01,1.19422e+01] 1.52607e+01
4 160001 [3.42277e+00,3.44495e+00] 3.44204e+00
5 320001 [1.85158e+00,1.85458e+00] 1.20644e+00
Not so good - the best that can be said, is that it does seem to be converging but the errors are
huge! As the next simulation shows, a trend here: instead of staying on a constant level curve of
H (i.e. H(y(t)) = H(y(0)) the value of H seem to be increasing and to verify this we add H(yN )
to our output (the expected value is H(1, 1) = 1). We also increase i a bit more:
So decreasing h (or increasing N ) to compute the error at a fixed time does seem to work - although
the required work can be very high if the error is to be small or the time period somewhat longer.
Instead of changinging h we will now fix h and increase T just to show that effect a bit more:
To show the time evolution of the discrete solution for different values of h see the left figure in
the following plot (only every 15th approximate value is plotted). On the right you can see the
same a simulation with a larger value of T using the same value of h used for the curve with the
same color on the left. The plots show the evolution of the system in phase space, i.e., the x-axis
CHAPTER 1. INTRODUCTION 7
represents the position of the mass and the y-axis its velocity. Another way of thinking of these
plots is in terms of the Hamiltonian H - H should be constant, i.e., the mass should remain on a
single level curve of H which are circles around the origion.
10
2
0 0
−2 −10
−2 0 2 −15 −10 −5 0 5 10 15
We will see later on that the forward or explicit Euler method suffers from stability issues in the
case that h is too large (this is not the problem here...). Nevertheless, we can try a method that
we will later prove to be more stable: the backward or implicit Euler method. The approach to
derive the approximation is the same as for the forward Euler method, except that we use the
approximation at t = tn+1 instead of at t = tn , i.e., y 0 (tn+1 ) ≈ y(tn+1h)−y(tn =≈ yn+1h−yn . and
y 0 (tn+1 ) = f (y(tn+1 )) ≈ f (yn+1 ) we arrive at the so called forward Euler method:
yn+1 = yn + hf (yn+1 ) .
The method is in general not quite as easy to code up but since f is linear in our case it is still
fairly easy to do:
1 1
0 0
−1 −1
Note that now the mass is slowing down like it would if friction was added (recall that the x-axis
in these plots represent the position and the y-axis the velocity).
In summary: were we to use the forward Euler method to compute the orbit of a satelite around
earth, the satelite would always be spinning off into space - preferable perhaps to the trajectory
predicted by the backward Euler method but still not correct... But also note that both methods
converge, i.e., if we fix a point in time and reduce h enough the error can (in theory) be made as
small as we want it to be. We have some experimental indication of this for the forward Euler
method, the following table indicates that the same is true for the backward Euler method:
Can we derive a method that converges and maintains the energy of the system, i.e., guarantees
that H(yn ) = H(y0 ) for all n? The answer is yes and the method is just as simple to implement
as the forward Euler method. The method is often refered to as the symplectic Euler method and
you will need to look closely to see the difference to the forward Euler method described above:
γ
yn+1,1 = yn,1 + hyn,2 , yn+1,2 = yn,2 − h yn+1,1 .
µ
CHAPTER 1. INTRODUCTION 9
1 1
0 0
−1 −1
For a fixed (small) time can we derive a method that converges faster - lets try ode45 (in Python
that is rk45):
i N yn |y(T ) − yN | H(yN ) rel error H
0 100 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
1 200 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
2 400 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
3 800 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
4 1600 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
5 3200 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
6 6400 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
7 12800 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
8 25600 [1.00189e+00,9.98035e-01] 2.72479e-03 1.50000e+00 -1.05150e-08
Chapter 2
Getting Started
In this chapter we will introduce a few concepts without being too formal. The ideas will then be
expanded on in the following chapters.
We use the abbreviations: C m (a, b) for C m ((a, b)) and C ∞ (I) := C m (I). It follows that
T
m∈N
C ∞ (I) ⊂ . . . ⊂ C m (I) ⊂ . . . ⊂ C 0 (I).
Theorem 2.2. (Taylor Theorem) Let f ∈ C m (a, b) and x0 ∈ (a, b) be given. Then there exist a
function ωm : R → R with lim ωm (x) = 0, so that
x→x0
where
m
X 1 (k)
Pm (x) = f (x0 )(x − x0 )k ,
k!
k=0
where there are different important expressions for the remainder term:
1. Lagrange representation: For fixed x ∈ (a, b) there is a ξ between x0 and x so that
1
Rm (x) := f (m+1) (ξ)(x − x0 )m+1 .
(m + 1)!
10
CHAPTER 2. GETTING STARTED 11
2. Integral representation:
Zx
1
Rm (x) := f (m+1) (t)(x − t)m dt .
m!
x0
.
Taylor expansion motivates the following definition:
Definition 2.3. A function f ∈ C 1 (x0 −h0 , x0 +h0 ) is up to leading order equal to f (x0 )+f 0 (x0 )h
in an open set around x0 , i.e., there is a function ω̄ : (−h0 , h0 ) → R with |ω̄(h)||h| → 0 and
f (x0 + h) = f (x0 ) + f 0 (x0 )h + ω̄(h). This means that we are neglecting all terms that converge
faster to zero than h.
•
Notation: f (x0 + h) = f (x0 ) + f 0 (x0 )h.
Remark (Vector valued case). Taylor expansion for a vector valued function f ∈ (C m (a, b))p is
defined in the same way with a vector valued Taylor polynomial P m - one can also consider this
the Taylor expansion of each component fi (i = 1, . . . , p) of f .
In that case that argument of f is multidimensional f ∈ C m (I) for I ⊂ Rq (or f ∈ (C m (I))p the
Taylor expansion takes the same form with f 0 is the gradient (or Jacobian) of f , f 00 the Hessian
and so on. pPq
For x ∈ Rq we will use |x| to denote the Euclidian norm |x| := 2
i=1 xi .
(i) g(t) = O(h(t)) for t → 0 iif there is a constant C > 0 and a δ > 0, so that |g(t)| ≤
C |h(t)| ∀ |t| < δ.
(ii) g(t) = o(h(t)) for t → 0 iif there is a δ > 0 und a function c : (0, δ) → R with |g(t)| ≤
c (|t|) |h(t)| ∀ |t| < δ and c(t) → 0 for t → 0.
Definition 2.8 (Experimental order of convergence (EOC)). The experimental order of conver-
gence of an approximation y(h) is given by
|z(h1 )−z0 |
log |z(h2 )−z0 |
p := .
log hh12
This can of course not be used to prove convergence but we can get a good indication of the
convergence rate through experiments and we can use a theoretical proven rate of convergence to
verify that an implementation is correct by comparing an experimental order of convergence with
the theoretical convergence rate. A major issue with this approach is that to compute the error one
requires knowledge of the exact solution z0 . In the example we discussed at the beginning of this
chapter, where we applied the forward Euler method to the linear mass spring problem we know
the exact solution so could compute the error. In general we want to use approximation in complex
cases where no exact solution is available. In this case some other approach to determining the
order of convergence has to be used. In summary the EOC is a good tool to check the correctness
of an implementation but to use it requires simplifying/modifying the problem to the point that
the exact solution is available.
Applying this to ODEs we will have z0 = Y (T ) and z(h) is the approximation at the final time
using a scheme with step size h.
Looking back at the errors computed for the mass spring problem we see that the error seems
to behave proportional to h = N1 - we already mentioned this previously. User the concept of a
rate of convergence this means that the scheme converges linearly, i.e., with order p = 1. Let us
compute the EOC to confirm this:
i N yN |y(T ) − yN | EOC
0 101 [1.20766e+00,1.22775e+00] 3.08212e-01
1 201 [1.10139e+00,1.10595e+00] 1.46654e-01 1.07151e+00
2 401 [1.05004e+00,1.05112e+00] 7.15342e-02 1.03571e+00
3 801 [1.02485e+00,1.02511e+00] 3.53278e-02 1.01783e+00
4 1601 [1.01238e+00,1.01245e+00] 1.75552e-02 1.00891e+00
5 3201 [1.00618e+00,1.00620e+00] 8.75053e-03 1.00445e+00
2.1.2 Conditioning
We have seen a great many sources for errors and each of these has to be analysed and kept under
control, especially to avoid the accumulation of errors. Depending on the problem, the influence
of any of these errors can be severe.
Example 2.9. Consider the initial value problem for t > 0:
c u(t)
u(t)
1 1
c
0
0 t t0
The next example also shows how even very small errors in some part of the algorithm can strongly
influences the error in the solution:
Example 2.10. Consider the linear system of equations
1.2969 0.8648 x1 0.86419999
= =: b.
0.2161 0.1441 x2 0.14400001
x1 0.9911
The exact solution is = .
x2 −0.4870
Due to some error, we obtain
0.8642
b̄ =
0.1440
instead of the exact right hand side. The relative error in the first and second component is merely
|0.86419999−0.8642|
0.86419999 = 1.15 10−8 and |0.14400001−0.1440|
0.14400001
−8
= 6.9410 . So the error is quite small.
x̄1 2
But the solution to the new problem is = , which means that the error in the
x̄2 −2
solution to the linear system of equations is more than 100%.
The amplification of errors, as shown in the previous example, is characterized by the conditioning
of the problem.
Definition 2.11. A problem is said to be well conditioned if small errors in the data lead to small
erros in the solution and badly conditioned otherwise. We will provide two different notion of
conditioning for the problem of computing the value f (x0 ) for a given function f : U → Rn and
given data x0 ∈ U where U is an open subset of Rm . We call this the problem (f, x0 ).
Example 2.12. The solution to the linear system Ay = b is the problem (f, x0 ) where f (x) = A−1 x
and x0 = b. The problem given in Example 2.10 was apparently badly conditioned.
Theorem 2.13. Let x0 = (x1 , . . . , xm ) ∈ U and let x0 + ∆x ∈ U be some perturbation of the
data with |∆x| 1. If f : U → Rn is once continuously differentiable then the error ∆fi (x0 ) =
fi (x0 + ∆x) − fi (x0 ) (i = 0, . . . , n) in the evaluation of fi is up to leading order equal to
m
X ∂fi
(x0 )∆xj .
j=1
∂xj
||x0 ||
Krel = ||f 0 (x0 )|| .
||f (x0 )||
Example 2.19. (Conditioning for linear systems of equations) Consider the linear system Ax = b,
i.e., the problem (f, b) with f (x) = A−1 x. Thus we have f 0 (x) = A−1 and
−1
A y
Kabs = A−1 := sup
.
y6=0 ||y||
Using the properties of the induced matrix norm A−1 we compute
−1
−1 ||b|| −1 ||Ax|| A · ||A|| · ||x||
A−1 · ||A|| .
Krel = A
= A ≤ =
||A−1 b|| ||x|| ||x||
Since there is a x ∈ Rm with ||Ax|| = ||A|| ||x|| the number A−1 · ||A|| is a good estimate for
the condition number of the problem (f, b).
Consider the matrix A from Example 2.10. We can show that A−1 ||A|| ≈ 109 , which shows
that the problem is badly conditioned.
The following section describes in detail how numbers are represented on a computer and how
arithmetic operations are performed. That section also includes some more examples showing the
issue of cancelation.
Usually one uses the notation ±a = 0. m1 . . . mr b±E with E = es−1 bs−1 + . . . + e0 b0 (Expo-
| {z }
Mantissa M
nent) and mi , ei ∈ {0, . . . , b − 1} , E ∈ N. For normalization purposes one assumes that m1 6= 0 if
a 6= 0.
For given (b, r, s) let the set A = A(b, r, s) denote all real numbers a ∈ R with the representation
(∗).
To store a number a ∈ D = [−amax , −amin ] ∪ {0} ∪ [amin , amax ] a mapping from D to A(b, r, s) is
defined: f l : D → A with f l(a) = min |â − a|.
â∈A
Remark. The floating point representation allows to store real number of very different magnitude,
−30
e.g., the speed of light c ≈ 0.29998 · 109 m
s or the electron mass m0 ≈ 0.911 · 10 kg.
We usually use the decimal system with b = 10 while on computers a binary representation is used
with b = 2. The constants r, s ∈ N depend on the architecture of the computer and the desired
accuracy.
Lemma 2.21. The set A(b, r, s) is finite. Its largest and smallest positive element is amax =
s s
(1 − b−r ) · bb −1 , amin = b−b , respectively.
Proof. Left as an exercise.
CHAPTER 2. GETTING STARTED 17
Remark. Usually for a ∈ (−amin , amin ) one defines f l(a) = 0 (“underflow”). If |a| > amax (“over-
flow”) many programs set a = N aN (not a number) and a computation has to be terminated.
Theorem 2.22. (Rounding errors) The aboslute error is given by
1 −r E
|a − f l(a)| ≤ b ·b ,
2
where E is the exponent of a. For the relative error caused by f l(a) for a 6= 0 the estimate
|f l(a) − a| 1
≤ b−r+1
|a| 2
holds.
Definition 2.23. The machine epsilon M := 12 b−r+1 is the difference between 1 and the next
larger number representable.
Defining := f l(a)−a
a , one has f l(a) = a + a = a(1 + ) and || ≤ M .
Proof. (Theorem 2.22) In the worst case f l(a) will differ from a by half a unit in the last position
of the mantissa of a: |a − f l(a)| ≤ 21 b−r bE .
Since we are assuming a normalized representation (m1 6= 0) if follows that |a| ≥ b−1 bE and
therefore
1 −r E
|f l(a) − a| b b 1
≤ 2 −1 E = b−r+1 .
|a| b b 2
Example 2.24 ((IEEE format)). A usual formal is the IEEE format. It provide standards for
single and double precision floating point numbers. A double precision number is stored using 64
bits (8 bytes):
x = ±m2c−1022 .
One bit is used to store the sign. 52 bits are used for the mantissa m = 2−1 +m2 2−2 +· · ·+m53 2−53
(the first position is one due to normalization). The characteristic c = c0 20 +· · ·+c10 210 ∈ [1, 2046]
can be stored in the remaining 11 bits. Here mi , ci ∈ {0, 1}. By storing the exponent in the form
c − 1022, i.e., without a sign, the range of numbers is doubled. The two excluded cases c = 0
and c = 2047 are used to store x = 0 and NaN, respectively. We have amax = 21024 ≈ 1.8 10308 ,
amin = 2−1022 ≈ 2.2 10−308 , and M = 21 2−52 ≈ 10−16 .
Definition 2.25. (Machine operations) The basic operations ? ∈ {+, −, ×, /} is replace by ~. In
general
a ~ b = f l(a ? b) = (a ? b)(1 + )
with || ≤ eps.
Remark. The operation ~ is not associative or distributive.
Example 2.26. (Loss of significants) In this example we use b = 10 and r = 6. We study the
problem (f, x0 ) with √ √
f (x) = ( x + 1 − x), x0 = 100 .
√ √
As x gets large, x + 1 and x are of very similar magnitude and subtracting the two values
is ill conditioned√as we already saw. Assume that √ we can compute
√ the square roots also up to
six √
decimals:
√ f l( 101) = 0.100499 · 10 2
and f l( 101) f l( 100)) = 0.499000 · 10−1 instead of
−1
f l( 101 − 100) = 0.498756 · 10 . So we have lost 3 significant figures from the available 6.
Rewriting f in the form
√ √ √ √
x+1− x x+1+ x 1
f (x) = √ √ =√ √
1 x+1+ x x+1+ x
√ √
removes
√ the problem
√ of loss of significance because adding x + 1 and x is well conditioned:
f l( x + 1) ⊕ f l( x) = 0.200499 · 102 and 1 (0.200499 · 102 ) = 0.498755 · 10−1 .
CHAPTER 2. GETTING STARTED 18
Observe that:
z fl(z) z(1 + δ)
fl = ≤ = z223 + δz223
M fl(M ) M
in single precision floating point. So the error term is magnified. It is possible to compensate for
this:
Example 2.27. Let:
log(x + 1)
f (x) =
x
and consider x ≈ 0.
lim f (x) = 1
x→0
n 1 2 3 4 5 6 7 8
(II) 0.33333 0.11111 0.03704 0.01235 0.00412 0.00137 0.00046 0.00015
(I) 0.33333 0.1111 0.03699 0.01216 0.00337 −0.00161 −0.01147 −0.04755
The exact value is 0.00015 in the obtainable precision. Even with r = 8 we obtain with (II)
0.00015242 and with (I) 0.00010407; the exact value is 0.00015242. The corresponding relative
errors are approximatly 0.27 · 10−6 and 0.31.
CHAPTER 2. GETTING STARTED 19
Overall stability means that errors in previous steps are not amplified.
R1 xk
Example 2.31. (Error amplification) We want to compute the integral Ik := x+5 dx.
0
(A) Observe
I0 = ln(6) − ln(5)
and
1
Ik + 5Ik−1 = (k ≥ 1), since
k
Z1 Z1
xk xk−1 − 1 1
+5 = xk−1 dx = .
x+5 x+5 k
0 0
Here we use I¯k to denote the computed value taking rounding errors into account. Obviously
Ik is monoton decreasing, and Ik & 0 (k → ∞), but this is not observed for the computed
values, we even have I¯4 < 0. On a standard PC we found: I¯21 = −0.158 · 10−1 und
I¯39 = 8.960 · 1010 .
This is a typical example of error accumulation. In the scheme the error in Ik−1 in amplified
by the factor 5 to compute Ik .
(B) If one computes the values for Ik exactly, one observes that I9 = I10 up to the three first
decimals. Using the backwards iteration Ik−1 = 15 k1 − Ik we obtain
Example 2.32. (Computing the solution to a quadratic equation) Consider the quadratic equation
y 2 − py + q = 0
2
q
2
for p, q ∈ R and 6= q < p4 . The two solution are y1,2 = y1,2 (p, q) = p2 ± p4 − q. Also p = y1 + y2
and q = y1 y2 . From this we can conclude that ∂p y1 +∂p y2 = 1 and y2 ∂p y1 +y1 ∂p y2 = 0. Therefore,
y1 y2
∂p y1 = , ∂p y2 = .
y2 − y1 y2 − y1
From this we can conclude that ∂q y1 + ∂q y2 = 0 and y2 ∂q y1 + y1 ∂q y2 = 1. Therefore,
1 1
∂q y1 = , ∂q y2 = .
y1 − y2 y1 − y2
p 1 + y2 /y1 q 1
k1,p = ∂p y1 = , k1,q = ∂q y1 = .
y1 1 − y2 /y1 y1 1 − y2 /y1
CHAPTER 2. GETTING STARTED 20
Similar results can be obtained for the condition numbers k2,p and k2,q for y2 (p, q). This shows
that the computation of the roots is badly conditioned if the two roots are close together, i.e., yy12
is close to one.
For | yy12 1 the problem is well conditioned. We could employ the following algorithm to compute
the results:
p2 √
u= ,v = u − q ,w = v .
4
For p < 0 we should first compute y2 = p2 − w to avoid cancellation effects. For the second root
we can use different approaches:
(I) (II)
p q .
y1 = 2 +w y1 = y2
2
For q p4 we have w ≈ p2 and (I) is prone to cancellation effects. Errors made in p and w are
carried over to y1 :
∆y1 • 1 ∆p 1 ∆w
≤
y1 1 + 2w/p p 1 + p/2w w .
+
2
Both factors are much greaten than one since q p4 . The method (II) is on the other hand stable:
∆y1 • ∆q ∆y2
≤
y1 q y2 .
+
and suitable right hand side it is possible to reduce this problem to a homogeneous, first order
ODE of the form
y 0 (t) = f (y(t))
and this is the type of problem we are going to study throughout this lecture (or its non-
homogeneous counterpart y 0 (t) = f (y(t), t) - although this is equivalent to introducing another
dependent variable satisfying the ODE τ 0 = 1.
We will consequently assume that y(t) = (yi (t))m i=1 for some given m ≥ 1. To make the problem
well posed we also need to provide an initial value, so we will assume that some y0 ∈ Rm is given
and will look for solutions to the initial value problem
∂f
Theorem 2.33 (Picard’s Theorem). If f and ∂y are continuous in a closed rectangle:
R = {(t, y)|a1 ≤ t ≤ a2 , b1 ≤ y ≤ b2 }
and if (t0 , y0 ) is an interior point of R, then (∗) has a unique solution y = g(t) which passes
through (t0 , y0 ).
Sketch of Proof - requires
complete metric spaces and Banach fixed point theorem. By assumption,
|f (t, y)| ≤ K and ∂f ≤ L. It follows
∂y
∀ (t, y1 ) and (t, y2 ) ∈ R, so we have the Lipschitz condition. Replace (∗) by an integral equation.
If y = g(t) satisfies (∗) then by integrating:
Z t
g(t) = y0 + f (s, g(s))ds.
t0
Choose a > 0 such that La < 1 and |t − t0 | ≤ a and |y − y0 | ≤ Ka. Let X be the set of all
continuous functions y = g(t) on |t − t0 | ≤ a with |g(t) − y0 | ≤ Ka, and so X is a complete
metric space. Define a mapping T of X into itself by
Z t
T g = h, h(t) = y0 + f (s, g(s))ds (|h(t) − y0 | ≤ Ka)
t0
Furthermore,
Z t
|h1 (t) − h2 (t)| = [f (s, g1 (s)) − f (s, g2 (s))]ds ≤ La sup |g1 (t) − g2 (t)|.
t0
for some τ ∈ [0, 1]. That this is a reasonable approximation can be seen using Taylor expansion
assuming Y ∈ C 3 :
1 1
Y (t+h) = Y (t+τ h)+Y 0 (t+τ h)(t+h−(t+τ h))+ Y 00 (t+τ h)(t+h−(t+τ h))2 + Y 000 (χ0 )(t+h−(t+τ h))3
2 6
and
1 1
Y (t) = Y (t + τ h) + Y 0 (t + τ h)(t − (t + τ h)) + Y 00 (t + τ h)(t − (t + τ h))2 + Y 000 (χ1 )(t − (t + τ h))3
2 6
Therefore
Y (t + h) − Y (t) 1
= Y 0 (t + τ h) + Y 00 (t + τ h)h(1 − 2τ ) + O(h2 ) .
h 2
In the case that τ = 21 , i.e., we are aiming at approximating the time derivative exactly in the
middle of the interval [t, t + h], the Y 00 drops out and we are left with
1 Y (t + h) − Y (t)
Y 0 (t + h) = + O(h2 ) .
2 h
1
If we are not approximating in the middle of the interval, i.e., τ 6= 2 the second order term is
present and we end up with
Y (t + h) − Y (t)
Y 0 (t + τ h) = + O(h) .
h
This type of superconvergence in some points is a typical property of many finite difference ap-
proximations to derivatives, i.e., that they have a higher convergence rate in some isolated points
then in the rest of the domain.
A finite difference approximation can be used to compute an approximation yn to the exact
solution Y at a point in time tn+1 = tn + h given approximations to Y at some earlier points in
time. For example taking τ = 0 and t = tn
Y (tn+1 ) − Y (tn )
f (tn , Y (tn )) = Y 0 (tn ) = + O(h) ,
h
Replacing Y (tn ) by yn and Y (tn+1 ) by yn+1 and ignoring the O(h) term yields
yn+1 − yn
= f (tn , yn )
h
which provides an explicit formula to compute yn+1 given yn :
yn+1 = yn + hf (tn , yn ) .
CHAPTER 2. GETTING STARTED 23
y1 = y0 + hf (t0 , y0 ),
y2 = y1 + hf (t1 , y1 ),
y3 = y2 + hf (t2 , y2 ),
..
.
This is known as the forward or explicit Euler method :
Algorithm. (Forward or Explicit Euler method)
yn+1 = yn + hf (tn , yn ) .
Fn (z; yn , tn , h) = z − yn − hf (t + h, z) ,
or taking δ = z − yn
δ − hf (t + h, yn + δ) = 0 .
This method is known as backward Euler method:
Algorithm. (Backward or Implicit Euler method)
Remark. We will see later that both the forward and backward Euler method converge with order
one. We will be discussing approaches to improve the accuracy and also why using a more complex
implicit method can sometimes be a good idea.
Using the finite difference quotient in the middle of the interval to take advantage of the higher
convergence rate is not so straightforward:
yn+1 − yn
= f (tn+ 21 , yn+ 21 )
h
since yn+ 12 is not part of the sequence we are computing. We can either use the approximation
on an interval 2h:
yn+1 − yn−1
= f (tn , yn )
2h
which we can use to compute yn+1 assuming we know both yn and yn−1 . This type of method
is called multistep method and is discussed later in the lecture. A second approach is to replace
f (tn+ 21 , yn+ 21 ) by an approximation - either
1
f (tn+ 21 , yn+ 12 ) ≈ (f (tn , yn ) + f (tn+1 , yn+1 ))
2
or
1
f (tn+ 12 , yn+ 21 ) ≈ f (tn+ 12 , (yn + yn+1 )) .
2
The first approach is often called the Crank-Nicholson while the second is called the implicit
midpoint method. Both are implicit:
Algorithm. (Crank-Nicholson method)
h
yn+1 = yn + f (tn , yn ) + f (tn+1 , yn+1 ) .
2
The following algorithm provides an approximation to Y (T ) given an h > 0:
t = t 0 , y = y0
While t < T
Compute δ with δ − h2 f (t + h, y + δ) = 0 (e.g. using Newton’s method discussed later)
y = y + h2 f (t, y) + δ
t=t+h
This method looks very similar to the backward Euler method but note that it does have a higher
complexity since more evaluations of f are required in each step and evaluation of f can be very
expensive. On the other hand the higher convergence rate of the finite difference approximation
at the midpoint could improve the overall convergence rate of the method - assuming we haven’t
ruined things by approximation f (tn+ 12 , yn+ 12 ).
Remark (Alternative derivation). Starting from y 0 (t) = f (y(t)) we can use integration over the
time interval [tn , tn+1 ] to derive an approximation:
Z tn+1 Z tn+1
y(tn+1 ) − y(tn ) = y(t) dt = f (y(t)) dt .
tn tn
Now to get a numerical scheme we need to approximate the integral on the right. We will study
more sophisticated methods later in the course but for now we can use an approximation based
on a single point in the interval:
Z tn+1
y(tn+1 ) − y(tn ) = f (y(t)) dt ≈ (tn+1 − tn )f (y(tn + τ (tn+1 − tn ))) = hf (y(tn + hτ )) ,
tn
for some τ ∈ [0, 1]. As we will have noticed (hopefully) we have rediscovered the forward Euler
(τ = 0) and the backward Euler (τ = 1) methods.
CHAPTER 2. GETTING STARTED 25
E(h) := max ek .
0≤k≤N
Remark. The definition of the approximation error is not unique or more to the point, the way to
measure the error is not unique. For example instead of looking at the maximum error over the
time interval we could study an average error or the error eN at the final time only.
During the derivation of the schemes we replaced the time derivative with a finite difference
quotient by dropping higher derivative terms in the Taylor expansion. Recall for example the
formula used for the forward Euler method
which indicates that we are introducing an error of the magnitude of h2 when going from Y (tn ) to
Y (tn+1 ). Now define Ỹ (tn+1 ) = Y (tn )+hf (tn , Y (tn )) (so using the forward Euler formula but with
the exact solution value at time t = tn ) and consider the solution Ỹ to the ODE with initial data
Ỹ (tn+1 ). The staring point for the Ỹ trajectory is therefore O(h2 ) off from the exact trajectory
which passed through (tn+1 , Y (tn+1 )). Depending on the type of ODE we are considering this
gap will widen in time. Also two sequences constructed via some numerical method applied to the
two different initial values Y (tn+1 ) (the exact value) and Ỹ (tn+1 ) (the perturbed value) will also
diverge more and more from each other in each step. This is called error propagation, i.e., how
errors done in previous steps influence the accuracy of the solution at later times. The stability of
the numerical scheme and the error done in each step, the so called truncation error or consistency
error both play a central role in determining the approximation error of the scheme.
Forward Euler
For h > 0 let yn+1 = yn + f (yn ) be the sequence produced by the forward Euler method, tn = nh
the sequence of points in time. We also introduce the evaluation of the exact solution at these
points in time, i.e., Yn := Y (tn ) and finally we denote the error at each of these points in time
by en := |yn − Yn |. Keep in mind that tn , yn , Yn , en all depend on h. Next we introduce the local
truncation error which is conceptually the error introduced by inserting the exact solution into
the numerical scheme:
Definition 2.35 (Local truncation error for forward Euler method).
τn := Yn+1 − Yn − hf (Yn )
Remark. From our definition we have Yn+1 = Yn + hf (Yn ) + τn and our derivation based on Taylor
expansion shows that the truncation error converges quadratically to 0, i.e., τn = O(h2 ).
Now assume that f is Lipschitz continuous (an assumption commonly used in the existence theory
of ODEs), i.e.,
|f (u) − f (v)| ≤ L|u − v| .
Using our definitions we now have
en+1 = yn +hf (yn )−(Yn +hf (Yn )+τn ) ≤ en +hf (yn )−f (Yn )+τn ≤ en +Lhen +τn ≤ (1+Lh)en +τn .
CHAPTER 2. GETTING STARTED 26
We thus conclude that going from step n to n + 1 leads to an amplification of the error en by
1 + Lh plus an additional O(h2 ) error coming from the truncation error.
We can now apply the same estimate to en to get
n
X
en+1 ≤ (1+Lh)en +τn ≤ (1+Lh)2 en−1 +τn +(1+Lh)τn−1 ≤ · · · ≤ (1+Lh)n+1 e0 + (1+Lh)i τn−i
i=0
Since e0 = |y0 − Y (0)| = |y0 − y0 | = 0 and using τn ≤ Ch2 where C depends neither on h nor on
n (in depends on the second derivate of Y over whole time interval we get
n
X
en+1 ≤ Ch2 (1 + Lh)i
i=0
Backwards Euler
The analysis for the backward Euler method is almost identical to the above. In this case yn+1 =
yn + hf (yn+1 ) and τn := Yn+1 − hf (Yn+1 ) − Yn .
en+1 ≤ en + hf (yn+1 ) − f (Yn+1 ) + τn ≤ en + hLen+1 + τn
CHAPTER 2. GETTING STARTED 27
Since we are interested in h → 0 we can assume that hL ≤ 1 − ε ∈ (0, 1) then rearranging terms
leads to
n
en τn e0 X 1
en+1 ≤ + ≤ ··· ≤ + τn−i .
1 − hL 1 − hL (1 − hl)n+1 i=0 (1 − hL)i
1 hL
where β = 1−hL . We have β − 1 = 1−hL and also
hL hL
β =1+ ≤1+
1 − hL ε
hL
since we assumed that hL < 1 − ε. Using that 1 + x ≤ ex we finally have β ≤ e ε and thus using
again h = Tn and n ≤ N :
nhL TL
βn ≤ e ε ≤ e ε .
Putting it all together we have shown that
Theorem 2.37 (Convergence of the backward Euler method). If the exact solution Y ∈ C 2 and
the right hand side f is Lipschitz continuous with Lipschitz constant L then the approximation
error converges linearly to 0 for h → 0. For h ≤ 1−ε
L for some ε ∈ (0, 1) the error is bounded by
1 − hL N TL
E(h) ≤ max τi exp − 1 = O(h) .
hL i=0 ε
yn+1 = (I + hA)yn
(I − hA)yn+1 = yn
and so
|Y (t)| = eRealλt |Y0 |
1. |Y (t)| is monotnly decreasing and converges to 0,
2. if λ ∈ R then Y (t) is strictly monotonly decreasing if Y0 > 0 and strictly monotonly increasing
if Y0 < 0. Consequenlty, Y (t) has the same sign as Y0 for all t.
Now one can ask the question: under which conditions does the sequence (yn )n behave in the
same way?
CHAPTER 2. GETTING STARTED 29
Forward Euler
As we saw above
yn+1 = (1 + hλ)n+1 Y0
and thus |yn | = |1 + hλ|n |Y0 |. To get this to converge to zero requires |1 + hλ| < 1, since Realλ < 1
this will hold for h sufficiently small but not for too large values of h. In fact
p
1 > |1 + hλ| = (1 + hRealλ)2 + h2 (Imagλ)2
and thus
|Realλ|
2 >h
|λ|2
recalling that our assumption was that Realλ < 0. There are two interesting cases here:
2
1. Imagλ = 0: the condition reduces to h < |Realλ| or perhaps easier to remember h|λ| < 2.
2. Realλ → 0: in this case the condition for convergence of forward Euler scheme is h < 0 so
not achievable. On the other hand the exact solution satisfies |Y (t)| = |Y0 | (the origin is a
center) so aiming for |yn | → 0 does not make sense. But since |yn | = |1 + hλ|n |Y0 | the issue
is not only that the discrete approximation does not converge to zero but in fact it grows
monotonically without bounds. This is in fact the setting of our mass spring system from
the introduction where our experiments showed that the right long time behaviour is not
achievable with the forward Euler method.
Backwards Euler
Doing the same for the backward Euler scheme we arrive at (1 − hλ)yn+1 = yn , or
1
yn+1 = Y0 .
(1 − hλ)n+1
We again focus on the stable case, i.e., Realλ < 0. Now |yn | → 0 if and only if |1 + hλ| > 1 or
which holds for any value of h again using that Realλ < 0. Note that the condition is satisfied
for any h in the case that Realλ ≤ 0 (in fact even for quite a lot of the right half of the complex
plane in the purely imaginary case the condition is still satisfied for any h > 0 which means that
|yn | → 0 although the exact solution is a center with |Y (t)| = |Y0 | - but we already saw this in
the introduction where are simulations showed that for the mass spring system the implicit Euler
method always leads to approximation that converge to zero.
A way to visualize the stability property of the two schemes is to plot the stability region in the
complex plane, i.e., for the forward Euler method all complex values z = hλ such that |1 + z| < 1,
for the implicit Euler method we shade the left half plane (although as pointed most of the right
half should also be shaded:
CHAPTER 2. GETTING STARTED 30
2 2
1 1
0 0
−1 −1
−2 −2
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Remark. The stability concept described so far is referred to as absolute stability. A more formal
discussion will be carried out later in the lecture. Methods like the backward Euler method which
are stable for all step sizes h are called unconditionally (absolute) stable or simply A-stable while
method like the forward Euler method which require h to be small enough for stability are called
conditionally stable.
We have so far focused on the limiting behaviour of yn for n → ∞ in the case of linear problems
with a stable fixed point at the origin. As pointed out above, if λ ∈ R and λ < 0 then the
exact solution is monoton decreasing or increasing depending on the sign of the initial condition
Y0 ∈ R. We will now look at conditions for h which guarantee that this behaviour carries over to
the sequence (yn )n when using the forward or backward Euler methods.
Starting with the forward Euler method we have yn = (1 − h|λ|)n |Y0 | where now λ is real and
negative. We can assume for simplicity that Y0 > 0 then we want to find a step size condition
so that 0 < yn < yn+1 , which is equivalent to 0 < 1 − h|λ| < 1 as can be easily seen. Since
h|λ| > 0 the second condition is always satisfied while the first condition requires h|λ| < 1. It is
worth now comparing this to the condition for absolute stability derived previously, which in the
current setting is |1 − h|λ| | < 1 which is equivalent to −1 < 1 − hλ| < 1. So as to be expected
achieving monotonicity requires a harder restriction on the time step then absolute stability did.
2 1
For |yn | → 0 we require h < |λ| while monotonicity requires h < |λ| .
1
Turning our attention to the backward Euler method with yn = (1+h|λ) n Y0 we see that mono-
1
tonicity always holds since 0 < 1+h|λ| < 1 for any step size h > 0. This behaviour is refereed to
as A-stable.
Chapter 3
In this chapter we first discuss different techniques of how mathematical models are arrived at, and
then discuss aspects of simplifying the models that make them easier to handle mathematically.
In the final part we provide some standard examples of mathematical models arising in different
areas of application.
these rate equations form a system of first order ODEs for the reactants A, B, C, D, . . . which
have to be combined with initial values for each reactant A0 , B0 , . . . . There are the following
elementary reactions to distinguish between:
• Constant supply: compound A is added to the system at a constant rate
k d
∅ → A =⇒ A=k .
dt
This looks like A is created out of thin air - which can’t really happen. What it means is
that there is a bunch of stuff we are not modelling but that lead to a source of A.
• Decay: substance A transforms into waste at rate k (i.e. A decomposes and is removed
from the system but the decay does not depend on any other reactants)
k d
A → ∅ =⇒ A = −kA .
dt
Again this is extending the mathematical model to include something we are not accurately
modelling.
31
CHAPTER 3. SOME ASPECTS OF MATHEMATICAL MODELLING 32
k d d
A → B =⇒ A = −kA , B = kA .
dt dt
• Reversible transformation: A transforms into B and vice versa. Such reactions should
be explicitly expanded into separate forward and reverse reactions:
k1 d d
A
B =⇒ A = −k1 A + k2 B , B = k1 A − k2 B .
k2 dt dt
k k
This is equivalent to the two transformations A →1 B, B →2 A.
• Multiple products: A and B combine to form C
k d d d
A + B → C =⇒ A = −kAB , B = −kAB , C = kAB .
dt dt dt
Although this looks fine it does not tell the whole story: assume that A is equal to B, so
k
the reaction is A + A → C. The reaction speed is kA2 but the correct ODEs can not be
d d
simply dt A = −kA , dt C = kA2 , since this implies that the amount of C being produces is
2
the same as the amount of A being destroyed - but we need two units of A to produce one
C - so we will need to redefine the ODE a bit.
Assume n units of A and m units of B react to produce p units of C and q units of D. The
problem described above is now resolved by defining the ODE to be
k
nA + mB → pC + qD =⇒
d d d d
A = −nkAn B m , B = −mkAn B m , C = pkAn B m , D = qkAn B m .
dt dt dt dt
d d k
Now we would correctly get dt A = −2kA2 , dt C = kA2 for the A + A → C reaction.
Stoichiometry: the factors −n, −m, +p, +q are called stoichiometries of the reaction and
the actual factors in the ODE are always the product of the stoichiometry and the reaction
speed (kAn B m in the above).
k
Remark: sometimes reaction like this are written as A + B → AB. In the mathematical model
one should use a new letter e.g. C to denote the component AB, which is not A times B but a
third reactant of the system!
(we are using i for the components and j for the reactions).
The reaction network leads to the system of ODEs:
n
d X alj
yi = kj (bij − aij ) Πm
l=1 yl , i = 1, · · · , m .
dt j=1
With this notation the system of ODEs can be written in the compact form
d
y = Γw(y) .
dt
Note that although this looks like a linear ODE, it is not one because in general w : Rm → Rn is
a nonlinear function.
Let’s verify that this form fits the ODEs for elementary reactions correctly:
• Constant supply: m = 1, n = 1, a11 = 0, b11 = 1, k1 = k:
d
y1 = k(1 − 0)y10 = k
dt
and
k
2AB → B
and finally B is produced out of thin air
l
∅→B .
First we will use C to denote AB and we remember to think of this as being four separate reactions,
so m = 3, n = 4, i.e., we count the reversible reaction as two separate ones.
k+
Starting with the equations for A: A is involved in two of the four reactions. In the first 2A+B → C
the stoichiometry is −2 and the speed is k+ A2 B while in the second the stoichiometry is 2 and the
speed is k− C:
A0 = −2k+ A2 B + 2k− C .
CHAPTER 3. SOME ASPECTS OF MATHEMATICAL MODELLING 34
Now B is involved in all four reactions: in the first we have stoichiometry −1 and speed k+ A2 B,
in the second stoichiometry 1 and speed k− C, in the third stoichiometry 1 speed kC 2 , and finally
in the last one stoichiometry is 1 and speed is l:
B 0 = −k+ A2 B + k− C + kC 2 + l .
C 0 = k+ A2 B − k− C − 2kC 2 .
To simplify things we can write down the stoichiometry matrix (a good way to do this is column
wise, i.e., collecting the entries for each reaction):
R1 R2 R3 R4
A −2 2 0 0
Γ :=
B −1 1
1 1
C 1 −1 −2 0
Remark.
Pm The point about linear independence is that for c ∈ Rm with c 6= 0 the function H(t) =
i=1 ci yi (t) is a conserved quantity as is for example any scalar multiple of H, e.g., 2c also leads
to a conserved quantity which is equal to 2H. This does not provide any additional information
and we therefore are only looking for conserved quantities H1 , . . . , Hd which are generated from
linearly independent vectorsPc1 , . . . , cd .
m
A conserved quantity H = i=1 ci yi can be used to remove one component from the ODE. Since
c 6= 0 we must have at least one i0 such that ci0 6= 0 and therefore
m m
1 X ci 1 X ci
yi0 (t) = H(t) − yi (t) = H(0) − yi (t)
ci0 c
i=1 i0
ci0 c
i=1 i0
CHAPTER 3. SOME ASPECTS OF MATHEMATICAL MODELLING 35
Pm
having used that H is constant and thereofore H(t) = H(0). Since H(0) = i=1 ci yi0 where
m
y0 ∈ R are the initial conditions for y we see that yi0 (t) satisfies the algebraic equation
m
X ci
yi0 (t) = yi0 − yi (t)
c
i=1 i0
which can be substituted into the ODE to remove yi0 from the system of ODEs.
Example 3.4. Recall the previous example with the stoichiometry matrix
R1 R2 R3 R4
A −2 2 0 0
Γ :=
B −1
1 1 1
C 1 −1 −2 0
solving for ΓT c = 0 we find that c = (c1 , −4c1 , −2c1 ) is a solution and therefore there is at least
one conserved quantity, e.g., taking c = (1, −4, −2) we find that H = A − 4B − 2C is conserved:
H 0 = − 2k+ A2 B + 2k− C
+ 4k+ A2 B − 4k− C − 4kC 2
− 2k+ A2 B + 2k− C + 4kC 2 = 0
It is easy too see that the dimension of KernelΓT is 1 and for example c1 = (1, 1, 1)T satisfies
ΓT c1 = 0 and thus H1 = c1 · (A1 , A2 , A3 ) = A1 + A2 + A3 is a conserved quantity. To check this
note that the ODE system corresponding to the above reaction network is A01 = −(k1 +k2 )A1 , A02 =
k1 A1 , A03 = k2 A1 and consequently
d
(A1 + A2 + A3 ) = −(k1 + k2 )A1 + k1 A1 + k2 A1 = 0 .
dt
But in addition c2 = (0, −k2 , k1 )T also leads to a conserved quantity H2 = k1 A3 − k2 A2 since
H20 = k1 k2 A1 − k2 k1 A1 = 0 ,
Using the language of dynamical systems one says that the set [0, ∞)m is invariant under the ODE,
i.e., trajectories that start in this set remain in this set. We can not give a prove that [0, ∞)m
is invariante for the ODEs derived from the mass action law. From a numerical point of view it
is important to note that these ODE systems come with additional properties which should be
respected by a numerical scheme used to solve them. From the above discussion we now know
that:
• there might be additional conserved quantities Hl , related for example to mass conservation.
Guaranteeing that the numerical scheme does not lead to a production/destruction of mass
due to the approximation error could be important depending on the application.
• the solution to the ODE remains non negative if the initial conditions are non negative:
again this is an important property of the underlying model and should be observed by an
approximation.
d β
dt A = βA. In terms of a reaction network this could be expressed as A → A + A (since here A is
involved in two reactions the resulting ODE is of the form A0 = −βA + 2βA = βA). Of course we
can assume that β combines both birth rate and death rate (although then β might have to be
chosen to be negative which is a bit of an extension of the normal reaction kinetics framework).
γ
A second approach to include death (but keep β > 0) is to add the reaction A → ∅ then the ODE
0
is A = −βA + 2βA − γA = (β − γ)A.
The weakness of Malthus law of predicting unlimited growth of a population (or it’s unavoidable
extinction) led to an improved model developed by Verhulst, usually called the logistic equation:
d
dt A = (γ − βA)A. The effective growth rate in this model is γ − βA and thus depends on the
current size of the population (modeling competition). In terms of reaction kinetics we can express
β
this by A + B → 2A leading at first to the two ODEs A0 = −βAB + 2βAB = βAB, B 0 = −βAB.
Since A0 + B 0 = 0 we have that A + B = A0 + B0 or B = A0 + B0 − A so that we can eliminate
B leading to A0 = βAB = βA(A0 + B0 − A) = (γ − βA)A with γ = β(A0 + B0 ). Given β, γ, A0
this provides B0 : B0 = βγ − A0 and since B(t) = A0 + B0 − A we find that B(t) = βγ − A(t) for all
t. In the logistic equation βγ is the second fixed point (next to A = 0) and is called the carrying
capacity.
Note that the stoichiometry matrix has the two row vectors (only one reaction) (−k + 2k) = (k)
and (−k) which are clearly linearly dependent. A corresponding vector c = (1, 1) leads to the
mass conservation A0 + B 0 = 0 we stated above.
A more detailed way to describe the coupling of populations to their use of resources is to introduce
additional rate equations to also describe the growth/decay of the resources. Consider a population
α
of rabbits (R) and foxes (F ). Rabbits reproduce based on Malthus’ law R → 2R. When foxes and
rabbits meet up then the rabbit goes to a better place (one can only hope) and the fox reproduces
β γ
(on the spot apparently) R + F → F + δF . Finally, foxes die of old age F → ∅. This leads to the
following system of ODEs
R0 = αR − βRF , F 0 = −γF − βRF + β(1 + δ)RF = βδRF − γF .
Using the approach of stoichiometry we get the following
αR
1 −1 0
Γ= , w(R, F ) = βRF .
0 δ −1
γF
This system is referred to as Lotka-Volterra predator-prey system. As far as I know Lotka was a
chemist and Volterra wanted to model a fish population...
Remark. There are many extensions to this model that are not directly related to the mass action
law. Learning is for example one effect that scientist like to take into account, e.g., the prey
learns to avoid the predator. This could be modeled using a time dependent reaction rate β(t)
to try to keep things within the framework of mass action kinetics - and many other extensions
of mass action kinetics can be introduced as well. A second observation is that the assumption
that predators kill prey whenever they meet up with one another, is not realistic for most species
- once a predator is full it stops eating. This can be modelled by introducing a non-linearity into
the system:
R0 = αR − g(R)F , F 0 = δg(R)F − γF .
So instead of assuming that the rate of increase of foxes due to feeding is linear in R (the old term
was βRF ) one assumes that it has the form g(R)F where for example g becomes constant if R is
βR
large, i.e., if there is an abundance of food, e.g., g(R) = 1+βR - for R small this behaves like βR
while for R large g(R) tends to 1. This type of non linearity does not directly fit into the concepts
of mass action kinetics.
SIR: epidemiology
In the context of the spread of diseases, simple models of epidemics divide the total population into
sub-groups depending on whether individuals are infected (I(t)), susceptible to the disease (S(t))
CHAPTER 3. SOME ASPECTS OF MATHEMATICAL MODELLING 38
or recovering from the disease (R(t)). The transitions between these states can be interpreted as
reactions
k γ
I + S → I + I ,I → R
(a infected person meeting a susceptible one can lead to two infected people (depending on k),
while infected people recover at a certain rate (γ). Note that recovered people are assumed immune
to the infection. The resulting system of ODEs is
d d d
S = −kSI , I = kSI − γI , R = γI .
dt dt dt
This model is often referred to as the SIR model. One property of this model is that it conserves
d
the population, i.e., dt (S + I + R) = 0. We should keep this in mind when designing numerical
schemes for such a system - the discretization should not lead to a growing or shrinking population
if the model doesn’t include births and deaths. Of course deaths by the infection can be included
in the I equation and one can also derive SIS models where immunity to the infection is not
possible using a “reversible reaction“. One can also add additional groups of the population e.g.
adding different stages to the infection, e.g., infected but not yet contagious. These models can
be made to describe different types of transitions always using the rate equations we discussed for
chemical reactions.
Michaelis-Menten kinetics
The kinetics of an enzymatic reaction is described by
kf k
S + E
SE →c E + P
kr
where S is a substrate which reacts with an enzyme to produce the complex SE. This is a
reversible reaction with forward rate kf and reverse rate kr . The complex SE can in turn release
a product P . The resulting system of ODEs is given by (denoting the complex with C)
S 0 = −kf SE + kr C , E 0 = −kf SE + kr C + kc C , C 0 = kf SE − kr C − kc C , P 0 = kc C
Note again the misuse of notation often found in the literature - in the ODE system SE refers to
the product of the two functions S(t), E(t) while in the reaction network SE is a compound the
amount of which is described by C(t) in the system of ODEs.
An important property of this equation is that the Hamiltonian H = H(x, p) is constant in time
along a given trajectory:
Lemma 3.7. Assume that x solves mx00 = −∇V (x) and define the momentum of the particle
x by p = mx0 then
d
H(x(t), p(t)) = 0 ,
dt
1 2
with H(x, p) := 2m p + V (x).
Proof.
d
H(x(t), p(t)) = ∂x H(x(t), p(t))x0 (t) + ∂p H(x(t), p(t))p0 (t)
dt
1
= ∇V (x(t))x0 (t) + p(t)p0 (t)
m
= ∇V (x(t))x0 (t) + x0 (t)mx00 (t) = x0 (t) ∇V (x(t)) − ∇V (x(t)) = 0
where we have used that p0 = mx00 = −∇V (x(t)) since x solves the ODE mx00 = −∇V (x).
Assuming now that x solves the second order ODE mx00 = −∇V (x) we see that x0 (t) = m 1
p(t)
and −∇V (x(t)) = p0 (t) . On the other hand ∂x H(x, p) = −∇V (x) and ∂p H(x, p) = m p so that
1
Remark. Note that this is a different system from the first order problem we so far have considered
in place of second order equations, i.e.,
The functions x are called generalized coordinates and p the momentum of the system.
The functions (x(t), p(t)) are also referred to as the Hamiltonian flow given by H.
If H(x, p) is of the form H(x, p) = T (p) + V (x) then it is called separable. For example the
1 2
Hamiltonian from the begin of the section H(x, p) := 2m p + V (x) is separable.
Remark. It is worth repeating that x are not required to be spatial coordinates for example they
could be the angle a pendulum’s rode makes with the vertical line.
Definition 3.9 (Invariant). A function f = f (x, p) is called an invariant under the Hamiltonian
flow given by H if
d
f (x(t), p(t)) = 0
dt
where x, p solves the Hamiltonian system given by H.
Lemma 3.2.1. For f to be an invariant under a Hamiltonian flow is equivalent to
fx Hp − fp Hx = 0 ,
x0 (t) = bx + cp ,
p0 (t) = −ax − bp ,
b c
and is therefore a linear system of ODEs. We can rewrite this as with matrix A = .
−a −b
Example 3.12 (Pendulum). Consider a particle with mass m attached to a rod of length l and
let g be the gravitational force. We assume that the pendulum can only swing in one plane so that
we can describe the position of the pendulum over time in Cartesian coordinates (x, y). This is
quite complex since the length of the rode restricts the position of the pendulum which always has
to lie on a circle of length l around the pivot. It is much easier to use the generalized coordinate
θ, which is the angle the rod makes with for example the vertical axis where we choose θ = 0 in
the case that the pendulum is pointing downwards. Once we have θ the actual position is given by
(x, y) = l(sin(θ), − cos(θ)).
The momentum of this system is the angular momentum of the pendulum: p = ml2 θ0 while the
potential energy is the equal to the gravitational energy: V (θ) = −mgl cos(θ). Therefore, the total
energy is given by
p2
H(θ, p) = − mgl cos(θ) .
2ml2
The Hamiltonian system is given by
p
θ0 = , p0 = −mgl sin(θ) ,
ml2
or written as second order equation
p0 g
θ00 = 2
= − sin(θ) .
ml l
Example 3.13 (Duffing oscillator). Let us consider the Hamiltonian for the Duffing oscillator
with mass m = 1: H(x, p) = 12 p2 + 2δ x2 + β4 x4 with β ∈ R and δ > 0. Note that β = 0 gives us
the Hamiltonian for the Harmonic oscillator with k 2 = 2δ . The resulting first order Hamiltonian
system is given by
x0 (t) = ∂p H(x(t), p(t)) = p ,
p0 (t) = −∂x H(x(t), p(t)) = −δx − βx3 .
CHAPTER 3. SOME ASPECTS OF MATHEMATICAL MODELLING 41
with V (x) = 2δ x2 + β4 x4 . This is as expected from the discussion above and we could have written
this down directly by looking at the Hamiltonian written in the form H(x, p) = 12 p2 + V (x).
First consider the β = 0 case: then the second order problem becomes the standard mass spring
equation x00 + δx = 0 with mass m = 1. If β 6= 0 the equation describes a spring system where
the spring restoring force (given in the standard case by δx) is nonlinear - the stiffness of the
spring does not exactly obey Hooke’s law. The restoring force provided by the nonlinear spring is
then (δ + βx2 )x. The case β > 0 is called a hardening spring while β < 0 results in a so called
softening spring. Basically if β > 0 the restoring force increases gets larger the more the spring is
displaced from its resting position, i.e., the larger x2 becomes. While for β < 0 the restoring force
is decreased with increasing displacement and could even switch sign.
Example 3.14 (oscillators). Many linear or nonlinear oscillators are govern by Hamiltonian
systems with
1 2
• Simple pendulum: H(x, p) = 2l2 p − gl cos(x).
• Harmonic oscillator: H(x, p) = 21 p2 + k 2 x2 .
Definition 3.15 (Principle of least action). The principle of least action states that given two
points in time t0 < t1 the particle path given by mx00 = −∇V (x) is the path that minimizes the
Rt
action S(q) = t01 L(q(t), q 0 (t)) dt under all paths q which go from position (t0 , x(t0 )) to (t1 , x(t1 )).
In other words x is a path such that
S(x) ≤ S(q) , for any path q : [t0 , t1 ] → Rd for which q(t0 ) = x(t0 ), q(t1 ) = x(t1 ) .
Assume we had such a path and η = η(t) was a path with η(t0 ) = η(t1 ) = 0. Taking ε (small of
course) then y(t; ε) := x(t) + εη(t) is a path that satisfies y(t0 ; ε) = x(t0 ), y(t1 ; ε) = x(t1 ) so since
x is extremal it must hold that S(x) ≤ S(y(·; ε)). Now we can define the function F : R → R with
Z t1
F (ε) := S(y(·; ε)) = L(y(t; ε), y 0 (t; ε)) dt
t0
where y (t; ε) is the derivative of y with respect to t, i.e., y 0 (t; ε) = x0 (t) + εη 0 (t). So
0
Z t1
F (ε) = L(x0 + εη 0 , x + εη) dt .
t0
Now the extremal conditions translates to F (0) ≤ F (ε) for all ε so 0 is an extremal point of the
scalar function F , which means that F 0 (0) = 0 must hold. Now we need to figure out how to
compute the variation of S, i.e., the derivative F 0 . We will not go into the mathematical details
why the following is legal but it turns out that (just think of the chain rule):
Z t1
d m 0
F 0 (ε) = |x (t) + εη 0 (t)|2 − V (x(t) + εη(t)) dt
dε t0 2
Z t1
= m(x0 (t) + εη 0 (t))η 0 (t) − ∇V (x(t) + εη(t))η(t) dt .
t0
To complete the argument we need to get rid of the derivative on the test function η which we
can do using integration by parts:
Z t1
0 0 t1
0 = F (0) = [mx (t)η(t))]t=t0 + −mx00 (t)η(t) − ∇V (x(t))η(t) dt .
t0
This has to hold for all paths η which vanish at t0 , t1 and if everything is smooth then the
fundamental theorem of variational calculus states that if
Z t1
G(t)η(t) dt = 0
t0
for all functions η with η(t0 ) = η(t1 ) = 0 then G(t) = 0 for all t ∈ (t0 , t1 ). Using this result (which
we do not prove here) we find that
or
x0 = v , mv 0 = −αv − βx + δ cos(ωt) .
CHAPTER 3. SOME ASPECTS OF MATHEMATICAL MODELLING 43
Let us assume that these additional forces are of the form f (x, v) then the Lagrange-DAlembert
Principle states that the variation of the action potential including f leads to
Z t1 Z t1
mx0 (t)η 0 (t) − ∇V (x(t))η(t) dt − f (x(t), x0 (t))η(t) dt ,
t0 t0
Now for the forward Euler method we used r = 1 and replaced Y (tn )0 by f (tn , Y (tn )) using (∗).
We can also use (∗) to replace Y 00 (tn ):
d
Y 00 (tn ) = f (t, Y (t))|t=tn = ∂t f (tn , Y (tn )) + ∂y f (tn , Y (tn ))Y 0 (tn )
dt
Using again Y 0 (tn ) = f (tn , Y (tn )) we can replace Y 00 (tn ) in the Taylor series by evaluations of f
and derivatives of f that only involve knowing Y (tn ) but no derivatives of Y . So we can compute
the derivatives Ynk = Y (k) (tn ) with the following expressions
Yn0 = Y (tn ), Yn1 = f (tn , Yn0 ), Yn2 = ∂t f (tn , Yn0 ) + ∂y f (tn , Yn0 )Yn1
In the same way we can arrive at formulas for Ynk = Y (k) (tn ) with k > 2 based only on Yn0 , . . . , Ynk−1
and higher order derivatives of f . We then have
r
X 1 k
Y (tn+1 ) = hk Y + O(hr+1 ) .
k! n
k=0
44
CHAPTER 4. HIGHER ORDER ONE STEP METHODS 45
Now by ignoring the O(hr+1 ) term and replacing Yn0 by the approximation yn of Y at time t = tn ,
we arrive at an approximation yn+1 for Y at t = tn+1 :
r
X 1 k
yn+1 = hk y
k! n
k=0
with
yn0 = yn , yn1 = f (tn , yn0 ), yn2 = ∂t f (tn , yn0 ) + ∂y f (tn , yn0 )yn1 , . . .
If the ODE is vector valued we can use the same approach but we need to use the vector valued
form of the Taylor series, i.e., ∂y f (tn , yn0 ) is the Jacobian and ∂y f (tn , yn0 )yn1 is a matrix-vector
product and so on.
This method is difficult to implement for a specific problem since it requires to compute high order
partial derivatives of f . These derivative have to be recomputed for each new problem. In the
next example we will see how to construct higher order methods which do only require evaluations
of f :
Example 4.2. (2 step Heun method) We again focus on a scalar ODE but the ideas carry over
to the vector valued case.
Starting with the first three terms of the Taylor expansion (r = 2):
1
= Y (tn ) + hf (tn , Y (tn )) + h2 ∂t f (tn , Y (tn )) + ∂y f (tn , Y (tn ))f (tn , Y (tn )) + O(h3 )
Y (tn+1 )
2
1
= Y (tn ) + hf (tn , Y (tn )) +
2
1
h f (tn , Y (tn )) + h∂t f (tn , Y (tn )) + h∂y f (tn , Y (tn ))f (tn , Y (tn )) + O(h3 )
2
Now using Taylor expansion in two variables we find:
Therefore
1 1
Y (tn+1 ) = Y (tn ) + hf (tn , Y (tn )) + h f (tn+1 , Y (tn ) + hf (tn , Y (tn )) ) + O(h3 )
2 2
Thus given yn we can compute
1
yn+1 = yn + (F1 + F2 ), F1 = hf (tn , yn ), F2 = hf (tn+1 , yn + F1 ) .
2
Equivalently we can use the formula
1
ỹn+1 = yn + hf (tn , yn ), ỹn+2 = ỹn+1 + hf (tn+1 , ỹn+1 ) yn+1 = (yn + ỹn+2 ) .
2
So we are averaging the starting value and the result of taking two forward Euler steps.
Example 4.3. (general second order explicit methods) To obtain second order methods we can
try to find parameter b1 , b2 , a, b so that
Note that kl (t, y; h), ϕ(t, y; h) ∈ Rr . The coefficients αi , γi , βi,l ∈ R are independent of t, y, f and
h, so they are independent of the problem we are solving, the time step used. It is usual to describe
RK methods by their Butcher tableau:
α1 0 0
..
α2 β2,1 .
.. .. .. ..
. . . .
αm βm,1 . . . βm,m−1 0
γ1 γm−1 γm
That is by two vectors α ∈ Rm , γ ∈ Rm and a lower trianglular matrix β ∈ Rm,m :
α β
γ
An implicit m-stage Runge-Kutta method is defined by
m
X
yn+1 = yn + hϕ(tn , yn ; h) with ϕ(t, y; h) := γi ki (t, y; h)
i=1
with !
m
X
kj (t, y; h) = f t + αj h, y + h βj,l kl (t, y; h) ∈ Rr , j = 1, . . . , m
l=1
As before the method can be described by a Butcher tableau but this time the matrix β can have
entries on and above the diagonal.
Example 4.5. All the methods we had so far where of Runge Kutta type:
0
1. Forward Euler method:
1
CHAPTER 4. HIGHER ORDER ONE STEP METHODS 47
0 1
2. Implicit Euler method:
1
1 1
3. midpoint method: 2 2
1
0
1 1
4. modified Euler method: 2 2
0 1
0
5. 2 stage Heun method: 1 1
1 1
2 2
0
1 1
2 2
1 1
7. Clasical 4th order RK method: 2 0 2
1 0 0 1
1 1 1 1
6 3 3 6
1 1
3 3 0
8. Diagonally implicit two stage third order method: 1 1 0
3 1
4 4
1/2 1/2 0 0 0
2/3 1/6 1/2 0 0
9. Four-stage, 3rd order diagonally implicit method 1/2 −1/2 1/2 1/2 0
1 3/2 −3/2 1/2 1/2
3/2 −3/2 1/2 1/2
√ √
1 3 1 1 3
2 − −
√6 4 √ 4 6
1 3 1 3 1
10. Forth order Gauss-Legendre method 2 + 6 4 + 6 4
1 1
2 2
Note that the last two methods and the implicit Euler, midpoint, trapezoidal methods are implicit
RK methods while all the others are explicit.
Remark. In addition to explicit and implicit there are other subclasses of RK methods, for example
an interesting class are the diagonally implicit Runge Kutta method where β is lower diagonal.
In this case we do not have to solve a m · r × m · r system of non linear equation for k1 , · · · , km
but only m non-linear equations of size r, first for k1 then for k2 up to km , since the equations
for ki does not depend on kl for l > i. In fact all the methods shown above with the exception
of the last one are diagonally implicit. Restricting the possible non zero entries in the Butcher
tableau, i.e., requiring βi,l = 0 for l > i for a diagonally implicit method, reduces the freedom one
has to design a method with other desirable properties, i.e., a higher rate of convergence, better
CHAPTER 4. HIGHER ORDER ONE STEP METHODS 48
stability, positivity of the solution etc. So one has less choice with an explicit method then with
a fully implicit method and diagonally implicit method are somewhere in between the other two.
Some further remarks (stated without proof) and summary of the above:
• Fully implicit m stage RK methods can be up to order 2m
• Explicit m stage RK method can be at most of order m but...
• order m is only possible for m ≤ 5 ...
• order m − 1 for m ≤ 7...
• order m − 2 for m ≤ 8
• implicit method have better stability properties (larger h) but require the use of the (vector
valued) Newton method (see further down in this chapter) because...
• in general they require the solution of a m · r × m · r system of non linear equation for the
(ki )m
i=1 ...
• diagonally implicit method (β is a lower diagonal matrix) require only to solve m non-linear
equations of size r and can be a good compromise.
Remark. Runge-Kutta methods can often be defined in many equivalent ways. So for example the
Butcher tableau above providing the backward Euler method leads to the scheme (assuming that
f is independent of t):
That this is in fact the backward Euler method discussed at the beginning of the lecture can be
easily seen using that
We will discuss below how to find the root k of F (k) = k − f (yn + hk) but the same approach can
be used to find the root yn+1 for F (y) = y − hf (y) − yn . The difference is only in the initial guess
used: in the first case f (yn ) makes sense while in the second case yn is a reasonable initial guess.
In the last part of this chapter we will not focus on how to construct RK methods but on how
to analyse a RK method given by its Butcher tableau. But first we discuss how to implement an
implicit RK method.
with !
m
X
kj (t, y; h) = f t + αj h, y + h βj,l kl (t, y; h) , j = 1, . . . , m
l=1
here r is the size of the ODE system (i.e., f : Rr → Rr ) and m is the number of stages, the vector
κ ∈ Rr·m is of the form κ = (k11 , . . . , k1r , k21 , . . . , k2r , . . . , km1 , . . . , kmr ) and similarly we must
understand Fj = (Fj1 , . . . , Fjr ) ∈ Rr and F as the as accumulated entries (Fji ). So F is a very
high dimensional function and finding a root can be challenging.
Recall the definition of a diagonally implicit RK method in which case the computation for the
kj s decouple, i.e., kj only depends on k1 , . . . , kj−1 but not on ki with i > j:
j−1
!
X
kj (t, y; h) = f t + αj h, y + hβj,j kj + h βj,l kl (t, y; h) , j = 1, . . . , m
l=1
m
!
X
Fj (k) = kj − f t + αj h, y + hβj,j kj + h βj,l kl , j = 1, . . . , m
l=1
So instead of finding one root in Rrm to a high dimension function F we need to find m roots in
Rr which is far easier.
So in this section we will discuss some basic approaches to computing roots of a given vector
valued function. The problem is hard to solve as we will see and we can only scratch the surface.
Problem. Given a function F : Rn → Rn find a root x∗ , i.e.,
x∗ ∈ Rn with F (x∗ ) = 0 .
Example 4.6. With F (x) = Ax − b, A ∈ Rn×n we have to solve a linear system of equations.
F (x) = ax2 + bx + c: in the case there is a simple formula for computing all roots.
F (x) = cos(x): the problem could be to find a root x∗ ∈ [1, 2]. Then the solution is x∗ = π2 .
We will focus on the scalar case n = 1 and to finding the root of a smooth function F : [a, b] → R.
Proof. We have by construction that (ak )k∈N is monoton increasing and (bk )k∈N is monoton de-
creasing and 0 < bk −ak = 21 (bk−1 −ak−1 ) = 2−k (b−a), ak < b, a < bk . Thus (ak )k∈N , (bk )k∈N con-
verge and lim ak = lim bk =: x∗ . Since F ∈ C 0 (a, b) we can conclude that F (x∗ ) = lim F (ak ) =
k→∞ k→∞ k→∞
lim F (bk ).
k→∞
Again by construction F (ak )F (bk ) < 0 always holds, and therefore
F (x∗ )2 = lim (F (ak )F (bk )) ≤ 0 =⇒ F (x∗ ) = 0.
k→∞
n o 1
x − x∗ ≤ min x(k) − bk , x(k) − ak = |bk − ak | ≤ 2−k−1 (b − a).
(k)
2
Remark. The method is very robust but also quite slow. It requires about three iterations to
compute one decimal place.
F (x(k) )
x(k+1) = x(k) − , x0 being given, n≥1
F 0 (x(k) )
F (x∗ )
and if x(k) → x∗ for some x0 ∈ U then since F is smooth we have x∗ = x∗ − F 0 (x∗ ) and thus
F (x∗ ) = 0.
Remark. (Geometric interpretation) Let lk (x) = ax + b be the linearization of F at x(k) , i.e.,
lk (x(k) ) = F (x(k) ), lk0 (x(k) ) = F 0 (x(k) ). So lk (x) = F (x(k) ) + F 0 (x(k) )(x − x(k) ) is the tangent to
F at x(k) . Now it seems reasonable to approximate a root of f by the root x(k+1) of lk :
0 = F (x(k) ) + F 0 (x(k) )(x(k+1) − x(k) ).
This is the formula used in the Newton method.
Figure 4.1 (left) shows the idea of the construction. With this idea in mind, it is obvious that the
method does not always lead to the expected result or to any result at all. This is also shown in
Figure 4.1.
Algorithm. (Newton method)
F x = F (x)
While |F x| > T OL :
x := x − FF0 (x)
x
F x := F (x)
Theorem 4.8. (convergence of Newton’s method) Consider F ∈ C 2 (a, b) so that there is a x∗ ∈
(a, b) with F (x∗ ) = 0. Let m := min |F 0 (x)| > 0, M := max |F 00 (x)|, and choose ρ > 0 so that
a≤x≤b a≤x≤b
Bρ (x∗ ) := {x | |x − x∗ | < ρ} ⊂ [a, b] and q := 2m
M
ρ < 1. Then Newton’s method converges with
(0) ∗
rate 2 for any starting value x ∈ Bρ (x ).
The approximation satisfies the a-priori error estimate
CHAPTER 4. HIGHER ORDER ONE STEP METHODS 51
Figure 4.1: Newton method: on the left the idea of Newton’s method is sketched. The middle
figure shows a function with two roots
where it is unclear which root will be approximated. On
the right we have a situation with x(k) → ∞, i.e., where the Newton method fails.
(k) (k−1) 2 2m 2k
(a) x − x∗ ≤ M x − x∗ ≤
2m M q
Remark.
It follows
from the assumptions made in the Theorem and from the mean value theorem
F (x)−F (y)
that x−y = |F 0 (ξ)| ≥ m ∀ x, y ∈ Bρ (x∗ ); x 6= y =⇒ |x − y| ≤ m 1
|F (x) − F (y)| .
Therefore x is the only root in Bρ (x ) and x is a simple root, i.e., F (x ) = 0 and F 0 (x∗ ) 6= 0.
∗ ∗ ∗ ∗
Remark. The a-posteriori estimate can be used to determine the quality of the approximation
since the right hand side is computable. In contrast the a-priori estimate involves the unknown
quantity x? or overestimates the error considerably - but it established the quadratic convergence
rate. The convergence is very fast if x(0) Bρ (x∗ ). Assume for example q = 21 then after only 10
∈ 2m
∗ −303
step of the method we have x (10)
− x ≤ M q 1024 ∼ 2m M 10 .
Comparing that with the nested interval approqach, we find using the same starting interval that
|b − a| = ρ = q 2m
(10)
m 1
M = M , with q = 2 . After 10 iterations of the nested interval method we get
x − x∗ = 2−11 |b − a| ≈ 2−11 ρ = 2−11 2m M q =2
−10 2m 2m
M ∼ M 10
−3
.
Proof. (convergence of Newton’s method)
From Taylor expansion we find
F 00 (ξ(y,x))
(1) F (y) = F (x) + F 0 (x)(y − x) + R(y, x) with R(y, x) = 2 |y − x|2 .
For all x, y ∈ Bρ (x∗ ):
M 2
(2) |R(y, x)| ≤ 2 |y − x|
For x ∈ Bρ (x∗ ) define Φ(x) := x − FF0(x) (x) . Then
|Φ(x) − x∗ | = (x − x∗ ) − FF0(x) 1 ∗ 0
(x) = − F 0 (x) [F (x) + (x − x)F (x)]
(1) 1
(2)
2
= F 0 (x) R(x∗ , x) = |F 01(x)| |R(x∗ , x)| ≤ m 1
|x − x∗ | M
2
(k−1) 2
M Φ x(k−1) − x∗ ≤ M M
− x∗
ρk = 2m 2m 2m
x
2 2k
= (ρk−1 ) ≤ . . . , ≤ (ρ0 )
2k k
x − x∗ 2
(0)
=⇒ x(k) − x∗ ≤ 2m
M ρk ≤ 2m
M (ρ0 ) = 2m
M
M
2m
M 2k
2m 2m 2k
≤ M ρ = M q .
|2m
{z }
=q
k
This proves the a-priori estimate. Since we have q < 1, it follows that q (2 )
→ 0 =⇒ xk →
x∗ .
For the a-posteriori estimate we use (1) with y = x(k) and x = x(k−1) . It follows that
F (x(k) )
x(k+1) := Φ(x(k) ) = x(k) − .
F 0 (x(k) )
A fixed point of this DE is given by x = Φ(x) and thus has to be a root of F . Furthermore if x is
a fixed point (and therefore a root of F ) we see that
(F 0 (x))2 − F (x)F 00 (x) F (x)F 00 (x)
|Φ0 (x)| = |1 − 0
=| |=0.
2
F (x) | F 0 (x)2
So all the fixed point are stable and consequently, for x(0) close to a root x∗ the sequence converges
to x? .
As we have seen the Newton method is not quite as robust as the nested interval method but
if x(0) is close to x? the Newton method is far much efficient. There are a few problems with
Newton’s method, the most obvious are:
1. How to find x(0) close enough to x? ?
2. What happens if F 0 (x∗ ) = 0?
3. Can one avoid having to compute F 0 ?
There are modification of Newton’s method to handle these problem. For example one can com-
bine the nested interval method with Newton’s method to produce a stable and efficient scheme:
consider a < b with F (a)F (b) < 0.
Define x := 21 (a + b), e
a := a, eb := b, F0 := F (x), Fa := F (a).
CHAPTER 4. HIGHER ORDER ONE STEP METHODS 53
F1 := F (x)
If (|F1 | > |F x| or x 6∈ (a, b)) , then
If F a F x < 0, then b := x, else (a := x; F a := F x)
x = 21 (a + b), F1 = F (x)
F x := F1
Remark. The principle idea (and the convergence Theorem) can be extended to F : Rn → Rn .
The scheme is practically the same:
x(k+1) = x(k) − (DF (x(k) ))−1 F (x(k) )
where DF (x(k) ) now denotes the jacobian of F at x(k) , i.e., DF (x(k) ) ∈ Rn×n . Thus a linear
system of equations has to be solved in each step of the Newton scheme. One iteration thus
consists of two step: first one solves the linear system DF (x(k) )δk = F (x(k) ) for δk and then
performs the update x(k+1) = x(k) − δk . The first step can require finding the solution to a large
system of linear equations - a major topic of numerical linear algebra which is covered in a third
year module.
We restrict the presentation to fixed time step h but the results can be easily extended to time steps
tn+1 − tn = hn . Also we will set t0 = 0.
Example 4.10. All the explicit Runge-Kutta methods are explicit one step methods, e.g., for the
forward Euler method we have ϕ = f (tn , yn ).
We focus in the following on scalar ODE r = 1 but the results carry over directly to the vector
valued case.
We next consider the convergence of these methods.
lim E(h) = 0.
h→0
The method has convergence order p, if for some constant M > 0, E(h) ≤ M hp for any h > 0.
An important concept is the truncation error of a numerical scheme:
Definition 4.12. (Truncation error and consistency) The truncation error of a one-step method
at step n is defined as
τn (h) = Y (tn+1 ) − Y (tn ) − hϕ(tn , Y (tn ); h)
Thus the exact solution Y satisfies the perturbed equation:
for n = 0, 1, . . . .
A one step method is consistent if τn (h) = o(h) for all n. This means that for every ε > 0 there
exists an h0 such that |τn (h)| < εh for all h < h0 .
It is consistent of order p if maxn |τn (h)| = O(hp+1 ).
Note that the consistency order is one lower then the order of the truncation error.
Remark. The truncation error is defined by inserting the exact solution to the initial value problem
(∗) into the one step method. This gives a measure of how large the error in step n would be if
the method was started with the correct value Y (tn ).
The situation is sketched in Figure 4.2. The Figure shows the exact solution Y and approximated
d
values y0 , y1 , y2 , y3 . At each point (tn , yn ) the sloop is given by dt Y n (tn ) whereby Y n is the
(
solution to the ODE with initial condition Y tn ) = yn . In yellow is the truncation error in each
step.
Example 4.13. For all the explicit methods described in the previous section which were based
on Taylor expansion, the truncation error is in O(hr+1 ) if the term not taken into account in the
Taylor series is of order O(hr+1 ), So the Taylor method from Example 4.1 given by
r
X 1 k
ϕ(tn , yn ; h) = hk y
k! n
k=1
CHAPTER 4. HIGHER ORDER ONE STEP METHODS 55
with
yn0 = yn , yn1 = f (tn , yn0 ), yn2 = ∂t f (tn , yn0 ) + ∂y f (tn , yn0 )yn1 , yn3 = . . .
has an order of consistency r since we showed that
r
X 1 k
Y (tn+1 ) = Y (tn ) + h hk−1 Y + O(hr+1 ) .
k! n
k=1
This example also demonstrates that there exist explicit one step methods of arbitrary order of
consistency.
The forward Euler method is consistent of order 1 and for all the two step methods satisfying the
conditions in Example 4.3 the order of consistency is two.
Remark. For Runge Kutta method one can formulate order conditions that allow to determine
order of consistencyPof the method based on the Butcher tableau. For a m stage Runge Kutta
m
method with αi = j=1 βij to be of at least order
Pm
p ≥ 1: i=1 γi = 1
Pm 1
p ≥ 2: i=1 αi γi = 2
Pm 2 1
Pm Pm 1
p ≥ 3: i=1 αi γi = 3 and i=1 j=1 γi βi,j αj = 6
For larger p the number of conditions grows fast. The order conditions can be used for example
to show that the diagonal implicit RK method is at least of order three.
We would expect that the maximum error E(h) is roughly equal to the sum of the local truncation
errors. Since the there are N = O(h−1 ) steps a method which is consistent of order p would thus
have a O(hp+1 ) error in each step which would then mean that the maximum error should be
roughly O(h−1 )O(hp+1 ) = O(hp ). This is in fact the case for h small enough as we will show in
the following. For this we need a discrete version of Gronwall’s lemma:
Lemma 4.14. (discrete Gronwall lemma) Let zn ∈ R+ satisfy
zn+1 ≤ Czn + D, ∀n ≥ 0
Example 4.16. For t he method we derived in the previous section we have to verify that ϕ is
Lipschitz. This is always true if f is uniformly Lipschitz in y and we assume h ≤ h0 . Take for
example a m stage Runge Kutta method:
m
X
ϕ(t, y; h) = γi ki (t, y; h)
i=1
i−1
!
X
ki (t, y; h) = f t + αi h, y + h βi,l kl (t, y; h) i = 1, . . . , m.
l=1
For k1 we find Lk1 = Lf thus we can bound the Lipschitz constant for k2 by Lk2 = (1 + h0 β2,1 )Lf
and Lk3 = (1 + h0 β3,1 + h0 β3,2 + h20 β3,2 β2,1 )Lf . Each Lki will remain bounded by some constant
depending on h0 and the coefficients of the Runge-Kutta method.
Proof. The proof is almost the same to the purely explicit case. Assume that τn (h) ≤ M hp+1 and
h ≤ 2L1ϕ̃ , then
so that
en+1 ≤ Cen + D
1+hL M p+1 1
with C = 1−hLϕ̃ϕ̃ > 1 and D = 1−hLϕ̃ h > 0 since 1 − hLϕ̃ > 2 > 0. Using Gronwall and
1 1
1−hLϕ̃ ≤ 2 the result follows.
Remark. In general the error estimates derived for the convergence proofs overestimate the error
considerably, since exp(LT ) grows rapidly with increasing T - so the estimate is only reliably for
small T . Under additional assumptions on f more accurate estimates can be derived.
lim yn = 0
n→∞
yn > yn+1
Example 4.19. For the forward Euler method we have yn+1 = (1−λh)yn so that yn = y0 (1−λh)n .
Therefore limn→∞ yn = 0 iff |1 − λh| < 1. Since λ > 0 is fixed, this means that the step size h has
to small enough: h ≤ λ2 .
Again using yn+1 = (1 − λh)yn we see that yn+1 < yn can only hold, if 1 − λh ∈ [0, 1). Since
1 − λh < 1 always holds, this requires that we choose h so that h ≤ λ1 .
If we do the same analysis for the backward Euler method: yn+1 = yn − λhyn+1 we find yn+1 =
−n
1
1+λh yn = (1 + λh) y0 . Since limn→∞ (1 + λh)−n = 0 for all choices of h, the backward
Euler method will show the correct long time behavior independent of the step size h. To obtain
1
monotonicity, we need 1+λh yn < yn which also holds for any h > 0 since we assumed that λ > 0.
The forward Euler method is said to be conditionally stable while the backward method is uncon-
ditionally stable, i.e., there is no restriction on the time step.
Lemma 4.20. Consider a general m stage Runge Kutta method given by the coefficient vectors
α, γ ∈ Rm and a matrix β ∈ Rm×m . Applying this to the linear ODE y 0 = λy leads to a scheme
of the form
P (λh)
yn+1 = yn
P (λh)
CHAPTER 4. HIGHER ORDER ONE STEP METHODS 59
where P, Q are polynomials of degree not more than m. If the method is explicit then Q ≡ 1.
P (ρ)
The rational function R(ρ) = Q(ρ) takes on the form:
m
X m
X m
X
(I − ρβ)−1 γi I − ρβ)−1 e
R(ρ) = 1 + ρ γi ij = 1 + ρ i
i=1 j=1 i=1
and !
m
X m
X
kj (t, y; h) = f t + αj h, y + h βj,l kl (t, y; h) = λy + λh βj,l kl (t, y; h)
l=1 l=1
The corresponding matrix is I − ρβ and the right hand side is the vector ρyn e with entries ρyn .
Thus
κ = (I − ρβ)−1 e ρyn .
With this we have shown
m
X m
X
γi (I − ρβ)−1 e i yn = R(ρ)yn
yn+1 = yn + ρ γi κi = 1 + ρ
i=1 i=1
Since κ1 = ρyn , it follows by induction that each κj is a polynomial of degree j and that concludes
the proof.
We can study the problems stated above:
Theorem 4.21. Consider a one step method which takes on the form
yn+1 = R(λh)yn
when applied to f = λh. Then the method has the correct long time behavior if |R(λh)| < 1 and
is monotone if 0 < R(λh) < 1.
1 1
0.8
0.5
0.6
exp(x) 0.4
expl. Euler
impl. Euler
midpoint
-0.5 2 stage Heun
2 stage Gauss 0.2
2 stage diagonal impl.
classical RK
some 6th order implicit method
three stage third order DIRK
-1 0
-10 -8 -6 -4 -2 0 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0
Remarks:
1
Figure 4.3: Plot of function R(z) for different RK methods. Also shown the approximation
property of R to e−z and the behaviour for z → −∞.
The above construction of R and subsequent arguments work in exactly the same way if we also
include complex valued λ with Realλ < 0. In this case we again get a stability region for each
method in which the complex value R(z) has modulus less then 1.
Definition 4.22. Consider a one step method which takes on the form
yn+1 = R(λh)yn
SR = {z ∈ C : |R(z)| < 1} .
1/2 1/2 0 0 0
2/3 1/6 1/2 0 0
9. Four-stage, 3rd order diagonally implicit method 1/2 −1/2 1/2 1/2 0
1 3/2 −3/2 1/2 1/2
3/2 −3/2 1/2 1/2
By checking the order conditions you can see that they are both at least third order - and in fact
they are not more then third. But the second one uses twice as many stages then the first which
means it it at least twice as expensive. Looking at the two methods more closely one realized that
the two stage method only is implicit in the first stage while it is explicit in the second. So per
step it requires exactlly as many applications of Newton’s method as the implicit Euler method
but is third order. While the second method is implicit in each stage so requires four applications
of Newton’s method which makes it more expensive. So we have a first order method which is
hardly less expensive then the two stage third order method which is much cheaper then the four
stage third order method. So why not always use the two stage method? The answer is stability.
Both the four stage method and the backward Euler method are L-stable (check the previous plots).
For the two stage method on the other hand R(z) > 1 for z < −6 so it does not have nearly the
stability of the other two requiring a reduced time step for realy large negative λ. So depending on
the problem the four stage method could be a lot more efficient to use.
Here are the stability regions in the complex plane for the two order three diagonally implicit Runge
Kutta methods (DIRK) methods. The thrid plot shows the stability region for the two stage fully
implicit Gauss method (see Butcher tableau at the beginning of this chapter). Here in each step a
more complicated nonliear problem has to be solved.
5 5 5
0 0 0
−5 −5 −5
−5 0 −5 0 −5 0
Not that the two right pictures show A-stable methods so all the left half complex plain is covered
in courlines. In the middle case a lot of the right hand plane is also covered. For the right method
it is unclear if the imaginary axis (or at least some part around the origin) is contained in the
stability region.
Here are the stability regions for the forward Euler method (again), for Heun’s method (compare to
what you know from the assignments), and the final example is the classical four stage Runge-Kutta
method:
explicit Euler Heun method classical RK4
2 2 2
0 0 0
−2 −2 −2
−3 −2 −1 0 1 −3 −2 −1 0 1 −3 −2 −1 0 1
For the classical RK4 method it is again unclear if some part of the imaginary axis around the
origin is contained in the stability region. Note the difference in the x axis scaling between these
plots and the previous plots for the implicit methods.
Chapter 5
5.1 Linearization
The perhaps simplest approach to simplify a given mathematical model is to linearize it around a
given ground state. The assumption is that the actual solution to the original problem is close to
this ground state at all time so that the linearized model is a good description.
Take an ODE model
y 0 (t) = f (t, y(t)) , y(0) = y0
and a known ground state Ŷ = Ŷ (t). Assume that the exact solution Y to the ODE is of the form
Y (t) = Ŷ (t) + Ỹ (t) where the perturbation Ỹ (t) around the ground state is assumed to be small
for all t. Then
Y 0 (t) = f (t, Y (t)) = f (t, Ŷ (t) + Ỹ (t)) = f (t, Ŷ (t)) + ∂y f (t, Ŷ (t))Ỹ (t) + O((Ỹ (t))2 )
Also Y 0 (t) = d
dt Ŷ (t) + d
dt Ỹ (t) so that we arrive at a linear ODE for Ỹ :
d
ỹ = Â(t)ỹ(t) + Ĉ(t)
dt
d
with Â(t) = ∂y f (t, Ŷ (t)) and Ĉ(t) = f (t, Ŷ (t)) − dt Ŷ (t). Note that Ỹ solves this ODE up to the
2
neglected O((Ỹ (t)) ) which we assumed to be small.
Example 5.1 (Linearization around a fixed point). A special case is the linearization around a
d
fixed point for a homogeneous right hand side f (t, y) = f (y), i.e., taking Ŷ (t) such that dt Ŷ (t) =
0 (Ŷ is constant) and f (Ŷ (t)) = 0. In this case Ĉ(t) = 0. Then the linearized system is a
homogeneous linear ODE:
d
ỹ = ∂y f (Ŷ )ỹ(t) .
dt
This is a special case of linearizing around the solution to the nonlinear problem: assume that a
solution Ŷ (t) is known, i.e.,
d
Ŷ (t) = f (t, Ŷ (t)) .
dt
This again means that Ĉ ≡ 0. Now consider an initial condition y0 = Ŷ (0) + ỹ0 close to the initial
value of the ground state then the linearized system turns into
d
ỹ = Â(t)ỹ(t) , ỹ(0) = ỹ0
dt
62
CHAPTER 5. SIMPLIFYING MATHEMATICAL MODELS 63
Example 5.2. Consider the SIR model from epidemiology discussed in chapter 3:
d d d
S = −kSI , I = kSI − γI , R = γI .
dt dt dt
Note that R decouples and can be computed once I is know. Consequently, we only need to consider
the first two equations. In this case f (S, I) = − kSI, kSI − γI) is independent of t. The Jacobian
with respect to (S, I) is given by
−kI −kS
∂y f (S, I) =
kI kS − γ
Linearizing around a fixed point Ŝ, Iˆ thus leads to the linear system:
−k Iˆ −k Ŝ
d S̃ S̃
= .
dt I˜ k Iˆ k Ŝ − γ I˜
There is another way of arriving at this equation which does not use Taylor expansion and can
sometimes be easier especially for large systems:
d d
S̃ = S = −kSI = −k(Ŝ + S̃)(Iˆ + I)
˜ = −k Ŝ Iˆ − k Ŝ I˜ − k S̃ Iˆ − k S̃ I˜ .
dt dt
Since we are assuming that S̃, I˜ are small the product of the two is even smaller and can be
ˆ is a fixed point −k Ŝ Iˆ = 0. So we arrive at
neglected. Furthermore, since (Ŝ, I)
d
S̃ = −k Ŝ I˜ − k IˆS̃
dt
which is the same as the first equation we derived using Taylor expansion. The same approach can
be used for the equation for I.
Example 5.3 (Pendulum). Consider again the equation for a pendulum discussed in Chapter 3:
g
θ00 = − sin(θ) .
l
Assuming the angle θ is small we can linearize around the ground state θ̂ = 0. Since we assumed
θ̃ = θ − θ̂ = θ is small we can write sin θ ≈ θ̃ which leads to the linearized ODE
g
θ̃00 = − θ̃ .
l
Of course we can arrive at the same equation using the above Taylor series approach:
d
θ̃ = Â(t)θ̃(t) + Ĉ(t)
dt
with f (t, θ) = − gl sin(θ) we have
g g
Â(t) = ∂y f (t, θ̂(t)) = − cos(θ̂) = −
l l
Example 5.6 (Pendulum). Consider again the equation for a pendulum discussed in Chapter 3:
g
θ00 = − sin(θ) .
l
As pointed out above θ is already dimensionless. We have [l] = [length] and [g] = [acceleration] =
[length/time2 ]. So the right hand side has dimension 1/time2 which matches the dimension on the
left (each time derivative gives a 1/time as we saw above). So we only need to non-dimensionalize
time by prescribing some time scale T . Then with τ = Tt and y(τ ) = θ(T τ ) (note y is already
dimensionless so we do not need a scale for that):
T 2g T 2g
y 00 (τ ) = T 2 θ00 (T τ ) = − sin(θ(T τ )) = − sin(y(τ )) = −Π1 sin(y(τ ))
l l
T 2g
Using the arguments made above we already know that Π1 = is dimensionless. Now it makes
l
q
sense to choose the time scale T in such a way that Π1 = 1, i.e., T = gl . This turns out to be
the period of the linearized pendulum.
Let us now work through a more complex problem which also demonstrates the flexibility (or
complexity) of this approach:
Example 5.7 (Projectile motion). Consider a projectile of mass M kilograms that is launched
vertically with initial speed V0 meters per second, from a position Y0 meters above the surface of
the Earth. Newtons law of gravitation coupled with the second law of motion then gives that the
height of the projectile Y (T ) varies with time T according to the ODE
d2 Y GME M
M =− , Y (0) = Y0 , Y 0 (0) = V0 .
dT 2 (RE + Y )2
Here G is the gravitational constant G = 6.7 × 10−11 m3 /(s2 kg) and the earths mass and radius
are given by ME = 6 × 1024 kg, RE = 6.4 × 106 m, respectively. Note the very different orders
of magnitude involved in the different parameters describing the system. First note that we can
2
reduce the number of parameters by introducing g = GME /RE ≈ 9.81m/s2 (which looks more
familiar):
d2 Y gRE2
M
M 2
=− , Y (0) = Y0 , Y 0 (0) = V0 ,
dT (RE + Y )2
or equivalently
d2 Y g
=− , Y (0) = Y0 , Y 0 (0) = V0 ,
dT 2 (1 + RY2 )2
E
Let’s now introduce some (as yet arbitrary) length scale L and time scale T and write T = tT
and Y (T ) = y(T /T)L. Consequently t, y(t) are dimensionless. The chosen scaling constants (here
L, T) are called the characteristic (length and time) scales. First we note that using the chain rule
d L
Y 0 (T ) = y(T /T)L = y 0 (T /T)L/T = y 0 (t)
dT T
so that our ODE becomes
L d2 y g L
2 2
=− L 2
, Ly(0) = Y0 , y 0 (0) = V0 ,
T dt (1 + R2 y) T
E
d2 y Π2
=− , y(0) = Π0 , y 0 (0) = Π1 ,
dt2 (1 + Π3 y)2
CHAPTER 5. SIMPLIFYING MATHEMATICAL MODELS 66
In the original problem our solution Y (T ) depended on five parameters g, RE , Y0 , V0 , and T while
the dimensionless solution y(t) only depends on three π0 , π1 , and t.
In dimensionless form it is often much easier to analysis certain limiting cases by considering
how to model simplifies if one of the dimensionless quantities Πi → 0 or Πi → ∞. Recall that
these constants describe the ratio of different competing effects so these limits can be viewed as
modelling the situation where one of these effects dominates over the other.
Example 5.9 (continued). For example consider the case that Y0 is very small compared to RE
or in other words RE → ∞ while Y0 remains fixes. With this scaling π1 = 0 and our ODE is
d2 y
= −1 , y(0) = 1, y 0 (0) = π0 ,
dt2
which has the solution y(t) = 1 + π0 t − 21 t2 or reverting the non dimensionalization
Y = Y0 + V0 T + A0 T 2 ,
d2 y 1
=− , y(0) = π0 , y 0 (0) = 1 ,
dt2 (1 + π1 y)2
CHAPTER 5. SIMPLIFYING MATHEMATICAL MODELS 67
Y0 g V02
π0 = , π1 = .
V02 gRE
d2 y
= −1 , y(0) = 0, y 0 (0) = 1 ,
dt2
The solution to this in the original variables is
g
Y = V0 T − T 2 .
2
Assuming V0 > 0 the time of flight of the projectile is 2V0 /g = 2T while the maximum altitude
reached is V02 /(2g) = 12 L so our choice of characteristic scales seem quite natural.
Example 5.11 (continued). Our final choice will be to choose the scaling such that Π1 = 1, Π3 = 1,
i.e., L = RE , T = RE /V0 then we are left with π0 = Y0 /RE , π2 = gRE /V02 . Using the values for
g, RE given above we find π2 ≈ 6.27 · 107 m2 /s2 /V02 so that ε := π12 = V02 6.27 · 10−7 s2 /m2 is very
small for suitable choices of V0 . Again assuming that Y0 is far smaller then RE , i.e., π0 = 0 the
ODE becomes
1
εy 00 = − , y(0) = 0, y 0 (0) = 1
1 + y2
As pointed out ε is very small but sending that constant to zero is problematic then we would then
be left with a algebraic equation for y which does not seem to make any sense. This is called a
singular perturbation problem and will be discussed in the next section.
Of course if we were modelling only the part of the ocean in the British channel this might not
be a reasonable assumption. Just choosing the parameter to be zero might lead to too crude a
model although it can sometimes lead to some useful insight into the problem. A simplified model
retaining more details can be arrived at by perturbation methods.
Perturbation methods can be used for a wide range of problems involving a small scale parameter ε,
e.g., for algebraic and (partial) differential equations. The idea is to assume that the solution can be
written in the form of a sum involving powers of the parameter ε, i.e., x(t) = εx1 (t) + ε3 x3 (t) + . . .
and deriving a system of equations for the functions xi (t) by substituting the expansion into the
model and matching equal powers of ε. Often a small number of functions xi is enough to get a
model that is very close to the original. The mathematical justification of this approach is often
referred to as asymptotic analysis.
Definition 5.12 (Asymptotically equivalent). Two functions f (ε), g(ε) are asymptotically equiv-
alent for ε → 0 if
f (ε)
lim =0
ε→0 g(ε)
Remark. f ∼ g does not only mean that the two functions have the same limit for ε → 0 but that
they also approach that limit at the same rate. So for example sin(ε) ∼ exp(ε) − 1 but this does
not hold for sin(ε) ∼ x2 although both still converge to 0 for ε → 0. √
Many functions are asymptotically equivalent to each other, e.g., cos( ε) ∼ cos(ε) ∼ (1 − ε/2) ∼
exp(ε). In fact ∼ defines an equivalence relation on functions.
Definition 5.13 (Asymptotic expansion). A function x(t) = x(t; ε) depending on a parameter ε
has an asymptotic expansion in this parameter if
∞
X
x(t; ε) = δi (ε)xi (t)
i=0
for small enough ε. Here it is assumed that all xi = O(1) w.r.t. ε and the so called gauge functions
are asymptotically ordered, i.e.,
In most cases the gauge functions satisfy δi = O(εki ) with k0 > k1 > k2 > . . . .
The first term δ0 x0 is referred to as the leading order term and we have x ∼ δ0 x0 as ε → 0.
Remark. More generally we have:
x(t; ε) ∼ δ0 x0 + δ1 x1 + δ2 x2
x(t; ε) = δ0 x0 + δ1 x1 + O(δ2 )
x(t; ε) = δ0 x0 + δ1 x1 + o(δ1 )
and these relations are true for any partial sum used.
The most common example of an asymptotic expansion is given by Taylors series.
Depending on the function x we can have three types of expansions
• regular : with limε→0 x = x0 in which case δ0 = 1.
• vanishing: with limε→0 x = 0 in which case δ0 1.
1 1 2
x+ = 1 − ε − ε + · · · = O(1)
4 16
1 1 2
x− = 0 + ε + ε + · · · = O(ε)
4 16
The problem with ε = 0, i.e., x2 − x = 0 has solutions x = 0, 1 which are the leading terms in
the expansion of x± so the ε term only leads to a slight change in the position of the roots which
are mostly determined by the balance of the x2 and the x term in the equation. This is a regular
problem where ε → 0 leads to well defined limits.
CHAPTER 5. SIMPLIFYING MATHEMATICAL MODELS 69
In general we do not know what the solution of a problem will look like to choosing the right gauge
functions can be tricky and can require a good understanding of the characteristics of the problem
(or a lot of trial and error). In this case we start by assuming that we have a regular problem in
which case an expansion of the form x = x0 + εx1 + ε2 x2 + . . . is a good way to start. Inserting
this into our quadratic equations and combining terms with equal power of ε leads to
1
(x20 − x0 ) + ε(2x1 x0 − x1 + ) + ε2 (x21 + 2x0 x2 − x2 ) + · · · = 0
4
If all xi are O(1) we arrive at a system of equations:
O(1) : x20 − x0 = 0 ⇒ x0 = 0 or x0 = 1
1 1 1
O(ε) : 2x1 x0 − x1 + = 0 ⇒ x1 = or x1 = −
4 4 4
1 1
O(ε2 ) : x21 + x0 x2 − x2 = 0 ⇒ x2 = or x2 = −
16 16
and so on.... Note only the leading order equation is non linear and the other equations are linear
in xi but depend on the previous xj s (j < i). Comparing the computed expansions for x± with
the one based on Taylor expansion of the exact solutions given above show that we recovered the
correct factors.
In the next example we apply the same approach to an ODE problem:
Example 5.15. Consider the solution to the ODE
d2 1
x=− , x(0) = 1, x0 (0) = α
dt2 (1 + εx)2
which is the equation for a projectile which we introduced previously. We are interested in the
solution for ε → 0. We start with the asymptotic expansion x(t) = x0 (t) + εx1 (t) + ε2 x2 (t) + . . .
which we insert into the ODE and the initial conditions. The first initial condition gives us
1 = x(0) = x0 (0) + εx1 (0) + . . . so x0 (0) = 1 and xi (0) = 0 for i > 0. The initial conditions for
x0 gives us in the same way x00 (0) = α and x0i (0) = 0. For the left hand side of the ODE we of
course simply get x00 (t) = x000 (t) + εx001 (t) + ε2 x002 (t) + . . . while for the right hand side we first use
the Taylor expansion for (1 + z)r which gives us
1
− = −1 + 2εx − 3ε2 x2 + 4ε3 x3 + . . .
(1 + εx)2
into which we finally enter our asymptotic expansion of x and combine terms with equal powers
of ε:
1
− = −1 + 2ε(x0 + εx1 + ε2 x2 ) − 3ε2 (x0 + εx1 )2 + 4ε3 x30 + O(ε4 )
(1 + εx)2
= −1 + 2ε(x0 + εx1 + ε2 x2 ) − 3ε2 (x20 + 2εx0 x1 ) + 4ε3 x30 + O(ε4 )
= −1 + 2εx0 + ε2 (2x1 − 3x20 ) + ε3 (2x2 − 6x0 x1 + 4x30 ) + O(ε4 ) .
Now combining terms with the same order of ε leads to a system of ODEs:
and so on. Now we can solve these problems in sequence since they are decoupled, i.e., the equation
at scale O(εn ) only depends on xi with i ≤ n but not on those with i > n. Also the ODE at scale
O(εn ) is linear in xn so is easy to solve once the previous xi have been computed.
CHAPTER 5. SIMPLIFYING MATHEMATICAL MODELS 70
At leading order we get x0 (t) = − 12 t2 + αt + 1, then x1 solves x001 = −t2 + 2αt + 2 with zero initial
1 4
conditions which has the solution x1 (t) = − 12 t + α3 t3 + t2 . Our asymptotic expansion thus has
the form
1 1 α
x(t) = − t2 + αt + 1 + ε(− t4 + t3 + t2 ) + O(ε2 )
2 12 3
Note that original expansion required |x0 (t)| ε|x1 (t)| to make sense (the gauge functions have
to lead to a separation of the terms in the asymptotic expansion). This stops being the case for
1
t = O(ε− 2 ). This effect is common in asymptotic expansions and indicates a change in the nature
of the problem - a change in the scaling regime occurs requiring the use of a different form for the
expansion.
εx2 − 2x + 1 = 0
√
1± 1−ε 1
which has the
two solutions x± = ε . We can expand this solution into x± ∼ ε 1 ± (1 − 12 ε −
1 2 3
8 ε + O(ε ) (check). So that
2 1 ε 1 ε ε2
x+ ∼ − − , x− ∼ + + .
ε 2 8 2 8 16
Now take ε = 0 in the quadratic equation, i.e., study the leading order equation −2x + 1 = 0
which has only a single solution x? = 12 . So we only recover the leading order term of the x−
solution and miss the second solution which has a singular behaviour for ε → 0 and can not be
expressed as an expansion of the form x = x0 + εx1 + ε2 x2 + . . . . Substituting the leading term
in the approximations for x± back into the quadratic equations εx2 − 2x + 1 = 0 shows where the
problem arises
4 4 1 ε
− +1=1 , ε −1+1= .
ε ε 4 4
The solutions are due to different terms in the quadratic equation balancing, for x+ the first two
terms balance at O( 1ε ) while the third term is O(1). For x− the second and third term are both O(1)
while the first term is O(ε) which is the regular solution obtained by the leading order equation.
We can extend the ideas for regular perturbation problems in the following way: instead of sub-
stituting an expansion U = u0 + εu1 + ε2 u2 + . . . as done in the regular perturbation problem,
substitute δ(ε)U . Then choose δ(ε) to obtain consistent dominant balances in the equation and
make sure that all neglected terms are really subdominant. Different choices of δ(ε) will lead to
different dominant balances. In the above example δ(ε) = 1ε , δ(ε) = 1. For each choice of δ(ε),
factoring out common ε factors will lead to a regular perturbation problem which can be used to
determine subsystems for u0 , u1 , u2 , . . . . This systematic approach will result in a full overview
of the regular and singular solutions. Let us try this approach for the quadratic equation for our
previous example:
Example 5.17 (cont.). We start with inserting δ(ε)X into the equation:
εδ(ε)2 X 2 − 2δ(ε)X + 1 = 0 .
We compare the order of magnitudes of each of these terms for different choices of δ keeping in
mind that X is O(1):
CHAPTER 5. SIMPLIFYING MATHEMATICAL MODELS 71
1. Term (2)=Term (3): δ = 1: the neglected first term is in this case O(ε) so subdominant. The
quadratic equation in this case is εX 2 − 2X + 1 = 0. Now substituting a regular expansion
into that equation, e.g., X = X0 + εX1 + ε2 X2 leads to the following system of equations
−2X0 + 1 = 0, X02 − 2X1 = 0, 2X0 X1 − 2X2 = 0. So we get X = 21 + 18 ε + 16 1 2
ε which are
the first terms of the regular solution x− .
2. Term (1)=Term (3): εδ 2 = 1 ⇒ δ = √1ε : the first and third term are O(1) while the neglected
second term is of order O( √1ε ) 1 so is not subdominant. So this choice is inconsistent.
3. Term (1)=Term (2): εδ 2 = δ ⇒ δ = 1ε : the neglected third term is O(1) while the two other
terms are O( 1ε ) 1 so the neglected term is subdominant. The quadratic equation in this
case is 1ε (X 2 − 2X + ε) = 0 or equivalently X 2 − 2X + ε = 0. Now substituting a regular
expansion into that equation, e.g., X = X0 + εX1 + ε2 X2 leads to the system: X02 − 2X0 = 0,
X0 X1 − 2X1 + 1 = 0, 2X0 X2 + X12 − 2X2 = 0. So X0 = 2 or X0 = 0. Note that X0 = 0
is not a consistent choice because then X is not O(1) but is O(ε) which means that Term
(1) and Term (2) would not balance anymore. So we need to take X0 = 2. With that choice
X1 = − 21 and X2 = − 18 and we have recovered the first three terms of the singular solution
x− .
In dynamic problems singular perturbations lead to boundary layers. These are parts of the time
line where the solution behaves very differently from the solution in the rest of the domain. The
origin of the term boundary layer comes from fluid dynamics where it describes the layer close
to a boundary generated by a fluid flowing over a rough surface - the flow is very different near
the boundary compared to the rest of the domain. But this multiscale effect appears in many
application - think for example of a tornado where high wind speeds coexists right next to large
regions where the weather is completely calm. How narrow this boundary layer is depends on the
size of some (non dimensional) parameter in the model.
Example 5.18. Consider the ODE y 0 = 1ε (sin t − y) with initial condition y(0) = 1 which has the
t
exact solution y(t) = (1 + ε2 )−1 (sin t − ε sin t) + Ce− ε where C can be used to fix the boundary
condition. So the solution consists of the two parts the first is basically the curve sin t which the
solution reaches after a very short transition phase (“boundary layer“) for t close to zero So we
first do a standard asymptotic expansion y ∼ y0 + εy1 :
so that we get y0 = sin t (ε0 terms) and y00 = cos t = y1 (ε terms). So the leading order term
represents the slow manifold solution sin t but note that there are no free parameters to choose and
so no boundary conditions we can fix. This is a recuring feature of singularly perturned problems
since the nature of the model changes - in this case from a first order ODE (which has one free
parameter for the boundary condition) to an algebraic equation (which has no free parameter).
To find a solution to the initial boundary layer we rescal time to make the boundary √ layer have
size O(1). We do not really know which size the boundary layer has, i.e., is it O( ε), O(ε), or
perhaps even O(ε2 ) Finding the right scaling is cruicial and can require some experimenting. To
have a general scaling one can try rescaling time in the form τ = t/Φ(ε) where then Φ(ε) = εα
for example. Then τ = 1 corresponds to real time t being equal to εα . Defining a new solution
function Y (τ ) = y(Φ(ε)τ ) and subsituting into the ODE we arrive at
Φ(ε)
Y 0 (τ ) = Φ(ε)y 0 (Φ(ε)τ ) = (sin Φ(ε)τ − y(Φ(ε)τ ))
ε
so that Y satisfies the ODE
ε
Y 0 (τ ) = (sin Φ(ε)τ − Y (τ )) , Y (0) = 1 .
Φ(ε)
CHAPTER 5. SIMPLIFYING MATHEMATICAL MODELS 72
Now we can again use an asymptotic expansion of the form Y (τ ) ∼ Y0 (τ ) + εY1 (τ ) - note that we
will focus on the leading term (the one with ε0 ) so the exact form of the expansion is not of so
much interest.
ε
(Y 0 + εY10 ) = sin Φ(ε)τ − Y0 (τ ) − εY1 (τ )) = Φ(ε)τ − Y0 (τ ) − εY1 (τ )) + O(ε2 )
Φ(ε) 0
where we used sin ετ ∼ ετ . Now taking Φ(ε) = ε seems a good choice here which leads to the
leading order ODE
Y00 = −Y0 (τ ) , Y0 (0) = 1 .
so that the leading order term is Y0 (τ ) = e−τ . So now we have two solutions:
• Outer solution (away from the boundary layer): yO (t) = sin t
t
• Inner solution (inside the boundary layer): yI (t) = e− ε (or YI (τ ) = e−τ )
The question
√ now is, is it possible to match these two function in the intermediate range, e.g., for
t = O( ε). We have no free parameters left so either those two function approximate the same
function in the intermediate range or they do not (in the later case we would have to reconsider
the size of the boundary layer Φ). In other words, there are no free parameters to do any matching
of the two solutions. √
Note that limt→0 yO (t) = √ 0 = limτ →∞ YI√ (τ ) which is a good sign. Also taking t = η ε one can
show that both yO = O( ε) and yI = O( ε) so they match each other in this region (check). As
an approximation to the exact solution we will thus use
t
ỹ(t) := yO (t) + yI (t) = sin t + e− ε
Example 5.19. For a slightly more complex example which is nicely worked out from mod-
elling, through non dimensionalization, to singular perturbation theory see: https: // www. math.
colostate. edu/ ~ shipman/ 47/ volume2a2010/ Munoz-Alicea. pdf
Bibliography
[3] E. Hairer and G. Wanner, Solving ordinary differential equations ii: Stiff and differential-
algebraic problems, Springer, 1996.
[4] Josef Stoer, Roland Bulirsch, Richard H. Bartels, Walter Gautschi, and Christoph Witzgall,
Introduction to numerical analysis, Texts in applied mathematics, Springer, 2002.
[5] Endre Süli and David Mayers, An introduction to numerical analysis, Cambridge University
Press, 2003.
[6] T. Witelski and M. Bowen, Methods of mathematical modelling: Continuous systems and
differential equations, Springer, 2015.
73