Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Spring Term 2019

MA261: Differential Equations: Modelling and Numerics


Part 1

Dr. Andreas Dedner


Contents

1 Introduction 1

2 Getting Started 10
2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Convergence rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Conditioning (this subsection is not examinable) . . . . . . . . . . . . . . . 13
2.1.3 Floating point numbers (this subsection is not examinable) . . . . . . . . . 16
2.2 Some ODE Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Solvability of the Initial Value Problem (this subsection is not examinable) 21
2.3 Discrete Gronwall Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 The Forward/Backward Euler method . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Stability analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

i
Chapter 1

Introduction

Numerical analysis is the mathematical study of algorithms for solving problems arising in many
different areas, e.g., physics, engineering, biology, statistics, economics, social sciences. In general
starting from some real world problem, the following steps have to be performed:
1. Modelling: the problem is formulated in mathematical terms, e.g., as differential equation.
In general the resulting problem can not be solved analytically (without approximation).
2. Analysis: the mathematical model is analysed for example with respect to well-posedness,
e.g., existence and uniqueness of a solution, sensitivity to errors in the data. Also terms with
different importance in the model can be identified to possibly reduce the complexity of the
model.
3. Discretization: the problem is approximated by (a sequence of) finite dimensional prob-
lems. The discretization is chosen to maintain important properties of the analytical prob-
lem, e.g., that the density is positive.
4. Numerical analysis: the discretization is studied again with respect to well-posedness, but
most importantly the error between the solution of the finite dimensional problem and the
mathematical model is estimated and convergence of the numerical solution is established.
5. Implementation: the finite dimensional problems are solved using a computer program.
This can be a cyclic procedure where for example the Analysis in step two can influence the
modelling step, i.e., step one is refined. The numerical simulation can show that additional effects
have to be taken into account and so the modelling has to be refined and so on.
This module will focus on all these points for some simple settings to make you familiar with
central underlying ideas. The modelling techniques and the numerical schemes presented are an
important building block used for solving more complex problems. We will be focusing on problems
described by ordinary differential equations.
The following example demonstrates how the above steps are applied - you don’t need to under-
stand the mathematical details of each step!
Example 1. Consider the problem of a steel rope of length L > 1m clamped between two pols
which are 1m apart so that the rope is almost taut. Now the position of the rope is to be modelled
in the case where an acrobat is standing in the middle. A sketch of the problem is shown in
Figure 1.2.
1. Modelling: first we make the assumption that the rope can be represented as a function
y : [0, 1] → R. The shape of the rope is such that its bending energy E is minimal. For E
one finds (neglecting for example gravity) the formula
c 1
Z 1
y 0 (x)2
Z
E(y) := p dx − f (x)y(x) dx .
2 0 1 + y 0 (x)2 0

1
CHAPTER 1. INTRODUCTION 2

Figure 1.1: Applied mathematics

y(x)
f

0 x
1

Figure 1.2: Sketch of problem and a computed solution with f (x) = B ( 21 ),  = 0.05.

Here c depends on the material of the rope and f is the load (the acrobat) on the rope. So
we seek y ∈ V := {v ∈ C 2 ((0, 1)) : v(0) = v(1) = 0} so that

E(y) = inf E(v) .


v∈V

Both the function f and the constant c have to be determined by measurements and contain
data error.
This is a very complex problem. So we make a simplification and assume that the displace-
ment of the rope is small, e.g., y 0 is small. Then we can replace E by

c 1 0 2
Z Z 1
Ē(y) := y (x) dx − f (x)y(x) dx .
2 0 0

Now we seek ȳ ∈ V so that


Ē(ȳ) = inf Ē(v) .
v∈V

The model is now simpler but we have made a modelling error.


2. Analysis: The simplified problem is equivalent to solving

−cȳ 00 (x) = f (x) x ∈ (0, 1) ȳ(0) = ȳ(1) = 0 .

This problem has a unique solution. One can also show for example that y < 0 if f < 0, i.e.,
when the force is pointing downward, the displacement is also downwards along the whole
length of the rope. This matches our intuition.
3. Discretization: Instead of approximating the function y at all points in [0, 1] we compute
y at N points xi = ih for i = 1, . . . , N − 1 with h = N1 . We can replace the second derivative
CHAPTER 1. INTRODUCTION 3

of y at a point x by the difference quotient (covered in MA3H0):


1
y 00 (x) ≈ (y(x + h) − 2y(x) + y(x − h)) .
h2
Thus an approximation yi to ȳ(xi ) is given by
1
(yi+1 − 2yi + yi−1 ) = f (xi ) ,
h2
for i = 1, . . . , N − 1 and taking y0 = yN = 0. Defining the vector and the matrix
 
2 −1 0
 −1 . .
 . . .. 
−1 N −1  ∈ RN −1×N −1
F = (f (xi ))N

i=1 ∈ R , A=  .. .. 
 . . −1 
0 −1 2
c
the unknown values Y = (yi )N i=1 are the solution to the linear system h2 AY = F . While
replacing the derivatives of y by a finite difference quotient we have made an approximation
error.
We now have to solve a linear system of equations. There are many ways of doing this
(see MA398), a simple iterative scheme is based on the equivalence of hc2 AY = F with
2
Y = Y − D−1 (AY − hc F ) where D is a diagonal matrix
 
2 0
D = diag(2) = 
 .. .

.
0 2
Note that D−1 can be easily computed. Starting with some initial vector Y 0 we can compute
a sequence of vectors Y n through
h2
Y n = Y n−1 − D−1 (AY n−1 − c F)

for n ≥ 1. We compute this sequence up to Y P which is then our final approximation. It can
be seen that Y n → Y (n → ∞) but since we can not compute an infinite number of iterates,
we have a further termination error caused by stopping the computation after P steps.
4. Numerical analysis: the matrix A is regular and thus the discrete problem has a unique
solution. For v ∈ C 4 ((0, 1)) there exists a constant M > 0 so that for all x ∈ (0, 1) the
following error estimate holds v 00 (x) − h12 (v(x + h) − 2v(x) + v(x − h)) ≤ M h2 . The same
estimate holds for the discrete values Y :
max ȳ(xi ) − yi ≤ Ch2 ,
i=1,...,N −1

for some constant C > 0. Furthermore yi < 0 if all fk < 0.


As already mentioned one can show that the iteration converges to the exact solution of the
linear system.
5. Implementation: we have to implement the iterative scheme
h2
Y n = Y n−1 − D−1 (AY n−1 − F)
c
on a computed. Here programming languages like C,C++,Fortran,Python,Julia, or MATLab
are used. The choice of the environment and the form of the coding influences the efficiency
of the algorithm, i.e., how long one has to wait until the solution is computed - but also how
long the development of the program takes.
A final source of error is caused by the fact that a computer can not store exact numbers but
only approximations; this leads to rounding errors.
CHAPTER 1. INTRODUCTION 4

The following example is again a taster of things to come. It demonstrates how some simple
manipulation can be used to simplify a model reducing the number of parameters it depends on
considerably making analysing the model much easier:
Example 2. A mass spring system with friction proportional to the velocity is modelled by the
second order ODE
µx00 (t) + βx0 (t) + γx(t) = 0 .
Here x(t) is the position of the (point) mass at time t, thus x0 is the velocity and x00 the acceleration
at time t. The three constants in the model are the mass µ > 0, β > 0 the amount of friction, and
γ > 0 the restoring force of the spring. To make the problem well posed the initial position of the
mass x(0) = x0 and the initial velocity x0 (t) = v0 have to be prescribed.
There are different ways to derive this model, one of them is to start with Newton’s second law
µa = F (mass × acceleration = applied forces). We choose our coordinate system in such a way
that x = 0 corresponds to the rest position of the spring - so x > 0 means the spring is stretched.
The restoring forces are assumed to be proportional to the amount of stretching s, so Fr = −γs.
This is a modelling assumption we could also have a nonlinear spring where the restoring force
depends nonlinearly on the stretching, e.g., F = −ks3 . The force of friction is assumed to also be
directly proportional to the velocity of the mass x0 , so Ff = −βx0 . As said before a = x00 and due
to our choice of coordinate system s = x so:

µx00 = Fr + Ff = −γx − βx0 .

This model is linear and can be easily solved using the approach of characteristic polynomials
discussed in MA133:
β
x(t) = e− 2µ t A cos(wt) + B sin(wt) ,
 p
with w = 4γµ − β 2 /2µ

where we made the assumption that b is small so that β 2 < 4γµ (the system is under damped).
The constants A and B are determined from the initial conditions.
The problem seems to depend on three parameters - although we know from studying the exact
solution that the type of behaviour of the system depends on the ratio between β 2 and 4γµ, e.g., if
β2 β2
4γµ < 1 the system oscillates while for 4γµ > 1 the system is over damped and will not oscillate
at all. One modelling technique is to non dimensionalize the model and in that step try to reduce
the number of parameters and isolating the parameters that mainly determine the behaviour of the
problem. To this end one first needs to fix the physical units of each part of the model, e.g., x could
be measured in meters (m), the velocity x0 could then be meters per second (m/s), acceleration
is (m/s2 ). Mass µ could be in kilogram (kg) and (to make things fit) we assume that β has
units kg/s and γ kg/s2 (we will discuss this in detail later on). Now let us fix a typical time
scale T , length scale L and introduce scaled time τ = t/T and rescale the position x(t) in the form
χ(τ ) = x(T τ )/L. Using chain rule we can easily see that χ0 (τ ) = x0 (T τ )T /L, χ00 (τ ) = x00 (T τ )T 2 /L
and thus

0 = µx00 (T τ ) + βx0 (T τ ) + γx(T τ ) = µL/T 2 χ00 (τ ) + βL/T χ0 (τ ) + γLχ(τ ) .

Note that χ, τ do not have any units (e.g. t, T have both some time units like seconds and their
fraction is unitless). We can now divide through by µL and multiply with T 2 to arrive at

χ00 + T β/µχ0 + T 2 γ/µχ = 0

and note not that the two remaining constants T β/µ, T 2 γ/µ are also unitless. We now have many
different ways to choose T (note that the equation doesn’t depend qon our choice for L). We could
choose T to make T β/µ = 1 or alternatively T γ/µ = 1, e.g., T = µγ , which leads to a coefficient
2
q q
2 β2
in front of the friction term of T β/µ = µβγµ 2 = 2
γµ =: ω . Our model thus reduces to

ξ 00 + ω 2 ξ 0 + ξ = 0 .
CHAPTER 1. INTRODUCTION 5

2
β
We are only left with a single factor ω 2 = γµ and we can discuss the behaviour of the solution
to this model (or simulate it) depending on the size of this one parameter. The damping regime
now depends on ω 2 being less than or greater than one. After understanding the behaviour of this
non dimensionalized problem one can then look at the values of the parameters in the problem e.g.
µ, β, γ to figure out which regime a given spring mass system belongs to. Of course in this simple
case we have not really learned anything new but the concept is more widely applicable.

Using Newton’s second law is one way of deriving the equations of motion for the mass. Another
approach is based on Hamiltonian dynamics which we will also briefly cover in this module. Let
us for now consider the frictionless case. Define the Hamiltonian
µ 2 γ 2
H(p, q) := q + p
2 2
and consider a particle moving such that H(x(t), x0 (t)) is constant in other words d 0
dt H(x(t), x (t)) =
0. Using chain rule it is easy to see that
d
H(x(t), x0 (t)) = µx00 x0 + γx0 x = x0 (µx00 + γx)
dt
so that H(x(t), x0 (t)) is constant if and only if either x is stationary (i.e. x0 = 0) or x solves the
second order problem
µx00 + γx = 0 .

We looked a bit at the modelling aspects of this problem and did some analysis, we can now turn
to discretizing the problem. In this case we have an exact solution so looking at discretization
methods for this problem seems a bit pointless but the circumstance that we know what the solution
should look like allows us to study the behaviour of a given method much more easily and we can
deduce something for more complicated cases where we do not know the exact solution. Of course
this only makes sense if we assume that the method we are studying is applicable to more general
problems. In this module we will focus on methods for solving first order nonlinear systems, i.e.,
ODEs of the form
y 0 (t) = f (t, y(t))
where y : [0, T ] → Rm for m ≥ 1. We can easily rewrite our mass spring system in that form
by introducing the vector y(t) = (y1 (t), y2 (t)) = (x(t), x0 (t)) so that y 0 (t) = (x0 (t), x00 (t)) =
(x0 (t), −γ/µx(t)) = (y2 (t), −γ/µy1 (t)) which is of the right form if we define f (y) = (y2 , −γ/µy1 ).
A simple approach to discretize y(t) is to look for approximations yn ≈ y(tn ) where t0 = 0 < t1 <
T
t2 < · · · < tN = T are some fixed points in time, for example tn = nh with h = N . The derivative
0
y (tn ) can be approximated using a finite difference quotient for example
y(tn+1 ) − y(tn ) yn+1 − yn
y 0 (tn ) ≈ ≈
h h
(we will have to make these ≈ much more precise if we want to understand what is going on).
Since y 0 (tn ) = f (y(tn )) ≈ f (yn ) we arrive at the so called forward Euler method:

yn+1 = yn + hf (yn )

which is a very easy to implement method, since given the initial condition y0 we can directly
compute y1 = y0 + hf (y0 ) and then y2 = y1 + hf (y1 ) and so on up to yN = yN −1 + hf (yN −1 ).
Applying this to our mass spring problem we get
γ
yn+1,1 = yn,1 + hyn,2 , yn+1,2 = yn,2 − h yn,1 .
µ
γ
In the following we set µ = 1 and use as initial data y(0) = (x0 , v0 ) = (1, 1) so that the exact
solution is simply
y(t) = (cos(t) + sin(t), − sin(t) + cos(t))
CHAPTER 1. INTRODUCTION 6


at T = 2π the solution should be back at (1, 1) so let us check what value yN has for N = h for

different values of h, e.g., hi = 100 2−i , i.e., we use Ni = 1002̇i points for i = 0, 1, . . . , 4:

i N yN |y(T ) − yN |
0 101 [1.20766198 1.2277517 ] 3.08212e-01
1 201 [1.10139465 1.10595473] 1.46654e-01
2 401 [1.05003655 1.05112221] 7.15342e-02
3 801 [1.02484773 1.02511256] 3.53278e-02
4 1601 [1.01238062 1.01244602] 1.75552e-02
5 3201 [1.00617943 1.00619568] 8.75053e-03

We have used a very large number of points tn for the final simulation and the solution is still not
all that accurate - the error has just dropped below 1%. Depending on the application this might or
might not be an acceptable level of error and may or may not be an acceptable computation effort
to reach this error. But it does seem worth while to investigate methods that achieve a smaller
error with the same computational cost or the same error with less computational cost and we will
study some such approaches in this module. The results seem to indicate that the error is going
to zero with increasing N - in fact it looks like the error is halving each time N is doubled, i.e.,
the error is proportional to 1/N ∼ h. We will see later in this module that this is in fact the
case. Computing only one period of the oscillation is often not of interest but instead the long
time behaviour is to be simulated, so let us redo the above computation with T = 200π (which is
actually not that long):

i N yN |y(T ) − yN |
0 10001 [-2.00707e+07,5.08076e+08] 5.08472e+08
1 20001 [1.48842e+04,2.27772e+04] 2.72078e+04
2 40001 [1.31599e+02,1.45952e+02] 1.95108e+02
3 80001 [1.16376e+01,1.19422e+01] 1.52607e+01
4 160001 [3.42277e+00,3.44495e+00] 3.44204e+00
5 320001 [1.85158e+00,1.85458e+00] 1.20644e+00

Not so good - the best that can be said, is that it does seem to be converging but the errors
are huge! As the next simulation shows, instead of staying on a constant level curve of H (i.e.
H(y(t)) = H(y(0)) the value of H seem to be increasing and to verify this we add H(yN ) to our
output (the expected value is H(1, 1) = 1). We also increase i a bit more:

i N yn |y(T ) − yN | H(yN ) rel error H


0 101 [1.20766e+00,1.22775e+00] 3.08212e-01 1.48291e+00 4.82911e-01
1 201 [1.10139e+00,1.10595e+00] 1.46654e-01 1.21810e+00 2.18103e-01
2 401 [1.05004e+00,1.05112e+00] 7.15342e-02 1.10372e+00 1.03717e-01
3 801 [1.02485e+00,1.02511e+00] 3.53278e-02 1.05058e+00 5.05843e-02
4 1601 [1.01238e+00,1.01245e+00] 1.75552e-02 1.02498e+00 2.49807e-02
5 3201 [1.00618e+00,1.00620e+00] 8.75053e-03 1.01241e+00 1.24134e-02
6 6401 [1.00309e+00,1.00309e+00] 4.36852e-03 1.00619e+00 6.18756e-03
7 12801 [1.00154e+00,1.00154e+00] 2.18258e-03 1.00309e+00 3.08901e-03
8 25601 [1.00077e+00,1.00077e+00] 1.09087e-03 1.00154e+00 1.54332e-03

So decreasing h (or increasing N ) to compute the error at a fixed time does seem to work - although
the required work can be very high if the error is to be small or the time period somewhat longer.
Instead of changing h we will now fix h and increase T just to show that effect a bit more:

i N T yn |y(T ) − yN | H(yN ) rel error H


0 401 6.28319e+00 [1.05004e+00,1.05112e+00] 7.15342e-02 1.10372e+00 1.03717e-01
1 4001 6.28319e+01 [1.62942e+00,1.64635e+00] 9.02186e-01 2.68274e+00 1.68274e+00
CHAPTER 1. INTRODUCTION 7

To show the time evolution of the discrete solution for different values of h see the left figure in
the following plot (only every 15th approximate value is plotted). On the right you can see the
same a simulation with a larger value of T using the same value of h used for the curve with the
same colour on the left. The plots show the evolution of the system in phase space, i.e., the x-axis
represents the position of the mass and the y-axis its velocity. Another way of thinking of these
plots is in terms of the Hamiltonian H - H should be constant, i.e., the mass should remain on a
single level curve of H which are circles around the origin.

10
2

0 0

−2 −10

−2 0 2 −15 −10 −5 0 5 10 15

We will see later on that the forward or explicit Euler method suffers from stability issues in the
case that h is too large (this is not the problem here...). Nevertheless, we can try a method that
we will later prove to be more stable: the backward or implicit Euler method. The approach to
derive the approximation is the same as for the forward Euler method, except that we use the
approximation at t = tn+1 instead of at t = tn , i.e., y 0 (tn+1 ) ≈ y(tn+1h)−y(tn =≈ yn+1h−yn . and
y 0 (tn+1 ) = f (y(tn+1 )) ≈ f (yn+1 ) we arrive at the so called forward Euler method:

yn+1 = yn + hf (yn+1 ) .

The method is in general not quite as easy to code up but since f is linear in our case it is still
fairly easy to do:

y1,n+1 = y1,n + hy2,n+1 , y2,n+1 = y2,n − hy1,n+1 .

which leads to a system of equations for the two components of yn+1 :

y1,n+1 − hy2,n+1 = y2,n , hy1,n+1 + y2,n+2 = y1,n .

We can easily solve this system


1  1 
y1,n+1 = 2
y2,n + hy1,n , y2,n+1 = 2
− hy2,n + y1,n .
1+h 1+h
so that we can repeat the same experiments as in the previous plots:
CHAPTER 1. INTRODUCTION 8

1 1

0 0

−1 −1

−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5

Note that now the mass is slowing down like it would if friction was added (recall that the x-axis
in these plots represent the position and the y-axis the velocity).
In summary: were we to use the forward Euler method to compute the orbit of a satellite around
earth, the satellite would always be spinning off into space - preferable perhaps to the trajectory
predicted by the backward Euler method but still not correct... But also note that both methods
converge, i.e., if we fix a point in time and reduce h enough the error can (in theory) be made as
small as we want it to be. We have some experimental indication of this for the forward Euler
method, the following table indicates that the same is true for the backward Euler method:

i N yn |y(T ) − yN | H(yN ) rel error H


0 101 [8.14386e-01,8.27934e-01] 2.53100e-01 6.74349e-01 -3.25651e-01
1 201 [9.04188e-01,9.07932e-01] 1.32877e-01 8.20949e-01 -1.79051e-01
2 401 [9.51364e-01,9.52347e-01] 6.80902e-02 9.06029e-01 -9.39709e-02
3 801 [9.75503e-01,9.75755e-01] 3.44668e-02 9.51851e-01 -4.81487e-02
4 1601 [9.87707e-01,9.87771e-01] 1.73399e-02 9.75628e-01 -2.43719e-02
5 3201 [9.93842e-01,9.93859e-01] 8.69672e-03 9.87739e-01 -1.22612e-02
6 6401 [9.96918e-01,9.96923e-01] 4.35507e-03 9.93850e-01 -6.14951e-03
7 12801 [9.98459e-01,9.98460e-01] 2.17921e-03 9.96921e-01 -3.07950e-03
8 25601 [9.99229e-01,9.99229e-01] 1.09003e-03 9.98459e-01 -1.54094e-03

Can we derive a method that converges and maintains the energy of the system, i.e., guarantees
that H(yn ) = H(y0 ) for all n? The answer is yes and the method is just as simple to implement
as the forward Euler method. The method is often referred to as the symplectic Euler method and
you will need to look closely to see the difference to the forward Euler method described above:
γ
yn+1,1 = yn,1 + hyn,2 , yn+1,2 = yn,2 − h yn+1,1 .
µ
CHAPTER 1. INTRODUCTION 9

1 1

0 0

−1 −1

−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5

i N yn |y(T ) − yN | H(yN ) rel error H


0 101 [1.00107e+00,9.98932e-01] 1.50898e-03 1.00000e+00 6.93562e-08
1 201 [1.00026e+00,9.99737e-01] 3.71239e-04 1.00000e+00 2.13137e-09
2 401 [1.00007e+00,9.99935e-01] 9.20760e-05 1.00000e+00 6.60632e-11
3 801 [1.00002e+00,9.99984e-01] 2.29283e-05 1.00000e+00 2.05724e-12
4 1601 [1.00000e+00,9.99996e-01] 5.72080e-06 1.00000e+00 6.43929e-14
5 3201 [1.00000e+00,9.99999e-01] 1.42880e-06 1.00000e+00 1.55431e-15
6 6401 [1.00000e+00,1.00000e+00] 3.57023e-07 1.00000e+00 3.10862e-15
7 12801 [1.00000e+00,1.00000e+00] 8.92339e-08 1.00000e+00 -4.99600e-15
8 25601 [1.00000e+00,1.00000e+00] 2.23057e-08 1.00000e+00 7.10543e-15
Chapter 2

Getting Started

In this chapter we will introduce a few concepts without being too formal. The ideas will then be
expanded on in the following chapters.

2.1 Basic Concepts


2.1.1 Convergence rates
One of the main tools we will be using is Taylor series. The proofs of the following Theorems are
part of the analysis lectures.
Definition 1. With C 0 (I), I = (a.b) we denote the space of continuous functions on the interval
I and the space of m times continuous differentiable functions with
n o
C m (I) := f : I → R | f, f 0 , f 00 , . . . , f (m) exist and are continuous .

We use the abbreviations: C m (a, b) for C m ((a, b)) and C ∞ (I) := C m (I). It follows that
T
m∈N
C ∞ (I) ⊂ . . . ⊂ C m (I) ⊂ . . . ⊂ C 0 (I).
Theorem 1. (Taylor Theorem) Let f ∈ C m (a, b) and x0 ∈ (a, b) be given. Then there exist a
function ωm : R → R with lim ωm (x) = 0, so that
x→x0

f (x) = Pm (x) + ωm (x)(x − x0 )m ,

where
m
X 1 (k)
Pm (x) = f (x0 )(x − x0 )k ,
k!
k=0

is the m-th order Taylor polynomial which is of degree m.


For f ∈ C m+1 (a, b) the following holds

f (x) = Pm (x) + Rm (x) ,

where there are different important expressions for the remainder term:
1. Lagrange representation: For fixed x ∈ (a, b) there is a ξ between x0 and x so that
1
Rm (x) := f (m+1) (ξ)(x − x0 )m+1 .
(m + 1)!

10
CHAPTER 2. GETTING STARTED 11

2. Integral representation:
Zx
1
Rm (x) := f (m+1) (t)(x − t)m dt .
m!
x0

Taylor expansion motivates the following definition:


Definition 2. A function f ∈ C 1 (x0 − h0 , x0 + h0 ) is up to leading order equal to f (x0 ) + f 0 (x0 )h
in an open set around x0 , i.e., there is a function ω̄ : (−h0 , h0 ) → R with |ω̄(h)| |h| → 0 and
0
f (x0 + h) = f (x0 ) + f (x0 )h + ω̄(h). This means that we are neglecting all terms that converge
faster to zero than h.

Notation: f (x0 + h) = f (x0 ) + f 0 (x0 )h.
Remark (Vector valued case). Taylor expansion for a vector valued function f ∈ (C m (a, b))p is
defined in the same way with a vector valued Taylor polynomial P m - one can also consider this
as the Taylor expansion of each component fi (i = 1, . . . , p) of f .
In the case that argument of f is multidimensional f ∈ C m (I) for I ⊂ Rq (or f ∈ (C m (I))p ) the
Taylor expansion takes the same form with f 0 is the gradient (or Jacobian) of f , f 00 the Hessian
and so on. pPq
For x ∈ Rq we will use |x| to denote the Euclidean norm |x| := 2
i=1 xi .

Definition 3. (Landau Symbols) Let g, h : R → Rq . Then we write

(i) g(t) = O(h(t)) for t → 0 iif there is a constant C > 0 and a δ > 0, so that

|g(t)| ≤ C |h(t)| ∀ |t| < δ .

(ii) g(t) = o(h(t)) for t → 0 iif there is a δ > 0 and a function c : (0, δ) → R with

|g(t)| ≤ c (|t|) |h(t)| ∀ |t| < δ

and c(t) → 0 for t → 0.


Given {an }, {bn } bn > 0 ∀ n, we say:
(i) an = O(bn ) if ∃ M > 0 such that |an | ≤ M bn ∀ n ≥ n0 .
an
(ii) an = o(bn ) if limn→∞ bn = 0.

Example 3. 1. tα = O(tβ ) iff α ≥ β


1
2. for c > 0 and any β: c− t = O(tβ ) (exponential convergence is faster then polynomial conver-
gence) Note: if the aim is to get an error below some tolerance then polynomial convergence
might be better depending on the tolerance.
CHAPTER 2. GETTING STARTED 12

3. ex = 1 + x + O(x2 ) but also ex = O(1)


4. f1 · f2 = O(g1 · g2 ) if fi = O(gi )
5. if f = o(g) then f = O(g) (o(g) is a stronger results then O(g)).
6. O(hp ) − O(hp ) = O(hp ), O(hp ) + O(hq ) = O(hmin{p,q} ), and O(hp )O(hq ) = O(hp+q ).
Definition 4 (order of convergence). Suppose {xn } → x as n → ∞. If ∃ λ, p > 0 such that

|xn+1 − x|
lim = λ,
n→∞ |xn − x|p

then we say {xn } converges to x with order p. The largest p with this property is said to be the
convergence rate or the rate of convergence of the sequence (xn )n .
Suppose z : [0, h0 ] → Rq and z(h) → z0 for h → 0. If there exists a p > 0 with

|z(h) − z0 | = O(hp )

then we say that the convergence is of order p. The largest p with this property is said to be the
convergence rate or rate of convergence of z(h) to z0 .
If p = 1, this is called linear convergence. If p = 2 this is quadratic convergence.

Example 4. Let f ∈ C 2 (R) and x0 ∈ R be given. Define d1 (h) = f (x0 +h)−f h


(x0 )
. Then d1 (h)
0
converges to f (x0 ) for h → 0 linearly. The proof is simple using Taylor expansion (see also the
next example).
Remark: we used this finite difference approximation in the introduction to motivate the for-
ward/backward Euler method.
If f ∈ C 4 (R) then d2 (h) = f (x0 +h)−2fh(x2 0 )+f (x0 −h) converges to f 00 (x0 ) for h → 0 quadratically.
Again this is shown by Taylor expansion:
1 1
f (x0 + h) − f (x0 ) = f 0 (x0 )h + f 00 (x0 )h2 + h3 f (3) (x0 ) + O(h4 ) ,
2 6
0 1 00 1
f (x0 − h) − f (x0 ) = −f (x0 )h + f (x0 )h − h3 f (3) (x0 ) + O(h4 ) .
2
2 6
Adding these two equation gives the result.
x2n +a √
Now take a > 1 and define x0 = a and xn+1 = 2x n
for n > 0. One can show that xn > a > 1
and since √ √
√ x2n − 2xn a + a (xn − a)2 1 √
0 ≤ xn+1 − a = = ≤ (xn − a)2
2xn 2xn 2

we see that xn converges to a quadratically.
Remark. Note that if z(h) converges to z0 with order p then it also converges to z0 with order q
for any q < p. That is why we are interested in the maximum rate of converges. So while it is
easy to see that for example z(h) = h sin(h) converges to 0 with rate h - simply because sin(h)
is bounded this result is not optimal. Since sin(h) = h + 61 χ3 for some χ ∈ [0, h] it follows that
h sin(h) converges quadratically to 0 since
1
|h sin(h)| ≤ h2 + h3 = O(h2 ) .
6
So the convergence rate of z(h) to 0 for h → 0 is quadratic.
1 p h2
What is the optimal p, i.e., largest p for which h+h2 = O(h ) holds? We can note that h+h2 ≤C
h2
but that is not the optimal result. In fact → 0 for h → 0 which indicates that we could
h+h2
choose a C that goes to 0 when h → 0 in the above estimate. That means we can do better:
h h 1 −1
h+h2 → 1 for h → 0 so h+h2 ≤ C or h+h2 = O(h ).
CHAPTER 2. GETTING STARTED 13

When computing an approximation z(h) to a problem with exact solution z0 it is possible to


determine the order of convergence experimentally. Assume that an approximation z(h) converges
to z0 with order p, i.e., |z(h) − z0 | = O(hp ). This means that |z(h) − z0 | ≤ Chp and C should
not depend on h. We will go a step further and assume that |z(h) − z0 | = Chp . Then  taking two
 p
|z(h1 )−z0 | h1
values of h, e.g., h2 6= h1 we can eliminate C by taking the quotient |z(h2 )−z0 | = h2 . This
motivates the following definition:
Definition 5 (Experimental order of convergence (EOC)). The experimental order of convergence
of an approximation z(h) to z0 is given by
|z(h1 )−z0 |
log |z(h2 )−z0 |
p := .
log hh12

This can of course not be used to prove convergence but we can get a good indication of the
convergence rate through experiments and we can use a theoretical proven rate of convergence to
verify that an implementation is correct by comparing an experimental order of convergence with
the theoretical convergence rate. A major issue with this approach is that to compute the error one
requires knowledge of the exact solution z0 . In the example we discussed at the beginning of this
chapter, where we applied the forward Euler method to the linear mass spring problem we know
the exact solution so could compute the error. In general we want to use approximation in complex
cases where no exact solution is available. In this case some other approach to determining the
order of convergence has to be used. In summary the EOC is a good tool to check the correctness
of an implementation but to use it requires simplifying/modifying the problem to the point that
the exact solution is available.
A typical approach is the so called method of manufactured solution. For an ODE this would for
example entail to pick an exact solution Y and then compute the right hand side f = f (t, y) so
that Y is the exact solution to the ODE y 0 = f (t, y). The trick is to choose Y so that f is not too
trivial, e.g., depends nonlinearly on y. For example Y (t) = sin(t) and f (t, y) = cos(t) would allow
us to computep errors but not really challenge the ODE solver. Instead we could take Y (t) = sin(t)
and f (t, y) = 1 − y 2 which would work at least for a restricted time interval.
Applying this to ODEs we will have z0 = Y (T ) and z(h) is the approximation at the final time
using a scheme with step size h.
Looking back at the errors computed for the mass spring problem we see that the error seems
to behave proportional to h = N1 - we already mentioned this previously. User the concept of a
rate of convergence this means that the scheme converges linearly, i.e., with order p = 1. Let us
compute the EOC to confirm this:
i N yN |y(T ) − yN | EOC
0 101 [1.20766e+00,1.22775e+00] 3.08212e-01
1 201 [1.10139e+00,1.10595e+00] 1.46654e-01 1.07151e+00
2 401 [1.05004e+00,1.05112e+00] 7.15342e-02 1.03571e+00
3 801 [1.02485e+00,1.02511e+00] 3.53278e-02 1.01783e+00
4 1601 [1.01238e+00,1.01245e+00] 1.75552e-02 1.00891e+00
5 3201 [1.00618e+00,1.00620e+00] 8.75053e-03 1.00445e+00

2.1.2 Conditioning (this subsection is not examinable)


We have seen a great many sources for errors and each of these has to be analysed and kept under
control, especially to avoid the accumulation of errors. Depending on the problem, the influence
of any of these errors can be severe.
Example 5. Consider the initial value problem for t > 0:

u0 (t) = (c − u(t))2 , u(0) = 1, c > 0.


CHAPTER 2. GETTING STARTED 14

It can be easily verified that


1 + tc(c − 1)
u(t) =
1 + t(c − 1)
is the solution. We observe the following dependence of u on the constant c:
c = 1 : in this case u ≡ c.
c > 1 : in this case 1 + t(c − 1) > 0 for all t and lim u(t) = c.
t→∞
1
c < 1 : in this case 1 + t(c − 1) = 0 for t = t0 = 1−c > 0, so that lim u(t) = ∞.
t→t0
It is easy to see how measurement errors or other forms of approximation errors can lead to
c = 1 + ε (which in this case would be okay) or c = 1 − ε (which would be a disaster). Numerical
scheme can thus lead to very different results if one does not keep control on errors. A qualitative
picture is shown in Figure 2.1.

c u(t)
u(t)

1 1
c
0
0 t t0

Figure 2.1: Effects of errors in the data.

The next example also shows how even very small errors in some part of the algorithm can strongly
influences the error in the solution:
Example 6. Consider the linear system of equations
    
1.2969 0.8648 x1 0.86419999
= =: b.
0.2161 0.1441 x2 0.14400001
   
x1 0.9911
The exact solution is = .
x2 −0.4870
Due to some error, we obtain  
0.8642
b̄ =
0.1440
instead of the exact right hand side. The relative error in the first and second component is merely
|0.86419999−0.8642|
0.86419999 = 1.15 10−8 and |0.14400001−0.1440|
0.14400001

−8
 = 6.9410 . So the error is quite small.
x̄1 2
But the solution to the new problem is = , which means that the error in the
x̄2 −2
solution to the linear system of equations is more than 100%.

The amplification of errors, as shown in the previous example, is characterized by the conditioning
of the problem.
Definition 6. A problem is said to be well conditioned if small errors in the data lead to small
errors in the solution and badly conditioned otherwise. We will provide two different notion of
conditioning for the problem of computing the value f (x0 ) for a given function f : U → Rn and
given data x0 ∈ U where U is an open subset of Rm . We call this the problem (f, x0 ).
CHAPTER 2. GETTING STARTED 15

Example 7. The solution to the linear system Ay = b is the problem (f, x0 ) where f (x) = A−1 x
and x0 = b. The problem given in Example 6 was apparently badly conditioned.
Theorem 2. Let x0 = (x1 , . . . , xm ) ∈ U and let x0 + ∆x ∈ U be some perturbation of the data
with |∆x|  1. If f : U → Rn is once continuously differentiable then the error ∆fi (x0 ) =
fi (x0 + ∆x) − fi (x0 ) (i = 0, . . . , n) in the evaluation of fi is up to leading order equal to
m
X ∂fi
(x0 )∆xj .
j=1
∂xj

For the relative error we have to leading order


m  
∆fi (x0 ) • X ∂fi xj ∆xj
= (x0 ) .
fi (x0 ) j=1
∂xj fi (x0 ) xj

Proof. The proof is a straightforward application of (multivariate) Taylor expansion

fi (x0 + ∆x) = fi (x0 ) + ∇fi (x0 ) · ∆x + ω (|∆x|)

with ω (|∆x|) = o (|∆x|).


Definition 7. (Condition number) We call the factors

∂fi xj
kij := (x0 )
∂xj f (x0 )

the relative condition numbers for the problem (f, x0 ).


The factor kij describes how an error in xj is amplified or damped when applying fi .

Example 8. (Arithmetic operations)


1. f (x1 , x2 ) = x1 x2 :
∂f x1 x2 x1
k11 = (x1 , x2 ) = =1.
∂x1 f (x1 , x2 ) x1 x2
Similarly we obtain k12 = 1 so that multiplication is well-conditioned.
2. Division is well-conditioned.

3. f (x1 , x2 ) = x1 + x2 :
xj
k1j = .
x1 + x2
Thus k1j is arbitrarily large, if x1 x2 < 0 and the absolute values of x1 , x2 are similar - in
this case addition is badly conditioned. Otherwise it is well-conditioned.
4. Subtraction is badly conditioned, if x1 x2 > 0 and the absolute values of x1 , x2 are similar.
We see that subtracting two positive numbers of the same size is not well conditioned. This can
be a problem on a computed since non exact arithmetic has to be used.

Example 9. Consider the case where number are rounded after the third decimal, i.e., instead
of x = 0.9995 and y = 0.9984, the approximations x̄ = 1 and ȳ = 0.998 have to be used. Then
¯ = 0.002 is performed. The relative error
instead of x − y = 0.0011 the computation x̄ − ȳ = 0.002
in the data is 0.0005 while the relative error after evaluation is 0.82 so we have an amplification
of more than 1000. For the condition number we compute k1j ≈ 910. This problem is known as
cancellation or loss of significants.

For more complex problems we can use a slightly different concept.


CHAPTER 2. GETTING STARTED 16

Definition 8. (Well-posedness) The problem (f, x0 ) is well-posed in

Bδ (x0 ) := {x ∈ U | ||x − x0 || < δ}

if there is a constant Labs ≥ 0, with

||f (x) − f (x0 )|| ≤ Labs ||x − x0 || (∗)

for all x ∈ Bδ (x0 ).


In the following denote with Labs (δ) the smallest constant satisfying (∗) and with Lrel (δ) we denote
the smallest number with
||f (x) − f (x0 )|| ||x − x0 ||
≤ Lrel (δ) .
||f (x0 )|| ||x0 ||
Definition 9. (Condition number, vector valued case) We define the absolute condition number
for the problem (f, x0 ) to be Kabs := lim Labs (δ), while the relative condition number is Krel :=
δ&0
lim Lrel (δ).
δ&0

Remark. If f is differentiable then


||x0 ||
Krel = ||f 0 (x0 )|| .
||f (x0 )||
||f 0 (x0 )y||
Note that f 0 (x0 ) is a matrix and ||f 0 (x0 )|| is the induced matrix norm ||f 0 (x0 )|| := sup ||y|| .
y6=0
Thus the value of Krel depends on the choice of the norm.
Example 10. (Conditioning for linear systems of equations) Consider the linear system Ax = b,
i.e., the problem (f, b) with f (x) = A−1 x. Thus we have f 0 (x) = A−1 and

A−1 y
Kabs = A−1 := sup .
y6=0 ||y||

Using the properties of the induced matrix norm A−1 we compute

||b|| ||Ax|| A−1 · ||A|| · ||x||


Krel = A−1 = A−1 ≤ = A−1 · ||A|| .
||A−1 b|| ||x|| ||x||

Since there is a x ∈ Rm with ||Ax|| = ||A|| ||x|| the number A−1 · ||A|| is a good estimate for
the condition number of the problem (f, b).
Consider the matrix A from Example 6. We can show that A−1 ||A|| ≈ 109 , which shows that
the problem is badly conditioned.
The following section describes in detail how numbers are represented on a computer and how
arithmetic operations are performed. That section also includes some more examples showing the
issue of cancellation.

2.1.3 Floating point numbers (this subsection is not examinable)


Definition 10 ((Floating point number)). A floating point number to the base b ∈ N is a number
a ∈ R of the form
s−1 0
a = ± m1 b−1 + . . . + mr b−r b±[es−1 b +...+e0 b ] .
 
(∗)

Usually one uses the notation ±a = 0. m1 . . . mr b±E with E = es−1 bs−1 + . . . + e0 b0 (Expo-
 
| {z }
Mantissa M
nent) and mi , ei ∈ {0, . . . , b − 1} , E ∈ N. For normalization purposes one assumes that m1 6= 0 if
a 6= 0.
CHAPTER 2. GETTING STARTED 17

For given (b, r, s) let the set A = A(b, r, s) denote all real numbers a ∈ R with the representation
(∗).
To store a number a ∈ D = [−amax , −amin ] ∪ {0} ∪ [amin , amax ] a mapping from D to A(b, r, s) is
defined: f l : D → A with f l(a) = min |â − a|.
â∈A

Remark. The floating point representation allows to store real number of very different magnitude,
−30
e.g., the speed of light c ≈ 0.29998 · 109 m
s or the electron mass m0 ≈ 0.911 · 10 kg.
We usually use the decimal system with b = 10 while on computers a binary representation is used
with b = 2. The constants r, s ∈ N depend on the architecture of the computer and the desired
accuracy.
Lemma 1. The set A(b, r, s) is finite. Its largest and smallest positive element is amax = (1 −
s s
b−r ) · bb −1 , amin = b−b , respectively.
Proof. Left as an exercise.
Remark. Usually for a ∈ (−amin , amin ) one defines f l(a) = 0 (“underflow”). If |a| > amax (“over-
flow”) many programs set a = N aN (not a number) and a computation has to be terminated.
Theorem 3. (Rounding errors) The absolute error is given by
1 −r E
|a − f l(a)| ≤ b ·b ,
2
where E is the exponent of a. For the relative error caused by f l(a) for a 6= 0 the estimate

|f l(a) − a| 1
≤ b−r+1
|a| 2

holds.
Definition 11. The machine epsilon M := 12 b−r+1 is the difference between 1 and the next larger
number representable.
Defining  := f l(a)−a
a , one has f l(a) = a + a = a(1 + ) and || ≤ M .
Proof. (Theorem 3) In the worst case f l(a) will differ from a by half a unit in the last position of
the mantissa of a: |a − f l(a)| ≤ 21 b−r bE .
Since we are assuming a normalized representation (m1 6= 0) if follows that |a| ≥ b−1 bE and
therefore
1 −r E
|f l(a) − a| b b 1
≤ 2 −1 E = b−r+1 .
|a| b b 2
Example 11 ((IEEE format)). A usual formal is the IEEE format. It provide standards for
single and double precision floating point numbers. A double precision number is stored using 64
bits (8 bytes):
x = ±m2c−1022 .
One bit is used to store the sign. 52 bits are used for the mantissa m = 2−1 +m2 2−2 +· · ·+m53 2−53
(the first position is one due to normalization). The characteristic c = c0 20 +· · ·+c10 210 ∈ [1, 2046]
can be stored in the remaining 11 bits. Here mi , ci ∈ {0, 1}. By storing the exponent in the form
c − 1022, i.e., without a sign, the range of numbers is doubled. The two excluded cases c = 0
and c = 2047 are used to store x = 0 and NaN, respectively. We have amax = 21024 ≈ 1.8 10308 ,
amin = 2−1022 ≈ 2.2 10−308 , and M = 21 2−52 ≈ 10−16 .
Definition 12. (Machine operations) The basic operations ? ∈ {+, −, ×, /} is replace by ~. In
general
a ~ b = f l(a ? b) = (a ? b)(1 + )
with || ≤ eps.
CHAPTER 2. GETTING STARTED 18

Remark. The operation ~ is not associative or distributive.

Example 12. (Loss of significants) In this example we use b = 10 and r = 6. We study the
problem (f, x0 ) with √ √
f (x) = ( x + 1 − x), x0 = 100 .
√ √
As x gets large, x + 1 and x are of very similar magnitude and subtracting the two values
is ill conditioned√as we already saw. Assume that √ we can compute
√ the square roots also up to
six √
decimals:
√ f l( 101) = 0.100499 · 10 2
and f l( 101) f l( 100)) = 0.499000 · 10−1 instead of
−1
f l( 101 − 100) = 0.498756 · 10 . So we have lost 3 significant figures from the available 6.
Rewriting f in the form
√ √ √ √
x+1− x x+1+ x 1
f (x) = √ √ =√ √
1 x+1+ x x+1+ x
√ √
removes
√ the problem
√ of loss of significance because adding x + 1 and x is well conditioned:
f l( x + 1) ⊕ f l( x) = 0.200499 · 102 and 1 (0.200499 · 102 ) = 0.498755 · 10−1 .
Observe that:  
z fl(z) z(1 + δ)
fl = ≤ = z223 + δz223
M fl(M ) M
in single precision floating point. So the error term is magnified. It is possible to compensate for
this:
Example 13. Let:
log(x + 1)
f (x) =
x
and consider x ≈ 0.
lim f (x) = 1
x→0

To compensate for large errors, we have:


 
log(x + 1) fl(log(x + 1))
fl =
x fl((x ⊕ 1) 1)

Example 14. Let:


1 − cos x
f (x) =
x2
and consider x ≈ 0. By Taylor’s expansion:

x2 x4
cos x = 1 − + − · · · h.o.t.
2! 4!
So we have:
1 − cos x 1 x2 x4 x6
= − + − cos ξ
x2 2! 4! 6! 8!
The following definition is closely related to the notion of well-posedness given previously.
Definition 13. An algorithm is called stable if small changes in the initial data produce only
small changes in the final results. Otherwise it is called unstable.
Let E0 > 0 be an initial error, and En be the error after n steps. Typically we can have
(i) Linear growth: En ≈ CnE0 , a constant C ∈ R.
(ii) Exponential growth: En ≈ C n E0 , C > 1. This occurs for example if En = CEn−1 . Expo-
nential growth is not acceptable.
CHAPTER 2. GETTING STARTED 19

Example 15. Consider a sequence:


13 4 1
xn+1 = xn − xn−1 , x0 = 1, x1 = , n ≥ 1 (I)
3 3 3
This is equivalent to:
1
xn+1 = xn , n ≥ 0, x0 = 1 (II)
3
in (I) a previous errors are amplified by 13 4
3 > 1 and 3 , this is unstable, while in (II) an error in
1
xn−1 is damped by the factor 3 which makes the algorithm stable.
Here is an example with r = 5:

n 1 2 3 4 5 6 7 8
(II) 0.33333 0.11111 0.03704 0.01235 0.00412 0.00137 0.00046 0.00015
(I) 0.33333 0.1111 0.03699 0.01216 0.00337 −0.00161 −0.01147 −0.04755

The exact value is 0.00015 in the obtainable precision. Even with r = 8 we obtain with (II)
0.00015242 and with (I) 0.00010407; the exact value is 0.00015242. The corresponding relative
errors are approximately 0.27 · 10−6 and 0.31.
Overall stability means that errors in previous steps are not amplified.
R1 xk
Example 16. (Error amplification) We want to compute the integral Ik := x+5 dx.
0

(A) Observe
I0 = ln(6) − ln(5)
and
1
Ik + 5Ik−1 = (k ≥ 1), since
k
Z1 Z1
xk xk−1 − 1 1
+5 = xk−1 dx = .
x+5 x+5 k
0 0

A computation with 3 decimals (r = 3, b = 10) leads to

I¯0 = 0.182 · 100 I¯1 = 0.900 · 10−1


I¯2 = 0.500 · 10−1 I¯3 = 0.833 · 10−1
I¯4 = −0.166 · 100

Here we use I¯k to denote the computed value taking rounding errors into account. Obviously
Ik is monotone decreasing, and Ik & 0 (k → ∞), but this is not observed for the computed
values, we even have I¯4 < 0. On a standard PC we found: I¯21 = −0.158 · 10−1 and
I¯39 = 8.960 · 1010 .

This is a typical example of error accumulation. In the scheme the error in Ik−1 in amplified
by the factor 5 to compute Ik .
(B) If one computes the values for Ik exactly, one observes that I9 = I10 up to the three first
decimals. Using the backwards iteration Ik−1 = 15 k1 − Ik we obtain


I¯4 = 0.343 · 10−1 I¯3 = 0.431 · 10−1


I¯2 = 0.500 · 10−1 I¯1 = 0.884 · 10−1
I¯0 = 0.182 · 100

In this case we observe a damping of errors.


CHAPTER 2. GETTING STARTED 20

Example 17. (Computing the solution to a quadratic equation) Consider the quadratic equation

y 2 − py + q = 0
2
q
2
for p, q ∈ R and 6= q < p4 . The two solution are y1,2 = y1,2 (p, q) = p2 ± p4 − q. Also p = y1 + y2
and q = y1 y2 . From this we can conclude that ∂p y1 +∂p y2 = 1 and y2 ∂p y1 +y1 ∂p y2 = 0. Therefore,
y1 y2
∂p y1 = , ∂p y2 = .
y2 − y1 y2 − y1
From this we can conclude that ∂q y1 + ∂q y2 = 0 and y2 ∂q y1 + y1 ∂q y2 = 1. Therefore,
1 1
∂q y1 = , ∂q y2 = .
y1 − y2 y1 − y2
The condition numbers are for y1 (p, q)
p 1 + y2 /y1 q 1
k1,p = ∂p y1 = , k1,q = ∂q y1 = .
y1 1 − y2 /y1 y1 1 − y2 /y1
Similar results can be obtained for the condition numbers k2,p and k2,q for y2 (p, q). This shows
that the computation of the roots is badly conditioned if the two roots are close together, i.e., yy12
is close to one.
For | yy12  1 the problem is well conditioned. We could employ the following algorithm to compute
the results:
p2 √
u= ,v = u − q ,w = v .
4
For p < 0 we should first compute y2 = p2 − w to avoid cancellation effects. For the second root
we can use different approaches:

(I) (II)
p q .
y1 = 2 +w y1 = y2

2
For q  p4 we have w ≈ p
2 and (I) is prone to cancellation effects. Errors made in p and w are
carried over to y1 :
∆y1 • 1 ∆p 1 ∆w
≤ + .
y1 1 + 2w/p p 1 + p/2w w
p2
Both factors are much greater than one since q  4 . The method (II) is on the other hand stable:

∆y1 • ∆q ∆y2
≤ + .
y1 q y2

2.2 Some ODE Basics


A general system of ODE is given by

F (z n (t), z n−1 (t), . . . , z 0 (t), z(t), t) = 0

where z maps the time interval [0, T ] to Rm and F = (Fi )m j


i=1 , z denotes the vector of jth
m
derivatives of z(t) = (zi (t))i=1 (not the j power of z. In many cases the ODE can be written
explicitly in the highest derivative, i.e.,

z n (t) = F (z n−1 (t), . . . , z(t), t) ,

reusing the symbol F for the right hand side. By introducing an extended solution vector

y(t) = (z(t), z 0 (t), z 2 (t), . . . , z n−1 (t), τ ) ∈ R(n−1)m+1


CHAPTER 2. GETTING STARTED 21

and suitable right hand side it is possible to reduce this problem to a homogeneous, first order
ODE of the form
y 0 (t) = f (y(t))
and this is the type of problem we are going to study throughout this lecture - or its non-
homogeneous counterpart y 0 (t) = f (y(t), t) - although this is equivalent to introducing another
dependent variable satisfying the ODE τ 0 = 1.
We will consequently assume that y(t) = (yi (t))m i=1 for some given m ≥ 1. To make the problem
well posed we also need to provide an initial value, so we will assume that some y0 ∈ Rm is given
and will look for solutions to the initial value problem
y 0 (t) = f (t, y(t)) , y(0) = y0 .

2.2.1 Solvability of the Initial Value Problem (this subsection is not


examinable)
Consider the initial value problem:
 0
y (t) = f (t, y(t)) (t0 ≤ t ≤ T )
(∗)
y(t0 ) = y0 ,
∂f
Theorem 4 (Picard’s Theorem). If f and ∂y are continuous in a closed rectangle:

R = {(t, y)|a1 ≤ t ≤ a2 , b1 ≤ y ≤ b2 }
and if (t0 , y0 ) is an interior point of R, then (∗) has a unique solution y = g(t) which passes
through (t0 , y0 ).
Sketch of Proof - requires complete metric spaces and Banach fixed point theorem. By assumption,
|f (t, y)| ≤ K and ∂f
∂y ≤ L. It follows

|f (t, y1 ) − f (t, y2 )| ≤ L|y1 − y2 |


∀ (t, y1 ) and (t, y2 ) ∈ R, so we have the Lipschitz condition. Replace (∗) by an integral equation.
If y = g(t) satisfies (∗) then by integrating:
Z t
g(t) = y0 + f (s, g(s))ds.
t0

Choose a > 0 such that La < 1 and |t − t0 | ≤ a and |y − y0 | ≤ Ka. Let X be the set of all
continuous functions y = g(t) on |t − t0 | ≤ a with |g(t) − y0 | ≤ Ka, and so X is a complete
metric space. Define a mapping T of X into itself by
Z t
T g = h, h(t) = y0 + f (s, g(s))ds (|h(t) − y0 | ≤ Ka)
t0

Furthermore,
Z t
|h1 (t) − h2 (t)| = [f (s, g1 (s)) − f (s, g2 (s))]ds ≤ La sup |g1 (t) − g2 (t)|.
t0

Since La < 1, then T is a contraction on X. We therefore conclude T g = g has a unique


solution.
In the following we will denote with Y the exact solution of (∗) which we will always assume to
exists on the interval [t0 , T ] and if not otherwise stated we take t0 = 0.
In very few cases is it possible to provide the function Y in closed form, e.g., for some scalar
(m = 1) problems and for linear problem (f (y) = Ay with some matrix A). This was discussed in
MA133. So in most cases we need to either
CHAPTER 2. GETTING STARTED 22

1. study certain properties of the solution, e.g., long time behaviour by looking at the stability
of fixed points. This was discussed in MA133.
2. approximate the solution, e.g., simplify the model or use numerical methods. We will look
at both these aspects in this module.

2.3 Discrete Gronwall Lemma


The discrete version of Gronwall’s lemma is useful for analysing discretizations of ODEs - just
like the non discrete version is used to analyse the ODEs themselves. We will be using the
Lemma in the following chapters but not in this chapter where we prove error estimates for the
forward/backward Euler method in a more pedestrian way. But the Lemma is still stated here:

Lemma 2. (discrete Gronwall lemma) Let zn ∈ R+ satisfy

zn+1 ≤ Czn + D, ∀n ≥ 0

for some C ≥ 0, D ≥ 0 and C 6= 1. Then


Cn − 1
zn ≤ D + z0 C n . (2.1)
C −1

Proof. The proof proceeds by induction on n. Setting n = 0 in (2.1) yields

z0 ≤ z0

which is obviously satisfied. We now assume that (2.1) holds for a fixed n and prove that it is true
for n + 1. We have

zn+1 ≤ Czn + D
Cn − 1
 
n
≤ C D + z0 C + D
C −1
C n+1 − C
= D + z0 C n+1 + D
C −1
 n+1 
C −C C −1
= D + + z0 C n+1
C −1 C −1
C n+1 − 1
= D + z0 C n+1
C −1
and the induction is complete.

Remark. As mentioned this is a discrete version of the Gronwall Lemma which states that if a
smooth enough function u satisfies the differential inequality u0 (t) ≤ β(t)u(t) for t ≥ a then
Rt
β(s) ds
u(t) ≤ u(a)e a .

In other words, u is bounded by the solution to the differential equation v 0 (t) = β(t)v(t) , v(a) =
u(a). In the same way the right hand side in the bound of the discrete Gronwall lemma is the
solution to the difference equation ξn+1 = Cξn + D.
CHAPTER 2. GETTING STARTED 23

2.4 The Forward/Backward Euler method


Most method to approximate the solution Y use a sequence of nodes t0 , t1 , t2 , . . . , tN = T with
t0 < t1 < · · · < tN and provide ways of computing approximations y1 , . . . , yN to the function
values Y (t1 ), . . . , Y (tN ) (note that we assume Y (t0 ) = y0 . In the simplest case the node are
chosen to be equidistant, i.e., ti = t0 + ih for some fixed h > 0. Note that such a numerical
method produces a sequence (yn )N n=0 starting from the initial conditions to the ODE y0 . This
sequence will depend on the time step size h so that we should be writing (ynh )n but we will ignore
this dependency in the following - but you should keep it in mind.
The most straightforward approach for discretizing the ODE is based on the approximation

Y (t + h) − Y (t)
Y 0 (t + τ h) ≈ ,
h
for some τ ∈ [0, 1]. That this is a reasonable approximation can be seen using Taylor expansion as-
suming Y ∈ C 3 (see first assignment). In the case that τ = 12 , i.e., we are aiming at approximating
the time derivative exactly in the middle of the interval [t, t + h], we arrive at

1 Y (t + h) − Y (t)
Y 0 (t + h) = + O(h2 ) .
2 h
1
If we are not approximating in the middle of the interval, i.e., τ 6= 2 and we end up with

Y (t + h) − Y (t)
Y 0 (t + τ h) = + O(h) .
h
This type of superconvergence in some points is a typical property of many finite difference ap-
proximations to derivatives, i.e., that they have a higher convergence rate in some isolated points
then in the rest of the domain.
A finite difference approximation can be used to compute an approximation yn to the exact
solution Y at a point in time tn+1 = tn + h given approximations to Y at some earlier points in
time. For example taking τ = 0 and t = tn

Y (tn+1 ) − Y (tn )
f (tn , Y (tn )) = Y 0 (tn ) = + O(h) ,
h
Replacing Y (tn ) by yn and Y (tn+1 ) by yn+1 and ignoring the O(h) term yields
yn+1 − yn
= f (tn , yn )
h
which provides an explicit formula to compute yn+1 given yn :

yn+1 = yn + hf (tn , yn ) .

Starting at t0 , y(t0 ) =: y0 , we arrive at

y1 = y0 + hf (t0 , y0 ),

y2 = y1 + hf (t1 , y1 ),
y3 = y2 + hf (t2 , y2 ),
..
.
This is known as the forward or explicit Euler method :
CHAPTER 2. GETTING STARTED 24

Algorithm. (Forward or Explicit Euler method)

yn+1 = yn + hf (tn , yn ) .
T
The following algorithm provides an approximation to Y (T ) given h = N for some N > 0::
t = t 0 , y = y0
While t < T

y = y + hf (t, y)
t=t+h
If we start by evaluating the finite difference approximation at t = tn , τ = 1 we arrive at
yn+1 − yn
= f (tn+1 , yn+1 ) .
h
This does not lead to an explicit formula for yn+1 . The approximation at tn+1 has to be computed
by finding the root δn+1 of

Fn (δ; yn , tn , h) = δ − f (t + h, y + hδ) .

This method is known as backward Euler method:


Algorithm. (Backward or Implicit Euler method)

yn+1 = yn + hf (tn+1 , yn+1 ) .

The following algorithm provides an approximation to Y (T ) given an h > 0:


t = t 0 , y = y0
While t < T

Compute δ with δ − f (t + h, y + hδ) = 0 (discussed later)
 y = y + hδ
t=t+h
There are many methods for finding a root of a nonlinear function. Newton’s method is probably
the most often used approach and we will analyse that method in more detail later in the lecture:
Algorithm. (Newton’s method)
F (δ k )
δk = δk − .
F 0 (δ k )
The above formula is for scalar functions for vector valued functions F with Jacobian DF :

δ k = δ k − DF −1 (δ k )F (δ k ) .

The following algorithm provides an approximation to δ with F (δ) = 0 for a given initial guess δ0 :
δ = δ0 and 0 < ε  1:
While F (δ) > ε
δ = δ − DF −1 (δ)F (δ)


Remark. We will see later that both the forward and backward Euler method converge with order
one. We will be discussing approaches to improve the accuracy and also why using a more complex
implicit method can sometimes be a good idea.
Using the finite difference quotient in the middle of the interval to take advantage of the higher
convergence rate is not so straightforward:
yn+1 − yn
= f (tn+ 21 , yn+ 21 )
h
since yn+ 12 is not part of the sequence we are computing. We can either use the approximation
on an interval 2h:
yn+1 − yn−1
= f (tn , yn )
2h
CHAPTER 2. GETTING STARTED 25

which we can use to compute yn+1 assuming we know both yn and yn−1 . This type of method
is called multistep method and is discussed later in the lecture. A second approach is to replace
f (tn+ 12 , yn+ 21 ) by an approximation - either
1
f (tn+ 21 , yn+ 12 ) ≈ (f (tn , yn ) + f (tn+1 , yn+1 ))
2
or
1
f (tn+ 12 , yn+ 21 ) ≈ f (tn+ 12 , (yn + yn+1 )) .
2
The first approach is often called the Crank-Nicholson while the second is called the implicit
midpoint method. Both are implicit: This method looks very similar to the backward Euler
method but note that it does have a higher complexity since more evaluations of f are required in
each step and evaluation of f can be very expensive. On the other hand the higher convergence
rate of the finite difference approximation at the midpoint could improve the overall convergence
rate of the method - assuming we haven’t messed things up by approximation f (tn+ 12 , yn+ 21 ).
Remark (Alternative derivation). Starting from y 0 (t) = f (y(t)) we can use integration over the
time interval [tn , tn+1 ] to derive an approximation:
Z tn+1 Z tn+1
y(tn+1 ) − y(tn ) = y 0 (t) dt = f (y(t)) dt .
tn tn

Now to get a numerical scheme we need to approximate the integral on the right. We will study
more sophisticated methods later in the course but for now we can use an approximation based
on a single point in the interval:
Z tn+1
y(tn+1 ) − y(tn ) = f (y(t)) dt ≈ (tn+1 − tn )f (y(tn + τ (tn+1 − tn ))) = hf (y(tn + hτ )) ,
tn

for some τ ∈ [0, 1]. As we will have noticed (hopefully) we have rediscovered the forward Euler
(τ = 0) and the backward Euler (τ = 1) methods.

2.4.1 Convergence analysis


In this section we want to study the most important (at least mathematically speaking) question
concerning the forward and backward Euler method: do they converge, i.e., is yn in any way
an approximation of Y (tn ) if h is small enough? Also we would like to know what the rate of
convergence is (assuming it does converge).
Definition 14 (Approximation error). For fixed step size h > 0 the approximation error at a
given step n is defined by
en := |yn − Y (tn )| .
The value en will of course depend on h. The total approximation error is given by
E(h) := max ek .
0≤k≤N

Remark. The definition of the approximation error is not unique or more to the point, the way to
measure the error is not unique. For example instead of looking at the maximum error over the
time interval we could study an average error or the error eN at the final time only as we will do
in the assignment.
During the derivation of the schemes we replaced the time derivative with a finite difference
quotient by dropping higher derivative terms in the Taylor expansion. Recall for example the
formula used for the forward Euler method
Y (tn+1 ) = Y (tn ) + hf (tn , Y (tn )) + O(h2 )
which indicates that we are introducing an error of the magnitude of h2 when going from Y (tn )
to Y (tn+1 ). So in each step we are doing an error of magnitude h2 but we are doing N ∼ h1 steps
to get to the final time. But the question is: how do these errors in each step add up over all time
steps?
CHAPTER 2. GETTING STARTED 26

Forward Euler
For h > 0 let yn+1 = yn + hf (tn , yn ) be the sequence produced by the forward Euler method,
tn = nh the sequence of points in time. We also introduce the evaluation of the exact solution at
these points in time, i.e., Yn := Y (tn ) and finally we denote the error at each of these points in
time by en := |yn − Yn |. Keep in mind that tn , yn , Yn , en all depend on h. Next we introduce the
local truncation error which is conceptually the error introduced by inserting the exact solution
into the numerical scheme:
Definition 15 (Local truncation error for forward Euler method). For t ∈ [0, T − h] we define
the truncation error for the forward Euler method to be

τ (t; h) := Y (t + h) − Y (t) − hf (t, Y (t))

or for n the local truncation error for the forward Euler method is

τn := Yn+1 − Yn − hf (tn , Yn )

Remark. From our definition we have Yn+1 = Yn + hf (Yn ) + τn and our derivation based on Taylor
expansion shows that the truncation error converges quadratically to 0, i.e., τn = O(h2 ).
We will drop the f ’s dependency on t now to simplify the notation and furthermore, that f is
Lipschitz continuous (an assumption commonly used in the existence theory of ODEs), i.e.,

|f (u) − f (v)| ≤ L|u − v| .

Using our definitions we now have

en+1 = yn +hf (yn )−(Yn +hf (Yn )+τn ) ≤ en +h f (yn )−f (Yn ) +τn ≤ en +Lhen +τn ≤ (1+Lh)en +τn .

We thus conclude that going from step n to n + 1 leads to an amplification of the error en by
1 + Lh plus an additional O(h2 ) error coming from the truncation error in that step.
We can now apply the same estimate to en to get
n
X
en+1 ≤ (1+Lh)en +τn ≤ (1+Lh)2 en−1 +(1+Lh)τn−1 +τn ≤ · · · ≤ (1+Lh)n+1 e0 + (1+Lh)i τn−i
i=0

Since e0 = |y0 − Y (0)| = |y0 − y0 | = 0 and using τn ≤ Ch2 where C depends neither on h nor on
n (in depends on the second derivative of Y over whole time interval) we get
n
X
en+1 ≤ Ch2 (1 + Lh)i
i=0

T T
for all 0 ≤ n < N where N = or h = h N.
We still need to estimate the size of the final geometric
sum:
n
X (1 + Lh)n+1 − 1 1
(1 + Lh)i = = ((1 + Lh)n+1 − 1) .
i=0
(1 + Lh) − 1 Lh
T T LT
Using n + 1 ≤ N = h and therefore h < n+1 so that Lh ≤ n+1 and therefore
n
X 1  LT n+1 
(1 + Lh)i ≤ 1+ −1 .
i=0
Lh n+1
a i a
Using that (1 + i) ≤ e we conclude
n
X 1
(1 + Lh)i ≤ (exp LT − 1) = O(h−1 ) .
i=0
Lh

Putting everything together we get:


CHAPTER 2. GETTING STARTED 27

Theorem 5 (Convergence of the forward Euler method). If the exact solution is in C 2 and the
right hand side f is Lipschitz continuous with Lipschitz constant L then the approximation error
converges linearly to 0 for h → 0:
exp LT − 1
E(h) ≤ max τi = O(h) .
0≤i≤N Lh

So we have proven (what we already guessed from our numerical experiments) that the forward
Euler method converges linearly to the exact solution. The constant in the error bound depends
on (i) the end time (ii) the second derivative of the exact solution, i.e., max0≤t≤T |Y 00 (t)| and (iii)
the Lipschitz constant of the right hand side f .
Remark. The result given above for the forward Euler method is very typical for many numerical
schemes used to solve ODEs where the error at a time step can often be expressed as the error at
the previous time step plus the truncation error at that step. Since these methods are often derived
by truncating a Taylor expansion, the convergence rate for the truncation error is straightforward
to obtain. Modifying the above argument (i) using that e0 = 0 and (ii) the truncation errors add
up in the form of a geometric sum, one can derive the overall convergence rate from the rate of
the truncation error. Of course one needs to keep in mind that the solution to the ODE (or the
right hand side function f ) has to be smooth enough to carry out the truncated Taylor expansion
in the first place.

Backwards Euler
The analysis for the backward Euler method is almost identical to the above. In this case yn+1 =
yn + hf (yn+1 ) and τn := Yn+1 − hf (Yn+1 ) − Yn .

en+1 ≤ en + h f (yn+1 ) − f (Yn+1 ) + τn ≤ en + hLen+1 + τn

Since we are interested in h → 0 we can assume that hL ≤ 1 − ε ∈ (0, 1) then rearranging terms
leads to
n
en τn e0 X 1
en+1 ≤ + ≤ ··· ≤ n+1
+ τn−i .
1 − hL 1 − hL (1 − hl) i=0
(1 − hL)i

Again we can use that e0 = 0, τi ≤ Ch2 so that we are left with


n−1
X βn − 1
en ≤ Ch2 βi =
i=0
β−1

1 hL
where β = 1−hL . We have β − 1 = 1−hL and also

hL hL
β =1+ ≤1+
1 − hL ε
hL
since we assumed that hL < 1 − ε. Using that 1 + x ≤ ex we finally have β ≤ e ε and thus using
again h = Tn and n ≤ N :
nhL TL
βn ≤ e ε ≤ e ε .
Putting it all together we have shown that
Theorem 6 (Convergence of the backward Euler method). If the exact solution Y ∈ C 2 and the
right hand side f is Lipschitz continuous with Lipschitz constant L then the approximation error
converges linearly to 0 for h → 0. For h ≤ 1−ε
L for some ε ∈ (0, 1) the error is bounded by

1 − hL N TL 
E(h) ≤ max τi exp − 1 = O(h) .
hL i=0 ε
CHAPTER 2. GETTING STARTED 28

2.4.2 Stability analysis


In the previous section we studied the convergence of the forward and backward Euler method
- a part of the numerical analysis of a scheme that is concerned with the behaviour for h → 0.
During any practical computation one has to be content with some finite step size h > 0 so that
the behaviour of the discretization in that case is important. We already saw in the introduction
that although the forward/backward Euler methods converge they can show some very undesirable
behaviour.
The simplest way of understanding the issue of stability of a numerical scheme for solving ODEs
is that one is trying to analyse the influence of any error in the initial conditions on the resulting
sequence (yn )n . For example consider two initial conditions y0 and ỹ0 = y0 + δ and the two
resulting sequences using a discretization of the ODE (yn )n and (ỹn )n . A form of stability could
imply max0≤n≤N |yn − ỹn | ≤ C|y0 − ỹ0 | = C|δ| with a constant depending possibly on T = N h
and on the right hand side f of the ODE. Of course we can only expect to be able to prove this
continuous dependency on the initial data if the same holds for the exact solutions Y, Ỹ of the
ODE with initial conditions y0 , ỹ0 , respectively.
Let us assume that the ODE is linear, i.e., f (y) = Ay with some invertible matrix A ∈ Rm×m .
Then the above description of stability simplifies considerably: if Y, Ỹ are the exact solutions to
the ODE with initial data y0 , ỹ0 , respectively, then Ȳ = Y − Ỹ solve the same ODE (since it is
linear) with initial conditions δ. So stability for a linear ODE with respect to errors in the initial
data is related to the question if solutions with initial data δ small will stay near/converge to 0
for t → ∞, i.e., will stay near the solution with 0 as initial conditions. This will (for example) be
the case if 0 is a stable fixed point or a centre, but not if 0 is unstable.
For a linear ODE the explicit Euler method turns into

yn+1 = (I + hA)yn

while the implicit Euler method reads

(I − hA)yn+1 = yn

where I ∈ Rm×m is the identity matrix.


Assume that we can diagonalize A in the form A = R−1 DR where D = diag(λ1 , . . . , λm ) is a
diagonal matrix with the (possibly complex) eigenvalues λi of A on the diagonal. Using this in
the forward Euler method we conclude

yn+1 = (I + hA)yn = R−1 (I + hD)Ryn .

Defining zn = Ryn we find that


zn+1 = (I + hD)zn
or component wise
zn+1,i = (1 + hλi )zn,i .
That means that the system is decoupled but zn could be complex valued. Using z0 = Ry0 we
thus have for i = 1, . . . , m:
zn+1,i = (1 + hλi )n+1 z0,i
We can use the same decomposition for original ODE as well defining Z := RY
d
Zi = λi Zi
dt
so that
Zi (t) = exp(λi t)zi,0 .
Remark. Note that connection between the exponential function appearing as a factor in the exact
solution and the factor 1 + hλi in the forward Euler method - recall ex ≈ 1 + x for x small and
h(n + 1) = tn+1 .
CHAPTER 2. GETTING STARTED 29

The decoupling argument shows that it is enough to study stability for linear problems in the case
of complex valued scalar problems (those correspond to either real valued problems or to 2 × 2
systems). So in the following we again denote with Y the now possibly complected valued solution
which satisfies
Y 0 (t) = λY (t)
with a complex constant λ. Since we are interested in the ODE setting, where the origin is a
stable fixed point, we assume the real part of λ is less than zero: λ ∈ C, Realλ < 0. We then know
that the origin is a stable fixed point and thus |Y (t)| → 0 for t → ∞ independent of the initial
condition Y0 , in fact:

Y (t) = eλt Y0 = eRealλt cos(Imagλt) + i sin(Imagλt) Y0




and so
|Y (t)| = eRealλt |Y0 |
1. |Y (t)| is monotnly decreasing and converges to 0,

2. if λ ∈ R then Y (t) is strictly monotonly decreasing if Y0 > 0 and strictly monotonly increasing
if Y0 < 0. Consequently, Y (t) has the same sign as Y0 for all t.
Now one can ask the question: under which conditions does the sequence (yn )n behave in the
same way?

Forward Euler
As we saw above
yn+1 = (1 + hλ)n+1 Y0
and thus |yn | = |1 + hλ|n |Y0 |. To get this to converge to zero requires |1 + hλ| < 1, since Realλ < 1
this will hold for h sufficiently small but not for too large values of h. In fact
p
1 > |1 + hλ| = (1 + hRealλ)2 + h2 (Imagλ)2

so squaring both sides leads to the condition

1 > 1 + 2hRealλ + h2 (Realλ)2 + (Imagλ)2 = 1 + 2hRealλ + h2 |λ|2




and thus
|Realλ|
2 >h
|λ|2
recalling that our assumption was that Realλ < 0. There are two interesting cases here:
2
1. Imagλ = 0: the condition reduces to h < |Realλ| or perhaps easier to remember h|λ| < 2.

2. Realλ → 0: in this case the condition for convergence of forward Euler scheme is h < 0 so
not achievable. On the other hand the exact solution satisfies |Y (t)| = |Y0 | (the origin is a
centre) so aiming for |yn | → 0 does not make sense. But since |yn | = |1 + hλ|n |Y0 | the issue
is not only that the discrete approximation does not converge to zero but in fact it grows
monotonically without bounds. This is in fact the setting of our mass spring system from
the introduction where our experiments showed that the right long time behaviour is not
achievable with the forward Euler method.
CHAPTER 2. GETTING STARTED 30

Backwards Euler
Doing the same for the backward Euler scheme we arrive at (1 − hλ)yn+1 = yn , or
1
yn+1 = Y0 .
(1 − hλ)n+1

We again focus on the stable case, i.e., Realλ < 0. Now |yn | → 0 if and only if |1 + hλ| > 1 or

1 < (1 − hRealλ)2 + h2 (Imagλ)2 = 1 + 2h|Realλ| + h2 |λ|2

which holds for any value of h again using that Realλ < 0. Note that the condition is satisfied
for any h in the case that Realλ ≤ 0 (in fact even for quite a lot of the right half of the complex
plane in the purely imaginary case the condition is still satisfied for any h > 0 which means that
|yn | → 0 although the exact solution is a centre with |Y (t)| = |Y0 | - but we already saw this in
the introduction where are simulations showed that for the mass spring system the implicit Euler
method always leads to approximation that converge to zero.
A way to visualize the stability property of the two schemes is to plot the stability region in the
complex plane, i.e., for the forward Euler method all complex values z = hλ such that |1 + z| < 1,
for the implicit Euler method we shade the left half plane (although as pointed most of the right
half should also be shaded:

2 2

1 1

0 0

−1 −1

−2 −2
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2

Definition 16 (Absolute stability). The stability concept described so far is referred to as absolute
stability.
Methods like the backward Euler method which are stable for all step sizes h are called uncondi-
tionally (absolute) stable or simply A-stable while method like the forward Euler method which
require h to be small enough for stability are called conditionally stable.
A more formal discussion will be carried out later in the lecture.
We have so far focused on the limiting behaviour of yn for n → ∞ in the case of linear problems
with a stable fixed point at the origin. As pointed out above, if λ ∈ R and λ < 0 then the
exact solution is monotone decreasing or increasing depending on the sign of the initial condition
Y0 ∈ R. We will now look at conditions for h which guarantee that this behaviour carries over to
the sequence (yn )n when using the forward or backward Euler methods.
Starting with the forward Euler method we have yn = (1 − h|λ|)n |Y0 | where now λ is real and
negative. We can assume for simplicity that Y0 > 0 then we want to find a step size condition
so that 0 < yn < yn+1 , which is equivalent to 0 < 1 − h|λ| < 1 as can be easily seen. Since
h|λ| > 0 the second condition is always satisfied while the first condition requires h|λ| < 1. It is
worth now comparing this to the condition for absolute stability derived previously, which in the
CHAPTER 2. GETTING STARTED 31

current setting is |1 − h|λ| | < 1 which is equivalent to −1 < 1 − hλ| < 1. So as to be expected
achieving monotonicity requires a harder restriction on the time step then absolute stability did.
2 1
For |yn | → 0 we require h < |λ| while monotonicity requires h < |λ| .
1
Turning our attention to the backward Euler method with yn = (1+h|λ) n Y0 we see that mono-
1
tonicity always holds since 0 < 1+h|λ| < 1 for any step size h > 0.

You might also like