Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

MATH49111/69111: Scientific Computing

Lecture 5
5th October 2016

Dr Chris Johnson
Mathematical modelling

Real-world modelling numerical Numerical interpretation Problem

. Equations
problem error error solution error solution

Solving real-world problems is a three-stage process:

1. Formulate equations that describe (an approximation of ) the
real-world problem
2. Find an (approximate) numerical solution to these equations
3. Interpret solutions in context of problem

. It is important to keep the stages separate

. We will focus on stage 2 (the easiest!)
Types of numerical error

Discretisation/truncation error
. Many problems must be approximated or discretised before
being solved numerically
. For example, we may approximate a infinite sum by many
terms of a finite sum
. The error introduced by approximating the problem is
truncation or discretisation error

Roundoff error
. Operations on floating-point (fractional) numbers are inexact
. Often we cannot solve even our approximated problem exactly
Measuring errors

. In order to quantify errors in our solutions we need to define a

measure for the error
. If ϕ∗ is an approximation to a scalar quantity ϕ then the
absolute error is defined by

∣ϕ − ϕ∗ ∣

. The relative error is defined by

∣ϕ − ϕ∗ ∣
, (when ϕ ≠ 0)
Measuring errors

. Often our solution will be vector v = (vi ), rather than scalar,

with an approximation v⋆ .
. Rather than list the error for every component, we define a
scalar error measure with a norm
. Three norms are commonly used for absolute errors:
. The L1 norm, ∥v⋆ − v∥1 ∶= ∑i ∣v⋆i − vi ∣ ,

. The L2 norm ∥v⋆ − v∥2 ∶= ∑i (v⋆i − vi ) ,
. The L∞ norm ∥v⋆ − v∥∞ ∶= maxi ∣v⋆i − vi ∣, the maximum absolute

. Relative errors for an Lp norm are defined by ∥v⋆ − v∥p /∥v∥p

. If v represents a (discretisation of a) function, use the norm of
the function not the norm of the vector.
Roundoff error

. When using floating-point fractional numbers (float and

double), we can only store around 7 decimal digits (for float)
or 15-16 decimal digits (for double).
. Truncation of digits beyond this (called roundoff error) means
that most floating-point calculations are inexact
. Roundoff error is particularly bad when subtracting two
numbers of similar size, or when adding numbers of very
dissimilar sizes.
. Errors may be magnified repeatedly under certain sequences
of calculations (this is called numerical instability)
. We can sometimes control this by altering the algorithm or
rearranging the order of operations
Example: roundoff error in sums
We wish to evaluate
(2π)2 (2π)3 (2π)4
e−2π = 1 − 2π + − + + ⋯ = 1.86744 . . . × 10−3
2! 3! 4!
using float variables (∼ 7 decimal digits of precision)
. The sum has heavy cancellation
. the largest terms are ≈ ±80
. the final result is ≈ 10−3

. We estimate the error as arising from the largest terms:

roundoff error ≈ 10−7 × 80 ≈ 10−5

. Result obtained with float variables: 1.8714 . . . × 10−3

. Accurate only to two digits (result = O(10−3 ); error = O(10−5 ))
Example: roundoff error in sums
How can we get a more accurate answer?
1 1
e−2π = = (2π)2 (2π)3
e2π 1 + 2π + + +⋯
2! 3!

. The largest terms are ≈ 80, as before

⇒ the roundoff error from these terms is ≈ 10−5 , as before
. The value of the sum is now e2π ≈ 500
. Expect ∼ 7-digit accuracy (result = O(102 ); error = O(10−5 ))
. about as good as we could ever expect from float

. Final result obtained with float: 1.86744236 . . . × 10−3

. Reordering sums can reduce the roundoff error1
(floating-point addition is not associative!)
N. J. Higham, 1993. SIAM J. Sci. Comput 14(4), 783–799
What will the output of this program be?
#include <iostream>

int main()
float f = 4.0/3.0;
double d = 4.0/3.0;
double difference = f - d;

std::cout << difference << std::endl;


A. A number around 10−7

B. A number roughly between −10−7 and 10−7
C. A number around 10−15
D. A number roughly between −10−15 and 10−15
E. 0
What will the output of this program be?
#include <iostream>

int main()
float f = 4.0/3.0;
double d = 4.0/3.0;
double difference = f - d;

std::cout << difference << std::endl;


A. A number around 10−7

B. A number roughly between −10−7 and 10−7
C. A number around 10−15
D. A number roughly between −10−15 and 10−15
E. 0
Discretisation/truncation error
. Many problems are impossible to solve exactly on a computer
. Infinite expressions:

1 1
γ = ∑ [ − log (1 + )]
k=1 k k

(we cannot sum an infinite number of terms)

. Continuous problems:
1 ′′
f ′′′ + ff =0
(we cannot represent f exactly, only an approximation)

. To solve these problems we must approximate them as a

discrete, finite problem that can be solved numerically.
. This approximation introduces error, often more significant
than roundoff error.
Numerical error: calculating a derivative
How do we calculate the derivative f ′ (x) of a function f (x)?
We cannot directly evaluate the limit

f (x + h) − f (x)
f ′ (x) = lim .
h→0+ h
However, writing f local to x as a Taylor series,
f (x + h) = f (x) + h f ′ (x) + f ′′ (x) + . . .
we find
f (x + h) − f (x) h
= f ′ (x) + f ′′ (x) + . . .
h 2
≈ f (x) when h ≪ 2f ′ (x)/f ′′ (x)

We approximate f ′ (x) by choosing a small but finite h.

Numerical error: calculating a derivative
Our approximate derivative is, calculated with double variables, is
f (x + h) − f (x)
fˆ′ ∶= ,

f (x) = sin(x), x = 1, f ′ (x) = cos(1) = 0.5403 . . .




10−16 10−12 10−8 10−4 100
Numerical error: calculating a derivative
f (x) = sin(x), x = 1, f ′ (x) = cos(1) = 0.5403 . . .




10−16 10−12 10−8 10−4 100
Our numerical estimate fˆ′ is
. apparently accurate over wide range of h < 1
. very inaccurate for h > 10−2 and h < 10−14
Numerical error: calculating a derivative

Truncation error
f (x + h) − f (x) h
= f ′ (x) + f ′′ (x) + O(h2 )
h 2
Approximation of f has a truncation error of ≈ (h/2)f ′′ (x)

Roundoff error
. f (x) and f (x + h) are accurate only to a relative accuracy є
. є is the machine precision (≈ 10−7 for float, ≈ 10−16 for double)
. The absolute error in f (x) and f (x + h) is therefore δ ≈ f (x)є
. We are therefore calculating a value in the range

f (x + h) − f (x) ± 2δ f (x + h) − f (x) 2δ
= ±
h h h
Numerical error: calculating a derivative
The combined truncation and roundoff error is:
h ′′ 2δ
f (x) +
2 h
Error estimate

∣ fˆ′ − f ′ ∣


10−16 10−12 10−8 10−4 100


. Truncation error dominates for h ⪆ δ ≈ 10−8

. Roundoff error dominates for h ⪅ δ ≈ 10−8
Comparison of floating-point variables

. Values of float or double variables are rarely exact

. Comparison of values
. Avoid if (a == b) (fails unless values are identical)
. Prefer if(std::abs(a-b) < eps) for some eps ≪ a, b

. Floating-point loop indices

. Avoid: for (double d=0; d<=1; d+=0.1)
. Prefer:
for (int i=0; i<=10; i++)
double d=i*0.1;
// ...
Numerical error: summary
Discretisation/truncation error
. Error from solving a discretised version of a continuous/infinite
. Minimised by:
. careful choice of discretisation
. ‘finer’ discretisation (often slower)

Roundoff error
. Error from finite precision of floating-point (float/double)
. Minimised by:
. careful choice of algorithm
. use of double rather than float
. high precision arithmetic arithmetic (slower than either)

You might also like