MATH49111/69111: Scientific Computing: 5th October 2016

MATH49111/69111: Scientific Computing
Lecture 5
5th October 2016
Dr Chris Johnson
chris.johnson@manchester.ac.uk
Mathematical modelling
Real-world modelling numerical Numerical interpretation Problem

. Equations
problem error error solution error solution
Solving real-world problems is a three-stage process:

1. Formulate equations that describe (an approximation of ) the
real-world problem
2. Find an (approximate) numerical solution to these equations
3. Interpret solutions in context of problem
. It is important to keep the stages separate

. We will focus on stage 2 (the easiest!)
Types of numerical error
Discretisation/truncation error
. Many problems must be approximated or discretised before
being solved numerically
. For example, we may approximate a infinite sum by many
terms of a finite sum
. The error introduced by approximating the problem is
truncation or discretisation error
Roundoff error
. Operations on floating-point (fractional) numbers are inexact
. Often we cannot solve even our approximated problem exactly
Measuring errors
. In order to quantify errors in our solutions we need to define a

measure for the error
. If ϕ∗ is an approximation to a scalar quantity ϕ then the
absolute error is defined by
∣ϕ − ϕ∗ ∣
. The relative error is defined by
∣ϕ − ϕ∗ ∣
, (when ϕ ≠ 0)
∣ϕ∣
Measuring errors
. Often our solution will be vector v = (vi ), rather than scalar,

with an approximation v⋆ .
. Rather than list the error for every component, we define a
scalar error measure with a norm
. Three norms are commonly used for absolute errors:
. The L1 norm, ∥v⋆ − v∥1 ∶= ∑i ∣v⋆i − vi ∣ ,
√
2
. The L2 norm ∥v⋆ − v∥2 ∶= ∑i (v⋆i − vi ) ,
. The L∞ norm ∥v⋆ − v∥∞ ∶= maxi ∣v⋆i − vi ∣, the maximum absolute
error
. Relative errors for an Lp norm are defined by ∥v⋆ − v∥p /∥v∥p

. If v represents a (discretisation of a) function, use the norm of
the function not the norm of the vector.
Roundoff error
. When using floating-point fractional numbers (float and

double), we can only store around 7 decimal digits (for float)
or 15-16 decimal digits (for double).
. Truncation of digits beyond this (called roundoff error) means
that most floating-point calculations are inexact
. Roundoff error is particularly bad when subtracting two
numbers of similar size, or when adding numbers of very
dissimilar sizes.
. Errors may be magnified repeatedly under certain sequences
of calculations (this is called numerical instability)
. We can sometimes control this by altering the algorithm or
rearranging the order of operations
Example: roundoff error in sums
We wish to evaluate
(2π)2 (2π)3 (2π)4
e−2π = 1 − 2π + − + + ⋯ = 1.86744 . . . × 10−3
2! 3! 4!
using float variables (∼ 7 decimal digits of precision)
. The sum has heavy cancellation
. the largest terms are ≈ ±80
. the final result is ≈ 10−3
. We estimate the error as arising from the largest terms:
roundoff error ≈ 10−7 × 80 ≈ 10−5
. Result obtained with float variables: 1.8714 . . . × 10−3

. Accurate only to two digits (result = O(10−3 ); error = O(10−5 ))
Example: roundoff error in sums
How can we get a more accurate answer?
1 1
e−2π = = (2π)2 (2π)3
e2π 1 + 2π + + +⋯
2! 3!
. The largest terms are ≈ 80, as before

⇒ the roundoff error from these terms is ≈ 10−5 , as before
. The value of the sum is now e2π ≈ 500
. Expect ∼ 7-digit accuracy (result = O(102 ); error = O(10−5 ))
. about as good as we could ever expect from float
. Final result obtained with float: 1.86744236 . . . × 10−3

. Reordering sums can reduce the roundoff error1
(floating-point addition is not associative!)
1
N. J. Higham, 1993. SIAM J. Sci. Comput 14(4), 783–799
Question
What will the output of this program be?
#include <iostream>
int main()
{
float f = 4.0/3.0;
double d = 4.0/3.0;
double difference = f - d;
std::cout << difference << std::endl;

}
A. A number around 10−7

B. A number roughly between −10−7 and 10−7
C. A number around 10−15
D. A number roughly between −10−15 and 10−15
E. 0
Question
What will the output of this program be?
#include <iostream>
int main()
{
float f = 4.0/3.0;
double d = 4.0/3.0;
double difference = f - d;
std::cout << difference << std::endl;

}
A. A number around 10−7

B. A number roughly between −10−7 and 10−7
C. A number around 10−15
D. A number roughly between −10−15 and 10−15
E. 0
. Many problems are impossible to solve exactly on a computer
. Infinite expressions:
∞
1 1
γ = ∑ [ − log (1 + )]
k=1 k k
(we cannot sum an infinite number of terms)

. Continuous problems:
1 ′′
f ′′′ + ff =0
2
(we cannot represent f exactly, only an approximation)
. To solve these problems we must approximate them as a

discrete, finite problem that can be solved numerically.
. This approximation introduces error, often more significant
than roundoff error.
Numerical error: calculating a derivative
How do we calculate the derivative f ′ (x) of a function f (x)?
We cannot directly evaluate the limit
f (x + h) − f (x)
f ′ (x) = lim .
h→0+ h
However, writing f local to x as a Taylor series,
h2
f (x + h) = f (x) + h f ′ (x) + f ′′ (x) + . . .
2
we find
f (x + h) − f (x) h
= f ′ (x) + f ′′ (x) + . . .
h 2
≈ f (x) when h ≪ 2f ′ (x)/f ′′ (x)
′
We approximate f ′ (x) by choosing a small but finite h.

Our approximate derivative is, calculated with double variables, is
f (x + h) − f (x)
fˆ′ ∶= ,
h
Example:
f (x) = sin(x), x = 1, f ′ (x) = cos(1) = 0.5403 . . .
0.6
fˆ′
0.5
0.4
10−16 10−12 10−8 10−4 100
h
f (x) = sin(x), x = 1, f ′ (x) = cos(1) = 0.5403 . . .
0.6
fˆ′
0.5
0.4
10−16 10−12 10−8 10−4 100
h
Our numerical estimate fˆ′ is
. apparently accurate over wide range of h < 1
. very inaccurate for h > 10−2 and h < 10−14
Truncation error
f (x + h) − f (x) h
= f ′ (x) + f ′′ (x) + O(h2 )
h 2
Approximation of f has a truncation error of ≈ (h/2)f ′′ (x)
′
Roundoff error
. f (x) and f (x + h) are accurate only to a relative accuracy є
. є is the machine precision (≈ 10−7 for float, ≈ 10−16 for double)
. The absolute error in f (x) and f (x + h) is therefore δ ≈ f (x)є
. We are therefore calculating a value in the range
f (x + h) − f (x) ± 2δ f (x + h) − f (x) 2δ
= ±
h h h
The combined truncation and roundoff error is:
h ′′ 2δ
f (x) +
2 h
100
Measured
Error estimate
10−4
∣ fˆ′ − f ′ ∣
10−8
10−16 10−12 10−8 10−4 100

h
√
. Truncation error dominates for h ⪆ δ ≈ 10−8
√
. Roundoff error dominates for h ⪅ δ ≈ 10−8
Comparison of floating-point variables
. Values of float or double variables are rarely exact

. Comparison of values
. Avoid if (a == b) (fails unless values are identical)
. Prefer if(std::abs(a-b) < eps) for some eps ≪ a, b
. Floating-point loop indices

. Avoid: for (double d=0; d<=1; d+=0.1)
. Prefer:
for (int i=0; i<=10; i++)
{
double d=i*0.1;
// ...
}
Numerical error: summary
. Error from solving a discretised version of a continuous/infinite
problem
. Minimised by:
. careful choice of discretisation
. ‘finer’ discretisation (often slower)
Roundoff error
. Error from finite precision of floating-point (float/double)
numbers.
. Minimised by:
. careful choice of algorithm
. use of double rather than float
. high precision arithmetic arithmetic (slower than either)

MATH49111/69111: Scientific Computing: 5th October 2016

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MATH49111/69111: Scientific Computing: 5th October 2016

Uploaded by

Copyright:

Available Formats

MATH49111/69111: Scientiﬁc Computing

Real-world modelling numerical Numerical interpretation Problem

Solving real-world problems is a three-stage process:

. It is important to keep the stages separate

. In order to quantify errors in our solutions we need to deﬁne a

. The relative error is deﬁned by

. Often our solution will be vector v = (vi ), rather than scalar,

. Relative errors for an Lp norm are deﬁned by ∥v⋆ − v∥p /∥v∥p

. When using ﬂoating-point fractional numbers (float and

. We estimate the error as arising from the largest terms:

roundoﬀ error ≈ 10−7 × 80 ≈ 10−5

. Result obtained with float variables: 1.8714 . . . × 10−3

. The largest terms are ≈ 80, as before

. Final result obtained with float: 1.86744236 . . . × 10−3

std::cout << difference << std::endl;

A. A number around 10−7

std::cout << difference << std::endl;

A. A number around 10−7

(we cannot sum an inﬁnite number of terms)

. To solve these problems we must approximate them as a

We approximate f ′ (x) by choosing a small but ﬁnite h.

f (x) = sin(x), x = 1, f ′ (x) = cos(1) = 0.5403 . . .

10−16 10−12 10−8 10−4 100

. Values of float or double variables are rarely exact

. Floating-point loop indices

You might also like