Lecture 3: Numerical Error: Zhenning Cai August 26, 2019

MA2213: Numerical Analysis I
Lecture 3: Numerical error
Zhenning Cai
August 26, 2019
As we have seen, computers can only deal with numbers with a limited number of digits.
Therefore the arithmetic done by computers is finite-digit arithmetic. For example, if x and
y are two single-precision floating-point numbers, in general, x × y is not a single-precision
floating-point number. However, the result generated by the computer is still a single-precision
floating-point number. We are going to understand the effect of the finite-digit arithmetic in
this lecture, through the following topics:
• Definition of errors
• Error of finite-digit arithmetic
1 Definition of errors
When we assign 0.23 to a single-precision floating-point variable, the value of the variable
actually differs from 0.23 by 4.17233 × 10−9 . To make the difference smaller, we can use
a double-precision floating-point variable, which has 64 bits (1 sign bit, 11 exponent bits, 52
mantissa bits). We say that a single-precision floating-point number has at most 24 significant
digits in the binary numercal system, and a double-precision floating-point number has at most
53 significant digits in the binary numeral system. The term significant digits in general means
the number of digits that appear to be accurate. A mathematical definition is as follows:
Definition 1. The number p∗ is said to approximate p to t significant digits (or figures) in
the numeral system with base b if t is the largest nonnegative integer for which
|p − p∗ |
⩽ 0.5 × b1−t .
|p|
Example 1. Let f l(x) be the single-precision floating-point form of x obtained by rounding.

Then for any nonzero x between −(2 − 2−23 ) × 2127 and (2 − 2−23 ) × 2127 , we have
|x − f l(x)|
⩽ 0.5 × 21−24 ≈ 5.96046 × 10−8 < 0.5 × 10−6 ,
|x|
which means a single-precision floating point number has 7 significant digits in the decimal
numeral system. Since the number 5.96046×10−8 is quite close to 0.5×10−7 , in some literature,
it is said that a single-precision floating point number has 7 to 8 significant digits.
Exercise 1. How many significant digits does a double-precision floating-point number have
in the decimal numeral system?
The error caused by replacing x with f l(x) is called round-off error. There are generally
three ways to measure the error:
1
Definition 2. Suppose that p∗ is an approximation to p. The actual error is p − p∗ , the
absolute error is |p − p∗ |, and the relative error is |p − p∗ |/|p|, provided that p ̸= 0.
The relative error is generally a better measure of accuracy than the absolute error. In
the definition of significant digits, the relative error is used.
Example 2. Since
f l(0.23) = (1.11010111000010100011111 × 10−11 )2 ≈ 0.23000000417232513,

f l(1000000001) = (1.11011100110101100101 × 1011101 )2 = 109 ,
f l(1.23 × 10−20 ) = (1.1101000010101110010011 × 10−1000011 )2 ≈ 1.229999960966593 × 10−20 ,
we can calculate the absolute errors as follows:
|0.23 − f l(0.23)| ≈ 4.17233 × 10−9 ,

|1000000001 − f l(1000000001)| = 1,
|1.23 × 10−20 − f l(1.23 × 10−20 )| ≈ 3.90334 × 10−28 .
Due to the difference in the magnitudes of the three numbers, the absolute errors are signifi-
cantly different. However, their relative errors are quite similar:
|0.23 − f l(0.23)|
≈ 1.81405 × 10−8 ,
|0.23|
|1000000001 − f l(1000000001)|
≈ 1.00000 × 10−9 ,
|1000000001|
|1.23 × 10−20 − f l(1.23 × 10−20 )|
≈ 3.17345 × 10−8 .
1.23 × 10−20
The relative error better describes the quality of approximation since it takes into account the
size of the value.
2 Finite-digit arithmetic
As we have seen in the beginning of the previous lecture, the error comes not only from
the representation of numbers, but also from the arithmetic performed by the computer. To
study this type of error, we build the following model for finite-digit arithmetic (we use single-
precision floating-point numbers as examples):
x ‘ y = f l(f l(x) + f l(y)), x a y = f l(f l(x) − f l(y)),

x b y = f l(f l(x) × f l(y)), x c y = f l(f l(x) ÷ f l(y)),
where x and y are two real numbers whose absolute values do not exceed the maximum single-
precision floating-point numbers. When the computer performs the operation x + y, the two
operands are actually f l(x) and f l(y). Under this condition, f l(f l(x) + f l(y)) is the best
approximation of x + y that the computer can provide using single-precision floating-point
numbers. The other three cases are similar.
Example 3. Let x = 5/7 and y = 1/3. Then
f l(x) = (0.10110110110110110110111)2 , f l(y) = (0.0101010101010101010101011)2 .
2
Therefore
f l(x) + f l(y) = (1.0000110000110000110000111)2 ,

f l(x) − f l(y) = (0.0110000110000110000110001)2 ,
f l(x) × f l(y) = (0.001111001111001111001111011100111100111100111101)2 ,
f l(x) ÷ f l(y) = (10.00100100100100100100100011101101101101101101101110)2 .
Applying another operator f l, we get
x ‘ y = (1.00001100001100001100010)2 ,
x a y = (0.0110000110000110000110001)2 ,
x b y = (0.00111100111100111100111110)2 ,
x c y = (10.0010010010010010010010)2 .
This example shows that except x a y, the outer f l introduces some additional error. Below
we tabulate the errors in decimal numbers:
Exact value Approximate value Absolute error Relative error
x 5/7 0.7142857313 · · · 1.703 × 10−8 2.384 × 10−8
y 1/3 0.3333333432 · · · 9.934 × 10−9 2.980 × 10−8
x+y 22/21 1.0476191043 · · · 5.677 × 10−8 5.419 × 10−8
x−y 8/21 0.3809523880 · · · 7.096 × 10−9 1.863 × 10−8
x×y 5/21 0.2380952537 · · · 1.561 × 10−8 6.557 × 10−8
x÷y 15/7 2.1428570747 · · · 6.812 × 10−8 3.179 × 10−8
These results can be verified by the following C program:
# include < stdio .h >
int main ()
{
float x = 5./7. , y = 1./3.;
printf ( " x = %.20 f \ n " , x ) ;

printf ( " y = %.20 f \ n " , x ) ;
printf ( " x + y = %.20 f \ n " , x + y);
printf ( " x - y = %.20 f \ n " , x - y);
printf ( " x * y = %.20 f \ n " , x * y);
printf ( " x / y = %.20 f \ n " , x / y);
return 0;
}
In general, arithmetic may increase the round-off error.
Finite-digit arithmetic may not follow the basic rules of arithmetic. For example, (x‘y)‘z
may not equal x ‘ (y ‘ z); (x ‘ y) b z may not equal (x b z) ‘ (y b z).
Example 4. Let x = 3.14159, y = 2.71828, and z = 106 . We compare (x ‘ y) ‘ z and

x ‘ (y ‘ z) by the following C program:
# include < stdio .h >
int main ()
{
float x = 3.14159 , y = 2.71828 , z = 1 e6 , xpy , ypz ;
xpy = x + y ; ypz = y + z ;
3
printf ( " x + y = %f ,\ t \ t ( x + y ) + z = % f \ n " , xpy , xpy + z ) ;
printf ( " y + z = %f ,\ t \ tx + ( y + z ) = % f \ n " , ypz , x + ypz ) ;
return 0;
}
The output is
x + y = 5.859870 , ( x + y ) + z = 1000005.875000
y + z = 1000002.687500 , x + ( y + z ) = 1000005.812500
Since the exact value of x + y + z is 1000005.85987, the result of (x ‘ y) ‘ z is more accurate

than x ‘ (y ‘ z). The reason is that when computing x ‘ y, only a small absolute error is
introduced, and therefore the error of (x ‘ y) ‘ z mainly comes from the second operator ‘.
However, when computing x ‘ (y ‘ z), both operators introduce relatively large errors.
In general, when summing up three numbers, it is recommend to first sum up the two
with closer magnitude, and then add the result by the third one. A natural question is: if we
have a sequence of numbers x1 , · · · , xn and want to take the sum, how to reduce the round-off
error? To answer this question, we need to design an algorithm for such computation.
3 Algorithms and numerical error

An algorithm is a set of unambiguous instructions to solve a class of problems or perform a
computation. For example, the following algorithm computes the sum of x1 , · · · , xn :
Algorithm 1 Computation of x1 + x2 + · · · + xn
1: sum ← 0
2: for i = 1, · · · , n do
3: sum ← sum + xi
4: end for
5: return sum
This algorithm is the simplest way to compute the sum. The above description of the
algorithm is called a pseudocode. The first line and the third line are “assignments”, and the
combination “for ... do ... end for” means a loop. Without round-off error, this algorithm
can generate exact results. However, when this algorithm is implemented using single-precision
floating-point numbers, round-off error may accumulate in the loop. For example, if n = 3
and x1 = 106 , x2 = 2.71828, x3 = 3.14159, the output will be 100005.8125, while we know
that by changing the summation order, a better result can be obtained.
A better algorithm was proposed by W. Kahan,1 whose pseudocode is
Algorithm 2 Computation of x1 + x2 + · · · + xn
1: sum ← 0, c ← 0
2: for i = 1, · · · , n do
3: y ← xi − c, t ← sum + y
4: c ← (t − sum) − y
5: sum ← t
6: end for
7: return sum
1
Kahan, Willian (1965), “Further remarks on reducing truncation errors”, Communications of the ACM,
8(1): 40.
4
In Algorithm 1, the round-off error accumulates linearly in n in the worst case, while in
Kahan’s algorithm, the worst-case error is independent of n.
Exercise 2. Suppose that Kahan’s algorithm is implemented using single-precision floating-
point numbers. Let n = 3 and x1 = 106 , x2 = 2.71828, x3 = 3.14159. What is the output of
Kahan’s algorithm?
Regardless of round-off error, an algorithm does not always generate exact result. For
example, to compute the integral
∫ +∞
g(x) = t−1/2 e−t dt,
x
we can use the following series expansion
√ ∑
+∞
(−1)k xk+1/2
g(x) = π− .
k!(k + 1/2)
k=0
However, we cannot have the computer take this infinite sum. Therefore we have to truncate
the series at k = N for some N . Suppose we take x = 2 and N = 10. The result of the
algorithm is 0.08064165 · · · , while the exact value of g(x) is 0.08064711 · · · . Therefore the
algorithm introduces an additional absolute error ≈ 5.46 × 10−6 .
Exercise 3. Write down the pseudocode of the above algorithm to compute g(x).
The function g(x) can also be computed by the following formula:
x1/2 e−x
g(x) = . (1)
1 1/2 − 1
1− +x+
2 1 2(1/2 − 2)
3− +x+
2 1 3(1/2 − 3)
5− +x+
2 1 4(1/2 − 4)
7− +x+
2 1 .
9 − + x + ..
2
To implement this formula, we must also terminate the continued fraction at some point.
Assume that we keep only N fractions. The algorithm is
Algorithm 3 Computation of g(x) up to N fractions

1: P−1 ← 1, Q−1 ← 0
2: P0 ← x + 1/2, Q0 ← 1
3: for k = 1, · · · , N − 1 do
4: Pk ← k(1/2 − k)Pk−2 + (2k + 1/2 + x)Pk−1
5: Qk ← k(1/2 − k)Qk−2 + (2k + 1/2 + x)Qk−1
6: end for
7: return QN −1 x1/2 e−x /PN −1
Exercise 4. Explain why the above algorithm is an approximation of (1).

Using Algorithm 3, when N = 10, the output is 0.08064707 · · · , and the absolute error is
approximately 4.16 × 10−8 . This part of error has nothing to do with finite-digit arithmetic,
and is usually called truncation error. In general, numerical error is the combined effect of
round-off error and truncation error.

Lecture 3: Numerical Error: Zhenning Cai August 26, 2019

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3: Numerical Error: Zhenning Cai August 26, 2019

Uploaded by

Copyright:

Available Formats

MA2213: Numerical Analysis I

Lecture 3: Numerical error

August 26, 2019

Example 1. Let f l(x) be the single-precision floating-point form of x obtained by rounding.

f l(0.23) = (1.11010111000010100011111 × 10−11 )2 ≈ 0.23000000417232513,

we can calculate the absolute errors as follows:

|0.23 − f l(0.23)| ≈ 4.17233 × 10−9 ,

x ‘ y = f l(f l(x) + f l(y)), x a y = f l(f l(x) − f l(y)),

Example 3. Let x = 5/7 and y = 1/3. Then

f l(x) = (0.10110110110110110110111)2 , f l(y) = (0.0101010101010101010101011)2 .

f l(x) + f l(y) = (1.0000110000110000110000111)2 ,

Applying another operator f l, we get

printf ( " x = %.20 f \ n " , x ) ;

In general, arithmetic may increase the round-oﬀ error.

Example 4. Let x = 3.14159, y = 2.71828, and z = 106 . We compare (x ‘ y) ‘ z and

Since the exact value of x + y + z is 1000005.85987, the result of (x ‘ y) ‘ z is more accurate

3 Algorithms and numerical error

we can use the following series expansion

Algorithm 3 Computation of g(x) up to N fractions

Exercise 4. Explain why the above algorithm is an approximation of (1).

You might also like