Professional Documents
Culture Documents
Computer Arithmetic: May 10, 2019 10:37 Ws-Book961x669 An Introduction To Numerical Computation 11388-Main
Computer Arithmetic: May 10, 2019 10:37 Ws-Book961x669 An Introduction To Numerical Computation 11388-Main
Chapter 1
Computer Arithmetic
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
1.1. Introduction
What are numerical methods? They are algorithms that compute approximate
solutions to a number of problems for which exact solutions are not available. These
include, but are not limited to:
• evaluating functions;
• computing the derivative of a function;
• computing the integral of a function;
• solving a non-linear equation;
• solving a system of linear equations;
• finding the maximum or minimum of a function;
• finding eigenvalues of an operator;
• finding solutions to an ODE (ordinary differential equation);
• finding solutions to a PDE (partial differential equation).
• development of algorithms;
• implementation;
• a little bit of analysis, including error-estimates, convergence, stability etc.
1
July 29, 2019 10:15 ws-book961x669 An Introduction to Numerical Computation 11388-main page 2
physical model
verification, and
physical explanation of the results
mathematical model
presentation of results,
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
visualization
numerical methods
Historically, there have been several bases for representing numbers, used by differ-
ent civilizations. These include:
In principle, one can use any integer (or even non-integer) β as the base. The
value of a number in base β is then written as
Computer Arithmetic 3
In principle, one can convert the numbers between different bases. Here are
some examples.
(1)8 = (001)2
(2)8 = (010)2
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
(3)8 = (011)2
(4)8 = (100)2
(5)8 = (101)2
(6)8 = (110)2
(7)8 = (111)2
(10)8 = (1000)2 .
Then, the conversion between these two bases become much simpler. To convert
from octal to binary, we can convert each digit in the octal base, and write it into
binary base, using 3 digits in binary for each digit in octal. For example:
(5034)8 = (|{z}
101 |{z}
000 |{z}
011 |{z}
100 )2
5 0 3 4 .
To convert a binary number to an octal one, we can group the binary digit in groups
of 3, and write out the octal base value for each group. For example:
(110 010 111 001)2 = (|{z}
6 2
|{z} 7
|{z} 1 )8
|{z}
110 010 111 001 .
Example 1.3. Conversion from decimal → binary. Write the decimal number
(12.45)10 in binary base.
This example is of particular interest. Since the computer uses binary base, how
would the number (12.45)10 look like in binary base? How does a computer store
this number in its memory? The conversion takes two steps.
First, we treat the integer part. The procedure is to keep dividing by 2, and
store the remainders of each step, until one cannot divide anymore. Then, collect
the remainder terms in the reversed order. We have
(remainder)
2 12 0
26 0 ⇒ (12)10 = (1100)2 .
23 1
21 1
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 4
0.45 × 2
0.9 × 2
1.8 × 2
1.6 × 2
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
1.6 × 2
···
(12.45)10 = (1100.01110011001100 · · · )2 .
It might be surprising that a simple finite length decimal number such as 12.45
corresponds to a fractional number with infinite length, when written in binary
form! This is very bad news! To store this number exactly in a computer, one
would need infinite memory space!
How does a computer store such a number? Below we will learn the standard
representation used in modern computers.
x = ±r × β n , 1 ≤ r < β.
We observe that, once the base β is fixed, the important information to be stored
for any number x include: (i) the sign, (ii) the exponent n, and (iii) the value of r.
The computer uses the binary version of the number system, and it represents
numbers with finite length. This means, it either rounds off or chops off after a
certain fractional point. These numbers are called machine numbers.
July 29, 2019 10:15 ws-book961x669 An Introduction to Numerical Computation 11388-main page 5
Computer Arithmetic 5
The value r is called the normalized mantissa. For binary numbers, r lies between
1 and 2, so we have
r = 1.(fractional part).
Therefore, to save memory space, in the computer we only to store the fractional
part of the number.
The integer number n is called the exponent. If n > 0, then we have x > 1. If
n < 0, then x < 1.
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
A bit is a memory space in a computer that can store the value either 0 or 1.
In a 32-bit computer, with single-precision, a number is stored as in Figure 1.2.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
s c f
sign of
mantissa biased exponent mantissa
Note that the value c, which takes 8-bit space, is not the actual exponent, but
the biased exponent. The maximum number of values that can be stored in the
space of 8-bit is 28 = 256. Thus, c takes a value between 0 and 255. Since we want
to represent both positive and negative exponents n in an equal way, we will use
the space to represent numbers from −127 to 128. Therefore, the actual value of
the exponent is n = c − 127.
To summarize, the actual value of the number in Figure 1.2 in the computer is:
(−1)s × 2c−127 × (1.f )2 .
Here r = (1.f )2 is the value of the binary number, where f is the fractional part,
which is stored in 23 bits.
This representation of numbers is called: single-precision IEEE standard
floating-point.
The smallest representable number in absolute value is obtained when the ex-
ponent is the smallest, i.e., n = −127, and we have
xmin = 2−127 ≈ 5.9 × 10−39 .
The largest representable number in absolute value is obtained when the expo-
nent is largest, i.e., n = 128, and we have
xmax = 2128 ≈ 3.4 × 1038 .
It is clear that computers can only handle numbers with absolute values between
xmin and xmax .
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 6
Error in the floating point representation. Let fl(x) denote the floating point
representation of the number x. In general it contains error, caused by roundoff or
chopping. Let δ denote the relative error. We have
fl(x) = x · (1 + δ)
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
with
fl(x) − x
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
relative error = δ =
x
absolute error = fl(x) − x = δ · x.
We know that |δ| ≤ ε, where ε is called machine epsilon, which represents the
smallest positive number detectable by the computer, such that fl(1 + ε) > 1.
In a 32-bit computer: ε = 2−23 . (Note that the number 23 is the size of the
mantissa.)
The relative errors in representing numbers depends on if one uses roundoff or
chop-off. We have
Computer Arithmetic 7
Here, δz is the round-off error in making the floating point representation for z.
Then, we have
absolute error = fl(z) − (x + y) = x · (δx + δz ) + y · (δy + δz )
= x · δx + y · δy + (x + y) · δz
| {z } | {z } | {z }
abs. error abs. error round-off error
for x for y
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
| {z }
propagated error
and
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
Would you be able to work out the error propagation for other arithmetic oper-
ations?
Loss of significance typically happens when we subtract two numbers that are very
close to each other. The result will then contain fewer significant digits.
To understand this fact, consider a number with 8 significant digits:
x = 0.d1 d2 d3 · · · d8 × 10−a .
Here d1 is the most significant digit, and d8 is the least significant digit.
Let y = 0.b1 b2 b3 · · · b8 × 10−a . We want to compute x − y.
Assume that x ≈ y, such that b1 = d1 , b2 = d2 , b3 = d3 . Then we have
x − y = 0.000c4 c5 c6 c7 c8 × 10−a .
We see that we have only 5 significant digits in the answer! In fact we lost 3
significant digits, caused by the fact that the first 3 fractional positions for x − y are
all 0. This is called loss of significance, which will result in loss of accuracy. The
number x − y has a much larger relative error in its representation, and therefore
is not so accurate anymore.
In designing a numerical algorithm, one should always be sensitive to such a loss.
One can avoid such subtractions by performing other equivalent computations. We
will see some examples below.
Example 1.5. Find the roots of x2 − 40x + 2 = 0. Use 4 significant digits in the
computation.
Answer. The roots for the equation ax2 + bx + c = 0 are
1 p
r1,2 = −b ± b2 − 4ac .
2a
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 8
Computer Arithmetic 9
Example 1.8. We consider another example where lost of significant digit is a real
trouble maker. Consider computing the function
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
p(x) = (x − 1)5 .
Expanding the polynomial, we have the “equivalent” function:
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
1.5
1
0.5
0.5
0 0
-0.5
-0.5
-1
-1
-1.5
-1.5 -2
0.9999999999 1 1.0000000001 0.9999999999 1 1.0000000001
Fig. 1.3 Plots of p(x) (left) and q(x) (right), generated by Matlab.
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 10
Taylor series and Taylor Theorem are important in the analysis of many of the
numerical methods in this course. Many error estimates and convergence results
are based on Taylor Theorem.
Assume that f (x) is a smooth function, so that all its derivatives exist. Then
the Taylor expansion of f about the point x = c is:
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
1 00 1
f (x) = f (c) + f 0 (c)(x − c) + f (c)(x − c)2 + f 000 (c)(x − c)3 + · · · . (1.5.1)
2! 3!
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
Using the summation sign, it can be written in the more compact way
∞
X 1 (k)
f (x) = f (c)(x − c)k . (1.5.2)
k!
k=0
∞
X x2k+1 x3 x5 x7
sin x = (−1)k = x− + − + ··· , |x| < ∞
(2k + 1)! 3! 5! 7!
k=0
∞
X x2k x2 x4 x6
cos x = (−1)k = 1− + − + ··· , |x| < ∞
(2k)! 2! 4! 6!
k=0
∞
1 X
= xk = 1 + x + x2 + x3 + x4 + · · · , |x| < 1.
1−x
k=0
Since the computer performs only arithmetic operations, these series are actually
how a computer calculates many “fancy” functions!
For example, the exponential function is calculated as
N
X xk
ex ≈ SN =
˙
k!
k=0
for some integer N . The integer N is chosen large enough so that the error is
sufficiently small. The value SN is called the partial sum of the Taylor series.
Note that the above formula is rather “expensive” in computing time since it
involves many arithmetic operations. Comparing to a single arithmetic operation,
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 11
Computer Arithmetic 11
evaluating a function requires a lot more computing time. This is why in many
algorithms we try to perform as few function evaluations as possible!
We now look at such an example.
2! 3! 4! 5!
And
1
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
= 0.5,
2!
1
= 0.166667,
3!
1
= 0.041667,
4!
···
1
= 0.0000027,
9!
1
= 0.00000027.
10!
1
We see that we can stop here since the term 10! has 6 zeros after decimal point. We
now have
1 1 1 1 1
e ≈ 1 + 1 + + + + + · · · = 2.71828.
2! 3! 4! 5! 9!
One can easily implement this algorithm with a for-loop.
the partial sum of the first n + 1 terms in the Taylor series. We shall use fn (x) to
approximate the Taylor series, with n sufficiently large.
The difference f (x) − fn (x) between the exact value of a function and its Taylor
approximation is estimated by the famous Taylor Theorem:
∞
X f (k) (c) f (n+1) (ξ)
˙ f (x) − fn (x) =
En+1 = (x − c)k = (x − c)n+1 (1.5.4)
k! (n + 1)!
k=n+1
July 29, 2019 10:15 ws-book961x669 An Introduction to Numerical Computation 11388-main page 12
for some value ξ which lies between x and c. This indicates, even though the error
En+1 is the sum of infinitely many terms, its size is comparable to the first term in
the series for En+1 .
We make the following observation: A Taylor series converges rapidly if x is
near c, and slowly (or not at all) if x is far away from c.
the error estimate (1.5.4) is the same as the “Mean Value Theorem”, illustrated in
Figure 1.4. Indeed, if f is smooth on the interval (a, b), then using the estimate
(1.5.4) with c = a, x = b, we obtain
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
f (x)
ξ
x
a b
This implies
f (b) − f (a)
f 0 (ξ) = .
b−a
If a, b are very close to each other, this formula can be used to compute an approx-
imation for f 0 .
We now introduce the finite difference approximations for derivatives. Given h > 0
sufficiently small, we have 3 ways of approximating f 0 (x):
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 13
Computer Arithmetic 13
f (x + h) − f (x)
(1) f 0 (x) ≈ (Forward Euler)
h
f (x) − f (x − h)
(2) f 0 (x) ≈ (Backward Euler)
h
f (x + h) − f (x − h)
(3) f 0 (x) ≈ (Central Finite Difference)
2h
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
See Figure 1.5 for an illustration. We observe that forward Euler method takes
information from the right side of x, the backward Euler takes information from
the left side, while central finite difference take information from both sides.
f (x)
(1) (3)
f 0 (x)
(2)
x
x−h x x+h
For the second derivative f 00 , we can use a central finite difference approximation:
1
f 00 (x) ≈ (f (x + h) − 2f (x) + f (x − h)) . (1.6.1)
h2
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 14
We now study the errors in these approximations. We start with another way
of writing the Taylor series. In (1.5.1), replacing c with x, x with x + h, and (x − c)
with h, we get
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
∞ n
X 1 (k) X 1 (k)
f (x + h) = f (x)hk = f (x)hk + En+1 (1.6.2)
k! k!
k=0 k=0
where the error term is
∞
X 1 (k) 1
En+1 = f (x)hk = f (n+1) (ξ)hn+1 (1.6.3)
k! (n + 1)!
k=n+1
for some ξ that lies between x and x + h.
This form of the Taylor series will provide the basic tool for deriving error
estimates, in the remainder of this course!
Computer Arithmetic 15
and finally
f (x + h) − 2f (x) + f (x − h) 1
2
= f 00 (x) + h2 f (4) (x) + O(h4 )
h 12
= f 00 (x) + O(h2 ). (second order)
In conclusion, one-sided Euler approximations are first order, while central finite
difference approximations are second order.
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
July 29, 2019 10:15 ws-book961x669 An Introduction to Numerical Computation 11388-main page 16
Problem 1
(i). (100.01)10
(ii). (64.625)10
(iii). (25)10
(c). Write (64.625)10 into normalized scientific notation in binary. You could use
the result in the previous problem. Then determine how it would look like in a
32-bit computer, using a single-precision floating point representation.
Perform a detailed study for the error propagation in the computation z = xy.
Let f l(x) = x(1 + δx ) and f l(y) = y(1 + δy ) where f l(x) is the floating point
representation of x. Find the expression for the absolute error and the relative
error in the answer f l(z).
Computer Arithmetic 17
Problem 4
(a). Derive the following Taylor series for (1 + x)n (this is also known as the
binomial series):
n(n − 1) 2 n(n − 1)(n − 2) 3
(1 + x)n = 1 + nx + x + x + · · · (x2 < 1).
2! 3!
Write out its particular
√ form when n = 2, n = 3, and n = 1/2. Then use the last
form to compute 1.0001 correct to 15 decimal places (rounded).
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
(b). Use the answer in part (a) to obtain a series for (1 + x2 )−1 .
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
To get started, you can read the first two chapters in “A Practical Introduction
to Matlab” by Gockenbach. You can find the introduction at the web-page:
http://www.math.mtu.edu/~msgocken/intro/intro.pdf
Read through the Chapter “Introduction” and “Simple calculations and graphs”.
How to do it:
Find a computer with Matlab installed in it. Start Matlab, by either clicking on
the appropriate icon (Windows or Mac), or by typing in matlab (unix or linux).
You should get a command window on the screen.
Go through the examples in Gockenbach’s notes.
Help!
As we will see later, Matlab has many build-in numerical functions. One of them
is a function that does polynomial interpolation. But what is the name, and how
to use it? You can use lookfor to find the function, and help to get a description
on how to use it.
lookfor keyword: Look after the “keyword” among Matlab functions.
There is nothing to turn in here. Just have fun and get used to Matlab!
(Hint: Use the build-in function size to obtain the dimensions of a matrix.) You
may call the function MatSum.
Then, test your function to compute the summation of all the elements of
12.2 1.2 2.4
A = 2 3 4 .
2.5 6.2 3.4
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
(b). Write a function for converting decimal to binary. You may call the function
DecToBin.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
(Hint: Use the build-in function floor to obtain the integer part. The maximum
length of the fractional binary part should be set as 16.)
Test the function by converting (12.625)10 and (21.45)10 to binary.
What to hand in: the source code for the functions, as well as the script file for
using the functions. You should also turn in the output results from Matlab.
In this problem we shall consider the effect of computer arithmetic in the evaluation
of the quadratic formula
√
−b ± b2 − 4ac
r1,2 = (1.7.1)
2a
for the roots of p(x) = ax2 + bx + c.
(a). Write a Matlab function, call it quadroots, that takes a, b, c as input and
returns the two roots as output. The function may start like this:
Use the formula in (1.7.1) for the computation. Run it on a few test problems to
make sure that it works before you go ahead with the following problems.
• 2x2 + 6x − 3
• x2 − 14x + 49
• 3x2 − 123454321x + 2
What do you get? Why are the results pretty bad for the last polynomial? Can
you explain?
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 19
Computer Arithmetic 19
(c). The product of the roots of ax2 + bx + c is of course c/a. Use this fact to
improve your program, call it smartquadroots, to see if you get better results.
Matlab has a command roots(p) which returns the roots of the polynomial
with coefficients in p. The program roots is smarter than quadroots and will give
accurate answers here. You can use it to check your results from smartquadroots
function.
What to hand in: Hand in two Matlab function m-files quadroots.m and
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
smartquadroots.m, and the results you get, together with your comments.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com