Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 1

Chapter 1

Computer Arithmetic
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

1.1. Introduction

What are numerical methods? They are algorithms that compute approximate
solutions to a number of problems for which exact solutions are not available. These
include, but are not limited to:

• evaluating functions;
• computing the derivative of a function;
• computing the integral of a function;
• solving a non-linear equation;
• solving a system of linear equations;
• finding the maximum or minimum of a function;
• finding eigenvalues of an operator;
• finding solutions to an ODE (ordinary differential equation);
• finding solutions to a PDE (partial differential equation).

Such algorithms can be implemented (i.e., programmed) on a computer.


Given a physical model, one first derives a mathematical model that describes
the physical phenomenon, with suitable assumptions. This process is called mathe-
matical modeling. The mathematical model is then analyzed and solved with suit-
able numerical methods, to obtain approximate solutions. The numerical results
may be plotted and compared to possible physical experiments, to verify the math-
ematical model and to gain insight for improvement. The diagram in Figure 1.1
shows how various aspects are related.
Keep in mind that a course on numerical methods is NOT about numbers. It is
about mathematical ideas and insights.
We shall study some basic issues:

• development of algorithms;
• implementation;
• a little bit of analysis, including error-estimates, convergence, stability etc.

Throughout the course we shall use Matlab for programming purposes.

1
July 29, 2019 10:15 ws-book961x669 An Introduction to Numerical Computation 11388-main page 2

2 An Introduction to Numerical Computation

physical model

verification, and
physical explanation of the results
mathematical model

presentation of results,
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

visualization

solve the model with


An Introduction to Numerical Computation Downloaded from www.worldscientific.com

numerical methods

mathematical theorems computer


numerical analysis programming

Fig. 1.1 The big picture.

1.2. Representation of Numbers in Different Bases

Historically, there have been several bases for representing numbers, used by differ-
ent civilizations. These include:

10: decimal, which has become standard nowadays;


2: binary, computer use;
8: octal;
16: hexadecimal, used in ancient China;
20: vigesimal, used in ancient France (numbers 70 to 79 are counted as 60+10
to 60+19 in French, and 80 is 4 × 20);
60: sexagesimal, used by the Babylonians.

In principle, one can use any integer (or even non-integer) β as the base. The
value of a number in base β is then written as

integer part fractional part


 
z }| { z }| {
an an−1 · · · a1 a0 . b1 b2 b3 · · ·
β

= an β n + an−1 β n−1 + · · · + a1 β + a0 (integer part)


−1 −2 −3
+ b1 β + b2 β + b3 β + ··· (fractional part)

Taking β = 10, we obtain the standard decimal representation.


The above formula allows us to convert a number in any base β into decimal
base.
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 3

Computer Arithmetic 3

In principle, one can convert the numbers between different bases. Here are
some examples.

Example 1.1. Conversion from octal → decimal:


(45.12)8 = 4 × 81 + 5 × 80 + 1 × 8−1 + 2 × 8−2 = (37.15625)10 .

Example 1.2. Conversion of octal ↔ binary:


Observe that, since 8 = 23 , we have
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

(1)8 = (001)2
(2)8 = (010)2
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

(3)8 = (011)2
(4)8 = (100)2
(5)8 = (101)2
(6)8 = (110)2
(7)8 = (111)2
(10)8 = (1000)2 .
Then, the conversion between these two bases become much simpler. To convert
from octal to binary, we can convert each digit in the octal base, and write it into
binary base, using 3 digits in binary for each digit in octal. For example:
(5034)8 = (|{z}
101 |{z}
000 |{z}
011 |{z}
100 )2
5 0 3 4 .
To convert a binary number to an octal one, we can group the binary digit in groups
of 3, and write out the octal base value for each group. For example:
(110 010 111 001)2 = (|{z}
6 2
|{z} 7
|{z} 1 )8
|{z}
110 010 111 001 .

Example 1.3. Conversion from decimal → binary. Write the decimal number
(12.45)10 in binary base.
This example is of particular interest. Since the computer uses binary base, how
would the number (12.45)10 look like in binary base? How does a computer store
this number in its memory? The conversion takes two steps.
First, we treat the integer part. The procedure is to keep dividing by 2, and
store the remainders of each step, until one cannot divide anymore. Then, collect
the remainder terms in the reversed order. We have
(remainder)
2 12 0
26 0 ⇒ (12)10 = (1100)2 .
23 1
21 1
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 4

4 An Introduction to Numerical Computation

Could you figure out a simple proof for this procedure?


For the fractional part, we will keep multiplying the fractional part by 2, and
store the integer part for the result. This gives us:

0.45 × 2
0.9 × 2
1.8 × 2
1.6 × 2
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

1.2 × 2 ⇒ (0.45)10 = (0.01110011001100 · · · )2 .


0.4 × 2
0.8 × 2
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

1.6 × 2
···

Again, would you be able to prove why this procedure works?


Putting them together, we get

(12.45)10 = (1100.01110011001100 · · · )2 .

It might be surprising that a simple finite length decimal number such as 12.45
corresponds to a fractional number with infinite length, when written in binary
form! This is very bad news! To store this number exactly in a computer, one
would need infinite memory space!
How does a computer store such a number? Below we will learn the standard
representation used in modern computers.

1.3. Floating Point Representation

We recall the normalized scientific notation for a real number x:


Decimal: x = ±r × 10n , 1 ≤ r < 10
Binary: x = ±r × 2n , 1≤r<2
Octal: x = ±r × 8n , 1≤r<8 (just as an example).
For example, we can normalize a decimal number as

2345 = 2.345 × 103 .

In general, for any base β, we have

x = ±r × β n , 1 ≤ r < β.

We observe that, once the base β is fixed, the important information to be stored
for any number x include: (i) the sign, (ii) the exponent n, and (iii) the value of r.
The computer uses the binary version of the number system, and it represents
numbers with finite length. This means, it either rounds off or chops off after a
certain fractional point. These numbers are called machine numbers.
July 29, 2019 10:15 ws-book961x669 An Introduction to Numerical Computation 11388-main page 5

Computer Arithmetic 5

The value r is called the normalized mantissa. For binary numbers, r lies between
1 and 2, so we have
r = 1.(fractional part).
Therefore, to save memory space, in the computer we only to store the fractional
part of the number.
The integer number n is called the exponent. If n > 0, then we have x > 1. If
n < 0, then x < 1.
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

A bit is a memory space in a computer that can store the value either 0 or 1.
In a 32-bit computer, with single-precision, a number is stored as in Figure 1.2.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

1 bit 8 bits radix point 23 bits

s c f
sign of
mantissa biased exponent mantissa

Fig. 1.2 32-bit computer with single precision.

Note that the value c, which takes 8-bit space, is not the actual exponent, but
the biased exponent. The maximum number of values that can be stored in the
space of 8-bit is 28 = 256. Thus, c takes a value between 0 and 255. Since we want
to represent both positive and negative exponents n in an equal way, we will use
the space to represent numbers from −127 to 128. Therefore, the actual value of
the exponent is n = c − 127.
To summarize, the actual value of the number in Figure 1.2 in the computer is:
(−1)s × 2c−127 × (1.f )2 .
Here r = (1.f )2 is the value of the binary number, where f is the fractional part,
which is stored in 23 bits.
This representation of numbers is called: single-precision IEEE standard
floating-point.
The smallest representable number in absolute value is obtained when the ex-
ponent is the smallest, i.e., n = −127, and we have
xmin = 2−127 ≈ 5.9 × 10−39 .
The largest representable number in absolute value is obtained when the expo-
nent is largest, i.e., n = 128, and we have
xmax = 2128 ≈ 3.4 × 1038 .
It is clear that computers can only handle numbers with absolute values between
xmin and xmax .
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 6

6 An Introduction to Numerical Computation

We say that x underflows if |x| < xmin . In this case, we consider x = 0.


We say that x overflows if |x| > xmax . In this case, we consider |x| = ∞.

Error in the floating point representation. Let fl(x) denote the floating point
representation of the number x. In general it contains error, caused by roundoff or
chopping. Let δ denote the relative error. We have
fl(x) = x · (1 + δ)
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

with
fl(x) − x
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

relative error = δ =
x
absolute error = fl(x) − x = δ · x.
We know that |δ| ≤ ε, where ε is called machine epsilon, which represents the
smallest positive number detectable by the computer, such that fl(1 + ε) > 1.
In a 32-bit computer: ε = 2−23 . (Note that the number 23 is the size of the
mantissa.)
The relative errors in representing numbers depends on if one uses roundoff or
chop-off. We have

• relative error in rounding off: δ ≤ 0.5 × 2−23 ≈ 0.6 × 10−7


• relative error in chopping: δ ≤ 1 × 2−23 ≈ 1.2 × 10−7 .

Error propagation (through arithmetic operations). When a computer performs


basic arithmetic operations such as addition or multiplication, the error in the
representation of the numbers will be carried on, and appear in the result. One
can image that, after a series of computations, the errors will be carried on at each
step, and possibly accumulate over time. This is called error propagation. We
demonstrate this fact by an example of addition.

Example 1.4. Consider an addition, say z = x + y, done in a computer. How


would the errors be propagated?
To fix the idea, let x > 0, y > 0, and let fl(x), fl(y) be their floating point
representation. Then, we have
fl(x) = x(1 + δx ), fl(y) = y(1 + δy )
where δx and δy are the errors in the floating point representation for x and y,
respectively. Then
fl(z) = fl (fl(x) + fl(y))
= (x(1 + δx ) + y(1 + δy )) (1 + δz )
= (x + y) + x · (δx + δz ) + y · (δy + δz ) + (xδx δz + yδy δz )
≈ (x + y) + x · (δx + δz ) + y · (δy + δz ).
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 7

Computer Arithmetic 7

Here, δz is the round-off error in making the floating point representation for z.
Then, we have
absolute error = fl(z) − (x + y) = x · (δx + δz ) + y · (δy + δz )
= x · δx + y · δy + (x + y) · δz
| {z } | {z } | {z }
abs. error abs. error round-off error

for x for y
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

| {z }
propagated error

and
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

fl(z) − (x + y) xδx + yδy


relative error = = + δz
x+y x+y |{z}
| {z }
propagated error round-off error.

Would you be able to work out the error propagation for other arithmetic oper-
ations?

1.4. Loss of Significance

Loss of significance typically happens when we subtract two numbers that are very
close to each other. The result will then contain fewer significant digits.
To understand this fact, consider a number with 8 significant digits:
x = 0.d1 d2 d3 · · · d8 × 10−a .
Here d1 is the most significant digit, and d8 is the least significant digit.
Let y = 0.b1 b2 b3 · · · b8 × 10−a . We want to compute x − y.
Assume that x ≈ y, such that b1 = d1 , b2 = d2 , b3 = d3 . Then we have
x − y = 0.000c4 c5 c6 c7 c8 × 10−a .
We see that we have only 5 significant digits in the answer! In fact we lost 3
significant digits, caused by the fact that the first 3 fractional positions for x − y are
all 0. This is called loss of significance, which will result in loss of accuracy. The
number x − y has a much larger relative error in its representation, and therefore
is not so accurate anymore.
In designing a numerical algorithm, one should always be sensitive to such a loss.
One can avoid such subtractions by performing other equivalent computations. We
will see some examples below.

Example 1.5. Find the roots of x2 − 40x + 2 = 0. Use 4 significant digits in the
computation.
Answer. The roots for the equation ax2 + bx + c = 0 are
1  p 
r1,2 = −b ± b2 − 4ac .
2a
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 8

8 An Introduction to Numerical Computation

In our case, we have



x1,2 = 20 ± 398 ≈ 20.00 ± 19.95,
therefore
x1 ≈ 20 + 19.95 = 39.95 (good, we have 4 significant digits),
x2 ≈ 20 − 19.95 = 00.05 (bad, since we lost 3 significant digits).
To avoid such a loss, we can change the algorithm. Observe that x1 x2 = c/a.
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

Then x2 can also be computed through a division instead:


c 2
x2 = = ≈ 0.05006.
ax1 1 · 39.95
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

In this way, we get back 4 significant digits in the result.

Example 1.6. We want to numerically compute the function


1
f (x) = √ .
2
x + 2x − x − 1
Explain what problem you might run into for certain values of x. Find a way to
overcome this difficulty.
Answer. We observe that
p 2
x2 + 2x = x2 + 2x, (x + 1)2 = x2 + 2x + 1.

We see that, for large values of x with x > 0, the values x2 + 2x and x + 1 are
very close to each other. Therefore, in the subtraction we will lose many significant
digits. To avoid this problem, we rewrite the function f (x) in an equivalent form
that does not require any subtraction. This can be achieved by multiplying both
the numerator and denominator by the conjugate of the denominator, namely

x2 + 2x + x + 1
f (x) = √  √ 
2
x + 2x − x − 1 x2 + 2x + x + 1

x2 + 2x + x + 1 p 
2 + 2x + x + 1 .
= 2 = − x
x + 2x − (x + 1)2
Example 1.7. Let
(2 + x) − 2
y= .
x
We see that y = 1 for x 6= 0. But, when |x| is very small, the value (2 + x) is
very close to 2. The subtraction in the numerator will lose many significant digits,
causing large inaccuracy in the solution.
We run the following code in Matlab (see Matlab tutorial, given in the homework
section):
>> x=10.^(-12:-1:-16); y=((2+x)-2)./x; format short e; [x;y]
The code displays the solutions for x and y:
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 9

Computer Arithmetic 9

x 1.0000e-12 1.0000e-13 1.0000e-14 1.0000e-15 1.0000e-16


y 1.0001e+00 9.9920e-01 1.0214e+00 8.8818e-01 0
We see that Matlab has serious trouble with the computation! For example, when
x = 10−15 , it gives a value y = 0.8818, with a relative error of almost 12%!

Example 1.8. We consider another example where lost of significant digit is a real
trouble maker. Consider computing the function
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

p(x) = (x − 1)5 .
Expanding the polynomial, we have the “equivalent” function:
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

q(x) = x5 − 5x4 + 10x3 − 10x2 + 5x − 1.


However, q(x) is not so equivalent to p(x) when x ≈ 1, if computed in a computer
with finite fractional points. The alternating signs lead to subtractions of two very
close numbers, causing large error due to lost of significant digits. You may try the
following code in Matlab:
x=[1.0-10^(-10):10^(-12):1.0+10^(-10)];
p=(x-1).^5;
q=x.^5 - 5 *x.^4 +10*x.^3 -10*x.^2 +5*x-1;
subplot(1,2,1); plot(x,p);
subplot(1,2,2), plot(x,q);
The code generates plots of p(x) and q(x), on the interval 1−10−10 ≤ x ≤ 1+10−10 ,
as computed in Matlab. The plots are shown in Figure 1.3. We see that in the plot
of q(x), the result is completely ruined by the large numerical error caused by the
loss of significant digits.

×10 -50 ×10 -15


1.5 2

1.5
1

0.5
0.5

0 0

-0.5
-0.5

-1

-1
-1.5

-1.5 -2
0.9999999999 1 1.0000000001 0.9999999999 1 1.0000000001

Fig. 1.3 Plots of p(x) (left) and q(x) (right), generated by Matlab.
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 10

10 An Introduction to Numerical Computation

1.5. Review of Taylor Series and Taylor Theorem

Taylor series and Taylor Theorem are important in the analysis of many of the
numerical methods in this course. Many error estimates and convergence results
are based on Taylor Theorem.
Assume that f (x) is a smooth function, so that all its derivatives exist. Then
the Taylor expansion of f about the point x = c is:
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

1 00 1
f (x) = f (c) + f 0 (c)(x − c) + f (c)(x − c)2 + f 000 (c)(x − c)3 + · · · . (1.5.1)
2! 3!
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

Using the summation sign, it can be written in the more compact way

X 1 (k)
f (x) = f (c)(x − c)k . (1.5.2)
k!
k=0

This is the Taylor series of f at the point c.


In the special case where c = 0, it is called the MacLaurin series:

1 00 1 X 1
f (x) = f (0) + f 0 (0)x + f (0)x2 + f 000 (0)x3 + · · · = f (k) (0)xk . (1.5.3)
2! 3! k!
k=0

You may be familiar with the following examples:



X xk x2 x3
ex = = 1+x+ + + ··· , |x| < ∞
k! 2! 3!
k=0


X x2k+1 x3 x5 x7
sin x = (−1)k = x− + − + ··· , |x| < ∞
(2k + 1)! 3! 5! 7!
k=0


X x2k x2 x4 x6
cos x = (−1)k = 1− + − + ··· , |x| < ∞
(2k)! 2! 4! 6!
k=0


1 X
= xk = 1 + x + x2 + x3 + x4 + · · · , |x| < 1.
1−x
k=0

Since the computer performs only arithmetic operations, these series are actually
how a computer calculates many “fancy” functions!
For example, the exponential function is calculated as
N
X xk
ex ≈ SN =
˙
k!
k=0

for some integer N . The integer N is chosen large enough so that the error is
sufficiently small. The value SN is called the partial sum of the Taylor series.
Note that the above formula is rather “expensive” in computing time since it
involves many arithmetic operations. Comparing to a single arithmetic operation,
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 11

Computer Arithmetic 11

evaluating a function requires a lot more computing time. This is why in many
algorithms we try to perform as few function evaluations as possible!
We now look at such an example.

Example 1.9. Compute the number e with 6 digit accuracy.


Answer. We have
1 1 1 1
e = e1 = 1 + 1 + + + + + ···
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

2! 3! 4! 5!
And
1
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

= 0.5,
2!
1
= 0.166667,
3!
1
= 0.041667,
4!
···
1
= 0.0000027,
9!

1
= 0.00000027.
10!
1
We see that we can stop here since the term 10! has 6 zeros after decimal point. We
now have
1 1 1 1 1
e ≈ 1 + 1 + + + + + · · · = 2.71828.
2! 3! 4! 5! 9!
One can easily implement this algorithm with a for-loop.

Error and convergence of partial sum of Taylor series: Assume that


f (k) (x) (0 ≤ k ≤ n) are continuous functions. Call
n
X 1 (k)
fn (x) = f (c)(x − c)k
k!
k=0

the partial sum of the first n + 1 terms in the Taylor series. We shall use fn (x) to
approximate the Taylor series, with n sufficiently large.
The difference f (x) − fn (x) between the exact value of a function and its Taylor
approximation is estimated by the famous Taylor Theorem:


X f (k) (c) f (n+1) (ξ)
˙ f (x) − fn (x) =
En+1 = (x − c)k = (x − c)n+1 (1.5.4)
k! (n + 1)!
k=n+1
July 29, 2019 10:15 ws-book961x669 An Introduction to Numerical Computation 11388-main page 12

12 An Introduction to Numerical Computation

for some value ξ which lies between x and c. This indicates, even though the error
En+1 is the sum of infinitely many terms, its size is comparable to the first term in
the series for En+1 .
We make the following observation: A Taylor series converges rapidly if x is
near c, and slowly (or not at all) if x is far away from c.

Example 1.10. (Geometric meaning for a special case of Taylor Theorem) If n = 0,


by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

the error estimate (1.5.4) is the same as the “Mean Value Theorem”, illustrated in
Figure 1.4. Indeed, if f is smooth on the interval (a, b), then using the estimate
(1.5.4) with c = a, x = b, we obtain
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

f (a) − f (b) = (b − a)f 0 (ξ), for some ξ in (a, b).

f (x)
ξ

x
a b

Fig. 1.4 Mean Value Theorem.

This implies

f (b) − f (a)
f 0 (ξ) = .
b−a

If a, b are very close to each other, this formula can be used to compute an approx-
imation for f 0 .

1.6. Numerical Differentiations and Finite Differences

We now introduce the finite difference approximations for derivatives. Given h > 0
sufficiently small, we have 3 ways of approximating f 0 (x):
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 13

Computer Arithmetic 13

f (x + h) − f (x)
(1) f 0 (x) ≈ (Forward Euler)
h
f (x) − f (x − h)
(2) f 0 (x) ≈ (Backward Euler)
h
f (x + h) − f (x − h)
(3) f 0 (x) ≈ (Central Finite Difference)
2h
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

See Figure 1.5 for an illustration. We observe that forward Euler method takes
information from the right side of x, the backward Euler takes information from
the left side, while central finite difference take information from both sides.

f (x)
(1) (3)

f 0 (x)

(2)

x
x−h x x+h

Fig. 1.5 Finite differences to approximate derivatives.

For the second derivative f 00 , we can use a central finite difference approximation:
1
f 00 (x) ≈ (f (x + h) − 2f (x) + f (x − h)) . (1.6.1)
h2
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 14

14 An Introduction to Numerical Computation

The formula (1.6.1) is motivated as follows:


 
f (x + h) − 2f (x) + f (x − h) 1 f (x + h) − f (x) f (x) − f (x − h)
= −
h2 h h h
f 0 (x + h/2) − f 0 (x − h/2)

h
≈ f 00 (x).
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

We now study the errors in these approximations. We start with another way
of writing the Taylor series. In (1.5.1), replacing c with x, x with x + h, and (x − c)
with h, we get
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

∞ n
X 1 (k) X 1 (k)
f (x + h) = f (x)hk = f (x)hk + En+1 (1.6.2)
k! k!
k=0 k=0
where the error term is

X 1 (k) 1
En+1 = f (x)hk = f (n+1) (ξ)hn+1 (1.6.3)
k! (n + 1)!
k=n+1
for some ξ that lies between x and x + h.
This form of the Taylor series will provide the basic tool for deriving error
estimates, in the remainder of this course!

Local Truncation Error. We have


1 1
f (x + h) = f (x) + hf 0 (x) + h2 f 00 (x) + h3 f 000 (x) + O(h4 ),
2 6
1 2 00 1
0
f (x − h) = f (x) − hf (x) + h f (x) − h3 f 000 (x) + O(h4 ).
2 6
Here the notation O(h4 ) indicates a term whose size (in absolute value) is bounded
by Ch4 , for some constant C, i.e.,
O(h4 ) ≤ Ch4 .

A simple computation shows


f (x + h) − f (x) 1
= f 0 (x) + hf 00 (x) + O(h2 )
h 2
= f 0 (x) + O(h1 ). (first order)
We call this approximation first order, because the error is bounded by O(h1 ).
Similar computations give
f (x) − f (x − h) 1
= f 0 (x) − hf 00 (x) + O(h2 )
h 2
= f 0 (x) + O(h1 ), (first order)
and
f (x + h) − f (x − h) 1
= f 0 (x) − h2 f 000 (x) + O(h2 )
2h 6
= f 0 (x) + O(h2 ), (second order)
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 15

Computer Arithmetic 15

and finally
f (x + h) − 2f (x) + f (x − h) 1
2
= f 00 (x) + h2 f (4) (x) + O(h4 )
h 12
= f 00 (x) + O(h2 ). (second order)
In conclusion, one-sided Euler approximations are first order, while central finite
difference approximations are second order.
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
July 29, 2019 10:15 ws-book961x669 An Introduction to Numerical Computation 11388-main page 16

16 An Introduction to Numerical Computation

1.7. Homework Problems for Chapter 1

Problem 1

(a). Convert the binary numbers to decimal numbers.


(i). (110111001.101011101)2
(ii). (1001100101.01101)2
(iii). (101.01)2
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

(b). Convert the decimal numbers to binary. Keep 10 fractional points.


An Introduction to Numerical Computation Downloaded from www.worldscientific.com

(i). (100.01)10
(ii). (64.625)10
(iii). (25)10

(c). Write (64.625)10 into normalized scientific notation in binary. You could use
the result in the previous problem. Then determine how it would look like in a
32-bit computer, using a single-precision floating point representation.

Problem 2: Error Propagation

Perform a detailed study for the error propagation in the computation z = xy.
Let f l(x) = x(1 + δx ) and f l(y) = y(1 + δy ) where f l(x) is the floating point
representation of x. Find the expression for the absolute error and the relative
error in the answer f l(z).

Problem 3: Loss of Significance



(a). For some values of x, the function f (x) = x2 + 1 − x cannot be computed
accurately in a computer by using this formula. Explain why and demonstrate it
with an example. Then find a way around the difficulty.

(b). Explain why the function


1
f (x) = √ √
x+2− x

cannot be computed accurately in a computer when x is large (using the above


formula). Find a way around the problem.

(c). Perform the following computations.


You may use 13 ≈ 0.333333, 34 = 0.75 and 100
301 = 0.332226.
1 3
(i). Compute 3 + 4 by using five significant digits rounding arithmetic.
(ii). Compute 13 − 301
100
by using five significant digits chopping arithmetic.
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 17

Computer Arithmetic 17

Problem 4
(a). Derive the following Taylor series for (1 + x)n (this is also known as the
binomial series):
n(n − 1) 2 n(n − 1)(n − 2) 3
(1 + x)n = 1 + nx + x + x + · · · (x2 < 1).
2! 3!
Write out its particular
√ form when n = 2, n = 3, and n = 1/2. Then use the last
form to compute 1.0001 correct to 15 decimal places (rounded).
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

(b). Use the answer in part (a) to obtain a series for (1 + x2 )−1 .
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

Problem 5: Matlab Exercises


The goal of this exercise is to get started with Matlab. You will go through:

• Matrices, vectors, solutions of systems of linear equations;


• Simple plots;
• Use of Matlab’s own help functions.

To get started, you can read the first two chapters in “A Practical Introduction
to Matlab” by Gockenbach. You can find the introduction at the web-page:
http://www.math.mtu.edu/~msgocken/intro/intro.pdf
Read through the Chapter “Introduction” and “Simple calculations and graphs”.

How to do it:
Find a computer with Matlab installed in it. Start Matlab, by either clicking on
the appropriate icon (Windows or Mac), or by typing in matlab (unix or linux).
You should get a command window on the screen.
Go through the examples in Gockenbach’s notes.

Help!
As we will see later, Matlab has many build-in numerical functions. One of them
is a function that does polynomial interpolation. But what is the name, and how
to use it? You can use lookfor to find the function, and help to get a description
on how to use it.
lookfor keyword: Look after the “keyword” among Matlab functions.
There is nothing to turn in here. Just have fun and get used to Matlab!

Problem 6: Sharpening Your Matlab Skills


(a). Write a function for summing all the elements in a matrix A, namely,
n X
X n
V = Aij .
i=1 j=1
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 18

18 An Introduction to Numerical Computation

(Hint: Use the build-in function size to obtain the dimensions of a matrix.) You
may call the function MatSum.
Then, test your function to compute the summation of all the elements of
 
12.2 1.2 2.4
A =  2 3 4 .
2.5 6.2 3.4
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

(b). Write a function for converting decimal to binary. You may call the function
DecToBin.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

(Hint: Use the build-in function floor to obtain the integer part. The maximum
length of the fractional binary part should be set as 16.)
Test the function by converting (12.625)10 and (21.45)10 to binary.

What to hand in: the source code for the functions, as well as the script file for
using the functions. You should also turn in the output results from Matlab.

Problem 7: A Study on Loss of Significance

In this problem we shall consider the effect of computer arithmetic in the evaluation
of the quadratic formula

−b ± b2 − 4ac
r1,2 = (1.7.1)
2a
for the roots of p(x) = ax2 + bx + c.

(a). Write a Matlab function, call it quadroots, that takes a, b, c as input and
returns the two roots as output. The function may start like this:

function [r1, r2] = quadroots(a,b,c)


% input: a, b, c: coefficients for the polynomial ax^2+bx+c=0.
% output: r1, r2: The two roots for the polynomial.

Use the formula in (1.7.1) for the computation. Run it on a few test problems to
make sure that it works before you go ahead with the following problems.

(b). Test your quadroots for the following polynomials:

• 2x2 + 6x − 3
• x2 − 14x + 49
• 3x2 − 123454321x + 2

What do you get? Why are the results pretty bad for the last polynomial? Can
you explain?
May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 19

Computer Arithmetic 19

(c). The product of the roots of ax2 + bx + c is of course c/a. Use this fact to
improve your program, call it smartquadroots, to see if you get better results.
Matlab has a command roots(p) which returns the roots of the polynomial
with coefficients in p. The program roots is smarter than quadroots and will give
accurate answers here. You can use it to check your results from smartquadroots
function.

What to hand in: Hand in two Matlab function m-files quadroots.m and
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.

smartquadroots.m, and the results you get, together with your comments.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com

Problem 8*. Binary–Decimal Conversion; a Proof


Give a proof for the algorithm used in Example 1.3, for both the integer and the
fractional parts.

Problem 9*. Computing π


We make a naive attempt to compute π. It is known that π can be computed by
the Leibniz Series, known as the Leibniz formula for π:
∞   X ∞
π X 1 1 2
= − = .
4 n=0
4n + 1 4n + 3 n=0
(4n + 1)(4n + 3)
Write a Matlab program that computes an approximation of π by summing n from
0 to 50. Compare this with the π value stored in Matlab and find out the error.

You might also like