Computer Arithmetic: May 10, 2019 10:37 Ws-Book961x669 An Introduction To Numerical Computation 11388-Main

May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 1
Chapter 1
Computer Arithmetic
by 45.134.241.2 on 04/04/21. Re-use and distribution is strictly not permitted, except for Open Access articles.
An Introduction to Numerical Computation Downloaded from www.worldscientific.com
1.1. Introduction
What are numerical methods? They are algorithms that compute approximate
solutions to a number of problems for which exact solutions are not available. These
include, but are not limited to:
• evaluating functions;
• computing the derivative of a function;
• computing the integral of a function;
• solving a non-linear equation;
• solving a system of linear equations;
• finding the maximum or minimum of a function;
• finding eigenvalues of an operator;
• finding solutions to an ODE (ordinary differential equation);
• finding solutions to a PDE (partial differential equation).
Such algorithms can be implemented (i.e., programmed) on a computer.

Given a physical model, one first derives a mathematical model that describes
the physical phenomenon, with suitable assumptions. This process is called mathe-
matical modeling. The mathematical model is then analyzed and solved with suit-
able numerical methods, to obtain approximate solutions. The numerical results
may be plotted and compared to possible physical experiments, to verify the math-
ematical model and to gain insight for improvement. The diagram in Figure 1.1
shows how various aspects are related.
Keep in mind that a course on numerical methods is NOT about numbers. It is
about mathematical ideas and insights.
We shall study some basic issues:
• development of algorithms;
• implementation;
• a little bit of analysis, including error-estimates, convergence, stability etc.
Throughout the course we shall use Matlab for programming purposes.
1
July 29, 2019 10:15 ws-book961x669 An Introduction to Numerical Computation 11388-main page 2
2 An Introduction to Numerical Computation
physical model
verification, and
physical explanation of the results
mathematical model
presentation of results,
visualization
solve the model with

numerical methods
mathematical theorems computer

numerical analysis programming
Fig. 1.1 The big picture.
1.2. Representation of Numbers in Different Bases
Historically, there have been several bases for representing numbers, used by differ-
ent civilizations. These include:
10: decimal, which has become standard nowadays;

2: binary, computer use;
8: octal;
16: hexadecimal, used in ancient China;
20: vigesimal, used in ancient France (numbers 70 to 79 are counted as 60+10
to 60+19 in French, and 80 is 4 × 20);
60: sexagesimal, used by the Babylonians.
In principle, one can use any integer (or even non-integer) β as the base. The
value of a number in base β is then written as
integer part fractional part

z }| { z }| {
an an−1 · · · a1 a0 . b1 b2 b3 · · ·
β
= an β n + an−1 β n−1 + · · · + a1 β + a0 (integer part)

−1 −2 −3
+ b1 β + b2 β + b3 β + ··· (fractional part)
Taking β = 10, we obtain the standard decimal representation.

The above formula allows us to convert a number in any base β into decimal
base.
Computer Arithmetic 3
In principle, one can convert the numbers between different bases. Here are
some examples.
Example 1.1. Conversion from octal → decimal:

(45.12)8 = 4 × 81 + 5 × 80 + 1 × 8−1 + 2 × 8−2 = (37.15625)10 .
Example 1.2. Conversion of octal ↔ binary:

Observe that, since 8 = 23 , we have
(1)8 = (001)2
(2)8 = (010)2
(3)8 = (011)2
(4)8 = (100)2
(5)8 = (101)2
(6)8 = (110)2
(7)8 = (111)2
(10)8 = (1000)2 .
Then, the conversion between these two bases become much simpler. To convert
from octal to binary, we can convert each digit in the octal base, and write it into
binary base, using 3 digits in binary for each digit in octal. For example:
(5034)8 = (|{z}
101 |{z}
000 |{z}
011 |{z}
100 )2
5 0 3 4 .
To convert a binary number to an octal one, we can group the binary digit in groups
of 3, and write out the octal base value for each group. For example:
(110 010 111 001)2 = (|{z}
6 2
|{z} 7
|{z} 1 )8
|{z}
110 010 111 001 .
Example 1.3. Conversion from decimal → binary. Write the decimal number
(12.45)10 in binary base.
This example is of particular interest. Since the computer uses binary base, how
would the number (12.45)10 look like in binary base? How does a computer store
this number in its memory? The conversion takes two steps.
First, we treat the integer part. The procedure is to keep dividing by 2, and
store the remainders of each step, until one cannot divide anymore. Then, collect
the remainder terms in the reversed order. We have
(remainder)
2 12 0
26 0 ⇒ (12)10 = (1100)2 .
23 1
21 1
Could you figure out a simple proof for this procedure?

For the fractional part, we will keep multiplying the fractional part by 2, and
store the integer part for the result. This gives us:
0.45 × 2
0.9 × 2
1.8 × 2
1.6 × 2
1.2 × 2 ⇒ (0.45)10 = (0.01110011001100 · · · )2 .

0.4 × 2
0.8 × 2
1.6 × 2
···
Again, would you be able to prove why this procedure works?

Putting them together, we get
(12.45)10 = (1100.01110011001100 · · · )2 .
It might be surprising that a simple finite length decimal number such as 12.45
corresponds to a fractional number with infinite length, when written in binary
form! This is very bad news! To store this number exactly in a computer, one
would need infinite memory space!
How does a computer store such a number? Below we will learn the standard
representation used in modern computers.
1.3. Floating Point Representation
We recall the normalized scientific notation for a real number x:

Decimal: x = ±r × 10n , 1 ≤ r < 10
Binary: x = ±r × 2n , 1≤r<2
Octal: x = ±r × 8n , 1≤r<8 (just as an example).
For example, we can normalize a decimal number as
2345 = 2.345 × 103 .
In general, for any base β, we have
x = ±r × β n , 1 ≤ r < β.
We observe that, once the base β is fixed, the important information to be stored
for any number x include: (i) the sign, (ii) the exponent n, and (iii) the value of r.
The computer uses the binary version of the number system, and it represents
numbers with finite length. This means, it either rounds off or chops off after a
certain fractional point. These numbers are called machine numbers.
The value r is called the normalized mantissa. For binary numbers, r lies between
1 and 2, so we have
r = 1.(fractional part).
Therefore, to save memory space, in the computer we only to store the fractional
part of the number.
The integer number n is called the exponent. If n > 0, then we have x > 1. If
n < 0, then x < 1.
A bit is a memory space in a computer that can store the value either 0 or 1.
In a 32-bit computer, with single-precision, a number is stored as in Figure 1.2.
1 bit 8 bits radix point 23 bits
s c f
sign of
mantissa biased exponent mantissa
Fig. 1.2 32-bit computer with single precision.
Note that the value c, which takes 8-bit space, is not the actual exponent, but
the biased exponent. The maximum number of values that can be stored in the
space of 8-bit is 28 = 256. Thus, c takes a value between 0 and 255. Since we want
to represent both positive and negative exponents n in an equal way, we will use
the space to represent numbers from −127 to 128. Therefore, the actual value of
the exponent is n = c − 127.
To summarize, the actual value of the number in Figure 1.2 in the computer is:
(−1)s × 2c−127 × (1.f )2 .
Here r = (1.f )2 is the value of the binary number, where f is the fractional part,
which is stored in 23 bits.
This representation of numbers is called: single-precision IEEE standard
floating-point.
The smallest representable number in absolute value is obtained when the ex-
ponent is the smallest, i.e., n = −127, and we have
xmin = 2−127 ≈ 5.9 × 10−39 .
The largest representable number in absolute value is obtained when the expo-
nent is largest, i.e., n = 128, and we have
xmax = 2128 ≈ 3.4 × 1038 .
It is clear that computers can only handle numbers with absolute values between
xmin and xmax .
We say that x underflows if |x| < xmin . In this case, we consider x = 0.

We say that x overflows if |x| > xmax . In this case, we consider |x| = ∞.
Error in the floating point representation. Let fl(x) denote the floating point
representation of the number x. In general it contains error, caused by roundoff or
chopping. Let δ denote the relative error. We have
fl(x) = x · (1 + δ)
with
fl(x) − x
relative error = δ =
x
absolute error = fl(x) − x = δ · x.
We know that |δ| ≤ ε, where ε is called machine epsilon, which represents the
smallest positive number detectable by the computer, such that fl(1 + ε) > 1.
In a 32-bit computer: ε = 2−23 . (Note that the number 23 is the size of the
mantissa.)
The relative errors in representing numbers depends on if one uses roundoff or
chop-off. We have
• relative error in rounding off: δ ≤ 0.5 × 2−23 ≈ 0.6 × 10−7

• relative error in chopping: δ ≤ 1 × 2−23 ≈ 1.2 × 10−7 .
Error propagation (through arithmetic operations). When a computer performs

basic arithmetic operations such as addition or multiplication, the error in the
representation of the numbers will be carried on, and appear in the result. One
can image that, after a series of computations, the errors will be carried on at each
step, and possibly accumulate over time. This is called error propagation. We
demonstrate this fact by an example of addition.
Example 1.4. Consider an addition, say z = x + y, done in a computer. How

would the errors be propagated?
To fix the idea, let x > 0, y > 0, and let fl(x), fl(y) be their floating point
representation. Then, we have
fl(x) = x(1 + δx ), fl(y) = y(1 + δy )
where δx and δy are the errors in the floating point representation for x and y,
respectively. Then
fl(z) = fl (fl(x) + fl(y))
= (x(1 + δx ) + y(1 + δy )) (1 + δz )
= (x + y) + x · (δx + δz ) + y · (δy + δz ) + (xδx δz + yδy δz )
≈ (x + y) + x · (δx + δz ) + y · (δy + δz ).
Here, δz is the round-off error in making the floating point representation for z.
Then, we have
absolute error = fl(z) − (x + y) = x · (δx + δz ) + y · (δy + δz )
= x · δx + y · δy + (x + y) · δz
| {z } | {z } | {z }
abs. error abs. error round-off error
for x for y
| {z }
propagated error
and
fl(z) − (x + y) xδx + yδy

relative error = = + δz
x+y x+y |{z}
| {z }
propagated error round-off error.
Would you be able to work out the error propagation for other arithmetic oper-
ations?
1.4. Loss of Significance
Loss of significance typically happens when we subtract two numbers that are very
close to each other. The result will then contain fewer significant digits.
To understand this fact, consider a number with 8 significant digits:
x = 0.d1 d2 d3 · · · d8 × 10−a .
Here d1 is the most significant digit, and d8 is the least significant digit.
Let y = 0.b1 b2 b3 · · · b8 × 10−a . We want to compute x − y.
Assume that x ≈ y, such that b1 = d1 , b2 = d2 , b3 = d3 . Then we have
x − y = 0.000c4 c5 c6 c7 c8 × 10−a .
We see that we have only 5 significant digits in the answer! In fact we lost 3
significant digits, caused by the fact that the first 3 fractional positions for x − y are
all 0. This is called loss of significance, which will result in loss of accuracy. The
number x − y has a much larger relative error in its representation, and therefore
is not so accurate anymore.
In designing a numerical algorithm, one should always be sensitive to such a loss.
One can avoid such subtractions by performing other equivalent computations. We
will see some examples below.
Example 1.5. Find the roots of x2 − 40x + 2 = 0. Use 4 significant digits in the
computation.
Answer. The roots for the equation ax2 + bx + c = 0 are
1 p
r1,2 = −b ± b2 − 4ac .
2a
In our case, we have

√
x1,2 = 20 ± 398 ≈ 20.00 ± 19.95,
therefore
x1 ≈ 20 + 19.95 = 39.95 (good, we have 4 significant digits),
x2 ≈ 20 − 19.95 = 00.05 (bad, since we lost 3 significant digits).
To avoid such a loss, we can change the algorithm. Observe that x1 x2 = c/a.
Then x2 can also be computed through a division instead:

c 2
x2 = = ≈ 0.05006.
ax1 1 · 39.95
In this way, we get back 4 significant digits in the result.
Example 1.6. We want to numerically compute the function

1
f (x) = √ .
2
x + 2x − x − 1
Explain what problem you might run into for certain values of x. Find a way to
overcome this difficulty.
Answer. We observe that
p 2
x2 + 2x = x2 + 2x, (x + 1)2 = x2 + 2x + 1.
√
We see that, for large values of x with x > 0, the values x2 + 2x and x + 1 are
very close to each other. Therefore, in the subtraction we will lose many significant
digits. To avoid this problem, we rewrite the function f (x) in an equivalent form
that does not require any subtraction. This can be achieved by multiplying both
the numerator and denominator by the conjugate of the denominator, namely
√
x2 + 2x + x + 1
f (x) = √ √
2
x + 2x − x − 1 x2 + 2x + x + 1
√
x2 + 2x + x + 1 p
2 + 2x + x + 1 .
= 2 = − x
x + 2x − (x + 1)2
Example 1.7. Let
(2 + x) − 2
y= .
x
We see that y = 1 for x 6= 0. But, when |x| is very small, the value (2 + x) is
very close to 2. The subtraction in the numerator will lose many significant digits,
causing large inaccuracy in the solution.
We run the following code in Matlab (see Matlab tutorial, given in the homework
section):
>> x=10.^(-12:-1:-16); y=((2+x)-2)./x; format short e; [x;y]
The code displays the solutions for x and y:
x 1.0000e-12 1.0000e-13 1.0000e-14 1.0000e-15 1.0000e-16

y 1.0001e+00 9.9920e-01 1.0214e+00 8.8818e-01 0
We see that Matlab has serious trouble with the computation! For example, when
x = 10−15 , it gives a value y = 0.8818, with a relative error of almost 12%!
Example 1.8. We consider another example where lost of significant digit is a real
trouble maker. Consider computing the function
p(x) = (x − 1)5 .
Expanding the polynomial, we have the “equivalent” function:
q(x) = x5 − 5x4 + 10x3 − 10x2 + 5x − 1.

However, q(x) is not so equivalent to p(x) when x ≈ 1, if computed in a computer
with finite fractional points. The alternating signs lead to subtractions of two very
close numbers, causing large error due to lost of significant digits. You may try the
following code in Matlab:
x=[1.0-10^(-10):10^(-12):1.0+10^(-10)];
p=(x-1).^5;
q=x.^5 - 5 *x.^4 +10*x.^3 -10*x.^2 +5*x-1;
subplot(1,2,1); plot(x,p);
subplot(1,2,2), plot(x,q);
The code generates plots of p(x) and q(x), on the interval 1−10−10 ≤ x ≤ 1+10−10 ,
as computed in Matlab. The plots are shown in Figure 1.3. We see that in the plot
of q(x), the result is completely ruined by the large numerical error caused by the
loss of significant digits.
×10 -50 ×10 -15

1.5 2
1.5
1
0.5
0.5
0 0
-0.5
-0.5
-1
-1
-1.5
-1.5 -2
0.9999999999 1 1.0000000001 0.9999999999 1 1.0000000001
Fig. 1.3 Plots of p(x) (left) and q(x) (right), generated by Matlab.
1.5. Review of Taylor Series and Taylor Theorem
Taylor series and Taylor Theorem are important in the analysis of many of the
numerical methods in this course. Many error estimates and convergence results
are based on Taylor Theorem.
Assume that f (x) is a smooth function, so that all its derivatives exist. Then
the Taylor expansion of f about the point x = c is:
1 00 1
f (x) = f (c) + f 0 (c)(x − c) + f (c)(x − c)2 + f 000 (c)(x − c)3 + · · · . (1.5.1)
2! 3!
Using the summation sign, it can be written in the more compact way
∞
X 1 (k)
f (x) = f (c)(x − c)k . (1.5.2)
k!
k=0
This is the Taylor series of f at the point c.

In the special case where c = 0, it is called the MacLaurin series:
∞
1 00 1 X 1
f (x) = f (0) + f 0 (0)x + f (0)x2 + f 000 (0)x3 + · · · = f (k) (0)xk . (1.5.3)
2! 3! k!
k=0
You may be familiar with the following examples:

∞
X xk x2 x3
ex = = 1+x+ + + ··· , |x| < ∞
k! 2! 3!
k=0
∞
X x2k+1 x3 x5 x7
sin x = (−1)k = x− + − + ··· , |x| < ∞
(2k + 1)! 3! 5! 7!
k=0
∞
X x2k x2 x4 x6
cos x = (−1)k = 1− + − + ··· , |x| < ∞
(2k)! 2! 4! 6!
k=0
∞
1 X
= xk = 1 + x + x2 + x3 + x4 + · · · , |x| < 1.
1−x
k=0
Since the computer performs only arithmetic operations, these series are actually
how a computer calculates many “fancy” functions!
For example, the exponential function is calculated as
N
X xk
ex ≈ SN =
˙
k!
k=0
for some integer N . The integer N is chosen large enough so that the error is
sufficiently small. The value SN is called the partial sum of the Taylor series.
Note that the above formula is rather “expensive” in computing time since it
involves many arithmetic operations. Comparing to a single arithmetic operation,
evaluating a function requires a lot more computing time. This is why in many
algorithms we try to perform as few function evaluations as possible!
We now look at such an example.
Example 1.9. Compute the number e with 6 digit accuracy.

Answer. We have
1 1 1 1
e = e1 = 1 + 1 + + + + + ···
2! 3! 4! 5!
And
1
= 0.5,
2!
1
= 0.166667,
3!
1
= 0.041667,
4!
···
1
= 0.0000027,
9!
1
= 0.00000027.
10!
1
We see that we can stop here since the term 10! has 6 zeros after decimal point. We
now have
1 1 1 1 1
e ≈ 1 + 1 + + + + + · · · = 2.71828.
2! 3! 4! 5! 9!
One can easily implement this algorithm with a for-loop.
Error and convergence of partial sum of Taylor series: Assume that

f (k) (x) (0 ≤ k ≤ n) are continuous functions. Call
n
X 1 (k)
fn (x) = f (c)(x − c)k
k!
k=0
the partial sum of the first n + 1 terms in the Taylor series. We shall use fn (x) to
approximate the Taylor series, with n sufficiently large.
The difference f (x) − fn (x) between the exact value of a function and its Taylor
approximation is estimated by the famous Taylor Theorem:
∞
X f (k) (c) f (n+1) (ξ)
˙ f (x) − fn (x) =
En+1 = (x − c)k = (x − c)n+1 (1.5.4)
k! (n + 1)!
k=n+1
for some value ξ which lies between x and c. This indicates, even though the error
En+1 is the sum of infinitely many terms, its size is comparable to the first term in
the series for En+1 .
We make the following observation: A Taylor series converges rapidly if x is
near c, and slowly (or not at all) if x is far away from c.
Example 1.10. (Geometric meaning for a special case of Taylor Theorem) If n = 0,

the error estimate (1.5.4) is the same as the “Mean Value Theorem”, illustrated in
Figure 1.4. Indeed, if f is smooth on the interval (a, b), then using the estimate
(1.5.4) with c = a, x = b, we obtain
f (a) − f (b) = (b − a)f 0 (ξ), for some ξ in (a, b).
f (x)
ξ
x
a b
Fig. 1.4 Mean Value Theorem.
This implies
f (b) − f (a)
f 0 (ξ) = .
b−a
If a, b are very close to each other, this formula can be used to compute an approx-
imation for f 0 .
1.6. Numerical Differentiations and Finite Differences
We now introduce the finite difference approximations for derivatives. Given h > 0
sufficiently small, we have 3 ways of approximating f 0 (x):
f (x + h) − f (x)
(1) f 0 (x) ≈ (Forward Euler)
h
f (x) − f (x − h)
(2) f 0 (x) ≈ (Backward Euler)
h
f (x + h) − f (x − h)
(3) f 0 (x) ≈ (Central Finite Difference)
2h
See Figure 1.5 for an illustration. We observe that forward Euler method takes
information from the right side of x, the backward Euler takes information from
the left side, while central finite difference take information from both sides.
f (x)
(1) (3)
f 0 (x)
(2)
x
x−h x x+h
Fig. 1.5 Finite differences to approximate derivatives.
For the second derivative f 00 , we can use a central finite difference approximation:
1
f 00 (x) ≈ (f (x + h) − 2f (x) + f (x − h)) . (1.6.1)
h2
The formula (1.6.1) is motivated as follows:

f (x + h) − 2f (x) + f (x − h) 1 f (x + h) − f (x) f (x) − f (x − h)
= −
h2 h h h
f 0 (x + h/2) − f 0 (x − h/2)
≈
h
≈ f 00 (x).
We now study the errors in these approximations. We start with another way
of writing the Taylor series. In (1.5.1), replacing c with x, x with x + h, and (x − c)
with h, we get
∞ n
X 1 (k) X 1 (k)
f (x + h) = f (x)hk = f (x)hk + En+1 (1.6.2)
k! k!
k=0 k=0
where the error term is
∞
X 1 (k) 1
En+1 = f (x)hk = f (n+1) (ξ)hn+1 (1.6.3)
k! (n + 1)!
k=n+1
for some ξ that lies between x and x + h.
This form of the Taylor series will provide the basic tool for deriving error
estimates, in the remainder of this course!
Local Truncation Error. We have

1 1
f (x + h) = f (x) + hf 0 (x) + h2 f 00 (x) + h3 f 000 (x) + O(h4 ),
2 6
1 2 00 1
0
f (x − h) = f (x) − hf (x) + h f (x) − h3 f 000 (x) + O(h4 ).
2 6
Here the notation O(h4 ) indicates a term whose size (in absolute value) is bounded
by Ch4 , for some constant C, i.e.,
O(h4 ) ≤ Ch4 .

A simple computation shows

f (x + h) − f (x) 1
= f 0 (x) + hf 00 (x) + O(h2 )
h 2
= f 0 (x) + O(h1 ). (first order)
We call this approximation first order, because the error is bounded by O(h1 ).
Similar computations give
f (x) − f (x − h) 1
= f 0 (x) − hf 00 (x) + O(h2 )
h 2
= f 0 (x) + O(h1 ), (first order)
and
f (x + h) − f (x − h) 1
= f 0 (x) − h2 f 000 (x) + O(h2 )
2h 6
= f 0 (x) + O(h2 ), (second order)
and finally
f (x + h) − 2f (x) + f (x − h) 1
2
= f 00 (x) + h2 f (4) (x) + O(h4 )
h 12
= f 00 (x) + O(h2 ). (second order)
In conclusion, one-sided Euler approximations are first order, while central finite
difference approximations are second order.
1.7. Homework Problems for Chapter 1
Problem 1
(a). Convert the binary numbers to decimal numbers.

(i). (110111001.101011101)2
(ii). (1001100101.01101)2
(iii). (101.01)2
(b). Convert the decimal numbers to binary. Keep 10 fractional points.

(i). (100.01)10
(ii). (64.625)10
(iii). (25)10
(c). Write (64.625)10 into normalized scientific notation in binary. You could use
the result in the previous problem. Then determine how it would look like in a
32-bit computer, using a single-precision floating point representation.
Problem 2: Error Propagation
Perform a detailed study for the error propagation in the computation z = xy.
Let f l(x) = x(1 + δx ) and f l(y) = y(1 + δy ) where f l(x) is the floating point
representation of x. Find the expression for the absolute error and the relative
error in the answer f l(z).
Problem 3: Loss of Significance

√
(a). For some values of x, the function f (x) = x2 + 1 − x cannot be computed
accurately in a computer by using this formula. Explain why and demonstrate it
with an example. Then find a way around the difficulty.
(b). Explain why the function

1
f (x) = √ √
x+2− x
cannot be computed accurately in a computer when x is large (using the above

formula). Find a way around the problem.
(c). Perform the following computations.

You may use 13 ≈ 0.333333, 34 = 0.75 and 100
301 = 0.332226.
1 3
(i). Compute 3 + 4 by using five significant digits rounding arithmetic.
(ii). Compute 13 − 301
100
by using five significant digits chopping arithmetic.
Problem 4
(a). Derive the following Taylor series for (1 + x)n (this is also known as the
binomial series):
n(n − 1) 2 n(n − 1)(n − 2) 3
(1 + x)n = 1 + nx + x + x + · · · (x2 < 1).
2! 3!
Write out its particular
√ form when n = 2, n = 3, and n = 1/2. Then use the last
form to compute 1.0001 correct to 15 decimal places (rounded).
(b). Use the answer in part (a) to obtain a series for (1 + x2 )−1 .
Problem 5: Matlab Exercises

The goal of this exercise is to get started with Matlab. You will go through:
• Matrices, vectors, solutions of systems of linear equations;

• Simple plots;
• Use of Matlab’s own help functions.
To get started, you can read the first two chapters in “A Practical Introduction
to Matlab” by Gockenbach. You can find the introduction at the web-page:
http://www.math.mtu.edu/~msgocken/intro/intro.pdf
Read through the Chapter “Introduction” and “Simple calculations and graphs”.
How to do it:
Find a computer with Matlab installed in it. Start Matlab, by either clicking on
the appropriate icon (Windows or Mac), or by typing in matlab (unix or linux).
You should get a command window on the screen.
Go through the examples in Gockenbach’s notes.
Help!
As we will see later, Matlab has many build-in numerical functions. One of them
is a function that does polynomial interpolation. But what is the name, and how
to use it? You can use lookfor to find the function, and help to get a description
on how to use it.
lookfor keyword: Look after the “keyword” among Matlab functions.
There is nothing to turn in here. Just have fun and get used to Matlab!
Problem 6: Sharpening Your Matlab Skills

(a). Write a function for summing all the elements in a matrix A, namely,
n X
X n
V = Aij .
i=1 j=1
(Hint: Use the build-in function size to obtain the dimensions of a matrix.) You
may call the function MatSum.
Then, test your function to compute the summation of all the elements of
 
12.2 1.2 2.4
A =  2 3 4 .
2.5 6.2 3.4
(b). Write a function for converting decimal to binary. You may call the function
DecToBin.
(Hint: Use the build-in function floor to obtain the integer part. The maximum
length of the fractional binary part should be set as 16.)
Test the function by converting (12.625)10 and (21.45)10 to binary.
What to hand in: the source code for the functions, as well as the script file for
using the functions. You should also turn in the output results from Matlab.
Problem 7: A Study on Loss of Significance
In this problem we shall consider the effect of computer arithmetic in the evaluation
of the quadratic formula
√
−b ± b2 − 4ac
r1,2 = (1.7.1)
2a
for the roots of p(x) = ax2 + bx + c.
(a). Write a Matlab function, call it quadroots, that takes a, b, c as input and
returns the two roots as output. The function may start like this:
function [r1, r2] = quadroots(a,b,c)

% input: a, b, c: coefficients for the polynomial ax^2+bx+c=0.
% output: r1, r2: The two roots for the polynomial.
Use the formula in (1.7.1) for the computation. Run it on a few test problems to
make sure that it works before you go ahead with the following problems.
(b). Test your quadroots for the following polynomials:
• 2x2 + 6x − 3
• x2 − 14x + 49
• 3x2 − 123454321x + 2
What do you get? Why are the results pretty bad for the last polynomial? Can
you explain?
(c). The product of the roots of ax2 + bx + c is of course c/a. Use this fact to
improve your program, call it smartquadroots, to see if you get better results.
Matlab has a command roots(p) which returns the roots of the polynomial
with coefficients in p. The program roots is smarter than quadroots and will give
accurate answers here. You can use it to check your results from smartquadroots
function.
What to hand in: Hand in two Matlab function m-files quadroots.m and
smartquadroots.m, and the results you get, together with your comments.
Problem 8*. Binary–Decimal Conversion; a Proof

Give a proof for the algorithm used in Example 1.3, for both the integer and the
fractional parts.
Problem 9*. Computing π

We make a naive attempt to compute π. It is known that π can be computed by
the Leibniz Series, known as the Leibniz formula for π:
∞ X ∞
π X 1 1 2
= − = .
4 n=0
4n + 1 4n + 3 n=0
(4n + 1)(4n + 3)
Write a Matlab program that computes an approximation of π by summing n from
0 to 50. Compare this with the π value stored in Matlab and find out the error.

Computer Arithmetic: May 10, 2019 10:37 Ws-Book961x669 An Introduction To Numerical Computation 11388-Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Arithmetic: May 10, 2019 10:37 Ws-Book961x669 An Introduction To Numerical Computation 11388-Main

Uploaded by

Copyright:

Available Formats

May 10, 2019 10:37 ws-book961x669 An Introduction to Numerical Computation 11388-main page 1

Such algorithms can be implemented (i.e., programmed) on a computer.

Throughout the course we shall use Matlab for programming purposes.

2 An Introduction to Numerical Computation

solve the model with

mathematical theorems computer

Fig. 1.1 The big picture.

1.2. Representation of Numbers in Different Bases

10: decimal, which has become standard nowadays;

integer part fractional part

= an β n + an−1 β n−1 + · · · + a1 β + a0 (integer part)

Taking β = 10, we obtain the standard decimal representation.

Example 1.1. Conversion from octal → decimal:

Example 1.2. Conversion of octal ↔ binary:

4 An Introduction to Numerical Computation

Could you figure out a simple proof for this procedure?

1.2 × 2 ⇒ (0.45)10 = (0.01110011001100 · · · )2 .

Again, would you be able to prove why this procedure works?

1.3. Floating Point Representation

We recall the normalized scientific notation for a real number x:

2345 = 2.345 × 103 .

In general, for any base β, we have

1 bit 8 bits radix point 23 bits

Fig. 1.2 32-bit computer with single precision.

6 An Introduction to Numerical Computation

We say that x underflows if |x| < xmin . In this case, we consider x = 0.

• relative error in rounding off: δ ≤ 0.5 × 2−23 ≈ 0.6 × 10−7

Error propagation (through arithmetic operations). When a computer performs

Example 1.4. Consider an addition, say z = x + y, done in a computer. How

fl(z) − (x + y) xδx + yδy

1.4. Loss of Significance

8 An Introduction to Numerical Computation

In our case, we have

Then x2 can also be computed through a division instead:

In this way, we get back 4 significant digits in the result.

Example 1.6. We want to numerically compute the function

x 1.0000e-12 1.0000e-13 1.0000e-14 1.0000e-15 1.0000e-16

q(x) = x5 − 5x4 + 10x3 − 10x2 + 5x − 1.

×10 -50 ×10 -15

10 An Introduction to Numerical Computation

1.5. Review of Taylor Series and Taylor Theorem

This is the Taylor series of f at the point c.

You may be familiar with the following examples:

Example 1.9. Compute the number e with 6 digit accuracy.

Error and convergence of partial sum of Taylor series: Assume that

12 An Introduction to Numerical Computation

Example 1.10. (Geometric meaning for a special case of Taylor Theorem) If n = 0,

f (a) − f (b) = (b − a)f 0 (ξ), for some ξ in (a, b).

Fig. 1.4 Mean Value Theorem.

1.6. Numerical Differentiations and Finite Differences

Fig. 1.5 Finite differences to approximate derivatives.

14 An Introduction to Numerical Computation

The formula (1.6.1) is motivated as follows:

Local Truncation Error. We have

A simple computation shows

16 An Introduction to Numerical Computation

1.7. Homework Problems for Chapter 1

(a). Convert the binary numbers to decimal numbers.

(b). Convert the decimal numbers to binary. Keep 10 fractional points.

Problem 2: Error Propagation

Problem 3: Loss of Significance

(b). Explain why the function