Chapter 1: Errors in Numerical Computations 1.: F Inf M F Sup M

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Chapter 1: Errors in Numerical Computations

1. Introduction

In this chapter, we first begin with some mathematical preliminary theorems that are

invariably useful concerning the study of numerical analysis. This chapter presents the

various kinds of errors that may occur in a problem. The representation of numbers in

computers is introduced. General results on the propagation of errors in numerical

computation are also given. Finally, the concepts of stability and conditioning of problems

and a brief idea of convergence of numerical methods are also introduced.

2. Mathematical preliminaries

Theorem 1.1. (Intermediate value theorem)

Let f (x) be a real valued continuous function on the finite interval [a, b] and define

m = Inf f (x) , M = Sup f (x ) .


a ≤ x ≤b a≤ x≤b

Then for any number µ in [m, M ] , there exists at least one point ξ in [a, b] for which

f (ξ ) = µ .

In particular, there are points ξ and ξ in [a , b] for which m = f (ξ ) and M = f (ξ ) .

Theorem 1.2. (Mean value theorem)

Let f (x) be a real valued continuous function on the finite interval [a, b] and differentiable in

( a, b) . Then there exists at least one point ξ in (a, b) for which

f (b) − f (a) = (b − a ) f ′(ξ ) .


Theorem 1.3. (Integral mean value theorem)

Let w(x ) be nonnegative and integrable on [a , b] , and let f (x) be continuous on [a , b] . Then

b b

∫ w( x) f ( x)dx = w(ξ )∫ f (x)dx , for some ξ ∈ [a, b] .


a a

One of the most important and useful tools for approximating functions f (x ) by polynomials,

in numerical analysis is Taylor’s theorem and the associated Taylor series. These

polynomials expressed in Taylor series are used extensively in numerical analysis.

Theorem 1.4. (Taylor’s theorem)

Let f (x) be a real valued continuous function on the finite interval [ a , b] and have n + 1

continuous derivatives on [a, b] for some n≥0, and let x, x0 ∈ [a, b] . Then

f ( x) = Pn ( x) + Rn+1 ( x) ,

where

( x − x0 ) ( x − x0 ) n ( n )
Pn ( x) = f ( x0 ) + f ′( x0 ) + ... + f ( x0 )
1! n!

( x − x0 ) n +1 ( n +1)
Rn+1 ( x) = f (ξ ) , a<ξ <b.
(n + 1)!

• Taylor’s theorem in two dimensions

Let f ( x, y) be a function of two independent variables x and y and suppose f (x ) possesses

continuous partial derivatives of order n in some neighbourhood N of a point (a, b) in the

domain of definition of f (x ) . Let (a + h, b + k ) ∈ N , then there exists a positive number θ

(0 < θ < 1) , such that

2
2 n−1
 ∂ ∂  1 ∂ ∂  1  ∂ ∂ 
f (a + h, b + k ) = f (a , b) +  h + k  f (a, b) +  h + k  f (a, b) + ... +  h + k  f ( a, b )
 ∂x ∂y  2!  ∂x ∂y  ( n − 1)!  ∂x ∂y

+ Rn+1 ( x ) ,

where

n
1 ∂ ∂ 
Rn ( x ) =  h + k  f (a + θh, b + θk ) , 0 <θ <1.
n!  ∂x ∂y 

Rn (x ) is called the remainder after n terms and the theorem is called, Taylor’s theorem with

remainder or Taylors’ expansion about the point (a, b) .

3. Approximate numbers and significant figures

Significant figures:

A significant figures is any one of the digits 1, 2, 3,… and 0.

In the number 0.00134, the significant figures are 1, 3, 4. The zeros are used here merely to

fix the decimal point and therefore not significant. But in the number 0.1204, the significant

figures are 1, 2, 0, 4. Similarly, 1.00317 has 6 significant figures.

• Rules of significant figures

Rule 1: All non-zero digits 1,2,…,9 are significant.

Rule 2: Zeros between non-zero digits are significant, e.g., In reading the measurement 9.04

cm, the zero represents a measured quantity, just as 9 and 4 and is, therefore, a significant

number. Similarly in another example, there are four significant numbers in the number 1005.

3
Rule 3: Zeros to the left of the first non-zero digit in a number are not significant, e.g.,

0.0026. Also, in the measurement 0.07 kg, the zeros are used merely to locate the decimal

point and are, therefore, not significant.

Rule 4: When a number ends in zeros that are to the right of the decimal point, then zeros are

significant, e.g., in the number 0.0200, there are three significant numbers. Another example

is that in reading the measurement 11.30 cm, the zero is an estimate and represents a

measured quantity. It is therefore significant. Thus, zeros to the right of the decimal point and

at the end of the number, are significant figures.

Rule 5: When a number ends in zeros that are not to the right of the decimal point, then zeros

are not necessarily significant, e.g., if a distance is reported as 1200 feet, one may assume

two significant figures. However, reporting measurements in scientific notation removes all

doubt, since all numbers are written in scientific notation are considered significant.

1200 feet 1.2 × 10 3 feet Two significant figures

1200 feet 1.20 × 103 feet Three significant figures

1200 feet 1.200 × 103 feet Four significant figures

Thus, we may conclude that if a zero represents a measured quantity, it is a significant figure.

If it merely locates the decimal point, it is not a significant figure.

4. Rounding off numbers

Numbers are rounded off so as to cause the least possible errors.

The following is the general rule for rounding off a number to n significant digits:

4
Discard all digits to the right of the n th place. If the discarded number is less than half a unit

in the n th place, leave the n th digit unchanged; if the discarded number is greater than half a

unit in the n th place, add 1 to the n th digit. If the discarded number is exactly half a unit in

the n th place, leave the n th digit unaltered if it is an even number, but increase it by 1 if it is

an odd number.

When a number is rounded off according to the rule just stated above, then it is said to be

correct to n significant digits.

To illustrate, the following numbers are corrected to 4 significant figures:

27.1345 becomes 27.13

27.8793 becomes 27.88

27.355 becomes 27.36

27.365 becomes 27.36

We now proceed to present the classification of the ways by which error is involved into the

numerical computation. Let us start with some simple definitions about error.

(i) Absolute error:

Let xT be the exact value or true value of a number and x A be its approximate value, then

xT − x A is called the absolute error Ea . Therefore, absolute error

Ea ≡ xT − x A

(ii) Relative and percentage errors:

5
The relative error is defined by

xT − x A
Er ≡ , provided xT ≠ 0 or xT is not close to zero.
xT

The percentage relative error is defined by

xT − x A
E p ≡ E r × 100 = × 100 , provided xT ≠ 0 or xT is not close to zero.
xT

(iii) Inherent error:

The inherent error is that quantity which is already present in the statement of the problem

before its solution. The inherent error arises either due to the straight assumptions in the

mathematical forms of the problem or due to the physical measurements of the parameters of

problem. Inherent error can not be completely eliminated but can be minimized if we select

better data or by employing high precision computer computations.

(iv) Round-off and chopping errors:

A number x is said to be rounded correct to a d -decimal digit number x ( d ) if the error

satisfies

1
x − x (d ) ≤ × 10 −d . (4.1)
2

The error arising out of rounding of a number, as defined in eq. (4.1) is known as round-off

error.

Let an arbitrary given real number x which has the representation in the following form

x = .d1d 2 ...d k d k +1...× b e , (4.2)

6
where b is the base, d1 ≠ 0 , d 2 , …, d k are integers and satisfies 0 ≤ d i ≤ b − 1 and the exponent

e is such that emin ≤ e ≤ emax . The fractional part .d1d 2 ...d k d k +1 ... is called the mantissa and it lies

between -1 and 1.

Now, the floating-point number fl ( x) in k -digit mantissa standard form can be obtained in

the following two ways:

(a) Chopping: In this case, we simply discard the digits d k +1 , d k +2 , … in eq. (4.2), and

obtain

fl ( x) = .d1d 2 ...d k × b e (4.3)

(b) Rounding: In this case, fl ( x) is chosen as the normalized floating-point number

nearest to x , together with the rule of symmetric rounding, according to which, if the

truncated part be exactly half a unit in the k -th position, then if the k -th digit be odd,

it is rounded up to an even digit and if it is even, then it is left unchanged.

Thus the relative error for k -digit mantissa standard form representation of x becomes

b1−k , for chopping


x − fl ( x) 
≤  1 1−k (4.4)
x  b , for rounding
2

Therefore, the bound on the relative error of a floating point number is reduced by half when

rounding is used instead of chopping. For this reason, on the most of the computers rounding

is used.

Example 1:

Approximate values of 1 and 1 , correct to 4 decimal places are 0.1667 and 0.0769
6 13

respectively. Find the possible relative error and absolute error in the sum of 0.1667 and

0.0769.

7
Solution:

Let x = 0.1667 and y = 0.0769 .

The maximum absolute error in each x and y is given by

1
× 10 −4 = 0.00005
2

(i) Relative Error in ( x + y ) A

( x + y )T − ( x + y ) A E ( x) E ( y) 0.00005 0.00005
E r [( x + y ) A ] = ≤ a + a ≤ + ≤ 0.000950135
( x + y )T ( x + y ) T ( x + y )T 0.1667 0.0769

(ii) Absolute Error in ( x + y ) A

E a [( x + y ) A ] = Er ( x + y) ( x + y )T ≤ 0.000950135× 0.2436 = 0.000231452886

Example 2:

If the number π = 4 tan −1 (1) is approximated using 5 significant digits, find the percentage

relative error due to

(i) Chopping,

(ii) Rounding.

Solution:

(i) Percentage relative error due to chopping is given by

8
π − 3.1415
× 100 = 0.00294926%
π

(ii) Percentage relative error due to rounding is given by

π − 3.1416
× 100 = 0.000233843%
π

From the above errors, it may be easily observed that rounding reduces error.

5. Truncation errors

These are the errors due to approximate formulae used in the computations. Truncation errors

result from the approximate formulae used which are generally based on the truncated series.

The study of this error is usually associated with the problem of convergence.

For example, let us assume that a function f (x) and all its higher order derivatives with

respect to the independent variable x at the point, say x = x0 are known. Now in order to

find the function value at a neighbouring point, say x = x0 + ∆x , one can use the Taylor series

expansion for the function f (x) about x = x0 as

( x − x0 ) 2
f ( x ) = f ( x0 ) + ( x − x0 ) f ′( x0 ) + f ′′( x0 ) + ... (5.1)
2!

the right hand side of the above equation is an infinite series and one has to truncate it

after some finite number of terms to calculate f ( x0 + ∆x) either with computer or by manual

calculations.

If the series is truncated after n terms, then it is equivalent to approximating f (x) with a

polynomial of degree n − 1 . Therefore, we have

9
( x − x0 ) 2 ( x − x0 ) n−1 (n−1)
f ( x ) ≈ Pn−1 ( x) = f ( x0 ) + ( x − x0 ) f ′( x0 ) + f ′′( x0 ) + ... + f ( x0 ) (5.2)
2! ( n − 1)!

The truncated error is given by

( x − x0 ) n ( n )
ET ( x ) = f ( x ) − Pn−1 ( x) = f (ξ ) (5.3)
n!

Now, let

M n ( x) = max f ( n) (ξ ) (5.4)
a ≤ξ ≤ x

Then the bound of the truncation error is given by

n
M n ( x ) x − x0
ET ( x) ≤ (5.5)
n!

If h = x−a, then the truncation error ET (x) is said to be of order O(h n ) .

Hence, from eq. (5.2) an approximate value of f ( x0 + ∆x) can be obtained with the truncation

error estimate as given in eq. (5.3).

Example 3:

Obtain a second degree polynomial approximation to

f ( x) = (1 + x) 2 / 3 , x ∈ [0,0.1]

Using Taylor series expansion about x = 0 . Use the expansion to approximate f (0.04) and

find a bound of the truncation error.

Solution:

10
We have,

f ( x) = (1 + x) 2 / 3 , f (0) = 1

2 2
f ′( x ) = , f ′(0) =
3(1 + x )1/ 3 3

2 2
f ′′( x) = − 4/ 3
, f ′′(0) = −
9(1 + x) 9

8
f ′′′( x ) = .
27(1 + x) 7 / 3

Thus, the Taylor series expansion with the remainder term is given by

2 x2 4 x3
(1 + x ) 2 / 3 = 1 + x− + , 0 < ξ < 0.1
3 9 81 (1 + ξ ) 7 / 3

Therefore, the truncation error is

 2 x2  4 x3
ET ( x) = (1 + x ) 2 / 3 − 1 + x −  =
 3 9  81 (1 + ξ ) 7 / 3

The approximate value of f (0.04) is

2 (0.06) 2
f (0.04) ≈ 1 + (0.06) − = 1.026488888889 , correct to 12 decimal places.
3 9

The truncation error bound in x ∈ [0,0.1] is given by

4 (0.1) 3
ET ≤ max
0≤ x ≤ 0.1 81 (1 + x) 7 / 3

4
≤ (0.1) 3 = 0.493827 × 10 −4
81

11
The exact value of f ( 0.04 ) correct upto 12 decimal places is 1.026491977549.

6. Floating point representation of numbers

A floating point number is represented in the following form

± .d1d 2 ...d k × b e ,

where b is the base, d1 ≠ 0 , d 2 , …, d k are digits or bits satisfying 0 ≤ d i ≤ b − 1 , k is the

number of significant digits or bits, which indicates the precision of the number and the

exponent e is such that emin ≤ e ≤ emax . The fractional part .d1d 2 ...d k d k +1... is called the mantissa

or significand and it lies between -1 and 1.

6.1 Computer representation of numbers

In numerical computation, nowadays usually digital computers are used. Most of the digital

computers uses floating point mode for storing real numbers.

The fundamental unit of data stored in a computer memory is called computer word. The

number of bits a word can hold is called word length. The word length is fixed for a computer

although it varies from computer to computer. The typical word lengths are 16, 32, 64 bits or

higher bits. The largest number that can be stored in a computer depends on word length. To

store a number in floating point representation, a computer word is divided into three fields.

The first part consists of one bit, called the sign bit. The next bits represent the exponent and

finally the bits represent the mantissa. For example, in single-precision floating-point format,

a 32-bit word is divided into three fields as follows: 1 bit for the sign, 8 bits for the exponent

and 23 bits for the mantissa. Exponent is an 8 bit signed integer from −128 to 127. On the

other hand, in double-precision floating-point format, a 64-bit word is divided into three

fields as follows: 1 bit for the sign, 11 bits for the exponent and 52 bits for the mantissa.

12
In the normalized floating point representation, the exponent is so adjusted that the bit d1

immediately after the binary point is always 1. Formally, a nonzero floating point number is

in normalized floating point form if

1
≤ mantissa < 1 .
b

The range of the exponents that a typical computer can handle is very large. The following

table shows the effective range of IEEE (Institute of Electrical and Electronics Engineers)

floating-point numbers:

Effective Floating-Point Range

Binary number Decimal number

Single precision ± (2-2-23) × 2 127 ~ ± 10 38.53

Double precision ± (2-2-52) × 2 1023 ~ ± 10 308.25

Table 1: Effective Floating-Point Range

If in a numerical computation, a number lies outside the range, then the following cases arise:

(a) Overflow: It occurs when the number is larger than the above range specified in

Table 1.

13
(b) Underflow: It occurs when the number is smaller than the above range specified in

Table 1.

In case of underflow, the number is usually set to zero and computation continues. But in

cases of overflow, the computer execution halts.

7. Propagation of errors

In this section, we consider the effect of arithmetic operations which involve errors. Let, x A

and y A be the approximate numbers used in the calculations. Suppose they are in error with

the true values xT and yT respectively.

Thus we can write xT = x A + ε x and yT = y A + ε y . Now, we examine the propagated error in

some particular cases:

Case 1: (Multiplication)

In multiplication, for the error in x A y A , we have

xT yT − x A y A = xT yT − ( xT − ε x )( yT − ε y )

= xT ε y + yT ε x − ε xε y .

Thus the relative error in x A y A is

xT yT − x A y A
Er ( xA yA ) =
xT yT

εx εy εx εy
= + − .
xT yT xT yT

≤ E r ( x A ) + E r ( y A ) , provided E r ( x A ) , Er ( y A ) << 1

Case 2: (Division)

14
Proceeding with same argument as in multiplication, we get

E r ( x A / y A ) ≤ Er ( x A ) + Er ( y A ) , provided Er ( y A ) << 1

Case 3: (Addition and subtraction)

In cases of addition and subtraction, we have

( xT ± yT ) − ( x A ± y A ) = ( xT − x A ) ± ( yT − y A ) = ε x ± ε y .

Thus the absolute error in ( x A ± y A ) is given by

Ea ( x A ± y A ) ≤ Ea ( x A ) + Ea ( y A ) .

Notes:

(i) The relative error in a product is bounded by the sum of the relative errors in the

multiplicands; and the relative error in a quotient is bounded by the sum of the

relative errors in the dividend and divisor. The relative errors in multiplication or

division do not propagate very rapidly.

(ii) The absolute error in the sum or difference of two numbers is bounded by the sum

of the absolute values of the errors in the corresponding numbers.

8. The general formula for errors

Let us consider the differentiable function

u = f ( x1 , x2 ,..., xn ) (8.1)

of several independent variables x1 , x2 , …, xn .

Suppose that, ∆xi represents error in each xi , so that the error in u is given by

15
u + ∆u = f ( x1 + ∆x1 , x2 + ∆x2 ,..., xn + ∆x n ) (8.2)

Taylor series expansion of the right hand side of eq. (8.2) gives

n
∂f
u + ∆u = f ( x1 , x2 ,..., xn ) + ∑ ∂x ∆x + O(∆x
i =0 i
i
2
i ) (8.3)

If we assume that the errors ∆x1 , ∆x2 , …, ∆xn are relatively very small, we can neglect the

second and higher powers of ∆xi . Thus from eq. (8.3), we get

n
∂f ∂f ∂f ∂f
∆u ≅ ∑ ∂x ∆x
i =0 i
i =
∂x1
∆x1 +
∂x2
∆x2 + ... +
∂xn
∆x n (8.4)

This is the general formula for computing the error of a function u = f ( x1 , x2 ,..., xn ) .

The relative error Er is then given by

∆u ∂f ∆x1 ∂f ∆x2 ∂ f ∆x n
Er = ≅ + + ... + .
u ∂x1 f ∂x2 f ∂xn f

Example 4:

3
If u = xyz 3 + x 2 y 3 and errors in x , y , z are 0.005, 0.001, 0.005 respectively at x = 2 , y = 1 ,
2

z = 1 , compute the maximum absolute and relative errors in evaluating u .

Solution:

Let

3 2 3
u = f ( x, y , z ) = xyz 3 + x y
2

We have

16
∂f ∂f 9 ∂f
= yz 3 + 3xy 3 , = xz 3 + x 2 y 2 , = 3xyz 2
∂x ∂y 2 ∂z

From eq. (8.4), we get

∂f ∂f ∂f 9
∆u ≅ ∆x + ∆y + ∆z = ( yz 3 + 3 xy 3 )∆x + ( xz 3 + x 2 y 2 ) ∆y + 3 xyz 2 ∆z
∂x ∂y ∂z 2

Given that x = 2 , y = 1 , z = 1 , ∆x = 0.005 , ∆y = 0.001 , ∆z = 0.005 and therefore, we obtain

9 2 2
∆u ≤ ( yz 3 + 3 xy 3 )∆x + ( xz 3 + x y ) ∆y + 3 xyz 2 ∆z
2

= 7 × 0.005 + 20 × 0.001 + 6 × 0.005 = 0.085

Hence the maximum absolute error in u is 0.085.

The maximum relative error in u is given by

∆u 0.085
(Er )max = max ≈ = 0.010625
u 8

17

You might also like