Chapter 1: Errors in Numerical Computations 1.: F Inf M F Sup M

Chapter 1: Errors in Numerical Computations
1. Introduction
In this chapter, we first begin with some mathematical preliminary theorems that are
invariably useful concerning the study of numerical analysis. This chapter presents the
various kinds of errors that may occur in a problem. The representation of numbers in
computers is introduced. General results on the propagation of errors in numerical
computation are also given. Finally, the concepts of stability and conditioning of problems
and a brief idea of convergence of numerical methods are also introduced.
2. Mathematical preliminaries
Theorem 1.1. (Intermediate value theorem)
Let f (x) be a real valued continuous function on the finite interval [a, b] and define
m = Inf f (x) , M = Sup f (x ) .

a ≤ x ≤b a≤ x≤b
Then for any number µ in [m, M ] , there exists at least one point ξ in [a, b] for which
f (ξ ) = µ .
In particular, there are points ξ and ξ in [a , b] for which m = f (ξ ) and M = f (ξ ) .
Theorem 1.2. (Mean value theorem)
Let f (x) be a real valued continuous function on the finite interval [a, b] and differentiable in
( a, b) . Then there exists at least one point ξ in (a, b) for which
f (b) − f (a) = (b − a ) f ′(ξ ) .

Theorem 1.3. (Integral mean value theorem)
Let w(x ) be nonnegative and integrable on [a , b] , and let f (x) be continuous on [a , b] . Then
b b
∫ w( x) f ( x)dx = w(ξ )∫ f (x)dx , for some ξ ∈ [a, b] .

a a
One of the most important and useful tools for approximating functions f (x ) by polynomials,
in numerical analysis is Taylor’s theorem and the associated Taylor series. These
polynomials expressed in Taylor series are used extensively in numerical analysis.
Theorem 1.4. (Taylor’s theorem)
Let f (x) be a real valued continuous function on the finite interval [ a , b] and have n + 1
continuous derivatives on [a, b] for some n≥0, and let x, x0 ∈ [a, b] . Then
f ( x) = Pn ( x) + Rn+1 ( x) ,
where
( x − x0 ) ( x − x0 ) n ( n )
Pn ( x) = f ( x0 ) + f ′( x0 ) + ... + f ( x0 )
1! n!
( x − x0 ) n +1 ( n +1)
Rn+1 ( x) = f (ξ ) , a<ξ <b.
(n + 1)!
• Taylor’s theorem in two dimensions
Let f ( x, y) be a function of two independent variables x and y and suppose f (x ) possesses
continuous partial derivatives of order n in some neighbourhood N of a point (a, b) in the
domain of definition of f (x ) . Let (a + h, b + k ) ∈ N , then there exists a positive number θ
(0 < θ < 1) , such that
2
2 n−1
 ∂ ∂  1 ∂ ∂  1  ∂ ∂ 
f (a + h, b + k ) = f (a , b) +  h + k  f (a, b) +  h + k  f (a, b) + ... +  h + k  f ( a, b )
 ∂x ∂y  2!  ∂x ∂y  ( n − 1)!  ∂x ∂y
+ Rn+1 ( x ) ,
where
n
1 ∂ ∂ 
Rn ( x ) =  h + k  f (a + θh, b + θk ) , 0 <θ <1.
n!  ∂x ∂y 
Rn (x ) is called the remainder after n terms and the theorem is called, Taylor’s theorem with
remainder or Taylors’ expansion about the point (a, b) .
3. Approximate numbers and significant figures
Significant figures:
A significant figures is any one of the digits 1, 2, 3,… and 0.
In the number 0.00134, the significant figures are 1, 3, 4. The zeros are used here merely to
fix the decimal point and therefore not significant. But in the number 0.1204, the significant
figures are 1, 2, 0, 4. Similarly, 1.00317 has 6 significant figures.
• Rules of significant figures
Rule 1: All non-zero digits 1,2,…,9 are significant.
Rule 2: Zeros between non-zero digits are significant, e.g., In reading the measurement 9.04
cm, the zero represents a measured quantity, just as 9 and 4 and is, therefore, a significant
number. Similarly in another example, there are four significant numbers in the number 1005.
3
Rule 3: Zeros to the left of the first non-zero digit in a number are not significant, e.g.,
0.0026. Also, in the measurement 0.07 kg, the zeros are used merely to locate the decimal
point and are, therefore, not significant.
Rule 4: When a number ends in zeros that are to the right of the decimal point, then zeros are
significant, e.g., in the number 0.0200, there are three significant numbers. Another example
is that in reading the measurement 11.30 cm, the zero is an estimate and represents a
measured quantity. It is therefore significant. Thus, zeros to the right of the decimal point and
at the end of the number, are significant figures.
Rule 5: When a number ends in zeros that are not to the right of the decimal point, then zeros
are not necessarily significant, e.g., if a distance is reported as 1200 feet, one may assume
two significant figures. However, reporting measurements in scientific notation removes all
doubt, since all numbers are written in scientific notation are considered significant.
1200 feet 1.2 × 10 3 feet Two significant figures
1200 feet 1.20 × 103 feet Three significant figures
1200 feet 1.200 × 103 feet Four significant figures
Thus, we may conclude that if a zero represents a measured quantity, it is a significant figure.
If it merely locates the decimal point, it is not a significant figure.
4. Rounding off numbers
Numbers are rounded off so as to cause the least possible errors.
The following is the general rule for rounding off a number to n significant digits:
4
Discard all digits to the right of the n th place. If the discarded number is less than half a unit
in the n th place, leave the n th digit unchanged; if the discarded number is greater than half a
unit in the n th place, add 1 to the n th digit. If the discarded number is exactly half a unit in
the n th place, leave the n th digit unaltered if it is an even number, but increase it by 1 if it is
an odd number.
When a number is rounded off according to the rule just stated above, then it is said to be
correct to n significant digits.
To illustrate, the following numbers are corrected to 4 significant figures:
27.1345 becomes 27.13
27.8793 becomes 27.88
27.355 becomes 27.36
27.365 becomes 27.36
We now proceed to present the classification of the ways by which error is involved into the
numerical computation. Let us start with some simple definitions about error.
(i) Absolute error:
Let xT be the exact value or true value of a number and x A be its approximate value, then
xT − x A is called the absolute error Ea . Therefore, absolute error
Ea ≡ xT − x A
(ii) Relative and percentage errors:
5
The relative error is defined by
xT − x A
Er ≡ , provided xT ≠ 0 or xT is not close to zero.
xT
The percentage relative error is defined by
xT − x A
E p ≡ E r × 100 = × 100 , provided xT ≠ 0 or xT is not close to zero.
xT
(iii) Inherent error:
The inherent error is that quantity which is already present in the statement of the problem
before its solution. The inherent error arises either due to the straight assumptions in the
mathematical forms of the problem or due to the physical measurements of the parameters of
problem. Inherent error can not be completely eliminated but can be minimized if we select
better data or by employing high precision computer computations.
(iv) Round-off and chopping errors:
A number x is said to be rounded correct to a d -decimal digit number x ( d ) if the error
satisfies
1
x − x (d ) ≤ × 10 −d . (4.1)
2
The error arising out of rounding of a number, as defined in eq. (4.1) is known as round-off
error.
Let an arbitrary given real number x which has the representation in the following form
x = .d1d 2 ...d k d k +1...× b e , (4.2)
6
where b is the base, d1 ≠ 0 , d 2 , …, d k are integers and satisfies 0 ≤ d i ≤ b − 1 and the exponent
e is such that emin ≤ e ≤ emax . The fractional part .d1d 2 ...d k d k +1 ... is called the mantissa and it lies
between -1 and 1.
Now, the floating-point number fl ( x) in k -digit mantissa standard form can be obtained in
the following two ways:
(a) Chopping: In this case, we simply discard the digits d k +1 , d k +2 , … in eq. (4.2), and
obtain
fl ( x) = .d1d 2 ...d k × b e (4.3)
(b) Rounding: In this case, fl ( x) is chosen as the normalized floating-point number
nearest to x , together with the rule of symmetric rounding, according to which, if the
truncated part be exactly half a unit in the k -th position, then if the k -th digit be odd,
it is rounded up to an even digit and if it is even, then it is left unchanged.
Thus the relative error for k -digit mantissa standard form representation of x becomes
b1−k , for chopping

x − fl ( x) 
≤  1 1−k (4.4)
x  b , for rounding
2
Therefore, the bound on the relative error of a floating point number is reduced by half when
rounding is used instead of chopping. For this reason, on the most of the computers rounding
is used.
Example 1:
Approximate values of 1 and 1 , correct to 4 decimal places are 0.1667 and 0.0769
6 13
respectively. Find the possible relative error and absolute error in the sum of 0.1667 and
0.0769.
7
Solution:
Let x = 0.1667 and y = 0.0769 .
The maximum absolute error in each x and y is given by
1
× 10 −4 = 0.00005
2
(i) Relative Error in ( x + y ) A
( x + y )T − ( x + y ) A E ( x) E ( y) 0.00005 0.00005
E r [( x + y ) A ] = ≤ a + a ≤ + ≤ 0.000950135
( x + y )T ( x + y ) T ( x + y )T 0.1667 0.0769
(ii) Absolute Error in ( x + y ) A
E a [( x + y ) A ] = Er ( x + y) ( x + y )T ≤ 0.000950135× 0.2436 = 0.000231452886
Example 2:
If the number π = 4 tan −1 (1) is approximated using 5 significant digits, find the percentage
relative error due to
(i) Chopping,
(ii) Rounding.
Solution:
(i) Percentage relative error due to chopping is given by
8
π − 3.1415
× 100 = 0.00294926%
π
(ii) Percentage relative error due to rounding is given by
π − 3.1416
× 100 = 0.000233843%
π
From the above errors, it may be easily observed that rounding reduces error.
5. Truncation errors
These are the errors due to approximate formulae used in the computations. Truncation errors
result from the approximate formulae used which are generally based on the truncated series.
The study of this error is usually associated with the problem of convergence.
For example, let us assume that a function f (x) and all its higher order derivatives with
respect to the independent variable x at the point, say x = x0 are known. Now in order to
find the function value at a neighbouring point, say x = x0 + ∆x , one can use the Taylor series
expansion for the function f (x) about x = x0 as
( x − x0 ) 2
f ( x ) = f ( x0 ) + ( x − x0 ) f ′( x0 ) + f ′′( x0 ) + ... (5.1)
2!
the right hand side of the above equation is an infinite series and one has to truncate it
after some finite number of terms to calculate f ( x0 + ∆x) either with computer or by manual
calculations.
If the series is truncated after n terms, then it is equivalent to approximating f (x) with a
polynomial of degree n − 1 . Therefore, we have
9
( x − x0 ) 2 ( x − x0 ) n−1 (n−1)
f ( x ) ≈ Pn−1 ( x) = f ( x0 ) + ( x − x0 ) f ′( x0 ) + f ′′( x0 ) + ... + f ( x0 ) (5.2)
2! ( n − 1)!
The truncated error is given by
( x − x0 ) n ( n )
ET ( x ) = f ( x ) − Pn−1 ( x) = f (ξ ) (5.3)
n!
Now, let
M n ( x) = max f ( n) (ξ ) (5.4)
a ≤ξ ≤ x
Then the bound of the truncation error is given by
n
M n ( x ) x − x0
ET ( x) ≤ (5.5)
n!
If h = x−a, then the truncation error ET (x) is said to be of order O(h n ) .
Hence, from eq. (5.2) an approximate value of f ( x0 + ∆x) can be obtained with the truncation
error estimate as given in eq. (5.3).
Example 3:
Obtain a second degree polynomial approximation to
f ( x) = (1 + x) 2 / 3 , x ∈ [0,0.1]
Using Taylor series expansion about x = 0 . Use the expansion to approximate f (0.04) and
find a bound of the truncation error.
Solution:
10
We have,
f ( x) = (1 + x) 2 / 3 , f (0) = 1
2 2
f ′( x ) = , f ′(0) =
3(1 + x )1/ 3 3
2 2
f ′′( x) = − 4/ 3
, f ′′(0) = −
9(1 + x) 9
8
f ′′′( x ) = .
27(1 + x) 7 / 3
Thus, the Taylor series expansion with the remainder term is given by
2 x2 4 x3
(1 + x ) 2 / 3 = 1 + x− + , 0 < ξ < 0.1
3 9 81 (1 + ξ ) 7 / 3
Therefore, the truncation error is
 2 x2  4 x3
ET ( x) = (1 + x ) 2 / 3 − 1 + x −  =
 3 9  81 (1 + ξ ) 7 / 3

The approximate value of f (0.04) is
2 (0.06) 2
f (0.04) ≈ 1 + (0.06) − = 1.026488888889 , correct to 12 decimal places.
3 9
The truncation error bound in x ∈ [0,0.1] is given by
4 (0.1) 3
ET ≤ max
0≤ x ≤ 0.1 81 (1 + x) 7 / 3
4
≤ (0.1) 3 = 0.493827 × 10 −4
81
11
The exact value of f ( 0.04 ) correct upto 12 decimal places is 1.026491977549.
6. Floating point representation of numbers
A floating point number is represented in the following form
± .d1d 2 ...d k × b e ,
where b is the base, d1 ≠ 0 , d 2 , …, d k are digits or bits satisfying 0 ≤ d i ≤ b − 1 , k is the
number of significant digits or bits, which indicates the precision of the number and the
exponent e is such that emin ≤ e ≤ emax . The fractional part .d1d 2 ...d k d k +1... is called the mantissa
or significand and it lies between -1 and 1.
6.1 Computer representation of numbers
In numerical computation, nowadays usually digital computers are used. Most of the digital
computers uses floating point mode for storing real numbers.
The fundamental unit of data stored in a computer memory is called computer word. The
number of bits a word can hold is called word length. The word length is fixed for a computer
although it varies from computer to computer. The typical word lengths are 16, 32, 64 bits or
higher bits. The largest number that can be stored in a computer depends on word length. To
store a number in floating point representation, a computer word is divided into three fields.
The first part consists of one bit, called the sign bit. The next bits represent the exponent and
finally the bits represent the mantissa. For example, in single-precision floating-point format,
a 32-bit word is divided into three fields as follows: 1 bit for the sign, 8 bits for the exponent
and 23 bits for the mantissa. Exponent is an 8 bit signed integer from −128 to 127. On the
other hand, in double-precision floating-point format, a 64-bit word is divided into three
fields as follows: 1 bit for the sign, 11 bits for the exponent and 52 bits for the mantissa.
12
In the normalized floating point representation, the exponent is so adjusted that the bit d1
immediately after the binary point is always 1. Formally, a nonzero floating point number is
in normalized floating point form if
1
≤ mantissa < 1 .
b
The range of the exponents that a typical computer can handle is very large. The following
table shows the effective range of IEEE (Institute of Electrical and Electronics Engineers)
floating-point numbers:
Effective Floating-Point Range
Binary number Decimal number
Single precision ± (2-2-23) × 2 127 ~ ± 10 38.53
Double precision ± (2-2-52) × 2 1023 ~ ± 10 308.25
Table 1: Effective Floating-Point Range
If in a numerical computation, a number lies outside the range, then the following cases arise:
(a) Overflow: It occurs when the number is larger than the above range specified in
Table 1.
13
(b) Underflow: It occurs when the number is smaller than the above range specified in
Table 1.
In case of underflow, the number is usually set to zero and computation continues. But in
cases of overflow, the computer execution halts.
7. Propagation of errors
In this section, we consider the effect of arithmetic operations which involve errors. Let, x A
and y A be the approximate numbers used in the calculations. Suppose they are in error with
the true values xT and yT respectively.
Thus we can write xT = x A + ε x and yT = y A + ε y . Now, we examine the propagated error in
some particular cases:
Case 1: (Multiplication)
In multiplication, for the error in x A y A , we have
xT yT − x A y A = xT yT − ( xT − ε x )( yT − ε y )
= xT ε y + yT ε x − ε xε y .
Thus the relative error in x A y A is
xT yT − x A y A
Er ( xA yA ) =
xT yT
εx εy εx εy
= + − .
xT yT xT yT
≤ E r ( x A ) + E r ( y A ) , provided E r ( x A ) , Er ( y A ) << 1
Case 2: (Division)
14
Proceeding with same argument as in multiplication, we get
E r ( x A / y A ) ≤ Er ( x A ) + Er ( y A ) , provided Er ( y A ) << 1
Case 3: (Addition and subtraction)
In cases of addition and subtraction, we have
( xT ± yT ) − ( x A ± y A ) = ( xT − x A ) ± ( yT − y A ) = ε x ± ε y .
Thus the absolute error in ( x A ± y A ) is given by
Ea ( x A ± y A ) ≤ Ea ( x A ) + Ea ( y A ) .
Notes:
(i) The relative error in a product is bounded by the sum of the relative errors in the
multiplicands; and the relative error in a quotient is bounded by the sum of the
relative errors in the dividend and divisor. The relative errors in multiplication or
division do not propagate very rapidly.
(ii) The absolute error in the sum or difference of two numbers is bounded by the sum
of the absolute values of the errors in the corresponding numbers.
8. The general formula for errors
Let us consider the differentiable function
u = f ( x1 , x2 ,..., xn ) (8.1)
of several independent variables x1 , x2 , …, xn .
Suppose that, ∆xi represents error in each xi , so that the error in u is given by
15
u + ∆u = f ( x1 + ∆x1 , x2 + ∆x2 ,..., xn + ∆x n ) (8.2)
Taylor series expansion of the right hand side of eq. (8.2) gives
n
∂f
u + ∆u = f ( x1 , x2 ,..., xn ) + ∑ ∂x ∆x + O(∆x
i =0 i
i
2
i ) (8.3)
If we assume that the errors ∆x1 , ∆x2 , …, ∆xn are relatively very small, we can neglect the
second and higher powers of ∆xi . Thus from eq. (8.3), we get
n
∂f ∂f ∂f ∂f
∆u ≅ ∑ ∂x ∆x
i =0 i
i =
∂x1
∆x1 +
∂x2
∆x2 + ... +
∂xn
∆x n (8.4)
This is the general formula for computing the error of a function u = f ( x1 , x2 ,..., xn ) .
The relative error Er is then given by
∆u ∂f ∆x1 ∂f ∆x2 ∂ f ∆x n
Er = ≅ + + ... + .
u ∂x1 f ∂x2 f ∂xn f
Example 4:
3
If u = xyz 3 + x 2 y 3 and errors in x , y , z are 0.005, 0.001, 0.005 respectively at x = 2 , y = 1 ,
2
z = 1 , compute the maximum absolute and relative errors in evaluating u .
Solution:
Let
3 2 3
u = f ( x, y , z ) = xyz 3 + x y
2
We have
16
∂f ∂f 9 ∂f
= yz 3 + 3xy 3 , = xz 3 + x 2 y 2 , = 3xyz 2
∂x ∂y 2 ∂z
From eq. (8.4), we get
∂f ∂f ∂f 9
∆u ≅ ∆x + ∆y + ∆z = ( yz 3 + 3 xy 3 )∆x + ( xz 3 + x 2 y 2 ) ∆y + 3 xyz 2 ∆z
∂x ∂y ∂z 2
Given that x = 2 , y = 1 , z = 1 , ∆x = 0.005 , ∆y = 0.001 , ∆z = 0.005 and therefore, we obtain
9 2 2
∆u ≤ ( yz 3 + 3 xy 3 )∆x + ( xz 3 + x y ) ∆y + 3 xyz 2 ∆z
2
= 7 × 0.005 + 20 × 0.001 + 6 × 0.005 = 0.085
Hence the maximum absolute error in u is 0.085.
The maximum relative error in u is given by
∆u 0.085
(Er )max = max ≈ = 0.010625
u 8
17

Chapter 1: Errors in Numerical Computations 1.: F Inf M F Sup M

Uploaded by

Copyright:

Available Formats

You might also like

Chapter 1: Errors in Numerical Computations 1.: F Inf M F Sup M

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1: Errors in Numerical Computations 1.: F Inf M F Sup M

Uploaded by

Copyright:

Available Formats

Chapter 1: Errors in Numerical Computations

computers is introduced. General results on the propagation of errors in numerical

and a brief idea of convergence of numerical methods are also introduced.

Theorem 1.1. (Intermediate value theorem)

m = Inf f (x) , M = Sup f (x ) .

In particular, there are points ξ and ξ in [a , b] for which m = f (ξ ) and M = f (ξ ) .

Theorem 1.2. (Mean value theorem)

( a, b) . Then there exists at least one point ξ in (a, b) for which

f (b) − f (a) = (b − a ) f ′(ξ ) .

∫ w( x) f ( x)dx = w(ξ )∫ f (x)dx , for some ξ ∈ [a, b] .

polynomials expressed in Taylor series are used extensively in numerical analysis.

Theorem 1.4. (Taylor’s theorem)

• Taylor’s theorem in two dimensions

Let f ( x, y) be a function of two independent variables x and y and suppose f (x ) possesses

continuous partial derivatives of order n in some neighbourhood N of a point (a, b) in the

domain of definition of f (x ) . Let (a + h, b + k ) ∈ N , then there exists a positive number θ

(0 < θ < 1) , such that

remainder or Taylors’ expansion about the point (a, b) .

3. Approximate numbers and significant figures

A significant figures is any one of the digits 1, 2, 3,… and 0.

figures are 1, 2, 0, 4. Similarly, 1.00317 has 6 significant figures.

• Rules of significant figures

Rule 1: All non-zero digits 1,2,…,9 are significant.

point and are, therefore, not significant.

at the end of the number, are significant figures.

1200 feet 1.2 × 10 3 feet Two significant figures

1200 feet 1.20 × 103 feet Three significant figures

1200 feet 1.200 × 103 feet Four significant figures

If it merely locates the decimal point, it is not a significant figure.

4. Rounding off numbers

Numbers are rounded off so as to cause the least possible errors.

correct to n significant digits.

To illustrate, the following numbers are corrected to 4 significant figures:

27.1345 becomes 27.13

27.8793 becomes 27.88

27.355 becomes 27.36

27.365 becomes 27.36

(i) Absolute error:

xT − x A is called the absolute error Ea . Therefore, absolute error

(ii) Relative and percentage errors:

The percentage relative error is defined by

(iii) Inherent error:

better data or by employing high precision computer computations.

(iv) Round-off and chopping errors:

A number x is said to be rounded correct to a d -decimal digit number x ( d ) if the error

x = .d1d 2 ...d k d k +1...× b e , (4.2)

the following two ways:

fl ( x) = .d1d 2 ...d k × b e (4.3)

(b) Rounding: In this case, fl ( x) is chosen as the normalized floating-point number

it is rounded up to an even digit and if it is even, then it is left unchanged.

b1−k , for chopping

Let x = 0.1667 and y = 0.0769 .

The maximum absolute error in each x and y is given by

(i) Relative Error in ( x + y ) A

(ii) Absolute Error in ( x + y ) A

E a [( x + y ) A ] = Er ( x + y) ( x + y )T ≤ 0.000950135× 0.2436 = 0.000231452886

relative error due to