1st Machine Numbers Computer Arithmetic and Errors

Machine Numbers
MAT307 Numerical Analysis Errors and Computer Arithmetics

Stability
MAT307 Numerical Analysis
Machine Numbers, Computer Arithmetics and Errors
Dr. Mustafa Ağgül
Dr. Mustafa Ağgül Hacettepe University

Machine Numbers
Stability
Numerical analysis is concerned with solving problems

numerically; developing a sequence of numerical computations to
obtain a satisfactory result.
A significant part of this process is to figure out the errors that

arise in these computations. These errors are due to arithmetic
operations or other sources.

Machine Numbers
Stability
Definition (Decimal Floating-point representation)

Let us consider s 6= 0 written in decimal system. Then it can be written
uniquely as
s = σ · d · 10n
where
I σ = ±1 is the sign
I 1 ≤ d < 10, the significand or mantissa
I n is an integer, the exponent
Example
126.329 = 1.26329 · 102
σ = 1, the exponent n = 2, the mantissa s = 1.26329

Machine Numbers
Stability
Definition (Decimal Floating-point representation)

Let consider s 6= 0 written in decimal system. Then it can be written
uniquely as
s = σ · d · 10n
where
Example
−726.329 = −7.26329 · 102
σ = −1, the exponent n = 2, the mantissa s = 7.26329

Machine Numbers
Stability
Computers use binary arithmetic, representing each number as a binary

number: a finite sum of integer powers of 2.
Some numbers can be represented exactly, but others, such as 13 cannot.
Example
5.25 = 22 + 20 + 2−2
has an exact representation in finite binary (base 2) system. However,
1
= 2−2 + 2−4 + 2−6 + . . .
3
does not.
A floating-point number, the way a real number represented in
computers, approximates the real number.

Machine Numbers
Stability
Definition (Binary floating-point representation)

Let us consider s 6= 0 written in binary system. Then it can be written
uniquely as
s = σ · d · 2n
where
Example
(10101.0111)2 = (1.01010111)2 · 24
σ = +1, the exponent n = 4 = (100)2 , the mantissa s = (1.01010111)2

Machine Numbers
Stability
Definition (Binary floating-point representation)

Let us consider s 6= 0 written in binary system. Then it can be written
uniquely as
s = σ · d · 2n
where
Example
−(0.0101011)2 = −(1.01011)2 · 2−2
σ = −1, the exponent n = −2 = −(10)2 , the mantissa s = (1.01011)2

Machine Numbers
Stability
The IEEE floating-point arithmetic standard is the format for floating

point numbers used in almost all computers.
I The IEEE single precision (32-bit) floating-point representation of
s reserves 1 bit, for the sign, 8 bits, for the exponent, and 23 bits
for the mantissa:
x xxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx
sign(1) exponent (8) mantissa (23)
I The IEEE double precision (64-bit) floating-point representation of

s reserves 1 bit, for the sign, 11 bits, for the exponent, and 52 bits
for the mantissa:
x xxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx. . . x
sign(1) exponent (11) mantissa (52)

Machine Numbers
Stability
Single precision floating-point representation of s has a precision of 24

binary digits, and the exponent n limited by −126 ≤ n ≤ 127:
s = σ · (1.a1 a2 . . . a23 ) · 2n
σ and ai values are binary numbers already. The exponent field needs to
represent both positive and negative exponents. Exponents in the interval
[0, 126] have negative sing and those in [128, 255] have positive sign. A
bias (127) is added to the actual exponent in order to get the stored
exponent. Letting k be the number of reserved bits for the exponent, the
bias can be found by 2k−1 − 1. (28−1 − 1 = 127 in this particular case.)

Machine Numbers
Stability
Exponents (00000000)2 = 0 and (11111111)2 = 255 are both

reserved for special characters like ±0, ±∞, N aN 1 , etc. For this
reason, the greatest and least positive floating points are as in
table below.
0 11111110 11111111111111111111111 ≈ 3.40e + 38

0 00000001 00000000000000000000000 ≈ 1.18e − 38
1
Not a Number
Machine Numbers
Stability
Example
1= 00111111100000000000000000000000 = +1.0 × 2127−127
−2 = 11000000000000000000000000000000 = −1.0 × 2128−127
0.03515625 = 00111101000100000000000000000000 = +1.125 × 2122−127
264.03125 = 01000011100001000000010000000000 = +1.0313720703125 × 2135−127
I The first bit is 0 making the number positive.

I The exponent,
(01111010)2 − 127 = 1 · 21 + 1 · 23 + 1 · 24 + 1 · 25 + 1 · 26 − 127 = −5
I The mantissa,
(00100000000000000000000)2 + 1 = 1 · 2−3 + 1 = 1.125

Thus, the floating point number 0 01111010 00100000000000000000000
corresponds to the real +1.125 × 2−5 = 0.03515625

Machine Numbers
Stability
Infinitely many real numbers are represented by finitely many floating

points in a computer. Let say that a number s has a mantissa
d = 1.a1 a2 . . . an−1 an an+1 but the floating-point representation may
contain only n binary digits. Then s must be shortened when stored.
Definition
We denote the machine floating-point representation of s by f l(s).
I truncate or chop d to n binary digits, ignoring the remaining digits
I round d to n binary digits, based on the size of the part of d
following digit n:
a) if n + 1st digit is 0, chop d to n digits
b) if n + 1st digit is 1, chop d to n digits and add 1 to the last
digit of the result

Machine Numbers
Stability
It can be shown that

f l(s) = s · (1 + )
where is a small number depending of s.
a) If chopping is used
−2−n+1 ≤ ≤ 0
b) If rounding is used
−2−n ≤ ≤ 2−n
Due to chopping:
I the worst possible error is twice as large as when rounding is used
I the sign of the error s − f l(s) is the same as the sign of s. There is
no possibility of cancellation of errors.
Due to rounding:
I the worst possible error is only half as large as when chopping is used
I the error s − f l(s) is negative for only half the cases, which leads to
better error propagation behavior
Machine Numbers
Stability
Definition
I The error in an approximated quantity is defined as
E(sA ) = |sT − sA |
where sT = true value, sA = approximate value. This is also called

absolute error.
I The relative error Rel(sA ) is the ratio of the error and true value
error s − s
T A
Rel(sA ) = =

true value sT

Machine Numbers
Stability
The notion of relative error is a more intrinsic error measure.

Example
I The exact distance between 2 cities: d1T = 100km and the
measured distance is d1A = 99km
E(d1A ) = d1T − d1A = 1km
E(d1A )
Rel(d1A ) = = 0.01 = 1%
d1T
I The exact distance between 2 cities: d2T = 2km and the
measured distance is d2A = 1km
E(d2A ) = d2T − d2A = 1km
E(d2A )
Rel(d2A ) = = 0.50 = 50%
d2T

Machine Numbers
Stability
When doing calculations with numbers that contain an error, the result
will be effected by these errors. This phenomenon is called the
propagation of error. For example, let
x = f l(x) + E(x) and y = f l(y) + E(y).
|xy − f l(xy)| |xy − (x − E(x))(y − E(y))|

=
|xy| |xy|
(1.1)
|yE(x) + xE(y) + E(x)E(y)|
=
|xy|
If E(x) << 1 and E(y) << 1, then E(x)E(y) can be ignored. Therefore,
|xy − f l(xy)| |E(x)| |E(y)|

Rel(xy) = ≈ + = Rel(x) + Rel(y).
|xy| |x| |y|
Remark
Rel(xy) = Rel(x) + Rel(y).

Machine Numbers
Stability
Some numerical methods may tolerate the small fluctuations (errors) in

the input data or due to the computation itself; others might magnify
them to deviate the computed solution from the exact solution
exponentially fast. Calculations that can be proven not to magnify
approximation errors are called numerically stable.
Example
Finding the roots of ax2 + bx + c = 0 by quadratic formula
√ √
−b + b2 − 4ac −b − b2 − 4ac
x1 = and x2 =
2a 2a
is unstable if b2 >> |4ac|.
√
I if b ≤ 0, cancellation occurs in x2 since −b ≈ b2 − 4ac
√
I if b ≥ 0, cancellation occurs in x1 since b ≈ b2 − 4ac
in both cases b may be exact, but the square root introduces errors.

1st Machine Numbers Computer Arithmetic and Errors

Uploaded by

Copyright:

Available Formats

You might also like

1st Machine Numbers Computer Arithmetic and Errors

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1st Machine Numbers Computer Arithmetic and Errors

Uploaded by

Copyright:

Available Formats

Machine Numbers

MAT307 Numerical Analysis Errors and Computer Arithmetics

MAT307 Numerical Analysis

Machine Numbers, Computer Arithmetics and Errors

Dr. Mustafa Ağgül

Dr. Mustafa Ağgül Hacettepe University

Numerical analysis is concerned with solving problems

A significant part of this process is to figure out the errors that

Dr. Mustafa Ağgül Hacettepe University

Definition (Decimal Floating-point representation)

126.329 = 1.26329 · 102

σ = 1, the exponent n = 2, the mantissa s = 1.26329

Dr. Mustafa Ağgül Hacettepe University

Definition (Decimal Floating-point representation)

−726.329 = −7.26329 · 102

σ = −1, the exponent n = 2, the mantissa s = 7.26329

Dr. Mustafa Ağgül Hacettepe University

Computers use binary arithmetic, representing each number as a binary

Dr. Mustafa Ağgül Hacettepe University

Definition (Binary floating-point representation)

σ = +1, the exponent n = 4 = (100)2 , the mantissa s = (1.01010111)2

Dr. Mustafa Ağgül Hacettepe University

Definition (Binary floating-point representation)

−(0.0101011)2 = −(1.01011)2 · 2−2

σ = −1, the exponent n = −2 = −(10)2 , the mantissa s = (1.01011)2

Dr. Mustafa Ağgül Hacettepe University

The IEEE floating-point arithmetic standard is the format for floating

I The IEEE double precision (64-bit) floating-point representation of

Dr. Mustafa Ağgül Hacettepe University

Single precision floating-point representation of s has a precision of 24

Dr. Mustafa Ağgül Hacettepe University

Exponents (00000000)2 = 0 and (11111111)2 = 255 are both

0 11111110 11111111111111111111111 ≈ 3.40e + 38

I The first bit is 0 making the number positive.

(01111010)2 − 127 = 1 · 21 + 1 · 23 + 1 · 24 + 1 · 25 + 1 · 26 − 127 = −5

(00100000000000000000000)2 + 1 = 1 · 2−3 + 1 = 1.125

Dr. Mustafa Ağgül Hacettepe University

Infinitely many real numbers are represented by finitely many floating

Dr. Mustafa Ağgül Hacettepe University

It can be shown that

where sT = true value, sA = approximate value. This is also called

Dr. Mustafa Ağgül Hacettepe University

The notion of relative error is a more intrinsic error measure.

Dr. Mustafa Ağgül Hacettepe University

x = f l(x) + E(x) and y = f l(y) + E(y).

|xy − f l(xy)| |xy − (x − E(x))(y − E(y))|

|xy − f l(xy)| |E(x)| |E(y)|

Dr. Mustafa Ağgül Hacettepe University

Some numerical methods may tolerate the small fluctuations (errors) in

Dr. Mustafa Ağgül Hacettepe University

You might also like