Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 11

MATH/CMPSC 455

Introduction to
Numerical Analysis I

Floating Point Representation


of Real Numbers
FLOATING POINT REPRESENTATION OF
REAL NUMBERS

 This is about how computers represent and operate


real numbers.
 Helps us to understand rounding errors

We consider IEEE 754 Floating Point Standard

Representing binary numbers in computer:


1. format

2. machine representation
FLOATING POINT FORMAT
 Formats for decimal system

Standard Notation

Scientific Notation

Normalized Scientific Notation


FLOATING POINT FORMAT
 Format for floating point number (binary representation)

Normalized IEEE floating point standard:

o sign (+ or -)
o mantissa , which contains the significant bits. (N b’s)
o exponent (p, M-bit binary number)

… …
Precision sign Exponent Mantissa
(M) (N)
single 1 8 23
double 1 11 52
Long double 1 15 64

Definition (machine epsilon, ): It is the distance


between 1 and the smallest floating point number greater
than 1. Gives a bound on the relative error due to rounding.

For the IEEE double precision floating point standard:


ROUNDING
How do we fit a given binary number in a finite number of
bits?

IEEE Rounding to Nearest Rule:

For double precision, if the 53rd bit to the right of the binary point is 0, then
round down (truncate after the 52nd bit). If the 53rd bit is 1, then round up
(add 1 to 52 bit), unless all known bits to the right of the 1 are 0’s, in which
case 1 is added to bit 52 if and only if bit 52 is 1.
ROUNDING
Notation: Denote the IEEE double precision floating point
number associated to x, using the Rounding to the Nearest
Rule, by fl(x).

Definition (absolute error & relative error): Let be a


computed version of the exact quantity .
ROUNDING

Example:

Example:

Relative rounding error:


MACHINE REPRESENTATION
… …

• Sign: 1 bit, 0 for positive, 1 for negative;

• Mantissa: 52 bits, …
11
• Exponent: 11 bits so 0 < e < 2 -1 = 2047 and
p = e - 1023

• 1~2046  -1022 ~ 1023


• 2 values reserved for infinity / NaN and 0
• 2047  infinity if the mantissa is allzeros, NaN
otherwise;
• 0  small numbers including 0
ADDITION AND ROUNDING OF
FLOATING POINT NUMBERS

Step 1: line up the two numbers Double Precision

Step 2: add them Higher Precision

Step 3: store the result as a floating point number


Double Precision
Example :

Example :

You might also like