8.3 Floating Point Numbers

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

8.

3 Floating Point
Numbers
Integers are not enough

We've got integer representation gured out for most needs


Great, but how do we represent
pi = 3.14159265…
e = 2.71828…
Or, given the restrictions of max/min values, and the over ow problem
0.000000001 (seconds in a nanosecond)
3155760000 (seconds in a century)?

With a xed representation we'll always eventually encounter a number too large or too small to t into it
The obvious solution is to use something a bit like scienti c notation
• Only in binary.

2
fi
?

fi
?

fi
fl
.

fi
.

A bit like scienti c notation

1.2 x 102 = 0.12 × 103

+ 3 12
sign exponent signi cand

+ 7 01111000

0.01111000 × 27

Which leaves these questions


• How to represent the sign? (For now we’ll use a sign bit.
• How to represent the size of the exponent - including negative exponents - to
give the best possible range
• How to get the best possible precision from the signi cand?

3
fi
fi
?

fi
)

Range, precision and accuracy

Range
• The difference between the largest and smallest numbers that can be represented
• Want this to be as large as possible.

Precision
• The number of signi cant gures
• Want as many useful digits as possible.
For example: for pi
• 3.183543 is more precise
Accuracy
• But 3.141 is more accurate.
• How close the representation is to the actual value
• Want this to be as close as possible
• But we already know there will be compromises (as with
any number-base system).

4
:

fi
,

fi
.

Bias

The exponent needs to represent positive and negative values.

We could use a signed representation


• But that would 'waste' a useful bit
• reducing the range
• requiring more special treatment.

We could use complement representation


• And it has been done before.

But today we use bias.

5
,

Bias

One way to do it
• Using 8-bits we can represent 256 exponents
• To represent the exponents -127 to +128 we add a xed bias value of 127 (or ‘excess’) to
each exponent
• This is an excess-127 representation
• Because we must subtract 127 from the representation to get the real exponent
• It allows the use of simple integer circuits to compare oating point numbers.

Examples. With a bias of 127


• the exponent 00000000 = 0, 0 - 127 = -127, so it represents 2-127
• the exponent 11111111 = 255, 255 - 127 = 128, so it represents 2128
• the exponent 10000100 = 132, 132 - 127 = 5, so it represents 25
• the exponent 00111011 = 59, 59 - 127 = -68, so it represents 2-68

6
.

fi
fl
.

Normalization

As it stands this oating point representation does not provide a unique representation for each number
• Very bad for arithmetic comparisons.

Also, we'd like the signi cand to provide as much precision as possible.

Thankfully, normalization provides unique representations and a gain in precision


• By requiring that the leftmost bit of the signi cand must always be 1.

7
fl
fi
fi
.

Normalization

Without normalization we could use:

0 0001 100100

or

0 0010 010010

or

0 0011 001001

To represent the same value.

With normalization we would only use:

0 0001 100100
8
Normalization

But where is the promised extra precision


Rather neatly, if we know that the leftmost bit of the signi cand is always going to be a 1, then we don't
need to physically represent it at all!

So, instead of
[0] [0001] [100110]

We can use
[0] [0001] [001100]

With the missing leftmost bit implied generated by hardware or software

Thus gaining a whole extra bit's precision!

But be careful to check if a particular representation uses an implied bit.


9
:

fi
Don't forget the zero!

It took humanity long enough to realise that zero is an important concept


• And now we can't do without it.

But, when normalization is used, oating point doesn't have a natural


representation for zero
• Because the signi cand is always supposed to begin with a 1.

So, by convention, all zeroes


[0] [000…00] [000…00
- represents the value zero in oating point.

While
[0/1] [111…11] [000…00
- represent +/- in nity
• more on this later
10
:

fi
fi
.

fl
]

fl
]

Floating Point: A brief historical overview

The earliest processors couldn't do oating point arithmetic


• Programmers had to implement it in software.

Co-processors for handling oating point emerged


• But they made the hardware more expensive.

Today’s processors usually have a oating point unit FPU contained


within the processor.

Somewhere in stage 2 above a number of proprietary oating point representations emerged


• Each with their own trade off in the sizes of the exponent and signi cand
• This made transfer of programs between different computer models dif cult
• Thankfully in 1985 a standard emerged: IEEE-754.

11
fl
fl
fl
.

fl
.

fi
fi
.

Floating Point: A brief historical overview

32-bit Single Precision Format

1 bit 8 bits 23 bits


sign exponent signi cand these use an implied bit
which appears before
64-bit Double Precision Format the binary point

1 bit 11 bits 52 bits

64-bit Double Precision Range


expressible expressible
negativ negative positive positive
negative positive
over ow under ow under ow over ow
numbers numbers
-1.0 x 10308 -1.0 x 10-308 0 1.0 x 10-308 1.0 x 10308
(none of this is to scale)
12
fl
fl
fi
fl
fl
e

IEEE-754 Single Precision (32-bit) examples

Floating Point Number Single Precision Representation


1 0 01111111 00000000000000000000
0.5 0 01111110 00000000000000000000
19.5 0 10000011 00111000000000000000
-3.75 1 10000000 11100000000000000000
Zero 0 00000000 00000000000000000000
+/- In nity 0/1 11111111 00000000000000000000
Not a Number (NaN) 0/1 11111111 any non-zero significand
Denormalized Number 0/1 00000000 any non-zero significand
Table 2.4 from Computer Organization and Architecture

13
fi
Floating point errors

As mentioned earlier, accuracy is a problem


• Either because of the base we use
• Or because of the precision.

In fact, it's a real world problem


• And one that the programmer / end user must be aware of and handle.

And when it goes wrong, it can go wrong badly


• Patriot Missil
• Arianne 5

14
e

Floating point errors

It’s obvious that very large or very small numbers won’t t into a particular representation.

For example, suppose this (pretend) 14-bit format is being used, with an implied bit
• 1-bit sig
1 bit 5 bits 8 bits
• 5-bit exponent, with a bias of 1
• 8-bit signi cand, with an implied bi sign exponent signi cand

This cannot represent numbers such as 220 or 2-20.

But other numbers more likely to be used can slip through the net too
• 128.25 cannot be precisely represented because in binary it is 10000000.01 which requires 10 bits
And, whether or not we use an implied bit, the closest we can get to it is 12
• giving a relative error of (128.25-128)÷128.25 ≈ 0.19%

15
fi
n

fi
5

fi
:

Arithmetic issues

oating point arithmetic is not guaranteed to be associative nor distributive.

In many cases
• (a + b) + c ≠ a + (b + c
• a × (b + c) ≠ ab + a
And instead there is often a small discrepancy.

The implication for you as a programmer is that you should take care when
using the equality operator with oating point variables
• And consider testing for 'closeness' instead.

16
fl
:

fl
.

Arithmetic issues

Testing for closeness coding example:

If n is a oating point variable in your program


• rst create a variable e = 1.0 x 10-20 (or similar
• then replace code such as
if ( n == 5 )
#do somethin
• with
if ( abs(n-5) < e )
#do something

17
fi
:

fl
g

To conclude

We need to represent non-integers


But a xed point representation provides too small a range.

So allowing the point to move, or oat, provides more range


• Representing numbers in a style similar to scienti c notation
• But placing the rst digit after the (binary) point.

Floating point comes at a price


• It is computationally far more expensive than using integers
• If unchecked by the programmer, loss of accuracy can cause incorrect behaviour of your program.

18
fi
fi
.

fl
.

fi
.

Reading and References

Make sure that you are comfortable both with what binary and hex represent, and how to convert
them to and from decimal
• You are likely to encounter both representations regularly in your future careers
Outline primarily based upon
• Chapter 2, Computer Organization & Architecture (3rd Edition), Null & Lobu
(you don’t need sight of this text
Other material and concepts used from
• Chapter 2, Fundamentals of Computer Architecture (3rd Edition), Burrel
(again, you don’t need sight of this text
• Set 9, CG066 Lecture Notes, Northumbria University, Harriso
Suggested learning activities
• Practice addition using oating point numbers
• Write small programs which cause the laws of association and distribution to fail when using
oating point variables.

19
fl
fl
.

You might also like