8.3 Floating Point Numbers

8.
3 Floating Point
Numbers
Integers are not enough
We've got integer representation gured out for most needs

Great, but how do we represent
pi = 3.14159265…
e = 2.71828…
Or, given the restrictions of max/min values, and the over ow problem
0.000000001 (seconds in a nanosecond)
3155760000 (seconds in a century)?
With a xed representation we'll always eventually encounter a number too large or too small to t into it
The obvious solution is to use something a bit like scienti c notation
• Only in binary.
2
fi
?
fi
?
fi
fl
.
fi
.
A bit like scienti c notation
1.2 x 102 = 0.12 × 103
+ 3 12
sign exponent signi cand
+ 7 01111000
0.01111000 × 27
Which leaves these questions

• How to represent the sign? (For now we’ll use a sign bit.
• How to represent the size of the exponent - including negative exponents - to
give the best possible range
• How to get the best possible precision from the signi cand?
3
fi
fi
?
fi
)
Range, precision and accuracy
Range
• The difference between the largest and smallest numbers that can be represented
• Want this to be as large as possible.
Precision
• The number of signi cant gures
• Want as many useful digits as possible.
For example: for pi
• 3.183543 is more precise
Accuracy
• But 3.141 is more accurate.
• How close the representation is to the actual value
• Want this to be as close as possible
• But we already know there will be compromises (as with
any number-base system).
4
:
fi
,
fi
.
Bias
The exponent needs to represent positive and negative values.
We could use a signed representation

• But that would 'waste' a useful bit
• reducing the range
• requiring more special treatment.
We could use complement representation

• And it has been done before.
But today we use bias.
5
,
Bias
One way to do it
• Using 8-bits we can represent 256 exponents
• To represent the exponents -127 to +128 we add a xed bias value of 127 (or ‘excess’) to
each exponent
• This is an excess-127 representation
• Because we must subtract 127 from the representation to get the real exponent
• It allows the use of simple integer circuits to compare oating point numbers.
Examples. With a bias of 127

• the exponent 00000000 = 0, 0 - 127 = -127, so it represents 2-127
• the exponent 11111111 = 255, 255 - 127 = 128, so it represents 2128
• the exponent 10000100 = 132, 132 - 127 = 5, so it represents 25
• the exponent 00111011 = 59, 59 - 127 = -68, so it represents 2-68
6
.
fi
fl
.
Normalization
As it stands this oating point representation does not provide a unique representation for each number
• Very bad for arithmetic comparisons.
Also, we'd like the signi cand to provide as much precision as possible.
Thankfully, normalization provides unique representations and a gain in precision

• By requiring that the leftmost bit of the signi cand must always be 1.
7
fl
fi
fi
.
Normalization
Without normalization we could use:
0 0001 100100
or
0 0010 010010
or
0 0011 001001
To represent the same value.
With normalization we would only use:
0 0001 100100
8
Normalization
But where is the promised extra precision

Rather neatly, if we know that the leftmost bit of the signi cand is always going to be a 1, then we don't
need to physically represent it at all!
So, instead of
[0] [0001] [100110]
We can use
[0] [0001] [001100]
With the missing leftmost bit implied generated by hardware or software
Thus gaining a whole extra bit's precision!
But be careful to check if a particular representation uses an implied bit.

9
:
fi
Don't forget the zero!
It took humanity long enough to realise that zero is an important concept

• And now we can't do without it.
But, when normalization is used, oating point doesn't have a natural

representation for zero
• Because the signi cand is always supposed to begin with a 1.
So, by convention, all zeroes

[0] [000…00] [000…00
- represents the value zero in oating point.
While
[0/1] [111…11] [000…00
- represent +/- in nity
• more on this later
10
:
fi
fi
.
fl
]
fl
]
Floating Point: A brief historical overview
The earliest processors couldn't do oating point arithmetic

• Programmers had to implement it in software.
Co-processors for handling oating point emerged

• But they made the hardware more expensive.
Today’s processors usually have a oating point unit FPU contained

within the processor.
Somewhere in stage 2 above a number of proprietary oating point representations emerged

• Each with their own trade off in the sizes of the exponent and signi cand
• This made transfer of programs between different computer models dif cult
• Thankfully in 1985 a standard emerged: IEEE-754.
11
fl
fl
fl
.
fl
.
fi
fi
.
Floating Point: A brief historical overview
32-bit Single Precision Format
1 bit 8 bits 23 bits

sign exponent signi cand these use an implied bit
which appears before
64-bit Double Precision Format the binary point
1 bit 11 bits 52 bits
64-bit Double Precision Range

expressible expressible
negativ negative positive positive
negative positive
over ow under ow under ow over ow
numbers numbers
-1.0 x 10308 -1.0 x 10-308 0 1.0 x 10-308 1.0 x 10308
(none of this is to scale)
12
fl
fl
fi
fl
fl
e
IEEE-754 Single Precision (32-bit) examples
Floating Point Number Single Precision Representation

1 0 01111111 00000000000000000000
0.5 0 01111110 00000000000000000000
19.5 0 10000011 00111000000000000000
-3.75 1 10000000 11100000000000000000
Zero 0 00000000 00000000000000000000
+/- In nity 0/1 11111111 00000000000000000000
Not a Number (NaN) 0/1 11111111 any non-zero significand
Denormalized Number 0/1 00000000 any non-zero significand
Table 2.4 from Computer Organization and Architecture
13
fi
Floating point errors
As mentioned earlier, accuracy is a problem

• Either because of the base we use
• Or because of the precision.
In fact, it's a real world problem

• And one that the programmer / end user must be aware of and handle.
And when it goes wrong, it can go wrong badly

• Patriot Missil
• Arianne 5
14
e
Floating point errors
It’s obvious that very large or very small numbers won’t t into a particular representation.
For example, suppose this (pretend) 14-bit format is being used, with an implied bit
• 1-bit sig
1 bit 5 bits 8 bits
• 5-bit exponent, with a bias of 1
• 8-bit signi cand, with an implied bi sign exponent signi cand
This cannot represent numbers such as 220 or 2-20.
But other numbers more likely to be used can slip through the net too
• 128.25 cannot be precisely represented because in binary it is 10000000.01 which requires 10 bits
And, whether or not we use an implied bit, the closest we can get to it is 12
• giving a relative error of (128.25-128)÷128.25 ≈ 0.19%
15
fi
n
fi
5
fi
:
Arithmetic issues
oating point arithmetic is not guaranteed to be associative nor distributive.
In many cases
• (a + b) + c ≠ a + (b + c
• a × (b + c) ≠ ab + a
And instead there is often a small discrepancy.
The implication for you as a programmer is that you should take care when
using the equality operator with oating point variables
• And consider testing for 'closeness' instead.
16
fl
:
fl
.
Arithmetic issues
Testing for closeness coding example:
If n is a oating point variable in your program

• rst create a variable e = 1.0 x 10-20 (or similar
• then replace code such as
if ( n == 5 )
#do somethin
• with
if ( abs(n-5) < e )
#do something
17
fi
:
fl
g
To conclude
We need to represent non-integers

But a xed point representation provides too small a range.
So allowing the point to move, or oat, provides more range

• Representing numbers in a style similar to scienti c notation
• But placing the rst digit after the (binary) point.
Floating point comes at a price

• It is computationally far more expensive than using integers
• If unchecked by the programmer, loss of accuracy can cause incorrect behaviour of your program.
18
fi
fi
.
fl
.
fi
.
Reading and References
Make sure that you are comfortable both with what binary and hex represent, and how to convert
them to and from decimal
• You are likely to encounter both representations regularly in your future careers
Outline primarily based upon
• Chapter 2, Computer Organization & Architecture (3rd Edition), Null & Lobu
(you don’t need sight of this text
Other material and concepts used from
• Chapter 2, Fundamentals of Computer Architecture (3rd Edition), Burrel
(again, you don’t need sight of this text
• Set 9, CG066 Lecture Notes, Northumbria University, Harriso
Suggested learning activities
• Practice addition using oating point numbers
• Write small programs which cause the laws of association and distribution to fail when using
oating point variables.
19
fl
fl
.

8.3 Floating Point Numbers

Uploaded by

Copyright:

Available Formats

You might also like

8.3 Floating Point Numbers

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

8.3 Floating Point Numbers

Uploaded by

Copyright:

Available Formats

8.

We've got integer representation gured out for most needs

A bit like scienti c notation

1.2 x 102 = 0.12 × 103

Which leaves these questions

Range, precision and accuracy

The exponent needs to represent positive and negative values.

We could use a signed representation

We could use complement representation

But today we use bias.

Examples. With a bias of 127

Thankfully, normalization provides unique representations and a gain in precision

Without normalization we could use:

To represent the same value.

With normalization we would only use:

But where is the promised extra precision

With the missing leftmost bit implied generated by hardware or software

Thus gaining a whole extra bit's precision!

But be careful to check if a particular representation uses an implied bit.

It took humanity long enough to realise that zero is an important concept

But, when normalization is used, oating point doesn't have a natural

So, by convention, all zeroes

Floating Point: A brief historical overview

The earliest processors couldn't do oating point arithmetic

Co-processors for handling oating point emerged

Today’s processors usually have a oating point unit FPU contained

Somewhere in stage 2 above a number of proprietary oating point representations emerged

Floating Point: A brief historical overview

32-bit Single Precision Format

1 bit 8 bits 23 bits

1 bit 11 bits 52 bits

64-bit Double Precision Range

IEEE-754 Single Precision (32-bit) examples

Floating Point Number Single Precision Representation

As mentioned earlier, accuracy is a problem

In fact, it's a real world problem

And when it goes wrong, it can go wrong badly

Floating point errors

This cannot represent numbers such as 220 or 2-20.

oating point arithmetic is not guaranteed to be associative nor distributive.

Testing for closeness coding example:

If n is a oating point variable in your program

We need to represent non-integers

So allowing the point to move, or oat, provides more range

Floating point comes at a price

Reading and References

You might also like