Professional Documents
Culture Documents
8.3 Floating Point Numbers
8.3 Floating Point Numbers
8.3 Floating Point Numbers
3 Floating Point
Numbers
Integers are not enough
With a xed representation we'll always eventually encounter a number too large or too small to t into it
The obvious solution is to use something a bit like scienti c notation
• Only in binary.
2
fi
?
fi
?
fi
fl
.
fi
.
+ 3 12
sign exponent signi cand
+ 7 01111000
0.01111000 × 27
3
fi
fi
?
fi
)
Range
• The difference between the largest and smallest numbers that can be represented
• Want this to be as large as possible.
Precision
• The number of signi cant gures
• Want as many useful digits as possible.
For example: for pi
• 3.183543 is more precise
Accuracy
• But 3.141 is more accurate.
• How close the representation is to the actual value
• Want this to be as close as possible
• But we already know there will be compromises (as with
any number-base system).
4
:
fi
,
fi
.
Bias
5
,
Bias
One way to do it
• Using 8-bits we can represent 256 exponents
• To represent the exponents -127 to +128 we add a xed bias value of 127 (or ‘excess’) to
each exponent
• This is an excess-127 representation
• Because we must subtract 127 from the representation to get the real exponent
• It allows the use of simple integer circuits to compare oating point numbers.
6
.
fi
fl
.
Normalization
As it stands this oating point representation does not provide a unique representation for each number
• Very bad for arithmetic comparisons.
Also, we'd like the signi cand to provide as much precision as possible.
7
fl
fi
fi
.
Normalization
0 0001 100100
or
0 0010 010010
or
0 0011 001001
0 0001 100100
8
Normalization
So, instead of
[0] [0001] [100110]
We can use
[0] [0001] [001100]
fi
Don't forget the zero!
While
[0/1] [111…11] [000…00
- represent +/- in nity
• more on this later
10
:
fi
fi
.
fl
]
fl
]
11
fl
fl
fl
.
fl
.
fi
fi
.
13
fi
Floating point errors
14
e
It’s obvious that very large or very small numbers won’t t into a particular representation.
For example, suppose this (pretend) 14-bit format is being used, with an implied bit
• 1-bit sig
1 bit 5 bits 8 bits
• 5-bit exponent, with a bias of 1
• 8-bit signi cand, with an implied bi sign exponent signi cand
But other numbers more likely to be used can slip through the net too
• 128.25 cannot be precisely represented because in binary it is 10000000.01 which requires 10 bits
And, whether or not we use an implied bit, the closest we can get to it is 12
• giving a relative error of (128.25-128)÷128.25 ≈ 0.19%
15
fi
n
fi
5
fi
:
Arithmetic issues
In many cases
• (a + b) + c ≠ a + (b + c
• a × (b + c) ≠ ab + a
And instead there is often a small discrepancy.
The implication for you as a programmer is that you should take care when
using the equality operator with oating point variables
• And consider testing for 'closeness' instead.
16
fl
:
fl
.
Arithmetic issues
17
fi
:
fl
g
To conclude
18
fi
fi
.
fl
.
fi
.
Make sure that you are comfortable both with what binary and hex represent, and how to convert
them to and from decimal
• You are likely to encounter both representations regularly in your future careers
Outline primarily based upon
• Chapter 2, Computer Organization & Architecture (3rd Edition), Null & Lobu
(you don’t need sight of this text
Other material and concepts used from
• Chapter 2, Fundamentals of Computer Architecture (3rd Edition), Burrel
(again, you don’t need sight of this text
• Set 9, CG066 Lecture Notes, Northumbria University, Harriso
Suggested learning activities
• Practice addition using oating point numbers
• Write small programs which cause the laws of association and distribution to fail when using
oating point variables.
19
fl
fl
.