Professional Documents
Culture Documents
Lecture 10 (Temp)
Lecture 10 (Temp)
content S E M
sign exponent mantissa
size (bits) 1 8 23
Conversion: 𝑁 = −1 𝑆 × 2𝐸−127 × 1. 𝑀
implied 1
◦ Float format
IEEE-754 Standard
• 64-bit Extended/Double Precision Format
content S E M
sign exponent “mantissa"
size (bits) 1 11 52
Conversion: 𝑁 = −1 𝑆 × 2𝐸−1023 × 1. 𝑀
implied 1
◦ Double format
◦ To allow for extended range and precision
◦ Reduces round-off errors during intermediate calculations
IEEE -754 Standard
• Implied one is used to maximize the bits in encoding
◦ Since for a non-zero number, the first non-zero (binary) digit is
always one
• However, implied one is used only when E is between max(E) and
min(E)
◦ Max(E) = all 1’s
◦ Min(E) = all 0’s
• Exponent bias (i.e. 127 for single and 1023 for double) allows for the
representation of very small and very large numbers
◦ Max(E) and min(E) has special meanings – different interpretation
◦ Effective range of E
▪ -126 to 127 (single)
▪ -1022 to 1023 (double)
Special Numbers
• Zero
◦ 𝐸 = 0 (min(E) or all 0’s), 𝑀 = 0
◦ Can have positive and negative zero
• Subnormal/Denormal
◦ 𝐸 = 0 (min€ or all 0’s), 𝑀 ≠ 0
◦ No implied one (a zero is used)
◦ Resulting representation is not normalized
◦ Actual exponent is -126 instead of -127
◦ −1 𝑆 × 0. 𝑀 × 2−126
Special Numbers
• Infinities
◦ 𝐸 = max(𝐸) or all 1’s, 𝑀 = 0
◦ Also used as replacement for the result when overflow occurs
• Subtraction
◦ Subtraction of significands may require normalization (shifting to the
left)
▪ Resulting difference may have its first non-zero digit at the left of the radix point
◦ Result of subtraction may be negative – need to get 2’s complement and
update the sign bit of the result
▪ Do the 2’s complement before normalizing
Floating-Point Addition
+1.111 0010 0000 0000 0000 0010 x 24
+1.100 0000 0000 0001 1000 0101 x 22
• Perform 2’s complement (since the result is negative and we encode mag)
- 0.111 1011 1111 1101 1101 1101 10011 x 2-1
• Both subnormal
◦ Result: underflow
Floating-Point Multiplication
• Consider two normalized (positive) single precision numbers 𝑁1 and 𝑁2
𝑁1 × 𝑁2 = 1. 𝑀1 × 2𝐸1 −127 1. 𝑀2 × 2𝐸2 −127
𝑁1 × 𝑁2 = 1. 𝑀1 × 1. 𝑀2 2 𝐸1 +𝐸2 −127 −127
Possible Strategies:
1. Multiply as is (same as previous slide)
▪ Radix point of 48-bit product is always after the 2nd MSB
▪ First 1 of the 48-bit product can be “anywhere”
2. Move radix points of to the right of both mantissas
▪ Adjust exponent/s
▪ Operands will always look like 𝑋23 𝑋22 … 𝑋0 . and 𝑌23 𝑌22 … 𝑌0 .
▪ Radix point of 48-bit product is always after the LSB
▪ First 1 of the 48-bit product can be “anywhere”
Floating-Point Multiplication
General Case: 𝑁1 × 𝑁2 = 𝑏1 . 𝑀1 × 𝑏2 . 𝑀2 2 𝐸1 +𝐸2 −127 −127
Possible Strategies:
3. Normalize subnormal operands before multiplication
▪ Adjust exponent/s
▪ Will always be similar to having “1. 𝑀1 ” and “1. 𝑀2 ”
▪ Radix point of 48-bit product is always after the 2nd MSB
▪ First 1 of the 48-bit product can only be at the MSB (bit 47) or next
MSB (bit 46)
▫ Need only at most 1 shift to normalize (if result can be normal)
▫ Otherwise, shift right until exponent is -126 (for subnormal results)
▫ Allows the LSBs of the partial sum to be shifted out in the addition of
partial products (no need to store the whole 48 bits throughout)
Floating-Point Division
Operands
• Both normal
◦ Result: normal, subnormal, overflow, or underflow
• Both subnormal
◦ Result: normal
Floating-Point Division
• Consider two normalized (positive) single precision numbers 𝑁1 and 𝑁2
𝑁1 /𝑁2 = 1. 𝑀1 × 2𝐸1 −127 / 1. 𝑀2 × 2𝐸2 −127
𝑁1 /𝑁2 = 1. 𝑀1 /1. 𝑀2 2 𝐸1 −𝐸2 +127 −127
Possible Strategy:
• Proceed with the division process discussed (without remainder) with the
following changes
◦ Assume that the first resulting quotient bit is a 1 and that the result will
be a normal number
◦ Given this assumption, it will take 24 cycles of the division to complete
the significand of the quotient (to get a 23-bit mantissa)
▪ Quotient becomes 24-bits (with an MSB of 1 as assumed)
◦ For this to quotient, the radix point must be shifted to the left 23 times
▪ 23-bit LSB of the quotient must be used as result mantissa (𝑀)
▪ 23 must be added to the initial exponent
▫ Radix point moved from the end of the LSB to after the MSB of the 24-bit quotient
Floating-Point Division
General Case: 𝑁1 /𝑁2 = 𝑏1 𝑀1 /𝑏2 𝑀2 2 𝐸1 +𝐸2 −127 −127
Possible Strategy:
• Adjust the approach to cover all cases (first quotient bit is not a 1)
• To simplify things, let 𝑋 represent the numerator 𝑏1 𝑀1 and 𝑌 represent
the denominator 𝑏2 𝑀2
• Initially we assumed:
? × 2𝐸𝑖𝑛𝑖𝑡−127 1𝑥𝑥𝑥𝑥 … 𝑥. × 2𝐸𝑖𝑛𝑖𝑡−127
𝑏2 𝑀2 𝑏1 𝑀1 𝑌23 𝑌22 … 𝑌0 𝑋23 𝑋22 … 𝑋0
1. 𝑥𝑥𝑥𝑥 … 𝑥 × 2𝐸𝑖𝑛𝑖𝑡+23−127
Floating-Point Division
• What if the first 1 appears at the middle of the quotient?
01𝑥𝑥𝑥 … 𝑥. 𝑥 × 2𝐸𝑖𝑛𝑖𝑡−127 1. 𝑥𝑥𝑥𝑥 … 𝑥 × 2𝐸𝑖𝑛𝑖𝑡+22−127
𝑌23 𝑌22 … 𝑌0 𝑋23 𝑋22 … 𝑋0 . 0
• Further optimization?
◦ Identify the exponent
◦ Identify along the way if the result is subnormal?
Floating-Point Division
• Considering 1 bit at a time (from the MSB side) of the dividend is
like shifting in the bits into a “current dividend” register
◦ Recall: (non-)restoring division and division hardware
◦ When extending the dividend, if needed, simply shift in a zero
• A 24-bit quotient is needed where results are placed one bit at a
time (placing new results at the LSB) and shifting old results to the
left
◦ When a 1 is shifted (left) into the MSB of this quotient, then the
last division cycle must have completed the 24-bit result
◦ Indicator that the division process can be stopped
• However, the exponent can also be used to flag the division process
to stop (when the answer is a subnormal)
◦ How to do this?
Floating-Point Division
• From previous described process, every additional division
cycle (extension of the dividend) requires the exponent to be
increased by 1
◦ This process can be traced back from the initial division
without waiting for the 24-bit quotient to be completed!
• This time, assume that every new result from a division cycle is
the last bit of the mantissa even the from the first division
◦ Simplifies the process by just extracting always the LSBs (23
bits) of the current quotient
◦ Additional step: add 47 to the initial exponent
Floating-Point Division
• Revised process
1. Determine the sign of the result
2. Compute the initial exponent 𝐸𝑖𝑛𝑖𝑡
◦ Note the case (N/N, N/D, D/N, or D/D)
3. Add 47 to the 𝐸𝑖𝑛𝑖𝑡
4. Perform division of significands (see details next slide)
◦ Exponent, mantissa, and normalization is done on this step
5. Place into the final registers
Floating-Point Division
• Division of significands
1. Shift a bit of the dividend (from the MSB side) into a
current dividend (also “remainder”) register
2. Perform division as with restoring division
▪ Subtract, evaluate and shift in a new quotient bit, restore if needed
▪ Check if 23rd bit of quotient is already a 1
▫ If yes, then 24-bit significand is already complete (last cycle)
· Still perform the step below
3. Subtract 1 from the exponent
▪ Check if the actual exponent is -126
▫ If yes, then result is subnormal. Extract the 23 LSB of the quotient and save
as Mantissa.
4. Repeat if not yet done
Rounding off
• Intermediate values of significands may need to be
represented using more bits to minimize errors
• Guard bits – additional bits in the mantissa retained during
intermediate steps or operations
◦ Maintains additional accuracy in final results
• Three common-rounding off methods
◦ Truncation
◦ Von Neumann rounding
◦ Round to neares (even)
Rounding Off (Truncation)
Truncation
• Extra bits are simply discarded
• Biased approximation since error range is not symmetrical
around in between numbers
• “Rounding towards 0”
• Example: truncate from 6 bits to 3 bits
◦ 0.001000 → 0.001
◦ 0.010110 → 0.010
Rounding Off (Von Neumann)
Von Neumann rounding
• If bits to be removed are all 0, truncate
• If any of the bits to be removed is 1, set LSB of retained bits to 1
• Unbiased approximation (symmetric error)
• Larger error range
• Example: from 6 bits to 3 bits
◦ 0.001000 → 0.001
◦ 0.001001 → 0.001
◦ 0.010001 → 0.011
Rounding Off (Nearest Even)
Round to nearest (even)
• Requires 3 guard bits
◦ First two guard bits are part of the mantissa to be removed
◦ Third bit is the OR of all bits beyond the first two (at least
one 1)
• Achieves least range of error but is also most difficult to
implement because of addition and possible re-normalization
Rounding Off (Nearest Even)
Round to nearest (even)
• Do the following for each case of guard bits
◦ 0xx – truncate
◦ 100 – round to nearest (even)
▪ +1 to LSB if LSB is 1 (odd)
▪ Truncate if LSB is 0 (already even)
◦ 101, 110, 111 – round up (+1 to LSB)