03.2 Numbers, Floating Point

Representing Real Numbers
ECE 3340 – David Mayerich

Floating Point
Decimal floating point representations
Floating point standards and arithmetic
Precision loss
Floating Point
• A floating point value is represented mathematically as
𝑓 = 𝑚 × 𝑏𝑥
̶ 𝑚 is the mantissa
̶ 𝑏 is the base (𝑏 = 10 for decimal values, 𝑏 = 2 for binary values)
̶ 𝑥 is the exponent
• A fixed number of digits can be allocated to 𝑚 and 𝑥
• 𝑥 ∈ ℤ is an integer
• 𝑚 must be normalized: 1 ≤ 𝑚 < 𝑏
̶ the first digit of 𝑚 must be non-zero
̶ this is required to ensure that 𝑓 only has one representation in 𝑚 and 𝑥
Floating Point, base 10
• For base-10 values, floating point is very similar to scientific notation
• Consider the following base-10 representation:
𝑓 = 𝑚 × 10𝑥
̶ 𝑚 is allocated 3 digits
̶ 𝑥 is allocated 2 digits
• 𝑚 must be normalized: 1 ≤ 𝑚 < 10
• 2.33 × 1006 = 2,330,000

• . 0000233 = 2.33 × 10−5
• 2334 = ?
̶ this value cannot be represented (not enough digits in the mantissa)
Floating Point, base 10 [examples]
• What is the maximum base-10 value that can be represented with a
3-digit mantissa and a 2-digit exponent?
𝑓 = 9.99 × 1099
• What is the smallest magnitude non-zero value?
𝑓 = 1.00 × 10−99
• We can represent very small and very large numbers
• All numbers are limited to 3 non-zero digits
Floating Point – Spacing
• Since 𝑚 has a limited number of digits, there is a spacing between
representable values
𝑓 = 𝑚 × 102
𝑚 = 1.1 𝑚 = 1.2 𝑚 = 1.3 𝑚 = 1.4

𝑓 = 110 𝑓 = 120 𝑓 = 130 𝑓 = 140
𝑓 = 𝑚 × 103
𝑚 = 1.1 𝑚 = 1.2 𝑚 = 1.3 𝑚 = 1.4

𝑓 = 1100 𝑓 = 1200 𝑓 = 1300 𝑓 = 1400
𝑓 = 1.0 × 103 𝑓 = 1.1 × 103

= 1000 = 1100
𝑓 = 9.8 × 102 𝑓 = 9.9 × 102

= 980 = 990
Floating Point in Digital Systems
• All computers represent values using binary numbers
• Digital floating point format:
𝑓 = 𝑚 × 2𝑥
̶ both the mantissa 𝑚 and exponent 𝑥 are allocated a number of
binary bits
Converting Binary Floating Point to Decimal
• Assume that a digital system uses a 3-bit mantissa and a 2-bit
exponent
• What is the decimal value if 𝑚 = 1.012 and 𝑥 = 102 ?

in decimal, 𝑚 = 1.2510 and 𝑥 = 210 so 𝑓 = 1.25 × 22 = 5
• Exercise 1: What is the spacing if 𝑥 = 112 ?

𝑓0 = 𝟏. 𝟎𝟎2 × 23 = 8, 𝑓1 = 𝟏. 𝟎𝟏2 × 23 = 10, 𝑓1 − 𝑓0 = 𝟐
• Exercise 2: What is the largest value that can be represented?
1.112 × 2112 = 1.75 × 23 = 𝟏𝟒
Bias and Implied 1’s
• Real implementations make two simplifications for floating point
1) Implied 1’s
̶ in binary, we know that the first digit of the mantissa 𝑚 must be 1:
1. 𝑚 × 𝑏 𝑥
̶ the actual precision of an 𝑛-bit mantissa is (𝑛 + 1) bits
2) Biased exponent – removes the need for signed exponents:

1. 𝑚 × 𝑏 𝑥−𝑐
̶ the value of 𝑐 is specified in the floating point standard
̶ 𝑥 is always an unsigned integer
Common Floating Point Formats
• IEEE 754 Standard
̶ IEEE 32-bit format – C/C++ float values
̶ 23-bit mantissa 8-bit exponent bias 𝑖 = 127
̶ IEEE 64-bit format – C/C++ double values

̶ 52-bit mantissa 11-bit exponent bias 𝑖 = 1023
• Other IEEE formats (not always supported):
̶ half-precision (C/C++ single), 10-bit mantissa, 5-bit exponent
̶ quadruple (128-bit), 112-bit mantissa, 15-bit exponent
̶ octuple (256-bit), 236-bit mantissa, 19-bit exponent
Floating Point Multiplication
𝑎 = −1 1 × 1.27 × 101
𝑏 = −1 0 × 3.01 × 100
𝑥 =𝑎∗𝑏
1) Add the sign bits: 𝑠 =1+0=1
2) Add the exponents: 𝑐 =1+0=1
3) Multiply the mantissa: 𝑚 = 1.27 × 3.01 = 3.8227 → 3.82
4) Adjust 𝑐 to normalize 𝑚: 𝑐=1
5) Final Result: 𝑥 = −1 1 × 3.82 × 101
Floating Point Addition
Add two floating point values assuming 3 digits of precision:
𝑎 = −1 1 × 1.27 × 101
𝑏 = −1 0 × 3.01 × 100
𝑥 =𝑎+𝑏
1) Convert to the same exponent (and chop):
𝑎 = −1 1 × 1.27 × 101 note that this is
𝑏 = −1 0 × 0.30 × 101 actually a subtraction
2) Add: 𝑥 = −1 1 × 0.97 × 101 (see sign digits)
3) Normalize: 𝑥 = −1 1 × 9.70 × 100

• What is the actual solution and the corresponding relative error?
|9.69−9.7|
𝑥 = −9.69 𝐸𝑟 = ≈ 0.1%
|9.69|
• Note that the actual solution could be represented

03.2 Numbers, Floating Point

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03.2 Numbers, Floating Point

Uploaded by

Copyright:

Available Formats

Representing Real Numbers

ECE 3340 – David Mayerich

• 2.33 × 1006 = 2,330,000

• What is the smallest magnitude non-zero value?

𝑚 = 1.1 𝑚 = 1.2 𝑚 = 1.3 𝑚 = 1.4

𝑚 = 1.1 𝑚 = 1.2 𝑚 = 1.3 𝑚 = 1.4

𝑓 = 1.0 × 103 𝑓 = 1.1 × 103

𝑓 = 9.8 × 102 𝑓 = 9.9 × 102

• What is the decimal value if 𝑚 = 1.012 and 𝑥 = 102 ?

• Exercise 1: What is the spacing if 𝑥 = 112 ?

2) Biased exponent – removes the need for signed exponents:

̶ IEEE 64-bit format – C/C++ double values

3) Normalize: 𝑥 = −1 1 × 9.70 × 100

You might also like