Lecture 2: Representing Numbers in The Computer: Zhenning Cai August 19, 2019

MA2213: Numerical Analysis I
Lecture 2: Representing numbers in the computer
Zhenning Cai
August 19, 2019
The arithmetic performed by computers is often inexact. The following two examples will
show this phenomenon:
Example 1. The following C code sets the values of two variables to be 0.23 and 0.25,
respectively. Then print these two values up to 10 significant digits.
# include < stdio .h >
# include < math .h >
int main ()
{
float x = 0.23 , y = 0.25;
printf ( " %.10 f \ t %.10 f \ n " , x , y ) ;
return 0;
}
The output is 0.2300000042 and 0.2500000000. The former differs from 0.23 by 4.2 × 10−9 ,
whereas the latter is exact.
√ √ √
Example 2. In exact arithmetic, we have x = 3 x · 3 x · 3 x. However, such a rule may fail
in computer arithmetic. The following C code sets the value of a variable to be 3, and then
takes its cubic root and raise the result to its cube. The difference between such a result and
the original variable is printed.
# include < math .h >
int main ()
{
float x0 = 3 , x ;
x = cbrtf ( x0 ) ; x = x * x * x ;
printf ( " % e \ n " , x - x0 ) ;
return 0;
}
The output is -4.768372e-07, which means −4.768372 × 10−7 . If we change the value of x0
to 3.375, then the result is exactly zero.
We are going to explain these results in this lecture, which will cover the following two
topics:
• Binary numbers and numeral systems

• Single-precision floating-point format
1
1 Binary numbers and numeral systems
The numeral system we use in our everyday life is the decimal numeral system. For example,
the number 1234.5678 means
1 × 103 + 2 × 102 + 3 × 101 + 4 × 100 + 5 × 10−1 + 6 × 10−2 + 7 × 10−3 + 8 × 10−4 ,
where 10 means the next natural number following 9. In general, a decimal number has the
form
am am−1 · · · a0 .b1 b2 · · · bn , (1)
where ak and bk are decimal digits, including 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and the decimal separator
“.” separates the integer part am am−1 · · · a0 and the fractional part b1 b2 · · · bn . The meaning
of the number sequence (1) is
am × 10m + am−1 × 10m−1 + · · · + a0 × 100 + b1 × 10−1 + b2 × 10−2 + · · · + bn × 10−n .
Again, here 10 represents the next natural number following 9. This is probably due to the
fact that most people have ten fingers.
Different from our daily used decimal numeral system, our computers prefer the binary
numeral system. A general numeral system has a very similar definition as the aforementioned
decimal numeral system, but with the following two differences:
• The meaning of “10” can be changed to any natural number larger than 1. This number
is called the base of the numeral system.
• The number of digits should equal the base, and these digits represent the first “10”
natural numbers. Note that here “10” does not necessarily mean “ten”.
Example 3. Let “10” be the next natural number following 7, which is “8” in the decimal
numeral system. Such a numeral system is called octal numeral system. By convention, the
eight digits are still 0, 1, 2, 3, 4, 5, 6, 7, and the separator is still a dot “.”. For instance, an
octal number
123.456 = 1 × 102 + 2 × 101 + 3 × 100 + 4 × 10−1 + 5 × 10−2 + 6 × 10−3
can be translated to a decimal number by
1 × 82 + 2 × 81 + 3 × 80 + 4 × 8−1 + 5 × 8−2 + 6 × 8−3 = 83.58984375.
This example involves two numeral systems, which looks confusing. One commonly used
notation is to put the base in the subscript, and the base is always a decimal number. A
number without a subscript is regarded as a decimal number (unless otherwise specified). For
instance, the above example shows that
123.4568 = 83.5898437510 = 83.58984375 or (123.456)8 = (83.58984375)10 = 83.58984375.
Some literature may use

123.456oct = 83.58984375dec ,
which has the same meaning.
Example 4. The base of the hexadecimal numeral system is sixteen, and the sixteen digits
are usually denoted by 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. For example,
(A1B.2C3)16 = (10 × 162 + 1 × 161 + 11 × 160 + 2 × 16−1 + 12 × 16−2 + 3 × 16−3 )10

= (2587.172607421875)10 .
2
Computer systems use the simplest numeral system: binary numeral system, whose base
is two, and the digits are 0 and 1. Each binary digit is called a bit.
Exercise 1. Convert the binary number (1011.0011)2 to the corresponding decimal number.
Exercise 2. Convert the binary number (1011001011.00111101)2 to the corresponding octal

and hexadecimal numbers.
2 Single-precision floating-point format

Computers can only store a finite number of bits, which means that computers can only rep-
resent a finite number of real numbers. For example, a single-precision floating-point number
(float in C) typically1 has 32 bits. Therefore the number of real numbers that can be exactly
represented is at most 232 . If we want to represent a number x other than these 232 numbers,
we can only approximate it by a number close to x, which introduces the round-off error. In
the example at the beginning of this lecture, the number 0.23 does not belong to this set,
whereas 0.25 = (0.01)2 can be represented exactly.
To better understand how our computer represents real numbers, we use the following C
program to test every bit of a single-precision floating-point number:
# include < stdint .h >
int main ()
{
union { uint32_t d ; float f ; } number ;
uint32_t d = (1 << 31) ;
if ( sizeof ( float ) != sizeof ( uint32_t ) ) {

fprintf ( stderr , " Error : a float is not 32 bits !\ n " ) ;
return 1;
}
printf ( " Please input a real number : " ) ;

if (! scanf ( " % f " , & number . f ) ) {
fprintf ( stderr , " Invalid input !\ n " ) ;
return 2;
};
while ( d ) {
printf (( d & number . d ) ? " 1 " : " 0 " ) ;
d > >= 1;
}
printf ( " \ n " ) ;
return 0;
}
This program reads a float number from the standard input, and then print its binary repre-
sentation. Here are some examples:
1
“Single” was defined by an old version of IEEE 754 standard. Now it is officially called “binary32”. See
https://ieeexplore.ieee.org/document/8766229.
3
Input Binary representation of the input Output
1 1 00111111100000000000000000000000
-1 -1 10111111100000000000000000000000
0.5 0.1 00111111000000000000000000000000
0.25 0.01 00111110100000000000000000000000
0.125 0.001 00111110000000000000000000000000
1.5 1.1 00111111110000000000000000000000
1.25 1.01 00111111101000000000000000000000
1.125 1.001 00111111100100000000000000000000
Now let’s demystify the sequence of digits in the output. The output has three sections:
b31 b30 b29 b28 b27 b26 b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
|






















































































































sign exponent (8 bits) fraction/mantissa (23 bits)
where each bit bk , k = 0, 1, · · · , 31 can be either 0 or 1. Such a sequence represent the binary
number
(−1)b31 ×10b30 b29 b28 b27 b26 b25 b24 b23 −1111111 ×1.b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0
or the decimal number

( )
∑
23
−i
(−1) b31
×2 e−127
× 1+ b23−i 2 . (2)
i=1
where
∑
7
e= b23+i 2i .
i=0
Example 5. The real number 9.15625 can be represented exactly in the single-precision
floating-point format. Since
(9.15625)10 = (1001.00101)2 = (1.00100101 × 1011 )2 ,
the mantissa is 00100101000000000000000 and the exponent is
(11 + 1111111)2 = (10000010)2 .
Thus all the bits are 0 10000001 00100101000000000000000.
There are some exceptions, which can also be found by the test code:
Input Output
0 00000000000000000000000000000000
-0 10000000000000000000000000000000
inf 01111111100000000000000000000000
-inf 11111111100000000000000000000000
nan 01111111110000000000000000000000
-nan 11111111110000000000000000000000
In fact, when the exponent is 11111111 and the mantissa is not zero, the result is always
interpreted as NaN or −NaN, which means “Not-a-Number”. Therefore, the largest number
that can be presented exactly in the single-precision floating-point format is given by the
4
digit sequence 0 11111110 11111111111111111111111. The corresponding decimal number is
(2 − 2−23 ) × 2127 ≈ 3.4028235 × 1038 .
A another special case is the squence with the exponent being 00000000. In this case,
the sequence denotes denormal numbers, whose formula is different from (2). The binary
expression for denormal numbers is
(−1)b31 × 10−1111110 × 0.b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 ,
and the corresponding decimal number is
∑
23
(−1)b31 × 2−126 × b23−i 2−i . (3)
i=1
Exercise 3. Show that the sequence 0 11111110 11111111111111111111111 represent the dec-
imal number (2 − 2−23 ) × 2127 .
Exercise 4. What is the nonzero number with the smallest absolute value that can be rep-
resented in the single-precision floating-point format?
Now we consider the representation of 0.23. The first step is to determine the sign bit,
which is clearly b31 = 0. To determine the exponent, we need to find an integer e such that
1 × 2e−127 ⩽ 0.23 < 2 × 2e−127 .
The solution is
e = ⌊log2 0.23⌋ + 127 = 124 = (1111100)2 .
The mantissa must be an approximation of
0.23
= 1.84.
2−3
Therefore to find the fraction, we just need to represent 1.68 in the binary numeral system.
This can be done by the following steps:
1. The integer part is 1 since 12 < 1.84 < (10)2 (= 2).

2. The first fractional digit is 1 since (1.1)2 (= 1.5) < 1.84.
3. The second fractional digit is 0 since (1.11)2 (= 1.75) < 1.84.
4. The third fractional digit is 1 since (1.110)2 (= 1.75) < 1.84 < (1.111)2 (= 1.875).
··· ··· ···
5. The 23rd fractional digit is 0 since
(1.11010111000010100011110)2 (≈ 1.83999991) < 1.84

< (1.11010111000010100011111)2 (≈ .1.84000003).
One possible approximation is to discard all the rest digits. Such a method is called chopping.
Thus in the single-precision floating-point format, the digit sequence is
0 01111100 11010111000010100011110. (4)
However, in the above C program, if we enter 0.23, the output is
00111110011010111000010100011111. (5)
5
The last digit differs from our prediction (4). In fact, the number (5) is closer to 0.23 than
(4):
(1.11010111000010100011110 × 10−11 )2 − 0.23 ≈ −1.07288 × 10−8 ,

(1.11010111000010100011111 × 10−11 )2 − 0.23 ≈ 4.17233 × 10−9 .
Such a method, which approximates a given number by its closest number in the single-
precision floating-point format, is called rounding. In the above case, the method of rounding
for single-precision can be understood in two different ways:
1. We add the number 1.84 by 2−24 and then chop the result.
2. We look at the 24th fractional digit of the number 1.84. If it is 1, we add the result of
chopping by 2−23 ; otherwise, we just take the result of chopping.

Lecture 2: Representing Numbers in The Computer: Zhenning Cai August 19, 2019

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 2: Representing Numbers in The Computer: Zhenning Cai August 19, 2019

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2: Representing Numbers in The Computer: Zhenning Cai August 19, 2019

Uploaded by

Copyright:

Available Formats

MA2213: Numerical Analysis I

Lecture 2: Representing numbers in the computer

August 19, 2019

• Binary numbers and numeral systems

1 × 103 + 2 × 102 + 3 × 101 + 4 × 100 + 5 × 10−1 + 6 × 10−2 + 7 × 10−3 + 8 × 10−4 ,

am × 10m + am−1 × 10m−1 + · · · + a0 × 100 + b1 × 10−1 + b2 × 10−2 + · · · + bn × 10−n .

123.456 = 1 × 102 + 2 × 101 + 3 × 100 + 4 × 10−1 + 5 × 10−2 + 6 × 10−3

can be translated to a decimal number by

1 × 82 + 2 × 81 + 3 × 80 + 4 × 8−1 + 5 × 8−2 + 6 × 8−3 = 83.58984375.

123.4568 = 83.5898437510 = 83.58984375 or (123.456)8 = (83.58984375)10 = 83.58984375.

Some literature may use

(A1B.2C3)16 = (10 × 162 + 1 × 161 + 11 × 160 + 2 × 16−1 + 12 × 16−2 + 3 × 16−3 )10

Exercise 2. Convert the binary number (1011001011.00111101)2 to the corresponding octal

2 Single-precision floating-point format

if ( sizeof ( float ) != sizeof ( uint32_t ) ) {

printf ( " Please input a real number : " ) ;

or the decimal number

(9.15625)10 = (1001.00101)2 = (1.00100101 × 1011 )2 ,

the mantissa is 00100101000000000000000 and the exponent is

(11 + 1111111)2 = (10000010)2 .

Thus all the bits are 0 10000001 00100101000000000000000.

and the corresponding decimal number is

1 × 2e−127 ⩽ 0.23 < 2 × 2e−127 .

1. The integer part is 1 since 12 < 1.84 < (10)2 (= 2).

(1.11010111000010100011110)2 (≈ 1.83999991) < 1.84

0 01111100 11010111000010100011110. (4)

However, in the above C program, if we enter 0.23, the output is

(1.11010111000010100011110 × 10−11 )2 − 0.23 ≈ −1.07288 × 10−8 ,

You might also like