IEEE 754 Standards - "float" and "double"

C# Tutorials - Herong's Tutorial Examples

∟IEEE 754 Standards - "float" and "double"

This section describes IEEE 754 standards on how floating-point values of 'float' and 'double' are represented in binary format.

In order to answer some of the questions raised in the previous sections, let's take a close look at how real numbers are represented in a computer system. In C#, the IEEE 754 single-precision and double-precision standards are used to represent "float" and "double" type values respectively. We will discuss the storage representation of "decimal" later in this book.

Since the IEEE 754 standards are widely used in the computer industry, there are a lots of publications available talking about these standards. In this section, I will explain only some basic rules of the standards.

Rule 1: The binary representation is divided into 3 components with different number of bits assigned to each components:

                   Sign   Exponent   Fraction   Total
Single-Precision      1          8         23      32
Double-Precision      1         11         52      64

Since the difference between the single-precision and double-precision is only the number of bits used in the exponent and fraction components, we will only use the single-precision in the discussion from now on.

Rule 2: The single bit in the sign component represents a sign value (s) through the binary sign convention:

          Sign     (s)
   Bit Pattern   Value
             0       1
             1      -1

Rule 3: The 8-bit combination in the exponent component represents an exponent value (e) through the binary integral number convention:

      Exponent     (e)
   Bit Pattern   Value
      00000001       1 <-- 2^0
      00000010       2 <-- 2^1
      00000011       3 <-- 2^1 + 2^0
           ...     ...
      11111110     254

Note that the bit patterns of all 0s and all 1s are not list here, because then are reserved for special purposes, which will be discussed later.

Rule 4: The 23-bit combination in the fraction component represents a fraction value through the binary fractional number convention:

                    Fraction   (f)
                 Bit Pattern   Value
   0000000 00000000 00000000   .0
   1000000 00000000 00000000   .5   <-- 2^(-1)
   0100000 00000000 00000000   .25  <-- 2^(-2)
   1100000 00000000 00000000   .75  <-- 2^(-1) + 2^(-2)
                         ...   ...
   1111111 11111111 11111111   .?

Rule 5: Putting all 3 components together, the single-precision 32-bit pattern represents a real number (r) through the following expression:

   r = s * 1.f * 2^(e-127)

Note that a numbers expressed in this form of expression is called a normalized number, because its binary point has been normalized to right after the leading 1. Also note that the leading 1 is not stored any where.

Now let's look at the range of positive normalized numbers:

                                     s   e  f    r
0 00000001 0000000 00000000 00000000 1   1 .0 -> 1 * 1.0 * 2^(1-127)
0 00000001 1000000 00000000 00000000 1   1 .5 -> 1 * 1.5 * 2^(1-127)
...
0 01111111 0000000 00000000 00000000 1 127 .0 -> 1 * 1.0 * 2^(127-127)
0 01111111 1000000 00000000 00000000 1 127 .5 -> 1 * 1.5 * 2^(127-127)
...
0 11111110 0000000 00000000 00000000 1 254 .0 -> 1 * 1.0 * 2^(254-127)
0 11111110 1000000 00000000 00000000 1 254 .0 -> 1 * 1.5 * 2^(254-127)
...
0 11111110 1111111 11111111 11111111 1 254 .? -> 1 * 1.? * 2^(254-127)

Rule 6: When the exponent component stores all 0s, the 3 components are put together to represent a denormalized number through the following expression:

   r = s * 0.f * 2^(1-127)

This expression is designed to extend the range of the normalized numbers a little bit on the 0 side:

                                     s e  f     r
0 00000000 0000000 00000000 00000000 1 0 .0 -> 1 * 0.0 * 2^(1-127)
0 00000000 1000000 00000000 00000000 1 0 .5 -> 1 * 0.5 * 2^(1-127)
...
0 00000000 1111111 11111111 11111111 1 0 .? -> 1 * 0.? * 2^(1-127)