Floating Point Formats

There follows a description of IEEE and IBM 370 floating point formats. This information was more relevant in the days of the Hitachi S3600 at the HPCF, and is not even guaranteed to be correct, although corrections are welcome.

General comments

Floating point numbers are stored in a very similar fashion to the "scientific" or "exponential" notiation in routine usage. There is a sign, a mantissa, and an exponent.

- 5.1276 * 10^3 = -5127.6
|   |         |
|   |         exponent
|   mantissa
sign
Differences arise in the choice of base (here 10), the range of the mantissa (here 1 <= m < 10), and the number of figures given to the mantissa and exponent. Note that in binary the choice of mantissa 1 <= m < 10 guarantees that the first digit of the mantissa is 1. Therefore this 1 is often suppressed when storing the mantissa.

IEEE

As used by babbage, and (almost) every other current computer excluding IBM mainframes, VAXes, the Hitachi S3600, Cray (i.e. XMP and C90 etc.) and Intel native mode floating-point.

                            Sign  exponent  mantissa  base exp offset

32 bit single precision      0     1-8       9-31      2     127
64 bit double precision      0     1-11     12-63      2    1023
The leading 1 of the mantissa is suppressed.

Various "special" numbers are also representable as IEEE. These include:

            double             single

+INF     7FF0000000000000     7F800000
-INF     FFF0000000000000     FF800000
NaN      7FF0000000000001     7F800001
               to               to
         7FFFFFFFFFFFFFFF     7FFFFFFF
               and              and
         FFF0000000000001     FF800001
               to               to
         FFFFFFFFFFFFFFFF     FFFFFFFF
+OVER    7FEFFFFFFFFFFFFF     7F7FFFFF
-OVER    FFEFFFFFFFFFFFFF     FF7FFFFF
+UNDER   0010000000000000     00800000
-UNDER   8010000000000000     80800000
Example (single precision):
3 (base 10) = 1.5 * 2^1  (base 10)
            = 1.1 * 10^1 (base 2)

Store: sign bit as 0                       (+)
       exponent as 1000 0000               (=128, as offset of 127 added)
       mantissa as 100 0000 0000 0000 0000 (suppress leading 1)
Giving:
       0100 0000 0100 0000 0000 0000 0000 0000

The single precision range is 1.2e-38 (2^-126) to 3.4e38 (2^128), and double precision 2.3e-308 (2^-1022) to 1.7e308 (2^1024).

Denormals

The smallest stored exponent is one, representing -126 in single precision, or -1022 in double precision. Numbers with a stored exponent of zero are said to be denormalised. The exponent is considered to be -126 / -1022, and the mantissa is interpreted with a suppressed leading zero (alternatively, the exponent is considered to be -127 / -1023, and the mantissa is fully stored).

Some machines flush denormals to zero, others calculate with them, but note that precision will be lost, as (many) leading digits of the mantissa will be zero.

The minimum denormals are 2^-149 (single precision, approx 1.4e-45) and 2^-1074 (double precision, approx 5e-324).

IBM 370

As used by turing (Hitachi S3600), and, of course, the IBM 370.

                            Sign  exponent  mantissa  base exp offset

32 bit single precision      0     1-7       8-31      16     64
64 bit double precision      0     1-7       8-63      16     64
The mantissa may have up to 3 leading zeros, even when normalised. Denormalised values are valid. Example (single precision):
3 (base 10) = 0.1875 * 16^1 (base 10)
            = 0.0011 * 10000^1 (base 2, exponent base 16)

Store: sign bit  0                             (+)
       exponent  100 0001                      (=65, as offset of 64 added)
       mantissa  0011 0000 0000 0000 0000 0000
Giving:
       0100 0001 0011 0000 0000 0000 0000 0000

Ranges

Approximate ranges are as follows:

                Range                Precision
                        (binary digits)    (decimal digits)

IEEE 32 bit     10^38       23                 7
IEEE 64 bit     10^308      52                15
IBM  32 bit     10^75      21-24              6-7
IBM  64 bit     10^75      53-56             15-16
Please note also that the IBM arithmetic "rounds" by truncation, whereas IEEE normally rounds to the closest value. Errors can build up rapidly under the former scheme.

Intel native

This is mostly IEEE, but is 80 bits, with a 64 bit mantissa with an explicit leading one and a 15 bit exponent offset by 16383. Denormals have an explicit leading zero, as well as the exponent zero, and the exponent cannot be zero with an explicit leading one in the mantissa(?!).