Standards for Floating-Point Arithmetic
We will now look at the details of the single and double precision formats.
Although you can get by without knowing the internals, this information can
help to understand some of the nuances of working with floating-point numbers.
In order to keep the discussion clear, we will at first only consider the
single precision format. The double precision format is similar, and will be
summarized later.
Normalized numbers
A typical binary floating-point number has the form s × (m / 2N-1)
× 2e, where s is either -1 or +1, m and e
are the mantissa or significand and exponent mentioned earlier, and N is
the number of bits in the significand, which is a constant for a specific
number format. For single-precision numbers, N = 24. The nunbers s,
m and e are packed into 32 bits. The layout is shown in the
image below:
| part |
sign |
exponent |
fraction |
| bit # |
31 |
23-30 |
0-22 |
The sign s is stored in the most significant bit. A value of 0
indicates a positive value, while 1 indicates a negative value.
The exponent field is an 8 bit unsigned integer called the biased exponent.
It is equal to the exponent e plus a constant called the bias
which has a value of 127 for single-precision numbers. This means that, for
example, an exponent of -44 is stored as -44+127= 83 or 01010011. There are two
reserved exponent values: 0 and 255. The reason for this will be explained
shortly. As a result, the smallest actual exponent is -126, and the largest is
+127.
The number format appears to be ambiguous: You can multiply m by two
and subtract 1 from e and get the same number. This ambiguity is
resolved by minimizing the exponent and maximizing the size of the significand.
This process is called normalization. As a result, the significand m
always has 24 bits, with the leading bit always equal to 1. Since we know it is
always equal to 1, we don't have to store this bit, and so we end up with the
significand taking up only 23 bits instead of 24.
Put another way, normalization means that the number m / 2N-1
always lies between 1 and 2. The 23 stored bits are also what comes after the
decimal point when the significand is divided by 2N-1. For
this reason, these bits are sometimes called the fraction.
Zero and subnormal numbers
At this point, you may wonder how the number zero is stored. After all, neither m
nor s can be zero, and so their product cannot be zero either. The
answer is that 0 is a special number with a special representation. In fact, it
has two representations!
The numbers we have been describing so far, whose significand has maximum
length, are called normalized numbers. They represent the vast
majority of numbers represented by the floating-point format. The smallest
positive value is 223 .2-126+1-24 = 1.1754e-38. The
largest value is (224-1).2127+1-24 = 3.4028e+38.
Recall that the biased exponent has two reserved values. The biased exponent 0
is used to represent the number zero as well as subnormal or denormalized
numbers. These are numbers whose significand is not normalized and has a
maximum length of 23 bits. The actual exponent used is -127+1-24=-149,
resulting in a smallest positive number of 2-149 = 1.4012e-45.
When both the biased exponent and the significand are zero, the resulting value
is equal to 0. Changing the sign of zero does not change its value, so we have
two possible representations of zero: one with a positive sign, and one with a
negative sign. As it turns out, it is meaningful to have a 'negative
zero' value. Although its value equals the value of normal 'positive zero,' it
behaves differently in some situations, which we will get into shortly.
Infinities and Not-a-Number
We still need to explain the use of the other reserved biased exponent value of
255. This exponent is used to represent infinities and Not-a-Number values.
If the biased exponent is all 1's (i.e. equal to 255) and the significand is all
0's, then the number represents infinity. The sign bit indicates whether we're
dealing with positive or negative infinity. These numbers are returned for
operations that either do not have a finite value (e.g. 1/0) or are too large
to be represented by a normalized number (e.g. 21,000,000,000).
The sign of a division by zero depends on the sign of both the numerator and the
denominator. If you divide +1 by negative zero, the result is negative
infinity. If you divide -1 by positive infinity, the result is negative zero.
If the significand is different from 0, the value represents a Not-a-Number
value or NaN. NaN's come in two flavors: signaling and non-signaling or quiet
corresponding to the leading bit in the significand being 1 and 0,
respectively. This distinction is not very important in practice, and is likely
to be dropped in the next revision of the standard.
NaN's are produced when the result of a calculation does not exist (e.g. Math.Sqrt(-1)
is not a real number) or cannot be determined (infinity / infinity). One of the
peculiarities of NaN's is that all arithmetic operations involving NaN's return
a NaN, except when the result would be the same regardless of the value. For
example, the function hypot(x, y) = Math.Sqrt(x*x+y*y) with x
infinite always equals positive infinity, regardless of the value of y.
As a result, hypot(infinity, NaN) = infinity.
Also, any comparison of a NaN with any other number including NaN returns false.
The one exception is the inequality operator, which always returns true even if
the value being compared is also NaN!
The significand bits of a NaN can be set to an arbitrary value, sometimes called
the payload. The IEC 60559 standard specifies that the payload
should propagate through calculations. For example, when a NaN is added to a
normal number, say 5.3, then the result is a NaN with the same payload as the
first operand. When both operands are NaN's, then the resulting NaN
carries the payload of either one of the operands. This leaves the possibility
to pass on potentially useful information in NaN values. Unfortunately, this
feature is hardly ever used.
Some examples
Let's look at some numbers and their corresponding bit patterns.
| Number |
Sign |
Exponent |
Fraction |
| 0 |
0 |
00000000 |
00000000000000000000000 |
| -0 |
1 |
00000000 |
00000000000000000000000 |
| 1 |
0 |
01111111 |
00000000000000000000000 |
| +Infinity |
0 |
11111111 |
00000000000000000000000 |
| NaN |
1 |
11111111 |
10000000000000000000000 |
| 3.141593 |
0 |
10000000 |
10010010000111111011100 |
| -3.141593 |
1 |
10000000 |
10010010000111111011100 |
| 100000 |
0 |
10001111 |
10000110101000000000000 |
| 0.000001 |
0 |
01101110 |
01001111100010110101100 |
| 1/3 |
0 |
01111101 |
01010101010101010101011 |
| 4/3 |
0 |
01111111 |
01010101010101010101011 |
| 2-144 |
0 |
00000000 |
00000000000000000100000 |
Notice the exponent field for 1 and 4/3. Both these numbers are between 1 and 2,
and so their unbiased exponent is zero. The biased exponent is therefore equal
to the bias, which is 127, or 1111111 in decimal. Numbers larger than 2 have
biased exponents greater than 127. Numbers smaller than 1 have biased exponents
smaller than 127.
The last number in the table (2-144) is denormalized. The biased
exponent is zero, and since 2-144 = 32*2-149 the fraction
is 32 = 25.