Some Terminology
Before we do anything, we should define the words that are commonly used
in numerical computing.
Number Formats
A computer program is a model of something in the real world. Many things in the
real world are represented by numbers. Those numbers need a representation in
our computer program. This is where number formats come in.
From a programmer's point of view, a number format is a collection of numbers.
99.9% of the time, the binary or internal representation is not important. It
may be important when we represent non-numeric data, as with bit fields, but
that doesn't concern us here. What counts is only that it can represent the
numbers from the real world objects we are modeling. Some number
formats include certain special values to indicate invalid values or values
that are outside the range of the number format.
Integers
Most numbers are integers, which are easy to represent. Almost any integer
you'll encounter will fit into a "32-bit signed integer," which is a number in
the range -2,147,483,648 to 2,147,483,647. For some applications, like counting
the number of people in the world, you need the next wider format: 64 bit
integers. Its range is wide enough to count every 10th of a microsecond over
many millenia. (This is how a DateTime value is represented internally.)
Many other numbers, like measurements, prices, and percentages, are real numbers
with digits after the decimal point. There are essentially two ways to
represent real numbers: fixed point and floating point.
Fixed-point formats
A fixed-point number is formed by multiplying an integer (the significand)
by some small scale factor, most often a negative power of 10 or 2. The name
derives from the fact that the decimal point is in a fixed position when the
number is written out. An example of a fixed point format is the Currency type
in pre .NET Visual Basic and the money type in SQL Server. These types have a
range of +/-900 trillion with four digits after the decimal point. The
multiplier is 0.0001 and every multiple of 0.0001 within the defined range is
represented by this number format. Another example is found in the NTP protocol
(Network Time Protocol), where time offsets are returned as 32 and 64 bit fixed
point values with the 'binary' point at 16 and 32 bits, respectively.
Fixed point works well for many applications. For financial calculations, it has
the added benefit that numbers such as 0.1 and 0.01 can be represented exactly
with a suitable choice of multiplier. However, it is not suited for many other
applications where a greater range is needed. Particle physicists commonly use
numbers smaller than 10-20, while cosmologists estimate the number
of particles in the universe at around 1085. It would be impratical
to represent numbers in this range in fixed point format. To cover the whole
range, a single number would take up at least 50 bytes!
Floating-point formats
This problem is solved with a floating point format. Floating-point numbers have
a variable scale factor, which is specified as the exponent of a power of a
small number called the base , which is usually 2 or 10. The .NET
framework defines three floating-point types: Single, Double and Decimal.
That's right: the Decimal type does not use the fixed point format of the
Currency or money type. It uses a decimal floating-point format.
A floating-point number has three parts: a sign, a significand and an exponent.
The magnitude of the number equals the significand times the base raised to the
exponent. Actual storage formats vary. By reserving certain values of the
exponent, it is possible to define special values such as infinity and invalid
results. Integer and fixed point formats usually do not contain any special
values.
Before we go into the details of real life formats, we need to define some more
terms.
Range, Precision and Accuracy
The range of a number format is the interval from the smallest number in the
format to the largest. The range of 16-bit signed integers is -32768 to 32767.
The range of double-precision floating-point numbers is (roughly) -1e+308 to
1e+308. Numbers outside a format's range cannot be represented directly.
Numbers within the range may not exist in the number format - infinitely many
don't. But at least there is always a number in the format that is fairly close
to our number.
Accuracy and precision are terms that are often confused, even though they have
significantly different meanings.
Precision is a property of a number format and refers to the amount of
information used to represent a number. Better or higher precision means more
numbers can be represented, and also means a better resolution: the numbers
that are represented by a higher precision format are closer together. 1.333 is
a number represented with a precision of five decimal digits: one before and
four after the decimal point. 1.33300 is the same number represented
with 7-digit precision.
Precision can be absolute or relative. Integer types have an absolute precision
of 1. Every integer within the type's range is represented. Fixed point types,
like the Currency type in earlier versions of Visual Basic, also have an
absolute precision. For the Currency type, it is 0.0001, which means that every
multiple of 0.0001 within the type's range is represented.
Floating point formats use relative precision. This means that the precision is
constant relative to the size of the number. For example, 1.3331, 1.3331e+5 =
13331, and 1.3331e-3 = 0.0013331 all have 5 decimal digits of relative
precision.
Precision is also a property of a calculation. Here, it refers to the number of
digits used in the calculation, and in particular also the precision used for
intermediate results. As an example, we calculate a simple expression with one
and two digit precision:
|
Using one digit precision:
|
| 0.4 * 0.6 + 0.6 * 0.4 |
= 0.24 + 0.24 |
Calculate products |
| = 0.2 + 0.2 |
Round to 1 digit |
| = 0.4 |
Final result |
|
Using two digit precision:
|
| 0.4 * 0.6 + 0.6 * 0.4 |
= 0.24 + 0.24 |
Calculate products |
| = 0.24 + 0.24 |
Keep the 2 digits |
| = 0.48 |
Calculate sum |
| = 0.5 |
Round to 1 digit |
Comparing to the exact result (0.48), we see that using 1 digit precision gives
a result that is off by 0.08, while using two digit precision gives a result
that is off by only 0.02. One lesson learnt from this example is that it is
useful to use extra precision for intermediate calculations if that option is
available.
Accuracy is a property of a number in a specific context. It indicates
how close a number is to its true value in that context. Without the context,
accuracy is meaningless, in much the same way that "John is 25 years old" has
no meaning if you don't know which John you are talking about..
Accuracy is closely related to error. Absolute error is the difference between
the value you obtained and the actual value for some quantity. Relative error
roughly equals the absolute error divided by the actual value, and is usually
expressed in the number of significant digits. Higher accuracy means smaller
error.
Accuracy and precision are related, but only indirectly. A number stored with
very low precision can be exactly accurate. For example:
Byte n0 = 0x03;
Int16 n1 = 0x0003;
Int32 n2 = 0x00000003;
Single n3 = 3.000000f;
Double n4 = 3.000000000000000;
Each of these five variables represents the number 3 exactly. The
variables are stored with different precisions, using from 8 to 64 bits. For
the sake of clarity, the precision of the numbers is shown explicitly, but the
precision does not have any impact on the accuracy.
Now look at the same number 3 as an approximation for pi, the ratio of the
circumference of a circle to its diameter. 3 is only accurate to one decimal
place, no matter what the precision. The Double value uses 8 times
as much storage as the Byte value, but it is no more accurate
Round-off error
Let's say you have a non-integer number from the real world that you want to use
in your program. Most likely you are faced with a problem. Unless your number
has some special form, it cannot be represented by any of the number formats
that are available to you. Your only solution is to find the number that is
represented by a number format that is closest to your number. Throughout the
lifetime of the program, you will use this approximation to your 'real' number
in calculations. Instead of using the exact value a, the program will
use a value a+e, with e a very small number which
can be positive or negative. This number e is called the round-off
error.
It's bad enough that you are forced to use an approximation of your number. But
it gets worse. In almost every arithmetic operation in your program, the result
of that operation will once again not be represented in the number format. On
top of the initial round-off error, almost every arithmetic operation
introduces a further error ei. For example, adding two
numbers, a and b, results in the number (a + b)
+ (ea + eb + esum),
where ea, eb, and esum
are the round-off errors of a, b, and the result,
respectively. Round-off error propagates and is very often amplified
by calculations. Fortunately, the round-off errors tend to cancel each other
out to some degree, but rarely do they cancel out completely. Some calculations
may also be affected more than others.
Part two of this series will have a lot more to say about round-off error and
how to minimize its adverse effects.