Optimization
A reasonably skilled programmer will not write a grossly inefficient program.
At least not deliberately. Optimization is what you do when the performance is
insufficient. Sometimes the optimizations are easy, sometimes they are hard.
Sometimes they fly in the face of your original design, and sometimes they require
that you grossly violate your beautiful abstractions in your class system. But
always, and I repeat, always, my experience has been that no programmer
has ever been able to predict or analyze where performance bottlenecks
are without data. No matter where you think the time is going, you will
be surprised to discover that it is going somewhere else.
You optimize because you have a problem in performance. Sometimes it is computational
optimization: your bitmap manipulation is just too slow. Sometimes it is data
access optimization: it just takes too long to get the data into the machine.
And sometimes it is algorithmic optimization: you're doing it wrong. If you don't
understand the difference between an n2 sort and an nlogn sort, you're probably already in trouble, but that knowledge
alone is not useful.
Some years ago, I was working on a complex program which had to perform a semantic
cross-check between the "statements" of the program and the "declarations"
(it was actually a 4GL constraint equation system, but the details don't matter).
I discovered that the operation was of n3 complexity (well,
actually m * n2, but m and n would be of comparable
size most of the time). There are three paths that you can follow here:
- The naive path. You don't even realize you've got an n3
problem. You're probably in trouble, because if it is the bottleneck,
you didn't know it was there.
- The formal academic path. You realize you've got an n3
problem, and know it is intrinsically evil, and rewrite your algorithms.
- The engineering path. You realize you've got an n3 problem,
but you instrument the system to discover its actual impact on the system.
The only valid path to optimization is the engineering path. I measured the
performance, and on the largest "real" example we had, I discovered
that n was almost always 1, sometimes 2, rarely 3, and had exactly one
instance of 4. This was too small to matter. Sure, the algorithm was n3,
but with n that small, there was no need to rewrite the code. Rewriting
the code would have been incredibly complex, delayed the whole project for a
couple weeks, and used up a couple pointers in each node of the tree in an already-tight
minicomputer address space.
I also wrote the storage allocator that everyone used. I spent a lot of hours
tweaking its performance to be the fastest allocator of its class. These adventures
are detailed in the book IDL: The Language and its Implementation, now,
alas, out of print. (Nestor, Newcomer, Gianinni and Stone, Prentice-Hall, 1990).
One group that used it used a "PC sampling" performance tool on Unix.
This is a performance tool that samples where the program counter (PC) is executing,
and over a long execution derives a "density histogram" of where the
program is spending its time. It clearly showed that a massive amount of time
was being spent in the storage allocator. This made no sense to me, but fingers
were being pointed in my direction.
So I wrote a little hook in the allocator that counted the number of times
it was called. It turns out it was called over 4,000,000 times. No call took
longer than the minimum measurement interval of 10µs (approximately ten instructions
on our 1-MIPS machine), but 40,000,000 microseconds is 40 seconds. Actually,
it was more than that because there were 4,000,000 free operations as well, which
were even faster, but still the resulting time was more than 50% of the total
execution time.
Why did this happen? Because, unknown to the programmers, a critical function
they were calling in the inner loop of several algorithms would actually allocate
a 5-to-10 byte working buffer, do its thing, and release it. When we changed
this to be a 10-byte stack local, the time spent in the storage allocator dropped
to about 3% of the total program time.
Without the data, we would not have known why the allocator accounted for so
much time. PC-sampling performance tools are a very weak class of tools and their
results are intrinsically suspect. See my article "Profiling for Performance",
in Dr.
Dobb's Journal (18,1) January, 1993, pp.80-87.