There's a very influential platform called the AVM, which stands for Algorithmic Virtual Machine. That's the imaginary device people use as their mental model of a computer. In particular, it's used by many people working on algorithms where performance matters. Performance matters in many different contexts, ranging from huge clusters processing astronomic amounts of data to modest applications running on pathetically weak hardware. However, I believe that the core architecture of the AVM is basically the same everywhere.
AVM application development is done using the ubiquitous AVM SDK – a whiteboard and a couple of hands for handwaving. An AVM application consists of a set of operations your algorithm needs executed. Each operation has a cost (typically one cycle, sometimes more). You can then estimate the run time of your algorithm by the clever technique of summing the cost of all operations.
These estimations are never close enough to the real run time. The definition of "close enough" varies; the quality of estimations, by and large, doesn't. That is, I claim that your handwavy AVM-derived estimation will fail to meet your precision requirements no matter what those requirements are. Apparently our tolerance for errors grows with the lack of understanding of the problem, but it never grows enough. But I'm not really sure about this theory; I'm only sure about AVM-estimations-suck part. Here's why.
The AVM is basically this imaginary machine that runs "operations". Here are some things that real machines must do, but the AVM doesn't:
- Fetching instructions
- Fetching operands
- Testing for conditions
- Storing results
Basically, the Algorithmic Virtual Machine developers concentrate on "operations" and ignore addressing, branches, caches, buses, registers, pipelines, and all those other gadgets which are needed in order to dispatch the operation. In fact, that's how I currently distinguish between people who write software to get a job done and people who think of software as their job. "People who program" are into operations (algebra, networking, AI); "programmers" are into dispatching (programming languages, operating systems, OO). This is about mental focus rather than aptitude. I haven't noticed that people of either group are inherently less productive than the other kind.
When they're after performance, the "operations" people will naturally look for a way to reduce the number of operations. Sometimes, they'll find an algorithm with a better asymptotic complexity – O(N+M) instead of O(N*M). At other times, they'll come up with a way to perform 4*N*M operations instead of 16*N*M. Both results are very significant – if M and N are the only variables. The trouble is that you can't see all the variables if you just look at the math (as in "we want to multiply and sum all these and then compare to that"). That way, you assume that you run on the AVM and leave out all the dispatching-related variables and get the wrong answer.
Is there a way to take the cost of dispatching into account? Not really, not without implementing your algorithm and measuring its performance. However, families of machines do have related sets of heuristics that can be used to guess the cost of running on them. For example, here are a couple of heuristics that I use for SIMD machines (they are relevant elsewhere, but their relative importance may drop):
- Bandwidth is costly.
- Addressing is costly.
These heuristics are vague, and I don't see a very good way to make them formal. Perhaps there isn't any. To show that my points have any formal significance, I'd have to formally prove that there's unavoidable intrinsic cost to some things no matter how you build your hardware. And I don't know how to go about that. So what I'll do is I'll give some examples to show what I mean, and leave it there.
Consider two "algorithms" (probably too fancy a name in this context): computing dot product, and computing its partial sums (Matlab:
sum(a .* b) and
cumsum(a .* b)). Exactly the same amount of "operations" – N multiplications and N additions. Many people with BA, MSc and PhD degrees in CS assume that the run time is going to be the same, too. It won't, because sum only produces one output, and cumsum produces N outputs. Worse, if the input vector elements are 8-bit integers, we probably need at least 32 bits for each output element. So we generate N*4 bytes of output from N*2 input bytes.
At this point, some people will say "Yeah, memory. Processors are fast, memories are slow, sure, memory is a problem". But it isn't just about the memory; memory bandwidth is just one kind of bandwidth. Let's look at the non-memory problems of the partial-sums-of-dot-product algorithm. On the way, I'll try to show how the "bandwidth costs" heuristic can be used to guess what your hardware can do and what the performance will be.
Consider a machine with a SIMD instruction set. Most likely, the machine has registers of fixed width (say, 16 bytes), and each instruction gets 2 inputs and produces 1 output. Why? Well, the hardware ought to support 2 inputs and 1 output to do basic math. Now, if it also wants to have an instruction that produces, say, 4 outputs, then it needs to have 3 additional output buses from the data processing units to the register file. It also needs a multiplexer so that each of the 4 outputs can be routed to each of its N registers (N can be 16 or 32 or even 128). The cost of multiplexers is, roughly, O(M*N), where M is the number of inputs and N is the number of outputs. That's awfully costly. Bandwidth costs. So they probably use 2 inputs and 1 output everywhere.
Now, suppose the machine has 16 multipliers, which is quite likely – 1 multiplier for each register byte, so we can multiply 16 pairs of bytes simultaneously. Does this mean that we can then take those 16 products and compute 16 new partial sums, all in the same cycle? Nope, because, among other things, we'd need a command producing 16×4 bytes to do that, and that's too much bandwidth. Are we likely to have a command that updates less than 16 accumulators? Yes, because that would speed up dot products, and dot products are very important; let's look at the manual.
You're likely to find a command updating – guess how many? – 4 accumulators (32 bits times 4 equals 16 bytes, that's exactly one machine register). If the register size is 8 bytes, you'll probably get a command updating 2 accumulators, and so on. Sometimes the machine uses "register pairs" for output; that doubles the register size for output bandwidth calculation purposes. The bottom line is that instruction set extensions can speed up dot product to an extent impossible for its partial sums. You might have noticed another problem here, that of the dependency of a partial sum on the previous partial sum. Removing this dependency doesn't solve the bandwidth problem. For example, consider the vertical projection of point-wise multiplication of 2 8-bit images, which has the same not-enough-accumulators problem.
There is little you can do about the bandwidth problem in the partial sums case – the algorithm is I/O bound. Some algorithms aren't, so you can optimize them to minimize the cost of bandwidth. For example, matrix multiplication is essentially lots of dot products. If you do those dot products straightforwardly, you'll have a loop spending 2 commands for loading the matrix elements into registers, and one command for multiplying and accumulating (MAC). 2 loads per MAC means an overhead of 200%.
However, you can work on blocks – 4 rows of matrix A and 4 columns of matrix B, and compute the 4×4=16 dot products in your loop. That's 4+4=8 loads per 16 MACs; the overhead dropped to 50%. If you have enough registers to do this. And it's still quite impressive overhead, isn't it? Your typical AVM user would be very disappointed. (Yes, some machines can parallelize the loads and the MACs, but some can't, and it's a toy example, and stop nitpicking). BTW, blocking can be used to save loads from main memory to cache just like we've used it to save loads from cache to registers.
OK. With partial sums of dot product, the bandwidth problem kills performance, and with matrix multiplication, it doesn't. What about convolution, which is about as basic as our previous examples? Gee, I really don't know. It's tricky, because with convolution, you need to store intermediate results somewhere, and it's unclear how many of them you're going to need. The optimal implementation depends on the quirks of the data processing units, the I/O, and the filter size. If you come across a benchmark showing the performance of convolution on some machine, you'll probably find interesting variations caused by the filter size.
So we have a bread-and-butter algorithm, and non-trivial & non-portable performance characteristics. I think it's one indication that your own less straightforward algorithm will also perform somewhat unpredictably. Unless you know an exact reason for the opposite.
Bandwidth is one problem with fetching operands and storing results. Another problem is figuring out where they go. In the case of registers, we have costly multiplexers for selecting the source and destination registers of instructions. In the case of memory, we have addresses. Computing addresses has a cost. Reading data from those addresses also has a cost. Some address sequences are costlier than others from one of these perspectives, or both.
The dumbest example is the misalignment problem. People who learned C on x86 are sometimes annoyed when they meet a PowerPC or an ARM or almost any other processor since it won't read a 32-bit integer from a misaligned address. So when you read a binary buffer from a file or a socket, you can't just cast the char* to an int* and expect it to work. Isn't it nice of x86 to properly handle these cases?
Maybe it's nice, maybe it isn't (at least if it failed, the code would be fixed to become legal C), but it sure is costly. The fact that it's "in the hardware" doesn't make it a single-cycle operation. If your address is misaligned, the 32 bits may reside in two different memory words (no matter what the word size is). The hardware will have to read the low word, and then read the high word, and then take the high bits of the low word and the low bits of the high word and make a single 32-bit value out of them. Because in one cycle, memories can only fetch one word from an aligned address.
Does it matter outside of I/O-related code using illegal pointer-casting? Consider the prosaic algorithm of computing the first derivative of a vector, spelled
v(2:end)-v(1:end-1) in Matlab. If we run on a SIMD machine, we could execute several subtractions simultaneously. In order to do that, we need to fetch a word containing v…v and a word containing v…v (both zero-based). But the second word is misaligned. The handling of misalignment will have a cost, whether it's done in hardware or in software.
Well, at least the operands of subtraction live in subsequent addresses – 0,1,2…15 and 1,2,3…16. That's how data processing units like them: you read a pack of numbers from memory and feed them right into the array of adders, ready to crunch them. It's not always like that. Consider scaling:
a(x) = b(s*x+t). This can be used to resize images (handy), or to play records at a different speed the way you'd do with a tape recorder (less handy, unless you like squeaky or growly voices).
Now, if s isn't integral (say, s=0.6), you'd have to fetch data from places such as
s*x+t = 1.3, 1.9, 2.5, 3.1, 3.7... Suppose you want to use linear interpolation to approximate a(1.3) as
a(1)*0.7+a(2)*0.3. So now we need to multiply the vector of "low" elements –
a([1,1,2...]) – by the vector of weights –
[0.7,0.1,0.5...] – and add the result to the similar product
a([2,2,3...])*[0.3,0.9,0.5...]. The multiplications and the additions map nicely to SIMD instruction sets; the indexing doesn't, because you have those weird jumpy indexes. So this time, the addressing can become a real bottleneck because it can prevent you from using SIMD instructions altogether and serialize your entire computation.
Well, at least we access adjacent elements. This means that most memory accesses will hit the cache. When you bump into an element that isn't cached yet, the machine will bring a whole cache line (say, 32 bytes), and then you'll read the other elements in that cache line, so it will pay off. You can even issue cache prefetching instructions so that while you're working on the current cache line, the machine will read the next one in the background. That way, you'll hit the cache all the time, instead of having your processor repeatedly surprised (hey, I don't have a(32) in the cache!.. hey, I don't have a(64) in the cache!.. hey, I don't have…). Avoiding the regularly scheduled surprise can be really beneficial, although cache prefetching is truly disgusting (it's basically a very finicky kind of cooperative multi-tasking – you ought to stuff the prefetching commands into the exactly right spots in your code).
a(x) = b(f(x)) – a generic transformation of an input vector given a function for computing the input coordinate from the output coordinate. We have no idea what the next address is going to be, do we? If the transformation is complicated enough, we're going to miss the cache a lot. By the way, if the transformation is in fact simple, and the compiler knows the transformation at compile time, the compiler is still very unlikely to generate optimal cache prefetching commands. Which is one of the gazillion differences between C++ templates and "machine-optimal" code.
DVMs and TVMs
My bandwidth and addressing heuristics don't model a real machine; they only model an upgrade to the AVM for SIMD machines. Multi-box computing is one example of an entire universe of considerations they fail to model. So what we got is a DVM – Domain-specific Virtual Machine.
Now, in order to estimate performance without measuring (which is necessary when you choose your optimizations – you just can't try all the different options), I recommend a TVM (Target-specific Virtual Machine). You get one as follows. You start with the AVM. This gives overly optimistic performance estimations. You then add the features needed to get a DVM. This gives overly pessimistic estimations.
Then, you ask some low-level-loving person: "What are the coolest features of this machine that other machines don't have?" This will give you the capabilities that the real processor has but its DVM doesn't have. For example, PowerPC with AltiVec extensions is basically a standard SIMD DVM plus
vec_perm. I won't talk about
vec_perm very much, but if you ever need to optimize for AltiVec, this is the one instruction you want to remember. It solves the indexing problem in the scaling example above, among other things. Using a SIMD DVM and forgetting about
vec_perm would make AltiVec look worse than it really is, and some algorithms much more costly than they really are.
And this is how you get a TVM for your platform. The resulting mental model gives you a fairly realistic picture, second only to reading the entire manual and understanding the interactions of all the features (not that easy). And it definitely beats the AVM by… how do you estimate the quality of handwaving? OK, it beats the AVM by the factor of 5, on average. What, you want a proof? Just watch the hands go.