Hardware macroarchitecture vs microarchitecture

May 11th, 2012

The comp-arch.net wiki defines "computer architecture" as the union of two things:

Macroarchitecture - the "visible" parts, the contract between hardware and software. For example, branch instructions.
Microarchitecture - the "invisible" parts, the implementation strategy which affects performance but not semantics. For example, branch prediction.

I think this distinction is very interesting for three reasons:

It pops up everywhere. That is, many hardware problems can be addressed at the macro level or the micro level, explicitly or implicitly.
The choice of macro vs micro is rarely trivial – for most problems, there are real-world examples of both kinds of solutions.
The choice has common consequences across problems. The benefits and drawbacks of macro and micro are frequently similar.

I'll use examples from diverse types of hardware – CPUs, DSPs, GPUs, FPGAs, CAPPs, and even DRAM controllers. We'll discuss some example problems and how they can be solved at the macro or micro level. I'll leave the discussion of the resulting trade-offs to separate write-ups. Here, we'll go through examples to see how practical macro and micro solutions to different problems look like.

Our examples are:

Data parallelism: SIMD vs SIMT
Multiple issue: VLIW vs superscalar
Running ahead: exposed latencies vs OOO
Local storage: bare RAM vs cache
Streaming access: DMA vs hardware prefetchers
Data processing: logic synthesis vs instruction sets
Local communication: routing vs register addresses
Avoiding starvation: pressure signals vs request aging

Data parallelism: SIMD vs SIMT

Suppose you want to have a data-parallel machine: software issues one instruction that processes multiple data items.

The common macro approach is wide registers and SIMD opcodes. To use the feature, software must explicitly break up its data into 16-byte chunks, and use special opcodes like "add_16_bytes" to process the chunks.

One micro approach is what NVIDIA marketing calls SIMT. The instruction set remains scalar. However, hw runs multiple scalar threads at once, with simultaneously running threads all executing the same instruction. That way, 16 pairs of values are added in a single cycle using scalar instructions.

(If you're interested in SIMT, a detailed comparison with SIMD as well as SMT – the more general simultaneous multithreading model – is here.)

Multiple issue: VLIW vs superscalar/OOO

Suppose you want to have a multiple issue machine. You want to simultaneously issue multiple instructions from a single thread.

The macro approach is VLIW, which stands for "very long instruction word". The idea is, those multiple instructions you issue become "one (very long) instruction", because software explicitly asks to run them together: "ADD R0, R1, R2 and MUL R3, R0, R5". Note that ADD and MUL "see" the same value of R0: MUL gets R0's value before it's modified by ADD.

VLIW also lets software choose to say, "ADD R0, R1, R2; afterwards, MUL R3, R0, R5" – that's two separate instructions yielding vanilla serial execution. This is not only slower (2 cycles instead of 1), but has a different meaning. This way, MUL does see ADD's change to R0. Either way, you get what you explicitly asked for.

(If you're interested in VLIW, an explanation of how programs map to this rather strangely-looking architecture is here.)

The micro approach, called superscalar execution, is having the hardware analyze the instructions and run them in parallel – when that doesn't change the hw/sw contract (the serial semantics). For example, ADD R0, R1, R2 can run in parallel with MUL R3, R1, R2 – but not with MUL R3, R0, R5 where MUL's input, R0, depends on ADD. Software remains unaware of instruction-level parallelism – or at least it can remain unaware and still run correctly.

Running ahead: exposed latencies vs OOO

We've just discussed issuing multiple instructions simultaneously. A related topic is issuing instructions before a previous instruction completes. Here, the macro approach is to, well, simply go ahead and issue instructions. It's the duty of software to make sure those instructions don't depend on results that are not yet available.

For example, a LOAD instruction can have a 4-cycle latency. Then if you load to R0 from R1 and at the next cycle, add R0 and R2, you will have used the old value of R0. If you want the new value, you must explicitly wait for 4 cycles, hopefully issuing some other useful instructions in the meanwhile.

The micro approach to handling latency is called OOO (out-of-order execution). Suppose you load to R0, then add R0 and R2, and then multiply R3 and R4. An OOO processor will notice that the addition's input is not yet available, proceed to the multiplication because its inputs are ready, and execute the addition once R0 is loaded (in our example, after 4 cycles). The hw/sw contract is unaffected by the fact that hardware issues instructions before a previous instruction completes.

Local storage: local memories vs caches

Suppose you want to have some RAM local to your processor, so that much of the memory operations work with this fast RAM and not the external RAM, which is increasingly becoming a bottleneck.

The macro approach is, you just add local RAM. There can be special load/store opcodes to access this RAM, or a special address range mapped to it. Either way, when software wants to use local RAM, it must explicitly ask for it – as in, char* p = (char*)0×54000, which says, "I'll use a pointer pointing to this magic address, 0×54000, which is the base address of my local RAM".

This is done on many embedded DSPs and even CPUs – for example, ARM calls this "tightly-coupled memory" and MIPS calls this "scratch pad memory".

The micro approach is caches. Software doesn't ask to use a cache – it loads from an external address as if the cache didn't exist. It's up to hardware to:

Check if the data is already in the cache
If it isn't, load it to the cache
Decide which cached data will be overwritten with the new data
If the overwritten cached data was modified, write it back to external memory before "forgetting" it
...

The hardware changes greatly, the hw/sw contract does not.

Data streaming: DMA vs hardware prefetchers

Suppose you want to support efficient "streaming transfers". DRAM is actually a fairly poor random access memory – there's a big latency you pay per address. However, it has excellent throughput if you load a large contiguous chunk of data. To utilize this, a processor must issue loads without waiting for results of previous loads. Load, wait, load, wait... is slow; load, load, load, load... is fast.

The macro approach is, sw tells hw that it wants to load an array. For example, a DMA – direct memory access – engine can have control registers telling it the base address and the size of an array to load. Software explicitly programs these registers and says, "load".

DMA starts loading and eventually says, "done" – for example, by setting a bit. In the meanwhile, sw does some unrelated stuff until it needs the loaded data. At this point, sw waits until the "done" bit is set, and then uses the data.

The micro approach is, software simply loads the array "as usual". Naturally, it loads from the base address, p, then from p+1, then p+2, p+3, etc. At some point, a hardware prefetcher quietly inspecting all the loads realizes that a contiguous array is being loaded. It then speculatively fetches ahead – loads large chunks beyond p+3 (hopefully not too large – we don't want to load too much unneeded data past the end of our array).

When software is about to ask for, say, p+7, its request is suddenly satisfied very quickly because the data is already in the cache. This keeps working nicely with p+8 and so on.

Data processing: logic synthesis vs instruction sets

Let's get back to basics. Suppose we want to add a bunch of numbers. How does software tell hardware to add numbers?

The micro approach is so much more common that it's the only one that springs to mind. Why, of course hardware has an ADD command, and it's implemented in hardware by some sort of circuit. There are degrees here (should there be a DIV opcode or should sw do division?) But the upshot is, there are opcodes.

However, there are architectures where software explicitly constructs data processing operations out of bit-level boolean primitives. This is famously done on FPGAs and is called "logic synthesis" – effectively software gets to build its own circuits. (This programming model is so uncommon that it isn't even called "software", but it certainly is.)

Less famously, this is also what effectively happens on associative memory processors (CAPPs/APAs) – addition is implemented as a series of bit-level masked compare & write operations. (The CAPP way results in awfully long latencies, which you're supposed to make up with throughput by processing thousands of elements in parallel. If you're interested in CAPPs, an overview is available here.)

Of course, you can simulate multiplication using bitwise operations on conventional opcode-based processors. But that would leave much of the hardware unused. On FPGAs and CAPPs, on the other hand, "building things out of bits" is how you're supposed to utilize hardware resources. You get a big heap of exposed computational primitives, and you map your design to them.

Local communication: routing vs register addresses

Another problem as basic as data processing operations is local communication: how does an operation pass its results to the next? We multiply and then add – how does the addition get the output of multiplication?

Again, the micro approach is by far the better known one. The idea is, you have registers, which are numbered somehow. We ask MUL to output to the register R5 (encoded as, say, 5). Then we ask ADD to use R5 as an input.

This actually doesn't sound as "micro" – what's implicit about it? We asked for R5 very explicitly. However, there are two sorts of "implicit" things going on here:

The numbers don't necessarily refer to physical registers – they don't on machines with register renaming.
More fundamentally, even when numbers do refer to physical registers, the routing is implicit.

How does the output of MUL travel to R5 and then to the input port of the adder? There are wires connecting these things, and multiplexers selecting between the various options. On most machines, there are also data forwarding mechanisms sending the output of MUL directly to the adder, in parallel to writing it into R5, so that ADD doesn't have to wait until R5 is written and then read back. But even on machines with explicit forwarding (and there aren't many), software doesn't see the wires and muxes – these are opaque hardware resources.

The macro approach to routing is what FPGAs do. The various computational units are connected to nearby configurable switches. By properly configuring those switches, you can send the output of one unit to another using a path going through several switches.

Of course this uses up the wires connecting the switches, and longer paths result in larger latencies. So it's not easy to efficiently connect all the units that you want using the switches and the wires that you have. In FPGAs, mapping operations to computational units and connecting between them is called "placement and routing". The "place & route" tools can run for a couple of hours given a large design.

This example as well as the previous illustrate micro vs macro at the extreme – a hardware resource that looks "all-important" in one architecture is invisible in another to the point where we forget it exists. The point is that they're equally important on both – the only question is who manages the resource, hardware or software.

Avoiding starvation: pressure signals vs request aging

One thing DRAM controllers do is accept requests from several different processors, put them in a queue, and reorder them. Reordering helps to better utilize DRAM, which, as previously mentioned, isn't that good at random access and prefers streaming access to consequent locations.

So if two processors, A and B, run in parallel, and each accesses a different region, it's frequently better to group requests together – A, A, A, A, B, B, B, B – then to process them in the order in which they arrive – say, A, B, A, A, B, A, B, B.

In fact, as long as A keeps issuing requests, it's frequently better to keep processing them until they're over, and keep B waiting. Better, that is, for throughput, as well as for A's latency – but worse for B's latency. If we don't know when to stop, serving A and starving B could make the system unusable.

When to stop? One macro solution is, the DRAM controller has incoming pressure signals, and both A and B can complain when starved by raising the pressure. Actually, this is "macro" only as far as the DRAM controller is concerned – it gives outside components explicit control over its behavior. The extent of software control over the generation of the pressure signal depends on the processors A and B.

One micro solution is to use request aging. Older requests are automatically considered more urgent. This method is implemented in many DRAM controllers – for instance, Denali's. The macro approach is implemented in the Arteris DRAM scheduler.

The micro approach is safer – the controller itself is careful to prevent starvation, whereas in the macro option, a non-cooperative processor can starve others. It also uses a simpler bus protocol, making compatibility easier for processors. However, it results in a lesser throughput – for instance, if B is a peripheral device with a large FIFO for incoming data, and can afford to wait for long periods of time until the FIFO overflows.

Whatever the benefits and drawbacks – and here, we aren't going to discuss benefits and drawbacks in any depth – this last example is supposed to illustrate that macro vs micro is relevant outside of "core"/"processor" design but extends to "non-computational" hardware as well.

Blurred boundaries

Micro vs macro is more of a continuum than a strictly binary distinction. That is, we can't always label a hardware feature as "visible" or "invisible" to programmers – rather, we can talk about the extent of its visibility.

There are basically two cases of "boundary blurring":

Hardware features "quite visible" even though they don't affect program semantics. These are "technically micro" but "macro in spirit".
Hardware features "quite invisible" even though they do affect program semantics. These are "technically macro" but "micro in spirit".

Let's briefly look at examples of both kinds of "blurring".

Technically micro but macro in spirit

A good example is memory banking. The point of banking is increasing the number of addresses that can be accessed per cycle. A single 32K memory bank lets you access a single address per cycle. 2 16K banks let you access 2 address, 4 8K banks let you access 4 addresses, and so on.

So basically "more is better". What limits the number of banks is the overhead you pay per bank, the overhead of logic figuring out the bank an address belongs to, and the fact that there's no point in accessing more data than you can process.

Now if we look at banking as implemented in NVIDIA GPU memory, TI DSP caches and superscalar CPU caches, then at first glance, they're all "micro solutions". These machines seem to mostly differ in their mapping of address to bank – for instance, NVIDIA GPUs switch banks every 4 bytes, while TI DSPs switch banks every few kilobytes.

But on all these machines, software can remain unaware of banking and run correctly. If two addresses are accessed at the same cycle that map to the same bank, then the access will take two cycles instead of one – but no fault is reported and results aren't affected. Semantically, banking is invisible.

However, I'd call GPUs' and DSPs' banking "macroish", and superscalar CPUs' banking "microish". Why?

GPUs and DSPs "advertise" banking, and commit to a consistent address mapping scheme and consistent performance implications across different devices. Vendors encourage you to know about banking so that you allocate data in ways minimizing contentions.

CPUs don't advertise banking very much, and different CPUs running the same instruction set have different banking schemes which result in different performance. Moreover, those CPU variants differ in their ability to access multiple addresses in parallel in the first place: a low-end CPU might access at most one address but a high-end CPU might access two.

GPUs and DSPs, on the other hand, have explicit multiple load-store units (a macro feature). So software knows when it attempts to accesses many addresses in parallel – one reason to "advertise" which addresses can actually be accessed in parallel.

This shows why hardware features that don't affect program semantics aren't "completely invisible to programmers" – rather, there are "degrees of visibility". A feature only affecting performance is "quite visible" if vendors and users consider it an important part of the hw/sw contract.

Technically macro but micro in spirit

SIMD and VLIW are both visible in assembly programs/binary instruction streams. However, SIMD is "much more macro in spirit" than VLIW. That's because for many programmers, the hw/sw contract isn't the semantics of assembly, which they never touch, but the semantics of their source language.

At the source code level, the effect of SIMD tends to be very visible. Automatic vectorization rarely works, so you end up using intrinsic functions and short vector data types. The effect of VLIW on source code can be close to zero. Compilers are great at automatic scheduling, and better than humans, so there's no reason to litter the code with any sort of annotations to help them. Hence, SIMD is "more macro" – more visible.

Moreover, there's "macroish VLIW" and "microish VLIW" – just like there's "macroish banking" and "microish banking" – and, again, the difference isn't in the hardware feature itself, but in the way it's treated by vendors and users.

An extreme example of "microish VLIW" is Transmeta – the native binary instruction encoding was VLIW, but the only software that was supposed to be aware of that were the vendor-supplied binary translators from x86 or other bytecode formats. VLIW was visible at the hardware level but still completely hidden from programmers by software tools.

An opposite, "macro-centric" example is TI's C6000 family. There's not one, but two "human-writable assembly languages". There's parallel assembly, where you get to manually schedule instructions. There's also linear assembly, which schedules instructions for you, but you still get to explicitly say which execution unit each instruction will use (well, almost; let's ignore the A-side/B-side issues here.)

Why provide such a "linear assembly" language? Josh Fisher, the inventor of VLIW, didn't approve of the concept in his book "Embedded Computing: a VLIW Approach".

That's because originally, one of the supposed benefits of VLIW was precisely being "micro in spirit" – the ability to hide VLIW behind an optimizing compiler meant that you could speed up existing code just by recompiling it. Not as easy as simply running old binaries on a stronger new out-of-order processor, but easy enough in many cases – and much easier to support at the hardware end.

Linear assembly basically throws these benefits out the window. You spell things in terms of C6000's execution units and opcodes, so the code can't be cross-platform. Worse, TI can't decide to add or remove execution units or registers from some of the C6000 variants and let the compiler reschedule instructions to fit the new variant. Linear assembly refers to units and registers explicitly enough to not support this flexibility – for instance, there's no silent spill code generation. Remove some of the resources, and much of the code will stop compiling.

Then why is linear assembly shipped by TI, and often recommended as the best source language for optimized code? The reason is that the code is more "readable" – if one of the things the reader is after is performance implications. The same silent spill code generation that makes C more portable makes it "less readable", performance-wise – you never can tell whether your data fits into registers or not, similarly it's hard to know how many operations of every execution unit are used.

The beauty of linear assembly is that it hides the stuff humans particularly hate to do and compilers excel at – such as register allocation and instruction scheduling – but it doesn't hide things making it easy to estimate performance – such as instruction selection and the distinction between stack and register variables. (In my opinion, the only problem with linear assembly is that it still hides a bit too much – and that people can choose to not use it. They often do – and preserve stunning unawareness of how the C6000 works for years and years.)

Personally, I believe that, contrary to original expectations, VLIW works better in "macro-centric" platforms than "micro-centric" – a viewpoint consistent with the relative success of Transmeta's chips and VLIW DSPs. Whether this view is right or wrong, the point here is that hardware features "visible" to software can be more or less visible to programmers – depending on how visible the software stack in its entirety makes them.

Implications

We've seen that "macro vs micro" is a trade-off appearing in a great many contexts in hardware design, and that typically, both types of solutions can be found in practical architectures – so it's not clear which is "better".

If there's no clear winner, what are the benefits and drawbacks of these two options? I believe that these benefits and drawbacks are similar across the many contexts where the trade-off occurs. Some of the implications were briefly mentioned in the discussion on VLIW's "extent of visibility" – roughly:

Micro is more compatible
Macro is more efficient
Macro is easier to implement

There are other common implications – for example, macro is harder to context-switch (I like this one because, while it's not very surprising once you think about it, it doesn't immediately spring to mind).

I plan to discuss the implications in detail sometime soon. I intend to focus, not as much on how things could be in theory, but on how they actually tend to come out and why.