Entries Tagged 'hardware' ↓

The habitat of hardware bugs

I wrote a post on embeddedrelated.com about hardware bugs - places where they're rarely to be found, places which they inhabit in large quantities, and why these insects flourish in some places more than others.

It's one of these things that I wish I was told when I started to fiddle with this shit – that while a chip is monolithic from a manufacturing point of view, from the logical spec angle it's a hodgepodge of components made and sold by different parties with very different goals and habits, tied together well enough to be marketable, but not enough to make it coherent from a low-level programmer's point of view. In fact, it's the job of the few low-level programmers to hide the idiosyncratic and buggy parts so as to present a coherent picture to the many higher-level programmers - the ones whose mental well-being is an economically significant goal.

 

The overblown frequency vs cost efficiency trade-off

I've often read arguments that computing circuitry running at a high frequency is inefficient, power-wise or silicon area-wise or both. So roughly, 100 MHz is more efficient, in that you get more work done per unit of energy or area spent. And CPUs go for 1 GHz or 3 GHz because serial performance sells regardless of efficiency. But accelerators like GPUs or embedded DSPs or ISPs or codecs implemented in hardware etc. etc. – these don't need to run at a high frequency.

And I think this argument is less common now when say GPUs have caught up, and an embedded GPU might run at the same frequency as an embedded CPU. But still, I've just seen someone peddling a "neuromorphic chip" or some such, and there it was – "you need to run conventional machines at 1 GHz and it's terribly inefficient."

AFAIK the real story here is pretty simple, namely:

  1. As you increase frequency, you GAIN efficiency up to point;
  2. From that point on, you do start LOSING efficiency;
  3. That inflection point, for well-designed circuits, is much higher than people think (close to a CPU's frequency in the given manufacturing process, certainly not 10x less as people often claim);
  4. …and what fueled the myth is, accelerator makers used to be much worse at designing for high frequency than CPU makers. So marketeers together with "underdog sympathizers" have overblown the frequency vs efficiency trade-off completely out of proportions.

And below I'll detail these points; if you notice oversimplifications, please correct me (there are many conflicting goals in circuit implementation, and these goals are different across markets, so my experience might be too narrow.)

Frequency improves efficiency up to a point

What's the cost of a circuit, and how is it affected by frequency? (This section shows the happy part of the answer – the sad part is in the next section.)

  1. Silicon area. The higher the clock frequency, the more things the same circuit occupying this area does per unit of time – so you win!
  2. Leakage power – just powering up the circuit and doing nothing, not even toggling the clock signal, costs you a certain amount of energy per unit of time. Here again, the higher the frequency, the more work gets done in exchange for the same leakage power – again you win!
  3. Switching power – every time the clock signal changes its value from 0 to 1 and back, this triggers a bunch of changes to the values of other signals as dictated by the interconnection of the logic gates, flip-flops – everything making up the circuit. All this switching from 0 to 1 and back costs energy (and NOT switching does not; measure the power dissipated by a loop multiplying zeros vs a loop multiplying random data, and you'll see what I mean. This has implications for the role of software in conserving energy, but this is outside our scope here.) What's the impact of frequency on cost here? It turns out that frequency is neutral - the cost in energy is directly proportionate to the clock frequency, but so is the amount of work done.

Overall, higher frequency means spending less area and power per unit of work – the opposite of the peanut gallery's conventional wisdom.

Frequency degrades efficiency from some point

At some point, however, higher frequency does start to increase the cost of the circuit per unit of work. The reasons boil down to having to build your circuit out of physically larger elements that leak more power. Even further down the frequency-chasing path come other problems, such as having to break down your work to many more pipeline stages, spending area and power on storage for the intermediate results of these stages; and needing expensive cooling solutions for heat dissipation. So actually there are several points along the road, with the cost of extra MHz growing at each point – until you reach the physically impossible frequency for a given manufacturing process.

How do you find the point where an extra MHz isn't worth it? For synthesizable design (one created in a high-level language like Verilog and VHDL), you can synthesize it for different frequencies and you can measure the cost in area and power, and plot the results. My confidence of where I think the inflection point should be comes from looking at these plots. Of course the plot will depend on the design, bringing us to the next point.

Better-designed circuits' optimal frequency is higher

One hard part of circuit design is, you're basically making a hugely parallel system, where many parts do different things. Each part doing the same thing would be easy – they all take the same time, duh, so no bottleneck. Conversely, each part doing something else makes it really easy to create a bottleneck – and really hard to balance the parts (it's hard to tell exactly how much time a piece of work takes without trying, and there are a lot of options you could try, each breaking the work into different parts.)

You need to break the harder things into smaller pipeline stages (yes, a cost in itself as we've just said – but usually a small cost unless you target really high frequencies and so have to break everything into umpteen stages.) Pipelining is hard to get right when the pipeline stages are not truly independent, and people often recoil from it (a hardware bug is on average more likely to be catastrophically costly than somewhat crummier performance.) Simpler designs also shorten schedules, which may be better than reaching a higher frequency later.

So CPUs competing for a huge market on serial performance and (stupidly) advertised frequency, implementing a comparatively stable instruction set, justified the effort to overcome these obstacles. (Sometimes to the detriment of consumers, arguably, as say with Pentium 4 – namely, high frequency, low serial performance due to too much pipelining.)

Accelerators are different. You can to some extent compensate for poor serial performance by throwing money at the problem - add more cores. Sometimes you don't care about extra performance – if you can decode video at the peak required rate and resolution, extra performance might not win more business. Between frequency improvements and architecture improvements/implementing a huge new standard, the latter might be more worthwhile. And then the budgets are generally smaller, so you tend to design more conservatively.

So AFAIK this is why so many embedded accelerators had crummy frequencies when they started out (and they also had apologists explaining why it was a good thing). And that's why some of the accelerators caught up – basically it was never a technical limitation but an economic problem of where to spend effort, and changing circumstances caused effort to be invested into improving frequency. And that's why if you're making an accelerator core which is 3 times slower than the CPU in the same chip, my first guess is your design isn't stellar at this stage, though it might improve – if it ever has to.

P.S. I'll say it again – my perspective can be skewed; someone with different experience might point out some oversimplifications. Different process nodes and different implementation constraints mean that what's decisive in one's experience is of marginal importance in another's experience. So please do correct me if I'm wrong in your experience.

P.P.S. Theoretically, a design running at 1 GHz might be doing the exact same amount of work as a 2 GHz design – if the pipeline is 2x shorter and each stage in the 1 GHz thing does the work of 2 stages in the 2 GHz thing. In practice, the 1 GHz design will have stages doing less work, so they complete in less than 1 nanosecond (1/1GHz) and are idle during much of the cycle. And this is why you want to invest some effort to up the frequency in that design – to not have mostly-idle circuitry leaking power and using up area. But the theoretically possible perfectly balanced 1 GHz design is a valid counter-argument to all of the above, I just don't think that's what most crummy frequencies hide behind them.

Update: here's an interesting complication – Norman Yarvin's comment points to an article about near-threshold voltage research by Intel, from which it turns out that a Pentium implementation designed to operate at near-threshold voltage (at a near-2x cost in area) achieves its best energy efficiency at 100 MHz – 10x slower than its peak frequency but spending 47x less energy. The trouble is, if you want that 10x performance back, you'd need 10 such cores for an overall area increase of 20x, in return for overall energy savings of 4.7x. Other points on the graph will be less extreme (less area spent, less energy saved.)

So this makes sense when silicon area is tremendously cheaper than energy, or when there's a hard limit on how much energy you can spend but a much laxer limit on area. This is not the case most of the time, AFAIK (silicon costs a lot and then it simply takes physical space, which also costs), but it can be the case some of the time. NTV can also make sense if voltage is adjusted dynamically based on workload, and you don't need high performance most of the time, and you don't care that your peak performance is achieved at a 2x area cost as much as you're happy to be able to conserve energy tremendously when not needing the performance.

Anyway, it goes to show that it's more complicated than I stated, even if I'm right for the average design made under today's typical constraints.

Working simultaneously vs waiting simultaneously

"Multiprocessing", "multi-threading", "parallelism", "concurrency" etc. etc. can give you two kinds of benefits:

  • Doing many things at once – 1000 multiplications every cycle.
  • Waiting for many things at once – wait for 1000 HTTP requests just issued.

Some systems help with one of these but not the other, so you want to know which one – and if it's the one you need.

For instance, CPython has the infamous GIL – global interpreter lock. To what extent does the GIL render CPython "useless on multiple cores"?

  • Indeed you can hardly do many things at once – not in a single pure Python process. One thread doing something takes the GIL and the other thread waits.
  • You can however wait for many things at once just fine – for example, using the multiprocessing module (pool.map), or you could spawn your own thread pool to do the same. Many Python threads can concurrently issue system calls that wait for data – reading from TCP sockets, etc. Then instead of 1000 request-wait, request-wait steps, you issue 1000 requests and wait for them all simultaneously. Could be close to a 1000x speed-up for long waits (with a 1000-thread worker pool; more on that below). Works like a charm.

So GIL is not a problem for "simultaneous waiting" for I/O. Is GIL a problem for simultaneous processing? If you ask me – no, because:

  • If you want performance, it's kinda funny to use pure Python and then mourn the fact that you can't run, on 8 cores, Python code that's 30-50x slower than C to begin with.
  • On the other hand, if you use C bindings, then the C code could use multiple threads actually running on multiple cores just fine; numpy does it if properly configured, for instance. Numpy also uses SIMD/vector instructions (SSE etc.) – another kind of "doing many things at once" that pure Python can't do regardless of the GIL.

So IMO Python doesn't have as bad a story in this department as it's reputed to have – and if it does look bad to you, you probably can't tolerate Python's slowness doing one thing at a time in the first place.

So Python – or C, for that matter – is OK for simultaneous waiting, but is it great? Probably not as great as Go or Erlang – which let you wait in parallel for millions of things. How do they do it? Cheap context management.

Context management is a big challenge of waiting for many things at once. If you wait for a million things, you need a million sets of variables keeping track of what exactly you're waiting for (has the header arrived? then I'm waiting for the query. has it arrived? then I ask the database and wait for it etc. etc.)

If those variables are thread-local variables in a million threads, then you run into one of the problems with C – and hence OS-supported threads designed to run C. The problem is that C has no idea how much stack it's gonna need (because of the halting problem, so you can't blame C); and C has no mechanism to detect that it ran out of stack space at runtime and allocate some more (because that's how its ABIs have evolved; in theory C could do this, but it doesn't.)

So the best thing a Unixy OS could do is, give C one page for the stack (say 4K), and make say the next 1-2M of the virtual address space unaccessible (with 64b pointers, address space is cheap). When C page-faults upon stack overflow, give it more physical memory – say another 4K. This method means at least 4K of allocated physical memory per thread, or 4G for a million threads – rather wasteful. (I think in practice it's usually way worse.) All regardless of us often needing a fraction of that memory for the actual state.

And that's before we got to the cost of context switching – which can be made smaller if we use setjmp/longjmp-based coroutines or something similar, but that wouldn't help much with stack space. C's lax approach to stack management – which is the way it is to shave a few cycles off the function call cost – can thus make C terribly inefficient in terms of memory footprint (speed vs space is generally a common trade-off – it's just a bad one in the specific use case of "massive waiting" in C).

So Go/Erlang don't rely on the C-ish OS threads but roll their own – based on their stack management, which doesn't require a contiguous block of addresses. And AFAIK you really can't get readable and efficient "massive waiting" code in any other way – your alternatives, apart from the readable but inefficient threads, are:

  • Manual state machine management – yuck
  • Layered state machines as in Twisted – better, but you still have callbacks looking at state variables
  • Continuation passing as in Node.js – perhaps nicer still, but still far from the smoothness of threads/processes/coroutines

The old Node.js slides say that "green threads/coroutines can improve the situation dramatically, but there is still machinery involved". I'm not sure how that machinery – the machinery in Go or Erlang – is any worse than the machinery involved in continuation passing and event loops (unless the argument is about compatibility more than efficiency – in which case machinery seems to me a surprising choice of words.)

Millions of cheap threads or whatever you call them are exciting if you wait for many events. Are they exciting if you do many things at once? No; C threads are just fine – and C is faster to begin with. You likely don't want to use threads directly – it's ugly – but you can multiplex tasks onto threads easily enough.

A "task" doesn't need to have its own context – it's just a function bound to its arguments. When a worker thread is out of work, it grabs the task out of a queue and runs it to completion. Because the machine works - rather than waits - you don't have the problems with stack management created by waiting. You only wait when there's no more work, but never in the middle of things.

So a thread pool running millions of tasks doesn't need a million threads. It can be a thread per core, maybe more if you have some waiting – say, if you wait for stuff offloaded to a GPU/DSP.

I really don't understand how Joe Armstrong could say Erlang is faster than C on multiple cores, or things to that effect, with examples involving image processing – instead of event handling which is where Erlang can be said to be more efficient.

Finally, a hardware-level example – which kind of hardware is good at simultaneous working, and which is good at simultaneous waiting?

If your goal is parallelizing work, eventually you'll deteriorate to SIMD. SIMD is great because there's just one "manager" – instruction sequencer – for many "workers" – ALUs. CPUs, DSPs and GPUs all have SIMD. NVIDIA calls its ALUs "cores" and 16-32 ALUs running the same instruction "threads", but that's just shameless marketing. A "thread" implies, to everyone but a marketeer, independent control flow, while GPU "threads" march in lockstep.

In practice, SIMD is hard despite thousands of man-years having been invested into better languages and libraries – because telling a bunch of dumb soldiers marching in lockstep what to do is just harder than running a bunch of self-motivating threads each doing its own thing.

(Harder in one way, easier in another: marching in lockstep precludes races – non-deterministic, once-in-a-blue-moon, scary races. But races of the kind arising between worker threads can be almost completely remedied with tools. Managing the dumb ALUs can not be made easier with tools and libraries to the same extent – not even close. Where I work, roughly there's an entire team responsible for SIMD programming, while threading is mostly automatic and bugs are weeded out by automated testing.)

If, however, you expect to be waiting much of the time – for memory or for high-latency floating point operations, for instance – then hoards of hardware threads lacking their own ALUs, as in barrel threading or hyper-threading, can be a great idea, while SIMD might do nothing for you. Similarly, a bunch of weaker cores can be better than a smaller number of stronger cores. The point being, what you really need here is a cheap way to keep context and switch between contexts, while actually doing a lot at once is unlikely to be possible in the first place.

Conclusions

  • Doing little or nothing while waiting for many things is both surprisingly useful and surprisingly hard (which took me way too long to internalize both in my hardware-related and software/server-related work). It motivates things looking rather strange, such as "green threads" and hardware threads without their own ALUs.
  • Actually doing many things in parallel – to me the more "obviously useful" thing – is difficult in an entirely different way. It tends to drag in ugly languages, intrinsics, libraries etc. about as much as having to do one single thing quickly. The "parallelism" part is actually the simplest (few threads so easy context management; races either non-existent [SIMD] or very easy to weed out [worker pool running tasks])
  • People doing servers (which wait a lot) and people doing number-crunching (work) think very differently about these things. Transplanting experience/advice from one area to the other can lead to nonsensical conclusions.

See also

Parallelism and concurrency need different tools – expands on the reasons for races being easy to find in computational code – but impossible to even uniformly define for most event handling code.

How FPGAs work, and why you'll buy one

Update (June 21): this article has been published at embeddedrelated.com, where I hope to publish a follow-up soon.

Today, pretty much everyone has a CPU, a DSP and a GPU, buried somewhere in their PC, phone, car, etc. Most don't know or care that they bought any of these, but they did.

Will everyone, at some future point, also buy an FPGA? The market size of FPGAs today is about 1% of the annual global semiconductor sales (~$3B vs ~$300B). Will FPGA eventually become a must-have, or will its volume remain relatively low?

We'll try to answer this question below. In order to see how popular FPGAs could become, we'll need to discuss what FPGAs are. FPGAs are a programmable platform, but one designed by EEs for EEs rather than for programmers. So for many programmers, FPGAs are exciting yet mysterious; I hope our discussion will help demystify them.

We'll start with a common explanation of FPGAs' relatively low popularity. We'll see why that explanation is wrong – and why, if we take a closer look, we actually come to expect FPGAs to blow the competition out of the water!

This will conclude today's installment, "Why you'll buy an FPGA". A sequel is in the making, titled "Why you won't buy an FPGA". There, we'll see some of the major obstacles standing between FPGAs and world domination.

The oft-repeated wrong answer

…to the question of "why aren't FPGAs more popular?" is, "FPGA is a poor man's alternative to making chips. You can implement any circuit design in an FPGA, but less efficiently than you could in an ASIC or a custom design. So it's great for prototyping, and for low-volume products where you can't afford to make your own chips. But it makes no sense for the highest-volume devices – which happen to add up to 99% of sales, leaving 1% to FPGAs."

This is wrong because programmability is a feature, not just a tax on efficiency.

Of course a Verilog program doing convolution on an FPGA would run faster if you made a chip that runs just that program. But you typically don't want to do this, even for the highest-volume products, any more than you want to convert your C programs running on CPUs into dedicated hardware! Because you want to change your code, run other programs, etc. etc.

When programmability is required – which is extremely often – then the right thing to compare FPGAs to is another programmable platform: a DSP, a GPU, etc. And, just like FPGAs, all of these necessarily introduce some overhead for programmability. So we can no longer assume, a priori, that any one option is more efficient than another – as we did when comparing FPGAs to single-purpose ASICs.

We need benchmarks – and FPGAs' performance appears very competitive in some benchmarks. Here's what BDTI's report from 2007 says:

…we estimated that high-end FPGAs implementing demanding DSP applications … consume on the order of 10 watts, while high-end DSPs consume roughly 2-3 watts. Our benchmark results have shown that high-end FPGAs can support roughly 10 to 100 times more channels on this benchmark than high-end DSPs…

So for that benchmark, FPGAs offer 10x-100x the runtime performance, and 2x-30x the energy efficiency of DSPs – quite impressive!

But wait – how are they so efficient?

FPGAs are no longer FPGAs

Aren't FPGAs Field-Programmable Gate Arrays?

Programmable gate arrays can't multiply as efficiently as dedicated multipliers, can they? A dedicated multiplier is a bunch of gates connected with wires – the specific gates that you need for multiplying, connected specifically to the right other gates as required for multiplication.

A programmable gate array is when your gates are generic. They index into a truth table (called a look-up table or LUT) with their inputs, and fetch the answer. With a 2-input LUT, you get an OR gate or an AND gate or whatever, depending on the truth table you programmed. With 3-input LUTs, you can have a single gate computing, say, (a&b)|c, but the principle is the same:

This absolutely must be bigger and slower than just an OR gate or an AND gate!

Likewise, wires go through programmable switch boxes, which connect wires as instructed by programmable bits:

There are several switch box topologies determining which wires can be connected to which. But whatever the topology, this must be bigger and slower than wires going directly to the right gates.

All this is indeed true, and a "bare" FPGA having nothing but programmable gates and routers cannot compete with a DSP. However, today's FPGAs come with DSP slices – specialized hardware blocks placed amidst the gates and routers, which do things like multiply-accumulate in "hard", dedicated gates.

So that's how FPGAs compete with DSPs – they have DSP hardware in them! Cheating, isn't it?

Well, yes and no.

It's "cheating" in the sense that FPGAs aren't really FPGAs any more – instead, they're arrays of programmable gates plus all that other stuff. A "true FPGA" would look like this:

Instead, a high-end modern FPGA looks like this:

To be competitive in DSP applications, FPGAs need DSP slices – ALUs doing things like multiply-accumulates.

To be competitive in applications needing a CPU – which is most of them – today's FPGAs have more than just specialized ALUs. They have full-blown ARM cores implemented using "hard", non-programmable gates!

So you've been "cheated" if you thought of FPGAs as "clean slates" suitable for any design. In reality, FPGAs have specialized hardware to make them competitive in specific areas.

And you can sometimes guess where they're less competitive by observing which specializations they lack. For instance, there are no "GPU slices", and indeed I don't believe FPGAs can compete with GPUs in their own domain as they compete with DSPs. (Why not simply add GPU slices then? Here the plot thickens, as we'll see in the follow-up article.)

But of course having DSP slices is more than just "cheating" – because look at just how many DSP slices FPGAs have. The cheapest FPGAs can do hundreds of mutliply-accumulates simultaneously! (My drawing above has the wrong scale – imagine hundreds of small DSP slices near a couple of much larger CPUs.)

And hundreds of MACs is a big deal, because while anyone can cram a load of multipliers into a chip, the hard part is to connect it all together, letting a meaningful program actually use these multipliers in parallel.

For instance, TI's C64 DSPs can do 8 MACs per cycle – but only if it's a dot product. TI's C66 DSPs can do 32 MACs/cycle – but only if you're multiplying complex numbers. You only get the highest throughput for very specific data flows.

To the extent that the FPGA architecture lets you actually use an order of magnitude more resources at a time, and do that in more real-life examples, it is a rather unique achievement. And this is how they actually beat dedicated DSPs with their DSP slices, not just reach the same performance.

FPGA as a programmable accelerator architecture

So what makes FPGAs such an efficient architecture? There's no simple answer, but here are some things that FPGAs can use to their advantage:

  • No need for full-blown ALUs for simple operations: a 2-bit adder doesn't need to be mapped to a large, "hard" DSP slice – it can fit comfortably in a small piece of "soft" logic. With most processors, you'd "burn" a full-blown ALU to do the simplest thing.
  • No need for a full cycle for simple operations: on FPGAs, you don't have to sacrifice a full cycle to do a simple operation, like an OR, which has a delay much shorter than a full cycle. Instead, you can feed OR's output immediately to the next operation, say, AND, without going through registers. You can chain quite a few of these, as long as their delays add up to less than a cycle. With most processors, you'd end up "burning" a full cycle on each of these operations.
  • Distributed operand routing: most processors have their ALUs communicate through register files. With all the ALUs connected to all the registers, there's a bottleneck – this interconnect grows as the product of the number of ALUs and registers, so you can't have too many of either. FPGAs spread ALUs and registers throughout the chip, and you can connect them in ways not creating such bottlenecks – say, as a long chain, as a tree, and in many other ways. Of course you can also route everything through a bottleneck, and then your design will run at a low frequency – but you don't have to. With CPUs or DSPs, they run at a high frequency – because the amount of ALUs and registers was limited to make that frequency possible. But in FPGAs you can get both high frequencies and a lot of resources used in parallel.
  • Distributed command dispatching: a 2-issue or a 6-issue processor is common, but 100-issue processors are virtually unheard of. Partly it's because of the above-mentioned operand routing, and partly it's because of command dispatching – you'd have to fetch all those commands from memory, another bottleneck. In FPGAs, you can implement command-generating logic in simple state machines residing near your ALUs – and in the simplest case, commands are constants kept in registers residing near ALUs. This lets you easily issue 100 parallel instructions.

This "distributed" business is easier to appreciate by looking at an example. Here's a schematic implementation of a 1D convolution on an FPGA – you convolve a long vector v with an N-coefficient filter f, computing, at every i, f0*v[i] + f1*v[i-1] + f2*v[i-2] + … + fN-1*v[i-N-1]:

In this drawing, N=8, but it scales easily to arbitrary N, producing results at a slightly larger latency – the summation tree depth being log(N).

The orange boxes are registers; commands like + and * are stored in registers, as are inputs and outputs. (I'm feeding the output of * to + directly without going through a register to save screen space.) Every clock cycle, inputs are fed to ALUs, and the outputs become the new register values.

Orange boxes (registers) spread amongst green boxes (ALUs) illustrate "distributed operand and command routing". If you wonder how it all looks like in code, Verilog source code corresponding to this drawing appears near the end of the article.

And here's a linear pipeline without a summation tree:

This is a little trickier, at least to me (I had a bug in my first drawing, hopefully it's fixed). The idea is, every pair of ALUs computes a product of fk with v[i-k], adds it to the partial sum accumulated thus far, and sends the updated partial sum downstream to the next pair of ALUs.

The trick is this. The elements of v are also moving downstream, together with the sums. But after v[i] got multiplied by f0, you don't want to multiply it by f1 in the next cycle. Instead, you want to multiply v[i-1] by f1 – that's the product that we need for the convolution at index i. And then you do want to multiply v[i] by f1 once cycle later – for the convolution at index i+1. I hope that my sampling of v[i] to an intermediate register, which delays its downstream motion, does the trick.

So these two examples show how FPGA programming is different from programming most kinds of processors – and how it can be more efficient. More efficient, because you can use a lot of ALUs simultaneously with little overhead spent on dispatching commands and moving inputs and outputs between ALUs. An argument can be made that:

  • FPGAs are more flexible than SIMD/SIMT. You can give different instructions to different ALUs, and you can route operands from different places. Contrast this with SIMD instructions like add_16_bytes, with byte i always coming from offset i inside a wide register.
  • FPGAs scale better than VLIW/superscalar. More instructions can be issued simultaneously, because there's no routing bottleneck near the register file, and no instruction memory bandwidth bottleneck.
  • FPGAs are more efficient than multiple cores. Multiple cores are flexible and can scale well. But you pay much more overhead per ALU. Each core would come with its own register files and memories, and then there are communication overheads.

This gives us a new perspective on LUTs and switch boxes. Yes, they can be an inefficient, cheaper-to-manufacture alternative to dedicated gates and wires. But they are also a mechanism for utilizing the "hard" components spread in between them – sometimes better than any other mechanism.

And this is how FPGAs beating DSPs with the help of DSP slices isn't "cheating". (In fact, mature DSPs "cheat" much more by having ugly, specialized instructions. Far more specialized than FPGAs' multiply-accumulate, dot product instructions being among the least ugly. And the reason they need such instructions is they don't have the flexibility of FPGAs, so what FPGAs effectively do in software, they must do in hardware in order to optimize very specific data flows.)

I/O applications

But wait – there's more! In addition to being a hardware prototyping platform and an accelerator architecture, FPGAs are also uniquely suited for software-defined I/O.

"Software-defined I/O" is the opposite of "hardware-defined I/O" – the common state of things, where you have, for instance, an Ethernet controller implementing some share of TCP or UDP in hardware. Software-defined I/O is when you have some programmable hardware instead of dedicated hardware, and you implement the protocols in software.

What makes FPGAs good at software-defined I/O?

  • Timing control: Verilog and other hardware description languages give you more precise control over timing than perhaps any other language. If you program it to take 4 cycles, it takes 4 cycles – no cache misses or interrupts or whatever will get in your way unexpectedly. And you can do a whole lot in these 4 cycles – FPGAs are good at issuing plenty of instructions in parallel as we've seen. This means you don't have to account for runtime variability by buffering incoming data, etc. – you know that every 4 cycles, you get a new byte/pixel/etc., and in 4 cycles, you're done with it. This is particularly valuable where "deep" buffering is unacceptable because the latency it introduces is intolerable – say, in a DRAM controller. You can also do things like generating a clock signal at a desired frequency, or deal with incoming clock signal at a different frequency than yours.
  • Fine-grained resource allocation: you "burn" a share of FPGA resources to handle some peripheral device – and that's what you've spent. With other processor cores, you'll burn an entire core – "this DSP handles WiFi" – even if the core is idle much of the time. (The FPGA resources are also burnt that way – but you'll often spend less resources than a full processor core takes.) Alternatively, you can time-share that DSP core – but it's often gnarly. Many kinds of cores expose a lot of resources that must be manually context-switched at an intolerably high latency. Core asymmetry gets in the way of thread migration. And with two I/O tasks, often none can tolerate being suspended for a long while, so you'll definitely burn two cores. (One solution is hardware multithreading.)

The upshot is that relatively few processors other than FPGAs are suitable for software-defined I/O. The heavily multi-threaded XMOS is claimed to be one exception. Then there are communication processors such as the hardware-threaded Qualcomm Hexagon DSP and the CEVA-XC DSPs. But these are fairly specialized; you couldn't use them to implement a memory controller or an LVDS-to-parallel video bridge, both of which you could do with an FPGA.

And of course, FPGA's I/O capabilities can be combined with computation acceleration – get pixels and enhance the image color on the fly, get IP packets with stock info and decide which stocks to trade on the fly.

Programmable, efficient, and versatile, FPGAs are starting to sound like a great delivery platform.

Summary

There are several points that I tried to make. Some are well-known truisms, and others are my own way of looking at things, which others might find debatable or at least unusually put.

  • While FPGA are a great small-scale circuit delivery platform, they can also be a large-scale software delivery platform. You can think of FPGAs as "inefficiently simulating circuits". But in other contexts, you can also think of them as "efficiently executing programs"!
  • Instead of fixed-function gates and wires connecting specific gates to each other, FPGAs use programmable gates – configured by setting a truth table of choice – and programmable switch boxes, where incoming wires are connected to some of the other wires based on configuration bits. By itself, it's very inefficient compared to a "direct" implementation of a circuit.
  • Then how can FPGAs beat, not just CPUs, but specialized accelerators like DSPs in their own game? The trick is, they're no longer FPGAs – gate arrays. Instead, they're also arrays of RAMs and DSP slices. And then they have full-blown CPUs, Ethernet controllers, etc. implemented in fixed-function hardware, just like any other chip.
  • In such modern FPGAs, the sea of LUTs and switch boxes can be used not instead of fixed-function circuits, but as a force multiplier letting you make full use of your fixed-function circuits. LUTs and switch boxes give two things no other processor architecture has. First, the ability to use less than a full-blown ALU for simple things – and less than a full clock cycle. Second, distributed routing of commands and operands – arguably more flexible than SIMD, more scalable than superscalar execution, and more efficient than multiple instruction streams.
  • FPGAs are the ultimate platform for software-defined I/O because of their timing control (if I said 4 cycles, it takes 4 cycles) and fine-grained resource allocation (spend so many registers and ALUs per asynchronous task instead of dedicating a full core or having to time-share it).

With all these advantages, why just 1% of the global semiconductor sales? One reasonable answer is that it took FPGAs a long time to evolve into their current state. Things FPGAs have today that they didn't have in the past include:

  • Fixed-function hardware essential for performance – this gradually progressed from RAM to DSP slices to complete CPUs.
  • Quick runtime reconfiguration, so that you can run convolution and then replace it with FFT – which you can't, and shouldn't be able to do, if you're thinking of FPGA as simulating one circuit.
  • Practically useable C-to-Verilog compilers, letting programmers, at least reasonably hardcore ones, who nonetheless aren't circuit designers, to approach FPGA programming easily enough.

All of these things cater to programmers as much or more than they cater to circuit designers. This shows that FPGAs are gunning for a position in the large-scale software delivery market, outside their traditional small-scale circuit implementation niche. (Marketing material by FPGA vendors confirms their intentions more directly.)

So from this angle, FPGAs evolved from a circuit implementation platform into a software delivery platform. Being a strong programmable architecture, they're expected to rise greatly in popularity, and, like other programmable architectures, end up everywhere.

Unanswered questions

As a teaser for the sequel, I'll conclude with some questions which our discussion left unanswered.

Why do FPGAs have DSP slices and full-blown "hard" CPUs? Why not the other way around – full-blown DSP cores, and some sort of smaller "CPU slices"? Where are the GPU slices? And if rationing individual gates, flip-flops and picoseconds instead of full ALUs, registers and clock cycles is so great, why doesn't everyone else do it? Why do they all break up resources into those larger chunks and only give software control over that?

Stay tuned for the sequel – "How FPGAs work, and why you won't buy one".

P.S. Programmable – how?

So how do you program the programmable gate array? Talk is cheap, and so are Microsoft Paint drawings. Show me the code!

The native programming interface is a hardware description language like Verilog. Here's an implementation of the tree-like convolution pipeline in Verilog – first the drawing and then the code:

module conv8(clk, in_v, out_conv);
  //inputs & outputs:
  input clk; //clock
  input [7:0] in_v; //1 8-bit vector element
  output reg [18:0] out_conv; //1 19-bit result

  //internal state:
  reg [7:0] f[0:7]; //8 8-bit coefficients
  reg [7:0] v[0:7]; //8 8-bit vector elements
  reg [15:0] prod[0:7]; //8 16-bit products
  reg [16:0] sum0[0:3]; //4 17-bit level 0 sums
  reg [17:0] sum1[0:1]; //2 18-bit level 1 sums

  integer i; //index for loops unrolled at compile time

  always @(posedge clk) begin //when clk goes from 0 to 1
    v[0] <= in_v;
    for(i=1; i<8; i=i+1)
      v[i] <= v[i-1];
    for(i=0; i<8; i=i+1)
      prod[i] <= f[i] * v[i];
    for(i=0; i<4; i=i+1)
      sum0[i] <= prod[i*2] + prod[i*2+1];
    for(i=0; i<2; i=i+1)
      sum1[i] <= sum0[i*2] + sum0[i*2+1];
    out_conv <= sum1[0] + sum1[1];
  end
endmodule

This example shows how "distributed routing" actually looks in code – and the fine-grained control over resources, defining things like 17-bit registers.

And it's fairly readable, isn't it? Definitely prettier than a SIMD program spelled with intrinsics – and more portable (you can target FPGAs by different vendors as well as an ASIC implementation using the same source code; it's not trivial, but not hopeless unlike with SIMD intrinsics, and probably not harder than writing actually portable OpenCL kernels.)

Incidentally, Verilog is perhaps the quintessential object-oriented language – everything is an object, as in a physical object: a register, a wire, a gate, or a collection of simpler objects. A module is like a class, except you can't create objects (called instantiations) dynamically – all objects are known at compile time and mapped to physical resources.

Verilog insists on encapsulation as strictly as it possibly could: there's simply no way to set an object's internal state. Because how could you affect that state, physically, without a wire going in? Actually, there is a way to do that – the usual instance.member syntax; hardware hackers call this "an antenna", because it's "wireless" communication with the object's innards. But it doesn't synthesize – that is, you can do it in a simulation, but not in an actual circuit.

Which means that our example module is busted, because we can't initialize the filter coefficients, f. In simulations, we can use antennas. But on an FPGA, we'd need to add, say, an init_f input, and then when it's set to 1, we could read the coefficients from the same port we normally use to read v's elements. (BTW, not that it adds much efficiency here, but the "if" test below is an example of an operation taking less than a cycle.)

always @(posedge clk) begin
  if(init_f) begin
    f[0] <= in_v;
    for(i=1; i<8; i=i+1)
      f[i] <= f[i-1];
  end
end

A triumph of encapsulation, it's also a bit of a pity, because there are now actual wires and some control logic sitting near our coefficient registers, enlarging the circuit, only to be used upon initialization. We're used to class constructors "burning" a few memory bits; who cares – the bits are quickly swapped out from the instruction cache, so you haven't wasted resources of your computational core. But Verilog module initialization "burns" LUTs and wires, and it's not nearly as easy to reuse them for something else. We'll elaborate on this point in the upcoming sequel.

Not only is Verilog object-oriented, but it's also the quintessential language for event-driven programming: things are either entirely static (these here bits go into this OR gate), or triggered by events (changes of signals, very commonly a clock signal which oscillates between 0 and 1 at some frequency). "always @(event-list)" is how you say what events should cause your statements to execute.

Finally, Verilog is a parallel language. The "static" processes, like bits going into OR gates, as well as "event-driven processes", like statements executing when the clock goes from 0 to 1, all happen in parallel. Within a list of statements, "A <= B; C <= A;" are non-blocking assignments. They happen in parallel, so that A is assigned the value of B, and C is simultaneously assigned the (old) value of A.

So, for example, prod[i]<=f[i]*v[i] sets the new value of prod, and in parallel, sums are computed from the old values of prod, making it a pipeline and not a serial computation. (Alternatively, we could use blocking assignments, "=" instead of "<=", to do it all serially. But then it would take more time to execute our series of statements, lowering our frequency, as clk couldn't switch from 0 to 1 again until the whole serial thing completes. Synthesis tools tell you the maximal frequency of your design when they're done compiling it.)

On top of its object-oriented, event-based, parallel core, Verilog delivers a ton of sweet, sweet syntactic sugar. You can write + and * instead of having to instantiate modules with "adder myadd(a,b)" or "multiplier mymul(a,b)" – though + and * are ultimately compiled down to module instances (on FPGAs, these are often DSP slice instances). You can use if statements and array indexing operators instead of instantiating multiplexors. And you can write loops to be unrolled by the compiler, generate instantiations using loop syntax, parameterize your designs so that constants can be configured by whoever instantiates them, etc. etc.

If all this doesn't excite you and you'd rather program in C, you can, sort of. There's been loads of "high-level synthesis tools" – basically C to Verilog compilers – and their quality increased over the years.

You'd be using a weird C dialect – no function pointers or recursion, extensions to specify the exact number of bits in your integers, etc. You'd have to use various #pragmas to guide the compilation process. And you'd have things like array[index++] not actually working with a memory array – and index++ not actually doing anything – because you're getting values, not from memory, but from a FIFO, or directly from the output of another module (just like in_v in our Verilog code doesn't have to come from memory, and out_conv doesn't have to go to memory.)

But you can use C, sort of – or Verilog, for real. Either way, you can write fairly readable FPGA programs.

The bright side of dark silicon

It's been a decade or so since the end of frequency scaling, and multicore has become ubiquitous, there being no other means to increase a chip's performance.

Some multicore systems are symmetric – all cores are identical, so you can easily move work from one core to another. Others are asymmetric – as in CPU cores and GPU cores, where it's harder to move work between different types of cores.

Which is better – symmetric or asymmetric multicore?

Why symmetric is better

Three main reasons that I see:

  • Better load balancing
  • Less work for everyone
  • More redundancy

Better load balancing

Asymmetric multicore makes load balancing harder, because a GPU can't easily yank a job from a queue shared with a CPU and run that job. That's because some of those jobs are simply impossible to run on a GPU. Others run so badly that it's not worth the trouble.

And those CPU codes that could run OK on GPUs would have to be compiled twice – for the CPU and the GPU – and even then you can't make things like function pointers and vtables work (though I can imagine a hardware workaround for the latter – a translation table of sorts; maybe I should patent it. Anyway, we're very far from that being our biggest problem.)

And then you need a shared queue between the CPU and the GPU – how does that work? – or you partition the work statically (each of the 4 CPUs processes 10% of the pixels, the remaining 60% of the pixels go to the GPU cores).

But static partitioning, often quite lousy even with symmetric multicore, is awful with asymmetric multicore because how do you choose the percentages? You need to know the relative strength of the cores at each task. How do you do that – dynamically figure out the first time your program runs on a new device?

So this is all close to insane. What people actually do instead is task parallelism – they look at their different jobs, and they figure out which should run on each type of core, and optimize each task for the respective core.

But task parallelism never load-balances very well. Let's say you look for faces in an image on the GPU and then try to figure out whose faces these are on the CPUs. Then sometimes the GPU finds a lot of faces and sometimes just a few, taking roughly the same time to do so. But the CPU then has either a lot of work or just a little. So one of them will tend to be the bottleneck.

Less work for everyone

We actually touched on that above. If you wanted to do data parallelism, running the same task on all your cores but on different subsets of the data, one problem would be to optimize your code for each type of core. That's more work. Someone at the OS/system level would also need to help you with sharing task queues and vtables – still more work.

Generally, more types of core means more hardware design, more compilers, assemblers, linkers and debuggers, more manuals, and more integration work from bus protocols to program loaders, etc. etc. And, for programmers, not only more optimization work but more portability problems.

More redundancy

That's a bit futuristic, but I actually heard this argument from respectable people. The idea is, chip manufacturing yields will significantly drop at, say, 8nm processes. And then your chance to get a chip without a microscopic defect somewhere will become so low that throwing away every defective chip will be uneconomical.

Well, with symmetric multicore you don't have to throw away the chip. If the testing equipment identifies the core that is no longer useable and marks the chip accordingly using fuses or some such (which is easy to do), an OS can then run jobs on all cores but the bad one.

Nifty, isn't it?

With asymmetric multicore, you can't do that, because some type of work will have no core on which it can run.

Why asymmetric is inevitable

In two words – dark silicon.

"Dark silicon" is a buzzword used to describe the growing gap between how many transistors you can cram into a chip with each advancement in lithography vs how many transistors you can actually use simultaneously given your power budget – the gap between area gains and power gains.

It's been a couple of years since the "dark silicon" paper which predicted "the end of multicore scaling" – a sad follow-up to the end of frequency scaling.

The idea is, you can have 2x more cores with each lithography shrink, but your energy efficiency grows only by a square root of 2. So 4 shrinks mean 16x more cores – but within a fixed power budget, you can only actually use 4. So progress slows down, so to speak. These numbers aren't very precise – you have to know your specific process to make a budget for your chip – but they're actually not bad as a tool to think about this.

With 16x more area but just 4x more power, can anything be done to avoid having that other 4x untapped?

It appears that the only route is specialization – spend a large fraction of the area on specialized cores which are much faster at some useful tasks than the other cores you have.

Can you then use them all in parallel? No – symmetric or asymmetric, keeping all cores busy is outside your power budget.

But, if much of the runtime is spent running code on specialized cores doing the job N times faster than the next best core, then you'll have regained much of your 4x – or even gained more than 4x.

Gaining more than 4x has always been possible with specialized cores, of course; dark silicon is just a compelling reason to do it, because it robs you of the much easier alternative.

What about load balancing? Oh, aren't we "lucky"! It's OK that things don't load-balance very well on these asymmetric systems – because if they did, all cores would be busy all the time. And we can't afford that – we must keep some of the silicon "dark" (not working) anyway!

And what about redundancy? I dunno – if the yield problem materializes, the increasingly asymmetric designs of today are in trouble. Or are they? If you have 4 CPUs and 4 GPU clusters, you lose 25% of the performance, worse than if you had 12 CPUs; but the asymmetric system outperforms the symmetric one by more than 25%, or so we hope.

So the bright side of dark silicon is that it forces us to develop new core architectures – because to fully reap the benefits of lithography shrinks, we can't just cram more of the same cores into a same-sized chip. Which, BTW, has been getting boring, boring, boring for a long time. CPU architecture has stabilized to a rather great extent; accelerator architecture, not nearly so.

GPUs are the tip of the iceberg, really – the most widely known and easily accessible accelerator, but there are loads of them coming in endless shapes and colors. And as time goes by and as long as transistors keep shrinking but their power efficiency lags behind, we'll need more and more kinds of accelerators.

(I have a lot of fun working on accelerator architecture, in part due to the above-mentioned factors, and I can't help wondering why it appears to be a rather marginal part of "computer architecture" which largely focuses on CPUs; I think it has to do with CPUs being a much better topic for quantitative research, but that's a subject for a separate discussion.)

And this is why the CPU will likely occupy an increasingly small share of the chip area, continuing the trend that you can see in chip photos from ChipWorks et al.

P.S.

I work on switching-limited chip designs: most of the energy is spent on switching transistors. So you don't have to power down the cores between tasks – you can keep them in an idle state and they'll consume almost no energy, because there's no switching – zeros stay zeros, and ones stay ones.

Chips which run at higher frequencies and which are not designed to operate at high temperatures (where high leakage would become intolerably high – leakage grows non-linearly with temperature) are often leakage-limited. This means that you must actually power down a core or else it keeps using much of the energy it uses when doing work.

Sometimes powering down is natural, as in standby mode. Powering down midway through realtime processing is harder though, because it takes time to power things down and then to power them back up and reinitialize their pesky little bits such as cache line tags, etc.

So in a leakage-limited design, asymmetric multicore is at some point no better than symmetric multicore – if the gaps between your tasks are sufficiently short, you can't power down anything, and then your silicon is never dark, so either you make smaller chips or programs burn them.

But powering up and down isn't that slow, so a lot of workloads should be far from this sad point.

P.P.S.

I know about GreenDroid, a project by people who make the "dark silicon leads to specialization" argument quite eloquently; I don't think their specialization is the right kind – I think cores should be programmable – but that again is a subject for a separate discussion.

P.P.P.S.

Of course there's one thing you can always do with extra area which is conceptually much easier than adding new types of cores – namely, add more memory, typically L2/L3 cache. Memory is a perfect fit for the dark silicon age, because it essentially is dark silicon – its switching energy consumption is roughly proportionate to the number of bytes you access per cycle but is largely independent of the number of bytes you keep in there. And as to leakage, it's easier to minimize for memories than most other kinds of things.

Another "lucky" coincidence is that you really need caches these days because external DRAM response latency has been 100 ns for a long time while processor clocks tend to 50-200x shorter, so missing all the caches really hurts.

So it's natural to expect memories to grow first and then the accelerator zoo; again consistently with recent chip photos where, say, ARM's caches are considerably bigger the ARM cores themselves.

(Itanium famously spent 85% percent of the chip area or so on caches, but that was more of "cheating" – a way to show off performance relative to x86 when in fact the advantage wasn't there – than anything else; at least that's how Bob Colwell quoted his conversation with Andy Grove. These days however it has become one of the few ways to actually use the extra area.)

Amdahl's law in reverse: the wimpy core advantage

Once a chip’s single-core performance lags by more than a factor to two or so behind the higher end of current-generation commodity processors, making a business case for switching to the wimpy system becomes increasingly difficult.

– Google's Urs Hölzle, in "Brawny cores still beat wimpy cores, most of the time"

Google sure knows its own business, so they're probably right when they claim that they need high serial performance. However, different businesses are different, and there are advantages to "wimpy cores" beyond power savings which are omitted from the "brawny/wimpy" paper and which are worth mentioning.

It is a commonplace assumption that a single 3 GHz processor is always better than 4 processors at 750 MHz "because of Amdahl's law". Specifically:

  • Some of your tasks can be parallelized to run on 4 cores – these will take the same time to complete on the two systems.
  • Other tasks can only run on one core – these will take 4x more time to complete on the 750 MHz wimpy-core system.
  • Some tasks are in-between – say, a task using 2 cores will take 2x more time to complete on the 750 MHz system.

Overall, the 3 GHz system never completes later and sometimes completes much earlier.

However, this assumes that a 3 GHz processor is consistently 4x faster than a 750 MHz processor. This is true in those rare cherished moments when both processors are running at their peak throughput. It's not true if both are stuck waiting for a memory request they issued upon a cache miss. For years, memory latency has been lagging behind memory throughput, and the 4x faster processor does not get a 4x smaller memory latency.

Assuming the latency is the same on the two systems – which is often very close to the truth – and that half of the time of the 750 MHz system is spent waiting for memory when running a serial task, the 3 GHz CPU will give you a speed-up of less than 2x.

What if the task can run in parallel on 4 cores? Now the 750 MHz system gets a 4x speed-up and completes more than 2x faster than the 3 GHz system!

(This assumes that the 4 750 MHz cores are not slowed down due to them all accessing the same memory – which, again, is very close to the truth. Memory bandwidth is usually high enough to serve 4 streams of requests – it's latency which is the problem. So having several cores waiting in parallel is faster than having one core waiting serially.)

A slow, parallel system outperforming a fast, serial system – Amdahl's law in reverse!

Is memory latency unique?

In a way it isn't – there are many cases where faster systems fail to get to their peak throughput because of some latency or other. For instance, the cost of mispredicted branches will generally be higher on faster systems.

However, most other latencies of these kinds are much smaller – and are handled rather well by the mechanisms making "brawny" cores larger and deserving of their name. For example, hardware branch predictors and speculative execution can go a long way in reducing the ultimate cost of mispredicted branches.

"Brawny" speculative hardware is useful enough to deal with memory latency as well, of course. Today's high-end cores all have speculative memory prefetching. When you read a contiguous array of memory, the core speculates that you'll keep reading, fetches data ahead of the currently processed elements, and when time comes to process the next elements, they're already right there in the cache.

The problem is that this breaks down once you try to use RAM as RAM - a random access memory where you don't go from index 0 to N but actually "jump around" randomly. (Theoretically this isn't different from branch prediction; in practice, large, non-contiguous data structures not fitting into caches and "unpredictable" for prefetchers are much more common than unpredictable branches or overflowing the branch history table.)

Hence we have pledges like this computer architect's kind request to stop using linked lists:

Why would anyone use a linked-list instead of arrays? I argue that linked lists have become irrelevant in the context of a modern computer system:

1. They reduce the benefit of out-of-order execution.

2. They throw off hardware prefetching.

3. They reduce DRAM and TLB locality.

4. They cannot leverage SIMD.

5. They are harder to send to GPUs.

Note that problems 1, 2 and 4 "disappear" on wimpy cores – that is, such cores don't have out-of-order execution, hardware prefetchers or SIMD instructions.  So a linked list doesn't result in as many missed performance opportunities as it does on brawny cores.

"Why would anyone use linked lists"? It's not just lists, for starters – it's arrays of pointers as well. "Why array of pointers" seems exceedingly obvious – you want to keep references instead of values to save space as well as to update something and actually get the thing updated, not its copy. Also you can have elements of different types – very common with OO code keeping arrays of base class pointers.

And then using arrays and indexing into them instead of using pointers still leaves you with much of the "access randomness" problem. A graph edge can be a pointer or an index; and while mallocing individual nodes might result in somewhat poorer locality than keeping them in one big array, you still bump into problems 1, 2 and 4 for big enough node arrays – because your accesses won't go from 0 to N.

Of course you could argue that memory indirection is "poor design" in the modern world and that programmers should mangle their designs until they fit modern hardware's capabilities. But then much of the "brawny vs wimpy" paper's argument is that brawny cores mean smaller development costs – that you get high performance for little effort. That's no longer true once the advice becomes to fight every instance of memory indirection.

It's also somewhat ironic from a historical perspective, because in the first place, we got to where we are because of not wanting to change anything in our software, and still wishing to get performance gains. The whole point of brawny speculative hardware is, "your old code is fine! Just give me your old binaries and I'll magically speed them up with my latest and greatest chip!"

The upshot is that you can't have it both ways. Either keep speculating (what, not brawny/brainy enough to guess where the next pointer is going to point to?), or admit that speculation has reached its limits and you can no longer deliver on the promises of high serial performance with low development costs.

Formalizing "reverse Amdahl's law"

…is not something I'd want to do.

Amdahl's law has a formal version where you take numbers representing your workload and you get a number quantifying the benefits a parallel system might give you.

A similar formalization could be done for "reverse Amdahl's law", taking into account, not just limits to parallelization due to serial bottlenecks (what Amdahl's law does), but also "limits to serialization"/the advantage to parallelization due to various kinds of latency which is better tolerated by parallel systems.

But I think the point here is that simple formal models fail to capture important factors – not that they should be made more complex with more input numbers to be pulled out of, erm, let's call it thin air. You just can't sensibly model real programs and real processors with a reasonably small bunch of numbers.

Speaking of numbers: is time spent waiting actually that large?

I don't know; depends on the system. Are there theoretically awesome CPUs stuck waiting for various things? Absolutely; here's that article about 6% CPU utilization in data servers (it also says that this is viewed as normal by some – as long as RAM and networking resources are utilized. Of course, by itself the 6% figure doesn't mean that brawny cores aren't better – if those 6% are all short, serial sprints letting you complete requests quickly, then maybe you need brawny cores. The question is how particular tasks end up being executed and why.)

Could a "wimpy-core" system improve things for you? It depends. I'm just pointing out why it could.

From big.LITTLE to LITTLE.big, or something

There's a trend of putting a single wimpy core alongside several brawny ones, the wimpy core being responsible for tasks where energy consumption is important, and the brawny ones being used when speed is required.

An interesting alternative is to put one brawny core and many wimpy ones – the brawny core could run the serial parts and the wimpy cores could run the parallel parts. If the task is serial, then a brawny core can only be better – perhaps a lot better, perhaps a little better, but better. If the task is parallel, many wimpy cores might be better than few brawny ones.

(I think both use cases fall under ARM's creatively capitalized "big.LITTLE concept"; if the latter case doesn't, perhaps LITTLE.big could be a catchy name for this – or we could try other capitalization options.)

Hyperthreading/Barrel threading/SIMT

…is also something that helps hide latency using coarse-grain, thread-level parallelism; it doesn't have to be "whole cores".

Parallelizing imperative programs

…is/can be much easier and safer than is commonly believed. But that's a subject for a separate series of posts.

Summary

  • Faster processors are exactly as slow as slower processors when they're stalled – say, because of waiting for memory;
  • Many slow processors are actually faster than few fast ones when stalled, because they're waiting in parallel;
  • All this on top of area & power savings of many wimpy cores compared to few brawny ones.

Will OpenCL help displace GPGPU? Parallella, P2012, …

OpenCL is usually perceived as a C dialect for GPGPU programming – doing "general-purpose" computations (not necessarily graphics) on GPU hardware. "It's like Nvidia's CUDA C, but portable".

However, OpenCL the language is not really tied to the GPU architecture. That is, hardware could run OpenCL programs and have an architecture very different from a GPU, resulting in a very different performance profile.

OpenCL is possibly the first programming language promising portability across accelerators: "OpenCL is for accelerators what C is for CPUs". Portability is disruptive. When hardware vendor A displaces vendor B, portable software usually helps a great deal.

Will OpenCL – "the GPGPU language" – eventually help displace GPGPU, by facilitating "GP-something-else" – "general-purpose" accelerators which aren't like GPUs?

We'll discuss this question on general grounds, and consider two specific examples of recent OpenCL accelerators: Adapteva's Parallella and ST's P2012.

Why displace GPGPU?

First of all, whether GPGPU is likely to be displaced or not – what could "GP-something-else" possibly give us that GPGPU doesn't?

There are two directions from which benefits could come – you could call them two opposite directions:

  1. Let software (ab)use more types of special-purpose accelerators. GPGPU lets you utilize (abuse?) your GPU for general-purpose stuff. It could be nice to have "GPDSP" to utilize the DSPs in your phone, "GPISP" to utilize the ISP, "GPCVP" to utilize computer vision accelerators likely to emerge in the future, etc. From GPGPU to GP-everything.
  2. Give software accelerators which are more general-purpose to begin with. GPGPU means doing your general-purpose stuff under the constraints imposed by the GPU architecture. An OpenCL accelerator lifting some of these constraints could be very welcome.

Could OpenCL help us get benefits from any of the directions (1) and (2)?

(1) is about making use of anal-retentive, efficiency-obsessed, weird, incompatible hardware. It's rather hard, for OpenCL or for any other portable, reasonably "pretty" language.

OpenCL does provide constructs more or less directly mapping to some of the "ugly" features common to many accelerators, for example:

  • Explicitly addressed local memory (as opposed to cache)
  • DMA (bulk memory transfers)
  • Short vector data types to make use of SIMD opcodes
  • Light-weight threads and barriers

But even with GPUs, OpenCL can't target all of the GPU's resources. There's the subset of the GPU accessible to GPGPU programs – and then there are the more idiosyncratic and less flexible parts used for actual graphics processing.

With accelerators such as DSPs and ISPs, my guess is that today most of their value – acceleration ability – is in the idiosyncratic features that can't be made accessible to OpenCL programs. They could evolve, but it's a bit far-fetched and we won't dwell on it now. At their current state, OpenCL is too portable and too "pretty" to map to most accelerators.

What about direction (2)? (2) is about making something that's more efficient than CPUs, but as nice and flexible as possible, and more flexible than GPUs.

As a whole, (2) isn't easy, for various reasons we'll discuss. But if we look, in isolation, at OpenCL the language, then it looks like a great language for targeting "faster-than-CPUs-and-more-flexible-than-GPUs" kind of accelerator.

What could such an accelerator give us that GPUs don't?

One important feature is divergent flow. GPUs are SIMD or SIMT hardware; either way, they can't efficiently support something like this:

if(cond(i)) {
  out[i] = f(i);
}
else {
  out[i] = g(i);
}

What they'll end up doing is, essentially, compute f(i) and g(i) for all values of i, and then throw away some of the results. For deeply nested conditionals, the cost of wasted computations can make the entire exercise of porting to a GPU pointless.

We'll now have a look at two OpenCL-compatible accelerators which promise to efficiently support divergent threads – or outright independent threads doing something completely unrelated. We'll briefly compare them, and then discuss some of their common benefits as well as common obstacles to their adoption.

Adapteva's Parallella

Actually, the chip's name is Epiphany – Parallella is the recently publicized name of Adapteva's planned platform based on Epiphany; anyway.

Adapteva's architecture is a 2D grid of processors with a mesh interconnect. To scale, you can have a chip with more cores – or you can have a 2D grid of chips with some of the inter-core communication seamlessly crossing chip boundaries. Each of the (scalar) processors executes its own instruction stream – no "marching in lockstep", fittingly for divergent flow.

There are no caches; a memory address can map to your local memory, or the local memory of some other processor in the grid, or to external memory. Access latency will vary accordingly; access to local memories of close neighbors is quicker than access to far neighbors. All memory access can be done using either load/store instructions or DMA.

(Note that you can reach far neighbors – unlike some more "fundamentalist" proposals for "2D scalability" where you can only talk to immediate neighbors, period. I think that's over the top; if you want to run something other than the game of life, it's awfully handy to have long communication paths – as do most computers ranging from FPGAs to neurons, some of which have really long axons.)

Stats:

  • 32K memory per core (unified – both instructions and data)
  • 4 banks that can help avoid contentions between loads/stores, instruction fetching and DMA traffic
  • 2-issue cores (one floating point operation and one integer or load/store operation)
  • 800 MHz at 28nm using low-power process, as opposed to high speed (my bet that it's hard to top 800 MHz at 28nm LP – any evidence to the contrary?)
  • ~25mW per core, 2W peak power for a full chip with 64 cores
  • 0.128 mm^2 per core

Sources: BDTI's overview and various documentation from adapteva.com.

ST's Platform 2012

The P2012 architecture is also, at the top level, a grid of processors with a mesh interconnect. One stated motivation is the intra-die variability in future process nodes: some cores will come out slower than others, and some will be unusable.

It is thus claimed that a non-uniform architecture (like the ones we have today – CPUs and a boatload of different accelerators) will become a bad idea. If a core happens to come out badly, and it's not like your other cores, you have to throw away the entire chip. Whereas if cores are all alike, you leave the bad ones unused, and you may still have enough good ones to use the chip.

Interestingly, despite this stated motivation, the P2012 is less uniform and has higher granularity than Epiphany. Firstly, there's a provision for special-purpose accelerators in the grid. Secondly, the top-level mesh connects, not individual cores, but clusters of 16 rather tightly-coupled cores (each with its own flow of control, however – again, good for divergence).

Similarly to Epiphany, data is kept in explicitly addressed local memory rather than cache, and you can access data outside the cluster using load/store instructions or DMA, but you'll pay a price depending on the distance.

However, within a cluster, data access is uniform: the 16 cores share 256K of local data memory. This can be convenient for large working sets. Instructions are private to a core – but they're kept in a cache, not a local memory, conveniently for large programs.

Stats:

  • 32K per core (16 cores with 16K I-cache per core and 32 8K data memory banks)
  • 1MB L2 cache in a 4-cluster, "69-core" chip (presumably, (16+1)x4+1 – one extra core per cluster and per chip)
  • 2-issue cores (I failed to find which instructions can be issued in parallel)
  • 600 MHz at 28nm (process details unclear)
  • 2W for the 69-core chip
  • 0.217 mm^3 per core (3.7 mm^2 per (16+1)-core cluster), not accounting for L2 cache

Source: slides, slides, slides.

Parallela vs P2012: a quick comparison

Each of the architectures can have many different implementations and configurations. It seems fair to compare a 28nm 64-core Epiphany chip with a 28nm 69-core P2012 chip (or at least fair as far as these things go). Each system has its own incompatible native programming interface, but both can also be programmed in OpenCL.

Here's how Epiphany compares to P2012:

  • Power: 1x (2W)
  • Core issue width: 1x (2-issue)
  • Local memory size: 1x (32K per core)
  • Frequency: 1.33x (800/600)
  • Core area efficiency: 1.7x (0.217/0.128)

I think it's a fine achievement for Adapteva – a 5-people company (ST has about 50000 employees – of course not all of them work on  the P2012, but still; Chuck Moore's GreenArrays is 18 people – and he's considered the ultimate minimalist, and develops a much more minimalistic product which, for instance, certainly won't run OpenCL programs).

This is not to say that these numbers are sufficient to compare the architectures. For starters, we assume that the power is the same, but we can't know without benchmarking. Energy consumption varies widely across programs – low power process brings leakage down to about zero at room temperature, so you're left with switching power which depends on what code you run, and on what data (multiplying zeros costs almost nothing compared to multiplying noise, for instance).

Then there are programming model differences, ranging from the extent of compliance of floating point to the IEEE standard to the rather different memory models. In the memory department, the ability of P2012 cores to access larger working sets should somewhat negate Epiphany's raw performance advantage on some workloads (though Epiphany cores might have lower latency when accessing their own banks). But then two different 2-issue cores will generally perform differently – you need thorough benchmarking to compare.

So what are these numbers good for? Just for a very rough, ballpark estimation of the cost of this type of core. That is, a core which is flexible enough to run its own instruction stream – but "low-end" enough to burden the programmer with local memory management, and lacking much of the other amenities of full-blown CPUs (speculative prefetching, out-of-order execution, etc.)

Our two examples both point to the same order of magnitude of performance. Let's look at a third system, KALRAY's MPPA – looking more like P2012 than Epiphany, with 16-core clusters and cores sharing memory.

At 28nm, 256 cores are reported to typically consume 5W at 400 MHz. (Adapteva and ST claim to give worst case numbers). That's 20mW per core compared to Epiphany's 25mW – but Epiphany runs at 2x the frequency. If we normalized for frequency, Epiphany comes out 1.6x more power-efficient – and that's when we compare it's worst case power to MPPA's typical power.

MPPA doesn't support OpenCL at the moment, and I found few details about the architecture; our quick glance is only good to show that these "low-end multicore" machines have the same order of magnitude of efficiency.

So will OpenCL displace GPGPU?

The accelerators of the kind we discussed above – and they're accelerators, not CPUs, because they're horrible at running large programs as opposed to hand-optimized kernels – these accelerators have some nice properties:

  • You get scalar threads which can diverge freely and efficiently – this is a lot of extra flexibility compared to SIMT or SIMD GPUs.
  • For GPGPU workloads that don't need divergence, these accelerators probably aren't much worse than GPUs. You lose some power efficiency because of reading the same instructions from many program memories, but it should be way less than a 2x loss, I'd guess.
  • And there's a programming model ready for these accelerators – OpenCL. They can be programmed in other C dialects, but OpenCL is a widespread, standard one that can be used, and it lets you use features like explicitly managed local memory and DMA in a portable way.

From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?

There are several roadblocks, however. First of all, there's no killer app for this type of thing – by definition. That is, for any killer app, almost certainly a much more efficient accelerator can be built for that domain. Generic OpenCL accelerators are good at accelerating the widest range of things, but they don't excel at accelerating anything.

There is, of course, at least one thriving platform which is, according to the common perception, "good at everything but excels at nothing" – FPGA. (I think it's more complicated than that but I'll leave it for another time.)

FPGAs are great for small to medium scale product delivery. The volume is too small to afford your own chip – but there may be too many things to accelerate which are too different from what an existing chip is good at accelerating. Flexible OpenCL accelerator chips could rival FPGAs here.

What about integrating these accelerators into high-volume chips such as application processors so they could compete with GPUs? Without a killer app, there's a real estate problem. At 100-150 mm^2, today's application processors are already rather large. And the new OpenGL accelerators aren't exactly small – they're bigger than any domain-specific accelerator.

Few chips are likely to include a large accelerator "just in case", without a killer app. Area is considered to be getting increasingly cheap. But we're far from the point where it's "virtually free", and the trend might not continue forever: GlobalFoundries' 14nm is a "low-shrink" node. Today, area is not free.

Of course, a new OpenCL accelerator does give some immediate benefit and so it isn't a purely speculative investment. That's because you could speed up existing OpenCL applications. But for existing code which is careful to avoid divergence, the accelerator would be somewhat less efficient than a GPU, and it wouldn't do graphics nearly as good as the GPU – so it'd be a rather speculative addition indeed.

What would make one add hardware for speculative reasons? A long life cycle. If you believe that your chip will have to accelerate important stuff many years after it's designed, then you'll doubt your ability to predict exactly what this stuff is going to be, and you'll want the most general-purpose accelerator.

Conversely, if you make new chips all the time, quickly sell a load of them, and then move on to market your next design, then you're less inclined to speculate. Anything that doesn't result in a visibly better product today is not worth the cost.

So generic OpenCL accelerators have a better shot at domains with long life cycles, which seem to be a minority. And then even when you found a vendor with a focus on the long term, you have the problem of performance portability.

Let's say platform vendor A does add the new accelerator to their chip. Awesome – except you probably also want to support vendor B's chips, which don't have such accelerators. And so efficient divergence is of no use to you, because it's not portable. Unless vendor A accounts for a very large share of the market – or if it's a dedicated device and you write a dedicated program and you don't care about portability.

OpenCL programs are portable, but their performance is not portable. For instance, if you use vector data types and the target platform doesn't have SIMD, the code will be compiled to scalar instructions, and so on.

What this means in practice is that one or several OpenCL subsets will emerge, containing features that people count on to be supported well. For instance, a relatively good scenario is, there's a subset that GPU programmers use on all GPUs. A worse scenario is, there's the desktop GPU subset and the mobile GPU subset. A still worse scenario is, there's the NVIDIA subset, the AMD subset, the Imagination subset, etc.

It's an evolving type of thing that's never codified anywhere but has more power than the actual standard.

Standards tend to materialize partially. For example, the C standard supports garbage collection, but real C implementations usually don't, and many real C programs will not run correctly on a standard-compliant, garbage-collecting implementation. Someone knowing C would predict this outcome, and would not trust the standard to change it.

So with efficient divergence, the question is, will this feature make it into a widely used OpenCL subset, even though it's not a part of any such subset today. If it doesn't, widespread hardware is unlikely to support it.

Personally, I like accelerators with efficient divergence. I'll say it again:

"From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?"

From an evolutionary standpoint though, it's quite the uphill battle. The CPU + GPU combination is considered "good enough" very widely. It's not impossible to grow in the shadow of a "good enough", established competitor. x86 was good enough and ARM got big, gcc was good enough and LLVM got big, etc.

It's just hard, especially if you can't replace the competitor anywhere and you aren't a must-have. A CPU is a must-have and ARM replaces x86 where it wins. A compiler is a must-have and LLVM replaces gcc where it wins. An OpenCL accelerator with efficient divergence – or any other kind, really – is not a must-have and it will replace neither the CPU nor the GPU. So it's quite a challenge to convince someone to spend on it.

Conclusion

I doubt that general-purpose OpenCL accelerators will displace GPGPU, even though it could be a nice outcome. The accelerators probably will find their uses though. The following properties seem favorable to them (all or a subset may be present in a given domain):

  • Small to medium scale, similarly to FPGAs
  • Long life cycles encouraging speculative investment in hardware
  • Device-specific software that can live without performance portability

In other words, there can be "life between the CPU and the GPU", though not necessarily in the highest volume devices.

Good luck to Adapteva with their Kickstarter project – a computer with 64 independent OpenCL cores for $99.

"It's done in hardware so it's cheap"

It isn't.

This is one of these things that look very obvious to me, to the point where it seems not worth discussing. However, I've heard the idea that "hardware magically makes things cheap" from several PhDs over the years. So apparently, if you aren't into hardware, it's not obvious at all.

So why doesn't "hardware support" automatically translate to "low cost"/"efficiency"? The short answer is, hardware is an electric circuit and you can't do magic with that, there are rules. So what are the rules? We know that hardware support does help at times. When does it, and when doesn't it?

To see the limitations of hardware support, let's first look at what hardware can do to speed things up. Roughly, you can really only do two things:

  1. Specialization - save dispatching costs in speed and energy.
  2. Parallelization - save time, but not energy, by throwing more hardware at the job.

Let's briefly look at examples of these two speed-up methods – and then some examples where hardware support does nothing for you, because none of the two methods helps. We'll only consider run time and energy per operation, ignoring silicon area (considering a third variable just makes it too hairy).

I'll also discuss the difference between real costs of operations and the price you pay for these operations, and argue that in the long run, costs are more stable and more important than prices.

Specialization: cheaper dispatching

If you want to extract bits 3 to 7 of a 32-bit word and then multiply them by 13 – let's say an encryption algorithm requires this – you can have an instruction doing just that. That will be faster and use less energy than, say, using bitwise AND, shift & multiplication instructions.

Why – what costs were cut out? The costs of dispatching individual operations – circuitry controlling which operation is executed, where the inputs come from and where the outputs go.

Specialization can be taken to an extreme. For instance, if you want a piece of hardware doing nothing but JPEG decoding, you can bring dispatching costs close to zero by having a single "instruction" – "decode a JPEG image". Then you have no flexibility – and none of the "overhead" circuitry found in more flexible machines (memory for storing instructions, logic for decoding these instructions, multiplexers choosing the registers that inputs come from based on these instructions, etc.)

Before moving on, let's look a little closer at why we won here:

  • We got a speed-up because the operations were fast to begin with – so dispatching costs dominated. With specialization, we need 4 wires connected directly to bits 3 to 7 that have tiny physical delay – just the time it takes the signal to travel to a nearby multiplier-by-13. Without specialization, we'd use a shifter shifting by a configurable amount of bits – 3 in our case but not always – which is a bunch of gates introducing a much larger delay. On top of that, since we'd be using several such circuits communicating through registers (let's say we're on a RISC CPU), we'd have delays due to reading and writing registers, delays due to selecting registers from a large register file, etc. With all this taken out by having a specialized instruction, no wonder we're seeing a big speed-up.
  • Likewise, we'll see lower energy consumption because the operations didn't require a lot of energy to begin with. Roughly, most of the energy is consumed when a signal value changes from 1 to 0 or back. When we use general-purpose instructions, most of the gate inputs & outputs and most flip-flops changing their values are those implementing the dispatching. When we use a specialized instruction, most of the switching is gone.

This means that, unsurprisingly, there's a limit to efficiency – the fundamental cost of the operations we need to do, which can't be cut.

When the operations themselves are costly enough – for instance, memory access or floating point operations – then their cost dominates the cost of dispatching. So specialized instructions that cut dispatching costs will give us little or nothing.

Parallelization: throwing more hardware at the job

What to do when specialization doesn't help? We can simply have N processors instead of one. For the parts that can be parallelized, we'll cut the run time by N – but spend the same amount of energy. So things got faster but not necessarily cheaper. A fixed power budget limits parallelization – as does a fixed budget of, well, money (the price of a 1000-CPU rack is still not trivial today).

[Why have multicore chips if it saves no energy? Because a multicore chip is cheaper than many single core ones, and because, above a certain frequency, many low-frequency cores use less energy than few high-frequency ones.]

We can combine parallelization with specialization – in fact it's done very frequently. Actually a JPEG decoder mentioned above would do that – a lot of its specialized circuits would execute in parallel.

Another example is how SIMD or SIMT processors broadcast a single instruction to multiple execution units. This way, we get only a speed-up, but no energy savings at the execution unit level: instead of one floating point ALU, we now have 4, or 32, etc. We do, however, get energy savings at the dispatching level – we save on program memory and decoding logic. As always with specialization, we pay in flexibility – we can't have our ALUs do different things at the same time, as some programs might want to.

Why do we see more single-precision floating point SIMD than double-precision SIMD? Because the higher the raw cost of operations, the less we save by specialization, and SIMD is a sort of specialization. If we have to pay for double-precision ALUs, why not put each in a full-blown CPU core? That way, at least we get the most flexibility, which means more opportunities to actually use the hardware rather than keeping it idle.

(It's really more complicated than that because SIMD can actually be a more useful programming model than multiple threads or processes in some cases, but we won't dwell on that.)

What can't be done

Now that we know what can be done – and there really isn't anything else – we basically already also know what can't be done. Let's look at some examples.

Precision costs are forever

8-bit integers are fundamentally more efficient than 32-bit floating point, and no hardware support for any sort of floating point operations can change this.

For one thing, multiplier circuit size (and energy consumption) is roughly quadratic in the size of inputs. IEEE 32b floating point numbers have 23b mantissas, so multiplying them means a ~9x larger circuit than an 8×8-bit multiplier with the same throughput. Another cost, linear in size, is that you need more memory, flip-flops and wires to store and transfer a float than an int8.

(People are more often aware of this one because SIMD instruction sets usually have fixed-sized registers which can be used to keep, say, 4 floats or 16 uint8s. However, this makes people underestimate the overhead of floating point as 4x – when it's more like 9x if you look at multiplying mantissas, not to mention handling exponents. Even int16 is 4x more costly to multiply than int8, not 2x as the storage space difference makes one guess.)

We design our own chips, and occasionally people say that it'd be nice to have a chip with, say, 256 floating point ALUs. This sounds economically nonsensical – sure it's nice and it's also quite obvious, so if nobody makes such chips at a budget similar to ours, it must be impossible, so why ask?

But actually it's a rather sensible suggestion, in that you can make a chip with 256 ALUs that is more efficient than anything on the market for what you do, but not flexible enough to be marketed as a general-purpose computer. That's precisely what specialization does.

However, specialization only helps with operations which are cheap enough to begin with compared to the cost of dispatching. So this can work with low-precision ALUs, but not with high-precision ALUs. With high-precision ALUs, the raw cost of operations would exceed our power budget, even if dispatching costs were zero.

Memory indirection costs are forever

I mentioned this in my old needlessly combative write-up about "high-level CPUs". There's this idea that we can have a machine that makes "high-level languages" run fast, and that they're really only slow because we're running on "C machines" as opposed to Lisp machines/Ruby machines/etc.

Leaving aside the question of what "high-level language" means (I really don't find it obvious at all, but never mind), object-orientation and dynamic typing frequently result in indirection: pointers instead of values and pointers to pointers instead of pointers. Sometimes it's done for no apparent reason – for instance, Erlang strings that are kept as linked lists of ints. (Why do people even like linked lists as "the" data structure and head/tail recursion as "the" control structure? But I digress.)

This kind of thing can never be sped up by specialization, because memory access fundamentally takes quite a lot of time and energy, and when you do p->a, you need one such access, and when you do p->q->a, you need two, hence you'll spend twice the time. Having a single "LOAD_LOAD" instruction instead of two – LOAD followed by a LOAD – does nothing for you.

All you can do is parallelization - throw more hardware at the problem, N processors instead of one. You can, alternatively, combine parallelization with specialization, similarly to N-way floating point SIMD that's somewhat cheaper than having N full-blown processors. For example, you could have several load-store units and several cache banks and a multiple-issue processor. Than if you had to run p1->q1->a and somewhere near that, p2->q2->b, and the pointers point into different banks, some of the 4 LOADs would end up running in parallel, without having several processors.

But, similarly to low-precision math being cheaper whatever the merits of floating point SIMD, one memory access is always twice cheaper than two despite the merits of cache banking and multiple issue. Specifically, doubling the memory access throughput roughly doubles the energy cost. This can sometimes be better than simply using two processors, but it's a non-trivial cost and will always be.

A note about latency

We could discuss other examples but these two are among the most popular – floating point support is a favorite among math geeks, and memory indirection support is a favorite among language geeks. So we'll move on to a general conclusion – but first, we should mention the difference between latency costs and throughput costs.

In our two examples, we only discussed throughput costs. A floating point ALU with a given throughout uses more energy than an int8 ALU. Two memory banks with a given throughput use about twice the energy of a single memory bank with half the throughput. This, together with the relatively high costs of these operations compared to the costs of dispatching them, made us conclude that we have nothing to do.

In reality, the high latency of such heavyweight operations can be the bigger problem than our inability to increase their throughput without paying a high price in energy. For example, consider the instruction sequence:

c = FIRST(a,b)
e = SECOND(c,d)

If FIRST has a low latency, then we'll quickly proceed to SECOND. If FIRST has a high latency, then SECOND will have to wait for that amount of time, even if FIRST has excellent throughput. Say, if FIRST is a LOAD, being able to issue a LOAD every cycle doesn't help if SECOND depends on the result of that LOAD and the LOAD latency is 5 cycles.

A large part of computer architecture is various methods for dealing with these latencies – VLIW, out-of-order, barrel processors/SIMT, etc. These are all forms of parallelization – finding something to do in parallel with the high-latency instruction. A barrel processor helps when you have many threads. An out-of-order processor helps when you have nearby independent instructions in the same thread. And so on.

Just like having N processors, all these parallelization methods don't lower dispatching costs - in fact, they raise them (more registers, higher issue bandwidth, tricky dispatching logic, etc.) The processor doesn't become more energy efficient - you get more done per unit of time but not per unit of energy. A simple processor would be stuck at the FIRST instruction, while a more clever one would find something to do – and spend the energy to do it.

So latency is a very important problem with fundamentally heavyweight operations, and machinery for hiding this latency is extremely consequential for execution speed. But fighting latency using any of the available methods is just a special case of parallelization, and in this sense not fundamentally different from simply having many cores in terms of energy consumed.

The upshot is that parallelization, whether it's having many cores or having single-core latency-hiding circuitry, can help you with execution speed – throughput per cycle – but not with energy efficiency – throughput per watt.

The latency of heavyweight stuff is important and not hopeless; its throughput is important and hopeless.

Cost vs price

"But on my GPU, floating point operations are actually as fast as int8 operations! How about that?"

Well, a bus ticket can be cheaper than the price of getting to the same place in a taxi. The bus ticket will be cheaper even if you're the only passenger, in which case the real cost of getting from A to B in a bus is surely higher than the cost of getting from A to B in a taxi. Moreover, a bus might take you there more quickly if there are lanes reserved for buses that taxis are not allowed to use.

It's basically a cost vs price thing – math and physics vs economics and marketing. The fundamentals only say that a hardware vendor always can make int8 cheaper than float – but they can have good reasons not to. It's not that they made floats as cheap as int8 – actually, they made int8 as expensive as floats in terms of real costs.

Just like you going alone in a bus designed to carry dozens of people is an inefficient use of a bus, using float ALUs to process what could be int8 numbers is an inefficient use of float ALUs. Similarly, just like transport regulations can make lanes available for buses but not cars, an instruction set can make fetching a float easy but make fetching a single byte hard (no load byte/load byte with sign extension instructions). But cars could use those lanes – and loading bytes could be made easy.

As a passenger, of course you will use the bus and not the taxi, because economics and/or marketing and/or regulations made it the cheaper option in terms of price. Perhaps it's so because the bus is cheaper overall, with all the passengers it carries during rush hours. Perhaps it's so because the bus is a part of the contract with your employer – it's a bus carrying employees towards a nearby something. And perhaps it's so because the bus is subsidized by the government. Whatever the reason, you go ahead and use the cheaper bus.

Likewise, as a programmer, if you're handed a platform where floating point is not more expensive or even cheaper than int8, it is perhaps wise to use floating point everywhere. The only things to note are, the vendor could have given you better int8 performance; and, at some point, a platform might emerge that you want to target and where int8 is much more efficient than float.

The upshot is that it's possible to lower the price of floating point relative to int8, but not the cost.

What's more "important" – prices or costs?

Prices have many nice properties that real costs don't have. For instance, all prices can be compared – just convert them all to your currency of choice. Real costs are hard to compare without prices: is 2x less time for 3x more energy better or worse?

In any discussion about "fundamental real costs", there tend to be hidden assumptions about prices. For example, I chose to ignore area in this discussion under the assumption that area is usually less important than power. What makes this assumption true – or false – is the prices fabs charge for silicon production, the sort of cooling solutions that are marketable today (a desktop fan could be used to cool a cell phone but you couldn't sell that phone), etc. It's really hard to separate costs from prices.

Here's a computer architect's argument to the effect of "look at prices, not costs":

While technical metrics like performance, power, and programmer effort make up for nice fuzzy debates, it is pivotal for every computer guy to understand that “Dollar” is the one metric that rules them all. The other metrics are just sub-metrics derived from the dollar: Performance matters because that’s what customers pay for; power matters because it allows OEMs to put cheaper, smaller batteries and reduce people’s electricity bills; and programmer effort matters because it reduces the cost of making software.

I have two objections: that prices are the effect, not the cause, and that prices are too volatile to commit to memory as a "fundamental".

Prices are the effect in the sense that, customers pay for performance because it matters, not "performance matters because customers pay for it". Or, more precisely – customers pay for performance because it matters to them. As a result – because customers pay for it – performance matters to vendors. Ultimately, the first cause is that performance matters, not that it sells.

The other thing about prices is that they're rather jittery. Even a price index designed for stability such as S&P 500 is jumping up and down like crazy. In a changing world, knowledge about costs has a longer shelf life than knowledge about prices.

For instance, power is considered cheap for desktops but expensive for servers and really expensive for mobile devices. In reality, desktops likely consume more power than servers, there being more desktops than servers. So the real costs are not like the prices – and prices change; the rise of mobile computing means rising prices for power-hungry architectures.

It seems to me that, taking the long view, the following makes sense:

  • It's best to reason in costs and project them to the relevant prices – not forget the underlying costs and "think in prices", so as to not get into habits that will become outdated when prices change.
  • If you see a high real cost "hidden" by contemporary prices, it's a good bet to assume that at some point in the future, prices will shift so that the real cost will rear its ugly head.

For example, any RISC architecture – ARM, MIPS, PowerPC, etc. – is fundamentally cheaper than, specifically, x86, in at least two ways: hardware costs – area & power – and the costs of developing said hardware. [At least so I believe; let's say that it's not as significant in my view than my other more basic examples, and I might be wrong and I'm only using this as an illustration.]

In the long run, this spells doom for the x86, whatever momentum it otherwise has at any point in time – software compatibility costs, Intel's manufacturing capabilities vs competitors capabilities, etc. Mathematically or physically fundamental costs will, in the long run, trump everything else.

In the long run, there is no x86, no ARM, no Windows, no iPhone, etc. There are just ideas. We remember ideas originating in ancient Greece and Rome, but no products. Every product is eventually outsold by another product. Old software is forgotten and old fabs rot. But fundamentals are forever. An idea that is sufficiently more costly fundamentally than a competing idea can not survive.

This is why I disagree with the following quote by Bob Colwell – the chief architect of the Pentium Pro (BTW, I love the interview and intend to publish a summary of the entire 160-something page document):

…you might say that CISC only stayed viable because Intel was able to throw a lot of money and people at it, and die size, bigger chips and so on.

In that sense, RISC still was better, which is what was claimed all along. And I said you know, there's point to be made there. I agree with you that Intel had more to do to stay competitive. They were starting a race from far behind the start line. But if you can throw money at a problem then, it's not really so fundamental technologically, is it? We look for more deep things than that, so if all the RISC/CISC thing amounted to was, you had a slight advantage economically, well, that's not as profound as it seemed back in the 80s was it?

Well, here's my counter-argument and it's not technical. The technical argument would be, CISC is worse, to the point where Intel's 32nm Medfield performs about as well as ARM-based 40nm chips in a space where power matters. Which can be countered with an economical argument – so what, Intel does have a better manufacturing ability so who cares, they still compete.

But my non-technical argument is, sure, you can be extremely savvy business-wise, and perhaps, if Intel realized early on how big mobile is going to be, they'd make a good enough x86-based offering back then and then everyone would have been locked out due to software compatibility issues and they'd reign like they reign in the desktop market.

But you can't do that forever. Every company is going to lose to some company at some point or other because you only need one big mistake and you'll make it, you'll ignore a single emerging market and that will be the end. Or, someone will outperform you technically – build a better fab, etc. If an idea is only ("only"?) being dragged into the future kicking and screaming by a very business-savvy and technically excellent company, then the idea has no chance.

The idea that will win is the idea that every new product will use. New products always beat old products – always have.

And nobody, nobody at all has made a new CISC architecture in ages. Intel will lose to a company or companies making RISC CPUs because nobody makes anything else – and it has to lose to someone. Right now it seems like it's ARM but it doesn't matter how it comes out in this round. It will happen at some point or other.

And if ARM beats x86, it won't be, straightforwardly, "because RISC is better" – x86 will have lost for business reasons, and it could have gone the other way for business reasons. But the fact that it will have lost to a RISC – that will be because RISC is technically better. That's why there's no CISC competitor to lose to.

Or, if you dismiss this with the sensible "in the long run, we're all dead" – then, well, if you're alive right now and you're designing hardware, you are not making a CISC processor, are you? QED, not?

Getting back to our subject – based on the assumption that real costs matter, I believe that ugly, specialized hardware is forever. It doesn't matter how much money is poured into general-purpose computing, by whom and why. You will always have sufficiently important tasks that can be accomplished 10x or 100x more cheaply by using fundamentally cheap operations, and it will pay off for someone to make the ugly hardware and write the ugly, low-level code doing low-precision arithmetic to make it work.

And, on the other hand, the market for general-purpose hardware is always going to be huge, in particular, because there are so many things that must be done where specialization fundamentally doesn't help at all.

Conclusion

Hardware can only deliver "efficiency miracles" for operations that are fundamentally cheap to begin with. This is done by lowering dispatching costs and so increasing throughput per unit of energy. The price paid is reduced flexibility.

Some operations, such as high-precision arithmetic and memory access, are fundamentally expensive in terms of energy consumed to reach a given throughput. With these, hardware can still give you more speed through parallelization, but at an energy cost that may be prohibitive.

Hardware macroarchitecture vs mircoarchitecture

The comp-arch.net wiki defines "computer architecture" as the union of two things:

  • Macroarchitecture - the "visible" parts, the contract between hardware and software. For example, branch instructions.
  • Microarchitecture - the "invisible" parts, the implementation strategy which affects performance but not semantics. For example, branch prediction.

I think this distinction is very interesting for three reasons:

  1. It pops up everywhere. That is, many hardware problems can be addressed at the macro level or the micro level, explicitly or implicitly.
  2. The choice of macro vs micro is rarely trivial – for most problems, there are real-world examples of both kinds of solutions.
  3. The choice has common consequences across problems. The benefits and drawbacks of macro and micro are frequently similar.

I'll use examples from diverse types of hardware – CPUs, DSPs, GPUs, FPGAs, CAPPs, and even DRAM controllers. We'll discuss some example problems and how they can be solved at the macro or micro level. I'll leave the discussion of the resulting trade-offs to separate write-ups. Here, we'll go through examples to see how practical macro and micro solutions to different problems look like.

Our examples are:

  • Data parallelism: SIMD vs SIMT
  • Multiple issue: VLIW vs superscalar
  • Running ahead: exposed latencies vs OOO
  • Local storage: bare RAM vs cache
  • Streaming access: DMA vs hardware prefetchers
  • Data processing: logic synthesis vs instruction sets
  • Local communication: routing vs register addresses
  • Avoiding starvation: pressure signals vs request aging

Data parallelism: SIMD vs SIMT

Suppose you want to have a data-parallel machine: software issues one instruction that processes multiple data items.

The common macro approach is wide registers and SIMD opcodes. To use the feature, software must explicitly break up its data into 16-byte chunks, and use special opcodes like "add_16_bytes" to process the chunks.

One mirco approach is what NVIDIA marketing calls SIMT. The instruction set remains scalar. However, hw runs multiple scalar threads at once, with simultaneously running threads all executing the same instruction. That way, 16 pairs of values are added in a single cycle using scalar instructions.

(If you're interested in SIMT, a detailed comparison with SIMD as well as SMT – the more general simultaneous multithreading model – is here.)

Multiple issue: VLIW vs superscalar/OOO

Suppose you want to have a multiple issue machine. You want to simultaneously issue multiple instructions from a single thread.

The macro approach is VLIW, which stands for "very long instruction word". The idea is, those multiple instructions you issue become "one (very long) instruction", because software explicitly asks to run them together: "ADD R0, R1, R2 and MUL R3, R0, R5". Note that ADD and MUL "see" the same value of R0: MUL gets R0's value before it's modified by ADD.

VLIW also lets software choose to say, "ADD R0, R1, R2; afterwards, MUL R3, R0, R5" – that's two separate instructions yielding vanilla serial execution. This is not only slower (2 cycles instead of 1), but has a different meaning. This way, MUL does see ADD's change to R0. Either way, you get what you explicitly asked for.

(If you're interested in VLIW, an explanation of how programs map to this rather strangely-looking architecture is here.)

The micro approach, called superscalar execution, is having the hardware analyze the instructions and run them in parallel – when that doesn't change the hw/sw contract (the serial semantics). For example, ADD R0, R1, R2 can run in parallel with MUL R3, R1, R2 – but not with MUL R3, R0, R5 where MUL's input, R0, depends on ADD. Software remains unaware of instruction-level parallelism – or at least it can remain unaware and still run correctly.

Running ahead: exposed latencies vs OOO

We've just discussed issuing multiple instructions simultaneously. A related topic is issuing instructions before a previous instruction completes. Here, the macro approach is to, well, simply go ahead and issue instructions. It's the duty of software to make sure those instructions don't depend on results that are not yet available.

For example, a LOAD instruction can have a 4-cycle latency. Then if you load to R0 from R1 and at the next cycle, add R0 and R2, you will have used the old value of R0. If you want the new value, you must explicitly wait for 4 cycles, hopefully issuing some other useful instructions in the meanwhile.

The micro approach to handling latency is called OOO (out-of-order execution). Suppose you load to R0, then add R0 and R2, and then multiply R3 and R4. An OOO processor will notice that the addition's input is not yet available, proceed to the multiplication because its inputs are ready, and execute the addition once R0 is loaded (in our example, after 4 cycles). The hw/sw contract is unaffected by the fact that hardware issues instructions before a previous instruction completes.

Local storage: local memories vs caches

Suppose you want to have some RAM local to your processor, so that much of the memory operations work with this fast RAM and not the external RAM, which is increasingly becoming a bottleneck.

The macro approach is, you just add local RAM. There can be special load/store opcodes to access this RAM, or a special address range mapped to it. Either way, when software wants to use local RAM, it must explicitly ask for it – as in, char* p = (char*)0×54000, which says, "I'll use a pointer pointing to this magic address, 0×54000, which is the base address of my local RAM".

This is done on many embedded DSPs and even CPUs – for example, ARM calls this "tightly-coupled memory" and MIPS calls this "scratch pad memory".

The micro approach is caches. Software doesn't ask to use a cache – it loads from an external address as if the cache didn't exist. It's up to hardware to:

  • Check if the data is already in the cache
  • If it isn't, load it to the cache
  • Decide which cached data will be overwritten with the new data
  • If the overwritten cached data was modified, write it back to external memory before "forgetting" it

The hardware changes greatly, the hw/sw contract does not.

Data streaming: DMA vs hardware prefetchers

Suppose you want to support efficient "streaming transfers". DRAM is actually a fairly poor random access memory – there's a big latency you pay per address. However, it has excellent throughput if you load a large contiguous chunk of data. To utilize this, a processor must issue loads without waiting for results of previous loads. Load, wait, load, wait… is slow; load, load, load, load… is fast.

The macro approach is, sw tells hw that it wants to load an array. For example, a DMA – direct memory access – engine can have control registers telling it the base address and the size of an array to load. Software explicitly programs these registers and says, "load".

DMA starts loading and eventually says, "done" – for example, by setting a bit. In the meanwhile, sw does some unrelated stuff until it needs the loaded data. At this point, sw waits until the "done" bit is set, and then uses the data.

The micro approach is, software simply loads the array "as usual". Naturally, it loads from the base address, p, then from p+1, then p+2, p+3, etc. At some point, a hardware prefetcher quietly inspecting all the loads realizes that a contiguous array is being loaded. It then speculatively fetches ahead – loads large chunks beyond p+3 (hopefully not too large – we don't want to load too much unneeded data past the end of our array).

When software is about to ask for, say, p+7, its request is suddenly satisfied very quickly because the data is already in the cache. This keeps working nicely with p+8 and so on.

Data processing: logic synthesis vs instruction sets

Let's get back to basics. Suppose we want to add a bunch of numbers. How does software tell hardware to add numbers?

The micro approach is so much more common that it's the only one that springs to mind. Why, of course hardware has an ADD command, and it's implemented in hardware by some sort of circuit. There are degrees here (should there be a DIV opcode or should sw do division?) But the upshot is, there are opcodes.

However, there are architectures where software explicitly constructs data processing operations out of bit-level boolean primitives. This is famously done on FPGAs and is called "logic synthesis" – effectively software gets to build its own circuits. (This programming model is so uncommon that it isn't even called "software", but it certainly is.)

Less famously, this is also what effectively happens on associative memory processors (CAPPs/APAs) – addition is implemented as a series of bit-level masked compare & write operations. (The CAPP way results in awfully long latencies, which you're supposed to make up with throughput by processing thousands of elements in parallel. If you're interested in CAPPs, an overview is available here.)

Of course, you can simulate multiplication using bitwise operations on conventional opcode-based processors. But that would leave much of the hardware unused. On FPGAs and CAPPs, on the other hand, "building things out of bits" is how you're supposed to utilize hardware resources. You get a big heap of exposed computational primitives, and you map your design to them.

Local communication: routing vs register addresses

Another problem as basic as data processing operations is local communication: how does an operation pass its results to the next? We multiply and then add – how does the addition get the output of multiplication?

Again, the micro approach is by far the better known one. The idea is, you have registers, which are numbered somehow. We ask MUL to output to the register R5 (encoded as, say, 5). Then we ask ADD to use R5 as an input.

This actually doesn't sound as "micro" – what's implicit about it? We asked for R5 very explicitly. However, there are two sorts of "implicit" things going on here:

  • The numbers don't necessarily refer to physical registers – they don't on machines with register renaming.
  • More fundamentally, even when numbers do refer to physical registers, the routing is implicit.

How does the output of MUL travel to R5 and then to the input port of the adder? There are wires connecting these things, and multiplexers selecting between the various options. On most machines, there are also data forwarding mechanisms sending the output of MUL directly to the adder, in parallel to writing it into R5, so that ADD doesn't have to wait until R5 is written and then read back. But even on machines with explicit forwarding (and there aren't many), software doesn't see the wires and muxes – these are opaque hardware resources.

The macro approach to routing is what FPGAs do. The various computational units are connected to nearby configurable switches. By properly configuring those switches, you can send the output of one unit to another using a path going through several switches.

Of course this uses up the wires connecting the switches, and longer paths result in larger latencies. So it's not easy to efficiently connect all the units that you want using the switches and the wires that you have. In FPGAs, mapping operations to computational units and connecting between them is called "placement and routing". The "place & route" tools can run for a couple of hours given a large design.

This example as well as the previous illustrate micro vs macro at the extreme – a hardware resource that looks "all-important" in one architecture is invisible in another to the point where we forget it exists. The point is that they're equally important on both – the only question is who manages the resource, hardware or software.

Avoiding starvation: pressure signals vs request aging

One thing DRAM controllers do is accept requests from several different processors, put them in a queue, and reorder them. Reordering helps to better utilize DRAM, which, as previously mentioned, isn't that good at random access and prefers streaming access to consequent locations.

So if two processors, A and B, run in parallel, and each accesses a different region, it's frequently better to group requests together – A, A, A, A, B, B, B, B – then to process them in the order in which they arrive – say, A, B, A, A, B, A, B, B.

In fact, as long as A keeps issuing requests, it's frequently better to keep processing them until they're over, and keep B waiting. Better, that is, for throughput, as well as for A's latency – but worse for B's latency. If we don't know when to stop, serving A and starving B could make the system unusable.

When to stop? One macro solution is, the DRAM controller has incoming pressure signals, and both A and B can complain when starved by raising the pressure. Actually, this is "macro" only as far as the DRAM controller is concerned – it gives outside components explicit control over its behavior. The extent of software control over the generation of the pressure signal depends on the processors A and B.

One micro solution is to use request aging. Older requests are automatically considered more urgent. This method is implemented in many DRAM controllers – for instance, Denali's. The macro approach is implemented in the Arteris DRAM scheduler.

The micro approach is safer – the controller itself is careful to prevent starvation, whereas in the macro option, a non-cooperative processor can starve others. It also uses a simpler bus protocol, making compatibility easier for processors. However, it results in a lesser throughput – for instance, if B is a peripheral device with a large FIFO for incoming data, and can afford to wait for long periods of time until the FIFO overflows.

Whatever the benefits and drawbacks – and here, we aren't going to discuss benefits and drawbacks in any depth – this last example is supposed to illustrate that macro vs micro is relevant outside of "core"/"processor" design but extends to "non-computational" hardware as well.

Blurred boundaries

Micro vs macro is more of a continuum than a strictly binary distinction. That is, we can't always label a hardware feature as "visible" or "invisible" to programmers – rather, we can talk about the extent of its visibility.

There are basically two cases of "boundary blurring":

  • Hardware features "quite visible" even though they don't affect program semantics. These are "technically micro" but "macro in spirit".
  • Hardware features "quite invisible" even though they do affect program semantics. These are "technically macro" but "micro in spirit".

Let's briefly look at examples of both kinds of "blurring".

Technically micro but macro in spirit

A good example is memory banking. The point of banking is increasing the number of addresses that can be accessed per cycle. A single 32K memory bank lets you access a single address per cycle. 2 16K banks let you access 2 address, 4 8K banks let you access 4 addresses, and so on.

So basically "more is better". What limits the number of banks is the overhead you pay per bank, the overhead of logic figuring out the bank an address belongs to, and the fact that there's no point in accessing more data than you can process.

Now if we look at banking as implemented in NVIDIA GPU memory, TI DSP caches and superscalar CPU caches, then at first glance, they're all "micro solutions". These machines seem to mostly differ in their mapping of address to bank – for instance, NVIDIA GPUs switch banks every 4 bytes, while TI DSPs switch banks every few kilobytes.

But on all these machines, software can remain unaware of banking and run correctly. If two addresses are accessed at the same cycle that map to the same bank, then the access will take two cycles instead of one – but no fault is reported and results aren't affected. Semantically, banking is invisible.

However, I'd call GPUs' and DSPs' banking "macroish", and superscalar CPUs' banking "microish". Why?

GPUs and DSPs "advertise" banking, and commit to a consistent address mapping scheme and consistent performance implications across different devices. Vendors encourage you to know about banking so that you allocate data in ways minimizing contentions.

CPUs don't advertise banking very much, and different CPUs running the same instruction set have different banking schemes which result in different performance. Moreover, those CPU variants differ in their ability to access multiple addresses in parallel in the first place: a low-end CPU might access at most one address but a high-end CPU might access two.

GPUs and DSPs, on the other hand, have explicit multiple load-store units (a macro feature). So software knows when it attempts to accesses many addresses in parallel – one reason to "advertise" which addresses can actually be accessed in parallel.

This shows why hardware features that don't affect program semantics aren't "completely invisible to programmers" – rather, there are "degrees of visibility". A feature only affecting performance is "quite visible" if vendors and users consider it an important part of the hw/sw contract.

Technically macro but micro in spirit

SIMD and VLIW are both visible in assembly programs/binary instruction streams. However, SIMD is "much more macro in spirit" than VLIW. That's because for many programmers, the hw/sw contract isn't the semantics of assembly, which they never touch, but the semantics of their source language.

At the source code level, the effect of SIMD tends to be very visible. Automatic vectorization rarely works, so you end up using intrinsic functions and short vector data types. The effect of VLIW on source code can be close to zero. Compilers are great at automatic scheduling, and better than humans, so there's no reason to litter the code with any sort of annotations to help them. Hence, SIMD is "more macro" – more visible.

Moreover, there's "macroish VLIW" and "microish VLIW" – just like there's "macroish banking" and "microish banking" – and, again, the difference isn't in the hardware feature itself, but in the way it's treated by vendors and users.

An extreme example of "microish VLIW" is Transmeta – the native binary instruction encoding was VLIW, but the only software that was supposed to be aware of that were the vendor-supplied binary translators from x86 or other bytecode formats. VLIW was visible at the hardware level but still completely hidden from programmers by software tools.

An opposite, "macro-centric" example is TI's C6000 family. There's not one, but two "human-writable assembly languages". There's parallel assembly, where you get to manually schedule instructions. There's also linear assembly, which schedules instructions for you, but you still get to explicitly say which execution unit each instruction will use (well, almost; let's ignore the A-side/B-side issues here.)

Why provide such a "linear assembly" language? Josh Fisher, the inventor of VLIW, didn't approve of the concept in his book "Embedded Computing: a VLIW Approach".

That's because originally, one of the supposed benefits of VLIW was precisely being "micro in spirit" – the ability to hide VLIW behind an optimizing compiler meant that you could speed up existing code just by recompiling it. Not as easy as simply running old binaries on a stronger new out-of-order processor, but easy enough in many cases – and much easier to support at the hardware end.

Linear assembly basically throws these benefits out the window. You spell things in terms of C6000's execution units and opcodes, so the code can't be cross-platform. Worse, TI can't decide to add or remove execution units or registers from some of the C6000 variants and let the compiler reschedule instructions to fit the new variant. Linear assembly refers to units and registers explicitly enough to not support this flexibility – for instance, there's no silent spill code generation. Remove some of the resources, and much of the code will stop compiling.

Then why is linear assembly shipped by TI, and often recommended as the best source language for optimized code? The reason is that the code is more "readable" – if one of the things the reader is after is performance implications. The same silent spill code generation that makes C more portable makes it "less readable", performance-wise – you never can tell whether your data fits into registers or not, similarly it's hard to know how many operations of every execution unit are used.

The beauty of linear assembly is that it hides the stuff humans particularly hate to do and compilers excel at – such as register allocation and instruction scheduling – but it doesn't hide things making it easy to estimate performance – such as instruction selection and the distinction between stack and register variables. (In my opinion, the only problem with linear assembly is that it still hides a bit too much – and that people can choose to not use it. They often do – and preserve stunning unawareness of how the C6000 works for years and years.)

Personally, I believe that, contrary to original expectations, VLIW works better in "macro-centric" platforms than "micro-centric" – a viewpoint consistent with the relative success of Transmeta's chips and VLIW DSPs. Whether this view is right or wrong, the point here is that hardware features "visible" to software can be more or less visible to programmers – depending on how visible the software stack in its entirety makes them.

Implications

We've seen that "macro vs micro" is a trade-off appearing in a great many contexts in hardware design, and that typically, both types of solutions can be found in practical architectures – so it's not clear which is "better".

If there's no clear winner, what are the benefits and drawbacks of these two options? I believe that these benefits and drawbacks are similar across the many contexts where the trade-off occurs. Some of the implications were briefly mentioned in the discussion on VLIW's "extent of visibility" – roughly:

  • Micro is more compatible
  • Macro is more efficient
  • Macro is easier to implement

There are other common implications – for example, macro is harder to context-switch (I like this one because, while it's not very surprising once you think about it, it doesn't immediately spring to mind).

I plan to discuss the implications in detail sometime soon. I intend to focus, not as much on how things could be in theory, but on how they actually tend to come out and why.

SIMD < SIMT < SMT: parallelism in NVIDIA GPUs

Programmable NVIDIA GPUs are very inspiring to hardware geeks, proving that processors with an original, incompatible programming model can become widely used.

NVIDIA call their parallel programming model SIMT - "Single Instruction, Multiple Threads". Two other different, but related parallel programming models are SIMD - "Single Instruction, Multiple Data", and SMT - "Simultaneous Multithreading". Each model exploits a different source of parallelism:

  • In SIMD, elements of short vectors are processed in parallel.
  • In SMT, instructions of several threads are run in parallel.
  • SIMT is somewhere in between – an interesting hybrid between vector processing and hardware threading.

My presentation of SIMT is focused on hardware architecture and its implications on the trade-off between flexibility and efficiency. I'll describe how SIMT is different from SIMD and SMT, and why – what is gained (and lost) through these differences.

From a hardware design perspective, NVIDIA GPUs are at first glance really strange. The question I'll try to answer is "why would you want to build a processor that way?" I won't attempt to write a GPU programming tutorial, or quantitatively compare GPUs to other processors.

SIMD < SIMT < SMT

It can be said that SIMT is a more flexible SIMD, and SMT is in turn a more flexible SIMT. Less flexible models are generally more efficient – except when their lack of flexibility makes them useless for the task.

So in terms of flexibility, SIMD < SIMT < SMT. In terms of performance, SIMD > SIMT > SMT, but only when the models in question are flexible enough for your workload.

SIMT vs SIMD

SIMT and SIMD both approach parallelism through broadcasting the same instruction to multiple execution units. This way, you replicate the execution units, but they all share the same fetch/decode hardware.

If so, what's the difference between "single instruction, multiple data", and single instruction, multiple threads"? In NVIDIA's model, there are 3 key features that SIMD doesn't have:

  1. Single instruction, multiple register sets
  2. Single instruction, multiple addresses
  3. Single instruction, multiple flow paths

We'll see how this lifts restrictions on the set of programs that are possible to parallelize, and at what cost.

Single instruction, multiple register sets

Suppose you want to add two vectors of numbers. There are many ways to spell this. C uses a loop spelling:

for(i=0;i<n;++i) a[i]=b[i]+c[i];

Matlab uses a vector spelling:

a=b+c;

SIMD uses a "short vector" spelling – the worst of both worlds. You break your data into short vectors, and your loop processes them using instructions with ugly names. An example using C intrinsic functions mapping to ARM NEON SIMD instructions:

void add(uint32_t *a, uint32_t *b, uint32_t *c, int n) {
  for(int i=0; i<n; i+=4) {
    //compute c[i], c[i+1], c[i+2], c[i+3]
    uint32x4_t a4 = vld1q_u32(a+i);
    uint32x4_t b4 = vld1q_u32(b+i);
    uint32x4_t c4 = vaddq_u32(a4,b4);
    vst1q_u32(c+i,c4);
  }
}

SIMT uses a "scalar" spelling:

__global__ void add(float *a, float *b, float *c) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  a[i]=b[i]+c[i]; //no loop!
}

The weird __global__ keyword says that add() is a GPU thread entry point. blockIdx, blockDim and threadIdx are built-in thread-local variables keeping the thread's ID. We'll see later why a thread ID isn't just a single number; however, in this example we in fact convert it to a single number, and use it as the element index.

The idea is that the CPU spawns a thread per element, and the GPU then executes those threads. Not all of the thousands or millions of threads actually run in parallel, but many do. Specifically, an NVIDIA GPU contains several largely independent processors called "Streaming Multiprocessors" (SMs), each SM hosts several "cores", and each "core" runs a thread. For instance, Fermi has up to 16 SMs with 32 cores per SM – so up to 512 threads can run in parallel.

All threads running on the cores of an SM at a given cycle are executing the same instruction – hence Single Instruction, Multiple Threads. However, each thread has its own registers, so these instructions process different data.

Benefits

"Scalar spelling", where you write the code of a single thread using standard arithmetic operators, is arguably a better interface than SIMD loops with ugly assembly-like opcodes.

Syntax considerations aside, is this spelling more expressive – can it do things SIMD can't? Not by itself, but it dovetails nicely with other features that do make SIMT more expressive. We'll discuss those features shortly; in theory, they could be bolted on the SIMD model, but they never are.

Costs

From a hardware resources perspective, there are two costs to the SIMT way:

  • Registers spent to keep redundant data items. SIMT registersIn our example, the pointers a, b, and c have the same value in all threads. The values of i are different across threads, but in a trivial way – for instance, 128, 129, 130… In SIMD, a, b, and c would be kept once in “scalar” registers – only short vectors such as a[i:i+4] would be kept in “vector” registers. The index i would also be kept once – several neighbor elements starting from i would be accessed without actually computing i+1, i+2, etc. Redundant computations both waste registers and needlessly consume power. Note, however, that a combination of compiler & hardware optimizations could eliminate the physical replication of redundant values. I don't know the extent to which it's done in reality.
  • Narrow data types are as costly as wide data types. SIMD registersA SIMD vector register keeping 4 32b integers can typically also be used to keep 8 16b integers, or 16 8b ones. Similarly, the same ALU hardware can be used for many narrow additions or fewer wide ones – so 16 byte pairs can be added in one cycle, or 4 32b integer pairs. In SIMT, a thread adds two items at a time, no matter what their width, wasting register bits and ALU circuitry.

It should be noted that SIMT can be easily amended with SIMD extensions for narrow types, so that each thread processes 4 bytes at a time using ugly assembly-like opcodes. AFAIK, NVIDIA refrains from this, presumably assuming that the ugliness is not worth the gain, with 32b float being the most popular data type in graphics anyway.

Single instruction, multiple addresses

Let's apply a function approximated by a look-up table to the elements of a vector:

__global__ void apply(short* a, short* b, short* lut) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  a[i] = lut[b[i]]; //indirect memory access
}

Here, i is again "redundant" in the sense that in parallel threads, the values of i are consecutive. However, each thread then accesses the address lut+b[i] – and these addresses are not consecutive.

Roughly, such parallel random access works for both loads and stores. Logically, stores are the trickier part because of conflicts. What if two or more threads attempt to, say, increment the same bin in a histogram? Different NVIDIA GPU generations provide different solutions that we won't dwell on.

Benefits

This feature lets you parallelize many programs you can't with SIMD. Some form of parallel random access is usually available on SIMD machines under the names "permutation", "shuffling", "table look-up", etc. However, it always works with registers, not with memory, so it's just not the same scale. You index into a table of 8, 16 or 32 elements, but not 16K.

As previously mentioned, in theory this feature can be bolted on the SIMD model: just compute your addresses (say, lut+b[i]) in vector registers, and add a rand_vec_load instruction. However, such an instruction would have a fairly high latency. As we'll see, the SIMT model naturally absorbs high latencies without hurting throughput; SIMD much less so.

Costs

GPU has many kinds of memory: external DRAM, L2 cache, texture memory, constant memory, shared memory… We'll discuss the cost of random access in the context of two memories "at the extremes": DRAM and shared memory. DRAM is the farthest from the GPU cores, sitting outside the chip. Shared memory is the closest to the cores – it's local to an SM, and the cores of an SM can use it to share results with each other, or for their own temporary data.

  • With DRAM memory, random access is never efficient. In fact, the GPU hardware looks at all memory addresses that the running threads want to access at a given cycle, and attempts to coalesce them into a single DRAM access – in case they are not random. Effectively the contiguous range from i to i+#threads is reverse-engineered from the explicitly computed i,i+1,i+2… – another cost of replicating the index in the first place. If the indexes are in fact random and can not be coalesced, the performance loss depends on "the degree of randomness". This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it – similarly to any other processor.
  • With shared memory, random access is slowed down by bank contentions. Generally, a hardware memory module will only service one access at a time. So shared memory is organized in independent banks; the number of banks for NVIDIA GPUs is 16. If x is a variable in shared memory, then it resides in bank number (&x/4)%16. In other words, if you traverse an array, the bank you hit changes every 4 bytes. Access throughput peaks if all addresses are in different banks – hardware contention detection logic always costs latency, but only actual contentions cost throughput. If there's a bank hosting 2 of the addresses, the throughput is 1/2 of the peak; if there's a bank pointed by 3 addresses, the throughput is 1/3 of the peak, etc., the worst slowdown being 1/16.

SIMT bank contentions

In theory, different mappings between banks and addresses are possible, each with its own shortcomings. For instance, with NVIDIA's mapping, accessing a contiguous array of floats gives peak throughput, but a contiguous array of bytes gives 1/4 of the throughput (since banks change every 4 bytes). Many of the GPU programming tricks aim at avoiding contentions.

For instance, with a byte array, you can frequently work with bytes at distance 4 from each other at every given step. Instead of accessing a[i] in your code, you access a[i*4], a[i*4+1], a[i*4+2] and a[i*4+3] – more code, but less contentions.

This sounds convoluted, but it's a relatively cheap way for the hardware to provide efficient random access. It also supports some very complicated access patterns with good average efficiency – by handling the frequent case of few contentions quickly, and the infrequent case of many contentions correctly.

Single instruction, multiple flow paths

Let's find the indexes of non-zero elements in a vector. This time, each thread will work on several elements instead of just one:

__global__ void find(int* vec, int len,
                     int* ind, int* nfound,
                     int nthreads) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  int last = 0;
  int* myind = ind + tid*len;
  for(int i=tid; i<len; i+=nthreads) {
    if(vec[i]) { //flow divergence
      myind[last] = i;
      last++;
    }
  }
  nfound[tid] = last;
}

Each thread processes len/nthreads elements, starting at the index equal to its ID with a step of nthreads. We could make each thread responsible for a more natural contiguous range using a step of 1. The way we did it is better in that accesses to vec[i] by concurrent threads address neighbor elements, so we get coalescing.

The interesting bit is if(vec[i]) – depending on vec[i], some threads execute the code saving the index, some don't. The control flow of different threads can thus diverge.

Benefits

Support for divergence further expands the set of parallelizable programs, especially when used together with parallel random access. SIMD has some support for conditional execution though vector "select" operations: select(flag,a,b) = if flag then a else b. However, select can't be used to suppress updates to values – the way myind[last] is never written by threads where vec[i] is 0.

SIMD instructions such as stores could be extended to suppress updates based on a boolean vector register. For this to be really useful, the machine probably also needs parallel random access (for instance, the find() example wouldn't work otherwise). Unless what seems like an unrealistically smart compiler arrives, this also gets more and more ugly, whereas the SIMT spelling remains clean and natural.

Costs

  • Only one flow path is executed at a time, and threads not running it must wait.SIMT divergence Ultimately SIMT executes a single instruction in all the multiple threads it runs – threads share program memory and fetch / decode / execute logic. When the threads have the same flow – all run the if, nobody runs the else, for example – then they just all run the code in if at full throughput. However, when one or more threads have to run the else, they'll wait for the if threads. When the if threads are done, they'll in turn wait for the else threads. Divergence is handled correctly, but slowly. Deeply nested control structures effectively serialize execution and are not recommended.
  • Divergence can further slow things down through "randomizing" memory access. In our example, all threads read vec[i], and the indexing is tweaked to avoid contentions. However, when myind[last] is written, different threads will have incremented the last counter a different number of times, depending on the flow. This might lead to contentions which also serialize execution to some extent. Whether the whole parallelization exercise is worth the trouble depends on the flow of the algorithm as well as the input data.

We've seen the differences between SIMT and its less flexible relative, SIMD. We'll now compare SIMT to SMT – the other related model, this time the more flexible one.

SIMT vs SMT

SIMT and SMT both use threads as a way to improve throughput despite high latencies. The problem they tackle is that any single thread can get stalled running a high-latency instruction. This leaves the processor with idle execution hardware.

One way around this is switching to another thread – which (hopefully) has an instruction ready to be executed – and then switch back. For this to work, context switching has to be instantaneous. To achieve that, you replicate register files so that each thread has its own registers, and they all share the same execution hardware.

But wait, doesn't SIMT already replicate registers, as a way to have a single instruction operate on different data items? It does – here, we're talking about a "second dimension" of register replication:

  1. Several threads – a "warp" in NVIDIA terminology – run simultaneously. So each thread needs its own registers.
  2. Several warps, making up a "block", are mapped to an SM, and an SM instantaneously switches between the warps of a block. So each warp needs separate registers for each of its threads.

SIMT 2D replication

With this "two-dimensional" replication, how many registers we end up with? Well, a lot. A block can have up to 512 threads. And the registers of those threads can keep up to 16K of data.

How many register sets does a typical SMT processor have? Er, 2, sometimes 4…

Why so few? One reason is diminishing returns. When you replicate registers, you pay a significant price, in the hope of being able to better occupy your execution hardware. However, with every thread you add, the chance of it being already occupied by all the other threads rises. Soon, the small throughput gain just isn't worth the price.

If SMT CPU designers stop at 2 or 4 threads, why did SIMT GPU designers go for 512?

With enough threads, high throughput is easy

SMT is an afterthought – an attempt to use idle time on a machine originally designed to not have a lot of idle time to begin with. The basic CPU design aims, first and foremost, to run a single thread fast. Splitting a process to several independent threads is not always possible. When it is possible, it's usually gnarly.

Even in server workloads, where there's naturally a lot of independent processing, single-threaded latency still matters. So few expensive, low-latency cores outperform many cheap, high-latency cores. As Google's Urs Hölzle put it, "brawny cores beat wimpy cores". Serial code has to run fast.

Running a single thread fast means being able to issue instructions from the current thread as often as possible. To do that, CPU hardware works around every one of the many reasons to wait. Such diverse techniques as:

  • superscalar execution
  • out-of-order execution
  • register renaming
  • branch prediction
  • speculative execution
  • cache hierarchy
  • speculative prefetching
  • etc. etc.

…are all there for the same basic purpose. They maximize the chances of an instruction to be issued without having to switch to another thread.

SMT is the last line of defense, attempting to fill stalls after all these other measures failed. And even that is considered a bad idea when it hurts the precious single-threaded performance. Which it usually does: each of the 2 threads will typically complete later then it would if it didn't have to share the hardware with the other. This is a key reason to keep the number of hardware threads low, even if there are still throughput gains to be made by adding threads.

However, for the GPUs, the use case is when you do have enough data parallelism to make use of plenty of threads. If so, why not build things the other way around? Threading could be our first stall-mitigating measure. If we have enough threads, we just keep switching between them, and the hardware always has something to do.

SIMT/SMT latency

This saves a lot of hardware, and a lot of design effort, because you don't need most of the other methods anymore. Even caches and hardware prefetching are not used in much of the GPU memory traffic – rather, you access external memory directly. Why bother with caching and prefetching, if you don't have to sit idly until the data arrives from main memory – but instead just switch to a different warp? No heuristics, no speculation, no hurry – just keep yourself busy when waiting.

Furthermore, even the basic arithmetic pipeline is designed for a high latency, high throughput scenario. According to the paper "Demystifying GPU architecture through microbenchmarking", no operation takes less than 24 cycles to complete. However, the throughput of many operations is single-cycle.

The upshot is that counting on the availability of many threads allows the GPU to sustain a high throughput without having to sweat for low latencies. Hardware becomes simpler and cheaper in many areas as a result.

When latencies are high, registers are cheap

So plentiful threads make it easier to build high-throughput hardware. What about having to replicate all those registers though? 16K sounds like an insane amount of registers – how is this even affordable?

Well, it depends on what "register" means. The original meaning is the kind of storage with the smallest access latency. In CPUs, access to registers is faster than access to L1 caches, which in turn is faster than L2, etc. Small access latency means an expensive implementation, therefore there must be few registers.

However, in GPUs, access to "registers" can be quite slow. Because the GPU is always switching between warps, many cycles pass between two subsequent instructions of one warp. The reason registers must be fast in CPUs is because subsequent instructions communicate through them. In GPUs, they also communicate through registers – but the higher latency means there's no rush.

Therefore, GPU registers are only analogous to CPU registers in terms of instruction encoding. In machine code, "registers" are a small set of temporary variables that can be referenced with just a few bits of encoding – unlike memory, where you need a longer address to refer to a variable. In this and other ways, "registers" and "memory" are semantically different elements of encoding – both on CPUs and GPUs.

However, in terms of hardware implementation, GPU registers are actually more like memory than CPU registers. [Disclaimer: NVIDIA doesn't disclose implementation details, and I'm grossly oversimplifying, ignoring things like data forwarding, multiple access ports, and synthesizable vs custom design]. 16K of local RAM is a perfectly affordable amount. So while in a CPU, registers have to be expensive, they can be cheap in a high-latency, high-throughput design.

It's still a waste if 512 threads keep the same values in some of their registers – such as array base pointers in our examples above. However, many of the registers keep different values in different threads. In many cases register replication is not a waste at all – any processor would have to keep those values somewhere. So functionally, the plentiful GPU registers can be seen as a sort of a data cache.

Drawbacks

We've seen that:

  • Many threads enable cheap high-throughput, high-latency design
  • A high-throughput, high-latency design in turn enables a cheap implementation of threads' registers

This leads to a surprising conclusion that SIMT with its massive threading can actually be cheaper than SMT-style threading added to a classic CPU design. Not unexpectedly, these cost savings come at a price of reduced flexibility:

  1. Low occupancy greatly reduces performance
  2. Flow divergence greatly reduces performance
  3. Synchronization options are very limited

Occupancy

"Occupancy" is NVIDIA's term for the utilization of threading. The more threads an SM runs, the higher its occupancy. Low occupancy obviously leads to low performance – without enough warps to switch between, the GPU won't be able to hide its high latencies. The whole point of massive threading is refusing to target anything but massively parallel workloads. SMT requires much less parallelism to be efficient.

Divergence

We've seen that flow divergence is handled correctly, but inefficiently in SIMT. SMT doesn't have this problem – it works quite well given unrelated threads with unrelated control flow.

There are two reasons why unrelated threads can't work well with SIMT:

  • SIMD-style instruction broadcasting – unrelated threads within a warp can't run fast.
  • More massive threading than SMT – unrelated wraps would compete for shared resources such as instruction cache space. SMT also has this problem, but it's tolerable when you have few threads.

So both of SIMT's key ideas – SIMD-style instruction broadcasting and SMT-style massive threading – are incompatible with unrelated threads.

Related threads – those sharing code and some of the data – could work well with massive threading by itself despite divergence. It's instruction broadcasting that they fail to utilize, leaving execution hardware in idle state.

However, it seems that much of the time, related threads actually tend to have the same flow and no divergence. If this is true, a machine with massive threading but without instruction broadcasting would miss a lot of opportunities to execute its workload more efficiently.

Synchronization

In terms of programming model, SMT is an extension to a single-threaded, time-shared CPU. The same fairly rich set of inter-thread (and inter-device) synchronization and communication options is available with SMT as with "classic" single-threaded CPUs. This includes interrupts, message queues, events, semaphores, blocking and non-blocking system calls, etc. The underlying assumptions are:

  • There are quite many threads
  • Typically, each thread is doing something quite different from other threads
  • At any moment, most threads are waiting for an event, and a small subset can actually run

SMT stays within this basic time-sharing framework, adding an option to have more than one actually running threads. With SMT, as with a "classic" CPU, a thread will be very typically put "on hold" in order to wait for an event. This is implemented using context switching – saving registers to memory and, if a ready thread is found, restoring its registers from memory so that it can run.

SIMT doesn't like to put threads on hold, for several reasons:

  • Typically, there are many running, related threads. It would make the most sense to put them all on hold, so that another, unrelated, equally large group of threads can run. However, switching 16K of context is not affordable. In this sense, "registers" are expensive after all, even if they are actually memory.
  • SIMT performance depends greatly on there being many running threads. There's no point in supporting the case where most threads are waiting, because SIMT wouldn't run such workloads very well anyway. From the use case angle, a lot of waiting threads arise in "system"/"controller" kind of software, where threads wait for files, sockets, etc. SIMT is purely computational hardware that doesn't support such OS services. So the situation is both awkward for SIMT and shouldn't happen in its target workloads anyway.
  • Roughly, SIMT supports data parallelism – same code, different data. Data parallelism usually doesn't need complicated synchronization – all threads have to synchronize once a processing stage is done, and otherwise, they're independent. What requires complicated synchronization, where some threads run and some are put on hold due to data dependencies, is task parallelism – different code, different data. However, task parallelism implies divergence, and SIMT isn't good at that anyway – so why bother with complicated synchronization?

Therefore, SIMT roughly supports just one synchronization primitive – __syncthreads(). This creates a synchronization point for all the threads of a block. You know that if a thread runs code past this point, no thread runs code before this point. This way, threads can safely share results with each other. For instance, with matrix multiplication:

  1. Each thread, based on its ID – x,y – reads 2 elements, A(x,y) and B(x,y), from external memory to the on-chip shared memory. (Of course large A and B won't fit into shared memory, so this will be done block-wise.)
  2. Threads sync – all threads can now safely access all of A and B.
  3. Each thread, depending on the ID, multiplies row y in A by column x in B.
//1. load A(x,y) and B(x,y)
int x = threadIdx.x;
int y = threadIdx.y;
A[stride*y + x] = extA[ext_stride*y + x];
B[stride*y + x] = extB[ext_stride*y + x];
//2. sync
__syncthreads();
//3. multiply row y in A by column x in B
float prod = 0;
for(int i=0; i<N; ++i) {
  prod += A[stride*y + i] * B[stride*i + x];
}

It's an incomplete example (we look at just one block and ignore blockIdx, among other thing), but it shows the point of syncing – and the point of these weird "multi-dimensional" thread IDs (IDs are x,y,z coordinates rather than just integers). It's just natural, with 2D and 3D arrays, to map threads and blocks to coordinates and sub-ranges of these arrays.

Summary of differences between SIMD, SIMT and SMT

SIMT is more flexible in SIMD in three areas:

  1. Single instruction, multiple register sets
  2. Single instruction, multiple addresses
  3. Single instruction, multiple flow paths

SIMT is less flexible than SMT in three areas:

  1. Low occupancy greatly reduces performance
  2. Flow divergence greatly reduces performance
  3. Synchronization options are very limited

The effect of flexibility on costs was discussed above. A wise programmer probably doesn't care about these costs – rather, he uses the most flexible (and easily accessible) device until he runs out of cycles, then moves on to utilize the next most flexible device. Costs are just a limit on how flexible a device that is available in a given situation can be.

"SIMT" should catch on

It's a beautiful idea, questioning many of the usual assumptions about hardware design, and arriving at an internally consistent answer with the different parts naturally complementing each other.

I don't know the history well enough to tell which parts are innovations by NVIDIA and which are borrowed from previous designs. However, I believe the term SIMT was coined by NVIDIA, and perhaps it's a shame that it (apparently) didn't catch, because the architecture "deserves a name" – not necessarily true of every "new paradigm" announced by marketing.

One person who took note is Andy Glew, one of Intel's P6 architects – in his awesome computer architecture wiki, as well as in his presentation regarding further development of the SIMT model.

The presentation talks about neat optimizations – "rejiggering threads", per-thread loop buffers/time pipelining – and generally praises SIMT superiority over SIMD. Some things I disagree with – such as the "vector lane crossing" issue – and some are very interesting, such as everything about improving utilization of divergent threads.

I think the presentation should be understandable after reading my oversimplified overview – and will show where my overview oversimplifies, among other things.

Peddling fairness

Throughout this overview, there's this recurring "fairness" idea: you can trade flexibility for performance, and you can choose your trade-off.

It makes for a good narrative, but I don't necessarily believe it. You might very well be able to get both flexibility and performance. More generally, you might get both A and B, where A vs B intuitively feels like "an inherent trade-off".

What you can't do is get A and B in a reasonably simple, straightforward way. This means that you can trade simplicity for almost everything – though you don't necessarily want to; perhaps it's a good subject for a separate post.