The overblown frequency vs cost efficiency trade-off

January 31st, 2016

I've often read arguments that computing circuitry running at a high frequency is inefficient, power-wise or silicon area-wise or both. So roughly, 100 MHz is more efficient, in that you get more work done per unit of energy or area spent. And CPUs go for 1 GHz or 3 GHz because serial performance sells regardless of efficiency. But accelerators like GPUs or embedded DSPs or ISPs or codecs implemented in hardware etc. etc. – these don't need to run at a high frequency.

And I think this argument is less common now when say GPUs have caught up, and an embedded GPU might run at the same frequency as an embedded CPU. But still, I've just seen someone peddling a "neuromorphic chip" or some such, and there it was – "you need to run conventional machines at 1 GHz and it's terribly inefficient."

AFAIK the real story here is pretty simple, namely:

  1. As you increase frequency, you GAIN efficiency up to point;
  2. From that point on, you do start LOSING efficiency;
  3. That inflection point, for well-designed circuits, is much higher than people think (close to a CPU's frequency in the given manufacturing process, certainly not 10x less as people often claim);
  4. ...and what fueled the myth is, accelerator makers used to be much worse at designing for high frequency than CPU makers. So marketeers together with "underdog sympathizers" have overblown the frequency vs efficiency trade-off completely out of proportions.

And below I'll detail these points; if you notice oversimplifications, please correct me (there are many conflicting goals in circuit implementation, and these goals are different across markets, so my experience might be too narrow.)

Frequency improves efficiency up to a point

What's the cost of a circuit, and how is it affected by frequency? (This section shows the happy part of the answer – the sad part is in the next section.)

  1. Silicon area. The higher the clock frequency, the more things the same circuit occupying this area does per unit of time – so you win!
  2. Leakage power – just powering up the circuit and doing nothing, not even toggling the clock signal, costs you a certain amount of energy per unit of time. Here again, the higher the frequency, the more work gets done in exchange for the same leakage power – again you win!
  3. Switching power – every time the clock signal changes its value from 0 to 1 and back, this triggers a bunch of changes to the values of other signals as dictated by the interconnection of the logic gates, flip-flops – everything making up the circuit. All this switching from 0 to 1 and back costs energy (and NOT switching does not; measure the power dissipated by a loop multiplying zeros vs a loop multiplying random data, and you'll see what I mean. This has implications for the role of software in conserving energy, but this is outside our scope here.) What's the impact of frequency on cost here? It turns out that frequency is neutral - the cost in energy is directly proportionate to the clock frequency, but so is the amount of work done.

Overall, higher frequency means spending less area and power per unit of work – the opposite of the peanut gallery's conventional wisdom.

Frequency degrades efficiency from some point

At some point, however, higher frequency does start to increase the cost of the circuit per unit of work. The reasons boil down to having to build your circuit out of physically larger elements that leak more power. Even further down the frequency-chasing path come other problems, such as having to break down your work to many more pipeline stages, spending area and power on storage for the intermediate results of these stages; and needing expensive cooling solutions for heat dissipation. So actually there are several points along the road, with the cost of extra MHz growing at each point – until you reach the physically impossible frequency for a given manufacturing process.

How do you find the point where an extra MHz isn't worth it? For synthesizable design (one created in a high-level language like Verilog and VHDL), you can synthesize it for different frequencies and you can measure the cost in area and power, and plot the results. My confidence of where I think the inflection point should be comes from looking at these plots. Of course the plot will depend on the design, bringing us to the next point.

Better-designed circuits' optimal frequency is higher

One hard part of circuit design is, you're basically making a hugely parallel system, where many parts do different things. Each part doing the same thing would be easy – they all take the same time, duh, so no bottleneck. Conversely, each part doing something else makes it really easy to create a bottleneck – and really hard to balance the parts (it's hard to tell exactly how much time a piece of work takes without trying, and there are a lot of options you could try, each breaking the work into different parts.)

You need to break the harder things into smaller pipeline stages (yes, a cost in itself as we've just said – but usually a small cost unless you target really high frequencies and so have to break everything into umpteen stages.) Pipelining is hard to get right when the pipeline stages are not truly independent, and people often recoil from it (a hardware bug is on average more likely to be catastrophically costly than somewhat crummier performance.) Simpler designs also shorten schedules, which may be better than reaching a higher frequency later.

So CPUs competing for a huge market on serial performance and (stupidly) advertised frequency, implementing a comparatively stable instruction set, justified the effort to overcome these obstacles. (Sometimes to the detriment of consumers, arguably, as say with Pentium 4 – namely, high frequency, low serial performance due to too much pipelining.)

Accelerators are different. You can to some extent compensate for poor serial performance by throwing money at the problem - add more cores. Sometimes you don't care about extra performance – if you can decode video at the peak required rate and resolution, extra performance might not win more business. Between frequency improvements and architecture improvements/implementing a huge new standard, the latter might be more worthwhile. And then the budgets are generally smaller, so you tend to design more conservatively.

So AFAIK this is why so many embedded accelerators had crummy frequencies when they started out (and they also had apologists explaining why it was a good thing). And that's why some of the accelerators caught up – basically it was never a technical limitation but an economic problem of where to spend effort, and changing circumstances caused effort to be invested into improving frequency. And that's why if you're making an accelerator core which is 3 times slower than the CPU in the same chip, my first guess is your design isn't stellar at this stage, though it might improve – if it ever has to.

P.S. I'll say it again – my perspective can be skewed; someone with different experience might point out some oversimplifications. Different process nodes and different implementation constraints mean that what's decisive in one's experience is of marginal importance in another's experience. So please do correct me if I'm wrong in your experience.

P.P.S. Theoretically, a design running at 1 GHz might be doing the exact same amount of work as a 2 GHz design – if the pipeline is 2x shorter and each stage in the 1 GHz thing does the work of 2 stages in the 2 GHz thing. In practice, the 1 GHz design will have stages doing less work, so they complete in less than 1 nanosecond (1/1GHz) and are idle during much of the cycle. And this is why you want to invest some effort to up the frequency in that design – to not have mostly-idle circuitry leaking power and using up area. But the theoretically possible perfectly balanced 1 GHz design is a valid counter-argument to all of the above, I just don't think that's what most crummy frequencies hide behind them.

Update: here's an interesting complication – Norman Yarvin's comment points to an article about near-threshold voltage research by Intel, from which it turns out that a Pentium implementation designed to operate at near-threshold voltage (at a near-2x cost in area) achieves its best energy efficiency at 100 MHz – 10x slower than its peak frequency but spending 47x less energy. The trouble is, if you want that 10x performance back, you'd need 10 such cores for an overall area increase of 20x, in return for overall energy savings of 4.7x. Other points on the graph will be less extreme (less area spent, less energy saved.)

So this makes sense when silicon area is tremendously cheaper than energy, or when there's a hard limit on how much energy you can spend but a much laxer limit on area. This is not the case most of the time, AFAIK (silicon costs a lot and then it simply takes physical space, which also costs), but it can be the case some of the time. NTV can also make sense if voltage is adjusted dynamically based on workload, and you don't need high performance most of the time, and you don't care that your peak performance is achieved at a 2x area cost as much as you're happy to be able to conserve energy tremendously when not needing the performance.

Anyway, it goes to show that it's more complicated than I stated, even if I'm right for the average design made under today's typical constraints.