The overblown frequency vs cost efficiency trade-off

January 31st, 2016

I've often read arguments that computing circuitry running at a high frequency is inefficient, power-wise or silicon area-wise or both. So roughly, 100 MHz is more efficient, in that you get more work done per unit of energy or area spent. And CPUs go for 1 GHz or 3 GHz because serial performance sells regardless of efficiency. But accelerators like GPUs or embedded DSPs or ISPs or codecs implemented in hardware etc. etc. – these don't need to run at a high frequency.

And I think this argument is less common now when say GPUs have caught up, and an embedded GPU might run at the same frequency as an embedded CPU. But still, I've just seen someone peddling a "neuromorphic chip" or some such, and there it was – "you need to run conventional machines at 1 GHz and it's terribly inefficient."

AFAIK the real story here is pretty simple, namely:

  1. As you increase frequency, you GAIN efficiency up to point;
  2. From that point on, you do start LOSING efficiency;
  3. That inflection point, for well-designed circuits, is much higher than people think (close to a CPU's frequency in the given manufacturing process, certainly not 10x less as people often claim);
  4. ...and what fueled the myth is, accelerator makers used to be much worse at designing for high frequency than CPU makers. So marketeers together with "underdog sympathizers" have overblown the frequency vs efficiency trade-off completely out of proportions.

And below I'll detail these points; if you notice oversimplifications, please correct me (there are many conflicting goals in circuit implementation, and these goals are different across markets, so my experience might be too narrow.)

Frequency improves efficiency up to a point

What's the cost of a circuit, and how is it affected by frequency? (This section shows the happy part of the answer – the sad part is in the next section.)

  1. Silicon area. The higher the clock frequency, the more things the same circuit occupying this area does per unit of time – so you win!
  2. Leakage power – just powering up the circuit and doing nothing, not even toggling the clock signal, costs you a certain amount of energy per unit of time. Here again, the higher the frequency, the more work gets done in exchange for the same leakage power – again you win!
  3. Switching power – every time the clock signal changes its value from 0 to 1 and back, this triggers a bunch of changes to the values of other signals as dictated by the interconnection of the logic gates, flip-flops – everything making up the circuit. All this switching from 0 to 1 and back costs energy (and NOT switching does not; measure the power dissipated by a loop multiplying zeros vs a loop multiplying random data, and you'll see what I mean. This has implications for the role of software in conserving energy, but this is outside our scope here.) What's the impact of frequency on cost here? It turns out that frequency is neutral - the cost in energy is directly proportionate to the clock frequency, but so is the amount of work done.

Overall, higher frequency means spending less area and power per unit of work – the opposite of the peanut gallery's conventional wisdom.

Frequency degrades efficiency from some point

At some point, however, higher frequency does start to increase the cost of the circuit per unit of work. The reasons boil down to having to build your circuit out of physically larger elements that leak more power. Even further down the frequency-chasing path come other problems, such as having to break down your work to many more pipeline stages, spending area and power on storage for the intermediate results of these stages; and needing expensive cooling solutions for heat dissipation. So actually there are several points along the road, with the cost of extra MHz growing at each point – until you reach the physically impossible frequency for a given manufacturing process.

How do you find the point where an extra MHz isn't worth it? For synthesizable design (one created in a high-level language like Verilog and VHDL), you can synthesize it for different frequencies and you can measure the cost in area and power, and plot the results. My confidence of where I think the inflection point should be comes from looking at these plots. Of course the plot will depend on the design, bringing us to the next point.

Better-designed circuits' optimal frequency is higher

One hard part of circuit design is, you're basically making a hugely parallel system, where many parts do different things. Each part doing the same thing would be easy – they all take the same time, duh, so no bottleneck. Conversely, each part doing something else makes it really easy to create a bottleneck – and really hard to balance the parts (it's hard to tell exactly how much time a piece of work takes without trying, and there are a lot of options you could try, each breaking the work into different parts.)

You need to break the harder things into smaller pipeline stages (yes, a cost in itself as we've just said – but usually a small cost unless you target really high frequencies and so have to break everything into umpteen stages.) Pipelining is hard to get right when the pipeline stages are not truly independent, and people often recoil from it (a hardware bug is on average more likely to be catastrophically costly than somewhat crummier performance.) Simpler designs also shorten schedules, which may be better than reaching a higher frequency later.

So CPUs competing for a huge market on serial performance and (stupidly) advertised frequency, implementing a comparatively stable instruction set, justified the effort to overcome these obstacles. (Sometimes to the detriment of consumers, arguably, as say with Pentium 4 – namely, high frequency, low serial performance due to too much pipelining.)

Accelerators are different. You can to some extent compensate for poor serial performance by throwing money at the problem - add more cores. Sometimes you don't care about extra performance – if you can decode video at the peak required rate and resolution, extra performance might not win more business. Between frequency improvements and architecture improvements/implementing a huge new standard, the latter might be more worthwhile. And then the budgets are generally smaller, so you tend to design more conservatively.

So AFAIK this is why so many embedded accelerators had crummy frequencies when they started out (and they also had apologists explaining why it was a good thing). And that's why some of the accelerators caught up – basically it was never a technical limitation but an economic problem of where to spend effort, and changing circumstances caused effort to be invested into improving frequency. And that's why if you're making an accelerator core which is 3 times slower than the CPU in the same chip, my first guess is your design isn't stellar at this stage, though it might improve – if it ever has to.

P.S. I'll say it again – my perspective can be skewed; someone with different experience might point out some oversimplifications. Different process nodes and different implementation constraints mean that what's decisive in one's experience is of marginal importance in another's experience. So please do correct me if I'm wrong in your experience.

P.P.S. Theoretically, a design running at 1 GHz might be doing the exact same amount of work as a 2 GHz design – if the pipeline is 2x shorter and each stage in the 1 GHz thing does the work of 2 stages in the 2 GHz thing. In practice, the 1 GHz design will have stages doing less work, so they complete in less than 1 nanosecond (1/1GHz) and are idle during much of the cycle. And this is why you want to invest some effort to up the frequency in that design – to not have mostly-idle circuitry leaking power and using up area. But the theoretically possible perfectly balanced 1 GHz design is a valid counter-argument to all of the above, I just don't think that's what most crummy frequencies hide behind them.

Update: here's an interesting complication – Norman Yarvin's comment points to an article about near-threshold voltage research by Intel, from which it turns out that a Pentium implementation designed to operate at near-threshold voltage (at a near-2x cost in area) achieves its best energy efficiency at 100 MHz – 10x slower than its peak frequency but spending 47x less energy. The trouble is, if you want that 10x performance back, you'd need 10 such cores for an overall area increase of 20x, in return for overall energy savings of 4.7x. Other points on the graph will be less extreme (less area spent, less energy saved.)

So this makes sense when silicon area is tremendously cheaper than energy, or when there's a hard limit on how much energy you can spend but a much laxer limit on area. This is not the case most of the time, AFAIK (silicon costs a lot and then it simply takes physical space, which also costs), but it can be the case some of the time. NTV can also make sense if voltage is adjusted dynamically based on workload, and you don't need high performance most of the time, and you don't care that your peak performance is achieved at a 2x area cost as much as you're happy to be able to conserve energy tremendously when not needing the performance.

Anyway, it goes to show that it's more complicated than I stated, even if I'm right for the average design made under today's typical constraints.

1. Norman YarvinJan 31, 2016

To get the circuit to work at a higher frequency, you often have to increase the voltage. That's where the increased switching losses come from; for those, power goes as voltage squared. Increasing the voltage also increases leakage losses, but I'm not sure how those scale.

Many CPUs these days do actually change their voltage as they change their frequency, and for exactly this reason. Transmeta, I believe, pioneered this; although they're defunct, others have picked it up.

2. Yossi KreininJan 31, 2016

I guess you mean that's where some of the super-linear (so cost-inefficient) increased switching losses come from (if you're increasing frequency and keeping the voltage, switching costs per unit of time also increase, but they increase proportionately to the amount of work done per unit of time so it's neutral efficiency-wise.)

And still – (1) at what frequency does it typically become necessary to increase the voltage, and (2) how much less cost-efficient is the circuit because of being able to reach a higher frequency at a higher voltage? AFAIK the answer to (1) is "pretty high" and the answer to (2) is "not much." Even when the answer to (1) is "pretty low", it means that you could beneficially make your circuit work at a higher frequency for those times where it's needed without losing much cost efficiency, and you chose not to do it because there wasn't much to gain by speeding up those rare/non-existent bursts of extraordinarily intensive, urgent work. So my main point would remain, namely, if your peak supported frequency is pretty low, it's not because supporting a higher peak frequency would result in a worse design, but because it was uneconomical given your schedule, development budget and use case. If design effort was free and everything else were kept constant, you'd probably do it.

But it is interesting how with all that said, essentially to the extent that you can lower power dissipation by lowering the frequency and voltage, you're trading silicon area for power and these are two pretty different variables (they're costs paid at different times and circumstances.) So I wonder how pronounced this effect is if you plot it – how low can you go frequency-wise and still gain something (I never experimented very much with it for various reasons – I probably would if I were in the cellphone processor market, for instance.)

3. Yossi KreininJan 31, 2016

One more thing is, if you're feeding off a battery and/or have trouble dissipating heat, it's beneficial to lower your frequency as much as you can lower it without the throughput falling below the threshold of acceptability – even if you can't also lower the voltage. That way, you get linear gains in switching power instead of super-linear, but in absolute terms, battery life is up and heat is down. This wouldn't be so if processors were powered down every time they finish the current bulk of work, but they aren't – in practice, waiting for the user involves a lot of non-productive switching activity and you save energy by doing this stuff slower.

The upshot is that we should see some processors in the field lowering their frequency to a much lower level than they would if all they pursued was a lower voltage.

4. Dan LuuJan 31, 2016

> So AFAIK this is why so many embedded accelerators had crummy frequencies when they started out (and they also had apologists explaining why it was a good thing). And that's why some of the accelerators caught up – basically it was never a technical limitation but an economic problem of where to spend effort, and changing circumstances caused effort to be invested into improving frequency.

This also matches my experience with non-embedded accelerators. If you're looking at (just for example) a 100x speedup, it's not so bad to target a less aggressive clock rate and take a 50x speedup with v1, which sharply reduces risk and eases schedule pressure. If that works out, then pull out all the stops for v2 or even v3.

5. Yossi KreininJan 31, 2016

Yeah – maybe I should have said plainly that accelerators accelerate, even if it's 50x instead of 100x; that's kinda what I meant by my vague "other architectural improvements." That's why it makes sense to leave that last, hard 2x for the next time.

6. Norman YarvinJan 31, 2016

Yes, "super-linear" was what I meant — or, well, I took it for granted that the question was switching losses per amount of work done, in which case it's a simple increase. As for hard numbers, I didn't have any in my head, but a search finds this report on some explorations that Intel did where they were able to make a Pentium that could run at as little as 2 milliwatts (though only at 3 MHz; the optimum was at more like 17 milliwatts and 100 MHz):

http://www.realworldtech.com/near-threshold-voltage/

7. Yossi KreininFeb 1, 2016

Interesting! I updated the article. (I hope I got it right; I find it's really easy to be stupid about the simple things – forget a 2x here or a 10x there...)

8. Johan OuwerkerkFeb 6, 2016

There's also the fact that a lot of this hardware tend to start out as a simple 'slave' device to a master CPU. So the bottleneck is going to be I/O between the two "domains" and a 'naive' version of your faster accelerator mostly burns these extra cycles waiting for IO to complete.

Also, there's the fact that powering things down to lower clock speed/sleep mode and back up is not a free lunch either. So your higher clock speeds must be so much higher that this overhead in current draw and time is compensated for by the correspondingly greater time spent in low(er) power mode(s).

9. Yossi KreininFeb 6, 2016

Both of these are true to some extent, though the SoCs of the last decade have far fewer communication overheads than say the CPU/GPU desktop setup which is always mentioned in these cases, and powering up/down probably doesn't take much more than ~1ms (but then of course some state might be destroyed by it that needs reinitialization, and there might be other costs.)

10. Alex OrangeOct 27, 2017

You seem to be confusing the best rate to run a given circuit at with the most efficient circuit. If you want to get from point A to point B with a car and your choices are a Honda Civic or a McLaren F1, the Civic is certainly going to get you there with less gas, but it can't go as fast as the F1. The Civic will have an optimal speed, and like your argument relative to circuits up to a certain point higher speed will give you higher efficiency. The F1's maximum efficiency speed will likely be higher than the Civic's maximum efficiency speed but it's efficiency will almost certainly be less due to it having a much larger engine then the Civic.

Similarly with circuits, a simple ripple carry adder is going to be excruciatingly slow, but also likely the lowest energy per add. A Kogge-Stone adder is going to be several times faster but will take up something like 5-6x the area and 5-6x the energy per operation. This is all talking about the architecture of the circuit (where to use an AND/NOR/NOT/XOR/etc gate). If you change the circuit type to something like dynamic gates you can speed up some more, but again at the cost of more energy. Almost universally, anything that you do in a given process to speed up an operation will burn more energy unless the original circuit was horribly designed (which they aren't).

By horribly designed I mean absolute mistakes like not using minimum length gates or building very area-inefficient gates. The differences between what's going on inside a CPU and a GPU other than process are going to be architecture and circuit type, not layout. Likely both are going to use custom layouts. The reason GPUs are "slower" is because their computations are MUCH more parallel then a CPU's. Therefore they measure their performance in GFLOPs total whereas a CPU measures its performance in GFLOPs or more often IOPs serial. CPU arithmetic circuits are therefore larger even taking speed into account whereas GPUs are tuned to fit as many operations/second into a given piece of area.

So, in conclusion, your statement of "I've often read arguments that computing circuitry running at a high frequency is inefficient, power-wise or silicon area-wise or both." would be better phrased as "...computing circuitry ***capable of*** running at a high frequency..." In which case the statement that such circuits are power and area inefficient is absolutely true.

11. Alex OrangeOct 27, 2017

P.S. By IOPs I meant integer operations/second. Just realized IOPs is I/O not integer ops/second.



Post a comment