The bright side of dark silicon

June 4th, 2013

It's been a decade or so since the end of frequency scaling, and multicore has become ubiquitous, there being no other means to increase a chip's performance.

Some multicore systems are symmetric – all cores are identical, so you can easily move work from one core to another. Others are asymmetric – as in CPU cores and GPU cores, where it's harder to move work between different types of cores.

Which is better – symmetric or asymmetric multicore?

Why symmetric is better

Three main reasons that I see:

Better load balancing

Asymmetric multicore makes load balancing harder, because a GPU can't easily yank a job from a queue shared with a CPU and run that job. That's because some of those jobs are simply impossible to run on a GPU. Others run so badly that it's not worth the trouble.

And those CPU codes that could run OK on GPUs would have to be compiled twice – for the CPU and the GPU – and even then you can't make things like function pointers and vtables work (though I can imagine a hardware workaround for the latter – a translation table of sorts; maybe I should patent it. Anyway, we're very far from that being our biggest problem.)

And then you need a shared queue between the CPU and the GPU – how does that work? – or you partition the work statically (each of the 4 CPUs processes 10% of the pixels, the remaining 60% of the pixels go to the GPU cores).

But static partitioning, often quite lousy even with symmetric multicore, is awful with asymmetric multicore because how do you choose the percentages? You need to know the relative strength of the cores at each task. How do you do that – dynamically figure out the first time your program runs on a new device?

So this is all close to insane. What people actually do instead is task parallelism – they look at their different jobs, and they figure out which should run on each type of core, and optimize each task for the respective core.

But task parallelism never load-balances very well. Let's say you look for faces in an image on the GPU and then try to figure out whose faces these are on the CPUs. Then sometimes the GPU finds a lot of faces and sometimes just a few, taking roughly the same time to do so. But the CPU then has either a lot of work or just a little. So one of them will tend to be the bottleneck.

Less work for everyone

We actually touched on that above. If you wanted to do data parallelism, running the same task on all your cores but on different subsets of the data, one problem would be to optimize your code for each type of core. That's more work. Someone at the OS/system level would also need to help you with sharing task queues and vtables – still more work.

Generally, more types of core means more hardware design, more compilers, assemblers, linkers and debuggers, more manuals, and more integration work from bus protocols to program loaders, etc. etc. And, for programmers, not only more optimization work but more portability problems.

More redundancy

That's a bit futuristic, but I actually heard this argument from respectable people. The idea is, chip manufacturing yields will significantly drop at, say, 8nm processes. And then your chance to get a chip without a microscopic defect somewhere will become so low that throwing away every defective chip will be uneconomical.

Well, with symmetric multicore you don't have to throw away the chip. If the testing equipment identifies the core that is no longer useable and marks the chip accordingly using fuses or some such (which is easy to do), an OS can then run jobs on all cores but the bad one.

Nifty, isn't it?

With asymmetric multicore, you can't do that, because some type of work will have no core on which it can run.

Why asymmetric is inevitable

In two words – dark silicon.

"Dark silicon" is a buzzword used to describe the growing gap between how many transistors you can cram into a chip with each advancement in lithography vs how many transistors you can actually use simultaneously given your power budget – the gap between area gains and power gains.

It's been a couple of years since the "dark silicon" paper which predicted "the end of multicore scaling" – a sad follow-up to the end of frequency scaling.

The idea is, you can have 2x more cores with each lithography shrink, but your energy efficiency grows only by a square root of 2. So 4 shrinks mean 16x more cores – but within a fixed power budget, you can only actually use 4. So progress slows down, so to speak. These numbers aren't very precise – you have to know your specific process to make a budget for your chip – but they're actually not bad as a tool to think about this.

With 16x more area but just 4x more power, can anything be done to avoid having that other 4x untapped?

It appears that the only route is specialization – spend a large fraction of the area on specialized cores which are much faster at some useful tasks than the other cores you have.

Can you then use them all in parallel? No – symmetric or asymmetric, keeping all cores busy is outside your power budget.

But, if much of the runtime is spent running code on specialized cores doing the job N times faster than the next best core, then you'll have regained much of your 4x – or even gained more than 4x.

Gaining more than 4x has always been possible with specialized cores, of course; dark silicon is just a compelling reason to do it, because it robs you of the much easier alternative.

What about load balancing? Oh, aren't we "lucky"! It's OK that things don't load-balance very well on these asymmetric systems – because if they did, all cores would be busy all the time. And we can't afford that – we must keep some of the silicon "dark" (not working) anyway!

And what about redundancy? I dunno – if the yield problem materializes, the increasingly asymmetric designs of today are in trouble. Or are they? If you have 4 CPUs and 4 GPU clusters, you lose 25% of the performance, worse than if you had 12 CPUs; but the asymmetric system outperforms the symmetric one by more than 25%, or so we hope.

So the bright side of dark silicon is that it forces us to develop new core architectures – because to fully reap the benefits of lithography shrinks, we can't just cram more of the same cores into a same-sized chip. Which, BTW, has been getting boring, boring, boring for a long time. CPU architecture has stabilized to a rather great extent; accelerator architecture, not nearly so.

GPUs are the tip of the iceberg, really – the most widely known and easily accessible accelerator, but there are loads of them coming in endless shapes and colors. And as time goes by and as long as transistors keep shrinking but their power efficiency lags behind, we'll need more and more kinds of accelerators.

(I have a lot of fun working on accelerator architecture, in part due to the above-mentioned factors, and I can't help wondering why it appears to be a rather marginal part of "computer architecture" which largely focuses on CPUs; I think it has to do with CPUs being a much better topic for quantitative research, but that's a subject for a separate discussion.)

And this is why the CPU will likely occupy an increasingly small share of the chip area, continuing the trend that you can see in chip photos from ChipWorks et al.

P.S.

I work on switching-limited chip designs: most of the energy is spent on switching transistors. So you don't have to power down the cores between tasks – you can keep them in an idle state and they'll consume almost no energy, because there's no switching – zeros stay zeros, and ones stay ones.

Chips which run at higher frequencies and which are not designed to operate at high temperatures (where high leakage would become intolerably high – leakage grows non-linearly with temperature) are often leakage-limited. This means that you must actually power down a core or else it keeps using much of the energy it uses when doing work.

Sometimes powering down is natural, as in standby mode. Powering down midway through realtime processing is harder though, because it takes time to power things down and then to power them back up and reinitialize their pesky little bits such as cache line tags, etc.

So in a leakage-limited design, asymmetric multicore is at some point no better than symmetric multicore – if the gaps between your tasks are sufficiently short, you can't power down anything, and then your silicon is never dark, so either you make smaller chips or programs burn them.

But powering up and down isn't that slow, so a lot of workloads should be far from this sad point.

P.P.S.

I know about GreenDroid, a project by people who make the "dark silicon leads to specialization" argument quite eloquently; I don't think their specialization is the right kind – I think cores should be programmable – but that again is a subject for a separate discussion.

P.P.P.S.

Of course there's one thing you can always do with extra area which is conceptually much easier than adding new types of cores – namely, add more memory, typically L2/L3 cache. Memory is a perfect fit for the dark silicon age, because it essentially is dark silicon – its switching energy consumption is roughly proportionate to the number of bytes you access per cycle but is largely independent of the number of bytes you keep in there. And as to leakage, it's easier to minimize for memories than most other kinds of things.

Another "lucky" coincidence is that you really need caches these days because external DRAM response latency has been 100 ns for a long time while processor clocks tend to 50-200x shorter, so missing all the caches really hurts.

So it's natural to expect memories to grow first and then the accelerator zoo; again consistently with recent chip photos where, say, ARM's caches are considerably bigger the ARM cores themselves.

(Itanium famously spent 85% percent of the chip area or so on caches, but that was more of "cheating" – a way to show off performance relative to x86 when in fact the advantage wasn't there – than anything else; at least that's how Bob Colwell quoted his conversation with Andy Grove. These days however it has become one of the few ways to actually use the extra area.)