Will OpenCL help displace GPGPU? Parallella, P2012, ...

October 13th, 2012

OpenCL is usually perceived as a C dialect for GPGPU programming – doing "general-purpose" computations (not necessarily graphics) on GPU hardware. "It's like Nvidia's CUDA C, but portable".

However, OpenCL the language is not really tied to the GPU architecture. That is, hardware could run OpenCL programs and have an architecture very different from a GPU, resulting in a very different performance profile.

OpenCL is possibly the first programming language promising portability across accelerators: "OpenCL is for accelerators what C is for CPUs". Portability is disruptive. When hardware vendor A displaces vendor B, portable software usually helps a great deal.

Will OpenCL – "the GPGPU language" – eventually help displace GPGPU, by facilitating "GP-something-else" – "general-purpose" accelerators which aren't like GPUs?

We'll discuss this question on general grounds, and consider two specific examples of recent OpenCL accelerators: Adapteva's Parallella and ST's P2012.

Why displace GPGPU?

First of all, whether GPGPU is likely to be displaced or not – what could "GP-something-else" possibly give us that GPGPU doesn't?

There are two directions from which benefits could come – you could call them two opposite directions:

Let software (ab)use more types of special-purpose accelerators. GPGPU lets you utilize (abuse?) your GPU for general-purpose stuff. It could be nice to have "GPDSP" to utilize the DSPs in your phone, "GPISP" to utilize the ISP, "GPCVP" to utilize computer vision accelerators likely to emerge in the future, etc. From GPGPU to GP-everything.
Give software accelerators which are more general-purpose to begin with. GPGPU means doing your general-purpose stuff under the constraints imposed by the GPU architecture. An OpenCL accelerator lifting some of these constraints could be very welcome.

Could OpenCL help us get benefits from any of the directions (1) and (2)?

(1) is about making use of anal-retentive, efficiency-obsessed, weird, incompatible hardware. It's rather hard, for OpenCL or for any other portable, reasonably "pretty" language.

OpenCL does provide constructs more or less directly mapping to some of the "ugly" features common to many accelerators, for example:

Explicitly addressed local memory (as opposed to cache)
DMA (bulk memory transfers)
Short vector data types to make use of SIMD opcodes
Light-weight threads and barriers

But even with GPUs, OpenCL can't target all of the GPU's resources. There's the subset of the GPU accessible to GPGPU programs – and then there are the more idiosyncratic and less flexible parts used for actual graphics processing.

With accelerators such as DSPs and ISPs, my guess is that today most of their value – acceleration ability – is in the idiosyncratic features that can't be made accessible to OpenCL programs. They could evolve, but it's a bit far-fetched and we won't dwell on it now. At their current state, OpenCL is too portable and too "pretty" to map to most accelerators.

What about direction (2)? (2) is about making something that's more efficient than CPUs, but as nice and flexible as possible, and more flexible than GPUs.

As a whole, (2) isn't easy, for various reasons we'll discuss. But if we look, in isolation, at OpenCL the language, then it looks like a great language for targeting "faster-than-CPUs-and-more-flexible-than-GPUs" kind of accelerator.

What could such an accelerator give us that GPUs don't?

One important feature is divergent flow. GPUs are SIMD or SIMT hardware; either way, they can't efficiently support something like this:

if(cond(i)) {
  out[i] = f(i);
}
else {
  out[i] = g(i);
}

What they'll end up doing is, essentially, compute f(i) and g(i) for all values of i, and then throw away some of the results. For deeply nested conditionals, the cost of wasted computations can make the entire exercise of porting to a GPU pointless.

We'll now have a look at two OpenCL-compatible accelerators which promise to efficiently support divergent threads – or outright independent threads doing something completely unrelated. We'll briefly compare them, and then discuss some of their common benefits as well as common obstacles to their adoption.

Adapteva's Parallella

Actually, the chip's name is Epiphany – Parallella is the recently publicized name of Adapteva's planned platform based on Epiphany; anyway.

Adapteva's architecture is a 2D grid of processors with a mesh interconnect. To scale, you can have a chip with more cores – or you can have a 2D grid of chips with some of the inter-core communication seamlessly crossing chip boundaries. Each of the (scalar) processors executes its own instruction stream – no "marching in lockstep", fittingly for divergent flow.

There are no caches; a memory address can map to your local memory, or the local memory of some other processor in the grid, or to external memory. Access latency will vary accordingly; access to local memories of close neighbors is quicker than access to far neighbors. All memory access can be done using either load/store instructions or DMA.

(Note that you can reach far neighbors – unlike some more "fundamentalist" proposals for "2D scalability" where you can only talk to immediate neighbors, period. I think that's over the top; if you want to run something other than the game of life, it's awfully handy to have long communication paths – as do most computers ranging from FPGAs to neurons, some of which have really long axons.)

Stats:

32K memory per core (unified – both instructions and data)
4 banks that can help avoid contentions between loads/stores, instruction fetching and DMA traffic
2-issue cores (one floating point operation and one integer or load/store operation)
800 MHz at 28nm using low-power process, as opposed to high speed (my bet that it's hard to top 800 MHz at 28nm LP – any evidence to the contrary?)
~25mW per core, 2W peak power for a full chip with 64 cores
0.128 mm^2 per core

Sources: BDTI's overview and various documentation from adapteva.com.

ST's Platform 2012

The P2012 architecture is also, at the top level, a grid of processors with a mesh interconnect. One stated motivation is the intra-die variability in future process nodes: some cores will come out slower than others, and some will be unusable.

It is thus claimed that a non-uniform architecture (like the ones we have today – CPUs and a boatload of different accelerators) will become a bad idea. If a core happens to come out badly, and it's not like your other cores, you have to throw away the entire chip. Whereas if cores are all alike, you leave the bad ones unused, and you may still have enough good ones to use the chip.

Interestingly, despite this stated motivation, the P2012 is less uniform and has higher granularity than Epiphany. Firstly, there's a provision for special-purpose accelerators in the grid. Secondly, the top-level mesh connects, not individual cores, but clusters of 16 rather tightly-coupled cores (each with its own flow of control, however – again, good for divergence).

Similarly to Epiphany, data is kept in explicitly addressed local memory rather than cache, and you can access data outside the cluster using load/store instructions or DMA, but you'll pay a price depending on the distance.

However, within a cluster, data access is uniform: the 16 cores share 256K of local data memory. This can be convenient for large working sets. Instructions are private to a core – but they're kept in a cache, not a local memory, conveniently for large programs.

Stats:

32K per core (16 cores with 16K I-cache per core and 32 8K data memory banks)
1MB L2 cache in a 4-cluster, "69-core" chip (presumably, (16+1)x4+1 – one extra core per cluster and per chip)
2-issue cores (I failed to find which instructions can be issued in parallel)
600 MHz at 28nm (process details unclear)
2W for the 69-core chip
0.217 mm^3 per core (3.7 mm^2 per (16+1)-core cluster), not accounting for L2 cache

Source: slides, slides, slides.

Parallela vs P2012: a quick comparison

Each of the architectures can have many different implementations and configurations. It seems fair to compare a 28nm 64-core Epiphany chip with a 28nm 69-core P2012 chip (or at least fair as far as these things go). Each system has its own incompatible native programming interface, but both can also be programmed in OpenCL.

Here's how Epiphany compares to P2012:

Power: 1x (2W)
Core issue width: 1x (2-issue)
Local memory size: 1x (32K per core)
Frequency: 1.33x (800/600)
Core area efficiency: 1.7x (0.217/0.128)

I think it's a fine achievement for Adapteva – a 5-people company (ST has about 50000 employees – of course not all of them work on the P2012, but still; Chuck Moore's GreenArrays is 18 people – and he's considered the ultimate minimalist, and develops a much more minimalistic product which, for instance, certainly won't run OpenCL programs).

This is not to say that these numbers are sufficient to compare the architectures. For starters, we assume that the power is the same, but we can't know without benchmarking. Energy consumption varies widely across programs – low power process brings leakage down to about zero at room temperature, so you're left with switching power which depends on what code you run, and on what data (multiplying zeros costs almost nothing compared to multiplying noise, for instance).

Then there are programming model differences, ranging from the extent of compliance of floating point to the IEEE standard to the rather different memory models. In the memory department, the ability of P2012 cores to access larger working sets should somewhat negate Epiphany's raw performance advantage on some workloads (though Epiphany cores might have lower latency when accessing their own banks). But then two different 2-issue cores will generally perform differently – you need thorough benchmarking to compare.

So what are these numbers good for? Just for a very rough, ballpark estimation of the cost of this type of core. That is, a core which is flexible enough to run its own instruction stream – but "low-end" enough to burden the programmer with local memory management, and lacking much of the other amenities of full-blown CPUs (speculative prefetching, out-of-order execution, etc.)

Our two examples both point to the same order of magnitude of performance. Let's look at a third system, KALRAY's MPPA – looking more like P2012 than Epiphany, with 16-core clusters and cores sharing memory.

At 28nm, 256 cores are reported to typically consume 5W at 400 MHz. (Adapteva and ST claim to give worst case numbers). That's 20mW per core compared to Epiphany's 25mW – but Epiphany runs at 2x the frequency. If we normalized for frequency, Epiphany comes out 1.6x more power-efficient – and that's when we compare it's worst case power to MPPA's typical power.

MPPA doesn't support OpenCL at the moment, and I found few details about the architecture; our quick glance is only good to show that these "low-end multicore" machines have the same order of magnitude of efficiency.

So will OpenCL displace GPGPU?

The accelerators of the kind we discussed above – and they're accelerators, not CPUs, because they're horrible at running large programs as opposed to hand-optimized kernels – these accelerators have some nice properties:

You get scalar threads which can diverge freely and efficiently – this is a lot of extra flexibility compared to SIMT or SIMD GPUs.
For GPGPU workloads that don't need divergence, these accelerators probably aren't much worse than GPUs. You lose some power efficiency because of reading the same instructions from many program memories, but it should be way less than a 2x loss, I'd guess.
And there's a programming model ready for these accelerators – OpenCL. They can be programmed in other C dialects, but OpenCL is a widespread, standard one that can be used, and it lets you use features like explicitly managed local memory and DMA in a portable way.

From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?

There are several roadblocks, however. First of all, there's no killer app for this type of thing – by definition. That is, for any killer app, almost certainly a much more efficient accelerator can be built for that domain. Generic OpenCL accelerators are good at accelerating the widest range of things, but they don't excel at accelerating anything.

There is, of course, at least one thriving platform which is, according to the common perception, "good at everything but excels at nothing" – FPGA. (I think it's more complicated than that but I'll leave it for another time.)

FPGAs are great for small to medium scale product delivery. The volume is too small to afford your own chip – but there may be too many things to accelerate which are too different from what an existing chip is good at accelerating. Flexible OpenCL accelerator chips could rival FPGAs here.

What about integrating these accelerators into high-volume chips such as application processors so they could compete with GPUs? Without a killer app, there's a real estate problem. At 100-150 mm^2, today's application processors are already rather large. And the new OpenGL accelerators aren't exactly small – they're bigger than any domain-specific accelerator.

Few chips are likely to include a large accelerator "just in case", without a killer app. Area is considered to be getting increasingly cheap. But we're far from the point where it's "virtually free", and the trend might not continue forever: GlobalFoundries' 14nm is a "low-shrink" node. Today, area is not free.

Of course, a new OpenCL accelerator does give some immediate benefit and so it isn't a purely speculative investment. That's because you could speed up existing OpenCL applications. But for existing code which is careful to avoid divergence, the accelerator would be somewhat less efficient than a GPU, and it wouldn't do graphics nearly as good as the GPU – so it'd be a rather speculative addition indeed.

What would make one add hardware for speculative reasons? A long life cycle. If you believe that your chip will have to accelerate important stuff many years after it's designed, then you'll doubt your ability to predict exactly what this stuff is going to be, and you'll want the most general-purpose accelerator.

Conversely, if you make new chips all the time, quickly sell a load of them, and then move on to market your next design, then you're less inclined to speculate. Anything that doesn't result in a visibly better product today is not worth the cost.

So generic OpenCL accelerators have a better shot at domains with long life cycles, which seem to be a minority. And then even when you found a vendor with a focus on the long term, you have the problem of performance portability.

Let's say platform vendor A does add the new accelerator to their chip. Awesome – except you probably also want to support vendor B's chips, which don't have such accelerators. And so efficient divergence is of no use to you, because it's not portable. Unless vendor A accounts for a very large share of the market – or if it's a dedicated device and you write a dedicated program and you don't care about portability.

OpenCL programs are portable, but their performance is not portable. For instance, if you use vector data types and the target platform doesn't have SIMD, the code will be compiled to scalar instructions, and so on.

What this means in practice is that one or several OpenCL subsets will emerge, containing features that people count on to be supported well. For instance, a relatively good scenario is, there's a subset that GPU programmers use on all GPUs. A worse scenario is, there's the desktop GPU subset and the mobile GPU subset. A still worse scenario is, there's the NVIDIA subset, the AMD subset, the Imagination subset, etc.

It's an evolving type of thing that's never codified anywhere but has more power than the actual standard.

Standards tend to materialize partially. For example, the C standard supports garbage collection, but real C implementations usually don't, and many real C programs will not run correctly on a standard-compliant, garbage-collecting implementation. Someone knowing C would predict this outcome, and would not trust the standard to change it.

So with efficient divergence, the question is, will this feature make it into a widely used OpenCL subset, even though it's not a part of any such subset today. If it doesn't, widespread hardware is unlikely to support it.

Personally, I like accelerators with efficient divergence. I'll say it again:

"From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?"

From an evolutionary standpoint though, it's quite the uphill battle. The CPU + GPU combination is considered "good enough" very widely. It's not impossible to grow in the shadow of a "good enough", established competitor. x86 was good enough and ARM got big, gcc was good enough and LLVM got big, etc.

It's just hard, especially if you can't replace the competitor anywhere and you aren't a must-have. A CPU is a must-have and ARM replaces x86 where it wins. A compiler is a must-have and LLVM replaces gcc where it wins. An OpenCL accelerator with efficient divergence – or any other kind, really – is not a must-have and it will replace neither the CPU nor the GPU. So it's quite a challenge to convince someone to spend on it.

Conclusion

I doubt that general-purpose OpenCL accelerators will displace GPGPU, even though it could be a nice outcome. The accelerators probably will find their uses though. The following properties seem favorable to them (all or a subset may be present in a given domain):

Small to medium scale, similarly to FPGAs
Long life cycles encouraging speculative investment in hardware
Device-specific software that can live without performance portability

In other words, there can be "life between the CPU and the GPU", though not necessarily in the highest volume devices.

Good luck to Adapteva with their Kickstarter project – a computer with 64 independent OpenCL cores for $99.