Will OpenCL help displace GPGPU? Parallella, P2012, …

OpenCL is usually perceived as a C dialect for GPGPU programming – doing "general-purpose" computations (not necessarily graphics) on GPU hardware. "It's like Nvidia's CUDA C, but portable".

However, OpenCL the language is not really tied to the GPU architecture. That is, hardware could run OpenCL programs and have an architecture very different from a GPU, resulting in a very different performance profile.

OpenCL is possibly the first programming language promising portability across accelerators: "OpenCL is for accelerators what C is for CPUs". Portability is disruptive. When hardware vendor A displaces vendor B, portable software usually helps a great deal.

Will OpenCL – "the GPGPU language" – eventually help displace GPGPU, by facilitating "GP-something-else" – "general-purpose" accelerators which aren't like GPUs?

We'll discuss this question on general grounds, and consider two specific examples of recent OpenCL accelerators: Adapteva's Parallella and ST's P2012.

Why displace GPGPU?

First of all, whether GPGPU is likely to be displaced or not – what could "GP-something-else" possibly give us that GPGPU doesn't?

There are two directions from which benefits could come – you could call them two opposite directions:

  1. Let software (ab)use more types of special-purpose accelerators. GPGPU lets you utilize (abuse?) your GPU for general-purpose stuff. It could be nice to have "GPDSP" to utilize the DSPs in your phone, "GPISP" to utilize the ISP, "GPCVP" to utilize computer vision accelerators likely to emerge in the future, etc. From GPGPU to GP-everything.
  2. Give software accelerators which are more general-purpose to begin with. GPGPU means doing your general-purpose stuff under the constraints imposed by the GPU architecture. An OpenCL accelerator lifting some of these constraints could be very welcome.

Could OpenCL help us get benefits from any of the directions (1) and (2)?

(1) is about making use of anal-retentive, efficiency-obsessed, weird, incompatible hardware. It's rather hard, for OpenCL or for any other portable, reasonably "pretty" language.

OpenCL does provide constructs more or less directly mapping to some of the "ugly" features common to many accelerators, for example:

  • Explicitly addressed local memory (as opposed to cache)
  • DMA (bulk memory transfers)
  • Short vector data types to make use of SIMD opcodes
  • Light-weight threads and barriers

But even with GPUs, OpenCL can't target all of the GPU's resources. There's the subset of the GPU accessible to GPGPU programs – and then there are the more idiosyncratic and less flexible parts used for actual graphics processing.

With accelerators such as DSPs and ISPs, my guess is that today most of their value – acceleration ability – is in the idiosyncratic features that can't be made accessible to OpenCL programs. They could evolve, but it's a bit far-fetched and we won't dwell on it now. At their current state, OpenCL is too portable and too "pretty" to map to most accelerators.

What about direction (2)? (2) is about making something that's more efficient than CPUs, but as nice and flexible as possible, and more flexible than GPUs.

As a whole, (2) isn't easy, for various reasons we'll discuss. But if we look, in isolation, at OpenCL the language, then it looks like a great language for targeting "faster-than-CPUs-and-more-flexible-than-GPUs" kind of accelerator.

What could such an accelerator give us that GPUs don't?

One important feature is divergent flow. GPUs are SIMD or SIMT hardware; either way, they can't efficiently support something like this:

if(cond(i)) {
  out[i] = f(i);
}
else {
  out[i] = g(i);
}

What they'll end up doing is, essentially, compute f(i) and g(i) for all values of i, and then throw away some of the results. For deeply nested conditionals, the cost of wasted computations can make the entire exercise of porting to a GPU pointless.

We'll now have a look at two OpenCL-compatible accelerators which promise to efficiently support divergent threads – or outright independent threads doing something completely unrelated. We'll briefly compare them, and then discuss some of their common benefits as well as common obstacles to their adoption.

Adapteva's Parallella

Actually, the chip's name is Epiphany – Parallella is the recently publicized name of Adapteva's planned platform based on Epiphany; anyway.

Adapteva's architecture is a 2D grid of processors with a mesh interconnect. To scale, you can have a chip with more cores – or you can have a 2D grid of chips with some of the inter-core communication seamlessly crossing chip boundaries. Each of the (scalar) processors executes its own instruction stream – no "marching in lockstep", fittingly for divergent flow.

There are no caches; a memory address can map to your local memory, or the local memory of some other processor in the grid, or to external memory. Access latency will vary accordingly; access to local memories of close neighbors is quicker than access to far neighbors. All memory access can be done using either load/store instructions or DMA.

(Note that you can reach far neighbors – unlike some more "fundamentalist" proposals for "2D scalability" where you can only talk to immediate neighbors, period. I think that's over the top; if you want to run something other than the game of life, it's awfully handy to have long communication paths – as do most computers ranging from FPGAs to neurons, some of which have really long axons.)

Stats:

  • 32K memory per core (unified – both instructions and data)
  • 4 banks that can help avoid contentions between loads/stores, instruction fetching and DMA traffic
  • 2-issue cores (one floating point operation and one integer or load/store operation)
  • 800 MHz at 28nm using low-power process, as opposed to high speed (my bet that it's hard to top 800 MHz at 28nm LP – any evidence to the contrary?)
  • ~25mW per core, 2W peak power for a full chip with 64 cores
  • 0.128 mm^2 per core

Sources: BDTI's overview and various documentation from adapteva.com.

ST's Platform 2012

The P2012 architecture is also, at the top level, a grid of processors with a mesh interconnect. One stated motivation is the intra-die variability in future process nodes: some cores will come out slower than others, and some will be unusable.

It is thus claimed that a non-uniform architecture (like the ones we have today – CPUs and a boatload of different accelerators) will become a bad idea. If a core happens to come out badly, and it's not like your other cores, you have to throw away the entire chip. Whereas if cores are all alike, you leave the bad ones unused, and you may still have enough good ones to use the chip.

Interestingly, despite this stated motivation, the P2012 is less uniform and has higher granularity than Epiphany. Firstly, there's a provision for special-purpose accelerators in the grid. Secondly, the top-level mesh connects, not individual cores, but clusters of 16 rather tightly-coupled cores (each with its own flow of control, however – again, good for divergence).

Similarly to Epiphany, data is kept in explicitly addressed local memory rather than cache, and you can access data outside the cluster using load/store instructions or DMA, but you'll pay a price depending on the distance.

However, within a cluster, data access is uniform: the 16 cores share 256K of local data memory. This can be convenient for large working sets. Instructions are private to a core – but they're kept in a cache, not a local memory, conveniently for large programs.

Stats:

  • 32K per core (16 cores with 16K I-cache per core and 32 8K data memory banks)
  • 1MB L2 cache in a 4-cluster, "69-core" chip (presumably, (16+1)x4+1 – one extra core per cluster and per chip)
  • 2-issue cores (I failed to find which instructions can be issued in parallel)
  • 600 MHz at 28nm (process details unclear)
  • 2W for the 69-core chip
  • 0.217 mm^3 per core (3.7 mm^2 per (16+1)-core cluster), not accounting for L2 cache

Source: slides, slides, slides.

Parallela vs P2012: a quick comparison

Each of the architectures can have many different implementations and configurations. It seems fair to compare a 28nm 64-core Epiphany chip with a 28nm 69-core P2012 chip (or at least fair as far as these things go). Each system has its own incompatible native programming interface, but both can also be programmed in OpenCL.

Here's how Epiphany compares to P2012:

  • Power: 1x (2W)
  • Core issue width: 1x (2-issue)
  • Local memory size: 1x (32K per core)
  • Frequency: 1.33x (800/600)
  • Core area efficiency: 1.7x (0.217/0.128)

I think it's a fine achievement for Adapteva – a 5-people company (ST has about 50000 employees – of course not all of them work on  the P2012, but still; Chuck Moore's GreenArrays is 18 people – and he's considered the ultimate minimalist, and develops a much more minimalistic product which, for instance, certainly won't run OpenCL programs).

This is not to say that these numbers are sufficient to compare the architectures. For starters, we assume that the power is the same, but we can't know without benchmarking. Energy consumption varies widely across programs – low power process brings leakage down to about zero at room temperature, so you're left with switching power which depends on what code you run, and on what data (multiplying zeros costs almost nothing compared to multiplying noise, for instance).

Then there are programming model differences, ranging from the extent of compliance of floating point to the IEEE standard to the rather different memory models. In the memory department, the ability of P2012 cores to access larger working sets should somewhat negate Epiphany's raw performance advantage on some workloads (though Epiphany cores might have lower latency when accessing their own banks). But then two different 2-issue cores will generally perform differently – you need thorough benchmarking to compare.

So what are these numbers good for? Just for a very rough, ballpark estimation of the cost of this type of core. That is, a core which is flexible enough to run its own instruction stream – but "low-end" enough to burden the programmer with local memory management, and lacking much of the other amenities of full-blown CPUs (speculative prefetching, out-of-order execution, etc.)

Our two examples both point to the same order of magnitude of performance. Let's look at a third system, KALRAY's MPPA – looking more like P2012 than Epiphany, with 16-core clusters and cores sharing memory.

At 28nm, 256 cores are reported to typically consume 5W at 400 MHz. (Adapteva and ST claim to give worst case numbers). That's 20mW per core compared to Epiphany's 25mW – but Epiphany runs at 2x the frequency. If we normalized for frequency, Epiphany comes out 1.6x more power-efficient – and that's when we compare it's worst case power to MPPA's typical power.

MPPA doesn't support OpenCL at the moment, and I found few details about the architecture; our quick glance is only good to show that these "low-end multicore" machines have the same order of magnitude of efficiency.

So will OpenCL displace GPGPU?

The accelerators of the kind we discussed above – and they're accelerators, not CPUs, because they're horrible at running large programs as opposed to hand-optimized kernels – these accelerators have some nice properties:

  • You get scalar threads which can diverge freely and efficiently – this is a lot of extra flexibility compared to SIMT or SIMD GPUs.
  • For GPGPU workloads that don't need divergence, these accelerators probably aren't much worse than GPUs. You lose some power efficiency because of reading the same instructions from many program memories, but it should be way less than a 2x loss, I'd guess.
  • And there's a programming model ready for these accelerators – OpenCL. They can be programmed in other C dialects, but OpenCL is a widespread, standard one that can be used, and it lets you use features like explicitly managed local memory and DMA in a portable way.

From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?

There are several roadblocks, however. First of all, there's no killer app for this type of thing – by definition. That is, for any killer app, almost certainly a much more efficient accelerator can be built for that domain. Generic OpenCL accelerators are good at accelerating the widest range of things, but they don't excel at accelerating anything.

There is, of course, at least one thriving platform which is, according to the common perception, "good at everything but excels at nothing" – FPGA. (I think it's more complicated than that but I'll leave it for another time.)

FPGAs are great for small to medium scale product delivery. The volume is too small to afford your own chip – but there may be too many things to accelerate which are too different from what an existing chip is good at accelerating. Flexible OpenCL accelerator chips could rival FPGAs here.

What about integrating these accelerators into high-volume chips such as application processors so they could compete with GPUs? Without a killer app, there's a real estate problem. At 100-150 mm^2, today's application processors are already rather large. And the new OpenGL accelerators aren't exactly small – they're bigger than any domain-specific accelerator.

Few chips are likely to include a large accelerator "just in case", without a killer app. Area is considered to be getting increasingly cheap. But we're far from the point where it's "virtually free", and the trend might not continue forever: GlobalFoundries' 14nm is a "low-shrink" node. Today, area is not free.

Of course, a new OpenCL accelerator does give some immediate benefit and so it isn't a purely speculative investment. That's because you could speed up existing OpenCL applications. But for existing code which is careful to avoid divergence, the accelerator would be somewhat less efficient than a GPU, and it wouldn't do graphics nearly as good as the GPU – so it'd be a rather speculative addition indeed.

What would make one add hardware for speculative reasons? A long life cycle. If you believe that your chip will have to accelerate important stuff many years after it's designed, then you'll doubt your ability to predict exactly what this stuff is going to be, and you'll want the most general-purpose accelerator.

Conversely, if you make new chips all the time, quickly sell a load of them, and then move on to market your next design, then you're less inclined to speculate. Anything that doesn't result in a visibly better product today is not worth the cost.

So generic OpenCL accelerators have a better shot at domains with long life cycles, which seem to be a minority. And then even when you found a vendor with a focus on the long term, you have the problem of performance portability.

Let's say platform vendor A does add the new accelerator to their chip. Awesome – except you probably also want to support vendor B's chips, which don't have such accelerators. And so efficient divergence is of no use to you, because it's not portable. Unless vendor A accounts for a very large share of the market – or if it's a dedicated device and you write a dedicated program and you don't care about portability.

OpenCL programs are portable, but their performance is not portable. For instance, if you use vector data types and the target platform doesn't have SIMD, the code will be compiled to scalar instructions, and so on.

What this means in practice is that one or several OpenCL subsets will emerge, containing features that people count on to be supported well. For instance, a relatively good scenario is, there's a subset that GPU programmers use on all GPUs. A worse scenario is, there's the desktop GPU subset and the mobile GPU subset. A still worse scenario is, there's the NVIDIA subset, the AMD subset, the Imagination subset, etc.

It's an evolving type of thing that's never codified anywhere but has more power than the actual standard.

Standards tend to materialize partially. For example, the C standard supports garbage collection, but real C implementations usually don't, and many real C programs will not run correctly on a standard-compliant, garbage-collecting implementation. Someone knowing C would predict this outcome, and would not trust the standard to change it.

So with efficient divergence, the question is, will this feature make it into a widely used OpenCL subset, even though it's not a part of any such subset today. If it doesn't, widespread hardware is unlikely to support it.

Personally, I like accelerators with efficient divergence. I'll say it again:

"From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?"

From an evolutionary standpoint though, it's quite the uphill battle. The CPU + GPU combination is considered "good enough" very widely. It's not impossible to grow in the shadow of a "good enough", established competitor. x86 was good enough and ARM got big, gcc was good enough and LLVM got big, etc.

It's just hard, especially if you can't replace the competitor anywhere and you aren't a must-have. A CPU is a must-have and ARM replaces x86 where it wins. A compiler is a must-have and LLVM replaces gcc where it wins. An OpenCL accelerator with efficient divergence – or any other kind, really – is not a must-have and it will replace neither the CPU nor the GPU. So it's quite a challenge to convince someone to spend on it.

Conclusion

I doubt that general-purpose OpenCL accelerators will displace GPGPU, even though it could be a nice outcome. The accelerators probably will find their uses though. The following properties seem favorable to them (all or a subset may be present in a given domain):

  • Small to medium scale, similarly to FPGAs
  • Long life cycles encouraging speculative investment in hardware
  • Device-specific software that can live without performance portability

In other words, there can be "life between the CPU and the GPU", though not necessarily in the highest volume devices.

Good luck to Adapteva with their Kickstarter project – a computer with 64 independent OpenCL cores for $99.

138 comments ↓

#1 Ivan Tikhonov on 10.13.12 at 11:19 am

What about already available GA144 chip?

#2 Yossi Kreinin on 10.13.12 at 12:51 pm

Erm… I mentioned GreenArrays; let's say that programming is challenging and it's definitely not my cup of tea, but if you know how to put it to good use, you could.

BTW all the stuff I mentioned here is also available as physical chips (perhaps less easy to purchase at this point but definitely available physically already).

#3 David on 10.13.12 at 1:52 pm

Just one nitpick: It's usually better to use

out[i] = f[i ] + cond[i] *(g[i]-f[i])

then the branching in your code snipet: You compute everything – just as in the "standard" pattern – and you have another multiply-and-add, but you have only one commit to memory. It's sometimes better even in nested conditions.

#4 Ivanassen on 10.13.12 at 2:54 pm

Bullets two and three apply to game consoles. Sony in particular have been crazy enough in the past to ship bold, weird architectures.

#5 yossi kreinin on 10.13.12 at 3:13 pm

@david: that's why i said f(i) – function, not array access… an if is better at some point

#6 Patrick Melody on 10.14.12 at 11:20 am

I disagree with the the premise that "most of their value – acceleration ability – is in the idiosyncratic features that can't be made accessible to OpenCL programs."

The idiosyncratic features that aren't exported to OpenCL are the rasterizer (vertex interpolator) and framebuffer compositer (z-buffer tests, blending, a few other things). These are valuable and graphics centric but account for a small amount of silicon. (The other specialized silicon that's part of the "GPU special sauce" are the filtered texture units, which *are* exported by OpenCL — they compute arbitrary expensive function approximations cheaply by table lookup.)

Roughly speaking, the rest of the chip is a lot of vector cores. The thing about these vector cores is we can have orders of magnitude more of them than CPUs do because we aren't spending any silicon on branch prediction, speculative execution, or out of order execution. Note also that GPU languages don't support pointers. Put that altogether and the resulting restrictions allow for very high thoughput for data intensive calculations. (Also we have many more threads than cores so threads blocked on memory latencies are swapped out for threads that have loaded data ready to go.)

The value GPUs now deliver is from the massive throughput we get from massive programmable parallelism that we get by throwing away features from CPUs that actually aren't necessary for the graphics specific task of processing streams of data packets independently. What's interesting is that a surprising number of jobs can be cast into this format. (Eg see the classic Google MapReduce paper for a few examples.) Even if this format isnt the most "efficient" way to structure the problem, the brute force strength of massive parallelism available now in GPUs can make it a performance win.

#7 Tom Peterson on 10.14.12 at 12:34 pm

What role does HSA Foundation play in this. Isn't HSA supposed to sort of replace OpenCL eventually?

#8 yossi kreinin on 10.15.12 at 11:13 am

@patrick: i meant other accelerators, not gpus

#9 charles griffiths on 10.16.12 at 1:30 am

The "64 core" version of Parallella is supposed to cost $199, the $99 Parallella is based on the 16 core chip. In either case, the board will have a dual core ARM cpu making the total cores 18 (for $99) or 66 (for $199).

#10 yossi kreinin on 10.16.12 at 7:14 am

@charles: oops – i'll correct it when i get a better connection…

#11 Bob H on 10.20.12 at 4:28 am

I've long wondered if someone could use the ZMS-08 from ZiiLabs as a GPGPU. OK is it targeted at mobile device, it isn't 64-bit and doesn't have a fancy interconnect but it does have a programmable interface bus. Also it has 4x 12-bit video IOs which could be used as general purpose interfaces.

Probably a case of trying to make a square peg fit in a round hole, but they are substantially cheaper than $99.

#12 Florian on 10.20.12 at 4:46 am

It's worth noting a point of GPU limitation you've not illustrated clearly. GPUs operate on a bus and a tiered cache architecture. They are highly optimized to work on sequential streams of data (vertices) and output to a fixed result buffer more or less sequential (backbuffer). GPUs are very bad at random access data crunching and random write data storage. That's the main reason GPUs are very bad for instance at n-body simulations and raytracing. Mid-end GPUs are so bad at these tasks that you can routinely write algorithms in interpreted languages that are hundreds of times slower than C and still outperform those GPUs on the CPU.

What's so promising about the new chips is that they don't look like the sequential stream processors that GPUs are. They look like they could perform much better at random read/write than GPUs, which matters a lot.

#13 Yossi Kreinin on 10.20.12 at 8:01 am

@Florian: CUDA devices are fine at rather random access – for example, parallel random access to data small enough to fit in, and explicitly allocated at, what Nvidia calls "shared memory". You get problems if you want random access to swaths of data (though I think highest-end devices come with caches for that, so they'd be tied with CPUs at some point perhaps).

In this sense, the accelerators that I mentioned are not necessarily better. You get fine random access when you hit the per-core/per-cluster explicitly managed local memory, worse latency for nearby local memories, and no data cache at all to speed up access to DRAM.

So I didn't mention this difference because I'm not sure about its significance (perhaps there's some advantage in being able to access neighbors; but for "sufficiently random access", you still need to manually plan all your accesses and in this sense a CPU is much better.)

#14 Ofer on 10.25.12 at 12:38 pm

I am not sure about your conclusion: "Long life cycles encouraging speculative investment in hardware".

If you mean generic HW, you are probably right, but if you already have the generic HW (the GPGPU) and you want to replace it taking a speculative investment is actually worse in long life cycle since it will take you more time to fix if you failed. Isn't it?

#15 Ofer on 10.25.12 at 1:13 pm

Actually, we did see an overthrowing of the leading GP accelerator for image processing – the GPU tookover the DSP.

The GPU was not planned to overthrown DSP, but rather to do better graphics processing. Generalization was a tool to reach a better graphic processor and not the goal.

The GPU did not found its way into the chip to become a general purpose accelerator but rather to have a certain function.

Since it was already there and had a certain level of generalization – people started using it. This led into improving the generalization for other applications.

My guess is that this is the way the GPGPU will be overthrown – a HW more suitable for image processing that already got the footprint in processors will become generic as part of its evolution.

#16 Yossi Kreinin on 10.26.12 at 10:18 am

If you are the Ofer that I think you are, we better discuss this elsewhere for a variety of reasons :) One thing though – DSP had a footprint in APs for a very long time, and it was way more generic/programmable than GPUs for a very long time. The fact that GPUs are now more widely targeted means that it's not just a question of footprint but what your evolution leads to and this has to do with your base architecture. Different architectures evolve in different directions; a GPU will not evolve into a DSP and vice versa. So something with a footprint will evolve, but it can only evolve into a certain range of things, and this range determines the range of applications where it can find practical uses.

#17 Uri W on 12.05.12 at 2:21 am

Thanks Yossi – the most interesting reading I had lately.

#18 Yossi Kreinin on 12.05.12 at 2:29 am

@Uri W: you're shitting me.

Check out Parallela's here. These people sure can execute.

#19 A on 12.10.12 at 10:32 am

Nitpick: "Someone knowing C would predict this outcome, and would not trust the standard to change it." — I'm sure you mean
s/not trust the standard/trust the standard not/

#20 Yossi Kreinin on 12.11.12 at 12:31 am

I think I meant what I wrote, actually. "not trust that the standard will change it", rather than "trust that the standard will not change it".

#21 Mike Chambers on 02.17.13 at 12:13 pm

Very interesting write-up, Yossi.

I was looking at GPU acceleration of petabyte scale seismic data processing about 5 years ago. The incumbent approach the was to use several thousand dual die dual core blades at 2.2ghz managed via MPI.

While this is the kind of problem that multiple GPUs should handle well, even the later introduction of CUDA made it difficult for the programmer to achieve the rated execution capacity of the devices.

I see great promise in Epiphany and I suspect that Parallela (particularly the 64 core — or 4×16 as I prefer to look at it) will open the doors to a wide spectrum of applications that the power, heat, and programming model complexity of GPUs make impractical today.

I suspect — as has been the case many times — that one technology will not displace the other; rather, each will find its use within the domain of its intrinsic strength. Many computationally intensive applications are also graphically intensive. Consider for example the challenges involved in a complex SCADA master control station with forward looking simulation. Optimal Real time visualization of several hundred or thousand RTU inputs is enough to tax a GPU. Even though the data is small, it comes from a wide range of sources and it is continually changing; moreover processing the delta data in real time and generating simulation scenarios does not lend itself to a highly linear instruction stream. The ability to "asynchronize" (or temporally decouple) simulation computation from visualization makes for a very robust architecture.

Likewise I see great promise in the potential for real-time optimization of diagnostic imaging in which tightly architected parallel pattern recognition can detect anomalies and change imaging granularity in response.

In work more familiar to many (h.264 encoding), both the challenges of existing MP/MC architectures and the limited practical success in applied GPU acceleration underscore the need for a simpler parallel processing coding model with a flat memory model. That need would seem one that Epiphany may be suited to satisfy.

#22 Yossi Kreinin on 02.18.13 at 4:34 am

Epiphany's memory model is not flat – in fact it's somewhat less flat than, say, CUDA's. It's nominally flat because you can use pointers to access anything using the same code, but performance gets worse the farther the memory actually is. In CUDA you have multiple execution cores working with many "equal" banks for close ("shared") memory. Epiphany's model is only flat if you don't care about performance – in which case so is CUDA's (just put everything in DRAM), and which isn't a sensible use case.

The flexibility advantage of Epiphany over CUDA or other GPU programming model is efficient divergence, not the memory model, at least as I see it; the memory model is different and you can argue for or against it, but none of the models is a straightforward flat thing.

#23 Kelvin N. on 07.23.13 at 11:07 am

Consumer Robotics:
Multiple kinds of computation on single purpose hardware √
Highly parallel √
Lots of inputs/outputs √
Divergent flow √
Benefits greatly from power efficiency √
No dominant hardware implementation, benefits from portability √
Mainstream horizon: 10-15 years √

#24 Craig Mustard on 08.21.13 at 11:05 am

Great writeup. My group is researching how to improve programmability on systems such as P2012, Adapteva, etc. We're calling systems that don't have cache that instead use software with DMA tranfers to move data around 'explicitly managed systems'. Check out my site for a paper on our work.

You may be interested to know about Intel's Runnemede, a research project along the same lines as Epiphany and P2012 – http://iacoma.cs.uiuc.edu/iacoma-papers/hpca13_1.pdf

I agree that OpenCL can significantly help with programming for accelerators. From my point of view it helps the programmer spawn lots of kernel as well as explicitly declare the global/local accesses of each kernel. However, there is definitely room for improvement. The boilerplate code required to use OpenCL is ridiculously high and a serious impediment to productivity. There's a lot of room to improve the OpenCL API on the host-side.

#25 Yossi Kreinin on 08.21.13 at 11:10 am

I'll definitely have a look, thanks!

#26 hmijail on 08.07.15 at 1:46 pm

Probably I'm missing something since I'm kinda out of my depth, but also a bit surprised to not see mention of Grand Central Dispatch (by Apple but open sourced and ported to FreeBSD). It dynamically recompiles (using LLVM's JIT facilities) the OpenCL kernels to make them run on whatever "processor" is available – so I guess any "processor" targeteable by LLVM could be used?

Docs at https://developer.apple.com/library/mac/documentation/Performance/Conceptual/OpenCL_MacProgGuide/UsingGCDwOpenCL/UsingGCDwOpenCL.html#//apple_ref/doc/uid/TP40008312-CH13-SW1

#27 Yossi Kreinin on 08.07.15 at 4:09 pm

"Make them run" is one thing, "make them run fast" is another… OpenCL is portable but its performance is very non-portable, unlike the performance of portable general-purpose programming languages. What I was saying was, if an OpenCL target got popular and another OpenCL target with a significantly different architecture showed up, then that latter target would get very limited leverage from being able to run existing OpenCL software, because said software would have been optimized for the former, older target and perform badly on the new one.

#28 hmijail on 08.10.15 at 2:43 pm

But I guess that that is the target scenario for having a JIT? (with autovectorizer and all)

#29 Yossi Kreinin on 08.10.15 at 7:43 pm

Well, autovectorization helps but doesn't always work, and then you have issues with things like how much local memory you have, what's the performance characteristics of accessing it, how thread occupancy impacts performance, how external memory is accessed and what's the perf characteristics there etc.

If having a JIT would give OpenCL performance portability, then say Google wouldn't shy away from exposing OpenCL drivers in Android. The reason they don't want people to use OpenCL very much is because they fear that one target will overtake the market and other targets won't be able to compete because their arch is too different; at least that's how I look at it… (Perhaps things changed here since the last time I checked and Google now promotes OpenCL vigorously with lively competition taking place between hardware vendors… if so my point of view is outdated.)

#30 samfix on 04.21.19 at 11:17 pm

I like the helpful information you provide in your articles. I will bookmark your weblog and take a look at once more right here frequently. I am rather certain I will learn a lot of new stuff right here! Good luck for the following!

#31 newz aimbot on 05.15.19 at 2:02 pm

Awesome, this is what I was searching for in yahoo

#32 Venita Ozga on 05.15.19 at 8:41 pm

5/15/2019 I'm pleased by the manner in which yosefk.com deals with this type of subject. Usually on point, sometimes contentious, without fail well-researched and also challenging.

#33 vn hax on 05.16.19 at 12:18 pm

Just wanna input on few general things, The website layout is perfect, the articles is very superb : D.

#34 fortnite aimbot download on 05.16.19 at 4:12 pm

Yeah bookmaking this wasn’t a risky decision outstanding post! .

#35 nonsense diamond key on 05.17.19 at 6:25 am

I’m impressed, I have to admit. Genuinely rarely should i encounter a weblog that’s both educative and entertaining, and let me tell you, you may have hit the nail about the head. Your idea is outstanding; the problem is an element that insufficient persons are speaking intelligently about. I am delighted we came across this during my look for something with this.

#36 fallout 76 hacks on 05.17.19 at 9:53 am

I’m impressed, I have to admit. Genuinely rarely should i encounter a weblog that’s both educative and entertaining, and let me tell you, you may have hit the nail about the head. Your idea is outstanding; the problem is an element that insufficient persons are speaking intelligently about. I am delighted we came across this during my look for something with this.

#37 red dead redemption 2 digital key resale on 05.17.19 at 3:04 pm

Found this on MSN and I’m happy I did. Well written post.

#38 redline v3.0 on 05.17.19 at 6:06 pm

I like this website its a master peace ! Glad I found this on google .

#39 chaturbate hack cheat engine 2018 on 05.18.19 at 7:31 am

Great, this is what I was looking for in google

#40 sniper fury cheats windows 10 on 05.18.19 at 2:25 pm

very Great post, i actually enjoyed this web site, carry on it

#41 mining simulator 2019 on 05.19.19 at 6:24 am

stays on topic and states valid points. Thank you.

#42 smutstone on 05.20.19 at 11:03 am

Hi, i really think i will be back to your website

#43 redline v3.0 on 05.21.19 at 6:31 am

Some truly fine stuff on this web site , appreciate it for contribution.

#44 free fire hack version unlimited diamond on 05.21.19 at 3:43 pm

I dugg some of you post as I thought they were very beneficial invaluable

#45 nonsense diamond on 05.22.19 at 5:34 pm

Appreciate it for this howling post, I am glad I observed this internet site on yahoo.

#46 krunker aimbot on 05.23.19 at 5:52 am

Just wanna input on few general things, The website layout is perfect, the articles is very superb : D.

#47 bitcoin adder v.1.3.00 free download on 05.23.19 at 9:29 am

I really got into this post. I found it to be interesting and loaded with unique points of view.

#48 vn hax on 05.23.19 at 6:14 pm

I must say got into this web. I found it to be interesting and loaded with unique points of interest.

#49 eternity.cc v9 on 05.24.19 at 7:02 am

I’m impressed, I have to admit. Genuinely rarely should i encounter a weblog that’s both educative and entertaining, and let me tell you, you may have hit the nail about the head. Your idea is outstanding; the problem is an element that insufficient persons are speaking intelligently about. I am delighted we came across this during my look for something with this.

#50 ispoofer pogo activate seriale on 05.24.19 at 5:22 pm

I love reading through and I believe this website got some genuinely utilitarian stuff on it! .

#51 cheats for hempire game on 05.26.19 at 6:01 am

I really enjoy examining on this internet site , it has got fine stuff .

#52 iobit uninstaller 7.5 key on 05.26.19 at 8:48 am

I’m impressed, I have to admit. Genuinely rarely should i encounter a weblog that’s both educative and entertaining, and let me tell you, you may have hit the nail about the head. Your idea is outstanding; the problem is an element that insufficient persons are speaking intelligently about. I am delighted we came across this during my look for something with this.

#53 smart defrag 6.2 serial key on 05.26.19 at 3:08 pm

Enjoyed reading through this, very good stuff, thankyou .

#54 resetter epson l1110 on 05.26.19 at 5:36 pm

I love reading through and I believe this website got some genuinely utilitarian stuff on it! .

#55 sims 4 seasons code free on 05.27.19 at 6:53 am

I truly enjoy looking through on this web site , it holds superb content .

#56 rust hacks on 05.27.19 at 7:30 pm

yahoo got me here. Cheers!

#57 strucid hacks on 05.28.19 at 9:47 am

This does interest me

#58 expressvpn key on 05.28.19 at 6:52 pm

Thanks for this website. I definitely agree with what you are saying.

#59 gamefly free trial on 05.29.19 at 1:17 am

I'm not sure where you are getting your information, but good topic.
I needs to spend some time learning more or understanding more.

Thanks for great info I was looking for this info for my mission.

#60 ispoofer license key on 05.29.19 at 8:01 am

Just wanna input on few general things, The website layout is perfect, the articles is very superb : D.

#61 aimbot free download fortnite on 05.29.19 at 12:00 pm

Respect to website author , some wonderful entropy.

#62 redline v3.0 on 05.29.19 at 4:28 pm

I consider something really special in this site.

#63 vn hax on 05.30.19 at 5:39 am

Good Morning, glad that i found on this in bing. Thanks!

#64 gamefly free trial on 05.30.19 at 6:37 am

My family all the time say that I am killing my time here at web,
except I know I am getting experience daily by reading
such good content.

#65 how to get help in windows 10 on 05.30.19 at 6:39 am

Hey there this is somewhat of off topic but I was wanting to know if blogs use WYSIWYG editors or if you have to manually code
with HTML. I'm starting a blog soon but have
no coding skills so I wanted to get advice from someone with experience.
Any help would be enormously appreciated!

#66 xbox one mods free download on 05.31.19 at 12:12 pm

Yeah bookmaking this wasn’t a risky decision outstanding post! .

#67 fortnite aimbot download on 05.31.19 at 2:58 pm

Respect to website author , some wonderful entropy.

#68 gamefly free trial on 05.31.19 at 4:32 pm

I've learn a few just right stuff here. Definitely price bookmarking for revisiting.

I wonder how much effort you set to create this kind of magnificent informative site.

#69 gamefly free trial on 05.31.19 at 11:19 pm

Heya i am for the first time here. I came across this
board and I find It really useful & it helped me out much.
I hope to give something back and help others like you helped me.

#70 mpl pro on 06.01.19 at 6:02 pm

I dugg some of you post as I thought they were very beneficial invaluable

#71 hacks counter blox script on 06.02.19 at 6:08 am

Great, this is what I was browsing for in yahoo

#72 gamefly free trial on 06.02.19 at 12:08 pm

Hi this is kind of of off topic but I was wanting to know if blogs use WYSIWYG editors
or if you have to manually code with HTML. I'm starting a blog soon but have no
coding skills so I wanted to get guidance from someone with experience.

Any help would be greatly appreciated!

#73 gamefly free trial on 06.02.19 at 9:37 pm

An impressive share! I've just forwarded this onto
a co-worker who has been conducting a little research on this.
And he actually ordered me breakfast simply because I found it for
him… lol. So let me reword this…. Thank YOU for the meal!!
But yeah, thanx for spending some time to discuss this matter here
on your site.

#74 chaturbate tokens hack generator 2018 pc on 06.03.19 at 9:57 am

Your article has proven useful to me.

#75 gamefly free trial on 06.05.19 at 1:33 am

If you are going for finest contents like me, only visit this
web site everyday because it offers feature contents, thanks

#76 gamefly free trial on 06.05.19 at 12:29 pm

Do you have a spam problem on this website; I also
am a blogger, and I was wondering your situation; many of us have developed some nice methods
and we are looking to swap techniques with other folks, please shoot me an email if interested.

#77 gamefly free trial on 06.05.19 at 7:12 pm

May I simply just say what a comfort to find somebody that
genuinely knows what they're discussing on the web. You definitely realize how to bring an issue
to light and make it important. More people need to check this
out and understand this side of the story. I was
surprised you are not more popular because you definitely possess the gift.

#78 gamefly free trial on 06.07.19 at 7:43 am

Magnificent goods from you, man. I've understand your stuff previous to
and you are just too excellent. I really like what
you have acquired here, really like what you're saying and the way in which you say it.
You make it enjoyable and you still take care of to keep it smart.

I can not wait to read much more from you. This is really a
terrific site.

#79 gamefly free trial on 06.08.19 at 7:48 am

Everyone loves it when individuals get together and share ideas.
Great website, keep it up!

#80 playstation 4 best games ever made 2019 on 06.12.19 at 2:36 pm

I have been surfing on-line greater than three hours nowadays, but I by no means discovered any
attention-grabbing article like yours. It is pretty value sufficient for me.

In my opinion, if all web owners and bloggers made just right content as you probably did, the web can be a lot
more helpful than ever before.

#81 protosmasher download on 06.16.19 at 7:50 pm

I have interest in this, danke.

#82 tinyurl.com on 06.17.19 at 5:41 pm

Greetings! This is my first visit to your blog! We are a collection of
volunteers and starting a new project in a community
in the same niche. Your blog provided us useful information to work on. You have done a wonderful job!

#83 proxo key generator on 06.19.19 at 8:11 am

Yeah bookmaking this wasn’t a risky decision outstanding post! .

#84 vn hax on 06.20.19 at 5:17 pm

I like this site because so much useful stuff on here : D.

#85 nonsense diamond key on 06.21.19 at 6:28 am

I must say, as a lot as I enjoyed reading what you had to say, I couldnt help but lose interest after a while.

#86 plenty of fish dating site on 06.21.19 at 11:32 pm

If you want to improve your familiarity simply keep visiting this website and
be updated with the most recent news posted here.

#87 star valor cheats on 06.23.19 at 4:01 pm

I am glad to be one of the visitors on this great website (:, appreciate it for posting .

#88 gx tool pro apk download on 06.24.19 at 2:07 pm

Respect to website author , some wonderful entropy.

#89 how do we KNOW on 06.25.19 at 3:52 am

Deference to op , some superb selective information .

#90 geometry dash 2.11 download on 06.25.19 at 6:47 pm

Respect to website author , some wonderful entropy.

#91 skisploit on 06.26.19 at 5:27 am

I really enjoy examining on this page , it has got great content .

#92 ispoofer license key on 06.27.19 at 4:59 am

Cheers, great stuff, I like.

#93 synapse x serial key free on 06.27.19 at 7:43 pm

Your post has proven useful to me.

#94 strucid hacks on 06.28.19 at 6:16 am

Good Morning, google lead me here, keep up nice work.

#95 advanced systemcare 11.5 key on 06.28.19 at 12:41 pm

I am glad to be one of the visitors on this great website (:, appreciate it for posting .

#96 zee 5 hack on 06.29.19 at 8:00 am

Cheers, great stuff, Me enjoying.

#97 cryptotab script hack free download on 06.29.19 at 2:20 pm

Thanks for this website. I definitely agree with what you are saying.

#98 unjailbreak script on 07.01.19 at 8:17 am

I like this site because so much useful stuff on here : D.

#99 fortnite cheats on 07.01.19 at 7:05 pm

Some truly great goodies on this web site , appreciate it for contribution.

#100 escape from tarkov cheats and hacks on 07.02.19 at 6:54 am

I am not rattling great with English but I get hold this really easygoing to read .

#101 redline v3.0 on 07.02.19 at 12:29 pm

I truly enjoy looking through on this web site , it holds superb content .

#102 download vnhax on 07.03.19 at 6:35 am

Just wanna input on few general things, The website layout is perfect, the articles is very superb : D.

#103 cyberhackid on 07.03.19 at 6:33 pm

Your website has proven useful to me.

#104 prison life hack download on 07.04.19 at 6:30 am

I was looking at some of your articles on this site and I believe this internet site is really instructive! Keep on posting .

#105 seo tutorial step by step with example on 07.04.19 at 3:13 pm

Parasite backlink SEO works well :)

#106 phantom forces aimbot on 07.04.19 at 6:17 pm

This is awesome!

#107 dego hack on 07.05.19 at 6:30 am

I kinda got into this post. I found it to be interesting and loaded with unique points of interest.

#108 erdas foundation 2015 on 07.05.19 at 6:43 pm

I kinda got into this site. I found it to be interesting and loaded with unique points of view.

#109 synapse x on 07.06.19 at 6:19 am

Respect to website author , some wonderful entropy.

#110 gx tool pubg uc hack on 07.06.19 at 10:35 am

Hi, here from yahoo, me enjoyng this, will come back again.

#111 rekordbox torrent on 07.06.19 at 8:29 pm

This is interesting!

#112 call of duty black ops 4 license key on 07.07.19 at 7:27 am

stays on topic and states valid points. Thank you.

#113 spyhunter 5.4.2.101 key on 07.08.19 at 7:29 am

Cheers, great stuff, Me enjoying.

#114 quest bars cheap 2019 coupon on 07.09.19 at 6:28 am

Hi! I could have sworn I've been to this site before but after looking
at some of the posts I realized it's new to me. Anyways, I'm definitely delighted I
stumbled upon it and I'll be book-marking it and checking back
often!

#115 roblox fps unlocker download on 07.09.19 at 9:05 am

I am glad to be one of the visitors on this great website (:, appreciate it for posting .

#116 quest bars on 07.11.19 at 2:01 am

I am regular reader, how are you everybody? This piece
of writing posted at this website is actually nice.

#117 prima on 07.14.19 at 2:04 am

Thank you for the great read!

#118 webcams teenagers on 07.15.19 at 2:17 am

some great ideas this gave me!

#119 plenty of fish dating site on 07.15.19 at 9:31 am

Generally I do not learn article on blogs, but I wish to say that this write-up very forced me
to try and do it! Your writing style has been surprised me.
Thanks, very nice article.

#120 legal porno on 07.16.19 at 12:19 am

great advice you give

#121 how to get help in windows 10 on 07.16.19 at 1:28 pm

It's actually very complex in this active life to listen news on TV, thus I only use internet for that reason, and get the latest news.

#122 how to get help in windows 10 on 07.18.19 at 12:37 am

You're so interesting! I don't believe I've truly read through
anything like this before. So nice to discover another person with a few
genuine thoughts on this subject. Really.. many thanks
for starting this up. This web site is one
thing that's needed on the internet, someone with a bit of originality!

#123 Russ Kmatz on 07.18.19 at 1:38 am

Alex9, this clue is your next piece of information. Feel free to transceive the agency at your convenience. No further information until next transmission. This is broadcast #5525. Do not delete.

#124 plenty of fish dating site on 07.19.19 at 12:15 am

Definitely believe that which you said. Your favorite reason seemed to be on the net the easiest thing to be aware of.

I say to you, I certainly get irked while people consider worries that they just do
not know about. You managed to hit the nail upon the top and
also defined out the whole thing without having side-effects , people can take a signal.
Will probably be back to get more. Thanks

#125 bella_marchelly on 07.19.19 at 2:02 am

you are so great

#126 buydrugsonline on 07.19.19 at 3:00 am

This blog is amazing! Thank you.

#127 plenty of fish dating site on 07.19.19 at 4:52 pm

Do you mind if I quote a few of your posts as long
as I provide credit and sources back to your website?
My website is in the very same area of interest as yours and my users would truly benefit from a lot of the information you present here.
Please let me know if this alright with you. Thanks!

#128 prodigy hacks on 07.21.19 at 2:02 pm

I am glad to be one of the visitors on this great website (:, appreciate it for posting .

#129 natalielise on 07.22.19 at 2:49 pm

Amazing blog! Is your theme custom made or did you download it from
somewhere? A design like yours with a few simple adjustements would really make my blog shine.
Please let me know where you got your theme. Bless you pof natalielise

#130 evogame net wifi on 07.23.19 at 11:34 am

Ha, here from yahoo, this is what i was browsing for.

#131 date couyar on 07.23.19 at 10:51 pm

I am 43 years old and a mother this helped me!

#132 datfe cougar on 07.23.19 at 11:33 pm

I am 43 years old and a mother this helped me!

#133 dat3 cougar on 07.23.19 at 11:50 pm

I am 43 years old and a mother this helped me!

#134 date cugar on 07.24.19 at 12:01 am

I am 43 years old and a mother this helped me!

#135 pphud download on 07.24.19 at 12:01 pm

Intresting, will come back here later too.

#136 ezfrags on 07.25.19 at 1:32 pm

I like this site because so much useful stuff on here : D.

#137 ezfrags on 07.26.19 at 2:35 pm

I must say, as a lot as I enjoyed reading what you had to say, I couldnt help but lose interest after a while.

#138 smore.com on 07.26.19 at 2:43 pm

This paragraph is in fact a pleasant one it helps new web visitors,
who are wishing for blogging. plenty of fish natalielise