OpenCL is usually perceived as a C dialect for GPGPU programming – doing "general-purpose" computations (not necessarily graphics) on GPU hardware. "It's like Nvidia's CUDA C, but portable".
However, OpenCL the language is not really tied to the GPU architecture. That is, hardware could run OpenCL programs and have an architecture very different from a GPU, resulting in a very different performance profile.
OpenCL is possibly the first programming language promising portability across accelerators: "OpenCL is for accelerators what C is for CPUs". Portability is disruptive. When hardware vendor A displaces vendor B, portable software usually helps a great deal.
Will OpenCL – "the GPGPU language" – eventually help displace GPGPU, by facilitating "GP-something-else" – "general-purpose" accelerators which aren't like GPUs?
We'll discuss this question on general grounds, and consider two specific examples of recent OpenCL accelerators: Adapteva's Parallella and ST's P2012.
Why displace GPGPU?
First of all, whether GPGPU is likely to be displaced or not – what could "GP-something-else" possibly give us that GPGPU doesn't?
There are two directions from which benefits could come – you could call them two opposite directions:
- Let software (ab)use more types of special-purpose accelerators. GPGPU lets you utilize (abuse?) your GPU for general-purpose stuff. It could be nice to have "GPDSP" to utilize the DSPs in your phone, "GPISP" to utilize the ISP, "GPCVP" to utilize computer vision accelerators likely to emerge in the future, etc. From GPGPU to GP-everything.
- Give software accelerators which are more general-purpose to begin with. GPGPU means doing your general-purpose stuff under the constraints imposed by the GPU architecture. An OpenCL accelerator lifting some of these constraints could be very welcome.
Could OpenCL help us get benefits from any of the directions (1) and (2)?
(1) is about making use of anal-retentive, efficiency-obsessed, weird, incompatible hardware. It's rather hard, for OpenCL or for any other portable, reasonably "pretty" language.
OpenCL does provide constructs more or less directly mapping to some of the "ugly" features common to many accelerators, for example:
- Explicitly addressed local memory (as opposed to cache)
- DMA (bulk memory transfers)
- Short vector data types to make use of SIMD opcodes
- Light-weight threads and barriers
But even with GPUs, OpenCL can't target all of the GPU's resources. There's the subset of the GPU accessible to GPGPU programs – and then there are the more idiosyncratic and less flexible parts used for actual graphics processing.
With accelerators such as DSPs and ISPs, my guess is that today most of their value – acceleration ability – is in the idiosyncratic features that can't be made accessible to OpenCL programs. They could evolve, but it's a bit far-fetched and we won't dwell on it now. At their current state, OpenCL is too portable and too "pretty" to map to most accelerators.
What about direction (2)? (2) is about making something that's more efficient than CPUs, but as nice and flexible as possible, and more flexible than GPUs.
As a whole, (2) isn't easy, for various reasons we'll discuss. But if we look, in isolation, at OpenCL the language, then it looks like a great language for targeting "faster-than-CPUs-and-more-flexible-than-GPUs" kind of accelerator.
What could such an accelerator give us that GPUs don't?
One important feature is divergent flow. GPUs are SIMD or SIMT hardware; either way, they can't efficiently support something like this:
if(cond(i)) { out[i] = f(i); } else { out[i] = g(i); }
What they'll end up doing is, essentially, compute f(i) and g(i) for all values of i, and then throw away some of the results. For deeply nested conditionals, the cost of wasted computations can make the entire exercise of porting to a GPU pointless.
We'll now have a look at two OpenCL-compatible accelerators which promise to efficiently support divergent threads – or outright independent threads doing something completely unrelated. We'll briefly compare them, and then discuss some of their common benefits as well as common obstacles to their adoption.
Adapteva's Parallella
Actually, the chip's name is Epiphany – Parallella is the recently publicized name of Adapteva's planned platform based on Epiphany; anyway.
Adapteva's architecture is a 2D grid of processors with a mesh interconnect. To scale, you can have a chip with more cores – or you can have a 2D grid of chips with some of the inter-core communication seamlessly crossing chip boundaries. Each of the (scalar) processors executes its own instruction stream – no "marching in lockstep", fittingly for divergent flow.
There are no caches; a memory address can map to your local memory, or the local memory of some other processor in the grid, or to external memory. Access latency will vary accordingly; access to local memories of close neighbors is quicker than access to far neighbors. All memory access can be done using either load/store instructions or DMA.
(Note that you can reach far neighbors – unlike some more "fundamentalist" proposals for "2D scalability" where you can only talk to immediate neighbors, period. I think that's over the top; if you want to run something other than the game of life, it's awfully handy to have long communication paths – as do most computers ranging from FPGAs to neurons, some of which have really long axons.)
Stats:
- 32K memory per core (unified – both instructions and data)
- 4 banks that can help avoid contentions between loads/stores, instruction fetching and DMA traffic
- 2-issue cores (one floating point operation and one integer or load/store operation)
- 800 MHz at 28nm using low-power process, as opposed to high speed (my bet that it's hard to top 800 MHz at 28nm LP – any evidence to the contrary?)
- ~25mW per core, 2W peak power for a full chip with 64 cores
- 0.128 mm^2 per core
Sources: BDTI's overview and various documentation from adapteva.com.
ST's Platform 2012
The P2012 architecture is also, at the top level, a grid of processors with a mesh interconnect. One stated motivation is the intra-die variability in future process nodes: some cores will come out slower than others, and some will be unusable.
It is thus claimed that a non-uniform architecture (like the ones we have today – CPUs and a boatload of different accelerators) will become a bad idea. If a core happens to come out badly, and it's not like your other cores, you have to throw away the entire chip. Whereas if cores are all alike, you leave the bad ones unused, and you may still have enough good ones to use the chip.
Interestingly, despite this stated motivation, the P2012 is less uniform and has higher granularity than Epiphany. Firstly, there's a provision for special-purpose accelerators in the grid. Secondly, the top-level mesh connects, not individual cores, but clusters of 16 rather tightly-coupled cores (each with its own flow of control, however – again, good for divergence).
Similarly to Epiphany, data is kept in explicitly addressed local memory rather than cache, and you can access data outside the cluster using load/store instructions or DMA, but you'll pay a price depending on the distance.
However, within a cluster, data access is uniform: the 16 cores share 256K of local data memory. This can be convenient for large working sets. Instructions are private to a core – but they're kept in a cache, not a local memory, conveniently for large programs.
Stats:
- 32K per core (16 cores with 16K I-cache per core and 32 8K data memory banks)
- 1MB L2 cache in a 4-cluster, "69-core" chip (presumably, (16+1)x4+1 – one extra core per cluster and per chip)
- 2-issue cores (I failed to find which instructions can be issued in parallel)
- 600 MHz at 28nm (process details unclear)
- 2W for the 69-core chip
- 0.217 mm^3 per core (3.7 mm^2 per (16+1)-core cluster), not accounting for L2 cache
Source: slides, slides, slides.
Parallela vs P2012: a quick comparison
Each of the architectures can have many different implementations and configurations. It seems fair to compare a 28nm 64-core Epiphany chip with a 28nm 69-core P2012 chip (or at least fair as far as these things go). Each system has its own incompatible native programming interface, but both can also be programmed in OpenCL.
Here's how Epiphany compares to P2012:
- Power: 1x (2W)
- Core issue width: 1x (2-issue)
- Local memory size: 1x (32K per core)
- Frequency: 1.33x (800/600)
- Core area efficiency: 1.7x (0.217/0.128)
I think it's a fine achievement for Adapteva – a 5-people company (ST has about 50000 employees – of course not all of them work on the P2012, but still; Chuck Moore's GreenArrays is 18 people – and he's considered the ultimate minimalist, and develops a much more minimalistic product which, for instance, certainly won't run OpenCL programs).
This is not to say that these numbers are sufficient to compare the architectures. For starters, we assume that the power is the same, but we can't know without benchmarking. Energy consumption varies widely across programs – low power process brings leakage down to about zero at room temperature, so you're left with switching power which depends on what code you run, and on what data (multiplying zeros costs almost nothing compared to multiplying noise, for instance).
Then there are programming model differences, ranging from the extent of compliance of floating point to the IEEE standard to the rather different memory models. In the memory department, the ability of P2012 cores to access larger working sets should somewhat negate Epiphany's raw performance advantage on some workloads (though Epiphany cores might have lower latency when accessing their own banks). But then two different 2-issue cores will generally perform differently – you need thorough benchmarking to compare.
So what are these numbers good for? Just for a very rough, ballpark estimation of the cost of this type of core. That is, a core which is flexible enough to run its own instruction stream – but "low-end" enough to burden the programmer with local memory management, and lacking much of the other amenities of full-blown CPUs (speculative prefetching, out-of-order execution, etc.)
Our two examples both point to the same order of magnitude of performance. Let's look at a third system, KALRAY's MPPA – looking more like P2012 than Epiphany, with 16-core clusters and cores sharing memory.
At 28nm, 256 cores are reported to typically consume 5W at 400 MHz. (Adapteva and ST claim to give worst case numbers). That's 20mW per core compared to Epiphany's 25mW – but Epiphany runs at 2x the frequency. If we normalized for frequency, Epiphany comes out 1.6x more power-efficient – and that's when we compare it's worst case power to MPPA's typical power.
MPPA doesn't support OpenCL at the moment, and I found few details about the architecture; our quick glance is only good to show that these "low-end multicore" machines have the same order of magnitude of efficiency.
So will OpenCL displace GPGPU?
The accelerators of the kind we discussed above – and they're accelerators, not CPUs, because they're horrible at running large programs as opposed to hand-optimized kernels – these accelerators have some nice properties:
- You get scalar threads which can diverge freely and efficiently – this is a lot of extra flexibility compared to SIMT or SIMD GPUs.
- For GPGPU workloads that don't need divergence, these accelerators probably aren't much worse than GPUs. You lose some power efficiency because of reading the same instructions from many program memories, but it should be way less than a 2x loss, I'd guess.
- And there's a programming model ready for these accelerators – OpenCL. They can be programmed in other C dialects, but OpenCL is a widespread, standard one that can be used, and it lets you use features like explicitly managed local memory and DMA in a portable way.
From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?
There are several roadblocks, however. First of all, there's no killer app for this type of thing – by definition. That is, for any killer app, almost certainly a much more efficient accelerator can be built for that domain. Generic OpenCL accelerators are good at accelerating the widest range of things, but they don't excel at accelerating anything.
There is, of course, at least one thriving platform which is, according to the common perception, "good at everything but excels at nothing" – FPGA. (I think it's more complicated than that but I'll leave it for another time.)
FPGAs are great for small to medium scale product delivery. The volume is too small to afford your own chip – but there may be too many things to accelerate which are too different from what an existing chip is good at accelerating. Flexible OpenCL accelerator chips could rival FPGAs here.
What about integrating these accelerators into high-volume chips such as application processors so they could compete with GPUs? Without a killer app, there's a real estate problem. At 100-150 mm^2, today's application processors are already rather large. And the new OpenGL accelerators aren't exactly small – they're bigger than any domain-specific accelerator.
Few chips are likely to include a large accelerator "just in case", without a killer app. Area is considered to be getting increasingly cheap. But we're far from the point where it's "virtually free", and the trend might not continue forever: GlobalFoundries' 14nm is a "low-shrink" node. Today, area is not free.
Of course, a new OpenCL accelerator does give some immediate benefit and so it isn't a purely speculative investment. That's because you could speed up existing OpenCL applications. But for existing code which is careful to avoid divergence, the accelerator would be somewhat less efficient than a GPU, and it wouldn't do graphics nearly as good as the GPU – so it'd be a rather speculative addition indeed.
What would make one add hardware for speculative reasons? A long life cycle. If you believe that your chip will have to accelerate important stuff many years after it's designed, then you'll doubt your ability to predict exactly what this stuff is going to be, and you'll want the most general-purpose accelerator.
Conversely, if you make new chips all the time, quickly sell a load of them, and then move on to market your next design, then you're less inclined to speculate. Anything that doesn't result in a visibly better product today is not worth the cost.
So generic OpenCL accelerators have a better shot at domains with long life cycles, which seem to be a minority. And then even when you found a vendor with a focus on the long term, you have the problem of performance portability.
Let's say platform vendor A does add the new accelerator to their chip. Awesome – except you probably also want to support vendor B's chips, which don't have such accelerators. And so efficient divergence is of no use to you, because it's not portable. Unless vendor A accounts for a very large share of the market – or if it's a dedicated device and you write a dedicated program and you don't care about portability.
OpenCL programs are portable, but their performance is not portable. For instance, if you use vector data types and the target platform doesn't have SIMD, the code will be compiled to scalar instructions, and so on.
What this means in practice is that one or several OpenCL subsets will emerge, containing features that people count on to be supported well. For instance, a relatively good scenario is, there's a subset that GPU programmers use on all GPUs. A worse scenario is, there's the desktop GPU subset and the mobile GPU subset. A still worse scenario is, there's the NVIDIA subset, the AMD subset, the Imagination subset, etc.
It's an evolving type of thing that's never codified anywhere but has more power than the actual standard.
Standards tend to materialize partially. For example, the C standard supports garbage collection, but real C implementations usually don't, and many real C programs will not run correctly on a standard-compliant, garbage-collecting implementation. Someone knowing C would predict this outcome, and would not trust the standard to change it.
So with efficient divergence, the question is, will this feature make it into a widely used OpenCL subset, even though it's not a part of any such subset today. If it doesn't, widespread hardware is unlikely to support it.
Personally, I like accelerators with efficient divergence. I'll say it again:
"From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?"
From an evolutionary standpoint though, it's quite the uphill battle. The CPU + GPU combination is considered "good enough" very widely. It's not impossible to grow in the shadow of a "good enough", established competitor. x86 was good enough and ARM got big, gcc was good enough and LLVM got big, etc.
It's just hard, especially if you can't replace the competitor anywhere and you aren't a must-have. A CPU is a must-have and ARM replaces x86 where it wins. A compiler is a must-have and LLVM replaces gcc where it wins. An OpenCL accelerator with efficient divergence – or any other kind, really – is not a must-have and it will replace neither the CPU nor the GPU. So it's quite a challenge to convince someone to spend on it.
Conclusion
I doubt that general-purpose OpenCL accelerators will displace GPGPU, even though it could be a nice outcome. The accelerators probably will find their uses though. The following properties seem favorable to them (all or a subset may be present in a given domain):
- Small to medium scale, similarly to FPGAs
- Long life cycles encouraging speculative investment in hardware
- Device-specific software that can live without performance portability
In other words, there can be "life between the CPU and the GPU", though not necessarily in the highest volume devices.
Good luck to Adapteva with their Kickstarter project – a computer with 64 independent OpenCL cores for $99.
138 comments ↓
What about already available GA144 chip?
Erm… I mentioned GreenArrays; let's say that programming is challenging and it's definitely not my cup of tea, but if you know how to put it to good use, you could.
BTW all the stuff I mentioned here is also available as physical chips (perhaps less easy to purchase at this point but definitely available physically already).
Just one nitpick: It's usually better to use
out[i] = f[i ] + cond[i] *(g[i]-f[i])
then the branching in your code snipet: You compute everything – just as in the "standard" pattern – and you have another multiply-and-add, but you have only one commit to memory. It's sometimes better even in nested conditions.
Bullets two and three apply to game consoles. Sony in particular have been crazy enough in the past to ship bold, weird architectures.
@david: that's why i said f(i) – function, not array access… an if is better at some point
I disagree with the the premise that "most of their value – acceleration ability – is in the idiosyncratic features that can't be made accessible to OpenCL programs."
The idiosyncratic features that aren't exported to OpenCL are the rasterizer (vertex interpolator) and framebuffer compositer (z-buffer tests, blending, a few other things). These are valuable and graphics centric but account for a small amount of silicon. (The other specialized silicon that's part of the "GPU special sauce" are the filtered texture units, which *are* exported by OpenCL — they compute arbitrary expensive function approximations cheaply by table lookup.)
Roughly speaking, the rest of the chip is a lot of vector cores. The thing about these vector cores is we can have orders of magnitude more of them than CPUs do because we aren't spending any silicon on branch prediction, speculative execution, or out of order execution. Note also that GPU languages don't support pointers. Put that altogether and the resulting restrictions allow for very high thoughput for data intensive calculations. (Also we have many more threads than cores so threads blocked on memory latencies are swapped out for threads that have loaded data ready to go.)
The value GPUs now deliver is from the massive throughput we get from massive programmable parallelism that we get by throwing away features from CPUs that actually aren't necessary for the graphics specific task of processing streams of data packets independently. What's interesting is that a surprising number of jobs can be cast into this format. (Eg see the classic Google MapReduce paper for a few examples.) Even if this format isnt the most "efficient" way to structure the problem, the brute force strength of massive parallelism available now in GPUs can make it a performance win.
What role does HSA Foundation play in this. Isn't HSA supposed to sort of replace OpenCL eventually?
@patrick: i meant other accelerators, not gpus
The "64 core" version of Parallella is supposed to cost $199, the $99 Parallella is based on the 16 core chip. In either case, the board will have a dual core ARM cpu making the total cores 18 (for $99) or 66 (for $199).
@charles: oops – i'll correct it when i get a better connection…
I've long wondered if someone could use the ZMS-08 from ZiiLabs as a GPGPU. OK is it targeted at mobile device, it isn't 64-bit and doesn't have a fancy interconnect but it does have a programmable interface bus. Also it has 4x 12-bit video IOs which could be used as general purpose interfaces.
Probably a case of trying to make a square peg fit in a round hole, but they are substantially cheaper than $99.
It's worth noting a point of GPU limitation you've not illustrated clearly. GPUs operate on a bus and a tiered cache architecture. They are highly optimized to work on sequential streams of data (vertices) and output to a fixed result buffer more or less sequential (backbuffer). GPUs are very bad at random access data crunching and random write data storage. That's the main reason GPUs are very bad for instance at n-body simulations and raytracing. Mid-end GPUs are so bad at these tasks that you can routinely write algorithms in interpreted languages that are hundreds of times slower than C and still outperform those GPUs on the CPU.
What's so promising about the new chips is that they don't look like the sequential stream processors that GPUs are. They look like they could perform much better at random read/write than GPUs, which matters a lot.
@Florian: CUDA devices are fine at rather random access – for example, parallel random access to data small enough to fit in, and explicitly allocated at, what Nvidia calls "shared memory". You get problems if you want random access to swaths of data (though I think highest-end devices come with caches for that, so they'd be tied with CPUs at some point perhaps).
In this sense, the accelerators that I mentioned are not necessarily better. You get fine random access when you hit the per-core/per-cluster explicitly managed local memory, worse latency for nearby local memories, and no data cache at all to speed up access to DRAM.
So I didn't mention this difference because I'm not sure about its significance (perhaps there's some advantage in being able to access neighbors; but for "sufficiently random access", you still need to manually plan all your accesses and in this sense a CPU is much better.)
I am not sure about your conclusion: "Long life cycles encouraging speculative investment in hardware".
If you mean generic HW, you are probably right, but if you already have the generic HW (the GPGPU) and you want to replace it taking a speculative investment is actually worse in long life cycle since it will take you more time to fix if you failed. Isn't it?
Actually, we did see an overthrowing of the leading GP accelerator for image processing – the GPU tookover the DSP.
The GPU was not planned to overthrown DSP, but rather to do better graphics processing. Generalization was a tool to reach a better graphic processor and not the goal.
The GPU did not found its way into the chip to become a general purpose accelerator but rather to have a certain function.
Since it was already there and had a certain level of generalization – people started using it. This led into improving the generalization for other applications.
My guess is that this is the way the GPGPU will be overthrown – a HW more suitable for image processing that already got the footprint in processors will become generic as part of its evolution.
If you are the Ofer that I think you are, we better discuss this elsewhere for a variety of reasons :) One thing though – DSP had a footprint in APs for a very long time, and it was way more generic/programmable than GPUs for a very long time. The fact that GPUs are now more widely targeted means that it's not just a question of footprint but what your evolution leads to and this has to do with your base architecture. Different architectures evolve in different directions; a GPU will not evolve into a DSP and vice versa. So something with a footprint will evolve, but it can only evolve into a certain range of things, and this range determines the range of applications where it can find practical uses.
Thanks Yossi – the most interesting reading I had lately.
@Uri W: you're shitting me.
Check out Parallela's here. These people sure can execute.
Nitpick: "Someone knowing C would predict this outcome, and would not trust the standard to change it." — I'm sure you mean
s/not trust the standard/trust the standard not/
I think I meant what I wrote, actually. "not trust that the standard will change it", rather than "trust that the standard will not change it".
Very interesting write-up, Yossi.
I was looking at GPU acceleration of petabyte scale seismic data processing about 5 years ago. The incumbent approach the was to use several thousand dual die dual core blades at 2.2ghz managed via MPI.
While this is the kind of problem that multiple GPUs should handle well, even the later introduction of CUDA made it difficult for the programmer to achieve the rated execution capacity of the devices.
I see great promise in Epiphany and I suspect that Parallela (particularly the 64 core — or 4×16 as I prefer to look at it) will open the doors to a wide spectrum of applications that the power, heat, and programming model complexity of GPUs make impractical today.
I suspect — as has been the case many times — that one technology will not displace the other; rather, each will find its use within the domain of its intrinsic strength. Many computationally intensive applications are also graphically intensive. Consider for example the challenges involved in a complex SCADA master control station with forward looking simulation. Optimal Real time visualization of several hundred or thousand RTU inputs is enough to tax a GPU. Even though the data is small, it comes from a wide range of sources and it is continually changing; moreover processing the delta data in real time and generating simulation scenarios does not lend itself to a highly linear instruction stream. The ability to "asynchronize" (or temporally decouple) simulation computation from visualization makes for a very robust architecture.
Likewise I see great promise in the potential for real-time optimization of diagnostic imaging in which tightly architected parallel pattern recognition can detect anomalies and change imaging granularity in response.
In work more familiar to many (h.264 encoding), both the challenges of existing MP/MC architectures and the limited practical success in applied GPU acceleration underscore the need for a simpler parallel processing coding model with a flat memory model. That need would seem one that Epiphany may be suited to satisfy.
Epiphany's memory model is not flat – in fact it's somewhat less flat than, say, CUDA's. It's nominally flat because you can use pointers to access anything using the same code, but performance gets worse the farther the memory actually is. In CUDA you have multiple execution cores working with many "equal" banks for close ("shared") memory. Epiphany's model is only flat if you don't care about performance – in which case so is CUDA's (just put everything in DRAM), and which isn't a sensible use case.
The flexibility advantage of Epiphany over CUDA or other GPU programming model is efficient divergence, not the memory model, at least as I see it; the memory model is different and you can argue for or against it, but none of the models is a straightforward flat thing.
Consumer Robotics:
Multiple kinds of computation on single purpose hardware √
Highly parallel √
Lots of inputs/outputs √
Divergent flow √
Benefits greatly from power efficiency √
No dominant hardware implementation, benefits from portability √
Mainstream horizon: 10-15 years √
Great writeup. My group is researching how to improve programmability on systems such as P2012, Adapteva, etc. We're calling systems that don't have cache that instead use software with DMA tranfers to move data around 'explicitly managed systems'. Check out my site for a paper on our work.
You may be interested to know about Intel's Runnemede, a research project along the same lines as Epiphany and P2012 – http://iacoma.cs.uiuc.edu/iacoma-papers/hpca13_1.pdf
I agree that OpenCL can significantly help with programming for accelerators. From my point of view it helps the programmer spawn lots of kernel as well as explicitly declare the global/local accesses of each kernel. However, there is definitely room for improvement. The boilerplate code required to use OpenCL is ridiculously high and a serious impediment to productivity. There's a lot of room to improve the OpenCL API on the host-side.
I'll definitely have a look, thanks!
Probably I'm missing something since I'm kinda out of my depth, but also a bit surprised to not see mention of Grand Central Dispatch (by Apple but open sourced and ported to FreeBSD). It dynamically recompiles (using LLVM's JIT facilities) the OpenCL kernels to make them run on whatever "processor" is available – so I guess any "processor" targeteable by LLVM could be used?
Docs at https://developer.apple.com/library/mac/documentation/Performance/Conceptual/OpenCL_MacProgGuide/UsingGCDwOpenCL/UsingGCDwOpenCL.html#//apple_ref/doc/uid/TP40008312-CH13-SW1
"Make them run" is one thing, "make them run fast" is another… OpenCL is portable but its performance is very non-portable, unlike the performance of portable general-purpose programming languages. What I was saying was, if an OpenCL target got popular and another OpenCL target with a significantly different architecture showed up, then that latter target would get very limited leverage from being able to run existing OpenCL software, because said software would have been optimized for the former, older target and perform badly on the new one.
But I guess that that is the target scenario for having a JIT? (with autovectorizer and all)
Well, autovectorization helps but doesn't always work, and then you have issues with things like how much local memory you have, what's the performance characteristics of accessing it, how thread occupancy impacts performance, how external memory is accessed and what's the perf characteristics there etc.
If having a JIT would give OpenCL performance portability, then say Google wouldn't shy away from exposing OpenCL drivers in Android. The reason they don't want people to use OpenCL very much is because they fear that one target will overtake the market and other targets won't be able to compete because their arch is too different; at least that's how I look at it… (Perhaps things changed here since the last time I checked and Google now promotes OpenCL vigorously with lively competition taking place between hardware vendors… if so my point of view is outdated.)
I like the helpful information you provide in your articles. I will bookmark your weblog and take a look at once more right here frequently. I am rather certain I will learn a lot of new stuff right here! Good luck for the following!
Awesome, this is what I was searching for in yahoo
5/15/2019 I'm pleased by the manner in which yosefk.com deals with this type of subject. Usually on point, sometimes contentious, without fail well-researched and also challenging.
Just wanna input on few general things, The website layout is perfect, the articles is very superb : D.
Yeah bookmaking this wasn’t a risky decision outstanding post! .
I’m impressed, I have to admit. Genuinely rarely should i encounter a weblog that’s both educative and entertaining, and let me tell you, you may have hit the nail about the head. Your idea is outstanding; the problem is an element that insufficient persons are speaking intelligently about. I am delighted we came across this during my look for something with this.
I’m impressed, I have to admit. Genuinely rarely should i encounter a weblog that’s both educative and entertaining, and let me tell you, you may have hit the nail about the head. Your idea is outstanding; the problem is an element that insufficient persons are speaking intelligently about. I am delighted we came across this during my look for something with this.
Found this on MSN and I’m happy I did. Well written post.
I like this website its a master peace ! Glad I found this on google .
Great, this is what I was looking for in google
very Great post, i actually enjoyed this web site, carry on it
stays on topic and states valid points. Thank you.
Hi, i really think i will be back to your website
Some truly fine stuff on this web site , appreciate it for contribution.
I dugg some of you post as I thought they were very beneficial invaluable
Appreciate it for this howling post, I am glad I observed this internet site on yahoo.
Just wanna input on few general things, The website layout is perfect, the articles is very superb : D.
I really got into this post. I found it to be interesting and loaded with unique points of view.
I must say got into this web. I found it to be interesting and loaded with unique points of interest.
I’m impressed, I have to admit. Genuinely rarely should i encounter a weblog that’s both educative and entertaining, and let me tell you, you may have hit the nail about the head. Your idea is outstanding; the problem is an element that insufficient persons are speaking intelligently about. I am delighted we came across this during my look for something with this.
I love reading through and I believe this website got some genuinely utilitarian stuff on it! .
I really enjoy examining on this internet site , it has got fine stuff .
I’m impressed, I have to admit. Genuinely rarely should i encounter a weblog that’s both educative and entertaining, and let me tell you, you may have hit the nail about the head. Your idea is outstanding; the problem is an element that insufficient persons are speaking intelligently about. I am delighted we came across this during my look for something with this.
Enjoyed reading through this, very good stuff, thankyou .
I love reading through and I believe this website got some genuinely utilitarian stuff on it! .
I truly enjoy looking through on this web site , it holds superb content .
yahoo got me here. Cheers!
This does interest me
Thanks for this website. I definitely agree with what you are saying.
I'm not sure where you are getting your information, but good topic.
I needs to spend some time learning more or understanding more.
Thanks for great info I was looking for this info for my mission.
Just wanna input on few general things, The website layout is perfect, the articles is very superb : D.
Respect to website author , some wonderful entropy.
I consider something really special in this site.
Good Morning, glad that i found on this in bing. Thanks!
My family all the time say that I am killing my time here at web,
except I know I am getting experience daily by reading
such good content.
Hey there this is somewhat of off topic but I was wanting to know if blogs use WYSIWYG editors or if you have to manually code
with HTML. I'm starting a blog soon but have
no coding skills so I wanted to get advice from someone with experience.
Any help would be enormously appreciated!
Yeah bookmaking this wasn’t a risky decision outstanding post! .
Respect to website author , some wonderful entropy.
I've learn a few just right stuff here. Definitely price bookmarking for revisiting.
I wonder how much effort you set to create this kind of magnificent informative site.
Heya i am for the first time here. I came across this
board and I find It really useful & it helped me out much.
I hope to give something back and help others like you helped me.
I dugg some of you post as I thought they were very beneficial invaluable
Great, this is what I was browsing for in yahoo
Hi this is kind of of off topic but I was wanting to know if blogs use WYSIWYG editors
or if you have to manually code with HTML. I'm starting a blog soon but have no
coding skills so I wanted to get guidance from someone with experience.
Any help would be greatly appreciated!
An impressive share! I've just forwarded this onto
a co-worker who has been conducting a little research on this.
And he actually ordered me breakfast simply because I found it for
him… lol. So let me reword this…. Thank YOU for the meal!!
But yeah, thanx for spending some time to discuss this matter here
on your site.
Your article has proven useful to me.
If you are going for finest contents like me, only visit this
web site everyday because it offers feature contents, thanks
Do you have a spam problem on this website; I also
am a blogger, and I was wondering your situation; many of us have developed some nice methods
and we are looking to swap techniques with other folks, please shoot me an email if interested.
May I simply just say what a comfort to find somebody that
genuinely knows what they're discussing on the web. You definitely realize how to bring an issue
to light and make it important. More people need to check this
out and understand this side of the story. I was
surprised you are not more popular because you definitely possess the gift.
Magnificent goods from you, man. I've understand your stuff previous to
and you are just too excellent. I really like what
you have acquired here, really like what you're saying and the way in which you say it.
You make it enjoyable and you still take care of to keep it smart.
I can not wait to read much more from you. This is really a
terrific site.
Everyone loves it when individuals get together and share ideas.
Great website, keep it up!
I have been surfing on-line greater than three hours nowadays, but I by no means discovered any
attention-grabbing article like yours. It is pretty value sufficient for me.
In my opinion, if all web owners and bloggers made just right content as you probably did, the web can be a lot
more helpful than ever before.
I have interest in this, danke.
Greetings! This is my first visit to your blog! We are a collection of
volunteers and starting a new project in a community
in the same niche. Your blog provided us useful information to work on. You have done a wonderful job!
Yeah bookmaking this wasn’t a risky decision outstanding post! .
I like this site because so much useful stuff on here : D.
I must say, as a lot as I enjoyed reading what you had to say, I couldnt help but lose interest after a while.
If you want to improve your familiarity simply keep visiting this website and
be updated with the most recent news posted here.
I am glad to be one of the visitors on this great website (:, appreciate it for posting .
Respect to website author , some wonderful entropy.
Deference to op , some superb selective information .
Respect to website author , some wonderful entropy.
I really enjoy examining on this page , it has got great content .
Cheers, great stuff, I like.
Your post has proven useful to me.
Good Morning, google lead me here, keep up nice work.
I am glad to be one of the visitors on this great website (:, appreciate it for posting .
Cheers, great stuff, Me enjoying.
Thanks for this website. I definitely agree with what you are saying.
I like this site because so much useful stuff on here : D.
Some truly great goodies on this web site , appreciate it for contribution.
I am not rattling great with English but I get hold this really easygoing to read .
I truly enjoy looking through on this web site , it holds superb content .
Just wanna input on few general things, The website layout is perfect, the articles is very superb : D.
Your website has proven useful to me.
I was looking at some of your articles on this site and I believe this internet site is really instructive! Keep on posting .
Parasite backlink SEO works well :)
This is awesome!
I kinda got into this post. I found it to be interesting and loaded with unique points of interest.
I kinda got into this site. I found it to be interesting and loaded with unique points of view.
Respect to website author , some wonderful entropy.
Hi, here from yahoo, me enjoyng this, will come back again.
This is interesting!
stays on topic and states valid points. Thank you.
Cheers, great stuff, Me enjoying.
Hi! I could have sworn I've been to this site before but after looking
at some of the posts I realized it's new to me. Anyways, I'm definitely delighted I
stumbled upon it and I'll be book-marking it and checking back
often!
I am glad to be one of the visitors on this great website (:, appreciate it for posting .
I am regular reader, how are you everybody? This piece
of writing posted at this website is actually nice.
Thank you for the great read!
some great ideas this gave me!
Generally I do not learn article on blogs, but I wish to say that this write-up very forced me
to try and do it! Your writing style has been surprised me.
Thanks, very nice article.
great advice you give
It's actually very complex in this active life to listen news on TV, thus I only use internet for that reason, and get the latest news.
You're so interesting! I don't believe I've truly read through
anything like this before. So nice to discover another person with a few
genuine thoughts on this subject. Really.. many thanks
for starting this up. This web site is one
thing that's needed on the internet, someone with a bit of originality!
Alex9, this clue is your next piece of information. Feel free to transceive the agency at your convenience. No further information until next transmission. This is broadcast #5525. Do not delete.
Definitely believe that which you said. Your favorite reason seemed to be on the net the easiest thing to be aware of.
I say to you, I certainly get irked while people consider worries that they just do
not know about. You managed to hit the nail upon the top and
also defined out the whole thing without having side-effects , people can take a signal.
Will probably be back to get more. Thanks
you are so great
This blog is amazing! Thank you.
Do you mind if I quote a few of your posts as long
as I provide credit and sources back to your website?
My website is in the very same area of interest as yours and my users would truly benefit from a lot of the information you present here.
Please let me know if this alright with you. Thanks!
I am glad to be one of the visitors on this great website (:, appreciate it for posting .
Amazing blog! Is your theme custom made or did you download it from
somewhere? A design like yours with a few simple adjustements would really make my blog shine.
Please let me know where you got your theme. Bless you pof natalielise
Ha, here from yahoo, this is what i was browsing for.
I am 43 years old and a mother this helped me!
I am 43 years old and a mother this helped me!
I am 43 years old and a mother this helped me!
I am 43 years old and a mother this helped me!
Intresting, will come back here later too.
I like this site because so much useful stuff on here : D.
I must say, as a lot as I enjoyed reading what you had to say, I couldnt help but lose interest after a while.
This paragraph is in fact a pleasant one it helps new web visitors,
who are wishing for blogging. plenty of fish natalielise