Humans and compilers need each other: the VLIW SIMD case

The state of the art in optimizing compilers today is such that for optimizing code, you need (1) a strong optimizing compiler and (2) a strong optimizing human. My rule of thumb is that (1) alone will yield 2x to 10x slower code. This is also what a person selling a (great) compiler "giving 80% of the optimal performance with no manual intervention" once told off-record to a roomful of programmers who pressed him into a corner, elevating my rule of thumb to a nobler plane of anecdotal evidence.

Now, I claim that this situation will persist, and in this post I'll try to close the fairly large gap between this claim and the mere acknowledgment of what the state of the art is today. The gap is particularly large for the believer in the possibility of strong AI – and while my position is a bit different, I do believe in fairly strong AI (can I say that? people keep telling that I can't say "nearly context-free". oh well.)

I realize that many people experienced in optimization feel that, on the contrary, there's in fact no gap large enough to justify an attempt as boringly rigorous (for a pop tech blog) at proving what they think is obvious as will shortly follow. But I think that many language geek discussions could benefit from a stronger bound on the power of a Sufficiently Smart Compiler than can be derived from (necessarily vague) doubts on the power of AI, and in this post I'll try to supply such a bound. I actually think a lot of (mainly domain-specific) things could be achieved by AI-ish work on compilation – closer to "identify bubble-sort and convert to quick-sort" than to traditional "analyze when variables are alive and assign them to registers" – and this is why it's useful to have a feeling when not to go there.

So, consider chess, where the state of the art is apparently quite similar to that in optimization: a strong human player using a strong computer program will take out both a human and a computer playing alone. However, it is conceivable that a program can be developed that doesn't need the help of a human, being able of completely simulating human thought processes or instead alternative processes which are consistently superior. Why can't it be the same with optimizing compilers?

(Chess and optimization are similar in another respect – few care about them; I readily acknowledge the insignificance of a 10x speed-up in a continuously expanding set of circumstances, I just happen to work in an area where it does count as a checkmate.)

I'll try to show that optimization is a fundamentally different game from chess, quite aside from the formal differences such as decidability. I'll use optimizing for VLIW SIMD processors to show where compilers outperform humans and vice versa. I'll be quoting a book by the inventor of VLIW called "Embedded Computing: A VLIW Approach" to support my position on the relative strength of humans and compilers in these cases. I'll then try to show that my examples are significant outside the peculiarities of current hardware, and attempt to state the general reason why humans are indispensable in optimization.

VLIW SIMD

First, we'll do the acronym expansion; skip it if you've been through it.

VLIW stands for "Very Long Instruction Word". What it really means is that your target processor can be told to execute several instructions in parallel. For example: R0=Add R1,R2 and R3=Mul R0,R1 and R1=Shift R5,R6. For this to work, the processor ought to be able to add, multiply and shift in parallel, that is, its execution hardware must be packed into several units, each getting distinct inputs. The units can be completely symmetric (all supporting the same operations); more often, different units support different instruction sets (so, for example, only one unit in a processor can multiply, but two of them can add, etc.) A stinky thing to note about VLIW instructions is the register semantics. In the example instruction above, R0 is mentioned both as an input and as an output. When it's mentioned as an input of Mul its old value is meant, and not the value computed by Add. This is somewhat natural since the whole point is to run Add and Mul in parallel so you don't want Mul to wait for Add; but it's confusing nonetheless. We'll come back to this shortly.

SIMD stands for "Single Instruction, Multiple Data" and is known much more widely than VLIW, being available at desktop and server processor architectures like x86 and PowerPC (VLIW reigns the quieter embedded DSP domain, the most commercially significant design probably being TI's C6000 family.) SIMD means that you have commands like R0=Add8 R1,R2, which does 8 additions given 2 inputs. The registers are thus treated as vectors of numbers – for example, uint8[16], or uint16[8], or uint32[4], assuming 16b registers. This establishes a preference for lower-precision numbers since you can pack more of them into a register and thus process more of them at a time: with uint16, you use Add8, but with uint8, you get to use the 2x faster Add16. We'll come back to this, too.

Optimizing for VLIW targets

The basic thing at which VLIW shines is the efficient implementation of "flat" loops (where most programs spend most time); by "flat", I mean that there are no nested if/elses or loops. The technique for implementing loops on VLIW machines is called modulo scheduling. The same technique is used on superscalar machines like modern x86 implementations (the difference from VLIWs being the instruction encoding semantics).

Since I couldn't find a good introductory page to link to, we'll run through a basic example of modulo scheduling right here. The idea is pretty simple, although when I first saw hardware designers doing it manually in a casual manner, I was deeply shocked (they do it for designing new hardware rather than programming existing hardware but it's the same principle).

Suppose you want to compute a[i]=b[i]*c+d on a VLIW processor with 4 units, 2 of them capable of load/store operations, 1 with an adder and 1 with a multiplier. All units have single-cycle latency (that is, their output is available to the next instruction; real VLIW units can have larger latencies, so that several instructions will execute before the result reaches the output register.) Let's assume that Load and Store increment the pointer, and ignore the need to test for the exit condition through the loop. Then a trivial assembly implementation of a[i]=b[i]*c+d looks like this:

LOOP:
R0=Load b++
R1=Mul R0,c
R2=Add R1,d
Store a++,R2

This takes 4 cycles per iteration, and utilizes none of the processor's parallelism as each instruction only uses 1 of the 4 execution units. Presumably we could do better; in fact the upper bound on our performance is 1 cycle per iteration, since no unit has to be used more than once to implement a[i]=b[i]*c+d (if we had two multiplications, for example, then with only 1 multiplying unit the upper bound would be 2 cycles/iteration.)

What we'll do now is blithely schedule all of the work to a single instruction, reaching the throughput suggested by our upper bound:

LOOP:
R0=LOAD b++ and R1=MUL R0,c and R2=ADD R1,d and STORE a++,R2

Let's look at what this code is doing at iteration N:

  • b[N] is loaded
  • b[N-1] (loaded at the previous iteration into R0) is multiplied by c
  • b[N-2]*c (computed at the previous iteration from the old value of R0 and saved to R1) is added to d
  • b[N-3]*c+d is saved to a[N]

This shows why our naive implementation doesn't work (it would be quite surprising if it did) – at iteration 0, b[N-1] to b[N-3] are undefined, so it makes no sense to do things depending on these values. However, starting at N=3, our (single-instruction) loop body seems to be doing its job just fine (except for storing the result to the wrong place – b ran away during the first 3 iterations). We'll take care of the first iterations by adding a loop header – instructions which implement the first 3 iterations, only doing the stuff that makes sense in those iterations:

R0=Load b++
R0=Load b++ and R1=Mul R0,c
R0=Load b++ and R1=Mul R0,c and R2=Add R1,d
LOOP:
R0=Load b++ and R1=Mul R0,c and R2=Add R1,d and Store a++,R2

For similar reasons, we need a loop trailer – unless we don't mind loading 3 elements past the end of a[], but I reckon you get the idea. So we'll skip the trailer part, and move to the more interesting case – what happens when the loop body won't fit into a single instruction. To show that, I can add more work to be done in the loop so it won't fit into the units, or I can use a weaker imaginary target machine to do the same work which will no longer fit into the (fewer) units. The former requires more imaginary assembly code, so I chose the latter. Let's imagine a target machine with just 2 units, 1 with Load/Store and one with Add/Mul. Then our upper bound on performance is 2 cycles per iteration. The loop body will look like this:

LOOP:
R0=Load b++ and R2=Add R1,d
R1=Mul R0,c and Store a++,R2

Compared to the single-instruction case, which was still readable ("Load and Mul and Add and Store"), this piece looks garbled. However, we can still trace its execution and find that it works correctly at iteration N (assuming we added proper header code):

  • At instruction 1 of iteration N, b[N] is loaded
  • At instruction 2 of iteration N, b[N] (loaded to R0 by instr 1 of iter N) is multiplied by c
  • At instruction 1 of iteration N, b[N-1]*c (computed in R1 by instr 2 of iter N-1) is added to d
  • At instruction 2 of iteration N, b[N-1]*c+d (computed in R2 by instr 1 of iter N) is stored to a[N]

In common VLIW terminology, the number of instructions in the loop body, known to the rest of humanity as "throughput", is called "initiation interval". "Modulo scheduling" is presumably so named because the instructions implementing a loop body are scheduled "modulo initiation interval". In our second example, the operations in the sequence Load, Mul, Add, Store go to instructions 0,1,0,1 = 0%2,1%2,2%2,3%2. In our first example, everything goes to i%1=0 – which is why I needed an example with at least 2 instructions in a loop, "modulo 1" being a poor way to illustrate "modulo".

In practice, "modulo scheduling" grows more hairy than simply computing the initiation interval, creating a linear schedule for your program and then "wrapping it around" the initiation interval using %. For example, if for whatever reason we couldn't issue Mul and Store at the same cycle, we could still implement the loop at the 2 cycles/iteration throughput, but we'd have to move the Mul forward in our schedule, and adjust the rest accordingly.

I've done this kind of thing manually for some time, and let me assure you that fun it was not. An initiation interval of 3 with 10-15 temporary variables was on the border of my mental capacity. Compilers, on the other hand, are good at this, because you can treat your input program as a uniform graph of operations and their dependencies, and a legal schedule preserving its semantics is relatively easy to define. You have a few annoyances like pointer aliasing which precludes reordering, but it's a reasonably small and closed set of annoyances. Quoting "Embedded Computing: A VLIW Approach" (3.2.1, p. 92): "All of these problems have been solved, although some have more satisfyingly closed-form solution than others." Which is why some people with years of experience on VLIW targets know almost nothing about modulo scheduling – a compiler does a fine job without their help.

The book goes on to say that "Using a VLIW approach without a good compiler is not recommended" – in other words, a human without a compiler will not perform very well. Based on my experience of hand-coding assembly for a VLIW, I second that. I did reach about 95% of the performance of a compiler that was developed later, but the time it took meant that many optimizations just wouldn't fit into a practical release schedule.

Optimizing for SIMD targets

I will try to show that humans optimize well for SIMD targets and compilers don't. I'll quote "Embedded Computing: A VLIW Approach" more extensively in this section. A book on VLIW may not sound like the best source for insight on SIMD, however, I somewhat naturally haven't heard of a book on SIMD stressing how compilers aren't good at optimizing for it. But then I haven't heard of a book stressing the opposite, either, and success papers I saw claimed at automatic vectorization was modest. Furthermore, the particular VLIW book I quote is in fact focusing on embedded DSP where SIMD is ubiquitous, and its central theme is the importance of designing processors in ways making them good targets for optimizing compilers. It sounds like a good place to look for tips on designing compilers to work well with SIMD and vice versa; and if they say they have no such tips, it's telling.

And in fact the bottom line of the discussion on SIMD (which they call "micro-SIMD") is fairly grim: "The ability of compilers to automatically extract micro-SIMD without hints (and in particular, without pointer alignment information) is still unproven, and manual code restructuring is still necessary to exploit micro-SIMD parallelism" (4.1.4, p. 143). This statement from 2005 is consistent with what (AFAIK) compilers can do today. No SIMD-targeted programming environment I know relieves you of the need to use intrinsics in your C code as in "a = Add8(b,c)", where Add8 is a built-in function-looking operator translated to a SIMD instruction.

What I find fascinating though is the way they singled out pointer alignment as a particularly interesting factor necessitating "hints". Sure, most newbies to SIMD are appalled when they find out about the need to align pointers to 16 bytes if you want to use instructions accessing 16 bytes at a time. But how much of a show-stopper can that be if we are to look at the costs and benefits more closely? Aligning pointers is easy, producing run time errors when they aren't is easier, telling a compiler that they are can't be hard (say, gcc has a __vector type modifier telling that), and alternatively generating two pieces of code – optimized for the aligned case and non-optimized for the misaligned case – isn't hard, either (the book itself mentions still other option – generating non-optimized loop header and trailer for the misaligned sections of an array).

There ought to be more significant reasons for people to be uglifying their code with non-portable intrinsics, and in fact there are. The book even discusses them in the pages preceeding the conclusion – but why doesn't it mention the more serious reasons in the conclusion? To me this is revealing of the difference between a programmer's perspective and a compiler writer's perspective, which is related to the difference between optimization and chess: in chess, there are rules.

For an optimizing programmer, SIMD instructions are a resource from which most benefit must be squeezed at any reasonable cost, including tweaking the behavior of the program. For an optimizing compiler, SIMD instructions are something that can be used to implement a piece of source code, in fact the preferable way to implement it – as long as its semantics are preserved. This means that a compiler obeys rules a programmer doesn't, making winning impossible. A typical reaction of a compiler writer is to think of this as not his problem – his problem ending where program transformations preserving the semantics are exhausted. I think this is what explains the focus on things like pointer alignment (which a compiler can in fact solve with a few hints and without affecting the results of the program) at the expense of the substantive issues (which it can't).

In the context of SIMD optimizations, the most significant example of rules obeyed by just one of the contestants has to do with precision, which the book mentions right after alignment in its detailed discussion of the problems with SIMD. "Even when we manipulate byte-sized quantities (as in the case of most pixel-based images, for example), the precision requirements of the majority of manipulation algorithms require keeping a few extra bits around (9, 12, and 16 are common choices) for the intermediate stages of an algorithm. …this forces us up to the next practical size of sub-word … reducing the potential parallelism by a factor of two up front." They go on to say that a 32b register will end up keeping just 2 16b numbers, giving a 2x speed-up – modest considering all the cases when you won't get even that due to other obstacles.

This argument shows the problems precision creates for the hardware implementation of SIMD. However, the precision of intermediate results isn't as hard a problem as this presentation makes it sound, because intermediate results are typically kept in registers, not in memory. So to keep the extra bits in intermediate results, you can either use large registers for SIMD operations and not "general-purpose" 32b ones, or you can keep intermediate results in pairs of registers – as long as you have enough processing units to generate and further process these intermediate results. Both things are done by actual SIMD hardware.

However, the significant problems created by precision lie at the software side: the compiler doesn't know how many bits it will need for intermediate results, nor when precision can be traded for performance. In C, the type of the intermediate results in the expression (a[i]*3+b[i]*c[i])>>d is int (roughly, 32b), even if a, b and c are arrays of 8b numbers, and the parenthesized expression can in fact exceed 16b. The programmer may know that b[i]*c[i] never exceeds, say, 20000 so the whole thing will fit in 16b. That C has no way of specifying precise ranges of values a variable can hold (as opposed to Lisp, of all rivals to the title of the most aggressively optimizing environment) doesn't by itself make an argument since a way could be added, just like gcc added __vector, not to mention the option of using a different language. Specifying the ranges of b[i] and c[i] wouldn't always suffice and we would have to further uglify the code to specify the range of the product (in case both b[i] and c[i] can be large by themselves but never together), but it could be done.

The real problem with having to specify such information to the compiler isn't the lack of a standard way of spelling it, but that a programmer doesn't know when to do it. If it's me who is responsible for the low-level aspects of optimization, I'll notice the trouble with an intermediate result requiring too many bits to represent. I will then choose whether to handle it by investigating the ranges of b[i] and c[i] and restricting them if needed, by moving the shift by d into the expression as in (a[i]*3>>d)+(b[i]*c[i]>>d) so intermediate results never exceed 16b, or in some other way. But if it's the compiler who's responsible, chances are that I won't know that this problem exists at all.

There's a trade-off between performance gains, precision losses and the effort needed to obtain more knowledge about the problem. A person can make these trade-offs because the person knows "what the program really does", and the semantics of the source code are just a rendering of that informal spec from one possible perspective. It's even worse than that – a person actually doesn't know what the program really does until an attempt to optimize it, so even strong AI capable of understanding an informal spec in English wouldn't be a substitute for a person.

A person can say, "Oh, we run out of bits here. OK, so let's drop the precision of the coefficients." Theoretically, and importantly for my claim, strong AI can also say that – but only if it operates as a person and not as a machine. I don't claim that we'll never reach a point where we have a machine powerful enough to join our team as a programmer, just that (1) we probably wouldn't want to and (2) if we would, it wouldn't be called a compiler, it would be called a software developer. That is, you wouldn't press a button and expect to get object code from your source code, you'd expect a conversation: "Hey, why do you need so many bits here – it's just a smoothing filter, do you really think anyone will notice the difference? Do you realize that this generates 4x slower code?" And then someone, perhaps another machine, would answer that yes, perhaps we should drop some of the bits, but let's not overdo it because there are artifacts, and I know you couldn't care less because your job ends here but those artifacts are amplified when we compute the gradient, etc.

This is how persons optimize, and while a machine could in theory act as a person, it would thereby no longer be a compiler. BTW, we have a compiler at work that actually does converse with you – it says that it will only optimize a piece of code if you specify that the minimal number of iterations executed is such and such; I think it was me who proposed to handle that case using conversation. So this discussion isn't pure rhetoric. I really wish compilers had a -warn-about-missed-optimization-opportunities switch that would give advice of this kind; it would help in a bunch of interesting cases. I just think that in some cases, precision being one of them, the amount and complexity of interactions needed to make headway like that exceeds the threshold separating aggressive optimization from aggressive lunacy.

To be sure, there are optimization problems that could be addressed by strong AI. In the case of SIMD, the book mentions one such area – they call it "Pack, Unpack, and Mix". "Some programs require rearranging the sub-words within a container to deal with the different sub-word layouts. From time to time, the ordering of the sub-words within a word (for example, coming from loading a word from memory) does not line up with the parallelism in the code… The only solution is to rearrange the sub-words within the containers through a set of permutation or copying operations (for example, the MIX operation in the HP PA-RISC MAX-2 extension)."

An example of this reordering problem is warping: computing a[i]=b[i*step+shift]. This is impossible to do in SIMD without a permutation instruction of the kind they mention (PowerPC's AltiVec has vec_perm, and AFAIK x86's SSE has nothing so you can't warp very efficiently). However, even if an instruction is available, compilers are AFAIK unable to exploit it. I see no reason why sufficiently strong AI couldn't manage to do such things with few hints in some interesting cases. I wouldn't bet my money on it – I side with Mitch Kapor on the Turing Test bet, but it is conceivable like the invincible chess playing program, and unlike transformations requiring "small" changes of the semantics.

Significance

There are areas of optimization that are very significant commercially but hardly interesting in a theoretical discussion (and this here's a distinctively theoretical discussion as is any discussion where the possibility of strong AI is supposed to be taken into account).

For example, register allocation for the x86 is exceedingly gnarly and perhaps an interesting argument could be made to defend the need for human intervention in this process in extreme cases (I wouldn't know since I never seriously optimized for the x86). However, a general claim that register allocation makes compiler optimization hard wouldn't follow from such an argument: on a machine with plentiful and reasonably uniform registers, it's hard to imagine what a human can do that a compiler can't do better, and almost everybody would agree that the single reason for not making hardware that way is a commercial one – to make an x86-compatible processor.

Now, I believe that both SIMD and VLIW instruction encodings don't have this accidental nature, and more likely are part of the Right Way of designing high-performance processors (assuming that it makes no sense to move cost from software to hardware and call that a "performance gain", that is, assuming that performance is measured per square millimeter of silicon). One argument of rigor worthy of a pop tech blog is that most high-end processors have converged to SIMD VLIW: they have instructions processing short vectors and they can issue multiple instructions in parallel; some do the latter in the "superscalar" way of having the hardware analyze dependencies between instructions at run time and others do it in the "actual VLIW" way of having the lack of dependencies proven and marked by the compiler, but you end up doing modulo scheduling anyway.

However, this can of course indicate uninformed consumer preference rather than actual utility (I type this on a noisy Core 2 Duo box running Firefox on top of XP, a job better handled by a cheaper, silent single-core – and I'm definitely a consumer who should have known better). So my main reasons for believing VLIW and SIMD are "right" are abstract considerations on building von Neumann machines:

  • You typically have lots of distinct execution hardware: a multiplier has little in common with a load/store unit. Up to a point, it will therefore make sense to support parallel execution of instructions on the different execution hardware. The cost of supporting it will be more I/O ports connecting the execution units with the register file – quite serious because of the multiplexers selecting the registers to read/write. However, the cost of not supporting it will be more execution hardware left unused for more time. So the optimum is unlikely to be "no parallel execution", it's likely "judicious parallel execution".
  • It is cheaper to have few wide registers and wide buses between the register file and the execution units than it is to have many narrow registers and buses. That's because the cost of the register file is proportional to the product of #registers and #buses to the execution units. It is thus significantly cheaper to have 1 unit with 4 8bx8b multipliers and 2 32b buses for the inputs then it is to have 4 units with 1 8bx8b multiplier in each and 8 8b buses for the inputs. It's also cheaper to keep 4 bytes in 1 32b register than in 4 8b registers. Likewise, it is cheaper to have 4 multipliers in 1 processor than to have 4 full-blown processor cores, because each core would have, say, its own fetch and decode logic and instruction cache – which are in fact pure overhead. So if you have a von Neumann machine with registers and buses and instruction cache, it makes sense (up to a point) to add SIMD to make the best of that investment, and this is why commercial VLIWs have SIMD, although the VLIW theory recommends more units instead.

Since I believe that both VLIW and SIMD are essential for maximizing hardware performance, I also tend to think that optimizations needed to utilize these features are "mainstream" enough to support a broad claim about optimization in general. And the main point of my claim is that compilers can't win in the optimization game, because part of that game is the ability to change the rules once you start losing.

Humans faced with a program they fail to optimize change the program, sometimes a little, sometimes a lot – I heard of 5×5 filters made 4×4 to run on a DSP. But even if we exclude the truly shameless cheating of the latter kind, the gentler cheating going into every serious optimization effort still requires to negotiate and to take responsibility in a way that a person – human or artificial – can, but a tool like a compiler can not.

Modulo scheduling is an example of the kinds of optimizations which in fact are best left to a compiler – the ones where rules are fixed: once the code is spelled, few can be added by further annotations by the author and hence the game can be won without much negotiations with the author; although sometimes a little interrogation can't hurt.

100 comments ↓

#1 Pierre R. Mai on 07.28.09 at 3:09 am

I've always thought that the whole discussion of compiler vs. programmer was a bit pointless, since ideally you always want both, exactly for the reasons you state. Furthermore an optimizing compiler can trim the "low-hanging fat" of code in non-critical sections, which lets the programmer concentrate on hot-spots, while still often giving enough savings in space or time cumulatively to make meeting performance/space goals easier.

W.r.t. the compiler giving better information on missed optimizations, allowing the programmer to iteratively revise his algorithms and/or inform the compiler better about possible optimizations: This is one of the major benefits of the SBCL/CMUCL optimizing Common Lisp compiler (called Python), which gives fairly informative optimization notes/hints, explaining exactly why it can't perform certain optimizations, including (naive) severity information. Given that the Python compiler was introduced in 1990, I find it sad that similar optimization advice hasn't been integrated into more compilers.

#2 Ryan on 07.28.09 at 7:58 am

Great article. I love seeing some more depth on some topics that were sort of "glossed over" in my undergrad education. My current job doesn't really flex any low-level muscles, so thanks yosefk for giving me a little mental workout.

#3 wm on 07.30.09 at 3:35 pm

SSE has PSHUFB that does permutation of any kind. And AFAIR is fast.

#4 Yossi Kreinin on 07.31.09 at 10:55 am

@wm: interesting! Available starting from SSSE3 and I only got to work against SSE2 because of compatibility reasons, but I should have paid attention.

Here they claim that it's only fast in the newest cores; I also remember reading a document by Apple saying how there wasn't any substitute for vec_perm to whoever ports from PowerPC-based Macs to x86-based ones. I guess this problem is becoming increasingly irrelevant then.

#5 Josh Fisher on 08.03.09 at 11:24 am

I love this stuff. There's much too little of this kind of discussion around. My colleagues and I have written a couple of papers on the general problem of trying to abstract code improvement (or even architecture improvement) from the application and from the expert. People wildly overestimate the possibility of complete independence.

A fun area for an attempt at Strong AI, but a tall, tall hill to climb.

#6 Yossi Kreinin on 08.03.09 at 11:15 pm

Glad you liked it!

There actually are a bunch of commercial products which let you play with improving code and architecture together at different levels of automation – Stretch definitely did that and I think Tensilica, CoWare, Target etc. offer that, too. I came to think that the bottleneck for such tools is the scarcity of people cunning enough to usefully apply them. (There's another thing I think of as a bottleneck, namely, my conviction that at some point of squeezing performance you need to go beyond tricky datapath operations into "non-conventional" load/store and fairly hard-wired state machines; I'd love to discuss examples but it would be a somewhat lengthy discussion).

#7 Nadav Rotem on 07.17.10 at 1:39 am

Hi Yossi,
My name is Nadav and I am a PhD candidate researching compilers. In my research I also consider Modulo Scheduling for SIMD-like architectures (such as GPUs). It would be interesting for me to read your ideas about GPUs and how they should be programmed. From a compiler perspective, they raise interesting challenges.

#8 Yossi Kreinin on 07.17.10 at 8:17 am

Hi Nadav,

I have no practical experience in (GP)GPU, and the only GPU architecture I'm roughly familiar with is Nvidia's machines targeted by CUDA. My impression was that on those machines you got way closer to C (without intrinsics) than in traditional SIMD because of "graceful performance degradation" in their SIMT model – handling memory contentions and branches automatically, etc.; thus I thought you got less challenges in the parts traditionally handled by compilers, with most of the challenges being memory management – something C pretty much has you doing manually on most machines, the only question being how tough it is on a particular machine. Therefore it should be interesting to me to read your thoughts on the challenges of programming/compiling for GPGPU… (I also see you've been dealing with memory management in your work).

As to compilers in general – real compiler writers work on implementations for existing specifications, dropouts work on new specifications. I'm a dropout; my reaction to a weird architecture or problem, if/when I get a chance to attack it from the compilation angle, is to make language extensions for modeling it rather than improving the handling of that problem in an existing language. (One sad fact about the PL world is that the good compiler writers end up spending lifetimes of work attempting to implement reasonable implementations for hopelessly awful specs – compile C++ reasonably fast, run Python or JS reasonably fast, etc.)

#9 Jeff on 05.30.13 at 12:42 pm

Good stuff. Lately MPPA (Massively Parallel Processing Array) type processors have been introduced. It's basically MIMD, Multiple Instruction Multiple Data. Some say MIMD is better than VLIW+SIMD since it has its own control mechanism, branching, which possibly helps reducing waste due to predicated instructions when there is divergence between multiple data. Do you have any thought on that?

#10 Yossi Kreinin on 05.30.13 at 1:00 pm

Well, there are many kinds of systems which can be called "MPPAs"; some of these systems are covered in this writeup: http://www.yosefk.com/blog/will-opencl-help-displace-gpgpu-parallella-p2012.html

One general question which is answered very differently by the different "many-core" systems is how the cores cooperate – how the sharing of memory and synchronization works. PicoChip-style interconnect is very different from Adapteva's Epiphany interconnect, which in turn is very different from ST's P2012 interconnect, etc.

If we don't talk about these details, then the most general thing we can ask is, can many threads be better than one? My answer is, yes they can; however on a boatload of workloads, VLIW SIMD is just vastly more efficient in terms of energy/silicon area spent per task than any kind of "MIMD" if MIMD is understood as a bunch of threads, each with its own program counter; this I'm very, very sure about.

#11 Sergtovar on 03.25.19 at 5:06 am

Êàæäîìó Ïðèâåò! Çàõîäèòå íà [url=http://sergtovar.ru]SergTovar[/url] SergTovar – Ïåðâûé èíòåðíåò ìàãàçèí àâòîíàïîëíÿþùèõñÿ ñàéòîâ äëÿ çàðàáîòêà. Ïîñòîÿííûì ïîêóïàòåëÿì ñêèäêè. Ïàðòí¸ðñêàÿ ïðîãðàììà.

#12 Lea on 05.06.19 at 10:22 pm

First of alll I want tto say great blog! I had a quick question in which I'd
like to ask iff you do nott mind. I was curious to
find out how you center yourself and clear your
mind befre writing. I havce had a difficult time clearing my mind in getting mmy thoughts out there.
I truly do take pleasure in writing but it justt seems like the first 10 to 15 minutes tend to be wasted simply just trying to figure out how tto begin. Any
ideas or tips? Cheers!

#13 docs.google.com on 05.07.19 at 10:49 am

Hello it's me, I am also visiting this site on a
regular basis, this site is in fact pleasant and the users are
genuinely sharing nice thoughts.

#14 Roma on 05.07.19 at 2:17 pm

Undeniably believe thst thast you said. Your favourite justification appeared to be
on thhe internet the easiest factor to have in mind of. I sayy to you, I certainly get
irked at the same time as people think about concerns that they pplainly do not
recognize about. You managed to hit the nail uoon the top and outlined ouut the
entire thing without having side effect , folks can take a signal.
Will likely be back to get more. Thank you

#15 https://reviews.birdeye.com/the-world-protection-group-inc-145331369536068 on 05.12.19 at 2:52 pm

What's up, after reading this amazingg article i am too happy to share my knowledge here with mates.

#16 Cody on 05.12.19 at 3:22 pm

Fantastic blog you have here but I was wondering if you knew of any discussion bboards that cover the same topics talked about here?
I'd really like to be a part of community where I can get responses from othedr experienced individuals that share the same interest.

If you have any recommendations, please let me know.
Kudos!

#17 https://directory.caionline.org/Listing?MDSID=COI-6933&adlistingid=&tid=0d239364-45fb-42f6-966f-ae5e974aea21&categories=Newsletters/Publications/Printing on 05.12.19 at 4:12 pm

What's up, yeah this article is really good and
I have learned llot of things from it on the topic of blogging.
thanks.

#18 Sung on 05.12.19 at 4:18 pm

Its such as you read my mind! You appear to grasp so
much approximately this, like you wrote the book in it or something.
I believe that you just could do with some percent
to drive the message house a bit, but ther than that, this is fantastic blog.
A great read. I will certainly be back.

#19 Kieran on 05.12.19 at 5:01 pm

I've been explorng for a little bit for any high-quality articles or blog
posts on this kind of house . Exploring in Yahoo I eventuaply stumbled
upn this web site. Reading this info So i'm satisfied to exhibit that I hve ann incredbly good uncanny feeling I discovered just what I needed.
I most no dooubt will make certain to do not omit this web site and give it a glance on a relentless basis.

#20 Dorris on 05.12.19 at 6:09 pm

I know this web page presents quality depending articles orr reviews and extra stuff,
is there any other web site which offers such
stuff in quality?

#21 Lois on 05.12.19 at 6:25 pm

Hey there juat wanted to give you a quick heads up.

The text in your article seem to be running ooff the screen in Opera.
I'm not sure if this is a format issue or something to do with web browser compatibility but I figured I'd
post to let yoou know. The design look great though! Hope you geet the issue
solved soon. Thanks

#22 Will on 05.12.19 at 7:23 pm

This is very interesting, You are a very slilled blogger.
I have joined your rss feed and look forward to seeking more of your fantastic post.

Also, I've sharedd your webdite in my social networks!

#23 Andy on 05.12.19 at 9:09 pm

Saved as a favorite, I lovve your site!

#24 https://Goo.gl/pSFpsa on 05.13.19 at 12:20 am

Hi there to every single one, it's actually a pleasant for me to visit this website, it consists
of valuable Information.

#25 Callum on 05.13.19 at 6:43 am

I do nnot even know howw I ended up here, but I thought this post was great.
I don't know who you are but definitely you are
going to a famous blogger if you are not already ;) Cheers!

#26 Chang on 05.15.19 at 8:10 am

I like the valuable info you propvide in your articles. I will bookmark your weblog and check again here
frequently. I'm quite certain I'll learn lots of new stuff
right here! Good luck for the next!

#27 free gg hack on 05.15.19 at 6:05 pm

I simply must tell you that you have an excellent and unique site that I must say enjoyed reading.

#28 fortnite aimbot download on 05.16.19 at 1:00 pm

Deference to op , some superb selective information .

#29 fortnite aimbot download on 05.16.19 at 4:54 pm

Some truly fine content on this web site , appreciate it for contribution.

#30 nonsense diamond on 05.17.19 at 7:08 am

I truly enjoy looking through on this web site , it holds superb content .

#31 Catrice Pinkos on 05.17.19 at 9:41 am

yosefk.com does it again! Very interesting site and a well-written article. Thanks!

#32 fallout 76 cheats on 05.17.19 at 10:34 am

very interesting post, i actually love this web site, carry on it

#33 red dead redemption 2 digital key resale on 05.17.19 at 3:44 pm

I like this site, useful stuff on here : D.

#34 redline v3.0 on 05.17.19 at 6:48 pm

Enjoyed reading through this, very good stuff, thankyou .

#35 drive.google.com on 05.18.19 at 3:49 am

Usually I do not learn article on blogs, however I would like to say that
this write-up very compelled me to try aand do it! Your
writing taste has been surprised me. Thank you, very nice article.

#36 chaturbate hack cheat engine 2018 on 05.18.19 at 8:14 am

Found this on google and I’m happy I did. Well written site.

#37 led ryggsäck on 05.18.19 at 3:05 pm

I simply must tell you that you have an excellent and unique post that I kinda enjoyed reading.

#38 mining simulator 2019 on 05.19.19 at 7:06 am

Great article to see, glad that Yahoo led me here, Keep Up good job

#39 Whitney on 05.19.19 at 7:17 am

Hey! I know this iss kinda offf topic but I was wondering if you
knew where I could get a captchna plugin for my
comment form? I'm using the same blog platform as yours and I'm having difficulty
finding one? Thanks a lot!

#40 www.google.com on 05.19.19 at 10:10 pm

Hello There. I found yojr blog using msn. Thiss is a really smartly written article.
I'll be sure to bookmark it and retirn to read extra of
yur usecul information. Thank you for the post.
I'll cerrtainly return.

#41 Keri on 05.20.19 at 9:17 am

wonderful issues altogether, you just gained a new reader.
What mmay you sugget in regards to your put up that you simply
madfe a few days ago? Any positive?

#42 smutstone on 05.20.19 at 11:46 am

Hey, glad that i saw on this in google. Thanks!

#43 redline v3.0 on 05.21.19 at 7:16 am

Morning, here from bing, i enjoyng this, I come back again.

#44 https://www.google.com/maps/d/viewer?mid=1l9hD_jN2kebJ3MQcZcO67eyJK1QMlK-M&ll=33.85110073572411-117.93477679382323&z=12 on 05.21.19 at 9:24 am

Hello there! I could have sworn I've been to this site before but after
checkking through some of the post I realized
it's new to me. Anyhow, I'm definitely happy I found it and I'll be book-marking and
checkiing back often!

#45 free fire hack version unlimited diamond on 05.21.19 at 4:33 pm

I am not rattling great with English but I get hold this really easygoing to read .

#46 nonsense diamond on 05.22.19 at 6:23 pm

Some truly fine goodies on this web site , appreciate it for contribution.

#47 krunker aimbot on 05.23.19 at 6:41 am

Deference to op , some superb selective information .

#48 bitcoin adder v.1.3.00 free download on 05.23.19 at 10:20 am

I like this site, because so much useful stuff on here : D.

#49 vn hax on 05.23.19 at 7:04 pm

Found this on MSN and I’m happy I did. Well written post.

#50 eternity.cc v9 on 05.24.19 at 7:52 am

Thank You for this.

#51 ispoofer pogo activate seriale on 05.24.19 at 6:21 pm

Just wanna input on few general things, The website layout is perfect, the articles is very superb : D.

#52 cheats for hempire game on 05.26.19 at 6:36 am

I conceive this web site holds some real superb information for everyone : D.

#53 iobit uninstaller 7.5 key on 05.26.19 at 9:21 am

Thank You for this.

#54 smart defrag 6.2 serial key on 05.26.19 at 3:41 pm

I must say, as a lot as I enjoyed reading what you had to say, I couldnt help but lose interest after a while.

#55 resetter epson l1110 on 05.26.19 at 6:26 pm

Found this on MSN and I’m happy I did. Well written website.

#56 sims 4 seasons free code on 05.27.19 at 7:43 am

Yeah bookmaking this wasn’t a risky decision outstanding post! .

#57 rust hacks on 05.27.19 at 8:13 pm

I simply must tell you that you have an excellent and unique article that I really enjoyed reading.

#58 strucid hacks on 05.28.19 at 10:31 am

Ha, here from yahoo, this is what i was browsing for.

#59 expressvpn key on 05.28.19 at 7:34 pm

Respect to website author , some wonderful entropy.

#60 gamefly free trial on 05.29.19 at 2:24 am

No matter if some one searches for his vital thing, so he/she wishes to be
available that in detail, so that thing is maintained over here.

#61 ispoofer activation key on 05.29.19 at 8:49 am

I love reading through and I believe this website got some genuinely utilitarian stuff on it! .

#62 aimbot free download fortnite on 05.29.19 at 12:48 pm

stays on topic and states valid points. Thank you.

#63 how to get help in windows 10 on 05.29.19 at 1:00 pm

Hello everybody, here every person is sharing such experience, so it's good to read
this webpage, and I used to visit this blog every day.

#64 redline v3.0 on 05.29.19 at 5:15 pm

Some truly wonderful goodies on this web site , appreciate it for contribution.

#65 vn hax on 05.30.19 at 6:29 am

Respect to website author , some wonderful entropy.

#66 MatSady on 05.31.19 at 1:57 am

Want To Buy Real Finasteride Tomar Cialis Alcohol Prix Kamagra 20 [url=http://cheapvia25mg.com]viagra online prescription[/url] Online Apotheke Viagra 25mg Isotretinoin Next Day Delivery Discount Where Can I Buy Lasix Without A Prescription

#67 Dorthy on 05.31.19 at 2:44 am

I really like your blog.. vrry nice colors & theme.
Did yoou make this website yourself or did yoou hiree someone
to do it for you? Plz respond as I'm looking to design my own blog andd would like to know where u got this from.
cheers

#68 xbox one mods free download on 05.31.19 at 1:02 pm

I simply must tell you that you have an excellent and unique post that I really enjoyed reading.

#69 fortnite aimbot download on 05.31.19 at 3:46 pm

Thanks for this site. I definitely agree with what you are saying.

#70 gamefly free trial on 05.31.19 at 5:08 pm

Very good post. I am experiencing many of these issues
as well..

#71 gamefly free trial on 06.01.19 at 11:43 am

I know this if off topic but I'm looking into starting my own blog and was wondering what all is needed to get setup?
I'm assuming having a blog like yours would cost a pretty penny?
I'm not very web smart so I'm not 100% certain. Any suggestions or advice would be
greatly appreciated. Kudos

#72 gamefly free trial on 06.01.19 at 3:58 pm

Hi there, I enjoy reading all of your article.
I wanted to write a little comment to support you.

#73 mpl pro on 06.01.19 at 6:37 pm

Cheers, here from baidu, me enjoyng this, i will come back again.

#74 gamefly free trial on 06.02.19 at 6:36 am

Hey I know this is off topic but I was wondering if you knew of
any widgets I could add to my blog that automatically tweet my newest twitter updates.

I've been looking for a plug-in like this for quite some time and was hoping maybe you would have some
experience with something like this. Please let
me know if you run into anything. I truly enjoy reading your
blog and I look forward to your new updates.

#75 hacks counter blox script on 06.02.19 at 6:45 am

Cheers, great stuff, Me like.

#76 how to crack fortnite accounts on 06.03.19 at 10:34 am

I like this site because so much useful stuff on here : D.

#77 gamefly free trial on 06.03.19 at 12:13 pm

whoah this weblog is excellent i really like reading your articles.
Keep up the great work! You recognize, lots of persons are hunting round for this information, you could help them
greatly.

#78 gamefly free trial on 06.04.19 at 4:39 pm

When someone writes an article he/she maintains the idea of
a user in his/her brain that how a user can be
aware of it. So that's why this post is amazing. Thanks!

#79 gamefly free trial on 06.05.19 at 7:53 am

I visit daily a few websites and information sites to read content,
except this web site presents quality based articles.

#80 Sherita Landquist on 06.05.19 at 7:57 pm

In my view, yosefk.com does a great job of covering issues of this sort. While frequently deliberately controversial, the information is generally thoughtful and thought-provoking.

#81 gamefly free trial on 06.06.19 at 8:58 am

Thanks for finally writing about > Humans and compilers need each
other: the VLIW SIMD case < Loved it!

#82 MatSady on 06.07.19 at 1:26 am

Order Trazadone Online Quotazione Levitra 10 Mg Propecia Kontraindikationen [url=http://abtsam.com]viagra online pharmacy[/url] Acquista Propecia 5 Alfa Reduttasi Propecia Online Without A Rx Keflex And The Liver

#83 gamefly free trial on 06.07.19 at 11:53 am

If you are going for finest contents like me, simply pay
a visit this web site daily for the reason that
it presents quality contents, thanks

#84 gamefly free trial on 06.07.19 at 1:33 pm

It's amazing in favor of me to have a web page, which is beneficial in favor of my
experience. thanks admin

#85 gamefly free trial 2019 coupon on 06.10.19 at 2:52 pm

excellent put up, very informative. I'm wondering why the other specialists of this sector don't realize this.

You must continue your writing. I'm sure, you
have a great readers' base already!

#86 MatSady on 06.14.19 at 1:53 pm

Kamagra Shop Schweiz Web Md Keflex Dosage [url=http://purchasecial.com]canadian pharmacy cialis[/url] Cialis Generique Tunisie

#87 quest bars on 06.16.19 at 9:28 pm

Hmm is anyone else having problems with the pictures on this blog loading?
I'm trying to figure out if its a problem on my end or
if it's the blog. Any suggestions would be greatly appreciated.

#88 krunker aimbot on 06.17.19 at 6:11 am

This is cool!

#89 quest bars cheap on 06.17.19 at 11:52 am

My programmer is trying to persuade me to move to .net from PHP.
I have always disliked the idea because of the expenses.
But he's tryiong none the less. I've been using WordPress on various websites for about a year
and am nervous about switching to another
platform. I have heard very good things about blogengine.net.
Is there a way I can transfer all my wordpress posts into it?

Any help would be greatly appreciated!

#90 proxo key generator on 06.19.19 at 11:43 am

Enjoyed reading through this, very good stuff, thankyou .

#91 vn hax download on 06.20.19 at 8:24 pm

I like this site, useful stuff on here : D.

#92 nonsense diamond key generator on 06.21.19 at 9:28 am

Enjoyed reading through this, very good stuff, thankyou .

#93 plenty of fish dating site on 06.22.19 at 2:55 am

Amazing! Its really remarkable article, I have got much clear idea on the topic of from this post.

#94 MatSady on 06.22.19 at 5:40 pm

Keflex Prostate Dog Onlineerertiledysfunctionpillss [url=http://buylevi.com]levitra 100mg guaranteed lowest price[/url] Generic Progesterone No Prior Script Website Overseas Price Prevacid Solutab Cheap

#95 star valor cheats on 06.23.19 at 6:52 pm

Yeah bookmaking this wasn’t a risky decision outstanding post! .

#96 gx tool apk on 06.24.19 at 4:57 pm

Enjoyed reading through this, very good stuff, thankyou .

#97 free online Q & A on 06.25.19 at 6:54 am

I conceive this web site holds some real superb information for everyone : D.

#98 Digital Marketing Company in USA on 06.25.19 at 9:28 am

Contizant is an award-winning best SEO Company is providing high-level digital marketing and search engine optimization services for online businesses. They are among the most renowned international digital agency based in India for affordable and high-quality search engine optimization, social media optimization, pay-per-click management and search engine marketing services.

#99 qureka pro apk on 06.25.19 at 9:34 pm

I simply must tell you that you have an excellent and unique web that I really enjoyed reading.

#100 skisploit on 06.26.19 at 8:13 am

very nice post, i actually like this web site, carry on it

Leave a Comment