SIMD < SIMT < SMT: parallelism in NVIDIA GPUs

Programmable NVIDIA GPUs are very inspiring to hardware geeks, proving that processors with an original, incompatible programming model can become widely used.

NVIDIA call their parallel programming model SIMT - "Single Instruction, Multiple Threads". Two other different, but related parallel programming models are SIMD - "Single Instruction, Multiple Data", and SMT - "Simultaneous Multithreading". Each model exploits a different source of parallelism:

  • In SIMD, elements of short vectors are processed in parallel.
  • In SMT, instructions of several threads are run in parallel.
  • SIMT is somewhere in between – an interesting hybrid between vector processing and hardware threading.

My presentation of SIMT is focused on hardware architecture and its implications on the trade-off between flexibility and efficiency. I'll describe how SIMT is different from SIMD and SMT, and why – what is gained (and lost) through these differences.

From a hardware design perspective, NVIDIA GPUs are at first glance really strange. The question I'll try to answer is "why would you want to build a processor that way?" I won't attempt to write a GPU programming tutorial, or quantitatively compare GPUs to other processors.

SIMD < SIMT < SMT

It can be said that SIMT is a more flexible SIMD, and SMT is in turn a more flexible SIMT. Less flexible models are generally more efficient – except when their lack of flexibility makes them useless for the task.

So in terms of flexibility, SIMD < SIMT < SMT. In terms of performance, SIMD > SIMT > SMT, but only when the models in question are flexible enough for your workload.

SIMT vs SIMD

SIMT and SIMD both approach parallelism through broadcasting the same instruction to multiple execution units. This way, you replicate the execution units, but they all share the same fetch/decode hardware.

If so, what's the difference between "single instruction, multiple data", and single instruction, multiple threads"? In NVIDIA's model, there are 3 key features that SIMD doesn't have:

  1. Single instruction, multiple register sets
  2. Single instruction, multiple addresses
  3. Single instruction, multiple flow paths

We'll see how this lifts restrictions on the set of programs that are possible to parallelize, and at what cost.

Single instruction, multiple register sets

Suppose you want to add two vectors of numbers. There are many ways to spell this. C uses a loop spelling:

for(i=0;i<n;++i) a[i]=b[i]+c[i];

Matlab uses a vector spelling:

a=b+c;

SIMD uses a "short vector" spelling – the worst of both worlds. You break your data into short vectors, and your loop processes them using instructions with ugly names. An example using C intrinsic functions mapping to ARM NEON SIMD instructions:

void add(uint32_t *a, uint32_t *b, uint32_t *c, int n) {
  for(int i=0; i<n; i+=4) {
    //compute c[i], c[i+1], c[i+2], c[i+3]
    uint32x4_t a4 = vld1q_u32(a+i);
    uint32x4_t b4 = vld1q_u32(b+i);
    uint32x4_t c4 = vaddq_u32(a4,b4);
    vst1q_u32(c+i,c4);
  }
}

SIMT uses a "scalar" spelling:

__global__ void add(float *a, float *b, float *c) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  a[i]=b[i]+c[i]; //no loop!
}

The weird __global__ keyword says that add() is a GPU thread entry point. blockIdx, blockDim and threadIdx are built-in thread-local variables keeping the thread's ID. We'll see later why a thread ID isn't just a single number; however, in this example we in fact convert it to a single number, and use it as the element index.

The idea is that the CPU spawns a thread per element, and the GPU then executes those threads. Not all of the thousands or millions of threads actually run in parallel, but many do. Specifically, an NVIDIA GPU contains several largely independent processors called "Streaming Multiprocessors" (SMs), each SM hosts several "cores", and each "core" runs a thread. For instance, Fermi has up to 16 SMs with 32 cores per SM – so up to 512 threads can run in parallel.

All threads running on the cores of an SM at a given cycle are executing the same instruction – hence Single Instruction, Multiple Threads. However, each thread has its own registers, so these instructions process different data.

Benefits

"Scalar spelling", where you write the code of a single thread using standard arithmetic operators, is arguably a better interface than SIMD loops with ugly assembly-like opcodes.

Syntax considerations aside, is this spelling more expressive – can it do things SIMD can't? Not by itself, but it dovetails nicely with other features that do make SIMT more expressive. We'll discuss those features shortly; in theory, they could be bolted on the SIMD model, but they never are.

Costs

From a hardware resources perspective, there are two costs to the SIMT way:

  • Registers spent to keep redundant data items. SIMT registersIn our example, the pointers a, b, and c have the same value in all threads. The values of i are different across threads, but in a trivial way – for instance, 128, 129, 130… In SIMD, a, b, and c would be kept once in “scalar” registers – only short vectors such as a[i:i+4] would be kept in “vector” registers. The index i would also be kept once – several neighbor elements starting from i would be accessed without actually computing i+1, i+2, etc. Redundant computations both waste registers and needlessly consume power. Note, however, that a combination of compiler & hardware optimizations could eliminate the physical replication of redundant values. I don't know the extent to which it's done in reality.
  • Narrow data types are as costly as wide data types. SIMD registersA SIMD vector register keeping 4 32b integers can typically also be used to keep 8 16b integers, or 16 8b ones. Similarly, the same ALU hardware can be used for many narrow additions or fewer wide ones – so 16 byte pairs can be added in one cycle, or 4 32b integer pairs. In SIMT, a thread adds two items at a time, no matter what their width, wasting register bits and ALU circuitry.

It should be noted that SIMT can be easily amended with SIMD extensions for narrow types, so that each thread processes 4 bytes at a time using ugly assembly-like opcodes. AFAIK, NVIDIA refrains from this, presumably assuming that the ugliness is not worth the gain, with 32b float being the most popular data type in graphics anyway.

Single instruction, multiple addresses

Let's apply a function approximated by a look-up table to the elements of a vector:

__global__ void apply(short* a, short* b, short* lut) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  a[i] = lut[b[i]]; //indirect memory access
}

Here, i is again "redundant" in the sense that in parallel threads, the values of i are consecutive. However, each thread then accesses the address lut+b[i] – and these addresses are not consecutive.

Roughly, such parallel random access works for both loads and stores. Logically, stores are the trickier part because of conflicts. What if two or more threads attempt to, say, increment the same bin in a histogram? Different NVIDIA GPU generations provide different solutions that we won't dwell on.

Benefits

This feature lets you parallelize many programs you can't with SIMD. Some form of parallel random access is usually available on SIMD machines under the names "permutation", "shuffling", "table look-up", etc. However, it always works with registers, not with memory, so it's just not the same scale. You index into a table of 8, 16 or 32 elements, but not 16K.

As previously mentioned, in theory this feature can be bolted on the SIMD model: just compute your addresses (say, lut+b[i]) in vector registers, and add a rand_vec_load instruction. However, such an instruction would have a fairly high latency. As we'll see, the SIMT model naturally absorbs high latencies without hurting throughput; SIMD much less so.

Costs

GPU has many kinds of memory: external DRAM, L2 cache, texture memory, constant memory, shared memory… We'll discuss the cost of random access in the context of two memories "at the extremes": DRAM and shared memory. DRAM is the farthest from the GPU cores, sitting outside the chip. Shared memory is the closest to the cores – it's local to an SM, and the cores of an SM can use it to share results with each other, or for their own temporary data.

  • With DRAM memory, random access is never efficient. In fact, the GPU hardware looks at all memory addresses that the running threads want to access at a given cycle, and attempts to coalesce them into a single DRAM access – in case they are not random. Effectively the contiguous range from i to i+#threads is reverse-engineered from the explicitly computed i,i+1,i+2… – another cost of replicating the index in the first place. If the indexes are in fact random and can not be coalesced, the performance loss depends on "the degree of randomness". This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it – similarly to any other processor.
  • With shared memory, random access is slowed down by bank contentions. Generally, a hardware memory module will only service one access at a time. So shared memory is organized in independent banks; the number of banks for NVIDIA GPUs is 16. If x is a variable in shared memory, then it resides in bank number (&x/4)%16. In other words, if you traverse an array, the bank you hit changes every 4 bytes. Access throughput peaks if all addresses are in different banks – hardware contention detection logic always costs latency, but only actual contentions cost throughput. If there's a bank hosting 2 of the addresses, the throughput is 1/2 of the peak; if there's a bank pointed by 3 addresses, the throughput is 1/3 of the peak, etc., the worst slowdown being 1/16.

SIMT bank contentions

In theory, different mappings between banks and addresses are possible, each with its own shortcomings. For instance, with NVIDIA's mapping, accessing a contiguous array of floats gives peak throughput, but a contiguous array of bytes gives 1/4 of the throughput (since banks change every 4 bytes). Many of the GPU programming tricks aim at avoiding contentions.

For instance, with a byte array, you can frequently work with bytes at distance 4 from each other at every given step. Instead of accessing a[i] in your code, you access a[i*4], a[i*4+1], a[i*4+2] and a[i*4+3] – more code, but less contentions.

This sounds convoluted, but it's a relatively cheap way for the hardware to provide efficient random access. It also supports some very complicated access patterns with good average efficiency – by handling the frequent case of few contentions quickly, and the infrequent case of many contentions correctly.

Single instruction, multiple flow paths

Let's find the indexes of non-zero elements in a vector. This time, each thread will work on several elements instead of just one:

__global__ void find(int* vec, int len,
                     int* ind, int* nfound,
                     int nthreads) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  int last = 0;
  int* myind = ind + tid*len;
  for(int i=tid; i<len; i+=nthreads) {
    if(vec[i]) { //flow divergence
      myind[last] = i;
      last++;
    }
  }
  nfound[tid] = last;
}

Each thread processes len/nthreads elements, starting at the index equal to its ID with a step of nthreads. We could make each thread responsible for a more natural contiguous range using a step of 1. The way we did it is better in that accesses to vec[i] by concurrent threads address neighbor elements, so we get coalescing.

The interesting bit is if(vec[i]) – depending on vec[i], some threads execute the code saving the index, some don't. The control flow of different threads can thus diverge.

Benefits

Support for divergence further expands the set of parallelizable programs, especially when used together with parallel random access. SIMD has some support for conditional execution though vector "select" operations: select(flag,a,b) = if flag then a else b. However, select can't be used to suppress updates to values – the way myind[last] is never written by threads where vec[i] is 0.

SIMD instructions such as stores could be extended to suppress updates based on a boolean vector register. For this to be really useful, the machine probably also needs parallel random access (for instance, the find() example wouldn't work otherwise). Unless what seems like an unrealistically smart compiler arrives, this also gets more and more ugly, whereas the SIMT spelling remains clean and natural.

Costs

  • Only one flow path is executed at a time, and threads not running it must wait.SIMT divergence Ultimately SIMT executes a single instruction in all the multiple threads it runs – threads share program memory and fetch / decode / execute logic. When the threads have the same flow – all run the if, nobody runs the else, for example – then they just all run the code in if at full throughput. However, when one or more threads have to run the else, they'll wait for the if threads. When the if threads are done, they'll in turn wait for the else threads. Divergence is handled correctly, but slowly. Deeply nested control structures effectively serialize execution and are not recommended.
  • Divergence can further slow things down through "randomizing" memory access. In our example, all threads read vec[i], and the indexing is tweaked to avoid contentions. However, when myind[last] is written, different threads will have incremented the last counter a different number of times, depending on the flow. This might lead to contentions which also serialize execution to some extent. Whether the whole parallelization exercise is worth the trouble depends on the flow of the algorithm as well as the input data.

We've seen the differences between SIMT and its less flexible relative, SIMD. We'll now compare SIMT to SMT – the other related model, this time the more flexible one.

SIMT vs SMT

SIMT and SMT both use threads as a way to improve throughput despite high latencies. The problem they tackle is that any single thread can get stalled running a high-latency instruction. This leaves the processor with idle execution hardware.

One way around this is switching to another thread – which (hopefully) has an instruction ready to be executed – and then switch back. For this to work, context switching has to be instantaneous. To achieve that, you replicate register files so that each thread has its own registers, and they all share the same execution hardware.

But wait, doesn't SIMT already replicate registers, as a way to have a single instruction operate on different data items? It does – here, we're talking about a "second dimension" of register replication:

  1. Several threads – a "warp" in NVIDIA terminology – run simultaneously. So each thread needs its own registers.
  2. Several warps, making up a "block", are mapped to an SM, and an SM instantaneously switches between the warps of a block. So each warp needs separate registers for each of its threads.

SIMT 2D replication

With this "two-dimensional" replication, how many registers we end up with? Well, a lot. A block can have up to 512 threads. And the registers of those threads can keep up to 16K of data.

How many register sets does a typical SMT processor have? Er, 2, sometimes 4…

Why so few? One reason is diminishing returns. When you replicate registers, you pay a significant price, in the hope of being able to better occupy your execution hardware. However, with every thread you add, the chance of it being already occupied by all the other threads rises. Soon, the small throughput gain just isn't worth the price.

If SMT CPU designers stop at 2 or 4 threads, why did SIMT GPU designers go for 512?

With enough threads, high throughput is easy

SMT is an afterthought – an attempt to use idle time on a machine originally designed to not have a lot of idle time to begin with. The basic CPU design aims, first and foremost, to run a single thread fast. Splitting a process to several independent threads is not always possible. When it is possible, it's usually gnarly.

Even in server workloads, where there's naturally a lot of independent processing, single-threaded latency still matters. So few expensive, low-latency cores outperform many cheap, high-latency cores. As Google's Urs Hölzle put it, "brawny cores beat wimpy cores". Serial code has to run fast.

Running a single thread fast means being able to issue instructions from the current thread as often as possible. To do that, CPU hardware works around every one of the many reasons to wait. Such diverse techniques as:

  • superscalar execution
  • out-of-order execution
  • register renaming
  • branch prediction
  • speculative execution
  • cache hierarchy
  • speculative prefetching
  • etc. etc.

…are all there for the same basic purpose. They maximize the chances of an instruction to be issued without having to switch to another thread.

SMT is the last line of defense, attempting to fill stalls after all these other measures failed. And even that is considered a bad idea when it hurts the precious single-threaded performance. Which it usually does: each of the 2 threads will typically complete later then it would if it didn't have to share the hardware with the other. This is a key reason to keep the number of hardware threads low, even if there are still throughput gains to be made by adding threads.

However, for the GPUs, the use case is when you do have enough data parallelism to make use of plenty of threads. If so, why not build things the other way around? Threading could be our first stall-mitigating measure. If we have enough threads, we just keep switching between them, and the hardware always has something to do.

SIMT/SMT latency

This saves a lot of hardware, and a lot of design effort, because you don't need most of the other methods anymore. Even caches and hardware prefetching are not used in much of the GPU memory traffic – rather, you access external memory directly. Why bother with caching and prefetching, if you don't have to sit idly until the data arrives from main memory – but instead just switch to a different warp? No heuristics, no speculation, no hurry – just keep yourself busy when waiting.

Furthermore, even the basic arithmetic pipeline is designed for a high latency, high throughput scenario. According to the paper "Demystifying GPU architecture through microbenchmarking", no operation takes less than 24 cycles to complete. However, the throughput of many operations is single-cycle.

The upshot is that counting on the availability of many threads allows the GPU to sustain a high throughput without having to sweat for low latencies. Hardware becomes simpler and cheaper in many areas as a result.

When latencies are high, registers are cheap

So plentiful threads make it easier to build high-throughput hardware. What about having to replicate all those registers though? 16K sounds like an insane amount of registers – how is this even affordable?

Well, it depends on what "register" means. The original meaning is the kind of storage with the smallest access latency. In CPUs, access to registers is faster than access to L1 caches, which in turn is faster than L2, etc. Small access latency means an expensive implementation, therefore there must be few registers.

However, in GPUs, access to "registers" can be quite slow. Because the GPU is always switching between warps, many cycles pass between two subsequent instructions of one warp. The reason registers must be fast in CPUs is because subsequent instructions communicate through them. In GPUs, they also communicate through registers – but the higher latency means there's no rush.

Therefore, GPU registers are only analogous to CPU registers in terms of instruction encoding. In machine code, "registers" are a small set of temporary variables that can be referenced with just a few bits of encoding – unlike memory, where you need a longer address to refer to a variable. In this and other ways, "registers" and "memory" are semantically different elements of encoding – both on CPUs and GPUs.

However, in terms of hardware implementation, GPU registers are actually more like memory than CPU registers. [Disclaimer: NVIDIA doesn't disclose implementation details, and I'm grossly oversimplifying, ignoring things like data forwarding, multiple access ports, and synthesizable vs custom design]. 16K of local RAM is a perfectly affordable amount. So while in a CPU, registers have to be expensive, they can be cheap in a high-latency, high-throughput design.

It's still a waste if 512 threads keep the same values in some of their registers – such as array base pointers in our examples above. However, many of the registers keep different values in different threads. In many cases register replication is not a waste at all – any processor would have to keep those values somewhere. So functionally, the plentiful GPU registers can be seen as a sort of a data cache.

Drawbacks

We've seen that:

  • Many threads enable cheap high-throughput, high-latency design
  • A high-throughput, high-latency design in turn enables a cheap implementation of threads' registers

This leads to a surprising conclusion that SIMT with its massive threading can actually be cheaper than SMT-style threading added to a classic CPU design. Not unexpectedly, these cost savings come at a price of reduced flexibility:

  1. Low occupancy greatly reduces performance
  2. Flow divergence greatly reduces performance
  3. Synchronization options are very limited

Occupancy

"Occupancy" is NVIDIA's term for the utilization of threading. The more threads an SM runs, the higher its occupancy. Low occupancy obviously leads to low performance – without enough warps to switch between, the GPU won't be able to hide its high latencies. The whole point of massive threading is refusing to target anything but massively parallel workloads. SMT requires much less parallelism to be efficient.

Divergence

We've seen that flow divergence is handled correctly, but inefficiently in SIMT. SMT doesn't have this problem – it works quite well given unrelated threads with unrelated control flow.

There are two reasons why unrelated threads can't work well with SIMT:

  • SIMD-style instruction broadcasting – unrelated threads within a warp can't run fast.
  • More massive threading than SMT – unrelated wraps would compete for shared resources such as instruction cache space. SMT also has this problem, but it's tolerable when you have few threads.

So both of SIMT's key ideas – SIMD-style instruction broadcasting and SMT-style massive threading – are incompatible with unrelated threads.

Related threads – those sharing code and some of the data – could work well with massive threading by itself despite divergence. It's instruction broadcasting that they fail to utilize, leaving execution hardware in idle state.

However, it seems that much of the time, related threads actually tend to have the same flow and no divergence. If this is true, a machine with massive threading but without instruction broadcasting would miss a lot of opportunities to execute its workload more efficiently.

Synchronization

In terms of programming model, SMT is an extension to a single-threaded, time-shared CPU. The same fairly rich set of inter-thread (and inter-device) synchronization and communication options is available with SMT as with "classic" single-threaded CPUs. This includes interrupts, message queues, events, semaphores, blocking and non-blocking system calls, etc. The underlying assumptions are:

  • There are quite many threads
  • Typically, each thread is doing something quite different from other threads
  • At any moment, most threads are waiting for an event, and a small subset can actually run

SMT stays within this basic time-sharing framework, adding an option to have more than one actually running threads. With SMT, as with a "classic" CPU, a thread will be very typically put "on hold" in order to wait for an event. This is implemented using context switching – saving registers to memory and, if a ready thread is found, restoring its registers from memory so that it can run.

SIMT doesn't like to put threads on hold, for several reasons:

  • Typically, there are many running, related threads. It would make the most sense to put them all on hold, so that another, unrelated, equally large group of threads can run. However, switching 16K of context is not affordable. In this sense, "registers" are expensive after all, even if they are actually memory.
  • SIMT performance depends greatly on there being many running threads. There's no point in supporting the case where most threads are waiting, because SIMT wouldn't run such workloads very well anyway. From the use case angle, a lot of waiting threads arise in "system"/"controller" kind of software, where threads wait for files, sockets, etc. SIMT is purely computational hardware that doesn't support such OS services. So the situation is both awkward for SIMT and shouldn't happen in its target workloads anyway.
  • Roughly, SIMT supports data parallelism – same code, different data. Data parallelism usually doesn't need complicated synchronization – all threads have to synchronize once a processing stage is done, and otherwise, they're independent. What requires complicated synchronization, where some threads run and some are put on hold due to data dependencies, is task parallelism – different code, different data. However, task parallelism implies divergence, and SIMT isn't good at that anyway – so why bother with complicated synchronization?

Therefore, SIMT roughly supports just one synchronization primitive – __syncthreads(). This creates a synchronization point for all the threads of a block. You know that if a thread runs code past this point, no thread runs code before this point. This way, threads can safely share results with each other. For instance, with matrix multiplication:

  1. Each thread, based on its ID – x,y – reads 2 elements, A(x,y) and B(x,y), from external memory to the on-chip shared memory. (Of course large A and B won't fit into shared memory, so this will be done block-wise.)
  2. Threads sync – all threads can now safely access all of A and B.
  3. Each thread, depending on the ID, multiplies row y in A by column x in B.
//1. load A(x,y) and B(x,y)
int x = threadIdx.x;
int y = threadIdx.y;
A[stride*y + x] = extA[ext_stride*y + x];
B[stride*y + x] = extB[ext_stride*y + x];
//2. sync
__syncthreads();
//3. multiply row y in A by column x in B
float prod = 0;
for(int i=0; i<N; ++i) {
  prod += A[stride*y + i] * B[stride*i + x];
}

It's an incomplete example (we look at just one block and ignore blockIdx, among other thing), but it shows the point of syncing – and the point of these weird "multi-dimensional" thread IDs (IDs are x,y,z coordinates rather than just integers). It's just natural, with 2D and 3D arrays, to map threads and blocks to coordinates and sub-ranges of these arrays.

Summary of differences between SIMD, SIMT and SMT

SIMT is more flexible in SIMD in three areas:

  1. Single instruction, multiple register sets
  2. Single instruction, multiple addresses
  3. Single instruction, multiple flow paths

SIMT is less flexible than SMT in three areas:

  1. Low occupancy greatly reduces performance
  2. Flow divergence greatly reduces performance
  3. Synchronization options are very limited

The effect of flexibility on costs was discussed above. A wise programmer probably doesn't care about these costs – rather, he uses the most flexible (and easily accessible) device until he runs out of cycles, then moves on to utilize the next most flexible device. Costs are just a limit on how flexible a device that is available in a given situation can be.

"SIMT" should catch on

It's a beautiful idea, questioning many of the usual assumptions about hardware design, and arriving at an internally consistent answer with the different parts naturally complementing each other.

I don't know the history well enough to tell which parts are innovations by NVIDIA and which are borrowed from previous designs. However, I believe the term SIMT was coined by NVIDIA, and perhaps it's a shame that it (apparently) didn't catch, because the architecture "deserves a name" – not necessarily true of every "new paradigm" announced by marketing.

One person who took note is Andy Glew, one of Intel's P6 architects – in his awesome computer architecture wiki, as well as in his presentation regarding further development of the SIMT model.

The presentation talks about neat optimizations – "rejiggering threads", per-thread loop buffers/time pipelining – and generally praises SIMT superiority over SIMD. Some things I disagree with – such as the "vector lane crossing" issue – and some are very interesting, such as everything about improving utilization of divergent threads.

I think the presentation should be understandable after reading my oversimplified overview – and will show where my overview oversimplifies, among other things.

Peddling fairness

Throughout this overview, there's this recurring "fairness" idea: you can trade flexibility for performance, and you can choose your trade-off.

It makes for a good narrative, but I don't necessarily believe it. You might very well be able to get both flexibility and performance. More generally, you might get both A and B, where A vs B intuitively feels like "an inherent trade-off".

What you can't do is get A and B in a reasonably simple, straightforward way. This means that you can trade simplicity for almost everything – though you don't necessarily want to; perhaps it's a good subject for a separate post.

27 comments ↓

#1 Sergey on 11.10.11 at 11:39 pm

Great article as always. Thanks a bunch!

#2 Yossi Kreinin on 11.11.11 at 1:26 am

Glad you liked it.

#3 Matt Pharr on 11.12.11 at 8:50 pm

This is a really nice writeup; I think it explains a lot of the details of this stuff very well. A few comments/thoughts..

First, to argue a detail, toward the end you make the connection between SIMT and massive numbers of threads to hide latency. I think that these two things are actually orthogonal; one could build a SIMD processor that also did the same trick for latency hiding, for example. It so happens that NV's SIMT processors today hide latency with many threads, but strictly speaking I think the two things are basically independent.

Another general issue is that one can provide a SIMT model to the user on a variety of underlying h/w architectures. For example, the ispc compiler (http://ispc.github.com) provides a ~SIMT model on the CPU (that turns out to be a really good way to use the CPU's vector units). In that case, there is more work to do in the compiler and generated code to deal with control flow divergence, doing scatters/gathers to memory efficiently, etc., but in the end, it's just a different set of trade-offs of what's done in the hardware and what's done by the compiler.

(One nice thing about SIMT on the CPU is that some of the issues you mention about redundant storage of data items with the same value aren't a problem there–ispc for example provides a 'uniform' data type qualifier that indicates that a variable has the same value for all of the SIMD lanes, which directly addresses the issue you bring up here.)

From what I've seen of AMD's GPU ISA, they are at a middle ground between today's CPUs and NV's GPUs in this respect, where their documented ISA has some SIMD-isms and some SIMT-isms, but the compiler seems to do more work to make everything work than is needed on NV GPUs.

#4 Yossi Kreinin on 11.13.11 at 12:01 am

@Matt: thanks; my knowledge of the subject is comparatively scarce, glad to know the thing made sense to you.

Regarding a processor hiding latency using an SIMD model – you mean plenty of SIMD threads instead of plenty of warps, or a single thread with logically very wide SIMD instructions that take multiple stages to execute physically, or something else? Anyway, one thing regarding this write-up is that it more or less ignores the difference between interface and implementation – basically the assumption is that the execution model is implemented relatively straightforwardly.

Similarly, providing a SIMT interface on a SIMD CPU seems very hard, on two levels – the compiler back-end work and the programmer's ability to predict the extent of optimization his code will undergo (though without doubt the portability and readability of the code compared to hand-optimizing using intrinsics is a great benefit).

On reason I ignore the possibility of doing such a thing is because I personally lack the resources – including cranial horsepower – to do it, or at least to count on being able to do it well before I invested a lot of effort into trying. So when I think about processor features I'd like to implement, I evaluate them based on a straightforward usage model and not a sophisticated usage model – for SIMD, that'd be intrinsics and not SPMD code.

In this light, as well as in the more general context of hardware-software co-evolution – one thing that would be really interesting to hear is if you have suggestions for extending the SIMD model in ways making SPMD compilation more straightforward, or possible in more cases. I'd assume there would be much rejoicing if SIMD could be reliably targeted by portable, readable SPMD code some day in the future, and I'd assume hardware could help.

Regarding the "uniform" qualifier – interesting that you have that; I guess it indicates that it's valuable in the sense that it's not necessarily trivially recoverable info for the compiler and/or programmer based on the program flow. As to CPU vs GPU – I wonder how much optimization happens there in order to keep redundant values once, or at least #threads-in-warp times and not #threads-in-block times.

#5 Yossi Kreinin on 11.13.11 at 11:35 am

From the ISPC manual:

The code generated for 8 and 16-bit integer types is generally not as efficient as the code generated for 32-bit integer types. It is generally worthwhile to use 32-bit integer types for intermediate computations, even if the final result will be stored in a smaller integer type.

…this is similar to the situation with NVIDIA GPUs, where using narrow types doesn't buy you performance. This might not be a problem for graphics or scientific computing code, but it's a major problem for image processing/computer vision code, and it's inherently wasteful at the hardware level (adding 4 pairs of bytes is as cheap or cheaper than adding 2 32b integer, multiplying 2 integers is actually ~4x more expensive in terms of circuitry than multiplying 4 pairs of bytes). I don't quite see how SIMT/SPMD can handle this problem apart from re-introducing intrinsics (single SIMD instruction, multiple threads – SIMD SIMT or SSIMDIMT).

#6 Matt Pharr on 11.14.11 at 8:27 pm

Hi, Yossi–

Regarding latency hiding, I'm just saying that any HW architecture, be it a conventional CPU or a modern GPU can use the trick of "have 8-16 of HW threads and be able to switch between them with very little overhead" to hide off-chip memory latency. No CPU today does that (and I'm not sure if it makes sense to do that on the CPU in general), but the point is just that I argue that latency hiding isn't a characteristic of the SIMT model per se.

One other small comment I forgot in my first reply regarding "SIMD instructions such as stores could be extended to suppress updates based on a boolean vector register": the latest versions of the AVX instruction set have masked load and masked store instructions that do just this.

Regarding "Similarly, providing a SIMT interface on a SIMD CPU seems very hard, on two levels – the compiler back-end work and the programmer’s ability to predict the extent of optimization his code will undergo (though without doubt the portability and readability of the code compared to hand-optimizing using intrinsics is a great benefit)."

The ispc compiler does this already–check it out! It is a decent amount of code to implement, but on the other hand, it is implementable. The question about how well the programmer can reason about the code is a good one. One thing that I do think is good about SIMT/SPMD on all types of processors (both CPU and GPU) is that the main things to reason about for performance are control flow convergence/divergence and memory access convergence/divergence.

The performance characteristics of the control flow part are similar on both CPU and GPU–the more converged the better, but there's a smooth perf. fall-off as it diverges more. Memory is more tricky: on the GPU, one benefits based on how coherent/coalesced memory accesses are at runtime, whereas on CPU (until there are gather and scatter instructions), it depends on what the compiler can figure out at compile-time. In general of course, there are lots of cases where memory locations accessed are data-dependent and thus can't be reasoned about at compile time. But on the other hand, it's surprising how efficient well-coded scatters and gathers are on the CPU, even without those available as native instructions.

> "In this light, as well as in the more general context of hardware-software co-evolution – one thing that would be really interesting to hear is if you have suggestions for extending the SIMD model in ways making SPMD compilation more straightforward, or possible in more cases. "

It's always possible, it's just a question of how many instructions it takes to do a particular operaion. :-). In general, I've been surprised at how efficiently CPU SIMD instructions can run SPMD code, even if not originally designed for it. New features like the masked loads and stores in AVX (and the gather instruction coming in AVX2) help with this a lot as well.

#7 Yossi Kreinin on 11.15.11 at 4:49 am

I understood that ispc already did this – that is, that you already did this :) Regarding ispc, I'm very curious about two things:

1. Its model is less explicit than intrinsics, but more explicit than auto-vectorization. How far is it from auto-vectorization – what work do auto-vectorizers have to do that ispc doesn't already do?

2. Is there a plan for ports to non-Intel SIMD architectures? (It's an open project, so perhaps someone outside Intel figured they'd want such a port?)

Regarding the hardware features that could be helpful for SPMD – so it's mostly masked side effects and scatter/gather? Is there anything that you'd like done to make operations working on narrow types (8b, 16b) more usable?

(The two problems that come to mind with narrow types are that the number of operands processed in parallel varies with the operand width, which doesn't trivially fit into the SPMD one-operation-per-lane model – not sure how hardware could help here; and that the C type promotion rules can be inconsistent with the input/output types of narrow-type SIMD operations – perhaps here, adapting hardware features and language features to meet somewhere in the middle could help?)

Regarding a latency-hiding non-SIMT architecture – one could make a heavily multi-threaded SIMD machine, though it would be more of a GPU than a CPU in the sense of not supporting single-threaded workloads or many unrelated threads very well. I agree that massive threading is a feature of the SIMT model conceptually separate from its other features, it just seems to be useful in about the same situations as its other features.

#8 Matt Pharr on 11.17.11 at 5:53 am

The narrower types are tricky; to be honest I haven't thought carefully about them. I do believe that it's important to support them well–the issues you bring up above are good ones.

The way I'd compare/contrast to auto-vectorization is that an auto-vectorizer has to take serial code and prove to itself that the code can be run in parallel (no loop-carried dependencies, etc.) This is tricky in practice for a few reasons. First, many auto-vectorizers aren't able to handle complex data-dependent control flow inside the loops–i.e. if you're looping over an array of values and then have a while() loop inside your for() loop that does a number of iterations based on the value loaded from the array. This is "just" an implementation issue, though.

The bigger issue is that there are many cases where an auto-vectorizer just can't prove loop iteration independence–consider calling a function within a loop that was defined in another source file. The compiler has no visibility into what that function will do, so it has to assume the worst case and not auto-vectorize.

Conversely, a SPMD/SIMT/whatever compiler has a fundamentally parallel execution model; the programmer is implicitly promising the compiler that it's safe to run multiple instances of the program in parallel (across vector lanes and or cores)–either because there are no dependencies between program instances, or because appropriate synchronization calls have been added to the code. Thus the compiler doesn't have to worry about proving independence at all–it just goes parallel and doesn't worry about it.

For legacy code and for programmers who are willing to trade-off performance for safety, auto-vectorizing technology is great. But I think that there is a class of highly performance-oriented programmers (people who program GPUs today, write intrinsics by hand, etc.), who are quite happy to make that guarantee to the compiler in exchange for guaranteed good utilization of the SIMD unit…

#9 Yossi Kreinin on 11.18.11 at 5:04 am

This is interesting and admittedly unexpected to me – if I understand correctly, you're saying that the biggest trouble of an auto-vectorizer is proving that its use would preserve the semantics of the code, right?

If so, and supposing we're targeting people who're willing to tweak code for performance and to stray from standard C, but who also generally prefer to keep as much of the code as close to the standard as possible – couldn't something like #pragma independent_iterations in front of a loop be sufficient?

One way in which this would seem to be insufficient to me is things like vector shuffling – in ispc, "inter-lane cross-talk" is handled with the intrinsic-like calls such as broadcast() and shuffle(), which works because the code is basically spelled so that work is explicitly distributed across lanes, so code can meaningfully refer to these lanes. If the code used a "normal" loop and incremented the index by a single step and not by #lanes steps every time, shuffling would become a sort of inter-iteration dependence – and the auto-vectorizer would have to deal with it.

Is this the kind of thing that makes something along the lines of #pragma trust_me a not good enough way to make auto-vectorization practical – or would you label that case with "non-fundamental", "just" implementation issues?

(If the single fundamental issue is data and control flow analysis, I'd expect reasonable performance-aware programmers to be very willing to add restrict to every pointer they pass to kernel code, and to make sure kernel code never relies on separate compilation and function pointers – which presumably makes flow analysis feasible; if the unwillingness of people to do this is the only truly insurmountable obstacle to the adoption of auto-vectorizers, it's very surprising.)

#10 Alejandro on 11.19.11 at 8:08 am

In the context of CUDA, SM is streaming multiprocessor, not streaming module.

#11 Yossi Kreinin on 11.19.11 at 9:01 am

Thanks! Fixed.

#12 Alejandro on 11.19.11 at 12:49 pm

I wasn't aware of the 4-byte distribution of shared memory. Do you have an Nvidia source to cite for that? I implemented it and did see a speed-up, but I'd like to be able to cite it.

#13 Yossi Kreinin on 11.19.11 at 11:10 pm

One place it's mentioned is NVIDIA CUDA Programming Guide, page 90: "In the case of the shared memory space, the banks are organized such that
successive 32-bit words are assigned to successive banks and each bank has a
bandwidth of 32 bits per two clock cycles." … "there are bank conflicts if an array
of char is accessed the following way:
__shared__ char shared[32];
char data = shared[BaseIndex + tid];

because shared[0], shared[1], shared[2], and shared[3], for example,
belong to the same bank. There are no bank conflicts however, if the same array is
accessed the following way:
char data = shared[BaseIndex + 4 * tid];"

(They also mention the special case of broadcasting – when N threads access the same address in the same bank, the broadcasting of the single accessed value to them all is handled efficiently and does not cause slowdown. I decided to not mention this special case for brevity, though perhaps it is important to support in hardware for getting straightforwardly written programs to run fast.)

#14 Alejandro on 11.20.11 at 10:14 pm

In my particular scenario, I have four 1-byte fields per thread, originally in four separate arrays (e.g., field_a[], field_b[], field_c[], field_d[]). This is obviously bad because the threads would hit the same bank of memory in groups of four. I changed my implementation to something like you describe:

__shared__ char thread_data[THREADS_PER_BLOCK * 4];
char field_a = thread_data[4 * threadIdx.x + 0];
char field_b = thread_data[4 * threadIdx.x + 1];
char field_c = thread_data[4 * threadIdx.x + 2];
char field_d = thread_data[4 * threadIdx.x + 3];

This definitely improved my memory access pattern, effectively giving each thread (in groups of 16, of course) it's own bank of memory. I don't think I can improve it any further at this point.

Unfortunately, while implementing this I happened to notice a place where I wasn't freeing a significant block of memory (oops!). This is good, since I'd been have issues with large data sets, but it tacked on a ms or two to my cycle time so I can no longer claim this optimization gave me a 20% speed boost ;)

Thanks for the reference as well as the article, I enjoyed reading it.

#15 Alejandro on 11.20.11 at 10:26 pm

Also, it seems that the latest version of the Programming Guide (http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf) has more information about shared memory conflicts in section F.3.3.

#16 Yossi Kreinin on 11.20.11 at 10:55 pm

Glad to have helped; I have to say that my own hands-on experience with CUDA is minimal…

#17 Carl Friedrich Bolz on 11.25.11 at 1:01 am

I really liked the post, was a great introduction to SIMT to me.

Also I found it cool (if maybe not totally surprising) that the meta-lesson "you can trade simplicity for almost anything" reappears in this context as well. The same is definitely true for the area I'm working on, which is VMs for (dynamic) languages.

#18 Yossi Kreinin on 11.25.11 at 1:24 am

Glad you liked it! As to "simplicity for almost anything" (sounds like VMs are the classic example) – actually, it's an interesting question how applicable it is to hardware design, because one thing that you usually lose with simplicity is size. In hardware, code size translates to hardware size – but not straightforwardly, that is, some things can be very complicated in code but not very large; and then size translates to power consumption, but also not straightforwardly.

I never really explored this with hardware, in the sense that I try to stay away from complexity in that area as much as possible, mainly to avoid bugs but also to have performance characteristics that are humanly comprehensible. If we look at CPUs (a kind of hardware that rewards complexity more than other kinds), we'll see that, on the one hand, the very complicated x86 designs are losing in the mobile space because ultimately that complexity does translate to too much area & power consumption; but on the other hand, embedded CPUs – ARM, MIPS, etc. – are moving towards super-scalar, hyper-threaded designs, which are far from the most straightforward CPU design.

In one way, CPUs are not unlike VMs (on different levels, they both try to efficiently run programs by applying non-straightforward analysis – typically without having the luxury of forcing the program authors to apply this analysis themselves and spell programs such that it's obvious how to run them efficiently) – so it seems that both spaces would reward complexity to some extent. I do believe that accelerator hardware like GPUs or DSPs is more "complexity-hostile" in the sense that due to price/performance considerations, it's a better idea to push much of the complexity onto software and move it out of the hardware.

#19 Ray McConnell on 04.16.12 at 2:55 am

Nice article. Your comment re complexity. The research shows that SIMT can indeed produce good utilisation of hardware resources. However, those of us who are in the thick of maximising performance vs energy realise that you must increase the bang per memory operation if you are to increase the power efficiency. SIMT starts to get a little harder when you consider more computational depth in the pipelines. Also modern SIMT needs to be generalised, not just for graphics and both the power efficiency and generality will move SIMT into more complexity, particularly in the numbers of sequencial instructions that need to skip references to the register file.

#20 Yossi Kreinin on 04.16.12 at 10:06 pm

…Because the register file is actually memory and you'd like more bang per memory operation, I guess. Great point.

As to utilization research – it ought to depend a great deal on your workload; for instance, SIMT is a terrible idea for most of my workloads simply because there's no way to sensibly spawn and occupy that many threads cooperating the way SIMT threads do.

#21 Ray McConnell on 04.24.12 at 6:15 am

Indeed. However, even partial system utilisation (good single block utilisation) may mean substantial power savings over a CPU + SIMD, as long as the SoC has good partial powerdown or clock gating.

#22 Yossi Kreinin on 04.24.12 at 7:32 am

Block – you mean a single "SM" in NVIDIA-speak, or a single "core"? I'm guessing SM – if so, keeping an SM occupied is already challenging for many workloads – 16/32 threads times latency of 28 translates to ~500-900 threads to keep an SM busy (I think these are realistic numbers).

#23 Patricio on 06.10.12 at 1:33 pm

Nice article! I've never read about this nor CUDA, but have studied about SIMD, so I could understand this :D

#24 Reader on 11.10.12 at 12:38 am

Thx. For weeks I had this article open and finally found the time to read it. It was definitly worth the time :-)

#25 Yossi Kreinin on 11.10.12 at 2:46 am

Thanks!

#26 djmips on 07.11.13 at 7:28 pm

I think you are confusing people by conflating SIMD and SWAR (SIMD Within A Register). Having a program counter (PC) per core and other features is still fundamentally SIMD. AMD calls their Graphics Core Next technology SIMD and they have an 80 bit PC per core.

#27 Yossi Kreinin on 07.11.13 at 11:35 pm

You're technically correct, however, SIMD is most often used to mean "SIMD within a register", and SWAR is almost never used.

Leave a Comment