SIMD < SIMT < SMT: parallelism in NVIDIA GPUs

Programmable NVIDIA GPUs are very inspiring to hardware geeks, proving that processors with an original, incompatible programming model can become widely used.

NVIDIA call their parallel programming model SIMT - "Single Instruction, Multiple Threads". Two other different, but related parallel programming models are SIMD - "Single Instruction, Multiple Data", and SMT - "Simultaneous Multithreading". Each model exploits a different source of parallelism:

  • In SIMD, elements of short vectors are processed in parallel.
  • In SMT, instructions of several threads are run in parallel.
  • SIMT is somewhere in between – an interesting hybrid between vector processing and hardware threading.

My presentation of SIMT is focused on hardware architecture and its implications on the trade-off between flexibility and efficiency. I'll describe how SIMT is different from SIMD and SMT, and why – what is gained (and lost) through these differences.

From a hardware design perspective, NVIDIA GPUs are at first glance really strange. The question I'll try to answer is "why would you want to build a processor that way?" I won't attempt to write a GPU programming tutorial, or quantitatively compare GPUs to other processors.

SIMD < SIMT < SMT

It can be said that SIMT is a more flexible SIMD, and SMT is in turn a more flexible SIMT. Less flexible models are generally more efficient – except when their lack of flexibility makes them useless for the task.

So in terms of flexibility, SIMD < SIMT < SMT. In terms of performance, SIMD > SIMT > SMT, but only when the models in question are flexible enough for your workload.

SIMT vs SIMD

SIMT and SIMD both approach parallelism through broadcasting the same instruction to multiple execution units. This way, you replicate the execution units, but they all share the same fetch/decode hardware.

If so, what's the difference between "single instruction, multiple data", and single instruction, multiple threads"? In NVIDIA's model, there are 3 key features that SIMD doesn't have:

  1. Single instruction, multiple register sets
  2. Single instruction, multiple addresses
  3. Single instruction, multiple flow paths

We'll see how this lifts restrictions on the set of programs that are possible to parallelize, and at what cost.

Single instruction, multiple register sets

Suppose you want to add two vectors of numbers. There are many ways to spell this. C uses a loop spelling:

for(i=0;i<n;++i) a[i]=b[i]+c[i];

Matlab uses a vector spelling:

a=b+c;

SIMD uses a "short vector" spelling – the worst of both worlds. You break your data into short vectors, and your loop processes them using instructions with ugly names. An example using C intrinsic functions mapping to ARM NEON SIMD instructions:

void add(uint32_t *a, uint32_t *b, uint32_t *c, int n) {
  for(int i=0; i<n; i+=4) {
    //compute c[i], c[i+1], c[i+2], c[i+3]
    uint32x4_t a4 = vld1q_u32(a+i);
    uint32x4_t b4 = vld1q_u32(b+i);
    uint32x4_t c4 = vaddq_u32(a4,b4);
    vst1q_u32(c+i,c4);
  }
}

SIMT uses a "scalar" spelling:

__global__ void add(float *a, float *b, float *c) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  a[i]=b[i]+c[i]; //no loop!
}

The weird __global__ keyword says that add() is a GPU thread entry point. blockIdx, blockDim and threadIdx are built-in thread-local variables keeping the thread's ID. We'll see later why a thread ID isn't just a single number; however, in this example we in fact convert it to a single number, and use it as the element index.

The idea is that the CPU spawns a thread per element, and the GPU then executes those threads. Not all of the thousands or millions of threads actually run in parallel, but many do. Specifically, an NVIDIA GPU contains several largely independent processors called "Streaming Multiprocessors" (SMs), each SM hosts several "cores", and each "core" runs a thread. For instance, Fermi has up to 16 SMs with 32 cores per SM – so up to 512 threads can run in parallel.

All threads running on the cores of an SM at a given cycle are executing the same instruction – hence Single Instruction, Multiple Threads. However, each thread has its own registers, so these instructions process different data.

Benefits

"Scalar spelling", where you write the code of a single thread using standard arithmetic operators, is arguably a better interface than SIMD loops with ugly assembly-like opcodes.

Syntax considerations aside, is this spelling more expressive – can it do things SIMD can't? Not by itself, but it dovetails nicely with other features that do make SIMT more expressive. We'll discuss those features shortly; in theory, they could be bolted on the SIMD model, but they never are.

Costs

From a hardware resources perspective, there are two costs to the SIMT way:

  • Registers spent to keep redundant data items. SIMT registersIn our example, the pointers a, b, and c have the same value in all threads. The values of i are different across threads, but in a trivial way – for instance, 128, 129, 130… In SIMD, a, b, and c would be kept once in “scalar” registers – only short vectors such as a[i:i+4] would be kept in “vector” registers. The index i would also be kept once – several neighbor elements starting from i would be accessed without actually computing i+1, i+2, etc. Redundant computations both waste registers and needlessly consume power. Note, however, that a combination of compiler & hardware optimizations could eliminate the physical replication of redundant values. I don't know the extent to which it's done in reality.
  • Narrow data types are as costly as wide data types. SIMD registersA SIMD vector register keeping 4 32b integers can typically also be used to keep 8 16b integers, or 16 8b ones. Similarly, the same ALU hardware can be used for many narrow additions or fewer wide ones – so 16 byte pairs can be added in one cycle, or 4 32b integer pairs. In SIMT, a thread adds two items at a time, no matter what their width, wasting register bits and ALU circuitry.

It should be noted that SIMT can be easily amended with SIMD extensions for narrow types, so that each thread processes 4 bytes at a time using ugly assembly-like opcodes. AFAIK, NVIDIA refrains from this, presumably assuming that the ugliness is not worth the gain, with 32b float being the most popular data type in graphics anyway.

Single instruction, multiple addresses

Let's apply a function approximated by a look-up table to the elements of a vector:

__global__ void apply(short* a, short* b, short* lut) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  a[i] = lut[b[i]]; //indirect memory access
}

Here, i is again "redundant" in the sense that in parallel threads, the values of i are consecutive. However, each thread then accesses the address lut+b[i] – and these addresses are not consecutive.

Roughly, such parallel random access works for both loads and stores. Logically, stores are the trickier part because of conflicts. What if two or more threads attempt to, say, increment the same bin in a histogram? Different NVIDIA GPU generations provide different solutions that we won't dwell on.

Benefits

This feature lets you parallelize many programs you can't with SIMD. Some form of parallel random access is usually available on SIMD machines under the names "permutation", "shuffling", "table look-up", etc. However, it always works with registers, not with memory, so it's just not the same scale. You index into a table of 8, 16 or 32 elements, but not 16K.

As previously mentioned, in theory this feature can be bolted on the SIMD model: just compute your addresses (say, lut+b[i]) in vector registers, and add a rand_vec_load instruction. However, such an instruction would have a fairly high latency. As we'll see, the SIMT model naturally absorbs high latencies without hurting throughput; SIMD much less so.

Costs

GPU has many kinds of memory: external DRAM, L2 cache, texture memory, constant memory, shared memory… We'll discuss the cost of random access in the context of two memories "at the extremes": DRAM and shared memory. DRAM is the farthest from the GPU cores, sitting outside the chip. Shared memory is the closest to the cores – it's local to an SM, and the cores of an SM can use it to share results with each other, or for their own temporary data.

  • With DRAM memory, random access is never efficient. In fact, the GPU hardware looks at all memory addresses that the running threads want to access at a given cycle, and attempts to coalesce them into a single DRAM access – in case they are not random. Effectively the contiguous range from i to i+#threads is reverse-engineered from the explicitly computed i,i+1,i+2… – another cost of replicating the index in the first place. If the indexes are in fact random and can not be coalesced, the performance loss depends on "the degree of randomness". This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it – similarly to any other processor.
  • With shared memory, random access is slowed down by bank contentions. Generally, a hardware memory module will only service one access at a time. So shared memory is organized in independent banks; the number of banks for NVIDIA GPUs is 16. If x is a variable in shared memory, then it resides in bank number (&x/4)%16. In other words, if you traverse an array, the bank you hit changes every 4 bytes. Access throughput peaks if all addresses are in different banks – hardware contention detection logic always costs latency, but only actual contentions cost throughput. If there's a bank hosting 2 of the addresses, the throughput is 1/2 of the peak; if there's a bank pointed by 3 addresses, the throughput is 1/3 of the peak, etc., the worst slowdown being 1/16.

SIMT bank contentions

In theory, different mappings between banks and addresses are possible, each with its own shortcomings. For instance, with NVIDIA's mapping, accessing a contiguous array of floats gives peak throughput, but a contiguous array of bytes gives 1/4 of the throughput (since banks change every 4 bytes). Many of the GPU programming tricks aim at avoiding contentions.

For instance, with a byte array, you can frequently work with bytes at distance 4 from each other at every given step. Instead of accessing a[i] in your code, you access a[i*4], a[i*4+1], a[i*4+2] and a[i*4+3] – more code, but less contentions.

This sounds convoluted, but it's a relatively cheap way for the hardware to provide efficient random access. It also supports some very complicated access patterns with good average efficiency – by handling the frequent case of few contentions quickly, and the infrequent case of many contentions correctly.

Single instruction, multiple flow paths

Let's find the indexes of non-zero elements in a vector. This time, each thread will work on several elements instead of just one:

__global__ void find(int* vec, int len,
                     int* ind, int* nfound,
                     int nthreads) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  int last = 0;
  int* myind = ind + tid*len;
  for(int i=tid; i<len; i+=nthreads) {
    if(vec[i]) { //flow divergence
      myind[last] = i;
      last++;
    }
  }
  nfound[tid] = last;
}

Each thread processes len/nthreads elements, starting at the index equal to its ID with a step of nthreads. We could make each thread responsible for a more natural contiguous range using a step of 1. The way we did it is better in that accesses to vec[i] by concurrent threads address neighbor elements, so we get coalescing.

The interesting bit is if(vec[i]) – depending on vec[i], some threads execute the code saving the index, some don't. The control flow of different threads can thus diverge.

Benefits

Support for divergence further expands the set of parallelizable programs, especially when used together with parallel random access. SIMD has some support for conditional execution though vector "select" operations: select(flag,a,b) = if flag then a else b. However, select can't be used to suppress updates to values – the way myind[last] is never written by threads where vec[i] is 0.

SIMD instructions such as stores could be extended to suppress updates based on a boolean vector register. For this to be really useful, the machine probably also needs parallel random access (for instance, the find() example wouldn't work otherwise). Unless what seems like an unrealistically smart compiler arrives, this also gets more and more ugly, whereas the SIMT spelling remains clean and natural.

Costs

  • Only one flow path is executed at a time, and threads not running it must wait.SIMT divergence Ultimately SIMT executes a single instruction in all the multiple threads it runs – threads share program memory and fetch / decode / execute logic. When the threads have the same flow – all run the if, nobody runs the else, for example – then they just all run the code in if at full throughput. However, when one or more threads have to run the else, they'll wait for the if threads. When the if threads are done, they'll in turn wait for the else threads. Divergence is handled correctly, but slowly. Deeply nested control structures effectively serialize execution and are not recommended.
  • Divergence can further slow things down through "randomizing" memory access. In our example, all threads read vec[i], and the indexing is tweaked to avoid contentions. However, when myind[last] is written, different threads will have incremented the last counter a different number of times, depending on the flow. This might lead to contentions which also serialize execution to some extent. Whether the whole parallelization exercise is worth the trouble depends on the flow of the algorithm as well as the input data.

We've seen the differences between SIMT and its less flexible relative, SIMD. We'll now compare SIMT to SMT – the other related model, this time the more flexible one.

SIMT vs SMT

SIMT and SMT both use threads as a way to improve throughput despite high latencies. The problem they tackle is that any single thread can get stalled running a high-latency instruction. This leaves the processor with idle execution hardware.

One way around this is switching to another thread – which (hopefully) has an instruction ready to be executed – and then switch back. For this to work, context switching has to be instantaneous. To achieve that, you replicate register files so that each thread has its own registers, and they all share the same execution hardware.

But wait, doesn't SIMT already replicate registers, as a way to have a single instruction operate on different data items? It does – here, we're talking about a "second dimension" of register replication:

  1. Several threads – a "warp" in NVIDIA terminology – run simultaneously. So each thread needs its own registers.
  2. Several warps, making up a "block", are mapped to an SM, and an SM instantaneously switches between the warps of a block. So each warp needs separate registers for each of its threads.

SIMT 2D replication

With this "two-dimensional" replication, how many registers we end up with? Well, a lot. A block can have up to 512 threads. And the registers of those threads can keep up to 16K of data.

How many register sets does a typical SMT processor have? Er, 2, sometimes 4…

Why so few? One reason is diminishing returns. When you replicate registers, you pay a significant price, in the hope of being able to better occupy your execution hardware. However, with every thread you add, the chance of it being already occupied by all the other threads rises. Soon, the small throughput gain just isn't worth the price.

If SMT CPU designers stop at 2 or 4 threads, why did SIMT GPU designers go for 512?

With enough threads, high throughput is easy

SMT is an afterthought – an attempt to use idle time on a machine originally designed to not have a lot of idle time to begin with. The basic CPU design aims, first and foremost, to run a single thread fast. Splitting a process to several independent threads is not always possible. When it is possible, it's usually gnarly.

Even in server workloads, where there's naturally a lot of independent processing, single-threaded latency still matters. So few expensive, low-latency cores outperform many cheap, high-latency cores. As Google's Urs Hölzle put it, "brawny cores beat wimpy cores". Serial code has to run fast.

Running a single thread fast means being able to issue instructions from the current thread as often as possible. To do that, CPU hardware works around every one of the many reasons to wait. Such diverse techniques as:

  • superscalar execution
  • out-of-order execution
  • register renaming
  • branch prediction
  • speculative execution
  • cache hierarchy
  • speculative prefetching
  • etc. etc.

…are all there for the same basic purpose. They maximize the chances of an instruction to be issued without having to switch to another thread.

SMT is the last line of defense, attempting to fill stalls after all these other measures failed. And even that is considered a bad idea when it hurts the precious single-threaded performance. Which it usually does: each of the 2 threads will typically complete later then it would if it didn't have to share the hardware with the other. This is a key reason to keep the number of hardware threads low, even if there are still throughput gains to be made by adding threads.

However, for the GPUs, the use case is when you do have enough data parallelism to make use of plenty of threads. If so, why not build things the other way around? Threading could be our first stall-mitigating measure. If we have enough threads, we just keep switching between them, and the hardware always has something to do.

SIMT/SMT latency

This saves a lot of hardware, and a lot of design effort, because you don't need most of the other methods anymore. Even caches and hardware prefetching are not used in much of the GPU memory traffic – rather, you access external memory directly. Why bother with caching and prefetching, if you don't have to sit idly until the data arrives from main memory – but instead just switch to a different warp? No heuristics, no speculation, no hurry – just keep yourself busy when waiting.

Furthermore, even the basic arithmetic pipeline is designed for a high latency, high throughput scenario. According to the paper "Demystifying GPU architecture through microbenchmarking", no operation takes less than 24 cycles to complete. However, the throughput of many operations is single-cycle.

The upshot is that counting on the availability of many threads allows the GPU to sustain a high throughput without having to sweat for low latencies. Hardware becomes simpler and cheaper in many areas as a result.

When latencies are high, registers are cheap

So plentiful threads make it easier to build high-throughput hardware. What about having to replicate all those registers though? 16K sounds like an insane amount of registers – how is this even affordable?

Well, it depends on what "register" means. The original meaning is the kind of storage with the smallest access latency. In CPUs, access to registers is faster than access to L1 caches, which in turn is faster than L2, etc. Small access latency means an expensive implementation, therefore there must be few registers.

However, in GPUs, access to "registers" can be quite slow. Because the GPU is always switching between warps, many cycles pass between two subsequent instructions of one warp. The reason registers must be fast in CPUs is because subsequent instructions communicate through them. In GPUs, they also communicate through registers – but the higher latency means there's no rush.

Therefore, GPU registers are only analogous to CPU registers in terms of instruction encoding. In machine code, "registers" are a small set of temporary variables that can be referenced with just a few bits of encoding – unlike memory, where you need a longer address to refer to a variable. In this and other ways, "registers" and "memory" are semantically different elements of encoding – both on CPUs and GPUs.

However, in terms of hardware implementation, GPU registers are actually more like memory than CPU registers. [Disclaimer: NVIDIA doesn't disclose implementation details, and I'm grossly oversimplifying, ignoring things like data forwarding, multiple access ports, and synthesizable vs custom design]. 16K of local RAM is a perfectly affordable amount. So while in a CPU, registers have to be expensive, they can be cheap in a high-latency, high-throughput design.

It's still a waste if 512 threads keep the same values in some of their registers – such as array base pointers in our examples above. However, many of the registers keep different values in different threads. In many cases register replication is not a waste at all – any processor would have to keep those values somewhere. So functionally, the plentiful GPU registers can be seen as a sort of a data cache.

Drawbacks

We've seen that:

  • Many threads enable cheap high-throughput, high-latency design
  • A high-throughput, high-latency design in turn enables a cheap implementation of threads' registers

This leads to a surprising conclusion that SIMT with its massive threading can actually be cheaper than SMT-style threading added to a classic CPU design. Not unexpectedly, these cost savings come at a price of reduced flexibility:

  1. Low occupancy greatly reduces performance
  2. Flow divergence greatly reduces performance
  3. Synchronization options are very limited

Occupancy

"Occupancy" is NVIDIA's term for the utilization of threading. The more threads an SM runs, the higher its occupancy. Low occupancy obviously leads to low performance – without enough warps to switch between, the GPU won't be able to hide its high latencies. The whole point of massive threading is refusing to target anything but massively parallel workloads. SMT requires much less parallelism to be efficient.

Divergence

We've seen that flow divergence is handled correctly, but inefficiently in SIMT. SMT doesn't have this problem – it works quite well given unrelated threads with unrelated control flow.

There are two reasons why unrelated threads can't work well with SIMT:

  • SIMD-style instruction broadcasting – unrelated threads within a warp can't run fast.
  • More massive threading than SMT – unrelated wraps would compete for shared resources such as instruction cache space. SMT also has this problem, but it's tolerable when you have few threads.

So both of SIMT's key ideas – SIMD-style instruction broadcasting and SMT-style massive threading – are incompatible with unrelated threads.

Related threads – those sharing code and some of the data – could work well with massive threading by itself despite divergence. It's instruction broadcasting that they fail to utilize, leaving execution hardware in idle state.

However, it seems that much of the time, related threads actually tend to have the same flow and no divergence. If this is true, a machine with massive threading but without instruction broadcasting would miss a lot of opportunities to execute its workload more efficiently.

Synchronization

In terms of programming model, SMT is an extension to a single-threaded, time-shared CPU. The same fairly rich set of inter-thread (and inter-device) synchronization and communication options is available with SMT as with "classic" single-threaded CPUs. This includes interrupts, message queues, events, semaphores, blocking and non-blocking system calls, etc. The underlying assumptions are:

  • There are quite many threads
  • Typically, each thread is doing something quite different from other threads
  • At any moment, most threads are waiting for an event, and a small subset can actually run

SMT stays within this basic time-sharing framework, adding an option to have more than one actually running threads. With SMT, as with a "classic" CPU, a thread will be very typically put "on hold" in order to wait for an event. This is implemented using context switching – saving registers to memory and, if a ready thread is found, restoring its registers from memory so that it can run.

SIMT doesn't like to put threads on hold, for several reasons:

  • Typically, there are many running, related threads. It would make the most sense to put them all on hold, so that another, unrelated, equally large group of threads can run. However, switching 16K of context is not affordable. In this sense, "registers" are expensive after all, even if they are actually memory.
  • SIMT performance depends greatly on there being many running threads. There's no point in supporting the case where most threads are waiting, because SIMT wouldn't run such workloads very well anyway. From the use case angle, a lot of waiting threads arise in "system"/"controller" kind of software, where threads wait for files, sockets, etc. SIMT is purely computational hardware that doesn't support such OS services. So the situation is both awkward for SIMT and shouldn't happen in its target workloads anyway.
  • Roughly, SIMT supports data parallelism – same code, different data. Data parallelism usually doesn't need complicated synchronization – all threads have to synchronize once a processing stage is done, and otherwise, they're independent. What requires complicated synchronization, where some threads run and some are put on hold due to data dependencies, is task parallelism – different code, different data. However, task parallelism implies divergence, and SIMT isn't good at that anyway – so why bother with complicated synchronization?

Therefore, SIMT roughly supports just one synchronization primitive – __syncthreads(). This creates a synchronization point for all the threads of a block. You know that if a thread runs code past this point, no thread runs code before this point. This way, threads can safely share results with each other. For instance, with matrix multiplication:

  1. Each thread, based on its ID – x,y – reads 2 elements, A(x,y) and B(x,y), from external memory to the on-chip shared memory. (Of course large A and B won't fit into shared memory, so this will be done block-wise.)
  2. Threads sync – all threads can now safely access all of A and B.
  3. Each thread, depending on the ID, multiplies row y in A by column x in B.
//1. load A(x,y) and B(x,y)
int x = threadIdx.x;
int y = threadIdx.y;
A[stride*y + x] = extA[ext_stride*y + x];
B[stride*y + x] = extB[ext_stride*y + x];
//2. sync
__syncthreads();
//3. multiply row y in A by column x in B
float prod = 0;
for(int i=0; i<N; ++i) {
  prod += A[stride*y + i] * B[stride*i + x];
}

It's an incomplete example (we look at just one block and ignore blockIdx, among other thing), but it shows the point of syncing – and the point of these weird "multi-dimensional" thread IDs (IDs are x,y,z coordinates rather than just integers). It's just natural, with 2D and 3D arrays, to map threads and blocks to coordinates and sub-ranges of these arrays.

Summary of differences between SIMD, SIMT and SMT

SIMT is more flexible in SIMD in three areas:

  1. Single instruction, multiple register sets
  2. Single instruction, multiple addresses
  3. Single instruction, multiple flow paths

SIMT is less flexible than SMT in three areas:

  1. Low occupancy greatly reduces performance
  2. Flow divergence greatly reduces performance
  3. Synchronization options are very limited

The effect of flexibility on costs was discussed above. A wise programmer probably doesn't care about these costs – rather, he uses the most flexible (and easily accessible) device until he runs out of cycles, then moves on to utilize the next most flexible device. Costs are just a limit on how flexible a device that is available in a given situation can be.

"SIMT" should catch on

It's a beautiful idea, questioning many of the usual assumptions about hardware design, and arriving at an internally consistent answer with the different parts naturally complementing each other.

I don't know the history well enough to tell which parts are innovations by NVIDIA and which are borrowed from previous designs. However, I believe the term SIMT was coined by NVIDIA, and perhaps it's a shame that it (apparently) didn't catch, because the architecture "deserves a name" – not necessarily true of every "new paradigm" announced by marketing.

One person who took note is Andy Glew, one of Intel's P6 architects – in his awesome computer architecture wiki, as well as in his presentation regarding further development of the SIMT model.

The presentation talks about neat optimizations – "rejiggering threads", per-thread loop buffers/time pipelining – and generally praises SIMT superiority over SIMD. Some things I disagree with – such as the "vector lane crossing" issue – and some are very interesting, such as everything about improving utilization of divergent threads.

I think the presentation should be understandable after reading my oversimplified overview – and will show where my overview oversimplifies, among other things.

Peddling fairness

Throughout this overview, there's this recurring "fairness" idea: you can trade flexibility for performance, and you can choose your trade-off.

It makes for a good narrative, but I don't necessarily believe it. You might very well be able to get both flexibility and performance. More generally, you might get both A and B, where A vs B intuitively feels like "an inherent trade-off".

What you can't do is get A and B in a reasonably simple, straightforward way. This means that you can trade simplicity for almost everything – though you don't necessarily want to; perhaps it's a good subject for a separate post.

98 comments ↓

#1 Sergey on 11.10.11 at 11:39 pm

Great article as always. Thanks a bunch!

#2 Yossi Kreinin on 11.11.11 at 1:26 am

Glad you liked it.

#3 Matt Pharr on 11.12.11 at 8:50 pm

This is a really nice writeup; I think it explains a lot of the details of this stuff very well. A few comments/thoughts..

First, to argue a detail, toward the end you make the connection between SIMT and massive numbers of threads to hide latency. I think that these two things are actually orthogonal; one could build a SIMD processor that also did the same trick for latency hiding, for example. It so happens that NV's SIMT processors today hide latency with many threads, but strictly speaking I think the two things are basically independent.

Another general issue is that one can provide a SIMT model to the user on a variety of underlying h/w architectures. For example, the ispc compiler (http://ispc.github.com) provides a ~SIMT model on the CPU (that turns out to be a really good way to use the CPU's vector units). In that case, there is more work to do in the compiler and generated code to deal with control flow divergence, doing scatters/gathers to memory efficiently, etc., but in the end, it's just a different set of trade-offs of what's done in the hardware and what's done by the compiler.

(One nice thing about SIMT on the CPU is that some of the issues you mention about redundant storage of data items with the same value aren't a problem there–ispc for example provides a 'uniform' data type qualifier that indicates that a variable has the same value for all of the SIMD lanes, which directly addresses the issue you bring up here.)

From what I've seen of AMD's GPU ISA, they are at a middle ground between today's CPUs and NV's GPUs in this respect, where their documented ISA has some SIMD-isms and some SIMT-isms, but the compiler seems to do more work to make everything work than is needed on NV GPUs.

#4 Yossi Kreinin on 11.13.11 at 12:01 am

@Matt: thanks; my knowledge of the subject is comparatively scarce, glad to know the thing made sense to you.

Regarding a processor hiding latency using an SIMD model – you mean plenty of SIMD threads instead of plenty of warps, or a single thread with logically very wide SIMD instructions that take multiple stages to execute physically, or something else? Anyway, one thing regarding this write-up is that it more or less ignores the difference between interface and implementation – basically the assumption is that the execution model is implemented relatively straightforwardly.

Similarly, providing a SIMT interface on a SIMD CPU seems very hard, on two levels – the compiler back-end work and the programmer's ability to predict the extent of optimization his code will undergo (though without doubt the portability and readability of the code compared to hand-optimizing using intrinsics is a great benefit).

On reason I ignore the possibility of doing such a thing is because I personally lack the resources – including cranial horsepower – to do it, or at least to count on being able to do it well before I invested a lot of effort into trying. So when I think about processor features I'd like to implement, I evaluate them based on a straightforward usage model and not a sophisticated usage model – for SIMD, that'd be intrinsics and not SPMD code.

In this light, as well as in the more general context of hardware-software co-evolution – one thing that would be really interesting to hear is if you have suggestions for extending the SIMD model in ways making SPMD compilation more straightforward, or possible in more cases. I'd assume there would be much rejoicing if SIMD could be reliably targeted by portable, readable SPMD code some day in the future, and I'd assume hardware could help.

Regarding the "uniform" qualifier – interesting that you have that; I guess it indicates that it's valuable in the sense that it's not necessarily trivially recoverable info for the compiler and/or programmer based on the program flow. As to CPU vs GPU – I wonder how much optimization happens there in order to keep redundant values once, or at least #threads-in-warp times and not #threads-in-block times.

#5 Yossi Kreinin on 11.13.11 at 11:35 am

From the ISPC manual:

The code generated for 8 and 16-bit integer types is generally not as efficient as the code generated for 32-bit integer types. It is generally worthwhile to use 32-bit integer types for intermediate computations, even if the final result will be stored in a smaller integer type.

…this is similar to the situation with NVIDIA GPUs, where using narrow types doesn't buy you performance. This might not be a problem for graphics or scientific computing code, but it's a major problem for image processing/computer vision code, and it's inherently wasteful at the hardware level (adding 4 pairs of bytes is as cheap or cheaper than adding 2 32b integer, multiplying 2 integers is actually ~4x more expensive in terms of circuitry than multiplying 4 pairs of bytes). I don't quite see how SIMT/SPMD can handle this problem apart from re-introducing intrinsics (single SIMD instruction, multiple threads – SIMD SIMT or SSIMDIMT).

#6 Matt Pharr on 11.14.11 at 8:27 pm

Hi, Yossi–

Regarding latency hiding, I'm just saying that any HW architecture, be it a conventional CPU or a modern GPU can use the trick of "have 8-16 of HW threads and be able to switch between them with very little overhead" to hide off-chip memory latency. No CPU today does that (and I'm not sure if it makes sense to do that on the CPU in general), but the point is just that I argue that latency hiding isn't a characteristic of the SIMT model per se.

One other small comment I forgot in my first reply regarding "SIMD instructions such as stores could be extended to suppress updates based on a boolean vector register": the latest versions of the AVX instruction set have masked load and masked store instructions that do just this.

Regarding "Similarly, providing a SIMT interface on a SIMD CPU seems very hard, on two levels – the compiler back-end work and the programmer’s ability to predict the extent of optimization his code will undergo (though without doubt the portability and readability of the code compared to hand-optimizing using intrinsics is a great benefit)."

The ispc compiler does this already–check it out! It is a decent amount of code to implement, but on the other hand, it is implementable. The question about how well the programmer can reason about the code is a good one. One thing that I do think is good about SIMT/SPMD on all types of processors (both CPU and GPU) is that the main things to reason about for performance are control flow convergence/divergence and memory access convergence/divergence.

The performance characteristics of the control flow part are similar on both CPU and GPU–the more converged the better, but there's a smooth perf. fall-off as it diverges more. Memory is more tricky: on the GPU, one benefits based on how coherent/coalesced memory accesses are at runtime, whereas on CPU (until there are gather and scatter instructions), it depends on what the compiler can figure out at compile-time. In general of course, there are lots of cases where memory locations accessed are data-dependent and thus can't be reasoned about at compile time. But on the other hand, it's surprising how efficient well-coded scatters and gathers are on the CPU, even without those available as native instructions.

> "In this light, as well as in the more general context of hardware-software co-evolution – one thing that would be really interesting to hear is if you have suggestions for extending the SIMD model in ways making SPMD compilation more straightforward, or possible in more cases. "

It's always possible, it's just a question of how many instructions it takes to do a particular operaion. :-). In general, I've been surprised at how efficiently CPU SIMD instructions can run SPMD code, even if not originally designed for it. New features like the masked loads and stores in AVX (and the gather instruction coming in AVX2) help with this a lot as well.

#7 Yossi Kreinin on 11.15.11 at 4:49 am

I understood that ispc already did this – that is, that you already did this :) Regarding ispc, I'm very curious about two things:

1. Its model is less explicit than intrinsics, but more explicit than auto-vectorization. How far is it from auto-vectorization – what work do auto-vectorizers have to do that ispc doesn't already do?

2. Is there a plan for ports to non-Intel SIMD architectures? (It's an open project, so perhaps someone outside Intel figured they'd want such a port?)

Regarding the hardware features that could be helpful for SPMD – so it's mostly masked side effects and scatter/gather? Is there anything that you'd like done to make operations working on narrow types (8b, 16b) more usable?

(The two problems that come to mind with narrow types are that the number of operands processed in parallel varies with the operand width, which doesn't trivially fit into the SPMD one-operation-per-lane model – not sure how hardware could help here; and that the C type promotion rules can be inconsistent with the input/output types of narrow-type SIMD operations – perhaps here, adapting hardware features and language features to meet somewhere in the middle could help?)

Regarding a latency-hiding non-SIMT architecture – one could make a heavily multi-threaded SIMD machine, though it would be more of a GPU than a CPU in the sense of not supporting single-threaded workloads or many unrelated threads very well. I agree that massive threading is a feature of the SIMT model conceptually separate from its other features, it just seems to be useful in about the same situations as its other features.

#8 Matt Pharr on 11.17.11 at 5:53 am

The narrower types are tricky; to be honest I haven't thought carefully about them. I do believe that it's important to support them well–the issues you bring up above are good ones.

The way I'd compare/contrast to auto-vectorization is that an auto-vectorizer has to take serial code and prove to itself that the code can be run in parallel (no loop-carried dependencies, etc.) This is tricky in practice for a few reasons. First, many auto-vectorizers aren't able to handle complex data-dependent control flow inside the loops–i.e. if you're looping over an array of values and then have a while() loop inside your for() loop that does a number of iterations based on the value loaded from the array. This is "just" an implementation issue, though.

The bigger issue is that there are many cases where an auto-vectorizer just can't prove loop iteration independence–consider calling a function within a loop that was defined in another source file. The compiler has no visibility into what that function will do, so it has to assume the worst case and not auto-vectorize.

Conversely, a SPMD/SIMT/whatever compiler has a fundamentally parallel execution model; the programmer is implicitly promising the compiler that it's safe to run multiple instances of the program in parallel (across vector lanes and or cores)–either because there are no dependencies between program instances, or because appropriate synchronization calls have been added to the code. Thus the compiler doesn't have to worry about proving independence at all–it just goes parallel and doesn't worry about it.

For legacy code and for programmers who are willing to trade-off performance for safety, auto-vectorizing technology is great. But I think that there is a class of highly performance-oriented programmers (people who program GPUs today, write intrinsics by hand, etc.), who are quite happy to make that guarantee to the compiler in exchange for guaranteed good utilization of the SIMD unit…

#9 Yossi Kreinin on 11.18.11 at 5:04 am

This is interesting and admittedly unexpected to me – if I understand correctly, you're saying that the biggest trouble of an auto-vectorizer is proving that its use would preserve the semantics of the code, right?

If so, and supposing we're targeting people who're willing to tweak code for performance and to stray from standard C, but who also generally prefer to keep as much of the code as close to the standard as possible – couldn't something like #pragma independent_iterations in front of a loop be sufficient?

One way in which this would seem to be insufficient to me is things like vector shuffling – in ispc, "inter-lane cross-talk" is handled with the intrinsic-like calls such as broadcast() and shuffle(), which works because the code is basically spelled so that work is explicitly distributed across lanes, so code can meaningfully refer to these lanes. If the code used a "normal" loop and incremented the index by a single step and not by #lanes steps every time, shuffling would become a sort of inter-iteration dependence – and the auto-vectorizer would have to deal with it.

Is this the kind of thing that makes something along the lines of #pragma trust_me a not good enough way to make auto-vectorization practical – or would you label that case with "non-fundamental", "just" implementation issues?

(If the single fundamental issue is data and control flow analysis, I'd expect reasonable performance-aware programmers to be very willing to add restrict to every pointer they pass to kernel code, and to make sure kernel code never relies on separate compilation and function pointers – which presumably makes flow analysis feasible; if the unwillingness of people to do this is the only truly insurmountable obstacle to the adoption of auto-vectorizers, it's very surprising.)

#10 Alejandro on 11.19.11 at 8:08 am

In the context of CUDA, SM is streaming multiprocessor, not streaming module.

#11 Yossi Kreinin on 11.19.11 at 9:01 am

Thanks! Fixed.

#12 Alejandro on 11.19.11 at 12:49 pm

I wasn't aware of the 4-byte distribution of shared memory. Do you have an Nvidia source to cite for that? I implemented it and did see a speed-up, but I'd like to be able to cite it.

#13 Yossi Kreinin on 11.19.11 at 11:10 pm

One place it's mentioned is NVIDIA CUDA Programming Guide, page 90: "In the case of the shared memory space, the banks are organized such that
successive 32-bit words are assigned to successive banks and each bank has a
bandwidth of 32 bits per two clock cycles." … "there are bank conflicts if an array
of char is accessed the following way:
__shared__ char shared[32];
char data = shared[BaseIndex + tid];

because shared[0], shared[1], shared[2], and shared[3], for example,
belong to the same bank. There are no bank conflicts however, if the same array is
accessed the following way:
char data = shared[BaseIndex + 4 * tid];"

(They also mention the special case of broadcasting – when N threads access the same address in the same bank, the broadcasting of the single accessed value to them all is handled efficiently and does not cause slowdown. I decided to not mention this special case for brevity, though perhaps it is important to support in hardware for getting straightforwardly written programs to run fast.)

#14 Alejandro on 11.20.11 at 10:14 pm

In my particular scenario, I have four 1-byte fields per thread, originally in four separate arrays (e.g., field_a[], field_b[], field_c[], field_d[]). This is obviously bad because the threads would hit the same bank of memory in groups of four. I changed my implementation to something like you describe:

__shared__ char thread_data[THREADS_PER_BLOCK * 4];
char field_a = thread_data[4 * threadIdx.x + 0];
char field_b = thread_data[4 * threadIdx.x + 1];
char field_c = thread_data[4 * threadIdx.x + 2];
char field_d = thread_data[4 * threadIdx.x + 3];

This definitely improved my memory access pattern, effectively giving each thread (in groups of 16, of course) it's own bank of memory. I don't think I can improve it any further at this point.

Unfortunately, while implementing this I happened to notice a place where I wasn't freeing a significant block of memory (oops!). This is good, since I'd been have issues with large data sets, but it tacked on a ms or two to my cycle time so I can no longer claim this optimization gave me a 20% speed boost ;)

Thanks for the reference as well as the article, I enjoyed reading it.

#15 Alejandro on 11.20.11 at 10:26 pm

Also, it seems that the latest version of the Programming Guide (http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf) has more information about shared memory conflicts in section F.3.3.

#16 Yossi Kreinin on 11.20.11 at 10:55 pm

Glad to have helped; I have to say that my own hands-on experience with CUDA is minimal…

#17 Carl Friedrich Bolz on 11.25.11 at 1:01 am

I really liked the post, was a great introduction to SIMT to me.

Also I found it cool (if maybe not totally surprising) that the meta-lesson "you can trade simplicity for almost anything" reappears in this context as well. The same is definitely true for the area I'm working on, which is VMs for (dynamic) languages.

#18 Yossi Kreinin on 11.25.11 at 1:24 am

Glad you liked it! As to "simplicity for almost anything" (sounds like VMs are the classic example) – actually, it's an interesting question how applicable it is to hardware design, because one thing that you usually lose with simplicity is size. In hardware, code size translates to hardware size – but not straightforwardly, that is, some things can be very complicated in code but not very large; and then size translates to power consumption, but also not straightforwardly.

I never really explored this with hardware, in the sense that I try to stay away from complexity in that area as much as possible, mainly to avoid bugs but also to have performance characteristics that are humanly comprehensible. If we look at CPUs (a kind of hardware that rewards complexity more than other kinds), we'll see that, on the one hand, the very complicated x86 designs are losing in the mobile space because ultimately that complexity does translate to too much area & power consumption; but on the other hand, embedded CPUs – ARM, MIPS, etc. – are moving towards super-scalar, hyper-threaded designs, which are far from the most straightforward CPU design.

In one way, CPUs are not unlike VMs (on different levels, they both try to efficiently run programs by applying non-straightforward analysis – typically without having the luxury of forcing the program authors to apply this analysis themselves and spell programs such that it's obvious how to run them efficiently) – so it seems that both spaces would reward complexity to some extent. I do believe that accelerator hardware like GPUs or DSPs is more "complexity-hostile" in the sense that due to price/performance considerations, it's a better idea to push much of the complexity onto software and move it out of the hardware.

#19 Ray McConnell on 04.16.12 at 2:55 am

Nice article. Your comment re complexity. The research shows that SIMT can indeed produce good utilisation of hardware resources. However, those of us who are in the thick of maximising performance vs energy realise that you must increase the bang per memory operation if you are to increase the power efficiency. SIMT starts to get a little harder when you consider more computational depth in the pipelines. Also modern SIMT needs to be generalised, not just for graphics and both the power efficiency and generality will move SIMT into more complexity, particularly in the numbers of sequencial instructions that need to skip references to the register file.

#20 Yossi Kreinin on 04.16.12 at 10:06 pm

…Because the register file is actually memory and you'd like more bang per memory operation, I guess. Great point.

As to utilization research – it ought to depend a great deal on your workload; for instance, SIMT is a terrible idea for most of my workloads simply because there's no way to sensibly spawn and occupy that many threads cooperating the way SIMT threads do.

#21 Ray McConnell on 04.24.12 at 6:15 am

Indeed. However, even partial system utilisation (good single block utilisation) may mean substantial power savings over a CPU + SIMD, as long as the SoC has good partial powerdown or clock gating.

#22 Yossi Kreinin on 04.24.12 at 7:32 am

Block – you mean a single "SM" in NVIDIA-speak, or a single "core"? I'm guessing SM – if so, keeping an SM occupied is already challenging for many workloads – 16/32 threads times latency of 28 translates to ~500-900 threads to keep an SM busy (I think these are realistic numbers).

#23 Patricio on 06.10.12 at 1:33 pm

Nice article! I've never read about this nor CUDA, but have studied about SIMD, so I could understand this :D

#24 Reader on 11.10.12 at 12:38 am

Thx. For weeks I had this article open and finally found the time to read it. It was definitly worth the time :-)

#25 Yossi Kreinin on 11.10.12 at 2:46 am

Thanks!

#26 djmips on 07.11.13 at 7:28 pm

I think you are confusing people by conflating SIMD and SWAR (SIMD Within A Register). Having a program counter (PC) per core and other features is still fundamentally SIMD. AMD calls their Graphics Core Next technology SIMD and they have an 80 bit PC per core.

#27 Yossi Kreinin on 07.11.13 at 11:35 pm

You're technically correct, however, SIMD is most often used to mean "SIMD within a register", and SWAR is almost never used.

#28 Shridhar on 10.15.14 at 6:31 pm

I come with a background in SIMD and this is easily the best article I've read that comprehensively covers GPU architecture details in an intuitive manner. Thank you so much Yossi!

#29 Yossi Kreinin on 10.15.14 at 6:34 pm

Glad you liked it!

#30 rowan194 on 08.13.15 at 7:27 pm

Thanks for this article. I've been trying to figure out whether my genetic algorithm categorising code can be shoehorned into a SIMD/SIMT type structure (say on a Kepler K1 board), or whether it really needs multiple independent cores (eg Parallella 16 core, or a small cluster of ARM 4/8 core boards). Still undecided, but your article has certainly encouraged me to explore this further.

#31 lailai51587 on 06.08.16 at 12:46 pm

Hi Yossi. This is a good article to distinguish SIMD, SMT, and SIMT. I am confused about SIMD or MIMD. Are they computer architecture or programming model? It seems that the architecture and programming model can be mixed. Do you have any books or papers which can well explain these?

#32 zee 5 mod on 03.30.19 at 12:34 am

As you obtain more adept at using magic, you will have to learn how to cope with runes.
This is probably not familiar with secure the validity associated
with progression of thoughts due to the inescapable fact it absolutely was your head itself that set the footwork for that theory.
This flying series flight simulator in Canada
is one strategy to experience the best way to chance a plane and
visit various areas of the world.

#33 Dominoqq on 03.30.19 at 8:57 pm

Hi to every body, it's my first visit of this weblog; this blog includes remarkable and actually excellent material in support of visitors.

#34 ir key 5.0.12 full crack on 03.31.19 at 7:55 pm

Therefore, it is important to restore this error with your Pc since
although it gets risky, all those unseen process
computer file are able to use up the stored data slowly.
Working inside a state-of-the-art lab, these experts concentrate on conducting persistent and thorough research, to make certain their methods stop at the
very forefront of digital and computer forensics. Furthermore, all this handles those
Windows xp dll computer files and hang down them into Microsoft windows memory.

#35 youclerks.com on 04.01.19 at 12:05 pm

Hello, I enjoy reading through your article post. I like to write a little
comment to support you.

#36 situs poker online terpopuler on 04.03.19 at 9:10 am

naturally like your web-site but you have to take a look at the spelling on quite a few of your posts.
Several of them are rife with spelling issues and I in finding
it very troublesome to tell the truth on the other hand I'll surely come back again.

#37 נערות ליווי במרכז on 04.09.19 at 4:32 pm

We're having coffee at Nylon Coffee Roasters on Everton Park in Singapore.
I'm having black coffee, he's possessing a cappuccino.
They're handsome. Brown hair slicked back, glasses that are great for
his face, hazel eyes and the most wonderful lips I've seen. They're well made, with incredible arms along
with a chest that stands out within this sweater. We're standing in front of
one another discussing our way of life, what we would like for the future, what we're
looking for on another person. He starts telling me
that he's got been rejected loads of times.

‘Why Andrew? You're so handsome. I'd never reject you ',
I have faith that He smiles at me, biting his lip.

‘Oh, I would not know. Everything happens for a good reason right.
But let me know, make use of reject me, would you Ana?' He said.

‘No, how could I?' , I replied

"So, utilize mind if I kissed you at the moment?' he explained as I buy more detailed him and kiss him.

‘Next time don't ask, function it.' I reply.

‘I love the way you think.' , he said.

While waiting, I start scrubbing my rearfoot as part of his leg, massaging it slowly. ‘What can you enjoy girls? And, Andrew, don't spare me the details.' I ask.

‘I adore determined women. Someone who knows what they have to want. Someone that won't say yes just because I said yes. Someone who's not scared of attempting a new challenge,' he says. ‘I'm never afraid of attempting something totally new, especially on the subject of making new things in bed ', I intimate ‘And I love ladies who are direct, who cut through the chase, like you just did. To be
honest, which is a huge turn on.

#38 נערות ליווי בבת ים on 04.09.19 at 5:08 pm

Yes. This is usually a 1st choice. I can carry on my study and turn an authorized psychologist.
There are various options because field. The curriculum vitae is definitely solid if I actually wanted to get a new job.

#39 http://bacsitathihongduyen.webflow.io/ on 04.10.19 at 9:21 pm

Hello, always i used to check web site posts here early in the daylight,
since i enjoy to gain knowledge of more and more.

#40 עיצוב אתר on 04.10.19 at 10:55 pm

A web site is an essential business tool — and every
business uses its site differently. Some use it to generate instant revenue through ecommerce sales while others use
it to generate leads, phone calls or physical location visits.
There is one thing that every business wants to accomplish with
its website: leveraging it to generate more growth.
There are many ways to boost your leads, sales and revenue
without investing in a complete redesign and rebuild.
Here are 10 hacks that you should think about trying — while
simple, they can potentially help your organization grow significantly.

1. Perform a conversion audit. Are you currently positive your
website is designed to convert traffic? The simple truth is, a lot of web design companies are great at creating appealing websites, nevertheless they aren't conversion rate
experts. Having a full-blown conversion audit performed is
well worth the tiny out-of-pocket expense. Related: 5 Tools to Help You
Audit Your Web Content When you can identify problems and make changes
to fix them just before launching marketing campaigns it wil
dramatically reduce wasted advertising spend and provide you with a stronger base
to begin with

#41 https://Penzu.com/ on 04.11.19 at 11:26 am

I really like it when individuals get together and
share opinions. Great blog, keep it up!

#42 pożyczki online on 04.18.19 at 6:25 am

Hey! I know this is kinda off topic but I was wondering which blog platform are you using for this site?

I'm getting tired of WordPress because I've had problems with hackers and I'm looking at options for another platform.
I would be fantastic if you could point me in the direction of a good platform.

#43 dominoqq link on 04.21.19 at 10:53 am

I blog frequently and I genuinely thank you for your content.
Your article has really peaked my interest. I'm going to bookmark
your blog and keep checking for new information about once a week.
I opted in for your Feed too.

#44 Aly Chiman on 04.23.19 at 11:21 am

Hello there,

My name is Aly and I would like to know if you would have any interest to have your website here at yosefk.com promoted as a resource on our blog alychidesign.com ?

We are in the midst of updating our broken link resources to include current and up to date resources for our readers. Our resource links are manually approved allowing us to mark a link as a do-follow link as well
.
If you may be interested please in being included as a resource on our blog, please let me know.

Thanks,
Aly

#45 Heidi on 04.26.19 at 2:38 pm

Hi there! Someone in my Myspace group shared this site with us
so I came to give it a look. I'm definitely loving the information.
I'm bookmarking and will be tweeting this to my followers!

Superb blog and amazing style and design.

#46 נערות ליווי בראשון לציון on 04.27.19 at 12:14 pm

Yes. It is a initial choice. I can keep on my own analyze and become an authorized psychologist.
There are many alternatives for the reason that field. My cv is usually solid in the
event that I anticipated to get an alternative job.

#47 חדרים לפי שעה בראשון on 04.28.19 at 5:08 pm

Index Search Villas and lofts rented, search by region, find during first minutes a villa for rental by city, various

#48 Situs Poker Online on 05.02.19 at 12:29 am

Amazing! This blog looks just like my old one! It's on a completely different
subject but it has pretty much the same page layout and design. Excellent choice of
colors!

#49 roletonlineuangasli.gamerlaunch.com on 05.02.19 at 5:08 am

Nonetheless, agency officials say that if the pension agency
fails to meet its obligation, the government would come beneath intense political strain to step in. The board can be
too small to meet primary requirements of corporate governance, in keeping with
an analysis by the federal government Accountability Office.
The company's action has also been questioned by the
government Accountability Office, the investigative arm
of Congress, which concluded that the strategy "will doubtless carry more risk" than projected by the agency.

#50 administration services breda on 05.05.19 at 9:19 am

You actually make it seem so easy with your presentation but I find this topic to be actually something which I think I would never understand.
It seems too complex and very broad for me. I am looking forward
for your next post, I'll try to get the hang of it!

#51 Poker Online on 05.05.19 at 2:45 pm

Spot on with this write-up, I seriously think this
web site needs a lot more attention. I'll probably be back again to see more, thanks for the info!

#52 http://ytexadan.com on 05.05.19 at 8:19 pm

Thank you for another informative web site.

Where else may just I get that kind of info written in such an ideal way?
I have a challenge that I'm just now operating on,
and I have been on the glance out for such info.

#53 Life Experience Degrees on 05.07.19 at 8:41 pm

I think that is an fascinating point, it made me think a bit. Thanks for sparking my thinking cap. Sometimes I get so much in a rut that I just believe like a record.

#54 Life Experience Degree on 05.07.19 at 8:42 pm

When someone writes an post he/she keeps the imageof a user in his/her brain that how a user can know it. Thereforethat’s why this post is amazing. Thanks!

#55 INSTAGRAM Algorithm on 05.08.19 at 12:53 pm

Way cool! Some very valid points! I appreciate you penning this article
and also the rest of the website is also very good.

#56 חדרים לפי שעה בתל אביב on 05.10.19 at 1:55 pm

Index Search Villas and lofts to book, search by region, find during first minutes a villa to
book by city, various rooms lofts and villas. Be impressed by photographs
and data that the site has to offer you. The site is a center for all
of you the ads inside

#57 http://lixty.net on 05.11.19 at 12:01 pm

Hello, I log on to your new stuff daily. Your story-telling style is awesome,
keep doing what you're doing!

#58 Alva Macdougald on 05.12.19 at 5:50 pm

I am often to blogging and i also truly appreciate your site content. The content has really peaks my interest. I am about to bookmark your blog and maintain checking for new details.

#59 Dominoqq Online on 05.13.19 at 5:41 am

Greate post. Keep posting such kind of info on your page.
Im really impressed by it.
Hello there, You have performed a great job. I'll certainly digg it and individually suggest to my friends.
I am confident they'll be benefited from this website.

#60 מתקין מזגנים on 05.13.19 at 7:02 am

I have visited your website several times, and found it to be very informative

#61 zajrzyj do nas on 05.13.19 at 9:53 am

Hello, i think that i saw you visited my website thus i came to “return the favor”.I'm attempting to find things to enhance my web site!I suppose its ok to use some
of your ideas!!

#62 לופטים בתל אביב on 05.14.19 at 4:50 am

Index Search Villas and lofts rented, search by region, find in minutes a villa rented by city, several different rooms lofts and villas.
Be afraid of the images and data that they have to provide you.

The website is a center for everybody the ads inside the field, bachelorette party?
Use a friend who leaves Israel? Regardless of the reason why you will
need to rent a villa for a potential event or simply a group
recreation suitable for any age. The website is also the
middle of rooms through the hour, which is definitely
another subject, for lovers who are seeking
an expensive room equipped for discreet entertainment that has a spouse or lover.
Whatever you are interested in, the 0LOFT website makes a hunt for you to find rentals for loft villas and rooms throughout Israel, North South and Gush Dan.

#63 Apex Legends Season 1 on 05.14.19 at 10:17 pm

This website really has all the info I needed concerning this subject and didn’t know who to ask.

#64 Jarod Ahlbrecht on 05.15.19 at 6:19 pm

Hello I will be thus excited I discovered the webpage, I seriously found you unintentionally, while I had been exploring upon Google regarding another thing, Anyhow I will be here today and would likely want to point out cheers to get a amazing post plus a all-round thrilling weblog (I additionally really like the actual theme/design), I don’t have enough time in order to browse all of it in the second but I possess saved that and also added your own Bottles, when I have time I am returning to examine far more, Please do maintain the truly amazing job.

#65 istripper crack on 05.15.19 at 8:04 pm

Enjoyed reading through this, very good stuff, thankyou .

#66 fortnite aimbot download on 05.16.19 at 1:20 pm

I like this page, because so much useful stuff on here : D.

#67 fortnite aimbot download on 05.16.19 at 5:14 pm

Intresting, will come back here more often.

#68 Sabrina Iwata on 05.16.19 at 6:19 pm

5/16/2019 yosefk.com does it again! Quite a thoughtful site and a thought-provoking post. Keep up the good work!

#69 nonsense diamond 1.9 on 05.17.19 at 7:30 am

I really enjoy examining on this page , it has got cool goodies .

#70 dadu sicbo on 05.17.19 at 8:49 am

I am really loving the theme/design of your site.
Do you ever run into any web browser compatibility problems?
A couple of my blog audience have complained about my
site not working correctly in Explorer but looks great in Firefox.

Do you have any advice to help fix this issue?

#71 Mybookmark.stream on 05.17.19 at 9:20 am

Oh my goodness! Incredible article dude! Thank you, However I am
experiencing problems with your RSS. I don't understand the reason why I am unable to subscribe to
it. Is there anybody having similar RSS issues? Anyone who knows the answer can you kindly respond?
Thanks!!

#72 fallout 76 cheats on 05.17.19 at 10:54 am

Great, google took me stright here. thanks btw for this. Cheers!

#73 red dead redemption 2 digital key resale on 05.17.19 at 4:03 pm

This is awesome!

#74 redline v3.0 on 05.17.19 at 7:08 pm

I love reading through and I believe this website got some genuinely utilitarian stuff on it! .

#75 badoo superpowers free on 05.18.19 at 8:34 am

Me enjoying, will read more. Thanks!

#76 forza horizon 4 license key on 05.18.19 at 3:25 pm

I like this site because so much useful stuff on here : D.

#77 mining simulator codes 2019 on 05.19.19 at 7:27 am

Some truly wonderful article on this web site , appreciate it for contribution.

#78 daftar casino online terpercaya on 05.19.19 at 8:16 am

Rainbow Six: Siege via @YouTube @qwertyJaayy
is hilarious!!

#79 smutstone on 05.20.19 at 12:06 pm

google got me here. Cheers!

#80 redline v3.0 on 05.21.19 at 7:37 am

Me enjoying, will read more. Thanks!

#81 free fire hack version unlimited diamond on 05.21.19 at 4:56 pm

Your post has proven useful to me.

#82 nonsense diamond on 05.22.19 at 6:46 pm

I’m impressed, I have to admit. Genuinely rarely should i encounter a weblog that’s both educative and entertaining, and let me tell you, you may have hit the nail about the head. Your idea is outstanding; the problem is an element that insufficient persons are speaking intelligently about. I am delighted we came across this during my look for something with this.

#83 krunker aimbot on 05.23.19 at 7:05 am

This is good. Thanks!

#84 חדרים ברמת גן on 05.23.19 at 8:12 am

thanks a lot a whole lot this excellent website is elegant along with casual

#85 bitcoin adder v.1.3.00 free download on 05.23.19 at 10:44 am

Cheers, great stuff, Me like.

#86 חדרים לפי שעה ברמת השרון on 05.23.19 at 7:00 pm

appreciate it lots this website is definitely formal and also relaxed

#87 vn hax on 05.23.19 at 7:27 pm

I am glad to be one of the visitors on this great website (:, appreciate it for posting .

#88 eternity.cc v9 on 05.24.19 at 8:15 am

This is good. Cheers!

#89 ispoofer pogo activate seriale on 05.24.19 at 6:48 pm

Cheers, great stuff, Me enjoying.

#90 לופטים בחיפה on 05.24.19 at 8:01 pm

Index Search Villas and lofts for rent, search by region, find during first minutes a villa to rent by city,
several different rooms lofts and villas. Be stunned at the photos and
information that the site has to make available you.
The site is a center for everyone the ads while in the field, bachelorette party?
Like someone who leaves Israel? It doesn't matter what the key reason why you
should rent a villa for a forthcoming event or merely an organization recreation ideal for any age.
The website is also center of rooms by the hour, which has already been another subject, for lovers who are trying to
find a lavish room equipped for discreet entertainment by using
a spouse or lover. Regardless of you are looking for, the 0LOFT
website makes a search for you to find rentals for
loft villas and rooms throughout Israel, North
South and Gush Dan.

#91 Renato Knaus on 05.24.19 at 8:47 pm

Great beat ! I wish to apprentice at the same time as you amend your web site, how could i subscribe for a blog website? The account helped me a appropriate deal. I had been a little bit familiar of this your broadcast offered brilliant transparent concept|

#92 poker online cc on 05.24.19 at 9:15 pm

Hi there, You've done an incredible job. I'll certainly digg it and personally recommend to my
friends. I am confident they'll be benefited from this web site.

#93 daftar judi online on 05.25.19 at 12:11 am

Hi there everybody, here every person is sharing such knowledge, therefore it's nice to read this blog,
and I used to go to see this webpage everyday.

#94 cheats for hempire game on 05.26.19 at 6:52 am

This does interest me

#95 iobit uninstaller 7.5 key on 05.26.19 at 9:37 am

Hello, i really think i will be back to your page

#96 smart defrag 6.2 serial key on 05.26.19 at 4:02 pm

Deference to op , some superb selective information .

#97 resetter epson l1110 on 05.26.19 at 6:49 pm

Appreciate it for this howling post, I am glad I observed this internet site on yahoo.

#98 sims 4 seasons free code on 05.27.19 at 8:07 am

Found this on bing and I’m happy I did. Well written website.

Leave a Comment