Amdahl's law in reverse: the wimpy core advantage

Once a chip’s single-core performance lags by more than a factor to two or so behind the higher end of current-generation commodity processors, making a business case for switching to the wimpy system becomes increasingly difficult.

– Google's Urs Hölzle, in "Brawny cores still beat wimpy cores, most of the time"

Google sure knows its own business, so they're probably right when they claim that they need high serial performance. However, different businesses are different, and there are advantages to "wimpy cores" beyond power savings which are omitted from the "brawny/wimpy" paper and which are worth mentioning.

It is a commonplace assumption that a single 3 GHz processor is always better than 4 processors at 750 MHz "because of Amdahl's law". Specifically:

  • Some of your tasks can be parallelized to run on 4 cores – these will take the same time to complete on the two systems.
  • Other tasks can only run on one core – these will take 4x more time to complete on the 750 MHz wimpy-core system.
  • Some tasks are in-between – say, a task using 2 cores will take 2x more time to complete on the 750 MHz system.

Overall, the 3 GHz system never completes later and sometimes completes much earlier.

However, this assumes that a 3 GHz processor is consistently 4x faster than a 750 MHz processor. This is true in those rare cherished moments when both processors are running at their peak throughput. It's not true if both are stuck waiting for a memory request they issued upon a cache miss. For years, memory latency has been lagging behind memory throughput, and the 4x faster processor does not get a 4x smaller memory latency.

Assuming the latency is the same on the two systems – which is often very close to the truth – and that half of the time of the 750 MHz system is spent waiting for memory when running a serial task, the 3 GHz CPU will give you a speed-up of less than 2x.

What if the task can run in parallel on 4 cores? Now the 750 MHz system gets a 4x speed-up and completes more than 2x faster than the 3 GHz system!

(This assumes that the 4 750 MHz cores are not slowed down due to them all accessing the same memory – which, again, is very close to the truth. Memory bandwidth is usually high enough to serve 4 streams of requests – it's latency which is the problem. So having several cores waiting in parallel is faster than having one core waiting serially.)

A slow, parallel system outperforming a fast, serial system – Amdahl's law in reverse!

Is memory latency unique?

In a way it isn't – there are many cases where faster systems fail to get to their peak throughput because of some latency or other. For instance, the cost of mispredicted branches will generally be higher on faster systems.

However, most other latencies of these kinds are much smaller – and are handled rather well by the mechanisms making "brawny" cores larger and deserving of their name. For example, hardware branch predictors and speculative execution can go a long way in reducing the ultimate cost of mispredicted branches.

"Brawny" speculative hardware is useful enough to deal with memory latency as well, of course. Today's high-end cores all have speculative memory prefetching. When you read a contiguous array of memory, the core speculates that you'll keep reading, fetches data ahead of the currently processed elements, and when time comes to process the next elements, they're already right there in the cache.

The problem is that this breaks down once you try to use RAM as RAM - a random access memory where you don't go from index 0 to N but actually "jump around" randomly. (Theoretically this isn't different from branch prediction; in practice, large, non-contiguous data structures not fitting into caches and "unpredictable" for prefetchers are much more common than unpredictable branches or overflowing the branch history table.)

Hence we have pledges like this computer architect's kind request to stop using linked lists:

Why would anyone use a linked-list instead of arrays? I argue that linked lists have become irrelevant in the context of a modern computer system:

1. They reduce the benefit of out-of-order execution.

2. They throw off hardware prefetching.

3. They reduce DRAM and TLB locality.

4. They cannot leverage SIMD.

5. They are harder to send to GPUs.

Note that problems 1, 2 and 4 "disappear" on wimpy cores – that is, such cores don't have out-of-order execution, hardware prefetchers or SIMD instructions.  So a linked list doesn't result in as many missed performance opportunities as it does on brawny cores.

"Why would anyone use linked lists"? It's not just lists, for starters – it's arrays of pointers as well. "Why array of pointers" seems exceedingly obvious – you want to keep references instead of values to save space as well as to update something and actually get the thing updated, not its copy. Also you can have elements of different types – very common with OO code keeping arrays of base class pointers.

And then using arrays and indexing into them instead of using pointers still leaves you with much of the "access randomness" problem. A graph edge can be a pointer or an index; and while mallocing individual nodes might result in somewhat poorer locality than keeping them in one big array, you still bump into problems 1, 2 and 4 for big enough node arrays – because your accesses won't go from 0 to N.

Of course you could argue that memory indirection is "poor design" in the modern world and that programmers should mangle their designs until they fit modern hardware's capabilities. But then much of the "brawny vs wimpy" paper's argument is that brawny cores mean smaller development costs – that you get high performance for little effort. That's no longer true once the advice becomes to fight every instance of memory indirection.

It's also somewhat ironic from a historical perspective, because in the first place, we got to where we are because of not wanting to change anything in our software, and still wishing to get performance gains. The whole point of brawny speculative hardware is, "your old code is fine! Just give me your old binaries and I'll magically speed them up with my latest and greatest chip!"

The upshot is that you can't have it both ways. Either keep speculating (what, not brawny/brainy enough to guess where the next pointer is going to point to?), or admit that speculation has reached its limits and you can no longer deliver on the promises of high serial performance with low development costs.

Formalizing "reverse Amdahl's law"

…is not something I'd want to do.

Amdahl's law has a formal version where you take numbers representing your workload and you get a number quantifying the benefits a parallel system might give you.

A similar formalization could be done for "reverse Amdahl's law", taking into account, not just limits to parallelization due to serial bottlenecks (what Amdahl's law does), but also "limits to serialization"/the advantage to parallelization due to various kinds of latency which is better tolerated by parallel systems.

But I think the point here is that simple formal models fail to capture important factors – not that they should be made more complex with more input numbers to be pulled out of, erm, let's call it thin air. You just can't sensibly model real programs and real processors with a reasonably small bunch of numbers.

Speaking of numbers: is time spent waiting actually that large?

I don't know; depends on the system. Are there theoretically awesome CPUs stuck waiting for various things? Absolutely; here's that article about 6% CPU utilization in data servers (it also says that this is viewed as normal by some – as long as RAM and networking resources are utilized. Of course, by itself the 6% figure doesn't mean that brawny cores aren't better – if those 6% are all short, serial sprints letting you complete requests quickly, then maybe you need brawny cores. The question is how particular tasks end up being executed and why.)

Could a "wimpy-core" system improve things for you? It depends. I'm just pointing out why it could.

From big.LITTLE to LITTLE.big, or something

There's a trend of putting a single wimpy core alongside several brawny ones, the wimpy core being responsible for tasks where energy consumption is important, and the brawny ones being used when speed is required.

An interesting alternative is to put one brawny core and many wimpy ones – the brawny core could run the serial parts and the wimpy cores could run the parallel parts. If the task is serial, then a brawny core can only be better – perhaps a lot better, perhaps a little better, but better. If the task is parallel, many wimpy cores might be better than few brawny ones.

(I think both use cases fall under ARM's creatively capitalized "big.LITTLE concept"; if the latter case doesn't, perhaps LITTLE.big could be a catchy name for this – or we could try other capitalization options.)

Hyperthreading/Barrel threading/SIMT

…is also something that helps hide latency using coarse-grain, thread-level parallelism; it doesn't have to be "whole cores".

Parallelizing imperative programs

…is/can be much easier and safer than is commonly believed. But that's a subject for a separate series of posts.

Summary

  • Faster processors are exactly as slow as slower processors when they're stalled – say, because of waiting for memory;
  • Many slow processors are actually faster than few fast ones when stalled, because they're waiting in parallel;
  • All this on top of area & power savings of many wimpy cores compared to few brawny ones.

Is program speed less important than X?

Is program speed less important than safety? Sometimes it is – and sometimes speed is safety. A slow autopilot is a dangerous autopilot. That's why so much safety-critical software is written in the least safe programming languages.

A lot of programs aren't like autopilots – a slower, safer transaction processor is usually better. Google's early credit card charging disaster is my favorite example (it would never happen with a bounds-checked, garbage-collected language runtime).

However, there's the "counter-anecdote" of a cell phone company switching to a fancy new billing system which took more time to charge customers for calls than it took customers to make calls. It fell behind to the point where they had to avoid charging customers for a month worth of calls because they couldn't compute the right amounts. So sometimes trusting one's money to a slow program is rather unsafe.

And then there are high-frequency trading algorithms. Speaking of which: is speed less important than program correctness? Sometimes it is – and sometimes speed is correctness. A slower chess program playing under time control will settle for a worse move – a less correct move. Slower project scheduling software will come up with a worse schedule.

Life in general is a game played under time control. Often, "slower" means "being able to process less information in less ways" – in other words, dumber, further away from "correct".

What about time to market – isn't program speed less important than time to market? Sometimes it is – and sometimes higher speed is shorter time to market. A breathtaking game or special effect on today's hardware that others can only pull off on tomorrow's hardware means that the game or the special effect made it first to the market.

("Performance tricks" sound more relevant to the real-time world of games than to the offline rendering of movies; a great counter-example is procedural generation of outdoor landscape in Brave by Inigo Quilez.)

But time to market is also affected by development time; is program speed less important than development time? Sometimes it is – and sometimes higher speed is less development time. A developer waiting for slow programs develops more slowly – and developers often wait for their own programs (tools searching for stuff or summarizing stuff, build systems, tests, machine learning algorithms, …).

Another point is that a developer whose code tends to be too slow will waste time looking for a faster, fancier, buggier algorithm, sometimes sifting through many options, each of which could be fast enough if coded by someone else. A developer whose code tends to be fast the first time will move on to the next thing more quickly.

Is program speed less important than X? Sometimes it is – but sometimes speed is inseparable from X.

Efficiency is fundamentally at odds with elegance

Q: In retrospect, wasn't the decision to trade off programmer efficiency, security, and software reliability in exchange for runtime performance a fundamental mistake?

A: Well, I don’t think I made such a tradeoff. I want elegant and efficient code. Sometimes I get it. The efficiency vs. correctness, efficiency vs. programmer time, efficiency vs. high level, etc. dichotomies are largely bogus.

An interview with Bjarne Stroustrup

Unlimited-precision symbolic computation is more elegant than floating point numbers. You simply never have any numerical stability problems. Anything algebraically correct – a valid way to solve for x given the equations involving x – is also computationally correct. You don't need to know all the quirks of floating point, and you won't need "numerical recipes" which are basically ways to deal with these quirks.

Symbolic computation is rather widely available – say, in Mathematica, Matlab/maple, etc. – but it's not used nearly as much as floating point. That's because floating point is much more efficient, and a whole lot of things can not be done in a reasonable amount of time and space with symbolic computation.

It is undeniable that this efficiency comes at a cost of correctness (in terms of increased likelihood of bugs), programmer time, and elegance. There are plenty of algebraically elegant solutions which just don't work in floating point. If you don't notice, you have a bug; if you do notice, you spend your (programmer) time looking for an alternative, and said alternative may be less elegant.

Floating point is not the lowest level we can sink to. In many cases the quest for still more efficiency brings us to the dark quagmires of fixed point arithmetic. That's when you have an integer and you implicitly assume it's divided by 2^N – the exponent is statically known so the point isn't floating at run time, so to say. So to implement a+b you just use integer addition, and a*b is an integer multiplication followed by a right shift by N. Or there are midway scenarios where you have many integers with a common, dynamically computed exponent because they all have roughly the same range (FFT is one case where this is often done.)

Fixed point is so ugly that there's not even a recipes book that I'm aware of; it just doesn't come out tasty in the slightest. The biggest trouble isn't even the very likely overflow but the loss of precision: floating point guarantees a certain number of significant bits in the mantissa, while fixed point doesn't – unless you explicitly normalize the number at some point using CLZ or similar (making it more of a floating point emulation than "true" fixed point where exponents aren't represented at runtime). You think you have a 32-bit mantissa and an implicit exponent so the number is rather precise – but those 32 bits can have most of the high bits set to zero and then it's not precise at all.

However, a mixture of 8-bit, 16-bit and 32-bit fixed point operations often beats the performance of floating point by a large margin; especially if you have SIMD instructions, because you can always fit 4x more 8-bit numbers into a register than 32-bit numbers, and a single-precision floating point multiplier is ~10x more costly in hardware than an 8-bit integer multiplier so you have less of them.

Again, this efficiency comes at a cost of correctness (in terms of increased likelihood of bugs), programmer time, and elegance.

In computer vision, you're often looking for objects of a certain class, and you have a classifier taking a rectangular image region and telling whether this region contains an object of that class. A simple and elegant object detection algorithm is to apply this classifier to every possible rectangle in the image, and then remove rectangles which mostly overlap (as in, if there are 15 similar rectangles saying there's a face in roughly the same place, make one rectangle out of them all).

This elegant algorithm is never used, because there are too many possible rectangles (every coordinate times every size). A common optimization is to use a cascade of classifiers. That is, apply a very cheap classifier with a lot of false positives but hopefully almost no false negatives to every region. The purpose is to throw away most of the rectangles so that the remaining smaller set still contains all the true positives – and a lot of false positives, of course, but much less.

This is repeated with many (possibly increasingly expensive) classifiers processing increasingly smaller sets of rectangles. The most widely deployed classifier cascade is probably the Viola-Jones face detector, currently available in most digital cameras displaying little squares around faces. As you could have noticed, it often misses a face, which is to be expected with all the little classifiers hurrying to throw rectangles away. And which is OK for a consumer application where a success rate of 90-95% is perfectly fine and an extra 1% of detection rate is not worth a $0.01 increase in price. The point is that the error rate is undeniably increased by stricter efficiency requirements.

The upshot is that object detection provides a broad family of examples where, again, runtime efficiency comes at a cost of correctness (in terms of increased likelihood of bugs – there's much more code to write – as well as the ultimate detection rate), programmer time, and elegance.

(Even the smallish sub-problem of merging overlapping rectangles provides an example where efficiency has to be bought with all those other things including elegance. A short, readable, elegant solution could use an O(N^2) nested loop where each rectangle is intersected with every other rectangle. One optimization is some sort of spatial data structure where you don't look at rectangles if they don't fall into the same bin of the spatial subdivision because then they can't intersect. That's faster, more buggy and less readable.)

Does this have anything to do with the quote by Stroustrup though? His implied point was how the std::sort template is more elegant as well as more efficient than C's qsort function fiddling with void pointers, right? Or ostream vs printf? Whereas these are all examples of "algorithmic efficiency" – is that even related to language design?

Well, the thing is, "algorithms" and "languages/code" is a continuum:

Think of all the psychic energy expended in seeking a fundamental distinction between "algorithm" and "program". — Alan Perlis

Given that it's a continuum, it is doubtful that a statement which is profoundly wrong in an "algorithmic" context could be true in a "programming" context. If the tradeoff between runtime efficiency and programmer efficiency/"elegance" is fundamental from an algorithmic point of view, then it's likely fundamental in computing in general.

For a concrete example of how blurry the line between "algorithmic efficiency" and "code efficiency" is, let's discuss corner detection. The FAST corner detector is a decision tree looking at pixels surrounding the central pixel and comparing the image intensity of the center to its surroundings. Similarly to other classifier cascades, "not a corner" is a quick decision, while "yes, a corner" is decided after all the checks are done.

The decision tree is implemented in several thousands of auto-generated C code lines with gotos. (That's one addition to the recent discussion about the utility of gotos in systems programming; add computer vision to the list of goto applications, I guess.)

Is it possible to implement the decision tree in a more elegant and readable way? Of course – but at the cost of efficiency; not asymptotic efficiency since it'd be the same decision tree, but efficiency nonetheless.

Is this goto business an "algorithmic" optimization or a "program" optimization? Consider that FAST's entire raison d'etre is being faster than, say, the Harris corner detector. Constants matter for high-resolution images processed in real time.

Consider furthermore that both FAST and Harris are O(#pixels) since they look at a finite, small number of pixels around each coordinate and execute a finite, small number of operations. Consider that which is more efficient depends on the platform – SIMD helps speed up Harris but not FAST, and different SIMD instruction sets speed it up by very different factors. (This is also true for linear classifiers vs Viola Jones and for other cases.) And consider the fact that algorithmically, they're wildly different – Harris looks at eigenvectors whereas FAST is an intensity-based decision tree, they have tunable decision thresholds with different meaning, and different sets of false positives and false negatives.

So is FAST a work in the area of "algorithms" or "programming", and is the auto-generated mountain of code essential to make it efficient an "algorithm" or a "program"? My answer is that it's both, in the sense that you can't really draw the line.

But what about std::sort and C++'s combination of efficiency and elegance? Well, C++ rather obviously does pay with programmer efficiency for runtime efficiency, without an option to opt out of the deal. Every allocation can leak, every reference can be dangling, every buffer can overflow, etc. etc. etc.

This blindingly obvious fact doesn't surprise those who realize the fundamental tradeoff between efficiency and a whole lot of other things, some of which can be collectively called "elegance". Whereas those refusing to believe in such a tradeoff manage to not even notice the consequences. For example:

The relatively small size of the C++ standard library – primarily reflecting the lack of resources in the ISO C++ standards committee – compared with the huge corporate libraries can be a real disadvantage for C++ developers compared to users of proprietary languages.

…So why do languages without corporate backing which are 2 to 3 times younger than C++, such as Perl, Python, and Ruby, have so much more libraries, both standard and non-standard, but widely used?

The best uses of C++ involve deliberate design. You design classes to represent the notions of your application, you organize those classes into hierarchies, you express your algorithms precisely and abstractly (no, that “and” is not a mistake), you use libraries, you build libraries, you devise error handling and resource management strategies and express them in code. The result can be beautiful, efficient, maintainable, etc. However, it’s not just sitting down and writing a series of statements directly manipulating characters, integers, and floating point numbers.

The thing is that actually doing something useful involves a whole lot of "direct manipulations" of characters, integers and floating point numbers – and strings, arrays, hash tables, files, sockets, windows, matrices, etc. Languages which let you "just sit down and write the series of statements" give programmers the extra productivity which results in all those extra libraries getting written.

However, equally undeniably it does cost you runtime efficiency, because you pay an overhead for built-in resource management strategies such as garbage collection, built-in error detection strategies such bounds checking, and a whole lot of other things.

It's not surprising that Stroustrup sees the problem in the fact that corporations "with the resources" invest them in what he thinks is the wrong thing, presumably because of their self-interested profit motives. Alex Stepanov who designed the STL expressed similar statements, and so did Alan Kay and every other perfectionist technologist. If you seek perfection to the point of denying the existence of most obvious tradeoffs – and tradeoffs are a pesky thing for a perfectionist because they imply that perfection is unattainable – then you're also likely to somewhat resent corporations, markets, etc. For a discussion of that, see my take on Worse Is Better vs The Right Thing.

(Of course there are plenty of perfectionists who, instead of rationalizing C++'s productivity problems, spend their time denying that Python is slow, or keep waiting for Python to become fast. It will not become fast. Also, all its combinations with C/C++ designed to remedy this inefficiency will forever be ugly. We had psyco, PyPy, pyrex, Cython, Unladen Swallow, CPython extension modules, Boost.Python, and who knows what else. Python is not designed to be efficient; it's designed for productivity and for extensibility through a necessarily ugly C FFI. The tradeoff is fundamental. Python is slow forever. Python bindings are ugly forever.)

So if the tradeoff is fundamental, should we give up on efficient resource utilization? No – if the elegant thing is to load the database table into RAM, it doesn't mean that we have enough RAM. Should we give up on programmer productivity? No – inline assembly or lock-free code which isn't obviously bug-free doesn't belong in our cold paths.

We should, however, give up on perfection. Some code will be slower than we want because we don't have time to optimize it, and some code will be uglier than we want because we have no choice but to optimize it.

A hope to defeat a fundamental tradeoff is nothing but a source of frustration, and it's a bliss to have lost such a hope.

How profilers lie: the cases of gprof and KCachegrind

We'll see how gprof and KCachegrind lie to us and why they do so, discuss the limits within which we can trust them nonetheless, and attempt to draw more general conclusions about profilers and profile visualization tools.

(In case your programming vocabulary is different from mine – I use "lying" in its dispassionate meaning of "communicating falsehoods" and not to convey negative judgement; on the contrary, I'm indebted to the authors of profiling tools both as a user and a developer of such tools.)

So, consider a program with two parts – an easy part and a hard part. Both parts do similar work but one part does much more work than the other:

void work(int n) {
  volatile int i=0; //don't optimize away
  while(i++ < n);
}
void easy() { work(1000); }
void hard() { work(1000*1000*1000); }
int main() { easy(); hard(); }

Here, work() is a do-nothing loop. easy() executes a thousand iterations of that loop and hard() executes a billion iterations. We therefore expect main() to spend most of its time in hard() and only a tiny fraction in easy().

Now let's profile the program with gprof:

gcc -o try try.c -pg
./try # saves stats to gmon.out
gprof try gmon.out

On my machine, this prints the following info:

self            self    total
seconds  calls  s/call  s/call  name
   3.84      2    1.92    1.92  work
   0.00      1    0.00    1.92  easy
   0.00      1    0.00    1.92  hard

gprof's lie is marked in red; it says easy() and hard() took the same amount of time, instead of hard() taking a million times more to run.

What happened? Can we trust anything that gprof says? Which parts of its output are entirely wrong like this "easy() is the same as hard()" business and which parts are roughly correct, give or take a measurement error? To answer this, we need to briefly discuss how gprof works.

Roughly, gprof's two sources of information are profil() and mcount():

  • profil() – a cousin of creat() in that it could have been spelled with an "e" as well – updates an instruction address histogram every 10 milliseconds. That is, 100 times a second the OS looks which instruction the program is executing, and increments a counter corresponding to that instruction. So the share of increments corresponding to a function's body is proportionate to the share of time the program spent in the function.
  • mcount() is a function called by assembly code generated by gcc -pg. Specifically, when a function is entered, it calls mcount() to record a call to itself from the caller (the caller is generally easy to identify because it necessarily passes a return address to your function and that address points right into the caller's body.) So if f() calls g() 152 times, mcount(f,g) will be called 152 times.

With this in mind, we can roughly tell what gprof knows. Specifically, it knows that:

  • easy() and hard() were both called once; work(), called from each, ran twice. This info is from mcount() and it's 100% reliable.
  • The program spent almost no time in the code of easy() and hard(), and most of its time in the code of work(). This info is from profil() and it's rather reliable – because the program ran for >3 seconds, which means we had >300 increments in our instruction histogram. If almost all of these increments are in work(), that's significant enough.

What about the share of time easy() spent in its call to work(), and the share of time hard() spent in work()? By now we know that gprof knows absolutely nothing about this. So it guesses, by taking 3.84 – seconds spent in work(), a reliable number – and divides it between easy() and hard() equally because each called work() once (based on mcount(), a reliable source) – and we get 1.92. This shows how bad results can be produced from perfectly good measurements, if passed to the wrong algorithm.

More generally, gprof's output falls into the following categories, listed in decreasing order of reliability:

  • Number of calls: 100% reliable in all parts of the report. (I think; please correct me if I'm wrong.)
  • Self seconds in the "Flat profile" (time spent in a given function not including children): reliable to the extent that 100 samples/second is statistically significant given the number of hot spots and the total runtime.
  • Seconds attributed to call graph edges (contribution of children to parents, total runtime spent in self+children, etc.): possibly dead wrong. Only trust it if there's zero code reuse in a given case (that is, f() is only called by g()), or if the function in question is known to take about the same time regardless of the call site (for example, rand()).

BTW, the fact that gprof lies doesn't mean that its documentation does; on the contrary, `man gprof` says, in the BUGS section:

The granularity of the sampling is shown, but remains statistical at best. [this refers to the limited reliability of profil's histogram.] We assume that the time for each execution of a function can be expressed by the total time for the function divided by the number of times the function is called. Thus the time propagated along the call graph arcs to the function's parents is directly proportional to the number of times that  arc is traversed. [this refers to the absolutely unreliable way in which profil's "self time" data is combined with mcount's call graph data.]

Unfortunately, users tend to read tools' output without reading documentation. (The ability of users who aren't into profiling tools to understand the implications of this passage is a separate question.)

The man page also refers to papers from 1982 and 1983. An age of over three decades is a good reason to cut a program some slack. In a way, gprof's age is not only a source of its limitations, such as only 100 samples per second, but also a source of its strengths, such as fast execution of profiled code and wide availability.

Now let's look at a more modern profiler called callgrind – a valgrind plugin. Being more modern, callgrind has a few advantages over gprof – such as not lying in its call graph (though some would debate that as we'll see), and coming with a GUI program to visualize its output called KCachegrind.

KCachegrind the viewer (as opposed to callgrind the measurements collector) does lie in its call tree as opposed to call graph as we'll shortly observe. But first let's have a look at its truthful reporting of the situation with easy() being easier than hard():

KCachegrind's call graph - true

As you can see, easy() isn't even shown at the graph (KCachegrind hides things with insignificant cost); you can, however, see the cost of main's call to easy() and hard() at the source view – indeed easy() is ~1000x faster than hard().

Why 1000x and not 1000000x? Because I changed hard() to run a million iterations instead of a billion, bringing the difference down to 1000x. Why did I do that? Because callgrind is slow – it's based on Valgrind which is essentially a processor simulator. This means that you don't measure time - you measure things like instructions fetched and cache misses (which are interesting in their own right), and you get an estimation of the time the program should take given these numbers and your processor model.

It also means callgrind is slow. Is it slower than gprof? Not necessarily. With gprof, code runs at near-native speed, but you only get 100 data points per second. With callgrind you get much more data points per second. So for a hairy program, with callgrind you get statistically significant data more quickly – so effectively callgrind is faster.

But for a simple program with just a couple of hot spots, callgrind is slower because if the program has a costly part 1 and then a costly part 2, it'll take callgrind more time to even get to part 2, whereas gprof, with its near-native speed, will give you good enough data from its fast run.

So much about speed; now let's look at a case where KCachegrind lies to us, and then we'll discuss why it happens.

To expose the lie, we'll need a more complicated program. We'll achieve the required complexity by having two worker functions instead of one, and then adding a manager – a function that does nothing except calling the two workers with the number of iterations to run.

How does the manager decide the number of iterations each worker should run? Based on the project requirements, of course. Our "projects" will be two more functions, each calling the manager with its own pair of iteration numbers for the two workers.

void worker1(int n) {
  volatile int i=0;
  while(i++<n);
}
void worker2(int n) {
  volatile int i=0;
  while(i++<n);
}
void manager(int n1, int n2) {
  worker1(n1);
  worker2(n2);
}
void project1() {
  manager(1000, 1000000);
}
void project2() {
  manager(1000000, 1000);
}
int main() {
  project1();
  project2();
}
As you can see, both workers work on both projects, but each project is mostly done by one of the workers, the other contributing 1000x less work. Now let's see what KCachegrind says about this; we need to run callgrind, which can be done without special compilation flags:
gcc -o try2 try2.c
valgrind --tool=callgrind ./try2
kcachegrind `ls -tr callgrind.out.* | tail -1`
Here's what we'll see:
KCachegrind call tree - false

The bottom part of the screen shows us truths, but the top part shows falsehoods.

The truth is that each project called the manager once; the manager called each worker twice; and each worker did half the work in the program – all shown at the call graph at the bottom. However, at the top, each of the project functions occupies half the window and shows that worker1 and worker2 each did half of the work in each project - which couldn't be further from the truth.

This falsehood is reported by the call tree (or "callee map" as KCachegrind calls it) – a view which is supposed to show, for each function, the share of work done in each of its callees relative to the work done by all those callees together (as opposed to the call graph which only links to the called functions and tells how many times they were called by that caller – and their share of work in the entire program.)

Why does the call tree tell a falsehood? Again, because it doesn't know the truth. KCachegrind visualizes callgrind's measurements, which include the number of times f() called g() and the time spent in calls from f() to g().

This is more that gprof's information – way more. gprof only knows how much time was spent in f() and g(), and how many times f() called g(). Callgrind also measures how much time was spent in g() specifically when it was called from f(). In particular, this means that KCachegrind's source view gives a perfectly reliable measurement of the time spent in f plus all its callees – something which users take for granted and something which gprof only guesses, often wildly wrongly.

However, this information is not enough to know what the call tree needs knowing to show the truth. Specifically, you only know that manager() spent the same time in calls to worker1() and worker2() overall; you have no idea how much time it spent in each worker when called from project1() and project2(). So you can't reliably plot the share of time worker1() and worker2() spent inside project1() or project2(); you can only guess, often wildly wrongly.

(In fact, you can't tell if manager() even called worker1() when called from project1(); perhaps it didn't – all you know is that manager called worker1() in some context. Some people conclude that the call graph is "incorrect"; in fact it is correct, the question is if you understand what you see the way you're supposed to – you aren't supposed to think that every path through the graph actually happened, only every edge. Another question is how upset you are when you find out (someone with a lot of "manager" functions doing nothing but dispatching might be very upset.) This example certainly broadens the gray area between "truth" and "lies" in profilers' output.)

Callgrind has something which appears like a possible workaround: --separate-callers=N. What this does is it measures things per call stack of size N instead of per call "arc". That is, instead of having a measurement for the arc from manager() to worker1() and a measurement for manager()->worker2(), it measures separately for project1()->manager()->worker1(), project1()->manager()->worker2(), project2()->manager()->worker1(), etc.

Unfortunately, KCachegrind didn't manage to open the resulting output file on my machine, nor did it help when I replaced the ticks (') separating the function names (which get concatenated together) with underscores (_). However, the textual data produced by callgrind indeed shows the truth:

fn=(726) worker2'manager'project1
5 3
+1 1
+1 7000008
+1 2

fn=(736) worker2'manager'project2
5 3
+1 1
+1 7008
+1 2

This shows that worker2() did 1000x more work when called (through manager()) from project1() than it did when called from project2() – as we know it did.

Having looked into the details of two particular examples, we'll proceed to a more general discussion of profilers and profile visualization tools.

Using no profiler at all is better than hoping it'll save the day

I know a few people who like to optimize code and think optimization is important, and who mostly ignore profilers. I know a few other people who claim that a good profiler is the necessary and sufficient prerequisite for optimization. More often than not, such people are not particularly fond of optimization, and their code will be slower than the code of above-mentioned profiler-bashers, before as well as after optimization.

The examples above supposedly show a part of the reason why "a good profiler" is not at all trivial to use.

Basically among the many opinions, there are two extreme ones sounding along the lines of:

  • You don't know where your bottlenecks are going to be, therefore don't bother to optimize before measuring.
  • You don't know where your bottlenecks are going to be – nor will you be able to measure them because adequate measurement and analysis tools plus test scenarios covering the relevant possibilities are hard to come by. Therefore, conserve resources if there's even a slight chance that the code will be a bottleneck.

In my experience, the first viewpoint results in slower code and is less consistent internally. That is, for someone who's not into optimization, a profiler is just not the force multiplier that he thinks it is. That's because he won't put the mental effort required to make an effective use of the profiler for the same reasons making him write slow code in the first place: he just doesn't like all this performance stuff.

The trouble with this being that profilers are not things magically telling you what to do without concentration on your behalf.

You need to understand how a profiler works to use it well

When I first realized how gprof lies in its call graph, I was so offended that I stopped using it for a long while. The problem with profilers is that not all the data is gross lies, but some is, and without knowing which data is likely to be wrong, you might trust things that you shouldn't, scratch your head to the point of hair loss, and then abandon the tool altogether once you realize how it routinely betrays your trust.

With KCachegrind, I came up with the example where it lies based on my knowledge of the callgrind output format – something I know because we (specifically, GK) added support for that format to our in-house profiling tools at work. I wouldn't guess that the Callee Map view is unreliable otherwise. Like most users, I rarely read the docs unless something clearly refuses to work altogether. The stats in the call graph as well as the source view are perfectly reliable. How would I know some other stats aren't?

The extent to which you warn the user about possible implications of assumptions in your software is a tough question for all programmers. Should KCachegrind have a big red warning "I might be misleading you" at its "map" views? A "true power user" doesn't need the warning because they know how the tool works anyway. A "truly careless user" wouldn't read the explanation linked to from the red warning and would just be scared away. Only a "middling user" irresponsible enough to not read the docs but responsible enough to read text linked to from big red warnings might benefit from such design. But how many such users are there compared to the rest?

My own experience here is depressing – nobody, not even the smartest folks, is willing to read anything unless they came here to read. As in, when you're a tutorial that people intend to read, you can tell things to people and they listen. When you're an error message, people read you to the extent necessary to make you go away. And when you're a warning, people ignore you. It sucks to be a warning.

So I'm pessimistic about big read warnings. As to tutorials – people don't expect profilers to be complicated enough to warrant a tutorial, so they probably won't allocate time specifically to read one. They're wrong, but they won't.

Choosing a profiler is hard

There's no perfect profiler – it's a tradeoff, or rather a bunch of tradeoffs. Let's see how complicated those tradeoffs are by comparing gprof, callgrind and Google's CPU profiler:

  • gprof is fast, requires a special compilation, gives you "self time" based on 100 instruction address samples per second, precise call counts, and often bogus "children time" information.
  • callgrind is slow, requires no special compilation, gives time estimations based on event counting, several order of magnitudes more data points per second (making it "effectively faster" for some use cases but not others), precise call counts as well as precise "events counted during a call to each child" information, and comes with a viewer giving correct though possibly misleading call graph and often bogus "map" views.
  • Google's CPU profiler is fast, requires no special compilation unless you're on a 64b platform without a working unwind library, uses a configurable amount of samples per second (default: the measly 100, I wonder why), lacks precise call count information, logs full call stacks unlike gprof and callgrind without –separate-callers, but then converts data to many viewing formats where the extra info is lost (such as the callgrind format). You do get more informative view of the profiling data in some cases.

So basically each of these profilers is better than the rest in some way and worse in some other way (for instance, gprof, at first glance the awful, ancient profiler, is actually your shortest path to precise call counts, which may well be the thing you need the most). And that's before discussing things like the ease of getting a working version for your platform.

As it often is with complicated things, the choice is made harder by the fact that you don't realize many of the implications until after you gained some experience with the tool. I don't think I know what questions to ask about a new profiler, even though I'm relatively savvy. I do realize that I want to understand how it works in a lot of detail.

Not all errors are "noise"

If you're listening to an analogue recording and there's a "shhhh" sound, it's "noise". If, however, someone is yelling near you, then this louder noise isn't "noise" in the mathematical sense – it's another signal. And if someone has overwritten your original recording, then there's only another signal left and none of yours. There's also noise on top of that other signal, but that noise is the least of your troubles.

Not everything standing between you and your signal is "noise"; sometimes it's an error making you look at the wrong signal.

With profilers, people intuitively expect "measurement noise" – a profiler measures time (or some other cost) and no measurement device is perfect. People are willing to accept that – but they do expect fidelity in the assignment of costs to context. "Context" is source lines, functions, and call sequences; people (correctly) think of context as something logical where the concept of "measurement errors" doesn't apply; it's either correct or not.

Unfortunately, people are right, but the conclusion is the opposite of the natural expectation: "context" is indeed a logical concept and not a continuous quantity – and therefore, when the tools give you wrong information, it's really wrong, as in completely detached from reality, and not something "roughly right" give or take a measurement error.

(Our examples focusing on call sequences are rather benign, in a way; for real fun, we could discuss the fidelity of assigning costs to source code lines in optimized programs. Even with a nice assembly listing linking to the various source files from which code was inlined, like the listing provided by KCachegrind, it's… let's say less than easy.)

The upshot is that many errors in profilers' output are not something that can be fixed by running the program more times to get more statistically significant results. Rather, you have to know where the results are only distorted by noise and where there can also be logical errors rendering the numbers meaningless.

Presentation errors can exist without measurement errors

…As evidenced by KCachegrind's false presentation of callgrind's or Google profiler's correct results, or by gprof's false conclusions based on reasonable numbers reported by profil() and perfect numbers from mcount().

To add to the previous point, measurement errors are more likely to be "noise" than presentation errors, which are more likely to tell something unrelated to the truth.

Conclusion

Profiling is trickier than we tend to assume, and as someone developing profilers, I understand programmers who're good at optimization and who mostly ignore profilers. A profiler could help users get way more mileage if it found a way to convince them to actually read a thorough, memorable explanation of its strengths and weaknesses.

It's "locking" if it's blocking

Given a piece of code, how do you know if it's lock-based or lock-free, and why do you care?

Sounds like a trivial question – it's lock-based if it says, "lock a mutex", which is usually an OS service. If it doesn't say "lock" or "unlock", then it's lock-free. And we care because OS services like mutexes are expensive and if we don't use them, code runs faster.

Which is all true, in a way, but it doesn't really answer the question. "The OS" isn't magic – it's code. You can have code implementing userspace locks outside the OS. Or you can have an OS where there's just one address space, and then that OS's locks are "userspace" code like any other. And so the question remains: which piece of code implements "locking" and which doesn't?

It's all code, right? A piece of code is a sort if it sorts. When is code locking? What is locking, as opposed to "lock-free" code? We'll try to answer that question, and then the answer will tell us why (and when) the "locking"/"lock-free" distinction actually matters.

We'll start with two examples – atomic addition and spin locks. The code for these can be surprisingly similar, and this similarity will help highlight the one thing that really sets the two apart.

Consider an atomic addition – something like gcc's __sync_fetch_and_add(), which increments a number at some memory location and returns the old value "atomically".

"Atomic" (only) means that nobody will mess up the state midway through the operation. "Atomic" doesn't mean, for example, that a single instruction is used – in fact, often it's not the case. Nor does "atomic" imply "lock-free". An implementation of atomic addition could legitimately use:

  1. A single machine instruction
  2. A loop observing that someone messed up our state, and retrying
  3. A mutex that's locked while the incrementing is done

Let's say it's a loop, something along the lines of:

do {
  val = *addr;
}
while(!compare_and_swap(addr, val, val+inc));

This is atomic, because if someone modifies addr before we manage to write val+inc, then compare_and_swap (CAS – a typical primitive) will observe that addr no longer holds val, fail, and make us retry. Note that we can get stuck at the loop for an indefinite period of time.

Now consider a spinlock – a loop doing something like:

while(!compare_and_swap(addr, UNLOCKED_VAL, LOCKED_VAL));

This will wait until addr holds UNLOCKED_VAL, and will modify it to keep LOCKED_VAL – unless someone else comes ahead of us, in which case they will write LOCKED_VAL first – and we're back to calling our CAS primitive in a loop. Note that we can get stuck at the loop for an indefinite period of time.

So now you see the difference between "locking" and "lock-free": our atomic addition loop is lock-free, and our spin lock loop implements locking.

Wait, what?

They're both loops, and very similarly-looking ones. Moreover, we can get stuck at both loops for an indefinite period of time. How come they're at the opposite sides of the locking/lock-free distinction?! Where's the difference?

The difference is in whether we get stuck if another thread gets suspended.

The first loop – atomic addition – never gets stuck because of someone else being suspended. On the contrary, it will finish faster. It gets stuck if others modify addr all the time and it keeps retrying. If they're suspended, they can't modify addr, and it succeeds more quickly.

The second loop – the spin lock – will very much get stuck if another thread obtains the lock and then gets suspended before releasing it. And it will keep running until that thread gets the opportunity to finish whatever it did with the lock taken, and releases the lock.

That's why we care – and that's why, with locking, we tend to involve the OS in the first place! Having the OS involved could be a vast improvement over our spin lock loop, because the OS could notice that our thread is stuck because a suspended thread holds a lock. And if that thread is ready to run, the OS could let it run right now, knowing that otherwise, our thread won't make progress anyway.

Moreover, if our thread has a high priority, and the suspended "locker" thread has a low priority, the OS could raise the "locker's" priority until it releases the lock – because releasing the lock that a high-priority thread wants should itself have high priority. (This stuff is known as dealing with "priority inversion" – the situation where they are less important than we are, and yet they block us.)

And for all these good things to happen, the OS needs to know that a lock is taken – which it doesn't with our spin lock loop. For the OS, it's just some loop that could be doing anything. The OS would only know what happens if we used its own locking functions. (BTW, there's another option to get the OS involved: explicitly fiddle with threads and priorities inside the spin lock loop if it takes too long. It can get ugly but it can be the right thing to do.)

Of course having the OS involved will cost us, and we don't want locks in our code because of that, but that's a consequence. The root cause is that threads that lock and get suspended block other threads that lock, and this is why spin locks aren't always a great way to get rid of the pesky OS overhead.

This also shows when spin locks or similar are fine – efficient and just as good as "lock-free code". For example, if you're using hardware threads which are never suspended, such as hardware threads in XMOS processors, then locks are fine. We'd certainly see problems if suspension was possible, but it isn't.

There are other, perhaps more common situations, when locking is fine. For instance, two highest-priority threads running on two distinct physical processors can communicate using spin locks very nicely, because they can't get suspended while holding a lock (they could be if interrupts have still higher priority, but we don't care if interrupt latency is small enough – and perhaps they are interrupt handlers locking some common resource.) Such two threads don't need any help from the OS.

Or, perhaps there's a high-priority thread that sets two variables, A and B, for a low-priority thread. A is written to and then B. After getting notified that A was written, the low-priority thread reads B in a loop – as long as B is NULL, it wasn't written to yet, so keep polling.

Effectively, this is locking – if the high-priority thread gets suspended, then the low-priority thread will get stuck in the loop. But, assuming the high-priority thread has the highest priority, it won't get suspended – and again we're fine without any help from the OS.

These examples show that "locking" isn't defined by "using OS locking functions", and it can take many different forms. You don't spot it in code by looking for certain library calls or for loops having some known look. You spot locking by asking, can I get stuck at this point if someone gets suspended? (And sometimes you answer, "yes, so it's locking; but they won't get suspended, so I don't care.")

Likewise, "lock-free" isn't defined by the use of CAS or LL/SC (load linked/store conditional – two commonly available instructions that can be used to implement compare-and-swap), or by a specific sort of loops.

For instance, our atomic addition loop could be modified to check if the value exceeded a certain size, and quitting the loop if it did without modifying the value:

do {
  val = *addr;
  if(val+inc >= size) break;
}
while(!compare_and_swap(addr, val, val+inc));

This is still atomic addition – if someone modifies addr, it doesn't break our code. We just keep retrying until addr holds something that can be incremented without exceeding a given size, and we manage to actually increment it without someone beating us to it. This can be used to implement a sort of lock-free queue.

And this code isn't "lock-free" because CAS is used in a loop – that is true for our spin lock as well – but because we aren't blocked when threads are suspended (in fact, it helps us if they get suspended, since there are less interferences).

And this is why we never need help from the OS with lock-free code.

Before concluding, it's worth noting a way in which lock-free code isn't better than lock-based code. We've already seen that "lock-free" isn't (necessarily) better in any way when suspension doesn't threaten us. But regardless of suspensions, "lock-free" is never better in a particular way – specifically, it isn't better if you're worried about correctness.

For instance, suppose you have a counter keeping an account balance. There's a bug where money is transferred between accounts, and the value of the source account is checked before decrementing it:

if(balance - transfer > 0) {
  balance -= transfer;
}

But several transfers can do the check before reaching the decrementing statement. Then they will all "succeed" – each will decrement the source account balance. As a result, the account balance may reach a negative value, and transfers will be authorized that shouldn't have been.

Here, it doesn't matter if the balance counter is locked or updated using any sort of lock-free addition. Both ways are better than nothing in the sense that all decrements will eventually happen and none will be lost. Both ways are still busted because the check and the decrement should be done atomically, and not just the decrement.

A program using a lock-free counter, or container, or any sort of data structure is thus exactly as correct or buggy as a lock-based version of the same program (provided that the locking and the lock-free updates are both implemented correctly, of course). That is, you can make (or avoid) exactly the same mistakes having to do with what orders of events you allow and how you handle them.

In other words, "locking" and "lock-free" are two equivalent ways to order events, the difference being that with locking, you have to deal with suspension, while with lock-free updates, you have to deal with the effects of concurrent interferences.

(You could say that locking tends to be less buggy and potentially less efficient because a uniform approach to coding is possible: just use the OS's locks. But as we've seen, you don't always have to or want to – and on the other hand, a uniform approach to coding is possible with lock-free code as well: just use off-the-shelf lock-free data structure libraries. And just like it's not automatically clear that locks are easier to develop with, it's not automatically clear that lock-free code is more efficient; both questions depend on the details.)

So that's it; I find it interesting though it could be rather trivial for you – and if you're one of these people, you might have spotted mistakes, in which case I'll be grateful for pointing them out. (I'm not that good at this stuff; what I'm comparatively good at is getting things done without this stuff – that is, sweeping it under the rug and getting imperative, concurrent, bug-free programs. But that doesn't generalize to all types of programs and it's a subject for a separate discussion.)

P.S. according to Wikipedia, it's actually a bit subtler. Specifically, the property of not being blocked by suspended threads makes an algorithm "non-blocking", but not "lock-free". "Lock-freedom" is a stronger property – a non-blocking algorithm is lock-free if there's guaranteed progress (as opposed to a situation where threads are busy undoing the effects of each other's interferences, and make no "real" progress). And there's a still stronger property, "wait-freedom", where progress is guaranteed in a bounded number of steps.

I allowed myself to ignore the distinction between non-blocking and lock-free because it's not a key part of what distinguishes lock-based synchronization from all the other kinds, which are often collectively called "lock-free" in informal speech. Of course in practice, a non-blocking algorithm that can get stuck in an interference-undoing endless loop is useless unless there's some sort of a workaround. Presumably that's the reason why "non-blocking" used to be a synonym for "lock-free" until about 2003 – it's not easy to think how a non-blocking algorithm that can get stuck forever can be made useable enough for this sort of thing to deserve a name of its own.

C++ template fuckwittery

You're kidding, right?

(gdb) bt
#0  0xaf88f2e0 in std::lround<Fixed<long, 14> > (__x=51.35198974609375) at /usr/include/c++/4.5/tr1_impl/cmath:710
#1  0xaf88f2e8 in std::lround<Fixed<long, 14> > (__x=51.35198974609375) at /usr/include/c++/4.5/tr1_impl/cmath:710
#2  0xaf88f2e8 in std::lround<Fixed<long, 14> > (__x=51.35198974609375) at /usr/include/c++/4.5/tr1_impl/cmath:710
#3  0xaf88f2e8 in std::lround<Fixed<long, 14> > (__x=51.35198974609375) at /usr/include/c++/4.5/tr1_impl/cmath:710
#4  0xaf88f2e8 in std::lround<Fixed<long, 14> > (__x=51.35198974609375) at /usr/include/c++/4.5/tr1_impl/cmath:710
#5  0xaf88f2e8 in std::lround<Fixed<long, 14> > (__x=51.35198974609375) at /usr/include/c++/4.5/tr1_impl/cmath:710

This goes on for a few tens of thousands of stack frames. Time to open /usr/include/c++/4.5/tr1_impl/cmath:710, which has this little gem:

  template<typename _Tp>
    inline long
    lround(_Tp __x)
    {
      typedef typename __gnu_cxx::__promote<_Tp>::__type __type;
      return lround(__type(__x));
    }

What happened? Fucked if I know (it's a bit hard to get to the bottom of the problem without being able to get to the bottom of the call stack, for starters; ought to figure out a better  way than hitting Enter screen after screen, with gdb asking if I really want to proceed).

Did someone define an std::lround? (A quick grep didn't show signs of that fuckwittery; though I found a couple of std::mins and std::maxes, leading to colorful consequences.)

Did someone define a template lround and did a using namespace std?

Did someone define an implicit casting operator that used to be called here, before this lround template appeared, and became a better match for the argument, whatever that means?

Fucked if I know.

I'm sure this lround business wasn't ever supposed to call itself, though it rather obviously can, depending on what the __gnu_cxx::__promote<_Tp>::__type does.

All that from trying to upgrade from g++ 4.2 to g++ 4.5 (and to gnu++0x – a C++0x flavor brought to you by GNU, enriched by GNU extensions such as strdup.) Oh the joys and safety of static binding – statically changing the meaning of your code with every compiler upgrade!

It's a good thing I rarely get to deal with C++ these days.

(Why upgrade to C++0x, a.k.a. C++11, a.k.a C++0xb? Lambdas, for one thing. Also future job interviews. Embrace C++11, or die trying.)

Why custom allocators/pools are hard

In languages with manually managed memory such as C and C++ as well as in garbage-collected languages, you sometimes want to roll your own memory allocator. Some common reasons are:

  • Speed: return &pool[last++] is faster than malloc. (A real pool would usually be slower than that, but still faster than malloc; especially since your "free", ready-to-be-allocated objects in the pool could have a lot of state initialized already since the last time they were used, unlike a malloc'd buffer – in OO terms, you don't need to call the constructor after allocating).
  • Predictability: people usually refer to "the pool advantage" as "lower fragmentation" and hence less chances of running out of memory due to "sudden" fragmentation in unexpected circumstances. Actually, fragmentation is higher with pools: a pool of 100 objects of type A can not be used to allocate objects of type B, even if you're using just one (or zero) A objects right now – so your memory is very much fragmented. However, it's fragmented predictably, leading to predictable allocation times.
  • Stability: Another things which higher fragmentation buys. Pools let you allocate B objects after running out of A objects from the predictably available "B fragment" (pool). This means you can actually handle out-of-memory conditions if you can live without another A object. A malloc-based program "runs out of everything" when it runs out of memory, so it's very unlikely to survive.

How hard are pools? Algorithmically, they're misleadingly easy, unlike malloc which is rather clearly hard. Malloc must be able to allocate chunks of many different sizes in an unpredictable order. A fast algorithm tending to have low fragmentation – or implementing garbage collection and heap defragmentation – is between hard and impossible (where "impossible" means that you'll always have pathological workloads leading to horrible behavior, and you're trying to select an algorithm such that real-life workloads are well-supported and the pathological ones remain theoretical).

Pools, however, usually allocate same-sized chunks. How hard can it be? You keep an array of free pointers and a last index; allocate() pops from this free pointer stack, and free() pushes a pointer back into it, and you're done.

The problem with pools isn't the allocation algorithm but the fact that a new memory management interface has been created. Memory management is fundamental to programming. Built-in memory management interfaces come with a lot of tool support which is rather hard for a custom interface to match.

Consider a garbage-collected language. Most often, such languages provide you a rather strong correctness guarantee: as long as an object is referenced, it will be kept alive and won't be reclaimed and reused for something else. In other words, you're promised to never have problems with dangling references (of course, a subset of these will turn into memory leak problems – too many objects that are no longer "really" needed but referenced – but these are generally way easier to debug).

However, if you implement a pool in a language with GC, that guarantee is gone. The language runtime doesn't know that pool.free(obj) "frees" something – as far as it can tell, the object is very much alive. If someone frees an object and then accesses it, it may very well be that the object has since been reused for something else, and now you have a nasty dangling reference problem.

Your only guarantee now is that you'll only get the "type-safe" variant of dangling references – you'd be fiddling with someone else's object of the same type as yours – but this doesn't necessarily make debugging easier (because changes to the wrong object of the right type may look "too sensible" to provoke the suspicion that they deserve).

Can you tell the runtime, "pool.free actually frees, and I want you to call it instead of your normal reclaiming procedure when the object is no longer referenced?" Perhaps some GC languages have this; it's certainly not a trivial thing to support, because part of the point of pools is to keep hairy, already-constructed objects in them, which point to other objects, some of which might be themselves allocated from pools and some not.

What about languages with manually managed memory? At first glance, the problem seems irrelevant to these because of their "advantage" of not providing any guarantees anyway. You very much can have dangling references with malloc, and pools don't change this.

However, there are tools such as Valgrind which flag a large share of these conditions, by marking chunks passed to free as "inaccessible", and chunks returned by malloc as "undefined" (inaccessible for reading until the first write which initializes the data). The trouble with pools is that, again, Valgrind doesn't know that pool.free frees, and hence it can't flag accesses through dangling references any more.

Is there a workaround? The answer depends on your situation and disposition:

  • Valgrind has a client request mechanism which lets you mark memory regions as "inaccessible" or "undefined", and your pools can issue these requests using Valgrind macros.
  • However, this isn't something that can be done in the pool implementation if the pool keeps constructed objects rather than plain memory chunks. You'll need a per-object-type function marking some of the memory as inaccessible/undefined – but not all of it. For instance, if the object keeps a pointer to a pre-allocated buffer, then maybe the buffer data become undefined when the object is freed and then reallocated, but the pointer to the buffer is defined, because it's already valid. For hairy objects, this can mean a lot of code for making Valgrind work as well as with malloc, and this code can have bugs, marking the wrong things as "defined".
  • If you're using tools other than Valgrind, you'll need to find an equivalent mechanism for these. If you use several tools, then you need to support several mechanisms. There's no standard interface for custom allocators (there could be – there is, in many languages, a standard interface for specifying custom operators, so it's not like there can't be standard ways for doing custom things; there just isn't for pools, at least there isn't in many real languages).

The main point I'm trying to make is, don't have every developer roll their own pool, unless it's for a specific type of objects used briefly and locally in some "private" bit of code. If you need pools for many different kinds of objects and these objects have long, non-trivial lifecycles and are accessed in many different contexts, standardize on the pools.

In a whole lot of cases, code reuse actually isn't worth the trouble and it's fine for people to do their own slightly different version of something which could become a common library – but it'd take too much effort and coordination and misunderstandings during that coordination.

Pools aren't one of these places. Their algorithmic simplicity actually makes it easy to standardize on a few common variants (what variant can one desire that others don't also need?) – and their non-algorithmic complications make standardization very worthwhile.

There are a bunch of other non-algorithmic problems you can have with pools besides having to describe your live objects to tools – for example:

  • Thread safety is another potentially non-portable aspect of memory allocation which is already handled by the language's built-in allocator and will become a headache for a custom one. You could use OS locks, or spinlocks, or a combination, or you could have a per-thread arena to avoid locking if it's too slow, in which case you'll need to handle deallocation by a thread different from the allocating one. Or perhaps you could do lock-free allocation if, say, there's an atomic increment and it's sufficient.
  • Placement new is something you might want to use in C++ that rather many C++ programmers aren't aware of. If you want to have your pool initialize objects in a memory chunk that's passed to it from outside, and you intend to use the pool with classes with non-empty constructors and destructors, then you'll want to do something like for(i=0;i<n;++i) new (buf+i*sizeof(T)) T(args) or what-not, and call ~T directly when the pool shuts down. If everyone rolls their own pools, a lot will do this bit wrong.

The upshot is that pools are surprisingly gnarly, and are really best avoided; there's a very good reason to build memory allocation into a programming language. To the extent that circumstances dictate the use of pools, it's a very good idea to standardize on a few common implementations, debug them, and leave them alone (though unfortunately a closed implementation likely can not deal with live objects tracking, and bugs will appear in user-implemented parts doing that).

The algorithmic simplicity of a pool, prompting people to just declare an object array and a pointer array and use these as a quick-and-dirty pool, is really quite deceiving.

"Value", the irksome euphemism

An economic value is the worth of a good or service as determined by the market.

Wikipedia

You keep using that word. I do not think it means what you think it means.

Inigo Montoya

If people pay you $1, then the economic value of your good or service has been determined by the market to be $1.

"Creating value" is thus a euphemism for "getting people to pay you money" – which has nothing to do with the usual meaning of "value".

Why is "value" an irksome euphemism? Because heroin dealers "create value", as determined by the market.

In the context of my own profession, all of the following are examples of value creation:

  • An office suite using undocumented and constantly changing formats. Value to users having no other way to access their own and others' documents: $100-$500, depending.
  • A distribution channel allowing developers to deploy software on popular devices, for which no legal alternative exists. Value to developers: 30% of revenue.
  • A social network with a billion private and corporate users who signed up for free, with a new charge to reach a given share of one's audience. Value created: $200 per post.

The more money I've extracted from you, the more value I've created, haven't I?

I'm not picking on Microsoft, Apple or Facebook. I can imagine working for any of them. My conscience is as flexible as the next guy's.

(A particularly inflexible conscience is a horrible condition. Feet to which no mass-produced shoes fit are merely inconvenient. A conscience incompatible with mass-produced social arrangements is a huge burden – not just on its owner, but on his friends and family.)

All I'm saying is that goods and services are distinct from bads and disservices, though both "create value".

Moreover, some sort of disservice tends to be essential to "value creation", a.k.a the extraction of money. People are attached to their money, and will only part with it when given little choice. Microsoft, Apple and Facebook constantly hone their methods of limiting users' choice. Who doesn't?

Business is what it is. It's not that consumers (us) are any better than producers (us). Nor is it impossible for something "free" – as in speech, beer, rider, whatever – to be a disservice to its users.

I just don't think "value" is the right word.

Will OpenCL help displace GPGPU? Parallella, P2012, …

OpenCL is usually perceived as a C dialect for GPGPU programming – doing "general-purpose" computations (not necessarily graphics) on GPU hardware. "It's like Nvidia's CUDA C, but portable".

However, OpenCL the language is not really tied to the GPU architecture. That is, hardware could run OpenCL programs and have an architecture very different from a GPU, resulting in a very different performance profile.

OpenCL is possibly the first programming language promising portability across accelerators: "OpenCL is for accelerators what C is for CPUs". Portability is disruptive. When hardware vendor A displaces vendor B, portable software usually helps a great deal.

Will OpenCL – "the GPGPU language" – eventually help displace GPGPU, by facilitating "GP-something-else" – "general-purpose" accelerators which aren't like GPUs?

We'll discuss this question on general grounds, and consider two specific examples of recent OpenCL accelerators: Adapteva's Parallella and ST's P2012.

Why displace GPGPU?

First of all, whether GPGPU is likely to be displaced or not – what could "GP-something-else" possibly give us that GPGPU doesn't?

There are two directions from which benefits could come – you could call them two opposite directions:

  1. Let software (ab)use more types of special-purpose accelerators. GPGPU lets you utilize (abuse?) your GPU for general-purpose stuff. It could be nice to have "GPDSP" to utilize the DSPs in your phone, "GPISP" to utilize the ISP, "GPCVP" to utilize computer vision accelerators likely to emerge in the future, etc. From GPGPU to GP-everything.
  2. Give software accelerators which are more general-purpose to begin with. GPGPU means doing your general-purpose stuff under the constraints imposed by the GPU architecture. An OpenCL accelerator lifting some of these constraints could be very welcome.

Could OpenCL help us get benefits from any of the directions (1) and (2)?

(1) is about making use of anal-retentive, efficiency-obsessed, weird, incompatible hardware. It's rather hard, for OpenCL or for any other portable, reasonably "pretty" language.

OpenCL does provide constructs more or less directly mapping to some of the "ugly" features common to many accelerators, for example:

  • Explicitly addressed local memory (as opposed to cache)
  • DMA (bulk memory transfers)
  • Short vector data types to make use of SIMD opcodes
  • Light-weight threads and barriers

But even with GPUs, OpenCL can't target all of the GPU's resources. There's the subset of the GPU accessible to GPGPU programs – and then there are the more idiosyncratic and less flexible parts used for actual graphics processing.

With accelerators such as DSPs and ISPs, my guess is that today most of their value – acceleration ability – is in the idiosyncratic features that can't be made accessible to OpenCL programs. They could evolve, but it's a bit far-fetched and we won't dwell on it now. At their current state, OpenCL is too portable and too "pretty" to map to most accelerators.

What about direction (2)? (2) is about making something that's more efficient than CPUs, but as nice and flexible as possible, and more flexible than GPUs.

As a whole, (2) isn't easy, for various reasons we'll discuss. But if we look, in isolation, at OpenCL the language, then it looks like a great language for targeting "faster-than-CPUs-and-more-flexible-than-GPUs" kind of accelerator.

What could such an accelerator give us that GPUs don't?

One important feature is divergent flow. GPUs are SIMD or SIMT hardware; either way, they can't efficiently support something like this:

if(cond(i)) {
  out[i] = f(i);
}
else {
  out[i] = g(i);
}

What they'll end up doing is, essentially, compute f(i) and g(i) for all values of i, and then throw away some of the results. For deeply nested conditionals, the cost of wasted computations can make the entire exercise of porting to a GPU pointless.

We'll now have a look at two OpenCL-compatible accelerators which promise to efficiently support divergent threads – or outright independent threads doing something completely unrelated. We'll briefly compare them, and then discuss some of their common benefits as well as common obstacles to their adoption.

Adapteva's Parallella

Actually, the chip's name is Epiphany – Parallella is the recently publicized name of Adapteva's planned platform based on Epiphany; anyway.

Adapteva's architecture is a 2D grid of processors with a mesh interconnect. To scale, you can have a chip with more cores – or you can have a 2D grid of chips with some of the inter-core communication seamlessly crossing chip boundaries. Each of the (scalar) processors executes its own instruction stream – no "marching in lockstep", fittingly for divergent flow.

There are no caches; a memory address can map to your local memory, or the local memory of some other processor in the grid, or to external memory. Access latency will vary accordingly; access to local memories of close neighbors is quicker than access to far neighbors. All memory access can be done using either load/store instructions or DMA.

(Note that you can reach far neighbors – unlike some more "fundamentalist" proposals for "2D scalability" where you can only talk to immediate neighbors, period. I think that's over the top; if you want to run something other than the game of life, it's awfully handy to have long communication paths – as do most computers ranging from FPGAs to neurons, some of which have really long axons.)

Stats:

  • 32K memory per core (unified – both instructions and data)
  • 4 banks that can help avoid contentions between loads/stores, instruction fetching and DMA traffic
  • 2-issue cores (one floating point operation and one integer or load/store operation)
  • 800 MHz at 28nm using low-power process, as opposed to high speed (my bet that it's hard to top 800 MHz at 28nm LP – any evidence to the contrary?)
  • ~25mW per core, 2W peak power for a full chip with 64 cores
  • 0.128 mm^2 per core

Sources: BDTI's overview and various documentation from adapteva.com.

ST's Platform 2012

The P2012 architecture is also, at the top level, a grid of processors with a mesh interconnect. One stated motivation is the intra-die variability in future process nodes: some cores will come out slower than others, and some will be unusable.

It is thus claimed that a non-uniform architecture (like the ones we have today – CPUs and a boatload of different accelerators) will become a bad idea. If a core happens to come out badly, and it's not like your other cores, you have to throw away the entire chip. Whereas if cores are all alike, you leave the bad ones unused, and you may still have enough good ones to use the chip.

Interestingly, despite this stated motivation, the P2012 is less uniform and has higher granularity than Epiphany. Firstly, there's a provision for special-purpose accelerators in the grid. Secondly, the top-level mesh connects, not individual cores, but clusters of 16 rather tightly-coupled cores (each with its own flow of control, however – again, good for divergence).

Similarly to Epiphany, data is kept in explicitly addressed local memory rather than cache, and you can access data outside the cluster using load/store instructions or DMA, but you'll pay a price depending on the distance.

However, within a cluster, data access is uniform: the 16 cores share 256K of local data memory. This can be convenient for large working sets. Instructions are private to a core – but they're kept in a cache, not a local memory, conveniently for large programs.

Stats:

  • 32K per core (16 cores with 16K I-cache per core and 32 8K data memory banks)
  • 1MB L2 cache in a 4-cluster, "69-core" chip (presumably, (16+1)x4+1 – one extra core per cluster and per chip)
  • 2-issue cores (I failed to find which instructions can be issued in parallel)
  • 600 MHz at 28nm (process details unclear)
  • 2W for the 69-core chip
  • 0.217 mm^3 per core (3.7 mm^2 per (16+1)-core cluster), not accounting for L2 cache

Source: slides, slides, slides.

Parallela vs P2012: a quick comparison

Each of the architectures can have many different implementations and configurations. It seems fair to compare a 28nm 64-core Epiphany chip with a 28nm 69-core P2012 chip (or at least fair as far as these things go). Each system has its own incompatible native programming interface, but both can also be programmed in OpenCL.

Here's how Epiphany compares to P2012:

  • Power: 1x (2W)
  • Core issue width: 1x (2-issue)
  • Local memory size: 1x (32K per core)
  • Frequency: 1.33x (800/600)
  • Core area efficiency: 1.7x (0.217/0.128)

I think it's a fine achievement for Adapteva – a 5-people company (ST has about 50000 employees – of course not all of them work on  the P2012, but still; Chuck Moore's GreenArrays is 18 people – and he's considered the ultimate minimalist, and develops a much more minimalistic product which, for instance, certainly won't run OpenCL programs).

This is not to say that these numbers are sufficient to compare the architectures. For starters, we assume that the power is the same, but we can't know without benchmarking. Energy consumption varies widely across programs – low power process brings leakage down to about zero at room temperature, so you're left with switching power which depends on what code you run, and on what data (multiplying zeros costs almost nothing compared to multiplying noise, for instance).

Then there are programming model differences, ranging from the extent of compliance of floating point to the IEEE standard to the rather different memory models. In the memory department, the ability of P2012 cores to access larger working sets should somewhat negate Epiphany's raw performance advantage on some workloads (though Epiphany cores might have lower latency when accessing their own banks). But then two different 2-issue cores will generally perform differently – you need thorough benchmarking to compare.

So what are these numbers good for? Just for a very rough, ballpark estimation of the cost of this type of core. That is, a core which is flexible enough to run its own instruction stream – but "low-end" enough to burden the programmer with local memory management, and lacking much of the other amenities of full-blown CPUs (speculative prefetching, out-of-order execution, etc.)

Our two examples both point to the same order of magnitude of performance. Let's look at a third system, KALRAY's MPPA – looking more like P2012 than Epiphany, with 16-core clusters and cores sharing memory.

At 28nm, 256 cores are reported to typically consume 5W at 400 MHz. (Adapteva and ST claim to give worst case numbers). That's 20mW per core compared to Epiphany's 25mW – but Epiphany runs at 2x the frequency. If we normalized for frequency, Epiphany comes out 1.6x more power-efficient – and that's when we compare it's worst case power to MPPA's typical power.

MPPA doesn't support OpenCL at the moment, and I found few details about the architecture; our quick glance is only good to show that these "low-end multicore" machines have the same order of magnitude of efficiency.

So will OpenCL displace GPGPU?

The accelerators of the kind we discussed above – and they're accelerators, not CPUs, because they're horrible at running large programs as opposed to hand-optimized kernels – these accelerators have some nice properties:

  • You get scalar threads which can diverge freely and efficiently – this is a lot of extra flexibility compared to SIMT or SIMD GPUs.
  • For GPGPU workloads that don't need divergence, these accelerators probably aren't much worse than GPUs. You lose some power efficiency because of reading the same instructions from many program memories, but it should be way less than a 2x loss, I'd guess.
  • And there's a programming model ready for these accelerators – OpenCL. They can be programmed in other C dialects, but OpenCL is a widespread, standard one that can be used, and it lets you use features like explicitly managed local memory and DMA in a portable way.

From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?

There are several roadblocks, however. First of all, there's no killer app for this type of thing – by definition. That is, for any killer app, almost certainly a much more efficient accelerator can be built for that domain. Generic OpenCL accelerators are good at accelerating the widest range of things, but they don't excel at accelerating anything.

There is, of course, at least one thriving platform which is, according to the common perception, "good at everything but excels at nothing" – FPGA. (I think it's more complicated than that but I'll leave it for another time.)

FPGAs are great for small to medium scale product delivery. The volume is too small to afford your own chip – but there may be too many things to accelerate which are too different from what an existing chip is good at accelerating. Flexible OpenCL accelerator chips could rival FPGAs here.

What about integrating these accelerators into high-volume chips such as application processors so they could compete with GPUs? Without a killer app, there's a real estate problem. At 100-150 mm^2, today's application processors are already rather large. And the new OpenGL accelerators aren't exactly small – they're bigger than any domain-specific accelerator.

Few chips are likely to include a large accelerator "just in case", without a killer app. Area is considered to be getting increasingly cheap. But we're far from the point where it's "virtually free", and the trend might not continue forever: GlobalFoundries' 14nm is a "low-shrink" node. Today, area is not free.

Of course, a new OpenCL accelerator does give some immediate benefit and so it isn't a purely speculative investment. That's because you could speed up existing OpenCL applications. But for existing code which is careful to avoid divergence, the accelerator would be somewhat less efficient than a GPU, and it wouldn't do graphics nearly as good as the GPU – so it'd be a rather speculative addition indeed.

What would make one add hardware for speculative reasons? A long life cycle. If you believe that your chip will have to accelerate important stuff many years after it's designed, then you'll doubt your ability to predict exactly what this stuff is going to be, and you'll want the most general-purpose accelerator.

Conversely, if you make new chips all the time, quickly sell a load of them, and then move on to market your next design, then you're less inclined to speculate. Anything that doesn't result in a visibly better product today is not worth the cost.

So generic OpenCL accelerators have a better shot at domains with long life cycles, which seem to be a minority. And then even when you found a vendor with a focus on the long term, you have the problem of performance portability.

Let's say platform vendor A does add the new accelerator to their chip. Awesome – except you probably also want to support vendor B's chips, which don't have such accelerators. And so efficient divergence is of no use to you, because it's not portable. Unless vendor A accounts for a very large share of the market – or if it's a dedicated device and you write a dedicated program and you don't care about portability.

OpenCL programs are portable, but their performance is not portable. For instance, if you use vector data types and the target platform doesn't have SIMD, the code will be compiled to scalar instructions, and so on.

What this means in practice is that one or several OpenCL subsets will emerge, containing features that people count on to be supported well. For instance, a relatively good scenario is, there's a subset that GPU programmers use on all GPUs. A worse scenario is, there's the desktop GPU subset and the mobile GPU subset. A still worse scenario is, there's the NVIDIA subset, the AMD subset, the Imagination subset, etc.

It's an evolving type of thing that's never codified anywhere but has more power than the actual standard.

Standards tend to materialize partially. For example, the C standard supports garbage collection, but real C implementations usually don't, and many real C programs will not run correctly on a standard-compliant, garbage-collecting implementation. Someone knowing C would predict this outcome, and would not trust the standard to change it.

So with efficient divergence, the question is, will this feature make it into a widely used OpenCL subset, even though it's not a part of any such subset today. If it doesn't, widespread hardware is unlikely to support it.

Personally, I like accelerators with efficient divergence. I'll say it again:

"From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?"

From an evolutionary standpoint though, it's quite the uphill battle. The CPU + GPU combination is considered "good enough" very widely. It's not impossible to grow in the shadow of a "good enough", established competitor. x86 was good enough and ARM got big, gcc was good enough and LLVM got big, etc.

It's just hard, especially if you can't replace the competitor anywhere and you aren't a must-have. A CPU is a must-have and ARM replaces x86 where it wins. A compiler is a must-have and LLVM replaces gcc where it wins. An OpenCL accelerator with efficient divergence – or any other kind, really – is not a must-have and it will replace neither the CPU nor the GPU. So it's quite a challenge to convince someone to spend on it.

Conclusion

I doubt that general-purpose OpenCL accelerators will displace GPGPU, even though it could be a nice outcome. The accelerators probably will find their uses though. The following properties seem favorable to them (all or a subset may be present in a given domain):

  • Small to medium scale, similarly to FPGAs
  • Long life cycles encouraging speculative investment in hardware
  • Device-specific software that can live without performance portability

In other words, there can be "life between the CPU and the GPU", though not necessarily in the highest volume devices.

Good luck to Adapteva with their Kickstarter project – a computer with 64 independent OpenCL cores for $99.

Do you really want to be making this much money when you're 50?

"Do You Really Want to be Doing This When You're 50?"

Well, I didn't really want to be doing this when I was 20. I'm in it for the money. As long as there's money in programming, I'll stay for the money, in all likelihood.

What else do you want to be doing when you're 50? Give me a profession remotely close to programming in the following ways:

  • Little or no required education
  • Good compensation, even for mediocre performers
  • Millions of jobs
  • No physical effort
  • No health or legal risks

Programming is money for nothing. Programming is very easy to enter and extremely hard to quit. What would you do instead?

I work with three lawyers – two became programmers and one became a PM. I haven't met programmers who became lawyers. I do know an engineer – not a programmer – who became a patent attorney (reported reason: "at some point, you resent your manager being the age of your kids"). Would you like to become a patent attorney when you're 50?

I had a manager who decided he'd rather be a school teacher, thinking that this line of work is more beneficial to society. He quit after 8 months, saying in his parting interview to a mainstream newspaper: "Sometimes I just want to enter the classroom with a machine gun and open fire". He's with Samsung now; he feels that his contribution to smartphone imagers benefits society substantially enough.

One of my roommates at work has been studying a bunch of things for a while now. He's got a degree in psychology and in something called Visual Theater. He's been programming part-time all the while, which is how he financed his studies. He's programming as a part of his visual performances (there's computer music involved). He'll likely be programming to finance his art work. I'm not sure he plans to quit programming at any defined point.

I've seen a lot of people "quitting" to study anything from physics to philosophy, and then going back to programming. The money is addictive. There are many other sources of satisfaction, of course – which is why I run this blog for free – but much of this satisfaction has to do with demand, directly or indirectly, and is thus very much related to money. "Building something useful" and "making money" are close relatives.

You could, of course, become independently wealthy. But you probably won't, and then programming is your plan B. There's also a thing about material wealth – it's easily taken away. I'm from Soviet Russia, so I tend to exaggerate the likelihood of that – but really, property is easily confiscated, and paper money can become paper overnight. It's not just a USSR thing; the US confiscated gold from its citizens at about the same time as the USSR. Professional ability, however, can't be confiscated. The prudent (paranoid?) independently wealthy programmer will thus make some effort to stay in a good shape.

There's the argument that professional programming is stressful. Again – compared to what? A doctor's work? A lawyer's work? Answering calls by irate customers while your responses are recorded for later inspection?

What stress? Programmers who can program at all – as in, print out a binary tree correctly – are very scarce. This scarcity makes it rather hard to push programmers around. You can try to bully them into doing unpaid overtime, but they quickly learn that it's a seller's market, and that you're basically bluffing. You have nobody to replace them with.

With demand outstripping supply, there's enough space in programming for everyone. This makes for a not-so-competitive environment, compared to, say, finance/investment banking type of jobs. Programmers are also typically shielded from customers and senior management – the kind of people who're always right, a trait making communication somewhat tiresome.

Deadlines? Sure, we have them, just like everybody else. Let's admit it though – we tend to miss them, and it's not very stressful to us unless we want it to be. If you're given an impossible schedule, and you do your best, and you miss the deadline, you can suffer deeply or you can maintain mental peace. The fact is that your material well-being is rarely in jeopardy because of a missed deadline, so your reaction is fully up to you.

There's the argument that programmers can't fully understand what's going on, what with all the APIs and layers and stuff. And if you don't understand your own environment, that's stressful and that's not fun. Fair enough; but again – who does understand his environment more than a programmer? A doctor digging into a patient's guts? A lawyer sifting through legal documents? An investor trading financial derivatives? A manager overseeing 10 or 20 programmers? With all the self-inflicted complexity, we're still in a better shape than most.

The fact is that there are relatively few programmers in their fifties around. Does it mean people don't survive in programming though? More likely, it is simply a result of growth. There were few 20 year old programmers 30 years ago – compared to 10 years ago. Therefore, there are fewer 50 year old programmers today than 30 year old programmers. To the extent that the growth in programming slows down, things will be different 20 years down the road.

So I'm not planning to quit programming, not because it's such a great source of joy by itself, but because it looks so good compared to just about anything else. Maybe not the most "passionate" statement – but passion burns out, whereas greed is sustainable. And if you plan to quit programming, I wonder what your alternative is, and I won't be surprised if you come back to programming in a few years.