Is program speed less important than X?

Is program speed less important than safety? Sometimes it is – and sometimes speed is safety. A slow autopilot is a dangerous autopilot. That's why so much safety-critical software is written in the least safe programming languages.

A lot of programs aren't like autopilots – a slower, safer transaction processor is usually better. Google's early credit card charging disaster is my favorite example (it would never happen with a bounds-checked, garbage-collected language runtime).

However, there's the "counter-anecdote" of a cell phone company switching to a fancy new billing system which took more time to charge customers for calls than it took customers to make calls. It fell behind to the point where they had to avoid charging customers for a month worth of calls because they couldn't compute the right amounts. So sometimes trusting one's money to a slow program is rather unsafe.

And then there are high-frequency trading algorithms. Speaking of which: is speed less important than program correctness? Sometimes it is – and sometimes speed is correctness. A slower chess program playing under time control will settle for a worse move – a less correct move. Slower project scheduling software will come up with a worse schedule.

Life in general is a game played under time control. Often, "slower" means "being able to process less information in less ways" – in other words, dumber, further away from "correct".

What about time to market – isn't program speed less important than time to market? Sometimes it is – and sometimes higher speed is shorter time to market. A breathtaking game or special effect on today's hardware that others can only pull off on tomorrow's hardware means that the game or the special effect made it first to the market.

("Performance tricks" sound more relevant to the real-time world of games than to the offline rendering of movies; a great counter-example is procedural generation of outdoor landscape in Brave by Inigo Quilez.)

But time to market is also affected by development time; is program speed less important than development time? Sometimes it is – and sometimes higher speed is less development time. A developer waiting for slow programs develops more slowly – and developers often wait for their own programs (tools searching for stuff or summarizing stuff, build systems, tests, machine learning algorithms, …).

Another point is that a developer whose code tends to be too slow will waste time looking for a faster, fancier, buggier algorithm, sometimes sifting through many options, each of which could be fast enough if coded by someone else. A developer whose code tends to be fast the first time will move on to the next thing more quickly.

Is program speed less important than X? Sometimes it is – but sometimes speed is inseparable from X.

Efficiency is fundamentally at odds with elegance

Q: In retrospect, wasn't the decision to trade off programmer efficiency, security, and software reliability in exchange for runtime performance a fundamental mistake?

A: Well, I don’t think I made such a tradeoff. I want elegant and efficient code. Sometimes I get it. The efficiency vs. correctness, efficiency vs. programmer time, efficiency vs. high level, etc. dichotomies are largely bogus.

An interview with Bjarne Stroustrup

Unlimited-precision symbolic computation is more elegant than floating point numbers. You simply never have any numerical stability problems. Anything algebraically correct – a valid way to solve for x given the equations involving x – is also computationally correct. You don't need to know all the quirks of floating point, and you won't need "numerical recipes" which are basically ways to deal with these quirks.

Symbolic computation is rather widely available – say, in Mathematica, Matlab/maple, etc. – but it's not used nearly as much as floating point. That's because floating point is much more efficient, and a whole lot of things can not be done in a reasonable amount of time and space with symbolic computation.

It is undeniable that this efficiency comes at a cost of correctness (in terms of increased likelihood of bugs), programmer time, and elegance. There are plenty of algebraically elegant solutions which just don't work in floating point. If you don't notice, you have a bug; if you do notice, you spend your (programmer) time looking for an alternative, and said alternative may be less elegant.

Floating point is not the lowest level we can sink to. In many cases the quest for still more efficiency brings us to the dark quagmires of fixed point arithmetic. That's when you have an integer and you implicitly assume it's divided by 2^N – the exponent is statically known so the point isn't floating at run time, so to say. So to implement a+b you just use integer addition, and a*b is an integer multiplication followed by a right shift by N. Or there are midway scenarios where you have many integers with a common, dynamically computed exponent because they all have roughly the same range (FFT is one case where this is often done.)

Fixed point is so ugly that there's not even a recipes book that I'm aware of; it just doesn't come out tasty in the slightest. The biggest trouble isn't even the very likely overflow but the loss of precision: floating point guarantees a certain number of significant bits in the mantissa, while fixed point doesn't – unless you explicitly normalize the number at some point using CLZ or similar (making it more of a floating point emulation than "true" fixed point where exponents aren't represented at runtime). You think you have a 32-bit mantissa and an implicit exponent so the number is rather precise – but those 32 bits can have most of the high bits set to zero and then it's not precise at all.

However, a mixture of 8-bit, 16-bit and 32-bit fixed point operations often beats the performance of floating point by a large margin; especially if you have SIMD instructions, because you can always fit 4x more 8-bit numbers into a register than 32-bit numbers, and a single-precision floating point multiplier is ~10x more costly in hardware than an 8-bit integer multiplier so you have less of them.

Again, this efficiency comes at a cost of correctness (in terms of increased likelihood of bugs), programmer time, and elegance.

In computer vision, you're often looking for objects of a certain class, and you have a classifier taking a rectangular image region and telling whether this region contains an object of that class. A simple and elegant object detection algorithm is to apply this classifier to every possible rectangle in the image, and then remove rectangles which mostly overlap (as in, if there are 15 similar rectangles saying there's a face in roughly the same place, make one rectangle out of them all).

This elegant algorithm is never used, because there are too many possible rectangles (every coordinate times every size). A common optimization is to use a cascade of classifiers. That is, apply a very cheap classifier with a lot of false positives but hopefully almost no false negatives to every region. The purpose is to throw away most of the rectangles so that the remaining smaller set still contains all the true positives – and a lot of false positives, of course, but much less.

This is repeated with many (possibly increasingly expensive) classifiers processing increasingly smaller sets of rectangles. The most widely deployed classifier cascade is probably the Viola-Jones face detector, currently available in most digital cameras displaying little squares around faces. As you could have noticed, it often misses a face, which is to be expected with all the little classifiers hurrying to throw rectangles away. And which is OK for a consumer application where a success rate of 90-95% is perfectly fine and an extra 1% of detection rate is not worth a $0.01 increase in price. The point is that the error rate is undeniably increased by stricter efficiency requirements.

The upshot is that object detection provides a broad family of examples where, again, runtime efficiency comes at a cost of correctness (in terms of increased likelihood of bugs – there's much more code to write – as well as the ultimate detection rate), programmer time, and elegance.

(Even the smallish sub-problem of merging overlapping rectangles provides an example where efficiency has to be bought with all those other things including elegance. A short, readable, elegant solution could use an O(N^2) nested loop where each rectangle is intersected with every other rectangle. One optimization is some sort of spatial data structure where you don't look at rectangles if they don't fall into the same bin of the spatial subdivision because then they can't intersect. That's faster, more buggy and less readable.)

Does this have anything to do with the quote by Stroustrup though? His implied point was how the std::sort template is more elegant as well as more efficient than C's qsort function fiddling with void pointers, right? Or ostream vs printf? Whereas these are all examples of "algorithmic efficiency" – is that even related to language design?

Well, the thing is, "algorithms" and "languages/code" is a continuum:

Think of all the psychic energy expended in seeking a fundamental distinction between "algorithm" and "program". — Alan Perlis

Given that it's a continuum, it is doubtful that a statement which is profoundly wrong in an "algorithmic" context could be true in a "programming" context. If the tradeoff between runtime efficiency and programmer efficiency/"elegance" is fundamental from an algorithmic point of view, then it's likely fundamental in computing in general.

For a concrete example of how blurry the line between "algorithmic efficiency" and "code efficiency" is, let's discuss corner detection. The FAST corner detector is a decision tree looking at pixels surrounding the central pixel and comparing the image intensity of the center to its surroundings. Similarly to other classifier cascades, "not a corner" is a quick decision, while "yes, a corner" is decided after all the checks are done.

The decision tree is implemented in several thousands of auto-generated C code lines with gotos. (That's one addition to the recent discussion about the utility of gotos in systems programming; add computer vision to the list of goto applications, I guess.)

Is it possible to implement the decision tree in a more elegant and readable way? Of course – but at the cost of efficiency; not asymptotic efficiency since it'd be the same decision tree, but efficiency nonetheless.

Is this goto business an "algorithmic" optimization or a "program" optimization? Consider that FAST's entire raison d'etre is being faster than, say, the Harris corner detector. Constants matter for high-resolution images processed in real time.

Consider furthermore that both FAST and Harris are O(#pixels) since they look at a finite, small number of pixels around each coordinate and execute a finite, small number of operations. Consider that which is more efficient depends on the platform – SIMD helps speed up Harris but not FAST, and different SIMD instruction sets speed it up by very different factors. (This is also true for linear classifiers vs Viola Jones and for other cases.) And consider the fact that algorithmically, they're wildly different – Harris looks at eigenvectors whereas FAST is an intensity-based decision tree, they have tunable decision thresholds with different meaning, and different sets of false positives and false negatives.

So is FAST a work in the area of "algorithms" or "programming", and is the auto-generated mountain of code essential to make it efficient an "algorithm" or a "program"? My answer is that it's both, in the sense that you can't really draw the line.

But what about std::sort and C++'s combination of efficiency and elegance? Well, C++ rather obviously does pay with programmer efficiency for runtime efficiency, without an option to opt out of the deal. Every allocation can leak, every reference can be dangling, every buffer can overflow, etc. etc. etc.

This blindingly obvious fact doesn't surprise those who realize the fundamental tradeoff between efficiency and a whole lot of other things, some of which can be collectively called "elegance". Whereas those refusing to believe in such a tradeoff manage to not even notice the consequences. For example:

The relatively small size of the C++ standard library – primarily reflecting the lack of resources in the ISO C++ standards committee – compared with the huge corporate libraries can be a real disadvantage for C++ developers compared to users of proprietary languages.

…So why do languages without corporate backing which are 2 to 3 times younger than C++, such as Perl, Python, and Ruby, have so much more libraries, both standard and non-standard, but widely used?

The best uses of C++ involve deliberate design. You design classes to represent the notions of your application, you organize those classes into hierarchies, you express your algorithms precisely and abstractly (no, that “and” is not a mistake), you use libraries, you build libraries, you devise error handling and resource management strategies and express them in code. The result can be beautiful, efficient, maintainable, etc. However, it’s not just sitting down and writing a series of statements directly manipulating characters, integers, and floating point numbers.

The thing is that actually doing something useful involves a whole lot of "direct manipulations" of characters, integers and floating point numbers – and strings, arrays, hash tables, files, sockets, windows, matrices, etc. Languages which let you "just sit down and write the series of statements" give programmers the extra productivity which results in all those extra libraries getting written.

However, equally undeniably it does cost you runtime efficiency, because you pay an overhead for built-in resource management strategies such as garbage collection, built-in error detection strategies such bounds checking, and a whole lot of other things.

It's not surprising that Stroustrup sees the problem in the fact that corporations "with the resources" invest them in what he thinks is the wrong thing, presumably because of their self-interested profit motives. Alex Stepanov who designed the STL expressed similar statements, and so did Alan Kay and every other perfectionist technologist. If you seek perfection to the point of denying the existence of most obvious tradeoffs – and tradeoffs are a pesky thing for a perfectionist because they imply that perfection is unattainable – then you're also likely to somewhat resent corporations, markets, etc. For a discussion of that, see my take on Worse Is Better vs The Right Thing.

(Of course there are plenty of perfectionists who, instead of rationalizing C++'s productivity problems, spend their time denying that Python is slow, or keep waiting for Python to become fast. It will not become fast. Also, all its combinations with C/C++ designed to remedy this inefficiency will forever be ugly. We had psyco, PyPy, pyrex, Cython, Unladen Swallow, CPython extension modules, Boost.Python, and who knows what else. Python is not designed to be efficient; it's designed for productivity and for extensibility through a necessarily ugly C FFI. The tradeoff is fundamental. Python is slow forever. Python bindings are ugly forever.)

So if the tradeoff is fundamental, should we give up on efficient resource utilization? No – if the elegant thing is to load the database table into RAM, it doesn't mean that we have enough RAM. Should we give up on programmer productivity? No – inline assembly or lock-free code which isn't obviously bug-free doesn't belong in our cold paths.

We should, however, give up on perfection. Some code will be slower than we want because we don't have time to optimize it, and some code will be uglier than we want because we have no choice but to optimize it.

A hope to defeat a fundamental tradeoff is nothing but a source of frustration, and it's a bliss to have lost such a hope.

How profilers lie: the cases of gprof and KCachegrind

We'll see how gprof and KCachegrind lie to us and why they do so, discuss the limits within which we can trust them nonetheless, and attempt to draw more general conclusions about profilers and profile visualization tools.

(In case your programming vocabulary is different from mine – I use "lying" in its dispassionate meaning of "communicating falsehoods" and not to convey negative judgement; on the contrary, I'm indebted to the authors of profiling tools both as a user and a developer of such tools.)

So, consider a program with two parts – an easy part and a hard part. Both parts do similar work but one part does much more work than the other:

void work(int n) {
  volatile int i=0; //don't optimize away
  while(i++ < n);
}
void easy() { work(1000); }
void hard() { work(1000*1000*1000); }
int main() { easy(); hard(); }

Here, work() is a do-nothing loop. easy() executes a thousand iterations of that loop and hard() executes a billion iterations. We therefore expect main() to spend most of its time in hard() and only a tiny fraction in easy().

Now let's profile the program with gprof:

gcc -o try try.c -pg
./try # saves stats to gmon.out
gprof try gmon.out

On my machine, this prints the following info:

self            self    total
seconds  calls  s/call  s/call  name
   3.84      2    1.92    1.92  work
   0.00      1    0.00    1.92  easy
   0.00      1    0.00    1.92  hard

gprof's lie is marked in red; it says easy() and hard() took the same amount of time, instead of hard() taking a million times more to run.

What happened? Can we trust anything that gprof says? Which parts of its output are entirely wrong like this "easy() is the same as hard()" business and which parts are roughly correct, give or take a measurement error? To answer this, we need to briefly discuss how gprof works.

Roughly, gprof's two sources of information are profil() and mcount():

  • profil() – a cousin of creat() in that it could have been spelled with an "e" as well – updates an instruction address histogram every 10 milliseconds. That is, 100 times a second the OS looks which instruction the program is executing, and increments a counter corresponding to that instruction. So the share of increments corresponding to a function's body is proportionate to the share of time the program spent in the function.
  • mcount() is a function called by assembly code generated by gcc -pg. Specifically, when a function is entered, it calls mcount() to record a call to itself from the caller (the caller is generally easy to identify because it necessarily passes a return address to your function and that address points right into the caller's body.) So if f() calls g() 152 times, mcount(f,g) will be called 152 times.

With this in mind, we can roughly tell what gprof knows. Specifically, it knows that:

  • easy() and hard() were both called once; work(), called from each, ran twice. This info is from mcount() and it's 100% reliable.
  • The program spent almost no time in the code of easy() and hard(), and most of its time in the code of work(). This info is from profil() and it's rather reliable – because the program ran for >3 seconds, which means we had >300 increments in our instruction histogram. If almost all of these increments are in work(), that's significant enough.

What about the share of time easy() spent in its call to work(), and the share of time hard() spent in work()? By now we know that gprof knows absolutely nothing about this. So it guesses, by taking 3.84 – seconds spent in work(), a reliable number – and divides it between easy() and hard() equally because each called work() once (based on mcount(), a reliable source) – and we get 1.92. This shows how bad results can be produced from perfectly good measurements, if passed to the wrong algorithm.

More generally, gprof's output falls into the following categories, listed in decreasing order of reliability:

  • Number of calls: 100% reliable in all parts of the report. (I think; please correct me if I'm wrong.)
  • Self seconds in the "Flat profile" (time spent in a given function not including children): reliable to the extent that 100 samples/second is statistically significant given the number of hot spots and the total runtime.
  • Seconds attributed to call graph edges (contribution of children to parents, total runtime spent in self+children, etc.): possibly dead wrong. Only trust it if there's zero code reuse in a given case (that is, f() is only called by g()), or if the function in question is known to take about the same time regardless of the call site (for example, rand()).

BTW, the fact that gprof lies doesn't mean that its documentation does; on the contrary, `man gprof` says, in the BUGS section:

The granularity of the sampling is shown, but remains statistical at best. [this refers to the limited reliability of profil's histogram.] We assume that the time for each execution of a function can be expressed by the total time for the function divided by the number of times the function is called. Thus the time propagated along the call graph arcs to the function's parents is directly proportional to the number of times that  arc is traversed. [this refers to the absolutely unreliable way in which profil's "self time" data is combined with mcount's call graph data.]

Unfortunately, users tend to read tools' output without reading documentation. (The ability of users who aren't into profiling tools to understand the implications of this passage is a separate question.)

The man page also refers to papers from 1982 and 1983. An age of over three decades is a good reason to cut a program some slack. In a way, gprof's age is not only a source of its limitations, such as only 100 samples per second, but also a source of its strengths, such as fast execution of profiled code and wide availability.

Now let's look at a more modern profiler called callgrind – a valgrind plugin. Being more modern, callgrind has a few advantages over gprof – such as not lying in its call graph (though some would debate that as we'll see), and coming with a GUI program to visualize its output called KCachegrind.

KCachegrind the viewer (as opposed to callgrind the measurements collector) does lie in its call tree as opposed to call graph as we'll shortly observe. But first let's have a look at its truthful reporting of the situation with easy() being easier than hard():

KCachegrind's call graph - true

As you can see, easy() isn't even shown at the graph (KCachegrind hides things with insignificant cost); you can, however, see the cost of main's call to easy() and hard() at the source view – indeed easy() is ~1000x faster than hard().

Why 1000x and not 1000000x? Because I changed hard() to run a million iterations instead of a billion, bringing the difference down to 1000x. Why did I do that? Because callgrind is slow – it's based on Valgrind which is essentially a processor simulator. This means that you don't measure time - you measure things like instructions fetched and cache misses (which are interesting in their own right), and you get an estimation of the time the program should take given these numbers and your processor model.

It also means callgrind is slow. Is it slower than gprof? Not necessarily. With gprof, code runs at near-native speed, but you only get 100 data points per second. With callgrind you get much more data points per second. So for a hairy program, with callgrind you get statistically significant data more quickly – so effectively callgrind is faster.

But for a simple program with just a couple of hot spots, callgrind is slower because if the program has a costly part 1 and then a costly part 2, it'll take callgrind more time to even get to part 2, whereas gprof, with its near-native speed, will give you good enough data from its fast run.

So much about speed; now let's look at a case where KCachegrind lies to us, and then we'll discuss why it happens.

To expose the lie, we'll need a more complicated program. We'll achieve the required complexity by having two worker functions instead of one, and then adding a manager – a function that does nothing except calling the two workers with the number of iterations to run.

How does the manager decide the number of iterations each worker should run? Based on the project requirements, of course. Our "projects" will be two more functions, each calling the manager with its own pair of iteration numbers for the two workers.

void worker1(int n) {
  volatile int i=0;
  while(i++<n);
}
void worker2(int n) {
  volatile int i=0;
  while(i++<n);
}
void manager(int n1, int n2) {
  worker1(n1);
  worker2(n2);
}
void project1() {
  manager(1000, 1000000);
}
void project2() {
  manager(1000000, 1000);
}
int main() {
  project1();
  project2();
}
As you can see, both workers work on both projects, but each project is mostly done by one of the workers, the other contributing 1000x less work. Now let's see what KCachegrind says about this; we need to run callgrind, which can be done without special compilation flags:
gcc -o try2 try2.c
valgrind --tool=callgrind ./try2
kcachegrind `ls -tr callgrind.out.* | tail -1`
Here's what we'll see:
KCachegrind call tree - false

The bottom part of the screen shows us truths, but the top part shows falsehoods.

The truth is that each project called the manager once; the manager called each worker twice; and each worker did half the work in the program – all shown at the call graph at the bottom. However, at the top, each of the project functions occupies half the window and shows that worker1 and worker2 each did half of the work in each project - which couldn't be further from the truth.

This falsehood is reported by the call tree (or "callee map" as KCachegrind calls it) – a view which is supposed to show, for each function, the share of work done in each of its callees relative to the work done by all those callees together (as opposed to the call graph which only links to the called functions and tells how many times they were called by that caller – and their share of work in the entire program.)

Why does the call tree tell a falsehood? Again, because it doesn't know the truth. KCachegrind visualizes callgrind's measurements, which include the number of times f() called g() and the time spent in calls from f() to g().

This is more that gprof's information – way more. gprof only knows how much time was spent in f() and g(), and how many times f() called g(). Callgrind also measures how much time was spent in g() specifically when it was called from f(). In particular, this means that KCachegrind's source view gives a perfectly reliable measurement of the time spent in f plus all its callees – something which users take for granted and something which gprof only guesses, often wildly wrongly.

However, this information is not enough to know what the call tree needs knowing to show the truth. Specifically, you only know that manager() spent the same time in calls to worker1() and worker2() overall; you have no idea how much time it spent in each worker when called from project1() and project2(). So you can't reliably plot the share of time worker1() and worker2() spent inside project1() or project2(); you can only guess, often wildly wrongly.

(In fact, you can't tell if manager() even called worker1() when called from project1(); perhaps it didn't – all you know is that manager called worker1() in some context. Some people conclude that the call graph is "incorrect"; in fact it is correct, the question is if you understand what you see the way you're supposed to – you aren't supposed to think that every path through the graph actually happened, only every edge. Another question is how upset you are when you find out (someone with a lot of "manager" functions doing nothing but dispatching might be very upset.) This example certainly broadens the gray area between "truth" and "lies" in profilers' output.)

Callgrind has something which appears like a possible workaround: --separate-callers=N. What this does is it measures things per call stack of size N instead of per call "arc". That is, instead of having a measurement for the arc from manager() to worker1() and a measurement for manager()->worker2(), it measures separately for project1()->manager()->worker1(), project1()->manager()->worker2(), project2()->manager()->worker1(), etc.

Unfortunately, KCachegrind didn't manage to open the resulting output file on my machine, nor did it help when I replaced the ticks (') separating the function names (which get concatenated together) with underscores (_). However, the textual data produced by callgrind indeed shows the truth:

fn=(726) worker2'manager'project1
5 3
+1 1
+1 7000008
+1 2

fn=(736) worker2'manager'project2
5 3
+1 1
+1 7008
+1 2

This shows that worker2() did 1000x more work when called (through manager()) from project1() than it did when called from project2() – as we know it did.

Having looked into the details of two particular examples, we'll proceed to a more general discussion of profilers and profile visualization tools.

Using no profiler at all is better than hoping it'll save the day

I know a few people who like to optimize code and think optimization is important, and who mostly ignore profilers. I know a few other people who claim that a good profiler is the necessary and sufficient prerequisite for optimization. More often than not, such people are not particularly fond of optimization, and their code will be slower than the code of above-mentioned profiler-bashers, before as well as after optimization.

The examples above supposedly show a part of the reason why "a good profiler" is not at all trivial to use.

Basically among the many opinions, there are two extreme ones sounding along the lines of:

  • You don't know where your bottlenecks are going to be, therefore don't bother to optimize before measuring.
  • You don't know where your bottlenecks are going to be – nor will you be able to measure them because adequate measurement and analysis tools plus test scenarios covering the relevant possibilities are hard to come by. Therefore, conserve resources if there's even a slight chance that the code will be a bottleneck.

In my experience, the first viewpoint results in slower code and is less consistent internally. That is, for someone who's not into optimization, a profiler is just not the force multiplier that he thinks it is. That's because he won't put the mental effort required to make an effective use of the profiler for the same reasons making him write slow code in the first place: he just doesn't like all this performance stuff.

The trouble with this being that profilers are not things magically telling you what to do without concentration on your behalf.

You need to understand how a profiler works to use it well

When I first realized how gprof lies in its call graph, I was so offended that I stopped using it for a long while. The problem with profilers is that not all the data is gross lies, but some is, and without knowing which data is likely to be wrong, you might trust things that you shouldn't, scratch your head to the point of hair loss, and then abandon the tool altogether once you realize how it routinely betrays your trust.

With KCachegrind, I came up with the example where it lies based on my knowledge of the callgrind output format – something I know because we (specifically, GK) added support for that format to our in-house profiling tools at work. I wouldn't guess that the Callee Map view is unreliable otherwise. Like most users, I rarely read the docs unless something clearly refuses to work altogether. The stats in the call graph as well as the source view are perfectly reliable. How would I know some other stats aren't?

The extent to which you warn the user about possible implications of assumptions in your software is a tough question for all programmers. Should KCachegrind have a big red warning "I might be misleading you" at its "map" views? A "true power user" doesn't need the warning because they know how the tool works anyway. A "truly careless user" wouldn't read the explanation linked to from the red warning and would just be scared away. Only a "middling user" irresponsible enough to not read the docs but responsible enough to read text linked to from big red warnings might benefit from such design. But how many such users are there compared to the rest?

My own experience here is depressing – nobody, not even the smartest folks, is willing to read anything unless they came here to read. As in, when you're a tutorial that people intend to read, you can tell things to people and they listen. When you're an error message, people read you to the extent necessary to make you go away. And when you're a warning, people ignore you. It sucks to be a warning.

So I'm pessimistic about big read warnings. As to tutorials – people don't expect profilers to be complicated enough to warrant a tutorial, so they probably won't allocate time specifically to read one. They're wrong, but they won't.

Choosing a profiler is hard

There's no perfect profiler – it's a tradeoff, or rather a bunch of tradeoffs. Let's see how complicated those tradeoffs are by comparing gprof, callgrind and Google's CPU profiler:

  • gprof is fast, requires a special compilation, gives you "self time" based on 100 instruction address samples per second, precise call counts, and often bogus "children time" information.
  • callgrind is slow, requires no special compilation, gives time estimations based on event counting, several order of magnitudes more data points per second (making it "effectively faster" for some use cases but not others), precise call counts as well as precise "events counted during a call to each child" information, and comes with a viewer giving correct though possibly misleading call graph and often bogus "map" views.
  • Google's CPU profiler is fast, requires no special compilation unless you're on a 64b platform without a working unwind library, uses a configurable amount of samples per second (default: the measly 100, I wonder why), lacks precise call count information, logs full call stacks unlike gprof and callgrind without –separate-callers, but then converts data to many viewing formats where the extra info is lost (such as the callgrind format). You do get more informative view of the profiling data in some cases.

So basically each of these profilers is better than the rest in some way and worse in some other way (for instance, gprof, at first glance the awful, ancient profiler, is actually your shortest path to precise call counts, which may well be the thing you need the most). And that's before discussing things like the ease of getting a working version for your platform.

As it often is with complicated things, the choice is made harder by the fact that you don't realize many of the implications until after you gained some experience with the tool. I don't think I know what questions to ask about a new profiler, even though I'm relatively savvy. I do realize that I want to understand how it works in a lot of detail.

Not all errors are "noise"

If you're listening to an analogue recording and there's a "shhhh" sound, it's "noise". If, however, someone is yelling near you, then this louder noise isn't "noise" in the mathematical sense – it's another signal. And if someone has overwritten your original recording, then there's only another signal left and none of yours. There's also noise on top of that other signal, but that noise is the least of your troubles.

Not everything standing between you and your signal is "noise"; sometimes it's an error making you look at the wrong signal.

With profilers, people intuitively expect "measurement noise" – a profiler measures time (or some other cost) and no measurement device is perfect. People are willing to accept that – but they do expect fidelity in the assignment of costs to context. "Context" is source lines, functions, and call sequences; people (correctly) think of context as something logical where the concept of "measurement errors" doesn't apply; it's either correct or not.

Unfortunately, people are right, but the conclusion is the opposite of the natural expectation: "context" is indeed a logical concept and not a continuous quantity – and therefore, when the tools give you wrong information, it's really wrong, as in completely detached from reality, and not something "roughly right" give or take a measurement error.

(Our examples focusing on call sequences are rather benign, in a way; for real fun, we could discuss the fidelity of assigning costs to source code lines in optimized programs. Even with a nice assembly listing linking to the various source files from which code was inlined, like the listing provided by KCachegrind, it's… let's say less than easy.)

The upshot is that many errors in profilers' output are not something that can be fixed by running the program more times to get more statistically significant results. Rather, you have to know where the results are only distorted by noise and where there can also be logical errors rendering the numbers meaningless.

Presentation errors can exist without measurement errors

…As evidenced by KCachegrind's false presentation of callgrind's or Google profiler's correct results, or by gprof's false conclusions based on reasonable numbers reported by profil() and perfect numbers from mcount().

To add to the previous point, measurement errors are more likely to be "noise" than presentation errors, which are more likely to tell something unrelated to the truth.

Conclusion

Profiling is trickier than we tend to assume, and as someone developing profilers, I understand programmers who're good at optimization and who mostly ignore profilers. A profiler could help users get way more mileage if it found a way to convince them to actually read a thorough, memorable explanation of its strengths and weaknesses.

It's "locking" if it's blocking

Given a piece of code, how do you know if it's lock-based or lock-free, and why do you care?

Sounds like a trivial question – it's lock-based if it says, "lock a mutex", which is usually an OS service. If it doesn't say "lock" or "unlock", then it's lock-free. And we care because OS services like mutexes are expensive and if we don't use them, code runs faster.

Which is all true, in a way, but it doesn't really answer the question. "The OS" isn't magic – it's code. You can have code implementing userspace locks outside the OS. Or you can have an OS where there's just one address space, and then that OS's locks are "userspace" code like any other. And so the question remains: which piece of code implements "locking" and which doesn't?

It's all code, right? A piece of code is a sort if it sorts. When is code locking? What is locking, as opposed to "lock-free" code? We'll try to answer that question, and then the answer will tell us why (and when) the "locking"/"lock-free" distinction actually matters.

We'll start with two examples – atomic addition and spin locks. The code for these can be surprisingly similar, and this similarity will help highlight the one thing that really sets the two apart.

Consider an atomic addition – something like gcc's __sync_fetch_and_add(), which increments a number at some memory location and returns the old value "atomically".

"Atomic" (only) means that nobody will mess up the state midway through the operation. "Atomic" doesn't mean, for example, that a single instruction is used – in fact, often it's not the case. Nor does "atomic" imply "lock-free". An implementation of atomic addition could legitimately use:

  1. A single machine instruction
  2. A loop observing that someone messed up our state, and retrying
  3. A mutex that's locked while the incrementing is done

Let's say it's a loop, something along the lines of:

do {
  val = *addr;
}
while(!compare_and_swap(addr, val, val+inc));

This is atomic, because if someone modifies addr before we manage to write val+inc, then compare_and_swap (CAS – a typical primitive) will observe that addr no longer holds val, fail, and make us retry. Note that we can get stuck at the loop for an indefinite period of time.

Now consider a spinlock – a loop doing something like:

while(!compare_and_swap(addr, UNLOCKED_VAL, LOCKED_VAL));

This will wait until addr holds UNLOCKED_VAL, and will modify it to keep LOCKED_VAL – unless someone else comes ahead of us, in which case they will write LOCKED_VAL first – and we're back to calling our CAS primitive in a loop. Note that we can get stuck at the loop for an indefinite period of time.

So now you see the difference between "locking" and "lock-free": our atomic addition loop is lock-free, and our spin lock loop implements locking.

Wait, what?

They're both loops, and very similarly-looking ones. Moreover, we can get stuck at both loops for an indefinite period of time. How come they're at the opposite sides of the locking/lock-free distinction?! Where's the difference?

The difference is in whether we get stuck if another thread gets suspended.

The first loop – atomic addition – never gets stuck because of someone else being suspended. On the contrary, it will finish faster. It gets stuck if others modify addr all the time and it keeps retrying. If they're suspended, they can't modify addr, and it succeeds more quickly.

The second loop – the spin lock – will very much get stuck if another thread obtains the lock and then gets suspended before releasing it. And it will keep running until that thread gets the opportunity to finish whatever it did with the lock taken, and releases the lock.

That's why we care – and that's why, with locking, we tend to involve the OS in the first place! Having the OS involved could be a vast improvement over our spin lock loop, because the OS could notice that our thread is stuck because a suspended thread holds a lock. And if that thread is ready to run, the OS could let it run right now, knowing that otherwise, our thread won't make progress anyway.

Moreover, if our thread has a high priority, and the suspended "locker" thread has a low priority, the OS could raise the "locker's" priority until it releases the lock – because releasing the lock that a high-priority thread wants should itself have high priority. (This stuff is known as dealing with "priority inversion" – the situation where they are less important than we are, and yet they block us.)

And for all these good things to happen, the OS needs to know that a lock is taken – which it doesn't with our spin lock loop. For the OS, it's just some loop that could be doing anything. The OS would only know what happens if we used its own locking functions. (BTW, there's another option to get the OS involved: explicitly fiddle with threads and priorities inside the spin lock loop if it takes too long. It can get ugly but it can be the right thing to do.)

Of course having the OS involved will cost us, and we don't want locks in our code because of that, but that's a consequence. The root cause is that threads that lock and get suspended block other threads that lock, and this is why spin locks aren't always a great way to get rid of the pesky OS overhead.

This also shows when spin locks or similar are fine – efficient and just as good as "lock-free code". For example, if you're using hardware threads which are never suspended, such as hardware threads in XMOS processors, then locks are fine. We'd certainly see problems if suspension was possible, but it isn't.

There are other, perhaps more common situations, when locking is fine. For instance, two highest-priority threads running on two distinct physical processors can communicate using spin locks very nicely, because they can't get suspended while holding a lock (they could be if interrupts have still higher priority, but we don't care if interrupt latency is small enough – and perhaps they are interrupt handlers locking some common resource.) Such two threads don't need any help from the OS.

Or, perhaps there's a high-priority thread that sets two variables, A and B, for a low-priority thread. A is written to and then B. After getting notified that A was written, the low-priority thread reads B in a loop – as long as B is NULL, it wasn't written to yet, so keep polling.

Effectively, this is locking – if the high-priority thread gets suspended, then the low-priority thread will get stuck in the loop. But, assuming the high-priority thread has the highest priority, it won't get suspended – and again we're fine without any help from the OS.

These examples show that "locking" isn't defined by "using OS locking functions", and it can take many different forms. You don't spot it in code by looking for certain library calls or for loops having some known look. You spot locking by asking, can I get stuck at this point if someone gets suspended? (And sometimes you answer, "yes, so it's locking; but they won't get suspended, so I don't care.")

Likewise, "lock-free" isn't defined by the use of CAS or LL/SC (load linked/store conditional – two commonly available instructions that can be used to implement compare-and-swap), or by a specific sort of loops.

For instance, our atomic addition loop could be modified to check if the value exceeded a certain size, and quitting the loop if it did without modifying the value:

do {
  val = *addr;
  if(val+inc >= size) break;
}
while(!compare_and_swap(addr, val, val+inc));

This is still atomic addition – if someone modifies addr, it doesn't break our code. We just keep retrying until addr holds something that can be incremented without exceeding a given size, and we manage to actually increment it without someone beating us to it. This can be used to implement a sort of lock-free queue.

And this code isn't "lock-free" because CAS is used in a loop – that is true for our spin lock as well – but because we aren't blocked when threads are suspended (in fact, it helps us if they get suspended, since there are less interferences).

And this is why we never need help from the OS with lock-free code.

Before concluding, it's worth noting a way in which lock-free code isn't better than lock-based code. We've already seen that "lock-free" isn't (necessarily) better in any way when suspension doesn't threaten us. But regardless of suspensions, "lock-free" is never better in a particular way – specifically, it isn't better if you're worried about correctness.

For instance, suppose you have a counter keeping an account balance. There's a bug where money is transferred between accounts, and the value of the source account is checked before decrementing it:

if(balance - transfer > 0) {
  balance -= transfer;
}

But several transfers can do the check before reaching the decrementing statement. Then they will all "succeed" – each will decrement the source account balance. As a result, the account balance may reach a negative value, and transfers will be authorized that shouldn't have been.

Here, it doesn't matter if the balance counter is locked or updated using any sort of lock-free addition. Both ways are better than nothing in the sense that all decrements will eventually happen and none will be lost. Both ways are still busted because the check and the decrement should be done atomically, and not just the decrement.

A program using a lock-free counter, or container, or any sort of data structure is thus exactly as correct or buggy as a lock-based version of the same program (provided that the locking and the lock-free updates are both implemented correctly, of course). That is, you can make (or avoid) exactly the same mistakes having to do with what orders of events you allow and how you handle them.

In other words, "locking" and "lock-free" are two equivalent ways to order events, the difference being that with locking, you have to deal with suspension, while with lock-free updates, you have to deal with the effects of concurrent interferences.

(You could say that locking tends to be less buggy and potentially less efficient because a uniform approach to coding is possible: just use the OS's locks. But as we've seen, you don't always have to or want to – and on the other hand, a uniform approach to coding is possible with lock-free code as well: just use off-the-shelf lock-free data structure libraries. And just like it's not automatically clear that locks are easier to develop with, it's not automatically clear that lock-free code is more efficient; both questions depend on the details.)

So that's it; I find it interesting though it could be rather trivial for you – and if you're one of these people, you might have spotted mistakes, in which case I'll be grateful for pointing them out. (I'm not that good at this stuff; what I'm comparatively good at is getting things done without this stuff – that is, sweeping it under the rug and getting imperative, concurrent, bug-free programs. But that doesn't generalize to all types of programs and it's a subject for a separate discussion.)

P.S. according to Wikipedia, it's actually a bit subtler. Specifically, the property of not being blocked by suspended threads makes an algorithm "non-blocking", but not "lock-free". "Lock-freedom" is a stronger property – a non-blocking algorithm is lock-free if there's guaranteed progress (as opposed to a situation where threads are busy undoing the effects of each other's interferences, and make no "real" progress). And there's a still stronger property, "wait-freedom", where progress is guaranteed in a bounded number of steps.

I allowed myself to ignore the distinction between non-blocking and lock-free because it's not a key part of what distinguishes lock-based synchronization from all the other kinds, which are often collectively called "lock-free" in informal speech. Of course in practice, a non-blocking algorithm that can get stuck in an interference-undoing endless loop is useless unless there's some sort of a workaround. Presumably that's the reason why "non-blocking" used to be a synonym for "lock-free" until about 2003 – it's not easy to think how a non-blocking algorithm that can get stuck forever can be made useable enough for this sort of thing to deserve a name of its own.

C++ template fuckwittery

You're kidding, right?

(gdb) bt
#0  0xaf88f2e0 in std::lround<Fixed<long, 14> > (__x=51.35198974609375) at /usr/include/c++/4.5/tr1_impl/cmath:710
#1  0xaf88f2e8 in std::lround<Fixed<long, 14> > (__x=51.35198974609375) at /usr/include/c++/4.5/tr1_impl/cmath:710
#2  0xaf88f2e8 in std::lround<Fixed<long, 14> > (__x=51.35198974609375) at /usr/include/c++/4.5/tr1_impl/cmath:710
#3  0xaf88f2e8 in std::lround<Fixed<long, 14> > (__x=51.35198974609375) at /usr/include/c++/4.5/tr1_impl/cmath:710
#4  0xaf88f2e8 in std::lround<Fixed<long, 14> > (__x=51.35198974609375) at /usr/include/c++/4.5/tr1_impl/cmath:710
#5  0xaf88f2e8 in std::lround<Fixed<long, 14> > (__x=51.35198974609375) at /usr/include/c++/4.5/tr1_impl/cmath:710

This goes on for a few tens of thousands of stack frames. Time to open /usr/include/c++/4.5/tr1_impl/cmath:710, which has this little gem:

  template<typename _Tp>
    inline long
    lround(_Tp __x)
    {
      typedef typename __gnu_cxx::__promote<_Tp>::__type __type;
      return lround(__type(__x));
    }

What happened? Fucked if I know (it's a bit hard to get to the bottom of the problem without being able to get to the bottom of the call stack, for starters; ought to figure out a better  way than hitting Enter screen after screen, with gdb asking if I really want to proceed).

Did someone define an std::lround? (A quick grep didn't show signs of that fuckwittery; though I found a couple of std::mins and std::maxes, leading to colorful consequences.)

Did someone define a template lround and did a using namespace std?

Did someone define an implicit casting operator that used to be called here, before this lround template appeared, and became a better match for the argument, whatever that means?

Fucked if I know.

I'm sure this lround business wasn't ever supposed to call itself, though it rather obviously can, depending on what the __gnu_cxx::__promote<_Tp>::__type does.

All that from trying to upgrade from g++ 4.2 to g++ 4.5 (and to gnu++0x – a C++0x flavor brought to you by GNU, enriched by GNU extensions such as strdup.) Oh the joys and safety of static binding – statically changing the meaning of your code with every compiler upgrade!

It's a good thing I rarely get to deal with C++ these days.

(Why upgrade to C++0x, a.k.a. C++11, a.k.a C++0xb? Lambdas, for one thing. Also future job interviews. Embrace C++11, or die trying.)

Why custom allocators/pools are hard

In languages with manually managed memory such as C and C++ as well as in garbage-collected languages, you sometimes want to roll your own memory allocator. Some common reasons are:

  • Speed: return &pool[last++] is faster than malloc. (A real pool would usually be slower than that, but still faster than malloc; especially since your "free", ready-to-be-allocated objects in the pool could have a lot of state initialized already since the last time they were used, unlike a malloc'd buffer – in OO terms, you don't need to call the constructor after allocating).
  • Predictability: people usually refer to "the pool advantage" as "lower fragmentation" and hence less chances of running out of memory due to "sudden" fragmentation in unexpected circumstances. Actually, fragmentation is higher with pools: a pool of 100 objects of type A can not be used to allocate objects of type B, even if you're using just one (or zero) A objects right now – so your memory is very much fragmented. However, it's fragmented predictably, leading to predictable allocation times.
  • Stability: Another things which higher fragmentation buys. Pools let you allocate B objects after running out of A objects from the predictably available "B fragment" (pool). This means you can actually handle out-of-memory conditions if you can live without another A object. A malloc-based program "runs out of everything" when it runs out of memory, so it's very unlikely to survive.

How hard are pools? Algorithmically, they're misleadingly easy, unlike malloc which is rather clearly hard. Malloc must be able to allocate chunks of many different sizes in an unpredictable order. A fast algorithm tending to have low fragmentation – or implementing garbage collection and heap defragmentation – is between hard and impossible (where "impossible" means that you'll always have pathological workloads leading to horrible behavior, and you're trying to select an algorithm such that real-life workloads are well-supported and the pathological ones remain theoretical).

Pools, however, usually allocate same-sized chunks. How hard can it be? You keep an array of free pointers and a last index; allocate() pops from this free pointer stack, and free() pushes a pointer back into it, and you're done.

The problem with pools isn't the allocation algorithm but the fact that a new memory management interface has been created. Memory management is fundamental to programming. Built-in memory management interfaces come with a lot of tool support which is rather hard for a custom interface to match.

Consider a garbage-collected language. Most often, such languages provide you a rather strong correctness guarantee: as long as an object is referenced, it will be kept alive and won't be reclaimed and reused for something else. In other words, you're promised to never have problems with dangling references (of course, a subset of these will turn into memory leak problems – too many objects that are no longer "really" needed but referenced – but these are generally way easier to debug).

However, if you implement a pool in a language with GC, that guarantee is gone. The language runtime doesn't know that pool.free(obj) "frees" something – as far as it can tell, the object is very much alive. If someone frees an object and then accesses it, it may very well be that the object has since been reused for something else, and now you have a nasty dangling reference problem.

Your only guarantee now is that you'll only get the "type-safe" variant of dangling references – you'd be fiddling with someone else's object of the same type as yours – but this doesn't necessarily make debugging easier (because changes to the wrong object of the right type may look "too sensible" to provoke the suspicion that they deserve).

Can you tell the runtime, "pool.free actually frees, and I want you to call it instead of your normal reclaiming procedure when the object is no longer referenced?" Perhaps some GC languages have this; it's certainly not a trivial thing to support, because part of the point of pools is to keep hairy, already-constructed objects in them, which point to other objects, some of which might be themselves allocated from pools and some not.

What about languages with manually managed memory? At first glance, the problem seems irrelevant to these because of their "advantage" of not providing any guarantees anyway. You very much can have dangling references with malloc, and pools don't change this.

However, there are tools such as Valgrind which flag a large share of these conditions, by marking chunks passed to free as "inaccessible", and chunks returned by malloc as "undefined" (inaccessible for reading until the first write which initializes the data). The trouble with pools is that, again, Valgrind doesn't know that pool.free frees, and hence it can't flag accesses through dangling references any more.

Is there a workaround? The answer depends on your situation and disposition:

  • Valgrind has a client request mechanism which lets you mark memory regions as "inaccessible" or "undefined", and your pools can issue these requests using Valgrind macros.
  • However, this isn't something that can be done in the pool implementation if the pool keeps constructed objects rather than plain memory chunks. You'll need a per-object-type function marking some of the memory as inaccessible/undefined – but not all of it. For instance, if the object keeps a pointer to a pre-allocated buffer, then maybe the buffer data become undefined when the object is freed and then reallocated, but the pointer to the buffer is defined, because it's already valid. For hairy objects, this can mean a lot of code for making Valgrind work as well as with malloc, and this code can have bugs, marking the wrong things as "defined".
  • If you're using tools other than Valgrind, you'll need to find an equivalent mechanism for these. If you use several tools, then you need to support several mechanisms. There's no standard interface for custom allocators (there could be – there is, in many languages, a standard interface for specifying custom operators, so it's not like there can't be standard ways for doing custom things; there just isn't for pools, at least there isn't in many real languages).

The main point I'm trying to make is, don't have every developer roll their own pool, unless it's for a specific type of objects used briefly and locally in some "private" bit of code. If you need pools for many different kinds of objects and these objects have long, non-trivial lifecycles and are accessed in many different contexts, standardize on the pools.

In a whole lot of cases, code reuse actually isn't worth the trouble and it's fine for people to do their own slightly different version of something which could become a common library – but it'd take too much effort and coordination and misunderstandings during that coordination.

Pools aren't one of these places. Their algorithmic simplicity actually makes it easy to standardize on a few common variants (what variant can one desire that others don't also need?) – and their non-algorithmic complications make standardization very worthwhile.

There are a bunch of other non-algorithmic problems you can have with pools besides having to describe your live objects to tools – for example:

  • Thread safety is another potentially non-portable aspect of memory allocation which is already handled by the language's built-in allocator and will become a headache for a custom one. You could use OS locks, or spinlocks, or a combination, or you could have a per-thread arena to avoid locking if it's too slow, in which case you'll need to handle deallocation by a thread different from the allocating one. Or perhaps you could do lock-free allocation if, say, there's an atomic increment and it's sufficient.
  • Placement new is something you might want to use in C++ that rather many C++ programmers aren't aware of. If you want to have your pool initialize objects in a memory chunk that's passed to it from outside, and you intend to use the pool with classes with non-empty constructors and destructors, then you'll want to do something like for(i=0;i<n;++i) new (buf+i*sizeof(T)) T(args) or what-not, and call ~T directly when the pool shuts down. If everyone rolls their own pools, a lot will do this bit wrong.

The upshot is that pools are surprisingly gnarly, and are really best avoided; there's a very good reason to build memory allocation into a programming language. To the extent that circumstances dictate the use of pools, it's a very good idea to standardize on a few common implementations, debug them, and leave them alone (though unfortunately a closed implementation likely can not deal with live objects tracking, and bugs will appear in user-implemented parts doing that).

The algorithmic simplicity of a pool, prompting people to just declare an object array and a pointer array and use these as a quick-and-dirty pool, is really quite deceiving.

"Value", the irksome euphemism

An economic value is the worth of a good or service as determined by the market.

Wikipedia

You keep using that word. I do not think it means what you think it means.

Inigo Montoya

If people pay you $1, then the economic value of your good or service has been determined by the market to be $1.

"Creating value" is thus a euphemism for "getting people to pay you money" – which has nothing to do with the usual meaning of "value".

Why is "value" an irksome euphemism? Because heroin dealers "create value", as determined by the market.

In the context of my own profession, all of the following are examples of value creation:

  • An office suite using undocumented and constantly changing formats. Value to users having no other way to access their own and others' documents: $100-$500, depending.
  • A distribution channel allowing developers to deploy software on popular devices, for which no legal alternative exists. Value to developers: 30% of revenue.
  • A social network with a billion private and corporate users who signed up for free, with a new charge to reach a given share of one's audience. Value created: $200 per post.

The more money I've extracted from you, the more value I've created, haven't I?

I'm not picking on Microsoft, Apple or Facebook. I can imagine working for any of them. My conscience is as flexible as the next guy's.

(A particularly inflexible conscience is a horrible condition. Feet to which no mass-produced shoes fit are merely inconvenient. A conscience incompatible with mass-produced social arrangements is a huge burden – not just on its owner, but on his friends and family.)

All I'm saying is that goods and services are distinct from bads and disservices, though both "create value".

Moreover, some sort of disservice tends to be essential to "value creation", a.k.a the extraction of money. People are attached to their money, and will only part with it when given little choice. Microsoft, Apple and Facebook constantly hone their methods of limiting users' choice. Who doesn't?

Business is what it is. It's not that consumers (us) are any better than producers (us). Nor is it impossible for something "free" – as in speech, beer, rider, whatever – to be a disservice to its users.

I just don't think "value" is the right word.

Will OpenCL help displace GPGPU? Parallella, P2012, …

OpenCL is usually perceived as a C dialect for GPGPU programming – doing "general-purpose" computations (not necessarily graphics) on GPU hardware. "It's like Nvidia's CUDA C, but portable".

However, OpenCL the language is not really tied to the GPU architecture. That is, hardware could run OpenCL programs and have an architecture very different from a GPU, resulting in a very different performance profile.

OpenCL is possibly the first programming language promising portability across accelerators: "OpenCL is for accelerators what C is for CPUs". Portability is disruptive. When hardware vendor A displaces vendor B, portable software usually helps a great deal.

Will OpenCL – "the GPGPU language" – eventually help displace GPGPU, by facilitating "GP-something-else" – "general-purpose" accelerators which aren't like GPUs?

We'll discuss this question on general grounds, and consider two specific examples of recent OpenCL accelerators: Adapteva's Parallella and ST's P2012.

Why displace GPGPU?

First of all, whether GPGPU is likely to be displaced or not – what could "GP-something-else" possibly give us that GPGPU doesn't?

There are two directions from which benefits could come – you could call them two opposite directions:

  1. Let software (ab)use more types of special-purpose accelerators. GPGPU lets you utilize (abuse?) your GPU for general-purpose stuff. It could be nice to have "GPDSP" to utilize the DSPs in your phone, "GPISP" to utilize the ISP, "GPCVP" to utilize computer vision accelerators likely to emerge in the future, etc. From GPGPU to GP-everything.
  2. Give software accelerators which are more general-purpose to begin with. GPGPU means doing your general-purpose stuff under the constraints imposed by the GPU architecture. An OpenCL accelerator lifting some of these constraints could be very welcome.

Could OpenCL help us get benefits from any of the directions (1) and (2)?

(1) is about making use of anal-retentive, efficiency-obsessed, weird, incompatible hardware. It's rather hard, for OpenCL or for any other portable, reasonably "pretty" language.

OpenCL does provide constructs more or less directly mapping to some of the "ugly" features common to many accelerators, for example:

  • Explicitly addressed local memory (as opposed to cache)
  • DMA (bulk memory transfers)
  • Short vector data types to make use of SIMD opcodes
  • Light-weight threads and barriers

But even with GPUs, OpenCL can't target all of the GPU's resources. There's the subset of the GPU accessible to GPGPU programs – and then there are the more idiosyncratic and less flexible parts used for actual graphics processing.

With accelerators such as DSPs and ISPs, my guess is that today most of their value – acceleration ability – is in the idiosyncratic features that can't be made accessible to OpenCL programs. They could evolve, but it's a bit far-fetched and we won't dwell on it now. At their current state, OpenCL is too portable and too "pretty" to map to most accelerators.

What about direction (2)? (2) is about making something that's more efficient than CPUs, but as nice and flexible as possible, and more flexible than GPUs.

As a whole, (2) isn't easy, for various reasons we'll discuss. But if we look, in isolation, at OpenCL the language, then it looks like a great language for targeting "faster-than-CPUs-and-more-flexible-than-GPUs" kind of accelerator.

What could such an accelerator give us that GPUs don't?

One important feature is divergent flow. GPUs are SIMD or SIMT hardware; either way, they can't efficiently support something like this:

if(cond(i)) {
  out[i] = f(i);
}
else {
  out[i] = g(i);
}

What they'll end up doing is, essentially, compute f(i) and g(i) for all values of i, and then throw away some of the results. For deeply nested conditionals, the cost of wasted computations can make the entire exercise of porting to a GPU pointless.

We'll now have a look at two OpenCL-compatible accelerators which promise to efficiently support divergent threads – or outright independent threads doing something completely unrelated. We'll briefly compare them, and then discuss some of their common benefits as well as common obstacles to their adoption.

Adapteva's Parallella

Actually, the chip's name is Epiphany – Parallella is the recently publicized name of Adapteva's planned platform based on Epiphany; anyway.

Adapteva's architecture is a 2D grid of processors with a mesh interconnect. To scale, you can have a chip with more cores – or you can have a 2D grid of chips with some of the inter-core communication seamlessly crossing chip boundaries. Each of the (scalar) processors executes its own instruction stream – no "marching in lockstep", fittingly for divergent flow.

There are no caches; a memory address can map to your local memory, or the local memory of some other processor in the grid, or to external memory. Access latency will vary accordingly; access to local memories of close neighbors is quicker than access to far neighbors. All memory access can be done using either load/store instructions or DMA.

(Note that you can reach far neighbors – unlike some more "fundamentalist" proposals for "2D scalability" where you can only talk to immediate neighbors, period. I think that's over the top; if you want to run something other than the game of life, it's awfully handy to have long communication paths – as do most computers ranging from FPGAs to neurons, some of which have really long axons.)

Stats:

  • 32K memory per core (unified – both instructions and data)
  • 4 banks that can help avoid contentions between loads/stores, instruction fetching and DMA traffic
  • 2-issue cores (one floating point operation and one integer or load/store operation)
  • 800 MHz at 28nm using low-power process, as opposed to high speed (my bet that it's hard to top 800 MHz at 28nm LP – any evidence to the contrary?)
  • ~25mW per core, 2W peak power for a full chip with 64 cores
  • 0.128 mm^2 per core

Sources: BDTI's overview and various documentation from adapteva.com.

ST's Platform 2012

The P2012 architecture is also, at the top level, a grid of processors with a mesh interconnect. One stated motivation is the intra-die variability in future process nodes: some cores will come out slower than others, and some will be unusable.

It is thus claimed that a non-uniform architecture (like the ones we have today – CPUs and a boatload of different accelerators) will become a bad idea. If a core happens to come out badly, and it's not like your other cores, you have to throw away the entire chip. Whereas if cores are all alike, you leave the bad ones unused, and you may still have enough good ones to use the chip.

Interestingly, despite this stated motivation, the P2012 is less uniform and has higher granularity than Epiphany. Firstly, there's a provision for special-purpose accelerators in the grid. Secondly, the top-level mesh connects, not individual cores, but clusters of 16 rather tightly-coupled cores (each with its own flow of control, however – again, good for divergence).

Similarly to Epiphany, data is kept in explicitly addressed local memory rather than cache, and you can access data outside the cluster using load/store instructions or DMA, but you'll pay a price depending on the distance.

However, within a cluster, data access is uniform: the 16 cores share 256K of local data memory. This can be convenient for large working sets. Instructions are private to a core – but they're kept in a cache, not a local memory, conveniently for large programs.

Stats:

  • 32K per core (16 cores with 16K I-cache per core and 32 8K data memory banks)
  • 1MB L2 cache in a 4-cluster, "69-core" chip (presumably, (16+1)x4+1 – one extra core per cluster and per chip)
  • 2-issue cores (I failed to find which instructions can be issued in parallel)
  • 600 MHz at 28nm (process details unclear)
  • 2W for the 69-core chip
  • 0.217 mm^3 per core (3.7 mm^2 per (16+1)-core cluster), not accounting for L2 cache

Source: slides, slides, slides.

Parallela vs P2012: a quick comparison

Each of the architectures can have many different implementations and configurations. It seems fair to compare a 28nm 64-core Epiphany chip with a 28nm 69-core P2012 chip (or at least fair as far as these things go). Each system has its own incompatible native programming interface, but both can also be programmed in OpenCL.

Here's how Epiphany compares to P2012:

  • Power: 1x (2W)
  • Core issue width: 1x (2-issue)
  • Local memory size: 1x (32K per core)
  • Frequency: 1.33x (800/600)
  • Core area efficiency: 1.7x (0.217/0.128)

I think it's a fine achievement for Adapteva – a 5-people company (ST has about 50000 employees – of course not all of them work on  the P2012, but still; Chuck Moore's GreenArrays is 18 people – and he's considered the ultimate minimalist, and develops a much more minimalistic product which, for instance, certainly won't run OpenCL programs).

This is not to say that these numbers are sufficient to compare the architectures. For starters, we assume that the power is the same, but we can't know without benchmarking. Energy consumption varies widely across programs – low power process brings leakage down to about zero at room temperature, so you're left with switching power which depends on what code you run, and on what data (multiplying zeros costs almost nothing compared to multiplying noise, for instance).

Then there are programming model differences, ranging from the extent of compliance of floating point to the IEEE standard to the rather different memory models. In the memory department, the ability of P2012 cores to access larger working sets should somewhat negate Epiphany's raw performance advantage on some workloads (though Epiphany cores might have lower latency when accessing their own banks). But then two different 2-issue cores will generally perform differently – you need thorough benchmarking to compare.

So what are these numbers good for? Just for a very rough, ballpark estimation of the cost of this type of core. That is, a core which is flexible enough to run its own instruction stream – but "low-end" enough to burden the programmer with local memory management, and lacking much of the other amenities of full-blown CPUs (speculative prefetching, out-of-order execution, etc.)

Our two examples both point to the same order of magnitude of performance. Let's look at a third system, KALRAY's MPPA – looking more like P2012 than Epiphany, with 16-core clusters and cores sharing memory.

At 28nm, 256 cores are reported to typically consume 5W at 400 MHz. (Adapteva and ST claim to give worst case numbers). That's 20mW per core compared to Epiphany's 25mW – but Epiphany runs at 2x the frequency. If we normalized for frequency, Epiphany comes out 1.6x more power-efficient – and that's when we compare it's worst case power to MPPA's typical power.

MPPA doesn't support OpenCL at the moment, and I found few details about the architecture; our quick glance is only good to show that these "low-end multicore" machines have the same order of magnitude of efficiency.

So will OpenCL displace GPGPU?

The accelerators of the kind we discussed above – and they're accelerators, not CPUs, because they're horrible at running large programs as opposed to hand-optimized kernels – these accelerators have some nice properties:

  • You get scalar threads which can diverge freely and efficiently – this is a lot of extra flexibility compared to SIMT or SIMD GPUs.
  • For GPGPU workloads that don't need divergence, these accelerators probably aren't much worse than GPUs. You lose some power efficiency because of reading the same instructions from many program memories, but it should be way less than a 2x loss, I'd guess.
  • And there's a programming model ready for these accelerators – OpenCL. They can be programmed in other C dialects, but OpenCL is a widespread, standard one that can be used, and it lets you use features like explicitly managed local memory and DMA in a portable way.

From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?

There are several roadblocks, however. First of all, there's no killer app for this type of thing – by definition. That is, for any killer app, almost certainly a much more efficient accelerator can be built for that domain. Generic OpenCL accelerators are good at accelerating the widest range of things, but they don't excel at accelerating anything.

There is, of course, at least one thriving platform which is, according to the common perception, "good at everything but excels at nothing" – FPGA. (I think it's more complicated than that but I'll leave it for another time.)

FPGAs are great for small to medium scale product delivery. The volume is too small to afford your own chip – but there may be too many things to accelerate which are too different from what an existing chip is good at accelerating. Flexible OpenCL accelerator chips could rival FPGAs here.

What about integrating these accelerators into high-volume chips such as application processors so they could compete with GPUs? Without a killer app, there's a real estate problem. At 100-150 mm^2, today's application processors are already rather large. And the new OpenGL accelerators aren't exactly small – they're bigger than any domain-specific accelerator.

Few chips are likely to include a large accelerator "just in case", without a killer app. Area is considered to be getting increasingly cheap. But we're far from the point where it's "virtually free", and the trend might not continue forever: GlobalFoundries' 14nm is a "low-shrink" node. Today, area is not free.

Of course, a new OpenCL accelerator does give some immediate benefit and so it isn't a purely speculative investment. That's because you could speed up existing OpenCL applications. But for existing code which is careful to avoid divergence, the accelerator would be somewhat less efficient than a GPU, and it wouldn't do graphics nearly as good as the GPU – so it'd be a rather speculative addition indeed.

What would make one add hardware for speculative reasons? A long life cycle. If you believe that your chip will have to accelerate important stuff many years after it's designed, then you'll doubt your ability to predict exactly what this stuff is going to be, and you'll want the most general-purpose accelerator.

Conversely, if you make new chips all the time, quickly sell a load of them, and then move on to market your next design, then you're less inclined to speculate. Anything that doesn't result in a visibly better product today is not worth the cost.

So generic OpenCL accelerators have a better shot at domains with long life cycles, which seem to be a minority. And then even when you found a vendor with a focus on the long term, you have the problem of performance portability.

Let's say platform vendor A does add the new accelerator to their chip. Awesome – except you probably also want to support vendor B's chips, which don't have such accelerators. And so efficient divergence is of no use to you, because it's not portable. Unless vendor A accounts for a very large share of the market – or if it's a dedicated device and you write a dedicated program and you don't care about portability.

OpenCL programs are portable, but their performance is not portable. For instance, if you use vector data types and the target platform doesn't have SIMD, the code will be compiled to scalar instructions, and so on.

What this means in practice is that one or several OpenCL subsets will emerge, containing features that people count on to be supported well. For instance, a relatively good scenario is, there's a subset that GPU programmers use on all GPUs. A worse scenario is, there's the desktop GPU subset and the mobile GPU subset. A still worse scenario is, there's the NVIDIA subset, the AMD subset, the Imagination subset, etc.

It's an evolving type of thing that's never codified anywhere but has more power than the actual standard.

Standards tend to materialize partially. For example, the C standard supports garbage collection, but real C implementations usually don't, and many real C programs will not run correctly on a standard-compliant, garbage-collecting implementation. Someone knowing C would predict this outcome, and would not trust the standard to change it.

So with efficient divergence, the question is, will this feature make it into a widely used OpenCL subset, even though it's not a part of any such subset today. If it doesn't, widespread hardware is unlikely to support it.

Personally, I like accelerators with efficient divergence. I'll say it again:

"From a programmer's perspective – bring them on! Why not have something with a standard programming interface, more efficient than CPUs, more flexible than GPUs – and running existing GPGPU programs almost as well as GPUs?"

From an evolutionary standpoint though, it's quite the uphill battle. The CPU + GPU combination is considered "good enough" very widely. It's not impossible to grow in the shadow of a "good enough", established competitor. x86 was good enough and ARM got big, gcc was good enough and LLVM got big, etc.

It's just hard, especially if you can't replace the competitor anywhere and you aren't a must-have. A CPU is a must-have and ARM replaces x86 where it wins. A compiler is a must-have and LLVM replaces gcc where it wins. An OpenCL accelerator with efficient divergence – or any other kind, really – is not a must-have and it will replace neither the CPU nor the GPU. So it's quite a challenge to convince someone to spend on it.

Conclusion

I doubt that general-purpose OpenCL accelerators will displace GPGPU, even though it could be a nice outcome. The accelerators probably will find their uses though. The following properties seem favorable to them (all or a subset may be present in a given domain):

  • Small to medium scale, similarly to FPGAs
  • Long life cycles encouraging speculative investment in hardware
  • Device-specific software that can live without performance portability

In other words, there can be "life between the CPU and the GPU", though not necessarily in the highest volume devices.

Good luck to Adapteva with their Kickstarter project – a computer with 64 independent OpenCL cores for $99.

Do you really want to be making this much money when you're 50?

"Do You Really Want to be Doing This When You're 50?"

Well, I didn't really want to be doing this when I was 20. I'm in it for the money. As long as there's money in programming, I'll stay for the money, in all likelihood.

What else do you want to be doing when you're 50? Give me a profession remotely close to programming in the following ways:

  • Little or no required education
  • Good compensation, even for mediocre performers
  • Millions of jobs
  • No physical effort
  • No health or legal risks

Programming is money for nothing. Programming is very easy to enter and extremely hard to quit. What would you do instead?

I work with three lawyers – two became programmers and one became a PM. I haven't met programmers who became lawyers. I do know an engineer – not a programmer – who became a patent attorney (reported reason: "at some point, you resent your manager being the age of your kids"). Would you like to become a patent attorney when you're 50?

I had a manager who decided he'd rather be a school teacher, thinking that this line of work is more beneficial to society. He quit after 8 months, saying in his parting interview to a mainstream newspaper: "Sometimes I just want to enter the classroom with a machine gun and open fire". He's with Samsung now; he feels that his contribution to smartphone imagers benefits society substantially enough.

One of my roommates at work has been studying a bunch of things for a while now. He's got a degree in psychology and in something called Visual Theater. He's been programming part-time all the while, which is how he financed his studies. He's programming as a part of his visual performances (there's computer music involved). He'll likely be programming to finance his art work. I'm not sure he plans to quit programming at any defined point.

I've seen a lot of people "quitting" to study anything from physics to philosophy, and then going back to programming. The money is addictive. There are many other sources of satisfaction, of course – which is why I run this blog for free – but much of this satisfaction has to do with demand, directly or indirectly, and is thus very much related to money. "Building something useful" and "making money" are close relatives.

You could, of course, become independently wealthy. But you probably won't, and then programming is your plan B. There's also a thing about material wealth – it's easily taken away. I'm from Soviet Russia, so I tend to exaggerate the likelihood of that – but really, property is easily confiscated, and paper money can become paper overnight. It's not just a USSR thing; the US confiscated gold from its citizens at about the same time as the USSR. Professional ability, however, can't be confiscated. The prudent (paranoid?) independently wealthy programmer will thus make some effort to stay in a good shape.

There's the argument that professional programming is stressful. Again – compared to what? A doctor's work? A lawyer's work? Answering calls by irate customers while your responses are recorded for later inspection?

What stress? Programmers who can program at all – as in, print out a binary tree correctly – are very scarce. This scarcity makes it rather hard to push programmers around. You can try to bully them into doing unpaid overtime, but they quickly learn that it's a seller's market, and that you're basically bluffing. You have nobody to replace them with.

With demand outstripping supply, there's enough space in programming for everyone. This makes for a not-so-competitive environment, compared to, say, finance/investment banking type of jobs. Programmers are also typically shielded from customers and senior management – the kind of people who're always right, a trait making communication somewhat tiresome.

Deadlines? Sure, we have them, just like everybody else. Let's admit it though – we tend to miss them, and it's not very stressful to us unless we want it to be. If you're given an impossible schedule, and you do your best, and you miss the deadline, you can suffer deeply or you can maintain mental peace. The fact is that your material well-being is rarely in jeopardy because of a missed deadline, so your reaction is fully up to you.

There's the argument that programmers can't fully understand what's going on, what with all the APIs and layers and stuff. And if you don't understand your own environment, that's stressful and that's not fun. Fair enough; but again – who does understand his environment more than a programmer? A doctor digging into a patient's guts? A lawyer sifting through legal documents? An investor trading financial derivatives? A manager overseeing 10 or 20 programmers? With all the self-inflicted complexity, we're still in a better shape than most.

The fact is that there are relatively few programmers in their fifties around. Does it mean people don't survive in programming though? More likely, it is simply a result of growth. There were few 20 year old programmers 30 years ago – compared to 10 years ago. Therefore, there are fewer 50 year old programmers today than 30 year old programmers. To the extent that the growth in programming slows down, things will be different 20 years down the road.

So I'm not planning to quit programming, not because it's such a great source of joy by itself, but because it looks so good compared to just about anything else. Maybe not the most "passionate" statement – but passion burns out, whereas greed is sustainable. And if you plan to quit programming, I wonder what your alternative is, and I won't be surprised if you come back to programming in a few years.

C as an intermediate language

Here's a Forth program debugged in KDevelop – a graphical debugger without Forth support:

KDevelop used to debug Forth code

Cool stuff, not? The syntax highlighting for Forth files – that's someone else's work that comes with the standard KDevelop installation. But the rest – being able to run Forth under KDevelop, place breakpoints, and look at program state – all that stuff is something we'll develop below.

I'll show all the required code; we don't have to do very much, because we get a lot for free by using C as an intermediate language.

A high-level intermediate language is not unusual. A lot of compilers target an existing high-level platform instead of generating native code – for instance, by generating JVM bytecode or JavaScript source code. Why? Because of all the things you get for free that way:

  • Portability
  • Optimization
  • Some degree of library interoperability
  • Some degree of tools interoperability (IDEs, debuggers, etc.)

A few languages targeting high-level platforms are rather well-known: Scala and Clojure with compilers targeting the JVM, CoffeeScript and Dart which are compiled to JavaScript. (Then there's Java, which Google famously compiles to JavaScript – though that remains somewhat offbeat.)

Which languages of this kind are the most successful? Without doubt, today the answer is C++ and Objective-C – languages whose first compilers emitted C code.

I think C is an awesome intermediate language for a compiler to emit. It's extremely portable, it compiles in a snap, optimizes nicely, and you get interoperability with loads of stuff.

When I wanted to make a compiler for an interpreted language we developed internally, I actually thought about targeting a VM, not a source language. I planned to emit LLVM IR. It was GD who talked me out of it; and really, why LLVM IR?

After all, LLVM IR is less readable than C, less stable than the C standard – and less portable than C. It will likely always be, almost by definition.

Even if LLVM runs on every hardware platform, LLVM IR will only be supported by the LLVM tools – but not, say, the GNU tools, or Visual Studio. Whereas generating C code gives you great support by LLVM tools – and GNU, and Visual Studio. Debugging a program generated from LLVM IR in Visual Studio will probably always be inferior to debugging auto-generated C code compiled by the Visual Studio compiler.

"C as an intermediate language" is one of those things I wanted to write about for years. What prevented me was, I'd like to walk through an example – including some of the extra work that may be required for better debugging support. But I couldn't think of a blog-scale example ("web-scale" seems to be the new synonym for "loads of data"; I propose "blog-scale" to mean "small enough to fully fit into a blog post".)

Then it dawned on me: Forth! The minimalist language I've fallen out of love with that still has a warm place in my heart.

So, I'll do a Forth-to-C compiler. That'll fit in a blog post – at least a small Forth subset will – and Forth is different enough from C to be interesting. Because my point is, it doesn't have to be C with extensions like C++ or Objective-C. It can be something rather alien to C and you'll still get a lot of mileage out of the C tools supplied by your platform.

Without further ado, let's implement a toy Forth-to-C compiler. We shall:


Enough Forth to not be dangerous

To be dangerous, we'd have to support CREATE/DOES> or COMPILE or POSTPONE or something of the sort. We won't – we'll only support enough Forth to implement Euclid's GCD.

So here's our Forth subset – you can skip this if you know Forth:

  • Forth has a data stack.
  • Integers are pushed onto the stack. When you say 2 3, 2 is pushed and then 3.
  • Arithmetic operators pop operands from the stack and push the result. 6 2 / pops 6 and 2 and pushes 6/2=3. 2 3 = pushes 0 (false), because 2 is not equal to 3.
  • Stack manipulation words, well, manipulate the stack. DUP duplicates the top of the stack: 2 DUP is the same as 2 2. Swap: 2 3 SWAP is the same as 3 2. Tuck: 2 3 TUCK is the same as… errm… 3 2 3. As you can already imagine, code is more readable with less of these words.
  • New words are defined with : MYNAME …code… ; Then if you say MYNAME, you'll execute the code, and return to the point of call when you reach the semicolon. No "function arguments" are declared – rather, code pops arguments from the stack, and pushes results. Say, : SQUARE DUP * ; defines a squaring word; now 3 SQUARE is the same as 3 DUP * – it pops 3 and pushes 9.
  • Loops: BEGIN … cond UNTIL is like do { … } while(!cond), with cond popped by the UNTIL. BEGIN … cond WHILE … REPEAT is like while(1) { … if(!cond) break; … }, with cond popped by the WHILE.
  • Printing: 2 . prints "2 ". CR prints a newline.
  • Forth is case-insensitive.

That's all – enough to implement GCD, and to scare and startle your infix-accustomed friends.


"Compilation strategy"

…A bit too dumb to be called a strategy; anyway, our blog-scale, blog-strength compiler will work as follows:

  • The stack is an array of "data"; data is a typedef for long.
  • Each user-defined word is compiled to a C function getting a data* pointing to the top of stack (TOS), and returning the new TOS pointer.
  • Where possible, built-in words can be used similarly to user-defined words: s=WORD(s).
  • Some words can't work that way: + can't become s=+(s) because that's not valid C. For such words, we do some trivial translation. 2 becomes PUSH(2), + becomes OP2(+). Similarly, control flow words are translated to do/while, if and break.
  • The stack grows downwards and shrinks upwards: if s[0] is the TOS, s[1] is the element below it, etc; s++ shrinks the stack by popping the TOS.

That's all; for instance, the gcd example from here:

: gcd begin dup while tuck mod repeat drop ;

…compiles to the C code below. As you can see, every word is translated in isolation – the compiler has almost no comprehension of "context" or "grammar":

data* gcd(data* s) { //: gcd
  do {               //begin
    s=dup(s);        //dup
    if(!*s++) break; //while
    s=tuck(s);       //tuck
    OP2(%);          //mod
  } while(1);        //repeat
  s=drop(s);         //drop
  return s;          //;
}

Here's the full Python source of the compiler (forth2c.py) – a bit redundant after all the explanations:

#!/usr/bin/python
import sys
# "special" words - can't emit s=WORD(s)
special = {
  ':':'data* ',
  ';':'return s; }',
  '.':'PRINT();',
  'begin':'do {',
  'until':'} while(!*s++);',
  'repeat':'} while(1);',
  'while':'if(!*s++) break;',
}
# binary operators
binops='+ - * / = < > mod'.split()
op2c={'=':'==','mod':'%'}

def forth2c(inf,out):
  n = 0
  for line in inf:
    n += 1
    # emit line info for C tools (debuggers, etc.)
    # - a nice option C gives us
    print >> out,'n#line',n,'"%s"'%infile
    for token in line.lower().strip().split():
      if token in special:
        print >> out,special[token],
      else:
        try:
          num = int(token)
          print >> out, 'PUSH(%d);'%num,
        except ValueError:
          if token in binops:
            print >> out,'OP2(%s);'%op2c.get(token,token),
          else:
            if defining:
              print >> out,token+'(data* s) {',
            else: # call
              print >> out,'s=%s(s);'%token,
      defining = token == ':'

out = open('forth_program.c','w')
print >> out, '#include "forth_runtime.h"'
for infile in sys.argv[1:]:
  forth2c(open(infile),out)

And here's the "runtime library", if you can call it that (forth_runtime.h). It's the kind of stack-fiddling you'd expect: s++, s--, s[0]=something.

#include <stdio.h>
typedef long data;

#define PUSH(item) (--s, *s = (item))
#define OP2(op) (s[1] = s[1] op s[0], ++s)
#define PRINT() (printf("%ld ", s[0]), ++s)
#define cr(s) (printf("n"), s)
#define drop(s) (s+1)
#define dup(s) (--s, s[0]=s[1], s)
#define tuck(s) (--s, s[0]=s[1], s[1]=s[2], s[2]=s[0], s)

#define MAX_DEPTH 4096 //arbitrary
data stack[MAX_DEPTH];
data* forth_main(data*);
int main() { forth_main(stack+MAX_DEPTH); return 0; }

Note that our Forth dialect has an entry point word – forth_main – that is passed a pointer to an empty stack by C's main. Real Forth doesn't have an entry point – it's more like a scripting language that way; but the forth_main hack will do for our example.

That's all! A Forth dialect in under 60 LOC. By far the shortest compiler I ever wrote, if not the most useful.

A test

Let's test our forth2c using a few example source files. countdown.4th (from here):

: COUNTDOWN
  BEGIN
    DUP .
    1 -
    DUP 0 =
  UNTIL
  DROP
;

gcd.4th (a multi-line version – to make single-stepping easier):

: gcd
  begin
    dup
  while
    tuck
    mod
  repeat
  drop
;

main.4th:

: forth_main
  5 countdown
  cr
  10 6 gcd .
  35 75 gcd .
  12856 3248 gcd .
  cr
;

We can run the program with the shell script:

./forth2c.py countdown.4th gcd.4th main.4th
gcc -g -static -Wall -o forth_program forth_program.c
./forth_program

This should print:

5 4 3 2 1
2 5 8

So countdown manages to count down from 5 to 1, and gcd finds the gcd. Good for them.


Debugging

Let's try our program with the trusty gdb:

$ gdb forth_program
There is NO WARRANTY!! etc. etc.
(gdb) b countdown # place breakpoint
Breakpoint 1: file countdown.4th, line 3.
(gdb) r # run
Breakpoint 1, countdown (s=0x804e03c) at countdown.4th:3
(gdb) bt # backtrace
#0  countdown (s=0x804e03c) at countdown.4th:3
#1  forth_main (s=0x804e03c) at main.4th:2
#2  main () at forth_runtime.h:14
(gdb) l # list source code
1	: COUNTDOWN
2	  BEGIN
3	    DUP .
4	    1 -
5	    DUP 0 =
6	  UNTIL
7	  DROP
8	;
(gdb)

Yay!!

Seriously, yay. We placed a breakpoint. We got a call stack – forth_main called countdown.  And we've seen the Forth source code of the function we debug. All that in a debugger that has no idea about Forth.

Partly it's due to compiling to high-level primitives, such as functions, that are supported by tools – we'd get that in every language. And partly it's due to #line – something we don't get everywhere.

#line is brilliant – all languages should have it. That's how we tell debuggers where our assembly code is really coming from – not the intermediate C code, but the real source.


Profilers and other tools

It's not just debuggers that understand #line – it's also profilers, program checkers, etc.

Here's KCachegrind showing the profile of our Forth program, obtained using Valgrind as follows:

valgrind --tool=callgrind ./forth_program
kcachegrind `ls -t callgrind* | head -1`

KCachegrind profiling Forth code

Now isn't that spiffy?

C compilers are also aware of #line – which we could try to abuse by pushing error reporting onto the C compiler. Say, if we use an undefined word do_something, we get an error at just the right source code line:

main.4th:3: implicit declaration of function ‘do_something

A very sensible error message – and we didn't have to do anything! Now let's try a BEGIN without a REPEAT – let's delete the REPEAT from countdown.4th:

gcd.4th:1: error: expected ‘while’ before ‘data’

Um, not a very good message. Perhaps this isn't such a good idea after all. If we want quality error reporting, we have to do the error checking at the source language level.


Custom data display: gdb pretty-printers

Things look rather nice. C tools – gdb, Valgrind, KCachegrind – are doing our bidding, aren't they?

Except for the data stack. Instead of the stack elements, gdb shows us the TOS pointer in hexadecimal: countdown (s=0x804e03c), forth_main (s=0x804e03c), which looks a bit lame.

It's not that the local variable s is useless – on the contrary, that's what we use to look at the data stack. This can be done using gdb commands such as p s[0] (prints the TOS), p s[1] (prints the element below TOS), etc. But we'd much rather look at the whole stack at a time, so that we can see how TUCK tucks our numbers into the stack as we single-step our code.

Can it be done?

  • The good news are that it's easy enough with gdb pretty-printers. (To me it became easy after Noam told me about the pretty-printers.)
  • The bad news are that, unlike the case with the universally supported #line, it's impossible to do custom pretty-printing in a uniform way for all tools. There's no "#line for pretty-printing" – quite a pity.
  • But then the good news are that gdb front-ends such as KDevelop or Eclipse support gdb pretty-printers – to an extent. (KDevelop apparently does a better job than Eclipse.) So you don't have to write a GUI plugin or something equally horrible.

Here's a gdb pretty printer for displaying our Forth data stack. It prints all the elements, from bottom to top (as the Forth stack contents is usually spelled). The TOS pointer is the data* you pass for printing – typically, some function's local variable s. The bottom of the stack is known statically – it's the pointer past the end of the global stack[] array. (If we had threads, that array would be thread-local, but anyway, you get the idea.)

So here's gdb_forth.py:

class StackPrettyPrinter(object):
  bottom = None
  bot_expr = 'stack + sizeof(stack)/sizeof(stack[0])'
  def __init__(self,s):
    self.s = s
  def to_string(self):
    if not self.bottom:
      self.bottom = gdb.parse_and_eval(self.bot_expr)
    s = self.s
    i = 0
    words = []
    while s[i].address != self.bottom:
      words.append(str(s[i]))
      i += 1
    return ' '.join(reversed(words))

def stack_lookup_func(val):
  if str(val.type) == 'data *':
    return StackPrettyPrinter(val)

gdb.pretty_printers.append(stack_lookup_func)
print 'loaded Forth extensions'

gdb.pretty_printers is a list of functions which may decide to wrap gdb.Value they're passed in a pretty-printer class object. Typically they decide based on the value's type. gdb.parse_and_eval returns a gdb.Value representing the value of the given C expression. A gdb.Value has a type, an address, etc. If the gdb.Value is a pointer, you can index it as a Python list: val[ind], and if it's a struct, you can use it as a dict: val['member_name'] (we don't have an example of that one here).

If you're interested in gdb pretty printers – a couple of notes:

  • Getting values from other values is much faster than parse_and_eval which just grinds the CPU to a halt; so learning the not-that-Pythonic-while-also-not-that-C-like syntax of gdb.Value access is very worthwhile.
  • You can return "maps" or "arrays" instead of plain strings. Pretty-printers for STL maps and vectors work that way. This can be used to pretty-print high-level data structures such as dynamic objects (if you're lucky and your design matches gdb's view of things – I found a lot of devil in the details.)

Let's test-drive our pretty printer – but first, let's register it in our ~/.gdbinit:

python execfile('/your/path/to/gdb_forth.py')

And now:

$ gdb forth_program
There is NO WARRANTY!!
loaded Forth extensions # aha, that's our printout!
(gdb) b gcd
(gdb) r
Breakpoint 1, gcd (s=10 6) # and that's the stack
(gdb) python for i in range(4): gdb.execute('c')
# continue 4 times
Breakpoint 1, gcd (s=6 4)
Breakpoint 1, gcd (s=4 2)
Breakpoint 1, gcd (s=2 0)
# these were the steps of gcd(10,6)
Breakpoint 1, gcd (s=35 75)
3           dup
# new inputs - gcd(35,75). let's single-step:
(gdb) s # step (execute DUP)
4         while
(gdb) p s # print s
$1 = 35 75 75 # so DUP duplicated the 75.
(gdb) s # step (execute WHILE)
5           tuck
(gdb) p s
$2 = 35 75 # ...and WHILE popped that 75.
(gdb) s # and now we execute the mighty TUCK:
6           mod
(gdb) p s
$3 = 75 35 75 # YES! Tucked the 75 right in!
(gdb) bt # and how's main's data stack looking?
#0  gcd (s=75 35 75) at gcd.4th:6
#1  forth_main (s=75 35) at main.4th:5
# well, it looks shorter - somewhat sensibly.
# (main TOS pointer is unchanged until gcd returns.)

I think it's rather nice. 10 6, 6 4, 4 2, 2 0 – a concise enough walk through Euclid's algorithm in Forth.


KDevelop with the stack display

And, finally, here's how it looks in KDevelop (where the stack displayed in the GUI changes upon every step, as one would expect from a good debugger). The stack is in the variables view on the left.

There is nothing special to be done to make things work in KDevelop – except for the standard procedure of convincing it to debug a user-supplied executable, as you might do with a C program.

Features that don't map naturally to C

Some features of a source language map more naturally to C than others. If we tried to tackle a language different from Forth – or to tackle all of Forth instead of a small subset – some features would pose hard problems.

Some such features can be handled straightforwardly with another intermediate language. For example, dynamic data attributes map to JavaScript much more nicely than they do to C.

But in many cases you don't pick your intermediate language because it's the most suitable one for your source language. If you want to target browsers, then it pretty much has to be JavaScript, even if it's a poor fit for your source language. Similarly, in other situations, it has to be C – or JVM, or .NET, etc.

In fact, many language designers made it their key constraint to play nicely with their platform – the intermediate language. Some advertise the harmonious integration with the platform in the language name: C++, Objective-C, CoffeeScript, TypeScript. They know that many users are more likely to choose their language because of its platform than for any other reason, because they're locked into the platform.

With that said, here are a few features that don't map straightforwardly to C, and possible ways to handle them.

  • Dynamic binding: easy enough. Generate something like call(object,method_id), or call(object,"method_name"), multidispatch(method,obj1,obj2) – whatever you want. Should work nicely.
  • Dynamic code generation – eval, etc.: C doesn't have eval – but it does have fopen("tmp.c"), fwrite, system("gcc -o tmp.so tmp.c …"), dlopen("tmp.so"), and dlsym. This can give you something a lot like eval. You need non-portable code to make it work, but not much of it. Alternatively, if a relatively slow eval with relatively poor debugger support is OK (often it is), you can implement eval using an interpreter, which can reuse the parser of your static compiler. We use both methods in a few places and it works fine.
  • Dynamic data: Code generation is easy – access(object,"attr_name") or whatever. The ugly part is viewing these objects in debuggers. One approach is pretty-printers or similar debugger scripting. Another approach – for "static enough data" – is to generate C struct definitions at runtime, compile them to a shared library with debug info, and load the library, as you'd do to implement eval.
  • Exceptions: I use setjmp/longjmp for that; I think that's what cfront did.
  • Garbage collection: AFAIK it can work fine, funnily enough. There are garbage collectors for C, such as the Boehm collector. Do they work for C programs? Not really – they can mistake a byte pattern that happens to look like a pointer for an actual pointer, causing a leak. But you don't have that problem – because your C program is generated from code that your compiler can analyze and then tell the collector where to look for pointers. So while gc doesn't truly work for C, it can work fine for languages implemented on top of C!
  • Generics, overloading, etc.: most of this is handled by your compiler, the only problem being name mangling. You can mangle names in a way that you find least ugly, or you can mangle them according to the C++ mangling convention, so that debuggers then demangle your names back into collection<item> or some such. (The C++ mangling convention is not portable so you might need to support several.)
  • Multiple return values: we pass pointers to all return values except the first; I think returning a struct might make things look nicer in debuggers. Of course the calling convention will be less efficient compared to a compiler emitting assembly rather than C. But it will still be more efficient than your actual competition, such as a Python tuple.
  • Program reduction (as in, transform (+ (* 2 3) 5) to (+ 6 5), then to 11): ouch. I guess it could make sense to write an interpreter to do that in C, but you wouldn't get that much mileage out of the C optimizer, and no mileage at all from C debuggers, profilers, etc. The basic model of computation has to be close enough to C to gain something from the C toolchain besides the portability of your C code implementing your language. (I'm not saying that you absolutely have to create a memory representation of (+ 6 5) at runtime – just that if you want to create it, then C tools won't understand your program.)

What about using C++ instead of C?

You could, and I did. I don't recommend it. You get exceptions for free instead of fiddling with setjmp/longjmp. Then they don't work when you throw them from one shared library and catch in another (I think RTTI has similar problems with shared libraries). You get templates for free. Then your auto-generated code compiles for ages.

Eli Bendersky writes about switching from C to C++ to implement a language VM. He says it's easier to use C++ features when they match your VM's needs then to reimplement them in C.

Here's how it worked out for me. I once implemented a VM and an interpreter in C++. Then we wrote a compiler for the language. Because the VM was in C++, it was much easier for the compiler to emit C++ code interfacing with the VM than C code. And then with this auto-generated C++ code, we bumped into shared-library-related problems, and build speed problems, etc. – and we have them to this day.

But of course, YMMV. One reason to use C++ is that debuggers get increasingly better at displaying STL collections and base class object pointers (which they know to downcast to the real runtime type and show the derived class members). If your language constructs map to these C++ features, you can get better debug-time data display in a portable way.

Anything you'd like to add?

If you have experience with using C as an intermediate language, I'll gladly add your comments. I'm more experienced with some things than others in this department, and so a lot of the potential issues would likely never occur to me.

Conclusion

C is a great intermediate language, and I'd be thrilled to see more new languages compiling to C. I don't live in either Java or JavaScript these days; surely I'm not alone. A language compiling to C could be fast, portable, have nice tools and library support – and immediately usable to many in the land of C, C++ and Objective-C.