Consistency: how to defeat the purpose of IEEE floating point

I don't know much about the design of IEEE floating point, except for the fact that a lot of knowledge and what they call "intellectual effort" went into it. I don't even know the requirements, and I suspect those were pretty detailed and complex (for example, the benefits of having a separate representation for +0 and -0 seem hard to grasp unless you know about the very specific and hairy examples in the complex plane). So I don't trust my own summary of the requirements very much. That said, here's the summary: the basic purpose of IEEE floating point is to give you results of the highest practically possible precision at each step of your computation.

I'm not going to claim this requirement is misguided, because I don't feel like arguing with people two orders of magnitude more competent than myself who have likely faced much tougher numerical problems than I've ever seen. What I will claim is that differences in numerical needs divide programmers into roughly three camps, and the highest-possible-precision approach hurts one of them really badly, and so has to be worked around in ways we'll discuss. The camps are:

  1. The huge camp of people who do businessy accounting. Those should work with integral types to get complete, deterministic, portable control over rounding and all that. Many of the clueless people in this camp represent 1 dollar and 10 cents as the floating point number 1.1. While they are likely a major driving force behind economical growth, I still think they deserve all the trouble they bring upon themselves.
  2. The tiny camp doing high-end scientific computing. Those are the people who can really appreciate the design of IEEE floating point and use its full power. It's great that humanity accidentally satisfied the needs of this small but really cool group, making great floating point hardware available everywhere through blind market forces. It's like having a built-in Stradivari in each home appliance. Yes, perhaps I exaggerate; I get that a lot.
  3. The sizable camp that deals with low-end to mid-range semi-scientific computing. You know, programs that have some geometry or physics or algebra in them. 99.99% of the code snippets in that realm work great with 64b floating point, without the author having invested any thought at all into "numerical analysis". 99% of the code works with 32b floats. When someone stumbles upon a piece of code in the 1% and actually witnesses fatal precision loss, everybody gathers to have a look as if they heard about a beautiful rainbow or a smoke suggesting a forest fire near the horizon.

The majority of people who use and actually should be using floating point are thus in camp 3. Those people don't care about precision anywhere near camp 2, nor do they know how to make the best of IEEE floating point in the very unlikely circumstances where their naive approach will actually fail to work. What they do care about though is consistency. It's important that things compute the same on all platforms. Perhaps more importantly for most, they should compute the same under different build settings, most notably debug and release mode, because otherwise you can't reproduce problems.

Side note: I don't believe in build modes; I usually debug production code in release mode. It's not just floating point that's inconsistent across modes – it's code snippets with behavior undefined by the language, buggy dependence on timing, optimizer bugs, conditional compilation, etc. Many other cans of worms. But the fact is that most people have trouble debugging optimized code, and nobody likes it, so it's nice to have the option to debug in debug mode, and to do that, you need things to reproduce there.

Also, comparing results of different build modes is one way to find worms from those other cans, like undefined behavior and optimizer bugs. Also, many changes you make are optimizations or refaptorings and you can check their sanity by making sure they didn't change the results of the previous version. As we'll see, IEEE FP won't give you even that, regardless of platforms and build modes. The bottom line is that if you're in camp 3, you want consistency, while the "only" things you can expect from IEEE FP is precision and speed. Sure, "only" should be put in quotes because it's a lot to get, it's just a real pity that fulfilling the smaller and more popular wish for consistency is somewhere between hard and impossible.

Some numerical analysts seem annoyed by the camp 3 whiners. To them I say: look, if IEEE FP wasn't the huge success that it is in the precision and speed departments, you wouldn't be hearing from us because we'd be busy struggling with those problems. What we're saying is the exact opposite of "IEEE FP sucks". It's "IEEE FP is so damn precise and fast that I'm happy with ALL of its many answers – the one in optimized x86 build, the one in debug PowerPC build, the one before I added a couple of local variables to that function and the one I got after that change. I just wish I consistently got ONE of these answers, any of them, but the same one." I think it's more flattering than insulting.

I've accumulated quite some experience in defeating the purpose of IEEE floating point and getting consistency at the (tiny, IMO) cost of precision and speed. I want to share this knowledge with humanity, with the hope of getting rewarded in the comments. The reward I'm after is a refutation of my current theory that you can only eliminate 95%-99% of the pain automatically and have to solve the rest manually each time it raises its ugly head.

The pain breakdown

I know three main sources of floating point inconsistency pain:

  1. Algebraic compiler optimizations
  2. "Complex" instructions like multiply-accumulate or sine
  3. x86-specific pain not available on any other platform; not that ~100% of non-embedded devices is a small market share for a pain.

The good news is that most pain comes from item 3 which can be more or less solved automatically. For the purpose of decision making ("should we invest energy into FP consistency or is it futile?"), I'd say that it's not futile and if you can cite actual benefits you'd get from consistency, than it's worth the (continuous) effort.

Disclaimer: I only discuss problems I know and to the extent of my understanding. I have no solid evidence that this understanding is complete. Perhaps the pain breakdown list should have item 4, and perhaps items 1 to 3 have more meat than I think. And while I tried to get the legal stuff right – which behavior conforms to IEEE 754, which conforms to C99, and which conforms to nothing but is still out there – I'm generally a weak tech lawyer and can be wrong. I can only give you the "worked on my 4 families of machines" kind of warranty.

Algebraic compiler optimizations

Compilers, or more specifically buggy optimization passes, assume that floating point numbers can be treated as a field – you know, associativity, distributivity, the works. This means that a+b+c can be computed in either the order implied by (a+b)+c or the one implied by a+(b+c). Adding actual parentheses in source code doesn't help you one bit. The compiler assumes associativity and may implement the computation in the order implied by regrouping your parentheses. Since each floating point operation loses precision on some of the possible inputs, the code generated by different optimizers or under different optimization settings may produce different results.

This could be extremely intimidating because you can't trust any FP expression with more than 2 inputs. However, I think that programming languages in general don't allow optimizers to make these assumptions, and in particular, the C standard doesn't (C99 § #13, didn't read it in the document but saw it cited in two sources). So this sort of optimization is actually a bug that will likely be fixed if reported, and at any given time, the number of these bugs in a particular compiler should be small.

I only recall one recurring algebraic optimization problem. Specifically, a*(b+1) tended to be compiled to a*b+a in release mode. The reason is that floating-point literal values like 1 are expensive; 1 becomes a hairy hexadecimal value that has to be loaded from a constant address at run time. So the optimizer was probably happy to optimize a literal away. I was always able to solve this problem by changing the spelling in the source code to a*b+a – the optimizer simply had less work to do, while the debug build saw no reason to make me miserable by undoing my regrouping back into a*(b+1).

This implies a general way of solving this sort of problem: find what the optimizer does by looking at the generated assembly, and do it yourself in the source code. This almost certainly guarantees that debug and release will work the same. With different compilers and platforms, the guarantee is less certain. The second optimizer may think that the optimization you copied from the first optimizer into your source code is brain-dead, and undo it and do a different optimization. However, that means you target two radically different optimizers, both of which are buggy and can't be fixed in the near future; how unlucky can you get?

The bottom line is that you rarely have to deal with this problem, and when it can't be solved with a bug report, you can look at the assembly and do the optimization in the source code yourself. If that fails because you have to use two very different and buggy compilers, use the shotgun described in the next item.

"Complex" instructions

Your target hardware can have instructions computing "non-trivial" expressions beyond a*b or a+b, such as a+=b*c or sin(x). The precision of the intermediate result b*c in a+=b*c may be higher than the size of an FP register would allow, had that result been actually stored in a register. IEEE and the C standard think it's great, because the single instruction generated from a+=b*c is both faster and more precise than the 2 instructions implementing it as d=b*c, a=a+d. Camp 3 people like myself don't think it's so great, because it happens in some build modes but not others, and across platforms the availability of these instruction varies, as does their precision.

AFAIK the "contraction" of a+=b*c is permitted by both the IEEE FP standard (which defines FP + and *) and the C standard (which defines FP types that can map to standards other than IEEE). On the other hand, sin(x), which also gets implemented in hardware these days, isn't addressed by either standard – to the same effect of making the optimization perfectly legitimate. So you can't solve this by reporting a bug the way you could with algebraic optimizations. The other way in which this is tougher is that tweaking the code according to the optimizer's wishes doesn't help much. AFAIK what can help is one of these two things:

  1. Ask the compiler to not generate these instructions. Sometimes there's an exact compiler option for that, like gcc's platform-specific -mno-fused-madd flag, or there's (a defined and actually implemented) pragma directive such as #pragma STDC FP_CONTRACT. Sometimes you don't have any of that, but you can lie to the compiler that you're using an older (compatible) revision of the processor architecture without the "complex" instructions. The latter is an all-or-nothing thing enabling/disabling lots of stuff, so it can degrade performance in many different ways; you have to check the cost.
  2. If compiler flags can't help, there's the shotgun approach I promised to discuss above, also useful for hypothetical cases of hard-to-work-around algebraic optimizations. Instead of helping the optimizer, we get in its way and make optimization impossible using separate compilation. For example, we can convert a+=b*c to a+=multiply_dammit(b,c); multiply_dammit is defined in a separate file. This makes it impossible for most optimizers to see the expression a+=b*c, and forces them to implement multiplication and addition separately. Modern compilers support link-time inlining and then they do optimize the result as a whole. But you can disable that option, and as a side effect speed up linkage a great deal; if that seriously hurts performance, your program is insane and you're a team of scary ravioli coders.

The trouble with the shotgun approach, aside from its ugliness, is that you can't afford to shoot at the performance-critical parts of your code that way. Let us hope that you'll never really have to choose between FP consistency and performance, as I've never had to date.


Intel is the birthplace of IEEE floating point, and the manufacturer of the most camp-3-painful and otherwise convoluted FP hardware. The pain comes, somewhat understandably, from a unique commitment to the IEEE FP philosophy – intermediate results should be as precise as possible; more on that in a moment. The "convoluted" part is consistent with the general insanity of the x86 instruction set. Specifically, the "old" (a.k.a "x87") floating point unit uses a stack architecture for addressing FP operands, which is pretty much the exact opposite of the compiler writer's dream target, but so is the rest of x86. The "new" floating point instructions in SSE don't have these problems, at the cost of creating the aesthetic/psychiatric problem of actually having two FP ISAs in the same processor.

Now, in our context we don't care about the FP stack thingie and all that, the only thing that matters is the consistency of precision. The "old" FP unit handles precision thusly. Precision of stuff written to memory is according to the number of bits of the variable, 'cause what else can it be. Precision of intermediate results in the "registers" (or the "FP stack" or whatever you call it) is defined according to the FPU control & status register, globally for all intermediate results in your program.

By default, it's 80 bits. This means that when you compute a*b+c*d and a,b,c,d are 32b floats, a*b and c*d are computed in 80b precision, and then their 80b sum is converted to a 32b result in memory (if a*b+c*d is indeed written to memory and isn't itself an "intermediate" result). Indeed, what's "intermediate" in the sense of not being written to memory and what isn't? That depends on:

  1. Debug/release build. If we have "float e=a*b+c*d", and e is only used once right in the next line, the optimizer probably won't see a point in writing it to memory. However, in a debug build there's a good reason to allocate it in memory, because if you single-step your program and you're already past the line that used e, you still might want to look at the value of e, so it's good that the compiler kept a copy of it for the debugger to find.
  2. The code "near" e=a*b+c*d according to the compiler's notion of proximity also affects its decisions. There are only so many registers, and sometimes you run out of them and have to store things in memory. This means that if you add or remove code near the line or in inline functions called near the line, the allocation of intermediate results may change.

Compilers could have an option asking them to hide this mess and give us consistent results. The problems with this are that (1) if you care about cross-platform/compiler consistency, then the availability of cross-mode consistency options in one compiler doesn't help with the other compiler and (2) for some reason, compilers apparently don't offer this option in a very good way. For example, MS C++ used to have a /fltconsistency switch but seems to have abandoned it in favor of an insane special-casing of the syntax float(a*b)+float(c*d) – that spelling forces consistency (although the C++ standard doesn't assign it a special meaning not included in the plain and sane a*b+c*d).

I'd guess they changed it because of the speed penalty it implies rather than the precision penalty as they say. I haven't heard about someone caring both about consistency and that level of precision, but I did hear that gcc's consistency-forcing -ffloat-store flag caused notable slowdowns. And the reason it did is implied by its name – AFAIK the only way to implement x86 FP consistency at compile time is to generate code storing FP values to memory to get rid of the extra precision bits. And -ffloat-store only affects named variables, not unnamed intermediate results (annoying, isn't it?), so /fltconsistency, assuming it actually gave you consistency of all results, should have been much slower. Anyway, the bottom line seems to be that you can't get much help from compilers here; deal with it yourself. Even Java gave up on its initial intent of getting consistent results on the x87 FPU and retreated to a cowardly strictfp scheme.

And the thing is, you never have to deal with it outside of x86 – all floating point units I've come across, including the ones specified by Intel's SSE and SSE2, simply compute 32b results from 32b inputs. People who decided to do it that way and rob us of quite some bits of precision have my deepest gratitude, because there's absolutely no good way to work around the generosity of the original x86 FPU designers and get consistent results. Here's what you can do:

  1. Leave the FP CSR configured to 80b precision. 32b and 64b intermediate results aren't really 32b and 64b. The fact that it's the default means that if you care about FP result consistency, intensive hair pulling during your first debugging sessions is an almost inevitable rite of passage.
  2. Set the FP CSR to 64b precision. If you only use 64b variables, debug==release and you're all set. If you have 32b floats though, then intermediate 32b results aren't really 32b. And usually you do have 32b floats.
  3. Set the FP CSR to 32b precision. debug==release, but you're far from "all set" because now your 64b results, intermediate or otherwise, are really 32b. Not only is this a stupid waste of bits, it is not unlikely to fail, too, because 32b isn't sufficient even for some of the problems encountered by camp 3. And of course it's not compatible with other platforms.
  4. Set the FP CSR to 64b precision during most of the program run, and temporarily set it to 32b in the parts of the program using 32b floats. I wouldn't go down that path; option 4 isn't really an option. I doubt that you use both 32b and 64b variables in a very thoughtful way and manage to have a clear boundary between them. If you depend on the ability of everyone to correctly maintain the CSR, then it sucks sucks sucks.

Side note: I sure as hell don't believe in "very special" "testing" build/running modes. For example, you could say that you have a special mode where you use option (3) and get 32b results, and use that mode to test debug==release or something. I think it's completely self-defeating, because the point of consistency is being able to reproduce a phenomenon X that happens in a mode which is actually important, in another mode where reproducing X is actually useful. Therefore, who needs consistency across inherently useless modes? We'd be defeating the purpose of defeating the purpose of IEEE floating point.

Therefore, if you don't have SSE, the only option is (2) – set the FP CSR to 64b and try to avoid 32b floats. On Linux, you can do it with:

#include <fpu_control.h>
fpu_control_t cw;
cw = (cw & ~_FPU_EXTENDED) | _FPU_DOUBLE;

Do it first thing in main(). If you use C++, you should do it first thing before main(), because people can use FP in constructors of global variables. This can be achieved by figuring out the compiler-specific translation unit initialization order, compiling your own C/C++ start-up library, overriding the entry point of a compiled start-up library using stuff like LD_PRELOAD, overwriting it in a statically linked program right there in the binary image, having a coding convention forcing to call FloatingPointSingleton::instance() before using FP, or shooting the people who like to do things before main(). It's a trade-off.

The situation is really even worse because the FPU CSR setting only affects mantissa precision but not the exponent range, so you never work with "real" 64b or 32b floats there. This matters in cases of huge numbers (overflow) and tiny numbers (double rounding of subnormals). But it's bad enough already, and camp 3 people don't really care about the extra horror; if you want those Halloween stories, you can find them here. The good news are that today, you are quite likely to have SSE2 and very likely to have SSE on your machine. So you can automatically sanitize all the mess as follows:

  1. If you have SSE2, use it and live happily ever after. SSE2 supports both 32b and 64b operations and the intermediate results are of the size of the operands. BTW, mixed expressions like a+b where a is float and b is double don't create consistency problems on any platform because the C standard specifies the rules for promotion precisely and portably (a will be promoted to double). The gcc way of using SSE2 for FP is -mfpmath=sse -msse2.
  2. If you only have SSE, use it for 32b floats which it does support (gcc: -mfpmath=sse -msse). 64b floats will go to the old FP unit, so you'll have to configure it to 64b intermediate results. Everything will work, the only annoying things being (1) the retained necessity to shoot the people having fun before main and (2) the slight differences in the semantics of control flags in the old FP and the SSE FP CSR, so if you configure your own policy, floats and doubles will not behave exactly the same. Neither problem is a very big deal.

Interestingly, SSE with its support for SIMD FP commands actually can make things worse in the standard-violating-algebraic-optimizations department. Specifically, Intel's compiler reportedly has (had?) an optimization which unrolls FP accumulation loops and reorders additions in order to utilize SIMD FP commands (gcc 4 does that, too, but only if you explicitly ask for trouble with -funsafe-math-optimizations or similar). But I wouldn't conclude anything from it, except that automatic vectorization, which is known to work only on the simplest of code snippets, actually doesn't work even on them.

Summary: use SSE2 or SSE, and if you can't, configure the FP CSR to use 64b intermediates and avoid 32b floats. Even the latter solution works passably in practice, as long as everybody is aware of it.

I think I covered everything I know except for things like long double, FP exceptions, etc. – and if you need that, you're not in camp 3; go away and hang out with your ivory tower numerical analyst friends. If you know a way to automate away more pain, I'll be grateful for every FP inconsistency debugging afternoon your advice will save me.

Happy Halloween!

Off topic

  1. To comment, you no longer need to register, just type "y" to confirm you're a human. Thanks to Aristotle Pagaltzis for pointing out that the previous arrangement sucked.
  2. I've started another blog, mostly hosting images. For example:

I originally intended to have one blog for everything, but since you've probably subscribed for the technobabble, I'll reserve the channel for that.


Eye 1:

Eye 2:

full resolution


full resolution


full resolution

Duck (takeoff)

This duck is taking off water.

full resolution


This fish sits on its tail and thinks.

full resolution

I want a struct linker

Here's a problem I've seen a lot (it's probably classified as an "Antipattern" or a "Code Smell" and as such has a name in the appropriate circles, but I wouldn't know, so I'll leave it nameless).

You have some kind of data structure that you pass around a lot. Soon, the most valuable thing about the structure isn't the data it keeps, but the fact that it's available all the way through some hairy flow of control. If you want to have your data flow through all those pipes, just add it to The Data Structure. (To antipattern classification enthusiasts: I don't think we have a god object yet because we really want to pass our data through that flow and it's right to have one kind of structure for that and not, say, propagating N+1 function parameters.)

Now suppose the structure holds an open set of data. For example, a spam filter could have a data structure to which various passes add various cues they extract from the message, and other passes can access those cues. We don't want the structure to know what passes exist and what cues they extract, so that you can add a pass without changing the structure.

I don't think there's a good way to do it in a 3GL. In C or C++, you can:

  • Aggregate the cue structures by value (which means you have to recompile everything once you change/add/remove a member from any of them)
  • Keep pointers to the cue structures and use forward declarations to avoid recompilation (a bit slower, and you still have to recompile when you add/remove a whole cue structure)
  • Keep an array of void* or base class objects (not debugger-friendly, and requires a registration procedure to resize the arrays according to the number of passes and deal dynamically computed indexes to the cues to all who wish to access them)
  • Keep a key -> void* map (increasingly slow and debugger-unfriendly, and you need registration to compute the keys from cue names, or use the C substitute for interning – use pointers to global variables with names like &g_my_cue_key as keys)
  • Keep a string -> void* map (no registration or pseudo-interning, but really slow)

On top of JVM or .NET, you have pretty much the same options, plus the option to generate the cue container structure dynamically. Each cue would define an interface and the container structure would implement all those interfaces. The debugger would display containers nicely, and the code accessing them wouldn't depend on the container class. I'd guess nobody does that though because the class generation part is likely somewhat gnarly.

In a 4GL, you can add attributes to class objects at run time. This is similar to keeping a key->pointer map in a 3GL, except the name interning is handled by the system as it should, and you don't confuse debuggers because you're using a standard feature of the object system. Which solves everything except for the speed issue, which is of course dwarfed by other 4GL speed issues.

Now, I used to think of it as one of the usual speed vs convenience trade-offs, but I no longer think it is, because a struct linker could solve it.

Suppose you could have "distributed" struct/class definitions in an offset-based language; you could write "dstruct SpamCues { ViagraCue viagra; CialisCue cialis; }" in the Medication spam filter module, and "dstruct SpamCues { FallicSymbolsCue fallic; SizeDescriptionsCue size; }" in the Penis Enlargement module. The structure is thus defined by all modules linked into the application.

When someone gets a SpamCues structure and accesses cues.viagra, the compiler generates a load-from-pointer-with-offset instruction – for example, in MIPS assembly it's spelled "lw offset(ptrreg)". However, the offset would be left for the linker to resolve, just the way it's done today for pointers in "move reg, globalobjectlabel" and "jump globalfunclabel".

This way, access to "distributed" structures would be as fast as "normal" structures. And you would preserve most optimizations related to adjacent offsets. For example, if your machine supports multiple loads, so a rectangle structure with 4 int members can be loaded to 4 registers with "ldm rectptrreg,{R0-R4}" or something, it could still be done because the compiler would know that the 4 members are adjacent; the only unknown thing would be the offset of the rectangle inside the larger struct.

One issue the linker could have on some architectures is handling very large offsets that don't fit into the instruction encoding of load-from-pointer-with-offset forms. Well, I'd live even with the dumbest solution where you always waste an instruction to increment a pointer in case the offset is too large. And then you could probably do better than that, similarly to the way "far calls" (calls to functions at addresses too far from the point of call for the offset to fit into 28 bits or whatever the branch offset encoding size is on your machine) are handled today.

The whole thing can fail in presence of dynamic loading during program run as in dlopen/LoadLibrary; if you already have objects of the structure, and your language doesn't support relocation because of using native pointers, then the dynamically loaded module won't be able to add members to a dstruct since it can't update the existing objects. Well, I can live with that limitation.

If the language generates native object files, there's the problem of maintaining compatibility with the object file format. I think this could "almost" be done, by mapping a distributed structure to a custom section .dstruct.SpamCues, and implementing members (viagra, cialis, fallic, size) as global objects in that section. Then if an equivalent of a linker script says that the base address of .dstruct.SpamCues is 0, then &viagra will resolve to the offset of the member inside the structure. The change to automatically map sections named .dstruct.* to 0 surely isn't more complicated than the handling of stuff like .gnu.linkonce, inflicted upon us by the idiocy of C++ templates and the likes of them.

And here's why I'll probably never see a struct linker:

  • If the language uses a native linker, a small change must be done to that linker in order to handle encodings of load/store instructions in ways it previously didn't (currently it only has to deal with resolving pointers, not offsets). Since it's platform-specific, the small change is actually quite large.
  • You could compromise and avoid that change by generating less efficient code which uses the already available linker ability to resolve the "address" of the viagra object in the zero-based .dstruct.SpamCues section – the code can add that "address" (offset, really) to &cues. But that could still force changes in the compiler back-end because now it has to generate assembly code adding what looks like 2 addresses, which makes no sense today and might be unsupported if the back-end preserves type information.
  • The previous items assume that the portable "front-end" work to support something like dstruct isn't a big deal. However, I'd guess that not enough people would benefit from it/realize they'd benefit from it to make it appear in a mainstream language and its front-ends.
  • I could roll my own compiler to a language similar to a mainstream one, with a bunch of additions like this struct linker thingie. Two problems with this. One – it's too hard to parse all the crud in a mainstream language (even if it isn't C++) to make it worth the trouble, unless your compiler does something really grand; a bunch of nice features probably aren't worth it. Two – most programmers take a losing approach towards their career where they want to put mainstream languages on their resume so that losers at the other end can scan their resumes for those languages; if your code is spelled in a dialect, you'll scare off the losers forming the backbone of our industry.

It still amazes me how what really matters isn't what can be done, but what's already done. It's easier to change goddamn hardware than it is to change "software infrastructure" like languages, software tools, APIs, protocols and all kinds of that shit. I mean, here we have something that's possible and even easy to do, and yet completely impractical.

Guess I'll have to roll my own yet-another-distributed-reflective-registration bullshit. Oh well.

The cardinal programming jokes

I'm depressed. What I'll do is I'll tell you the 3 cardinal programming jokes. And if it helps cheer me up, I'll consider my job well done.

I must warn you about those jokes. Firstly, they are translated from Russian and Hebrew by yours truly, which may cause them to lose some of their charm. Secondly, I'm not sure they came with that much charm to begin with, because my taste in jokes (or otherwise) can be politely characterized as "lowbrow". In particular, all 3 jokes are based on the sewer/plumber metaphor. I didn't consciously collect them based on this criterion, it just turns out that I can't think of a better metaphor for programming.

By the way, I was recently told by a very strong programmer that of all things, he wanted to become a plumber as a kid. 'Cause it was very interesting to him, the tools, the pipes, how you make the whole thing work. And then he felt he understood enough of it, so he figured he'd become a programmer instead. And now he is, and he has enough (virtual) pipes full of (virtual) shit to keep him curious about how to make it work for the rest of his life. By which I mean to say, hey, it's not just my bad taste, it is a good metaphor, see?

So, the jokes. Lowbrow, depressing stuff. You have been warned.

Expanding your skill set

A very important thing. You should be learning stuff. Yada yada.

With many things though, people have this strange tendency to avoid knowing them, and instead ask someone else unfortunate enough to already know them. Say, Makefiles. Is it just my experience or do people worldwide pretend to be incapable of dealing with a hairy Makefile, and leave its regularly scheduled tweaking to a small set of knowledgeable victims?

Or debugging of the lowest kind, with race conditions and creative memory corruption. People like to give up on that, as long as someone else can take over. "I just don't know how to proceed". Right.

Sometimes I wish I could put this claim to a test. Check if they'd say this at gunpoint. Or, more humanely and therefore much less cheaply, propose them $1M if they do know how to proceed. I bet they'd think a bit harder. If you're working on AI, specifically on preparing it to the Turing test, don't forget to teach it this principle, or else it has no chance of passing for a human.

I find that the following describes the double-edged sword that is skill set expansion quite well:

A plumber and his apprentice pay a visit to a manhole requiring their attention. The plumber goes down the manhole, and the apprentice stays above with the toolbox. The plumber asks for wrench #3, and the apprentice puts the wrench into his hand. 2 minutes pass. "Wrench #5!" The apprentice finds the wrench and passes it to the plumber. 5 more minutes. "Wrench #6!" The plumber is given that, takes a couple more minutes and finally comes out.

The next scene should really be a small piece of pantomime, but I'll have to get by with words alone. Not unexpectedly for this type of joke, the plumber comes out with his arms covered with excrement. He slowly sweeps his right hand over his left arm, then the left hand over the right arm, shakes his hands and reaches for something to wipe them with. And to the apprentice he says:

"Watch and learn, son, or you'll be passing wrenches for the rest of your life".

Really, you should learn things. Expand your skill set. Who wants to be passing wrenches?

Layers of abstraction

Abstraction is good. Or should I say legitimate. Or should I say inevitable. I mean, you have to count on something. Something has to work, because you can't build things on top of nothing.

Except it won't work. That something you build things on top of won't work.

What's that? "Whining"? Yep, definitely. This here is whining.

Whining is good. Or should I say legitimate. Or should I say inevitable. Because if you aren't allowed to whine about frigging data channels which drop chunks of data and duplicate chunks of data because some fucking hardware subcontructor couldn't be bothered to implement arbitration for shared data access, if you aren't allowed to whine about that…

If you aren't allowed to whine about that, you should be allowed to whine about memory, which flips bits, and zeros bytes, and it does so once per hour for some weird sequence of accesses having nothing to do with the address where data actually changes. Fuck that, OK? Fuck DDR2. Fuck its controllers and the zillions of their configuration parameters.

A plumber climbs out of a manhole, this time without a preamble, and his arms are covered with – guess what? – excrement! A beautiful little girl in a beautiful white dress happens to pass by. The plumber seizes the opportunity and (another piece of pantomime) quickly, but firmly sweeps his hands over the girl's white dress.

Little girl (appalled): AAAH!!

Plumber (outraged): Oh yeah? I bet you love to take a shit though.

Yep. You love to allocate objects in memory, don't you? Megabytes of them. And then a board designer decides to wipe his filthy hands with your beautiful white huge software system. Debug that, you perverted memory-addicted individual.

Taking pride in your work

And still, I actually like my work, on a level. Why? It feels inherently cool to design stuff that becomes this bunch of tiny parts, transistors and all, switching hundreds of millions of times a second, and then to write code that manages all the flying circus.

I know people who feel the same about computer vision. People for whom it's a personal priority to work on computer vision, where they are given images and they look for stuff in them. Who wants to be doing that? Who wants to be responsible for the solution of a problem that can't even be precisely defined? Me, I wanna be doing hardware.

What do I actually do most of the time though? I eat hexadecimal. I sit near a debugger, and I keep hitting Page Up in a memory view window, to find the beginning of the array that overwrote this piece of data (I guess the element size from the repetitive patterns and such), and along comes a computer vision geek and he says, "damn it, man, you got out of the Matrix!"

Well, I dunno, I find it much easier to guess what buggy code did to my memory than to find out "why" an algorithm thinks this here is a person when in fact it's a shade of a tree. Because if you look closely at the pixels, the shade kinda looks like a person, but of course we could reject it based on its motion, but of course that would mean we'd approve these reflections over here based on their motion, but, but, but…

What my bogus example is saying is that you have lots and lots of cues but each can work both for you and against you, and now how do you weigh all that, without even a formal spec? I'd rather eat hexadecimal, thank you very much.

And we look at each other, and sincerely think that our jobs are pretty nifty, but the other guy's job is awful and how can he be doing it. And I suspect that if one looks at this from aside, one might wonder where the actual fun is, because there is actual fun in here, or so all the participants testify. And I think I know the answer.

An airplane lands, and passengers come out. One of them notices a guy underneath the airplane. As you'd guess, the guy is a plumber. The plumber touches some lock, and immediately gets covered by excrement streaming from an opening at the bottom of the plane.

The pantomime cleanup routine follows, and then comes the turn of the dialog.

Passenger (appalled): What on Earth makes you keep this job?

Plumber (proudly): Hey, I'm in the aerospace business!

The aerospace effect happens to different people with different things. With some, it's "Hey, I'm making real hardware!" With others, it's "Hey, I'm finding real objects in real images!" It's a good thing people are different, because so are the currents of excrement, and someone ought to swim in each. We can't all be passing wrenches.

I love globals, or Google Core Dump

The entire discussion only applies to unsafe languages, the ones that dump core. By which I mean, C. Or C++, if you're really out of luck.

If it can dump core, it will dump core, unless it decides to silently corrupt its data instead. Trust my experience of working in a multi-processor, multi-threaded, multi-programmer, multi-nightmare environment. Some C++ FQA Lite readers claimed that the fact that I deal with lots of crashes in C++ indicates that I'm a crappy programmer surrounded by similar people. Luckily, you don't really need to trust my experience, because you can trust Google's. Do this:

  1. Find a Google office near you.
  2. Visit a Google toilet.
  3. You'll find a page about software testing, with the subtitle "Debugging sucks. Testing rocks." Read it.
  4. Recover from the trauma.
  5. Realize that the chances of you being better at eliminating bugs than Google are low.
  6. Read about the AdWords multi-threaded billing server nightmare.
  7. The server was written in C++. The bug couldn't happen in a safe language. Meditate on it.
  8. Consider yourself enlightened.

This isn't the reason why this post has "Google core dump" in its title, but hopefully it's a reason for us to agree that your C/C++ program will crash, too.

I love globals

What happens when we face a core dump? Well, we need the same things you'd expect to look for in any investigation: names and addresses. Names of objects looking at which may explain what happened, their addresses to actually look at them, and type information to sensibly display them.

In C and C++, we have 3 kinds of addresses: stack, heap and global. Let's see who lives there.

Except the stack is overwritten, because it can be. Don't count on being able to see the function calls leading to the point of crash, nor the parameters and local variables of those functions. In fact, don't even count on being able to see the point of crash itself: the program counter, the link register, the frame pointer, all that stuff can contain garbage.

And the heap is overwritten, too, nearly as badly. The typical data structure used by C/C++ allocators (for example, dlmalloc) is a kind of linked list, where each memory block is prefixed with its size so you can jump to the next one. Overwrite one of these size values and you will have lost the boundaries of the chunks above that address. That's a loss of 50% of the heap objects on average, assuming uniform distribution of memory overwriting bugs across the address space of the heap.

So don't count on the stack or the heap. Your only hope is that someone has ignored the Best Practices and the finger-pointing by the more proficient colleagues, and allocated a global object. Possibly under the clever disguise of a "Singleton". Not a bad thing after all, that moronic "design pattern", because it ultimately allowed to counter cargo cult programmers' accusations of "globals are evil" with equally powerful cargo cult argument of "it's a design pattern". So people could allocate globals again.

Which is good, because a global always has an accurate name-to-address mapping, no matter what atrocity was committed by the bulk of unsafe code running on the loose. Can't overwrite a symbol table. And it has accurate type information, too. As opposed to objects you find through void*, or a base class pointer where the base class lacks virtual functions or the object vptr was overwritten, etc.

Which is why I frequently start debugging by firing an object view window on a global, or running debugger macros which read globals, etc. Of course you can fuck up a global variable to make debugging unpleasant. For example, if the variable is "static" in the C sense, you need to open the right file or function to display it, and you need the debugger front-end to understand the context, which will be especially challenging if it's a static variable in a template function (one of the best things in C++ is how neatly its new features interact with C's old ones).

Or you can stuff the global into a class or a namespace. I was never able to display globals by their qualified C++ name in, say, gdb 5. But no matter; nm <program> | grep <global> followed by p *(TypeOfGlobal*)addr always does the trick, and no attempts at obfuscating the symbol table will stop it. I still say make it a real, unashamed global to make debugging easier. If you're lucky, you'll get to piss off a couple of cargo cult followers as a nice side-effect.

Google Core Dump

A core dump is a web. Its sites are objects. It's hyperlinks are pointers. It's PageRank is a TypeRank: what's the type of this object according to the votes of the pointers stored in other objects? The spamdexing is done by pointer-like bit patterns stored in unused memory slots. The global variables are the major sites with high availability you can use as roots for the crawling.

What utilities would we like to have for this web? The usual stuff.

  • Browsers. Debugger object view window is the Firefox, and the memory view window is the Lynx. The core dump Lynx usually sucks in that it doesn't make it easy to follow pointers – can't click on a word and have the browser follow the pointer (by jumping to the memory pointed by it). No back button, either. Oh well.
  • DNS. The ability to translate variable names to raw addresses. Works very reliably for globals and passably otherwise. Works reliably for all objects in safe languages.
  • Reverse DNS. Given an address, tell me the object name. Problematic for dynamically allocated objects, although you could list the names of pointer variables leading to it (Google bombing). Works reliably for global functions and variables. For some reason, the standard addr2line program only supports functions though. Which is why I have an addr2sym program. It so happened that I have several of them, in fact. You can download one here. "Reverse DNS" is particularly useful when you find pointers somewhere in registers or memory and wonder what they could point to. In safe languages, you don't have that problem because everything is typed and so you can simply display the pointed object.
  • Google Core Dump, similar to Google Desktop or Google for the WWW. Crawl a core dump, figure out the object boundaries and types by parsing the heap linked list and the stack and looking at pointers' "votes", create an index, and allow me to query that index. Lots of work, that, some of it heuristical. And in order to get type information in C or C++, you'll have to either parse the source code (good luck doing it with C++), or parse the non-portable debug information format. But it's doable; in fact, we have it, for our particular target/debugger/allocator combo. Of course it has its glitches. Quirky and obscure enough to make open sourcing it not worth the trouble.

I really wish there was a reasonably portable and reliable Google Core Dump kind of thing. But it doesn't look like that many people care about debugging crashes at all. Most core dumps at customer sites seem to go to /dev/null, and those that can't be easily deciphered are apparently given up on until the bug manifests itself in some other way or its cause is guessed by someone.

Am I coming from a particularly weird niche where the code size is large enough and the development rapid enough to make crashes almost unavoidable, but crashes in the final product version are almost intolerable? Or do most good projects allocate everything on the stack and the heap, so with those smashed they're doomed no matter what? Or is the problem simply stinky enough to make it unattractive for a hobby project while lacking revenue potential to make a good commercial project?

Would you like this sort of thing? If you would, drop me a line. In the meanwhile, I satisfy my wish for a Google Core Dump with my perfect implementation for an embedded co-processor, the one I've poked at with Tcl commands. With 128K of memory, no dynamic allocation, and local variables effectively implemented as globals, perfect decoding is easy. I'm telling ya, globals rule.

As to my "reverse DNS" implementation:

  • I could make it more portable by parsing the output of nm --print-size. But just running nm on a 20M symbol table takes about 2 seconds. I want instantaneous output, 'cause I'm very impatient when I debug.
  • Alternatively, I could make it more portable by using a library such as bfd. But that would drag in a library such as bfd, and I had trouble with what looked like library/compiler version mismatches with bfd, whereas my ELF parsing code never had any trouble. Also, an implementation parsing ELF is more interesting as sample code because you get to see how easy to parse these formats are. So it's elfaddr2sym, not addr2sym. (It's really 32-bit-ELF-with-host-endianness-addr2sym, because I'm lazy and it covers all my targets.)
  • There's a ton of addr2sym code out there, and maybe a good addr2sym program. I just didn't find it. I have an acknowledged weakness in the wheel reinventing department.
  • Of course I don't demangle the ugly C++ names; piping to c++filt does.
  • The program is in D, because of the "instantaneous" bit, and because D is one of the best choices available today if you care about both speed and brevity. Look at this: lowerBound!("a.st_value <= b")(ssyms, addr) does a binary search for addr in the sorted ssyms array. As brief as it gets out of the box with any language and standard library, isn't it? The string is compiled statically into the instantiation of the lowerBound template; a & b are the arguments of the anonymous function represented by the string. Readable. Short. Fast. Easy to use – garbage-collected array outputs in functions like filter(), error messages to the point – that's why a decent grammar is a good thing even if you aren't the compiler writer. Looks a lot like C++, braces, static typing, everything. Thus easy to pimp in a 3GL environment, in particular, a C++ environment. You can download the Digital Mars D compiler for Linux, or wait for C++0x to solve 15% of the problems with <algorithm> by introducing worse problems.

By the way, the std.algorithm module, the one with the sort, filter, lowerBound and similar functions, is by Andrei Alexandrescu, of Modern C++ Design fame. How is it possible that his stuff in D is so yummy while his implementation of similar things in C++ is equally icky? Because C++ is to D what proper fixation is to anaesthesia. There, I bet you saw it coming.

What does "global" mean?

For the sake of completeness, I'd like to bore you with a discussion of the various aspects of globalhood, in the vanishingly small hope of this being useful in a battle against a cargo cult follower authoring a coding convention or such. In C++, "global" can mean at least 6 things:

  • Number of instances per process. A "global" is everything that's instantiated once.
  • Life cycle. A "global" is constructed before main and destroyed after main. A static variable inside a function is not "global" in this sense.
  • "Scope" in the "namespace" sense (as opposed to the life cycle sense). We have C-style file scope, class scope, function scope, and "the true global scope". And we have namespaces.
  • Storage. A "global" is assigned a link time address and stored there. In a singleton implementation calling new and assigning its output to a global pointer, the pointer is "global" in this sense but the object is not.
  • Access control. If it's in a class scope, it may be private or protected, which makes it less of a global in this fifth sense.
  • Responsibility. A global can be accessible from everywhere but only actually accessed from a couple of places. For example, you can allocate a large object instantiating lots of members near your main function and then call object methods which aren't aware that the stuff is allocated globally.

So when I share my love of globals with you, the question is which aspect of globality I mean. What I mean is this:

  1. I like global storage – link-time addresses – for everything which can be handled that way. A global pointer is better than nothing, but it can be overwritten and you will have lost the object; better allocate the entire thing globally.
  2. I like global scope, no classes, namespaces and access control keywords attached, to make symbol table look-up easier, thus making use of the global allocation.
  3. I like global life cycle – no Meyers' singletons and lazy initialization. In fact, I like trivial constructors/destructors, leaving the actual work to init/close functions called by main(). This way, you can actually control the order in which things are done and know what the dependencies are. With Meyers' singletons, the order of destruction is uncontrollable (it's the reverse order of initialization, which doesn't necessarily work). Solutions were proposed to this problem, so dreadful that I'm not going to discuss them. Just grow up, design the damned init/close sequence and be in control again. Why do people think that all major operations should be explicit except for initialization which should happen automagically when you least expect it?
  4. "Globals" in the sense of "touched by every piece of code" is the trademark style of a filthy swine. There are plenty of good reasons to use "globals"; none of them has anything to do with "globals" as in "variables nobody/everybody is responsible for".
  5. I think that everything that's instantiated once per process is a "global", and when you wrap it with scope, access control, and design patterns, you shouldn't stop calling it a global (and instead insist on "singleton", "static class member", etc.). It's still a global, and its wrapping should be evaluated by its practical virtues. Currently, I see no point in wrapping globals in anything – plain old global variables are the thing best supported by all software tools I know.

I think this can be used as "rationale" in a coding guideline, maybe in the part allowing the use of globals as an "exception". But I keep my hopes low.