Consistency: how to defeat the purpose of IEEE floating point

I don't know much about the design of IEEE floating point, except for the fact that a lot of knowledge and what they call "intellectual effort" went into it. I don't even know the requirements, and I suspect those were pretty detailed and complex (for example, the benefits of having a separate representation for +0 and -0 seem hard to grasp unless you know about the very specific and hairy examples in the complex plane). So I don't trust my own summary of the requirements very much. That said, here's the summary: the basic purpose of IEEE floating point is to give you results of the highest practically possible precision at each step of your computation.

I'm not going to claim this requirement is misguided, because I don't feel like arguing with people two orders of magnitude more competent than myself who have likely faced much tougher numerical problems than I've ever seen. What I will claim is that differences in numerical needs divide programmers into roughly three camps, and the highest-possible-precision approach hurts one of them really badly, and so has to be worked around in ways we'll discuss. The camps are:

  1. The huge camp of people who do businessy accounting. Those should work with integral types to get complete, deterministic, portable control over rounding and all that. Many of the clueless people in this camp represent 1 dollar and 10 cents as the floating point number 1.1. While they are likely a major driving force behind economical growth, I still think they deserve all the trouble they bring upon themselves.
  2. The tiny camp doing high-end scientific computing. Those are the people who can really appreciate the design of IEEE floating point and use its full power. It's great that humanity accidentally satisfied the needs of this small but really cool group, making great floating point hardware available everywhere through blind market forces. It's like having a built-in Stradivari in each home appliance. Yes, perhaps I exaggerate; I get that a lot.
  3. The sizable camp that deals with low-end to mid-range semi-scientific computing. You know, programs that have some geometry or physics or algebra in them. 99.99% of the code snippets in that realm work great with 64b floating point, without the author having invested any thought at all into "numerical analysis". 99% of the code works with 32b floats. When someone stumbles upon a piece of code in the 1% and actually witnesses fatal precision loss, everybody gathers to have a look as if they heard about a beautiful rainbow or a smoke suggesting a forest fire near the horizon.

The majority of people who use and actually should be using floating point are thus in camp 3. Those people don't care about precision anywhere near camp 2, nor do they know how to make the best of IEEE floating point in the very unlikely circumstances where their naive approach will actually fail to work. What they do care about though is consistency. It's important that things compute the same on all platforms. Perhaps more importantly for most, they should compute the same under different build settings, most notably debug and release mode, because otherwise you can't reproduce problems.

Side note: I don't believe in build modes; I usually debug production code in release mode. It's not just floating point that's inconsistent across modes – it's code snippets with behavior undefined by the language, buggy dependence on timing, optimizer bugs, conditional compilation, etc. Many other cans of worms. But the fact is that most people have trouble debugging optimized code, and nobody likes it, so it's nice to have the option to debug in debug mode, and to do that, you need things to reproduce there.

Also, comparing results of different build modes is one way to find worms from those other cans, like undefined behavior and optimizer bugs. Also, many changes you make are optimizations or refaptorings and you can check their sanity by making sure they didn't change the results of the previous version. As we'll see, IEEE FP won't give you even that, regardless of platforms and build modes. The bottom line is that if you're in camp 3, you want consistency, while the "only" things you can expect from IEEE FP is precision and speed. Sure, "only" should be put in quotes because it's a lot to get, it's just a real pity that fulfilling the smaller and more popular wish for consistency is somewhere between hard and impossible.

Some numerical analysts seem annoyed by the camp 3 whiners. To them I say: look, if IEEE FP wasn't the huge success that it is in the precision and speed departments, you wouldn't be hearing from us because we'd be busy struggling with those problems. What we're saying is the exact opposite of "IEEE FP sucks". It's "IEEE FP is so damn precise and fast that I'm happy with ALL of its many answers – the one in optimized x86 build, the one in debug PowerPC build, the one before I added a couple of local variables to that function and the one I got after that change. I just wish I consistently got ONE of these answers, any of them, but the same one." I think it's more flattering than insulting.

I've accumulated quite some experience in defeating the purpose of IEEE floating point and getting consistency at the (tiny, IMO) cost of precision and speed. I want to share this knowledge with humanity, with the hope of getting rewarded in the comments. The reward I'm after is a refutation of my current theory that you can only eliminate 95%-99% of the pain automatically and have to solve the rest manually each time it raises its ugly head.

The pain breakdown

I know three main sources of floating point inconsistency pain:

  1. Algebraic compiler optimizations
  2. "Complex" instructions like multiply-accumulate or sine
  3. x86-specific pain not available on any other platform; not that ~100% of non-embedded devices is a small market share for a pain.

The good news is that most pain comes from item 3 which can be more or less solved automatically. For the purpose of decision making ("should we invest energy into FP consistency or is it futile?"), I'd say that it's not futile and if you can cite actual benefits you'd get from consistency, than it's worth the (continuous) effort.

Disclaimer: I only discuss problems I know and to the extent of my understanding. I have no solid evidence that this understanding is complete. Perhaps the pain breakdown list should have item 4, and perhaps items 1 to 3 have more meat than I think. And while I tried to get the legal stuff right – which behavior conforms to IEEE 754, which conforms to C99, and which conforms to nothing but is still out there – I'm generally a weak tech lawyer and can be wrong. I can only give you the "worked on my 4 families of machines" kind of warranty.

Algebraic compiler optimizations

Compilers, or more specifically buggy optimization passes, assume that floating point numbers can be treated as a field – you know, associativity, distributivity, the works. This means that a+b+c can be computed in either the order implied by (a+b)+c or the one implied by a+(b+c). Adding actual parentheses in source code doesn't help you one bit. The compiler assumes associativity and may implement the computation in the order implied by regrouping your parentheses. Since each floating point operation loses precision on some of the possible inputs, the code generated by different optimizers or under different optimization settings may produce different results.

This could be extremely intimidating because you can't trust any FP expression with more than 2 inputs. However, I think that programming languages in general don't allow optimizers to make these assumptions, and in particular, the C standard doesn't (C99 §5.1.2.3 #13, didn't read it in the document but saw it cited in two sources). So this sort of optimization is actually a bug that will likely be fixed if reported, and at any given time, the number of these bugs in a particular compiler should be small.

I only recall one recurring algebraic optimization problem. Specifically, a*(b+1) tended to be compiled to a*b+a in release mode. The reason is that floating-point literal values like 1 are expensive; 1 becomes a hairy hexadecimal value that has to be loaded from a constant address at run time. So the optimizer was probably happy to optimize a literal away. I was always able to solve this problem by changing the spelling in the source code to a*b+a – the optimizer simply had less work to do, while the debug build saw no reason to make me miserable by undoing my regrouping back into a*(b+1).

This implies a general way of solving this sort of problem: find what the optimizer does by looking at the generated assembly, and do it yourself in the source code. This almost certainly guarantees that debug and release will work the same. With different compilers and platforms, the guarantee is less certain. The second optimizer may think that the optimization you copied from the first optimizer into your source code is brain-dead, and undo it and do a different optimization. However, that means you target two radically different optimizers, both of which are buggy and can't be fixed in the near future; how unlucky can you get?

The bottom line is that you rarely have to deal with this problem, and when it can't be solved with a bug report, you can look at the assembly and do the optimization in the source code yourself. If that fails because you have to use two very different and buggy compilers, use the shotgun described in the next item.

"Complex" instructions

Your target hardware can have instructions computing "non-trivial" expressions beyond a*b or a+b, such as a+=b*c or sin(x). The precision of the intermediate result b*c in a+=b*c may be higher than the size of an FP register would allow, had that result been actually stored in a register. IEEE and the C standard think it's great, because the single instruction generated from a+=b*c is both faster and more precise than the 2 instructions implementing it as d=b*c, a=a+d. Camp 3 people like myself don't think it's so great, because it happens in some build modes but not others, and across platforms the availability of these instruction varies, as does their precision.

AFAIK the "contraction" of a+=b*c is permitted by both the IEEE FP standard (which defines FP + and *) and the C standard (which defines FP types that can map to standards other than IEEE). On the other hand, sin(x), which also gets implemented in hardware these days, isn't addressed by either standard – to the same effect of making the optimization perfectly legitimate. So you can't solve this by reporting a bug the way you could with algebraic optimizations. The other way in which this is tougher is that tweaking the code according to the optimizer's wishes doesn't help much. AFAIK what can help is one of these two things:

  1. Ask the compiler to not generate these instructions. Sometimes there's an exact compiler option for that, like gcc's platform-specific -mno-fused-madd flag, or there's (a defined and actually implemented) pragma directive such as #pragma STDC FP_CONTRACT. Sometimes you don't have any of that, but you can lie to the compiler that you're using an older (compatible) revision of the processor architecture without the "complex" instructions. The latter is an all-or-nothing thing enabling/disabling lots of stuff, so it can degrade performance in many different ways; you have to check the cost.
  2. If compiler flags can't help, there's the shotgun approach I promised to discuss above, also useful for hypothetical cases of hard-to-work-around algebraic optimizations. Instead of helping the optimizer, we get in its way and make optimization impossible using separate compilation. For example, we can convert a+=b*c to a+=multiply_dammit(b,c); multiply_dammit is defined in a separate file. This makes it impossible for most optimizers to see the expression a+=b*c, and forces them to implement multiplication and addition separately. Modern compilers support link-time inlining and then they do optimize the result as a whole. But you can disable that option, and as a side effect speed up linkage a great deal; if that seriously hurts performance, your program is insane and you're a team of scary ravioli coders.

The trouble with the shotgun approach, aside from its ugliness, is that you can't afford to shoot at the performance-critical parts of your code that way. Let us hope that you'll never really have to choose between FP consistency and performance, as I've never had to date.

x86

Intel is the birthplace of IEEE floating point, and the manufacturer of the most camp-3-painful and otherwise convoluted FP hardware. The pain comes, somewhat understandably, from a unique commitment to the IEEE FP philosophy – intermediate results should be as precise as possible; more on that in a moment. The "convoluted" part is consistent with the general insanity of the x86 instruction set. Specifically, the "old" (a.k.a "x87") floating point unit uses a stack architecture for addressing FP operands, which is pretty much the exact opposite of the compiler writer's dream target, but so is the rest of x86. The "new" floating point instructions in SSE don't have these problems, at the cost of creating the aesthetic/psychiatric problem of actually having two FP ISAs in the same processor.

Now, in our context we don't care about the FP stack thingie and all that, the only thing that matters is the consistency of precision. The "old" FP unit handles precision thusly. Precision of stuff written to memory is according to the number of bits of the variable, 'cause what else can it be. Precision of intermediate results in the "registers" (or the "FP stack" or whatever you call it) is defined according to the FPU control & status register, globally for all intermediate results in your program.

By default, it's 80 bits. This means that when you compute a*b+c*d and a,b,c,d are 32b floats, a*b and c*d are computed in 80b precision, and then their 80b sum is converted to a 32b result in memory (if a*b+c*d is indeed written to memory and isn't itself an "intermediate" result). Indeed, what's "intermediate" in the sense of not being written to memory and what isn't? That depends on:

  1. Debug/release build. If we have "float e=a*b+c*d", and e is only used once right in the next line, the optimizer probably won't see a point in writing it to memory. However, in a debug build there's a good reason to allocate it in memory, because if you single-step your program and you're already past the line that used e, you still might want to look at the value of e, so it's good that the compiler kept a copy of it for the debugger to find.
  2. The code "near" e=a*b+c*d according to the compiler's notion of proximity also affects its decisions. There are only so many registers, and sometimes you run out of them and have to store things in memory. This means that if you add or remove code near the line or in inline functions called near the line, the allocation of intermediate results may change.

Compilers could have an option asking them to hide this mess and give us consistent results. The problems with this are that (1) if you care about cross-platform/compiler consistency, then the availability of cross-mode consistency options in one compiler doesn't help with the other compiler and (2) for some reason, compilers apparently don't offer this option in a very good way. For example, MS C++ used to have a /fltconsistency switch but seems to have abandoned it in favor of an insane special-casing of the syntax float(a*b)+float(c*d) – that spelling forces consistency (although the C++ standard doesn't assign it a special meaning not included in the plain and sane a*b+c*d).

I'd guess they changed it because of the speed penalty it implies rather than the precision penalty as they say. I haven't heard about someone caring both about consistency and that level of precision, but I did hear that gcc's consistency-forcing -ffloat-store flag caused notable slowdowns. And the reason it did is implied by its name – AFAIK the only way to implement x86 FP consistency at compile time is to generate code storing FP values to memory to get rid of the extra precision bits. And -ffloat-store only affects named variables, not unnamed intermediate results (annoying, isn't it?), so /fltconsistency, assuming it actually gave you consistency of all results, should have been much slower. Anyway, the bottom line seems to be that you can't get much help from compilers here; deal with it yourself. Even Java gave up on its initial intent of getting consistent results on the x87 FPU and retreated to a cowardly strictfp scheme.

And the thing is, you never have to deal with it outside of x86 – all floating point units I've come across, including the ones specified by Intel's SSE and SSE2, simply compute 32b results from 32b inputs. People who decided to do it that way and rob us of quite some bits of precision have my deepest gratitude, because there's absolutely no good way to work around the generosity of the original x86 FPU designers and get consistent results. Here's what you can do:

  1. Leave the FP CSR configured to 80b precision. 32b and 64b intermediate results aren't really 32b and 64b. The fact that it's the default means that if you care about FP result consistency, intensive hair pulling during your first debugging sessions is an almost inevitable rite of passage.
  2. Set the FP CSR to 64b precision. If you only use 64b variables, debug==release and you're all set. If you have 32b floats though, then intermediate 32b results aren't really 32b. And usually you do have 32b floats.
  3. Set the FP CSR to 32b precision. debug==release, but you're far from "all set" because now your 64b results, intermediate or otherwise, are really 32b. Not only is this a stupid waste of bits, it is not unlikely to fail, too, because 32b isn't sufficient even for some of the problems encountered by camp 3. And of course it's not compatible with other platforms.
  4. Set the FP CSR to 64b precision during most of the program run, and temporarily set it to 32b in the parts of the program using 32b floats. I wouldn't go down that path; option 4 isn't really an option. I doubt that you use both 32b and 64b variables in a very thoughtful way and manage to have a clear boundary between them. If you depend on the ability of everyone to correctly maintain the CSR, then it sucks sucks sucks.

Side note: I sure as hell don't believe in "very special" "testing" build/running modes. For example, you could say that you have a special mode where you use option (3) and get 32b results, and use that mode to test debug==release or something. I think it's completely self-defeating, because the point of consistency is being able to reproduce a phenomenon X that happens in a mode which is actually important, in another mode where reproducing X is actually useful. Therefore, who needs consistency across inherently useless modes? We'd be defeating the purpose of defeating the purpose of IEEE floating point.

Therefore, if you don't have SSE, the only option is (2) – set the FP CSR to 64b and try to avoid 32b floats. On Linux, you can do it with:

#include <fpu_control.h>
fpu_control_t cw;
_FPU_GETCW(cw);
cw = (cw & ~_FPU_EXTENDED) | _FPU_DOUBLE;
_FPU_SETCW(cw);

Do it first thing in main(). If you use C++, you should do it first thing before main(), because people can use FP in constructors of global variables. This can be achieved by figuring out the compiler-specific translation unit initialization order, compiling your own C/C++ start-up library, overriding the entry point of a compiled start-up library using stuff like LD_PRELOAD, overwriting it in a statically linked program right there in the binary image, having a coding convention forcing to call FloatingPointSingleton::instance() before using FP, or shooting the people who like to do things before main(). It's a trade-off.

The situation is really even worse because the FPU CSR setting only affects mantissa precision but not the exponent range, so you never work with "real" 64b or 32b floats there. This matters in cases of huge numbers (overflow) and tiny numbers (double rounding of subnormals). But it's bad enough already, and camp 3 people don't really care about the extra horror; if you want those Halloween stories, you can find them here. The good news are that today, you are quite likely to have SSE2 and very likely to have SSE on your machine. So you can automatically sanitize all the mess as follows:

  1. If you have SSE2, use it and live happily ever after. SSE2 supports both 32b and 64b operations and the intermediate results are of the size of the operands. BTW, mixed expressions like a+b where a is float and b is double don't create consistency problems on any platform because the C standard specifies the rules for promotion precisely and portably (a will be promoted to double). The gcc way of using SSE2 for FP is -mfpmath=sse -msse2.
  2. If you only have SSE, use it for 32b floats which it does support (gcc: -mfpmath=sse -msse). 64b floats will go to the old FP unit, so you'll have to configure it to 64b intermediate results. Everything will work, the only annoying things being (1) the retained necessity to shoot the people having fun before main and (2) the slight differences in the semantics of control flags in the old FP and the SSE FP CSR, so if you configure your own policy, floats and doubles will not behave exactly the same. Neither problem is a very big deal.

Interestingly, SSE with its support for SIMD FP commands actually can make things worse in the standard-violating-algebraic-optimizations department. Specifically, Intel's compiler reportedly has (had?) an optimization which unrolls FP accumulation loops and reorders additions in order to utilize SIMD FP commands (gcc 4 does that, too, but only if you explicitly ask for trouble with -funsafe-math-optimizations or similar). But I wouldn't conclude anything from it, except that automatic vectorization, which is known to work only on the simplest of code snippets, actually doesn't work even on them.

Summary: use SSE2 or SSE, and if you can't, configure the FP CSR to use 64b intermediates and avoid 32b floats. Even the latter solution works passably in practice, as long as everybody is aware of it.

I think I covered everything I know except for things like long double, FP exceptions, etc. – and if you need that, you're not in camp 3; go away and hang out with your ivory tower numerical analyst friends. If you know a way to automate away more pain, I'll be grateful for every FP inconsistency debugging afternoon your advice will save me.

Happy Halloween!

38 comments ↓

#1 ljbirdwatcher on 10.31.08 at 2:31 pm

Beautiful and thought provoking, as usual. But… I assume consistency means getting identical bit patterns in FP operation results across different compilers and platforms. Now why is that goal desirable? Why care about a binary representation of a double being one way or another?

#2 Yossi Kreinin on 10.31.08 at 2:48 pm

Thanks…. "Bit pattern" is a loaded term because it may refer to the represented number or it may look beyond that at the actual bit string stored in memory (in which case you start talking about endianness, etc.). I assume we're talking about the represented number, so for example, the question is what's wrong with having computed x=1.0001 on one platform and 0.999 on another platform (build mode, compiler).

The problem is ultimately that x will look differently when printed, will yield a different value if converted to an int, and if it's compared with 1.0 in a conditional statement, a different code path will execute. This complicates testing because you can't compare outputs of different build settings or program versions. It also complicates debugging because you can't reproduce problems found in a hard-to-debug environment in an easy-to-debug environment.

Now, there's another problem – that of theoretical program correctness. That is, if you assume that (int)x is always between 1 and 10 and once in a lifetime x actually computes as 0.999 and you have an out-of-bounds access, than you're really screwed way beyond ease of testing and debugging.

However, this isn't the problem of camp 3. It's the problem of camp 2 who compute precise things using precise techniques. A camp 3 person should add a max(min((int)x,10),1) and be done with it. That's because a camp 3 person can't/doesn't want to/shouldn't be busy thinking about stuff like the behavior for subnormal numbers. The inherent precision of his data and algorithms is so lame that this effort is completely misplaced; it won't yield good results, and it costs much more than braindead sanitizing like the min/max thing.

The bottom line is that I want things to be the same because identical output eases testing and debugging, and not because I care about the exact bit pattern that will be computed identically.

#3 Anonymous on 10.31.08 at 3:15 pm

hur hur .. refaptoring!

#4 Yossi Kreinin on 10.31.08 at 3:28 pm

According to Google, the term was coined by Steve Yegge, and still has only one Google hit. This post is supposed to raise that number by 100%.

#5 anonymous on 10.31.08 at 3:34 pm

Google up math professor William Kahan at UC Berkeley. He was the guy who invented IEEE 754 arithmetic. You might ask him?

#6 ljbirdwatcher on 10.31.08 at 4:23 pm

I'd argue if you assume (int)x is always between 1 and 10 and once in a lifetime it actually computes as 0.999, you are screwed no matter if you have crosscompiler consistency in x, or not. Granted, it would not be a bad thing to have it, but even if you do, you are still screwed until you do the max(min) thing, or assume x is between 1-e and 10+e instead.

#7 Doug Merritt on 10.31.08 at 6:36 pm

I understand your desire for consistency, but there are some problems with your claims and assumptions.

Anonymous above recommend you check out Kahan, the father of IEEE floating point, and for anyone who has enough interest to read a blog entry like this, let alone write one, Kahan is in fact required reading — potentially up to reading everything on his site:

http://www.cs.berkeley.edu/~wkahan/

I read/re-read his publications every few years, and happened to have bookmarked some that I liked last month:

"Why Can I Debug some Numerical Programs that You Can't?"
http://www.cs.berkeley.edu/~wkahan/Stnfrd50.pdf

"A Logarithm Too Clever by Half" http://www.cs.berkeley.edu/~wkahan/LOG10HAF.TXT

Also the must-read edited reprint of the paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg
http://docs.sun.com/source/806-3568/ncg_goldberg.html

Back to what I consider important misstatements, certainly IEEE tries to avoid unnecessary imprecision, and multiple things in its design show that, but that is different than what I consider to be an incorrect statement, that "to give you results of the highest practically possible precision at each step of your computation"

Nay nay. Your desire for consistency is similar to the goals of an expert numerical analyst like Kahan, but to a first approximation, *accuracy* is far more important than precision, and the two are not at all equivalent, yet programmers in general tend not to think about the difference.

Finally, most of the (interesting and valuable) concerns, approaches, and advice you present doesn't have anything to do with IEEE floating point. Almost all of your points have to do with with idiosyncracies of languages, compilers, and optimizers.

…possibly aside from your ending comments about fpu_control, which I merely skimmed and can't comment on due to the press of time.

Despite those critiques, thanks for your article. Any discussion that makes programmers more aware of any of the zillions of issues relating to floating point is a good public service.

#8 Doug Merritt on 10.31.08 at 6:50 pm

P.S. I forgot to mention…when I said "Almost all of your points have to do with with idiosyncracies of languages, compilers, and optimizers.", I meant to say that you would have similar issues with non-IEEE floating point, and pre-IEEE standard, there were many.

And when I mentioned the concern of experts like Kahan to have consistency, I meant to mention the (unfortunately not all that well-know fact) work by Kahan and his students and successors to create a C libm (best known as fdlibm — Freely Distributable Libm), which was "exactly" accurate, meaning that each function produced a result that differed from an infinite-precision result by no more than 1/2 of the final bit of precision.

In other words, Kahan with others under him produced a libm (first distributed with BSD Unix, IIRC) that was both as accurate as possible and had maximum precision possible, simultaneously.

What a tour de force!

Furthermore, they released that library for multiple floating point standards, not just IEEE, but also VAX and PDP 11 etc (IIRC).

That should make anyone stop and say "hmm!"

#9 Anonymous.1415926... on 10.31.08 at 7:40 pm

You lost me as soon as I found out your parentheses weren't done in Ada.

#10 Dan on 10.31.08 at 8:00 pm

The solution that has always worked for me when running tests on numerical code is to compare the results as numbers (rather than strings). If the difference between the expected result and the actual result is less than, say, 0.00001% then I consider it a pass.
This sounds a lot easier than messing with the code or worrying about rounding.

If floating-point error has any significant effect on the result, it would tend to indicate a poorly-conditioned numerical algorithm that should not be used anyway.

#11 Mattthew on 10.31.08 at 9:50 pm

You mispell "worms" as "warms" in two places.

#12 M on 10.31.08 at 10:27 pm

Actually, group two (high-end scientific calculations) also require consistency. Calculations need to be able to be repeated to obtain the same results. This inconsistency that you talk about is not an issue with IEEE floating points, it is an issue with the compilers you are using.

#13 Yossi Kreinin on 10.31.08 at 11:02 pm

To people saying that my problem is with software tools and not the FP standard: for those on x86, the biggest problem before SSE used to be the x87 FPU, which is almost impossible to get consistent results out of, and AFAIK IEEE 754 allows it to be built that way because it doesn't mind extra precision. Also, things like contraction are AFAIK explicitly supported by IEEE, without requiring a mode where contraction is forced to give results consistent to non-optimized mul-then-add implementation. Also, I think the 2008 revision of IEEE 754 has a reproducibility clause which discusses the way for hw/sw systems to provide the option of writing code generating reproducible results. So they do acknowledge this to be in their territory. I didn't talk about it because I tend to care more about de-facto standards than upcoming standards, but I thought it's worth mentioning now that you blame me for misattributing blame.

To people saying that algorithms shouldn't "depend" on small precision errors (where "depend" means "produce radically different results by taking a different code path or indexing into a different table entry", etc.) – for many, many problems it doesn't work that way and there's no reason to think of it like that. Think about problems without a "right" answer – for example, face recognition. Where does the region with the face start – at pixel 397 or 398? When a person in a crowd is obstructed by another person – should we lose the "target" at frame 55 or 56? And what if the vision algorithm is imperfect, as they all are, and you only find 97% of the things you should? "Noise" and algorithmic imperfection can make different examples appear in the false-negative 3%, so it's no longer +-1 in the pixel or frame number. Sure, you can and should implement automated tests measuring the quality of your code by defining tolerance criteria capable of saying "you're about that good" without relying on exact pixel numbers. However, many times what you care about isn't how good you are, but why you crash on input so and so in release build. Now, if debug build is only statistically similar to the release build, the crash will probably not reproduce on that input. Net result: FP inconsistency makes things hard to debug.

Now, the whole theme of "you're walking on a borderline and a slight numerical push can make you cross it" – it happens all the time, with examples much simpler than face recognition. It's wishful thinking to assume that "user-visible" program results shouldn't be affected by numerical noise or else the algorithm is "buggy".

Regarding W. Kahan, for whom I, um, Googled a lot: the fact that you can develop an insanely precise library on top of IEEE, and that you can fight and win the battle for consistent results on top of IEEE doesn't contradict what I say. What I say is that you can't do it without an extremely detailed and accurate numerical analysis. If you compute a trig function and you manage to reach the maximal possible precision your datatype supports, sure, you're precise and consistent. This the kind of thing camp 2 people can do and happen to be doing, although I'm not sure you'll always end up like that – if you could always reach the maximal possible datatype precision with IEEE, for every algorithm, it would be real numbers, and they aren't. But the thing is, camp 3 people deal with data and algorithms which aren't worth the effort, because you'll get lousy precision anyway. Those are the people who want consistency without the, um, enormous cost of detailed numerical analysis.

#14 Yossi Kreinin on 10.31.08 at 11:33 pm

To Doug Merritt: I only noticed your first comment now since WP cleverly put it into the moderation queue because of the links.

I'd be grateful if you explained me your argument regarding precision and accuracy in more detail, because I think the general issue of their difference is orthogonal to what I'm saying. In terms of this dichotomy, "accuracy" is related to algorithm design (an approximation of sine should actually approximate sine on all inputs – give a result close to the real value), and "precision" is related to algorithm implementation (the hw/sw combo running a sine approximation should compute the same thing each time you try it whether it's computing an accurate approximation or a crummy one; although it's not that trivial because one thing crummy algorithms do is overflow and the like, so the requirement of their designers to get precisely the same wrong answer across many systems may be "illegitimate" – that's why I think that the inability to control the exponent range on the x87 isn't a big deal).

Anyway, in terms of accuracy/precision, when I ask for "consistency", it's synonymous for "precision", actually the maximal possible precision – that is, "gimme the same answer everywhere". I do that based on the assumption that the accuracy of the basic operations of IEEE FP is good enough to be able to design decently accurate algorithms with very little effort in most cases (and that is true whether I use 64b regs or 80b regs). But in general, the question of analyzing the accuracy of algorithms implemented on top of IEEE 754 is out of my scope – it's the business of people designing them, and it was the business of the people designing IEEE 754 who cared about the ability to design specific algorithms accurately on top of the basic operations.

Now, the fact that I use "consistency of results" where you'd use "precision", and that I use "precision of operations" where you'd say "accuracy" is consistent with your claim that most programmers don't think in terms of accuracy/precision; I'm writing for these people, after all. I think my claims are still valid and readable though.

#15 undebugger on 11.01.08 at 1:31 am

refaptoring!!! it's a brillliant meme, didn't hear it before

(and the article is fine as always)

#16 Yossi Kreinin on 11.01.08 at 2:38 am

thanks, but as noted in a comment above – it's Steve Y's (there 1 Google hit for it right now.)

#17 Dan on 11.01.08 at 7:37 am

Yossi,
Your comment #13 finally made the point of this clear in my mind. The 'off-by-one-pixel' example in image processing is a great example of why one would need consistency.

Although, I would still argue that the 'relying-on-rounding-is-a-bug' argument still almost always applies to continuous problem spaces.

Thanks for the great article.

#18 Gabe on 11.01.08 at 11:18 am

It's definitely a pain point, but what's the solution? How are you going to standardize hardware and compiler implementations? If you managed to do that lots of people would get left out in the cold. Wishing for consistency in computers is a bit more realistic, but still in the same camp as biologists wishing for genetic equivalence in all members of a species. Sure it would solve a boat load of problems but it ain't gonna happen, and even if it did, the perfection would only exist until the first bug was found.

#19 Marcel Popescu on 11.01.08 at 11:25 am

Precision = how many digits you have in a number.
Accuracy = how close you are to the "real" value.

3.55555 is a more precise approximation of Pi than 3.1, but a less accurate one.

#20 Yossi Kreinin on 11.01.08 at 11:35 am

@Gabe: as I said in the article, AFAIK the specific problem of FP consistency is "almost solvable" on the hardware available today, so the state of things isn't bad, it's just that you don't get consistency by default.

@Marcel: the term "precision" is used to name many different things. "Precision" means "#digits" in the context of cout<<setprecision(20), it means "accuracy" as you defined it in a lot of informal discussions (including this article except for the comments mentioning the word "accuracy"), and in the context of "the difference between precision and accuracy", it means the distance between measurements/computations obtained from different trials: http://en.wikipedia.org/wiki/Precision_and_accuracy

So I don't think any of this is about the number of digits, which is a base-dependent and hardly interesting property.

#21 Marcel Popescu on 11.01.08 at 11:46 am

I was clarifying Doug Merritt's comment, which you seemed not to understand ("I’d be grateful if you explained me your argument regarding precision and accuracy in more detail…"). It seems obvious to me that he's using the same definition I am.

#22 Brooks Moses on 12.04.08 at 12:29 pm

A comment on this quote about compiler optimizations that alter precision: "So this sort of optimization is actually a bug that will likely be fixed if reported, and at any given time, the number of these bugs in a particular compiler should be small."

In many cases, this is not in fact a bug, but a case of using an optimization flag that tells the compiler to introduce these optimizations in the interest of greater speed of computation. In GCC, the flag is -ffast-math. I believe that in some commercial compilers, this sort of flag is on by default at higher levels of optimization.

Thus, if you rely on these optimizations being absent, you should read your compiler documentation carefully!

#23 Yossi Kreinin on 12.04.08 at 12:39 pm

I shouldn't; standard-compliant behavior ought to be the default.

Thanks for the info on -ffast-math though; I remember seeing it in some make output, but I haven't looked it up. Now I'll be sure to tear it out of every compiler option list I'll ever stumble upon.

Regarding the a*(b+1) => a*b+a transformation – I'm positive it was a bug; nobody would have stuffed -ffast-math into that Makefile.

#24 Yossi Kreinin on 12.04.08 at 12:41 pm

…Liked your (?) page at LtU.

#25 Websites tagged "embedded" on Postsaver on 12.26.08 at 8:17 am

[...] my channel flippin’ mind saved by Yaksah2008-12-14 – inspiration saved by frogfish2008-12-13 – Consistency: how to defeat the purpose of IEEE floating point saved by Jelloman182008-12-13 – as soon as the incarnational analogy is made, one’s indulgences [...]

#26 Simon on 09.24.09 at 5:24 am

Consistency is important. Code that is compiled on one system may not produce a consistent result on another system. I am talking about code that involves a large number of accumulated computations. If the program is run once, it produces a result, which we average over the length of the run. The second time we run the same code a new set of results emerge and the average can differ by 10%. The differences start out at the 15th significant figure and accumulate past the 8th and right down to the first figure. That means that the code behaves differently each time it is run, dependent on things like memory and stack. If the program is run on the system it is compiled on, it should always produce consistent results (but it will not). When it is run on the client's computer it definitely will not produce the same numbers – and that is serious if it is a weather prediction program.

#27 Glenn Fiedler on 02.25.10 at 11:09 pm

Brilliant article. And for those wondering why consistent floating point calculations are so important – many, MANY videogames rely on complete determinism and sending only inputs per-simulation from to keep players in sync, eg. most RTS, even some physics simulation games like "Little Big Planet", basically any situation in which the state of the system is too large to practically send over the network, determinism is key!

#28 Glenn Fiedler on 02.25.10 at 11:12 pm

Brilliant article. And for those wondering why consistent floating point calculations are so important – many, MANY videogames rely on complete determinism and sending only inputs per-simulation frame to keep each player's game simulation in sync. For example most RTS games use this technique (eg. "Starcraft", "Age of Empires") also some physics simulation games like “Little Big Planet” do too, and many sports games you play online as well . Basically, any situation where the state of the system is too large to practically send over the network, you'll find that determinism and reproducibility are key!

#29 Yossi Kreinin on 02.26.10 at 4:50 am

@Glenn: thanks; I hope my practical advice is, um, practical. These things are a bit hard to fully research (at least for me) – hard to know how many warts one left uncovered.

#30 Christer Ericson on 05.08.10 at 1:26 am

"Adding actual parentheses in source code doesn’t help you one bit. The compiler assumes associativity and may implement the computation in the order implied by regrouping your parentheses."

Utter and totally wrong! Read up on the C/C++ standards. A standard-complying compiler (and in this particular issue, they all are, unless you've opted out using some 'fast-math' compiler option) will not change the meaning of floating-point code.

E.g. see 5.1.2.3 section 14 of the C (draft) standard. ("Rearrangement for floating-point expressions is often restricted because of limitations in
precision as well as range. The implementation cannot generally apply the mathematical associative rules
for addition or multiplication, nor the distributive rule, because of roundoff error, even in the absence of
overflow and underflow.")

See Annex F for even more detail. Same applies for C++.

#31 Yossi Kreinin on 05.08.10 at 11:13 pm

Um, I quote §5.1.2.3 in the paragraph right after the one you mention.

#32 Christer Ericson on 05.20.10 at 7:59 pm

Huh? I have no idea what you are trying to say with that reply.

Your original statement "The compiler assumes associativity" is unequivocally wrong. Floating-point arithmetic is NOT associative! Therefore no compiler "assumes associativity".

The bit where you say "Adding actual parentheses in source code doesn’t help you one bit." is also total nonsense. If I have explicitly added parenthesis to a floating-point expression to control the evaluation order, the compiler cannot remove these parentheses unless removing them leaves the expression exactly equivalent *under evaluation as a floating-point expression*. E.g. a compiler cannot rewrite "((x – y) + y)" into "x". So adding parenthesis DO matter!

You said "Compilers, or more specifically buggy optimization passes, assume that floating point numbers can be treated as a field – you know, associativity, distributivity, the works. This means that a+b+c can be computed in either the order implied by (a+b)+c or the one implied by a+(b+c)."

This too is complete nonsense. Floating-point numbers do not have associativity, and therefore cannot be treated as a field. The two floating-point expressions "(a+b)+c" and "a+(b+c)" are NOT equivalent!

I think you need to brush up on floating-point arithmetic a bit. See the Accuracy Problems section of the wikipedia page

http://en.wikipedia.org/wiki/Floating-point

for an example of how floating-point numbers are not associative.

#33 Yossi Kreinin on 05.22.10 at 11:10 am

I said "buggy optimization passes". If you've never run into a compiler bug of that sort, that's fine, and the fact that you know that such behavior is illegitimate and therefore is indicative of a compiler bug is also fine, but both add little to the discussion. The fact that I knew everything you just said is completely obvious from TFA to everyone who bothered to R it.

#34 JC on 11.05.12 at 4:35 pm

I know this article is 4 years old, but it takes me back 25 years when I worked on Fortran compilers for early 386/387s. We would dread the Gold* support call from a research scientist who would report his program worked perfectly in debug mode, but crashed in release mode. At Gold* support we knew we were about to do a full numerical analysis of the code to find where an 80 bit number had gone down one particular "if" branch of doom when its 64 bit debug alter-ego had gone down the "else" path of salvation.

25 years later I am the research scientist. I have the advantage of being forwarned and forearmed, so write code and tests that are tollarent of such eccentic behaviour. This is the general solution, as I often write complex/fast/precise algortithms which can I compare to easy/slow/naive versions. As the number and sequence of FP operations are quite different the answers will never be exactly the same even if the IEEE fp and the compiler were friendlier, and I don't care provided they match within my calculated tollerence.

#35 Yossi Kreinin on 11.05.12 at 10:05 pm

@JC: well, someone like you who is in the tiny minority of people who actually know what they're doing when they reach for floating point probably doesn't care about things like consistent behavior between debug and release…

#36 Amateur Gamedev on 04.27.13 at 12:31 pm

It is highly worth mentioning that this is an incredibly annoying problem for game developers. Eg. multiplayer real-time strategy games often depend on their simulations working precisely the same on all machines. If they don't, it is a synchronization error and the ongoing game is cancelled.

This means that for a multiplatform game, one is practically forced to use software floating or fixed point for everything that can affect the outcome of the game, diminishing the available performance an order of magnitude lower than a predictable hardware implementation would allow.

#37 Yossi Kreinin on 04.27.13 at 1:19 pm

I wrote the article a long time ago, but I think that if you're on x86, then you can ask for double precision instead of extended and then ask for SSE to be used for single precision and basically you get what Sun calls "single/double system" instead of an "extended-based system" in its manuals and it's deterministic enough; and you ask on all platforms that the compilers doesn't use fused MACs and such, and you don't use library functions like log or exp, and you're basically all set.

I'm not sure this covers it all but it sounds as if this is better than using emulation; maybe I miss something big making the use of hardware floating point hopeless.

Incidentally, I'm planning on writing a follow-up on this one; I've studied the issues in more detail, and I think I'm in a reasonably good position to argue with some of the ideas behind IEEE. I think this portability business is just botched up, for one thing (ironically this is pointed out in an appendix to the paper one of the commenters above linked to, the piece by Goldberg), and then nans, infs, denormals and rounding modes all have their very notable drawbacks which I think are worth discussing.

#38 Christian Plesner Hansen on 11.13.13 at 2:54 am

Great article. I've been looking into IEEE 754, trying to figure out how to support it in sane way in a high-level language. I've had a lot of the same thoughts along the way but this goes into much more detail. Looking forward to the follow-up.

Leave a Comment