Fun with UB in C: returning uninitialized floats

June 30th, 2015

The average C/C++ programmer's intuition says that uninitialized variables are fine as long as you don't depend on their values.

A more experienced programmer probably suspects that uninitialized variables are fine as long as you don't access them. That is, computing c=a+b where b is uninitialized is not harmless even if you never use c. That's because the compiler could, say, optimize away the entire block of code surrounding c=a+b under the assumption that c=a+b, where b is proven to be always uninitialized, is always undefined behavior (UB). And if it's UB, the only way for the program to be correct is for this code to be unreachable anyway. And if it's unreachable, why waste instructions translating any of it?

However, the following code looks like it could potentially be OK, doesn't it?

float get(obj* v, bool* ok) {
  float c;
  if(v->valid) {
    *ok = true;
    c = v->a + v->b;
  }
  else {
   *ok = false; //not ok, so don't expect anything from c
  }
  return c;
}

Here you return an uninitialized c, which the caller shouldn't touch because *ok is false. As long as the caller doesn't, all is well, right?

Well, it turns out that even if the caller does nothing at all with the return value – ever, regardless of *ok – the program might bomb. That's because c could be initialized to a singaling NaN, and then say on x86, when the fstp instruction is used to basically just get rid of the return value, you get an exception. In release mode but not in debug mode, some of the time but not all the time. This gives you this warm, fuzzy WTF feeling when you stare at the disassembled code. "Why is there even a float here in the first place?!"

How much uninitialized data is shuffled around by real-world C programs? A lot, I wager – likely closer to 95% than to 5% of programs do this. Otherwise Valgrind would not go to all the trouble to not barf on uninitialized data until the last possible moment (that moment being when a branch is taken based on uninitialized data, or when it's passed to a system call; to not barf then would require some sort of a multiverse simulation approach for which there are not enough computing resources.)

Needless to say, most programs enjoying ("enjoying"?) Valgrind's (or rather memcheck's) conservative approach to error reporting were written neither in assembly which few use, nor in, I dunno, Java, which won't let you do this. They were written in C and C++, and most likely they invoke UB.

(Can you touch uninitialized data in C without triggering UB? I seriously don't know, I'm not a language lawyer. Being able to do this is actually occasionally useful for optimization. Integral types for instance don't have anything like signaling NaNs so at the assembly language level you should be fine. But at the C level the compiler might get needlessly clever if it manages to prove that the data is uninitialized. My own intuition is it can never prove squat about data passed by pointer because of aliasing and so I kinda assume that if I get a buffer pointing to some data and some of it is uninitialized I can do everything to it that I could in assembly. But I'm not sure.)

What a way to make a living.