The C++ Sucks Series: the quest for the entry point

December 7th, 2008

Suppose you run on the x86 and you don't like its default FPU settings. For example, you want your programs to dump core when they divide by zero or compute a NaN, having noticed that on average, these events aren't artifacts of clever numerical algorithm design, but rather indications that somebody has been using uninitialized memory. It's not necessarily a good idea for production code, but for debugging, you can tweak the x86 FPU thusly:

//this is a Linux header using GNU inline asm
#include <fpu_control.h>
void fpu_setup() {
  unsigned short cw;
  _FPU_GETCW(cw);
  cw &= ~_FPU_MASK_ZM; //Divide by zero
  cw &= ~_FPU_MASK_IM; //Invalid operation
  _FPU_SETCW(cw);
}

So you call this function somewhere during your program's initialization sequence, and sure enough, computations producing NaN after the call to fpu_setup result in core dumps. Then one day someone computes a NaN before the call to fpu_setup, and you get a core dump the first time you try to use the FPU after that point. Because that's how x86 maintains its "illegal operation" flags and that's how it uses them to signal exceptions.

The call stack you got is pretty worthless as you're after the context that computed the NaN, not the context that got the exception because it happened to be the first one to use the FPU after the call to fpu_setup. So you move the call to fpu_setup to the beginning of main(), but help it does not. That's because the offending computation happens before main, somewhere in the global object construction sequence. The order of execution of the global object constructors is undefined by the C++ standard. So if you kindly excuse my phrasing – where should we shove the call to fpu_setup?

If you have enough confidence in your understanding of the things going on (as opposed to entering hair-pulling mode), what you start looking for is the REAL entry point. C++ is free to suck and execute parts of your program in "undefined" (random) order, but a computer still executes instructions in a defined order, and whatever that order is, some instructions ought to come first. Since main() isn't the real entry point in the sense that stuff happens before main, there ought to be another function which does come first.

One thing that could work is to add a global object to each C++ translation unit, and have its constructor call fpu_setup(); one of those calls ought to come before the offending global constructor – assuming that global objects defined in the same translation unit will be constructed one after another (AFAIK in practice they will, although in theory the implementation could, for example, order the constructor calls by the object name, so they wouldn't). However, this can get gnarly for systems with non-trivial build process and/or decomposition into shared libraries. Another problem is that compilers will "optimize away" (throw away together with the side effects, actually) calls to constructors of global objects which aren't "used" (mentioned by name). You can work around that by generating code "using" all the dummy objects from all the translation units and calling that "using" code from, say, main. Good luck with that.

The way I find much easier is to not try to solve this "portably" by working against the semantics prescribed by the C++ standard, but instead rely on the actual implementation, which usually has a defined entry point, and a bunch of functions known to be called by the entry point before main. For example, the GNU libc uses a function called __libc_start_main, which is eventually called by the code at _start (the "true" entry point containing the first executed instruction, AFAIK; I suck at GNU/Linux and only know what was enough to get by until now.) In general, running `objdump -T <program> | grep start` (which looks for symbols from shared libraries – "nm <program>" will miss those) is likely to turn up some interesting function. In these situations, some people prefer to find out from the documentation, others prefer to crawl under a table and die of depression; the grepping individuals of my sort are somewhere in between.

Now, instead of building (correctly configure-ing and make-ing) our own version of libc with __libc_start_main calling the dreaded fpu_setup, we can use $LD_PRELOAD – an env var telling the loader to load our library first. If we trick the loader into loading a shared library containing the symbol __libc_start_main, it will override libc's function with the same name. (I'm not very good at dynamic loading, but the sad fact is that it's totally broken, under both Windows and Unix, in the simple sense that where a static linker would give you a function redefinition error, the dynamic loader will pick a random function of the two sharing a name, or it will call one of them from some contexts and the other one from other contexts, etc. But if you ever played with dynamic loading, you already know that, so enough with that.)

Here's a __libc_start_main function calling fpu_setup and then the actual libc's __libc_start_main:

#include <dlfcn.h>

typedef int (*fcn)(int *(main) (int, char * *, char * *),
                   int argc,
                   char * * ubp_av,
                   void (*init) (void),
                   void (*fini) (void),
                   void (*rtld_fini) (void),
                   void (* stack_end));
int __libc_start_main(int *(main) (int, char * *, char * *),
                      int argc,
                      char * * ubp_av,
                      void (*init) (void),
                      void (*fini) (void),
                      void (*rtld_fini) (void),
                      void (* stack_end))
{
  fpu_setup();
  void* handle = dlopen("/lib/libc.so.6", RTLD_LAZY | RTLD_GLOBAL);
  fcn start = (fcn)dlsym(handle, "__libc_start_main");
  (*start)(main, argc, ubp_av, init, fini, rtld_fini, stack_end);
}

Pretty, isn't it? Most of the characters are spent on spelling the arguments of this monstrosity – not really interesting since we simply propagate whatever args turned up by grepping/googling for "__libc_start_main" to the "real" libc's __libc_start_main. dlopen and dlsym give us access to that real __libc_start_main, and /lib/libc.so.6 is where my Linux box keeps its libc (I found out using `ldd <program> | grep libc`).

If you save this to a fplib.c file, you can use it thusly:

gcc -o fplib.so -shared fplib.c
env LD_PRELOAD=./fplib.so <program>

And now your program should finally dump core at the point in the global construction sequence where NaN is computed.

This approach has the nice side-effect of enabling you to "instrument" unsuspecting programs without recompiling them s.t. they run with a reconfigured FPU (to have them crash if they compute NaNs, unless of course they explicitly configure the FPU themselves instead of relying on what they get from the system.) But there are niftier applications of dynamic preloading, such as valgrind on Linux and .NET on Windows (BTW, I don't know how to trick Windows into preloading, just that you can.) What I wanted to illustrate wasn't how great preloading is, but the extent to which C++, the language forcing you to sink that low just to execute something at the beginning of your program, SUCKS.

Barf.

Corrections - thanks to the respective commenters for these:

1. Section 3.6.2/1 of the ISO C++ standard states, that “dynamically initialized [objects] shall be initialized in the order in which their definition appears in the translation unit”. So at least you have that out of your way if you want to deal with the problem at the source code level.

2. Instead of hard-coding the path to libc.so, you can pass RTLD_NEXT to dlsym.