The C++ Sucks Series: petrifying functions

September 4th, 2009

Your IT department intends to upgrade your OS and gives a developer a new image to play with. The developer is generally satisfied, except there's this one program that mysteriously dumps core. Someone thoughtfully blames differences in system libraries.

Alternative prelude: you have this program and you're working on a new version. Being generally satisfied with the updates, you send the code overseas. They build it and it mysteriously dumps core. Someone thoughtfully blames differences in the compiler version.

Whatever the prelude, you open the core dump with `gdb app core` and gdb says:

#0  0x080484c9 in main (argc=Cannot access memory at address 0xbf3e7a8c) at main.cpp:4
4    int main(int argc, char** argv)
(gdb)

Check out the garbage near "argc=" – if it ain't printing garbage, it ain't a C++ debugger. Anyway, it looks like the program didn't even enter main. An alert C++ hater will immediately suspect that the flying circus happening in C++ before main could be at fault, but in this case it isn't. In fact, a program can be similarly petrified by the perspective of entering any function, not necessarily main. It's main where it crashes in our example because the example is small; here's the source code:

#include <stdio.h>
#include "app.h"

int main(int argc, char** argv)
{
  if(argc != 2) {
    printf("please specify a profile\n");
    return 1;
  }
  const char* profile = argv[1];
  Application app(profile);
  app.mainLoop();
}

On your machine, you run the program without any arguments and sure enough, it says "please specify a profile"; on this other machine, it just dumps core. Hmmm.

Now, I won't argue that C++ isn't a high-level object-oriented programming language since every book on the subject is careful to point out the opposite. Instead I'll argue that you can't get a first-rate user experience with this high-level object-oriented programming language if you don't also know assembly. And with the first-rate experience being the living hell that it is, few would willingly opt for a second-rate option.

For example, nothing at the source code level can explain how a program is so shocked by the necessity of running main that it dumps a core in its pants. On the other hand, here's what we get at the assembly level:

(gdb) p $pc
$1 = (void (*)(void)) 0x80484c9 <main+20>
(gdb) disass $pc
Dump of assembler code for function main:
0x080484b5 <main+0>:    lea    0x4(%esp),%ecx
0x080484b9 <main+4>:    and    $0xfffffff0,%esp
0x080484bc <main+7>:    pushl  -0x4(%ecx)
0x080484bf <main+10>:    push   %ebp
0x080484c0 <main+11>:    mov    %esp,%ebp
0x080484c2 <main+13>:    push   %ecx
0x080484c3 <main+14>:    sub    $0xa00024,%esp
0x080484c9 <main+20>:    mov    %ecx,-0xa0001c(%ebp)
# we don't care about code past $pc -
# a screenful of assembly elided

What this says is that the offending instruction is at the address main+20. As you'd expect with a Segmentation fault or a Bus error core dump, this points to an instruction accessing memory, specifically, the stack.

BTW I don't realy know the x86 assembly, but I can still read it thusly: "mov" can't just mean the tame RISC "move between registers" thing because then we wouldn't crash, so one operand must spell a memory address. Without remembering the source/destination order of the GNU assembler (which AFAIK is the opposite of the usual), I can tell that it's the second operand that is the memory operand because there's an integer constant which must mean an offset or something, and why would you need a constant to specify a register operand. Furthermore, I happen to remember that %ebp is the frame pointer register which means that it points into the stack, however I could figure it out from a previous instruction at main+11, which moves %esp [ought to be the stack pointer] to %ebp (or vice versa, as you could think without knowing the GNU operand ordering – but it would still mean that %ebp points into the stack.)

Which goes to show that you can read assembly while operating from a knowledge base that is not very dense, a way of saying "without really knowing what you're doing" – try that with C++ library code; but I digress. Now, why would we fail to access the stack? Could it have to do with the fact that we apparenty access it with the offset -0xa0001c, which ought to be unusually large? Let's have a look at the local variables, hoping that we can figure out the size of the stack main needs from their sizes. (Of course if the function used a Matrix class of the sort where the matrix is kept by value right there in a flat member array, looking at the named local variables mentioned in the program wouldn't be enough since the temporaries returned by overloaded operators would also have to be taken into account; luckily this isn't the case.)

(gdb) info locals
# if it ain't printing garbage, it ain't a C++ debugger:
profile = 0xb7fd9870 "U211?WVS??207"
app = Cannot access memory at address 0xbf3e7a98

We got two local variables; at least one must be huge then. (It can be worse in real life, main functions being perhaps the worst offenders, as many people are too arrogant to start with an Application class. Instead they have an InputParser and an OutputProducer and a Processor, which they proudly use in a neat 5-line main function – why wrap that in a class, 2 files in C++-land? Then they add an InputValidator, an OutputFormatConfigurator and a ProfileLoader, then less sophisticated people gradually add 20 to 100 locals for doing things right there in main, and then nobody wants to refactor the mess because of all the local variables you'd have to pass around; whereas an Application class with two hundred members, while disgusting, at least makes helper functions easy. But I digress again.)

(gdb) p sizeof profile
$2 = 4
(gdb) p sizeof app
$3 = 10485768

"10485768". The trouble with C++ debuggers is that they routinely print so much garbage due to memory corruption, debug information inadequacy and plain stupidity that their users are accustomed to automatically ignore most of their output without giving it much thought. In particular, large numbers with no apparent regularity in their digits are to a C++ programmer what "viagra" is to a spam filter: a sure clue that something was overwritten somewhere and the number shouldn't be trusted (I rarely do pair programming but I do lots of pair debugging and people explicitly shared this spam filtering heuristic with me).

However, in this case overwriting is unlikely since a sizeof is a compile time constant stored in the debug information and not in the program memory. We can see that the number will "make more sense" in hexadecimal (which is why hex is generally a good thing to look at before ignoring "garbage"):

(gdb) p /x sizeof app
$4 = 0xa00008

...Which is similar to our offset value, and confirms that we've been debugging a plain and simple stack overflow. Which would be easy to see in the case of a recursive function, or if the program crashed, say, in an attempt to access a large local array. However, in C++ it will crash near the beginning of a function long before the offending local variable is even declared, in an attempt to push the frame pointer or some such; I think I also saw it crash in naively-looking places further down the road, but I can't reproduce it.

Now we must find out which member of the Application class is the huge one, which is lots of fun when members are plentiful and deeply nested, which, with a typical Application class, they are. Some languages have reflection given which we could traverse the member tree automatically; incidentally, most of those languages don't dump core though. Anyway, in our case finding the problem is easy because I've made the example small.

(I also tried to make it ridiculous – do you tend to ridicule pedestrian code, including your own, sometimes as you type? Few do and the scarcity makes them very dear to me.)

class Application
{
 public:
  Application(const char* profile);
  void mainLoop();
 private:
  static const int MAX_BUF_SIZE = 1024;
  static const int MAX_PROF = 1024*10;
  const char* _profPath;
  char _parseBuf[MAX_BUF_SIZE][MAX_PROF];
  Profile* _profile;
};

This shows that it's _parseBuf that's causing the problem. This also answers the question of an alert C++ apologist regarding all of the above not being special to C++ but also relevant to C (when faced with a usability problem, C++ apologists like to ignore it and instead concentrate on assigning blame; if a problem reproduces in C, it's not C++'s fault according to their warped value systems.) Well, while one could write an equivalent C code causing a similar problem, one is unlikely to do so because C doesn't have a private keyword which to a first approximation does nothing but is advertised as an "encapsulation mechanism".

In other words, an average C programmer would have a createApplication function which would malloc an Application struct and all would be well since the huge _parseBuf wouldn't land on the stack. Of course an average C++ programmer, assuming he found someone to decipher the core dump for him as opposed to giving up on the OS upgrade or the overseas code upgrade, could also allocate the Application class dynamically, which would force him to change an unknown number of lines in the client code. Or he could change _parseBuf's type to std::vector, which would force him to change an unknown number of lines in the implementation code, depending on the nesting of function calls from Application. Alternatively the average C++ programmer could change _parseBuf to be a reference, new it in the constructor(s) and delete it in the destructor, assuming he can find someone who explains to him how to declare references to 2D arrays.

However, suppose you don't want to change code but instead would like to make old code run on the new machine – a perfectly legitimate desire independently of the quality of the code and its source language. The way to do it under Linux/tcsh is:

unlimit stacksize

Once this is done, the program should no longer dump core. `limit stacksize` would show you the original limit, which AFAIK will differ across Linux installations and sometimes will depend on the user (say, if you ssh to someone's desktop, you can get a lesser default stacksize limit and won't be able to run the wretched program). For example, on my wubi installation (Ubuntu for technophopes like myself who have a Windows machine, want a Linux, and hate the idea of fiddling with partitions), `limit stacksize` reports the value of 8M.

Which, as we've just seen, is tiny.