Getting the call stack without a frame pointer

October 9th, 2009

Everything I know about getting the current call stack of C or C++ programs, including ones compiled with -fomit-frame-pointer or an equivalent, with or without a debugger. Hardly entertaining except for those specifically wishing to lay their hands on call stacks.

We'll start with trivia people sharing my unhealthy interest in call stacks probably know. There are two contexts where you might want to get a call stack:

Look at the call stack in a debugger to see what the program is doing.
Get a representation of the call stack inside the program itself. For example, a memory profiler might want to attach the call stack identifying the context where the allocation is made to each memory block to see who allocates the most.

Sometimes (2) can be implemented using (1) by running a debugger programmatically and asking it for the current call stack, and sometimes it can't (too much communication overhead, or no debugger available – for example, a program stripped of debugging information and running on a standalone embedded board).

The straightforward way of getting the current call stack, used both by debuggers and by programs curious about their own stacks, relies on the frame pointer. The idea is that a machine register is reserved for keeping a pointer into the stack, called the frame pointer, and every function is compiled to do the following in its prologue:

Push the return address to the stack
Push the frame pointer to the stack
Save the address of the resulting two-pointer structure to the frame pointer register

This creates a linked list on the stack, with every node keeping a return address – this list is the call stack (and this is why debuggers show the points of return from function calls and not the points of call, a bit annoyingly for function calls spanning many source code lines – in fact we get the return stack and not the call stack). Here's how you get this list from within a program:

struct stack_frame {
  struct stack_frame* next;
  void* ret;
};
int get_call_stack(void** retaddrs, int max_size) {
  /* x86/gcc-specific: this tells gcc that the fp
     variable should be an alias to the %ebp register
     which keeps the frame pointer */
  register struct stack_frame* fp asm("ebp");
  /* the rest just walks through the linked list */
  struct stack_frame* frame = fp;
  int i = 0;
  while(frame) {
    if(i < max_size) {
      retaddrs[i++] = frame->ret;
    }
    frame = frame->next;
  }
  return i;
}

The code for getting the list head pointer depends on the platform, the list structure itself is common to many machines and compilers. The return addresses may be converted to function names and source line numbers with addr2line -f or similar. When the program can't access its own debug info during its execution as in the case of embedded devices, the translation from addresses to names will be a separate offline step.

The whole frame pointer business is fairly widely documented, with a bunch of source code available centered around getting the call stack that way. I think the GNU backtrace function and the Windows StackWalk64 function, which these days are probably a better alternative than code snippets like the one above, also use this linked list when available.

Now, what happens if the compiler is told to avoid generating the code maintaining the frame pointer (-fomit-frame-pointer for gcc, /Oy for MSVC), or not told to override its default behavior of not generating such code (-ga for Green Hills C++)?

Admittedly it doesn't appear to be such a good idea to make debugging that much harder in order to save a few instructions. However, there are reasons to do so. For example, one common consequence of Software Design is lots of functions doing little or nothing except for delegating their work to another function. Without frame pointer maintenance, such a call is just a jump instruction – your callee function will return directly to the address saved by your caller. With frame pointers, you need a whole prologue and epilogue here. Anyway, we won't discuss the benefits of frame pointer omission since measuring the overhead for your particular code will be more reliable than such a discussion anyway.

Compiling without frame pointers hits code trying to obtain its own context harder than it hits debuggers, because debuggers don't really need frame pointers and only (sometimes) rely on them for simplicity of implementation. Given a return address, a debugger can tell:

Which function it belongs to (unlike the program itself, a debugger is necessarily supposed to have access to the symbol table)
Where that function keeps the return address (the compiler knows that, so it can tell the debugger)
The amount by which the function decrements the stack pointer, assuming the stack grows downwards – again something the compiler knows. Now that the debugger knows the previous return address and the previous stack pointer, it can go back to step 1.

So a debugger can do just fine without frame pointers as long as the compiler gives it enough information about the layout of the stack. I've been debugging without frame pointers for a long time with the Green Hills MULTI debugger which uses a proprietary debug info format. More recently the DWARF format, gcc and gdb seem to have caught up and now programs compiled with -fasynchronous-unwind-tables -fomit-frame-pointer are debuggable with gdb. The information generated by -fasynchronous-unwind-tables seems to go to a separate ELF section called .eh_frame_hdr.

Not only will gdb use .eh_frame_hdr, but the GNU backtrace function appears to be using it as well (it doesn't work under -fomit-frame-pointer but apparently does work when you add -fasynchronous-unwind-tables – although the docs explicitly say: "frame pointer elimination will stop backtrace from interpreting the stack contents correctly"). Nor is this section stripped from the program – it's not implemented as a "normal" debug information section but as an allocated data section, so it's always available to a program (in particular, to the backtrace function).

So under gcc, all call stack problems seem to be solved – unless you trust the docs (!?), or unless some code isn't compiled with the right flags because of not being up to date or someone being too greedy to allocate space for a debug info section. Outside gcc, or more precisely DWARF, I don't think a stripped program can access such debug info.

Is there a way to get a call stack without a frame pointer, without a debugger and without debug info?

For years I was sure that the answer was "no", hence some things will only work under a separate build mode – just like the release build but with frame pointers. Then one time the Green Hills debugger failed to list the call stack for some reason as it sometimes does, but this time we really wanted to decipher it. And we figured that we can in fact do the same thing the debugger does, except we'll understand from the assembly code what the debugger usually understood from debug information.

Specifically, to understand where the return address is kept and by what amount the stack pointer is decremented, you need to find the instructions doing (or undoing) these things in the prologue (or the epilogue) of the function. This worked, but due to either inertia or stupidity it took me months to realize that you can write code doing this. Anyway, here's how it works on a 32b MIPS processor under the Green Hills compiler. The prologue code of a function will contain instructions like these:

main+0: 27bdffe8 addiu sp, sp, -0x18
main+4: afbf0014 sw    r31, 0x14(sp)

The add immediate instruction decrements the stack pointer, and the store word instruction saves the return address from the register where it's saved by the caller to some place on the stack. The high 16 bits of these instructions don't depend on the function, encoding the "addui sp, sp" and the "sw, r31 ...(sp)" parts. The low 16 bits encode a signed offset. So we can obtain the call stack from our code disassembling it thusly:

/* get previous stack pointer and return address
   given the current ones */
int get_prev_sp_ra(void** prev_sp, void** prev_ra,
                   void* sp, void* ra) {
  unsigned* wra = (unsigned*)ra;
  int spofft;
  /* scan towards the beginning of the function -
     addui sp,sp,spofft should be the first command */
  while((*wra >> 16) != 0x27bd) {
    /* test for "scanned too much" elided */
    wra--;
  }
  spofft = ((int)*wra << 16) >> 16; /* sign-extend */
  *prev_sp = (char*)sp - spofft;
  /* now scan forward for sw r31,raofft(sp) */
  while(wra < (unsigned*)ra) {
    if((*wra >> 16) == 0xafbf) {
      int raofft = ((int)*wra << 16) >> 16; /* sign */
      *prev_ra = *(void**)((char*)sp + raofft);
      return 1;
    }
    wra++;
  }
  return 0; /* failed to find where ra is saved */
}

The call stack will then be produced by the following loop:

int get_call_stack_no_fp(void** retaddrs, int max_size) {
  void* sp = get_sp(); /* stack pointer register */
  void* ra = get_ra(); /* return address register */
  /* adjust sp by the offset by which this function
     has just decremented it */
  int* funcbase = (int*)(int)&get_call_stack_no_fp;
  /* funcbase points to an addiu sp,sp,spofft command */
  int spofft = (*funcbase << 16) >> 16; /* 16 LSBs */
  int i=0;
  sp = (char*)sp-spofft;
  do {
    if(i < max_size) {
      retaddrs[i++] = ra;
    }
  }
  while(get_prev_sp_ra(&sp,&ra,sp,ra));
  return i; /* stack size */
}

get_sp and get_ra access registers so they must be assembly macros, which in the case of Green Hills can be spelled like this:

asm void* get_ra() {
  move $v0, $ra
}
asm void* get_sp() {
  move $v0, $sp
}

Under MIPS32 and Green Hills, this code seems to be giving decent call stacks except for the inevitable omission of function calls done without saving the return address; the most common case of those – simple delegating functions – was already mentioned above. If f calls g which does (almost) nothing except for calling h, and g doesn't bother to save the return address to the stack, having h return directly to f, then the call stack will contain f and h but not g. Not much of a problem and sometimes even an advantage as far as I'm concerned, since g is rarely interesting. Also, you can get this problem irregardless of the lack of a frame pointer – for example, gcc -O3 on x86 will not maintain the call stack accurately even without -fomit-frame-pointer, generating the following ridiculous code:

pushl   %ebp
movl    %esp, %ebp
popl    %ebp
; so what was the point of setting up and
; immediately destroying a stack frame?
jmp     print_trace

Now, although code like get_prev_sp_ra looking for 0x27bd admittedly is a heuristic relying on undocumented platform-specific behavior, it looks like a passable way of getting a call stack on a RISC machine like MIPS, ARM or PowerPC. What about the x86 though? Effectively we have our code partially disassembling itself here, which is not nearly as easy with the x86 (in particular, I don't think there's a way to scan backwards because of the variable encoding length; although we could look at epilogues instead of prologues just as well).

Instead of dragging in a disassembler, we can use an external program, such as, well, a debugger. This obviously defeats the purpose of being able to get the call stack without a debugger. But this purpose isn't very interesting on the x86 in the first place because there you're rarely stuck in situations where a program can't run a debugger.

The only point of the disassembling business on the x86 thus remains to deal with programs compiled without a frame pointer and without debug information making it possible to get the call stack nonetheless. I don't know if anybody has such a problem these days, now that gcc has -fasynchronous-unwind-tables – perhaps someone uses compilers which can't do this or binaries compiled without this, and perhaps the problem is extinct on the x86. For what it's worth, here's a script getting the call stack from a core file without relying on gdb's bt command but relying on its disassemble command. Usage: python bt <program> <core>. No warranty, ...or FITNESS FOR A PARTICULAR PURPOSE.

And this is all I know about getting the call stack in C or C++, something users of other languages can do, in the (unlikely) absence of a library function doing just that, simply by throwing an exception, immediately catching it and using its getStackTrace method or some such.