Getting the call stack without a frame pointer

October 9th, 2009

Everything I know about getting the current call stack of C or C++ programs, including ones compiled with -fomit-frame-pointer or an equivalent, with or without a debugger. Hardly entertaining except for those specifically wishing to lay their hands on call stacks.

We'll start with trivia people sharing my unhealthy interest in call stacks probably know. There are two contexts where you might want to get a call stack:

Look at the call stack in a debugger to see what the program is doing.
Get a representation of the call stack inside the program itself. For example, a memory profiler might want to attach the call stack identifying the context where the allocation is made to each memory block to see who allocates the most.

Sometimes (2) can be implemented using (1) by running a debugger programmatically and asking it for the current call stack, and sometimes it can't (too much communication overhead, or no debugger available – for example, a program stripped of debugging information and running on a standalone embedded board).

The straightforward way of getting the current call stack, used both by debuggers and by programs curious about their own stacks, relies on the frame pointer. The idea is that a machine register is reserved for keeping a pointer into the stack, called the frame pointer, and every function is compiled to do the following in its prologue:

Push the return address to the stack
Push the frame pointer to the stack
Save the address of the resulting two-pointer structure to the frame pointer register

This creates a linked list on the stack, with every node keeping a return address – this list is the call stack (and this is why debuggers show the points of return from function calls and not the points of call, a bit annoyingly for function calls spanning many source code lines – in fact we get the return stack and not the call stack). Here's how you get this list from within a program:

struct stack_frame {
  struct stack_frame* next;
  void* ret;
};
int get_call_stack(void** retaddrs, int max_size) {
  /* x86/gcc-specific: this tells gcc that the fp
     variable should be an alias to the %ebp register
     which keeps the frame pointer */
  register struct stack_frame* fp asm("ebp");
  /* the rest just walks through the linked list */
  struct stack_frame* frame = fp;
  int i = 0;
  while(frame) {
    if(i < max_size) {
      retaddrs[i++] = frame->ret;
    }
    frame = frame->next;
  }
  return i;
}

The code for getting the list head pointer depends on the platform, the list structure itself is common to many machines and compilers. The return addresses may be converted to function names and source line numbers with addr2line -f or similar. When the program can't access its own debug info during its execution as in the case of embedded devices, the translation from addresses to names will be a separate offline step.

The whole frame pointer business is fairly widely documented, with a bunch of source code available centered around getting the call stack that way. I think the GNU backtrace function and the Windows StackWalk64 function, which these days are probably a better alternative than code snippets like the one above, also use this linked list when available.

Now, what happens if the compiler is told to avoid generating the code maintaining the frame pointer (-fomit-frame-pointer for gcc, /Oy for MSVC), or not told to override its default behavior of not generating such code (-ga for Green Hills C++)?

Admittedly it doesn't appear to be such a good idea to make debugging that much harder in order to save a few instructions. However, there are reasons to do so. For example, one common consequence of Software Design is lots of functions doing little or nothing except for delegating their work to another function. Without frame pointer maintenance, such a call is just a jump instruction – your callee function will return directly to the address saved by your caller. With frame pointers, you need a whole prologue and epilogue here. Anyway, we won't discuss the benefits of frame pointer omission since measuring the overhead for your particular code will be more reliable than such a discussion anyway.

Compiling without frame pointers hits code trying to obtain its own context harder than it hits debuggers, because debuggers don't really need frame pointers and only (sometimes) rely on them for simplicity of implementation. Given a return address, a debugger can tell:

Which function it belongs to (unlike the program itself, a debugger is necessarily supposed to have access to the symbol table)
Where that function keeps the return address (the compiler knows that, so it can tell the debugger)
The amount by which the function decrements the stack pointer, assuming the stack grows downwards – again something the compiler knows. Now that the debugger knows the previous return address and the previous stack pointer, it can go back to step 1.

So a debugger can do just fine without frame pointers as long as the compiler gives it enough information about the layout of the stack. I've been debugging without frame pointers for a long time with the Green Hills MULTI debugger which uses a proprietary debug info format. More recently the DWARF format, gcc and gdb seem to have caught up and now programs compiled with -fasynchronous-unwind-tables -fomit-frame-pointer are debuggable with gdb. The information generated by -fasynchronous-unwind-tables seems to go to a separate ELF section called .eh_frame_hdr.

Not only will gdb use .eh_frame_hdr, but the GNU backtrace function appears to be using it as well (it doesn't work under -fomit-frame-pointer but apparently does work when you add -fasynchronous-unwind-tables – although the docs explicitly say: "frame pointer elimination will stop backtrace from interpreting the stack contents correctly"). Nor is this section stripped from the program – it's not implemented as a "normal" debug information section but as an allocated data section, so it's always available to a program (in particular, to the backtrace function).

So under gcc, all call stack problems seem to be solved – unless you trust the docs (!?), or unless some code isn't compiled with the right flags because of not being up to date or someone being too greedy to allocate space for a debug info section. Outside gcc, or more precisely DWARF, I don't think a stripped program can access such debug info.

Is there a way to get a call stack without a frame pointer, without a debugger and without debug info?

For years I was sure that the answer was "no", hence some things will only work under a separate build mode – just like the release build but with frame pointers. Then one time the Green Hills debugger failed to list the call stack for some reason as it sometimes does, but this time we really wanted to decipher it. And we figured that we can in fact do the same thing the debugger does, except we'll understand from the assembly code what the debugger usually understood from debug information.

Specifically, to understand where the return address is kept and by what amount the stack pointer is decremented, you need to find the instructions doing (or undoing) these things in the prologue (or the epilogue) of the function. This worked, but due to either inertia or stupidity it took me months to realize that you can write code doing this. Anyway, here's how it works on a 32b MIPS processor under the Green Hills compiler. The prologue code of a function will contain instructions like these:

main+0: 27bdffe8 addiu sp, sp, -0x18
main+4: afbf0014 sw    r31, 0x14(sp)

The add immediate instruction decrements the stack pointer, and the store word instruction saves the return address from the register where it's saved by the caller to some place on the stack. The high 16 bits of these instructions don't depend on the function, encoding the "addui sp, sp" and the "sw, r31 ...(sp)" parts. The low 16 bits encode a signed offset. So we can obtain the call stack from our code disassembling it thusly:

/* get previous stack pointer and return address
   given the current ones */
int get_prev_sp_ra(void** prev_sp, void** prev_ra,
                   void* sp, void* ra) {
  unsigned* wra = (unsigned*)ra;
  int spofft;
  /* scan towards the beginning of the function -
     addui sp,sp,spofft should be the first command */
  while((*wra >> 16) != 0x27bd) {
    /* test for "scanned too much" elided */
    wra--;
  }
  spofft = ((int)*wra << 16) >> 16; /* sign-extend */
  *prev_sp = (char*)sp - spofft;
  /* now scan forward for sw r31,raofft(sp) */
  while(wra < (unsigned*)ra) {
    if((*wra >> 16) == 0xafbf) {
      int raofft = ((int)*wra << 16) >> 16; /* sign */
      *prev_ra = *(void**)((char*)sp + raofft);
      return 1;
    }
    wra++;
  }
  return 0; /* failed to find where ra is saved */
}

The call stack will then be produced by the following loop:

int get_call_stack_no_fp(void** retaddrs, int max_size) {
  void* sp = get_sp(); /* stack pointer register */
  void* ra = get_ra(); /* return address register */
  /* adjust sp by the offset by which this function
     has just decremented it */
  int* funcbase = (int*)(int)&get_call_stack_no_fp;
  /* funcbase points to an addiu sp,sp,spofft command */
  int spofft = (*funcbase << 16) >> 16; /* 16 LSBs */
  int i=0;
  sp = (char*)sp-spofft;
  do {
    if(i < max_size) {
      retaddrs[i++] = ra;
    }
  }
  while(get_prev_sp_ra(&sp,&ra,sp,ra));
  return i; /* stack size */
}

get_sp and get_ra access registers so they must be assembly macros, which in the case of Green Hills can be spelled like this:

asm void* get_ra() {
  move $v0, $ra
}
asm void* get_sp() {
  move $v0, $sp
}

Under MIPS32 and Green Hills, this code seems to be giving decent call stacks except for the inevitable omission of function calls done without saving the return address; the most common case of those – simple delegating functions – was already mentioned above. If f calls g which does (almost) nothing except for calling h, and g doesn't bother to save the return address to the stack, having h return directly to f, then the call stack will contain f and h but not g. Not much of a problem and sometimes even an advantage as far as I'm concerned, since g is rarely interesting. Also, you can get this problem irregardless of the lack of a frame pointer – for example, gcc -O3 on x86 will not maintain the call stack accurately even without -fomit-frame-pointer, generating the following ridiculous code:

pushl   %ebp
movl    %esp, %ebp
popl    %ebp
; so what was the point of setting up and
; immediately destroying a stack frame?
jmp     print_trace

Now, although code like get_prev_sp_ra looking for 0x27bd admittedly is a heuristic relying on undocumented platform-specific behavior, it looks like a passable way of getting a call stack on a RISC machine like MIPS, ARM or PowerPC. What about the x86 though? Effectively we have our code partially disassembling itself here, which is not nearly as easy with the x86 (in particular, I don't think there's a way to scan backwards because of the variable encoding length; although we could look at epilogues instead of prologues just as well).

Instead of dragging in a disassembler, we can use an external program, such as, well, a debugger. This obviously defeats the purpose of being able to get the call stack without a debugger. But this purpose isn't very interesting on the x86 in the first place because there you're rarely stuck in situations where a program can't run a debugger.

The only point of the disassembling business on the x86 thus remains to deal with programs compiled without a frame pointer and without debug information making it possible to get the call stack nonetheless. I don't know if anybody has such a problem these days, now that gcc has -fasynchronous-unwind-tables – perhaps someone uses compilers which can't do this or binaries compiled without this, and perhaps the problem is extinct on the x86. For what it's worth, here's a script getting the call stack from a core file without relying on gdb's bt command but relying on its disassemble command. Usage: python bt <program> <core>. No warranty, ...or FITNESS FOR A PARTICULAR PURPOSE.

And this is all I know about getting the call stack in C or C++, something users of other languages can do, in the (unlikely) absence of a library function doing just that, simply by throwing an exception, immediately catching it and using its getStackTrace method or some such.

1. WayneOct 12, 2009

Just a note that x86_64 is more interesting than x86 as the ABI allows them to omit the frame pointer from code and requires a special DWARF section (perhaps the same one you mention.) to allow the debugger to unwind code.

That means your first loop that walks ebp (actually rbp) won't be reliable on x86_64 even if -fomit-frame-pointer is not passed to the compiler.

2. AlexOct 20, 2009

Well, just use Google Breakpad for this stuff.

3. DanOct 22, 2009

The last place I worked, we were using gcc/gdb on MIPS and wrote functions to do this, too. They looked very similar to what you've posted here, but were a little more complicated because we found that some functions made more than one adjustment to the stack frame.

Anyway, we had to create our own mini coredumps for this embedded system because the OS (Nucleus) didn't do it for us. These stack back-tracing functions were the only tool we had for generating a stack trace for the core dumps. It worked surprisingly well. I don't recall ever getting a core dump that didn't include a valid stack trace (save the rare ones that had severe stack corruption... those were usually bad days for me).

4. Yossi KreininOct 23, 2009

@Dan: mini core dumps as in subsets of the program & chip state as opposed to complete snapshots? Painful stuff, I hate to decide what goes in and what stays out – it's a good thing to insist that an embedded system is always hooked to something capable of getting a full core dump during development time.

The thing I hate the most about stack corruption is that if the return address is smashed and you jump to nowhere, you don't know where you jumped from – at least MIPS is careful to first set $pc to $ra, then produce an exception at that address upon jr $ra into the chasm.

5. Kragen Javier SitakerNov 8, 2009

Thanks, Yossi! I haven't laughed out loud at a segment of assembly code in a while.

It would be convenient for debugging (in the stack corruption case) if ret exchanged $pc and $ra instead of just copying one to the other. It's not like you're going to return to that same address again, are you?

6. Yossi KreininNov 8, 2009

I don't think x86, which has a ret instruction, has a return address register – ret pops an address from the stack symmetrically to call which pushes one onto the stack, isn't it so?

On MIPS and AFAIK most RISC machines, there is no special ret instruction – there's jr (jump register) or similar; it's the same instruction used for virtual function calls so I guess clobbering the register in the general case would be wrong as it would suck through a loop calling a virtual function, for example, and special-casing $ra under the assumption that it's really used for a return address isn't The RISC Way.

Which of course does not prevent them from stuffing $pc where the last executed jr instruction resided to some kind of register – I'd happily settle for another addition to the zillion user-inaccessible coprocessor registers already available. The reason they don't do it IMO is that Unix/C, and consequently RISC which is the Unix/C back-end of choice, couldn't care less about debugging – they're proud when they correctly record the point where an exception occurred so you can at least meaningfully restart from there.

7. Kragen Javier SitakerNov 11, 2009

You're right about x86, and your MIPS comment is also correct for SPARC, ARM, and HP-PA, but not PowerPC, which has a special "link register" LR with magic instructions to access it, blech. But since it's generally used for either function return or virtual function calls, and I doubt the ABI makes it "callee"-saved in either of those cases, changing it from a mov to a swap wouldn't pose much risc to efficiency. It would clearly break programs, though.

(Actually the only case I can think of where it would matter would be when you're implementing direct-threaded Forth...)

Another extra register would be a good idea, but of course to work under an OS, the OS would have to know to save the value on every context-switch, just in case you might need to debug the program. If we can have that, can we also have a ring buffer of the last 64 jumps the program executed? Because that would be really handy too...

8. DanNov 23, 2009

@Yossi: Mini coredumps are sometimes the only way to get them from customer equipment, when there is limited flash space (not enough to save a real core file). It also happens that on these systems, the only reliable way to get data out of the box in the face of a core dump was through a 38400kbps serial port. It would have taken up to 31 hours to dump core on one of these, if I did my math right (memory maxed at 512MB, all one monolithic image, single address space, yaddah yaddah). We got *really* creative at coming up with ways to include the most amount of useful information in the smallest amount of data.

9. Yossi KreininNov 23, 2009

@Dan: sure, been there. Just saying that sometimes beating up a PM can bring in more bandwidth and end up being the cheaper solution. Sometimes.

10. GruntledEmbeddedDesignerFeb 27, 2012

Hi, I really like your article. I'm using it to get some stack dumps in the development phase of our product. I have found that sometimes the return address register in the MIPS 24K is not 32bit aligned. And this causes your example code to throw a CPU exception when dereferencing a non aligned pointer. Just a heads up.
thanks for the effort to document this stuff. Its extremely useful.

11. Yossi KreininFeb 27, 2012

Seriously? I'd think the 24K couldn't return (jr $ra) to a misaligned address. Is it a real return address or something located where the code expects a return address that actually isn't?

12. GruntledEmbeddedDesignerFeb 28, 2012

Seriously looks like that anyway. The stack dump example starts to break down when compiler optimization level is turned up AND the code is compiled with the MIPS16 option.

I'm getting return addresses like the following: 0x9e032d1f

And these optimized functions don't do a
addiu sp,sp,spofft

So now i'm trying to figure out how to get the call stack for functions being jumped to where the calling function uses the jalx instruction.

I'm probably in over my head but this is for extra credit in my down time anyway.

13. Yossi KreininFeb 28, 2012

MIPS16 I don't know anything about, although a 0xf sounds too misaligned even for that; anyway, unfortunately it's not surprising that my code above can't handle your case...

14. GruntledEmbeddedDesignerFeb 29, 2012

yes, i agree 0xF sounds like a very strange offset??? but taking a look at the .map file, all the base address of the functions compiled under MIPS16 are on byte aligned addresses.

15. ProtozorqJun 27, 2012

Nice article!

I was hoping to find a method to receive the frame pointer in a function when omit-frame-pointer is enabled. I've found out, that GCC provides a builtin function to receive the frame pointer:

void * __builtin_frame_address (unsigned int level);

Unfortunately, usage of that function seems to turn of omit-frame-pointer optimization for the given function.

I don't even know, since when this builtin function exists .. so, it might not be a very portable method.

16. Yossi KreininJun 27, 2012

I'd guess that if the callers' code is compiled with frame pointer omission, then you can't get the callstack even if you (the callee) do set up the frame pointer and have an intrinsic to get that address.

17. AaronSep 26, 2013

This is pure gold! I am working on MIPS as well and I had often found myself wondering about this same topic. Thank you so much.

18. Yossi KreininSep 26, 2013

Thanks, and I hope it actually works for you...

19. MarsOct 17, 2013

Thanks, it does help!
But in my MIPS compiler this code will retrieve wrong spofft in the line below.
/* funcbase points to an addiu sp,sp,spofft command */
int spofft = (*funcbase <> 16; /* 16 LSBs */

It's because the first few asm lines deal with $gp first like below:
00400b04 :
400b04: 3c1c0fc0 lui gp,0xfc0
400b08: 279c754c addiu gp,gp,30028
400b0c: 0399e021 addu gp,gp,t9
400b10: 27bdffe0 addiu sp,sp,-32

so I modified your code a little to work around it:
wra = (unsigned*)funcbase;
/* scan from the beginning of the function -
addui sp,sp,spofft may not be the first command */
while((*wra >> 16) != 0x27bd) {
/* test for "scanned too much" elided */
wra++;
}
spofft = ((int)*wra <> 16; /* sign-extend */

Just to share this.

20. Yossi KreininOct 19, 2013

Interesting; does this $gp fiddling happen at every function? Something's off; $gp is usually initialized to a constant (specifically the middle of the sda, "small data area") and then left alone.

21. MarsOct 21, 2013

Yes, it's in every function. I don't why this happened. I'm using uclibc, and the compile flags is "mipsel-linux-gcc -g -O -mtune=4kc -mips32 "

Getting the call stack without a frame pointer

Post a comment