How to make a heap profiler

May 23rd, 2014

We'll see how to make a heap profiler. Example code for this post makes up heapprof, a working 250-line heap profiler for programs using malloc/free.

It works out of the box on Linux (tested on "real" programs like gdb and python). The main point though is being easy to port and modify to suit your needs. The code, build and test scripts are on GitHub.

Why roll your own heap profiler?

It's easy! And fun, if you're that sort of person. What, not reasons enough? OK, how about...
Your platform doesn't have one - a common case on embedded systems lacking an OS, standard APIs to output data, etc.
Your real-time program overwrites the heap after a few hours. You'd like to know which buffer overflows. Valgrind doesn't run on the device/is too slow. How does a custom heap profiler help here? Read on!
You want to present stats differently from the way your profiler does.
You only want to instrument malloc some of the time to minimize the slowdown.
...

What a heap profiler needs from the platform

You can't write a heap profiler in portable, standard C. You need a few things that most platforms have, but C doesn't specify an interface for. What you need are ways to:

Intercept calls to malloc, free, calloc and realloc
Get the current call stack at run time
Dump the contents of memory (as in core dump)
Match instruction addresses to source code lines

Given that, we can associate every allocated block with a call stack. Then we cluster allocations by call stack. Finally, we sort the call stacks by the total amount of allocated memory.

To illustrate the idea, here's a rough sketch of how heap memory normally looks like (each line is one memory word - 32/64b):

size (used by malloc/free)
user data (malloc returns this address - size is "invisible" to the user)
...
size (of the next block)
user data
...

And here's how heap memory looks like with heapprof's malloc:

size (used by the underlying malloc/free)
"HeaP" (magic string; underlying malloc gives us this address)
user size (user's request - without heapprof's overhead)
caller 0 (the address malloc returns to)
caller 1 (the address caller 0 returns to)
...
"ProF" (magic string)
user data (our malloc returns this address to the user)
...
size
"HeaP"
user size
caller 0
caller 1
...
"ProF"
user data
...

A program reading a core dump containing this heap can simply look for blocks enclosed in "HeaP"..."ProF". Thus it will find the sizes of all live blocks - and the call stacks responsible for each allocation.

While we're at it - why store metadata at the beginning of chunks and not at the end?

Most often, buffers overflow due to large positive array indexes, not negative indexes. Let's say someone's array overflows, messing up the heap and dumping core. Then if we ran under heapprof, we'll see who allocated the block right before the point of corruption. This will narrow down our search considerably.

You can tell that I'm writing from experience, can't you?.. Ah, safety-critical software in zero-safety languages... what a way to make a living. Anyway, the point is, a heap profiler includes a "heap annotator", which is a handy debugging tool in its own right. Because it's much easier to make sense of a heap with call stacks attached to blocks.

So how do we do all that - intercept malloc, save the call stack, dump the memory and match return addresses to source lines? Let's answer these one by one.

Intercepting malloc and friends

gcc -shared -fPIC -o heapprof.soheapprof.c ...
env LD_PRELOAD=heapprof.so program args ...

That's all - add to $LD_PRELOAD a shared object redefining malloc, calloc, free and realloc. (Be careful to not redefine anything else - use static to hide symbols.)

Our redefined malloc will allocate the bytes for its caller plus a few more for the call stack. Allocate - how? The simplest way is to call the original malloc (it's similar with free):

typedef void* (*malloc_func)(size_t);
static malloc_func g_malloc;
//at init time:
g_malloc = (malloc_func)dlsym(RTLD_NEXT, "malloc");
//upon malloc:
void* chunk = g_malloc(size+EXTRA);

In statically linked binaries common on embedded systems, just adding your malloc, etc. to the build is typically enough to override the standard functions. Calling the original functions is hard or impossible though. I'd pull in an open source malloc - like Doug Lea's dlmalloc - rename the functions to real_malloc or whatever and call them from my own versions.

Getting the current call stack

GNU C has the wonderful backtrace() function which just does the work. Nifty!

void** p = (void**)chunk;

//fill the metadata
p[START_INDEX] = START_MAGIC;
backtrace(p+SIZE_INDEX, nframes+1); //+1 for &malloc
p[SIZE_INDEX] = (void*)size; // overwrite &malloc
p[END_INDEX] = END_MAGIC;

//give the user a pointer past the metadata
return (char*)p + EXTRA;

Unfortunately, not all systems have backtrace - not even all GNU C ports (say, there's no backtrace on MIPS, AFAIK). Without backtrace, getting the call stack yourself is still relatively easy, though it can get a bit ghastly. If you care, you can read a bunch about it here.

Dumping core

If there's one thing C is a good at, it's dumping core:

int*p=0;*p=0;

Segmentation fault (core dumped)

(Is there a more succinct way? int*p=*p comes to mind, but it might accidentally not crash if p is un-initialized to a legitimate pointer. *(int*)0=0? Any other suggestions for shaving off characters?..)

What if these barbaric means don't suit your ends? gdb lets you place a breakpoint at your function some_func, and dump core thusly:

gdb program -ex "b some_func" -ex r -ex "gcore my.core" -ex q

You can do this multiple times in the same process, getting several heap state snapshots.

Or let's say you're a Python process with C modules you want to profile:

os.kill(os.getpid(), 11)

...or there's kill -SEGV process-id etc. etc.

On an embedded system, you can dump memory with any method available - using a JTAG probe or having the program send memory over some communication channel, etc. It won't be a "real" core dump in a format recognized by debuggers. But as we'll see in the next section, it's probably enough.

Matching return addresses to source code

Now our offline heap stats analyzer, heapprof.py, searches for block metadata enclosed in "HeaP...ProF" and finds block sizes and stacks:

class Block:
 def __init__(self, metadata):
  # 'I',4 for 32b, 'Q',8 for 64b machines
  self.size = struct.unpack('I', metadata[0:4])[0]
  self.stack = struct.unpack('%d'%(len(metadata)/4 - 1)+'I', metadata[4:])

So now block.stack is a list of return addresses, and

{addr for block in blocks for addr in block.stack}

...is the set of all return addresses in our core dump. How do we match them to source lines and function names?

We can pipe the addresses to addr2line -f -e program:

from subprocess import *
addr2line = Popen('addr2line -f -e'.split()+[exe],
                  stdin=PIPE, stdout=PIPE)
for addr in addrs:
  addr2line.stdin.write('0x%x\n'%addr)
  addr2line.stdin.flush()
  func = addr2line.stdout.readline().strip()
  line = addr2line.stdout.readline().strip()

However, this doesn't work for return addresses from shared libraries - addr2line doesn't know to which addresses they got loaded.

What does work is gdb - if it's given the core dump telling it where shared libraries got loaded:

gdb = Popen(['gdb',prog,core], stderr=STDOUT,
             stdin=PIPE, stdout=PIPE)
for addr in addrs:
  gdb.stdin.write('info symbol 0x%x\n'%addr)
  gdb.stdin.write('list *0x%x\n'%addr)
  gdb.stdin.write('printf "\\ndone\\n"\n')
  gdb.stdin.flush()
  s = ''
  while s != 'done':
    s = gdb.stdout.readline().strip()
    if 'is in' in s: line = s.split('is in ')[1]
    if 'in section' in s: func = s.split('(gdb) ')[1]

The script looks kinda ugly, but the upshot is that info symbol 0xwhatever tells you the function name (and then some), while list *0xwhatever tells you the source line number (and then some).

So who needs addr2line when we have gdb? On embedded systems, often there's just one static binary, so addr2line is sufficient. And on the other hand, there are no core dumps coming out in a standard format. Maybe all you have is a JTAG probe and you dump your memory in one big chunk. So you can't use gdb program core.

In that case, heapprof.py will work just fine if you setenv HEAPPROF_ADDR2LINE. It doesn't care if it's a "real" core dump or just a raw memory dump - searching for "HeaP...ProF" is equally easy. Only gdb cares, and $HEAPPROF_ADDR2LINE avoids using gdb.

If you use a proprietary compiler, then maybe it doesn't have addr2line. Bummer. If the executable file format is standard (ELF/COFF/whatever), then gdb's info symbol command will work (but list won't - not unless DWARF debug info is available.) Also, proprietary compilers are bad for business in embedded systems. But that's a rant for another time.

Clustering and sorting

Clustering blocks by their allocating stack, and sorting the stacks by sum of block sizes is easy:

stack2sizes = {}
for block in blocks:
  stack2sizes.setdefault(block.stack,list()).append(block.size)
total = sorted([(sum(sizes), stack) for stack, sizes in stack2sizes.iteritems()])

Now we can simply print out the sorted stacks using the symbolic information obtained above from addr2line/gdb.

Everybody likes to malloc

Why did heapprof take me several hours to write instead of just one - several hours spread over several days, so here I am programming in my spare time after switching to a part-time job specifically to program less? What's the root cause apart from me being a complete dork?

The problem is that everybody mallocs. dlsym mallocs. backtrace mallocs. pthread - which I only need because those others malloc - also mallocs. Sheesh!

So what happens is, you're inside malloc. You want to log the call stack, so you call backtrace. Backtrace calls malloc. Ought to avoid infinite recursion, which we do with a global variable. Now that global variable needs to be protected with a mutex. Which we have to initialize before the first call to malloc - and that initialization mallocs. Also we need dlsym initially to get the original &malloc, but dlsym also mallocs.

So we need to be able to malloc without &malloc, initially. So I use sbrk for that, and I need free to not use &free to try and free that sbrk'd stuff 'cause that will fail miserably. And so on

It's all in heapprof.c if you want to have a look. I don't think it's very interesting; it does make a heap profiler that much harder to write, but it all still fits into 111 sloc, so it's no big deal, really. The really silly thing is it pulls threading into this, because of the global variable guarding against backtrace's malloc calls.

I suspect that backtrace only mallocs upon initialization and maybe not at all in statically linked binaries. So maybe if you're porting to an embedded system, you don't need to worry about threading issues after all. I just wanted to write something "robust" for "the general case".

Porting

If you want to port heapprof or bits of it to your platform, the issues you might need to deal with are described here. Basically you might need to tweak the platform-specific things we've discussed above, plus a couple others like alignment and endianness.

Conclusions

A heap profiler is a really simple tool to write
A heap profiler annotates heap blocks with metadata at run time - this in itself can a be a great debugging tool
I'm a hopeless dork desperately needing to program less