Machine code monkey patching

April 29th, 2011

A monkey patch is a way to extend or modify the runtime code of dynamic languages (e.g. Smalltalk, JavaScript, Objective-C, Ruby, Perl, Python, Groovy, etc.) without altering the original source code.

Wikipedia

For example, the Python code:

# someone else's class
class TheirClass:
 def their_method(self):
  print "them"
obj = TheirClass()
obj.their_method()

# our function
def our_function(self):
 print "us"

# the monkey patch
TheirClass.their_method = our_function
obj.their_method()

...will print:

them
us

...showing that we have changed the behavior of TheirClass objects, including those we didn't create ourselves. Which can't be done with more civilized techniques like inheritance.

Here's how you can monkey patch machine code, assuming the machine architecture is ARM:

typedef void (*funcptr)();

void monkey_patch(funcptr their_func, funcptr our_func) {
  ((int*)their_func)[0] = 0xe51ff004;
  ((int*)their_func)[1] = (int)our_func;
}
//monkey patching the memory allocator:
monkey_patch((funcptr)&malloc, (funcptr)&our_malloc);
monkey_patch((funcptr)&free, (funcptr)&our_free);

This overwrites the first instruction (32-bit word) of their_func with 0xe51ff004, which is the ARM machine code corresponding to the assembly instruction LDR PC,[PC,-4] – which means, in C-like pseudocode, PC = *(PC+4), or "jump to the program location pointed by the next word after the current program location".

(Why the byte address PC+4 is spelled in assembly as PC-4? I recall that it's because an ARM instruction at address X actually gets the value X+8 when referencing PC. Presumably because it is – or at some point was – the most convenient semantics for pipelined hardware to implement:

- so the physical PC register could very well keep the value X+8.)

So the first word of their_func is overwritten with, "jump to where the second word points". The second word is then overwritten with our_func, and we're all set.

Purpose

I actually did this in production code, on a bare metal target (no OS – just a boot loader that runs a massive single binary). I monkey patched the memory allocator – malloc, free, calloc, realloc – and the Unix-like I/O functions underlying that particular compiler's <stdio.h> and <iostream> implementation – read, write, open, close, creat. The memory allocator had to be changed to work on the target dual-core chip. The I/O functions had to be changed to use our drivers, so that we could write stuff to the Flash or USB using FILE* or ofstream.

A more civilized approach, if you want to override functions in a dynamic library, is passing another library at run time with LD_PRELOAD or equivalent. And if the code is linked statically as it was in my case, you can override the functions at link time. The trouble is that the linker could refuse to cooperate.

(And in my case, we shipped libraries, the customer linked the program, and the guy who talked to the customer refused to cooperate – that is, to help them override functions at link time. He was an old-school embedded developer, the kind that don't need no stinking malloc and printf. The project had a million lines of code very much in need of malloc and printf. He said, clean it up. Don't call malloc on the second CPU. So I went away and monkey patched malloc anyway.

In such a case, the civilized approach is to keep trying to talk the guy into it, and then have him persuade the (even more hardcore) embedded devs at the customer's side. What I did was what managers call "an attempt at a technical solution when a social solution is needed". Or as programmers call it, "avoiding a couple of months of pointless discussions". Being neither a full-time programmer nor a full-time manager, I don't have a clear opinion which viewpoint is right. I guess it depends on how long and pointless the discussions are going to be, versus how long and pointless the code working around the "social" problem will be.)

In theory, machine code monkey patching could be used in a bunch of "legitimate" cases, such as logging or debugging. In practice, this ugly thing is probably only justified in inherently ugly situations – as is kinda true of monkey patching in general.

Pitfalls

My example implementation for the ARM assumes that a function has at least 2 instructions. An empty ARM assembly function can have just one (jump to link register). In that case, the first instruction of the next function will be overwritten. A more sophisticated version of monkey_patch() could stash the target address someplace else, and use a LDR PC,[PC,clever_offset] command instead of a constant LDR PC,[PC,-4] command.

Overwriting machine code instructions breaks code that reads (as opposed to "executes") those instructions, counting on the original bit patterns to be stored there. This isn't very likely to be a problem with actual ARM code, unless it was written by Mel.

On any machine with separate and unsynchronized instruction and data caches, overwriting instructions will modify the contents of the data cache but not the instruction cache. If the instructions happen to be loaded to the instruction cache at the time of overwriting, subsequent calls to the monkey-patched function might call the original function, until the instruction cache line keeping the original code happens to be evicted (which isn't guaranteed to ever happen).

If your luck is particularly bad and the two overwritten instructions map to two adjacent cache lines, only one of which is loaded to the instruction cache at the time of overwriting, a call to the monkey-patched function might crash (since it'll see one original instruction word and one new one). In any case, on machines where caches won't sync automatically, one should sync them explicitly to implement self-modifying code correctly (I'll spare you my ARM9 code doing this).

If your OS places instructions in read-only memory pages, overwriting it will not work unless you convince the OS to grant you permissions to do so.

C++

C++ virtual functions can be monkey patched more similarly to the typical dynamic language way. Instead of modifying instructions, we can overwrite the virtual function table.

Advantages:

Disadvantages:

The need to fiddle with OS memory protection is likely to remain since vtables are treated as constant data and as such are typically placed in write-protected sections.

Example C++ code (g++/Linux, tested with g++ 4.2.4 on Ubuntu 8.04):

#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>

template<class T, class F>
void monkey_patch(int their_ind, F our_func) {
  T obj; //can we get the vptr without making an object?
  int* vptr = *(int**)&obj;
  //align to page size:
  void* page = (void*)(int(vptr) & ~(getpagesize()-1));
  //make the page with the vtable writable
  if(mprotect(page, getpagesize(), PROT_WRITE|PROT_READ|PROT_EXEC))
    perror("mprotect");
  vptr[their_ind] = (int)our_func;
}

class TheirClass {
 public:
  virtual void some_method() {}
  virtual void their_method() { printf("them\n"); }
};
void our_function() { printf("us\n"); }
int main() {
  TheirClass* obj = new TheirClass;
  //gcc ignores the vtable with a stack-allocated object
  obj->their_method(); //prints "them"

  monkey_patch<TheirClass, void(*)()>(1, our_function);
  //some_method is at index 0, their_method is at 1
  //we could instead try to non-portably get the index
  //out of &TheirClass::their_method

  obj->their_method(); //prints "us"
}

Conclusion

Let's drink to never having to do any of this (despite the fact that yes, some of us do enjoy it in a perverted way and feel nostalgic blogging about it).