Machine code monkey patching

A monkey patch is a way to extend or modify the runtime code of dynamic languages (e.g. Smalltalk, JavaScript, Objective-C, Ruby, Perl, Python, Groovy, etc.) without altering the original source code.

Wikipedia

For example, the Python code:

# someone else's class
class TheirClass:
 def their_method(self):
  print "them"
obj = TheirClass()
obj.their_method()

# our function
def our_function(self):
 print "us"

# the monkey patch
TheirClass.their_method = our_function
obj.their_method()

…will print:

them
us

…showing that we have changed the behavior of TheirClass objects, including those we didn't create ourselves. Which can't be done with more civilized techniques like inheritance.

Here's how you can monkey patch machine code, assuming the machine architecture is ARM:

typedef void (*funcptr)();

void monkey_patch(funcptr their_func, funcptr our_func) {
  ((int*)their_func)[0] = 0xe51ff004;
  ((int*)their_func)[1] = (int)our_func;
}
//monkey patching the memory allocator:
monkey_patch((funcptr)&malloc, (funcptr)&our_malloc);
monkey_patch((funcptr)&free, (funcptr)&our_free);

This overwrites the first instruction (32-bit word) of their_func with 0xe51ff004, which is the ARM machine code corresponding to the assembly instruction LDR PC,[PC,-4] – which means, in C-like pseudocode, PC = *(PC+4), or "jump to the program location pointed by the next word after the current program location".

(Why the byte address PC+4 is spelled in assembly as PC-4? I recall that it's because an ARM instruction at address X actually gets the value X+8 when referencing PC. Presumably because it is – or at some point was – the most convenient semantics for pipelined hardware to implement:

  • when the instruction at address X executes,
  • the instruction at address X+4 is decoded, and
  • the instruction at address X+8 is fetched

- so the physical PC register could very well keep the value X+8.)

So the first word of their_func is overwritten with, "jump to where the second word points". The second word is then overwritten with our_func, and we're all set.

Purpose

I actually did this in production code, on a bare metal target (no OS – just a boot loader that runs a massive single binary). I monkey patched the memory allocator – malloc, free, calloc, realloc – and the Unix-like I/O functions underlying that particular compiler's <stdio.h> and <iostream> implementation – read, write, open, close, creat. The memory allocator had to be changed to work on the target dual-core chip. The I/O functions had to be changed to use our drivers, so that we could write stuff to the Flash or USB using FILE* or ofstream.

A more civilized approach, if you want to override functions in a dynamic library, is passing another library at run time with LD_PRELOAD or equivalent. And if the code is linked statically as it was in my case, you can override the functions at link time. The trouble is that the linker could refuse to cooperate.

(And in my case, we shipped libraries, the customer linked the program, and the guy who talked to the customer refused to cooperate – that is, to help them override functions at link time. He was an old-school embedded developer, the kind that don't need no stinking malloc and printf. The project had a million lines of code very much in need of malloc and printf. He said, clean it up. Don't call malloc on the second CPU. So I went away and monkey patched malloc anyway.

In such a case, the civilized approach is to keep trying to talk the guy into it, and then have him persuade the (even more hardcore) embedded devs at the customer's side. What I did was what managers call "an attempt at a technical solution when a social solution is needed". Or as programmers call it, "avoiding a couple of months of pointless discussions". Being neither a full-time programmer nor a full-time manager, I don't have a clear opinion which viewpoint is right. I guess it depends on how long and pointless the discussions are going to be, versus how long and pointless the code working around the "social" problem will be.)

In theory, machine code monkey patching could be used in a bunch of "legitimate" cases, such as logging or debugging. In practice, this ugly thing is probably only justified in inherently ugly situations – as is kinda true of monkey patching in general.

Pitfalls

My example implementation for the ARM assumes that a function has at least 2 instructions. An empty ARM assembly function can have just one (jump to link register). In that case, the first instruction of the next function will be overwritten. A more sophisticated version of monkey_patch() could stash the target address someplace else, and use a LDR PC,[PC,clever_offset] command instead of a constant LDR PC,[PC,-4] command.

Overwriting machine code instructions breaks code that reads (as opposed to "executes") those instructions, counting on the original bit patterns to be stored there. This isn't very likely to be a problem with actual ARM code, unless it was written by Mel.

On any machine with separate and unsynchronized instruction and data caches, overwriting instructions will modify the contents of the data cache but not the instruction cache. If the instructions happen to be loaded to the instruction cache at the time of overwriting, subsequent calls to the monkey-patched function might call the original function, until the instruction cache line keeping the original code happens to be evicted (which isn't guaranteed to ever happen).

If your luck is particularly bad and the two overwritten instructions map to two adjacent cache lines, only one of which is loaded to the instruction cache at the time of overwriting, a call to the monkey-patched function might crash (since it'll see one original instruction word and one new one). In any case, on machines where caches won't sync automatically, one should sync them explicitly to implement self-modifying code correctly (I'll spare you my ARM9 code doing this).

If your OS places instructions in read-only memory pages, overwriting it will not work unless you convince the OS to grant you permissions to do so.

C++

C++ virtual functions can be monkey patched more similarly to the typical dynamic language way. Instead of modifying instructions, we can overwrite the virtual function table.

Advantages:

  • more portable across machine architectures – the vtable layout doesn't depend on the machine
  • no cache syncing problems
  • no encoding-related corner cases like very short functions or instructions used as data

Disadvantages:

  • less portable across compilers – ARM bytecode is the same with all compilers, vtable layout is not
  • fewer calls could be redirected – some compilers avoid the vtable indirection when they know the object's type at compile time (of course inlined calls won't be redirected with either technique)
  • only virtual functions can be redirected – typically a minority of C++ class member functions

The need to fiddle with OS memory protection is likely to remain since vtables are treated as constant data and as such are typically placed in write-protected sections.

Example C++ code (g++/Linux, tested with g++ 4.2.4 on Ubuntu 8.04):

#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>

template<class T, class F>
void monkey_patch(int their_ind, F our_func) {
  T obj; //can we get the vptr without making an object?
  int* vptr = *(int**)&obj;
  //align to page size:
  void* page = (void*)(int(vptr) & ~(getpagesize()-1));
  //make the page with the vtable writable
  if(mprotect(page, getpagesize(), PROT_WRITE|PROT_READ|PROT_EXEC))
    perror("mprotect");
  vptr[their_ind] = (int)our_func;
}

class TheirClass {
 public:
  virtual void some_method() {}
  virtual void their_method() { printf("themn"); }
};
void our_function() { printf("usn"); }
int main() {
  TheirClass* obj = new TheirClass;
  //gcc ignores the vtable with a stack-allocated object
  obj->their_method(); //prints "them"

  monkey_patch<TheirClass, void(*)()>(1, our_function);
  //some_method is at index 0, their_method is at 1
  //we could instead try to non-portably get the index
  //out of &TheirClass::their_method

  obj->their_method(); //prints "us"
}

Conclusion

Let's drink to never having to do any of this (despite the fact that yes, some of us do enjoy it in a perverted way and feel nostalgic blogging about it).

16 comments ↓

#1 Z.T. on 04.29.11 at 4:34 am

The calling convention for "void our_function(void)" and "void TheirClass::their_method(void)" are not the same – the member function has an implicit *this argument. Depending on architecture and compiler, overriding one with the other might not be safe (suppose caller pushes *this and expects callee to pop *this and your function doesn't).

#2 Yossi Kreinin on 04.29.11 at 4:47 am

@Z.T.: agreed; I thought about passing a TheirClass* to our_function, but then figured that I'd either want it to use the pointer (and then I'd also want to change the Python example to do so), or to explain why I passed it, and decided to leave it that way for brevity.

You're right that this could break on some platforms – although the typical caller/callee responsibilities will usually allow this sort of thing. At that level of discussion, another thing that is not guaranteed is that a class method calling convention is the same as the global function calling convention, so I'd have to mention that, too if I tried to more or less exhaustively list the ways for this to break.

#3 Mattia on 04.29.11 at 5:29 am

Just a typo: Theirs.their_method = our_function should be TheirClass.their_method = our_function

Ciao,

Mattia

#4 Yossi Kreinin on 04.29.11 at 5:43 am

Oh yeah. Fixed. That's what happens when you edit without trying in the REPL…

#5 Nemo on 04.29.11 at 7:56 am

I guess it depends on how long and pointless the discussions are going to be, versus how long and pointless the code working around the “social” problem will be.)

Actually, it mostly depends on how long the code will be around. (Ever notice how the answer is usually "longer than I ever imagined"?)

Pity the engineer(s) who had to maintain / debug / enhance that garbage after you left. A few hours of somebody new having to untangle your spaghetti, integrated over years or decades… Maybe a few conversations up front would not have been completely pointless?

This does not sound like a "manager" versus an "engineer" kind of question. It sounds more like a "good engineer" versus "bad engineer" kind of question…

#6 Yossi Kreinin on 04.29.11 at 8:23 am

Well, for starters, I haven't left yet :) And as for spaghetti – um, this isn't really an example of that, is it? And assuming that sometimes you do need to override a function in a statically linked program – there really isn't a terribly readable way to do it (in the "clean" case you'll need to read the build system files to figure it out – and that is arguably harder than reading an ugly, but documented runtime patching function). And even if it's clearly stated in some place that a function is overridden, it doesn't look like so at the point of call (a global C function call looks like it calls exactly what it says to any unsuspecting programmer knowing C). And then most people never care about the implementation of malloc() or write() – and those who do are probably capable of figuring out what I did, too, without much trouble.

Now regarding a few conversations up front – from the fact that I have one with you, despite the tone of your message, you can infer that I also tried a few conversations with him. I believe my judgment to stop at some point was correct, but perhaps you, who knows little about me and nothing about him, do know better nonetheless…

#7 Nemo on 04.29.11 at 9:01 am

I am just trying to imagine debugging code where "malloc()" does not call malloc(), "open()" does not call open(), etc. By the time I figured out what was going on and found the bug in the custom library function, I am pretty sure I would be cursing somebody's name. :-)

Although if you absolutely must do this kind of patching, I suppose it is a toss-up whether to do it statically at link time or dynamically like this. (I would still favor the static approach, since it would be fairly portable across compilers and CPU architectures… And when not, the failure would be far more likely to happen at compile time.)

But my main argument was readability, and you are right that static vs. dynamic patching is equally bad in that regard.

Incidentally, I believe your C++ approach will break on non-trivial classes (specifically, anything that uses covariant return types, like this).

My point is that this sort of thing is almost always an even worse idea than it appears. But then, I usually have an allergic reaction to "clever" code…

#8 Zeppo on 04.29.11 at 6:44 pm

Yeah, I agree with above post. Good luck getting a third party company to move to a malloc that is thread safe and a whole new basic library and requires overhauling their whole system. That just won't happen, and even if it did it would take forever to get the bugs out with that much code.

The link to Mel reminds me of when I learned to program, messing with the code to make the addresses come out right for everything was such a pain using dos debug. A sort of famous computer guy actually took pity on me and bought me an assembler when he heard of it.

#9 Yossi Kreinin on 04.30.11 at 1:04 am

@Nemo: Sure it'll break for more complicated cases, although apparently it could be extended to do the right pointer arithmetics, both for finding the vtables and for adjusting the object pointers – it just wasn't my point to illustrate the extent of complexity of the C++ calling conventions, only that in principle you could do it.

"The static approach" is more portable across CPUs, but less portable across compilers since linker scripts/flags/behavior aren't the same with all compilers. For starters, nothing in the C or C++ language definition says that you can legally do this sort of thing statically, so it's necessarily a compiler-specific hack. And the failure is not likely to manifest itself at compile time at all – very easily your function will just be ignored and the standard one taken, without any warning (for instance, if someone tweaked the build system and your overriding was incidentally disabled).

Regarding debugging – our malloc simply called the fairly standard and well-debugged dlmalloc, only using the right arena based on the calling CPU ID, and our I/O functions called our drivers – just the way it'd happen on Unix or Windows if we had to roll our own drivers, that is, you never know which code is called by those functions anyway, it's just that the mechanism for registering drivers differs across targets. So at the bottom line, the chance to have to debug problems in our custom allocator or I/O functions themselves were never very high.

I sort of figured your allergy… :) Well, I'm happy to tell that a few projects with this code bundled in them are in production, working and selling fine :)

#10 Yossi Kreinin on 04.30.11 at 1:06 am

@Zeppo: you learned to program in hexadecimal, since there wasn't a free assembler/disassembler around? Impressive.

#11 Mihnea on 05.06.11 at 10:47 pm

I almost shipped this kind of stuff in a PC program. I had to intercept user actions in Autodesk Maya (from a plug-in) and either record them to be played back later, or stream them live to other Maya instances. It was supposed to be a better alternative to instructional videos, but it never got released.

Since in Maya almost every user action results in running a script command, all I had to do was intercept those commands and record or relay them. Unfortunately the official mechanism for this doesn't catch everything because they cut some corners, but after some time spent in the fine company of ollydbg I found that it always ends up calling one of about 3 or 4 functions whenever the user does something, and I could grab the command string from their arguments. I located and patched those functions when my plug-in was loaded (which was fun, as they differed slightly between the 3 releases of Maya I had to support). I also had to run the original function after my hook was done, so I saved the instructions which were overwritten by the jmp opcode, ran them at the end of my function, then jumped back to the original function, after those instructions.

#12 Yossi Kreinin on 05.07.11 at 9:40 am

@Mihnea: tricky stuff. I also saved the original instructions – so that they could be temporarily restored before code that computed the checksum of .text ran, and in software reset cycles (awfully enough, the original system allocator was used to allocate a large share of stuff before ours took control – because there was no way for me to overwrite the functions before global constructors, etc. happened, again because that would require the linker script guy's cooperation; and before software reset, since code wasn't reloaded to the RAM, you had to restore the system allocator since ours wasn't used to running before main()). So there was something like a monkey_unpatch function. But I didn't have to use it at the time of the function call, so it was easier; I wonder if you could monkey patch a function in a case like yours, where you need to be calling the original version, in a general way (without checking the overwritten bytecode/call sequence/etc. yourself and figuring out how to do it).

#13 Mihnea on 05.08.11 at 12:01 am

There is one special case where this can be done painlessly: when the target function is exported from a shared object. You just patch the import table and you're done. Obviously, you won't catch calls from within the SO which contains the function, but you may not need to. There are a number of OpenGL and Direct3D debuggers which do this to hook API calls and track resources, state etc. When your function is not a SO export, creative programming to the rescue!

On a machine with fixed instruction size I suppose you can just end your hook function with some placeholder bytes followed by a jump after the overwritten bytes in the original function. Copy the original bytes in the placeholder and as long as your hook leaves the machine state the same way it found it, it should work.

On messier machines like the PC, where you would need to write a mini-disassembler to know how many bytes to copy, something like this may work:

- patch the function with a jmp to your hook, saving the bytes you've overwritten

- at the end of your function, put the original bytes back, save the return address somewhere, change it to a "rehook" function, jmp to the start of the original function

- when the original function finishes, it will return to your "rehook" function, which will reinsert the hooking jmp instruction and jmp to the saved return address

It should be noted that writing a bit of code which understands the instruction set just enough to figure out where the next instruction boundary is after the inserted jmp opcode is not too difficult (at least on PC). You can even steal it from something like kkrunchy, and it's probably going to take less time to implement than this rehooking business.

Or, assuming your CPU has a one-byte software breakpoint instruction and supports single-stepping:

- stick a software breakpoint at the top of the function

- do whatever you need to do inside the breakpoint handler (however horrid that sounds)

- put the original byte back, turn on the single-step flag, resume the function

- the single step handler will be called after the first instruction in the function, saving you the trouble of figuring out how many bytes it was. Reinsert the software break now to rearm the hook.

- man up and use printf debugging from now on, because this will confuse the hell out of any debugger. Also, "speedy" is not a word that can be used to describe the performance of this solution. :)

I guess this thing can be done even if you don't have one-byte breakpoint opcodes by repeatedly single stepping until you are past the overwritten bytes, but I sincerely hope no one has ever made such a CPU.

Or, even more repulsive stuff:

- if you have an MMU, turn off the execute bit for the page containing the start of the function and run the hook from inside the fault handler when the IP points to the location you care about.

- if you have hardware breakpoints, you can use those. The PC has 4 breakpoint registers, so with this you can hook at most 4 functions.

This is just a brain dump, I never tried any of it (except for import table patching for Windows DLLs).

#14 Semi Essessi on 05.15.11 at 2:15 am

Incidentally this is how break points and 'edit and continue' work in MS visual studio… Also, in the case of linked in calls you can patch the address rather than the code, either in the executable data or at runtime. Useful for debugging if you need the original function left intact. :)

#15 Anonymous on 06.03.11 at 12:25 pm

Wow, it's funny how paragraph formatting makes the Story of Mel look so much more… boring. Here's the version I think of as canonical:
http://www.catb.org/jargon/html/story-of-mel.html

#16 Anonymous on 05.13.14 at 12:52 am

About the C++ thing: If you only need to modify the behavior of existing objects that you did not create yourself (as in the python example at the start of the article), there should be a simpler way to do it:

Derive from TheirClass and overload their_func as desired. Then overwrite the vptr of obj with the one from an object of your new derived class.

This should be more portable as you don't need to mess around with protected memory and vtable indices. Of course you still have to assume the location of the vptr and worry about compiler optimizations.

Leave a Comment