Machine code monkey patching
A monkey patch is a way to extend or modify the runtime code of dynamic languages (e.g. Smalltalk, JavaScript,
Objective-C, Ruby, Perl, Python, Groovy, etc.) without altering the original source code.
β Wikipedia
For example, the Python code:
# someone else's class
class TheirClass:
def their_method(self):
print "them"
obj = TheirClass()
obj.their_method()
# our function
def our_function(self):
print "us"
# the monkey patch
TheirClass.their_method = our_function
obj.their_method()
...will print:
them
us
...showing that we have changed the behavior of TheirClass objects, including those we didn't create ourselves. Which can't
be done with more civilized techniques like inheritance.
Here's how you can monkey patch machine code, assuming the machine architecture is ARM:
typedef void (*funcptr)();
void monkey_patch(funcptr their_func, funcptr our_func) {
((int*)their_func)[0] = 0xe51ff004;
((int*)their_func)[1] = (int)our_func;
}
//monkey patching the memory allocator:
monkey_patch((funcptr)&malloc, (funcptr)&our_malloc);
monkey_patch((funcptr)&free, (funcptr)&our_free);
This overwrites the first instruction (32-bit word) of their_func with 0xe51ff004, which is the ARM machine code
corresponding to the assembly instruction LDR PC,[PC,-4] β which means, in C-like pseudocode, PC = *(PC+4), or
"jump to the program location pointed by the next word after the current program location".
(Why the byte address PC+4 is spelled in assembly as PC-4? I recall that it's because an ARM instruction at address X
actually gets the value X+8 when referencing PC. Presumably because it is β or at some point was β the most convenient semantics
for pipelined hardware to implement:
- when the instruction at address X executes,
- the instruction at address X+4 is decoded, and
- the instruction at address X+8 is fetched
- so the physical PC register could very well keep the value X+8.)
So the first word of their_func is overwritten with, "jump to where the second word points". The second word is then
overwritten with our_func, and we're all set.
Purpose
I actually did this in production code, on a bare metal target (no OS β just a boot loader that runs a massive single
binary). I monkey patched the memory allocator β malloc, free, calloc, realloc β and the Unix-like I/O functions underlying that
particular compiler's <stdio.h> and <iostream> implementation β read, write, open, close, creat. The memory
allocator had to be changed to work on the target dual-core chip. The I/O functions had to be changed to use our drivers, so
that we could write stuff to the Flash or USB using FILE* or ofstream.
A more civilized approach, if you want to override functions in a dynamic library, is passing another library at run time
with LD_PRELOAD or equivalent. And if the code is linked statically as it was in my case, you can override the functions at link
time. The trouble is that the linker could refuse to cooperate.
(And in my case, we shipped libraries, the customer linked the program, and the guy who talked to the customer
refused to cooperate β that is, to help them override functions at link time. He was an old-school embedded developer, the kind
that don't need no stinking malloc and printf. The project had a million lines of code very much in need of malloc and printf.
He said, clean it up. Don't call malloc on the second CPU. So I went away and monkey patched malloc anyway.
In such a case, the civilized approach is to keep trying to talk the guy into it, and then have him persuade the (even more
hardcore) embedded devs at the customer's side. What I did was what managers call "an attempt at a technical solution when a
social solution is needed". Or as programmers call it, "avoiding a couple of months of pointless discussions". Being neither a
full-time programmer nor a full-time manager, I don't have a clear opinion which viewpoint is right. I guess it depends on how
long and pointless the discussions are going to be, versus how long and pointless the code working around the "social" problem
will be.)
In theory, machine code monkey patching could be used in a bunch of "legitimate" cases, such as logging or debugging. In
practice, this ugly thing is probably only justified in inherently ugly situations β as is kinda true of monkey patching in
general.
Pitfalls
My example implementation for the ARM assumes that a function has at least 2 instructions. An empty ARM assembly function can
have just one (jump to link register). In that case, the first instruction of the next function will be overwritten. A more
sophisticated version of monkey_patch() could stash the target address someplace else, and use a LDR
PC,[PC,clever_offset] command instead of a constant LDR PC,[PC,-4] command.
Overwriting machine code instructions breaks code that reads (as opposed to "executes") those instructions, counting
on the original bit patterns to be stored there. This isn't very likely to be a problem with actual ARM code, unless it was
written by Mel.
On any machine with separate and unsynchronized instruction and data caches, overwriting instructions will modify the
contents of the data cache but not the instruction cache. If the instructions happen to be loaded to the instruction cache at
the time of overwriting, subsequent calls to the monkey-patched function might call the original function, until the instruction
cache line keeping the original code happens to be evicted (which isn't guaranteed to ever happen).
If your luck is particularly bad and the two overwritten instructions map to two adjacent cache lines, only one of which is
loaded to the instruction cache at the time of overwriting, a call to the monkey-patched function might crash (since it'll see
one original instruction word and one new one). In any case, on machines where caches won't sync automatically, one should sync
them explicitly to implement self-modifying code correctly (I'll spare you my ARM9 code doing this).
If your OS places instructions in read-only memory pages, overwriting it will not work unless you convince the OS to grant
you permissions to do so.
C++
C++ virtual functions can be monkey patched more similarly to the typical dynamic language way. Instead of modifying
instructions, we can overwrite the virtual function table.
Advantages:
- more portable across machine architectures β the vtable layout doesn't depend on the machine
- no cache syncing problems
- no encoding-related corner cases like very short functions or instructions used as data
Disadvantages:
- less portable across compilers β ARM bytecode is the same with all compilers, vtable layout is not
- fewer calls could be redirected β some compilers avoid the vtable indirection when they know the object's type at compile
time (of course inlined calls won't be redirected with either technique)
- only virtual functions can be redirected β typically a minority of C++ class member functions
The need to fiddle with OS memory protection is likely to remain since vtables are treated as constant data and as such are
typically placed in write-protected sections.
Example C++ code (g++/Linux, tested with g++ 4.2.4 on Ubuntu 8.04):
#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>
template<class T, class F>
void monkey_patch(int their_ind, F our_func) {
T obj; //can we get the vptr without making an object?
int* vptr = *(int**)&obj;
//align to page size:
void* page = (void*)(int(vptr) & ~(getpagesize()-1));
//make the page with the vtable writable
if(mprotect(page, getpagesize(), PROT_WRITE|PROT_READ|PROT_EXEC))
perror("mprotect");
vptr[their_ind] = (int)our_func;
}
class TheirClass {
public:
virtual void some_method() {}
virtual void their_method() { printf("them\n"); }
};
void our_function() { printf("us\n"); }
int main() {
TheirClass* obj = new TheirClass;
//gcc ignores the vtable with a stack-allocated object
obj->their_method(); //prints "them"
monkey_patch<TheirClass, void(*)()>(1, our_function);
//some_method is at index 0, their_method is at 1
//we could instead try to non-portably get the index
//out of &TheirClass::their_method
obj->their_method(); //prints "us"
}
Conclusion
Let's drink to never having to do any of this (despite the fact that yes, some of us do enjoy it in a perverted way and feel
nostalgic blogging about it).
The calling convention for "void our_function(void)" and "void
TheirClass::their_method(void)" are not the same β the member function
has an implicit *this argument. Depending on architecture and compiler,
overriding one with the other might not be safe (suppose caller pushes
*this and expects callee to pop *this and your function doesn't).
@Z.T.: agreed; I thought about passing a TheirClass* to our_function,
but then figured that I'd either want it to use the pointer (and then
I'd also want to change the Python example to do so), or to explain why
I passed it, and decided to leave it that way for brevity.
You're right that this could break on some platforms β although the
typical caller/callee responsibilities will usually allow this sort of
thing. At that level of discussion, another thing that is not guaranteed
is that a class method calling convention is the same as the global
function calling convention, so I'd have to mention that, too if I tried
to more or less exhaustively list the ways for this to break.
Just a typo: Theirs.their_method = our_function should be
TheirClass.their_method = our_function
Ciao,
Mattia
Oh yeah. Fixed. That's what happens when you edit without trying in
the REPL...
I guess it depends on how long and pointless the discussions are
going to be, versus how long and pointless the code working around the
βsocialβ problem will be.)
Actually, it mostly depends on how long the code will be around.
(Ever notice how the answer is usually "longer than I ever
imagined"?)
Pity the engineer(s) who had to maintain / debug / enhance that
garbage after you left. A few hours of somebody new having to untangle
your spaghetti, integrated over years or decades... Maybe a few
conversations up front would not have been completely pointless?
This does not sound like a "manager" versus an "engineer" kind of
question. It sounds more like a "good engineer" versus "bad engineer"
kind of question...
Well, for starters, I haven't left yet :) And as for spaghetti β um,
this isn't really an example of that, is it? And assuming that
sometimes you do need to override a function in a statically linked
program β there really isn't a terribly readable way to do it (in the
"clean" case you'll need to read the build system files to figure it out
β and that is arguably harder than reading an ugly, but documented
runtime patching function). And even if it's clearly stated in some
place that a function is overridden, it doesn't look like so at the
point of call (a global C function call looks like it calls exactly what
it says to any unsuspecting programmer knowing C). And then most people
never care about the implementation of malloc() or write() β and those
who do are probably capable of figuring out what I did, too, without
much trouble.
Now regarding a few conversations up front β from the fact that I
have one with you, despite the tone of your message, you can infer that
I also tried a few conversations with him. I believe my judgment to stop
at some point was correct, but perhaps you, who knows little about me
and nothing about him, do know better nonetheless...
I am just trying to imagine debugging code where "malloc()" does not
call malloc(), "open()" does not call open(), etc. By the time I figured
out what was going on and found the bug in the custom library function,
I am pretty sure I would be cursing somebody's name. :-)
Although if you absolutely must do this kind of patching, I suppose
it is a toss-up whether to do it statically at link time or dynamically
like this. (I would still favor the static approach, since it would be
fairly portable across compilers and CPU architectures... And when not,
the failure would be far more likely to happen at compile time.)
But my main argument was readability, and you are right that static
vs. dynamic patching is equally bad in that regard.
Incidentally, I believe your C++ approach will break on non-trivial
classes (specifically, anything that uses covariant return types, like
this).
My point is that this sort of thing is almost always an even worse
idea than it appears. But then, I usually have an allergic reaction to
"clever" code...
Yeah, I agree with above post. Good luck getting a third party
company to move to a malloc that is thread safe and a whole new basic
library and requires overhauling their whole system. That just won't
happen, and even if it did it would take forever to get the bugs out
with that much code.
The link to Mel reminds me of when I learned to program, messing with
the code to make the addresses come out right for everything was such a
pain using dos debug. A sort of famous computer guy actually took pity
on me and bought me an assembler when he heard of it.
@Nemo: Sure it'll break for more complicated cases, although
apparently it could be extended to do the right pointer arithmetics,
both for finding the vtables and for adjusting the object pointers β it
just wasn't my point to illustrate the extent of complexity of the C++
calling conventions, only that in principle you could do it.
"The static approach" is more portable across CPUs, but less portable
across compilers since linker scripts/flags/behavior aren't the same
with all compilers. For starters, nothing in the C or C++ language
definition says that you can legally do this sort of thing statically,
so it's necessarily a compiler-specific hack. And the failure is not
likely to manifest itself at compile time at all β very easily your
function will just be ignored and the standard one taken, without any
warning (for instance, if someone tweaked the build system and your
overriding was incidentally disabled).
Regarding debugging β our malloc simply called the fairly standard
and well-debugged dlmalloc, only using the right arena based on the
calling CPU ID, and our I/O functions called our drivers β just the way
it'd happen on Unix or Windows if we had to roll our own drivers, that
is, you never know which code is called by those functions anyway, it's
just that the mechanism for registering drivers differs across targets.
So at the bottom line, the chance to have to debug problems in our
custom allocator or I/O functions themselves were never very high.
I sort of figured your allergy... :) Well, I'm happy to tell that a
few projects with this code bundled in them are in production, working
and selling fine :)
@Zeppo: you learned to program in hexadecimal, since there wasn't a
free assembler/disassembler around? Impressive.
I almost shipped this kind of stuff in a PC program. I had to
intercept user actions in Autodesk Maya (from a plug-in) and either
record them to be played back later, or stream them live to other Maya
instances. It was supposed to be a better alternative to instructional
videos, but it never got released.
Since in Maya almost every user action results in running a script
command, all I had to do was intercept those commands and record or
relay them. Unfortunately the official mechanism for this doesn't catch
everything because they cut some corners, but after some time spent in
the fine company of ollydbg I found that it always ends up calling one
of about 3 or 4 functions whenever the user does something, and I could
grab the command string from their arguments. I located and patched
those functions when my plug-in was loaded (which was fun, as they
differed slightly between the 3 releases of Maya I had to support). I
also had to run the original function after my hook was done, so I saved
the instructions which were overwritten by the jmp opcode, ran them at
the end of my function, then jumped back to the original function, after
those instructions.
@Mihnea: tricky stuff. I also saved the original instructions β so
that they could be temporarily restored before code that computed the
checksum of .text ran, and in software reset cycles (awfully enough, the
original system allocator was used to allocate a large share of stuff
before ours took control β because there was no way for me to overwrite
the functions before global constructors, etc. happened, again because
that would require the linker script guy's cooperation; and before
software reset, since code wasn't reloaded to the RAM, you had to
restore the system allocator since ours wasn't used to running before
main()). So there was something like a monkey_unpatch function. But I
didn't have to use it at the time of the function call, so it was
easier; I wonder if you could monkey patch a function in a case like
yours, where you need to be calling the original version, in a general
way (without checking the overwritten bytecode/call sequence/etc.
yourself and figuring out how to do it).
There is one special case where this can be done painlessly: when the
target function is exported from a shared object. You just patch the
import table and you're done. Obviously, you won't catch calls from
within the SO which contains the function, but you may not need to.
There are a number of OpenGL and Direct3D debuggers which do this to
hook API calls and track resources, state etc. When your function is not
a SO export, creative programming to the rescue!
On a machine with fixed instruction size I suppose you can just end
your hook function with some placeholder bytes followed by a jump after
the overwritten bytes in the original function. Copy the original bytes
in the placeholder and as long as your hook leaves the machine state the
same way it found it, it should work.
On messier machines like the PC, where you would need to write a
mini-disassembler to know how many bytes to copy, something like this
may work:
- patch the function with a jmp to your hook, saving the bytes you've
overwritten
- at the end of your function, put the original bytes back, save the
return address somewhere, change it to a "rehook" function, jmp to the
start of the original function
- when the original function finishes, it will return to your
"rehook" function, which will reinsert the hooking jmp instruction and
jmp to the saved return address
It should be noted that writing a bit of code which understands the
instruction set just enough to figure out where the next instruction
boundary is after the inserted jmp opcode is not too difficult (at least
on PC). You can even steal it from something like kkrunchy, and it's
probably going to take less time to implement than this rehooking
business.
Or, assuming your CPU has a one-byte software breakpoint instruction
and supports single-stepping:
- stick a software breakpoint at the top of the function
- do whatever you need to do inside the breakpoint handler (however
horrid that sounds)
- put the original byte back, turn on the single-step flag, resume
the function
- the single step handler will be called after the first instruction
in the function, saving you the trouble of figuring out how many bytes
it was. Reinsert the software break now to rearm the hook.
- man up and use printf debugging from now on, because this will
confuse the hell out of any debugger. Also, "speedy" is not a word that
can be used to describe the performance of this solution. :)
I guess this thing can be done even if you don't have one-byte
breakpoint opcodes by repeatedly single stepping until you are past the
overwritten bytes, but I sincerely hope no one has ever made such a
CPU.
Or, even more repulsive stuff:
- if you have an MMU, turn off the execute bit for the page
containing the start of the function and run the hook from inside the
fault handler when the IP points to the location you care about.
- if you have hardware breakpoints, you can use those. The PC has 4
breakpoint registers, so with this you can hook at most 4 functions.
This is just a brain dump, I never tried any of it (except for import
table patching for Windows DLLs).
Incidentally this is how break points and 'edit and continue' work in
MS visual studio... Also, in the case of linked in calls you can patch
the address rather than the code, either in the executable data or at
runtime. Useful for debugging if you need the original function left
intact. :)
Wow, it's funny how paragraph formatting makes the Story of Mel look
so much more... boring. Here's the version I think of as
canonical:
http://www.catb.org/jargon/html/story-of-mel.html
About the C++ thing: If you only need to modify the behavior of
existing objects that you did not create yourself (as in the python
example at the start of the article), there should be a simpler way to
do it:
Derive from TheirClass and overload their_func as desired. Then
overwrite the vptr of obj with the one from an object of your new
derived class.
This should be more portable as you don't need to mess around with
protected memory and vtable indices. Of course you still have to assume
the location of the vptr and worry about compiler optimizations.
Post a comment