refix: fast, debuggable, reproducible builds

March 19th, 2024

There's a simple way to make your builds all of the following:

Reproducible/deterministic - same binaries always built from the same source, so you can cache build outputs across users
Debuggable - gdb, sanitizers, Valgrind, KCachegrind, etc. find your source code effortlessly
Fast - the build time overhead is negligible, even compared to a blazing fast linker like mold

What makes it really fast is a small Rust program called refix that post-processes your build outputs (if you don't want to compile from source, here's a static Linux binary.) Both the program and this document are written for the context of C/C++ source code compiled to native binaries. But this can work with other languages and binary formats, too, and it should be easy to support them in refix. (In fact, it mostly supports them already... you'll see.)

This "one weird trick" isn't already popular, not because the solution is hard, nor because the problem isn't painful. Rather, it's not already popular because people widely consider it impossible for builds to be both debuggable and reproducible, and standardize on workarounds instead. Since "established practices" are sticky, and especially so in the darker corners like build systems¹, we'll need to discuss not only how to solve the problem, but also why solve it at all.

The curious case of the disappearing source files

Why are people so willing to give up their birthright - the effortless access to the source code of a debugged program? I mean, build a "Hello, world" cmake project, and everything just works: gdb finds your source code, assert prints a path you can just open in an editor, etc. "Source path" isn't even a thing.

Later on, the system grows, and the build slows down. So someone implements build artifact caching, in one of several ways:

A general-purpose distributed build cache, like Bazel's
Something for caching specific kinds of artifacts, like ccache
An entirely home-grown system - like running the build of user X in a build directory left previously by user Y at the build server's local disk (and hoping that their source code is similar enough, so most object files needn't be rebuilt²)

In any case, now that you need caching, you also need reproducible builds. Otherwise, you'd cache object files built by different users, and you'd get different file paths and other stuff depending on which user built each object file. And we can all agree that build caches are important, and pretty much force you to put relative paths into debug information and the value of __FILE__ (and some meaningless garbage into __TIME__, etc.)

But we can also agree that the final binaries which users actually run should have full source paths, right? I mean, I know there are workarounds for finding the source files. We'll talk about them later; I'd say they don't really work. Of course, the workarounds would be tolerable if they were inevitable. But they aren't.

Why not fix the binary coming out of the build cache, so it points to the absolute path of the source files? (The build system made an effort to detach the binary from the full source path, so that it can be cached. But now that the binary has left the cache, we should "refix" it back to the source path of the version where it belongs.)

We'll look at 3 ways of refixing the binary to the source path - a thesis, an anti-thesis and a synthesis, as it were.

Thesis: `debugedit` - civilized, standard and format-aware

A standard tool for this is debugedit. The man page example does exactly the "refixing" we're looking for:

debugedit -b `pwd` -d /usr/lib/debug files...
    Rewrites path compliled into binary
    from current directory to /usr/lib/debug.

Some Linux distributions use debugedit for building source files in some arbitrary location, and then make the debug info point to wherever source files are installed when someone downloads them to debug the program.

If debugedit works for you, problem solved. It works perfectly when it does. However, when I tried it on a 3GB shared object compiled from a C++ code base³, it ran for 30 seconds, and then crashed. If you, too find debugedit either slow or buggy for your needs, read on.

Anti-thesis: `sed` - nasty, brutish, and short

Why is debugedit's job hard (slow and bug-prone)? Mainly because it needs to grow or shrink the space reserved for each replaced string. When you do such things, you need to move a lot of data (slow), and adjust who-knows-which offset fields in the file (bug-prone.)

But what if the strings had the same length? Then we don't need to move or adjust anything, and we could, erm, we could replace them with sed.

Here, then, is our nasty, brutish, and short recipe:

Run gcc with these flags:

-fdebug-prefix-map==MAGIC # for DWARF
-ffile-prefix-map==MAGIC  # for __FILE__

Make MAGIC long enough for any source path prefix you're willing to support.
Why the == in the flag? This invocation assumes that file paths are relative, so it remaps the empty string to MAGIC, meaning, dir/file.c becomes MAGICdir/file.c. You can also pass =/prefix/to/remap=MAGIC, if your build system uses absolute paths.
Use sed to replace MAGIC with your actual source path in the binary outputted by the build system.
If the source path is shorter than the length of MAGIC, pad it with forward slashes: /////home/user/src/. If the source path is too long, the post-link step should truncate it, warn, and eventually be changed to outright fail. You don't really need to support giant paths.

Our post-link step thus becomes:

sed -i 's/MAGIC/\/\/\/...\/user\/src\//g' binary

The downside, on top of the source path length limit, is a trace of the brutishness making it into the output file. Namely, you're going to see these extra forward slashes in some situations. We can't pad a prefix with an invisible character... luckily, we can pad it with a character not changing the meaning of the path.

On the upside, compared to debugedit, the method using sed is:

More widely applicable - it, erm, "supports" all executable and debug information formats, as well as archives and object files.
More robust - not affected by input format complexity
Faster - 10 seconds to process the 3GB binary (about the time it takes mold to link that binary... yes, it's that good!)

Is this fast enough? Depends on your binary sizes. If yours are big and you don't want to effectively double the link time, our next and last method is for you.

Synthesis: `refix` - nasty, brutish, and somewhat format-aware

Can we go faster than sed? We have two reasons to hope so:

sed is unlikely to be optimized specifically for replacing strings of equal size; it's not that common a use case.
We don't actually need to go through the entire file. File paths only appear in some of the sections - .rodata where strings are kept, and debug info sections. If we know enough about the file format to find the sections (which takes very little knowledge), we can avoid touching most of the bytes in the file.

But wait, isn't the giant binary built from C++ mostly giant because of the debug info? Yes, but it turns out that most of the debug info sections don't contain file paths; only .debug_line and .debug_str do and these are only about 10% of our giant file.

So the refix program works as follows:

It mmaps the file, since it knows it never needs to move the data and can just overwrite the strings in place.
For ELF files, it finds .rodata, .debug_line and .debug_str, and searches & replaces only within these. This handles executables, shared libraries (*.so) and object files (*.o).
For ar archives, it finds the ELFs within the archive, then the sections it cares about within each ELF, and searches & replaces within these. This handles lib*.a.
For files which are neither ELFs nor archives of ELFs, refix just replaces everywhere as sed would, but still faster because it's optimized for the same-sized source & destination strings case.

Thus, refix is:

Very fast - 50 ms on the 3GB binary, and 250 ms on the same binary in "sed mode" (meaning, if we remove the ELF magic number, so refix is forced to replace everywhere and not just in the relevant sections.)
Widely applicable - works on any file format where the file prefix isn't compressed and is otherwise "laid bare"
Robust - while it knows a bit about the binary file format, it's very, very little (enough to find the sections it's interested in); it's hundreds of lines of code vs debugedit's thousands. And you can always make it run even less code by falling back to "sed mode."

...with the sole downside being that, same as with sed, you might occasionally see the leading slashes in pathnames.

That's it, right? We're done? Well, maybe, but it's not always how it goes. People have questions. So here we go.

Q & A

Why do this? We already have a system for finding the source code.

First of all, it is worth saying that you shouldn't have any "system" for finding source code, because the tired, stressed developer who was sent a core dump to urgently look at is entitled to having at least this part work entirely effortlessly⁴.

But also, whatever system you do have ought to have issues:

If you do not modify the cacheable, reproducible binaries coming out of the build system, then by definition your way to find source code must rely on something inherent to a given source version, independent of who built it and where. Since you're not going to embed the entire source code into the executable, you must rely on some sort of version information. What if the program had uncommitted changes, which happens in debugging scenarios (someone built a version to test and someone else sent a core dump from this version?)
"Well you're not supposed to get core dumps from versions with uncommitted changes, unless it's your local version that you haven't given to anyone but are testing locally, so you know which version it is. You should only release versions externally thru CI" - so giving anything to anyone to test is now considered "releasing externally" and must necessarily go thru CI, and having trouble finding the source code is now a punishment for straying from proper procedure? How did this discussion, which started at how build caches speed up the build, deteriorate to the point where we're telling developers to change how they work, in ways which will slow them down?
But OK, let's say I didn't "release" anything - instead I have 5 local versions I'm working with and they go thru test flows and dump core - I'm now supposed to guess which core comes from which version, or develop my own "system" to know? (Some people actually assume this won't happen because you can't run tests outside CI anyway, so you will submit a merge request in order to run them. And they assume that because they use some testing infra intertwined with CI infra and most of their tests technically can't run outside CI. And perhaps they don't even have machines to run on that aren't managed by Jenkins or some such to begin with. But that is a horror story for another time. Here I'll just assume that we agree that it's good to be able to test changes locally and debug easily.)
In the cases where the version info actually enables you to find the right code, the process can be made more tolerable by developing a gdb Python extension that automatically tells gdb where the source code is based on the embedded version info. Do you have this extension and a team maintaining it together with the build system?
Do you also have this automated for all the other tools (sanitizers, Valgrind, KCachegrind, VTune, whatever)? Do they all even have a way to tell them where to look for source code? Is there a team handling this for all users, for every new tool used by developers?

I realize that these pain points aren't equally relevant to all organizations, and the extent of their relevance depends a lot on the proverbial software lifecycle. (They also aren't equally relevant to everyone in a given organization. I claim that the people suffering the most from this are the people doing the most debugging, and they are quite often very far removed from any team that could ameliorate their suffering by improving "the system for finding source code" - so they're bound to suffer for a long time.)

My main point, however, is that you needn't have any of these pain points at all. There's no tradeoff or price to pay: your build is still reproducible and fast. Just make it debuggable with this one weird trick!

(Wow, I've been quite composed and civil here. I'm very proud of myself. Not that it's easy. I have strong feelings about this stuff, folks!)

What about non-reproducible info other than source path (time, build host, etc)?

I'm glad you asked! You can put all the stuff changing with every build into a separate section, reserved at build time and filled after link time. You make the section with:

char ver[SIZE] __attribute__((section(".ver"))) = {1};

This reserves SIZE bytes in a section called .ver. It's non-const deliberately, since if it's const, the OS will exclude it from core dumps (why save data to disk when it's guaranteed to be exactly the same as the contents of the section in the binary?) But you might actually very much want to look at the content of this section in a core dump, perhaps before looking at anything else. For instance, the content of this section can help you find the path of the executable that dumped this core!⁵

(How do you find the section in the core dump without having an executable which the debugger could use to tell you the address of ver? Like so: strings core | grep MagicOnlyFoundInVer. Nasty, brutish, and short. The point is, having the executable path in the core dump is an additional and often major improvement on top of having full source paths in the executable... because you need to find the executable before you can find the source!)

Additionally, our ver variable is deliberately initialized with one 1 followed by zeros, since if it's all zeros, then .ver will be a "bss" section, the kind zeroed by the loader and without space reserved for it in the binary. So you'd have nowhere to write your actual, "non-reproducible" version info at a post-link step.

After the linker is done, you can use objcopy to replace the content of .ver. But if you're using refix, which already mmaps the file, you can pass it more arguments to replace ELF sections:

refix exe old-prefix new-prefix --section .ver file

refix will put the content of file into .ver, or fail if the file doesn't have the right length. (We don't move stuff around in the ELF, only replace.)

What about compressed debug sections?

What about them? I'm not sure why people use them, to be honest. I mean, who has so many executable files which they don't want to compress as a whole (because they need to run them often, I presume), but they do want to compress the debug sections to save space? Like, in what scenario this is your way to save enough space to even worry about it?

But, they could be supported rather nicely, I think, if you really care. You wouldn't be able to just blithely mmap a file and replace inside it without updating any offset field in the file, but I think you could come close, or rather stay very far away from doing seriously heavy lifting making this slow and bug-prone. Let's chat if you're interested in this.

(I think maybe one problem is that some build caches have a file size limit? Like, something Bazel-related tops out at 2GB since it's the maximal value of the Java int type?.. Let's talk about something else, this is making me very sad.)

It's 250 ms on generic data. And you still did the ELF/ar thing to get to 50 ms. Are you insane?

Well, it's 250 ms on a fast machine with a fast SSD. Some people have files on NAS, which can slow down the file access a lot. In such cases, accessing 10x less of the mmaped data will mitigate most of the NAS slowdown. You don't really want to produce linker output on NAS, but it can be very hard to make the build system stop doing that, and I want people stuck in this situation to at least have debuggable binaries without waiting even more for the build. So refix is optimized for a slow filesystem.

But also, if it's not too much work, I like things to be fast. Insane or not, the people who make fast things are usually the people who like fast things, by themselves and not due to some compelling reason, and I'm not sure I'm ashamed of maybe going overboard a bit; better safe than sorry. Like, I don't parse most of the ELF file, which means I don't use the Elf::parse method from the goblin library, but instead I wrote a 30 line function to parse just what I need.

This saves 300-350 ms, which, is it a lot? - maybe not. Will it become much more than that on a slower file system? I don't know, it takes less time to optimize the problem away than answer this question. Did I think of slow file systems when doing it? - not as much as I was just annoyed that my original C++ program, which the Rust program is a "clean room" open source implementation of, takes 150 ms and the Rust one takes about 400 ms. Am I happy now that I got it down to 50 ms? Indeed!

(Why is Rust faster? Not sure; I think, firstly, GNU memmem is slower than memchr::memmem::Finder, and secondly, I didn't use TBB in C++ but did use Rayon in Rust, because the speedup is marginal - you bottleneck on I/O - and I don't want to complicate the build for small gains, but in Rust it's not complicated - just cargo add rayon.)

It often takes less time to just do the efficient thing than it takes to argue about the amount it would save relatively to the inefficient thing. (But it's still more time than just going ahead and doing the inefficient thing without arguing. But even that is not always the case. But most people who make fast things will usually just go for the efficient thing when they see it regardless if it's the case, I think. IME the people who always argue about whether optimizations are worth it make big and slow things in the end.)

I'm as crazy as you, and I want this speedup for non-ELF executable formats.

Let's chat. The goblin library probably supports your format - shouldn't take more than 100-150 LOC to handle this in refix.

Which binaries should I run this stuff on?

Anything delivered "outside the build system" for the use of people (who run programs / load shared libraries) or other build systems (which link code against static libraries / object files.) And nothing "inside the build system", or it will ruin caching.

I hope for your sake that you have a monolithic build where you build everything from source. But I wouldn't count on it; quite often team A builds libraries for team B, which gets them from Artifactory or something wonderful like that. In that case, you might start out with a bug where some libraries are shipped with the MAGIC as their source prefix instead of the real thing. This is easy to fix though, and someone might even remind you with "what's this weird MAGIC stuff?"

(Somehow nobody used to ask "what's /local/clone-87fg12eb/src", when that was the prefix instead of MAGIC. Note that even if you have this bug and keep MAGIC in some library files, nobody is worse off than previously when it was /local/clone-87fg12eb/src. And once you fix it, they'll be better off.)

CI removes the source after building it. What should the destination source prefix be?..

And here I was, thinking that it's the build cache not liking absolute paths that was the problem... It turns out that we have a bigger problems: the source is just nowhere to be found! /local/clone-87fg12eb/src is gone forever!

But actually, it makes sense for CI to build on the local disk in a temporary directory. In parallel with building, CI can export the code to a globally accessible NAS directory. And at the end of the build, CI can refix the binaries to that NAS directory. It's not good to build from NAS (or to NAS) - it's not only slow, but fails in the worst ways under load - which is why a temporary local directory makes sense. But NAS is a great place for debugging tools to get source from - widely accessible with no effort for the user.

Many organizations decide against NAS source exports, because it would be too easy for developers. Instead you're supposed to download the source via HTTP, which is much more scalable than NAS, thus solving an important problem you don't have; plus, you can make yourself some coffee while the entire source code (of which you'll only need the handful of files you'll actually open in the debugger) is downloaded and decompressed.

In that case, your destination source prefix should be wherever the user downloads the files to. Decide on any local path independent of the user name, and with version information encoded in it, so multiple versions can be downloaded concurrently. Have a nice cup of coffee!

What should the root path length limit be?

100 bytes.

Our CI produces output in `/filer/kubernetes/docker/gitlab/jenkins/pre-commit/department/team/developer/branch-name/test-suite-name/repo/`, which is 110 bytes.

Great! Now you have a reason to ask them to shorten it. I'm sure they'll get to it in a quarter or two, if you keep reminding.

Our CEO's preschooler works as a developer, insists on a 200 byte prefix, and won't tolerate the build failing.

Then truncate the path without failing the build. He won't find the source code easily, but he can't find it already today. If there's one thing fixing the problem won't do, it's making anyone worse off. It can't make you worse off, since the current situation leaves it nowhere worse to take you. It could only possibly take you from never being able to easily find the source to sometimes, if not always, being able to find it.

Conclusion

Use refix, sed or debugedit to make your fast, reproducible builds also effortlessly debuggable, so that it's trivial to find the source given an executable - and the executable given a core dump.

And please don't tell me it's OK for developers to roam the Earth looking for source code instead. It hurts my feelings!

Thanks to Dan Luu for reviewing a draft of this post.

I don't mean "dark corners" in a bad way. I managed a build system team for a few years and am otherwise interested in build systems, as evidenced by my writing this whole thing up. By "dark corners" I simply mean places invisible to most of the organization unless something bad happens, so the risk of breaking things is larger than the reward for improving them. It's quite understandable for such circumstances to beget a somewhat conservative approach.↩︎
I've known more than one infrastructure team that did this "cross-user build directory reuse" without ever hearing about each other. This method, while quite terrible in terms of optimization potential left on the table, owes its depressing popularity to its high resilience to the terribleness of everything else (eg it doesn't mind poor network bandwidth or even network instability, or the build flow not knowing where processes get their inputs and put their outputs; thus you can use this approach with an almost arbitrarily bad build system and IT infrastructure.)↩︎
Yes, a 3GB shared object compiled from a C++ code base. Firstly, shame on you, it's not nice to laugh at people with problems. Secondly, no, it's not stupid to have large binaries. It's much more stupid to split everything into many tiny shared objects, actually. It was always theoretically stupid, but now mold made it actually stupid, because linkage used to be theoretically very fast, and now it's actually very fast. And nothing good happens from splitting things to umpteen tiny .so's... but that's a topic for another time.↩︎
I've been told, in all seriousness and by an extremely capable programmer involved in a build system at the time, that "debugging should NOT be made easy; we should incentivize more design-time effort to cut on the debugging effort." To this I replied that Dijkstra would have been very proud of him, same as he was very angry with Warren Teitelman, whom he confronted for the crime of presenting a debugger with "how many bugs are we going to tolerate," getting the immortal reply "7." And I said that we should make debugging easy for those first and only 7 bugs we're going to tolerate.↩︎
But what if this info gets overwritten, seeing how it's not const?.. if you're really worried about this section getting overwritten, of all things, you can align its base address and size to the OS page size, and mprotect it at init time. This is exactly what const would achieve, but without excluding the section from core dumps.↩︎