Yossi Kreinin

Looking for senior IT/DevOps people

Yossi Kreinin — Thu, 30 Jun 2016 00:00:00 +0000

I wouldn't spam you with these job offers if didn't work :-) So, we're looking for senior IT people to work at our Jerusalem offices – managers and hands-on people alike. We have rapid growth, "Big Data" (it definitely is crash Excel - in fact, at one point it was close to physically crashing through the floor due to the storage servers' weight, but luckily that's been handled), "HPC" (biggish server farms, distributed build & tests, etc.), and many other buzzwords [1]. I don't know where IT ends and DevOps starts but I guess a good candidate could have either in their CV, so there.

If you have qualified friends looking for a challenging, well-paying job at a fun place, send their CVs, the sooner the better – we're in a hurry (rapid growth!), so early birds are more likely to get the can of worms. As always, "challenging" is a downside as much as an upside (a place where IT means Exchange, SAP and little else might pay very well for a more predictable and less demanding job.)

We value experience in building and maintaining non-trivial systems, and technical reasoning (X happens because of Y, Z is most efficient if you use it to do W, etc.) We also value experience in higher-level areas such as management and purchasing, and business reasoning (don't hook X and Y together since their vendors compete and will sabotage the project, Z beats W in terms of total cost of ownership, etc.) We do kinda lean towards thinking of technical aptitude as a cornerstone on top of which solid higher-level expertise is built. (We've seen managers snowed by vendors, reports, etc., which is a perennial problem in tech at large and isn't restricted to IT.)

If you'd like to hear more details, please email Yossi.Kreinin@gmail.com

[1] what we don't have is a heavy-duty web site/application, which might make the position less relevant for some.

The habitat of hardware bugs

Yossi Kreinin — Wed, 13 Jul 2016 00:00:00 +0000

The Moscow apartment which little me called home was also home to many other creatures, from smallish cockroaches to biggish rats. But of course we rarely met them face to face. Evolution has weeded out those animals imprudent enough to crash your dinner. However, when we moved a cupboard one time, we had the pleasure to meet a few hundreds of fabulously evolved cockroaches.

In this sense, logical bugs aren't different from actual insects. You won't find bugs under the spotlight, because they get fixed under the spotlight, crushed like a cockroach on the dinner table. But in darker nooks and crannies, bugs thrive and multiply.

When hardware malfunctions in a single, specific way, software running on it usually fails in several different, seemingly random ways, so it sucks to debug it. Homing in on the cause is easier if you can guess which parts of the system are more likely to be buggy.

When hardware fails, nobody wants a programmer treating it as a lawyer or a mathematician (the hardware broke the contract! only working hardware lets us reason about software!) Instead, the key to success is approaching it as a pragmatic entomologist knowing where bugs live.

Note that I'm mostly talking about design bugs, not random manufacturing defects. Manufacturing defects can occur absolutely anywhere. If you're in an industry where you can't toss a faulty unit into the garbage can, but instead must find the specific manufacturing defect in every reported bad unit, I probably can't tell you anything new, but I can offer you my deepest sympathy.

CPUs

CPUs are the perfect illustration of the "spotlight vs nooks and crannies" principle. In CPUs, the spotlight, where it's hard to find bugs, is functionality accessible to userspace programs - data processing, memory access and control flow instructions.

Bugs are more likely in those parts of the CPUs only accessible to operating systems and drivers - and used more by OS kernels than drivers. Stuff like memory protection, interrupt handling, and other privileged instructions. You can sell a buggy CPU if it doesn't break too many commercially significant, hard to patch programs - and there aren't many important OS kernels, therefore a lot of scenarios are never triggered by them.

A new OS kernel might bump into the bug, of course, but at that point, it's the programmer's problem. A friend who wrote a small real-time operating system had to familiarize himself with several errata items, and was the first to report some of these items.

It should be noted that an x86 CPU should be way less buggy in the privileged areas than the average embedded CPU. That's because it's more compatible in the privileged areas than almost any other CPU. AFAIK, today's x86 CPUs will still run unmodified OS binaries from the 80s and 90s.

Other CPUs are not like that. I recall that ARM has 2 instructions, MCR and MRC (Move Register to/from Co-processor), and the meaning of those instructions depends on their several constant arguments. It could flush the cache or program the memory protection unit or do other things - a bit like a hypothetical CALC instruction where CALC 0 does addition, CALC 1 subtracts, CALC 2 multiplies, etc. My point isn't that MCR and MRC look cryptic in assembly code, but that the meaning changes between ARM generations. MIPS is similar, except they're called MFC0 and MTC0, Move From/To Coprocessor 0.

These incompatibilities do not break userspace programs, which can't execute any of these instructions - but the OS needs to be tweaked to support a new core. If a new core introduces a bug in a privileged instruction, that doesn't break old OS code any more than it's already broken by ISA incompatibilities. Updating OS code is the perfect opportunity to also work around fresh hardware bugs.

x86 chips also run more OSes than chips based on most other architectures. For instance, a now-defunct team making a fairly widespread ARM-based application processor had to port about 3 versions of Linux (is there a chip maker who likes Linux with its endless versions and having to port it themselves? Or do they secretly wish they could tell Linus Torvalds what he publicly said to NVIDIA, namely, "fuck you"?) They also supported OS vendors in the porting of Windows and QNX. Overall, the chip probably ever ran 5 full-blown OSes. x86 chips need to run endless OS builds - often built from very similar source code, but still.

The same principle applies to all hardware. It's bug-free if and only if they can't sell it with bugs. If they can sell it with bugs and make it your problem, they very well might.

Memory

$100 says your DRAM chip works. The DRAM chip is a mindless slave implementing precise commands by the DRAM controller on the master chip, without any feedback - there are no retries, no negotiation, no way to say you're sorry. And no software will run properly on faulty DRAM. Faulty DRAM isn't a marketable product.

Your board is definitely buggy. They told you they checked signal integrity, but they lied. If DRAM malfunctions, it's probably the board, or the boot code programming DRAM-related components in a way that doesn't work on this board.

In the middle, there's the DRAM controller and the PHY. You'll only see bugs there if you're a chip maker - a chip is not marketable unless such bugs are already worked around somehow. If you are indeed a chip maker, this is when you find out why fabless chip companies are worth so much more than the equally fabless vendors of "IPs" such as CPUs and DRAM controllers. The short answer is that chip makers are exposed to most of the risk. And in your case, some of this risk has just been realized.

A DRAM controller bug can be very damaging to the chip maker, whose engineering samples might not work and whose production schedule might be delayed. For the DRAM controller vendor - no big deal, "we have 3 more customers affected by this bug, we must say you're taking it unusually passionately!" This is an actual quote. I want to add something here, something describing what we chip makers think of these people, but words fail me. The point is, they fix the bug and ship the fixed version to their next customers. You get to figure out how to make your engineering samples kinda work (often lowering the DRAM frequency helps), and perhaps how to fix the design without too many changes to the wafer masks.

Bottom line is, DRAM controllers and PHYs can have bugs, usually it's the chip maker's problem, managing this risk is not fun.

The bus interconnect between your processors and the DRAM controller probably doesn't have bugs - not correctness bugs, at least. That's because today it's usually produced by a code generator, and such a code generator is really hard to market if it has bugs, because they'll manifest in so many different ways and places. I found a bug in an interconnect once, and I was very proud of my tests, but that was a preliminary version, and they found the bug independently. Real, supported versions always worked fine.

Performance bugs around memory access are legion, of course, because you can totally sell products with performance issues, at least up to a point. A chip can have 2 processors with 8-byte buses each, going to a DRAM giving you 16 bytes per cycle, through a shared 8-byte-per-cycle bottleneck. This interconnect is the handiwork of some time-starved dude on the chip maker's team, armed with an interconnect-generating tool. Even such an idiotic issue will manifest on some benchmarks but not others, and might not get caught at design time. And if you think that is stupid, I've heard of a level 2 cache which never actually cached anything, and this fact happily went unnoticed for a few months. (Of course, this not being caught at design time is when the team should start looking for a new career.)

Similarly, DRAM schedulers, supposedly very clever about optimizing DRAM performance, can in practice be really stupid, etc. In fact, performance issues are among the hardest to pinpoint, and so are found in the greatest abundance in hardware and software alike. But in a way, they aren't bugs.

Peripheral devices

Expect peripheral device controllers to be pretty shitty. There really is no reason to make them particularly good. Only device drivers access these things, so it all concerns just a handful of programmers, and then working around a hardware bug here is easier than almost anywhere else.

A device driver has the device all to itself, nothing can touch the hardware concurrently unless the driver lets it, and the code can fiddle with the hardware controller all it likes, perhaps emulating some of the functionality on the CPU if necessary, and doing arbitrarily complex things to work around bugs. Starting with simpler things like reading memory-mapped registers twice - a workaround for a real old bug from a real vendor in the automotive space, one who huffs and puffs a lot about reliability and safety.

And a lot of peripheral devices also allow some room for error at the protocol level - you can drop packets, retransmit packets, checksums tell you if ultimately all the data was transferred correctly, you can negotiate on the protocol features, etc. etc. All that helps work around hardware bugs, reducing the pressure to ship correct hardware.

Also, since few people read the spec, there's no reason to make it very clear, or detailed, or up-to-date, or fully correct, or maintain errata properly. This is not to say that nobody does it right, just that many don't, and this shit still sells. Nobody cares that driver programmers suffer.

(By the way, I'm not necessarily condemning people at the hardware side here. Some low-level programmers like to complain about how bad hardware is, but it's not obvious how much should be invested to make the driver writer's job easy, even from a purely economic point view of optimally using society's resources, regardless of anyone's bottom line. If a chip is shipped earlier at the cost of including a couple of peripheral controllers which are annoying to write drivers for, maybe it's the right trade-off. I'm not saying that bugs should be exterminated at all cost, I'm just telling where they live.)

Miscommunication

As a programmer, do not expect every device to follow protocols correctly. What will work is the CPU accessing memory, in any way the CPU can access memory - with or without caching (the two ways generate vastly different bus commands.) But if the path between the device doing the access and the device handling the access is a less traveled one, then you might need to do the access in a very specific way.

For instance, bus protocols might mandate that access to an unmapped address will result in an error response. But an unmapped address might fall into a large region which the interconnect associates with a hardware module written by some bloke. So it routes your request to the bloke's module. The bloke can and will write hardware description code that checks for every address mapped within his range, and then returns a response - but the code does nothing when the address is unmapped. Then reading from this address will cause the CPU to hang forever, and not only a debugger running on the chip, but even a JTAG probe will not tell you where it's stuck.

There are many issues of this sort - a commonly unsupported thing is byte access as opposed to full word access (the hardware bloke didn't want to look at low address bits or byte masks), etc. etc. A bus protocol lawyer might be able to prove that the hardware is buggy in the sense of not following the protocol properly. A programmer must call it a feature and live with it.

As a chip maker, there's the additional trouble when you hook two working devices together, but lie about the protocol subset they support, and they will not work together. For instance, a DMA engine and a cache might both "support out of order bus responses." But the cache will return the response data interleaved at the world level, while the DMA might require responses to be interleaved at the burst level, where the burst size is defined by the DMA's read commands.

The chip maker is rather unlikely to ship hardware with this sort of a bug, so by itself it's rarely a programmer's problem. But they might make you set a bit in the DMA controller that disables the kind of requests producing out of order bus responses when accessing certain addresses. Again you can argue if it's a bug or a feature, but either way, if you won't set the bit, interesting things will transpire.

Summary

Don't trust freshly designed boards
Don't trust peripheral controllers
Trust CPUs in userspace & DRAM chips (almost always), and everything between the two (unless the chip is new & untested)
Expect to bump into unsupported bus protocol features if you do anything except accessing memory from a CPU
If you write your own OS, be prepared to work around CPU bugs (except perhaps on the PC)

I wrote a long time ago, and I still believe it, that lower-level programming is made relatively easy by the fact that you're less likely to have bugs in your dependencies. That's because low-level bugs both hurt more users and are harder to fix, and therefore people try harder to avoid them in the first place. However, this isn't equally true for all the different low-level things you depend on.

I've described the state of things with hardware as it is in my experience, and attempted to trace the differences to the different costs of bugs to different vendors. The same reasoning applies to software components - for instance, compilers are more likely to have bugs than OS kernels - because, by definition, compiler bugs cannot break existing binaries, but kernel bugs will do that. So I think it's a generally useful angle to look at things from.

Fun won't get it done

Yossi Kreinin — Mon, 01 Aug 2016 00:00:00 +0000

OK, published at 3:30 AM. That's a first!

So. Got something you want to do over the coarse of a year? Here's a motivation woefully insufficient to pull it off:

It's fun!

What could give you enough drive to finish the job? Anything with a reward in the future, once you're done:

Millions of fans will adore me.
It will be the ugliest thing on the planet.
I will finally understand quantum neural rockets.
We will see who the loser is, Todd!
I will help humanity.
I will destroy humanity.

It doesn't matter how noble or ignoble your goal is. What matters is delaying gratification. Because even your favorite thing in the world will have shitty bits if you chew on a big enough chunk of it. A few months or years worth of work are always a big enough chunk, so there will be shitty bits. Unfortunately, it's also the minimum-sized chunk to do anything of significance.

This is where many brilliant talents drown. Having known the joy of true inspiration, it's hard to settle for less, which you must to have any impact. Meanwhile, their thicker peers happily butcher task after task. Before you know it, these tasks add up to an impactful result.

In hindsight, I was really lucky in that I chose a profession for money instead of love. Why? Stamina. Money is a reward in the future that lets you ignore the shittier bits of the present.

Loving every moment of it, on the other hand, carries you until that moment which you hate, and then you need a new sort of fuel. Believe me, I know. I love drawing and animation, and you won't believe how many times I started and stopped doing it.

But the animation teacher who taught me 3D said he was happy to put textures on toilet seat models when he started out. That's the kind of appetite you need – and very few people naturally feel that sort of attraction to toilet seats. You need a big reward in the future, like "I'm going to become a pro," to pull it off.

But I don't want to become a pro. I don't want to work in the Israeli animation market where there's scarcely a feature film made. I don't even want to work for a big overseas animation studio. I want to make something, erm, something beautiful that I love, which is a piece of shit of a goal.

Because you know where I made most progress picking up actual skills? In an evening animation school, where I had a perfectly good goal: survive. It's good because it's a simple, binary thing which doesn't give a rat's ass about your mood. You either drop out or you don't. But "something I love" is fluid, and depends a lot on the mood. And when you hate this thing you're making, as you sometimes will, it's hard to imagine loving it later.

Conversely, imagining how I don't drop out is easy. This is what I was imagining when sculpting this bust, which 90% of the time I hated with a passion because it looked like crap. But I thought, "I'm not quitting, I'm not quitting, I'm not quitting, hey, I get the point of re-topology in Mudbox, I'm not quitting, I'm not quitting, hey, I guess I see what the specular map does, I'm not quitting... Guess I'm done!"

And now let's talk about beauty for a moment.

I'm a programmer. I like to think that I'm not the thickest, butcherest programmer, in that I understand the role of beauty in it. For the trained eye, programs can be beautiful as much as math, physics or chess, and a beautiful program is better for business than the needlessly uglier program. (Ever tried pitching the value of beauty to someone businessy? Loads of fun.)

But you know why beauty is your enemy? Because it sucks the fun out of things. How? Because you're making this thing and chances are, it's not beautiful according to your own standard. The trap is, your taste for beauty is usually ahead of your creative ability. In any area, and then in any sub-area of that area, ad infinitum, you can tell ugly from beautiful long before you can make something beautiful yourself. And even if you can satisfy your own taste, often the final thing is beautiful, but not the states it goes through.

So the passionate, sensitive soul is hit twice:

You're driven by fun and inspiration because you've once experienced it and now you covet it.
Your sense of beauty, frustrated by the state of your creation, kills all the fun – that very fun which you insist must be your only fuel.

Life is easier if you want a yacht. I think you can buy a decent one for $300K, and certainly for $1M. Now all you need to do is make that money, doing doesn't matter what – imagining that yacht will help you do anything well! If you want beauty, however, I do not envy you.

How do I cope with my desire for beauty? The first step is acknowledging the problem, which I do. The fact is that my worst failures in programming came when I insisted on beauty the most. The second step is shunning beauty as a goal, and making it into a means and a side-effect.

I need a program doing at least X, taking at most Y seconds, at a date not later than Z. I'll keep ugliness to a minimum because ugly programs work badly. And if it comes out particularly nicely, that's great. But beauty is not a goal, and enjoying the beauty of this program as I write it is not why I write it.

And if you think it's true for commercial work but not open source software, look at, I dunno, Linux. Read some Torvalds:

Realistically, every single release, most of it is just driver work. Which is kind of boring in the sense there is nothing fundamentally interesting in a driver, it's just support for yet another chipset or something, and at the same time that's kind of the bread and butter of the kernel. More than half of the kernel is just drivers, and so all the big exciting smart things we do, in the end it pales when compared to all the work we just do to support new hardware.

Boring bits. Boring bits that must be done to make something of value.

Does this transfer to art or poetry or any of those things whose whole point is beauty? Well, yeah, I think it does, because no, beauty is not the whole point:

The most important thing about a drawing is that it's done. Now it exists, and people can see it, and you can make another one. Practice. They will not come out very well if they don't come out.
Often people like your subject. There's a continuum between "it's beautiful in a way that words cannot convey" and "I love how this song expresses my favorite political philosophy." To the extent that a work of art tells a story, or even sets up a mood, its beauty does become a means to an end.
Just because the end result is beautiful to the observer, and even if that's the only point, doesn't mean every step making it was an orgy of beauty for whomever made it. Part of what goes into it is boring, technical work.

So here, too I'm trying to make beauty a non-goal. Instead my goals are "make a point" and "keep going," and you try to add beauty, or remove ugliness, as you go.

For example, I didn't do a graduation project in the evening school, but I animated a short on my own in the same timeframe, and I published it, even though it's not the beautiful thing I always dreamed about making. And I'm not sure anyone gets the joke except me. (I'm not sure I get it anymore, either.)

Now my goal is "make another one." It's a good goal, because it's easy to imagine making another one. It's proper delayed gratification.

And if you've enjoyed programming 20 years ago and are trying to reignite the passion, I suggest that you find a goal as worthy for you as "fun" or "beauty", but as clear and binary as a yacht. And you can settle for less worthy, but not for less clear and binary. Because everything they told you about "extrinsic motivation" being inferior to "intrinsic motivation" is one big lie. And this lie will fall apart the moment you sink your teeth into a bunch of shit, as will always happen if you're trying to accomplish anything.

Hiring (self-driving algos, HLL compiler research)

Yossi Kreinin — Sun, 11 Sep 2016 00:00:00 +0000

OK, so 2 things:

1. If you send me a CV and they're hired to work on self-driving algos – machine vision/learning/mapping/navigation, I'll pay you a shitton of money. (Details over email.) These teams want CS/math/physics/similar degree with great grades, and they want programming ability. They'll hire quite a lot of people.

2. The position below is for my team and if you refer a CV, I cannot pay you a shitton of money. But:

We're developing an array language that we want to efficiently compile to our in-house accelerators (multiple target architectures, you can think of it as "compiling to a DSP/GPU/FPGA.")

Of recent public efforts, perhaps Halide is the closest relative (we're compiling AOT instead of processing a graph of C++ objects constructed at run time, but I'm guessing the work done at the back-end is somewhat similar.) What we have now is already beating hand-optimized code in our C dialects on some programs, but it's still a "blue sky" effort in that we're not sure exactly how far it will go (in terms of the share of production programs where it can replace our C dialects.)

As usual, we aren't looking for someone with experience in exactly this sort of thing (here especially it'd be hopeless since there are few compiler writers and most of them work on lower-level languages.) Historically, the people who enjoy this kind of work have a background in what I broadly call (mislabel?) "discrete math" - formal methods, theory of computation, board game AI, even cryptography, basically anywhere where you have clever algorithms in a discrete space that can be shown to work every time. (Heavyweight counter-examples missing one of "clever", "discrete" or "every time" – OSes, rendering, or NNs. This of course is not to say that experience in any of these is disqualifying, just that they're different.)

I think of it as a gig combining depth that people expect from academic work with compensation that people expect from industry work. If you're interested, email me (Yossi.Kreinin@gmail.com).

All positions are in Jerusalem.

Things want to work, not punish errors

Yossi Kreinin — Mon, 27 Feb 2017 00:00:00 +0000

For better or worse, things want to work.

Consider driving at night on unlit, curvy mountain roads, at a speed about twice the limit, zigzagging between cars, including oncoming ones. Obviously dangerous, and yet many do this, and survive. How?

Roads and cars are built with big safety margins
Other drivers don't want to die and help you get through
Practice makes perfect, so you get good at this bad thing

The road, the car, you, other drivers, and their cars all want this to work. So for a long while, it does, until it finally doesn't. I know 3-4 people who drive like this habitually. At least 2 of them totaled cars. All think they're excellent drivers. All have high IQs, making you wonder just what this renowned benchmark of human brains really tells us.

Now consider a terribly managed project with an insane deadline, and a team and budget too small. All too often, this too works out. How?

Unless it physically cannot exist, a solution wants you to find it. You carve out a piece and the next piece suggests itself. Even if management fails to think how the pieces fit together, the pieces often come out such that they can be made to fit with modest extra effort.
And then the people who make the pieces want them to fit. Even if the process is totally mismanaged, many people will talk to each other and find out what to do to make parts work together.
The project was approved because a customer was persuaded. At this point, the customer wants the project to succeed. A little bit of schedule slippage will not make them change their minds, nor will a somewhat less impressive result. More slack for you.
The vendor, too wants the project to succeed, and will tolerate a little bit of budget overrun. More slack.
Most often, when things fail, they fail visibly. It's as if things wanted you to see that they fail, so that you fix them.

The fact is that by cutting features, having a few non-terminal bugs, and being somewhat late and over budget, most projects can be salvaged. In fact, when they say that "most projects fail," the PMI ¹ defines "failure" as being a bit late or over budget. If "failure" is defined as outright cancellation, I conjecture that most projects "succeed."

Which projects are least likely to be canceled? In other words, where is being late, over budget and off the original spec most tolerable? Obviously, when the overall delivered value is the highest, both in absolute terms and relatively to the cost. In other words, reality punishes bad management the least in the most impactful cases.

What is the biggest problem with bad management? Same as crazy driving: risk. The problem in both cases is you risk high-cost, low-probability events. It's terrible things that tend not to happen. And people are pretty bad at learning from mistakes they never had to pay for.

Wannabe racecar drivers fail to learn from driving into risky situations which their own eyes tell them are risky. For managers, learning is harder – the risks accumulated through bad management are abstract, instead of viscerally scary. In fact, a lot of the risks are never understood by management, or even fully reported. There's just too much risk to sweep under various rugs to make it all ingrained in institutional memory.

In fact, it's even worse, because risk-taking is actually rewarding as long as the downside doesn't materialize. The crazy driver gets there 10 minutes earlier. Similarly, non-obviously hazardous management often delivers at an obviously small cost. And while driving is not actually competitive, except in the inflamed minds of the zigzagging few, most projects are delivered in very competitive environments indeed. And competition can make even small rewards for risk decisive – as it can with any other smallish factor large enough to make a difference between victory and defeat.

Things want to work more than they want to punish us for our errors. The punishment may be very cruel and unusual alright, but it's rare. It seems that the universe, at least The Universe of Deliverables, is Beckerian. It delivers optimal punishment for rational agents correctly estimating probabilities. Sadly, humans are bad at probability.

And thus crazy drivers and bad managers alike (often the same people, BTW) march from one insane adventure to the next, gaining more and more confidence in their brilliance.

PMI (The Project Management Institute) is a con, where they sell you "PMBOK" (Project Management Body of Knowledge, a thick book you can use as a monitor stand) and "PMP" (Project Management Professional, a certification required by PMI's conscious or unwitting accomplices in dark corners of the industry.) A variety of more elaborate cons targeted at narrower audiences incorporate PMI's core body of cargo cult practices.↩︎

Patents: how and why to get them

Yossi Kreinin — Sat, 02 Jun 2018 00:00:00 +0000

I'm going to discuss 3 very basic things about patents:

Why it's good for you to get them;
Why it might be bad for your employer (and why they don't care);
How to get a patent for your idea (doesn't matter which.)

Some of my points are a bit naughty. But I maintain that they're based in fact and fairly widely known. So well-known, in fact, that I'm surprised to have never read it somewhere else.

My explanation is that the hatred of patents in the tech world is such that nothing except "HATE! HATE! HATE!" can be said on the subject in polite society. In this atmosphere, "Patents: how and why to get them" reads like "Humans: how and why to cook them."

If you can make yourself read this human-cooking manual, however, I think you'll find both amusing and useful things. I have more experience with patents than I've ever asked for, having worked on this stuff with lawyers from the smallest law firms to the largest ones, including lawyers who personally handled the most famous lawsuits for the most famous tech clients. I'm not an authority on patents, but I have good stories.

What patents give you

Some companies pay you money per patent. But it's rarely enough to make it worth your while, unless it's all you're doing. Patents look good on your CV, but reactions might be negative as well (you might appear "overqualified," "an expert in an unrelated field," etc.)

What's the one thing a patent undeniably buys you? A right to legally and publicly discuss your work – which you often can't get in any other way. This is not a side-effect of patent law, but its whole stated point. Patent law prompts companies to publish their ideas, in exchange for a time-limited monopoly right to use the ideas.

Note that publishing ideas in patents is easy, and the benefit for the author is certain. But getting and enforcing a monopoly for said ideas is not easy, so the benefit for the proprietor is not at all certain. Here's why.

What patents give (and don't give) your employer

Some problems with patents are so obvious that even patent lawyers will honestly discuss them with their clients:

When you submit a patent application, it becomes public forever, even if it's rejected. You will have paid legal fees with the end result of granting competitors access to your ideas.
If you sue for patent infringement, your patent might be invalidated as a result. It's like a rejected patent application, but with at least $1 million more in legal fees.

But there's another, potentially far bigger problem, that patent lawyers will rarely mention, let alone admit its extent:

You don't get monopoly rights to everything you file in the patent application. You publish a "spec" and "claims." The monopoly is granted only for the claims – perhaps in a reduced form relatively to the original patent application, due to feedback from the examiner. Yet the entire spec, much of it not covered by the claims, becomes public.

So what's the big deal, you might ask? The spec describes some device or method. The claims describe the supposedly new ideas used in this device or method. All you have to do is write a spec such that nothing of value is disclosed that is not covered by the claims.

However, in reality, the published spec is often quite close to the actual spec used by engineers, with all the details. That's simply the path of least resistance:

Patent lawyers don't know which claims will be rejected by the examiner. (If they knew, you wouldn't have a heap of rejected applications, nor patents invalidated in courts.) They file relatively broad claims, and then change the claims to address challenges by the examiner, until a patent is granted. The catch is that you can only base your new claims on details included in the originally filed spec – the spec can never be altered. Thus a detailed, complete spec maximizes the chances to get some patent out of the filing – covering 90% or 10% of the spec, depending.
More prosaically, if we don't file the actual spec but instead write a new one tailored to the claims, who's gonna do it? Neither the engineer nor the lawyer necessarily has the ability to do it, and surely neither has any interest in doing it. Much better to take existing documents and do the minimal necessary translation from English to legalese.

Ultimately, there's a conflict of interest between your employer and their patent lawyer, and a surprisingly perfect alignment of interests between the lawyer and yourself:

The lawyer wants to publish as many details as possible – to maximize the chance of getting a patent, and to avoid extra work;
The engineer also wants to publish as much as possible – to make his ideas known to the fullest extent, and to avoid extra work;
The employer/shareholder wants to publish as little as possible – but has no simple, reliable way to incentivize anyone to push in this direction (though of course some are much better at this than others.)

Funnily enough, this too is largely in line with the lawmaker's stated intent – prompting companies to publish ideas instead of keeping them secret. But why do companies file patents?

The answer is that patents are never read – they're counted. More precisely, a company's goal is to acquire enough patents so that they can only be counted – but not read and understood in a reasonable amount of time.

If you have too many patents to read and understand (hundreds, thousands or more), then investors and competitors alike assume you "own your domain" – you can counter-attack if sued. You're as well-defended legally as you can possibly be. But if you have few patents, someone might read and understand most of them – and create a narrative about some legal weakness. Such narratives are bad for the stock price.

This situation must be avoided. And that's all there is to it – at least in the computing industry. And I know it might sound too dismissive to be convincing. But the fact is that the content of patents is just too complex to drive business decisions. The feasible thing for a decision-maker is to pick the bucket to put you in, out of "no patents, some patents, a shitton of patents." For more information, see the seminal work "Pulling Decisions out of One's Ass: Fast and Slow," keeping in mind that decision-makers have a lot of decisions to make, so they must be Fast.

Why filing patents isn't a crime on par with cannibalism

Considering the above, I don't think that a product company employee filing patents pollutes the tech environment as badly as people believe.

Product companies file patents largely for self-defense. Some occasionally attack startups, but how many startups were destroyed by a patent lawsuit vs the number of those destroyed by a badly managed acquisition (with the original investors doing just fine)? And there are examples of big companies buying startups already attacked by a lawsuit filed by a bigger product company, confident that between two big companies, the legal result will be a stalemate. Thus for a big company genuinely fearing your product, it's much safer to buy you than sue you and have you bought by a big competitor.

The real trouble is patent trolls, who cannot be counter-attacked. But the only way a product company's patents will land in a troll's hands is if the company goes bankrupt and sells the patents. Well, guess what – in these cases, other product companies are eager to outbid the trolls. For example, when MIPS Technologies was sold to Imagination for ~$60 million many years ago, its patents, sold separately, fetched ~$500 million from some CPU cartel involving various big name CPU companies. Alternatively, a failing company can turn into a troll and sue a successful product company (MicroUnity comes to mind.)

Thus patents of failing product companies result in a weird form of socialism, where profit is spread more evenly between investors, with losers getting a chunk of the winners' profits. I don't think this chunk is nearly large enough on average to substantially reduce the incentive to work hard for the win, which is supposedly "the" trouble with laws subsidizing losers.

My point is that patent trolls and product companies seem to live in largely parallel universes. There are patents filed with the intention to be used by a patent troll, and there are patents filed by product companies, and the latter cause far less damage.

How to get a patent

I've lost count of the number of times I've heard the words "The Black Swan." It rather aggrieves me, but you gotta hand it to Taleb. Everyone is trying to pollute our language by needlessly coining catchphrases in a quest to be memorable, but he succeeded more than most. Surely I wouldn't hear this nonsense as often if he called the book "The Unforeseen Event."

Getting patents is a lot like branding. The trick is to call old things new names.

Why does it take a patent lawsuit and at least $1 million in legal fees to find out if a patent really is a patent – or to see it invalidated by the court? Because searching prior art is hard. "Prior art" includes everything published prior to the patent – older patents, academic papers, and everything else, really. Strictly speaking, you never know if you're done.

How does the patent office examiner examine prior art, at a cost much lower than $1 million? Some equivalent of quick googling. The input of search engines is words and short phrases. If you use words and phrases which are uncommon in your domain, the search will come up blank, or it will find things so obviously unrelated to your work that even a patent examiner will get it.

If you're extending the concept of a thread, don't call the result "extended threads", call it "hypercontexts." If you're calculating a histogram, call it a "distribution estimator." And so on. Again, I know this sounds too dismissive of the system to be believable. Well, try it. File a patent application full of "distribution estimators" and another one written in plain English. See which gets approved more smoothly.

Note that you might be tempted to conduct a prior art search yourself before filing the patent application, as a matter of due diligence. Yet some lawyers actually recommend against it, since if you do find prior art, you're now willfully infringing on it, and should cease and desist. My advice is to come up with a bunch of Black Swans/Distribution Estimators describing your idea, and pick the ones with the fewest Google results (patent search and otherwise).

And don't actually read any patent you accidentally find – don't willfully infringe, it's illegal. Just count them. Patents are never read, only counted – sounds familiar?

The other very important thing – which you mostly get to worry about in smaller companies – is to get the right kind of lawyer. Patent lawyers are fallen engineers, with engineering degrees, and sometimes actual engineering experience. The underlying engineer who's morphed into a lawyer ought to have specialized in your domain. No compromise is acceptable here. If you're doing optics, don't work with a guy who did chip design, and if you do chip design, don't work with a guy who did optics.

It doesn't matter if the lawyer is a Partner ($900/hour), an Associate ($450/hour), or some lesser life form in the law firm. It doesn't matter whether the firm is the biggest name in the industry or completely unknown. What matters is engineering knowledge. Don't expect a patent lawyer to honestly tell you he doesn't know your domain. He'll always accept the work, and you'll pay $truckload/hour trying to explain the most basic things to him, and failing. You need to actively ask about his education and experience.

Summary

Like most annoying things in life, patents aren't evil as much as they're absurd. Use them to your advantage.

Don't ask if a monorepo is good for you – ask if you're good enough for a monorepo

Yossi Kreinin — Tue, 30 Jul 2019 00:00:00 +0000

This is inspired by Dan Luu's post on the advantages of a single big repository over many small ones. That post is fairly old, and I confess that I'm hardly up to date on the state of tooling, both for managing multiple repos and for dealing with one big one. But I'm going to make an argument which I think mostly works regardless of the state of tooling on any given day:

Monorepo is great if you're really good, but absolutely terrible if you're not that good.
Multiple repos, on the other hand, are passable for everyone – they're never great, but they're never truly terrible, either.

If you agree with the above, the choice is up to your personal philosophy. To me, for instance, it's a no-brainer – I'll choose the passable thing which successfully withstands contact with apathetic mediocrity over the greater thing which falls apart upon such contact in a heartbeat.

You might be different – you might believe in Good – and then you'll choose a monorepo, like Google, the ultimate force for Good in technology (which is why they safeguard your personal data; you wouldn't want someone evil to have it – luckily, Google can do no evil.) And I'm almost not kidding: the superpower which lets Google maintain the grassroots bureaucracy which I find necessary to make monorepos work well is actually the same trait making you sufficiently delusional to chant, or at least to have chanted "Don't Be Evil" entirely seriously. I don't have that. I am, to a first approximation, evil. Worse is Better.

But that's me – I'm not saying you/your org are Not So Good, or Evil. I'm only saying that you should be open to the possibility, and that I don't see the implications of being Not So Good discussed as much as they deserve.

Why are monorepos terrible if you're not that good? Three reasons:

Branching in
Modularity out
Tooling strained

Let's discuss them in some detail.

Branching: getting forked by your worst programmer

In a Good team, you don't have multiple concurrent branches from which actual product deliveries are produced, and/or where most people get to maintain these branches simultaneously for a long time. And you certainly can't have branching due to outright atrocities, like someone adding a feature by killing a feature – for example, making the app work on Android, but destroying the ability to build for iOS in the process.

But in a not-so-good team... you get the idea.

What do you do when you have a branch working on Android and another branch working on iOS and you have deliveries on both platforms? You postpone the merge, and keep the fork. For how long do you postpone the merge? For as long as is necessary for the dumbass who caused the fork to fix their handiwork, in parallel with delivering more features (which likely results in digging a deeper hole to climb out of afterwards.) And the dumbass might take months, years, or forever.

The question then becomes, what was forked?

In a multi-repo world, the repo maintained by the team with the dumbass on it got forked. In a monorepo world, the entire code base got forked, and the entire org is now held hostage by the dumbass. And you might think that this will result in a lot of pressure to fix the problem, and you'd be wrong, for the same reasons that high murder rates don't cure themselves by people putting pressure on whomever to lower them to some equilibrium level common to all human societies.

Some places have higher than average murder rates, and some places have have higher than average fork rates. And I argue that a lot of places have fork rates which combine into a complete disaster with a monorepo. And you might not even realize how bad the fork rate is at your place, because multiple repos largely shield you from the consequences. Or, more tragically, you might not realize how bad your fork rate is because your monorepo is in its first couple of years, and you're sowing what you'll reap in its next couple of years, when you'll have more code, more deliveries and more dumbasses.

With multiple repos, if you have your shit under control, and your repos have a single release branch with a single timeline, all you have to do is to test against both of the dumbass's branches. But with a monorepo, you need to maintain your code in 2 branches, with a growing share of everybody else's code morphing incompatibly in those branches, simply because they exist. And very soon it will be more than 2 because there's more than a single dumbass, and good luck to you.

Modularity: demoted from a norm to an ideal

Norms are mundane, but they are what is. Ideals are lofty, but they are merely what should be (and typically isn't.) If you want to actually have something, you don't want it to be an ideal, like altruism – you want it to be a norm, like wiping one's ass. If something is demoted from ass-wiping to altruism, that something will scarcely be found in the wild.

With multiple repos, modularity is the norm. It's not a must - you technically can have a repo depending on umpteen other repos. But your teammates expect to be able to work with their repo with a minimal set of dependencies. They don't like to have to clone lots of other repos, and to then worry about their versions (in part because the tooling support for this might be less than great).

In fact, a common multi-repo failure mode is that people expect too few dependencies and make too many repos which are too small to host a useful self-contained system. Note that this failure mode is not lethal. It kinda sucks to have this over-modularity with benefits of independence which turn out to be imaginary upon a closer look, and to have people treat what essentially are internal APIs with way too much reverence, just because two modules which are extremely tightly coupled conceptually are independent technically, in terms of cloning/building/testing. But it doesn't kill you.

With a monorepo, modularity is a mere ideal. Everybody clones the whole thing. You're not supposed to add gratuitous dependencies, but it's very easy to add such a dependency in terms of cloning, building and versioning, and nobody objects to the dependency being added the way they would if they needed to clone more repos.

Of course in a Good team, needless dependencies would be weeded out in code reviews, and a Culture would evolve over time avoiding needless dependencies. In a not-so-good team, your monorepo will grow into a single giant ball of circular dependencies. Note that adding dependencies is infinitely easier than untangling them, much like forking is easier than merging, with the difference that the gut-felt urgency to merge ("I can't maintain all your damned branches any longer!!") is far greater and far more backed by simple self-interest than the urgency to improve the dependency structure.

Tooling: is yours better than the standard?

This part might age worse than the others, and might not be particularly up to date even now – what "standard" tools are capable of changes over time. But generally speaking, a growing monorepo is likely to outgrow the standard version management tools and methods, as well as other tools and methods dealing with your revision controlled code.

Google used to have a FUSE driver to avoid copying hundreds of millions of source lines at a time, and instead getting the files on demand, when a directory is cd'd into. Facebook used to hack on hg to make it fast on its large monorepo. Maybe already today, or some day, a growing number of off-the-shelf tools will scale to infinite monorepos without such investments. But it sounds reasonable that there will always be tools and workflows which you will struggle to make work with a large monorepo (starting with some script doing find/grep.)

With a bunch of small monorepos, you work with a small overall number of source files in your working directory, so you don't need to tell your tools, "don't try to deal with the whole thing – instead only search this subset, or use this index etc. etc." And you have tools these days which kinda sorta let you manage the revisions of multiple repositories (for instance, there's Google's Repo.) And I think the result is very, very far from a great experience potentially afforded by a large monorepo. But it also never breaks down as badly as a large monorepo outgrowing the abilities of tools, as well as the ability of your local toolsmiths to find creative workarounds for these growth pains.

Summary

Don't ask if a monorepo is good for you – ask if you're good enough for a monorepo. Personally, I don't have the guts to bet on the supply of Goodness in a given org to remain sufficiently large over time to consistently avert the potential disasters of monorepos. But that's just my personal outlook; if you want to compliment me, don't call me "smart," and definitely don't call me "good" – I know my limits in these areas, and I take far more pride in knowing these limits than in the limits themselves; so, to compliment me, call me "pragmatic." Yet a culture worthy of a monorepo absolutely can exist – just make sure yours actually is one of those, and don't mistake your ideals for your norms.

I have risen

Yossi Kreinin — Tue, 12 Mar 2024 00:00:00 +0000

Hello to the readers still using RSS! I've moved the blog off WordPress to my own ugly publishing software, and will be grateful if you report any glitches you see (posts or comments look bad on device X or feed reader Y, that sort of thing.)

This blog slowed down a lot in 2017, when I switched from a part-time programming position to a full-time senior management position. Between the comment spam flood and the ancient pre-mobile design, it would take some doing to get the blog back into shape; and between work and non-work stuff I had going, I didn't find time for said doing.

But I've gone back to programming a couple of years ago, and back to part-time a few months ago, and now the doing is done. And do I have things to tell you!

refix: fast, debuggable, reproducible builds

Yossi Kreinin — Tue, 19 Mar 2024 00:00:00 +0000

There's a simple way to make your builds all of the following:

Reproducible/deterministic - same binaries always built from the same source, so you can cache build outputs across users
Debuggable - gdb, sanitizers, Valgrind, KCachegrind, etc. find your source code effortlessly
Fast - the build time overhead is negligible, even compared to a blazing fast linker like mold

What makes it really fast is a small Rust program called refix that post-processes your build outputs (if you don't want to compile from source, here's a static Linux binary.) Both the program and this document are written for the context of C/C++ source code compiled to native binaries. But this can work with other languages and binary formats, too, and it should be easy to support them in refix. (In fact, it mostly supports them already... you'll see.)

This "one weird trick" isn't already popular, not because the solution is hard, nor because the problem isn't painful. Rather, it's not already popular because people widely consider it impossible for builds to be both debuggable and reproducible, and standardize on workarounds instead. Since "established practices" are sticky, and especially so in the darker corners like build systems¹, we'll need to discuss not only how to solve the problem, but also why solve it at all.

The curious case of the disappearing source files

Why are people so willing to give up their birthright - the effortless access to the source code of a debugged program? I mean, build a "Hello, world" cmake project, and everything just works: gdb finds your source code, assert prints a path you can just open in an editor, etc. "Source path" isn't even a thing.

Later on, the system grows, and the build slows down. So someone implements build artifact caching, in one of several ways:

A general-purpose distributed build cache, like Bazel's
Something for caching specific kinds of artifacts, like ccache
An entirely home-grown system - like running the build of user X in a build directory left previously by user Y at the build server's local disk (and hoping that their source code is similar enough, so most object files needn't be rebuilt²)

In any case, now that you need caching, you also need reproducible builds. Otherwise, you'd cache object files built by different users, and you'd get different file paths and other stuff depending on which user built each object file. And we can all agree that build caches are important, and pretty much force you to put relative paths into debug information and the value of __FILE__ (and some meaningless garbage into __TIME__, etc.)

But we can also agree that the final binaries which users actually run should have full source paths, right? I mean, I know there are workarounds for finding the source files. We'll talk about them later; I'd say they don't really work. Of course, the workarounds would be tolerable if they were inevitable. But they aren't.

Why not fix the binary coming out of the build cache, so it points to the absolute path of the source files? (The build system made an effort to detach the binary from the full source path, so that it can be cached. But now that the binary has left the cache, we should "refix" it back to the source path of the version where it belongs.)

We'll look at 3 ways of refixing the binary to the source path - a thesis, an anti-thesis and a synthesis, as it were.

Thesis: `debugedit` - civilized, standard and format-aware

A standard tool for this is debugedit. The man page example does exactly the "refixing" we're looking for:

debugedit -b `pwd` -d /usr/lib/debug files...
    Rewrites path compliled into binary
    from current directory to /usr/lib/debug.

Some Linux distributions use debugedit for building source files in some arbitrary location, and then make the debug info point to wherever source files are installed when someone downloads them to debug the program.

If debugedit works for you, problem solved. It works perfectly when it does. However, when I tried it on a 3GB shared object compiled from a C++ code base³, it ran for 30 seconds, and then crashed. If you, too find debugedit either slow or buggy for your needs, read on.

Anti-thesis: `sed` - nasty, brutish, and short

Why is debugedit's job hard (slow and bug-prone)? Mainly because it needs to grow or shrink the space reserved for each replaced string. When you do such things, you need to move a lot of data (slow), and adjust who-knows-which offset fields in the file (bug-prone.)

But what if the strings had the same length? Then we don't need to move or adjust anything, and we could, erm, we could replace them with sed.

Here, then, is our nasty, brutish, and short recipe:

Run gcc with these flags:

-fdebug-prefix-map==MAGIC # for DWARF
-ffile-prefix-map==MAGIC  # for __FILE__

Make MAGIC long enough for any source path prefix you're willing to support.
Why the == in the flag? This invocation assumes that file paths are relative, so it remaps the empty string to MAGIC, meaning, dir/file.c becomes MAGICdir/file.c. You can also pass =/prefix/to/remap=MAGIC, if your build system uses absolute paths.
Use sed to replace MAGIC with your actual source path in the binary outputted by the build system.
If the source path is shorter than the length of MAGIC, pad it with forward slashes: /////home/user/src/. If the source path is too long, the post-link step should truncate it, warn, and eventually be changed to outright fail. You don't really need to support giant paths.

Our post-link step thus becomes:

sed -i 's/MAGIC/\/\/\/...\/user\/src\//g' binary

The downside, on top of the source path length limit, is a trace of the brutishness making it into the output file. Namely, you're going to see these extra forward slashes in some situations. We can't pad a prefix with an invisible character... luckily, we can pad it with a character not changing the meaning of the path.

On the upside, compared to debugedit, the method using sed is:

More widely applicable - it, erm, "supports" all executable and debug information formats, as well as archives and object files.
More robust - not affected by input format complexity
Faster - 10 seconds to process the 3GB binary (about the time it takes mold to link that binary... yes, it's that good!)

Is this fast enough? Depends on your binary sizes. If yours are big and you don't want to effectively double the link time, our next and last method is for you.

Synthesis: `refix` - nasty, brutish, and somewhat format-aware

Can we go faster than sed? We have two reasons to hope so:

sed is unlikely to be optimized specifically for replacing strings of equal size; it's not that common a use case.
We don't actually need to go through the entire file. File paths only appear in some of the sections - .rodata where strings are kept, and debug info sections. If we know enough about the file format to find the sections (which takes very little knowledge), we can avoid touching most of the bytes in the file.

But wait, isn't the giant binary built from C++ mostly giant because of the debug info? Yes, but it turns out that most of the debug info sections don't contain file paths; only .debug_line and .debug_str do and these are only about 10% of our giant file.

So the refix program works as follows:

It mmaps the file, since it knows it never needs to move the data and can just overwrite the strings in place.
For ELF files, it finds .rodata, .debug_line and .debug_str, and searches & replaces only within these. This handles executables, shared libraries (*.so) and object files (*.o).
For ar archives, it finds the ELFs within the archive, then the sections it cares about within each ELF, and searches & replaces within these. This handles lib*.a.
For files which are neither ELFs nor archives of ELFs, refix just replaces everywhere as sed would, but still faster because it's optimized for the same-sized source & destination strings case.

Thus, refix is:

Very fast - 50 ms on the 3GB binary, and 250 ms on the same binary in "sed mode" (meaning, if we remove the ELF magic number, so refix is forced to replace everywhere and not just in the relevant sections.)
Widely applicable - works on any file format where the file prefix isn't compressed and is otherwise "laid bare"
Robust - while it knows a bit about the binary file format, it's very, very little (enough to find the sections it's interested in); it's hundreds of lines of code vs debugedit's thousands. And you can always make it run even less code by falling back to "sed mode."

...with the sole downside being that, same as with sed, you might occasionally see the leading slashes in pathnames.

That's it, right? We're done? Well, maybe, but it's not always how it goes. People have questions. So here we go.

Q & A

Why do this? We already have a system for finding the source code.

First of all, it is worth saying that you shouldn't have any "system" for finding source code, because the tired, stressed developer who was sent a core dump to urgently look at is entitled to having at least this part work entirely effortlessly⁴.

But also, whatever system you do have ought to have issues:

If you do not modify the cacheable, reproducible binaries coming out of the build system, then by definition your way to find source code must rely on something inherent to a given source version, independent of who built it and where. Since you're not going to embed the entire source code into the executable, you must rely on some sort of version information. What if the program had uncommitted changes, which happens in debugging scenarios (someone built a version to test and someone else sent a core dump from this version?)
"Well you're not supposed to get core dumps from versions with uncommitted changes, unless it's your local version that you haven't given to anyone but are testing locally, so you know which version it is. You should only release versions externally thru CI" - so giving anything to anyone to test is now considered "releasing externally" and must necessarily go thru CI, and having trouble finding the source code is now a punishment for straying from proper procedure? How did this discussion, which started at how build caches speed up the build, deteriorate to the point where we're telling developers to change how they work, in ways which will slow them down?
But OK, let's say I didn't "release" anything - instead I have 5 local versions I'm working with and they go thru test flows and dump core - I'm now supposed to guess which core comes from which version, or develop my own "system" to know? (Some people actually assume this won't happen because you can't run tests outside CI anyway, so you will submit a merge request in order to run them. And they assume that because they use some testing infra intertwined with CI infra and most of their tests technically can't run outside CI. And perhaps they don't even have machines to run on that aren't managed by Jenkins or some such to begin with. But that is a horror story for another time. Here I'll just assume that we agree that it's good to be able to test changes locally and debug easily.)
In the cases where the version info actually enables you to find the right code, the process can be made more tolerable by developing a gdb Python extension that automatically tells gdb where the source code is based on the embedded version info. Do you have this extension and a team maintaining it together with the build system?
Do you also have this automated for all the other tools (sanitizers, Valgrind, KCachegrind, VTune, whatever)? Do they all even have a way to tell them where to look for source code? Is there a team handling this for all users, for every new tool used by developers?

I realize that these pain points aren't equally relevant to all organizations, and the extent of their relevance depends a lot on the proverbial software lifecycle. (They also aren't equally relevant to everyone in a given organization. I claim that the people suffering the most from this are the people doing the most debugging, and they are quite often very far removed from any team that could ameliorate their suffering by improving "the system for finding source code" - so they're bound to suffer for a long time.)

My main point, however, is that you needn't have any of these pain points at all. There's no tradeoff or price to pay: your build is still reproducible and fast. Just make it debuggable with this one weird trick!

(Wow, I've been quite composed and civil here. I'm very proud of myself. Not that it's easy. I have strong feelings about this stuff, folks!)

What about non-reproducible info other than source path (time, build host, etc)?

I'm glad you asked! You can put all the stuff changing with every build into a separate section, reserved at build time and filled after link time. You make the section with:

char ver[SIZE] __attribute__((section(".ver"))) = {1};

This reserves SIZE bytes in a section called .ver. It's non-const deliberately, since if it's const, the OS will exclude it from core dumps (why save data to disk when it's guaranteed to be exactly the same as the contents of the section in the binary?) But you might actually very much want to look at the content of this section in a core dump, perhaps before looking at anything else. For instance, the content of this section can help you find the path of the executable that dumped this core!⁵

(How do you find the section in the core dump without having an executable which the debugger could use to tell you the address of ver? Like so: strings core | grep MagicOnlyFoundInVer. Nasty, brutish, and short. The point is, having the executable path in the core dump is an additional and often major improvement on top of having full source paths in the executable... because you need to find the executable before you can find the source!)

Additionally, our ver variable is deliberately initialized with one 1 followed by zeros, since if it's all zeros, then .ver will be a "bss" section, the kind zeroed by the loader and without space reserved for it in the binary. So you'd have nowhere to write your actual, "non-reproducible" version info at a post-link step.

After the linker is done, you can use objcopy to replace the content of .ver. But if you're using refix, which already mmaps the file, you can pass it more arguments to replace ELF sections:

refix exe old-prefix new-prefix --section .ver file

refix will put the content of file into .ver, or fail if the file doesn't have the right length. (We don't move stuff around in the ELF, only replace.)

What about compressed debug sections?

What about them? I'm not sure why people use them, to be honest. I mean, who has so many executable files which they don't want to compress as a whole (because they need to run them often, I presume), but they do want to compress the debug sections to save space? Like, in what scenario this is your way to save enough space to even worry about it?

But, they could be supported rather nicely, I think, if you really care. You wouldn't be able to just blithely mmap a file and replace inside it without updating any offset field in the file, but I think you could come close, or rather stay very far away from doing seriously heavy lifting making this slow and bug-prone. Let's chat if you're interested in this.

(I think maybe one problem is that some build caches have a file size limit? Like, something Bazel-related tops out at 2GB since it's the maximal value of the Java int type?.. Let's talk about something else, this is making me very sad.)

It's 250 ms on generic data. And you still did the ELF/ar thing to get to 50 ms. Are you insane?

Well, it's 250 ms on a fast machine with a fast SSD. Some people have files on NAS, which can slow down the file access a lot. In such cases, accessing 10x less of the mmaped data will mitigate most of the NAS slowdown. You don't really want to produce linker output on NAS, but it can be very hard to make the build system stop doing that, and I want people stuck in this situation to at least have debuggable binaries without waiting even more for the build. So refix is optimized for a slow filesystem.

But also, if it's not too much work, I like things to be fast. Insane or not, the people who make fast things are usually the people who like fast things, by themselves and not due to some compelling reason, and I'm not sure I'm ashamed of maybe going overboard a bit; better safe than sorry. Like, I don't parse most of the ELF file, which means I don't use the Elf::parse method from the goblin library, but instead I wrote a 30 line function to parse just what I need.

This saves 300-350 ms, which, is it a lot? - maybe not. Will it become much more than that on a slower file system? I don't know, it takes less time to optimize the problem away than answer this question. Did I think of slow file systems when doing it? - not as much as I was just annoyed that my original C++ program, which the Rust program is a "clean room" open source implementation of, takes 150 ms and the Rust one takes about 400 ms. Am I happy now that I got it down to 50 ms? Indeed!

(Why is Rust faster? Not sure; I think, firstly, GNU memmem is slower than memchr::memmem::Finder, and secondly, I didn't use TBB in C++ but did use Rayon in Rust, because the speedup is marginal - you bottleneck on I/O - and I don't want to complicate the build for small gains, but in Rust it's not complicated - just cargo add rayon.)

It often takes less time to just do the efficient thing than it takes to argue about the amount it would save relatively to the inefficient thing. (But it's still more time than just going ahead and doing the inefficient thing without arguing. But even that is not always the case. But most people who make fast things will usually just go for the efficient thing when they see it regardless if it's the case, I think. IME the people who always argue about whether optimizations are worth it make big and slow things in the end.)

I'm as crazy as you, and I want this speedup for non-ELF executable formats.

Let's chat. The goblin library probably supports your format - shouldn't take more than 100-150 LOC to handle this in refix.

Which binaries should I run this stuff on?

Anything delivered "outside the build system" for the use of people (who run programs / load shared libraries) or other build systems (which link code against static libraries / object files.) And nothing "inside the build system", or it will ruin caching.

I hope for your sake that you have a monolithic build where you build everything from source. But I wouldn't count on it; quite often team A builds libraries for team B, which gets them from Artifactory or something wonderful like that. In that case, you might start out with a bug where some libraries are shipped with the MAGIC as their source prefix instead of the real thing. This is easy to fix though, and someone might even remind you with "what's this weird MAGIC stuff?"

(Somehow nobody used to ask "what's /local/clone-87fg12eb/src", when that was the prefix instead of MAGIC. Note that even if you have this bug and keep MAGIC in some library files, nobody is worse off than previously when it was /local/clone-87fg12eb/src. And once you fix it, they'll be better off.)

CI removes the source after building it. What should the destination source prefix be?..

And here I was, thinking that it's the build cache not liking absolute paths that was the problem... It turns out that we have a bigger problems: the source is just nowhere to be found! /local/clone-87fg12eb/src is gone forever!

But actually, it makes sense for CI to build on the local disk in a temporary directory. In parallel with building, CI can export the code to a globally accessible NAS directory. And at the end of the build, CI can refix the binaries to that NAS directory. It's not good to build from NAS (or to NAS) - it's not only slow, but fails in the worst ways under load - which is why a temporary local directory makes sense. But NAS is a great place for debugging tools to get source from - widely accessible with no effort for the user.

Many organizations decide against NAS source exports, because it would be too easy for developers. Instead you're supposed to download the source via HTTP, which is much more scalable than NAS, thus solving an important problem you don't have; plus, you can make yourself some coffee while the entire source code (of which you'll only need the handful of files you'll actually open in the debugger) is downloaded and decompressed.

In that case, your destination source prefix should be wherever the user downloads the files to. Decide on any local path independent of the user name, and with version information encoded in it, so multiple versions can be downloaded concurrently. Have a nice cup of coffee!

What should the root path length limit be?

100 bytes.

Our CI produces output in `/filer/kubernetes/docker/gitlab/jenkins/pre-commit/department/team/developer/branch-name/test-suite-name/repo/`, which is 110 bytes.

Great! Now you have a reason to ask them to shorten it. I'm sure they'll get to it in a quarter or two, if you keep reminding.

Our CEO's preschooler works as a developer, insists on a 200 byte prefix, and won't tolerate the build failing.

Then truncate the path without failing the build. He won't find the source code easily, but he can't find it already today. If there's one thing fixing the problem won't do, it's making anyone worse off. It can't make you worse off, since the current situation leaves it nowhere worse to take you. It could only possibly take you from never being able to easily find the source to sometimes, if not always, being able to find it.

Conclusion

Use refix, sed or debugedit to make your fast, reproducible builds also effortlessly debuggable, so that it's trivial to find the source given an executable - and the executable given a core dump.

And please don't tell me it's OK for developers to roam the Earth looking for source code instead. It hurts my feelings!

Thanks to Dan Luu for reviewing a draft of this post.

I don't mean "dark corners" in a bad way. I managed a build system team for a few years and am otherwise interested in build systems, as evidenced by my writing this whole thing up. By "dark corners" I simply mean places invisible to most of the organization unless something bad happens, so the risk of breaking things is larger than the reward for improving them. It's quite understandable for such circumstances to beget a somewhat conservative approach.↩︎
I've known more than one infrastructure team that did this "cross-user build directory reuse" without ever hearing about each other. This method, while quite terrible in terms of optimization potential left on the table, owes its depressing popularity to its high resilience to the terribleness of everything else (eg it doesn't mind poor network bandwidth or even network instability, or the build flow not knowing where processes get their inputs and put their outputs; thus you can use this approach with an almost arbitrarily bad build system and IT infrastructure.)↩︎
Yes, a 3GB shared object compiled from a C++ code base. Firstly, shame on you, it's not nice to laugh at people with problems. Secondly, no, it's not stupid to have large binaries. It's much more stupid to split everything into many tiny shared objects, actually. It was always theoretically stupid, but now mold made it actually stupid, because linkage used to be theoretically very fast, and now it's actually very fast. And nothing good happens from splitting things to umpteen tiny .so's... but that's a topic for another time.↩︎
I've been told, in all seriousness and by an extremely capable programmer involved in a build system at the time, that "debugging should NOT be made easy; we should incentivize more design-time effort to cut on the debugging effort." To this I replied that Dijkstra would have been very proud of him, same as he was very angry with Warren Teitelman, whom he confronted for the crime of presenting a debugger with "how many bugs are we going to tolerate," getting the immortal reply "7." And I said that we should make debugging easy for those first and only 7 bugs we're going to tolerate.↩︎
But what if this info gets overwritten, seeing how it's not const?.. if you're really worried about this section getting overwritten, of all things, you can align its base address and size to the OS page size, and mprotect it at init time. This is exactly what const would achieve, but without excluding the section from core dumps.↩︎

Managers have no human rights

Yossi Kreinin — Sun, 31 Mar 2024 00:00:00 +0000

Here are some thoughts which are often basically correct:

Every time I try to do the right thing here, it's like the place actively resists it. Actually, forget "the right thing" - it's whenever I try to do pretty much anything.
Yesterday's "strategic" thing I toiled over just got tossed into the dumpster. And they expect me to be all excited about the new "strategy" they coughed up?
My "colleagues" attack me overtly and covertly all the time, all the while maintaining a "professional" and even cheerful demeanor - in effect, a gaslighting tactic. And in this sewage lagoon, I'm supposed to get work done?
The deadline is impossible, and everybody knows it. Why are we all pretending that we're trying to meet it, when we're actually busy rehearsing our speeches blaming the inevitable failure on each other?

I'm sure you can add a few similar observations of your own, which at various times & places were fairly accurate. My point in this writeup is that a manager doesn't get to whine about any of this, any more than a boxer gets to whine about a broken nose. A normal person very much does get to whine about a broken nose, and it isn't whining - it's grounds for a lawsuit in any reasonable jurisdiction. But when a boxer steps into the boxing ring, he obviously forfeits the basic human right of not getting punched in the face.

Similarly, a normal person - the so-called "individual contributor", which I guess is what we call workers in the age where mice and keyboards replaced hammers and sickles - a normal person can reasonably expect some basics from the workplace:

The place should help me get work done, and provide various physical, informational and social infrastructure for this purpose.
The place should articulate a strategy which my work meaningfully fits into, and manage changes to this strategy carefully and thoughtfully.
I am entitled to healthy human relationships in the workplace, and to management fostering an environment conducive of healthy relationships forming, rather than abusive and adversarial ones.
The schedule should not demand the impossible, and certainly when I work hard to meet whatever deadline was set, I should not worry about being blamed for the team missing the deadline in the end.

The condition meeting the full set of these lofty requirements is sometimes called "psychological safety." So in short, the individual contributor is entitled to psychological safety - in hammer & sickle terms, workers should be able to focus on work.

And by "should," I don't mean it always actually happens. I just mean that when it doesn't, you can reasonably expect to resolve the situation by quitting the team you're on, without having to find a different line of work. That, as opposed to a manager, who can only resolve this situation by finding a different line of work.

Why isn't the manager entitled to the same human rights as everyone else? Well, first of all, he just isn't, meaning, if you see a manager who complains about said human rights of his being violated a lot, you can be certain that he's not going to stay in management for long; he'll either have the common sense to quit or he'll be "demoted" from this position (whether it's actually a demotion, and which way in a hierarchy is up is a question in itself; some theoreticians postulate that there's no up, only out - but in any case, the manager will stop being one.)

Now, if I do try to explain this fact, which managers usually get at the gut level without needing an explanation, the analogies that come to mind are the minister of foreign affairs and the plumber. You cannot, as a minister of foreign affairs, be sad and shocked over countries plotting against you, and maybe even preparing to attack you - nobody wants a perpetually shocked minister of foreign affairs. And similarly, you cannot be a plumber if you're appalled by the sight, smell and tactile properties of shit.

"Individual contributors" can be fairly non-competitive, certainly in an industry like computing where demand for workers outstrips supply, there's enough work for everyone, and where you know a ton of trivia about your area that anyone else would need lots of time to learn if they had to step into your shoes. It's not only desirable but very possible to find a place where no colleague is going to fight you in order to add your area of responsibility to theirs, and thus get promoted.

Managers, on the other hand, are always basically low-grade¹ fighting each other, in the same way as countries always have conflicting interests, even if they maintain what looks like cordial relationships. I mean it not as a statement about the character of managers, but as a description of their condition. This condition follows, not from their character, but from various unfortunate facts of life - for example, the fact that managers are assumed to be fungible and are hopelessly underinformed, and it gets worse with rank.

The fungibility assumption means that a manager is always under a threat of losing "territory" to another manager, a reorg making him a report of someone undesirable, or a straightforward replacement, much more than an IC, which creates a very competitive environment. And the theoretical impossibility of managers being truly informed on the subjects falling under their responsibility guarantees that their never-ending competition involves a lot of so-called "misinformation, disinformation and malinformation." Of course, the manager's condition of eternal competition fought on such wonderful terms does filter for character - and not in a way making a manager's day spent with colleagues particularly pleasant.

In fact, all the unfortunate circumstances above - like the difficulty getting things done across the proverbial (organizational) boundaries, deranged convulsions around "strategy," schedule chicken, and of course the scheming & the gaslighting - all this shit flows first and foremost from the competition between managers as well as the organization competing in the external world.

ICs are entitled to managers shielding them from this shit - rarely fully, but quite often very considerably. Managers are not entitled to this, because even if a 2nd, 3rd or Nth level manager would like to shield lower-ranking managers from this (a rare, if laudable, desire), it's not possible, because there's just too much of it going on at the same time. Of course it gets worse at higher ranks, but even a 1st level manager expecting a positive atmosphere, the kind that ICs take for granted - "wow, cool stuff you've made there!" - will be sorely disappointed to learn that "please," "thank you," and "sorry" are gone from his day, replaced by "our requirement," "your commitment" and "customer escalation."

"If you want to make people happy, don't be a leader, sell ice cream," said a first-rate CEO and first-rate asshole Steve Jobs. To this we might add, "If you want people to make you happy, consider selling ice cream, too." Or it could be any sort of work which isn't management. A manager needs to be seriously driven by something other than having a nice day, because that's just not gonna happen - the perfect drive for you and your higher management is "getting promoted," and the perfect drive for you to have from the shareholders' POV is "getting shit done." But motivation is a story for another time.

Infrequently Raised Objections

I am a manager, and my days are nice.

Congratulations! You're either a great liar, including to yourself (all great liars start with themselves), or you're completely indifferent to constant struggle and maybe you even enjoy it, or you're leading a very capable organization which overdelivers often and underdelivers never (how big is it? A double digit number, tops?..), so you're enjoying what's known as "peace through strength."

Rest assured that this condition is not fundamentally permanent (all strength is finite and always only goes so far), but do enjoy it while it lasts, which can be for quite a while. Watch out for large reorgs, changes in the market / technology and wider strategy (as I'm sure you do; only a competent manager gets to enjoy any duration of peace through strength.)

Or you're lucky.

I am a manager, and my manager shields me from this.

You're either effectively an IC, like "the leader of a team of 2 people under someone who makes every 3 people into a team," or you're working on something self-contained nobody needs and it will be soon canceled, or you have some infernal bond sealed in goat's blood with your manager (or your manager has it with his), and when this thing explodes under external pressure, it will be really ugly.

Or you're lucky.

There exist organizations free from the dysfunction you describe.

Like I said, "...or you're lucky." Sure, they exist. They're just rare, and usually don't remain that way as they grow (ever heard "we're only hiring the best people?" Well, when you're hiring a lot of people, you're hiring typical people, because there aren't this many "best" people readily available. Therefore, growth tends to bring about a regression to the mean in all areas, including this one.) And most places are dysfunctional this way from day one, which by itself doesn't prevent them from succeeding and growing; nor does a lack of this dysfunction guarantee success.

Speaking of which, I never quite understood the meaning of "dysfunctional" in "dysfunctional organizations." These organizations definitely function; they generate trillions worth of world GDP. That they aren't fun to work at in a managerial role might be true, but it is not their function to make it fun to work there in a managerial role. It is the function of you to choose roles you can enjoy, and I hope the above can help some people with this.

But we foster a non-hierarchical culture of openness and curiosity.

If you're looking to decorate your office space, I have a suitcase full of hammers and sickles I brought from Soviet Russia. I kept them to remind me of the old country, but your company sounds so awesome that I'll gladly send them to you.

Most deliberate attempts to improve upon the baseline outcome of people being people make things worse. You'll do everyone a favor by going straight to the standard thing without going through a tedious cult phase first.

Calmly accepting dysfunction does not a good manager make.

I didn't mean to imply that accepting and having a strategy for handling "dysfunction," or should I say the inevitable consequences of the job description, is sufficient for being a good manager, whatever that means. I'm only saying that it's necessary for being a manager at all, for any reasonable period of time and with a reasonable level of job satisfaction.

This "acceptance" is not a binary thing. It depends on how bad it gets.

It's binary in some and not binary in others. There are 3 types of people: people who binarily can't accept it; people who binarily can, regardless of the depths of depravity reached; and people on whom it starts taking a toll at a certain level. Your type can be predicted pretty well based on what motivates you, and we'll discuss it in an upcoming, very motivational piece on motivation.

Thanks to Dan Luu and Tim Pote for reviewing a draft of this post.

P.S. There exists a breed of "individual contributor" with a fancy title, such as a Principal Engineer, a Fellow and other such. The desirability of the existence of these titles is a subject in its own right; in our context, their relevance is that they largely strip you of human rights as much as management titles do. One hint of why this is so is their visibility as a status marker and the consequences of this visibility - their scarcity and the resulting competition, in many places fiercer than the fight for management titles. An exception to the rule is if you're The Guy Who Did X for some serious-ass value of X, and you got your fancy title thanks to that value of X, regardless of the "technical track" promotion politics.

Hopefully↩︎

The state of AI for hand-drawn animation inbetweening

Yossi Kreinin — Wed, 17 Apr 2024 00:00:00 +0000

There are many potential ways to use AI¹ (and computers in general) for 2D animation. I’m currently interested in a seemingly conservative goal: to improve the productivity of a traditional hand-drawn full animation workflow by AI assuming responsibilities similar to those of a human assistant.

As a “sub-goal” of that larger goal, we’ll take a look at two recently published papers on animation “inbetweening” – the automatic generation of intermediate frames between given keyframes. AFAIK these papers represent the current state of the art. We’ll see how these papers and a commercial frame interpolation tool perform on some test sequences. We’ll then briefly discuss the future of the broad family of techniques in these papers versus some substantially different emerging approaches.

There’s a lot of other relevant research to look into, which I’m trying to do - this is just the start. I should say that I’m not “an AI guy” - or rather I am if you’re building an inference chip, but not if you’re training a neural net. I’m interested in this as a programmer who could incorporate the latest tech into an animation program, and as an animator who could use that program. But I’m no expert on this, and so I’ll be very happy to get feedback/suggestions through email or comments.

I’ve been into animation tech since forever, and what’s possible these days is exciting. Specifically with inbetweening tech, I think we’re still “not there yet”, and I think you’ll agree after seeing the results below. But we might well get there within a decade, and maybe much sooner.

I think this stuff is very, very interesting! If you think so, too, we should get in touch. Doubly so if you want to work on this. I am going to work on exactly this!

Motivation and scope

Why is it interesting to make AI a 2D animator’s assistant, of all the things we could have it do (text to video, image to video, image style transfer onto a video, etc.)?

An animator is an actor. The motion of a character reflects the implied physical and mental state of that character. If the motion of a character, even one designed by a human, is fully machine-generated, it means that human control over acting is limited; the machine is now the actor, and the human’s influence is limited to “directing” at best. It is interesting to develop AI-assisted workflows where the human is still the actor.
To control motion, the animator needs to draw several keyframes (or perhaps edit a machine-generated draft - but with a possibility to erase and redraw it fully.) The range of ways to do “a sad walk” or “an angry, surprised head turn” and the range of character traits influencing the acting is too wide for acting to be controlled via cues other than actually drawing the pose.
If a human is to be in control, “moving line art” is the necessary basis for any innovations in the appearance of the characters. That’s because humans use a “light table”, aka “onion skin”, to draw moving characters, where you see several frames overlaid on top of each other (like the frames of a bouncing ball sequence below). And it’s roughly not humanly possible to “read” a light table unless the frames have the sharp edges of line art (believe me, I spent more time trying than I should have.) Any workflow with human animators in control of motion needs to have line art at its basis, even if the final rendered film looks very differently from the traditional line art style.

The above gives the human a role similar to a traditional key animator, so it’s natural to give the machine the roles of assistants. It could be that AI can additionally do some of the key animator’s work, so that less keyframes are provided in some cases than you’d have to give a human assistant (and one reason for this could be your ability to quickly get the AI to complete your work in 10-20 possible ways, and choose the best option, which is impractical with a human assistant.) But the basic role of the human as a key animator would remain, and so the first thing to explore is the machine taking over the assistant’s role.

So I’m not saying that we can’t improve productivity beyond the “machine as the assistant” arrangement, nor that we must limit ourselves to the traditional appearance of hand-drawn animation. I’m just saying that our conservative scope is likely the right starting point, even if our final goals are more ambitious - at least as long as we want the human to remain the actor.

What would the machine do in an assistant’s role? Traditionally, assistants’ jobs include:

Inbetweening (drawing frames between the key frames)
Cleanup (taking rough “pencil” sketches and “inking” them)
Coloring (“should” be trivial with a paint bucket tool, but surprisingly annoying around small gaps in the lines)

Our scope here is narrowed further by focusing exclusively on inbetweening. There’s no deep reason for this beyond having to start somewhere, and inbetweening being the most “animation-y” assistant’s job, because it’s about movement. So focusing our search on inbetweening is most likely to give results relevant to animation and not just “still” line art.

Finally, in this installment, we’re going to focus on papers which call themselves “AI for animation inbetweening” papers. It’s not obvious that any relevant “killer technique” has to come from a paper focusing on this problem explicitly. We could end up borrowing ideas from papers on video frame interpolation, or video/animation generation not designed for inbetweening, etc. In fact, I’m looking at some things like this. But again, let’s start somewhere.

Preamble: testing Runway

Before looking at papers for the latest ideas, let’s check out Runway Frame Interpolation. Together with Stability AI and the CompVis group, Runway researchers were behind Stable Diffusion, and Runway is at the forefront of deploying generative AI for video.

Let’s test frame interpolation on a sneaky cartoony rabbit sequence. It’s good as a test sequence because it has both fast/large and slower/smaller movement (so both harder and easier parts.) It also has both “flat 2D” body movement and “3D” head rotation - one might say too much rotation… But rotation is good to test because it’s a big reason for doing full hand-drawn animation. Absent rotation, you can split your character into “cut-out” parts, and animate it by moving and stretching these parts.

We throw away every second frame, ask Runway to interpolate the sequence, and after some conversions and a frame rate adjustment (don’t ask), we get something like this:

This tool definitely isn’t currently optimized for cartoony motion. Here’s an example inbetween:

Now let’s try a similar sequence with a sneaky me instead of a sneaky rabbit. Incidentally, this is one of several styles I’m interested in - something between live action and Looney Tunes, with this self-portrait taking live action maybe 15% towards Looney Tunes:

Frame interpolation looks somewhat better here, but it’s still more morphing than moving from pose to pose:

An example inbetween:

While the Frame Interpolation tool currently doesn’t work for this use case, I’d bet that Runway could solve the problem quicker and better than most if they wanted to. Whether there’s a large enough market for this is another question, and it might depend on the exact definition of “this.” Personally, I believe that a lot of good things in life cannot be “monetized”, a lot of art-related things are in this unfortunate category, and I’m very prepared to invest time and effort into this without clear, or even any prospects of making money.

In any case, we’ve got our test sequences, and we’ve got our motivation to look for better performance in recent papers.

Raster frame representation

There’s a lot of work on AI for image processing/computer vision. It’s natural to borrow techniques from this deeply researched space and apply them to line art represented as raster images.

There are a few papers doing this; AFAIK the state of the art with this approach is currently Improving the Perceptual Quality of 2D Animation Interpolation (2022). Their EISAI GitHub repo points to a colab demo and a Docker image for running locally, which I did, and things basically Just Worked.

That this can even happen blows my mind. I remember how things worked 25 years ago, when you rarely had the code published, and people implementing computer vision papers would occasionally swear that the paper is outright lying, because the described algorithms don’t do and couldn’t possibly do what the paper says.

The sequence below shows just inbetweens produced by EISAI. Meaning, frame N is produced from the original frames N-1 and N+1; there’s not a single original frame here. So this sequence isn’t directly comparable to Runway’s output.

I couldn’t quite produce the same output with Runway as with the papers (don’t ask.) If you care, this sequence is closer to being comparable to Runway’s, if not fully apples to apples:

If you look at individual inbetweens, you’ll see that EISAI and Runway have similar difficulties - big changes between frames, occlusion and deformation, and both do their best and worst in about the same places. One of the best inbetweens by EISAI:

One of the worst:

The inbetweens are produced by forward-warping based on bidirectional flow estimation. “Flow estimation” means computing, per pixel or region in the first keyframe, its most likely corresponding location in the other keyframe - “finding where it went to” in the other image (if you have “two images of mostly the same thing,” you can hope to find parts from one in the other.) “Warping” means transforming pixel data - for example, scaling, translating and rotating a region. “Forward-warping by bidirectional flow estimation” means taking regions from both keyframes and warping them to put them “where they belong” in the inbetween - which is halfway between a region’s position in the source image, and the position in the other image that the flow estimation says this region corresponds to.

Warping by flow explains the occasional 3-4 arms and legs and 2 heads (it warps a left hand from both input images into two far-away places in the output image, since the flow estimator found a wrong match, instead of matching the hands to each other.) This also explains “empty space” patches of various sizes in the otherwise flat background.

Notably, warping by flow “gives up” on cases of occlusion up front (I mean cases where something is visible in one frame and not in the other due to rotation or any other reason.) If your problem formulation is “let’s find parts of one image in the other image, and warp each part to the middle position between where it was in the first and where we found it in the second” - then the correct answer to “where did the occluded part move?” is “I don’t know; I can’t track something that isn’t there.”

(Note that the system being an “AI” has no impact on this point. You could have a “traditional,” “hardcoded” system for warping based on optical flow, or a differentiable one with trainable parameters (“AI”.) Let’s say we believe the trainable one is likely to achieve better results. But training does not sidestep the question the parameters of what are being trained, and what the model can, or can’t possibly do once trained.)

When the optical flow matches “large parts” between images correctly, you still have occasional issues due to both images being warped into the result, with “ghosting” of details of fingers or noses or what-not (meaning, you see two slightly different drawings of a hand at roughly the same place, and you see one drawing through the other, as if that other drawing was a semi-transparent “ghost”.) A dumb question coming to my mind is if this could be improved through brute force, by “increasing the resolution of the image” / having a “higher-resolution flow estimation,” so you have a larger number of smaller patches capable of representing the deformations of details, because each patch is tracked and warped separately.

An interesting thing in this paper is the use of distance transform to “create” texture for convolutional neural networks to work with for feature extraction. The distance transform replaces every pixel value with the distance from that pixel’s coordinates to the closest black pixel. If you interpret distances as black & white pixel values, this gives “texture” to your line art in a way. The paper cites “Optical flow based line drawing frame interpolation using distance transform to support inbetweenings” (2019) which also used distance transform for this purpose.

If you’re dealing with 2D animation and you’re borrowing image processing/computer vision neural networks (hyperparameters and maybe even pretrained weights, as this paper does with a few layers of ResNet), you will have the problem of “lack of texture” - you have these large flat-color regions, and the output of every convolution on each pixel within the region is obviously exactly the same. Distance transform gives some texture for the convolutions to “respond” to.

This amuses me in a “machine learning inside joke” sort of way. “But they told me that manual feature engineering was over in the era of Deep Learning!” I mean, sure, a lot of it is over - you won’t see a paper on “the next SIFT or HOG.” But, apart from the “hyperparameters” (a name for, basically, the entire network architecture) being manually engineered, and the various manual data augmentation and what-not, what’s Kornia, if not “a tool for manual feature engineering in a differentiable programming context”? And I’m not implying that there’s anything wrong with it - quite the contrary, my point is that people still do this because it works, or at least makes some things work better.

Before we move on to other approaches, let’s check how EISAI does on the rabbit sequence. I don’t care for the rabbit sequence; I’m selfishly interested in the me sequence. But since unlike Runway, EISAI was trained on animation data, it seems fair to feed it something more like the training data:

Both Runway and EISAI do worse on the rabbit, which has more change in hands and ears and walks a bit faster. It seems that large movements, deformations and rotations affect performance more than “similarity to training data,” or at least similarity in a naive sense.

Vector frame representation

Instead of treating the input as images, you could work on a vector representation of the lines. AFAIK the most recent paper in this category is Deep Geometrized Cartoon Line Inbetweening (2023). Their AnimeInbet GitHub repo lets you reproduce the paper’s results. To run on your own data, you need to hack the code a bit (at least I didn’t manage without some code changes.) More importantly, you need to vectorize your input data somehow.

The paper doesn’t come with its own input drawing vectorization system, and arguably shouldn’t, since vector drawing programs exist, and vectorizing raster drawings is a problem in its own right and outside the paper’s scope. The code in the paper has no trouble getting input data in a vector representation because their line art dataset is produced from their dataset of moving 3D characters, rendered with a “toon shader” or whatever the thing rendering lines instead of shaded surfaces is called. And since the 2D points/lines come from 3D vertices/edges, you’re basically projecting a 3D vector representation into a 2D space and it’s still a vector representation.

What’s more, this data set provides a kind of ground truth that you don’t get from 2D animation data sets - namely, detailed correspondence between the points in both input frames and the ground truth inbetween frame. If your ground truth is a frame from an animated movie, you only know that this frame is “the inbetween you expect between the previous frame and the next.” But here, you know where every 3D vertex ended up in every image!

This correspondence information is used at training time - and omitted at inference time, or it would be cheating. So if you want to feed data into AnimeInbet, you only need to vectorize this data into points connected by straight lines, without worrying about vertex correspondence. The paper itself cites Virtual Sketching, itself a deep learning based system, as the vectorization tool they used for their own experiments in one of the “ablation studies” (I know it’s idiomatic scientific language, but can I just say that I love this expression? “Please don’t contribute to the project during the next month. We’re performing an ablation study of individual productivity. If the study proves successful, you shall be ablated from the company by the end of the month.”)

There are comments in the AnimeInbet repo about issues using Virtual Sketching; mine was that some lines partially disappeared (could be my fault for not using it properly.) I ended up writing some neanderthal-style image processing code skeletonizing the raster lines, and then flood-filling the skeleton and connecting the points while flood-filling. I’d explain this at more length if it was more than a one-off hack; for what it’s worth, I think it’s reasonably correct for present purposes. (My “testing” is that when I render my vertices and the lines connecting them and eyeball the result, no obviously stupid line connecting unrelated things appears, and no big thing from the input raster image is clearly missing.)

This hacky “vectorization” code (might need more hacking to actually use) is in Animation Papers GitHub repo, together with other code you might use to run AnimeInbet on your data.

Results on our test sequences:

The rabbit is harder for AnimeInbet, similarly to the others. For example, the ears are completely destroyed by the head turn, as usual:

The worst and the best inbetweens occur in pretty much the same frames:

Visually notable aspects of AnimeInbet’s output compared to the previous systems we’ve seen:

AnimeInbet doesn’t blur lines. It might shred lines on occasion, but you don’t blur vector lines like you blur pixels. (You very much can put a bunch of garbage lines into the output, and AnimeInbet is pretty good at not doing that, but this capability belongs to our next item. Here we’ll just note that raster-based systems didn’t quite “learn” to avoid line blurring, which this system avoids by design.)
AnimeInbet seems quite good at matching small details and avoiding ghosting/copying the same thing twice from both images. This is not something that can salvage bad inbetweens, but it makes good inbetweens better; in the one above, the pants and the hands are examples where small detail is matched better than in the raster systems.
For every part, AnimeInbet either finds a match or removes it from the output. The paper formulates inbetweening as a graph matching problem (where vertices are the nodes and the lines connecting them are edges.) Parts without a match are marked as invisible. This doesn’t “solve” occlusion or rotation, but it tends to keep you from putting stuff into the output that the animator needs to erase and redraw afterwards. This makes good inbetweens marginally better; for bad inbetweens, it makes them “less funny” but probably not much more usable (you get 2 legs instead of 4, but they’re often not the right legs; and you can still get a head with two foreheads as in the bad inbetween above.)

AnimeInbet has a comprehensive evaluation of their system vs other systems (EISAI and VFIformer as well as FILM and RIFE, video interpolation rather than specifically animation inbetweening systems.) According to their methodology (where they use their own test dataset), their system comes out ahead by a large margin. In my extremely small-scale and qualitative testing, I’d say that it looks better, too, though perhaps less dramatically.

Here we have deep learning with a model and input data set tailored carefully to the problem - something I think you won’t see as often as papers reusing one or several pretrained networks, and combining them with various adaptations to apply to the problem at hand. My emotional reaction to this approach appearing to do better than ideas borrowed from “general image/video AI research” is mixed.

I like “being right” (well, vaguely) about AI not being “general artificial intelligence” but a set of techniques that you need to apply carefully to build a system for your needs, instead of just throwing data into some giant general-purpose black box - this is something I like going on about, maybe more than I should given my level of understanding. As a prospective user/implementer looking for “the next breakthrough paper,” however, it would be better for me if ideas borrowed from “general video research” worked great, because there’s so many of them compared to the volume of “animation-focused research.”

I mean, Disney already fired its hand-drawn animation department years ago. If the medium is to be revived (and people even caring about it aren’t getting any younger), it’s less likely to happen through direct investment into animation than as a byproduct of other, more profitable things. I guess we’ll see how it goes.

Applicability of “2D feature matching” techniques

No future improvement of the techniques in both papers can possibly take care of “all of inbetweening,” because occlusion and rotation happen a lot, and do not fit these papers’ basic approach of matching 2D features in the input frames. And even the best inbetweens aren’t quite usable as is. But they could be used with some editing, and it could be easier to edit them than draw the whole thing from scratch.

An encouraging observation is that machines struggle with big changes and people struggle with small changes, so they can complement each other well. A human is better at (and less bored by) drawing an inbetween between two keyframes which look very different than drawing something very close to both input frames and putting every line at juuuuust the right place. If machines can help handle the latter kind of work, even with some editing required, that’s great!

It’s very interesting to look into approaches that can in fact handle more change between input frames. For example, check out the middle frame below, generated from the frames on its left and right:

This is from Explorative Inbetweening of Time and Space (2024); they say the code is coming soon. It does have some problems with occlusion (look at the right arm in the middle image.) But it seems to only struggle when showing something that is occluded in both input frames (for example, the right leg is fine, though it’s largely occluded in the image on the left.) This is a big improvement over what we’ve seen above, or right below (this is one frame of Runway’s output, where one right leg slowly merges into the left leg, while another right leg is growing):

But what’s even more impressive - extremely impressive - is that the system decided that the body would go up before going back down between these two poses! (Which is why it’s having trouble with the right arm in the first place! A feature matching system wouldn’t have this problem, because it wouldn’t realize that in the middle position, the body would go up, and the right arm would have to be somewhere. Struggling with things not visible in either input keyframe is a good problem to have - it’s evidence of knowing these things exist, which demonstrates quite the capabilities!)

This system clearly learned a lot about three-dimensional real-world movement behind the 2D images it’s asked to interpolate between. Let’s call approaches going in this direction “3D motion reconstruction” techniques (and I apologize if there’s better, standard terminology / taxonomy; I’d use it if I knew it.)

My point here, beyond eagerly waiting for the code in this paper, is that feature matching techniques might remain interesting in the long term, precisely because “they don’t understand what’s going on in the scene.” Sure, they clearly don’t learn “how a figure moves or looks like.” But this gives some hope that what they can do - handling small changes - will work on more kinds of inputs. Meaning, a system that “learned human movement” might be less useful for an octopus sequence than a system that “learned to match patches of pixels, or graphs of points connected by lines.” So falling back on 2D feature matching could remain useful for a long time, even once 3D motion reconstruction works great on the kinds of characters it was trained on.

Conclusion

I think we can agree that animation inbetweening doesn’t quite work at the moment, though it might already be useful for inbetweening small movements, which is otherwise a painstaking process for a human. I think we can also agree that it’s reasonable to hope it will be production-ready quite soon, and emerging inbetweening systems which “understand and reconstruct movement,” beyond “matching image features,” are one reason to be hopeful.

In future installments, I hope to look into more techniques for inbetweening, and the closely related question of what animators need to control inbetweening, beyond just giving the system two keyframes. Human inbetweeners certainly get more input than pairs of keyframes. This makes me believe that it’s not just the plausibility of the inbetweens you produce, but their controllability which is going to determine “the winning technique.”

Thanks to Dan Luu for reviewing a draft of this post.

I miss the time when they called it machine learning rather than artificial intelligence, and the milder, calmer economic conditions which were a moderating influence on terminology (in the end, whether it’s called ML or AI is an investors’ preference.) But I’m giving up and calling it AI, since at this point calling it ML is more a readability issue than anything else.↩︎

A 100x speedup with unsafe Python

Yossi Kreinin — Sat, 04 May 2024 00:00:00 +0000

We're going to speed up some numpy code by 100x using "unsafe Python." Which is not quite the same as unsafe Rust, but it's a bit similar, and I'm not sure what else to call it... you'll see. It's not something you'd use in most Python code, but it's handy on occasion, and I think it shows "the nature of Python” from an interesting angle.

So let's say you use pygame to write a simple game in Python.

(Is pygame the way to go today? I'm not the guy to ask; to its credit, it has a very simple screen / mouse / keyboard APIs, and is quite portable because it's built on top of SDL. It runs on the major desktop platforms, and with a bit of fiddling, you can run it on Android using Buildozer. In any case, pygame is just one real-life example where a problem arises that "unsafe Python" can solve.)

Let us furthermore assume that you're resizing images a lot, so you want to optimize this. And so you discover the somewhat unsurprising fact that OpenCV's resizing is faster than pygame's, as measured by the following simple microbenchmark:

from contextlib import contextmanager
import time

@contextmanager
def Timer(name):
    start = time.time()
    yield
    finish = time.time()
    print(f'{name} took {finish-start:.4f} sec')

import pygame as pg
import numpy as np
import cv2

IW = 1920
IH = 1080
OW = IW // 2
OH = IH // 2

repeat = 100

isurf = pg.Surface((IW,IH), pg.SRCALPHA)
with Timer('pg.Surface with smoothscale'):
    for i in range(repeat):
        pg.transform.smoothscale(isurf, (OW,OH))

def cv2_resize(image):
    return cv2.resize(image, (OH,OW), interpolation=cv2.INTER_AREA)

i1 = np.zeros((IW,IH,3), np.uint8)
with Timer('np.zeros with cv2'):
    for i in range(repeat):
        o1 = cv2_resize(i1)

This prints:

pg.Surface with smoothscale took 0.2002 sec
np.zeros with cv2 took 0.0145 sec

Tempted by the nice 13x speedup reported by the mircobenchmark, you go back to your game, and use pygame.surfarray.pixels3d to get zero-copy access to the pixels as a numpy array. Full of hope, you pass this array to cv2.resize, and observe that everything got much slower. Dammit! "Caching," you think, "or something. Never trust a mircobenchmark!"

Anyway, just in case, you call cv2.resize on the pixels3d array in your mircobenchmark. Perhaps the slowdown will reproduce?..

i2 = pg.surfarray.pixels3d(isurf)
with Timer('pixels3d with cv2'):
    for i in range(repeat):
        o2 = cv2_resize(i2)

Sure enough, this is very slow, just like you saw in your larger program:

pixels3d with cv2 took 1.3625 sec

So 7x slower than smoothscale - and more shockingly, almost 100x slower than cv2.resize called with numpy.zeros! What gives?! Like, we have two zero-initialized numpy arrays of the same shape and datatype. And of course the resized output arrays have the same shape & datatype, too:

print('i1==i2 is', np.all(i1==i2))
print('o1==o2 is', np.all(o1==o2))
print('input shapes', i1.shape,i2.shape)
print('input types', i1.dtype,i2.dtype)
print('output shapes', o1.shape,o2.shape)
print('output types', o1.dtype,o2.dtype)

Just like you'd expect, this prints that everything is the same:

i1==i2 is True
o1==o2 is True
input shapes (1920, 1080, 3) (1920, 1080, 3)
input types uint8 uint8
output shapes (960, 540, 3) (960, 540, 3)
output types uint8 uint8

How could a function run 100x more slowly on one array relatively to the other, seemingly identical array?.. I mean, you would hope SDL wouldn't allocate pixels in some particularly slow-to-access RAM area - even though it theoretically could do stuff like that, with a little help from the kernel (like, create a non-cachable memory area or something.) Or is the surface stored in GPU memory and we're going thru PCI to get every pixel?!.. It doesn't work this way, does it? - is there some horrible memory coherence protocol for these things that I missed?.. And if not - if it's the same kind of memory of the same shape and size with both arrays - what's different that costs us a 100x slowdown?..

It turns out... And I confess that I only found out by accident, after giving up on this¹ and moving on to something else. Entirely incidentally, that other thing involved passing numpy data to C code, so I had to learn what this data looks like from C. So, it turns out that the shape and datatype aren't all there is to a numpy array:

print('input strides',i1.strides,i2.strides)
print('output strides',o1.strides,o2.strides)

Ah, strides. Same in the output arrays, but very different in the input arrays:

input strides (3240, 3, 1) (4, 7680, -1)
output strides (1620, 3, 1) (1620, 3, 1)

As we'll see, this difference between the strides does in fact fully account for the 100x slowdown. Can we fix this? We can, but first, the post itself will need to seriously slow down to explain these strides, because they're so weird. And then we'll snatch our 100x right back from these damned strides.

numpy array memory layout

So, what's a "stride"? A stride tells you how many bytes you have to, well, stride from one pixel to the next. For instance, let's say we have a 3D array like an RGB image. Then given the array base pointer and the 3 strides, the address of array[x,y,z] will be base + x*xstride + y*ystride + z*zstride (where with images, z is one of 0, 1 or 2, for the 3 channels of an RGB image.)

In other words, the strides define the layout of the array in memory. And for better or worse, numpy is very flexible with respect to what this layout might be, because it supports many different stride values for a given array shape & datatype.

The two layouts at hand - numpy's default layout, and SDL's - are... well, I don't even know which of the two offends me more. As you can see from the stride values, the layout numpy uses by default for a 3D array is base + x*3*height + y*3 + z.

This means that the RGB values of a pixel are stored in 3 adjacent bytes, and the pixels of a column are stored contiguously in memory - a column-major order. And I, for one, find this offensive, because images are traditionally stored in a row-major order, in particular, image sensors send them this way (and capture them this way, as you can see from the rolling shutter - every row is captured at a slightly different time, not column.)

"Why, we do follow that respected tradition as well," say popular numpy-based image libraries. "See for yourself - save an array of shape (1920, 1080) to a PNG file, and you'll get a 1080x1920 image." Which is true, and of course makes it even worse: if you index with arr[x,y], then x, aka dimension zero, actually corresponds to the vertical dimension in the corresponding PNG file, and y, aka dimension one, corresponds to the horizontal dimension. And thus numpy array columns correspond to PNG image rows. Which makes the numpy image layout "row-major" in some sense, at the cost of x and y having the opposite of their usual meaning.

...Unless you got your numpy array from a pygame Surface object, in which case x actually does index into the horizontal dimension. And so saving pixels3d(surface) with, say, imageio will produce a transposed PNG relatively to the PNG created by pygame.image.save(surface). And in case adding that insult to the injury wasn't enough, cv2.resize gets a (width, height) tuple as the destination size, producing an output array of shape (height, width).

Against the backdrop of these insults and injuries, SDL has an inviting, civilized-looking layout where x is x, y is y, and the data is stored in an honest row-major order, for all the meanings of "row." But then upon closer look, the layout just tramples all over my feelings: base + x*4 + y*4*width - z.

Like, the part where we have 4 in the strides instead of 3 as expected for an RGB image - I can get that part. We did ask for an RGBA image, with an alpha channel, when we passed SRCALPHA to the Surface constructor. So I guess it keeps the alpha channel together with the RGB pixels, and the 4 in the strides is needed to skip the As in RBGA. But then why, may I ask, are there separate pixels3d and pixels_alpha functions? It's always annoying to have to deal with RGB and alphas separately when using numpy with pygame surfaces. Why not a single pixels4d function?..

But OK, the 4 instead of the 3 I could live with. But a zstride of -1? MINUS ONE? You start at the address of your Red pixel, and to get to Green, you walk back one byte?! Now you're just fucking with me.

It turns out that SDL supports both RGB and BGR layout (in particular, apparently surfaces loaded from files are RGB, and those created in memory are BGR?.. or is it even hairier than this?..) And if you use pygame's APIs, you needn't worry about RGB vs BGR, the APIs handle it transparently. If you use pixels3d for numpy interop, you also needn't worry about RGB vs BGR, because numpy's flexibility with strides lets pygame give you an array that looks like RGB despite it being BGR in memory. For that, z stride is set to -1, and the base pointer of the array points to the first pixel's red value - two pixels ahead of where the array memory starts, which is where the first pixel's blue value is.

Wait a minute... now I get why we have pixels3d and pixels_alpha but no pixels4d!! Because SDL has RGBA and BGRA images - BGRA, not ABGR - and you can't make BGRA data look like an RGBA numpy array, no matter what weird values you use for strides. There's a limit to layout flexibility... or rather, there really isn't any limit beyond the limits of computability, but thankfully numpy stops at configurable strides and doesn't let you specify a generic callback function addr(base, x, y, z) for a fully programmable layout².

So to support RGBA and BGRA transparently, pygame is forced to give us 2 numpy arrays - one for RGB (or BGR, depending on the surface), and another for the alpha. And these numpy arrays have the right shape, and let us access the right data, but their layouts are very different from normal arrays of their shape.

And different memory layout can definitely explain major differences in performance. We could try to figure out exactly why the performance difference is almost 100x. But when possible, I prefer to just get rid of garbage, rather than study it in detail. So instead of understanding this in depth, we'll simply show that the layout difference indeed accounts for the 100x - and then get rid of the slowdown without changing the layout, which is where "unsafe Python" finally comes in.

How can we show that the layout alone, and not some other property of the pygame Surface data (like the memory it's allocated in) explains the slowdown? We can benchmark cv2.resize on a numpy array we create ourselves, with the same layout as pixels3d gives us:

# create a byte array of zeros, and attach
# a numpy array with pygame-like strides
# to this byte array using the buffer argument.
i3 = np.ndarray((IW, IH, 3), np.uint8,
        strides=(4, IW*4, -1),
        buffer=b'\0'*(IW*IH*4),
        offset=2) # start at the "R" of BGR

with Timer('pixels3d-like layout with cv2'):
    for i in range(repeat):
        o2 = cv2_resize(i3)

Indeed this is about as slow as we measured on pygame Surface data:

pixels3d-like layout with cv2 took 1.3829 sec

Out of curiosity, we can check what happens if we merely copy data between these layouts:

i4 = np.empty(i2.shape, i2.dtype)
with Timer('pixels3d-like copied to same-shape array'):
    for i in range(repeat):
        i4[:] = i2

with Timer('pixels3d-like to same-shape array, copyto'):
    for i in range(repeat):
        np.copyto(i4, i2)

Both the assignment operator and copyto are very slow, almost as slow as cv2.resize:

pixels3d-like copied to same-shape array took 1.2681 sec
pixels3d-like to same-shape array, copyto took 1.2702 sec

Fooling the code into running faster

What can we do about this? We can't change the layout of pygame Surface data. And we seriously don't want to copy the C++ code of cv2.resize, with its various platform-specific optimizations, to see if we can adapt it to the Surface layout without losing performance. What we can do is feed Surface data to cv2.resize using an array with numpy's default layout (instead of straightforwardly passing the array object returned by pixel3d.)

Not that this would actually work with any given function, mind you. But it will work specifically with resizing, because it doesn't really care about certain aspects of the data, which we're incidentally going to blatantly misrepresent:

Resizing code doesn't care if a given channel represents red or blue. (Unlike, for instance, converting RGB to greyscale, which would care.) If you give it BGR data and lie that it's RGB, the code will produce the same result as it would given actual RGB data.
Similarly, it doesn't matter for resizing which array dimension represents width, and which is height.

Now, let's take another look at the memory representation of pygame's BGRA array of shape (width, height).

This representation is actually the same as an RGBA array of shape (height, width) with numpy's default strides! I mean, not really - if we reinterpret this data as an RGBA array, we're treating red channel values as blue and vice versa. Likewise, if we reinterpret this data as a (height, width) array with numpy's default strides, we're implicitly transposing the image. But resizing wouldn't care!

And, as an added bonus, we'd get a single RGBA array, and resize it with one call to cv2.resize, instead of resizing pixels3d and pixels_alpha separately. Yay!

Here's code taking a pygame surface and returning the base pointer of the underlying RGBA or BGRA array, and a flag telling if it's BGR or RGB:

import ctypes

def arr_params(surface):
    pixels = pg.surfarray.pixels3d(surface)
    width, height, depth = pixels.shape
    assert depth == 3
    xstride, ystride, zstride = pixels.strides
    oft = 0
    bgr = 0
    if zstride == -1: # BGR image - walk back
        # 2 bytes to get to the first blue pixel
        oft = -2
        zstride = 1
        bgr = 1
    # validate our assumptions about the data layout
    assert xstride == 4
    assert zstride == 1
    assert ystride == width*4
    base = pixels.ctypes.data_as(ctypes.c_void_p)
    ptr = ctypes.c_void_p(base.value + oft)
    return ptr, width, height, bgr

Now that we have the underlying C pointer to the pixel data, we can wrap it in a numpy array with the default strides, implicitly transposing the image and swapping the R & B channels. And once we "attach" a numpy array with default strides to both the input and the output data, our call to cv2.resize will run 100x faster!

def rgba_buffer(p, w, h):
    # attach a WxHx4 buffer to the base pointer
    type = ctypes.c_uint8 * (w * h * 4)
    return ctypes.cast(p, ctypes.POINTER(type)).contents

def cv2_resize_surface(src, dst):
    iptr, iw, ih, ibgr = arr_params(src)
    optr, ow, oh, obgr = arr_params(dst)

    # our trick only works if both surfaces are BGR,
    # or they're both RGB. if their layout doesn't match,
    # our code would actually swap R & B channels
    assert ibgr == obgr

    ibuf = rgba_buffer(iptr, iw, ih)

    # numpy's default strides are height*4, 4, 1
    iarr = np.ndarray((ih,iw,4), np.uint8, buffer=ibuf)
    
    obuf = rgba_buffer(optr, ow, oh)

    oarr = np.ndarray((oh,ow,4), np.uint8, buffer=obuf)

    cv2.resize(iarr, (ow,oh), oarr, interpolation=cv2.INTER_AREA)

Sure enough, we finally get a speedup instead of a slowdown from using cv2.resize on Surface data, and we're as fast as resizing an RGBA numpy.zeros array (where originally we benchmarked an RGB array, not RGBA):

osurf = pg.Surface((OW,OH), pg.SRCALPHA)
with Timer('attached RGBA with cv2'):
    for i in range(repeat):
        cv2_resize_surface(isurf, osurf)

i6 = np.zeros((IW,IH,4), np.uint8)
with Timer('np.zeros RGBA with cv2'):
    for i in range(repeat):
        o6 = cv2_resize(i6)

The benchmark says we got our 100x back:

attached RGBA with cv2 took 0.0097 sec
np.zeros RGBA with cv2 took 0.0066 sec

All of the ugly code above is on GitHub. Since this code is ugly, you can't be sure it actually resizes the image correctly, so there's some more code over there that tests resizing on non-zero images. If you run it, you will get the following gorgeous output image:

Did we really get a 100x speedup? It depends on how you count. We got cv2.resize to run 100x faster relatively to calling it straightforwardly with the pixel3d array. But specifically for resizing, pygame has smoothscale, and our speedup relatively to it is 13-15x. There are some more benchmarks on GitHub for functions other than resize, some of which don't have a corresponding pygame API:

Copying with dst[:] = src: 28x
Inverting with dst[:] = 255 - src: 24x
cv2.warpAffine: 12x
cv2.medianBlur: 15x
cv2.GaussianBlur: 200x

So not "exactly" 100x, though I feel it's fair enough to call it "100x" for short.

In any case, I'd be surprised if that many people use SDL from Python for this specific issue to be broadly relevant. But I'd guess that numpy arrays with weird layouts come up in other places, too, so this kind of trick might be relevant elsewhere.

"Unsafe Python"

The code above uses "the C kind of knowledge" to get a speedup (Python generally hides data layout from you, whereas C proudly exposes it.) It also, unfortunately, has the memory (un)safety of C - we get a C base pointer to the pixel data, and from that point on, if we mess up the pointer arithmetic, or use the data after it was already freed, we're going to crash or corrupt data. And yet we wrote no C code - it's all Python.

Rust has an "unsafe" keyword where the compiler forces you to realize that you're calling an API which voids the normal safety guarantees. But the Rust compiler doesn't make you mark your function as "unsafe" just because you have an unsafe block in that function. Rather, it trusts you to decide whether your function is itself unsafe or not.

(In our example, cv2_resize_surface is a safe API, assuming I don't have a bug, because none of the horror escapes into the outside world - outside, we just see that the output surface was filled with the output data. But arr_params is a completely unsafe API, since it returns a C pointer that you can do anything with. And rgba_buffer is also unsafe - although we return a numpy array, a "safe" object, nothing prevents you from using it after the data was freed, for example. In the general case, no static analysis can tell whether you've built something safe from unsafe building blocks or not.)

Python doesn't have an unsafe keyword - which is in character for a dynamic language with sparse static annotation. But otherwise, Python + ctypes + C libraries is sort of similar in spirit to Rust with unsafe. The language is safe by default, but you have your escape hatch when you need it.

"Unsafe Python" exemplifies a general principle: there's a lot of C in Python. C is Python's evil twin, or, in chronological order, Python is C's good-natured twin. C gives you performance, and doesn't care about usability or safety; if any of the footguns go off, tell it to your healthcare provider, C isn't interested. Python on the other hand gives you safety, and it's based on a decade's worth of research into usability for beginners. It doesn't, however, care about performance. They're both optimized aggressively for two opposite goals, at the cost of ignoring the other's goals.

But on top of that, Python was built with C extensions in mind from the start. Today, from my point of view, Python functions as a packaging system for popular C/C++ libraries. I have way less appetite for downloading and building OpenCV to use it from C++ than pip installing OpenCV binaries and using them from Python, because C++ doesn't have a standard package management system, and Python does. There are a lot of high-performance libraries (for instance in scientific computing and deep learning) with more code calling them in Python than in C/C++. And on the other hand, if you want seriously optimized Python code and a small deployment footprint / low startup time, you'd use Cython to produce an extension "as if written in C" to spare the overhead of an otherwise "more Pythonic" JIT-based system like numba.

Not only is there a lot of C in Python, but, being opposites of sorts, they complement each other fairly well. A good way to make Python code fast is using C libraries in the right ways. Conversely, a good way to use C safely is to write the core in C and a lot of the logic on top of it in Python. The Python & C/C++/Rust mix - either a C program with a massive Python extension API, or a Python program with all of the heavy lifting done in C - seems quite dominant in high-performance, numeric, desktop / server areas. And while I'm not sure this fact is very inspiring, I think it's a fact³, and things will stay this way for a long time.

Thanks to Dan Luu for reviewing a draft of this post.

This is what happens when you do stuff for fun, or just in a small team. If I was getting paid to work on this, I'd keep looking into it until figuring it out, at least if the team was large enough to not have to worry that this would delay more critical work too much. Makes one think, though I'm not sure what I think about this, all things considered.↩︎
Thankfully, because the existing layout flexibility "only" gives us a 100x slowdown, where with a callback, it could easily go to 10000x.↩︎
I'm not that good in this particular area, and I'd be happy to hear the thoughts of more experienced people on what to use these days to implement something like Krita or Blender. I sort of lean towards "a Python program with C/C++/Rust libraries" rather than "a C++/Rust program with a Python extension API," because, funnily enough, C++ is too unsafe and Rust is too safe for quickly iterating on a large, complicated code base - so I'd want to keep most of the code doing lots of little things in Python, and use C/C++/Rust for optimized production code doing well-understood heavy lifting kinds of stuff. But this way of structuring your program is at most moderately popular, and I wonder if I'm missing something.↩︎

Profiling with Ctrl-C

Yossi Kreinin — Tue, 25 Jun 2024 00:00:00 +0000

I once wrote about how profiler output can be misleading. Someone commented that you don’t need profilers - just Ctrl-C your program in a debugger instead, and you’ll see the call stack where your program probably spends most of its time. I admit that I sneered at the idea at the time, because, despite those comments’ almost aggressive enthusiasm, this method doesn’t actually work on the hard problems. But as my outlook on life worsened with age, I came to think that Ctrl-C profiling deserves a shout-out, because it’s very effective against stupid problems encountered by lazy people operating in unfriendly environments.

I mean, I’ve tended to dismiss the stupid problems and focus on the hard ones, but is this a good approach in the real world? Today I’m quite ready to accept that most of life is stupid problems encountered by lazy people operating in unfriendly environments. Certainly, one learning experience was becoming such a person myself, by stepping into a senior management role¹ and then going back to programming after a few years. Now I’m lazy because I got used to not doing anything myself, and I’m in an environment which is unfriendly to me, because I forgot how things work, or they no longer work the way they used to. And while I’m a bit ashamed to admit this as someone who’s developed several profilers himself, I’m often not really in the mood to figure out how to use a profiler in a given setting.

But, here’s a program taking a minute to start up. Well, only in the debug build; this must be why nobody fixed it, but we really should, it sucks to wait for a full minute every time you rebuild & rerun. So I Ctrl-C the thing, and what do you know, there’s one billion stack frames from the nlohmann JSON parser, I guess it all gets inlined in the release build; must be what they call “zero-cost abstraction”². Another Ctrl-C, another call stack, coming from a different place but again ending up parsing JSON. And I don’t know what the fix was - a different JSON parser, or compiling some code with optimizations even in the debug build - but someone fixed it after my Ctrl-C-based report.

Or let’s say I’m trying to switch to the LLD linker from gold, to speed up the linking. Why not the even faster mold? - because I’m on MIPS, and mold doesn’t support MIPS. But LLD is pretty fast, too; the core was written by the same person, after all. And then I open a core dump from a binary linked with LLD, and gdb is really slow. Hmm. It should have been faster, actually, because I’ve also added --gdb-index, which tells the linker to create, I guess, some index for gdb, making gdb faster than its slow default behavior, which is reserved for the unfortunate people who don’t know the cool flags. But I’m not seeing faster, I’m seeing slower. What gives?

So, I run gdb under gdb, and Ctrl-C it while it’s struggling with the core dump. There’s some callstack with dwarf_decode_macro_bytes. Google quickly brings up some relevant issues, such as “Using -ggdb3 and linking with ld.lld leads to cpu/memory hog in gdb” (Status: UNCONFIRMED) and “lld doesn't generate DW_MACRO_import like ld.bfd does” (Status: RESOLVED WONTFIX.)

Apparently gcc generates some DWARF data that gdb is slow to handle. The GNU linker fixes this data, so that gdb doesn’t end up handling it slowly. LLD refuses to emulate this behavior of the GNU linker, because it’s gcc’s fault to have produced that DWARF data in the first place. And gdb refuses to handle LLD’s output efficiently, because it’s LLD’s fault to not have handled gcc’s output the way the GNU linker does. So I just remove -ggdb3 - it gives you a bit richer debug info, but it’s not worth the slower linking with gold instead of LLD, nor the slowdown in gdb that you get with LLD. And everyone links happily ever after.

Which goes to show that Ctrl-C profiling is often enough to solve a simple problem, and it’s usually much easier than learning how to use a profiler and how to properly read its output. You can connect a debugger to almost anything, all the way down to some chip with nothing like a standard OS that could work with a standard profiler. You can connect a debugger to almost anything especially if it’s slow - for example, maybe it’s hard to actually invoke the program under gdb because its invocation is buried somewhere very deep, but if it’s slow, you can gdb /proc/$pid/exe $pid after it was started.

A debugger also needs less to work with than a profiler. Unlike perf, gdb will give you a callstack even if the program was compiled without frame pointer support. And you certainly don’t need a special build, like gprof’s -pg, or to run on a slow simulator, like callgrind / KCachegrind. And then the output of a profiler might be easy to misinterpret - and I’ve only scratched the surface the last time I wrote about it. Eyeballing a few callstacks is more straightforward.

Why then do we need profilers at all? Here is a very partial list of reasons, in no particular order.

Let’s say, completely hypothetically, that you’ve switched to the LLD linker, and your program is now 2-3% slower. If you Ctrl-C it, you’ll see the same callstacks as with the version linked with gold. But if you have a profiler running on a simulator, similarly to callgrind, then you can find the functions with the most slowdown - and they might not be the ones taking the most time overall, they just have the most slowdown relatively to the old version - and then you can look at the assembly listings and see how much time was spent running each instruction. And then you’ll see that the new version has branch-to-address-from-register instructions where the old version had branch-to-constant-offset instructions.

Then you will learn about MIPS “relocation relaxation” (used also in RISC-V AFAIK.) The compiler “assumes the worst” and generates code loading a function address into a register, and then jumping to the address stored in that register. Then, if you’re lucky, the linker realizes that it has actually placed the function close enough to the caller for that caller to branch to the function using a constant offset. (Fixed-sized RISC branch instructions cannot encode constant offsets larger than a certain value, so the function needs to be close enough to the caller for the distance to fit into the offset encoding.) And then the linker “relaxes” the expensive branch-from-register instruction into a cheaper branch-to-constant-offset instruction. And it turns out that the LLD version you’re using doesn’t implement relocation relaxation.

Of course you, or should I say me, wouldn’t need that very, very fancy simulator-based profiler if you weren’t the idiot using LLD 9 when LLD 14 was already available, with relocation relaxation implemented back in LLD 10. (I wish I’d saved the discussion in the mailing list around this patch; now I can’t find it anywhere. There was nobody confident enough in their MIPS knowledge to review the patch, but you don’t merge patches without a review, do you? There was even a message saying “Happy anniversary to the relocation relaxation patch!” a year after it was submitted without having been merged. Eventually someone said something like “we have to either merge or reject it, or we’re being rude” and someone else said “well, the patch author knows MIPS better than any of us, so let’s just merge it.”)

But, despite having been an idiot here, I maintain that you don’t have to be an idiot to have this sort of problem, which a profiler will help solve, and Ctrl-C profiling will not.

The broader issue is that Ctrl-C is essentially a sampling profiler - one with an unusually low sampling frequency, but a sampling profiler nonetheless. Very small changes spread across a program are obviously invisible to a sampling profiler. Also, sampling profilers are bad at tail latency - if something is usually fast but occasionally slow, you won’t be there to Ctrl-C it when it’s slow. (Of course, if “slow” means 100 ms instead of the usual 25 ms, you wouldn’t manage to Ctrl-C it in time even if you were there - that low sampling frequency comes with some downsides.)

Systems involving many threads, processes or machines… our esteemed “random pausing” technique, aka Ctrl-C profiling, is often not great to use with these. And at this point I feel that the idea of replacing all of the various profilers with Ctrl-C is too ridiculous to bother with more counterarguments.

But, there are many various kinds of profilers, making it a question which kind to use, and how much legwork finding the problem will take on top of using it. Simulation-based profilers don’t have the problem of losing data to a low sampling frequency - they analyze full instruction traces - but they’re too slow for anything like a production environment. So you might need some measurements that you can run in production, and then a way to rerun the program on the simulator using inputs that were observed to cause a slowdown in production based on these measurements. Tracing profilers like ftrace / KernelShark are great for looking at examples of tail latency, but they will not reliably take you to the places in the code where the time is spent. Sampling profilers can run in production and take you to the right place in the code, but they’re a poor match for code that runs slowly but only occasionally, and even worse for code that occasionally gets stuck waiting for something. And most of these tools have a bunch of non-trivial prerequisites, config knobs and likely ways to misread their output.

Conversely, Ctrl-C in a debugger is easy, makes you look very effective when it actually works, and costs almost nothing to try even when it doesn’t really help in the end. What’s not to like?

I often find myself recommending something primitive or ugly, which might actually do better than the “proper” approach, or it might have less risky failure modes in the hands of typical users, or it might be easier to tailor to your needs than a more elaborate solution. “Profile with Ctrl-C” fits right in - certainly very primitive, yet often compares surprisingly favorably with more sophisticated alternatives. And therefore, I must give Ctrl-C profiling my warmest endorsement!

Thanks to Dan Luu for reviewing a draft of this post.

In my Russian-speaking mind, “stepping into” is strongly associated with “stepping into shit.” I’m not sure there’s an idiomatic English synonym for stepping into something strongly implying that this something is shit; there should be - it’s a very useful thing to have in a language.↩︎
“Zero-cost abstraction” is a figure of speech popular with people who don’t consider time spent compiling, deciphering compiler errors, debugging, or running the debug build as a “cost.” It would be more accurate to call it “zero cost in production machine resources,” though even that is quite often incorrect.↩︎

Advantages of incompetent management

Yossi Kreinin — Thu, 04 Jul 2024 00:00:00 +0000

What constitutes managerial competence? As a vague starting point for an answer, we could say that competent management sets achievable objectives and then achieves them, by organizing and incentivizing the necessary work.

It turns out that even this near-tautological banality is enough to see why competent management puts many desirable things out of reach. This becomes apparent when looking at examples where incompetent management does better than most well-run places can hope for.

Efficiency

Improving efficiency tends to be against the interest of most people in an org, because it’s equivalent to shrinking your budget. Here’s what I’m told is a true story about how things work with actual budgets. A relatively inexperienced VP attends a meeting where senior management is asked to shrink their budgets due to the adverse economic climate brought about by the coronavirus pandemic. He eagerly cuts his equipment budget from $10 million to $6 million - over the loud and desperate objections of his team (whom the VP nearly accuses of lacking patriotism, loyalty to the company and commitment to the common good.)

Next year, the coronavirus mutates some more, and profits go back up. Our VP submits a $10 million equipment budget to the finance department, where they cheerfully inform him that the extra $4 million will not go well with the CEO. Why, a 66% increase over last year’s $6 million!

Wait a minute, thinks the VP, a sensation running through his whole body of rapidly gaining that invaluable experience which he so sorely lacked. I voluntarily cut 40% of my budget - a share way larger than anyone else - due to an unforeseen, extraordinary emergency. And now I’m rewarded with this cut becoming permanent?.. I see. Well, I’m always eager to learn.

This year being already lost, he quietly resubmits a $6 million budget (approved more swiftly by the CEO than any other, thanks to zero YoY growth.) Next year, he uses some real or perceived crisis to increase this budget to $20 million. And now he learned how to operate in a well-run company.

Of course you could say that this is a badly run company, and to avoid arguing what that means, let’s stick to the definition of managerial competence as the ability to set and achieve objectives. Whatever objective you are expected to achieve, a bigger budget makes it easier. And while asking for more resources gets you yelled at, the yelling is for show, and ends once the budget increase is approved (or isn’t approved; but it never really hurts to try.) But if you fail to achieve your objective, the yelling will be for realsies, go on and on, be followed by career setbacks, and continue long afterwards, quite possibly with no way to resuscitate said career.

Set objectives create a simple zero-sum game over resources - you want more resources to do what they asked, and they want you to do the same things with less. Optimization, budget cuts or relinquishing resources under any other name simply registers as losing a round in this game. It’s awfully sweet to save company resources, but expecting it to do you any good just means that you don’t understand the game.

I mean, what do you expect to happen? That we'll ask you to do less, or forgive you for doing less? No way, we asked you to do those things because they must be done. Then maybe you expect to be given more resources? Obviously ridiculous, you just had resources, there’s no sense in hoping to get them again as a reward for giving them back when you already had them?.. Maybe you’d like a stock grant for being such a good citizen? No, if we do that, everyone will inflate their budget, and then cut it to get a stock grant. Could that be what you did here?.. Like, why was the budget so large to begin with?..

But wait, seriously though, what’s the math here? What are we maximizing? Revenue minus cost? Revenue divided by cost? I mean, shrinking the cost has got to be helping with these?.. Well, sure it’s helping, but it’s not helping you, because you don’t bring any revenue by yourself, unlike cost, which you very much do incur all by yourself. The math with you is, we tell you to do something if the cost is below a threshold. If you won’t commit to doing it cheaply enough, we’ll find someone who will, and if we can’t, we won’t do the thing, or reconsider the options in some other way. But exactly what the cost below the threshold is changes nothing in any math related to you, except for a lower cost making your job harder, since you have the same objectives to achieve. The firm’s bottom line - sure, lower costs help there. But the impact on the firm’s “revenue - cost” doesn’t trickle down to your “cost < threshold,” because you have no revenue¹.

Things work the same with any resource, not just actual money - it could be team size, or processor cycles and memory bytes. If you free up 200 ms of CPU cycles and 500 megs of RAM, someone else can deploy their functionality using these newly available resources, and then you won’t be able to. In fact, a mature, well-run CI system will measure everyone’s resource footprint after each commit, and will not let you exceed your budget, which was frozen at some point based on how much you were using at the time (hope it was a lot! - always spend like crazy before the baseline is established!) Is it any wonder that people learn to never optimize their code - unless they want to deploy something new themselves, and only after asking for more resources to deploy it and not getting any?

I like it when people ask “why is this code so slow? Why don't we optimize it?” And it still makes me sad when people ask instead, “how much CPU time do I have for running this code?” when it's obviously 5-10x slower than it could be, and they're asking to reserve 2-3x more CPU time than they're already wasting. But that's what happens when people have worked at well-run places and aren’t stupid.

What happens in a badly run place? In a badly run place, management is bad at setting objectives, so you have people aimlessly wandering about, lacking clear goals, and just doing stuff because they want to. They see an optimization opportunity and they gladly pursue it - it’s interesting, it’s fun, it’s good for the company, it’s what they’re here for. If a patch must be submitted to a team, that team might gladly accept it - they don’t mind shrinking their resource footprint, because nobody monitors the resource budget properly, nor presses them to meet any targets very hard - which is also why they don’t really mind spending some time on something not helping them achieve any such target. In fact, they might get interested enough to actively help whoever found the problem to fix it.

Your legs don’t fight your heart, brain and each other for the oxygen budget; every organ only uses what it needs, and is optimized for efficiency. The selfish corporation is yet to make its parts behave as selflessly as our body parts sharing our supposedly selfish genes. Yet people do have a tendency to do the right thing regardless of incentives - no doubt because they mistake their corporation for their tribe, thinking their coworkers share more of their genes than they do². But if there’s a reliable & scalable way for vigorous, systematic management to reward the spontaneous human drive towards efficiency instead of punishing it, I am yet to see it. Certainly honest people working for the trillion-dollar heavyweight champions of the industry testify that this problem is far from solved.

It’s an exercise both fun and depressing to come up with ways to “manage for efficiency.” For example, we could reward people for performance savings, right? Great idea - now I can commit some CPU or memory hog, then you can fix it, and we’ll split the reward. Or, more realistically, first we all go on a crazy resource spending spree to meet a deadline. And then later on, we optimize away the lower-hanging fruit in the crazy-inefficient system and get a reward - with not-so-low-hanging fruit from that spending spree probably left hanging forever.

(Of course, we probably won’t tell ourselves that we’re deliberately overspending more than is actually helpful for meeting the deadline to game the system. Rather, the culture will just kinda shift in that direction. People are very good at doing fucked-up things without admitting it to themselves - which would make them sad and less energized to do the fucked-up thing they have compelling reasons to do.)

Perverse incentives always appear wherever incentives are deployed, because the very notion of an incentive is fundamentally perverse. But a competent manager is forced to use incentives, instructions, and incentives to follow instructions, because what else could he use?

Sprawl

“These teams are like bulldozers with no brakes,” mused my acquaintance, who’d managed a team in a poorly-run company and had recently become a director in a much better-run one. “You only have a steering wheel, and you need to be steering all the time, or this thing is going to dig a giant hole in the ground, or raze a building or something. If you don’t tell them exactly what to do, they’re still always going to do something, and then it’ll be too late.”

You see, he was used to people doing pretty much nothing when left unmolested. Of course, from the employer’s point of view, this habit is straightforwardly wasteful, because you’re still paying their salaries. To weed out such do-nothing people, competent management sets up a performance evaluation process, so that we always know what every person has done for us every year, and who should get outsized rewards and who should get fired.

This system leaves people very worried if they don’t have clear goals to work towards. However, even a competent organization cannot set actually useful goals for everyone at all times, just like you generally need your legs, but you don’t really have a use for them at every moment. And thus, you have people with spare bandwidth making up their own goals, so that they have something to show in the performance review.

If we now revisit the situation from the employer’s point of view, it is no longer trivially wasteful, because everyone is always busy. However, it’s likely more wasteful than before, because people are building stuff you didn’t really need, and yet you almost certainly need now, because actually productive activities are hopelessly intertwined with this stuff.

This is a big reason why successful software companies end up with mountains of code. The cycle repeats and branches out exponentially, as every team who’s built the once-needless and now-necessary thing asks for more headcount, gets it, and inevitably ends up with some of it idle some of the time. Then these new people invent more goals to pursue, persuade everyone that these fake goals are actual sub-goals of the real goals, and entangle existing systems with their new systems.

And now figuring out where the waste is will be much harder than just spotting idle people, since all the needless work was done for no other purpose than looking very important, and people are pretty good at making the right impression when they’re trying. And of course when people lie, they lie first and foremost to themselves - we’re all natural-born Method actors - so if you spot a decoy and try to cancel the work on that system, not only will the people working on it fight this with all their might, but they'll be genuinely heartbroken if you do cancel it. And by the time you’ve actually dealt with one of these weeds, if you’re a weird manager actually trying, two more will have sprouted in another part of the org.

If you’re used to such sprawl, you’d be surprised how effective sleepy HR practices are at preventing it. Suppose you always get a standard, shitty raise at the end of the year by default, unless you bargain loudly, which works rarely and only if you’ve really made an impression throughout the year. There is no defined budget for raises; every significant raise is hard to get, and you never get it proactively without bargaining, but there’s no formal system to avoid spending too much on raises except for the reluctant, reactive approach to giving them. There’s also no system for firing low performers, and it’s only very rarely that you see anyone fired - like that crazy fuck who went on and on about how your source control sucked and should be completely different, and then used a single dot character, “.”, as the commit message when he finally committed something.

A similar system is used for managing other resources: for example, every team gets to grow at some low annual rate, no department is ever cut, and it’s very hard to grow your department faster than the base rate even if you get more responsibilities.

A place like this evolves the healthy laziness that keeps animals from moving their body parts all the time, needlessly burning calories, in order for the claws, wings and tails to get a good performance review from their head at the end of the year. Sure, many people do nothing much of the time, and you need some effort to make them do something when it becomes necessary; “the hedgehog is too proud a bird to fly without a kick,” as the wise Russian proverb goes. But on the upside, nobody doing anything unless it’s really necessary means you don’t have all this unnecessary stuff.

Healthy laziness begets agility - you have way less code, less systems, less everything, and therefore way more ability to maneuver and actually change things with a small number of motivated people - and there’s always a small number of motivated people in any place, and this place might even keep them, if they learn to bargain for raises. And you also don’t need to grow as much, because you don’t need to be adding people to take care of all these sprawling systems that you quickly come to depend on.

Bugs

Bug fixes work a lot like efficiency improvements, the main difference being that competent management makes things much worse. You can’t make fixing bugs into a “goal,” same as you can’t make optimization into a goal, because people will just add more bugs up front and then fix some of them. But at least with optimization, you can have teams doing it across the organization, and it claws back some of the performance lost in the first place.

A team optimizing others’ systems cannot hunt down the tens of thousands of little performance hogs created by everyone else. But it can often find tens or hundreds of relatively small changes with a fairly big performance impact. They’re probably not “fully incentivized” for this outsized impact, because with rewards anywhere close to how much money this is worth to the business, the incentive is quite likely to become extremely perverse. But you definitely can make “everyone else’s performance” a team’s job description, combine it with your venerable performance evaluation & promotion process, and get something - often a big something.

Another kind of teams with some form of “someone else’s efficiency” in their job description is people working on compiler, language runtime or kernel optimizations, custom compute or networking accelerators, and other such. They could be inefficient in their own work for the same reasons mentioned above, but they might still be increasing others’ efficiency, because it’s legitimately an example of a goal that their competent management is good at setting and achieving.

The problem with bugs is that you can’t have people solve others’ bugs as much as you can have them improve others’ efficiency. It is generally much easier for a relative outsider to see where a system spends its resources than where its bugs are. That’s because all systems spend similar kinds of resources, but what constitutes a bug varies from system to system, and there’s almost never a machine-readable, formal, or even just a reasonably complete and written-down definition of correctness. The few exceptions are things like programming language semantics, and indeed this is where a lot of progress has been made - think sanitizers, race detectors, etc.

Another problem with bug fixes which you don’t have as much with optimizations is that it’s harder to measure the impact. With efficiency improvements you can usually give a ballpark number of how much resources it would save - perhaps a range of possibilities rather than one high-confidence number, but you’ll have something. With bugs, well, you could A/B test them to try to quantify the impact on some metric management cares about, but who does that?

With performance, you deal in resources to begin with, and you have some number speaking of resource savings by definition, or you couldn’t call it an optimization. And now there might be an argument of what multipliers to apply to this number to arrive at a cost estimation, but at least you have a starting point. With a bug fix, you have the bug and the fix, and you’re seriously going to suggest to A/B test the impact for no benefit to the employer except your ability to claim this impact is worthy of a promotion? This is a great plan especially for internal systems without A/B testing infrastructure or any preconditions for it, but it’s a great plan in general, employers love this.

(And also, most bugs you fix tend to come from your own team, and then all high impact proves is that you messed up big time when you put the bug in. You’re not supposed to have bugs in the first place, punk.)

“I have this potential employer who says they’re interested in performance and correctness,” said another acquaintance. “I told them that I can work on performance anywhere in the industry, so I can probably find an offer better in other respects elsewhere. But correctness sounds interesting. I don’t know anywhere caring about correctness!”

Well, it’s not like they don’t care, as much as they don’t have a mechanism for caring or even registering it. Correctness is not a goal in itself that management can set for the teams without perverse side-effects. Of course, you have to fix “showstopper bugs” or you haven’t achieved your goal. Any further bug-fixing takes resources from achieving your nominal goals, and is avoided - not outright, which would look bad, but through slow-walking and other acceptable forms of sabotage.

It’s true that Microsoft Teams (to take one example too many are familiar with) can get away with bugs because it’s bundled with Outlook and other stuff, and because whoever pays for it doesn’t use it that much, but rather foists it upon helpless internal users. But it’s also true that fixing those bugs would be money very well-spent for Microsoft, because it would almost certainly improve their reputation and increase sales at the margins and more than offset the cost of the work. The problem is that it’s hard for a well-run place to get people to fix non-showstopper bugs.

(One way to work on correctness, if you're into this, is to go to areas where more bugs are showstoppers, so fixing them becomes a part of the nominal goals. If you’re a hardware developer, FPGAs, where you can fix bugs very cheaply, are a worse context for this than ASICs, where you cannot, making you eager to find and fix them proactively. And hardware running lots of software, which can't be patched to work around hardware bugs, like a CPU, will face more pressure to be correct than something like a peripheral device controller, which is only touched by comparatively little code written at the company making the hardware, where it’s “easy” for software developers to add workarounds to this code³. If you’re a software developer, you could try an industry with high reliability requirements, where many more bugs are defined as showstoppers.)

Of course, having more sprawl means having more bugs (and more performance issues, and more machine resources spent on running all that sprawling code), and even defining what “correct” means becomes harder when the system is larger. The unfortunate side effects of competent management compound each other.

The problem with incompetent management

The main disadvantage of incompetent management is its definitional inability to set and achieve key goals, which can endanger the survival of the organization. Incompetent management can only thrive in situations where basic survival and even growth are somehow taken care of, and any major changes in that situation create an existential risk.

It is theoretically possible for management to respond to an external crisis by “changing gears” from a sleepy indifference to what’s going on in the organization to a vigorous push to get something huge done, as required by the new external situation. The hope is to kick the suddenly awakened, terrified hedgehog into the stratosphere, and then go back to the sleepy ways of old once it’s orbiting the Earth.

In practice, the risk is high for this attempt to fail - a place not used to the mobilized state of subordinating all efforts to top-down goals will need time to learn, where “learning” might involve firing or otherwise replacing key people (which is a big part of what “learning” means for organizations, and what people mean when they say such learning is “hard.”)

If the war effort does succeed, there’s quite likely no going back - the hedgehog will have been thoroughly transformed and militarized by the ordeal. It will be the usual mix of competent management and cargo cult management from now on.

Cargo cult management vs straightforward incompetence

Speaking of which - a most unfortunate side effect of competent management is the widespread desire to emulate its look and feel, which contaminates the wonderful natural incompetence of so many managers, robbing us of its many advantages.

Mostly incompetent management which is very bad at setting and achieving goals is perfectly capable and all too likely to cargo-cult effective management by setting up an elaborate bureaucracy for assigning work and tracking its status, thus preventing work from happening spontaneously. This has all the downsides of actually competent management without any of the benefits.

Things work much better when incompetent managers embrace their laziness and do close to nothing. This is possible if there's a culture where a manager gets to look good through means other than appearing to be on top of plans and status - for example, by presenting shiny things the team is working on (regardless of their exact impact on the bottom line or even chances to be deployed in production.)

What is to be done?

“What is to be done?” is a pamphlet by Lenin, who proposed some things to be done, and went on to do them and then some, with results most charitably described as mixed.

I don't know how it ever happened to me, but I somehow got infected with the absurd idea that there's always a good way for things to work in an organization, and furthermore, somehow this good way always makes the org more effective than the commonly observed not-so-good alternatives. I was brought up with natural immunity to the Soviet strain of this Panglossian optimism with respect to our ability to shape organizations in the all-around optimal way:

We shall wholly destroy the world of oppression
Down to the foundations, and then
We'll build a new world of our own.

But it turned out that my Soviet antibodies don’t automatically work against the Western strain:

…the most important advantage of being good is that it acts as a compass. … you have so many choices. How do you decide?

Here's the answer: Do whatever's best for your users. You can hold onto this like a rope in a hurricane, and it will save you if anything can. Follow it and it will take you through everything you need to do.⁴

I mean, I guess I’ve always had antibodies good enough to protect against severe illness; I never imagined that companies mostly succeeded by holding onto their goodness like a rope in a hurricane. But if you pressed me, I’d say they probably could, and they would be better off if they did.

Which, if you think about it, why on Earth would this have to be correct? Few people would say that you can always make your code faster without making it uglier, and those who say it tend to be a bit insane, in a professional sense. So why would making a company more effective always make it better instead of worse, according to, well, any definition, really?.. Just because the opposite thought is depressing? Well, the thought of faster code tending to be uglier isn’t a very happy one, either.

So, now that I have immunity strong enough to prevent infection and transmission of the effective goodness virus, I don’t think you have to find a solution to an organizational problem just because you happen to observe it⁵. From an individual’s POV, the environment is nearly impossible to change, hard to truly understand, and fairly easy to fit into for most values of “environment,” and companies are no different. You probably can’t make most well-run places “truly care” about efficiency or correctness, but you can make a great living optimizing stuff, and even debugging or seriously testing, if you find the right place for it.

Of course, if you put a gun to my head, I could add a few paragraphs on “combining the best of both worlds” and how it’s been known to happen in small teams over short periods of time, and so on. And, not gonna lie, I almost did put a gun to my own head to write these paragraphs - old habits die hard. But I came to my senses and deleted it. It’s more likely to make you feel sad than happy, and most of all, it’s likely to make you bored.

(Update - see a comment on this write-up describing the rise and fall of Creo, a company that they say did start out combining the best of both worlds, then regressed to the mean after acquiring another company with a lot of people and a "standard" culture, and went downhill both as a place to work and a viable business. Like I said, these "best of both worlds" stories will make you more sad than happy.)

Conclusion

Competent management sets goals to achieve. Whatever can’t be made into a goal cannot be achieved by definition. Whether this sounds trivial or absurd, it has many surprising undesirable consequences which are surprisingly hard to avoid.

A company’s board is unlikely to raise the need for less competent management in their annual meeting, and for good reasons. A prospective employee is another matter. If someone invites you to work for a company that’s run very badly, there might well be a good story there - this is far from guaranteed, but you might want to hear the details. And by “a good story”, I don’t mean “yay, here’s a place to slack off at,” but “maybe I can finally get some work done that I hardly ever get the chance to do.”

Profiling in production with function call traces

Yossi Kreinin — Wed, 05 Feb 2025 00:00:00 +0000

A timeline showing function call and return events is a great way to debug performance problems, especially in production. In particular, it's often much more effective than traditional sampling profilers, for reasons we’ll discuss. However, the adoption of function tracing in the industry remains uneven because of a chicken-and-egg problem.

To best use a tracing profiler, you need some adaptations to your code and your workflow (as opposed to sampling profilers, which “just work” with your code.) So to make a tracing profiler, one needs people wishing to change their code & workflow in order to use it. That said, as we’ll see, it’s gotten fairly easy to develop a tracing profiler today, and integrating it into your work is very doable as well – which I hope might encourage people to both make and use tracing profilers.

“Our main contribution,” as they say in papers, is a new function tracing profiler for C++ imaginatively named funtrace. It’s ready for serious use – it works out of the box on large, complicated programs. For example, here it is peeking into the inner workings of the awesome painting program Krita, showing how the call stack changes over time, the thread state transitions, and the source code of some selected function:

Funtrace has the following attractive qualities, which I hope I am not overselling:

AFAIK, has the lowest-overhead tracing (<10 ns per instrumented call or return in my measurements)
Supports threads, shared libraries and exceptions
Supports ftrace events, showing thread scheduling states alongside function calls & returns, so you see when time is spent waiting as opposed to computing
Works with stock gcc or clang - no custom compilers or compiler passes
Easy to integrate into a build system, and even easier to try without touching the build system using tiny compiler-wrapping scripts “passing all the right flags”
Small (just ~1K LOC for the runtime) and thus:
- easy to port (currently x86/Linux is supported)
- easy to extend (say, to support some variant of “green threads”/fibers)
- easy to audit in case you’re reluctant to add something intrusive like this into your system without understanding it well (as I personally would be!)
Relatively comprehensive – it comes with its own tool for finding and cutting instrumentation overhead in test runs too large to fully trace; support for remapping file paths to locate debug information and source code; a way to extract trace data from core dumps; and other such “ways to address real-world concerns.”

I’ve worked on several kinds of profilers during the last 20 years, and this one is easily my favorite. Perhaps it’s the colorful stalactites?.. Anyway, that’s “our main contribution”; we’ll see how funtrace works, and how you could use similar methods and existing components to build your own tracing profiler.

But we’ll cover more than the proverbial design and implementation of funtrace; in fact, we’ll leave a few of its particularly “hardcore” bits for their own followup. We’ll start with viztracer, a great tracing profiler for Python, and discuss how to introduce a profiler like that into your workflow. We’ll also talk about LLVM XRay, AFAIK “the” open-source function tracing C++ profiler today, and how its design is influenced by what the workflow needs to be in a place like Google, where XRay comes from.

We’ll also get to the awesome magic-trace, a non-intrusive tracing profiler built on top of Intel Performance Trace. Unfortunately, its authors say that magic-trace is hard or impossible to port to many popular platforms. This will bring us to what CPU makers could do to make hardware-accelerated function tracing very cheap in both hardware and software – and usable much more widely than today, including in dynamic and JITted languages.

viztracer: how to use a great tracing profiler
Funtrace: making a tracing profiler for native code
Antithesis: LLVM XRay
Synthesis: funtrace with XRay characteristics
Hardware-assisted tracing
Conclusion and future work

viztracer: how to use a great tracing profiler

I’m working on a small animation program, and I’ve found out that there’s such a thing as insufficient yak shaving - it turns out that the care and feeding of an unshaved yak gets tiresome quickly. For example, I’ve avoided figuring out how to use a profiler for way too long, and in hindsight, wasted a lot of time manually putting timers into the code.

I mean, you start putting timers into your code, and the first thing you print out is the average runtimes of things. The averages tell you which code the program spends most of its time running, helping you to speed up some of this code. But then you run into the occasional lags, and of course, averages can’t explain why you have unusual lags - because by definition, on average, you don’t have unusual lags. So whatever is taking time when you do have unusual lags doesn’t move the averages.

(Incidentally, this is why sampling profilers like perf, which periodically check what the program is doing and then show you what it was doing most of the time, can help with saving CPU cycles, but mostly can’t help with worst case latency.)

So you print out worst case runtimes, but of course the worst case runtimes of each function don’t help you by themselves. What you’re really after is how long each part of your code took when an entire flow, like your mouse-down event handling, was unusually slow. That slowness wasn’t due to every function taking the most time ever, but due to some of them taking a lot of time, and the sum of* everything* running in this flow taking too much.

So you start printing some sort of tables with timer values - a row for every time some flow ran, and a table like that for every flow, and then you look at the rows with the most total cycles. The tables grow, and now you want something easier to look at than tables of numbers. So you start having thoughts like “I could decorate my functions with @trace or something, to trace calls and returns, and if only there was a nice way to display this trace with the function calls nested inside each other...”

And at this point you say – you know, I would be building a tracing profiler! There has got to be a tracing profiler for Python - I should find one and use it! Where’s my yak shaving machine?! Should have reached out for that long ago, before getting all tangled up in all this yak fur!

Then you discover viztracer, which looks great, and works like a charm if you run it on some script with viztracer ./script.py. Works fine for a short program run: you get the last 10 million function calls at the end of the run. But your program is an interactive GUI, which means an “infinite” program run. You don’t want to quit the program to get the trace. Well, you can Ctrl-C the program - more accurately, you can Ctrl-C viztracer which is running the program - to get the last 10M calls before you Ctrl-Cd. But how do you know when to Ctrl-C?

This right here is the major problem with tracing profilers: you need to figure out what to trace. With sampling profilers like perf, no such problem: you sample all the time, and then you get a summary – of a size not dependent on how long the program ran. And this is important not only because there are only so many terabytes in a disk, but first and foremost because there’s only so much time to scroll horizontally, trying to find the part of a giant timeline that you care about.

Therefore, a tracing profiler usually comes with an API for triggering tracing, which the program must call. And this is our chicken-and-egg problem: when perf came out, it was immediately usable for all the natively compiled programs out there – and everyone looking into performance could use it, and wanted to make it better. But with a tracing profiler, most programs must be changed for it to be usable, if only a little bit, and who wants to risk developing a tool that nobody can use on day one?

(There is, of course, the other major problem with tracing profilers - their larger overhead compared to sampling profilers. This IMO explains why dynamic languages are more likely to have a tracing profiler than static languages at this time. Not only are these languages designed for things like intercepting function calls without the language maker having to add support for this, making things easier for “community tool makers,” but the execution is so slow to begin with that the overhead of a tracing profiler is relatively smaller than in static languages, and thus doesn’t deter tool makers and users alike as much. We’ll talk about the overhead later, when we get to compiled languages.)

Anyway, you use the tracing API:

tracer = viztracer.VizTracer()
tracer.start()

init_flow()

tracer.stop()
tracer.save('trace.json')

You start with your init flow, if only because there’s just one such flow, as opposed to the many runtime event handling flows. You get your trace.json file, and you run vizviewer trace.json. A browser tab pops up, and in that tab, you see this message:

At this point, I hope you brought your big yak shaving machine. If, like me, you’ve only brought the small one, this is where it breaks (“great, I knew it was a waste of time to look for a tracing profiler, of course this stuff comes broken out of the box”) and you go back to looking at tables of numbers for a while, until this irritates you enough to try again. Then you find out that your init flow was long enough to trigger a known bug that likely won’t be fixed, but there’s a fine workaround – all you need to do if vizviewer gives you “Error: RPC framing error” is to reopen trace.json from the web UI¹.

Happy happy, joy joy, you think to yourself, and get busy profiling your runtime flows. You open the traces – wow, so much better than tables of numbers!

There it is, my wonderfully optimized resizing function in “unsafe Python” - and to its left, a bunch of short calls, looks like it’s drawing lots of buttons in a loop - maybe it’ll be faster to draw them as one larger image?.. Way nicer than the tables!

The colorful stalactites are reminiscent of flamegraphs ², though they aren’t flamegraphs - they represent the execution timeline, not the share of time spent per callstack. Vizviewer can show actual flamegraphs, too - pass --flamegraph. In our example, instead of the many little calls on the left in the screenshot above, you will get the following succinct summary (with the functions colored differently – done by different code, I guess?..):

Note that this is the exact flamegraph of a short period of time captured in a trace – while a sampling profiler shows you an approximate flamegraph of a long period of time, a very different thing.

In any case, now that tracing basically works, you have a simple playbook:

When you start handling an event, create a tracer object, and start() tracing
When you’re done, stop() tracing, and check how long it took to handle the event. Only keep the tracer objects upon the slowest measurements
Eventually, save() the traces you kept

As you follow this playbook, you run into some issues:

After creating the 1022nd VizTracer object (whether the previous 1021 were destroyed or not), the process terminates with the somewhat paradoxical error message Failed to create Tss_Key: Success. Some resource must be leaking - so let’s keep a pool of VizTracer objects and reuse them.
One of the flows you trace invokes another flow you trace, and then you get a Warning! Overwrite tracer! You should not have two VizTracer recording at the same time! So you stop tracing when entering a nested flow, and restart it when that flow is done.

But, it’s not that many issues. You do need to write some code around a tracing profiler to make it work for you – but not a lot, and it’s well worth your trouble, certainly with viztracer, which is absolutely great³. Here’s my code for this, presented mainly to show that it’s <150 LOC – there’s not much to it.

And now that we’ve seen how useful they are, let’s make our own tracing profiler!

Funtrace: making a tracing profiler for native code

To trace function calls in a compiled language, you need 4 main things:

Compiler instrumentation for running code upon function entry & exit ⁴
Runtime code collecting some sort of function IDs and timestamps when functions are called and return
Offline trace decoder for converting the traced function IDs into symbolic names, and producing some format for the…
…Decoded trace viewer – a UI for looking at the traced timeline

Actually, you also need a 5th thing, which one might call the first – namely, assumptions about the user’s workflow: what the user needs to do, is willing to do, and is not willing to do. For funtrace, these assumptions are:

The user either wants or agrees to trace in production. The user might want this because performance problems happen in production, can be hard to reproduce, and you want to debug them. If the user isn’t interested in debugging problems in production, we still hope they agree to trace in production, or at least in acceptance tests run before releasing production versions. That’s because tracing overhead can become unacceptable unless continuously monitored and culled as needed. Tracing during acceptance testing guarantees that the overhead is acceptable, by the definition of “acceptance testing.” And then you can look at trace data any time you want, without worrying that this data is irrelevant due to high tracing overhead. Conversely, not tracing during acceptance testing virtually guarantees that when you’ll actually enable tracing, the overhead will be unacceptable, and you won’t have time to cull it, making tracing unusable when you need it.
Therefore, as a corollary of tracing in production, the user agrees to continuously monitor and cull tracing overhead by manually specifying things like “never trace these several functions,” “don’t trace when we’re loading files - we know this flow is always slow, and tracing it does nothing except making it even slower,” etc.
The user knows when to collect trace data and which collected data to save, and will use our API to do it. For example, “start tracing when event handling begins, keep the trace of the slowest processing of every type of event, and save all these slow traces upon program exit.”

I like these assumptions for two reasons: this is the workflow I want as a user, and you get a small and fast runtime with these assumptions. However, this is not the only possible set of sensible assumptions, and we’ll see below how very different assumptions influence the design of LLVM XRay.

And now with our workflow assumptions in mind, let’s think step by step, as we tell LLMs when we want to guide their boundless creativity away from complete bullshit, and work our way through the list of key components in a tracer.

Compiler instrumentation

With any native language using LLVM, you can write a compiler pass calling some event handlers upon function entry & exit – and with GCC as well, though most would prefer LLVM’s APIs for this.

Specifically in funtrace, however, my goal was to use existing compiler flags for instrumentation. Specifically with C++, g++ and clang++ make this possible, and compiler flags are more consistent across compiler versions than the internal APIs for writing a compiler pass. And even if my pass supported multiple compiler versions, who’d want to build it for their specific compiler version so as to try funtrace?..

g++ and clang++ have the following flags, all supported by funtrace, and each having its own pros and cons:

Both compilers support -finstrument-functions, which makes the compiler generate calls to __cyg_profile_func_enter/exit when functions are, well, entered and exited. Very clean – you can write your tracing handlers in portable C code. For better or worse, this instruments functions before inlining, and C++ code is famously full of tiny inline functions⁵. It can be neat to trace a program with -finstrument-functions to inspect the flow – might be easier to follow than either in a debugger or an IDE – but it’s impractical for tracing in production. Can we lower the overhead?
- With g++ you can pass something like -finstrument-functions-exclude-file-list=.h,.hpp,/usr/include to ignore all the functions in the header files – close to, but not quite what I’d want.
- With clang++ you can simply use -finstrument-functions-after-inlining, often exactly what you want in production.
g++ supports the not-too-obvious flag combination -pg -mfentry -minstrument-return=call. This is similar to clang’s -finstrument-functions-after-inlining in that it, well, instruments functions after inlining. But this doesn’t call __cyg_profile_func_enter/exit – it calls __fentry__ and __return__, and it calls them in a different way, pretty much forcing them to be assembly functions – though this is actually a plus, since it lowers the overhead somewhat; bringing us to the run time code implementing all these functions called by compiler instrumentation.

Runtime code

Our trace entries keep a timestamp, and a pointer into the code of a function. The highest bit of the code pointer is used to mark an entry as “call” or “return” (no machine actually uses all the 64 bits of a pointer in userspace.)

Getting a cycle-accurate timestamp is fairly cheap; x86 has the so-called TSC (timestamp counter), which you read with the RDTSC instruction. We’ll discuss alternatives to TSC on x86 and elsewhere in our “hardcore followup.”

We keep thread-local cyclic buffers of these entries. The user can dump them in full in one of 2 ways:

Call funtrace_pause_and_write_current_snapshot(). We pause tracing while writing the snapshot, to not overwrite the data with events logged after the call.
Run kill -SIGTRAP (similarly to viztracer’s and magic-trace’s Ctrl-C/SIGINT; I prefer SIGTRAP, since many programs handle SIGINT themselves.)

As we’ve seen above, "suddenly dumping the whole trace" like this works for peeking into programs you know nothing about, but it’s not great. Dumping the full content of the buffers is costly in time & space, and then looking at this data gets annoying, too. For example, some threads are idler than others, and keep very old events in their buffers thanks to this idleness. These old events cause the timeline UI to zoom out so much that you can’t see anything.

You can pass flags to the funtrace decoder asking to ignore some threads, or to ignore events past a certain age. But it’s actually much easier to know at the time you’re taking the snapshot what that age should be:

uint64_t start_time = funtrace_time(); //wraps RDTSC

do_stuff();

uint64_t latency = funtrace_time() - start_time;

if(latency > _slowest) {
  _slowest = latency;

  funtrace_free_snapshot(_snapshot);
  _snapshot = funtrace_pause_and_get_snapshot_starting_at_time(start_time);

  //eventually we’ll write _snapshot out with
  //funtrace_write_snapshot()
}
//else (if latency <= _slowest, the typical case),
//the only overhead added by our tracing logic
//is the 2 funtrace_time() calls

Now the snapshot only keeps “the interesting part” – small, and easy to look at. Tracing is always on, so the program can decide at any moment that “something interesting” happened, and save the recent events according to the appropriate definition of “recent.”⁶

Appending to a cyclic buffer is easy; the ANDing of the current pointer (not index) with a mask is the only slightly tricky thing (see Hardcore Follow-up.)

void trace(uint64_t code_ptr, uint64_t flags)
{
    //trace_buf is declared thread_local
    uint64_t buf_ptr = (uint64_t)trace_buf.pos;
    buf_ptr &= trace_buf.wraparound_mask;
    event* entry = (event*)buf_ptr;
    if(!entry) {
        return;
    }
    entry->func = code_ptr | flags;
    entry->cycle = __rdtsc();
    trace_buf.pos = entry + 1;
}

The flags argument is 0 for calls and 1<<63 for returns⁷. To pause tracing, we set wraparound_mask to 0. Since we must have the “if(!entry)” for pausing, we might as well also set the mask to 0 to support disabling tracing at runtime, which cuts ~85% of the overhead in my tests.

(Note that the code above is racy - it might take some time until a thread reads the zero from wraparound_mask which another thread pausing the tracing wrote; we don’t particularly mind - at worst, we’ll overwrite a few old events. Likewise, some recent writes might not be visible to the snapshotting code reading the buffers - we don’t particularly mind, either⁸.)

The “nice” C callbacks __cyg_profile_func_enter/exit simply call the trace() function above. Things are harder for the __fentry__ & __return__ callbacks. Firstly, they aren’t passed the address of the function calling them as an argument - but we could get that with __builtin_return_address(0)⁹. More importantly, they aren’t called according to the C calling convention - the compiler “just calls them,” without bothering to save registers where their caller’s arguments might be kept, for example.

I don’t know how to tell gcc or clang, “please don’t use registers where arguments are passed - please only use temporary caller-saved registers.” And if you implement __fentry__ in C, and the compiler clobbers a register where arguments are passed, and __fentry__ was called from a function which gets arguments, that function’s argument will have been clobbered.

So I wrote these functions in x86 assembly, basically by taking what the compiler produces from trace(__builtin_return_address(0), flags) and then changing the code to only use those registers that I am allowed to in this “non-standard calling convention” – or saving those registers that I can’t help using, but shouldn’t clobber.

(RDX and RAX are the annoying ones. RDTSC is hardwired to clobber them, but they’re also used for an argument and the return value, respectively, so it works out to __fentry__ having to save RDX, and __return__ having to save both. Many more appetising details of this sort await in the Hardcore Followup; generally, writing a tracing profiler today involves figuring out many small and simple, if somewhat arcane things, but not a lot of code - the perfect job for myself, I find.)

Of course, just the content of the buffers is not enough to make sense of the snapshot once it was dumped. We also need to know:

Where the code was loaded to - the executable and each shared library get loaded to a different offset, potentially in every run if ASLR is enabled, and we have to subtract this offset to convert code pointers to function names using symbol table lookup. We can get this info from /proc/self/maps, but it’s faster to get it from dl_iterate_phdr (and maybe more portable, eg Android reportedly won’t let you access files under /proc)
The TSC frequency to convert cycles to nanoseconds. There are at least 3 methods to find it out - using the CPUID instruction (specifically “leaf 15H” - don’t ask), grepping the output of dmesg, and simply sleeping for some time and checking by how much TSC was incremented during that time. Funtrace tries all these methods, in that order, in case the first two fail (the 3rd kinda can’t fail, but it’s not very accurate.)

With that, we can decode the code pointers and the timestamps, respectively. One last nice thing to save is thread names. The OS spends lavishly to let us userspace peasants name our threads - 15 (!!) characters per thread (16 with the null byte), and funtrace reads these names with pthread_getname_np¹⁰.

That’s it - we have our snapshot.

Decoded trace viewer

We need to decode the trace before viewing it. But first, we need to decide what the viewer will be, to make our decoder emit the trace in the viewer’s format. Vizviewer, for example, is based on Perfetto. In fact, it turns out that every tracing profiler mentioned in this post (viztracer, magic-trace, XRay) uses Perfetto for the viewer.

I assume that Perfetto owes its popularity to its quick rendering of very large traces at arbitrary zoom, its beautiful look & feel, and its simple JSON trace format - just a bunch of records with a name, start timestamp, duration, and thread ID:

{
"traceEvents": [
{"args":{"name":"my_program"},"name":"process_name",
    "tid":1,"pid":1,"ph":"M"},
{"args":{"name":"main_thread"},"name":"thread_name",
    "tid":1,"pid":1,"ph":"M"},
{"name":"run()","ts":2,"dur":10,"tid":1,"pid":1,"ph":"X"},
{"name":"parse()","ts":5,"dur":5,"tid":1,"pid":1,"ph":"X"},
{"name":"open()","ts":6,"dur":3,"tid":1,"pid":1,"ph":"X"},
{"args":{"name":"worker_thread"},"name":"thread_name",
    "tid":2,"pid":1,"ph":"M"},
{"name":"config()","ts":3,"dur":4,"tid":2,"pid":1,"ph":"X"},
{"name":"open()","ts":4,"dur":2,"tid":2,"pid":1,"ph":"X"}
]
}

Note how we didn’t tell which function called which - Perfetto just finds which time ranges are nested within other time ranges, and stacks them accordingly:

Today, however, I don’t recommend following vizviewer’s example and using Perfetto. Much better to do what vizviewer couldn’t do - use vizviewer itself!

A big reason is that vizviewer extends the JSON format to include the source code of functions (and unlike every damned debugging and profiling tool, here’s a program finally doing the right thing and putting the source code into the JSON instead of file names – so that you actually look at the source code that was traced, and not the code appearing in those files right now, possibly with some newer changes! And you can also send these JSON files to someone and they’re self-contained, and they’ll open on their machine – unlike typical tool reports referencing source code.)

So funtrace2viz, our trace decoder, simply produces a vizviewer JSON; to view it, install viztracver with pip install viztracer, and you’ll get vizviewer in your $PATH. And now all we need is an…

Offline trace decoder

Our trace entries have 2 fields – a code pointer and a cycle – so decoding involves 2 jobs:

Converting code pointers to function names and source line numbers.
Converting cycles to microseconds.

Therefore, we should do this in Rust, which has excellent libraries for both tasks.

Libraries for both tasks, you think; the 2nd task being multiplication of numbers. I guess this guy’s rabid hatred of C++ metastasized into rabid Rust fandom, you think; sad, if unsurprising – a textbook example of mental illness.

Well, ackchyually, I’m a less rabid Rust fan than I’d like; I’m afraid that if you’re into stuff involving a mix of GUI and number crunching, a combination of C++ and Python is your best bet today, if only because these are the two widely popular languages which people use for this stuff, and where most of the libraries and tools are¹¹.

That said, yes, converting cycles to microseconds is ”a task,” as evidenced by the following comment in XRay’s code:

// Chrome trace event format always wants data in micros.
// CyclesPerMicro = CycleHertz / 10^6
// TSC / CyclesPerMicro == TSC * 10^6 / CycleHertz == MicroTimestamp
//
// Could lose some precision here by converting the TSC to 
// a double to multiply by the period in micros. 52 bit
// mantissa is a good start though.
//
// TODO: Make feature request to Chrome Trace viewer to
// accept ticks and a frequency or do some more involved
// calculation to avoid dangers of conversion.

You see, TSC is a 64b number, and if your machine has been running for a while, it will have more than 52 significant bits, and you will start losing the low bits, because they won’t fit into a double’s mantissa. Now, in Rust, all I had to do to avoid this precision loss was `cargo add num`, and then use Ratio for the conversion.

But if it was C++, while you can find a library for this, you would want to avoid the dependency – because without a standard build & packaging system, dependencies are a major PITA. So I’d just leave a TODO like they did in XRay.

Rust is the fastest popular language with a standard package manager. This alone will make you extremely productive in areas it has good libraries for, if you’re looking to minimize the product of machine time x developer time. Mature support and widespread use of binary packages for C++ code would greatly boost Rust’s applicability! For instance, pip installing Python bindings wrapping C++ libraries is way easier than managing these libraries as source dependencies. If Rust could become “a packaging system for C++” like Python effectively is, it would immediately become very tempting to use just for this reason!¹²

Of course, our bigger task is parsing ELF (with goblin::elf) and DWARF (with addr2line) - and we need to parse both. Only DWARF has line info, but only ELF has symbols containing some of the code pointers – for example, gcc doesn’t bother to produce DWARF debug info for “thunks” it generates. What’s a thunk? Well, there are “virtual thunks” and “non-virtual thunks” according to the C++ demangler (cargo add cpp_demangle); I’m sure this means something, but I don’t care exactly what it is – I just want some name related to the source code at least somewhat instead of bare hex garbage.

Which reminds me – and I’m sure experienced programmers have seen it coming – we actually have three jobs: converting code pointers to names, converting cycles to microseconds, and dealing with random shit. Examples of the latter:

Detecting and ignoring “virtual override thunks,” which for some reason call __return__ but not __fentry__
Handling “orphan returns” (when the function call was overwritten in the cyclic buffer and we only see the return event)
Handling exceptions, its own section in the Hardcore Followup
Handling “strange missing returns” when f called g which called h, but h returns straight to f (eg because setjmp/longjmp were used)
Remapping pathnames according to substitution rules provided by the user, in case the files aren’t where the debug info says they should be (which happens way more often than it should - see an easy way to avoid such issues)

But, that’s about it. I dwell on the details in part to show that it’s not that much work, even if you’re aiming to cover enough ground for “serious uses” - threads, exceptions, shared objects, multiple compiler instrumentation options, etc. etc. The decoding is about 1K LOC, same as the runtime (but with way more library dependencies!)

One last item to file under “random shit” is converting ftrace timestamps (similar to, but uglier than converting our trace entry timestamps), bringing us to…

ftrace: tracing thread state changes

A function tracer traces function calls, and since we just did, we could use this tautology to declare that our job is done. However, you’ll wonder if a function taking a long time was actually computing something or waiting for something – so you need to know whether the thread was in a running state or not.

Linux can trace many kernel events, including scheduling events. A userspace peasant with the right permissions (eg sudo chown -R $USER /sys/kernel/tracing) can configure ftrace to log thread scheduling events. You get the latest events from the kernel buffer with cat /sys/kernel/tracing/trace (and no, you don’t need to actually understand the example log below to follow what’s next - just showing it for those curious about tracing on Linux):

#                   _-----=> irqs-off
#                  / _----=> need-resched
#                 | / _---=> hardirq/softirq
#                 || / _--=> preempt-depth
#                 ||| / _-=> migrate-disable
#                 |||| /     delay
#  TASK-PID CPU#  |||||  TIMESTAMP  FUNCTION
#     | |     |   |||||     |         |
  <...>-78  [003] ..... 30625460: task_newtask: pid=81 comm=main clone_flags=3d0f00 oom_score_adj=0
 -0   [004] d.... 30701644: sched_switch: prev_comm=swapper/4 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=main next_pid=81 next_prio=120
  <...>-78  [003] d.... 30747413: sched_switch: prev_comm=main prev_pid=78 prev_prio=120 prev_state=D ==> next_comm=swapper/3 next_pid=0 next_prio=120
  <...>-81  [004] d.... 30750940: sched_waking: comm=main pid=78 prio=120 target_cpu=003
 -0   [003] d.... 30780026: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=main next_pid=78 next_prio=120
  <...>-81  [004] ..... 30810996: task_rename: pid=81 oldcomm=main newcomm=worker oom_score_adj=0
  <...>-78  [003] d.s.. 37939679: sched_waking: comm=rcu_sched pid=15 prio=120 target_cpu=035
  <...>-81  [004] d.h.. 38466542: sched_waking: comm=code pid=9974 prio=120 target_cpu=004

It turns out that we don’t even need to parse this format – Perfetto can simply read ftrace data from a systemTraceEvents JSON key, and has special support for visualizing scheduling events. Here’s how it looks like:

Perfetto shows us which thread each CPU core is running at any given moment (where “swapper” is Linux-speak for “nothing”.) To each thread, Perfetto adds a special lane showing whether its state is Running, Runnable (light green), or waiting for something (white/blank.)

So all we have to do is collect and log ftrace events (~200 lines of funtrace’s ~1200 LOC runtime.) We configure trace_clock to x86-tsc to synchronize timestamps with our function call/return events. We read the scheduling events supported by Perfetto from trace_pipe (only listening to events from our process and its children, or we’ll be flooded with data, including too many CPU lanes to vertically scroll through.) And we maintain a circular buffer of events, so that we can get a snapshot of all events after some time threshold at any moment.

That’s it - the offline decoder converts ftrace timestamps from TSC to milliseconds (a bit ugly to have to massage text like this, but no biggie; weird that you have to do this - the JSON evidently wasn’t designed for TSC, the fastest timestamping method, but whatever.) And we’re done.

Note that there’s WAY more info that we could get from ftrace. For example, we could show how threads wait for each other because of taking the same locks, etc. etc. I just picked the low-hanging fruit, which was enough for a certain “completeness” – you can know both what functions the CPU was running and when it was waiting.

I don’t know why other function tracers using Perfetto don’t collect ftrace data, with Perfetto making it so easy. I think some come from a larger system having another part doing this – and perhaps others don’t bother because users have trouble getting permissions to access ftrace?!.. Why do I need permissions to know when my own threads were scheduled?.. (I confess up front that if there’s an answer, I might refuse to understand it. I hate permissions!)

Getting traces from core dumps

I firmly believe that in addition to compile-time and runtime support, proper developer tools come with coretime support. I have a morbid fascination with coredump-oriented design: make your data structures easy to extract from core dumps, and write the code to do so.

Function traces might come handy when performing an autopsy on a core dump – it helps to see what the program was doing before it crashed. Looking at when your threads were doing what might help understand a race condition. A null dereferencing might become obvious once you see that the wrong flow was running. And then some core dumps might be due to performance being too bad (eg a real time program purposely crashing during test cycles), and traces are perfect for that.

So funtrace comes with a gdb.Python extension command, imaginatively named funtrace. It reads the thread-local trace buffers, as well as the ftrace events¹³. We also save the addresses where shared libraries were loaded to, which we get from gdb’s info proc mappings. You get a funtrace.raw file in the same format you’d get from funtrace_save_snapshot(), and you can decode it with funtrace2viz. At ~120 LOC, our “coretime” costs us just 10% of our runtime in terms of lines of code.

funcount: culling overhead

In a microbenchmark, it takes 8-9 ns to log one trace entry in my tests, and every function call takes 2 entries. This can be a lot or a little, depending on how much work a function does on average. From my experience, you’ll probably want to disable tracing for a bunch of short functions to get the overall overhead down to single-digit percentage points – what I’d call a fair price for being able to debug performance issues (it costs money because it saves money, like a plumber said in some movie; up to a point, slower code that you can optimize thanks to the extra visibility ends up being faster, because you will have used more optimization opportunities.)

You can exclude a function from tracing using the NOFUNTRACE macro, which adds an __attribute__ to the function disabling compiler instrumentation. The question is, which functions to exclude? You could check some traces and find often-called short functions. But this brings back the first problem with tracing profilers: what to trace? Traces are short - perfect for understanding interesting outlier events, but not for understanding the overhead on average like we want here.

The obvious approach is, let’s sort the functions by the number of times they’re called, and consider disabling tracing on those called most often. However, what is actually not obvious is how to count function calls. A sampling profiler can’t tell the difference between a long function called once and a shorter function called many times¹⁴. gprof collects accurate call counts – for single-threaded programs; in multithreaded programs, the call counts are garbage. And callgrind is slow, and won’t run in many environments.

So, funtrace ships its own tool for counting calls, funcount – which is nice, in particular, because it counts the exact same calls that funtrace would instrument under whatever compiler flags you chose. It works like this:

the function entry hook increments an atomic counter in a 2-level page table. We assume that 48 bits out of a pointer’s 64 are actually used; given an address, we increment the counter at pages[high_bits][mid_bits][low_bits], where each index is made of 16 out of these 48 bits, and the arrays are sparse (most of the page pointers are null.)
We could allocate pages thread-safely on demand using some compare-and-swappery, but it slows things down A LOT. Instead, we use dl_iterate_phdr upon program start and upon calls to dlopen to find executable address space segments, and allocate pages only for those segments¹⁵.
We print the non-zero counts at the end of the program run. The resulting report, funcount.txt, can be decoded to function names & source line numbers using funcount2sym. You can then sort by call count with sort, and combine reports from multiple runs with awk or such¹⁶.

There’s also a knob to tune a time/space tradeoff: if you compile with -DFUNCOUNT_PAGE_TABLES=16, 16 page tables will be kept instead of one, with threads indexing into the page table cpu_core_number % 16 (we get the core number from RDTSCP.) If you have many threads calling the same functions, more page tables means less fighting for the cache lines keeping these functions’ counters – at the cost of using more memory. The final report is the sum of the counts from all the page tables.

This thing is about as fast as funtrace, modest in memory use, takes ~250 LOC for the runtime + ~60 LOC for the offline decoding – and like funtrace itself, is easy to understand and port.

While we’re on the subject of tracing overhead, one might ask – isn’t opt-in tracing better? Instead of tracing by default and using tools to find when to opt out of tracing, why not let people trace whatever they want, without putting tracing in behind their backs?

This approach can work well for some teams. I believe that tracing by default is better for most projects, if only because the code with the most surprising performance artifacts is both what you want traced the most in hindsight, and what was most likely not traced satisfactorily up front in an opt-in tracing regime, since nobody expected the surprise by definition.

Another point is that an “opt-in tracing backlog,” where many people just don’t opt in for a long time, can easily grow so much that you will lose all hope to use tracing when you need it. The “opting-out” backlog cannot grow so badly, because too much tracing results in performance problems that you will have to fix long before they’ll get too daunting to even try. This is similar to the argument for tracing in production vs flipping a switch when you need tracing – a switch that’s off by default is unlikely to really work when you need it.

Last but not least, opting out is typically less work than opting in, making tracing cost less in development time. You can definitely create a culture where people carefully put tracing statements into their code; the endlessly growing “tracing backlog” is likely, but it’s not destiny. But then changing code becomes more costly, so you’ll tend to avoid it more often – a problem in its own right. The same thing happens with other “good things,” like documentation and tests – they make your system better, but you take more time making it, and then balk at making major changes. The nice thing about tracing is that unlike documentation and tests, you can have it done mostly automatically with a bit of manual intervention.

Antithesis: LLVM XRay

The thesis behind funtrace is that the user:

traces in production,
monitors & culls tracing overhead, and
decides when to obtain and save trace data.

Now let’s look at XRay, built on assumptions we could call the antithesis of ours:

We’re a huge company (in the specific case of XRay’s developers, it’s called Google.)
We have a million programmers maintaining a billion lines of code who can’t be bothered to integrate a tracing profiler into their code. These programmers aren’t our users.
We have, however, a performance team looking for issues across our millions of servers. These are our users.
If this performance team looks for issues completely randomly in a small share of our machines, it’s still a ton of machines, so they’ll accidentally bump into some unusual slowdowns.
All of our code is compiled by the same build system. We can change the way code is compiled, as long as the overhead as experienced by the teams owning the code is not large.

In short, the programmers who wrote the traced code are patients signing a consent form for periodic checkups – and the X-Ray machine operators are trained doctors on a separate team. For a big company, these are very sensible assumptions. With this in mind, let’s look at how XRay works:

XRay instrumentation inserts NOP instruction sequences upon function entry/exit. The XRay runtime can then change these instructions to make the functions jump to function entry/exit handlers while the program is running, and this patching is done without pausing execution - quite the engineering feat. This makes the overhead very low when tracing is disabled, which is what code owners care about. It also lets the performance team pick subsets of code where tracing is enabled at runtime.
XRay culls the overhead of tracing automatically, so that code owners needn’t bother. In fact, if you run XRay in a small test, you might get an almost empty trace and assume it didn’t work, unless you lower the following thresholds:
- -fxray-instruction-threshold= (default: 200) - the compiler only instruments functions with at least this many instructions, or those with loops which it thinks are likely enough to take a long time to stray from this rule.
- XRAY_BASIC_OPTIONS=func_duration_threshold_us= (default: 5 microseconds; undocumented at the time of writing) - a function call that took less than the value set by this env var is omitted from the trace¹⁷.
XRay makes an effort to compress trace events - instead of logging 64b function pointers, the compiler produces 28b integer IDs for the functions, and the runtime stores 32b TSC deltas instead of 64b TSC values. (If the delta doesn’t fit into 32 bits, the runtime logs a special record with the full 64b TSC value.)

Thus an XRay trace entry is half the size of funtrace’s (and it’s furthermore very easy to keep short calls out of the trace), but XRay’s runtime overhead per instrumented function call is 6 times as large (as measured in a microbenchmark, FWIW.)

This is the sensible tradeoff for a big company with a big code base, a big performance team outside the teams owning the code, and a big tooling team implementing the tracing profiler:

It’s extremely important to minimize overhead with tracing disabled, or people will push back against compiling with instrumentation.
Always-on function tracing is prohibitively expensive. With a giant fleet of machines, even a 3% overhead costs a fortune.
…But the overhead of occasionally enabled tracing is no big deal. A rare 20-40% slowdown (the number reported in the XRay whitepaper) experienced by some service when the performance team is tracing it won’t register as a problem.
Traced data size must be minimal. Since nobody tells us when something interesting happens, we must collect data covering A LOT of time to find interesting things ourselves – so we better filter the data well.
The complexity of the tool is not an issue – the cost of developing and maintaining it is dwarfed by the cost savings it can provide across a giant server fleet.

Thus it is the right thing, for a big company, to invest effort into thread-safe runtime code patching, a system for creating and decoding small function IDs, runtime mechanisms for filtering the trace which are slow and make traces harder to understand - but the performance team will manage to understand them, and this is how you get data covering a sufficiently large time range to this team.

The adverse effects of this approach manifest in “smaller” contexts:

You can’t afford a high runtime overhead in many cases, eg a realtime application, or even just a GUI on an end-user device. If you have a huge server fleet, slowing down some services considerably for a few seconds is fine, in part because in any case there are network lags all the time, and the end user can’t tell the difference. This isn’t as true when you’re running on a small “edge device” rather than a big server farm. Overhead is also worse for a smaller server farm – the more machines you have, the smaller share of them you can slow down for tracing (and still find stuff), so the relative cost shrinks.
Runtime code patching is impossible in some small, “embedded” environments, either because you’re running from ROM, or because you’re running from a read-only image in RAM, and have no room for copies of the code pages changed by the runtime patching.
The open source XRay version (not what Google uses internally AFAIK) still can’t decode symbols from shared libraries. Actually, it didn’t even log function calls made by shared libraries for almost a decade, since the runtime patching didn’t cover DSOs. But the latest LLVM version does log them thanks to a patch submitted in 2024 by people using XRay for academic work. They plan to make decoding work, too, but the current implementation only supports the “basic logging format” which is so bloated that it’s unusable in production - the runtime overhead will be 15-18 times as large as funtrace’s… Eventually, the OSS XRay might fully support shared libraries, but it will have taken at least a decade after the initial version was released.

A big tooling team can deal with the complexity of runtime patching and function ID decoding in the presence of DSOs¹⁸; evidently it’s harder for a smaller project. Using code pointers as IDs spares you these complications, but you pay in trace entry size – which is OK for a smaller software system, where you can put in logic for tracing just what you want.

Bigness begets bigness and works well with bigness, and vice versa. Neither big nor small are “better;” both are fine as long as you “go big” or small consistently, and when the environment calls for it.

Synthesis: funtrace with XRay characteristics

You can use funtrace with XRay instrumentation – a straightforward kind of synthesis. This uses almost the same assembly call/return handlers that we have for gcc with the -pg… instrumentation flags – these get called instead of XRay’s callbacks.

Why would you use -fxray-instrument instead of -finstrument-functions-after-inlining, the other way to use funtrace under clang?

You might like to be able to automatically exclude short functions by tuning the threshold set by -fxray-instruction-threshold=N.
You can patch the code at runtime to enable tracing and disable it again, lowering the overhead further relatively to funtrace’s way to disable tracing (which still has your code jumping to its handlers, which do less than they do when tracing – but a bit more than XRay-generated code not patched to jump to tracing callbacks.)

Note that you need a recent LLVM to be able to trace inside DSOs with XRay instrumentation – specifically, a version having the -fxray-shared flag.

Note as well that support for exceptions under -fxray-instrument has some limitations (same as with gcc under -pg), though it’s pretty good and a big step up from XRay’s not supporting exceptions at all (the Google style guide bans C++ exceptions, and I must say that I fully share their distaste for the feature – but many programs use exceptions, so funtrace makes an effort to support them, as we’ll see in the followup.)

Now, what would be a deeper form of synthesis than just combining XRay instrumentation with funtrace runtime? What does a synthesis of assumptions look like?

Let’s say we assume the developer is “the user” of tracing and “owns” it, rather than relying on a performance team. We can still ask ourselves, when is the developer the closest to the position of such a performance team? The answer is, when adding tracing to their program for the first time!

I mean, if you started out with tracing from day one, then you’re never in that position. But if you already have a biggish system and you’re adding tracing to it, then it’s very tedious to manually exclude lots of small functions from tracing. How can we make this easier?

Filtering by function size: funtrace adds a compile time flag, -funtrace-instr-thresh=N, which works a lot like -fxray-instruction-threshold=N; it excludes short functions from tracing unless they have loops (though you can pass -funtrace-ignore-loops to have them excluded anyway)
Filtering using a list of mangled function names: let’s say you want to disable the tracing of lots of functions reported by funcount as “frequent callees”, and check what this does to tracing overhead, and to the traces you get. Going thru 100 source files to add NOFUNTRACE to each function gets old quickly – especially if you want to try many different experiments, excluding different subsets of functions every time. Instead, you can use -funtrace-no-trace=file to exclude the functions listed in that file – way quicker than editing each function.

Sounds great, but you might be wondering, how does funtrace “add a compile time flag”, if it uses stock gcc or clang with no changes or compiler passes?.. If you had a feeling of something fishy coming up, you were very right. Funtrace adds these trace filtering flags by post-processing the assembly code generated by the compiler. Some implications of this:

Assembly post-processing removes most, but not all of the instrumentation overhead. We’re removing instructions calling the function entry/exit hooks, but the code will still have been variously “scarred” by having those calls put in by the compiler in the first place. It shouldn’t cost more than 1ns per function, but it can add up.
Assembly post-processing, while tested on large programs, is less solid than the rest of funtrace. It’s easy to make a compiler generating assembly code breaking this filtering – both the loop detection (which simply looks for branches to labels defined before the branch within a function) and the removal of calls to the hooks. It’s simple text processing making assumptions on how the text of the .s files looks like. It “works”, but these assumptions aren’t backed by any specification.

So one might decide that this assembly post-processing is more suitable for initial experimentation than a long-term production deployment. It’s there; it’s your call in what scope to use it, and people will have different preferences for good reasons. I personally don’t mind the risk of incorrect code generation that much, because I’m good at debugging such things, and I count on tests to uncover it quickly. But this is the opposite of the right approach for many teams, so I’ve given the exact opposite advice on some occasions.

Whether you want it for production or not, assembly post-processing should make first-time experiments with funtrace easier, by providing features for “an encounter with a system for which tracing is alien” – the situation XRay is designed around.

Hardware-assisted tracing

Most CPUs have some hardware tracing facilities, but the most basic & common of these are designed for people debugging the hardware itself or very low-level software like kernels and boot loaders using something like a JTAG probe. For example, when a branch is taken, the instruction address or a delta might be sent to a probe like that over a dedicated channel, or it might be saved into a tiny circular SRAM buffer inside the chip that you can then read with the probe. This doesn’t help most people debugging large systems though.

An exception to this is the Intel Performance Trace, which lets you trace native code with zero instrumentation. The awesome magic-trace is built on top of it. You can try it with magic-trace run ls (for example); it works out of the box, no recompilation required, and the overhead is low (they say 2-10%.)

I’m unironically in shock that people deploying on x86, certainly to environments they control, don’t all insist on using Intel hardware to always run the code under magic-trace in production. (You can trigger tracing programmatically, and with these traces, you’ll be able to debug any latency issue you could have in production.) Did Google develop the high-overhead XRay when Intel Performance Trace already existed?.. (Not sure about the exact timeline; perhaps the two matured at about the same time?) How can it be that a decade after Intel Performance Trace was made, AMD still hasn’t caught up, and other platforms also lack equivalent features?

Seriously, it’s as if you had hardware floating point units for a decade and only a select few would be using them, or were even aware of them, with a few more stuck on software emulation, and most just using scaled integers, as in representing 1.05 with 105. How do we explain this?

Programmers aren’t demanding tracing profilers; they use sampling profilers, because it’s the tradition, it helps with the average case (and unlike with the big O notation, nobody drills it into programmers to look for the worst case when profiling, or how to do it), and it requires zero work, unlike a tracing profiler where triggering tracing requires slightly less than zero work.
The hardware solution is fairly complicated. Many teams will not do it given weak demand; a lightweight feature with weak or uncertain demand might get greenlit, but a full-blown control flow trace like Intel’s isn’t a lightweight feature.
Machine instruction-level control flow tracing is useless for interpreters and hard-to-use for JITters, and most people with control of their environment are server people running a lot of Java, JavaScript, Python, PHP etc. For an interpreter all you’d see is the interpreter’s loop spinning, with no clue what code it’s interpreting. For a JITter, every JIT runtime would need to dump metadata telling a tool like magic-trace what source code each machine code snippet was generated from – you get this from compiler-generated symbol tables for statically compiled native code.

I would be pleasantly surprised if this writeup would cause programmers’ demand for tracing profilers to outright explode, but I’m not counting on it. However, I have a suggestion for tracing support in the CPU hardware which is simple enough to risk implementing despite weak demand – and simple enough for interpreters and JITters to use. So you can realistically hope to put this thing into your CPU, have interpreters and JITters use it, and then programmers will love your hardware for its tracing features.

Here’s how it could work:

You add an instruction, say TRACE, which traces a 64b register value from a general-purpose register (or the instruction pointer as a special case). It also traces a few bits of static metadata from its encoding (so we can tag events as “function call,” “return,” “context switch” etc.
TRACE sends the data above via a special port to a trace writer module. TRACE is not another store instruction; it doesn’t go thru caches. Instead, data is sent to a module where it could be timestamped, compressed, and written from the module’s SRAM to a cyclic buffer in DRAM.

Thus we still rely on software to issue tracing instructions, but it’s just one instruction per event, we get timestamping, compression and cyclic buffer management for free, and the only cost is another instruction and the DRAM bandwidth spent on writing to the cyclic buffer (and we can write at a low priority - we don’t mind buffering these writes as long as we haven’t run out of SRAM in the module; we would also lose very little bandwidth to precharging/activating DRAM rows, since our writes are the opposite of random access.) We can also turn off the writing, and then the overhead shrinks to just fetching & decoding a NOP - “like XRay with tracing disabled.”

Two refinements of this idea in two opposite directions:

If you’re “a perfectionist CPU maker”, you can further lower the overhead to near-zero for statically generated code. For example, x86 already has the ENDBR64 instruction, which every recently compiled function starts with, in pursuit of the venerable goal of “control flow integrity”¹⁹. This instruction could be an implicit TRACE PC, and we could have an ENDBR64_UNTRACED for excluding functions from tracing to save bandwidth. The same thing could be done with RET.
If you’re “a pragmatic chip maker” and the CPU IP vendor won’t add a TRACE instruction, you can instead have software store to a memory-mapped device your chip makes available at some address. You will spend more instructions to send the data, and you will interfere with the load-store unit, but you will still save the overhead of timestamping and cyclic buffer management, you won’t pollute caches with trace data, you will save space & bandwidth with compression, and you will get a single buffer for all the trace data instead of many thread-local buffers which probably take more memory than you need (for this, you will want to encode the thread ID or the CPU ID when storing trace data, either in the data itself or the address bits - eg each CPU can store to a different memory-mapped address, and the OS can trace context switches so you know which thread is currently running on each CPU.)

This scheme is very easy on the hardware; a “tracing device” with a RAM and a DMA is invisibly small in today’s hardware designs, and this doesn’t interfere with the CPU logic and doesn’t risk bugs, either those breaking the CPU or those breaking the trace.

As an added bonus, an interpreter can use TRACE and pass it some data which isn’t a machine instruction address but rather the ID of a function in the interpreted language. And a JITter can emit TRACE instructions into its code. (I feel that this would be bigger news for interpreters than JITters, but I think most JITters would benefit from it relatively to Intel Performance Trace-like hardware-assisted tracing – would be happy to hear the thoughts of people understanding JITters.)

I think it would be great to start seeing TRACE instructions in CPUs!

Conclusion and future work

We (it’s always “we” in papers, isn’t it?) have presented a comprehensive solution for C++ function tracing, ready for production use on x86/Linux and easy to port to many other platforms. We have also used the opportunity to discuss how to use a function tracer in your workflow, how to implement your own function tracer for native code, and which existing tools can help with the heavy lifting. Finally, we’ve seen how hardware could help making tracing more efficient and usable for both statically and dynamically compiled languages, in a relatively cheap & simple way.

Here are some things “we” could add to funtrace (more likely I than we – though I’d be happy to work with you on this!):

Porting to other native languages. I’d expect the trace file format and the decoder to need no changes in most cases; the runtime you might well want to rewrite (eg a Rust runtime is easier to distribute with cargo than a C++ runtime, right?.. Though the C++ one would be functionally adequate in this case?..) For compile-time support, you could use existing LLVM features where relevant, or you could write an LLVM pass - or write code changing LLVM IR files or even compiler-generated assembly (the -funtrace compile time flags do this, and I've gone way further with that… in this context. See the hardcore followup!)
Support for performance counters. We could log the output of RDPMC instead of, or in addition to RDTSC. This might be useful (you could learn things like which functions missed the cache a lot), though RDPMC is a PITA, as we’ll see in that followup of ours.
Support for custom events. You might want to log things like “the resolution of the images we were processing was 1920x1080”. A “delayed printf” approach could be a great fit for this, seeing how we need the executable files to decode the code pointers anyway; so we could easily extract format strings from the executables given the logged pointers while we’re at it, and do the formatting at decoding time.
Support for goroutines / async / other forms of “non-OS threads.” Note that the relatively easy part is to decode each such “green thread” into its own Perfetto JSON thread. The harder part is to figure out how not to drown in all those threads when viewing them; these runtimes brag about millions of threads – that’s a lot of vertical scrolling. Ofc you can limit tracing to a subset of these; you could also show a lane per CPU instead of a lane per thread, but then the call stacks changing from one thread to another get very noisy (ask me how I know) and you might want to adapt the viewer to deal with this somehow.
Support for existing tracing frameworks – for example, integrating with the tracy profiler. Note that any tracing system using the same timestamps as funtrace can theoretically coexist with it without any need to share buffers or data formats; what you’d want is to then view all this traced data in a single viewer.
Clever trace filtering a-la XRay – eg detect at runtime that a function was called 1000 times and took less than a microsecond every time, and modify its code to no longer call the entry/return hooks. This is risky (what if this function locks something and will wait for a long time in some future call?), but only somewhat (we’ll still trace its caller and get some idea of where we waited), and it removes the overhead of RDTSC, which is big on x86.

That’s it – I hope you liked it! And if you’re really into this kind of stuff (evidently, I’m really into this stuff!), give funtrace a try, and stay tuned for the Hardcore Followup!

Thanks to Dan Luu for reviewing a draft of this post.

0+0 > 0: C++ thread-local storage performance

Yossi Kreinin — Mon, 17 Feb 2025 00:00:00 +0000

We'll discuss how to make sure that your access to TLS (thread-local storage) is fast. If you’re interested strictly in TLS performance guidelines and don't care about the details, skip right to the end — but be aware that you’ll be missing out on assembly listings of profound emotional depth, which can shake even a cynical, battle-hardened programmer. If you don’t want to miss out on that — and who would?! — read on, and you shall learn the computer-scientific insight behind the intriguing inequality 0+0 > 0.

I’ve recently published a new C++ profiler, funtrace, which traces function calls & returns as well as thread state changes, showing an execution timeline like this (the screenshot is from Krita, a “real-world,” complicated drawing program):

One thing a software-based tracing profiler needs is a per-thread buffer for traced data. Actually it would waste less memory for all threads to share the same buffer, and this is how things “should” work in a system with some fairly minimal hardware support for tracing, which I suggested in the funtrace writeup, and which would look roughly like this:

But absent such trace data writing hardware, the data must be written using store instructions through the caches¹. So many CPUs sharing a trace buffer results in them constantly yanking lines from each other’s caches in order to append to the buffer, with a spectacular slowdown. And then you'd need to synchronize updates to the current write position — still more slowdown. A shared buffer can be fine for user-initiated printing, but it’s too slow for tracing every call and return.

So per-thread buffers it is — bringing us to C++’s thread_local keyword, which gives each thread its own copy of a variable in the global scope — perfect for our trace buffers, it would seem. But it turns out that we need to be careful with exactly how we use thread_local to keep our variable access time from exploding, as explained in the rest of this document.

The C toolchain — not the C++ compiler front-end, but assemblers, linkers and such — is generally quite ossified, with decades-old linker bugs enshrined as a standard ². TLS is an interesting case when this toolchain was actually given quite the facelift to support a new feature — with the result of simple, convenient syntax potentially hiding fairly high overhead (contrary to the more typical case of inconvenient syntax, no new work in the toolchain, and resource use being fairly explicit.)

At first glance, TLS looks wonderfully efficient, with a whole machine register dedicated to making access to these exotic variables fast, and a whole scheme set up in the linker to use this register. Let’s take this code accessing a thread_local object named tls_obj:

int get_first() {
  return tls_obj.first_member;
}

This compiles to the following assembly code:

  movl  %fs:tls_obj@tpoff, %eax

This loads data from the address of tls_obj into the %eax register where the return value should go. The address of tls_obj is computed by adding the value of the register %fs and the constant offset tls_obj@tpoff. Here, %fs is the TLS base address register on x86; other machines similarly reserve a register for this. tls_obj@tpoff is an offset from the base address of the TLS area allocated per thread, and it’s assigned by the linker such that room is reserved within the TLS area for every thread_local object in the linked binary. Is this awesome or what?!

Constructors

If instead we access a thread_local object with a constructor — let's call it tls_with_ctor — we get assembly code like this (and this is with -O3 – you really don’t want to see the unoptimized version of this):

  cmpb  $0, %fs:__tls_guard@tpoff
  je    .slow_path
  movl  %fs:tls_with_ctor@tpoff, %eax
  ret
.slow_path:
  // inlined call to __tls_init, which constructs
  // all the TLS variables in this translation unit…
  pushq %rbx
  movq  %fs:0, %rbx
  movb  $1, %fs:__tls_guard@tpoff
  leaq  tls_with_ctor@tpoff(%rbx), %rdi
  call  Class::Class()
  leaq  tls_with_ctor2@tpoff(%rbx), %rdi
  call  Class2::Class2()
  // …followed by our function’s code
  movl    %fs:tls_with_ctor@tpoff, %eax
  popq  %rbx
  ret

Our simple access to a register plus offset has evolved to first check a thread-local “guard variable”, and if it’s not yet set to 1, it now calls the constructors for all of the thread-local objects in the translation unit. (__tls_guard is an implicitly generated static, per-translation-unit boolean.)

While funtrace’s call/return hooks, which get their trace buffer pointer from TLS, are called all the time, access to thread_locals should be more rare in “normal” code — so not sure it’s fair to brand this __tls_guard approach as having “unacceptable overhead.” Of course, the inlining only happens if your thread_local is defined in the same translation unit where you access it; accessing an extern thread_local with a constructor involves a function call, with the function testing the guard variable of the translation unit where the thread_local is defined. But with inlining, the fast path is quite fast on a good processor (I come from an embedded background where you usually have cheap CPUs rather than good, so an extra load and a branch depending on the loaded value shock me more than they should; a superscalar out-of-order branch-predicting speculatively-executing CPU will handle this just fine.)

What I don’t understand is why. Like, why. Generating this code must have taken a bunch of compiler work; it didn’t “just happen for free.” Furthermore, the varname@tpoff thing must have involved some linker work; it’s not like keeping the linker unchanged was a constraint. Why not arrange for the __tls_init function of every translation unit (the one that got inlined into the slow path above) to be called before a thread’s entry point is called? Because it would require a little bit of libc or libpthread work?..

I mean, this was done for global constructors. You don’t check whether you called the global constructors of a translation unit before accessing a global with a constructor (and sure, that would have been even slower than the TLS init code checking __tls_guard, because it would need to have been a thread-safe guard variable access; though even this was implemented for calling the constructors of static variables declared inside functions, see also -fno-threadsafe-statics.) It’s not really harder to do this for TLS constructors than for global constructors, except that we need pthread_create to call this code, which, why not?..

Is this a deliberate performance tradeoff, benefitting code with lots of thread_locals and starting threads constantly, with each thread using few of the thread_locals, and some thread_locals having slow constructors³? But such code isn't great to begin with?.. Anyway, I don’t really get why the ugly thing above is generated from thread_locals’ constructors. The way I handled it in my case is, funtrace sidesteps the TLS constructor problem by interposing pthread_create, and initializing its thread_locals in its pthread_create wrapper.

Shared libraries

And now let’s see what happens when we put our thread-local variable, the one without a constructor, into a shared library (compiling with -fPIC and linking with -shared):

push %rbp
mov  %rsp,%rbp
data16 lea tls_obj(%rip),%rdi
data16 data16 callq __tls_get_addr@plt
mov  (%rax),%eax
pop  %rbp
retq

All this colorful code is generated instead of what used to be a single movl %fs:tls_obj@tpoff, %eax. More code was generated than before, forcing us to spill and restore registers. But the worst part is that our TLS access now requires a function call — we need __tls_get_addr to find the TLS area of the currently running shared library.

Why don’t we just use the same code as before — the movl instruction — with the dynamic linker substituting the right value for tls_obj@tpoff? This is an honest question; I don’t understand why this isn’t a job for the dynamic linker like any other kind of dynamic relocation. Is this to save work in libc again?.. Like, for tls_obj@tpoff to be an offset from the same base address no matter which shared library tls_obj was linked into, you would need the TLS areas of all the shared libraries to be allocated contiguously:

main executable at offset 0
the first loaded .so at the offset sizeof(main TLS)
the next one at the offset sizeof(main TLS) + sizeof(first.so TLS)
…

But for this, libc would need to do this contiguous allocation, and of course you can’t move the TLS data once you’ve allocated it, since someone might be keeping pointers into it⁴. So you need to carve out a chunk of the memory space — no biggie with a 64-bit or even “just” a 48-bit address space, right?.. — and you need to put the executable’s TLS at some magic address with mmap and then you keep mmaping the TLS areas of newly loaded .so’s one next to another.

But this now becomes a part of the ABI (“these addresses are reserved for TLS”), and I guess nobody wanted to soil the ABI this way “just” to make TLS fast for shared libraries?.. In any case, looks like TLS areas are allocated non-contiguously and so you need a different base address every time and you can’t use an offset… but still, couldn’t the dynamic linker bake this address into the code, instead of calling a function to get it?.. Feels to me that this was doable but deemed not worth the trouble, more than it being impossible, though maybe I’m missing something.

A curious bit is those data16⁵ in the code:

data16 lea tls_obj(%rip),%rdi
data16 data16 callq __tls_get_addr@plt

What is this for?.. Actually, the data16 prefix does nothing in this context except padding the instructions to take more space, making things slightly slower still, though it’s peanuts compared to the function call. Why does the compiler put this padding in? Because if you compile with -fPIC but then link the code into an executable, without the -shared, the function call gets replaced with faster code:

push %rbp
mov  %rsp,%rbp
mov  %fs:0x0,%rax
lea  -0x4(%rax),%rax
mov  (%rax),%eax
pop  %rbp
retq

The generated code is still scarred with the register spilling and what-not, and we don’t get our simple movl %fs:tls_obj@tpoff, %eax back, but still, we have to be very thankful for the compiler & linker work here, done for the benefit of the many people whose build system compiles everything with -fPIC, including code that is then linked without -shared (because who knows if the .o will be linked into a shared library or an executable? It’s not like the build system knows the entire graph of build dependencies — wait, it actually does — but still, it obviously shouldn’t be bothered to find out if -fPIC is needed — this type of mundane concern would just distract it from its noble goal of Scheduling a Graph of Completely Generic Tasks. Seriously, no C++ build system out there stoops to this - not one, and goodness knows there are A LOT of them.)

In any case, the data16s are generated by the compiler to make the red instructions take enough space for the green instructions to fit into, in case we link without -shared after all.

Constructors in shared libraries

And now let’s see what happens if we put (1) a thread_local object with (2) a constructor into a shared library, for a fine example of how 2 of C++’s famously “zero-overhead” features compose. We’ve all heard how “the whole is greater than the sum of its parts,” occasionally expressed by the peppier HRy people as “1 + 1 = 3.” I suggest a similarly inspiring expression “0 + 0 > 0”, which quite often applies to “zero overhead”:

sub  $0x8,%rsp
callq TLS init function for tls_with_ctor@plt
data16 lea tls_with_ctor(%rip),%rdi
data16 data16 callq __tls_get_addr@plt
mov  (%rax),%eax
add  $0x8,%rsp
retq

So, now we have 2 function calls — one for calling the constructor in case it wasn’t called yet, and another to get the address of the thread_local variable from its ID. Makes sense, except that I recall that under -O3, this “TLS init function” business was inlined, and now it no longer is? Say, I wonder what code got generated for this “TLS init function”?..

  subq  $8, %rsp
  leaq  __tls_guard@tlsld(%rip), %rdi
  call  __tls_get_addr@PLT
  cmpb  $0, __tls_guard@dtpoff(%rax)
  je    .slow_path
  addq  $8, %rsp
  ret
.slow_path:
  movb  $1, __tls_guard@dtpoff(%rax)
  data16  leaq  tls_with_ctor@tlsgd(%rip), %rdi
  data16 data16 call  __tls_get_addr@PLT
  movq  %rax, %rdi
  call  Class::Class()@PLT
  data16  leaq  tls_with_ctor2@tlsgd(%rip), %rdi
  data16 data16 call  __tls_get_addr@PLT
  addq  $8, %rsp
  movq  %rax, %rdi
  jmp   Class2::Class2()@PLT

Oh boy. So not only doesn’t this thing get inlined, but it calls __tls_get_addr again, even on the fast path. And then you have the slow path, which calls __tls_get_addr again and again…not that we care, it runs just once, but it kinda shows that this __tls_get_addr business doesn’t optimize very well. I mean, it’s not just the slow path of the init code — here’s how a function accessing 2 thread_local objects with constructors looks like:

pushq   %rbx
call    TLS init function for tls_with_ctor@PLT
data16 leaq tls_with_ctor@tlsgd(%rip), %rdi
data16 data16 call __tls_get_addr@PLT
movl    (%rax), %ebx
call    TLS init function for tls_with_ctor2@PLT
data16 leaq tls_with_ctor2@tlsgd(%rip), %rdi
data16 data16 call __tls_get_addr@PLT
addl    (%rax), %ebx
movl    %ebx, %eax
popq    %rbx

Like… man. This calls __tls_get_addr 4 times, twice per accessed thread_local (once directly, and once from the “TLS init functions”).

Why do we call 2 “TLS init function for whatever” when both do the same thing — check the guard variable and run the constructors of all objects in the translation unit (and in this case the two objects are defined in the same translation unit, the same one where the function is defined)? Is it because in the general case, the two objects come from 2 different translation units?

And what about the __tls_get_addr calls to get the addresses of the objects themselves? Why call that twice? Why not call something just once that gives you the base address of the module’s TLS, and then add offsets to it? Is it because in the general case, the two objects could come from 2 different shared libraries?

And BTW, with clang 20 (the latest version ATM), it’s seemingly enough for one thread-local object in a translation unit to have a constructor for the compiler to generate a “TLS init function” for every thread-local object, and call it when the object is accessed… so, seriously, don’t use thread_local with constructors, even if you don’t care about the overhead, as long as there’s even one thread_local object where you do care about access time.

On the other hand, clang has an optimization, where access to several thread_locals with hidden visibility⁶ is indeed optimized such that __tls_get_addr is only called once (instead of twice the number of accessed thread_locals), and then we add a per-variable offset to access each thread_local. It turns out that a big part of the answer to the question "why call __tls_get_addr per variable?" is that with the default visibility, variables could be interposed at runtime, and so the compiler can't assume that they're defined by the same shared library, even if it's compiling a .cpp file that defines all of the accessed variables.

Of course, the other part of the answer is that it takes work to implement this optimization; according to the comment that I learned this from, this optimization is not available on all platforms in clang, and I'm not seeing it in g++ on x86. A smaller problem is that as you can see in the code below, with the current code generation, there's lots of register spilling and restoring going on which I can't really explain (even if I look at the slow path which I elided in the assembly listing below, since it's hairy enough as it is):

pushq   %rbp
pushq   %r15
pushq   %r14
pushq   %rbx
pushq   %rax
leaq    __tls_guard@TLSLD(%rip), %rdi
callq   __tls_get_addr@PLT
movq    %rax, %rbx
cmpb    $0, __tls_guard@DTPOFF(%rax)
je      .slow_path
movl    tls_with_ctor@DTPOFF(%rbx), %ebp
addl    tls_with_ctor2@DTPOFF(%rbx), %ebp
movl    %ebp, %eax
addq    $8, %rsp
popq    %rbx
popq    %r14
popq    %r15
popq    %rbp
retq

Note that if you compile with -fPIC but then link without -shared, even the single call to __tls_get_addr gets replaced with the much faster, if quite colorful instruction data16 data16 data16 mov %fs:0x0,%rax. All in all, an impressive effort by clang to optimize TLS access from shared objects; yet on net balance, I think it's fair to recommend putting data into a smaller number of thread_locals and avoiding constructors, rather than counting on visibility to improve the code generation.

So what does that famous __tls_get_addr function do? Here’s the fast path:

mov  %fs:DTV_OFFSET, %RDX_LP
mov  GL_TLS_GENERATION_OFFSET+_rtld_local(%rip), %RAX_LP
cmp  %RAX_LP, (%rdx)
jne  .slow_path
mov  TI_MODULE_OFFSET(%rdi), %RAX_LP
salq $4, %rax
movq (%rdx,%rax), %rax
cmp  $-1, %RAX_LP
je   .slow_path
add  TI_OFFSET_OFFSET(%rdi), %RAX_LP
ret

These 11 instructions on the fast path enable lazy allocation of a shared library’s TLS — every thread only allocates a TLS for a given shared library upon its first attempt to access one of its thread-local variables. (Each “variable ID” passed to __tls_get_addr is a pointer to a struct with module ID and an offset within that module’s TLS; __tls_get_addr checks whether TLS was allocated for the module, and if it wasn’t, calls __tls_get_addr_slow in order to allocate it.)

Is this lazy allocation the answer to why the whole thing is so slow? Do we really want to only call constructors for thread-local variables upon first use, and ideally to even allocate memory for them upon first use? Note that we allocate memory for all the thread_locals in a shared library upon the first use of even one; but we call constructors for all the thread_locals in a translation unit upon the first use of even one; which is a bit random for the C++ standard to prescribe, not to mention that it doesn’t really concern itself with dynamic loading? So it’s more, the standard gave implementations room to do this, rather than prescribed them to do this?.. I don’t know about you, but I’d prefer a contiguous allocation for all the TLS areas of all the modules in all the threads, and fast access to the variables over this lazy allocation and initialization; I wonder if this was a deliberate tradeoff or “just how things ended up being.”

Summary of performance guidelines

Access to thread_local objects without constructors linked into an executable is very efficient
Constructors make this slower…
Especially if you access an extern thread_local from another translation unit…
Separately from constructors, compiling with -fPIC also makes TLS access slower…
…and linking code compiled with -fPIC with the -shared flag makes it seriously slower, worse than either constructors or compiling with -fPIC...
…but constructors together with -fPIC -shared really takes the cake and is the slowest by far!
…and actually, a thread_local variable x having a constructor might slow down access to a thread_local variable y in the same translation unit
Prefer putting the data into one thread_local object rather than several when you can (true for globals, too, BTW.) It can’t hurt, and it can probably help a lot, by having fewer calls to __tls_get_addr if your code is linked into a shared library.
Define your thread_locals as having hidden visibility - it won't always help if they're compiled into a shared library, but sometimes it'll help a lot, and it can't hurt.

Future work

It annoys me to no end that the funtrace runtime has to be linked into the executable to avoid the price of __tls_get_addr. (This also means that funtrace must export its runtime functions from the executable, which precludes shared libraries using the funtrace runtime API (for taking trace snapshots) from linking with -Wl,--no-undefined.)

I just want a tiny thread-local struct. It can’t be that I can’t do that efficiently without modifying the executable, so that for instance a Python extension module can be traced without recompiling the Python executable. Seriously, there’s a limit to how idiotic things should be able to get.

I’m sure there’s some dirty trick or other, based on knowing the guts of libc and other such, which, while dirty, is going to work for a long time, and where you can reasonably safely detect if it stopped working and upgrade it for whatever changes the guts of libc will have undergone. If you have an idea, please share it! If not, I guess I’ll get to it one day; I released funtrace before getting around to this bit, but generally, working around a large number of stupid things like this is a big chunk of what I do.

Knowing what you shouldn’t know

If I manage to stay out of trouble, it’s rarely because of knowing that much, but more because I’m relatively good at 2 other things: knowing what I don’t know, and knowing what I shouldn’t know. To look at our example, you could argue that the above explanations are shallower than they could be — I ask why something was done instead of looking up the history, and I only briefly touch on what TI_MODULE_OFFSET and TI_OFFSET_OFFSET (yes, TI_OFFSET_OFFSET) are, and I don’t say a word about GL_TLS_GENERATION_OFFSET, for example, and I could.

I claim that the kind of things we saw around __tls_get_addr is an immediate red flag along the lines of, yes I am looking into low-level stuff, but no, nothing good will come out of knowing this particular bit very well in the context that I’m in right now; maybe I’ll be forced to learn it sometime, but right now this looks exactly like stuff I should avoid rather than stuff I should learn.

I don’t know how to generalize the principle to make it explicit and easy to follow. All I can say right now is that the next section has examples substantiating this feeling; you mainly want to avoid __tls_get_addr, because even people who know it very well, because they maintain it and everything related to it, run into problems with it.

I’ve recently been seeing the expression “anti-intellectualism” used by people criticizing arguments along the lines of “this is too complex for me to understand, so this can’t be good.” While I agree that we want some more concrete argument about why something isn’t worth understanding than “I don’t get it, and I would get it if it was any good,” I implore not to call this “anti-intellectualism,” lest we implicitly crown ourselves as “intellectuals” over the fact that we understand what TI_OFFSET_OFFSET is. It’s ridiculous enough that we’re called “knowledge workers,” when the “knowledge” referred to in this expression is the knowledge of what TI_OFFSET_OFFSET is.

Workarounds for shared libraries

Like I said, it annoys me to no end that TLS access is slow for variables defined in shared libraries. Readers suggested quite a few workarounds, "dirty" to varying degrees:

"Inlining" `pthread_getspecific`

There's a pthreads API for allocating "thread-specific keys" which is a form of TLS. Calling pthread_getspecific upon every TLS access isn't any better than calling __tls_get_addr. But we can "inline" the code of glibc's implementation, and if we can make sure that our key is the first one allocated, it will take just a couple of assembly instructions (loading a pointer from %fs with a constant offset, and then loading our data from that pointer):

#define tlsReg_ (__extension__( \
  { char*r; __asm ("mov %%fs:0x10,%0":"=r"(r)); r; }))

inline void *pxTlsGetLt32_m(pthread_key_t Pk){
  assert(Pk<32);
  return *(void**)(tlsReg_+0x310+sizeof(void*[2])*Pk+8);
}
void* getKey0(void) {
  return pxTlsGetLt32_m(0);
}

getKey0 compiles to:

  mov  %fs:0x10,%rax
  mov  0x318(%rax),%rax

Compiling with `-ftls-model=initial-exec`

It turns out that there's something called the "initial exec TLS model", where a TLS access costs you 2 instructions and no function calls:

movq tls_obj@GOTTPOFF(%rip), %rax
movl %fs:(%rax), %eax

You can also make just some variables use this model with __attribute((tls_model("initial-exec"))), instead of compiling everything with -ftls-model=initial-exec, which might be very useful since the space for such variables is a scarce resource as we'll see shortly.

This method is great if you can LD_PRELOAD your library, or link the executable against it so that it becomes DT_NEEDED. Otherwise, this may or may not work at runtime:

the shared object generally needs to be an immediately loaded shared object. The linker sets the DF_STATIC_TLS flag to annotate a shared object with initial-exec TLS relocations.

glibc ld.so reserves some space in static TLS blocks and allows dlopen on such a shared object if its TLS size is small. There could be an obscure reason for using such an attribute: general dynamic and local dynamic TLS models are not async-signal-safe in glibc. However, other libc implementations may not reserve additional TLS space for dlopen'ed initial-exec shared objects, e.g. musl will error.

Faster `__tls_get_addr` with `-mtls-dialect=gnu2`

It turns out there's a faster __tls_get_addr which you can opt into using. This is still too much code for my taste; but if you're intereseted in the horrible details, you can read the comment where I found out about this.

LLMs aren’t world models

Yossi Kreinin — Sun, 10 Aug 2025 00:00:00 +0000

I believe that language models aren’t world models. It’s a weak claim — I’m not saying they’re useless, or that we’re done milking them. It’s also a fuzzy-sounding claim — with its trillion weights, who can prove that there’s something an LLM isn't a model of? But I hope to make my claim clear and persuasive enough with some examples.

A friend who plays better chess than me — and knows more math & CS than me - said that he played some moves against a newly released LLM, and it must be at least as good as him. I said, no way, I’m going to cRRRush it, in my best Russian accent. I make a few moves – but unlike him, I don't make good moves¹, which would be opening book moves it has seen a million times; I make weak moves, which it hasn't ². The thing makes decent moves in response, with cheerful commentary about how we're attacking this and developing that — until about move 10, when it tries to move a knight which isn't there, and loses in a few more moves. This was a year or two ago; I’ve just tried this again, and it lost track of the board state by move 9.

When I’m saying that LLMs have no world model, I don’t mean that they haven't seen enough photos of chess knights, or held a knight in their greasy fingers; I don’t mean the physical world, necessarily. And I obviously don’t mean that a machine can’t learn a model of chess, when all leading chess engines use machine learning. I only mean that, having read a trillion chess games, LLMs, specifically, have not learned that to make legal moves, you need to know where the pieces are on the board. Why would they? For predicting the moves or commentary in chess games, which is what they’re optimized for, this would help very marginally, if at all.

Of course, nobody uses LLMs as chess engines — so whatever they did learn about chess, they learned entirely “by accident”, without any effort by developers to improve the process for this kind of data. And we could say that the whole argument that LLMs learn about the world is that they have to understand the world as a side effect of modeling the distribution of text — which is soundly refuted by them literally failing to learn the first thing about chess. But maybe we could charitably assume that LLMs fail this badly with chess for silly reasons you could easily fix, but nobody bothered. So let’s look at something virtual enough to learn a model of without having greasy fingers to touch it with, but also relevant enough for developers to try to make it work.

So, for my second example, we will consider the so-called “normal blending mode” in image editors like Krita — what happens when you put a layer with some partially transparent pixels on top of another layer? What’s the mathematical formula for blending 2 layers? An LLM replied roughly like so:

In Krita Normal blending mode, colors are not blended using a mathematical formula. The "Normal" mode simply displays the upper layer's color, potentially affected by its transparency, without any interaction or calculation with the base layer's color. (It then said how other blending modes were different and involved mathematical formulas.)

This answer tells us the LLM doesn't know things such as:

Computers work with numbers. A color is represented by a number in a computer.
Therefore, a color cannot be blended by something other than a mathematical formula — nor can it be “affected” without a “calculation” by transparency, another number.
“Transparency” is when you can see through something.
“Seeing” works by sampling the color at various points, and processing that signal.
Therefore, if you can see something through something, like, say, a base layer through an upper layer, then by definition, the color you will see is affected not only by the color of the upper layer and its degree of transparency, but also by the color of the base layer — or you wouldn’t be seeing the base layer, which means that the upper layer is not at all transparent, because you’re not seeing through it.

I mean, it sounds stupid to break it down like that, but I’m not wrong, am I? It really doesn’t know any of these things, does it.

Can you prompt the LLM to explain alpha blending properly? Sure. But that just shows the LLM knows to put the words explaining it after the words asking the question. This capability does not make the answer above into lesser evidence of the LLM not knowing the things as opposed to the words.

And of course people can be like that, too - eg much better at the big O notation and complexity analysis in interviews than on the job. But I guarantee you that if you put a gun to their head or offer them a million dollar bonus for getting it right, they will do well enough on the job, too. And with 200 billion thrown at LLM hardware last year, the thing can't complain that it wasn't incentivized to perform³.

Of course, these are simple examples. An LLM triumphalist will observe that they often stop reproducing; an LLM denialist will assume they stopped reproducing through some conspiracy, like a chess engine tool having been given to the LLM, or it having been drenched with synthetic data similar to your question. (I used to ask LLMs to prove 2+2=4; they'd very pompously enumerate various notable properties of 2 and 4, and proudly declare that 2+2 must equal 4 based on these properties, and I had a good laugh. Then LLMs were flogged to become “good at math,” and now they might say something about “Peano axioms,” and some total garbage about set theory — but they emit enough S(S(2)) and such that it probably counts as a proof, though I am yet to see the simple “2+2 = 2+(1+1) = (2+1)+1 = 3+1 = 4” which I’d expect from an entity understanding the question.)

For a more complex example, we can take associativity (which, as we’ve seen in 2+2=4, LLMs understand vaguely at best), combine it with alpha blending and transparency (which apparently they don’t understand at all), and see how well LLMs do. I’ve had an exchange with an LLM asking whether alpha blending, as implemented in commonly used libraries, is associative, or whether it isn’t due to precision loss or whatever — and if it’s not associative, how does caching work in drawing programs (where the program must be precomputing the blending of the layers above and below the currently edited one, to avoid recomputing the blending of 10 or 100 layers upon every brush stroke.)

Sure enough, it said that alpha blending wasn’t associative — probably because I suggested that it might not be — and that this is “solved with caching instead of mathematical elegance” — probably because I suggested that caching was involved. And then I ask, but how can caching work if blending is not associative? If layer 6 is selected, and you blend the cached blending of {1…5}, the selected layer 6, and the cached blending of {7…10}, you would get different results from blending {1…4}, 5, and {6…10}, if blending is not in fact associative? And then if you selected layer 5 in the program, you would see a different picture compared to selecting layer 6 - but in practice you see the same picture?

“You got me,” says the LLM, more or less. So their not knowing what any of the words actually mean very much does extend to complex examples.

You could say that the LLM was a victim of its agreeableness⁴, since it might have been influenced by my contradictory implications that blending might not be associative, yet caching must be implemented that counts on it being associative. I could say that, well, my whole question was about which parts of my suspicions are incorrect, and saying they’re all correct is an abject failure — but let’s assume it could be a character flaw more than an intellectual weakness. So in our last example, we’ll see the LLM having its own opinion and sticking to it, despite being told repeatedly that it can’t be true.

I ask it about the thread safety of appending to a Python list from multiple threads, and whether I can tell the number of times append was called with len(myList), and whether it will work once the GIL is removed. It says that without the GIL, the program could corrupt memory. I say, no way, this is not C, it must be more like Java? And it goes, no, CPython is a C program, and without the GIL your racy code can crash like C does. Java is different, it has a memory model, and look at these crash reports from GIL-less Python. And I’m like, but these are bug reports, it’s not by design, is there evidence that this is by design? — and it goes, it’s too early for the kind of evidence you’re looking for to exist, no-GIL is too new, but here’s how a C program could crash in such scenarios… and on and on and on.

It does not know that (pure) Python is a memory-safe language, and that no suggestion making it memory-unsafe would ever be accepted, and I found no way of persuading it to take this notion into account — or to acknowledge that the evidence it’s citing in support of its claims is more like evidence to the contrary (if all the crashes upon races you find are bug reports, it points to the requirement being that races don’t lead to crashes.)

So it can be either kinda agreeable or very stubborn — and it might obviously not know what it just said in both modes.

Can this be quantified?

I don't see how.

I mean, I wish it could be. It's clear that LLMs do learn some things about the world. For instance, even just the token embeddings contain the representation of the concept of gender learned without any specific effort to teach the model what gender is, as evidenced by “king - man + woman ~= queen” in the embedding space.

Ideally, you would want to quantify "how much of the world LLMs model." But even if you resolve the difficulty of defining what this means, you'll run into the ease with which LLMs memorize answers to specific questions, so the vendor can celebrate the new bar having been cleared.

All I can confidently claim is that they don't learn a world model except by accident, and there's neither a theoretical reason nor empirical evidence for your being able to count on this accident in any defined and broad set of circumstances.

So-called conclusions⁵

A guy who made $100 million from being an early employee of some startup came to give a lecture for that startup, and said “a fundamentally incorrect approach to a problem can be taken very far in practice with sufficient engineering effort.” (He then cheered up his listeners, most of whom had $100 million less than him, with the addendum “That's what I think is happening in this company!”)

It is therefore not one of my conclusions that you can’t take LLMs very, very far just because they demonstrably do not learn a model of the many worlds described by the words they’re trained on (which, BTW, is exactly as it says on the tin; nobody ever called them LWMs.) I will, however, predict a few things — something you shouldn’t do if you don’t want to look stupid in the future, but here goes.

There will be at least one big breakthrough in machine learning around “world models”. I have no idea what this breakthrough will look like; I predict that it will happen because some important kinds of thinking cannot be done without it, and I trust the Church-Turing thesis when it comes to these kinds of thinking, and I think someone will figure this out, same as people have come up with deep learning, convnets and transformers. And of course you already have “world models”, such as systems recovering object classes and positions from images — by a breakthrough, I mean a “generic” ability to build models of “novel worlds” (even if the model isn’t as good as a specially tailored one), much like you throw any text into an LLM and have it learn “something” without much tuning for this kind of text.

(In fact, I would guess there will be at least 2 more breakthroughs, the other one being around needing far less training data — again, not because I know how machines could use less training data, but because I know you and I get by with less. Feeding these algorithms gobs of data is another example of how an approach that must be fundamentally incorrect at least in some sense, as evidenced by how data-hungry it is, can be taken very far by engineering efforts — as long as something is useful enough to fund such efforts and isn’t outcompeted by a new idea, it can persist.)

LLMs are not by themselves sufficient as a path to general machine intelligence; in some sense they are a distraction because of how far you can take them despite the approach being fundamentally incorrect. This should make “AI risk” people happy; but “AI risk” is its own hilarity best left to another time.

LLMs will never⁶ manage to deal with large code bases “autonomously”, because they would need to have a model of the program, and they don’t even learn to track chess pieces having read everything there is to read about chess.

LLMs will never reliably know what they don’t know, or stop making things up. You need some sort of a world model to have notions of knowledge, truth and falsehood. Any mechanism that is supposed to make LLMs “safe”, trustworthy or other such is a mix of snake oil and honest efforts to somehow steer it away from text that spooks users — which can be done, since users are spooked by form more than substance. For example, when it says some politically related nonsense, people drag it for having the wrong politics, and you can “fix” it by making its output less politically charged — without making it less nonsensical, which there’s no way to reliably achieve.

LLMs will always be able to teach a student complex (standard) curriculum, answer an expert’s question with a useful (known) insight, and yet fail at basic (novel) questions on the same subject, all at the same time. This is not surprising — this is exactly what you would expect from a language model that isn’t a world model. This fuzzy insight is more cute than useful, however, since it’s hard to know what is and is not novel — in part because you come to the LLM in the first place with things you don’t already know everything about.

(Some sources, such as this wonderful writeup about LLMs not knowing that rotating a tic-tac-toe board doesn’t change the game, take this point to its logical conclusion: “If you know the answer, you don’t need to ask an LLM; if you don’t know the answer, you can’t trust an LLM.” But this conveys a true insight into LLMs together with unwarranted pessimism about their utility. In fact, sometimes you know the answer, but it’s quicker to proofread an LLM’s output than to type it out; sometimes you don’t know the answer, but know to check if the LLM’s answer is correct, etc. etc.)

LLM-style language processing is definitely a part of how human intelligence works — and how human stupidity works. I agree with Dijkstra that “can machines think?” and “can submarines swim?” are poor questions to ask, and I hate it when people say that neural networks “work like the brain” and such. But I can’t help feeling that LLMs are a mirror into a part of how people such as myself think — and I don’t like what I’m seeing in that mirror. “Thinking” by guessing what words to say next based on words we’ve previously heard might actually help find a good idea — and it’s also how know-nothings get through work meetings, and how people come to think they know stuff they really don’t, and how they internalize the stupidest notions. I am starting to think that in today’s environment, high cognitive skills are an actual risk factor for stupidity, and that learning words without learning a model of what they refer to is one big part of the problem.

P.S. I wish I could say something about how to best use LLMs for programming — something I would like to be qualified to speak about, and that I am kinda supposed to learn enough to be qualified to speak about. I don’t think I am; I can only say that I tried Cursor and it failed every time, including at replacing f(obj) and g(obj) with obj.f() and obj.g() (it would occasionally mix up f and g, and I got tired of reviewing its output), and I went back to simply copying code into and out of chat windows. I would say that I use LLMs like I use SIMD — sometimes it’s a good fit for leaf functions whose behavior is relatively easy to specify and test, and it has no business being anywhere else.

I have conflicting theories about why some people do great things with “agentic AI” while I think it’s hopelessly useless for me; I am waiting for someone to write something crisp and well-researched about this to teach me the truth, or a useful approximation. I console myself with the idea that I can’t be missing out on too much, given how terrible the output I’m getting from LLMs often is.

P.P.S. Here’s a somewhat rough Soviet-era joke about a school for kids with special needs that illustrates my point. Rachel says that a Russian joke isn’t a joke, but a story of pain. To this I reply that some people like them.

An inspector comes to a school for kids with developmental issues. He asks a kid riding a wooden horsie his name, and the kid says “MMMM.” He says, what do you want to be when you grow up? — and the kid says “MMMM.” The inspector turns to the principal and says, “you’re doing nothing for these kids. I’ll be back in a month — if there’s no improvement, we’ll close the school.”

He comes back a month later and finds the kid swinging on the wooden horsie, same as last time; if you want to tell this masterpiece of Soviet humor at parties — the perfect conversation starter — you should be swinging wildly when saying the kid’s lines:

— What’s your name?
— MMMMikey!!
— Mikey? Nice to meet you, Mikey. What do you want to be when you grow up, Mikey?
— MMMMastrounaut!!
— An astronaut? Good stuff, good stuff! And how old are you?
— MMMMikey!!

The moral of the story being that you can learn to predict the next word without learning much about the world — at least up to a point.

Thanks to Dan Luu for reviewing a draft of this post.

It helps that I don’t know the good opening moves; I can’t be bothered to learn any opening theory. The fact that my poorer chess knowledge makes it easier for me to see how bad the LLM is at chess is an interesting case study. It turns out that you can get good answers out of LLMs by asking very well-phrased questions sounding like someone else’s well-phrased questions answered in its training data; whereas if you ask simpler questions which are perfectly valid but not commonly asked, they will fall apart.↩︎
Funnily or tragically, my system of tripping up the opponent with weak moves it hasn’t memorized a response to is conceptually similar to grandmaster play of today, where grandmasters memorize chess engine lines, and a “novelty” is a relatively weak move your opponent has never analyzed with the engine, whereas you did and you remember all the strong moves after this weak one. Of course my strategy would not work against a grandmaster, because I don’t come prepared with memorized engine lines, and the grandmaster would find much better moves than I would over the board. Still, this 21st century concept of “chess novelty” is tangentially related, and funny, or tragic, as the case might be.↩︎
People can also give the wrong answer because they’re drunk. I don’t think the LLM was drunk. My point is that a person who gave this answer would get zero points for this question on a test, and that the LLM is constantly under test because it’s a machine serving no purpose other than answering these questions, and I don’t see why it should not get zero points here, even though people might eg fail to answer some logic puzzle phrased in one way but succeed when it is phrased in another way, etc. etc. — I don’t see how the cognitive weaknesses of people provide an excuse for the machine in this specific case.↩︎
Actually, in this case, it was agreeable in substance but snarky in tone — it gave me an answer that confirmed all my different suspicions, contradictory as they were, and at the same time it was saying something like “don’t expect the world to be pretty or simple, man, the world is messy, man.” Generally I don’t think that LLMs “personality,” “style,” “politics” and other anthropomorphic characteristics are the main thing about them; I think the main thing is what they model (text) and what they don’t model except by accident (the thing the text is about.)↩︎
It’s hard to call them “conclusions” when they’re fuzzy statements supposedly following from my fuzzy claim. In fact this is what bugs me about LLMs in general: the thing is fuzzy — you can’t say it does something, because sometimes it fails to do it; you can’t say it doesn’t do something, because sometimes it succeeds; and you can’t discuss the rates of real-life success and failure, because who’s keeping score? This is why it’s hard for me to write about LLMs — I don’t like it when things get this fuzzy, certainly when it comes to long-form writing; I’m reduced by the very nature of the subject to shitposting about this on Twitter, along the lines of “Computers used to provide cheap, reliable automation; then AI came along.”↩︎
When I’m saying that an LLM will never be able to do something, I mean it in the sense of “y = ax + b will never represent a parabola” rather than in the sense of “the points residing on a curve rather than a straight line can never be represented by an equation.” Machine learning might do what LLMs can’t do. Of course this could be used for a No True Scotsman defense — “if it clearly learns a model of the world, it’s not a true LLM.” I’m assuming that when a big breakthrough is achieved, we’ll know enough about it to be able to settle the question whether it’s still an LLM, as long as we’re arguing in good faith — same as we don’t know all the details of how commercial LLMs work, but we know about transformers, tokenization, encoders, decoders, next token prediction, whole-text synthesis, etc., and this is enough for “LLM” to have a somewhat technical meaning — not as precise as “y=ax+b,” but not nearly as vague as, say, “AI.”↩︎

"Enabling" C threads in a Python / Wasm environment

Yossi Kreinin — Thu, 25 Dec 2025 00:00:00 +0000

Scarred by bare metal programming during my formative years, I consider the speedup from multithreading worth pursuing no matter how limited a form of it you’ll get, and no matter how hideous the hacks you’ll need to make it work. In today’s quest, we shall discover the various ways in which threads don’t work in a Python, Wasm, and especially a Python on Wasm environment, and then do something about it — even when that something could get us shunned from polite society. In the end, we’ll arrive at a working setup for limited yet performant multithreading, usable for soft real time programs caring about sub-millisecond overheads which we’ll attempt to minimize or eliminate (GitHub links: Python running C threads on Wasm; a simple C++ thread pool for Wasm.)

What this does is shown in the screenshot below¹ — a browser thread running Python and calling a C function sending work to a pool of C threads, one of which (“em-pthread”) is shown at the bottom.

I knew nothing about WebAssembly, and very little about JS and the rest of it, when I got into this. I am still shocked by what I’ve discovered. The world’s most successful secure VM for running untrusted code turns out to be a complete hack wrapped in a shortcut inside a workaround² — though I must admit that I kinda like it that way?!.. This is not to say that I really learned the platform — please correct me when I’m wrong, and please provide me emotional support when I’m right, as I’m muttering things like “so pthread_create fails because someone put API.tests=Tests into some JS file?!..” or “so the wrong C function gets called because dynamic linking is quietly working incorrectly?!..”

Plain pre-no-GIL Python

At first glance, there’s nothing to talk about here — everybody knows that Python is about to lose the global interpreter lock, but that work isn’t finished yet, so right now you won’t get a speedup from threading, unless your threads wait on I/O a lot. That a C library used from Python can spawn its own threads to mitigate this problem is not news, either.

However, a couple of things below are appreciated less widely. I bring these things to your attention lest you discover them on your own, and repeat my mistake of actually trying to use them. Incidentally, not using these things will turn out to be particularly beneficial on WebAssembly.

The first thing is that the GIL is released when a function is called via ctypes³. So maybe we could write serial C functions and call them from Python worker threads? Well, it’s fine for work taking tens of milliseconds, but for very short tasks, the overhead of Python thread pool management is significant.

The second thing is that ctypes can pass a Python callback to a C function. This is expected of a C FFI package, but ctypes generates machine code at runtime to make it work, which is impressive⁴. Anyway, seems like we could export a C parallel_for function to Python, and run Python callbacks in a C thread pool?

This is fine for letting Python code use C threads, to avoid a second pool of Python threads. But even the thinnest Python callback that calls back into C adds overhead too large for short tasks. You can (sorta) see this if you zoom into this VTune screenshot, with the first three quarters of the timeline occupied by Python runtime functions — BTW, mostly functions seemingly unrelated to the GIL:

The upshot is that for speeding up a sequence of relatively short tasks (like real time input handling), the way to go is a serial Python flow calling C functions sending tasks to C worker threads, with zero Python code running in the workers.

It turns out that on top of its performance advantage, this style has the added benefit of being the easiest to port to WebAssembly by far. In fact, despite getting no threading whatsoever under Pyodide, the Python Wasm runtime, out of the box, you might think that this style would “just work” — since we’re not using Python threads, right? We’re just loading a C library which uses threads, and C threads work on Wasm, right?

Wrong. “The easiest way” doesn’t mean easy — you do not just load a C library using threads on Wasm.

Life inside an array

Wasm started out, in the days of asm.js, as a way to compile C to a subset of JavaScript that a JS engine can optimize well, by emitting code like (x|0) which persuades JS engines to treat x as an integer, rather than something with a statically unknowable type. Later, a proper intermediate instruction set representation was designed for the compiler to emit, so the x|0 craziness went away.

However, it’s still a C program running inside a JavaScript environment. How does this even work? Well, the program’s data is stored inside a JavaScript array⁵. When malloc runs out of space in this array, it calls the JavaScript function Memory.grow to enlarge the array, the way it would call sbrk on Linux. (In general, the JS runtime is the OS of the Wasm program — its way to interact with the world outside the JS array is to call an “imported” function implemented in JavaScript⁶.) C pointers are compiled to indexes into this array. To access a C array from JS, you actually use JS array views with names like HEAPU8 (for uint8 data), indexing into entries starting from the integer value of the C array base pointer.

So how does Python run in a JS environment? Well, CPython is a C program which was compiled to run inside a JavaScript array, and ported to use the “OS” (actually, JS) APIs available on the web and/or Node.js. Pyodide is that port — a very impressive feat.

So, we’ll build our shared library for Wasm (where programs are called “main modules” and dynamic libraries are called “side modules” — odd names underscoring the fact that you can have several main modules within one OS process, each running inside its own JS array.) And we’ll load our side module from Pyodide using ctypes.CDLL as we would anywhere else, and it will happily spawn threads.

Except it won’t load; it will complain that you’re trying to run it inside the wrong kind of JS array. You see, there’s ArrayBuffer, an array which you can’t share between threads, and there’s SharedArrayBuffer, which you can. (“I’m afraid I can’t do that” due to a mix of security considerations and historical baggage is a recurring theme on the web platform.) Since your side module was built with -pthread, it needs to run inside a SharedArrayBuffer, but Pyodide was built to run inside a plain ArrayBuffer.

My first thought was, “big deal — I’ll give Pyodide, this compiled blob of Wasm, a SharedArrayBuffer instead of an ArrayBuffer to run in” — it’s not like it “feels” what it’s running inside, right? It’s the same load/store instructions in the end?

Well, yes, but. Long story short, you must build Pyodide from source with -pthread to load your side module. There are many reasons for this, such as side modules getting the pthread runtime from the main module, and needing emscripten-generated JS for this runtime to work (did I mention that emscripten produces a giant .js output file alongside the .wasm file — about 60K LOC for Pyodide?)

Luckily, Pyodide provides a Docker container for the build, and $EXTRA_C/LDFLAGS env vars that you can set to -pthread; it’s rather nice. To then build your side module, use emscripten from the emsdk/ directory produced by the Pyodide build process, or things will silently fail⁷.

You will also need to serve your web page with a couple of HTTP headers (Cross-Origin-Opener-Policy same-origin, and Cross-Origin-Embedder-Policy require-corp⁸) — absent which the browser will refuse to give you a SharedArrayBuffer (I’m afraid I can’t do that.)

And you might want to build Pyodide to use mimalloc instead of dlmalloc to avoid a global lock in malloc/free, though I’m not currently doing it. Mimalloc uses more space, and even with dlmalloc I’m seeing tabs with Pyodide where Python’s tracemalloc reports tens of megs, but the Chrome tab tooltip reports hundreds of megs⁹. A more efficient if less pleasant approach is to avoid malloc in parallelized code.

While we’re at it, it’s notable that dynamic allocation of large chunks is space-inefficient when “living inside an array” without mmap. And funnily enough, this is exactly how things work on the bare metal — if you allocate and free lots of large & small chunks, you’ll get heap fragmentation to the point of running out of memory, so you learn to malloc the larger chunks up front.

(This doesn’t happen on fancier systems with virtual memory, where malloc mmaps large chunks instead of carving them out of the one big array grown & shrunk by sbrk. Then free calls munmap, and since a contiguous virtual address range can be backed by non-contiguous physical pages, you can always reuse the pages for another large chunk later — you don’t suffer from fragmentation. By forcing malloc to look for large contiguous slices inside a SharedBufferArray, the bleeding edge web VM is sending us to knuckle-walk down the road traveled by embedded computing troglodytes.)

Fragile – handle with care

Up until now, nothing really gross — so you need to rebuild from source and use a specific compiler version, big deal. On the other hand, up until now, we didn’t actually spawn any threads, either. Our side module will now load just fine — but it will promptly get stuck once it spawns a thread and tries to join it.

One splendid thing on the web is how “you can’t block the main thread” — “I’m afraid I can’t do that.” Of course you can block the main thread by doing some slow work in JS, but you’re not allowed to block by waiting (you should instead yield to the event loop — all that async business.) Therefore, using pthread APIs from the main thread tends to be a bad idea. You can actually get emscripten to busy-wait in the main thread, instead of the disallowed proper waiting. But the workers talk to the main thread to access many kinds of global state, so waiting might deadlock.

So the first order of business is to move your code using Pyodide to a web worker¹⁰. Having done that, you might observe that you’re stuck at the same point, because of the following:

When pthread_create() is called, if we need to create a new Web Worker, then that requires returning to the main event loop. That is, you cannot call pthread_create and then keep running code synchronously that expects the worker to start running - it will only run after you return to the event loop. This is a violation of POSIX behavior and will break common code…

So we yield to the event loop — yet we still remain stuck. That’s because the emscripten-produced JS runtime code making threads kinda sorta work is:

broken by Pyodide customizations; but also
missing Pyodide customizations that any module would need to support threads; and finally is
broken, period, (nearly) regardless of Pyodide.

Here are the specific problems and their workarounds, in increasing order of ugliness.

As a warmup, you’ll find that you run the wrong JS code when spawning your C thread. In a JS environment, the so-called pthread runs in a JS “web worker” thread, the entry point of which is a JS file it gets in its constructor and runs from top to bottom. Your C thread entry point is eventually called from that JS file. Emscripten produces a single JS file, pyodide.asm.js in our case, which wraps the module both in the main thread and in the workers. So to implement pthread_create, the file pyodide.asm.js needs to initialize a JS worker with itself. However, like any JS file, pyodide.asm.js doesn’t know its own URL. Its attempt to find out instead produces the name of your Pyodide-using worker’s JS file, and hilarity ensues.

Now if Pyodide was designed to support threads, its loadPyodide function, which wraps the emscripten-generated wrapper pyodide.asm.js, would probably accept the file URL as a parameter — and pass it to pyodide.asm.js in the "mainScriptUrlOrBlob" key that the generated code seems to expect. Since my workarounds are too incomplete to upstream to Pyodide anyway, I didn’t bother to fix this properly, and passed the name thru some global variable.

A more curious class of issues you’ll discover is code (smack in the middle of pyodide.asm.js’s 60K LOC) doing things like API.tests=Tests, and failing when running in the pthread worker. Grepping will reveal that the offending code comes from Pyodide, rather than being generated by emscripten. How does it get into pyodide.asm.js? Not sure, but I think a part of the thorny path is, the .js files get concatenated by make, then grabbed by a C file using the new #embed preprocessor directive, and then pulled out from the object file into pyodide.asm.js by emscripten. In any case, you just need to strategically surround the offending bits with if(!ENVIRONMENT_IS_PTHREAD) — that’s a global variable set in the generated code by checking if the thread’s name is em-pthread (the check is done differently on the web and under Node, of course.) And with enough of those ifs, you’re past this hurdle.

This is starting to stink, but you’ve probably smelled things worse than this. This is definitely the first case I’ve ever seen of a threading implementation fragile enough to require effort from higher-level code (in this case Pyodide) both to support it and to avoid breaking it. But zooming out a bit, most of us have seen worse examples of difficulties making code do things it wasn’t meant to.

No, the real horror hides behind this succinct assertion in the Pyodide docs:

“The interaction between pthreads and dynamic linking is slow and buggy, more work upstream would be required to support them together.”¹¹

We’re about to discover what this means.

How “threads” do “dynamic linking”

I sort of dismissed the stark warning in the docs because of the optimistic comment from 2022 from an emscripten maintainer (on the issue “Add support for simultaneously using dynamic linking + pthreads”, from 2015):

I think this is largely complete. We still warn about it being experimental, but it should work % bugs. We can open specific bugs if/when they are found.

Another reason I wasn’t worried were my limited ambitions. I don’t dlopen from threads or whatever — I just spawn a pool of threads from a side module, how can it not work? Of course this question reveals a lack of imagination — the real reason I wasn’t worried. Like, what do threads even have to do with dynamic linking? Thread-local storage, which does get awfully ugly in the presence of dynamic linking? Well, I don’t use TLS. What could go wrong?

This thinking is sensible with real threads doing real dynamic linking. The basic part of mapping new code and data into the process address space is sort of orthogonal to threading — all threads “see” the new code and data simply because they share the address space.

What about Wasm threads — don’t they share address space, too, that SharedArrayBuffer? Well, that’s for data, not code. Remember how C data pointers become indexes into the SharedArrayBuffer? Well, function pointers become indexes into wasmTable, a different array. Here, each index corresponds to a whole function, not an instruction. This is nice in that you can’t jump into the middle of a function (“I’m afraid I can’t do that”) — a part of Wasm’s control flow integrity features (of course data integrity inside the [Shared]ArrayBuffer is no better than in C — but the considerable extent of CFI is still a good thing.)

What you’ll find out when your thread finally runs, and the wrong function is called instead of your entry point, is that wasmTable is private to each thread. Which means that Wasm “threads” are more like processes with data in shared memory. The code can be completely out of sync — as in, thread A has function f at index 47, and thread B has f at 46.

Of course, the emscripten runtime — the C and JS code working together — makes an effort to keep the tables in sync. When a pthread starts, it gets a list of dynamic libraries to load in the same order its parent did. And when a pthread runs and dlopens a library, it “publishes” this fact thru a shared queue, and waits for the other threads to notice and dlopen the library, too. It’s impressive how hard Emscripten tries to make dlopen and pthreads work despite the offbeat sharing model — though notably, this is another potential source of Wasm-specific threading deadlocks:

In order to make this synchronization as seamless as possible, we hook into the low level primitives of emscripten_futex_wait and emscripten_yield. … If your applications busy waits, or directly uses the atomic.waitXX instructions (or the clang __builtin_wasm_memory_atomic_waitXX builtins) you maybe need to switch it to use emscripten_futex_wait or order avoid deadlocks. If you don’t use emscripten_futex_wait while you block, you could potentially block other threads that are calling dlopen and/or dlsym.

In particular, you will deadlock not only if you wait for a child to start without yielding to the event loop, but also if you dlopen before yielding and letting the child start — so don’t. But to go back to our nastier problem — with all this runtime work, how does a child thread get f at index 46, when the parent has it at 47? Ah, that’s because code can add entries to wasmTable — and you don’t have to be “the dynamic loader,” some JS runtime code, to do it! It can just as well be some other JS runtime code! I think that there’s something Pyodide-specific doing this, but not sure; you might pass a dynamically created callback as a function pointer that way, for instance.

In any case, for init-time dynamic linking, the current JS runtime simply has the parent send a list of modules to load, and the child loads them in that order. This doesn’t account for the stuff put in between modules in the parent. My workaround is to modify the emscripten-generated JS code to keep the offsets to which modules were loaded in the parent (that’s simply wasmTable.length at the time when loading happened.) I then send them to the child, which grows the table to recreate the parent’s gaps between the modules. (I don’t care what’s in these gaps, on the theory that C code not interacting with Python code won’t need it.)

I did not patch the runtime enough to handle the case where a pthread returns, and the JS worker is then reused for another pthread; you will probably get exceptions in the dynamic loading code in the new pthread. I also didn’t bother with offsets of stuff dlopen’d later at runtime. I think that the mismatch between the Unix process model and the Web-workers-with-a-SharedArrayBuffer model is quite large, and my chances to bridge it are smaller than the chances of the actual emscripten team, and they probably gave up on fully bridging it for a reason. In fact, there’s a shared-everything-threads proposal which aims to share wasmTable, among other things, to make dynamic linking not as “slow and buggy” as the Pyodide docs mentioned in passing, and as we’ve now learned thru bitter experience.

Another reason I didn’t bother fixing issues outside my use case is that even said use case, that of a pool of pthreads which never exit or dlopen, and all they do is run parallel_for’s, doesn’t actually work with off-the-shelf code, as we’ll immediately see. We’re clearly not enabling threads here — we’re “enabling” “threads”, meaning, we’re trying to use platform features which don’t add up to “real” threads by design. Best to focus on a narrow use case and make sure it actually works.

A small, warm pool

With these workarounds¹², we can compile an off-the-shelf framework like TaskFlow, and run a parallel_for example. Once we work around needing to yield to the event loop before the thread pool becomes usable, it turns out that TaskFlow occasionally gets stuck in a parallel_for — all threads are stuck waiting:

I have no idea why, and whether it’s a “real” Wasm compatibility issue in TaskFlow, or evidence of my attempt at enabling threads in Pyodide being incomplete (as we’ve seen, on Wasm, threading is easily broken by the code you integrate threading into.) I decided that my adventure debugging hairy code I know nothing about stops here — my needs are much more modest than TaskFlow’s feature set, so I am just rolling my own thread pool.

I only do parallel_fors, with nested or concurrently submitted parallel_fors serialized. To avoid things that “are supposed to work” in C++ but turn out to be broken or non-trivial to enable on Wasm, I’m only using synchronization primitives compiling straightforwardly to Wasm builtins. Those builtins are:

The usual atomic memory operations (load/store, read-modify-write, compare-and-swap, fence.) These are Wasm instructions compiled to native CPU instructions or short instruction sequences; it’s hard to imagine how the Wasm “JS OS” business can break any of these. It’s also hard to imagine how a C++ implementation of std::atomic can mess things up, so I just use std::atomic rather than emscripten builtins.
The OS-assisted notify/wait — these map to futexes (or their equivalent on not-Linux.) How much hides behind the word “map” here at the Wasm VM side I don’t know. I do know that the C++ standard library std::atomic::notify/wait methods do quite a bit on top of the builtins (check out the screenful of C++ garbage below) — so I use emscripten_futex_wake/wait instead¹³; you can switch to the std versions by commenting out a #define.

In case futex performance is an issue — it certainly is in my tests, but YMMV — I also have a “warm pool” feature where the threads busy-wait for work instead of waiting on a futex. (This warm pool business is one reason to roll my own pool — off-the-shelf load balancers don’t have this, though some of them busy-wait a little bit before waiting on a mutex, which is similar.)

Of course, such busy-waiting makes fans spin and drains batteries¹⁴. An example of mitigating this is, you only keep your pool warm between your pointer-down and pointer-up event, or you could go further and only “warm” the pool once you start dropping pointer-move events. A “cool” pool waits on a futex, which is pretty low-overhead — but the overhead could count with many short parallel_fors submitted quickly enough after each other.

A final wrinkle is that parallel_fors are serialized until all threads are ready to execute work. This avoids an init-time slowdown where someone uses a parallel_for and ends up waiting for hundreds of ms for Wasm threads to start — or deadlocks because they won’t start without yielding to the event loop¹⁵. In production, it’s also a better fallback upon some environment issue where the threads failed to run, period — compared to getting stuck forever waiting for them.

This warm, limited, but fast, small, and seemingly robust pool is available on GitHub. It’s more “works on my machine” than “battle-tested” at the moment, but you could “come to trust it through understanding” more easily than most such code, it passed some TSan testing (which found a bug), it’s portable, and it works in the browser, which apparently isn’t trivial for a load balancer¹⁶.

Conclusion

A Pyodide fork you can build with the “threading support” described above is available here.

We shall conclude with a brief attempt to defend both our ends and our means against the accusation of unacceptable barbarism — an accusation which I acknowledge to be quite understandable. I mean, the above should be enough to get why Pyodide doesn’t ship with threading support — Python dlopens a lot, and Pyodide is supposed to work out of the box for a large number of Python users, who will just give up on this Python in the browser business if things break in ways described above. Shouldn’t we avoid half-baked hacks, and produce solid code built on a solid basis?

My reply to this is that features matter, performance is a feature, a hack is fine if you know what you’re doing, and a lot of good stuff comes about through a series of hacks:

By “features matter,” I mean that between an ugly implementation of something and no implementation at all, ugly is better. “Let’s do it properly” instead of a quick hack I can get behind. But “let’s not do it at all” is going overboard — getting things done is kind of the point, and you don’t give up on doing stuff out of a sense of aesthetics.
By “performance is a feature,” I mean that, say, threading makes things several times faster, and this can be the difference between a feature that works fine, and one that is unusable.
The trouble with a hack is how often it breaks and what happens when it does. A hack is fine if it’s unlikely to break, or if it tends to break “loudly” during testing, rather than breaking quietly in production, etc. A lot of the risk can be mitigated by aiming low — if your system is simple and doesn’t count on too much, not much will break, and you can figure it out when it does. A simple system has an easier time leveraging ugly hacks to get tangible benefits.
The WebAssembly platform is quite hacky, and this is still visible in 2025; looks like it was worse earlier on. I don’t think it’s a coincidence that Wasm is becoming the most successful VM for running untrusted code in a wide variety of languages. The way it’s made is not only hacky but makes it easy to further hack on, which means you can integrate it into some odd environment with very reasonable effort. For an example of a massive hack, look at the “asyncify” feature where normal C++ code is made to yield to the event loop by emscripten. This implementation style is what helped C spread in the first place — and the opposite of what I think somewhat limits the JVM, which is a serious thing you can’t just add a bunch of hacks to in order to adapt it to something.

I can’t say that I love ugly, hacky code — I’ve met bigger fans of “pornographic programming”, as McCarthy called it. But ugly and hacky is much better than solid, well-designed, and too rigid to work around the limitations of. For what it’s worth given my very little experience, I hereby recommend WebAssembly as fairly hackable.

P.S.

“Printf debugging” JS¹⁷ was harder than usual. Apparently, console.log is aggressively buffered, and if your threads print in some short test on Node, you might never see the prints. On Node, fs.appendFileSync helps, and you can install a process.on('uncaughtException') hook doing this. In the browser, pausing in the debugger seems to flush console.log prints.

Thanks to Dan Luu for reviewing a draft of this post.

On instructions from our blogging ethics department, we hereby inform you that our screenshots are doctored for legibility.↩︎
Technically, it’s not the VM itself, but the compile-time features and C runtime libraries utilizing it “from within”, and the JavaScript runtime code wrapping it “from the outside” which are full of hacks. The VM is presumably quite a bit cleaner. But as Stallman used to argue with respect to his preferred Gah-noo — Linux naming, an OS is not just the kernel, and arguably Wasm is not just the VM.↩︎
I sort of take it for granted that if you want to call C from Python, you do it via ctypes, which keeps the Python extension API (and any of the convoluted templated wrappers that make it worse) out of your C code. Maybe it deserves a separate post, seeing how often people take the other route, despite it being uglier and harder.↩︎
The function pointer called from C needs to find the Python context — eg object.method is a legitimate callback, and there’s no statically compiled C function whose address you could possibly pass to identify it. Since the C function prototype, such as int (binary_op)(int,int) or whatever, doesn’t reserve a void for ctypes to pass the context pointer in, ctypes has to generate machine instructions with the Python context pointer hardcoded into them (as if compiling a function int object_method(int,int) that knows the address of the object.)↩︎
Today's Wasm has something more clever than "just another array" behind these JavaScript arrays, but it still looks like just an array from JS.↩︎
Of course you could embed a Wasm runtime into something other than a JS environment — see WASI — but a lot of real-life code compiling to WebAssembly assumes a JS environment, including Pyodide. In this writeup, we’re ignoring non-JS Wasm environments.↩︎
ABI compatibility — or should I say ABJSI, for Application Binary & JavaScript Interface? — seems rather loose in these parts; which might be a good thing?.. — eg Google complained at one point that ABI compatibility in native Linux C++ builds costs them 10% of performance across their giant fleet of machines, since it prevented many standard library optimizations over the years. One nice thing about Wasm is that you’re likely to be building a small system fully from source, since bloat costs more than elsewhere, which I guess makes ABI compatibility less of a problem.↩︎
LLMs are good at configuring the web server to send these headers. I’ll spare you my explanation of what they prevent your page from doing since there’s lots of better sources online.↩︎
In general, “the” reason to avoid Python in the browser unless a Python interpreter is a user-visible feature is footprint — RAM at runtime and download size at load time; speed you can take care of using methods like the ones we’re discussing here↩︎
A lot of code using Pyodide seems to run in the main thread, and some of the APIs will not work in Web workers. This is part of the beauty of the web platform: on one hand, you’re not allowed to block the main thread; on the other hand, lots of APIs are inaccessible to worker threads. So you’re supposed to move heavy work to workers, and then have them talk to the main thread for accessing the DOM or IndexedDB etc. etc. — so that main thread which we really don’t want to disturb is guaranteed to be busy. Of course, Python suffers more from this than JS, in that in JS, you “merely” need to split your work between threads, whereas with Pyodide, you must decide where the Python interpreter lives — you can’t use it from the main thread as well as a web worker — and then it either won’t be able to access some APIs (you’ll need to proxy to JS code running in the main thread instead), or it won’t be able to execute long-running tasks [regardless of the issues coming with spawning C threads which mainline Pyodide won’t do anyway.] My opinion is that you use Python for some numeric stuff which might be long-running, not as your preferred way to access web APIs which you might as well do in JS, so the sensible thing [to the extent that Python in the browser is sensible] is to put Pyodide into a web worker. Another argument — for those doing something quick rather than something slow in Python — is that the main thread drops input events upon the smallest lag, whereas with a worker, you can collect all the input events in the main thread, forward them to the worker, and have that worker decide what to do with them whichever way you want.↩︎
One thing you could do is to build the Python interpreter statically with all the necessary packages; code-size-wise it would be great, and you’d avoid the pthread/dlopen issues. I think it might hurt, however — AFAIK Python is not designed for this, and your build flow will get way uglier. I find the dynamic linking issues much easier to deal with on net balance.↩︎
Actually there’s another workaround, where Pyodide adds a dependency on some “sentinel” functions that I forcefully resolve to meaningless stubs in pyodide.asm.js; this didn’t feel interesting enough to discuss.↩︎
Of course the actual low-level builtin is __builtin_wasm_memory_atomic_waitXX, but of course using that one will deadlock if a thread dlopens, as described by the quote in the docs above. I didn’t dig into whether making the pool code brittle/risky that way pays off in some speedup, or what other problems you’re inviting by using the lowest-level thing.↩︎
Of course on wasm, busy-waiting also deadlocks if another thread dlopens, see previous footnote. In any case, a pool should not be kept warm forever, and ours will “cool itself down” if you forget to do it after a configurable number of spinning iterations, and will wait on a futex using the deadlock-preventing builtin.↩︎
Of course if you create a pool and then dlopen before yielding back to the event loop, the dlopen will deadlock — see the 2 previous footnotes.↩︎
Of the more “serious”/feature-rich and standard pools, oneTBB failed some of its own tests (I used their instructions for building and testing on Wasm) and I didn’t manage to get it to load under Pyodide — same as with OpenMP, though I didn’t try my hardest; outside Pyodide people report some success with these. The one pool that compiled, loaded and ran just fine was poolSTL — but in my tests, it was much slower than the pool presented here.↩︎
There’s also TypeScript, which I am yet to learn to debug at build time. Pyodide provides TypeScript definitions which I personally don’t need, and which I gave up on adapting to the -pthread build flow (SharedArrayBuffer isn’t ArrayBuffer, and this causes type errors.) Say what you will about C++ type errors — and I’ve said quite a bit over the years — you mostly see, if not without difficulty, where in your code they’re coming from. TypeScript managed to stun me with errors coming out of the standard library with no reference whatsoever to “my” code (more accurately, Pyodide’s code which isn’t bundled with TypeScript.) I haven’t seriously used Python’s type annotations, let alone TypeScript given my near-zero JS experience, so I haven’t earned the right to an opinion. All I can say is that I’m not excited to get into non-load-bearing type systems retrofitted into languages which ignore the types when running or generating code. My experience is that if things can suck, they do. And a type system which does not affect the execution of code — the thing that can’t suck because machines do not recover from or mitigate compiler/interpreter errors — such a type system inherently can get away with quite a bit (at worst, the user will skip type checking — there are flags for this — which eg C++ cannot have, since you can’t generate code without the types.)↩︎

All means are fair except solving the problem

Yossi Kreinin — Wed, 06 May 2026 00:00:00 +0000

An industry veteran in my circles has recently made the rookie mistake¹ of printing a warning from his code upon misuse. Surprisingly to nobody experienced, critical workflows soon came to a screeching halt.

It turned out that a program using his code prints something like “yay, done” upon exit, and scripts expect it to be the last thing it says. But now those warnings occasionally got printed from destructors or such, after the “yay, done”, making the scripts think the program failed.

One might think that this prompted people to fix the reported misuse, and that thought would be another rookie mistake. Instead, they were quick to point out that it’s hard to know where these warnings could come from, and we cannot risk all those critical workflows failing when some case of misuse surfaces in a new context.

I mean, you could grep to get an upper bound, and if you did, not that many places would come up. But one could then say, as some in fact did, that maybe you haven’t grepped everywhere you should have, and even the cases you did find are owned by many different teams, so we won’t get the fixes quickly enough, etc.

Several solutions were suggested by helpful high-ranking people:

You could add a destructor printing “yay, done” again if a warning was printed during the destruction sequence (opening an interesting technical debate about the differences between a destructor, __attribute__((destructor)), an atexit handler and other unspeakable horrors). In fact, our industry veteran would later learn, and I swear that I’m not making this up, that this was already implemented by someone else who printed something during the program termination sequence, and had to appease the scripts.
You could suppress the warnings by default, and enable them upon request (opening a debate about the runtime method to enable them, and the appropriate circumstances to do this).
You could write those warnings to their own file, and…

When I was done scrolling his work chat with these helpful suggestions, our unfortunate industry veteran put on a melancholy smile and summarized the situation: “All means are fair except solving the problem.”²

Our protagonist happens to be somewhat of an idealist, and since his condition is too acute to be treated by experience, he’s bound to make what pragmatists call “rookie mistakes.” But this particular story could happen to most of us.↩︎
Hyrum’s law arguably diagnoses this particular problem more specifically from a technical point of view. However, our melancholy veteran’s phrase hints at the broader social condition from which the technical problem derives its significance. And by “social condition,” I mean that in Hyrum’s law, “all observable behaviors of your system will be depended on by somebody” is implicitly amended with “...somebody who can’t be bothered to fix their code, and there’s nothing you can do about it” — and it’s this quiet part which makes it into a “law.”↩︎

LLMs struggle with the shell, too

Yossi Kreinin — Sat, 23 May 2026 00:00:00 +0000

You used to tell people, “why are you doing all this by hand — write a script to do it!”, and then “...I meant an actual Python script, not a buggy grep | sed | crap pipeline!”

This got better since some of those too lazy to write a script (or not lazy enough to avoid the harder, buggier way?) now ask an LLM for the script. But you know who doesn’t? LLMs!

“Agentic LLM harnesses” seem to think it’s bad to pollute the filesystem with scripts, and that they’re very good with ad-hoc shell commands. Neither is true. I actually prefer to be able to see their scripts and run them myself (that’s how you eventually discover that an image file diffing script the thing made says the images match when they don’t)¹. And it turns out that LLMs are just as bad as people at coming up with the right long-winded shell command — give them two source directories to build two program versions and debug differences in their output, and they keep forgetting to rebuild, or they cd to the wrong place, ad infinitum — but they’ll get this right if you ask them to do it with a script!

Except they won’t. You keep telling them to put things in scripts, you yell in all caps, you put it into their config file, they make a “long-term memory” out of it (another file) — but they keep coming up with ad-hoc bug-ridden shell commands, and you need to remind them again and again, put this into a script.

Who says you can’t emulate human intelligence with a fuckload-dimensional function? These things struggle with the shell just the way people do!

There’s another “advantage” to agents using scripts — you only need to give them permission to run the interpreter once, instead of approving many different inline commands with features the harness thinks require approval, like cd or env var expansion. Of course, what this really shows is how worthless the agent “sandboxing” is, and that you “should” “just” run in YOLO mode inside a container or something. Why people often don’t do what they should and what “just” amounts to is a separate matter.↩︎

Yossi Kreinin

Looking for senior IT/DevOps people

The habitat of hardware bugs

CPUs

Memory

Peripheral devices

Miscommunication

Summary

Fun won't get it done

Hiring (self-driving algos, HLL compiler research)

Things want to work, not punish errors

Patents: how and why to get them

What patents give you

What patents give (and don't give) your employer

Why filing patents isn't a crime on par with cannibalism

How to get a patent

Summary

Don't ask if a monorepo is good for you – ask if you're good enough for a monorepo

Branching: getting forked by your worst programmer

Modularity: demoted from a norm to an ideal

Tooling: is yours better than the standard?

Summary

I have risen

refix: fast, debuggable, reproducible builds

The curious case of the disappearing source files

Thesis: debugedit - civilized, standard and format-aware

Anti-thesis: sed - nasty, brutish, and short

Synthesis: refix - nasty, brutish, and somewhat format-aware

Q & A

Why do this? We already have a system for finding the source code.

What about non-reproducible info other than source path (time, build host, etc)?

What about compressed debug sections?

It's 250 ms on generic data. And you still did the ELF/ar thing to get to 50 ms. Are you insane?

I'm as crazy as you, and I want this speedup for non-ELF executable formats.

Which binaries should I run this stuff on?

CI removes the source after building it. What should the destination source prefix be?..

What should the root path length limit be?

Our CI produces output in /filer/kubernetes/docker/gitlab/jenkins/pre-commit/department/team/developer/branch-name/test-suite-name/repo/, which is 110 bytes.

Our CEO's preschooler works as a developer, insists on a 200 byte prefix, and won't tolerate the build failing.

Conclusion

Managers have no human rights

Infrequently Raised Objections

I am a manager, and my days are nice.

I am a manager, and my manager shields me from this.

There exist organizations free from the dysfunction you describe.

But we foster a non-hierarchical culture of openness and curiosity.

Calmly accepting dysfunction does not a good manager make.

This "acceptance" is not a binary thing. It depends on how bad it gets.

The state of AI for hand-drawn animation inbetweening

Motivation and scope

Preamble: testing Runway

Raster frame representation

Vector frame representation

Applicability of “2D feature matching” techniques

Conclusion

A 100x speedup with unsafe Python

numpy array memory layout

Fooling the code into running faster

"Unsafe Python"

Profiling with Ctrl-C

Advantages of incompetent management

Efficiency

Sprawl

Bugs

The problem with incompetent management

Cargo cult management vs straightforward incompetence

What is to be done?

Conclusion

See also

Profiling in production with function call traces

viztracer: how to use a great tracing profiler

Funtrace: making a tracing profiler for native code

Compiler instrumentation

Runtime code

Decoded trace viewer

Offline trace decoder

ftrace: tracing thread state changes

Getting traces from core dumps

funcount: culling overhead

Antithesis: LLVM XRay

Thesis: `debugedit` - civilized, standard and format-aware

Anti-thesis: `sed` - nasty, brutish, and short

Synthesis: `refix` - nasty, brutish, and somewhat format-aware

Our CI produces output in `/filer/kubernetes/docker/gitlab/jenkins/pre-commit/department/team/developer/branch-name/test-suite-name/repo/`, which is 110 bytes.

"Inlining" `pthread_getspecific`

Compiling with `-ftls-model=initial-exec`

Faster `__tls_get_addr` with `-mtls-dialect=gnu2`