<?xml version='1.0' encoding='UTF-8'?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0">
  <channel>
    <title>Yossi Kreinin</title>
    <link>https://yosefk.com/blog</link>
    <description>Worse is better</description>
    <docs>http://www.rssboard.org/rss-specification</docs>
    <generator>Yossi Kreinin's ugly publishing software</generator>
    <image>
      <url>https://yosefk.com/blog/self.jpg</url>
      <title>Yossi Kreinin</title>
      <link>https://yosefk.com/blog</link>
      <width>144</width>
      <height>144</height>
    </image>
    <language>en</language>
    <lastBuildDate>Wed, 06 May 2026 16:00:01 +0000</lastBuildDate>
    <item>
      <title>A layman's view of the economy</title>
      <link>https://yosefk.com/blog/a-laymans-view-of-the-economy.html</link>
      <description><![CDATA[<p>First of all, I proudly present a 2-minute short that I animated!</p>
<div class="video_16_9" style="position: relative;overflow: hidden;width: 100%;padding-top: 56.25%;">
<iframe class="responsive_iframe" allowfullscreen="allowfullscreen" src="https://player.vimeo.com/video/171368757" style="position: absolute;border: 0;top: 0;left: 0;bottom: 0;right: 0;width: 100%;height: 100%;">
</iframe>
</div>
<p>...And the same thing on YouTube, in case one&nbsp;loads better&nbsp;than the other:</p>
<div class="video_16_9" style="position: relative;overflow: hidden;width: 100%;padding-top: 56.25%;">
<iframe class="responsive_iframe" allowfullscreen="allowfullscreen" src="//www.youtube.com/embed/c-cOrPOHi7E" style="position: absolute;border: 0;top: 0;left: 0;bottom: 0;right: 0;width: 100%;height: 100%;">
</iframe>
</div>
<p>One thing I learned making the film is that&nbsp;my&nbsp;Russian accent colors not only my words, but any noise coming out of my mouth.
So I'm not the most&nbsp;versatile voice actor.</p>
<p>Anyway, we certainly have a <a href="https://en.wikipedia.org/wiki/Debt_crisis">debt crisis</a>, and easy credit policies
keep&nbsp;producing still&nbsp;more debt. I don't think interest rates have ever stayed so low for so long, everywhere.</p>
<p>Economists argue both for and against debt expansion [1], as they argue&nbsp;about&nbsp;everything.</p>
<p>My own take is as simple as my sparse knowledge ought to make it:</p>
<ul>
<li>Unprecedented conditions produce unprecedented outcomes.</li>
<li>Booms are usually&nbsp;gradual, and busts&nbsp;are sudden.</li>
</ul>
<p>No unusual boom has gradually arisen&nbsp;from unusual monetary policy, and it's been a while. <strong>But something unusual ought
to happen in unusual conditions!</strong> Thus one expects a sudden, unusual bust&nbsp;down the road.</p>
<p>That's it. It's&nbsp;like a physicist's&nbsp;proof [2] that one's attractiveness peaks at some distance from the observer. At the
distances of zero and infinity, visual attractiveness is zero (you can't see anything.) Therefore, attractiveness as a function
of distance has a maximum&nbsp;<em>somewhere</em> in between. True, kinda, and it didn't take a lot of insight into the nature of
attractiveness – much like my peak debt proof doesn't require an&nbsp;understanding of the economy [3].</p>
<p>Will&nbsp;today's "Brexit" trigger&nbsp;the global&nbsp;downturn predicted by Yossi Kreinin's&nbsp;Rule of Unprecedented Conditions? Probably&nbsp;not
by itself. I think it's a symptom more than a cause [4], and&nbsp;the big bad thing comes&nbsp;later.</p>
<p>In the meanwhile, here's to hoping that my little film (started when "Grexit" was a thing, completed just in time for Brexit)
was funnier than the average <a href="https://www.reddit.com/r/forwardsfromgrandma/">forward from grandma</a>&nbsp;[5].</p>
<p>Happy Brexit! And if you follow people on Twitter, there's a strong case for <a href="https://twitter.com/YossiKreinin">following me</a> as well.</p>
<p>[1] Bibliography:&nbsp;<a href="http://www.nytimes.com/2015/02/09/opinion/paul-krugman-nobody-understands-debt.html">Nobody
Understands Debt</a>&nbsp;except Krugman;&nbsp;<a href="https://www.ced.org/blog/entry/does-krugman-understand-debt">Does Krugman
Understand Debt?</a></p>
<p>[2] I&nbsp;think a&nbsp;particular famous&nbsp;physicist said it, but I forget who.</p>
<p>[3] ...and I can't say I&nbsp;have any understanding of the economy. That said, I've owed and paid off a lot of debt, and got to
negotiate with many bankers. And I can tell you that "debt is money we owe to ourselves", Krugman's catchphrase, feels
unconvincing to&nbsp;creditors – as&nbsp;many people and whole nations have&nbsp;found out.</p>
<p>[4] In fact, I just got an email from&nbsp;an&nbsp;asset manager saying that&nbsp;it's good <em>for the UK</em> in the&nbsp;longer run, elevating
Brexit from a symptom to a cure. But he didn't say "good <em>for everyone</em>", and then I'm not sure his crystal ball is
better than yours or mine.</p>
<p>[5] I linked to /r/forwardsfromgrandma since,&nbsp;regardless of the&nbsp;politics of either its members or their grandmas, I ought to
give credit for&nbsp;the brilliant term – it's definitely funny because it's true. I've watched many relatives acquire the habit
of&nbsp;forwarding&nbsp;various wingnut stuff&nbsp;as they age. Most&nbsp;frighteningly, my own urge to email such things&nbsp;gets harder to resist
every year. I can sense&nbsp;my own ongoing grandmafication; between you and me, an animated short about debt might be a part of
"it." Scary, scary stuff.</p>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/a-laymans-view-of-the-economy#comments</comments>
      <pubDate>Fri, 24 Jun 2016 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/a-laymans-view-of-the-economy.feed</wfw:commentRss>
    </item>
    <item>
      <title>Looking for senior IT/DevOps people</title>
      <link>https://yosefk.com/blog/looking-for-senior-itdevops-people.html</link>
      <description><![CDATA[<p>I wouldn't spam you with these job offers if didn't work :-) So, we're looking for senior IT people to work at our Jerusalem
offices – <strong>managers and hands-on people alike</strong>.&nbsp;We have&nbsp;rapid growth, "Big Data" (it definitely <a href="https://twitter.com/devops_borat/status/288698056470315008?lang=en">is crash Excel</a>&nbsp;-&nbsp;in fact,&nbsp;at one point it was
close to physically crashing through the floor&nbsp;due to the storage servers'&nbsp;weight, but luckily that's been handled), "HPC"
(biggish server farms, distributed build &amp; tests, etc.),&nbsp;and many other buzzwords [1].&nbsp;I don't know where IT ends and DevOps
starts but I guess a good candidate could have&nbsp;either in their CV, so there.</p>
<p>If you have qualified friends looking for a&nbsp;challenging, well-paying job at a fun place, send their CVs, the sooner the
better – we're in a hurry (rapid growth!), so early birds are&nbsp;more likely to get the can of worms. As always, "challenging" is a
downside as much as an&nbsp;upside&nbsp;(a place where IT means Exchange, SAP and little else might pay very well for a more predictable
and less demanding job.)</p>
<p>We value experience in building and maintaining non-trivial systems, and technical reasoning (X happens because of Y, Z is
most&nbsp;efficient if you use it to do W, etc.) We also value&nbsp;experience in higher-level areas such as management and purchasing,
and business reasoning (don't hook X and&nbsp;Y together since their vendors compete&nbsp;and will sabotage the project,&nbsp;Z beats W in
terms of total cost of ownership, etc.) We do kinda lean towards thinking of technical&nbsp;aptitude as a cornerstone on top of which
solid&nbsp;higher-level expertise is built. (We've seen managers snowed by vendors, reports, etc., which&nbsp;is a perennial problem in
tech at large and isn't restricted to IT.)</p>
<p>If you'd like to hear more details, please email Yossi.Kreinin@gmail.com</p>
<p>[1] what we don't have is a heavy-duty web site/application, which might make the position less relevant for some.</p>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/looking-for-senior-itdevops-people#comments</comments>
      <pubDate>Thu, 30 Jun 2016 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/looking-for-senior-itdevops-people.feed</wfw:commentRss>
    </item>
    <item>
      <title>The habitat of hardware bugs</title>
      <link>https://yosefk.com/blog/the-habitat-of-hardware-bugs.html</link>
      <description><![CDATA[<p>The&nbsp;Moscow apartment which little me called home was also home to many other creatures, from smallish cockroaches to biggish
rats. But of course we rarely met&nbsp;them face to face. Evolution has weeded out those animals&nbsp;imprudent enough to crash
your&nbsp;dinner.&nbsp;However, when we moved a cupboard one time, we had the pleasure to meet a few hundreds of fabulously&nbsp;evolved
cockroaches.</p>
<p>In this sense, logical bugs aren't different from actual insects. You won't find bugs&nbsp;under the spotlight, because they get
fixed under the spotlight, crushed like a cockroach on the dinner table.&nbsp;But in darker nooks and crannies, bugs&nbsp;thrive and
multiply.</p>
<p>When hardware malfunctions in a single, specific way, software running on it usually&nbsp;fails in several different, seemingly
random ways, so it sucks to debug it. Homing in on the cause is easier if you can guess&nbsp;which parts of the system are more
likely to be buggy.</p>
<p>When hardware fails, nobody wants a programmer treating it&nbsp;as a lawyer or a mathematician (the&nbsp;hardware broke the contract!
only working hardware&nbsp;lets us&nbsp;reason about software!) Instead, the key to success is approaching it as&nbsp;a pragmatic&nbsp;entomologist
knowing where bugs live.</p>
<p>Note that I'm mostly talking about design bugs, not random&nbsp;manufacturing defects. Manufacturing defects can occur absolutely
anywhere. If you're in an industry where you can't toss a faulty&nbsp;unit into the garbage can, but instead must find the specific
manufacturing defect in every reported bad&nbsp;unit, I probably can't tell you anything new, but I can offer you my deepest
sympathy.</p>
<h2 id="cpus">CPUs</h2>
<p>CPUs are the perfect illustration of the "spotlight vs nooks and crannies" principle. In CPUs, the spotlight, where it's hard
to find&nbsp;bugs, is functionality accessible to userspace programs - data processing,&nbsp;memory access and control
flow&nbsp;instructions.</p>
<p>Bugs are more likely&nbsp;in those parts of the CPUs only accessible to operating systems and drivers - and used more by OS
kernels&nbsp;than drivers. Stuff like memory protection, interrupt handling,&nbsp;and other privileged instructions. You can sell a buggy
CPU if it doesn't break too many commercially significant, hard to patch programs - and there aren't many important OS kernels,
therefore a lot of scenarios are never triggered&nbsp;by them.</p>
<p>A new OS kernel might bump into the bug, of course, but at that point, it's&nbsp;the programmer's problem. A friend who wrote a
small real-time operating system had to&nbsp;familiarize&nbsp;himself with several errata items, and was the first to report some of these
items.</p>
<p>It should be noted that an x86 CPU should be way less buggy&nbsp;in the privileged areas&nbsp;than the average embedded CPU. That's
because it's more <em>compatible</em> in the privileged areas than almost any other CPU. AFAIK, today's x86 CPUs will still run
unmodified&nbsp;OS binaries from the 80s and 90s.</p>
<p>Other CPUs are not like that. I recall that ARM has 2 instructions, MCR and MRC (Move Register&nbsp;to/from Co-processor), and the
meaning of those instructions depends on their several constant arguments. It could flush the cache or program the memory
protection unit or do other things -&nbsp;a bit like&nbsp;a hypothetical CALC&nbsp;instruction where CALC 0&nbsp;does addition, CALC 1 subtracts,
CALC 2 multiplies, etc. My&nbsp;point isn't that MCR and MRC look cryptic in assembly code, but that <em>the meaning changes between
ARM generations</em>. MIPS is similar, except they're&nbsp;called MFC0 and MTC0, Move From/To Coprocessor 0.<br>
</p>
<p>These incompatibilities do&nbsp;not break userspace programs, which can't execute any of these instructions -&nbsp;but the OS needs to
be tweaked to support a new core. If a new core introduces a bug in a privileged instruction, <em>that doesn't break old OS code
any more than it's already broken</em> by ISA incompatibilities. Updating OS code is the perfect opportunity to also work around
fresh hardware bugs.</p>
<p>x86 chips also run more OSes than chips based on most other architectures. For instance, a now-defunct&nbsp;team making a fairly
widespread ARM-based&nbsp;application processor had to port about 3 versions of Linux (is there a chip maker who likes Linux with its
endless versions and having to port it themselves? Or do they secretly wish they could tell Linus Torvalds what he publicly said
to&nbsp;NVIDIA, namely, "fuck you"?)&nbsp;They also&nbsp;supported OS vendors in the porting of Windows and QNX. Overall, the chip probably
ever ran 5 full-blown&nbsp;OSes.&nbsp;x86 chips need to run endless OS builds - often built&nbsp;from&nbsp;very similar source code, but still.</p>
<p>The same principle applies to all hardware. <strong>It's bug-free if and only if&nbsp;they can't sell it with bugs</strong>. If
they can sell it with bugs and make it your problem, they very well&nbsp;might.</p>
<h2 id="memory">Memory</h2>
<p>$100 says your DRAM chip works. The DRAM chip is a mindless slave implementing precise&nbsp;commands by the DRAM controller on the
master chip,&nbsp;without any feedback - there are no retries, no negotiation,&nbsp;no way&nbsp;to say you're sorry. And no software will run
properly on faulty DRAM. Faulty DRAM isn't a marketable product.</p>
<p>Your board is definitely buggy. They told you they checked&nbsp;signal integrity, but they lied. If DRAM malfunctions, it's
probably&nbsp;the board, or the boot code&nbsp;programming DRAM-related components in a way that doesn't work on this board.</p>
<p>In the middle, there's the DRAM controller and the PHY. You'll only see bugs there if you're a chip maker - a chip is not
marketable unless such bugs are already worked around somehow. If you are indeed a chip maker, this is when you find out why
fabless chip companies are worth so much more than the equally fabless vendors of "IPs" such as CPUs and DRAM controllers. The
short answer is that chip makers are&nbsp;exposed to most of the risk. And in your case,&nbsp;some of this risk has&nbsp;just
been&nbsp;realized.</p>
<p>A DRAM controller bug can be very damaging&nbsp;to&nbsp;the chip maker, whose engineering samples might not work and whose production
schedule might be delayed. For the DRAM controller vendor - no big deal, "we have 3 more customers affected by this bug, we must
say you're taking it unusually passionately!" This is an actual quote. I want to add something here, something describing&nbsp;what
we chip makers think of&nbsp;these people,&nbsp;but words fail me. The point is, they fix the bug and ship the fixed version to their next
customers. You get to figure out how to make your engineering samples kinda work (often lowering the DRAM frequency helps), and
perhaps how to fix the design&nbsp;without too many changes&nbsp;to&nbsp;the wafer&nbsp;masks.</p>
<p>Bottom line is, DRAM controllers and PHYs can have bugs, usually it's the chip&nbsp;maker's problem, managing this risk is not
fun.</p>
<p>The bus interconnect&nbsp;between your processors and the&nbsp;DRAM controller probably&nbsp;doesn't have bugs - not correctness bugs, at
least. That's because today it's usually produced by a code generator, and such a code generator is really hard to market if it
has bugs, because they'll manifest in so many different ways and places. I found a bug in an interconnect once, and I was very
proud of my tests, but that&nbsp;was a preliminary version, and they found the bug independently. Real, supported versions always
worked fine.</p>
<p><em>Performance</em> bugs around memory access are legion, of course, because you can totally sell products with performance
issues, at least&nbsp;up to a point. A chip can have 2&nbsp;processors with 8-byte buses each, going to a DRAM giving you&nbsp;16 bytes per
cycle, through a shared&nbsp;8-byte-per-cycle bottleneck. This&nbsp;interconnect is the handiwork of some time-starved dude on the chip
maker's team,&nbsp;armed with an interconnect-generating tool. Even such an idiotic&nbsp;issue will manifest on some benchmarks but not
others, and might not get caught at&nbsp;design time. And if you think <em>that</em> is stupid, I've heard of&nbsp;a level 2 cache which
never actually cached anything, and this fact&nbsp;happily went&nbsp;unnoticed for a few months. (Of course, <em>this</em> not being
caught at design time is when the team should start looking for a new career.)</p>
<p>Similarly, DRAM schedulers, supposedly very clever about optimizing DRAM performance, can in practice be really stupid, etc.
In fact, performance issues&nbsp;are among the&nbsp;hardest to pinpoint, and so are found in the&nbsp;greatest abundance in hardware and
software alike. But in a way, they aren't bugs.</p>
<h2 id="peripheral-devices">Peripheral devices</h2>
<p>Expect peripheral device controllers to be pretty shitty. There really is no reason to make them particularly good. Only
device drivers access these things, so it all concerns just a handful of programmers,&nbsp;and then working around a hardware bug
here is easier than almost anywhere else.</p>
<p>A device driver has the device all to itself, nothing can touch the hardware concurrently unless the driver lets it, and the
code can fiddle with the hardware controller all it likes, perhaps emulating some of the functionality on the CPU if necessary,
and&nbsp;doing arbitrarily complex things to work around bugs. Starting with simpler things like reading memory-mapped&nbsp;registers
twice - a workaround for&nbsp;a real old&nbsp;bug from a real vendor in the automotive space, one who huffs and puffs a lot about
reliability and safety.</p>
<p>And a lot of peripheral devices also allow some room for error at the protocol level - you can drop packets, retransmit
packets, checksums tell you if ultimately all the data was transferred correctly, you can negotiate on the protocol features,
etc. etc. All that helps work around hardware bugs, reducing the pressure to ship correct hardware.</p>
<p>Also, since few people read the spec, there's no reason to make it very clear, or detailed, or up-to-date, or fully correct,
or maintain errata properly. This is not to say that nobody does it right, just that many don't, and this shit still sells.
Nobody cares that driver programmers suffer.</p>
<p>(By the way, I'm not necessarily condemning people at the&nbsp;hardware side here. Some low-level&nbsp;programmers like to complain
about how bad hardware is, but it's not obvious how much should be invested to make the driver writer's job easy, even&nbsp;from a
purely&nbsp;economic point view of optimally using society's resources, regardless of anyone's bottom line. If a chip is shipped
earlier at the cost of including a couple of peripheral controllers which are&nbsp;annoying&nbsp;to write drivers for, maybe it's the
right trade-off. I'm not saying&nbsp;that bugs should be exterminated at all cost, I'm just telling where they live.)</p>
<h2 id="miscommunication">Miscommunication</h2>
<p>As a programmer, do not expect every device to follow protocols correctly. What will work is the CPU accessing memory, in any
way the CPU can access memory -&nbsp;with or without caching (the two ways generate vastly different bus commands.)&nbsp;But if the path
between the device doing the access and the device handling the access is a less traveled one, then you might need to do the
access in a very specific way.</p>
<p>For instance, bus protocols might mandate that access to an unmapped&nbsp;address will result in an error response. But an
unmapped address might fall into a large region which the interconnect associates with a hardware module written by some bloke.
So it routes your request to the bloke's module. The&nbsp;bloke can and will write hardware description code that checks for every
address mapped within his range,&nbsp;and then&nbsp;returns a response - but the code&nbsp;does&nbsp;<em>nothing</em> when the address is
unmapped.&nbsp;Then reading from this address will cause the CPU to hang forever, and not only a debugger running on the chip,
but&nbsp;even a JTAG probe will not tell you where it's stuck.</p>
<p>There are many issues of this sort - a&nbsp;commonly unsupported thing is byte access as opposed to full word access (the hardware
bloke didn't want to look at low address bits or byte masks), etc. etc. A bus protocol lawyer might be able to prove that the
hardware is buggy in the sense of not following the&nbsp;protocol properly. A programmer must call it a feature and live with it.</p>
<p>As a chip maker, there's the additional trouble when you&nbsp;hook&nbsp;two working devices together, but lie about the protocol subset
they support, and they will not work together. For instance, a DMA engine and a cache might both "support out of order bus
responses." But the cache will return the response data interleaved at the world level, while&nbsp;the DMA might require&nbsp;responses&nbsp;to
be interleaved at the burst level, where the burst size is defined by the DMA's read commands.</p>
<p>The chip maker is rather unlikely to ship hardware with this sort of a bug, so by itself it's rarely a programmer's problem.
But they might make you set a bit in the DMA controller that disables the kind of requests producing out of order bus responses
when accessing certain addresses. Again you can argue if it's a bug or a feature, but either way, if you won't set the bit,
interesting things will transpire.</p>
<h2 id="summary">Summary</h2>
<ul>
<li>Don't trust freshly designed boards</li>
<li>Don't trust peripheral controllers</li>
<li>Trust CPUs in userspace &amp;&nbsp;DRAM chips (almost&nbsp;always), and everything&nbsp;between the two (unless the chip is new &amp;
untested)</li>
<li>Expect to bump into unsupported bus&nbsp;protocol features if you do anything except accessing memory from a CPU</li>
<li>If you write your own OS, be prepared to work around CPU bugs (except perhaps on the PC)</li>
</ul>
<p><a href="low-level-is-easy.html">I wrote a long time ago</a>, and I still believe it, that lower-level programming is made
relatively&nbsp;easy by the fact that you're less likely to have bugs in your dependencies. That's because low-level bugs both&nbsp;hurt
more users and are harder to fix, and therefore people try harder to avoid them in the first place. However, this isn't equally
true for all the&nbsp;different low-level things you depend on.</p>
<p>I've described the state of things with hardware as it is in my experience, and attempted to trace&nbsp;the differences to the
different&nbsp;costs of bugs to different&nbsp;vendors. The same reasoning applies to software components -&nbsp;for instance, compilers&nbsp;are
more likely to have bugs than&nbsp;OS kernels -&nbsp;because, by definition, compiler bugs cannot break existing binaries,&nbsp;but kernel bugs
will do that. So I think it's a generally&nbsp;useful angle to look at things from.</p>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/the-habitat-of-hardware-bugs#comments</comments>
      <pubDate>Wed, 13 Jul 2016 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/the-habitat-of-hardware-bugs.feed</wfw:commentRss>
    </item>
    <item>
      <title>Fun won't get it done</title>
      <link>https://yosefk.com/blog/fun-wont-get-it-done.html</link>
      <description><![CDATA[<p>OK, published at 3:30 AM. That's a first!</p>
<p>So.&nbsp;Got something you want to do over the coarse of a year? Here's a&nbsp;motivation woefully insufficient to pull it off:</p>
<ul>
<li>It's fun!</li>
</ul>
<p>What could&nbsp;give you enough drive to finish the job? Anything with a reward <em>in the future, once you're done</em>:</p>
<ul>
<li>Millions of fans&nbsp;<strong>will</strong> adore me.</li>
<li>It <strong>will</strong> be the ugliest thing on the planet.</li>
<li>I <strong>will</strong> finally understand quantum neural rockets.</li>
<li>We <strong>will</strong> see who the loser is, Todd!</li>
<li>I <strong>will</strong> help humanity.</li>
<li>I <strong>will</strong>&nbsp;destroy humanity.</li>
</ul>
<p>It doesn't matter how noble or ignoble your&nbsp;goal is. What matters is <strong>delaying gratification</strong>. Because even
your&nbsp;favorite thing in the&nbsp;world will have&nbsp;shitty bits if you chew on&nbsp;a big enough chunk of it. A few months or years worth of
work are <em>always</em> a big enough chunk, so there <em>will</em> be shitty bits. Unfortunately, it's also the minimum-sized
chunk to do anything of significance.</p>
<p>This is&nbsp;where many brilliant talents drown. Having known the joy of true inspiration, it's hard to settle for less, which you
<em>must</em> to have any impact. Meanwhile,&nbsp;their&nbsp;thicker peers happily butcher task after task. Before you know it,&nbsp;these
tasks&nbsp;add up to an&nbsp;impactful result.</p>
<p>In hindsight, I was really&nbsp;lucky in that I chose a profession for money instead of love.&nbsp;Why? <strong>Stamina</strong>. Money
is a reward in the future that lets you ignore the shittier bits of the present.</p>
<p>Loving every moment of it, on the other hand, carries you until that moment&nbsp;which you <em>hate</em>, and then you need a new
sort of fuel. Believe me, I know. I love drawing and animation, and you won't believe how many times I started and stopped doing
it.</p>
<p>But the animation teacher who taught me 3D said he was happy to put textures on toilet seat models when he started out.
<em>That's</em> the kind of appetite you need – and very few people&nbsp;naturally feel that sort of attraction to toilet seats. You
need a&nbsp;big reward in the future, like "I'm going to become a pro," to pull it off.</p>
<p>But I don't want to become a pro. I don't want to work in the Israeli animation market where there's scarcely a feature
film&nbsp;made. I don't even want to work for a big overseas animation studio. I want to make something, erm, something beautiful
that I love, <strong>which is a piece of shit of a goal</strong>.</p>
<p>Because you know where I made most progress picking up actual skills? In an evening animation school, where I had a&nbsp;perfectly
good goal: survive. It's good because it's a simple, binary thing which doesn't give a rat's ass about your mood. You either
drop out or you don't. But "something I love" is fluid, and depends a lot on the mood. And&nbsp;when you hate this thing you're
making, as you sometimes will, it's hard to imagine loving it later.</p>
<p>Conversely, imagining how I don't drop&nbsp;out is easy. This is what I was imagining when sculpting this bust, which 90% of the
time I hated with a passion because it looked like crap. But I thought, "I'm not quitting, I'm not quitting, I'm not quitting,
hey, I&nbsp;get the point of re-topology in Mudbox, I'm not quitting, I'm not quitting, hey, I guess I see what&nbsp;the specular map
does, I'm not quitting... Guess I'm done!"</p>
<div class="video_16_9" style="position: relative;overflow: hidden;width: 100%;padding-top: 56.25%;">
<iframe class="responsive_iframe" allowfullscreen="allowfullscreen" src="https://player.vimeo.com/video/171365263" style="position: absolute;border: 0;top: 0;left: 0;bottom: 0;right: 0;width: 100%;height: 100%;">
</iframe>
</div>
<p>And now let's talk about beauty for a moment.</p>
<p>I'm a programmer. I like to think that I'm not the thickest, butcherest programmer, in that I understand the role of beauty
in it. For the trained eye, programs can be beautiful as much as&nbsp;math, physics or chess, and a beautiful program is better
<em>for business</em> than the&nbsp;needlessly uglier program. (Ever tried pitching the value of beauty to someone businessy? Loads
of fun.)</p>
<p>But you know why beauty is your enemy? Because it sucks the fun out of things. How? Because you're making this thing and
chances are, <strong>it's not beautiful according to your own standard</strong>. The trap is, your&nbsp;taste for beauty is usually
ahead of your&nbsp;creative ability. In any area, and then in any sub-area of that area, ad infinitum, you can tell ugly from
beautiful long before you can make something beautiful yourself. And&nbsp;even if&nbsp;you can satisfy your own taste,&nbsp;often&nbsp;the final
thing is beautiful, but not the states it goes through.</p>
<p>So&nbsp;the passionate, sensitive soul is hit twice:</p>
<ol>
<li>You're driven by fun and inspiration because you've once experienced it and now you covet it.</li>
<li>Your sense of beauty, frustrated by the state of your creation, kills&nbsp;all the fun – that very fun which&nbsp;you insist must be
your only fuel.</li>
</ol>
<p>Life is easier if you want a yacht. I think you can buy a&nbsp;decent&nbsp;one for $300K, and certainly for $1M. Now all you need to do
is make that money, doing doesn't matter what – imagining that yacht will help you do <em>anything</em> well! If you want
beauty, however, I do not envy you.</p>
<p>How do I cope with my desire for beauty?&nbsp;The first step is acknowledging&nbsp;the problem, which I do. The fact is that my worst
failures in programming came when I insisted on beauty the most. The second step is shunning beauty as a <em>goal</em>, and
making it&nbsp;into a <em>means</em> and a <em>side-effect</em>.</p>
<p>I need a program doing at least X, taking at most Y seconds, at a date not later than Z.&nbsp;I'll keep ugliness to a minimum
because ugly programs work badly. And if it comes out particularly nicely, that's great. But beauty is&nbsp;not a goal, and&nbsp;enjoying
the beauty of this program as I write it is not why I write it.</p>
<p>And if you think it's true for commercial work but not open source software, look at, I dunno, Linux. Read some <a href="http://www.h-online.com/open/features/Interview-Linus-Torvalds-I-don-t-read-code-any-more-1748462.html">Torvalds</a>:</p>
<blockquote>
<p>Realistically, every single release, most of it is just driver work. Which is <strong>kind of boring in the sense there is
nothing fundamentally interesting in a driver</strong>, it's just support for yet another chipset or something, and at the same
time that's kind of the bread and butter of the kernel. More than half of the kernel is just drivers, and so <strong>all the big
exciting smart things we do, in the end it pales</strong> when compared to all the work we just do to support new hardware.</p>
</blockquote>
<p>Boring bits. Boring bits that&nbsp;must be done to make something of value.</p>
<p>Does this&nbsp;transfer to art or poetry or any of those things&nbsp;whose whole point is beauty? Well, yeah, I think it does, because
no,&nbsp;beauty is not the whole point:</p>
<ul>
<li>The most important thing about a drawing is that it's done. Now it exists, and people can see it, and you can make
<em>another one</em>. Practice. They will not come out very well if they don't come out.</li>
<li>Often people like your&nbsp;subject.&nbsp;There's a continuum between "it's beautiful in a way that words cannot convey" and "I love
how this song&nbsp;expresses&nbsp;my favorite political philosophy." To the extent that a work of art tells a story, or even sets up&nbsp;a
mood, its beauty <em>does</em> become a means to an end.</li>
<li>Just because the end result is beautiful to the observer, and even if that's the only point, doesn't mean every step making
it was an orgy of beauty for whomever made it. Part of what goes into it is boring, technical work.</li>
</ul>
<p>So here, too I'm trying to make beauty a non-goal. Instead my goals are "make a point" and "keep going," and you try to add
beauty, or remove ugliness, as you go.</p>
<p>For example,&nbsp;I didn't do a graduation project in the evening school, but I&nbsp;animated a short on my own in the same timeframe,
and I published it, even though it's not the beautiful thing I always dreamed about making. And&nbsp;I'm not sure anyone gets the
joke except me. (I'm not sure I get it anymore, either.)</p>
<div class="video_16_9" style="position: relative;overflow: hidden;width: 100%;padding-top: 56.25%;">
<iframe class="responsive_iframe" allowfullscreen="allowfullscreen" src="https://player.vimeo.com/video/171368757" style="position: absolute;border: 0;top: 0;left: 0;bottom: 0;right: 0;width: 100%;height: 100%;">
</iframe>
</div>
<p>Now my goal is "make another one." It's a good goal, because it's easy to imagine making another one. It's proper&nbsp;delayed
gratification.</p>
<p>And if you've enjoyed programming 20 years ago&nbsp;and are trying to reignite the passion, I suggest that you find a goal as
worthy for you as "fun" or "beauty", but as clear and binary as a yacht.&nbsp;And you can settle for less worthy, but not for less
clear and binary. Because everything they told you about "extrinsic motivation" being inferior to "intrinsic motivation" is one
big lie. And this lie will&nbsp;fall apart the moment you sink your teeth into a bunch of shit, as will always happen if you're
trying to accomplish anything.</p>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/fun-wont-get-it-done#comments</comments>
      <pubDate>Mon, 01 Aug 2016 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/fun-wont-get-it-done.feed</wfw:commentRss>
    </item>
    <item>
      <title>Hiring (self-driving algos, HLL compiler research)</title>
      <link>https://yosefk.com/blog/hiring-self-driving-algos-hll-compiler-research.html</link>
      <description><![CDATA[<p>OK, so 2 things:</p>
<p>1. If you send me a CV and they're hired to work on self-driving algos – machine vision/learning/mapping/navigation, I'll pay
you a shitton of money. (Details over email.) These teams want CS/math/physics/similar degree with great grades, and they want
programming ability. They'll hire quite a lot of people.</p>
<p>2. The position below is for my team and if you refer a CV, I cannot pay you a shitton of money. But:</p>
<p><strong>We're developing an array language that we want to efficiently compile to our in-house accelerators (multiple target
architectures, you can think of it as "compiling to a DSP/GPU/FPGA.")</strong></p>
<p>Of recent public efforts, perhaps <a href="http://halide-lang.org/">Halide</a> is the closest relative (we're compiling AOT
instead of processing a graph of C++ objects constructed at run time, but I'm guessing the work done at the back-end is somewhat
similar.) What we have now is already beating hand-optimized code in our C dialects on some programs, but it's still a "blue
sky" effort in that we're not sure exactly how far it will go (in terms of the share of production programs where it can replace
our C dialects.)</p>
<p>As usual, we aren't looking for someone with experience in exactly this sort of thing (here especially it'd be hopeless since
there are few compiler writers and most of them work on lower-level languages.) Historically, the people who enjoy this kind of
work have a background in what I broadly call (mislabel?) "discrete math" -&nbsp; formal methods,&nbsp;theory of computation, board game
AI, even cryptography, basically anywhere where you have clever algorithms in a discrete space that can be shown to work every
time. (Heavyweight counter-examples missing one of "clever", "discrete" or "every time" – OSes, rendering, or NNs. This of
course is not to say that experience in any of these is disqualifying, just that they're different.)</p>
<p>I think of it as a gig combining depth that people expect from academic work with compensation that people expect from
industry work. If you're interested, email me (Yossi.Kreinin@gmail.com).</p>
<p>All positions are in Jerusalem.</p>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/hiring-self-driving-algos-hll-compiler-research#comments</comments>
      <pubDate>Sun, 11 Sep 2016 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/hiring-self-driving-algos-hll-compiler-research.feed</wfw:commentRss>
    </item>
    <item>
      <title>Things want to work, not punish errors</title>
      <link>https://yosefk.com/blog/things-want-to-work-not-punish-errors.html</link>
      <description><![CDATA[<p>For better or worse, things want to work.</p>
<p>Consider&nbsp;driving at night on unlit, curvy mountain roads, at a speed about twice the limit, zigzagging between cars,
including oncoming ones. Obviously dangerous, and yet many&nbsp;do this, and survive. How?</p>
<ul>
<li>Roads and cars are built with big safety margins</li>
<li>Other drivers don't want to die and help you get through</li>
<li>Practice makes perfect, so you get good at this bad thing</li>
</ul>
<p>The road, the car, you, other drivers, and their cars all want this to work. So for a long while, it does, until it finally
doesn't. I know 3-4 people who&nbsp;drive like this habitually. At least 2 of them totaled cars. All think they're excellent drivers.
All have high IQs, making you wonder just what this renowned benchmark of human brains really tells us.</p>
<p>Now consider a terribly managed project with an insane deadline, and a team and budget too small. All too often, this too
works out. How?</p>
<ul>
<li>Unless it physically cannot exist, a&nbsp;solution <strong>wants</strong> you to find it. You carve out a piece and the next
piece suggests itself. Even if&nbsp;management fails&nbsp;to think how the pieces&nbsp;fit together, the pieces often come out such that&nbsp;they
<em>can</em> be made to fit with modest extra effort.</li>
<li>And then the people who make the pieces <strong>want</strong> them to fit. Even if&nbsp;the process is totally mismanaged, many
people will talk to each other and find out&nbsp;what to do to make parts work together.</li>
<li>The project was approved because a customer was persuaded. At this point, the customer&nbsp;<strong>wants</strong>&nbsp;the project to
succeed. A little bit of schedule slippage will not make them change their minds, nor will a somewhat less impressive result.
More slack for you.</li>
<li>The vendor, too <strong>wants</strong> the project to succeed, and will tolerate a little bit of budget overrun. More
slack.</li>
<li>Most often, when things fail, they fail visibly. It's as if things <strong>wanted</strong> you to see that they fail, so
that you fix them.</li>
</ul>
<p>The fact is that by cutting features, having a few non-terminal bugs,&nbsp;and being somewhat late and over budget, most projects
can be salvaged. In fact, when they say that "most projects fail," the PMI <a class="footnote-ref" role="doc-noteref" href="#fn1" id="fnref1"><sup>1</sup></a> defines "failure" as being a bit late or over budget. If "failure" is defined as outright
cancellation, I conjecture that most projects "succeed."</p>
<p>Which projects are least likely to&nbsp;be canceled? In other words, where is being&nbsp;late, over budget and off the original spec
most tolerable? Obviously, <em>when the overall delivered value&nbsp;is the highest</em>, both in absolute terms and relatively to
the cost. In other words, <strong>reality punishes bad management the least&nbsp;in the most impactful cases</strong>.</p>
<p>What is the biggest problem with bad management?&nbsp;Same as crazy driving: risk.&nbsp;The problem in both cases is you risk
high-cost, low-probability events. It's terrible things that tend not to happen. And&nbsp;people are pretty bad at learning from
mistakes they&nbsp;never had to pay for.</p>
<p>Wannabe racecar drivers fail to&nbsp;learn from driving into risky situations which their own eyes tell them are risky. For
managers, learning is harder – the risks accumulated through bad management are abstract, instead of viscerally scary. In fact,
a lot of the risks are never understood by management, or even fully reported. There's just too much risk to sweep under various
rugs to make it all ingrained in institutional memory.</p>
<p>In fact, it's even worse, because risk-taking is actually <strong>rewarding</strong> as long as the downside doesn't
materialize. The crazy driver gets there 10 minutes earlier. Similarly, non-obviously hazardous management often delivers at an
obviously small cost. And while driving is&nbsp;not actually&nbsp;competitive, except in the inflamed minds of the zigzagging few, most
projects are delivered in very competitive environments indeed. And competition can make even small rewards for risk decisive –
as it can with any other smallish factor&nbsp;large enough to make a difference between victory and defeat.</p>
<p>Things want to work more than they want to punish us for our errors. The punishment may be very cruel and unusual alright,
but it's rare. It seems that the universe, at least The Universe of Deliverables, is Beckerian. It delivers optimal punishment
for rational agents correctly estimating probabilities. Sadly,&nbsp;humans are bad at probability.</p>
<p>And thus crazy drivers and bad managers alike (often the same people, BTW) march from one insane adventure to the next,
gaining more and more confidence in their brilliance.</p>

<section class="footnotes footnotes-end-of-document" role="doc-endnotes" id="footnotes">
<hr>
<ol>
<li id="fn1"><p>PMI&nbsp;(The Project Management Institute) is&nbsp;a con, where they sell you "PMBOK" (Project Management Body of
Knowledge, a thick book you can use as a monitor stand) and "PMP" (Project Management Professional, a certification required by
PMI's conscious or unwitting accomplices in dark corners of the industry.) A variety of more elaborate cons targeted at narrower
audiences incorporate PMI's core body of cargo cult practices.<a class="footnote-back" role="doc-backlink" href="#fnref1">↩︎</a></p></li>
</ol></section>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/things-want-to-work-not-punish-errors#comments</comments>
      <pubDate>Mon, 27 Feb 2017 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/things-want-to-work-not-punish-errors.feed</wfw:commentRss>
    </item>
    <item>
      <title>Patents: how and why to get them</title>
      <link>https://yosefk.com/blog/patents-how-and-why-to-get-them.html</link>
      <description><![CDATA[<p>I'm going to discuss 3 very basic things about patents:</p>
<ul>
<li>Why it's good for you to get them;</li>
<li>Why it might be bad for your employer (and why they don't care);</li>
<li>How to get a patent for your idea (doesn't matter which.)</li>
</ul>
<p>Some of my points are a bit naughty. But I maintain that they're based in fact and fairly widely known. So well-known, in
fact, that I'm surprised to have never read it somewhere else.</p>
<p>My explanation is that the hatred of patents in the tech world is such that nothing except "HATE! HATE! HATE!" can be said on
the subject in polite society. In this atmosphere, "Patents: how and why to get them" reads like "Humans: how and why to cook
them."</p>
<p>If you can make yourself read this human-cooking manual, however, I think you'll find both amusing and useful things. I have
more experience with patents than I've ever asked for, having worked on this stuff with lawyers from the smallest law firms to
the largest ones, including lawyers who personally handled the most famous lawsuits for the most famous tech clients. I'm not an
authority on patents, but I have good stories.</p>
<h2 id="what-patents-give-you">What patents give you</h2>
<p>Some companies pay you money per patent. But it's rarely enough to make it worth your while, unless it's all you're doing.
Patents look good on your CV, but reactions might be negative as well (you might appear "overqualified," "an expert in an
unrelated field," etc.)</p>
<p>What's the one thing a patent undeniably buys you? <em>A right to legally and publicly discuss your work –</em> which you
often can't get in any other way. This is not a side-effect of patent law, but its whole stated point. Patent law prompts
companies to <em>publish</em> their ideas, in exchange for a time-limited monopoly right to use the ideas.</p>
<p>Note that publishing ideas in patents is easy, and the benefit <em>for the author</em> is certain. But getting and enforcing
a monopoly for said ideas is <em>not</em> easy, so the benefit <em>for the proprietor</em> is not at all certain. Here's
why.</p>
<h2 id="what-patents-give-and-dont-give-your-employer">What patents give (and don't give) your employer</h2>
<p>Some problems with patents are so obvious that even patent lawyers will honestly discuss them with their clients:</p>
<ul>
<li>When you submit a patent application, it becomes public forever, <em>even if it's rejected.</em> You will have paid legal
fees with the end result of granting competitors access to your ideas.</li>
<li>If you sue for patent infringement, your patent might be <em>invalidated</em> as a result. It's like a rejected patent
application, but with at least $1 million more in legal fees.</li>
</ul>
<p>But there's another, potentially far bigger problem, that patent lawyers will rarely mention, let alone admit its extent:</p>
<ul>
<li>You don't get monopoly rights to everything you file in the patent application. You publish a "spec" and "claims." The
monopoly is granted <em>only for the claims</em> – perhaps in a reduced form relatively to the original patent application, due
to feedback from the examiner. Yet <em>the entire spec</em>, much of it not covered by the claims, becomes public.</li>
</ul>
<p>So what's the big deal, you might ask? The spec describes some device or method. The claims describe the supposedly new ideas
used in this device or method. All you have to do is write a spec such that nothing of value is disclosed that is not covered by
the claims.</p>
<p>However, in reality, the published spec is often quite close to <em>the actual spec used by engineers</em>, with all the
details. That's simply the path of least resistance:</p>
<ul>
<li>Patent lawyers don't know which claims will be rejected by the examiner. (If they knew, you wouldn't have a heap of rejected
applications, nor patents invalidated in courts.) They file relatively broad claims, and then change the claims to address
challenges by the examiner, until a patent is granted. The catch is that <em>you can only base your new claims on details
included in the originally filed spec</em> – the spec can never be altered. Thus a detailed, complete spec maximizes the chances
to get <em>some</em> patent out of the filing – covering 90% or 10% of the spec, depending.</li>
<li>More prosaically, if we don't file the actual spec but instead write a new one tailored to the claims, who's gonna do it?
Neither the engineer nor the lawyer necessarily has the ability to do it, and surely neither has any interest in doing it. Much
better to take existing documents and do the minimal necessary translation from English to legalese.</li>
</ul>
<p><strong><em>Ultimately, there's a conflict of interest between your employer and their patent lawyer, and a surprisingly
perfect alignment of interests between the lawyer and yourself</em></strong>:</p>
<ul>
<li><strong>The lawyer wants to publish as many details as possible</strong> – to maximize the chance of getting a patent, and
to avoid extra work;</li>
<li><strong>The engineer also wants to publish as much as possible –</strong> to make his ideas known to the fullest extent, and
to avoid extra work;</li>
<li><strong>The employer/shareholder wants to publish as <em>little</em> as possible –</strong> but has no simple, reliable way
to incentivize anyone to push in this direction (though of course some are much better at this than others.)</li>
</ul>
<p>Funnily enough, this too is largely in line with the lawmaker's stated intent – prompting companies to publish ideas instead
of keeping them secret. But why do companies file patents?</p>
<p>The answer is that patents are never read – they're counted. More precisely, a company's goal is to acquire enough patents so
that they can <em>only</em> be counted – but not read and understood in a reasonable amount of time.</p>
<p>If you have too many patents to read and understand (hundreds, thousands or more), then investors and competitors alike
assume you "own your domain" – you can counter-attack if sued. You're as well-defended legally as you can possibly be. But if
you have few patents, someone might read and understand most of them – and create a narrative about some legal weakness. Such
narratives are bad for the stock price.</p>
<p>This situation must be avoided. And that's all there is to it – at least in the computing industry. And I know it might sound
too dismissive to be convincing. But the fact is that the content of patents is just too complex to drive business decisions.
The feasible thing for a decision-maker is to pick the bucket to put you in, out of "no patents, some patents, a shitton of
patents." For more information, see the seminal work "<a href="https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow">Pulling
Decisions out of One's Ass: Fast and Slow,</a>" keeping in mind that decision-makers have a lot of decisions to make, so they
must be Fast.</p>
<h2 id="why-filing-patents-isnt-a-crime-on-par-with-cannibalism">Why filing patents isn't a crime on par with cannibalism</h2>
<p>Considering the above, I don't think that <em>a product company employee</em> filing patents pollutes the tech environment as
badly as people believe.</p>
<p>Product companies file patents largely for self-defense. Some occasionally attack startups, but how many startups were
destroyed by a patent lawsuit vs the number of those destroyed by a badly managed acquisition (with the original investors doing
just fine)? And there are examples of big companies buying startups already attacked by a lawsuit filed by a bigger product
company, confident that between two big companies, the legal result will be a stalemate. Thus for a big company genuinely
fearing your product, it's much safer to buy you than sue you and have you bought by a big competitor.</p>
<p>The real trouble is patent trolls, who cannot be counter-attacked. But the only way a product company's patents will land in
a troll's hands is if the company goes bankrupt and sells the patents. Well, guess what – in these cases, other product
companies are eager to outbid the trolls. For example, when MIPS Technologies was sold to Imagination for ~$60 million many
years ago, its patents, sold separately, fetched ~$500 million from some CPU cartel involving various big name CPU companies.
Alternatively, a failing company can turn into a troll and sue a successful product company (MicroUnity comes to mind.)</p>
<p>Thus patents of failing product companies result in a weird form of socialism, where profit is spread more evenly between
investors, with losers getting a chunk of the winners' profits. I don't think this chunk is nearly large enough on average to
substantially reduce the incentive to work hard for the win, which is supposedly "the" trouble with laws subsidizing losers.</p>
<p>My point is that patent trolls and product companies seem to live in largely parallel universes. There are patents filed with
the intention to be used by a patent troll, and there are patents filed by product companies, and the latter cause far less
damage.</p>
<h2 id="how-to-get-a-patent">How to get a patent</h2>
<p>I've lost count of the number of times I've heard the words "The Black Swan." It rather aggrieves me, but you gotta hand it
to Taleb. Everyone is trying to pollute our language by needlessly coining catchphrases in a quest to be memorable, but he
succeeded more than most. Surely I wouldn't hear this nonsense as often if he called the book "The Unforeseen Event."</p>
<p>Getting patents is a lot like branding. The trick is to call old things new names.</p>
<p>Why does it take a patent lawsuit and at least $1 million in legal fees to find out if a patent <em>really</em> is a patent –
or to see it invalidated by the court? Because searching prior art is hard. "Prior art" includes everything published prior to
the patent – older patents, academic papers, and everything else, really. Strictly speaking, you never know if you're done.</p>
<p>How does the patent office examiner examine prior art, at a cost much lower than $1 million? Some equivalent of quick
googling. The input of search engines is words and short phrases. If you use words and phrases which are uncommon in your
domain, the search will come up blank, or it will find things so obviously unrelated to your work that even a patent examiner
will get it.</p>
<p>If you're extending the concept of a thread, don't call the result "extended threads", call it "hypercontexts." If you're
calculating a histogram, call it a "distribution estimator." And so on. Again, I know this sounds too dismissive of the system
to be believable. Well, try it. File a patent application full of "distribution estimators" and another one written in plain
English. See which gets approved more smoothly.</p>
<p>Note that you might be tempted to conduct a prior art search yourself before filing the patent application, as a matter of
due diligence. Yet some lawyers actually recommend against it, since if you do find prior art, you're now willfully infringing
on it, and should cease and desist. My advice is to come up with a bunch of Black Swans/Distribution Estimators describing your
idea, and pick the ones with the fewest Google results (patent search and otherwise).</p>
<p>And don't actually <em>read</em> any patent you accidentally find – don't willfully infringe, it's illegal. Just count them.
Patents are never read, only counted – sounds familiar?</p>
<p>The other very important thing – which you mostly get to worry about in smaller companies – is to get the right kind of
lawyer. Patent lawyers are fallen engineers, with engineering degrees, and sometimes actual engineering experience.
<strong><em>The underlying engineer who's morphed into a lawyer ought to have specialized in your domain. No compromise is
acceptable here</em>.</strong> If you're doing optics, don't work with a guy who did chip design, and if you do chip design,
don't work with a guy who did optics.</p>
<p>It doesn't matter if the lawyer is a Partner ($900/hour), an Associate ($450/hour), or some lesser life form in the law firm.
It doesn't matter whether the firm is the biggest name in the industry or completely unknown. What matters is engineering
knowledge. Don't expect a patent lawyer to honestly tell you he doesn't know your domain. He'll <em>always</em> accept the work,
and you'll pay $truckload/hour trying to explain the most basic things to him, and failing. You need to actively ask about his
education and experience.</p>
<h2 id="summary">Summary</h2>
<p>Like most annoying things in life, patents aren't evil as much as they're absurd. Use them to your advantage.</p>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/patents-how-and-why-to-get-them#comments</comments>
      <pubDate>Sat, 02 Jun 2018 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/patents-how-and-why-to-get-them.feed</wfw:commentRss>
    </item>
    <item>
      <title>Don't ask if a monorepo is good for you – ask if you're good enough for a monorepo</title>
      <link>https://yosefk.com/blog/dont-ask-if-a-monorepo-is-good-for-you-ask-if-youre-good-enough-for-a-monorepo.html</link>
      <description><![CDATA[<p>This is inspired by <a href="https://danluu.com/monorepo/">Dan Luu's post</a> on the advantages of a single big repository
over many small ones. That post is fairly old, and I confess that I'm hardly up to date on the state of tooling, both for
managing multiple repos and for dealing with one big one. But I'm going to make an argument which I think mostly works
regardless of the state of tooling on any given day:</p>
<ul>
<li>Monorepo is great if you're really good, but absolutely terrible if you're not that good.</li>
<li>Multiple repos, on the other hand, are passable for everyone – they're never great, but they're never truly terrible,
either.</li>
</ul>
<p>If you agree with the above, the choice is up to your personal philosophy. To me, for instance, it's a no-brainer – I'll
choose the passable thing which successfully withstands contact with apathetic mediocrity over the greater thing which falls
apart upon such contact in a heartbeat.</p>
<p>You might be different – you might believe in Good – and then you'll choose a monorepo, like Google, the ultimate force for
Good in technology (which is why they safeguard your personal data; you wouldn't want someone evil to have it – luckily, Google
can do no evil.) And I'm almost not kidding: the superpower which lets Google maintain the grassroots bureaucracy which I find
necessary to make monorepos work well is actually the same trait making you sufficiently delusional to chant, or at least to
have chanted "Don't Be Evil" entirely seriously. I don't have that. I am, to a first approximation, evil. <a href="https://yosefk.com/blog/what-worse-is-better-vs-the-right-thing-is-really-about.html">Worse is Better</a>.</p>
<p>But that's me – I'm not saying <em>you/your org</em> are Not So Good, or Evil. I'm only saying that <em>you should be open to
the possibility,</em> and that I don't see the implications of being Not So Good discussed as much as they deserve.</p>
<p>Why are monorepos terrible if you're not that good? Three reasons:</p>
<ol>
<li>Branching in</li>
<li>Modularity out</li>
<li>Tooling strained</li>
</ol>
<p>Let's discuss them in some detail.</p>
<h2 id="branching-getting-forked-by-your-worst-programmer">Branching: getting forked by your worst programmer</h2>
<p>In a Good team, you don't have multiple concurrent branches from which actual product deliveries are produced, and/or where
most people get to maintain these branches simultaneously for a long time. And you certainly can't have branching due to
outright atrocities, like someone adding a feature by killing a feature – for example, making the app work on Android, but
destroying the ability to build for iOS in the process.</p>
<p>But in a not-so-good team... you get the idea.</p>
<p>What do you do when you have a branch working on Android and another branch working on iOS and you have deliveries on both
platforms? You postpone the merge, and keep the fork. For how long do you postpone the merge? For as long as is necessary for
the dumbass who caused the fork to fix their handiwork, in parallel with delivering more features (which likely results in
digging a deeper hole to climb out of afterwards.) And the dumbass might take months, years, or forever.</p>
<p>The question then becomes, <em>what was forked</em>?</p>
<p>In a multi-repo world, the repo maintained by the team with the dumbass on it got forked. In a monorepo world, <em>the entire
code base got forked, and the entire org is now held hostage by the dumbass.</em> And you might think that this will result in a
lot of pressure to fix the problem, and you'd be wrong, for the same reasons that high murder rates don't cure themselves by
people putting pressure on whomever to lower them to some equilibrium level common to all human societies.</p>
<p>Some places have higher than average murder rates, and some places have have higher than average fork rates. And I argue that
a lot of places have fork rates which combine into a complete disaster with a monorepo. And you might not even realize how bad
the fork rate is at your place, because multiple repos largely shield you from the consequences. Or, more tragically, you might
not realize how bad your fork rate is because your monorepo is in its first couple of years, and you're sowing what you'll reap
in its next couple of years, when you'll have more code, more deliveries and more dumbasses.</p>
<p>With multiple repos, if <em>you</em> have your shit under control, and <em>your</em> repos have a single release branch with
a single timeline, all you have to do is to test against both of the dumbass's branches. But with a monorepo, you need to
maintain your code in 2 branches, with a growing share of everybody else's code morphing incompatibly in those branches, simply
because they exist. And very soon it will be more than 2 because there's more than a single dumbass, and good luck to you.</p>
<h2 id="modularity-demoted-from-a-norm-to-an-ideal">Modularity: demoted from a norm to an ideal</h2>
<p>Norms are mundane, but they are what is. Ideals are lofty, but they are merely what should be (and typically isn't.) If you
want to actually <em>have</em> something, you don't want it to be an ideal, like altruism – you want it to be a norm, like
wiping one's ass. If something is demoted from ass-wiping to altruism, that something will scarcely be found in the wild.</p>
<p>With multiple repos, modularity is the norm. It's not a <em>must</em> - you technically can have a repo depending on umpteen
other repos. But your teammates <em>expect</em> to be able to work with their repo with a minimal set of dependencies. They
<em>don't like</em> to have to clone lots of other repos, and to then worry about their versions <em>(in part because the
tooling support for this might be less than great)</em>.</p>
<p>In fact, a common multi-repo failure mode is that people expect <em>too few</em> dependencies and make <em>too many repos
which are too small</em> to host a <em>useful</em> self-contained system. Note that this failure mode is not lethal. It kinda
sucks to have this over-modularity with benefits of independence which turn out to be imaginary upon a closer look, and to have
people treat what essentially are internal APIs with way too much reverence, just because two modules which are extremely
tightly coupled <em>conceptually</em> are independent <em>technically,</em> in terms of cloning/building/testing. But it doesn't
kill you.</p>
<p>With a monorepo, modularity is a mere ideal. Everybody clones the whole thing. You're not supposed to add gratuitous
dependencies, but it's very easy to add such a dependency in terms of cloning, building and versioning, and nobody objects to
the dependency being added the way they would if they needed to clone more repos.</p>
<p>Of course in a Good team, needless dependencies would be weeded out in code reviews, and a Culture would evolve over time
avoiding needless dependencies. In a not-so-good team, your monorepo will grow into a single giant ball of circular
dependencies. Note that adding dependencies is infinitely easier than untangling them, much like forking is easier than merging,
with the difference that the gut-felt urgency to merge ("I can't maintain all your damned branches any longer!!") is far greater
and far more backed by simple self-interest than the urgency to improve the dependency structure.</p>
<h2 id="tooling-is-yours-better-than-the-standard">Tooling: is yours better than the standard?</h2>
<p>This part might age worse than the others, and might not be particularly up to date even now – what "standard" tools are
capable of changes over time. But generally speaking, a growing monorepo is likely to outgrow the standard version management
tools and methods, as well as <em>other</em> tools and methods dealing with your revision controlled code.</p>
<p>Google used to have a FUSE driver to avoid copying hundreds of millions of source lines at a time, and instead getting the
files on demand, when a directory is cd'd into. Facebook used to hack on hg to make it fast on its large monorepo. Maybe already
today, or some day, a growing number of off-the-shelf tools will scale to infinite monorepos without such investments. But it
sounds reasonable that there will always be tools and workflows which you will struggle to make work with a large monorepo
(starting with some script doing find/grep.)</p>
<p>With a bunch of small monorepos, you work with a small overall number of source files in your working directory, so you don't
need to tell your tools, "don't try to deal with the whole thing – instead only search this subset, or use this index etc. etc."
And you have tools these days which kinda sorta let you manage the revisions of multiple repositories (for instance, there's
Google's Repo.) And I think the result is very, very far from a great experience <em>potentially</em> afforded by a large
monorepo. But it also <em>never</em> breaks down as badly as a large monorepo outgrowing the abilities of tools, as well as the
ability of your local toolsmiths to find creative workarounds for these growth pains.</p>
<h2 id="summary">Summary</h2>
<p>Don't ask if a monorepo is good for you – ask if you're good enough for a monorepo. Personally, I don't have the guts to bet
on the supply of Goodness in a given org to remain sufficiently large over time to consistently avert the potential disasters of
monorepos. But that's just my personal outlook; if you want to compliment me, don't call me "smart," and definitely don't call
me "good" – I know my limits in these areas, and I take far more pride in knowing these limits than in the limits themselves;
so, to compliment me, call me "pragmatic." Yet a culture worthy of a monorepo absolutely can exist – just make sure yours
actually is one of those, and don't mistake your ideals for your norms.</p>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/dont-ask-if-a-monorepo-is-good-for-you-ask-if-youre-good-enough-for-a-monorepo#comments</comments>
      <pubDate>Tue, 30 Jul 2019 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/dont-ask-if-a-monorepo-is-good-for-you-ask-if-youre-good-enough-for-a-monorepo.feed</wfw:commentRss>
    </item>
    <item>
      <title>I have risen</title>
      <link>https://yosefk.com/blog/i-have-risen.html</link>
      <description><![CDATA[<p>Hello to the readers still using RSS! I've moved the blog off WordPress to my own ugly publishing software, and will be
grateful if you report any glitches you see (posts or comments look bad on device X or feed reader Y, that sort of thing.)</p>
<p>This blog slowed down a lot in 2017, when I switched from a part-time programming position to a full-time senior management
position. Between the comment spam flood and the ancient pre-mobile design, it would take some doing to get the blog back into
shape; and between work and non-work stuff I had going, I didn't find time for said doing.</p>
<p>But I've gone back to programming a couple of years ago, and back to part-time a few months ago, and now the doing is done.
And do I have things to tell you!</p>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/i-have-risen#comments</comments>
      <pubDate>Tue, 12 Mar 2024 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/i-have-risen.feed</wfw:commentRss>
    </item>
    <item>
      <title>refix: fast, debuggable, reproducible builds</title>
      <link>https://yosefk.com/blog/refix-fast-debuggable-reproducible-builds.html</link>
      <description><![CDATA[<p>There's a simple way to make your builds all of the following:</p>
<ul>
<li><strong>Reproducible</strong>/deterministic - same binaries always built from the same source, so you can cache build
outputs across users</li>
<li><strong>Debuggable</strong> - gdb, sanitizers, Valgrind, KCachegrind, etc. find your source code effortlessly</li>
<li><strong>Fast</strong> - the build time overhead is negligible, even compared to a blazing fast linker like <a href="https://github.com/rui314/mold">mold</a></li>
</ul>
<p>What makes it really fast is a small Rust program called <a href="https://github.com/yosefk/refix">refix</a> that
post-processes your build outputs (if you don't want to compile from source, <a href="https://yosefk.com/refix/">here's a static
Linux binary</a>.) Both the program and this document are written for the context of C/C++ source code compiled to native
binaries. But this can work with other languages and binary formats, too, and it should be easy to support them in
<code>refix</code>. (<em>In fact, it mostly supports them already...</em> you'll see.)</p>
<p>This "one weird trick" isn't already popular, not because the solution is hard, nor because the problem isn't painful.
Rather, it's not already popular because people widely consider it impossible for builds to be both debuggable and reproducible,
and standardize on workarounds instead. Since "established practices" are sticky, and especially so in the darker corners like
build systems<a class="footnote-ref" role="doc-noteref" href="#fn1" id="fnref1"><sup>1</sup></a>, we'll need to discuss not only
how to solve the problem, but also why solve it at all.</p>
<h2 id="the-curious-case-of-the-disappearing-source-files">The curious case of the disappearing source files</h2>
<p>Why are people so willing to give up their birthright - the effortless access to the source code of a debugged program? I
mean, build a "Hello, world" cmake project, and everything just works: gdb finds your source code, <code>assert</code> prints a
path you can just open in an editor, etc. "Source path" isn't even a thing.</p>
<p>Later on, the system grows, and the build slows down. So someone implements build artifact caching, in one of several
ways:</p>
<ul>
<li>A general-purpose distributed build cache, like Bazel's</li>
<li>Something for caching specific kinds of artifacts, like ccache</li>
<li>An entirely home-grown system - like running the build of user X in a build directory left previously by user Y at the build
server's local disk (and hoping that their source code is similar enough, so most object files needn't be rebuilt<a class="footnote-ref" role="doc-noteref" href="#fn2" id="fnref2"><sup>2</sup></a>)</li>
</ul>
<p>In any case, now that you need caching, you also need reproducible builds. Otherwise, you'd cache object files built by
different users, and you'd get different file paths and other stuff depending on which user built each object file. And we can
all agree that build caches are important, and pretty much force you to put relative paths into debug information and the value
of <code>__FILE__</code> (and some meaningless garbage into <code>__TIME__</code>, etc.)</p>
<p>But we can <em>also</em> agree that the <em>final binaries</em> which users actually run should have full source paths,
right? I mean, I know there are workarounds for finding the source files. We'll talk about them later; I'd say they don't really
work. Of course, the workarounds would be tolerable if they were inevitable. But they aren't.</p>
<p><strong>Why not fix the binary coming out of the build cache, so it points to the absolute path of the source files?</strong>
(The build system made an effort to detach the binary from the full source path, so that it can be cached. But now that the
binary has left the cache, we should "refix" it back to the source path of the version where it belongs.)</p>
<p>We'll look at 3 ways of refixing the binary to the source path - a thesis, an anti-thesis and a synthesis, as it were.</p>
<h2 id="thesis-debugedit---civilized-standard-and-format-aware">Thesis: <code>debugedit</code> - civilized, standard and
format-aware</h2>
<p>A standard tool for this is <a href="https://sourceware.org/debugedit/">debugedit</a>. The man page example does exactly the
"refixing" we're looking for:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;"><code>debugedit -b `pwd` -d /usr/lib/debug files...
    Rewrites path compliled into binary
    from current directory to /usr/lib/debug.</code></pre>
<p>Some Linux distributions use <code>debugedit</code> for building source files in some arbitrary location, and then make the
debug info point to wherever source files are installed when someone downloads them to debug the program.</p>
<p>If debugedit works for you, problem solved. It works perfectly when it does. However, when I tried it on a 3GB shared object
compiled from a C++ code base<a class="footnote-ref" role="doc-noteref" href="#fn3" id="fnref3"><sup>3</sup></a>, it ran for 30
seconds, and then crashed. If you, too find debugedit either slow or buggy for your needs, read on.</p>
<h2 id="anti-thesis-sed---nasty-brutish-and-short">Anti-thesis: <code>sed</code> - nasty, brutish, and short</h2>
<p>Why is debugedit's job hard (slow and bug-prone)? Mainly because it needs to grow or shrink the space reserved for each
replaced string. When you do such things, you need to move a lot of data (slow), and adjust who-knows-which offset fields in the
file (bug-prone.)</p>
<p>But what if the strings had the same length? Then we don't need to move or adjust anything, and we could, erm, we could
replace them with <code>sed</code>.</p>
<p>Here, then, is our nasty, brutish, and short recipe:</p>
<ul>
<li>Run <code>gcc</code> with these flags:
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">-fdebug-prefix-map==MAGIC <i># for DWARF</i>
-ffile-prefix-map==MAGIC  <i># for __FILE__</i>
</pre></li>
<li>Make MAGIC long enough for any source path prefix you're willing to support.</li>
<li>Why the <code>==</code> in the flag? This invocation assumes that file paths are relative, so it remaps <em>the empty
string</em> to MAGIC, meaning, <code>dir/file.c</code> becomes <code>MAGICdir/file.c</code>. You can also pass
<code>=/prefix/to/remap=MAGIC</code>, if your build system uses absolute paths.</li>
<li>Use <code>sed</code> to replace MAGIC with your actual source path in the binary outputted by the build system.</li>
<li>If the source path is shorter than the length of MAGIC, pad it with forward slashes: <code>/////home/user/src/</code>. If
the source path is too long, the post-link step should truncate it, warn, and eventually be changed to outright fail. You don't
<em>really</em> need to support giant paths.</li>
</ul>
<p>Our post-link step thus becomes:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;"><code>sed -i 's/MAGIC/\/\/\/...\/user\/src\//g' binary</code></pre>
<p>The downside, on top of the source path length limit, is a trace of the brutishness making it into the output file. Namely,
you're going to see these extra forward slashes in some situations. We can't pad a prefix with an invisible character...
luckily, we can pad it with a character not changing the meaning of the path.</p>
<p>On the upside, compared to <code>debugedit</code>, the method using <code>sed</code> is:</p>
<ul>
<li><strong>More widely applicable</strong> - it, erm, "supports" all executable and debug information formats, as well as
archives and object files.</li>
<li><strong>More robust</strong> - not affected by input format complexity</li>
<li><strong>Faster</strong> - 10 seconds to process the 3GB binary (about the time it takes <code>mold</code> to link that
binary... yes, it's that good!)</li>
</ul>
<p>Is this fast enough? Depends on your binary sizes. If yours are big and you don't want to effectively double the link time,
our next and last method is for you.</p>
<h2 id="synthesis-refix---nasty-brutish-and-somewhat-format-aware">Synthesis: <code>refix</code> - nasty, brutish, and somewhat
format-aware</h2>
<p>Can we go faster than <code>sed</code>? We have two reasons to hope so:</p>
<ul>
<li><code>sed</code> is unlikely to be optimized specifically for replacing strings of equal size; it's not that common a use
case.</li>
<li>We don't actually need to go through the entire file. File paths only appear in some of the sections - <code>.rodata</code>
where strings are kept, and debug info sections. If we know enough about the file format to find the sections (which takes very
little knowledge), we can avoid touching most of the bytes in the file.</li>
</ul>
<p>But wait, isn't the giant binary built from C++ mostly giant because of the debug info? <em>Yes</em>, but it turns out that
most of the debug info sections <em>don't contain file paths</em>; only <code>.debug_line</code> and <code>.debug_str</code> do
and these are only about 10% of our giant file.</p>
<p>So the <code>refix</code> program works as follows:</p>
<ul>
<li>It <code>mmap</code>s the file, since it knows it never needs to move the data and can just overwrite the strings in
place.</li>
<li>For ELF files, it finds <code>.rodata</code>, <code>.debug_line</code> and <code>.debug_str</code>, and searches &amp;
replaces only within these. This handles executables, shared libraries (<code>*.so</code>) and object files
(<code>*.o</code>).</li>
<li>For <code>ar</code> archives, it finds the ELFs within the archive, then the sections it cares about within each ELF, and
searches &amp; replaces within these. This handles <code>lib*.a</code>.</li>
<li>For files which are neither ELFs nor archives of ELFs, <code>refix</code> just replaces everywhere as <code>sed</code>
would, but still faster because it's optimized for the same-sized source &amp; destination strings case.</li>
</ul>
<p>Thus, <code>refix</code> is:</p>
<ul>
<li><strong>Very fast</strong> - 50 ms on the 3GB binary, and 250 ms on the same binary in "sed mode" (meaning, if we remove the
ELF magic number, so <code>refix</code> is forced to replace everywhere and not just in the relevant sections.)</li>
<li><strong>Widely applicable</strong> - works on any file format where the file prefix isn't compressed and is otherwise "laid
bare"</li>
<li><strong>Robust</strong> - while it knows a bit about the binary file format, it's very, very little (enough to find the
sections it's interested in); it's hundreds of lines of code vs <code>debugedit</code>'s thousands. And you can always make it
run even less code by falling back to "sed mode."</li>
</ul>
<p>...with the sole downside being that, same as with sed, you might occasionally see the leading slashes in pathnames.</p>
<p>That's it, right? We're done? Well, maybe, but it's not always how it goes. People have questions. So here we go.</p>
<h2 id="q-a">Q &amp; A</h2>
<h3 id="why-do-this-we-already-have-a-system-for-finding-the-source-code.">Why do this? We already have a system for finding the
source code.</h3>
<p>First of all, it is worth saying that you <em>shouldn't</em> have any "system" for finding source code, because the tired,
stressed developer who was sent a core dump to urgently look at is entitled to having at least <em>this</em> part work entirely
effortlessly<a class="footnote-ref" role="doc-noteref" href="#fn4" id="fnref4"><sup>4</sup></a>.</p>
<p>But also, whatever system you do have ought to have issues:</p>
<ul>
<li>If you do not modify the cacheable, reproducible binaries coming out of the build system, then by definition your way to
find source code must rely on something inherent to a given source version, independent of who built it and where. Since you're
not going to embed the entire source code into the executable, you must rely on some sort of version information. What if the
program had uncommitted changes, which happens in debugging scenarios (someone built a version to test and someone else sent a
core dump from this version?)</li>
<li>"Well you're not supposed to get core dumps from versions with uncommitted changes, unless it's your local version that you
haven't given to anyone but are testing locally, so you know which version it is. You should only release versions externally
thru CI" - so giving anything to anyone to test is now considered "releasing externally" and must necessarily go thru CI, and
having trouble finding the source code is now a punishment for straying from proper procedure? How did this discussion, which
started at how build caches <em>speed up</em> the build, deteriorate to the point where we're telling developers to change how
they work, in ways which will <em>slow them down?</em></li>
<li>But OK, let's say I didn't "release" anything - instead I have 5 local versions I'm working with and they go thru test flows
and dump core - I'm now supposed to guess which core comes from which version, or develop my own "system" to know? (Some people
actually assume this won't happen because you can't run tests outside CI anyway, so you will submit a merge request in order to
run them. And they assume that because they use some testing infra intertwined with CI infra and most of their tests technically
can't run outside CI. And perhaps they don't even have machines to run on that aren't managed by Jenkins or some such to begin
with. But that is a horror story for another time. Here I'll just assume that we agree that it's good to be able to test changes
locally and debug easily.)</li>
<li>In the cases where the version info actually enables you to find the right code, the process can be made more tolerable by
developing a <code>gdb</code> Python extension that automatically tells gdb where the source code is based on the embedded
version info. Do you have this extension and a team maintaining it together with the build system?</li>
<li>Do you also have this automated for all the other tools (sanitizers, Valgrind, KCachegrind, VTune, whatever)? Do they all
even have a way to tell them where to look for source code? Is there a team handling this for all users, for every new tool used
by developers?</li>
</ul>
<p>I realize that these pain points aren't equally relevant to all organizations, and the extent of their relevance depends a
lot on the proverbial software lifecycle. (They also aren't equally relevant to everyone in a given organization. I claim that
the people suffering the most from this are the people doing the most debugging, and they are quite often very far removed from
any team that could ameliorate their suffering by improving "the system for finding source code" - so they're bound to suffer
for a long time.)</p>
<p>My main point, however, is that you needn't have any of these pain points <em>at all</em>. There's no tradeoff or price to
pay: your build is still reproducible and fast. Just make it debuggable with this one weird trick!</p>
<p>(Wow, I've been quite composed and civil here. I'm very proud of myself. Not that it's easy. I have strong <em>feelings</em>
about this stuff, folks!)</p>
<h3 id="what-about-non-reproducible-info-other-than-source-path-time-build-host-etc">What about non-reproducible info other than
source path (time, build host, etc)?</h3>
<p>I'm glad you asked! You can put all the stuff changing with every build into a separate section, reserved at build time and
filled after link time. You make the section with:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;"><code>char ver[SIZE] __attribute__((section(".ver"))) = {1};</code></pre>
<p>This reserves <code>SIZE</code> bytes in a section called <code>.ver</code>. It's non-<code>const</code> deliberately, since
if it's <code>const</code>, the OS will exclude it from core dumps (why save data to disk when it's guaranteed to be exactly the
same as the contents of the section in the binary?) But you might actually very much want to look at the content of this section
in a core dump, perhaps before looking at anything else. <strong>For instance, the content of this section can help you find the
path of the executable that dumped this core!</strong><a class="footnote-ref" role="doc-noteref" href="#fn5" id="fnref5"><sup>5</sup></a></p>
<p>(How do you find the section in the core dump without having an executable which the debugger could use to tell you the
address of <code>ver</code>? Like so: <code>strings core | grep MagicOnlyFoundInVer</code>. Nasty, brutish, and short. The point
is, having the executable path <em>in the core dump</em> is an additional and often major improvement on top of having full
source paths <em>in the executable...</em> because you need to find the executable before you can find the source!)</p>
<p>Additionally, our <code>ver</code> variable is deliberately initialized with one <code>1</code> followed by zeros, since if
it's all zeros, then <code>.ver</code> will be a "bss" section, the kind zeroed by the loader and without space reserved for it
in the binary. So you'd have nowhere to write your actual, "non-reproducible" version info at a post-link step.</p>
<p>After the linker is done, you can use <code>objcopy</code> to replace the content of <code>.ver</code>. But if you're using
<code>refix</code>, which already mmaps the file, you can pass it more arguments to replace ELF sections:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;"><code>refix exe old-prefix new-prefix --section .ver file</code></pre>
<p><code>refix</code> will put the content of <code>file</code> into <code>.ver</code>, or fail if the file doesn't have the
right length. (We don't move stuff around in the ELF, only replace.)</p>
<h3 id="what-about-compressed-debug-sections">What about compressed debug sections?</h3>
<p>What about them? I'm not sure why people use them, to be honest. I mean, who has <em>so many</em> executable files which they
don't want to compress as a whole (because they need to run them often, I presume), but they do want to compress the debug
sections to save space? Like, in what scenario <em>this</em> is your way to save enough space to even worry about it?</p>
<p>But, they could be supported rather nicely, I think, if you really care. You wouldn't be able to just blithely
<code>mmap</code> a file and replace inside it without updating any offset field in the file, but I think you could come close,
or rather stay very far away from doing seriously heavy lifting making this slow and bug-prone. Let's chat if you're interested
in this.</p>
<p>(I think maybe one problem is that some build caches have a file size limit? Like, something Bazel-related tops out at 2GB
since it's the maximal value of the Java int type?.. Let's talk about something else, this is making me very sad.)</p>
<h3 id="its-250-ms-on-generic-data.-and-you-still-did-the-elfar-thing-to-get-to-50-ms.-are-you-insane">It's 250 ms on generic
data. And you still did the ELF/ar thing to get to 50 ms. Are you insane?</h3>
<p>Well, it's 250 ms on a fast machine with a fast SSD. Some people have files on NAS, which can slow down the file access a
lot. In such cases, accessing 10x less of the <code>mmap</code>ed data will mitigate most of the NAS slowdown. You don't really
want to produce linker output on NAS, but it can be very hard to make the build system stop doing that, and I want people stuck
in this situation to at least have debuggable binaries without waiting even more for the build. So <code>refix</code> is
optimized for a slow filesystem.</p>
<p>But also, if it's not too much work, I like things to be fast. <a href="people-can-read-their-managers-mind.html">Insane or
not</a>, the people who make fast things are usually the people who like fast things, by themselves and not due to some
compelling reason, and I'm not sure I'm ashamed of maybe going overboard a bit; better safe than sorry. Like, I don't parse most
of the ELF file, which means I don't use the <code>Elf::parse</code> method from the <code>goblin</code> library, but instead I
wrote a 30 line function to parse just what I need.</p>
<p>This saves 300-350 ms, which, is it a lot? - maybe not. Will it become much more than that on a slower file system? I don't
know, it takes less time to optimize the problem away than answer this question. Did I think of slow file systems when doing it?
- not as much as I was just annoyed that my original C++ program, which the Rust program is a "clean room" open source
implementation of, takes 150 ms and the Rust one takes about 400 ms. Am I happy now that I got it down to 50 ms? Indeed!</p>
<p>(Why is Rust faster? Not sure; I think, firstly, GNU <code>memmem</code> is slower than <code>memchr::memmem::Finder</code>,
and secondly, I didn't use TBB in C++ but did use Rayon in Rust, because the speedup is marginal - you bottleneck on I/O - and I
don't want to complicate the build for small gains, but in Rust it's not complicated - just <code>cargo add rayon</code>.)</p>
<p>It often takes less time to just do the efficient thing than it takes to argue about the amount it would save relatively to
the inefficient thing. (But it's still more time than just going ahead and doing the inefficient thing without arguing. But even
that is not always the case. But most people who make fast things will usually just go for the efficient thing when they see it
regardless if it's the case, I think. IME the people who always argue about whether optimizations are worth it make big and slow
things in the end.)</p>
<h3 id="im-as-crazy-as-you-and-i-want-this-speedup-for-non-elf-executable-formats.">I'm as crazy as you, and I want this speedup
for non-ELF executable formats.</h3>
<p>Let's chat. The <code>goblin</code> library probably supports your format - shouldn't take more than 100-150 LOC to handle
this in <code>refix</code>.</p>
<h3 id="which-binaries-should-i-run-this-stuff-on">Which binaries should I run this stuff on?</h3>
<p>Anything delivered "outside the build system" for the use of people (who run programs / load shared libraries) or other build
systems (which link code against static libraries / object files.) And nothing "inside the build system", or it will ruin
caching.</p>
<p>I hope for your sake that you have a monolithic build where you build everything from source. But I wouldn't count on it;
quite often team A builds libraries for team B, which gets them from Artifactory or something wonderful like that. In that case,
you might start out with a bug where some libraries are shipped with the MAGIC as their source prefix instead of the real thing.
This is easy to fix though, and someone might even remind you with "what's this weird MAGIC stuff?"</p>
<p>(Somehow nobody used to ask "what's <code>/local/clone-87fg12eb/src</code>", when <em>that</em> was the prefix instead of
MAGIC. Note that even if you have this bug and keep MAGIC in some library files, <em>nobody is worse off</em> than previously
when it was <code>/local/clone-87fg12eb/src</code>. And once you fix it, they'll be <em>better</em> off.)</p>
<h3 id="ci-removes-the-source-after-building-it.-what-should-the-destination-source-prefix-be..">CI removes the source after
building it. What should the destination source prefix be?..</h3>
<p>And here I was, thinking that it's the build cache not liking absolute paths that was the problem... It turns out that we
have a bigger problems: the source is just nowhere to be found! <code>/local/clone-87fg12eb/src</code> is gone forever!</p>
<p>But actually, it makes sense for CI to build on the local disk in a temporary directory. In parallel with building, CI can
export the code to a globally accessible NAS directory. And at the end of the build, CI can refix the binaries to that NAS
directory. It's not good to <em>build</em> from NAS (or to NAS) - it's not only slow, but fails in the worst ways under load -
which is why a temporary local directory makes sense. But NAS is a great place for <em>debugging tools</em> to get source from -
widely accessible with no effort for the user.</p>
<p>Many organizations decide against NAS source exports, because it would be too easy for developers. Instead you're supposed to
download the source via HTTP, which is much more scalable than NAS, thus solving an important problem you don't have; plus, you
can make yourself some coffee while the entire source code (of which you'll only need the handful of files you'll actually open
in the debugger) is downloaded and decompressed.</p>
<p>In that case, your destination source prefix should be wherever the user downloads the files to. Decide on any local path
independent of the user name, and with version information encoded in it, so multiple versions can be downloaded concurrently.
Have a nice cup of coffee!</p>
<h3 id="what-should-the-root-path-length-limit-be">What should the root path length limit be?</h3>
<p>100 bytes.</p>
<h3 id="our-ci-produces-output-in-filerkubernetesdockergitlabjenkinspre-commitdepartmentteamdeveloperbranch-nametest-suite-namerepo-which-is-110-bytes.">Our
CI produces output in
<code>/filer/kubernetes/docker/gitlab/jenkins/pre-commit/department/team/developer/branch-name/test-suite-name/repo/</code>,
which is 110 bytes.</h3>
<p>Great! Now you have a reason to ask them to shorten it. I'm sure they'll get to it in a quarter or two, if you keep
reminding.</p>
<h3 id="our-ceos-preschooler-works-as-a-developer-insists-on-a-200-byte-prefix-and-wont-tolerate-the-build-failing.">Our CEO's
preschooler works as a developer, insists on a 200 byte prefix, and won't tolerate the build failing.</h3>
<p>Then truncate the path without failing the build. He won't find the source code easily, but he can't find it <em>already
today.</em> <strong>If there's one thing fixing the problem won't do, it's making anyone worse off.</strong> It <em>can't</em>
make you worse off, since the current situation leaves it nowhere worse to take you. It could only possibly take you from
<em>never</em> being able to easily find the source to <em>sometimes</em>, if not always, being able to find it.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Use <code>refix</code>, <code>sed</code> or <code>debugedit</code> to make your fast, reproducible builds also effortlessly
debuggable, so that it's trivial to find the source given an executable - and the executable given a core dump.</p>
<p>And please don't tell me it's OK for developers to roam the Earth looking for source code instead. It hurts my feelings!</p>
<p><em>Thanks to Dan Luu for reviewing a draft of this post.</em></p>

<section class="footnotes footnotes-end-of-document" role="doc-endnotes" id="footnotes">
<hr>
<ol>
<li id="fn1"><p>I don't mean "dark corners" in a bad way. I managed a build system team for a few years and am otherwise
interested in build systems, as evidenced by my writing this whole thing up. By "dark corners" I simply mean places invisible to
most of the organization unless something bad happens, so the risk of breaking things is larger than the reward for improving
them. It's quite understandable for such circumstances to beget a somewhat conservative approach.<a class="footnote-back" role="doc-backlink" href="#fnref1">↩︎</a></p></li>
<li id="fn2"><p>I've known more than one infrastructure team that did this "cross-user build directory reuse" without ever
hearing about each other. This method, while quite terrible in terms of optimization potential left on the table, owes its
depressing popularity to its high resilience to the terribleness of everything else (eg it doesn't mind poor network bandwidth
or even network instability, or the build flow not knowing where processes get their inputs and put their outputs; thus you can
use this approach with an almost arbitrarily bad build system and IT infrastructure.)<a class="footnote-back" role="doc-backlink" href="#fnref2">↩︎</a></p></li>
<li id="fn3"><p>Yes, a 3GB shared object compiled from a C++ code base. Firstly, shame on you, it's not nice to laugh at people
with problems. Secondly, no, it's not stupid to have large binaries. It's much more stupid to split everything into many tiny
shared objects, actually. It was always <em>theoretically</em> stupid, but now mold made it <em>actually</em> stupid, because
linkage used to be theoretically very fast, and now it's actually very fast. And nothing good happens from splitting things to
umpteen tiny .so's... but that's a topic for another time.<a class="footnote-back" role="doc-backlink" href="#fnref3">↩︎</a></p></li>
<li id="fn4"><p>I've been told, in all seriousness and by an extremely capable programmer involved in a build system at the
time, that "debugging should NOT be made easy; we should incentivize more design-time effort to cut on the debugging effort." To
this I replied that Dijkstra would have been very proud of him, same as <a href="https://lispy.wordpress.com/2008/10/24/lisp50-notes-part-v-interlisp-parc-and-the-common-lisp-consolidation-wars/">he was
very angry with Warren Teitelman</a>, whom he confronted for the crime of presenting a debugger with "how many bugs are we going
to tolerate," getting the immortal reply "7." And I said that we should make debugging easy for those first and only 7 bugs
we're going to tolerate.<a class="footnote-back" role="doc-backlink" href="#fnref4">↩︎</a></p></li>
<li id="fn5"><p>But what if this info gets overwritten, seeing how it's not <code>const</code>?.. if you're really worried about
this section getting overwritten, of all things, you can align its base address and size to the OS page size, and
<code>mprotect</code> it at init time. This is exactly what <code>const</code> would achieve, but without excluding the section
from core dumps.<a class="footnote-back" role="doc-backlink" href="#fnref5">↩︎</a></p></li>
</ol></section>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/refix-fast-debuggable-reproducible-builds#comments</comments>
      <pubDate>Tue, 19 Mar 2024 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/refix-fast-debuggable-reproducible-builds.feed</wfw:commentRss>
    </item>
    <item>
      <title>Managers have no human rights</title>
      <link>https://yosefk.com/blog/managers-have-no-human-rights.html</link>
      <description><![CDATA[<p>Here are some thoughts which are often basically correct:</p>
<ul>
<li>Every time I try to do the right thing here, it's like the place actively resists it. Actually, forget "the right thing" -
it's whenever I try to do pretty much <em>anything.</em></li>
<li>Yesterday's "strategic" thing I toiled over just got tossed into the dumpster. And they expect me to be all excited about
the new "strategy" they coughed up?</li>
<li>My "colleagues" attack me overtly and covertly all the time, all the while maintaining a "professional" and even cheerful
demeanor - in effect, a gaslighting tactic. And in this sewage lagoon, I'm supposed to get work done?</li>
<li>The deadline is impossible, and everybody knows it. Why are we all pretending that we're trying to meet it, when we're
actually busy rehearsing our speeches blaming the inevitable failure on each other?</li>
</ul>
<p>I'm sure you can add a few similar observations of your own, which at various times &amp; places were fairly accurate. My
point in this writeup is that <strong>a manager doesn't get to whine about any of this</strong>, any more than a boxer gets to
whine about a broken nose. A <em>normal</em> person very much <em>does</em> get to whine about a broken nose, and it isn't
whining - it's grounds for a lawsuit in any reasonable jurisdiction. But when a boxer steps into the boxing ring, he obviously
forfeits the basic human right of not getting punched in the face.</p>
<p>Similarly, a <em>normal</em> person - the so-called "individual contributor", which I guess is what we call workers in the
age where mice and keyboards replaced hammers and sickles - a normal person can reasonably expect some basics from the
workplace:</p>
<ul>
<li>The place should help me get work done, and provide various physical, informational and social infrastructure for this
purpose.</li>
<li>The place should articulate a strategy which my work meaningfully fits into, and manage changes to this strategy carefully
and thoughtfully.</li>
<li>I am entitled to healthy human relationships in the workplace, and to management fostering an environment conducive of
healthy relationships forming, rather than abusive and adversarial ones.</li>
<li>The schedule should not demand the impossible, and certainly when I work hard to meet whatever deadline was set, I should
not worry about being blamed for the team missing the deadline in the end.</li>
</ul>
<p>The condition meeting the full set of these lofty requirements is sometimes called "psychological safety." So in short, the
individual contributor is entitled to psychological safety - in hammer &amp; sickle terms, <b><em>workers</em> should be able to
focus on work</b>.</p>
<p>And by "should," I don't mean it always actually happens. I just mean that when it doesn't, you can reasonably expect to
resolve the situation by quitting <em>the team</em> you're on, without having to find <em>a different line of work.</em> That,
as opposed to a manager, who can <em>only</em> resolve this situation by finding a different line of work.</p>
<p>Why isn't the manager entitled to the same human rights as everyone else? Well, first of all, he just <em>isn't</em>,
meaning, if you see a manager who complains about said human rights of his being violated a lot, you can be certain that he's
not going to stay in management for long; he'll either have the common sense to quit or he'll be "demoted" from this position
(whether it's actually a demotion, and which way in a hierarchy is up is a question in itself; some theoreticians postulate that
there's no up, only out - but in any case, the manager will stop being one.)</p>
<p>Now, if I do try to explain this fact, which managers usually get at the gut level without needing an explanation, the
analogies that come to mind are the minister of foreign affairs and the plumber. You cannot, as a minister of foreign affairs,
be sad and shocked over countries plotting against you, and maybe even preparing to attack you - nobody wants a perpetually
shocked minister of foreign affairs. And similarly, you cannot be a plumber if you're appalled by the sight, smell and tactile
properties of shit.</p>
<p>"Individual contributors" can be fairly non-competitive, certainly in an industry like computing where demand for workers
outstrips supply, there's enough work for everyone, and where you know a ton of trivia about your area that anyone else would
need lots of time to learn if they had to step into your shoes. It's not only desirable but very possible to find a place where
no colleague is going to fight you in order to add your area of responsibility to theirs, and thus get promoted.</p>
<p>Managers, on the other hand, are <em>always</em> basically low-grade<a class="footnote-ref" role="doc-noteref" href="#fn1" id="fnref1"><sup>1</sup></a> fighting each other, in the same way as countries always have conflicting interests, even if
they maintain what looks like cordial relationships. I mean it not as a statement about the character of managers, but as a
description of their condition. This condition follows, not from their character, but from various unfortunate facts of life -
for example, the fact that managers are assumed to be fungible and are hopelessly underinformed, and it gets worse with
rank.</p>
<p>The fungibility assumption means that a manager is always under a threat of losing "territory" to another manager, a reorg
making him a report of someone undesirable, or a straightforward replacement, much more than an IC, which creates a very
competitive environment. And the theoretical impossibility of managers being truly informed on the subjects falling under their
responsibility guarantees that their never-ending competition involves a lot of so-called "<a href="https://en.wikipedia.org/wiki/Misinformation">misinformation</a>, <a href="https://en.wikipedia.org/wiki/Disinformation">disinformation</a> and <a href="https://en.wikipedia.org/wiki/Malinformation">malinformation</a>." Of course, the manager's condition of eternal
competition fought on such wonderful terms <em>does</em> filter for character - and not in a way making a manager's day spent
with colleagues particularly pleasant.</p>
<p>In fact, all the unfortunate circumstances above - like the difficulty getting things done across the proverbial
(organizational) boundaries, deranged convulsions around "strategy," schedule chicken, and of course the scheming &amp; the
gaslighting - all this shit flows first and foremost from the competition between managers as well as the organization competing
in the external world.</p>
<p>ICs are entitled to managers shielding them from this shit - rarely fully, but quite often very considerably. Managers are
not entitled to this, because even if a 2nd, 3rd or Nth level manager would <em>like</em> to shield lower-ranking managers from
this (a rare, if laudable, desire), it's <em>not possible</em>, because there's just too much of it going on at the same time.
Of course it gets worse at higher ranks, but even a 1st level manager expecting <em>a positive atmosphere</em>, the kind that
ICs take for granted - "wow, cool stuff you've made there!" - will be sorely disappointed to learn that "please," "thank you,"
and "sorry" are gone from his day, replaced by "our requirement," "your commitment" and "customer escalation."</p>
<p>"If you want to make people happy, don't be a leader, sell ice cream," said a first-rate CEO and first-rate asshole Steve
Jobs. To this we might add, "If you want <em>people</em> to make <em>you</em> happy, consider selling ice cream, too." Or it
could be any sort of work which isn't management. A manager needs to be seriously driven by something other than having a nice
day, because that's just not gonna happen - the perfect drive <em>for you and your higher management</em> is "getting promoted,"
and the perfect drive for you to have <em>from the shareholders' POV</em> is "getting shit done." But motivation is a story for
another time.</p>
<h2 id="infrequently-raised-objections">Infrequently Raised Objections</h2>
<h3 id="i-am-a-manager-and-my-days-are-nice.">I am a manager, and my days are nice.</h3>
<p>Congratulations! You're either a great liar, including to yourself (all great liars start with themselves), or you're
completely indifferent to constant struggle and maybe you even enjoy it, or you're leading a very capable organization which
overdelivers often and underdelivers never (<em>how big is it?</em> A double digit number, tops?..), so you're enjoying what's
known as "peace through strength."</p>
<p>Rest assured that this condition is not fundamentally permanent (all strength is finite and always only goes so far), but do
enjoy it while it lasts, which can be for quite a while. Watch out for large reorgs, changes in the market / technology and
wider strategy (as I'm sure you do; only a competent manager gets to enjoy any duration of peace through strength.)</p>
<p>Or you're lucky.</p>
<h3 id="i-am-a-manager-and-my-manager-shields-me-from-this.">I am a manager, and my manager shields me from this.</h3>
<p>You're either effectively an IC, like "the leader of a team of 2 people under someone who makes every 3 people into a team,"
or you're working on something self-contained nobody needs and it will be soon canceled, or you have some infernal bond sealed
in goat's blood with your manager (or your manager has it with his), and when this thing explodes under external pressure, it
will be really ugly.</p>
<p>Or you're lucky.</p>
<h3 id="there-exist-organizations-free-from-the-dysfunction-you-describe.">There exist organizations free from the dysfunction
you describe.</h3>
<p>Like I said, "...or you're lucky." Sure, they exist. They're just rare, and usually don't remain that way as they grow (ever
heard "we're only hiring the best people?" Well, when you're hiring <em>a lot</em> of people, you're hiring <em>typical</em>
people, because there aren't this many "best" people readily available. Therefore, growth tends to bring about a regression to
the mean in all areas, including this one.) And most places are dysfunctional this way from day one, which by itself doesn't
prevent them from succeeding and growing; nor does a lack of this dysfunction guarantee success.</p>
<p>Speaking of which, I never quite understood the meaning of "dysfunctional" in "dysfunctional organizations." These
organizations definitely <em>function</em>; they generate trillions worth of world GDP. That they aren't fun to work at in a
managerial role might be true, but it is <em>not</em> their function to make it fun to work there in a managerial role. It is
the function of <em>you</em> to choose roles you can enjoy, and I hope the above can help some people with this.</p>
<h3 id="but-we-foster-a-non-hierarchical-culture-of-openness-and-curiosity.">But we foster a non-hierarchical culture of
openness and curiosity.</h3>
<p>If you're looking to decorate your office space, I have a suitcase full of hammers and sickles I brought from Soviet Russia.
I kept them to remind me of the old country, but your company sounds so awesome that I'll gladly send them to you.</p>
<p>Most deliberate attempts to improve upon the baseline outcome of people being people make things worse. You'll do everyone a
favor by going straight to the standard thing without going through a tedious cult phase first.</p>
<h3 id="calmly-accepting-dysfunction-does-not-a-good-manager-make.">Calmly accepting dysfunction does not a good manager
make.</h3>
<p>I didn't mean to imply that accepting and having a strategy for handling "dysfunction," or should I say the inevitable
consequences of the job description, is sufficient for being a good manager, whatever that means. I'm only saying that it's
<em>necessary</em> for being a manager <em>at all,</em> for any reasonable period of time and with a reasonable level of job
satisfaction.</p>
<h3 id="this-acceptance-is-not-a-binary-thing.-it-depends-on-how-bad-it-gets.">This "acceptance" is not a binary thing. It
depends on how bad it gets.</h3>
<p>It's binary in some and not binary in others. There are 3 types of people: people who binarily can't accept it; people who
binarily can, regardless of the depths of depravity reached; and people on whom it starts taking a toll at a certain level. Your
type can be predicted pretty well based on what motivates you, and we'll discuss it in an upcoming, very motivational piece on
motivation.</p>
<p><em>Thanks to Dan Luu and Tim Pote for reviewing a draft of this post.</em></p>
<p><strong>P.S.</strong> There exists a breed of "individual contributor" with a fancy title, such as a Principal Engineer, a
Fellow and other such. The desirability of the existence of these titles is a subject in its own right; in our context, their
relevance is that they largely strip you of human rights as much as management titles do. One hint of why this is so is their
visibility as a status marker and the consequences of this visibility - their scarcity and the resulting competition, in many
places fiercer than the fight for management titles. An exception to the rule is if you're The Guy Who Did X for some
serious-ass value of X, and you got your fancy title thanks to that value of X, regardless of the "technical track" promotion
politics.</p>

<section class="footnotes footnotes-end-of-document" role="doc-endnotes" id="footnotes">
<hr>
<ol>
<li id="fn1"><p>Hopefully<a class="footnote-back" role="doc-backlink" href="#fnref1">↩︎</a></p></li>
</ol></section>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/managers-have-no-human-rights#comments</comments>
      <pubDate>Sun, 31 Mar 2024 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/managers-have-no-human-rights.feed</wfw:commentRss>
    </item>
    <item>
      <title>The state of AI for hand-drawn animation inbetweening</title>
      <link>https://yosefk.com/blog/the-state-of-ai-for-hand-drawn-animation-inbetweening.html</link>
      <description><![CDATA[<p>There are many potential ways to use AI<a class="footnote-ref" role="doc-noteref" href="#fn1" id="fnref1"><sup>1</sup></a>
(and computers in general) for 2D animation. I’m currently interested in a seemingly conservative goal: <strong>to improve the
productivity of a traditional hand-drawn full animation workflow by AI assuming responsibilities similar to those of a human
assistant.</strong></p>
<p>As a “sub-goal” of that larger goal, we’ll take a look at two recently published papers on animation “inbetweening” – the
automatic generation of intermediate frames between given keyframes. AFAIK these papers represent the current state of the art.
We’ll see how these papers and a commercial frame interpolation tool perform on some test sequences. We’ll then briefly discuss
the future of the broad family of techniques in these papers versus some substantially different emerging approaches.</p>
<p>There’s a lot of other relevant research to look into, which I’m trying to do - this is just the start. I should say that I’m
not “an AI guy” - or rather I am if you’re building an inference chip, but not if you’re training a neural net. I’m interested
in this as a programmer who could incorporate the latest tech into an animation program, and as an animator who could use that
program. But I’m no expert on this, and so <strong>I’ll be very happy to get feedback/suggestions</strong> through <a href="mailto:Yossi.Kreinin@gmail.com">email</a> or <a href="https://yosefk.com/cgi-bin/comments.cgi?post=blog/the-state-of-ai-for-hand-drawn-animation-inbetweening#comments">comments</a>.</p>
<p>I’ve been into animation tech since forever, and what’s possible these days is exciting. Specifically with inbetweening tech,
I think we’re still “not there yet”, and I think you’ll agree after seeing the results below. But we might well get there within
a decade, and maybe much sooner.</p>
<p>I think this stuff is very, very interesting! If you think so, too, we should get in touch. Doubly so if you want to work on
this. I am going to work on exactly this!</p>
<h2 id="motivation-and-scope">Motivation and scope</h2>
<p>Why is it interesting to make AI a 2D animator’s assistant, of all the things we could have it do (text to video, image to
video, image style transfer onto a video, etc.)?</p>
<ul>
<li><strong>An animator is an actor</strong>. The motion of a character reflects the implied physical and mental state of that
character. If the motion of a character, even one designed by a human, is fully machine-generated, it means that human control
over acting is limited; the machine is now the actor, and the human’s influence is limited to “directing” at best. It is
interesting to develop AI-assisted workflows where the human is still the actor.</li>
<li><strong>To control motion, the animator needs to draw several keyframes</strong> (or perhaps edit a machine-generated draft
- but with a possibility to erase and redraw it fully.) The range of ways to do “a sad walk” or “an angry, surprised head turn”
and the range of character traits influencing the acting is too wide for acting to be controlled via cues other than actually
drawing the pose.</li>
<li><strong>If a human is to be in control, “moving line art” is the necessary basis for any innovations in the appearance of
the characters.</strong> That’s because humans use a “light table”, aka “onion skin”, to draw moving characters, where you see
several frames overlaid on top of each other (like the frames of a bouncing ball sequence below). And it’s roughly not humanly
possible to “read” a light table unless the frames have the sharp edges of line art (believe me, I spent more time trying than I
should have.) Any workflow with human animators in control of motion needs to have line art at its basis, even if the final
rendered film looks very differently from the traditional line art style.</li>
</ul>
<p><img alt="ball-light-table.png" height="237" src="https://yosefk.com/img/inbetweening/ball-light-table.png" title="several frames on a light table" width="576" style="max-width: 100%;height: auto;"></p>
<ul>
<li><strong>The above gives the human a role similar to a traditional key animator, so it’s natural to give the machine the
roles of assistants.</strong> It could be that AI can additionally do some of the key animator’s work, so that less keyframes
are provided in some cases than you’d have to give a human assistant (and one reason for this could be your ability to quickly
get the AI to complete your work in 10-20 possible ways, and choose the best option, which is impractical with a human
assistant.) But the basic role of the human as a key animator would remain, and so the first thing to explore is the machine
taking over the assistant’s role.</li>
</ul>
<p>So I’m not saying that we can’t improve productivity beyond the “machine as the assistant” arrangement, nor that we must
limit ourselves to the traditional appearance of hand-drawn animation. I’m just saying that <strong>our conservative scope is
likely the right <em>starting point</em>, even if our final goals are more ambitious - at least as long as we want the human to
remain the actor.</strong></p>
<p>What would the machine do in an assistant’s role? Traditionally, assistants’ jobs include:</p>
<ul>
<li>Inbetweening (drawing frames between the key frames)</li>
<li>Cleanup (taking rough “pencil” sketches and “inking” them)</li>
<li>Coloring (“should” be trivial with a paint bucket tool, but surprisingly annoying around small gaps in the lines)</li>
</ul>
<p><strong>Our scope here is narrowed further by focusing exclusively on inbetweening</strong>. There’s no deep reason for this
beyond having to start somewhere, and inbetweening being the most “animation-y” assistant’s job, because it’s about movement. So
focusing our search on inbetweening is most likely to give results relevant to animation and not just “still” line art.</p>
<p><strong>Finally, in this installment, we’re going to focus on papers which <em>call themselves</em> “AI for animation
inbetweening” papers. </strong>It’s <em>not</em> obvious that any relevant “killer technique” has to come from a paper focusing
on this problem explicitly. We could end up borrowing ideas from papers on video frame interpolation, or video/animation
generation not designed for inbetweening, etc. In fact, I’m looking at some things like this. But again, let’s start
somewhere.</p>
<h2 id="preamble-testing-runway">Preamble: testing Runway</h2>
<p>Before looking at papers for the latest ideas, let’s check out <a href="https://runwayml.com/ai-tools/frame-interpolation/">Runway Frame Interpolation</a>. Together with Stability AI and the
CompVis group, Runway researchers were behind <a href="https://en.wikipedia.org/wiki/Stable_Diffusion">Stable Diffusion</a>, and
Runway is at the forefront of deploying generative AI for video.</p>
<p>Let’s test frame interpolation on a sneaky cartoony rabbit sequence. It’s good as a test sequence because it has both
fast/large and slower/smaller movement (so both harder and easier parts.) It also has both “flat 2D” body movement and “3D” head
rotation - one might say too much rotation… But rotation is good to test because it’s a big reason for doing full hand-drawn
animation. Absent rotation, you can split your character into “cut-out” parts, and animate it by <a href="https://duik.rxlab.guide/Angela/index.html">moving and stretching these parts</a>.</p>
<p><img alt="rabbit.gif" height="540" src="https://yosefk.com/img/inbetweening/rabbit.gif" width="576" style="max-width: 100%;height: auto;"></p>
<p>We throw away every second frame, ask Runway to interpolate the sequence, and after some conversions and a frame rate
adjustment (don’t ask), we get something like this:</p>
<p><img alt="rabbit-runway.gif" height="540" src="https://yosefk.com/img/inbetweening/rabbit-runway.gif" width="576" style="max-width: 100%;height: auto;"></p>
<p>This tool definitely isn’t currently optimized for cartoony motion. Here’s an example inbetween:</p>
<p><img alt="rabbit-runway-inbetween.png" height="540" src="https://yosefk.com/img/inbetweening/rabbit-runway-inbetween.png" width="576" style="max-width: 100%;height: auto;"></p>
<p>Now let’s try a similar sequence with a sneaky me instead of a sneaky rabbit. Incidentally, this is one of several styles I’m
interested in - something between live action and Looney Tunes, with this self-portrait taking live action maybe 15% towards
Looney Tunes:</p>
<p><img alt="myself.gif" height="540" src="https://yosefk.com/img/inbetweening/myself.gif" width="576" style="max-width: 100%;height: auto;"></p>
<p>Frame interpolation looks somewhat better here, but it’s still more <em>morphing</em> than <em>moving</em> from pose to
pose:</p>
<p><img alt="myself-runway.gif" height="540" src="https://yosefk.com/img/inbetweening/myself-runway.gif" width="576" style="max-width: 100%;height: auto;"></p>
<p>An example inbetween:</p>
<p><img alt="myself-runway-inbetween.png" height="540" src="https://yosefk.com/img/inbetweening/myself-runway-inbetween.png" width="576" style="max-width: 100%;height: auto;"></p>
<p>While the Frame Interpolation tool currently doesn’t work for this use case, I’d bet that Runway could solve the problem
quicker and better than most if they wanted to. Whether there’s a large enough market for this is another question, and it might
depend on the exact definition of “this.” Personally, I believe that a lot of good things in life cannot be “monetized”, a lot
of art-related things are in this unfortunate category, and I’m very prepared to invest time and effort into this without clear,
or even any prospects of making money.</p>
<p>In any case, we’ve got our test sequences, and we’ve got our motivation to look for better performance in recent papers.</p>
<h2 id="raster-frame-representation">Raster frame representation</h2>
<p>There’s a lot of work on AI for image processing/computer vision. It’s natural to borrow techniques from this deeply
researched space and apply them to line art represented as raster images.</p>
<p>There are a few papers doing this; AFAIK the state of the art with this approach is currently <a href="https://arxiv.org/abs/2111.12792">Improving the Perceptual Quality of 2D Animation Interpolation</a> (2022). Their <a href="https://github.com/ShuhongChen/eisai-anime-interpolator">EISAI GitHub repo</a> points to a colab demo and a Docker image
for running locally, which I did, and things basically Just Worked.</p>
<p>That this can even happen blows my mind. I remember how things worked 25 years ago, when you rarely had the code published,
and people implementing computer vision papers would occasionally swear that the paper is outright lying, because the described
algorithms don’t do and couldn’t possibly do what the paper says.</p>
<p>The sequence below shows <em>just</em> inbetweens produced by EISAI. Meaning, frame N is produced from the original frames
N-1 and N+1; there’s not a single original frame here. So this sequence isn’t directly comparable to Runway’s output.</p>
<p><img alt="myself-eisai.gif" height="540" src="https://yosefk.com/img/inbetweening/myself-eisai.gif" width="576" style="max-width: 100%;height: auto;"></p>
<p>I couldn’t quite produce the same output with Runway as with the papers (don’t ask.) If you care, this sequence is closer to
being comparable to Runway’s, if not fully apples to apples:</p>
<p><img alt="myself-eisai-comparable-to-runway.gif" height="540" src="https://yosefk.com/img/inbetweening/myself-eisai-comparable-to-runway.gif" width="576" style="max-width: 100%;height: auto;"></p>
<p>If you look at individual inbetweens, you’ll see that EISAI and Runway have similar difficulties - big changes between
frames, occlusion and deformation, and both do their best and worst in about the same places. One of the best inbetweens by
EISAI:</p>
<p><img alt="myself-eisai-best.png" height="540" src="https://yosefk.com/img/inbetweening/myself-eisai-best.png" width="576" style="max-width: 100%;height: auto;"></p>
<p>One of the worst:</p>
<p><img alt="myself-eisai-worst.png" height="540" src="https://yosefk.com/img/inbetweening/myself-eisai-worst.png" width="576" style="max-width: 100%;height: auto;"></p>
<p>The inbetweens are produced by <strong>forward-warping based on bidirectional flow estimation</strong>. “Flow estimation”
means computing, per pixel or region in the first keyframe, its most likely corresponding location in the other keyframe -
“finding where it went to” in the other image (if you have “two images of mostly the same thing,” you can hope to find parts
from one in the other.) “Warping” means transforming pixel data - for example, scaling, translating and rotating a region.
“Forward-warping by bidirectional flow estimation” means taking regions from both keyframes and warping them to put them “where
they belong” in the inbetween - which is halfway between a region’s position in the source image, and the position in the other
image that the flow estimation says this region corresponds to.</p>
<p>Warping by flow explains the occasional 3-4 arms and legs and 2 heads (it warps a left hand from both input images into two
far-away places in the output image, since the flow estimator found a wrong match, instead of matching the hands to each other.)
This also explains “empty space” patches of various sizes in the otherwise flat background.</p>
<p>Notably, warping by flow “gives up” on cases of occlusion up front (I mean cases where something is visible in one frame and
not in the other due to rotation or any other reason.) If your problem formulation is “let’s find parts of one image in the
other image, and warp each part to the middle position between where it was in the first and where we found it in the second” -
then the <em>correct</em> answer to “where did the occluded part move?” is “I don’t know; I can’t track something that isn’t
there.”</p>
<p>(Note that the system being an “AI” has no impact on this point. You could have a “traditional,” “hardcoded” system for
warping based on optical flow, or a differentiable one with trainable parameters (“AI”.) Let’s say we believe the trainable one
is likely to achieve better results. But training does not sidestep the question the parameters <em>of what</em> are being
trained, and what the model can, or can’t possibly do once trained.)</p>
<p>When the optical flow matches “large parts” between images correctly, you still have occasional issues due to both images
being warped into the result, with “ghosting” of details of fingers or noses or what-not (meaning, you see two slightly
different drawings of a hand at roughly the same place, and you see one drawing through the other, as if that other drawing was
a semi-transparent “ghost”.) A dumb question coming to my mind is if this could be improved through brute force, by “increasing
the resolution of the image” / having a “higher-resolution flow estimation,” so you have a larger number of smaller patches
capable of representing the deformations of details, because each patch is tracked and warped separately.</p>
<p>An interesting thing in this paper is <strong>the use of <a href="https://en.wikipedia.org/wiki/Distance_transform">distance
transform</a> to “create” texture for convolutional neural networks to work with for feature extraction.</strong> The distance
transform replaces every pixel value with the distance from that pixel’s coordinates to the closest black pixel. If you
interpret distances as black &amp; white pixel values, this gives “texture” to your line art in a way. The paper cites “Optical
flow based line drawing frame interpolation using distance transform to support inbetweenings” (2019) which also used distance
transform for this purpose.</p>
<p><img alt="distance-transform.png" height="509" src="https://yosefk.com/img/inbetweening/distance-transform.png" width="576" style="max-width: 100%;height: auto;"></p>
<p>If you’re dealing with 2D animation and you’re borrowing image processing/computer vision neural networks (hyperparameters
and maybe even pretrained weights, as this paper does with a few layers of ResNet), you will have the problem of “lack of
texture” - you have these large flat-color regions, and the output of every convolution on each pixel within the region is
obviously exactly the same. Distance transform gives some texture for the convolutions to “respond” to.</p>
<p>This amuses me in a “machine learning inside joke” sort of way. “But they told me that <em>manual feature engineering</em>
was over in the era of Deep Learning!” I mean, sure, a lot of it is over - you won’t see a paper on “the next <a href="https://en.wikipedia.org/wiki/Scale-invariant_feature_transform">SIFT</a> or <a href="https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients">HOG</a>.” But, apart from the “hyperparameters” (a name
for, basically, the entire network architecture) being manually engineered, and the various manual <a href="https://en.wikipedia.org/wiki/Data_augmentation">data augmentation</a> and what-not, what’s <a href="https://github.com/kornia/kornia">Kornia</a>, if not “a tool for manual feature engineering in a differentiable
programming context”? And I’m not implying that there’s anything wrong with it - quite the contrary, my point is that people
still do this because it works, or at least makes some things work better.</p>
<p>Before we move on to other approaches, let’s check how EISAI does on the rabbit sequence. I don’t care for the rabbit
sequence; I’m selfishly interested in the me sequence. But since unlike Runway, EISAI was trained on animation data, it seems
fair to feed it something more like the training data:</p>
<p><img alt="rabbit-eisai.gif" height="540" src="https://yosefk.com/img/inbetweening/rabbit-eisai.gif" width="576" style="max-width: 100%;height: auto;"></p>
<p>Both Runway and EISAI do worse on the rabbit, which has more change in hands and ears and walks a bit faster. It seems that
large movements, deformations and rotations affect performance more than “similarity to training data,” or at least similarity
in a naive sense.</p>
<h2 id="vector-frame-representation">Vector frame representation</h2>
<p>Instead of treating the input as images, you could work on a vector representation of the lines. AFAIK the most recent paper
in this category is <a href="https://arxiv.org/abs/2309.16643">Deep Geometrized Cartoon Line Inbetweening</a> (2023). Their <a href="https://github.com/lisiyao21/AnimeInbet">AnimeInbet GitHub repo</a> lets you reproduce the paper’s results. To run on your
own data, you need to hack the code a bit (at least I didn’t manage without some code changes.) More importantly, you need to
vectorize your input data somehow.</p>
<p>The paper doesn’t come with its own input drawing vectorization system, and arguably shouldn’t, since vector drawing programs
exist, and vectorizing raster drawings is a problem in its own right and outside the paper’s scope. The code in the paper has no
trouble getting input data in a vector representation because their line art dataset is produced from their dataset of moving 3D
characters, rendered with a “toon shader” or whatever the thing rendering lines instead of shaded surfaces is called. And since
the 2D points/lines come from 3D vertices/edges, you’re basically projecting a 3D vector representation into a 2D space and it’s
still a vector representation.</p>
<p><img alt="animeinbet-dataset-characters.png" height="288" src="https://yosefk.com/img/inbetweening/animeinbet-dataset-characters.png" width="576" style="max-width: 100%;height: auto;"></p>
<p>What’s more, <strong>this data set provides a kind of ground truth that you don’t get from 2D animation data sets - namely,
detailed correspondence between the points in both input frames and the ground truth inbetween frame</strong>. If your ground
truth is a frame from an animated movie, you only know that this frame is “the inbetween you expect between the previous frame
and the next.” But here, you know where every 3D vertex ended up in every image!</p>
<p>This correspondence information is used at training time - and omitted at inference time, or it would be cheating. So if you
want to feed data into AnimeInbet, you only need to vectorize this data into points connected by straight lines, without
worrying about vertex correspondence. The paper itself cites <a href="https://github.com/MarkMoHR/virtual_sketching">Virtual
Sketching</a>, itself a deep learning based system, as the vectorization tool they used for their own experiments in one of the
“ablation studies” (I know it’s idiomatic scientific language, but can I just say that I love this expression? “Please don’t
contribute to the project during the next month. We’re performing an ablation study of individual productivity. If the study
proves successful, you shall be ablated from the company by the end of the month.”)</p>
<p>There are comments in the AnimeInbet repo about issues using Virtual Sketching; mine was that some lines partially
disappeared (could be my fault for not using it properly.) I ended up writing some neanderthal-style image processing code <a href="https://en.wikipedia.org/wiki/Topological_skeleton">skeletonizing</a> the raster lines, and then <a href="https://en.wikipedia.org/wiki/Flood_fill">flood-filling</a> the skeleton and connecting the points while flood-filling.
I’d explain this at more length if it was more than a one-off hack; for what it’s worth, I <em>think</em> it’s reasonably
correct for present purposes. (My “testing” is that when I render my vertices and the lines connecting them and eyeball the
result, no obviously stupid line connecting unrelated things appears, and no big thing from the input raster image is clearly
missing.)</p>
<p>This hacky “vectorization” code (might need more hacking to actually use) is in <a href="https://github.com/yosefk/AnimationPapers">Animation Papers GitHub repo</a>, together with other code you might use to run
AnimeInbet on your data.</p>
<p>Results on our test sequences:</p>
<p><img alt="myself-animeinbet.gif" height="576" src="https://yosefk.com/img/inbetweening/myself-animeinbet.gif" width="576" style="max-width: 100%;height: auto;"></p>
<p>The rabbit is harder for AnimeInbet, similarly to the others. For example, the ears are completely destroyed by the head
turn, as usual:</p>
<p><img alt="rabbit-animeinbet.gif" height="576" src="https://yosefk.com/img/inbetweening/rabbit-animeinbet.gif" width="576" style="max-width: 100%;height: auto;"></p>
<p>The worst and the best inbetweens occur in pretty much the same frames:</p>
<p><img alt="myself-animeinbet-worst.png" height="576" src="https://yosefk.com/img/inbetweening/myself-animeinbet-worst.png" width="576" style="max-width: 100%;height: auto;"></p>
<p><img alt="myself-animeinbet-best.png" height="576" src="https://yosefk.com/img/inbetweening/myself-animeinbet-best.png" width="576" style="max-width: 100%;height: auto;"></p>
<p>Visually notable aspects of AnimeInbet’s output compared to the previous systems we’ve seen:</p>
<ul>
<li><strong>AnimeInbet doesn’t blur lines</strong>. It might <em>shred</em> lines on occasion, but you don’t <em>blur</em>
vector lines like you blur pixels. (You very much <em>can</em> put a bunch of garbage lines into the output, and AnimeInbet is
pretty good at <em>not</em> doing that, but this capability belongs to our next item. Here we’ll just note that raster-based
systems didn’t quite “learn” to avoid line blurring, which this system avoids by design.)</li>
<li><strong>AnimeInbet seems quite good at matching small details and avoiding ghosting/copying the same thing twice from both
images.</strong> This is not something that can salvage bad inbetweens, but it makes good inbetweens better; in the one above,
the pants and the hands are examples where small detail is matched better than in the raster systems.</li>
<li><strong>For every part, AnimeInbet either finds a match or removes it from the output.</strong> The paper formulates
inbetweening as a graph matching problem (where vertices are the nodes and the lines connecting them are edges.) Parts without a
match are marked as invisible. This doesn’t “solve” occlusion or rotation, but it tends to keep you from putting stuff into the
output that the animator needs to erase and redraw afterwards. This makes good inbetweens marginally better; for bad inbetweens,
it makes them “less funny” but probably not much more usable (you get 2 legs instead of 4, but they’re often <em>not the right
legs;</em> and you can still get a head with two foreheads as in the bad inbetween above.)</li>
</ul>
<p>AnimeInbet has a comprehensive evaluation of their system vs other systems (EISAI and VFIformer as well as FILM and RIFE,
video interpolation rather than specifically animation inbetweening systems.) According to their methodology (where they use
their own test dataset), their system comes out ahead by a large margin. In my extremely small-scale and qualitative testing,
I’d say that it looks better, too, though perhaps less dramatically.</p>
<p>Here we have deep learning with a model and input data set tailored carefully to the problem - something I think you won’t
see as often as papers reusing one or several pretrained networks, and combining them with various adaptations to apply to the
problem at hand. My emotional reaction to this approach appearing to do better than ideas borrowed from “general image/video AI
research” is mixed.</p>
<p>I like “being right” (well, vaguely) about AI <em>not</em> being “general artificial intelligence” but a set of techniques
that you need to apply carefully to build a system for your needs, instead of just throwing data into some giant general-purpose
black box - this is something I like going on about, maybe more than I should given my level of understanding. As a prospective
user/implementer looking for “the next breakthrough paper,” however, it would be better for me if ideas borrowed from “general
video research” worked great, because there’s so many of them compared to the volume of “animation-focused research.”</p>
<p>I mean, Disney already fired its hand-drawn animation department years ago. If the medium is to be revived (and people even
caring about it aren’t getting any younger), it’s less likely to happen through direct investment into animation than as a
byproduct of other, more profitable things. I guess we’ll see how it goes.</p>
<h2 id="applicability-of-2d-feature-matching-techniques">Applicability of “2D feature matching” techniques</h2>
<p>No future improvement of the techniques in both papers can possibly take care of “all of inbetweening,” because occlusion and
rotation happen a lot, and do not fit these papers’ basic approach of <strong>matching 2D features</strong> in the input frames.
And even the best inbetweens aren’t quite usable as is. But they could be used with some editing, and it could be easier to edit
them than draw the whole thing from scratch.</p>
<p>An encouraging observation is that <strong>machines struggle with big changes and people struggle with small changes, so they
can complement each other well</strong>. A human is better at (and less bored by) drawing an inbetween between two keyframes
which look very different than drawing something very close to both input frames and putting every line at juuuuust the right
place. If machines can help handle the latter kind of work, even with some editing required, that’s great!</p>
<p>It’s very interesting to look into approaches that <em>can</em> in fact handle more change between input frames. For example,
check out the middle frame below, generated from the frames on its left and right:</p>
<p><img alt="yoga-mat.png" height="158" src="https://yosefk.com/img/inbetweening/yoga-mat.png" width="576" style="max-width: 100%;height: auto;"></p>
<p>This is from <a href="https://time-reversal.github.io/">Explorative Inbetweening of Time and Space</a> (2024); they say the
code is coming soon. It does have some problems with occlusion (look at the right arm in the middle image.) But it seems to only
struggle when showing something that is occluded <em>in both input frames</em> (for example, the right leg is fine, though it’s
largely occluded in the image on the left.) This is a big improvement over what we’ve seen above, or right below (this is one
frame of Runway’s output, where one right leg slowly merges into the left leg, while another right leg is growing):</p>
<p><img alt="yoga-mat-runway.png" height="321" src="https://yosefk.com/img/inbetweening/yoga-mat-runway.png" width="576" style="max-width: 100%;height: auto;"></p>
<p>But what’s even more impressive - extremely impressive - is that the system decided that <em>the body would go up before
going back down</em> between these two poses! (Which is why it’s having trouble with the right arm in the first place! A feature
matching system wouldn’t have this problem, because it wouldn’t realize that in the middle position, the body would go up, and
the right arm would have to be somewhere. Struggling with things not visible in either input keyframe is a good problem to have
- it’s evidence of knowing these things <em>exist,</em> which demonstrates quite the capabilities!)</p>
<p>This system clearly learned a lot about three-dimensional real-world movement behind the 2D images it’s asked to interpolate
between. Let’s call approaches going in this direction “<strong>3D motion reconstruction</strong>” techniques (and I apologize
if there’s better, standard terminology / taxonomy; I’d use it if I knew it.)</p>
<p>My point here, beyond eagerly waiting for the code in this paper, is that feature matching techniques might remain
interesting in the long term, <em>precisely because “they don’t understand what’s going on in the scene.”</em> Sure, they
clearly don’t learn “how a figure moves or looks like.” But this gives some hope that what they <em>can</em> do - handling small
changes - will work <em>on more kinds of inputs</em>. Meaning, a system that “learned human movement” might be less useful for
an octopus sequence than a system that “learned to match patches of pixels, or graphs of points connected by lines.” So falling
back on 2D feature matching could remain useful for a long time, even once 3D motion reconstruction works great on the kinds of
characters it was trained on.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I think we can agree that animation inbetweening doesn’t quite work at the moment, though it might already be useful for
inbetweening small movements, which is otherwise a painstaking process for a human. I think we can also agree that it’s
reasonable to hope it will be production-ready quite soon, and emerging inbetweening systems which “understand and reconstruct
movement,” beyond “matching image features,” are one reason to be hopeful.</p>
<p>In future installments, I hope to look into more techniques for inbetweening, and the closely related question of what
animators need to control inbetweening, beyond just giving the system two keyframes. <strong>Human inbetweeners certainly get
more input than pairs of keyframes.</strong> This makes me believe that it’s not just the <em>plausibility</em> of the
inbetweens you produce, but their <em>controllability</em> which is going to determine “the winning technique.”</p>
<p><em>Thanks to Dan Luu for reviewing a draft of this post.</em></p>

<section class="footnotes footnotes-end-of-document" role="doc-endnotes" id="footnotes">
<hr>
<ol>
<li id="fn1"><p>I miss the time when they called it machine learning rather than artificial intelligence, and the milder, calmer
economic conditions which were a moderating influence on terminology (in the end, whether it’s called ML or AI is an investors’
preference.) But I’m giving up and calling it AI, since at this point calling it ML is more a readability issue than anything
else.<a class="footnote-back" role="doc-backlink" href="#fnref1">↩︎</a></p></li>
</ol></section>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/the-state-of-ai-for-hand-drawn-animation-inbetweening#comments</comments>
      <pubDate>Wed, 17 Apr 2024 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/the-state-of-ai-for-hand-drawn-animation-inbetweening.feed</wfw:commentRss>
    </item>
    <item>
      <title>A 100x speedup with unsafe Python</title>
      <link>https://yosefk.com/blog/a-100x-speedup-with-unsafe-python.html</link>
      <description><![CDATA[<p>We're going to speed up some numpy code by 100x using "unsafe Python." Which is not quite the same as unsafe Rust, but it's a
bit similar, and I'm not sure what else to call it... you'll see. It's not something you'd use in most Python code, but it's
handy on occasion, and I think it shows "the nature of Python” from an interesting angle.</p>
<p>So let's say you use <a href="https://pyga.me/">pygame</a> to write a simple game in Python.</p>
<p>(Is pygame the way to go today? I'm not the guy to ask; to its credit, it has a very simple screen / mouse / keyboard APIs,
and is quite portable because it's built on top of <a href="https://www.libsdl.org/">SDL</a>. It runs on the major desktop
platforms, and with a bit of fiddling, you can run it on Android using <a href="https://buildozer.readthedocs.io/en/latest/">Buildozer</a>. In any case, pygame is just one real-life example where a
problem arises that "unsafe Python" can solve.)</p>
<p>Let us furthermore assume that you're resizing images a lot, so you want to optimize this. And so you discover the somewhat
unsurprising fact that <a href="https://opencv.org/">OpenCV</a>'s resizing is faster than pygame's, as measured by the following
simple microbenchmark:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">from contextlib import contextmanager
import time

@contextmanager
def Timer(name):
    start = time.time()
    yield
    finish = time.time()
    print(<i><b>f'{name} took {finish-start:.4f} sec'</b></i>)

import pygame as pg
import numpy as np
import cv2

IW = 1920
IH = 1080
OW = IW // 2
OH = IH // 2

repeat = 100

isurf = pg.Surface((IW,IH), pg.SRCALPHA)
with Timer(<i><b>'pg.Surface with smoothscale'</b></i>):
    for i in range(repeat):
        <b>pg.transform.smoothscale</b>(isurf, (OW,OH))

def cv2_resize(image):
    return cv2.resize(image, (OH,OW), interpolation=cv2.INTER_AREA)

i1 = np.zeros((IW,IH,3), np.uint8)
with Timer(<i><b>'np.zeros with cv2'</b></i>):
    for i in range(repeat):
        o1 = <b>cv2_resize</b>(i1)
</pre>
<p>This prints:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">pg.Surface with smoothscale took <b>0.2002</b> sec
np.zeros with cv2 took <b>0.0145</b> sec
</pre>
<p>Tempted by the nice 13x speedup reported by the mircobenchmark, you go back to your game, and use
<code>pygame.surfarray.pixels3d</code> to get zero-copy access to the pixels as a numpy array. Full of hope, you pass this array
to <code>cv2.resize</code>, and observe that everything got much <em>slower</em>. Dammit! "Caching," you think, "or something.
Never trust a mircobenchmark!"</p>
<p>Anyway, just in case, you call cv2.resize on the pixels3d array in your mircobenchmark. Perhaps the slowdown will
reproduce?..</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">i2 = <b>pg.surfarray.pixels3d</b>(isurf)
with Timer(<b><i>'pixels3d with cv2'</i></b>):
    for i in range(repeat):
        o2 = cv2_resize(i2)
</pre>
<p>Sure enough, this is very slow, just like you saw in your larger program:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">pixels3d with cv2 took <b>1.3625</b> sec
</pre>
<p>So 7x slower than smoothscale - and more shockingly, almost <strong>100x</strong> slower than cv2.resize called with
numpy.zeros! What gives?! Like, we have two zero-initialized numpy arrays <strong>of the same shape and datatype.</strong> And
of course the resized output arrays have the same shape &amp; datatype, too:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">print(<b><i>'i1==i2 is'</i></b>, np.all(i1==i2))
print(<b><i>'o1==o2 is'</i></b>, np.all(o1==o2))
print(<b><i>'input shapes'</i></b>, i1.shape,i2.shape)
print(<b><i>'input types'</i></b>, i1.dtype,i2.dtype)
print(<b><i>'output shapes'</i></b>, o1.shape,o2.shape)
print(<b><i>'output types'</i></b>, o1.dtype,o2.dtype)
</pre>
<p>Just like you'd expect, this prints that everything is the same:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;"><code>i1==i2 is True
o1==o2 is True
input shapes (1920, 1080, 3) (1920, 1080, 3)
input types uint8 uint8
output shapes (960, 540, 3) (960, 540, 3)
output types uint8 uint8</code></pre>
<p>How could a function run 100x more slowly on one array relatively to the other, seemingly identical array?.. I mean, you
would hope SDL <em>wouldn't</em> allocate pixels in some particularly slow-to-access RAM area - even though it theoretically
<em>could</em> do stuff like that, with a little help from the kernel (like, create a non-cachable memory area or something.) Or
is the surface stored in GPU memory and we're going thru PCI to get every pixel?!.. It doesn't work this way, <em>does it?</em>
- is there some horrible memory coherence protocol for these things that I missed?.. And if not - if it's the same kind of
memory of the same shape and size with both arrays - what's different that costs us a 100x slowdown?..</p>
<p>It turns out... And I confess that I only found out by accident, after giving up on this<a class="footnote-ref" role="doc-noteref" href="#fn1" id="fnref1"><sup>1</sup></a> and moving on to something else. Entirely incidentally, that other thing
involved passing numpy data to C code, so I had to learn what this data looks like from C. So, it turns out that the shape and
datatype aren't all there is to a numpy array:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">print(<b><i>'input strides'</i></b>,i1.strides,i2.strides)
print(<b><i>'output strides'</i></b>,o1.strides,o2.strides)
</pre>
<p>Ah, <em>strides.</em> Same in the output arrays, but very different in the input arrays:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">input strides <b>(3240, 3, 1) (4, 7680, -1)</b>
output strides (1620, 3, 1) (1620, 3, 1)
</pre>
<p>As we'll see, this difference between the strides does in fact fully account for the 100x slowdown. Can we fix this? We can,
but first, the post itself will need to seriously slow down to explain these strides, because they're so weird. And then we'll
snatch our 100x right back from these damned strides.</p>
<h2 id="numpy-array-memory-layout">numpy array memory layout</h2>
<p>So, what's a "stride"? A stride tells you how many bytes you have to, well, stride from one pixel to the next. For instance,
let's say we have a 3D array like an RGB image. Then given the array base pointer and the 3 strides, the address of
<code>array[x,y,z]</code> will be <code>base + x*xstride + y*ystride + z*zstride</code> (where with images, z is one of 0, 1 or
2, for the 3 channels of an RGB image.)</p>
<p>In other words, <strong>the strides define the layout of the array in memory</strong>. And for better or worse, <strong>numpy
is very flexible with respect to what this layout might be</strong>, because it supports many different stride values for a
given array shape &amp; datatype.</p>
<p>The two layouts at hand - numpy's default layout, and SDL's - are... well, I don't even know which of the two offends me
more. As you can see from the stride values, the layout numpy uses by default for a 3D array is
<code>base + x*3*height + y*3 + z</code>.</p>
<p><img alt="numpy-layout.png" height="178" src="https://yosefk.com/img/numpy-perf/numpy-layout.png" width="576" style="max-width: 100%;height: auto;"></p>
<p>This means that the RGB values of a pixel are stored in 3 adjacent bytes, and the pixels of a <em>column</em> are stored
contiguously in memory - a <a href="https://en.wikipedia.org/wiki/Row-_and_column-major_order">column-major order</a>. And I,
for one, find this <em>offensive</em>, because images are <em>traditionally</em> stored in a row-major order, in particular,
image sensors send them this way (and <em>capture</em> them this way, as you can see from the <a href="https://en.wikipedia.org/wiki/Rolling_shutter">rolling shutter</a> - every <em>row</em> is captured at a slightly
different time, not <em>column.</em>)</p>
<p>"Why, we <em>do</em> follow that respected tradition as well," say popular numpy-based image libraries. "See for yourself -
save an array of shape <code>(1920, 1080)</code> to a PNG file, and you'll get a 1080x1920 image." Which is true, and of course
makes it even worse: if you index with <code>arr[x,y]</code>, then x, aka dimension zero, actually corresponds to <em>the
vertical dimension</em> in the corresponding PNG file, and y, aka dimension one, corresponds to <em>the horizontal
dimension.</em> And thus numpy array columns correspond to PNG image rows. Which makes the numpy image layout "row-major" in
some sense, at the cost of x and y having the opposite of their usual meaning.</p>
<p>...Unless you got your numpy array from a pygame Surface object, in which case x actually <em>does</em> index into the
horizontal dimension. And so saving <code>pixels3d(surface)</code> with, say, imageio will produce a <em>transposed</em> PNG
relatively to the PNG created by <code>pygame.image.save(surface)</code>. And in case adding <em>that</em> insult to the injury
wasn't enough, cv2.resize gets a <code>(width, height)</code> tuple as the destination size, producing an output array of shape
<code>(height, width)</code>.</p>
<p>Against the backdrop of these insults and injuries, SDL has an inviting, civilized-looking layout where x is x, y is y, and
the data is stored in an honest row-major order, for all the meanings of "row." But then upon closer look, the layout just
tramples all over my feelings: <code>base + x*4 + y*4*width - z</code>.</p>
<p>Like, the part where we have 4 in the strides instead of 3 as expected for an RGB image - I can get that part. We did ask for
an <em>RGBA</em> image, with an alpha channel, when we passed <code>SRCALPHA</code> to the Surface constructor. So I guess it
keeps the alpha channel together with the RGB pixels, and the 4 in the strides is needed to skip the As in RBGA. But then why,
may I ask, are there separate <code>pixels3d</code> and <code>pixels_alpha</code> functions? It's always annoying to have to
deal with RGB and alphas separately when using numpy with pygame surfaces. Why not a single <code>pixels4d</code>
function?..</p>
<p>But OK, the 4 instead of the 3 I could live with. But a zstride of -1? MINUS ONE? You start at the address of your Red pixel,
and to get to Green, you walk back one byte?! Now you're just fucking with me.</p>
<p>It turns out that SDL supports both RGB and BGR layout (in particular, apparently surfaces loaded from files are RGB, and
those created in memory are BGR?.. or is it even hairier than this?..) And if you use pygame's APIs, you needn't worry about RGB
vs BGR, the APIs handle it transparently. If you use <code>pixels3d</code> for numpy interop, you <em>also</em> needn't worry
about RGB vs BGR, because numpy's flexibility with strides lets pygame give you an array that <em>looks</em> like RGB despite it
being BGR in memory. For that, z stride is set to -1, and the base pointer of the array points to the first pixel's red value -
two pixels ahead of where the array memory starts, which is where the first pixel's <em>blue</em> value is.</p>
<p><img alt="SDL-layout.png" height="178" src="https://yosefk.com/img/numpy-perf/SDL-layout.png" width="576" style="max-width: 100%;height: auto;"></p>
<p>Wait a minute... <strong><em>now</em> I get why we have pixels3d and pixels_alpha but no pixels4d!!</strong> Because SDL has
RGBA and BGRA images - <em>BGRA, not ABGR</em> - and you can't make BGRA data look like an RGBA numpy array, no matter what
weird values you use for strides. There's a limit to layout flexibility... or rather, there really isn't any limit beyond the
limits of computability, but thankfully numpy stops at configurable strides and doesn't let you specify a generic callback
function <code>addr(base, x, y, z)</code> for a fully programmable layout<a class="footnote-ref" role="doc-noteref" href="#fn2" id="fnref2"><sup>2</sup></a>.</p>
<p>So to support RGBA and BGRA transparently, pygame is forced to give us 2 numpy arrays - one for RGB (or BGR, depending on the
surface), and another for the alpha. And these numpy arrays have the right <em>shape</em>, and let us access the right
<em>data</em>, but their <em>layouts</em> are very different from normal arrays of their shape.</p>
<p>And <strong>different memory layout can definitely explain major differences in performance</strong>. We could try to figure
out exactly why the performance difference is almost 100x. But when possible, I prefer to just get rid of garbage, rather than
study it in detail. So instead of understanding this in depth, we'll simply show that the layout difference indeed accounts for
the 100x - and then get rid of the slowdown <em>without</em> changing the layout, which is where "unsafe Python" finally comes
in.</p>
<p>How can we show that the layout alone, and not some other property of the pygame Surface data (like the memory it's allocated
in) explains the slowdown? We can benchmark cv2.resize on a numpy array we create ourselves, with the same layout as
<code>pixels3d</code> gives us:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;"><i># create a byte array of zeros, and attach
# a numpy array with pygame-like strides
# to this byte array using the buffer argument.</i>
i3 = np.ndarray((IW, IH, 3), np.uint8,
        <b>strides=(4, IW*4, -1),</b>
        buffer=b'\0'*(IW*IH*4),
        offset=2) <i># start at the "R" of BGR</i>

with Timer(<b><i>'pixels3d-like layout with cv2'</i></b>):
    for i in range(repeat):
        o2 = <b>cv2_resize</b>(i3)
</pre>
<p>Indeed this is about as slow as we measured on pygame Surface data:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">pixels3d-like layout with cv2 took <b>1.3829</b> sec
</pre>
<p>Out of curiosity, we can check what happens if we merely copy data between these layouts:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">i4 = np.empty(i2.shape, i2.dtype)
with Timer(<b><i>'pixels3d-like copied to same-shape array'</i></b>):
    for i in range(repeat):
        <b>i4[:] = i2</b>

with Timer(<b><i>'pixels3d-like to same-shape array, copyto'</i></b>):
    for i in range(repeat):
        <b>np.copyto</b>(i4, i2)
</pre>
<p>Both the assignment operator and <code>copyto</code> are very slow, almost as slow as cv2.resize:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">pixels3d-like copied to same-shape array took <b>1.2681</b> sec
pixels3d-like to same-shape array, copyto took <b>1.2702</b> sec
</pre>
<h2 id="fooling-the-code-into-running-faster">Fooling the code into running faster</h2>
<p>What can we do about this? We can't change the layout of pygame Surface data. And we <em>seriously</em> don't want to copy
the C++ code of cv2.resize, with its various platform-specific optimizations, to see if we can adapt it to the Surface layout
without losing performance. <strong>What we <em>can</em> do is feed Surface data to cv2.resize using an array <em>with numpy's
default layout</em></strong> (instead of straightforwardly passing the array object returned by pixel3d.)</p>
<p>Not that this would actually <em>work</em> with any given function, mind you. But it <em>will</em> work specifically with
resizing, because it doesn't really care about certain aspects of the data, which we're incidentally going to blatantly
misrepresent:</p>
<ul>
<li>Resizing code doesn't care if a given channel represents red or blue. (Unlike, for instance, converting RGB to greyscale,
which <em>would</em> care.) If you give it BGR data and lie that it's RGB, the code will produce the same result as it would
given actual RGB data.</li>
<li>Similarly, it doesn't matter for resizing which array dimension represents width, and which is height.</li>
</ul>
<p>Now, let's take another look at the memory representation of pygame's BGRA array of shape <code>(width, height)</code>.</p>
<p><img alt="SDL-layout.png" height="178" src="https://yosefk.com/img/numpy-perf/SDL-layout.png" width="576" style="max-width: 100%;height: auto;"></p>
<p>This representation is actually the same as an RGBA array of shape <code>(height, width)</code> with numpy's default strides!
I mean, not really - if we reinterpret this data as an RGBA array, we're treating red channel values as blue and vice versa.
Likewise, if we reinterpret this data as a <code>(height, width)</code> array with numpy's default strides, we're implicitly
transposing the image. But resizing wouldn't care!</p>
<p>And, as an added bonus, we'd get a single RGBA array, and resize it with one call to cv2.resize, instead of resizing pixels3d
and pixels_alpha separately. Yay!</p>
<p>Here's code taking a pygame surface and returning the base pointer of the underlying RGBA or BGRA array, and a flag telling
if it's BGR or RGB:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">import ctypes

def arr_params(surface):
    pixels = pg.surfarray.pixels3d(surface)
    width, height, depth = pixels.shape
    assert depth == 3
    xstride, ystride, zstride = pixels.strides
    oft = 0
    bgr = 0
    if <b>zstride == -1</b>: <i># BGR image - walk back
        # 2 bytes to get to the first blue pixel</i>
        <b>oft = -2</b>
        zstride = 1
        bgr = 1
    <i># validate our assumptions about the data layout</i>
    assert xstride == 4
    assert zstride == 1
    assert ystride == width*4
    base = <b>pixels.ctypes.data_as</b>(ctypes.c_void_p)
    ptr = ctypes.c_void_p(base.value + oft)
    return ptr, width, height, bgr
</pre>
<p>Now that we have the underlying C pointer to the pixel data, we can wrap it in a numpy array with the default strides,
implicitly transposing the image and swapping the R &amp; B channels. <strong>And once we "attach" a numpy array with default
strides to both the input and the output data, our call to cv2.resize will run 100x faster!</strong></p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">def rgba_buffer(p, w, h):
    <i># attach a WxHx4 buffer to the base pointer</i>
    type = ctypes.c_uint8 * (w * h * 4)
    return <b>ctypes.cast(p, ctypes.POINTER(type)).contents</b>

def <b>cv2_resize_surface</b>(src, dst):
    iptr, iw, ih, ibgr = arr_params(src)
    optr, ow, oh, obgr = arr_params(dst)

    <i># our trick only works if both surfaces are BGR,
    # or they're both RGB. if their layout doesn't match,
    # our code would actually swap R &amp; B channels</i>
    <b>assert ibgr == obgr</b>

    ibuf = rgba_buffer(iptr, iw, ih)

    <i># numpy's default strides are height*4, 4, 1</i>
    iarr = np.ndarray(<b>(ih,iw,4)</b>, np.uint8, <b>buffer=ibuf</b>)
    
    obuf = rgba_buffer(optr, ow, oh)

    oarr = np.ndarray(<b>(oh,ow,4)</b>, np.uint8, <b>buffer=obuf</b>)

    <b>cv2.resize</b>(iarr, (ow,oh), oarr, interpolation=cv2.INTER_AREA)
</pre>
<p>Sure enough, we finally get a speedup instead of a slowdown from using cv2.resize on Surface data, and we're as fast as
resizing an RGBA numpy.zeros array (where originally we benchmarked an <em>RGB</em> array, not RGBA):</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">osurf = pg.Surface((OW,OH), pg.SRCALPHA)
with Timer(<b><i>'attached RGBA with cv2'</i></b>):
    for i in range(repeat):
        <b>cv2_resize_surface</b>(isurf, osurf)

i6 = np.zeros((IW,IH,4), np.uint8)
with Timer(<b><i>'np.zeros RGBA with cv2'</i></b>):
    for i in range(repeat):
        o6 = <b>cv2_resize</b>(i6) 
</pre>
<p>The benchmark says we got our 100x back:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">attached RGBA with cv2 took <b>0.0097</b> sec
np.zeros RGBA with cv2 took <b>0.0066</b> sec
</pre>
<p>All of the ugly code above is <a href="https://github.com/yosefk/BlogCodeSamples/blob/main/numpy-perf.py">on GitHub</a>.
Since this code is ugly, you can't be sure it actually resizes the image correctly, so there's some more code over there that
tests resizing on non-zero images. If you run it, you will get the following gorgeous output image:</p>
<p><img alt="resized.png" height="540" src="https://yosefk.com/img/numpy-perf/resized.png" width="960" style="max-width: 100%;height: auto;"></p>
<p>Did we really get a 100x speedup? It depends on how you count. We got cv2.resize to run 100x faster relatively to calling it
straightforwardly with the pixel3d array. But specifically for resizing, pygame has smoothscale, and our speedup relatively to
it is 13-15x. There are some more benchmarks on GitHub for functions other than resize, some of which don't have a corresponding
pygame API:</p>
<ul>
<li>Copying with <code>dst[:] = src</code>: <strong>28x</strong></li>
<li>Inverting with <code>dst[:] = 255 - src</code>: <strong>24x</strong></li>
<li><code>cv2.warpAffine</code>: <strong>12x</strong></li>
<li><code>cv2.medianBlur</code>: <strong>15x</strong></li>
<li><code>cv2.GaussianBlur</code>: <strong>200x</strong></li>
</ul>
<p>So not "exactly" 100x, though I feel it's fair enough to call it "100x" for short.</p>
<p>In any case, I'd be surprised if that many people use SDL from Python for this specific issue to be broadly relevant. But I'd
guess that numpy arrays with weird layouts come up in other places, too, so this kind of trick might be relevant elsewhere.</p>
<h2 id="unsafe-python">"Unsafe Python"</h2>
<p>The code above uses "the C kind of knowledge" to get a speedup (Python generally hides data layout from you, whereas C
proudly exposes it.) It also, unfortunately, has the memory (un)safety of C - we get a C base pointer to the pixel data, and
from that point on, if we mess up the pointer arithmetic, or use the data after it was already freed, we're going to crash or
corrupt data. And yet we wrote no C code - it's all Python.</p>
<p><a href="https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html">Rust has an "unsafe" keyword</a> where the compiler forces
you to realize that you're calling an API which voids the normal safety guarantees. But the Rust compiler <em>doesn't</em> make
you mark your function as "unsafe" just because you have an unsafe block in that function. Rather, it trusts <em>you</em> to
decide whether your function is itself unsafe or not.</p>
<p>(In our example, <code>cv2_resize_surface</code> is a safe API, assuming I don't have a bug, because none of the horror
escapes into the outside world - outside, we just see that the output surface was filled with the output data. But
<code>arr_params</code> is a completely unsafe API, since it returns a C pointer that you can do anything with. And
<code>rgba_buffer</code> is <em>also</em> unsafe - although we return a numpy array, a "safe" object, nothing prevents you from
using it after the data was freed, for example. In the general case, no static analysis can tell whether you've built something
safe from unsafe building blocks or not.)</p>
<p>Python doesn't have an <code>unsafe</code> keyword - which is in character for a dynamic language with sparse static
annotation. But otherwise, Python + <code>ctypes</code> + C libraries is sort of similar in spirit to Rust with
<code>unsafe</code>. The language is safe by default, but you have your escape hatch when you need it.</p>
<p>"Unsafe Python" exemplifies a general principle: <strong>there's <em>a lot</em> of C in Python</strong>. C is Python's evil
twin, or, in chronological order, Python is C's good-natured twin. C gives you performance, and doesn't care about usability or
safety; if any of the footguns go off, tell it to your healthcare provider, C isn't interested. Python on the other hand gives
you safety, and it's based on <a href="https://en.wikipedia.org/wiki/ABC_(programming_language)">a decade's worth of
research</a> into usability for beginners. It doesn't, however, care about performance. They're both optimized aggressively for
two opposite goals, at the cost of ignoring the other's goals.</p>
<p>But on top of that, Python was built with C extensions in mind from the start. Today, from my point of view, <em>Python
functions as a packaging system</em> for popular C/C++ libraries. I have way less appetite for downloading and building OpenCV
to use it from C++ than <code>pip install</code>ing OpenCV binaries and using them from Python, because C++ doesn't have a
standard package management system, and Python does. There are a lot of high-performance libraries (for instance in scientific
computing and deep learning) with more code calling them in Python than in C/C++. And on the other hand, if you want seriously
optimized Python code and a small deployment footprint / low startup time, you'd use <a href="https://cython.org/">Cython</a> to
produce an extension "as if written in C" to spare the overhead of an otherwise "more Pythonic" JIT-based system like <a href="https://numba.pydata.org/">numba</a>.</p>
<p>Not only is there a lot of C in Python, but, being opposites of sorts, they complement each other fairly well. A good way to
make Python code fast is using C libraries in the right ways. Conversely, a good way to use C safely is to write the core in C
and a lot of the logic on top of it in Python. The Python &amp; C/C++/Rust mix - either a C program with a massive Python
extension API, or a Python program with all of the heavy lifting done in C - seems quite dominant in high-performance, numeric,
desktop / server areas. And while I'm not sure this fact is very inspiring, I think it's a fact<a class="footnote-ref" role="doc-noteref" href="#fn3" id="fnref3"><sup>3</sup></a>, and things will stay this way for a long time.</p>
<p><em>Thanks to Dan Luu for reviewing a draft of this post.</em></p>

<section class="footnotes footnotes-end-of-document" role="doc-endnotes" id="footnotes">
<hr>
<ol>
<li id="fn1"><p>This is what happens when you do stuff for fun, or just in a small team. If I was getting paid to work on this,
I'd keep looking into it until figuring it out, at least if the team was large enough to not have to worry that this would delay
more critical work too much. Makes one think, though I'm not sure <em>what</em> I think about this, all things considered.<a class="footnote-back" role="doc-backlink" href="#fnref1">↩︎</a></p></li>
<li id="fn2"><p>Thankfully, because the existing layout flexibility "only" gives us a 100x slowdown, where with a callback, it
could easily go to 10000x.<a class="footnote-back" role="doc-backlink" href="#fnref2">↩︎</a></p></li>
<li id="fn3"><p>I'm not that good in this particular area, and I'd be happy to hear the thoughts of more experienced people on
what to use these days to implement something like Krita or Blender. I sort of lean towards "a Python program with C/C++/Rust
libraries" rather than "a C++/Rust program with a Python extension API," because, funnily enough, C++ is <em>too unsafe</em> and
Rust is <em>too safe</em> for quickly iterating on a large, complicated code base - so I'd want to keep most of the code doing
lots of little things in Python, and use C/C++/Rust for optimized production code doing well-understood heavy lifting kinds of
stuff. But this way of structuring your program is at most moderately popular, and I wonder if I'm missing something.<a class="footnote-back" role="doc-backlink" href="#fnref3">↩︎</a></p></li>
</ol></section>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/a-100x-speedup-with-unsafe-python#comments</comments>
      <pubDate>Sat, 04 May 2024 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/a-100x-speedup-with-unsafe-python.feed</wfw:commentRss>
    </item>
    <item>
      <title>Profiling with Ctrl-C</title>
      <link>https://yosefk.com/blog/profiling-with-ctrl-c.html</link>
      <description><![CDATA[<p>I once wrote about <a href="https://yosefk.com/blog/how-profilers-lie-the-cases-of-gprof-and-kcachegrind.html">how profiler
output can be misleading</a>. Someone <a href="https://yosefk.com/cgi-bin/comments.cgi?post=blog/how-profilers-lie-the-cases-of-gprof-and-kcachegrind#comment-1847">commented</a>
that you don’t need profilers - just Ctrl-C your program in a debugger instead, and you’ll see the call stack where your program
probably spends most of its time. I admit that I sneered at the idea at the time, because, despite those comments’ almost
aggressive enthusiasm, this method doesn’t actually work on the hard problems. But as my outlook on life worsened with age, I
came to think that Ctrl-C profiling deserves a shout-out, because it’s very effective against stupid problems encountered by
lazy people operating in unfriendly environments.</p>
<p>I mean, I’ve tended to dismiss the stupid problems and focus on the hard ones, but is this a good approach in the real world?
Today I’m quite ready to accept that most of life is stupid problems encountered by lazy people operating in unfriendly
environments. Certainly, one learning experience was becoming such a person myself, by stepping into a senior management role<a class="footnote-ref" role="doc-noteref" href="#fn1" id="fnref1"><sup>1</sup></a> and then going back to programming after a few
years. Now I’m lazy because I got used to not doing anything myself, and I’m in an environment which is unfriendly to me,
because I forgot how things work, or they no longer work the way they used to. And while I’m a bit ashamed to admit this as
someone who’s developed several profilers himself, I’m often not really in the mood to figure out how to use a profiler in a
given setting.</p>
<p>But, here’s a program taking a minute to start up. Well, only in the debug build; this must be why nobody fixed it, but we
really should, it sucks to wait for a full minute every time you rebuild &amp; rerun. So I Ctrl-C the thing, and what do you
know, there’s one billion stack frames from the <a href="https://github.com/nlohmann/json">nlohmann JSON parser</a>, I guess it
all gets inlined in the release build; must be what they call “zero-cost abstraction”<a class="footnote-ref" role="doc-noteref" href="#fn2" id="fnref2"><sup>2</sup></a>. Another Ctrl-C, another call stack, coming from a different place but again
ending up parsing JSON. And I don’t know what the fix was - a different JSON parser, or compiling some code with optimizations
even in the debug build - but someone fixed it after my Ctrl-C-based report.</p>
<p>Or let’s say I’m trying to switch to the LLD linker from gold, to speed up the linking. Why not the even faster <a href="https://github.com/rui314/mold">mold</a>? - because I’m on MIPS, and mold doesn’t support MIPS. But LLD is pretty fast,
too; the core was written by <a href="https://github.com/rui314">the same person</a>, after all. And then I open a core dump
from a binary linked with LLD, and gdb is <em>really</em> slow. Hmm. It should have been <em>faster</em>, actually, because I’ve
also added <code>--gdb-index</code>, which tells the linker to create, I guess, some index for gdb, making gdb faster than its
slow default behavior, which is reserved for the unfortunate people who don’t know the cool flags. But I’m not seeing faster,
I’m seeing slower. What gives?</p>
<p>So, I run gdb under gdb, and Ctrl-C it while it’s struggling with the core dump. There’s some callstack with
<code>dwarf_decode_macro_bytes</code>. Google quickly brings up some relevant issues, such as “<a href="https://sourceware.org/bugzilla/show_bug.cgi?id=24624">Using -ggdb3 and linking with ld.lld leads to cpu/memory hog in
gdb</a>” (Status: <strong>UNCONFIRMED</strong>) and “<a href="https://bugs.llvm.org/show_bug.cgi?id=42030">lld doesn't generate
DW_MACRO_import like ld.bfd does</a>” (Status: <strong>RESOLVED WONTFIX</strong>.)</p>
<p>Apparently gcc generates some DWARF data that gdb is slow to handle. The GNU linker fixes this data, so that gdb doesn’t end
up handling it slowly. LLD refuses to emulate this behavior of the GNU linker, because it’s gcc’s fault to have produced that
DWARF data in the first place. And gdb refuses to handle LLD’s output efficiently, because it’s LLD’s fault to not have handled
gcc’s output the way the GNU linker does. So I just remove <code>-ggdb3</code> - it gives you a bit richer debug info, but it’s
not worth the slower linking with gold instead of LLD, nor the slowdown in gdb that you get with LLD. And everyone links happily
ever after.</p>
<p>Which goes to show that <strong>Ctrl-C profiling is often enough to solve a simple problem, and it’s usually much easier than
learning how to use a profiler and how to properly read its output</strong>. You can connect a debugger to almost anything, all
the way down to some chip with nothing like a standard OS that could work with a standard profiler. You can connect a debugger
to almost anything especially if it’s slow - for example, maybe it’s hard to actually invoke the program under gdb because its
invocation is buried somewhere very deep, but if it’s slow, you can <code>gdb /proc/$pid/exe $pid</code> after it was
started.</p>
<p>A debugger also needs less to work with than a profiler. Unlike <a href="https://perf.wiki.kernel.org/index.php/Tutorial">perf</a>, gdb will give you a callstack even if <a href="https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html">the program was compiled without frame
pointer support</a>. And you certainly don’t need a special build, like <a href="https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_mono/gprof.html">gprof</a>’s <code>-pg</code>, or to run on a slow
simulator, like <a href="https://valgrind.org/docs/manual/cl-manual.html">callgrind</a> / <a href="https://kcachegrind.github.io/html/Home.html">KCachegrind</a>. And then the output of a profiler might be easy to
misinterpret - and I’ve only scratched the surface <a href="https://yosefk.com/blog/how-profilers-lie-the-cases-of-gprof-and-kcachegrind.html">the last time I wrote about it</a>.
Eyeballing a few callstacks is more straightforward.</p>
<p>Why then do we need profilers at all? Here is a very partial list of reasons, in no particular order.</p>
<p>Let’s say, completely hypothetically, that you’ve switched to the LLD linker, and your program is now 2-3% slower. If you
Ctrl-C it, you’ll see the same callstacks as with the version linked with gold. But if you have a profiler running on a
simulator, similarly to callgrind, then you can find the functions with the most slowdown - and they might not be the ones
taking the most time overall, they just have the most slowdown relatively to the old version - and then <strong>you can look at
the assembly listings and see how much time was spent running each instruction</strong>. And then you’ll see that the new
version has branch-to-address-from-register instructions where the old version had branch-to-constant-offset instructions.</p>
<p>Then you will learn about MIPS “relocation relaxation” (used also in RISC-V AFAIK.) The compiler “assumes the worst” and
generates code loading a function address into a register, and then jumping to the address stored in that register. Then, if
you’re lucky, the linker realizes that it has actually placed the function close enough to the caller for that caller to branch
to the function using a constant offset. (Fixed-sized RISC branch instructions cannot encode constant offsets larger than a
certain value, so the function needs to be close enough to the caller for the distance to fit into the offset encoding.) And
then the linker “relaxes” the expensive branch-from-register instruction into a cheaper branch-to-constant-offset instruction.
And it turns out that the LLD version you’re using doesn’t implement relocation relaxation.</p>
<p>Of course you, or should I say me, wouldn’t need that very, very fancy simulator-based profiler if you weren’t the idiot
using LLD 9 when LLD 14 was already available, with relocation relaxation implemented back in LLD 10. (I wish I’d saved the
discussion in the mailing list around this patch; now I can’t find it anywhere. There was nobody confident enough in their MIPS
knowledge to review the patch, but you don’t merge patches without a review, do you? There was even a message saying “Happy
anniversary to the relocation relaxation patch!” a year after it was submitted without having been merged. Eventually someone
said something like “we have to either merge or reject it, or we’re being rude” and someone else said “well, the patch author
knows MIPS better than any of us, so let’s just merge it.”)</p>
<p>But, despite having been an idiot here, I maintain that you don’t have to be an idiot to have this sort of problem, which a
profiler will help solve, and Ctrl-C profiling will not.</p>
<p>The broader issue is that <strong>Ctrl-C is essentially a sampling profiler</strong> - one with an unusually low sampling
frequency, but a sampling profiler nonetheless. Very small changes spread across a program are obviously invisible to a sampling
profiler. Also, <strong><a href="https://danluu.com/perf-tracing/">sampling profilers are bad at tail latency</a></strong> - if
something is usually fast but occasionally slow, you won’t be there to Ctrl-C it when it’s slow. (Of course, if “slow” means 100
ms instead of the usual 25 ms, you wouldn’t manage to Ctrl-C it in time even if you were there - that low sampling frequency
comes with some downsides.)</p>
<p>Systems involving many threads, processes or machines… our esteemed “random pausing” technique, aka Ctrl-C profiling, is
often not great to use with these. And at this point I feel that the idea of replacing all of the various profilers with Ctrl-C
is too ridiculous to bother with more counterarguments.</p>
<p><strong>But, there are many various kinds of profilers, making it a question which kind to use, and how much legwork finding
the problem will take on top of using it</strong>. Simulation-based profilers don’t have the problem of losing data to a low
sampling frequency - they analyze full instruction traces - but they’re too slow for anything like a production environment. So
you might need some measurements that you can run in production, and then a way to rerun the program on the simulator using
inputs that were observed to cause a slowdown in production based on these measurements. Tracing profilers like <a href="https://www.kernel.org/doc/html/v5.0/trace/ftrace.html">ftrace</a> / <a href="https://kernelshark.org/Documentation.html">KernelShark</a> are great for looking at examples of tail latency, but they
will not reliably take you to the places in the code where the time is spent. Sampling profilers can run in production and take
you to the right place in the code, but they’re a poor match for code that runs slowly but only occasionally, and even worse for
code that occasionally gets stuck waiting for something. And most of these tools have a bunch of non-trivial prerequisites,
config knobs and likely ways to misread their output.</p>
<p>Conversely, Ctrl-C in a debugger is easy, makes you look very effective when it actually works, and costs almost nothing to
try even when it doesn’t really help in the end. What’s not to like?</p>
<p>I often find myself recommending something primitive or ugly, which <a href="https://yosefk.com/blog/refix-fast-debuggable-reproducible-builds.html#anti-thesis-sed---nasty-brutish-and-short">might
actually do better than the “proper” approach</a>, or it might have <a href="https://yosefk.com/blog/dont-ask-if-a-monorepo-is-good-for-you-ask-if-youre-good-enough-for-a-monorepo.html">less risky
failure modes in the hands of typical users</a>, or it might be <a href="https://yosefk.com/blog/how-to-make-a-heap-profiler.html">easier to tailor to your needs than a more elaborate
solution</a>. “Profile with Ctrl-C” fits right in - certainly very primitive, yet often compares surprisingly favorably with
more sophisticated alternatives. And therefore, I must give Ctrl-C profiling my warmest endorsement!</p>
<p><em>Thanks to Dan Luu for reviewing a draft of this post.</em></p>

<section class="footnotes footnotes-end-of-document" role="doc-endnotes" id="footnotes">
<hr>
<ol>
<li id="fn1"><p>In my Russian-speaking mind, “stepping into” is strongly associated with “stepping into shit.” I’m not sure
there’s an idiomatic English synonym for stepping into something strongly implying that this something is shit; there should be
- it’s a very useful thing to have in a language.<a class="footnote-back" role="doc-backlink" href="#fnref1">↩︎</a></p></li>
<li id="fn2"><p>“Zero-cost abstraction” is a figure of speech popular with people who don’t consider time spent compiling,
deciphering compiler errors, debugging, or running the debug build as a “cost.” It would be more accurate to call it “zero cost
in production machine resources,” though even that is quite often incorrect.<a class="footnote-back" role="doc-backlink" href="#fnref2">↩︎</a></p></li>
</ol></section>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/profiling-with-ctrl-c#comments</comments>
      <pubDate>Tue, 25 Jun 2024 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/profiling-with-ctrl-c.feed</wfw:commentRss>
    </item>
    <item>
      <title>Advantages of incompetent management</title>
      <link>https://yosefk.com/blog/advantages-of-incompetent-management.html</link>
      <description><![CDATA[<p>What constitutes managerial competence? As a vague starting point for an answer, we could say that competent management sets
achievable objectives and then achieves them, by organizing and incentivizing the necessary work.</p>
<p>It turns out that even this near-tautological banality is enough to see why competent management puts many desirable things
out of reach. This becomes apparent when looking at examples where incompetent management does better than most well-run places
can hope for.</p>
<h2 id="efficiency">Efficiency</h2>
<p>Improving efficiency tends to be against the interest of most people in an org, because it’s equivalent to shrinking your
budget. Here’s what I’m told is a true story about how things work with actual budgets. A relatively inexperienced VP attends a
meeting where senior management is asked to shrink their budgets due to the adverse economic climate brought about by the
coronavirus pandemic. He eagerly cuts his equipment budget from $10 million to $6 million - over the loud and desperate
objections of his team (whom the VP nearly accuses of lacking patriotism, loyalty to the company and commitment to the common
good.)</p>
<p>Next year, the coronavirus mutates some more, and profits go back up. Our VP submits a $10 million equipment budget to the
finance department, where they cheerfully inform him that the extra $4 million will not go well with the CEO. Why, a 66%
increase over last year’s $6 million!</p>
<p>Wait a minute, thinks the VP, a sensation running through his whole body of rapidly gaining that invaluable experience which
he so sorely lacked. I voluntarily cut 40% of my budget - a share way larger than anyone else - due to an unforeseen,
extraordinary emergency. And now I’m rewarded with this cut becoming permanent?.. I see. Well, I’m always eager to learn.</p>
<p>This year being already lost, he quietly resubmits a $6 million budget (approved more swiftly by the CEO than any other,
thanks to zero YoY growth.) Next year, he uses some real or perceived crisis to increase this budget to $20 million. And now he
learned how to operate in a well-run company.</p>
<p>Of course you could say that this is a <em>badly</em> run company, and to avoid arguing what that means, let’s stick to the
definition of managerial competence as the ability to set and achieve objectives. <strong>Whatever objective you are expected to
achieve, a bigger budget makes it easier</strong>. And while asking for more resources gets you yelled at, the yelling is for
show, and ends once the budget increase is approved (or <em>isn’t</em> approved; but it never really hurts to try.) But if you
fail to achieve your objective, the yelling will be for realsies, go on and on, be followed by career setbacks, and continue
long afterwards, quite possibly with no way to resuscitate said career.</p>
<p>Set objectives create a simple zero-sum game over resources - you want more resources to do what they asked, and they want
you to do the same things with less. <strong>Optimization, budget cuts or relinquishing resources under any other name simply
registers as losing a round in this game</strong>. It’s awfully sweet to save company resources, but expecting it to do you any
good just means that you don’t understand the game.</p>
<p>I mean, what do you expect to happen? That we'll ask you to do less, or forgive you for doing less? No way, we asked you to
do those things because they must be done. Then maybe you expect to be given more resources? Obviously ridiculous, you just had
resources, there’s no sense in hoping to get them again as a reward for giving them back when you already had them?.. Maybe
you’d like a stock grant for being such a good citizen? No, if we do that, everyone will inflate their budget, and then cut it
to get a stock grant. Could that be what you did here?.. Like, why was the budget so large to begin with?..</p>
<p>But wait, seriously though, what’s the math here? What are we maximizing? Revenue minus cost? Revenue divided by cost? I
mean, shrinking the cost has got to be helping with these?.. Well, sure it’s helping, but it’s not helping <em>you,</em> because
you don’t bring any revenue <em>by yourself,</em> unlike cost, which you very much do incur all by yourself. The math with
<em>you</em> is, <strong>we tell you to do something if the cost is below a threshold</strong>. If you won’t commit to doing it
cheaply enough, we’ll find someone who will, and if we can’t, we won’t do the thing, or reconsider the options in some other
way. But exactly what the cost below the threshold is changes nothing in any math related to you, except for a lower cost making
your job harder, since you have the same objectives to achieve. The firm’s bottom line - sure, lower costs help there. But the
impact on the firm’s “revenue - cost” doesn’t trickle down to your “cost &lt; threshold,” because you have no revenue<a class="footnote-ref" role="doc-noteref" href="#fn1" id="fnref1"><sup>1</sup></a>.</p>
<p>Things work the same with any resource, not just actual money - it could be team size, or processor cycles and memory bytes.
If you free up 200 ms of CPU cycles and 500 megs of RAM, someone else can deploy their functionality using these newly available
resources, and then <em>you</em> won’t be able to. In fact, a mature, well-run CI system will measure everyone’s resource
footprint after each commit, and will not let you exceed your budget, which was frozen at some point based on how much you were
using at the time (hope it was a lot! - always spend like crazy before the baseline is established!) Is it any wonder that
people learn to never optimize their code - unless <em>they</em> want to deploy something new themselves, and only after asking
for more resources to deploy it and not getting any?</p>
<p>I like it when people ask “why is this code so slow? Why don't we optimize it?” And it still makes me sad when people ask
instead, “how much CPU time do I have for running this code?” when it's obviously 5-10x slower than it could be, and they're
asking to reserve 2-3x more CPU time than they're already wasting. But that's what happens when people have worked at well-run
places and aren’t stupid.</p>
<p>What happens in a badly run place? In a badly run place, management is bad at setting objectives, so you have people
aimlessly wandering about, lacking clear goals, and just doing stuff because they want to. They see an optimization opportunity
and they gladly pursue it - it’s interesting, it’s fun, it’s good for the company, it’s what they’re here for. If a patch must
be submitted to a team, that team might gladly accept it - they don’t mind shrinking their resource footprint, because nobody
monitors the resource budget properly, nor presses them to meet any targets very hard - which is also why they don’t really mind
spending some time on something not helping them achieve any such target. In fact, they might get interested enough to actively
help whoever found the problem to fix it.</p>
<p>Your legs don’t fight your heart, brain and each other for the oxygen budget; every organ only uses what it needs, and is
optimized for efficiency. The selfish corporation is yet to make its parts behave as selflessly as our body parts sharing our
supposedly selfish genes. Yet people do have a tendency to do the right thing regardless of incentives - no doubt because they
mistake their corporation for their tribe, thinking their coworkers share more of their genes than they do<a class="footnote-ref" role="doc-noteref" href="#fn2" id="fnref2"><sup>2</sup></a>. But <strong>if there’s a reliable &amp; scalable way for
vigorous, systematic management to reward the spontaneous human drive towards efficiency instead of punishing it, I am yet to
see it</strong>. Certainly honest people working for the trillion-dollar heavyweight champions of the industry testify that this
problem is far from solved.</p>
<p>It’s an exercise both fun and depressing to come up with ways to “manage for efficiency.” For example, we could reward people
for performance savings, right? Great idea - now I can commit some CPU or memory hog, then you can fix it, and we’ll split the
reward. Or, more realistically, first we all go on a crazy resource spending spree to meet a deadline. And then later on, we
optimize away the lower-hanging fruit in the crazy-inefficient system and get a reward - with not-so-low-hanging fruit from that
spending spree probably left hanging forever.</p>
<p>(Of course, we probably won’t tell ourselves that we’re deliberately overspending more than is actually helpful for meeting
the deadline to game the system. Rather, the culture will just kinda shift in that direction. People are very good at doing
fucked-up things without admitting it to themselves - which would make them sad and less energized to do the fucked-up thing
they have compelling reasons to do.)</p>
<p>Perverse incentives always appear wherever incentives are deployed, because the very notion of an incentive is fundamentally
perverse. But a competent manager is forced to use incentives, instructions, and incentives to follow instructions, because what
else could he use?</p>
<h2 id="sprawl">Sprawl</h2>
<p>“These teams are like bulldozers with no brakes,” mused my acquaintance, who’d managed a team in a poorly-run company and had
recently become a director in a much better-run one. “You only have a steering wheel, and you need to be steering all the time,
or this thing is going to dig a giant hole in the ground, or raze a building or something. If you don’t tell them exactly what
to do, they’re still always going to do <em>something</em>, and then it’ll be too late.”</p>
<p>You see, he was used to people doing pretty much nothing when left unmolested. Of course, from the employer’s point of view,
this habit is straightforwardly wasteful, because you’re still paying their salaries. To weed out such do-nothing people,
competent management sets up a performance evaluation process, so that we always know what every person has done for us every
year, and who should get outsized rewards and who should get fired.</p>
<p>This system leaves people very worried if they don’t have clear goals to work towards. However, even a competent organization
cannot set actually useful goals for everyone at all times, just like you generally need your legs, but you don’t really have a
use for them at every moment. And thus, <strong>you have people with spare bandwidth making up their own goals, so that they
have something to show in the performance review.</strong></p>
<p>If we now revisit the situation from the employer’s point of view, it is no longer <em>trivially</em> wasteful, because
everyone is always busy. However, it’s likely more wasteful than before, because people are building stuff you didn’t really
need, and yet you almost certainly need <em>now,</em> because actually productive activities are hopelessly intertwined with
this stuff.</p>
<p>This is a big reason why successful software companies end up with mountains of code. The cycle repeats and branches out
exponentially, as every team who’s built the once-needless and now-necessary thing asks for more headcount, gets it, and
inevitably ends up with some of it idle some of the time. Then these new people invent more goals to pursue, persuade everyone
that these fake goals are actual sub-goals of the real goals, and entangle existing systems with their new systems.</p>
<p>And now figuring out where the waste is will be much harder than just spotting idle people, since all the needless work was
done for no other purpose than looking very important, and people are pretty good at making the right impression when they’re
trying. And of course when people lie, they lie first and foremost to themselves - we’re all natural-born <a href="https://en.wikipedia.org/wiki/Method_acting">Method actors</a> - so if you spot a decoy and try to cancel the work on that
system, not only will the people working on it fight this with all their might, but they'll be genuinely heartbroken if you do
cancel it. And by the time you’ve actually dealt with one of these weeds, if you’re a weird manager actually trying, two more
will have sprouted in another part of the org.</p>
<p>If you’re used to such sprawl, you’d be surprised how effective sleepy HR practices are at preventing it. Suppose you always
get a standard, shitty raise at the end of the year by default, unless you bargain loudly, which works rarely and only if you’ve
really made an impression throughout the year. There is no defined budget for raises; every significant raise is hard to get,
and you never get it proactively without bargaining, but there’s no formal system to avoid spending too much on raises except
for the reluctant, reactive approach to giving them. There’s also no system for firing low performers, and it’s only very rarely
that you see anyone fired - like that crazy fuck who went on and on about how your source control sucked and should be
completely different, and then used a single dot character, “.”, as the commit message when he finally committed something.</p>
<p>A similar system is used for managing other resources: for example, every team gets to grow at some low annual rate, no
department is ever cut, and it’s very hard to grow your department faster than the base rate even if you get more
responsibilities.</p>
<p>A place like this evolves the healthy laziness that keeps animals from moving their body parts all the time, needlessly
burning calories, in order for the claws, wings and tails to get a good performance review from their head at the end of the
year. Sure, many people do nothing much of the time, and you need some effort to make them do something when it becomes
necessary; “the hedgehog is too proud a bird to fly without a kick,” as the wise Russian proverb goes. But on the upside, nobody
doing anything unless it’s really necessary means you don’t have all this unnecessary stuff.</p>
<p><strong>Healthy laziness begets agility - you have way less code, less systems, less everything, and therefore way more
ability to maneuver and actually change things with a small number of motivated people </strong>- and there’s always a small
number of motivated people in any place, and this place might even keep them, if they learn to bargain for raises. And you also
don’t need to grow as much, because you don’t need to be adding people to take care of all these sprawling systems that you
quickly come to depend on.</p>
<h2 id="bugs">Bugs</h2>
<p>Bug fixes work a lot like efficiency improvements, the main difference being that competent management makes things much
worse. You can’t make fixing bugs into a “goal,” same as you can’t make optimization into a goal, because people will just add
more bugs up front and then fix some of them. But at least with optimization, you can have teams doing it across the
organization, and it claws back some of the performance lost in the first place.</p>
<p>A team optimizing others’ systems cannot hunt down the tens of thousands of little performance hogs created by everyone else.
But it can often find tens or hundreds of relatively small changes with a fairly big performance impact. They’re probably not
“fully incentivized” for this outsized impact, because with rewards anywhere close to how much money this is worth to the
business, the incentive is quite likely to become extremely perverse. But you definitely can make “everyone else’s performance”
a team’s job description, combine it with your venerable performance evaluation &amp; promotion process, and get
<em>something</em> - often a big something.</p>
<p>Another kind of teams with some form of “someone else’s efficiency” in their job description is people working on compiler,
language runtime or kernel optimizations, custom compute or networking accelerators, and other such. They could be inefficient
<em>in their own work</em> for the same reasons mentioned above, but they might still be increasing <em>others’</em> efficiency,
because it’s legitimately an example of a goal that their competent management is good at setting and achieving.</p>
<p>The problem with bugs is that <strong>you can’t have people solve others’ bugs as much as you can have them improve others’
efficiency</strong>. It is generally much easier for a relative outsider to see where a system spends its resources than where
its bugs are. That’s because all systems spend similar kinds of resources, but what constitutes a bug varies from system to
system, and there’s almost never a machine-readable, formal, or even just a reasonably complete and written-down definition of
correctness. The few exceptions are things like programming language semantics, and indeed this is where a lot of progress has
been made - think sanitizers, <a href="https://yosefk.com/blog/checkedthreads-bug-free-shared-memory-parallelism.html">race
detectors</a>, etc.</p>
<p><strong>Another problem with bug fixes which you don’t have as much with optimizations is that it’s harder to measure the
impact</strong>. With efficiency improvements you can usually give a ballpark number of how much resources it would save -
perhaps a range of possibilities rather than one high-confidence number, but you’ll have something. With bugs, well, you could
A/B test them to try to quantify the impact on some metric management cares about, but who does that?</p>
<p>With performance, you deal in resources to begin with, and you have <em>some</em> number speaking of resource savings by
definition, or you couldn’t call it an optimization. And now there might be an argument of what multipliers to apply to this
number to arrive at a cost estimation, but at least you have a starting point. With a bug fix, you have the bug and the fix, and
you’re seriously going to suggest to A/B test the impact for no benefit to the employer except your ability to claim this impact
is worthy of a promotion? This is a great plan especially for internal systems without A/B testing infrastructure or any
preconditions for it, but it’s a great plan in general, employers love this.</p>
<p>(And also, most bugs you fix tend to come from your own team, and then all high impact proves is that you messed up big time
when you put the bug in. You’re not supposed to have bugs in the first place, punk.)</p>
<p>“I have this potential employer who says they’re interested in performance and correctness,” said another acquaintance. “I
told them that I can work on performance anywhere in the industry, so I can probably find an offer better in other respects
elsewhere. But correctness sounds interesting. I don’t know anywhere caring about correctness!”</p>
<p>Well, it’s not like they don’t care, as much as they don’t have a mechanism for caring or even registering it. Correctness is
not a goal in itself that management can set for the teams without perverse side-effects. Of course, you have to fix
“showstopper bugs” or you haven’t achieved your goal. Any further bug-fixing takes resources from achieving your nominal goals,
and is avoided - not outright, which would look bad, but through slow-walking and other acceptable forms of sabotage.</p>
<p>It’s true that Microsoft Teams (to take one example too many are familiar with) can get away with bugs because it’s bundled
with Outlook and other stuff, and because whoever pays for it doesn’t use it that much, but rather foists it upon helpless
internal users. But it’s also true that fixing those bugs would be money very well-spent for Microsoft, because it would almost
certainly improve their reputation and increase sales at the margins and more than offset the cost of the work. The problem is
that it’s hard for a well-run place to get people to fix non-showstopper bugs.</p>
<p>(One way to work on correctness, if you're into this, is to go to areas where more bugs are showstoppers, so fixing them
becomes a part of the nominal goals. If you’re a hardware developer, FPGAs, where you can fix bugs very cheaply, are a worse
context for this than ASICs, where you cannot, making you eager to find and fix them proactively. And hardware running lots of
software, which can't be patched to work around hardware bugs, like a CPU, <a href="https://yosefk.com/blog/the-habitat-of-hardware-bugs.html">will face more pressure to be correct</a> than something like a
peripheral device controller, which is only touched by comparatively little code written at the company making the hardware,
where it’s “easy” for software developers to add workarounds to this code<a class="footnote-ref" role="doc-noteref" href="#fn3" id="fnref3"><sup>3</sup></a>. If you’re a software developer, you could try an industry with high reliability
requirements, where many more bugs are defined as showstoppers.)</p>
<p>Of course, having more sprawl means having more bugs (and more performance issues, and more machine resources spent on
running all that sprawling code), and even defining what “correct” means becomes harder when the system is larger. The
unfortunate side effects of competent management compound each other.</p>
<h2 id="the-problem-with-incompetent-management">The problem with incompetent management</h2>
<p>The main disadvantage of incompetent management is its definitional inability to set and achieve key goals, which can
endanger the survival of the organization. Incompetent management can only thrive in situations where basic survival and even
growth are somehow taken care of, and any major changes in that situation create an existential risk.</p>
<p>It is theoretically possible for management to respond to an external crisis by “changing gears” from a sleepy indifference
to what’s going on in the organization to a vigorous push to get something huge done, as required by the new external situation.
The hope is to kick the suddenly awakened, terrified hedgehog into the stratosphere, and then go back to the sleepy ways of old
once it’s orbiting the Earth.</p>
<p>In practice, the risk is high for this attempt to fail - a place not used to the mobilized state of subordinating all efforts
to top-down goals will need time to learn, where “learning” might involve firing or otherwise replacing key people (which is a
big part of what “learning” means for organizations, and what people mean when they say such learning is “hard.”)</p>
<p>If the war effort does succeed, there’s quite likely no going back - the hedgehog will have been thoroughly transformed and
militarized by the ordeal. It will be the usual mix of competent management and cargo cult management from now on.</p>
<h2 id="cargo-cult-management-vs-straightforward-incompetence">Cargo cult management vs straightforward incompetence</h2>
<p>Speaking of which - a most unfortunate side effect of competent management is the widespread desire to emulate its look and
feel, which contaminates the wonderful natural incompetence of so many managers, robbing us of its many advantages.</p>
<p>Mostly incompetent management which is very bad at setting and achieving goals is perfectly capable and all too likely to
cargo-cult effective management by setting up an elaborate bureaucracy for assigning work and tracking its status, thus
preventing work from happening spontaneously. This has all the downsides of actually competent management without any of the
benefits.</p>
<p>Things work much better when incompetent managers embrace their laziness and do close to nothing. This is possible if there's
a culture where a manager gets to look good through means other than appearing to be on top of plans and status - for example,
by presenting shiny things the team is working on (regardless of their exact impact on the bottom line or even chances to be
deployed in production.)</p>
<h2 id="what-is-to-be-done">What is to be done?</h2>
<p>“What is to be done?” is <a href="https://en.wikipedia.org/wiki/What_Is_to_Be_Done%3F">a pamphlet by Lenin</a>, who proposed
some things to be done, and went on to do them and then some, with results most charitably described as mixed.</p>
<p>I don't know how it ever happened to me, but I somehow got infected with the absurd idea that there's always a good way for
things to work in an organization, and furthermore, somehow this good way always makes the org more effective than the commonly
observed not-so-good alternatives. I was brought up with natural immunity to <a href="https://en.wikipedia.org/wiki/The_Internationale#Anthem_of_the_Soviet_Union">the Soviet strain</a> of this Panglossian
optimism with respect to our ability to shape organizations in the all-around optimal way:</p>
<blockquote>
<p>We shall wholly destroy the world of oppression<br> Down to the foundations, and then<br> We'll build a new world of our
own.</p>
</blockquote>
<p>But it turned out that my Soviet antibodies don’t automatically work against <a href="https://paulgraham.com/good.html">the
Western strain</a>:</p>
<blockquote>
<p>…the most important advantage of being good is that it acts as a compass. … you have so many choices. How do you decide?</p>
<p>Here's the answer: Do whatever's best for your users. You can hold onto this like a rope in a hurricane, and it will save you
if anything can. Follow it and it will take you through everything you need to do.<a class="footnote-ref" role="doc-noteref" href="#fn4" id="fnref4"><sup>4</sup></a></p>
</blockquote>
<p>I mean, I guess I’ve always had antibodies good enough to protect against severe illness; I never imagined that companies
mostly succeeded by holding onto their goodness like a rope in a hurricane. But if you pressed me, I’d say they probably
<em>could</em>, and they would be better off if they did.</p>
<p>Which, if you think about it, why on Earth would this have to be correct? Few people would say that you can always make your
code faster without making it uglier, and those who say it tend to be a bit insane, in a professional sense. So why would making
a company more effective always make it better instead of worse, according to, well, any definition, really?.. Just because the
opposite thought is depressing? Well, the thought of <a href="https://yosefk.com/blog/efficiency-is-fundamentally-at-odds-with-elegance.html">faster code tending to be uglier</a> isn’t
a very happy one, either.</p>
<p>So, now that I have immunity strong enough to prevent infection and transmission of the effective goodness virus, I don’t
think you have to find a solution to an organizational problem just because you happen to observe it<a class="footnote-ref" role="doc-noteref" href="#fn5" id="fnref5"><sup>5</sup></a>. From an individual’s POV, the environment is nearly
impossible to change, hard to truly understand, and fairly easy to fit into for most values of “environment,” and companies are
no different. You probably can’t make most well-run places “truly care” about efficiency or correctness, but you can make a
great living optimizing stuff, and even debugging or seriously testing, if you find the right place for it.</p>
<p>Of course, if you put a gun to my head, I could add a few paragraphs on “combining the best of both worlds” and how it’s been
known to happen in small teams over short periods of time, and so on. And, not gonna lie, I almost did put a gun to my own head
to write these paragraphs - old habits die hard. But I came to my senses and deleted it. It’s more likely to make you feel sad
than happy, and most of all, it’s likely to make you bored.</p>
<p>(<strong>Update</strong> - see <a href="https://news.ycombinator.com/item?id=40893945">a comment on this write-up describing
the rise and fall of Creo</a>, a company that they say did start out combining the best of both worlds, then regressed to the
mean after acquiring another company with a lot of people and a "standard" culture, and went downhill both as a place to work
and a viable business. Like I said, these "best of both worlds" stories will make you more sad than happy.)</p>
<h2 id="conclusion">Conclusion</h2>
<p>Competent management sets goals to achieve. Whatever can’t be made into a goal cannot be achieved by definition. Whether this
sounds trivial or absurd, it has many surprising undesirable consequences which are surprisingly hard to avoid.</p>
<p>A company’s board is unlikely to raise the need for less competent management in their annual meeting, and for good reasons.
A prospective employee is another matter. If someone invites you to work for a company that’s run very badly, there might well
be a good story there - this is far from guaranteed, but you might want to hear the details. And by “a good story”, I don’t mean
“yay, here’s a place to slack off at,” but “maybe I can finally get some work done that I hardly ever get the chance to do.”</p>
<h2 id="see-also">See also</h2>
<p>It's very possible to make sure you only hire people who can answer algorithms questions in an interview. But don't expect
these carefully filtered employees to then <a href="https://danluu.com/algorithms-interviews/">actually solve, rather than
create</a>, problems equivalent to basic phone screen questions on the job, for reasons related to the above.</p>
<p><em>Thanks to Dan Luu and Tim Pote for reviewing a draft of this post.</em></p>

<section class="footnotes footnotes-end-of-document" role="doc-endnotes" id="footnotes">
<hr>
<ol>
<li id="fn1"><p>If you’re a general manager of a business unit, aka a P&amp;L (profit &amp; loss) unit, then of course you
<em>do</em> have revenue, and things are different. For the purpose of our current discussion, I treat a BU as a separate
company, and the discussion applies to its employees, not its GM who for our purposes can be treated as a CEO.<a class="footnote-back" role="doc-backlink" href="#fnref1">↩︎</a></p></li>
<li id="fn2"><p>Either that, or it’s a genetic defect in people, or something about group selection. It is a scientific fact
that everything in life is either the result of DNA molecules evolving to copy themselves more efficiently, or their occasional
failure to do so, and no other mechanism nor information encoded in any other form is of any consequence.<a class="footnote-back" role="doc-backlink" href="#fnref2">↩︎</a></p></li>
<li id="fn3"><p>It’s also easy for a device driver programmer to beat up a hardware device designer, especially when sneaking
from behind, but it’s been made artificially hard by various legal mechanisms.<a class="footnote-back" role="doc-backlink" href="#fnref3">↩︎</a></p></li>
<li id="fn4"><p>I did notice that it says “good for users” and not “good for employees” or from other points of view. But you
and I both know what the answer would have been had someone in the audience raised their hand and asked, “...it’s about being
good for users, right? - you do often have to make it terrible for employees, or terrible in some other sense, in order to
succeed?”<a class="footnote-back" role="doc-backlink" href="#fnref4">↩︎</a></p></li>
<li id="fn5"><p>The virus is bad enough to actually make you think, “since I don’t know how to solve this problem, I probably
don’t really understand it - my analysis of my observations must be incorrect,” as if thinking that the discrete knapsack
problem is NP-complete is a symptom of not understanding the discrete knapsack problem.<a class="footnote-back" role="doc-backlink" href="#fnref5">↩︎</a></p></li>
</ol></section>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/advantages-of-incompetent-management#comments</comments>
      <pubDate>Thu, 04 Jul 2024 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/advantages-of-incompetent-management.feed</wfw:commentRss>
    </item>
    <item>
      <title>Profiling in production with function call traces</title>
      <link>https://yosefk.com/blog/profiling-in-production-with-function-call-traces.html</link>
      <description><![CDATA[<p><strong>A timeline showing function call and return events is a great way to debug performance problems, especially in
production</strong>. In particular, it's often much more effective than traditional sampling profilers, for reasons we’ll
discuss. However, the adoption of function tracing in the industry remains uneven because of a chicken-and-egg problem.</p>
<p>To best use a tracing profiler, you need some adaptations to your code and your workflow (as opposed to sampling profilers,
which “just work” with your code.) So to <em>make </em>a tracing profiler, one needs people wishing to change their code &amp;
workflow in order to use it. That said, as we’ll see, <strong>it’s gotten fairly easy to develop a tracing profiler
today,</strong> and integrating it into your work is very doable as well – which I hope might encourage people to both make and
use tracing profilers.</p>
<p>“Our main contribution,” as they say in papers, is a new function tracing profiler for C++ imaginatively named <a href="https://github.com/yosefk/funtrace">funtrace</a>. It’s ready for serious use – it works out of the box on large,
complicated programs. For example, here it is peeking into the inner workings of the awesome painting program <a href="https://krita.org/">Krita</a>, showing how the call stack changes over time, the thread state transitions, and the source
code of some selected function:</p>
<p><img alt="image8.png" height="776" src="https://yosefk.com/img/funtrace/image8.png" title="a trace of Krita made by funtrace &amp; displayed by vizviewer" width="576" style="max-width: 100%;height: auto;"></p>
<p>Funtrace has the following attractive qualities, which I hope I am not overselling:</p>
<ul>
<li>AFAIK, has <strong>the lowest-overhead tracing</strong> (&lt;10 ns per instrumented call or return in my measurements)</li>
<li>Supports <strong>threads, shared libraries and exceptions</strong></li>
<li>Supports <a href="https://www.kernel.org/doc/html/v5.0/trace/events.html">ftrace events</a>, showing <strong>thread
scheduling states</strong> alongside function calls &amp; returns, so you see when time is spent waiting as opposed to
computing</li>
<li>Works with <strong>stock gcc or clang</strong> - no custom compilers or compiler passes</li>
<li>Easy to integrate into a build system, and even easier to <strong>try <em>without</em> touching the build system</strong>
using <a href="https://github.com/yosefk/funtrace/tree/master/compiler-wrappers">tiny compiler-wrapping scripts</a> “passing all
the right flags”</li>
<li><strong>Small</strong> (just <strong>~1K LOC</strong> for the runtime) and thus:
<ul>
<li>easy to port (currently <strong>x86/Linux</strong> is supported)</li>
<li>easy to extend (say, to support some variant of “green threads”/fibers)</li>
<li>easy to audit in case you’re reluctant to add something intrusive like this into your system without understanding it well
(as I personally would be!)</li>
</ul></li>
<li><strong>Relatively comprehensive</strong> – it comes with its own tool for finding and cutting instrumentation overhead in
test runs too large to fully trace; support for remapping file paths to locate debug information and source code; a way to
extract trace data from core dumps; and other such “ways to address real-world concerns.”</li>
</ul>
<p>I’ve worked on several kinds of profilers during the last 20 years, and this one is easily my favorite. Perhaps it’s the
colorful stalactites?.. Anyway, that’s “our main contribution”; we’ll see how funtrace works, and how you could use similar
methods and existing components to build your own tracing profiler.</p>
<p>But we’ll cover more than the proverbial design and implementation of funtrace; in fact, we’ll leave a few of its
particularly “hardcore” bits for their own followup. We’ll start with <a href="https://github.com/gaogaotiantian/viztracer">viztracer</a>, a great tracing profiler for Python, and discuss how to
introduce a profiler like that into your workflow. We’ll also talk about <a href="https://llvm.org/docs/XRay.html">LLVM
XRay</a>, AFAIK “the” open-source function tracing C++ profiler today, and how its design is influenced by what the workflow
needs to be in a place like Google, where XRay comes from.</p>
<p>We’ll also get to the awesome <a href="https://github.com/janestreet/magic-trace">magic-trace</a>, a non-intrusive tracing
profiler built on top of <a href="https://perfwiki.github.io/main/">Intel Performance Trace</a>. Unfortunately, its authors say
that magic-trace is <a href="https://github.com/janestreet/magic-trace/wiki/How-could-magic-trace-be-made-to-work-on...">hard or
impossible to port to many popular platforms</a>. This will bring us to <strong>what CPU makers could do to make
hardware-accelerated function tracing very cheap</strong> in <em>both </em>hardware and software – and usable much more widely
than today, including in dynamic and JITted languages.</p>
<ul>
<li><a href="#viztracer-how-to-use-a-great-tracing-profiler">viztracer: how to use a great tracing profiler</a></li>
<li><a href="#funtrace-making-a-tracing-profiler-for-native-code">Funtrace: making a tracing profiler for native code</a>
<ul>
<li><a href="#compiler-instrumentation">Compiler instrumentation</a></li>
<li><a href="#runtime-code">Runtime code</a></li>
<li><a href="#decoded-trace-viewer">Decoded trace viewer</a></li>
<li><a href="#offline-trace-decoder">Offline trace decoder</a></li>
<li><a href="#ftrace-tracing-thread-state-changes">ftrace: tracing thread state changes</a></li>
<li><a href="#getting-traces-from-core-dumps">Getting traces from core dumps</a></li>
<li><a href="#funcount-culling-overhead"><code>funcount</code>: culling overhead</a></li>
</ul></li>
<li><a href="#antithesis-llvm-xray">Antithesis: LLVM XRay</a></li>
<li><a href="#synthesis-funtrace-with-xray-characteristics">Synthesis: funtrace with XRay characteristics</a></li>
<li><a href="#hardware-assisted-tracing">Hardware-assisted tracing</a></li>
<li><a href="#conclusion-and-future-work">Conclusion and future work</a></li>
</ul>
<h2 id="viztracer-how-to-use-a-great-tracing-profiler">viztracer: how to use a great tracing profiler</h2>
<p>I’m working on a small animation program, and I’ve found out that there’s such a thing as insufficient <a href="https://en.wiktionary.org/wiki/yak_shaving">yak shaving</a> - it turns out that the care and feeding of an unshaved yak
gets tiresome quickly. For example, I’ve avoided figuring out how to use a profiler for way too long, and in hindsight, wasted a
lot of time manually putting timers into the code.</p>
<p>I mean, you start putting timers into your code, and the first thing you print out is the average runtimes of things. The
averages tell you which code the program spends most of its time running, helping you to speed up some of this code. But then
you run into the occasional lags, and of course, <strong><a href="https://danluu.com/perf-tracing/">averages can’t explain why
you have unusual lags</a> - because by definition, <em>on average</em>, you don’t have <em>unusual</em> lags.</strong> So
whatever is taking time when you <em>do</em> have unusual lags doesn’t move the averages.</p>
<p>(Incidentally, this is why sampling profilers like <a href="https://perfwiki.github.io/main/">perf</a>, which periodically
check what the program is doing and then show you what it was doing most of the time, can help with saving CPU cycles, but
mostly <strong>can’t help with worst case latency</strong>.)</p>
<p>So you print out worst case runtimes, but of course the worst case runtimes of each function don’t help you <em>by
themselves.</em> What you’re really after is how long <em>each part</em> of your code took when <em>an entire flow, </em>like
your mouse-down event handling, was unusually slow. That slowness wasn’t due to every function taking the most time ever, but
due to <em>some</em> of them taking a lot of time, and the sum of* everything* running <em>in this flow</em> taking too
much.</p>
<p>So you start printing some sort of tables with timer values - a row for every time some flow ran, and a table like that for
every flow, and then you look at the rows with the most total cycles. The tables grow, and now you want something easier to look
at than tables of numbers. So you start having thoughts like “I could decorate my functions with <span class="citation" data-cites="trace">@trace</span> or something, to trace calls and returns, and if only there was a nice way to display this
trace with the function calls nested inside each other...”</p>
<p>And at this point you say – you know, <em>I would be building a tracing profiler!</em> There has got to be a tracing profiler
for Python - I should find one and use it! Where’s my yak shaving machine?! Should have reached out for that long ago, before
getting all tangled up in all this yak fur!</p>
<p>Then you discover viztracer, which looks great, and works like a charm if you run it on some script with
<code>viztracer ./script.py</code>. Works fine for a short program run: you get the last 10 million function calls at the end of
the run. But your program is an interactive GUI, which means an “infinite” program run. You don’t want to quit the program to
get the trace. Well, you can Ctrl-C the program - more accurately, you can Ctrl-C viztracer which is running the program - to
get the last 10M calls before you Ctrl-Cd. But <a href="https://yosefk.com/blog/profiling-with-ctrl-c.html">how do you know when
to Ctrl-C</a>?</p>
<p><strong>This right here is the major problem with tracing profilers: <em>you need to figure out what to trace.</em>
</strong>With sampling profilers like perf, no such problem: you sample <em>all the time</em>, and then you get a summary – of a
size <em>not dependent on how long the program ran. </em>And this is important not only because there are only so many terabytes
in a disk, but first and foremost because <em>there’s only so much time to scroll horizontally</em>, trying to find the part of
a giant timeline that you care about.</p>
<p><strong>Therefore, a tracing profiler usually comes with an API for triggering tracing, which the program must call</strong>.
And this is our chicken-and-egg problem: when perf came out, it was immediately usable for all the natively compiled programs
out there – and everyone looking into performance could use it, and wanted to make it better. But with a tracing profiler, most
programs must be changed for it to be usable, if only a little bit, and who wants to risk developing a tool that nobody can use
on day one?</p>
<p>(There is, of course, <em>the other </em>major problem with tracing profilers - their larger overhead compared to sampling
profilers. This IMO explains why <strong>dynamic languages are more likely to have a tracing profiler than static languages at
this time</strong>. Not only are these languages designed for things like intercepting function calls without the language maker
having to add support for this, making things easier for “community tool makers,” but <strong>the execution is so slow to begin
with that the overhead of a tracing profiler is <em>relatively</em> smaller than in static languages, </strong>and thus doesn’t
deter tool makers and users alike as much. We’ll talk about the overhead later, when we get to compiled languages.)</p>
<p>Anyway, you use the tracing API:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;"><code>tracer = viztracer.VizTracer()
tracer.start()

init_flow()

tracer.stop()
tracer.save('trace.json')</code></pre>
<p>You start with your init flow, if only because there’s just one such flow, as opposed to the many runtime event handling
flows. You get your trace.json file, and you run <code>vizviewer trace.json</code>. A browser tab pops up, and in that tab, you
see this message:</p>
<p><img alt="image1.png" height="198" src="https://yosefk.com/img/funtrace/image1.png" title="Oops, something went wrong. Please file a bug." width="582" style="max-width: 100%;height: auto;"></p>
<p>At this point, I hope you brought your big yak shaving machine. If, like me, you’ve only brought the small one, this is where
it breaks (“great, I knew it was a waste of time to look for a tracing profiler, of course this stuff comes broken out of the
box”) and you go back to looking at tables of numbers for a while, until this irritates you enough to try again. Then you find
out that your init flow was long enough to trigger a known bug that <a href="https://github.com/gaogaotiantian/viztracer/issues/139">likely won’t be fixed</a>, but there’s a fine workaround –
<strong>all you need to do if vizviewer gives you “Error: RPC framing error” is to reopen trace.json from the web UI<a class="footnote-ref" role="doc-noteref" href="#fn1" id="fnref1"><sup>1</sup></a>.</strong></p>
<p>Happy happy, joy joy, you think to yourself, and get busy profiling your runtime flows. You open the traces – wow, so much
better than tables of numbers!</p>
<p><img alt="image2.png" height="349" src="https://yosefk.com/img/funtrace/image2.png" title="Example viztracer trace" width="576" style="max-width: 100%;height: auto;"></p>
<p>There it is, my <a href="https://yosefk.com/blog/a-100x-speedup-with-unsafe-python.html">wonderfully optimized resizing
function in “unsafe Python”</a> - and to its left, a bunch of short calls, looks like it’s drawing lots of buttons in a loop -
maybe it’ll be faster to draw them as one larger image?.. <em>Way</em> nicer than the tables!</p>
<p>The colorful stalactites are reminiscent of <a href="https://github.com/brendangregg/FlameGraph">flamegraphs</a><a class="footnote-ref" role="doc-noteref" href="#fn2" id="fnref2"><sup>2</sup></a>, though they <em>aren’t </em>flamegraphs - they
represent <em>the execution timeline, </em>not <em>the share of time spent per callstack</em>. Vizviewer can show actual
flamegraphs, too - pass <code>--flamegraph</code>. In our example, instead of the many little calls on the left in the
screenshot above, you will get the following succinct summary (with the functions colored differently – done by different code,
I guess?..):</p>
<p><img alt="image3.png" height="183" src="https://yosefk.com/img/funtrace/image3.png" title="Example viztracer flamegraph" width="583" style="max-width: 100%;height: auto;"></p>
<p>Note that this is the <em>exact</em> flamegraph of a <em>short</em> period of time captured in a trace – while a sampling
profiler shows you an <em>approximate </em>flamegraph of a <em>long </em>period of time, a very different thing.</p>
<p>In any case, now that tracing basically works, you have a simple playbook:</p>
<ul>
<li>When you start handling an event, create a tracer object, and <strong>start()</strong> tracing</li>
<li>When you’re done, <strong>stop()</strong> tracing, and check how long it took to handle the event. Only keep the tracer
objects upon the slowest measurements</li>
<li>Eventually, <strong>save()</strong> the traces you kept</li>
</ul>
<p>As you follow this playbook, you run into some issues:</p>
<ul>
<li>After creating the 1022nd VizTracer object (whether the previous 1021 were destroyed or not), the process terminates with
the somewhat paradoxical error message <code>Failed to create Tss_Key: Success</code>. Some resource must be leaking - so let’s
keep a pool of VizTracer objects and reuse them.</li>
<li>One of the flows you trace invokes another flow you trace, and then you get a
<code>Warning! Overwrite tracer! You should not have two VizTracer recording at the same time!</code> So you stop tracing when
entering a nested flow, and restart it when that flow is done.</li>
</ul>
<p>But, it’s not that many issues. <strong>You do need to write some code around a tracing profiler to make it work for you –
but not a lot, and it’s well worth your trouble</strong>, certainly with viztracer, which is absolutely great<a class="footnote-ref" role="doc-noteref" href="#fn3" id="fnref3"><sup>3</sup></a>. <a href="https://github.com/yosefk/Tinymation/blob/master/pyside/trace.py">Here’s my code for this</a>, presented mainly to show
that it’s &lt;150 LOC – there’s not much to it.</p>
<p>And now that we’ve seen how useful they are, let’s make our own tracing profiler!</p>
<h2 id="funtrace-making-a-tracing-profiler-for-native-code">Funtrace: making a tracing profiler for native code</h2>
<p>To trace function calls in a compiled language, you need 4 main things:</p>
<ul>
<li><strong>Compiler instrumentation</strong> for running code upon function entry &amp; exit <a class="footnote-ref" role="doc-noteref" href="#fn4" id="fnref4"><sup>4</sup></a></li>
<li><strong>Runtime code</strong> collecting some sort of function IDs and timestamps when functions are called and return</li>
<li><strong>Offline trace decoder</strong> for converting the traced function IDs into symbolic names, and producing some format
for the…</li>
<li>…<strong>Decoded trace viewer </strong>– a UI for looking at the traced timeline</li>
</ul>
<p>Actually, you also need a 5th thing, which one might call the first – namely, <strong>assumptions about the user’s
workflow</strong>: what the user needs to do, is willing to do, and is <em>not </em>willing to do. For funtrace, these
assumptions are:</p>
<ul>
<li><strong>The user either <em>wants </em>or <em>agrees </em>to trace <span style="text-decoration:underline;">in
production</span>.</strong> The user might <em>want</em> this because performance problems happen in production, can be hard to
reproduce, and you want to debug them. If the user isn’t interested in debugging problems in production, we still hope they
<em>agree </em>to trace in production, or at least in acceptance tests run before releasing production versions. That’s because
<em>tracing overhead can become unacceptable unless continuously monitored and culled as needed</em>. Tracing during acceptance
testing guarantees that the overhead is acceptable, by the definition of “acceptance testing.” And then you can look at trace
data any time you want, without worrying that this data is irrelevant due to high tracing overhead. Conversely, <em>not
</em>tracing during acceptance testing virtually guarantees that <em>when you’ll actually enable tracing, the overhead will be
unacceptable, and you won’t have time to cull it,</em> making tracing unusable when you need it.</li>
<li>Therefore, as a corollary of tracing in production, <strong>the user agrees to continuously monitor and cull tracing
overhead</strong> by manually specifying things like “never trace these several functions,” “don’t trace when we’re loading
files - we know this flow is always slow, and tracing it does nothing except making it even slower,” etc.</li>
<li><strong>The user knows when to collect trace data and which collected data to save</strong>, and will use our API to do it.
For example, “start tracing when event handling begins, keep the trace of the slowest processing of every type of event, and
save all these slow traces upon program exit.”</li>
</ul>
<p>I like these assumptions for two reasons: <strong>this is the workflow I want as a user</strong>, and <strong>you get a small
and fast runtime with these assumptions</strong>. However, this is not the only possible set of sensible assumptions, and we’ll
see below how very different assumptions influence the design of LLVM XRay.</p>
<p>And now with our workflow assumptions in mind, let’s think step by step, as we tell LLMs when we want to guide their
boundless creativity away from complete bullshit, and work our way through the list of key components in a tracer.</p>
<h3 id="compiler-instrumentation">Compiler instrumentation</h3>
<p>With any native language using LLVM, you can write a compiler pass calling some event handlers upon function entry &amp; exit
– and with GCC as well, though most would prefer LLVM’s APIs for this.</p>
<p>Specifically in funtrace, however, my goal was <strong>to use existing compiler flags for instrumentation</strong>.
Specifically with C++, g++ and clang++ make this possible, and compiler flags are more consistent across compiler versions than
the internal APIs for writing a compiler pass. And even if my pass supported multiple compiler versions, who’d want to build it
for their specific compiler version so as to try funtrace?..</p>
<p>g++ and clang++ have the following flags, <strong>all supported by funtrace</strong>, and each having its own pros and
cons:</p>
<ul>
<li>Both compilers support <code>-finstrument-functions</code>, which makes the compiler generate calls to
__cyg_profile_func_enter/exit when functions are, well, entered and exited. Very clean – you can write your tracing handlers in
portable C code. For better or worse, this instruments functions <em>before inlining,</em> and C++ code is famously full of tiny
inline functions<a class="footnote-ref" role="doc-noteref" href="#fn5" id="fnref5"><sup>5</sup></a>. It can be neat to trace a
program with -finstrument-functions to inspect the flow – might be easier to follow than either in a debugger or an IDE – but
it’s impractical for tracing in production. Can we lower the overhead?
<ul>
<li>With g++ you can pass something like <code>-finstrument-functions-exclude-file-list=.h,.hpp,/usr/include</code> to ignore
all the functions in the header files – close to, but not quite what I’d want.</li>
<li>With clang++ you can simply use <code>-finstrument-functions-after-inlining</code>, often exactly what you want in
production.</li>
</ul></li>
<li>g++ supports the not-too-obvious flag combination <code>-pg -mfentry -minstrument-return=call</code>. This is similar to
clang’s -finstrument-functions-after-inlining in that it, well, instruments functions after inlining. But this doesn’t call
__cyg_profile_func_enter/exit – it calls __fentry__ and __return__, and it calls them in a different way, pretty much forcing
them to be assembly functions – though this is actually a plus, since it lowers the overhead somewhat; bringing us to the run
time code implementing all these functions called by compiler instrumentation.</li>
</ul>
<h3 id="runtime-code">Runtime code</h3>
<p>Our trace entries keep a timestamp, and a pointer into the code of a function. The highest bit of the code pointer is used to
mark an entry as “call” or “return” (no machine actually uses all the 64 bits of a pointer in userspace.)</p>
<p>Getting a cycle-accurate timestamp is fairly cheap; x86 has the so-called TSC (timestamp counter), which you read with the
RDTSC instruction. We’ll discuss alternatives to TSC on x86 and elsewhere in our “hardcore followup.”</p>
<p>We keep thread-local cyclic buffers of these entries. The user can dump them in full in one of 2 ways:</p>
<ul>
<li><strong>Call <code>funtrace_pause_and_write_current_snapshot()</code>.</strong> We pause tracing while writing the snapshot,
to not overwrite the data with events logged after the call.</li>
<li><strong>Run <code>kill -SIGTRAP &lt;pid&gt;</code></strong> (similarly to viztracer’s and magic-trace’s Ctrl-C/SIGINT; I
prefer SIGTRAP, since many programs handle SIGINT themselves.)</li>
</ul>
<p>As we’ve seen above, "suddenly dumping the whole trace" like this works for peeking into programs you know nothing about, but
it’s not great. Dumping the full content of the buffers is costly in time &amp; space, and then <em>looking </em>at this data
gets annoying, too. For example, some threads are idler than others, and keep very old events in their buffers thanks to this
idleness. These old events cause the timeline UI to zoom out so much that you can’t see anything.</p>
<p>You can pass flags to the funtrace decoder asking to ignore some threads, or to ignore events past a certain age. But it’s
actually much easier to know <em>at the time you’re taking the snapshot </em>what that age should be:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">uint64_t start_time = <b>funtrace_time</b>(); <i>//wraps RDTSC</i>

do_stuff();

uint64_t latency = <b>funtrace_time</b>() - start_time;

if(latency &gt; _slowest) {
  _slowest = latency;

  <b>funtrace_free_snapshot</b>(_snapshot);
  _snapshot = <b>funtrace_pause_and_get_snapshot_starting_at_time</b>(start_time);

  <i>//eventually we’ll write _snapshot out with
  //funtrace_write_snapshot()</i>
}
<i>//else (if latency &lt;= _slowest, the typical case),
//the only overhead added by our tracing logic
//is the 2 funtrace_time() calls</i>
</pre>
<p>Now the snapshot only keeps “the interesting part” – small, and easy to look at. Tracing is always on, so the program can
decide at any moment that “something interesting” happened, and save the recent events according to the appropriate definition
of “recent.”<a class="footnote-ref" role="doc-noteref" href="#fn6" id="fnref6"><sup>6</sup></a></p>
<p>Appending to a cyclic buffer is easy; the ANDing of the current <em>pointer </em>(not <em>index</em>) with a mask is the only
slightly tricky thing (see Hardcore Follow-up.)</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">void trace(uint64_t code_ptr, uint64_t flags)
{
    <i>//trace_buf is declared thread_local</i>
    uint64_t buf_ptr = (uint64_t)trace_buf.pos;
    buf_ptr &amp;= trace_buf.wraparound_mask;
    event* entry = (event*)buf_ptr;
    if(!entry) {
        return;
    }
    entry-&gt;func = code_ptr | flags;
    entry-&gt;cycle = <b>__rdtsc</b>();
    trace_buf.pos = entry + 1;
}
</pre>
<p>The flags argument is 0 for calls and 1&lt;&lt;63 for returns<a class="footnote-ref" role="doc-noteref" href="#fn7" id="fnref7"><sup>7</sup></a>. To pause tracing, we set wraparound_mask to 0. Since we must have the “if(!entry)” for
pausing, we might as well also set the mask to 0 to support <strong>disabling tracing at runtime, which cuts ~85% of the
overhead in my tests</strong>.</p>
<p>(Note that <strong>the code above is racy</strong> - it might take some time until a thread reads the zero from
wraparound_mask which another thread pausing the tracing wrote; we don’t particularly mind - at worst, we’ll overwrite a few old
events. Likewise, some recent writes might not be visible to the snapshotting code reading the buffers - we don’t particularly
mind, either<a class="footnote-ref" role="doc-noteref" href="#fn8" id="fnref8"><sup>8</sup></a>.)</p>
<p>The “nice” C callbacks __cyg_profile_func_enter/exit simply call the trace() function above. Things are harder for the
__fentry__ &amp; __return__ callbacks. Firstly, they aren’t passed the address of the function calling them as an argument - but
we could get that with __builtin_return_address(0)<a class="footnote-ref" role="doc-noteref" href="#fn9" id="fnref9"><sup>9</sup></a>. More importantly, <strong>they aren’t called according to the C calling convention</strong>
- the compiler “just calls them,” without bothering to save registers where their caller’s arguments might be kept, for
example.</p>
<p>I don’t know how to tell gcc or clang, “please don’t use registers where arguments are passed - please only use temporary
caller-saved registers.” And if you implement __fentry__ in C, and the compiler clobbers a register where arguments are passed,
and __fentry__ was called from a function which gets arguments, that function’s argument will have been clobbered.</p>
<p>So I wrote these functions in x86 assembly, basically by taking what the compiler produces from
<code>trace(__builtin_return_address(0), flags)</code> and then changing the code to only use those registers that I am allowed
to in this “non-standard calling convention” – or saving those registers that I can’t help using, but shouldn’t clobber.</p>
<p>(RDX and RAX are the annoying ones. RDTSC is hardwired to clobber them, but they’re also used for an argument and the return
value, respectively, so it works out to __fentry__ having to save RDX, and __return__ having to save both. Many more appetising
details of this sort await in the Hardcore Followup; generally, <strong>writing a tracing profiler today involves figuring out
many small and simple, if somewhat arcane things, but not a lot of code</strong> - the perfect job for myself, I find.)</p>
<p>Of course, just the content of the buffers is not enough to make sense of the snapshot once it was dumped. We also need to
know:</p>
<ul>
<li><strong>Where the code was loaded to</strong> - the executable and each shared library get loaded to a different offset,
potentially in every run if <a href="https://en.wikipedia.org/wiki/Address_space_layout_randomization">ASLR</a> is enabled, and
we have to subtract this offset to convert code pointers to function names using symbol table lookup. We can get this info from
<code>/proc/self/maps</code>, but it’s faster to get it from <a href="https://man7.org/linux/man-pages/man3/dl_iterate_phdr.3.html">dl_iterate_phdr</a> (and <em>maybe </em>more portable, eg
Android reportedly won’t let you access files under /proc)</li>
<li><strong>The TSC frequency</strong> to convert cycles to nanoseconds. There are at least 3 methods to find it out - using the
CPUID instruction (specifically “leaf 15H” - don’t ask), grepping the output of <code>dmesg</code>, and simply sleeping for some
time and checking by how much TSC was incremented during that time. Funtrace tries all these methods, in that order, in case the
first two fail (the 3rd kinda can’t fail, but it’s not very accurate.)</li>
</ul>
<p>With that, we can decode the code pointers and the timestamps, respectively. One last nice thing to save is <strong>thread
names</strong>. The OS spends lavishly to let us userspace peasants name our threads - 15 (!!) characters per thread (16 with
the null byte), and funtrace reads these names with <code>pthread_getname_np</code><a class="footnote-ref" role="doc-noteref" href="#fn10" id="fnref10"><sup>10</sup></a>.</p>
<p>That’s it - we have our snapshot.</p>
<h3 id="decoded-trace-viewer">Decoded trace viewer</h3>
<p>We need to decode the trace before viewing it. But first, we need to decide what the viewer will be, to make our decoder emit
the trace in the viewer’s format. Vizviewer, for example, is based on <strong><a href="https://ui.perfetto.dev/">Perfetto</a></strong>. In fact, it turns out that <strong>every tracing profiler mentioned in
this post (viztracer, magic-trace, XRay) uses Perfetto for the viewer</strong>.</p>
<p>I assume that Perfetto owes its popularity to its quick rendering of very large traces at arbitrary zoom, its beautiful look
&amp; feel, and its <a href="https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/">simple JSON trace
format</a> - just a bunch of records with a name, start timestamp, duration, and thread ID:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">{
"traceEvents": [
{"args":{"name":"<b>my_program</b>"},"name":"process_name",
    "tid":1,"pid":1,"ph":"M"},
{"args":{"name":"<b>main_thread</b>"},"name":"thread_name",
    "tid":1,"pid":1,"ph":"M"},
{"name":"<b>run</b>()","ts":2,"dur":10,"tid":1,"pid":1,"ph":"X"},
{"name":"<b>parse</b>()","ts":5,"dur":5,"tid":1,"pid":1,"ph":"X"},
{"name":"<b>open</b>()","ts":6,"dur":3,"tid":1,"pid":1,"ph":"X"},
{"args":{"name":"<b>worker_thread</b>"},"name":"thread_name",
    "tid":2,"pid":1,"ph":"M"},
{"name":"<b>config</b>()","ts":3,"dur":4,"tid":2,"pid":1,"ph":"X"},
{"name":"<b>open</b>()","ts":4,"dur":2,"tid":2,"pid":1,"ph":"X"}
]
}
</pre>
<p>Note how we didn’t tell which function called which - Perfetto just finds which time ranges are nested within other time
ranges, and stacks them accordingly:</p>
<p><img alt="image6.png" height="207" src="https://yosefk.com/img/funtrace/image6.png" title="Example Perfetto JSON" width="575" style="max-width: 100%;height: auto;"></p>
<p>Today, however, I don’t recommend following vizviewer’s example and using Perfetto. Much better to do what vizviewer couldn’t
do - <strong>use vizviewer itself!</strong></p>
<p>A big reason is that <strong>vizviewer extends the JSON format to include the source code of functions</strong> (and unlike
<em>every damned debugging and profiling tool</em>, here’s a program finally doing the right thing and putting <em>the source
code </em>into the JSON instead of <em>file names </em>– so that you actually look at the source code <em>that was traced</em>,
and not the code appearing in those files <em>right now,</em> possibly with some newer changes! And you can also send these JSON
files to someone and they’re self-contained, and they’ll open on their machine – unlike typical tool reports referencing source
code.)</p>
<p>So <code>funtrace2viz</code>, our trace decoder, simply produces a vizviewer JSON; to view it, install viztracver with
<code>pip install viztracer</code>, and you’ll get vizviewer in your $PATH. And now all we need is an…</p>
<h3 id="offline-trace-decoder">Offline trace decoder</h3>
<p>Our trace entries have 2 fields – a code pointer and a cycle – so decoding involves 2 jobs:</p>
<ul>
<li>Converting code pointers to function names and source line numbers.</li>
<li>Converting cycles to microseconds.</li>
</ul>
<p>Therefore, we should do this in Rust, which has excellent libraries for both tasks.</p>
<p><em>Libraries</em> for <em>both</em> tasks, you think; the 2nd task being multiplication of numbers. I guess this guy’s rabid
hatred of C++ metastasized into rabid Rust fandom, you think; sad, if unsurprising – a textbook example of mental illness.</p>
<p>Well, ackchyually, I’m a less rabid Rust fan than I’d like; I’m afraid that if you’re into stuff involving a mix of GUI and
number crunching, a combination of C++ and Python is your best bet today, if only because these are the two widely popular
languages which people use for this stuff, and where most of the libraries and tools are<a class="footnote-ref" role="doc-noteref" href="#fn11" id="fnref11"><sup>11</sup></a>.</p>
<p>That said, yes, converting cycles to microseconds <em>is </em>”a task,” as evidenced by the following comment in XRay’s
code:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">// Chrome trace event format always wants data in micros.
// CyclesPerMicro = CycleHertz / 10^6
// TSC / CyclesPerMicro == TSC * 10^6 / CycleHertz == MicroTimestamp
//
// <b>Could lose some precision here</b> by converting the TSC to 
// a double to multiply by the period in micros. 52 bit
// mantissa is a good start though.
//
// <b>TODO</b>: Make feature request to Chrome Trace viewer to
// accept ticks and a frequency or <b>do some more involved
// calculation</b> to avoid dangers of conversion.
</pre>
<p>You see, TSC is a 64b number, and if your machine has been running for a while, it will have more than 52 significant bits,
and you will start losing the low bits, because they won’t fit into a double’s mantissa. Now, in Rust, all I had to do to avoid
this precision loss was `cargo add <a href="https://docs.rs/num/latest/num/">num</a>`, and then use
<code>Ratio&lt;BigInt&gt;</code> for the conversion.</p>
<p>But if it was C++, while you can find a library for this, you would want to avoid the dependency – because without a standard
build &amp; packaging system, dependencies are a major PITA. So I’d just leave a TODO like they did in XRay.</p>
<p><strong>Rust is the fastest popular language with a standard package manager</strong>. <em>This alone</em> will make you
extremely productive in areas it has good libraries for, if you’re looking to minimize the product of machine time x developer
time. <strong>Mature support and widespread use of binary packages for C++ code would greatly boost Rust’s
applicability!</strong> For instance, <code>pip install</code>ing Python bindings wrapping C++ libraries is way easier than
managing these libraries as source dependencies. If Rust could become “a packaging system for C++” like Python effectively is,
it would immediately become very tempting to use just for this reason!<a class="footnote-ref" role="doc-noteref" href="#fn12" id="fnref12"><sup>12</sup></a></p>
<p>Of course, our bigger task is parsing ELF (with <a href="https://docs.rs/goblin/latest/goblin/elf/index.html">goblin::elf</a>) and DWARF (with <a href="https://docs.rs/addr2line/latest/addr2line/">addr2line</a>) - and we need to parse both. Only DWARF has line info, but
only ELF has symbols containing some of the code pointers – for example, gcc doesn’t bother to produce DWARF debug info for
“thunks” it generates. What’s a thunk? Well, there are “virtual thunks” and “non-virtual thunks” according to the C++ demangler
(cargo add <a href="https://docs.rs/cpp_demangle/latest/cpp_demangle/">cpp_demangle</a>); I’m sure this means something, but I
don’t care exactly what it is – I just want some name related to the source code at least somewhat instead of bare hex
garbage.</p>
<p>Which reminds me – and I’m sure experienced programmers have seen it coming – we actually have <em>three </em>jobs:
converting code pointers to names, converting cycles to microseconds, and dealing with random shit. Examples of the latter:</p>
<ul>
<li>Detecting and ignoring “virtual override thunks,” which for some reason call __return__ but not __fentry__</li>
<li>Handling “orphan returns” (when the function call was overwritten in the cyclic buffer and we only see the return
event)</li>
<li>Handling exceptions, its own section in the Hardcore Followup</li>
<li>Handling “strange missing returns” when f called g which called h, but h returns straight to f (eg because setjmp/longjmp
were used)</li>
<li>Remapping pathnames according to substitution rules provided by the user, in case the files aren’t where the debug info says
they should be (which happens way more often than it should - see <a href="https://yosefk.com/blog/refix-fast-debuggable-reproducible-builds.html">an easy way to avoid such issues</a>)</li>
</ul>
<p>But, that’s about it. I dwell on the details in part to show that <strong>it’s not that much work</strong>, even if you’re
aiming to cover enough ground for “serious uses” - threads, exceptions, shared objects, multiple compiler instrumentation
options, etc. etc. <strong>The decoding is about 1K LOC</strong>, same as the runtime (but with way more library
dependencies!)</p>
<p>One last item to file under “random shit” is converting <em>ftrace timestamps </em>(similar to, but uglier than converting
our trace entry timestamps), bringing us to…</p>
<h3 id="ftrace-tracing-thread-state-changes">ftrace: tracing thread state changes</h3>
<p>A function tracer traces function calls, and since we just did, we could use this tautology to declare that our job is done.
However, you’ll wonder if a function taking a long time was actually <em>computing</em> something or <em>waiting </em>for
something – so you need to know whether the thread was in a running state or not.</p>
<p>Linux can trace many kernel events, including scheduling events. A userspace peasant with the right permissions (eg
<code>sudo chown -R $USER /sys/kernel/tracing</code>) can configure ftrace to log thread scheduling events. You get the latest
events from the kernel buffer with <code>cat /sys/kernel/tracing/trace</code> (and no, you don’t need to actually understand the
example log below to follow what’s next - just showing it for those curious about tracing on Linux):</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">#                   _-----=&gt; irqs-off
#                  / _----=&gt; need-resched
#                 | / _---=&gt; hardirq/softirq
#                 || / _--=&gt; preempt-depth
#                 ||| / _-=&gt; migrate-disable
#                 |||| /     delay
#  TASK-PID CPU#  |||||  TIMESTAMP  FUNCTION
#     | |     |   |||||     |         |
  &lt;...&gt;-78  [<b>003</b>] ..... 30625460: <b>task_newtask:</b> pid=81 comm=main clone_flags=3d0f00 oom_score_adj=0
 &lt;idle&gt;-0   [<b>004</b>] d.... 30701644: <b>sched_switch:</b> prev_comm=swapper/4 prev_pid=0 prev_prio=120 prev_state=R ==&gt; next_comm=main next_pid=81 next_prio=120
  &lt;...&gt;-78  [<b>003</b>] d.... 30747413: <b>sched_switch:</b> prev_comm=main prev_pid=78 prev_prio=120 prev_state=D ==&gt; next_comm=swapper/3 next_pid=0 next_prio=120
  &lt;...&gt;-81  [<b>004</b>] d.... 30750940: <b>sched_waking:</b> comm=main pid=78 prio=120 target_cpu=003
 &lt;idle&gt;-0   [<b>003</b>] d.... 30780026: <b>sched_switch:</b> prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==&gt; next_comm=main next_pid=78 next_prio=120
  &lt;...&gt;-81  [<b>004</b>] ..... 30810996: <b>task_rename: </b>pid=81 oldcomm=main newcomm=worker oom_score_adj=0
  &lt;...&gt;-78  [<b>003</b>] d.s.. 37939679: <b>sched_waking:</b> comm=rcu_sched pid=15 prio=120 target_cpu=035
  &lt;...&gt;-81  [<b>004</b>] d.h.. 38466542: <b>sched_waking:</b> comm=code pid=9974 prio=120 target_cpu=004
</pre>
<p>It turns out that we don’t even need to parse this format – <strong>Perfetto can simply read ftrace data from a
<code>systemTraceEvents</code> JSON key, and has special support for visualizing scheduling events.</strong> Here’s how it looks
like:</p>
<p><img alt="image7.png" height="434" src="https://yosefk.com/img/funtrace/image7.png" title="Perfetto ftrace support" width="576" style="max-width: 100%;height: auto;"></p>
<p>Perfetto shows us which thread each CPU core is running at any given moment (where “swapper” is Linux-speak for “nothing”.)
To each thread, Perfetto adds a special lane showing whether its state is Running, Runnable (light green), or waiting for
something (white/blank.)</p>
<p>So all we have to do is collect and log ftrace events (~200 lines of funtrace’s ~1200 LOC runtime.) We configure
<code>trace_clock</code> to <code>x86-tsc</code> to synchronize timestamps with our function call/return events. We read the
scheduling events supported by Perfetto from <code>trace_pipe</code> (only listening to events from our process and its
children, or we’ll be flooded with data, including too many CPU lanes to vertically scroll through.) And we maintain a circular
buffer of events, so that we can get a snapshot of all events after some time threshold at any moment.</p>
<p>That’s it - the offline decoder converts ftrace timestamps from TSC to milliseconds (a bit ugly to have to massage text like
this, but no biggie; weird that you have to do this - the JSON evidently wasn’t designed for TSC, the fastest timestamping
method, but whatever.) And we’re done.</p>
<p>Note that there’s WAY more info that we could get from ftrace. For example, we could show how threads wait for each other
because of taking the same locks, etc. etc. I just picked the low-hanging fruit, which was enough for a certain “completeness” –
you can know both what functions the CPU was running and when it was waiting.</p>
<p>I don’t know why other function tracers using Perfetto don’t collect ftrace data, with Perfetto making it so easy. I think
some come from a larger system having another part doing this – and perhaps others don’t bother because users have trouble
getting permissions to access ftrace?!.. <strong>Why do I need permissions to know when my own threads were
scheduled?..</strong> (I confess up front that if there’s an answer, I might refuse to understand it. I hate permissions!)</p>
<h3 id="getting-traces-from-core-dumps">Getting traces from core dumps</h3>
<p>I firmly believe that in addition to compile-time and runtime support, proper developer tools come with <em>coretime
support</em>. I have a morbid fascination with <strong>coredump-oriented design: make your data structures easy to extract from
core dumps, and write the code to do so</strong>.</p>
<p>Function traces might come handy when performing an autopsy on a core dump – it helps to see what the program was doing
before it crashed. Looking at when your threads were doing what might help understand a race condition. A null dereferencing
might become obvious once you see that the wrong flow was running. And then some core dumps might be due to performance being
too bad (eg a real time program purposely crashing during test cycles), and traces are perfect for that.</p>
<p>So funtrace comes with a gdb.Python extension command, imaginatively named <code>funtrace</code>. It reads the thread-local
trace buffers, as well as the ftrace events<a class="footnote-ref" role="doc-noteref" href="#fn13" id="fnref13"><sup>13</sup></a>. We also save the addresses where shared libraries were loaded to, which we get from gdb’s
<code>info proc mappings</code>. You get a funtrace.raw file in the same format you’d get from
<code>funtrace_save_snapshot()</code>, and you can decode it with <code>funtrace2viz</code>. At <strong>~120 LOC</strong>, our
“coretime” costs us just 10% of our runtime in terms of lines of code.</p>
<h3 id="funcount-culling-overhead">funcount: culling overhead</h3>
<p>In a microbenchmark, it takes 8-9 ns to log one trace entry in my tests, and every function call takes 2 entries. This can be
a lot or a little, depending on how much work a function does on average. From my experience, <strong>you’ll probably want to
disable tracing for a bunch of short functions</strong> to get the overall overhead down to single-digit percentage points –
what I’d call a fair price for being able to debug performance issues (it <em>costs </em>money because it <em>saves </em>money,
like a plumber said in some movie; up to a point, slower code that you can optimize thanks to the extra visibility ends up being
<em>faster, </em>because you will have used more optimization opportunities.)</p>
<p>You can exclude a function from tracing using the <code>NOFUNTRACE</code> macro, which adds an <code>__attribute__</code> to
the function disabling compiler instrumentation. The question is, <strong>which functions to exclude?</strong> You could check
some traces and find often-called short functions. But this brings back the first problem with tracing profilers: what to trace?
Traces are short - perfect for understanding interesting outlier events, but not for understanding the overhead <em>on average
</em>like we want here.</p>
<p>The obvious approach is, <strong>let’s sort the functions by the number of times they’re called, and consider disabling
tracing on those called most often</strong>. However, what is actually not obvious is <em>how to count function calls</em>. A
sampling profiler can’t tell the difference between a long function called once and a shorter function called many times<a class="footnote-ref" role="doc-noteref" href="#fn14" id="fnref14"><sup>14</sup></a>. gprof collects accurate call counts – for
single-threaded programs; in multithreaded programs, the call counts are garbage. And callgrind is slow, and won’t run in many
environments.</p>
<p>So, funtrace ships its own tool for counting calls, <code>funcount</code> – which is nice, in particular, because <strong>it
counts the exact same calls that funtrace would instrument</strong> under whatever compiler flags you chose. It works like
this:</p>
<ul>
<li><strong>the function entry hook increments an atomic counter in a 2-level page table</strong>. We assume that 48 bits out of
a pointer’s 64 are actually used; given an address, we increment the counter at
<code>pages[high_bits][mid_bits][low_bits]</code>, where each index is made of 16 out of these 48 bits, and the arrays are
sparse (most of the page pointers are null.)</li>
<li>We could allocate pages thread-safely on demand using some compare-and-swappery, but it slows things down A LOT. Instead,
<strong>we use dl_iterate_phdr upon program start and upon calls to dlopen</strong> to find executable address space segments,
and allocate pages only for those segments<a class="footnote-ref" role="doc-noteref" href="#fn15" id="fnref15"><sup>15</sup></a>.</li>
<li><strong>We print the non-zero counts at the end of the program run</strong>. The resulting report, funcount.txt, can be
decoded to function names &amp; source line numbers using funcount2sym. You can then sort by call count with <code>sort</code>,
and combine reports from multiple runs with awk or such<a class="footnote-ref" role="doc-noteref" href="#fn16" id="fnref16"><sup>16</sup></a>.</li>
</ul>
<p>There’s also a knob to tune a time/space tradeoff: if you compile with <code>-DFUNCOUNT_PAGE_TABLES=16</code>, 16 page tables
will be kept instead of one, with threads indexing into the page table cpu_core_number % 16 (we get the core number from
RDTSCP.) If you have many threads calling the same functions, more page tables means less fighting for the cache lines keeping
these functions’ counters – at the cost of using more memory. The final report is the sum of the counts from all the page
tables.</p>
<p>This thing is about as fast as funtrace, modest in memory use, takes <strong>~250 LOC for the runtime + ~60 LOC for the
offline decoding</strong> – and like funtrace itself, is easy to understand and port.</p>
<p>While we’re on the subject of tracing overhead, one might ask – isn’t <em>opt-in</em> tracing better? Instead of tracing by
default and using tools to find when to opt out of tracing, why not let people trace whatever they want, without putting tracing
in behind their backs?</p>
<p>This approach can work well for some teams. I believe that tracing by default is better for most projects, if only because
<strong>the code with the most surprising performance artifacts is both what you want traced the most <em>in hindsight, </em>and
what was most likely <em>not </em>traced satisfactorily up front </strong>in an opt-in tracing regime, since nobody expected the
surprise by definition.</p>
<p>Another point is that an “opt-in tracing backlog,” where many people just <em>don’t </em>opt in for a long time, can easily
grow so much that you will lose all hope to use tracing when you need it. The “opting-out” backlog <em>cannot </em>grow so
badly, because too much tracing results in performance problems that you will <em>have </em>to fix long before they’ll get too
daunting to even try. This is similar to the argument for tracing in production vs flipping a switch when you need tracing –
<strong>a switch that’s off by default is unlikely to really work when you need it</strong>.</p>
<p>Last but not least, <strong>opting out is typically less work than opting in, making tracing cost less in development
time</strong>. You can definitely create a culture where people carefully put tracing statements into their code; the endlessly
growing “tracing backlog” is likely, but it’s not destiny. But then changing code becomes more costly, so you’ll tend to avoid
it more often – a problem in its own right. The same thing happens with other “good things,” like documentation and tests – they
make your system better, but you take more time making it, and then balk at making major changes. The nice thing about tracing
is that unlike documentation and tests, you can have it done mostly automatically with a bit of manual intervention.</p>
<h2 id="antithesis-llvm-xray">Antithesis: LLVM XRay</h2>
<p>The thesis behind funtrace is that the user:</p>
<ul>
<li>traces in production,</li>
<li>monitors &amp; culls tracing overhead, and</li>
<li>decides when to obtain and save trace data.</li>
</ul>
<p>Now let’s look at XRay, built on assumptions we could call the antithesis of ours:</p>
<ul>
<li>We’re a huge company (in the specific case of XRay’s developers, it’s called Google.)</li>
<li>We have a million programmers maintaining a billion lines of code who can’t be bothered to integrate a tracing profiler into
their code. These programmers aren’t our users.</li>
<li>We have, however, a performance team looking for issues across our millions of servers. <em>These </em>are our users.</li>
<li>If this performance team looks for issues completely randomly in a small share of our machines, it’s still a ton of
machines, so they’ll accidentally bump into some unusual slowdowns.</li>
<li>All of our code is compiled by the same build system. We can change the way code is compiled, as long as the overhead <em>as
experienced by the teams owning the code </em>is not large.</li>
</ul>
<p>In short, the programmers who wrote the traced code are patients signing a consent form for periodic checkups – and the X-Ray
machine operators are trained doctors on a separate team. <a href="https://yosefk.com/blog/advantages-of-incompetent-management.html">For a big company, these are very sensible
assumptions</a>. With this in mind, let’s look at how XRay works:</p>
<ul>
<li><strong>XRay instrumentation inserts NOP instruction sequences upon function entry/exit</strong>. The XRay runtime can then
change these instructions to make the functions jump to function entry/exit handlers <em>while the program is running</em>, and
this patching is done without pausing execution - quite the engineering feat. This makes the overhead very low <em>when tracing
is disabled, </em>which is what code owners care about. It also lets the performance team pick subsets of code where tracing is
enabled at runtime.</li>
<li><strong>XRay culls the overhead of tracing automatically</strong>, so that code owners needn’t bother. In fact, if you run
XRay in a small test, you might get an almost empty trace and assume it didn’t work, unless you lower the following thresholds:
<ul>
<li><code>-fxray-instruction-threshold=&lt;N&gt;</code> (default: 200) - the compiler only instruments functions with at least
this many instructions, or those with loops which it thinks are likely enough to take a long time to stray from this rule.</li>
<li><code>XRAY_BASIC_OPTIONS=func_duration_threshold_us=&lt;N&gt;</code> (default: 5 microseconds; undocumented at the time of
writing) - a function call that took less than the value set by this env var is omitted from the trace<a class="footnote-ref" role="doc-noteref" href="#fn17" id="fnref17"><sup>17</sup></a>.</li>
</ul></li>
<li><strong>XRay makes an effort to compress trace events</strong> - instead of logging 64b function pointers, the compiler
produces 28b integer IDs for the functions, and the runtime stores 32b TSC deltas instead of 64b TSC values. (If the delta
doesn’t fit into 32 bits, the runtime logs a special record with the full 64b TSC value.)</li>
</ul>
<p><strong>Thus an XRay trace entry is half the size of funtrace’s </strong>(and it’s furthermore very easy to keep short calls
out of the trace)<strong>, but XRay’s runtime overhead per instrumented function call is 6 times as large </strong>(as measured
in a microbenchmark, FWIW.)</p>
<p>This is <em>the sensible tradeoff</em> for a big company with a big code base, a big performance team outside the teams
owning the code, and a big tooling team implementing the tracing profiler:</p>
<ul>
<li><strong>It’s extremely important to minimize overhead with tracing disabled</strong>, or people will push back against
compiling with instrumentation.</li>
<li><strong>Always-on function tracing is prohibitively expensive</strong>. With a giant fleet of machines, even a 3% overhead
costs a fortune.</li>
<li><strong>…But the overhead of <em>occasionally </em>enabled tracing is no big deal</strong>. A rare 20-40% slowdown (the
number reported in <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45287.pdf">the XRay
whitepaper</a>) experienced by some service when the performance team is tracing it won’t register as a problem.</li>
<li><strong>Traced data size must be minimal</strong>. Since nobody tells us when something interesting happens, we must collect
data covering A LOT of time to find interesting things ourselves – so we better filter the data well.</li>
<li><strong>The complexity of the tool is not an issue</strong> – the cost of developing and maintaining it is dwarfed by the
cost savings it can provide across a giant server fleet.</li>
</ul>
<p>Thus it is the right thing, for a big company, to invest effort into thread-safe runtime code patching, a system for creating
and decoding small function IDs, runtime mechanisms for filtering the trace which are slow and make traces harder to understand
- but the performance team will manage to understand them, and this is how you get data covering a sufficiently large time range
to this team.</p>
<p>The adverse effects of this approach manifest in “smaller” contexts:</p>
<ul>
<li><strong>You can’t afford a high runtime overhead in many cases,</strong> eg a realtime application, or even just a GUI on an
end-user device. If you have a huge server fleet, slowing down some services considerably for a few seconds is fine, in part
because in any case there are network lags all the time, and the end user can’t tell the difference. This isn’t as true when
you’re running on a small “edge device” rather than a big server farm. Overhead is also worse for a <em>smaller </em>server farm
– the more machines you have, the smaller share of them you can slow down for tracing (and still find stuff), so the
<em>relative </em>cost shrinks.</li>
<li><strong>Runtime code patching is impossible in some small, “embedded” environments,</strong> either because you’re running
from ROM, or because you’re running from a read-only image in RAM, and have no room for copies of the code pages changed by the
runtime patching.</li>
<li>The open source XRay version (not what Google uses internally AFAIK) <strong>still can’t decode symbols from shared
libraries.</strong> Actually, it didn’t even <em>log</em> function calls made by shared libraries for almost a decade, since the
runtime patching didn’t cover DSOs. But the latest LLVM version <em>does</em> log them thanks to a patch submitted in 2024 by <a href="https://arxiv.org/pdf/2303.11110">people using XRay for academic work</a>. They <a href="https://github.com/tudasc/CaPI/issues/2">plan to make decoding work, too, but the current implementation only supports the
“basic logging format”</a> which is so bloated that it’s unusable in production - the runtime overhead will be
<strong>15-18</strong> times as large as funtrace’s… Eventually, the OSS XRay might fully support shared libraries, but it will
have taken at least a decade after the initial version was released.</li>
</ul>
<p>A big tooling team can deal with the complexity of runtime patching and function ID decoding in the presence of DSOs<a class="footnote-ref" role="doc-noteref" href="#fn18" id="fnref18"><sup>18</sup></a>; evidently it’s harder for a smaller
project. Using code pointers as IDs spares you these complications, but you pay in trace entry size – which is OK for a smaller
software system, where you can put in logic for tracing just what you want.</p>
<p><strong>Bigness begets bigness and works well with bigness, and vice versa</strong>. Neither big nor small are “better;” both
are fine as long as you “go big” or small consistently, and when the environment calls for it.</p>
<h2 id="synthesis-funtrace-with-xray-characteristics">Synthesis: funtrace with XRay characteristics</h2>
<p>You can use funtrace with XRay instrumentation – a straightforward kind of synthesis. This uses almost the same assembly
call/return handlers that we have for gcc with the -pg… instrumentation flags – these get called instead of XRay’s
callbacks.</p>
<p>Why would you use <code>-fxray-instrument</code> instead of <code>-finstrument-functions-after-inlining</code>, the other way
to use funtrace under clang?</p>
<ul>
<li>You might like to be able to automatically exclude short functions by tuning the threshold set by
-fxray-instruction-threshold=N.</li>
<li>You can patch the code at runtime to enable tracing and disable it again, lowering the overhead further relatively to
funtrace’s way to disable tracing (which still has your code jumping to its handlers, which do less than they do when tracing –
but a bit more than XRay-generated code not patched to jump to tracing callbacks.)</li>
</ul>
<p>Note that you need a recent LLVM to be able to trace inside DSOs with XRay instrumentation – specifically, a version having
the <code>-fxray-shared</code> flag.</p>
<p>Note as well that <strong>support for exceptions under -fxray-instrument has some limitations (same as with gcc under
-pg)</strong>, though it’s pretty good and a big step up from XRay’s not supporting exceptions at all (the Google style guide
bans C++ exceptions, and I must say that I fully share their distaste for the feature – but many programs use exceptions, so
funtrace makes an effort to support them, as we’ll see in the followup.)</p>
<p>Now, what would be a deeper form of synthesis than just combining XRay instrumentation with funtrace runtime? What does a
<strong>synthesis of assumptions</strong> look like?</p>
<p>Let’s say we assume the developer is “the user” of tracing and “owns” it, rather than relying on a performance team. We can
still ask ourselves, when is the developer the closest to the position of such a performance team? The answer is, <strong>when
adding tracing to their program for the first time!</strong></p>
<p>I mean, if you started out with tracing from day one, then you’re never in that position. But if you already have a biggish
system and you’re adding tracing to it, then it’s very tedious to manually exclude lots of small functions from tracing.
<strong>How can we make this easier?</strong></p>
<ul>
<li><strong>Filtering by function size</strong>: funtrace adds a compile time flag, <code>-funtrace-instr-thresh=N</code>, which
works a lot like -fxray-instruction-threshold=N; it excludes short functions from tracing unless they have loops (though you can
pass <code>-funtrace-ignore-loops</code> to have them excluded anyway)</li>
<li><strong>Filtering using a list of mangled function names</strong>: let’s say you want to disable the tracing of lots of
functions reported by funcount as “frequent callees”, and check what this does to tracing overhead, and to the traces you get.
Going thru 100 source files to add NOFUNTRACE to each function gets old quickly – especially if you want to try many different
experiments, excluding different subsets of functions every time. Instead, you can use <code>-funtrace-no-trace=file</code> to
exclude the functions listed in that file – way quicker than editing each function.</li>
</ul>
<p>Sounds great, but you might be wondering, <em>how does funtrace “add a compile time flag”</em>, if it uses stock gcc or clang
with no changes or compiler passes?.. If you had a feeling of something fishy coming up, you were very right. Funtrace adds
these trace filtering flags by <strong>post-processing the assembly code generated by the compiler.</strong> Some implications
of this:</p>
<ul>
<li><strong>Assembly post-processing removes most, but not all of the instrumentation overhead.</strong> We’re removing
instructions calling the function entry/exit hooks, but the code will still have been variously “scarred” by having those calls
put in by the compiler in the first place. It shouldn’t cost more than 1ns per function, but it can add up.</li>
<li><strong>Assembly post-processing, while tested on large programs, is less solid than the rest of funtrace.</strong> It’s
easy to make a compiler generating assembly code breaking this filtering – both the loop detection (which simply looks for
branches to labels defined before the branch within a function) and the removal of calls to the hooks. It’s simple text
processing making assumptions on how the text of the .s files looks like. It “works”, but these assumptions aren’t backed by any
specification.</li>
</ul>
<p>So one might decide that this assembly post-processing is more suitable for initial experimentation than a long-term
production deployment. It’s there; it’s your call in what scope to use it, and people will have different preferences for good
reasons. I personally don’t mind the risk of incorrect code generation that much, because I’m good at debugging such things, and
I count on tests to uncover it quickly. But this is the opposite of the right approach for many teams, so I’ve given the exact
opposite advice on some occasions.</p>
<p>Whether you want it for production or not, assembly post-processing should make first-time experiments with funtrace easier,
by providing features for “an encounter with a system for which tracing is alien” – the situation XRay is designed around.</p>
<h2 id="hardware-assisted-tracing">Hardware-assisted tracing</h2>
<p>Most CPUs have some hardware tracing facilities, but the most basic &amp; common of these are designed for people debugging
the hardware itself or very low-level software like kernels and boot loaders using something like a JTAG probe. For example,
when a branch is taken, the instruction address or a delta might be sent to a probe like that over a dedicated channel, or it
might be saved into a tiny circular SRAM buffer inside the chip that you can then read with the probe. This doesn’t help most
people debugging large systems though.</p>
<p>An exception to this is the Intel Performance Trace, which lets you trace native code with zero instrumentation. The awesome
magic-trace is built on top of it. You can try it with <code>magic-trace run ls</code> (for example); it works out of the box,
no recompilation required, and the overhead is low (they say 2-10%.)</p>
<p><strong>I’m unironically in shock that people deploying on x86, certainly to environments they control, don’t all insist on
using Intel hardware to always run the code under magic-trace in production.</strong> (You can trigger tracing programmatically,
and with these traces, you’ll be able to debug any latency issue you could have in production.) Did Google develop the
high-overhead XRay when Intel Performance Trace already existed?.<em>.</em> (Not sure about the exact timeline; perhaps the two
matured at about the same time?) How can it be that a decade after Intel Performance Trace was made, AMD still hasn’t caught up,
and other platforms also lack equivalent features?</p>
<p>Seriously, it’s as if you had hardware floating point units for a decade and only a select few would be using them, or were
even aware of them, with a few more stuck on software emulation, and most just <a href="https://yosefk.com/blog/10x-more-selective.html">using scaled integers</a>, as in representing 1.05 with 105. How do we
explain this?</p>
<ul>
<li><strong>Programmers aren’t demanding tracing profilers</strong>; they use sampling profilers, because it’s the tradition, it
helps with the average case (and unlike with the big O notation, nobody drills it into programmers to look for the worst case
when profiling, or how to do it), and it requires <em>zero </em>work, unlike a tracing profiler where triggering tracing
requires <em>slightly less than zero work.</em></li>
<li><strong>The hardware solution is fairly complicated.</strong> Many teams will not do it given weak demand; a lightweight
feature with weak or uncertain demand might get greenlit, but a full-blown control flow trace like Intel’s isn’t a lightweight
feature.</li>
<li><strong>Machine instruction-level control flow tracing is useless for interpreters and hard-to-use for JITters,</strong> and
most people with control of their environment are server people running a lot of Java, JavaScript, Python, PHP etc. For an
interpreter all you’d see is the interpreter’s loop spinning, with no clue what code it’s interpreting. For a JITter, <em>every
JIT runtime</em> would need to dump metadata telling a tool like magic-trace what source code each machine code snippet was
generated from – you get this from compiler-generated symbol tables for statically compiled native code.</li>
</ul>
<p>I would be pleasantly surprised if this writeup would cause programmers’ demand for tracing profilers to outright explode,
but I’m not counting on it. However, I have a suggestion for tracing support in the CPU hardware which is <strong>simple enough
to risk implementing despite weak demand – and simple enough for interpreters and JITters to use.</strong> So you can
realistically hope to put this thing into your CPU, have interpreters and JITters use it, and then programmers will love your
hardware for its tracing features.</p>
<p>Here’s how it could work:</p>
<ul>
<li><strong>You add an instruction, say TRACE</strong>, which traces a 64b register value from a general-purpose register (or
the instruction pointer as a special case). It also traces a few bits of static metadata from its encoding (so we can tag events
as “function call,” “return,” “context switch” etc.</li>
<li><strong>TRACE sends the data above via a special port to a trace writer module.</strong> TRACE is <em>not </em>another store
instruction; it doesn’t go thru caches. Instead, data is sent to a module where it could be timestamped, compressed, and written
from the module’s SRAM to a cyclic buffer in DRAM.</li>
</ul>
<p><img alt="image5.png" height="336" src="https://yosefk.com/img/funtrace/image5.png" title="A trace writer with a DMA bypassing the CPU cache system" width="556" style="max-width: 100%;height: auto;"></p>
<p>Thus we still rely on software to issue tracing instructions, but it’s just one instruction per event, we get timestamping,
compression and cyclic buffer management for free, and the only cost is another instruction and the DRAM bandwidth spent on
writing to the cyclic buffer (and we can write at a low priority - we don’t mind buffering these writes as long as we haven’t
run out of SRAM in the module; we would also lose very little bandwidth to precharging/activating DRAM rows, since our writes
are the opposite of random access.) We can also turn off the writing, and then the overhead shrinks to just fetching &amp;
decoding a NOP - “like XRay with tracing disabled.”</p>
<p>Two refinements of this idea in two opposite directions:</p>
<ul>
<li><strong>If you’re “a perfectionist CPU maker”</strong>, you can further lower the overhead to near-zero for statically
generated code. For example, x86 already has the ENDBR64 instruction, which every recently compiled function starts with, in
pursuit of the venerable goal of “<a href="https://en.wikipedia.org/wiki/Control-flow_integrity">control flow integrity</a>”<a class="footnote-ref" role="doc-noteref" href="#fn19" id="fnref19"><sup>19</sup></a>. This instruction could be an implicit TRACE
PC, and we could have an ENDBR64_UNTRACED for excluding functions from tracing to save bandwidth. The same thing could be done
with RET.</li>
<li><strong>If you’re “a pragmatic chip maker”</strong> and the CPU IP vendor won’t add a TRACE instruction, you can instead
have software store to a memory-mapped device your chip makes available at some address. You will spend more instructions to
send the data, and you will interfere with the load-store unit, but you will still save the overhead of timestamping and cyclic
buffer management, you won’t pollute caches with trace data, you will save space &amp; bandwidth with compression, and you will
get a single buffer for all the trace data instead of many thread-local buffers which probably take more memory than you need
(for this, you will want to encode the thread ID or the CPU ID when storing trace data, either in the data itself or the address
bits - eg each CPU can store to a different memory-mapped address, and the OS can trace context switches so you know which
thread is currently running on each CPU.)</li>
</ul>
<p>This scheme is very easy on the hardware; a “tracing device” with a RAM and a DMA is invisibly small in today’s hardware
designs, and this doesn’t interfere with the CPU logic and doesn’t risk bugs, either those breaking the CPU or those breaking
the trace.</p>
<p>As an added bonus, <strong>an interpreter can use TRACE and pass it some data which isn’t a machine instruction address but
rather the ID of a function in the interpreted language</strong>. And a JITter can emit TRACE instructions into its code. (I
feel that this would be bigger news for interpreters than JITters, but I think most JITters would benefit from it relatively to
Intel Performance Trace-like hardware-assisted tracing – would be happy to hear the thoughts of people understanding
JITters.)</p>
<p>I think it would be great to start seeing TRACE instructions in CPUs!</p>
<h2 id="conclusion-and-future-work">Conclusion and future work</h2>
<p>We (it’s always “we” in papers, isn’t it?) have presented a comprehensive solution for C++ function tracing, ready for
production use on x86/Linux and easy to port to many other platforms. We have also used the opportunity to discuss how to use a
function tracer in your workflow, how to implement your own function tracer for native code, and which existing tools can help
with the heavy lifting. Finally, we’ve seen how hardware could help making tracing more efficient and usable for both statically
and dynamically compiled languages, in a relatively cheap &amp; simple way.</p>
<p>Here are some things “we” could add to funtrace (more likely <em>I </em>than <em>we –</em> though I’d be happy to work with
you on this!):</p>
<ul>
<li><strong>Porting to other native languages.</strong> I’d expect the trace file format and the decoder to need no changes in
most cases; the runtime you might well want to rewrite (eg a Rust runtime is easier to distribute with cargo than a C++ runtime,
right?.. Though the C++ one would be functionally adequate in this case?..) For compile-time support, you could use existing
LLVM features where relevant, or you could write an LLVM pass - or write code changing LLVM IR files or even compiler-generated
assembly (the <code>-funtrace</code> compile time flags do this, and I've gone way further with that… in this context. See the
hardcore followup!)</li>
<li><strong>Support for performance counters</strong>. We could log the output of RDPMC instead of, or in addition to RDTSC.
This might be useful (you could learn things like which functions missed the cache a lot), though RDPMC is a PITA, as we’ll see
in that followup of ours.</li>
<li><strong>Support for custom events</strong>. You might want to log things like “the resolution of the images we were
processing was 1920x1080”. A “<a href="https://yosefk.com/blog/delayed-printf-for-real-time-logging.html">delayed printf</a>”
approach could be a great fit for this, seeing how we need the executable files to decode the code pointers anyway; so we could
easily extract format strings from the executables given the logged pointers while we’re at it, and do the formatting at
decoding time.</li>
<li><strong>Support for goroutines / async / other forms of “non-OS threads.”</strong> Note that the relatively easy part is to
decode each such “green thread” into its own Perfetto JSON thread. The harder part is to figure out how not to drown in all
those threads when viewing them; these runtimes brag about millions of threads – that’s a lot of vertical scrolling. Ofc you can
limit tracing to a subset of these; you could also show a lane per CPU instead of a lane per thread, but then the call stacks
changing from one thread to another get very noisy (ask me how I know) and you might want to adapt the viewer to deal with this
somehow.</li>
<li><strong>Support for existing tracing frameworks</strong> – for example, integrating with the <a href="https://github.com/wolfpld/tracy">tracy</a> profiler. Note that any tracing system using the same timestamps as funtrace
can theoretically coexist with it without any need to share buffers or data formats; what you’d want is to then view all this
traced data in a single viewer.</li>
<li><strong>Clever trace filtering</strong> a-la XRay – eg detect at runtime that a function was called 1000 times and took less
than a microsecond every time, and modify its code to no longer call the entry/return hooks. This is risky (what if this
function locks something and will wait for a long time in some future call?), but only somewhat (we’ll still trace its caller
and get some idea of where we waited), and it removes the overhead of RDTSC, which is big on x86.</li>
</ul>
<p>That’s it – I hope you liked it! And if you’re really into this kind of stuff (evidently, I’m <em>really </em>into this
stuff!), give <a href="https://github.com/yosefk/funtrace">funtrace</a> a try, and stay tuned for the Hardcore Followup!</p>
<p><em>Thanks to Dan Luu for reviewing a draft of this post.</em></p>
<h2 id="see-also">See also</h2>
<p><a href="https://danluu.com/perf-tracing/">Sampling vs tracing</a> was the greatest inspiration for this work. It discusses
tracing in a distributed computing environment, and mentions some techniques different from what we’ve seen above – such as
having your function entry/exit handlers write to a “current code pointer” global variable, read in a busy loop by a CPU core
dedicated for tracing (YMMV, but this could lower the impact of tracing on latency at the cost of “burning” a CPU core, thus
trading some of the machine’s throughput for the latency gain.)</p>

<section class="footnotes footnotes-end-of-document" role="doc-endnotes" id="footnotes">
<hr>
<ol>
<li id="fn1"><p>Note that, for every new trace.json, you want to run <code>vizviwer trace.json</code>, <em>even if you fully
expect to have to then open trace.json again from the Web UI, </em>to work around the “RPC” thing<em>. </em>That’s because
vizviewer reads the source code from the JSON at the server side. So if you just load the JSON from the web UI in a tab
previously opened by vizviewer, you won’t get the source code – you’ll get “No source code found” messages.<a class="footnote-back" role="doc-backlink" href="#fnref1">↩︎</a></p></li>
<li id="fn2"><p>Traditionally, flamegraphs are displayed as stalagmites, growing upwards rather than downwards, but, like,
whatever.<a class="footnote-back" role="doc-backlink" href="#fnref2">↩︎</a></p></li>
<li id="fn3"><p>I list the issues I ran into <em>to encourage </em>readers to use viztracer. If I just wrote how great it was,
and then they tried it and ran into issues I didn’t mention, it would <em>discourage</em> them<em>. </em>In fact, I start
twitching when I look for something and only find posts showing the happy path with nothing going wrong, and telling how awesome
the thing is – it is a strong predictor of running into issues with no help in sight. But if I see the rare writeup listing a
few rakes the author stepped on – and their exact whereabouts to keep me from stepping on them myself – <em>that </em>makes me
very optimistic and eager to give the thing a try!<a class="footnote-back" role="doc-backlink" href="#fnref3">↩︎</a></p></li>
<li id="fn4"><p>Strictly speaking, it’s enough to instrument function entry points; the callback can then change the return
address to jump into an epilogue which traces the return, and then jumps to the original return address. For funtrace on
x86/Linux, “this is neither good nor necessary,” to quote Capablanca, but it makes sense in some situations.<a class="footnote-back" role="doc-backlink" href="#fnref4">↩︎</a></p></li>
<li id="fn5"><p>Zero overhead abstractions strike again! The billion-inline-wrappers style famously interferes with debugging,
and profiling is a special case of that. But who cares? That’s overhead for developers, not machines. We’re here to serve
machines! I mean, look how much we did for machines in quality and quantity in the last 80 years, compare that to the modest
improvements they created for us, and weep.<a class="footnote-back" role="doc-backlink" href="#fnref5">↩︎</a></p></li>
<li id="fn6"><p>Pausing can “spoil” some snapshots, but very rarely. Specifically, if each of 2 threads notices an unusual
event, and they take snapshots concurrently, the 1st thread will get a fine snapshot, but the 2nd might see a “hole” in its
snapshot, because the 1st one paused tracing while snapshotting. But 2 <em>overlapping</em>, <em>unusual</em> events are, well,
<em>very</em> unusual, and the 1st is still fully captured.<a class="footnote-back" role="doc-backlink" href="#fnref6">↩︎</a></p></li>
<li id="fn7"><p>Actually, it's <a href="https://github.com/yosefk/funtrace/blob/master/funtrace_flags.h">worse</a> or it would
have been called <em>flag</em> rather than <em>flags</em>, but that stuff definitely belongs in the hardcore followup.<a class="footnote-back" role="doc-backlink" href="#fnref7">↩︎</a></p></li>
<li id="fn8"><p>The races in our tracing code can probably be fixed without having a memory barrier upon every event (like XRay
does, for example.) I <em>think </em>we only need a barrier when we notice that tracing was paused, and we want whoever paused
it to read up-to-date data. I felt that we can leave the races as is, and spare a bunch of extra complexity, on a rather naive
theory of the memory system. Specifically, I believe that since we have one writer for the trace data (the thing written without
locking – wraparound_mask is <em>read </em>without locking, but is written with a lock held), the worst that can happen is that
a few recent writes will get stuck in the write buffer of the core running the traced thread. However, the bulk of the data will
be in dirty cache lines, and the reader will get up-to-date data from these cache lines, because that’s how cache coherence
works. I’m very willing to hear why things are worse than I think, especially with the funky binary search that the reader is
running on the trace buffers – which will be discussed in The Hardcore Followup, but until then, feedback on the code is most
welcome, too. I’m OK at this stuff but far from great, and much better at avoiding it than doing clever things with it.<a class="footnote-back" role="doc-backlink" href="#fnref8">↩︎</a></p></li>
<li id="fn9"><p>Actually, __cyg_profile_func_enter gives you the function pointer for the caller, while __builtin_return_address
would give you an address of some instruction inside the caller. But both are fine for symbol table lookup – symbol table maps a
function name to an address range containing both pointers; In fact, I don’t understand why __cyg_profile_func_enter bothers to
give you the function pointer – __builtin_return_address would give you an equally useful address, and would save instructions
at the instrumented functions.<a class="footnote-back" role="doc-backlink" href="#fnref9">↩︎</a></p></li>
<li id="fn10"><p>It should be a truth universally acknowledged that objects should have names – very useful for debugging.
Awesome that threads have a whopping 15 characters for the name – can we also get this for locks, for example?.. This is a pet
peeve of mine; I don’t understand why naming objects isn’t more common – the overhead is most often dwarfed by the utility. The
overhead would be smaller still if this was something people cared about – eg you could variously assign 64b IDs mapped to
names, instead of copying strings, if there was a standard way to manage these IDs.<a class="footnote-back" role="doc-backlink" href="#fnref10">↩︎</a></p></li>
<li id="fn11"><p>There’s often a feedback loop where there are some basic reasons for a language not being that great for some
area, and then it becomes a feedback cycle of, “nobody uses the language for this, so language developers don’t care to further
hurt this use case, so they do, and so people have even less reason to use the language for that use case.” I’m not sure this is
the case with Rust &amp; numerics; I haven’t thought about it deeply. It seems that numerics isn’t a focus area for the
language, but there’s also a lot of serious “numerics-adjacent work” happening in Rust. So I guess we’ll see where this goes.<a class="footnote-back" role="doc-backlink" href="#fnref11">↩︎</a></p></li>
<li id="fn12"><p>In fact, to take our example, Rust’s Ratio&lt;BigInt&gt; is pretty slow – enough for me to have noticed &amp;
tried ratios with 64b numerators and denominators, which didn’t work for what I was doing, so back to BigInt I was. I’m pretty
sure that MPFR would have been faster, but it’d also be a more painful dependency to deal with. So, Rust managing prebuilt
packages a-la pip would have been great!<a class="footnote-back" role="doc-backlink" href="#fnref12">↩︎</a></p></li>
<li id="fn13"><p>A nice side effect of our thread reading from the kernel ftrace buffer into our own userspace buffer is that
you can then get the events from a core dump. You could of course save ftrace together with the core using some funky
/proc/sys/kernel/core_pattern instead – but who’s ever gonna do it, and then propagate the trace together with the core through
whatever scary path it goes through until reaching the developer?<a class="footnote-back" role="doc-backlink" href="#fnref13">↩︎</a></p></li>
<li id="fn14"><p>Actually, if we run the program <em>with funtrace instrumentation</em>, then a sampling profiler <em>could
</em>tell us where our __fentry__ callback or whatever gets called from, since a good sampling profiler samples <em>call
stacks</em>, and so it could show us the callstacks with __fentry__ sorted by call count. But this suffers from artifacts due to
low sampling rate, and then it’s just an unpleasant, heavyweight and flaky workflow IME – I tried running
<code>perf record -g</code> and then running <code>hotspot</code> and going to the <em>Bottom Up</em> or <em>Caller/Callee</em>
tabs, and it sorta works, but I can’t recommend it. The nice thing about funcount is that it does this one job it’s named after,
it provides a correct and straightforward answer to “the call count question,” and the reports are small &amp; very easy to
understand, combine and manipulate in whichever way you want.<a class="footnote-back" role="doc-backlink" href="#fnref14">↩︎</a></p></li>
<li id="fn15"><p>Actually our approach misses function calls in constructors – eg if you interpose dlopen, libc’s dlopen will
map executable segments, call the constructors, and only then return to you, giving you your first chance to run dl_iterate_phdr
with the new segments mapped. In most programs, however, <em>for the purposes of lowering tracing overhead</em>, missing
constructor calls has got to be a non-problem.<a class="footnote-back" role="doc-backlink" href="#fnref15">↩︎</a></p></li>
<li id="fn16"><p>Decode the reports before combining them - function pointers are <em>not </em>the same across runs thanks to
ASLR and what-not<a class="footnote-back" role="doc-backlink" href="#fnref16">↩︎</a></p></li>
<li id="fn17"><p>Actually, the so-called “basic logging format” (too slow for serious use, I’d say) can be said to be filtering
functions by func_duration_threshold_us; it keeps a callstack at runtime, and so it knows the time a function call took, so it
can compare to the threshold. “FDR (flight data recorder) logging” also uses this threshold, but I don’t think it’s as
straightforward as “filtering by function duration” and I didn’t dig enough to tell you exactly what it does.<a class="footnote-back" role="doc-backlink" href="#fnref17">↩︎</a></p></li>
<li id="fn18"><p>It could be that Google’s internal version of XRay doesn’t support shared libraries, either, because Google
links everything statically on the server – not sure they do, but if they do, I am here to say “you see, just like I told you –
shared libraries are no good!” Anyway, my point is only that XRay’s effort to assign function IDs is a “big company thing” for
various reasons I mentioned, and that it creates difficulties with shared library support that within a big company would be
trivial, but evidently aren’t trivial outside it.<a class="footnote-back" role="doc-backlink" href="#fnref18">↩︎</a></p></li>
<li id="fn19"><p>We’ve all heard something about “every ending is a new beginning,” and since we all know that A being B is a
transitive relationship, we could all extrapolate from this that every new beginning is an ending. But I think few of us were
emotionally prepared by this logic to seeing every compiler-generated function <em>beginning </em>with an instruction called
<em>END</em>_WHATEVER. I’m just saying, at some point, you have got to start wondering, “am I maybe doing this whole thing
wrong, and should I do some bigger changes than the crazy incremental ones I’m finding myself doing?” And if it requires a
Worldwide Software Czar to shake things up in the right direction, I just want you all to know that I could be available, if the
price is right.<a class="footnote-back" role="doc-backlink" href="#fnref19">↩︎</a></p></li>
</ol></section>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/profiling-in-production-with-function-call-traces#comments</comments>
      <pubDate>Wed, 05 Feb 2025 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/profiling-in-production-with-function-call-traces.feed</wfw:commentRss>
    </item>
    <item>
      <title>0+0 &gt; 0: C++ thread-local storage performance</title>
      <link>https://yosefk.com/blog/cxx-thread-local-storage-performance.html</link>
      <description><![CDATA[<p>We'll discuss how to make sure that your access to TLS (thread-local storage) is fast. If you’re interested strictly in TLS
performance guidelines and don't care about the details, <a href="#summary-of-performance-guidelines">skip right to the end</a>
— but be aware that you’ll be missing out on assembly listings of profound emotional depth, which can shake even a cynical,
battle-hardened programmer. If you don’t want to miss out on that — and who would?! — read on, and you shall learn the
computer-scientific insight behind the intriguing inequality 0+0 &gt; 0.</p>
<p>I’ve recently <a href="https://yosefk.com/blog/profiling-in-production-with-function-call-traces.html">published</a> a new
C++ profiler, <a href="https://github.com/yosefk/funtrace">funtrace</a>, which traces function calls &amp; returns as well as
thread state changes, showing an execution timeline like this (the screenshot is from <a href="https://krita.org/">Krita</a>, a
“real-world,” complicated drawing program):</p>
<p><img alt="image8.png" height="776" src="https://yosefk.com/img/funtrace/image8.png" title="a trace of Krita made by funtrace &amp; displayed by vizviewer" width="576" style="max-width: 100%;height: auto;"></p>
<p>One thing a software-based tracing profiler needs is a per-thread buffer for traced data. Actually it would waste less memory
for all threads to share the same buffer, and this is how things “should” work in a system with some fairly minimal <a href="https://yosefk.com/blog/profiling-in-production-with-function-call-traces.html#hardware-assisted-tracing">hardware support
for tracing, which I suggested in the funtrace writeup</a>, and which would look roughly like this:</p>
<p><img alt="image5.png" height="336" src="https://yosefk.com/img/funtrace/image5.png" title="A trace writer with a DMA bypassing the CPU cache system" width="556" style="max-width: 100%;height: auto;"></p>
<p>But absent such trace data writing hardware, the data must be written using store instructions through the caches<a class="footnote-ref" role="doc-noteref" href="#fn1" id="fnref1"><sup>1</sup></a>. So many CPUs sharing a trace buffer results in
them constantly yanking lines from each other’s caches in order to append to the buffer, with a spectacular slowdown. And then
you'd need to synchronize updates to the current write position — still more slowdown. A shared buffer can be fine for <a href="https://yosefk.com/blog/delayed-printf-for-real-time-logging.html">user-initiated printing</a>, but it’s too slow for
tracing every call and return.</p>
<p>So per-thread buffers it is — bringing us to C++’s <code>thread_local</code> keyword, which gives each thread its own copy of
a variable in the global scope — perfect for our trace buffers, it would seem. But it turns out that <strong>we need to be
careful with exactly how we use <code>thread_local</code> to keep our variable access time from exploding</strong>, as explained
in the rest of this document.</p>
<p>The C toolchain — not the C++ compiler front-end, but assemblers, linkers and such — is generally quite ossified, with <a href="https://stackoverflow.com/questions/76925604/why-is-the-constructor-of-a-global-variable-not-called-in-a-library">decades-old
linker bugs enshrined as a standard</a><a class="footnote-ref" role="doc-noteref" href="#fn2" id="fnref2"><sup>2</sup></a>. TLS
is an interesting case when this toolchain was actually given quite the facelift to support a new feature — with the result of
simple, convenient syntax potentially hiding fairly high overhead (contrary to the more typical case of inconvenient syntax, no
new work in the toolchain, and resource use being fairly explicit.)</p>
<p>At first glance, TLS looks wonderfully efficient, with a whole machine register dedicated to making access to these exotic
variables fast, and a whole scheme set up in the linker to use this register. Let’s take this code accessing a
<code>thread_local</code> object named <code>tls_obj</code>:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;"><code>int get_first() {
  return tls_obj.first_member;
}</code></pre>
<p>This compiles to the following assembly code:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">  <b><span style="color: #6aa84f">movl  %fs:<i>tls_obj</i>@tpoff, %eax</span></b>
</pre>
<p>This loads data from the address of <code>tls_obj</code> into the <code>%eax</code> register where the return value should
go. The address of tls_obj is computed by adding the value of the register <code>%fs</code> and the constant offset
<code>tls_obj@tpoff</code>. Here, <code>%fs</code> is the TLS base address register on x86; other machines similarly reserve a
register for this. <code>tls_obj@tpoff</code> is an offset from the base address of the TLS area allocated per thread, and it’s
assigned by the linker such that room is reserved within the TLS area for every <code>thread_local</code> object in the linked
binary. Is this awesome or what?!</p>
<h2 id="constructors">Constructors</h2>
<p>If instead we access a <code>thread_local</code> object with a constructor — let's call it <code>tls_with_ctor</code> — we
get assembly code like this (and this is with <code>-O3</code> – you really don’t want to see the unoptimized version of
this):</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">  <span style="color: #cc0000">cmpb  $0, %fs:<b><i>__tls_guard</i></b>@tpoff
  je    .slow_path</span>
  <b><span style="color: #6aa84f">movl  %fs:<i>tls_with_ctor</i>@tpoff, %eax</span></b>
  ret
<b>.slow_path:</b>
  // inlined call to __tls_init, which constructs
  // <b>all</b> the TLS variables in this translation unit…
  pushq %rbx
  movq  %fs:0, %rbx
  movb  $1, %fs:<b><i>__tls_guard</i></b>@tpoff
  leaq  <b><i>tls_with_ctor</i></b>@tpoff(%rbx), %rdi
  call  <b><i>Class::Class()</i></b>
  leaq  <b><i>tls_with_ctor2</i></b>@tpoff(%rbx), %rdi
  call  <b><i>Class2::Class2()</i></b>
  // …followed by our function’s code
  <b><span style="color: #6aa84f">movl    %fs:<i>tls_with_ctor</i>@tpoff, %eax</span></b>
  popq  %rbx
  ret
</pre>
<p>Our simple access to a register plus offset has evolved to first check a thread-local “guard variable”, and if it’s not yet
set to 1, it now calls the constructors for all of the thread-local objects in the translation unit. (<code>__tls_guard</code>
is an implicitly generated <code>static</code>, per-translation-unit boolean.)</p>
<p>While funtrace’s call/return hooks, which get their trace buffer pointer from TLS, are called all the time, access to
<code>thread_local</code>s should be more rare in “normal” code — so not sure it’s fair to brand this <code>__tls_guard</code>
approach as having “unacceptable overhead.” Of course, the inlining only happens if your thread_local is defined in the same
translation unit where you access it; <strong>accessing an <code>extern thread_local</code> with a constructor involves a
function call</strong>, with the function testing the guard variable of the translation unit where the thread_local is defined.
But with inlining, the fast path is quite fast on a good processor (I come from an embedded background where you usually have
<em>cheap </em>CPUs rather than <em>good</em>, so an extra load and a branch depending on the loaded value shock me more than
they should; a superscalar out-of-order branch-predicting speculatively-executing CPU will handle this just fine.)</p>
<p>What I don’t understand is why. Like, <em>why.</em> Generating this code must have taken a bunch of compiler work; it didn’t
“just happen for free.” Furthermore, the <code>varname@tpoff</code> thing must have involved some linker work; it’s not like
keeping the linker unchanged was a constraint. Why not arrange for the <code>__tls_init</code> function of every translation
unit (the one that got inlined into the slow path above) to be called before a thread’s entry point is called? Because it would
require a little bit of libc or libpthread work?..</p>
<p>I mean, this was done for <em>global constructors</em>. You don’t check whether you called the global constructors of a
translation unit before accessing a global with a constructor (and sure, <em>that </em>would have been even slower than the TLS
init code checking <code>__tls_guard</code>, because it would need to have been a <em>thread-safe</em> guard variable access;
though even <em>this </em>was implemented for calling the constructors of <em>static variables declared inside functions,
</em>see also <code>-fno-threadsafe-statics</code>.) It’s not really harder to do this for TLS constructors than for global
constructors, except that we need <code>pthread_create</code> to call this code, which, why not?..</p>
<p>Is this a deliberate performance tradeoff, benefitting code with lots of thread_locals and starting threads constantly, with
each thread using few of the thread_locals, and some thread_locals having slow constructors<a class="footnote-ref" role="doc-noteref" href="#fn3" id="fnref3"><sup>3</sup></a>? But such code isn't great to begin with?.. Anyway, I don’t really get why the
ugly thing above is generated from <code>thread_local</code>s’ constructors. The way I handled it in my case is,
<strong>funtrace sidesteps the TLS constructor problem by <a href="https://docs.oracle.com/cd/E19683-01/816-1386/chapter3-26/index.html">interposing</a>
<code>pthread_create</code></strong>, and initializing its <code>thread_local</code>s in its pthread_create wrapper.</p>
<h2 id="shared-libraries">Shared libraries</h2>
<p>And now let’s see what happens when we put our thread-local variable, the one without a constructor, into a shared library
(compiling with <code>-fPIC</code> and linking with <code>-shared</code>):</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;"><b><span style="color: #e69138">push %rbp
mov  %rsp,%rbp
<span style="color: #cc0000">data16 lea <i>tls_obj</i>(%rip),%rdi
data16 data16 callq <i>__tls_get_addr</i>@plt</span>
mov  (%rax),%eax
pop  %rbp</span></b>
retq 
</pre>
<p>All this colorful code is generated instead of what used to be a single <b><span style="color: #6aa84f">movl
%fs:<i>tls_obj</i><span class="citation" data-cites="tpoff">@tpoff</span>, %eax</span></b>. More code was generated than before,
forcing us to <strong><span style="color: #e69138">spill and restore registers</span></strong>. But the worst part is that our
TLS access now requires <strong><span style="color: #cc0000">a function call</span></strong> — we need
<code>__tls_get_addr</code> to find the TLS area of the currently running shared library.</p>
<p><strong>Why don’t we just use the same code as before — the <code>movl</code> instruction — with the dynamic linker
substituting the right value for <code>tls_obj@tpoff</code>? </strong>This is an honest question; I don’t understand why this
isn’t a job for the dynamic linker like any other kind of dynamic relocation. Is this to save work in libc again?.. Like, for
<code>tls_obj@tpoff</code> to be an offset <em>from the same base address</em> no matter which shared library
<code>tls_obj</code> was linked into, you would need the TLS areas of all the shared libraries to be allocated contiguously:</p>
<ul>
<li>main executable at offset 0</li>
<li>the first loaded .so at the offset <code>sizeof(main TLS)</code></li>
<li>the next one at the offset <code>sizeof(main TLS) + sizeof(first.so TLS)</code></li>
<li>…</li>
</ul>
<p>But for this, libc would need to do this contiguous allocation, and of course you can’t move the TLS data once you’ve
allocated it, since someone might be keeping pointers into it<a class="footnote-ref" role="doc-noteref" href="#fn4" id="fnref4"><sup>4</sup></a>. So you need to carve out a chunk of the memory space — no biggie with a 64-bit or even
“just” a 48-bit address space, right?.. — and you need to put the executable’s TLS at some magic address with <code>mmap</code>
and then you keep <code>mmap</code>ing the TLS areas of newly loaded .so’s one next to another.</p>
<p>But this now becomes a part of the ABI (“these addresses are reserved for TLS”), and I guess nobody wanted to soil the ABI
this way “just” to make TLS fast for shared libraries?.. In any case, looks like TLS areas are allocated non-contiguously and so
you need a different base address every time and you can’t use an offset… but <em>still</em>, couldn’t the dynamic linker bake
this address into the code, instead of calling a function to get it?.. Feels to me that this was doable but deemed not worth the
trouble, more than it being impossible, though maybe I’m missing something.</p>
<p>A curious bit is those <code>data16</code><a class="footnote-ref" role="doc-noteref" href="#fn5" id="fnref5"><sup>5</sup></a>
in the code:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;"><b><span style="color: #cc0000">data16 lea <i>tls_obj</i>(%rip),%rdi
data16 data16 callq <i>__tls_get_addr</i>@plt</span></b>
</pre>
<p>What is this for?.. Actually, the <code>data16</code> prefix does nothing in this context except padding the instructions to
take more space, making things slightly slower still, though it’s peanuts compared to the function call. Why does the compiler
put this padding in? Because if you compile with <code>-fPIC</code> but then link the code into an executable, without the
<code>-shared</code>, the function call gets replaced with faster code:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;"><b><span style="color: #e69138">push %rbp
mov  %rsp,%rbp
<span style="color: #6aa84f">mov  %fs:0x0,%rax
lea  -0x4(%rax),%rax</span>
mov  (%rax),%eax
pop  %rbp</span></b>
retq
</pre>
<p>The generated code is still scarred with the <strong><span style="color: #e69138">register spilling</span></strong> and
what-not, and we don’t get our simple <b><span style="color: #6aa84f">movl %fs:<i>tls_obj</i><span class="citation" data-cites="tpoff">@tpoff</span>, %eax</span></b> back, but still, we have to be very thankful for the compiler &amp; linker
work here, done for the benefit of the <em>many </em>people whose build system compiles everything with <code>-fPIC</code>,
including code that is then linked without <code>-shared</code> (because who knows if the .o will be linked into a shared
library or an executable? It’s not like the build system knows <em>the entire graph of build dependencies — </em>wait, it
actually <em>does — </em>but still, it obviously shouldn’t be <em>bothered </em>to find out if -fPIC is needed — this type of
mundane concern would just distract it from its noble goal of Scheduling a Graph of Completely Generic Tasks. Seriously, no C++
build system out there stoops to this - not one, and goodness knows there are A LOT of them.)</p>
<p>In any case, the <code>data16</code>s are generated by the compiler to make the red instructions take enough space for the
green instructions to fit into, in case we link without <code>-shared</code> after all.</p>
<h2 id="constructors-in-shared-libraries">Constructors in shared libraries</h2>
<p>And now let’s see what happens if we put (1) a thread_local object with (2) a constructor into a shared library, for a fine
example of how 2 of C++’s famously “zero-overhead” features compose. We’ve all heard how “the whole is greater than the sum of
its parts,” occasionally expressed by the peppier HRy people as “1 + 1 = 3.” I suggest a similarly inspiring expression “0 + 0
&gt; 0”, which quite often applies to “zero overhead”:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">sub  $0x8,%rsp
<b><span style="color: #cc0000">callq</span> <i>TLS init function for tls_with_ctor</i></b>@plt
data16 lea <b><i>tls_with_ctor</i></b>(%rip),%rdi
data16 data16 <b><span style="color: #cc0000">callq</span> <i>__tls_get_addr</i></b>@plt
mov  (%rax),%eax
add  $0x8,%rsp
retq
</pre>
<p>So, now we have 2 function calls — one for calling the constructor in case it wasn’t called yet, and another to get the
address of the <code>thread_local</code> variable from its ID. Makes sense, except that I recall that under <code>-O3</code>,
this “TLS init function” business was inlined, and now it no longer is? Say, I wonder what code got generated for this “TLS init
function”?..</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">  subq  $8, %rsp
  leaq  <b><i>__tls_guard</i></b>@tlsld(%rip), %rdi
  <b><span style="color: #cc0000">call</span>  <i>__tls_get_addr</i></b>@PLT
  cmpb  $0, <b><i>__tls_guard@dtpoff</i></b>(%rax)
  je    .slow_path
  addq  $8, %rsp
  ret
<b>.slow_path:</b>
  movb  $1, <b><i>__tls_guard</i></b>@dtpoff(%rax)
  data16  leaq  <b><i>tls_with_ctor</i></b>@tlsgd(%rip), %rdi
  data16 data16 <b><span style="color: #cc0000">call</span>  <i>__tls_get_addr</i></b>@PLT
  movq  %rax, %rdi
  call  <b><i>Class::Class</i></b>()@PLT
  data16  leaq  <b><i>tls_with_ctor2</i></b>@tlsgd(%rip), %rdi
  data16 data16 <b><span style="color: #cc0000">call</span>  <i>__tls_get_addr</i></b>@PLT
  addq  $8, %rsp
  movq  %rax, %rdi
  jmp   <b><i>Class2::Class2</i></b>()@PLT
</pre>
<p>Oh boy. So not only doesn’t this thing get inlined, but it calls <code>__tls_get_addr</code> <em>again, <strong>even on the
fast path</strong>. </em>And then you have the slow path, which calls __tls_get_addr <em>again and again</em>…not that we care,
it runs just once, but it kinda shows that this __tls_get_addr business doesn’t optimize very well. I mean, it’s not just the
slow path of the init code — here’s how a function accessing 2 thread_local objects with constructors looks like:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">pushq   %rbx
<b><span style="color: #cc0000">call</span></b>    TLS init function for tls_with_ctor@PLT
data16 leaq tls_with_ctor@tlsgd(%rip), %rdi
data16 data16 <b><span style="color: #cc0000">call</span></b> __tls_get_addr@PLT
movl    (%rax), %ebx
<b><span style="color: #cc0000">call</span></b>    TLS init function for tls_with_ctor2@PLT
data16 leaq tls_with_ctor2@tlsgd(%rip), %rdi
data16 data16 <b><span style="color: #cc0000">call</span></b> __tls_get_addr@PLT
addl    (%rax), %ebx
movl    %ebx, %eax
popq    %rbx
</pre>
<p>Like… man. This calls __tls_get_addr <span style="color: #cc0000"><strong><em>4 times</em></strong></span>, twice per
accessed thread_local (once directly, and once from the “TLS init functions”).</p>
<p>Why do we call <em>2 </em>“TLS init function for whatever” when <em>both </em>do the same thing — check the guard variable
and run the constructors of <em>all </em>objects in the translation unit (and in this case the two objects are defined in the
same translation unit, the same one where the function is defined)? Is it because in the general case, the two objects come from
2 different translation units?</p>
<p>And what about the __tls_get_addr calls to get the addresses of the objects themselves? Why call <em>that </em>twice? Why not
call something just once that gives you the base address of the module’s TLS, and then add offsets to it? Is it because in the
general case, the two objects could come from 2 different shared libraries?</p>
<p>And BTW, with clang 20 (the latest version ATM), it’s seemingly enough for <em>one </em>thread-local object in a translation
unit to have a constructor for the compiler to generate a “TLS init function” for <em>every </em>thread-local object, and call
it when the object is accessed… so, seriously, <strong>don’t </strong>use <code>thread_local</code> with constructors, even if
you don’t care about the overhead, as long as there’s even one thread_local object where you <em>do </em>care about access
time.</p>
<p>On the other hand, clang has an optimization, where access to several thread_locals <em>with hidden visibility<a class="footnote-ref" role="doc-noteref" href="#fn6" id="fnref6"><sup>6</sup></a></em> is indeed optimized such that
<code>__tls_get_addr</code> is only called once (instead of twice the number of accessed thread_locals), and then we add a
per-variable offset to access each thread_local. It turns out that a big part of the answer to the question "why call
__tls_get_addr per variable?" is that <strong>with the default visibility, variables could be interposed at runtime,</strong>
and so the compiler can't assume that they're defined by the same shared library, even if it's compiling a .cpp file that
defines all of the accessed variables.</p>
<p>Of course, the other part of the answer is that it takes work to implement this optimization; <a href="https://lobste.rs/s/b5dnjh/0_0_0_c_thread_local_storage_performance#c_dkzbw3">according to the comment that I learned this
from</a>, this optimization is not available on all platforms in clang, and I'm not seeing it in g++ on x86. A smaller problem
is that as you can see in the code below, with the current code generation, there's lots of register spilling and restoring
going on which I can't really explain (even if I look at the slow path which I elided in the assembly listing below, since it's
hairy enough as it is):</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">pushq   %rbp
pushq   %r15
pushq   %r14
pushq   %rbx
pushq   %rax
leaq    <b><i>__tls_guard</i></b>@TLSLD(%rip), %rdi
<b><span style="color: #cc0000">callq</span></b>   <b><i>__tls_get_addr</i></b>@PLT
movq    %rax, %rbx
cmpb    $0, <b><i>__tls_guard</i></b>@DTPOFF(%rax)
je      .slow_path
movl    <b><i>tls_with_ctor</i></b>@DTPOFF(%rbx), %ebp
addl    <b><i>tls_with_ctor2</i></b>@DTPOFF(%rbx), %ebp
movl    %ebp, %eax
addq    $8, %rsp
popq    %rbx
popq    %r14
popq    %r15
popq    %rbp
retq
</pre>
<p>Note that if you compile with <code>-fPIC</code> but then link without <code>-shared</code>, even the single call to
<code>__tls_get_addr</code> gets replaced with the much faster, if quite colorful instruction
<code>data16 data16 data16 mov %fs:0x0,%rax</code>. All in all, an impressive effort by clang to optimize TLS access from shared
objects; yet on net balance, I think it's fair to recommend putting data into a smaller number of thread_locals and avoiding
constructors, rather than counting on visibility to improve the code generation.</p>
<p>So what does that famous <code>__tls_get_addr</code> function do? Here’s the fast path:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">mov  %fs:DTV_OFFSET, %RDX_LP
mov  GL_TLS_GENERATION_OFFSET+_rtld_local(%rip), %RAX_LP
cmp  %RAX_LP, (%rdx)
jne  .slow_path
mov  TI_MODULE_OFFSET(%rdi), %RAX_LP
salq $4, %rax
movq (%rdx,%rax), %rax
cmp  $-1, %RAX_LP
je   .slow_path
add  TI_OFFSET_OFFSET(%rdi), %RAX_LP
ret
</pre>
<p>These 11 instructions on the fast path enable lazy allocation of a shared library’s TLS — every thread only allocates a TLS
for a given shared library upon its first attempt to access one of its thread-local variables. (Each “variable ID” passed to
<code>__tls_get_addr</code> is a pointer to a struct with module ID and an offset within that module’s TLS;
<code>__tls_get_addr</code> checks whether TLS was allocated for the module, and if it wasn’t, calls
<code>__tls_get_addr_slow</code> in order to allocate it.)</p>
<p>Is this lazy allocation the answer to why the whole thing is so slow? Do we <em>really</em> want to only call constructors
for thread-local variables upon first use, and ideally to even allocate memory for them upon first use? Note that we allocate
memory <em>for all the thread_locals <strong>in a shared library</strong> </em>upon the first use of even one; but we call
constructors <em>for all the thread_locals <strong>in a translation unit </strong></em>upon the first use of even one; which is
a bit random for the C++ standard to prescribe, not to mention that it doesn’t really concern itself with dynamic loading? So
it’s more, the standard gave implementations room to do this, rather than prescribed them to do this?.. I don’t know about you,
but I’d prefer a contiguous allocation for all the TLS areas of all the modules in all the threads, and fast access to the
variables over this lazy allocation and initialization; I wonder if this was a deliberate tradeoff or “just how things ended up
being.”</p>
<h2 id="summary-of-performance-guidelines">Summary of performance guidelines</h2>
<ul>
<li>Access to <code>thread_local</code> objects without constructors linked into an executable is <em>very</em> efficient</li>
<li>Constructors make this slower…</li>
<li>Especially if you access an <code>extern thread_local</code> from another translation unit…</li>
<li>Separately from constructors, compiling with <code>-fPIC</code> also makes TLS access slower…</li>
<li>…and linking code compiled with <code>-fPIC</code> with the <code>-shared</code> flag makes it <em>seriously </em>slower,
worse than either constructors or compiling with <code>-fPIC</code>...</li>
<li>…but constructors together with <code>-fPIC -shared</code> <em>really </em>takes the cake and is the slowest by far!</li>
<li>…and actually, a thread_local variable x having a constructor might slow down access to a thread_local variable y in the
same translation unit</li>
<li>Prefer putting the data into one thread_local object rather than several when you can (true for globals, too, BTW.) It can’t
hurt, and it can probably help a lot, by having fewer calls to <code>__tls_get_addr</code> if your code is linked into a shared
library.</li>
<li>Define your thread_locals as having hidden visibility - it won't always help if they're compiled into a shared library, but
sometimes it'll help a lot, and it can't hurt.</li>
</ul>
<h2 id="future-work">Future work</h2>
<p>It annoys me to no end that the funtrace runtime has to be linked into the executable to avoid the price of
<code>__tls_get_addr</code>. (This also means that funtrace must export its runtime functions from the executable, which
precludes shared libraries using the funtrace runtime API (for taking trace snapshots) from linking with
<code>-Wl,--no-undefined</code>.)</p>
<p>I just want a tiny thread-local struct. It can’t be that I can’t do that efficiently without modifying the executable, so
that for instance a Python extension module can be traced without recompiling the Python executable. Seriously, there’s a limit
to how idiotic things should be able to get.</p>
<p>I’m sure there’s some dirty trick or other, based on knowing the guts of libc and other such, which, while dirty, is going to
work for a long time, and where you can reasonably safely detect if it stopped working and upgrade it for whatever changes the
guts of libc will have undergone. If you have an idea, please share it! If not, I guess I’ll get to it one day; I released
funtrace before getting around to this bit, but generally, working around a large number of stupid things like this is a big
chunk of what I do.</p>
<h2 id="knowing-what-you-shouldnt-know">Knowing what you shouldn’t know</h2>
<p>If I manage to stay out of trouble, it’s rarely because of knowing that much, but more because I’m relatively good at 2 other
things: knowing what I don’t know, and knowing what I shouldn’t know. To look at our example, you could argue that the above
explanations are shallower than they could be — I ask why something was done instead of looking up the history, and I only
briefly touch on what <code>TI_MODULE_OFFSET</code> and <code>TI_OFFSET_OFFSET</code> (yes, TI_OFFSET_OFFSET) are, and I don’t
say a word about GL_TLS_GENERATION_OFFSET, for example, and I <em>could.</em></p>
<p>I claim that the kind of things we saw around __tls_get_addr is an immediate red flag along the lines of, yes I am looking
into low-level stuff, but no, nothing good will come out of knowing this particular bit very well in the context that I’m in
right now; maybe I’ll be forced to learn it sometime, but right now this looks exactly like stuff I should avoid rather than
stuff I should learn.</p>
<p>I don’t know how to generalize the principle to make it explicit and easy to follow. All I can say right now is that the next
section has examples substantiating this feeling; you mainly want to avoid <code>__tls_get_addr</code>, because even people who
know it very well, because they maintain it and everything related to it, run into problems with it.</p>
<p>I’ve recently been seeing the expression “anti-intellectualism” used by people criticizing arguments along the lines of “this
is too complex for me to understand, so this can’t be good.” While I agree that we want some more concrete argument about why
something isn’t worth understanding than “I don’t get it, and I <em>would </em>get it if it was any good,” I implore not to call
this “anti-intellectualism,” lest we implicitly crown ourselves as “intellectuals” over the fact that we understand what
TI_OFFSET_OFFSET is. It’s ridiculous enough that we’re called “knowledge workers,” when the “knowledge” referred to in this
expression is the knowledge of what TI_OFFSET_OFFSET is.</p>
<h2 id="workarounds-for-shared-libraries">Workarounds for shared libraries</h2>
<p>Like I said, it annoys me to no end that TLS access is slow for variables defined in shared libraries. Readers suggested
quite a few workarounds, "dirty" to varying degrees:</p>
<h3 id="inlining-pthread_getspecific">"Inlining" <code>pthread_getspecific</code></h3>
<p>There's a pthreads API for allocating "thread-specific keys" which is a form of TLS. Calling <code>pthread_getspecific</code>
upon every TLS access isn't any better than calling <code>__tls_get_addr</code>. But <a href="https://x.com/pskocik/status/1891494684863680663">we can "inline" the code of glibc's implementation</a>, and if we can
make sure that our key is the first one allocated, it will take just a couple of assembly instructions (loading a pointer from
<code>%fs</code> with a constant offset, and then loading our data from that pointer):</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">#define tlsReg_ (__extension__( \
  { char*r; __asm ("mov %%fs:0x10,%0":"=r"(r)); r; }))

inline void *pxTlsGetLt32_m(pthread_key_t Pk){
  assert(Pk&lt;32);
  return *(void**)(tlsReg_+0x310+sizeof(void*[2])*Pk+8);
}
void* getKey0(void) {
  return pxTlsGetLt32_m(0);
}
</pre>
<p><code>getKey0</code> compiles to:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">  mov  %fs:0x10,%rax
  mov  0x318(%rax),%rax
</pre>
<h3 id="compiling-with--ftls-modelinitial-exec">Compiling with <code>-ftls-model=initial-exec</code></h3>
<p>It <a href="https://news.ycombinator.com/item?id=43078859">turns out</a> that there's something called the "<a href="https://maskray.me/blog/2021-02-14-all-about-thread-local-storage#initial-exec-tls-model-executable-preemptible">initial
exec TLS model</a>", where a TLS access costs you 2 instructions and no function calls:</p>
<pre style="background-color: #eeeeee;color: #222;overflow: auto;margin: 0 0 1.5385em 0;padding: 0.7692em;">movq <b><i>tls_obj</i></b>@GOTTPOFF(%rip), %rax
movl %fs:(%rax), %eax
</pre>
<p>You can also make just some variables use this model with <code>__attribute((tls_model("initial-exec")))</code>, instead of
compiling everything with <code>-ftls-model=initial-exec</code>, which might be very useful since the space for such variables
is a scarce resource as we'll see shortly.</p>
<p>This method is great if you can <code>LD_PRELOAD</code> your library, or link the executable against it so that it becomes
<code>DT_NEEDED</code>. Otherwise, this may or may not work at runtime:</p>
<blockquote>
<p>the shared object generally needs to be an immediately loaded shared object. The linker sets the DF_STATIC_TLS flag to
annotate a shared object with initial-exec TLS relocations.</p>
<p>glibc ld.so reserves some space in static TLS blocks and allows dlopen on such a shared object if its TLS size is small.
There could be an obscure reason for using such an attribute: general dynamic and local dynamic TLS models are not
async-signal-safe in glibc. However, other libc implementations may not reserve additional TLS space for dlopen'ed initial-exec
shared objects, e.g. musl will error.</p>
</blockquote>
<h3 id="faster-__tls_get_addr-with--mtls-dialectgnu2">Faster <code>__tls_get_addr</code> with
<code>-mtls-dialect=gnu2</code></h3>
<p>It turns out there's a faster <code>__tls_get_addr</code> which you can opt into using. This is still too much code for my
taste; but if you're intereseted in the horrible details, you can read <a href="https://news.ycombinator.com/item?id=43079061">the comment where I found out about this</a>.</p>
<h2 id="see-also">See also</h2>
<p>Various compiler and runtime issues make this slow stuff even slower, and it takes a while to get it fixed. If you stay
within the guidelines above, you should avoid such problems; if you don’t, you might have more problems than described above —
including both performance and correctness:</p>
<ul>
<li><a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81501">mulitple calls to __tls_get_addr() with -fPIC</a> (reported in
2017, status: NEW as of 2025). Some highlights from 2022:
<ul>
<li>“We recently upgraded our toolchain from GCC9 to GCC11, and <strong>we're seeing <code>__tls_get_addr</code> take up to 10%
of total runtime</strong> under some workloads, where it was 1-2% before. It seems that some changes to the optimization passes
in 10 or 11 have significantly increased the impact of this problem.”</li>
<li>“I've shown a workaround I used, which might be useful until GCC handle <code>__tls_get_addr()</code> as returning a
constant addresses that doesn't need to be looked up multiple times in a function.“</li>
<li>“Thanks for the patch! I wonder if it would handle coroutines correctly. <strong>Clang has this open bug <a href="https://github.com/llvm/llvm-project/issues/47179">"Compiler incorrectly caches thread_local address across
suspend-points"</a> that is related to this optimization</strong>.”</li>
</ul></li>
<li><a href="https://sourceware.org/bugzilla/show_bug.cgi?id=19924">TLS performance degradation after dlopen</a> (reported in
2016; fixed in libc 2.39 in 2023, backported to older libcs up to 2.34 in 2025):
<ul>
<li>“we have noticed a performance degradation of TLS access in shared libraries. <b>If another shared library that uses TLS is
loaded via dlopen, __tls_get_addr takes significant more time</b>. Once that shared library accesses it's TLS, the performance
normalizes. We do have a use-case where this is actually really significant.”</li>
<li>“elf: Fix slow tls access after dlopen [BZ #19924] In short: __tls_get_addr checks the global generation counter and if the
current dtv is older then _dl_update_slotinfo updates dtv up to the generation of the accessed module. So if the global
generation is newer than generation of the module then __tls_get_addr keeps hitting the slow dtv update path. The dtv update
path includes a number of checks to see if any update is needed and this already causes measurable tls access slow down after
dlopen. It may be possible to detect up-to-date dtv faster. But if there are many modules loaded (&gt; TLS_SLOTINFO_SURPLUS)
then this requires at least walking the slotinfo list. This patch tries to update the dtv to the global generation instead, so
after a dlopen the tls access slow path is only hit once. The modules with larger generation than the accessed one were not
necessarily synchronized before, so additional synchronization is needed.”</li>
<li>“the fix for <a href="https://sourceware.org/bugzilla/show_bug.cgi?id=19924">bug 19924</a> was to update DTV on tls access
up to the global gen count so after an independent dlopen the next tls access updates the DTV gen count instead of falling into
a slow code path over and over again. <strong>this introduced some issues</strong>: update happens now even if the accessed tls
is in an early loaded library that use static tls (l_tls_offset is set), so such access is no longer as-safe and may alloc. some
of this was mitigated by an ugly workaround: “elf: Support recursive use of dynamic TLS in interposed malloc.” a possible better
approach is to expose the gen count of the accessed module directly in the tls_get_addr argument: this is possible on 64bit
targets if we compress modid and offset into one GOT entry and use the other for the gen count when processing DTPMOD and DTPREL
relocs. (then the original logic before the 19924 fix would not slow down after a global gencount bump: we can compare the DTV
gen count to the accessed module gen count. btw we do this with TLSDESC today and thus aarch64 was imho not affected by the
malloc interposition issue.) however i feel this is dancing around a bad design to use the generation count to deal with dlclose
and reused modids. so here is a better approach…”</li>
</ul></li>
</ul>
<p>If you’re not quite following some of the above, this sort of makes my point about <code>__tls_get_addr</code> being
undesirable, though I am not sure how to defend this way of making a point in the general case.</p>
<p><em>Thanks to Dan Luu for reviewing a draft of this post.</em></p>

<section class="footnotes footnotes-end-of-document" role="doc-endnotes" id="footnotes">
<hr>
<ol>
<li id="fn1"><p>You could instead write the data straight to DRAM, by putting your trace buffer into memory mapped with the
“uncached” attribute in the processor’s page table. But operating systems don’t make it very easy to allocate such memory, and
for good reason — this can save you cache-related overheads, but writing 64-bit words to DRAM wastes 7/8ths of the bandwidth,
since DRAM writes work in “bursts” of 64 bytes. That’s why CPUs use DRAM bandwidth efficiently when writing back full cache
lines, but not when writing individual words to uncached memory. And of course none of this would help with maintaining a next
write position variable shared between threads.<a class="footnote-back" role="doc-backlink" href="#fnref1">↩︎</a></p></li>
<li id="fn2"><p>The nature of C++ combines with human nature in such a way that people often want to have some Registry of
Plugins, with some libmyplugin.a implementing a MyPlugin derived from Plugin and calling Registry::add(this) in the constructor
of MyPlugin g_plugin. Of course this runs into the following wonderful provision in the standard: “It is implementation-defined
whether initialization happens before main is entered or is delayed until the first non-initialization odr-use of a non-inline
definition in the same translation unit as the one containing the variable definition.” Which of course is a bit of a
chicken-and-egg problem, as nothing from libmyplugin.a ever gets executed unless the plugin is added to the registry — but it’s
not going to be added to the registry “until something from the plugin gets executed” (though in practice it will happen
earlier, before main — what actually happens is that the linker notices that a symbol from the translation unit gets used and
then <em>doesn’t </em>drop all the code from the translation unit defining the global constructor; the language in the standard
just gives implementations the maximal possible leeway to make your life miserable.) I refer to the standard accepting this
nonsense (where your code works if it’s kept in a .cpp that’s compiled into a .o and then linked, but not if the .o is put into
a lib.a and then the lib.a is linked) and filing it under legitimate “implementation-defined” behavior as “enshrining
decades-old bugs as a standard,” though you could say that it’s not a bug as much as the linker not bothering to implement an
obviously correct behavior that was not a part of its original specification. I had other examples from my time of transitioning
from gold to lld and mold, but thankfully they have largely faded from my memory.<a class="footnote-back" role="doc-backlink" href="#fnref2">↩︎</a></p></li>
<li id="fn3"><p>Apparently we can say that it's trade-off made <em>by the implementation</em> since the standard <a href="https://stackoverflow.com/questions/24253584/when-is-a-thread-local-global-variable-initialized">permits</a>
implementations to choose any time for calling the constructor of a thread_local before the first use, same as it does for
"normal" global variables, where implementations normally choose to call the constructor before main unconditionally, without
having a guard variable for lazy initialization: "3.7.2/2 [basic.stc.thread]: A variable with thread storage duration shall be
initialized before its first odr-use (3.2) and, if constructed, shall be destroyed on thread exit."<a class="footnote-back" role="doc-backlink" href="#fnref3">↩︎</a></p></li>
<li id="fn4"><p>You also can’t really deallocate a TLS area when a shared library is unloaded if you allocate them all
contiguously — you could theoretically reuse that area for the TLS of another shared library, but only if it fits into that
address range; you could unmap the physical pages and have the OS use them for other allocation requests, but you might be stuck
with the virtual address space range of the TLS area, with no uses for it in sight. Is this very unlikely case — lots of
dlopen/dlclose and no dlopening of the same previously dlclosed library where it would fit perfectly into the “TLS address range
gap” — one reason this kind of design would be rejected?.. Like, I’d still totally do it, and I did a bunch of work of this
sort, but maybe it offends the sensibilities of most people doing these things?..<a class="footnote-back" role="doc-backlink" href="#fnref4">↩︎</a></p></li>
<li id="fn5"><p>Actually two <code>data16</code> prefixes are so pointless that while gdb <em>disassembles</em> this instruction
sequence just fine, the GNU assembler presumably refuses to <em>assemble</em> it, so the compiler generates the delightfully
self-explanatory code <code>.value 0x6666</code> (0x66 being the encoding of <code>data16</code>) followed by <code>rex64</code>
and then the <code>call</code>. I don’t know why gdb’s disassembler doesn’t show us the <code>rex64</code>; I’m not by any
measure an x86 guy. I’m just someone who thinks that this stuff has to work somehow and there must be a way to make it do the
thing you want, and I try to learn enough to seem to be able to do it when I need it.<a class="footnote-back" role="doc-backlink" href="#fnref5">↩︎</a></p></li>
<li id="fn6"><p>You set the visibility to "hidden" with <code>-fvisibility-hidden</code>,
<code>__attribute__((visibility ("hidden")))</code>, or <code>[[gnu::visibility("hidden")]]</code><a class="footnote-back" role="doc-backlink" href="#fnref6">↩︎</a></p></li>
</ol></section>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/cxx-thread-local-storage-performance#comments</comments>
      <pubDate>Mon, 17 Feb 2025 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/cxx-thread-local-storage-performance.feed</wfw:commentRss>
    </item>
    <item>
      <title>LLMs aren’t world models</title>
      <link>https://yosefk.com/blog/llms-arent-world-models.html</link>
      <description><![CDATA[<p>I believe that language models aren’t world models. It’s a weak claim — I’m not saying they’re useless, or that we’re done
milking them. It’s also a fuzzy-sounding claim — with its trillion weights, who can prove that there’s something an LLM isn't a
model of? But I hope to make my claim clear and persuasive enough with some examples.</p>
<p>A friend who plays better chess than me — and knows more math &amp; CS than me - said that he played some moves against a
newly released LLM, and it must be at least as good as him. I said, no way, I’m going to cRRRush it, in my best Russian accent.
I make a few moves – but unlike him, I don't make good moves<a class="footnote-ref" role="doc-noteref" href="#fn1" id="fnref1"><sup>1</sup></a>, which would be opening book moves it has seen a million times; I make weak moves, which it
hasn't <a class="footnote-ref" role="doc-noteref" href="#fn2" id="fnref2"><sup>2</sup></a>. The thing makes decent moves in
response, with cheerful commentary about how we're attacking this and developing that — until about move 10, when it tries to
move a knight which isn't there, and loses in a few more moves. This was a year or two ago; I’ve just tried this again, and it
lost track of the board state by move 9.</p>
<p>When I’m saying that LLMs have no world model, I don’t mean that they haven't seen enough photos of chess knights, or held a
knight in their greasy fingers; I don’t mean the physical world, necessarily. And I obviously don’t mean that a machine can’t
learn a model of chess, when all leading chess engines use machine learning. I only mean that, <strong>having read a trillion
chess games, <em>LLMs</em>, specifically, have not learned that to make legal moves, you need to know where the pieces are on
the board</strong>. Why would they? For predicting the moves or commentary in chess games, which is what they’re optimized for,
this would help very marginally, if at all.</p>
<p>Of course, nobody uses LLMs as chess engines — so whatever they did learn about chess, they learned entirely “by accident”,
without any effort by developers to improve the process for this kind of data. And we could say that the whole argument that
LLMs learn about the world is that they <em>have </em>to understand the world <em>as a side effect of modeling the distribution
of text</em> — which is soundly refuted by them literally failing to learn the first thing about chess. But maybe we could
charitably assume that LLMs fail this badly with chess for silly reasons you could easily fix, but nobody bothered. So let’s
look at something virtual enough to learn a model of without having greasy fingers to touch it with, but also relevant enough
for developers to try to make it work.</p>
<p>So, for my second example, we will consider the so-called “normal blending mode” in image editors like <a href="https://krita.org/">Krita</a> — what happens when you put a layer with some partially transparent pixels on top of another
layer? What’s the mathematical formula for blending 2 layers? An LLM replied roughly like so:</p>
<blockquote>
<p>In Krita Normal blending mode, colors are <strong>not blended using a mathematical formula</strong>. The "Normal" mode
<strong>simply</strong> displays the upper layer's color, <strong>potentially affected</strong> by its transparency,
<strong>without any interaction or calculation</strong> with the base layer's color. <em>(It then said how other blending modes
were different and involved mathematical formulas.)</em></p>
</blockquote>
<p>This answer tells us the LLM doesn't know things such as:</p>
<ul>
<li>Computers work with numbers. A color is represented by a number in a computer.</li>
<li>Therefore, a color cannot be blended by something other than a mathematical formula — nor can it be “affected” without a
“calculation” by transparency, another number.</li>
<li>“Transparency” is when you can see through something.</li>
<li>“Seeing” works by sampling the color at various points, and processing that signal.</li>
<li>Therefore, if you can see something through something, like, say, a base layer through an upper layer, then by definition,
the color you will see is affected not only by the color of the upper layer and its degree of transparency, but also by the
color of the base layer — or you wouldn’t be <em>seeing</em> the base layer, which means that the upper layer is <em>not at all
</em>transparent, because you’re not seeing through it.</li>
</ul>
<p>I mean, it sounds stupid to break it down like that, but I’m not wrong, am I? It really doesn’t know any of these things,
does it.</p>
<p>Can you prompt the LLM to explain <a href="https://en.wikipedia.org/wiki/Alpha_compositing">alpha blending</a> properly?
Sure. But that just shows the LLM knows to put the words explaining it after the words asking the question. This capability does
not make the answer above into lesser evidence of the LLM not knowing <em>the things </em>as opposed to <em>the words.</em></p>
<p>And of course people can be like that, too - eg <a href="https://danluu.com/algorithms-interviews/">much better at the big O
notation and complexity analysis in interviews than on the job</a>. But I guarantee you that <a href="https://yosefk.com/blog/the-cardinal-programming-jokes.html#expanding-your-skill-set">if you put a gun to their head or
offer them a million dollar bonus for getting it right</a>, they will do well enough on the job, too. And with 200 billion
thrown at LLM hardware last year, the thing can't complain that it wasn't incentivized to perform<a class="footnote-ref" role="doc-noteref" href="#fn3" id="fnref3"><sup>3</sup></a>.</p>
<p>Of course, these are simple examples. An LLM triumphalist will observe that they often stop reproducing; an LLM denialist
will assume they stopped reproducing through some conspiracy, like a chess engine tool having been given to the LLM, or it
having been drenched with synthetic data similar to your question. (I used to ask LLMs to prove 2+2=4; they'd very pompously
enumerate various notable properties of 2 and 4, and proudly declare that 2+2 must equal 4 based on these properties, and I had
a good laugh. Then LLMs were flogged to become “good at math,” and now they might say something about “Peano axioms,” and some
total garbage about set theory — but they emit enough S(S(2)) and such that it probably counts as a proof, though I am yet to
see the simple “2+2 = 2+(1+1) = (2+1)+1 = 3+1 = 4” which I’d expect from an entity understanding the question.)</p>
<p>For a more complex example, we can take associativity (which, as we’ve seen in 2+2=4, LLMs understand vaguely at best),
combine it with alpha blending and transparency (which apparently they don’t understand at all), and see how well LLMs do. I’ve
had an exchange with an LLM asking whether alpha blending, as implemented in commonly used libraries, is associative, or whether
it isn’t due to precision loss or whatever — and if it’s not associative, how does caching work in drawing programs (where the
program must be precomputing the blending of the layers above and below the currently edited one, to avoid recomputing the
blending of 10 or 100 layers upon every brush stroke.)</p>
<p>Sure enough, it said that alpha blending wasn’t associative — probably because I suggested that it might not be — and that
this is “solved with caching instead of mathematical elegance” — probably because I suggested that caching was involved. And
then I ask, but how can caching work if blending is not associative? If layer 6 is selected, and you blend the cached blending
of {1…5}, the selected layer 6, and the cached blending of {7…10}, you would get different results from blending {1…4}, 5, and
{6…10}, if blending is not in fact associative? And then if you selected layer 5 in the program, you would see a different
picture compared to selecting layer 6 - but in practice you see the same picture?</p>
<p>“You got me,” says the LLM, more or less. So their not knowing what any of the words actually mean very much does extend to
complex examples.</p>
<p>You could say that the LLM was a victim of its agreeableness<a class="footnote-ref" role="doc-noteref" href="#fn4" id="fnref4"><sup>4</sup></a>, since it might have been influenced by my contradictory implications that blending might
not be associative, yet caching must be implemented that counts on it being associative. I could say that, well, my whole
question was about which parts of my suspicions are incorrect, and saying they’re all correct is an abject failure — but let’s
assume it could be a character flaw more than an intellectual weakness. So in our last example, we’ll see the LLM having its own
opinion and sticking to it, despite being told repeatedly that it can’t be true.</p>
<p>I ask it about the thread safety of appending to a Python list from multiple threads, and whether I can tell the number of
times append was called with len(myList), and whether it will work once the <a href="https://en.wikipedia.org/wiki/Global_interpreter_lock">GIL</a> is <a href="https://peps.python.org/pep-0703/">removed</a>.
It says that without the GIL, the program could corrupt memory. I say, no way, this is not C, it must be more like Java? And it
goes, no, <em>CPython is a C program</em>, and without the GIL your racy code can crash like C does. Java is different, <em>it
has a memory model</em>, and look at these crash reports from GIL-less Python. And I’m like, but these are bug reports, it’s not
<em>by design</em>, is there evidence that this is by design? — and it goes, it’s too early for the kind of evidence you’re
looking for to exist, no-GIL is too new, but here’s how a C program could crash in such scenarios… and on and on and on.</p>
<p>It does not know that <a href="https://yosefk.com/blog/a-100x-speedup-with-unsafe-python.html">(pure) Python is a memory-safe
language</a>, and that no suggestion making it memory-unsafe would ever be accepted, and I found no way of persuading it to take
this notion into account — or to acknowledge that the evidence it’s citing in support of its claims is more like evidence to the
contrary (if all the crashes upon races you find are bug reports, it points to the requirement being that races don’t lead to
crashes.)</p>
<p>So it can be either kinda agreeable or very stubborn — and it might obviously not know what it just said in both modes.</p>
<p><strong>Can this be quantified?</strong></p>
<p>I don't see how.</p>
<p>I mean, I wish it could be. It's clear that LLMs do learn <em>some</em> things about the world. For instance, even just the
token embeddings contain the representation of the concept of gender learned without any specific effort to teach the model what
gender is, as evidenced by “king - man + woman ~= queen” in the embedding space.</p>
<p>Ideally, you would want to quantify "how much of the world LLMs model." But even if you resolve the difficulty of defining
what this means, you'll run into the ease with which LLMs memorize answers to specific questions, so the vendor can celebrate
the new bar having been cleared.</p>
<p>All I can confidently claim is that they don't learn a world model except by accident, and there's neither a theoretical
reason nor empirical evidence for your being able to count on this accident in any defined and broad set of circumstances.</p>
<p><strong>So-called conclusions<a class="footnote-ref" role="doc-noteref" href="#fn5" id="fnref5"><sup>5</sup></a></strong></p>
<p>A guy who made $100 million from being an early employee of some startup came to give a lecture for that startup, and said “a
fundamentally incorrect approach to a problem can be taken very far in practice with sufficient engineering effort.” (He then
cheered up his listeners, most of whom had $100 million less than him, with the addendum “That's what I think is happening in
this company!”)</p>
<p>It is therefore not one of my conclusions that you can’t take LLMs very, very far just because they demonstrably do not learn
a model of the many worlds described by the words they’re trained on (which, BTW, is exactly as it says on the tin; nobody ever
called them LWMs.) I will, however, predict a few things — something you shouldn’t do if you don’t want to look stupid in the
future, but here goes.</p>
<p><strong>There will be at least one big breakthrough in machine learning around “world models”</strong>. I have no idea what
this breakthrough will look like; I predict that it will happen because some important kinds of thinking cannot be done without
it, and I trust the Church-Turing thesis when it comes to these kinds of thinking, and I think someone will figure this out,
same as people have come up with deep learning, convnets and transformers. And of course you already have “world models”, such
as systems recovering object classes and positions from images — by a breakthrough, I mean a “generic” ability to build models
of “novel worlds” (even if the model isn’t as good as a specially tailored one), much like you throw any text into an LLM and
have it learn “something” without much tuning for this kind of text.</p>
<p>(In fact, I would guess there will be at least 2 more breakthroughs, the other one being around needing far less training
data — again, not because I know how machines could use less training data, but because I know you and I get by with less.
Feeding these algorithms gobs of data is another example of how an approach that must be fundamentally incorrect at least in
some sense, as evidenced by how data-hungry it is, can be taken very far by engineering efforts — as long as something is useful
enough to fund such efforts and isn’t outcompeted by a new idea, it can persist.)</p>
<p><strong>LLMs are not by themselves sufficient as a path to general machine intelligence; in some sense they are a distraction
because of how far you can take them despite the approach being fundamentally incorrect</strong>. This should make “AI risk”
people happy; but “AI risk” is its own hilarity best left to another time.</p>
<p><strong>LLMs will never<a class="footnote-ref" role="doc-noteref" href="#fn6" id="fnref6"><sup>6</sup></a> manage to deal
with large code bases “autonomously”</strong>, because they would need to have a model of the program, and they don’t even learn
to track chess pieces having read everything there is to read about chess.</p>
<p><strong>LLMs will never reliably know what they don’t know, or stop making things up. </strong>You need some sort of a world
model to have notions of knowledge, truth and falsehood. Any mechanism that is supposed to make LLMs “safe”, trustworthy or
other such is a mix of snake oil and honest efforts to somehow steer it away from text that spooks users — which can be done,
since users are spooked by form more than substance. For example, when it says some politically related nonsense, people drag it
for having the wrong politics, and you can “fix” it by making its output less politically charged — without making it less
nonsensical, which there’s no way to reliably achieve.</p>
<p><strong>LLMs will always be able to teach a student complex (standard) curriculum, answer an expert’s question with a useful
(known) insight, and yet fail at basic (novel) questions on the same subject, all at the same time</strong>. This is not
surprising — this is exactly what you would expect from a language model that isn’t a world model. This fuzzy insight is more
cute than useful, however, since it’s hard to know what is and is not novel — in part because you come to the LLM in the first
place with things you don’t already know everything about.</p>
<p>(Some sources, such as <a href="https://mindmatters.ai/2025/01/some-lessons-from-deepseek-compared-with-other-chatbots/">this
wonderful writeup about LLMs not knowing that rotating a tic-tac-toe board doesn’t change the game</a>, take this point to its
logical conclusion: “<em>If you know the answer, you don’t need to ask an LLM; if you don’t know the answer, you can’t trust an
LLM.</em>” But this conveys a true insight into LLMs together with unwarranted pessimism about their utility. In fact, sometimes
you know the answer, but it’s quicker to proofread an LLM’s output than to type it out; sometimes you don’t know the answer, but
know to check if the LLM’s answer is correct, etc. etc.)</p>
<p><strong>LLM-style language processing is definitely a part of how human intelligence works — and how human stupidity works.
</strong>I agree with Dijkstra that “can machines think?” and “can submarines swim?” are poor questions to ask, and I hate it
when people say that neural networks “work like the brain” and such. But I can’t help feeling that LLMs are a mirror into a part
of how people such as myself think — and I don’t like what I’m seeing in that mirror. “Thinking” by guessing what words to say
next based on words we’ve previously heard might actually help find a good idea — and it’s also how know-nothings get through
work meetings, and how people come to think they know stuff they really don’t, and how they internalize the stupidest notions. I
am starting to think that in today’s environment, high cognitive skills are an actual risk factor for stupidity, and that
learning words without learning a model of what they refer to is one big part of the problem.</p>
<p><strong>P.S.</strong> I wish I could say something about how to best use LLMs for programming — something I would like to be
qualified to speak about, and that I am kinda supposed to learn enough to be qualified to speak about. I don’t think I am; I can
only say that I tried Cursor and it failed every time, including at replacing f(obj) and g(obj) with obj.f() and obj.g() (it
would occasionally mix up f and g, and I got tired of reviewing its output), and I went back to simply copying code into and out
of chat windows. I would say that I use LLMs like I use SIMD — sometimes it’s a good fit for leaf functions whose behavior is
relatively easy to specify and test, and it has no business being anywhere else.</p>
<p>I have conflicting theories about why some people do great things with “agentic AI” while I think it’s hopelessly useless for
me; I am waiting for someone to write something crisp and well-researched about this to teach me the truth, or a useful
approximation. I console myself with the idea that I can’t be missing out on too much, given how terrible the output I’m getting
from LLMs often is.</p>
<p><strong>P.P.S. </strong>Here’s a somewhat rough Soviet-era joke about a school for kids with special needs that illustrates
my point. Rachel says that a Russian joke isn’t a joke, but a story of pain. To this I reply that some people like them.</p>
<blockquote>
<p>An inspector comes to a school for kids with developmental issues. He asks a kid riding a wooden horsie his name, and the kid
says “MMMM.” He says, what do you want to be when you grow up? — and the kid says “MMMM.” The inspector turns to the principal
and says, “you’re doing nothing for these kids. I’ll be back in a month — if there’s no improvement, we’ll close the
school.”</p>
<p>He comes back a month later and finds the kid swinging on the wooden horsie, same as last time; if you want to tell this
masterpiece of Soviet humor at parties — the perfect conversation starter — you should be swinging wildly when saying the kid’s
lines:</p>
<p>— What’s your name?<br> — MMMMikey!!<br> — Mikey? Nice to meet you, Mikey. What do you want to be when you grow up,
Mikey?<br> — MMMMastrounaut!! <br> — An astronaut? Good stuff, good stuff! And how old are you? <br> — MMMMikey!! <br></p>
</blockquote>
<p>The moral of the story being that you can learn to predict the next word without learning much about the world — at least up
to a point.</p>
<p><em>Thanks to Dan Luu for reviewing a draft of this post.</em></p>

<section class="footnotes footnotes-end-of-document" role="doc-endnotes" id="footnotes">
<hr>
<ol>
<li id="fn1"><p>It helps that I don’t know the good opening moves; I can’t be bothered to learn any opening theory. The fact
that my poorer chess knowledge makes it easier for me to see how bad the LLM is at chess is an interesting case study. It turns
out that you can get good answers out of LLMs by asking very well-phrased questions sounding like someone else’s well-phrased
questions answered in its training data; whereas if you ask simpler questions which are perfectly valid but not commonly asked,
they will fall apart.<a class="footnote-back" role="doc-backlink" href="#fnref1">↩︎</a></p></li>
<li id="fn2"><p>Funnily or tragically, my system of tripping up the opponent with weak moves it hasn’t memorized a response to
is conceptually similar to grandmaster play of today, where grandmasters memorize chess engine lines, and a “novelty” is a
relatively weak move your opponent has never analyzed with the engine, whereas you did and you remember all the strong moves
after this weak one. Of course my strategy would not work against a grandmaster, because I don’t come prepared with memorized
engine lines, and the grandmaster would find much better moves than I would over the board. Still, this 21st century concept of
“chess novelty” is tangentially related, and funny, or tragic, as the case might be.<a class="footnote-back" role="doc-backlink" href="#fnref2">↩︎</a></p></li>
<li id="fn3"><p>People can also give the wrong answer because they’re drunk. I don’t think the LLM was drunk. My point is that a
person who gave this answer would get zero points for this question on a test, and that the LLM is constantly under test because
it’s a machine serving no purpose other than answering these questions, and I don’t see why it should not get zero points here,
even though people might eg fail to answer some logic puzzle phrased in one way but succeed when it is phrased in another way,
etc. etc. — I don’t see how the cognitive weaknesses of people provide an excuse for the machine in this specific case.<a class="footnote-back" role="doc-backlink" href="#fnref3">↩︎</a></p></li>
<li id="fn4"><p>Actually, in this case, it was agreeable in substance but snarky in tone — it gave me an answer that confirmed
all my different suspicions, contradictory as they were, and at the same time it was saying something like “don’t expect the
world to be pretty or simple, man, the world is messy, man.” Generally I don’t think that LLMs “personality,” “style,”
“politics” and other anthropomorphic characteristics are the main thing about them; I think the main thing is what they model
(text) and what they don’t model except by accident (the thing the text is about.)<a class="footnote-back" role="doc-backlink" href="#fnref4">↩︎</a></p></li>
<li id="fn5"><p>It’s hard to call them “conclusions” when they’re fuzzy statements supposedly following from my fuzzy claim. In
fact this is what bugs me about LLMs in general: the thing is fuzzy — you can’t say it does something, because sometimes it
fails to do it; you can’t say it doesn’t do something, because sometimes it succeeds; and you can’t discuss the rates of
real-life success and failure, because who’s keeping score? This is why it’s hard for me to write about LLMs — I don’t like it
when things get this fuzzy, certainly when it comes to long-form writing; I’m reduced by the very nature of the subject to
shitposting about this on Twitter, along the lines of “<a href="https://x.com/YossiKreinin/status/1946153421817397567">Computers
used to provide cheap, reliable automation; then AI came along.</a>”<a class="footnote-back" role="doc-backlink" href="#fnref5">↩︎</a></p></li>
<li id="fn6"><p>When I’m saying that an LLM will never be able to do something, I mean it in the sense of “y = ax + b will never
represent a parabola” rather than in the sense of “the points residing on a curve rather than a straight line can never be
represented by an equation.” <em>Machine learning </em>might do what <em>LLMs </em>can’t do. Of course this could be used for a
No True Scotsman defense — “if it clearly learns a model of the world, it’s not a true LLM.” I’m assuming that when a big
breakthrough is achieved, we’ll know enough about it to be able to settle the question whether it’s still an LLM, as long as
we’re arguing in good faith — same as we don’t know all the details of how commercial LLMs work, but we know about transformers,
tokenization, encoders, decoders, next token prediction, whole-text synthesis, etc., and this is enough for “LLM” to have a
somewhat technical meaning — not as precise as “y=ax+b,” but not nearly as vague as, say, “AI.”<a class="footnote-back" role="doc-backlink" href="#fnref6">↩︎</a></p></li>
</ol></section>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/llms-arent-world-models#comments</comments>
      <pubDate>Sun, 10 Aug 2025 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/llms-arent-world-models.feed</wfw:commentRss>
    </item>
    <item>
      <title>"Enabling" C threads in a Python / Wasm environment</title>
      <link>https://yosefk.com/blog/enabling-c-threads-in-a-python-wasm-environment.html</link>
      <description><![CDATA[<p>Scarred by bare metal programming during my formative years, I consider the speedup from multithreading worth pursuing no
matter how limited a form of it you’ll get, and no matter how hideous the hacks you’ll need to make it work. In today’s quest,
we shall discover the various ways in which threads don’t work in a Python, Wasm, and especially a Python on Wasm environment,
and then do something about it — even when that something could get us shunned from polite society. In the end, <strong>we’ll
arrive at a working setup for limited yet performant multithreading, usable for soft real time programs caring about
sub-millisecond overheads</strong> which we’ll attempt to minimize or eliminate (GitHub links: <a href="https://github.com/yosefk/pyodide-pthread">Python running C threads on Wasm</a>; <a href="https://github.com/yosefk/WarmPool">a simple C++ thread pool for Wasm</a>.)</p>
<p>What this does is shown in the screenshot below<a class="footnote-ref" role="doc-noteref" href="#fn1" id="fnref1"><sup>1</sup></a> — a browser thread running Python and calling a C function sending work to a pool of C
threads, one of which (“em-pthread”) is shown at the bottom.</p>
<p><img alt="the Python thread and a pthread C worker" height="755" src="https://yosefk.com/img/pyodide/2-threads.png" width="661" style="max-width: 100%;height: auto;"></p>
<p>I knew nothing about WebAssembly, and very little about JS and the rest of it, when I got into this. I am still shocked by
what I’ve discovered. The world’s most successful secure VM for running untrusted code turns out to be a complete hack wrapped
in a shortcut inside a workaround<a class="footnote-ref" role="doc-noteref" href="#fn2" id="fnref2"><sup>2</sup></a> — though I
must admit that I kinda like it that way?!.. This is not to say that I really learned the platform — please correct me when I’m
wrong, and please provide me emotional support when I’m right, as I’m muttering things like “so pthread_create fails because
someone put API.tests=Tests into some JS file?!..” or “so the wrong C function gets called because dynamic linking is quietly
working incorrectly?!..”</p>
<h2 id="plain-pre-no-gil-python">Plain pre-no-GIL Python</h2>
<p>At first glance, there’s nothing to talk about here — everybody knows that Python is about to lose the global interpreter
lock, but that work isn’t finished yet, so right now you won’t get a speedup from threading, unless your threads wait on I/O a
lot. That a C library used from Python can spawn its own threads to mitigate this problem is not news, either.</p>
<p>However, a couple of things below are appreciated less widely. I bring these things to your attention lest you discover them
on your own, and repeat my mistake of actually trying to use them. Incidentally, <em>not </em>using these things will turn out
to be particularly beneficial on WebAssembly.</p>
<p>The first thing is that the GIL is released when a function is called via ctypes<a class="footnote-ref" role="doc-noteref" href="#fn3" id="fnref3"><sup>3</sup></a>. So maybe we could write serial C functions and call them from Python worker
threads? Well, it’s fine for work taking tens of milliseconds, but <strong>for very short tasks, the overhead of Python thread
pool management is significant</strong>.</p>
<p>The second thing is that ctypes can pass a Python callback to a C function. This is expected of a C FFI package, but ctypes
generates machine code at runtime to make it work, which is impressive<a class="footnote-ref" role="doc-noteref" href="#fn4" id="fnref4"><sup>4</sup></a>. Anyway, seems like we could export a C parallel_for function to Python, and run Python
callbacks in a C thread pool?</p>
<p>This is fine for letting Python code use C threads, to avoid a second pool of Python threads. But <strong>even the thinnest
Python callback that calls back into C adds overhead too large for short tasks</strong>. You can (sorta) see this if you zoom
into this VTune screenshot, with the first three quarters of the timeline occupied by Python runtime functions — BTW, <em>mostly
functions seemingly unrelated to the GIL</em>:</p>
<p><img alt="a ctypes call from a worker thread" height="333" src="https://yosefk.com/img/pyodide/vtune.png" width="810" style="max-width: 100%;height: auto;"></p>
<p>The upshot is that for speeding up a sequence of relatively short tasks (like real time input handling), <strong>the way to
go is a serial Python flow calling C functions sending tasks to C worker threads, with zero Python code running in the
workers.</strong></p>
<p>It turns out that on top of its performance advantage, this style has the added benefit of being the easiest to port to
WebAssembly by far. In fact, despite getting <strong>no threading whatsoever under Pyodide, the Python Wasm runtime, out of the
box</strong>, you might think that this style would “just work” — since we’re not using Python threads, right? We’re just
loading a C library which uses threads, and C threads work on Wasm, right?</p>
<p>Wrong. “The easiest way” doesn’t mean <em>easy </em>— you do not <em>just </em>load a C library using threads on Wasm.</p>
<h2 id="life-inside-an-array">Life inside an array</h2>
<p>Wasm started out, in the days of asm.js, as a way to compile C <em>to a subset of JavaScript</em> that a JS engine can
optimize well, by emitting code like (x|0) which persuades JS engines to treat x as an integer, rather than something with a
statically unknowable type. Later, a proper intermediate instruction set representation was designed for the compiler to emit,
so the x|0 craziness went away.</p>
<p>However, it’s still a C program running inside a JavaScript environment. How does this even work? Well, <em>the program’s
data is stored inside a JavaScript array</em><a class="footnote-ref" role="doc-noteref" href="#fn5" id="fnref5"><sup>5</sup></a>. When malloc runs out of space in this array, it calls the JavaScript function Memory.grow
to enlarge the array, the way it would call sbrk on Linux. (In general, the JS runtime is the OS of the Wasm program — its way
to interact with the world outside the JS array is to call an “imported” function implemented in JavaScript<a class="footnote-ref" role="doc-noteref" href="#fn6" id="fnref6"><sup>6</sup></a>.) C pointers are compiled to indexes into this array. To
access a C array from JS, you actually use JS array views with names like HEAPU8 (for uint8 data), indexing into entries
starting from the integer value of the C array base pointer.</p>
<p>So how does Python run in a JS environment? Well, CPython is a C program which was compiled to run inside a JavaScript array,
and ported to use the “OS” (actually, JS) APIs available on the web and/or Node.js. <a href="https://pyodide.org/en/stable/">Pyodide</a> is that port — a very impressive feat.</p>
<p>So, we’ll build our shared library for Wasm (where programs are called “main modules” and dynamic libraries are called “side
modules” — odd names underscoring the fact that you can have <em>several </em>main modules within one OS process, each running
inside its own JS array.) And we’ll load our side module from Pyodide using ctypes.CDLL as we would anywhere else, and it will
happily spawn threads.</p>
<p>Except it won’t load; it will complain that you’re trying to run it inside the wrong kind of JS array. You see, there’s
ArrayBuffer, an array which <em>you can’t share between threads, </em>and there’s SharedArrayBuffer, which you can. (“I’m afraid
I can’t do that” due to a mix of security considerations and historical baggage is a recurring theme on the web platform.) Since
your side module was built with -pthread, it needs to run inside a SharedArrayBuffer, but Pyodide was built to run inside a
plain ArrayBuffer.</p>
<p>My first thought was, “big deal — I’ll give Pyodide, this compiled blob of Wasm, a SharedArrayBuffer instead of an
ArrayBuffer to run in” — it’s not like it “feels” what it’s running inside, right? It’s the same load/store instructions in the
end?</p>
<p>Well, yes, but. Long story short, you must <strong>build Pyodide from source with -pthread</strong> to load your side module.
There are many reasons for this, such as side modules getting the pthread runtime from the main module, and needing
emscripten-generated JS for this runtime to work (did I mention that emscripten produces a giant .js output file alongside the
.wasm file — about 60K LOC for Pyodide?)</p>
<p>Luckily, Pyodide provides a Docker container for the build, and $EXTRA_C/LDFLAGS env vars that you can set to -pthread; it’s
rather nice. <strong>To then build your side module, use emscripten from the emsdk/ directory</strong> produced by the Pyodide
build process, or things will silently fail<a class="footnote-ref" role="doc-noteref" href="#fn7" id="fnref7"><sup>7</sup></a>.</p>
<p>You will also need to serve your web page with a couple of HTTP headers (Cross-Origin-Opener-Policy same-origin, and
Cross-Origin-Embedder-Policy require-corp<a class="footnote-ref" role="doc-noteref" href="#fn8" id="fnref8"><sup>8</sup></a>) —
absent which the browser will <em>refuse to give you a SharedArrayBuffer</em> (I’m afraid I can’t do that.)</p>
<p>And you might want to build Pyodide to use mimalloc instead of dlmalloc to avoid a global lock in malloc/free, though I’m not
currently doing it. Mimalloc uses more space, and even with dlmalloc I’m seeing tabs with Pyodide where Python’s tracemalloc
reports tens of megs, but the Chrome tab tooltip reports hundreds of megs<a class="footnote-ref" role="doc-noteref" href="#fn9" id="fnref9"><sup>9</sup></a>. A more efficient if less pleasant approach is to <strong>avoid malloc in parallelized
code</strong>.</p>
<p>While we’re at it, it’s notable that <strong>dynamic allocation of large chunks is space-inefficient when “living inside an
array” without mmap.</strong> And funnily enough, <em>this is exactly how things work on the bare metal </em>— if you allocate
and free lots of large &amp; small chunks, you’ll get heap fragmentation to the point of running out of memory, so you learn to
<strong>malloc the larger chunks up front</strong>.</p>
<p>(This doesn’t happen on fancier systems with virtual memory, where malloc mmaps large chunks instead of carving them out of
the one big array grown &amp; shrunk by sbrk. Then free calls munmap, and since a contiguous virtual address range can be backed
by <em>non-contiguous</em> physical pages, you can always reuse the pages for another large chunk later — you don’t suffer from
fragmentation. By forcing malloc to look for large contiguous slices inside a SharedBufferArray, the bleeding edge web VM is
sending us to knuckle-walk down the road traveled by embedded computing troglodytes.)</p>
<h2 id="fragile-handle-with-care">Fragile – handle with care</h2>
<p>Up until now, nothing really gross — so you need to rebuild from source and use a specific compiler version, big deal. On the
other hand, up until now, we didn’t actually spawn any threads, either. Our side module will now load just fine — but it will
promptly get stuck once it spawns a thread and tries to join it.</p>
<p>One splendid thing on the web is how “you can’t block the main thread” — “I’m afraid I can’t do that.” Of course you <em>can
</em>block the main thread by doing some slow work in JS, but you’re not allowed to block by <em>waiting</em> (you should
instead yield to the event loop — all that async business.) Therefore, using pthread APIs from the main thread tends to be a bad
idea. You can actually get emscripten to busy-wait in the main thread, instead of the disallowed proper waiting. But the workers
talk to the main thread to access many kinds of global state, so waiting might deadlock.</p>
<p>So the first order of business is to <strong>move your code using Pyodide to a web worker</strong><a class="footnote-ref" role="doc-noteref" href="#fn10" id="fnref10"><sup>10</sup></a>. Having done that, you might observe that you’re stuck at
the same point, because of <a href="https://emscripten.org/docs/porting/pthreads.html#special-considerations">the
following</a>:</p>
<blockquote>
<p>When pthread_create() is called, if we need to create a new Web Worker, then that requires returning to the main event loop.
That is, <strong>you cannot call pthread_create and then keep running code synchronously that expects the worker to start
running</strong> - it will only run after you return to the event loop. This is a violation of POSIX behavior and will break
common code…</p>
</blockquote>
<p>So we yield to the event loop — yet we still remain stuck. That’s because the emscripten-produced JS runtime code making
threads kinda sorta work is:</p>
<ol>
<li><em>broken by </em>Pyodide customizations; but also</li>
<li><em>missing </em>Pyodide customizations that any module would need to support threads; and finally is</li>
<li>broken, period, (nearly) regardless of Pyodide.</li>
</ol>
<p>Here are the specific problems and their workarounds, in increasing order of ugliness.</p>
<p>As a warmup, you’ll find that <em>you run the wrong JS code when spawning your C thread</em>. In a JS environment, the
so-called pthread runs in a JS “web worker” thread, the entry point of which is <em>a JS file</em> it gets in its constructor
and runs from top to bottom. Your C thread entry point is eventually called from that JS file. Emscripten produces a single JS
file, pyodide.asm.js in our case, which wraps the module both in the main thread and in the workers. So to implement
pthread_create, the file pyodide.asm.js needs to initialize a JS worker <em>with itself</em>. However, like any JS file,
pyodide.asm.js doesn’t know its own URL. Its attempt to find out instead produces the name of <em>your Pyodide-using
worker’s</em> JS file, and hilarity ensues.</p>
<p>Now if Pyodide was designed to support threads, its loadPyodide function, <em>which wraps the emscripten-generated wrapper
</em>pyodide.asm.js, would probably accept the file URL as a parameter — and pass it to pyodide.asm.js in the
"mainScriptUrlOrBlob" key that the generated code seems to expect. Since my workarounds are too incomplete to upstream to
Pyodide anyway, I didn’t bother to fix this properly, and passed the name thru some global variable.</p>
<p>A more curious class of issues you’ll discover is code (smack in the middle of pyodide.asm.js’s 60K LOC) doing things like
API.tests=Tests, and failing when running in the pthread worker. Grepping will reveal that the offending code comes from
Pyodide, rather than being generated by emscripten. How does it get into pyodide.asm.js? Not sure, but I think a part of the
thorny path is, the .js files get concatenated by make, then grabbed by a C file using the new #embed preprocessor directive,
and then pulled out from the object file into pyodide.asm.js by emscripten. In any case, you just need to strategically surround
the offending bits with if(!ENVIRONMENT_IS_PTHREAD) — that’s a global variable set in the generated code <em>by checking if the
thread’s name is em-pthread </em>(the check is done differently on the web and under Node, of course.) And with enough of those
ifs, you’re past this hurdle.</p>
<p>This is starting to stink, but you’ve probably smelled things worse than this. This is definitely the first case I’ve ever
seen of <strong>a</strong> <strong>threading implementation fragile enough to require effort from higher-level code (in this
case Pyodide) both to support it and to avoid breaking it</strong>. But zooming out a bit, most of us have seen worse examples
of difficulties making code do things it wasn’t meant to.</p>
<p>No, the real horror hides behind this succinct assertion in the Pyodide docs:</p>
<blockquote>
<p>“The interaction between pthreads and dynamic linking is slow and buggy, more work upstream would be required to support them
together.”<a class="footnote-ref" role="doc-noteref" href="#fn11" id="fnref11"><sup>11</sup></a></p>
</blockquote>
<p>We’re about to discover what this means.</p>
<h2 id="how-threads-do-dynamic-linking">How “threads” do “dynamic linking”</h2>
<p>I sort of dismissed the stark warning in the docs because of <a href="https://github.com/emscripten-core/emscripten/issues/3494">the optimistic comment from 2022</a> from an emscripten
maintainer (on the issue “<a href="https://github.com/emscripten-core/emscripten/issues/3494#top">Add support for simultaneously
using dynamic linking + pthreads”, from 2015</a>):<a href="https://github.com/emscripten-core/emscripten/issues/3494#top"></a></p>
<blockquote>
<p>I think this is largely complete. We still warn about it being experimental, but it should work % bugs. We can open specific
bugs if/when they are found.</p>
</blockquote>
<p>Another reason I wasn’t worried were my limited ambitions. I don’t dlopen from threads or whatever — I just spawn a pool of
threads from a side module, how can it <em>not </em>work? Of course this question reveals a lack of imagination — the
<em>real</em> reason I wasn’t worried. Like, what do threads even have to do with dynamic linking? <a href="https://yosefk.com/blog/cxx-thread-local-storage-performance.html">Thread-local storage, which does get awfully ugly</a>
in the presence of dynamic linking? Well, I don’t use TLS. What could go wrong?</p>
<p>This thinking is sensible with <em>real </em>threads doing <em>real </em>dynamic linking. The basic part of mapping new code
and data into the process address space is sort of orthogonal to threading — all threads “see” the new code and data simply
because <em>they share the address space</em>.</p>
<p>What about Wasm threads — don’t they share address space, too, that SharedArrayBuffer? Well, that’s for <em>data</em>, not
<em>code</em>. Remember how C data pointers become indexes into the SharedArrayBuffer? Well, function pointers become indexes
into wasmTable, <em>a different array.</em> Here, each index corresponds to <em>a whole function</em>, not an instruction. This
is nice in that you can’t jump into the middle of a function (“I’m afraid I can’t do that”) — a part of Wasm’s control flow
integrity features (of course <em>data integrity </em>inside the [Shared]ArrayBuffer is no better than in C — but the
considerable extent of CFI is still a good thing.)</p>
<p>What you’ll find out when your thread finally runs, and <em>the wrong function is called instead of your entry point,</em> is
that wasmTable is <strong>private to each thread</strong>. Which means that <strong>Wasm “threads” are more like processes with
data in shared memory</strong>. The code can be completely out of sync — as in, thread A has function f at index 47, and thread
B has f at 46.</p>
<p>Of course, the emscripten runtime — the C and JS code working together — makes an effort to keep the tables in sync. When a
pthread starts, it gets a list of dynamic libraries to load in the same order its parent did. And when a pthread runs and
dlopens a library, it “publishes” this fact thru a shared queue, and waits for the other threads to notice and dlopen the
library, too. It’s impressive how hard Emscripten tries to make dlopen and pthreads work despite the offbeat sharing model —
though notably, <a href="https://emscripten.org/docs/compiling/Dynamic-Linking.html#pthreads-support">this is another potential
source of Wasm-specific threading deadlocks</a>:</p>
<blockquote>
<p>In order to make this synchronization as seamless as possible, we hook into the low level primitives of emscripten_futex_wait
and emscripten_yield. … If your applications busy waits, or directly uses the atomic.waitXX instructions (or the clang
__builtin_wasm_memory_atomic_waitXX builtins) you maybe need to switch it to use emscripten_futex_wait or order avoid deadlocks.
<strong>If you don’t use emscripten_futex_wait while you block, you could potentially block other threads that are calling
dlopen and/or dlsym</strong>.</p>
</blockquote>
<p>In particular, you will deadlock not only if you wait for a child to start without yielding to the event loop, but also if
you dlopen before yielding and letting the child start — so don’t. But to go back to our nastier problem — with all this runtime
work, how does a child thread get f at index 46, when the parent has it at 47? Ah, that’s because <strong>code can add entries
to wasmTable —</strong> and you don’t have to be “the dynamic loader,” some JS runtime code, to do it! It can just as well be
some <em>other </em>JS runtime code! I <em>think </em>that there’s something Pyodide-specific doing this, but not sure; you
might pass a dynamically created callback as a function pointer that way, for instance.</p>
<p>In any case, for init-time dynamic linking, the current JS runtime simply has the parent send a list of modules to load, and
the child loads them in that order. This doesn’t account for the stuff put in <em>between modules </em>in the parent. My
workaround is to modify the emscripten-generated JS code to keep the offsets to which modules were loaded in the parent (that’s
simply wasmTable.length at the time when loading happened.) I then send them to the child, which grows the table to recreate the
parent’s gaps between the modules. (I don’t care what’s in these gaps, on the theory that C code not interacting with Python
code won’t need it.)</p>
<p>I did not patch the runtime enough to handle the case where a pthread returns, and the JS worker is then reused for another
pthread; you will probably get exceptions in the dynamic loading code in the new pthread. I also didn’t bother with offsets of
stuff dlopen’d later at runtime. I think that the mismatch between the Unix process model and the
Web-workers-with-a-SharedArrayBuffer model is quite large, and my chances to bridge it are smaller than the chances of the
actual emscripten team, and they probably gave up on fully bridging it for a reason. In fact, there’s a <a href="https://github.com/WebAssembly/shared-everything-threads/blob/main/proposals/shared-everything-threads/Overview.md">shared-everything-threads
proposal</a> which aims to share wasmTable, among other things, to make dynamic linking not as “slow and buggy” as the Pyodide
docs mentioned in passing, and as we’ve now learned thru bitter experience.</p>
<p>Another reason I didn’t bother fixing issues outside my use case is that even said use case, that of a pool of pthreads which
never exit or dlopen, and all they do is run parallel_for’s, doesn’t actually work with off-the-shelf code, as we’ll immediately
see. We’re clearly not enabling threads here — we’re “enabling” “threads”, meaning, we’re trying to use platform features which
don’t add up to “real” threads by design. Best to focus on a narrow use case and make sure it actually works.</p>
<h2 id="a-small-warm-pool">A small, warm pool</h2>
<p>With these workarounds<a class="footnote-ref" role="doc-noteref" href="#fn12" id="fnref12"><sup>12</sup></a>, we can compile
an off-the-shelf framework like TaskFlow, and run a parallel_for example. Once we work around needing to yield to the event loop
before the thread pool becomes usable, it turns out that <strong>TaskFlow occasionally gets stuck in a parallel_for</strong> —
all threads are stuck waiting:</p>
<p><img alt="TaskFlow stuck on Wasm in a side module loaded from Pyodide" height="732" src="https://yosefk.com/img/pyodide/tf-stuck.png" width="604" style="max-width: 100%;height: auto;"></p>
<p>I have no idea why, and whether it’s a “real” Wasm compatibility issue in TaskFlow, or evidence of my attempt at enabling
threads in Pyodide being incomplete (as we’ve seen, on Wasm, threading is easily broken by the code you integrate threading
into.) I decided that my adventure debugging hairy code I know nothing about stops here — my needs are much more modest than
TaskFlow’s feature set, so I am just rolling my own thread pool.</p>
<p>I only do parallel_fors, with nested or concurrently submitted parallel_fors serialized. To avoid things that “are supposed
to work” in C++ but turn out to be broken or non-trivial to enable on Wasm, <strong>I’m only using synchronization primitives
compiling straightforwardly to Wasm builtins</strong>. Those builtins are:</p>
<ul>
<li>The usual <strong>atomic memory operations</strong> (load/store, read-modify-write, compare-and-swap, fence.) These are Wasm
instructions compiled to native CPU instructions or short instruction sequences; it’s hard to imagine how the Wasm “JS OS”
business can break any of these. It’s also hard to imagine how a C++ implementation of std::atomic&lt;T&gt; can mess things up,
so I just use std::atomic rather than emscripten builtins.</li>
<li>The OS-assisted <strong>notify/wait</strong> — these map to <a href="https://en.wikipedia.org/wiki/Futex">futexes</a> (or
their equivalent on not-Linux.) How much hides behind the word “map” here at the Wasm VM side I don’t know. I <em>do </em>know
that the C++ standard library std::atomic&lt;T&gt;::notify/wait methods do quite a bit on top of the builtins (check out the
screenful of C++ garbage below) — so I use emscripten_futex_wake/wait instead<a class="footnote-ref" role="doc-noteref" href="#fn13" id="fnref13"><sup>13</sup></a>; you can switch to the std versions by commenting out a #define.</li>
</ul>
<p><img alt="C++ callstack when using std::atomic::wait" height="260" src="https://yosefk.com/img/pyodide/cxx-stack.png" width="584" style="max-width: 100%;height: auto;"></p>
<p>In case futex performance is an issue — it certainly is in my tests, but YMMV — I also have <strong>a “warm pool” feature
where the threads busy-wait for work instead of waiting on a futex</strong>. (This warm pool business is one reason to roll my
own pool — off-the-shelf load balancers don’t have this, though some of them busy-wait a little bit before waiting on a mutex,
which is similar.)</p>
<p>Of course, such busy-waiting makes fans spin and drains batteries<a class="footnote-ref" role="doc-noteref" href="#fn14" id="fnref14"><sup>14</sup></a>. An example of mitigating this is, you only keep your pool warm between your pointer-down
and pointer-up event, or you could go further and only “warm” the pool once you start dropping pointer-move events. A “cool”
pool waits on a futex, which is pretty low-overhead — but the overhead could count with many short parallel_fors submitted
quickly enough after each other.</p>
<p>A final wrinkle is that <strong>parallel_fors are serialized until all threads are ready to execute work</strong>. This
avoids an init-time slowdown where someone uses a parallel_for and ends up waiting for hundreds of ms for Wasm threads to start
— or deadlocks because they won’t start without yielding to the event loop<a class="footnote-ref" role="doc-noteref" href="#fn15" id="fnref15"><sup>15</sup></a>. In production, it’s also a better fallback upon some environment issue where the threads
failed to run, period — compared to getting stuck forever waiting for them.</p>
<p>This warm, limited, but fast, small, and seemingly robust pool is <a href="https://github.com/yosefk/BlogCodeSamples/blob/main/worker_pool.h">available on GitHub</a>. It’s more “works on my
machine” than “battle-tested” at the moment, but you could “come to trust it through understanding” more easily than most such
code, it passed some TSan testing (which found a bug), it’s portable, and it works in the browser, which apparently isn’t
trivial for a load balancer<a class="footnote-ref" role="doc-noteref" href="#fn16" id="fnref16"><sup>16</sup></a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>A Pyodide fork you can build with the “threading support” described above is available <a href="https://github.com/yosefk/pyodide-pthread">here</a>.</p>
<p>We shall conclude with a brief attempt to defend both our ends and our means against the accusation of unacceptable barbarism
— an accusation which I acknowledge to be quite understandable. I mean, the above should be enough to get why Pyodide doesn’t
ship with threading support — Python dlopens a lot, and Pyodide is supposed to work out of the box for a large number of Python
users, who will just give up on this Python in the browser business if things break in ways described above. Shouldn’t we avoid
half-baked hacks, and produce solid code built on a solid basis?</p>
<p>My reply to this is that features matter, performance is a feature, a hack is fine if you know what you’re doing, and a lot
of good stuff comes about through a series of hacks:</p>
<ul>
<li>By “features matter,” I mean that between an ugly implementation of something and <em>no implementation at all</em>, ugly is
better. “Let’s do it properly” instead of a quick hack I can get behind. But “let’s not do it at all” is going overboard —
getting things done is kind of the point, and you don’t give up on doing stuff out of a sense of aesthetics.</li>
<li>By “performance is a feature,” I mean that, say, threading makes things several times faster, and this can be the difference
between a feature that works fine, and one that is unusable.</li>
<li>The trouble with a hack is how often it breaks and what happens when it does. A hack is fine if it’s unlikely to break, or
if it tends to break “loudly” during testing, rather than breaking quietly in production, etc. <strong>A lot of the risk can be
mitigated by aiming low</strong> — if your system is simple and doesn’t count on too much, not much will break, and you can
figure it out when it does. <strong>A simple system has an easier time leveraging ugly hacks to get tangible
benefits.</strong></li>
<li>The WebAssembly platform is quite hacky, and this is still visible in 2025; looks like it was worse earlier on. <strong>I
don’t think it’s a coincidence that Wasm is becoming the most successful VM for running untrusted code in a wide variety of
languages</strong>. The way it’s made is not only hacky but <em>makes it easy to further hack on</em>, which means you can
integrate it into some odd environment with very reasonable effort. For an example of a massive hack, look at the “<a href="https://emscripten.org/docs/porting/asyncify.html">asyncify</a>” feature where normal C++ code is made to yield to the
event loop by emscripten. This implementation style is what helped C spread in the first place — and the opposite of what I
think somewhat limits the JVM, which is a serious thing you can’t just add a bunch of hacks to in order to adapt it to
something.</li>
</ul>
<p>I can’t say that I love ugly, hacky code — I’ve met bigger fans of “<a href="https://www-formal.stanford.edu/jmc/history/lisp/node3.html">pornographic programming</a>”, as McCarthy called it. But
ugly and hacky is much better than solid, well-designed, and too rigid to work around the limitations of. For what it’s worth
given my very little experience, I hereby recommend WebAssembly as fairly hackable.</p>
<h2 id="see-also">See also</h2>
<p>“This one is admittedly a hack, and a dirty one,” says <a href="https://ppuzio.medium.com/multithreading-in-the-browser-with-emscripten-and-boost-asio-1e4d8484e155">a Wasm-related
writeup</a> about one of its sed invocations modifying Emscripten-generated JS. I’m linking to it primarily as an attempt to
present my methods of dealing with this stuff as completely normal.</p>
<h2 id="p.s.">P.S.</h2>
<p>“Printf debugging” JS<a class="footnote-ref" role="doc-noteref" href="#fn17" id="fnref17"><sup>17</sup></a> was harder than
usual. Apparently, console.log is aggressively buffered, and if your threads print in some short test on Node, you might never
see the prints. On Node, fs.appendFileSync helps, and you can install a process.on('uncaughtException') hook doing this. In the
browser, pausing in the debugger seems to flush console.log prints.</p>
<p><em>Thanks to Dan Luu for reviewing a draft of this post.</em></p>

<section class="footnotes footnotes-end-of-document" role="doc-endnotes" id="footnotes">
<hr>
<ol>
<li id="fn1"><p>On instructions from our blogging ethics department, we hereby inform you that our screenshots are doctored for
legibility.<a class="footnote-back" role="doc-backlink" href="#fnref1">↩︎</a></p></li>
<li id="fn2"><p>Technically, it’s not the VM itself, but the compile-time features and C runtime libraries utilizing it “from
within”, and the JavaScript runtime code wrapping it “from the outside” which are full of hacks. The VM is presumably quite a
bit cleaner. But as Stallman used to argue with respect to his preferred Gah-noo — Linux naming, an OS is not just the kernel,
and arguably Wasm is not just the VM.<a class="footnote-back" role="doc-backlink" href="#fnref2">↩︎</a></p></li>
<li id="fn3"><p>I sort of take it for granted that if you want to call C from Python, you do it via ctypes, which keeps the
Python extension API (and any of the convoluted templated wrappers that make it worse) out of your C code. Maybe it deserves a
separate post, seeing how often people take the other route, despite it being uglier and harder.<a class="footnote-back" role="doc-backlink" href="#fnref3">↩︎</a></p></li>
<li id="fn4"><p>The function pointer called from C needs to find the Python context — eg object.method is a legitimate callback,
and there’s no statically compiled C function whose address you could possibly pass to identify it. Since the C function
prototype, such as int (<em>binary_op)(int,int) or whatever, doesn’t reserve a void</em> for ctypes to pass the context pointer
in, ctypes has to generate machine instructions with the Python context pointer hardcoded into them (as if compiling a function
int object_method(int,int) that knows the address of the object.)<a class="footnote-back" role="doc-backlink" href="#fnref4">↩︎</a></p></li>
<li id="fn5"><p>Today's Wasm has something more clever than "just another array" behind these JavaScript arrays, but it still
looks like just an array from JS.<a class="footnote-back" role="doc-backlink" href="#fnref5">↩︎</a></p></li>
<li id="fn6"><p>Of course you could embed a Wasm runtime into something other than a JS environment — see WASI — but a lot of
real-life code compiling to WebAssembly assumes a JS environment, including Pyodide. In this writeup, we’re ignoring non-JS Wasm
environments.<a class="footnote-back" role="doc-backlink" href="#fnref6">↩︎</a></p></li>
<li id="fn7"><p>ABI compatibility — or should I say ABJSI, for Application Binary &amp; JavaScript Interface? — seems rather
loose in these parts; which might be a good thing?.. — eg Google complained at one point that ABI compatibility in native Linux
C++ builds costs them 10% of performance across their giant fleet of machines, since it prevented many standard library
optimizations over the years. One nice thing about Wasm is that you’re likely to be building a small system fully from source,
since bloat costs more than elsewhere, which I guess makes ABI compatibility less of a problem.<a class="footnote-back" role="doc-backlink" href="#fnref7">↩︎</a></p></li>
<li id="fn8"><p>LLMs are good at configuring the web server to send these headers. I’ll spare you my explanation of what they
prevent your page from doing since there’s lots of better sources online.<a class="footnote-back" role="doc-backlink" href="#fnref8">↩︎</a></p></li>
<li id="fn9"><p>In general, “the” reason to avoid Python in the browser unless a Python interpreter is a user-visible feature is
footprint — RAM at runtime and download size at load time; speed you can take care of using methods like the ones we’re
discussing here<a class="footnote-back" role="doc-backlink" href="#fnref9">↩︎</a></p></li>
<li id="fn10"><p>A lot of code using Pyodide seems to run in the main thread, and some of the APIs will not work in Web workers.
This is part of the beauty of the web platform: on one hand, you’re not allowed to block the main thread; on the other hand,
lots of APIs are inaccessible to worker threads. So you’re supposed to move heavy work to workers, and then have them talk to
the main thread for accessing the DOM or IndexedDB etc. etc. — so that main thread which we really don’t want to disturb is
guaranteed to be busy. Of course, Python suffers more from this than JS, in that in JS, you “merely” need to split your work
between threads, whereas with Pyodide, you must decide where the Python interpreter lives — you can’t use it from the main
thread as well as a web worker — and then it either won’t be able to access some APIs (you’ll need to proxy to JS code running
in the main thread instead), or it won’t be able to execute long-running tasks [regardless of the issues coming with spawning C
threads which mainline Pyodide won’t do anyway.] My opinion is that you use Python for some numeric stuff which might be
long-running, not as your preferred way to access web APIs which you might as well do in JS, so the sensible thing [to the
extent that Python in the browser is sensible] is to put Pyodide into a web worker. Another argument — for those doing something
quick rather than something slow in Python — is that the main thread drops input events upon the smallest lag, whereas with a
worker, you can collect all the input events in the main thread, forward them to the worker, and have that worker decide what to
do with them whichever way you want.<a class="footnote-back" role="doc-backlink" href="#fnref10">↩︎</a></p></li>
<li id="fn11"><p>One thing you could do is to build the Python interpreter statically with all the necessary packages;
code-size-wise it would be great, and you’d avoid the pthread/dlopen issues. I think it might hurt, however — AFAIK Python is
not designed for this, and your build flow will get way uglier. I find the dynamic linking issues much easier to deal with on
net balance.<a class="footnote-back" role="doc-backlink" href="#fnref11">↩︎</a></p></li>
<li id="fn12"><p>Actually there’s another workaround, where Pyodide adds a dependency on some “sentinel” functions that I
forcefully resolve to meaningless stubs in <a href="pyodide.asm.js">pyodide.asm.js</a>; this didn’t feel interesting enough to
discuss.<a class="footnote-back" role="doc-backlink" href="#fnref12">↩︎</a></p></li>
<li id="fn13"><p>Of course the <em>actual </em>low-level builtin is __builtin_wasm_memory_atomic_waitXX, but of course using
<em>that </em>one will deadlock if a thread dlopens, as described by the quote in the docs above. I didn’t dig into whether
making the pool code brittle/risky that way pays off in some speedup, or what other problems you’re inviting by using the
lowest-level thing.<a class="footnote-back" role="doc-backlink" href="#fnref13">↩︎</a></p></li>
<li id="fn14"><p>Of course on wasm, busy-waiting also deadlocks if another thread dlopens, see previous footnote. In any case, a
pool should not be kept warm forever, and ours will “cool itself down” if you forget to do it after a configurable number of
spinning iterations, and will wait on a futex using the deadlock-preventing builtin.<a class="footnote-back" role="doc-backlink" href="#fnref14">↩︎</a></p></li>
<li id="fn15"><p>Of course if you create a pool and then dlopen before yielding back to the event loop, the dlopen will deadlock
— see the 2 previous footnotes.<a class="footnote-back" role="doc-backlink" href="#fnref15">↩︎</a></p></li>
<li id="fn16"><p>Of the more “serious”/feature-rich and standard pools, oneTBB failed some of its own tests (I used <a href="https://github.com/uxlfoundation/oneTBB/blob/master/WASM_Support.md">their instructions for building and testing on
Wasm</a>) and I didn’t manage to get it to load under Pyodide — same as with OpenMP, though I didn’t try my hardest; outside
Pyodide people report some success with these. The one pool that compiled, loaded and ran just fine was <a href="https://github.com/alugowski/poolSTL">poolSTL</a> — but in my tests, it was much slower than the pool presented here.<a class="footnote-back" role="doc-backlink" href="#fnref16">↩︎</a></p></li>
<li id="fn17"><p>There’s also TypeScript, which I am yet to learn to debug <em>at build time.</em> Pyodide provides TypeScript
definitions which I personally don’t need, and which I gave up on adapting to the -pthread build flow (SharedArrayBuffer isn’t
ArrayBuffer, and this causes type errors.) Say what you will about C++ type errors — and I’ve said <a href="https://yosefk.com/c++fqa/defective.html#defect-7">quite a bit</a> over the years — you mostly see, if not without
difficulty, where in your code they’re coming from. TypeScript managed to stun me with errors coming out of the standard library
with no reference whatsoever to “my” code (more accurately, Pyodide’s code which isn’t bundled with TypeScript.) I haven’t
seriously used Python’s type annotations, let alone TypeScript given my near-zero JS experience, so I haven’t earned the right
to an opinion. All I can say is that I’m not excited to get into <em>non-load-bearing </em>type systems retrofitted into
languages which ignore the types when running or generating code. My experience is that <a href="https://yosefk.com/blog/the-habitat-of-hardware-bugs.html">if things can suck, they do</a>. And a type system which does
not affect the execution of code — the thing that <em>can’t</em> suck because machines do not recover from or mitigate
compiler/interpreter errors — such a type system inherently can get away with quite a bit (at worst, the user will skip type
checking — there are flags for this — which eg C++ cannot have, since you can’t generate code without the types.)<a class="footnote-back" role="doc-backlink" href="#fnref17">↩︎</a></p></li>
</ol></section>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/enabling-c-threads-in-a-python-wasm-environment#comments</comments>
      <pubDate>Thu, 25 Dec 2025 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/enabling-c-threads-in-a-python-wasm-environment.feed</wfw:commentRss>
    </item>
    <item>
      <title>All means are fair except solving the problem</title>
      <link>https://yosefk.com/blog/all-means-are-fair-except-solving-the-problem.html</link>
      <description><![CDATA[<p>An industry veteran in my circles has recently made the rookie mistake<a class="footnote-ref" role="doc-noteref" href="#fn1" id="fnref1"><sup>1</sup></a> of printing a warning from his code upon misuse. Surprisingly to nobody experienced,
critical workflows soon came to a screeching halt.</p>
<p>It turned out that a program using his code prints something like “yay, done” upon exit, and scripts expect it to be the last
thing it says. But now those warnings occasionally got printed from destructors or such, <em>after</em> the “yay, done”, making
the scripts think the program failed.</p>
<p>One might think that this prompted people to fix the reported misuse, and that thought would be another rookie mistake.
Instead, they were quick to point out that it’s hard to know where these warnings could come from, and we cannot risk all those
critical workflows failing when some case of misuse surfaces in a new context.</p>
<p>I mean, you could grep to get an upper bound, and if you did, not that many places would come up. But one could then say, as
some in fact <em>did</em>, that maybe you haven’t grepped everywhere you should have, and even the cases you did find are owned
by many different teams, so we won’t get the fixes quickly enough, etc.</p>
<p>Several solutions were suggested by helpful high-ranking people:</p>
<ul>
<li>You could add a destructor printing “yay, done” <em>again</em> if a warning was printed during the destruction sequence
(opening an interesting technical debate about the differences between a destructor, __attribute__((destructor)), an atexit
handler and other unspeakable horrors). In fact, our industry veteran would later learn, and I swear that I’m not making this
up, that <em>this was already implemented by someone else</em> who printed something during the program termination sequence,
and had to appease the scripts.</li>
<li>You could suppress the warnings by default, and enable them upon request (opening a debate about the runtime method to
enable them, and the appropriate circumstances to do this).</li>
<li>You could write those warnings to their own file, and…</li>
</ul>
<p>When I was done scrolling his work chat with these helpful suggestions, our unfortunate industry veteran put on a melancholy
smile and summarized the situation: “All means are fair except solving the problem.”<a class="footnote-ref" role="doc-noteref" href="#fn2" id="fnref2"><sup>2</sup></a></p>

<section class="footnotes footnotes-end-of-document" role="doc-endnotes" id="footnotes">
<hr>
<ol>
<li id="fn1"><p>Our protagonist happens to be somewhat of an idealist, and since his condition is too acute to be treated by
experience, he’s bound to make what pragmatists call “rookie mistakes.” But this particular story could happen to most of us.<a class="footnote-back" role="doc-backlink" href="#fnref1">↩︎</a></p></li>
<li id="fn2"><p><a href="https://www.hyrumslaw.com/">Hyrum’s law</a> arguably diagnoses this particular problem more
specifically from a technical point of view. However, our melancholy veteran’s phrase hints at the broader social condition from
which the technical problem derives its significance. And by “social condition,” I mean that in Hyrum’s law, “all observable
behaviors of your system will be depended on by somebody” is implicitly amended with “...somebody who can’t be bothered to fix
their code, and there’s nothing you can do about it” — and it’s this quiet part which makes it into a “law.”<a class="footnote-back" role="doc-backlink" href="#fnref2">↩︎</a></p></li>
</ol></section>]]></description>
      <comments>https://yosefk.com/cgi-bin/comments.cgi?post=blog/all-means-are-fair-except-solving-the-problem#comments</comments>
      <pubDate>Wed, 06 May 2026 00:00:00 +0000</pubDate>
      <dc:creator>Yossi Kreinin</dc:creator>
      <wfw:commentRss>https://yosefk.com/blog/all-means-are-fair-except-solving-the-problem.feed</wfw:commentRss>
    </item>
  </channel>
</rss>
