Do call yourself a programmer, and other career advice

This is a (very late) reply to Patrick McKenzie's "Don't Call Yourself A Programmer, And Other Career Advice". I find much of his advice very sensible, and it might be very helpful to someone in the beginning of their career – assuming they can act upon it (and I really don't know whether my 20-year-old self could actually use the advice to improve his negotiation skills, for example).

A few things in the article I disagree with, however. Here I'll mostly focus on those few things, recommending you to read the original article so that you don't miss the rest of it.

"Disagree" is not necessarily the right word – a more precise way to put it would be "it's different in my experience". Which is to be expected because both of us are speaking based on our own careers, which have been rather different. Patrick McKenzie is a small business owner running Bingo Card Creator and a successful consultant. I'm a lead chip architect at a billion-dollar company. Both of us have thus traveled some distance away from "purely programming" (whatever that means), but in rather different directions.

What company are you going to work for?

Patrick McKenzie says 90% of the jobs involve things like implementing an internal travel expense reporting form, rather than a product shipped to external customers. He advises you to get used to the idea, even though such software is "soul-crushingly boring" as he puts it.

How bad is it, and is it really 90% of the jobs? Spolsky thinks it's maybe 80% – and that it's bad enough to "drain the life out of you". He goes on to elaborate why it "sucks to be an in-house programmer":

  • There's rarely a business reason to improve in-house software past the point of "barely good enough". "Forget any pride of craftsmanship – you're going to churn out embarrassing junk".
  • At software companies, what you do is more directly related to the way the company makes money, so you're more likely to be respected. "A programmer is never going to rise to become CEO of Viacom, but you might well rise to become CEO of a tech company." "…no matter how critical it was for Viacom to get this internet thing right, when it came time to assign people to desks, the in-house programmers were stuck with 3 people per cubicle in a dark part of the office".

Note that McKenzie and Spolsky are in almost complete agreement over these points. But then Spolsky says you should be gunning for a position in a software company – the environment where creatures of your kind naturally thrive. Conversely, McKenzie explains how to prosper as a programmer outside software companies – moving in the opposite direction of where things go by default (being stuck in a dark part of the office while they're trying to outsource your job.)

So the question is which path you prefer. "Not so fast", you say: one of these jobs is way easier to land – 80-90% of the chances are you're not getting inside a software company – so it's not just a question of preference.

Here I disagree: even if only 10-20% of programmers work in software companies (where are the stats?..), and even if they're "the best" (according to what metric?), McKenzie himself says in that same article:

You radically overestimate the average skill of the competition because of the crowd you hang around with:  Many people already successfully employed as senior engineers cannot actually implement FizzBuzz.

But if competition is relatively unskilled on average, you probably can land a job in the 10-20% of the sector that you want – as did most people who graduated around the time I did. So I rather firmly believe that it's a matter of choice: do you want to work on in-house software or one-off businessy projects of that kind, or do you prefer a software company?

Let's proceed to McKenzie's advice to in-house programmers – which should in itself help one make that choice.

How to call yourself

One such advice is:

Don't call yourself a programmer. “Programmer” sounds like “anomalously high-cost peon who types some mumbo-jumbo into some other mumbo-jumbo.” Instead, describe yourself by what you have accomplished for previous employers vis-a-vis increasing revenues or reducing costs.

Sure – an in-house programmer is likely doing some type of expensive mumbo-jumbo in the eyes of his non-technical MBA-wielding manager.

To me, however, a programmer is who I'm looking for, while a resume full of revenue increases and cost reductions sounds like an "anomalously high-cost parasite who types some mumbo-jumbo into Excel and PowerPoint, claiming credit for others' work".

McKenzie says a software company looks at this just like a company hiring internal programmers, essentially. His example is "the guy who wrote the backend billing code that 97% of Google’s revenue passes through – he’s now an angel investor". The guy apparently got rich by being near a "profit center" rather than through his unusual skills.

The thing is, in this case I believe he's talking about Ron Garret, the PhD from NASA's Jet Propulsion Laboratory. Do you think they hired him because he described his work at the JPL in terms of revenues and costs? (BTW he didn't like working on the billing code, bought his stock options and quit, instead of choosing a career at the company's biggest "profit center".)

Did any unusual skills go into the billing code? Ron Garret says:

I did end up writing the credit card billing and accounting system, which is a nontrivial thing to get right. Fortunately for me, just before coming to Google I had taken some time to study computer security and cryptography, so I was actually well prepared for that particular task. …I designed the billing system to be secure against even a dishonest employee with root access (which is not such an easy thing to do). I have no idea if they are still using my system, but if they are then I'd feel pretty confident that my credit card number was not going to get stolen.

Sounds to me that his technical knowledge and programming ability was the bulk of his contribution, whereas deep thoughts such as realizing that there will be some "cost reduction" due to not having credit card numbers stolen is not something an employer needs to hire anyone for.

So if I ever send out a resume as a chip architect, I will focus on my technical role in transitioning from fixed-function hardware accelerators to programmable processors, more than the manpower this saved and the business we won as a result (which I think were real outcomes of our work, but which is rather hard to quantify – as these things often are unless you're a business-friendly-sounding liar.)

Incidentally, I'm not sure when I'll send out that resume, which brings us to the next point.

On job hopping, backstabbing, and the lack thereof

Co-workers and bosses are not usually your friends: You will spend a lot of time with co-workers.  You may eventually become close friends with some of them, but in general, you will move on in three years…

<your boss will> attempt to do things that none of your actual friends would ever do, like try to talk you down several thousand dollars in salary or guilt-trip you into spending more time with the company when you could be spending time with your actual friends.  You will have other coworkers who — affably and ethically — will suggest things which go against your interests…

There is a certain internal consistency to a view that your coworkers are not your friends, because you will move on in 3 years. In fact, it's a bit circular. They aren't your friends – because you'll move on. And why will you move on? Well, I dunno, maybe for a 10% salary increase. What's there to lose? Relationships with coworkers? But coworkers aren't your friends!

Again, I don't disagree, but rather offer an alternative view, equally internally consistent. I have stayed at one job for more than a decade, in large part because I'm rather attached to the people I work with. To be sure, I got raises, and I was ready to quit over employment terms – but it'd take much more than 10%.

Isn't it just a quantitative difference in preferences – a 10% raise not being fundamentally different than, say, 100%? Well, sufficiently large quantitative changes add up to qualitative changes, as Marxian dialectics or some other Soviet philosophy thingie that my parents sometimes quote taught us. What's going on is that both approaches can lead to career advancement, but they do so very differently.

If you're willing to change jobs over a small raise, you'll be changing them frequently. You won't get attached to people, or to the work you're doing together. You will be very good at finding jobs and you will know what's generally going on in the industry and what's in demand. You will not know that many things specific to any of your employers. You and your employer will become very useful to each other fairly quickly, but you'll also be somewhat expendable for each other.

Alternatively, you can keep a job as long as it's a fun environment, requiring a significant raise once in a while. Your relationships with people combined with your long-term outlook can let you do things together that you otherwise couldn't plan or execute, and learn things you wouldn't have learned.

Much of my knowledge about chip design comes from ASIC hackers I worked with, and their willingness to develop their biggest ideas together with me came from trust that necessarily took time to build. It takes time to learn that none of you is in the habit of "suggesting things going against the other's interest", or pulling other unfriendly shenanigans.

Incidentally, if you stay at one place for a long while, then your worth to the employer grows to the point where you can get the significant raise that you'd quit over without actually quitting. Your worth can also grow well above what employers are willing to pay to experienced new hires, so there's no longer a point in switching jobs. This is somewhat analogous to becoming a consultant after having switched a whole lot of jobs and now making more than the next job hop could give you.

Both approaches work, though I don't have stats showing which tends to be more effective. I do believe that the long-term approach is more fun. I could never land the kind of gig that I have now through job hopping. More importantly, I wouldn't have the relationships that I have at work.

"More importantly", because all means to reach our ends often fail, and then all we're left with is our means. You can't count on any career strategy to give you either a dream job or a load of money; it'll work to some extent or other but who knows. What you can count on is your lifestyle being affected rather predictably by your career choices. The impact of these choices on relationships could thus be weighted as more important than the impact on career advancement because it's more predictable.

The part about bosses is the only one I very much agree with. (I had enough bosses to be able to plausibly deny that I'm thinking about any particular one here.) Yes, some of them will want you to work more time for less money (by itself a natural desire for an employer) while attempting to look like your friends (which is where it becomes a tad irksome). This just means that you should guard your own interests (as always) – and perhaps not judge people too harshly before spending time in their shoes.

How to value an equity grant

McKenzie says you shouldn't value equity very much, and he doesn't spend many of words to say it. I'll talk about stock options, which are worse than an actual equity grant and which is the only thing I've ever been offered.

My basic outlook is again long-term. I work at a private company whose value rose almost tenfold over the decade I've been there. And it's still a private company, so there's never been an easy venue to make money off most of the stock options.

From a long-term view, stock options look worse – and better.

Worse, because having stock options ties your hands behind your back. You usually can't afford to buy them when you quit, or at least buying them is a significant risk that you might be reluctant to take. If the company survives for a long while, then you may start to dislike the place but the hope of making money off your stock options now makes it harder to quit. If you generally like the place, options make it harder to negotiate a raise, since they know you can't quit.

So in the long term, options can effectively be a liability.

On the other hand, as the company matures, its stock options tend to get undervalued by employees, and for no good reason. People intuitively think along the lines of, "it's already expensive – how much can a price rise from now on?" It's a natural thought if the price has went up threefold or tenfold already.

But what this misses is that you don't get paid in percentage points – you get dollars. A $100 share going up 20% to $120 means you make $20 per share. A $5 share going up 100% to $15 means you only make $10 per share. Stock options of a mature company whose price is still rising can thus be even nicer than stock options of a young company which rises more quickly but which is still cheap – and is more likely to go bust overnight.

The upshot is that people overvalue stock options early on – but they also often undervalue them later on.

Note that if you don't intend to stay for more than 3 years, than stock options are most certainly a liability because they make it harder to quit – while the chances that the company makes it big in that span of time are very low.

Working at a startup

McKenzie lists valid reasons not to. In terms of job satisfaction, he says you can work on many exciting things in large corporations, not just startups.

Here's one thing in favor of startups. A large corporation usually doesn't have huge gaping holes that it doesn't know how to deal with or doesn't even notice. A startup often does have many such gaping holes, because, well, nothing is established yet, they don't even understand what they're doing, and most importantly, they are severely understaffed.

This means that you can grab pretty much any responsibility that you want to. There will be areas that people are competing to work on everywhere, but in a startup doing something hard enough, there will be a ton of hard problems nobody is competing to solve because there's not enough time or people for everything. You can be the person pointing out that problem and grabbing that responsibility.

As companies mature, being able to just work on whatever you want gets harder. My metaphor for it is nomadic programmers moving from problem to problem vs settlers with states and national borders where even visiting your neighbor's code may involve a visa.

This isn't a recommendation to work for startups, just one thing worth pointing out. The counterpoint is that if you're an orderly person who wants an orderly process, then a larger company known for its development culture is probably a better idea.

Impact of career on life happiness

At the end of the day, your life happiness will not be dominated by your career.

In one way, I agree wholeheartedly; whatever the merits of a job, it's a job, and I actually noticed my productivity fall at times of treating it as more important than that. The healthy way of looking at it is "just a job, at the end of the day".

On the other hand, we do spend quite some time at work. The question is, to what extent does it make sense to separate "work" from "life" – and to what extent it's one part of life among many, to be treated similarly to those other parts of life?

I argue that the "work/life" separation shouldn't be strong enough to separate "coworkers" into a distinct category of human beings with whom relationships are formed fundamentally differently – nor is it necessarily great to be emotionally detached from the workplace to be always ready to abandon it and "move on".

(I'm not arguing that McKenzie's intent was to say the exact opposite of what I'm saying, BTW. I'm just commenting on some quotes and the general atmosphere of the text as I perceived it. A lot of things simply have different meaning when heard by different people; a simple advice like "be wary of others' intentions" is great for someone overly trusting, but not for someone already verging on paranoia. Some people need to hear that coworkers aren't friends; today I'm writing for the other people.)

Summary

When I introduce myself, I usually call myself a programmer, regardless of my current work on chip architecture and management and stuff. I got into programming for the money, so it's not like I'm overflowing with pride when uttering "programmer". I just think programming is a great career and the right thing to call myself for me.

There's an alternative approach where you program, but you don't call it that, and you use programming as a starting point from which you transition to some form of being involved in business as directly as possible.

It sounds a bit roundabout to me – why not just get an MBA instead? – but maybe it's the right path for some (especially considering that some prestigious MBA programs want you to have industry experience before you can even enroll.)

The important thing is to choose the path that suits your preferences, follow it consistently, and realize where your approach is most likely to succeed. Because where I work, someone applying for a programming position and not calling himself a programmer will not make a good impression.

I agree emphatically with many of the points in McKenzie's article – my favorite point is the importance of communication skills – and I very much recommend it.

How FPGAs work, and why you'll buy one

Update (June 21): this article has been published at embeddedrelated.com, where I hope to publish a follow-up soon.

Today, pretty much everyone has a CPU, a DSP and a GPU, buried somewhere in their PC, phone, car, etc. Most don't know or care that they bought any of these, but they did.

Will everyone, at some future point, also buy an FPGA? The market size of FPGAs today is about 1% of the annual global semiconductor sales (~$3B vs ~$300B). Will FPGA eventually become a must-have, or will its volume remain relatively low?

We'll try to answer this question below. In order to see how popular FPGAs could become, we'll need to discuss what FPGAs are. FPGAs are a programmable platform, but one designed by EEs for EEs rather than for programmers. So for many programmers, FPGAs are exciting yet mysterious; I hope our discussion will help demystify them.

We'll start with a common explanation of FPGAs' relatively low popularity. We'll see why that explanation is wrong – and why, if we take a closer look, we actually come to expect FPGAs to blow the competition out of the water!

This will conclude today's installment, "Why you'll buy an FPGA". A sequel is in the making, titled "Why you won't buy an FPGA". There, we'll see some of the major obstacles standing between FPGAs and world domination.

The oft-repeated wrong answer

…to the question of "why aren't FPGAs more popular?" is, "FPGA is a poor man's alternative to making chips. You can implement any circuit design in an FPGA, but less efficiently than you could in an ASIC or a custom design. So it's great for prototyping, and for low-volume products where you can't afford to make your own chips. But it makes no sense for the highest-volume devices – which happen to add up to 99% of sales, leaving 1% to FPGAs."

This is wrong because programmability is a feature, not just a tax on efficiency.

Of course a Verilog program doing convolution on an FPGA would run faster if you made a chip that runs just that program. But you typically don't want to do this, even for the highest-volume products, any more than you want to convert your C programs running on CPUs into dedicated hardware! Because you want to change your code, run other programs, etc. etc.

When programmability is required – which is extremely often – then the right thing to compare FPGAs to is another programmable platform: a DSP, a GPU, etc. And, just like FPGAs, all of these necessarily introduce some overhead for programmability. So we can no longer assume, a priori, that any one option is more efficient than another – as we did when comparing FPGAs to single-purpose ASICs.

We need benchmarks – and FPGAs' performance appears very competitive in some benchmarks. Here's what BDTI's report from 2007 says:

…we estimated that high-end FPGAs implementing demanding DSP applications … consume on the order of 10 watts, while high-end DSPs consume roughly 2-3 watts. Our benchmark results have shown that high-end FPGAs can support roughly 10 to 100 times more channels on this benchmark than high-end DSPs…

So for that benchmark, FPGAs offer 10x-100x the runtime performance, and 2x-30x the energy efficiency of DSPs – quite impressive!

But wait – how are they so efficient?

FPGAs are no longer FPGAs

Aren't FPGAs Field-Programmable Gate Arrays?

Programmable gate arrays can't multiply as efficiently as dedicated multipliers, can they? A dedicated multiplier is a bunch of gates connected with wires – the specific gates that you need for multiplying, connected specifically to the right other gates as required for multiplication.

A programmable gate array is when your gates are generic. They index into a truth table (called a look-up table or LUT) with their inputs, and fetch the answer. With a 2-input LUT, you get an OR gate or an AND gate or whatever, depending on the truth table you programmed. With 3-input LUTs, you can have a single gate computing, say, (a&b)|c, but the principle is the same:

This absolutely must be bigger and slower than just an OR gate or an AND gate!

Likewise, wires go through programmable switch boxes, which connect wires as instructed by programmable bits:

There are several switch box topologies determining which wires can be connected to which. But whatever the topology, this must be bigger and slower than wires going directly to the right gates.

All this is indeed true, and a "bare" FPGA having nothing but programmable gates and routers cannot compete with a DSP. However, today's FPGAs come with DSP slices – specialized hardware blocks placed amidst the gates and routers, which do things like multiply-accumulate in "hard", dedicated gates.

So that's how FPGAs compete with DSPs – they have DSP hardware in them! Cheating, isn't it?

Well, yes and no.

It's "cheating" in the sense that FPGAs aren't really FPGAs any more – instead, they're arrays of programmable gates plus all that other stuff. A "true FPGA" would look like this:

Instead, a high-end modern FPGA looks like this:

To be competitive in DSP applications, FPGAs need DSP slices – ALUs doing things like multiply-accumulates.

To be competitive in applications needing a CPU – which is most of them – today's FPGAs have more than just specialized ALUs. They have full-blown ARM cores implemented using "hard", non-programmable gates!

So you've been "cheated" if you thought of FPGAs as "clean slates" suitable for any design. In reality, FPGAs have specialized hardware to make them competitive in specific areas.

And you can sometimes guess where they're less competitive by observing which specializations they lack. For instance, there are no "GPU slices", and indeed I don't believe FPGAs can compete with GPUs in their own domain as they compete with DSPs. (Why not simply add GPU slices then? Here the plot thickens, as we'll see in the follow-up article.)

But of course having DSP slices is more than just "cheating" – because look at just how many DSP slices FPGAs have. The cheapest FPGAs can do hundreds of mutliply-accumulates simultaneously! (My drawing above has the wrong scale – imagine hundreds of small DSP slices near a couple of much larger CPUs.)

And hundreds of MACs is a big deal, because while anyone can cram a load of multipliers into a chip, the hard part is to connect it all together, letting a meaningful program actually use these multipliers in parallel.

For instance, TI's C64 DSPs can do 8 MACs per cycle – but only if it's a dot product. TI's C66 DSPs can do 32 MACs/cycle – but only if you're multiplying complex numbers. You only get the highest throughput for very specific data flows.

To the extent that the FPGA architecture lets you actually use an order of magnitude more resources at a time, and do that in more real-life examples, it is a rather unique achievement. And this is how they actually beat dedicated DSPs with their DSP slices, not just reach the same performance.

FPGA as a programmable accelerator architecture

So what makes FPGAs such an efficient architecture? There's no simple answer, but here are some things that FPGAs can use to their advantage:

  • No need for full-blown ALUs for simple operations: a 2-bit adder doesn't need to be mapped to a large, "hard" DSP slice – it can fit comfortably in a small piece of "soft" logic. With most processors, you'd "burn" a full-blown ALU to do the simplest thing.
  • No need for a full cycle for simple operations: on FPGAs, you don't have to sacrifice a full cycle to do a simple operation, like an OR, which has a delay much shorter than a full cycle. Instead, you can feed OR's output immediately to the next operation, say, AND, without going through registers. You can chain quite a few of these, as long as their delays add up to less than a cycle. With most processors, you'd end up "burning" a full cycle on each of these operations.
  • Distributed operand routing: most processors have their ALUs communicate through register files. With all the ALUs connected to all the registers, there's a bottleneck – this interconnect grows as the product of the number of ALUs and registers, so you can't have too many of either. FPGAs spread ALUs and registers throughout the chip, and you can connect them in ways not creating such bottlenecks – say, as a long chain, as a tree, and in many other ways. Of course you can also route everything through a bottleneck, and then your design will run at a low frequency – but you don't have to. With CPUs or DSPs, they run at a high frequency – because the amount of ALUs and registers was limited to make that frequency possible. But in FPGAs you can get both high frequencies and a lot of resources used in parallel.
  • Distributed command dispatching: a 2-issue or a 6-issue processor is common, but 100-issue processors are virtually unheard of. Partly it's because of the above-mentioned operand routing, and partly it's because of command dispatching – you'd have to fetch all those commands from memory, another bottleneck. In FPGAs, you can implement command-generating logic in simple state machines residing near your ALUs – and in the simplest case, commands are constants kept in registers residing near ALUs. This lets you easily issue 100 parallel instructions.

This "distributed" business is easier to appreciate by looking at an example. Here's a schematic implementation of a 1D convolution on an FPGA – you convolve a long vector v with an N-coefficient filter f, computing, at every i, f0*v[i] + f1*v[i-1] + f2*v[i-2] + … + fN-1*v[i-N-1]:

In this drawing, N=8, but it scales easily to arbitrary N, producing results at a slightly larger latency – the summation tree depth being log(N).

The orange boxes are registers; commands like + and * are stored in registers, as are inputs and outputs. (I'm feeding the output of * to + directly without going through a register to save screen space.) Every clock cycle, inputs are fed to ALUs, and the outputs become the new register values.

Orange boxes (registers) spread amongst green boxes (ALUs) illustrate "distributed operand and command routing". If you wonder how it all looks like in code, Verilog source code corresponding to this drawing appears near the end of the article.

And here's a linear pipeline without a summation tree:

This is a little trickier, at least to me (I had a bug in my first drawing, hopefully it's fixed). The idea is, every pair of ALUs computes a product of fk with v[i-k], adds it to the partial sum accumulated thus far, and sends the updated partial sum downstream to the next pair of ALUs.

The trick is this. The elements of v are also moving downstream, together with the sums. But after v[i] got multiplied by f0, you don't want to multiply it by f1 in the next cycle. Instead, you want to multiply v[i-1] by f1 – that's the product that we need for the convolution at index i. And then you do want to multiply v[i] by f1 once cycle later – for the convolution at index i+1. I hope that my sampling of v[i] to an intermediate register, which delays its downstream motion, does the trick.

So these two examples show how FPGA programming is different from programming most kinds of processors – and how it can be more efficient. More efficient, because you can use a lot of ALUs simultaneously with little overhead spent on dispatching commands and moving inputs and outputs between ALUs. An argument can be made that:

  • FPGAs are more flexible than SIMD/SIMT. You can give different instructions to different ALUs, and you can route operands from different places. Contrast this with SIMD instructions like add_16_bytes, with byte i always coming from offset i inside a wide register.
  • FPGAs scale better than VLIW/superscalar. More instructions can be issued simultaneously, because there's no routing bottleneck near the register file, and no instruction memory bandwidth bottleneck.
  • FPGAs are more efficient than multiple cores. Multiple cores are flexible and can scale well. But you pay much more overhead per ALU. Each core would come with its own register files and memories, and then there are communication overheads.

This gives us a new perspective on LUTs and switch boxes. Yes, they can be an inefficient, cheaper-to-manufacture alternative to dedicated gates and wires. But they are also a mechanism for utilizing the "hard" components spread in between them – sometimes better than any other mechanism.

And this is how FPGAs beating DSPs with the help of DSP slices isn't "cheating". (In fact, mature DSPs "cheat" much more by having ugly, specialized instructions. Far more specialized than FPGAs' multiply-accumulate, dot product instructions being among the least ugly. And the reason they need such instructions is they don't have the flexibility of FPGAs, so what FPGAs effectively do in software, they must do in hardware in order to optimize very specific data flows.)

I/O applications

But wait – there's more! In addition to being a hardware prototyping platform and an accelerator architecture, FPGAs are also uniquely suited for software-defined I/O.

"Software-defined I/O" is the opposite of "hardware-defined I/O" – the common state of things, where you have, for instance, an Ethernet controller implementing some share of TCP or UDP in hardware. Software-defined I/O is when you have some programmable hardware instead of dedicated hardware, and you implement the protocols in software.

What makes FPGAs good at software-defined I/O?

  • Timing control: Verilog and other hardware description languages give you more precise control over timing than perhaps any other language. If you program it to take 4 cycles, it takes 4 cycles – no cache misses or interrupts or whatever will get in your way unexpectedly. And you can do a whole lot in these 4 cycles – FPGAs are good at issuing plenty of instructions in parallel as we've seen. This means you don't have to account for runtime variability by buffering incoming data, etc. – you know that every 4 cycles, you get a new byte/pixel/etc., and in 4 cycles, you're done with it. This is particularly valuable where "deep" buffering is unacceptable because the latency it introduces is intolerable – say, in a DRAM controller. You can also do things like generating a clock signal at a desired frequency, or deal with incoming clock signal at a different frequency than yours.
  • Fine-grained resource allocation: you "burn" a share of FPGA resources to handle some peripheral device – and that's what you've spent. With other processor cores, you'll burn an entire core – "this DSP handles WiFi" – even if the core is idle much of the time. (The FPGA resources are also burnt that way – but you'll often spend less resources than a full processor core takes.) Alternatively, you can time-share that DSP core – but it's often gnarly. Many kinds of cores expose a lot of resources that must be manually context-switched at an intolerably high latency. Core asymmetry gets in the way of thread migration. And with two I/O tasks, often none can tolerate being suspended for a long while, so you'll definitely burn two cores. (One solution is hardware multithreading.)

The upshot is that relatively few processors other than FPGAs are suitable for software-defined I/O. The heavily multi-threaded XMOS is claimed to be one exception. Then there are communication processors such as the hardware-threaded Qualcomm Hexagon DSP and the CEVA-XC DSPs. But these are fairly specialized; you couldn't use them to implement a memory controller or an LVDS-to-parallel video bridge, both of which you could do with an FPGA.

And of course, FPGA's I/O capabilities can be combined with computation acceleration – get pixels and enhance the image color on the fly, get IP packets with stock info and decide which stocks to trade on the fly.

Programmable, efficient, and versatile, FPGAs are starting to sound like a great delivery platform.

Summary

There are several points that I tried to make. Some are well-known truisms, and others are my own way of looking at things, which others might find debatable or at least unusually put.

  • While FPGA are a great small-scale circuit delivery platform, they can also be a large-scale software delivery platform. You can think of FPGAs as "inefficiently simulating circuits". But in other contexts, you can also think of them as "efficiently executing programs"!
  • Instead of fixed-function gates and wires connecting specific gates to each other, FPGAs use programmable gates – configured by setting a truth table of choice – and programmable switch boxes, where incoming wires are connected to some of the other wires based on configuration bits. By itself, it's very inefficient compared to a "direct" implementation of a circuit.
  • Then how can FPGAs beat, not just CPUs, but specialized accelerators like DSPs in their own game? The trick is, they're no longer FPGAs – gate arrays. Instead, they're also arrays of RAMs and DSP slices. And then they have full-blown CPUs, Ethernet controllers, etc. implemented in fixed-function hardware, just like any other chip.
  • In such modern FPGAs, the sea of LUTs and switch boxes can be used not instead of fixed-function circuits, but as a force multiplier letting you make full use of your fixed-function circuits. LUTs and switch boxes give two things no other processor architecture has. First, the ability to use less than a full-blown ALU for simple things – and less than a full clock cycle. Second, distributed routing of commands and operands – arguably more flexible than SIMD, more scalable than superscalar execution, and more efficient than multiple instruction streams.
  • FPGAs are the ultimate platform for software-defined I/O because of their timing control (if I said 4 cycles, it takes 4 cycles) and fine-grained resource allocation (spend so many registers and ALUs per asynchronous task instead of dedicating a full core or having to time-share it).

With all these advantages, why just 1% of the global semiconductor sales? One reasonable answer is that it took FPGAs a long time to evolve into their current state. Things FPGAs have today that they didn't have in the past include:

  • Fixed-function hardware essential for performance – this gradually progressed from RAM to DSP slices to complete CPUs.
  • Quick runtime reconfiguration, so that you can run convolution and then replace it with FFT – which you can't, and shouldn't be able to do, if you're thinking of FPGA as simulating one circuit.
  • Practically useable C-to-Verilog compilers, letting programmers, at least reasonably hardcore ones, who nonetheless aren't circuit designers, to approach FPGA programming easily enough.

All of these things cater to programmers as much or more than they cater to circuit designers. This shows that FPGAs are gunning for a position in the large-scale software delivery market, outside their traditional small-scale circuit implementation niche. (Marketing material by FPGA vendors confirms their intentions more directly.)

So from this angle, FPGAs evolved from a circuit implementation platform into a software delivery platform. Being a strong programmable architecture, they're expected to rise greatly in popularity, and, like other programmable architectures, end up everywhere.

Unanswered questions

As a teaser for the sequel, I'll conclude with some questions which our discussion left unanswered.

Why do FPGAs have DSP slices and full-blown "hard" CPUs? Why not the other way around – full-blown DSP cores, and some sort of smaller "CPU slices"? Where are the GPU slices? And if rationing individual gates, flip-flops and picoseconds instead of full ALUs, registers and clock cycles is so great, why doesn't everyone else do it? Why do they all break up resources into those larger chunks and only give software control over that?

Stay tuned for the sequel – "How FPGAs work, and why you won't buy one".

P.S. Programmable – how?

So how do you program the programmable gate array? Talk is cheap, and so are Microsoft Paint drawings. Show me the code!

The native programming interface is a hardware description language like Verilog. Here's an implementation of the tree-like convolution pipeline in Verilog – first the drawing and then the code:

module conv8(clk, in_v, out_conv);
  //inputs & outputs:
  input clk; //clock
  input [7:0] in_v; //1 8-bit vector element
  output reg [18:0] out_conv; //1 19-bit result

  //internal state:
  reg [7:0] f[0:7]; //8 8-bit coefficients
  reg [7:0] v[0:7]; //8 8-bit vector elements
  reg [15:0] prod[0:7]; //8 16-bit products
  reg [16:0] sum0[0:3]; //4 17-bit level 0 sums
  reg [17:0] sum1[0:1]; //2 18-bit level 1 sums

  integer i; //index for loops unrolled at compile time

  always @(posedge clk) begin //when clk goes from 0 to 1
    v[0] <= in_v;
    for(i=1; i<8; i=i+1)
      v[i] <= v[i-1];
    for(i=0; i<8; i=i+1)
      prod[i] <= f[i] * v[i];
    for(i=0; i<4; i=i+1)
      sum0[i] <= prod[i*2] + prod[i*2+1];
    for(i=0; i<2; i=i+1)
      sum1[i] <= sum0[i*2] + sum0[i*2+1];
    out_conv <= sum1[0] + sum1[1];
  end
endmodule

This example shows how "distributed routing" actually looks in code – and the fine-grained control over resources, defining things like 17-bit registers.

And it's fairly readable, isn't it? Definitely prettier than a SIMD program spelled with intrinsics – and more portable (you can target FPGAs by different vendors as well as an ASIC implementation using the same source code; it's not trivial, but not hopeless unlike with SIMD intrinsics, and probably not harder than writing actually portable OpenCL kernels.)

Incidentally, Verilog is perhaps the quintessential object-oriented language – everything is an object, as in a physical object: a register, a wire, a gate, or a collection of simpler objects. A module is like a class, except you can't create objects (called instantiations) dynamically – all objects are known at compile time and mapped to physical resources.

Verilog insists on encapsulation as strictly as it possibly could: there's simply no way to set an object's internal state. Because how could you affect that state, physically, without a wire going in? Actually, there is a way to do that – the usual instance.member syntax; hardware hackers call this "an antenna", because it's "wireless" communication with the object's innards. But it doesn't synthesize – that is, you can do it in a simulation, but not in an actual circuit.

Which means that our example module is busted, because we can't initialize the filter coefficients, f. In simulations, we can use antennas. But on an FPGA, we'd need to add, say, an init_f input, and then when it's set to 1, we could read the coefficients from the same port we normally use to read v's elements. (BTW, not that it adds much efficiency here, but the "if" test below is an example of an operation taking less than a cycle.)

always @(posedge clk) begin
  if(init_f) begin
    f[0] <= in_v;
    for(i=1; i<8; i=i+1)
      f[i] <= f[i-1];
  end
end

A triumph of encapsulation, it's also a bit of a pity, because there are now actual wires and some control logic sitting near our coefficient registers, enlarging the circuit, only to be used upon initialization. We're used to class constructors "burning" a few memory bits; who cares – the bits are quickly swapped out from the instruction cache, so you haven't wasted resources of your computational core. But Verilog module initialization "burns" LUTs and wires, and it's not nearly as easy to reuse them for something else. We'll elaborate on this point in the upcoming sequel.

Not only is Verilog object-oriented, but it's also the quintessential language for event-driven programming: things are either entirely static (these here bits go into this OR gate), or triggered by events (changes of signals, very commonly a clock signal which oscillates between 0 and 1 at some frequency). "always @(event-list)" is how you say what events should cause your statements to execute.

Finally, Verilog is a parallel language. The "static" processes, like bits going into OR gates, as well as "event-driven processes", like statements executing when the clock goes from 0 to 1, all happen in parallel. Within a list of statements, "A <= B; C <= A;" are non-blocking assignments. They happen in parallel, so that A is assigned the value of B, and C is simultaneously assigned the (old) value of A.

So, for example, prod[i]<=f[i]*v[i] sets the new value of prod, and in parallel, sums are computed from the old values of prod, making it a pipeline and not a serial computation. (Alternatively, we could use blocking assignments, "=" instead of "<=", to do it all serially. But then it would take more time to execute our series of statements, lowering our frequency, as clk couldn't switch from 0 to 1 again until the whole serial thing completes. Synthesis tools tell you the maximal frequency of your design when they're done compiling it.)

On top of its object-oriented, event-based, parallel core, Verilog delivers a ton of sweet, sweet syntactic sugar. You can write + and * instead of having to instantiate modules with "adder myadd(a,b)" or "multiplier mymul(a,b)" – though + and * are ultimately compiled down to module instances (on FPGAs, these are often DSP slice instances). You can use if statements and array indexing operators instead of instantiating multiplexors. And you can write loops to be unrolled by the compiler, generate instantiations using loop syntax, parameterize your designs so that constants can be configured by whoever instantiates them, etc. etc.

If all this doesn't excite you and you'd rather program in C, you can, sort of. There's been loads of "high-level synthesis tools" – basically C to Verilog compilers – and their quality increased over the years.

You'd be using a weird C dialect – no function pointers or recursion, extensions to specify the exact number of bits in your integers, etc. You'd have to use various #pragmas to guide the compilation process. And you'd have things like array[index++] not actually working with a memory array – and index++ not actually doing anything – because you're getting values, not from memory, but from a FIFO, or directly from the output of another module (just like in_v in our Verilog code doesn't have to come from memory, and out_conv doesn't have to go to memory.)

But you can use C, sort of – or Verilog, for real. Either way, you can write fairly readable FPGA programs.

The bright side of dark silicon

It's been a decade or so since the end of frequency scaling, and multicore has become ubiquitous, there being no other means to increase a chip's performance.

Some multicore systems are symmetric – all cores are identical, so you can easily move work from one core to another. Others are asymmetric – as in CPU cores and GPU cores, where it's harder to move work between different types of cores.

Which is better – symmetric or asymmetric multicore?

Why symmetric is better

Three main reasons that I see:

  • Better load balancing
  • Less work for everyone
  • More redundancy

Better load balancing

Asymmetric multicore makes load balancing harder, because a GPU can't easily yank a job from a queue shared with a CPU and run that job. That's because some of those jobs are simply impossible to run on a GPU. Others run so badly that it's not worth the trouble.

And those CPU codes that could run OK on GPUs would have to be compiled twice – for the CPU and the GPU – and even then you can't make things like function pointers and vtables work (though I can imagine a hardware workaround for the latter – a translation table of sorts; maybe I should patent it. Anyway, we're very far from that being our biggest problem.)

And then you need a shared queue between the CPU and the GPU – how does that work? – or you partition the work statically (each of the 4 CPUs processes 10% of the pixels, the remaining 60% of the pixels go to the GPU cores).

But static partitioning, often quite lousy even with symmetric multicore, is awful with asymmetric multicore because how do you choose the percentages? You need to know the relative strength of the cores at each task. How do you do that – dynamically figure out the first time your program runs on a new device?

So this is all close to insane. What people actually do instead is task parallelism – they look at their different jobs, and they figure out which should run on each type of core, and optimize each task for the respective core.

But task parallelism never load-balances very well. Let's say you look for faces in an image on the GPU and then try to figure out whose faces these are on the CPUs. Then sometimes the GPU finds a lot of faces and sometimes just a few, taking roughly the same time to do so. But the CPU then has either a lot of work or just a little. So one of them will tend to be the bottleneck.

Less work for everyone

We actually touched on that above. If you wanted to do data parallelism, running the same task on all your cores but on different subsets of the data, one problem would be to optimize your code for each type of core. That's more work. Someone at the OS/system level would also need to help you with sharing task queues and vtables – still more work.

Generally, more types of core means more hardware design, more compilers, assemblers, linkers and debuggers, more manuals, and more integration work from bus protocols to program loaders, etc. etc. And, for programmers, not only more optimization work but more portability problems.

More redundancy

That's a bit futuristic, but I actually heard this argument from respectable people. The idea is, chip manufacturing yields will significantly drop at, say, 8nm processes. And then your chance to get a chip without a microscopic defect somewhere will become so low that throwing away every defective chip will be uneconomical.

Well, with symmetric multicore you don't have to throw away the chip. If the testing equipment identifies the core that is no longer useable and marks the chip accordingly using fuses or some such (which is easy to do), an OS can then run jobs on all cores but the bad one.

Nifty, isn't it?

With asymmetric multicore, you can't do that, because some type of work will have no core on which it can run.

Why asymmetric is inevitable

In two words – dark silicon.

"Dark silicon" is a buzzword used to describe the growing gap between how many transistors you can cram into a chip with each advancement in lithography vs how many transistors you can actually use simultaneously given your power budget – the gap between area gains and power gains.

It's been a couple of years since the "dark silicon" paper which predicted "the end of multicore scaling" – a sad follow-up to the end of frequency scaling.

The idea is, you can have 2x more cores with each lithography shrink, but your energy efficiency grows only by a square root of 2. So 4 shrinks mean 16x more cores – but within a fixed power budget, you can only actually use 4. So progress slows down, so to speak. These numbers aren't very precise – you have to know your specific process to make a budget for your chip – but they're actually not bad as a tool to think about this.

With 16x more area but just 4x more power, can anything be done to avoid having that other 4x untapped?

It appears that the only route is specialization – spend a large fraction of the area on specialized cores which are much faster at some useful tasks than the other cores you have.

Can you then use them all in parallel? No – symmetric or asymmetric, keeping all cores busy is outside your power budget.

But, if much of the runtime is spent running code on specialized cores doing the job N times faster than the next best core, then you'll have regained much of your 4x – or even gained more than 4x.

Gaining more than 4x has always been possible with specialized cores, of course; dark silicon is just a compelling reason to do it, because it robs you of the much easier alternative.

What about load balancing? Oh, aren't we "lucky"! It's OK that things don't load-balance very well on these asymmetric systems – because if they did, all cores would be busy all the time. And we can't afford that – we must keep some of the silicon "dark" (not working) anyway!

And what about redundancy? I dunno – if the yield problem materializes, the increasingly asymmetric designs of today are in trouble. Or are they? If you have 4 CPUs and 4 GPU clusters, you lose 25% of the performance, worse than if you had 12 CPUs; but the asymmetric system outperforms the symmetric one by more than 25%, or so we hope.

So the bright side of dark silicon is that it forces us to develop new core architectures – because to fully reap the benefits of lithography shrinks, we can't just cram more of the same cores into a same-sized chip. Which, BTW, has been getting boring, boring, boring for a long time. CPU architecture has stabilized to a rather great extent; accelerator architecture, not nearly so.

GPUs are the tip of the iceberg, really – the most widely known and easily accessible accelerator, but there are loads of them coming in endless shapes and colors. And as time goes by and as long as transistors keep shrinking but their power efficiency lags behind, we'll need more and more kinds of accelerators.

(I have a lot of fun working on accelerator architecture, in part due to the above-mentioned factors, and I can't help wondering why it appears to be a rather marginal part of "computer architecture" which largely focuses on CPUs; I think it has to do with CPUs being a much better topic for quantitative research, but that's a subject for a separate discussion.)

And this is why the CPU will likely occupy an increasingly small share of the chip area, continuing the trend that you can see in chip photos from ChipWorks et al.

P.S.

I work on switching-limited chip designs: most of the energy is spent on switching transistors. So you don't have to power down the cores between tasks – you can keep them in an idle state and they'll consume almost no energy, because there's no switching – zeros stay zeros, and ones stay ones.

Chips which run at higher frequencies and which are not designed to operate at high temperatures (where high leakage would become intolerably high – leakage grows non-linearly with temperature) are often leakage-limited. This means that you must actually power down a core or else it keeps using much of the energy it uses when doing work.

Sometimes powering down is natural, as in standby mode. Powering down midway through realtime processing is harder though, because it takes time to power things down and then to power them back up and reinitialize their pesky little bits such as cache line tags, etc.

So in a leakage-limited design, asymmetric multicore is at some point no better than symmetric multicore – if the gaps between your tasks are sufficiently short, you can't power down anything, and then your silicon is never dark, so either you make smaller chips or programs burn them.

But powering up and down isn't that slow, so a lot of workloads should be far from this sad point.

P.P.S.

I know about GreenDroid, a project by people who make the "dark silicon leads to specialization" argument quite eloquently; I don't think their specialization is the right kind – I think cores should be programmable – but that again is a subject for a separate discussion.

P.P.P.S.

Of course there's one thing you can always do with extra area which is conceptually much easier than adding new types of cores – namely, add more memory, typically L2/L3 cache. Memory is a perfect fit for the dark silicon age, because it essentially is dark silicon – its switching energy consumption is roughly proportionate to the number of bytes you access per cycle but is largely independent of the number of bytes you keep in there. And as to leakage, it's easier to minimize for memories than most other kinds of things.

Another "lucky" coincidence is that you really need caches these days because external DRAM response latency has been 100 ns for a long time while processor clocks tend to 50-200x shorter, so missing all the caches really hurts.

So it's natural to expect memories to grow first and then the accelerator zoo; again consistently with recent chip photos where, say, ARM's caches are considerably bigger the ARM cores themselves.

(Itanium famously spent 85% percent of the chip area or so on caches, but that was more of "cheating" – a way to show off performance relative to x86 when in fact the advantage wasn't there – than anything else; at least that's how Bob Colwell quoted his conversation with Andy Grove. These days however it has become one of the few ways to actually use the extra area.)

Parallelism and concurrency need different tools

concurrent (noun): Archaic. a rival or competitor.

dictionary.com

Two lines that do not intersect are called parallel lines.

Wikipedia

In this piece, I disagree with Joe Armstrong and Rob Pike, basing my argument on the differences between vending machines and gift boxes (illustrated with drawings carefully prepared in Microsoft Paint).

Parallelism and concurrency are both very fashionable notions. Lots of languages and tools are advertised as good at these things – often at both things.

I believe that concurrency and parallelism call for very different tools, and each tool can be really good at either one or the other. To oversimplify:

It's possible for both kinds of tools to co-exist in a single language or system. For example, Haskell appears to have good tools for both concurrency and parallelism. But it's still two different sets of tools, and the Haskell wiki explains that you shouldn't use the concurrency tools if you're after parallelism:

Rule of thumb: use Pure Parallelism if you can, Concurrency otherwise.

The people behind Haskell realize that one tool for both isn't good enough. It's a similar story with the new language ParaSail – it offers both kinds of tools, and advises to avoid its concurrency features in parallel, non-concurrent programs.

This is in stark contrast to the opinion of some people primarily interested in concurrency, who think that their tools are a great fit for parallelism. Rob Pike says Go makes parallelism easy because it's a concurrent language:

Concurrency makes parallelism (and scaling and everything else) easy.

Likewise, Joe Armstrong claims that Erlang is great for parallelism because it's a concurrent language. He goes as far as to say that seeking to parallelize "legacy" code written in non-concurrent languages is "solving the wrong problem".

What causes this difference in opinions? Why do Haskell and ParaSail people think you need separate tools for concurrency and parallelism, while Go and Erlang people think that concurrent languages handle parallelism just fine?

I think people reach these different conclusions because they work on different problems, and develop a different perception of the basic distinction between concurrency and parallelism. Joe Armstrong has drawn a picture to explain how he sees that distinction. I'll draw you a different picture – but first, here's a reproduction of his:

Concurrency-centric view of concurrency vs parallelism

Actually many aspects of concurrency are present with just one queue, but I reproduced the 2 queues from Armstrong's original drawing. So what does this picture say?

  • "Parallel" means that you serve Coke faster.
  • "Parallel" doesn't mean much else – either way, it's the same concurrent problem.
  • Who gets Coke first depends on who gets in line first.
  • In a way it doesn't matter who gets Coke first – they all get a can of Coke one way or another.
  • …except perhaps for the few last ones, if the machine runs out of Coke – but that's life, man. Somebody has to be the last.

Indeed, that's all there is to it with a vending machine. What about giving gifts to a bunch of kids? Is there a difference between coke cans and presents?

In fact there is. The vending machine is an event handling system: people come at unpredictable moments, and find an unpredictable amount of cans in the machine. The gift boxes is a computational system: you know the kids, you know what you bought, and you decide who gets what and how.

In the gift boxes scenario, "concurrent vs parallel" looks very different:

Parallelism-centric view of concurrency vs parallelism

How is "concurrent" different from "parallel" in this example?

  • Concurrent present-dealing is a lot like a vending machine: who gets what depends on who got there first.
  • Parallel present-dealing is not like that: each present is labeled, so you know who gets what.
  • In the concurrent case, a queue is necessary – that's how you avoid two kids fighting over a present/two tasks corrupting a shared data structure.
  • In the parallel case, a queue is not necessary. No matter who gets where first, they all end up with the right present.

Queueing in front of a gift heap is concurrent in that archaic sense of "rivalry, competition": you want to be there first, so that it's you who gets the Lego set. In Russian, "concurrent" (pronounced kon-koo-rent) is the contemporary word for "competitor".

With labeled gifts, there's no competition. Logically, labels "connect" each kid with his gift, and these are parallel, non-intersecting, non-conflicting lines. (Why did I draw the arrows so that they do intersect? We'll get back to it later; a good way to think of it is, kids'/processors' paths do cross as they search for their gifts/data – but those intersections are not conflicts over who gets what.)

Computation vs event handling

With event handling systems such as vending machines, telephony, web servers and banks, concurrency is inherent to the problem – you must resolve inevitable conflicts between unpredictable requests. Parallelism is a part of the solution - it speeds things up, but the root of the problem is concurrency.

With computational systems such as gift boxes, graphics, computer vision and scientific computing, concurrency is not a part of the problem – you compute an output from inputs known in advance, without any external events. Parallelism is where the problems start – it speeds things up, but it can introduce bugs.

Let's proceed to discuss the differences between computational systems and event handling systems. We'll start with obvious things, like determinism, but there are many more subtle consequences.

Determinism: desirable vs impossible

In computational systems, determinism is desirable because it makes life easier in many ways. For example, it's nice to test optimizations and refactorings by observing that results didn't change – which you can only do with deterministic programs.

Determinism is often not strictly required – you may genuinely not care which 100 images you find as long as they all have kittens in them, or exactly what approximation of PI you compute as long as it's somewhere between 3 and 4. Determinism is merely very nice – and possible.

In event handling systems, however, determinism is impossible, in the sense that it is a requirement for different orders of events to produce different results. If I got there first, I should get the last Coke can, the last dollar in the shared bank account, etc. If you beat me to it by a millisecond, then it should go to you.

Of course, for basic debugging, you can always run the system completely serially, handling one event a time. But once you make it a bit more realistic and start handling events without first finishing everything that you were doing, you can no longer expect a specific result.

Even in a debugging environment, if you replay an event log with two users racing to withdraw money from a shared bank account, you'll run the thing twice and it might reach two different final states. The number of possible final states is exponential in the number of conflicts.

Ouch.

Sign of parallel safety: determinism vs correctness

How do you know that you have no bugs due to the different ordering of events?

In computational systems, if you get the same result every time, you probably have no parallelism bugs – even if the result is busted. Dog pictures instead of cats means you have bugs – but not parallelism bugs if it's the same dogs every time.

In event handling systems, the only sure sign of not having parallelism bugs is if you always get correct results.

With two users racing to withdraw money, you can't expect to always reach the same result. What can you expect, assuming the bank isn't buggy? Many things. You (presumably) can never have a negative account balance, you (presumably) can't drain a bank account twice, in effect creating money, etc.

If none of these wrong things ever happen, then you got yourself a functioning bank, otherwise you have a bug. With a telephony system, it's similar – except that there's a different set of requirements defining "correct results".

There's no general way to tell if there are timing-related bugs without understanding the aspects of the system having nothing to do with timing.

Ouch!

Parallelism bugs: easy to pinpoint vs impossible to define

With labeled gift boxes, parallelism bugs are trivial to spot – even if the labels are in Chinese, and you can't read Chinese:

Even if you can't read the labels, knowing that they exist is enough. If two kids fight over a box, then something is wrong and you call an adult who can read the labels.

If you're an automated debugging tool, and you don't know which task is supposed to process which data, it's enough to know that a task shouldn't access data modified by another task. If that happens, you tell the programmer, so that he properly assigns data to tasks without conflicts.

Such tools aren't hypothetical – they're out there. Cilk comes with a tool that does that. Checkedthreads has a Valgrind-based tool that does that.

Parallel Haskell doesn't have to do it – because you have no side effects to begin with, guaranteeing zero conflicts when things are evaluated in parallel. But even with dynamic checking instead of static guarantees, your parallelism bugs are basically just gone.

The upshot is that in computational systems, you don't need to know much to flag bugs and to point to the exact problem they cause. "Here's the box they were fighting over".

In event handling systems, a responsible adult must be much more knowledgeable to maintain discipline.

To be sure, there still is a weaker, but universal rule which always applies: you can't touch anything if someone is already busy with it. You must wait your turn in a queue. An adult can make sure this rule is followed, which is better than nothing. But it's not enough to prevent many common kinds of mischief:

Someone approaching the presents without queuing is clearly wrong – bug detected, order restored.

But someone unpacking a present, leaving the queue, and coming back to find that the present was taken by someone else could be wrong – or it could be OK. After all, that other kid waited his turn, so the only universal rule of event handling systems was followed – but not necessarily the specific rules of how we give out presents around here.

Here's how these problems look in code. The following buggy money transfer implementation can be flagged as buggy by an automated debugging tool:

src.balance -= amount
dst.balance += amount

Here we have no synchronization at all – src.balance can be modified by two processes without one of them waiting for the other to be done with it. So you could lose some of the decrements or what-not. A data race detector like Helgrind will spot this bug by monitoring memory accesses and observing the lack of synchronization – just as easily as Cilk's or checkedthreads' checker.

However, the version below is probably still buggy, but it would look fine to data race detectors:

atomic { src.balance -= amount }
atomic { dst.balance += amount }

Here, "atomic" means that everyone has to wait before modifying the balances in some sort of queue – possibly a very fancy sort of queue, but a queue nonetheless; more on that later. Orderly queuing makes data race detectors happy – but is there still a bug?

A thread could be suspended after decrementing src.balance but before incrementing dst.balance, exposing an intermediate state where money is "temporarily lost". Is this a problem? Perhaps. Do you understand banking? I don't, not really – and Helgrind surely doesn't.

Here's a more obviously wrong program:

if src.balance - amount > 0:
  atomic { src.balance -= amount }
  atomic { dst.balance += amount }

Here, a thread could test that src.balance has enough money to withdraw a given amount, and go to pee. Another thread arrives, makes a similar test and gets busy withdrawing the money. The first thread comes back, is put into some queue, waits his turn and withdraws its own amount – which could have become illegitimate, the first thread having withdrawn too much already.

This is a lot like coming back and realizing that another kid has walked away with your Apple iPhone®. It's a race condition which isn't a data race – that is, it happens despite everyone's civil and polite queuing.

What's a race condition? It depends on the application. I'm sure the last snippet is buggy, but I'm less sure about the previous snippet, and it all depends on how the bank works. And if we can't even define "race conditions" without knowing what the program is trying to do, we can't hope to automatically detect them.

Of course you can also not have race conditions. All you have to do is implement the entire withdrawal atomically, and of course all approaches to concurrency let you do it.

The trouble with race conditions though, as opposed to, say, null pointer exceptions, is that you can feed the program bug-triggering inputs again and again, and the thing will only reproduce once in a blue moon. Here's where deterministic, automated debugging could be really handy, flagging the bug every time you feed those inputs. Unfortunately, it's impossible with event handling systems.

"Ouch" again.

Queues: implementation detail vs part of the interface

With labeled gifts, you don't need a queue. Everyone can just go and take their present.

Or maybe not, not really. Thousands of kids getting thousands of labeled presents will need a queue or several, or else they'd be running into each other. How those queues work doesn't matter in the sense that everyone ends up with the same present anyway. The choice of a queuing method does affect how much time you wait for your present though.

That's why I drew the "logically parallel" lines connecting everyone to their labeled gifts so that they intersect – even though these aren't "actual conflicts" over who gets what. (I'm very careful with my metaphors and my Microsoft Paint art – very careful.)

4 processors accessing unrelated data items through a single memory bus is when "logically parallel lines technically cross", and in effect processors will have to wait in some hardware-level queue to make it work. 1000 logically independent tasks distributed to these 4 processors through a load balancing scheduler will again need queues to work.

There are almost always many queues even in a fully parallel, conflict-free system – but they're an implementation detail and they shouldn't affect results. You get the same thing no matter what your place was in any of the queues:

Conversely, in a concurrent system, the queue is right there in the interface:

  • A semaphore has a queue of threads waiting to lock it, and who gets there first will affect results.
  • An Erlang process has a queue of messages, and who sends a message first will affect results.
  • A goroutine listens to a channel, and the order of writes to a channel affects results.
  • With transactional memory, everyone failing to commit a transaction is queuing.
  • With lock-free containers, everyone failing to update the container is queuing.

With event systems, you can have simple queues or fancy queues, but they are all explicit queues in the sense that order often affects results, and that's OK – or it could be a race condition, go figure.

With computational, parallel, conflict-free systems, order shouldn't affect results – and you want tools to make sure the system is indeed conflict-free.

Rob Pike shows in his slides how easy it is to build a load balancer in Go. Sure it's easy. Go is for concurrency and concurrency is all about queues, and so is load balancing. Not that load balancers are very hard in any language, but yeah it's particularly easy in concurrent languages.

But that's one part of the parallelism story; the next thing you want is either static guarantees of not having conflicts, or dynamic checking. I'm not saying it can't be done in Go – sure it can be done – just that without having it, you underserve computational systems. And once you do have it, it's a different set of interfaces and tools from channels and goroutines – even if underneath it's implemented with channels and goroutines.

Importance of preemption

So conflict prevention/detection is something computational systems need that concurrency tools don't provide. Are there things that concurrency tools do provide that computational systems don't need?

Sure – explicit queues, for starters. Not only aren't they needed, but explicit queues get in the way of computational systems, because as we've seen, queues lead to race conditions that aren't data races and you can't pinpoint those.

Another thing computational systems don't really need is very cheap preemptable threads/processes.

With event handling systems, you sometimes want to react to many different events very quickly, which takes many preemptable processes. You want to have 10000 processes stuck in the middle of something, and process number 10001 can still get activated immediately when an event arrives. This can't work unless preemptable processes are very cheap.

With computational systems, it's nice to have cheap tasks that you can map to the relatively expensive OS threads – but tasks are not as powerful as processes. You don't activate tasks upon events. What tasks do is they wait in various queues, and get processed when a worker thread becomes idle to run them. Unlike the case with goroutines or the like, you can't have more tasks in flight than you have OS threads – and you don't need to.

Tasks can be done nicely with fairly traditional runtimes – as opposed to full-blown cheap processes/goroutines/whatever, which seem to me to need more work at the low-level runtime system side of things.

This shows how platforms aiming at concurrency not only under-serve computational systems, but over-serve them as well.

(To be fair, it is theoretically conceivable to gain something from preemption in a computational system – namely, it could improve throughput if it lets freshly created tasks which are a part of the critical path preempt in-flight tasks which are not. However, in my long and sad experience, a scheduler actually knowing what the critical paths are is little more than a theoretical possibility. And a dumb greedy scheduler has no use for preemption.)

Differences among tools from the same family

Tools aimed at concurrent event systems are not all alike, nor are tools aimed at parallel computational systems. They're from the same family, but there are substantial differences.

  • Erlang will not let processes share memory at all. This means you never have data races, which doesn't particularly impress me because as we've seen, data races are easy to find automatically – and race conditions are not at all eliminated by not sharing memory. The nice thing though is you can seamlessly scale to multiple boxes, not just multiple cores in the same chip.
  • Rust won't let you share memory unless it's immutable. No easy multi-box scaling, but better performance on a single box, and no need for data race detectors, which can have false negatives because of poor test coverage. (Actually it's not quite like that – here's a correction, also explaining that there are plans to add parallelism tools to Rust in addition to its existing concurrency facilities.)
  • Go will let you share everything, which gives the most performance at the price of what I think is a tolerable verification burden. For data races, Go has a data race detector, and race conditions will just happen in event systems anyway.
  • STM Haskell will let you share immutable data freely, and mutable data if you explicitly ask for it. It also provides a transactional memory interface, which I think is a cool thing to have and sometimes quite a pain to emulate with other methods. Haskell has other concurrency tools as well – there are channels, and if you want to have Erlang's scalability to multiple machines, apparently Cloud Haskell does the trick.

Of course the other big difference is that you'd be programming in Erlang, Rust, Go and Haskell, respectively. And now to computational systems:

  • Parallel Haskell will only parallelize pure code. A static guarantee of no parallelism bugs at the cost of having no side effects.
  • ParaSail allows side effects, but disallows many other things, such as pointers, and as a result it will only evaluate things in parallel if they don't share mutable data (for example, you can process two array slices in parallel if the compiler is convinced that they don't overlap). Similarly to Haskell, ParaSail has some concurrency support – namely "concurrent objects" which indeed can be shared and mutable – and documentation stresses the benefits of not using concurrency tools when all you need is parallelism.
  • Flow appears to be based on a pure functional core, which is restricted even further to let the compiler fully comprehend the data flow in the program, and allowing it to target platforms as diverse as Hadoop and CUDA. Things syntactically looking like side effects, parallel reduction, etc. are supposed to be a layer of sugar atop this core. At least that's my impression from reading the Manifesto, which admittedly I didn't fully understand ("if a morphism is surjective and injective, then it is a bijection and therefore it is invertible" is more obvious to some of us than others.)
  • Cilk is C with keywords for parallel loops and function calls. It will let you share mutable data and shoot yourself in the foot, but it comes with tools that deterministically find those bugs, if they could ever happen on your test inputs. What makes uninhibited shared mutable data very useful is when you don't shoot yourself in the feet – when a parallel loop computes stuff with whatever task-local side-effect-based optimizations, and then the loop ends and now everyone can use the stuff. Like children unpacking their Lego sets, each building stuff from them and then all playing together – no side effects is sometimes a lesser Lego. (The Proper Fixation blog – over-extending metaphors since 2008.)
  • checkedthreads is a lot like Cilk; it doesn't rely on language extensions, and all of it is free and open – not just the interface and the runtime but the bug-finding tools as well.

I wrote checkedthreads, so this is the advertisement part; checkedthreads is portable, free, safe and available today in the very mainstream languages C and C++, unlike many systems requiring new languages or language extensions.

With Cilk, there's an effort to standardize it in C++1y or some such, but Cilk wants to add keywords and the C++ people don't like adding keywords. Cilk is available in branches of gcc and LLVM, but it won't run on all platforms at the moment (it extends the ABI) and it hasn't been merged into the mainline for a while. Some newer Cilk features are patented. Not all of it is freely available, etc. etc.

Cilk's big advantage, however, is that it's backed up by Intel, whereas checkedthreads is backed up by yours truly. And if Cilk suits you and you decide to use it as a result of my checkedthreads-related blogging, I will have achieved my goal of automated debugging of parallel programs getting some much-deserved attention.

The upshot is that not all concurrency tools are alike, and different parallelism tools are likewise different – I don't even claim to have pointed out the most important differences between my examples; it's hairy stuff. Still, they're two different families, and the first thing to do is to pick the right family.

Conclusion

We've discussed differences between parallel, computational systems and concurrent, event handling systems. Areas of differences include:

  • Determinism: desirable vs impossible
  • Sign of parallel safety: same results vs correct results
  • Parallelism bugs: easy to pinpoint vs impossible to define
  • Queues: implementation detail vs part of the interface
  • Preemption: nearly useless vs nearly essential

For event handling systems, concurrency is their essence and parallelism is a part of some solutions – typically good ones (two vending machines is better than one). For computational systems, parallelism is their essence and concurrency is a part of some solutions – typically bad ones (a heap of gifts is usually worse than labeled gifts).

I hope to have gained more clarity than confusion by occasionally conflating "parallelism/concurrency" with "computation/event handling". I also hope I didn't paint too dark a picture of event handling systems – perhaps there are automated verification strategies that I'm not aware of. However it came out, I don't claim that my viewpoint and terminology usage is "the" way of looking at this.

There's value in the angle preferred by some people interested in event handling systems – "concurrency is dealing with several things at a time, parallelism is doing several things at a time". From this angle, parallelism is an implementation detail, while concurrency is in the structure of the program.

I believe there's value in my perspective as well – namely, "concurrency is dealing with inevitable timing-related conflicts, parallelism is avoiding unnecessary conflicts" – "vending machines vs labeled gifts". Here's how the two look like – now with parallel arrows disentangled, as they logically are:

The most important takeaway is that computational code, as opposed to event handling code, can be made virtually bug-free rather easily, using either automated debugging tools or static guarantees.

Handling parallelism using its own tools is nothing new. Rob Pike's work on Sawzall, a specialized parallel language where code is always free from parallelism bugs, predates his work on the concurrent language Go.

However, tools for concurrency appear "louder" than tools for parallelism at the moment – and they can handle parallelism, albeit relatively badly. Loud and bad often crowds out the less visible better. It'll be a pity if better support for parallelism won't be developed as a side effect of "loud concurrency" – or if such support atrophies where already available.

I'd go as far as replying to Armstrong's "parallelizing serial code is solving the wrong problem" with "using 'bare' concurrency tools for computational code is applying your solution to the wrong problem". The simple fact is that C parallelized with the help of the right tools is faster and safer than Erlang.

So here's to "using the right tool for the job", and not having anyone walk away with your Apple iPhone®.

10x more selective

There's this common notion of "10x programmers" who are 10x more productive than the average programmer. We can't quantify productivity so we don't know if it's true. But definitely, enough people appear unusually productive to sustain the "10x programmer" notion.

How do they do it?

People often assume that 10x more productivity results from 10x more aptitude or 10x more knowledge. I don't think so. Now I'm not saying aptitude and knowledge don't help. But what I've noticed over the years is that the number one factor is 10x more selectivity. The trick is to consistently avoid shit work.

And by shit work, I don't necessarily mean "intellectually unrewarding". Rather, the definition of shit work is that its output goes down the toilet.

I've done quite a lot of shit work myself, especially when I was inexperienced and gullible. (One of the big advantages of experience is that one becomes less gullible that way – which more than compensates for much of the school knowledge having faded from memory.)

Let me supply you with a textbook example of hard, stimulating, down-the-toilet-going work: my decade-old adventures with fixed point.

You know what "fixed point arithmetic" is? I'll tell you. It's when you work with integers and pretend they're fractions, by implicitly assuming that your integer x actually represents x/2^N for some value of N.

So to add two numbers, you just do x+y. To multiply, you need to do x*y>>N, because plain x*y would represent x*y/2^2N, right? You also need to be careful so that this shit doesn't overflow, deal with different Ns in the same expression, etc.

Now in the early noughties, I was porting software to an in-house chip which was under development. It wasn't supposed to have hardware floating point units – "we'll do everything in fixed point".

Here's a selection of things that I did:

  • There was a half-assed C++ template class called InteliFixed<N> (there still is; I kid you not). I put a lot of effort into making it, erm, full-assed (what's the opposite of half-assed?) This included things like making operator+ commutative when it gets two fixed point numbers of different types (what's the type of the result?); making sure the dreadful inline assembly implementing 64-bit intermediate multiplications inlines well; etc. etc.
  • My boss told me to keep two versions of the code – one using floating point, for the noble algorithm developers, and one using fixed point, for us grunt workers fiddling with production code. So I manually kept the two in sync.
  • My boss also told me to think of a way to run some of the code in float, some not, to help find precision bugs. So I wrote a heuristic C++ parser that automatically merged the two versions into one. It took some functions from the "float" version and others from the "fixed" version, based on a header-file-like input telling it what should come from which version.
  • Of course this merged shit would not run or even compile just like that, would it? So I implemented macros where you'd pass to functions, instead of vector<float>&, a REFERENCE(vector<float>), and a horrendous bulk of code making this work at runtime when you actually passed a vector<InteliFixed> (which the code inside the function then tried to treat as a vector<float>.)
  • And apart from all that meta-programming, there was the programming itself of course. For example, solving 5×5 equation systems to fit polynomials to noisy data points, in fixed point. I managed to get this to work using hideous normalization tricks and assembly code using something like 96 bits of integer precision. My code even worked better than single-precision floating point without normalization! Yay!

For months and months, I worked as hard as ever, cranking out as much complicated, working code as ever.

And here's what I should have done:

  • Convince management to put the damned hardware floating point unit into the damned chip. It didn't cost that many square millimeters of silicon – I should have insisted on finding out how many. (FPUs were added in the next chip generation.)
  • Failing that, lay my hands on the chip simulator, measure the cost of floating point emulation, and use it wherever it was affordable. (This is what we ended up doing in many places.)
  • Tell my boss that maintaining two versions in sync like he wanted isn't going to work – they're going to diverge completely, so that no tool in hell will be able to partially merge them and run the result. (Of course this is exactly what happened.)

Why did this end up in many months of shit work instead of doing the right thing? Because I didn't know what's what, because I didn't think I could argue with my management, and because the work was challenging and interesting. It then promptly went down the toilet.

The hardest part of "managing" these 10x folks – people widely known as extremely productive – is actually convincing them to work on something. (The rest of managing them tends to be easy – they know what's what; once they decide to do something, it's done.)

You'd expect the opposite, kind of, right? I mean if you're so productive, why do you care? You work quickly; the worst thing that happens is, nothing comes out of it – then you'll just do the next thing quickly, right? I mean it's the slow, less productive folks that ought to be picky – they're slower and so get less shots at new stuff to work on to begin with, right?

But that's the optical illusion at work: the more productive folks aren't that much quicker – not 10x quicker. The reason they appear 10x quicker is that almost nothing they do is thrown away – unlike a whole lot of stuff that other people do.

And you don't count that thrown-away stuff as productivity. You think of a person as "the guy who did X" where X was famously useful – and forget all the Ys which weren't that useful, despite the effort and talent going into those Ys. Even if something else was "at fault", like the manager, or the timing, or whatever.

To pick famous examples, you remember Ken Thompson for C and Unix – but not for Plan 9, not really, and not for Go, not yet – on the contrary, Go gets your attention because it's a language by those Unix guys. You remember Linus Torvalds even though Linux is a Unix clone and git is a BitKeeper clone – in fact because they're clones of successful products which therefore had great chances to succeed due to good timing.

The first thing you care about is not how original something is or how hard it was to write or how good it is along any dimension: you care about its uses.

The 10x programmer will typically fight very hard to not work on something that is likely enough to not get used.

One of these wise guys asked me the other day about checkedthreads which I've just finished, "so is anyone using that?" with that trademark irony. I said I didn't know; there was a comment on HN saying that maybe someone will give it a try.

I mean it's a great thing; it's going to find all of your threading bugs, basically. But it's not a drop-in replacement for pthreads or such, you need to write the code using its interfaces – nice, simple interfaces, but not the ones you're already using. So there's a good chance few people will bother; whereas Helgrind or the thread sanitizer, which have tons of false negatives and false positives, at least work with the interfaces that people use today.

Why did I bother then? Because the first version took an afternoon to write (that was before I decided I want to have parallel nested loops and stuff), and I figured I had a chance because I'd blog about it (as I do, for example, right now). If I wrote a few posts explaining how you could actually hunt down bugs in old-school shared-memory parallel C code even easier than with Rust/Go/Erlang, maybe people would notice.

But there's already too much chances of a flop here for most of the 10x crowd I personally know to bother trying. Even though we use something like checkedthreads internally and it's a runaway success. In fact the ironic question came from the guy who put a lot of work in that internal version – because internally, it was very likely to be used.

See? Not working on potential flops – that's productivity.

How to pick what to work on? There are a lot of things one can look at:

  • Is there an alternative already available? How bad is it? If it's passable, then don't do it – it's hard to improve on a good thing and even harder to convince that improvements are worth the switch.
  • How "optional" is this thing? Will nothing work without it, or is it a bell/whistle type of thing that can easily go unnoticed?
  • How much work do users need to put in to get benefits? Does it work with their existing code or data? Do they need to learn new tricks or can they keep working as usual?
  • How many people must know about the thing for it to get distributed, let alone used? Will users mostly run the code unknowingly because it gets bundled together with code already distributed to them, or do they need to actively install something? (Getting the feature automatically and then having to learn things in order to use it is often better than having to install something and then working as usual. Think of how many people end up using a new Excel feature vs how many people use software running backups in the background.)
  • How much code to deliver how much value? Optimizing the hell out of a small kernel doing mpeg decompression sounds better than going over a million lines of code to get a 1.2x overall speed-up (even though the latter may be worth it by itself; it just necessarily requires 10x the programmers, not one "10x programmer").
  • Does it have teeth?If users do something wrong (or "wrong"), does it silently become useless to them (like static code analysis when it no longer understands a program), or does it halt their progress until they fix the error (like a bounds-checked array)?

You could easily expand this list; the basic underlying question is, what are the chances of me finishing this thing and then it being actually used? This applies recursively to every feature, sub-feature and line of code: does it contribute to the larger thing being used? And is there something else I could do with the time that would contribute more?

Of course it's more complicated than that; some useful things are held in higher regard than others for various reasons. Which is where Richard Stallman enters and requires us to call Linux "GNU/Linux" because GNU provided much of the original userspace stuff. And while I'm not going to call it "Gah-noo Lee-nux", there's sadly some merit to the argument, in the sense that yeah, unfortunately some hard, important work is less noticed than other hard, important work.

But how fair things are is beside the point. After all, it's not like 10x the perceived productivity is very likely to give you 10x the compensation. So there's not a whole lot of reasons to "cheat" and appear more productive than you are. The main reason to be productive is because there's fire raging up one's arse, more than any tangible benefit.

The point I do want to make is, to get more done, you don't need to succeed more quickly (although that helps) as much as you need to fail less often. And not all failures are due to lack of knowledge or skill; most of them are due to quitting before something is actually usable – or due to there being few chances for it to be used in the first place.

So I believe, having authored a lot of code that went down the toilet, that you don't get productive by working as much as by not working – not on stuff that is likely to get thrown away.

checkedthreads: bug-free shared memory parallelism

If you aren’t deeply frightened about all the issues raised by concurrency, you aren’t thinking about it hard enough.

John Carmack

This post introduces checkedthreads – a free framework for parallelizing C and C++ code, and for automatically finding every race condition that could potentially manifest on given program inputs. It comes with:

  • a Valgrind-based checker for thorough verification
  • an event-reordering scheduler for fast verification

The code in checkedthreads is fresh and wasn't used in production yet. However, tools using the same approach have been successfully used for years in automotive safety software containing millions of lines of code written by many dozens of developers.

A nice use case for checkedthreads is, you have a complex, serial program that you want to parallelize. With checkedthreads, you'll be able to run your test suite and be sure that you have no parallelism bugs – as sure as you'd be about, say, memory leaks. And if your parallelization does introduce bugs, the Valgrind checker will pinpoint them for you, so that you can quickly fix them.

In what follows, I explain how race detection works in checkedthreads, and then briefly discuss its other features and how to get started with it.

***

Why are threads so error-prone? We're accustomed to see the root of the problem in mutable shared memory. Commonly proposed alternatives to threads avoid mutable shared memory. For example, pure functional code gets rid of the "mutable" part, and process-based parallelism gets rid of the "shared" part".

While processes and pure FP have their virtues, I believe that mutable shared memory is not, by itself, what makes threads bug-prone. You can keep mutable shared memory and eliminate the bugs. Moreover, as we'll see, it's perfectly possible to eliminate shared memory – and keep the bugs.

What, then, is the root of the problem? I believe what we need is to find the right interfaces. Threads and locks are great low-level primitives, but a bad interface to use directly in source code.

In this sense, threads are not unlike goto. Goto is a horrible interface for human programmers. But it's fine as a machine instruction underlying higher-level interfaces ranging from loops and function calls to exceptions and coroutines.

What higher-level interfaces do we have on top of threads? One such interface is fork/join parallelism – parallel loops and function calls. ("Loops and function calls" is notably very similar to the most popular interface on top of goto.) "Fork/join", because starting a parallel loop logically forks a thread per iteration, and those threads are joined back when the loop ends:

Ordering constraints of fork-join code

Examples of fork/join interfaces include TBB's/PPL's parallel_for and parallel_invoke, and OpenMP's #pragma omp parallel for. Checkedthreads provides something similar as well:

ctx_for(_objs.size(), [&](int i) {
  process(_objs[i]); //in parallel for all i's
});

Fork/join parallelism is well-known, and appreciated for its automation of synchronization and load balancing. But how does a fork/join interface help with program correctness compared to raw threads and locks?

We'll see how fork/join helps by looking at two methods to verify parallel code – event reordering and memory access monitoring. Both methods are implemented by checkedthreads and together, they guarantee near-100% freedom from parallelism bugs.

We'll discuss why these methods are so effective with fork/join code – and why they are comparatively ineffective with raw threads.

Since parallel function calls can be implemented using parallel loops, we'll only explicitly mention parallel loops. And we'll only discuss "pure" fork/join programs – programs where forking and joining is the only synchronization mechanism.

That is, code inside a parallel loop can access stuff written by whoever spawned the loop – and code after the loop can access stuff written in the loop. But you can't access shared data using semaphores, atomic counters, lock-free containers, etc. – such accesses will be flagged as bugs. To synchronize threads, you must fork or join.

Lastly, we'll assume that you can't proceed until the loop you spawned completes – and you can't wait for anything but a loop you spawned yourself. (This is an obvious property of code spelled using some kind of a "parallel for", but not of other fork/join interfaces; a recent paper calls this property "strict fork/join parallelism".)

We're thus assuming both "purity" and "strictness"… which sounds more restrictive than it is – but that is a separate topic. And now, with all the assumptions spelled out:

Why fork/join code is verifiable – and why raw threads aren't

First, let's consider event reordering – a cheap and effective verification method.

Suppose thread A writes to address X, and thread B concurrently reads from X. Then B might see A's write or it might not, depending on timing, so it's a bug. A cheap way to find this bug is, don't run the program with the production thread scheduler, where the order of events depends on timing.

Instead, use a debugging scheduler which purposefully and deterministically reorders events. Make it schedule things so that in one run, A's write precedes B's read – and in another run, B's read comes first. Then compare the results of the two runs – if they differ, there's a bug.

How many orders does it take to reorder every two instructions that could ever run in parallel? (Actually, this requirement is too weak to find all the bugs – we'll plug that hole later – but it's usually enough to find most of the bugs.)

In a fork/join program, you need just two orders:

  1. run all loops from 0 to N.
  2. run them all backwards, from N to 0.

Here's an illustration – consider this example pseudo-code with nested parallel loops:

parallel for i=0:2
  foo(i)
  parallel for j=0:i
    bar(i,j)
  baz(i)

The ordering constraints – which code block can run in parallel with which – make up the following DAG:

Ordering constraints of fork-join code

And here are the two schedules – the 0 to N one and the backwards one:

The two event reordering schedules

Pick any two instructions that could possibly run in parallel and you'll see that they are reordered in these two schedules.

What happens if you use raw threads and locks?

Then the ordering constraints don't have to look like a simple fork/join DAG. All we can say is that the ordering constraints form some sort of a partial order over the set of serial code blocks (whose instructions are fully ordered).

With N code blocks, an upper bound on the number of orders you might need is N*2. (For every code segment, schedule it to run as early as possible, and then schedule it to run as late as possible – that's N*2 schedules reordering every two independent instructions.)

But N*2 is a lot of orders – N can be rather large. Can we do better, as we did in the fork/join case – by having less schedules which reorder more things?

Perhaps we can – but finding out the lower bound on the number of orders is an NP-hard problem (its standard name is finding the poset dimension). And heuristics analyzing partial orders in an attempt to produce fewer than N*2 orders take minutes to run with N in the few thousands, in my experience.

The upshot is that reordering every pair of independent instructions is trivial with fork/join code and very hard with raw threads.

Now let's consider an improvement upon simple event reordering – memory access monitoring based on something like program instrumentation or a compiler pass.

Plain event reordering has two main drawbacks:

  • Some bugs are missed. Consider updates to accumulators. Whether we run sum+=a[5] before or after sum+=a[6], sum will reach the same value – whereas in a parallel run, it may not. (Finer-grained reordering – trying every way to interleave instructions that could run in parallel, and not just reordering every pair of independent instructions – would catch the bug. But that's obviously infeasible.)
  • Bugs aren't pinpointed. Reordering gives evidence that you have a bug by demonstrating that results differ under different schedules. But how do you find the bug?

Here's how we can improve upon plain reordering. Let's intercept memory accesses and record the ID of the thread which was the last to write to each and every location – the location's owner. (Logically, each loop index could run in a separate thread – so ideally, we'd map each index to a separate thread ID. In practice, we might want IDs to be small, so we'd use low bits of the index or some such.)

Then if a location is accessed by someone whose ID is different from the owner's ID, we have a bug, and we've pinpointed it – all we need to do is print the current call stack:

checkedthreads: error - thread 56 accessed 0x7FF000340 [0x7FF000340,4], owned by 55
==2919==    at 0x40202C: std::_Function_handler...::_M_invoke (bug.cpp:16)
==2919==    by 0x403293: ct_valgrind_for_loop (valgrind_imp.c:62)
==2919==    by 0x4031C8: ct_valgrind_for (valgrind_imp.c:82)
==2919==    by 0x40283C: ct_for (ct_api.c:177)
==2919==    by 0x401E9D: main (bug.cpp:20)

It's a tad more complicated with nested loops but not much more complicated; the details aren't worth discussing here. The point is, this works regardless of whether there's an effect on results that can be reproduced in a serial run. We pinpoint bugs involving accumulators and shared temporary buffers and what-not, even though results look fine in serial runs.

Note that we still rely on event reordering. That's because we catch the bug if an address was written first and then read – but not if it was read first and then written:

Memory access monitoring

This is OK – if you have a reordering scheduler guaranteeing that in one of your runs, the address is actually written first and then read. With such a scheduler and access monitoring, you'll never miss a bug if it could ever happen with your input data – and you'll always have it pinpointed.

Why doesn't it work nearly as well with raw threads and semaphores?

The first problem is, you need a reordering scheduler, and there are too many schedules to cover, as we discussed above. I've shown in the past an example where Helgrind, a memory access monitoring Valgrind tool for debugging pthread applications, misses a bug – because the bug is masked by the order in which things happen to run.

But there's another problem with raw threads and locks, preventing memory access monitoring from pinpointing some of the bugs when they do reproduce.

Specifically, if two threads access a location, and they use a lock to synchronize their accesses, then you can't flag the access as a bug, right? However, it very well may be a bug. It's certainly not a data race (unsynchronized access to memory) – but it can be a race condition (a bug due to event ordering).

As an example, consider a supposedly deterministic simulation (not an interactive game – though the screenshot below is from a game, Lock 'n' Chase). In the simulation, agents run around a maze, picking up coins. Whoever made it first to pick up a coin has won a race – literally!

Lock and Chase

Suppose each agent is simulated by its own thread. Then it might be that the simulated speed of an agent in the maze depends on the amount of CPU time its thread used compared to others. And then results depend on timing and it's a bug. (A thread per maze region would be a better strategy than a thread per agent.)

This is a race condition – because timing affects results. However, as long as agents lock the coins they pick up, it's not a data race – because access to shared memory is synchronized. And memory access monitoring can only pinpoint data races, not race conditions.

You can simulate a maze, notice that memory where you keep a coin representation is accessed concurrently without locking, and print the offending call stack.

But you can't simulate a maze, notice that with different event ordering, different agents pick up the (properly locked) coin, and "print the offending call stack". There simply is no single offending call stack – only the fact that results end up differing.

John Regehr has a great discussion on the difference between race conditions and data races. Check out his example with a bank account, where all account balances are locked "nicely" – but the bank still doesn't properly transfer money between accounts.

The thing that matters in our context is that with pure fork/join code, all race conditions are data races. In the general case of raw threads, it's no longer true – which gets in the way of pinpointing bugs.

Note that using something like Go's channels or even Erlang's processes doesn't solve the maze race problem – in the sense that you still can't pinpoint bugs. Instead of locking coins, you might have agent processes or goroutines or what-not – and one or several processes keeping coins. Agents would send requests to pick up coins to these processes – and who'd make it first isn't deterministic, and there's no single place in the code "where the bug is".

This is one example of eliminating shared memory and keeping the bugs – as opposed to fork/join code's ability to keep shared memory while (automatically) eliminating the bugs.

(This is not to say that Erlang-style lightweight processes aren't a good idea – they can be great in certain contexts. I do believe that parallelizing computational code is a somewhat different problem from handling concurrent events – and that fork/join parallelism is close to "the right thing" for computational code.)

To summarize the entire discussion of verification:

  • With fork/join code, every bug which could possibly manifest with the given inputs can be deterministically reproduced and pinpointed.
  • Conversely, with raw threads and locks, it's computationally hard to deterministically reproduce bugs. Furthermore, even reproducible bugs can not always be automatically pinpointed.

It is also notable that with raw threads, you get to worry about deadlocks. With pure fork/join code, there simply can never be a deadlock.

(The analysis above mostly isn't rigorous and you're welcome to correct me.)

Some features of checkedthreads

Noteworthy features of checkedthreads are discussed here. A short summary:

  • Guaranteed bug detection as discussed above.
  • Integration with other frameworks: if you already use OpenMP or TBB, you can configure checkedthreads to use their scheduler (and their thread pool) instead of fighting over the machine with the other framework.
  • Dynamic load balancing: work gets done as soon as a thread is available to do it. Load balancing automatically takes into account variability between the different tasks – as well as whatever unexpected load the CPUs might be handling while running your code.
  • A C89 and a C++11 API are available.
  • Free as in "do what you want with it" (FreeBSD license).
  • Easily portable (in theory).
  • Small and simple (at the moment).
  • Custom schedulers are very easy to add (though it's more of a recreational activity than a necessity, or so I hope).

Downloading, building, installing and using checkedthreads

I recommend to read the build instructions, and then, possibly, download precompiled binaries. (The build instructions are useful, in particular, to know what you're getting in the binary archive.) The source code of version 1.0 is archived here; git@github.com:yosefk/checkedthreads.git keeps the latest source code.

Before actually using checkedthreads, I recommend to read the rather short documentation of the API, the environment variables, and runtime verification.

If you run into any issues with checkedthreads, drop me a line at Yossi.Kreinin@gmail.com.

Conclusion

To quote the same piece by John Carmack again:

…if you have a large enough codebase, any class of error that is syntactically legal probably exists there.

In other words, "correct by design" is an almost non-existent property; correctness is either demonstrated automatically, or it is absent. While nobody has to make the simple errors we discussed in our toy examples, analogous errors will necessarily creep into large parallel programs.

Checkedthreads shows that for one significant family of parallel imperative programs – fork/join code – it's possible to automatically, deterministically reproduce and pinpoint every bug.

I hope it to be a convincing example of what I think is a more general truth – namely, that parallel imperative programs don't have to be "deeply frightening", any more than serial imperative programs ought to be a scary nest of computed gotos. What we need is the right higher-level interfaces on top of raw threads and locks.

There generally appears to be a necessary trade-off between flexibility and correctness. A lot of widely known approaches to parallelism maximize one at the expense of the other. Raw threads – and many of the higher-level frameworks out there – are very flexible, but it's very hard to know that your code is correct. With pure functional code, you have statically guaranteed determinism – but "no side effects" is a very severe restriction.

Checkedthreads attempts to offer a different balance: determinism guaranteed by testing instead of statically – and an imperative language to program in, albeit with less synchronization options than those available with raw threads.

I hope you like it – and don't hesitate to email me if you need any sort of support :-) (In particular, I'll gladly port the thing to other platforms/languages if people are interested.)

Amdahl's law in reverse: the wimpy core advantage

Once a chip’s single-core performance lags by more than a factor to two or so behind the higher end of current-generation commodity processors, making a business case for switching to the wimpy system becomes increasingly difficult.

– Google's Urs Hölzle, in "Brawny cores still beat wimpy cores, most of the time"

Google sure knows its own business, so they're probably right when they claim that they need high serial performance. However, different businesses are different, and there are advantages to "wimpy cores" beyond power savings which are omitted from the "brawny/wimpy" paper and which are worth mentioning.

It is a commonplace assumption that a single 3 GHz processor is always better than 4 processors at 750 MHz "because of Amdahl's law". Specifically:

  • Some of your tasks can be parallelized to run on 4 cores – these will take the same time to complete on the two systems.
  • Other tasks can only run on one core – these will take 4x more time to complete on the 750 MHz wimpy-core system.
  • Some tasks are in-between – say, a task using 2 cores will take 2x more time to complete on the 750 MHz system.

Overall, the 3 GHz system never completes later and sometimes completes much earlier.

However, this assumes that a 3 GHz processor is consistently 4x faster than a 750 MHz processor. This is true in those rare cherished moments when both processors are running at their peak throughput. It's not true if both are stuck waiting for a memory request they issued upon a cache miss. For years, memory latency has been lagging behind memory throughput, and the 4x faster processor does not get a 4x smaller memory latency.

Assuming the latency is the same on the two systems – which is often very close to the truth – and that half of the time of the 750 MHz system is spent waiting for memory when running a serial task, the 3 GHz CPU will give you a speed-up of less than 2x.

What if the task can run in parallel on 4 cores? Now the 750 MHz system gets a 4x speed-up and completes more than 2x faster than the 3 GHz system!

(This assumes that the 4 750 MHz cores are not slowed down due to them all accessing the same memory – which, again, is very close to the truth. Memory bandwidth is usually high enough to serve 4 streams of requests – it's latency which is the problem. So having several cores waiting in parallel is faster than having one core waiting serially.)

A slow, parallel system outperforming a fast, serial system – Amdahl's law in reverse!

Is memory latency unique?

In a way it isn't – there are many cases where faster systems fail to get to their peak throughput because of some latency or other. For instance, the cost of mispredicted branches will generally be higher on faster systems.

However, most other latencies of these kinds are much smaller – and are handled rather well by the mechanisms making "brawny" cores larger and deserving of their name. For example, hardware branch predictors and speculative execution can go a long way in reducing the ultimate cost of mispredicted branches.

"Brawny" speculative hardware is useful enough to deal with memory latency as well, of course. Today's high-end cores all have speculative memory prefetching. When you read a contiguous array of memory, the core speculates that you'll keep reading, fetches data ahead of the currently processed elements, and when time comes to process the next elements, they're already right there in the cache.

The problem is that this breaks down once you try to use RAM as RAM - a random access memory where you don't go from index 0 to N but actually "jump around" randomly. (Theoretically this isn't different from branch prediction; in practice, large, non-contiguous data structures not fitting into caches and "unpredictable" for prefetchers are much more common than unpredictable branches or overflowing the branch history table.)

Hence we have pledges like this computer architect's kind request to stop using linked lists:

Why would anyone use a linked-list instead of arrays? I argue that linked lists have become irrelevant in the context of a modern computer system:

1. They reduce the benefit of out-of-order execution.

2. They throw off hardware prefetching.

3. They reduce DRAM and TLB locality.

4. They cannot leverage SIMD.

5. They are harder to send to GPUs.

Note that problems 1, 2 and 4 "disappear" on wimpy cores – that is, such cores don't have out-of-order execution, hardware prefetchers or SIMD instructions.  So a linked list doesn't result in as many missed performance opportunities as it does on brawny cores.

"Why would anyone use linked lists"? It's not just lists, for starters – it's arrays of pointers as well. "Why array of pointers" seems exceedingly obvious – you want to keep references instead of values to save space as well as to update something and actually get the thing updated, not its copy. Also you can have elements of different types – very common with OO code keeping arrays of base class pointers.

And then using arrays and indexing into them instead of using pointers still leaves you with much of the "access randomness" problem. A graph edge can be a pointer or an index; and while mallocing individual nodes might result in somewhat poorer locality than keeping them in one big array, you still bump into problems 1, 2 and 4 for big enough node arrays – because your accesses won't go from 0 to N.

Of course you could argue that memory indirection is "poor design" in the modern world and that programmers should mangle their designs until they fit modern hardware's capabilities. But then much of the "brawny vs wimpy" paper's argument is that brawny cores mean smaller development costs – that you get high performance for little effort. That's no longer true once the advice becomes to fight every instance of memory indirection.

It's also somewhat ironic from a historical perspective, because in the first place, we got to where we are because of not wanting to change anything in our software, and still wishing to get performance gains. The whole point of brawny speculative hardware is, "your old code is fine! Just give me your old binaries and I'll magically speed them up with my latest and greatest chip!"

The upshot is that you can't have it both ways. Either keep speculating (what, not brawny/brainy enough to guess where the next pointer is going to point to?), or admit that speculation has reached its limits and you can no longer deliver on the promises of high serial performance with low development costs.

Formalizing "reverse Amdahl's law"

…is not something I'd want to do.

Amdahl's law has a formal version where you take numbers representing your workload and you get a number quantifying the benefits a parallel system might give you.

A similar formalization could be done for "reverse Amdahl's law", taking into account, not just limits to parallelization due to serial bottlenecks (what Amdahl's law does), but also "limits to serialization"/the advantage to parallelization due to various kinds of latency which is better tolerated by parallel systems.

But I think the point here is that simple formal models fail to capture important factors – not that they should be made more complex with more input numbers to be pulled out of, erm, let's call it thin air. You just can't sensibly model real programs and real processors with a reasonably small bunch of numbers.

Speaking of numbers: is time spent waiting actually that large?

I don't know; depends on the system. Are there theoretically awesome CPUs stuck waiting for various things? Absolutely; here's that article about 6% CPU utilization in data servers (it also says that this is viewed as normal by some – as long as RAM and networking resources are utilized. Of course, by itself the 6% figure doesn't mean that brawny cores aren't better – if those 6% are all short, serial sprints letting you complete requests quickly, then maybe you need brawny cores. The question is how particular tasks end up being executed and why.)

Could a "wimpy-core" system improve things for you? It depends. I'm just pointing out why it could.

From big.LITTLE to LITTLE.big, or something

There's a trend of putting a single wimpy core alongside several brawny ones, the wimpy core being responsible for tasks where energy consumption is important, and the brawny ones being used when speed is required.

An interesting alternative is to put one brawny core and many wimpy ones – the brawny core could run the serial parts and the wimpy cores could run the parallel parts. If the task is serial, then a brawny core can only be better – perhaps a lot better, perhaps a little better, but better. If the task is parallel, many wimpy cores might be better than few brawny ones.

(I think both use cases fall under ARM's creatively capitalized "big.LITTLE concept"; if the latter case doesn't, perhaps LITTLE.big could be a catchy name for this – or we could try other capitalization options.)

Hyperthreading/Barrel threading/SIMT

…is also something that helps hide latency using coarse-grain, thread-level parallelism; it doesn't have to be "whole cores".

Parallelizing imperative programs

…is/can be much easier and safer than is commonly believed. But that's a subject for a separate series of posts.

Summary

  • Faster processors are exactly as slow as slower processors when they're stalled – say, because of waiting for memory;
  • Many slow processors are actually faster than few fast ones when stalled, because they're waiting in parallel;
  • All this on top of area & power savings of many wimpy cores compared to few brawny ones.

Is program speed less important than X?

Is program speed less important than safety? Sometimes it is – and sometimes speed is safety. A slow autopilot is a dangerous autopilot. That's why so much safety-critical software is written in the least safe programming languages.

A lot of programs aren't like autopilots – a slower, safer transaction processor is usually better. Google's early credit card charging disaster is my favorite example (it would never happen with a bounds-checked, garbage-collected language runtime).

However, there's the "counter-anecdote" of a cell phone company switching to a fancy new billing system which took more time to charge customers for calls than it took customers to make calls. It fell behind to the point where they had to avoid charging customers for a month worth of calls because they couldn't compute the right amounts. So sometimes trusting one's money to a slow program is rather unsafe.

And then there are high-frequency trading algorithms. Speaking of which: is speed less important than program correctness? Sometimes it is – and sometimes speed is correctness. A slower chess program playing under time control will settle for a worse move – a less correct move. Slower project scheduling software will come up with a worse schedule.

Life in general is a game played under time control. Often, "slower" means "being able to process less information in less ways" – in other words, dumber, further away from "correct".

What about time to market – isn't program speed less important than time to market? Sometimes it is – and sometimes higher speed is shorter time to market. A breathtaking game or special effect on today's hardware that others can only pull off on tomorrow's hardware means that the game or the special effect made it first to the market.

("Performance tricks" sound more relevant to the real-time world of games than to the offline rendering of movies; a great counter-example is procedural generation of outdoor landscape in Brave by Inigo Quilez.)

But time to market is also affected by development time; is program speed less important than development time? Sometimes it is – and sometimes higher speed is less development time. A developer waiting for slow programs develops more slowly – and developers often wait for their own programs (tools searching for stuff or summarizing stuff, build systems, tests, machine learning algorithms, …).

Another point is that a developer whose code tends to be too slow will waste time looking for a faster, fancier, buggier algorithm, sometimes sifting through many options, each of which could be fast enough if coded by someone else. A developer whose code tends to be fast the first time will move on to the next thing more quickly.

Is program speed less important than X? Sometimes it is – but sometimes speed is inseparable from X.

Efficiency is fundamentally at odds with elegance

Q: In retrospect, wasn't the decision to trade off programmer efficiency, security, and software reliability in exchange for runtime performance a fundamental mistake?

A: Well, I don’t think I made such a tradeoff. I want elegant and efficient code. Sometimes I get it. The efficiency vs. correctness, efficiency vs. programmer time, efficiency vs. high level, etc. dichotomies are largely bogus.

An interview with Bjarne Stroustrup

Unlimited-precision symbolic computation is more elegant than floating point numbers. You simply never have any numerical stability problems. Anything algebraically correct – a valid way to solve for x given the equations involving x – is also computationally correct. You don't need to know all the quirks of floating point, and you won't need "numerical recipes" which are basically ways to deal with these quirks.

Symbolic computation is rather widely available – say, in Mathematica, Matlab/maple, etc. – but it's not used nearly as much as floating point. That's because floating point is much more efficient, and a whole lot of things can not be done in a reasonable amount of time and space with symbolic computation.

It is undeniable that this efficiency comes at a cost of correctness (in terms of increased likelihood of bugs), programmer time, and elegance. There are plenty of algebraically elegant solutions which just don't work in floating point. If you don't notice, you have a bug; if you do notice, you spend your (programmer) time looking for an alternative, and said alternative may be less elegant.

Floating point is not the lowest level we can sink to. In many cases the quest for still more efficiency brings us to the dark quagmires of fixed point arithmetic. That's when you have an integer and you implicitly assume it's divided by 2^N – the exponent is statically known so the point isn't floating at run time, so to say. So to implement a+b you just use integer addition, and a*b is an integer multiplication followed by a right shift by N. Or there are midway scenarios where you have many integers with a common, dynamically computed exponent because they all have roughly the same range (FFT is one case where this is often done.)

Fixed point is so ugly that there's not even a recipes book that I'm aware of; it just doesn't come out tasty in the slightest. The biggest trouble isn't even the very likely overflow but the loss of precision: floating point guarantees a certain number of significant bits in the mantissa, while fixed point doesn't – unless you explicitly normalize the number at some point using CLZ or similar (making it more of a floating point emulation than "true" fixed point where exponents aren't represented at runtime). You think you have a 32-bit mantissa and an implicit exponent so the number is rather precise – but those 32 bits can have most of the high bits set to zero and then it's not precise at all.

However, a mixture of 8-bit, 16-bit and 32-bit fixed point operations often beats the performance of floating point by a large margin; especially if you have SIMD instructions, because you can always fit 4x more 8-bit numbers into a register than 32-bit numbers, and a single-precision floating point multiplier is ~10x more costly in hardware than an 8-bit integer multiplier so you have less of them.

Again, this efficiency comes at a cost of correctness (in terms of increased likelihood of bugs), programmer time, and elegance.

In computer vision, you're often looking for objects of a certain class, and you have a classifier taking a rectangular image region and telling whether this region contains an object of that class. A simple and elegant object detection algorithm is to apply this classifier to every possible rectangle in the image, and then remove rectangles which mostly overlap (as in, if there are 15 similar rectangles saying there's a face in roughly the same place, make one rectangle out of them all).

This elegant algorithm is never used, because there are too many possible rectangles (every coordinate times every size). A common optimization is to use a cascade of classifiers. That is, apply a very cheap classifier with a lot of false positives but hopefully almost no false negatives to every region. The purpose is to throw away most of the rectangles so that the remaining smaller set still contains all the true positives – and a lot of false positives, of course, but much less.

This is repeated with many (possibly increasingly expensive) classifiers processing increasingly smaller sets of rectangles. The most widely deployed classifier cascade is probably the Viola-Jones face detector, currently available in most digital cameras displaying little squares around faces. As you could have noticed, it often misses a face, which is to be expected with all the little classifiers hurrying to throw rectangles away. And which is OK for a consumer application where a success rate of 90-95% is perfectly fine and an extra 1% of detection rate is not worth a $0.01 increase in price. The point is that the error rate is undeniably increased by stricter efficiency requirements.

The upshot is that object detection provides a broad family of examples where, again, runtime efficiency comes at a cost of correctness (in terms of increased likelihood of bugs – there's much more code to write – as well as the ultimate detection rate), programmer time, and elegance.

(Even the smallish sub-problem of merging overlapping rectangles provides an example where efficiency has to be bought with all those other things including elegance. A short, readable, elegant solution could use an O(N^2) nested loop where each rectangle is intersected with every other rectangle. One optimization is some sort of spatial data structure where you don't look at rectangles if they don't fall into the same bin of the spatial subdivision because then they can't intersect. That's faster, more buggy and less readable.)

Does this have anything to do with the quote by Stroustrup though? His implied point was how the std::sort template is more elegant as well as more efficient than C's qsort function fiddling with void pointers, right? Or ostream vs printf? Whereas these are all examples of "algorithmic efficiency" – is that even related to language design?

Well, the thing is, "algorithms" and "languages/code" is a continuum:

Think of all the psychic energy expended in seeking a fundamental distinction between "algorithm" and "program". — Alan Perlis

Given that it's a continuum, it is doubtful that a statement which is profoundly wrong in an "algorithmic" context could be true in a "programming" context. If the tradeoff between runtime efficiency and programmer efficiency/"elegance" is fundamental from an algorithmic point of view, then it's likely fundamental in computing in general.

For a concrete example of how blurry the line between "algorithmic efficiency" and "code efficiency" is, let's discuss corner detection. The FAST corner detector is a decision tree looking at pixels surrounding the central pixel and comparing the image intensity of the center to its surroundings. Similarly to other classifier cascades, "not a corner" is a quick decision, while "yes, a corner" is decided after all the checks are done.

The decision tree is implemented in several thousands of auto-generated C code lines with gotos. (That's one addition to the recent discussion about the utility of gotos in systems programming; add computer vision to the list of goto applications, I guess.)

Is it possible to implement the decision tree in a more elegant and readable way? Of course – but at the cost of efficiency; not asymptotic efficiency since it'd be the same decision tree, but efficiency nonetheless.

Is this goto business an "algorithmic" optimization or a "program" optimization? Consider that FAST's entire raison d'etre is being faster than, say, the Harris corner detector. Constants matter for high-resolution images processed in real time.

Consider furthermore that both FAST and Harris are O(#pixels) since they look at a finite, small number of pixels around each coordinate and execute a finite, small number of operations. Consider that which is more efficient depends on the platform – SIMD helps speed up Harris but not FAST, and different SIMD instruction sets speed it up by very different factors. (This is also true for linear classifiers vs Viola Jones and for other cases.) And consider the fact that algorithmically, they're wildly different – Harris looks at eigenvectors whereas FAST is an intensity-based decision tree, they have tunable decision thresholds with different meaning, and different sets of false positives and false negatives.

So is FAST a work in the area of "algorithms" or "programming", and is the auto-generated mountain of code essential to make it efficient an "algorithm" or a "program"? My answer is that it's both, in the sense that you can't really draw the line.

But what about std::sort and C++'s combination of efficiency and elegance? Well, C++ rather obviously does pay with programmer efficiency for runtime efficiency, without an option to opt out of the deal. Every allocation can leak, every reference can be dangling, every buffer can overflow, etc. etc. etc.

This blindingly obvious fact doesn't surprise those who realize the fundamental tradeoff between efficiency and a whole lot of other things, some of which can be collectively called "elegance". Whereas those refusing to believe in such a tradeoff manage to not even notice the consequences. For example:

The relatively small size of the C++ standard library – primarily reflecting the lack of resources in the ISO C++ standards committee – compared with the huge corporate libraries can be a real disadvantage for C++ developers compared to users of proprietary languages.

…So why do languages without corporate backing which are 2 to 3 times younger than C++, such as Perl, Python, and Ruby, have so much more libraries, both standard and non-standard, but widely used?

The best uses of C++ involve deliberate design. You design classes to represent the notions of your application, you organize those classes into hierarchies, you express your algorithms precisely and abstractly (no, that “and” is not a mistake), you use libraries, you build libraries, you devise error handling and resource management strategies and express them in code. The result can be beautiful, efficient, maintainable, etc. However, it’s not just sitting down and writing a series of statements directly manipulating characters, integers, and floating point numbers.

The thing is that actually doing something useful involves a whole lot of "direct manipulations" of characters, integers and floating point numbers – and strings, arrays, hash tables, files, sockets, windows, matrices, etc. Languages which let you "just sit down and write the series of statements" give programmers the extra productivity which results in all those extra libraries getting written.

However, equally undeniably it does cost you runtime efficiency, because you pay an overhead for built-in resource management strategies such as garbage collection, built-in error detection strategies such bounds checking, and a whole lot of other things.

It's not surprising that Stroustrup sees the problem in the fact that corporations "with the resources" invest them in what he thinks is the wrong thing, presumably because of their self-interested profit motives. Alex Stepanov who designed the STL expressed similar statements, and so did Alan Kay and every other perfectionist technologist. If you seek perfection to the point of denying the existence of most obvious tradeoffs – and tradeoffs are a pesky thing for a perfectionist because they imply that perfection is unattainable – then you're also likely to somewhat resent corporations, markets, etc. For a discussion of that, see my take on Worse Is Better vs The Right Thing.

(Of course there are plenty of perfectionists who, instead of rationalizing C++'s productivity problems, spend their time denying that Python is slow, or keep waiting for Python to become fast. It will not become fast. Also, all its combinations with C/C++ designed to remedy this inefficiency will forever be ugly. We had psyco, PyPy, pyrex, Cython, Unladen Swallow, CPython extension modules, Boost.Python, and who knows what else. Python is not designed to be efficient; it's designed for productivity and for extensibility through a necessarily ugly C FFI. The tradeoff is fundamental. Python is slow forever. Python bindings are ugly forever.)

So if the tradeoff is fundamental, should we give up on efficient resource utilization? No – if the elegant thing is to load the database table into RAM, it doesn't mean that we have enough RAM. Should we give up on programmer productivity? No – inline assembly or lock-free code which isn't obviously bug-free doesn't belong in our cold paths.

We should, however, give up on perfection. Some code will be slower than we want because we don't have time to optimize it, and some code will be uglier than we want because we have no choice but to optimize it.

A hope to defeat a fundamental tradeoff is nothing but a source of frustration, and it's a bliss to have lost such a hope.

How profilers lie: the cases of gprof and KCachegrind

We'll see how gprof and KCachegrind lie to us and why they do so, discuss the limits within which we can trust them nonetheless, and attempt to draw more general conclusions about profilers and profile visualization tools.

(In case your programming vocabulary is different from mine – I use "lying" in its dispassionate meaning of "communicating falsehoods" and not to convey negative judgement; on the contrary, I'm indebted to the authors of profiling tools both as a user and a developer of such tools.)

So, consider a program with two parts – an easy part and a hard part. Both parts do similar work but one part does much more work than the other:

void work(int n) {
  volatile int i=0; //don't optimize away
  while(i++ < n);
}
void easy() { work(1000); }
void hard() { work(1000*1000*1000); }
int main() { easy(); hard(); }

Here, work() is a do-nothing loop. easy() executes a thousand iterations of that loop and hard() executes a billion iterations. We therefore expect main() to spend most of its time in hard() and only a tiny fraction in easy().

Now let's profile the program with gprof:

gcc -o try try.c -pg
./try # saves stats to gmon.out
gprof try gmon.out

On my machine, this prints the following info:

self            self    total
seconds  calls  s/call  s/call  name
   3.84      2    1.92    1.92  work
   0.00      1    0.00    1.92  easy
   0.00      1    0.00    1.92  hard

gprof's lie is marked in red; it says easy() and hard() took the same amount of time, instead of hard() taking a million times more to run.

What happened? Can we trust anything that gprof says? Which parts of its output are entirely wrong like this "easy() is the same as hard()" business and which parts are roughly correct, give or take a measurement error? To answer this, we need to briefly discuss how gprof works.

Roughly, gprof's two sources of information are profil() and mcount():

  • profil() – a cousin of creat() in that it could have been spelled with an "e" as well – updates an instruction address histogram every 10 milliseconds. That is, 100 times a second the OS looks which instruction the program is executing, and increments a counter corresponding to that instruction. So the share of increments corresponding to a function's body is proportionate to the share of time the program spent in the function.
  • mcount() is a function called by assembly code generated by gcc -pg. Specifically, when a function is entered, it calls mcount() to record a call to itself from the caller (the caller is generally easy to identify because it necessarily passes a return address to your function and that address points right into the caller's body.) So if f() calls g() 152 times, mcount(f,g) will be called 152 times.

With this in mind, we can roughly tell what gprof knows. Specifically, it knows that:

  • easy() and hard() were both called once; work(), called from each, ran twice. This info is from mcount() and it's 100% reliable.
  • The program spent almost no time in the code of easy() and hard(), and most of its time in the code of work(). This info is from profil() and it's rather reliable – because the program ran for >3 seconds, which means we had >300 increments in our instruction histogram. If almost all of these increments are in work(), that's significant enough.

What about the share of time easy() spent in its call to work(), and the share of time hard() spent in work()? By now we know that gprof knows absolutely nothing about this. So it guesses, by taking 3.84 – seconds spent in work(), a reliable number – and divides it between easy() and hard() equally because each called work() once (based on mcount(), a reliable source) – and we get 1.92. This shows how bad results can be produced from perfectly good measurements, if passed to the wrong algorithm.

More generally, gprof's output falls into the following categories, listed in decreasing order of reliability:

  • Number of calls: 100% reliable in all parts of the report. (I think; please correct me if I'm wrong.)
  • Self seconds in the "Flat profile" (time spent in a given function not including children): reliable to the extent that 100 samples/second is statistically significant given the number of hot spots and the total runtime.
  • Seconds attributed to call graph edges (contribution of children to parents, total runtime spent in self+children, etc.): possibly dead wrong. Only trust it if there's zero code reuse in a given case (that is, f() is only called by g()), or if the function in question is known to take about the same time regardless of the call site (for example, rand()).

BTW, the fact that gprof lies doesn't mean that its documentation does; on the contrary, `man gprof` says, in the BUGS section:

The granularity of the sampling is shown, but remains statistical at best. [this refers to the limited reliability of profil's histogram.] We assume that the time for each execution of a function can be expressed by the total time for the function divided by the number of times the function is called. Thus the time propagated along the call graph arcs to the function's parents is directly proportional to the number of times that  arc is traversed. [this refers to the absolutely unreliable way in which profil's "self time" data is combined with mcount's call graph data.]

Unfortunately, users tend to read tools' output without reading documentation. (The ability of users who aren't into profiling tools to understand the implications of this passage is a separate question.)

The man page also refers to papers from 1982 and 1983. An age of over three decades is a good reason to cut a program some slack. In a way, gprof's age is not only a source of its limitations, such as only 100 samples per second, but also a source of its strengths, such as fast execution of profiled code and wide availability.

Now let's look at a more modern profiler called callgrind – a valgrind plugin. Being more modern, callgrind has a few advantages over gprof – such as not lying in its call graph (though some would debate that as we'll see), and coming with a GUI program to visualize its output called KCachegrind.

KCachegrind the viewer (as opposed to callgrind the measurements collector) does lie in its call tree as opposed to call graph as we'll shortly observe. But first let's have a look at its truthful reporting of the situation with easy() being easier than hard():

KCachegrind's call graph - true

As you can see, easy() isn't even shown at the graph (KCachegrind hides things with insignificant cost); you can, however, see the cost of main's call to easy() and hard() at the source view – indeed easy() is ~1000x faster than hard().

Why 1000x and not 1000000x? Because I changed hard() to run a million iterations instead of a billion, bringing the difference down to 1000x. Why did I do that? Because callgrind is slow – it's based on Valgrind which is essentially a processor simulator. This means that you don't measure time - you measure things like instructions fetched and cache misses (which are interesting in their own right), and you get an estimation of the time the program should take given these numbers and your processor model.

It also means callgrind is slow. Is it slower than gprof? Not necessarily. With gprof, code runs at near-native speed, but you only get 100 data points per second. With callgrind you get much more data points per second. So for a hairy program, with callgrind you get statistically significant data more quickly – so effectively callgrind is faster.

But for a simple program with just a couple of hot spots, callgrind is slower because if the program has a costly part 1 and then a costly part 2, it'll take callgrind more time to even get to part 2, whereas gprof, with its near-native speed, will give you good enough data from its fast run.

So much about speed; now let's look at a case where KCachegrind lies to us, and then we'll discuss why it happens.

To expose the lie, we'll need a more complicated program. We'll achieve the required complexity by having two worker functions instead of one, and then adding a manager – a function that does nothing except calling the two workers with the number of iterations to run.

How does the manager decide the number of iterations each worker should run? Based on the project requirements, of course. Our "projects" will be two more functions, each calling the manager with its own pair of iteration numbers for the two workers.

void worker1(int n) {
  volatile int i=0;
  while(i++<n);
}
void worker2(int n) {
  volatile int i=0;
  while(i++<n);
}
void manager(int n1, int n2) {
  worker1(n1);
  worker2(n2);
}
void project1() {
  manager(1000, 1000000);
}
void project2() {
  manager(1000000, 1000);
}
int main() {
  project1();
  project2();
}
As you can see, both workers work on both projects, but each project is mostly done by one of the workers, the other contributing 1000x less work. Now let's see what KCachegrind says about this; we need to run callgrind, which can be done without special compilation flags:
gcc -o try2 try2.c
valgrind --tool=callgrind ./try2
kcachegrind `ls -tr callgrind.out.* | tail -1`
Here's what we'll see:
KCachegrind call tree - false

The bottom part of the screen shows us truths, but the top part shows falsehoods.

The truth is that each project called the manager once; the manager called each worker twice; and each worker did half the work in the program – all shown at the call graph at the bottom. However, at the top, each of the project functions occupies half the window and shows that worker1 and worker2 each did half of the work in each project - which couldn't be further from the truth.

This falsehood is reported by the call tree (or "callee map" as KCachegrind calls it) – a view which is supposed to show, for each function, the share of work done in each of its callees relative to the work done by all those callees together (as opposed to the call graph which only links to the called functions and tells how many times they were called by that caller – and their share of work in the entire program.)

Why does the call tree tell a falsehood? Again, because it doesn't know the truth. KCachegrind visualizes callgrind's measurements, which include the number of times f() called g() and the time spent in calls from f() to g().

This is more that gprof's information – way more. gprof only knows how much time was spent in f() and g(), and how many times f() called g(). Callgrind also measures how much time was spent in g() specifically when it was called from f(). In particular, this means that KCachegrind's source view gives a perfectly reliable measurement of the time spent in f plus all its callees – something which users take for granted and something which gprof only guesses, often wildly wrongly.

However, this information is not enough to know what the call tree needs knowing to show the truth. Specifically, you only know that manager() spent the same time in calls to worker1() and worker2() overall; you have no idea how much time it spent in each worker when called from project1() and project2(). So you can't reliably plot the share of time worker1() and worker2() spent inside project1() or project2(); you can only guess, often wildly wrongly.

(In fact, you can't tell if manager() even called worker1() when called from project1(); perhaps it didn't – all you know is that manager called worker1() in some context. Some people conclude that the call graph is "incorrect"; in fact it is correct, the question is if you understand what you see the way you're supposed to – you aren't supposed to think that every path through the graph actually happened, only every edge. Another question is how upset you are when you find out (someone with a lot of "manager" functions doing nothing but dispatching might be very upset.) This example certainly broadens the gray area between "truth" and "lies" in profilers' output.)

Callgrind has something which appears like a possible workaround: --separate-callers=N. What this does is it measures things per call stack of size N instead of per call "arc". That is, instead of having a measurement for the arc from manager() to worker1() and a measurement for manager()->worker2(), it measures separately for project1()->manager()->worker1(), project1()->manager()->worker2(), project2()->manager()->worker1(), etc.

Unfortunately, KCachegrind didn't manage to open the resulting output file on my machine, nor did it help when I replaced the ticks (') separating the function names (which get concatenated together) with underscores (_). However, the textual data produced by callgrind indeed shows the truth:

fn=(726) worker2'manager'project1
5 3
+1 1
+1 7000008
+1 2

fn=(736) worker2'manager'project2
5 3
+1 1
+1 7008
+1 2

This shows that worker2() did 1000x more work when called (through manager()) from project1() than it did when called from project2() – as we know it did.

Having looked into the details of two particular examples, we'll proceed to a more general discussion of profilers and profile visualization tools.

Using no profiler at all is better than hoping it'll save the day

I know a few people who like to optimize code and think optimization is important, and who mostly ignore profilers. I know a few other people who claim that a good profiler is the necessary and sufficient prerequisite for optimization. More often than not, such people are not particularly fond of optimization, and their code will be slower than the code of above-mentioned profiler-bashers, before as well as after optimization.

The examples above supposedly show a part of the reason why "a good profiler" is not at all trivial to use.

Basically among the many opinions, there are two extreme ones sounding along the lines of:

  • You don't know where your bottlenecks are going to be, therefore don't bother to optimize before measuring.
  • You don't know where your bottlenecks are going to be – nor will you be able to measure them because adequate measurement and analysis tools plus test scenarios covering the relevant possibilities are hard to come by. Therefore, conserve resources if there's even a slight chance that the code will be a bottleneck.

In my experience, the first viewpoint results in slower code and is less consistent internally. That is, for someone who's not into optimization, a profiler is just not the force multiplier that he thinks it is. That's because he won't put the mental effort required to make an effective use of the profiler for the same reasons making him write slow code in the first place: he just doesn't like all this performance stuff.

The trouble with this being that profilers are not things magically telling you what to do without concentration on your behalf.

You need to understand how a profiler works to use it well

When I first realized how gprof lies in its call graph, I was so offended that I stopped using it for a long while. The problem with profilers is that not all the data is gross lies, but some is, and without knowing which data is likely to be wrong, you might trust things that you shouldn't, scratch your head to the point of hair loss, and then abandon the tool altogether once you realize how it routinely betrays your trust.

With KCachegrind, I came up with the example where it lies based on my knowledge of the callgrind output format – something I know because we (specifically, GK) added support for that format to our in-house profiling tools at work. I wouldn't guess that the Callee Map view is unreliable otherwise. Like most users, I rarely read the docs unless something clearly refuses to work altogether. The stats in the call graph as well as the source view are perfectly reliable. How would I know some other stats aren't?

The extent to which you warn the user about possible implications of assumptions in your software is a tough question for all programmers. Should KCachegrind have a big red warning "I might be misleading you" at its "map" views? A "true power user" doesn't need the warning because they know how the tool works anyway. A "truly careless user" wouldn't read the explanation linked to from the red warning and would just be scared away. Only a "middling user" irresponsible enough to not read the docs but responsible enough to read text linked to from big red warnings might benefit from such design. But how many such users are there compared to the rest?

My own experience here is depressing – nobody, not even the smartest folks, is willing to read anything unless they came here to read. As in, when you're a tutorial that people intend to read, you can tell things to people and they listen. When you're an error message, people read you to the extent necessary to make you go away. And when you're a warning, people ignore you. It sucks to be a warning.

So I'm pessimistic about big read warnings. As to tutorials – people don't expect profilers to be complicated enough to warrant a tutorial, so they probably won't allocate time specifically to read one. They're wrong, but they won't.

Choosing a profiler is hard

There's no perfect profiler – it's a tradeoff, or rather a bunch of tradeoffs. Let's see how complicated those tradeoffs are by comparing gprof, callgrind and Google's CPU profiler:

  • gprof is fast, requires a special compilation, gives you "self time" based on 100 instruction address samples per second, precise call counts, and often bogus "children time" information.
  • callgrind is slow, requires no special compilation, gives time estimations based on event counting, several order of magnitudes more data points per second (making it "effectively faster" for some use cases but not others), precise call counts as well as precise "events counted during a call to each child" information, and comes with a viewer giving correct though possibly misleading call graph and often bogus "map" views.
  • Google's CPU profiler is fast, requires no special compilation unless you're on a 64b platform without a working unwind library, uses a configurable amount of samples per second (default: the measly 100, I wonder why), lacks precise call count information, logs full call stacks unlike gprof and callgrind without –separate-callers, but then converts data to many viewing formats where the extra info is lost (such as the callgrind format). You do get more informative view of the profiling data in some cases.

So basically each of these profilers is better than the rest in some way and worse in some other way (for instance, gprof, at first glance the awful, ancient profiler, is actually your shortest path to precise call counts, which may well be the thing you need the most). And that's before discussing things like the ease of getting a working version for your platform.

As it often is with complicated things, the choice is made harder by the fact that you don't realize many of the implications until after you gained some experience with the tool. I don't think I know what questions to ask about a new profiler, even though I'm relatively savvy. I do realize that I want to understand how it works in a lot of detail.

Not all errors are "noise"

If you're listening to an analogue recording and there's a "shhhh" sound, it's "noise". If, however, someone is yelling near you, then this louder noise isn't "noise" in the mathematical sense – it's another signal. And if someone has overwritten your original recording, then there's only another signal left and none of yours. There's also noise on top of that other signal, but that noise is the least of your troubles.

Not everything standing between you and your signal is "noise"; sometimes it's an error making you look at the wrong signal.

With profilers, people intuitively expect "measurement noise" – a profiler measures time (or some other cost) and no measurement device is perfect. People are willing to accept that – but they do expect fidelity in the assignment of costs to context. "Context" is source lines, functions, and call sequences; people (correctly) think of context as something logical where the concept of "measurement errors" doesn't apply; it's either correct or not.

Unfortunately, people are right, but the conclusion is the opposite of the natural expectation: "context" is indeed a logical concept and not a continuous quantity – and therefore, when the tools give you wrong information, it's really wrong, as in completely detached from reality, and not something "roughly right" give or take a measurement error.

(Our examples focusing on call sequences are rather benign, in a way; for real fun, we could discuss the fidelity of assigning costs to source code lines in optimized programs. Even with a nice assembly listing linking to the various source files from which code was inlined, like the listing provided by KCachegrind, it's… let's say less than easy.)

The upshot is that many errors in profilers' output are not something that can be fixed by running the program more times to get more statistically significant results. Rather, you have to know where the results are only distorted by noise and where there can also be logical errors rendering the numbers meaningless.

Presentation errors can exist without measurement errors

…As evidenced by KCachegrind's false presentation of callgrind's or Google profiler's correct results, or by gprof's false conclusions based on reasonable numbers reported by profil() and perfect numbers from mcount().

To add to the previous point, measurement errors are more likely to be "noise" than presentation errors, which are more likely to tell something unrelated to the truth.

Conclusion

Profiling is trickier than we tend to assume, and as someone developing profilers, I understand programmers who're good at optimization and who mostly ignore profilers. A profiler could help users get way more mileage if it found a way to convince them to actually read a thorough, memorable explanation of its strengths and weaknesses.