Passing shell script arguments to a subprocess

So I want to create a shell script that ultimately exec's a command, after doing something like, say, setting an environment variable first:

#!/bin/sh
export MY_VAR=MY_VAL
exec my_command $*

(The point of using `exec my_command` rather than plain `my_command` is to not leave a /bin/sh process waiting for my_command that shows up in pstree and makes my_command not see its "real" parent process and so on.)

Simple enough, except it doesn't work. If you run the script with `script "a b" c`, my_command's arguments will be a b c (that's three arguments instead of the original two).

(Update – as pointed out in the comments, "$@" instead of $* works fine, and is perfectly sufficient for my example where all the script does is setting an env var. "$@" isn't enough if you need to fiddle with the arguments – if you need to do that, read on, otherwise just use "$@".)

A common workaround seems to be, you iterate over the arguments and you quote them and then eval:

#!/bin/sh
export MY_VAR=MY_VAL
args=
for arg in "$@";
do
  args="$args '$arg'"
done
eval exec my_command $args

Not so simple, but works better: "a b" c will indeed be passed to my_command as "a b" c.

However, it doesn't work if the arguments contain single quotes. If you pass "'a'" (that's double quote, single quote, the character a, single quote, double quote), my_command will get plain a. If you pass "'a b'" (double, single, a, space, b, single, double), my_command will get two arguments, a b, instead of one, 'a b'.

What to do? One potential workaround is escaping quotes: replacing ' with ', etc. Perhaps someone with sufficiently thorough understanding of the Unix shell could pull it off; personally, I wouldn't trust myself to take care of all the special cases, or even to fully enumerate them.

So instead, what works for me (or so I hope) is, instead of creating a string of values by concatenating arguments, I make a string of references to the arguments using the variables $1, $2, etc. How many of those are needed – is the last one $3 or $7? Ah, we can use $# to figure that out:

#!/bin/sh
export MY_VAR=MY_VAL
nargs=$#
args=
while [ $nargs -gt 0 ]
do
  args=""$$nargs" $args"
  nargs=`expr $nargs - 1`
done
eval exec my_command $args

This handsome code generates, given three arguments, the string "$1" "$2" "$3", and then evals it to get the three argument values, which apparently can't cause quoting complications that are dependent on the actual argument values. With five arguments, the string "$1" "$2" "$3" "$4" "$5" is generated, and so on. (You'd use more code if you wanted to mangle some of the arguments, which is the sole reason to do this sort of things as opposed to using "$@".)

If you're good at Unix and you know a less ""$$ugly" $way" to do this, and/or a more correct one, do tell.

(Why am I writing shell scripts in the first place, you ask? There are reasons for this too, and reasons for the reasons; there always are.)

Update 2: according to a comment, $1-$9 work in sh, but $10 and up do not; they do work in bash, which is what you actually get when you ask for /bin/sh on some systems but not others.

I really ought to try harder to stay away from shell scripting.  I mean, I know I shouldn't, but I keep coming back. I'm like those tribesmen around the world who can't resist the urge to drink alcohol and having genes making alcohol particularly addictive and bad for their health. I clearly don't have the genetic makeup that would make *sh reasonably harmless for me.

Why programming isn't for everyone

Today I learned about HyperCard, a system where you could implement a basic calculator in a few easy steps, one of them involving the following impressively English-like snippet:

on mouseUp
  get name of me
  put the value of the last word of it after card field "lcd"
end mouseUp

The article depicts HyperCard as a system making programming accessible to people who aren't professional developers. It is claimed that Apple likely killed off the product because it's inconsistent with its business model (roughly, devices bought to consume rather than to create).

I sympathize with the sentiment – I very much like stuff you can tinker with, and dislike business models discouraging tinkering. However, I don't think businesses have the power to prevent anything that works well for many people from happening. A conspiracy of typewriter manufacturers could never stop the PC.

This seems especially true with software, where huge systems can be built by volunteers in their spare time. If an idea works, if a software system wants to be built around it, it will be built.

Of course it may be the case that the time hasn't come for a programming system for non-developers. It's just my opinion that it never will come, not really. Why?

Not because you need much education to program. Very useful stuff can be built without knowing why optimal sorting is O(n*log(n)), or even what big O means.

Not because programming languages must have, or typically have arcane syntax. As a kid, I found Pascal's somewhat English-like "begin" and "end" off-putting, and was greatly relieved to discover Algolish braces. How close to natural language syntax can get, and whether it is at all beneficial to go there is IMO an irrelevant question. The fact is that programming languages can be very readable to people.

The main reason is that development leads to maintenance, and maintenance leads to suffering.

For example, if your program stores persistent data, and you want to change it, your changes to the program must be done such as to preserve the meaning of existing data. This part of development causes major pain everywhere, from video recording to financial databases to compiler construction. No amount of knowledge and no amount of support from the tools make this fun.

There are many other things. Everything in your program's environment is unstable and you must constantly update the program to keep up. Your program gets cluttered with options and you forget what does what. There are cases you didn't test – spaces in the names, empty data fields, reverse order of operations.

As a result, maintenance means dealing with misbehaving programs that eat data, send trash around, or simply make you wait for an hour and then watch them produce garbage.

This never ends and quickly stops being fun. When something useful can not be done quickly and isn't the average person's idea of fun, it becomes the business of professionals – or hardcore hobbyists indistinguishable from professionals. As a counter-example, many people like cooking in their spare time without necessarily getting close to the level of a chef or spending that much time cooking. Con Kolivas, on the other hand, could technically be called a "hobbyist", but he could be called a "professional" as well.

Maybe I'm wrong, maybe there are plenty of places where a sprinkle of logic – in textual form or graphical form or whatever form – can be figured out quickly, left alone and be useful ever after. It's just that it's usually the opposite with me. Every time I have a nice little idea it takes me 10x the time it "should" take to implement, and most things keep biting me once in a while for a long time.

Programming isn't for everyone because it is not fun to maintain what was fun to program.

Compensation, rationality and the project/person fit

To negotiate a compensation, you need to compare to something. There are two principally different things people compare compensation to:

  1. Available alternatives. Employee: "I could get twice as much at Microsoft." Employer: "We can hire Bob for a half of your salary."
  2. Peers' compensation. Employee: "Jeff gets twice as much and he's not better than me." Employer: "John gets half your salary and you're not better than him."

I believe the second approach – comparing "ability" and having a common level of compensation for people "at the same level of ability" – is the worse approach. Its main drawbacks are:

  • People look into each other's pockets too much that way
  • It is, in a basic economical sense, an "irrational" approach
  • It ignores the project/person fit

I'll discuss all of these drawbacks, mostly focusing on ignoring the project/person fit – in my opinion, the worst part.

Looking into each other's pockets

Even if management doesn't disclose the way people are labeled and what compensation corresponds to each label, people have an incentive to find out all about this. This means that everyone will know how much everyone else gets, and how one must be labeled to earn a given amount.

People looking into each other's pockets is bad for everyone:

  • Invariably people will find others' compensation unjust, which doesn't improve team spirit.
  • You get Piterian situations where, say, a strong developer's only way to get a raise is to become a manager, at which he might very well suck, etc.
  • Sometimes the employer does want to set exceptional conditions for someone – pay someone significantly more or less than someone else with the same title. However, if everyone tends to find out about everyone else's compensation, it becomes hard to make these exceptions as it is guaranteed to upset people.

Others' compensation is one of those things that are better left unknown. It's a pity if you tempt people to find it out.

Comparing to imaginary alternatives is "irrational"

If I'm working on X, Jeff works on Y and John works on Z, it makes no sense to compare my compensation to theirs. Whoever is unhappy with the current arrangements and threatens to terminate them – that is, whether I quit or get fired – neither Jeff nor John will replace me, nor will I replace them.

Jeff and John usually have to keep working on Y and Z, so they can't work on X if I quit. Nor will I work on Y and Z – even if I quit, not the company, but just my team, and join their team in the same company. They're already there working on Y and Z – so I won't work on Y or Z, but on W.

Therefore, the employer should compare my compensation to what he'd pay someone else to do X, including the cost of training him. I should compare to what I'd be payed to do W, including the cost of having to learn to do it.

Why should we compare to these things and not others? Because these are our actual alternatives. Jeff's and John's compensation has nothing to do with our actual alternatives.

To which someone can legitimately still reply: why? Someone can say, I still want to compare to Jeff's and John's compensation. So what if you're saying that it's "economically irrational" to consider things unrelated to the real alternatives in a price negotiation? It's my price negotiation, I can compare to whatever I want!

That someone would be right, in a way. It's not like there's a monopoly on the definition of "economic rationality" – one could certainly find an economist claiming that looking at your peers is the rational thing to do, or at least the natural thing to do.

Say, Robert Frank – "Choosing the Right Pond". You know, evolutionary considerations – you're trying to impress a potential mate with your salary, the mate compares within the "pond", an unusually high salary is an externality, etc.

Basically it's partners that you compete for, and it's your peers who you compete against, so it's their compensation that you should care about. (Does this sound just like your workplace? I hope not…)

As an aside, I don't understand evolutionary definitions of "rationality", not really. I mean, if the ultimate goal is to pass your genes, shouldn't you become a serial rapist targeting nuns or someone else who isn't likely to use abortion? If you aren't doing this, and you advocate the evolutionary view of rationality, aren't you proving your own irrationality by your own actions? And if you are irrational, then why should irrational people like you be trusted to define rationality in the first place?

But the fact that I don't like the "evolutionary" view of rationality and prefer, in this context, the "classical economics" definition is just my opinion. An employer can have his own – just like a friend who kept trying to sell his car, for a long time, until he found someone willing to pay the high price.

Another friend said, when they discussed markets, "what you did is irrational – markets don't behave that way – in a market, you lower the price if you don't have a buyer". To which the seller responded – "first, I did sell high eventually; second – you can't tell me how markets behave – I am the market!"

So yeah, if you're an employer or an employee and you want to compare compensations regardless of what alternatives are actually available – you can of course do this. You are the market – economists, bloggers or anyone else can try to describe your behavior and predict its outcomes, but they aren't entitled to label it "rational" or "irrational", not really.

All that can be said is that considering imaginary alternatives instead of the real ones can very well make you face the real ones.

That is, suppose you say to an employee, "John gets half your salary and you're not better than him." Suppose the employee replies, "I could get twice as much at Microsoft." His alternative is real – he quits. Your alternative is not real – John is not available to replace the guy who quit. Now you're facing your real alternatives – which can be much worse than raising the guy's salary would have been.

Isn't it a better idea to consider your real alternatives during the negotiations?

To which one could reply – how bad those alternatives can be, really? I mean, we hired John, right? And he's just as good. So we can always hire this sort of person for this sort of price, right? Yeah, there are the training costs, but that's all there is to it, not?

I believe that there's more to it than training costs. The big thing is the project/person fit.

The project/person fit

It's magical. If a person wants to do something, I'm so much in favor of letting them, whatever other things they'd have to stop doing. I mean, there are things which nobody will ever do except the one person – or maybe one of two or three people – to whom it's important.

Or someone could do it, but not nearly as well. And not because he's "worse" – he may be "better" on all the common benchmarks (IQ, grades, reputation, whatever). He's not "worse" in any quantifiable way, but it just doesn't click – the project is not a good fit for him.

It's a depressing thought for a manager – a part of a manager's helplessness. A manager can't do anything himself – the most helpless creature around. He's always responsible for what other people do. He can pick the people, talk to people, negotiate with people, reshuffle people. But that is all he can do – and not a single bit of real work that must be done to make his project succeed.

This means an extreme dependence on other people, which is stressful. The project/person fit makes this much worse. You're basically constrained to not move people away from projects when there's this magical click. They're irreplaceable, so you depend on them tremendously – not very comforting. So it's natural to argue that this magic business doesn't really exist – everyone is replaceable.

Now, I'm not saying that people actually "can't be replaced" – far from it. That thought would make me lose sleep as a team leader – and it would offend me as a programmer.

I mean, if our processors are "universal computing machines", then surely programmers ought to be universal as well, right? I much prefer to think of myself a "replaceable cog" – but a universal cog – than an irreplaceable part of the peculiar machinery of my current workplace, obviously useless outside it because of my extreme specialization.

So actually I'm at the other extreme on this one, most likely – I don't think very much of "relevant experience", and I'll be the first to say that a person new to something will cope with it very well, don't worry. Everyone is replaceable, because everyone can deal with everything.

For instance, in our recent round of work on hardware verification, we had a tough deadline, so there was a single hardware module that 5 programmers worked on. Of them, 3 had no experience in hardware verification at all, so they had to learn about hardware simulators and waveform viewers and stuff.

Normally, just one person would do that work, but it'd take longer and we couldn't afford the latency. We also had to swap people in and out to do other things, and they had to continue where the previous person left. And it worked, basically. So I think I'm very much at the other extreme – programmers are universal, and they'll deal.

What do I mean by this "project/person fit" then?

What I mean is that there's still a 10x productivity difference between a person struggling with this important stuff that you dumped on them but they kinda don't understand or care about very much, and a person who wants the thing done.

Actually it's more than 10x – you can't quantify it, it's qualitatively different. A bird doesn't just move faster than a snail. You can't express the difference between crawling and flying in a single number, even if your HR policy mandates this sort of quantification.

People have their own priorities

A manager classifies things as important and unimportant, and he might be tempted to think that somebody gives a damn about his view of these matters.

But they don't give a damn. They classify things as "stuff the manager wants" and "stuff that they want". Stuff that's only important to them because you said so crawls. Stuff that they feel is important and interesting flies.

Managers might think that work gets done because they want it done. It's true – but the best work gets done because people who do it want it done.

And people are amazing in the diversity of their tastes. Taste depends on many things – personal talents and interests, personal history that makes some problems closer to your heart than others, and so on – but no matter what the reasons are, the result is that tastes are wildly different.

Consider the following things, all among the stuff our team works on:

  • A distributed build & run server.
  • A debugger agent – porting gdb to custom hardware and OS.
  • A graphical editor for tagging objects in video clips.
  • A static memory manager built around C language extensions and a constraint solver.

I think all of them are important, and all of them are interesting. As a programmer, I'd work on any of them. I mean, does any of this sound like boring grunt work? Certainly they're all nicer than verifying a hardware module that you didn't specify under time pressure, at least if you ask me.

However, in my team, there's just one, two, sometimes ~1.5 people who actually want to work on each of these things. Moreover, most of them have an aversion to most other things on the list.

Now, if it was strictly necessary, any of them would work on any of these projects. And they'd do a good job even if they got the one they hated the most. But it'd be uninspired, and nobody could blame them.

How easy is it to find someone who'd love to do a project? I'll tell you – real damn hard. I mean, I'm a language geek; in my opinion, everyone wants to work on programming language extensions. And you know what? They don't. Not really. Most don't want at all. Then many like the idea, in principle. But not that kind of language, or not that kind of extensions. There's no spark in their eyes – until the right person shows up.

Similarly with the other things. You'd think that a person who likes the debugger agent would also like the distributed build server, not? I'd expect that, definitely – but she doesn't. And you can't make someone like something. Usually you can't even pay them to like it. They just won't.

Some projects are optional. With these, I will wait for years until the right person shows up. I feel guilty – people are asking for it, it'd be great if we had it, it would become an enabler for things now
unthinkable. But who am I kidding? Nobody wants to do this now, not really. Better wait until he shows up.

When he shows up, what do I say? I say, keep him. Really. Don't let the thing turn into a wasteland just because programmers (actually) are universal, replaceable cogs!

Some projects are not optional – you must do them no matter what. When there's no right person to do such a thing – watch years of work, tears and sweat produce a mountain of code dripping with hate and depression. I'm serious – sometimes I can actually look at code and see how nobody ever loved it.

I've seen brilliant people produce disgusting code nobody wants to touch. Certainly I couldn't help it myself – my sense of duty did not help. I did it on time, it worked, and it was a toxic waste.

It doesn't help that the manager thinks it's important. It doesn't help that I agree with him. If I don't like it, I won't do it very well.

Sometimes – many times – the right person arrives years after the wrong people – the wrong people for this project – have been spitting blood trying to make it work. It takes a few months and the scenery is transformed. Mountains of hate are gone. You have a working system. People who lost hope for this particular area to ever become habitable, to stop smelling of fail, suddenly smile.

Would you let that person go, just because John is "just as good" and you pay him less? There is no way that John is going to take over this thing. Even if he's available. He isn't interested. He couldn't care less. He could take over just like anyone else, but it'd be toxic waste all over again. Come on!

Sometimes a programmer will be moved away from a project – or not be allowed to do it – because of his already high compensation. "We can find someone cheaper to do this". Yes – but not someone who wants to do this! This just brings tears to my eyes.

But if he loves the project, he won't quit, right?

Good thinking. People who can be replaced with someone like John should therefore be compared to John. People who can't be replaced with someone like John can still be compared to him – they're the ones who love their work, so they likely won't quit, and then we can sensibly compare everyone to everyone in a reasonable manner.

They'll quit.

They'll quit even if it's "irrational" for them. People can quit a project they love over compensation, and then spend years until they find something nice to work on. Often they feel it wasn't worth it, or at least are unhappy with their working situation.

But it doesn't help that you were right and that they should have stayed, settling for the fair compensation level of John and working on their favorite stuff. It doesn't help because the loss is yours as well.

Why do people behave in this "irrational" way, apart from having too high expectations about their alternatives? The economist David Friedman gives an evolutionary explanation, if you like that sort of thing:

…human beings regard the usual terms of exchange as right and any deviation from those terms that makes them worse off as a presumptively wicked act by the other party. This feature resulted in human beings that possessed it getting better terms in bilateral monopoly bargains in the environment in which we evolved…

"Bilateral monopoly" is basically the situation you and your employee find yourselves in once a project "clicks" with him. It's hard for you to replace him – and it's hard for him to replace you. This may tempt you to lower the price you're willing to pay. The response Mother Nature had equipped us with for these cases is that the employee thinks you're wicked, and he quits.

This reaction is "irrational" – in the sense that he's now worse off. But it's very much "rational", in the sense that the threat of "irrational" quitting should improve his terms – if you know that the threat is real, despite the fact that actually quitting would make him worse off.

Well, in my experience, the threat is very real alright. Worth taking into account.

Why management likes to set standard compensation levels

I suspect the benefit is that it makes decision-making easier on the scale of a large company. It works reasonably well and is very easy to implement. It's a bit like using a simple heuristic in code because it's just 5 lines of code and it sort of works.

"Bounded rationality", if you like (…isn't "bounded rationality" what used to be called "stupidity"? Aren't "the cognitive limitations of the mind" mentioned in the article also called "stupidity"? I'm not mocking stupidity – I'm certainly equipped with a high degree of stupidity myself, and you can trace its influence on my decision-making. I'm just wondering why invent new terms when we already have perfectly good ones.)

Anyway, if you know why standard compensation levels are a good idea – a rational argument for them in an unbounded way – let me know in the comments. Puzzles me plenty.

Cycles, memory, fuel and parking

In high-performance, resource-constrained projects, you're not likely to suddenly run out of cycles – but you're quite likely to suddenly run out of memory. I think it's a bit similar to how it's easy to buy fuel for your car – but sometimes you just can't find a parking spot.

I think the difference comes from pricing. Processor cycles are priced similarly to fuel, whereas memory bytes are priced similarly to parking spots. I think I know the problem but not the solution – and will be glad to hear suggestions for a solution.

Cycles: gradual price adjustment

If you work on a processing-intensive project – rendering, simulation, machine learning – then, roughly, every time someone adds a feature, the program becomes a bit slower. Every slowdown makes the program a bit "worse" – marginally less useable and more annoying.

What this means is that every slowdown is frowned upon, and the slower the program becomes, the more a new slowdown is frowned upon. From "it got slower" to "it's annoyingly slow" to "we don't want to ship it this way" to "listen, we just can't ship it this way" – effectively, a developer slowing things down pays an increasingly high price. Not money, but a real price nonetheless – organizational pressure to optimize is proportionate to the amount of cycles spent.

Therefore, you can't "suddenly" run out of cycles - long before you really can't ship the program, there will be a growing pressure to optimize.

This is a bit similar to fuel prices – we can't "suddenly" run out of fuel. Rather, fuel prices will rise long before there'll actually be no fossil fuels left to dig out of the ground. (I'm not saying prices will rise "early enough to readjust", whatever "enough" means and whatever the options to "readjust" are –  just that prices will rise much earlier in absolute terms, at least 5-10 years earlier).

This also means that there can be no fuel shortages. When prices rise, less is purchased, but there's always (expensive) fuel waiting for those willing to pay the price. Similarly, when cycles become scarce, everyone spends more effort optimizing (pays a higher price), and some features become too costly to add (less is purchased) – but when you really need cycles, you can get them.

Memory: price jumps from zero to infinity

When there's enough memory, the cost of an allocated byte is zero. Nobody notices the memory footprint – roughly, RAM truly is RAM, the cost of memory access is the same no matter where objects are located and how much memory they occupy together. So who cares?

However, there comes a moment where the process won't fit into RAM anymore. If there's no swap space (embedded devices), the cost of allocated byte immediately jumps to infinity – the program won't run. Even if swapping is supported, once your working set doesn't fit into memory, things get very slow. So going past that limit is very costly – whereas getting near it costs nothing.

Since nobody cares about memory before you hit some arbitrary limit, this moment can be very sudden: without warning, suddenly you can't allocate anything.

This is a bit similar to a parking lot, where the first vehicle is as cheap to park as the next and the last – and then you can't park at all. Actually, it's even worse - memory is more similar to an unmarked parking lot, where people park any way they like, leaving much unused space. Then when a new car arrives, it can't be parked unless every other car is moved – but the drivers are not there.

(Actually, an unmarked parking lot is analogous to fragmented memory, and it's solved by heap compaction by introducing a runtime latency. But the biggest problem with real memory is that people allocate many big chunks where few small ones could be used, and probably would be used if memory cost was something above zero. Can you think of a real-world analogy for that?..)

Why not price memory the way we price cycles?

I'd very much like to find a way to price memory – both instructions and data - the way we naturally price cycles. It'd be nice to have organizational pressure mount proportionately to the amount of memory spent.

But I just don't quite see how to do it, except in environments where it happens naturally. For instance, on a server farm, larger memory footprint can mean that you need more servers – pressure naturally mounts to reduce the footprint. Not so on a dedicated PC or an embedded device.

Why isn't parking like fuel, for that matter? Why are there so many places where you'd expect to find huge underground parking lots – everybody wants to park there – but instead find parking shortages? Why doesn't the price of parking spots rise as spots become taken, at least where I live?

Well, basically, fuel is not parking – you can transport fuel but not parking spots, for example, so it's a different kind of competition – and then we treat them differently for whatever social reason. I'm not going to dwell on fuel vs parking – it's my analogy, not my subject. But, just as an example, it's perfectly possible to establish fuel price controls and get fuel shortages, and then fuel becomes like parking, in a bad way. Likewise, you could implement dynamic pricing of parking spots – more easily with today's technology than, say, 50 years ago.

Back to cycles vs memory – you could, in theory, "start worrying" long before you're out of memory, seeing that memory consumption increases. It's just not how worrying works, though. If you have 1G of memory, everybody knows that you can ship the program when it consumes 950M as easily as when it consumes 250M. Developers just shrug and move along. With speed, you genuinely start worrying when it starts dropping, because both you and the users notice the drop – even if the program is "still usable".

It's pretty hard to create "artificial worries". Maybe it's a cultural thing – maybe some organizations more easily believe in goals set by management than others. If a manager says, "reduce memory consumption", do you say "Yes, sir!" – or do you say, "Are you serious? We have 100M available – but features X, Y and Z are not implemented and users want them!"

Do you seriously fight to achieve nominal goals, or do you only care about the ultimate goals of the project? Does management reward you for achieving nominal goals, or does it ultimately only care about real goals?

If the organization believes in nominal goals, then it can actually start optimizing memory consumption long before it runs out of memory – but sincerely believing in nominal goals is dangerous. There's something very healthy in a culture skeptical about anything that sounds good but clearly isn't the most important and urgent thing to do. Without that skepticism, it's easy to get off track.

How would you go about creating a "memory-consumption-aware culture"? I can think of nothing except paying per byte saved - but, while it sounds like a good direction with parking spots, with developers it could become quite a perverse incentive…

Could SOPA give us back a decentralized Internet?

I don't think SOPA will fly, ultimately. It benefits content companies at the expense of technology companies which by now seem to have deeper pockets. Technology companies will find a way to undo SOPA if it passes.

But suppose it passes and is consistently enforced. This threatens sites "enabling or facilitating copyright infringement" – what are those?

Standalone personal sites probably aren't threatened. You know what you publish, and if you publish copyrighted content, you can easily remove it. Gmail probably isn't threatened because data isn't publicly available. SOPA does threaten Wikipedia, because you're supposed to not link to "infringing sites" (which could be anything) – but it probably doesn't threaten them through the content actually on the site, since they're very careful not to use copyrighted content.

Which sites are threatened the most? Facebook, YouTube, blogging and social networking sites. Plenty of copyrighted content gets uploaded to these. If SOPA is trimmed to exclude links to "infringing sites", then it is mostly "social" sites which are targeted.

Are these sites a good development in the Internet world? It's definitely not how the Internet was supposed to look like. Instead of many individual sites, we now have a few huge sites keeping most of the published data, together with much personal information, with very little obligations to users. "They trust me – dumb fucks", as the Facebook CEO put it.

Wouldn't it be great if instead of big social sites, we had big hosting companies and many independent individual sites? Wouldn't it be great if the many independent sites were all using public protocols to exchange data – using the Internet network and not the Facebook network? Wouldn't it be great if no "social engineer" could oversee our communication?

Couldn't SOPA do just that – make it unaffordable to manage a proprietary network like Facebook on top of the Internet, giving us back a decentralized Internet? Facebook convinced hundreds of millions of users that it's fun to be on the Internet, read stuff, write stuff. Couldn't SOPA then force people out of Facebook and bring them to the actual Internet?

Hosting companies that make publishing easy – on your site, under your domain, with data under your full control and responsibility – could use the opportunity. It's well past time that running an actual site is feasible on this fabulous Internet network. With all these proprietary networks on top, what normal person runs a site today, or even knows what it means? Wouldn't it be great if they finally started?

And yeah, I realize it's not going to be like that. Facebook will manage to shoot this legislation down. If it doesn't, then it'll manage to work around its enforcement. And if it doesn't, any site with a link to any other site is probably threatened – definitely Wikipedia, Reddit, HN…

So yeah, it's going to be much worse. But I can dream, can't I?

(And couldn't you think of a way to distribute the hosting of user-generated contents – like news links or Wikipedia articles – and give a unified view at the client side? Then one couldn't target "the Wikipedia site" – there wouldn't be any – but only a specific portion. Wouldn't it be better for users, in some ways? )

Coding standards: is consistency prettier than freedom?

Different projects have different coding standards, and some have none at all. How does it affect the quality of code and developers' well-being? What results can we reasonably expect from a style guide?

Let's have a look at the effect of style guides in the real world. Here's how Jerusalem looks like:

These similarly looking buildings are near the city center. Here's a shot of the suburbs:

Same stuff, pretty much. White buildings, red roof tiles – or plain flat roofs.

And now for something completely different:

This is Tel Aviv. Buildings don't look similar to each other here. Nor do different parts of the city:

As you can see, in Tel Aviv, there's no style guide – everyone builds stuff to suit their own taste.

In Jerusalem, on the other hand, buildings have to be covered with Jerusalem stone, giving them their trademark off-white color. Jerusalem owes its visual consistency to a century-old style guide enforced by municipal laws.

Here are a few observations – relevant to most style guides, I think:

  • Consistent style is either enforced or lacking. Whatever virtues freedom may have, consistency of style is not one of them.
  • Consistent style is functionally inconsequential. Buildings in Jerusalem are about as safe and comfortable as buildings in Tel Aviv.
  • Psychologically, style does matter. Many people hate visiting Tel Aviv because it's ugly.
  • Whether consistent style is more beautiful is debatable. Many other people hate visiting Jerusalem because it's ugly.
  • People will defeat stylistic consistency despite the style guide. Here's an example from one of Jerusalem's suburbs, Ramot Polin – "a housing project for honeybees":

This is consistent with the style guide, but not very consistent with the actual look of other buildings – nor does it look very comfortable. Leading to my last observation:

  • A common style can be codified and enforced, but common sense can't be. Municipal law mentioned "off-white", but who would have thought to mention "rectangular"?

A sensible style guide is your one and only way to achieve consistent style – and not much else.

What if a style guide is not sensible? Here's a building from Tirana, Albania:

Here's another one:

Yep, that's the style guide over there – bright colors over ugly buildings. And there's nowhere to hide from the consistent style.

Maybe you actually love this style, and hate Jerusalem's uniform off-white. My point is that either way, the consistent style of one of those cities leaves no place for you to like.

Tel Aviv, on the other hand, has a place to like for both Tirana lovers and Jerusalem lovers. Off-white houses with red rooftops? Neve Tzedek has what you want:

Buildings painted in primary colors? Here's a hotel for you:

Personally, I still prefer Jerusalem though. Consistent style is better – if you like that particular style.

Requiring limestone vs banning asbestos

Can coding standards be described as style guides, or are they more than that?

The Google C++ Style Guide suggests that it is in fact more than a style guide:

The term Style is a bit of a misnomer, since these conventions cover far more than just source file formatting.

The document goes on to say that, apart from "enforcing consistency", it also "constrains, or even bans, use of certain features" – "to avoid the various common errors and problems that these features can cause".

"Enforcing consistency" does sound similar to requiring limestone – there's no direct functional impact. But "banning features to avoid problems" sounds more like banning asbestos – very much because of its functional impact, which can include cancer.

However, language features are different from building materials. Asbestos was discovered, not designed, and they couldn't know it'd cause cancer. C++ RTTI was designed and approved as a standard by strong programmers, who had in mind some cases where they thought it'd be useful.

RTTI is banned by the Google Style Guide, not the way asbestos is banned by regulations, but the way some sculptors prohibit their students to use fingers when they shape the fine details of clay. Learn to use proper sculpting tools – then do use fingers if necessary:

A query of type during run-time typically means a design problem. …you may use RTTI. But think twice about it. :-) Then think twice again.

Think four times and you'll be allowed to use RTTI. Think 1024 times and you're still not allowed to use asbestos in a housing project. That's because construction standards include functional considerations, but coding standards ultimately discuss style and style alone.

That's why the strictest coding standards allow exceptions. And that's why every banned feature is sometimes better than the proposed alternative.

Readability through inconsistency

Style guides enforce consistency. In the real world, we've seen that consistent style matters psychologically. In programming, people also advocate consistency for psychological reasons:

It is very important that any programmer be able to look at another's code and quickly understand it. Creating common, required idioms and patterns makes code much easier to understand.

Psychological reasons are important – but there are symmetric psychological arguments for inconsistency.

For example, required idioms can in fact make code easier to understand – or harder. Let's look at idioms actually required in some style guides:

  • The use of C++ "algorithms" such as std::for_each and std::transform instead of decade-old "patterns" called loops. I expect the idea to become widespread again, together with C++11 lambdas. Here's TheRegister's take on the impact on readability.
  • Yoda conditions: if(5 == num). This page – first Google hit for "Yoda conditions" at the time of writing – lists only benefits and no drawbacks, and proposes to add this to your style guide. Will code become more readable though? They're Yoda conditions! "If num is five" is how you always say it in English (and Hebrew, and Russian). If five is num, read as natural your code will not.

Of course my opinion on the readability of these patterns is debatable – which is precisely my point. Once a style guide is chosen, some people will experience the joy of fluent reading every time they hit if(5==num). Others will experience the pain of a mental roadblock – also every time.

A style guide will have something to dislike for everyone. When tastes are sufficiently different, the average amount of cringes per person stays the same under consistent style – and the variance rises (someone will hate a particularly popular mandatory pattern).

It's like keeping wealth constant and increasing inequality – something not even a political party would advocate. How is this psychologically a win?

But let's go ahead and assume that the "required idioms" suit everyone's taste, and, by themselves, actually make code easier to understand. Still:

  • External libraries will not follow your style guide. They follow style guides of their own. And this inconsistency can improve readability. Code using the library stands out, and the library's style can match the accepted domain-specific conventions better than your local style. In computer vision, X is the real world coordinate, x is the pixel coordinate – contrary to many software style guides.
  • You can't count on stylistic conventions, because there are exceptions. Google's code orders parameters such that inputs come first, but memcpy & snprintf don't. You either have to look out for exceptions or risk misunderstandings by blindly assuming consistency.
  • Different people think differently, even if their code looks the same. I find it easier to understand programmers' intents through their unique style. When they're all forced to write superficially similarly, I can't tell who wrote what, and what the subtext of the code is.

I'll illustrate the last point with a couple of examples. I knew O.M. before I ever saw him and before I even knew his name. To me, he was the programmer with the two spaces before the trailing const:

inline int x()  const;

I knew him through his code: mathematically elegant, obsessive about fine details of type-based binding and modeling. I could guess what he left out with an intent to maybe add it later. I understood him.

Likewise, I can always spot G.D.'s code by the right-leaning asterisk:

int *p,*q=arr+i;

G.D. certainly couldn't care less about types – similarly to most people with this asterisk alignment. I know his code: terse, efficient, to the point. I know what to expect.

Who wrote this code?

camelCaseName = longerCamelCaseName-camelCaseName;

I dunno, the collective unconscious wrote it. Anyone could write it – or several people patching after each other. I fail to identify with the author and guess his intent – it could be anyone. The code has no smell or taste to me.

This can sound a bit crazy – "does he actually advocate that everyone develops as uncommon style as possible as a way to mark his trail"? Of course I don't mean that.

I think what I'm saying can make more sense if you think of style and taste in code as analogous to the taste of food. Of course it's ridiculous to expect every restaurant to make food with a unique taste. Many people like pizza, many people know how to make pizza – so expect much similarly-tasting pizza around.

But we wouldn't like to always eat food cooked to the same spec by restaurants in a single franchise. If someone knows to make food with a unique taste, we welcome it.

And unique taste of your food doesn't indicate that it's bad for your health. Moreover, food with a familiar taste can be very unhealthy – much of it is.  Code in a familiar style has a comforting look – sometimes misleadingly.

I don't think requiring the same taste everywhere is how you improve the health of a code base.

Popular demand

I'm naturally inclined to argue against coding standards, because it feels like bureaucracy unthinkingly imposed on the work process from above. However, what if it's not imposed from above – what if the programmers themselves want it?

Many certainly do. Many good ones do.

Incidentally, while I was writing this, I stumbled upon an article arguing for consistency – for the very reasons I use to argue against it:

If code isn’t written in a consistent style in your team, whenever you come across code with the spacing a bit wrong, the first thing your head’s going to process is "I didn’t write this."

…Exactly why I like personal style! I really didn't write it – it's his code, I want to understand him, and his style helps greatly.

This is a natural feeling, and as we all know coders have a hard [time] to restrain impulse to rewrite any piece of code they didn’t write.

I understand that impulse very well. Personally, I hate new food more than anyone I know – I eat the same thing every day, for months, for years. I do prefer Jerusalem to Tel Aviv. And I like consistent style – especially my own.

But why should others be forced-fed my favorite cabbage salad? Is consistency that much prettier than freedom?

Team spirit

I don't argue with people who favor a consistent style – and I can't. The article above is nostalgic about a team that followed a style guide "religiously". How can nostalgia be refuted? Clearly, consistent style can create a unique team spirit that to some is valuable and memorable.

All I can and do argue is that a lack of a style guide can also create a good atmosphere that is preferred by some other people.

In fact I believe it is exactly team spirit that a style guide – or lack thereof – actually affects. All other effects come indirectly through the impact on atmosphere. Here's my attempt at taking the social effect into account:

  • I won't break established conventions. Following conventions is a great way to say, "I respect the local tradition and the wisdom it embodies." And I'll thoroughly enjoy the limestone covering our code – I like consistency. Yay, a beautiful code base! And if the convention is really bad – I hopefully won't need to join the team in the first place.
  • However, I won't establish and enforce conventions when I'm the one starting with a clean slate. Having no conventions is a great way to say, "join my project to express yourself without artificial constraints".

From team spirit to grassroots bureaucracy

Did you know Ken Thompson is not allowed to check in code at Google? He said so in his Coders at Work interview:

Seibel: I know Google has a policy where every new employee has to get checked out on languages before they're allowed to check code in. Which means you had to get checked out on C.

Thompson: Yeah, I haven't been.

Seibel: You haven't been! You're not allowed to check in code?

Thompson: I'm not allowed to check in code, no.

The programmer who co-created Unix is not allowed to check in code. If this isn't a bureaucracy, what is? But it's inevitable – with rules, you always paint yourself into a corner that way. Allow some to break the conventions, and you will have offended everyone else. Use the same rules for everyone – and some won't contribute.

Of course, Thompson does contribute to Google. Well, Google has plenty of ways to motivate him to do that. And, apparently, they have good programmers willing to write production code based on his uncommitted prototype code.

But we're not all like Google that way. We can't all afford the inevitable laughable outcomes of bureaucracies. If you come across an original programmer with an off-beat style, do you want him to join your project or to move on?

Grassroots bureaucracy is still bureaucracy. I wouldn't object to an established bureaucracy that people claim to value. But I wouldn't establish a new one, either.

Conclusion

A style guide can make code look prettier to some – and uglier to others – but not tangibly better, except if programmers enjoy it so much as to produce better code.

This is equally true for the lack of a style guide. The results of freedom look prettier to some and uglier to others.

Personally, I believe that one rule too few is better than one rule too many, so I don't bother to enforce a common style.

P.S.: when I become a neat-freak

Generally, I tend to enforce little to no conventions. Here are the exceptional situations where I actually enter the ridiculous position of telling people how to write their code.

Interfaces should be consistent

I don't care if an interface looksLikeThis or like_this. But if it uses both, then I'll ask the author to change it to one of the styles – the one which came first. For users to feel confident, an interface should look well thought-out – which implies an illusion of a single author, which implies a consistent style.

By "interface", I only mean the outermost stuff called by module users. Internal functions, classes, etc. can look like Tel-Aviv as far as I'm concerned. For instance, in a simple server, the "interface" to me is just the protocol, and nothing in the code itself.

Warnings should be errors

I hate the concept of compiler warnings – I link it to the concept of guilt. "Fine, be that way, do this evil implicit conversion thing – but if something happens, I will have told you so." Why should we put up with such manipulative behavior? Pick a position – refuse to compile, or compile silently.

However, in practice, compatibility issues make what has to be errors into warnings. If it compiled 30 years ago, it has to keep compiling, even if nobody wants it to compile in new code. Even if compilers could always prove the code wrong, but initially didn't bother, and it happens to work in old programs.

So, to the dismay of freedom-lovers, I turn warnings into errors where I can (as in -Werror) – even if I can't cherry-pick the "right" warnings. Mainly since when a file generates 10 (false) warnings, the eleventh (truly useful) warning goes unnoticed.

Another reason is that warnings cause guilt – they are evil, and must be destroyed. If I didn't destroy them by turning them into errors, I'd have to destroy them by disabling them. Then I'd never get that eleventh useful warning.

But except for these two pet peeves – interfaces and warnings – I do think grown-up programmers should be left alone.

P.P.S.: Greetings from the Overextended Metaphor Parrot

Originally, I had a few references to Bauhaus architecture in the text – how Tel Aviv has buildings in the Bauhaus style and Jerusalem doesn't because of its limestone requirement, and how Ken Thompson's inability to commit code at Google is analogous to that. However, as a commenter pointed out, there are buildings in the Bauhaus style in Jerusalem – in my own neighborhood, actually, so I obviously walked past them plenty of times.

I guess this goes to show that good architects have no trouble complying with a style guide – and that I shouldn't overextend metaphors in areas where I'm not minimally competent.

Graham & Coase: when big companies are a good idea

Paul Graham was once asked the following RAQ (rarely asked question):

How can I avoid turning into a pointy-haired boss?

His answer:

The pointy-haired boss is a manager who doesn't program. So the surest way to avoid becoming him is to stay a programmer. What tempts programmers to become managers are companies where the only way to advance is to go into management. So avoid such companies and work for (or start) startups.

Why be a manager when you could be a founder or early employee at a startup?

Why?! Oh wow. I could fill a book explaining why. But many of my reasons are my own, and aren't relevant to you unless you're much like me. So I'll focus on the general answer to the question implied by Paul Graham: Why do large firms exist?

The question was addressed by the economist Ronald Coase in his article "The Nature of the Firm". This article, together with his work on externalities (the Coase Theorem), earned him a Nobel Prize in economics. This is one evidence that the question is interesting and far from trivial.

Suppose there are no good answers to Paul Graham's rhetorical question. That is, it's always objectively better to start or join a small firm than to be a manager in a large one. You'll always get more work done, or will be more satisfied, or both. Well, if so, competition should eventually drive large firms out of business. So why are they still around?

For starters, clearly there are problems best solved by small groups of people armed with off-the-shelf tools. For instance, two iconic YC startups funded by Paul Graham, Reddit and Dropbox, each solve a problem with the help of a few programmers and a bunch of commodity servers running a commodity software stack. A larger company could hardly improve on what they do.

Note that off-the-shelf products are key to being small (or at least starting small). Reddit or Dropbox could never build those servers from scratch. A small group of people can not erect a $5G chip fabrication facility. Building and operating a fab – or a search engine – requires lots of custom development, so you need a lot of people.

Or do you?

Of course the total number of people involved has to be very large. But it doesn't follow that they should be organized as big companies. Instead, the work could be done by many small organizations, each contracting out most of the work to others.

You're big because you hire. Why hire if you can buy, contract out – and stay small?

Indeed, this seems to make perfect sense. To quote Wikipedia's summary of The Nature of the Firm (1937):

The traditional economic theory of the time suggested that, because the market is "efficient" (that is, those who are best at providing each good or service most cheaply are already doing so), it should always be cheaper to contract out than to hire.

Then why do most people prefer employment to self-employment, as evidenced by their actions (and an economist never trusts anything but actions as a tool to reveal someone's preferences)? Why do I hate the idea of running a small firm?

Either the "traditional economic theory" is right – one should run a small firm, and I'm a freak of nature destined to extinction due to economic evolutionary pressure, together with much of the population – or the theory is lacking, and there should be a concept formalizing my aversion to self-employment.

And in fact, at this point, Coase introduces the term – transaction costs:

Coase noted, however, that there are a number of transaction costs to using the market; the cost of obtaining a good or service via the market is actually more than just the price of the good.

Oh, yeah – MUCH more if you ask me.

Other costs, including search and information costs, bargaining costs, keeping trade secrets, and policing and enforcement costs, can all potentially add to the cost of procuring something via the market.

YES! Here's a Nobel Prize-winning economist from the notoriously "pro-free market" Chicago school that UNDERSTANDS ME. He knows why I hate markets. ("Pro-market" doesn't mean you love markets, just that you think governments are even worse.)

This suggests that firms will arise when they can arrange to produce what they need internally and somehow avoid these costs.

Avoiding these costs can enable work that just can't happen outside the context of a big company.

For instance, I work on chips for embedded computer vision, at a company that's now fairly large. This is an example where a lot of people need to cooperate in a custom development effort (as opposed to fewer people using off-the-shelf products).

In theory, I could start a computer vision hardware startup instead of it being an internal project. In practice, it wouldn't work, because:

  • I wouldn't know what to build. Hardware accelerates algorithms – what algorithms? I only know because I'm in the same company with developers of very effective unpublished algorithms. Without that knowledge, what could I build – an OpenCV accelerator? Good luck selling that.
  • I couldn't build it nearly as efficiently. A great source of efficiency is fitting hardware to the specific workload. But if we were not a part of the company but a vendor, the company would make sure there are competing vendors to keep prices low. This means that we, no longer having a guaranteed customer, would have to support as many different workloads as possible, to increase the pool of potential customers. As a rule, more generic hardware is less efficient.
  • I couldn't explain how to program it. Once you gave away your programming model to the customer – as you have to if you want them to, well, program you processors – only very strong patents can prevent them from cloning your hardware (possibly with the help of your competitor). A big company that, among other things, designs its own hardware doesn't have to explain it to the outside world. And even if its hardware ends up cloned – it's just one part of the secret knowledge behind the product. But if you're a small company only making hardware and it's cloned, you're busted. You shouldn't even start before making sure your ideas are "sufficiently patentable" – which you don't know before you developed those ideas.

Of course, the number one real reason I couldn't run a hardware startup is that I'm no businessman. But the problems above are also very real, and frequently insurmountable for people who can do business. Not all custom development is impossible to successfully outsource, but much is. The problems result from economic fundamentals.

In econ-speak, such problems are collectively known as "search and information costs, bargaining costs, keeping trade secrets, and policing and enforcement costs". Indeed, all these problems were featured in my example. In plain English, a simple way to sum up all those problems is trust – or more precisely, the lack thereof:

  • A company can't trust a vendor, so a vendor can't know its algorithms.
  • A company can't trust a vendor to keep qaulity high and prices low if it guarantees to remain its customer…
  • …So a vendor can't trust a company to remain its customer, so it can't invest too much in a solution just to that company's specific needs.
  • A vendor can't trust a company to keep buying from it if enough knowledge is given away so that the product can be cloned instead – so some products are not worth building.

When you work for a big company, you deal with coworkers, and you're all playing for the same team. The smaller the company, the more you deal with customers and vendors, which means playing against them. There's no such word as "co-customer" or "co-vendor" for a good reason.

At least that's how things are framed by the rules. The rules say that all employees are agents acting towards a common goal, "to promote the company's interests" – whereas different companies have different bottom lines and different interests.

Of course, reality is never like the rules – in reality, everyone in the company plays by their own rules, attempting to promote the interest of any of the following – or a combination:

  • Shareholders
  • Customers
  • Employees
  • His team
  • His manager
  • His friends
  • Himself

So in reality, of course there's a lot of chaos in a big company. And it doesn't help that the bigger it is, the harder it is to make sense of what's going on:

…There is a natural limit to what can be produced internally, however. Coase notices "decreasing returns to the entrepreneur function", including increasing overhead costs and increasing propensity for an overwhelmed manager to make mistakes in resource allocation. This is a countervailing cost to the use of the firm.

…Which explains why we aren't all employed by a single all-encompassing huge company.

But at least the rules of a large company frame things right – as cooperation more than competition. (Competition generally isn't an end – it's a means to ultimately force people to cooperate, and, as Coase points out, it only gets you this far.)

Of course, corporate rules also create competition – employees compete for raises, etc. But in practice, overall most would agree that it's much safer to trust co-workers than customers or vendors.

Why be a manager when you could be a founder or early employee at a startup? Here's the part of my answer that is based on economic fundamentals.

I specialize in areas requiring custom development by many people. Many people can only tightly cooperate under rules implying trust. Therefore they must not be customers and vendors, but coworkers, which leads to large firms. Such is The Nature of the Firm.

Of course there are problems that can be solved by a small group of people with mutual trust, without tightly-coupled, joint development with others – for example, the problems solved by Reddit and Dropbox. One reason I personally never looked that way is my aversion to business. Such is my own nature.

It just so happens that the nature of the firm suits my nature nicely – because there are situations where big companies are a good idea. When you can't buy and have to build, trust is fundamental to getting the job done.

UPDATE (December 9, 2011): just found an interesting analogy between company size and program size. Doing many things in one big program can be easier than using many small programs because of "transaction costs" – the cost of exchanging data between the programs.

Engineers vs managers: economics vs business

When chased by a bear, engineers want to run faster than the bear, managers want to run faster than you. This is known as "the best vs the good enough", and is a very common theme.

For instance, company A releases a good enough technology, company B releases the best technology on the market. B fails and A succeeds, because A releases earlier, or because A's technology is more compatible with the status quo, etc. Engineers will commonly feel sympathy for B, managers will applaud the shrewdness of A.

It's a common story and an interesting angle, but the "best vs good enough" formulation misses something. It sounds as if there's a road towards "the best" – towards the 100%. Engineers want to keep going until they actually reach 100%. And managers force them to quit at 70%:

There comes a time in the life of every project where the right thing to do is shoot the engineers and ship the fucker.

However, frequently the road towards "the best" looks completely different from the road to "the good enough" from the very beginning. The different goals of engineers and managers make their thinking work in different directions. A simple example will illustrate this difference.

Suppose there's a bunch of computers where people can run stuff. Some system is needed to decide who runs what, when and where. What to do?

  • An engineer will want to keep as many computers occupied at every moment as possible – otherwise they're wasted.
  • A manager will want to give each team as few computers as absolutely necessary – otherwise they're wasted.

These directions aren't just opposite – "as many as possible" vs "as few as necessary". They focus on different things. The engineer imagines idle machines longing for work, and he wants to feed them with tasks. The manager thinks of irate users longing for machines, and he wants to feed them with enough machines to be quiet. Their definitions of "success" are barely related, as are their definitions of "waste".

An engineer's solution is to have everybody submit their jobs to a queue managed by a central server. The Condor software is an example implementation; there are many others, with many subtle issues and differences – or you can roll your own. The upshot is that once a machine becomes idle, it can immediately yank a next job from the server's queue. As long as there's anything left to do, no machine is idle. This is the best possible situation.

A manager's solution is to give the ASIC team 12 servers: asic01, asic02, … asic12. QA gets 20 machines, qa01 … qa20, and so on. If you want to work on a machine, you ssh to it and you work. If you're an ASIC engineer and you want to use a QA machine, you can't log in. If users fight over their team's machines, they can go to their manager. If their manager decides the team needs more machines, he goes to upper management. Good enough.

The "good enough" is not 70% of "the best" – it's not even in the same direction. In fact, it's more like -20%: once the "good enough" solution is deployed, the road towards "the best" gets harder. You restrict access to machines, and you get people used to the ssh session interface, which "the best" solution will not provide.

Which solution is actually better? Tough question.

  • The manager's solution requires no programming or installation, and trivial administration.
  • The manager's solution requires no changes of habits (ssh is the standard).
  • The engineer's solution yields ~100% utilization, the manager's 50%, 30%, or 10%, depending.
  • The manager's solution will cause less headache to the managers, on average. Once in a while, you buy a team some more machines, and they shut up. The engineer's solution requires to set priorities at times of overload, and people will constantly argue about priority with managers.
  • The engineer's solution doesn't run things on machines that are already busy anyway. Users armed with ssh will tend to do this, possibly with horrible slowdowns due to swapping, etc.
  • The engineer's solution can provide the lowest latencies.
  • The engineer's solution makes it trivial to utilize new machines – no need to set up sessions, just submit jobs as previously. This is good – and bad: people will demand new machines instead of reducing the load.

So there are many conflicting considerations. Their importance varies between cases, and is very hard to measure.

"The best" solution is not necessarily the best – rather, it's what the mindset of looking for the best yields. Likewise, the "good enough" solution is not necessarily good enough – it's just what you come up with when you look for a good enough approach.

Obviously, portraying "engineers" and "managers" this way is a gross oversimplification, and most real people can look at things from both angles. I like this oversimplification for two reasons:

  1. It does help to understand and predict many common arguments and reactions.
  2. A person's perspective frequently does coincide with the title: engineers aim at the best, managers look for the good enough. Just the title is thus a good basis for prediction.

How does title affect judgement? There are arguments to the effect of "aiming at the good enough is wiser, and managers are in a better position to achieve wisdom", or vice versa. However, I don't believe either viewpoint is more correct than the other. Rather, they are biases. People in different positions acquire different biases, because they have different incentives and constraints.

The difference in incentives is that engineers are rewarded for success, while managers are rewarded for non-failure.

An exceptional engineer is unique and valuable, like a great chef – being an OK engineer is more like cooking for McDonald's. An engineer needs unusual success to be noticed. One reason such success is achievable is because he works within his area of expertise, on things that he understands. There's rarely a formal quality metric, but his deep understanding creates in him a sense of beauty that serves as his metric.

A manager is fine as long as the project doesn't fail. Most projects fail – if yours didn't, it's quite an achievement. One reason they fail is that there are a lot of ways to fail. Forget just one item in a lengthy checklist, and it won't matter how well the other 99 items are handled. People may not buy a car because there's no convenient place for a resting arm. A manager looks after a whole lot of items, usually without being able to understand most of them in any depth. He's inclined to look for simple pass/fail criteria.

If your success function is a continuous metric, you'll aim at the best. If it's a long list of Boolean values – V or X – you'll want "good enough" (V) at every entry. "The best" is just a costlier way to get a V, at a higher risk to end up with an X. The manager's outlook leads to binarization.

Another difference is that engineers and managers control different things – "One man's constant is another man's variable". In our example, the engineer doesn't decide how many machines to purchase or how to allocate them. If the number of machines is constant, what remains is to make the most of them. The manager, on the other hand, doesn't directly control utilization, and frequently doesn't know how to improve it. If utilization is constant, what remains is to properly ration it.

More generally, engineers tend optimize within constraints set by management, in part because they can't change those constraints. As to managers, they tend to settle for "good enough", in part because they can't optimize.

Which biases lead to more sensible solutions depends on the situation. However, there's a subtle reason to like the engineer's devotion to excellence, despite the manager's pressure for mediocrity. It's precisely when the manager is right in that excellence will not contribute to sales when the engineer contributes the most to users.

If an engineer's job is done so poorly as to make the product non-marketable, the product will fail and won't be used. When the quality along all dimensions passes the adoption threshold, every improvement along any dimension is effectively a gift to the users – who'd be using the product anyway. This is false where improvements contribute significantly to popularity, but true where they don't.

The improvements are inconsequential for business, but consequential economically. Economically, anything that helps people is good. For business, anything that creates profit is good. Often these definitions of "good" coincide, but sometimes they don't. In these cases, people who aim at excellence for excellence's sake help bridge the gap.

The manager's checklist approach – the business angle – ensures that the product is marketable, which is great because otherwise it won't get used. But the engineer's urge to optimize is closer to the really important goal: making the best of what we've got; this is the economics angle. So while it can lead to problems just like any other bias, it's likeable that way.

SIMD < SIMT < SMT: parallelism in NVIDIA GPUs

Programmable NVIDIA GPUs are very inspiring to hardware geeks, proving that processors with an original, incompatible programming model can become widely used.

NVIDIA call their parallel programming model SIMT - "Single Instruction, Multiple Threads". Two other different, but related parallel programming models are SIMD - "Single Instruction, Multiple Data", and SMT - "Simultaneous Multithreading". Each model exploits a different source of parallelism:

  • In SIMD, elements of short vectors are processed in parallel.
  • In SMT, instructions of several threads are run in parallel.
  • SIMT is somewhere in between – an interesting hybrid between vector processing and hardware threading.

My presentation of SIMT is focused on hardware architecture and its implications on the trade-off between flexibility and efficiency. I'll describe how SIMT is different from SIMD and SMT, and why – what is gained (and lost) through these differences.

From a hardware design perspective, NVIDIA GPUs are at first glance really strange. The question I'll try to answer is "why would you want to build a processor that way?" I won't attempt to write a GPU programming tutorial, or quantitatively compare GPUs to other processors.

SIMD < SIMT < SMT

It can be said that SIMT is a more flexible SIMD, and SMT is in turn a more flexible SIMT. Less flexible models are generally more efficient – except when their lack of flexibility makes them useless for the task.

So in terms of flexibility, SIMD < SIMT < SMT. In terms of performance, SIMD > SIMT > SMT, but only when the models in question are flexible enough for your workload.

SIMT vs SIMD

SIMT and SIMD both approach parallelism through broadcasting the same instruction to multiple execution units. This way, you replicate the execution units, but they all share the same fetch/decode hardware.

If so, what's the difference between "single instruction, multiple data", and single instruction, multiple threads"? In NVIDIA's model, there are 3 key features that SIMD doesn't have:

  1. Single instruction, multiple register sets
  2. Single instruction, multiple addresses
  3. Single instruction, multiple flow paths

We'll see how this lifts restrictions on the set of programs that are possible to parallelize, and at what cost.

Single instruction, multiple register sets

Suppose you want to add two vectors of numbers. There are many ways to spell this. C uses a loop spelling:

for(i=0;i<n;++i) a[i]=b[i]+c[i];

Matlab uses a vector spelling:

a=b+c;

SIMD uses a "short vector" spelling – the worst of both worlds. You break your data into short vectors, and your loop processes them using instructions with ugly names. An example using C intrinsic functions mapping to ARM NEON SIMD instructions:

void add(uint32_t *a, uint32_t *b, uint32_t *c, int n) {
  for(int i=0; i<n; i+=4) {
    //compute c[i], c[i+1], c[i+2], c[i+3]
    uint32x4_t a4 = vld1q_u32(a+i);
    uint32x4_t b4 = vld1q_u32(b+i);
    uint32x4_t c4 = vaddq_u32(a4,b4);
    vst1q_u32(c+i,c4);
  }
}

SIMT uses a "scalar" spelling:

__global__ void add(float *a, float *b, float *c) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  a[i]=b[i]+c[i]; //no loop!
}

The weird __global__ keyword says that add() is a GPU thread entry point. blockIdx, blockDim and threadIdx are built-in thread-local variables keeping the thread's ID. We'll see later why a thread ID isn't just a single number; however, in this example we in fact convert it to a single number, and use it as the element index.

The idea is that the CPU spawns a thread per element, and the GPU then executes those threads. Not all of the thousands or millions of threads actually run in parallel, but many do. Specifically, an NVIDIA GPU contains several largely independent processors called "Streaming Multiprocessors" (SMs), each SM hosts several "cores", and each "core" runs a thread. For instance, Fermi has up to 16 SMs with 32 cores per SM – so up to 512 threads can run in parallel.

All threads running on the cores of an SM at a given cycle are executing the same instruction – hence Single Instruction, Multiple Threads. However, each thread has its own registers, so these instructions process different data.

Benefits

"Scalar spelling", where you write the code of a single thread using standard arithmetic operators, is arguably a better interface than SIMD loops with ugly assembly-like opcodes.

Syntax considerations aside, is this spelling more expressive – can it do things SIMD can't? Not by itself, but it dovetails nicely with other features that do make SIMT more expressive. We'll discuss those features shortly; in theory, they could be bolted on the SIMD model, but they never are.

Costs

From a hardware resources perspective, there are two costs to the SIMT way:

  • Registers spent to keep redundant data items. SIMT registersIn our example, the pointers a, b, and c have the same value in all threads. The values of i are different across threads, but in a trivial way – for instance, 128, 129, 130… In SIMD, a, b, and c would be kept once in “scalar” registers – only short vectors such as a[i:i+4] would be kept in “vector” registers. The index i would also be kept once – several neighbor elements starting from i would be accessed without actually computing i+1, i+2, etc. Redundant computations both waste registers and needlessly consume power. Note, however, that a combination of compiler & hardware optimizations could eliminate the physical replication of redundant values. I don't know the extent to which it's done in reality.
  • Narrow data types are as costly as wide data types. SIMD registersA SIMD vector register keeping 4 32b integers can typically also be used to keep 8 16b integers, or 16 8b ones. Similarly, the same ALU hardware can be used for many narrow additions or fewer wide ones – so 16 byte pairs can be added in one cycle, or 4 32b integer pairs. In SIMT, a thread adds two items at a time, no matter what their width, wasting register bits and ALU circuitry.

It should be noted that SIMT can be easily amended with SIMD extensions for narrow types, so that each thread processes 4 bytes at a time using ugly assembly-like opcodes. AFAIK, NVIDIA refrains from this, presumably assuming that the ugliness is not worth the gain, with 32b float being the most popular data type in graphics anyway.

Single instruction, multiple addresses

Let's apply a function approximated by a look-up table to the elements of a vector:

__global__ void apply(short* a, short* b, short* lut) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  a[i] = lut[b[i]]; //indirect memory access
}

Here, i is again "redundant" in the sense that in parallel threads, the values of i are consecutive. However, each thread then accesses the address lut+b[i] – and these addresses are not consecutive.

Roughly, such parallel random access works for both loads and stores. Logically, stores are the trickier part because of conflicts. What if two or more threads attempt to, say, increment the same bin in a histogram? Different NVIDIA GPU generations provide different solutions that we won't dwell on.

Benefits

This feature lets you parallelize many programs you can't with SIMD. Some form of parallel random access is usually available on SIMD machines under the names "permutation", "shuffling", "table look-up", etc. However, it always works with registers, not with memory, so it's just not the same scale. You index into a table of 8, 16 or 32 elements, but not 16K.

As previously mentioned, in theory this feature can be bolted on the SIMD model: just compute your addresses (say, lut+b[i]) in vector registers, and add a rand_vec_load instruction. However, such an instruction would have a fairly high latency. As we'll see, the SIMT model naturally absorbs high latencies without hurting throughput; SIMD much less so.

Costs

GPU has many kinds of memory: external DRAM, L2 cache, texture memory, constant memory, shared memory… We'll discuss the cost of random access in the context of two memories "at the extremes": DRAM and shared memory. DRAM is the farthest from the GPU cores, sitting outside the chip. Shared memory is the closest to the cores – it's local to an SM, and the cores of an SM can use it to share results with each other, or for their own temporary data.

  • With DRAM memory, random access is never efficient. In fact, the GPU hardware looks at all memory addresses that the running threads want to access at a given cycle, and attempts to coalesce them into a single DRAM access – in case they are not random. Effectively the contiguous range from i to i+#threads is reverse-engineered from the explicitly computed i,i+1,i+2… – another cost of replicating the index in the first place. If the indexes are in fact random and can not be coalesced, the performance loss depends on "the degree of randomness". This loss results from the DRAM architecture quite directly, the GPU being unable to do much about it – similarly to any other processor.
  • With shared memory, random access is slowed down by bank contentions. Generally, a hardware memory module will only service one access at a time. So shared memory is organized in independent banks; the number of banks for NVIDIA GPUs is 16. If x is a variable in shared memory, then it resides in bank number (&x/4)%16. In other words, if you traverse an array, the bank you hit changes every 4 bytes. Access throughput peaks if all addresses are in different banks – hardware contention detection logic always costs latency, but only actual contentions cost throughput. If there's a bank hosting 2 of the addresses, the throughput is 1/2 of the peak; if there's a bank pointed by 3 addresses, the throughput is 1/3 of the peak, etc., the worst slowdown being 1/16.

SIMT bank contentions

In theory, different mappings between banks and addresses are possible, each with its own shortcomings. For instance, with NVIDIA's mapping, accessing a contiguous array of floats gives peak throughput, but a contiguous array of bytes gives 1/4 of the throughput (since banks change every 4 bytes). Many of the GPU programming tricks aim at avoiding contentions.

For instance, with a byte array, you can frequently work with bytes at distance 4 from each other at every given step. Instead of accessing a[i] in your code, you access a[i*4], a[i*4+1], a[i*4+2] and a[i*4+3] – more code, but less contentions.

This sounds convoluted, but it's a relatively cheap way for the hardware to provide efficient random access. It also supports some very complicated access patterns with good average efficiency – by handling the frequent case of few contentions quickly, and the infrequent case of many contentions correctly.

Single instruction, multiple flow paths

Let's find the indexes of non-zero elements in a vector. This time, each thread will work on several elements instead of just one:

__global__ void find(int* vec, int len,
                     int* ind, int* nfound,
                     int nthreads) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  int last = 0;
  int* myind = ind + tid*len;
  for(int i=tid; i<len; i+=nthreads) {
    if(vec[i]) { //flow divergence
      myind[last] = i;
      last++;
    }
  }
  nfound[tid] = last;
}

Each thread processes len/nthreads elements, starting at the index equal to its ID with a step of nthreads. We could make each thread responsible for a more natural contiguous range using a step of 1. The way we did it is better in that accesses to vec[i] by concurrent threads address neighbor elements, so we get coalescing.

The interesting bit is if(vec[i]) – depending on vec[i], some threads execute the code saving the index, some don't. The control flow of different threads can thus diverge.

Benefits

Support for divergence further expands the set of parallelizable programs, especially when used together with parallel random access. SIMD has some support for conditional execution though vector "select" operations: select(flag,a,b) = if flag then a else b. However, select can't be used to suppress updates to values – the way myind[last] is never written by threads where vec[i] is 0.

SIMD instructions such as stores could be extended to suppress updates based on a boolean vector register. For this to be really useful, the machine probably also needs parallel random access (for instance, the find() example wouldn't work otherwise). Unless what seems like an unrealistically smart compiler arrives, this also gets more and more ugly, whereas the SIMT spelling remains clean and natural.

Costs

  • Only one flow path is executed at a time, and threads not running it must wait.SIMT divergence Ultimately SIMT executes a single instruction in all the multiple threads it runs – threads share program memory and fetch / decode / execute logic. When the threads have the same flow – all run the if, nobody runs the else, for example – then they just all run the code in if at full throughput. However, when one or more threads have to run the else, they'll wait for the if threads. When the if threads are done, they'll in turn wait for the else threads. Divergence is handled correctly, but slowly. Deeply nested control structures effectively serialize execution and are not recommended.
  • Divergence can further slow things down through "randomizing" memory access. In our example, all threads read vec[i], and the indexing is tweaked to avoid contentions. However, when myind[last] is written, different threads will have incremented the last counter a different number of times, depending on the flow. This might lead to contentions which also serialize execution to some extent. Whether the whole parallelization exercise is worth the trouble depends on the flow of the algorithm as well as the input data.

We've seen the differences between SIMT and its less flexible relative, SIMD. We'll now compare SIMT to SMT – the other related model, this time the more flexible one.

SIMT vs SMT

SIMT and SMT both use threads as a way to improve throughput despite high latencies. The problem they tackle is that any single thread can get stalled running a high-latency instruction. This leaves the processor with idle execution hardware.

One way around this is switching to another thread – which (hopefully) has an instruction ready to be executed – and then switch back. For this to work, context switching has to be instantaneous. To achieve that, you replicate register files so that each thread has its own registers, and they all share the same execution hardware.

But wait, doesn't SIMT already replicate registers, as a way to have a single instruction operate on different data items? It does – here, we're talking about a "second dimension" of register replication:

  1. Several threads – a "warp" in NVIDIA terminology – run simultaneously. So each thread needs its own registers.
  2. Several warps, making up a "block", are mapped to an SM, and an SM instantaneously switches between the warps of a block. So each warp needs separate registers for each of its threads.

SIMT 2D replication

With this "two-dimensional" replication, how many registers we end up with? Well, a lot. A block can have up to 512 threads. And the registers of those threads can keep up to 16K of data.

How many register sets does a typical SMT processor have? Er, 2, sometimes 4…

Why so few? One reason is diminishing returns. When you replicate registers, you pay a significant price, in the hope of being able to better occupy your execution hardware. However, with every thread you add, the chance of it being already occupied by all the other threads rises. Soon, the small throughput gain just isn't worth the price.

If SMT CPU designers stop at 2 or 4 threads, why did SIMT GPU designers go for 512?

With enough threads, high throughput is easy

SMT is an afterthought – an attempt to use idle time on a machine originally designed to not have a lot of idle time to begin with. The basic CPU design aims, first and foremost, to run a single thread fast. Splitting a process to several independent threads is not always possible. When it is possible, it's usually gnarly.

Even in server workloads, where there's naturally a lot of independent processing, single-threaded latency still matters. So few expensive, low-latency cores outperform many cheap, high-latency cores. As Google's Urs Hölzle put it, "brawny cores beat wimpy cores". Serial code has to run fast.

Running a single thread fast means being able to issue instructions from the current thread as often as possible. To do that, CPU hardware works around every one of the many reasons to wait. Such diverse techniques as:

  • superscalar execution
  • out-of-order execution
  • register renaming
  • branch prediction
  • speculative execution
  • cache hierarchy
  • speculative prefetching
  • etc. etc.

…are all there for the same basic purpose. They maximize the chances of an instruction to be issued without having to switch to another thread.

SMT is the last line of defense, attempting to fill stalls after all these other measures failed. And even that is considered a bad idea when it hurts the precious single-threaded performance. Which it usually does: each of the 2 threads will typically complete later then it would if it didn't have to share the hardware with the other. This is a key reason to keep the number of hardware threads low, even if there are still throughput gains to be made by adding threads.

However, for the GPUs, the use case is when you do have enough data parallelism to make use of plenty of threads. If so, why not build things the other way around? Threading could be our first stall-mitigating measure. If we have enough threads, we just keep switching between them, and the hardware always has something to do.

SIMT/SMT latency

This saves a lot of hardware, and a lot of design effort, because you don't need most of the other methods anymore. Even caches and hardware prefetching are not used in much of the GPU memory traffic – rather, you access external memory directly. Why bother with caching and prefetching, if you don't have to sit idly until the data arrives from main memory – but instead just switch to a different warp? No heuristics, no speculation, no hurry – just keep yourself busy when waiting.

Furthermore, even the basic arithmetic pipeline is designed for a high latency, high throughput scenario. According to the paper "Demystifying GPU architecture through microbenchmarking", no operation takes less than 24 cycles to complete. However, the throughput of many operations is single-cycle.

The upshot is that counting on the availability of many threads allows the GPU to sustain a high throughput without having to sweat for low latencies. Hardware becomes simpler and cheaper in many areas as a result.

When latencies are high, registers are cheap

So plentiful threads make it easier to build high-throughput hardware. What about having to replicate all those registers though? 16K sounds like an insane amount of registers – how is this even affordable?

Well, it depends on what "register" means. The original meaning is the kind of storage with the smallest access latency. In CPUs, access to registers is faster than access to L1 caches, which in turn is faster than L2, etc. Small access latency means an expensive implementation, therefore there must be few registers.

However, in GPUs, access to "registers" can be quite slow. Because the GPU is always switching between warps, many cycles pass between two subsequent instructions of one warp. The reason registers must be fast in CPUs is because subsequent instructions communicate through them. In GPUs, they also communicate through registers – but the higher latency means there's no rush.

Therefore, GPU registers are only analogous to CPU registers in terms of instruction encoding. In machine code, "registers" are a small set of temporary variables that can be referenced with just a few bits of encoding – unlike memory, where you need a longer address to refer to a variable. In this and other ways, "registers" and "memory" are semantically different elements of encoding – both on CPUs and GPUs.

However, in terms of hardware implementation, GPU registers are actually more like memory than CPU registers. [Disclaimer: NVIDIA doesn't disclose implementation details, and I'm grossly oversimplifying, ignoring things like data forwarding, multiple access ports, and synthesizable vs custom design]. 16K of local RAM is a perfectly affordable amount. So while in a CPU, registers have to be expensive, they can be cheap in a high-latency, high-throughput design.

It's still a waste if 512 threads keep the same values in some of their registers – such as array base pointers in our examples above. However, many of the registers keep different values in different threads. In many cases register replication is not a waste at all – any processor would have to keep those values somewhere. So functionally, the plentiful GPU registers can be seen as a sort of a data cache.

Drawbacks

We've seen that:

  • Many threads enable cheap high-throughput, high-latency design
  • A high-throughput, high-latency design in turn enables a cheap implementation of threads' registers

This leads to a surprising conclusion that SIMT with its massive threading can actually be cheaper than SMT-style threading added to a classic CPU design. Not unexpectedly, these cost savings come at a price of reduced flexibility:

  1. Low occupancy greatly reduces performance
  2. Flow divergence greatly reduces performance
  3. Synchronization options are very limited

Occupancy

"Occupancy" is NVIDIA's term for the utilization of threading. The more threads an SM runs, the higher its occupancy. Low occupancy obviously leads to low performance – without enough warps to switch between, the GPU won't be able to hide its high latencies. The whole point of massive threading is refusing to target anything but massively parallel workloads. SMT requires much less parallelism to be efficient.

Divergence

We've seen that flow divergence is handled correctly, but inefficiently in SIMT. SMT doesn't have this problem – it works quite well given unrelated threads with unrelated control flow.

There are two reasons why unrelated threads can't work well with SIMT:

  • SIMD-style instruction broadcasting – unrelated threads within a warp can't run fast.
  • More massive threading than SMT – unrelated wraps would compete for shared resources such as instruction cache space. SMT also has this problem, but it's tolerable when you have few threads.

So both of SIMT's key ideas – SIMD-style instruction broadcasting and SMT-style massive threading – are incompatible with unrelated threads.

Related threads – those sharing code and some of the data – could work well with massive threading by itself despite divergence. It's instruction broadcasting that they fail to utilize, leaving execution hardware in idle state.

However, it seems that much of the time, related threads actually tend to have the same flow and no divergence. If this is true, a machine with massive threading but without instruction broadcasting would miss a lot of opportunities to execute its workload more efficiently.

Synchronization

In terms of programming model, SMT is an extension to a single-threaded, time-shared CPU. The same fairly rich set of inter-thread (and inter-device) synchronization and communication options is available with SMT as with "classic" single-threaded CPUs. This includes interrupts, message queues, events, semaphores, blocking and non-blocking system calls, etc. The underlying assumptions are:

  • There are quite many threads
  • Typically, each thread is doing something quite different from other threads
  • At any moment, most threads are waiting for an event, and a small subset can actually run

SMT stays within this basic time-sharing framework, adding an option to have more than one actually running threads. With SMT, as with a "classic" CPU, a thread will be very typically put "on hold" in order to wait for an event. This is implemented using context switching – saving registers to memory and, if a ready thread is found, restoring its registers from memory so that it can run.

SIMT doesn't like to put threads on hold, for several reasons:

  • Typically, there are many running, related threads. It would make the most sense to put them all on hold, so that another, unrelated, equally large group of threads can run. However, switching 16K of context is not affordable. In this sense, "registers" are expensive after all, even if they are actually memory.
  • SIMT performance depends greatly on there being many running threads. There's no point in supporting the case where most threads are waiting, because SIMT wouldn't run such workloads very well anyway. From the use case angle, a lot of waiting threads arise in "system"/"controller" kind of software, where threads wait for files, sockets, etc. SIMT is purely computational hardware that doesn't support such OS services. So the situation is both awkward for SIMT and shouldn't happen in its target workloads anyway.
  • Roughly, SIMT supports data parallelism – same code, different data. Data parallelism usually doesn't need complicated synchronization – all threads have to synchronize once a processing stage is done, and otherwise, they're independent. What requires complicated synchronization, where some threads run and some are put on hold due to data dependencies, is task parallelism – different code, different data. However, task parallelism implies divergence, and SIMT isn't good at that anyway – so why bother with complicated synchronization?

Therefore, SIMT roughly supports just one synchronization primitive – __syncthreads(). This creates a synchronization point for all the threads of a block. You know that if a thread runs code past this point, no thread runs code before this point. This way, threads can safely share results with each other. For instance, with matrix multiplication:

  1. Each thread, based on its ID – x,y – reads 2 elements, A(x,y) and B(x,y), from external memory to the on-chip shared memory. (Of course large A and B won't fit into shared memory, so this will be done block-wise.)
  2. Threads sync – all threads can now safely access all of A and B.
  3. Each thread, depending on the ID, multiplies row y in A by column x in B.
//1. load A(x,y) and B(x,y)
int x = threadIdx.x;
int y = threadIdx.y;
A[stride*y + x] = extA[ext_stride*y + x];
B[stride*y + x] = extB[ext_stride*y + x];
//2. sync
__syncthreads();
//3. multiply row y in A by column x in B
float prod = 0;
for(int i=0; i<N; ++i) {
  prod += A[stride*y + i] * B[stride*i + x];
}

It's an incomplete example (we look at just one block and ignore blockIdx, among other thing), but it shows the point of syncing – and the point of these weird "multi-dimensional" thread IDs (IDs are x,y,z coordinates rather than just integers). It's just natural, with 2D and 3D arrays, to map threads and blocks to coordinates and sub-ranges of these arrays.

Summary of differences between SIMD, SIMT and SMT

SIMT is more flexible in SIMD in three areas:

  1. Single instruction, multiple register sets
  2. Single instruction, multiple addresses
  3. Single instruction, multiple flow paths

SIMT is less flexible than SMT in three areas:

  1. Low occupancy greatly reduces performance
  2. Flow divergence greatly reduces performance
  3. Synchronization options are very limited

The effect of flexibility on costs was discussed above. A wise programmer probably doesn't care about these costs – rather, he uses the most flexible (and easily accessible) device until he runs out of cycles, then moves on to utilize the next most flexible device. Costs are just a limit on how flexible a device that is available in a given situation can be.

"SIMT" should catch on

It's a beautiful idea, questioning many of the usual assumptions about hardware design, and arriving at an internally consistent answer with the different parts naturally complementing each other.

I don't know the history well enough to tell which parts are innovations by NVIDIA and which are borrowed from previous designs. However, I believe the term SIMT was coined by NVIDIA, and perhaps it's a shame that it (apparently) didn't catch, because the architecture "deserves a name" – not necessarily true of every "new paradigm" announced by marketing.

One person who took note is Andy Glew, one of Intel's P6 architects – in his awesome computer architecture wiki, as well as in his presentation regarding further development of the SIMT model.

The presentation talks about neat optimizations – "rejiggering threads", per-thread loop buffers/time pipelining – and generally praises SIMT superiority over SIMD. Some things I disagree with – such as the "vector lane crossing" issue – and some are very interesting, such as everything about improving utilization of divergent threads.

I think the presentation should be understandable after reading my oversimplified overview – and will show where my overview oversimplifies, among other things.

Peddling fairness

Throughout this overview, there's this recurring "fairness" idea: you can trade flexibility for performance, and you can choose your trade-off.

It makes for a good narrative, but I don't necessarily believe it. You might very well be able to get both flexibility and performance. More generally, you might get both A and B, where A vs B intuitively feels like "an inherent trade-off".

What you can't do is get A and B in a reasonably simple, straightforward way. This means that you can trade simplicity for almost everything – though you don't necessarily want to; perhaps it's a good subject for a separate post.

An unusual hardware architecture: APA (Associative Processing Array)

We're living in the golden age of hardware design – it probably won't get any cheaper to make a new chip. Yes, there's the downside that it's harder to be wildly incompatible with everything than in the old days – there's plenty of standard interfaces to support. On the other hand, it's a benefit, not just a drawback – make a chip with a CPU that runs C, an Ethernet controller and a DRAM interface, and it's now usable with plenty of software and hardware developed by others.

And then for the wilder, "incompatible" innovations – if you really want to invent interfaces and not just implementations – there's the "accelerator" realm: an MPEG decoder, a DSP, a GPU. A decade ago, they said that "standard" microprocessors killed all high-performance architectures. Today, a system that kept the "fastest computer in the world" title for almost a year is based on GPUs – and "non-standard"/"incompatible" accelerators are everywhere (not to mention SIMD instruction set extensions right inside the otherwise "compatible"/"commodity" CPUs).

Still, while it indeed became way cheaper to make your own chips since you don't have to erect your own fab to do that, "cheap" still means costs somewhere in the millions – not quite what the word "cheap" brings to mind. And mistakes still can't be corrected, not really. Which results, among other things, in an understandable risk aversion with respect to design.

A language geek, who's naturally curious about programming languages that don't look mainstream (C with classes), is going to meet many such languages, and plenty of implementations to play with. A hardware geek, who's equally curious about designs that don't look mainstream (RISC/VLIW with bells and whistles), is going to find few such designs and very few implementations. Risk aversion is innovation aversion.

APA thus combines two traits that are rare – it gets at least 9 out of 10 in the "non-mainstream" category, and implementations were actually manufactured and shipped, specifically, in early smartphones by NeoMagic (BDTI and EETimes articles are some of the architecture overviews). The phones apparently weren't very successful, but it seems a poor measure of the architecture's merit in this case, just because there are so many ways to fail regardless of your accelerator quality. So the platform's demise gives us not evidence but a lack of evidence (the platform wasn't widely targeted by 3rd party developers whose opinions could be interesting).

Overview

APA stands for Associative Processing Array, a kind of content-addressable memory. Just memory. How do you compute using just memory? APA (Associative Processing Array)The operations are done right there, near the cells where your bits are kept. No need to read your data word by word into a processor, than write it back – rather, your operations run in parallel, inside the memory, on every data word! SIMD operations on steroids.

What operations? Bitwise operations – masked compare and masked write:

  • Compare: tag[i] = (word[i]&mask)==(compare_data&mask)
  • Write: if(tag[i]) word[i]=(word[i]&mask) | (write_data&~mask)

…where a tag bit is kept for every data word (row), and mask & compare/write data are broadcasted to all rows. You can also move words – either to the adjacent rows or to rows at distance 8 (or, say, 16, depending on the hardware configuration – but not at an arbitrary distance). And you can move the data in or out, sequentially.

Where's the code – who decides what to write, compare and move in what order? One option is to use a low-end CPU running "normal" code to control the contents of the compare/write data & mask registers, to issue APA operations and to interact with the host system (the "real" CPU and external memory). This little processor isn't "accelerating" computations by itself, just controls the APA accelerator hardware.

That's it. No add, no multiply, no nothing. How's that for non-mainstream?

So, how do you actually do anything with this SIMD-on-steroids machine – say, add two vectors of numbers? Well, you need to have both vectors in the array – each element pair occupying some of the bits of a row (for instance, the NeoMagic 512-row APA had 160 bits per row, so for 16 bit vectors, you'd use 32 bits out of 160). Then you perform the addition bit by bit, using masked compare and write operations.

That's right, addition done entirely in software. Latency: 48 cycles for 16 bits, throughput: >10 16b additions/cycle for NeoMagic's 512-row APA. More latency for 32 bits, less for 8 bits, still less for 3 bits. Which means optimization opportunities you're not used to having in software (you gain no speed-up using only 3 bits of a 32b CPU register for addition). On the other hand, it means large penalties for high precision (floating point takes thousands of cycles).

Tools and libraries would be supplied with an APA implementation, but the exciting thought for a programmer is to be able to bypass them where needed and get down to bit-level manipulation! (The exciting thought for a decision maker would be to always use the tools, never pay someone for the bit fiddling; well, different people get their excitement from different things.)

Pros & cons

In a way, it's an exceedingly generic and elegant architecture, and a very impressive one. One "gets it" like one doesn't get any other. Just try to describe any other sort of programmable machine to a level of detail sufficient for accurate predictions about performance.

Surprisingly – or, perhaps, unsurprisingly given just how unusual this machine is – I simultaneously also don't get it like any other. I'm not used to thinking of computations that way, both in terms of precision and in terms of data access. I have no idea what share of my 8-bit numbers only have just 5 or 2 bits and it can matter a lot here.

Likewise, even for algorithms that don't require random access – and APA is simply not good for random access – I have little idea of just how non-random my access is. That is, similarly to the precision case where constants I'm not used to care about matter, there's a difference between accessing a row at distance 1 and a row at distance 5, and I don't know how frequent different distances between adjacent things are in a code base I care about.

So while in the abstract, a shorter than ever architecture description is sufficient here for accurate performance predictions, I lack intuition and experience that would make such predictions easy.

A great thing about the APA design is its "hardware friendliness". The typical processor design involves a truckload of convoluted circuits that are likeable for their function, but unimaginable to the human mind as physical objects – so cumbersome, infinitely configurable, finicky tools are used to translate a functional description of said circuits to actual electric circuits occupying actual space.

The typical hardware design is not hardware friendly (sounds strange – but the average piece of software isn't very considerate towards its target processor and software stack, either). The APA, on the other hand, has a natural spatial mapping with its 2 dimensions (row bit width and the number of rows) and regular connections between just the adjacent or relatively close rows.

Again, surprisingly – or unsurprisingly given how unusual the APA is – this is also a drawback, in the sense that the standard hardware design tools will do a very poor job implementing APA. For that matter, standard tools will likely do an even worse job implementing plain RAM – which also has a natural spatial mapping and regular structure.

RAM is analog design, not digital: rather than describing it on a functional level, someone implements it as a physical circuit description. Basically, RAM is a library for digital tools, like flip-flops or simple gates – a library that can't be implemented using the tool itself.

I really appreciate the von Neumann architecture, but it's scary to realize just how strong the von Neumann grip really is. RAM is basically the only interface to a large array of bits that is relatively easily accessible to a hardware designer. (Not that it's very easily accessible, mind you – there is no portable interface for RAM, believe it or not, but no matter how you build your chips, you'll be able to get some sort of RAM).

What happens if you want your own bag of bits instead of RAM? You can do what they call full custom design – make your own physical circuit description. This isn't "portable", where "porting" means changing the manufacturing process – either to move from 65nm to 40nm, or from one manufacturer to another. It's also a longer and more complicated process. But it's certainly doable, especially if you outsource it to the inventors, as you'd have to do anyway because the thing is patented.

The hostility to hardware design tools (despite the friendliness to the basic constraints of hardware) and patent protection make APA more of a possible off-the-shelf solution than inspiring source of ideas, and so does the design simplicity. Unlike, say, VLIW, which is a design style with basic ideas in the public domain and countless possible variations and extensions (starting with the "let's add this one spiffy instruction" variety), APA is more or less complete. There are important constants to tweak – row width and the distance to easily accessible neighbors – but while it could be a hard choice to make, it's not very creative. (I do believe it could be a source of ideas if only through expanding one's horizons, which is why I write about it.)

If we attempt to discuss the efficiency of APA qualitatively, in terms of hardware resource utilization rather than throughput per mm^2 or per mW for a given app, three things come to mind:

  • Benefit - a perennial bottleneck of accessing memory through a bus very narrow compared to the amount of bits stored is eliminated. This "ought to be good" for algorithms where access is "far from random" since these gain nothing from conventional RAM's flexibility but pay the full price of its low throughput.
  • Drawback - the cycle is "wasted": values are read from flip-flops, undergo very few transformations and are written back. This "ought to be bad" because it won't translate to high frequency (you can't read then write a flip-flop very fast – and doing this with them all at once dissipates power, so in fact low frequencies enabled by parallelism, not high frequencies enabled by circuit simplicity are APA's potential advantage in the frequency department). A competing design running at a similar frequency, however, will do much more per cycle (like actual addition) because circuits implementing combinatorial logic are fast. The question thus becomes how big those circuits are compared to flip-flops: if enough APA rows can be packed instead, the throughput will still be competitive despite the abysmal latency – provided that you simultaneously process sufficiently large amounts of data.
  • Benefit - ability to use DRAM cells. This "ought to be good" since DRAM cells are much smaller than normal flip-flops (which is achieved through having them leak their charge so they have to be periodically refreshed). AFAIK, nobody uses them for computation directly because they don't fit in traditional hardware models – neither as registers (that's plain dumb) nor as local RAM (they have poor latency and imply a cumbersome controller to access as a RAM). An APA, on the other hand, could potentially run well on DRAM cells if the required simple circuitry were implemented near the cells. One problem with this is that custom chips may be easy to make these days, but not custom DRAM chips. In particular, Andy Glew, formerly from Intel then AMD, said at some place that it's "hard to influence DRAM makers", and if it's hard for Intel or AMD, well, it's probably hard in general.

Overall, I have a lot of reservations here – not just because of the way originality by itself seems to complicate matters here (I'd hate to admit it as the only reason…), but because it's a step in the opposite direction to what successful SIMD systems are doing. That is, you get high throughput given unusually low precision and unusually restricted data access patterns. A DSP from the late 90s would attempt to process numbers with more bits and would let you fetch them from more places. A GPU from the 2000s still more so – floating point numbers, parallel random access with transparent contention handling (at least in CUDA GPUs). It seems that the direction is removing restrictions on the set of things you run fast, not very high throughput for a very restricted set.

On the other hand, there exist algorithms with low precision and high locality. And it's exciting to see an architecture which is not RAM-based, naturally represented in space, etc. – all those things which get people excited in hardware discussions – but practical enough for a real world delivery, and with some things connecting it to more usual programming models (for instance, SIMD commands broadcasted from a CPU instead of clever local rules you have no idea how to come up with as in cellular automata). So it's definitely very interesting.

Update: as a better informed commenter pointed out, APA is also known as CAPP – Content Addressable Parallel Processor, and implementations date as early as 1972 (which makes you wonder what could legitimately remain outside the public domain by now). This seems an evidence of having failed the test of time – or having some really deep trouble with commercialization. I'd be curious to hear a software developer's experience with this – BDTI's article from 2003 talked about the development experience in hypothetical terms and entirely in the future tense, whereas some past evidence has to be available.