Entries Tagged 'software' ↓

Error codes vs exceptions: critical code vs typical code

Error codes or exceptions – which is better? Here's my answer:

  1. They have the same worst case – a human error can lead to a complete disaster.
  2. Exceptions are far safer for most code.
  3. Error codes are far safer for well-reviewed, critical code.

(As you can see from 2 and 3, I believe that most code is not critical and/or poorly reviewed; I think most people will agree on that one.)

Worst case: disaster

Here's a disaster with error codes (based on an example from a critique of Go's error handling):

seal_presidential_bunker()
trigger_doomsday_device()

If we fail to seal_presidential_bunker, we still trigger_doomsday_device, because the programmer forgot to check the error code. Human error has lead to a disaster.

(The original article doesn't specify exactly what the disaster is. One problem is that the presidential staff is not safe. Another problem is that the doomsday device got triggered – which wouldn't happen if an exception were thrown and left uncaught. Which of the two problems is the bigger part of the disaster depends on your worldview.)

Here's a disaster with exceptions.

open_the_gate()
wait_for_our_men_to_come_in()
close_the_gate()

If wait_for_our_men_to_come_in throws an exception, then we'll never close_the_gate, and the enemy will sneak in. Again – human error, disaster.

So in theory, exceptions and error codes are equally bad.

Exceptions are safer for most code

Most code doesn't trigger doomsday devices, nor deals with lethal enemies at the gates. When most code messes up, garbage appears on the screen or in log files, and a programmer shows up to debug the problem.

With exceptions, it's easier for the programmer to figure out why this garbage appeared, because the failure occurs closer to the point of the error.

f=open_users_file()
print_users_list(f)

If open_users_file() throws an exception, then the programmer will see a "No such file or directory" with a call stack and think, "why couldn't this idiot [possibly, me] bother to check if the file is there?" Then he fixes the bug and all is well again.

If open_users_file() returns an invalid file object (similarly to, for example, C++'s ifstream), then print_users_list (which doesn't check errors, either) might print an empty user list. The error might then become "No such user", or "Permission denied", etc. The program will fail further from the point of error – the file opening code – and you'll need to go back and figure out where the error is.

For production code, failing early isn't necessarily better. Failing early is what leaves the gate open for the enemies in the above example. Failing early due to a floating point error – instead of trying further just in case – was reportedly the root cause of the explosion of Ariane 5, costing $.5G.

But for most code, which:

  • doesn't lead to such large damages
  • is written rather hastily
  • is expected to have bugs
  • has programmers constantly attending to it and fixing those bugs

…for most code, failing early is better simply because it always makes debugging easier – even if it doesn't make the impact of the error smaller.

Error codes have another horrible anti-debugging quality: loss of information. Even if the program fails early with error codes, you usually only get the code of the topmost layer without all the details from below.

With an exception, you get a call stack, and an error string from the bottom layer. With a perror(), you get just an error string from the bottom layer ("No such file or directory" – which file? Who wants it?). With error codes, you get something like "User list management error" – the fact that it was a file opening error gets "swallowed" by layers of code converting low-level error codes to high-level ones.

It's possible to collect an "error code call stack" with all the information, but it's almost never done. Whereas an exception does it automatically for the laziest of programmers. Another win for whoever gets to debug the code.

Error codes are safer for well-reviewed code

Code reviews are generally easier with error codes than exceptions. Error codes mean that you must carefully look at function calls to see if the programmer handled the possible errors. Exceptions mean that you must imagine what happens if an exception is thrown anywhere in the flow.

Making sure that the opening of every gate is exception-safe – that the gate gets closed when an exception is thrown – is hard. C++ has RAII for the gate-closing (and Python has with and C# has using), and Java has checked exceptions for the exception-hunting.

But even if you have both and then some, it still seems hard. A program has a lot of intermediate states, and some of them don't make sense. An exception can leave you in this intermediate state. And it's not easy to wrap every entrance into an intermediate state using whatever exception-safety-wrapper your language gives you.

So I think it makes sense for Go – a language for writing critical production code – to shun exceptions. We use C++ with -fno-exceptions for serious production code and I think it's equally sensible.

It just doesn't make sense to write most of your code that way. In most of my code, I want to always fail early to make debugging easier, seeing the full context of the error, and I want that to happen without putting much thought into error handling.

And this is why I think exceptions should be embraced by all lazy programmers writing low-quality code like myself.

Aren't side effects fundamental in complexity analysis?

I'm not that good at functional programming and the related discussion of the benefits and drawbacks of side effects. Please forgive me if I'm utterly wrong – or if I'm trivially right because I'm saying something elementary.

So the thing is, very basic complexity analysis that we all use assumes a von Neumann machine with a random access memory. Random access means that you can read an arbitrary array element, and that you can write it.

Suppose we throw away the ability to write array elements, which is a side effect. Instead, we replace it with the "functional" ability to create a new array having the same elements as the old one at all indexes except for one – the one we'd write if we had side effects. So we create an array like this: {old,…,old,new,old,…,old}.

In this sense, aren't we getting a fundamentally inefficient system? We took an O(1) operation and replaced it with an O(N) operation. We don't have to use the new O(N) operation every time we previously used the old O(1) one, of course – we can use an entirely different algorithm instead. But we must be losing something.

For instance, how do you compute a histogram? With side effects, you have a loop over your elements doing histogram[array[i]]++, this is O(N) where N is the amount of elements.

Without side effects, we can create a new histogram at every step – this is O(N*K) where K is the amount of histogram bins. Or we can sort the array – O(N*log(N)) – and then have an O(N) pass over the sorted array, consing every time a new element is found. Both options are asymptotically slower, not just programming-language-shootout slower. Did I miss an O(N) option?

I assume that the people most likely to point out the drawbacks of side effects are also likely to care about asymptotic complexity, because both questions are very basic CS questions. What's the standard workaround for the loss of asymptotic efficiency, then?

One answer could be a smart compiler – smart enough to implement the logical creation of an updated histogram as an update to the memory keeping the old histogram.

But it seems that if we had a compiler that smart, then side effects are not a problem in the first place. An imperative program full of side effects can be trivially implemented as a recursive function taking a state (the entire state of the memory – one large array), creating a new state, and calling itself with this state.

The reason people object to side effects seems related to the fact that such a program is "functional" only in a pointlessly theoretical sense. The nominal lack of side effects gives you no actual added ability to analyze the data flow of such programs.

I haven't formally proved it by this example, of course, but it seems to me that a nominally functional operation of creating a new array just like the old one with a single modified element is not a very good one. It can sometimes be hard or impossible to elide – as in the "imperative program emulation" example – and when it can't be, the need to fall back to actually copying is just too disastrous.

I guess this is why functional languages have cons – make a new list from a new element, the head, and an old list, the tail – but don't have a similar "array consing" primitive for inserting a new element into a random position of an old array.

Now in the absence of a compiler smart enough for efficient "array consing", people seem to be using mutable data structures, such as Haskell's mutable arrays. This seems to repeal some of the correctness guarantees gained by not having side effects. For example:

The freezeArray action converts a mutable array into an immutable one by copying, whereas unsafeFreezeArray returns an immutable array that is effectively just the type cast version of the mutable array. Should you write to the mutable array after it has been (unsafely) frozen, you'll side-effect the immutable array in the process. Please don't :-)

Bugs introduced by side effects coupled with aliasing – you modified something at the wrong time and someone else then references that something – are back.

So what's the point of shunning side effects if you're forced to have them anyway to do some of your computations asymptotically efficiently? Is the point having a clear separation between "pure" code and side-effecting code?

From a "social" more than a "formalistic" standpoint, such a separation indeed makes a lot of sense when "pure" code computes and side-effecting code does I/O. It's awesome to be able to tell pure computations from I/O, and it's actually a pity that imperative language typically allow people to litter computational code with I/O as easily as they do.

But if side effects are also necessary for computational code, then, again from a "social" standpoint, the pure/impure separation doesn't seem like a stable thing that is likely to correspond to how people think of boundaries between parts of the system. A piece of code that doesn't use histograms today might use them tomorrow, and then its purity will be gone.

Generally it seems like a whole lot of algorithms require writable RAM to be efficient. Hough transforms are "glorified histograms". A bunch of algorithms processing graphs or matrices update "rather random" indexes/counters iteratively, with no very clear way to somehow sort these updates so that they are sufficiently non-random to become list consing or similar.

I'm not saying pure functional languages aren't fun or productive for you – a very valid "social" argument in favor of purity. I just wonder how far purity can take you from a complexity analysis standpoint.

If it's not very far, or if it's not very clear how far, then why not have side effects "by default", and use data & control flow analysis to identify places where there are no side effects after all, or where their impact is limited enough (as in, i++ is a nominal side effect but a trivial one)?

In particular, how can purity be "the" way to safe parallelism if it robs you of asymptotic performance, and you're after parallelism to get performance?

[I'm not at all sure I get this right, so I'm asking a "real" question, not a rhetorical one.]

Passing shell script arguments to a subprocess

So I want to create a shell script that ultimately exec's a command, after doing something like, say, setting an environment variable first:

#!/bin/sh
export MY_VAR=MY_VAL
exec my_command $*

(The point of using `exec my_command` rather than plain `my_command` is to not leave a /bin/sh process waiting for my_command that shows up in pstree and makes my_command not see its "real" parent process and so on.)

Simple enough, except it doesn't work. If you run the script with `script "a b" c`, my_command's arguments will be a b c (that's three arguments instead of the original two).

(Update – as pointed out in the comments, "$@" instead of $* works fine, and is perfectly sufficient for my example where all the script does is setting an env var. "$@" isn't enough if you need to fiddle with the arguments – if you need to do that, read on, otherwise just use "$@".)

A common workaround seems to be, you iterate over the arguments and you quote them and then eval:

#!/bin/sh
export MY_VAR=MY_VAL
args=
for arg in "$@";
do
  args="$args '$arg'"
done
eval exec my_command $args

Not so simple, but works better: "a b" c will indeed be passed to my_command as "a b" c.

However, it doesn't work if the arguments contain single quotes. If you pass "'a'" (that's double quote, single quote, the character a, single quote, double quote), my_command will get plain a. If you pass "'a b'" (double, single, a, space, b, single, double), my_command will get two arguments, a b, instead of one, 'a b'.

What to do? One potential workaround is escaping quotes: replacing ' with ', etc. Perhaps someone with sufficiently thorough understanding of the Unix shell could pull it off; personally, I wouldn't trust myself to take care of all the special cases, or even to fully enumerate them.

So instead, what works for me (or so I hope) is, instead of creating a string of values by concatenating arguments, I make a string of references to the arguments using the variables $1, $2, etc. How many of those are needed – is the last one $3 or $7? Ah, we can use $# to figure that out:

#!/bin/sh
export MY_VAR=MY_VAL
nargs=$#
args=
while [ $nargs -gt 0 ]
do
  args=""$$nargs" $args"
  nargs=`expr $nargs - 1`
done
eval exec my_command $args

This handsome code generates, given three arguments, the string "$1" "$2" "$3", and then evals it to get the three argument values, which apparently can't cause quoting complications that are dependent on the actual argument values. With five arguments, the string "$1" "$2" "$3" "$4" "$5" is generated, and so on. (You'd use more code if you wanted to mangle some of the arguments, which is the sole reason to do this sort of things as opposed to using "$@".)

If you're good at Unix and you know a less ""$$ugly" $way" to do this, and/or a more correct one, do tell.

(Why am I writing shell scripts in the first place, you ask? There are reasons for this too, and reasons for the reasons; there always are.)

Update 2: according to a comment, $1-$9 work in sh, but $10 and up do not; they do work in bash, which is what you actually get when you ask for /bin/sh on some systems but not others.

I really ought to try harder to stay away from shell scripting.  I mean, I know I shouldn't, but I keep coming back. I'm like those tribesmen around the world who can't resist the urge to drink alcohol and having genes making alcohol particularly addictive and bad for their health. I clearly don't have the genetic makeup that would make *sh reasonably harmless for me.

Why programming isn't for everyone

Today I learned about HyperCard, a system where you could implement a basic calculator in a few easy steps, one of them involving the following impressively English-like snippet:

on mouseUp
  get name of me
  put the value of the last word of it after card field "lcd"
end mouseUp

The article depicts HyperCard as a system making programming accessible to people who aren't professional developers. It is claimed that Apple likely killed off the product because it's inconsistent with its business model (roughly, devices bought to consume rather than to create).

I sympathize with the sentiment – I very much like stuff you can tinker with, and dislike business models discouraging tinkering. However, I don't think businesses have the power to prevent anything that works well for many people from happening. A conspiracy of typewriter manufacturers could never stop the PC.

This seems especially true with software, where huge systems can be built by volunteers in their spare time. If an idea works, if a software system wants to be built around it, it will be built.

Of course it may be the case that the time hasn't come for a programming system for non-developers. It's just my opinion that it never will come, not really. Why?

Not because you need much education to program. Very useful stuff can be built without knowing why optimal sorting is O(n*log(n)), or even what big O means.

Not because programming languages must have, or typically have arcane syntax. As a kid, I found Pascal's somewhat English-like "begin" and "end" off-putting, and was greatly relieved to discover Algolish braces. How close to natural language syntax can get, and whether it is at all beneficial to go there is IMO an irrelevant question. The fact is that programming languages can be very readable to people.

The main reason is that development leads to maintenance, and maintenance leads to suffering.

For example, if your program stores persistent data, and you want to change it, your changes to the program must be done such as to preserve the meaning of existing data. This part of development causes major pain everywhere, from video recording to financial databases to compiler construction. No amount of knowledge and no amount of support from the tools make this fun.

There are many other things. Everything in your program's environment is unstable and you must constantly update the program to keep up. Your program gets cluttered with options and you forget what does what. There are cases you didn't test – spaces in the names, empty data fields, reverse order of operations.

As a result, maintenance means dealing with misbehaving programs that eat data, send trash around, or simply make you wait for an hour and then watch them produce garbage.

This never ends and quickly stops being fun. When something useful can not be done quickly and isn't the average person's idea of fun, it becomes the business of professionals – or hardcore hobbyists indistinguishable from professionals. As a counter-example, many people like cooking in their spare time without necessarily getting close to the level of a chef or spending that much time cooking. Con Kolivas, on the other hand, could technically be called a "hobbyist", but he could be called a "professional" as well.

Maybe I'm wrong, maybe there are plenty of places where a sprinkle of logic – in textual form or graphical form or whatever form – can be figured out quickly, left alone and be useful ever after. It's just that it's usually the opposite with me. Every time I have a nice little idea it takes me 10x the time it "should" take to implement, and most things keep biting me once in a while for a long time.

Programming isn't for everyone because it is not fun to maintain what was fun to program.

Cycles, memory, fuel and parking

In high-performance, resource-constrained projects, you're not likely to suddenly run out of cycles – but you're quite likely to suddenly run out of memory. I think it's a bit similar to how it's easy to buy fuel for your car – but sometimes you just can't find a parking spot.

I think the difference comes from pricing. Processor cycles are priced similarly to fuel, whereas memory bytes are priced similarly to parking spots. I think I know the problem but not the solution – and will be glad to hear suggestions for a solution.

Cycles: gradual price adjustment

If you work on a processing-intensive project – rendering, simulation, machine learning – then, roughly, every time someone adds a feature, the program becomes a bit slower. Every slowdown makes the program a bit "worse" – marginally less useable and more annoying.

What this means is that every slowdown is frowned upon, and the slower the program becomes, the more a new slowdown is frowned upon. From "it got slower" to "it's annoyingly slow" to "we don't want to ship it this way" to "listen, we just can't ship it this way" – effectively, a developer slowing things down pays an increasingly high price. Not money, but a real price nonetheless – organizational pressure to optimize is proportionate to the amount of cycles spent.

Therefore, you can't "suddenly" run out of cycles - long before you really can't ship the program, there will be a growing pressure to optimize.

This is a bit similar to fuel prices – we can't "suddenly" run out of fuel. Rather, fuel prices will rise long before there'll actually be no fossil fuels left to dig out of the ground. (I'm not saying prices will rise "early enough to readjust", whatever "enough" means and whatever the options to "readjust" are –  just that prices will rise much earlier in absolute terms, at least 5-10 years earlier).

This also means that there can be no fuel shortages. When prices rise, less is purchased, but there's always (expensive) fuel waiting for those willing to pay the price. Similarly, when cycles become scarce, everyone spends more effort optimizing (pays a higher price), and some features become too costly to add (less is purchased) – but when you really need cycles, you can get them.

Memory: price jumps from zero to infinity

When there's enough memory, the cost of an allocated byte is zero. Nobody notices the memory footprint – roughly, RAM truly is RAM, the cost of memory access is the same no matter where objects are located and how much memory they occupy together. So who cares?

However, there comes a moment where the process won't fit into RAM anymore. If there's no swap space (embedded devices), the cost of allocated byte immediately jumps to infinity – the program won't run. Even if swapping is supported, once your working set doesn't fit into memory, things get very slow. So going past that limit is very costly – whereas getting near it costs nothing.

Since nobody cares about memory before you hit some arbitrary limit, this moment can be very sudden: without warning, suddenly you can't allocate anything.

This is a bit similar to a parking lot, where the first vehicle is as cheap to park as the next and the last – and then you can't park at all. Actually, it's even worse - memory is more similar to an unmarked parking lot, where people park any way they like, leaving much unused space. Then when a new car arrives, it can't be parked unless every other car is moved – but the drivers are not there.

(Actually, an unmarked parking lot is analogous to fragmented memory, and it's solved by heap compaction by introducing a runtime latency. But the biggest problem with real memory is that people allocate many big chunks where few small ones could be used, and probably would be used if memory cost was something above zero. Can you think of a real-world analogy for that?..)

Why not price memory the way we price cycles?

I'd very much like to find a way to price memory – both instructions and data - the way we naturally price cycles. It'd be nice to have organizational pressure mount proportionately to the amount of memory spent.

But I just don't quite see how to do it, except in environments where it happens naturally. For instance, on a server farm, larger memory footprint can mean that you need more servers – pressure naturally mounts to reduce the footprint. Not so on a dedicated PC or an embedded device.

Why isn't parking like fuel, for that matter? Why are there so many places where you'd expect to find huge underground parking lots – everybody wants to park there – but instead find parking shortages? Why doesn't the price of parking spots rise as spots become taken, at least where I live?

Well, basically, fuel is not parking – you can transport fuel but not parking spots, for example, so it's a different kind of competition – and then we treat them differently for whatever social reason. I'm not going to dwell on fuel vs parking – it's my analogy, not my subject. But, just as an example, it's perfectly possible to establish fuel price controls and get fuel shortages, and then fuel becomes like parking, in a bad way. Likewise, you could implement dynamic pricing of parking spots – more easily with today's technology than, say, 50 years ago.

Back to cycles vs memory – you could, in theory, "start worrying" long before you're out of memory, seeing that memory consumption increases. It's just not how worrying works, though. If you have 1G of memory, everybody knows that you can ship the program when it consumes 950M as easily as when it consumes 250M. Developers just shrug and move along. With speed, you genuinely start worrying when it starts dropping, because both you and the users notice the drop – even if the program is "still usable".

It's pretty hard to create "artificial worries". Maybe it's a cultural thing – maybe some organizations more easily believe in goals set by management than others. If a manager says, "reduce memory consumption", do you say "Yes, sir!" – or do you say, "Are you serious? We have 100M available – but features X, Y and Z are not implemented and users want them!"

Do you seriously fight to achieve nominal goals, or do you only care about the ultimate goals of the project? Does management reward you for achieving nominal goals, or does it ultimately only care about real goals?

If the organization believes in nominal goals, then it can actually start optimizing memory consumption long before it runs out of memory – but sincerely believing in nominal goals is dangerous. There's something very healthy in a culture skeptical about anything that sounds good but clearly isn't the most important and urgent thing to do. Without that skepticism, it's easy to get off track.

How would you go about creating a "memory-consumption-aware culture"? I can think of nothing except paying per byte saved - but, while it sounds like a good direction with parking spots, with developers it could become quite a perverse incentive…

Coding standards: is consistency prettier than freedom?

Different projects have different coding standards, and some have none at all. How does it affect the quality of code and developers' well-being? What results can we reasonably expect from a style guide?

Let's have a look at the effect of style guides in the real world. Here's how Jerusalem looks like:

These similarly looking buildings are near the city center. Here's a shot of the suburbs:

Same stuff, pretty much. White buildings, red roof tiles – or plain flat roofs.

And now for something completely different:

This is Tel Aviv. Buildings don't look similar to each other here. Nor do different parts of the city:

As you can see, in Tel Aviv, there's no style guide – everyone builds stuff to suit their own taste.

In Jerusalem, on the other hand, buildings have to be covered with Jerusalem stone, giving them their trademark off-white color. Jerusalem owes its visual consistency to a century-old style guide enforced by municipal laws.

Here are a few observations – relevant to most style guides, I think:

  • Consistent style is either enforced or lacking. Whatever virtues freedom may have, consistency of style is not one of them.
  • Consistent style is functionally inconsequential. Buildings in Jerusalem are about as safe and comfortable as buildings in Tel Aviv.
  • Psychologically, style does matter. Many people hate visiting Tel Aviv because it's ugly.
  • Whether consistent style is more beautiful is debatable. Many other people hate visiting Jerusalem because it's ugly.
  • People will defeat stylistic consistency despite the style guide. Here's an example from one of Jerusalem's suburbs, Ramot Polin – "a housing project for honeybees":

This is consistent with the style guide, but not very consistent with the actual look of other buildings – nor does it look very comfortable. Leading to my last observation:

  • A common style can be codified and enforced, but common sense can't be. Municipal law mentioned "off-white", but who would have thought to mention "rectangular"?

A sensible style guide is your one and only way to achieve consistent style – and not much else.

What if a style guide is not sensible? Here's a building from Tirana, Albania:

Here's another one:

Yep, that's the style guide over there – bright colors over ugly buildings. And there's nowhere to hide from the consistent style.

Maybe you actually love this style, and hate Jerusalem's uniform off-white. My point is that either way, the consistent style of one of those cities leaves no place for you to like.

Tel Aviv, on the other hand, has a place to like for both Tirana lovers and Jerusalem lovers. Off-white houses with red rooftops? Neve Tzedek has what you want:

Buildings painted in primary colors? Here's a hotel for you:

Personally, I still prefer Jerusalem though. Consistent style is better – if you like that particular style.

Requiring limestone vs banning asbestos

Can coding standards be described as style guides, or are they more than that?

The Google C++ Style Guide suggests that it is in fact more than a style guide:

The term Style is a bit of a misnomer, since these conventions cover far more than just source file formatting.

The document goes on to say that, apart from "enforcing consistency", it also "constrains, or even bans, use of certain features" – "to avoid the various common errors and problems that these features can cause".

"Enforcing consistency" does sound similar to requiring limestone – there's no direct functional impact. But "banning features to avoid problems" sounds more like banning asbestos – very much because of its functional impact, which can include cancer.

However, language features are different from building materials. Asbestos was discovered, not designed, and they couldn't know it'd cause cancer. C++ RTTI was designed and approved as a standard by strong programmers, who had in mind some cases where they thought it'd be useful.

RTTI is banned by the Google Style Guide, not the way asbestos is banned by regulations, but the way some sculptors prohibit their students to use fingers when they shape the fine details of clay. Learn to use proper sculpting tools – then do use fingers if necessary:

A query of type during run-time typically means a design problem. …you may use RTTI. But think twice about it. :-) Then think twice again.

Think four times and you'll be allowed to use RTTI. Think 1024 times and you're still not allowed to use asbestos in a housing project. That's because construction standards include functional considerations, but coding standards ultimately discuss style and style alone.

That's why the strictest coding standards allow exceptions. And that's why every banned feature is sometimes better than the proposed alternative.

Readability through inconsistency

Style guides enforce consistency. In the real world, we've seen that consistent style matters psychologically. In programming, people also advocate consistency for psychological reasons:

It is very important that any programmer be able to look at another's code and quickly understand it. Creating common, required idioms and patterns makes code much easier to understand.

Psychological reasons are important – but there are symmetric psychological arguments for inconsistency.

For example, required idioms can in fact make code easier to understand – or harder. Let's look at idioms actually required in some style guides:

  • The use of C++ "algorithms" such as std::for_each and std::transform instead of decade-old "patterns" called loops. I expect the idea to become widespread again, together with C++11 lambdas. Here's TheRegister's take on the impact on readability.
  • Yoda conditions: if(5 == num). This page – first Google hit for "Yoda conditions" at the time of writing – lists only benefits and no drawbacks, and proposes to add this to your style guide. Will code become more readable though? They're Yoda conditions! "If num is five" is how you always say it in English (and Hebrew, and Russian). If five is num, read as natural your code will not.

Of course my opinion on the readability of these patterns is debatable – which is precisely my point. Once a style guide is chosen, some people will experience the joy of fluent reading every time they hit if(5==num). Others will experience the pain of a mental roadblock – also every time.

A style guide will have something to dislike for everyone. When tastes are sufficiently different, the average amount of cringes per person stays the same under consistent style – and the variance rises (someone will hate a particularly popular mandatory pattern).

It's like keeping wealth constant and increasing inequality – something not even a political party would advocate. How is this psychologically a win?

But let's go ahead and assume that the "required idioms" suit everyone's taste, and, by themselves, actually make code easier to understand. Still:

  • External libraries will not follow your style guide. They follow style guides of their own. And this inconsistency can improve readability. Code using the library stands out, and the library's style can match the accepted domain-specific conventions better than your local style. In computer vision, X is the real world coordinate, x is the pixel coordinate – contrary to many software style guides.
  • You can't count on stylistic conventions, because there are exceptions. Google's code orders parameters such that inputs come first, but memcpy & snprintf don't. You either have to look out for exceptions or risk misunderstandings by blindly assuming consistency.
  • Different people think differently, even if their code looks the same. I find it easier to understand programmers' intents through their unique style. When they're all forced to write superficially similarly, I can't tell who wrote what, and what the subtext of the code is.

I'll illustrate the last point with a couple of examples. I knew O.M. before I ever saw him and before I even knew his name. To me, he was the programmer with the two spaces before the trailing const:

inline int x()  const;

I knew him through his code: mathematically elegant, obsessive about fine details of type-based binding and modeling. I could guess what he left out with an intent to maybe add it later. I understood him.

Likewise, I can always spot G.D.'s code by the right-leaning asterisk:

int *p,*q=arr+i;

G.D. certainly couldn't care less about types – similarly to most people with this asterisk alignment. I know his code: terse, efficient, to the point. I know what to expect.

Who wrote this code?

camelCaseName = longerCamelCaseName-camelCaseName;

I dunno, the collective unconscious wrote it. Anyone could write it – or several people patching after each other. I fail to identify with the author and guess his intent – it could be anyone. The code has no smell or taste to me.

This can sound a bit crazy – "does he actually advocate that everyone develops as uncommon style as possible as a way to mark his trail"? Of course I don't mean that.

I think what I'm saying can make more sense if you think of style and taste in code as analogous to the taste of food. Of course it's ridiculous to expect every restaurant to make food with a unique taste. Many people like pizza, many people know how to make pizza – so expect much similarly-tasting pizza around.

But we wouldn't like to always eat food cooked to the same spec by restaurants in a single franchise. If someone knows to make food with a unique taste, we welcome it.

And unique taste of your food doesn't indicate that it's bad for your health. Moreover, food with a familiar taste can be very unhealthy – much of it is.  Code in a familiar style has a comforting look – sometimes misleadingly.

I don't think requiring the same taste everywhere is how you improve the health of a code base.

Popular demand

I'm naturally inclined to argue against coding standards, because it feels like bureaucracy unthinkingly imposed on the work process from above. However, what if it's not imposed from above – what if the programmers themselves want it?

Many certainly do. Many good ones do.

Incidentally, while I was writing this, I stumbled upon an article arguing for consistency – for the very reasons I use to argue against it:

If code isn’t written in a consistent style in your team, whenever you come across code with the spacing a bit wrong, the first thing your head’s going to process is "I didn’t write this."

…Exactly why I like personal style! I really didn't write it – it's his code, I want to understand him, and his style helps greatly.

This is a natural feeling, and as we all know coders have a hard [time] to restrain impulse to rewrite any piece of code they didn’t write.

I understand that impulse very well. Personally, I hate new food more than anyone I know – I eat the same thing every day, for months, for years. I do prefer Jerusalem to Tel Aviv. And I like consistent style – especially my own.

But why should others be forced-fed my favorite cabbage salad? Is consistency that much prettier than freedom?

Team spirit

I don't argue with people who favor a consistent style – and I can't. The article above is nostalgic about a team that followed a style guide "religiously". How can nostalgia be refuted? Clearly, consistent style can create a unique team spirit that to some is valuable and memorable.

All I can and do argue is that a lack of a style guide can also create a good atmosphere that is preferred by some other people.

In fact I believe it is exactly team spirit that a style guide – or lack thereof – actually affects. All other effects come indirectly through the impact on atmosphere. Here's my attempt at taking the social effect into account:

  • I won't break established conventions. Following conventions is a great way to say, "I respect the local tradition and the wisdom it embodies." And I'll thoroughly enjoy the limestone covering our code – I like consistency. Yay, a beautiful code base! And if the convention is really bad – I hopefully won't need to join the team in the first place.
  • However, I won't establish and enforce conventions when I'm the one starting with a clean slate. Having no conventions is a great way to say, "join my project to express yourself without artificial constraints".

From team spirit to grassroots bureaucracy

Did you know Ken Thompson is not allowed to check in code at Google? He said so in his Coders at Work interview:

Seibel: I know Google has a policy where every new employee has to get checked out on languages before they're allowed to check code in. Which means you had to get checked out on C.

Thompson: Yeah, I haven't been.

Seibel: You haven't been! You're not allowed to check in code?

Thompson: I'm not allowed to check in code, no.

The programmer who co-created Unix is not allowed to check in code. If this isn't a bureaucracy, what is? But it's inevitable – with rules, you always paint yourself into a corner that way. Allow some to break the conventions, and you will have offended everyone else. Use the same rules for everyone – and some won't contribute.

Of course, Thompson does contribute to Google. Well, Google has plenty of ways to motivate him to do that. And, apparently, they have good programmers willing to write production code based on his uncommitted prototype code.

But we're not all like Google that way. We can't all afford the inevitable laughable outcomes of bureaucracies. If you come across an original programmer with an off-beat style, do you want him to join your project or to move on?

Grassroots bureaucracy is still bureaucracy. I wouldn't object to an established bureaucracy that people claim to value. But I wouldn't establish a new one, either.

Conclusion

A style guide can make code look prettier to some – and uglier to others – but not tangibly better, except if programmers enjoy it so much as to produce better code.

This is equally true for the lack of a style guide. The results of freedom look prettier to some and uglier to others.

Personally, I believe that one rule too few is better than one rule too many, so I don't bother to enforce a common style.

P.S.: when I become a neat-freak

Generally, I tend to enforce little to no conventions. Here are the exceptional situations where I actually enter the ridiculous position of telling people how to write their code.

Interfaces should be consistent

I don't care if an interface looksLikeThis or like_this. But if it uses both, then I'll ask the author to change it to one of the styles – the one which came first. For users to feel confident, an interface should look well thought-out – which implies an illusion of a single author, which implies a consistent style.

By "interface", I only mean the outermost stuff called by module users. Internal functions, classes, etc. can look like Tel-Aviv as far as I'm concerned. For instance, in a simple server, the "interface" to me is just the protocol, and nothing in the code itself.

Warnings should be errors

I hate the concept of compiler warnings – I link it to the concept of guilt. "Fine, be that way, do this evil implicit conversion thing – but if something happens, I will have told you so." Why should we put up with such manipulative behavior? Pick a position – refuse to compile, or compile silently.

However, in practice, compatibility issues make what has to be errors into warnings. If it compiled 30 years ago, it has to keep compiling, even if nobody wants it to compile in new code. Even if compilers could always prove the code wrong, but initially didn't bother, and it happens to work in old programs.

So, to the dismay of freedom-lovers, I turn warnings into errors where I can (as in -Werror) – even if I can't cherry-pick the "right" warnings. Mainly since when a file generates 10 (false) warnings, the eleventh (truly useful) warning goes unnoticed.

Another reason is that warnings cause guilt – they are evil, and must be destroyed. If I didn't destroy them by turning them into errors, I'd have to destroy them by disabling them. Then I'd never get that eleventh useful warning.

But except for these two pet peeves – interfaces and warnings – I do think grown-up programmers should be left alone.

P.P.S.: Greetings from the Overextended Metaphor Parrot

Originally, I had a few references to Bauhaus architecture in the text – how Tel Aviv has buildings in the Bauhaus style and Jerusalem doesn't because of its limestone requirement, and how Ken Thompson's inability to commit code at Google is analogous to that. However, as a commenter pointed out, there are buildings in the Bauhaus style in Jerusalem – in my own neighborhood, actually, so I obviously walked past them plenty of times.

I guess this goes to show that good architects have no trouble complying with a style guide – and that I shouldn't overextend metaphors in areas where I'm not minimally competent.

Machine code monkey patching

A monkey patch is a way to extend or modify the runtime code of dynamic languages (e.g. Smalltalk, JavaScript, Objective-C, Ruby, Perl, Python, Groovy, etc.) without altering the original source code.

Wikipedia

For example, the Python code:

# someone else's class
class TheirClass:
 def their_method(self):
  print "them"
obj = TheirClass()
obj.their_method()

# our function
def our_function(self):
 print "us"

# the monkey patch
TheirClass.their_method = our_function
obj.their_method()

…will print:

them
us

…showing that we have changed the behavior of TheirClass objects, including those we didn't create ourselves. Which can't be done with more civilized techniques like inheritance.

Here's how you can monkey patch machine code, assuming the machine architecture is ARM:

typedef void (*funcptr)();

void monkey_patch(funcptr their_func, funcptr our_func) {
  ((int*)their_func)[0] = 0xe51ff004;
  ((int*)their_func)[1] = (int)our_func;
}
//monkey patching the memory allocator:
monkey_patch((funcptr)&malloc, (funcptr)&our_malloc);
monkey_patch((funcptr)&free, (funcptr)&our_free);

This overwrites the first instruction (32-bit word) of their_func with 0xe51ff004, which is the ARM machine code corresponding to the assembly instruction LDR PC,[PC,-4] – which means, in C-like pseudocode, PC = *(PC+4), or "jump to the program location pointed by the next word after the current program location".

(Why the byte address PC+4 is spelled in assembly as PC-4? I recall that it's because an ARM instruction at address X actually gets the value X+8 when referencing PC. Presumably because it is – or at some point was – the most convenient semantics for pipelined hardware to implement:

  • when the instruction at address X executes,
  • the instruction at address X+4 is decoded, and
  • the instruction at address X+8 is fetched

- so the physical PC register could very well keep the value X+8.)

So the first word of their_func is overwritten with, "jump to where the second word points". The second word is then overwritten with our_func, and we're all set.

Purpose

I actually did this in production code, on a bare metal target (no OS – just a boot loader that runs a massive single binary). I monkey patched the memory allocator – malloc, free, calloc, realloc – and the Unix-like I/O functions underlying that particular compiler's <stdio.h> and <iostream> implementation – read, write, open, close, creat. The memory allocator had to be changed to work on the target dual-core chip. The I/O functions had to be changed to use our drivers, so that we could write stuff to the Flash or USB using FILE* or ofstream.

A more civilized approach, if you want to override functions in a dynamic library, is passing another library at run time with LD_PRELOAD or equivalent. And if the code is linked statically as it was in my case, you can override the functions at link time. The trouble is that the linker could refuse to cooperate.

(And in my case, we shipped libraries, the customer linked the program, and the guy who talked to the customer refused to cooperate – that is, to help them override functions at link time. He was an old-school embedded developer, the kind that don't need no stinking malloc and printf. The project had a million lines of code very much in need of malloc and printf. He said, clean it up. Don't call malloc on the second CPU. So I went away and monkey patched malloc anyway.

In such a case, the civilized approach is to keep trying to talk the guy into it, and then have him persuade the (even more hardcore) embedded devs at the customer's side. What I did was what managers call "an attempt at a technical solution when a social solution is needed". Or as programmers call it, "avoiding a couple of months of pointless discussions". Being neither a full-time programmer nor a full-time manager, I don't have a clear opinion which viewpoint is right. I guess it depends on how long and pointless the discussions are going to be, versus how long and pointless the code working around the "social" problem will be.)

In theory, machine code monkey patching could be used in a bunch of "legitimate" cases, such as logging or debugging. In practice, this ugly thing is probably only justified in inherently ugly situations – as is kinda true of monkey patching in general.

Pitfalls

My example implementation for the ARM assumes that a function has at least 2 instructions. An empty ARM assembly function can have just one (jump to link register). In that case, the first instruction of the next function will be overwritten. A more sophisticated version of monkey_patch() could stash the target address someplace else, and use a LDR PC,[PC,clever_offset] command instead of a constant LDR PC,[PC,-4] command.

Overwriting machine code instructions breaks code that reads (as opposed to "executes") those instructions, counting on the original bit patterns to be stored there. This isn't very likely to be a problem with actual ARM code, unless it was written by Mel.

On any machine with separate and unsynchronized instruction and data caches, overwriting instructions will modify the contents of the data cache but not the instruction cache. If the instructions happen to be loaded to the instruction cache at the time of overwriting, subsequent calls to the monkey-patched function might call the original function, until the instruction cache line keeping the original code happens to be evicted (which isn't guaranteed to ever happen).

If your luck is particularly bad and the two overwritten instructions map to two adjacent cache lines, only one of which is loaded to the instruction cache at the time of overwriting, a call to the monkey-patched function might crash (since it'll see one original instruction word and one new one). In any case, on machines where caches won't sync automatically, one should sync them explicitly to implement self-modifying code correctly (I'll spare you my ARM9 code doing this).

If your OS places instructions in read-only memory pages, overwriting it will not work unless you convince the OS to grant you permissions to do so.

C++

C++ virtual functions can be monkey patched more similarly to the typical dynamic language way. Instead of modifying instructions, we can overwrite the virtual function table.

Advantages:

  • more portable across machine architectures – the vtable layout doesn't depend on the machine
  • no cache syncing problems
  • no encoding-related corner cases like very short functions or instructions used as data

Disadvantages:

  • less portable across compilers – ARM bytecode is the same with all compilers, vtable layout is not
  • fewer calls could be redirected – some compilers avoid the vtable indirection when they know the object's type at compile time (of course inlined calls won't be redirected with either technique)
  • only virtual functions can be redirected – typically a minority of C++ class member functions

The need to fiddle with OS memory protection is likely to remain since vtables are treated as constant data and as such are typically placed in write-protected sections.

Example C++ code (g++/Linux, tested with g++ 4.2.4 on Ubuntu 8.04):

#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>

template<class T, class F>
void monkey_patch(int their_ind, F our_func) {
  T obj; //can we get the vptr without making an object?
  int* vptr = *(int**)&obj;
  //align to page size:
  void* page = (void*)(int(vptr) & ~(getpagesize()-1));
  //make the page with the vtable writable
  if(mprotect(page, getpagesize(), PROT_WRITE|PROT_READ|PROT_EXEC))
    perror("mprotect");
  vptr[their_ind] = (int)our_func;
}

class TheirClass {
 public:
  virtual void some_method() {}
  virtual void their_method() { printf("themn"); }
};
void our_function() { printf("usn"); }
int main() {
  TheirClass* obj = new TheirClass;
  //gcc ignores the vtable with a stack-allocated object
  obj->their_method(); //prints "them"

  monkey_patch<TheirClass, void(*)()>(1, our_function);
  //some_method is at index 0, their_method is at 1
  //we could instead try to non-portably get the index
  //out of &TheirClass::their_method

  obj->their_method(); //prints "us"
}

Conclusion

Let's drink to never having to do any of this (despite the fact that yes, some of us do enjoy it in a perverted way and feel nostalgic blogging about it).

Making data races manifest themselves

Or, synchronization, valgrind and poset dimension.

This story is perhaps more technical than I ever allowed myself to get online. If I may say so – it may be worth following; I can say that it really changed my perspective on what data races are.

I'll need to explain a bit about the context:

  • how we structure our parallel apps for multi-core targets
  • how we catch synchronization bugs using a custom valgrind tool
  • which of the bugs were hard to catch that way

Then I can explain how we now catch them all (at least those having any chance to happen on our test inputs; you'll see what I mean).

I'll also discuss valgrind's standard race detection tool, helgrind. I'll try to show that the interface to parallelism it works against – pthreads – poses a much harder problem, and how races can go undetected because of that.

I hope this can be interesting outside of my specific work context because "catching all the races" seems to be a claim rarely made. Also we got to stumble upon a well-known math problem during mundane pointer-chasing – a cliche of sorts, but amusing when it actually happens.

So, here goes.

Our parallelism frameworkThe framework - goals & dependencies

…known in its current incarnation as GSF – "Goal/Graph Scheduling Framework". It's fairly simple – 4 paragraphs and 2 pseudo-code listings should be a reasonable introduction.

The idea is, your app is a graph of goals. The nodes, goals, are things to do – typically function pointers (could be function objects or closures). The edges are dependencies – A depends on B, so A must be executed after B (denoted with B A in the diagrams). The main loop does this:

goal = first_goal # a dummy everything depends on
while goal != last_goal: # a dummy, depends on everything
  update_dependent_goals_status(goal) # find READY goals
  for chosen in sched.choose_goals(): # scheduler
    chosen.resource.enqueue(chosen) # outgoing message Qs
  goal = dequeue_finished_goal() # incoming message Q

This runs all the goals in a graph, and can itself be invoked in a loop – in our case, we run all the goals for every grabbed video frame.

Whatever the scheduling policy, a correct execution order will be chosen. Only READY goals can run, and a goal only becomes READY when everything it depends on becomes DONE. This is implemented in update_dependent_goals_status:

goal.status = DONE
for dep in goal.dependent_goals:
  dep.num_predecessors -= 1
  if dep.num_predecessors == 0:
    dep.status = READY

The scheduling policy is implemented partly in sched.choose_goals – the scheduler, and partly in resource.enqueue – the queues. The scheduler decides who is enqueued first; some queues may reorder waiting goals by priority, or even suspend the running goal(s) when a more important one arrives. While the scheduling policy can be made complicated almost without bound, that's all there is to the framework itself.

To summarize in standard terms, it's a kind of a message passing framework, without explicit threads and without locking, but with (massive) data sharing. "Goal scheduled"/"goal finished" notifications are the only messages, and only used by the framework – the goals themselves always communicate through shared data. Dependencies are the only synchronization primitive. OS locks are never used, except in the guts of various memory allocators.

Drawbacks

Users, or "application programmers", usually hate this framework with a passion, because code is hard to read and modify.

To break your app into the sort of graph described above, you need to specify the goals and the dependencies. Currently we do both in ugly ways. Goals are specified as global functions, with context passed through global variables: void goal_A() { g_ctxt… }. The hairy dependency graph is constructed by a Python script running at build time, with ifs, loops, and function calls – if do_A: make_goal("A",dep="B C"). This adds ugliness to C++ code, and then makes you follow ugly Python code to figure out the execution order of the ugly C++ code.

Some of it was even uglier in the past. Some of it could become less ugly in the future. At least one serious limitation is irremediable – the inability to create goals at run time. For example, you can't say "create a goal per processed data item" – rather, you need to explicitly, statically create a goal per CPU and distribute the data between the goals. You can't say "if X happens, create that goal" – you need to always have that goal, and have it do nothing unless X happens. People don't like to look at their flow that way.

Motivation

Efficiency and correctness, basically. Efficiency is tangential to our story, so I won't discuss it (and frankly, we aren't sure how much efficiency we gain, compared to other approaches). As to correctness – the hateful thing indeed "scales", which is to say, it doesn't break down. We've managed to ship working code for several years, given:

  • 8 to 10 heterogeneous target cores
  • Hundreds of goals
  • Dozens of applications
  • Dozens of developers

…the worst thing being "dozens of developers" – most unaware of parallelism, as I think they should be. Not that it proves we couldn't have done it differently – I can only give my theory as to why it would be harder.

The biggest problem with parallel imperative programs is synchronizing access to shared data. Tasks/threads can, and are encouraged to communicate through messages, but eliminating sharing is hard:

  • People frequently don't even realize they're sharing data

  • If tools make implicit use of shared data impossible, things can get painful (think of explicitly passing around a particularly "popular" data item, or copying large/intertwined objects)

Sharing data is natural and efficient in imperative programs, and outlawing it is hard. And locks are problematic since

  • You can have deadlocks, and
  • You can forget to lock.

Now, if you schedule a statically known flow graph rather than an unconstrained threaded app, it turns out that you solve both of these problems:

  • Deadlocks are trivially prevented - at build time, check that the dependency graph has no cycles

  • Undeclared dependencies – sharing data without proper synchronization – can be found using program instrumentation

Our subject is this second bit – using program instrumentation to find race conditions (which makes everything up to this point an unusually long introduction).

We use a custom valgrind plug-in ("tool" as they call it) for program instrumentation. It works somewhat similarly to valgrind's helgrind tool, though helgrind's implementation is much more complicated.

However, helgrind has false negatives – it will consistently miss some of the data races, as I believe will any tool lacking the knowledge of the overall app structure. A simple example of a data race unreported by helgrind appears after the discussion on race condition detection, when the problem should become more clear.

Race conditions in a static graph

In an imperative program, goals communicate through memory. A memory access can thus be thought of as a proof of dependence. If goal A accesses memory at address X, and goal B was the last to modify X, then A depends on B. If there is no path from A to B in the dependency graph, the program is buggy. A could run before B or after it, so the result of accessing X is undefined.Write then read: the simplest dependence proof

Suppose we could intercept all load/store instructions in the program. Then we could maintain, for every byte, its current "owner" – a goal ID. A store to a location would update its owner to the ID of the currently running goal. Loads and stores could then check whether the currently running goal depends on the owner of the accessed location – all it takes is accessing a 2D array, the path matrix. Upon error, print the call stack, and the names of the 2 goals.

This sort of thing is surprisingly easy to implement on top of valgrind, almost as easy as implementing an interface with abstract onLoad and onStore methods. I thought of posting sample code but it looks like the example tool shipped with valgrind, "lackey", comes with a load/store interception example these days.

As to the "shadow memory" you need for keeping owner IDs – you can do that with a page table, along the lines of a virtual memory implementation in an OS. The high bits of a pointer select a page and the low bits are the offset within the page, with pages allocated upon first access. Standard valgrind tools do it this way, too.

Our valgrind tool is called shmemcheck for "shared memory checker". A lame pun on memcheck, valgrind's most famous tool, the one reporting uninitialized memory access. "Memcheck-shmemcheck". Probably the last time I used this sort of name – funny for the first few times, embarrassing for the next few thousands of times.

Despite the embarrassment, when we got an implementation good enough to be systematically used, it was really great – data races were decimated. Valgrind is awesome, unbelievably so. It managed to unimaginably expand what can be done to a natively compiled program, decades after the tools for building native programs were cast in stone.

The trouble with this version of, cough, shmemcheck, is that it doesn't really work. That is, sometimes you have false negatives, so you still get to dive into core dumps. Why?

What about the opposite case?

Read then write: another shadow cell?If A loads from X that was last modified by B, A depends on B, and we detect it alright. What if A writes to X that was previously read by B? This also proves that A should depend on B. Otherwise B will sometimes run after A, reading its update to X, and sometimes before A, reading the previous value. In order to detect this dependency, we have to remember that X was read by B until A stores to X.

…What if, in addition to B, X was read by C, D, and E, all before A's update?Reads then write: too many shadow cells

Every location always has one "owner" but can have many "users". We can afford keeping an ID – 2 bytes in our case – for every byte. Keeping – and checking – a list of users that could grow to tens or even hundreds of IDs per location sounds impractical.

We used all sorts of application-specific arguments to convince ourselves that this problem is confined to a few particular scenarios. We had a few hacks removing some of the false negatives, at the cost of adding some false positives, and lived with that.

Choosing the order

We got used to think that the question was, how do you know who reads X before A? But then it dawned upon us that the right question was: why do they read X before A?!read then write: why not reverse the order?

And really – if B, C and D run before A because it is stated in the dependency graph that A depends on B, C and D – then there's no bug to worry about. But if there's no dependency declared between, say, A and B, then A could run before B just as well – and we'd detect the bug. So we only missed the race condition because of a randomly chosen order. And when we do find races, we don't know if B depends on A or vice versa – only that we were lucky to have A run first.

What if we choose different random orders and run the same app many times? Then in some run, A will run before B and we'll find the bug – with just one shadow cell keeping the owner.

…Or will it? We have in our dependency graph many "A, B" pairs of independent goals. If we pick random orders, how likely are these pairs to "flip" to "B, A" in some orders but not others?

The first conjecture was – very likely. Just randomize schedules by having sched.choose_goals pick a random goal among the READY ones. "There seems to be no bias in the selection; A and B are independent so the chance for A to come before B is 1/2; therefore, with N orders, the chance for A to always follow B is 1/2^N – small enough."

Interestingly, it still sounds reasonable to me, though I found a counter-example, just out of fear that it's too good to be true (what, years of core dumps are now behind me?) The counter example is, suppose you have two long processes, P and Q, each made up of 100 goals, P1…P100 and Q1…Q100. P and Q are declared independent – Pi never depends on Qj or vice versa. However, Pi+1 depends on Pi, similarly for Q, so P and Q goals can run in just one order.

Just two schedules would "flip" all independent pairs: (1) Run P, then Q and (2) Run Q, then P. However, sched.choose_goals has a snowflake's chance in hell to pick these schedules. The set of READY goals will contain, at all times, 2 goals: one from P and one from Q. sched.choose_goals must either choose the one from P 100 times in a row, or the one from Q 100 times in a row. Since that's a 1 in a 2^100 event, P1 will always run before Q100. If P1 loads from X and Q100 stores to X, we'll never find the undeclared dependency.

One could say that things that aren't likely to flip in such random trials are unlikely to flip in reality and it'd be 100% wrong. In reality, execution order depends on real run times and those create a very biased, and an unpredictably biased sampling of the distribution of schedules.

Now, with N goals, N*2 orders would clearly be enough to flip all pairs – for every goal, create an order where it preceeds everything it doesn't depend on, and another order where it follows everything it doesn't depend on. But our N is above 1000, so N*2, though not entirely impractical, is a bit too much.

Poset dimension

Our pessimism resulting from the collapse of the first conjecture lead to an outburst of optimism yielding a second conjecture – just 2 orders are enough to flip all pairs. Specifically, the orders we thought of were a DFS postorder and its "reversal", where you reverse the order in which you traverse the children of nodes.

Works on my P & Q example indeed, but there's still a counter-example, just a bit more elaborate. We kept looking at it until it turned out that we shouldn't. The dependency graph makes the goals a partially ordered set or poset. Finding the minimal number of schedules needed to flip all independent pairs would find the order dimension of that poset. Which is a known problem, and, like all the good things in life, is NP-hard.

Curious as this discovery was, I was crushed by this NP-hardship. However, there still could be a good practical heuristic yielding a reasonably small number of orders.

We tried a straightforward one – along the lines of the N*2 upper bound, "flip them one by one", but greedier. Simply try to flip as many pairs as you can in the same schedule:

  1. Start with a random schedule, and make a list of all the pairs that can be flipped.
  2. Starting with the real dependencies, add fake dependencies forcing pairs from the list to flip, until you can't add any more without creating a cycle.
  3. Create a schedule given those extended dependencies.
  4. Remove the pairs you flipped from the list. If it isn't empty, go to step 2.

This tended to produce only about 10 schedules for more than 1000 goals, and there was much rejoicing.

Massive testing

Instrumentation slows things down plenty, so you can't run many tests. This is a pity because instrumentation only catches data races that actually occurred – it's actual loads by A from location X owned by B that prove that A depends on B. But if you can't run the program on a lot of data, you may not reach the flow path in A that uses X.

However, if you can, with just 10 schedules, flip all "A, B" pairs, you don't need instrumentation in the first place. Just run the app with different orders on the same data, see if you get the same results. If you don't, there's a race condition – then run under instrumentation to have it pin-pointed to you. The order is "purposefully diversified", but deterministic, so problems will reproduce in the second, instrumented run. Thus massive runs can be uninstrumented, and therefore, more massive.

So not only does this cheap poset dimension upper bound business solve the false negatives problem with instrumentation – it also greatly increases coverage. Of course there can still be a race that never occurred in any test runs. But now the quality of testing is defined just by the quality of the test data, never by chance. With races, whether you detect them or not is thought of as something inherently having to do with chance. It turns out that it doesn't have to.


Helgrind's problem

Helgrind – or any tool lacking knowledge about the app structure – can not meaningfully reorder things this way, in order to "make races manifest themselves". A race can be detected if A that stores to X is moved to happen before B that loads from X, but in order to do that, you have to know where A and B are.

Helgrind's only markers breaking the app into meaningful parts are synchronization calls, like mutex locks/unlocks. But it doesn't know where they'll happen in advance. So it can't stir the execution so that "this lock happens as early as possible" – there's no "this lock" until a lock actually happens. By then it's too late to make it happen before anything else that already happened. (In theory, the app structure could be inferred dynamically from the flow, I just don't see how I'd go about it.)

Therefore, helgrind seems to have the same problem our first versions of shmemcheck had. You can keep the single owner of a location but not its many readers; it appears that helgrind keeps one reader – the last. Helgrind's "owner/reader" is a lock ID rather than a goal ID, but I think the principle is very similar.  I haven't studied helgrind's interals so I only speculate about how it works, but I easily managed to create an example where it gives a false negative:

  1. In thread A, read X without locking, as many times as you want (…or don't want – in a real program it would be a bug, not an achievement…)
  2. After that, read X once with proper locking.
  3. In thread B, modify X, again with proper locking.

If, during testing, thread B happens to modify X after all of thread A's reads from X, the bug in A – unsynchronized access to X – will go unnoticed. It won't go unnoticed if thread A never locks properly – since helgrind will remember and report the last of its unsynchronized reads:

#include <pthread.h>
#include <stdio.h>

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
int shared=1;
int unused=0;
typedef struct { int read_unlocked, read_locked; } sopt;

void* child_fn(void* arg) {
  sopt* opt = (sopt*)arg;
  if(opt->read_unlocked) {
    unused += shared;
  }
  if(opt->read_locked) {
    pthread_mutex_lock(&mutex);
    unused += shared;
    pthread_mutex_unlock(&mutex);
  }
  return 0;
}

void testhelgrind(int read_unlocked, int read_locked,
                   const char* expected) {
  fprintf(stderr,"expected behavior: %sn", expected);
  //fork
  sopt opt = { read_unlocked, read_locked };
  pthread_t child;
  pthread_create(&child, 0, child_fn, &opt);
  //lock & modify shared data
  pthread_mutex_lock(&mutex);
  shared = 2;
  pthread_mutex_unlock(&mutex);
  //join
  pthread_join(child, 0);
}

int main(void) {
 testhelgrind(0,1,"no errors reported [true negative]");
 testhelgrind(1,0,"data race reported [true positive]");
 testhelgrind(1,1,"no errors reported [false negative]");
}

Which, under valgrind –tool=helgrind, prints:

expected behavior: no errors reported [true negative]
expected behavior: data race reported [true positive]

...
==15041== Possible data race during write of size 4 at 0x804a024 by thread #1
==15041==    at 0x8048613: test_helgrind (helgrind_example.c:33)
==15041==    by 0x8048686: main (helgrind_example.c:41)
==15041==  This conflicts with a previous read of size 4 by thread #3
==15041==    at 0x804856E: child_fn (helgrind_example.c:14)
...

expected behavior: no errors reported [false negative]

…or at least it works that way on my system, where the child thread runs first.

So if helgrind does report a data race – be sure to look careful. If you have several buggy reads, and only fix the last one and run again, helgrind will very likely not report the previous one. Your new synchronization code around the last read now "masks" the previous read.

The happy ending

We still have false negatives in our data race detection, but these are just easy-to-fix low-level loose ends – or so I think/hope. Things like tracking memory writes by devices other than CPUs, or handling custom memory allocators. I'm looking forward to have these false negatives eliminated, very sincerely. Few things are uglier than debugging data races.

I never thought this through enough to tell whether this purposeful reordering business can be utilized by a framework more restrictive than "use raw threads & locks", but less restrictive than "your app is a static graph". If it can be (is already?) utilized elsewhere, I'd be very curious to know.

Special thanks to Yonatan Bilu for his help with the math.

The Iron Fist Coding Standard

I've developed the following coding standard during the years when I've been responsible, on and off, for reviewing code by other programmers. It's suitable for most text-based programming languages and data specification formats.

Rules

Indent upon nesting

Where control structures or data objects nest, indentation helps a reader to keep track of the nesting level.

Indentation should render properly in all relevant IDE/editor configurations. Nobody should see stuff in the inner scope closer to the left than the outer scope. Seriously, spend 10 minutes to understand tabs vs spaces issues.

Use profanity judiciously

Profanity makes code easier to write, but harder to read and modify. Profanity signals a potential hazard, but distracts attention from the details – the opposite of a good warning. With profanity, less is more.

Be careful with comments, more careful with identifiers, and still more careful with the ones going into symbol tables. Except when mandated by an explicit requirement, under no circumstances should profanity appear in a program's output, documentation or configuration files, or be a required part of its input.

Clean up

Bits of code that can't possibly serve a purpose are a sign of neglect, scaring some readers and depressing others. Delete.

Semi-commented-out stuff spread around in moments of panic can be hard to clean up during the bad mood that breeds it. But that bad mood will haunt you as long as you keep bumping into that stuff. So.

Deviations

Concise nested statements or data object specifications can take one line and so need no indentation: if(error) quit.

Indenting upon nesting is neither mandated nor recommended in languages and formats where it is not customary. For example, code following a branch instruction or the <html> opening tag.

It's nice when generated code is indented, but sometimes the white space uses up bandwidth and sometimes you get fed up with propagating the nesting level through the code generator. Generated code is mostly read by whoever wrote the generator so it's up to you.

For profanity to appear in a program's interaction with the user, no explicit requirement is needed when it is a common practice in the given branch of the industry.

Some code won't be very tidy when you're really in a hurry – so be it.

Conclusion

Adopting the Iron Fist Coding Standard will give you the benefits of a mature, field-tested system of coding guidelines while saving the cost of creating one from scratch.

My history with Forth & stack machines

My VLSI tools take a chip from conception through testing. Perhaps 500 lines of source code. Cadence, Mentor Graphics do the same, more or less. With how much source/object code?

Chuck Moore, the inventor of Forth

This is a personal account of my experience implementing and using the Forth programming language and the stack machine architecture. "Implementing and using" – in that order, pretty much; a somewhat typical order, as will become apparent.

It will also become clear why, having defined the instruction set of a processor designed to run Forth that went into production, I don't consider myself a competent Forth programmer (now is the time to warn that my understanding of Forth is just that – my own understanding; wouldn't count on it too much.)

Why the epigraph about Chuck Moore's VLSI tools? Because Forth is very radical. Black Square kind of radical. An approach to programming seemingly leaving out most if not all of programming:

…Forth does it differently. There is no syntax, no redundancy, no typing. There are no errors that can be detected. …there are no parentheses. No indentation. No hooks, no compatibility. …No files. No operating system.

Black Square by Kazimir Malevich

I've never been a huge fan of suprematism or modernism in general. However, a particular modernist can easily get my attention if he's a genius in a traditional sense, with superpowers. Say, he memorizes note sheets upon the first brief glance like Shostakovich did.

Now, I've seen chip design tools by the likes of Cadence and Mentor Graphics. Astronomically costly licenses. Geological run times. And nobody quite knows what they do. To me, VLSI tools in 500 lines qualify as a superpower, enough to grab my attention.

So, here goes.

***

I was intrigued with Forth ever since I read about it in Bruce Eckel's book on C++, a 198-something edition; he said there that "extensibility got a bad reputation due to languages like Forth, where a programmer could change everything and effectively create a programming language of his own". WANT!

A couple of years later, I looked for info on the net, which seemed somewhat scarce. An unusually looking language. Parameters and results passed implicitly on a stack. 2 3 + instead of 2+3. Case-insensitive. Nothing about the extensibility business though.

I thought of nothing better than to dive into the source of an implementation, pForth – and I didn't need anything better, as my mind was immediately blown away by the following passage right at the top of system.fth, the part of pForth implemented in Forth on top of the C interpreter:

: (   41 word drop ; immediate
( That was the definition for the comment word. )
( Now we can add comments to what we are doing! )

Now. we. can. add. comments. to. what. we. are. doing.

What this does is define a word (Forth's name for a function) called "(". "(" is executed at compile time (as directed by IMMEDIATE). It tells the compiler to read bytes from the source file (that's what the word called, um, WORD is doing), until a ")" – ASCII 41 – is found. Those bytes are then ignored (the pointer to them is removed from the stack with DROP). So effectively, everything inside "( … )" becomes a comment.

Wow. Yeah, you definitely can't do that in C++. (You can in Lisp but they don't teach you those parts at school. They teach the pure functional parts, where you can't do things that you can in C++. Bastards.)

Read some more and…

 conditional primitives
: IF     ( -- f orig )  ?comp compile 0branch  conditional_key >mark     ; immediate
: THEN   ( f orig -- )  swap ?condition  >resolve   ; immediate
: BEGIN  ( -- f dest )  ?comp conditional_key <mark   ; immediate
: AGAIN  ( f dest -- )  compile branch  swap ?condition  <resolve  ; immediate
: UNTIL  ( f dest -- )  compile 0branch swap ?condition  <resolve  ; immediate
: AHEAD  ( -- f orig )  compile branch   conditional_key >mark     ; immediate

Conditional primitives?! Looks like conditional primitives aren't – they define them here. This COMPILE BRANCH business modifies the code of a function that uses IF or THEN, at compile time. THEN – one part of the conditional – writes (RESOLVEs) a branch offset to a point in code saved (MARKed) by IF, the other part of the conditional.

It's as if a conventional program modified the assembly instructions generated from it at compile time. What? How? Who? How do I wrap my mind around this?

Shocked, I read the source of pForth.

Sort of understood how Forth code was represented and interpreted. Code is this array of "execution tokens" – function pointers, numbers and a few built-ins like branches, basically. A Forth interpreter keeps an instruction pointer into this array (ip), a data stack (ds), and a return stack (rs), and does this:

while(true) {
 switch(*ip) {
  //arithmetics (+,-,*...):
  case PLUS: ds.push(ds.pop() + ds.pop()); ++ip;
  //stack manipulation (drop,swap,rot...):
  case DROP: ds.pop(); ++ip;
  //literal numbers (1,2,3...):
  case LITERAL: ds.push(ip[1]); ip+=2;
  //control flow:
  case COND_BRANCH: if(!ds.pop()) ip+=ip[1]; else ip+=2;
  case RETURN: ip = rs.pop();
  //user-defined words: save return address & jump
  default: rs.push(ip+1); ip = *ip;
 }
}

That's it, pretty much. Similar, say, to the virtual stack machine used to implement Java. One difference is that compiling a Forth program is basically writing to the code array in a WYSIWYG fashion. COMPILE SOMETHING simply appends the address of the word SOMETHING to the end of the code array. So does plain SOMETHING when Forth is compiling rather than interpreting, as it is between a colon and a semicolon, that is, when a word is defined.

So

: DRAW-RECTANGLE 2DUP UP RIGHT DOWN LEFT ;

simply appends {&2dup,&up,&right,&down,&left,RETURN} to the code array. Very straightforward. There are no parameters or declaration/expression syntax as in…

void drawRectangle(int width, int height) {
  up(height);
  right(width);
  down(height);
  left(width);
}

…to make it less than absolutely clear how the source code maps to executable code. "C maps straightforwardly to assembly"? Ha! Forth maps straightforwardly to assembly. Well, to the assembly language of a virtual stack machine, but still. So one can understand how self-modifying code like IF and THEN works.

On the other hand, compared to drawRectangle, it is somewhat unclear what DRAW-RECTANGLE does. What are those 2 values on the top of the stack that 2DUP duplicates before meaningful English names appear in DRAW-RECTANGLE's definition? This is supposed to be ameliorated by stack comments:

: DRAW-RECTANGLE ( width height -- ) ... ;

…tells us that DRAW-RECTANGLE expects to find height at the top of the stack, and width right below it.

I went on to sort of understand CREATE/DOES> – a further extension of this compile-time self-modifying code business that you use to "define defining words" (say, CONSTANT, VARIABLE, or CLASS). The CREATE part says what should be done when words (say, class names) are defined by your new defining word. The DOES> part says what should be done when those words are used. For example:

: CONSTANT
   CREATE ,
   DOES> @
;
\ usage example:
7 CONSTANT DAYS-IN-WEEK
DAYS-IN-WEEK 2 + . \ should print 9

CREATE means that every time CONSTANT is called, a name is read from the source file (similarly to what WORD would have done). Then a new word is created with that name (as a colon would have done). This word records the value of HERE – something like sbrk(0), a pointer past the last allocated data item. When the word is executed, it pushes the saved address onto the data stack, then calls the code after DOES>. The code after CREATE can put some data after HERE, making it available later to the DOES> part.

With CONSTANT, the CREATE part just saves its input (in our example, 7) – the comma word does this: *HERE++ = ds.pop(); The DOES> part then fetches the saved number – the @ sign is the fetch word: ds.push( *ds.pop() );

CONSTANT works somewhat similarly to a class, CREATE defining its constructor and DOES> its single method:

class Constant
  def initialize(x) @x=x end
  def does() @x end
end
daysInWeek = Constant.new(7)
print daysInWeek.does() + 2

…But it's much more compact on all levels.

Another example is defining C-like structs. Stripped down to their bare essentials (and in Forth things tend to be stripped down to their bare essentials), you can say that:

struct Rectangle {
  int width;
  int height;
};

…simply gives 8 (the structure size) a new name Rectangle, and gives 0 and 4 (the members' offsets) new names, width and height. Here's one way to implement structs in Forth:

struct
  cell field width
  cell field height
constant rectangle

\ usage example:
\ here CREATE is used just for allocation
create r1 rectangle allot \ r1=HERE; HERE+=8
2 r1 width !
3 r1 height !
: area dup width @ swap height @ * ;
r1 area . \ should print 6

CELL is the size of a word; we could say "4 field width" instead of "cell field width" on 32b machines. Here's the definition of FIELD:

 : field ( struct-size field-size -- new-struct-size )
    create over , +
    does> @ +
 ;

Again, pretty compact. The CREATE part stores the offset, a.k.a current struct size (OVER does ds.push(ds[1]), comma does *HERE++=ds.pop()), then adds the field size to the struct size, updating it for the next call to FIELD. The DOES> part fetches the offset, and adds it to the top of the stack, supposedly containing the object base pointer, so that "rect width" or "rect height" compute &rect.width or &rect.height, respectively. Then you can access this address with @ or ! (fetch/store). STRUCT simply pushes 0 to the top of the data stack (initial size value), and at the end, CONSTANT consumes the struct size:

struct \ data stack: 0
  cell ( ds: 0 4 ) field width  ( ds: 4 )
  cell ( ds: 4 4 ) field height ( ds: 8 )
constant rectangle ( ds: as before STRUCT )

You can further extend this to support polymorphic methods – METHOD would work similarly to FIELD, fetching a function pointer ("execution token") through a vtable pointer and an offset kept in the CREATEd part. A basic object system in Forth can thus be implemented in one screen (a Forth code size unit – 16 lines x 64 characters).

To this day, I find it shocking that you can define defining words like CONSTANT, FIELD, CLASS, METHOD – something reserved to built-in keywords and syntactic conventions in most languages – and you can do it so compactly using such crude facilities so trivial to implement. Back when I first saw this, I didn't know about DEFMACRO and how it could be used to implement the defining words of CLOS such as DEFCLASS and DEFMETHOD (another thing about Lisp they don't teach in schools). So Forth was completely mind-blowing.

And then I put Forth aside.

It seemed more suited for number crunching/"systems programming" than text processing/"scripting", whereas it is scripting that is the best trojan horse for pushing a language into an organization. Scripting is usually mission-critical without being acknowledged as such, and many scripts are small and standalone. Look how many popular "scripting languages" there are as opposed to "systems programming languages". Then normalize it by the amount of corporate backing a language got on its way to popularity. Clearly scripting is the best trojan horse.

In short, there were few opportunities to play with Forth at work, so I didn't. I fiddled with the interpreter and with the metaprogramming and then left it at that without doing any real programming.

Here's what Jeff Fox, a prominent member of the Forth community who've worked with Chuck Moore for years, has to say about people like me:

Forth seems to mean programming applications to some and porting Forth or dissecting Forth to others. And these groups don't seem to have much in common.

…One learns one set of things about frogs from studying them in their natural environment or by getting a doctorate in zoology and specializing in frogs. And people who spend an hour dissecting a dead frog in a pan of formaldehyde in a biology class learn something else about frogs.

…One of my favorite examples was that one notable colorforth [a Forth dialect] enthusiast who had spent years studying it, disassembling it, reassembling it and modifying it, and made a lot of public comments about it, but had never bothered running it and in two years of 'study' had not been able to figure out how to do something in colorforth as simple as:

1 dup +

…[such Forth users] seem to have little interest in what it does, how it is used, or what people using it do with it. But some spend years doing an autopsy on dead code that they don't even run.

Live frogs are just very different than dead frogs.

Ouch. Quite an assault not just on a fraction of a particular community, but on language geeks in general.

I guess I feel that I could say that if it isn't solving a significant real problem in the real world it isn't really Forth.

True, I guess, and equally true from the viewpoint of someone extensively using any non-mainstream language and claiming enormous productivity gains for experts. Especially true for the core (hard core?) of the Forth community, Forth being their only weapon. They actually live in Forth; it's DIY taken to the extreme, something probably unparalleled in the history of computing, except, perhaps, the case of Lisp environments and Lisp machines (again).

Code running on Forth chips. Chips designed with Forth CAD tools. Tools developed in a Forth environment running on the bare metal of the desktop machine. No standard OS, file system or editor. All in recent years when absolutely nobody else would attempt anything like it. They claim to be 10x to 100x more productive than C programmers (a generic pejorative term for non-Forth programmers; Jeff Fox is careful to put "C" in quotes, presumably either to make the term more generic or more pejorative).

…people water down the Forth they do by not exercising most of the freedom it offers… by using Forth only as debugger or a yet another inefficient scripting language to be used 1% of the time.

Forth is about the freedom to change the language, the compiler, the OS or even the hardware design and is very different than programming languages that are about fitting things to a fixed language syntax in a narrow work context.

What can be said of this? If, in order to "really" enter a programming culture, I need to both "be solving a significant real problem in the real world" and exercising "the freedom to change the language, the compiler, the OS or even the hardware design", then there are very few options for entering this culture indeed. The requirement for "real world work" is almost by definition incompatible with "the freedom to change the language, the compiler, the OS and the hardware design".

And then it so happened that I started working on a real-world project about as close to Forth-level DIY as possible. It was our own hardware, with our own OS, our own compilers, designed to be running our own application. We did use standard CAD tools, desktop operating systems and editors, and had standard RISC cores in the chip and standard C++ cross compilers for them. Well, everyone has weaknesses. Still, the system was custom-tailored, embedded, optimized, standalone, with lots of freedom to exercise – pretty close to the Forth way, in one way.

One part of the system was an image processing co-processor, a variation on the VLIW theme. Its memory access and control flow was weird and limited, to the effect that you could neither access nor jump to an arbitrary memory address. It worked fine for the processing-intensive parts of our image processing programs.

We actually intended to glue those parts together with a few "control instructions" setting up the plentiful control registers of this machine. When I tried, it quickly turned out, as was to be expected, that those "control instructions" must be able to do, well, everything – arithmetic, conditions, loops. In short, we needed a CPU.

We thought about buying a CPU, but it was unclear how we could use an off-the-shelf product. We needed to dispatch VLIW instructions from the same instruction stream. We also needed a weird mixture of features. No caches, no interrupts, no need for more than 16 address bits, but for accessing 64 data bits, and 32-bit arithmetic.

We thought about making our own CPU. The person with the overall responsibility for the hardware design gently told me that I was out of my mind. CPUs have register files and pipeline and pipeline stalls and dependency detection to avoid those stalls and it's too complicated.

And then I asked, how about a stack machine? No register file. Just a 3-stage pipeline – fetch, decode, execute. No problem with register dependencies, always pop inputs from the top of the stack, push the result.

He said it sounded easy enough alright, we could do that. "It's just like my RPN calculator. How would you program it?" "In Forth!"

I defined the instruction set in a couple of hours. It mapped to Forth words as straightforwardly as possible, plus it had a few things Forth doesn't have that C might need, as a kind of insurance (say, access to 16-bit values in memory).

This got approved and implemented; not that it became the schedule bottleneck, but it was harder than we thought. Presumably that was partly the result of not reading "Stack Computers: the new wave", and not studying the chip designs of Forth's creator Chuck Moore, either. I have a feeling that knowledgable people would have sneered at this machine: it was trivial to compile Forth to it, but at the cost of complicating the hardware.

But I was satisfied – I got a general-purpose CPU for setting up my config regs at various times through my programs, and as a side effect, I got a Forth target. And even if it wasn't the most cost-effective Forth target imaginable, it was definitely a time to start using Forth at work.

(Another area of prior art on stack machines that I failed to study in depth was 4stack – an actual VLIW stack machine, with 4 data stacks as suggested by its name. I was very interested in it, especially during the time when we feared implementation problems with the multi-port register file feeding our multiple execution units. I didn't quite figure out how programs would map to 4stack and what the efficiency drop would be when one had to spill stuff from the data stacks to other memory because of data flow complications. So we just went for a standard register file and it worked out.)

The first thing I did was write a Forth cross-compiler for the machine – a 700-line C++ file (and for reasons unknown, the slowest-compiling C++ code that I have ever seen).

I left out all of the metaprogramming stuff. For instance, none of the Forth examples above, the ones that drove me to Forth, could be made to work in my own Forth. No WORD, no COMPILE, no IMMEDIATE, no CREATE/DOES>, no nothing. Just colon definitions, RPN syntax, flow control words built into the compiler. "Optimizations" – trivial constant folding so that 1 2 + becomes 3, and inlining – :INLINE 1 + ; works just like : 1 + ; but is inlined into the code of the caller. (I was working on the bottlenecks so saving a CALL and a RETURN was a big deal.) So I had that, plus inline assembly for the VLIW instructions. Pretty basic.

I figured I didn't need the more interesting metaprogramming stuff for my first prototype programs, and I could add it later if it turned out that I was wrong. It was wierd to throw away everything I originally liked the most, but I was all set to start writing real programs. Solving real problems in the real world.

It was among the most painful programming experiences in my life.

All kinds of attempts at libraries and small test programs aside, my biggest program was about 700 lines long (that's 1 line of compiler code for 1 line of application code). Here's a sample function:

: mean_std ( sum2 sum inv_len -- mean std )
  \ precise_mean = sum * inv_len;
  tuck u* \ sum2 inv_len precise_mean
  \ mean = precise_mean >> FRAC;
  dup FRAC rshift -rot3 \ mean sum2 inv_len precise_mean
  \ var = (((unsigned long long)sum2 * inv_len) >> FRAC) - (precise_mean * precise_mean >> (FRAC*2));
  dup um* nip FRAC 2 * 32 - rshift -rot \ mean precise_mean^2 sum2 inv_len
  um* 32 FRAC - lshift swap FRAC rshift or \ mean precise_mean^2 sum*inv_len
  swap - isqrt \ mean std
;

Tuck u*.

This computes the mean and the standard deviation of a vector given the sum of its elements, the sum of their squares, and the inverse of its length. It uses scaled integer arithmetic: inv_len is an integer keeping (1<<FRAC)/length. How it arranges the data on the stack is beyond me. It was beyond me at the time when I wrote this function, as indicated by the plentiful comments documenting the stack state, amended by wimpy C-like comments ("C"-like comments) explaining the meaning of the postfix expressions.

This nip/tuck business in the code? Rather than a reference to the drama series on plastic surgery, these are names of Forth stack manipulation words. You can look them up in the standard. I forgot what they do, but it's, like, ds.insert(2,ds.top()), ds.remove(1), this kind of thing.

Good Forth programmers reportedly don't use much of those. Good Forth programmers arrange things so that they flow on the stack. Or they use local variables. My DRAW-RECTANGLE definition above, with a 2DUP, was reasonably flowing by my standards: you get width and height, duplicate both, and have all 4 data items – width,height,width,height – consumed by the next 4 words. Compact, efficient – little stack manipulation. Alternatively we could write:

: DRAW-RECTANGLE { width height }
  height UP
  width RIGHT
  height DOWN
  width LEFT
;

Less compact, but very readable – not really, if you think about it, since nobody knows how much stuff UP leaves on the stack and what share of that stuff RIGHT consumes, but readable enough if you assume the obvious. One reason not to use locals is that Chuck Moore hates them:

I remain adamant that local variables are not only useless, they are harmful.

If you are writing code that needs them you are writing non-optimal code. Don't use local variables. Don't come up with new syntax for describing them and new schemes for implementing them. You can make local variables very efficient especially if you have local registers to store them in, but don't. It's bad. It's wrong.

It is necessary to have [global] variables. … I don't see any use for [local] variables which are accessed instantaneously.

Another reason not to use locals is that it takes time to store and fetch them. If you have two items on a data stack on a hardware stack machine, + will add them in one cycle. If you use a local, then it will take a cycle to store its value with { local_name }, and a cycle to fetch its value every time you mention local_name. On the first version of our machine, it was worse as fetching took 2 cycles. So when I wrote my Forth code, I had to make it "flow" for it to be fast.

The abundance of DUP, SWAP, -ROT and -ROT3 in my code shows that making it flow wasn't very easy. One problem is that every stack manipulation instruction also costs a cycle, so I started wondering whether I was already past the point where I had a net gain. The other problem was that I couldn't quite follow this flow.

Another feature of good Forth code, which supposedly helps achieve the first good feature ("flow" on the stack), is factoring. Many small definitions.

Forth is highly factored code. I don't know anything else to say except that Forth is definitions. If you have a lot of small definitions you are writing Forth. In order to write a lot of small definitions you have to have a stack.

In order to have really small definitions, you do need a stack, I guess – or some other implicit way of passing parameters around; if you do that explicitly, definitions get bigger, right? That's how you can get somewhat Forth-y with Perl – passing things through the implicit variable $_: call chop without arguments, and you will have chopped $_.

Anyway, I tried many small definitions:

:inline num_rects params @ ;
:inline sum  3 lshift gray_sums + ;
:inline sum2 3 lshift gray_sums 4 + + ;
:inline rect_offset 4 lshift ;
:inline inv_area rect_offset rects 8 + + @ ;
:inline mean_std_stat ( lo hi -- stat )
  FRAC lshift swap 32 FRAC - rshift or
;
: mean_std_loop
 \ inv_global_std = (1LL << 32) / MAX(global_std, 1);
 dup 1 max 1 swap u/mod-fx32 drop \ 32 frac bits

 num_rects \ start countdown
 begin
  1 - \ rects--
  dup sum2 @
  over sum @
  pick2 inv_area
  mean_std \ global_mean global_std inv_global_std rectind mean std
  rot dup { rectind } 2 NUM_STATS * * stats_arr OFT 2 * + + { stats }
  \ stats[OFT+0] = (short)( ((mean - global_mean) * inv_global_std) >> (32 - FRAC) );
  \ stats[OFT+1] = (short)( std * inv_global_std >> (32 - FRAC) );
  pick2       um* mean_std_stat stats 2 + h! \ global_mean global_std inv_global_std mean
  pick3 - over m* mean_std_stat stats h!
  rectind ?dup 0 = \ quit at rect 0
 until
 drop 2drop
;

I had a bunch of those short definitions, and yet I couldn't get rid of heavy functions with DUP and OVER and PICK and "C" comments to make any sense of it. This stack business just wasn't for me.

Stacks are not popular. It's strange to me that they are not. There is just a lot of pressure from vested interests that don't like stacks, they like registers.

But I actually had a vested interest in stacks, and I began to like registers more and more. The thing is, expression trees map perfectly to stacks: (a+b)*(c-d) becomes a b + c d – *. Expression graphs, however, start to get messy: (a+b)*a becomes a dup b + *, and this dup cluttering things up is a moderate example. And an "expression graph" simply means that you use something more than once. How come this clutters up my code? This is reuse. A kind of factoring, if you like. Isn't factoring good?

In fact, now that I thought of it, I didn't understand why stacks were so popular. Vested interests, perhaps? Why is the JVM bytecode and the .NET bytecode and even CPython's bytecode all target stack VMs? Why not use registers the way LLVM does?

Speaking of which. I started to miss a C compiler. I downloaded LLVM. (7000 files plus a huge precompiled gcc binary. 1 hour to build from scratch. So?) I wrote a working back-end for the stack machine within a week. Generating horrible code. Someone else wrote an optimizing back-end in about two months.

After a while, the optimizing back-end's code wasn't any worse than my hand-coded Forth. Its job was somewhat easier than mine since by the time it arrived, it only took 1 cycle to load a local. On the other hand, loads were fast as long as they weren't interleaved with stores – some pipeline thing. So the back-end was careful to reorder things so that huge sequences of loads went first and then huge sequences of stores. Would be a pity to have to do that manually in Forth.

You have no idea how much fun it is to just splatter named variables all over the place, use them in expressions in whatever order you want, and have the compiler schedule things. Although you do it all day. And that was pretty much the end of Forth on that machine; we wrote everything in C.

What does this say about Forth? Not much except that it isn't for me. Take Prolog. I know few things more insulting than having to code in Prolog. Whereas Armstrong developed Erlang in Prolog and liked it much better than reimplementing Erlang in C for speed. I can't imagine how this could be, but this is how it was. People are different.

Would a good Forth programmer do better than me? Yes, but not just at the level of writing the code differently. Rather, at the level of doing everything differently. Remember the freedom quote? "Forth is about the freedom to change the language, the compiler, the OS or even the hardware design".

…And the freedom to change the problem.

Those computations I was doing? In Forth, they wouldn't just write it differently. They wouldn't implement them at all. In fact, we didn't implement them after all, either. The algorithms which made it into production code were very different – in our case, more complicated. In the Forth case, they would have been less complicated. Much less.

Would less complicated algorithms work? I don't know. Probably. Depends on the problem though. Depends on how you define "work", too.

The tiny VLSI toolchain from the epigraph? I showed Chuck Moore's description of that to an ASIC hacker. He said it was very simplistic – no way you could do with that what people are doing with standard tools.

But Chuck Moore isn't doing that, under the assumption that you need not to. Look at the chips he's making. 144-core, but the cores (nodes) are tiny – why would you want them big, if you feel that you can do anything with almost no resources? And they use 18-bit words. Presumably under the assumption that 18 bits is a good quantity, not too small, not too large. Then they write an application note about imlpementing the MD5 hash function:

MD5 presents a few problems for programming a Green Arrays device. For one thing it depends on modulo 32 bit addition and rotation. Green Arrays chips deal in 18 bit quantities. For another, MD5 is complicated enough that neither the code nor the set of constants required to implement the algorithm will fit into one or even two or three nodes of a Green Arrays computer.

Then they solve these problems by manually implementing 32b addition and splitting the code across nodes. But if MD5 weren't a standard, you could implement your own hash function without going to all this trouble.

In his chip design tools, Chuck Moore naturally did not use the standard equations:

Chuck showed me the equations he was using for transistor models in OKAD and compared them to the SPICE equations that required solving several differential equations. He also showed how he scaled the values to simplify the calculation. It is pretty obvious that he has sped up the inner loop a hundred times by simplifying the calculation. He adds that his calculation is not only faster but more accurate than the standard SPICE equation. … He said, "I originally chose mV for internal units. But using 6400 mV = 4096 units replaces a divide with a shift and requires only 2 multiplies per transistor. … Even the multiplies are optimized to only step through as many bits of precision as needed.

This is Forth. Seriously. Forth is not the language. Forth the language captures nothing, it's a moving target. Chuck Moore constantly tweaks the language and largely dismisses the ANS standard as rooted in the past and bloated. Forth is the approach to engineering aiming to produce as small, simple and optimal system as possible, by shaving off as many requirements of every imaginable kind as you can.

That's why its metaprogramming is so amazingly compact. It's similar to Lisp's metaprogramming in much the same way bacterial genetic code is similar to that of humans – both reproduce. Humans also do many other things that bacteria can't (…No compatibility. No files. No operating system). And have a ton of useless junk in their DNA, their bodies and their habitat.

Bacteria have no junk in their DNA. Junk slows down the copying of the DNA which creates a reproduction bottleneck so junk mutations can't compete. If it can be eliminated, it should. Bacteria are small, simple, optimal systems, with as many requirements shaved off as possible. They won't conquer space, but they'll survive a nuclear war.

This stack business? Just a tiny aspect of the matter. You have complicated expression graphs? Why do you have complicated expression graphs? The reason Forth the language doesn't have variables is because you can eliminate them, therefore they are junk, therefore you should eliminate them. What about those expressions in your Forth program? Junk, most likely. Delete!

I can't do that.

I can't face people and tell them that they have to use 18b words. In fact I take pride in the support for all the data types people are used to from C in our VLIW machine. You can add signed bytes, and unsigned shorts, and you even have instructions adding bytes to shorts. Why? Do I believe that people actually need all those combinations? Do I believe that they can't force their 16b unsigned shorts into 15b signed shorts to save hardware the trouble?

OF COURSE NOT.

They just don't want to. They want their 16 bits. They whine about their 16th bit. Why do they want 16 and not 18? Because they grew up on C. "C". It's completely ridiculous, but nevertheless, people are like that. And I'm not going to fight that, because I am not responsible for algorithms, other people are, and I want them happy, at least to a reasonable extent, and if they can be made happier at a reasonable cost, I gladly pay it. (I'm not saying you can't market a machine with a limited data type support, just using this as an example of the kind of junk I'm willing to carry that in Forth it is not recommended to carry.)

Why pay this cost? Because I don't do algorithms, other people do, so I have to trust them and respect their judgment to a large extent. Because you need superhuman abilities to work without layers. My minimal stack of layers is – problem, software, hardware. People working on the problem (algorithms, UI, whatever) can't do software, not really. People doing software can't do hardware, not really. And people doing hardware can't do software, etc.

The Forth way of focusing on just the problem you need to solve seems to more or less require that the same person or a very tightly united group focus on all three of these things, and pick the right algorithms, the right computer architecture, the right language, the right word size, etc. I don't know how to make this work.

My experience is, you try to compress the 3 absolutely necessary layers to 2, you get a disaster. Have your algorithms people talk directly to your hardware people, without going through software people, and you'll get a disaster. Because neither understands software very well, and you'll end up with an unusable machine. Something with elaborate computational capabilities that can't be put together into anything meaningful. Because gluing it together, dispatching, that's the software part.

So you need at least 3 teams, or people, or hats, that are to an extent ignorant about each other's work. Even if you're doing everything in-house, which, according to Jeff Fox, was essentially a precondition to "doing Forth". So there's another precondtion – having people being able to do what at least 3 people in their respective areas normally do, and concentrating on those 3 things at the same time. Doing the cross-layer global optimization.

It's not how I work. I don't have the brain nor the knowledge nor the love for prolonged meditation. I compensate with, um, people skills. I find out what people need, that is, what they think they need, and I negotiate, and I find reasonable compromises, and I include my narrow understanding of my part – software tools and such – into those compromises. This drags junk in. I live with that.

I wish I knew what to tell you that would lead you to write good Forth. I can demonstrate. I have demonstrated in the past, ad nauseam, applications where I can reduce the amount of code by 90% and in some cases 99%. It can be done, but in a case by case basis. The general principle still eludes me.

And I think he can, especially when compatibility isn't a must. But not me.

I still find Forth amazing, and I'd gladly hack on it upon any opportunity. It still gives you the most bang for the buck – it implements the most functionality in the least space. So still a great fit for tiny targets, and unlikely to be surpassed. Both because it's optimized so well and because the times when only bacteria survived in the amounts of RAM available are largely gone so there's little competition.

As to using Forth as a source of ideas on programming and language design in general – not me. I find that those ideas grow out of an approach to problem solving that I could never apply.

Update (July 2014): Jeff Fox's "dead frog dissector" explained his view of the matter in a comment to this article, telling us why the frog (colorForth) died in his hands in the first place… A rather enlightening incident, this little story.