An unusual hardware architecture: APA (Associative Processing Array)

We're living in the golden age of hardware design – it probably won't get any cheaper to make a new chip. Yes, there's the downside that it's harder to be wildly incompatible with everything than in the old days – there's plenty of standard interfaces to support. On the other hand, it's a benefit, not just a drawback – make a chip with a CPU that runs C, an Ethernet controller and a DRAM interface, and it's now usable with plenty of software and hardware developed by others.

And then for the wilder, "incompatible" innovations – if you really want to invent interfaces and not just implementations – there's the "accelerator" realm: an MPEG decoder, a DSP, a GPU. A decade ago, they said that "standard" microprocessors killed all high-performance architectures. Today, a system that kept the "fastest computer in the world" title for almost a year is based on GPUs – and "non-standard"/"incompatible" accelerators are everywhere (not to mention SIMD instruction set extensions right inside the otherwise "compatible"/"commodity" CPUs).

Still, while it indeed became way cheaper to make your own chips since you don't have to erect your own fab to do that, "cheap" still means costs somewhere in the millions – not quite what the word "cheap" brings to mind. And mistakes still can't be corrected, not really. Which results, among other things, in an understandable risk aversion with respect to design.

A language geek, who's naturally curious about programming languages that don't look mainstream (C with classes), is going to meet many such languages, and plenty of implementations to play with. A hardware geek, who's equally curious about designs that don't look mainstream (RISC/VLIW with bells and whistles), is going to find few such designs and very few implementations. Risk aversion is innovation aversion.

APA thus combines two traits that are rare – it gets at least 9 out of 10 in the "non-mainstream" category, and implementations were actually manufactured and shipped, specifically, in early smartphones by NeoMagic (BDTI and EETimes articles are some of the architecture overviews). The phones apparently weren't very successful, but it seems a poor measure of the architecture's merit in this case, just because there are so many ways to fail regardless of your accelerator quality. So the platform's demise gives us not evidence but a lack of evidence (the platform wasn't widely targeted by 3rd party developers whose opinions could be interesting).


APA stands for Associative Processing Array, a kind of content-addressable memory. Just memory. How do you compute using just memory? APA (Associative Processing Array)The operations are done right there, near the cells where your bits are kept. No need to read your data word by word into a processor, than write it back – rather, your operations run in parallel, inside the memory, on every data word! SIMD operations on steroids.

What operations? Bitwise operations – masked compare and masked write:

  • Compare: tag[i] = (word[i]&mask)==(compare_data&mask)
  • Write: if(tag[i]) word[i]=(word[i]&mask) | (write_data&~mask)

…where a tag bit is kept for every data word (row), and mask & compare/write data are broadcasted to all rows. You can also move words – either to the adjacent rows or to rows at distance 8 (or, say, 16, depending on the hardware configuration – but not at an arbitrary distance). And you can move the data in or out, sequentially.

Where's the code – who decides what to write, compare and move in what order? One option is to use a low-end CPU running "normal" code to control the contents of the compare/write data & mask registers, to issue APA operations and to interact with the host system (the "real" CPU and external memory). This little processor isn't "accelerating" computations by itself, just controls the APA accelerator hardware.

That's it. No add, no multiply, no nothing. How's that for non-mainstream?

So, how do you actually do anything with this SIMD-on-steroids machine – say, add two vectors of numbers? Well, you need to have both vectors in the array – each element pair occupying some of the bits of a row (for instance, the NeoMagic 512-row APA had 160 bits per row, so for 16 bit vectors, you'd use 32 bits out of 160). Then you perform the addition bit by bit, using masked compare and write operations.

That's right, addition done entirely in software. Latency: 48 cycles for 16 bits, throughput: >10 16b additions/cycle for NeoMagic's 512-row APA. More latency for 32 bits, less for 8 bits, still less for 3 bits. Which means optimization opportunities you're not used to having in software (you gain no speed-up using only 3 bits of a 32b CPU register for addition). On the other hand, it means large penalties for high precision (floating point takes thousands of cycles).

Tools and libraries would be supplied with an APA implementation, but the exciting thought for a programmer is to be able to bypass them where needed and get down to bit-level manipulation! (The exciting thought for a decision maker would be to always use the tools, never pay someone for the bit fiddling; well, different people get their excitement from different things.)

Pros & cons

In a way, it's an exceedingly generic and elegant architecture, and a very impressive one. One "gets it" like one doesn't get any other. Just try to describe any other sort of programmable machine to a level of detail sufficient for accurate predictions about performance.

Surprisingly – or, perhaps, unsurprisingly given just how unusual this machine is – I simultaneously also don't get it like any other. I'm not used to thinking of computations that way, both in terms of precision and in terms of data access. I have no idea what share of my 8-bit numbers only have just 5 or 2 bits and it can matter a lot here.

Likewise, even for algorithms that don't require random access – and APA is simply not good for random access – I have little idea of just how non-random my access is. That is, similarly to the precision case where constants I'm not used to care about matter, there's a difference between accessing a row at distance 1 and a row at distance 5, and I don't know how frequent different distances between adjacent things are in a code base I care about.

So while in the abstract, a shorter than ever architecture description is sufficient here for accurate performance predictions, I lack intuition and experience that would make such predictions easy.

A great thing about the APA design is its "hardware friendliness". The typical processor design involves a truckload of convoluted circuits that are likeable for their function, but unimaginable to the human mind as physical objects – so cumbersome, infinitely configurable, finicky tools are used to translate a functional description of said circuits to actual electric circuits occupying actual space.

The typical hardware design is not hardware friendly (sounds strange – but the average piece of software isn't very considerate towards its target processor and software stack, either). The APA, on the other hand, has a natural spatial mapping with its 2 dimensions (row bit width and the number of rows) and regular connections between just the adjacent or relatively close rows.

Again, surprisingly – or unsurprisingly given how unusual the APA is – this is also a drawback, in the sense that the standard hardware design tools will do a very poor job implementing APA. For that matter, standard tools will likely do an even worse job implementing plain RAM – which also has a natural spatial mapping and regular structure.

RAM is analog design, not digital: rather than describing it on a functional level, someone implements it as a physical circuit description. Basically, RAM is a library for digital tools, like flip-flops or simple gates – a library that can't be implemented using the tool itself.

I really appreciate the von Neumann architecture, but it's scary to realize just how strong the von Neumann grip really is. RAM is basically the only interface to a large array of bits that is relatively easily accessible to a hardware designer. (Not that it's very easily accessible, mind you – there is no portable interface for RAM, believe it or not, but no matter how you build your chips, you'll be able to get some sort of RAM).

What happens if you want your own bag of bits instead of RAM? You can do what they call full custom design – make your own physical circuit description. This isn't "portable", where "porting" means changing the manufacturing process – either to move from 65nm to 40nm, or from one manufacturer to another. It's also a longer and more complicated process. But it's certainly doable, especially if you outsource it to the inventors, as you'd have to do anyway because the thing is patented.

The hostility to hardware design tools (despite the friendliness to the basic constraints of hardware) and patent protection make APA more of a possible off-the-shelf solution than inspiring source of ideas, and so does the design simplicity. Unlike, say, VLIW, which is a design style with basic ideas in the public domain and countless possible variations and extensions (starting with the "let's add this one spiffy instruction" variety), APA is more or less complete. There are important constants to tweak – row width and the distance to easily accessible neighbors – but while it could be a hard choice to make, it's not very creative. (I do believe it could be a source of ideas if only through expanding one's horizons, which is why I write about it.)

If we attempt to discuss the efficiency of APA qualitatively, in terms of hardware resource utilization rather than throughput per mm^2 or per mW for a given app, three things come to mind:

  • Benefit - a perennial bottleneck of accessing memory through a bus very narrow compared to the amount of bits stored is eliminated. This "ought to be good" for algorithms where access is "far from random" since these gain nothing from conventional RAM's flexibility but pay the full price of its low throughput.
  • Drawback - the cycle is "wasted": values are read from flip-flops, undergo very few transformations and are written back. This "ought to be bad" because it won't translate to high frequency (you can't read then write a flip-flop very fast – and doing this with them all at once dissipates power, so in fact low frequencies enabled by parallelism, not high frequencies enabled by circuit simplicity are APA's potential advantage in the frequency department). A competing design running at a similar frequency, however, will do much more per cycle (like actual addition) because circuits implementing combinatorial logic are fast. The question thus becomes how big those circuits are compared to flip-flops: if enough APA rows can be packed instead, the throughput will still be competitive despite the abysmal latency – provided that you simultaneously process sufficiently large amounts of data.
  • Benefit - ability to use DRAM cells. This "ought to be good" since DRAM cells are much smaller than normal flip-flops (which is achieved through having them leak their charge so they have to be periodically refreshed). AFAIK, nobody uses them for computation directly because they don't fit in traditional hardware models – neither as registers (that's plain dumb) nor as local RAM (they have poor latency and imply a cumbersome controller to access as a RAM). An APA, on the other hand, could potentially run well on DRAM cells if the required simple circuitry were implemented near the cells. One problem with this is that custom chips may be easy to make these days, but not custom DRAM chips. In particular, Andy Glew, formerly from Intel then AMD, said at some place that it's "hard to influence DRAM makers", and if it's hard for Intel or AMD, well, it's probably hard in general.

Overall, I have a lot of reservations here – not just because of the way originality by itself seems to complicate matters here (I'd hate to admit it as the only reason…), but because it's a step in the opposite direction to what successful SIMD systems are doing. That is, you get high throughput given unusually low precision and unusually restricted data access patterns. A DSP from the late 90s would attempt to process numbers with more bits and would let you fetch them from more places. A GPU from the 2000s still more so – floating point numbers, parallel random access with transparent contention handling (at least in CUDA GPUs). It seems that the direction is removing restrictions on the set of things you run fast, not very high throughput for a very restricted set.

On the other hand, there exist algorithms with low precision and high locality. And it's exciting to see an architecture which is not RAM-based, naturally represented in space, etc. – all those things which get people excited in hardware discussions – but practical enough for a real world delivery, and with some things connecting it to more usual programming models (for instance, SIMD commands broadcasted from a CPU instead of clever local rules you have no idea how to come up with as in cellular automata). So it's definitely very interesting.

Update: as a better informed commenter pointed out, APA is also known as CAPP – Content Addressable Parallel Processor, and implementations date as early as 1972 (which makes you wonder what could legitimately remain outside the public domain by now). This seems an evidence of having failed the test of time – or having some really deep trouble with commercialization. I'd be curious to hear a software developer's experience with this – BDTI's article from 2003 talked about the development experience in hypothetical terms and entirely in the future tense, whereas some past evidence has to be available.

We're hiring

Or, we'll hire if we find the right person.

If you live in Israel and are into things like:

  • hardware/processor/programming language/library design
  • compilers, debuggers, profilers, simulators, instrumentation
  • number crunching/optimization
  • parallelism/multi-core/distributed computing

…we might have a bunch of stuff to interest you. Benefits/drawbacks/features:

  • Most projects are intended primarily for internal use – "infrastructure" if you like
  • Lots of custom things, ranging from our own chip to our own distributed build server
  • Plenty of production projects depending on your work
  • Ability to do grand things singlehandedly guaranteed through understaffing
  • Bleeding edge, "first/best" system in several categories
  • Little management, especially if you don't really need any
  • A "Worse is Better" spirit (reasonably pragmatic perfectionists are welcome though)
  • Many long-term projects
  • No downsizing during previous bubble burst
  • Experience in any specific relevant area appreciated but not required
  • Mostly C/C++, Python, and our own languages; if it matters, you can use any other language
  • Mostly Linux, a bit of Windows, Eclipse usage share eclipsed by emacs & vi
  • Nice people
  • Jerusalem

A great place to work, if you ask me. If you're interested, send mail to

Machine code monkey patching

A monkey patch is a way to extend or modify the runtime code of dynamic languages (e.g. Smalltalk, JavaScript, Objective-C, Ruby, Perl, Python, Groovy, etc.) without altering the original source code.


For example, the Python code:

# someone else's class
class TheirClass:
 def their_method(self):
  print "them"
obj = TheirClass()

# our function
def our_function(self):
 print "us"

# the monkey patch
TheirClass.their_method = our_function

…will print:


…showing that we have changed the behavior of TheirClass objects, including those we didn't create ourselves. Which can't be done with more civilized techniques like inheritance.

Here's how you can monkey patch machine code, assuming the machine architecture is ARM:

typedef void (*funcptr)();

void monkey_patch(funcptr their_func, funcptr our_func) {
  ((int*)their_func)[0] = 0xe51ff004;
  ((int*)their_func)[1] = (int)our_func;
//monkey patching the memory allocator:
monkey_patch((funcptr)&malloc, (funcptr)&our_malloc);
monkey_patch((funcptr)&free, (funcptr)&our_free);

This overwrites the first instruction (32-bit word) of their_func with 0xe51ff004, which is the ARM machine code corresponding to the assembly instruction LDR PC,[PC,-4] – which means, in C-like pseudocode, PC = *(PC+4), or "jump to the program location pointed by the next word after the current program location".

(Why the byte address PC+4 is spelled in assembly as PC-4? I recall that it's because an ARM instruction at address X actually gets the value X+8 when referencing PC. Presumably because it is – or at some point was – the most convenient semantics for pipelined hardware to implement:

  • when the instruction at address X executes,
  • the instruction at address X+4 is decoded, and
  • the instruction at address X+8 is fetched

- so the physical PC register could very well keep the value X+8.)

So the first word of their_func is overwritten with, "jump to where the second word points". The second word is then overwritten with our_func, and we're all set.


I actually did this in production code, on a bare metal target (no OS – just a boot loader that runs a massive single binary). I monkey patched the memory allocator – malloc, free, calloc, realloc – and the Unix-like I/O functions underlying that particular compiler's <stdio.h> and <iostream> implementation – read, write, open, close, creat. The memory allocator had to be changed to work on the target dual-core chip. The I/O functions had to be changed to use our drivers, so that we could write stuff to the Flash or USB using FILE* or ofstream.

A more civilized approach, if you want to override functions in a dynamic library, is passing another library at run time with LD_PRELOAD or equivalent. And if the code is linked statically as it was in my case, you can override the functions at link time. The trouble is that the linker could refuse to cooperate.

(And in my case, we shipped libraries, the customer linked the program, and the guy who talked to the customer refused to cooperate – that is, to help them override functions at link time. He was an old-school embedded developer, the kind that don't need no stinking malloc and printf. The project had a million lines of code very much in need of malloc and printf. He said, clean it up. Don't call malloc on the second CPU. So I went away and monkey patched malloc anyway.

In such a case, the civilized approach is to keep trying to talk the guy into it, and then have him persuade the (even more hardcore) embedded devs at the customer's side. What I did was what managers call "an attempt at a technical solution when a social solution is needed". Or as programmers call it, "avoiding a couple of months of pointless discussions". Being neither a full-time programmer nor a full-time manager, I don't have a clear opinion which viewpoint is right. I guess it depends on how long and pointless the discussions are going to be, versus how long and pointless the code working around the "social" problem will be.)

In theory, machine code monkey patching could be used in a bunch of "legitimate" cases, such as logging or debugging. In practice, this ugly thing is probably only justified in inherently ugly situations – as is kinda true of monkey patching in general.


My example implementation for the ARM assumes that a function has at least 2 instructions. An empty ARM assembly function can have just one (jump to link register). In that case, the first instruction of the next function will be overwritten. A more sophisticated version of monkey_patch() could stash the target address someplace else, and use a LDR PC,[PC,clever_offset] command instead of a constant LDR PC,[PC,-4] command.

Overwriting machine code instructions breaks code that reads (as opposed to "executes") those instructions, counting on the original bit patterns to be stored there. This isn't very likely to be a problem with actual ARM code, unless it was written by Mel.

On any machine with separate and unsynchronized instruction and data caches, overwriting instructions will modify the contents of the data cache but not the instruction cache. If the instructions happen to be loaded to the instruction cache at the time of overwriting, subsequent calls to the monkey-patched function might call the original function, until the instruction cache line keeping the original code happens to be evicted (which isn't guaranteed to ever happen).

If your luck is particularly bad and the two overwritten instructions map to two adjacent cache lines, only one of which is loaded to the instruction cache at the time of overwriting, a call to the monkey-patched function might crash (since it'll see one original instruction word and one new one). In any case, on machines where caches won't sync automatically, one should sync them explicitly to implement self-modifying code correctly (I'll spare you my ARM9 code doing this).

If your OS places instructions in read-only memory pages, overwriting it will not work unless you convince the OS to grant you permissions to do so.


C++ virtual functions can be monkey patched more similarly to the typical dynamic language way. Instead of modifying instructions, we can overwrite the virtual function table.


  • more portable across machine architectures – the vtable layout doesn't depend on the machine
  • no cache syncing problems
  • no encoding-related corner cases like very short functions or instructions used as data


  • less portable across compilers – ARM bytecode is the same with all compilers, vtable layout is not
  • fewer calls could be redirected – some compilers avoid the vtable indirection when they know the object's type at compile time (of course inlined calls won't be redirected with either technique)
  • only virtual functions can be redirected – typically a minority of C++ class member functions

The need to fiddle with OS memory protection is likely to remain since vtables are treated as constant data and as such are typically placed in write-protected sections.

Example C++ code (g++/Linux, tested with g++ 4.2.4 on Ubuntu 8.04):

#include <sys/mman.h>
#include <unistd.h>
#include <stdio.h>

template<class T, class F>
void monkey_patch(int their_ind, F our_func) {
  T obj; //can we get the vptr without making an object?
  int* vptr = *(int**)&obj;
  //align to page size:
  void* page = (void*)(int(vptr) & ~(getpagesize()-1));
  //make the page with the vtable writable
  if(mprotect(page, getpagesize(), PROT_WRITE|PROT_READ|PROT_EXEC))
  vptr[their_ind] = (int)our_func;

class TheirClass {
  virtual void some_method() {}
  virtual void their_method() { printf("themn"); }
void our_function() { printf("usn"); }
int main() {
  TheirClass* obj = new TheirClass;
  //gcc ignores the vtable with a stack-allocated object
  obj->their_method(); //prints "them"

  monkey_patch<TheirClass, void(*)()>(1, our_function);
  //some_method is at index 0, their_method is at 1
  //we could instead try to non-portably get the index
  //out of &TheirClass::their_method

  obj->their_method(); //prints "us"


Let's drink to never having to do any of this (despite the fact that yes, some of us do enjoy it in a perverted way and feel nostalgic blogging about it).

Making data races manifest themselves

Or, synchronization, valgrind and poset dimension.

This story is perhaps more technical than I ever allowed myself to get online. If I may say so – it may be worth following; I can say that it really changed my perspective on what data races are.

I'll need to explain a bit about the context:

  • how we structure our parallel apps for multi-core targets
  • how we catch synchronization bugs using a custom valgrind tool
  • which of the bugs were hard to catch that way

Then I can explain how we now catch them all (at least those having any chance to happen on our test inputs; you'll see what I mean).

I'll also discuss valgrind's standard race detection tool, helgrind. I'll try to show that the interface to parallelism it works against – pthreads – poses a much harder problem, and how races can go undetected because of that.

I hope this can be interesting outside of my specific work context because "catching all the races" seems to be a claim rarely made. Also we got to stumble upon a well-known math problem during mundane pointer-chasing – a cliche of sorts, but amusing when it actually happens.

So, here goes.

Our parallelism frameworkThe framework - goals & dependencies

…known in its current incarnation as GSF – "Goal/Graph Scheduling Framework". It's fairly simple – 4 paragraphs and 2 pseudo-code listings should be a reasonable introduction.

The idea is, your app is a graph of goals. The nodes, goals, are things to do – typically function pointers (could be function objects or closures). The edges are dependencies – A depends on B, so A must be executed after B (denoted with B A in the diagrams). The main loop does this:

goal = first_goal # a dummy everything depends on
while goal != last_goal: # a dummy, depends on everything
  update_dependent_goals_status(goal) # find READY goals
  for chosen in sched.choose_goals(): # scheduler
    chosen.resource.enqueue(chosen) # outgoing message Qs
  goal = dequeue_finished_goal() # incoming message Q

This runs all the goals in a graph, and can itself be invoked in a loop – in our case, we run all the goals for every grabbed video frame.

Whatever the scheduling policy, a correct execution order will be chosen. Only READY goals can run, and a goal only becomes READY when everything it depends on becomes DONE. This is implemented in update_dependent_goals_status:

goal.status = DONE
for dep in goal.dependent_goals:
  dep.num_predecessors -= 1
  if dep.num_predecessors == 0:
    dep.status = READY

The scheduling policy is implemented partly in sched.choose_goals – the scheduler, and partly in resource.enqueue – the queues. The scheduler decides who is enqueued first; some queues may reorder waiting goals by priority, or even suspend the running goal(s) when a more important one arrives. While the scheduling policy can be made complicated almost without bound, that's all there is to the framework itself.

To summarize in standard terms, it's a kind of a message passing framework, without explicit threads and without locking, but with (massive) data sharing. "Goal scheduled"/"goal finished" notifications are the only messages, and only used by the framework – the goals themselves always communicate through shared data. Dependencies are the only synchronization primitive. OS locks are never used, except in the guts of various memory allocators.


Users, or "application programmers", usually hate this framework with a passion, because code is hard to read and modify.

To break your app into the sort of graph described above, you need to specify the goals and the dependencies. Currently we do both in ugly ways. Goals are specified as global functions, with context passed through global variables: void goal_A() { g_ctxt… }. The hairy dependency graph is constructed by a Python script running at build time, with ifs, loops, and function calls – if do_A: make_goal("A",dep="B C"). This adds ugliness to C++ code, and then makes you follow ugly Python code to figure out the execution order of the ugly C++ code.

Some of it was even uglier in the past. Some of it could become less ugly in the future. At least one serious limitation is irremediable – the inability to create goals at run time. For example, you can't say "create a goal per processed data item" – rather, you need to explicitly, statically create a goal per CPU and distribute the data between the goals. You can't say "if X happens, create that goal" – you need to always have that goal, and have it do nothing unless X happens. People don't like to look at their flow that way.


Efficiency and correctness, basically. Efficiency is tangential to our story, so I won't discuss it (and frankly, we aren't sure how much efficiency we gain, compared to other approaches). As to correctness – the hateful thing indeed "scales", which is to say, it doesn't break down. We've managed to ship working code for several years, given:

  • 8 to 10 heterogeneous target cores
  • Hundreds of goals
  • Dozens of applications
  • Dozens of developers

…the worst thing being "dozens of developers" – most unaware of parallelism, as I think they should be. Not that it proves we couldn't have done it differently – I can only give my theory as to why it would be harder.

The biggest problem with parallel imperative programs is synchronizing access to shared data. Tasks/threads can, and are encouraged to communicate through messages, but eliminating sharing is hard:

  • People frequently don't even realize they're sharing data

  • If tools make implicit use of shared data impossible, things can get painful (think of explicitly passing around a particularly "popular" data item, or copying large/intertwined objects)

Sharing data is natural and efficient in imperative programs, and outlawing it is hard. And locks are problematic since

  • You can have deadlocks, and
  • You can forget to lock.

Now, if you schedule a statically known flow graph rather than an unconstrained threaded app, it turns out that you solve both of these problems:

  • Deadlocks are trivially prevented - at build time, check that the dependency graph has no cycles

  • Undeclared dependencies – sharing data without proper synchronization – can be found using program instrumentation

Our subject is this second bit – using program instrumentation to find race conditions (which makes everything up to this point an unusually long introduction).

We use a custom valgrind plug-in ("tool" as they call it) for program instrumentation. It works somewhat similarly to valgrind's helgrind tool, though helgrind's implementation is much more complicated.

However, helgrind has false negatives – it will consistently miss some of the data races, as I believe will any tool lacking the knowledge of the overall app structure. A simple example of a data race unreported by helgrind appears after the discussion on race condition detection, when the problem should become more clear.

Race conditions in a static graph

In an imperative program, goals communicate through memory. A memory access can thus be thought of as a proof of dependence. If goal A accesses memory at address X, and goal B was the last to modify X, then A depends on B. If there is no path from A to B in the dependency graph, the program is buggy. A could run before B or after it, so the result of accessing X is undefined.Write then read: the simplest dependence proof

Suppose we could intercept all load/store instructions in the program. Then we could maintain, for every byte, its current "owner" – a goal ID. A store to a location would update its owner to the ID of the currently running goal. Loads and stores could then check whether the currently running goal depends on the owner of the accessed location – all it takes is accessing a 2D array, the path matrix. Upon error, print the call stack, and the names of the 2 goals.

This sort of thing is surprisingly easy to implement on top of valgrind, almost as easy as implementing an interface with abstract onLoad and onStore methods. I thought of posting sample code but it looks like the example tool shipped with valgrind, "lackey", comes with a load/store interception example these days.

As to the "shadow memory" you need for keeping owner IDs – you can do that with a page table, along the lines of a virtual memory implementation in an OS. The high bits of a pointer select a page and the low bits are the offset within the page, with pages allocated upon first access. Standard valgrind tools do it this way, too.

Our valgrind tool is called shmemcheck for "shared memory checker". A lame pun on memcheck, valgrind's most famous tool, the one reporting uninitialized memory access. "Memcheck-shmemcheck". Probably the last time I used this sort of name – funny for the first few times, embarrassing for the next few thousands of times.

Despite the embarrassment, when we got an implementation good enough to be systematically used, it was really great – data races were decimated. Valgrind is awesome, unbelievably so. It managed to unimaginably expand what can be done to a natively compiled program, decades after the tools for building native programs were cast in stone.

The trouble with this version of, cough, shmemcheck, is that it doesn't really work. That is, sometimes you have false negatives, so you still get to dive into core dumps. Why?

What about the opposite case?

Read then write: another shadow cell?If A loads from X that was last modified by B, A depends on B, and we detect it alright. What if A writes to X that was previously read by B? This also proves that A should depend on B. Otherwise B will sometimes run after A, reading its update to X, and sometimes before A, reading the previous value. In order to detect this dependency, we have to remember that X was read by B until A stores to X.

…What if, in addition to B, X was read by C, D, and E, all before A's update?Reads then write: too many shadow cells

Every location always has one "owner" but can have many "users". We can afford keeping an ID – 2 bytes in our case – for every byte. Keeping – and checking – a list of users that could grow to tens or even hundreds of IDs per location sounds impractical.

We used all sorts of application-specific arguments to convince ourselves that this problem is confined to a few particular scenarios. We had a few hacks removing some of the false negatives, at the cost of adding some false positives, and lived with that.

Choosing the order

We got used to think that the question was, how do you know who reads X before A? But then it dawned upon us that the right question was: why do they read X before A?!read then write: why not reverse the order?

And really – if B, C and D run before A because it is stated in the dependency graph that A depends on B, C and D – then there's no bug to worry about. But if there's no dependency declared between, say, A and B, then A could run before B just as well – and we'd detect the bug. So we only missed the race condition because of a randomly chosen order. And when we do find races, we don't know if B depends on A or vice versa – only that we were lucky to have A run first.

What if we choose different random orders and run the same app many times? Then in some run, A will run before B and we'll find the bug – with just one shadow cell keeping the owner.

…Or will it? We have in our dependency graph many "A, B" pairs of independent goals. If we pick random orders, how likely are these pairs to "flip" to "B, A" in some orders but not others?

The first conjecture was – very likely. Just randomize schedules by having sched.choose_goals pick a random goal among the READY ones. "There seems to be no bias in the selection; A and B are independent so the chance for A to come before B is 1/2; therefore, with N orders, the chance for A to always follow B is 1/2^N – small enough."

Interestingly, it still sounds reasonable to me, though I found a counter-example, just out of fear that it's too good to be true (what, years of core dumps are now behind me?) The counter example is, suppose you have two long processes, P and Q, each made up of 100 goals, P1…P100 and Q1…Q100. P and Q are declared independent – Pi never depends on Qj or vice versa. However, Pi+1 depends on Pi, similarly for Q, so P and Q goals can run in just one order.

Just two schedules would "flip" all independent pairs: (1) Run P, then Q and (2) Run Q, then P. However, sched.choose_goals has a snowflake's chance in hell to pick these schedules. The set of READY goals will contain, at all times, 2 goals: one from P and one from Q. sched.choose_goals must either choose the one from P 100 times in a row, or the one from Q 100 times in a row. Since that's a 1 in a 2^100 event, P1 will always run before Q100. If P1 loads from X and Q100 stores to X, we'll never find the undeclared dependency.

One could say that things that aren't likely to flip in such random trials are unlikely to flip in reality and it'd be 100% wrong. In reality, execution order depends on real run times and those create a very biased, and an unpredictably biased sampling of the distribution of schedules.

Now, with N goals, N*2 orders would clearly be enough to flip all pairs – for every goal, create an order where it preceeds everything it doesn't depend on, and another order where it follows everything it doesn't depend on. But our N is above 1000, so N*2, though not entirely impractical, is a bit too much.

Poset dimension

Our pessimism resulting from the collapse of the first conjecture lead to an outburst of optimism yielding a second conjecture – just 2 orders are enough to flip all pairs. Specifically, the orders we thought of were a DFS postorder and its "reversal", where you reverse the order in which you traverse the children of nodes.

Works on my P & Q example indeed, but there's still a counter-example, just a bit more elaborate. We kept looking at it until it turned out that we shouldn't. The dependency graph makes the goals a partially ordered set or poset. Finding the minimal number of schedules needed to flip all independent pairs would find the order dimension of that poset. Which is a known problem, and, like all the good things in life, is NP-hard.

Curious as this discovery was, I was crushed by this NP-hardship. However, there still could be a good practical heuristic yielding a reasonably small number of orders.

We tried a straightforward one – along the lines of the N*2 upper bound, "flip them one by one", but greedier. Simply try to flip as many pairs as you can in the same schedule:

  1. Start with a random schedule, and make a list of all the pairs that can be flipped.
  2. Starting with the real dependencies, add fake dependencies forcing pairs from the list to flip, until you can't add any more without creating a cycle.
  3. Create a schedule given those extended dependencies.
  4. Remove the pairs you flipped from the list. If it isn't empty, go to step 2.

This tended to produce only about 10 schedules for more than 1000 goals, and there was much rejoicing.

Massive testing

Instrumentation slows things down plenty, so you can't run many tests. This is a pity because instrumentation only catches data races that actually occurred – it's actual loads by A from location X owned by B that prove that A depends on B. But if you can't run the program on a lot of data, you may not reach the flow path in A that uses X.

However, if you can, with just 10 schedules, flip all "A, B" pairs, you don't need instrumentation in the first place. Just run the app with different orders on the same data, see if you get the same results. If you don't, there's a race condition – then run under instrumentation to have it pin-pointed to you. The order is "purposefully diversified", but deterministic, so problems will reproduce in the second, instrumented run. Thus massive runs can be uninstrumented, and therefore, more massive.

So not only does this cheap poset dimension upper bound business solve the false negatives problem with instrumentation – it also greatly increases coverage. Of course there can still be a race that never occurred in any test runs. But now the quality of testing is defined just by the quality of the test data, never by chance. With races, whether you detect them or not is thought of as something inherently having to do with chance. It turns out that it doesn't have to.

Helgrind's problem

Helgrind – or any tool lacking knowledge about the app structure – can not meaningfully reorder things this way, in order to "make races manifest themselves". A race can be detected if A that stores to X is moved to happen before B that loads from X, but in order to do that, you have to know where A and B are.

Helgrind's only markers breaking the app into meaningful parts are synchronization calls, like mutex locks/unlocks. But it doesn't know where they'll happen in advance. So it can't stir the execution so that "this lock happens as early as possible" – there's no "this lock" until a lock actually happens. By then it's too late to make it happen before anything else that already happened. (In theory, the app structure could be inferred dynamically from the flow, I just don't see how I'd go about it.)

Therefore, helgrind seems to have the same problem our first versions of shmemcheck had. You can keep the single owner of a location but not its many readers; it appears that helgrind keeps one reader – the last. Helgrind's "owner/reader" is a lock ID rather than a goal ID, but I think the principle is very similar.  I haven't studied helgrind's interals so I only speculate about how it works, but I easily managed to create an example where it gives a false negative:

  1. In thread A, read X without locking, as many times as you want (…or don't want – in a real program it would be a bug, not an achievement…)
  2. After that, read X once with proper locking.
  3. In thread B, modify X, again with proper locking.

If, during testing, thread B happens to modify X after all of thread A's reads from X, the bug in A – unsynchronized access to X – will go unnoticed. It won't go unnoticed if thread A never locks properly – since helgrind will remember and report the last of its unsynchronized reads:

#include <pthread.h>
#include <stdio.h>

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
int shared=1;
int unused=0;
typedef struct { int read_unlocked, read_locked; } sopt;

void* child_fn(void* arg) {
  sopt* opt = (sopt*)arg;
  if(opt->read_unlocked) {
    unused += shared;
  if(opt->read_locked) {
    unused += shared;
  return 0;

void testhelgrind(int read_unlocked, int read_locked,
                   const char* expected) {
  fprintf(stderr,"expected behavior: %sn", expected);
  sopt opt = { read_unlocked, read_locked };
  pthread_t child;
  pthread_create(&child, 0, child_fn, &opt);
  //lock & modify shared data
  shared = 2;
  pthread_join(child, 0);

int main(void) {
 testhelgrind(0,1,"no errors reported [true negative]");
 testhelgrind(1,0,"data race reported [true positive]");
 testhelgrind(1,1,"no errors reported [false negative]");

Which, under valgrind –tool=helgrind, prints:

expected behavior: no errors reported [true negative]
expected behavior: data race reported [true positive]

==15041== Possible data race during write of size 4 at 0x804a024 by thread #1
==15041==    at 0x8048613: test_helgrind (helgrind_example.c:33)
==15041==    by 0x8048686: main (helgrind_example.c:41)
==15041==  This conflicts with a previous read of size 4 by thread #3
==15041==    at 0x804856E: child_fn (helgrind_example.c:14)

expected behavior: no errors reported [false negative]

…or at least it works that way on my system, where the child thread runs first.

So if helgrind does report a data race – be sure to look careful. If you have several buggy reads, and only fix the last one and run again, helgrind will very likely not report the previous one. Your new synchronization code around the last read now "masks" the previous read.

The happy ending

We still have false negatives in our data race detection, but these are just easy-to-fix low-level loose ends – or so I think/hope. Things like tracking memory writes by devices other than CPUs, or handling custom memory allocators. I'm looking forward to have these false negatives eliminated, very sincerely. Few things are uglier than debugging data races.

I never thought this through enough to tell whether this purposeful reordering business can be utilized by a framework more restrictive than "use raw threads & locks", but less restrictive than "your app is a static graph". If it can be (is already?) utilized elsewhere, I'd be very curious to know.

Special thanks to Yonatan Bilu for his help with the math.

Leaf (upside down)

full resolution

The Iron Fist Coding Standard

I've developed the following coding standard during the years when I've been responsible, on and off, for reviewing code by other programmers. It's suitable for most text-based programming languages and data specification formats.


Indent upon nesting

Where control structures or data objects nest, indentation helps a reader to keep track of the nesting level.

Indentation should render properly in all relevant IDE/editor configurations. Nobody should see stuff in the inner scope closer to the left than the outer scope. Seriously, spend 10 minutes to understand tabs vs spaces issues.

Use profanity judiciously

Profanity makes code easier to write, but harder to read and modify. Profanity signals a potential hazard, but distracts attention from the details – the opposite of a good warning. With profanity, less is more.

Be careful with comments, more careful with identifiers, and still more careful with the ones going into symbol tables. Except when mandated by an explicit requirement, under no circumstances should profanity appear in a program's output, documentation or configuration files, or be a required part of its input.

Clean up

Bits of code that can't possibly serve a purpose are a sign of neglect, scaring some readers and depressing others. Delete.

Semi-commented-out stuff spread around in moments of panic can be hard to clean up during the bad mood that breeds it. But that bad mood will haunt you as long as you keep bumping into that stuff. So.


Concise nested statements or data object specifications can take one line and so need no indentation: if(error) quit.

Indenting upon nesting is neither mandated nor recommended in languages and formats where it is not customary. For example, code following a branch instruction or the <html> opening tag.

It's nice when generated code is indented, but sometimes the white space uses up bandwidth and sometimes you get fed up with propagating the nesting level through the code generator. Generated code is mostly read by whoever wrote the generator so it's up to you.

For profanity to appear in a program's interaction with the user, no explicit requirement is needed when it is a common practice in the given branch of the industry.

Some code won't be very tidy when you're really in a hurry – so be it.


Adopting the Iron Fist Coding Standard will give you the benefits of a mature, field-tested system of coding guidelines while saving the cost of creating one from scratch.

My history with Forth & stack machines

My VLSI tools take a chip from conception through testing. Perhaps 500 lines of source code. Cadence, Mentor Graphics do the same, more or less. With how much source/object code?

Chuck Moore, the inventor of Forth

This is a personal account of my experience implementing and using the Forth programming language and the stack machine architecture. "Implementing and using" – in that order, pretty much; a somewhat typical order, as will become apparent.

It will also become clear why, having defined the instruction set of a processor designed to run Forth that went into production, I don't consider myself a competent Forth programmer (now is the time to warn that my understanding of Forth is just that – my own understanding; wouldn't count on it too much.)

Why the epigraph about Chuck Moore's VLSI tools? Because Forth is very radical. Black Square kind of radical. An approach to programming seemingly leaving out most if not all of programming:

…Forth does it differently. There is no syntax, no redundancy, no typing. There are no errors that can be detected. …there are no parentheses. No indentation. No hooks, no compatibility. …No files. No operating system.

Black Square by Kazimir Malevich

I've never been a huge fan of suprematism or modernism in general. However, a particular modernist can easily get my attention if he's a genius in a traditional sense, with superpowers. Say, he memorizes note sheets upon the first brief glance like Shostakovich did.

Now, I've seen chip design tools by the likes of Cadence and Mentor Graphics. Astronomically costly licenses. Geological run times. And nobody quite knows what they do. To me, VLSI tools in 500 lines qualify as a superpower, enough to grab my attention.

So, here goes.


I was intrigued with Forth ever since I read about it in Bruce Eckel's book on C++, a 198-something edition; he said there that "extensibility got a bad reputation due to languages like Forth, where a programmer could change everything and effectively create a programming language of his own". WANT!

A couple of years later, I looked for info on the net, which seemed somewhat scarce. An unusually looking language. Parameters and results passed implicitly on a stack. 2 3 + instead of 2+3. Case-insensitive. Nothing about the extensibility business though.

I thought of nothing better than to dive into the source of an implementation, pForth – and I didn't need anything better, as my mind was immediately blown away by the following passage right at the top of system.fth, the part of pForth implemented in Forth on top of the C interpreter:

: (   41 word drop ; immediate
( That was the definition for the comment word. )
( Now we can add comments to what we are doing! )

Now. we. can. add. comments. to. what. we. are. doing.

What this does is define a word (Forth's name for a function) called "(". "(" is executed at compile time (as directed by IMMEDIATE). It tells the compiler to read bytes from the source file (that's what the word called, um, WORD is doing), until a ")" – ASCII 41 – is found. Those bytes are then ignored (the pointer to them is removed from the stack with DROP). So effectively, everything inside "( … )" becomes a comment.

Wow. Yeah, you definitely can't do that in C++. (You can in Lisp but they don't teach you those parts at school. They teach the pure functional parts, where you can't do things that you can in C++. Bastards.)

Read some more and…

 conditional primitives
: IF     ( -- f orig )  ?comp compile 0branch  conditional_key >mark     ; immediate
: THEN   ( f orig -- )  swap ?condition  >resolve   ; immediate
: BEGIN  ( -- f dest )  ?comp conditional_key <mark   ; immediate
: AGAIN  ( f dest -- )  compile branch  swap ?condition  <resolve  ; immediate
: UNTIL  ( f dest -- )  compile 0branch swap ?condition  <resolve  ; immediate
: AHEAD  ( -- f orig )  compile branch   conditional_key >mark     ; immediate

Conditional primitives?! Looks like conditional primitives aren't – they define them here. This COMPILE BRANCH business modifies the code of a function that uses IF or THEN, at compile time. THEN – one part of the conditional – writes (RESOLVEs) a branch offset to a point in code saved (MARKed) by IF, the other part of the conditional.

It's as if a conventional program modified the assembly instructions generated from it at compile time. What? How? Who? How do I wrap my mind around this?

Shocked, I read the source of pForth.

Sort of understood how Forth code was represented and interpreted. Code is this array of "execution tokens" – function pointers, numbers and a few built-ins like branches, basically. A Forth interpreter keeps an instruction pointer into this array (ip), a data stack (ds), and a return stack (rs), and does this:

while(true) {
 switch(*ip) {
  //arithmetics (+,-,*...):
  case PLUS: ds.push(ds.pop() + ds.pop()); ++ip;
  //stack manipulation (drop,swap,rot...):
  case DROP: ds.pop(); ++ip;
  //literal numbers (1,2,3...):
  case LITERAL: ds.push(ip[1]); ip+=2;
  //control flow:
  case COND_BRANCH: if(!ds.pop()) ip+=ip[1]; else ip+=2;
  case RETURN: ip = rs.pop();
  //user-defined words: save return address & jump
  default: rs.push(ip+1); ip = *ip;

That's it, pretty much. Similar, say, to the virtual stack machine used to implement Java. One difference is that compiling a Forth program is basically writing to the code array in a WYSIWYG fashion. COMPILE SOMETHING simply appends the address of the word SOMETHING to the end of the code array. So does plain SOMETHING when Forth is compiling rather than interpreting, as it is between a colon and a semicolon, that is, when a word is defined.



simply appends {&2dup,&up,&right,&down,&left,RETURN} to the code array. Very straightforward. There are no parameters or declaration/expression syntax as in…

void drawRectangle(int width, int height) {

…to make it less than absolutely clear how the source code maps to executable code. "C maps straightforwardly to assembly"? Ha! Forth maps straightforwardly to assembly. Well, to the assembly language of a virtual stack machine, but still. So one can understand how self-modifying code like IF and THEN works.

On the other hand, compared to drawRectangle, it is somewhat unclear what DRAW-RECTANGLE does. What are those 2 values on the top of the stack that 2DUP duplicates before meaningful English names appear in DRAW-RECTANGLE's definition? This is supposed to be ameliorated by stack comments:

: DRAW-RECTANGLE ( width height -- ) ... ;

…tells us that DRAW-RECTANGLE expects to find height at the top of the stack, and width right below it.

I went on to sort of understand CREATE/DOES> – a further extension of this compile-time self-modifying code business that you use to "define defining words" (say, CONSTANT, VARIABLE, or CLASS). The CREATE part says what should be done when words (say, class names) are defined by your new defining word. The DOES> part says what should be done when those words are used. For example:

   DOES> @
\ usage example:
DAYS-IN-WEEK 2 + . \ should print 9

CREATE means that every time CONSTANT is called, a name is read from the source file (similarly to what WORD would have done). Then a new word is created with that name (as a colon would have done). This word records the value of HERE – something like sbrk(0), a pointer past the last allocated data item. When the word is executed, it pushes the saved address onto the data stack, then calls the code after DOES>. The code after CREATE can put some data after HERE, making it available later to the DOES> part.

With CONSTANT, the CREATE part just saves its input (in our example, 7) – the comma word does this: *HERE++ = ds.pop(); The DOES> part then fetches the saved number – the @ sign is the fetch word: ds.push( *ds.pop() );

CONSTANT works somewhat similarly to a class, CREATE defining its constructor and DOES> its single method:

class Constant
  def initialize(x) @x=x end
  def does() @x end
daysInWeek =
print daysInWeek.does() + 2

…But it's much more compact on all levels.

Another example is defining C-like structs. Stripped down to their bare essentials (and in Forth things tend to be stripped down to their bare essentials), you can say that:

struct Rectangle {
  int width;
  int height;

…simply gives 8 (the structure size) a new name Rectangle, and gives 0 and 4 (the members' offsets) new names, width and height. Here's one way to implement structs in Forth:

  cell field width
  cell field height
constant rectangle

\ usage example:
\ here CREATE is used just for allocation
create r1 rectangle allot \ r1=HERE; HERE+=8
2 r1 width !
3 r1 height !
: area dup width @ swap height @ * ;
r1 area . \ should print 6

CELL is the size of a word; we could say "4 field width" instead of "cell field width" on 32b machines. Here's the definition of FIELD:

 : field ( struct-size field-size -- new-struct-size )
    create over , +
    does> @ +

Again, pretty compact. The CREATE part stores the offset, a.k.a current struct size (OVER does ds.push(ds[1]), comma does *HERE++=ds.pop()), then adds the field size to the struct size, updating it for the next call to FIELD. The DOES> part fetches the offset, and adds it to the top of the stack, supposedly containing the object base pointer, so that "rect width" or "rect height" compute &rect.width or &rect.height, respectively. Then you can access this address with @ or ! (fetch/store). STRUCT simply pushes 0 to the top of the data stack (initial size value), and at the end, CONSTANT consumes the struct size:

struct \ data stack: 0
  cell ( ds: 0 4 ) field width  ( ds: 4 )
  cell ( ds: 4 4 ) field height ( ds: 8 )
constant rectangle ( ds: as before STRUCT )

You can further extend this to support polymorphic methods – METHOD would work similarly to FIELD, fetching a function pointer ("execution token") through a vtable pointer and an offset kept in the CREATEd part. A basic object system in Forth can thus be implemented in one screen (a Forth code size unit – 16 lines x 64 characters).

To this day, I find it shocking that you can define defining words like CONSTANT, FIELD, CLASS, METHOD – something reserved to built-in keywords and syntactic conventions in most languages – and you can do it so compactly using such crude facilities so trivial to implement. Back when I first saw this, I didn't know about DEFMACRO and how it could be used to implement the defining words of CLOS such as DEFCLASS and DEFMETHOD (another thing about Lisp they don't teach in schools). So Forth was completely mind-blowing.

And then I put Forth aside.

It seemed more suited for number crunching/"systems programming" than text processing/"scripting", whereas it is scripting that is the best trojan horse for pushing a language into an organization. Scripting is usually mission-critical without being acknowledged as such, and many scripts are small and standalone. Look how many popular "scripting languages" there are as opposed to "systems programming languages". Then normalize it by the amount of corporate backing a language got on its way to popularity. Clearly scripting is the best trojan horse.

In short, there were few opportunities to play with Forth at work, so I didn't. I fiddled with the interpreter and with the metaprogramming and then left it at that without doing any real programming.

Here's what Jeff Fox, a prominent member of the Forth community who've worked with Chuck Moore for years, has to say about people like me:

Forth seems to mean programming applications to some and porting Forth or dissecting Forth to others. And these groups don't seem to have much in common.

…One learns one set of things about frogs from studying them in their natural environment or by getting a doctorate in zoology and specializing in frogs. And people who spend an hour dissecting a dead frog in a pan of formaldehyde in a biology class learn something else about frogs.

…One of my favorite examples was that one notable colorforth [a Forth dialect] enthusiast who had spent years studying it, disassembling it, reassembling it and modifying it, and made a lot of public comments about it, but had never bothered running it and in two years of 'study' had not been able to figure out how to do something in colorforth as simple as:

1 dup +

…[such Forth users] seem to have little interest in what it does, how it is used, or what people using it do with it. But some spend years doing an autopsy on dead code that they don't even run.

Live frogs are just very different than dead frogs.

Ouch. Quite an assault not just on a fraction of a particular community, but on language geeks in general.

I guess I feel that I could say that if it isn't solving a significant real problem in the real world it isn't really Forth.

True, I guess, and equally true from the viewpoint of someone extensively using any non-mainstream language and claiming enormous productivity gains for experts. Especially true for the core (hard core?) of the Forth community, Forth being their only weapon. They actually live in Forth; it's DIY taken to the extreme, something probably unparalleled in the history of computing, except, perhaps, the case of Lisp environments and Lisp machines (again).

Code running on Forth chips. Chips designed with Forth CAD tools. Tools developed in a Forth environment running on the bare metal of the desktop machine. No standard OS, file system or editor. All in recent years when absolutely nobody else would attempt anything like it. They claim to be 10x to 100x more productive than C programmers (a generic pejorative term for non-Forth programmers; Jeff Fox is careful to put "C" in quotes, presumably either to make the term more generic or more pejorative).

…people water down the Forth they do by not exercising most of the freedom it offers… by using Forth only as debugger or a yet another inefficient scripting language to be used 1% of the time.

Forth is about the freedom to change the language, the compiler, the OS or even the hardware design and is very different than programming languages that are about fitting things to a fixed language syntax in a narrow work context.

What can be said of this? If, in order to "really" enter a programming culture, I need to both "be solving a significant real problem in the real world" and exercising "the freedom to change the language, the compiler, the OS or even the hardware design", then there are very few options for entering this culture indeed. The requirement for "real world work" is almost by definition incompatible with "the freedom to change the language, the compiler, the OS and the hardware design".

And then it so happened that I started working on a real-world project about as close to Forth-level DIY as possible. It was our own hardware, with our own OS, our own compilers, designed to be running our own application. We did use standard CAD tools, desktop operating systems and editors, and had standard RISC cores in the chip and standard C++ cross compilers for them. Well, everyone has weaknesses. Still, the system was custom-tailored, embedded, optimized, standalone, with lots of freedom to exercise – pretty close to the Forth way, in one way.

One part of the system was an image processing co-processor, a variation on the VLIW theme. Its memory access and control flow was weird and limited, to the effect that you could neither access nor jump to an arbitrary memory address. It worked fine for the processing-intensive parts of our image processing programs.

We actually intended to glue those parts together with a few "control instructions" setting up the plentiful control registers of this machine. When I tried, it quickly turned out, as was to be expected, that those "control instructions" must be able to do, well, everything – arithmetic, conditions, loops. In short, we needed a CPU.

We thought about buying a CPU, but it was unclear how we could use an off-the-shelf product. We needed to dispatch VLIW instructions from the same instruction stream. We also needed a weird mixture of features. No caches, no interrupts, no need for more than 16 address bits, but for accessing 64 data bits, and 32-bit arithmetic.

We thought about making our own CPU. The person with the overall responsibility for the hardware design gently told me that I was out of my mind. CPUs have register files and pipeline and pipeline stalls and dependency detection to avoid those stalls and it's too complicated.

And then I asked, how about a stack machine? No register file. Just a 3-stage pipeline – fetch, decode, execute. No problem with register dependencies, always pop inputs from the top of the stack, push the result.

He said it sounded easy enough alright, we could do that. "It's just like my RPN calculator. How would you program it?" "In Forth!"

I defined the instruction set in a couple of hours. It mapped to Forth words as straightforwardly as possible, plus it had a few things Forth doesn't have that C might need, as a kind of insurance (say, access to 16-bit values in memory).

This got approved and implemented; not that it became the schedule bottleneck, but it was harder than we thought. Presumably that was partly the result of not reading "Stack Computers: the new wave", and not studying the chip designs of Forth's creator Chuck Moore, either. I have a feeling that knowledgable people would have sneered at this machine: it was trivial to compile Forth to it, but at the cost of complicating the hardware.

But I was satisfied – I got a general-purpose CPU for setting up my config regs at various times through my programs, and as a side effect, I got a Forth target. And even if it wasn't the most cost-effective Forth target imaginable, it was definitely a time to start using Forth at work.

(Another area of prior art on stack machines that I failed to study in depth was 4stack – an actual VLIW stack machine, with 4 data stacks as suggested by its name. I was very interested in it, especially during the time when we feared implementation problems with the multi-port register file feeding our multiple execution units. I didn't quite figure out how programs would map to 4stack and what the efficiency drop would be when one had to spill stuff from the data stacks to other memory because of data flow complications. So we just went for a standard register file and it worked out.)

The first thing I did was write a Forth cross-compiler for the machine – a 700-line C++ file (and for reasons unknown, the slowest-compiling C++ code that I have ever seen).

I left out all of the metaprogramming stuff. For instance, none of the Forth examples above, the ones that drove me to Forth, could be made to work in my own Forth. No WORD, no COMPILE, no IMMEDIATE, no CREATE/DOES>, no nothing. Just colon definitions, RPN syntax, flow control words built into the compiler. "Optimizations" – trivial constant folding so that 1 2 + becomes 3, and inlining – :INLINE 1 + ; works just like : 1 + ; but is inlined into the code of the caller. (I was working on the bottlenecks so saving a CALL and a RETURN was a big deal.) So I had that, plus inline assembly for the VLIW instructions. Pretty basic.

I figured I didn't need the more interesting metaprogramming stuff for my first prototype programs, and I could add it later if it turned out that I was wrong. It was wierd to throw away everything I originally liked the most, but I was all set to start writing real programs. Solving real problems in the real world.

It was among the most painful programming experiences in my life.

All kinds of attempts at libraries and small test programs aside, my biggest program was about 700 lines long (that's 1 line of compiler code for 1 line of application code). Here's a sample function:

: mean_std ( sum2 sum inv_len -- mean std )
  \ precise_mean = sum * inv_len;
  tuck u* \ sum2 inv_len precise_mean
  \ mean = precise_mean >> FRAC;
  dup FRAC rshift -rot3 \ mean sum2 inv_len precise_mean
  \ var = (((unsigned long long)sum2 * inv_len) >> FRAC) - (precise_mean * precise_mean >> (FRAC*2));
  dup um* nip FRAC 2 * 32 - rshift -rot \ mean precise_mean^2 sum2 inv_len
  um* 32 FRAC - lshift swap FRAC rshift or \ mean precise_mean^2 sum*inv_len
  swap - isqrt \ mean std

Tuck u*.

This computes the mean and the standard deviation of a vector given the sum of its elements, the sum of their squares, and the inverse of its length. It uses scaled integer arithmetic: inv_len is an integer keeping (1<<FRAC)/length. How it arranges the data on the stack is beyond me. It was beyond me at the time when I wrote this function, as indicated by the plentiful comments documenting the stack state, amended by wimpy C-like comments ("C"-like comments) explaining the meaning of the postfix expressions.

This nip/tuck business in the code? Rather than a reference to the drama series on plastic surgery, these are names of Forth stack manipulation words. You can look them up in the standard. I forgot what they do, but it's, like, ds.insert(2,, ds.remove(1), this kind of thing.

Good Forth programmers reportedly don't use much of those. Good Forth programmers arrange things so that they flow on the stack. Or they use local variables. My DRAW-RECTANGLE definition above, with a 2DUP, was reasonably flowing by my standards: you get width and height, duplicate both, and have all 4 data items – width,height,width,height – consumed by the next 4 words. Compact, efficient – little stack manipulation. Alternatively we could write:

: DRAW-RECTANGLE { width height }
  height UP
  width RIGHT
  height DOWN
  width LEFT

Less compact, but very readable – not really, if you think about it, since nobody knows how much stuff UP leaves on the stack and what share of that stuff RIGHT consumes, but readable enough if you assume the obvious. One reason not to use locals is that Chuck Moore hates them:

I remain adamant that local variables are not only useless, they are harmful.

If you are writing code that needs them you are writing non-optimal code. Don't use local variables. Don't come up with new syntax for describing them and new schemes for implementing them. You can make local variables very efficient especially if you have local registers to store them in, but don't. It's bad. It's wrong.

It is necessary to have [global] variables. … I don't see any use for [local] variables which are accessed instantaneously.

Another reason not to use locals is that it takes time to store and fetch them. If you have two items on a data stack on a hardware stack machine, + will add them in one cycle. If you use a local, then it will take a cycle to store its value with { local_name }, and a cycle to fetch its value every time you mention local_name. On the first version of our machine, it was worse as fetching took 2 cycles. So when I wrote my Forth code, I had to make it "flow" for it to be fast.

The abundance of DUP, SWAP, -ROT and -ROT3 in my code shows that making it flow wasn't very easy. One problem is that every stack manipulation instruction also costs a cycle, so I started wondering whether I was already past the point where I had a net gain. The other problem was that I couldn't quite follow this flow.

Another feature of good Forth code, which supposedly helps achieve the first good feature ("flow" on the stack), is factoring. Many small definitions.

Forth is highly factored code. I don't know anything else to say except that Forth is definitions. If you have a lot of small definitions you are writing Forth. In order to write a lot of small definitions you have to have a stack.

In order to have really small definitions, you do need a stack, I guess – or some other implicit way of passing parameters around; if you do that explicitly, definitions get bigger, right? That's how you can get somewhat Forth-y with Perl – passing things through the implicit variable $_: call chop without arguments, and you will have chopped $_.

Anyway, I tried many small definitions:

:inline num_rects params @ ;
:inline sum  3 lshift gray_sums + ;
:inline sum2 3 lshift gray_sums 4 + + ;
:inline rect_offset 4 lshift ;
:inline inv_area rect_offset rects 8 + + @ ;
:inline mean_std_stat ( lo hi -- stat )
  FRAC lshift swap 32 FRAC - rshift or
: mean_std_loop
 \ inv_global_std = (1LL << 32) / MAX(global_std, 1);
 dup 1 max 1 swap u/mod-fx32 drop \ 32 frac bits

 num_rects \ start countdown
  1 - \ rects--
  dup sum2 @
  over sum @
  pick2 inv_area
  mean_std \ global_mean global_std inv_global_std rectind mean std
  rot dup { rectind } 2 NUM_STATS * * stats_arr OFT 2 * + + { stats }
  \ stats[OFT+0] = (short)( ((mean - global_mean) * inv_global_std) >> (32 - FRAC) );
  \ stats[OFT+1] = (short)( std * inv_global_std >> (32 - FRAC) );
  pick2       um* mean_std_stat stats 2 + h! \ global_mean global_std inv_global_std mean
  pick3 - over m* mean_std_stat stats h!
  rectind ?dup 0 = \ quit at rect 0
 drop 2drop

I had a bunch of those short definitions, and yet I couldn't get rid of heavy functions with DUP and OVER and PICK and "C" comments to make any sense of it. This stack business just wasn't for me.

Stacks are not popular. It's strange to me that they are not. There is just a lot of pressure from vested interests that don't like stacks, they like registers.

But I actually had a vested interest in stacks, and I began to like registers more and more. The thing is, expression trees map perfectly to stacks: (a+b)*(c-d) becomes a b + c d – *. Expression graphs, however, start to get messy: (a+b)*a becomes a dup b + *, and this dup cluttering things up is a moderate example. And an "expression graph" simply means that you use something more than once. How come this clutters up my code? This is reuse. A kind of factoring, if you like. Isn't factoring good?

In fact, now that I thought of it, I didn't understand why stacks were so popular. Vested interests, perhaps? Why is the JVM bytecode and the .NET bytecode and even CPython's bytecode all target stack VMs? Why not use registers the way LLVM does?

Speaking of which. I started to miss a C compiler. I downloaded LLVM. (7000 files plus a huge precompiled gcc binary. 1 hour to build from scratch. So?) I wrote a working back-end for the stack machine within a week. Generating horrible code. Someone else wrote an optimizing back-end in about two months.

After a while, the optimizing back-end's code wasn't any worse than my hand-coded Forth. Its job was somewhat easier than mine since by the time it arrived, it only took 1 cycle to load a local. On the other hand, loads were fast as long as they weren't interleaved with stores – some pipeline thing. So the back-end was careful to reorder things so that huge sequences of loads went first and then huge sequences of stores. Would be a pity to have to do that manually in Forth.

You have no idea how much fun it is to just splatter named variables all over the place, use them in expressions in whatever order you want, and have the compiler schedule things. Although you do it all day. And that was pretty much the end of Forth on that machine; we wrote everything in C.

What does this say about Forth? Not much except that it isn't for me. Take Prolog. I know few things more insulting than having to code in Prolog. Whereas Armstrong developed Erlang in Prolog and liked it much better than reimplementing Erlang in C for speed. I can't imagine how this could be, but this is how it was. People are different.

Would a good Forth programmer do better than me? Yes, but not just at the level of writing the code differently. Rather, at the level of doing everything differently. Remember the freedom quote? "Forth is about the freedom to change the language, the compiler, the OS or even the hardware design".

…And the freedom to change the problem.

Those computations I was doing? In Forth, they wouldn't just write it differently. They wouldn't implement them at all. In fact, we didn't implement them after all, either. The algorithms which made it into production code were very different – in our case, more complicated. In the Forth case, they would have been less complicated. Much less.

Would less complicated algorithms work? I don't know. Probably. Depends on the problem though. Depends on how you define "work", too.

The tiny VLSI toolchain from the epigraph? I showed Chuck Moore's description of that to an ASIC hacker. He said it was very simplistic – no way you could do with that what people are doing with standard tools.

But Chuck Moore isn't doing that, under the assumption that you need not to. Look at the chips he's making. 144-core, but the cores (nodes) are tiny – why would you want them big, if you feel that you can do anything with almost no resources? And they use 18-bit words. Presumably under the assumption that 18 bits is a good quantity, not too small, not too large. Then they write an application note about imlpementing the MD5 hash function:

MD5 presents a few problems for programming a Green Arrays device. For one thing it depends on modulo 32 bit addition and rotation. Green Arrays chips deal in 18 bit quantities. For another, MD5 is complicated enough that neither the code nor the set of constants required to implement the algorithm will fit into one or even two or three nodes of a Green Arrays computer.

Then they solve these problems by manually implementing 32b addition and splitting the code across nodes. But if MD5 weren't a standard, you could implement your own hash function without going to all this trouble.

In his chip design tools, Chuck Moore naturally did not use the standard equations:

Chuck showed me the equations he was using for transistor models in OKAD and compared them to the SPICE equations that required solving several differential equations. He also showed how he scaled the values to simplify the calculation. It is pretty obvious that he has sped up the inner loop a hundred times by simplifying the calculation. He adds that his calculation is not only faster but more accurate than the standard SPICE equation. … He said, "I originally chose mV for internal units. But using 6400 mV = 4096 units replaces a divide with a shift and requires only 2 multiplies per transistor. … Even the multiplies are optimized to only step through as many bits of precision as needed.

This is Forth. Seriously. Forth is not the language. Forth the language captures nothing, it's a moving target. Chuck Moore constantly tweaks the language and largely dismisses the ANS standard as rooted in the past and bloated. Forth is the approach to engineering aiming to produce as small, simple and optimal system as possible, by shaving off as many requirements of every imaginable kind as you can.

That's why its metaprogramming is so amazingly compact. It's similar to Lisp's metaprogramming in much the same way bacterial genetic code is similar to that of humans – both reproduce. Humans also do many other things that bacteria can't (…No compatibility. No files. No operating system). And have a ton of useless junk in their DNA, their bodies and their habitat.

Bacteria have no junk in their DNA. Junk slows down the copying of the DNA which creates a reproduction bottleneck so junk mutations can't compete. If it can be eliminated, it should. Bacteria are small, simple, optimal systems, with as many requirements shaved off as possible. They won't conquer space, but they'll survive a nuclear war.

This stack business? Just a tiny aspect of the matter. You have complicated expression graphs? Why do you have complicated expression graphs? The reason Forth the language doesn't have variables is because you can eliminate them, therefore they are junk, therefore you should eliminate them. What about those expressions in your Forth program? Junk, most likely. Delete!

I can't do that.

I can't face people and tell them that they have to use 18b words. In fact I take pride in the support for all the data types people are used to from C in our VLIW machine. You can add signed bytes, and unsigned shorts, and you even have instructions adding bytes to shorts. Why? Do I believe that people actually need all those combinations? Do I believe that they can't force their 16b unsigned shorts into 15b signed shorts to save hardware the trouble?


They just don't want to. They want their 16 bits. They whine about their 16th bit. Why do they want 16 and not 18? Because they grew up on C. "C". It's completely ridiculous, but nevertheless, people are like that. And I'm not going to fight that, because I am not responsible for algorithms, other people are, and I want them happy, at least to a reasonable extent, and if they can be made happier at a reasonable cost, I gladly pay it. (I'm not saying you can't market a machine with a limited data type support, just using this as an example of the kind of junk I'm willing to carry that in Forth it is not recommended to carry.)

Why pay this cost? Because I don't do algorithms, other people do, so I have to trust them and respect their judgment to a large extent. Because you need superhuman abilities to work without layers. My minimal stack of layers is – problem, software, hardware. People working on the problem (algorithms, UI, whatever) can't do software, not really. People doing software can't do hardware, not really. And people doing hardware can't do software, etc.

The Forth way of focusing on just the problem you need to solve seems to more or less require that the same person or a very tightly united group focus on all three of these things, and pick the right algorithms, the right computer architecture, the right language, the right word size, etc. I don't know how to make this work.

My experience is, you try to compress the 3 absolutely necessary layers to 2, you get a disaster. Have your algorithms people talk directly to your hardware people, without going through software people, and you'll get a disaster. Because neither understands software very well, and you'll end up with an unusable machine. Something with elaborate computational capabilities that can't be put together into anything meaningful. Because gluing it together, dispatching, that's the software part.

So you need at least 3 teams, or people, or hats, that are to an extent ignorant about each other's work. Even if you're doing everything in-house, which, according to Jeff Fox, was essentially a precondition to "doing Forth". So there's another precondtion – having people being able to do what at least 3 people in their respective areas normally do, and concentrating on those 3 things at the same time. Doing the cross-layer global optimization.

It's not how I work. I don't have the brain nor the knowledge nor the love for prolonged meditation. I compensate with, um, people skills. I find out what people need, that is, what they think they need, and I negotiate, and I find reasonable compromises, and I include my narrow understanding of my part – software tools and such – into those compromises. This drags junk in. I live with that.

I wish I knew what to tell you that would lead you to write good Forth. I can demonstrate. I have demonstrated in the past, ad nauseam, applications where I can reduce the amount of code by 90% and in some cases 99%. It can be done, but in a case by case basis. The general principle still eludes me.

And I think he can, especially when compatibility isn't a must. But not me.

I still find Forth amazing, and I'd gladly hack on it upon any opportunity. It still gives you the most bang for the buck – it implements the most functionality in the least space. So still a great fit for tiny targets, and unlikely to be surpassed. Both because it's optimized so well and because the times when only bacteria survived in the amounts of RAM available are largely gone so there's little competition.

As to using Forth as a source of ideas on programming and language design in general – not me. I find that those ideas grow out of an approach to problem solving that I could never apply.

Update (July 2014): Jeff Fox's "dead frog dissector" explained his view of the matter in a comment to this article, telling us why the frog (colorForth) died in his hands in the first place… A rather enlightening incident, this little story.

The Internet age/reputation paradox

A person's reputation tends to rise together with the age. The older one is, the more opportunities one had to do notable things, and to meet people who could appreciate those things and tell others about them. So this makes sense.

An online document's reputation also tends to rise together with the age. The older the document, the more documents link to it, and the more documents in turn link to those documents, raising the old document's PageRank. So this makes sense, too.

The paradox is that the older documents are written by the younger people. That is, it is one's younger version that wrote one's older documents. So the documents with the most reputation will tend to be written by people (or more precisely snapshots of people) with the least reputation; one's dumb young stuff may well pop up first in a Google search.

(Not that there aren't any counter-tendencies to cancel this effect at times; my old anxious, moronic report of an imaginary bug in ALL CAPS no longer shows up in my egosearches. So no, I'm not bitter. In fact, Google loves me more than I deserve – for instance, my review of Extreme Programming Explained has appeared in search results right after the Amazon entry for the book ever since I published it, and I've only skimmed through the thing. The only thing that bothers me in the SEO department is that the search for "C++ FQA" gets corrected to "C++ FAQ" – didn't expect that once the query got past the spell check barrier. I hope my collegue's riskily named DreamPie project will not experience a similar setback.)

If a tree falls in a forest, it kills Schrödinger's cat

Schrödinger used to have this quantum cat which was alive and dead at the same time as long as nobody opened the box, and it was the very act of looking at the cat that made it either alive or dead. Now, I'm not sure about this quantum stuff, but if you ask me you'd always find a dead cat upon opening the box, killed by the act of not looking. In fact, if you open any random box nobody was looking into, chances are you'll find a dead cat there. Let me give an example.

I recently chatted with a former employee of a late company I'll call Excellence (the real name was even worse). Excellence was a company with offices right across the street that kept shrinking until the recent financial crisis. It then had to simultaneously fire almost all of its remaining employees, carefully selected as their best during the previous years when other employees were continuously fired at a lower rate. Giving us a whole lot of great hires, including MV, the co-worker in this story (though he was among those who guessed where things were going and crossed the street a bit earlier).

Rumors tell that to the extent possible, Excellence used to live up to expectations created by its name. In particular, without being encouraged or forced by customers to comply to any Software Development Methodology such as the mighty and dreadful CMM, they had (as CMM puts it) not only Established, but rigorously Maintained an elaborate design, documentation and review process which preceded every coding effort. Other than praise, MV had little to say about the process, except perhaps that occasionally someone would invent something awfully complicated that made code incomprehensible, having gone just fine through the review process because of sounding too smart for anyone to object.

Now, in our latest conversation about how things were at Excellence, MV told how he once had to debug a problem in a core module of theirs, to which no changes had been made in years. There, he stumbled upon a loop searching for a value. He noticed that when the value was found, the loop wouldn't terminate – a continue instead of a break kind of thing. Since the right value tended to be found pretty early through the loop, and because it was at such a strategic place, test cases everyone was running took minutes instead of seconds to finish. Here's a dead cat in a gold-plated box for ya, and one buried quite deeply.

My own professional evolution shaped my mind in such a way that it didn't surprise me in the slightest that this slipped past the reviewer(s). What surprised me was how this slipped past the jittery bodybuilder. You see, we have this Integration Manager, one of his hobbies being bodybuilding (a detail unrelated, though not entirely, to his success in his work), and one thing he does after integrating changes is he looks at the frame rate. When the frame rate seems low, he pops up a window with the execution profile, where the program is split into about 1000 parts. If your part looks heavier than usual, or if it's something new that looks heavy compared to the rest, it'll set him off.

So I asked MV how come that the cat, long before it got dead and buried, didn't set off the jittery bodybuilder. He said they didn't have one for it to set off. They were translating between the formats of different programs. Not that performance didn't matter – they worked on quite large data sets. But to the end user, automatic translation taking hours had about the same value as automatic translation taking seconds – the alternative was manual translation taking weeks or months. So they took large-scale performance implications of their decisions into account during design reviews. Then once the code was done and tested, it was done right, so if it took N cycles to run, it was because it took N cycles to do whatever it did right.

And really, what is the chance that the code does everything right according to a detailed spec it is tested against, but there's a silly bug causing it to do it awfully slowly? If you ask me – the chance is very high, and more generally:

  • Though not looking at performance followed from a reasonable assessment of the situation,
  • Performance was bad, and bad enough to become an issue (though an unacknowledged one), when it wasn't looked at,
  • Although the system in general was certainly "looked at", apparently from more eyes and angles than "an average system", but it didn't help,
  • So either you have a jittery bodybuilder specifically and continuously eyeballing something, or that something sucks.

Of course you can save effort using jittery automated test programs. For example, we've been running a regression testing system for about a year. I recently decided to look at what's running through it, beyond the stuff it reports as errors that ought to be fixed (in this system we try to avoid false positives to the point of tolerating some of the false negatives, so it doesn't complain loudly about every possible problem). I found that:

  • It frequently ran out of disk space. It was OK for it to run out of disk space at times, but it did it way too often now. That's because its way of finding out the free space on the various partitions was obsoleted by the move of the relevant directories to network-attached storage.
  • At some of the machines, it failed to get licenses to one of the compilers it needed – perhaps because the env vars were set to good values with most users but not all, perhaps because of a compiler upgrade it didn't take into account. [It was OK for it to occasionally fail to get a license (those are scarce) - then it should have retried, and at the worst case report a license error. However, the compiler error messages it got were new to it, so it thought something just didn't compile. It then ignored the problem on otherwise good grounds.]
  • Its way of figuring out file names from module names failed for one module which was partially renamed recently (don't ask). Through an elaborate path this resulted in tolerating false negatives it really shouldn't.

And I'm not through with this thing yet, which to me exemplifies the sad truth that while you can have a cat looking at other cats to prevent them from dying, a human still has to look at that supervisor cat, or else it dies, leading to the eventual demise of the others.

If you don't look at a program, it rots. If you open a box, there's a dead cat in it. And if a tree falls in a forest and no one is around to hear it, it sucks.

Applied mathematics in business consulting

As a part of my continuous moral degradation and the resulting increasing alignment with the forces of Evil, I'm sharing an apartment with a gal who used to work in HR assessment. She recently got me acquainted to a friend of hers, BC, who works as a business consultant (names have been changed to protect the guilty).

BC's primary educational background is in applied mathematics. Having put the math they teach in CS departments to relatively few uses as a working programmer, I asked her about the uses of applied mathematics in business consulting. BC cited the following two examples.

The first example involves compensation and its dependence on key performance indicators, affectionately known as KPI and estimated by HR assessors. One way of looking at this dependence is to consider how it affects compensation over time as an employee's competence increases.

A psychological discussion is then possible on the relative merits of the different graphs plotting the compensation functions f(KPI). If f is linear (has a constant derivative), we make people struggle equally hard at every step. If f's derivative increases over time (for instance, when f is exponential), we make elevation hard at first and then increasingly easy. If f's derivative decreases over time (for example, if f is logarithmic), we make the last mile the hardest. And so on.

Through a psychological discussion of this sort, someone in the consulting company decided that what was really needed in some case or other was an S-shaped curve. The problem was that you couldn't just draw an S-shaped curve – the plotting had to be done in Excel according to a formula; an S-shaped curve which just blithely goes through arbitrary points doesn't cut it when you deliver a Compensation Model. But how do you make a formula to go up slowly, than fast, than slowly again? Exponents don't work. Logarithms don't work. A sine does look like an S, but it's a wrong kind of S, somehow. What to do?An S-shaped curve

Enter BC with 6 years of math studies under her belt. A compact yet impressive formula is spelled out, and – presto! – Excel renders an S-shaped curve. (I guess she used the sigmoid function but I didn't check.) The formula brought delight to management and fame to BC, and compensation payments issued according to its verdict keep adding up to scary numbers (BC's agency works with some really big companies).

The second example involves the compensation of managers. Naturally, a good manager near the bottom is worth less to the firm than a bad manager near the top, and therefore the compensation function should now depend on the manager's level in the hierarchy as well as his KPI (or better). Equally naturally, the numbers coming out of the compensation spreadsheet will under no circumstances arise through an externally conducted study of their psychological implications or any similarly unpredictable device. The numbers will result from nothing but the deep understanding of the organization possessed by the top management.

The development process of the managerial compensation function is thus complementary to that of the employee compensation function. Instead of producing numbers from a beautiful spreadsheet, what is needed here is to produce a beautiful spreadsheet from the numbers specified by the top management. The spreadsheet then promptly generates back these exact numbers from the input parameters.

The purpose of the spreadsheet is to relieve the top managers from the need to justify the numbers to their underlings. In order to guarantee that they are relieved from this need, the formula should not contain terms such as 1/(level^2), which could raise questions such as why not use 1/level, why not use 1/log(level) and other questions along these lines. Rather, the formula should contain terms which could raise no questions at all simply due to their size and shape.

BC faced this problem at an early stage of her career, and despite the natural stress, came up with an interesting Compensation Model, its key term being e raised to the power of something unspeakably grand, combining the trademark Gaussian look and feel with an obvious ability to deter the skeptics from asking questions. The only problem with that term was the very source of its utility, namely, the fact that it evaluated to 0 for all values of KPI and hierarchy level.

The deadline being close, BC told the manager of the consulting project in question about the status of her work and expressed her doubts regarding the delivery of the Computational Model. The manager told her that she just doesn't get it, does she, it's great, the right numbers come out and that's all there is to it and we should send it right away. And so they did, to everyone's complete satisfaction.

Her command of applied mathematics aside, BC is generally quite powerful.

For instance, she once got invited to consult some government agency about a project of theirs, while being on vacation and without it being explained to her that she was about to attend a formal meeting with the whole team. In order to maintain the reputation of the guy who somewhat clumsily brought her in, she had to improvise.

The project, worthy of a government agency, was related to some sort of war on corruption, the unusual thing being that they wanted to fight the corruption of other governments. Their weapon of choice was the training of representatives of another government, financed by the other government, in their supposedly superior methods of governance. While the general concept was impressive on many levels, the details were unclear.

BC had to speak, and she spoke based on a principle appearing in a book by some McKinsey alumni (she didn't tell its name nor generally recommended it): whatever you tell people, it should contain 3 main points. Possibly 4. But preferably 3. More is overwhelming and less is boring. So she said: "At its core, your project is about teaching people. It is therefore essential to clearly understand three things:

  • Whom you're teaching,
  • What you're teaching them,
  • And how you're teaching it."

And they started writing it down.

I asked BC whether there was some way to unleash her on the company employing me so that she grinds a few bullet points into them (a handsome Compensation Model being as good a place to start as any). She said something to the effect of "it only works on the weak-minded"; it was apparent, she said, that the government agency in question had little previous exposure to consulting.

BC says she (still) believes that business consulting is meaningful and valuable, which sounds paradoxically at this point. But, looked at from another angle, it really isn't. Don't view her stories as ones undermining the credibility of business consulting but rather as ones building her own credibility as a person aware of the actual meaning of things and willing to sincerely share that understanding (how many people would instead say that they Developed Cutting-Edge Compensation Models?) If she says there's meaning to it, perhaps there is.