My history with Forth & stack machines

September 10th, 2010

My VLSI tools take a chip from conception through testing. Perhaps 500 lines of source code. Cadence, Mentor Graphics do the same, more or less. With how much source/object code?

– Chuck Moore, the inventor of Forth

This is a personal account of my experience implementing and using the Forth programming language and the stack machine architecture. "Implementing and using" – in that order, pretty much; a somewhat typical order, as will become apparent.

It will also become clear why, having defined the instruction set of a processor designed to run Forth that went into production, I don't consider myself a competent Forth programmer (now is the time to warn that my understanding of Forth is just that – my own understanding; wouldn't count on it too much.)

Why the epigraph about Chuck Moore's VLSI tools? Because Forth is very radical. Black Square kind of radical. An approach to programming seemingly leaving out most if not all of programming:

...Forth does it differently. There is no syntax, no redundancy, no typing. There are no errors that can be detected. ...there are no parentheses. No indentation. No hooks, no compatibility. ...No files. No operating system.

Black Square by Kazimir Malevich

I've never been a huge fan of suprematism or modernism in general. However, a particular modernist can easily get my attention if he's a genius in a traditional sense, with superpowers. Say, he memorizes note sheets upon the first brief glance like Shostakovich did.

Now, I've seen chip design tools by the likes of Cadence and Mentor Graphics. Astronomically costly licenses. Geological run times. And nobody quite knows what they do. To me, VLSI tools in 500 lines qualify as a superpower, enough to grab my attention.

So, here goes.

***

I was intrigued with Forth ever since I read about it in Bruce Eckel's book on C++, a 198-something edition; he said there that "extensibility got a bad reputation due to languages like Forth, where a programmer could change everything and effectively create a programming language of his own". WANT!

A couple of years later, I looked for info on the net, which seemed somewhat scarce. An unusually looking language. Parameters and results passed implicitly on a stack. 2 3 + instead of 2+3. Case-insensitive. Nothing about the extensibility business though.

I thought of nothing better than to dive into the source of an implementation, pForth – and I didn't need anything better, as my mind was immediately blown away by the following passage right at the top of system.fth, the part of pForth implemented in Forth on top of the C interpreter:

: (   41 word drop ; immediate
( That was the definition for the comment word. )
( Now we can add comments to what we are doing! )

Now. we. can. add. comments. to. what. we. are. doing.

What this does is define a word (Forth's name for a function) called "(". "(" is executed at compile time (as directed by IMMEDIATE). It tells the compiler to read bytes from the source file (that's what the word called, um, WORD is doing), until a ")" – ASCII 41 – is found. Those bytes are then ignored (the pointer to them is removed from the stack with DROP). So effectively, everything inside "( ... )" becomes a comment.

Wow. Yeah, you definitely can't do that in C++. (You can in Lisp but they don't teach you those parts at school. They teach the pure functional parts, where you can't do things that you can in C++. Bastards.)

Read some more and...

 conditional primitives
: IF     ( -- f orig )  ?comp compile 0branch  conditional_key >mark     ; immediate
: THEN   ( f orig -- )  swap ?condition  >resolve   ; immediate
: BEGIN  ( -- f dest )  ?comp conditional_key <mark   ; immediate
: AGAIN  ( f dest -- )  compile branch  swap ?condition  <resolve  ; immediate
: UNTIL  ( f dest -- )  compile 0branch swap ?condition  <resolve  ; immediate
: AHEAD  ( -- f orig )  compile branch   conditional_key >mark     ; immediate

Conditional primitives?! Looks like conditional primitives aren't – they define them here. This COMPILE BRANCH business modifies the code of a function that uses IF or THEN, at compile time. THEN – one part of the conditional – writes (RESOLVEs) a branch offset to a point in code saved (MARKed) by IF, the other part of the conditional.

It's as if a conventional program modified the assembly instructions generated from it at compile time. What? How? Who? How do I wrap my mind around this?

Shocked, I read the source of pForth.

Sort of understood how Forth code was represented and interpreted. Code is this array of "execution tokens" – function pointers, numbers and a few built-ins like branches, basically. A Forth interpreter keeps an instruction pointer into this array (ip), a data stack (ds), and a return stack (rs), and does this:

while(true) {
 switch(*ip) {
  //arithmetics (+,-,*...):
  case PLUS: ds.push(ds.pop() + ds.pop()); ++ip;
  //stack manipulation (drop,swap,rot...):
  case DROP: ds.pop(); ++ip;
  //literal numbers (1,2,3...):
  case LITERAL: ds.push(ip[1]); ip+=2;
  //control flow:
  case COND_BRANCH: if(!ds.pop()) ip+=ip[1]; else ip+=2;
  case RETURN: ip = rs.pop();
  //user-defined words: save return address & jump
  default: rs.push(ip+1); ip = *ip;
 }
}

That's it, pretty much. Similar, say, to the virtual stack machine used to implement Java. One difference is that compiling a Forth program is basically writing to the code array in a WYSIWYG fashion. COMPILE SOMETHING simply appends the address of the word SOMETHING to the end of the code array. So does plain SOMETHING when Forth is compiling rather than interpreting, as it is between a colon and a semicolon, that is, when a word is defined.

: DRAW-RECTANGLE 2DUP UP RIGHT DOWN LEFT ;

simply appends {&2dup,&up,&right,&down,&left,RETURN} to the code array. Very straightforward. There are no parameters or declaration/expression syntax as in...

void drawRectangle(int width, int height) {
  up(height);
  right(width);
  down(height);
  left(width);
}

...to make it less than absolutely clear how the source code maps to executable code. "C maps straightforwardly to assembly"? Ha! Forth maps straightforwardly to assembly. Well, to the assembly language of a virtual stack machine, but still. So one can understand how self-modifying code like IF and THEN works.

On the other hand, compared to drawRectangle, it is somewhat unclear what DRAW-RECTANGLE does. What are those 2 values on the top of the stack that 2DUP duplicates before meaningful English names appear in DRAW-RECTANGLE's definition? This is supposed to be ameliorated by stack comments:

: DRAW-RECTANGLE ( width height -- ) ... ;

...tells us that DRAW-RECTANGLE expects to find height at the top of the stack, and width right below it.

I went on to sort of understand CREATE/DOES> – a further extension of this compile-time self-modifying code business that you use to "define defining words" (say, CONSTANT, VARIABLE, or CLASS). The CREATE part says what should be done when words (say, class names) are defined by your new defining word. The DOES> part says what should be done when those words are used. For example:

: CONSTANT
   CREATE ,
   DOES> @
;
\ usage example:
7 CONSTANT DAYS-IN-WEEK
DAYS-IN-WEEK 2 + . \ should print 9

CREATE means that every time CONSTANT is called, a name is read from the source file (similarly to what WORD would have done). Then a new word is created with that name (as a colon would have done). This word records the value of HERE – something like sbrk(0), a pointer past the last allocated data item. When the word is executed, it pushes the saved address onto the data stack, then calls the code after DOES>. The code after CREATE can put some data after HERE, making it available later to the DOES> part.

With CONSTANT, the CREATE part just saves its input (in our example, 7) – the comma word does this: *HERE++ = ds.pop(); The DOES> part then fetches the saved number – the @ sign is the fetch word: ds.push( *ds.pop() );

CONSTANT works somewhat similarly to a class, CREATE defining its constructor and DOES> its single method:

class Constant
  def initialize(x) @x=x end
  def does() @x end
end
daysInWeek = Constant.new(7)
print daysInWeek.does() + 2

...But it's much more compact on all levels.

Another example is defining C-like structs. Stripped down to their bare essentials (and in Forth things tend to be stripped down to their bare essentials), you can say that:

struct Rectangle {
  int width;
  int height;
};

...simply gives 8 (the structure size) a new name Rectangle, and gives 0 and 4 (the members' offsets) new names, width and height. Here's one way to implement structs in Forth:

struct
  cell field width
  cell field height
constant rectangle

\ usage example:
\ here CREATE is used just for allocation
create r1 rectangle allot \ r1=HERE; HERE+=8
2 r1 width !
3 r1 height !
: area dup width @ swap height @ * ;
r1 area . \ should print 6

CELL is the size of a word; we could say "4 field width" instead of "cell field width" on 32b machines. Here's the definition of FIELD:

 : field ( struct-size field-size -- new-struct-size )
    create over , +
    does> @ +
 ;

Again, pretty compact. The CREATE part stores the offset, a.k.a current struct size (OVER does ds.push(ds[1]), comma does *HERE++=ds.pop()), then adds the field size to the struct size, updating it for the next call to FIELD. The DOES> part fetches the offset, and adds it to the top of the stack, supposedly containing the object base pointer, so that "rect width" or "rect height" compute &rect.width or &rect.height, respectively. Then you can access this address with @ or ! (fetch/store). STRUCT simply pushes 0 to the top of the data stack (initial size value), and at the end, CONSTANT consumes the struct size:

struct \ data stack: 0
  cell ( ds: 0 4 ) field width  ( ds: 4 )
  cell ( ds: 4 4 ) field height ( ds: 8 )
constant rectangle ( ds: as before STRUCT )

You can further extend this to support polymorphic methods – METHOD would work similarly to FIELD, fetching a function pointer ("execution token") through a vtable pointer and an offset kept in the CREATEd part. A basic object system in Forth can thus be implemented in one screen (a Forth code size unit – 16 lines x 64 characters).

To this day, I find it shocking that you can define defining words like CONSTANT, FIELD, CLASS, METHOD – something reserved to built-in keywords and syntactic conventions in most languages – and you can do it so compactly using such crude facilities so trivial to implement. Back when I first saw this, I didn't know about DEFMACRO and how it could be used to implement the defining words of CLOS such as DEFCLASS and DEFMETHOD (another thing about Lisp they don't teach in schools). So Forth was completely mind-blowing.

And then I put Forth aside.

It seemed more suited for number crunching/"systems programming" than text processing/"scripting", whereas it is scripting that is the best trojan horse for pushing a language into an organization. Scripting is usually mission-critical without being acknowledged as such, and many scripts are small and standalone. Look how many popular "scripting languages" there are as opposed to "systems programming languages". Then normalize it by the amount of corporate backing a language got on its way to popularity. Clearly scripting is the best trojan horse.

In short, there were few opportunities to play with Forth at work, so I didn't. I fiddled with the interpreter and with the metaprogramming and then left it at that without doing any real programming.

Here's what Jeff Fox, a prominent member of the Forth community who've worked with Chuck Moore for years, has to say about people like me:

Forth seems to mean programming applications to some and porting Forth or dissecting Forth to others. And these groups don't seem to have much in common.

...One learns one set of things about frogs from studying them in their natural environment or by getting a doctorate in zoology and specializing in frogs. And people who spend an hour dissecting a dead frog in a pan of formaldehyde in a biology class learn something else about frogs.

...One of my favorite examples was that one notable colorforth [a Forth dialect] enthusiast who had spent years studying it, disassembling it, reassembling it and modifying it, and made a lot of public comments about it, but had never bothered running it and in two years of 'study' had not been able to figure out how to do something in colorforth as simple as:
1 dup +
...[such Forth users] seem to have little interest in what it does, how it is used, or what people using it do with it. But some spend years doing an autopsy on dead code that they don't even run.

Live frogs are just very different than dead frogs.

Ouch. Quite an assault not just on a fraction of a particular community, but on language geeks in general.

I guess I feel that I could say that if it isn't solving a significant real problem in the real world it isn't really Forth.

True, I guess, and equally true from the viewpoint of someone extensively using any non-mainstream language and claiming enormous productivity gains for experts. Especially true for the core (hard core?) of the Forth community, Forth being their only weapon. They actually live in Forth; it's DIY taken to the extreme, something probably unparalleled in the history of computing, except, perhaps, the case of Lisp environments and Lisp machines (again).

Code running on Forth chips. Chips designed with Forth CAD tools. Tools developed in a Forth environment running on the bare metal of the desktop machine. No standard OS, file system or editor. All in recent years when absolutely nobody else would attempt anything like it. They claim to be 10x to 100x more productive than C programmers (a generic pejorative term for non-Forth programmers; Jeff Fox is careful to put "C" in quotes, presumably either to make the term more generic or more pejorative).

...people water down the Forth they do by not exercising most of the freedom it offers... by using Forth only as debugger or a yet another inefficient scripting language to be used 1% of the time.

Forth is about the freedom to change the language, the compiler, the OS or even the hardware design and is very different than programming languages that are about fitting things to a fixed language syntax in a narrow work context.

What can be said of this? If, in order to "really" enter a programming culture, I need to both "be solving a significant real problem in the real world" and exercising "the freedom to change the language, the compiler, the OS or even the hardware design", then there are very few options for entering this culture indeed. The requirement for "real world work" is almost by definition incompatible with "the freedom to change the language, the compiler, the OS and the hardware design".

And then it so happened that I started working on a real-world project about as close to Forth-level DIY as possible. It was our own hardware, with our own OS, our own compilers, designed to be running our own application. We did use standard CAD tools, desktop operating systems and editors, and had standard RISC cores in the chip and standard C++ cross compilers for them. Well, everyone has weaknesses. Still, the system was custom-tailored, embedded, optimized, standalone, with lots of freedom to exercise – pretty close to the Forth way, in one way.

One part of the system was an image processing co-processor, a variation on the VLIW theme. Its memory access and control flow was weird and limited, to the effect that you could neither access nor jump to an arbitrary memory address. It worked fine for the processing-intensive parts of our image processing programs.

We actually intended to glue those parts together with a few "control instructions" setting up the plentiful control registers of this machine. When I tried, it quickly turned out, as was to be expected, that those "control instructions" must be able to do, well, everything – arithmetic, conditions, loops. In short, we needed a CPU.

We thought about buying a CPU, but it was unclear how we could use an off-the-shelf product. We needed to dispatch VLIW instructions from the same instruction stream. We also needed a weird mixture of features. No caches, no interrupts, no need for more than 16 address bits, but for accessing 64 data bits, and 32-bit arithmetic.

We thought about making our own CPU. The person with the overall responsibility for the hardware design gently told me that I was out of my mind. CPUs have register files and pipeline and pipeline stalls and dependency detection to avoid those stalls and it's too complicated.

And then I asked, how about a stack machine? No register file. Just a 3-stage pipeline – fetch, decode, execute. No problem with register dependencies, always pop inputs from the top of the stack, push the result.

He said it sounded easy enough alright, we could do that. "It's just like my RPN calculator. How would you program it?" "In Forth!"

I defined the instruction set in a couple of hours. It mapped to Forth words as straightforwardly as possible, plus it had a few things Forth doesn't have that C might need, as a kind of insurance (say, access to 16-bit values in memory).

This got approved and implemented; not that it became the schedule bottleneck, but it was harder than we thought. Presumably that was partly the result of not reading "Stack Computers: the new wave", and not studying the chip designs of Forth's creator Chuck Moore, either. I have a feeling that knowledgable people would have sneered at this machine: it was trivial to compile Forth to it, but at the cost of complicating the hardware.

But I was satisfied – I got a general-purpose CPU for setting up my config regs at various times through my programs, and as a side effect, I got a Forth target. And even if it wasn't the most cost-effective Forth target imaginable, it was definitely a time to start using Forth at work.

(Another area of prior art on stack machines that I failed to study in depth was 4stack – an actual VLIW stack machine, with 4 data stacks as suggested by its name. I was very interested in it, especially during the time when we feared implementation problems with the multi-port register file feeding our multiple execution units. I didn't quite figure out how programs would map to 4stack and what the efficiency drop would be when one had to spill stuff from the data stacks to other memory because of data flow complications. So we just went for a standard register file and it worked out.)

The first thing I did was write a Forth cross-compiler for the machine – a 700-line C++ file (and for reasons unknown, the slowest-compiling C++ code that I have ever seen).

I left out all of the metaprogramming stuff. For instance, none of the Forth examples above, the ones that drove me to Forth, could be made to work in my own Forth. No WORD, no COMPILE, no IMMEDIATE, no CREATE/DOES>, no nothing. Just colon definitions, RPN syntax, flow control words built into the compiler. "Optimizations" – trivial constant folding so that 1 2 + becomes 3, and inlining – :INLINE 1 + ; works just like : 1 + ; but is inlined into the code of the caller. (I was working on the bottlenecks so saving a CALL and a RETURN was a big deal.) So I had that, plus inline assembly for the VLIW instructions. Pretty basic.

I figured I didn't need the more interesting metaprogramming stuff for my first prototype programs, and I could add it later if it turned out that I was wrong. It was wierd to throw away everything I originally liked the most, but I was all set to start writing real programs. Solving real problems in the real world.

It was among the most painful programming experiences in my life.

All kinds of attempts at libraries and small test programs aside, my biggest program was about 700 lines long (that's 1 line of compiler code for 1 line of application code). Here's a sample function:

: mean_std ( sum2 sum inv_len -- mean std )
  \ precise_mean = sum * inv_len;
  tuck u* \ sum2 inv_len precise_mean
  \ mean = precise_mean >> FRAC;
  dup FRAC rshift -rot3 \ mean sum2 inv_len precise_mean
  \ var = (((unsigned long long)sum2 * inv_len) >> FRAC) - (precise_mean * precise_mean >> (FRAC*2));
  dup um* nip FRAC 2 * 32 - rshift -rot \ mean precise_mean^2 sum2 inv_len
  um* 32 FRAC - lshift swap FRAC rshift or \ mean precise_mean^2 sum*inv_len
  swap - isqrt \ mean std
;

Tuck u*.

This computes the mean and the standard deviation of a vector given the sum of its elements, the sum of their squares, and the inverse of its length. It uses scaled integer arithmetic: inv_len is an integer keeping (1<<FRAC)/length. How it arranges the data on the stack is beyond me. It was beyond me at the time when I wrote this function, as indicated by the plentiful comments documenting the stack state, amended by wimpy C-like comments ("C"-like comments) explaining the meaning of the postfix expressions.

This nip/tuck business in the code? Rather than a reference to the drama series on plastic surgery, these are names of Forth stack manipulation words. You can look them up in the standard. I forgot what they do, but it's, like, ds.insert(2,ds.top()), ds.remove(1), this kind of thing.

Good Forth programmers reportedly don't use much of those. Good Forth programmers arrange things so that they flow on the stack. Or they use local variables. My DRAW-RECTANGLE definition above, with a 2DUP, was reasonably flowing by my standards: you get width and height, duplicate both, and have all 4 data items – width,height,width,height – consumed by the next 4 words. Compact, efficient – little stack manipulation. Alternatively we could write:

: DRAW-RECTANGLE { width height }
  height UP
  width RIGHT
  height DOWN
  width LEFT
;

Less compact, but very readable – not really, if you think about it, since nobody knows how much stuff UP leaves on the stack and what share of that stuff RIGHT consumes, but readable enough if you assume the obvious. One reason not to use locals is that Chuck Moore hates them:

I remain adamant that local variables are not only useless, they are harmful.

If you are writing code that needs them you are writing non-optimal code. Don't use local variables. Don't come up with new syntax for describing them and new schemes for implementing them. You can make local variables very efficient especially if you have local registers to store them in, but don't. It's bad. It's wrong.

It is necessary to have [global] variables. ... I don't see any use for [local] variables which are accessed instantaneously.

Another reason not to use locals is that it takes time to store and fetch them. If you have two items on a data stack on a hardware stack machine, + will add them in one cycle. If you use a local, then it will take a cycle to store its value with { local_name }, and a cycle to fetch its value every time you mention local_name. On the first version of our machine, it was worse as fetching took 2 cycles. So when I wrote my Forth code, I had to make it "flow" for it to be fast.

The abundance of DUP, SWAP, -ROT and -ROT3 in my code shows that making it flow wasn't very easy. One problem is that every stack manipulation instruction also costs a cycle, so I started wondering whether I was already past the point where I had a net gain. The other problem was that I couldn't quite follow this flow.

Another feature of good Forth code, which supposedly helps achieve the first good feature ("flow" on the stack), is factoring. Many small definitions.

Forth is highly factored code. I don't know anything else to say except that Forth is definitions. If you have a lot of small definitions you are writing Forth. In order to write a lot of small definitions you have to have a stack.

In order to have really small definitions, you do need a stack, I guess – or some other implicit way of passing parameters around; if you do that explicitly, definitions get bigger, right? That's how you can get somewhat Forth-y with Perl – passing things through the implicit variable $_: call chop without arguments, and you will have chopped $_.

Anyway, I tried many small definitions:

:inline num_rects params @ ;
:inline sum  3 lshift gray_sums + ;
:inline sum2 3 lshift gray_sums 4 + + ;
:inline rect_offset 4 lshift ;
:inline inv_area rect_offset rects 8 + + @ ;
:inline mean_std_stat ( lo hi -- stat )
  FRAC lshift swap 32 FRAC - rshift or
;
: mean_std_loop
 \ inv_global_std = (1LL << 32) / MAX(global_std, 1);
 dup 1 max 1 swap u/mod-fx32 drop \ 32 frac bits

 num_rects \ start countdown
 begin
  1 - \ rects--
  dup sum2 @
  over sum @
  pick2 inv_area
  mean_std \ global_mean global_std inv_global_std rectind mean std
  rot dup { rectind } 2 NUM_STATS * * stats_arr OFT 2 * + + { stats }
  \ stats[OFT+0] = (short)( ((mean - global_mean) * inv_global_std) >> (32 - FRAC) );
  \ stats[OFT+1] = (short)( std * inv_global_std >> (32 - FRAC) );
  pick2       um* mean_std_stat stats 2 + h! \ global_mean global_std inv_global_std mean
  pick3 - over m* mean_std_stat stats h!
  rectind ?dup 0 = \ quit at rect 0
 until
 drop 2drop
;

I had a bunch of those short definitions, and yet I couldn't get rid of heavy functions with DUP and OVER and PICK and "C" comments to make any sense of it. This stack business just wasn't for me.

Stacks are not popular. It's strange to me that they are not. There is just a lot of pressure from vested interests that don't like stacks, they like registers.

But I actually had a vested interest in stacks, and I began to like registers more and more. The thing is, expression trees map perfectly to stacks: (a+b)*(c-d) becomes a b + c d – *. Expression graphs, however, start to get messy: (a+b)*a becomes a dup b + *, and this dup cluttering things up is a moderate example. And an "expression graph" simply means that you use something more than once. How come this clutters up my code? This is reuse. A kind of factoring, if you like. Isn't factoring good?

In fact, now that I thought of it, I didn't understand why stacks were so popular. Vested interests, perhaps? Why is the JVM bytecode and the .NET bytecode and even CPython's bytecode all target stack VMs? Why not use registers the way LLVM does?

Speaking of which. I started to miss a C compiler. I downloaded LLVM. (7000 files plus a huge precompiled gcc binary. 1 hour to build from scratch. So?) I wrote a working back-end for the stack machine within a week. Generating horrible code. Someone else wrote an optimizing back-end in about two months.

After a while, the optimizing back-end's code wasn't any worse than my hand-coded Forth. Its job was somewhat easier than mine since by the time it arrived, it only took 1 cycle to load a local. On the other hand, loads were fast as long as they weren't interleaved with stores – some pipeline thing. So the back-end was careful to reorder things so that huge sequences of loads went first and then huge sequences of stores. Would be a pity to have to do that manually in Forth.

You have no idea how much fun it is to just splatter named variables all over the place, use them in expressions in whatever order you want, and have the compiler schedule things. Although you do it all day. And that was pretty much the end of Forth on that machine; we wrote everything in C.

What does this say about Forth? Not much except that it isn't for me. Take Prolog. I know few things more insulting than having to code in Prolog. Whereas Armstrong developed Erlang in Prolog and liked it much better than reimplementing Erlang in C for speed. I can't imagine how this could be, but this is how it was. People are different.

Would a good Forth programmer do better than me? Yes, but not just at the level of writing the code differently. Rather, at the level of doing everything differently. Remember the freedom quote? "Forth is about the freedom to change the language, the compiler, the OS or even the hardware design".

...And the freedom to change the problem.

Those computations I was doing? In Forth, they wouldn't just write it differently. They wouldn't implement them at all. In fact, we didn't implement them after all, either. The algorithms which made it into production code were very different – in our case, more complicated. In the Forth case, they would have been less complicated. Much less.

Would less complicated algorithms work? I don't know. Probably. Depends on the problem though. Depends on how you define "work", too.

The tiny VLSI toolchain from the epigraph? I showed Chuck Moore's description of that to an ASIC hacker. He said it was very simplistic – no way you could do with that what people are doing with standard tools.

But Chuck Moore isn't doing that, under the assumption that you need not to. Look at the chips he's making. 144-core, but the cores (nodes) are tiny – why would you want them big, if you feel that you can do anything with almost no resources? And they use 18-bit words. Presumably under the assumption that 18 bits is a good quantity, not too small, not too large. Then they write an application note about imlpementing the MD5 hash function:

MD5 presents a few problems for programming a Green Arrays device. For one thing it depends on modulo 32 bit addition and rotation. Green Arrays chips deal in 18 bit quantities. For another, MD5 is complicated enough that neither the code nor the set of constants required to implement the algorithm will fit into one or even two or three nodes of a Green Arrays computer.

Then they solve these problems by manually implementing 32b addition and splitting the code across nodes. But if MD5 weren't a standard, you could implement your own hash function without going to all this trouble.

In his chip design tools, Chuck Moore naturally did not use the standard equations:

Chuck showed me the equations he was using for transistor models in OKAD and compared them to the SPICE equations that required solving several differential equations. He also showed how he scaled the values to simplify the calculation. It is pretty obvious that he has sped up the inner loop a hundred times by simplifying the calculation. He adds that his calculation is not only faster but more accurate than the standard SPICE equation. ... He said, "I originally chose mV for internal units. But using 6400 mV = 4096 units replaces a divide with a shift and requires only 2 multiplies per transistor. ... Even the multiplies are optimized to only step through as many bits of precision as needed.

This is Forth. Seriously. Forth is not the language. Forth the language captures nothing, it's a moving target. Chuck Moore constantly tweaks the language and largely dismisses the ANS standard as rooted in the past and bloated. Forth is the approach to engineering aiming to produce as small, simple and optimal system as possible, by shaving off as many requirements of every imaginable kind as you can.

That's why its metaprogramming is so amazingly compact. It's similar to Lisp's metaprogramming in much the same way bacterial genetic code is similar to that of humans – both reproduce. Humans also do many other things that bacteria can't (...No compatibility. No files. No operating system). And have a ton of useless junk in their DNA, their bodies and their habitat.

Bacteria have no junk in their DNA. Junk slows down the copying of the DNA which creates a reproduction bottleneck so junk mutations can't compete. If it can be eliminated, it should. Bacteria are small, simple, optimal systems, with as many requirements shaved off as possible. They won't conquer space, but they'll survive a nuclear war.

This stack business? Just a tiny aspect of the matter. You have complicated expression graphs? Why do you have complicated expression graphs? The reason Forth the language doesn't have variables is because you can eliminate them, therefore they are junk, therefore you should eliminate them. What about those expressions in your Forth program? Junk, most likely. Delete!

I can't do that.

I can't face people and tell them that they have to use 18b words. In fact I take pride in the support for all the data types people are used to from C in our VLIW machine. You can add signed bytes, and unsigned shorts, and you even have instructions adding bytes to shorts. Why? Do I believe that people actually need all those combinations? Do I believe that they can't force their 16b unsigned shorts into 15b signed shorts to save hardware the trouble?

OF COURSE NOT.

They just don't want to. They want their 16 bits. They whine about their 16th bit. Why do they want 16 and not 18? Because they grew up on C. "C". It's completely ridiculous, but nevertheless, people are like that. And I'm not going to fight that, because I am not responsible for algorithms, other people are, and I want them happy, at least to a reasonable extent, and if they can be made happier at a reasonable cost, I gladly pay it. (I'm not saying you can't market a machine with a limited data type support, just using this as an example of the kind of junk I'm willing to carry that in Forth it is not recommended to carry.)

Why pay this cost? Because I don't do algorithms, other people do, so I have to trust them and respect their judgment to a large extent. Because you need superhuman abilities to work without layers. My minimal stack of layers is – problem, software, hardware. People working on the problem (algorithms, UI, whatever) can't do software, not really. People doing software can't do hardware, not really. And people doing hardware can't do software, etc.

The Forth way of focusing on just the problem you need to solve seems to more or less require that the same person or a very tightly united group focus on all three of these things, and pick the right algorithms, the right computer architecture, the right language, the right word size, etc. I don't know how to make this work.

My experience is, you try to compress the 3 absolutely necessary layers to 2, you get a disaster. Have your algorithms people talk directly to your hardware people, without going through software people, and you'll get a disaster. Because neither understands software very well, and you'll end up with an unusable machine. Something with elaborate computational capabilities that can't be put together into anything meaningful. Because gluing it together, dispatching, that's the software part.

So you need at least 3 teams, or people, or hats, that are to an extent ignorant about each other's work. Even if you're doing everything in-house, which, according to Jeff Fox, was essentially a precondition to "doing Forth". So there's another precondtion – having people being able to do what at least 3 people in their respective areas normally do, and concentrating on those 3 things at the same time. Doing the cross-layer global optimization.

It's not how I work. I don't have the brain nor the knowledge nor the love for prolonged meditation. I compensate with, um, people skills. I find out what people need, that is, what they think they need, and I negotiate, and I find reasonable compromises, and I include my narrow understanding of my part – software tools and such – into those compromises. This drags junk in. I live with that.

I wish I knew what to tell you that would lead you to write good Forth. I can demonstrate. I have demonstrated in the past, ad nauseam, applications where I can reduce the amount of code by 90% and in some cases 99%. It can be done, but in a case by case basis. The general principle still eludes me.

And I think he can, especially when compatibility isn't a must. But not me.

I still find Forth amazing, and I'd gladly hack on it upon any opportunity. It still gives you the most bang for the buck – it implements the most functionality in the least space. So still a great fit for tiny targets, and unlikely to be surpassed. Both because it's optimized so well and because the times when only bacteria survived in the amounts of RAM available are largely gone so there's little competition.

As to using Forth as a source of ideas on programming and language design in general – not me. I find that those ideas grow out of an approach to problem solving that I could never apply.

Update (July 2014): Jeff Fox's "dead frog dissector" explained his view of the matter in a comment to this article, telling us why the frog (colorForth) died in his hands in the first place... A rather enlightening incident, this little story.