"High-level CPU": follow-up

February 2nd, 2008

This is a follow-up on the previous entry, the "high-level CPU" challenge. I'll try to summarize the replies and my opinion on the various proposals. But first, a summary of my original points:

"Very" high-level languages have a cost. Attributing this cost to the underlying hardware architecture is wrong. You could move the cost from software to hardware, but that wouldn't eliminate it. I primarily referred to languages characterized by indirection levels and late binding of user-defined operations, such as Lisp and Python, and to a lesser extent/confidence to side-effect-free languages like Haskell. I didn't mean to say that high-level languages should not be used, in fact I think that their cost is wildly overestimated by many. However, denying the existence of any intrinsic cost guarantees that people will keep overestimating it, because if it weren't that high a cost, why would you lie to them? I mean it very seriously; horrible tech marketing is responsible for the death (or coma) of many great things.
Of all systems with similar cost and features, the one that has the least stuff implemented in hardware is the best, because you can change more things. The idea that moving things to hardware is a sure way to make them efficient is a misconception. Hardware can't do "anything in one cycle"; there are many constraints involved. Therefore, it's better to let the software explicitly control a set of low-level components than build hardware logic implementing high-level interfaces to them. For example, to add 2 numbers on a RISC machine, you load them to registers, then add. You could have a command adding operands from memory; it wouldn't run faster, because the hardware would have to spend cycles on loading operands to (implicit) registers. Hardware doesn't have to be a RISC machine, but it's always better to move as much control to software as possible under the given system cost constraints.

I basically asked people to refute point 1 ("HLLs are costly"). What follows describes the attempts people made at it.

Computers you can't program

Several readers managed to ignore my references to specific high-level languages and used the opportunity to pimp hardware architectures that can't run those languages. Or any other programming languages designed for human beings, for that matter. Example architectures:

Cellular automata
Neural networks (such as IBM ZISC)

It is my opinion that the fans of this family of hardware/vaporware, consistent advocates of The New Age of Computing, have serious AI problems. Here's a sample quote on cellular automata: "I guess they really are like us." Well, if you want to build a computing device in order to have a relationship with it, maybe a cellular automaton will do the trick. Although I'd recommend to first check the fine selection of Homo Sapiens we have here on Planet Earth. Because those come with lots of features you'd like in a friend, a foe, a spouse or an employee already built-in, while computer hardware has a certain gap to fill in this department.

Me, I want to build machines to do stuff that someone "like us" wouldn't want to do, for any of the several reasons (the job is hard/boring/stinky/whatever). And once I've built them, I want people to be able to use them. Please note this last point. People and other "nature's computers", like animals and fungi, aren't supposed to be "used". In fact, all those systems spend a huge amount of resources to avoid being used. Machines aren't supposed to be like that. Machines are supposed to do what you want. Which means that both the designer and the user need to control them. Now, a computer that can't even be tricked into parsing HTML in a straightforward way doesn't look like it's built to be controlled, does it?

Let me supply you with an example: Prolog. Prolog is an order of magnitude more tame than a neural net (and two orders of magnitude compared to a cellular automaton) when it comes to "control" – you can implement HTML parsing with it. But Prolog does show alarming signs of independence – it spends most of its time in its inference engine, an elaborate mechanism running lengthy non-trivial loops, which sometimes turn out to be infinite. You aren't supposed to single-step those loops; you're supposed to specify truths about your world, and Prolog will derive more truths for you. Prolog was supposed to be the wave of the future about 25 years ago. I think it can be safely called dead by now, despite the fair amount of money poured into it. I think it died because it's extremely frustrating to use – you just can't tell why the hell it worked that way in each particular case. I've never seen anything remotely as annoying as Prolog, with the notable exception of Makefiles, running on top of a wonderful inference engine of their own.

My current opinion is that neural networks rarely deserve a special hardware implementation – if you need them, build a more traditional computer and run them on top of that; and cellular automata are just stillborn. I might be wrong in the sense that a hardware implementation of these models is the optimal solution for some problem, hence we'll see those beasts in some corner of a successful real-world system. But the vast majority of computing, including AI apps, will run on machines that support basic bread-and-butter programmer things simply and straightforwardly. Here's a Computing Technology Acceptance Lower Bound for ya: if you can't parse a frigging log file with it, you can't do anything with it.

Self-assembly computers

Our next contestant is a machine that you surely can program, once you've built it from the pieces which came in the box. Some people mentioned "FPGA", others failed to call it by its name (one comment mentioned a "giant hypercube of gates", for example). In this part, I'm talking about the suggestions to use an FPGA without further advice on exactly how it should be used; that is, FPGA as the architecture, not FPGA used to prototype an architecture.

Maybe people think that with an FPGA, "everything is possible", so in particular, you could easily build a processor efficiently implementing a HLL. Well, FPGA is just a way to implement hardware allowing you to trade NRE for unit cost. And with hardware, some things are possible and some aren't, or so I claim – for example, you can't magically make the cost of HLLs go away. If you can't think of a way to reduce the overhead HLLs impose on the system cost, citing FPGA doesn't make your argument look any better. On the contrary – you've saved NRE, but you've raised the cost of the hardware by the factor of 5.

Another angle: can you build a compiler? Probably so. Would you like to start your project with building a compiler? Probably not. Now, what makes people think that they want to build hardware themselves? I really don't know. Building hardware is gnarly, FPGA or not – there are lots of constraints you have to think about to make the thing efficient, and it's extremely easy to err on the side of not having enough flexibility. The latter typically happens because you try to implement overly high-level interfaces; it then turns out that you need the same low-level components to do something slightly different.

And changing hardware isn't quite as easy as changing software, even with FPGA, because hardware description code, with its massive parallelism and underlying synthesis constraints, is fairly tricky. FPGA is a perfectly legitimate platform for hardware vendors, but an awful interface for application programmers. If you deliver FPGAs, make it your implementation detail; giving it to application programmers isn't very likely to make them happy in the long run.

At the other end of the spectrum, there's the kind of "self-assembly computer" that reassembles itself automatically, "adapting to the user's needs". Even if it made any sense (and it doesn't), it still wouldn't answer the question: how should this magical hardware adapt to handle HLLs, for example, indirect memory access?

Actual computers designed to run HLLs

Some people mentioned actual hardware which was built to run HLLs, including Reduceron, Tcl on Board, Lisp Machines, Rekursiv, and ARM's Jazelle instruction set. For some reason, nobody mentioned Intel's 432, an object-oriented microprocessor which was supposed to replace x86, but was, among other things, too slow. This illustrates that the existence of a "high-level processor" doesn't mean that it was a good idea (of course it doesn't mean the opposite, either).

I'll now talk about these machines in increasing order of my confidence that the architecture doesn't remove the overhead posed by the HLL it's supposed to run.

Reduceron is designed to run Haskell, and focuses on an optimization problem I wasn't even aware of, that of graph reduction. One of the primary ideas seem to be that graph reduction doesn't suffer from dependency problems which could inhibit parallelization, but still can't be parallelized on stock CPUs. That's because a lot of memory access is involved, and there's typically little load/store bandwidth available to a CPU compared to its data processing capability. Well, I agree with this completely in the sense that memory access is the number one area where custom hardware design can help; more on that later. However, I'm not sure that the right way to go about it is to build a "Haskell Machine"; building a lower-level processor with lots of bandwidth available to it could be better. Then again, it could be worse, and my confidence level in this area is extremely low, which is why I list the Reduceron before the others: I think I'll look into this whole business some more. Pure functional languages are a weak spot of mine; for now, I can only say three things for sure: (1) side effects are a huge source of bugs, (2) although they get in the way of optimizers, side effects are a poor man's number one source of optimizations, so living without them isn't easy, and (3) the Reduceron is a pretty cool project.
Tcl on Board was built to run a Tcl dialect. Tcl doesn't pose optimization problems that languages like Lisp or Python do – it's largely a procedural language grinding flat objects. And there's another thing I ought to tell you: I don't like Tcl. However, I think that this Tcl chip is kind of insightful, because it's designed for low-end applications. And the single biggest win of having a "high-level" instruction set is to save space on program encoding. Several people mentioned it as a big deal; I don't think of it as a big deal, because instruction caches always worked great for me (~90% hits without any particular optimizations). However, for really small systems of the low-end embedded kind, program encoding is a real issue. I'm not saying that Tcl on Board is a good (or a bad) idea by itself; I know nothing about these things. I'm just saying that while I think high-level hardware will fail to deliver speed gains, it might give you space gains, so it may be the way to go for really small systems which aren't supposed to scale. Not that I know much about those systems, except that if I'd have to build one, I'd seriously consider Forth...
Lisp Machines ran Lisp, and Rekursiv ran LINGO, which apparently was somewhat similar to Smalltalk. This I know. What I don't know is how the hardware support for the high-level features would eliminate the cost overhead of the HLLs involved; that's because I don't know the architecture, and nobody gave much detail. I don't see a way to solve the fundamental problems. I mean, if I want to support arrays of bytes, then each byte must be tagged, doesn't it? And if I only support fixnums larger than bytes, then I'd waste space, right? And just what could the LispM do about the hairy binding done by CLOS behind the scenes? Again, this doesn't mean these machines weren't a good idea; in fact I wish my desktop hardware were more expensive and more secure, and tagged architectures could help. All I'm saying is that it would be more expensive. I think. I'd like to hear more about LispM, simply because most people who used it seem to be very fond of it – I know just one exception.
Jazelle is supposed to run Java. Java is significantly lower-level than Lisp or Smalltalk. It still is a beautiful example, because the hardware support in this case yields little performance benefits. In fact MIPS reported that a software implementation of JVM running on a MIPS core outperformed a JVM using Jazelle by a factor of about 2. I've never seen a refutation of that.

Stock computers with bells and whistles

Finally, there was a bunch of suggestions to add specific features to traditional processors.

Content-addressable memory is supposed to speed up associative array look-ups. There's a well-known aphorism by Alan Perlis – "A language that doesn't affect the way you think about programming is not worth knowing". Here's my attempt at an aphorism: "A processor that doesn't affect the way you access memory is not worth building". This makes the wide variety of tools designed to help you build a SIMD VLIW machine with your own data processing instructions uninteresting to me, and on the other hand, makes CAM quite appealing. I came to believe that your biggest problem isn't processing the data, it's fetching the data. I might talk about it some time; the Reduceron, essentially designed to solve a memory access problem preventing the optimization of a "perfectly parallelizable" algorithm, is one example of this. However, CAM goes way beyond providing more bandwidth or helping with the addressing – it adds comparison logic to each memory word. While it sounds impractical to replace all of your RAM with CAM, stashing a CAM array somewhere inside your system could help with some problems. Then again, it won't necessarily pay off – it depends on the exact details of what you're doing. All I can say at this point is that it's a Worthy Idea, which, for some reason, I keep forgetting about, and I shouldn't.
GC/reference counting optimizations. Maybe I'm wildly wrong, but I don't think the garbage is a big deal, 'cause how much time do you spend on garbage collection compared to plain malloc/free? The way I see it, the problem isn't so much with the overhead of garbage collection as it is with the amount of small objects allocated by the system and, most importantly, the amount of indirect memory accesses. I learned that some Lisp compilers can do object inlining with varying amounts of user intervention; well, when it works out, it removes the need for special hardware support. The thing is, I think the main battle here is to flatten objects, not to efficiently get rid of them. And I think that it's quite clearly software that should fight that battle.
Regular expression and string functions in hardware: I don't think it's worth the trouble, because how much time do you spend in regex matching anyway? Maybe it's because I don't process massive volumes of text, but when I do process the moderate amounts of text I bump into, there's the part where you store your findings in data structures, and I think it might be the bottleneck. And then a huge amount of data comes from places like RDBMSes where you don't have to parse much. You'd end up with idle silicon, quietly leaking power.

The good stuff

At the bottom line, there were two hardware-related things which captured my intoxicated imagination: the Reduceron and content-addressable memories. If anything ever materializes around this, I'll send out some samples. In the meanwhile – thanks!