"High-level CPU": follow-up

This is a follow-up on the previous entry, the "high-level CPU" challenge. I'll try to summarize the replies and my opinion on the various proposals. But first, a summary of my original points:

  1. "Very" high-level languages have a cost. Attributing this cost to the underlying hardware architecture is wrong. You could move the cost from software to hardware, but that wouldn't eliminate it. I primarily referred to languages characterized by indirection levels and late binding of user-defined operations, such as Lisp and Python, and to a lesser extent/confidence to side-effect-free languages like Haskell. I didn't mean to say that high-level languages should not be used, in fact I think that their cost is wildly overestimated by many. However, denying the existence of any intrinsic cost guarantees that people will keep overestimating it, because if it weren't that high a cost, why would you lie to them? I mean it very seriously; horrible tech marketing is responsible for the death (or coma) of many great things.
  2. Of all systems with similar cost and features, the one that has the least stuff implemented in hardware is the best, because you can change more things. The idea that moving things to hardware is a sure way to make them efficient is a misconception. Hardware can't do "anything in one cycle"; there are many constraints involved. Therefore, it's better to let the software explicitly control a set of low-level components than build hardware logic implementing high-level interfaces to them. For example, to add 2 numbers on a RISC machine, you load them to registers, then add. You could have a command adding operands from memory; it wouldn't run faster, because the hardware would have to spend cycles on loading operands to (implicit) registers. Hardware doesn't have to be a RISC machine, but it's always better to move as much control to software as possible under the given system cost constraints.

I basically asked people to refute point 1 ("HLLs are costly"). What follows describes the attempts people made at it.

Computers you can't program

Several readers managed to ignore my references to specific high-level languages and used the opportunity to pimp hardware architectures that can't run those languages. Or any other programming languages designed for human beings, for that matter. Example architectures:

It is my opinion that the fans of this family of hardware/vaporware, consistent advocates of The New Age of Computing, have serious AI problems. Here's a sample quote on cellular automata: "I guess they really are like us." Well, if you want to build a computing device in order to have a relationship with it, maybe a cellular automaton will do the trick. Although I'd recommend to first check the fine selection of Homo Sapiens we have here on Planet Earth. Because those come with lots of features you'd like in a friend, a foe, a spouse or an employee already built-in, while computer hardware has a certain gap to fill in this department.

Me, I want to build machines to do stuff that someone "like us" wouldn't want to do, for any of the several reasons (the job is hard/boring/stinky/whatever). And once I've built them, I want people to be able to use them. Please note this last point. People and other "nature's computers", like animals and fungi, aren't supposed to be "used". In fact, all those systems spend a huge amount of resources to avoid being used. Machines aren't supposed to be like that. Machines are supposed to do what you want. Which means that both the designer and the user need to control them. Now, a computer that can't even be tricked into parsing HTML in a straightforward way doesn't look like it's built to be controlled, does it?

Let me supply you with an example: Prolog. Prolog is an order of magnitude more tame than a neural net (and two orders of magnitude compared to a cellular automaton) when it comes to "control" – you can implement HTML parsing with it. But Prolog does show alarming signs of independence – it spends most of its time in its inference engine, an elaborate mechanism running lengthy non-trivial loops, which sometimes turn out to be infinite. You aren't supposed to single-step those loops; you're supposed to specify truths about your world, and Prolog will derive more truths for you. Prolog was supposed to be the wave of the future about 25 years ago. I think it can be safely called dead by now, despite the fair amount of money poured into it. I think it died because it's extremely frustrating to use – you just can't tell why the hell it worked that way in each particular case. I've never seen anything remotely as annoying as Prolog, with the notable exception of Makefiles, running on top of a wonderful inference engine of their own.

My current opinion is that neural networks rarely deserve a special hardware implementation – if you need them, build a more traditional computer and run them on top of that; and cellular automata are just stillborn. I might be wrong in the sense that a hardware implementation of these models is the optimal solution for some problem, hence we'll see those beasts in some corner of a successful real-world system. But the vast majority of computing, including AI apps, will run on machines that support basic bread-and-butter programmer things simply and straightforwardly. Here's a Computing Technology Acceptance Lower Bound for ya: if you can't parse a frigging log file with it, you can't do anything with it.

Self-assembly computers

Our next contestant is a machine that you surely can program, once you've built it from the pieces which came in the box. Some people mentioned "FPGA", others failed to call it by its name (one comment mentioned a "giant hypercube of gates", for example). In this part, I'm talking about the suggestions to use an FPGA without further advice on exactly how it should be used; that is, FPGA as the architecture, not FPGA used to prototype an architecture.

Maybe people think that with an FPGA, "everything is possible", so in particular, you could easily build a processor efficiently implementing a HLL. Well, FPGA is just a way to implement hardware allowing you to trade NRE for unit cost. And with hardware, some things are possible and some aren't, or so I claim – for example, you can't magically make the cost of HLLs go away. If you can't think of a way to reduce the overhead HLLs impose on the system cost, citing FPGA doesn't make your argument look any better. On the contrary – you've saved NRE, but you've raised the cost of the hardware by the factor of 5.

Another angle: can you build a compiler? Probably so. Would you like to start your project with building a compiler? Probably not. Now, what makes people think that they want to build hardware themselves? I really don't know. Building hardware is gnarly, FPGA or not – there are lots of constraints you have to think about to make the thing efficient, and it's extremely easy to err on the side of not having enough flexibility. The latter typically happens because you try to implement overly high-level interfaces; it then turns out that you need the same low-level components to do something slightly different.

And changing hardware isn't quite as easy as changing software, even with FPGA, because hardware description code, with its massive parallelism and underlying synthesis constraints, is fairly tricky. FPGA is a perfectly legitimate platform for hardware vendors, but an awful interface for application programmers. If you deliver FPGAs, make it your implementation detail; giving it to application programmers isn't very likely to make them happy in the long run.

At the other end of the spectrum, there's the kind of "self-assembly computer" that reassembles itself automatically, "adapting to the user's needs". Even if it made any sense (and it doesn't), it still wouldn't answer the question: how should this magical hardware adapt to handle HLLs, for example, indirect memory access?

Actual computers designed to run HLLs

Some people mentioned actual hardware which was built to run HLLs, including Reduceron, Tcl on Board, Lisp Machines, Rekursiv, and ARM's Jazelle instruction set. For some reason, nobody mentioned Intel's 432, an object-oriented microprocessor which was supposed to replace x86, but was, among other things, too slow. This illustrates that the existence of a "high-level processor" doesn't mean that it was a good idea (of course it doesn't mean the opposite, either).

I'll now talk about these machines in increasing order of my confidence that the architecture doesn't remove the overhead posed by the HLL it's supposed to run.

  • Reduceron is designed to run Haskell, and focuses on an optimization problem I wasn't even aware of, that of graph reduction. One of the primary ideas seem to be that graph reduction doesn't suffer from dependency problems which could inhibit parallelization, but still can't be parallelized on stock CPUs. That's because a lot of memory access is involved, and there's typically little load/store bandwidth available to a CPU compared to its data processing capability. Well, I agree with this completely in the sense that memory access is the number one area where custom hardware design can help; more on that later. However, I'm not sure that the right way to go about it is to build a "Haskell Machine"; building a lower-level processor with lots of bandwidth available to it could be better. Then again, it could be worse, and my confidence level in this area is extremely low, which is why I list the Reduceron before the others: I think I'll look into this whole business some more. Pure functional languages are a weak spot of mine; for now, I can only say three things for sure: (1) side effects are a huge source of bugs, (2) although they get in the way of optimizers, side effects are a poor man's number one source of optimizations, so living without them isn't easy, and (3) the Reduceron is a pretty cool project.
  • Tcl on Board was built to run a Tcl dialect. Tcl doesn't pose optimization problems that languages like Lisp or Python do – it's largely a procedural language grinding flat objects. And there's another thing I ought to tell you: I don't like Tcl. However, I think that this Tcl chip is kind of insightful, because it's designed for low-end applications. And the single biggest win of having a "high-level" instruction set is to save space on program encoding. Several people mentioned it as a big deal; I don't think of it as a big deal, because instruction caches always worked great for me (~90% hits without any particular optimizations). However, for really small systems of the low-end embedded kind, program encoding is a real issue. I'm not saying that Tcl on Board is a good (or a bad) idea by itself; I know nothing about these things. I'm just saying that while I think high-level hardware will fail to deliver speed gains, it might give you space gains, so it may be the way to go for really small systems which aren't supposed to scale. Not that I know much about those systems, except that if I'd have to build one, I'd seriously consider Forth…
  • Lisp Machines ran Lisp, and Rekursiv ran LINGO, which apparently was somewhat similar to Smalltalk. This I know. What I don't know is how the hardware support for the high-level features would eliminate the cost overhead of the HLLs involved; that's because I don't know the architecture, and nobody gave much detail. I don't see a way to solve the fundamental problems. I mean, if I want to support arrays of bytes, then each byte must be tagged, doesn't it? And if I only support fixnums larger than bytes, then I'd waste space, right? And just what could the LispM do about the hairy binding done by CLOS behind the scenes? Again, this doesn't mean these machines weren't a good idea; in fact I wish my desktop hardware were more expensive and more secure, and tagged architectures could help. All I'm saying is that it would be more expensive. I think. I'd like to hear more about LispM, simply because most people who used it seem to be very fond of it – I know just one exception.
  • Jazelle is supposed to run Java. Java is significantly lower-level than Lisp or Smalltalk. It still is a beautiful example, because the hardware support in this case yields little performance benefits. In fact MIPS reported that a software implementation of JVM running on a MIPS core outperformed a JVM using Jazelle by a factor of about 2. I've never seen a refutation of that.

Stock computers with bells and whistles

Finally, there was a bunch of suggestions to add specific features to traditional processors.

  • Content-addressable memory is supposed to speed up associative array look-ups. There's a well-known aphorism by Alan Perlis – "A language that doesn't affect the way you think about programming is not worth knowing". Here's my attempt at an aphorism: "A processor that doesn't affect the way you access memory is not worth building". This makes the wide variety of tools designed to help you build a SIMD VLIW machine with your own data processing instructions uninteresting to me, and on the other hand, makes CAM quite appealing. I came to believe that your biggest problem isn't processing the data, it's fetching the data. I might talk about it some time; the Reduceron, essentially designed to solve a memory access problem preventing the optimization of a "perfectly parallelizable" algorithm, is one example of this. However, CAM goes way beyond providing more bandwidth or helping with the addressing – it adds comparison logic to each memory word. While it sounds impractical to replace all of your RAM with CAM, stashing a CAM array somewhere inside your system could help with some problems. Then again, it won't necessarily pay off – it depends on the exact details of what you're doing. All I can say at this point is that it's a Worthy Idea, which, for some reason, I keep forgetting about, and I shouldn't.
  • GC/reference counting optimizations. Maybe I'm wildly wrong, but I don't think the garbage is a big deal, 'cause how much time do you spend on garbage collection compared to plain malloc/free? The way I see it, the problem isn't so much with the overhead of garbage collection as it is with the amount of small objects allocated by the system and, most importantly, the amount of indirect memory accesses. I learned that some Lisp compilers can do object inlining with varying amounts of user intervention; well, when it works out, it removes the need for special hardware support. The thing is, I think the main battle here is to flatten objects, not to efficiently get rid of them. And I think that it's quite clearly software that should fight that battle.
  • Regular expression and string functions in hardware: I don't think it's worth the trouble, because how much time do you spend in regex matching anyway? Maybe it's because I don't process massive volumes of text, but when I do process the moderate amounts of text I bump into, there's the part where you store your findings in data structures, and I think it might be the bottleneck. And then a huge amount of data comes from places like RDBMSes where you don't have to parse much. You'd end up with idle silicon, quietly leaking power.

The good stuff

At the bottom line, there were two hardware-related things which captured my intoxicated imagination: the Reduceron and content-addressable memories. If anything ever materializes around this, I'll send out some samples. In the meanwhile – thanks!

20 comments ↓

#1 uaf1989 on 02.03.08 at 11:36 am

Please allow me to present a subclass of problems that would be amenable to this approach. If you had a language that could compile a purely logical design, say a logic table for traffic-light-controllers, it could produce code that would be able to configure a logic gate array in the most efficient manner, ala Karnough mapping. Then the gate array would be able to execute the logical functions in the most efficient way possible. Scale this up for encryption, perhaps. I have often thought it would be nice to have a programmable array that could be referenced in software after it had been preprogrammed to perform some complex operation.

#2 uaf1989 on 02.03.08 at 11:39 am

I would add that the gate array itself could consist entirely of NAND gates.

#3 bjorn.lindberg on 02.03.08 at 5:15 pm

One application that actually use specialized hardware successfully is graphics cards. Typically drawing a lot of polygons really fast. I suppose you have to find some key functioning that is used a lot before you could justify some special hardware implemented instruction.

#4 macoovacany on 02.03.08 at 8:39 pm

scheme-79 chip:
ftp://publications.ai.mit.edu/ai-publications/pdf/AIM-559.pdf

LISP Machine :
ftp://publications.ai.mit.edu/ai-publications/500-999/AIM-514.ps

[ps|pdf].

Timbo

#5 hotmichel on 02.04.08 at 3:19 am

It is possible to make Java CPU which greatly outperforms
traditional CPU (Intel, AMD, …). Look at
http://www.azulsystems.com/
they have been marketing Java CPU for some time now.
Their Vega 2 processor, seems to be the first 48-core chip designed and optimized for Java and acheives much greater performance than general purpose CPU..

#6 Yossi Kreinin on 02.04.08 at 1:49 pm

Vega 2: (1) Java is fairly low-level compared to the languages I cited, (2) Vega 2 is supposed to run a gazillion of threads; when you have trivially parallelizable workloads (say, in servers), life is good.

Programmable gate arrays: actual FPGAs are actually more flexible than that, they're just gnarly to program.

Graphics cards: I know close to nothing about those; I was shocked to hear that Michael Abrash basically plans to make them obsolete: http://www.radgametools.com/pixomain.htm

The thing is, the less software-configurable a piece of hardware is, the less chances it has to survive, because people want features, and ought to be able to tweak things.

#7 yahoolian on 02.05.08 at 3:56 pm

Side effects limit optimization. if the code must be run in a certain order, the optimizer cannot reorder it to make it faster. does C have a cost compared to assembly? i doubt humans could consistently outperform a C compiler by coding in assembly. there are too many things to keep track of. there is no intrinsic cost to using C vs assembly, and there is no intrinsic cost to using Haskell vs C. it just depends on which implementation has better optimizations implemented.

by the way, RISC uses more memory bandwidth for the instructions than x86. this results in slower performance, as memory is usually the bottleneck. the Reduceron is faster because it can access memory blocks and compute in parallel, instead of being forced to run sequentially on a von Neumann architecture. a von Neumann processor runs sequentially. by the way, all processors solving optimization problems. yes, you could go lower level than haskell and compile directly to FPGA to get parallel access to memory, but that is much more work, and you'd have to reconfigure the FPGA every time you want to run a different program, which would be quite slow.

here's an interesting chip, based on Parallel Random Access Memory.
http://www.umiacs.umd.edu/~vishkin/XMT/index.shtml

and with respect to memory management, a compiler can automatically detect where memory is deallocated and allocated. a deallocation followed by an allocation can simply reuse the memory. this compile time garbage collection is implemented in mercury. http://www.cs.mu.oz.au/research/mercury/information/papers.html#iclp2001_ctgc

pixomatic only gives DX7 features, and doesn't come anywhere close to current GPU performance. the goal of pixomatic is not to make GPUs obsolete.

#8 kragen on 04.23.08 at 11:00 pm

The old machine Kay was talking about as being fast for Smalltalk was a Dorado, not a B5000; the B5000 came out something like 15 years earlier, as you know but maybe not all your readers do. The Dorado's approach was the microcode thingy that lost out to RISC.

As far as I can tell, Kay is basically wrong about the Moore's Law thing; I wrote about it in this kragen-tol post: http://lists.canonical.org/pipermail/kragen-tol/2007-March/000850.html

In theory that tells you what benchmarks Kay was probably thinking about.

Basically he was wrong, first, because the Dorado was built out of 3000 extremely expensive, high-performance ECL chips. The appropriate comparison is not to a laptop but to a Cray; that accounts for two of the three "missing" orders of magnitude in performance.

Second, he was wrong because he's talking about Squeak's performance, and Squeak is a bytecode interpreter. If you compile your bytecodes down to machine code with some PICs, you get back the other "missing" order of magnitude.

So here are the approaches that have been suggested either in Yossi's original post or in the comments:
- data word tagging: Yosef doesn't like this because it doesn't work for tiny data (say, 8 bits).
- FPGAs for specialized coprocessors. I think this is a fantastic idea, especially for things like image-processing, and maybe you can use it for things like Reduceron-style combinator-graph reduction too. The Reduceron is a perfect example of how FPGAs do more than allow you to trade NRE for unit cost: you have to reprogram the gate array to make it run a different Haskell program, because that Haskell program is embodied in the set of supercombinators it knows how to reduce.
- CAMs for hash tables. I don't know if that will work.
- reference counting hardware. Maybe a good idea, I don't know.
- lightweight parallelism. Good idea, but more of a design goal than an architectural design feature.
- integer-only cores (maybe "core diversity" is a better term?) probably will result in better overall efficiency, but might be harder to program. Seems like a step backwards if what you want is to reduce the penalty to run high-level languages; what you really need to do is come up with a software system that would hide this complexity from the programmer.
- private per-core memory. Same comments as integer-only cores.
- cellular automata. For what it's worth, ERIM's Cytocomputer — a pipelined raster-scanning cellular-automaton machine on custom silicon — was pretty damn fast for certain kinds of image processing. (You used the CA matrix to perform bulk nearest-neighbor operations SIMD-style on, in effect, the whole image at once.) But, again, not helpful for high-level languages. I don't think anyone ever used a Cytocomputer or a CAM-8 to parse a frigging log file.
- combinator-graph reduction machines. One of the first ones of these was the SKIM ("S-K-I-Machine") back in I think the 1970s; the Reduceron is a supercombinator-based modern equivalent.
- zeroing newly allocated cache lines without reading from DRAM. Modern CPUs can do this already.
- the Reduceron. Yup, awesome.
- the Tera MTA, which was the 128-register-set machine.

I'm going to tell the story of what I think happened to the Tera. Contrary to another commenter's assertion, the problem with the MTA was not that software wasn't available to take advantage of the machine; Tera ported LS-DYNA3D to it, supported all the usual HPC stuff in C and Fortran, and got some really impressive performance numbers IIRC. The problem seemed to be that they were competing on performance with Intel, the Digital Alpha team, and the StrongARM team, all of whom had enormous market volumes and could afford to spend orders of magnitude more money on their CPU designs. I think they only ever shipped two or three generations of their hardware over the course of ten years or so; the last one was a CMOS design actually designed by engineers at Cadence. Fortunately they got enough money that they were able to buy Cray, took the Cray name, and eventually de-emphasized the MTA and even the old Cray vector line in favor of huge clusters of commodity microprocessors.

There's an important lesson here for would-be higher-performance CPUs. It's not enough that the CPU be a lot faster than the other CPUs at the time that you get the idea to build it; it also needs to be a lot faster than the other CPUs when it actually ships. The Itanic and Symbolics can also tell this sad tale.

Yossi also said, about Jamie Zawinski's assertion that Lisp can be fast on stock hardware:

> If you use Lisp in the Lispy way that makes it so attractive in the first place, how on Earth can you optimize out the dynamic type checks and binding? You’d have to solve undecidable problems to make sense of the data flow.

The answer is that there are decidable approximations that give you good speedups. Olin Shivers's dissertation was a starting point for a lot of this research, but it has continued since then. Also, you can take the specializing-JIT approach Java has taken, although I don't think anyone has.

Remember that Jamie got his start at Lucid, whose claim to fame was precisely that they proved you didn't need specialized hardware to run Lisp fast.

I don't really know what the Lisp Machines had in order to help them run Lisp fast. I know they tagged each word, and had tag-checking instructions that checked the tags in parallel with doing the most likely operation, and I'm not confident that you can actually do that in software even on a modern out-of-order CPU.

A couple of times the "tag every byte" thing has come up. In my experience, whenever I'm dealing with large arrays of small numbers (1, 8, 16, or 32 bits) in a high-level language, it's a lot easier to put all of those numbers into a single tagged blob object, rather than putting tagged versions of all those numbers into a potentially-heterogeneous vector. Then you have SIMD instructions that add, or multiply, or subtract, corresponding members, or whatever. This makes interpretation overhead pretty much irrelevant. I've done real-time audio and video synthesis in Python this way, on my 700MHz laptop. I've done OLAP in Python this way. It's a very popular approach — it's how Matlab works — and it dates back to APL360, so it's not new either.

So if you had a computer with all kinds of crazy architectural features to execute high-level languages quickly, you probably still wouldn't want to use them most of the time to grovel over images pixel by pixel. You probably want to use something like Numeric, or Matlab, or Sisal, or APL, or J, or A+, or Lush, or the Perl Data Language, or PV-WAVE IDL. The speed of your high-level-language code is basically irrelevant here.

Anyway, I'm no expert. I've never even built a CPU of my own, let alone designed one, let alone designed a cutting-edge world-beating fast CPU, and I've only ever written one compiler for a high-level language, and it was a toy. (It's called Ur-Scheme; it compiles itself.) But here are some approaches that I think show some promise:

- a conditional call instruction, or a "previous PC" register that always points to the previous instruction executed, or a "PC before last jump", or something. It would be really great if handling an unexpected operand type could be just a type test followed by a conditional jump, rather than conditionally jumping around a call instruction so that you know where the type fault happened. (This is useful both for safety and for dynamic dispatch, e.g. PICs.)

- stack computers. If you're serious about packing more cores into less area, well, they're much smaller in silicon area than register machines, and often with less function-call overhead. IntellaSys's new SeaForth chip sounds really interesting, although limited in memory, and I imagine you could synthesize something similar that took only 10x as much silicon, and then it wouldn't take you ten years to ship a working chip.

- associative polymorphic caches. Modern CPUs have branch target buffers, or BTBs, which speed up register-indirect jumps and calls by caching the most likely destination of the jump. This is great for cutting down the interpretation penalty of a bytecode interpreter (as long as you put an indirect jump at the end of each bytecode instead of jumping to a central dispatcher) but I have the impression that they still leave a bunch of pipeline stalls in late-bound method calls. Which is why the JIT and similar machines implement, essentially, a BTB in software in the form of a PIC.

- user-level memory fault handling. It's really fast to allocate, say, 12 bytes of memory if you have a copying collector: you just copy your heap pointer register to another register and add 12 to the heap pointer. Except that then you have to do a compare and conditional jump to check for heap overflow, which is a lot more expensive. You can use virtual-memory hardware to trap your heap overflow cheaply, but that typically involves context-switching into the kernel and back. There's no reason it has to.

- memory protection between objects, as in Erlang and KeyKOS. The idea is that you divide state into "domains" or "processes" that share nothing and communicate only by sending each other messages, each with its own event loop to handle those messages. KeyKOS used the IBM 370 virtual memory architecture to do this, so the objects could be written in whatever language you wanted, with an expected object granularity of about a memory page and maybe 16 "keys" or references to other objects per object. Erlang uses a virtual machine instead. IIRC the Intel 432 was built more or less with this in mind, but there were a bunch of "capability hardware" machines back in the 1960s with the same idea. I think the AS/400 ('scuse me, iSeries) still works this way, and has from the beginning. But you don't have to totally redesign your CPU architecture to support memory protection between objects better; just support a smaller page size and have enough MMU contexts that you don't have to flush a TLB every time you context-switch from one object to another.

- tag-checking instructions as on the SPARC, maybe.

Mark Lentczner recommends reading David Ungar's dissertation about what you'd want in a CPU to make it run Smalltalk faster.

#9 kragen on 04.23.08 at 11:05 pm

Oh, and I don't know if HLLs are expensive. Ur-Scheme compiles code to run surprisingly fast, like only 5x slower than GCC, despite being totally type-checked and bounds-checked at run-time and not doing anything you could call "optimization". But Scheme is a long way from Python, and 5x is still 4 Moore-years. On the other hand, Chambers's dissertation explained how they got Self within a factor of 2 of C on the SPARC, and I don't think there are any languages more dynamic than Self at the moment. (Well, maybe Bicicleta, sort of.)

#10 Yossi Kreinin on 04.24.08 at 11:37 am

Regarding http://lists.canonical.org/pipermail/kragen-tol/2007-March/000850.html – nice piece of research. I wonder what really happened with all that benchmark business. Likely, Alan Kay cited it casually without giving it much thought, and then we all came along with our nitpicking… The numbers turned out to be waaaay too impressive.

"Ur-Scheme compiles code to run surprisingly fast, like only 5x slower than GCC"

Not on image processing code I'd guess :) Measuring efficiency is damn tricky.

Regarding cellular automaton for image processing: I'd love to see a competitor build one of those. Down would they go.

#11 kragen on 04.24.08 at 9:55 pm

The 5x figure was from (define (fib n) (if (< n 2) 1 (+ (fib (1- n) (fib (- n 2))))) and its C equivalent, so indeed it might not generalize to more realistic programs. On the other hand, that program consists entirely of integer operations and function calls, and Ur-Scheme is particularly bad at both of those (it has to check and fix up type tags for integer operations, and function calls indirect through a global variable, pass and check the number of arguments every time, and its function prologues and epilogues are horrible crap) so the gap might be less rather than more. (On the other hand, the basic blocks are so short that there's not that much optimizing GCC can do.)

However, since it's an MFTL, it doesn't implement anything that isn't needed to compile itself. And it doesn't use vectors, so it doesn't have vectors. It would clearly need to have those!

Does GCC SSA-vectorize image-processing-type loops yet? If not, it seems like the quality of your library of Matlab-like primitives would matter a lot more than the code generated by your compiler. And I'm going to argue that for image-processing code, code written with Matlab-like or even K-like primitives is "higher level" than the equivalent nest of loops in a language like Python (without Numeric) or C, in the sense that they more closely approximate the terminology and concepts of the problem domain, and contain fewer irrelevant details.

The Cytocomputer was pretty capable; in a pipeline of ten gate arrays, it was able to do on the order of 100 million nearest-neighbor operations (Sobel, dilation, erosion, stuff like that) per second, with a throughput of 10 megapixels per second. Your laptop CPU can probably do a few times more than that now, but it's not implemented in a gate array, and it has 28 more years of Moore's Law behind it. The Cytocomputer was contemporary with the Cray X-MP and the 6510, but I think it was actually faster than the X-MP at the stuff it was good for. (No doubt your GPU can do one or two orders of magnitude more than your CPU at things like that? I haven't really looked into GPGPU programming.)

The best information online I've found about the Cytocomputer is in a patent from 1989, which points at the original Cytocomputer patents: http://www.freepatentsonline.com/4928313.html

I think it got deployed in a bunch of machine vision applications throughout the 1980s, but I'm not clear on whether those were production or experimental.

I don't know if its approach is still valuable, or if the stuff in a modern GPU or something is a Pareto improvement (faster, cheaper, and more flexible, or something). But it seems like, for stuff that you can phrase in terms of nearest-neighbor operations, it should be nearly optimal; and it should be possible to support multiple-image elementwise operations by sticking branches into the linear pipeline.

But maybe I'm suffering from an AVM problem :)

Oh, about method dispatch. I chatted with a buddy of mine who prototyped a BTB implementation at Transmeta. Apparently BTBs speed up C++-style virtual method dispatch quite a lot, so maybe the associative polymorphic cache would only save a few cycles.

#12 kragen on 04.24.08 at 10:24 pm

Um, obviously I meant SSE, not SSA.

#13 snicolai on 05.17.08 at 6:04 pm

I'll take a slightly different angle to the question. What can you leave out of a CPU to improve performance? Have you looked at the Singularity project from MS Research?

http://research.microsoft.com/os/singularity/publications/OSR2007_RethinkingSoftwareStack.pdf give some details. Section 4.2.1 talks about the "unsafe code tax" imposed by the hardware protection mechanisms in the CPU. They measured it has high as 37%.

I see this just as a continuation of the RISC trend, moving things that were traditionally done in hardware to software. Ultimately what I want from the hardware is memory load/store throughput, enough registers/cache in the processor to hide the latency of that throughput and balanced processing elements to operate on that data.

Current memory management hardware tracks the state of each page (dirty, etc.) and a virtual address for that page. How much would dumping the virtual addressing mechanisms (but keeping the page state mechanisms, which are useful for garbage collection) speed up a CPU? I haven't kept up with modern CPUs, but getting rid of virtual address translation should save a stage or two in the pipeline.

A VM operating on physical addresses would now be able to do memory layout optimizations to take advantage of extra banks of memory in one machine vs another. The virtual addressing layer makes that difficult now.

The VM could still page items in and out, it would just replace references to an item being paged out with a proxy object that read the item back in before accessing the object.

The wonderful advantage we now have with many of the high level languages today is that they are defined in terms of a VM. This gives us a layer under any application written to that VM to change and implement new ways of executing the application transparently to the application. Witness the progress made in speeding up the JVM over the last 10 years. Theoretically, the application doesn't change, but executes faster on newer versions of the JVM.

#14 Yossi Kreinin on 05.18.08 at 10:26 am

Regarding the cost of page table management: I currently work on embedded systems with page table support unavailable or turned off. I wouldn't know the cost of address translation, but it appears to be passable because they cache mappings, and comparisons/negations are fast. That said, I'd rather have instructions for checked array access, and with those plus a VM taking care of type safety, I think you could get better security and safety than the process model gives. I think I'm actually speaking in the spirit of the Singularity project here.

Regarding JVM: it started out slow as hell, despite being relatively low-level, so you could expect improvement. Today, AFAIK it's comparable to C for object grinding, but is hardly an optimal platform for image processing or non-hardware-accelerated computer graphics (I'd guess array boundary checking is the main problem, and maybe object inlining; C# has structures to avoid the latter, and unsafe code to avoid the former).

#15 Yossi Kreinin on 05.18.08 at 10:28 am

Also regarding JVM efficiency and programs executing on different versions of the VM out there – a post by John Carmack:

http://www.armadilloaerospace.com/n.x/johnc/recent%20updates/archive?news_id=295

#16 Yossi Kreinin on 05.18.08 at 10:39 am

To kragen:

Regarding vectorization – I don't believe in automatic vectorization by compilers, nor do I believe in vectorized operations built into a programming language. I believe in hand-coding with intrinsics. I believe it to yield at least 5x more efficient code, on average. No figures to back that up, of course, but lots of confidence. So vectorization is out of the scope.

Now, with vectorization being out of the scope, a language with unchecked array access and support for small data types will beat a language without those by a large margin. However, I think you can make Lisp support small data types in memory, and you could make its array accesses unsafe. Now, if CPUs supported commands for checked array access, you have optimized, safe, portable Lisp.

Regarding machines executing gazillions of operations on neighbors: they usually bite the dust when you get beyond shamelessly parallel image processing. I'd rather play with a multi-core DSP which can handle data-parallel scenarios, not just task-parallel ones.

#17 davidmathers on 06.02.08 at 10:29 pm

I don't know enough to know how relevant this is but…

Design of a LISP-Based Microprocessor by Guy Steele and Gerald Sussman

LISP differs from traditional machine languages in that the program/data storage is conceptually an unordered set of linked record structures of various sizes, rather than an ordered, indexable vector of integers or bit fields of fixed size. An instruction set can be designed for programs expressed as trees of record structures. A processor can interpret these program trees in a recursive fashion and provide automatic storage management for the record structures.

We discuss a small-scale prototype VLSI microprocessor which has been designed and fabricated, containing a sufficiently complete instruction interpreter to execute small programs and a rudimentary storage allocator.

http://delivery.acm.org/10.1145/360000/359031/p628-steele.pdf?key1=359031&key2=3630878911&coll=ACM&dl=ACM&CFID=15151515&CFTOKEN=6184618

#18 Yossi Kreinin on 06.04.08 at 12:36 pm

I didn't read the whole thing, and I might at some point. However, it seems that it's a basic implementation of the core Lisp evaluator, using type tags for dispatching. Tagged memory has efficiency problems if you use it straightforwardly, for example, you can't compactly represent byte arrays. Of course there could be ways to save tags, I just don't see how a complete result would look like. I'll delve into the paper some more.

BTW, I wish we had something like Lisp machines rather than the buggy insecure desktop hardware/software towers of today. In this posting, I'm explicitly approaching the problem from a completely anal-retentive perspective I obtained in the world of embedded apps; it's hardly the right way to look at other things.

#19 London Dry Gin on 01.28.09 at 7:27 pm

> Regular expression and string functions in hardware

SSE4.2 introduced some string functions in hardware, http://en.wikipedia.org/wiki/SSE4#SSE4.2

#20 Evan on 05.24.12 at 9:59 am

PSA:

The david moon conversation has moved to this link since this thread was posted:

http://development.azuldev.com/blog/cliff/2008-11-18-brief-conversation-david-moon

Leave a Comment