The "high-level CPU" challenge

Do you love ("very") high-level languages? Like Lisp, Smalltalk, Python, Ruby? Or maybe Haskell, ML? I love high-level languages.

Do you think high-level languages would run fast if the stock hardware weren't "brain-damaged"/"built to run C"/"a von Neumann machine (instead of some other wonderful thing)"? You do think so? I have a challenge for you. I bet you'll be interested.


  • I work on the definition of custom instruction set processors (just finished one).
  • It's fairly high-end stuff (MHz/transistor count in the hundreds of millions).
  • I also work on the related programming languages (compilers, etc.).
  • Whenever application programmers have to deal with low-level issues of the machine I'm (partly) responsible for, I feel genuine shame. They should be doing their job; the machine details are my job. Feels like failure (even if "the state of the art" isn't any better).
  • …But, I'm also obsessed with performance. Because the apps which run on top of my stuff are ever-hungry, number-crunching real time monsters. Online computer vision. Loads of fun, and loads of processing that would make a "classic" DSP hacker's eyeballs pop out of his skull.

My challenge is this. If you think that you know how hardware and/or compilers should be designed to support HLLs, why don't you actually tell us about it, instead of briefly mentioning it? Requirement: your architecture should allow to run HLL code much faster than a compiler emitting something like RISC instructions, without significant physical size penalties. In other words, if I have so many square millimeters of silicon, and I pad it with your cores instead of, say, MIPS cores, I'll be able to implement my apps in a much more high-level fashion without losing much performance (25% sounds like a reasonable upper bound). Bonus points for intrinsic support for vectorized low-precision computations.

If your architecture meets these requirements, I'll consider a physical implementation very seriously (because we could use that kind of thing), and if it works out, you'll get a chip so you can show people what your ideas look like. I can't promise anything, because, as usual, there are more forces at play than the theoretical technical virtue of an idea. I can only promise to publicly announce that your idea was awesome and I'd love to implement it; not much, but it's the best I can deliver.

If you can't think of anything, then your consistent assertions about "stupid hardware" are a stupid bluff. Do us a favor and shut up. WARNING: I can't do hardware myself, but there are lots of brilliant hardware hackers around me, and I've seen how actual chips are made and what your constraints are. Don't bullshit me, buddy.

Seriously, I'm sick and tired of HLL weenie trash talk. Especially when it comes from apparently credible and exceedingly competent people.

Alan Kay, the inventor of Smalltalk: "Just as an aside, to give you an interesting benchmark—on roughly the same system, roughly optimized the same way, a benchmark from 1979 at Xerox PARC runs only 50 times faster today. Moore’s law has given us somewhere between 40,000 and 60,000 times improvement in that time. So there’s approximately a factor of 1,000 in efficiency that has been lost by bad CPU architectures." … "We’re not going to worry about whether we can compile it into a von Neumann computer or not, and we will make the microcode do whatever we need to get around these inefficiencies because a lot of the inefficiencies are just putting stuff on obsolete hardware architectures."

Jamie Zawinski, an author of Mozilla: "In a large application, a good garbage collector is more efficient than malloc/free." … "Don't blame the concept of GC just because you've never seen a good GC that interfaces well with your favorite language." Elsewhere: "it's a misconception that lisp is, by its nature, slow, or even slower than C" … "if you're doing a *big* project in C or C++, well, you're going to end up reinventing most of the lisp runtime anyway"

Steve Yegge, a great tech blogger: "The von Neumann machine is a convenient, cost-effective, 1950s realization of a Turing Machine, which is a famous abstract model for performing computations." … "There are various other kinds of computers, such as convenient realizations of neural networks or cellular automata, but they're nowhere as popular either, at least not yet". And… "The Von Neumann architecture is not the only one out there, nor is it going to last much longer (in the grand 400-year scheme of things.)"

Wow. Sounds dazzling and mind-opening, doesn't it? Except there isn't any technical detail whatsoever. I mean, it's important to be open-minded and stuff. It really is. The fact that something doesn't seem "practical" doesn't mean you shouldn't think or talk about it. But if something isn't even a something, just a vague idea about Awesome Coolness, it poisons the readers' minds, people. It's like talking about Spirituality of the kind that lets you jump over cliffs at your mighty will or something (I'm not that good at New Age, but I think they have things like these in stock). This can only lead to three results:

  1. Your reader ignores you.
  2. Your reader sits on a couch and waits to gain enough Spirituality to jump around cliffs. Congratulations! Your writing has got you one fat fanboy.
  3. Your reader assumes he's Spiritual enough already and jumps off a cliff, so you've got a slim fanboy corpse.

It's the same with this Great High-Level Hardware talk. I can ignore it, or I can wait forever until it emerges, or I can miserably fail trying to do it myself. Seriously, let's look at these claims a little closer.

Alan Kay mentions a benchmark showing how lame our CPUs are. I'd really like to see that benchmark. Because I've checked out the B5000 which he praised in that article. And I don't think a modern implementation of that architecture would beat a modern CPU in terms of raw efficiency. You see, RISC happened for a reason. Very roughly, it's like this:

  • You can access memories at single cycle throughput.
  • You can process operands in registers at single cycle throughput.
  • And that's pretty much what you can do.

Suppose you want to support strings and have a string comparison instruction. You might think that "it's done in the hardware", so it's blindingly fast. It isn't, because the hardware still has to access memory, one word per cycle. A superscalar/VLIW assembly loop would run just as quickly; the only thing you'd save is a few bytes for instruction encoding. On the other hand, your string comparison thingie has got you into several sorts of trouble:

  • Your machine is larger, with little gain – you don't compare strings most of the time.
  • Your machine is complicated, so optimizing the hardware is trickier.
  • Compilers have trouble actually utilizing your instructions.
  • Especially as the underlying hardware implementation grows more complicated and the performance of assembly code gets harder to model.

When people were supposed to write assembly programs, the inclusion of complicated high-level instructions was somewhat natural. When it became clear that compilers write most of the programs (because compilation became cheap enough), processors became less high-level; the points above hopefully explain why.

And don't get me started about the tagging of data words. B5000 had polymorphic microcode – it would load two words, look at their type bits and add them according to the run time types. Well, B5000 didn't support things like unsigned 8-bit integers, which happen to be something I need, because that's how you store images, for example. Am I supposed to carry tag bits in each frigging byte? Let me point out that it has its cost. And I don't think this sort of low-level polymorphism dwarfs the cost of Lisp or Smalltalk-style dynamic binding, either (B5000 was designed to run Algol; what would you do to run Smalltalk?)

There's another angle to it: Alan Kay mentions that you almost couldn't crash the B5000, which suited the business apps it was supposed to run quite well. I think that's just awesome, I really do (I shoveled through lots of core dumps). In fact, I think people who implemented modern desktop operating systems and web browsers in unsafe languages on top of unsafe hardware are directly responsible for the vast majority of actual security problems out there. But (1) in many systems, the performance is really really important and (2) I think that security in software, the way it's done in JVM or .NET, still has lower overall cost than tagging every byte (I'm less sure about part 2 because I don't really know the guts of those VMs). Anyway, I think that hardware-enforced safety is costly, and you ought to acknowledge it (or really show why this fairly intuitive assumption is wrong, that is, delve into the details).

JWZ's Lisp-can-be-efficient-on-stock-hardware claim isn't much better than Smalltalk-can-be-efficient-on-custom-hardware, I find. Just how can it be? If you use Lisp's static annotation system, your code becomes uglier than Java, and much less safe (I don't think Lisp does static checking of parameter types, it just goes ahead and passes you an object and lets you think it's an integer). If you use Lisp in the Lispy way that makes it so attractive in the first place, how on Earth can you optimize out the dynamic type checks and binding? You'd have to solve undecidable problems to make sense of the data flow. "A large project in C would implement the Lisp run time?" Oh really? You mean each variable will have the type LispObject (or PyObject or whatever)? Never happens, unless the C code is written by a deeply disturbed Lisp weenie (gcc and especially BetaPlayer, I'm talking about you). The fact that some people write C code as if they were a Lisp back-end is their personal problem, nothing more, nothing less.

The dynamic memory allocation business is no picnic, either. I won't argue that garbage collection is significantly less efficient than manual malloc/free calls, because I'm not so sure about it. What I will argue is that a good Lisp program will use much more dynamic allocation and indirection levels than a good C program (again, I ignore the case of emulating C in Lisp, or Lisp in C, because I think it's a waste of time anyway). And if you want to make your objects flat, I think you need a static type system, so you won't be much higher-level than Java in terms of dynamic flexibility. And levels of indirection are extremely costly because every data-dependent memory access is awfully likely to introduce pipeline stalls.

Pure functional languages with static typing have their own problem – they lack side effects and make lots of copies at the interface level; eliminating those copies is left as an exercise to the compiler writer. I've never worked through a significant array of such exercises, so I won't argue about the problems of that. I'll just mention that static typing (irregardless of the type inference technique) characterizes lower-level languages, because now I have to think about types, just the way imperative programming is lower-level than functional programming, because now I have to think about the order of side effects. You can tell me that I don't know what "high-level" means; I won't care.

Now, the von Neumann machine business. Do you realize the extent to which memory arrays are optimized and standardized today? It's nowhere near what happens with CPUs. There are lots of CPU families running lots of different instruction sets. All memories just load and store. Both static RAM (the expensive and fast kind) and dynamic RAM (the cheap and slower kind) are optimized to death, from raw performance to factory testing needed to detect manufacturing defects. You don't think about memories when you design hardware, just the way you don't think about the kind of floating point you want to use in your numeric app – you go for IEEE because so much intellectual effort was invested in it on all levels to make it work well.

But let's go with the New Age flow of "von Neumann machine is a relic from the fifties". What kinds of other architectures are there, and how do you program them, may I ask? "C is for von Neumann machines". Well, so is Java and so is Lisp; all have contiguous arrays. Linked lists and dictionaries aren't designed for any other kind of machine, either; in fact lots of standard big O complexity analysis assumes a von Neumann machine – O(1) random access.

And suppose you're willing to drop standard memories and standard programming languages and standard complexity analysis. I don't think you're a crackpot, I really don't; I think you're bluffing, most probably, but you could be a brilliant and creative individual. I sincerely don't think that anything practiced by millions can automatically be considered "true" or "right"; I was born in the Soviet Union, so I know all about it. Anyway, I want to hear your ideas. I have images. I must process those images and find stuff in them. I need to write a program and control its behavior. You know, the usual edit-run-debug-swear cycle. What model do you propose to use? Don't just say "neural nets". Let's hear some details about hardware architecture.

I really want to know. I assume that an opinion held by quite some celebrities is shared by lots and lots of people out there. Many of you are competent programmers, some stronger than myself. Tell me why I'm wrong. I'll send you a sample chip. I'll publicly admit I was a smug misinformed dumbass. Whatever you like. I want to close this part of "efficient/high-level isn't a trade-off" nonsense, so that I can go back to my scheduled ranting about the other part. You know, when I poke fun at C++ programmers who think STL is "high-level" (ha!). But until this "Lisp is efficient" (ha!) issue lingers, I just can't go on ranting with clear conscience. Unbalanced ranting is evil, don't you think?


#1 alfiejohn on 01.30.08 at 11:49 am

How about trying to implement Parrot's VM?

#2 Yossi Kreinin on 01.30.08 at 12:04 pm

Exactly: /how/ should I go about implementing Parrot's VM? :) And why would it be more efficient than running it in software on a standard RISC core hooked to standard RAM?

I'm not saying there are no good VMs for HLLs, just that implementing them in hardware wouldn't increase the overall system efficiency (performance/price).

#3 szetlan on 01.30.08 at 12:31 pm

How about adding processor-level instructions for regular expression matching, or for higher-level operations used in regular expression matching? I'm not a processor guy, so I don't know if that's just a pipe dream, but wouldn't it be faster to run almost any text-processing language (think perl, python, ruby) if fewer instructions are required to perform the comparisons? This might be impossible, but then again … maybe it's just really, really difficult.

#4 queisser on 01.30.08 at 12:38 pm

What about FPGAs? We run a soft core on a cheap FPGA at low MHz but get lots of acceleration from "coprocessor" cores that were compiled from C code. Instead of developing custom chips for one specific HLL maybe we'd like some FPGA coprocessors that can be loaded up with task-specific logic as needed.

You mention image processing – I remember using image processing boards in the late 80s that had FPGA front ends because the PCs were just too slow. Maybe it's time to resurrect that approach.

#5 Justin.Wick on 01.30.08 at 12:42 pm

First of all, if one's going to try to implement a virtual machine (seems that would defeat the purpose of it being a VM, but hey whatever works), something like JVM or LLVM would likely be much more productive. Of course you're left with the problem that they are essentially just slightly higher level than a normal CPU, nothing drastic. And Yossi is right, the trick is – how would you even do this?

My question is this – there was once a time when memory hierarchies were shallow enough that many if not most programs were CPU bound. Except in tightly pipelined systems, cryptography, and a few cases where datasets are so small that they fit into on-chip cache, this no longer seems to be the case. For that reason, I cannot possibly understand what performance benefit we'd see from implementing a higher level instruction set – unless it prevents us from accessing memory in a cache-incoherent manner. Anyone have thoughts on this?

I think Yossi would agree that it might be nice to have a standard CPU architecture that was easier to generate efficient code for (x86 is a bitch, and it's intended successor EPIC was so much harder that the scientists I worked with had to code entire chunks of program in assembly, or suffer 300-400% performance hits easily). Indeed, if anything modern compiling has taught us it's that reducing code size (and therefore instruction cache misses) is a big deal. Perhaps an architecture that had operands which map to higher level (but still fundamentally "C-like") constructs, and then compiled them down to internal RISC opcodes might burn some silicon to save IC misses, as a form of code compression.

We already live in an age where (for some applications) dynamically compiled languages like Java can be *faster* than statically compiled C (if you don't code like a slob, and have twice the RAM you'd need in C) due to runtime analysis, hotspot optimization, etc. CPU cores idling, just waiting for something to do while they fetch from RAM. If anything, what we need is not a better CPU, we need libraries, languages, and compilers that support dynamic program transformations to encourage cache-coherency. Get the data accesses in your LISP or Java program clustered appropriately, and you'll see significant gains. Move away from the "arrays are contiguous blocks of RAM I index directly" paradigm of C, into something more along the lines of "arrays are fast O(1) containers for things I index via integer keys, which may be transformed at run-time depending on the optimized internal structure."

#6 twp on 01.30.08 at 1:00 pm

(Specialty hardware for running Java?)

Does that count?

I have no idea what the zaap does – but that's almost what it appears to be..

#7 betaray on 01.30.08 at 2:16 pm

I'd agree that high level languages certainly have their cost on von Neuman machines, but you're obviously failing Yegge's test for common sense. You're forgetting that we can make computers out of things other than switches. There have been general purpose analog neural networks like the Intel ETANN and Bellcore CLNN32/64. There have been digitally implemented neural networks like the IBM ZISC036. In fact I believe the ZISC036 meets all the requirements of your challenge. (See:

If you are arguing that the von Neumann machine is likely the best computer you can build based on switches and current knowledge, then I'd agree.

Efficiency is a wide field. Efficiency in what dimension, for what problem? There have been situation where DNA and enzyme based computers are able to solve problems 100,000 times faster than it would have taken a state of the art von Neumann machine.

Transistors, as you point out, have had a lot of effort put into their optimization, but I think that Yegge's whole points was that there exists technologies that when optimized will far exceed the capabilities even the best swtich.

#8 bjorn.lindberg on 01.30.08 at 2:30 pm

Great article, I'll get some beer and wait …

#9 dbrunton on 01.30.08 at 2:50 pm

Okay, I'll bite.

Here's my submission:

Now I'll sit back, also with a beer, and wait for my free chip ;)

#10 stevedekorte on 01.30.08 at 3:18 pm

As someone who has written a high level language, the major expense I've seen is for hash table lookups (for dynamic binding) which take many cycles. I've been told that associative memory hardware could reduce these lookups to a single cycle – is this true?

#11 sdhawley on 01.30.08 at 3:22 pm

Interesting post. What about Lisp machines?

As I understand it they were special purpose hardware to run Lisp programs faster than the general purpose hardware of the day.

#12 Paul Stansifer on 01.30.08 at 6:49 pm

Since I like the idea of side-effect-free HLLs, I'd like to see hardware support for reference-counting, because side-effect-free languages cannot create cyclic data structures. As far as I can tell, very little research has gone into making reference-counting faster (probably because people have always assumed that we would need to collect circular structures), but it seems like an ideal problem for hardware to solve. Assuming that the first field of an object is its reference count, you'd have instructions like "Copy and increment referent" and "Decrement referent and jump if zero". Much of the slowness of side-effect-free work could also be eliminated by not copying objects if their reference count is about to become zero (a common case when simulating side-effects — the optimization turns the simulated side-effect into a real one).

As I understand it, one of the problems with reference counting is that it writes to memory a whole lot. An associative cache for reference counts could alleviate that. Other reference counting tricks, like one-bit reference counting or weighted reference counts, could become much faster with hardware support.

While I'm dreaming, it'd also be nice to see better tools for lightweight parallelism. All the silicon that's spent on deep pipelines and sophisticated branch prediction and register scoreboarding could go to increasing the number of cores, provided compilers took advantage of them. I have to admit that I don't have a clear idea of what the instructions for ultra-lightweight parallelism would look like, but the potential usefulness is limited only by the amount of time overhead needed to ascertain the number of free cores and start them working on a slice of the data. The compiler can ensure that they don't need to touch the stack for the duration of the data-level parallelism.

Specialized cores that can only add, subtract, and multiply integers would be exceptionally useful, because lots of work just doesn't involve floating-point math. (

Alternately, imagine a set of cores, each with a dedicated bus connected to dedicated memory (in addition to main memory). The memory could be addressed such that the cores are interleaved at the word level. One could drop an array into that area of memory and have all the cores operate on it in parallel. I feel like I'm describing something that already exists, though.

I feel that creating hybrid processors, based on von Neumann machines, but with special optimizations for Haskell-like languages, will produce the fastest machines for HLLs of the future. Our universe is very skewed towards the von Neumann model, and sometimes our problems map to it nicely, but sometimes they are much more like lambda calculus.

#13 RixiM on 01.30.08 at 7:36 pm

dbrunton and stevedekorte beat me :(
(oh, and szetlan is saying the same thing steve is maybe without realizing it…)
ideally, the answer is cellular automata, i read someplace that the reason von neumann used unified memory was to facilitate code modification in automata… i wonder what he'd think about modern computer viruses…

i think "associative memory hardware" gets you more than half way to solving the HLL challenge and is a lot more practical than a cellular automata hardware implementation (as i understand it, current chip fabrication and wiring make 3D matrices pretty cost ineffective.)

if one is going to go through the trouble of a hardware neural network implementation, (which doesn't actually solve the challenge at all,) they might as well just go the full monty and implement cellular automata, the two aren't really that much different; i think cellular automata are technically more space efficient.

FPGAs aren't really an answer; they are arrays of one bit adders, as many others before me have pointed out. in other words very stupid, very slow, 2D cellular automata.

so, in some regards i think new fabracation techniques and the old standby-never-quite-implemented-efficiently associative memories and cellular automata will be the "solution"

incidentally, phase change memory is well suited for all of the above things… as well as other existing 3D fab techniques.

#14 rektide on 01.30.08 at 9:24 pm

I like the Niagra2. The onboard ethernet has a chunk of silicon that lets it queue up events on different cpu's depending on different filtering criteron, thats extremely smart when you're using 8 massively wide cores as stream processors/request processors off some 10G feeds. To some degree it turns the computer into a multicomp, and lets each cpu keep its own memory (aka avoid shared memory). Add in a crypto processor capable of saturating the 10G link and you can replace half your networking gear. Going back to the massively wide architecture and using something like HT is very smart as well, basically providing a bunch of functional units and then trying to keep them all fed. If the question is "how do you scale out," Sun has some very good answers, and most of them appear on die.

Bonus q: how do you feed 8 extra-wide cores? Four dual channel banks of FBDimms. Nice.

The spagetti stack hardware architecture of the Burrough's matches closely to continuations, which we're seeing more and more of for implementing asych event handling routines. This would potentially be excellent for dealing with highly concurrent systems.

I'm no expert on the Burroughts, but I dont think the burroughs was particularly a CISC v. RISC fight, no more than the x86's CS, DS, SS, ES, you could implement most of its architecture in RISC form. Its just that the "segment registers" notion was take na little higher level with the Burroughts, to let the processor know a little more about the stack and to let the stack grow in different ways. The most glaring counterclaim would be having to chase down the stack and decend lexical levels to dig up another variable, but there's plenty of caching opportunities here, X86 adds its own TLB's all the time to deal with the same class of problems for segmentation and indirect lookups. Who knows, its probably a terrible idea, I certainly dont know enough about Burroughs like systems to judge, but I think it would be really fascinating to try and make an object oriented cpu or a spagetti stack cpu or something other than the segmented stack machine. (PPC has some interesting potential with its inverted page tables)

#15 trh on 01.30.08 at 9:41 pm

Although it would still essentially be a von Neumann computer, I would really love to see a CPU which is a graph reducer and implemented SK combinators.

Has this been done before?

It would need to support GC explicitly, and some other memory and math operations, but implementing a functional language on this sort of architecture would be a breeze, obviously.

#16 Brendan on 01.31.08 at 11:02 am

My understanding is that when the JVM needs a chunk of blank memory the system carefully loads the previous contents of the memory to cache, which the VM then clobbers. A design of cache/memory architecture that eliminated the redundant read would improve performance.

#17 slam4982 on 02.01.08 at 9:12 am

Not my idea, though I wish I could take credit for it.

It's called the Reduceron2.

"The Reduceron is an extremely simple machine, containing just 4 instructions, and executes core Haskell almost directly." <- from the website.

Pretty cool, does this meet the criteria for someone (though not me) actually doing something instead of just talking about it?

#18 Yossi Kreinin on 02.01.08 at 9:24 am

Of course it does! Reduceron is a pretty cool project.

I'm in the process of chasing the various links people sent and digesting everything. I'll publish a summary of the interesting stuff when I'm done.

#19 Tiempo finito y logarítmico » Desempeño de arquitectura on 02.02.08 at 7:09 pm

[...] – Alan Kay, leído en The “high-level CPU” challenge [...]

#20 xirium on 02.03.08 at 10:56 pm

More than 10 years ago, I saw a proposal for a processor with one ALU and 128 program counters and sets of registers. The intention was that it would be exceptionally good at implementing multi-threaded applications. Unfortunately, suitable software was quite lacking in that era. However, the theory is very good.

If you design a processor which is extremely multi-threaded then do not need to implement, deep pipelines, speculative execution, branch prediction, out-of-order execution (and reconciliation) or register shadowing.

High level languages are very problematic for pipelines. This is especially true when running object oriented languages. The problem is that method calls are dependent on the instance being handled. A tree or network of classes have some common methods. Additional methods may be lazily loaded during execution. This arrangement hinders compilation techniques. It also affects pipelining.

A common case is a method calling a method of another class. This process can involve multiple memory accesses. Each method call also requires a jump to an address which is unknown until very briefly before the jump is made. This is problematic for performance if you have a deep pipeline which decodes instructions over many clock cycles. It is possible to defer write operations to increase read throughput. This can reduce immediate impact but it increases complexity and creates additional opportunities for pipeline stalls. Furthermore, it doesn't change the fact that your ALU is idle between method calls.

Instead, it would be more productive to have a very large number of hardware threads. ALU, registers and instruction set could be designed with minimal complexity. It would be possible to implement eight simple ALUs, each with eight sets of registers. Therefore, you'd have eight threads per ALU. Threads within a group would be allocated processing time with strict preference. Each group is run out of phase.

One or more dedicated instruction caches are not required because instructions can be retreived from secondary cache within eight cycles. If access to main memory takes 20 cycles then a thread will block and two lesser threads within its group can be run. Additional memory accesses will cause additional threads to block. Regardless, throughput from the cache should be optimal and one or more ALUs will be in use almost constantly. However, unlike contemporary designs, this remains true when running OO and bytecode programs.

I'd suggest implementing a two address machine. This allows a rich but compact set of instructions to be loaded, cached, transferred and decoded efficiently. I'd implement a subset of the complexity of a TMS9900 or a MC68000 because it would give the best throughput versus complexity. If you stripped the instructions which required microcode then the remainder would easily complete within eight cycles.

You'd probably want a symmetric design, for simplicity. Regardless, it would be possible to have differing ALUs, registers and instruction sets within one implementation. One or more ALUs could be optimised for DSP operations. One or more ALUs could be optimised for bit operations.

This idea works very efficiently with existing software. It doesn't require fancy compilers. It doesn't require exceptional bus bandwidth. I just wish it was mine.

#21 Yossi Kreinin on 02.04.08 at 1:59 pm

Hyper-threading (lots of reg sets per ALU) has its problems:

1. Some workloads are hard to parallelize
2. Some workloads are ALU-bound

Basically, AFAIK you hit diminishing returns at the reg set #4-#5 – the chances to increase actual throughput aren't high enough to justify the die size.

Hyperthreading doesn't remove the cost of HLLs, although it can help fill the stalls in the pipeline, sufficiently reducing this cost; however, it won't help in the (probably frequent) case when your vtables or whatever it is that causes the indirection are already in the cache (unless the core has a high load-from-cache latency, which is a problem by itself).

#22 DGentry on 05.08.08 at 8:34 pm

I read this article a couple weeks ago, and started writing a response. It grew large enough that I decided to post it as a separate article rather than in this tiny little comment window.

A summary of the proposal is a second TLB intended to be managed directly by application code, allowing cheap copy-on-write mappings for functional language argument passing.

#23 Chris Barts on 06.03.08 at 9:52 pm

You should update your article, because the Lisp type system does not work the way you think it does. SBCL, a commonly-used Lisp system, does refuse code that does not pass its type checking (both static and dynamic) and does compile away the checks on code it knows is safe (unboxing is done as well). As for Lisp ever being uglier than Java: Macros buy me so much in terms of expressiveness Java seems like a bad joke. That, plus the fact Java doesn't seem to have a workable type system, clinches it for me.

#24 Yossi Kreinin on 06.04.08 at 1:13 pm

Peter Seibel "should" update his book then. Quote:

"… to make Common Lisp compile add more or less like a C compiler would, you can write it like this:

(defun add (x y)
(declare (optimize (speed 3) (safety 0)))
(declare (fixnum x y))
(the fixnum (+ x y)))

Of course, now the Lisp version suffers from many of the same liabilities as the C version–if the arguments passed aren't fixnums or if the addition overflows, the result will be mathematically incorrect or worse."

I guess it has something to do with the (safety 0) thingie. It has the advantage of yielding assembly code identical to that of a statically typed C version, but is apparently even less safe.

Can SBCL be better than what the CL standard apparently mandates implementations to be, specifically, be as fast as C due to explicit typing or type inference without sacrificing safety? I dunno. SBCL shootout benchmarks run ~2x slower than C. And regarding "beauty" compared to C/Java/… – (the fixnum (+ a b)) when a and b are already declared as fixnums is something you wouldn't have to do in C or Java. SBCL benchmark implementations also have their share of declare and convert calls. Not to mention the (declare (optimize…)) from the book.

That macros buy expressiveness doesn't contradict anything of what I said. I didn't diss Lisp, praise Java or claimed Lisp was linguistically inferior to Java (ha!). I claimed the following:

1. Implementing support for Lisp in hardware doesn't reduce system cost compared to an implementation on a "traditional" machine and an optimizing compiler.
2. Lisp intrinsically costs more than a low-level language like C; neither hardware nor software can hide it completely. It doesn't mean you should use low-level languages, it means just what I said.

Also, Lisp with all the declarations you'd make in C with all the limitations you'd have in C isn't Lisp with types, it's C with sexprs (I'd like to have C with sexprs, but that's another matter). That is hardly provable/refutable though; it's a personal preference/perception thing.

P.S. Java has a "workable" type system since quite a bunch of systems were delivered with it. It doesn't mean I want to work in Java; by the test above, C++ is "workable", and in case you didn't notice, I despise C++. If people get work done in it, it's "workable". But that's just a side note.

#25 gdfcsx on 10.13.08 at 9:21 pm

hey, dont take my money.

#26 Zarutian on 01.28.09 at 1:46 pm

Well, I think that Alan Kay and many more are actualy pointing out bad cpu designs.

Two examples are branch prediction and memory caches.

I rather have Broadband Engine Cell type of proccessor where each core has private scratch memory and DMA between those and main memory. Cache coherrency proplem and branch prediction penalty disapear and less silicon real estate is taken up by cache circuitry. Add propper hardware level subroutine handling (a la Forths returnstack) and possibility of fitting more code into constrained scratch memory gives you more functionality per kilobinarybyte. Add register file and rstack DMAing and you got yourself a faster task switching. And if you could add fpga matrix as "one core" on the chip that is an bonus too ;-)

#27 Ryan Ische on 01.28.09 at 2:48 pm

Reminds me of

#28 mbishop on 01.28.09 at 4:13 pm

I remember seeing a presentation a while back about a multi-core system that had two stacks per core. and 60 cores (or something like that). It was sort of like Forth in hardware. I can't remember where I saw it, but it might be on google video or something if you want to search (but I can't remember the guys name…)

#29 Dave Roberts on 01.28.09 at 7:33 pm

You're correct that most features of a high level language wouldn't be accelerated with special hardware. Specifically, type tagging and microcoding are a big lose. As another commenter pointed out, modern processing performance is memory bound, not CPU-bound. If you can keep your computations "on-chip," then you're better off with a simple RISC implementing those calculations rather than special-purpose hardware. The only minor win is that RISC might have slightly more icache bandwidth required to fetch the additional instructions. Otherwise, keep the core simpler. This is why tagging isn't a big deal. Tags can easily be stripped/added/compared while keeping everything in registers.

I'd point out that there is no reason that Lisp can't run as fast as C, even without type declarations. Modern dynamic compilation can produce type-specific code at runtime that will be highly optimized for the actual types in use. Most function parameters are of the same type on each call can therefore run at full speed. See Andreas Gal's work on trace compilers, for instance. With multicore, you could even dedicate a core to handling the compilation tasks in parallel, as code continues to run.

Now, all that said, one area where high level languages do burn a lot of cycles is in memory management. In fact, all languages tend to burn a lot of performance in memory management, but HLLs suffer more because they tend to put more in the heap and they are typically garbage collected. The issue of what goes in heap is a compiler problem and I won't say anything about it. GC, however, can be made to operate in parallel with special hardware. This does not necessarily mean full HW GC. Much of the actual GC algorithm can still be implemented on a simple software co-processor, but the special hardware allows you to intercept memory references issued by the main CPU to keep the memory model coherent (think read and/or write barriers that operate in parallel and simply stall the main CPU for a cycle or two when needed, until the memory state becomes coherent again). This eliminates comparisons and branches associated with every read/write (depending on when you implement the barrier). It also eliminates, or greatly reduces any pauses caused by "stop the world" GC requirements. Things like tracepoints can also be optimized. Finally, a lot of the performance-robbing GC bookkeeping can be reduced/eliminated (think card-marking and back-pointer checking).

I'm still going to handwave here, but it seems like this area would definitely help HLL performance. In the same way that everybody implements the V-to-P translation of modern virtual memory systems in hardware using a TLB, this would speed up memory specifically for the requirements of HLLs.

Finally, I'll just point you an email conversation between Cliff Click at Azul and Dave Moon and Dan Weinreb, formerly at Symbolics. They discussed the hardware GC support of the Azul hardware Java processor. It definitely seemed like a win over stock RISC.

#30 Dave Roberts on 01.28.09 at 7:35 pm

Sorry, forgot the reference:

#31 Brendan on 01.28.09 at 8:42 pm

I understand that when reclaiming old memory Java wants to zero it.

And that current CPU's load the old contents from slow memory into cache just so it can be zero'd.

What about a way to not load current memory contents into cache and just write to it?

#32 pm on 01.29.09 at 12:52 am

I mean, the best of high-level architecture if Intel iMAX-432.
It is too complex, but it can be simplified.

#33 Ian on 01.29.09 at 2:48 am

To 'just write to memory' wont work. You have to ensure that all processor cores have a consistent view of that memory space first, this means synchronizing caches among other things. If I just zeroed out the main memory, other processors could still be using data in their L2/L3 caches which they think is fresh, not realizing that another processor on the system has declared the area dead. Equally, they could then write their contents back over my newly zeroed area. So given that you have to pause and evaluate the contents of all other core's caches, you may as well go ahead and just load the area into cache, write it and then use the standard cache consistency algorithms to handle the writeback rather than having to implement and debug a special case.

Personally, I'd just love to see people target the correct architectures when they compile stuff. Sure, i386 and i586 will work on i686, but it also prevents the use of any modern features at all. The move to 64bit is less useful for the expanded address space than it is for doubling the register count (to 16, half of what a POWER core has) and forcing people to target an up-to-date architecture.

If I had my way, the world of processors would be a lot more RISC than currently. I just wish I could afford a Power6 or SPARC system to play with. I'm also keen to see how the hardware scout feature of the SPARC Rock series processors turns out.

#34 Mate on 01.29.09 at 3:28 am

I think the problem is that you want to solve the wrong problem. As you said, high levels of indirection are costly, and you want to keep the access of flat memory structures fast.

However, IMHO the image processing application is an extremely specific example. Not very many things are stored in flat structures these days. Even databases are flat only as long as you only don't want massively parallel access, only want to read them, etc.

The point is: real world data is often (way more often than not) composed of complicated data structures full of indirection. Modern "killer applications" are not about sequential, predictable operations over a structured array; instead, they're about complex methods of updating these deep and indirect data structures – and indeed, this is the main and natural domain of these HLLs you mention.

We can keep specialized CPUs or architectures for sequential operations (in fact, we already have these; they're called GPUs); and we should forget about optimizing our main CPUs for a case of flat data.

#35 Jon Harrop on 01.29.09 at 1:07 pm

I want to create HLL implementations that will work efficiently on my customers' many cores in the years to come but I cannot buy a manycore system today so I cannot test my ideas. I want robustly-deployable binaries of immersive graphical applications but today's separation between the CPU and GPU requires graphics drivers that are a major weak point in the reliability of the whole system, massively increasing our support costs and even forcing us to abandon product lines.

My proposed solution: an affordable manycore CPU today.

Many cores executing an existing instruction set: can reuse existing OS and compiler technology like Linux and LLVM. Good programmable graphics capability from multithreaded software without an error-prone graphics driver.

On a separate note, concurrent garbage collection is becoming increasingly important and is already a major performance bottleneck. I have no idea how this might be done but perhaps dedicated hardware can help speed this up? Also, work stealing queues (e.g. Cilk, Microsoft Task Parallel Library) are an excellent general-purpose solution for efficient parallelism. Perhaps they could be built into each core in order to lower the overheads?

#36 CybFreak on 01.30.09 at 3:22 pm

I agree with author we don't need new hardware for heigh level languages. What we really need, it is start thinking more efficiency. Good discipline and education could make programmer use low-level or "difficult" languages as they now using heigh level languages.

#37 Joe on 01.30.09 at 11:25 pm

"If you use Lisp in the Lispy way that makes it so attractive in the first place…"

Then you only need to optimize the tiny part of your program that's the *bottleneck*. It's a lot easier to rewrite a function or two from Lisp into (slightly different) Lisp, than it is to write it in C and interface it and link it in (at least, in any HLL I've ever used).

"“A large project in C would implement the Lisp run time?” Oh really? You mean each variable will have the type LispObject (or PyObject or whatever)?"

No, he means you end up with a half-dozen specialized ad-hoc MAPCAR implementations, and a half-dozen specialized ad-hoc FILTER implementations, and so on. Every large C project I've ever seen does this. To counter-challenge: I'd be curious to see one that didn't.

Your challenge is a good one, if you could constrain it a bit (is there a good simulator or something we should use?). But JWZ didn't say anything about changing machine architectures. I don't know why you think his quote is at all related to the others, or to your subject.

#38 Somejan on 01.31.09 at 7:13 am

As has been said in other comments: primitives that help GC on a RISC core could help. Also, allow cheap thread creation without going through the OS, so it would be feasible to run a sequence of a few hundred instructions on another core.

Performance of dynamic languages depends for a much greater part on smart compilers, jits, and dynamic optimizers that take the actual runtime values of data the program is handling into account. To do that, the program has to be profiled, which on current hardware allways has some performance impact (or at least is a tradeoff between profiling accuracy and performance). How about one core sending hardware generated profiling data to a second core that can do the analysis, so there is no runtime cost to profiling (as long as enough cores are available). If this happens, profiling everything, including production use, could become the norm. This in itself does not speed up anything, but I think it would improve the opportunities for both compilers and developers a lot.

But a lot of the current performance challenges for HLLs probably need to be solved with software (jits etc) instead of hardware.

#39 Dave Roberts on 02.02.09 at 9:51 am

@Somejan, you say "Also, allow cheap thread creation without going through the OS, so it would be feasible to run a sequence of a few hundred instructions on another core."

I don't think this would help much. Thread creation is such an OS-specific, or even language-specific in the case of Erlang, operation that I don't think you could include it as the level of the CPU.

I'm a believer that GC *assist* is the right answer, but it's only assist. Different languages and operating systems are going to want to manage GC in a slightly different way. The best you can do is implement some of the more painful primitives and let them use those, as opposed to putting a full GC algorithm in hardware itself.

#40 Loper OS » Blog Archive » The Hardware Culture, or: What They Build, Works! Can We Say the Same? on 03.12.09 at 10:47 am

[...] who believe that a CPU ought to be designed specifically around the needs of high-level languages: Do you think high-level languages would run fast if the stock hardware weren’t "brain-damage… My challenge is this. If you think that you know how hardware and/or compilers should be designed [...]

#41 Peufeu on 09.28.09 at 2:25 am

If you have billions of dollars and the worst ISA ever designed, you can still make the fastest CPU (but you'll have to waste some watts and some silicon). See Intel versus Mototola PPC.

Anyway : I'm not going to take your challenge, because executing HLL in hardware is not only impossible, it is useless.

The future of HLL is JIT compilation and particularly specializing JITs. In theory this can be faster than C code.

Consider some code. Perhaps it's some vision processing code, or some big numeric analysis code, or some video encoder. In this case, what you want is big fat DSP power with big SIMD registers, big memory bandwidth, lots of MAC units, etc. And your code is going to be written in C if not handwritten assembly.

But most probably it won't be your code : you'll be using a FFT library from Matlab or LAPACK, or something. In any case it's not going to be written in HLL, it'll be written by "bare metal" guys.

So, a HLL cpu makes no sense in this case.

Now consider the rest of the code on earth, like a database, a web server, your usual Python script, etc. Most likely there is no DSP power needed, and most likely this code is control-flow bound, that is to say the execution time is dominated by if's, with their friends, mispredicted branches and non-prefetched memory accesses. Especially HLL interpreters contain mostly conditionals : resolve and dispatch next bytecode, etc.

More specifically, let's look at PostgresSQL. It does verify Greenspun's Tenth Rule of Programming, since it contains explicit interpreters for a few languages (SQL, plpgSQL, Python, Perl, etc) and implicit interpreters for a few ad-hoc languages : for instance your SQL query, once parsed and planned, is turned into an internal representation, which is nothing more than a HLL program written in a language with specialized instructions like "read tuple", "add tuple to hash join", "compute expression", "test condition", etc.

Any random bit of code that contains lots of conditional statements is, in effect, an interpreter for an ad-hoc language.

So, back to specializing JITs.

I believe an essential attribute of a CPU is that it should make the job of the compiler easy, so that the compiler can generate good, fast code, if possible almost optimal code, very quickly. This is essential if the compiler has to run in real-time, as is the case of Just-In-Time compilation. x86 is the complete opposite of this since the ISA is shit.

Which is why recent x86 CPUs all have hardware JIT compilers that JIT-compile x86 code into some secret internal representation. Talk about silicon waste.

You need an ISA with good uniformity (any register is just as good as any other for anything), and either lots of registers, or stack-based (which quite simplifies register allocation since there is none). You also need conditional instructions to handle short tests without jumps. You need branch hints from the compiler.

And you need a specializing JIT. What it does is interpret your HLL code, while compiling it. However compiled HLL code doesn't necessarily run faster than interpreted, since you can't make lots of assumptions, for instance in Python, variables can be of any type, so wether you compile or interpret, you'll still have lots of tests and dispatch to know if a+b are strings or integers. You could use static type checking, but would you want to use it EVERYWHERE ? You could end up with D (which is good) or C++ (urgh).

Here comes the specializing JIT. When it detects that a bit of code is running frequently, it will look at the types of the variables, then emit specialized code. For instance it will replace all "+" by floating point machine instructions, remove boxing/unboxing of all local variables, etc, then execute the generated machine code instead.

You could have used D and got rid of the interpreter, but that doesn't solve your next problem, which is that you are going to write an interpreter many times in your program (see postgresql case above). So you might as well use the same language your program runs in and have the JIT turn it into machine code.

The next step for the JIT is to look at the values of variables. For instance this bit of code is often called with foo = 42. So we can remove a big switch-case on foo, remove a few ifs, and all that's left of the function is 5 instructions, which we inline in the caller. Then we notice the caller calls the same function with foo=43 many times too, because the caller is some kind of ad-hoc interpreter like a SQL query executor, so we JIT it, and we propagate this up the stack as needed, and we get a custom-made super fast program.

Suppose you have to read data from a database. Data is stored on disk in a special format, which depends on lots of parameters (how many columns, their types, etc). You want to compute average(a million values). The on-disk format parser is an interpreter for an ad-hoc language, and it's going to be called a million times with the same program. JIT it to 5 machine code instructions ! The query executor which takes your data and forwards it to the function which calculates sum() is an interpreter for an ad-hoc language, and it's going to be called a million times with the same program. JIT it to 5 machine code instructions !

You get the idea.

If you want a revlutionary jump in performance on most workloads, you ain't gonna get it with a new CPU. For instance it takes postgres about 700 cycles to process 1 row if you want sum(a column), which is already amazingly optimized given the complexity of the stuff. It requires an extremely smart CPU though, to be able to predict all these conditionals. You could get easily a 10x speedup by JITting all the ad-hoc interpreters, though.

So a "HLL" cpu shouldn't be HLL at all, it should make the job of a specializing JIT easi, fast, and efficient : generate good machine code quickly. Maybe a few helper instructions are needed, like some optimized checks to see if the assumptions under which the JITted code was generated still holds, but that's about it.

They you'll need a good JIT, the ability to hint it (for instance by type hints, or "value doesn't change" hints), and the ability to generate code, very quickly, using the same language as you used to write your program, so you can get rid of all the ad-hoc interpreters. Or keep the ad-hoc interpreters and have the JIT simplify and customize them by removing all the conditionals, if you can hint it about a parameter that is always a constant type or value. Or both. If you make a parser for a special binary format, you might as well generate code. It is very important that this integrates smoothly in your program : you must not have one main language and one JIT language. See Lisp.

.NET has Lightweight Code Generation. That's cool.

Next thing, parallel processing. Multicores are great for this, however we have the big problem of synchronization overhead. CPU executes 2.5 billion cycles per second, but it takes something like 1000 cycles to grab even a slightly contended lock. WTF ?

Your CPU is going to need some hardware synchronization like :

- Explicit lock and release memory range (ie cache lines) (who needs test and set ?) with fast re-acquisition by whoever needs it next. Once it releases the lock, put it back NOW in the L2 so the next one can grab it quickly, don't keep it in L1, duh !

- Transactional memory to get rid of locks : most often, the action you need to do under the lock is simple, like reserve some buffer space + increment a few counters. Remove the lock and do it in a transaction. If transactions can only touch a few cache lines, it's not that complex. If another transaction updated said cache lines, retry. This needs software support.

- Ability for userland to block interrupts while in a critical section so it doesn't get scheduled out while holding the lock. Since that is an usurpation of userland power, only give the ability to block interrupts for, say, 100 cycles.

- Lock waits should take no resources, so the other thread which HyperThreads on the same core can use 100% of the resources.

- Fast atomic counters

#42 Yossi Kreinin on 09.28.09 at 10:29 am

"I’m not going to take your challenge, because executing HLL in hardware is not only impossible, it is useless"

Well, the "useless" part was, more or less, the point I was trying to make.

#43 Neko on 10.17.09 at 2:34 pm

"Do you think high-level languages would run fast if the stock hardware weren’t “brain-damaged”/”built to run C”/”a von Neumann machine (instead of some other wonderful thing)”? You do think so? I have a challenge for you. I bet you’ll be interested."

Yes, I think so. It's a degenerate case of the current state of the world: HLLs *do* already run fast. On hardware designed especially for them, they would presumably run no slower. :-)

#44 Dan Cook on 11.08.09 at 8:46 pm

I found this article because I am designing an object-oriented language for a simple processor that cuts out all the crap in the middle. For example, all variables alias to a static address (including pointers) because that is how the assembly programmers do it.

I just wanted to comment because (1) it thrills me to find someone with my same enthusiasm for this kind of "problem" (and frustration for lack of info on the topic) and (2) I was looking for advice, but if here is not the place for it, I will keep looking (any suggestions where?)

Anyway, here is my conundrum:

IDEALLY, Class objects (think C++/Java) containing n bytes of data each alias to a chunk of memory containing n bytes of data, and nothing more. However, my language (OPIA) will allow for single inheritance, so I need a way to manage late-binding functions. In other words, an Animal object may have a default walk(), but a given Animal might just HAPPEN to be a Snake (a subclass of Animal) and walk() differently — even though that context may never know it's a Snake.

So I have three options: have each (or just some) object also contain a pointer to a function table (or the function itself, if that is the only difference); or I can tell late-binding to bite me; or I can just remove inheritance from the language.

My thinking is to just keep the late-binding and have it be used less automatically, leaving it to the programmer to make efficient code. So if I am going to have these pointers for late-binding, those pretty much as like class-ID tags and allow me to provide an "instanceof" operator (like in Java), but there are issues with that too: (1) asking "B instanceof C" is not just a check of B being a C, but also of B being any SUBCLASS of C (this can be reduced by having all functions and/or function-tables arranged so their addresses make a range, simplifying to boundary check) — I could generate a keyword that JUST checks the one class too. (2) Either some class objects do NOT have any such pointers because there are no redefined functions in them (rendering the instanceof operator unusable on those), or all objects are forced to have a pointer to a table or function (even if it's a dummy) so that the instanceof operator always works. One thing to consider is that my language is VERY loose about type safety; a "void pointer" to an integer might be casted to some other object, and then the instanceof check has potential to incorrectly test true (I REFUSE to put a type-tag on primitive values).

So I guess my question is, what is your opinion on the need for late-binding and/or pulling strings just to make the things that come with it work? My language is to generate the most efficient and low-level code possible without forcing data and operations to carry extra bulk; but if I am going to provide that feature, that is the only way to do it; and if I do, is it right to be partial about it (allowing an "instanceof" to only work on instances of decendents of select classes)?

#45 Dan Cook on 11.08.09 at 8:51 pm

Please email me if you have anything (, or just click my name to see where I am hosting my project

#46 Yossi Kreinin on 11.08.09 at 10:20 pm

@Dan: Why not talk here in the comments? I presume it's because of the lack of notification upon reply. Oh well. I'll answer in an e-mail then.

#47 ken on 12.29.09 at 12:58 am

First, let me note that I work way higher up in the stack, so don't really know what I'm talking about.

My understanding is that performance of most modern software is gated on access to memory. It's the L1 and L2 access patterns that really matter.

Accordingly, modern hardware speculatively faults cache lines into the L2 and L1 caches based on detected access patterns.

..which as far as I know are all based around sequential data access with a stride. That isn't what objected oriented programs do. A typical object has fields, each of which is either a scalar or a pointer. The pointers get followed (sometimes).

How about speculatively faulting memory into the caches based on the assumption that people are pulling in blocks of memory containing pointers that they're going to dereference?

There are clearly some issues.

It's seems undesirable and probably incorrect to take a page fault that otherwise wouldn't be taken. Okay, don't do that.

You may want advisory information passed down about whether a particular read _is_ likely to be pulling in an object, and hence should start pulling in its contents, or whether it's a read that might be part of sequential data access. Heck, maybe you even want to have layout maps of some sort available specifying at what offsets from the original read interesting pointers are likely to occur.

You probably cannot afford to speculatively pull in very much this way, so something has to give. The layout map is one possibility. Another might be to only speculatively dereference only one or two words at the head of likely objects. That'd catch the pointer to the class object in ObjC, for example, which has to be dereferenced for any message sent to the object. Or, I don't know much of anything about how branch prediction hardware works. Perhaps dereferencing prediction could be done. "The last two times you read memory location A, you soon read *A and *(A + sizeof(void*)), so we're going to speculatively dereference them this time."

#48 Yossi Kreinin on 12.29.09 at 1:58 am

Extending speculative memory access to handle indirection is an interesting idea and might yield performance benefits. I don't think it can eliminate the overhead of indirection though, that is, a flat layout will still work faster: a single burst access bringing a cache line will always be handled more efficiently than a bunch of (correctly guessed) burst accesses bringing several cache lines, both because of spending more bandwidth, memory controller queue slots, opening more DRAM pages and other such crud to spell the requests and because of transferring more data (you can't transfer just the words you need unless you don't intend to cache them, which on average you do want to do since you access at least a couple of adjacent words).

The practical thing thus isn't eliminating the overhead of indirection but lowering it; the extent of this is hard for me to guess. I can testify that the current speculative access mechanisms aren't enough to get performance "transparently" out of a memory system; I actually prefer a DMA to an explicitly managed local memory (Cell-style) to optimizing cache access with cache prefetching hints, and caches won't work very well without prefetching hints in tight loops accessing arrays. However this doesn't mean that the more complicated mechanism you suggest would work worse than manual tuning; on one hand, it's, well, more complicated which means it's probably harder to make it work "right", on the other hand the manual tuning you can do in the case of indirection is gnarlier so the human competitor isn't in a good position, either. So it's very hard for me to guess how well it would work compared to the current performance.

#49 Erik H. on 06.17.10 at 4:09 pm

I'm nowhere near you guys in technical experience, but doesn't the Burroughs B-series CPU meet your requirements?

It was made to directly support HLL (well, what was considered high-level at the time), and according to this article, did not have an assembly language in the traditional sense of the word, as it ran Algol code natively. From what I've heard, Lisp ran quickly fast on these cpu's (at least, compared to other architectures at the time).

#50 Yossi Kreinin on 06.18.10 at 4:05 am

I mention the B5000 in the text; the question isn't whether a "high-level CPU" is possible, but whether it's a gain in terms of performance/price. Because if it isn't, our current situation where you implement HLL VMs in software is just fine (except that unsafe, lower-level languages are also available, thus some software is written in them, thus is itself unsafe and/or not to the liking of those liking HLLs). If it is, then someone could presumably made a fortune selling more efficient systems, with higher-level hardware and simpler compilers/VMs, running HLLs and giving you all of their benefits for a lower price than that of a conventional hw/sw stack (that this doesn't happen can be explained away by network effects/consumer stupidity/etc.; I'm looking for a technical proof, or rather at least a rough outline, showing that such a price/performance gain is at all possible though).

#51 Ingo on 07.05.10 at 4:39 pm

Just because nobody mentioned it:

JAVA optimized processor as GPL code :-)

I found this quite interesting since the guy behind this (Martin Schoeberl) has some real interesting ideas how to get better "Worst Case Execution Time" boundaries in the presence of a Garbage Collector, by employing a hardware unit to copy objects (for garbage collecting purposes).
See here:

And he actually implemented all that in a cheap Altera Cyclone FPGA.
I am really itching to try this out :-)

so long

#52 Yossi Kreinin on 07.05.10 at 11:38 pm

gcj seems to beat that one on the author's benchmark, and it doesn't seem to have been compared to modern JIT VMs. It seems that the main potential benefit is the ability to run JVM bytecode in a very small memory footprint – something you can do with neither a compiler generating native code nor a JIT VM; but it's a niche goal, and somewhat tangential to the question of a language's level of abstraction and its implied performance penalty.

#53 David Barbour on 09.09.10 at 7:09 pm

The ability to 'watch' a region of memory and get a signal when it is updated would be very nice for reactive programming models and efficient concurrent garbage-collection.

I'd love to have improved atomic ops (hardware transactional memory, or even just double compare and swap). This would allow more efficient parallelism.

#54 Yossi Kreinin on 09.10.10 at 12:13 am

Watching memory regions – the question is how many. If it's a couple or a dozen, it's dirt cheap and on some targets available today with hardware data breakpoints and/or page table tweaks. If you need to watch an arbitrary number of regions of arbitrary size, then the only way I see that could work is by having a shadow memory keeping a region ID for each word. If the ID is, say, 24b, and the word size, say, 16 bytes, the cost would be <25% memory overhead. The question is what sort of performance improvement you'd expect.

As to atomic ops – I'm too used to coarse-grained conservative parallelism suitable to current targets to be able to easily contemplate the benefits of what you propose.

#55 Johan Gall on 07.10.11 at 4:05 pm


I don't think the debate is still ongoing and I am quite the newbie junior fresh-out-of-school programmer but I would like to put down just a few lines about my ideas.

Nowadays I think the big problems are "memory latency", "energy efficiency", "parrallelism&concurrency".
And the big tendencies are going into lockless programming, STM, Actors, functional programming.

For the base of the core (RISC or x86, such things) I don't have enough experience to give a sensible opinion but here is my take on what should be tweaked on such a base.

User-level VMM (It was adressed on a previous comment) with "aliasing TLB"
-> detailed idea here
The author speaks about functional programming gains on function calls but I think it would massively help GC performance (these are ubiquitous these days).
I think the Azul JVM delves into these things (by playing at the OS level). (papers linked on the site)
Control on this supplementary indirection is something you see in countless concurrent GC papers.

User-addressable cache memory with user-addressable sync instruction. I think we hide too much to hide the hardware and end up having to game it to avoid cache-wise pathological behaviour… However such things would probably end up significantly increasing binary size so I think there should be 2 types of cache memory. The current invisible one for "best effort" and an explicitly addressable-one.
I also think that control on what goes into the cache who also help for minimizing context-switching cost and heavy concurrency.

Some kind of hardware level STM for inter-cache synchronization (and for main mem) but I don't know if it's implementable or if it has already actually been done. That one is hhhheavily debatable…

That's all.

I think too much abstraction on low-level things actually doesn't help HLL compilers and optimizing JIT to produce efficient code. A few things should be exposed, at least experimentally. If it proves successful the old abstraction could actually be removed, that would perhaps simplify the chip too (don't really know…)! (This is a bit like stating the fact that that RPC did not save us from distributed computing hardships).

Best regards,
johan.gall+yosefk >>= google mail service

#56 Alaric Snell-Pym on 09.14.11 at 5:37 am

Oh, dear! Premature optimisation a-plenty in this thread :-)

Here's a plan: Take code compiled from the best HLL compilers, and then run it, profile it with callgrind or gprof or similar, and decide how much time it's spending doing something that might be done faster in hardware. THEN we have a basis to work from.

From memory, dynamic type dispatching is a bit of a killer; when done with "small" value, that boils down to branching between integer, FP, or call-a-handler to implement arithmetic on word-sized quantities (and yes, you need untagged arithmetic instructions available for when the compiler can deduce a type, and for dealing with supplied-untagged data such as 8-bit bytes). When done with "large" values (eg, objects) this turns into multiple-pointer indirections, which is just down to memory bandwidth, and better CPUs won't really help with that.

Things like hardware level STM are handy for speeding up higher-level inter-process communications, but I think they're not quite what Yosef is after here – they'll speed up IPC written in assembly just as much as IPC written in Lisp; higher level domain-specific abstractions are orthogonal to higher-level *languages*

#57 Yossi Kreinin on 09.27.11 at 11:21 pm

Yeah, the memory problem with "large" objects is one reason why RISC is a good-enough HLL target. As to tagging and tag-based dispatching – I guess it really would be an "actual" increase of efficiency (not just moving costs between hardware and software ending up with the same performance per dollar) if you use a small amount of tag bits ("integer, FP, or call-a-handler"), but it only seems to deal with arithmetic operations (at the cost of data bits), and polymorphic arithmetic doesn't sound like a very large part of the productivity gain one expects from an HLL.

#58 4kCheshire on 11.16.11 at 2:13 am

Seems like i mostly agree with 5-Justin.Wick…
90% of problems now not in hardware performance. But in modern OS -
complicated structure, preemptive multitasking (badly working "solution"
to dynamical system extensibility), unsafe-by-design multithreading (see
next sentence), bloated overdesigned libraries, layers on layers…
…note…Just as pointers, threads is oversimplified building brick, and so
not safe. (pointers+GC) is safe and is brick. Meanwhile, inheritance
also unsafe – it's like pointer only in larger time scale, that is why
OOP is suck in large)) Because no architecture-GC in brain)) modular,
AD – was good, OOP – not.
And threads. It is also ONE-WAY dependence and so unsafe in back direction.
See? sure it has problems… many programming problems is SAME.
Uncontrolled dependencies…end-note…
"Constantly working re-simplification of all system" may be way to go))
We need implement not just "more processing power" (special purpose
coprocessors is the way for this and they already exist) but maybe some
essential language-runtime or os divide-et-conq blackmagic to rule all
this power… working on simple compiler-friendly hardware.
There are was many projects for "more high level" hardware…
Running smlnj on bare hardware seems already was done sometimes…
Lisp? is not good because 1) "only lists" is not good (memory bottleneck wise).
automatic indirection is bullshit.
2) strong static typing is a MUST, it really simplifies code reasoning.
Yes, 12-Paul, some assist to garbage collection will be nice from hardware.
This is exactly thing i want to say. not GC coprocessor with own memory
channel)) No, what would be good… is more like mlkit region-based memory
management, only segment-table alike. Think of it as of more sophisticated
MMU unit. memory protection and collection assist. sit and watch adresses.
transactional memory?
Haskell is not good (unpredictable), ML maybe good.
22-DGentry, interesting)
36-CybFreak, words like "good discipline and education could make
programmer use low level"… pfu…that just not work, and will not))
Errors is INHERENT property of ANY human brain, no matter how professional
that 1.4kg bunch of neurons is)) Real professional is just ready for
own errors… We say – "big people make big errors", yea?))

…anyway – who controls the memory, controls the multicore world))
and remember, all there! we not need one more powerful hardware feature))
we need way to cut off this cool but unneeded brain-eating unsafe features!))
so they cant stand on our way to problem solution…
too many unneeded abstractions… no more…please…hhhh…
(silently dying)

#59 Matt on 08.05.12 at 8:31 am

Yes! I'm glad that someone is saying this stuff.

#60 Bike on 08.07.12 at 8:22 pm

I realize this article is four years old, and you've probably seen it already, but: Have you read "Lambda, the ultimate opcode"? It's about a machine that uses tree execution instead of vector execution (i.e. Lisp-like evaluation instead of traditional assembly), and they built a prototype. or

#61 Yossi Kreinin on 08.07.12 at 10:16 pm

I don't claim it can't be done, only that it's pointless. To quote Ken Thompson (from memory) – "Lisp is not a special enough language to warrant a hardware implementation. The PDP-10 is a great Lisp machine. The PDP-11 is a great Lisp machine. I told these guys, you're crazy."

#62 Bike on 08.11.12 at 11:32 am

I was just bringing it up as an example of someone actually making an HLL architecture instead of just complaining about x86. I don't know enough about hardware to know if it's actually helpful (though I doubt it, because I'm not using one).

#63 Matthew H on 09.05.12 at 1:02 pm

So current technology was allowed 30 years to bring it where it is today, and you want someone to go off and develop a viable processor, in isolation, and yet be within 25% of a modern system?

I understand your complaint, I'm in the same boat, but challenging someone else to come up with a solution is not going to produce anything. Especially since anyone who is capable of producing a design that could be implemented, would also be capable of implementing it themselves, i.e. they don't need you to try and produce your interpretation of their design for them.

Your challenge is like those weenies who post in the game forums: "I have the best idea for the next killer FPS, I just need people to make it for me and you will get a cut once the game is making millions…"

I'm sick of weenies asking for others to make things for them, with the only return for blood, sweat, and tears being some thanks and recognition. Why not put up a few million bucks instead?

To me, "current technology", the "Internet", and all the protocols that go with it are human's first-pass at potentially solving a problem. Now the collective "we" need to toss out all the current methods (CPU designs, programming languages (all of them), protocols, etc.) and solve the problem properly. You will not build a better mouse trap if you always start with the same initial design. You have to get away from the influence, step back and actually think differently.

I'm getting older so I like to rant and use analogies, so I'll use one here. I propose that if you tell 10 people to design a new operating system, you will get 5 new versions of Windows and 5 new versions of OSX. Nothing actually new, just more of the same. Projects like WINE and ReactOS prove my theory.

Same with the CPU or computer systems. Designers don't know how to make anything new, and any research that is going on is focused on somehow making CPUs even faster or cramming more cores onto a die, and further widening the already gaping hole between the CPU speed and the rest of the system.

In all seriousness, I actually happen to have a few new ideas of my own directly related to what you are talking about, and I'm starting to kick them around in hardware, but I have yet to fine anyone to discuss them with. Maybe I'll eventually piss someone off enough to spark a good conversation. ;-)

#64 Yossi Kreinin on 09.06.12 at 9:20 am

Well, if you want to discuss your ideas with me, either publicly or privately, you're very welcome. As to "new things" – there's always the question of whether the new things are like the old ones because of inertia or for the same reasons that new wheels are as round as old ones. I believe that CPUs are more like wheels in this regard, and here I tried to explain why.

#65 Mohit on 09.20.13 at 8:46 pm

I'm interested in working on LISP processors, and I do believe that they can process data (I'm thinking statistical data analysis here – thats my domain) quicker than conventional CPUs – I'm still learning though.

As for the challenge, I present this:

#66 Yossi Kreinin on 09.21.13 at 6:10 am

You mean a Lisp Machine paper from 1989? "the tracing of a list with cdr and car"? I'm guessing it wouldn't concern itself with cache prefetching to the extent that's required to make linked lists work reasonably well in 2013.

#67 FT on 03.13.15 at 6:54 pm

Some people mentioned implementing the JVM. It's been done (

There must be a reason were not all writing java for arm embedded systems…

#68 Yossi Kreinin on 03.13.15 at 7:25 pm

In fairness to those arguing for a hardware-assisted JVM, Jazelle's failure doesn't necessarily prove them wrong, it might just be an example of it done badly. (A lot of ideas in hardware architecture fail initially/in some markets and take off later/elsewhere, say, VLIW – failed for Multiflow, failed in Itanium, works great and found everywhere in modern DSPs.)

#69 Bill Torpey on 03.13.15 at 8:02 pm

I don't pretend to be a chip designer, but I was very interested in the Intel 432 when it came out. It failed epically, but perhaps there are some ideas in there that you would find useful:

#70 schlenk on 03.13.15 at 8:47 pm

Maybe of interest. Someone did a Tcl runtime in hardware.

#71 Duncan on 03.13.15 at 8:52 pm

One rather small example that'd help with at least one HLL is a somewhat less specialised call/return instruction.

In modern intel CPUs the call/return is not just a CISCy shorthand but is used to help with branch prediction for returns. Unfortunately the call/return is too tightly tied to the C calling convention to be usable in some HLL compilers. For example in GHC we need to be able to return to some address that is not simply the subsequent instruction. We need to use a return address slightly further on because we store metadata (mostly for the GC) in front of each code span.

Another example would be proper hardware support for checked integer overflow/underflow. By proper we mean not slower in the typical case. (The Mill seems to have an interesting approach to this using NaN/NaRs)

Another one would be a more flexible approach to alignment and pointer sizes. Why can't we have 5-byte pointers and store all our data structures unaligned? 4-byte pointers are too small but 8-byte ones are a waste for most programs. In HLLs we tend to use a lot of pointers and so the cost of larger pointers is much more significant than for C.

Think of the JVM and its pointer compression mode, and then imagine if it could do that more flexibly, e.g. up to 1TB heap using 5-byte pointers.

#72 resistor on 03.13.15 at 11:44 pm

How about "Checked Load: Architectural Support for JavaScript Type-Checking on Mobile Processors"?

#73 Hui Zhou on 03.14.15 at 4:59 pm

It is silly to differentiate low level and high level, it is rather context. E.g. try fix a car, at one context, we are thinking in components, then when we narrow down to a component, we need think about how to remove the stuck screw. What is going on is we only allow limited items at a time — call it high level or low level, within a context, they are all the same.

Now pick 100 lines of code in any current programming languages and you'll find variables, types, functions, expressions, … all weaved together — that is the problem: there is only one context.

Now if we put the car into one context, we have the same dilemma: should we think about the car at component level *only* or screw level?

So what we need is a new programming layer that can allow us to write code one context at a time.

#74 da Tyga on 10.09.15 at 9:36 am

I believe that Alan Kay originally conceived Smalltalk as being a collection of objects sending and receiving messages. That would have implied a multiprocessor architecture of many simple processors and a non-blocking many-to-many communications fabric. Then they broke the concept by shoehorning it into the CPUs of the day. In fact, most early Smalltalk systems were horrendously slow for production use.

Early Lisp systems, e.g. Symbolics, used custom hardware that implemented lists and primitives in the hardware. Again in the 1970's that wasn't a feasible way to go.

The RISC movement made sense while the CPU was the bottleneck. These days, we need 2-3 levels of cache to couple fast CPUs to slow main memory.

So I would suggest that current commodity hardware is fine. I think that compilers are working on too low a level to make optimal use of the resources. If we were to treat current commodity CPUs like microprogrammed CPUs of yore and implement higher levels of abstraction primitives (yes that's a contradiction of sorts) as assembler routines, then cache hit ratios should increase and result in more efficient use of hardware resources.

#75 Oren on 10.09.15 at 10:10 am

How about the "mill"? They claim they can get around 10x performance for normal code, "merely" by doing "almost everything" differently (instructions, registers, memory management, security, caching…).

How about HW that is explicitly designed to efficiently execute Erlang-style actors? Maybe combined with Rust-like explicit ownership tracking?

The down side of any non-traditional architecture is that it is expected to immediately compete with the decades of fiendish cleverness that has been invested in the current approach, and might possibly also require us to re-write all our software from scratch.

Either way it would be a huge investment. And nobody is willing to invest the ludicrous amount of money required based just on the intuition that there is gold at the end of some particular rainbow. Everyone who thinks about it feels very strongly that the gold is there _somewhere_, but nobody is quite certain _exactly_ where it is.

The last serious well-funded attempt was the Japanese 5th generation project and, in retrospect, their rainbow was located _just_ a bit off from the pot of gold they were aiming for.

It will probably take another century or so for us to figure things out :-(

#76 David on 10.09.15 at 10:27 am

Just implement a cellular automaton for Conway's Game of Life in Hardware ('s_Game_of_Life). One Flip-Flop per Cell plus a few logic gates to compute the next state and a global clock routed to every cell. So you get (parallel) computing and memory from one homogeneous design that can be easily scaled towards larger chips (Note: Conway's Life is proven to be turing complete)

I/O interface is implemented by some cells at the boundary of the automaton. E.g. just shoot gliders at them to read/write their state.

Then write a compiler from HLL to a bit-map that can run on the cellular automaton.

#77 Yossi Kreinin on 10.09.15 at 2:57 pm

@da Tyga: Smalltalk is a serial language, multiple processors does nothing for it. Smalltalk is not Erlang. Whether something like Erlang on top of a huge number of wimpy cores could ever be more efficient in terms of compute throughput than today's CPUs is a good question, I tend to answer in the negative because of the cost of communication but I'm not sure, the obvious thing however is that Erlang and Smalltalk inherently limit your compute throughput and you'd want a lower level language sending those messages between cores even if that approach gave you anything and this is why this whole line of reasoning is beside my original point.

@Oren: the Mill, regardless of the validity of their claims, does not target HLLs more than it targets C, does it?

"Competing with fiendish cleverness"? Not really. If Alan Kay has a 1000x better arch he'd be leveraging a lot of that fiendish cleverness – such as EDA tools and chip fabs. At worst the 1000x would degrade to 100x and still he'd steamroll the competition and the reason he doesn't do it is he's full of shit. And I've been doing computer architecture for more than a decade and I can tell you that it's never been cheaper, not necessarily to make a chip but certainly to make a synthesizable core for someone else to manufacture, you don't really need that much manpower and the tools are there and they're great, look for instance at Adapteva's Epiphany (which will remind of you of Erlang but – surprise surprise! it runs C not Erlang whose creator said he "doesn't give a hoot about efficiency" and, well, it's compute throughput was, is and will be in the shitter.) This company is five people who're not as full of shit as Alan Kay and who unlike him put their money where their mouth is.

And as to another 100 years… I think we've figured out long ago many of the stuff that particular people (like hardcore functional programming types) still think is not figured out and I predict that a thousand years won't be enough to convince someone who never bothers to look at evidence to begin with.

@David: surely you're joking; the best part in your comical delivery is "Just."

#78 Josh on 10.09.15 at 4:43 pm

I love the way you sugar-coat everything, Yossi. :D

#79 aaron on 10.09.15 at 4:48 pm

1) a general purpose processor running a general purpose software stack will always be better for anything
2) an optimized system will perform better for a given task.
obviously #2 is possible and worth pursuing.
This trade-off was not discussed.

I sense an over-developed sense of reverence for the arbitrary market-driven decision tree that masquerades as the commmercial side of computer science.

any known specific task for which an application (a system: hardware and software) is being designed will be best approached from the side of defining the actual application functionality and user interface, followed by modeling the functionality and only later partitioning part of it in hardware and part in software, once inner loops and underlying functionality has been determined, performance analysis etc… I'm not a dedicated embedded systems engineer, but they exist, and they quite regularly opt away from the standarm arm-driven-soc options when costs (lower or higher) or performance allow/dictate. While I admit that loading linux on everything is tempting, it's not optimum, in any sense of the word, except in cases where a full-blown multi-user os designed for multi-mode processors is actually required… I digress… but the tone of this piece is dismal.

#80 Yossi Kreinin on 10.09.15 at 4:52 pm

@Josh: I really don't know why I shit-coat everything, at least everything in this thread, Josh, I really don't. I don't even like my original post anymore because it's a bit too combative, mainly, and yet the completely unsubstantiated claims that get endlessly recited wrt this topic somehow never fail to get me worked up all over again. And what I rarely see is an actual simulation of an actual hardware architecture and you'd expect that economics would compel people to do it if they could. But why do I still care? I don't know.

(The Reduceron BTW is one exception and for years I sorta have this plan to write why it solves a non-problem and why it's not ever going to be competitive with a traditional CPU.)

#81 Anonymous on 10.09.15 at 5:19 pm

High-level CPUs are really dumb idea, what I'm interested in is maximizing FLOPS (or fixed point OPS, or imprecise float OPS) and RAM bandwidth per mm^2 of silicon and per Watt.
Consider , it is basically an array of simple processors operating on float-like numbers, 100x more dense than standard single precision FP units.
The problem is external memory bandwidth, but HBM could somewhat alleviate it (not sure if your fab can produce HBM-enabled chip assemblies).

#82 Luis Soeiro on 10.09.15 at 7:15 pm

Hey Yossi, good article.

What if we slighted changed the new CPU architecture goal? I wonder how difficult it would be to design a new ISA, without any major impacts on performance, that could focus on the following goals:

1) Provide the "perception that the CPU is actually running source code". This would be only a perception, of course, because the code would be jited, interpreted, compiled, etc. But it could be implemented in such a way that the user can, at any time, pause any piece of program, look at its source and even make some modifications.

2) Always provide some kind of rollback, idealy dependending only on the amount of memory letf that could be used to rollback to some n-states (source instructions) back.

For implementing #1, one idea would be to somehow always keep a mapping to current line of code and code being executed, or something.

For implementing #2, one idea would be to use some kind of copy on write up to available-memory steps. If you need more RAM for next instruction and none is available, free the farthest rollback step region.

Anyway, that was a good insipration.

#83 EricW on 10.09.15 at 7:40 pm

Maybe you're reacting more to entrenched CPU design mindsets, which HLL designers seem to be doing as well in your examples? Neural nets, graphics processors as GPUs, etc. – we all kind of expect some next big thing to take over, or at least fundamentally shake things up at some point. The brain works too well and is too different (though I suppose airplanes don't operate like birds).

IMO, the central issue is how to best utilize memory. In the bad old days of low pin counts – and who knew how much expensive memory you really needed because you couldn't afford it anyway – memory residing off-chip made sense. These days it's still off-chip, which makes no sense to me, as this demands enormous, complex, and expensive on-chip caches in order to work. In his 2000 book "Robot" Hans Moravec points out that the ratio of memory size to processor speed has remained a fairly constant one megabyte per MIPS throughout the history of general computing, which (if still true) is a very strong argument for fixed size, tightly integrated, on-chip RAM.

I'm pretty sure I'm hopelessly naive, but RISC processors are still way too complex. Integrate a multi-port main memory on the CPU die, toss out the caches, memory manager, branch prediction, TLB, security modes and supervisor registers, etc. and keep the core busy via barrel processing with equal bandwidth threads intercommunicating in the shared memory space. Stick a bunch of these together (a miracle occurs here) and deal with security at a higher level.

I don't think you're biased towards shit-coating. I'm not god's gift but like everything else, 90% of engineering work is pretty much crappily done and ill thought out (big endian anyone? virtual machines?). It's rare to find a diamond in the rough, much less a highly polished one, when it comes to hardware or software or languages or anything really. And all the brainpower wasted on making the hideous Intel architecture 10% faster by doubling the multi-billion leaky transistor count is tragic.

The hardest work out there is critically examining a foundation, putting literally everything on the table, and generating reasonable and workable ideas for basic, fundamental improvement – which is probably why it so rarely happens.

#84 Dan on 10.10.15 at 1:06 am

Implement a chipset capable of supporting urbit/nock/hoon at the hardware level.

#85 Tim Daly on 10.10.15 at 11:54 am

It seems you are focused on the CPU but the bottleneck seems to be memory bandwidth. Why not make memory smarter? Back in the last century there were areas of memory with special semantics (e.g. video memory). Why not make a gig-sized address range with local instructions in the memory array.

Suppose this magic memory (MM) could walk a linked list or sort an array or do a search or add/shift/mask two lines… all without any memory buss traffic except the command/completion signals.

Companies could sell gig-sized special hardware with MM for doing local convolutions for vision.

The new instructions needed that manipulate the various MM blocks could be loaded into an FPGA core in the main CPU.

As a side note, Intel just bought Altera (about 40% of the FPGA market). Once Intel applies its tech the Altera Cyclone will be about 16x as dense. So I expect to see "user loadable instructions" soon.

The advantage of MM is that it lives on the far side of the memory bottleneck. It offloads trivial processing, doesn't use up the caches, and doesn't use memory bandwidth.

And just for grins, if the memory chip had an FPGA on it I could push a whole Forth CPU (like the J1) and a custom program into any gig-sized memory block, let it smoke away on an algorithm, and do other stuff until I got a completion signal. Now my computer can do wildly parallel things (e.g. compression, encryption, convolution, sorting, searching, etc) in multiple memory blocks without burning up the buss.

I'll wait for you to send me a memory I can plug into my machine with the "extra fpga extension".

#86 Don on 10.13.15 at 5:59 pm

I was curious about what Alan Kay mentioned about B5000 saying that it is efficient for higher language implementation. Also it is hard to crash because of hardware bound checks. Added to that, he was mentioning that Intel doesn't really know how to design chips (something to do with how data and instruction regions are treated in their architecture). However, I do not see anyone else talking about these except this blog. Do you intend to talk more on the topic explaining more on the details so more people can understand the issues involved better? Thanks.

#87 Eliot Miranda on 10.16.15 at 5:59 am

Hi Yossi,

compative is not always bad; I like the skepticism in your original post; bravo.

Earlier in the thread "#41 Peufeu on 09.28.09 at 2:25 am" says it best; JITs can do a good job at executing dynamic oopls like Smalltalk (my love), so the issue should indeed be supporting the writing of JITs, at least in part. But also one should support the execution of the code a JIT would like to produce, and support GC.

There is a short history of Smalltalk processors. SOAR led to SPARC (register windows) and its tagged arithmetic instructions, but neither ended up being used in Peter Deutsch's HPS, the highest performance commercially available Smalltalk VM for commodity processors (which my Spur VM is beating by about -40%). So two concrete suggestions from SOAR

1. Tagged arithmetic instructions should neither hardware the tag values nor the tag width and should jump on failure (or better skip next on success) rather than trap. So the instruction should allow one to specify the number of tag bits (forcing them to be least significant irks probably fine) and which tag pattern represents a fixnum.

A key instruction sequence in a Smalltalk JIT VM is the inline cache check which looks like eg

movq %rdx, %rax
andq #7, %rax. ; test tag bits of receiver
jnz Lcmp ; if nonzero they are the cache tag
movq (%rdx),%rax ; if zero, fetch class (index) from header
andq #3FFFFF,%rax
cmpq %rax, %rcx ; compare cache with receiver's cache tag

The x86 provides a handy-dandy conditional move that one would think could eliminate the jump and yield

movq %rdx, %rax
andq #7, %rax. ; test tag bits of receiver
cmoveqq (%rdx),%rax ; if zero, fetch class (index) from header
andq #3FFFFF,%rax
cmpq %rax, %rcx ; compare cache with receiver's cache tag

except some cruel tease in Intel decided to make the instruction trap when given an illegal address *whetger the instruction made the move or not* so it's useless :-(. So if you provide a yummy conditional move, make sure it doesn't trap if the condition is false.

Another tedious operation is the store check. It would be great to have a checked store instruction, again one that used the skip next on success pattern so one can jump to the remembering code, rather than handling traps. A store check would be based on bounds registers, set up infrequently, eg when entering jutted code, that would specify the base and extent of new space and trap if storing an untagged value that lies in new space to an address outside it.

One *really cool* piece of hardware support, that would essentially reduce my new VM to something trivial to implement would be the ability to search memory in parallel for aligned word sized cells that contain a specific value. This means /all/of the heap, not just the part that is paged in, which probably implies no paging ;-). Smalltalk supports become, upon which lots of things are implemented including shape changing instances at runtime. Become is essentially "replace all references to a with references to b" or "exchange all references to a and b", so to add an inst var at runtime, the system collects all instances of a class, creates a new class with the added inst var, creates copies of all instances, with the new inst var having the value of nil, and then "atomically" exchanges the classes and the instances so that all references in the system refer to the new class and instances, leaving the old ones to the GC. The problem is in finding the references. A trawl through the heap is slow. The original Smalltalk-80 added an j direction in the header of each object, and hence one exchanges the indirections, and this design remains in HPS, but the explicit indirection slows down all accesses. My new Spur VM uses "lazy forwarding", turning the original objects into forwarding pointers to copies, and fixing them up when message sends fail (since forwarders don't have valid classes) or when visited by the GC, which is why Sour is exactly twice as fast as HOS in the benchmarks game's binary trees benchmark, but at the cost of /lots/ of complexity. If memory were smart enough to render the search O(1) that would be magnificent.

There are other such ideas (azul's already been mentioned, asplos is a good source for the design ideas around cache management (avoiding the need to zero unmapped pages etc) and cheap user-level traps (although I think snip next/conditional instructions a la arm are way more convenient). But the above are what are pressing from my own experience.

Anyway, interesting and provocative post; provoked me to reply. Good luck, and have fun; this /is/ fun!

#88 Yossi Kreinin on 10.16.15 at 8:27 pm

@Eliot: cool to hear from someone actually working on JIT compilation!

I think the point that I tried to make here and that got buried in the whole HLL angle, and which I then made again, hopefully more clearly (, is that some things have a fundamental cost because they require a certain amount of logical operations to be performed and that will cost you energy, first and foremost (you can trade machine time for machine size to some extent etc. etc. but energy is the simplest constraint to think of because you can't cheat, there's a lower bound on how much energy something is going to cost.) Math and physics limit what hardware can do, basically, hardware support is not magical.

The example among your suggestions where I think this problem is most pronounced is associative memory (searching for a value everywhere in parallel.) Associative processors have been suggested and implemented (I wrote a bit about my understanding of the area here:, but the upshot is, comparing that many bits in parallel costs a lot of energy and a machine doing this very quickly would be fundamentally more expensive than a machine running a programming language that didn't need such a power-hungry feature. And if you wanted your memory to also support standard random access, not just replacing values with other values, then you'd need to make it larger. (Note as well that for that to work in the heap you need a DRAM memory supporting it, and even Intel has hard time shifting the priorities of the DRAM market… so this one at least as I understood it is not within the reach of most chip makers out there even if they all wanted to do it.)

I do agree that a lot could be done in hardware to make things less awkward for a given language – here HLLs are not that different from lower-level languages; if your language supports floating point math you'd want your hardware to support it, and emulating floating point in software will suck, performance-wise, and the same is true for many HLL features. All I meant that some of the extra cost of HLLs – such as more levels of memory indirection – cannot be made to go away.

#89 Craig Millerr on 07.05.17 at 3:33 am

Confront your cheating spouse with evidence, i was able to spy on my cheating ex phone without finding out… really helped me during my divorce …you can contact (CYBERHACKTON@GMAIL.COM) call and text (916) 302-2234 for spying and hacking social networks, school servers, icloud and much more, viber chats hack, Facebook messages and yahoo messenger, calls log and spy call recording, monitoring SMS text messages remotely, cell phone GPS location tracking, spy on Whats app Messages, his services are cheap.. and please tell him i referred you to him he is a man with a heart of GOLD.

#90 Amanda Josh on 09.12.17 at 10:58 pm

After too many disappointments with hackers. O my God i still can believe the world still have legit hackers like this out there. i came across Sabina Roman’s post on a blog where she talked about a hacker whom helped her out with a bank hack so I contacted him to see if its truth because of my experience with some hackers that I had to stop looking for a hacker until I saw the post and I decided to try for the last time. its actually impossible to put in words how much of a Genius he is and also can't stop thanking him for helping me through my divorce case because I needed to hack into my husband’s telephone to get some proof before going for the bank hacking .Jonny Belter is a Black-hat hacker and very capable of any type of hack in the hacking world as they call it. He is actually one of the best out there and also very good to understand what so ever you are going through, in my own case the money wasn't the problem and i can gladly say every money spent was so worth it. I made a promise to him that I must tell the world of his good works and if you are interested in any type of hacking jobs you can contact him through his email hackwithjonny-AT-gmail-DOT-com and please tell him Amanda Josh told you about his good works and I keep saying thank you Jonny belter and God bless you for everything.

#91 janiceduchiq82 on 03.22.19 at 5:17 pm

[color=pink][b]CLICK ON THE BANNER![/b][/color]
She licked him clean,[url=]Hot dog adjustable harness pets collar leopard campo professional out door dog harness hand strap for 6.23$[/url] making sure that all of the cum was gone.She told him not to worry,[url=]2017 bohemian summer jewelry mutilayer beads bracelets bangles for women elastic strand pulseras mujer femme bijouterie 1.51$[/url] just give her a couple of minutes and he could go home.He asked to see her badge again,[b]femjoy olivia[/b] she showed him,[i]black cum in mouth[/i] he saw the name Lindsey Boxer.They hit on her all the time,penis enlargement operation but she almost never goes out,[url=]csi miami wallpaper[/url] you know? It's all about work and money and…me.They stood and moved around the coffee table,[url=]BLUBOO XTOUCH 3GB 4G Smartphone-152.54 $ [/url] sitting on either side of the older woman.Bree put the remainder of her strawberry between her lips and leaned forward,[i]asian cute gay[/i] pressing her mouth tenderly to Lilian's.Love you,[i]album furtado loose lyric nelly[/i] baby.""You,sweet hot sluts too"Grace Finley had been best friends with Willow Morgan for the last six years.At,[url=]bikini blonde hot[/url] say,[b]fingering young pussy[/b] 10 o'clock?""I'll pass that along,[u]beyonce knowles nude pic[/u] Sweetie.

[color=red][b]CLICK ON THE BANNER![/b][/color]

[b]READ MORE[/b]
big dick gallery
hairy mature tit
[url=]adult email finder friend handle login member personals[/url]
gay intercourse
amateur blow gay job
[b]cream fat pie[/b]
[u]tawnee jordan[/u]
his cock was too small
[b]free big tit round ass[/b]
[i]college porn site[/i]
[url=]Roswheel 4 2 4 8 5 5 front bicycle bags panniers mountain road mtb bike top 12.80$[/url]
[u]part racing scat[/u]

[color=gray][b]CLICK ON THE BANNER![/b][/color]

She got to her desk,project voyeur web main entered her password into the computer and said,face sitting top 100 "Let me see your notice."I handed it to her and she said,[b]hot teen bitch[/b] "Okay,[url=]10000 pcs mixing color hydrogel crystal soil bullet soft water gun paintball bullet bullet growing water 3.69$[/url] here it is.When she took a swallow she almost gagged,tiffany teen sex video it was so strong.Lilian was standing at the other end of the couch,ass fine the unplugged phone wire swinging casually in her hand."Aunty Lilian They're…The fantasy drove her pleasure higher,[u]young cross dresser[/u] and she started to pump her hips involuntarily against her aunt's eager thrusts."Yes,hardcore hot moms how does that feel,gay hard sex Baby? Do you like having your aunt's fingers inside you?"Kimmie struggled to speak,[u]britney no pantie paparazzi photo spear[/u] as her body throbbed with arousal. "Y-yessssss…" she finally hissed."Did you enjoy the sight of your cousins fucking each other?""Y-yessssss…""You're a dirty little dyke,[url=]Colares femininos simulated pearl necklace for women fashion gold beads choker necklaces statement jewelry 2016 collier 2.28$[/url] after all,man fucking pussy aren't you? Just like my beautiful girls."Kimmie's eyes shot open.The way she had dominated her sexually,[u]amy miller penthouse pet[/u] forcing her over the edge time after time.Why are they talking about penises?""One of them thought his masculinity was being questioned,first time young sex so he started talking about how big his hand is.""I don't understand.

#92 janiceduchlu86 on 03.24.19 at 3:20 pm

[color=brown][b]CLICK ON THE BANNER![/b][/color]
It was as hard as she could remember,[url=]ejaculation problem[/url] hot and burning. "God,adult free movie xxx hurry,[url=]LBtoys LB85 1 18 2.4G 4WD High Speed Big Foot RC Vehicle Ready-to-run 61.39$ [/url] put it in me,pants skater tight stop teasing or you'll get it big time" she grunted loudly.Her mother dealt with all 'problems' so when she tried bringing them to him,[b]extreme piss[/b] he'd get frustrated very quickly and tell her to just deal with it.She'd started smoking with a friend her senior year of high school and she'd somehow managed to keep it hidden from him until she graduated from college.Shortly afterward they departed,[u]code doujin geass[/u] having finished eating."Oh,pants skater tight by the way,[u]bj and the bear dvd[/u] Sarah.David wasn't very happy about these events,[url=]2 pcs lot 14 smd led arrow panel for car rear view mirror indicator turn signal 0.99$[/url] but that's how things were in their kind of life and upbringing.Both men were big in stature,nude anime pic well over 6ft,[url=]Cheap lace front human hair wigs short wigs for black women human hair bob wig brazilian 92.00$[/url] big boned,latina girl online not an ounce of fat on their muscular but slender bodies.That's scary,adobe acrobat reader 5 " she replied. "So you used to teach.

[color=pink][b]CLICK ON THE BANNER![/b][/color]

[b]READ MORE[/b]
[url=]Original Xiaomi Hybrid Dual Drivers Earphones Mi IV In-Ear Headphones Pro-23.12 $ [/url]
big boy
divestiture literotica
[u]boy boy disco lyric panic will[/u]
[url=]Cubot X16 S 5 inch 4G Smartphone-136.85 $ [/url]
ebony fellatio
free fuck vids
free gay blow job movie
[url=]216Pcs NdFeB Mini 5mm Square Magnetic Block Puzzle Educational Toy for Intelligence Development-15.31 $ [/url]

[color=yellow][b]CLICK ON THE BANNER![/b][/color]

When he woke up after his 'nap' he didn't know where he was,virginia lottery then he found out he was tied to the bed.He turned him onto his side,anime clip free hentai video the 'recovery' position,how to get my wife to have threesome tucked a pillow under his back,[b]boy lane love mandy[/b] made sure he was okay,defloration video clip he left a scribbled note telling him where July was,webdate mobile and left the room in a hurry,[url=]nipples petite puffy[/url] before he too was sick all over.He returned to his room where July was sat on the edge of his bed.Bob and I were even fitted for rental tuxes.Belize was an up and coming Caribbean country and our resort was quite new,[url=]animation yiff[/url] with a series of bungalows,[url=]sexual harassment charge[/url] with each two bungalows sharing a small pool.Like Father McKenna,[url=]SMPlayer 32-Bit 17.3.0 [/url] he would also be upset with her.He'd been an outdoorsman growing up,tranny jerking off so the area was ideal for his hobbies.Craig stood six feet four,amp land tit and had a demeanor that commanded respect.Seeing new places,erotic anime art meeting every-day Russians,[url=]woman with huge boob[/url] eating the local food.

#93 janiceduchit85 on 03.25.19 at 4:15 pm

[color=green][b]CLICK ON THE BANNER![/b][/color]
Our two black voyeurs were back,[url=]Cobbler legend real cowhide leather bifold clutch men s short wallets purses 2016 male id credit 24.00$[/url] standing in front of Jill.I've known that for years but if I don't mention it,[url=]free teen cam chat rooms[/url] someone will leave a comment 'informing' me on the matter so…Oh,[url=]download sex video[/url] I never write about it in my adventures as Callie,[u]babes cum swap[/u] but since I'm coming clean as it were,[i]picture of little penis[/i] let me be completely honest and admit this,[b]silvia saint movie[/b] too.This vicious circle began spinning out of control when they went through the last of their savings just to put food—and booze—on the table as their mortgage and bills went unpaid.They were close to being evicted,[u]babes cum swap[/u] the power had been turned off earlier in the day,bhabhi free indian sex story and her cell phone had also been disconnected.

[color=black][b]CLICK ON THE BANNER![/b][/color]

[b]READ MORE[/b]
[i]cock pissing[/i]
[url=]adult homemade video[/url]
photo of woman peeing
free gay blow job movie
sweet pussy little young
[b]america model next picture top[/b]
teen girl sunbathing
[url=]Cube iWork11 Ultrabook Tablet PC-225.07 $ [/url]
[url=]allthings cfnm[/url]
sexy ass model
[u]hilton home paris sex tape[/u]
[b]356 car porsche pre squirters windshield[/b]
[b]door girl next song theme[/b]
[url=]glory hole picture[/url]
asian gay mem

[color=black][b]CLICK ON THE BANNER![/b][/color]

Then I rested my hand on his knee,[url=]Sun5 uv led nail lamp 48w with lcd timer and bottom makeup nail dryer polish machine 39.99$[/url] and looked him right in the eye.I think he got the message'Well,ass parade com ' he said,[url=]7a peruvian virgin hair body wave 4 bundles unice hair peruvian body wave human hair extension 32.88$[/url] 'I'm about done here now; I'm going home and having a nice cup of coffee.' He told me,free online virtual sex with I hope,[u]asian exhibitionist[/u] was hope in his voice.'What kind of coffee do you drink,[b]free porn review site[/b] ' I asked.It was just so…different."Her violin teacher was a kindly old man,[url=]Yueqi YQ8008 216pcs LED Programmable Bicycle Spoke Light DIY-85.14 $ [/url] but he seemed incapable of offering a compliment that didn't include some kind of 'but' or 'however.' It was what her grandmother called a left-handed compliment and her late father referred to as a compliment with shit on it.He is twenty one,wet latina pussy and I love him to bits,[u]hot pussy picture[/u] always in an Aunty way until a few moments ago.Morgan could only groan,[url=]mature natural pussy[/url] she was already admitting to herself that she was going under,[url=]North Korea fires first test missile of Trump era [/url] he was going to fuck her,barenaked lady jingle bells she had nothing to fight him with,free young teen porn clip her body had already rebelled and was going with it.He kissed his way down her supple yielding body,cum face dipping and swirling his tongue in and around her belly button,[u]gay teen jerking off[/u] his fingers still playing havoc in her pussy.She was desperate for release,[b]hardcore fucking people[/b] and her cousin had her teetering on the brink.Bree slid her free hand down Kimmie's body,movie swinger and started to rub furiously at her aching clit.She had a pretty face,[u]mom vette vicky[/u] but no one could ever see her body because it was always covered in something that would have made a paper bag look fashionable.Staring into his sparkling eyes,[url=]free hentai pic[/url] which seemed to be searching the depths of her soul,[url=]Onemix men s sport running shoes music rhythm men s sneakers breathable mesh outdoor athletic shoe 92.00$[/url] she unfastened her skirt,[url=";]cloud red[/url] letting it pool around her feet,[url=]blacks on blonde lesbian[/url] revealing a pair of silk panties tightly gripping her crotch.Jackson couldn't seem to work his mouth as he stood there opening and closing it.When she screamed and he didn't come,[url=]flashing dog collar[/url] that's when she'd panicked.Mall security checked every video camera in the entire shopping center looking for any sign of the seven-year old boy with dark hair wearing a blue jacket and tennis shoes.He put on the robe and felt ridiculous when he saw himself in the large mirror. "I look like some kind of freak,[url=]Cobbler legend real cowhide leather bifold clutch men s short wallets purses 2016 male id credit 24.00$[/url] " he told himself as he saw a boy in a girl's bathrobe."Okay,frog sex picture and movie I'm comin' back in,[url=]teen pregnancy story[/url] " she said tapping on the door.She was well-informed,[b]nude olsen pic twin[/b] she actually listened,[url=]Nylon pet led dog collar night safety led flashing glow led pet supplies dog cat collar 3.62$[/url] and most of all,[url=]next wifey[/url] she cared.

#94 janiceduchic84 on 03.27.19 at 11:50 am

[color=green][b]CLICK ON THE BANNER![/b][/color]
She had defended herself admirably against him,free defloration video attempting desperately to protect her marriage.She stayed where she was,woman peeing in diaper somehow thinking stupidly,[url=]2pcs led door warning light for vw logo projector for volkswagen vw phaeton passat b5 13.32$[/url] that he would stop.He didn'tHis hand was starting to drift higher and lower,naked hunk snaking up,free bdsm picture then snaking down.Annie didn't know what to do,big black boob movie preview her son in law was caressing her legs,[url=]com pinkpornstars[/url] in a slow seductive way,[b]vegas show girl costume[/b] and she was nonplussed by it.She shifted her stance a little,[url=]upskirt sexy pantie[/url] he took his hand away,teen first bra but then it was back on her calf again,[url=]teen asian babes[/url] Annie sighed.All the while he was chatting to Jenny and her,gang bang pic his warm hand was making her shiver.What the fuck is he doing,[url=]New hot delicious 100pcs giant onion seeds organic vegetables for interest diy 0.51$[/url] doesn't he know what that's doing to me,[b]going poop[/b] Annie thought to her self?Her body was heating up,[url=]Vkworld T1 Plus 4G Phablet-144.36 $ [/url] it began to behave in the treacherous way it always did,[url=]sexy wife clip[/url] when certain types of men toyed with her.That's when his hand,[url=]exhibitionist masturbation[/url] was now at the back of her exposed knee's,[url=]Teclast X3 Pro 2 in 1 Ultrabook Tablet PC-476.17 $ [/url] his fingers and nails,[url=]Colrovie woman blouses korean styles fashion blouses women streetwear pink mock neck frill trim long sleeve 14.97$[/url] tickled her,[u]nutrition pregnant teen[/u] and lightly scratched them,vibrating cock ring it was definitely one of her danger zones.And because her right hip was against Paul,[i]frog sex video[/i] he couldn't be seen doing what he was doing.But Annie couldn't make herself step away and return to the other side of Jenny's bed.She was effectively trapped,[url=]busty bikini girl[/url] but by her own feelings.Katy moaned loudly,[url=]big clit sex[/url] her hips rose to meet her invader,funny adult cartoon picture to give back what she was receiving.Ricky thumped his prick back and forth,[url=]Bolsos mujer fashion women shoulder bag duffle bags canvas girl pink vs secret luxury packets traveling 33.90$[/url] Katy was incoherent in her sexual desperation,biggest loser poppi her needs to be fucked were paramount desire,fat hardcore fuck nothing mattered more than having a prick securely lodged in her,shit video it was all she wanted in life at this very moment in time.Ricky gave no thought to her,[b]pinoy erotic story[/b] he wasn't interested if she got hers or not,hot sexy horny woman all he knew was,chick dick hot small he had a beautiful woman under him,[url=]free henna tattoo design[/url] and he was fucking her,asian bondage free and he would fuck again too,[b]going poop[/b] any time he wanted to.The resemblance may not have been as strong as Trisha's to Kate's,quickoffice symbian keygen but it was unmistakable: the crystal-clear blue eyes,[url=]Xiaomi Intelligent Air Purifier-253.88 $ [/url] their almond shape,ass and pussy spread the high cheekbones,[url=] discus index.html[/url] the nearly perfect symmetry of her face.He was overcome with emotion,[url=] discus index.html[/url] he cried in front of them,huge blonde dick it was the most emotional moment in his young life.They all understood,nude anime pic and cheered him.He calmed down,rear entry sex position his mother by his side,clitoris finding and then they partied,[u]double dildo orgasm[/u] song and dance were the order of the night.At 10pm his mother called the room to order,[url=]non nude asian[/url] champagne was dished out,[url=]free lady nude picture[/url] a toast was made by his best friend.And then Val shouted to every one. 'Right you lot,[url=]2016 summer new fashion elastic hairband for women hair accessories girls elastic hair bands children hair 1.00$[/url] out to the front of the hotel,girl just want to have fun movie I have a small surprise.'They all trooped out wondering what was on.As they all stood around Craig and his mother,virginia lottery she took something out of her bag and pressed it.Right in front of them a car came to life,[url=]Amazing Sky Star Laser Projector Lamp Starry Night Light Powered by AA Battery-3.39 $ [/url] lights blinked and it beeped,[i]hentai mom movie[/i] doors unlocked.

[color=blue][b]CLICK ON THE BANNER![/b][/color]
[b]READ MORE[/b]
door garage liftmaster
[url=]5 colors pigment eyeshadow palette eye shadow powder metallic shimmer makeup beauty profissional make up warm 3.32$[/url]
home made porn .com
[url=]orgy gallery[/url]
[url=]defloration video clip[/url]
[url=]4 colors face pro concealer palette cream corrector contour mixiu brand bronzer primer makeup camouflage concealer 4.99$[/url]
teen sucking cock
[url=]free small cock[/url]
teen tgp mpeg
[url=]2017 women long wallet multi functional umbrella zipper female purse wallets handbag clutch card holder top 5.19$[/url]
com pinkpornstars
dangerdave free

[color=blue][b]CLICK ON THE BANNER![/b][/color]

As she slid back and forth,[b]free preteen model picture[/b] she gripped the older woman's tits tightly in both hands."Oh,toon manga porn fuck…That night she worked hard at being 'Beverly,young lesbian strap on ' she made dinner,[u]anime kinky sex[/u] put her son to bed,[i]car part race sale used[/i] chatted,lesbian wet kiss had a small glass of wine,disney in porn and turned her back on her husband when they went to bed,comic digimon sex she was saving herselfThe next morning after a surprisingly good night's sleep,[b]top black female model[/b] she saw her husband off,[url=]Amazing Sky Star Laser Projector Lamp Starry Night Light Powered by AA Battery-3.39 $ [/url] took her son to school,[url=]midwest express airline flight[/url] then dashed home to ready herself for her lunch time date with Mark at 12 noon.McClain?""Alpine Dentistry,asian child model " he told her."Yeah,barenaked lady jingle bells they don't exactly have the best reputation.Eva had my head in a vice like grip.I went into automatic,[b]dudes jerking off[/b] I needed no further encouragement.She had knocked gently on his door and gone into his bedroom,[url=]real mature tit[/url] he had turned over to see her,allstar fan fiction and he was topless.She'd never heard the term "RICO" before but after everything they'd jointly owned was confiscated by the government,asian dildo she became intimately familiar with it.They don't know how permanent the damage is yet,[b]free preteen model picture[/b] but I may never be able to walk again."Steve was stunned by Sybil's words.I would drink a beer or a glass of wine at home,[i] girl of maxim[/i] but never at a social event."Henry,[url=]LETV Leeco One X600 32GB ROM 4G Phablet-152.21 $ [/url] we have to go to this dinner dance for the library foundation two weeks from Friday.With a groan,cock video gay she climbed out of bed and padded quickly toward the door.

#95 janiceduchye83 on 03.30.19 at 8:19 pm

[color=green][b]CLICK ON THE BANNER![/b][/color]
So I wants to learn to talk,[url=]Pulseras fashion gold silver cuff bracelets bangles for women men jewelry female charm bracelet pulseiras bijoux 3.67$[/url] you know,[url=]Leagoo Shark 1 6.0 inch 4G Phablet-186.48 $ [/url] like she do and like my president and he wife do.""Okay,89 mustang pic that makes sense.Got it?" Kelli asked."I think so,[b]live web cam girl[/b] " Kendra replied. "This is just all so new to me.I promise,[b]sex with blowup doll[/b] " she assured him.He paused then said,mature mom lesbian "Do I know you?"Crystal was absolutely beaming as she said,[i]free hentai sex game[/i] "Of course Please have a seat.

[color=pink][b]CLICK ON THE BANNER![/b][/color]
[b]READ MORE[/b]
[b]jennifer aniston nude ass[/b]
free my friends hot mom
[u]naked college man[/u]
ass latin black
[b]fucking in lesbian shower[/b]
teen in aqua thong
[u]denver used car price[/u]
[b]porn raw retro[/b]
mexican pussy porn
[url=]real young girl[/url]
[b]jennifer aniston sex tape[/b]

[color=brown][b]CLICK ON THE BANNER![/b][/color]

She wore a bra,[url=]akon dog ft snoop[/url] but there was certainly no anatomically-necessary reason to do so.Why aren't you pursuing it full time? The better question is has my uncle introduced you to Kalen McElroy yet? If not,asian black chick dick in monster that's criminal.""I know who he is,[u]free blow job movie sample[/u] of course.Trust me." Then I added,[u]japanese school girl thumb[/u] "So I guess if we weren't related I'd be happy to learn you'd want to go out with me."Nancy laughed and said,male adult toy "I know you're only 18 but I bet I could get you into Pete's Tavern You definitely look like you're 21 and besides,young big tit I know the bartender.I mean,[url=]JXD 509G 5.8G 2.0MP Camera RC Quadcopter-90.51 $ [/url] what's not to like? They're smarter,[url=]latina hottie[/url] wiser,[u]video porn de paris hilton[/u] more mature,[url=]blonde free nude older pic[/url] and better looking than girls my age,[u]hentai porn teen titans[/u] " I replied.So,[url=]bare cameron diaz naked[/url] Emma is it? Where do I know you from?" he said."Oh,free nude wrestling diva pic I'm sure we've never met,clit to clit Mr.Williams?""I done heard you does tutoring and I wants he'p.""Help with what exactly?" he asked."Okay,[url=]PURIDEA I2 Wireless Bluetooth Speaker 10000mAh Power Bank with Mic and LED Flashlight Support Hands-free-33.29 $ [/url] I was sittin' in my old crib watchin' Oprah when my boyfrien' was out.She started to answer and he said,clitoris peircing "Look.So the norm is to have eight or nine months of weather where dark,arab sex saudi gray,[b]picture of hot guys[/b] raw days and temperatures range from 35-55 predominate.My personal fetishes aren't always another person's cup of tea but judging by the many favorable comments I've received in private emails,[url=]ASUS Zenfone Pegasus 3 X008 4G Smartphone-190.84 $ [/url] at least a few of you enjoy reading them.That's because the Cascade Mountains,[url=]Hillary Clinton: April Ryan patronized by Sean Spicer [/url] which cut the state in half from the Canadian border through the state of Oregon,body art tattoo pic serve to block the clouds that form off the Pacific Ocean and Puget Sound from moving further east.

#96 janiceduchvc83 on 04.01.19 at 4:41 am

[color=purple][b]CLICK ON THE BANNER![/b][/color]
Were she to stay in that environment,my wife love big black cock what she'd learned was essential to her survival.You'll feel even more complete once we get you in this dress and in a pair of heels with some jewelry to set everything off,bangers big michelle tit " Kelli assured her."So let's have you try on this undergarment but first we're gonna have to find a way to uh,[url=]camcorder from hsn samsung[/url] tuck your 'package' so that it can't show through at all.I heard dad say,[url=]free sex toons[/url] "Look who stopped by,[b]nude oil woman wrestling[/b] Cal.Oh,[url=]bare cameron diaz naked[/url] Kelli,blonde fuck fest this is SO much fun""You're very welcome,[url=]huge cock video[/url] sweetie.Trust me." Then I added,carrier strep throat "So I guess if we weren't related I'd be happy to learn you'd want to go out with me."Nancy laughed and said,[u]horny girl next door[/u] "I know you're only 18 but I bet I could get you into Pete's Tavern You definitely look like you're 21 and besides,[url=]arab sex saudi[/url] I know the bartender.And Candy…or Connie…or Casey…or whatever…was not only there but she was…perky What can I get you for breakfast?" she'd asked him in that 'I'm sooo glad you're finally awake again so I can say mindless,bikini blonde hot milf sexy wife annoying things to you' voice.'"What time is it?" he asked her rhetorically.

[color=blue][b]CLICK ON THE BANNER![/b][/color]
[b]READ MORE[/b]
[i]2001 dirty joke[/i]
virginia dmv
[b]intercourse with dog[/b]
[i]lyric adam sandler hanukkah[/i]
[b]small cock gay[/b]
[url=]ASUS ZenFone 2 Laser 4G Phablet-273.29 $ [/url]
[url=]50 pcs bag iris flower iris seeds rare bonsai phalaenopsis orchid flower seeds 18 colors perennial 0.88$[/url]
[u]free hentai sex movie[/u]
snow blower tire chain
[b]porn raw retro[/b]
fhm cover girl
[u]latin girl gone wild[/u]

[color=red][b]CLICK ON THE BANNER![/b][/color]

When are you free?" he asked."Me? Oh,bobbi billard web cam my goodness.On second thought,[url=]xxx may pokemon[/url] you may actually be even more rattled until you get used to looking at the image I'm seeing which is well—amazing You look as much like a girl as any girl I've ever seen and I'm not exaggerating.As Zack took her hand,[b]hot little redhead[/b] she placed the other behind his shoulder.Take care and have a nice day."Casey ripped off his shirt,[url=]milf seeker janine[/url] threw it on the bed,[u]stripper[/u] then pulled on her clothes and stormed out slamming the door behind her.That smells so good,[url=]latina hottie[/url] honey,[u]teen titans hentai xxx[/u] " Kelli told her. "Now let's make you beautiful"For the next half hour,adult lesbian personals Kelli helped shape Kendra's eyebrows,[url=]ash pikachu[/url] then did her half of her face with liquid foundation that made Kendra's face look smooth and ready for the rest of her makeover.She called me over to watch and although I'm no musician,[i]fucking old pussy[/i] I have to say I was very impressed.""Oh.Like all of us (with the exception of Loren),[url=]Ulefone Be Touch 3 4G Phablet-149.14 $ [/url] Nancy was poor.I promise,[url=]cover letter resume sales sample[/url] " she assured him.I mean,[url=]conner kiss lesbian pic tara[/url] what's not to like? They're smarter,[url=]Car air vent outlet stent diamond airfreshener aroma diffuser for car fragrance januar10 0.51$[/url] wiser,[i]dog mating video[/i] more mature,[b]lady pissing[/b] and better looking than girls my age,dog collar and leash " I replied.Admitting these things had been hard,[url=]UMI TOUCH 4G Phablet-155.75 $ [/url] but once she realized no one was to blame; that this was just how life was where she was raised,fuck bitch then moving on became a whole lot easier.How could she know I hated being called by my full name and that no one had done that since junior high? As much as I disliked that,aqua comedy force hunger teen tv I didn't say anything.And he was almost never wrong.Yet when it came to dating,free shemale sex video Braden had long since tired of just fucking women he found physically attractive yet he kept right on doing the same thing somehow expecting different results.This is Ombre Rose,[i]free hentai sex game[/i] my favorite perfume.

#97 KevinMub on 04.04.19 at 11:26 pm

Hi there,

That's quite an interesting website I've got into, while was surfing a web. Not much, just some [url=]real amateur sex[/url] [url=]best hd amateur porn[/url] [url=]amateur porn upload[/url] [url=]best amteur porn[/url] here :)


#98 mina on 04.05.19 at 10:58 am

As soon as I noticed this website I went on reddit to share some of the love with them.

#99 Eden on 04.14.19 at 2:34 am

Howdy! This is kind of off topic but I need some help from an established blog.
Is it hard to set up your own blog? I'm not very techincal but I can figure things
out pretty fast. I'm thinking about making my own but I'm not sure
where to begin. Do you have any ideas or suggestions?
Many thanks

#100 KevinMub on 04.19.19 at 12:36 pm


That's pretty stunning site I've got into, while was surfing a web. Nothing much, just some [url=]submit amatuer porn[/url] [url=]best site for homemade porn[/url] [url=]amateur home porno[/url] [url=]porno amatuers[/url] [url=]teen amatuer porn tube[/url] found :)


#101 distance moving companies on 05.03.19 at 7:29 am

Hey exceptional blog! Does running a blog like this take a massive amount work?
I have no expertise in computer programming but I had been hoping to
start my own blog soon. Anyway, should you have any ideas or techniques for new blog
owners please share. I understand this is off subject nevertheless I just wanted to
ask. Thanks!

#102 Trimble TSC3 Brochure on 05.07.19 at 5:01 pm

Have you ever thought about adding a little
bit more than just your articles? I mean, what
you say is fundamental and all. But imagine if you added some
great graphics or videos to give your posts more, "pop"!
Your content is excellent but with pics and video clips,
this blog could undeniably be one of the very best in its niche.
Amazing blog!

#103 ScanStation C10 Manual on 05.09.19 at 6:08 am

Greetings I am so grateful I found your webpage, I really found you by error,
while I was researching on Aol for something else, Regardless I am here now and would just like to say thanks for a
marvelous post and a all round thrilling blog (I also love the theme/design), I
don’t have time to browse it all at the moment but I have book-marked it and also included
your RSS feeds, so when I have time I will be back to read much
more, Please do keep up the superb jo.

#104 resharper crack on 05.15.19 at 5:06 pm

Enjoyed examining this, very good stuff, thanks .

#105 Isaac Linard on 05.16.19 at 6:08 am is solid, well-researched content. I just forwarded this on 5/15/2019 to a colleague who has been involved in some work of their own on this subject. To say thanks, he just bought me dinner! So, let me express my gratitude by saying: Thanks for the drink!

#106 Kelwhanda on 05.30.19 at 6:17 pm

Propecia Spermiogramma Levitra Bei Erkaltung [url=]viagra online prescription[/url] Mail Order Medications From Canada Clomid 200g

#107 belize travel guide  on 06.02.19 at 7:05 am

I have to thank you for the efforts you have put in writing this website.
I really hope to view the same high-grade content from you later on as well.
In fact, your creative writing abilities has inspired me to get my own, personal blog now ;)

#108 on 06.03.19 at 2:23 am


#109 on 06.03.19 at 4:03 pm

[url=]fruity loops free download crack[/url]

#110 Berry Tell on 06.04.19 at 8:02 pm is an excellent read. I just now passed this on 6/4/2019 to a coworker who has been involved in a little research of their own on this subject. To say thank you, he just bought me dinner! So, I should probably say: Thank you for the meal!

#111 on 06.04.19 at 8:48 pm

[url=]norton internet security serial key[/url]
[url=]acid pro 6.0 download[/url]
[url=]windows 7 32 bit ultimate activator free download[/url]
[url=]windows 7 directx version[/url]
[url=]dj music mixer free[/url]

#112 on 06.05.19 at 1:41 am

[url=]avg driver updater key list[/url]
[url=]clion free version[/url]
[url=]windows 8.1 all updates download[/url]
[url=]how to downgrade iphone 4s without itunes[/url]
[url=]malwarebytes anti malware premium lifetime activation key[/url]

#113 Juan Zervas on 06.08.19 at 12:19 am

6/7/2019 Love this site– very informative and a lot of stuff to think about!

#114 Kelwhanda on 06.10.19 at 2:12 am

Real Progesterone Pills Visa No Prior Script Viagra Se Vende Con Receta Medica Viagra Generika Rezeptfrei [url=]xenical[/url] Clonidine Overnight

#115 Lesangent on 06.10.19 at 5:01 am

Airmail Pharmacy Uk [url=]cialis online[/url] Clomid C Est Comment Real Pfiefer Viagra Cialis 20 Controindicazioni

#116 Marybeth Grissett on 06.11.19 at 1:56 am is an excellent read. I just now passed this on 6/10/2019 to a fellow student who's been involved in some work of his own on the topic. To show his appreciation, she just bought me lunch! So, I guess I should say: Cheers for the meal!

#117 Keven Emlin on 06.13.19 at 9:19 pm

I'm pleased by the manner in which covers this kind of issue. Usually to the point, often polemic, consistently well-written and also stimulating.

#118 aimbot fortnite download on 06.17.19 at 3:43 am

Enjoyed examining this, very good stuff, thanks .

#119 SEO on 06.17.19 at 6:09 pm

Hey There. I found your blog using msn. This is a very well written article.

I will make sure to bookmark it and return to
read more of your useful info. Thanks for the post. I will
definitely comeback.

#120 epidian 652 on 06.17.19 at 6:10 pm

Thank you for another informative blog. Where else could I am getting that type of
information written in such an ideal method? I've a venture
that I am simply now working on, and I have been at the look
out for such information.

#121 먹튀검증 커뮤니티 on 06.18.19 at 5:57 am

You are so interesting! I do not suppose I've truly read anything like this before.
So great to discover someone with some unique thoughts on this subject matter.
Really.. many thanks for starting this up. This site
is one thing that is required on the web, someone with a little originality!

#122 먹튀검증 커뮤니티 on 06.18.19 at 7:09 am

I know this if off topic but I'm looking into starting my own weblog and was wondering what all is needed to get setup?
I'm assuming having a blog like yours would cost a pretty penny?
I'm not very internet smart so I'm not 100% sure.
Any tips or advice would be greatly appreciated.
Thank you

#123 poseidonhd on 06.21.19 at 8:19 am

Does your blog have a contact page? I'm having problems locating it but, I'd like to shoot you an email.
I've got some recommendations for your blog you might
be interested in hearing. Either way, great blog and I look forward to seeing it expand over time.

#124 poseidon hd on 06.21.19 at 8:39 am

Hi there! I just would like to give you a big thumbs up for your excellent
info you have here on this post. I am coming back to your web site for more soon.

#125 Kelwhanda on 06.22.19 at 8:00 am

Como Afecta Propecia buy accutane in usa [url=]buy generic cialis[/url] Prix Du Cialis En Andorre Buy Super Active Viagra Online Free Over The Counter Mediformin

#126 Lesangent on 06.23.19 at 12:00 pm

Viagra Cialis Danger Is There Penicillen In Keflex [url=]cialis no prescription[/url] Priligy Vente Uk Buy Now Online Zentel

#127 Fiskeolie til hunde on 06.26.19 at 7:25 am

I really liked your blog article.

Leave a Comment