How about trying to implement Parrot's VM?
http://www.parrotcode.org/
Exactly: /how/ should I go about implementing Parrot's VM? :) And why
would it be more efficient than running it in software on a standard
RISC core hooked to standard RAM?
I'm not saying there are no good VMs for HLLs, just that implementing
them in hardware wouldn't increase the overall system efficiency
(performance/price).
How about adding processor-level instructions for regular expression
matching, or for higher-level operations used in regular expression
matching? I'm not a processor guy, so I don't know if that's just a pipe
dream, but wouldn't it be faster to run almost any text-processing
language (think perl, python, ruby) if fewer instructions are required
to perform the comparisons? This might be impossible, but then again ...
maybe it's just really, really difficult.
What about FPGAs? We run a soft core on a cheap FPGA at low MHz but
get lots of acceleration from "coprocessor" cores that were compiled
from C code. Instead of developing custom chips for one specific HLL
maybe we'd like some FPGA coprocessors that can be loaded up with
task-specific logic as needed.
You mention image processing – I remember using image processing
boards in the late 80s that had FPGA front ends because the PCs were
just too slow. Maybe it's time to resurrect that approach.
First of all, if one's going to try to implement a virtual machine
(seems that would defeat the purpose of it being a VM, but hey whatever
works), something like JVM or LLVM would likely be much more productive.
Of course you're left with the problem that they are essentially just
slightly higher level than a normal CPU, nothing drastic. And Yossi is
right, the trick is – how would you even do this?
My question is this – there was once a time when memory hierarchies
were shallow enough that many if not most programs were CPU bound.
Except in tightly pipelined systems, cryptography, and a few cases where
datasets are so small that they fit into on-chip cache, this no longer
seems to be the case. For that reason, I cannot possibly understand what
performance benefit we'd see from implementing a higher level
instruction set – unless it prevents us from accessing memory in a
cache-incoherent manner. Anyone have thoughts on this?
I think Yossi would agree that it might be nice to have a standard
CPU architecture that was easier to generate efficient code for (x86 is
a bitch, and it's intended successor EPIC was so much harder that the
scientists I worked with had to code entire chunks of program in
assembly, or suffer 300-400% performance hits easily). Indeed, if
anything modern compiling has taught us it's that reducing code size
(and therefore instruction cache misses) is a big deal. Perhaps an
architecture that had operands which map to higher level (but still
fundamentally "C-like") constructs, and then compiled them down to
internal RISC opcodes might burn some silicon to save IC misses, as a
form of code compression.
We already live in an age where (for some applications) dynamically
compiled languages like Java can be *faster* than statically compiled C
(if you don't code like a slob, and have twice the RAM you'd need in C)
due to runtime analysis, hotspot optimization, etc. CPU cores idling,
just waiting for something to do while they fetch from RAM. If anything,
what we need is not a better CPU, we need libraries, languages, and
compilers that support dynamic program transformations to encourage
cache-coherency. Get the data accesses in your LISP or Java program
clustered appropriately, and you'll see significant gains. Move away
from the "arrays are contiguous blocks of RAM I index directly" paradigm
of C, into something more along the lines of "arrays are fast O(1)
containers for things I index via integer keys, which may be transformed
at run-time depending on the optimized internal structure."
http://www-03.ibm.com/systems/z/zaap/
(Specialty hardware for running Java?)
Does that count?
I have no idea what the zaap does – but that's almost what it appears
to be..
I'd agree that high level languages certainly have their cost on von
Neuman machines, but you're obviously failing Yegge's test for common
sense. You're forgetting that we can make computers out of things other
than switches. There have been general purpose analog neural networks
like the Intel ETANN and Bellcore CLNN32/64. There have been digitally
implemented neural networks like the IBM ZISC036. In fact I believe the
ZISC036 meets all the requirements of your challenge. (See: http://www.it.hiof.no/prosjekter/hoit/html/nr1_96/zisc036.html)
If you are arguing that the von Neumann machine is likely the best
computer you can build based on switches and current knowledge, then I'd
agree.
Efficiency is a wide field. Efficiency in what dimension, for what
problem? There have been situation where DNA and enzyme based computers
are able to solve problems 100,000 times faster than it would have taken
a state of the art von Neumann machine.
Transistors, as you point out, have had a lot of effort put into
their optimization, but I think that Yegge's whole points was that there
exists technologies that when optimized will far exceed the capabilities
even the best swtich.
Great article, I'll get some beer and wait ...
Okay, I'll bite.
Here's my submission: http://www.davidbrunton.com/2008/01/submission-to-high-level-cpu-challenge.html
Now I'll sit back, also with a beer, and wait for my free chip ;)
As someone who has written a high level language, the major expense
I've seen is for hash table lookups (for dynamic binding) which take
many cycles. I've been told that associative memory hardware could
reduce these lookups to a single cycle – is this true?
Interesting post. What about Lisp machines? http://en.wikipedia.org/wiki/Lisp_machine
As I understand it they were special purpose hardware to run Lisp
programs faster than the general purpose hardware of the day.
Since I like the idea of side-effect-free HLLs, I'd like to see
hardware support for reference-counting, because side-effect-free
languages cannot create cyclic data structures. As far as I can tell,
very little research has gone into making reference-counting faster
(probably because people have always assumed that we would need to
collect circular structures), but it seems like an ideal problem for
hardware to solve. Assuming that the first field of an object is its
reference count, you'd have instructions like "Copy and increment
referent" and "Decrement referent and jump if zero". Much of the
slowness of side-effect-free work could also be eliminated by not
copying objects if their reference count is about to become zero (a
common case when simulating side-effects — the optimization turns the
simulated side-effect into a real one).
As I understand it, one of the problems with reference counting is
that it writes to memory a whole lot. An associative cache for reference
counts could alleviate that. Other reference counting tricks, like
one-bit reference counting or weighted reference counts, could become
much faster with hardware support.
While I'm dreaming, it'd also be nice to see better tools for
lightweight parallelism. All the silicon that's spent on deep pipelines
and sophisticated branch prediction and register scoreboarding could go
to increasing the number of cores, provided compilers took advantage of
them. I have to admit that I don't have a clear idea of what the
instructions for ultra-lightweight parallelism would look like, but the
potential usefulness is limited only by the amount of time overhead
needed to ascertain the number of free cores and start them working on a
slice of the data. The compiler can ensure that they don't need to touch
the stack for the duration of the data-level parallelism.
Specialized cores that can only add, subtract, and multiply integers
would be exceptionally useful, because lots of work just doesn't involve
floating-point math. (http://bloggablea.wordpress.com/2007/04/27/so-does-anyone-even-use-all-these-darn-cpu-instructions/)
Alternately, imagine a set of cores, each with a dedicated bus
connected to dedicated memory (in addition to main memory). The memory
could be addressed such that the cores are interleaved at the word
level. One could drop an array into that area of memory and have all the
cores operate on it in parallel. I feel like I'm describing something
that already exists, though.
I feel that creating hybrid processors, based on von Neumann
machines, but with special optimizations for Haskell-like languages,
will produce the fastest machines for HLLs of the future. Our universe
is very skewed towards the von Neumann model, and sometimes our problems
map to it nicely, but sometimes they are much more like lambda
calculus.
dbrunton and stevedekorte beat me :(
(oh, and szetlan is saying the same thing steve is maybe without
realizing it...)
ideally, the answer is cellular automata, i read someplace that the
reason von neumann used unified memory was to facilitate code
modification in automata... i wonder what he'd think about modern
computer viruses...
i think "associative memory hardware" gets you more than half way to
solving the HLL challenge and is a lot more practical than a cellular
automata hardware implementation (as i understand it, current chip
fabrication and wiring make 3D matrices pretty cost ineffective.)
if one is going to go through the trouble of a hardware neural
network implementation, (which doesn't actually solve the challenge at
all,) they might as well just go the full monty and implement cellular
automata, the two aren't really that much different; i think cellular
automata are technically more space efficient.
FPGAs aren't really an answer; they are arrays of one bit adders, as
many others before me have pointed out. in other words very stupid, very
slow, 2D cellular automata.
so, in some regards i think new fabracation techniques and the old
standby-never-quite-implemented-efficiently associative memories and
cellular automata will be the "solution"
incidentally, phase change memory is well suited for all of the above
things... as well as other existing 3D fab techniques.
I like the Niagra2. The onboard ethernet has a chunk of silicon that
lets it queue up events on different cpu's depending on different
filtering criteron, thats extremely smart when you're using 8 massively
wide cores as stream processors/request processors off some 10G feeds.
To some degree it turns the computer into a multicomp, and lets each cpu
keep its own memory (aka avoid shared memory). Add in a crypto processor
capable of saturating the 10G link and you can replace half your
networking gear. Going back to the massively wide architecture and using
something like HT is very smart as well, basically providing a bunch of
functional units and then trying to keep them all fed. If the question
is "how do you scale out," Sun has some very good answers, and most of
them appear on die.
Bonus q: how do you feed 8 extra-wide cores? Four dual channel banks
of FBDimms. Nice.
The spagetti stack hardware architecture of the Burrough's matches
closely to continuations, which we're seeing more and more of for
implementing asych event handling routines. This would potentially be
excellent for dealing with highly concurrent systems.
I'm no expert on the Burroughts, but I dont think the burroughs was
particularly a CISC v. RISC fight, no more than the x86's CS, DS, SS,
ES, you could implement most of its architecture in RISC form. Its just
that the "segment registers" notion was take na little higher level with
the Burroughts, to let the processor know a little more about the stack
and to let the stack grow in different ways. The most glaring
counterclaim would be having to chase down the stack and decend lexical
levels to dig up another variable, but there's plenty of caching
opportunities here, X86 adds its own TLB's all the time to deal with the
same class of problems for segmentation and indirect lookups. Who knows,
its probably a terrible idea, I certainly dont know enough about
Burroughs like systems to judge, but I think it would be really
fascinating to try and make an object oriented cpu or a spagetti stack
cpu or something other than the segmented stack machine. (PPC has some
interesting potential with its inverted page tables)
Although it would still essentially be a von Neumann computer, I
would really love to see a CPU which is a graph reducer and implemented
SK combinators.
Has this been done before?
It would need to support GC explicitly, and some other memory and
math operations, but implementing a functional language on this sort of
architecture would be a breeze, obviously.
My understanding is that when the JVM needs a chunk of blank memory
the system carefully loads the previous contents of the memory to cache,
which the VM then clobbers. A design of cache/memory architecture that
eliminated the redundant read would improve performance.
Not my idea, though I wish I could take credit for it.
It's called the Reduceron2.
http://www-users.cs.york.ac.uk/~mfn/reduceron2/
"The Reduceron is an extremely simple machine, containing just 4
instructions, and executes core Haskell almost directly." <- from the
website.
Pretty cool, does this meet the criteria for someone (though not me)
actually doing something instead of just talking about it?
Of course it does! Reduceron is a pretty cool project.
I'm in the process of chasing the various links people sent and
digesting everything. I'll publish a summary of the interesting stuff
when I'm done.
[...] – Alan Kay, leĂdo en The “high-level CPU” challenge [...]
More than 10 years ago, I saw a proposal for a processor with one ALU
and 128 program counters and sets of registers. The intention was that
it would be exceptionally good at implementing multi-threaded
applications. Unfortunately, suitable software was quite lacking in that
era. However, the theory is very good.
If you design a processor which is extremely multi-threaded then do
not need to implement, deep pipelines, speculative execution, branch
prediction, out-of-order execution (and reconciliation) or register
shadowing.
High level languages are very problematic for pipelines. This is
especially true when running object oriented languages. The problem is
that method calls are dependent on the instance being handled. A tree or
network of classes have some common methods. Additional methods may be
lazily loaded during execution. This arrangement hinders compilation
techniques. It also affects pipelining.
A common case is a method calling a method of another class. This
process can involve multiple memory accesses. Each method call also
requires a jump to an address which is unknown until very briefly before
the jump is made. This is problematic for performance if you have a deep
pipeline which decodes instructions over many clock cycles. It is
possible to defer write operations to increase read throughput. This can
reduce immediate impact but it increases complexity and creates
additional opportunities for pipeline stalls. Furthermore, it doesn't
change the fact that your ALU is idle between method calls.
Instead, it would be more productive to have a very large number of
hardware threads. ALU, registers and instruction set could be designed
with minimal complexity. It would be possible to implement eight simple
ALUs, each with eight sets of registers. Therefore, you'd have eight
threads per ALU. Threads within a group would be allocated processing
time with strict preference. Each group is run out of phase.
One or more dedicated instruction caches are not required because
instructions can be retreived from secondary cache within eight cycles.
If access to main memory takes 20 cycles then a thread will block and
two lesser threads within its group can be run. Additional memory
accesses will cause additional threads to block. Regardless, throughput
from the cache should be optimal and one or more ALUs will be in use
almost constantly. However, unlike contemporary designs, this remains
true when running OO and bytecode programs.
I'd suggest implementing a two address machine. This allows a rich
but compact set of instructions to be loaded, cached, transferred and
decoded efficiently. I'd implement a subset of the complexity of a
TMS9900 or a MC68000 because it would give the best throughput versus
complexity. If you stripped the instructions which required microcode
then the remainder would easily complete within eight cycles.
You'd probably want a symmetric design, for simplicity. Regardless,
it would be possible to have differing ALUs, registers and instruction
sets within one implementation. One or more ALUs could be optimised for
DSP operations. One or more ALUs could be optimised for bit
operations.
This idea works very efficiently with existing software. It doesn't
require fancy compilers. It doesn't require exceptional bus bandwidth. I
just wish it was mine.
Hyper-threading (lots of reg sets per ALU) has its problems:
1. Some workloads are hard to parallelize
2. Some workloads are ALU-bound
Basically, AFAIK you hit diminishing returns at the reg set #4-#5 –
the chances to increase actual throughput aren't high enough to justify
the die size.
Hyperthreading doesn't remove the cost of HLLs, although it can help
fill the stalls in the pipeline, sufficiently reducing this cost;
however, it won't help in the (probably frequent) case when your vtables
or whatever it is that causes the indirection are already in the cache
(unless the core has a high load-from-cache latency, which is a problem
by itself).
I read this article a couple weeks ago, and started writing a
response. It grew large enough that I decided to post it as a separate
article rather than in this tiny little comment window.
A summary of the proposal is a second TLB intended to be managed
directly by application code, allowing cheap copy-on-write mappings for
functional language argument passing.
http://codingrelic.geekhold.com/2008/05/high-level-cpu-response.html
You should update your article, because the Lisp type system does not
work the way you think it does. SBCL, a commonly-used Lisp system, does
refuse code that does not pass its type checking (both static and
dynamic) and does compile away the checks on code it knows is safe
(unboxing is done as well). As for Lisp ever being uglier than Java:
Macros buy me so much in terms of expressiveness Java seems like a bad
joke. That, plus the fact Java doesn't seem to have a workable type
system, clinches it for me.
Peter Seibel "should" update his book then. Quote:
"... to make Common Lisp compile add more or less like a C compiler
would, you can write it like this:
(defun add (x y)
(declare (optimize (speed 3) (safety 0)))
(declare (fixnum x y))
(the fixnum (+ x y)))
Of course, now the Lisp version suffers from many of the same
liabilities as the C version–if the arguments passed aren't fixnums or
if the addition overflows, the result will be mathematically incorrect
or worse."
I guess it has something to do with the (safety 0) thingie. It has
the advantage of yielding assembly code identical to that of a
statically typed C version, but is apparently even less safe.
Can SBCL be better than what the CL standard apparently mandates
implementations to be, specifically, be as fast as C due to explicit
typing or type inference without sacrificing safety? I dunno. SBCL
shootout benchmarks run ~2x slower than C. And regarding "beauty"
compared to C/Java/... – (the fixnum (+ a b)) when a and b are already
declared as fixnums is something you wouldn't have to do in C or Java.
SBCL benchmark implementations also have their share of declare and
convert calls. Not to mention the (declare (optimize...)) from the
book.
That macros buy expressiveness doesn't contradict anything of what I
said. I didn't diss Lisp, praise Java or claimed Lisp was linguistically
inferior to Java (ha!). I claimed the following:
1. Implementing support for Lisp in hardware doesn't reduce system
cost compared to an implementation on a "traditional" machine and an
optimizing compiler.
2. Lisp intrinsically costs more than a low-level language like C;
neither hardware nor software can hide it completely. It doesn't mean
you should use low-level languages, it means just what I said.
Also, Lisp with all the declarations you'd make in C with all the
limitations you'd have in C isn't Lisp with types, it's C with sexprs
(I'd like to have C with sexprs, but that's another matter). That is
hardly provable/refutable though; it's a personal preference/perception
thing.
P.S. Java has a "workable" type system since quite a bunch of systems
were delivered with it. It doesn't mean I want to work in Java; by the
test above, C++ is "workable", and in case you didn't notice, I despise
C++. If people get work done in it, it's "workable". But that's just a
side note.
hey, dont take my money.
Well, I think that Alan Kay and many more are actualy pointing out
bad cpu designs.
Two examples are branch prediction and memory caches.
I rather have Broadband Engine Cell type of proccessor where each
core has private scratch memory and DMA between those and main memory.
Cache coherrency proplem and branch prediction penalty disapear and less
silicon real estate is taken up by cache circuitry. Add propper hardware
level subroutine handling (a la Forths returnstack) and possibility of
fitting more code into constrained scratch memory gives you more
functionality per kilobinarybyte. Add register file and rstack DMAing
and you got yourself a faster task switching. And if you could add fpga
matrix as "one core" on the chip that is an bonus too ;-)
Reminds me of http://www.loper-os.org/
I remember seeing a presentation a while back about a multi-core
system that had two stacks per core. and 60 cores (or something like
that). It was sort of like Forth in hardware. I can't remember where I
saw it, but it might be on google video or something if you want to
search (but I can't remember the guys name...)
You're correct that most features of a high level language wouldn't
be accelerated with special hardware. Specifically, type tagging and
microcoding are a big lose. As another commenter pointed out, modern
processing performance is memory bound, not CPU-bound. If you can keep
your computations "on-chip," then you're better off with a simple RISC
implementing those calculations rather than special-purpose hardware.
The only minor win is that RISC might have slightly more icache
bandwidth required to fetch the additional instructions. Otherwise, keep
the core simpler. This is why tagging isn't a big deal. Tags can easily
be stripped/added/compared while keeping everything in registers.
I'd point out that there is no reason that Lisp can't run as fast as
C, even without type declarations. Modern dynamic compilation can
produce type-specific code at runtime that will be highly optimized for
the actual types in use. Most function parameters are of the same type
on each call can therefore run at full speed. See Andreas Gal's work on
trace compilers, for instance. With multicore, you could even dedicate a
core to handling the compilation tasks in parallel, as code continues to
run.
Now, all that said, one area where high level languages do burn a lot
of cycles is in memory management. In fact, all languages tend to burn a
lot of performance in memory management, but HLLs suffer more because
they tend to put more in the heap and they are typically garbage
collected. The issue of what goes in heap is a compiler problem and I
won't say anything about it. GC, however, can be made to operate in
parallel with special hardware. This does not necessarily mean full HW
GC. Much of the actual GC algorithm can still be implemented on a simple
software co-processor, but the special hardware allows you to intercept
memory references issued by the main CPU to keep the memory model
coherent (think read and/or write barriers that operate in parallel and
simply stall the main CPU for a cycle or two when needed, until the
memory state becomes coherent again). This eliminates comparisons and
branches associated with every read/write (depending on when you
implement the barrier). It also eliminates, or greatly reduces any
pauses caused by "stop the world" GC requirements. Things like
tracepoints can also be optimized. Finally, a lot of the
performance-robbing GC bookkeeping can be reduced/eliminated (think
card-marking and back-pointer checking).
I'm still going to handwave here, but it seems like this area would
definitely help HLL performance. In the same way that everybody
implements the V-to-P translation of modern virtual memory systems in
hardware using a TLB, this would speed up memory specifically for the
requirements of HLLs.
Finally, I'll just point you an email conversation between Cliff
Click at Azul and Dave Moon and Dan Weinreb, formerly at Symbolics. They
discussed the hardware GC support of the Azul hardware Java processor.
It definitely seemed like a win over stock RISC.
Sorry, forgot the reference: http://blogs.azulsystems.com/cliff/2008/11/a-brief-conversation-with-david-moon.html
I understand that when reclaiming old memory Java wants to zero
it.
And that current CPU's load the old contents from slow memory into
cache just so it can be zero'd.
What about a way to not load current memory contents into cache and
just write to it?
I mean, the best of high-level architecture if Intel iMAX-432.
It is too complex, but it can be simplified.
To 'just write to memory' wont work. You have to ensure that all
processor cores have a consistent view of that memory space first, this
means synchronizing caches among other things. If I just zeroed out the
main memory, other processors could still be using data in their L2/L3
caches which they think is fresh, not realizing that another processor
on the system has declared the area dead. Equally, they could then write
their contents back over my newly zeroed area. So given that you have to
pause and evaluate the contents of all other core's caches, you may as
well go ahead and just load the area into cache, write it and then use
the standard cache consistency algorithms to handle the writeback rather
than having to implement and debug a special case.
Personally, I'd just love to see people target the correct
architectures when they compile stuff. Sure, i386 and i586 will work on
i686, but it also prevents the use of any modern features at all. The
move to 64bit is less useful for the expanded address space than it is
for doubling the register count (to 16, half of what a POWER core has)
and forcing people to target an up-to-date architecture.
If I had my way, the world of processors would be a lot more RISC
than currently. I just wish I could afford a Power6 or SPARC system to
play with. I'm also keen to see how the hardware scout feature of the
SPARC Rock series processors turns out.
I think the problem is that you want to solve the wrong problem. As
you said, high levels of indirection are costly, and you want to keep
the access of flat memory structures fast.
However, IMHO the image processing application is an extremely
specific example. Not very many things are stored in flat structures
these days. Even databases are flat only as long as you only don't want
massively parallel access, only want to read them, etc.
The point is: real world data is often (way more often than not)
composed of complicated data structures full of indirection. Modern
"killer applications" are not about sequential, predictable operations
over a structured array; instead, they're about complex methods of
updating these deep and indirect data structures – and indeed, this is
the main and natural domain of these HLLs you mention.
We can keep specialized CPUs or architectures for sequential
operations (in fact, we already have these; they're called GPUs); and we
should forget about optimizing our main CPUs for a case of flat
data.
I want to create HLL implementations that will work efficiently on my
customers' many cores in the years to come but I cannot buy a manycore
system today so I cannot test my ideas. I want robustly-deployable
binaries of immersive graphical applications but today's separation
between the CPU and GPU requires graphics drivers that are a major weak
point in the reliability of the whole system, massively increasing our
support costs and even forcing us to abandon product lines.
My proposed solution: an affordable manycore CPU today.
Many cores executing an existing instruction set: can reuse existing
OS and compiler technology like Linux and LLVM. Good programmable
graphics capability from multithreaded software without an error-prone
graphics driver.
On a separate note, concurrent garbage collection is becoming
increasingly important and is already a major performance bottleneck. I
have no idea how this might be done but perhaps dedicated hardware can
help speed this up? Also, work stealing queues (e.g. Cilk, Microsoft
Task Parallel Library) are an excellent general-purpose solution for
efficient parallelism. Perhaps they could be built into each core in
order to lower the overheads?
I agree with author we don't need new hardware for heigh level
languages. What we really need, it is start thinking more efficiency.
Good discipline and education could make programmer use low-level or
"difficult" languages as they now using heigh level languages.
"If you use Lisp in the Lispy way that makes it so attractive in the
first place..."
Then you only need to optimize the tiny part of your program that's
the *bottleneck*. It's a lot easier to rewrite a function or two from
Lisp into (slightly different) Lisp, than it is to write it in C and
interface it and link it in (at least, in any HLL I've ever used).
"“A large project in C would implement the Lisp run time?” Oh really?
You mean each variable will have the type LispObject (or PyObject or
whatever)?"
No, he means you end up with a half-dozen specialized ad-hoc MAPCAR
implementations, and a half-dozen specialized ad-hoc FILTER
implementations, and so on. Every large C project I've ever seen does
this. To counter-challenge: I'd be curious to see one that didn't.
Your challenge is a good one, if you could constrain it a bit (is
there a good simulator or something we should use?). But JWZ didn't say
anything about changing machine architectures. I don't know why you
think his quote is at all related to the others, or to your subject.
As has been said in other comments: primitives that help GC on a RISC
core could help. Also, allow cheap thread creation without going through
the OS, so it would be feasible to run a sequence of a few hundred
instructions on another core.
Performance of dynamic languages depends for a much greater part on
smart compilers, jits, and dynamic optimizers that take the actual
runtime values of data the program is handling into account. To do that,
the program has to be profiled, which on current hardware allways has
some performance impact (or at least is a tradeoff between profiling
accuracy and performance). How about one core sending hardware generated
profiling data to a second core that can do the analysis, so there is no
runtime cost to profiling (as long as enough cores are available). If
this happens, profiling everything, including production use, could
become the norm. This in itself does not speed up anything, but I think
it would improve the opportunities for both compilers and developers a
lot.
But a lot of the current performance challenges for HLLs probably
need to be solved with software (jits etc) instead of hardware.
@Somejan, you say "Also, allow cheap thread creation without going
through the OS, so it would be feasible to run a sequence of a few
hundred instructions on another core."
I don't think this would help much. Thread creation is such an
OS-specific, or even language-specific in the case of Erlang, operation
that I don't think you could include it as the level of the CPU.
I'm a believer that GC *assist* is the right answer, but it's only
assist. Different languages and operating systems are going to want to
manage GC in a slightly different way. The best you can do is implement
some of the more painful primitives and let them use those, as opposed
to putting a full GC algorithm in hardware itself.
[...] who believe that a CPU ought to be designed specifically around
the needs of high-level languages: Do you think high-level languages
would run fast if the stock hardware weren’t "brain-damage... My
challenge is this. If you think that you know how hardware and/or
compilers should be designed [...]
If you have billions of dollars and the worst ISA ever designed, you
can still make the fastest CPU (but you'll have to waste some watts and
some silicon). See Intel versus Mototola PPC.
Anyway : I'm not going to take your challenge, because executing HLL
in hardware is not only impossible, it is useless.
The future of HLL is JIT compilation and particularly specializing
JITs. In theory this can be faster than C code.
Consider some code. Perhaps it's some vision processing code, or some
big numeric analysis code, or some video encoder. In this case, what you
want is big fat DSP power with big SIMD registers, big memory bandwidth,
lots of MAC units, etc. And your code is going to be written in C if not
handwritten assembly.
But most probably it won't be your code : you'll be using a FFT
library from Matlab or LAPACK, or something. In any case it's not going
to be written in HLL, it'll be written by "bare metal" guys.
So, a HLL cpu makes no sense in this case.
Now consider the rest of the code on earth, like a database, a web
server, your usual Python script, etc. Most likely there is no DSP power
needed, and most likely this code is control-flow bound, that is to say
the execution time is dominated by if's, with their friends,
mispredicted branches and non-prefetched memory accesses. Especially HLL
interpreters contain mostly conditionals : resolve and dispatch next
bytecode, etc.
More specifically, let's look at PostgresSQL. It does verify
Greenspun's Tenth Rule of Programming, since it contains explicit
interpreters for a few languages (SQL, plpgSQL, Python, Perl, etc) and
implicit interpreters for a few ad-hoc languages : for instance your SQL
query, once parsed and planned, is turned into an internal
representation, which is nothing more than a HLL program written in a
language with specialized instructions like "read tuple", "add tuple to
hash join", "compute expression", "test condition", etc.
Any random bit of code that contains lots of conditional statements
is, in effect, an interpreter for an ad-hoc language.
So, back to specializing JITs.
I believe an essential attribute of a CPU is that it should make the
job of the compiler easy, so that the compiler can generate good, fast
code, if possible almost optimal code, very quickly. This is essential
if the compiler has to run in real-time, as is the case of Just-In-Time
compilation. x86 is the complete opposite of this since the ISA is
shit.
Which is why recent x86 CPUs all have hardware JIT compilers that
JIT-compile x86 code into some secret internal representation. Talk
about silicon waste.
You need an ISA with good uniformity (any register is just as good as
any other for anything), and either lots of registers, or stack-based
(which quite simplifies register allocation since there is none). You
also need conditional instructions to handle short tests without jumps.
You need branch hints from the compiler.
And you need a specializing JIT. What it does is interpret your HLL
code, while compiling it. However compiled HLL code doesn't necessarily
run faster than interpreted, since you can't make lots of assumptions,
for instance in Python, variables can be of any type, so wether you
compile or interpret, you'll still have lots of tests and dispatch to
know if a+b are strings or integers. You could use static type checking,
but would you want to use it EVERYWHERE ? You could end up with D (which
is good) or C++ (urgh).
Here comes the specializing JIT. When it detects that a bit of code
is running frequently, it will look at the types of the variables, then
emit specialized code. For instance it will replace all "+" by floating
point machine instructions, remove boxing/unboxing of all local
variables, etc, then execute the generated machine code instead.
You could have used D and got rid of the interpreter, but that
doesn't solve your next problem, which is that you are going to write an
interpreter many times in your program (see postgresql case above). So
you might as well use the same language your program runs in and have
the JIT turn it into machine code.
The next step for the JIT is to look at the values of variables. For
instance this bit of code is often called with foo = 42. So we can
remove a big switch-case on foo, remove a few ifs, and all that's left
of the function is 5 instructions, which we inline in the caller. Then
we notice the caller calls the same function with foo=43 many times too,
because the caller is some kind of ad-hoc interpreter like a SQL query
executor, so we JIT it, and we propagate this up the stack as needed,
and we get a custom-made super fast program.
Suppose you have to read data from a database. Data is stored on disk
in a special format, which depends on lots of parameters (how many
columns, their types, etc). You want to compute average(a million
values). The on-disk format parser is an interpreter for an ad-hoc
language, and it's going to be called a million times with the same
program. JIT it to 5 machine code instructions ! The query executor
which takes your data and forwards it to the function which calculates
sum() is an interpreter for an ad-hoc language, and it's going to be
called a million times with the same program. JIT it to 5 machine code
instructions !
You get the idea.
If you want a revlutionary jump in performance on most workloads, you
ain't gonna get it with a new CPU. For instance it takes postgres about
700 cycles to process 1 row if you want sum(a column), which is already
amazingly optimized given the complexity of the stuff. It requires an
extremely smart CPU though, to be able to predict all these
conditionals. You could get easily a 10x speedup by JITting all the
ad-hoc interpreters, though.
So a "HLL" cpu shouldn't be HLL at all, it should make the job of a
specializing JIT easi, fast, and efficient : generate good machine code
quickly. Maybe a few helper instructions are needed, like some optimized
checks to see if the assumptions under which the JITted code was
generated still holds, but that's about it.
They you'll need a good JIT, the ability to hint it (for instance by
type hints, or "value doesn't change" hints), and the ability to
generate code, very quickly, using the same language as you used to
write your program, so you can get rid of all the ad-hoc interpreters.
Or keep the ad-hoc interpreters and have the JIT simplify and customize
them by removing all the conditionals, if you can hint it about a
parameter that is always a constant type or value. Or both. If you make
a parser for a special binary format, you might as well generate code.
It is very important that this integrates smoothly in your program : you
must not have one main language and one JIT language. See Lisp.
.NET has Lightweight Code Generation. That's cool.
Next thing, parallel processing. Multicores are great for this,
however we have the big problem of synchronization overhead. CPU
executes 2.5 billion cycles per second, but it takes something like 1000
cycles to grab even a slightly contended lock. WTF ?
Your CPU is going to need some hardware synchronization like :
- Explicit lock and release memory range (ie cache lines) (who needs
test and set ?) with fast re-acquisition by whoever needs it next. Once
it releases the lock, put it back NOW in the L2 so the next one can grab
it quickly, don't keep it in L1, duh !
- Transactional memory to get rid of locks : most often, the action
you need to do under the lock is simple, like reserve some buffer space
+ increment a few counters. Remove the lock and do it in a transaction.
If transactions can only touch a few cache lines, it's not that complex.
If another transaction updated said cache lines, retry. This needs
software support.
- Ability for userland to block interrupts while in a critical
section so it doesn't get scheduled out while holding the lock. Since
that is an usurpation of userland power, only give the ability to block
interrupts for, say, 100 cycles.
- Lock waits should take no resources, so the other thread which
HyperThreads on the same core can use 100% of the resources.
- Fast atomic counters
"I’m not going to take your challenge, because executing HLL in
hardware is not only impossible, it is useless"
Well, the "useless" part was, more or less, the point I was trying to
make.
"Do you think high-level languages would run fast if the stock
hardware weren’t “brain-damaged”/”built to run C”/”a von Neumann machine
(instead of some other wonderful thing)”? You do think so? I have a
challenge for you. I bet you’ll be interested."
Yes, I think so. It's a degenerate case of the current state of the
world: HLLs *do* already run fast. On hardware designed especially for
them, they would presumably run no slower. :-)
I found this article because I am designing an object-oriented
language for a simple processor that cuts out all the crap in the
middle. For example, all variables alias to a static address (including
pointers) because that is how the assembly programmers do it.
I just wanted to comment because (1) it thrills me to find someone
with my same enthusiasm for this kind of "problem" (and frustration for
lack of info on the topic) and (2) I was looking for advice, but if here
is not the place for it, I will keep looking (any suggestions
where?)
Anyway, here is my conundrum:
IDEALLY, Class objects (think C++/Java) containing n bytes of data
each alias to a chunk of memory containing n bytes of data, and nothing
more. However, my language (OPIA) will allow for single inheritance, so
I need a way to manage late-binding functions. In other words, an Animal
object may have a default walk(), but a given Animal might just HAPPEN
to be a Snake (a subclass of Animal) and walk() differently — even
though that context may never know it's a Snake.
So I have three options: have each (or just some) object also contain
a pointer to a function table (or the function itself, if that is the
only difference); or I can tell late-binding to bite me; or I can just
remove inheritance from the language.
My thinking is to just keep the late-binding and have it be used less
automatically, leaving it to the programmer to make efficient code. So
if I am going to have these pointers for late-binding, those pretty much
as like class-ID tags and allow me to provide an "instanceof" operator
(like in Java), but there are issues with that too: (1) asking "B
instanceof C" is not just a check of B being a C, but also of B being
any SUBCLASS of C (this can be reduced by having all functions and/or
function-tables arranged so their addresses make a range, simplifying to
boundary check) — I could generate a keyword that JUST checks the one
class too. (2) Either some class objects do NOT have any such pointers
because there are no redefined functions in them (rendering the
instanceof operator unusable on those), or all objects are forced to
have a pointer to a table or function (even if it's a dummy) so that the
instanceof operator always works. One thing to consider is that my
language is VERY loose about type safety; a "void pointer" to an integer
might be casted to some other object, and then the instanceof check has
potential to incorrectly test true (I REFUSE to put a type-tag on
primitive values).
So I guess my question is, what is your opinion on the need for
late-binding and/or pulling strings just to make the things that come
with it work? My language is to generate the most efficient and
low-level code possible without forcing data and operations to carry
extra bulk; but if I am going to provide that feature, that is the only
way to do it; and if I do, is it right to be partial about it (allowing
an "instanceof" to only work on instances of decendents of select
classes)?
Please email me if you have anything ([email protected]), or just
click my name to see where I am hosting my project
@Dan: Why not talk here in the comments? I presume it's because of
the lack of notification upon reply. Oh well. I'll answer in an e-mail
then.
First, let me note that I work way higher up in the stack, so don't
really know what I'm talking about.
My understanding is that performance of most modern software is gated
on access to memory. It's the L1 and L2 access patterns that really
matter.
Accordingly, modern hardware speculatively faults cache lines into
the L2 and L1 caches based on detected access patterns.
..which as far as I know are all based around sequential data access
with a stride. That isn't what objected oriented programs do. A typical
object has fields, each of which is either a scalar or a pointer. The
pointers get followed (sometimes).
How about speculatively faulting memory into the caches based on the
assumption that people are pulling in blocks of memory containing
pointers that they're going to dereference?
There are clearly some issues.
It's seems undesirable and probably incorrect to take a page fault
that otherwise wouldn't be taken. Okay, don't do that.
You may want advisory information passed down about whether a
particular read _is_ likely to be pulling in an object, and hence should
start pulling in its contents, or whether it's a read that might be part
of sequential data access. Heck, maybe you even want to have layout maps
of some sort available specifying at what offsets from the original read
interesting pointers are likely to occur.
You probably cannot afford to speculatively pull in very much this
way, so something has to give. The layout map is one possibility.
Another might be to only speculatively dereference only one or two words
at the head of likely objects. That'd catch the pointer to the class
object in ObjC, for example, which has to be dereferenced for any
message sent to the object. Or, I don't know much of anything about how
branch prediction hardware works. Perhaps dereferencing prediction could
be done. "The last two times you read memory location A, you soon read
*A and *(A + sizeof(void*)), so we're going to speculatively dereference
them this time."
Extending speculative memory access to handle indirection is an
interesting idea and might yield performance benefits. I don't think it
can eliminate the overhead of indirection though, that is, a flat layout
will still work faster: a single burst access bringing a cache line will
always be handled more efficiently than a bunch of (correctly guessed)
burst accesses bringing several cache lines, both because of spending
more bandwidth, memory controller queue slots, opening more DRAM pages
and other such crud to spell the requests and because of transferring
more data (you can't transfer just the words you need unless you don't
intend to cache them, which on average you do want to do since you
access at least a couple of adjacent words).
The practical thing thus isn't eliminating the overhead of
indirection but lowering it; the extent of this is hard for me to guess.
I can testify that the current speculative access mechanisms aren't
enough to get performance "transparently" out of a memory system; I
actually prefer a DMA to an explicitly managed local memory (Cell-style)
to optimizing cache access with cache prefetching hints, and caches
won't work very well without prefetching hints in tight loops accessing
arrays. However this doesn't mean that the more complicated mechanism
you suggest would work worse than manual tuning; on one hand, it's,
well, more complicated which means it's probably harder to make it work
"right", on the other hand the manual tuning you can do in the case of
indirection is gnarlier so the human competitor isn't in a good
position, either. So it's very hard for me to guess how well it would
work compared to the current performance.
I'm nowhere near you guys in technical experience, but doesn't the
Burroughs B-series CPU meet your requirements? http://en.wikipedia.org/wiki/Burroughs_large_systems#B5000
It was made to directly support HLL (well, what was considered
high-level at the time), and according to this article, did not have an
assembly language in the traditional sense of the word, as it ran Algol
code natively. From what I've heard, Lisp ran quickly fast on these
cpu's (at least, compared to other architectures at the time).
I mention the B5000 in the text; the question isn't whether a
"high-level CPU" is possible, but whether it's a gain in terms of
performance/price. Because if it isn't, our current situation where you
implement HLL VMs in software is just fine (except that unsafe,
lower-level languages are also available, thus some software is written
in them, thus is itself unsafe and/or not to the liking of those liking
HLLs). If it is, then someone could presumably made a fortune selling
more efficient systems, with higher-level hardware and simpler
compilers/VMs, running HLLs and giving you all of their benefits for a
lower price than that of a conventional hw/sw stack (that this doesn't
happen can be explained away by network effects/consumer stupidity/etc.;
I'm looking for a technical proof, or rather at least a rough outline,
showing that such a price/performance gain is at all possible
though).
Just because nobody mentioned it:
http://www.jopdesign.com/
JAVA optimized processor as GPL code :-)
I found this quite interesting since the guy behind this (Martin
Schoeberl) has some real interesting ideas how to get better "Worst Case
Execution Time" boundaries in the presence of a Garbage Collector, by
employing a hardware unit to copy objects (for garbage collecting
purposes).
See here:
http://www.jopdesign.com/doc/nbgc.pdf
And he actually implemented all that in a cheap Altera Cyclone
FPGA.
I am really itching to try this out :-)
so long
Ingo
gcj seems to beat that one on the author's benchmark, and it doesn't
seem to have been compared to modern JIT VMs. It seems that the main
potential benefit is the ability to run JVM bytecode in a very small
memory footprint – something you can do with neither a compiler
generating native code nor a JIT VM; but it's a niche goal, and somewhat
tangential to the question of a language's level of abstraction and its
implied performance penalty.
The ability to 'watch' a region of memory and get a signal when it is
updated would be very nice for reactive programming models and efficient
concurrent garbage-collection.
I'd love to have improved atomic ops (hardware transactional memory,
or even just double compare and swap). This would allow more efficient
parallelism.
Watching memory regions – the question is how many. If it's a couple
or a dozen, it's dirt cheap and on some targets available today with
hardware data breakpoints and/or page table tweaks. If you need to watch
an arbitrary number of regions of arbitrary size, then the only way I
see that could work is by having a shadow memory keeping a region ID for
each word. If the ID is, say, 24b, and the word size, say, 16 bytes, the
cost would be <25% memory overhead. The question is what sort of
performance improvement you'd expect.
As to atomic ops – I'm too used to coarse-grained conservative
parallelism suitable to current targets to be able to easily contemplate
the benefits of what you propose.
Hello,
I don't think the debate is still ongoing and I am quite the newbie
junior fresh-out-of-school programmer but I would like to put down just
a few lines about my ideas.
Nowadays I think the big problems are "memory latency", "energy
efficiency", "parrallelism&concurrency".
And the big tendencies are going into lockless programming, STM, Actors,
functional programming.
For the base of the core (RISC or x86, such things) I don't have
enough experience to give a sensible opinion but here is my take on what
should be tweaked on such a base.
User-level VMM (It was adressed on a previous comment) with "aliasing
TLB"
-> detailed idea here http://codingrelic.geekhold.com/2008/05/high-level-cpu-response.html
The author speaks about functional programming gains on function calls
but I think it would massively help GC performance (these are ubiquitous
these days).
I think the Azul JVM delves into these things (by playing at the OS
level). http://en.wikipedia.org/wiki/Azul_Systems (papers
linked on the site)
Control on this supplementary indirection is something you see in
countless concurrent GC papers.
User-addressable cache memory with user-addressable sync instruction.
I think we hide too much to hide the hardware and end up having to game
it to avoid cache-wise pathological behaviour... However such things
would probably end up significantly increasing binary size so I think
there should be 2 types of cache memory. The current invisible one for
"best effort" and an explicitly addressable-one.
I also think that control on what goes into the cache who also help for
minimizing context-switching cost and heavy concurrency.
Some kind of hardware level STM for inter-cache synchronization (and
for main mem) but I don't know if it's implementable or if it has
already actually been done. That one is hhhheavily debatable...
That's all.
I think too much abstraction on low-level things actually doesn't
help HLL compilers and optimizing JIT to produce efficient code. A few
things should be exposed, at least experimentally. If it proves
successful the old abstraction could actually be removed, that would
perhaps simplify the chip too (don't really know...)! (This is a bit
like stating the fact that that RPC did not save us from distributed
computing hardships).
Best regards,
johan.gall+yosefk >>= google mail service
Oh, dear! Premature optimisation a-plenty in this thread :-)
Here's a plan: Take code compiled from the best HLL compilers, and
then run it, profile it with callgrind or gprof or similar, and decide
how much time it's spending doing something that might be done faster in
hardware. THEN we have a basis to work from.
From memory, dynamic type dispatching is a bit of a killer; when done
with "small" value, that boils down to branching between integer, FP, or
call-a-handler to implement arithmetic on word-sized quantities (and
yes, you need untagged arithmetic instructions available for when the
compiler can deduce a type, and for dealing with supplied-untagged data
such as 8-bit bytes). When done with "large" values (eg, objects) this
turns into multiple-pointer indirections, which is just down to memory
bandwidth, and better CPUs won't really help with that.
Things like hardware level STM are handy for speeding up higher-level
inter-process communications, but I think they're not quite what Yosef
is after here – they'll speed up IPC written in assembly just as much as
IPC written in Lisp; higher level domain-specific abstractions are
orthogonal to higher-level *languages*
Yeah, the memory problem with "large" objects is one reason why RISC
is a good-enough HLL target. As to tagging and tag-based dispatching – I
guess it really would be an "actual" increase of efficiency (not just
moving costs between hardware and software ending up with the same
performance per dollar) if you use a small amount of tag bits ("integer,
FP, or call-a-handler"), but it only seems to deal with arithmetic
operations (at the cost of data bits), and polymorphic arithmetic
doesn't sound like a very large part of the productivity gain one
expects from an HLL.
Seems like i mostly agree with 5-Justin.Wick...
90% of problems now not in hardware performance. But in modern OS
-
complicated structure, preemptive multitasking (badly working
"solution"
to dynamical system extensibility), unsafe-by-design multithreading
(see
next sentence), bloated overdesigned libraries, layers on
layers...
...note...Just as pointers, threads is oversimplified building brick,
and so
not safe. (pointers+GC) is safe and is brick. Meanwhile,
inheritance
also unsafe – it's like pointer only in larger time scale, that is
why
OOP is suck in large)) Because no architecture-GC in brain))
modular,
AD – was good, OOP – not.
And threads. It is also ONE-WAY dependence and so unsafe in back
direction.
See? sure it has problems... many programming problems is SAME.
Uncontrolled dependencies...end-note...
"Constantly working re-simplification of all system" may be way to
go))
We need implement not just "more processing power" (special
purpose
coprocessors is the way for this and they already exist) but maybe
some
essential language-runtime or os divide-et-conq blackmagic to rule
all
this power... working on simple compiler-friendly hardware.
There are was many projects for "more high level" hardware...
Running smlnj on bare hardware seems already was done sometimes...
Lisp? is not good because 1) "only lists" is not good (memory bottleneck
wise).
automatic indirection is bullshit.
2) strong static typing is a MUST, it really simplifies code
reasoning.
Yes, 12-Paul, some assist to garbage collection will be nice from
hardware.
This is exactly thing i want to say. not GC coprocessor with own
memory
channel)) No, what would be good... is more like mlkit region-based
memory
management, only segment-table alike. Think of it as of more
sophisticated
MMU unit. memory protection and collection assist. sit and watch
adresses.
transactional memory?
Haskell is not good (unpredictable), ML maybe good.
22-DGentry, interesting)
36-CybFreak, words like "good discipline and education could make
programmer use low level"... pfu...that just not work, and will
not))
Errors is INHERENT property of ANY human brain, no matter how
professional
that 1.4kg bunch of neurons is)) Real professional is just ready
for
own errors... We say – "big people make big errors", yea?))
...anyway – who controls the memory, controls the multicore
world))
and remember, all there! we not need one more powerful hardware
feature))
we need way to cut off this cool but unneeded brain-eating unsafe
features!))
so they cant stand on our way to problem solution...
too many unneeded abstractions... no more...please...hhhh...
(silently dying)
))
Yes! I'm glad that someone is saying this stuff.
I realize this article is four years old, and you've probably seen it
already, but: Have you read "Lambda, the ultimate opcode"? It's about a
machine that uses tree execution instead of vector execution (i.e.
Lisp-like evaluation instead of traditional assembly), and they built a
prototype.
http://repository.readscheme.org/ftp/papers/ai-lab-pubs/AIM-514.ps.gz
or http://repository.readscheme.org/ftp/papers/ai-lab-pubs/AIM-514.pdf
I don't claim it can't be done, only that it's pointless. To quote
Ken Thompson (from memory) – "Lisp is not a special enough language to
warrant a hardware implementation. The PDP-10 is a great Lisp machine.
The PDP-11 is a great Lisp machine. I told these guys, you're
crazy."
I was just bringing it up as an example of someone actually making an
HLL architecture instead of just complaining about x86. I don't know
enough about hardware to know if it's actually helpful (though I doubt
it, because I'm not using one).
So current technology was allowed 30 years to bring it where it is
today, and you want someone to go off and develop a viable processor, in
isolation, and yet be within 25% of a modern system?
I understand your complaint, I'm in the same boat, but challenging
someone else to come up with a solution is not going to produce
anything. Especially since anyone who is capable of producing a design
that could be implemented, would also be capable of implementing it
themselves, i.e. they don't need you to try and produce your
interpretation of their design for them.
Your challenge is like those weenies who post in the game forums: "I
have the best idea for the next killer FPS, I just need people to make
it for me and you will get a cut once the game is making
millions..."
I'm sick of weenies asking for others to make things for them, with
the only return for blood, sweat, and tears being some thanks and
recognition. Why not put up a few million bucks instead?
To me, "current technology", the "Internet", and all the protocols
that go with it are human's first-pass at potentially solving a problem.
Now the collective "we" need to toss out all the current methods (CPU
designs, programming languages (all of them), protocols, etc.) and solve
the problem properly. You will not build a better mouse trap if you
always start with the same initial design. You have to get away from the
influence, step back and actually think differently.
I'm getting older so I like to rant and use analogies, so I'll use
one here. I propose that if you tell 10 people to design a new operating
system, you will get 5 new versions of Windows and 5 new versions of
OSX. Nothing actually new, just more of the same. Projects like WINE and
ReactOS prove my theory.
Same with the CPU or computer systems. Designers don't know how to
make anything new, and any research that is going on is focused on
somehow making CPUs even faster or cramming more cores onto a die, and
further widening the already gaping hole between the CPU speed and the
rest of the system.
In all seriousness, I actually happen to have a few new ideas of my
own directly related to what you are talking about, and I'm starting to
kick them around in hardware, but I have yet to fine anyone to discuss
them with. Maybe I'll eventually piss someone off enough to spark a good
conversation. ;-)
Well, if you want to discuss your ideas with me, either publicly or
privately, you're very welcome. As to "new things" – there's always the
question of whether the new things are like the old ones because of
inertia or for the same reasons that new wheels are as round as old
ones. I believe that CPUs are more like wheels in this regard, and here
I tried to explain why.
I'm interested in working on LISP processors, and I do believe that
they can process data (I'm thinking statistical data analysis here –
thats my domain) quicker than conventional CPUs – I'm still learning
though.
As for the challenge, I present this: http://www.researchgate.net/publication/3501984_MARS-a_RISC-based_architecture_for_LISP
You mean a Lisp Machine paper from 1989? "the tracing of a list with
cdr and car"? I'm guessing it wouldn't concern itself with cache
prefetching to the extent that's required to make linked lists work
reasonably well in 2013.
Some people mentioned implementing the JVM. It's been done (http://en.wikipedia.org/wiki/Jazelle).
There must be a reason were not all writing java for arm embedded
systems...
In fairness to those arguing for a hardware-assisted JVM, Jazelle's
failure doesn't necessarily prove them wrong, it might just be an
example of it done badly. (A lot of ideas in hardware architecture fail
initially/in some markets and take off later/elsewhere, say, VLIW –
failed for Multiflow, failed in Itanium, works great and found
everywhere in modern DSPs.)
I don't pretend to be a chip designer, but I was very interested in
the Intel 432 when it came out. It failed epically, but perhaps there
are some ideas in there that you would find useful:
http://en.wikipedia.org/wiki/Intel_iAPX_432
Maybe of interest. Someone did a Tcl runtime in hardware. http://www.tcl.tk/community/tcl2002/archive/Tcl2002papers/thibault-hardware/tcl2002.pdf
One rather small example that'd help with at least one HLL is a
somewhat less specialised call/return instruction.
In modern intel CPUs the call/return is not just a CISCy shorthand
but is used to help with branch prediction for returns. Unfortunately
the call/return is too tightly tied to the C calling convention to be
usable in some HLL compilers. For example in GHC we need to be able to
return to some address that is not simply the subsequent instruction. We
need to use a return address slightly further on because we store
metadata (mostly for the GC) in front of each code span.
Another example would be proper hardware support for checked integer
overflow/underflow. By proper we mean not slower in the typical case.
(The Mill seems to have an interesting approach to this using
NaN/NaRs)
Another one would be a more flexible approach to alignment and
pointer sizes. Why can't we have 5-byte pointers and store all our data
structures unaligned? 4-byte pointers are too small but 8-byte ones are
a waste for most programs. In HLLs we tend to use a lot of pointers and
so the cost of larger pointers is much more significant than for C.
Think of the JVM and its pointer compression mode, and then imagine
if it could do that more flexibly, e.g. up to 1TB heap using 5-byte
pointers.
How about "Checked Load: Architectural Support for JavaScript
Type-Checking on Mobile Processors"?
https://homes.cs.washington.edu/~luisceze/publications/anderson-hpca2011.pdf
It is silly to differentiate low level and high level, it is rather
context. E.g. try fix a car, at one context, we are thinking in
components, then when we narrow down to a component, we need think about
how to remove the stuck screw. What is going on is we only allow limited
items at a time — call it high level or low level, within a context,
they are all the same.
Now pick 100 lines of code in any current programming languages and
you'll find variables, types, functions, expressions, ... all weaved
together — that is the problem: there is only one context.
Now if we put the car into one context, we have the same dilemma:
should we think about the car at component level *only* or screw
level?
So what we need is a new programming layer that can allow us to write
code one context at a time.
I believe that Alan Kay originally conceived Smalltalk as being a
collection of objects sending and receiving messages. That would have
implied a multiprocessor architecture of many simple processors and a
non-blocking many-to-many communications fabric. Then they broke the
concept by shoehorning it into the CPUs of the day. In fact, most early
Smalltalk systems were horrendously slow for production use.
Early Lisp systems, e.g. Symbolics, used custom hardware that
implemented lists and primitives in the hardware. Again in the 1970's
that wasn't a feasible way to go.
The RISC movement made sense while the CPU was the bottleneck. These
days, we need 2-3 levels of cache to couple fast CPUs to slow main
memory.
So I would suggest that current commodity hardware is fine. I think
that compilers are working on too low a level to make optimal use of the
resources. If we were to treat current commodity CPUs like
microprogrammed CPUs of yore and implement higher levels of abstraction
primitives (yes that's a contradiction of sorts) as assembler routines,
then cache hit ratios should increase and result in more efficient use
of hardware resources.
How about the "mill"? They claim they can get around 10x performance
for normal code, "merely" by doing "almost everything" differently
(instructions, registers, memory management, security, caching...).
How about HW that is explicitly designed to efficiently execute
Erlang-style actors? Maybe combined with Rust-like explicit ownership
tracking?
The down side of any non-traditional architecture is that it is
expected to immediately compete with the decades of fiendish cleverness
that has been invested in the current approach, and might possibly also
require us to re-write all our software from scratch.
Either way it would be a huge investment. And nobody is willing to
invest the ludicrous amount of money required based just on the
intuition that there is gold at the end of some particular rainbow.
Everyone who thinks about it feels very strongly that the gold is there
_somewhere_, but nobody is quite certain _exactly_ where it is.
The last serious well-funded attempt was the Japanese 5th generation
project and, in retrospect, their rainbow was located _just_ a bit off
from the pot of gold they were aiming for.
It will probably take another century or so for us to figure things
out :-(
Just implement a cellular automaton for Conway's Game of Life in
Hardware (https://en.wikipedia.org/wiki/Conway's_Game_of_Life).
One Flip-Flop per Cell plus a few logic gates to compute the next state
and a global clock routed to every cell. So you get (parallel) computing
and memory from one homogeneous design that can be easily scaled towards
larger chips (Note: Conway's Life is proven to be turing complete)
I/O interface is implemented by some cells at the boundary of the
automaton. E.g. just shoot gliders at them to read/write their
state.
Then write a compiler from HLL to a bit-map that can run on the
cellular automaton.
@da Tyga: Smalltalk is a serial language, multiple processors does
nothing for it. Smalltalk is not Erlang. Whether something like Erlang
on top of a huge number of wimpy cores could ever be more efficient in
terms of compute throughput than today's CPUs is a good question, I tend
to answer in the negative because of the cost of communication but I'm
not sure, the obvious thing however is that Erlang and Smalltalk
inherently limit your compute throughput and you'd want a lower level
language sending those messages between cores even if that approach gave
you anything and this is why this whole line of reasoning is beside my
original point.
@Oren: the Mill, regardless of the validity of their claims, does not
target HLLs more than it targets C, does it?
"Competing with fiendish cleverness"? Not really. If Alan Kay has a
1000x better arch he'd be leveraging a lot of that fiendish cleverness –
such as EDA tools and chip fabs. At worst the 1000x would degrade to
100x and still he'd steamroll the competition and the reason he doesn't
do it is he's full of shit. And I've been doing computer architecture
for more than a decade and I can tell you that it's never been cheaper,
not necessarily to make a chip but certainly to make a synthesizable
core for someone else to manufacture, you don't really need that much
manpower and the tools are there and they're great, look for instance at
Adapteva's Epiphany (which will remind of you of Erlang but – surprise
surprise! it runs C not Erlang whose creator said he "doesn't give a
hoot about efficiency" and, well, it's compute throughput was, is and
will be in the shitter.) This company is five people who're not as full
of shit as Alan Kay and who unlike him put their money where their mouth
is.
And as to another 100 years... I think we've figured out long ago
many of the stuff that particular people (like hardcore functional
programming types) still think is not figured out and I predict that a
thousand years won't be enough to convince someone who never bothers to
look at evidence to begin with.
@David: surely you're joking; the best part in your comical delivery
is "Just."
I love the way you sugar-coat everything, Yossi. :D
1) a general purpose processor running a general purpose software
stack will always be better for anything
or
2) an optimized system will perform better for a given task.
obviously #2 is possible and worth pursuing.
This trade-off was not discussed.
I sense an over-developed sense of reverence for the arbitrary
market-driven decision tree that masquerades as the commmercial side of
computer science.
any known specific task for which an application (a system: hardware
and software) is being designed will be best approached from the side of
defining the actual application functionality and user interface,
followed by modeling the functionality and only later partitioning part
of it in hardware and part in software, once inner loops and underlying
functionality has been determined, performance analysis etc... I'm not a
dedicated embedded systems engineer, but they exist, and they quite
regularly opt away from the standarm arm-driven-soc options when costs
(lower or higher) or performance allow/dictate. While I admit that
loading linux on everything is tempting, it's not optimum, in any sense
of the word, except in cases where a full-blown multi-user os designed
for multi-mode processors is actually required... I digress... but the
tone of this piece is dismal.
@Josh: I really don't know why I shit-coat everything, at least
everything in this thread, Josh, I really don't. I don't even like my
original post anymore because it's a bit too combative, mainly, and yet
the completely unsubstantiated claims that get endlessly recited wrt
this topic somehow never fail to get me worked up all over again. And
what I rarely see is an actual simulation of an actual hardware
architecture and you'd expect that economics would compel people to do
it if they could. But why do I still care? I don't know.
(The Reduceron BTW is one exception and for years I sorta have this
plan to write why it solves a non-problem and why it's not ever going to
be competitive with a traditional CPU.)
High-level CPUs are really dumb idea, what I'm interested in is
maximizing FLOPS (or fixed point OPS, or imprecise float OPS) and RAM
bandwidth per mm^2 of silicon and per Watt.
Consider http://www.bdti.com/InsideDSP/2013/10/23/SingularComputing
http://www.gwern.net/docs/2010-bates.pdf , it is
basically an array of simple processors operating on float-like numbers,
100x more dense than standard single precision FP units.
The problem is external memory bandwidth, but HBM could somewhat
alleviate it (not sure if your fab can produce HBM-enabled chip
assemblies).
Hey Yossi, good article.
What if we slighted changed the new CPU architecture goal? I wonder
how difficult it would be to design a new ISA, without any major impacts
on performance, that could focus on the following goals:
1) Provide the "perception that the CPU is actually running source
code". This would be only a perception, of course, because the code
would be jited, interpreted, compiled, etc. But it could be implemented
in such a way that the user can, at any time, pause any piece of
program, look at its source and even make some modifications.
2) Always provide some kind of rollback, idealy dependending only on
the amount of memory letf that could be used to rollback to some
n-states (source instructions) back.
For implementing #1, one idea would be to somehow always keep a
mapping to current line of code and code being executed, or
something.
For implementing #2, one idea would be to use some kind of copy on
write up to available-memory steps. If you need more RAM for next
instruction and none is available, free the farthest rollback step
region.
Anyway, that was a good insipration.
Maybe you're reacting more to entrenched CPU design mindsets, which
HLL designers seem to be doing as well in your examples? Neural nets,
graphics processors as GPUs, etc. – we all kind of expect some next big
thing to take over, or at least fundamentally shake things up at some
point. The brain works too well and is too different (though I suppose
airplanes don't operate like birds).
IMO, the central issue is how to best utilize memory. In the bad old
days of low pin counts – and who knew how much expensive memory you
really needed because you couldn't afford it anyway – memory residing
off-chip made sense. These days it's still off-chip, which makes no
sense to me, as this demands enormous, complex, and expensive on-chip
caches in order to work. In his 2000 book "Robot" Hans Moravec points
out that the ratio of memory size to processor speed has remained a
fairly constant one megabyte per MIPS throughout the history of general
computing, which (if still true) is a very strong argument for fixed
size, tightly integrated, on-chip RAM.
I'm pretty sure I'm hopelessly naive, but RISC processors are still
way too complex. Integrate a multi-port main memory on the CPU die, toss
out the caches, memory manager, branch prediction, TLB, security modes
and supervisor registers, etc. and keep the core busy via barrel
processing with equal bandwidth threads intercommunicating in the shared
memory space. Stick a bunch of these together (a miracle occurs here)
and deal with security at a higher level.
I don't think you're biased towards shit-coating. I'm not god's gift
but like everything else, 90% of engineering work is pretty much
crappily done and ill thought out (big endian anyone? virtual
machines?). It's rare to find a diamond in the rough, much less a highly
polished one, when it comes to hardware or software or languages or
anything really. And all the brainpower wasted on making the hideous
Intel architecture 10% faster by doubling the multi-billion leaky
transistor count is tragic.
The hardest work out there is critically examining a foundation,
putting literally everything on the table, and generating reasonable and
workable ideas for basic, fundamental improvement – which is probably
why it so rarely happens.
Implement a chipset capable of supporting urbit/nock/hoon at the
hardware level.
It seems you are focused on the CPU but the bottleneck seems to be
memory bandwidth. Why not make memory smarter? Back in the last century
there were areas of memory with special semantics (e.g. video memory).
Why not make a gig-sized address range with local instructions in the
memory array.
Suppose this magic memory (MM) could walk a linked list or sort an
array or do a search or add/shift/mask two lines... all without any
memory buss traffic except the command/completion signals.
Companies could sell gig-sized special hardware with MM for doing
local convolutions for vision.
The new instructions needed that manipulate the various MM blocks
could be loaded into an FPGA core in the main CPU.
As a side note, Intel just bought Altera (about 40% of the FPGA
market). Once Intel applies its tech the Altera Cyclone will be about
16x as dense. So I expect to see "user loadable instructions" soon.
The advantage of MM is that it lives on the far side of the memory
bottleneck. It offloads trivial processing, doesn't use up the caches,
and doesn't use memory bandwidth.
And just for grins, if the memory chip had an FPGA on it I could push
a whole Forth CPU (like the J1) and a custom program into any gig-sized
memory block, let it smoke away on an algorithm, and do other stuff
until I got a completion signal. Now my computer can do wildly parallel
things (e.g. compression, encryption, convolution, sorting, searching,
etc) in multiple memory blocks without burning up the buss.
I'll wait for you to send me a memory I can plug into my machine with
the "extra fpga extension".
I was curious about what Alan Kay mentioned about B5000 saying that
it is efficient for higher language implementation. Also it is hard to
crash because of hardware bound checks. Added to that, he was mentioning
that Intel doesn't really know how to design chips (something to do with
how data and instruction regions are treated in their architecture).
However, I do not see anyone else talking about these except this blog.
Do you intend to talk more on the topic explaining more on the details
so more people can understand the issues involved better? Thanks.
Hi Yossi,
compative is not always bad; I like the skepticism in your original
post; bravo.
Earlier in the thread "#41 Peufeu on 09.28.09 at 2:25 am" says it
best; JITs can do a good job at executing dynamic oopls like Smalltalk
(my love), so the issue should indeed be supporting the writing of JITs,
at least in part. But also one should support the execution of the code
a JIT would like to produce, and support GC.
There is a short history of Smalltalk processors. SOAR led to SPARC
(register windows) and its tagged arithmetic instructions, but neither
ended up being used in Peter Deutsch's HPS, the highest performance
commercially available Smalltalk VM for commodity processors (which my
Spur VM is beating by about -40%). So two concrete suggestions from
SOAR
1. Tagged arithmetic instructions should neither hardware the tag
values nor the tag width and should jump on failure (or better skip next
on success) rather than trap. So the instruction should allow one to
specify the number of tag bits (forcing them to be least significant
irks probably fine) and which tag pattern represents a fixnum.
A key instruction sequence in a Smalltalk JIT VM is the inline cache
check which looks like eg
movq %rdx, %rax
andq #7, %rax. ; test tag bits of receiver
jnz Lcmp ; if nonzero they are the cache tag
movq (%rdx),%rax ; if zero, fetch class (index) from header
andq #3FFFFF,%rax
Lcmp:
cmpq %rax, %rcx ; compare cache with receiver's cache tag
The x86 provides a handy-dandy conditional move that one would think
could eliminate the jump and yield
movq %rdx, %rax
andq #7, %rax. ; test tag bits of receiver
cmoveqq (%rdx),%rax ; if zero, fetch class (index) from header
andq #3FFFFF,%rax
Lcmp:
cmpq %rax, %rcx ; compare cache with receiver's cache tag
except some cruel tease in Intel decided to make the instruction trap
when given an illegal address *whetger the instruction made the move or
not* so it's useless :-(. So if you provide a yummy conditional move,
make sure it doesn't trap if the condition is false.
Another tedious operation is the store check. It would be great to
have a checked store instruction, again one that used the skip next on
success pattern so one can jump to the remembering code, rather than
handling traps. A store check would be based on bounds registers, set up
infrequently, eg when entering jutted code, that would specify the base
and extent of new space and trap if storing an untagged value that lies
in new space to an address outside it.
One *really cool* piece of hardware support, that would essentially
reduce my new VM to something trivial to implement would be the ability
to search memory in parallel for aligned word sized cells that contain a
specific value. This means /all/of the heap, not just the part that is
paged in, which probably implies no paging ;-). Smalltalk supports
become, upon which lots of things are implemented including shape
changing instances at runtime. Become is essentially "replace all
references to a with references to b" or "exchange all references to a
and b", so to add an inst var at runtime, the system collects all
instances of a class, creates a new class with the added inst var,
creates copies of all instances, with the new inst var having the value
of nil, and then "atomically" exchanges the classes and the instances so
that all references in the system refer to the new class and instances,
leaving the old ones to the GC. The problem is in finding the
references. A trawl through the heap is slow. The original Smalltalk-80
added an j direction in the header of each object, and hence one
exchanges the indirections, and this design remains in HPS, but the
explicit indirection slows down all accesses. My new Spur VM uses "lazy
forwarding", turning the original objects into forwarding pointers to
copies, and fixing them up when message sends fail (since forwarders
don't have valid classes) or when visited by the GC, which is why Sour
is exactly twice as fast as HOS in the benchmarks game's binary trees
benchmark, but at the cost of /lots/ of complexity. If memory were smart
enough to render the search O(1) that would be magnificent.
There are other such ideas (azul's already been mentioned, asplos is
a good source for the design ideas around cache management (avoiding the
need to zero unmapped pages etc) and cheap user-level traps (although I
think snip next/conditional instructions a la arm are way more
convenient). But the above are what are pressing from my own
experience.
Anyway, interesting and provocative post; provoked me to reply. Good
luck, and have fun; this /is/ fun!
@Eliot: cool to hear from someone actually working on JIT
compilation!
I think the point that I tried to make here and that got buried in
the whole HLL angle, and which I then made again, hopefully more clearly
(http://yosefk.com/blog/its-done-in-hardware-so-its-cheap.html),
is that some things have a fundamental cost because they require a
certain amount of logical operations to be performed and that will cost
you energy, first and foremost (you can trade machine time for machine
size to some extent etc. etc. but energy is the simplest constraint to
think of because you can't cheat, there's a lower bound on how much
energy something is going to cost.) Math and physics limit what hardware
can do, basically, hardware support is not magical.
The example among your suggestions where I think this problem is most
pronounced is associative memory (searching for a value everywhere in
parallel.) Associative processors have been suggested and implemented (I
wrote a bit about my understanding of the area here: http://yosefk.com/blog/an-unusual-hardware-architecture-apa-associative-processing-array.html),
but the upshot is, comparing that many bits in parallel costs a lot of
energy and a machine doing this very quickly would be fundamentally more
expensive than a machine running a programming language that didn't need
such a power-hungry feature. And if you wanted your memory to also
support standard random access, not just replacing values with other
values, then you'd need to make it larger. (Note as well that for that
to work in the heap you need a DRAM memory supporting it, and even Intel
has hard time shifting the priorities of the DRAM market... so this one
at least as I understood it is not within the reach of most chip makers
out there even if they all wanted to do it.)
I do agree that a lot could be done in hardware to make things less
awkward for a given language – here HLLs are not that different from
lower-level languages; if your language supports floating point math
you'd want your hardware to support it, and emulating floating point in
software will suck, performance-wise, and the same is true for many HLL
features. All I meant that some of the extra cost of HLLs –
such as more levels of memory indirection – cannot be made to go
away.
Post a comment