"It's done in hardware so it's cheap"
It isn't.
This is one of these things that look very obvious to me, to the point where it seems not worth discussing. However, I've
heard the idea that "hardware magically makes things cheap" from several PhDs over the years. So apparently, if you aren't into
hardware, it's not obvious at all.
So why doesn't "hardware support" automatically translate to "low cost"/"efficiency"? The short answer is, hardware is an
electric circuit and you can't do magic with that, there are rules. So what are the rules? We know that hardware support does
help at times. When does it, and when doesn't it?
To see the limitations of hardware support, let's first look at what hardware can do to speed things up. Roughly,
you can really only do two things:
- Specialization - save dispatching costs in speed and energy.
- Parallelization - save time, but not energy, by throwing more hardware at the job.
Let's briefly look at examples of these two speed-up methods – and then some examples where hardware support does nothing for
you, because none of the two methods helps. We'll only consider run time and energy per operation, ignoring silicon area
(considering a third variable just makes it too hairy).
I'll also discuss the difference between real costs of operations and the price you pay for these
operations, and argue that in the long run, costs are more stable and more important than prices.
Specialization: cheaper dispatching
If you want to extract bits 3 to 7 of a 32-bit word and then multiply them by 13 – let's say an encryption algorithm requires
this – you can have an instruction doing just that. That will be faster and use less energy than, say, using bitwise AND, shift
& multiplication instructions.
Why – what costs were cut out? The costs of dispatching individual operations – circuitry controlling which operation is
executed, where the inputs come from and where the outputs go.
Specialization can be taken to an extreme. For instance, if you want a piece of hardware doing nothing but JPEG decoding, you
can bring dispatching costs close to zero by having a single "instruction" – "decode a JPEG image". Then you have no flexibility
– and none of the "overhead" circuitry found in more flexible machines (memory for storing instructions, logic for decoding
these instructions, multiplexers choosing the registers that inputs come from based on these instructions, etc.)
Before moving on, let's look a little closer at why we won here:
- We got a speed-up because the operations were fast to begin with – so dispatching costs dominated. With
specialization, we need 4 wires connected directly to bits 3 to 7 that have tiny physical delay – just the time it takes the
signal to travel to a nearby multiplier-by-13. Without specialization, we'd use a shifter shifting by a configurable amount of
bits – 3 in our case but not always – which is a bunch of gates introducing a much larger delay. On top of that, since we'd be
using several such circuits communicating through registers (let's say we're on a RISC CPU), we'd have delays due to reading and
writing registers, delays due to selecting registers from a large register file, etc. With all this taken out by having a
specialized instruction, no wonder we're seeing a big speed-up.
- Likewise, we'll see lower energy consumption because the operations didn't require a lot of energy to begin with. Roughly,
most of the energy is consumed when a signal value changes from 1 to 0 or back. When we use general-purpose instructions, most
of the gate inputs & outputs and most flip-flops changing their values are those implementing the dispatching. When we use a
specialized instruction, most of the switching is gone.
This means that, unsurprisingly, there's a limit to efficiency – the fundamental cost of the operations we need to do, which
can't be cut.
When the operations themselves are costly enough – for instance, memory access or floating point operations – then their cost
dominates the cost of dispatching. So specialized instructions that cut dispatching costs will give us little or nothing.
Parallelization: throwing more hardware at the job
What to do when specialization doesn't help? We can simply have N processors instead of one. For the parts that can be
parallelized, we'll cut the run time by N – but spend the same amount of energy. So things got faster but not necessarily
cheaper. A fixed power budget limits parallelization – as does a fixed budget of, well, money (the price of a 1000-CPU rack is
still not trivial today).
[Why have multicore chips if it saves no energy? Because a multicore chip is cheaper than many single core ones, and because,
above a certain frequency, many low-frequency cores use less energy than few high-frequency ones.]
We can combine parallelization with specialization – in fact it's done very frequently. Actually a JPEG decoder mentioned
above would do that – a lot of its specialized circuits would execute in parallel.
Another example is how SIMD or SIMT
processors broadcast a single instruction to multiple execution units. This way, we get only a speed-up, but no energy savings
at the execution unit level: instead of one floating point ALU, we now have 4, or 32, etc. We do, however, get energy savings at
the dispatching level – we save on program memory and decoding logic. As always with specialization, we pay in flexibility – we
can't have our ALUs do different things at the same time, as some programs might want to.
Why do we see more single-precision floating point SIMD than double-precision SIMD? Because the higher the raw cost of
operations, the less we save by specialization, and SIMD is a sort of specialization. If we have to pay for double-precision
ALUs, why not put each in a full-blown CPU core? That way, at least we get the most flexibility, which means more opportunities
to actually use the hardware rather than keeping it idle.
(It's really more complicated than that because SIMD can actually be a more useful programming model than multiple threads or
processes in some cases, but we won't dwell on that.)
What can't be done
Now that we know what can be done – and there really isn't anything else – we basically already also know what can't be done.
Let's look at some examples.
Precision costs are forever
8-bit integers are fundamentally more efficient than 32-bit floating point, and no hardware support for any sort of floating
point operations can change this.
For one thing, multiplier circuit size (and energy consumption) is roughly quadratic in the size of inputs. IEEE 32b floating
point numbers have 23b mantissas, so multiplying them means a ~9x larger circuit than an 8×8-bit multiplier with the same
throughput. Another cost, linear in size, is that you need more memory, flip-flops and wires to store and transfer a float than
an int8.
(People are more often aware of this one because SIMD instruction sets usually have fixed-sized registers which can be used
to keep, say, 4 floats or 16 uint8s. However, this makes people underestimate the overhead of floating point as 4x – when it's
more like 9x if you look at multiplying mantissas, not to mention handling exponents. Even int16 is 4x more costly to multiply
than int8, not 2x as the storage space difference makes one guess.)
We design our own chips, and occasionally people say that it'd be nice to have a chip with, say, 256 floating point ALUs.
This sounds economically nonsensical – sure it's nice and it's also quite obvious, so if nobody makes such chips at a budget
similar to ours, it must be impossible, so why ask?
But actually it's a rather sensible suggestion, in that you can make a chip with 256 ALUs that is more efficient
than anything on the market for what you do, but not flexible enough to be marketed as a general-purpose computer. That's
precisely what specialization does.
However, specialization only helps with operations which are cheap enough to begin with compared to the cost of dispatching.
So this can work with low-precision ALUs, but not with high-precision ALUs. With high-precision ALUs, the raw cost of operations
would exceed our power budget, even if dispatching costs were zero.
Memory indirection costs are forever
I mentioned this in my old needlessly combative write-up about "high-level CPUs". There's this idea that we can have a
machine that makes "high-level languages" run fast, and that they're really only slow because we're running on "C machines" as
opposed to Lisp machines/Ruby machines/etc.
Leaving aside the question of what "high-level language" means (I really don't find it obvious at all, but never mind),
object-orientation and dynamic typing frequently result in indirection: pointers instead of values and pointers to pointers
instead of pointers. Sometimes it's done for no apparent reason – for instance, Erlang strings that are kept as linked lists of
ints. (Why do people even like linked lists as "the" data structure and head/tail recursion as "the" control structure? But I
digress.)
This kind of thing can never be sped up by specialization, because memory access fundamentally takes quite a lot of time and
energy, and when you do p->a, you need one such access, and when you do p->q->a, you need two, hence you'll spend twice
the time. Having a single "LOAD_LOAD" instruction instead of two – LOAD followed by a LOAD – does nothing for you.
All you can do is parallelization - throw more hardware at the problem, N processors instead of one. You can, alternatively,
combine parallelization with specialization, similarly to N-way floating point SIMD that's somewhat cheaper than having N
full-blown processors. For example, you could have several load-store units and several cache banks and a multiple-issue
processor. Than if you had to run p1->q1->a and somewhere near that, p2->q2->b, and the pointers point into
different banks, some of the 4 LOADs would end up running in parallel, without having several processors.
But, similarly to low-precision math being cheaper whatever the merits of floating point SIMD, one memory access is always
twice cheaper than two despite the merits of cache banking and multiple issue. Specifically, doubling the memory access
throughput roughly doubles the energy cost. This can sometimes be better than simply using two processors, but it's a
non-trivial cost and will always be.
A note about latency
We could discuss other examples but these two are among the most popular – floating point support is a favorite among math
geeks, and memory indirection support is a favorite among language geeks. So we'll move on to a general conclusion – but first,
we should mention the difference between latency costs and throughput costs.
In our two examples, we only discussed throughput costs. A floating point ALU with a given throughout uses more energy than
an int8 ALU. Two memory banks with a given throughput use about twice the energy of a single memory bank with half the
throughput. This, together with the relatively high costs of these operations compared to the costs of dispatching them, made us
conclude that we have nothing to do.
In reality, the high latency of such heavyweight operations can be the bigger problem than our inability to increase their
throughput without paying a high price in energy. For example, consider the instruction sequence:
c = FIRST(a,b)
e = SECOND(c,d)
If FIRST has a low latency, then we'll quickly proceed to SECOND. If FIRST has a high latency, then SECOND will have to wait
for that amount of time, even if FIRST has excellent throughput. Say, if FIRST is a LOAD, being able to issue a LOAD every cycle
doesn't help if SECOND depends on the result of that LOAD and the LOAD latency is 5 cycles.
A large part of computer architecture is various methods for dealing with these latencies – VLIW, out-of-order, barrel processors/SIMT, etc. These are all forms of parallelization –
finding something to do in parallel with the high-latency instruction. A barrel processor helps when you have many threads. An
out-of-order processor helps when you have nearby independent instructions in the same thread. And so on.
Just like having N processors, all these parallelization methods don't lower dispatching costs - in fact, they raise
them (more registers, higher issue bandwidth, tricky dispatching logic, etc.) The processor doesn't become more energy
efficient - you get more done per unit of time but not per unit of energy. A simple processor would be stuck at the FIRST
instruction, while a more clever one would find something to do – and spend the energy to do it.
So latency is a very important problem with fundamentally heavyweight operations, and machinery for hiding this latency is
extremely consequential for execution speed. But fighting latency using any of the available methods is just a special case of
parallelization, and in this sense not fundamentally different from simply having many cores in terms of energy consumed.
The upshot is that parallelization, whether it's having many cores or having single-core latency-hiding circuitry, can help
you with execution speed – throughput per cycle – but not with energy efficiency – throughput per watt.
The latency of heavyweight stuff is important and not hopeless; its throughput is important and hopeless.
Cost vs price
"But on my GPU, floating point operations are actually as fast as int8 operations! How about that?"
Well, a bus ticket can be cheaper than the price of getting to the same place in a taxi. The bus ticket will be cheaper even
if you're the only passenger, in which case the real cost of getting from A to B in a bus is surely higher than the cost of
getting from A to B in a taxi. Moreover, a bus might take you there more quickly if there are lanes reserved for buses that
taxis are not allowed to use.
It's basically a cost vs price thing – math and physics vs economics and marketing. The fundamentals only say that a hardware
vendor always can make int8 cheaper than float – but they can have good reasons not to. It's not that they made floats
as cheap as int8 – actually, they made int8 as expensive as floats in terms of real costs.
Just like you going alone in a bus designed to carry dozens of people is an inefficient use of a bus, using float ALUs to
process what could be int8 numbers is an inefficient use of float ALUs. Similarly, just like transport regulations can make
lanes available for buses but not cars, an instruction set can make fetching a float easy but make fetching a single byte hard
(no load byte/load byte with sign extension instructions). But cars could use those lanes – and loading bytes
could be made easy.
As a passenger, of course you will use the bus and not the taxi, because economics and/or marketing and/or regulations made
it the cheaper option in terms of price. Perhaps it's so because the bus is cheaper overall, with all the passengers it carries
during rush hours. Perhaps it's so because the bus is a part of the contract with your employer – it's a bus carrying employees
towards a nearby something. And perhaps it's so because the bus is subsidized by the government. Whatever the reason, you go
ahead and use the cheaper bus.
Likewise, as a programmer, if you're handed a platform where floating point is not more expensive or even cheaper than int8,
it is perhaps wise to use floating point everywhere. The only things to note are, the vendor could have given you
better int8 performance; and, at some point, a platform might emerge that you want to target and where int8 is much more
efficient than float.
The upshot is that it's possible to lower the price of floating point relative to int8, but not the
cost.
What's more "important" – prices or costs?
Prices have many nice properties that real costs don't have. For instance, all prices can be compared – just convert them all
to your currency of choice. Real costs are hard to compare without prices: is 2x less time for 3x more energy better or
worse?
In any discussion about "fundamental real costs", there tend to be hidden assumptions about prices. For example, I chose to
ignore area in this discussion under the assumption that area is usually less important than power. What makes this assumption
true – or false – is the prices fabs charge for silicon production, the sort of cooling solutions that are marketable today (a
desktop fan could be used to cool a cell phone but you couldn't sell that phone), etc. It's really hard to separate costs from
prices.
Here's
a computer architect's argument to the effect of "look at prices, not costs":
While technical metrics like performance, power, and programmer effort make up for nice fuzzy debates, it is pivotal for
every computer guy to understand that “Dollar” is the one metric that rules them all. The other metrics are just sub-metrics
derived from the dollar: Performance matters because that’s what customers pay for; power matters because it allows OEMs to put
cheaper, smaller batteries and reduce people’s electricity bills; and programmer effort matters because it reduces the cost of
making software.
I have two objections: that prices are the effect, not the cause, and that prices are too volatile to commit to memory as a
"fundamental".
Prices are the effect in the sense that, customers pay for performance because it matters, not "performance matters because
customers pay for it". Or, more precisely – customers pay for performance because it matters to them. As a result –
because customers pay for it – performance matters to vendors. Ultimately, the first cause is that performance matters,
not that it sells.
The other thing about prices is that they're rather jittery. Even a price index designed for stability such as S&P 500 is jumping up and down like crazy. In a
changing world, knowledge about costs has a longer shelf life than knowledge about prices.
For instance, power is considered cheap for desktops but expensive for servers and really expensive for mobile devices. In
reality, desktops likely consume more power than servers, there being more desktops than servers. So the real costs are not like
the prices – and prices change; the rise of mobile computing means rising prices for power-hungry architectures.
It seems to me that, taking the long view, the following makes sense:
- It's best to reason in costs and project them to the relevant prices – not forget the underlying costs and "think in
prices", so as to not get into habits that will become outdated when prices change.
- If you see a high real cost "hidden" by contemporary prices, it's a good bet to assume that at some point in the future,
prices will shift so that the real cost will rear its ugly head.
For example, any RISC architecture – ARM, MIPS, PowerPC, etc. – is fundamentally cheaper than, specifically, x86, in at least
two ways: hardware costs – area & power – and the costs of developing said hardware. [At least so I believe; let's say that
it's not as significant in my view than my other more basic examples, and I might be wrong and I'm only using this as an
illustration.]
In the long run, this spells doom for the x86, whatever momentum it otherwise has at any point in time – software
compatibility costs, Intel's manufacturing capabilities vs competitors capabilities, etc. Mathematically or physically
fundamental costs will, in the long run, trump everything else.
In the long run, there is no x86, no ARM, no Windows, no iPhone, etc. There are just ideas. We remember ideas originating in
ancient Greece and Rome, but no products. Every product is eventually outsold by another product. Old software is forgotten and
old fabs rot. But fundamentals are forever. An idea that is sufficiently more costly fundamentally than a competing idea can not
survive.
This is why I disagree with the following quote by Bob Colwell – the chief architect of the Pentium Pro (BTW, I love the interview and intend
to publish a summary of the entire 160-something page document):
...you might say that CISC only stayed viable because Intel was able to throw a lot of money and people at it, and die size,
bigger chips and so on.
In that sense, RISC still was better, which is what was claimed all along. And I said you know, there's point to be made
there. I agree with you that Intel had more to do to stay competitive. They were starting a race from far behind the start line.
But if you can throw money at a problem then, it's not really so fundamental technologically, is it? We look for more deep
things than that, so if all the RISC/CISC thing amounted to was, you had a slight advantage economically, well, that's not as
profound as it seemed back in the 80s was it?
Well, here's my counter-argument and it's not technical. The technical argument would be, CISC is worse, to the point where
Intel's 32nm Medfield performs about as well as ARM-based 40nm chips in a space where power matters. Which can be countered with
an economical argument – so what, Intel does have a better manufacturing ability so who cares, they still compete.
But my non-technical argument is, sure, you can be extremely savvy business-wise, and perhaps, if Intel realized early on how
big mobile is going to be, they'd make a good enough x86-based offering back then and then everyone would have been locked out
due to software compatibility issues and they'd reign like they reign in the desktop market.
But you can't do that forever. Every company is going to lose to some company at some point or other because you only need
one big mistake and you'll make it, you'll ignore a single emerging market and that will be the end. Or, someone will outperform
you technically – build a better fab, etc. If an idea is only ("only"?) being dragged into the future kicking and screaming by a
very business-savvy and technically excellent company, then the idea has no chance.
The idea that will win is the idea that every new product will use. New products always beat old products – always
have.
And nobody, nobody at all has made a new CISC architecture in ages. Intel will lose to a company or companies making RISC
CPUs because nobody makes anything else – and it has to lose to someone. Right now it seems like it's ARM but it
doesn't matter how it comes out in this round. It will happen at some point or other.
And if ARM beats x86, it won't be, straightforwardly, "because RISC is better" – x86 will have lost for business
reasons, and it could have gone the other way for business reasons. But the fact that it will have lost to a
RISC – that will be because RISC is technically better. That's why there's no CISC competitor to lose to.
Or, if you dismiss this with the sensible "in the long run, we're all dead" – then, well, if you're alive right now and
you're designing hardware, you are not making a CISC processor, are you? QED, not?
Getting back to our subject – based on the assumption that real costs matter, I believe that ugly, specialized hardware is
forever. It doesn't matter how much money is poured into general-purpose computing, by whom and why. You will always have
sufficiently important tasks that can be accomplished 10x or 100x more cheaply by using fundamentally cheap operations, and it
will pay off for someone to make the ugly hardware and write the ugly, low-level code doing low-precision arithmetic to make it
work.
And, on the other hand, the market for general-purpose hardware is always going to be huge, in particular, because there are
so many things that must be done where specialization fundamentally doesn't help at all.
Conclusion
Hardware can only deliver "efficiency miracles" for operations that are fundamentally cheap to begin with. This is done by
lowering dispatching costs and so increasing throughput per unit of energy. The price paid is reduced flexibility.
Some operations, such as high-precision arithmetic and memory access, are fundamentally expensive in terms of energy consumed
to reach a given throughput. With these, hardware can still give you more speed through parallelization, but at an energy cost
that may be prohibitive.
You're totally confused. Memory indirection can be optimized
perfectly well – every single CPU does precisely that with its virtual
memory system. TLB caches are far older than regular memory caches, and
if you were right then any modern operating system would be far slower
than running DOS on the same hardware. TLB caches make all memory
indirection related to virtual memory access essentially free. Without
them computers would be how many times slower than they are now?
Lisp hardware had similar optimizations for memory indirection built
into their cache systems. And modern CPUs do various indirect branch
prediction etc.
I'm talking about, specifically, p->q->a vs p->a or
something along those lines. What you're talking about is (1) TLB
optimizations and (2) latency of dependent instructions (that's what
branch prediction is about). I'm talking about the raw access to cache
memory banks – not the cost of missing the cache, not the cost of
address translation, not the cost of waiting for the previous
instruction (though I discussed the latter). I'm talking about the raw
cost in energy of local memory throughput.
Tomasz brings up an interesting point – although not free (address
translation + memory coherence is the main reason why L1 caches in most
processors are limited in size to the page size times the number of sets
in the cache), virtual address translation are an example of hardware
optimizing a chain of loads into a page table in a manner that is more
efficient than what you could do in software.
TLBs cache the result of the full set of lookups into the page
table.
However, TLBs work for the following reasons:
(1) There's a high probability that the entire set of loads will be
reused in the future, (i.e. you're accessing the same page again), and
the working set is tiny. In order to provide the illusion of being free,
a TLB that runs in parallel with the L1 cache needs to be implemented in
flip-flops rather than an SRAM, which limits the practical size to 16-64
entries.
(2) There's a very high read to write ratio, so the TLB can assume
the page tables are effectively read-only, and drop the set of cached
entries on a page table update
If we try to apply this to e.g. strings stored as single-linked
lists, the cacheability and working set assumption (1) is easily
violated. Similarly, if we apply this to a high-level language virtual
machine like CPython where every value is a pointer to an object, both
assumption (1) and (2) are violated.
I think the take-away from this is that while there are very specific
circumstances where memory indirection can be optimized with a hardware
feature, the resulting hardware feature is going to come with many
caveats, rendering them hard to use for typical high-level language
implementations.
Hi,
I don't agree on some areas. I'm electronics engineer and had
designed hardware too.
We agree that parallel means losing flexibility but we don't agree in
that it does not take less power. If you use 1000 units to do the same
that a CPU does at 1000x less frequency and power consumption models to
the square of frequency you are wasting 1000x/((1000)*(1000))=> 1000
times less energy.
You only need 100-120Hz for video, less for audio, because the brain
is a low frequency, massive parallel computer itself. If I do a CAD/CAE
simulation for a design I don't care about 1/100 delay and so on. There
are lots of applications for hardware acceleration.
Energy does not model exactly to the square of frequency, specially
at low frec but you can also shut down hardware modules you don't use
and all that combined with what you already stated.
You could design a real module that spends 300-200 times less power
than what the CPU does. In fact you could see them in every phone, iPad
or macbook air.
It seems to me that you don't know what you are talking about in the
big picture, you see trees but can't see the forest.
@Rune Holm: Of course managing the TLB contents is nothing like
managing the actual cache contents – which is why the original comment
is totally irrelevant :) There's a lot to do with loads & stores in
many specific cases, but not if you have "sufficiently random
addresses".
@Jose: well, I mentioned your point in the part about multicore;
several slow cores being more energy efficient than one fast one is the
same effect as "a real module" running at a low frequency spending less
energy than a high-frequency CPU.
However, there's a certain minimal frequency where you gain nothing
by lowering it still more. To me, it's interesting what happens at about
that range of frequencies, because what happens above it is grossly
inefficient in terms of energy consumption and people only target their
designs (CPUs, mostly) at such "unhealthy" frequencies for
programmer/consumer convenience. And I'm not in the consumer market.
BTW, 100MHz at 40nm, for instance, is way below that minimal frequency
for a reasonably pipelined design.
Also, this is a second comment pointing out a mistake I didn't even
make, and not in a very polite manner. Oh, the joy of the web.
Haters gonna hate...
I'm usually a lurker of your content – which I enjoy greatly.. your
subject matter this time is going to draw out "the experts"
"I believe that ugly, specialized hardware is forever."
Is awfully close to an argument for CISC...
@Luke: glad you like my stuff.
@Morris: with CISC it would be "ugly, general-purpose hardware"; I
sort of think ugliness is a price that should buy you something, and
with specialisation it buys you efficiency but without, it could only
buy compatibility – which is worth a fortune until people toss their old
toys and buy new toys and then it's worth nothing.
You are wrong about Intel... They don't make CISC processors. All
their designs have been RISC-model since 1995. It's just that this has
been hidden under the x86 instruction set, which they emulate since the
pentium pro.
Wody, the X86 instruction set is a CISC-style instruction set.
Converting the instructions to RISC-style instruction set doesn't make
the CPU a RISC CPU, it just means there's overhead that a true RISC CPU
wouldn't have to deal with.
Excellent points, though I would throw in maybe adding a FPLD next to
the CPU so you can download "specialization" once. Or the CPU cores with
DSP like on my Nokia n810. I guess the first question is whether the
specialization is intrinsic or temporary to the whole system. There will
be time/energy costs in setup, but then it runs fast.
It also goes further. QR Codes have part of the algorithm that uses
mod 3 as the operator, however if you do a bit of unrolling you can
remove the divide operations and I think I only have one 8×8
multiply:
https://github.com/tz1/qrduino
I'll also add an Amen to the object oriented overload. I don't have a
problem with OOP per se – you can do it in plain C, if that paradigm is
the right one to solve a problem. But there seems to have been a
separate philosophy being taught that everything should be atomized,
abstracted, obfuscated, so it becomes setvalueofA(getvalueofB()) and
hope the compiler/linker is good enough to untangle it into the
load-store.
This was also one reason that Windows CE was such a failure at the
time (when it was competing with the Palm Pilot). You can't fool power
usages or a battery. The Palm used very efficient routines – basically
an original Macintosh processor and code updated. CE ported the Windows
kernel with all the bloat and baggage (it had an interrupt jitter making
it all but unusable for embedded). If something takes 10x the gates or
cycles, or whatever resource, it will drain the battery 10x faster.
Adding faster hardware is usually quadratic, i.e. to make it go 2x
faster, it will eat 4x the power (and parallelization isn't free – the
extra routing adds gates).
The only resource that is rarely tight today is memory, so it is
often easier to create tables of results rather than efficient
operations (e.g. Rainbow Tables for breaking encryption). That can be
traded off for speed and power in some cases. But that doesn't apply to
microcontrollers with only a few K.
Regarding CISC vs RISC, what do you think about memory efficiency of
instruction encodings?
Most RISC designs I'm aware of use a fixed width (usually 32-bit)
encoding, whereas I think x86 instructions average out to about 3.5
bytes per instruction, allowing more instructions to fit in a code cache
line. On top of that, allowing instructions to access memory directly
can eliminate entire load/store instructions compared to RISC.
Of course, optimizing encoding length is not CISC-specific as such
(as evidenced by Thumb), but it would seem that CISC in general would
enable shorter instructions overall, possibly leading to better memory
utilization.
@tz: I actually think of an FPLD/FPGA as a specialized architecture
(rather than "clean slate you can do anything with) – I have a draft
about this bit. Likewise, a DSP is a specialized architecture – good at
a fraction of things, bad at most. Actually "specialization" is an
annoyingly fuzzy term because it's not clear what "general-purpose"
means; in this post I allowed myself to use the term because the context
was gaining efficiency through supporting what you want in hardware and
here, I can implicitly assume that you have a less efficient baseline
which you consider something done "without hardware support" – and
relative to that hypothetical baseline, the term makes sense. In a
broader context, I think there's no such thing as "general purpose
hardware" – there are just many different architectures, CPU is one
family, FPGA is another, DSP a third, etc. In the broad sense though, we
can call X more "general purpose" than Y if it runs more lines of code
or sells more units or has more programmers targeting it or a
combination; in this sense, I think clearly CPUs are the winner, and in
a business sense, "general purpose" is a sensible term even though it's
somewhat fuzzy technically.
@Jussi K: x86 binaries, specifically, are indeed somewhat denser than
RISC binaries (and ARM binaries are a bit denser than MIPS binaries, for
instance, so there are differences within the RISC family, too.)
The thing is, code size did use to be an issue in the old days, but
it rarely is these days – most memory is used by data, most instruction
accesses hit the cache, and in terms of energy, what matters is the
width of your instruction cache memory bus and what it takes to decode
the instructions once you fetch them; I believe RISC is cheaper to
decode overall.
This shows though that trade-offs are nearly impossible to get right
in any kind of a "timeless" fashion. If you have a hairy variable-length
encoding at a time when memory is really scarce, and your other option
is to have a fixed-sized encoding, presumably the reasonable thing is to
look at the prices of the two – contemporary prices – and choose to
conserve memory. You know that you made decoding complicated and that's
a spending in terms of resources that can come back and byte you when
prices shift. But it's still the right trade-off at contemporary
prices.
So the only case where "looking at resources" is the sensible thing
is if your choice is, waste something or not waste it, without a
trade-off; this is the case with int8 vs float – but not really, not if
you add, say, development effort to the mix... It's only sensible if you
frame the problem as "hardware costs", which I think made sense in the
context of my write-up where I try to explain what hardware can and
cannot do, but not sensible in a broader case. With trade-offs, I guess
all you can do is realize you made one, in terms of fundamental
resources, and then watch out for price shifts...
@Yossi: FPGA's trade performance for flexibility (Programmable
routing networks are slower than metal conductors) . I'm not sure
whether that also means higher cost in terms of energy consumption. E.g
same application on FPGA consumes less energy than GPU or CPU.
There are very few applications where FPGA based solutions can be
compared with CPU/GPU, the only one I can think of is Bitcoin
hash-mining. According to:
https://en.bitcoin.it/wiki/Mining_hardware_comparison
FPGA's seems about 10-20x more energy efficient than GPU, and GPU's
are in itself 10x more energy efficient than CPU's.
Anyway, FPGA's do have substantial advantages over CPU's and GPU's in
terms of being able to precisely fit arithmetic precision. If you need a
12 bit adder, you use resources for a 12 bit adder. Not resources for a
16 bit adder, etc.
Also FPGA's can do bit extraction and manipulation hardwired and not
by utilizing generic shift/and instructions as your "extract bit 3:7"
example.
The largest drawback still I think is that FPGA's are usually
programmed by hardware designers and they are diametrically opposite of
software developers in terms of development culture. Their tools are low
level timing diagrams and ancient programming languages such as VHDL.
Even though there has been C->VHDL compilers developed or direct
Synthesis of C code, FPGA's are still programmed using VHDL or
Verilog.
The biggest hurdle still for a software developer to experiment with
FPGA's is that step 1 still is to design and build hardware that uses
FPGA's. Even though there are several evaluation boards available, you
still need the HDL tools.
Hi Yossi,
Another good article.
It has been a long time you have not published; I was wondering
whether you had given up.
Deep down your main points stand:
- one still needs to perform the operations required for the task and
there is a lower bound as to how much energy you will need.
- A programmable machine using a set of types and operations is
inherently less efficient than a fully specialised component which can
use any type and any operation as long as it can be clock/power gated
when not needed
- Dollar is all that matters in the end. If someone else can make it so
that your customer can make a better buck out of the product, you will
lose.
I'll be careful from there, I am a bit worried of braking trade
sensitive information from my current or my previous employer.
I'd just want to point out that:
1) SMT is not as bad as you make it be. At least once you went through
the hassle of getting a full OOO, multi issue pipeline. I would not be
surprised for designs based on the technology to be fashionable again.
Barrel processors on the other hand, probably not. I would be even less
surprised to see x86 with wider execution pipelines and more than 2 way
SMT. But then, I don't work for Intel so it is just one of my guesses on
how they will fight lighter many core which seem to be entering the
server market. Essentially, unused powered silicon is a full energy
loss. Low level clock gating is hard (as in small clock domains), low
level power gating is even harder. Also it can be good for 3. Making
wider pipelines and good I-cache means that you can boost instruction
throughput for specific types of parallel friendly streams, yet having
several instruction streams menas that you can perform ok on high
latency workloads by running several in parallel. Scheduling threads in
such a setting is an intersting issue.
2) Dynamic consumption is a lot less than a problem that one would
think in modern devices, leakage is huge on the latest processes, clock
distribution is a massive energy hog on high speed designs. At less
cutting edge nodes and at lower frequency, it is still very relevant.
But we are talking CPUs and GPUs here. Comparing them with special
purpose HW is a bit of an difficult comparison. On one side you have
something running at 1-2Ghz on cutting edge processes, on the other side
something running at 100-400 Mhz a couple nodes behind, most of the
time.
3) One real difference in between accelerators and general processors
is memory coherency. Accelerators will typically be given private memory
and a minimalistic mmu just so that the device memory space can be
abstracted. On the other hand, memory coherency in coherent core systems
is an arduous issue (and an expensive one in term of power)
Time and power consumption are not the only costs imposed by the use
of idiot hardware.
What is the total economic cost of the existence of buffer overflows
of every kind? They simply don't need to exist. Any of them. Implement
hardware bounds checking for *all* array accesses, and hardware
type-checking for all arithmetic ops (1970s state of the art!) and
buffer overflows vanish, taking with them the lion's share of known
security exploits, crashes, and other digital miseries.
x86 has had the BOUND instruction for hardware-accelerated bounds
checking since 80186 in 1982. It also has had the INTO instruction for
hardware-accelerated integer arithmetic overflow checks since 8086 in
1978. Not that it seemed to help.
And even for typical web languages that do provide array bounds
checking and widening of integer arithmetic into infinite-sized
integers, the security problems did not vanish, they just shifted into
e.g. SQL injection and cross-site scripting.
I believe secure programming is a programmer and programming language
problem, not a hardware problem.
@Mattias Ernelli: there are many differences in the table even
within, say, the ARM family that are hard to understand (not directly
related to, say, frequency); I guess knowing more about the systems and
the benchmark could help.
@Ben D (glad to hear from you!): I don't think I said much about OOO
or SMT – not about how "bad" they were... As to dynamic vs static
consumption – it depends on your process. At 40nm LP, static consumption
is almost zero at room temperature and still very small around 100C. I
don't compare CPUs designed for high frequencies with accelerators
designed for low power – there are CPUs designed or at least synthesised
for low power. Regarding memory coherence in accelerators – sure, one of
the whole slew of things accelerators don't have to worry about, and
making CPUs really, really inefficient...
@Stanislav Datskovskiy – I agree with Rune Holm I guess... With the
twist that it's really a hardware/software co-design problem – hardware
would have the features if much of the software used them, much of the
software would use the features if almost all of the hardware had them.
It's a chicken and egg problem in a situation where hardware and
software evolution is only very loosely coordinated.
Можно найти упоминания, что астрономы из JPL интегрировали движение
тел солнечной системы на DEC Alpha со 128-битной точностью (H-floating в
терминологии DEC). А потом всё. На маркетинговых листках некоторых
карточек NVidia можно прочесть, что какие-то графические вычисления у
них якобы делаются со 128-битной точностью, но воспользоваться этим
через CUDA всё равно нельзя. Нет таких типов данных. В интеловских
процессорах тоже нет (хотя в компиляторах есть, через эмуляцию работают,
очень медленно).
Интересно ваше мнение: почему не выпускают процессоры с нативной
поддержкой quadruple precision? С точки зрения precision costs их вообще
осмысленно делать? Пока что те, кому надо 128 бит, используют
double-double арифметику.
I guess your question is, if you need 128 bits of precision, then is
hardware support better than emulation, right? I don't know, really –
huge multipliers in particular make no sense starting at some size, I
think, better to implement using 4 multiplies using 2x smaller hardware
multipliers or some such. But I never cared about this sort of thing so
I wouldn't really know where to draw the line.
My guess is that a nice system could use software emulation sped up
by fairly trivial hardware support for shuffling bits –
extracting/packing mantissas, exponents and signs, dealing with parts of
large mantissas, etc. – and that such emulation with rudimentary
hardware support would have about the same energy efficiency as
full-blown hardware support for 128b floating point; but it's really
just a guess.
Oh, as to why there are no processors with native/nice support for
quadruple precision: I think it's just because it's a tiny market –
almost no software would use the feature, so why bother. As to
processors targeted at supercomputers – perhaps even scientists don't
care about quad precision and perhaps they do have some sort of
rudimentary support that helps bring emulation to about the same energy
efficiency as full-blown hardware support would have; I dunno.
Over the years I noticed that hardware beyond the level of the
simplest RISC architecture tended to resemble fossilized software
intended to do something faster, or cheaper with regard to some
resource, or better in some way. Faster and cheaper were usually
measurable. Better was much fuzzier.
A real question with floating point arithmetic, "Does it matter that
you almost always get a wrong (imprecise) answer?"
Well, there's also the question of what's the alternative...
In the long run, this spells doom for the x86
Seems like the long run will turn out to have been surprisingly short.
@Aristotle: I think it's still in the "it's not over till it's over"
stage, but yeah.
Looks like it’s in free fall: “Windows PCs sales in
U.S. retail stores fell a staggering 21% in the four-week period from
October 21 to November 17, compared to the same period the previous
year. In short, there is now falling demand for x86 processors.” The
rest of that article looks at plenty signs more but none so visceral as
that one.
And here someone is making sense about relegating the x86 to
the role of coprocessor to an ARM CPU while PCs still need it.
It’s not over but it sure looks like a foregone conclusion at this
point.
(Meanwhile, tangentially, you can
run now bonafide RISC OS on an honest to goodness ARM-powered PC –
no emulation, the real deal. The Archimedes 3000 I always dreamed of
when I was a boy but never had the pocket money to buy before they died
– it’s now but a few bucks away. Forgotten youth that never was, here I
come...)
"Free fall" assuming what acceleration constant? :)
I'm with you of course, although the demise of PC as a platform isn't
necessarily that great (it's a question how the desktop computer is
going to be like).
I do not expect the desktop computer to change all that much – unless
you mean “PC as a platform” in the narrow Wintel sense and your question
is about what OS is going to run on ARM desktops. Microsoft is not
likely to deliver one, Apple is highly likely, and I really hope someone
else steps into that void. (ChromeOS...? Eh.) But an Apple-only future
would be a far bleaker outlook than a Microsoft-only ever one was, in
spite of their far better products. (Partly, actually, because
of that. People are delighted to have no other choice than
Apple, in droves.)
I meant "PC as a platform" in the sense of having standard protocols
and building your system using devices from disparate vendors that use
these protocols; and in the sense of being able to run an OS that the
machine didn't come preinstalled with. And even in the narrower sense –
Windows is wonderful in ways that nothing else is. It's not as bad as an
Apple product in terms of the choices it dares to take away from you and
it offers compatibility that a Linux distribution is yet to achieve
(including Android – I can't run simple games on my 2-year-old phone
without upgrading the OS).
As to Apple: I never found their products to be substantially
different from the competition, and I think their design is rather ugly,
especially rounded rectangles everywhere. My explanation of their recent
runaway success is something that you could call hypnosis, and I give
them maybe 5 years, 8 years, tops to fall back into relative obscurity
now that the hypnotist has passed away.
Having programmed for both x86 and PowerPC I found, to my surprise,
that I prefer x86. Well, x64 anyway. Yes, it's messy, but some of that
mess adds value. For instance, loading constants that use more than
16-bits is trivial on x86 but requires either multiple instructions are
reading the constant from a 64-KB section pointed to by a reserved
register. This is one instance of a general problem caused by fixed-size
instructions — sometimes you don't have enough bits to express the
instruction you want. In x86 land you can *always* extend the
instruction set.
x86 instructions can also be far more powerful with addressing modes
like eax += [const+reg1+reg2*8].
So that's the benefits of x86. The costs, on the other hand, only go
down, as the portion of the die associated with decoding becomes a
smaller fraction. Looked at that way it seems inevitable that RISC chips
in a particular market will have a temporary advantage (when decode
size/power matters) but if they fail to win quickly then their advantage
will disappear.
Or maybe you're right and Intel will eventually lose. Certainly
winning streaks are hard to maintain.
You mean you programmed in assembly and liked x86 better? That'd make
sense; RISC is nicer to target a compiler at, not to hand-code for. As
to the portion of the die – look at ARM-based designs with 4 high-end
cores and 4 low-end cores. How big would the low-end cores be if they
were x86? It's not true that larger chips always use larger cores, they
could use more instead, and then chips increasingly shrink over time as
power and yield constraints prevent us from fully reaping Moore's law
benefits.
Winning streaks are impossible to maintain forever. Now if you had a
CISC contender, then CISC would have a chance in the long run, but there
isn't any.
@Aristotle: My opinion is that idea about combining an A5 with an x64
is mad. We're not yet at the point in heterogeneous computing where that
makes sense. Wait for a few more years until the GPU is a first class
citizen of the processor interconnect fabric. Right now it's on the
wrong end of a PCIe link and all access will need an arbitration chip or
some equally ugly hack (not that that doesn't have recent examples like
Optimus v1).
@Yossi, et-al (mostly et-al)
Like you I expect RISC based designs to dominate long-term but, while
nothing lasts forever, forever is a long time. The other comment on
this, mostly applicable to ARM and POWER is that RISC is generally taken
as a base to build on, complexity is added back in every time something
turns out to work better with a specialised instruction e.g. SIMD.
Excellent reading, thanks for that.
Though almost two years after the last post, I still feel the need to
react on that misleading RISC vs CISC debate.
RISC and CISC terms have been extrapolated to qualify very different
notions: variable length of the instruction set, symmetry of the
encoding, decoupling memory accesses from computations, general-purpose
registers vs specialized ones. However, initially the only difference
between RISC and CISC terms is Complex vs Simple (Reduced). Of course
defining the exact boundary between Complex and Simple is a hard
task.
In modern out-of-order processors, the complexity of decoding is not
that important (though I must admit that x86[_64] brought instruction
decoding to an unsuspected level of brainf***king). Maybe one of the
major instruction feature that needs to be decoded as soon as possible
is whether or not the instruction is a branch, to do everything that is
possible to prevent breaking the instruction fetch. In a second step,
easy identification of register dependencies is really nice to have.
So-called RISC architectures such as Power, ARMv8, Alpha (and others) do
provide simple decoding. But ARMv7 (RISC?) makes branch identification
almost harder than in x86 (CISC). In fact, the first pages of the ARMv8
specifications acknowledge, in a very interesting discussion, the
weakness of the ARMv7.
In the end, I think Wody and Alex are both right: Intel architectures
are internally obviously RISC-like; and yes, the Instruction Set is
complex. But apart from designing extremely-utltra-super-low-end cores I
don't think that the ISA overhead does matter anyway/anymore. If you
were to design Instruction Set from scratch you would obviously not do
anything close to an x86 ISA, however Intel is not starting from
scratch...
I am very late to this discussion.
But i feel you have A FUNDAMENTAL PROBLEM in your logic.
The COST of any given operation you discuss (8/32/load/mulitple/etc)
is not fixed. In fact, it has an exponential range of values.
The LOAD instruction takes a pico-joule if it is in an L1 hit. It
cantake x10^6 if it triggers some sort of NUMA page access (or
whatever).
Generation of architects have brought us a finely tuned eco-system
where imperative code, is compiled for a certain instruction set,
running on a system with very specific tarde offs.
The amortized cost of an average operation is then somewhat
predictable.
As soon you try try a "new" component e.g. excess indirection e.g. fuzzy
arithmetic , of course the system is thrown out of wack.
But give it a few decades and there is no reason why a cache system
can make a->p1->p2 faster than accessing a[1],a[2]
Similar arguments can be made about many core parallelism taking LESS
ABSOLUTE POWER e.g. by always selecting a core closest to the memory or
what ever (for by lower frequency as you mentioned).
So your entire argument is therefore limited to the CURRENT status
quo rather than an absolute limit.
"there is no reason why a cache system can make a->p1->p2
faster than accessing a[1],a[2]"
This is correct (though I assume you mean't "can't", that is
incorrect.) Try it.
"Similar arguments can be made about many core parallelism taking
LESS ABSOLUTE POWER e.g. by always selecting a core closest to the
memory or what ever"
They can be made, and they're wrong. Try it.
Thank you, I've just been looking for info about this topic for a
while and
yours is the greatest I have found out till now.
But, what in regards to the conclusion? Are you sure
in regards to the source?
It's amazing for me to have a web site, which is useful in support of
my experience.
thanks admin
I don't know whether it's just me or if perhaps
everybody else experiencing issues with your website.
It seems like some of the written text in your posts are running
off
the screen. Can someone else please comment and
let me know if this is happening to them as well? This might be a
issue with my internet browser because I've had
this happen before. Thank you
I'd like to thank you for the efforts you have put in penning this
website.
I am hoping to check out the same high-grade blog posts from you in the
future as well.
In truth, your creative writing abilities has encouraged me to get my
own blog now ;)
Ahaa, its nice discussion about this post here at this blog, I have
read all that, so at this time me also commenting at this place.
What's up, its fastidious article regarding media print, we
all be familiar with media is a wonderful source
of data.
My family members every time say that I am killing my time here
at
web, except I know I am getting experience everyday by reading thes good
content.
Post a comment