Working simultaneously vs waiting simultaneously
"Multiprocessing", "multi-threading", "parallelism", "concurrency" etc. etc. can give you two kinds of benefits:
- Doing many things at once – 1000 multiplications every cycle.
- Waiting for many things at once – wait for 1000 HTTP requests just issued.
Some systems help with one of these but not the other, so you want to know which one – and if it's the one you need.
For instance, CPython has the infamous GIL – global
interpreter lock. To what extent does the GIL render CPython "useless on multiple cores"?
- Indeed you can hardly do many things at once – not in a single pure Python process. One thread doing something
takes the GIL and the other thread waits.
- You can however wait for many things at once just fine – for example, using the multiprocessing module (pool.map), or you
could spawn your own thread pool to do the same. Many Python threads can concurrently issue system calls that wait for data –
reading from TCP sockets, etc. Then instead of 1000 request-wait, request-wait steps, you issue 1000 requests and wait for them
all simultaneously. Could be close to a 1000x speed-up for long waits (with a 1000-thread worker pool; more on that below).
Works like a charm.
So GIL is not a problem for "simultaneous waiting" for I/O. Is GIL a problem for simultaneous processing? If you ask me – no,
because:
- If you want performance, it's kinda funny to use pure Python and then mourn the fact that you can't run, on 8 cores, Python
code that's 30-50x slower than C to begin with.
- On the other hand, if you use C bindings, then the C code could use multiple threads actually running on multiple
cores just fine; numpy does it if properly configured, for instance. Numpy also uses SIMD/vector instructions (SSE etc.) –
another kind of "doing many things at once" that pure Python can't do regardless of the GIL.
So IMO Python doesn't have as bad a story in this department as it's reputed to have – and if it does look bad to you, you
probably can't tolerate Python's slowness doing one thing at a time in the first place.
So Python – or C, for that matter – is OK for simultaneous waiting, but is it great? Probably not as great as Go or
Erlang – which let you wait in parallel for millions of things. How do they do it? Cheap context management.
Context management is a big challenge of waiting for many things at once. If you wait for a million things, you need a
million sets of variables keeping track of what exactly you're waiting for (has the header arrived? then I'm waiting for the
query. has it arrived? then I ask the database and wait for it etc. etc.)
If those variables are thread-local variables in a million threads, then you run into one of the problems with C – and hence
OS-supported threads designed to run C. The problem is that C has no idea how much stack it's gonna need (because of the halting
problem, so you can't blame C); and C has no mechanism to detect that it ran out of stack space at runtime and allocate some
more (because that's how its ABIs have evolved; in theory C could do this, but it doesn't.)
So the best thing a Unixy OS could do is, give C one page for the stack (say 4K), and make say the next 1-2M of the virtual
address space unaccessible (with 64b pointers, address space is cheap). When C page-faults upon stack overflow, give it more
physical memory – say another 4K. This method means at least 4K of allocated physical memory per thread, or 4G for a million
threads – rather wasteful. (I think in practice it's usually way worse.) All regardless of us often needing a fraction of that
memory for the actual state.
And that's before we got to the cost of context switching – which can be made smaller if we use setjmp/longjmp-based coroutines or something similar, but that
wouldn't help much with stack space. C's lax approach to stack management – which is the way it is to shave a few cycles off the
function call cost – can thus make C terribly inefficient in terms of memory footprint (speed vs space is generally a common
trade-off – it's just a bad one in the specific use case of "massive waiting" in C).
So Go/Erlang don't rely on the C-ish OS threads but roll their own – based on their stack management, which doesn't require a
contiguous block of addresses. And AFAIK you really can't get readable and efficient "massive waiting" code in any
other way – your alternatives, apart from the readable but inefficient threads, are:
- Manual state machine management – yuck
- Layered state machines as in Twisted –
better, but you still have callbacks looking at state variables
- Continuation passing as in Node.js – perhaps nicer still, but still far from the
smoothness of threads/processes/coroutines
The old Node.js slides say that "green
threads/coroutines can improve the situation dramatically, but there is still machinery involved". I'm not sure how that
machinery – the machinery in Go or Erlang – is any worse than the machinery involved in continuation passing and event loops
(unless the argument is about compatibility more than efficiency – in which case machinery seems to me a surprising choice of
words.)
Millions of cheap threads or whatever you call them are exciting if you wait for many events. Are they exciting if you do
many things at once? No; C threads are just fine – and C is faster to begin with. You likely don't want to use threads directly
– it's ugly – but you can multiplex tasks onto threads easily enough.
A "task" doesn't need to have its own context – it's just a function bound to its arguments. When a worker thread is out of
work, it grabs the task out of a queue and runs it to completion. Because the machine works - rather than
waits - you don't have the problems with stack management created by waiting. You only wait when there's no more work,
but never in the middle of things.
So a thread pool running millions of tasks doesn't need a million threads. It can be a thread per core,
maybe more if you have some waiting – say, if you wait for stuff offloaded to a GPU/DSP.
I really don't understand how Joe Armstrong could say Erlang is faster than C on multiple cores, or
things to that effect, with examples involving image processing – instead of event handling which is where Erlang can
be said to be more efficient.
Finally, a hardware-level example – which kind of hardware is good at simultaneous working, and which is good at simultaneous
waiting?
If your goal is parallelizing work, eventually you'll deteriorate to SIMD. SIMD is great because there's just one "manager" –
instruction sequencer – for many "workers" – ALUs. CPUs, DSPs and GPUs all have SIMD. NVIDIA calls its ALUs "cores" and 16-32
ALUs running the same instruction "threads", but that's just shameless marketing. A "thread" implies, to
everyone but a marketeer, independent control flow, while GPU "threads" march in lockstep.
In practice, SIMD is hard despite thousands of man-years having been invested into better languages and libraries – because
telling a bunch of dumb soldiers marching in lockstep what to do is just harder than running a bunch of self-motivating threads
each doing its own thing.
(Harder in one way, easier in another: marching in lockstep precludes races – non-deterministic, once-in-a-blue-moon, scary
races. But races of the kind arising between worker threads can be almost completely remedied with tools. Managing the dumb ALUs can not be made
easier with tools and libraries to the same extent – not even close. Where I work, roughly there's an entire team responsible
for SIMD programming, while threading is mostly automatic and bugs are weeded out by automated testing.)
If, however, you expect to be waiting much of the time – for memory or for high-latency floating point operations, for
instance – then hoards of hardware threads lacking their own ALUs, as in barrel threading or hyper-threading, can be a
great idea, while SIMD might do nothing for you. Similarly, a bunch of weaker cores can be better than a smaller number of
stronger cores. The point being, what you really need here is a cheap way to keep context and switch between contexts, while
actually doing a lot at once is unlikely to be possible in the first place.
Conclusions
- Doing little or nothing while waiting for many things is both surprisingly useful and surprisingly hard
(which took me way too long to internalize both in my hardware-related and software/server-related work). It motivates things
looking rather strange, such as "green threads" and hardware threads without their own ALUs.
- Actually doing many things in parallel – to me the more "obviously useful" thing – is difficult in an entirely different
way. It tends to drag in ugly languages, intrinsics, libraries etc. about as much as having to do one single thing
quickly. The "parallelism" part is actually the simplest (few threads so easy context management; races either non-existent
[SIMD] or very easy to weed out [worker pool running tasks])
- People doing servers (which wait a lot) and people doing number-crunching (work) think very differently about these things.
Transplanting experience/advice from one area to the other can lead to nonsensical conclusions.
See also
Parallelism and concurrency need
different tools – expands on the reasons for races being easy to find in computational code – but impossible to even
uniformly define for most event handling code.
One interesting solution to this in the parallel computing space is
the object based decomposition and Structured Dagger (SDAG) control-flow
mechanism implemented in the Charm++ parallel programming environment.
Each concurrent execution context is a user-defined object driven by
asynchronous calls to particular 'entry methods'. Thus, the state for
processing a message is given by both the statically known members of
the recipient object and the contents of the message itself – no stack
problem. When those messages need to be processed in a particular
arrangement, SDAG generates code inside each object to store them and
consume them as intended.
Thus, when an object's control flow needs to 'block' waiting for some
remote message to arrive, the generated code sets up its own
continuation in its internal structures, and then simply returns from
the executing method. The runtime scheduler driving each worker thread
then continues by picking up some other message, and delivering it to
whatever object it's directed to. Hence, 'context switching' purely by
function call/return.
http://ppl.cs.illinois.edu/research/dagger
http://ppl.cs.illinois.edu/papers/13-36 (section
3.2)
http://charm.cs.illinois.edu/manuals/html/charm++/5.html
Interesting stuff. Could you link to a page with simple example code
doing something concrete? (5.html above has code doing something rather
abstract – Input a, Input b etc.; I think it'd be easiest to get a gut
feeling of how this kind of approach stacks up compared to others by
looking at example code doing something concrete – sort something,
filter something, send/receive something etc., something an application
programmer could be doing.)
I'm not going to get into the whole threads / continuation passing /
etc. (I see threads as CPS with the continuation address stored on the
stack, FWIW – the fact that the stack is a big array causes most of the
problems) – I just wanted to get a bit "well, actually" on Python
GIL.
You argue that the GIL isn't important because if you want
performance you shouldn't be using Python. But IMO that's not the
sequence of events. You start out using something like Python because
it's productive, there's heaps of libraries that are available and
easily installed for it, and you can glue them together into a solution
that works well. Over time it gets more featured, bigger, and slower.
Where things like the GIL hit you is that you can't use multithreading
to get much of that performance back.
You can't simply write it in C to begin with, because that's a major
engineering effort – half the libraries you're using don't use C, some
of them are pure Python, some of them use C++, etc. You can't even
extract out the inner loop, because there is no hot loop.
You say "if you want performance" – but that's a false question.
Everybody wants performance. Performance is just another requirement to
be prioritized, and if something isn't performed in a finite time, it
might as well not be performed at all. But as you do more and more work
– all software grows over time – the time required to do this work tends
to increase, but the time allotted tends not to increase quite as
liberally. Eventually you hit the deadline, and performance moves up the
priority list.
tl/dr: performance always matters, even if you're starting out 50x
slower – you're getting the 50x back somewhere else.
Well, the thing is, if you use C bindings on an 8-core machine you'll
get your 8x times the 30-50x spent on using Python. Given that Python
does not have particularly good tools for debugging multi-threaded code
– C does better there – it makes sense to first implement your
bottleneck in C and then maybe do multi-threading even in a world
without the GIL:
* Moving things to C gives a larger performance boost than
multithreading
* It is less buggy than multithreading
* If you want to parallelize the code then doing it in C is easier to
get right so you'd want to move to C anyway
That said, I realize that temptation exists to just somehow
multithread whatever it is you've coded up in Python under time
pressure, so I see your point...
TempleOS has master/slave multicore support, not SMP. For real-time
graphics, you want to break a problem like a flight simulator into 8
pieces and have each core do one strip of land. You do not want the OS
scheduler to move tasks between cores. 7 cores with an extra job
doubled-up is not the same as 8 cores!! Putting two tasks on one core
makes refresh time 30fps instead of 60fps!
Here is a flight simulator:
https://www.youtube.com/watch?v=FxEq6IM43sA&feature=player_embedded
The code for the flight simulator is here:
http://www.templeos.org/Wb/Demo/Games/EagleDive.html
TempleOS:
http://www.templeos.org
TempleOS does master/slave multicore, not SMP. For my flight
simulator, I put one task on each of the 8 cores. I do NOT want the
scheduler to change it. If the scheduler put 2 tasks on one core instead
of one task on each of 8 cores, that would cause the refresh to glitch
and wait a whole nother cycle!!
Here's the SDAG code for parallel LU factorization without pivoting
(for demonstration only, since it would be fine with real numbers, but
can be numerically junk in floating point):
http://charm.cs.uiuc.edu/gerrit/gitweb?p=charmlu.git;a=blob;f=lu.ci;h=6d9f5aa0b5dce28982d14fd73bdf86cb0db2c053;hb=31fb59692d2d5ee337cb0ad3e115cd8cf3b9fc16
Here's the correct numerical code that implements 'partial'
pivoting:
http://charm.cs.uiuc.edu/gerrit/gitweb?p=charmlu.git;a=blob;f=lu.ci;h=9a1d0650191320efce5172dd31210b760c740a18;hb=HEAD
Excellent untangling of these two problem domains. I'll just avoid
using the words "concurrency" and "parallelism" altogether. There are
just too many competing definitions and if you look in the dictionary –
"at the same time" is mentioned in both. The potential for
misinterpretation is far too great.
As is accepted for "function", "routine", "procedure" and
"subroutine". If you try and convey different ideas using words that are
so strongly related, or are so broadly used to mean the same thing –
you're going to have a bad time.
So I'm adopting your "waiting for many things at once" vs "doing many
things at once".
PS nodejs has a GIL though it's less famous perhaps because it
doesn't have a name.
I thought Node.js was actually completely single-threaded because JS
is; I guess you could call that the ultimate GIL...
Well i think it also depends on how you are allocating resources when
you are creating a specific thread, if it has a low footprint then you
could achieve big improvements.
As reported in a node.js module, kernel threads where way faster than
simple concurrency.
"If you want performance, it's kinda funny to use pure Python and
then mourn the fact that you can't run, on 8 cores, Python code that's
30-50x slower than C to begin with."
Not at all. Python programmers don't measure themselves against the
arbitrary yardstick of "C", just like C programmers don't measure
themselves against assembly language or machine language. Faster is
always better, even if it's not as fast as some other environment.
When the JVM upgraded from green threads to native threads, it became
possible to write "working simultaneously" programs in pure Java (though
they still wouldn't be as fast as C, at least at the time). Programmers
now had the *choice* of using C FFI or pure Java for their threading
needs — and I know of zero who use C FFI today, given this choice.
Whatever the technical merits may be, it seems that the population in
general would rather use their existing language, than escape to C, even
if it's slower.
When I write a program for other people, and most of my users tell me
that they really wish it worked a different way, my response is not to
dig in my heels and insist that this way is technically superior. I
change my program to work the way they think it should.
"Moving things to C gives a larger performance boost than
multithreading"
Yes or no. This depends entirely on the extent to which the problem
is parallelizable.
"It is less buggy than multithreading. If you want to parallelize the
code then doing it in C is easier to get right so you'd want to move to
C anyway"
Am i reading this correctly? Are you saying C coding is the *less*
buggy alternative? I'm a longtime C programmer and I strongly
disagree.
Threading in an HLL isn't terribly hard — worst case, you add too
many locks, and it doesn't run as efficiently as it could.
Doing *anything* in C is an exercise in getting every detail perfect,
something which even seasoned professionals have trouble with — worst
case, your process has a security hole the size of Nebraska, and half
the internet needs an immediate software update.
Well, Python has a "special relationship" with C, as acknowledged but
its creator, and it assumes you use C much more than Java does – Java
very much prefers you to stay "pure" IMO. I think that for a Java
programmer it is more sensible to only know Java and not C than it is
for a Python programmer to only know Python and not C; for one thing,
Python is way way slower than Java.
As to multi-threaded code vs serial C code – I do prefer the latter,
and if I have to do multiple threads, the last thing I would reach to
would be explicit locks, and have enough experience with both serial C
code and multithreading to be entitled to my own opinion, though of
course others are entitled to theirs.
If Python lost the GIL, letting one gun for an 8x less latency with
threads (and 1x higher throughput, or less... using multiple cores means
you'll need more cores to transfer the speedup benefits to the user,
eventually) with threads instead of a 30-50x less latency and more
throughput with C, I wouldn't object. What I would like to see then
though, is good frameworks atop threads (not unlike pool.map) and good
race detectors (currently I don't know about these for Python). And even
then I'd hardly recommend it for the hard cases. For easy cases
with trivial parallelization and not that much code and just one
developer touching that code – perhaps.
Response to #4 above:
Hi Yossi,
I agree fundamentally with your position here, but if the context is
a programmer or team that isn't expert in C, then Barry's point carries
more weight. There is a bit of a price for introducing C into a pure
Python program, because even if all the interfaces are good, you have
the worry of debugging / configured ng that code.
(I'm speaking in the abstract; I'm much more conversant with C than
Python)
wrt. "Layered state machines as in Twisted – better, but you still
have callbacks looking at state variables"
You might want to read
https://glyph.twistedmatrix.com/2014/02/unyielding.html
which goes over possible concurrency mechanisms available for Python and
Twisted (including implicit coroutines you seem to like) and points out
why you actually don't want implicit context switches in your code.
Post a comment