Working simultaneously vs waiting simultaneously

April 11th, 2014

"Multiprocessing", "multi-threading", "parallelism", "concurrency" etc. etc. can give you two kinds of benefits:

Some systems help with one of these but not the other, so you want to know which one – and if it's the one you need.

For instance, CPython has the infamous GIL – global interpreter lock. To what extent does the GIL render CPython "useless on multiple cores"?

So GIL is not a problem for "simultaneous waiting" for I/O. Is GIL a problem for simultaneous processing? If you ask me – no, because:

So IMO Python doesn't have as bad a story in this department as it's reputed to have – and if it does look bad to you, you probably can't tolerate Python's slowness doing one thing at a time in the first place.

So Python – or C, for that matter – is OK for simultaneous waiting, but is it great? Probably not as great as Go or Erlang – which let you wait in parallel for millions of things. How do they do it? Cheap context management.

Context management is a big challenge of waiting for many things at once. If you wait for a million things, you need a million sets of variables keeping track of what exactly you're waiting for (has the header arrived? then I'm waiting for the query. has it arrived? then I ask the database and wait for it etc. etc.)

If those variables are thread-local variables in a million threads, then you run into one of the problems with C – and hence OS-supported threads designed to run C. The problem is that C has no idea how much stack it's gonna need (because of the halting problem, so you can't blame C); and C has no mechanism to detect that it ran out of stack space at runtime and allocate some more (because that's how its ABIs have evolved; in theory C could do this, but it doesn't.)

So the best thing a Unixy OS could do is, give C one page for the stack (say 4K), and make say the next 1-2M of the virtual address space unaccessible (with 64b pointers, address space is cheap). When C page-faults upon stack overflow, give it more physical memory – say another 4K. This method means at least 4K of allocated physical memory per thread, or 4G for a million threads – rather wasteful. (I think in practice it's usually way worse.) All regardless of us often needing a fraction of that memory for the actual state.

And that's before we got to the cost of context switching – which can be made smaller if we use setjmp/longjmp-based coroutines or something similar, but that wouldn't help much with stack space. C's lax approach to stack management – which is the way it is to shave a few cycles off the function call cost – can thus make C terribly inefficient in terms of memory footprint (speed vs space is generally a common trade-off – it's just a bad one in the specific use case of "massive waiting" in C).

So Go/Erlang don't rely on the C-ish OS threads but roll their own – based on their stack management, which doesn't require a contiguous block of addresses. And AFAIK you really can't get readable and efficient "massive waiting" code in any other way – your alternatives, apart from the readable but inefficient threads, are:

The old Node.js slides say that "green threads/coroutines can improve the situation dramatically, but there is still machinery involved". I'm not sure how that machinery – the machinery in Go or Erlang – is any worse than the machinery involved in continuation passing and event loops (unless the argument is about compatibility more than efficiency – in which case machinery seems to me a surprising choice of words.)

Millions of cheap threads or whatever you call them are exciting if you wait for many events. Are they exciting if you do many things at once? No; C threads are just fine – and C is faster to begin with. You likely don't want to use threads directly – it's ugly – but you can multiplex tasks onto threads easily enough.

A "task" doesn't need to have its own context – it's just a function bound to its arguments. When a worker thread is out of work, it grabs the task out of a queue and runs it to completion. Because the machine works - rather than waits - you don't have the problems with stack management created by waiting. You only wait when there's no more work, but never in the middle of things.

So a thread pool running millions of tasks doesn't need a million threads. It can be a thread per core, maybe more if you have some waiting – say, if you wait for stuff offloaded to a GPU/DSP.

I really don't understand how Joe Armstrong could say Erlang is faster than C on multiple cores, or things to that effect, with examples involving image processing – instead of event handling which is where Erlang can be said to be more efficient.

Finally, a hardware-level example – which kind of hardware is good at simultaneous working, and which is good at simultaneous waiting?

If your goal is parallelizing work, eventually you'll deteriorate to SIMD. SIMD is great because there's just one "manager" – instruction sequencer – for many "workers" – ALUs. CPUs, DSPs and GPUs all have SIMD. NVIDIA calls its ALUs "cores" and 16-32 ALUs running the same instruction "threads", but that's just shameless marketing. A "thread" implies, to everyone but a marketeer, independent control flow, while GPU "threads" march in lockstep.

In practice, SIMD is hard despite thousands of man-years having been invested into better languages and libraries – because telling a bunch of dumb soldiers marching in lockstep what to do is just harder than running a bunch of self-motivating threads each doing its own thing.

(Harder in one way, easier in another: marching in lockstep precludes races – non-deterministic, once-in-a-blue-moon, scary races. But races of the kind arising between worker threads can be almost completely remedied with tools. Managing the dumb ALUs can not be made easier with tools and libraries to the same extent – not even close. Where I work, roughly there's an entire team responsible for SIMD programming, while threading is mostly automatic and bugs are weeded out by automated testing.)

If, however, you expect to be waiting much of the time – for memory or for high-latency floating point operations, for instance – then hoards of hardware threads lacking their own ALUs, as in barrel threading or hyper-threading, can be a great idea, while SIMD might do nothing for you. Similarly, a bunch of weaker cores can be better than a smaller number of stronger cores. The point being, what you really need here is a cheap way to keep context and switch between contexts, while actually doing a lot at once is unlikely to be possible in the first place.

Conclusions

See also

Parallelism and concurrency need different tools – expands on the reasons for races being easy to find in computational code – but impossible to even uniformly define for most event handling code.

1. Phil MillerApr 12, 2014

One interesting solution to this in the parallel computing space is the object based decomposition and Structured Dagger (SDAG) control-flow mechanism implemented in the Charm++ parallel programming environment. Each concurrent execution context is a user-defined object driven by asynchronous calls to particular 'entry methods'. Thus, the state for processing a message is given by both the statically known members of the recipient object and the contents of the message itself – no stack problem. When those messages need to be processed in a particular arrangement, SDAG generates code inside each object to store them and consume them as intended.

Thus, when an object's control flow needs to 'block' waiting for some remote message to arrive, the generated code sets up its own continuation in its internal structures, and then simply returns from the executing method. The runtime scheduler driving each worker thread then continues by picking up some other message, and delivering it to whatever object it's directed to. Hence, 'context switching' purely by function call/return.

http://ppl.cs.illinois.edu/research/dagger
http://ppl.cs.illinois.edu/papers/13-36 (section 3.2)
http://charm.cs.illinois.edu/manuals/html/charm++/5.html

2. Yossi KreininApr 12, 2014

Interesting stuff. Could you link to a page with simple example code doing something concrete? (5.html above has code doing something rather abstract – Input a, Input b etc.; I think it'd be easiest to get a gut feeling of how this kind of approach stacks up compared to others by looking at example code doing something concrete – sort something, filter something, send/receive something etc., something an application programmer could be doing.)

3. Barry KellyApr 12, 2014

I'm not going to get into the whole threads / continuation passing / etc. (I see threads as CPS with the continuation address stored on the stack, FWIW – the fact that the stack is a big array causes most of the problems) – I just wanted to get a bit "well, actually" on Python GIL.

You argue that the GIL isn't important because if you want performance you shouldn't be using Python. But IMO that's not the sequence of events. You start out using something like Python because it's productive, there's heaps of libraries that are available and easily installed for it, and you can glue them together into a solution that works well. Over time it gets more featured, bigger, and slower. Where things like the GIL hit you is that you can't use multithreading to get much of that performance back.

You can't simply write it in C to begin with, because that's a major engineering effort – half the libraries you're using don't use C, some of them are pure Python, some of them use C++, etc. You can't even extract out the inner loop, because there is no hot loop.

You say "if you want performance" – but that's a false question. Everybody wants performance. Performance is just another requirement to be prioritized, and if something isn't performed in a finite time, it might as well not be performed at all. But as you do more and more work – all software grows over time – the time required to do this work tends to increase, but the time allotted tends not to increase quite as liberally. Eventually you hit the deadline, and performance moves up the priority list.

tl/dr: performance always matters, even if you're starting out 50x slower – you're getting the 50x back somewhere else.

4. Yossi KreininApr 12, 2014

Well, the thing is, if you use C bindings on an 8-core machine you'll get your 8x times the 30-50x spent on using Python. Given that Python does not have particularly good tools for debugging multi-threaded code – C does better there – it makes sense to first implement your bottleneck in C and then maybe do multi-threading even in a world without the GIL:

* Moving things to C gives a larger performance boost than multithreading
* It is less buggy than multithreading
* If you want to parallelize the code then doing it in C is easier to get right so you'd want to move to C anyway

That said, I realize that temptation exists to just somehow multithread whatever it is you've coded up in Python under time pressure, so I see your point...

5. Terry A. DavisApr 13, 2014

TempleOS has master/slave multicore support, not SMP. For real-time graphics, you want to break a problem like a flight simulator into 8 pieces and have each core do one strip of land. You do not want the OS scheduler to move tasks between cores. 7 cores with an extra job doubled-up is not the same as 8 cores!! Putting two tasks on one core makes refresh time 30fps instead of 60fps!

Here is a flight simulator:

https://www.youtube.com/watch?v=FxEq6IM43sA&feature=player_embedded

The code for the flight simulator is here:

http://www.templeos.org/Wb/Demo/Games/EagleDive.html

TempleOS:
http://www.templeos.org

6. Terry A. DavisApr 13, 2014

TempleOS does master/slave multicore, not SMP. For my flight simulator, I put one task on each of the 8 cores. I do NOT want the scheduler to change it. If the scheduler put 2 tasks on one core instead of one task on each of 8 cores, that would cause the refresh to glitch and wait a whole nother cycle!!

7. Phil MillerApr 13, 2014

Here's the SDAG code for parallel LU factorization without pivoting (for demonstration only, since it would be fine with real numbers, but can be numerically junk in floating point):
http://charm.cs.uiuc.edu/gerrit/gitweb?p=charmlu.git;a=blob;f=lu.ci;h=6d9f5aa0b5dce28982d14fd73bdf86cb0db2c053;hb=31fb59692d2d5ee337cb0ad3e115cd8cf3b9fc16

Here's the correct numerical code that implements 'partial' pivoting:
http://charm.cs.uiuc.edu/gerrit/gitweb?p=charmlu.git;a=blob;f=lu.ci;h=9a1d0650191320efce5172dd31210b760c740a18;hb=HEAD

8. Yuval GreenfieldApr 13, 2014

Excellent untangling of these two problem domains. I'll just avoid using the words "concurrency" and "parallelism" altogether. There are just too many competing definitions and if you look in the dictionary – "at the same time" is mentioned in both. The potential for misinterpretation is far too great.

As is accepted for "function", "routine", "procedure" and "subroutine". If you try and convey different ideas using words that are so strongly related, or are so broadly used to mean the same thing – you're going to have a bad time.

So I'm adopting your "waiting for many things at once" vs "doing many things at once".

PS nodejs has a GIL though it's less famous perhaps because it doesn't have a name.

9. Yossi KreininApr 13, 2014

I thought Node.js was actually completely single-threaded because JS is; I guess you could call that the ultimate GIL...

10. UristoareMay 2, 2014

Well i think it also depends on how you are allocating resources when you are creating a specific thread, if it has a low footprint then you could achieve big improvements.
As reported in a node.js module, kernel threads where way faster than simple concurrency.

11. HoomanMay 4, 2014

"If you want performance, it's kinda funny to use pure Python and then mourn the fact that you can't run, on 8 cores, Python code that's 30-50x slower than C to begin with."

Not at all. Python programmers don't measure themselves against the arbitrary yardstick of "C", just like C programmers don't measure themselves against assembly language or machine language. Faster is always better, even if it's not as fast as some other environment.

When the JVM upgraded from green threads to native threads, it became possible to write "working simultaneously" programs in pure Java (though they still wouldn't be as fast as C, at least at the time). Programmers now had the *choice* of using C FFI or pure Java for their threading needs — and I know of zero who use C FFI today, given this choice. Whatever the technical merits may be, it seems that the population in general would rather use their existing language, than escape to C, even if it's slower.

When I write a program for other people, and most of my users tell me that they really wish it worked a different way, my response is not to dig in my heels and insist that this way is technically superior. I change my program to work the way they think it should.

12. C programmerMay 4, 2014

"Moving things to C gives a larger performance boost than multithreading"

Yes or no. This depends entirely on the extent to which the problem is parallelizable.

"It is less buggy than multithreading. If you want to parallelize the code then doing it in C is easier to get right so you'd want to move to C anyway"

Am i reading this correctly? Are you saying C coding is the *less* buggy alternative? I'm a longtime C programmer and I strongly disagree.

Threading in an HLL isn't terribly hard — worst case, you add too many locks, and it doesn't run as efficiently as it could.

Doing *anything* in C is an exercise in getting every detail perfect, something which even seasoned professionals have trouble with — worst case, your process has a security hole the size of Nebraska, and half the internet needs an immediate software update.

13. Yossi KreininMay 4, 2014

Well, Python has a "special relationship" with C, as acknowledged but its creator, and it assumes you use C much more than Java does – Java very much prefers you to stay "pure" IMO. I think that for a Java programmer it is more sensible to only know Java and not C than it is for a Python programmer to only know Python and not C; for one thing, Python is way way slower than Java.

As to multi-threaded code vs serial C code – I do prefer the latter, and if I have to do multiple threads, the last thing I would reach to would be explicit locks, and have enough experience with both serial C code and multithreading to be entitled to my own opinion, though of course others are entitled to theirs.

If Python lost the GIL, letting one gun for an 8x less latency with threads (and 1x higher throughput, or less... using multiple cores means you'll need more cores to transfer the speedup benefits to the user, eventually) with threads instead of a 30-50x less latency and more throughput with C, I wouldn't object. What I would like to see then though, is good frameworks atop threads (not unlike pool.map) and good race detectors (currently I don't know about these for Python). And even then I'd hardly recommend it for the hard cases. For easy cases with trivial parallelization and not that much code and just one developer touching that code – perhaps.

14. Adam KellyMay 6, 2014

Response to #4 above:

Hi Yossi,

I agree fundamentally with your position here, but if the context is a programmer or team that isn't expert in C, then Barry's point carries more weight. There is a bit of a price for introducing C into a pure Python program, because even if all the interfaces are good, you have the worry of debugging / configured ng that code.

(I'm speaking in the abstract; I'm much more conversant with C than Python)

15. CcxJun 11, 2014

wrt. "Layered state machines as in Twisted – better, but you still have callbacks looking at state variables"

You might want to read
https://glyph.twistedmatrix.com/2014/02/unyielding.html
which goes over possible concurrency mechanisms available for Python and Twisted (including implicit coroutines you seem to like) and points out why you actually don't want implicit context switches in your code.



Post a comment