Working simultaneously vs waiting simultaneously

"Multiprocessing", "multi-threading", "parallelism", "concurrency" etc. etc. can give you two kinds of benefits:

  • Doing many things at once – 1000 multiplications every cycle.
  • Waiting for many things at once – wait for 1000 HTTP requests just issued.

Some systems help with one of these but not the other, so you want to know which one – and if it's the one you need.

For instance, CPython has the infamous GIL – global interpreter lock. To what extent does the GIL render CPython "useless on multiple cores"?

  • Indeed you can hardly do many things at once – not in a single pure Python process. One thread doing something takes the GIL and the other thread waits.
  • You can however wait for many things at once just fine – for example, using the multiprocessing module (, or you could spawn your own thread pool to do the same. Many Python threads can concurrently issue system calls that wait for data – reading from TCP sockets, etc. Then instead of 1000 request-wait, request-wait steps, you issue 1000 requests and wait for them all simultaneously. Could be close to a 1000x speed-up for long waits (with a 1000-thread worker pool; more on that below). Works like a charm.

So GIL is not a problem for "simultaneous waiting" for I/O. Is GIL a problem for simultaneous processing? If you ask me – no, because:

  • If you want performance, it's kinda funny to use pure Python and then mourn the fact that you can't run, on 8 cores, Python code that's 30-50x slower than C to begin with.
  • On the other hand, if you use C bindings, then the C code could use multiple threads actually running on multiple cores just fine; numpy does it if properly configured, for instance. Numpy also uses SIMD/vector instructions (SSE etc.) – another kind of "doing many things at once" that pure Python can't do regardless of the GIL.

So IMO Python doesn't have as bad a story in this department as it's reputed to have – and if it does look bad to you, you probably can't tolerate Python's slowness doing one thing at a time in the first place.

So Python – or C, for that matter – is OK for simultaneous waiting, but is it great? Probably not as great as Go or Erlang – which let you wait in parallel for millions of things. How do they do it? Cheap context management.

Context management is a big challenge of waiting for many things at once. If you wait for a million things, you need a million sets of variables keeping track of what exactly you're waiting for (has the header arrived? then I'm waiting for the query. has it arrived? then I ask the database and wait for it etc. etc.)

If those variables are thread-local variables in a million threads, then you run into one of the problems with C – and hence OS-supported threads designed to run C. The problem is that C has no idea how much stack it's gonna need (because of the halting problem, so you can't blame C); and C has no mechanism to detect that it ran out of stack space at runtime and allocate some more (because that's how its ABIs have evolved; in theory C could do this, but it doesn't.)

So the best thing a Unixy OS could do is, give C one page for the stack (say 4K), and make say the next 1-2M of the virtual address space unaccessible (with 64b pointers, address space is cheap). When C page-faults upon stack overflow, give it more physical memory – say another 4K. This method means at least 4K of allocated physical memory per thread, or 4G for a million threads – rather wasteful. (I think in practice it's usually way worse.) All regardless of us often needing a fraction of that memory for the actual state.

And that's before we got to the cost of context switching – which can be made smaller if we use setjmp/longjmp-based coroutines or something similar, but that wouldn't help much with stack space. C's lax approach to stack management – which is the way it is to shave a few cycles off the function call cost – can thus make C terribly inefficient in terms of memory footprint (speed vs space is generally a common trade-off – it's just a bad one in the specific use case of "massive waiting" in C).

So Go/Erlang don't rely on the C-ish OS threads but roll their own – based on their stack management, which doesn't require a contiguous block of addresses. And AFAIK you really can't get readable and efficient "massive waiting" code in any other way – your alternatives, apart from the readable but inefficient threads, are:

  • Manual state machine management – yuck
  • Layered state machines as in Twisted – better, but you still have callbacks looking at state variables
  • Continuation passing as in Node.js – perhaps nicer still, but still far from the smoothness of threads/processes/coroutines

The old Node.js slides say that "green threads/coroutines can improve the situation dramatically, but there is still machinery involved". I'm not sure how that machinery – the machinery in Go or Erlang – is any worse than the machinery involved in continuation passing and event loops (unless the argument is about compatibility more than efficiency – in which case machinery seems to me a surprising choice of words.)

Millions of cheap threads or whatever you call them are exciting if you wait for many events. Are they exciting if you do many things at once? No; C threads are just fine – and C is faster to begin with. You likely don't want to use threads directly – it's ugly – but you can multiplex tasks onto threads easily enough.

A "task" doesn't need to have its own context – it's just a function bound to its arguments. When a worker thread is out of work, it grabs the task out of a queue and runs it to completion. Because the machine works - rather than waits - you don't have the problems with stack management created by waiting. You only wait when there's no more work, but never in the middle of things.

So a thread pool running millions of tasks doesn't need a million threads. It can be a thread per core, maybe more if you have some waiting – say, if you wait for stuff offloaded to a GPU/DSP.

I really don't understand how Joe Armstrong could say Erlang is faster than C on multiple cores, or things to that effect, with examples involving image processing – instead of event handling which is where Erlang can be said to be more efficient.

Finally, a hardware-level example – which kind of hardware is good at simultaneous working, and which is good at simultaneous waiting?

If your goal is parallelizing work, eventually you'll deteriorate to SIMD. SIMD is great because there's just one "manager" – instruction sequencer – for many "workers" – ALUs. CPUs, DSPs and GPUs all have SIMD. NVIDIA calls its ALUs "cores" and 16-32 ALUs running the same instruction "threads", but that's just shameless marketing. A "thread" implies, to everyone but a marketeer, independent control flow, while GPU "threads" march in lockstep.

In practice, SIMD is hard despite thousands of man-years having been invested into better languages and libraries – because telling a bunch of dumb soldiers marching in lockstep what to do is just harder than running a bunch of self-motivating threads each doing its own thing.

(Harder in one way, easier in another: marching in lockstep precludes races – non-deterministic, once-in-a-blue-moon, scary races. But races of the kind arising between worker threads can be almost completely remedied with tools. Managing the dumb ALUs can not be made easier with tools and libraries to the same extent – not even close. Where I work, roughly there's an entire team responsible for SIMD programming, while threading is mostly automatic and bugs are weeded out by automated testing.)

If, however, you expect to be waiting much of the time – for memory or for high-latency floating point operations, for instance – then hoards of hardware threads lacking their own ALUs, as in barrel threading or hyper-threading, can be a great idea, while SIMD might do nothing for you. Similarly, a bunch of weaker cores can be better than a smaller number of stronger cores. The point being, what you really need here is a cheap way to keep context and switch between contexts, while actually doing a lot at once is unlikely to be possible in the first place.


  • Doing little or nothing while waiting for many things is both surprisingly useful and surprisingly hard (which took me way too long to internalize both in my hardware-related and software/server-related work). It motivates things looking rather strange, such as "green threads" and hardware threads without their own ALUs.
  • Actually doing many things in parallel – to me the more "obviously useful" thing – is difficult in an entirely different way. It tends to drag in ugly languages, intrinsics, libraries etc. about as much as having to do one single thing quickly. The "parallelism" part is actually the simplest (few threads so easy context management; races either non-existent [SIMD] or very easy to weed out [worker pool running tasks])
  • People doing servers (which wait a lot) and people doing number-crunching (work) think very differently about these things. Transplanting experience/advice from one area to the other can lead to nonsensical conclusions.

See also

Parallelism and concurrency need different tools – expands on the reasons for races being easy to find in computational code – but impossible to even uniformly define for most event handling code.


#1 Phil Miller on 04.12.14 at 10:16 am

One interesting solution to this in the parallel computing space is the object based decomposition and Structured Dagger (SDAG) control-flow mechanism implemented in the Charm++ parallel programming environment. Each concurrent execution context is a user-defined object driven by asynchronous calls to particular 'entry methods'. Thus, the state for processing a message is given by both the statically known members of the recipient object and the contents of the message itself – no stack problem. When those messages need to be processed in a particular arrangement, SDAG generates code inside each object to store them and consume them as intended.

Thus, when an object's control flow needs to 'block' waiting for some remote message to arrive, the generated code sets up its own continuation in its internal structures, and then simply returns from the executing method. The runtime scheduler driving each worker thread then continues by picking up some other message, and delivering it to whatever object it's directed to. Hence, 'context switching' purely by function call/return. (section 3.2)

#2 Yossi Kreinin on 04.12.14 at 10:40 am

Interesting stuff. Could you link to a page with simple example code doing something concrete? (5.html above has code doing something rather abstract – Input a, Input b etc.; I think it'd be easiest to get a gut feeling of how this kind of approach stacks up compared to others by looking at example code doing something concrete – sort something, filter something, send/receive something etc., something an application programmer could be doing.)

#3 Barry Kelly on 04.12.14 at 6:28 pm

I'm not going to get into the whole threads / continuation passing / etc. (I see threads as CPS with the continuation address stored on the stack, FWIW – the fact that the stack is a big array causes most of the problems) – I just wanted to get a bit "well, actually" on Python GIL.

You argue that the GIL isn't important because if you want performance you shouldn't be using Python. But IMO that's not the sequence of events. You start out using something like Python because it's productive, there's heaps of libraries that are available and easily installed for it, and you can glue them together into a solution that works well. Over time it gets more featured, bigger, and slower. Where things like the GIL hit you is that you can't use multithreading to get much of that performance back.

You can't simply write it in C to begin with, because that's a major engineering effort – half the libraries you're using don't use C, some of them are pure Python, some of them use C++, etc. You can't even extract out the inner loop, because there is no hot loop.

You say "if you want performance" – but that's a false question. Everybody wants performance. Performance is just another requirement to be prioritized, and if something isn't performed in a finite time, it might as well not be performed at all. But as you do more and more work – all software grows over time – the time required to do this work tends to increase, but the time allotted tends not to increase quite as liberally. Eventually you hit the deadline, and performance moves up the priority list.

tl/dr: performance always matters, even if you're starting out 50x slower – you're getting the 50x back somewhere else.

#4 Yossi Kreinin on 04.12.14 at 9:36 pm

Well, the thing is, if you use C bindings on an 8-core machine you'll get your 8x times the 30-50x spent on using Python. Given that Python does not have particularly good tools for debugging multi-threaded code – C does better there – it makes sense to first implement your bottleneck in C and then maybe do multi-threading even in a world without the GIL:

* Moving things to C gives a larger performance boost than multithreading
* It is less buggy than multithreading
* If you want to parallelize the code then doing it in C is easier to get right so you'd want to move to C anyway

That said, I realize that temptation exists to just somehow multithread whatever it is you've coded up in Python under time pressure, so I see your point…

#5 Terry A. Davis on 04.13.14 at 2:39 am

TempleOS has master/slave multicore support, not SMP. For real-time graphics, you want to break a problem like a flight simulator into 8 pieces and have each core do one strip of land. You do not want the OS scheduler to move tasks between cores. 7 cores with an extra job doubled-up is not the same as 8 cores!! Putting two tasks on one core makes refresh time 30fps instead of 60fps!

Here is a flight simulator:

The code for the flight simulator is here:


#6 Terry A. Davis on 04.13.14 at 2:41 am

TempleOS does master/slave multicore, not SMP. For my flight simulator, I put one task on each of the 8 cores. I do NOT want the scheduler to change it. If the scheduler put 2 tasks on one core instead of one task on each of 8 cores, that would cause the refresh to glitch and wait a whole nother cycle!!

#7 Phil Miller on 04.13.14 at 10:02 am

Here's the SDAG code for parallel LU factorization without pivoting (for demonstration only, since it would be fine with real numbers, but can be numerically junk in floating point):;a=blob;;h=6d9f5aa0b5dce28982d14fd73bdf86cb0db2c053;hb=31fb59692d2d5ee337cb0ad3e115cd8cf3b9fc16

Here's the correct numerical code that implements 'partial' pivoting:;a=blob;;h=9a1d0650191320efce5172dd31210b760c740a18;hb=HEAD

#8 Yuval Greenfield on 04.13.14 at 11:05 pm

Excellent untangling of these two problem domains. I'll just avoid using the words "concurrency" and "parallelism" altogether. There are just too many competing definitions and if you look in the dictionary – "at the same time" is mentioned in both. The potential for misinterpretation is far too great.

As is accepted for "function", "routine", "procedure" and "subroutine". If you try and convey different ideas using words that are so strongly related, or are so broadly used to mean the same thing – you're going to have a bad time.

So I'm adopting your "waiting for many things at once" vs "doing many things at once".

PS nodejs has a GIL though it's less famous perhaps because it doesn't have a name.

#9 Yossi Kreinin on 04.13.14 at 11:17 pm

I thought Node.js was actually completely single-threaded because JS is; I guess you could call that the ultimate GIL…

#10 Uristoare on 05.02.14 at 2:52 pm

Well i think it also depends on how you are allocating resources when you are creating a specific thread, if it has a low footprint then you could achieve big improvements.
As reported in a node.js module, kernel threads where way faster than simple concurrency.

#11 Hooman on 05.04.14 at 11:39 am

"If you want performance, it's kinda funny to use pure Python and then mourn the fact that you can't run, on 8 cores, Python code that's 30-50x slower than C to begin with."

Not at all. Python programmers don't measure themselves against the arbitrary yardstick of "C", just like C programmers don't measure themselves against assembly language or machine language. Faster is always better, even if it's not as fast as some other environment.

When the JVM upgraded from green threads to native threads, it became possible to write "working simultaneously" programs in pure Java (though they still wouldn't be as fast as C, at least at the time). Programmers now had the *choice* of using C FFI or pure Java for their threading needs — and I know of zero who use C FFI today, given this choice. Whatever the technical merits may be, it seems that the population in general would rather use their existing language, than escape to C, even if it's slower.

When I write a program for other people, and most of my users tell me that they really wish it worked a different way, my response is not to dig in my heels and insist that this way is technically superior. I change my program to work the way they think it should.

#12 C programmer on 05.04.14 at 11:55 am

"Moving things to C gives a larger performance boost than multithreading"

Yes or no. This depends entirely on the extent to which the problem is parallelizable.

"It is less buggy than multithreading. If you want to parallelize the code then doing it in C is easier to get right so you'd want to move to C anyway"

Am i reading this correctly? Are you saying C coding is the *less* buggy alternative? I'm a longtime C programmer and I strongly disagree.

Threading in an HLL isn't terribly hard — worst case, you add too many locks, and it doesn't run as efficiently as it could.

Doing *anything* in C is an exercise in getting every detail perfect, something which even seasoned professionals have trouble with — worst case, your process has a security hole the size of Nebraska, and half the internet needs an immediate software update.

#13 Yossi Kreinin on 05.04.14 at 9:51 pm

Well, Python has a "special relationship" with C, as acknowledged but its creator, and it assumes you use C much more than Java does – Java very much prefers you to stay "pure" IMO. I think that for a Java programmer it is more sensible to only know Java and not C than it is for a Python programmer to only know Python and not C; for one thing, Python is way way slower than Java.

As to multi-threaded code vs serial C code – I do prefer the latter, and if I have to do multiple threads, the last thing I would reach to would be explicit locks, and have enough experience with both serial C code and multithreading to be entitled to my own opinion, though of course others are entitled to theirs.

If Python lost the GIL, letting one gun for an 8x less latency with threads (and 1x higher throughput, or less… using multiple cores means you'll need more cores to transfer the speedup benefits to the user, eventually) with threads instead of a 30-50x less latency and more throughput with C, I wouldn't object. What I would like to see then though, is good frameworks atop threads (not unlike and good race detectors (currently I don't know about these for Python). And even then I'd hardly recommend it for the hard cases. For easy cases with trivial parallelization and not that much code and just one developer touching that code – perhaps.

#14 Adam Kelly on 05.06.14 at 6:31 pm

Response to #4 above:

Hi Yossi,

I agree fundamentally with your position here, but if the context is a programmer or team that isn't expert in C, then Barry's point carries more weight. There is a bit of a price for introducing C into a pure Python program, because even if all the interfaces are good, you have the worry of debugging / configured ng that code.

(I'm speaking in the abstract; I'm much more conversant with C than Python)

#15 Ccx on 06.11.14 at 5:44 am

wrt. "Layered state machines as in Twisted – better, but you still have callbacks looking at state variables"

You might want to read
which goes over possible concurrency mechanisms available for Python and Twisted (including implicit coroutines you seem to like) and points out why you actually don't want implicit context switches in your code.

#16 Podologue du sport Toulouse on 04.07.19 at 5:30 am

I don't know if it's just me or if perhaps everyone else experiencing
issues with your website. It appears like some of the text within your
posts are running off the screen. Can someone else please comment and let me know
if this is happening to them too? This might be a problem with my
web browser because I've had this happen before.

#17 on 04.07.19 at 12:34 pm

I don't even know how I ended up right here, however I thought
this submit was once good. I do not know who you might be but
certainly you're going to a well-known blogger in the event you aren't already.

#18 moving companies in bradenton fl on 04.16.19 at 3:17 am

Can I simply just say what a relief to find an individual who truly understands what they are discussing on the internet.
You certainly understand how to bring a problem to light and make it important.
A lot more people should look at this and understand this side
of your story. I can't believe you're not more
popular because you certainly possess the gift.

#19 MS2-KIT on 04.16.19 at 6:58 am

Howdy just wanted to give you a quick heads up and let you know a few of the
images aren't loading properly. I'm not sure why but I think its
a linking issue. I've tried it in two different browsers and both show
the same results.

#20 MLP 5-2B on 04.18.19 at 11:53 am

I think this is among the most significant information for me.
And i'm glad reading your article. But should remark on some
general things, The website style is great, the articles is really nice :
D. Good job, cheers

#21 Topcon GPT-9003A Data Sheet on 04.28.19 at 1:50 am

These are really impressive ideas in regarding blogging.

You have touched some fastidious things here.
Any way keep up wrinting.

#22 business relocations on 05.05.19 at 5:34 pm

I don't know if it's just me or if everyone else encountering
problems with your blog. It appears as though some of the written text on your content are running off the screen. Can somebody else please comment and let me know if this is happening to them too?
This may be a issue with my web browser because I've had this happen previously.
Thank you

#23 FC-2500 Spec Sheet on 05.08.19 at 10:00 pm

I got this web page from my pal who informed me about this site and at the
moment this time I am visiting this web page and reading very informative content here.

#24 ScanStation P16 PDF on 05.09.19 at 7:43 am

If some one desires expert view regarding blogging then i propose him/her to
go to see this website, Keep up the fastidious job.

#25 Leica ScanStation C10 on 05.15.19 at 11:49 am

I am regular reader, how are you everybody?
This article posted at this website is actually good.

#26 Terence Schmitmeyer on 05.15.19 at 5:01 pm

In my estimation, does a great job of handling issues of this sort! While frequently deliberately controversial, the material posted is more often than not well-written and thought-provoking.

#27 newz aimbot on 05.16.19 at 12:15 am

Cheers, great stuff, I like.

#28 Leica TS06 Manual on 05.20.19 at 7:58 am

This information is invaluable. When can I
find out more?

#29 how to get help in windows 10 on 05.21.19 at 2:50 am

I was suggested this web site by my cousin. I'm not sure whether this post
is written by him as nobody else know such detailed about my difficulty.
You're wonderful! Thanks!

#30 Probiotics for adults on 05.25.19 at 12:56 am

This article will assist the internet viewers for creating new blog or even a blog from start to

#31 astrum helicopters  on 06.01.19 at 2:15 am

Wow, incredible blog structure! How lengthy have you
been running a blog for? you made running a blog glance easy.

The entire glance of your website is excellent, as neatly as the content material!

#32 best games ps4 on 06.07.19 at 7:33 pm

I simply could not leave your site before suggesting that I
really loved the usual information an individual provide to your visitors?
Is going to be again often to inspect new posts

#33 Doreatha Karwoski on 06.08.19 at 1:16 am

6/7/2019 @ 6:16:13 PM In my opinion, does a excellent job of covering issues of this sort. Even if often intentionally polemic, the material posted is in the main well researched and challenging.

#34 gamefly free trial 2019 coupon on 06.10.19 at 3:27 pm

With havin so much content and articles do you ever
run into any issues of plagorism or copyright infringement?
My blog has a lot of completely unique content I've either written myself or outsourced
but it appears a lot of it is popping it up all over the web without my agreement.
Do you know any methods to help reduce content from being
stolen? I'd genuinely appreciate it.

#35 안전놀이터 홍보 on 06.11.19 at 2:06 pm

I have read so many articles concerning the blogger lovers
however this post is really a nice article, keep it up.

#36 playstation 4 best games ever made 2019 on 06.12.19 at 6:24 pm

Because the admin of this web page is working, no doubt very
shortly it will be famous, due to its feature contents.

#37 on 06.13.19 at 11:39 am

Hi, i think that i saw you visited my web site so i came to “return the favor”.I am attempting to find things to improve my
website!I suppose its ok to use some of your ideas!!

#38 epidian deco on 06.17.19 at 5:58 pm

This blog was… how do you say it? Relevant!!

Finally I have found something that helped me. Many thanks!

#39 aimbot fortnite download on 06.17.19 at 10:35 pm

I like this article, because so much useful stuff on here : D.

#40 먹튀검증커뮤니티 on 06.18.19 at 4:37 am

My brother suggested I would possibly like this website.
He was once totally right. This put up truly made
my day. You can not consider just how so much time I had spent for this information! Thank you!

#41 먹튀폴리스 on 06.18.19 at 6:06 am

I must thank you for the efforts you have put in writing this website.

I'm hoping to check out the same high-grade content from you later on as well.
In truth, your creative writing abilities has encouraged me to get my very own blog now ;)

#42 proxo key on 06.19.19 at 5:36 pm

I consider something really special in this site.

#43 vn hax pubg mobile on 06.21.19 at 2:02 am

Cheers, great stuff, Me like.

#44 poseidon hd on 06.21.19 at 8:15 am

I'm more than happy to discover this site. I wanted to thank you for your time for this particularly
fantastic read!! I definitely enjoyed every part of
it and I have you book marked to look at new information in your site.

#45 poseidon hd on 06.21.19 at 8:29 am

I'm really impressed with your writing skills as
well as with the layout on your blog. Is this a paid theme or did you modify it yourself?
Either way keep up the excellent quality writing,
it's rare to see a great blog like this one these days.

#46 nonsense diamond key generator on 06.21.19 at 3:02 pm

Great, google took me stright here. thanks btw for post. Cheers!

#47 plenty of fish dating site on 06.22.19 at 1:00 am

Everything is very open with a very clear clarification of the issues.
It was really informative. Your website is very useful.
Many thanks for sharing!

#48 quest bars cheap on 06.23.19 at 10:35 am

whoah this blog is magnificent i like studying your posts.
Stay up the great work! You understand, many persons are
looking around for this info, you can aid them greatly.

#49 star valor cheats on 06.24.19 at 12:13 am

I am glad to be one of the visitors on this great website (:, appreciate it for posting .

#50 gx tool pro apk download on 06.24.19 at 9:58 pm

I was looking at some of your articles on this site and I believe this internet site is really instructive! Keep on posting .

#51 whole plant CBD oil on 06.25.19 at 12:37 am

Piece of writing writing is also a excitement, if you be acquainted with after that you can write or else it is complicated to write.

#52 fortnite mods on 06.26.19 at 2:50 am

Great, this is what I was browsing for in bing

Leave a Comment