checkedthreads: bug-free shared memory parallelism

If you aren’t deeply frightened about all the issues raised by concurrency, you aren’t thinking about it hard enough.

John Carmack

This post introduces checkedthreads – a free framework for parallelizing C and C++ code, and for automatically finding every race condition that could potentially manifest on given program inputs. It comes with:

  • a Valgrind-based checker for thorough verification
  • an event-reordering scheduler for fast verification

The code in checkedthreads is fresh and wasn't used in production yet. However, tools using the same approach have been successfully used for years in automotive safety software containing millions of lines of code written by many dozens of developers.

A nice use case for checkedthreads is, you have a complex, serial program that you want to parallelize. With checkedthreads, you'll be able to run your test suite and be sure that you have no parallelism bugs – as sure as you'd be about, say, memory leaks. And if your parallelization does introduce bugs, the Valgrind checker will pinpoint them for you, so that you can quickly fix them.

In what follows, I explain how race detection works in checkedthreads, and then briefly discuss its other features and how to get started with it.

***

Why are threads so error-prone? We're accustomed to see the root of the problem in mutable shared memory. Commonly proposed alternatives to threads avoid mutable shared memory. For example, pure functional code gets rid of the "mutable" part, and process-based parallelism gets rid of the "shared" part".

While processes and pure FP have their virtues, I believe that mutable shared memory is not, by itself, what makes threads bug-prone. You can keep mutable shared memory and eliminate the bugs. Moreover, as we'll see, it's perfectly possible to eliminate shared memory – and keep the bugs.

What, then, is the root of the problem? I believe what we need is to find the right interfaces. Threads and locks are great low-level primitives, but a bad interface to use directly in source code.

In this sense, threads are not unlike goto. Goto is a horrible interface for human programmers. But it's fine as a machine instruction underlying higher-level interfaces ranging from loops and function calls to exceptions and coroutines.

What higher-level interfaces do we have on top of threads? One such interface is fork/join parallelism – parallel loops and function calls. ("Loops and function calls" is notably very similar to the most popular interface on top of goto.) "Fork/join", because starting a parallel loop logically forks a thread per iteration, and those threads are joined back when the loop ends:

Ordering constraints of fork-join code

Examples of fork/join interfaces include TBB's/PPL's parallel_for and parallel_invoke, and OpenMP's #pragma omp parallel for. Checkedthreads provides something similar as well:

ctx_for(_objs.size(), [&](int i) {
  process(_objs[i]); //in parallel for all i's
});

Fork/join parallelism is well-known, and appreciated for its automation of synchronization and load balancing. But how does a fork/join interface help with program correctness compared to raw threads and locks?

We'll see how fork/join helps by looking at two methods to verify parallel code – event reordering and memory access monitoring. Both methods are implemented by checkedthreads and together, they guarantee near-100% freedom from parallelism bugs.

We'll discuss why these methods are so effective with fork/join code – and why they are comparatively ineffective with raw threads.

Since parallel function calls can be implemented using parallel loops, we'll only explicitly mention parallel loops. And we'll only discuss "pure" fork/join programs – programs where forking and joining is the only synchronization mechanism.

That is, code inside a parallel loop can access stuff written by whoever spawned the loop – and code after the loop can access stuff written in the loop. But you can't access shared data using semaphores, atomic counters, lock-free containers, etc. – such accesses will be flagged as bugs. To synchronize threads, you must fork or join.

Lastly, we'll assume that you can't proceed until the loop you spawned completes – and you can't wait for anything but a loop you spawned yourself. (This is an obvious property of code spelled using some kind of a "parallel for", but not of other fork/join interfaces; a recent paper calls this property "strict fork/join parallelism".)

We're thus assuming both "purity" and "strictness"… which sounds more restrictive than it is – but that is a separate topic. And now, with all the assumptions spelled out:

Why fork/join code is verifiable – and why raw threads aren't

First, let's consider event reordering – a cheap and effective verification method.

Suppose thread A writes to address X, and thread B concurrently reads from X. Then B might see A's write or it might not, depending on timing, so it's a bug. A cheap way to find this bug is, don't run the program with the production thread scheduler, where the order of events depends on timing.

Instead, use a debugging scheduler which purposefully and deterministically reorders events. Make it schedule things so that in one run, A's write precedes B's read – and in another run, B's read comes first. Then compare the results of the two runs – if they differ, there's a bug.

How many orders does it take to reorder every two instructions that could ever run in parallel? (Actually, this requirement is too weak to find all the bugs – we'll plug that hole later – but it's usually enough to find most of the bugs.)

In a fork/join program, you need just two orders:

  1. run all loops from 0 to N.
  2. run them all backwards, from N to 0.

Here's an illustration – consider this example pseudo-code with nested parallel loops:

parallel for i=0:2
  foo(i)
  parallel for j=0:i
    bar(i,j)
  baz(i)

The ordering constraints – which code block can run in parallel with which – make up the following DAG:

Ordering constraints of fork-join code

And here are the two schedules – the 0 to N one and the backwards one:

The two event reordering schedules

Pick any two instructions that could possibly run in parallel and you'll see that they are reordered in these two schedules.

What happens if you use raw threads and locks?

Then the ordering constraints don't have to look like a simple fork/join DAG. All we can say is that the ordering constraints form some sort of a partial order over the set of serial code blocks (whose instructions are fully ordered).

With N code blocks, an upper bound on the number of orders you might need is N*2. (For every code segment, schedule it to run as early as possible, and then schedule it to run as late as possible – that's N*2 schedules reordering every two independent instructions.)

But N*2 is a lot of orders – N can be rather large. Can we do better, as we did in the fork/join case – by having less schedules which reorder more things?

Perhaps we can – but finding out the lower bound on the number of orders is an NP-hard problem (its standard name is finding the poset dimension). And heuristics analyzing partial orders in an attempt to produce fewer than N*2 orders take minutes to run with N in the few thousands, in my experience.

The upshot is that reordering every pair of independent instructions is trivial with fork/join code and very hard with raw threads.

Now let's consider an improvement upon simple event reordering – memory access monitoring based on something like program instrumentation or a compiler pass.

Plain event reordering has two main drawbacks:

  • Some bugs are missed. Consider updates to accumulators. Whether we run sum+=a[5] before or after sum+=a[6], sum will reach the same value – whereas in a parallel run, it may not. (Finer-grained reordering – trying every way to interleave instructions that could run in parallel, and not just reordering every pair of independent instructions – would catch the bug. But that's obviously infeasible.)
  • Bugs aren't pinpointed. Reordering gives evidence that you have a bug by demonstrating that results differ under different schedules. But how do you find the bug?

Here's how we can improve upon plain reordering. Let's intercept memory accesses and record the ID of the thread which was the last to write to each and every location – the location's owner. (Logically, each loop index could run in a separate thread – so ideally, we'd map each index to a separate thread ID. In practice, we might want IDs to be small, so we'd use low bits of the index or some such.)

Then if a location is accessed by someone whose ID is different from the owner's ID, we have a bug, and we've pinpointed it – all we need to do is print the current call stack:

checkedthreads: error - thread 56 accessed 0x7FF000340 [0x7FF000340,4], owned by 55
==2919==    at 0x40202C: std::_Function_handler...::_M_invoke (bug.cpp:16)
==2919==    by 0x403293: ct_valgrind_for_loop (valgrind_imp.c:62)
==2919==    by 0x4031C8: ct_valgrind_for (valgrind_imp.c:82)
==2919==    by 0x40283C: ct_for (ct_api.c:177)
==2919==    by 0x401E9D: main (bug.cpp:20)

It's a tad more complicated with nested loops but not much more complicated; the details aren't worth discussing here. The point is, this works regardless of whether there's an effect on results that can be reproduced in a serial run. We pinpoint bugs involving accumulators and shared temporary buffers and what-not, even though results look fine in serial runs.

Note that we still rely on event reordering. That's because we catch the bug if an address was written first and then read – but not if it was read first and then written:

Memory access monitoring

This is OK – if you have a reordering scheduler guaranteeing that in one of your runs, the address is actually written first and then read. With such a scheduler and access monitoring, you'll never miss a bug if it could ever happen with your input data – and you'll always have it pinpointed.

Why doesn't it work nearly as well with raw threads and semaphores?

The first problem is, you need a reordering scheduler, and there are too many schedules to cover, as we discussed above. I've shown in the past an example where Helgrind, a memory access monitoring Valgrind tool for debugging pthread applications, misses a bug – because the bug is masked by the order in which things happen to run.

But there's another problem with raw threads and locks, preventing memory access monitoring from pinpointing some of the bugs when they do reproduce.

Specifically, if two threads access a location, and they use a lock to synchronize their accesses, then you can't flag the access as a bug, right? However, it very well may be a bug. It's certainly not a data race (unsynchronized access to memory) – but it can be a race condition (a bug due to event ordering).

As an example, consider a supposedly deterministic simulation (not an interactive game – though the screenshot below is from a game, Lock 'n' Chase). In the simulation, agents run around a maze, picking up coins. Whoever made it first to pick up a coin has won a race – literally!

Lock and Chase

Suppose each agent is simulated by its own thread. Then it might be that the simulated speed of an agent in the maze depends on the amount of CPU time its thread used compared to others. And then results depend on timing and it's a bug. (A thread per maze region would be a better strategy than a thread per agent.)

This is a race condition – because timing affects results. However, as long as agents lock the coins they pick up, it's not a data race – because access to shared memory is synchronized. And memory access monitoring can only pinpoint data races, not race conditions.

You can simulate a maze, notice that memory where you keep a coin representation is accessed concurrently without locking, and print the offending call stack.

But you can't simulate a maze, notice that with different event ordering, different agents pick up the (properly locked) coin, and "print the offending call stack". There simply is no single offending call stack – only the fact that results end up differing.

John Regehr has a great discussion on the difference between race conditions and data races. Check out his example with a bank account, where all account balances are locked "nicely" – but the bank still doesn't properly transfer money between accounts.

The thing that matters in our context is that with pure fork/join code, all race conditions are data races. In the general case of raw threads, it's no longer true – which gets in the way of pinpointing bugs.

Note that using something like Go's channels or even Erlang's processes doesn't solve the maze race problem – in the sense that you still can't pinpoint bugs. Instead of locking coins, you might have agent processes or goroutines or what-not – and one or several processes keeping coins. Agents would send requests to pick up coins to these processes – and who'd make it first isn't deterministic, and there's no single place in the code "where the bug is".

This is one example of eliminating shared memory and keeping the bugs – as opposed to fork/join code's ability to keep shared memory while (automatically) eliminating the bugs.

(This is not to say that Erlang-style lightweight processes aren't a good idea – they can be great in certain contexts. I do believe that parallelizing computational code is a somewhat different problem from handling concurrent events – and that fork/join parallelism is close to "the right thing" for computational code.)

To summarize the entire discussion of verification:

  • With fork/join code, every bug which could possibly manifest with the given inputs can be deterministically reproduced and pinpointed.
  • Conversely, with raw threads and locks, it's computationally hard to deterministically reproduce bugs. Furthermore, even reproducible bugs can not always be automatically pinpointed.

It is also notable that with raw threads, you get to worry about deadlocks. With pure fork/join code, there simply can never be a deadlock.

(The analysis above mostly isn't rigorous and you're welcome to correct me.)

Some features of checkedthreads

Noteworthy features of checkedthreads are discussed here. A short summary:

  • Guaranteed bug detection as discussed above.
  • Integration with other frameworks: if you already use OpenMP or TBB, you can configure checkedthreads to use their scheduler (and their thread pool) instead of fighting over the machine with the other framework.
  • Dynamic load balancing: work gets done as soon as a thread is available to do it. Load balancing automatically takes into account variability between the different tasks – as well as whatever unexpected load the CPUs might be handling while running your code.
  • A C89 and a C++11 API are available.
  • Free as in "do what you want with it" (FreeBSD license).
  • Easily portable (in theory).
  • Small and simple (at the moment).
  • Custom schedulers are very easy to add (though it's more of a recreational activity than a necessity, or so I hope).

Downloading, building, installing and using checkedthreads

I recommend to read the build instructions, and then, possibly, download precompiled binaries. (The build instructions are useful, in particular, to know what you're getting in the binary archive.) The source code of version 1.0 is archived here; git@github.com:yosefk/checkedthreads.git keeps the latest source code.

Before actually using checkedthreads, I recommend to read the rather short documentation of the API, the environment variables, and runtime verification.

If you run into any issues with checkedthreads, drop me a line at Yossi.Kreinin@gmail.com.

Conclusion

To quote the same piece by John Carmack again:

…if you have a large enough codebase, any class of error that is syntactically legal probably exists there.

In other words, "correct by design" is an almost non-existent property; correctness is either demonstrated automatically, or it is absent. While nobody has to make the simple errors we discussed in our toy examples, analogous errors will necessarily creep into large parallel programs.

Checkedthreads shows that for one significant family of parallel imperative programs – fork/join code – it's possible to automatically, deterministically reproduce and pinpoint every bug.

I hope it to be a convincing example of what I think is a more general truth – namely, that parallel imperative programs don't have to be "deeply frightening", any more than serial imperative programs ought to be a scary nest of computed gotos. What we need is the right higher-level interfaces on top of raw threads and locks.

There generally appears to be a necessary trade-off between flexibility and correctness. A lot of widely known approaches to parallelism maximize one at the expense of the other. Raw threads – and many of the higher-level frameworks out there – are very flexible, but it's very hard to know that your code is correct. With pure functional code, you have statically guaranteed determinism – but "no side effects" is a very severe restriction.

Checkedthreads attempts to offer a different balance: determinism guaranteed by testing instead of statically – and an imperative language to program in, albeit with less synchronization options than those available with raw threads.

I hope you like it – and don't hesitate to email me if you need any sort of support :-) (In particular, I'll gladly port the thing to other platforms/languages if people are interested.)

20 comments ↓

#1 Random person on the Internet on 04.02.13 at 4:59 am

By "no license" you mean public domain? If so, you should state that explicitly somewhere in the code, since any work is copyrighted by default by Berne convention. And since public domain is a somewhat nebulous concept in some countries, you might as well distribute the code under the 3-clause BSD or MIT license.

#2 Yossi Kreinin on 04.02.13 at 5:58 am

I say that you're free to do anything you want to at no charge, and without any warranty, in README.md; isn't that good enough?

#3 Antti Tuppurainen on 04.02.13 at 6:54 am

It might be good enough for you, but not for any lawyer I've ever met. :)

Licenses are tricky things and whenever you come up with a new one with subtly different wording, it's likely to result in undefined behavior in some jurisdictions around the world. Just like programmers, lawyers hate undefined behavior.

The MIT license and the new style BSD license give the same effective guarantees that you're aiming for, but it'll also be a hell of a lot easier to convince my company's lawyers that we can use and contribute to software under those licenses.

#4 Yossi Kreinin on 04.02.13 at 7:03 am

But all those licenses are less permissive than what I'm aiming for, actually; they specifically don't let you do whatever you want to with the code – you're obliged to carry the license with the code and sometimes give credit or whatever. I want it to be like characters that landed magically into your editor/file system – do whatever you want to with them; and I don't want to uglify every file with boilerplate crapola.

Are you seriously saying that I should use an existing license because you'd actually have to convince your company's lawyers? Does it work if the license is in the directory but not in each and every file?

#5 Daniel Janus on 04.02.13 at 7:20 am

Just attach a copy of the WTFPL.

#6 Yossi Kreinin on 04.02.13 at 7:23 am

I would, but I'm not sure it will appease the type of lawyer people are worrying about.

I'm shopping for a license now… I really don't want to uglify every file…

#7 Yossi Kreinin on 04.02.13 at 7:47 am

OK; so it's FreeBSD 2-clause license… Argh.

#8 Name on 04.02.13 at 6:21 pm

The most permissive license is CC0:
http://creativecommons.org/publicdomain/zero/1.0/legalcode

It's lawyer-approved, and doesn't force the recipient to reproduce the copyright notice in source / binaries and whatnot.

#9 Name on 04.02.13 at 6:28 pm

There is also the UNLICENSE:

http://unlicense.org/

The popular sqlite database uses something similar to this.

#10 Aristotle Pagaltzis on 04.02.13 at 7:16 pm

Yes, unfortunate though it is, you really cannot just say “do whatever you want” if you want to minimise the amount of headaches for everyone else. Sigh.

(I had been using MIT because I was not aware of 2-clause BSD until now. Thanks for finding that.)

#11 Yossi Kreinin on 04.02.13 at 9:53 pm

What annoys me about these things is that they're a virus; the first thing they require is "redistribute me when you redistribute the code". Oh well…

#12 Yossi Kreinin on 04.03.13 at 12:47 am

CC0 and and especially the very short unlicense – nice stuff. I ain't fiddling with this anymore though… Maybe next time.

#13 Miguel Osorio on 04.03.13 at 3:51 am

First of all, thank you for such good reading material, Yossi! My first visit to this blog was by following a link from Carmack's twitter feed, if I recall correctly. Ended up reading the whole shabang in almost one go!

I was inspired by http://en.wikipedia.org/wiki/Algorithmic_skeleton to write my own APIs for parallel code, and guess what? To date, the most basic one I've written was fork-join. Parallel loops are implemented using that.

I have one or two questions though. I didn't understand very well your example of the nested loops. That only two orderings are necessary. Don't all iterations of each parallel-for run concurrently? So in one run, I might have:

——–> Time
i = 0; i = 1; i = 2;

But another run might be:

——–> Time
i = 1; i = 2; i = 0;

That's different from the strict reverse ordering, right? (2; 1; 0)

Or is it that what actually matters is that, since it's fork-join, the only threads that are actually doing work are always the innermost loops, since all the remaining higher level ones are instead "waiting in parallel"?

Not sure if I rubber-ducked you :P.

Cheers!

#14 Yossi Kreinin on 04.03.13 at 11:01 am

It's true that you could reverse the outer loop(s) but there's still a boatload of orders in which inner loops can run. What the two orders I mentioned are sufficient for is just reordering every pair of independent instructions (that is, every two instructions that could run in parallel). It's a rather weak statement about the order of things – too weak to catch all the bugs in fact as I mentioned, though strong enough to catch them all together with memory interception.

As an example, consider:

i=0 { j=0 j=1 j=2 } i=1 { j=0 j=1 j=2 }

And:

i=1 { j=2 j=1 j=0 } i=0 { j=2 j=1 j=0 }

Where i is the outer index and j is the inner index. Here, if you pick two instructions belonging to two different serial code blocks that could run in parallel, then these two code blocks have now been reversed. But this is not the only way to achieve the effect, nor does it imply anything stronger about the order of things.

#15 Miguel Osorio on 04.03.13 at 11:35 am

Got it! So that's what you meant with "The upshot is that reordering every pair of independent instructions is trivial with fork/join code and very hard with raw threads.".

Your approach is generic because in this context, fork-join and parallel-for are actually interchangeable, right? A "generic" fork-join could spawn n threads, each doing a different job. The parallel-for would just be a case where n threads are spawned doing the same job, but with different data payloads.

#16 Yossi Kreinin on 04.03.13 at 11:54 am

It's indeed generic in the sense that you can do a parallel_invoke or some such where you give N different functions instead of having an index running from 0 to N.

It's actually less generic performance-wise if you want what the paper that I cited calls "non-strict fork-join" – that is, code which can wait, not just for stuff it spawned, but for other stuff as well – as you'll be able to do with most task-based fork-join interfaces, and as you won't be able to do with just parallel_for and parallel_invoke.

Why the restriction?

Because (A) I don't believe the difference in performance due to the extra sync options is significant in real life (though I could easily come up with a contrived example where it s); and (B) because it's damn hard to verify generic dependency DAGs compared to the very simple fork-join DAGs – in fact you hit the "NP-hardness" of poset dimension.

I hope to blog about it in detail; it's actually covered not-so-badly in the old post – http://www.yosefk.com/blog/making-data-races-manifest-themselves.html – except that it spells things in terms of some other framework and it's older and I understood less back then (for instance, it says that the trouble with locks is just that you can forget to lock or that you can deadlock; in fact the biggest trouble with locks is stuff along the lines of the maze race example.)

The upshot is that I'm doing generic graphs at work but I don't think the world needs them, really; I might add them to checkedthreads in the future though.

#17 Uri on 04.04.13 at 6:47 am

Amazing staff. Thanks !

#18 Yossi Kreinin on 04.04.13 at 8:08 am

If you're the Uri that I think you are, then you know all of this all too well except for the part where it's a dynamic simple-shaped dependency DAG instead of a static arbitrarily-shaped dependency DAG :) (Which is the more general is hard to tell; qsort is better expressed with this fork/join stuff whereas task parallel graphs are better handled by the arbitrarily shaped graphs. Which is the more gnarly is obvious…)

#19 Levi on 06.23.13 at 7:10 pm

I am not typically a Go apologist, but I think you've mischaracterized Go's concurrency model. Although it supports nondeterministic concurrency, it also supports deterministic concurrency. Go's synchronization is based on channels, which come in two forms: buffered and unbuffered. Unbuffered channels create a rendezvous point–a write cannot complete until the other end executes a read, and vice versa.

You can do a fork/join in Go by creating an unbuffered channel per goroutine fork. There is no question then which goroutine returned a result. If you spawn a goroutine for each iteration of a loop (the fork), then loop again to read from each channel (the join), you will get the exact same result every time. The order in which the goroutines get scheduled and actually complete may vary, but that will not be observable by your program.

Of course, you would have to do some hacking in the guts of Go in order to use your checking tools with it, but the fork/join model itself is one model that is supported by Go's primitives.

#20 Yossi Kreinin on 06.23.13 at 8:08 pm

I'm not saying you can't implement fork/join on top of goroutines (you can); I'm only pointing out what happens if you have shared state and you use a goroutine to access that shared state – that you eliminate data races but not race conditions, which is obvious to some but surprising to others.

Leave a Comment