It's "locking" if it's blocking

December 27th, 2012

Given a piece of code, how do you know if it's lock-based or lock-free, and why do you care?

Sounds like a trivial question – it's lock-based if it says, "lock a mutex", which is usually an OS service. If it doesn't say "lock" or "unlock", then it's lock-free. And we care because OS services like mutexes are expensive and if we don't use them, code runs faster.

Which is all true, in a way, but it doesn't really answer the question. "The OS" isn't magic – it's code. You can have code implementing userspace locks outside the OS. Or you can have an OS where there's just one address space, and then that OS's locks are "userspace" code like any other. And so the question remains: which piece of code implements "locking" and which doesn't?

It's all code, right? A piece of code is a sort if it sorts. When is code locking? What is locking, as opposed to "lock-free" code? We'll try to answer that question, and then the answer will tell us why (and when) the "locking"/"lock-free" distinction actually matters.

We'll start with two examples – atomic addition and spin locks. The code for these can be surprisingly similar, and this similarity will help highlight the one thing that really sets the two apart.

Consider an atomic addition – something like gcc's __sync_fetch_and_add(), which increments a number at some memory location and returns the old value "atomically".

"Atomic" (only) means that nobody will mess up the state midway through the operation. "Atomic" doesn't mean, for example, that a single instruction is used – in fact, often it's not the case. Nor does "atomic" imply "lock-free". An implementation of atomic addition could legitimately use:

  1. A single machine instruction
  2. A loop observing that someone messed up our state, and retrying
  3. A mutex that's locked while the incrementing is done

Let's say it's a loop, something along the lines of:

do {
  val = *addr;
}
while(!compare_and_swap(addr, val, val+inc));

This is atomic, because if someone modifies addr before we manage to write val+inc, then compare_and_swap (CAS – a typical primitive) will observe that addr no longer holds val, fail, and make us retry. Note that we can get stuck at the loop for an indefinite period of time.

Now consider a spinlock – a loop doing something like:

while(!compare_and_swap(addr, UNLOCKED_VAL, LOCKED_VAL));

This will wait until addr holds UNLOCKED_VAL, and will modify it to keep LOCKED_VAL – unless someone else comes ahead of us, in which case they will write LOCKED_VAL first – and we're back to calling our CAS primitive in a loop. Note that we can get stuck at the loop for an indefinite period of time.

So now you see the difference between "locking" and "lock-free": our atomic addition loop is lock-free, and our spin lock loop implements locking.

Wait, what?

They're both loops, and very similarly-looking ones. Moreover, we can get stuck at both loops for an indefinite period of time. How come they're at the opposite sides of the locking/lock-free distinction?! Where's the difference?

The difference is in whether we get stuck if another thread gets suspended.

The first loop – atomic addition – never gets stuck because of someone else being suspended. On the contrary, it will finish faster. It gets stuck if others modify addr all the time and it keeps retrying. If they're suspended, they can't modify addr, and it succeeds more quickly.

The second loop – the spin lock – will very much get stuck if another thread obtains the lock and then gets suspended before releasing it. And it will keep running until that thread gets the opportunity to finish whatever it did with the lock taken, and releases the lock.

That's why we care – and that's why, with locking, we tend to involve the OS in the first place! Having the OS involved could be a vast improvement over our spin lock loop, because the OS could notice that our thread is stuck because a suspended thread holds a lock. And if that thread is ready to run, the OS could let it run right now, knowing that otherwise, our thread won't make progress anyway.

Moreover, if our thread has a high priority, and the suspended "locker" thread has a low priority, the OS could raise the "locker's" priority until it releases the lock – because releasing the lock that a high-priority thread wants should itself have high priority. (This stuff is known as dealing with "priority inversion" – the situation where they are less important than we are, and yet they block us.)

And for all these good things to happen, the OS needs to know that a lock is taken – which it doesn't with our spin lock loop. For the OS, it's just some loop that could be doing anything. The OS would only know what happens if we used its own locking functions. (BTW, there's another option to get the OS involved: explicitly fiddle with threads and priorities inside the spin lock loop if it takes too long. It can get ugly but it can be the right thing to do.)

Of course having the OS involved will cost us, and we don't want locks in our code because of that, but that's a consequence. The root cause is that threads that lock and get suspended block other threads that lock, and this is why spin locks aren't always a great way to get rid of the pesky OS overhead.

This also shows when spin locks or similar are fine – efficient and just as good as "lock-free code". For example, if you're using hardware threads which are never suspended, such as hardware threads in XMOS processors, then locks are fine. We'd certainly see problems if suspension was possible, but it isn't.

There are other, perhaps more common situations, when locking is fine. For instance, two highest-priority threads running on two distinct physical processors can communicate using spin locks very nicely, because they can't get suspended while holding a lock (they could be if interrupts have still higher priority, but we don't care if interrupt latency is small enough – and perhaps they are interrupt handlers locking some common resource.) Such two threads don't need any help from the OS.

Or, perhaps there's a high-priority thread that sets two variables, A and B, for a low-priority thread. A is written to and then B. After getting notified that A was written, the low-priority thread reads B in a loop – as long as B is NULL, it wasn't written to yet, so keep polling.

Effectively, this is locking – if the high-priority thread gets suspended, then the low-priority thread will get stuck in the loop. But, assuming the high-priority thread has the highest priority, it won't get suspended – and again we're fine without any help from the OS.

These examples show that "locking" isn't defined by "using OS locking functions", and it can take many different forms. You don't spot it in code by looking for certain library calls or for loops having some known look. You spot locking by asking, can I get stuck at this point if someone gets suspended? (And sometimes you answer, "yes, so it's locking; but they won't get suspended, so I don't care.")

Likewise, "lock-free" isn't defined by the use of CAS or LL/SC (load linked/store conditional – two commonly available instructions that can be used to implement compare-and-swap), or by a specific sort of loops.

For instance, our atomic addition loop could be modified to check if the value exceeded a certain size, and quitting the loop if it did without modifying the value:

do {
  val = *addr;
  if(val+inc >= size) break;
}
while(!compare_and_swap(addr, val, val+inc));

This is still atomic addition – if someone modifies addr, it doesn't break our code. We just keep retrying until addr holds something that can be incremented without exceeding a given size, and we manage to actually increment it without someone beating us to it. This can be used to implement a sort of lock-free queue.

And this code isn't "lock-free" because CAS is used in a loop – that is true for our spin lock as well – but because we aren't blocked when threads are suspended (in fact, it helps us if they get suspended, since there are less interferences).

And this is why we never need help from the OS with lock-free code.

Before concluding, it's worth noting a way in which lock-free code isn't better than lock-based code. We've already seen that "lock-free" isn't (necessarily) better in any way when suspension doesn't threaten us. But regardless of suspensions, "lock-free" is never better in a particular way – specifically, it isn't better if you're worried about correctness.

For instance, suppose you have a counter keeping an account balance. There's a bug where money is transferred between accounts, and the value of the source account is checked before decrementing it:

if(balance - transfer > 0) {
  balance -= transfer;
}

But several transfers can do the check before reaching the decrementing statement. Then they will all "succeed" – each will decrement the source account balance. As a result, the account balance may reach a negative value, and transfers will be authorized that shouldn't have been.

Here, it doesn't matter if the balance counter is locked or updated using any sort of lock-free addition. Both ways are better than nothing in the sense that all decrements will eventually happen and none will be lost. Both ways are still busted because the check and the decrement should be done atomically, and not just the decrement.

A program using a lock-free counter, or container, or any sort of data structure is thus exactly as correct or buggy as a lock-based version of the same program (provided that the locking and the lock-free updates are both implemented correctly, of course). That is, you can make (or avoid) exactly the same mistakes having to do with what orders of events you allow and how you handle them.

In other words, "locking" and "lock-free" are two equivalent ways to order events, the difference being that with locking, you have to deal with suspension, while with lock-free updates, you have to deal with the effects of concurrent interferences.

(You could say that locking tends to be less buggy and potentially less efficient because a uniform approach to coding is possible: just use the OS's locks. But as we've seen, you don't always have to or want to – and on the other hand, a uniform approach to coding is possible with lock-free code as well: just use off-the-shelf lock-free data structure libraries. And just like it's not automatically clear that locks are easier to develop with, it's not automatically clear that lock-free code is more efficient; both questions depend on the details.)

So that's it; I find it interesting though it could be rather trivial for you – and if you're one of these people, you might have spotted mistakes, in which case I'll be grateful for pointing them out. (I'm not that good at this stuff; what I'm comparatively good at is getting things done without this stuff – that is, sweeping it under the rug and getting imperative, concurrent, bug-free programs. But that doesn't generalize to all types of programs and it's a subject for a separate discussion.)

P.S. according to Wikipedia, it's actually a bit subtler. Specifically, the property of not being blocked by suspended threads makes an algorithm "non-blocking", but not "lock-free". "Lock-freedom" is a stronger property – a non-blocking algorithm is lock-free if there's guaranteed progress (as opposed to a situation where threads are busy undoing the effects of each other's interferences, and make no "real" progress). And there's a still stronger property, "wait-freedom", where progress is guaranteed in a bounded number of steps.

I allowed myself to ignore the distinction between non-blocking and lock-free because it's not a key part of what distinguishes lock-based synchronization from all the other kinds, which are often collectively called "lock-free" in informal speech. Of course in practice, a non-blocking algorithm that can get stuck in an interference-undoing endless loop is useless unless there's some sort of a workaround. Presumably that's the reason why "non-blocking" used to be a synonym for "lock-free" until about 2003 – it's not easy to think how a non-blocking algorithm that can get stuck forever can be made useable enough for this sort of thing to deserve a name of its own.

1. Josh HabermanDec 28, 2012

"A program using a lock-free counter, or container, or any sort of data structure is thus exactly as correct or buggy as a lock-based version of the same program"

While this may be true as you've stated it, I think it misses the point a bit. With the exception of simple atomic counters, lock-free programming is a significantly different design approach than traditional lock-based programming. I don't think it's meaningful to talk about lock-based and lock-free versions of the same program, as if they were slight variations of one another. If you try to make lock-based designs lock-free by just removing locks (carefully), lock-freedom will look terrible.

If you look at the literature about lock-free programming, it's not investigating how to concurrently modify a bank account balance (that's more the domain of STM, which is a more direct alternative to lock-based, shared-state programming). Lock-free literature focuses on how you can implement data structures like stacks, queues, sets, hash tables, malloc() heaps, etc. without locks. A lock-free queue, for example, can be a highly useful work-distributing and communications mechanism. It would be a great candidate for implementing Erlang mailboxes or other similar virtual machine primitives.

2. Evgeny LazinDec 28, 2012

OS synchronization primitives can use many useful optimizations to prevent different problems, such as thread starvation or convoying. Priority inversion is not the only trick.
That's why we should prefer OS locks instead of spin locks in most cases.

3. yossi kreininDec 29, 2012

@Josh: my point is just as relevant with containers. If you add things to a set and you observe intermediate states of that set when you shouldn't, you have a race condition whether the set is lock-protected or lock-free.

@Evgeny: sure, I don't claim to exhaustively list the features you get from OS locking support, just the core reason why you need them. I do claim that if suspension is not a problem for whatever reason, then you are always fine without OS support.

4. Josh HabermanDec 29, 2012

@yossi: that's true, but in general lock-free algorithms are so difficult to get right that writing one is basically a research-level problem. If you need a lock-free set, you don't go to a white-board, you go to a paper like http://www.research.ibm.com/people/m/michael/spaa-2002.pdf

5. Yossi KreininDec 29, 2012

I didn't mean bugs in the set. I meant bugs in using it. What my account example was supposed to show was not a bug in implementing the counter but the fact that you have bugs with perfectly correct atomic counters. Likewise, with a perfectly correct set, you can have bugs where you fill the set, access it concurrently with filling it, and get the wrong idea of who's a member of the set because you looked to early. The set is fine – concurrent and all – doesn't matter if it's lock-based or lock-free. But your program still has race conditions. That's my point (I intend to make it in a separate post; working title: "there's no such thing as thread-safe data structures" – precisely because the fact that a data structure is thread-safe doesn't mean that the program using it is correct.)

6. Josh HabermanDec 29, 2012

I see what you are saying though I would put it slightly differently. There *are* such things as thread-safe data structures, inasmuch as they can make API guarantees and honor them. Your account example is broken because the code is trying to provide *more* guarantees than the underlying counter does but is failing to implement those guarantees correctly.

The point I was trying to make is complementary to this. My point is: for lock-free programming to actually be a win, one needs to approach it not as "removing locks" but as "designing so that the program doesn't *need* locks." The primary tool for accomplishing this is ensuring that operations that logically need to be serialized happen on the same thread.

So in your account example (which we both agree is bad), the problem is that it attempts to avoid locks but still uses a fundamentally shared-state design where threads are performing a mix of business logic and mutation on the same data in parallel. The ideal lock-free design would be to ensure that each account is assigned to a particular thread and only modified from that thread. Obviously for this design to be effective, you would need to be able to handle all mutations on a single thread, or partition your object space such that transactions don't need to span threads. If neither of these is possible, then you're fundamentally looking a a concurrent mutators design, and should probably just use locks instead of trying to be clever about removing them.

7. Sanjoy DasDec 31, 2012

@Yossi

+1 on the "correct program vs. correct data structure" issue.

There is a very nice talk by Cliff Click [1] describing a lock-free hash-table, with `hash_table[key]` semantically equivalent to a global variable shared among threads. Threads can still have races, stomp on each others writes etc.; but the hash-table itself doesn't crash.

My experience with lock-free programming [2] is that it is rather difficult to get right, even for extremely simple data structures.

[1] http://www.youtube.com/watch?v=WYXgtXWejRM
[2] http://playingwithpointers.com/archives/1035

8. neleaiDec 31, 2012

@Josh: A lock-free map is just bunch of lock-free list that you select by hash function. Why should someone write a paper about something that obvious?

9. Sanjoy DasDec 31, 2012

@nelai

A bucketed (each bucket being a linked list) hash-table has terrible cache performance; so you'd want a probing hash-table. And then you'd want to be able to resize the hash-table from time to time. All this adds complexity (but maybe not enough to warrant a research paper).

10. Aristotle PagaltzisDec 31, 2012

Josh:

If neither of these is possible, then you're fundamentally looking a a concurrent mutators design, and should probably just use locks instead of trying to be clever about removing them.

Yossi’s point was precisely that in that situation you can’t β€œjust” use locks and expect your concurrent mutator design to be correct. (Nor, conversely, β€œjust” use lock-free data structures in a situation that calls for it, and automatically expect correctness.)

11. Josh HabermanJan 2, 2013

@neleai: I think you mean a bunch of lock-free *sets*. The paper I referenced describes both lock-free sets and hash tables in the same paper. Implementing a lock-free set is not obvious.

@Aristotle: If you use locks to serialize concurrent mutators, then you avoid race conditions. There shouldn't be anything controversial about that statement. If you wrapped Yossi's example in a lock, it would no longer have the bug he was describing.

Does proper serialization of a single data structure guarantee that your entire program is correct? Of course not. No data structure or API can guarantee that its clients are correct. You wouldn't say "there's no such this as a correct sorting algorithm" just because a client can mis-implement the comparator. Likewise it doesn't make sense to say "there's no such thing as a thread-safe data structure" just because clients can have synchronization bugs in how they use it.

12. ChrisMay 15, 2014

"if(balance – transfer > 0) balance -= transfer;"

The obvious place locks are better is somewhere like this:
if (CanGetKeysToSafe()) GetKeysToSafe();

If we wrap the whole thing in a lock then it is correct, otherwise there can be a race condition. But yes, in the first example given enough time eventually the state will be correct. You just don't want to enable the user to close their account while doing multiple withdrawals that could go negative at the same time :)



Post a comment