Wait/wound mutexes
This article brought to you by LWN subscribersDevelopers wanting to add new locking primitives to the kernel tend to be received with a certain amount of skepticism. The kernel is already well equipped with locking mechanisms, and experience shows that new mechanisms tend to be both unnecessary and hard to get right. The "wait/wound mutex mechanism" proposed by Maarten Lankhorst may well get that kind of response. But it is an interesting approach to a specific locking problem that merits a closer look.Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.
A conceptual overview
Situations where multiple locks must be held simultaneously pose a risk of deadlocks: if the order in which those locks are acquired is not always the same, there will eventually come a time when two threads find themselves blocked, each waiting for the other to release a lock. Kernel code tends to be careful about lock ordering, and the "lockdep" checking tool has gotten quite good about finding code that violates the rules. So deadlocks are quite rare, despite the huge number of locks used by the kernel.
But what about situations where the ordering of lock acquisition cannot be specified in advance, or, even worse, is controlled by user space? Maarten's patch describes one such scenario: a chain of buffers used with the system's graphical processing unit (GPU). These buffers must, at various times, be "owned" by the GPU itself, the GPU driver, user space, and, possibly, another driver completely, such as for a video frame grabber. User space can submit the buffers for processing in an arbitrary order, and the GPU may complete them in a different order. If locking is used to control the ownership of the buffers, and if multiple buffers must be manipulated at once, avoiding deadlocks could become difficult.
Imagine a simple situation where there are two buffers of interest:
Imagine further that we have two threads (we'll call them T1 and T2) that attempt to lock both buffers in the opposite order: T1 starts with Buffer A, while T2 starts with Buffer B. As long as they do not both try to grab the buffers at the same time, things will work. But, someday, each will succeed in locking one buffer and a situation like this will develop:
The kernel's existing locking primitives have no answer to a situation like this other than "don't do that." The wait/wound mutex, instead, is designed for just this case. In general terms, what will happen in this situation is:
- The thread that "got there first" will simply sleep until the
remaining buffer becomes available. If T1 started the process of
locking the buffers first, it will be the thread that waits.
- The other thread will be "wounded," meaning that it will be told it must release any locks it holds and start over from scratch.
So if T2 is wounded, the deadlock will be resolved by telling T2 to release Buffer B; it must then wait until that buffer becomes available again and start over. So the situation will look something like this:
Once T1 has released the buffers, T2 will be able to retry and, presumably, make forward progress on its task.
The details
The first step toward using a set of locks within the wait/wound mechanism is to define a "class"; this class is essentially a context within which the locks are to be acquired. When multiple threads contend for the same locks, they must do so using the same class. A wait/wound class is defined with:
#include <linux/mutex.h> static DEFINE_WW_CLASS(my_class);
As far as users of the system are concerned, the class needs to exist, but it is otherwise opaque; there is no explicit initialization required. Internally, the main purpose for the class's existence is to hold a sequence number (an atomic counter) used to answer the "who got there first" question; it also contains some information used by lockdep to verify correct use of wait/wound locks.
The acquisition of a specific set of locks must be done within a "context" that tracks the specific locks held. Before acquiring the first lock, a call should be made to:
void ww_acquire_init(struct ww_acquire_ctx *ctx, struct ww_class *ww_class);
This call will assign a sequence number to the context and do a bit of record keeping. Once that has been done, it is possible to start acquiring locks:
int ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx);
If the lock has been successfully acquired, the return value will be zero. When all goes well, the thread will manage to acquire all of the locks it needs. Once that process is complete, that fact should be signaled with:
void ww_acquire_done(struct ww_acquire_ctx *ctx);
This function is actually a no-op in the current implementation, but that could change in the future. After this call, the processing of the locked data can proceed normally. Once the job is done, it is time to release the locks and clean up:
void ww_mutex_unlock(struct ww_mutex *lock); void ww_acquire_fini(struct ww_acquire_ctx *ctx);
Each held lock should be released with ww_mutex_unlock(); once all locks have been released, the context should be cleaned up with ww_acquire_fini().
The above description describes what happens when all goes well, but it has left out an important case that all wait/wound mutex users must handle: the detection of a potential deadlock. That case comes about whenever an attempt is made to lock a ww_mutex that is already locked; in this case, there are three possible outcomes.
The first of these comes about if the locking thread already holds that ww_mutex and is attempting to lock it for a second time. With ordinary mutexes, this would be an error, but the wait/wound mechanism is designed for this case. Evidently, sometimes, the ordering of the locking is so poorly defined that multiple locking attempts can happen. In such cases, ww_mutex_lock() will return -EALREADY. The locking thread, assuming it knows how to respond to -EALREADY, can continue about its business.
The second possibility is that the sequence number in the context for the locking process is higher than the number associated with thread already holding the lock. In this case, the new caller gets "wounded"; ww_mutex_lock() will return -EDEADLK to signal that fact. The wounded thread is expected to clean up and get out of the way. "Cleaning up" means releasing all locks held under the relevant context with calls to ww_mutex_unlock(). Once all of the locks are free, the wounded thread can try again, but only when the contended lock is released by the victorious thread; waiting for that to happen is done with:
void ww_mutex_lock_slow(struct ww_mutex *lock, struct ww_acquire_ctx *ctx);
This function will block the calling thread until lock becomes free; once it returns, the thread can try again to acquire all of the other locks it needs. It is entirely possible that this thread could, once again, fail to acquire all of the needed locks. But, since the sequence number increases monotonically, a once-wounded thread must eventually reach a point where it has the highest priority and will win out.
The final case comes about when the new thread's sequence number is lower than that of the thread currently holding the lock. In this case, the new thread will simply block in ww_mutex_lock() until the lock is freed. If the thread holding the contended lock attempts to acquire another lock that is already held by the new thread, it will get the -EDEADLK status at that point; it will then release the contended lock and let the new thread proceed. Going back to the example above:
Thread T1, holding the lower sequence number, will wait for Buffer B to be unlocked, while thread T2 will see -EDEADLK when it attempts to lock Buffer A.
The documentation in the patch does not describe what happens if the holding process never calls ww_mutex_lock() again. In this case, it will never know that it is supposed to back off. But, in this case, the holder must necessarily already have acquired all of the locks it needs, so there should be no reason why it cannot simply finish its work and release the locks normally. So the end result will be the same.
Conclusion
Needless to say, there are plenty of details that have not been covered here; see the ww-mutex-design.txt document included with the patch set for more information.
In that document, there are code examples for three different ways of working with wait/wound mutexes. One need not read for long to conclude that the API looks a bit complex and tricky to use; it will be far harder to write correct locking code using this facility than it would be with normal mutexes. Perhaps that complexity is necessary, and it seems certain that this mechanism will not be needed in many places in the kernel, so the complexity should not spread too far. But an API like this can be expected to raise some eyebrows.
What is missing at this point is any real code that uses wait/wound
mutexes. Kernel developers will certainly want to see some examples of
where this kind of locking mechanism is needed. After all, the kernel has
made it through its first two decades without this kind of complex locking;
convincing the community that this feature is now necessary is going to
take a strong sales effort. That is best done by showing how wait/wound
mutexes solve a problem that cannot be easily addressed otherwise. Until
that is done, wait/wound mutexes are likely to remain an interesting bit of
code on the sidelines.
Index entries for this article | |
---|---|
Kernel | Locking mechanisms/Mutexes |
Posted May 2, 2013 7:54 UTC (Thu)
by airlied (subscriber, #9104)
[Link] (3 responses)
So Maarten has code to port TTM over to this infrastructure already in a branch, and has posted it to dri-devel previously I think.
Posted May 2, 2013 8:07 UTC (Thu)
by mlankhorst (subscriber, #52260)
[Link] (1 responses)
Posted May 2, 2013 9:03 UTC (Thu)
by blackwood (guest, #44174)
[Link]
Rob Clark started a proposal for dma_fences (now in Maarten's branch), which are attached to each dma_buf taking part in any given gpu render batch (or any other dma op affecting a dma_buf).
Now if you naively walk all the dma_bufs, lock each of them one-by-one and attach a new set of fences, races with other threads have a peculiar effect: If you're unlucky you can create a synchronization loop between a bunch of buffers and fences, and since these fences can be fully hw-based (i.e. never wake up the cpu to do the syncing) you'll end up with deadlocked gpus, each waiting on the other.
Hw sync/wait deadlocks between different drivers is the kind of fun I don't want to deal with, so we need a multi-object locking scheme which works cross-devices.
Note that i915 isn't currently based on ttm (and personally I'm not known as a big fan of ttm), but the proposed ww mutex stuff is massively improved:
Now one critique I hear often is "why can't you guys use something simpler?". Reasons against that:
Thus far all the proposed "simple" schemes fall short in one place or the other.
Also, cross-device sync with dma_buf/fence isn't the only use-case I see rolling in:
Posted May 2, 2013 9:14 UTC (Thu)
by blackwood (guest, #44174)
[Link]
So we have both code using w/w mutexes, and it's not really a new concept for drivers/gpu/drm. Last paragraphs needs to be updated a bit ;-)
Posted May 2, 2013 21:07 UTC (Thu)
by ncm (guest, #165)
[Link] (1 responses)
Posted May 2, 2013 21:16 UTC (Thu)
by jake (editor, #205)
[Link]
Indeed. Typo alert. "wait/wound" is correct, fixed now, thanks.
jake
Posted May 9, 2013 21:48 UTC (Thu)
by ksandstr (guest, #60862)
[Link] (7 responses)
The first is that a failure to try-lock requires some sort of a fall-back procedure that either doesn't violate locking order, or does so with a different try-lock subject than in previous iterations. Coming up with that procedure is an enormous hassle, and cramps many a concurrent design. Wait/wound mutexes would seem to avoid this hazard by permitting the wounded thread to resume only after the contended mutex has been released at least once.
The second is that (depending on the interface) wait/wound mutexes could reduce the "slumbering herd" effect that occurs when a thread holds a number of mutexes but then has to wait on one more, repeating down the tree. (this effect also tends to magnify non-mutex waits through the mutex scheme, making it especially pernicious.) This reduction would happen by having the wounded thread wait for the contended mutex only after releasing its own, thereby allowing sufficiently unrelated threads to proceed unimpeded. The net gain is lower aggregate latency under contention.
Posted May 10, 2013 17:08 UTC (Fri)
by heijo (guest, #88363)
[Link] (6 responses)
Eventually you'll succeed, although it may take time quadratic in the total number of locks that may be taken across attempts.
Wait/wound on the other hand guarantees that the first-coming thread will progress in linear time in the number of locks, but it seems the other threads may burn unlimited CPU time attempting to take locks and retrying in pathological cases.
This could be fixable by having the "later" thread wait directly for all earlier threads to be done, to save power at the expense of latency, although this is probably not an issue in practice.
Posted May 10, 2013 19:17 UTC (Fri)
by dlang (guest, #313)
[Link] (4 responses)
undefined CPU time, but not unlimited.
each thread gets a sequence number when they start trying to get a lock, and when two threads conflict, the one with the lower sequence number wins.
As a result, every thread is guaranteed to be able to get the lock in preference to any threads that were first tried to get the lock after it did.
This puts a outer bound on the amount of CPU it will waste in the meantime (admittedly, not a bound that you can readily calculate since you don't know how long the locks will be held by processes that have priority over you)
Posted May 10, 2013 23:44 UTC (Fri)
by brong (guest, #87268)
[Link] (3 responses)
My understanding was that once you're wounded, you have to restart from scratch. If you're restarting with the same (low) sequence number rather than being allocated a brand new one, then I see your point. Otherwise, I see starvation possibilities, the horror.
Posted May 11, 2013 2:21 UTC (Sat)
by dlang (guest, #313)
[Link] (2 responses)
> But, since the sequence number increases monotonically, a once-wounded thread must eventually reach a point where it has the highest priority and will win out.
They don't say explicitly that the sequence number is maintained, but I don't see what this would mean otherwise.
Posted May 15, 2013 13:45 UTC (Wed)
by mlankhorst (subscriber, #52260)
[Link] (1 responses)
Posted May 15, 2013 20:49 UTC (Wed)
by dlang (guest, #313)
[Link]
But this detail is critical to avoiding the risk of permanent starvation of some thread.
Posted May 13, 2013 17:40 UTC (Mon)
by ksandstr (guest, #60862)
[Link]
But as you say, for algorithms where the needed set of locks cannot change between backout and retry, your solution is adequate. I've found that those algorithms generally aren't candidates for fine-grained locking, though that's neither here or there.
Personally, if wait/wound mutexes remove these halfway esoteric concerns from mainstream kernel code (along with the entire idea of lock ordering), I'll be happy as a clam.
Posted Jul 13, 2013 3:22 UTC (Sat)
by vomlehn (guest, #45588)
[Link]
It may be the case that, as a practical matter, this is not a performance issue. I can see, however, a maintenance issue. One subsystem may acquire a m/m mutex, then call another subsystem that acquired a different m/m mutex. This M/M mutex implementation implies that the called subsystem will have to know that the caller holds an m/m mutex and return to that subsystem (or use a callback to that subsystem) to release the first mutex. Ick.
I can conceive of the m/m mutex code keeping track of the m/m mutexes held by a given task and providing feekback on whether any more mutexes need to be released, which simplifies the problem a little from the performance perspective.
All-in-all, I'm not sure this is ready for prime time.
Wait/wound mutexes
Wait/wound mutexes
Wait/wound mutexes
- not intermingled with all the gpu memroy management craziness ttm also does
- sane, clear api (we grade on a curve in gpu-land ...) with decent documentation
- _really_ good debug support - while writing the kerneldoc for Maarten's code I've checked whether any kind of interface abuse would be caught. Furthermore we have a slowpath injection debug option to exercise the contended case (-EDEADLK) with single-threaded tests.
- current ttm based drivers (radeon, nouveau, ...) already deal with this complexity. Furthermore gpus tend to die randomly, so all the command submission ioctls for the big drivers (i915, radeon, nouveau) are already fully restartable. Ditching a tiny bit of code to handle the ww mutex slowpath won't sched complexity.
- Current drivers rely on the -EALREADY semantics for correctnes. Crazy, I know, but like I've said: We grade on a scale ... Any simple scheme would so need to support this, too. Toghether with the first point you won't really have be able to achieve any reduction in interface complexity for drivers.
- Thanks to a big discussion with Peter Zijlstra we have a rather solid plan forward for PI-boosting and bounded lock acquisition in linux-rt.
- i915 is in desperate need of a finer-grained locking scheme. We run into ugly OOM issues all over the place due to our current "one lock to rule them all" design. We duct-tape over the worst with horrible hacks, but spurious OOMs are still too easy to hit. Moving to a per-object lock scheme using ww mutexes will fix that. Plan B would be to use ttm, but that'd be _really_ disruptive, and I don't like ttm that much.
- We're in the process of switching to per-object locking for drm modeset objects, but the complex (and dynamic) graph nature of all the connections poses some interesting issues. Ww mutexes would naturally solve this problem, too.
-Daniel
Wait/wound mutexes
Wait/Wake
Wait/Wake
They're mutexes, Jim, but not as we know them
They're mutexes, Jim, but not as we know them
They're mutexes, Jim, but not as we know them
They're mutexes, Jim, but not as we know them
They're mutexes, Jim, but not as we know them
They're mutexes, Jim, but not as we know them
They're mutexes, Jim, but not as we know them
They're mutexes, Jim, but not as we know them
Nesting wait/wound mutexes?