Bug #13056


pessimal rw locks behavior

Added by Mateusz Guzik over 1 year ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


While read-write locking always modifying the same cacheline suffers from fundamental scalability issues, the current implementation makes the problem more acute that it needs to be. Note that the description below is amd64-centric, but outlined problems concern all architectures.

The least non-scalable approach sticking to the above design performs lock xadd for both locking and unlocking and then inspecting the result (e.g, to find out the lock has a writer). However, implementing it in illumos while keeping the lock pointer-sized is problematic since the code needs to always be able to find the writer in order to lend it the priority. Currently this is trivially done by reading the stored pointer, but then it is not safe to blindly xadd into the lock word. One idea to take care of the problem is to shift out part of the address which always known, making enough room for blind xadds to not corrupt the unique part.

The current approach of casing into the word has significantly worse behavior in face of multiple writers due to repeat atomics stemming from failed attempts (e.g., 2 read lock requests at the same time will result in a failure for one of them). Even then, the current handling degrades more than it needs to.

I don't have specific numbers for illumos, but to give you an idea of the impact: on FreeBSD a locking primitive with the same problem (cas loops with explicit re-reads) which got augmented into a cas-loop without expilcit re-reads got a 27% speed up during concurrent access on a 2 socket * 10 cores * 2 threads Sandy Bridge box few years back.

The problem starts in the fast path which immediately falls to the slow path. Instead it should loop for as long as the lock can be taken for reading.

Slow path performs the following:

                if (((old = lp->rw_wwwh) & lock_busy) == 0) {

Explicit read of the lock. The failing fast path can pass the value, making this read unnecessary. Also see below.
                        if (casip(&lp->rw_wwwh, old, old + lock_value) != old) {

casip returns the found value, but it only gets used for comparison. Looping back issues an avoidable reread.
                                if (rw_lock_delay != NULL) {

The pointer is NULL on amd64. Whether anything should be done highly depends on the architecture, it is ok to busy-loop on amd64 on bare metal. Virtualized environments may want to start injecting "pause" after a while.

Concern of similar sort can be expressed against the loop setting waiter bits.

No data to display


Also available in: Atom PDF