Project

General

Profile

Bug #13057

pessimal mutex behavior

Added by Mateusz Guzik 4 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

This is a slightly edited copy of what I originally posted here:
https://illumos.topicbox.com/groups/discuss/T58ee93f9106073ba

While all code should strive to have as little lock contention as possible, it
is often unavoidable in real-world projects (especially ones which started
uniprocessor-oriented) and locking primitive behavior remains an important
factor in performance.

With this in mind I decided to take a detailed look at illumos and found
room for improvement. While I'm not interested in working on it I figured
I can share. Review is x86-centric, I don't know to what's the penalty on
sparc.

I was surprised to find that both spinlocks and regular mutexes use the same
entry point. spinlocks fail the general fast path and are checked for in the
slow path.

Basic logic is very simple: spin on cpu if the lock owner is running, go off
cpu otherwise. Going to sleep and waking up is interlocked with turnstile_*
facility.

Fast path lock starts with lock cmpxchg trying to replace 0 with the address
of the current thread. cmpxchg returns the found value, but it is discarded
and the code falls back to the slow path (mutex_vector_enter) only passing
the lock address.

void
mutex_vector_enter(mutex_impl_t *lp)
{
[..]
        if (MUTEX_TYPE_SPIN(lp)) {
                lock_set_spl(&lp->m_spin.m_spinlock, lp->m_spin.m_minspl,
                    &lp->m_spin.m_oldspl);
                return;
        }

        if (!MUTEX_TYPE_ADAPTIVE(lp)) {
                mutex_panic("mutex_enter: bad mutex", lp);
                return;
        }

Each of these macros results in a read generated in assembly. This is
slightly problematic as in the time window between failing the fast path
and getting here someone else could have dirtied the cacheline (e.g. by
also failing the fast path), meaning the cpu is now forced fetch it again.
Inspected value was already returned by cmpxchg and could have been passed
as an argument instead.
        backoff = mutex_lock_backoff(0);        /* set base backoff */
        for (;;) {
                mutex_lock_delay(backoff); /* backoff delay */

Despite fairness issues backoff may be a decent choice even today.
The current code tries to avoid blindly increasing the delay, which may
be a good idea for many workloads.

However, the way delay is implemented should be modified. Each delay step
uses MUTEX_DELAY macro:

#define MUTEX_DELAY()   {       \
                        mutex_delay(); \
                        SMT_PAUSE();    \
                        }

SMT_PAUSE is the "pause" instruction which is fine, but mutex_delay is
set to:
        ENTRY(mutex_delay_default)
        movq    $92,%r11
0:      decq    %r11
        jg      0b
        ret
        SET_SIZE(mutex_delay_default)

I don't know the reasoning behind this. It may be early amd64 cpus did not
have the pause instruction and the above busy-wait is used to induce delay
regardless. The current standard is to just do "pause". So I think this
routine should be scrapped, but it will have a side effect of changing
how long each step takes. iow, backoff parameters will have to be-retuned.
(they should probably be revisited anyway since they are unchanged from the
beginning)
        if ((owner = MUTEX_OWNER(vlp)) == NULL) {
                if (mutex_adaptive_tryenter(lp)) {
                        break;
                }
                [..]
        }

The lock is read from after a pausing which is fine.
        if (mutex_owner_running(lp) != NULL)  {
                continue;
        }

This intends to go back to spinning if the owner is running and it does for that
specific case, but see below for a problem.

This routine performs its own read in a dedicated section which is part
of magic used to keep threads freeable while other CPUs may want
to inspect them.

        ENTRY(mutex_owner_running)
mutex_owner_running_critical_start:
        movq    (%rdi), %r11            /* get owner field */
        andq    $MUTEX_THREAD, %r11     /* remove waiters bit */
        cmpq    $0, %r11                /* if free, skip */
        je      1f                      /* go return 0 */
        movq    T_CPU(%r11), %r8        /* get owner->t_cpu */
        movq    CPU_THREAD(%r8), %r9    /* get t_cpu->cpu_thread */
.mutex_owner_running_critical_end:
        cmpq    %r11, %r9       /* owner == running thread? */
        je      2f              /* yes, go return cpu */
1:
        xorq    %rax, %rax      /* return 0 */
        ret
2:
        movq    %r8, %rax               /* return cpu */
        ret
        SET_SIZE(mutex_owner_running)

So what it really returns is a cpu struct if the lock is owned and
the owner is running. But it returns NULL in 2 vastly different cases:
- the lock is owned and the owner is off cpu
- the lock is free, which can happen in the time window between the
previous read and this one

These 2 cases can't be distinguished by the caller and consequently force it to lock the turnstile.
One trivial solution would be to pass an extra arg like this:

        mutex_owner_running(lp, &owner)

Then we got an updated owner and can check for it. This will be important
below.
       /*
        * The owner appears not to be running, so block.
        * See the Big Theory Statement for memory ordering issues.
        */
       ts = turnstile_lookup(lp);

Note the comment is misleading. As noted earlier we get here if either the
owner is not running OR there is no owner to begin with.

But if we found the lock to be free, we wasted time getting a trip through
turnstile code.

        MUTEX_SET_WAITERS(lp);

This is:
#define MUTEX_SET_WAITERS(lp)                                           \
{                                                                       \
        uintptr_t old;                                                  \
        while ((old = (lp)->m_owner) != 0 &&                            \
            casip(&(lp)->m_owner, old, old | MUTEX_WAITERS) != old)     \
                continue;                                               \
}

That is, set the MUTEX_WAITERS bit if the lock is held. Also note if the
flag is already set, it sets it again anyway.
turnstile_lookup grabs a spinlock. By the time we got here a number of
things could have happened, especially if we contended with other
waiters.

In particular:
- the lock is no longer owned by anyone, then the loop correctly does nothing
- the lock is owned and the owner is not running, then the loop correctly
sets the bit
- the lock is owned and the owner is running, then the bit is set for no
reason

Also note even if the lock is owned, this may be a completely different
thread than the one previously seen.

        /*
         * Recheck whether owner is running after waiters bit hits
         * global visibility (above).  If owner is running, spin.
         */
        if (mutex_owner_running(lp) != NULL) {
                turnstile_exit(lp);
                continue;
        }

This should have been done prior to MUTEX_SET_WAITERS loop. Note that
while it is always possible the owner starts running after we set the bit,
the time window between this check and the loop is much smaller than between
turnstile_lookup and this check, i.e. losing the race would be less likely.
        /*
         * If owner and waiters bit are unchanged, block.
         */
        if (MUTEX_OWNER(vlp) == owner && MUTEX_HAS_WAITERS(vlp)) {
                sleep_time -= gethrtime();
                (void) turnstile_block(ts, TS_WRITER_Q, lp,
                    &mutex_sobj_ops, NULL, NULL);
                sleep_time += gethrtime();
                /* reset backoff after turnstile */
                backoff = mutex_lock_backoff(0);
        } else {
                turnstile_exit(lp);
        }

Note that 'owner' is what was read off the lock way back, prior to
turnstile_lookup and even the first mutex_owner_running.

Consider a scenario where the lock changed owners between the read and
turnstile_lookup and that the new owner is off cpu.

The MUTEX_SET_WAITERS loop with set the bit and mutex_owner_running will
conclude the new owner is not running. Yet, MUTEX_OWNER(vlp) == owner will
compare against the pre-turnstile value and incorrectly bail only to
turnstile_lookup again. Instead, the owner could have been read off at the
beginning of the turnstile section.

The unlock path uses a clever trick to perform unlock without an atomic
operation if there are no waiters. However, it accounts for a case where
the waiters bit is set and there is nobody asleep. This happens more than
it should because of way loose checks. I strongly suspect this results in worse
multithreaded performance in face of contention as it adds accesses to the
lock word. Instead, a counter could be maintained in struct thread. Threads
going off CPU would take the relevant dispatcher lock, re-check the owner
is not running and bump the counter to denote that there is a lock with
blocked waiters. Blocking is rare enough that some false positives stemming
from this should be fine.

All in all, there should be few % contention reduction just by addressing this.

No data to display

Also available in: Atom PDF