Bug #11909


THREAD_KPRI_RELEASE does nothing of the sort

Added by John Levon almost 4 years ago. Updated over 3 years ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


OS-7753.jira.json (34.7 KB) OS-7753.jira.json Joshua M. Clulow, 2020-02-28 05:57 PM

Related issues

Related to illumos gate - Bug #12771: 11909 re-introduced lockstat panicClosedJohn Levon

Actions #1

Updated by John Levon almost 4 years ago

Work by Patrick Mooney originally at (and OS-7926)

Some testing notes from that below. In addition to below, I've run these bits for a while during illumos-gate build.

On several occasions, most recently in SWSUP-1426, systems have been observed in dire or deadlocked states thanks to threads running with MINCLSYSPRI acquired via THREAD_KPRI_REQUEST. With such a thread blocks on a lock with nonzero t_kpri_req, FSS and TSS will set a flag (FSSKPRI and TSSKPRI, respectively) which will persist in the scheduler struct until it is cleared in fss_trapret when the thread returns to userspace. If the thread spends a lot of time in userspace, as is common for KVM, bhyve, or workloads like sendfile(), then FSSKPRI will remain asserted, causing that thread to bypass many scheduler decisions, including expiration of its time quantum so other threads can preempt it.

This behavior has dubious origins, and in the world of fully preemptable kernel logic running on large multi-CPU systems, it seems ideal to rip it out.

With a stock PI and one with my change to rip out the KPRI logic, I ran full illumos-joyent nightly builds and compared the performance. When run repeatedly, the build duration was consistent within each PI, but the patched version was faster by a significant margin. Here are two such runs:


==== Elapsed build time (non-DEBUG) ====

real    14:01.1
user  1:50:14.7
sys     55:59.8


==== Elapsed build time (non-DEBUG) ====

real    10:37.6
user  1:52:30.6
sys     46:22.0

Between each of the builds, I clobbered much of the workspace (leaving the bootstrap tools intact) and then initiated the build after a reboot so ARC warmth would not be a factor.

Actions #2

Updated by Gordon Ross almost 4 years ago

The design work for this really should consider its effect on problems and explain how the result will avoid that problem.

Actions #3

Updated by John Levon over 3 years ago

Here's an some excerpts of a real-life case where this caused a machine to basically grind to a halt (from Joyent SWSUP-1426), courtesy of Mike Gerdts and Patrick, lightly edited:

There are hundreds of instances running.  A sampling of them shows that they are hung up on vmadm trying to acquire a lock.

Here's one of those vmadm processes. It is trying to register its dtrace probes.

> fffffe28b82ad198::walk thread |::findstack -v
stack pointer for thread fffffe28a895c040 (vmadm/1): fffffcc2791775d0
[ fffffcc2791775d0 _resume_from_idle+0x12b() ]
  fffffcc279177600 swtch+0x141()
  fffffcc2791776a0 turnstile_block+0x21a(fffffe28e71d0a90, 0, ffffffffc012b448, fffffffffbc1b4e0, 0, 0)
  fffffcc279177710 mutex_vector_enter+0x3a3(ffffffffc012b448)
  fffffcc279177760 dtrace_helper_provider_register+0x39(fffffe28b82ad198, fffffe28b4221d80, fffffcc279177800)

It is blocked by a node process that is trying to do the same.

> ffffffffc012b448::mutex
ffffffffc012b448 adapt fffffe27dd5213c0      -      -     yes
> fffffe27dd5213c0::findstack -v
stack pointer for thread fffffe27dd5213c0 (node/1): fffffcc26b0beb70
[ fffffcc26b0beb70 _resume_from_idle+0x12b() ]
  fffffcc26b0beba0 swtch+0x141()
  fffffcc26b0bec40 turnstile_block+0x21a(fffffe298f9cbea0, 0, fffffffffbc8b468, fffffffffbc1b4e0, 0, 0)
  fffffcc26b0becb0 mutex_vector_enter+0x3a3(fffffffffbc8b468)
  fffffcc26b0bece0 dtrace_enabling_matchall+0x24()

That node process is blocked by a bhyve process:

> fffffe28e72daba0::findstack -v
stack pointer for thread fffffe28e72daba0 (bhyve/42 [vcpu 1]): fffffcc276385310
[ fffffcc276385310 _resume_from_idle+0x12b() ]
  fffffcc276385340 swtch+0x141()
  fffffcc2763853e0 turnstile_block+0x21a(fffffe234d51a398, 0, fffffffffbc23008, fffffffffbc1b4e0, 0, 0)
  fffffcc276385450 mutex_vector_enter+0x3a3(fffffffffbc23008)
  fffffcc2763854a0 cbe_xcall+0x5d(fffffe234decf000, fffffe234decf000, fffffffffb9f7a80, fffffcc2763854b0)
  fffffcc276385520 cyclic_remove_here+0x57(fffffe237fd40340, 1c, fffffcc276385560, 1)
  fffffcc2763855b0 cyclic_juggle_one_to+0xf4(fffffe237fd940b0, fffffe237fd40a00)
  fffffcc2763855f0 cyclic_move_here+0x88(fffffe237fd940b0)
  fffffcc276385620 vmm_glue_callout_localize+0x27(fffffe27f52f6e90)
  fffffcc276385640 vlapic_localize_resources+0x15(fffffe27f52f6e40)
  fffffcc276385670 vm_localize_resources+0x55(fffffe28ae399000, fffffe28ae399350)
  fffffcc276385760 vm_run+0x18d(fffffe28ae399000, fffffcc276385ae0)

In fact, that cbe_xcall_lock was released (so if the bhyve thread could run, it could acquire the lock).

Over to Patrick:

The lack of waiters on the cbe_xcall_lock is not itself troubling when we look at the status of that CPU:

  9 fffffffffbc4ee00  1b    1    0  60   no    no t-77805 fffffe28a2df6100 qemu-system-x86_
                       |    |
            RUNNING <--+    +-->  PRI THREAD           PROC
              READY                 1 fffffe28e72daba0 bhyve

The bhyve thread was made runnable when the owner of cbe_xcall_lock dropped it.
The new question becomes: Why hasn't the bhyve thread been able to run yet?

 The qemu thread occupying the CPU does have a higher priority. The long t_disp_time delay on the qemu thread is troubling, but nothing in its stack suggests anything particularly unusual:

> fffffe28a2df6100::findstack -v
stack pointer for thread fffffe28a2df6100 (qemu-system-x86_/6): fffffcc27d8281e0
  fffffcc27d828220 ht_begin_intr+0x51(a)
  fffffcc27d828290 0xb()
  fffffcc27d8282e0 apix_hilevel_intr_epilog+0xc6(fffffcc27d828350, 0)
  fffffcc27d828340 apix_do_interrupt+0x382(fffffcc27d828350, fffffe29ebefb028)
  fffffcc27d828350 _sys_rtt_ints_disabled+8()
  fffffcc27d828460 vcpu_enter_guest+0x3e7()
  fffffcc27d8284a0 __vcpu_run+0x9b()
  fffffcc27d8284e0 kvm_arch_vcpu_ioctl_run+0x114()
  fffffcc27d828cc0 kvm_ioctl+0x466()

The ticket trail switches over at this point to OS-7753 and Patrick's comments beginning "On several occasions...". So to join the dots, what we have here is a bhyve thread, waiting to get on a CPU, that's blocking lots of other threads. However, that bhyve thread itself is not getting a chance due to a (qemu) thread on CPU that's got t_kpri_req > 0. Since that won't get reliably cleared until going back out to userspace, this state of affairs can continue on arbitrarily: essentially, nothing will cause the qemu thread to come off CPU.

This example is particularly egregious, but it should be clear that this is a general problem: any thread that gets a t_kpri_req bump can hog a CPU for longer than would be ideal. Often, this is bounded by that thread going through trapret or otherwise dropping t_kpri_req, but there's no hard requirement on this.

Actions #5

Updated by Gordon Ross over 3 years ago

I asked for some time to look into this further, so I'll share some notes on what I learned,
as well as some additional notes from Bryan about this work.

One concern was whether removing the THREAD_KPRI_REQUEST/RELEASE
mechanism would leave us without a solution to priority inversion problems.

It doesn't. The primary mechanism for avoiding priority inversion is a
priority inheritance mechanism implemented in uts/common/os/turnstile.c
which is used by mutex_enter, rw_enter (at least for the writer case) etc.

There are apparently some additional cases where threads could block
that could cause priority inversion, one example being rwlock reader
threads (from user-level) getting preempted while a writer waits to
enter that rwlock. The THREAD_KPRI_REQUEST mechanism was
an attempt to temporarily raise thread priorities for these cases (for
user-level threads) but the implementation had a number of problems.

Someone could fix THREAD_KPRI_RELEASE (make it restore the
original thread priority earlier), but there does not appear to be
interest in doing that because: (a) the problem it tries to solve is
apparently very rare (not observed on fairly busy systems), and
(b) if one were to start over trying to solve these remaining (and
mostly theoretical) priority inversion problems, the design would
likely be qutie different. In particular, the placement of all the
(increment/decrement) sites has poor maintainability.

Bryan had more to say along those lines: (quoting)

(1) The priority inversion is in theory possible both before and after this change, as rwlocks have never supported priority inheritance for writers blocked by readers (and note that it necessitates a middling-priority spinner -- if there are but two threads, the writer will block and the reader will run)

(2) Such as the mechanism addressed this priority inversion at all, it did very little: it is a long, long way from reader tracking and true priority inheritance for rwlock writers blocked by readers

(3) The way rwlocks are used in the system make true priority inheritance for writers blocked by readers highly questionable to begin with (mutexes should be used where this behavior is desired)

(4) The change itself has been studied extensively, with special attention paid to possible effects on scheduling behavior outside of priority inversions

(5) The cure is unquestionably better than the disease, which itself caused many production problems

Actions #6

Updated by Electric Monk over 3 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 6a0b121754dbd27d5d72dcddcb6465aab4092ba1

commit  6a0b121754dbd27d5d72dcddcb6465aab4092ba1
Author: Patrick Mooney <>
Date:   2020-04-08T14:57:16.000Z

    11909 THREAD_KPRI_RELEASE does nothing of the sort
    Reviewed by: Bryan Cantrill <>
    Reviewed by: Jerry Jelinek <>
    Reviewed by: Gordon Ross <>
    Approved by: Dan McDonald <>

Actions #7

Updated by Joshua M. Clulow over 3 years ago

  • Related to Bug #12771: 11909 re-introduced lockstat panic added

Also available in: Atom PDF