Feature #1532


Long-term kernel-resident processes need a way to play fair

Added by Rich Lowe about 12 years ago. Updated almost 3 years ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


While implementing AMD-V support for the KVM port to illumos done by Joyent we hit a problem in the dispatcher which we believe to be pretty unique to this situation, and which creates unthinkably severe priority inversion.


A KVM guest is booted which spends does a real-mode loop while polling the
keyboard (the OpenBSD x86 loader). Boot is interrupted in the loader.

Soon after this happens, the machine drops off the network and appears to
hang. The deadman does not fire.

Breaking in via the debugger, we find that the system is in fact live, and
qemu is on the cpu. Manipulating the qemu process via /proc briefly
re-awakens the network.

After some experimentation with this scenario, a dump is forced which, by
chance, happens to demonstrate the issues quite wonderfully.

> ::cpuinfo -v
  0 fffffffffbc486e0  1b    5    0 108   no    no t-0    ffffff00076cbc40 sched
                       |    |    |
            RUNNING <--+    |    +--> PIL THREAD
              READY         |           9 ffffff00076cbc40
             EXISTS         |           - ffffff0007605c40 (idle)
             ENABLE         |
                            +-->  PRI THREAD           PROC
                                   99 ffffff00076f5c40 sched
                                   99 ffffff00076fbc40 sched
                                   99 ffffff0007701c40 sched
                                   60 ffffff0007dbac40 fsflush
                                   60 ffffff0008910c40 sched

  1 ffffff01d3e32580  1f    4    0   0  yes    no t-6620600 ffffff01d5a600c0 
                       |    |
            RUNNING <--+    +-->  PRI THREAD           PROC
              READY                99 ffffff0007e01c40 sched
           QUIESCED                99 ffffff0007e1fc40 sched
             EXISTS                59 ffffff01cfa6d7c0 qemu-system-x86_
             ENABLE                59 ffffff01e35b43c0 qemu-system-x86_

The SWITCH time on cpu1 is not an error, and can be confirmed from the
scheduling class specific data:

> ffffff01d5a600c0::print kthread_t t_cldata | ::print -a tsproc_t ts_timeleft
ffffff01dde41b90 ts_timeleft = 0xff9b0253
> ffffff01dde41b90/D
0xffffff01dde41b90:             -6618541

The time left in our timeslice is massively negative (over 18 hours over

Looking back at this we notice that RNRN is true, so if we were to ever return
to userland we would be preempted, but KVM with AMD-V is able to handle all
the exits induced by the OpenBSD loader in-kernel, and so has not returned to
userland (and will not, without user intervention).

Our next questions are thus

1. Why are we not KRNRN?
2. Why is this hanging the network?

These turn out to have the same answer.

There are two system threads at priority 99 on the cpu1 runq, the maximum
non-interrupt/realtime SYS priority. KRNRN won't come true unless a thread
with greater than kpreemptpri is waiting on our runq. kpreemptpri is defined
to be the maximum SYS pri + 1, evading this.

The waiting threads turn out to be soft ring and squeue workers. Ah ha!

> ffffff0007e01c40::findstack -v
stack pointer for thread ffffff0007e01c40: ffffff0007e01b60
[ ffffff0007e01b60 _resume_from_idle+0xf1() ]
  ffffff0007e01b90 swtch+0x1e6()
  ffffff0007e01bc0 cv_wait+0x7f(ffffff01d6fd5b60, ffffff01d6fd5ac0)
  ffffff0007e01c20 mac_soft_ring_worker+0xc5(ffffff01d6fd5ac0)
  ffffff0007e01c30 thread_start+8()
> ffffff0007e1fc40::findstack -v
stack pointer for thread ffffff0007e1fc40: ffffff0007e1fb60
[ ffffff0007e1fb60 _resume_from_idle+0xf1() ]
  ffffff0007e1fb90 swtch+0x1e6()
  ffffff0007e1fbc0 cv_wait+0x7f(ffffff01d2c35c50, ffffff01d2c35c10)
  ffffff0007e1fc20 squeue_worker+0xd1(ffffff01d2c35c00)
  ffffff0007e1fc30 thread_start+8()

The problem is, thus, that the dispatcher is not equipped to deal with a
thread which will spend unbounded amounts of time in the kernel without ever
sleeping nor returning to userland, and will suffer massive priority inversion
problems should this occur (in this case, we have starved two threads at far
greater priority than ourselves for 18 hours).

It is unclear why the threads aren't being migrated to cpu0 and run there. I believe this to represent a 2nd bug.

Suggested fixes

As a rather sad workaround to this problem, we're at present setting our
kprunrun, unconditionally, immediately after VM exit to encourage the system
to treat us as harshly as possible. This obviously pessimizes our own
performance greately, as we'll be preempted by lower priority threads, but it
avoids this problem in a compact fashion. A better solution is called for.

Scheduling Class

You may, at this point, be thinking how similar this problem sounds to that
solved by the System Duty Cycle (SDC) scheduling class introduced for ZFS,
another situation where kernel threads may use non-negligable amounts of CPU
for non-negligable amounts of time, and indeed SDC arranges such that in
cpu_surrender() it is treated harshly for preemption, using a straight
priority comparison without regarding for kpreemptpri. Unfortunately, it also
deliberately does not support threads with a userland component (it requires
that they have LWPs, for the purposes of having microstate data which it may
track, nothing more).

A "Treat me rough" button.

Another solution would be a flag to a kthread_t instructing the system that we
require SDC-like treatment with respect to preemption (SDC would be adjusted
to set this flag, rather than have special-purpose logic).

A pair of functions could then be added, for the sake of argument:

  • kpreempt_agressively() woud set this flag, and cause the current thread to be subject to harsh preemption.
  • kpreempt_normally() would unset the flag.

(compare to kpreempt_enable() and kpreempt_disable() in use). We would
then call these across the VCPU run, such that when we spin busy in that loop,
and only when we do that, we are subject to agressive preemption.

This is the solution I currently favour (partly, it must be said, because of
the isolation of effect, combined with the difficulty of testing dispatcher
changes across a range of workloads).

Related issues

Related to illumos gate - Bug #7232: add tunables to combat scheduling delay of kernel threadsClosedDaniel Kimmel2016-07-28

Actions #1

Updated by Rich Lowe about 12 years ago

  • Tags deleted (needs-triage)
Actions #2

Updated by Rich Lowe about 12 years ago

  • Status changed from New to In Progress
Actions #3

Updated by Rich Lowe over 11 years ago

  • Status changed from In Progress to Feedback
  • Assignee deleted (Rich Lowe)

SmartOS deals with this via setting kpreemptpri such that this isn't a problem. We should give that change some time to shake out on their systems and, assuming all is well, just do that.

Actions #4

Updated by Kent Watsen almost 11 years ago

Any update on this? Is Joyent's kpreemptpri setting stable enough to use here?

Actions #5

Updated by Rich Lowe almost 11 years ago

Almost certainly, yes.

Actions #6

Updated by Joshua M. Clulow about 3 years ago

  • Status changed from Feedback to In Progress
  • Assignee set to Joshua M. Clulow

I think it's time to fix kpreemptpri. It has been at 99 in SmartOS for so long that I didn't immediately even think things could still get stuck this way.

I recently determined that this is one cause of deadlock under heavy paging activity on a system with only a small number of CPUs. Roughly speaking: pageout() is waiting for ZFS I/O to occur to push pages to swap, and the disk driver has reported completion to ZFS. A single ZIO sits idle forever because receipt of the completion is deferred to the zio_write_intr taskq for dispatch -- but the taskq thread is stuck behind other high priority kernel threads, perhaps ironically in this case stuck behind DTrace scrubbing on one core and the pageout_scanner on another. As such, the ZIO sits in that taskq forever, we don't free the page, pageout() eventually gets stuck waiting for the next txg sync to complete and spa_sync() is stuck forever waiting on ZIO in the completion taskq. It looks just like memory deadlock, but freemem is still at a few thousand pages.

Had kpreemptpri been 99 instead of 100, it seems that this would not have happened, and we could go back to the regular cause of memory deadlock: ZFS being too hungry for pageout_reserve pages.

This change was made as:

commit 23e44ac77691c8b18b6c707b7de655f2b6cbc144 (HEAD -> pageout, jclulow/pageout)
Author: Bryan Cantrill <>
Date:   Mon Feb 20 09:03:35 2012 +0000

    OS-970 kpreemptpri should be lowered to maxclsyspri

I will get somebody at Joyent to open up the analysis:

Actions #7

Updated by Joshua M. Clulow about 3 years ago

  • Related to Bug #7232: add tunables to combat scheduling delay of kernel threads added
Actions #8

Updated by Electric Monk about 3 years ago

  • Gerrit CR set to 886
Actions #9

Updated by Joshua M. Clulow about 3 years ago

Testing Notes

First, this change has been in SmartOS for a very long time (since 2012). It has thus run for a long time on a lot of systems, running a pretty reasonable general purpose workload mix, far in excess of testing I will be able to arrange even with community help.

Second, I did a smoke test by:

  • doing several full nighty builds of master
  • concurrently running several iperf load generators (both sending and receiving) while doing the builds, and then for 24 hours afterwards

I tried to push the network stack a bit, while otherwise busying the machine, on the basis that at least there are some background workers at different priorities that need to run to make that work out.

There were no observable issues.

Actions #10

Updated by Electric Monk almost 3 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 6d40a71ead689b14c68d91a5d92845e5f3daa020

commit  6d40a71ead689b14c68d91a5d92845e5f3daa020
Author: Bryan Cantrill <>
Date:   2020-09-23T22:22:51.000Z

    1532 Long-term kernel-resident processes need a way to play fair
    Reviewed by: Andy Fiddaman <>
    Reviewed by: Patrick Mooney <>
    Approved by: Richard Lowe <>


Also available in: Atom PDF