Long-term kernel-resident processes need a way to play fair
While implementing AMD-V support for the KVM port to illumos done by Joyent we hit a problem in the dispatcher which we believe to be pretty unique to this situation, and which creates unthinkably severe priority inversion.
A KVM guest is booted which spends does a real-mode loop while polling the
keyboard (the OpenBSD x86 loader). Boot is interrupted in the loader.
Soon after this happens, the machine drops off the network and appears to
hang. The deadman does not fire.
Breaking in via the debugger, we find that the system is in fact live, and
qemu is on the cpu. Manipulating the qemu process via /proc briefly
re-awakens the network.
After some experimentation with this scenario, a dump is forced which, by
chance, happens to demonstrate the issues quite wonderfully.
> ::cpuinfo -v ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc486e0 1b 5 0 108 no no t-0 ffffff00076cbc40 sched | | | RUNNING <--+ | +--> PIL THREAD READY | 9 ffffff00076cbc40 EXISTS | - ffffff0007605c40 (idle) ENABLE | +--> PRI THREAD PROC 99 ffffff00076f5c40 sched 99 ffffff00076fbc40 sched 99 ffffff0007701c40 sched 60 ffffff0007dbac40 fsflush 60 ffffff0008910c40 sched ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 1 ffffff01d3e32580 1f 4 0 0 yes no t-6620600 ffffff01d5a600c0 qemu-system-x86_ | | RUNNING <--+ +--> PRI THREAD PROC READY 99 ffffff0007e01c40 sched QUIESCED 99 ffffff0007e1fc40 sched EXISTS 59 ffffff01cfa6d7c0 qemu-system-x86_ ENABLE 59 ffffff01e35b43c0 qemu-system-x86_
The SWITCH time on cpu1 is not an error, and can be confirmed from the
scheduling class specific data:
> ffffff01d5a600c0::print kthread_t t_cldata | ::print -a tsproc_t ts_timeleft ffffff01dde41b90 ts_timeleft = 0xff9b0253 > ffffff01dde41b90/D 0xffffff01dde41b90: -6618541
The time left in our timeslice is massively negative (over 18 hours over
Looking back at this we notice that RNRN is true, so if we were to ever return
to userland we would be preempted, but KVM with AMD-V is able to handle all
the exits induced by the OpenBSD loader in-kernel, and so has not returned to
userland (and will not, without user intervention).
Our next questions are thus
1. Why are we not KRNRN?
2. Why is this hanging the network?
These turn out to have the same answer.
There are two system threads at priority 99 on the cpu1 runq, the maximum
non-interrupt/realtime SYS priority. KRNRN won't come true unless a thread
with greater than kpreemptpri is waiting on our runq. kpreemptpri is defined
to be the maximum SYS pri + 1, evading this.
The waiting threads turn out to be soft ring and squeue workers. Ah ha!
> ffffff0007e01c40::findstack -v stack pointer for thread ffffff0007e01c40: ffffff0007e01b60 [ ffffff0007e01b60 _resume_from_idle+0xf1() ] ffffff0007e01b90 swtch+0x1e6() ffffff0007e01bc0 cv_wait+0x7f(ffffff01d6fd5b60, ffffff01d6fd5ac0) ffffff0007e01c20 mac_soft_ring_worker+0xc5(ffffff01d6fd5ac0) ffffff0007e01c30 thread_start+8() > ffffff0007e1fc40::findstack -v stack pointer for thread ffffff0007e1fc40: ffffff0007e1fb60 [ ffffff0007e1fb60 _resume_from_idle+0xf1() ] ffffff0007e1fb90 swtch+0x1e6() ffffff0007e1fbc0 cv_wait+0x7f(ffffff01d2c35c50, ffffff01d2c35c10) ffffff0007e1fc20 squeue_worker+0xd1(ffffff01d2c35c00) ffffff0007e1fc30 thread_start+8()
The problem is, thus, that the dispatcher is not equipped to deal with a
thread which will spend unbounded amounts of time in the kernel without ever
sleeping nor returning to userland, and will suffer massive priority inversion
problems should this occur (in this case, we have starved two threads at far
greater priority than ourselves for 18 hours).
It is unclear why the threads aren't being migrated to cpu0 and run there. I believe this to represent a 2nd bug.
As a rather sad workaround to this problem, we're at present setting our
kprunrun, unconditionally, immediately after VM exit to encourage the system
to treat us as harshly as possible. This obviously pessimizes our own
performance greately, as we'll be preempted by lower priority threads, but it
avoids this problem in a compact fashion. A better solution is called for.
You may, at this point, be thinking how similar this problem sounds to that
solved by the System Duty Cycle (SDC) scheduling class introduced for ZFS,
another situation where kernel threads may use non-negligable amounts of CPU
for non-negligable amounts of time, and indeed SDC arranges such that in
cpu_surrender() it is treated harshly for preemption, using a straight
priority comparison without regarding for kpreemptpri. Unfortunately, it also
deliberately does not support threads with a userland component (it requires
that they have LWPs, for the purposes of having microstate data which it may
track, nothing more).
A "Treat me rough" button.¶
Another solution would be a flag to a
kthread_t instructing the system that we
require SDC-like treatment with respect to preemption (SDC would be adjusted
to set this flag, rather than have special-purpose logic).
A pair of functions could then be added, for the sake of argument:
kpreempt_agressively()woud set this flag, and cause the current thread to be subject to harsh preemption.
kpreempt_normally()would unset the flag.
kpreempt_disable() in use). We would
then call these across the VCPU run, such that when we spin busy in that loop,
and only when we do that, we are subject to agressive preemption.
This is the solution I currently favour (partly, it must be said, because of
the isolation of effect, combined with the difficulty of testing dispatcher
changes across a range of workloads).
Updated by Rich Lowe over 9 years ago
- Status changed from In Progress to Feedback
- Assignee deleted (
SmartOS deals with this via setting kpreemptpri such that this isn't a problem. We should give that change some time to shake out on their systems and, assuming all is well, just do that.
Updated by Joshua M. Clulow about 1 year ago
- Status changed from Feedback to In Progress
- Assignee set to Joshua M. Clulow
I think it's time to fix
kpreemptpri. It has been at 99 in SmartOS for so long that I didn't immediately even think things could still get stuck this way.
I recently determined that this is one cause of deadlock under heavy paging activity on a system with only a small number of CPUs. Roughly speaking:
pageout() is waiting for ZFS I/O to occur to push pages to swap, and the disk driver has reported completion to ZFS. A single ZIO sits idle forever because receipt of the completion is deferred to the
zio_write_intr taskq for dispatch -- but the taskq thread is stuck behind other high priority kernel threads, perhaps ironically in this case stuck behind DTrace scrubbing on one core and the pageout_scanner on another. As such, the ZIO sits in that taskq forever, we don't free the page,
pageout() eventually gets stuck waiting for the next txg sync to complete and
spa_sync() is stuck forever waiting on ZIO in the completion taskq. It looks just like memory deadlock, but
freemem is still at a few thousand pages.
kpreemptpri been 99 instead of 100, it seems that this would not have happened, and we could go back to the regular cause of memory deadlock: ZFS being too hungry for
This change was made as:
commit 23e44ac77691c8b18b6c707b7de655f2b6cbc144 (HEAD -> pageout, jclulow/pageout) Author: Bryan Cantrill <firstname.lastname@example.org> Date: Mon Feb 20 09:03:35 2012 +0000 OS-970 kpreemptpri should be lowered to maxclsyspri
I will get somebody at Joyent to open up the analysis: https://smartos.org/bugview/OS-970
Updated by Joshua M. Clulow about 1 year ago
First, this change has been in SmartOS for a very long time (since 2012). It has thus run for a long time on a lot of systems, running a pretty reasonable general purpose workload mix, far in excess of testing I will be able to arrange even with community help.
Second, I did a smoke test by:
- doing several full nighty builds of master
- concurrently running several
iperfload generators (both sending and receiving) while doing the builds, and then for 24 hours afterwards
I tried to push the network stack a bit, while otherwise busying the machine, on the basis that at least there are some background workers at different priorities that need to run to make that work out.
There were no observable issues.
Updated by Electric Monk about 1 year ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
commit 6d40a71ead689b14c68d91a5d92845e5f3daa020 Author: Bryan Cantrill <email@example.com> Date: 2020-09-23T22:22:51.000Z 1532 Long-term kernel-resident processes need a way to play fair Reviewed by: Andy Fiddaman <firstname.lastname@example.org> Reviewed by: Patrick Mooney <email@example.com> Approved by: Richard Lowe <firstname.lastname@example.org>