Long-term kernel-resident processes need a way to play fair
While implementing AMD-V support for the KVM port to illumos done by Joyent we hit a problem in the dispatcher which we believe to be pretty unique to this situation, and which creates unthinkably severe priority inversion.
A KVM guest is booted which spends does a real-mode loop while polling the
keyboard (the OpenBSD x86 loader). Boot is interrupted in the loader.
Soon after this happens, the machine drops off the network and appears to
hang. The deadman does not fire.
Breaking in via the debugger, we find that the system is in fact live, and
qemu is on the cpu. Manipulating the qemu process via /proc briefly
re-awakens the network.
After some experimentation with this scenario, a dump is forced which, by
chance, happens to demonstrate the issues quite wonderfully.
> ::cpuinfo -v ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc486e0 1b 5 0 108 no no t-0 ffffff00076cbc40 sched | | | RUNNING <--+ | +--> PIL THREAD READY | 9 ffffff00076cbc40 EXISTS | - ffffff0007605c40 (idle) ENABLE | +--> PRI THREAD PROC 99 ffffff00076f5c40 sched 99 ffffff00076fbc40 sched 99 ffffff0007701c40 sched 60 ffffff0007dbac40 fsflush 60 ffffff0008910c40 sched ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 1 ffffff01d3e32580 1f 4 0 0 yes no t-6620600 ffffff01d5a600c0 qemu-system-x86_ | | RUNNING <--+ +--> PRI THREAD PROC READY 99 ffffff0007e01c40 sched QUIESCED 99 ffffff0007e1fc40 sched EXISTS 59 ffffff01cfa6d7c0 qemu-system-x86_ ENABLE 59 ffffff01e35b43c0 qemu-system-x86_
The SWITCH time on cpu1 is not an error, and can be confirmed from the
scheduling class specific data:
> ffffff01d5a600c0::print kthread_t t_cldata | ::print -a tsproc_t ts_timeleft ffffff01dde41b90 ts_timeleft = 0xff9b0253 > ffffff01dde41b90/D 0xffffff01dde41b90: -6618541
The time left in our timeslice is massively negative (over 18 hours over
Looking back at this we notice that RNRN is true, so if we were to ever return
to userland we would be preempted, but KVM with AMD-V is able to handle all
the exits induced by the OpenBSD loader in-kernel, and so has not returned to
userland (and will not, without user intervention).
Our next questions are thus
1. Why are we not KRNRN?
2. Why is this hanging the network?
These turn out to have the same answer.
There are two system threads at priority 99 on the cpu1 runq, the maximum
non-interrupt/realtime SYS priority. KRNRN won't come true unless a thread
with greater than kpreemptpri is waiting on our runq. kpreemptpri is defined
to be the maximum SYS pri + 1, evading this.
The waiting threads turn out to be soft ring and squeue workers. Ah ha!
> ffffff0007e01c40::findstack -v stack pointer for thread ffffff0007e01c40: ffffff0007e01b60 [ ffffff0007e01b60 _resume_from_idle+0xf1() ] ffffff0007e01b90 swtch+0x1e6() ffffff0007e01bc0 cv_wait+0x7f(ffffff01d6fd5b60, ffffff01d6fd5ac0) ffffff0007e01c20 mac_soft_ring_worker+0xc5(ffffff01d6fd5ac0) ffffff0007e01c30 thread_start+8() > ffffff0007e1fc40::findstack -v stack pointer for thread ffffff0007e1fc40: ffffff0007e1fb60 [ ffffff0007e1fb60 _resume_from_idle+0xf1() ] ffffff0007e1fb90 swtch+0x1e6() ffffff0007e1fbc0 cv_wait+0x7f(ffffff01d2c35c50, ffffff01d2c35c10) ffffff0007e1fc20 squeue_worker+0xd1(ffffff01d2c35c00) ffffff0007e1fc30 thread_start+8()
The problem is, thus, that the dispatcher is not equipped to deal with a
thread which will spend unbounded amounts of time in the kernel without ever
sleeping nor returning to userland, and will suffer massive priority inversion
problems should this occur (in this case, we have starved two threads at far
greater priority than ourselves for 18 hours).
It is unclear why the threads aren't being migrated to cpu0 and run there. I believe this to represent a 2nd bug.
As a rather sad workaround to this problem, we're at present setting our
kprunrun, unconditionally, immediately after VM exit to encourage the system
to treat us as harshly as possible. This obviously pessimizes our own
performance greately, as we'll be preempted by lower priority threads, but it
avoids this problem in a compact fashion. A better solution is called for.
You may, at this point, be thinking how similar this problem sounds to that
solved by the System Duty Cycle (SDC) scheduling class introduced for ZFS,
another situation where kernel threads may use non-negligable amounts of CPU
for non-negligable amounts of time, and indeed SDC arranges such that in
cpu_surrender() it is treated harshly for preemption, using a straight
priority comparison without regarding for kpreemptpri. Unfortunately, it also
deliberately does not support threads with a userland component (it requires
that they have LWPs, for the purposes of having microstate data which it may
track, nothing more).
A "Treat me rough" button.¶
Another solution would be a flag to a
kthread_t instructing the system that we
require SDC-like treatment with respect to preemption (SDC would be adjusted
to set this flag, rather than have special-purpose logic).
A pair of functions could then be added, for the sake of argument:
kpreempt_agressively()woud set this flag, and cause the current thread to be subject to harsh preemption.
kpreempt_normally()would unset the flag.
kpreempt_disable() in use). We would
then call these across the VCPU run, such that when we spin busy in that loop,
and only when we do that, we are subject to agressive preemption.
This is the solution I currently favour (partly, it must be said, because of
the isolation of effect, combined with the difficulty of testing dispatcher
changes across a range of workloads).
Updated by Rich Lowe almost 8 years ago
- Status changed from In Progress to Feedback
- Assignee deleted (
SmartOS deals with this via setting kpreemptpri such that this isn't a problem. We should give that change some time to shake out on their systems and, assuming all is well, just do that.