Bug #5029
closedprocess starvation in FSS
100%
Description
Brendan has four cpu-bound processes:
for (;;) {}
He is seeing that sometimes one of these processes won't run, sometimes for seconds
--
We know the processes are always on the run queue. In fss_tick, it look like there are a few issues here that could cause problems.
fssproc = FSSPROC(t); fssproj = FSSPROC2FSSPROJ(fssproc); if (fssproj != NULL) { fsspset_t *fsspset = FSSPROJ2FSSPSET(fssproj); disp_lock_enter_high(&fsspset->fssps_displock); fssproj->fssp_ticks += fss_nice_tick[fssproc->fss_nice]; fssproj->fssp_zone_ticks++; --> fssproc->fss_ticks++; disp_lock_exit_high(&fsspset->fssps_displock); }
Since fss_ticks is used in fss_newpri, we might not be incrementing here if proj is NULL.
We also have this block
if (thread_change_pri(t, new_pri, 0)) { if ((t->t_schedflag & TS_LOAD) && (lwp = t->t_lwp) && lwp->lwp_state == LWP_USER) t->t_schedflag &= ~TS_DONT_SWAP; fssproc->fss_timeleft = fss_quantum; } else { call_cpu_surrender = B_TRUE; }
The thread_change_pri might change our position in the run queue, but we never set call_cpu_surrender so we would keep running.
--
The four processes are bound to the same CPU. Here is some output that shows what is happening.
Run time: # dtrace -n 'sched:::on-cpu /execname == "burn1"/ { self->s = timestamp; } sched:::off-cpu /self->s/ { @["runtime (ms)"] = lquantize((timestamp - self->s) / 1000000, 0, 10000, 100); self->s = 0; }' runtime (ms) value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@ 242900 100 |@@@@@@@@@@@@@ 126509 200 |@ 7878 300 | 2092 400 | 1624 500 | 1250 600 | 1199 700 | 861 800 | 801 900 | 723 1000 | 656 1100 | 518 1200 | 411 1300 | 359 1400 | 321 1500 | 311 1600 | 294 1700 | 229 1800 | 253 1900 | 649 2000 | 1027 2100 | 0
Off-CPU time:
# dtrace -n 'sched:::off-cpu /execname == "burn1"/ { self->s = timestamp; } sched:::on-cpu /self->s/ { @["off-cpu (ms)"] = lquantize((timestamp - self->s) / 1000000, 0, 100000, 1000); self->s = 0; }' off-cpu (ms) value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 387184 1000 | 2256 2000 | 1078 3000 | 862 4000 | 1070 5000 | 637 6000 | 535 7000 | 489 8000 | 454 9000 | 387 10000 | 308 11000 | 303 12000 | 221 13000 | 194 14000 | 196 15000 | 144 16000 | 99 17000 | 119 18000 | 94 19000 | 108 20000 | 49 21000 | 46 22000 | 59 23000 | 58 24000 | 31 25000 | 39 26000 | 28 27000 | 32 28000 | 33 29000 | 20 30000 | 15 31000 | 16 32000 | 10 33000 | 12 34000 | 7 35000 | 4 36000 | 3 37000 | 9 38000 | 6 39000 | 2 40000 | 1 41000 | 3 42000 | 2 43000 | 2 44000 | 1 45000 | 0
Confirming long off-CPU time:
[root@headnode (bh1-kvm4:0) ~]# prstat -mLcp 98288,99168,99169,98287 1 Please wait... PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 98287 root 38 0.0 0.0 0.0 0.0 0.0 0.0 62 0 6 0 0 burn1/1 98288 root 37 0.0 0.0 0.0 0.0 0.0 0.0 63 0 6 0 0 burn1/1 99168 root 27 0.0 0.0 0.0 0.0 0.0 0.0 73 0 4 0 0 burn1/1 99169 root 25 0.0 0.0 0.0 0.0 0.0 0.0 75 0 4 0 0 burn1/1 Total: 4 processes, 4 lwps, load averages: 4.23, 4.21, 4.08 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 99169 root 34 0.0 0.0 0.0 0.0 0.0 0.0 66 0 3 0 0 burn1/1 98287 root 33 0.0 0.0 0.0 0.0 0.0 0.0 67 0 3 0 0 burn1/1 99168 root 33 0.0 0.0 0.0 0.0 0.0 0.0 67 0 4 0 0 burn1/1 98288 root 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100 0 0 0 0 burn1/1 Total: 4 processes, 4 lwps, load averages: 4.26, 4.22, 4.09 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 99169 root 34 0.0 0.0 0.0 0.0 0.0 0.0 66 0 3 0 0 burn1/1 98287 root 33 0.0 0.0 0.0 0.0 0.0 0.0 67 0 3 0 0 burn1/1 99168 root 33 0.0 0.0 0.0 0.0 0.0 0.0 67 0 8 0 0 burn1/1 98288 root 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100 0 0 0 0 burn1/1 Total: 4 processes, 4 lwps, load averages: 4.30, 4.23, 4.09 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 99169 root 34 0.0 0.0 0.0 0.0 0.0 0.0 66 0 3 0 0 burn1/1 98287 root 33 0.0 0.0 0.0 0.0 0.0 0.0 67 0 3 0 0 burn1/1 99168 root 33 0.0 0.0 0.0 0.0 0.0 0.0 67 0 3 0 0 burn1/1 98288 root 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100 0 0 0 0 burn1/1 Total: 4 processes, 4 lwps, load averages: 4.33, 4.23, 4.09 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 99169 root 34 0.0 0.0 0.0 0.0 0.0 0.0 66 0 3 0 0 burn1/1 98287 root 33 0.0 0.0 0.0 0.0 0.0 0.0 67 0 3 0 0 burn1/1 99168 root 33 0.0 0.0 0.0 0.0 0.0 0.0 67 0 8 0 0 burn1/1 98288 root 0.0 0.0 0.0 0.0 0.0 0.0 0.0 100 0 0 0 0 burn1/1 [...]
last output repeated for another 15 seconds.
--
Here is some analysis and an evaluation of whats happening.
fss_tick
--------
We know we are TS_ONPROC, the other threads are either TS_ONPROC or TS_RUN since they run all the time in user-land
With the default FSS quantum, we know we stay onproc for 11 ticks(fssproc->fss_timeleft) before we recalculate the priority in fss_newpri. We also stay on unless there is a higher priority process which needs to run (t->t_pri < t->t_disp_queue->disp_maxrunpri). If there is, we set call_cpu_surrender
When we do call fss_newpri we see that the following call to thread_change_pri always returns 0 and that the priority we'repassing in is almost always the same as we currently have. thread_change_pri returns 0 because we take the first block and because our state is TS_ONPROC. Since our thread is still the highest priority, its going to be the one dispatched (cp->cpu_dispthread).
When we do call cpu_surrender, we set FSSBACKQ but we may still be the highest priority thread, so we would run again.
XXX 2290 How can thread_change_pri ever return anything but 0 here?
2299 how can this test t->t_state == TS_ONPROC ever be false?
fss_newpri
----------
When fss_newpri is called from fss_tick for this test case, we know several things which simplify the priority calculation:
- all 4 threads in the same project
- 4 runnable threads in the project (fssp_runnable)
- all 4 threads share the projects fssp_shusage
- the current thread ran for 11 ticks (fssproc->fss_ticks)
A typical shusage value I saw is something like this: 398851
This gives:
fsspri + (398851 * 4 * 11) fsspri + 17549444
That becomes the new fss_pri. Ignoring fsspri for a moment, and assume that this pri is < maxfsspri (but in the same magnitude as maxfsspri) this calculates to a fss_umdpri. Assume maxfsspri is 19000000.
invpri = (fsspri * 58) / 19000000; <- 53 fssproc->fss_umdpri = 59 - invpri; <- 6
however if the maxfsspri is an order of magnitude bigger, say 175494440
invpri = (fsspri * 58) / 175494440; <- 5 fssproc->fss_umdpri = 59 - invpri; <- 54
So how does maxfsspri get big? Well once we have a process with a huge fsspri, that becomes the maxfsspri. That will be decayed in fss_decay_usage but that can take a little while.
So this process might have a fsspri 10x lower than the maxfsspri, and it will take a while before this processes fsspri grows to be in the same range or greater than maxfsspri and in the meantime, this process will keep a high priority.
Once the process exceeds the current maxfsspri, its priority will drop to 1 and another process will get its shot, but then that process will take a while before its fsspri gets up to the range of maxfsspri.
So this explains the behavior we see:
- assume one of the other processes had run, so maxfsspri is large
- Process will first run for 11 ticks before we hit fss_newpri
- our fsspri is 10x lower than maxfsspri so the fss_umdpri will calculate to
a high value and we will remain the highest priority process to run - over several seconds our fsspri will have grown past the
old maxfsspri value, so this process drops to priority 1, maxfsspri gets
set to this high value, and then finally another process will get a chance
to run
This matches what we see when we trace our 4 processes coming in to the fss_newpri function
pid 7880 ticks 11 fsspri 195487265864494 umdp 48 max 975901166990186 pid 7880 ticks 11 fsspri 195487401366718 umdp 48 max 975901166990186 pid 7880 ticks 11 fsspri 195487536868942 umdp 48 max 975901166990186 pid 7880 ticks 11 fsspri 195487672371166 umdp 48 max 975901166990186 pid 7880 ticks 11 fsspri 195487807873390 umdp 48 max 975901166990186 pid 7880 ticks 11 fsspri 195487943375614 umdp 48 max 975901166990186 pid 7880 ticks 11 fsspri 195488078877838 umdp 48 max 975901166990186 pid 7880 ticks 11 fsspri 195488214380062 umdp 48 max 975901166990186 pid 7880 ticks 11 fsspri 195488349882286 umdp 48 max 975901166990186 pid 7881 ticks 0 fsspri 166606394586027 umdp 46 max 724301647375528 pid 7885 ticks 0 fsspri 249192430786967 umdp 40 max 724301647375528 pid 7883 ticks 0 fsspri 206228343731219 umdp 43 max 724301647375528 pid 7880 ticks 11 fsspri 145089110246316 umdp 48 max 724301647375528 pid 7880 ticks 11 fsspri 145089140096972 umdp 48 max 724301647375528 pid 7880 ticks 11 fsspri 145089169947628 umdp 48 max 724301647375528 pid 7880 ticks 11 fsspri 145089199798284 umdp 48 max 724301647375528 pid 7880 ticks 11 fsspri 145089229648940 umdp 48 max 724301647375528 pid 7880 ticks 11 fsspri 145089259499596 umdp 48 max 724301647375528 pid 7880 ticks 11 fsspri 145089289350252 umdp 48 max 724301647375528 pid 7880 ticks 11 fsspri 145089319200908 umdp 48 max 724301647375528 pid 7880 ticks 11 fsspri 145089349051564 umdp 48 max 724301647375528 pid 7885 ticks 0 fsspri 184947507224702 umdp 40 max 537567628911524 pid 7883 ticks 0 fsspri 153060098863014 umdp 43 max 537567628911524 pid 7881 ticks 0 fsspri 123653183481816 umdp 46 max 537567628911524 pid 7880 ticks 11 fsspri 107683523403991 umdp 48 max 537567628911524 pid 7880 ticks 11 fsspri 107683550191983 umdp 48 max 537567628911524 pid 7880 ticks 11 fsspri 107683576979975 umdp 48 max 537567628911524 pid 7880 ticks 11 fsspri 107683603767967 umdp 48 max 537567628911524 pid 7880 ticks 11 fsspri 107683630555959 umdp 48 max 537567628911524 pid 7880 ticks 11 fsspri 107683657343951 umdp 48 max 537567628911524 pid 7880 ticks 11 fsspri 107683684131943 umdp 48 max 537567628911524 pid 7880 ticks 11 fsspri 107683710919935 umdp 48 max 537567628911524 pid 7880 ticks 11 fsspri 107683737707927 umdp 48 max 537567628911524 pid 7883 ticks 0 fsspri 113599292124893 umdp 43 max 398975974582771 pid 7881 ticks 0 fsspri 91773847115410 umdp 46 max 398975974582771 pid 7885 ticks 0 fsspri 137265728018333 umdp 40 max 398975974582771 pid 7880 ticks 11 fsspri 79921543961814 umdp 48 max 398975974582771 pid 7880 ticks 11 fsspri 79921568364786 umdp 48 max 398975974582771 pid 7880 ticks 11 fsspri 79921592767758 umdp 48 max 398975974582771 pid 7880 ticks 11 fsspri 79921617170730 umdp 48 max 398975974582771 pid 7880 ticks 11 fsspri 79921641573702 umdp 48 max 398975974582771 pid 7880 ticks 11 fsspri 79921665976674 umdp 48 max 398975974582771 pid 7880 ticks 11 fsspri 79921690379646 umdp 48 max 398975974582771 pid 7880 ticks 11 fsspri 79921714782618 umdp 48 max 398975974582771 pid 7880 ticks 11 fsspri 79921739185590 umdp 48 max 398975974582771 pid 7881 ticks 0 fsspri 68113402155968 umdp 46 max 296114981135650 pid 7885 ticks 0 fsspri 101876907513606 umdp 40 max 296114981135650 pid 7883 ticks 0 fsspri 84311974623944 umdp 43 max 296114981135650 pid 7880 ticks 11 fsspri 59316933913385 umdp 48 max 296114981135650 pid 7880 ticks 11 fsspri 59316956615581 umdp 48 max 296114981135650 pid 7880 ticks 11 fsspri 59316979317777 umdp 48 max 296114981135650 pid 7880 ticks 11 fsspri 59317002019973 umdp 48 max 296114981135650 pid 7880 ticks 11 fsspri 59317024722169 umdp 48 max 296114981135650 pid 7880 ticks 11 fsspri 59317047424365 umdp 48 max 296114981135650 pid 7880 ticks 11 fsspri 59317070126561 umdp 48 max 296114981135650 pid 7880 ticks 11 fsspri 59317092828757 umdp 48 max 296114981135650 pid 7880 ticks 11 fsspri 59317115530953 umdp 48 max 296114981135650 pid 7885 ticks 0 fsspri 75611767295254 umdp 40 max 219772837561615 pid 7883 ticks 0 fsspri 62575293666208 umdp 43 max 219772837561615 pid 7881 ticks 0 fsspri 50552915662632 umdp 46 max 219772837561615 pid 7880 ticks 11 fsspri 44024438532415 umdp 48 max 219772837561615 pid 7881 ticks 11 fsspri 50552915662632 umdp 46 max 219772837561615 pid 7883 ticks 11 fsspri 62575293666208 umdp 43 max 219772837561615 pid 7885 ticks 11 fsspri 75611767295254 umdp 40 max 219772837561615 pid 7880 ticks 11 fsspri 133797318897451 umdp 24 max 219772837561615 pid 7881 ticks 11 fsspri 140325796027668 umdp 22 max 223570199262487 pid 7883 ticks 11 fsspri 152348174031244 umdp 19 max 230098676392704 pid 7885 ticks 11 fsspri 165384647660290 umdp 16 max 242121054396280 pid 7880 ticks 11 fsspri 223570199262487 umdp 1 max 255157528025326 pid 7883 ticks 0 fsspri 179699220059739 umdp 1 max 232559316911052 pid 7880 ticks 0 fsspri 232559316911052 umdp 1 max 232559316911052 pid 7881 ticks 4 fsspri 170776361385210 umdp 1 max 232559316911052 pid 7885 ticks 0 fsspri 189374727831296 umdp 1 max 232559316911052 pid 7881 ticks 11 fsspri 170776368816826 umdp 17 max 232559316911052 pid 7881 ticks 11 fsspri 170776389253770 umdp 17 max 232559316911052 pid 7881 ticks 11 fsspri 170776409690714 umdp 17 max 232559316911052 pid 7881 ticks 11 fsspri 170776430127658 umdp 17 max 232559316911052 pid 7881 ticks 11 fsspri 170776450564602 umdp 17 max 232559316911052 pid 7881 ticks 11 fsspri 170776471001546 umdp 17 max 232559316911052
Updated by Electric Monk over 7 years ago
- Status changed from New to Closed
git commit 482a774964a5167f49856a87b22ce7797ce5d694
commit 482a774964a5167f49856a87b22ce7797ce5d694 Author: Jerry Jelinek <jerry.jelinek@joyent.com> Date: 2014-09-20T01:04:05.000Z 5029 process starvation in FSS 5028 improve ffs comments and observability Reviewed by: Brendan Gregg <brendan.gregg@joyent.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Richard Lowe <richlowe@richlowe.net>