Project

General

Profile

Actions

Bug #5029

closed

process starvation in FSS

Added by Robert Mustacchi almost 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
kernel
Start date:
2014-07-19
Due date:
% Done:

100%

Estimated time:
Difficulty:
Hard
Tags:
Gerrit CR:

Description

Brendan has four cpu-bound processes:

for (;;) {}

He is seeing that sometimes one of these processes won't run, sometimes for seconds

--

We know the processes are always on the run queue. In fss_tick, it look like there are a few issues here that could cause problems.

        fssproc = FSSPROC(t);
        fssproj = FSSPROC2FSSPROJ(fssproc);
        if (fssproj != NULL) {
                fsspset_t *fsspset = FSSPROJ2FSSPSET(fssproj);
                disp_lock_enter_high(&fsspset->fssps_displock);
                fssproj->fssp_ticks += fss_nice_tick[fssproc->fss_nice];
                fssproj->fssp_zone_ticks++;
-->             fssproc->fss_ticks++;
                disp_lock_exit_high(&fsspset->fssps_displock);
        }

Since fss_ticks is used in fss_newpri, we might not be incrementing here if proj is NULL.

We also have this block

                        if (thread_change_pri(t, new_pri, 0)) {
                                if ((t->t_schedflag & TS_LOAD) &&
                                    (lwp = t->t_lwp) &&
                                    lwp->lwp_state == LWP_USER)
                                        t->t_schedflag &= ~TS_DONT_SWAP;
                                fssproc->fss_timeleft = fss_quantum;
                        } else {
                                call_cpu_surrender = B_TRUE;
                        }

The thread_change_pri might change our position in the run queue, but we never set call_cpu_surrender so we would keep running.

--

The four processes are bound to the same CPU. Here is some output that shows what is happening.

Run time:

# dtrace -n 'sched:::on-cpu /execname == "burn1"/ { self->s = timestamp; } sched:::off-cpu /self->s/ { @["runtime (ms)"] = lquantize((timestamp - self->s) / 1000000, 0, 10000, 100); self->s = 0; }'

  runtime (ms)                                      
           value  ------------- Distribution ------------- count    
             < 0 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@                242900   
             100 |@@@@@@@@@@@@@                            126509   
             200 |@                                        7878     
             300 |                                         2092     
             400 |                                         1624     
             500 |                                         1250     
             600 |                                         1199     
             700 |                                         861      
             800 |                                         801      
             900 |                                         723      
            1000 |                                         656      
            1100 |                                         518      
            1200 |                                         411      
            1300 |                                         359      
            1400 |                                         321      
            1500 |                                         311      
            1600 |                                         294      
            1700 |                                         229      
            1800 |                                         253      
            1900 |                                         649      
            2000 |                                         1027     
            2100 |                                         0       

Off-CPU time:

# dtrace -n 'sched:::off-cpu /execname == "burn1"/ { self->s = timestamp; } sched:::on-cpu /self->s/ { @["off-cpu (ms)"] = lquantize((timestamp - self->s) / 1000000, 0, 100000, 1000); self->s = 0; }'

  off-cpu (ms)                                      
           value  ------------- Distribution ------------- count    
             < 0 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  387184   
            1000 |                                         2256     
            2000 |                                         1078     
            3000 |                                         862      
            4000 |                                         1070     
            5000 |                                         637      
            6000 |                                         535      
            7000 |                                         489      
            8000 |                                         454      
            9000 |                                         387      
           10000 |                                         308      
           11000 |                                         303      
           12000 |                                         221      
           13000 |                                         194      
           14000 |                                         196      
           15000 |                                         144      
           16000 |                                         99       
           17000 |                                         119      
           18000 |                                         94       
           19000 |                                         108      
           20000 |                                         49       
           21000 |                                         46       
           22000 |                                         59       
           23000 |                                         58       
           24000 |                                         31       
           25000 |                                         39       
           26000 |                                         28       
           27000 |                                         32       
           28000 |                                         33       
           29000 |                                         20       
           30000 |                                         15       
           31000 |                                         16       
           32000 |                                         10       
           33000 |                                         12       
           34000 |                                         7        
           35000 |                                         4        
           36000 |                                         3        
           37000 |                                         9        
           38000 |                                         6        
           39000 |                                         2        
           40000 |                                         1        
           41000 |                                         3        
           42000 |                                         2        
           43000 |                                         2        
           44000 |                                         1        
           45000 |                                         0      

Confirming long off-CPU time:

[root@headnode (bh1-kvm4:0) ~]# prstat -mLcp 98288,99168,99169,98287 1
Please wait...
  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
98287 root      38 0.0 0.0 0.0 0.0 0.0 0.0  62   0   6   0   0 burn1/1
98288 root      37 0.0 0.0 0.0 0.0 0.0 0.0  63   0   6   0   0 burn1/1
99168 root      27 0.0 0.0 0.0 0.0 0.0 0.0  73   0   4   0   0 burn1/1
99169 root      25 0.0 0.0 0.0 0.0 0.0 0.0  75   0   4   0   0 burn1/1
Total: 4 processes, 4 lwps, load averages: 4.23, 4.21, 4.08
  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
99169 root      34 0.0 0.0 0.0 0.0 0.0 0.0  66   0   3   0   0 burn1/1
98287 root      33 0.0 0.0 0.0 0.0 0.0 0.0  67   0   3   0   0 burn1/1
99168 root      33 0.0 0.0 0.0 0.0 0.0 0.0  67   0   4   0   0 burn1/1
98288 root     0.0 0.0 0.0 0.0 0.0 0.0 0.0 100   0   0   0   0 burn1/1
Total: 4 processes, 4 lwps, load averages: 4.26, 4.22, 4.09
  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
99169 root      34 0.0 0.0 0.0 0.0 0.0 0.0  66   0   3   0   0 burn1/1
98287 root      33 0.0 0.0 0.0 0.0 0.0 0.0  67   0   3   0   0 burn1/1
99168 root      33 0.0 0.0 0.0 0.0 0.0 0.0  67   0   8   0   0 burn1/1
98288 root     0.0 0.0 0.0 0.0 0.0 0.0 0.0 100   0   0   0   0 burn1/1
Total: 4 processes, 4 lwps, load averages: 4.30, 4.23, 4.09
  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
99169 root      34 0.0 0.0 0.0 0.0 0.0 0.0  66   0   3   0   0 burn1/1
98287 root      33 0.0 0.0 0.0 0.0 0.0 0.0  67   0   3   0   0 burn1/1
99168 root      33 0.0 0.0 0.0 0.0 0.0 0.0  67   0   3   0   0 burn1/1
98288 root     0.0 0.0 0.0 0.0 0.0 0.0 0.0 100   0   0   0   0 burn1/1
Total: 4 processes, 4 lwps, load averages: 4.33, 4.23, 4.09
  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
99169 root      34 0.0 0.0 0.0 0.0 0.0 0.0  66   0   3   0   0 burn1/1
98287 root      33 0.0 0.0 0.0 0.0 0.0 0.0  67   0   3   0   0 burn1/1
99168 root      33 0.0 0.0 0.0 0.0 0.0 0.0  67   0   8   0   0 burn1/1
98288 root     0.0 0.0 0.0 0.0 0.0 0.0 0.0 100   0   0   0   0 burn1/1
[...]

last output repeated for another 15 seconds.

--

Here is some analysis and an evaluation of whats happening.

fss_tick
--------
We know we are TS_ONPROC, the other threads are either TS_ONPROC or TS_RUN since they run all the time in user-land

With the default FSS quantum, we know we stay onproc for 11 ticks(fssproc->fss_timeleft) before we recalculate the priority in fss_newpri. We also stay on unless there is a higher priority process which needs to run (t->t_pri < t->t_disp_queue->disp_maxrunpri). If there is, we set call_cpu_surrender

When we do call fss_newpri we see that the following call to thread_change_pri always returns 0 and that the priority we'repassing in is almost always the same as we currently have. thread_change_pri returns 0 because we take the first block and because our state is TS_ONPROC. Since our thread is still the highest priority, its going to be the one dispatched (cp->cpu_dispthread).

When we do call cpu_surrender, we set FSSBACKQ but we may still be the highest priority thread, so we would run again.

XXX 2290 How can thread_change_pri ever return anything but 0 here?
2299 how can this test t->t_state == TS_ONPROC ever be false?

fss_newpri
----------
When fss_newpri is called from fss_tick for this test case, we know several things which simplify the priority calculation:

  • all 4 threads in the same project
  • 4 runnable threads in the project (fssp_runnable)
  • all 4 threads share the projects fssp_shusage
  • the current thread ran for 11 ticks (fssproc->fss_ticks)

A typical shusage value I saw is something like this: 398851
This gives:

    fsspri + (398851 * 4 * 11)
    fsspri + 17549444

That becomes the new fss_pri. Ignoring fsspri for a moment, and assume that this pri is < maxfsspri (but in the same magnitude as maxfsspri) this calculates to a fss_umdpri. Assume maxfsspri is 19000000.
        invpri = (fsspri * 58) / 19000000;      <- 53
        fssproc->fss_umdpri = 59 - invpri;      <- 6

however if the maxfsspri is an order of magnitude bigger, say 175494440

        invpri = (fsspri * 58) / 175494440;     <- 5
        fssproc->fss_umdpri = 59 - invpri;      <- 54

So how does maxfsspri get big? Well once we have a process with a huge fsspri, that becomes the maxfsspri. That will be decayed in fss_decay_usage but that can take a little while.

So this process might have a fsspri 10x lower than the maxfsspri, and it will take a while before this processes fsspri grows to be in the same range or greater than maxfsspri and in the meantime, this process will keep a high priority.

Once the process exceeds the current maxfsspri, its priority will drop to 1 and another process will get its shot, but then that process will take a while before its fsspri gets up to the range of maxfsspri.

So this explains the behavior we see:

  • assume one of the other processes had run, so maxfsspri is large
  • Process will first run for 11 ticks before we hit fss_newpri
  • our fsspri is 10x lower than maxfsspri so the fss_umdpri will calculate to
    a high value and we will remain the highest priority process to run
  • over several seconds our fsspri will have grown past the
    old maxfsspri value, so this process drops to priority 1, maxfsspri gets
    set to this high value, and then finally another process will get a chance
    to run

This matches what we see when we trace our 4 processes coming in to the fss_newpri function

pid 7880 ticks 11 fsspri   195487265864494 umdp 48 max   975901166990186
pid 7880 ticks 11 fsspri   195487401366718 umdp 48 max   975901166990186
pid 7880 ticks 11 fsspri   195487536868942 umdp 48 max   975901166990186
pid 7880 ticks 11 fsspri   195487672371166 umdp 48 max   975901166990186
pid 7880 ticks 11 fsspri   195487807873390 umdp 48 max   975901166990186
pid 7880 ticks 11 fsspri   195487943375614 umdp 48 max   975901166990186
pid 7880 ticks 11 fsspri   195488078877838 umdp 48 max   975901166990186
pid 7880 ticks 11 fsspri   195488214380062 umdp 48 max   975901166990186
pid 7880 ticks 11 fsspri   195488349882286 umdp 48 max   975901166990186
pid 7881 ticks  0 fsspri   166606394586027 umdp 46 max   724301647375528
pid 7885 ticks  0 fsspri   249192430786967 umdp 40 max   724301647375528
pid 7883 ticks  0 fsspri   206228343731219 umdp 43 max   724301647375528
pid 7880 ticks 11 fsspri   145089110246316 umdp 48 max   724301647375528
pid 7880 ticks 11 fsspri   145089140096972 umdp 48 max   724301647375528
pid 7880 ticks 11 fsspri   145089169947628 umdp 48 max   724301647375528
pid 7880 ticks 11 fsspri   145089199798284 umdp 48 max   724301647375528
pid 7880 ticks 11 fsspri   145089229648940 umdp 48 max   724301647375528
pid 7880 ticks 11 fsspri   145089259499596 umdp 48 max   724301647375528
pid 7880 ticks 11 fsspri   145089289350252 umdp 48 max   724301647375528
pid 7880 ticks 11 fsspri   145089319200908 umdp 48 max   724301647375528
pid 7880 ticks 11 fsspri   145089349051564 umdp 48 max   724301647375528
pid 7885 ticks  0 fsspri   184947507224702 umdp 40 max   537567628911524
pid 7883 ticks  0 fsspri   153060098863014 umdp 43 max   537567628911524
pid 7881 ticks  0 fsspri   123653183481816 umdp 46 max   537567628911524
pid 7880 ticks 11 fsspri   107683523403991 umdp 48 max   537567628911524
pid 7880 ticks 11 fsspri   107683550191983 umdp 48 max   537567628911524
pid 7880 ticks 11 fsspri   107683576979975 umdp 48 max   537567628911524
pid 7880 ticks 11 fsspri   107683603767967 umdp 48 max   537567628911524
pid 7880 ticks 11 fsspri   107683630555959 umdp 48 max   537567628911524
pid 7880 ticks 11 fsspri   107683657343951 umdp 48 max   537567628911524
pid 7880 ticks 11 fsspri   107683684131943 umdp 48 max   537567628911524
pid 7880 ticks 11 fsspri   107683710919935 umdp 48 max   537567628911524
pid 7880 ticks 11 fsspri   107683737707927 umdp 48 max   537567628911524
pid 7883 ticks  0 fsspri   113599292124893 umdp 43 max   398975974582771
pid 7881 ticks  0 fsspri    91773847115410 umdp 46 max   398975974582771
pid 7885 ticks  0 fsspri   137265728018333 umdp 40 max   398975974582771
pid 7880 ticks 11 fsspri    79921543961814 umdp 48 max   398975974582771
pid 7880 ticks 11 fsspri    79921568364786 umdp 48 max   398975974582771
pid 7880 ticks 11 fsspri    79921592767758 umdp 48 max   398975974582771
pid 7880 ticks 11 fsspri    79921617170730 umdp 48 max   398975974582771
pid 7880 ticks 11 fsspri    79921641573702 umdp 48 max   398975974582771
pid 7880 ticks 11 fsspri    79921665976674 umdp 48 max   398975974582771
pid 7880 ticks 11 fsspri    79921690379646 umdp 48 max   398975974582771
pid 7880 ticks 11 fsspri    79921714782618 umdp 48 max   398975974582771
pid 7880 ticks 11 fsspri    79921739185590 umdp 48 max   398975974582771
pid 7881 ticks  0 fsspri    68113402155968 umdp 46 max   296114981135650
pid 7885 ticks  0 fsspri   101876907513606 umdp 40 max   296114981135650
pid 7883 ticks  0 fsspri    84311974623944 umdp 43 max   296114981135650
pid 7880 ticks 11 fsspri    59316933913385 umdp 48 max   296114981135650
pid 7880 ticks 11 fsspri    59316956615581 umdp 48 max   296114981135650
pid 7880 ticks 11 fsspri    59316979317777 umdp 48 max   296114981135650
pid 7880 ticks 11 fsspri    59317002019973 umdp 48 max   296114981135650
pid 7880 ticks 11 fsspri    59317024722169 umdp 48 max   296114981135650
pid 7880 ticks 11 fsspri    59317047424365 umdp 48 max   296114981135650
pid 7880 ticks 11 fsspri    59317070126561 umdp 48 max   296114981135650
pid 7880 ticks 11 fsspri    59317092828757 umdp 48 max   296114981135650
pid 7880 ticks 11 fsspri    59317115530953 umdp 48 max   296114981135650
pid 7885 ticks  0 fsspri    75611767295254 umdp 40 max   219772837561615
pid 7883 ticks  0 fsspri    62575293666208 umdp 43 max   219772837561615
pid 7881 ticks  0 fsspri    50552915662632 umdp 46 max   219772837561615
pid 7880 ticks 11 fsspri    44024438532415 umdp 48 max   219772837561615
pid 7881 ticks 11 fsspri    50552915662632 umdp 46 max   219772837561615
pid 7883 ticks 11 fsspri    62575293666208 umdp 43 max   219772837561615
pid 7885 ticks 11 fsspri    75611767295254 umdp 40 max   219772837561615
pid 7880 ticks 11 fsspri   133797318897451 umdp 24 max   219772837561615
pid 7881 ticks 11 fsspri   140325796027668 umdp 22 max   223570199262487
pid 7883 ticks 11 fsspri   152348174031244 umdp 19 max   230098676392704
pid 7885 ticks 11 fsspri   165384647660290 umdp 16 max   242121054396280
pid 7880 ticks 11 fsspri   223570199262487 umdp  1 max   255157528025326
pid 7883 ticks  0 fsspri   179699220059739 umdp  1 max   232559316911052
pid 7880 ticks  0 fsspri   232559316911052 umdp  1 max   232559316911052
pid 7881 ticks  4 fsspri   170776361385210 umdp  1 max   232559316911052
pid 7885 ticks  0 fsspri   189374727831296 umdp  1 max   232559316911052
pid 7881 ticks 11 fsspri   170776368816826 umdp 17 max   232559316911052
pid 7881 ticks 11 fsspri   170776389253770 umdp 17 max   232559316911052
pid 7881 ticks 11 fsspri   170776409690714 umdp 17 max   232559316911052
pid 7881 ticks 11 fsspri   170776430127658 umdp 17 max   232559316911052
pid 7881 ticks 11 fsspri   170776450564602 umdp 17 max   232559316911052
pid 7881 ticks 11 fsspri   170776471001546 umdp 17 max   232559316911052
Actions #1

Updated by Electric Monk over 7 years ago

  • Status changed from New to Closed

git commit 482a774964a5167f49856a87b22ce7797ce5d694

commit  482a774964a5167f49856a87b22ce7797ce5d694
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Date:   2014-09-20T01:04:05.000Z

    5029 process starvation in FSS
    5028 improve ffs comments and observability
    Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Actions

Also available in: Atom PDF