Project

General

Profile

Actions

Bug #5892

open

CPU caps can throttle performance more than intended

Added by Robert Mustacchi over 6 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
kernel
Start date:
2015-04-30
Due date:
% Done:

100%

Estimated time:
Difficulty:
Hard
Tags:
Gerrit CR:

Description

From Brendan's original analysis:

CPU caps throttle performance more than intended. I suspect this may be due to the architecture of using separate caps dispatcher queues, and waking up one thread per tick when a zone drops back down below its cap.

Background: The mysqlgauge benchmark was performed to test MySQL, on us-east-1 (joyent_20120424T232010Z). It was configured to use 32 CPU-hot threads on a system with a cap of 800, which resulted in a high degree of cap-latency. While this high load doesn't make much sense to benchmark (it's testing queueing, not running), the result was worse than expected, based on a cap of 800.

Test: to reproduce more easily, this C program burns a little CPU and sleeps:

     1  #include <unistd.h>
     2  
     3  int
     4  main()
     5  {
     6          int i, j;
     7          while (1) {
     8                  usleep (9 * 1000);
     9                  for (i = 0; i < 200000; i++) {
    10                          j++;
    11                  }
    12          }
    13  }

From 1 to 220 of these programs were launched, with DTrace sampling the on-CPU rate. The zone had a caps of 300, and this was tested with hz=100 and hz=1000 (hires_tick=1). The attached graph (capshz.png) shows the results, with hz=1000 much more closely delivering the CPU cap.

From usr/src/uts/common/disp/cpucaps.c:

/*
 * If cap limit is not reached, make one thread from wait queue runnable.
 * The waitq_isempty check is performed without the waitq lock. If a new thread
 * is placed on the waitq right after the check, it will be picked up during the
 * next invocation of cap_poke_waitq().
 *
 * Called once per tick for zones.
 */
/* ARGSUSED */
static void
cap_poke_waitq(cpucap_t *cap, int64_t gen)
{
[...]
        if (cap->cap_usage >= cap->cap_chk_value) {
                cap->cap_above++;
        } else {
                waitq_t *wq = &cap->cap_waitq;

                cap->cap_below++;

                if (!waitq_isempty(wq))
                        waitq_runone(wq);
        }
}

waitq_runone() runs the first thread on the queue. And is called once per tick, adding latency when runnable threads should be running but are waiting for the next tick to wakeup.

This may be the same issue that was hurting the MySQL benchmark. Using DTrace to sample the on-CPU time of MySQL shows it dropping to 30% below its cap:

[bgregg@30NBHS1 (us-east-1) ~]$ dtrace -n 'profile-999 /zonename == "36b50e4c-88d2-4588-a974-11195fac000b"/ { @ = count(); } tick-1s { printa("%@d", @); trunc(@)
; }'
dtrace: description 'profile-999 ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  2  56649                         :tick-1s 6109
  2  56649                         :tick-1s 7740
  2  56649                         :tick-1s 8230
  2  56649                         :tick-1s 6532
  2  56649                         :tick-1s 7854
  2  56649                         :tick-1s 7239
  2  56649                         :tick-1s 8093
  2  56649                         :tick-1s 5920
  2  56649                         :tick-1s 8504
  2  56649                         :tick-1s 7459
  2  56649                         :tick-1s 5689
[...]

Also, the number of thread on the cap queues gets over 30:

[bgregg@30NBHS1 (us-east-1) ~]$ dtrace -n 'waitq_runone:entry /stringof(args[0]->wq_first->t_procp->p_zone->zone_name) == "36b50e4c-88d2-4588-a974-11195fa
c000b"/ { @[args[0]->wq_count] = count(); @t["T"] = count(); } tick-1s { printa(@); printa(@t); trunc(@); trunc(@t); } profile-999 /zonename == "36b50e4c-
88d2-4588-a974-11195fac000b"/ { @p = count(); } tick-1s { printa(@p); trunc(@p); }'
dtrace: description 'waitq_runone:entry ' matched 4 probes
CPU     ID                    FUNCTION:NAME
 11  56649                         :tick-1s
       34                1
        1                2
        2                2
        3                2
        4                2
        5                2
        6                2
        7                2
        8                2
       21                2
       22                2
       23                2
       24                2
       25                2
       26                2
       27                2
       28                2
       29                2
       30                2
       31                2
       32                2
       33                2
        9                3
       10                3
       11                3
       19                3
       20                3
       12                4
       13                4
       14                4
       15                4
       16                4
       17                4
       18                4

  T                                                                86
[...]

From Jerry's follow up:

I explored a variety of alternatives. First I tried tweaking the decay rate but I was able to confirm that we really were being throttled by the 1 thread/tick behavior. I implemented a 1khz cyclic which delivered the same behavior as the hires tick, then I implemented the current solution which lets us run at the slower 100hz rate but dequeues more threads/tick. This delivers similar performance as the hires tick across a variety of loads that I tested and gets us to being right at 100% at lower caps and much closer to 100% at higher cap values.


Files

capshz.png (122 KB) capshz.png Robert Mustacchi, 2015-04-30 08:28 PM

No data to display

Actions

Also available in: Atom PDF