Project

General

Profile

Bug #13148

do not sq_wait unnecessarily

Added by Patrick Mooney 10 days ago. Updated about 2 hours ago.

Status:
In Progress
Priority:
Normal
Category:
kernel
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Upstreaming OS-6150 from SmartOS:

A member of the community reported extremely slow results running apache via the Phoronix Test Suite as a means of comparing performance between LX and ESX:

Has anyone ran the Phoronix Test Suite on smartos? I've tried it in an lx-zone:
"image_uuid": "68a837fe-1b9b-11e7-a66d-ab7961786c42"

and in a joyent smartos zone:
"image_uuid": "f6acf198-2037-11e7-8863-8fdd4ce58b6a",

- SmartOS (build: 20170413T062226Z)

One of the few tests that would run on solaris is the pts/apache test:

./phoronix-test-suite benchmark pts/apache

I've tried it on two generations of servers, and compared the results to
similar hardware running esx with a full blown vm with centos 6 installed. The
results for smartos are awful. An average: 1749.53 Requests Per Second for
the smartos zone vs 23425.61 Requests Per Second for the full vm (centos 6)
on an esx host. The performance of the lx-zone were very similar to the
smartos zone.

On the pts/apache test, the host on esx ran the test in 3 minutes,
approximately; the smartos zone, and the lx-zone were closer to 30 minutes.

If I am doing something wrong, I'd really like to know.

Thanks!

-al

After following their instructions for install (same image, same test), I observed similar slowness. The test runs 'ab' against the apache instance running on localhost. During the test, CPU usage was minimal. Further examination showed high latency in connect() syscalls.

Tracing for threads spending excessive time off-cpu in 'ab' yielded this result:
(upper figure indicating difference in timestamp and stack() count aggregation below)

           value  ------------- Distribution ------------- count
         7000000 |                                         0
         7500000 |                                         2
         8000000 |                                         7
         8500000 |@@@@@@@                                  104
         9000000 |@@@@@@@@@@@@@@@@@@@@@                    315
         9500000 |@@@@@@@@@@@@                             179
        10000000 |                                         0

              genunix`cv_wait+0x70
              ip`squeue_synch_enter+0x12b
              ip`tcp_connect+0x72
              sockfs`so_connect+0xc4
              sockfs`socket_connect+0x3c
              lx_brand`lx_connect+0xaf
              lx_brand`lx_syscall_enter+0x16f
              unix`sys_syscall+0x142
              607

10ms is a long time to wait on the squeue. Further investigation is required.

Here are the preserved args for the different tests processes to aid in manual reproduction:

22284:  /var/lib/phoronix-test-suite/installed-tests/pts/apache-1.6.1//httpd_/bin/httpd
argv[0]: /var/lib/phoronix-test-suite/installed-tests/pts/apache-1.6.1//httpd_/bin/httpd
argv[1]: -k
argv[2]: start
argv[3]: -f
argv[4]: /var/lib/phoronix-test-suite/installed-tests/pts/apache-1.6.1//httpd_/conf/httpd.conf

22666:  ./httpd_/bin/ab -n 1000000 -c 100 http://localhost:8088/test.html
argv[0]: ./httpd_/bin/ab
argv[1]: -n
argv[2]: 1000000
argv[3]: -c
argv[4]: 100
argv[5]: http://localhost:8088/test.html

Quick D script I threw together for tracking long periods off-cpu in connect(2):

lx-syscall::connect:entry
/pid == $target/
{
        self->t = timestamp;
}

sched:::off-cpu
/self->t/
{
        self->w = timestamp;
}
sched:::on-cpu
/self->w/
{
        @ws = sum(timestamp - self->w);
        @wq = quantize(timestamp - self->w);
        this->diff = timestamp - self->w;
        self->w = 0;
}
sched:::on-cpu
/this->diff > 1000000/
{
        @wl = llquantize(this->diff, 10, 6, 8, 20);
        this->diff = 0;
        @off[stack()] = count();
}

lx-syscall::connect:return
/self->t/
{
        @ = quantize(timestamp - self->t);
        self->t = 0;
}
END
{
        trace("syscall times");
        printa(@);
        trace("wait quantize");
        printa(@wq);
        trace("wait total");
        printa(@ws);
        trace("wait long");
        printa(@wl);
}

The 10ms timeout corresponds to the standard ip_squeue_worker_wait timeout. When wake-up of the squeue worker thread is requested via squeue_worker_wakeup(), this timeout (stored in sq_wait) is used to delay the actual wake-up of the worker. The motivations for this delay are not immediately apparent.

Tuning sq_wait to 0 on my test machine via mdb-kw resulted in a roughly 20x throughput increase (from ~750 conn/s to ~15k conn/s).

Comparing results for the above ab(1) test, a platform bearing the update performs much better: roughly 15500 req/s vs 750 req/s. Bulk throughput operations appear totally unaffected both for machine-local iperf tests between zones as well as off-machine transfers. This comes as no surprise as the high-throughput operations show no reliance on squeue worker wake-up.


Related issues

Related to illumos gate - Bug #13149: elide squeue wake-ups when prudentIn Progress

Actions

History

#1

Updated by Patrick Mooney 10 days ago

  • Related to Bug #13149: elide squeue wake-ups when prudent added
#2

Updated by Patrick Mooney 10 days ago

  • Status changed from New to In Progress
#3

Updated by Electric Monk 8 days ago

  • Gerrit CR set to 926
#4

Updated by Patrick Mooney about 2 hours ago

Rather than using PTS as the means of driving httpd/ab, I simply installed apache-24 from the ooce repo on my test system. On a vanilla illumos-gate build, it was only able to achieve ~500 req/s. With the patch applied, it jumped up to ~10000 req/s - a similar increase to what was seen on SmartOS/illumos-joyent.

This patch is identical in content to what was side-pulled into OmniOSCE back in February. As far as I know, they've not heard any complaints from r34 or bloody users regarding the change.

Also available in: Atom PDF