Project

General

Profile

Actions

Bug #14781

open

Poor performance and high CPU load on large AMD machine

Added by Ian Collins about 1 year ago. Updated 7 months ago.

Status:
New
Priority:
High
Assignee:
-
Category:
zones
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

I am seeing some strange behaviour on a couple of large machines (Dell PowerEdge R7525, 2xAMD EPYC 7543 32-Core and 512GB RAM) that I can't get a handle on. I don't see this on smaller hosts.

The machines have a number lx zones that are allocated 64 cores worth of CPU and 64 GB of RAM. These are used as compile engines, running up to 64 jobs.

What I see is very high host CPU usage (measured from cpu_nsec_idle), close to 100% with no measured CPU activity from the zones. This manifests as a stall in the zone, compiles take a very long time and the console (both zone and host) is unresponsive then things recover.

I ran a high load test on one of these machines and an older Dual E5-2687W 12-core CPU machine. Both had an lx zone with the cpu_cap and ram equal to the number of cores on the machine. In the zone, I run clang-tidy over a large code base with one job per core. The test script runs "nice" so it shouldn't make the system unresponsive.

While the tests were running, this dtrace script form Dan McDonald was run to see what was hogging the CPUs:
dtrace n 'profile-97 /curthread>t_name == NULL/ { @hogs[execname, "noname"] = count(); } profile-97 /curthread->t_name != NULL/ { @hogs[execname, stringof(curthread->t_name) ] = count(); } tick-10s {printa(@hogs);clear(@hog;}'

On the Xeon box, the machine is happy, other stuff just works and my kstat based logging shows the machine fully busy with the zone using 80% of the CPU for the duration of the test.
A typical sample is:

Tue Jun 21 07:49:28 2022
clang-tidy noname 1021094
qemu-system-x86_ noname 152650
bhyve vcpu 0 37661
bhyve vcpu 1 18564
bhyve vcpu 3 14050
bhyve vcpu 2 7540

Which is what I would expect to see for this test.

On the Big dual-socket EPYC 7543 machine, the system is unresponsive, I see 100% global zone CPU, I can't ssh on while the test is running and a typical sample is:

Tue Jun 21 21:08:01 2022
clang-tidy noname 52913
sched noname 43670
pageout noname 7805
bhyve vcpu 6 1593
bhyve vcpu 2 1336
bhyve vcpu 0 1285
zpool-zones noname 1239

when starting the test and

Tue Jun 21 21:10:51 2022
clang-tidy noname 55693
node noname 34680
svc.startd noname 6157
init noname 3530
svc.configd noname 3529
sched noname 1655
nscd revalidate 1309

once the test gets going.

The AMD machine has 176G of RAM allocated to bhyve machines and the ARC size remains constant at around 256G throughout.


Files

Screenshot from 2022-10-25 21-40-27.png (74.6 KB) Screenshot from 2022-10-25 21-40-27.png Ian Collins, 2022-10-25 08:40 AM
Screenshot from 2023-03-02 16-47-21.png (142 KB) Screenshot from 2023-03-02 16-47-21.png CPU kstats when starting a test. Ian Collins, 2023-03-02 03:48 AM
Screenshot from 2023-03-02 16-50-25.png (65.6 KB) Screenshot from 2023-03-02 16-50-25.png ARC size and free mem when starting a test. Ian Collins, 2023-03-02 03:50 AM
Actions #1

Updated by Ian Collins 11 months ago

Getting back to this after being sidetracked...

Looking at unix:0:system_pages:freemen when the job starts, it falls to under 1GB. The ARC size remains > 190GB.

I wonder if setting a lower ARC size is the solution here?

Actions #2

Updated by Ian Collins 7 months ago

We are still plagued by this issue.

Can anyone suggest how I can debug this further? I don't expect to see a machine with 512GB of RAM swapping!

I have attached the CPU loading for the host when starting a test run, which shows the system CPU usage going a bit crazy...

Actions

Also available in: Atom PDF