Project

General

Profile

Actions

Bug #14842

open

Strange CPU spikes in system time on smbd

Added by Adam Stylinski 3 months ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
smb - SMB server and client
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

Since updating, we have a server that's having weird intermittent CPU spikes, seemingly in smbd (it monopolizes 40-50% of the CPU time while doing so). What's weird is other servers, configured similarly and at similar OI revisions, don't seem to exhibit this. When this happens, smbd seems to be hitting /var/krb5/rcache/root/cifs_0 a lot (observed via zfsslower, where it's being serviced entirely from ARC).
I'm linking to a flame graph captured via profile-299 when these spikes happen (kernel stacks), with the unix`idle derived stacks removed:
https://paste.c-net.org/RuthlessMontreal

I tried to attach the output of smbsrv.d during this time (fairly large), but even with xz it's still 4.6 MB (just slightly too large).


Files

smbd1.txt.xz (2 MB) smbd1.txt.xz part1 Adam Stylinski, 2022-07-21 08:58 PM
smbd2.txt.xz (2.5 MB) smbd2.txt.xz part2 Adam Stylinski, 2022-07-21 08:58 PM
Actions #1

Updated by Adam Stylinski 3 months ago

Also, while the flame graph doesn't connect the two, I believe the fop_read call happening in smbd is being serviced via a different kernel thread that's taking the massive second half of the flamegraph.

Actions #2

Updated by Adam Stylinski 3 months ago

They weren't that much larger than the limit post compression. Splitting them was reasonable here.

Actions #3

Updated by Adam Stylinski 3 months ago

Also userspace flamegraph grabbed during this pathology:
https://paste.c-net.org/AschenTimeless

Actions #4

Updated by Gordon Ross 2 months ago

The user-level stack looks interesting, particularly krb5_rc_file_recover_locked.
I don't have time to look right now, but is this doing a lock/retry loop of some sort?

Actions #5

Updated by Adam Stylinski 2 months ago

So it's a fairly widely used shared file server, it's possible a client could be inducing that? What's a good way to answer that question?

Actions #6

Updated by Adam Stylinski 2 months ago

So I've confirmed today that this issue persists over a reboot. It seems likely something specific to the our configuration or a client accessing it. What can I do to further investigate?

Actions

Also available in: Atom PDF