Bug #14142
closedkernel SMB spams log when it hits max_connections
100%
Description
We've got a box which is using kernel SMB which seems to hang occasionally. When it hangs, the kernel is still responding to ping, typed characters on serial console do echo, but userland is unresponsive, both on serial console (login prompts don't work at all) and over the network (TCP connection attempts just hang and time out).
NMI on this machine works fine normally to trigger a dump or enter kmdb -- the tweakables to panic on NMI are set in /etc/system
. However, in this hang situation when we issue an NMI this happens:
WARNING: SMB Session: taskq_dispatch failed WARNING: SMB Session: taskq_dispatch failed WARNING: SMB Session: taskq_dispatch failed WARNING: SMB Session: taskq_dispatch failed WARNING: SMB Session: taskq_dispatch fail...
The system prints this to the serial console at maximum baud rate, apparently forever. A second NMI never seems to do anything to it, either, and we resort to a hard reset.
Updated by Gordon Ross over 1 year ago
This is a cmn_err(CE_WARN, ...) call here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/smbsrv/smb_server.c?r=4e065a9f#2494
The system must be receiving a flood of connections. (Under attack?)
Why would all of these be going to the console? I don't think that's normal.
Updated by Gordon Ross over 1 year ago
Have you seen that more than once? Any clues how to reproduce it?
Updated by Gordon Ross 12 months ago
- Subject changed from kernel SMB stops NMI from inducing panic, printfs as hard and fast as it can instead to kernel SMB spams log when it hits max_connections
This happens when the SMB server has created too many sessions, and runs into max_connections.
Updated by Gordon Ross 12 months ago
There are actually a couple problems here.
1: Checking for max_connections should happen earlier, before we try to allocate a taskq thread.
2: The error handling code if/when taskq_dispatch fails was missing a call to smb_session_logoff,
which is necessary to allow the tear-down to proceed smoothly.
This can be tested by setting max_connections to a small value (I used 3) and then make enough connections to run over that limit. In the updated code, the first error handling code path can be tested with that configuration. To test the second error handling code path, use "mdb \-kw" to increase smb_server\->sv_cfg.skc_maxconnections by one and then repeat the attempt to add client connections. Both cases should immediately disconnect the client. After both error code paths are exercised, make sure the SMB service can restart without problems.
Updated by Gordon Ross 12 months ago
- Status changed from New to In Progress
- Assignee set to Gordon Ross
Updated by Joshua M. Clulow 12 months ago
My question:
While this fix does seem like an improvement to the code, do you have
any idea about the interaction with the NMI/panic described in bug?
Gordon's answer:
I believe it was simply that their system was too busy logging stuff
to the console to respond to the NMI.
I did not actually reproduce that problem, though I think that's
possible with any console spammer.And I forgot my test results, which were: I repeated the procedures
described in the issue, verifying that we get the messages at no more
than once per minute, and that service restart works.
Updated by Electric Monk 12 months ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
git commit 61b20185b3a9f12c5f69672abe47b79dfb002cab
commit 61b20185b3a9f12c5f69672abe47b79dfb002cab Author: Gordon Ross <gwr@racktopsystems.com> Date: 2022-06-09T15:00:11.000Z 14142 kernel SMB spams log when it hits max_connections Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Albert Lee <alee@racktopsystems.com> Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Matt Barden <mbarden@tintri.com> Approved by: Joshua M. Clulow <josh@sysmgr.org>