[ Analysis from Bayard Bell ]
The mutex does seem to be bad, but we can nevertheless link this to a thread, which turns out to be in smb_server_shutdown():
> ffffff28b2f81f50::mutex
ADDR TYPE HELD MINSPL OLDSPL WAITERS
mdb: 0xffffff28b2f81f50: invalid adaptive mutex (-f to dump anyway)
> ffffff28b2f81f38::print -atx smb_event_t
ffffff28b2f81f38 smb_event_t {
ffffff28b2f81f38 uint32_t se_magic = 0xbaa9b1ab
ffffff28b2f81f40 list_node_t se_lnd = {
ffffff28b2f81f40 struct list_node *list_next = 0
ffffff28b2f81f48 struct list_node *list_prev = 0
}
ffffff28b2f81f50 kmutex_t se_mutex = {
ffffff28b2f81f50 void *[1] _opaque = [ 0xffffff296df13c46 ]
}
ffffff28b2f81f58 kcondvar_t se_cv = {
ffffff28b2f81f58 ushort_t _opaque = 0xd6
}
ffffff28b2f81f60 smb_server_t *se_server = 0xffffff18623be000
ffffff28b2f81f68 uint32_t se_txid = 0xf8a3b3
ffffff28b2f81f6c boolean_t se_notified = 0 (0)
ffffff28b2f81f70 int se_waittime = 0x1
ffffff28b2f81f74 int se_timeout = 0x7530
ffffff28b2f81f78 int se_errno = 0
}
> ffffff28b2f81f50::mutex -f
ADDR TYPE HELD MINSPL OLDSPL WAITERS
ffffff28b2f81f50 adapt ffffff296df13c40 - - no
> ffffff296df13c40::findstack -v
stack pointer for thread ffffff296df13c40: ffffff00a39329c0
[ ffffff00a39329c0 _resume_from_idle+0xf1() ]
ffffff00a39329f0 swtch+0x145()
ffffff00a3932a20 cv_wait+0x61(ffffffffc0179f58, ffffffffc0179f50)
ffffff00a3932a40 smb_kdoor_close+0x41()
ffffff00a3932a70 smb_server_shutdown+0x2d(ffffff18623be000)
ffffff00a3932a90 smb_server_delete+0x91()
ffffff00a3932ac0 smb_drv_close+0x1d(d200000000, 1, 2, ffffff2711c2b030)
ffffff00a3932af0 dev_close+0x3d(d200000000, 1, 2, ffffff2711c2b030)
ffffff00a3932b30 device_close+0xa2(ffffff168ea06a80, 1, ffffff2711c2b030)
ffffff00a3932bc0 spec_close+0x163(ffffff168ea06a80, 1, 1, 0, ffffff2711c2b030, 0)
ffffff00a3932c40 fop_close+0x71(ffffff168ea06a80, 1, 1, 0, ffffff2711c2b030, 0)
ffffff00a3932c80 closef+0x5d(ffffff16cdaf9768)
ffffff00a3932cc0 closeall+0x66(ffffff169be406c0)
ffffff00a3932d50 proc_exit+0x41a(3, b)
ffffff00a3932d70 exit+0x15(3, b)
ffffff00a3932df0 psig+0x53c()
ffffff00a3932f00 trap+0x13ea(ffffff00a3932f10, 2c, 1)
ffffff00a3932f10 0xfffffffffb8001d6()
We seem to have other threads facing a similar scenario:
> ffffff28b2f81f38::whatthread | ::stacks
THREAD STATE SOBJ COUNT
ffffff00a1961c60 SLEEP CV 214
swtch+0x145
cv_timedwait_hires+0xe0
cv_reltimedwait+0x4f
smb_event_wait+0x88
smb_threshold_enter+0x69
smb_authenticate+0x39
smb_com_session_setup_andx+0x36
smb_dispatch_request+0x4a6
smb_session_worker+0x6c
taskq_d_thread+0xb1
thread_start+8
ffffff00a509bc60 PANIC <NONE> 1
0x246
All 214 of these are using the same CV and mutex for their arguments and are calling smb_event_wait against the same smb_event_t:
> ffffff28b2f81f38::whatthread | ::stacks -c cv_reltimedwait | ::stacks -C vpanic | ::findstack -v ! nawk '$2 ~ /cv_reltimedwait/ {print $2, $3, $4, $5}' | sort -u
cv_reltimedwait+0x4f(ffffff28b2f81f58, ffffff28b2f81f50, 64, 4)
> ffffff28b2f81f38::whatthread | ::stacks -c cv_reltimedwait | ::stacks -C vpanic | ::findstack -v ! nawk '$2 ~ /smb_event_wait/ {print $2}' | sort -u
smb_event_wait+0x88(ffffff28b2f81f38)
Our kernel CIFS server is indeed stopping, and this appears to be because the smbd process is exiting and dumping core according to its p_flags (usr/src/uts/common/sys/proc.h SEXITING at line 479, SDOCORE at line 503, also has SMSACCT at line 497 and SMSFORK at line 502):
> ::smbsrv
SERVER ZONE STATE
ffffff18623be000 0 STOPPING
> ffffff296df13c40::print kthread_t t_procp | ::ps
S PID PPID PGID SID UID FLAGS ADDR NAME
R 5952 1 5952 5952 0 0xc2000002 ffffff169be3fc88 smbd
We also know that the smb_event_wait() call is via smb_threshold_enter(), which is a request throttle in kernel CIFS (comments at usr/src/uts/common/fs/smbsrv/smb_kutil.c lines 2288-2299, code at 2301-2353). Thus the event on which these threads are waiting is identified by reference by ct_event member of an smb_cmd_threshold_t. Looking at that data structure from a waiter's stack trace, we see that the ct_event pointer is in fact NULL and that data structure generally appears to have been zero'ed out:
> ffffff18623be708::print -atx smb_cmd_threshold_t
ffffff18623be708 smb_cmd_threshold_t {
ffffff18623be708 char *ct_cmd = 0
ffffff18623be710 kmutex_t ct_mutex = {
ffffff18623be710 void *[1] _opaque = [ 0 ]
}
ffffff18623be718 volatile uint32_t ct_active_cnt = 0
ffffff18623be71c volatile uint32_t ct_blocked_cnt = 0
ffffff18623be720 volatile uint32_t ct_error_cnt = 0
ffffff18623be724 uint32_t ct_threshold = 0
ffffff18623be728 struct smb_event *ct_event = 0
ffffff18623be730 uint32_t ct_event_id = 0
}
This corresponds precisely to what happens when smb_threshold_fini() is called (usr/src/uts/common/fs/smbsrv/smb_kutil.c lines 2279-2285), and it would not appear to accord with the preceding comments (lines 2274-2277)--while we may need to destroy the events, doing so without first shutting down the threads referencing it in the session worker pool taskq is an autofrag. The comments tell us what we'd find by examining callers together with the stack trace for the server shutdown, which is that we are calling smb_server_shutdown() from smb_server_delete(), so we've already completed the three calls to smb_threshold_fini().