Project

General

Profile

Actions

Bug #5899

closed

panic on bad mutex in smb_event_wait()

Added by Gordon Ross over 6 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Start date:
2015-05-02
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

Here's the panic info and back-trace:

> ::panicinfo 
cpu 21
thread ffffff00a509bc60
message mutex_enter: bad mutex, lp=ffffff28b2f81f50 owner=ffffff296df13c40 thread=ffffff00a509bc60
[...]
> ::stack
vpanic()
mutex_panic+0x73(fffffffffb94b618, ffffff28b2f81f50)
mutex_vector_enter+0x446(ffffff28b2f81f50)
cv_timedwait_hires+0xfd(ffffff28b2f81f58, ffffff28b2f81f50, 3b9aca00, 989680, 0)
cv_reltimedwait+0x4f(ffffff28b2f81f58, ffffff28b2f81f50, 64, 4)
smb_event_wait+0x88(ffffff28b2f81f38)
smb_threshold_enter+0x69(ffffff18623be708)
smb_authenticate+0x39(ffffff295fd72698, ffffff1832ac7808, ffffff00a509ba88)
smb_com_session_setup_andx+0x36(ffffff295fd72698)
smb_dispatch_request+0x4a6(ffffff295fd72698)
smb_session_worker+0x6c(ffffff295fd72698)
taskq_d_thread+0xb1(ffffff20c50479f0)
thread_start+8()

Actions #1

Updated by Gordon Ross over 6 years ago

[ Analysis from Bayard Bell ]

The mutex does seem to be bad, but we can nevertheless link this to a thread, which turns out to be in smb_server_shutdown():

> ffffff28b2f81f50::mutex
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
mdb: 0xffffff28b2f81f50: invalid adaptive mutex (-f to dump anyway)
> ffffff28b2f81f38::print -atx smb_event_t
ffffff28b2f81f38 smb_event_t {
    ffffff28b2f81f38 uint32_t se_magic = 0xbaa9b1ab
    ffffff28b2f81f40 list_node_t se_lnd = {
        ffffff28b2f81f40 struct list_node *list_next = 0
        ffffff28b2f81f48 struct list_node *list_prev = 0
    }
    ffffff28b2f81f50 kmutex_t se_mutex = {
        ffffff28b2f81f50 void *[1] _opaque = [ 0xffffff296df13c46 ]
    }
    ffffff28b2f81f58 kcondvar_t se_cv = {
        ffffff28b2f81f58 ushort_t _opaque = 0xd6
    }
    ffffff28b2f81f60 smb_server_t *se_server = 0xffffff18623be000
    ffffff28b2f81f68 uint32_t se_txid = 0xf8a3b3
    ffffff28b2f81f6c boolean_t se_notified = 0 (0)
    ffffff28b2f81f70 int se_waittime = 0x1
    ffffff28b2f81f74 int se_timeout = 0x7530
    ffffff28b2f81f78 int se_errno = 0
}
> ffffff28b2f81f50::mutex -f
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
ffffff28b2f81f50 adapt ffffff296df13c40      -      -      no
> ffffff296df13c40::findstack -v
stack pointer for thread ffffff296df13c40: ffffff00a39329c0
[ ffffff00a39329c0 _resume_from_idle+0xf1() ]
  ffffff00a39329f0 swtch+0x145()
  ffffff00a3932a20 cv_wait+0x61(ffffffffc0179f58, ffffffffc0179f50)
  ffffff00a3932a40 smb_kdoor_close+0x41()
  ffffff00a3932a70 smb_server_shutdown+0x2d(ffffff18623be000)
  ffffff00a3932a90 smb_server_delete+0x91()
  ffffff00a3932ac0 smb_drv_close+0x1d(d200000000, 1, 2, ffffff2711c2b030)
  ffffff00a3932af0 dev_close+0x3d(d200000000, 1, 2, ffffff2711c2b030)
  ffffff00a3932b30 device_close+0xa2(ffffff168ea06a80, 1, ffffff2711c2b030)
  ffffff00a3932bc0 spec_close+0x163(ffffff168ea06a80, 1, 1, 0, ffffff2711c2b030, 0)
  ffffff00a3932c40 fop_close+0x71(ffffff168ea06a80, 1, 1, 0, ffffff2711c2b030, 0)
  ffffff00a3932c80 closef+0x5d(ffffff16cdaf9768)
  ffffff00a3932cc0 closeall+0x66(ffffff169be406c0)
  ffffff00a3932d50 proc_exit+0x41a(3, b)
  ffffff00a3932d70 exit+0x15(3, b)
  ffffff00a3932df0 psig+0x53c()
  ffffff00a3932f00 trap+0x13ea(ffffff00a3932f10, 2c, 1)
  ffffff00a3932f10 0xfffffffffb8001d6()

We seem to have other threads facing a similar scenario:
> ffffff28b2f81f38::whatthread | ::stacks
THREAD           STATE    SOBJ                COUNT
ffffff00a1961c60 SLEEP    CV                    214
                 swtch+0x145
                 cv_timedwait_hires+0xe0
                 cv_reltimedwait+0x4f
                 smb_event_wait+0x88
                 smb_threshold_enter+0x69
                 smb_authenticate+0x39
                 smb_com_session_setup_andx+0x36
                 smb_dispatch_request+0x4a6
                 smb_session_worker+0x6c
                 taskq_d_thread+0xb1
                 thread_start+8

ffffff00a509bc60 PANIC    <NONE>                  1
                 0x246

All 214 of these are using the same CV and mutex for their arguments and are calling smb_event_wait against the same smb_event_t:

> ffffff28b2f81f38::whatthread | ::stacks -c cv_reltimedwait | ::stacks -C vpanic | ::findstack -v ! nawk '$2 ~ /cv_reltimedwait/ {print $2, $3, $4, $5}' | sort -u
cv_reltimedwait+0x4f(ffffff28b2f81f58, ffffff28b2f81f50, 64, 4)
> ffffff28b2f81f38::whatthread | ::stacks -c cv_reltimedwait | ::stacks -C vpanic | ::findstack -v ! nawk '$2 ~ /smb_event_wait/ {print $2}' | sort -u
smb_event_wait+0x88(ffffff28b2f81f38)

Our kernel CIFS server is indeed stopping, and this appears to be because the smbd process is exiting and dumping core according to its p_flags (usr/src/uts/common/sys/proc.h SEXITING at line 479, SDOCORE at line 503, also has SMSACCT at line 497 and SMSFORK at line 502):
> ::smbsrv
SERVER           ZONE STATE
ffffff18623be000 0    STOPPING
> ffffff296df13c40::print kthread_t t_procp | ::ps
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R   5952      1   5952   5952      0 0xc2000002 ffffff169be3fc88 smbd

We also know that the smb_event_wait() call is via smb_threshold_enter(), which is a request throttle in kernel CIFS (comments at usr/src/uts/common/fs/smbsrv/smb_kutil.c lines 2288-2299, code at 2301-2353). Thus the event on which these threads are waiting is identified by reference by ct_event member of an smb_cmd_threshold_t. Looking at that data structure from a waiter's stack trace, we see that the ct_event pointer is in fact NULL and that data structure generally appears to have been zero'ed out:
> ffffff18623be708::print -atx smb_cmd_threshold_t
ffffff18623be708 smb_cmd_threshold_t {
    ffffff18623be708 char *ct_cmd = 0
    ffffff18623be710 kmutex_t ct_mutex = {
        ffffff18623be710 void *[1] _opaque = [ 0 ]
    }
    ffffff18623be718 volatile uint32_t ct_active_cnt = 0
    ffffff18623be71c volatile uint32_t ct_blocked_cnt = 0
    ffffff18623be720 volatile uint32_t ct_error_cnt = 0
    ffffff18623be724 uint32_t ct_threshold = 0
    ffffff18623be728 struct smb_event *ct_event = 0
    ffffff18623be730 uint32_t ct_event_id = 0
}

This corresponds precisely to what happens when smb_threshold_fini() is called (usr/src/uts/common/fs/smbsrv/smb_kutil.c lines 2279-2285), and it would not appear to accord with the preceding comments (lines 2274-2277)--while we may need to destroy the events, doing so without first shutting down the threads referencing it in the session worker pool taskq is an autofrag. The comments tell us what we'd find by examining callers together with the stack trace for the server shutdown, which is that we are calling smb_server_shutdown() from smb_server_delete(), so we've already completed the three calls to smb_threshold_fini().

Actions #2

Updated by Gordon Ross over 6 years ago

This is a fix we did back in Nov 2013.

There was an ordering problem in tear-down of the "events" support
and these "threshold" things. I never liked the smb_event stuff.
It's a global event list largely duplicating the functionality of the old-style
unix sleep/wakeup (on an address) mechanism. It was not needed here,
so the use of events was removed, and the wait implemented using a
simple cv_timewait() instead.

Actions #3

Updated by Electric Monk about 6 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 856399cf160942495309c59ac7a9541834573cd3

commit  856399cf160942495309c59ac7a9541834573cd3
Author: Gordon Ross <gwr@nexenta.com>
Date:   2015-05-08T14:37:58.000Z

    5899 panic on bad mutex in smb_event_wait()
    Reviewed by: Thomas Keiser <thomas.keiser@nexenta.com>
    Reviewed by: Dan McDonald <danmcd@nexenta.com>
    Reviewed by: Albert Lee <trisk@nexenta.com>
    Reviewed by: Richard Lowe <richlowe@richlowe.net>
    Approved by: Dan McDonald <danmcd@omniti.com>

Actions

Also available in: Atom PDF