Project

General

Profile

Actions

Bug #13851

closed

panic in smb_llist_enter() with bad rwlock

Added by Gordon Ross 6 months ago. Updated 6 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
cifs - CIFS server and client
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

System panic while SMB handling normal work

> $C
ffffd002e3499150 vpanic()
ffffd002e3499170 rw_panic+0x54(fffffffffb95e5df, ffffd0a6a8c77cd0)
ffffd002e34991e0 rw_enter_sleep+0x366(ffffd0a6a8c77cd0, 1)
ffffd002e3499200 smb_llist_enter+0x15(ffffd0a6a8c77cd0, 1)
ffffd002e3499270 smb_node_notify_change+0x46(ffffd0a6a8c77c88, 1, ffffd09df3b335b8)
ffffd002e3499300 smb_fem_fcn_create+0x7c(ffffd002e3499430, ffffd09df3b335b8, ffffd002e34997a0, 0, 80, ffffd002e3499838, ffffd099e88f0878,
ffffd09400000000, 0, 0)
ffffd002e3499360 vnext_create+0x97(ffffd002e3499430, ffffd09df3b335b8, ffffd002e34997a0, 0, 80, ffffd002e3499838, ffffd099e88f0878, ffffd00200000000, 0,
0)
ffffd002e34993f0 smb_fem_fcn_create+0x5c(ffffd002e3499430, ffffd09df3b335b8, ffffd002e34997a0, 0, 80, ffffd002e3499838, ffffd099e88f0878,
ffffd09a00000000, 0, 0)
ffffd002e34994a0 vhead_create+0xfe(ffffd09a12d87a40, ffffd09df3b335b8, ffffd002e34997a0, 0, 80, ffffd002e3499838, ffffd099e88f0878, ffffffff00000000, 0,
0)
ffffd002e3499540 fop_create+0xc7(ffffd09a12d87a40, ffffd09df3b335b8, ffffd002e34997a0, 0, 80, ffffd002e3499838, ffffd099e88f0878, ffffd09a00000000, 0, 0
)
ffffd002e34998a0 rfs3_create+0x40b(ffffd002e3499ac0, ffffd002e3499980, ffffd09a25a59300, ffffd002e3499ca0, ffffd099e88f0878, 0)
ffffd002e3499c20 common_dispatch+0x600(ffffd002e3499ca0, ffffd09465853a00, 2, 4, fffffffff87e1aac, ffffffffc0313060)
ffffd002e3499c40 rfs_dispatch+0x2d(ffffd002e3499ca0, ffffd09465853a00)
ffffd002e3499d20 svc_getreq+0x1c1(ffffd09465853a00, ffffd096af2d6d40)
ffffd002e3499d90 svc_run+0x198(ffffd094412c43f8)
ffffd002e3499dd0 svc_do_run+0x8e(1)
ffffd002e3499ec0 nfssys+0x111(e, fdfc0fbc)
ffffd002e3499f10 _sys_sysenter_post_swapgs+0x149()

> smb_llist_enter::nm -f ctype
C Type
void (*)(smb_llist_t *, krw_t)

> ffffd0a6a8c77cd0::print smb_llist_t
{
    ll_lock = {
        _opaque = [ 0xc ]
    }
    ll_list = {
        list_size = 0x3a0
        list_offset = 0x10
        list_head = {
            list_next = 0
            list_prev = 0
        }
    }
    ll_count = 0
    ll_wrop = 0x606
    ll_mutex = {
        _opaque = [ 0xffffd002de017c46 ]
    }
    ll_deleteq = {
        list_size = 0x28
        list_offset = 0
        list_head = {
            list_next = 0
            list_prev = 0
        }
    }
    ll_deleteq_count = 0
    ll_flushing = 0 (0)
}
> ffffd0a6a8c77cd0::print -t smb_llist_t
smb_llist_t {
    krwlock_t ll_lock = {
        void *[1] _opaque = [ 0xc ]
    }
    list_t ll_list = {
        size_t list_size = 0x3a0
        size_t list_offset = 0x10
        struct list_node list_head = {
            struct list_node *list_next = 0
            struct list_node *list_prev = 0
        }
    }
    uint32_t ll_count = 0
    uint64_t ll_wrop = 0x606
    kmutex_t ll_mutex = {
        void *[1] _opaque = [ 0xffffd002de017c46 ]
    }
    list_t ll_deleteq = {
        size_t list_size = 0x28
        size_t list_offset = 0
        struct list_node list_head = {
            struct list_node *list_next = 0
            struct list_node *list_prev = 0
        }
    }
    uint32_t ll_deleteq_count = 0
    boolean_t ll_flushing = 0 (0)
}

It looks like the rw lock has been destroyed.
Actions #1

Updated by Gordon Ross 6 months ago

  • Status changed from New to In Progress

Further analysis of the crash

$ mdb -k 0
mdb: warning: dump is from SunOS 5.11 NexentaStor_5.2.2:553a1d42ea; dcmds and macros may not match kernel implementation
Loading modules: [ unix genunix dtrace specfs mac cpu.generic uppc apix scsi_vhci zfs mr_sas sd ip hook neti sockfs arp usba xhci stmf sata mm mpt lofs mpt_sas random pmcs i40e idm cpc crypto fcip fctl fcp kvm stmf_sbd ufs logindmux nsmb ptm smbsrv nfs ipc emlxs ]
> ::status
debugging crash dump vmcore.0 (64-bit) from filestore2-1
operating system: 5.11 NexentaStor_5.2.2:553a1d42ea (i86pc)
image uuid: 887130fa-3ffa-e0f9-bac1-f4761c3fc6b4
panic message: 
rw_enter: bad rwlock, lp=ffffd0a6a8c77cd0 wwwh=c thread=ffffd0943a749400
dump content: kernel pages only
> 
> $C
ffffd002e3499150 vpanic()
ffffd002e3499170 rw_panic+0x54(fffffffffb95e5df, ffffd0a6a8c77cd0)
ffffd002e34991e0 rw_enter_sleep+0x366(ffffd0a6a8c77cd0, 1)
ffffd002e3499200 smb_llist_enter+0x15(ffffd0a6a8c77cd0, 1)
ffffd002e3499270 smb_node_notify_change+0x46(ffffd0a6a8c77c88, 1, ffffd09df3b335b8)
ffffd002e3499300 smb_fem_fcn_create+0x7c(ffffd002e3499430, ffffd09df3b335b8, 
ffffd002e34997a0, 0, 80, ffffd002e3499838, ffffd099e88f0878, ffffd09400000000, 0
, 0)
ffffd002e3499360 vnext_create+0x97(ffffd002e3499430, ffffd09df3b335b8, 
ffffd002e34997a0, 0, 80, ffffd002e3499838, ffffd099e88f0878, ffffd00200000000, 0
, 0)
ffffd002e34993f0 smb_fem_fcn_create+0x5c(ffffd002e3499430, ffffd09df3b335b8, 
ffffd002e34997a0, 0, 80, ffffd002e3499838, ffffd099e88f0878, ffffd09a00000000, 0
, 0)
ffffd002e34994a0 vhead_create+0xfe(ffffd09a12d87a40, ffffd09df3b335b8, 
ffffd002e34997a0, 0, 80, ffffd002e3499838, ffffd099e88f0878, ffffffff00000000, 0
, 0)
ffffd002e3499540 fop_create+0xc7(ffffd09a12d87a40, ffffd09df3b335b8, 
ffffd002e34997a0, 0, 80, ffffd002e3499838, ffffd099e88f0878, ffffd09a00000000, 0
, 0)
ffffd002e34998a0 rfs3_create+0x40b(ffffd002e3499ac0, ffffd002e3499980, 
ffffd09a25a59300, ffffd002e3499ca0, ffffd099e88f0878, 0)
ffffd002e3499c20 common_dispatch+0x600(ffffd002e3499ca0, ffffd09465853a00, 2, 4
, fffffffff87e1aac, ffffffffc0313060)
ffffd002e3499c40 rfs_dispatch+0x2d(ffffd002e3499ca0, ffffd09465853a00)
ffffd002e3499d20 svc_getreq+0x1c1(ffffd09465853a00, ffffd096af2d6d40)
ffffd002e3499d90 svc_run+0x198(ffffd094412c43f8)
ffffd002e3499dd0 svc_do_run+0x8e(1)
ffffd002e3499ec0 nfssys+0x111(e, fdfc0fbc)
ffffd002e3499f10 _sys_sysenter_post_swapgs+0x149()
> 
> ::stacks -m smbsrv
THREAD           STATE    SOBJ                COUNT
ffffd002e0fa4c40 SLEEP    CV                     39
                 swtch+0x141
                 cv_wait+0x70
                 cv_wait_sig+0x82
                 so_dequeue_msg+0x2f7
                 so_recvmsg+0x249
                 socket_recvmsg+0x33
                 ksocket_recv+0x15a
                 smb_sorecv+0x39
                 smb_session_xprt_gethdr+0x33
                 smb_session_reader+0x36
                 smb_session_receiver+0x82
                 smb_server_receiver+0x18
                 taskq_d_thread+0xb7
                 thread_start+8

ffffd002df6d8c40 SLEEP    CV                      1
                 swtch+0x141
                 cv_timedwait_hires+0xec
                 cv_reltimedwait+0x51
                 smb_thread_continue_timedwait_locked+0x3c
                 smb_thread_continue_timedwait+0x40
                 smb_server_timers+0x35
                 smb_thread_entry_point+0x53
                 thread_start+8

ffffd002de555c40 SLEEP    CV                      1
                 swtch+0x141
                 cv_wait+0x70
                 cv_wait_sig_swap_core+0x8a
                 cv_wait_sig_swap+0x17
                 so_acceptq_dequeue_locked+0x6e
                 so_acceptq_dequeue+0x41
                 so_accept+0xed
                 socket_accept+0x1f
                 ksocket_accept+0xb8
                 smb_server_listener+0x58
                 smb_thread_entry_point+0x53
                 thread_start+8

ffffd002df6c6c40 SLEEP    CV                      1
                 swtch+0x141
                 cv_wait+0x70
                 smb_thread_continue_timedwait_locked+0x5d
                 smb_thread_continue+0x2d
                 smb_kshare_unexport_thread+0x40
                 smb_thread_entry_point+0x53
                 thread_start+8

ffffd0943a749400 PANIC    <NONE>                  1
                 param_preset
                 rw_panic+0x54
                 rw_enter_sleep+0x366
                 smb_llist_enter+0x15
                 smb_node_notify_change+0x46
                 smb_fem_fcn_create+0x7c
                 vnext_create+0x97
                 smb_fem_fcn_create+0x5c
                 vhead_create+0xfe
                 fop_create+0xc7
                 rfs3_create+0x40b
                 common_dispatch+0x600
                 rfs_dispatch+0x2d
                 svc_getreq+0x1c1
                 svc_run+0x198
                 svc_do_run+0x8e
                 nfssys+0x111
                 _sys_sysenter_post_swapgs+0x149

> 
> ::smblist ! cat > smblist.0.txt

Looking through smblist.0.txt there are lots of clients and open files,
but very little going on in the way of SMB requests.

Let's look a the smbnode passed in the panic stack above, to
smb_node_notify_change(ffffd0a6a8c77c88, 1, ...)

> ffffd0a6a8c77c88 ::smbnode
ADDR             VP               NODE-NAME          OFILES LOCKS  WLOCKS   
OPLOCK   REF   
ffffd0a6a8c77c88 ffffd09a00c32980 7                  0      0      0        
0        0      
> ffffd0a6a8c77c88 ::print smb_node_t
{
    n_lnd = {
        list_next = 0
        list_prev = 0
    }
    n_magic = 0
    n_lock = {
        _opaque = [ 0xc ]
    }
    n_mutex = {
        _opaque = [ 0xffffd002de017c46 ]
    }
    n_state = 0x1 (SMB_NODE_STATE_DESTROYING)
    n_refcnt = 0
    n_hashkey = 0xa4cac487
    n_hash_bucket = smb_node_hash_table+0x36d8
    n_open_count = 0
    n_opening_count = 0
    n_ofile_list = {
        ll_lock = {
            _opaque = [ 0xc ]
        }
        ll_list = {
            list_size = 0x3a0         
            list_offset = 0x10
            list_head = {
                list_next = 0
                list_prev = 0
            }
        }
        ll_count = 0
        ll_wrop = 0x606
        ll_mutex = {
            _opaque = [ 0xffffd002de017c46 ]
        }
        ll_deleteq = {
            list_size = 0x28
            list_offset = 0
            list_head = {
                list_next = 0
                list_prev = 0
            }
        }
        ll_deleteq_count = 0
        ll_flushing = 0 (0)
    }
    n_lock_list = {                   
        ll_lock = {
            _opaque = [ 0xc ]
        }
        ll_list = {
            list_size = 0x60
            list_offset = 0
            list_head = {
                list_next = 0
                list_prev = 0
            }
        }
        ll_count = 0
        ll_wrop = 0
        ll_mutex = {
            _opaque = [ 0xffffd002de017c46 ]
        }
        ll_deleteq = {
            list_size = 0x28
            list_offset = 0
            list_head = {
                list_next = 0
                list_prev = 0
            }                         
        }
        ll_deleteq_count = 0
        ll_flushing = 0 (0)
    }
    n_wlock_list = {
        ll_lock = {
            _opaque = [ 0xc ]
        }
        ll_list = {
            list_size = 0x60
            list_offset = 0
            list_head = {
                list_next = 0
                list_prev = 0
            }
        }
        ll_count = 0
        ll_wrop = 0
        ll_mutex = {
            _opaque = [ 0xffffd002de017c46 ]
        }
        ll_deleteq = {
            list_size = 0x28          
            list_offset = 0
            list_head = {
                list_next = 0
                list_prev = 0
            }
        }
        ll_deleteq_count = 0
        ll_flushing = 0 (0)
    }
    flags = 0
    n_allocsz = 0
    n_fcn_count = 0x2
    n_oplock = {
        ol_mutex = {
            _opaque = [ 0xffffd002de017c46 ]
        }
        ol_fem = 0 (0)
        excl_open = 0
        ol_state = 0x10000000
        cnt_II = 0
        cnt_R = 0
        cnt_RH = 0
        cnt_RHBQ = 0                  
        waiters = 0
        WaitingOpenCV = {
            _opaque = 0
        }
    }
    n_dnode = 0xffffd0a5012f9020
    n_unode = 0
    delete_on_close_cred = 0
    n_delete_on_close_flags = 0
    od_name = [ "7" ]
    vp = 0xffffd09a00c32980
    n_audit_buf = 0
}
> 
expect the rw_enter panic is on n_ofile_list.ll_lock

> ffffd0a6a8c77c88 ::print -a -s 2 smb_node_t n_ofile_list
ffffd0a6a8c77cd0 n_ofile_list = {
    ffffd0a6a8c77cd0 n_ofile_list.ll_lock = {
        ffffd0a6a8c77cd0 _opaque = [ ... ]
    }
    ffffd0a6a8c77cd8 n_ofile_list.ll_list = {
        ffffd0a6a8c77cd8 list_size = 0x3a0
        ffffd0a6a8c77ce0 list_offset = 0x10
        ffffd0a6a8c77ce8 list_head = { ... }
    }
    ffffd0a6a8c77cf8 n_ofile_list.ll_count = 0
    ffffd0a6a8c77d00 n_ofile_list.ll_wrop = 0x606
    ffffd0a6a8c77d08 n_ofile_list.ll_mutex = {
        ffffd0a6a8c77d08 _opaque = [ ... ]
    }
    ffffd0a6a8c77d10 n_ofile_list.ll_deleteq = {
        ffffd0a6a8c77d10 list_size = 0x28
        ffffd0a6a8c77d18 list_offset = 0
        ffffd0a6a8c77d20 list_head = { ... }
    }
    ffffd0a6a8c77d30 n_ofile_list.ll_deleteq_count = 0
    ffffd0a6a8c77d34 n_ofile_list.ll_flushing = 0 (0)
}

> ffffd0a6a8c77cd0 ::rwlock
            ADDR      OWNER/COUNT FLAGS          WAITERS
ffffd0a6a8c77cd0                1  B100
                                |   |  
                    DESTROYED --+---+

So we have an smb_node that's in process of destruction
n_state = 0x1 (SMB_NODE_STATE_DESTROYING)
but appears to still have fem hooks pointing to it
n_fcn_count = 0x2

There are no ofiles left in n_ofile_list
so somwhere we missed some smb_node_fcn_unsubscribe() calls.

Actions #2

Updated by Gordon Ross 6 months ago

After some discussions among the team...

It appears that what's likely to have happened is that the server "subscribed" an open file (ofile) in an SMB notify request while simultaneously handling either an SMB close or (more likely) a client disconnect on the session that owns that ofile. There is a failure to defend against a particular sequence of tear-down during ofile close vs creation of data structures in notify (where an ofile is "subscribed"). The defense agains this problem is fairly simple: In smb_notify, verify that the ofile is still open before setting up the subscription for notification events.

Actions #3

Updated by Electric Monk 6 months ago

  • Gerrit CR set to 1531
Actions #4

Updated by Gordon Ross 6 months ago

Testing:

This is difficult to reproduce. We used temporary delays to force the order of things, and then showed that the problem no longer happens (even with the temporary delays in place)
This has also been in the field for several months.

Actions #5

Updated by Electric Monk 6 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 4b69ad0900e9388c55c1b0e0b522dae0a595b4fa

commit  4b69ad0900e9388c55c1b0e0b522dae0a595b4fa
Author: Gordon Ross <gordon.ross@tintri.com>
Date:   2021-06-07T22:03:09.000Z

    13851 panic in smb_llist_enter() with bad rwlock
    Reviewed by: Matt Barden <mbarden@tintri.com>
    Reviewed by: Evan Layton <elayton@tintri.com>
    Reviewed by: Alexander Stetsenko <alex.stetsenko@gmail.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Actions

Also available in: Atom PDF