Project

General

Profile

Bug #11830

smb/server service shutdown hang

Added by Gordon Ross 5 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

Description:
After smbd crashed with a SIGSEGV, the SMB service could not restart.
The "svcs" command shows the SMB service as "online*" meaning it's stuck restarting.

Steps to Reproduce:
Unclear. Happens sometimes if smbd is killed or dies i.e. via SIGSEGV

Expected Results:
Should be able to kill and restart the SMB server service,
even if smbd terminates with some sort of failure.

Actual Results:
The SMB service got stuck with smb waiting in the "smbsrv" driver close function.

History

#1

Updated by Gordon Ross 5 months ago

I was lucky enough to run into this while running a debug build.
One of the tests at the recent MS plug-fest testing event caused smbd to crash, after which the service got stuck in shutdown as described above. I was able to get a crash dump for analysis.

> ::stacks -m smbsrv
THREAD           STATE    SOBJ                COUNT
ffffff001068ac40 SLEEP    CV                      1
                 swtch+0x18a
                 cv_timedwait_hires+0xf8
                 cv_reltimedwait+0x71
                 smb_rwx_cvwait+0xa6
                 smb_session_logoff+0x1a5
                 smb_session_cancel+0x76
                 smb_session_receiver+0xb7
                 smb_server_receiver+0x18
                 taskq_d_thread+0xf3
                 thread_start+8

ffffff1ae1da2080 SLEEP    CV                      1
                 swtch+0x18a
                 cv_wait+0x89
                 smb_avl_destroy+0x63
                 smb_export_stop+0x4e
                 smb_server_shutdown+0x17a
                 smb_server_delete+0x1cf
                 smb_drv_close+0x35
                 dev_close+0x31
                 device_close+0xd8
                 spec_close+0x1d9     
                 fop_close+0x61
                 closef+0x97
                 closeall+0x82
                 proc_exit+0x450
                 exit+0x15
                 psig+0x358
                 trap+0x140d

ffffff0011174c40 SLEEP    CV                      1
                 swtch+0x18a
                 cv_wait+0x89
                 smb_thread_continue_timedwait_locked+0x5d
                 smb_thread_continue+0x4d
                 smb_kshare_unexport_thread+0x40
                 smb_thread_entry_point+0x7a
                 thread_start+8

> ::smblist
SERVER           ZONE STATE                            
ffffff05d114e580 0    STOPPING                         
  SESSION          IP_ADDR                PORT     DIALECT  STATE       
  ffffff19e03f27d0 10.10.2.18             43686    0x302    DISCONNECTED
    USER             UID   SSNID            ACCOUNT                         
    ffffff04289c7af8 1     fbe8b7450f       NSTOR\test                      
    TREE             TID   SHARE NAME       RESOURCE                        
    ffffff04843aa010 1     test             /export/test                    
      OFILE            FID   NODE             CRED             LEASE            
      ffffff05db73c3e8 2     ffffff0506dd1678 ffffff0506d9d9b0 0

> ffffff04289c7af8 ::smbuser -v
SMB user information (ffffff04289c7af8):
UID: 1
SSNID: fbe8b7450f
State: 2 (LOGGING_OFF)
Flags: 0x00000000 <>
Privileges: 0x00800000 <CHANGE_NOTIFY>
Credential: ffffff0506d9d9b0
Reference Count: 2
User Account: NSTOR\test

>  ffffff04843aa010 ::smbtree -v
SMB tree information (ffffff04843aa010):

TID: 0001
State: 1 (DISCONNECTING)
Share: test
Resource: /export/test
Type: ZFS
Volume: export
Umask: 0000
Flags: 000d2fce <ACLS,STREAMS,CI,OPLOCKS,SHORTNAMES,XVATTR,DIRENTFLAGS,ACL_CR
,ACEMASK,UNICODE,QUOTA,SPARSE,XMOUNTS>
SMB Node: ffffff04289c5958
Reference Count: 1

> ffffff05db73c3e8 ::smbofile -v
SMB ofile information (ffffff05db73c3e8):

FID: 2
State: 3 (SAVING)
DH Type: 1 (DURABLE_V1)
Lease: 0
SMB Node: ffffff0506dd1678
LLF Offset: 0x0 (Invalid)
Flags: 0x00000000 <>
Granted Acc.: 0x001f01ff <READ_DATA,WRITE_DATA,APPEND_DATA,READ_EA,WRITE_EA
,EXECUTE,DELETE_CHILD,READ_ATTR,WRITE_ATTR,DELETE,READ_CTRL,WRITE_DAC
,WRITE_OWNER,SYNCH>
Share Mode: 0x00000007 <READ,WRITE,DELETE>
User: ffffff04289c7af8
Tree: ffffff04843aa010
Credential: ffffff0506d9d9b0

> 0xffffff04843aa010 ::print  smb_tree_t t_ofile_list
t_ofile_list = {
    t_ofile_list.ll_lock = {
        _opaque = [ 0 ]
    }
    t_ofile_list.ll_list = {
        list_size = 0x3a0
        list_offset = 0
        list_head = {
            list_next = 0xffffff05db73c3e8
            list_prev = 0xffffff05db73c3e8
        }
    }
    t_ofile_list.ll_count = 0x1
    t_ofile_list.ll_wrop = 0x1
    t_ofile_list.ll_mutex = {
        _opaque = [ 0 ]
    }
    t_ofile_list.ll_deleteq = {
        list_size = 0x28
        list_offset = 0
        list_head = {
            list_next = 0xffffff0448046600
            list_prev = 0xffffff0448046600
        }                             
    }
    t_ofile_list.ll_deleteq_count = 0x1
    t_ofile_list.ll_flushing = 0 (0)
}
> 0xffffff0448046600 ::whatis
ffffff0448046600 is allocated from smb_dtor_cache:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
ffffff0480dafc18 ffffff0448046600     2103f1229bc0 ffffff0011b69c40
                 ffffff042aa2a348 ffffff03dad13b80 ffffff03de09f8d0
                 kmem_cache_alloc_debug+0x2e0
                 kmem_cache_alloc+0xdd
                 smb_llist_post+0x45
                 smb_ofile_release+0xf4
                 smb_request_free+0x72
                 smb_oplock_async_break+0xc9
                 taskq_d_thread+0xf3
                 thread_start+8

> 0xffffff0448046600 ::print smb_dtor_t
{
    dt_lnd = {
        list_next = 0xffffff04843aa0b0
        list_prev = 0xffffff04843aa0b0
    }
    dt_magic = 0x44544f52
    dt_object = 0xffffff05db73c3e8
    dt_proc = smb_ofile_save_dh
}

It looks like it was handling an oplock break during shutdown.

The stack from the allocation of the ofile destructor shows us that
the last ref to ofile was dropped by smb_oplock_async_break
smb_ofile_release did state: SAVE_DH >> SAVING

The "problem" sequence must be:

smb_oplock_ind_break gets hold via smb_ofile_hold_olbrk()
and schedules a taskq job to do the oplock break.
before that job has a chance to run...

smb_session_logoff / smb_user_logoff /
smb_session_disconnect_owned_trees / ...
smb_tree_disconnect

then the taskq job from smb_oplock_ind_break runs and
during request free we drop the last ref on the ofile

Nothing after that flushes the delete queue on the tree or user,
so those never actually get deleted, and shutdown waits forever.

#2

Updated by Gordon Ross 4 months ago

A relatively simple way to fix this is:
When building the special request to do an oplock break, if we find that the ofile has tree and session pointers, take references on the tree and session and store them in the request (tid_tree, uid_user). That way, when the taskq job processing the oplock break calls smb_request_free, the tree and session will get "release" calls (which run the delete queues etc) regardless of which code path ends up being the last to do call release on the tree and session.

I did not come up with a reliable way to reproduce this problem, but the fix has had some "mileage".

This code path is well covered by the smbtorture oplock tests. Re-ran those; test results unchanged.

#3

Updated by Electric Monk 4 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 5bcbb01ca45de96cf8ebfb0dfd19cc3016d217c7

commit  5bcbb01ca45de96cf8ebfb0dfd19cc3016d217c7
Author: Gordon Ross <gwr@nexenta.com>
Date:   2019-10-19T13:01:05.000Z

    11830 smb/server service shutdown hang
    Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com>
    Reviewed by: Evan Layton <evan.layton@nexenta.com>
    Reviewed by: Matt Barden <matt.barden@nexenta.com>
    Reviewed by: Andy Stormont <AStormont@racktopsystems.com>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Also available in: Atom PDF