Bug #11830
closedsmb/server service shutdown hang
Added by Gordon Ross over 2 years ago. Updated over 2 years ago.
100%
Description
Description:
After smbd crashed with a SIGSEGV, the SMB service could not restart.
The "svcs" command shows the SMB service as "online*" meaning it's stuck restarting.
Steps to Reproduce:
Unclear. Happens sometimes if smbd is killed or dies i.e. via SIGSEGV
Expected Results:
Should be able to kill and restart the SMB server service,
even if smbd terminates with some sort of failure.
Actual Results:
The SMB service got stuck with smb waiting in the "smbsrv" driver close function.
Updated by Gordon Ross over 2 years ago
I was lucky enough to run into this while running a debug build.
One of the tests at the recent MS plug-fest testing event caused smbd to crash, after which the service got stuck in shutdown as described above. I was able to get a crash dump for analysis.
> ::stacks -m smbsrv THREAD STATE SOBJ COUNT ffffff001068ac40 SLEEP CV 1 swtch+0x18a cv_timedwait_hires+0xf8 cv_reltimedwait+0x71 smb_rwx_cvwait+0xa6 smb_session_logoff+0x1a5 smb_session_cancel+0x76 smb_session_receiver+0xb7 smb_server_receiver+0x18 taskq_d_thread+0xf3 thread_start+8 ffffff1ae1da2080 SLEEP CV 1 swtch+0x18a cv_wait+0x89 smb_avl_destroy+0x63 smb_export_stop+0x4e smb_server_shutdown+0x17a smb_server_delete+0x1cf smb_drv_close+0x35 dev_close+0x31 device_close+0xd8 spec_close+0x1d9 fop_close+0x61 closef+0x97 closeall+0x82 proc_exit+0x450 exit+0x15 psig+0x358 trap+0x140d ffffff0011174c40 SLEEP CV 1 swtch+0x18a cv_wait+0x89 smb_thread_continue_timedwait_locked+0x5d smb_thread_continue+0x4d smb_kshare_unexport_thread+0x40 smb_thread_entry_point+0x7a thread_start+8 > ::smblist SERVER ZONE STATE ffffff05d114e580 0 STOPPING SESSION IP_ADDR PORT DIALECT STATE ffffff19e03f27d0 10.10.2.18 43686 0x302 DISCONNECTED USER UID SSNID ACCOUNT ffffff04289c7af8 1 fbe8b7450f NSTOR\test TREE TID SHARE NAME RESOURCE ffffff04843aa010 1 test /export/test OFILE FID NODE CRED LEASE ffffff05db73c3e8 2 ffffff0506dd1678 ffffff0506d9d9b0 0 > ffffff04289c7af8 ::smbuser -v SMB user information (ffffff04289c7af8): UID: 1 SSNID: fbe8b7450f State: 2 (LOGGING_OFF) Flags: 0x00000000 <> Privileges: 0x00800000 <CHANGE_NOTIFY> Credential: ffffff0506d9d9b0 Reference Count: 2 User Account: NSTOR\test > ffffff04843aa010 ::smbtree -v SMB tree information (ffffff04843aa010): TID: 0001 State: 1 (DISCONNECTING) Share: test Resource: /export/test Type: ZFS Volume: export Umask: 0000 Flags: 000d2fce <ACLS,STREAMS,CI,OPLOCKS,SHORTNAMES,XVATTR,DIRENTFLAGS,ACL_CR ,ACEMASK,UNICODE,QUOTA,SPARSE,XMOUNTS> SMB Node: ffffff04289c5958 Reference Count: 1 > ffffff05db73c3e8 ::smbofile -v SMB ofile information (ffffff05db73c3e8): FID: 2 State: 3 (SAVING) DH Type: 1 (DURABLE_V1) Lease: 0 SMB Node: ffffff0506dd1678 LLF Offset: 0x0 (Invalid) Flags: 0x00000000 <> Granted Acc.: 0x001f01ff <READ_DATA,WRITE_DATA,APPEND_DATA,READ_EA,WRITE_EA ,EXECUTE,DELETE_CHILD,READ_ATTR,WRITE_ATTR,DELETE,READ_CTRL,WRITE_DAC ,WRITE_OWNER,SYNCH> Share Mode: 0x00000007 <READ,WRITE,DELETE> User: ffffff04289c7af8 Tree: ffffff04843aa010 Credential: ffffff0506d9d9b0 > 0xffffff04843aa010 ::print smb_tree_t t_ofile_list t_ofile_list = { t_ofile_list.ll_lock = { _opaque = [ 0 ] } t_ofile_list.ll_list = { list_size = 0x3a0 list_offset = 0 list_head = { list_next = 0xffffff05db73c3e8 list_prev = 0xffffff05db73c3e8 } } t_ofile_list.ll_count = 0x1 t_ofile_list.ll_wrop = 0x1 t_ofile_list.ll_mutex = { _opaque = [ 0 ] } t_ofile_list.ll_deleteq = { list_size = 0x28 list_offset = 0 list_head = { list_next = 0xffffff0448046600 list_prev = 0xffffff0448046600 } } t_ofile_list.ll_deleteq_count = 0x1 t_ofile_list.ll_flushing = 0 (0) } > 0xffffff0448046600 ::whatis ffffff0448046600 is allocated from smb_dtor_cache: ADDR BUFADDR TIMESTAMP THREAD CACHE LASTLOG CONTENTS ffffff0480dafc18 ffffff0448046600 2103f1229bc0 ffffff0011b69c40 ffffff042aa2a348 ffffff03dad13b80 ffffff03de09f8d0 kmem_cache_alloc_debug+0x2e0 kmem_cache_alloc+0xdd smb_llist_post+0x45 smb_ofile_release+0xf4 smb_request_free+0x72 smb_oplock_async_break+0xc9 taskq_d_thread+0xf3 thread_start+8 > 0xffffff0448046600 ::print smb_dtor_t { dt_lnd = { list_next = 0xffffff04843aa0b0 list_prev = 0xffffff04843aa0b0 } dt_magic = 0x44544f52 dt_object = 0xffffff05db73c3e8 dt_proc = smb_ofile_save_dh }
It looks like it was handling an oplock break during shutdown.
The stack from the allocation of the ofile destructor shows us that
the last ref to ofile was dropped by smb_oplock_async_break
smb_ofile_release did state: SAVE_DH >> SAVING
The "problem" sequence must be:
smb_oplock_ind_break gets hold via smb_ofile_hold_olbrk()
and schedules a taskq job to do the oplock break.
before that job has a chance to run...
smb_session_logoff / smb_user_logoff /
smb_session_disconnect_owned_trees / ...
smb_tree_disconnect
then the taskq job from smb_oplock_ind_break runs and
during request free we drop the last ref on the ofile
Nothing after that flushes the delete queue on the tree or user,
so those never actually get deleted, and shutdown waits forever.
Updated by Gordon Ross over 2 years ago
A relatively simple way to fix this is:
When building the special request to do an oplock break, if we find that the ofile has tree and session pointers, take references on the tree and session and store them in the request (tid_tree, uid_user). That way, when the taskq job processing the oplock break calls smb_request_free, the tree and session will get "release" calls (which run the delete queues etc) regardless of which code path ends up being the last to do call release on the tree and session.
I did not come up with a reliable way to reproduce this problem, but the fix has had some "mileage".
This code path is well covered by the smbtorture oplock tests. Re-ran those; test results unchanged.
Updated by Electric Monk over 2 years ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
git commit 5bcbb01ca45de96cf8ebfb0dfd19cc3016d217c7
commit 5bcbb01ca45de96cf8ebfb0dfd19cc3016d217c7 Author: Gordon Ross <gwr@nexenta.com> Date: 2019-10-19T13:01:05.000Z 11830 smb/server service shutdown hang Reviewed by: Roman Strashkin <roman.strashkin@nexenta.com> Reviewed by: Evan Layton <evan.layton@nexenta.com> Reviewed by: Matt Barden <matt.barden@nexenta.com> Reviewed by: Andy Stormont <AStormont@racktopsystems.com> Approved by: Robert Mustacchi <rm@fingolfin.org>