Bug #6916
openRace between nlm_unexport() and nlm_svc_stopping() can cause panic
0%
Description
There is a bace between nlm_unexport() and nlm_svc_stopping() which can cause this panic:
> ::status debugging crash dump vmcore.0 (64-bit) from t1 operating system: 5.11 illumos-f83b46b (i86pc) image uuid: 13e647cf-2104-43e6-a7e4-d95e902c5a96 panic message: mutex_enter: bad mutex, lp=ffffff00d1d18e18 owner=ffffff00c8848880 thread=ffffff00d1328c00 dump content: kernel pages only > ::stack vpanic() mutex_panic+0x58(fffffffffb94ac1c, ffffff00d1d18e18) mutex_vector_enter+0x347(ffffff00d1d18e18) nlm_unexport+0xd4(ffffff00d0009100) lm_unexport+0x11(ffffff00d0009100) unexport+0xfd(ffffff00d0009100) exportfs+0x297(ffffff00034a2da0, 100000, ffffff00c60985c8) stubs_common_code+0x51() nfs_export+0x9e(8046038) nfssys+0x438(2, 8046038) _sys_sysenter_post_swapgs+0x149() >
Files
Updated by Marcel Telka over 7 years ago
Steps to reproduce the panic
Since the panic is caused by very short race windows in nlm_svc_stopping() and nlm_unexport() we need to instrument both functions a bit to make sure the race windows are big enough to reproduce the panic easily. Please find attached the 6916-repro.patch file with the change needed. This change is needed on the NFS server.
In addition, you need an application on the NFS client that is able to lock and unlock a file, for example the attached locker.c file (compile using "gcc -Wall -o locker locker.c").
Steps on the NFS server:
1) Share a dataset:
# zfs set sharenfs=on rpool/export # chmod 777 /export/ #
2) Make sure the race windows are big enough:
# echo "nlm_svc_stopping_wait/W1" | mdb -kw nlm_svc_stopping_wait: 0 = 0x1 # echo "nlm_unexport_wait/W1" | mdb -kw nlm_unexport_wait: 0 = 0x1 #
Continue on the NFS client:
3) Mount the share using NFSv3:
# mount -o vers=3 SERVER:/export /mnt #
4) Lock and unlock a file on the share:
# ./locker lock r 0 0 /mnt/file #
Back on the NFS server:
5) Restart lockd:
# svcadm restart nlockmgr #
6) Unshare the dataset (this will hang due to instrumentation in the klmmod module):
# zfs set sharenfs=off rpool/export
7) Unblock both threads waiting in the race windows:
# echo "nlm_svc_stopping_wait/W0" | mdb -kw nlm_svc_stopping_wait: 0x1 = 0x0 # sleep 2 # echo "nlm_unexport_wait/W0" | mdb -kw nlm_unexport_wait: 0x1 = 0x0 #
8) PANIC!
Updated by Marcel Telka over 7 years ago
Root cause
The problem is in nlm_svc_stopping() where the manipulations with nlm_hosts_tree are not protected by nlm_globals->lock; probably in a hope that we are the only thread touching the nlm_hosts_tree. This is obviously not true, and the nlm_svc_stopping() author was aware about that because this is in the comment above the nlm_svc_stopping():
2432 * NOTE: NFS code can call NLM while it's 2433 * stopping or even if it's shut down. Any attempt 2434 * to lock file either on client or on the server 2435 * will fail if NLM isn't in NLM_ST_UP state.
The fix just makes sure the nlm_hosts_tree is properly protected by nlm_globals->lock as all other functions does.
Updated by Marcel Telka over 7 years ago
Updated by Marcel Telka over 7 years ago
- Status changed from In Progress to Pending RTI
Updated by Marcel Telka over 7 years ago
- File 6916-repro.patch 6916-repro.patch added
The bug in nlm_svc_stopping() could cause similar panic in another function(s) too.