Bug #6525
closednlm_unexport() should not call nlm_vhold_clean() with g->lock held
100%
Description
The nlm_unexport() calls nlm_vhold_clean() with g->lock held. It is a big no-no to call other non-trivial subsystems with locks (especially mutexes) held.
Looking at the other nlm_vhold_clean() call here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/klm/nlm_impl.c#1282
we can see that there the nlm_vhold_clean() caller carefully dropped the mutex before the nlm_vhold_clean() call leaving no mutex held. The almost similar situation is in nlm_unexport() - the hostp->nh_lock is dropped - but the other mutex (g->lock) is left held.
Currently, this does not cause any known issue with vanilla illumos-gate, but with #6009 feature integrated we would see the following recursive mutex_enter panic:
> ::status debugging crash dump vmcore.0 (64-bit) from t1 operating system: 5.11 NexentaStor_5.0.0.25:d89426b20f:debug (i86pc) image uuid: 18671a59-8083-6b9c-88ff-8f5089bc8b97 panic message: recursive mutex_enter, lp=ffffff01ec6264c0 owner=ffffff0235266be0 thread=ffffff0235266be0 dump content: kernel pages only > ::stack vpanic() mutex_panic+0x58(fffffffffb96881d, ffffff01ec6264c0) mutex_vector_enter+0x30f(ffffff01ec6264c0) nlm_host_find_by_sysid+0x2b(ffffff01ec6264c0, 3) nlm_sysid_to_host+0x61(0, 3, ffffff000825db50, ffffff000825dc28) translate_sysid_to_host+0xb5(0, 3, ffffff000825dbf0, 2e, ffffff000825dc28) flk_stale_lock_release+0x55(ffffff022ccd2480) flk_delete_active_lock+0x1b8(ffffff022ccd2480, 0) cleanlocks+0x165(ffffff01f0f25200, ffffffff, 3) nlm_vhold_clean+0x2d(ffffff01dcbd8f80, 3) nlm_unexport+0xd0(ffffff0235ff16c0) lm_unexport+0x11(ffffff0235ff16c0) unexport+0xfd(ffffff0235ff16c0) exportfs+0x28f(ffffff000825ed90, 100000, ffffff01f6747290) stubs_common_code+0x51() nfs_export+0xb4(80473d8) nfssys+0x588(2, 80473d8) _sys_sysenter_post_swapgs+0x237() >
Steps to reproduce
Note: These steps works only with #6009 feature integrated.
On the NFS server share a directory and make sure the stale lock detection timeout is short enough (the default is one hour; it is too long):
# share /tmp # echo "stale_lock_timeout/W 5" | mdb -kw stale_lock_timeout: 0xe10 = 0x5 #
On the NFS client mount the share via NFSv3 and lock a file:
# mount -o vers=3 t1:/tmp /mnt # ./locker > lock r 0 0 /mnt/file >
Back on the NFS server, try to lock the same file, wait for the stale lock detection message and then try to unshare the directory:
# ./locker lock W 0 0 /tmp/file & [1] 10609 # sleep 5 # tail -n1 /var/adm/messages Dec 17 20:19:51 t1 genunix: [ID 964031 kern.info] NOTICE: Stale lock (host: 010.000.100.002 (NLM), pid: 100745, vnode: ffffff01f0f25200, path: /tmp/file, RDLCK: 0:18446744073709551615) # unshare /tmp
PANIC!
Files
Related issues
Updated by Marcel Telka over 6 years ago
- Related to Feature #6009: Support for remote stale lock detection added
Updated by Marcel Telka over 6 years ago
- Status changed from In Progress to Pending RTI
Updated by Electric Monk over 6 years ago
- Status changed from Pending RTI to Closed
- % Done changed from 0 to 100
git commit b2b464a48ff6cc58978813dbfc2f622e2dab29ce
commit b2b464a48ff6cc58978813dbfc2f622e2dab29ce Author: Marcel Telka <marcel.telka@nexenta.com> Date: 2015-12-28T21:48:45.000Z 6525 nlm_unexport() should not call nlm_vhold_clean() with g->lock held Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com>