Project

General

Profile

Bug #6525

nlm_unexport() should not call nlm_vhold_clean() with g->lock held

Added by Marcel Telka almost 5 years ago. Updated almost 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
nfs - NFS server and client
Start date:
2015-12-19
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

The nlm_unexport() calls nlm_vhold_clean() with g->lock held. It is a big no-no to call other non-trivial subsystems with locks (especially mutexes) held.

Looking at the other nlm_vhold_clean() call here:

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/klm/nlm_impl.c#1282

we can see that there the nlm_vhold_clean() caller carefully dropped the mutex before the nlm_vhold_clean() call leaving no mutex held. The almost similar situation is in nlm_unexport() - the hostp->nh_lock is dropped - but the other mutex (g->lock) is left held.

Currently, this does not cause any known issue with vanilla illumos-gate, but with #6009 feature integrated we would see the following recursive mutex_enter panic:

> ::status
debugging crash dump vmcore.0 (64-bit) from t1
operating system: 5.11 NexentaStor_5.0.0.25:d89426b20f:debug (i86pc)
image uuid: 18671a59-8083-6b9c-88ff-8f5089bc8b97
panic message: recursive mutex_enter, lp=ffffff01ec6264c0 owner=ffffff0235266be0 thread=ffffff0235266be0
dump content: kernel pages only
> ::stack
vpanic()
mutex_panic+0x58(fffffffffb96881d, ffffff01ec6264c0)
mutex_vector_enter+0x30f(ffffff01ec6264c0)
nlm_host_find_by_sysid+0x2b(ffffff01ec6264c0, 3)
nlm_sysid_to_host+0x61(0, 3, ffffff000825db50, ffffff000825dc28)
translate_sysid_to_host+0xb5(0, 3, ffffff000825dbf0, 2e, ffffff000825dc28)
flk_stale_lock_release+0x55(ffffff022ccd2480)
flk_delete_active_lock+0x1b8(ffffff022ccd2480, 0)
cleanlocks+0x165(ffffff01f0f25200, ffffffff, 3)
nlm_vhold_clean+0x2d(ffffff01dcbd8f80, 3)
nlm_unexport+0xd0(ffffff0235ff16c0)
lm_unexport+0x11(ffffff0235ff16c0)
unexport+0xfd(ffffff0235ff16c0)
exportfs+0x28f(ffffff000825ed90, 100000, ffffff01f6747290)
stubs_common_code+0x51()
nfs_export+0xb4(80473d8)
nfssys+0x588(2, 80473d8)
_sys_sysenter_post_swapgs+0x237()
>

Steps to reproduce

Note: These steps works only with #6009 feature integrated.

On the NFS server share a directory and make sure the stale lock detection timeout is short enough (the default is one hour; it is too long):

# share /tmp
# echo "stale_lock_timeout/W 5" | mdb -kw
stale_lock_timeout:             0xe10           =       0x5
#

On the NFS client mount the share via NFSv3 and lock a file:

# mount -o vers=3 t1:/tmp /mnt
# ./locker
> lock r 0 0 /mnt/file
>

Back on the NFS server, try to lock the same file, wait for the stale lock detection message and then try to unshare the directory:

# ./locker lock W 0 0 /tmp/file &
[1] 10609
# sleep 5
# tail -n1 /var/adm/messages 
Dec 17 20:19:51 t1 genunix: [ID 964031 kern.info] NOTICE: Stale lock (host: 010.000.100.002 (NLM), pid: 100745, vnode: ffffff01f0f25200, path: /tmp/file, RDLCK: 0:18446744073709551615)
# unshare /tmp

PANIC!


Files

locker.c (3.57 KB) locker.c Marcel Telka, 2015-12-19 12:08 AM

Related issues

Related to illumos gate - Feature #6009: Support for remote stale lock detectionIn ProgressMarcel Telka2015-06-16

Actions
#1

Updated by Marcel Telka almost 5 years ago

  • Related to Feature #6009: Support for remote stale lock detection added
#2

Updated by Marcel Telka almost 5 years ago

  • Status changed from In Progress to Pending RTI
#3

Updated by Electric Monk almost 5 years ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 0 to 100

git commit b2b464a48ff6cc58978813dbfc2f622e2dab29ce

commit  b2b464a48ff6cc58978813dbfc2f622e2dab29ce
Author: Marcel Telka <marcel.telka@nexenta.com>
Date:   2015-12-28T21:48:45.000Z

    6525 nlm_unexport() should not call nlm_vhold_clean() with g->lock held
    Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
    Approved by: Dan McDonald <danmcd@omniti.com>

Also available in: Atom PDF