Bug #6916

Race between nlm_unexport() and nlm_svc_stopping() can cause panic

Added by Marcel Telka almost 5 years ago. Updated almost 5 years ago.

Pending RTI
nfs - NFS server and client
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


There is a bace between nlm_unexport() and nlm_svc_stopping() which can cause this panic:

> ::status
debugging crash dump vmcore.0 (64-bit) from t1
operating system: 5.11 illumos-f83b46b (i86pc)
image uuid: 13e647cf-2104-43e6-a7e4-d95e902c5a96
panic message: 
mutex_enter: bad mutex, lp=ffffff00d1d18e18 owner=ffffff00c8848880 thread=ffffff00d1328c00
dump content: kernel pages only
> ::stack
mutex_panic+0x58(fffffffffb94ac1c, ffffff00d1d18e18)
exportfs+0x297(ffffff00034a2da0, 100000, ffffff00c60985c8)
nfssys+0x438(2, 8046038)


locker.c (3.57 KB) locker.c Marcel Telka, 2016-04-16 05:41 PM
6916-repro.patch (1.06 KB) 6916-repro.patch Marcel Telka, 2016-04-29 01:18 PM

Updated by Marcel Telka almost 5 years ago

Steps to reproduce the panic

Since the panic is caused by very short race windows in nlm_svc_stopping() and nlm_unexport() we need to instrument both functions a bit to make sure the race windows are big enough to reproduce the panic easily. Please find attached the 6916-repro.patch file with the change needed. This change is needed on the NFS server.

In addition, you need an application on the NFS client that is able to lock and unlock a file, for example the attached locker.c file (compile using "gcc -Wall -o locker locker.c").

Steps on the NFS server:

1) Share a dataset:

# zfs set sharenfs=on rpool/export
# chmod 777 /export/

2) Make sure the race windows are big enough:

# echo "nlm_svc_stopping_wait/W1" | mdb -kw
nlm_svc_stopping_wait:          0               =       0x1
# echo "nlm_unexport_wait/W1" | mdb -kw
nlm_unexport_wait:              0               =       0x1

Continue on the NFS client:

3) Mount the share using NFSv3:

# mount -o vers=3 SERVER:/export /mnt

4) Lock and unlock a file on the share:

# ./locker lock r 0 0 /mnt/file

Back on the NFS server:

5) Restart lockd:

# svcadm restart nlockmgr

6) Unshare the dataset (this will hang due to instrumentation in the klmmod module):

# zfs set sharenfs=off rpool/export

7) Unblock both threads waiting in the race windows:

# echo "nlm_svc_stopping_wait/W0" | mdb -kw
nlm_svc_stopping_wait:          0x1             =       0x0
# sleep 2
# echo "nlm_unexport_wait/W0" | mdb -kw
nlm_unexport_wait:              0x1             =       0x0



Updated by Marcel Telka almost 5 years ago

Root cause

The problem is in nlm_svc_stopping() where the manipulations with nlm_hosts_tree are not protected by nlm_globals->lock; probably in a hope that we are the only thread touching the nlm_hosts_tree. This is obviously not true, and the nlm_svc_stopping() author was aware about that because this is in the comment above the nlm_svc_stopping():

2432 * NOTE: NFS code can call NLM while it's
2433 * stopping or even if it's shut down. Any attempt
2434 * to lock file either on client or on the server
2435 * will fail if NLM isn't in NLM_ST_UP state.

The fix just makes sure the nlm_hosts_tree is properly protected by nlm_globals->lock as all other functions does.


Updated by Marcel Telka almost 5 years ago

  • Status changed from In Progress to Pending RTI

Updated by Marcel Telka almost 5 years ago

  • Description updated (diff)

Updated by Marcel Telka almost 5 years ago

  • File deleted (6916-repro.patch)

Updated by Marcel Telka almost 5 years ago

The bug in nlm_svc_stopping() could cause similar panic in another function(s) too.


Updated by Marcel Telka almost 5 years ago

  • Description updated (diff)

Also available in: Atom PDF