6009 Support for remote stale lock detection

Review Request #61 - Created June 16, 2015 and updated

Information
Marcel Telka
illumos-gate
6009
Reviewers
general

webrev: http://cr.illumos.org/~webrev/marcel/il-6009-stale-lock/

The support for locking over NFSv2/3 is implemented using the sideband NLM protocol (with the support of the NSM protocol). Unfortunaly, there are some scenarios when the NLM server and the NLM client might run out of sync with their knowledge of the locks. While the most such cases are just a bugs either in the NLM server or NLM client implementation, there are some valid
problematic scenarios that are not caused by any bug and they are just inherent flaws in the NFS/NLM/NSM protocol design.

In a case the NLM client and NLM server are out of sync with their record of the locks, we could see a case when the NLM (NFS) server thinks a file is locked by some client, but such a client thinks it does not hold such a lock. If there is some other client asking for the conflicting lock we will see that the other client won't be able to get the lock and it might wait (block) forever. For these scenarios the operating system contains the clear_locks(1m) tool to help the administrator to clear such "stale" locks.

The problem with the clear_locks(1m) usage is that it is very hard for the administrator to find such a "stale" lock candidate for the cleanup. Basically, there is no help from the operating system itself to make the admin's life easir. This change adds the support for the "stale" locks detection, so the admin will be able easily find a "stale" lock candidate for the clearing.

The implementation checks for the remote stale locks only (NLM, NFSv4, SMB) with the focus on the NLM, since the other two sharing protocols (NFSv4 and SMB) does not have known scenarios with the stale locks. The local locks are not checked, but if needed the implementation could be easily modified to include the check for local stale locks too.

I manually tested many scenarios to make sure the stale lock reporting works as
designed and expected.  I thoroughly tested scenarios with partial unlocking of
some areas of file, relocking, upgrading and downgrading of the locks.  The
testing also covered the interaction between NFSv2/3 (NLM), NFSv4, and local
locks on the NFS server.

During the testing I reproduced new panic in the NLM implementation:
6007 assertion failed: TAILQ_EMPTY(&hostp->nh_vholds_list), file: ../../common/klm/nlm_impl.c, line: 1161

I was able reproduce the panic both with and without the fix using the exactly
same steps, so the panic is not related to my fix.

Issues

  • 0
  • 2
  • 1
  • 3
Description From Last Updated
Dan McDonald
Dan McDonald

Thank you for addressing my concerns. For bonus points, any testing details beyond what you have already would be appreciated.

Ken Mays
Ship It!
Loading...