6009 Support for remote stale lock detection
Review Request #61 — Created June 16, 2015 and updated
The support for locking over NFSv2/3 is implemented using the sideband NLM protocol (with the support of the NSM protocol). Unfortunaly, there are some scenarios when the NLM server and the NLM client might run out of sync with their knowledge of the locks. While the most such cases are just a bugs either in the NLM server or NLM client implementation, there are some valid
problematic scenarios that are not caused by any bug and they are just inherent flaws in the NFS/NLM/NSM protocol design.
In a case the NLM client and NLM server are out of sync with their record of the locks, we could see a case when the NLM (NFS) server thinks a file is locked by some client, but such a client thinks it does not hold such a lock. If there is some other client asking for the conflicting lock we will see that the other client won't be able to get the lock and it might wait (block) forever. For these scenarios the operating system contains the clear_locks(1m) tool to help the administrator to clear such "stale" locks.
The problem with the clear_locks(1m) usage is that it is very hard for the administrator to find such a "stale" lock candidate for the cleanup. Basically, there is no help from the operating system itself to make the admin's life easir. This change adds the support for the "stale" locks detection, so the admin will be able easily find a "stale" lock candidate for the clearing.
The implementation checks for the remote stale locks only (NLM, NFSv4, SMB) with the focus on the NLM, since the other two sharing protocols (NFSv4 and SMB) does not have known scenarios with the stale locks. The local locks are not checked, but if needed the implementation could be easily modified to include the check for local stale locks too.
I manually tested many scenarios to make sure the stale lock reporting works as designed and expected. I thoroughly tested scenarios with partial unlocking of some areas of file, relocking, upgrading and downgrading of the locks. The testing also covered the interaction between NFSv2/3 (NLM), NFSv4, and local locks on the NFS server. During the testing I reproduced new panic in the NLM implementation: 6007 assertion failed: TAILQ_EMPTY(&hostp->nh_vholds_list), file: ../../common/klm/nlm_impl.c, line: 1161 I was able reproduce the panic both with and without the fix using the exactly same steps, so the panic is not related to my fix.
Implemented changes suggested by Dan McDonald.
Revision 3 (+528 -189)
Thank you for addressing my concerns. For bonus points, any testing details beyond what you have already would be appreciated.