6009 Support for remote stale lock detection

Review Request #61 — Created June 16, 2015 and updated

marcel
illumos-gate
6009
general

webrev: http://cr.illumos.org/~webrev/marcel/il-6009-stale-lock/

The support for locking over NFSv2/3 is implemented using the sideband NLM protocol (with the support of the NSM protocol). Unfortunaly, there are some scenarios when the NLM server and the NLM client might run out of sync with their knowledge of the locks. While the most such cases are just a bugs either in the NLM server or NLM client implementation, there are some valid
problematic scenarios that are not caused by any bug and they are just inherent flaws in the NFS/NLM/NSM protocol design.

In a case the NLM client and NLM server are out of sync with their record of the locks, we could see a case when the NLM (NFS) server thinks a file is locked by some client, but such a client thinks it does not hold such a lock. If there is some other client asking for the conflicting lock we will see that the other client won't be able to get the lock and it might wait (block) forever. For these scenarios the operating system contains the clear_locks(1m) tool to help the administrator to clear such "stale" locks.

The problem with the clear_locks(1m) usage is that it is very hard for the administrator to find such a "stale" lock candidate for the cleanup. Basically, there is no help from the operating system itself to make the admin's life easir. This change adds the support for the "stale" locks detection, so the admin will be able easily find a "stale" lock candidate for the clearing.

The implementation checks for the remote stale locks only (NLM, NFSv4, SMB) with the focus on the NLM, since the other two sharing protocols (NFSv4 and SMB) does not have known scenarios with the stale locks. The local locks are not checked, but if needed the implementation could be easily modified to include the check for local stale locks too.

I manually tested many scenarios to make sure the stale lock reporting works as
designed and expected.  I thoroughly tested scenarios with partial unlocking of
some areas of file, relocking, upgrading and downgrading of the locks.  The
testing also covered the interaction between NFSv2/3 (NLM), NFSv4, and local
locks on the NFS server.

During the testing I reproduced new panic in the NLM implementation:
6007 assertion failed: TAILQ_EMPTY(&hostp->nh_vholds_list), file: ../../common/klm/nlm_impl.c, line: 1161

I was able reproduce the panic both with and without the fix using the exactly
same steps, so the panic is not related to my fix.
  • 0
  • 0
  • 2
  • 1
  • 3
Description From Last Updated
marcel
danmcd
  1. 
      
  2. usr/src/uts/common/klm/nlm_impl.c (Diff revision 2)
     
     

    Please add a comment:

    / NC_INET & NC_INET6 are string constants, so strcmp() is safe. /

  3. usr/src/uts/common/os/flock.c (Diff revision 2)
     
     

    Isn't there a function like this already somewhere? It seems surprising that there isn't...

    1. Something roughly similar is here: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/vnode.c#3051 but it is not directly usable for our needs here, so I decided to create a new one.

      Yes, it is surprising.

  4. usr/src/uts/common/os/flock.c (Diff revision 2)
     
     

    Parens!

  5. usr/src/uts/common/os/flock.c (Diff revision 2)
     
     

    Please use parens here for readability. (This is a problem in other places too.)

  6. usr/src/uts/common/os/flock.c (Diff revision 2)
     
     

    Parens

  7. usr/src/uts/common/os/flock.c (Diff revision 2)
     
     

    Parens

  8. usr/src/uts/common/os/flock.c (Diff revision 2)
     
     

    PARENS!!! (Esp. here.)

  9. 
      
marcel
Review request changed
danmcd
  1. Thank you for addressing my concerns. For bonus points, any testing details beyond what you have already would be appreciated.

  2. 
      
kmays
  1. Ship It!
  2. 
      
Loading...