Bug #6225
NFSv4: setlock() can spin forever
100%
Description
Here is the simplified setlock() implementation:
int setlock(...) { int error; retry: for (...) { error = VOP_FRLOCK(F_SETLK, ...); } if (error) { VOP_FRLOCK(F_GETLK, ...); if (F_UNLCK) goto retry; } return (error); }
The setlock() just tries to get the lock in the for loop. If it is not successful it tries to find who is preventing us to get the lock. If it finds that there is no pending lock, it retries (goto retry). If we are extremely unlucky, some other process might always hold a lock when we do VOP_FRLOCK(F_SETLK), and it might always release such a lock just before we do VOP_FRLOCK(F_GETLK). This would cause the setlock() to spin forever. We should make sure this extremely unlucky scenario cannot happen.
In addition to the extremely unlucky (but possible) scenario described above the setlock() spin could be reproduced far easier:
The current core file locking implementation is write-preferring. This means that if there is a read lock active (process 1) and one other write lock is waiting (process 2) for the read lock, the additional read lock request (process 3) will fail (or block). In a case the process 3 will try to get the information about the lock preventing us from getting the read lock (using F_GETLK), the F_UNLCK is returned. This will cause that setlock() to spin forever too. Please see Bug #5957 for discussion about write-preferring vs. read-preferring locking and for an example about the described behavior.
There are several problems caused by the spinning setlock():
- The setlock() thread holds the exported_lock as reader - see rfs4_compound(). This means that during the setlock() spin all attempts to acquire the exported_lock as writer will block and wait until the setlock() thread stops to spin. For example every attempt to share or unshare a filesystem will hang.
- The NFSv4 client won't see the reply to its NFSv4 lock call (since the particular thread responsible for sending the reply is spinning in setlock()).
Steps to reproduce
You need one NFS server with a shared filesystem (let say /export) and one NFS client with the mounted share from the NFS server using the NFSv4 protocol (mount -o vers=4 SERVER:/export /mnt).
You also need to complie the attached lock.c source file in the /export directory:
# gcc -Wall -o lock lock.c
Do this locally on the NFS server in the /export directory:
# ./lock r file & F_SETLK: F_RDLCK sleeping [1] 718 # sleep 2 # ./lock W file F_SETLKW: F_WRLCK
Do this on the NFS client in the /mnt directory:
# ./lock R file F_SETLKW: F_RDLCK
Now we can find a thread on the NFS server cycling in setlock():
> ::stacks -c setlock THREAD STATE SOBJ COUNT ffffff00d0fb38a0 SLEEP CV 1 swtch+0x141 cv_wait+0x70 delay_common+0xb8 delay+0x30 setlock+0x164 rfs4_do_lock+0x1b8 rfs4_op_lock+0x619 rfs4_compound+0x1e9 rfs4_dispatch+0x21d common_dispatch+0x765 rfs_dispatch+0x2d svc_getreq+0x1c1 svc_run+0x146 svc_do_run+0x8e nfssys+0xf1 _sys_sysenter_post_swapgs+0x149 >
The thread calls in a cycle VOP_FRLOCK() with F_SETLK (or F_SETLK_NBMAND) and F_GETLK. The VOP_FRLOCK(F_SETLK) will always fail, but the VOP_FRLOCK(F_GETLK) will always succeed and return F_UNLCK. This will cause the infinite spin in setlock().
Now, when you'll try to unshare any exported filesystem, the unshare command will hang with this stack:
> ffffff00c8dd97a0::findstack -v stack pointer for thread ffffff00c8dd97a0: ffffff0002ebcc80 [ ffffff0002ebcc80 _resume_from_idle+0xf4() ] ffffff0002ebccb0 swtch+0x141() ffffff0002ebcd50 turnstile_block+0x21a(0, 0, ffffffffc01afc88, fffffffffbc08bc0, 0, 0) ffffff0002ebcdc0 rw_enter_sleep+0x19b(ffffffffc01afc88, 0) ffffff0002ebd270 unexport+0x28(ffffff00d2943340) ffffff0002ebdd40 exportfs+0x25f(ffffff0002ebdda0, 100000, ffffff00d0d72470) ffffff0002ebdd80 stubs_common_code+0x51() ffffff0002ebddd0 nfs_export+0x9e(8046048) ffffff0002ebdec0 nfssys+0x388(2, 8046048) ffffff0002ebdf10 _sys_sysenter_post_swapgs+0x149() >
Files
Related issues
Updated by Marcel Telka over 5 years ago
- Related to Bug #5957: File locking should be read-preferring added
Updated by Marcel Telka over 5 years ago
The solution
The NFSv4.0 LOCK operation does not support the blocking lock (at the NFSv4.0
protocol level) so the client needs to resend the LOCK request in a case the
lock is denied by the NFSv4 server. NFSv4.0 clients are prepared for that
(obviously); they are sending the LOCK requests with some delays between the
attempts. See nfs4frlock() and nfs4_block_and_wait() for the locking and delay
implementation at the client side.
To make the life of the clients easier, the illumos NFSv4 server tries to do
some fast retries on its own (the for loop in setlock()) in a hope the lock will be
available soon. And if not, the client won't need to resend the LOCK requests
so fast to check the lock availability. This basically saves some network
traffic and tries to make sure the client gets the lock ASAP.
So the for loop in setlock() makes perfect sense and it is good to have, but it
is completely wrong to allow the setlock() spin forever (goto retry).
The solution is easy: Having the above in mind, make sure the setlock() won't
spin forever, so the "goto retry" is going to be removed.
Updated by Arne Jansen over 5 years ago
- Related to Bug #6253: F_GETLK doesn't always return lock owner added
Updated by Marcel Telka over 5 years ago
Arne Jansen and Gordon Ross requested to preserve the retry loop and limit it to 10 instead. This is the explanation (written by Gordon):
There's a race inherent in the current VOP_FRLOCK design where: a: "other guy" takes a lock that conflicts with a lock we want b: we attempt to take our lock (non-blocking) and the attempt fails. c: "other guy" releases the conflicting lock d: we ask what lock conflicts with the lock we want, getting F_UNLCK (no lock blocks us) If we retry the non-blocking lock attempt in this case (restart at step 'b') there's some possibility that many such attempts might fail. However a test designed to actually provoke this race shows that the vast majority of cases require no retry, and only a few took as many as three retries. Here's the test outcome: number of retries how many times we needed that many retries 0 79461 1 862 2 49 3 5 Given those empirical results, we arbitrarily limit the retry count to ten. If we actually make to ten retries and give up, nothing catastrophic happens, but we're unable to return the information about the conflicting lock to the NFS client. That's an acceptable trade off vs. letting this retry loop run forever.
Updated by Electric Monk over 5 years ago
- Status changed from Pending RTI to Closed
- % Done changed from 0 to 100
git commit 55a4551ddde8feed8f8e011458f992ea74257bd9
commit 55a4551ddde8feed8f8e011458f992ea74257bd9 Author: Marcel Telka <marcel.telka@nexenta.com> Date: 2015-10-07T18:30:42.000Z 6225 NFSv4: setlock() can spin forever Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Arne Jansen <arne@die-jansens.de> Approved by: Dan McDonald <danmcd@omniti.com>