Bug #4827
opennfs4: slow file locking
0%
Description
With multiple concurrent processes, the nfs4 file locking gets unacceptable slow for us.
In case we cannot get the lock, the nfs4 locking mechanism (function nfs4_block_and_wait) does a retry with an exponential back-off algorithm to prevent us from flooding the server with locking requests. In nfs4_block_and_wait() and nfs4frlock_pre_setup() we start with a delay of one(!) second. We believe starting with one second is too pessimistic but let’s check…
With two concurrent processes we indirectly determine the back-off time in the kernel between two consecutive tries to get a blocking flock. Actually we measure the time it takes between the moment a lock gets freed by one process and the success of an flock in the other process.
Old behavior:
Time between unlock by P1 and lock by P2: 0.992331028 Time between unlock by P1 and lock by P2: 0.992113113 Time between unlock by P1 and lock by P2: 0.991441965 Time between unlock by P1 and lock by P2: 0.992605925 Time between unlock by P1 and lock by P2: 0.993023157 Time between unlock by P1 and lock by P2: 0.991801023 Time between unlock by P1 and lock by P2: 0.991739035 Time between unlock by P1 and lock by P2: 0.991793156 Time between unlock by P1 and lock by P2: 0.991232873 Time between unlock by P1 and lock by P2: 0.991338968
The lock is free but P2 has to wait at least 0.99s until it gets the lock.
With the following mdb one-liners we set the initial delay from 1s to 1ms:
echo 'nfs4frlock_pre_setup+0x1b/W 0x3e8' | mdb -kw nfs4frlock_pre_setup+0x1b: 0xf4240 = 0x3e8 echo 'nfs4_block_and_wait+0x6a/W 1' | mdb -kw nfs4_block_and_wait+0x6a: 0x3e8 = 0x1 echo 'nfs4_block_and_wait+0x91/W 0x3e8' | mdb -kw nfs4_block_and_wait+0x91: 0xf4240 = 0x3e8
The new behavior looks like this:
Time between unlock by P1 and lock by P2: 0.002027035 Time between unlock by P1 and lock by P2: 0.002145052 Time between unlock by P1 and lock by P2: 0.001981974 Time between unlock by P1 and lock by P2: 0.002073049 Time between unlock by P1 and lock by P2: 0.002177954 Time between unlock by P1 and lock by P2: 0.002015829 Time between unlock by P1 and lock by P2: 0.001969099 Time between unlock by P1 and lock by P2: 0.002105951 Time between unlock by P1 and lock by P2: 0.001879930 Time between unlock by P1 and lock by P2: 0.001966000
Looks much faster, right?
I will try to provide a source code patch soon. Maybe the initial back-off delay should be a tunable kernel setting.
Files
Related issues
Updated by Simon K about 9 years ago
The code was introduced by Git commit 7c478bd95313f5f23a4c, so I think the code is at least 9 years old. To start with a delay of 1s doesn’t seem to be reasonable in modern times.
We’ve encountered a side effect when dropping the initial delay from 1s to 1ms. The distribution of lock assignments gets much better in case the first process(P1) does some periodical file locking and unlocking. For P2, there is a better chance or a higher probability to catch the lock after it was freed by P1. With the current back-off behavior, P1 is able to catch the lock again and again while P2 would probably sleep for a long time.
Updated by Garrett D'Amore about 9 years ago
1ms back off may be under the threshold for user land sleep times (clock is often 100Hz, i.e. 10msec). I'd recommend trying a 20msec delay. This is still far better than 1 sec (1/50th the time), but it is likely to give other participants a fair chance to get the lock.
Updated by Marcel Telka about 9 years ago
The actual locking at the NFSv4 server is implemented in setlock() - see file nfs4_srv.c.
In a case the file is already locked, setlock() tries to lock the file four times in a row with delays 10ms, 20ms, and 40ms between retries. So in a case the file is locked for long time by someone else, the (denied) lock reply will arrive to the NFSv4 client in about 70ms.
Based on that, it is not very important for the NFSv4 server whether the initial delay at the NFSv4 client between lock retries is 1ms, 10ms, or even more. The illumos NFSv4 server already have some overload protection implemented.
OTOH, it also won't help much to have the initial delay at the client set to 1ms only. When considering the illumos NFSv4 implementation, the better initial delay would be tens of milliseconds, IMHO. It should be noted, that for other NFSv4 server implementations the best initial delay might differ.
Please note that it is always possible to construct a test case where any arbitrarily chosen initial delay timeout is the best. You apparently found a scenario where 1ms seems to be optimal, but it is easy to construct another test case where, for example, 100ms would behave better than 1ms.
Updated by Marcel Telka about 9 years ago
Simon Klinkert wrote:
We’ve encountered a side effect when dropping the initial delay from 1s to 1ms. The distribution of lock assignments gets much better in case the first process(P1) does some periodical file locking and unlocking. For P2, there is a better chance or a higher probability to catch the lock after it was freed by P1. With the current back-off behavior, P1 is able to catch the lock again and again while P2 would probably sleep for a long time.
The above looks like #4842.
Updated by Simon K about 9 years ago
- File 0001-4827-nfs4-slow-file-locking.patch 0001-4827-nfs4-slow-file-locking.patch added
- Assignee deleted (
Simon K)
I've attached a git format-patch file. This patch contains a fix for #4837 and should make the file locking a bit faster and more effective. To get even more speed (will increase the server load!), set nfs4_base_wait_time to 10ms, limit max_msec_delay to 1s, and multiply tick_delay by 1.5 instead of 2.
Updated by Simon K almost 9 years ago
The server side has a (slow) retry loop for locking requests as well. Three retries are possible, starting with a delay of 10ms:
6 73172 setlock:entry 6 20809 fop_frlock:entry 6 20810 fop_frlock:return ret=11 6 20809 fop_frlock:entry 6 20810 fop_frlock:return ret=11 1 20809 fop_frlock:entry 1 20810 fop_frlock:return ret=11 2 20809 fop_frlock:entry 2 20810 fop_frlock:return ret=11 2 20809 fop_frlock:entry 2 20810 fop_frlock:return ret=0 2 73173 setlock:return time=69ms
The nfs4 file locking feels slower than it should be on a system with hundreds of processes struggling with the same file. Even an unlock may need 120s(!).
Unfortunately, rfs4_maxlock_tries and rfs4_lock_delay are static variables and thus not tunable by mdb or /etc/system.