Project

General

Profile

Actions

Bug #4827

open

nfs4: slow file locking

Added by Simon K about 8 years ago. Updated almost 8 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
nfs - NFS server and client
Start date:
2014-04-30
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

With multiple concurrent processes, the nfs4 file locking gets unacceptable slow for us.

In case we cannot get the lock, the nfs4 locking mechanism (function nfs4_block_and_wait) does a retry with an exponential back-off algorithm to prevent us from flooding the server with locking requests. In nfs4_block_and_wait() and nfs4frlock_pre_setup() we start with a delay of one(!) second. We believe starting with one second is too pessimistic but let’s check…

With two concurrent processes we indirectly determine the back-off time in the kernel between two consecutive tries to get a blocking flock. Actually we measure the time it takes between the moment a lock gets freed by one process and the success of an flock in the other process.

Old behavior:

  Time between unlock by P1 and lock by P2: 0.992331028
  Time between unlock by P1 and lock by P2: 0.992113113
  Time between unlock by P1 and lock by P2: 0.991441965
  Time between unlock by P1 and lock by P2: 0.992605925
  Time between unlock by P1 and lock by P2: 0.993023157
  Time between unlock by P1 and lock by P2: 0.991801023
  Time between unlock by P1 and lock by P2: 0.991739035
  Time between unlock by P1 and lock by P2: 0.991793156
  Time between unlock by P1 and lock by P2: 0.991232873
  Time between unlock by P1 and lock by P2: 0.991338968

The lock is free but P2 has to wait at least 0.99s until it gets the lock.

With the following mdb one-liners we set the initial delay from 1s to 1ms:

echo 'nfs4frlock_pre_setup+0x1b/W 0x3e8' | mdb -kw
nfs4frlock_pre_setup+0x1b:      0xf4240         =       0x3e8
echo 'nfs4_block_and_wait+0x6a/W 1' | mdb -kw
nfs4_block_and_wait+0x6a:       0x3e8           =       0x1
echo 'nfs4_block_and_wait+0x91/W 0x3e8' | mdb -kw
nfs4_block_and_wait+0x91:       0xf4240         =       0x3e8

The new behavior looks like this:

  Time between unlock by P1 and lock by P2: 0.002027035
  Time between unlock by P1 and lock by P2: 0.002145052
  Time between unlock by P1 and lock by P2: 0.001981974
  Time between unlock by P1 and lock by P2: 0.002073049
  Time between unlock by P1 and lock by P2: 0.002177954
  Time between unlock by P1 and lock by P2: 0.002015829
  Time between unlock by P1 and lock by P2: 0.001969099
  Time between unlock by P1 and lock by P2: 0.002105951
  Time between unlock by P1 and lock by P2: 0.001879930
  Time between unlock by P1 and lock by P2: 0.001966000

Looks much faster, right?

I will try to provide a source code patch soon. Maybe the initial back-off delay should be a tunable kernel setting.


Files

0001-4827-nfs4-slow-file-locking.patch (5.83 KB) 0001-4827-nfs4-slow-file-locking.patch git patch Simon K, 2014-05-14 10:17 AM

Related issues

Related to illumos gate - Bug #4837: NFSv4 client lock retry delay upper limit should be shorterNew2014-05-02

Actions
Actions #1

Updated by Simon K about 8 years ago

The code was introduced by Git commit 7c478bd95313f5f23a4c, so I think the code is at least 9 years old. To start with a delay of 1s doesn’t seem to be reasonable in modern times.

We’ve encountered a side effect when dropping the initial delay from 1s to 1ms. The distribution of lock assignments gets much better in case the first process(P1) does some periodical file locking and unlocking. For P2, there is a better chance or a higher probability to catch the lock after it was freed by P1. With the current back-off behavior, P1 is able to catch the lock again and again while P2 would probably sleep for a long time.

Actions #2

Updated by Garrett D'Amore about 8 years ago

1ms back off may be under the threshold for user land sleep times (clock is often 100Hz, i.e. 10msec). I'd recommend trying a 20msec delay. This is still far better than 1 sec (1/50th the time), but it is likely to give other participants a fair chance to get the lock.

Actions #3

Updated by Marcel Telka about 8 years ago

The actual locking at the NFSv4 server is implemented in setlock() - see file nfs4_srv.c.

In a case the file is already locked, setlock() tries to lock the file four times in a row with delays 10ms, 20ms, and 40ms between retries. So in a case the file is locked for long time by someone else, the (denied) lock reply will arrive to the NFSv4 client in about 70ms.

Based on that, it is not very important for the NFSv4 server whether the initial delay at the NFSv4 client between lock retries is 1ms, 10ms, or even more. The illumos NFSv4 server already have some overload protection implemented.

OTOH, it also won't help much to have the initial delay at the client set to 1ms only. When considering the illumos NFSv4 implementation, the better initial delay would be tens of milliseconds, IMHO. It should be noted, that for other NFSv4 server implementations the best initial delay might differ.

Please note that it is always possible to construct a test case where any arbitrarily chosen initial delay timeout is the best. You apparently found a scenario where 1ms seems to be optimal, but it is easy to construct another test case where, for example, 100ms would behave better than 1ms.

Actions #4

Updated by Marcel Telka about 8 years ago

Simon Klinkert wrote:

We’ve encountered a side effect when dropping the initial delay from 1s to 1ms. The distribution of lock assignments gets much better in case the first process(P1) does some periodical file locking and unlocking. For P2, there is a better chance or a higher probability to catch the lock after it was freed by P1. With the current back-off behavior, P1 is able to catch the lock again and again while P2 would probably sleep for a long time.

The above looks like #4842.

Actions #5

Updated by Marcel Telka about 8 years ago

  • Assignee set to Simon K
Actions #6

Updated by Simon K about 8 years ago

I've attached a git format-patch file. This patch contains a fix for #4837 and should make the file locking a bit faster and more effective. To get even more speed (will increase the server load!), set nfs4_base_wait_time to 10ms, limit max_msec_delay to 1s, and multiply tick_delay by 1.5 instead of 2.

Actions #7

Updated by Simon K almost 8 years ago

The server side has a (slow) retry loop for locking requests as well. Three retries are possible, starting with a delay of 10ms:

 6  73172                    setlock:entry
  6  20809                 fop_frlock:entry
  6  20810                fop_frlock:return ret=11
  6  20809                 fop_frlock:entry
  6  20810                fop_frlock:return ret=11
  1  20809                 fop_frlock:entry
  1  20810                fop_frlock:return ret=11
  2  20809                 fop_frlock:entry
  2  20810                fop_frlock:return ret=11
  2  20809                 fop_frlock:entry
  2  20810                fop_frlock:return ret=0
  2  73173                   setlock:return time=69ms

The nfs4 file locking feels slower than it should be on a system with hundreds of processes struggling with the same file. Even an unlock may need 120s(!).

Unfortunately, rfs4_maxlock_tries and rfs4_lock_delay are static variables and thus not tunable by mdb or /etc/system.

Actions

Also available in: Atom PDF