Bug #12681


Excess service times for NFS(v3) locking calls

Added by Gordon Ross almost 4 years ago. Updated almost 4 years ago.

nfs - NFS server and client
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


Customer reports slow access from Linux clients running apache services.
Examination of network captures on both sides shows that "call back" attempts all fail
(where the server tries to do RPC bind and to open a connection back to the client).
It turns out there's a firewall preventing calls back to the client. However:
The customer insists this all works with (other vendor's) server.

Actions #1

Updated by Gordon Ross almost 4 years ago

Additional analysis of this problem (where we have a firewall blocking call-backs)
The NLM server gets into an unfortunate state with this configuration, apparently hung.

We forced a crash dump and found that we have 511 threads all blocked like this:

> ::stacks -m rpcmod                  
THREAD           STATE    SOBJ                COUNT
ffffd09c24848820 SLEEP    CV                    511

Let's have a look at maxthreads...
> ffffd09c24848820 ::findstack -v
stack pointer for thread ffffd09c24848820: ffffd002e3781260
[ ffffd002e3781260 _resume_from_idle+0x112() ]
  ffffd002e3781290 swtch+0x141()
  ffffd002e3781330 cv_timedwait_sig_hires+0x39d(ffffd0c732c343d4, ffffd0940feccc80, 34630b8a000, 989680, 0)
  ffffd002e3781380 cv_timedwait_sig+0x49(ffffd0c732c343d4, ffffd0940feccc80, 11e88ee)
  ffffd002e37813c0 wait_for_lock+0x10c(ffffd0c732c34300)
  ffffd002e3781420 flk_wait_execute_request+0x230(ffffd0c732c34300)
  ffffd002e37814a0 flk_process_request+0x4ed(ffffd0c732c34300)
  ffffd002e3781630 reclock+0x5dd(ffffd09de5360580, ffffd002e3781a00, e, 10003, 0, ffffd002e3781650)
  ffffd002e37816d0 fs_frlock+0x187(ffffd09de5360580, 7, ffffd002e3781a00, 10003, 0, ffffd002e37818f0, ffffd09cfa33c760, 0)
  ffffd002e3781770 zfs_frlock+0xf8(ffffd09de5360580, 7, ffffd002e3781a00, 10003, 0, ffffd002e37818f0, ffffd09cfa33c760, 0)
  ffffd002e3781810 fop_frlock+0x65(ffffd09de5360580, 7, ffffd002e3781a00, 10003, 0, ffffd002e37818f0, ffffd09cfa33c760, 0)
  ffffd002e3781860 nlm_vop_frlock+0x51(ffffd09de5360580, 7, ffffd002e3781a00, 10003, 0, ffffd002e37818f0, ffffd09cfa33c760, 0)
  ffffd002e37819a0 nlm_block+0xeb(ffffd002e3781b60, ffffd09cb1d006b8, ffffd0a192ea4b10, ffffd09c39f5ddc8, ffffd002e3781a00, fffffffff868c330)
  ffffd002e3781ac0 nlm_do_lock+0x2f5(ffffd002e3781b60, ffffd002e3781b10, ffffd002e3781ca0, fffffffff868c300, 0, fffffffff868c330)
  ffffd002e3781af0 nlm4_lock_4_svc+0x2a(ffffd002e3781b60, ffffd002e3781b10, ffffd002e3781ca0)
  ffffd002e3781c20 nlm_dispatch+0xfe(ffffd002e3781ca0, ffffd0a146edea00, fffffffff8693610)
  ffffd002e3781c40 nlm_prog_4+0x34(ffffd002e3781ca0, ffffd0a146edea00)
  ffffd002e3781d20 svc_getreq+0x1c1(ffffd0a146edea00, ffffd0aaa5f29f00)
  ffffd002e3781d90 svc_run+0x198(ffffd0a02e2b0750)
  ffffd002e3781dd0 svc_do_run+0x8e(2)
  ffffd002e3781ec0 nfssys+0x111(e, febf0fbc)
  ffffd002e3781f10 _sys_sysenter_post_swapgs+0x149()
> ffffd0a02e2b0750 ::print struct __svcpool
    p_next = 0
    p_prev = 0xffffd09c425573f8
    p_id = 0x2
    p_threads = 0x2
    p_detached_threads = 0x1ff
    p_maxthreads = 0x200
    p_redline = 0x1
    p_reserved_threads = 0
    p_thread_lock = {
        _opaque = [ 0 ]
    p_asleep = 0
    p_drowsy = 0
    p_req_cv = {
        _opaque = 0
    p_timeout = 0x1f4
    p_req_lock = {
        _opaque = [ 0 ]
    p_reqs = 0x300e97
    p_walkers = 0
    p_max_same_xprt = 0x8
    p_stksize = 0
    p_closing = 0
    p_creator_signaled = 0
    p_creator_exit = 0
    p_creator_cv = {
        _opaque = 0x1
    p_creator_lock = {
        _opaque = [ 0 ]
    p_lcount = 0x1a6
    p_lhead = 0xffffd0a4d8f831f0
    p_lrwlock = {

Yup. We're simply out of RPC threads in the lockd service!
We're also blocking against locks held by clients but where we've been unable to notify those clients (with a lock call-back) that they got the lock they asked for.

Actions #2

Updated by Gordon Ross almost 4 years ago

It turns out the dump shown above was misleading, and shows the state of things after the call-back attempts failed. That activity eventually will all time out, but that takes a while

We had some other dumps (sorry, don't have the stacks handy now) where we can see that all the NLM RPC service threads are blocked in rpcbind_getaddr calls. Looking into that some more, I see that we're attempting the rpbbind_getaddr call earlier than we need to, and doing so even when we will not use the handle. This is part of what's causing us to run out of threads, because at the point where we block trying to setup this handle, the rpc service thread has not yet "detached" itself from the RPC service thread pool. RPC threads are not supposed to block (for a long time) until they have detached. That's why we run out of threads.

I'll move that RPC client handle setup (which does the rpcbind_getaddr) call later, doing it only when we know we need to call back the client (i.e. for a lock call that must block).
This should allow non-contested lock calls to proceed without attempting to setup the callback handle.
There were also problems with reporting errors when a call-back fails, and cleaning up state.

This also turns out to dramatically improve locking performance over NFS when the lock request is non-contested.

Out for review:

Actions #3

Updated by Gordon Ross almost 4 years ago

Testing: Run Linux NFS client from behind a firewall that prevents any call-back RPC.
Verified uncontested locks can be taken without call-back RPC (these used to fail).
Verified the call-back failures are reported, and lock state is cleaned up when that happens.
General regression tests via the Connectathon NFS lock tests.

Actions #4

Updated by Dan McDonald almost 4 years ago

We had been seeing in Joyent unusual NFS problems with an LX NFS client (using a SmartOS native-assistant lx_lockd, but actual Linux rpcbind and/or rpc.statd), but those both seem to dissipate in the presence of a combination of a SmartOS lx-brand fix, along with this fix on the server end.

Actions #5

Updated by Gordon Ross almost 4 years ago

  • Status changed from In Progress to Pending RTI
Actions #6

Updated by Electric Monk almost 4 years ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 0 to 100

git commit 54b3584e66419f0f60689044f13d702660b4233f

commit  54b3584e66419f0f60689044f13d702660b4233f
Author: Gordon Ross <>
Date:   2020-05-09T23:08:44.000Z

    12681 Excess service times for NFS(v3) locking calls
    Reviewed by: Evan Layton <>
    Reviewed by: Dan McDonald <>
    Approved by: Richard Lowe <>


Also available in: Atom PDF