Project

General

Profile

Bug #6554

Deadlock between rfs4_do_lock() and rfs4_op_read()

Added by Marcel Telka over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
nfs - NFS server and client
Start date:
2016-01-14
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

We encountered a hang in the NFSv4 server. The NFS server thread pool was exhausted and all NFS server threads were blocked waiting for some locks.

Root cause analysis

There are 2000+ threads stuck in rfs4_lease_expired() waiting for a mutex:

> ::stacks -m nfssrv
THREAD           STATE    SOBJ                COUNT
ffffff6ce4ade760 SLEEP    MUTEX                2020
                 swtch+0x141
                 turnstile_block+0x262
                 mutex_vector_enter+0x3c5
                 rfs4_dbe_lock+0x11
                 rfs4_lease_expired+0x25
                 rfs4_get_lo_state+0x71
                 rfs4_get_all_state+0xcb
                 rfs4_check_stateid+0x12c
                 rfs4_op_read+0xa9
                 rfs4_compound+0x1e9
                 rfs4_dispatch+0x145
                 common_dispatch+0x765
                 rfs_dispatch+0x2d
                 svc_getreq+0x1c1
                 svc_run+0xe0
                 svc_do_run+0x8e
                 nfssys+0xf1
                 dtrace_systrace_syscall32+0xe4
                 _sys_sysenter_post_swapgs+0x149

fffffe87bb2ef120 SLEEP    MUTEX                  24
                 swtch+0x141          
                 turnstile_block+0x262
                 mutex_vector_enter+0x3c5
                 rfs4_dbsearch+0xd6
>

Some information about the mutex:

> ffffff6ce4ade760::findstack -v
stack pointer for thread ffffff6ce4ade760: fffffe42eda30160
[ fffffe42eda30160 _resume_from_idle+0xf4() ]
  fffffe42eda30190 swtch+0x141()
  fffffe42eda30240 turnstile_block+0x262(ffffff6ce3237c38, 0, fffffe84b884e428, fffffffffbc07960, 0, 0)
  fffffe42eda302b0 mutex_vector_enter+0x3c5(fffffe84b884e428)
  fffffe42eda302d0 rfs4_dbe_lock+0x11(fffffe84b884e428)
  fffffe42eda30300 rfs4_lease_expired+0x25(fffffe84b884e490)
  fffffe42eda30350 rfs4_get_lo_state+0x71(ffffff39edbcabc0, fffffe42eda30368, 0)
  fffffe42eda303c0 rfs4_get_all_state+0xcb(ffffff39edbcabc0, fffffe42eda30400, fffffe42eda303f8, fffffe42eda303f0)
  fffffe42eda30470 rfs4_check_stateid+0x12c(1, fffffe849ad87e00, ffffff39edbcabc0, 0, fffffe42eda30678, 1, fffffe42eda30590)
  fffffe42eda30620 rfs4_op_read+0xa9(ffffff39edbcabb8, fffffe84e8c09168, fffffe42eda30c00, fffffe42eda30640)
  fffffe42eda307b0 rfs4_compound+0x1e9(fffffe42eda30a00, fffffe42eda307d0, 0, fffffe42eda30c00, 0, fffffe42eda30808)
  fffffe42eda30860 rfs4_dispatch+0x145(ffffffffc0109010, fffffe42eda30c00, fffffe8e426f3c00, fffffe42eda30a00)
  fffffe42eda30b80 common_dispatch+0x765(fffffe42eda30c00, fffffe8e426f3c00, 2, 4, fffffffff822f796, ffffffffc0109060)
  fffffe42eda30ba0 rfs_dispatch+0x2d(fffffe42eda30c00, fffffe8e426f3c00)
  fffffe42eda30c80 svc_getreq+0x1c1(fffffe8e426f3c00, fffffe87e5156b60)
  fffffe42eda30cf0 svc_run+0xe0(fffffe849a505668)
  fffffe42eda30d30 svc_do_run+0x8e(1)
  fffffe42eda30e20 nfssys+0xf1(e, fe1e0fbc)
  fffffe42eda30ec0 dtrace_systrace_syscall32+0xe4(e, fe1e0fbc, bc368040, e4ade760, 1, 0)
  fffffe42eda30f10 _sys_sysenter_post_swapgs+0x149()
> fffffe84b884e428::mutex
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
fffffe84b884e428 adapt fffffe84e6f8eba0      -      -     yes
> fffffe84e6f8eba0::findstack -v
stack pointer for thread fffffe84e6f8eba0: fffffe42ed77e420
[ fffffe42ed77e420 _resume_from_idle+0xf4() ]
  fffffe42ed77e450 swtch+0x141()
  fffffe42ed77e500 turnstile_block+0x262(ffffff6ce323a1e0, 0, ffffff6ce38aaea8, fffffffffbc07960, 0, 0)
  fffffe42ed77e570 mutex_vector_enter+0x3c5(ffffff6ce38aaea8)
  fffffe42ed77e590 rfs4_dbe_lock+0x11(ffffff6ce38aaea8)
  fffffe42ed77e620 rfs4_op_release_lockowner+0x1f4(fffffe8492545200, ffffff6ce7111d00, fffffe42ed77ec00, fffffe42ed77e640)
  fffffe42ed77e7b0 rfs4_compound+0x1e9(fffffe42ed77ea00, fffffe42ed77e7d0, 0, fffffe42ed77ec00, 0, fffffe42ed77e808)
  fffffe42ed77e860 rfs4_dispatch+0x21d(ffffffffc0109010, fffffe42ed77ec00, ffffff6ce627ee00, fffffe42ed77ea00)
  fffffe42ed77eb80 common_dispatch+0x765(fffffe42ed77ec00, ffffff6ce627ee00, 2, 4, fffffffff822f796, ffffffffc0109060)
  fffffe42ed77eba0 rfs_dispatch+0x2d(fffffe42ed77ec00, ffffff6ce627ee00)
  fffffe42ed77ec80 svc_getreq+0x1c1(ffffff6ce627ee00, fffffe8827e12560)
  fffffe42ed77ecf0 svc_run+0xe0(fffffe849a505668)
  fffffe42ed77ed30 svc_do_run+0x8e(1)
  fffffe42ed77ee20 nfssys+0xf1(e, f9130fbc)
  fffffe42ed77eec0 dtrace_systrace_syscall32+0xe4(e, f9130fbc, bb29f800, e6f8eba0, 1, 0)
  fffffe42ed77ef10 _sys_sysenter_post_swapgs+0x149()
>

The mutex is held by thread fffffe84e6f8eba0. The thread is waiting for another mutex ffffff6ce38aaea8. Who is holding it?

> ffffff6ce38aaea8::mutex
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
ffffff6ce38aaea8 adapt ffffff6ce4a5e080      -      -     yes
> ffffff6ce4a5e080::findstack -v
stack pointer for thread ffffff6ce4a5e080: fffffe42eced60a0
[ fffffe42eced60a0 _resume_from_idle+0xf4() ]
  fffffe42eced60d0 swtch+0x141()
  fffffe42eced6180 turnstile_block+0x262(0, 0, fffffe849fa4dca0, fffffffffbc08b80, 0, 0)
  fffffe42eced61f0 rw_enter_sleep+0x21d(fffffe849fa4dca0, 0)
  fffffe42eced6210 nbl_start_crit+0x19(fffffe849fa4dc40, 0)
  fffffe42eced62b0 fs_frlock+0xeb(fffffe849fa4dc40, 2a, fffffe42eced6500, 10001, 0, 0, fffffe84a90f3738, 0)
  fffffe42eced6350 zfs_frlock+0xf8(fffffe849fa4dc40, 2a, fffffe42eced6500, 10001, 0, 0, fffffe84a90f3738, 0)
  fffffe42eced63f0 fop_frlock+0x65(fffffe849fa4dc40, 2a, fffffe42eced6500, 10001, 0, 0, fffffe84a90f3738, 0)
  fffffe42eced64c0 setlock+0xa8(fffffe849fa4dc40, fffffe42eced6500, 10001, fffffe84a90f3738)
  fffffe42eced65b0 rfs4_do_lock+0x1b8(fffffe849ae96068, 2, 0, ffffffffffffffff, fffffe84a90f3738, fffffe84e8c90e28)
  fffffe42eced6620 rfs4_op_locku+0x288(ffffff39edcb38f8, fffffe84e8c90e28, fffffe42eced6c00, fffffe42eced6640)
  fffffe42eced67b0 rfs4_compound+0x1e9(fffffe42eced6a00, fffffe42eced67d0, 0, fffffe42eced6c00, 0, fffffe42eced6808)
  fffffe42eced6860 rfs4_dispatch+0x21d(ffffffffc0109010, fffffe42eced6c00, fffffe8e3f301c00, fffffe42eced6a00)
  fffffe42eced6b80 common_dispatch+0x765(fffffe42eced6c00, fffffe8e3f301c00, 2, 4, fffffffff822f796, ffffffffc0109060)
  fffffe42eced6ba0 rfs_dispatch+0x2d(fffffe42eced6c00, fffffe8e3f301c00)
  fffffe42eced6c80 svc_getreq+0x1c1(fffffe8e3f301c00, fffffe87b8fa3320)
  fffffe42eced6cf0 svc_run+0xe0(fffffe849a505668)
  fffffe42eced6d30 svc_do_run+0x8e(1)
  fffffe42eced6e20 nfssys+0xf1(e, f6cb0fbc)
  fffffe42eced6ec0 dtrace_systrace_syscall32+0xe4(e, f6cb0fbc, bc8e6ac0, e4a5e080, 1, 0)
  fffffe42eced6f10 _sys_sysenter_post_swapgs+0x149()
>

The mutex is held by thread ffffff6ce4a5e080 hung in fs_frlock(), called via nfs and zfs. The thread is waiting for rwlock fffffe849fa4dca0 to get it as writer. Please note the thread is working on vnode fffffe849fa4dc40.

> fffffe849fa4dca0::rwlock
            ADDR      OWNER/COUNT FLAGS          WAITERS
fffffe849fa4dca0        READERS=1  B011 ffffff6ce4a5e080 (W)
                                     ||
                 WRITE_WANTED -------+|
                  HAS_WAITERS --------+
>

Who is holding the rwlock as reader? This one (please note the vnode fffffe849fa4dc40 here too):

> fffffe849d1deb80::findstack -v
stack pointer for thread fffffe849d1deb80: fffffe42edca41f0
[ fffffe42edca41f0 _resume_from_idle+0xf4() ]
  fffffe42edca4220 swtch+0x141()
  fffffe42edca42d0 turnstile_block+0x262(0, 0, ffffff6ce38aaea8, fffffffffbc07960, 0, 0)
  fffffe42edca4340 mutex_vector_enter+0x3c5(ffffff6ce38aaea8)
  fffffe42edca4360 rfs4_dbe_lock+0x11(ffffff6ce38aaea8)
  fffffe42edca43c0 rfs4_state_has_access+0x3c(ffffff6ce38aaf28, 1, fffffe849fa4dc40)
  fffffe42edca4470 rfs4_check_stateid+0x43b(1, fffffe849fa4dc40, fffffe849ff508c0, 0, fffffe42edca4678, 1, fffffe42edca4590)
  fffffe42edca4620 rfs4_op_read+0x1a3(fffffe849ff508b8, fffffe84e8c8e728, fffffe42edca4c00, fffffe42edca4640)
  fffffe42edca47b0 rfs4_compound+0x1e9(fffffe42edca4a00, fffffe42edca47d0, 0, fffffe42edca4c00, 0, fffffe42edca4808)
  fffffe42edca4860 rfs4_dispatch+0x145(ffffffffc0109010, fffffe42edca4c00, ffffff6ce629b200, fffffe42edca4a00)
  fffffe42edca4b80 common_dispatch+0x765(fffffe42edca4c00, ffffff6ce629b200, 2, 4, fffffffff822f796, ffffffffc0109060)
  fffffe42edca4ba0 rfs_dispatch+0x2d(fffffe42edca4c00, ffffff6ce629b200)
  fffffe42edca4c80 svc_getreq+0x1c1(ffffff6ce629b200, fffffe8824477c60)
  fffffe42edca4cf0 svc_run+0xe0(fffffe849a505668)
  fffffe42edca4d30 svc_do_run+0x8e(1)
  fffffe42edca4e20 nfssys+0xf1(e, fd360fbc)
  fffffe42edca4ec0 dtrace_systrace_syscall32+0xe4(e, fd360fbc, ec48ec40, 9d1deb80, 1, 0)
  fffffe42edca4f10 _sys_sysenter_post_swapgs+0x149()
> rfs4_op_read::dis!grep call
rfs4_op_read+0xa4:              call   +0xe427  <rfs4_check_stateid>
rfs4_op_read+0xc9:              call   +0x3838162       <kmem_free>
rfs4_op_read+0x12b:             call   +0x3857dc0       <nbl_need_check>
rfs4_op_read+0x146:             call   +0x3857d45       <nbl_start_crit>
rfs4_op_read+0x162:             call   +0x3857db9       <nbl_conflict>
rfs4_op_read+0x19e:             call   +0xe32d  <rfs4_check_stateid>
>

Looking at the sources we see:

3203    if ((stat = rfs4_check_stateid(FREAD, vp, &args->stateid, FALSE,
3204        deleg, TRUE, &ct)) != NFS4_OK) {
3205        *cs->statusp = resp->status = stat;
3206        goto out;
3207    }
3208
3209    /*
3210     * Enter the critical region before calling VOP_RWLOCK
3211     * to avoid a deadlock with write requests.
3212     */
3213    if (nbl_need_check(vp)) {
3214        nbl_start_crit(vp, RW_READER);
3215        in_crit = 1;
3216        if (nbl_conflict(vp, NBL_READ, args->offset, args->count, 0,
3217            &ct)) {
3218            *cs->statusp = resp->status = NFS4ERR_LOCKED;
3219            goto out;
3220        }
3221    }
3222
3223    if ((stat = rfs4_check_stateid(FREAD, vp, &args->stateid, FALSE,
3224        deleg, TRUE, &ct)) != NFS4_OK) {
3225        *cs->statusp = resp->status = stat;
3226        goto out;
3227    }

The rfs4_op_read() is hung at line 3223.

Just to be sure. Is there the VFS_NBMAND (0x200) flag set?

> fffffe849fa4dc40::print vnode_t v_vfsp|::print vfs_t vfs_flag
vfs_flag = 0x6620
>

Yes, it is. So we called nbl_start_crit() at line 3214 to acquire the following rwlock:

> fffffe849fa4dc40::print -a vnode_t v_nbllock
fffffe849fa4dca0 v_nbllock = {
    fffffe849fa4dca0 v_nbllock._opaque = [ 0xb ]
}
>

So, thread fffffe849d1deb80 (rfs4_op_read) is holding the rwlock fffffe849fa4dca0 as reader and it is waiting for mutex ffffff6ce38aaea8. OTOH, thread ffffff6ce4a5e080 (zfs_frlock) is holding the mutex ffffff6ce38aaea8 and it is waiting for rwlock fffffe849fa4dca0 (as writer).

Deadlock!

The bug is in rfs4_do_lock()/setlock() where the other subsystem (in this case ZFS) is called with a mutex held.

The other minor issue identified is that rfs4_op_read() calls rfs4_check_stateid() twice (lines 3203 and 3223). One such call is very likely not needed and could be removed.

Steps to reproduce

First, you need to instrument the nfssrv kernel module to increase the deadlock window. Apply the attached nfssrv.patch and compile the nfssrv kernel module (dmake install in usr/src/uts/intel/nfssrv).

Second, you need a helper test program to run on a NFS client. Compile the attached deadlock.c.

Then do the following to reproduce the deadlock:

@SERVER # cp nfssrv /kernel/misc/amd64
@SERVER # bootadm update-archive
@SERVER # reboot
@SERVER # zfs create -o sharenfs=on -o nbmand=on rpool/export/test
@SERVER # chmod 777 /export/test/

@CLIENT # mount -o vers=4 SERVER:/export/test /mnt
@CLIENT # echo test > /mnt/test
@CLIENT # ./deadlock /mnt/test

Files

nfssrv.patch (1.16 KB) nfssrv.patch Marcel Telka, 2016-01-14 03:26 PM
deadlock.c (814 Bytes) deadlock.c Marcel Telka, 2016-01-14 03:26 PM

History

#1

Updated by Marcel Telka over 4 years ago

  • Status changed from In Progress to Pending RTI
#2

Updated by Electric Monk over 4 years ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 0 to 100

git commit 3a36268e678711ec7f0ae522d7ef460e51186768

commit  3a36268e678711ec7f0ae522d7ef460e51186768
Author: Marcel Telka <marcel.telka@nexenta.com>
Date:   2016-01-26T22:35:58.000Z

    6554 Deadlock between rfs4_do_lock() and rfs4_op_read()
    Reviewed by: Dan Fields <dan.fields@nexenta.com>
    Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
    Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
    Approved by: Robert Mustacchi <rm@joyent.com>

Also available in: Atom PDF