Bug #6554
closedDeadlock between rfs4_do_lock() and rfs4_op_read()
100%
Description
We encountered a hang in the NFSv4 server. The NFS server thread pool was exhausted and all NFS server threads were blocked waiting for some locks.
Root cause analysis
There are 2000+ threads stuck in rfs4_lease_expired() waiting for a mutex:
> ::stacks -m nfssrv THREAD STATE SOBJ COUNT ffffff6ce4ade760 SLEEP MUTEX 2020 swtch+0x141 turnstile_block+0x262 mutex_vector_enter+0x3c5 rfs4_dbe_lock+0x11 rfs4_lease_expired+0x25 rfs4_get_lo_state+0x71 rfs4_get_all_state+0xcb rfs4_check_stateid+0x12c rfs4_op_read+0xa9 rfs4_compound+0x1e9 rfs4_dispatch+0x145 common_dispatch+0x765 rfs_dispatch+0x2d svc_getreq+0x1c1 svc_run+0xe0 svc_do_run+0x8e nfssys+0xf1 dtrace_systrace_syscall32+0xe4 _sys_sysenter_post_swapgs+0x149 fffffe87bb2ef120 SLEEP MUTEX 24 swtch+0x141 turnstile_block+0x262 mutex_vector_enter+0x3c5 rfs4_dbsearch+0xd6 >
Some information about the mutex:
> ffffff6ce4ade760::findstack -v stack pointer for thread ffffff6ce4ade760: fffffe42eda30160 [ fffffe42eda30160 _resume_from_idle+0xf4() ] fffffe42eda30190 swtch+0x141() fffffe42eda30240 turnstile_block+0x262(ffffff6ce3237c38, 0, fffffe84b884e428, fffffffffbc07960, 0, 0) fffffe42eda302b0 mutex_vector_enter+0x3c5(fffffe84b884e428) fffffe42eda302d0 rfs4_dbe_lock+0x11(fffffe84b884e428) fffffe42eda30300 rfs4_lease_expired+0x25(fffffe84b884e490) fffffe42eda30350 rfs4_get_lo_state+0x71(ffffff39edbcabc0, fffffe42eda30368, 0) fffffe42eda303c0 rfs4_get_all_state+0xcb(ffffff39edbcabc0, fffffe42eda30400, fffffe42eda303f8, fffffe42eda303f0) fffffe42eda30470 rfs4_check_stateid+0x12c(1, fffffe849ad87e00, ffffff39edbcabc0, 0, fffffe42eda30678, 1, fffffe42eda30590) fffffe42eda30620 rfs4_op_read+0xa9(ffffff39edbcabb8, fffffe84e8c09168, fffffe42eda30c00, fffffe42eda30640) fffffe42eda307b0 rfs4_compound+0x1e9(fffffe42eda30a00, fffffe42eda307d0, 0, fffffe42eda30c00, 0, fffffe42eda30808) fffffe42eda30860 rfs4_dispatch+0x145(ffffffffc0109010, fffffe42eda30c00, fffffe8e426f3c00, fffffe42eda30a00) fffffe42eda30b80 common_dispatch+0x765(fffffe42eda30c00, fffffe8e426f3c00, 2, 4, fffffffff822f796, ffffffffc0109060) fffffe42eda30ba0 rfs_dispatch+0x2d(fffffe42eda30c00, fffffe8e426f3c00) fffffe42eda30c80 svc_getreq+0x1c1(fffffe8e426f3c00, fffffe87e5156b60) fffffe42eda30cf0 svc_run+0xe0(fffffe849a505668) fffffe42eda30d30 svc_do_run+0x8e(1) fffffe42eda30e20 nfssys+0xf1(e, fe1e0fbc) fffffe42eda30ec0 dtrace_systrace_syscall32+0xe4(e, fe1e0fbc, bc368040, e4ade760, 1, 0) fffffe42eda30f10 _sys_sysenter_post_swapgs+0x149() > fffffe84b884e428::mutex ADDR TYPE HELD MINSPL OLDSPL WAITERS fffffe84b884e428 adapt fffffe84e6f8eba0 - - yes > fffffe84e6f8eba0::findstack -v stack pointer for thread fffffe84e6f8eba0: fffffe42ed77e420 [ fffffe42ed77e420 _resume_from_idle+0xf4() ] fffffe42ed77e450 swtch+0x141() fffffe42ed77e500 turnstile_block+0x262(ffffff6ce323a1e0, 0, ffffff6ce38aaea8, fffffffffbc07960, 0, 0) fffffe42ed77e570 mutex_vector_enter+0x3c5(ffffff6ce38aaea8) fffffe42ed77e590 rfs4_dbe_lock+0x11(ffffff6ce38aaea8) fffffe42ed77e620 rfs4_op_release_lockowner+0x1f4(fffffe8492545200, ffffff6ce7111d00, fffffe42ed77ec00, fffffe42ed77e640) fffffe42ed77e7b0 rfs4_compound+0x1e9(fffffe42ed77ea00, fffffe42ed77e7d0, 0, fffffe42ed77ec00, 0, fffffe42ed77e808) fffffe42ed77e860 rfs4_dispatch+0x21d(ffffffffc0109010, fffffe42ed77ec00, ffffff6ce627ee00, fffffe42ed77ea00) fffffe42ed77eb80 common_dispatch+0x765(fffffe42ed77ec00, ffffff6ce627ee00, 2, 4, fffffffff822f796, ffffffffc0109060) fffffe42ed77eba0 rfs_dispatch+0x2d(fffffe42ed77ec00, ffffff6ce627ee00) fffffe42ed77ec80 svc_getreq+0x1c1(ffffff6ce627ee00, fffffe8827e12560) fffffe42ed77ecf0 svc_run+0xe0(fffffe849a505668) fffffe42ed77ed30 svc_do_run+0x8e(1) fffffe42ed77ee20 nfssys+0xf1(e, f9130fbc) fffffe42ed77eec0 dtrace_systrace_syscall32+0xe4(e, f9130fbc, bb29f800, e6f8eba0, 1, 0) fffffe42ed77ef10 _sys_sysenter_post_swapgs+0x149() >
The mutex is held by thread fffffe84e6f8eba0. The thread is waiting for another mutex ffffff6ce38aaea8. Who is holding it?
> ffffff6ce38aaea8::mutex ADDR TYPE HELD MINSPL OLDSPL WAITERS ffffff6ce38aaea8 adapt ffffff6ce4a5e080 - - yes > ffffff6ce4a5e080::findstack -v stack pointer for thread ffffff6ce4a5e080: fffffe42eced60a0 [ fffffe42eced60a0 _resume_from_idle+0xf4() ] fffffe42eced60d0 swtch+0x141() fffffe42eced6180 turnstile_block+0x262(0, 0, fffffe849fa4dca0, fffffffffbc08b80, 0, 0) fffffe42eced61f0 rw_enter_sleep+0x21d(fffffe849fa4dca0, 0) fffffe42eced6210 nbl_start_crit+0x19(fffffe849fa4dc40, 0) fffffe42eced62b0 fs_frlock+0xeb(fffffe849fa4dc40, 2a, fffffe42eced6500, 10001, 0, 0, fffffe84a90f3738, 0) fffffe42eced6350 zfs_frlock+0xf8(fffffe849fa4dc40, 2a, fffffe42eced6500, 10001, 0, 0, fffffe84a90f3738, 0) fffffe42eced63f0 fop_frlock+0x65(fffffe849fa4dc40, 2a, fffffe42eced6500, 10001, 0, 0, fffffe84a90f3738, 0) fffffe42eced64c0 setlock+0xa8(fffffe849fa4dc40, fffffe42eced6500, 10001, fffffe84a90f3738) fffffe42eced65b0 rfs4_do_lock+0x1b8(fffffe849ae96068, 2, 0, ffffffffffffffff, fffffe84a90f3738, fffffe84e8c90e28) fffffe42eced6620 rfs4_op_locku+0x288(ffffff39edcb38f8, fffffe84e8c90e28, fffffe42eced6c00, fffffe42eced6640) fffffe42eced67b0 rfs4_compound+0x1e9(fffffe42eced6a00, fffffe42eced67d0, 0, fffffe42eced6c00, 0, fffffe42eced6808) fffffe42eced6860 rfs4_dispatch+0x21d(ffffffffc0109010, fffffe42eced6c00, fffffe8e3f301c00, fffffe42eced6a00) fffffe42eced6b80 common_dispatch+0x765(fffffe42eced6c00, fffffe8e3f301c00, 2, 4, fffffffff822f796, ffffffffc0109060) fffffe42eced6ba0 rfs_dispatch+0x2d(fffffe42eced6c00, fffffe8e3f301c00) fffffe42eced6c80 svc_getreq+0x1c1(fffffe8e3f301c00, fffffe87b8fa3320) fffffe42eced6cf0 svc_run+0xe0(fffffe849a505668) fffffe42eced6d30 svc_do_run+0x8e(1) fffffe42eced6e20 nfssys+0xf1(e, f6cb0fbc) fffffe42eced6ec0 dtrace_systrace_syscall32+0xe4(e, f6cb0fbc, bc8e6ac0, e4a5e080, 1, 0) fffffe42eced6f10 _sys_sysenter_post_swapgs+0x149() >
The mutex is held by thread ffffff6ce4a5e080 hung in fs_frlock(), called via nfs and zfs. The thread is waiting for rwlock fffffe849fa4dca0 to get it as writer. Please note the thread is working on vnode fffffe849fa4dc40.
> fffffe849fa4dca0::rwlock ADDR OWNER/COUNT FLAGS WAITERS fffffe849fa4dca0 READERS=1 B011 ffffff6ce4a5e080 (W) || WRITE_WANTED -------+| HAS_WAITERS --------+ >
Who is holding the rwlock as reader? This one (please note the vnode fffffe849fa4dc40 here too):
> fffffe849d1deb80::findstack -v stack pointer for thread fffffe849d1deb80: fffffe42edca41f0 [ fffffe42edca41f0 _resume_from_idle+0xf4() ] fffffe42edca4220 swtch+0x141() fffffe42edca42d0 turnstile_block+0x262(0, 0, ffffff6ce38aaea8, fffffffffbc07960, 0, 0) fffffe42edca4340 mutex_vector_enter+0x3c5(ffffff6ce38aaea8) fffffe42edca4360 rfs4_dbe_lock+0x11(ffffff6ce38aaea8) fffffe42edca43c0 rfs4_state_has_access+0x3c(ffffff6ce38aaf28, 1, fffffe849fa4dc40) fffffe42edca4470 rfs4_check_stateid+0x43b(1, fffffe849fa4dc40, fffffe849ff508c0, 0, fffffe42edca4678, 1, fffffe42edca4590) fffffe42edca4620 rfs4_op_read+0x1a3(fffffe849ff508b8, fffffe84e8c8e728, fffffe42edca4c00, fffffe42edca4640) fffffe42edca47b0 rfs4_compound+0x1e9(fffffe42edca4a00, fffffe42edca47d0, 0, fffffe42edca4c00, 0, fffffe42edca4808) fffffe42edca4860 rfs4_dispatch+0x145(ffffffffc0109010, fffffe42edca4c00, ffffff6ce629b200, fffffe42edca4a00) fffffe42edca4b80 common_dispatch+0x765(fffffe42edca4c00, ffffff6ce629b200, 2, 4, fffffffff822f796, ffffffffc0109060) fffffe42edca4ba0 rfs_dispatch+0x2d(fffffe42edca4c00, ffffff6ce629b200) fffffe42edca4c80 svc_getreq+0x1c1(ffffff6ce629b200, fffffe8824477c60) fffffe42edca4cf0 svc_run+0xe0(fffffe849a505668) fffffe42edca4d30 svc_do_run+0x8e(1) fffffe42edca4e20 nfssys+0xf1(e, fd360fbc) fffffe42edca4ec0 dtrace_systrace_syscall32+0xe4(e, fd360fbc, ec48ec40, 9d1deb80, 1, 0) fffffe42edca4f10 _sys_sysenter_post_swapgs+0x149() > rfs4_op_read::dis!grep call rfs4_op_read+0xa4: call +0xe427 <rfs4_check_stateid> rfs4_op_read+0xc9: call +0x3838162 <kmem_free> rfs4_op_read+0x12b: call +0x3857dc0 <nbl_need_check> rfs4_op_read+0x146: call +0x3857d45 <nbl_start_crit> rfs4_op_read+0x162: call +0x3857db9 <nbl_conflict> rfs4_op_read+0x19e: call +0xe32d <rfs4_check_stateid> >
Looking at the sources we see:
3203 if ((stat = rfs4_check_stateid(FREAD, vp, &args->stateid, FALSE, 3204 deleg, TRUE, &ct)) != NFS4_OK) { 3205 *cs->statusp = resp->status = stat; 3206 goto out; 3207 } 3208 3209 /* 3210 * Enter the critical region before calling VOP_RWLOCK 3211 * to avoid a deadlock with write requests. 3212 */ 3213 if (nbl_need_check(vp)) { 3214 nbl_start_crit(vp, RW_READER); 3215 in_crit = 1; 3216 if (nbl_conflict(vp, NBL_READ, args->offset, args->count, 0, 3217 &ct)) { 3218 *cs->statusp = resp->status = NFS4ERR_LOCKED; 3219 goto out; 3220 } 3221 } 3222 3223 if ((stat = rfs4_check_stateid(FREAD, vp, &args->stateid, FALSE, 3224 deleg, TRUE, &ct)) != NFS4_OK) { 3225 *cs->statusp = resp->status = stat; 3226 goto out; 3227 }
The rfs4_op_read() is hung at line 3223.
Just to be sure. Is there the VFS_NBMAND (0x200) flag set?
> fffffe849fa4dc40::print vnode_t v_vfsp|::print vfs_t vfs_flag vfs_flag = 0x6620 >
Yes, it is. So we called nbl_start_crit() at line 3214 to acquire the following rwlock:
> fffffe849fa4dc40::print -a vnode_t v_nbllock fffffe849fa4dca0 v_nbllock = { fffffe849fa4dca0 v_nbllock._opaque = [ 0xb ] } >
So, thread fffffe849d1deb80 (rfs4_op_read) is holding the rwlock fffffe849fa4dca0 as reader and it is waiting for mutex ffffff6ce38aaea8. OTOH, thread ffffff6ce4a5e080 (zfs_frlock) is holding the mutex ffffff6ce38aaea8 and it is waiting for rwlock fffffe849fa4dca0 (as writer).
Deadlock!
The bug is in rfs4_do_lock()/setlock() where the other subsystem (in this case ZFS) is called with a mutex held.
The other minor issue identified is that rfs4_op_read() calls rfs4_check_stateid() twice (lines 3203 and 3223). One such call is very likely not needed and could be removed.
Steps to reproduce
First, you need to instrument the nfssrv kernel module to increase the deadlock window. Apply the attached nfssrv.patch and compile the nfssrv kernel module (dmake install in usr/src/uts/intel/nfssrv).
Second, you need a helper test program to run on a NFS client. Compile the attached deadlock.c.
Then do the following to reproduce the deadlock:
@SERVER # cp nfssrv /kernel/misc/amd64 @SERVER # bootadm update-archive @SERVER # reboot @SERVER # zfs create -o sharenfs=on -o nbmand=on rpool/export/test @SERVER # chmod 777 /export/test/ @CLIENT # mount -o vers=4 SERVER:/export/test /mnt @CLIENT # echo test > /mnt/test @CLIENT # ./deadlock /mnt/test
Files
Updated by Marcel Telka about 7 years ago
- Status changed from In Progress to Pending RTI
Updated by Electric Monk about 7 years ago
- Status changed from Pending RTI to Closed
- % Done changed from 0 to 100
git commit 3a36268e678711ec7f0ae522d7ef460e51186768
commit 3a36268e678711ec7f0ae522d7ef460e51186768 Author: Marcel Telka <marcel.telka@nexenta.com> Date: 2016-01-26T22:35:58.000Z 6554 Deadlock between rfs4_do_lock() and rfs4_op_read() Reviewed by: Dan Fields <dan.fields@nexenta.com> Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Approved by: Robert Mustacchi <rm@joyent.com>