Project

General

Profile

Bug #8040

NFSv4 client: 3-way deadlock between nfs4_bio(), nfs4_do_delegreturn(), and nfs4_flush_pages()

Added by Marcel Telka over 2 years ago. Updated over 2 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
nfs - NFS server and client
Start date:
2017-04-06
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

A 3-way deadlock between nfs4_readahead(), nfs4_do_delegreturn(), and nfs4_purge_caches() threads was detected.

The nfs4_readahead() thread is waiting for the nfs4_do_delegreturn() thread, while the nfs4_do_delegreturn() thread is waiting for the nfs4_purge_caches() thread, and finally, the nfs4_purge_caches() is waiting for the nfs4_readahead() thread:

> ffffff0049061c40::findstack -v
stack pointer for thread ffffff0049061c40: ffffff00490614b0
[ ffffff00490614b0 _resume_from_idle+0xf4() ]
  ffffff00490614e0 swtch+0x141()
  ffffff0049061520 cv_wait+0x70(ffffff0aa1daa750, ffffff0aa1daa748)
  ffffff0049061590 cv_wait_sig+0x82(ffffff0aa1daa750, ffffff0aa1daa748)
  ffffff00490615f0 nfs_rw_enter_sig+0x175(ffffff0aa1daa738, 1, 4)
  ffffff0049061640 wait_for_recall1+0x5e(ffffff0aaef08500, 1, ffffff0049061920, 8)
  ffffff00490616a0 wait_for_recall+0x48(ffffff0aaef08500, 0, 1, ffffff0049061920)
  ffffff0049061760 nfs4_start_fop+0xac(ffffff0aa6275000, ffffff0aaef08500, 0, 1, ffffff0049061920, 0)
  ffffff00490619b0 nfs4read+0x101(ffffff0aaef08500, ffffff0af2c29000, 14f8f8000, 8000, ffffff4c0c6bde50, ffffffce6a67e7c8, ffffff0a00000001, 0)
  ffffff0049061a70 nfs4_bio+0xf1(ffffff4c0c6bde00, 0, ffffffce6a67e7c8, 1)
  ffffff0049061b30 nfs4_readahead+0x185(ffffff0aaef08500, 14f8f8000, e1ee000, fffffffffbc34880, ffffffce6a67e7c8)
  ffffff0049061c00 nfs4_async_common_start+0x34d(ffffffce9bdd6140, 0)
  ffffff0049061c20 nfs4_async_start+0x13(ffffffce9bdd6140)
  ffffff0049061c30 thread_start+8()
> 0xffffff0046ecdc40::findstack -v
stack pointer for thread ffffff0046ecdc40: ffffff0046ecd5d0
[ ffffff0046ecd5d0 _resume_from_idle+0xf4() ]
  ffffff0046ecd600 swtch+0x141()
  ffffff0046ecd640 cv_wait+0x70(ffffff0aa1daa5a4, ffffff0aa1daa480)
  ffffff0046ecd6b0 cv_wait_sig+0x82(ffffff0aa1daa5a4, ffffff0aa1daa480)
  ffffff0046ecd760 nfs4_attr_cache+0x120(ffffff0aaef08500, ffffffc73ed05158, 2db841414e80ff, ffffff231f730400, 1, 0)
  ffffff0046ecd980 nfs4delegreturn_otw+0x153(ffffff0aa1daa3e8, ffffff231f730400, ffffff0046ecda90)
  ffffff0046ecdaf0 nfs4_do_delegreturn+0xea(ffffff0aa1daa3e8, 30, ffffff231f730400, ffffff0a8ec26b80)
  ffffff0046ecdb70 nfs4delegreturn_impl+0x1c9(ffffff0aa1daa3e8, 30, ffffff0a8ec26b80)
  ffffff0046ecdc20 nfs4delegreturn_thread+0x220(ffffff0af3540200)
  ffffff0046ecdc30 thread_start+8()
> ffffff0a9efa27c0::findstack -v
stack pointer for thread ffffff0a9efa27c0: ffffff004b79f070
[ ffffff004b79f070 _resume_from_idle+0xf4() ]
  ffffff004b79f0a0 swtch+0x141()
  ffffff004b79f0e0 cv_wait+0x70(ffffff0022af9f04, ffffff003cf040c0)
  ffffff004b79f160 page_lock_es+0x139(ffffff0022af9ec0, 1, fffffffffbc86590, 1, 0)
  ffffff004b79f190 page_lock+0x20(ffffff0022af9ec0, 1, fffffffffbc86590, 1)
  ffffff004b79f260 pvn_vplist_dirty+0x21c(ffffff0aaef08500, 0, fffffffff8157f20, 10000, ffffff231f730400)
  ffffff004b79f310 nfs4_putpages+0x298(ffffff0aaef08500, 0, 0, 10000, ffffff231f730400)
  ffffff004b79f3a0 nfs4_putpage+0xa9(ffffff0aaef08500, 0, 0, 10000, ffffff231f730400, 0)
  ffffff004b79f420 fop_putpage+0x4c(ffffff0aaef08500, 0, 0, 10000, ffffff231f730400, 0)
  ffffff004b79f460 nfs4_flush_pages+0x30(ffffff0aaef08500, ffffff231f730400)
  ffffff004b79f4f0 nfs4_purge_caches+0x1cb(ffffff0aaef08500, 0, ffffff231f730400, 0)
  ffffff004b79f5a0 nfs4_attr_cache+0x2b9(ffffff0aaef08500, ffffff938abe2080, 2db84105dd7ba7, ffffff231f730400, 0, 0)
  ffffff004b79f5f0 nfs4_update_attrcache+0x41(0, ffffff938abe2080, 2db84105dd7ba7, ffffff0aaef08500, ffffff231f730400)
  ffffff004b79fac0 nfs4setattr+0x70a(ffffff0aaef08500, ffffff004b79fae0, 0, ffffff231f730400, 0)
  ffffff004b79fbd0 nfs4_space+0xd7(ffffff0aaef08500, b, ffffff004b79fd30, 2002, 0, ffffff231f730400, 0)
  ffffff004b79fc60 fop_space+0x59(ffffff0aaef08500, b, ffffff004b79fd30, 2002, 0, ffffff231f730400, 0)
  ffffff004b79fec0 fcntl+0x474(f, 1b, 8047514)
  ffffff004b79ff10 _sys_sysenter_post_swapgs+0x149()
>

All three threads are in such a state for very long time now:

> ffffff0049061c40::thread -d
            ADDR         DISPTIME BOUND PR SWITCH
ffffff0049061c40         4cb450ac    -1  0 t-20115988
> 0xffffff0046ecdc40::thread -d
            ADDR         DISPTIME BOUND PR SWITCH
ffffff0046ecdc40         4cb450ac    -1  0 t-20115988
> ffffff0a9efa27c0::thread -d
            ADDR         DISPTIME BOUND PR SWITCH
ffffff0a9efa27c0         4cb45048    -1  0 t-20116088
>

Files

instr-8040.patch (2.36 KB) instr-8040.patch Marcel Telka, 2017-06-13 08:24 AM
test-8040.c (1.25 KB) test-8040.c Marcel Telka, 2017-06-13 08:26 AM

History

#1

Updated by Marcel Telka over 2 years ago

Crash dump analysis

We detected a stuck nfs4_ephemeral_harvester():

> ::stacks -c nfs4_ephemeral_harvester|::thread -d
            ADDR         DISPTIME BOUND PR SWITCH
ffffff003e7f7c40         4cb49c5a    -1  0 t-20096614
> ::stacks -c nfs4_ephemeral_harvester|::findstack -v
stack pointer for thread ffffff003e7f7c40: ffffff003e7f7890
[ ffffff003e7f7890 _resume_from_idle+0xf4() ]
  ffffff003e7f78c0 swtch+0x141()
  ffffff003e7f7900 cv_wait+0x70(ffffff0aa627514e, ffffff0aa6275150)
  ffffff003e7f7970 cv_wait_sig+0x82(ffffff0aa627514e, ffffff0aa6275150)
  ffffff003e7f79c0 nfs4_async_stop_sig+0x84(ffffffce9bdd6140)
  ffffff003e7f7a30 nfs4_unmount+0x64(ffffffce9bdd6140, 8, ffffff0a871bfdb0)
  ffffff003e7f7a60 fsop_unmount+0x1b(ffffffce9bdd6140, 8, ffffff0a871bfdb0)
  ffffff003e7f7ab0 dounmount+0x57(ffffffce9bdd6140, 8, ffffff0a871bfdb0)
  ffffff003e7f7b00 umount2_engine+0x96(ffffffce9bdd6140, 8, ffffff0a871bfdb0, 0)
  ffffff003e7f7b40 nfs4_ephemeral_record_umount+0x39(ffffffce9bdd6140, 8, ffffff0a908a15c8, ffffff0be6aa9070)
  ffffff003e7f7bf0 nfs4_ephemeral_harvest_forest+0x1db(ffffff0a9dc4b888, 0, 1)
  ffffff003e7f7c20 nfs4_ephemeral_harvester+0x77(ffffff0a9dc4b888)
  ffffff003e7f7c30 thread_start+8()
>

The nfs4_ephemeral_harvester() thread is waiting for mi->mi_async_cv with address ffffff0aa627514e.

> ::offsetof mntinfo4_t mi_async_cv
offsetof (mntinfo4_t, mi_async_cv) = 0x14e, sizeof (...->mi_async_cv) = 2
> ffffff0aa627514e-0x14e=K
                ffffff0aa6275000
>

The mntinfo4 is ffffff0aa6275000.

> ffffff0aa6275000::print -a mntinfo4_t mi_async_reqs mi_async_tail
ffffff0aa62750a0 mi_async_reqs = [ 0, 0, 0, 0, 0, 0 ]
ffffff0aa62750d0 mi_async_tail = [ 0, 0, 0, 0xffffff0ab6a60400, 0, 0 ]
>

The async requests queue is empty, but the last request we removed from there is still in mi_async_tail. It is 0xffffff0ab6a60400.

> 0xffffff0ab6a60400::print struct nfs4_async_reqs
{
    a_next = 0
    a_vp = 0xffffff0aaef08500
    a_cred = 0xffffffce6a67e7c8
    a_io = 3 (NFS4_READ_AHEAD)
    a_args = {
        a_read_args = {
            readahead = nfs4_readahead
            blkoff = 0x14f910000
            seg = kmapseg
            addr = 0x1097000
        }
        a_pageio_args = {
            pageio = nfs4_readahead
            pp = 0x14f910000
            io_off = 0xfffffffffbc34880
            io_len = 0x1097000
            flags = 0
        }
        a_readdir_args = {
            readdir = nfs4_readahead
            rdc = 0x14f910000
        }
        a_commit_args = {
            commit = nfs4_readahead
            plist = 0x14f910000
            offset = 0xfffffffffbc34880
            count = 0x1097000
        }
    }
}
>

The request I/O type is NFS4_READ_AHEAD for vnode 0xffffff0aaef08500. The request should be handled by the nfs4_readahead() function (a_args.a_read_args.readahead).

> ::stacks -c nfs4_readahead|::findstack -v!grep nfs4_readahead
  ffffff0049061b30 nfs4_readahead+0x185(ffffff0aaef08500, 14f8f8000, e1ee000, fffffffffbc34880, ffffffce6a67e7c8)
  ffffff004cf81b30 nfs4_readahead+0x185(ffffff0aaef08500, 14f908000, 1093000, fffffffffbc34880, ffffffce6a67e7c8)
  ffffff004593cb30 nfs4_readahead+0x185(ffffff0aaef08500, 14f900000, 108f000, fffffffffbc34880, ffffffce6a67e7c8)
  ffffff00434c5b30 nfs4_readahead+0x185(ffffff0aaef08500, 14f910000, 1097000, fffffffffbc34880, ffffffce6a67e7c8)
>

We have 4 nfs4_readahead() threads. All of them are working on our vnode (ffffff0aaef08500) with different offsets.

> ::stacks -c nfs4_readahead
THREAD           STATE    SOBJ                COUNT
ffffff0049061c40 SLEEP    CV                      4
                 swtch+0x141
                 cv_wait+0x70
                 cv_wait_sig+0x82
                 nfs_rw_enter_sig+0x175
                 wait_for_recall1+0x5e
                 wait_for_recall+0x48
                 nfs4_start_fop+0xac
                 nfs4read+0x101
                 nfs4_bio+0xf1
                 nfs4_readahead+0x185
                 nfs4_async_common_start+0x34d
                 nfs4_async_start+0x13
                 thread_start+8

> ::stacks -c nfs4_readahead|::findstack -v!grep wait_for_recall\\\>
  ffffff00490616a0 wait_for_recall+0x48(ffffff0aaef08500, 0, 1, ffffff0049061920)
  ffffff004cf816a0 wait_for_recall+0x48(ffffff0aaef08500, 0, 1, ffffff004cf81920)
  ffffff004593c6a0 wait_for_recall+0x48(ffffff0aaef08500, 0, 1, ffffff004593c920)
  ffffff00434c56a0 wait_for_recall+0x48(ffffff0aaef08500, 0, 1, ffffff00434c5920)
>

All 4 threads are waiting for the delegation recall for our vnode and they are waiting for that for pretty long time:

> ::stacks -c nfs4_readahead|::thread -d
            ADDR         DISPTIME BOUND PR SWITCH
ffffff0049061c40         4cb450ac    -1  0 t-20115988
ffffff004cf81c40         4cb450ac    -1  0 t-20115988
ffffff004593cc40         4cb450ac    -1  0 t-20115988
ffffff00434c5c40         4cb450ac    -1  0 t-20115988
>

The threads are waiting to get the ffffff0aa1daa738 nfs_rwlock as READER:

> ffffff0049061c40::findstack -v
stack pointer for thread ffffff0049061c40: ffffff00490614b0
[ ffffff00490614b0 _resume_from_idle+0xf4() ]
  ffffff00490614e0 swtch+0x141()
  ffffff0049061520 cv_wait+0x70(ffffff0aa1daa750, ffffff0aa1daa748)
  ffffff0049061590 cv_wait_sig+0x82(ffffff0aa1daa750, ffffff0aa1daa748)
  ffffff00490615f0 nfs_rw_enter_sig+0x175(ffffff0aa1daa738, 1, 4)
  ffffff0049061640 wait_for_recall1+0x5e(ffffff0aaef08500, 1, ffffff0049061920, 8)
  ffffff00490616a0 wait_for_recall+0x48(ffffff0aaef08500, 0, 1, ffffff0049061920)
  ffffff0049061760 nfs4_start_fop+0xac(ffffff0aa6275000, ffffff0aaef08500, 0, 1, ffffff0049061920, 0)
  ffffff00490619b0 nfs4read+0x101(ffffff0aaef08500, ffffff0af2c29000, 14f8f8000, 8000, ffffff4c0c6bde50, ffffffce6a67e7c8, ffffff0a00000001, 0)
  ffffff0049061a70 nfs4_bio+0xf1(ffffff4c0c6bde00, 0, ffffffce6a67e7c8, 1)
  ffffff0049061b30 nfs4_readahead+0x185(ffffff0aaef08500, 14f8f8000, e1ee000, fffffffffbc34880, ffffffce6a67e7c8)
  ffffff0049061c00 nfs4_async_common_start+0x34d(ffffffce9bdd6140, 0)
  ffffff0049061c20 nfs4_async_start+0x13(ffffffce9bdd6140)
  ffffff0049061c30 thread_start+8()
>

But the lock is held as WRITER by 0xffffff0046ecdc40:

> ffffff0aaef08500::print vnode_t v_data|::print -a rnode4_t r_deleg_recall_lock
ffffff0aa1daa738 r_deleg_recall_lock = {
    ffffff0aa1daa738 r_deleg_recall_lock.count = 0xfffffffe
    ffffff0aa1daa73c r_deleg_recall_lock.waiters = 0
    ffffff0aa1daa740 r_deleg_recall_lock.owner = 0xffffff0046ecdc40
    ffffff0aa1daa748 r_deleg_recall_lock.lock = {
        ffffff0aa1daa748 _opaque = [ 0 ]
    }
    ffffff0aa1daa750 r_deleg_recall_lock.cv = {
        ffffff0aa1daa750 _opaque = 0x4
    }
}
>

The 0xffffff0046ecdc40 thread is blocked on a condition variable:

> 0xffffff0046ecdc40::findstack -v
stack pointer for thread ffffff0046ecdc40: ffffff0046ecd5d0
[ ffffff0046ecd5d0 _resume_from_idle+0xf4() ]
  ffffff0046ecd600 swtch+0x141()
  ffffff0046ecd640 cv_wait+0x70(ffffff0aa1daa5a4, ffffff0aa1daa480)
  ffffff0046ecd6b0 cv_wait_sig+0x82(ffffff0aa1daa5a4, ffffff0aa1daa480)
  ffffff0046ecd760 nfs4_attr_cache+0x120(ffffff0aaef08500, ffffffc73ed05158, 2db841414e80ff, ffffff231f730400, 1, 0)
  ffffff0046ecd980 nfs4delegreturn_otw+0x153(ffffff0aa1daa3e8, ffffff231f730400, ffffff0046ecda90)
  ffffff0046ecdaf0 nfs4_do_delegreturn+0xea(ffffff0aa1daa3e8, 30, ffffff231f730400, ffffff0a8ec26b80)
  ffffff0046ecdb70 nfs4delegreturn_impl+0x1c9(ffffff0aa1daa3e8, 30, ffffff0a8ec26b80)
  ffffff0046ecdc20 nfs4delegreturn_thread+0x220(ffffff0af3540200)
  ffffff0046ecdc30 thread_start+8()
> 0xffffff0046ecdc40::thread -d
            ADDR         DISPTIME BOUND PR SWITCH
ffffff0046ecdc40         4cb450ac    -1  0 t-20115988
>

We are waiting there because r_serial is non-NULL:

483             while (rp->r_serial != NULL) {
484                     if (!cv_wait_sig(&rp->r_cv, &rp->r_statelock)) {
485                             mutex_exit(&rp->r_statelock);
486                             if (lwp != NULL)
487                                     lwp->lwp_nostop--;
488                             return;
489                     }
490             }

The r_serial is a pointer to another nfs4_attr_cache() thread we are waiting for:

> ffffff0aaef08500::print vnode_t v_data|::print rnode4_t r_serial|::findstack -v
stack pointer for thread ffffff0a9efa27c0: ffffff004b79f070
[ ffffff004b79f070 _resume_from_idle+0xf4() ]
  ffffff004b79f0a0 swtch+0x141()
  ffffff004b79f0e0 cv_wait+0x70(ffffff0022af9f04, ffffff003cf040c0)
  ffffff004b79f160 page_lock_es+0x139(ffffff0022af9ec0, 1, fffffffffbc86590, 1, 0)
  ffffff004b79f190 page_lock+0x20(ffffff0022af9ec0, 1, fffffffffbc86590, 1)
  ffffff004b79f260 pvn_vplist_dirty+0x21c(ffffff0aaef08500, 0, fffffffff8157f20, 10000, ffffff231f730400)
  ffffff004b79f310 nfs4_putpages+0x298(ffffff0aaef08500, 0, 0, 10000, ffffff231f730400)
  ffffff004b79f3a0 nfs4_putpage+0xa9(ffffff0aaef08500, 0, 0, 10000, ffffff231f730400, 0)
  ffffff004b79f420 fop_putpage+0x4c(ffffff0aaef08500, 0, 0, 10000, ffffff231f730400, 0)
  ffffff004b79f460 nfs4_flush_pages+0x30(ffffff0aaef08500, ffffff231f730400)
  ffffff004b79f4f0 nfs4_purge_caches+0x1cb(ffffff0aaef08500, 0, ffffff231f730400, 0)
  ffffff004b79f5a0 nfs4_attr_cache+0x2b9(ffffff0aaef08500, ffffff938abe2080, 2db84105dd7ba7, ffffff231f730400, 0, 0)
  ffffff004b79f5f0 nfs4_update_attrcache+0x41(0, ffffff938abe2080, 2db84105dd7ba7, ffffff0aaef08500, ffffff231f730400)
  ffffff004b79fac0 nfs4setattr+0x70a(ffffff0aaef08500, ffffff004b79fae0, 0, ffffff231f730400, 0)
  ffffff004b79fbd0 nfs4_space+0xd7(ffffff0aaef08500, b, ffffff004b79fd30, 2002, 0, ffffff231f730400, 0)
  ffffff004b79fc60 fop_space+0x59(ffffff0aaef08500, b, ffffff004b79fd30, 2002, 0, ffffff231f730400, 0)
  ffffff004b79fec0 fcntl+0x474(f, 1b, 8047514)
  ffffff004b79ff10 _sys_sysenter_post_swapgs+0x149()
> ffffff0a9efa27c0::thread -d
            ADDR         DISPTIME BOUND PR SWITCH
ffffff0a9efa27c0         4cb45048    -1  0 t-20116088
>

The thread is operating on page ffffff0022af9ec0 and it is waiting until the page is unlocked. Who is holding the lock on the page?

> ffffff0022af9ec0::print page_t p_offset p_vnode p_selock
p_offset = 0x14f8f8000
p_vnode = 0xffffff0aaef08500
p_selock = 0x89061c40
>

The page belongs to vnode 0xffffff0aaef08500, offset 0x14f8f8000 and the lock holder is stored in p_selock. But the pointer to the holder is mangled there. Only low 30 bits of its address are stored in p_selock. Highest bit in the p_selock is always set to 1, and the 2nd highest bit is always cleared (see the SE_WRITER definition). There are two possible threads holding the page locked:

> ::thread!grep 9061c40
ffffff0049061c40 sleep       8 2000    3    60     0   0              n/a
ffffff0079061c40 sleep       8 2000    3    60     0   0              n/a
> ffffff0049061c40::findstack -v
stack pointer for thread ffffff0049061c40: ffffff00490614b0
[ ffffff00490614b0 _resume_from_idle+0xf4() ]
  ffffff00490614e0 swtch+0x141()
  ffffff0049061520 cv_wait+0x70(ffffff0aa1daa750, ffffff0aa1daa748)
  ffffff0049061590 cv_wait_sig+0x82(ffffff0aa1daa750, ffffff0aa1daa748)
  ffffff00490615f0 nfs_rw_enter_sig+0x175(ffffff0aa1daa738, 1, 4)
  ffffff0049061640 wait_for_recall1+0x5e(ffffff0aaef08500, 1, ffffff0049061920, 8)
  ffffff00490616a0 wait_for_recall+0x48(ffffff0aaef08500, 0, 1, ffffff0049061920)
  ffffff0049061760 nfs4_start_fop+0xac(ffffff0aa6275000, ffffff0aaef08500, 0, 1, ffffff0049061920, 0)
  ffffff00490619b0 nfs4read+0x101(ffffff0aaef08500, ffffff0af2c29000, 14f8f8000, 8000, ffffff4c0c6bde50, ffffffce6a67e7c8, ffffff0a00000001, 0)
  ffffff0049061a70 nfs4_bio+0xf1(ffffff4c0c6bde00, 0, ffffffce6a67e7c8, 1)
  ffffff0049061b30 nfs4_readahead+0x185(ffffff0aaef08500, 14f8f8000, e1ee000, fffffffffbc34880, ffffffce6a67e7c8)
  ffffff0049061c00 nfs4_async_common_start+0x34d(ffffffce9bdd6140, 0)
  ffffff0049061c20 nfs4_async_start+0x13(ffffffce9bdd6140)
  ffffff0049061c30 thread_start+8()
> ffffff0079061c40::findstack -v
stack pointer for thread ffffff0079061c40: ffffff0079061b20
[ ffffff0079061b20 _resume_from_idle+0xf4() ]
  ffffff0079061b50 swtch+0x141()
  ffffff0079061b90 cv_wait+0x70(ffffff43fef3a130, ffffff43fef3a150)
  ffffff0079061c20 nfs4_async_manager+0xf4(ffffffbb2b7ca440)
  ffffff0079061c30 thread_start+8()
>

Obviously, the first one is more suspicious. It is also working on the same vnode ffffff0aaef08500 with the same offset 14f8f8000. And what does nfs4_readahead() before it calls nfs4_bio()? It allocates and locks the page(s) for the area it wants to read.

#2

Updated by Marcel Telka over 2 years ago

Steps to reproduce the hang

To reproduce the problem you need a NFS server and one NFS client.

As a prerequisite we need to instrument the NFS client using the attached instr-8040.patch file to enlarge the race windows to make sure we reproduce the hang reliably.

For the reproduction we will use a test application attached as test-8040.c doing the right sequence of operations needed to reproduce the bug.

# Share a dir and create a file we will work with
SERVER# share /export; mkfile 100m /export/test ; chmod 777 /export/test

# Mount the shared directory
CLIENT# mount -o vers=4 SERVER:/export /mnt 

# Enable the delay loops to enlarge the race windows
CLIENT# mdb -kwe 'repro_8040/W 1'

# Create the readahead and nfs4_space threads
# Note: the following command will hang
CLIENT# ./test-8040 /mnt/test

# Make sure the grace period is over (if there is any)
CLIENT# sleep 100

# remove the file at the server so the NFS server asks for deleg return
SERVER# rm /export/test

# Unblock the readahead thread so it can start to wait for the deleg return thread
CLIENT# mdb -kwe 'repro_8040_readahead_thread/W 0'
# Wait for a while to make sure the readahead thread is safely waiting for the deleg return
CLIENT# sleep 2
# Unblock the nfs4_space thread so it can deadlock with the other two threads
CLIENT# mdb -kwe 'repro_8040_space_thread/W 0'

Now you will find the NFS client hung:

# mdb -ke '::stacks -c page_lock_es|::thread -d'
            ADDR         DISPTIME BOUND PR SWITCH
ffffff01f7b90ba0             8a78    -1  0 t-62930
#
#3

Updated by Marcel Telka over 2 years ago

With slightly modified instrumentation I'm able to reproduce the deadlock using nfs4_read() thread too:

> ::stacks -m nfs|::stacks -C nfs4_inactive_thread|::stacks -C nfs4_async_manager|::stacks -C nfssys|::stacks -C nfs4_renew_lease_thread
THREAD           STATE    SOBJ                COUNT
ffffff0006b17c40 SLEEP    CV                      1
                 swtch+0x18a
                 cv_wait+0x89
                 cv_wait_sig+0x2cb
                 nfs4_attr_cache+0x148
                 nfs4delegreturn_otw+0x153
                 nfs4_do_delegreturn+0x111
                 nfs4delegreturn_impl+0x219
                 nfs4delegreturn_thread+0x22f
                 thread_start+8

ffffff020f5283c0 SLEEP    CV                      1
                 swtch+0x18a
                 cv_wait+0x89
                 page_lock_es+0x1c7
                 page_lock+0x20
                 pvn_vplist_dirty+0x244
                 nfs4_putpages+0x2dc
                 nfs4_putpage+0xbc
                 fop_putpage+0x4c
                 nfs4_flush_pages+0x30
                 nfs4_purge_caches+0x1cb
                 nfs4_attr_cache+0x32f
                 nfs4_update_attrcache+0x8e
                 nfs4setattr+0x78d
                 nfs4_space+0x118
                 fop_space+0x59
                 fcntl+0x50b
                 _sys_sysenter_post_swapgs+0x237

ffffff01f8131b60 SLEEP    CV                      1
                 swtch+0x18a
                 cv_wait_sig+0x1b5
                 nfs_rw_enter_sig+0x17f
                 wait_for_recall1+0x5e
                 wait_for_recall+0x57
                 nfs4_start_fop+0x13a
                 nfs4read+0x166
                 nfs4_bio+0x129
                 nfs4_getapage+0x30b
                 pvn_getpages+0x181
                 nfs4_getpage+0x1ae
                 fop_getpage+0x7e
                 vpm_map_pages+0x303
                 vpm_data_copy+0x8a
                 nfs4_read+0x315
                 fop_read+0x5b
                 read+0x2b8
                 read32+0x1e
                 _sys_sysenter_post_swapgs+0x237

> ::stacks -m nfs|::stacks -C nfs4_inactive_thread|::stacks -C nfs4_async_manager|::stacks -C nfssys|::stacks -C nfs4_renew_lease_thread|::thread -d
            ADDR         DISPTIME BOUND PR SWITCH
ffffff0006b17c40             4f73    -1  0 t-375593
ffffff020f5283c0             63fe    -1  0 t-370334
ffffff01f8131b60             5652    -1  0 t-373834
>
#4

Updated by Marcel Telka over 2 years ago

  • Subject changed from NFSv4 client: 3-way deadlock between nfs4_readahead(), nfs4_do_delegreturn(), and nfs4_purge_caches() to NFSv4 client: 3-way deadlock between nfs4_bio(), nfs4_do_delegreturn(), and nfs4_flush_pages()
#6

Updated by Vitaliy Gusev over 2 years ago

In short, this is a deadlock between taking resources:

A:  page->p_selock
B:  rnode4->r_deleg_recall_lock
C:  rnode4->r_serial (via rnode4->r_cv)

Threads T1 (nfs4_bio), T2 (nfs4_do_delegreturn), T3 (nfs4_flush_pages)
and order of getting resources :

T1:   A, B
T2:   B, C 
T3:   C, A

Also available in: Atom PDF