Bug #6458


nfsauth_cache_get() deadlock with kmem reap thread

Added by Matthew Ahrens almost 8 years ago. Updated about 7 years ago.

nfs - NFS server and client
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


When the system runs low on memory it will invoke kmem_reap() to reap all the kmem caches. The reaping processes each cache serially, first calling the reclaim callback. We have seen a couple of caches where the reclaim callback can block the reaping thread for a significant amount of time. The stack looks like this:

ffffff00d2417c40 SLEEP RWLOCK 1

Since the reaping is treated like a queue it's not possible to reap any other caches until we're finished reclaiming the exi_cache[s]. The exi_cache_trim() function is trying to grab the rw lock as writer and has to wait for all readers to finish processing any work. The problem is that the threads holding the rw lock as reader may end up getting stuck waiting on an allocation. It would make more sense to make this a best effort attempt and only process the reclaim if we're able to obtain the rw lock as writer.

The reclaim callback should call rw_tryenter() instead of blocking. This will ensure that we don't block the main thread which is trying to free memory.

Related issues

Related to illumos gate - Feature #5296: Support for more than 16 groups with AUTH_SYSClosedMarcel Telka2014-11-07

Actions #1

Updated by Marcel Telka almost 8 years ago

I assume this caused a deadlock with one thread in exi_cache_trim(), the other one blocked here:

and some other blocked threads between them (at least one, maybe more).

Could you please share the crash dump file, or at least stacks of the deadlocked threads?

I tried to find what exactly is at exi_cache_trim+0x40, but I'm unable to do so with the recent OI hipster kernel (different compiler used?):

# echo exi_cache_trim::dis | mdb -k | head -20
exi_cache_trim:                 pushq  %rbp
exi_cache_trim+1:               movq   %rdi,%rax
exi_cache_trim+4:               addq   $0x100,%rax
exi_cache_trim+0xa:             movq   %rsp,%rbp
exi_cache_trim+0xd:             subq   $0x70,%rsp
exi_cache_trim+0x11:            movq   %rbx,-0x38(%rbp)
exi_cache_trim+0x15:            movq   %r12,-0x30(%rbp)
exi_cache_trim+0x19:            movq   %r13,-0x28(%rbp)
exi_cache_trim+0x1d:            movq   %r14,-0x20(%rbp)
exi_cache_trim+0x21:            movq   %r15,-0x18(%rbp)
exi_cache_trim+0x25:            movq   %rdi,-0x8(%rbp)
exi_cache_trim+0x29:            movq   %rdi,-0x70(%rbp)
exi_cache_trim+0x2d:            movq   %rax,-0x68(%rbp)
exi_cache_trim+0x31:            movq   $0x0,-0x60(%rbp)
exi_cache_trim+0x39:            movq   -0x70(%rbp),%rax
exi_cache_trim+0x3d:            movq   -0x60(%rbp),%rdx
exi_cache_trim+0x41:            movq   0x128(%rax,%rdx),%rdx
exi_cache_trim+0x49:            movq   %rdx,-0x58(%rbp)
exi_cache_trim+0x4d:            call   +0x359b46e       <gethrestime_sec>
exi_cache_trim+0x52:            movq   -0x68(%rbp),%rdi

What rw_enter() call is at exi_cache_trim+0x40 ? There are several options, so I'd like to be sure:

Thank you.

Actions #2

Updated by Matthew Ahrens almost 8 years ago

Marcel: it's both of the latter ones. See the code here:

Actions #3

Updated by Marcel Telka almost 8 years ago

Matthew Ahrens wrote:

Marcel: it's both of the latter ones. See the code here:

I saw the code, but I'm not sure the fix is complete, so I try to understand as much as possible.

How it could be "both" when you pasted one stack only? Could you please share more info? Thanks.

Actions #4

Updated by George Wilson almost 8 years ago

Keep in mind that this deadlock was detected prior to the changes made by Marcel's change:

commit 71da0c32412fe8362672d1dba2fa4c6b061445b6
Author: Marcel Telka <>
Date: Tue Jan 13 00:22:41 2015 +0100

5509 nfsauth_cache_get() could spend a lot of time walking exi_cache
Reviewed by: Gordon Ross &lt;&gt;
Reviewed by: Robert Mustacchi &lt;&gt;
Approved by: Dan McDonald &lt;&gt;

The original fix was not a straightforward merge

Here's another thread that is involved in the deadlock:

stack pointer for thread ffffff8b13654120: ffffff03d6be34c0
ffffff03d6be3520 do_interrupt+0xba(ffffff8579564ac0, ffffff03d6be35b0)
ffffff03d6be35b0 0x282()
ffffff03d6be36e0 0xffffff8579566000()
ffffff03d6be3720 apic_intr_exit+0x3f(0, f0)
ffffff03d6be3780 hilevel_intr_prolog+0x3d(ffffff8579566000, f, 0,
ffffff03d6be37e0 do_interrupt+0x120(ffffff03d6be37f0, 0)
ffffff03d6be37f0 _interrupt+0xba()
ffffff03d6be3910 mutex_delay_default+7()
ffffff03d6be3980 mutex_vector_enter+0xcc(fffffffffbc364b8)
ffffff03d6be3a00 htable_steal+0x9c(1)
ffffff03d6be3a90 htable_alloc+0xf7(ffffff85b265ea88, fee00000, 0, 0)
ffffff03d6be3b50 htable_create+0x1e9(ffffff85b265ea88, feef6000, 0, 0)
ffffff03d6be3be0 hati_load_common+0x198(ffffff85b265ea88, feef6000,
ffffff01f0e09370, 40d, 400, 0, 405e734)
ffffff03d6be3c60 hat_memload+0x75(ffffff85b265ea88, feef6000, ffffff01f0e09370
, d, 400)
ffffff03d6be3ca0 hat_memload_region+0x25(ffffff85b265ea88, feef6000,
ffffff01f0e09370, d, 400, ffffffffffffffff)
ffffff03d6be3dc0 segvn_faultpage+0x462(ffffff85b265ea88, ffffff85b4ef1c98,
feef6000, d6000, ffffff85b4a39e2c, ffffff03d6be3e90, ffffff850000000d, 0,
ffffff0300000001, ffffff8500000000)
ffffff03d6be3f90 segvn_fault+0x8e6(ffffff85b265ea88, ffffff85b4ef1c98,
feef6000, 1000, 0, 1)
ffffff03d6be40a0 as_fault+0x31a(ffffff85b265ea88, ffffff85b318ede0, feef6835,
1, 0, 1)
ffffff03d6be4130 pagefault+0x96(feef6835, 0, 1, 0)
ffffff03d6be4240 trap+0xd23(ffffff03d6be4250, feef6835, 25)
ffffff03d6be4250 0xfffffffffb8001d6()
ffffff03d6be4360 bcopy+0x208()
ffffff03d6be43c0 copyin_nowatch+0x4b(feef6835, ffffff9d32f55fb0, 1)
ffffff03d6be43f0 prstop+0x61(1, 0)
ffffff03d6be44a0 door_upcall+0x11d(ffffff85b14bec80, ffffff03d6be4620, 0,
ffffffffffffffff, 0)
ffffff03d6be4520 door_ki_upcall_limited+0x67(ffffff85b0f3b670,
ffffff03d6be4620, 0, ffffffffffffffff, 0)
ffffff03d6be4560 stubs_common_code+0x51()
ffffff03d6be4730 nfsauth_retrieve+0x1dd(ffffffbaafd7ca00, ffffff99c37479a0, 1
, ffffff03d6be4760, ffffff03d6be477c)
ffffff03d6be47d0 nfsauth_cache_get+0x2a1(ffffffbaafd7ca00, ffffff03d6be4ca0, 1
ffffff03d6be4840 nfsauth_access+0xc6(ffffffbaafd7ca00, ffffff03d6be4ca0)
ffffff03d6be4900 checkauth+0xca(ffffffbaafd7ca00, ffffff03d6be4ca0,
ffffff8b59664740, 0, 0)
ffffff03d6be4c20 common_dispatch+0x289(ffffff03d6be4ca0, ffffff8fc00d1800, 2,
4, fffffffff8503c86, ffffffffc01a3060)
ffffff03d6be4c40 rfs_dispatch+0x2d(ffffff03d6be4ca0, ffffff8fc00d1800)
ffffff03d6be4d20 svc_getreq+0x1c1(ffffff8fc00d1800, ffffff8bd3fd8720)
ffffff03d6be4d90 svc_run+0x146(ffffff85b4efd0b0)
ffffff03d6be4dd0 svc_do_run+0x8e(1)
ffffff03d6be4ec0 nfssys+0xf1(e, fd050fbc)
ffffff03d6be4f10 _sys_sysenter_post_swapgs+0x149()

This thread is stuck waiting on nfsd which is stuck waiting on memory. So in order to give memory back (in particular for ZFS to give memory back) we need to allow the reaper taskq to make progress.

Actions #5

Updated by Marcel Telka almost 8 years ago

I'm not surprised it happened before #5509, since the problem was introduced by #5296.

Actions #6

Updated by Marcel Telka almost 8 years ago

  • Category set to nfs - NFS server and client
  • Status changed from New to In Progress
  • Assignee changed from Matthew Ahrens to Marcel Telka
Actions #7

Updated by Marcel Telka almost 8 years ago

  • Related to Feature #5296: Support for more than 16 groups with AUTH_SYS added
Actions #8

Updated by Marcel Telka almost 8 years ago

Possible deadlock scenario

Since I do not have access to the crash dump file, nor the stack list, I cannot be sure what exactly happened in the real life. Here is one possible scenario of the deadlock:

The problem is caused by the memory allocation call at line 1119:

1117        if (ngids != NULL && gids != NULL) {
1118            *ngids = p->auth_srv_ngids;
1119            *gids = kmem_alloc(*ngids * sizeof (gid_t), KM_SLEEP);
1120            bcopy(p->auth_srv_gids, *gids, *ngids * sizeof (gid_t));
1121        }

The kmem_alloc() is called with the held auth_lock mutex and the call might block waiting for the memory. Because of this, some other thread might be blocked, for example at line 951 here with authc_lock rw lock held (the rw lock is held as reader):

951    mutex_enter(&p->auth_lock);
952    rw_exit(&c->authc_lock);

And because of this, another thread might get blocked asking for the authc_lock writer lock, for example at line 932 here:

932        rw_enter(&c->authc_lock, RW_WRITER);
933        rw_exit(&exi->exi_cache_lock);

There (at line 932) the exi_cache_lock is held as reader. Next, we might have some other thread waiting for the exi_cache_lock writer lock, for example here at line 883:

878        rw_init(&nc->authc_lock, NULL, RW_DEFAULT, NULL);
879        avl_create(&nc->authc_tree, nfsauth_cache_compar,
880            sizeof (struct auth_cache),
881            offsetof(struct auth_cache, auth_link));
883        rw_enter(&exi->exi_cache_lock, RW_WRITER);
884        c = (struct auth_cache_clnt *)avl_find(tree, &acc, &where);

Now, when the exi_cache reaper is running (the exi_cache_trim() function) it might hang on every rw_enter() and/or mutex_enter() call there.

Root cause

The root cause is at line 1119 where the kmem_alloc(KM_SLEEP) is called with a lock held. To fix this, we need either to change the kmem_alloc() call so it is KM_NOSLEEP, or make sure there is no lock held during the kmem_alloc(KM_SLEEP) call.

Actions #9

Updated by Gordon Ross almost 8 years ago

If you can find a way to make the kmem_alloc call where no locks are held, that would be nice.
One method often used is to do the allocation "speculatively" (before we know whether we'll actually need it or not) before taking locks, then take locks, and either insert the new object or free it.

Actions #10

Updated by Marcel Telka almost 8 years ago

Gordon Ross wrote:

If you can find a way to make the kmem_alloc call where no locks are held, that would be nice.
One method often used is to do the allocation "speculatively" (before we know whether we'll actually need it or not) before taking locks, then take locks, and either insert the new object or free it.

This case is not about "inserting objects". It is about "copying some data from existing object where the lock needs to be hold to make sure the data in the object are valid". I cannot do the allocation speculatively, since I do not know the size of the buffer in advance. The size is known (for sure) only when the lock is held.

I'm working on a fix with kmem_alloc(KM_NOSLEEP). It seems to be easily doable with little cost. Otherwise I'd need to do non-trivial refactoring of the code, which is risky.

Actions #11

Updated by Marcel Telka almost 8 years ago

  • Subject changed from kmem reap thread gets blocked in reclaim callback to nfsauth_cache_get() deadlock with kmem reap thread
Actions #12

Updated by Electric Monk about 7 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 8ee802cce77385ce86e5fe8520cee28ac9fd4669

commit  8ee802cce77385ce86e5fe8520cee28ac9fd4669
Author: George Wilson <>
Date:   2016-07-15T21:36:59.000Z

    6458 nfsauth_cache_get() deadlock with kmem reap thread
    Reviewed by: Paul Dagnelie <>
    Reviewed by: Matthew Ahrens <>
    Reviewed by: Sebastien Roy <>
    Reviewed by: Steven Hartland <>
    Reviewed by: Dan McDonald <>
    Reviewed by: Josef 'Jeff' Sipek <>
    Approved by: Gordon Ross <>


Also available in: Atom PDF