Project

General

Profile

Bug #3498

panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)

Added by Christopher Siden over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Category:
zfs - Zettabyte File System
Start date:
2013-01-23
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

This is a panic that only occurs on debug builds, although on non-debug builds this might cause a deadlock. From George Wilson's bug report and analysis at Delphix:

The system panics in arc_read():

vpanic()
assfail+0x73(fffffffff78ee4b8, fffffffff78ef060, af9)
arc_read+0x114(ffffff165b700ab0, ffffff164b8b1000, ffffffe4a7bcf080, 
ffffff17e59fa7d8, fffffffff7822500, ffffff33d1cd8018)
dsl_read+0x33(ffffff165b700ab0, ffffff164b8b1000, ffffffe4a7bcf080, 
ffffff17e59fa7d8, fffffffff7822500, ffffff33d1cd8018)
dbuf_read_impl+0x1cb(ffffff33d1cd8018, ffffff165b700ab0, ffffff009cf32fc8)
dbuf_read+0x1de(ffffff33d1cd8018, ffffff165b700ab0, a)
dmu_tx_check_ioerr+0x6b(ffffff165b700ab0, ffffffd2c3800740, 0, 7ae1)
dmu_tx_count_write+0x1bd(ffffff16878e9cc8, f5c0f600, 20000)
dmu_tx_hold_write+0x96(ffffff1633709380, e, f5c0f600, 20000)
zfs_write+0x6f9(ffffff2736592200, ffffff009cf33630, 10, ffffff4740342388, 
ffffff009cf334c0)
fop_write+0x6b(ffffff2736592200, ffffff009cf33630, 10, ffffff4740342388, 
ffffff009cf334c0)
rfs3_write+0x551(ffffff009cf337a0, ffffff009cf33890, ffffff1676af5180, 
ffffff009cf33ad0, ffffff4740342388)
common_dispatch+0x4c4(ffffff009cf33ad0, ffffff1672b2c240, 2, 4,
fffffffff8591118
, ffffffffc01a63e0)
rfs_dispatch+0x2d(ffffff009cf33ad0, ffffff1672b2c240)
svc_getreq+0x20d(ffffff1672b2c240, ffffff91d5af4a60)
svc_run+0x19a(ffffff166eff8600)
svc_do_run+0x81(1)
nfssys+0x7e7(e, fda50fbc)
_sys_sysenter_post_swapgs+0x237()

panic[cpu5]/thread=ffffff1d18defb40: 
assertion failed: !refcount_is_zero(&pbuf->b_hdr->b_refcnt), file:
../../common/
fs/zfs/arc.c, line: 2809

So somehow the reference count for this arc_buf_t was 0 but the value in the
crash dump does not agree with this:

> ffffff17e59fa7d8::print arc_buf_t b_hdr->b_refcnt
b_hdr->b_refcnt = {
    b_hdr->b_refcnt.rc_mtx = {
        _opaque = [ 0 ]
    }
    b_hdr->b_refcnt.rc_list = {
        list_size = 0x28
        list_offset = 0
        list_head = {
            list_next = 0xffffff398538cf90
            list_prev = 0xffffff398538cf90
        }
    }
    b_hdr->b_refcnt.rc_removed = {
        list_size = 0x28
        list_offset = 0
        list_head = {
            list_next = 0xffffff398538cfb0
            list_prev = 0xffffff398538cfb0
        }
    }
    b_hdr->b_refcnt.rc_count = 0x1
    b_hdr->b_refcnt.rc_removed_count = 0
}

This seems to imply that we're somehow racing with some other thread. Looking
at where we created the arc_buf_hdr_t associated with this arc_buf_t gives us a
clue:

> 0xffffff398538cee8::whatis
ffffff398538cee8 is allocated from arc_buf_hdr_t:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
ffffff39a0c8bc18 ffffff398538cee8     696b005d1b4d ffffff009b3edc40
                 ffffff1639144448 ffffff159cdeea80 ffffff15e05759a0
                 kmem_cache_alloc_debug+0x283
                 kmem_cache_alloc+0x23f
                 arc_release+0x239
                 arc_release_bp+0x25
                 dbuf_release_bp+0xe3
                 dbuf_write+0x84
                 dbuf_sync_indirect+0x23a
                 dbuf_sync_list+0x65
                 dbuf_sync_indirect+0x25b
                 dbuf_sync_list+0x65
                 dbuf_sync_indirect+0x25b
                 dbuf_sync_list+0x65
                 dnode_sync+0x559
                 dmu_objset_sync_dnodes+0xb6
                 dmu_objset_sync+0x271

So that's the sync thread. This seems to imply that we have a race between
arc_release() and arc_read(). Let's see if we can prove this. Let's look at the
sync thread:

> ffffff009b3edc40::findstack -v
stack pointer for thread ffffff009b3edc40: ffffff009b3ed100
  ffffff009b3edc40 hilevel_intr_prolog+0x35(0, 0, fffffffffb87ab48, 
  ffffff009b3edc30)

That's totally not helpful! Thanks to Adam we were able to find the actual
stack for the sync thread using the following:

> ffffff009b3edc40::print kthread_t t_pcb.val[2] | ::eval "*.$C" 
ffffff009a346cb0 panic+0x94()
ffffff009a346cf0 assfail+0x73(fffffffffb9912f0, fffffffffb991300, 93)
ffffff009a346d50 hilevel_intr_prolog+0x23c(ffffff164c773580, f, a, 
ffffff009a346dc0)
ffffff009a346db0 do_interrupt+0x10d(ffffff009a346dc0, ffffff164c784e00)
ffffff009a346dc0 _interrupt+0x1e9()
ffffff009a346ed0 dumpsys_helper+0x1c()
ffffff009a346ee0 panic_idle+0x25()
ffffff009a346f60 xc_serv+0x119(0, 0)
ffffff009a346fc0 av_dispatch_autovect+0x8f(f0)
ffffff009a346ff0 dispatch_hilevel+0x1f(f0, 0)
ffffff009b3ed160 switch_sp_and_call+0x13()
ffffff009b3ed1c0 do_interrupt+0x137(ffffff009b3ed1d0, ffffff164c784c88)
ffffff009b3ed1d0 _interrupt+0x1e9()
ffffff009b3ed310 0xfffffffffb8654fe()
ffffff009b3ed390 kmem_cache_alloc_debug+0x29d(ffffff155d42a888,
ffffff28e4bae3b8
, 0, 0, fffffffffbdba552)
ffffff009b3ed3f0 kmem_cache_alloc+0x115(ffffff155d42a888, 0)
ffffff009b3ed430 kmem_zalloc+0x6a(20, 0)
ffffff009b3ed4c0 arc_write+0xa2(ffffff17d26a7c78, ffffff164b8b1000, 27a08, 
ffffffe4a7bcc400, ffffff17eb934600, 1, ffffff009b3ed540, fffffffff78279e0, 
fffffffff7827df0, ffffff3627a67098, 4, 0, ffffff009b3ed520)
ffffff009b3ed5a0 dbuf_write+0x332(ffffff16ec5a9810, ffffff17eb934600, 
ffffff1633775700)
ffffff009b3ed630 dbuf_sync_leaf+0x3c6(ffffff16ec5a9810, ffffff1633775700)
ffffff009b3ed670 dbuf_sync_list+0x58(ffffff164e9c8090, ffffff1633775700)
ffffff009b3ed6e0 dbuf_sync_indirect+0x25b(ffffff164e9c8050, ffffff1633775700)
ffffff009b3ed720 dbuf_sync_list+0x65(ffffff168b23a588, ffffff1633775700)
ffffff009b3ed790 dbuf_sync_indirect+0x25b(ffffff168b23a548, ffffff1633775700)
ffffff009b3ed7d0 dbuf_sync_list+0x65(ffffff16ec4a3690, ffffff1633775700)
ffffff009b3ed840 dbuf_sync_indirect+0x25b(ffffff16ec4a3650, ffffff1633775700)
ffffff009b3ed880 dbuf_sync_list+0x65(ffffffd2c3800818, ffffff1633775700)
ffffff009b3ed8f0 dnode_sync+0x559(ffffffd2c3800740, ffffff1633775700)
ffffff009b3ed940 dmu_objset_sync_dnodes+0xb6(ffffff167a3b8940, ffffff167a3b8900
, ffffff1633775700)
ffffff009b3ed9e0 dmu_objset_sync+0x271(ffffff167a3b8780, ffffff1fe9ff8e58, 
ffffff1633775700)
ffffff009b3eda20 dsl_dataset_sync+0x74(ffffff166d7ddc80, ffffff1fe9ff8e58, 
ffffff1633775700)
ffffff009b3edac0 dsl_pool_sync+0xdb(ffffff164b436b40, 27a08)
ffffff009b3edb80 spa_sync+0x3cc(ffffff164b8b1000, 27a08)
ffffff009b3edc20 txg_sync_thread+0x2cf(ffffff164b436b40)
ffffff009b3edc30 thread_start+8()

From the stack we can see we've moved off to syncing leaf blocks. Let's see if
we can find the dbuf associated with the panic arc_buf_t. We know from the
panic stack trace that we obtained the arc_buf_t here:

static void
dbuf_read_impl(dmu_buf_impl_t *db, zio_t *zio, uint32_t *flags)
{
<snip>
        if (db->db_parent)
                pbuf = db->db_parent->db_buf;
        else
                pbuf = db->db_objset->os_phys_buf;

        (void) dsl_read(zio, spa, db->db_blkptr, pbuf,
            dbuf_read_done, db, ZIO_PRIORITY_SYNC_READ,
            (*flags & DB_RF_CANFAIL) ? ZIO_FLAG_CANFAIL : ZIO_FLAG_MUSTSUCCEED,
            &aflags, &zb);
<snip>
}

So let's look at the dbuf from panic stack trace:
> ffffff165b700ab0::print dmu_buf_impl_t db_parent
db_parent = 0
> ffffff33d1cd8018::print dmu_buf_impl_t db_parent
db_parent = 0xffffff26a179e3b0
> 0xffffff26a179e3b0::print dmu_buf_impl_t db_buf
db_buf = 0xffffff17e59fa7d8

Let's see if we can determine if the db_parent may have been released by the
sync thread. We'll start by looking at dbuf_dirty_record for the
dbuf_sync_indirect() calls:

ffffff009b3ed6e0 dbuf_sync_indirect+0x25b(ffffff164e9c8050, ffffff1633775700)

> ffffff164e9c8050::print dbuf_dirty_record_t dr_dbuf
dr_dbuf = 0xffffff26a179e3b0

Found it! Let's look closer at the race:

The sync thread would have called dbuf_release_bp(0xffffff26a179e3b0) and this
calls arc_release(0xffffff17e59fa7d8). So we must have gone through this code
path:

void
arc_release(arc_buf_t *buf, void *tag)
{
<snip>
        /*
         * Do we have more than one buf?
         */
        if (hdr->b_datacnt > 1) {
                arc_buf_hdr_t *nhdr;
                arc_buf_t **bufp;
                uint64_t blksz = hdr->b_size;
                uint64_t spa = hdr->b_spa;
                arc_buf_contents_t type = hdr->b_type;
                uint32_t flags = hdr->b_flags;

                ASSERT(hdr->b_buf != buf || buf->b_next != NULL);
                /*
                 * Pull the data off of this hdr and attach it to
                 * a new anonymous hdr.
                 */
                (void) remove_reference(hdr, hash_lock, tag);
                bufp = &hdr->b_buf;
                while (*bufp != buf)
                        bufp = &(*bufp)->b_next;
                *bufp = buf->b_next;
                buf->b_next = NULL;
<snip>
                nhdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
                nhdr->b_size = blksz;
                nhdr->b_spa = spa;
                nhdr->b_type = type;
                nhdr->b_buf = buf;
                nhdr->b_state = arc_anon;
                nhdr->b_arc_access = 0;
                nhdr->b_flags = flags & ARC_L2_WRITING;
                nhdr->b_l2hdr = NULL;
                nhdr->b_datacnt = 1;
                nhdr->b_freeze_cksum = NULL;
                (void) refcount_add(&nhdr->b_refcnt, tag);
                buf->b_hdr = nhdr;
<snip>

So if we assume that the b_datacnt > 1 and the reference count for this
arc_buf_hdr_t is currently 1 then w

Let's assume b_datacnt > 1 and the reference count == 1. In the code above we
would have called remove_reference() on the arc_buf_hdr_t bringing the value of
b_refcnt to 0. If immediately afterward the thread doing the read executes
this:

int
arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_buf_t *pbuf,
    arc_done_func_t *done, void *private, int priority, int zio_flags,
    uint32_t *arc_flags, const zbookmark_t *zb)
{
<snip>
        ASSERT(!refcount_is_zero(&pbuf->b_hdr->b_refcnt));

It would discover that the b_refcount is 0 but before we can panic the sync
thread continues to run and now moves the buffer to a newly created
arc_buf_hdr_t and increments the b_refcnt to 1.
The fix for this involves removing some unused logic in arc_read(). The logic
was originally intended for bp-rewrite which is not fully implemented. By
removing the logic we can combine both arc_read() and arc_read_nolock() into a
common function and remove the need to pass around the parent's arc_buf_t when
doing traversals.

History

#1

Updated by Christopher Siden over 6 years ago

  • Status changed from Pending RTI to Resolved
commit 1b912ec
Author: George Wilson <george.wilson@delphix.com>
Date:   Fri Feb 1 14:11:09 2013

    3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)
    Reviewed by: Adam Leventhal <ahl@delphix.com>
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF