Actions
Bug #3498
closedpanic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)
Start date:
2013-01-23
Due date:
% Done:
100%
Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:
Description
This is a panic that only occurs on debug builds, although on non-debug builds this might cause a deadlock. From George Wilson's bug report and analysis at Delphix:
The system panics in arc_read(): vpanic() assfail+0x73(fffffffff78ee4b8, fffffffff78ef060, af9) arc_read+0x114(ffffff165b700ab0, ffffff164b8b1000, ffffffe4a7bcf080, ffffff17e59fa7d8, fffffffff7822500, ffffff33d1cd8018) dsl_read+0x33(ffffff165b700ab0, ffffff164b8b1000, ffffffe4a7bcf080, ffffff17e59fa7d8, fffffffff7822500, ffffff33d1cd8018) dbuf_read_impl+0x1cb(ffffff33d1cd8018, ffffff165b700ab0, ffffff009cf32fc8) dbuf_read+0x1de(ffffff33d1cd8018, ffffff165b700ab0, a) dmu_tx_check_ioerr+0x6b(ffffff165b700ab0, ffffffd2c3800740, 0, 7ae1) dmu_tx_count_write+0x1bd(ffffff16878e9cc8, f5c0f600, 20000) dmu_tx_hold_write+0x96(ffffff1633709380, e, f5c0f600, 20000) zfs_write+0x6f9(ffffff2736592200, ffffff009cf33630, 10, ffffff4740342388, ffffff009cf334c0) fop_write+0x6b(ffffff2736592200, ffffff009cf33630, 10, ffffff4740342388, ffffff009cf334c0) rfs3_write+0x551(ffffff009cf337a0, ffffff009cf33890, ffffff1676af5180, ffffff009cf33ad0, ffffff4740342388) common_dispatch+0x4c4(ffffff009cf33ad0, ffffff1672b2c240, 2, 4, fffffffff8591118 , ffffffffc01a63e0) rfs_dispatch+0x2d(ffffff009cf33ad0, ffffff1672b2c240) svc_getreq+0x20d(ffffff1672b2c240, ffffff91d5af4a60) svc_run+0x19a(ffffff166eff8600) svc_do_run+0x81(1) nfssys+0x7e7(e, fda50fbc) _sys_sysenter_post_swapgs+0x237() panic[cpu5]/thread=ffffff1d18defb40: assertion failed: !refcount_is_zero(&pbuf->b_hdr->b_refcnt), file: ../../common/ fs/zfs/arc.c, line: 2809 So somehow the reference count for this arc_buf_t was 0 but the value in the crash dump does not agree with this: > ffffff17e59fa7d8::print arc_buf_t b_hdr->b_refcnt b_hdr->b_refcnt = { b_hdr->b_refcnt.rc_mtx = { _opaque = [ 0 ] } b_hdr->b_refcnt.rc_list = { list_size = 0x28 list_offset = 0 list_head = { list_next = 0xffffff398538cf90 list_prev = 0xffffff398538cf90 } } b_hdr->b_refcnt.rc_removed = { list_size = 0x28 list_offset = 0 list_head = { list_next = 0xffffff398538cfb0 list_prev = 0xffffff398538cfb0 } } b_hdr->b_refcnt.rc_count = 0x1 b_hdr->b_refcnt.rc_removed_count = 0 } This seems to imply that we're somehow racing with some other thread. Looking at where we created the arc_buf_hdr_t associated with this arc_buf_t gives us a clue: > 0xffffff398538cee8::whatis ffffff398538cee8 is allocated from arc_buf_hdr_t: ADDR BUFADDR TIMESTAMP THREAD CACHE LASTLOG CONTENTS ffffff39a0c8bc18 ffffff398538cee8 696b005d1b4d ffffff009b3edc40 ffffff1639144448 ffffff159cdeea80 ffffff15e05759a0 kmem_cache_alloc_debug+0x283 kmem_cache_alloc+0x23f arc_release+0x239 arc_release_bp+0x25 dbuf_release_bp+0xe3 dbuf_write+0x84 dbuf_sync_indirect+0x23a dbuf_sync_list+0x65 dbuf_sync_indirect+0x25b dbuf_sync_list+0x65 dbuf_sync_indirect+0x25b dbuf_sync_list+0x65 dnode_sync+0x559 dmu_objset_sync_dnodes+0xb6 dmu_objset_sync+0x271 So that's the sync thread. This seems to imply that we have a race between arc_release() and arc_read(). Let's see if we can prove this. Let's look at the sync thread: > ffffff009b3edc40::findstack -v stack pointer for thread ffffff009b3edc40: ffffff009b3ed100 ffffff009b3edc40 hilevel_intr_prolog+0x35(0, 0, fffffffffb87ab48, ffffff009b3edc30) That's totally not helpful! Thanks to Adam we were able to find the actual stack for the sync thread using the following: > ffffff009b3edc40::print kthread_t t_pcb.val[2] | ::eval "*.$C" ffffff009a346cb0 panic+0x94() ffffff009a346cf0 assfail+0x73(fffffffffb9912f0, fffffffffb991300, 93) ffffff009a346d50 hilevel_intr_prolog+0x23c(ffffff164c773580, f, a, ffffff009a346dc0) ffffff009a346db0 do_interrupt+0x10d(ffffff009a346dc0, ffffff164c784e00) ffffff009a346dc0 _interrupt+0x1e9() ffffff009a346ed0 dumpsys_helper+0x1c() ffffff009a346ee0 panic_idle+0x25() ffffff009a346f60 xc_serv+0x119(0, 0) ffffff009a346fc0 av_dispatch_autovect+0x8f(f0) ffffff009a346ff0 dispatch_hilevel+0x1f(f0, 0) ffffff009b3ed160 switch_sp_and_call+0x13() ffffff009b3ed1c0 do_interrupt+0x137(ffffff009b3ed1d0, ffffff164c784c88) ffffff009b3ed1d0 _interrupt+0x1e9() ffffff009b3ed310 0xfffffffffb8654fe() ffffff009b3ed390 kmem_cache_alloc_debug+0x29d(ffffff155d42a888, ffffff28e4bae3b8 , 0, 0, fffffffffbdba552) ffffff009b3ed3f0 kmem_cache_alloc+0x115(ffffff155d42a888, 0) ffffff009b3ed430 kmem_zalloc+0x6a(20, 0) ffffff009b3ed4c0 arc_write+0xa2(ffffff17d26a7c78, ffffff164b8b1000, 27a08, ffffffe4a7bcc400, ffffff17eb934600, 1, ffffff009b3ed540, fffffffff78279e0, fffffffff7827df0, ffffff3627a67098, 4, 0, ffffff009b3ed520) ffffff009b3ed5a0 dbuf_write+0x332(ffffff16ec5a9810, ffffff17eb934600, ffffff1633775700) ffffff009b3ed630 dbuf_sync_leaf+0x3c6(ffffff16ec5a9810, ffffff1633775700) ffffff009b3ed670 dbuf_sync_list+0x58(ffffff164e9c8090, ffffff1633775700) ffffff009b3ed6e0 dbuf_sync_indirect+0x25b(ffffff164e9c8050, ffffff1633775700) ffffff009b3ed720 dbuf_sync_list+0x65(ffffff168b23a588, ffffff1633775700) ffffff009b3ed790 dbuf_sync_indirect+0x25b(ffffff168b23a548, ffffff1633775700) ffffff009b3ed7d0 dbuf_sync_list+0x65(ffffff16ec4a3690, ffffff1633775700) ffffff009b3ed840 dbuf_sync_indirect+0x25b(ffffff16ec4a3650, ffffff1633775700) ffffff009b3ed880 dbuf_sync_list+0x65(ffffffd2c3800818, ffffff1633775700) ffffff009b3ed8f0 dnode_sync+0x559(ffffffd2c3800740, ffffff1633775700) ffffff009b3ed940 dmu_objset_sync_dnodes+0xb6(ffffff167a3b8940, ffffff167a3b8900 , ffffff1633775700) ffffff009b3ed9e0 dmu_objset_sync+0x271(ffffff167a3b8780, ffffff1fe9ff8e58, ffffff1633775700) ffffff009b3eda20 dsl_dataset_sync+0x74(ffffff166d7ddc80, ffffff1fe9ff8e58, ffffff1633775700) ffffff009b3edac0 dsl_pool_sync+0xdb(ffffff164b436b40, 27a08) ffffff009b3edb80 spa_sync+0x3cc(ffffff164b8b1000, 27a08) ffffff009b3edc20 txg_sync_thread+0x2cf(ffffff164b436b40) ffffff009b3edc30 thread_start+8() From the stack we can see we've moved off to syncing leaf blocks. Let's see if we can find the dbuf associated with the panic arc_buf_t. We know from the panic stack trace that we obtained the arc_buf_t here: static void dbuf_read_impl(dmu_buf_impl_t *db, zio_t *zio, uint32_t *flags) { <snip> if (db->db_parent) pbuf = db->db_parent->db_buf; else pbuf = db->db_objset->os_phys_buf; (void) dsl_read(zio, spa, db->db_blkptr, pbuf, dbuf_read_done, db, ZIO_PRIORITY_SYNC_READ, (*flags & DB_RF_CANFAIL) ? ZIO_FLAG_CANFAIL : ZIO_FLAG_MUSTSUCCEED, &aflags, &zb); <snip> } So let's look at the dbuf from panic stack trace: > ffffff165b700ab0::print dmu_buf_impl_t db_parent db_parent = 0 > ffffff33d1cd8018::print dmu_buf_impl_t db_parent db_parent = 0xffffff26a179e3b0 > 0xffffff26a179e3b0::print dmu_buf_impl_t db_buf db_buf = 0xffffff17e59fa7d8 Let's see if we can determine if the db_parent may have been released by the sync thread. We'll start by looking at dbuf_dirty_record for the dbuf_sync_indirect() calls: ffffff009b3ed6e0 dbuf_sync_indirect+0x25b(ffffff164e9c8050, ffffff1633775700) > ffffff164e9c8050::print dbuf_dirty_record_t dr_dbuf dr_dbuf = 0xffffff26a179e3b0 Found it! Let's look closer at the race: The sync thread would have called dbuf_release_bp(0xffffff26a179e3b0) and this calls arc_release(0xffffff17e59fa7d8). So we must have gone through this code path: void arc_release(arc_buf_t *buf, void *tag) { <snip> /* * Do we have more than one buf? */ if (hdr->b_datacnt > 1) { arc_buf_hdr_t *nhdr; arc_buf_t **bufp; uint64_t blksz = hdr->b_size; uint64_t spa = hdr->b_spa; arc_buf_contents_t type = hdr->b_type; uint32_t flags = hdr->b_flags; ASSERT(hdr->b_buf != buf || buf->b_next != NULL); /* * Pull the data off of this hdr and attach it to * a new anonymous hdr. */ (void) remove_reference(hdr, hash_lock, tag); bufp = &hdr->b_buf; while (*bufp != buf) bufp = &(*bufp)->b_next; *bufp = buf->b_next; buf->b_next = NULL; <snip> nhdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE); nhdr->b_size = blksz; nhdr->b_spa = spa; nhdr->b_type = type; nhdr->b_buf = buf; nhdr->b_state = arc_anon; nhdr->b_arc_access = 0; nhdr->b_flags = flags & ARC_L2_WRITING; nhdr->b_l2hdr = NULL; nhdr->b_datacnt = 1; nhdr->b_freeze_cksum = NULL; (void) refcount_add(&nhdr->b_refcnt, tag); buf->b_hdr = nhdr; <snip> So if we assume that the b_datacnt > 1 and the reference count for this arc_buf_hdr_t is currently 1 then w Let's assume b_datacnt > 1 and the reference count == 1. In the code above we would have called remove_reference() on the arc_buf_hdr_t bringing the value of b_refcnt to 0. If immediately afterward the thread doing the read executes this: int arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_buf_t *pbuf, arc_done_func_t *done, void *private, int priority, int zio_flags, uint32_t *arc_flags, const zbookmark_t *zb) { <snip> ASSERT(!refcount_is_zero(&pbuf->b_hdr->b_refcnt)); It would discover that the b_refcount is 0 but before we can panic the sync thread continues to run and now moves the buffer to a newly created arc_buf_hdr_t and increments the b_refcnt to 1.
The fix for this involves removing some unused logic in arc_read(). The logic was originally intended for bp-rewrite which is not fully implemented. By removing the logic we can combine both arc_read() and arc_read_nolock() into a common function and remove the need to pass around the parent's arc_buf_t when doing traversals.
Updated by Christopher Siden over 10 years ago
- Status changed from Pending RTI to Resolved
commit 1b912ec Author: George Wilson <george.wilson@delphix.com> Date: Fri Feb 1 14:11:09 2013 3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt) Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net>
Actions