Project

General

Profile

Bug #13034

dnode_sync is careless with range tree

Added by Patrick Mooney 7 months ago. Updated 6 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

While testing an unrelated change, I observed a panic in ZFS:

panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe007d4be490 addr=fffffe5a5616a00c
dump content: kernel pages only
> $C
fffffe007d4be600 zfs_btree_find+0x238(fffffe59eb90a700, fffffe007d4be610, 0)
fffffe007d4be670 range_tree_find_impl+0x134(fffffe59eb90a700, 13960c, 1)
fffffe007d4be6c0 range_tree_find+0x45(fffffe59eb90a700, 13960c, 1)
fffffe007d4be6f0 range_tree_contains+0x10(fffffe59eb90a700, 13960c, 1)
fffffe007d4be740 dnode_block_freed+0x80(fffffe5a090ad520, 13960c)
fffffe007d4be810 dbuf_read_impl+0xe7(fffffe5a35c775d0, fffffe5a57e3b568, 1e)
fffffe007d4be8b0 dbuf_read+0xdb(fffffe5a35c775d0, fffffe5a57e3b568, 1e)
fffffe007d4be960 dmu_buf_hold_array_by_dnode+0x11c(fffffe5a090ad520, 13960c000, 1000, 1, fffffffff79998a0, fffffe007d4be9ac, fffffe007d4be9b0, fffffe5a00000000)
fffffe007d4bea10 dmu_read_uio_dnode+0x54(fffffe5a090ad520, fffffe007d4bec78, 1000)
fffffe007d4bea70 dmu_read_uio+0x73(fffffe59d6d934c0, 1, fffffe007d4bec78, 1000)
fffffe007d4beae0 zvol_read+0xd5(11200000003, fffffe007d4bec78, fffffe59f5208448)
fffffe007d4beb10 cdev_read+0x2a(11200000003, fffffe007d4bec78, fffffe59f5208448)
fffffe007d4beba0 spec_read+0x310(fffffe59de796d00, fffffe007d4bec78, 0, fffffe59f5208448, 0)
fffffe007d4bec20 fop_read+0x60(fffffe59de796d00, fffffe007d4bec78, 0, fffffe59f5208448, 0)
fffffe007d4beef0 preadv+0x22a(4, 167abe8, 1, 13960c000, 0)
fffffe007d4bef00 sys_syscall+0x283()

Which corresponds to:

        for (node = (zfs_btree_core_t *)tree->bt_root; depth < tree->bt_height;
            node = (zfs_btree_core_t *)node->btc_children[child], depth++) {
                ASSERT3P(node, !=, NULL);
                void *d = zfs_btree_find_in_buf(tree, node->btc_elems,
                    node->btc_hdr.bth_count, value, &idx);

in zfs_btree_find, where the child node is being dereferenced in order to pass params to zfs_btree_find_in_buf. Inspection of the zfs_btree_t does show that child being present in the root of the tree:

> fffffe59eb90a700::print zfs_btree_t
{
    bt_root = 0xfffffe5a52688000
    bt_height = 0x1
    bt_elem_size = 0x10
    bt_num_elems = 0x12b1
    bt_num_nodes = 0x11
    bt_bulk = 0
    bt_compar = range_tree_seg64_compare
}
> fffffe59eb90a700::print zfs_btree_t bt_root | ::print zfs_btree_core_t
{
    btc_hdr = {
        bth_parent = 0
        bth_core = 0x1 (B_TRUE)
        bth_count = 0x1a
    }
    btc_children = [ 0xfffffe5a51c8c000, 0xfffffe5a56cc2000, 0xfffffe5a55946000, 0xfffffe5a57c70000, 0xfffffe5a5422a000, 0xfffffe5a5616a000, 0xfffffe5a52c68000, 0xfffffe5a54228000, 0xfffffe5a5702c000, 0xfffffe5a55ba6000, 0xfffffe5a52686000, 0xfffffe5a5344a000, 0xfffffe5a5521a000, 0xfffffe5a55db8000, 0xfffffe5a53eae000, 0xfffffe5a555b4000, 0xfffffe5a56588000, 0xfffffe5a57294000, 0xfffffe5a52a86000, 0xfffffe5a5422c000, 0xfffffe5a555be000, 0xfffffe5a55d86000, 0xfffffe5a5674c000, 0xfffffe5a54e4e000, 0xfffffe5a57a8c000, 0xfffffe5a5656a000, 0xfffffe5a55ba4000, 0xabadb10cdeadbeef, 0xabadb10cdeadbeef, 0xabadb10cdeadbeef, 0xabadb10cdeadbeef, 0xabadb10cdeadbeef, ... ]
    btc_elems = [ ... ]
}

The child is indeed not mapped:

> fffffe5a5616a000::print zfs_btree_leaf_t
{
    btl_hdr = {
mdb: failed to read bth_parent pointer at fffffe5a5616a000: no mapping for address
    }
}

Strangely, many of the other leading elements in the array appear unmapped as well:

> fffffe5a52688010,1a/Kn | ::print zfs_btree_leaf_t
{
    btl_hdr = {
mdb: failed to read bth_parent pointer at fffffe5a51c8c000: no mapping for address
    }
}
{
    btl_hdr = {
mdb: failed to read bth_parent pointer at fffffe5a56cc2000: no mapping for address
    }
}
{
    btl_hdr = {
mdb: failed to read bth_parent pointer at fffffe5a55946000: no mapping for address
    }
}
{
    btl_hdr = {
mdb: failed to read bth_parent pointer at fffffe5a57c70000: no mapping for address
    }
}
{
    btl_hdr = {
mdb: failed to read bth_parent pointer at fffffe5a5422a000: no mapping for address
    }
}
{
    btl_hdr = {
mdb: failed to read bth_parent pointer at fffffe5a5616a000: no mapping for address
    }
}
{
    btl_hdr = {
mdb: failed to read bth_parent pointer at fffffe5a52c68000: no mapping for address
    }
}
{
    btl_hdr = {
mdb: failed to read bth_parent pointer at fffffe5a54228000: no mapping for address
    }
}
{
    btl_hdr = {
mdb: failed to read bth_parent pointer at fffffe5a5702c000: no mapping for address
    }
}
{
    btl_hdr = {
mdb: failed to read bth_parent pointer at fffffe5a55ba6000: no mapping for address
    }
}
{
    btl_hdr = {
mdb: failed to read bth_parent pointer at fffffe5a52686000: no mapping for address
    }
}
{
    btl_hdr = {
        bth_parent = 0xfffffe5a52688000
        bth_core = 0 (0)
        bth_count = 0x7f
    }
    btl_elems = [ ... ]
}
{
    btl_hdr = {
        bth_parent = 0xfffffe5a52688000
        bth_core = 0 (0)
        bth_count = 0xb6
    }
    btl_elems = [ ... ]
}

Going back to the leaf pointer which caused the fatal fault in the first place, we can look at kmalog to gain more insight into what may have been going on here. (Fortunately the panic was observed under a DEBUG kernel)

> fffffe5a5616a000::kmalog

T-0.065070871  addr=fffffe5a5616a000  zfs_btree_leaf_cache
         kmem_slab_free+0x2c7
         kmem_slab_free_constructed+0x7d
         kmem_cache_free+0xd9
         zfs_btree_node_destroy+0x3c
         zfs_btree_next_helper+0xdd
         zfs_btree_destroy_nodes+0x35
         range_tree_vacate+0x184
         dnode_sync+0x3e5
         dmu_objset_sync_dnodes+0x9e
         sync_dnodes_task+0x31
         taskq_thread+0x315
         thread_start+0xb

T-1.982798978  addr=fffffe5a5616a000  zfs_btree_leaf_cache
         kmem_slab_alloc_impl+0x324
         kmem_slab_alloc+0x69
         kmem_cache_alloc+0x193
         zfs_btree_insert_into_leaf+0xb2
         zfs_btree_add_idx+0x1be
         range_tree_add_impl+0xe0c
         range_tree_add+0x13
         dnode_free_range+0x460
         dmu_free_long_range_impl+0x253
         dmu_free_long_range+0x82
         zvol_ioctl+0x6de
         zfsdev_ioctl+0x34a
         cdev_ioctl+0x2b
         spec_ioctl+0x45
         fop_ioctl+0x5b

So it was once valid, and has only recently been freed. Let's look for the function which freed it:

> ::stacks -c range_tree_vacate
THREAD           STATE    SOBJ                COUNT
fffffe007c40bc20 ONPROC   <NONE>                  1
                 dbuf_redirty+0x24
                 dmu_buf_will_dirty_impl+0x8a
                 apix_hilevel_intr_prolog+0xa0
                 apix_do_interrupt+0x56e
                 _interrupt+0x1f2
                 fletcher_2_incremental_native+0x2b
                 fletcher_2_native+0x1e
                 arc_cksum_verify+0xca
                 arc_buf_thaw+0x57
                 arc_release+0x5c5
                 dbuf_release_bp+0x61
                 free_children+0xb0
                 free_children+0x350
                 free_children+0x350
                 dnode_sync_free_range_impl+0x11a
                 dnode_sync_free_range+0x54
                 range_tree_vacate+0x178
                 dnode_sync+0x3e5
                 dmu_objset_sync_dnodes+0x9e
                 sync_dnodes_task+0x31
                 taskq_thread+0x315
                 thread_start+0xb

AhHA, it's still running even. What code is responsible for the behavior in that thread?

        /* process all the "freed" ranges in the file */
        if (dn->dn_free_ranges[txgoff] != NULL) {
                dnode_sync_free_range_arg_t dsfra;
                dsfra.dsfra_dnode = dn;
                dsfra.dsfra_tx = tx;
                dsfra.dsfra_free_indirects = freeing_dnode;
                if (freeing_dnode) {
                        ASSERT(range_tree_contains(dn->dn_free_ranges[txgoff],
                            0, dn->dn_maxblkid + 1));
                }
                mutex_enter(&dn->dn_mtx);
                range_tree_vacate(dn->dn_free_ranges[txgoff],
                    dnode_sync_free_range, &dsfra);
                range_tree_destroy(dn->dn_free_ranges[txgoff]);
                dn->dn_free_ranges[txgoff] = NULL;
                mutex_exit(&dn->dn_mtx);
        }

It appears to be taking the dnode mutex before vacating the free ranges btree. Does that thread hold the mutex?

> fffffe5a090ad520::print -ta dnode_t dn_mtx
fffffe5a090ad620 kmutex_t dn_mtx = {
    fffffe5a090ad620 void *[1] dn_mtx._opaque = [ 0xfffffe59e9f1b060 ]
}
> fffffe5a090ad620::mutex
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
fffffe5a090ad620 adapt fffffe59e9f1b060      -      -      no
> fffffe59e9f1b060::findstack
stack pointer for thread fffffe59e9f1b060 (bhyve/2 [blk-5:0-0]): fffffe007d4be1a0
  fffffe007d4be220 segkmem_fault+0x41()
  fffffe007d4be310 as_fault+0x4d6()
  fffffe007d4be390 rewrite_syscall()
  fffffe007d4be480 trap+0x12b7()
  fffffe007d4be490 cmntrap_pushed+0xf6()
  fffffe007d4be600 zfs_btree_find+0x238()
  fffffe007d4be670 range_tree_find_impl+0x134()
  fffffe007d4be6c0 range_tree_find+0x45()
  fffffe007d4be6f0 range_tree_contains+0x10()
  fffffe007d4be740 dnode_block_freed+0x80()
  fffffe007d4be810 dbuf_read_impl+0xe7()
  fffffe007d4be8b0 dbuf_read+0xdb()
  fffffe007d4be960 dmu_buf_hold_array_by_dnode+0x11c()
  fffffe007d4bea10 dmu_read_uio_dnode+0x54()
  fffffe007d4bea70 dmu_read_uio+0x73()
  fffffe007d4beae0 zvol_read+0xd5()
  fffffe007d4beb10 cdev_read+0x2a()
  fffffe007d4beba0 spec_read+0x310()
  fffffe007d4bec20 fop_read+0x60()
  fffffe007d4beef0 preadv+0x22a()
  fffffe007d4bef00 sys_syscall+0x283()

No, the thread which stumbled into the unmapped memory holds the lock. That seems odd considering dnode_sync acquired it before beginning to vacate the range tree. A closer look at dnode_sync_free_range, which is provided as the call back to that range tree vacate operation sheds some light on the situation:

static void
dnode_sync_free_range(void *arg, uint64_t blkid, uint64_t nblks)
{
        dnode_sync_free_range_arg_t *dsfra = arg;
        dnode_t *dn = dsfra->dsfra_dnode;

        mutex_exit(&dn->dn_mtx);
        dnode_sync_free_range_impl(dn, blkid, nblks,
            dsfra->dsfra_free_indirects, dsfra->dsfra_tx);
        mutex_enter(&dn->dn_mtx);
}

It's dropping the mutex while it performs the "real work" behind the operation. That would leave the dnode open to being locked by another party during that time. Is there anything about the range_tree_vacate operation which would be a problem in that regard? Looking at how it's implemented behind the scenes, we can see that zfs_btree_destroy_nodes is used:

void
range_tree_vacate(range_tree_t *rt, range_tree_func_t *func, void *arg)
{

        if (rt->rt_ops != NULL && rt->rt_ops->rtop_vacate != NULL)
                rt->rt_ops->rtop_vacate(rt, rt->rt_arg);

        if (func != NULL) {
                range_seg_t *rs;
                zfs_btree_index_t *cookie = NULL;

                while ((rs = zfs_btree_destroy_nodes(&rt->rt_root, &cookie)) !=
                    NULL) {
                        func(arg, rs_get_start(rs, rt), rs_get_end(rs, rt) -
                            rs_get_start(rs, rt));
                }
        } else {
                zfs_btree_clear(&rt->rt_root);
        }

        bzero(rt->rt_histogram, sizeof (rt->rt_histogram));
        rt->rt_space = 0;
}

The comment detailing the expectations of zfs_btree_destroy_nodes has an ominous warning, considering our circumstances:

 * This function is used to visit all the elements in the tree before
 * destroying the tree. This allows the calling code to perform any cleanup it
 * needs to do. This is more efficient than just removing the first element
 * over and over, because it removes all rebalancing. Once the destroy_nodes()
 * function has been called, no other btree operations are valid until it
 * returns NULL, which point the only valid operation is zfs_btree_destroy().

So once range_tree_vacate makes a call to zfs_btree_destroy, no other operations (read or write) should be allowed to it. Because dnode_sync_free_range happens to drop the dnode mutex, that leaves a window for another thread to acquire the lock and gain access to the now-invalid-to-other-callers btree.

If the dnode mutex must be dropped while syncing those free ranges, some other state on the dnode must be kept to prevent other threads from stumbling into the btree undergoing vacation. Otherwise a normal remove-on-element-at-a-time method should be used to ensure the btree remains in valid states when the mutex is dropped.

#1

Updated by Patrick Mooney 7 months ago

Also open as OpenZFS #10708

#2

Updated by Patrick Mooney 7 months ago

The range_tree_vacate function was previously implemented by avl_destroy_nodes, prior to the integration of #11971. This too would have been dangerous, given the restrictions spelled out in its man page:

After the avl_destroy_nodes() function is called on an AVL tree, the only
valid functions that may be called on it are additional calls to
avl_destroy_nodes() and finally avl_destroy().

#3

Updated by Electric Monk 6 months ago

  • Gerrit CR set to 867
#4

Updated by Patrick Mooney 6 months ago

With the proposed patch included in a DEBUG build, the reproducer workload is no longer able to panic the machine as it once was. Prior to the fix it was a reliable crash. For inclusion into OpenZFS, the fix was run through their test suite and passed.

#5

Updated by Electric Monk 6 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 7611ab7b771a068fbab94a90143de2afd495769f

commit  7611ab7b771a068fbab94a90143de2afd495769f
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2020-08-27T19:21:30.000Z

    13034 dnode_sync is careless with range tree
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed by: Igor Kozhukhov <igor@dilos.org>
    Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
    Reviewed by: Jason King <jason.king@joyent.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF