Project

General

Profile

Actions

Bug #14003

open

panic: zfs dbuf code destroys buffer still held by others on encrypted pool

Added by Alex Wilson 4 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

We had a SmartOS CN panic like this:

> ::status
debugging crash dump ./vmcore.0 (64-bit) from sdc-cb8030
operating system: 5.11 joyent_20210623T002805Z (i86pc)
git branch: eait-20210623
git rev: f12a9dd262eb31f6cd4765ca82ad60d64defee9c
image uuid: (not set)
panic message: assertion failed: remove_reference(hdr, ((void *)0), tag) == 0 (0x3 == 0x0), file: ../../common/fs/zfs/arc.c, line: 3773
dump content: kernel pages only
> $C
fffffb04504c27e0 vpanic()
fffffb04504c2870 0xfffffffffba436e1()
fffffb04504c28b0 arc_buf_destroy+0xe2(fffffdc9f1eb3768, fffffdc8e9811680)
fffffb04504c2900 dbuf_issue_final_prefetch_done+0x47(fffffed81e424080, fffffed81e424080, fffffed81e424128, fffffdc9f1eb3768, fffffdc8e9811680)
fffffb04504c2990 arc_read_done+0x229(fffffed81e424080)
fffffb04504c2a30 zio_done+0x542(fffffed81e424080)
fffffb04504c2a60 zio_execute+0xa7(fffffed81e424080)
fffffb04504c2b10 taskq_thread+0x2cd(fffffdc57b003088)
fffffb04504c2b20 thread_start+0xb()

It looks like the arc_buf_hdr_t being destroyed still had 3 other arc_buf_t holding it (4 total):

> fffffdc9f1eb3768::print -t arc_buf_t
arc_buf_t {
    arc_buf_hdr_t *b_hdr = 0xfffffe68e26ecaa8
    arc_buf_t *b_next = 0xfffffe79af57c820
    kmutex_t b_evict_lock = {
        void *[1] _opaque = [ 0 ]
    }
    void *b_data = 0xfffffdc93424f000
    arc_buf_flags_t b_flags = 0x6 (ARC_BUF_FLAG_{COMPRESSED|ENCRYPTED})
}
> 0xfffffe68e26ecaa8::print -t arc_buf_hdr_t
arc_buf_hdr_t {
    dva_t b_dva = {
        uint64_t [2] dva_word = [ 0, 0 ]
    }
    uint64_t b_birth = 0
    arc_buf_contents_t b_type = 2 (ARC_BUFC_METADATA)
    arc_buf_hdr_t *b_hash_next = 0
    arc_flags_t b_flags = 0x2168010 (ARC_FLAG_{L2CACHE|PROTECTED|BUFC_METADATA|HAS_L1HDR|COMPRESSED_ARC|COMPRESS_1})
    uint16_t b_psize = 0x20
    uint16_t b_lsize = 0x20
    uint64_t b_spa = 0x579d5b1f1bf19e55
    l2arc_buf_hdr_t b_l2hdr = {
        l2arc_dev_t *b_dev = 0xfffffdc61ed15000
        uint64_t b_daddr = 0x1a8c42000
        arc_state_type_t b_arcs_state = 0 (ARC_STATE_ANON)
        list_node_t b_l2node = {
            struct list_node *list_next = 0
            struct list_node *list_prev = 0
        }
    }
    l1arc_buf_hdr_t b_l1hdr = {
        kmutex_t b_freeze_lock = {
            void *[1] _opaque = [ 0 ]
        }
        zio_cksum_t *b_freeze_cksum = 0
        arc_buf_t *b_buf = 0xfffffdca00f70a90
        uint32_t b_bufcnt = 0x4
        kcondvar_t b_cv = {
            ushort_t _opaque = 0
        }
        uint8_t b_byteswap = 0xa
        arc_state_t *b_state = ARC_anon
        multilist_node_t b_arc_node = {
            struct list_node *list_next = 0
            struct list_node *list_prev = 0
        }
        clock_t b_arc_access = 0
        zfs_refcount_t b_refcnt = {
            uint64_t rc_count = 0x3
        }
        arc_callback_t *b_acb = 0
        abd_t *b_pabd = 0
    }
    arc_buf_hdr_crypt_t b_crypt_hdr = {
        abd_t *b_rabd = 0xfffffdcde7c9fa00
        dmu_object_type_t b_ot = 0t10 (DMU_OT_DNODE)
        uint32_t b_ebufcnt = 0x3      
        uint64_t b_dsobj = 0x1fc34
        uint8_t [8] b_salt = [ 0x18, 0xec, 0xe5, 0x12, 0x50, 0xa1, 0x75, 0x40 ]
        uint8_t [12] b_iv = [ 0x64, 0x9c, 0x7, 0x33, 0x4, 0xc, 0xa6, 0x7d, 0xaa, 0x73, 0xf, 0x2e ]
        uint8_t [16] b_mac = [ 0xa1, 0xef, 0xdb, 0x5d, 0x7e, 0xce, 0x4d, 0x4c, 0x98, 0x60, 0xa8, 0x19, 0xd5, 0x62, 0x64, 0xa5 ]
    }
}

It's a metadata buf, owned by a dbuf in the metadnode for a dataset on the pool. These are normally owned and managed by the dbuf cache, and with compressed ARC enabled, never make it onto the main ARC lists (and so nothing other than the dbuf cache should ever find them or use them). This one is marked ARC_anon, which is consistent with that.

However, we can see that the ARC_FLAG_PREFETCH flag is no longer on the header, even though this must have been created in dbuf_issue_final_prefetch (which sets that flag unconditionally). We usually only unset this flag in arc.c when we find a prefetched buffer on the MRU or MFU lists (which this buf should never have been on!) or the buf_hash (which it also should never have been on!).

Also, the flags imply that this buffer was in the L2ARC, which buffers owned by the dbuf cache should never be (they're never on the main ARC lists for the L2ARC writer to find them).

Actions #1

Updated by Alex Wilson 4 months ago

https://github.com/openzfs/zfs/issues/6881 looks similar, but was closed by the stale bot.

same with https://github.com/openzfs/zfs/issues/10085

there are some other open issues which seem to mention similar panics in passing, but none have a resolution that I can see

Actions #2

Updated by Alex Wilson 4 months ago

Got a new set of panics across 5 different machines which look closely related:

> ::status
debugging crash dump vmcore.0 (64-bit) from sdc-cb8028
operating system: 5.11 joyent_20210623T002805Z (i86pc)
git branch: eait-20210623
git rev: fae973e1f1706a5aaf6ce807bc437fe3aafa6e2b
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffb045398c630 addr=4 occurred in module "zfs" due to a NULL pointer dereference
dump content: kernel pages only
> $C
fffffb045398c730 abd_iter_advance+0xd(fffffb045398c748, 0)
fffffb045398c7c0 abd_iterate_func+0x4a(0, 0, 4000, fffffffff7c8dbe0, fffffb045398c7d0)
fffffb045398c800 abd_copy_to_buf_off+0x39(fffffdcc487bb000, 0, 0, 4000)
fffffb045398c880 arc_buf_fill+0x29d(fffffdca2e22abe8, fffffdc571e36000, fffffdcc49d06568, 7)
fffffb045398c900 arc_buf_alloc_impl+0x14d(fffffdcc49d0aba0, fffffdc571e36000, fffffdcc49d06568, fffffdcc55ec0c30, 1, 1, 0, fffffdc500000001, fffffdcc49d06550)
fffffb045398c990 arc_read_done+0x12a(fffffdc8a0a7a968)
fffffb045398ca30 zio_done+0x542(fffffdc8a0a7a968)
fffffb045398ca60 zio_execute+0xa7(fffffdc8a0a7a968)
fffffb045398cb10 taskq_thread+0x2cd(fffffdc61ea5f1c0)
fffffb045398cb20 thread_start+0xb()

once again we have an ARC buf belonging to an encrypted, compressed metadnode, and once again the flags look a bit weird:

> fffffdca2e22abe8::print -t arc_buf_t
arc_buf_t {
    arc_buf_hdr_t *b_hdr = 0xfffffdcc49d0aba0
    arc_buf_t *b_next = 0xfffffdcb2169b618
    kmutex_t b_evict_lock = {
        void *[1] _opaque = [ 0 ]
    }
    void *b_data = 0xfffffdcc487bb000
    arc_buf_flags_t b_flags = 0x6 (ARC_BUF_FLAG_{COMPRESSED|ENCRYPTED})
}
> 0xfffffdcc49d0aba0::print -t arc_buf_hdr_t
arc_buf_hdr_t {
    dva_t b_dva = {
        uint64_t [2] dva_word = [ 0, 0 ]
    }
    uint64_t b_birth = 0
    arc_buf_contents_t b_type = 2 (ARC_BUFC_METADATA)
    arc_buf_hdr_t *b_hash_next = 0
    arc_flags_t b_flags = 0x2168110 (ARC_FLAG_{L2CACHE|IO_IN_PROGRESS|PROTECTED|BUFC_METADATA|HAS_L1HDR|COMPRESSED_ARC|COMPRESS_1})
    uint16_t b_psize = 0x20
    uint16_t b_lsize = 0x20
    uint64_t b_spa = 0xd6b85d7b4a5b89bb
    l2arc_buf_hdr_t b_l2hdr = {
        l2arc_dev_t *b_dev = 0
        uint64_t b_daddr = 0
        arc_state_type_t b_arcs_state = 0 (ARC_STATE_ANON)
        list_node_t b_l2node = {
            struct list_node *list_next = 0
            struct list_node *list_prev = 0
        }
    }
    l1arc_buf_hdr_t b_l1hdr = {
        kmutex_t b_freeze_lock = {
            void *[1] _opaque = [ 0 ]
        }
        zio_cksum_t *b_freeze_cksum = 0
        arc_buf_t *b_buf = 0xfffffdca2e22abe8
        uint32_t b_bufcnt = 0x2
        kcondvar_t b_cv = {
            ushort_t _opaque = 0
        }
        uint8_t b_byteswap = 0xa
        arc_state_t *b_state = ARC_anon
        multilist_node_t b_arc_node = {
            struct list_node *list_next = 0
            struct list_node *list_prev = 0
        }
        clock_t b_arc_access = 0
        zfs_refcount_t b_refcnt = {
            uint64_t rc_count = 0x2
        }
        arc_callback_t *b_acb = 0xfffffdcc49d06540
        abd_t *b_pabd = 0
    }
    arc_buf_hdr_crypt_t b_crypt_hdr = {
        abd_t *b_rabd = 0
        dmu_object_type_t b_ot = 0t10 (DMU_OT_DNODE)
        uint32_t b_ebufcnt = 0x1
        uint64_t b_dsobj = 0xbe
        uint8_t [8] b_salt = [ 0x39, 0x3a, 0x87, 0x5d, 0xd6, 0xc4, 0x28, 0x8a ]
        uint8_t [12] b_iv = [ 0x97, 0xe2, 0xc3, 0xa6, 0xf5, 0xc3, 0xbc, 0x96, 0x5a, 0x64, 0x41, 0xea ]
        uint8_t [16] b_mac = [ 0x45, 0x12, 0xda, 0x72, 0x72, 0x3f, 0xfa, 0xa0, 0xd, 0x44, 0x6c, 0xdd, 0x4a, 0xeb, 0x59, 0x1 ]
    }
}

this time its b_crypt_hdr.b_rabd is NULL, though, which is what causes this other panic

Actions #3

Updated by Alex Wilson 4 months ago

Additional update: one of the CNs that hit this new panic is reporting permanent corruption in a dataset metadnode (looks like the one it panic'd on). persists after multiple scrubs.

Actions #4

Updated by Rich Ercolani 4 months ago

I should remark that, at least with my test reproductions on https://github.com/openzfs/zfs/issues/12257, L2ARC is never involved, so presuming this is the same issue of a dbuf getting freed while someone still had a handle to it, L2ARC is not always required to end up sad in this way.

Actions #5

Updated by Alex Wilson 4 months ago

new day, new stack!

panic[cpu15]/thread=fffffb044fd2bc20: assertion failed: remove_reference(hdr, ((void *)0), tag) == 0 (0x1 == 0x0), file: ../../common/fs/zfs/arc.c, line: 3773

Warning - stack not written to the dump buffer
fffffb044fd2b6a0 fffffffffba436e1 ()
fffffb044fd2b6e0 zfs:arc_buf_destroy+e2 ()
fffffb044fd2b740 zfs:dbuf_destroy+38 ()
fffffb044fd2b840 zfs:dbuf_rele_and_unlock+d0 ()
fffffb044fd2b880 zfs:dbuf_rele+29 ()
fffffb044fd2b940 zfs:dnode_next_offset_level+122 ()
fffffb044fd2b9d0 zfs:dnode_next_offset+95 ()
fffffb044fd2ba90 zfs:dmu_object_next+ff ()
fffffb044fd2baf0 zfs:dmu_objset_space_upgrade+b7 ()
fffffb044fd2bb20 zfs:dmu_objset_id_quota_upgrade_cb+87 ()
fffffb044fd2bb60 zfs:dmu_objset_upgrade_task_cb+55 ()
fffffb044fd2bc00 genunix:taskq_d_thread+bc ()
fffffb044fd2bc10 unix:thread_start+b ()
Actions #6

Updated by Alex Wilson 4 months ago

Update: our first permanent data corruption of significance has happened due to these panics. We have now lost customer data and had to pull some zones out of backups whose dataset metadnode has experienced permanent corruption -- they can no longer be mounted or sent.

Actions #7

Updated by Alex Wilson 4 months ago

For anyone else experiencing this, if we haven't figured it out and fixed it by then, get your data off encrypted ZFS ASAP before it eats your metadnodes. Don't delay -- once you see the first panic, evacuate the machine.

Actions #8

Updated by Alex Wilson 4 months ago

  • Subject changed from panic: dbuf_issue_final_prefetch_done destroys buffer still held by others to panic: zfs dbuf code destroys buffer still held by others
Actions #9

Updated by Alex Wilson 4 months ago

  • Subject changed from panic: zfs dbuf code destroys buffer still held by others to panic: zfs dbuf code destroys buffer still held by others on encrypted pool
Actions #10

Updated by Joshua M. Clulow 3 months ago

I spent some time reproducing this on Friday. The conditions that seem particularly relevant are:

- older ZFS send stream, that would require some set of file system-level upgrades, was received into an encrypted dataset
- more than one clone of that received snapshot is created simultaneously or almost simultaneously

An upgrade that involves dmu_objset_id_quota_upgrade_cb() and dmu_objset_space_upgrade() (etc) needs to write to the pool. We can't upgrade the contents of the snapshot as those are frozen, so we end up upgrading the cloned copy. It seems like that cloned copy can/does continue to refer to some of the blocks in the original snapshot.

By using a SmartOS platform with DEBUG bits, I was able to get a much crisper panic:

panic[cpu17]/thread=fffffe00f7f1ac20: assertion failed: !HDR_IO_IN_PROGRESS(hdr), file: ../../common/fs/zfs/arc.c, line: 6392

fffffe00f7f19eb0 fffffffffbdcf334 ()
fffffe00f7f19f60 zfs:arc_release+830 ()
fffffe00f7f19f90 zfs:dbuf_release_bp+64 ()
fffffe00f7f1a0e0 zfs:dbuf_write+b8 ()
fffffe00f7f1a200 zfs:dbuf_sync_leaf+1c8 ()
fffffe00f7f1a250 zfs:dbuf_sync_list+bb ()
fffffe00f7f1a300 zfs:dbuf_sync_indirect+172 ()
fffffe00f7f1a350 zfs:dbuf_sync_list+ac ()
fffffe00f7f1a400 zfs:dbuf_sync_indirect+172 ()
fffffe00f7f1a450 zfs:dbuf_sync_list+ac ()
fffffe00f7f1a500 zfs:dbuf_sync_indirect+172 ()
fffffe00f7f1a550 zfs:dbuf_sync_list+ac ()
fffffe00f7f1a600 zfs:dbuf_sync_indirect+172 ()
fffffe00f7f1a650 zfs:dbuf_sync_list+ac ()
fffffe00f7f1a700 zfs:dbuf_sync_indirect+172 ()
fffffe00f7f1a750 zfs:dbuf_sync_list+ac ()
fffffe00f7f1a7e0 zfs:dnode_sync+507 ()
fffffe00f7f1a920 zfs:dmu_objset_sync+1a5 ()
fffffe00f7f1a990 zfs:dsl_dataset_sync+b2 ()
fffffe00f7f1aa30 zfs:dsl_pool_sync+b2 ()
fffffe00f7f1aab0 zfs:spa_sync_iterate_to_convergence+f7 ()
fffffe00f7f1ab30 zfs:spa_sync+272 ()
fffffe00f7f1ac00 zfs:txg_sync_thread+29f ()
fffffe00f7f1ac10 unix:thread_start+b ()

Looking at select frames from the panic stack:

fffffe00f7f19f60 arc_release+0x830(
    arc_buf_t           fffffeb30345d300,
    void *tag           fffffeb3143da060)

fffffe00f7f19f90 dbuf_release_bp+0x64(
    dmu_buf_impl_t      fffffeb3143da060)

fffffe00f7f1a0e0 dbuf_write+0xb8(
    dbuf_dirty_record_t fffffeb30bc44580,
        | > fffffeb30bc44580::print dbuf_dirty_record_t dr_dbuf
        | dr_dbuf = 0xfffffeb3143da060
        | > 0xfffffeb3143da060::dbuf
        |             addr   object lvl     blkid holds os
        | fffffeb3143da060      mdn   0        34    33 dynamite/plant/0000
        |
        | > 0xfffffeb3143da060::print dmu_buf_impl_t db_state db_level
        | db_state = 4 (DB_CACHED)
        | db_level = 0
        | > 0xfffffeb39a7bd860::print dnode_t dn_type
        | dn_type = 0t10 (DMU_OT_DNODE)

    arc_buf_t           fffffeb30345d300,
    dmu_tx_t            fffffeb3cbe57900)

...
fffffe00f7f1a7e0 dnode_sync+0x507(
    dnode_t             fffffeb39a7bd860,
        | > fffffeb39a7bd860::print dnode_t
        |     dn_objset->os_dsl_dataset->ds_dir->dd_myname
        | dn_objset->os_dsl_dataset->ds_dir->dd_myname = [ "0000" ]

    dmu_tx_t            fffffeb3cbe57900)

...
fffffe00f7f1ab30 spa_sync+0x272(fffffeb36e344000, 1389a)

There is also an upgrade task waiting on a ZIO:

> fffffe00f3fdcc20::findstack -v
stack pointer for thread fffffe00f3fdcc20 (tq:z_upgrade): fffffe00f3fdc730
[ fffffe00f3fdc730 _resume_from_idle+0x12b() ]
  fffffe00f3fdc760 swtch+0x18a()
  fffffe00f3fdc790 cv_wait+0x89(fffffeb35ee6b4f0, fffffeb35ee6b4e8)
  fffffe00f3fdc7d0 zio_wait+0xbb(fffffeb35ee6b120)
  fffffe00f3fdc860 dbuf_read+0x30d(fffffeb37a73ecf0, 0, 46)
  fffffe00f3fdc940 dnode_next_offset_level+0x508(fffffeb35cbb3c08, 0, fffffe00f3fdca48, 0, 20, 0)
  fffffe00f3fdc9e0 dnode_next_offset+0xb2(fffffeb35cbb3c08, 0, fffffe00f3fdca48, 0, 20, 0)
  fffffe00f3fdcaa0 dmu_object_next+0xea(fffffeb309c77880, fffffe00f3fdcab8, 0, 0)
  fffffe00f3fdcb00 dmu_objset_space_upgrade+0x71(fffffeb309c77880)
  fffffe00f3fdcb30 dmu_objset_id_quota_upgrade_cb+0xb5(fffffeb309c77880)
  fffffe00f3fdcb70 dmu_objset_upgrade_task_cb+0x51(fffffeb309c77880)
  fffffe00f3fdcc00 taskq_d_thread+0xfb(fffffeb369d8c0f8)
  fffffe00f3fdcc10 thread_start+0xb()

The first child zio in the tree that the upgrade task is waiting on has the arc_read_done() callback, with the same arc_buf_hdr_t we're trying to chuck out:

> fffffeb35ee6b120::zio -c -r
ADDRESS                 TYPE  STAGE            WAITER           TIME_ELAPSED
fffffeb35ee6b120        NULL  CHECKSUM_VERIFY  fffffe00f3fdcc20 -
 fffffeb3b9e71710 <<<   READ  VDEV_IO_START    -                -
  fffffeb35a5a2ab0      READ  VDEV_IO_START    -                -
   fffffeb3c27b84c8     READ  VDEV_IO_START    -                25776ms

> fffffeb3b9e71710::print zio_t io_done io_private
io_done = arc_read_done
io_private = 0xfffffeb34d2e32e8

> 0xfffffeb34d2e32e8::print arc_buf_hdr_t b_l1hdr.b_acb
b_l1hdr.b_acb = 0xfffffeb362ce1a60

> 0xfffffeb362ce1a60::print arc_callback_t acb_done acb_private acb_next
acb_done = dbuf_read_done
acb_private = 0xfffffeb37a73ecf0
acb_next = 0
1| dbuf from spa sync dbuf_write/dbuf_release_bp/arc_release:
1|
1| > 0xfffffeb3143da060::print dmu_buf_impl_t db_blkptr |::blkptr
1| DVA[0]=<0:6c2c3bce00:4000>
1| DVA[1]=<0:1875fcd600:4000>
1| salt=9a66415f5d7e43b4 iv=5d93388c6a7ae0b5:a4d685e4
1| [L0 DNODE] FLETCHER_4 OFF encrypted LE contiguous unique double
1| size=4000L/4000P birth=1345L/1345P fill=32
1| cksum=f76500ede3bc963c:f55159c2025849f:1f5bb4f444e36704:3c2398842a4679a7
1|
1| > 0xfffffeb3143da060::dbuf
1|             addr   object lvl     blkid holds os
1| fffffeb3143da060      mdn   0        34    33 dynamite/plant/0000

2| dbuf from dbuf_read/zio_wait in upgrade task:
2|
2| > fffffeb37a73ecf0::print dmu_buf_impl_t db_blkptr | ::blkptr
2| DVA[0]=<0:6c2c3bce00:4000>
2| DVA[1]=<0:1875fcd600:4000>
2| salt=9a66415f5d7e43b4 iv=5d93388c6a7ae0b5:a4d685e4
2| [L0 DNODE] FLETCHER_4 OFF encrypted LE contiguous unique double
2| size=4000L/4000P birth=1345L/1345P fill=32
2| cksum=f76500ede3bc963c:f55159c2025849f:1f5bb4f444e36704:3c2398842a4679a7
2|
2| > fffffeb37a73ecf0::dbuf
2|             addr   object lvl     blkid holds os
2| fffffeb37a73ecf0      mdn   0        34     2 dynamite/plant/0001

Same block pointer, presumably to the data in the snapshot we cannot change. But, two distinct read/write I/O operations.

It would seem that we are not correctly tracking references or holds on something here.

Actions

Also available in: Atom PDF