Project

General

Profile

Actions

Bug #16436

closed

zfs reference tracking panic during zfs_receive_raw_incremental test

Added by Andy Fiddaman 20 days ago. Updated 12 days ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

When running the ZFS testsuite on a system which has zfs reference tracking
enabled (zfs:reference_tracking_enable = 1), the system panics
when the zfs_receive_raw_incremental test runs.

panic[cpu12]/thread=fffffe003de4fc20: No such hold fffffe31f0338d18 with count 4096 on refcount ffffffffc0006298

fffffe003de4f5a0 zfs:zvol_minors+33915ae4 ()
fffffe003de4f610 zfs:arc_free_data_impl+6d ()
fffffe003de4f650 zfs:arc_free_data_abd+2a ()
fffffe003de4f690 zfs:arc_hdr_free_pabd+122 ()
fffffe003de4f800 zfs:arc_write+2a2 ()
fffffe003de4f940 zfs:dmu_objset_sync+19f ()
fffffe003de4f9a0 zfs:dsl_dataset_sync+a0 ()
fffffe003de4fa50 zfs:dsl_pool_sync+aa ()
fffffe003de4fad0 zfs:spa_sync_iterate_to_convergence+dd ()
fffffe003de4fb50 zfs:spa_sync+2ae ()
fffffe003de4fc00 zfs:txg_sync_thread+263 ()
fffffe003de4fc10 unix:thread_start+b ()

Related issues

Related to illumos gate - Feature #16203: zfs: switch refcount tracking from lists to AVL-treesClosedAndy Fiddaman

Actions
Actions #1

Updated by Andy Fiddaman 20 days ago

Interestingly, the tag that is being released is for an arc header allocated
in arc_hdr_realloc_crypt, and there is another reference with the same
4096 count for a buffer which has been freed in arc_hdr_realloc_crypt.

> fffffe2df2f0f018::whatis
fffffe2df2f0f018 is allocated from arc_buf_hdr_t_full_crypt:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
fffffe2df2a62978 fffffe2df2f0f018       167850df46 fffffe0040ee7c20
                 fffffe2d5181d648 fffffe2d1d19bfc0                0
                 kmem_cache_alloc_debug+0x37c
                 kmem_cache_alloc+0x253
                 arc_hdr_realloc_crypt+0xee
                 arc_convert_to_raw+0x18d
                 dmu_objset_sync+0x12d
                 dsl_dataset_sync+0xa0
                 dsl_pool_sync+0xaa
                 spa_sync_iterate_to_convergence+0xdd
                 spa_sync+0x2ae
                 txg_sync_thread+0x263
                 thread_start+0xb

> ffffffffc0006298::zfs_refcount ! grep 4096
reference with count=4096 with tag fffffe2d5bb05ab8 "", held at:
reference with count=4096 with tag fffffe2d6ae1fb80, held at:
reference with count=4096 with tag fffffe2df2f103f8, held at:
> fffffe2df2f103f8::whatis
fffffe2df2f103f8 is freed from arc_buf_hdr_t_full:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
fffffe2df31589b0 fffffe2df2f103f8       167850e499 fffffe0040ee7c20
                 fffffe2d5181d008 fffffe2d1d19c200 fffffe2d3d967d70
                 kmem_cache_free_debug+0x114
                 kmem_cache_free+0x89
                 arc_hdr_realloc_crypt+0x335
                 arc_convert_to_raw+0x18d
                 dmu_objset_sync+0x12d
                 dsl_dataset_sync+0xa0
                 dsl_pool_sync+0xaa
                 spa_sync_iterate_to_convergence+0xdd
                 spa_sync+0x2ae
                 txg_sync_thread+0x263
                 thread_start+0xb

It's fairly easy to see what's happening here. When an ARC header is being
converted over from unencrypted to encrypted (or vice versa), the references
are not being correctly transferred. This doesn't matter if reference counting is
not enabled.

Actions #2

Updated by Andy Fiddaman 19 days ago

I went and looked at OpenZFS and it doesn't look like they ever fixed this. They've since moved away from having different L1 ARC header structs depending on encryption or not, and just accepted an overhead in the non-crypt case.

With the fix in place, a full ZFS testsuite run in DEBUG bits with reference tracking enabled succeeds:

Results Summary
PASS     1260
FAIL       8
SKIP      24
KILLED     1

Running Time:   11:16:57
Percent passed: 97.4%
Log directory:  /var/tmp/test_results/20240401T233017

omnios@zfstest:~$ pfexec mdb -ke reference_tracking_enable::print
0x1

The skips are the TRIM tests due to lack of support in the hypervisor I'm using and the killed test is persist_l2arc_008_pos as per usual.

#16203 significantly reduces the effect of reference tracking (in fact almost makes it get lost in the noise) on the time it takes to run the testsuite.

Actions #3

Updated by Electric Monk 19 days ago

  • Gerrit CR set to 3396
Actions #4

Updated by Andy Fiddaman 19 days ago

  • Related to Feature #16203: zfs: switch refcount tracking from lists to AVL-trees added
Actions #5

Updated by Andy Fiddaman 19 days ago

  • Gerrit CR deleted (3396)
Actions #6

Updated by Electric Monk 19 days ago

  • Gerrit CR set to 3398
Actions #7

Updated by Electric Monk 12 days ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 3133214d3780f1eb6711c8d5d2a5b2a7c64ae50b

commit  3133214d3780f1eb6711c8d5d2a5b2a7c64ae50b
Author: Andy Fiddaman <illumos@fiddaman.net>
Date:   2024-04-09T15:21:23.000Z

    16436 zfs reference tracking panic during zfs_receive_raw_incremental test
    Reviewed by: Bill Sommerfeld <sommerfeld@hamachi.org>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Dan McDonald <danmcd@mnx.io>

Actions

Also available in: Atom PDF