Project

General

Profile

Actions

Bug #13358

closed

dmu_objset_upgrade_stop() needs to wait

Added by Jason King over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

While running the zfs test suite on a DEBUG build, it tripped the following assertion:

panic message: assertion failed: rc->rc_count == number, file: ../../common/fs/zfs/refcount.c, line: 90

Recreating it with refcount tracking enabled, this immediate cause is clear:

> ::stack
vpanic()
0xfffffffffbdcd395()
zfs_refcount_destroy_many+0x30(fffffe172cbad768, 0)
zfs_refcount_destroy+0x10(fffffe172cbad768)
dsl_dataset_evict_async+0x12a(fffffe172cbad440)
taskq_thread+0x315(fffffe170cee0eb0)
thread_start+0xb()
> fffffe172cbad768::zfs_refcount
zfs_refcount_t at fffffe172cbad768 has 1 current holds, 3 recently released holds
current holds:
reference with tag fffffffff79decca "upgrade_tag", held at:
fffffe1c6a649300 is allocated from reference_cache:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
fffffe1c67d829d0 fffffe1c6a649300      4e89b292133 fffffe1c9e3387c0
                 fffffe170ce92888 fffffe16e20e9cc0 fffffe16f4b6d850
                 kmem_cache_alloc_debug+0x2fc
                 kmem_cache_alloc+0x135
                 zfs_refcount_add_many+0x82
                 zfs_refcount_add+0x13
                 dsl_dataset_long_hold+0x59
                 dmu_objset_upgrade+0x73
                 dmu_objset_id_quota_upgrade+0x15
                 dmu_objset_own+0x14f
                 zfsvfs_create+0x6c
                 zfs_domount+0x50
                 zfs_mount+0x2a7
                 fsop_mount+0x14
                 domount+0xa78
                 mount+0xfe
                 syscall_ap+0x98

> fffffe170cee0eb0::taskq
ADDR             NAME                             ACT/THDS Q'ED  MAXQ INST
fffffe170cee0eb0 dbu_evict                          1/   1    1   390    -

So we have a hold on the dataset due to dmu_object_upgrade() that hasn't been released.

Comparing to OpenZFS, we are missing a call to txg_wait_synced(os->os_spa->spa_dsl_pool, 0); after we wait for the upgrade taskq to complete. With this call in place, I've been unable to trip the assert again (without it, I can easily trip it with a subset of the zfs test suite).

Frustratingly, I cannot yet come up with a good explanation why that fixes it. I suspect there is some sort of race going on, but I don't know what, and if the txg_wait_synced is just papering over the problem by inducing sufficient delay to not cause the race, or if that is in fact correct.

The call should probably be there since it seems to be a bit from the zfs encryption port that got missed, but I'd still like dig more to have a better explanation.


Related issues

Related to illumos gate - Bug #13194: null/dangling pointer deref somewhere under dmu_objset_upgradeClosed

Actions
Actions #1

Updated by Andy Fiddaman over 1 year ago

  • Related to Bug #13194: null/dangling pointer deref somewhere under dmu_objset_upgrade added
Actions #2

Updated by Andy Fiddaman over 1 year ago

This is a reliable reproducer for me in a bhyve VM with 16 vCPUs.

zpool create -f -O version=1 testpool c3t0d0; zpool destroy testpool
Actions #3

Updated by Andy Fiddaman over 1 year ago

I tested the proposed fix (see attached Gerrit) by verifying directly that it fixes the leaked reference using the simple reproducer in the previous note.

I successfully ran the full ZFS test suite in a loop for a couple of days under a DEBUG onu with this change in place and observed no regressions in the list of successful tests (and no panics!); there are still some tests which are known to fail but they failed both with and without this change in place.

Actions #4

Updated by Electric Monk over 1 year ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit cc37296fa5ddc1ddd4012e040f797164c2a5cf5d

commit  cc37296fa5ddc1ddd4012e040f797164c2a5cf5d
Author: Andy Fiddaman <omnios@citrus-it.co.uk>
Date:   2020-12-16T14:18:06.000Z

    13358 dmu_objset_upgrade_stop() needs to wait
    12397 12254 broke the zfs test suite for older python versions
    Portions contributed by: Tom Caputi <tcaputi@datto.com>
    Portions contributed by: Nasf-Fan <fan.yong@intel.com>
    Portions contributed by: Gvozden Neskovic <neskovic@gmail.com>
    Reviewed by: Jason King <jason.king@joyent.com>
    Reviewed by: Igor Kozhukhov <igor@dilos.org>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF