Project

General

Profile

Bug #11942

Panic on zil/slog replay when TX_REMOVE followed by TX_CREATE

Added by Andy Fiddaman 14 days ago. Updated 10 days ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

panic[cpu7]/thread=fffffe16f0f4fb40: assertion failed: dmu_object_claim_dnsize(zfsvfs->z_os, obj, DMU_OT_PLAIN_FILE_CONTENTS, 0, obj_type, bonuslen, dnodesize, tx) == 0 (0x1c == 0x0), file: ../../common/fs/zfs/zfs_znode.c, line: 861

fffffe002258a280 genunix:process_type+153649 ()
fffffe002258a400 zfs:zfs_mknode+7e0 ()
fffffe002258a540 zfs:zfs_create+6fa ()
fffffe002258a5e0 genunix:fop_create+cf ()
fffffe002258a7a0 zfs:zfs_replay_create+2b8 ()
fffffe002258a800 zfs:zil_replay_log_record+f2 ()
fffffe002258a9d0 zfs:zil_parse+1f8 ()
fffffe002258aa50 zfs:zil_replay+bc ()
fffffe002258aa90 zfs:zfsvfs_setup+bd ()
fffffe002258ab10 zfs:zfs_domount+171 ()
fffffe002258ac30 zfs:zfs_mount+2a7 ()
fffffe002258ac60 genunix:fsop_mount+14 ()
fffffe002258add0 genunix:domount+952 ()
fffffe002258ae70 genunix:mount+fe ()
fffffe002258aeb0 genunix:syscall_ap+98 ()
fffffe002258af10 unix:brand_sys_sysenter+1dc ()

This problem was introduced with the large dnodes feature back in March - I've seen only two reports of this in the wild.

The following commit from zfsonlinux fixes this (and adds a test to ensure there are not future regressions)

commit 035e96118bc9a7cbf435dd17dda507b870fcf6e6
Author: Chunwei Chen <david.chen@nutanix.com>
Date:   Wed Aug 28 10:42:02 2019 -0700

Fix zil replay panic when TX_REMOVE followed by TX_CREATE

If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.


Related issues

Related to illumos gate - Feature #8423: Implement large_dnode pool featureClosed2017-06-23

Actions

History

#1

Updated by Andy Fiddaman 14 days ago

  • Related to Feature #8423: Implement large_dnode pool feature added
#2

Updated by Andy Fiddaman 13 days ago

With the ZoL fix in place, the new test passes and does not cause a panic

Test: /opt/zfs-tests/tests/functional/slog/slog_replay_fs_001 (run as root) [00:11] [PASS]
Test: /opt/zfs-tests/tests/functional/slog/slog_replay_fs_002 (run as root) [00:22] [PASS]
#3

Updated by Electric Monk 10 days ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit d8849d7dee03b84a3fa281ec65eb9e3d86d3756b

commit  d8849d7dee03b84a3fa281ec65eb9e3d86d3756b
Author: Chunwei Chen <david.chen@nutanix.com>
Date:   2019-11-11T20:21:17.000Z

    11943 Fix out-of-order ZIL txtype lost on hardlinked files
    11942 Panic on zil/slog replay when TX_REMOVE followed by TX_CREATE
    Portions contributed by: Ryan Moeller <ryan@freqlabs.com>
    Portions contributed by: Andy Fiddaman <andy@omniosce.org>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF