Project

General

Profile

Actions

Bug #13892

closed

panic when deleting millions of files

Added by Jerry Jelinek 4 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

At RackTop we ran into this bug.

The system panic-ed while deleting 67 million files from dataset using command line:

# rm -rf /storage/m01/global/mil01/Root_NFS/

Actions #1

Updated by Jerry Jelinek 4 months ago

Here is the relevant info from the dump:

> ::status
debugging crash dump vmcore.0 (64-bit) from bsrscale02
operating system: 23.0 23.0.0TEST.2088 (i86pc)
image uuid: 783fd37e-ba45-c743-fb91-87becbe5e1e0
panic message: mutex_exit: not owner, lp=fffffe279602bab8 owner=0 thread=fffffcc2707e2c20
dump content: kernel pages only
> 
> $C
fffffcc2707e2630 vpanic()
fffffcc2707e2650 mutex_panic+0x4a(fffffffffb95c36e, fffffe279602bab8)
fffffcc2707e2680 mutex_vector_exit+0x3e(fffffe279602bab8)
fffffcc2707e26d0 dbuf_try_add_ref+0x3e(fffffe279602ba50, fffffe442b1950c0, 1b1f, ffffffffffffffff, fffffffff7ab2580)
fffffcc2707e2720 dsl_dataset_try_add_ref+0x4f(fffffe271cc3d000, ffffff1c1a52e280, fffffffff7ab2580)
fffffcc2707e27a0 dsl_prop_notify_all_cb+0x97(fffffe271cc3d000, fffffebbccb20dc0, 0)
fffffcc2707e2850 dmu_objset_find_dp_impl+0x15a(fffffe293bb2a518)
fffffcc2707e28f0 dmu_objset_find_dp+0x148(fffffe271cc3d000, 2f0d, fffffffff79d10e0, 0, 2)
fffffcc2707e2910 dsl_prop_notify_all+0x25(fffffe2f97c21d40)
fffffcc2707e2990 dsl_dir_rename_sync+0x2c4(fffffcc270a10b20, fffffe3633f73c00)
fffffcc2707e29d0 dsl_sync_task_sync+0x93(fffffcc270a109f8, fffffe3633f73c00)
fffffcc2707e2a80 dsl_pool_sync+0x34b(fffffe271cc3d000, 222f)
fffffcc2707e2b00 spa_sync_iterate_to_convergence+0xd0(fffffeb2dcc02000, fffffe28c1db0280)
fffffcc2707e2b60 spa_sync+0x2f6(fffffeb2dcc02000, 222f)
fffffcc2707e2c00 txg_sync_thread+0x1f5(fffffe271cc3d000)
fffffcc2707e2c10 thread_start+0xb()
> fffffe279602bab8::mutex
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
fffffe279602bab8 adapt               no      -      -      no

We're calling mutex_exit on a mutex we're not holding. I've looked up the call stack and confirmed none of the callers are taking the mutex and expecting this function to release it.

I've looked at the openzfs code and this is fixed there with commit d617648c7fc.

Actions #2

Updated by Electric Monk 4 months ago

  • Gerrit CR set to 1558
Actions #3

Updated by Jerry Jelinek 4 months ago

I don't think it is very easy to reproduce this bug, considering we just have seen it this once and the fix has been in OpenZFS since 2015. Considering that the fix has been in OpenZFS so long, I would like to treat this has having been tested there.

Actions #4

Updated by Joshua M. Clulow 4 months ago

Jerry Jelinek wrote in #note-3:

I don't think it is very easy to reproduce this bug, considering we just have seen it this once and the fix has been in OpenZFS since 2015. Considering that the fix has been in OpenZFS so long, I would like to treat this has having been tested there.

I suspect that's fair enough, provided you do the appropriate build and some sort of smoke test afterwards to make sure things are still plausibly working.

Actions #5

Updated by Jerry Jelinek 4 months ago

At racktop we've already included this change as of the middle of last week and we're running this through all of our normal usage/testing since we're in the middle of a beta release. We've had about 1 week of solid usage at this point.

Actions #6

Updated by Electric Monk 4 months ago

  • Status changed from New to Closed
  • % Done changed from 90 to 100

git commit 6f43873c1c66055abf8e3c15d5fec716e84f69ba

commit  6f43873c1c66055abf8e3c15d5fec716e84f69ba
Author: Ned Bass <bass6@llnl.gov>
Date:   2021-06-24T21:38:07.000Z

    13892 panic when deleting millions of files

    Portions contributed by: Jerry Jelinek <gjelinek@gmail.com>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Jason King <jason.brian.king@gmail.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF