Project

General

Profile

Bug #3085

zfs diff panics, then panics in a loop on booting

Added by Rich Lowe about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Urgent
Category:
zfs - Zettabyte File System
Start date:
2012-08-16
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

I had the misfortune to interrupt a 'zfs diff'. The system immediately panicked for a reason I didn't catch, subsequent boots then panic under

spa_history_log_internal_ds() --> dsl_dataset_name() --> dsl_dir_name(), because the ds in question has a NULL ds_dir

We panic the same way when running 'zfs diff' in general, it's just a matter of when we try to release the hold.

History

#1

Updated by Rich Lowe about 7 years ago

It is perhaps important that the zfs diff invocation I used was of the form

zfs diff foo@bar foo

Such that a hidden snapshot (or whatever the term) was created.

#2

Updated by Rich Lowe about 7 years ago

It's not clear to me (or, I think, Schrock), whether the problem is that the ds_dir is NULL, or that we don't cope.

Searching the code, we're not particularly resilient to it being NULL, which suggests it should not be, but Matt fixed http://wesunsolve.net/bugid/id/6798428 in combination with http://wesunsolve.net/bugid/id/6860996 Where the former has the same basic cause, and the latter did indeed introduce a little care about NULL ds_dir.

#3

Updated by Matthew Ahrens about 7 years ago

Are you able to get a dump? If the diff was not on your root filesystem, you can boot with "-m milestone=none" in the "kernel" line in grub, and then it will probably come up and have the dump.

#4

Updated by Christopher Siden about 7 years ago

I think we found the problem. spa_history_log_internal_ds() is being called too late from dsl_dataset_user_release_sync() (after the dataset has been destroyed). It would be nice to confirm that this is the case you hit though (is the caller of dsl_dataset_user_release_sync() in your crash dump dsl_dataset_user_release_sync()?).

I'll get a fix prepared tonight.

#5

Updated by Christopher Siden about 7 years ago

  • Assignee set to Christopher Siden
#6

Updated by Rich Lowe about 7 years ago

Yes, dsl_dataset_user_release_sync.

<richlowe> txg_sync_thread --> spa_sync --> dsl_pool_sync -->
           dsl_sync_task_group_sync --> dsl_dataset_user_release_sync -->
           dsl_spa_history_log_internal_ds

Is the stack I typed out for schrock. Sorry I forgot to paste it in here, too.

#7

Updated by Rich Lowe about 7 years ago

I booted a prior BE, out of desparation, which hasn't crashed.

It seems likely that Matt's change to use _internal_ds v. _internal in dsl_dataset_user_release_sync is the cause of at least the post-reboot panics, as csiden said.

I'm going to try to reproduce the zfs diff panic with a bit more control, in case it differs.

#8

Updated by Rich Lowe about 7 years ago

  • Subject changed from interrupting zfs diff panics, may damage filesystem to zfs diff panics, then panics in a loop on booting
#9

Updated by Christopher Siden about 7 years ago

It looks like the panic will occur any time a 'zfs release' causes a defer-destroyed snapshot to actually get destroyed (this happens implicitly when you 'zfs diff' the live file system). It looks like 'zfs diff' and 'zfs hold/release' are not exercised by the ZFS test suite. I will look in to that.

#10

Updated by Rich Lowe about 7 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100
  • Tags deleted (needs-triage)

Resolved in r13772 commit:2579580ac955

Also available in: Atom PDF