i/o errors when deleting filesystem/zvol can lead to space map corruption

Added by Christopher Siden almost 10 years ago. Updated over 9 years ago.

zfs - Zettabyte File System
Analysis by Matt Ahrens:

If we encounter an i/o error (on a piece of metadata) while deleting a
filesystem or zvol, we don't update the bptree_entry_phys_t's bookmark.  When
we sync the next txg, we will revisit some bp's that have already been freed,
freeing them again.  This will continue every txg until we happen to reload the
broken space map, at which point we will panic.  Typical panic message due to
this bug:

panic[cpu0]/thread=ffffff0009bf0c40: zfs: allocating allocated
segment(offset=12518912 size=1536)

ffffff0009bf0920 genunix:vcmn_err+23 ()
ffffff0009bf0990 zfs:zfs_panic_recover+51 ()
ffffff0009bf0a50 zfs:range_tree_add+14e ()
ffffff0009bf0b00 zfs:space_map_load+34e ()
ffffff0009bf0b30 zfs:metaslab_load+3b ()
ffffff0009bf0b60 zfs:metaslab_preload+45 ()
ffffff0009bf0c20 genunix:taskq_thread+2d0 ()
ffffff0009bf0c30 unix:thread_start+8 ()

Note that this manifestation should only be possible on nondebug bits.  On
debug, we should fail this assertion in bptree_iterate():
            ASSERT(err == 0 || err == ERESTART);
which will prevent us from corrupting the on-disk state.

This problem would be exacerbated by
  3835 zfs need not store 2 copies of all metadata
because there would be only one copy of L1 indirect blocks.  Right now,
there are 2 copies of all metadata, so losing a single VDEV (on a multi-vdev
pool) would typically not lose any metadata, so we couldn't hit this bug.

A quick fix would be to turn this in to a VERIFY, so that nondebug systems will
panic here as well.

The real fix would probably involve enhancing the traversal logic to be able to
resume from any bookmark; then we can save the bookmark when an i/o error
occurs.  Once we've hit an error, we may also want to not retry every txg,
instead wait for a pool state change (e.g. device comes back online, "zpool
clear", etc).  We may also want to provide some mechanism to complete the
destroy by ignoring i/o errors, thus leaking any space pointed to by the
unreadable blocks.

This problem was partially mitigated by the fix for bug '4391 panic system rather than corrupting pool if we hit bug 4390'.

This issue can be reproduced with the following script (disk names may vary):

#!/bin/bash -x

zpool create -o cachefile=none -f test c1t1d0 
zfs create -o redundant_metadata=most test/fs

dd if=/dev/urandom of=/test/fs/a bs=1024k count=100
zpool export test
dd if=/dev/zero of=/dev/dsk/c1t1d0s0 seek=8 bs=1024k count=100 conv=notrunc
zpool import -o cachefile=none test
zfs destroy test/fs
sleep 10
zpool get freeing
zpool status -v test

zfs create test/a
mkfile 10m /test/a/10m
zfs destroy test/a
sleep 10
zpool get freeing

zdb -bbe test

Related to illumos gate - Bug #3835: zfs need not store 2 copies of all metadataClosedChristopher Siden2013-06-21

Related to illumos gate - Feature #4391: panic system rather than corrupting pool if we hit bug 4390ClosedChristopher Siden2013-12-11

git commit 7fd05ac4dec0c343d2f68f310d3718b715ecfbaf

commit  7fd05ac4dec0c343d2f68f310d3718b715ecfbaf
Author: Matthew Ahrens <>
Date:   2014-06-05T21:20:08.000Z

    4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
    Reviewed by: George Wilson <>
    Reviewed by: Christopher Siden <>
    Reviewed by: Adam Leventhal <>
    Reviewed by: Dan McDonald <>
    Reviewed by: Saso Kiselkov <>
    Approved by: Dan McDonald <>


