i/o errors when deleting filesystem/zvol can lead to space map corruption
Analysis by Matt Ahrens:
If we encounter an i/o error (on a piece of metadata) while deleting a filesystem or zvol, we don't update the bptree_entry_phys_t's bookmark. When we sync the next txg, we will revisit some bp's that have already been freed, freeing them again. This will continue every txg until we happen to reload the broken space map, at which point we will panic. Typical panic message due to this bug: panic[cpu0]/thread=ffffff0009bf0c40: zfs: allocating allocated segment(offset=12518912 size=1536) ffffff0009bf0920 genunix:vcmn_err+23 () ffffff0009bf0990 zfs:zfs_panic_recover+51 () ffffff0009bf0a50 zfs:range_tree_add+14e () ffffff0009bf0b00 zfs:space_map_load+34e () ffffff0009bf0b30 zfs:metaslab_load+3b () ffffff0009bf0b60 zfs:metaslab_preload+45 () ffffff0009bf0c20 genunix:taskq_thread+2d0 () ffffff0009bf0c30 unix:thread_start+8 () Note that this manifestation should only be possible on nondebug bits. On debug, we should fail this assertion in bptree_iterate(): ASSERT(err == 0 || err == ERESTART); which will prevent us from corrupting the on-disk state. This problem would be exacerbated by 3835 zfs need not store 2 copies of all metadata because there would be only one copy of L1 indirect blocks. Right now, there are 2 copies of all metadata, so losing a single VDEV (on a multi-vdev pool) would typically not lose any metadata, so we couldn't hit this bug. A quick fix would be to turn this in to a VERIFY, so that nondebug systems will panic here as well. The real fix would probably involve enhancing the traversal logic to be able to resume from any bookmark; then we can save the bookmark when an i/o error occurs. Once we've hit an error, we may also want to not retry every txg, instead wait for a pool state change (e.g. device comes back online, "zpool clear", etc). We may also want to provide some mechanism to complete the destroy by ignoring i/o errors, thus leaking any space pointed to by the unreadable blocks.
This problem was partially mitigated by the fix for bug '4391 panic system rather than corrupting pool if we hit bug 4390'.
This issue can be reproduced with the following script (disk names may vary):
#!/bin/bash -x zpool create -o cachefile=none -f test c1t1d0 zfs create -o redundant_metadata=most test/fs dd if=/dev/urandom of=/test/fs/a bs=1024k count=100 sync zpool export test dd if=/dev/zero of=/dev/dsk/c1t1d0s0 seek=8 bs=1024k count=100 conv=notrunc zpool import -o cachefile=none test zfs destroy test/fs sleep 10 zpool get freeing zpool status -v test zfs create test/a mkfile 10m /test/a/10m sync zfs destroy test/a sleep 10 zpool get freeing zdb -bbe test