ZFS Dedup vs. mismatching checksums
My 6-disk raidz2 data pool has experienced some "unrecoverable errors", found during a scrub from oi_148a LiveUSB. From my further analysis it seems that a userdata block of deduped data got corrupted and no longer matches the stored checksum. For whatever reason, raidz2 did not help in recovery of this data, so I rsync'ed the files over from another copy. Then things got interesting...
It seems the block-pointer block with that mismatching checksum did not get invalidated, so my attempts to rsync known-good versions of the bad files from external source seemed to work, but in fact failed: subsequent reads of the files produced IO errors.
Apparently (my wild guess), upon writing the blocks, checksums were calculated and the matching DDT entry was found. ZFS did not care that the entry pointed to unconsistent data (not matching the checksum now), it still increased the DDT counter.
The problem was solved by disabling dedup for the dataset involved and rsync-updating the file in-place.
After the dedup feature was disabled and new blocks were uniquely written, everything was readable (and md5sums matched) as originally expected.
I believe that if the block is detected to be corrupt (checksum mismatches the data), the checksum value in blockpointers and DDT should be rewritten to an "impossible" value, perhaps all-zeroes or such, when the error is detected.
Alternatively (opportunistically), a flag might be set in the DDT entry requesting that a new write mathching this stored checksum should get committed to disk - thus "repairing" all files which reference the block (at least, stopping the IO errors). Alas, so far there is no guarantee that it was not the checksum itself that got corrupted (except for using ZDB to retrieve the block contents and matching with a known-good copy of the data, if any), so corruption of the checksum would also cause replacement of "really-good-but-normally-inaccessible" data.
Updated by Jim Klimov over 9 years ago
As was suggested in zfs-discuss list, setting dedup attribute to "verify" instead of "on" did somewhat help when rsync-updating each found "broken" file found by scrub (or routine EIOIO): now every block that matches the checksum in DDT but differs on-disk is written separately - instead of increasing the counter pointing to a broken block. In effect this allows to repair files, and when all inclusions of the "broken" block are gone, this probably should allow to free the mismatching blocks (I didn't finish my repairs yet, so can't guarantee the auto-cleaning of the pool from obsolete broken blocks; until then, scrub finds them as nameless numbered "files", but zdb can't test or do any IO on them).
NOTE: This still requires that every file with "broken" blocks be restored from trusted backup or original sources. Writing another block that matches the checksum in DDT does not replace the original block which used to match the same checksum but is broken now (even if several files actually did share the same on-disk data block).
MANUAL REPAIR: It might be possible, but quite time-consuming to say the least, to walk the block tree with SCRUB (to find broken files), DD (to find broken block numbers) and ZDB (to locate on-disk DVA addresses and try to compare L1-checksums of different broken files' invalid blocks), and try to replace broken blocks in middle of those files where you don't have another copy, but you guess that mid-file block was shared via DDT and exactly matches another one you have from another repaired file.
Recommended reading: on-disk ZFS format spec, Max Bruning's blogs, zfs and zdb sources; several days of time ;)