ZFS Repair corrupted deduped blocks with incoming writes having same checksum
During the mail-list discussion of bug #2024 and #2649, an additional proposal emerged and it was suggested to post an explicit RFE (http://permalink.gmane.org/gmane.os.illumos.devel/7252):
...in case of collision on write (with dedup=verify), also check if the existing on-disk block still matches the checksum; if the block is corrupted and useless anyway - have it replaced by the new block which does match the same checksum (write the block, save its BP into the DDT entry leaving other fields intact).
This way manual repair of one broken deduped file (by copying it over from backup or another source), would automatically propagate to all other files which contain this block. And such unfixable corruption of on-disk userdata seems orders of magnitude more probable than the sha256 10^-77 hash collision chance ;)
Also note (this problem below is the essence of bug #2015) that even with dedup=verify, the userdata block contents are being matched upon write, but checksum of on-disk data is neither recalculated "just for kicks", nor does its mismatch cause an increase in pool's CKSUM counters - unlike a ZPL/ZVOL read attempt from this block which would compare the checksums to data, increase counters and sound the alarm.
So I wonder HOW is the block read by dedup-writing code (in order to compare its contents) to bypass the normal checksumming and alarming? Directly from the DVA? And how, lacking the checksum match, did it try to reconstruct the (corrupt) block from raidz2 satisfactorily? In particular, if a single sector is broken (and can be recovered by raidzN), would this result in a contents-mismatch for dedup-verify and a separate unique write instead of proper repair (and at least signalling) of the partially or fully broken block?
Updated by Jim Klimov about 9 years ago
It should also be noted that currently dedup=verify still "dedups" only one first known block with the given checksum.
All other blocks with same checksum and different contents are going to be unique writes, even if there are many of them - i.e. via copying of corrupted files from the backup/source, while the truly corrupted mismatching version of the block still occupies the DDT :(
This thought should increase the priority of implementing repair of known-to-be-corrupted-anyway blocks by replacing them with known-good ones ;)