ZFS dedup=on should not panic the system when two blocks with same checksum exist in DDT
While testing interactions of dedup with corrupt on-disk data on my system during repairs, I found that oi_148a (LiveUSB) instantly reboots after the procedure listed below (tested twice during remote recovery).
Prerequisite: system with a bad data block in a file (doesn't match checksum stored in blockpointers and probably in DDT), original block which matches that checksum (I overwrite it into the corrupted file as part of the repairs). Original file (containing the corrupted block) was written onto a dataset with dedup enabled.
- "dedup=on" neither fixed the on-disk file, nor logged an error, and subsequent reads produced IO errors (and increased the counter). Probably just the DDT counter was increased during the write (that's the "works as designed" part according to Richard);
- "dedup=verify" doesn't log a checksum error if it finds a block whose assumed checksum matches the newly written block, but contents differ from the new block during dedup-verification and in fact these contents do not match the checksum either (at least, not the one in block pointer). Reading the block produced no errors (bug #2015);
- what's worse, reenabling "dedup=on" and writing the same again block crashes (reboots) the system instantly.
Possibly, because now there are two DDT entries pointing to same checksum in different blocks, and no verification was explicitly requested in this dataset?
If so, basically this means that if dedup=verify is used on at least one dataset (and has created two DDT entries with same checksums), it is unsafe to use plain dedup=on on any dataset in that pool. Particularly because the user has the file whose block matches the checksum, and it is likely that the file will be copied around and trigger the error. Thus - a BUG report ;)
Workaround: if you use dedup, always use it with verification. Helps against corrupt on-disk sectors, too ;)
Details in mail list, here: http://mail.opensolaris.org/pipermail/zfs-discuss/2012-January/050881.html
See also bugs #1981 and #2015
PS: It might be nice to craft a test with two different blocks sharing the same checksum and written with dedup=verify (so that two DDT entries exist), and then rewritten with dedup=on in same and/or different dataset -- would that work or crash? I'm not sure about how to create the hash collision samples... =)
Updated by Jim Klimov over 9 years ago
1) for the first "dedup=on" - the error was previously known (found by scrub), and still the block was not invalidated and reused bu dedup "as is" - that's basically tracked as bug #1981
2) "dedup=verify" doesn't log a checksum error ... - that's the bug #2015, not the fact that reading the corrected block produced no errors ;)
Updated by Jim Klimov about 9 years ago
The problem of kernel panicking with such double-entries in DDT (one of which is invalid) was further explored by me on-list (zfs-discuss):
In essence, I tracked down at least one of the panic scenarios to be a kernel-side NULL pointer dereference, and the way it gets into the situation. In short, there is a way for ddt_phys_select() to not find any blocks matching its query and return NULL, and ddt_phys_decref() used the passed pointer without any validation and got the NULL dereference.
I think that in this situation, if we didn't find the block - there's no DDT-refcount to decrease already and it gets automatically freed. However I may be naive and this solution might cause some block-leaks (which to me are better than panics anyway) ;)
The attached fix adds some value checking and error reporting for the situation.
This might not be a complete solution, but I got no replies on-list and it worked for me for a month, so I publish it and mark the issue 50% complete. Maybe someone will revise and integrate the fix up to 100%? :)
Updated by Jim Klimov about 9 years ago
- Status changed from New to In Progress
- Assignee set to Jim Klimov
So here goes my first webrev attempt:
Call for reviews posted on oi-dev mailing list.
Updated by Jim Klimov almost 9 years ago
- % Done changed from 50 to 60
The root cause of MY pool's problem was uncovered and solved in bug #2649, and it was decided that ZFS should indeed panic on such cases by default and bring attention of analysts and developers - so sanity-checks for pointer non-NULLness in the functions is strategically a bad thing. As a workaround, zfs_recover can be set to cleanse the pool of a pre-existing errors at possible expense of leaked blocks.
The remaining code snippet for that was discussed on-list, webrev not yet made due to other pre-occupation :( Hope to get back to it soon.