Project

General

Profile

Feature #2008

ZFS option to verify all writes (did they make it to disk correctly)?

Added by Jim Klimov over 8 years ago. Updated over 8 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
2012-01-21
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

My recent research of my corrupted pool data led me to believe that some writes did not land on disks where requested (this is one of possible disk errors like bitrot and others, that ZFS tries to protect from). Such misdirected writes can strike two (or more) files with one shot: those which did not get the data, and those which were overwritten on the media.

It seems like a worthy RFE to include a pool-wide option to "verify-after-write/commit" - to test that recent TXG sync data has indeed made it to disk on (consumer-grade) hardware into the designated sector numbers.

Perhaps the test should be delayed several seconds after the sync writes (to accomodate for emptying of disk's write cache which can lie about sync writes and/or write ordering).

If the read verifcation fails, currently cached data from recent TXGs can be recovered from on-disk redundancy and/or still exist in RAM cache, and rewritten again (and tested again).

More importantly, a failed test may mean that the write landed on disk randomly, and the pool should be scrubbed ASAP. It may be guessed that the yet-unknown error can lie within "epsilon" tracks (sector numbers) from the currently found non-written data, so if it is possible to scrub just a portion of the pool based on DVAs - that's a preferred start. It is possible that some data can be recovered if it is tended to ASAP (i.e. on mirror, raidz, copies>1; external backups)...

History

#1

Updated by Jim Klimov over 8 years ago

It should be stressed that the verification reads must get the data from media (bypassing disks' read caches if possible, and ZFS's prefetch/ARC caches as a must).
In particular, the data remaining in ARC caches might be useful for restoration of on-disk data if a discrepancy is found quickly.

An implementation detail was suggested by Bob Friesenhahn on the list:

This is an interesting idea. I think that you would want to do a
mini-scrub on a TXG at least one behind the last one written since
otherwise any test would surely be foiled by caching. The ability to
restore data from RAM is doubtful since TXGs get forgotten from
memory as soon as they are written.

I think expiration of recent TXGs can be rearranged as part of this new Feature - to facilitate repairs (or plain recommits) of erroneous writes - i.e. free old TXGs after the verification read for that TXG number.

Also available in: Atom PDF