Feature #3372


ZFS tunable: Induce delays between access to same locations in component disks of a Top-Level VDEV

Added by Jim Klimov about 9 years ago. Updated about 9 years ago.

zfs - Zettabyte File System
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


While discussing my older problem with appearance of unrecoverable errors on a 6-disk raidz2 pool, one of the most likely causes I and the list came up with was a hardware event (like a power surge, wire noise interpreted as an IO command by the disks, mechanical shake of the storage box, etc.) that caused HDDs to somehow corrupt the data under their heads.

Given that a raidzN disk set has to simultaneously access sectors on all component disks that make up a logical ZFS block, it seems likely that a common hardware event can cause the heads to collide with the disk (or do unintended IO) and corrupt the data.

This RFE proposes to add an optional (enabled with a tunable) mode to enforce the reads and writes from the same offsets of component disks of a TLVDEV to be done at different times, so that a common external cause wouldn't be able to momentarily corrupt all components of the same ZFS block by crashing the disk heads on it.

Additional considerations:
  • This problem as posed likely only applies to mirrors and raidzN sets of local spinning harddisks, including USB/eSATA attached disks and JBODs. The workaround should not take place for NAS/SAN LUNs, iSCSI LUNs, pools in files and zvols, single-disk pools, (arguably?) flash drives (USB, CF, SD...) and SSDs, etc. Arguably, this should also not take place for VM environments with virtual disks (but should take place for directly accessed disks via PCI passthrough or other means of HDD/partition delegation).
  • If there are many IOs outstanding, they may be queued in different orders to different disks, so that the overall throughput suffers minimally. Since a COW write causes an avalanche of block-pointer updates with 2 or 3 copies of each, it is not a problem to have an assortment of widely different physical locations to seek to for each of the disks in the TLVDEV.
  • This workaround should take place for all IOs (write and read) regardless of urgency (sync/async) or source (userdata IO, scrub, ZIL update, etc.)
  • For multiple-disk redundancy (raidz2, raidz3, three- or four-way mirrors) some disks might be allowed simultaneous IO to the same offsets, so that incidental corruption of just a couple of sectors in a logical block does not cause its guaranteed unrecoverability as it may happen if all sectors get corrupted.

This is proposed to be a tunable behaviour, set by admins/owners of inherently unreliable storage boxes where "garden-variety" grade hardware may cause random problems (such as home NASes), at their discretion. This feature may likely reduce performance of ZFS pools on spinning HDDs, but should hopefully strengthen them against simultaneous hardware- or environment-induced glitches.


Also available in: Atom PDF