ZFS tunable: Induce delays between access to same locations in component disks of a Top-Level VDEV
While discussing my older problem with appearance of unrecoverable errors on a 6-disk raidz2 pool, one of the most likely causes I and the list came up with was a hardware event (like a power surge, wire noise interpreted as an IO command by the disks, mechanical shake of the storage box, etc.) that caused HDDs to somehow corrupt the data under their heads.
Given that a raidzN disk set has to simultaneously access sectors on all component disks that make up a logical ZFS block, it seems likely that a common hardware event can cause the heads to collide with the disk (or do unintended IO) and corrupt the data.
This RFE proposes to add an optional (enabled with a tunable) mode to enforce the reads and writes from the same offsets of component disks of a TLVDEV to be done at different times, so that a common external cause wouldn't be able to momentarily corrupt all components of the same ZFS block by crashing the disk heads on it.Additional considerations:
- This problem as posed likely only applies to mirrors and raidzN sets of local spinning harddisks, including USB/eSATA attached disks and JBODs. The workaround should not take place for NAS/SAN LUNs, iSCSI LUNs, pools in files and zvols, single-disk pools, (arguably?) flash drives (USB, CF, SD...) and SSDs, etc. Arguably, this should also not take place for VM environments with virtual disks (but should take place for directly accessed disks via PCI passthrough or other means of HDD/partition delegation).
- If there are many IOs outstanding, they may be queued in different orders to different disks, so that the overall throughput suffers minimally. Since a COW write causes an avalanche of block-pointer updates with 2 or 3 copies of each, it is not a problem to have an assortment of widely different physical locations to seek to for each of the disks in the TLVDEV.
- This workaround should take place for all IOs (write and read) regardless of urgency (sync/async) or source (userdata IO, scrub, ZIL update, etc.)
- For multiple-disk redundancy (raidz2, raidz3, three- or four-way mirrors) some disks might be allowed simultaneous IO to the same offsets, so that incidental corruption of just a couple of sectors in a logical block does not cause its guaranteed unrecoverability as it may happen if all sectors get corrupted.
This is proposed to be a tunable behaviour, set by admins/owners of inherently unreliable storage boxes where "garden-variety" grade hardware may cause random problems (such as home NASes), at their discretion. This feature may likely reduce performance of ZFS pools on spinning HDDs, but should hopefully strengthen them against simultaneous hardware- or environment-induced glitches.
Updated by Jim Klimov about 8 years ago
An additional consideration: by design, this feature adds to latency on singular raidzN reads (sectors at similar offsets on different disks are enforced to be accessed at different times, and only when they are all available in RAM, does the logical block recombination occur and result in a successful or failed read IO). The more disks in a raidz set, the more ~10ms delays are imposed. For usual (non-scrub) IO, reads from non-parity sectors can be queued first, so that the checksum match for non-corrupted blocks can be seen quicker, and additional latency due to reads from those disks which contain parity for this block is not required - but I think this is already done today.
However, if there are several detected streaming reads, it is okay to request the sectors from different streams at adequately different locations on different disks, thus somewhat hiding the artificially added latency on the average.