ZFS should not trust the layers underneath regarding drive timeouts/failure
I just woke up from an anxiety attack about the possibility of future disk failures at work. Last week at work we had a 30 minute platform wide outage due to a failing disk in our NFS storage array, and this is not the first time either - it happened several months ago.
It's not a low end setup - we're talking a pair of NFS heads with 96GB Ram, 6 256GB L2ARC devices, ZeusRAMs for ZIL, LSI 6Gbps eSAS cards (mpt_sas) attached to a daisy-chain of 12 disk LSI 6Gbps SAS JBODs, with 7.2k RPM Seagate Constellation SAS disks.
When a disk goes bad, IO on the whole array just stalls or stutters or grinds to a halt. Then your customers start calling screaming at you.
We had already dropped sd's un_cmd_timeout and un_retry_count to 5 seconds and 3 respectively the last time this happened but this seems to have made no difference what so ever:
echo "::walk sd_state | ::grep '.!=0' | ::sd_state" | mdb -k | egrep "un_retry_count|un_cmd_timeout" | sort | uniq
un_cmd_timeout = 0x5
un_retry_count = 0x3
I don't think this problem is going to go away until ZFS itself takes responsibility for dropping bad disks. It's supposed to be the last word in filesystems and cope with all kinds of failure (bitrot etc) but for some reason it trusts the sd layer to send IO down to the actual disks and timeout if necessary, and the sd layer in turn just passes this trust on to the drivers themselves, which in turn trust the firmware in the HBAs, which probably in turn trust the firmware in the Harddrives. That's a lot of trust.
I propose two new pool-wide properties, failtimeout and failretries (to match failmode), which accept either a numeric value (I'd choose 5 seconds and 2 retries personally) OR a list of drives and their respective timeouts, if you want a custom timeout per drive. Eg:
zpool set tank failtimeout=5
zpool set tank failretries=2
zpool set tank failtimeout="c0t0d0=5,c0t1d0=10"
zpool set tank failretries="c0t0d0=2,c0t1d0=4"
I am so annoyed at this situation I am right now declaring that if a single individual takes this on and completes the work (meaning: I don't get an outage the next time a disk fails) I will pay them £5000 ($7742) in three payments over 3 months. This is not a joke.