Project

General

Profile

Feature #1553

ZFS should not trust the layers underneath regarding drive timeouts/failure

Added by Alasdair Lumsden almost 8 years ago. Updated almost 7 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
2011-09-22
Due date:
% Done:

0%

Estimated time:
Difficulty:
Hard
Tags:
needs-triage

Description

I just woke up from an anxiety attack about the possibility of future disk failures at work. Last week at work we had a 30 minute platform wide outage due to a failing disk in our NFS storage array, and this is not the first time either - it happened several months ago.

It's not a low end setup - we're talking a pair of NFS heads with 96GB Ram, 6 256GB L2ARC devices, ZeusRAMs for ZIL, LSI 6Gbps eSAS cards (mpt_sas) attached to a daisy-chain of 12 disk LSI 6Gbps SAS JBODs, with 7.2k RPM Seagate Constellation SAS disks.

When a disk goes bad, IO on the whole array just stalls or stutters or grinds to a halt. Then your customers start calling screaming at you.

We had already dropped sd's un_cmd_timeout and un_retry_count to 5 seconds and 3 respectively the last time this happened but this seems to have made no difference what so ever:

echo "::walk sd_state | ::grep '.!=0' | ::sd_state" | mdb -k | egrep "un_retry_count|un_cmd_timeout" | sort | uniq
un_cmd_timeout = 0x5
un_retry_count = 0x3

I don't think this problem is going to go away until ZFS itself takes responsibility for dropping bad disks. It's supposed to be the last word in filesystems and cope with all kinds of failure (bitrot etc) but for some reason it trusts the sd layer to send IO down to the actual disks and timeout if necessary, and the sd layer in turn just passes this trust on to the drivers themselves, which in turn trust the firmware in the HBAs, which probably in turn trust the firmware in the Harddrives. That's a lot of trust.

I propose two new pool-wide properties, failtimeout and failretries (to match failmode), which accept either a numeric value (I'd choose 5 seconds and 2 retries personally) OR a list of drives and their respective timeouts, if you want a custom timeout per drive. Eg:

zpool set tank failtimeout=5
zpool set tank failretries=2
OR
zpool set tank failtimeout="c0t0d0=5,c0t1d0=10"
zpool set tank failretries="c0t0d0=2,c0t1d0=4"

I am so annoyed at this situation I am right now declaring that if a single individual takes this on and completes the work (meaning: I don't get an outage the next time a disk fails) I will pay them £5000 ($7742) in three payments over 3 months. This is not a joke.

History

#1

Updated by Garrett D'Amore almost 8 years ago

Alasdair:

Architecturally the solution for this problem does not belong with ZFS. Transport errors and timeouts should be detected and handled by the underlying storage subsystem.

The defaults in sd are terrible.

The HBA drivers need work here too.

That does not mean that we ought to move this logic into the filesystem code. I would strongly oppose such a project on a number of grounds. Besides being the wrong place, it is the case that the actual timeouts may be very different for different transports. A 60 second timeout may be quite reasonable for a disk attached via iSCSI across a WAN. Its insane for a locally connected SSD though.

In fact, sd is the wrong place for much of this work, too. Really, as I said, these are transport-specific values and ought to be handled via the transport subsystems -- i.e. the HBAs.

We should talk more about this if you want. I'd actually prefer to reject this particular RFE, and talk about the more correct approach of solving your problem(s).

#2

Updated by Alasdair Lumsden almost 8 years ago

Hi Garrett,

Many thanks for responding.

Garrett D'Amore wrote:

Architecturally the solution for this problem does not belong with ZFS. Transport errors and timeouts should be detected and handled by the underlying storage subsystem.

But, for example, we don't have the source to mpt, and we certainly don't have the source to the firmware on the HBAs or Harddrives. If the sd layer can be enhanced then that would be acceptable to me, but (not being a kernel developer I can only talk at a high level here) trust drivers or the layers below since sometimes we don't have control over them and since often they're written by vendors.

The defaults in sd are terrible.

I don't think the sd timeouts I adjusted actually do a great deal in the circumstances I'm experiencing. I spoke to someone on IRC who also adjusted them and found they made no difference:

11:52 <philhar> did you change the timeout in the LSI bios?
11:52 <Alasdairrr> no
11:53 <Alasdairrr> we never tested the sd timeout stuff
11:53 <Alasdairrr> just took it on faith that the variable would help
11:53 <philhar> "With these values in place, our timeout is reduced from upwards of 3 minutes, to a mere 15 seconds. "
11:53 <philhar> so thats just from the calculations
11:53 <Alasdairrr> "in theory"
11:54 <Alasdairrr> yeah
11:54 <philhar> I see...
11:54 <philhar> well we did the real time adjustments
11:54 <philhar> and tried to pull just a single drive, didn't make a huge difference
11:54 <Alasdairrr> :(
11:54 <Alasdairrr> that sucks
11:54 <philhar> still took about 90-120 seconds for the drive to be marked as faulted in zfs

He's lucky - 90-120 seconds would be a vast improvement in my book.

The HBA drivers need work here too.

That does not mean that we ought to move this logic into the filesystem code. I would strongly oppose such a project on a number of grounds. Besides being the wrong place, it is the case that the actual timeouts may be very different for different transports. A 60 second timeout may be quite reasonable for a disk attached via iSCSI across a WAN. Its insane for a locally connected SSD though.

That's precisely why I think it's a useful tunable, ZFS can sit atop lots of different transports, not just SAS HBAs, and you don't always easily have a place underneith to set a timeout. The defaults could be set really high (or unset and not used by default) and you only enable them if you want to take control of the situation, as in these circumstances.

In fact, sd is the wrong place for much of this work, too. Really, as I said, these are transport-specific values and ought to be handled via the transport subsystems -- i.e. the HBAs.

We should talk more about this if you want. I'd actually prefer to reject this particular RFE, and talk about the more correct approach of solving your problem(s).

I have spoken to a few others who think ZFS might be the right place for this - it's not your average filesystem in that it already handles abstractions usually handled in hardware, such as RAID and checksumming, and I honestly do think it would be reasonable for it to handle timeouts if someone so wished, as a feature option, precisely because there are so many alternative transports to put your pool on, such as file backed on NFS.

I would really appreciate it if you'd indulge the discussion a little longer, rather than a flat out rejection. But it would be good to have a chat about it as I'm certainly not the only one being caught out by this.

#3

Updated by Richard Elling almost 8 years ago

ZFS is the wrong place for timeouts. The proper place is in the lower layers where each physical block device can have its own timeouts, retries, policies, and fault management.

We have done extensive testing over the years with sd timeouts and retry logic. While sd is a mess, for unrelated reasons, it is a better location to handle many errors. sd needs more love or a casket.

#4

Updated by Garrett D'Amore almost 8 years ago

There are other reasons besides just notifying the filesystem of failure, that ZFS is the wrong place. You have to release resources, etc. that can only be done by the actual device driver. And as I indicated, even sd is too high in the stack ... it belongs in the driver.

"mpt" is unfortunate. We're working on an open source rewrite. In the meantime there is mptsas, which is better for a variety of reasons anyway. Making a bad architectural decision to support legacy hardware is wrong; this was done for Cassini and it was a mistake, and it would be a mistake to do it now for mpt.

The hardware and firmware of HBAs are irrelevant... we can safely timeout operations in the HBA driver without intervention of the HBA firmware. This is done for all kinds of device drivers already -- its just a SMOP (small matter of programming). (In some cases we might need to hard reset the HBA when we encounter a timeout, to allow us to release resources held for DMA without risking the HBA from scribbling over them in a future date, but this should not be common.)

#5

Updated by Garrett D'Amore almost 8 years ago

  • Subject changed from ZFS should not trust the layers underneith regarding drive timeouts/failure to ZFS should not trust the layers underneath regarding drive timeouts/failure
#6

Updated by Rich Lowe almost 8 years ago

Anyone doing this would be well advised to talk to the ZFS folks about where they were going with device error handling/FMA integration. At least as far as I understood things there was a well thought out end state for their efforts in this regard.

#7

Updated by Steve Radich almost 8 years ago

The reasons to move this down the layers are perfectly logical and although I hate to admit from an admin perspective I agree... well.. I kind of agree this doesn't belong in zfs.

HOWEVER we could read/write/checksum errors in zfs, these "belong" in the lower layers you could argue - I'd propose add a column for timeout exceeded and count as proposed - you could also discard the reads by pushing them off to a thread waiting for "garbage" (i.e. discarded reads/writes that timed out) - We can count these as errors for reporting purposes as stage 1, then stage 2 we actually have an option to just have all the discarded i/os waiting for completion.

From an admin perspective if waiting for a write for a drive that's failing I hang, I NEED that to abort and fail the drive BEFORE the client notices - regardless of if the drive / controller still are doing this operation or not - Let it continue to run (i.e. no need to worry about resources), just wait for the response in a queue that is discarding results.

As to resetting controller - I actually think that's a bad idea, if you cause the controller to rescan disks (just as an example) that's far too long of an operation - Let the drive drop out and an admin look at it - a scrub is a nightmare, but better than timeouts.

ESPECIALLY though add the timeout counters so an admin can see one drive is slower than others, then (s)he can decide to take action or not - We've had PLENTY of drives that are responding slow even though not failing, it causes numerous problems until we find the drive, which is quite difficult with iostat seeing one drive has longer response times/busy times than the others for no apparent reason.

Steve Radich
www.LinkedIn.com/in/SteveRadich

#9

Updated by David Gwynne almost 8 years ago

could you use the failfast stuff svm introduced (http://www.filibeto.org/~aduritz/truetrue/solaris10/svm/svm_failfast.pdf) in zfs? this isnt a new problem.

#10

Updated by matthew lagoe about 7 years ago

Personally I don't think this should only be done at the os level, all other major storage venders have this issue solved (EMC, Netapp) however ZFS seems to ignore this issue. It is a MAJOR issue for anyone wanting to use ZFS in a high availability environment, the whole point of using ZFS as raid is so that a single drive failure doesn't take you down.

There are a number of ways to fix this, one would be for zfs to be smart enough to either fail the drive in the zfs pool (not at the os level) or for it to go ahead with writing on the other disks and mark it as "problematic" (after a set timeout) and work on writing to it when it can.

Raid cards have had this logic for years and frankly ZFS is a software raid card, why cant it also have the necessary features of a high end raid card?

I am with Alasdair on this, I would be happy to pay a developer to fix this but saying its not an issue or someone else's issue is just blatantly wrong.

So far this has taken our storage down twice this year...

#11

Updated by matthew lagoe about 7 years ago

Side note: putting it in ZFS would also have the added benefit of ensuring you storage up-time even when there are bugs in drivers etc. Yes we all know to use HBA X however even that HBA that always works perfectly can have issues, ZFS is designed to prevail over hardware issues, seems logical to me that it would be able to prevail over a funky HBA. (With appropriate warnings and messages)

Also if there is some magical HBA/drive etc out there that can work around this issue I would love to know about it.

#12

Updated by Sam Zaydel about 7 years ago

I think this needs to be dealt with at the driver really, and integrated with FMA, after all other like issues are commonly handled by hardware drivers that have much more insight about the hardware than do layers above. FMA really needs to be involved in making sure that at the OS level failures are reported on, etc. FMA should correctly retire device, remove device links, log, etc.

#13

Updated by matthew lagoe about 7 years ago

Oh don't get me wrong it should be handled in the driver as well, however ZFS shouldn't trust anything and should always assume that everything else is broken in every way.

Also available in: Atom PDF