Project

General

Profile

Bug #5483

Pool hang & system panic because of single disk failure

Added by Chip Schweiss over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2014-12-24
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

A single disk failure on a raidz3 pool caused i/o to hang eventually panicking the system.

The disk failed by going completely offline. Upon reboot the system was dropped to maintenance mode because the device was unresponsive and the pool would not import. It took removing /etc/zfs/zfs.cache to get the system to boot. Importing with '-f' afterwards worked without a problem.

Two problems in the handling of this. First the disk should have been failed from the pool triggering a spare to replace it. Second the system shouldn't hang on booting because a single disk in a raidz3 pool is missing. A single disk failure should never be able to completely take the system down.

There seem to have been many long discussions about i/o hanging on a single disk and the appropriate action to take. My suggestion is to at the minimum allow policies to be set via /etc/system or as zpool settings such as seconds of no response to fail a disk, and maximum number of disks to fail before ignoring timeouts in a vdev for raidz3, raidz2, raidz1 and mirrors.

In a raidz3 I would always chose availability over trying to preserve a single disk.

Hardware:
2 LSI-9207-8e HBAs, Firmware version 17
Supermicro SC847E26-RJBOD1
Seagate Constelleation es.3 SAS

OS: OmniOS r151010

Dump available at ftp://nrg-ftp.wustl.edu/pub/zfs/io_hung_20141223

Sorry the compressed dump is 43G. The system has 128GB of ram. Contact me if you are having troubles downloading it.


Files

msgbuf.txt (21.7 KB) msgbuf.txt Chip Schweiss, 2014-12-29 01:39 PM

History

#2

Updated by Rich Ercolani over 5 years ago

Gonna guess you're hitting some combination of #4310 and ...I forget the other bug number, depending on which bits are in r151010. If you can include the ::stack output here, I can tell you which.

The latter is fixed in stock il-gate after a certain revision (and, I believe, in the OmniOS bloody releases); the former will be fixed by sometime in January at the latest.

#3

Updated by Chip Schweiss over 5 years ago

I have another dump from another system with the same failure mode. Seems disk failures crash the system more often than not.

Hardware:
2 LSI-9206-16e
DataON DNS-1660 JBOD
Seagate Constellation ST6000NM0034

OS: OmniOS r151012

Dump available at: ftp://ftp.nrg.wustl.edu/pub/zfs/disk-fault_20141225

Again a large memory system. This dump is 25GB.

If anyone would like mdb or dtrace info from these, let me know specifically and I will try to provide that info. I'm a novice with mdb and dtrace so specific instructions will help immensely.

#4

Updated by Chip Schweiss over 5 years ago

::stack output from the dump on the r151012 crash

> ::stack
vpanic()
vdev_deadman+0x10b(fffff021f35106c0)
vdev_deadman+0x4a(fffff021f350acc0)
vdev_deadman+0x4a(fffff021f34ff380)
spa_deadman+0xad(fffff021ff3cf000)
cyclic_softint+0xfd(fffff021ede1a540, 0)
cbe_low_level+0x14()
av_dispatch_softvect+0x78(2)
apix_dispatch_softint+0x35(0, 0)
switch_sp_and_call+0x13()
apix_do_softint+0x6c(fffff000f5895a50)
apix_do_interrupt+0x34a(fffff000f5895a50, 2)
_interrupt+0xba()
acpi_cpu_cstate+0x11b(fffff021ed715120)
cpu_acpi_idle+0x8d()
cpu_idle_adaptive+0x13()
idle+0xa7()
thread_start+8()

I just read #4310, the first crash had both the pool with the failed disk and dump disk on mpt_sas controllers. However they are different controllers. So I don't think what happened took mpt_sas completely down.

::msgbuf reveals that in each case, the offending disk is offline for sometime before the pool is declared hung and the system is panic'd. I've attached it.

About a year ago I caught this type of failure in action and was able to zpool offline the offending disk before the system was panic'd. It recovered immediately. Wish I knew then how to take a dump of a running system. :(

Maybe this is related to how Seagate Constellation's fail that I see this mode of failure more often than not.

Also available in: Atom PDF