Bug #8391

mr_sas controller lockup on Dell H330

Added by Michael Loftis 11 days ago. Updated 11 days ago.

Status:NewStart date:2017-06-13
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-
Difficulty:Medium Tags:needs-triage

Description

Dell R530 with the H330 SAS RAID controller with disks in Non-RAID mode

Machine is currently running SmartOS 20170608T172228Z which reports driver version 6.503.00.00ILLUMOS-20170524

Controller has latest firmware package from Dell - 25.5.2.0001 - prior revisions were totally unusable due to a cache flush bug in the firmware.

Disks are 8x WD 6TB RED WD60EFRX

Under heavy I/O load the controller "goes away" with the driver reporting the controller timing out, eventually resulting in the kernel declaring a panic. What seems to particularly tickle it is during backup creation innobackupex --apply-log operations which result in a lot of fairly random I/O being issued. Kernel errors (reproduced, no serial console ATM) below repeat until a panic...

mr_sas: [ID 270009 kern.warning] WARNING: mr_sas0: io_timeout_checker: FW Fault, calling reset adapter
mr_sas: [ID 643100 kern.notice] mr_sas0: io_timeout_checker: fw_outstanding 0x60 max_fw_cmds 0xEF

The number of outstanding ops doesn't change once those messages start to output and the machine never recovers. Eventually - at least in prior SmartOS releases (just updated to this release today) ZFS calls a PANIC and the machine reboots (or drops to kmdb if enabled).

I'm can be reached as ZOP on Freenode in both #illumos and #smartos or /msg or through the bug here.

History

#1 Updated by Michael Loftis 11 days ago

Additionally...it seems the reset when it IS happening is calling the mrsas_tbolt_reset_ppc which fails...

mr_sas: [ID 643100 kern.notice] mr_sas0: mrsas_tbolt_reset_ppc(): RESET FAILED; return failure!!!

I'm getting a mdb/serial debug session going and kicking it over again now.

Also available in: Atom