Bug #8391

mr_sas controller lockup on Dell H330

Added by Michael Loftis 4 months ago. Updated 12 days ago.

Status:NewStart date:2017-06-13
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-
Difficulty:Medium Tags:needs-triage

Description

Dell R530 with the H330 SAS RAID controller with disks in Non-RAID mode

Machine is currently running SmartOS 20170608T172228Z which reports driver version 6.503.00.00ILLUMOS-20170524

Controller has latest firmware package from Dell - 25.5.2.0001 - prior revisions were totally unusable due to a cache flush bug in the firmware.

Disks are 8x WD 6TB RED WD60EFRX

Under heavy I/O load the controller "goes away" with the driver reporting the controller timing out, eventually resulting in the kernel declaring a panic. What seems to particularly tickle it is during backup creation innobackupex --apply-log operations which result in a lot of fairly random I/O being issued. Kernel errors (reproduced, no serial console ATM) below repeat until a panic...

mr_sas: [ID 270009 kern.warning] WARNING: mr_sas0: io_timeout_checker: FW Fault, calling reset adapter
mr_sas: [ID 643100 kern.notice] mr_sas0: io_timeout_checker: fw_outstanding 0x60 max_fw_cmds 0xEF

The number of outstanding ops doesn't change once those messages start to output and the machine never recovers. Eventually - at least in prior SmartOS releases (just updated to this release today) ZFS calls a PANIC and the machine reboots (or drops to kmdb if enabled).

I'm can be reached as ZOP on Freenode in both #illumos and #smartos or /msg or through the bug here.

History

#1 Updated by Michael Loftis 4 months ago

Additionally...it seems the reset when it IS happening is calling the mrsas_tbolt_reset_ppc which fails...

mr_sas: [ID 643100 kern.notice] mr_sas0: mrsas_tbolt_reset_ppc(): RESET FAILED; return failure!!!

I'm getting a mdb/serial debug session going and kicking it over again now.

#2 Updated by Juergen Gotteswinter 4 months ago

Your description sounds more like the usual effects of SATA Disks hooked up to a SAS HBA.

#3 Updated by Michael Loftis 4 months ago

Juergen Gotteswinter wrote:

Your description sounds more like the usual effects of SATA Disks hooked up to a SAS HBA.

We've quite a few very large storage arrays, only SmatOS with the H330's configured this way (and some mixed SAS/SATA) but they run mostly FreeBSD and some Linux. Illumos (SmartOS) is the only combo we've got issues like this. Most of them are under much more constant I/O load too. I opened the bug because it's definitely a problem with the Illumos kernel driver, it MAY be some firmware bug it's tickling. The driver can't properly reset this controller, that's definitely a bug even if the root cause were the SATA disks on an SAS HBA. But the fact we've got much larger - hundreds of drives - and some smaller ones running these same disks tells me it's not the disks. We don't have very many H330 based systems in the field yet, but, this is the only one having issues of any kind, it's also the only one with Illumos/SmartOS.

If it was "just a SATA disk on SAS HBA problem" we'd be seeing it allover and I wouldn't be wasting others time by attempting to open a bug and get someone who has more knowledge about the controllers and hopefully better access to Broadcom and some sort of programming manual for the firmware/chips.

#4 Updated by Michael Loftis 4 months ago

Michael Loftis wrote:

Juergen Gotteswinter wrote:

Your description sounds more like the usual effects of SATA Disks hooked up to a SAS HBA.

...

So maybe the issue is the timeout values are too small for slower SATA drives like these REDs causing it to unnecessarily attempt to reset the controller? (and failing to reset the controller)

#5 Updated by Ian Collins about 1 month ago

I have been seeing similar problems with the H330, as reported on the SmartOS mail list:

The controller is running firmware version 25.5.2.0001.

I'm running 20170805T013701Z on this box.

The card is in HBA mode and the 6 drives in the pool are Toshiba SAS SSDs (PH043PCJTB300765051GA00).

Thrashing each disk in turn with format surface analysis does not trigger the bug, but importing the pool read only does...

Could there be two bugs at play here? A firmware bug causing timeouts and a driver bug failing to reset the controller?

On my systems (I have seen this on two), I don't think it is the volume of I/O, but the location. Working though the client KMVs and zero filling their drives (dd in Linux, sdelete in Windows) which maxes out the pool I/O does not cause timeouts. One VM always triggers the timeout and hang, even trying to destroy the KVM volume triggers the bug.

#6 Updated by Michael Loftis 16 days ago

I've had time to start to come back to this and after upgrading to 20170928T144204Z and RAID fw 25.5.3.0005 and verified that the problem is still present and reproducible.

#7 Updated by Ian Collins 12 days ago

I have also struck the problem again this morning. If you are gong to run Illumos on DELL, get Dell to fit a "normal" SAS controller.

Also available in: Atom