Project

General

Profile

Actions

Bug #8391

open

mr_sas controller lockup on Dell H330

Added by Michael Loftis almost 6 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2017-06-13
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:

Description

Dell R530 with the H330 SAS RAID controller with disks in Non-RAID mode

Machine is currently running SmartOS 20170608T172228Z which reports driver version 6.503.00.00ILLUMOS-20170524

Controller has latest firmware package from Dell - 25.5.2.0001 - prior revisions were totally unusable due to a cache flush bug in the firmware.

Disks are 8x WD 6TB RED WD60EFRX

Under heavy I/O load the controller "goes away" with the driver reporting the controller timing out, eventually resulting in the kernel declaring a panic. What seems to particularly tickle it is during backup creation innobackupex --apply-log operations which result in a lot of fairly random I/O being issued. Kernel errors (reproduced, no serial console ATM) below repeat until a panic...

mr_sas: [ID 270009 kern.warning] WARNING: mr_sas0: io_timeout_checker: FW Fault, calling reset adapter
mr_sas: [ID 643100 kern.notice] mr_sas0: io_timeout_checker: fw_outstanding 0x60 max_fw_cmds 0xEF

The number of outstanding ops doesn't change once those messages start to output and the machine never recovers. Eventually - at least in prior SmartOS releases (just updated to this release today) ZFS calls a PANIC and the machine reboots (or drops to kmdb if enabled).

I'm can be reached as ZOP on Freenode in both #illumos and #smartos or /msg or through the bug here.

Actions #1

Updated by Michael Loftis almost 6 years ago

Additionally...it seems the reset when it IS happening is calling the mrsas_tbolt_reset_ppc which fails...

mr_sas: [ID 643100 kern.notice] mr_sas0: mrsas_tbolt_reset_ppc(): RESET FAILED; return failure!!!

I'm getting a mdb/serial debug session going and kicking it over again now.

Actions #2

Updated by Juergen Gotteswinter almost 6 years ago

Your description sounds more like the usual effects of SATA Disks hooked up to a SAS HBA.

Actions #3

Updated by Michael Loftis almost 6 years ago

Juergen Gotteswinter wrote:

Your description sounds more like the usual effects of SATA Disks hooked up to a SAS HBA.

We've quite a few very large storage arrays, only SmatOS with the H330's configured this way (and some mixed SAS/SATA) but they run mostly FreeBSD and some Linux. Illumos (SmartOS) is the only combo we've got issues like this. Most of them are under much more constant I/O load too. I opened the bug because it's definitely a problem with the Illumos kernel driver, it MAY be some firmware bug it's tickling. The driver can't properly reset this controller, that's definitely a bug even if the root cause were the SATA disks on an SAS HBA. But the fact we've got much larger - hundreds of drives - and some smaller ones running these same disks tells me it's not the disks. We don't have very many H330 based systems in the field yet, but, this is the only one having issues of any kind, it's also the only one with Illumos/SmartOS.

If it was "just a SATA disk on SAS HBA problem" we'd be seeing it allover and I wouldn't be wasting others time by attempting to open a bug and get someone who has more knowledge about the controllers and hopefully better access to Broadcom and some sort of programming manual for the firmware/chips.

Actions #4

Updated by Michael Loftis almost 6 years ago

Michael Loftis wrote:

Juergen Gotteswinter wrote:

Your description sounds more like the usual effects of SATA Disks hooked up to a SAS HBA.

...

So maybe the issue is the timeout values are too small for slower SATA drives like these REDs causing it to unnecessarily attempt to reset the controller? (and failing to reset the controller)

Actions #5

Updated by Ian Collins over 5 years ago

I have been seeing similar problems with the H330, as reported on the SmartOS mail list:

The controller is running firmware version 25.5.2.0001.

I'm running 20170805T013701Z on this box.

The card is in HBA mode and the 6 drives in the pool are Toshiba SAS SSDs (PH043PCJTB300765051GA00).

Thrashing each disk in turn with format surface analysis does not trigger the bug, but importing the pool read only does...

Could there be two bugs at play here? A firmware bug causing timeouts and a driver bug failing to reset the controller?

On my systems (I have seen this on two), I don't think it is the volume of I/O, but the location. Working though the client KMVs and zero filling their drives (dd in Linux, sdelete in Windows) which maxes out the pool I/O does not cause timeouts. One VM always triggers the timeout and hang, even trying to destroy the KVM volume triggers the bug.

Actions #6

Updated by Michael Loftis over 5 years ago

I've had time to start to come back to this and after upgrading to 20170928T144204Z and RAID fw 25.5.3.0005 and verified that the problem is still present and reproducible.

Actions #7

Updated by Ian Collins over 5 years ago

I have also struck the problem again this morning. If you are gong to run Illumos on DELL, get Dell to fit a "normal" SAS controller.

Actions #8

Updated by George Pochiscan over 5 years ago

Hello,

We have the same issue with the 20171012T005133Z image, Dell 730 and H330mini RAID controller.

Our setup contains 9 R730 servers , 15x 600GB SAS disks TH06DWVPHGT0074P011JA00 , PERC H330 Mini(Embedded) .

Controller firmware version is 25.5.2.0001.

Actions #9

Updated by Ian Collins almost 4 years ago

I am still seeing this issue with the latest platform image.

It always appears to be localised to an individual volume. Once isolated, I can stop the VM in question in the timeouts stop, until the next time...

Actions #10

Updated by George Pochiscan almost 4 years ago

Hello,

I am curious if this bug is related to type of the disks or to the controller.
Also, i am curious if this is on both the HBA 330 and Perc H330 RAID controller.

Thanks!

Actions #11

Updated by Ian Collins almost 4 years ago

In our case, the drives are Dell SAS SSDs, part #MZ7LM480HCHP00D3.

Unfortunately, I was foolish enough not to take the controller out of RAID mode when build the machines. The disks are all set up as non-RAID devices.

I have never seen any issues with the HBA controller.

Actions #12

Updated by Cullum Smith over 2 years ago

I am also affected by this issue.

I have a Dell R630 with PERC H730 Mini (which is just a rebranded LSI 3108). I have all my disks in the controller's semi-fake "passthrough" mode.

I see the exact same messages on the console and all VMs hang. Looks like the RAID controller just disappears.

Oddly, for me this bug is only ever triggered when deleting a zone or ZFS dataset. I've done full pkgsrc builds with heavy I/O and it's been fine, but sometimes an innocuous "vmadm delete" from SmartOS forces me to power cycle the box.

Actions #13

Updated by Ian Collins over 2 years ago

We bit the bullet and replaced our controllers with HBA330s. No issues since.

Our symptoms were as you describe.

If you are seeing these symptoms, I strongly reccomend replacing the controller. We had a drive failure that completely locked up the pool, it never got out of the controller reset loop.

Actions

Also available in: Atom PDF