mr_sas controller lockup on Dell H330
Dell R530 with the H330 SAS RAID controller with disks in Non-RAID mode
Machine is currently running SmartOS 20170608T172228Z which reports driver version 6.503.00.00ILLUMOS-20170524
Controller has latest firmware package from Dell - 25.5.2.0001 - prior revisions were totally unusable due to a cache flush bug in the firmware.
Disks are 8x WD 6TB RED WD60EFRX
Under heavy I/O load the controller "goes away" with the driver reporting the controller timing out, eventually resulting in the kernel declaring a panic. What seems to particularly tickle it is during backup creation innobackupex --apply-log operations which result in a lot of fairly random I/O being issued. Kernel errors (reproduced, no serial console ATM) below repeat until a panic...
mr_sas: [ID 270009 kern.warning] WARNING: mr_sas0: io_timeout_checker: FW Fault, calling reset adapter
mr_sas: [ID 643100 kern.notice] mr_sas0: io_timeout_checker: fw_outstanding 0x60 max_fw_cmds 0xEF
The number of outstanding ops doesn't change once those messages start to output and the machine never recovers. Eventually - at least in prior SmartOS releases (just updated to this release today) ZFS calls a PANIC and the machine reboots (or drops to kmdb if enabled).
I'm can be reached as ZOP on Freenode in both #illumos and #smartos or /msg or through the bug here.
#1 Updated by Michael Loftis 11 days ago
Additionally...it seems the reset when it IS happening is calling the mrsas_tbolt_reset_ppc which fails...
mr_sas: [ID 643100 kern.notice] mr_sas0: mrsas_tbolt_reset_ppc(): RESET FAILED; return failure!!!
I'm getting a mdb/serial debug session going and kicking it over again now.
Also available in: Atom