Project

General

Profile

Actions

Bug #5122

open

mpt_sas hangs forever with some kinds of malfunctioning drive

Added by Rich Ercolani almost 9 years ago. Updated about 7 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Start date:
2014-08-26
Due date:
% Done:

50%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:

Description

Precisely as it says.

> ::mptsas
        mptsas_t inst ncmds suspend  power
================================================================================
ffffff23b62fd000    0     0       0 ON=D0
ffffff23b62f6000    1     0       0 ON=D0
> ::stacks -m sd
THREAD           STATE    SOBJ                COUNT
ffffff2335289400 SLEEP    CV                      1
                 swtch+0x18a
                 cv_wait_sig+0x1b5
                 sd_check_media+0x2c0
                 sdioctl+0xe0d
                 cdev_ioctl+0x39
                 spec_ioctl+0x60
                 fop_ioctl+0x55
                 ioctl+0x9b
                 _sys_sysenter_post_swapgs+0x237

ffffff00f4394c40 SLEEP    SEMA                    1
                 swtch+0x18a
                 sema_p+0x2a9
                 biowait+0xd9
                 default_physio+0x345
                 physio+0x25
                 scsi_uscsi_handle_cmd+0x29d
                 sd_ssc_send+0x195
                 sd_send_scsi_RDWR+0x353
                 sd_tg_rdwr+0x423
                 cmlb_use_efi+0xd0
                 cmlb_validate_geometry+0x121
                 cmlb_validate+0x83
                 sd_unit_attach+0x9ff
                 sdattach+0x19
                 devi_attach+0x9e
                 attach_node+0x14f
                 i_ndi_config_node+0xc0
                 i_ddi_attachchild+0x88
                 devi_attach_node+0x88
                 ndi_devi_online+0xb0
                 i_mdi_pi_state_change+0x4cf
                 mdi_pi_online+0x43
                 mptsas_create_virt_lun+0x963
                 mptsas_create_lun+0x1b0
                 mptsas_probe_lun+0xdd
                 mptsas_config_target+0x38
                 mptsas_handle_topo_change+0x356
                 mptsas_handle_dr+0x12c
                 taskq_thread+0x318
                 thread_start+8
> ::msgbuf
[snip]
WARNING: /pci@0,0/pci8086,3c08@3/pci1000,30d0@0 (mpt_sas0):
        mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31110610
WARNING: /pci@0,0/pci8086,3c08@3/pci1000,30d0@0 (mpt_sas0):
        mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31110610
WARNING: /pci@0,0/pci8086,3c08@3/pci1000,30d0@0 (mpt_sas0):
        mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31170000
WARNING: /pci@0,0/pci8086,3c08@3/pci1000,30d0@0 (mpt_sas0):
        mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31170000
/pci@0,0/pci8086,3c08@3/pci1000,30d0@0 (mpt_sas0):
        Log info 0x31110610 received for target 25 w5000c50038ae5ae4.
        scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc

Probably can work around this by pulling the relevant malfunctioning disk, but thought I'd log it as I didn't think this was still a thing that could happen in modern mpt_sas land.

illumos-gate, current as of 20140821.

Actions #1

Updated by Rich Ercolani almost 9 years ago

By the by, pulling the drive or reseating such a drive will not fix this problem once it occurs - nor will cycling the enclosure. The OS will not recover once it reaches this state, short of a power cycle - and if you cycle the OS with the misbehaving hardware still attached, it will have the same problem on boot, possibly leading to other failures.

Actions #2

Updated by Rich Ercolani about 7 years ago

  • Assignee set to Dan Fields

Assigning by request.

Actions #3

Updated by Dan Fields about 7 years ago

  • % Done changed from 0 to 50

diffs - https://github.com/illumos/illumos-gate/compare/master...kojack:bug5122

Summary: When a SCSI ccommand has an invalid device handle (0xffff) at its target, the code shouldn't
return an error to the SCSI transport. Instead it should accept the command and mark it as 'device gone' and let the
initiator handle the result.

This biowait() showed up at a site with pools containing SSD drives. When the drive failed it didn't slowly
begin to fail sending .derr or .merr or resets. Instead it disappeared from the SAS topology. The consistent
sign if this is the loginfo value "0x31170000", which appears in all logs associated with this issue that I've seen.

$ ./lsi_decode_loginfo.py 0x31170000
Value           31170000h
Type:           30000000h       SAS
Origin:         01000000h       PL
Code:           00170000h       PL_LOGINFO_CODE_IO_DEVICE_MISSING_DELAY_RETRY I-T Nexus Loss

The mpt_sas driver handles this by resetting the FW and, in the process, invalidates all target device handles. A re-scan
takes place and, if a disk doesn't come back, subsequent SCSI commands destined for that target are failed back to the transport and retried at line 3452 (https://github.com/illumos/illumos-gate/compare/master...kojack:bug5122#diff-3116992b45e2a7d840c02d3864acc00fL3452)

Actions

Also available in: Atom PDF