mpt_sas hangs forever with some kinds of malfunctioning drive
Precisely as it says.
> ::mptsas mptsas_t inst ncmds suspend power ================================================================================ ffffff23b62fd000 0 0 0 ON=D0 ffffff23b62f6000 1 0 0 ON=D0 > ::stacks -m sd THREAD STATE SOBJ COUNT ffffff2335289400 SLEEP CV 1 swtch+0x18a cv_wait_sig+0x1b5 sd_check_media+0x2c0 sdioctl+0xe0d cdev_ioctl+0x39 spec_ioctl+0x60 fop_ioctl+0x55 ioctl+0x9b _sys_sysenter_post_swapgs+0x237 ffffff00f4394c40 SLEEP SEMA 1 swtch+0x18a sema_p+0x2a9 biowait+0xd9 default_physio+0x345 physio+0x25 scsi_uscsi_handle_cmd+0x29d sd_ssc_send+0x195 sd_send_scsi_RDWR+0x353 sd_tg_rdwr+0x423 cmlb_use_efi+0xd0 cmlb_validate_geometry+0x121 cmlb_validate+0x83 sd_unit_attach+0x9ff sdattach+0x19 devi_attach+0x9e attach_node+0x14f i_ndi_config_node+0xc0 i_ddi_attachchild+0x88 devi_attach_node+0x88 ndi_devi_online+0xb0 i_mdi_pi_state_change+0x4cf mdi_pi_online+0x43 mptsas_create_virt_lun+0x963 mptsas_create_lun+0x1b0 mptsas_probe_lun+0xdd mptsas_config_target+0x38 mptsas_handle_topo_change+0x356 mptsas_handle_dr+0x12c taskq_thread+0x318 thread_start+8 > ::msgbuf [snip] WARNING: /pci@0,0/pci8086,3c08@3/pci1000,30d0@0 (mpt_sas0): mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31110610 WARNING: /pci@0,0/pci8086,3c08@3/pci1000,30d0@0 (mpt_sas0): mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31110610 WARNING: /pci@0,0/pci8086,3c08@3/pci1000,30d0@0 (mpt_sas0): mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31170000 WARNING: /pci@0,0/pci8086,3c08@3/pci1000,30d0@0 (mpt_sas0): mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31170000 /pci@0,0/pci8086,3c08@3/pci1000,30d0@0 (mpt_sas0): Log info 0x31110610 received for target 25 w5000c50038ae5ae4. scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
Probably can work around this by pulling the relevant malfunctioning disk, but thought I'd log it as I didn't think this was still a thing that could happen in modern mpt_sas land.
illumos-gate, current as of 20140821.
Updated by Rich Ercolani almost 9 years ago
By the by, pulling the drive or reseating such a drive will not fix this problem once it occurs - nor will cycling the enclosure. The OS will not recover once it reaches this state, short of a power cycle - and if you cycle the OS with the misbehaving hardware still attached, it will have the same problem on boot, possibly leading to other failures.
Updated by Rich Ercolani about 7 years ago
- Assignee set to Dan Fields
Assigning by request.
Updated by Dan Fields about 7 years ago
- % Done changed from 0 to 50
diffs - https://github.com/illumos/illumos-gate/compare/master...kojack:bug5122
Summary: When a SCSI ccommand has an invalid device handle (0xffff) at its target, the code shouldn't
return an error to the SCSI transport. Instead it should accept the command and mark it as 'device gone' and let the
initiator handle the result.
This biowait() showed up at a site with pools containing SSD drives. When the drive failed it didn't slowly
begin to fail sending .derr or .merr or resets. Instead it disappeared from the SAS topology. The consistent
sign if this is the loginfo value "0x31170000", which appears in all logs associated with this issue that I've seen.
$ ./lsi_decode_loginfo.py 0x31170000 Value 31170000h Type: 30000000h SAS Origin: 01000000h PL Code: 00170000h PL_LOGINFO_CODE_IO_DEVICE_MISSING_DELAY_RETRY I-T Nexus Loss
The mpt_sas driver handles this by resetting the FW and, in the process, invalidates all target device handles. A re-scan
takes place and, if a disk doesn't come back, subsequent SCSI commands destined for that target are failed back to the transport and retried at line 3452 (https://github.com/illumos/illumos-gate/compare/master...kojack:bug5122#diff-3116992b45e2a7d840c02d3864acc00fL3452)