Bug #12163
closedmpt_sas: Collateral damage caused by dead SATA disk
100%
Description
This is the typical symptom of the problem (when disks are under load):
Dec 8 03:17:16 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Dec 8 03:17:16 server passthrough command timeout Dec 8 03:17:17 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Dec 8 03:17:17 server MPT Firmware version v20.0.7.0 (SAS2308) Dec 8 03:17:17 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Dec 8 03:17:17 server mpt_sas0 MPI Version 0x200 Dec 8 03:17:24 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Dec 8 03:17:24 server mpt0: IOC Operational. Dec 8 03:17:47 server scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Dec 8 03:17:47 server mptsas_handle_event_sync: event 0xf, IOCStatus=0x8000, IOCLogInfo=0x31111000 Dec 8 03:17:48 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Dec 8 03:17:48 server Log info 0x31111000 received for target 15 p0. Dec 8 03:17:48 server scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Dec 8 03:17:48 server scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Dec 8 03:17:48 server mptsas request inquiry page 0x83 for target:f, lun:0 failed! Dec 8 03:17:48 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000cca232e88ffa,0 (sd58): Dec 8 03:17:48 server Command failed to complete...Device is gone Dec 8 03:17:48 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000cca232e88176,0 (sd53): Dec 8 03:17:48 server Command failed to complete...Device is gone ... Dec 8 03:17:48 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f/disk@w5000cca232e88a99,0 (sd45): Dec 8 03:17:48 server Command failed to complete...Device is gone Dec 8 03:17:48 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000cca232e8b7ee,0 (sd70): Dec 8 03:17:48 server Command failed to complete...Device is gone
In the example above 23 disks just disappeared in a second.
When disks are not under load this could happen:
Jan 6 18:17:42 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Jan 6 18:17:42 server MPT Firmware version v20.0.7.0 (SAS2308) Jan 6 18:17:42 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Jan 6 18:17:42 server mpt_sas0 MPI Version 0x200 Jan 6 18:17:42 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Jan 6 18:17:42 server mpt0: IOC Operational. Jan 6 18:17:54 server scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Jan 6 18:17:54 server mptsas_handle_event_sync: event 0xf, IOCStatus=0x8000, IOCLogInfo=0x31111000 Jan 6 18:17:54 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Jan 6 18:17:54 server Log info 0x31111000 received for target 26 p0. Jan 6 18:17:54 server scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc Jan 6 18:17:54 server scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0): Jan 6 18:17:54 server mptsas request inquiry page 0x83 for target:1a, lun:0 failed! Jan 6 18:18:09 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000c50067504b74,0 (sd54): Jan 6 18:18:09 server drive offline Jan 6 18:18:09 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000c500794606e0,0 (sd48): Jan 6 18:18:09 server drive offline Jan 6 18:18:09 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000c50079414f2d,0 (sd66): Jan 6 18:18:09 server drive offline ... Jan 6 18:18:09 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000c500791896b1,0 (sd49): Jan 6 18:18:09 server drive offline Jan 6 18:18:13 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000c500794a5250,0 (sd46): Jan 6 18:18:13 server drive offline Jan 6 18:18:27 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000c5006757bc3c,0 (sd64): Jan 6 18:18:27 server drive offline
The other possible manifestation of the problem is when the system with a bad SATA disk is booted then some disks are just not accessible.
In all cases the affected disks are those that are attached to the mpt_sas controller after the dead SATA disk (try something like lsiutils -p 1 42
to see the disk order).
Related issues
Updated by Marcel Telka about 3 years ago
Root cause
There are two prerequisites required to see this problem:- Bad SATA disk attached to mpt_sas
- mpt_sas controller reset/initialization (either during normal operation, or during boot)
The bug is in the mptsas_update_hashtab()
function and the problem is usually manifested by the following code path: mptsas_restart_ioc()
-> mptsas_update_driver_data()
-> mptsas_update_hashtab()
.
The problem is that when mptsas_get_target_device_info()
fails with DEV_INFO_FAIL_GUID
all next disks are just skipped (line 14903):
14892 while (mpt->m_done_traverse_dev == 0) { 14893 ptgt = NULL; 14894 page_address = 14895 (MPI2_SAS_DEVICE_PGAD_FORM_GET_NEXT_HANDLE & 14896 MPI2_SAS_DEVICE_PGAD_FORM_MASK) | 14897 (uint32_t)dev_handle; 14898 rval = mptsas_get_target_device_info(mpt, page_address, 14899 &dev_handle, &ptgt); 14900 if ((rval == DEV_INFO_FAIL_PAGE0) || 14901 (rval == DEV_INFO_FAIL_ALLOC) || 14902 (rval == DEV_INFO_FAIL_GUID)) { 14903 break; 14904 } 14905 14906 mpt->m_dev_handle = dev_handle; 14907 }
The fix
To fix this we should continue
(instead of break
) in a case when rval == DEV_INFO_FAIL_GUID
.
There are similar code snippets in mptsas_phy_to_tgt()
and mptsas_wwid_to_ptgt()
too so they will be changed similarly too.
Updated by Marcel Telka about 3 years ago
Updated by Marcel Telka about 3 years ago
Testing Done
I tried to force mpt_sas
resets with this fix and the massive disks outage is no longer reproducible.
Also, with this fix the machine is able to boot with all disks available (except the dead one, of course).
Updated by Marcel Telka about 3 years ago
- Status changed from In Progress to Pending RTI
Updated by Electric Monk about 3 years ago
- Status changed from Pending RTI to Closed
- % Done changed from 0 to 100
git commit d59679dc4ee5ea26c61e7762a3f7a6f74a1f4c2c
commit d59679dc4ee5ea26c61e7762a3f7a6f74a1f4c2c Author: Marcel Telka <marcel@telka.sk> Date: 2020-01-08T07:34:36.000Z 12163 mpt_sas: Collateral damage caused by dead SATA disk Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Stefan Behrens <sbehrens@giantdisaster.de> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Garrett D'Amore <garrett@damore.org>
Updated by Marcel Telka about 3 years ago
- Related to Bug #8745: Hipster is marking drives as bad when they are ok added