Project

General

Profile

Bug #12163

mpt_sas: Collateral damage caused by dead SATA disk

Added by Marcel Telka 11 months ago. Updated 11 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

This is the typical symptom of the problem (when disks are under load):

Dec  8 03:17:16 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Dec  8 03:17:16 server  passthrough command timeout
Dec  8 03:17:17 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Dec  8 03:17:17 server  MPT Firmware version v20.0.7.0 (SAS2308)
Dec  8 03:17:17 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Dec  8 03:17:17 server  mpt_sas0 MPI Version 0x200
Dec  8 03:17:24 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Dec  8 03:17:24 server  mpt0: IOC Operational.
Dec  8 03:17:47 server scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Dec  8 03:17:47 server  mptsas_handle_event_sync: event 0xf, IOCStatus=0x8000, IOCLogInfo=0x31111000
Dec  8 03:17:48 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Dec  8 03:17:48 server  Log info 0x31111000 received for target 15 p0.
Dec  8 03:17:48 server  scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
Dec  8 03:17:48 server scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Dec  8 03:17:48 server  mptsas request inquiry page 0x83 for target:f, lun:0 failed!
Dec  8 03:17:48 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000cca232e88ffa,0 (sd58):
Dec  8 03:17:48 server  Command failed to complete...Device is gone
Dec  8 03:17:48 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000cca232e88176,0 (sd53):
Dec  8 03:17:48 server  Command failed to complete...Device is gone
...
Dec  8 03:17:48 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f/disk@w5000cca232e88a99,0 (sd45):
Dec  8 03:17:48 server  Command failed to complete...Device is gone
Dec  8 03:17:48 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000cca232e8b7ee,0 (sd70):
Dec  8 03:17:48 server  Command failed to complete...Device is gone

In the example above 23 disks just disappeared in a second.

When disks are not under load this could happen:

Jan  6 18:17:42 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Jan  6 18:17:42 server         MPT Firmware version v20.0.7.0 (SAS2308)
Jan  6 18:17:42 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Jan  6 18:17:42 server         mpt_sas0 MPI Version 0x200
Jan  6 18:17:42 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Jan  6 18:17:42 server         mpt0: IOC Operational.
Jan  6 18:17:54 server scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Jan  6 18:17:54 server         mptsas_handle_event_sync: event 0xf, IOCStatus=0x8000, IOCLogInfo=0x31111000
Jan  6 18:17:54 server scsi: [ID 365881 kern.info] /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Jan  6 18:17:54 server         Log info 0x31111000 received for target 26 p0.
Jan  6 18:17:54 server         scsi_status=0x0, ioc_status=0x804b, scsi_state=0xc
Jan  6 18:17:54 server scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0 (mpt_sas0):
Jan  6 18:17:54 server         mptsas request inquiry page 0x83 for target:1a, lun:0 failed!
Jan  6 18:18:09 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000c50067504b74,0 (sd54):
Jan  6 18:18:09 server         drive offline
Jan  6 18:18:09 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000c500794606e0,0 (sd48):
Jan  6 18:18:09 server         drive offline
Jan  6 18:18:09 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000c50079414f2d,0 (sd66):
Jan  6 18:18:09 server         drive offline
...
Jan  6 18:18:09 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000c500791896b1,0 (sd49):
Jan  6 18:18:09 server         drive offline
Jan  6 18:18:13 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000c500794a5250,0 (sd46):
Jan  6 18:18:13 server         drive offline
Jan  6 18:18:27 server scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,e0a@3,2/pci1000,3020@0/iport@f0/disk@w5000c5006757bc3c,0 (sd64):
Jan  6 18:18:27 server         drive offline

The other possible manifestation of the problem is when the system with a bad SATA disk is booted then some disks are just not accessible.

In all cases the affected disks are those that are attached to the mpt_sas controller after the dead SATA disk (try something like lsiutils -p 1 42 to see the disk order).


Related issues

Related to illumos gate - Bug #8745: Hipster is marking drives as bad when they are okNewMarcel Telka2017-10-29

Actions
#1

Updated by Marcel Telka 11 months ago

Root cause

There are two prerequisites required to see this problem:
  1. Bad SATA disk attached to mpt_sas
  2. mpt_sas controller reset/initialization (either during normal operation, or during boot)

The bug is in the mptsas_update_hashtab() function and the problem is usually manifested by the following code path: mptsas_restart_ioc() -> mptsas_update_driver_data() -> mptsas_update_hashtab().

The problem is that when mptsas_get_target_device_info() fails with DEV_INFO_FAIL_GUID all next disks are just skipped (line 14903):

14892    while (mpt->m_done_traverse_dev == 0) {
14893        ptgt = NULL;
14894        page_address =
14895            (MPI2_SAS_DEVICE_PGAD_FORM_GET_NEXT_HANDLE &
14896            MPI2_SAS_DEVICE_PGAD_FORM_MASK) |
14897            (uint32_t)dev_handle;
14898        rval = mptsas_get_target_device_info(mpt, page_address,
14899            &dev_handle, &ptgt);
14900        if ((rval == DEV_INFO_FAIL_PAGE0) ||
14901            (rval == DEV_INFO_FAIL_ALLOC) ||
14902            (rval == DEV_INFO_FAIL_GUID)) {
14903            break;
14904        }
14905
14906        mpt->m_dev_handle = dev_handle;
14907    }

The fix

To fix this we should continue (instead of break) in a case when rval == DEV_INFO_FAIL_GUID.

There are similar code snippets in mptsas_phy_to_tgt() and mptsas_wwid_to_ptgt() too so they will be changed similarly too.

#3

Updated by Marcel Telka 11 months ago

Testing Done

I tried to force mpt_sas resets with this fix and the massive disks outage is no longer reproducible.

Also, with this fix the machine is able to boot with all disks available (except the dead one, of course).

#4

Updated by Marcel Telka 11 months ago

  • Status changed from In Progress to Pending RTI
#5

Updated by Electric Monk 11 months ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 0 to 100

git commit d59679dc4ee5ea26c61e7762a3f7a6f74a1f4c2c

commit  d59679dc4ee5ea26c61e7762a3f7a6f74a1f4c2c
Author: Marcel Telka <marcel@telka.sk>
Date:   2020-01-08T07:34:36.000Z

    12163 mpt_sas: Collateral damage caused by dead SATA disk
    Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Stefan Behrens <sbehrens@giantdisaster.de>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Approved by: Garrett D'Amore <garrett@damore.org>

#6

Updated by Marcel Telka 11 months ago

  • Related to Bug #8745: Hipster is marking drives as bad when they are ok added

Also available in: Atom PDF