Project

General

Profile

Bug #11767

mpt_sas hangs after config header request timeout

Added by Hans Rosenfeld 6 months ago. Updated 6 months ago.

Status:
Closed
Priority:
Normal
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

On a customer system, we've ran into these warnings at boot roughly once in ten reboots:

WARNING: /pci@14,0/pci8086,2030@0/pci1028,1f45@0 (mpt_sas0):
       config header request timeout
WARNING: /pci@14,0/pci8086,2030@0/pci1028,1f45@0 (mpt_sas0):
       Failed to get sas phy page 0 for each phy
WARNING: /pci@14,0/pci8086,2030@0/pci1028,1f45@0 (mpt_sas0):
       mptsas phy update failed

The system would then take hours to complete its boot and not find any disks/SSDs attached to the affected SAS HBA. As it turns out this is caused by a command timeout in mptsas_access_config_page(), as indicated by the first warning. The remaining warnings are directly caused by this failure. The timeout for config header requests (and also config page requests later in the function) are hardcoded to 60s.

It turns out there's an oversight in mptsas_access_config_page(): just as in other places in mpt_sas it already contains code to reset the I/O controller when a command timeout happened. This code is just never executed, leaving the controller in a hung state. This can be easily fixed. It is still unclear what causes these timeouts in the first place. With the IOC reset being done properly the timeouts may still happen every now and then at reboot, but they won't cause any more harm than a single 60s delay at boot. All disks/SSDs are visible and usable.

History

#1

Updated by Hans Rosenfeld 6 months ago

Testing: I've manually rebooted the customers system a few times until the timeouts happened at boot. Previously a timeout meant the system would take ages to boot and then see none of its SSDs. With these changes implemented the system did experience a short delay (60s) due to the timeout, but once it finished booting all SSDs were visible and usable, the system behaved as expected.

#2

Updated by Electric Monk 6 months ago

  • Status changed from New to Closed
  • % Done changed from 20 to 100

git commit 882fdc88a3017c111dceac12ee7c111b7c4f1232

commit  882fdc88a3017c111dceac12ee7c111b7c4f1232
Author: Hans Rosenfeld <hans.rosenfeld@joyent.com>
Date:   2019-10-04T22:00:10.000Z

    11767 mpt_sas hangs after config header request timeout
    Reviewed by: Robert Mustacchi <rm+illumos@fingolfin.org>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Andy Stormont <astormont@racktopsystems.com>
    Reviewed by: Andy Giles <illumos@ang.homedns.org>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF