mpt_sas hangs after config header request timeout
On a customer system, we've ran into these warnings at boot roughly once in ten reboots:
WARNING: /pci@14,0/pci8086,2030@0/pci1028,1f45@0 (mpt_sas0): config header request timeout WARNING: /pci@14,0/pci8086,2030@0/pci1028,1f45@0 (mpt_sas0): Failed to get sas phy page 0 for each phy WARNING: /pci@14,0/pci8086,2030@0/pci1028,1f45@0 (mpt_sas0): mptsas phy update failed
The system would then take hours to complete its boot and not find any disks/SSDs attached to the affected SAS HBA. As it turns out this is caused by a command timeout in mptsas_access_config_page(), as indicated by the first warning. The remaining warnings are directly caused by this failure. The timeout for config header requests (and also config page requests later in the function) are hardcoded to 60s.
It turns out there's an oversight in mptsas_access_config_page(): just as in other places in mpt_sas it already contains code to reset the I/O controller when a command timeout happened. This code is just never executed, leaving the controller in a hung state. This can be easily fixed. It is still unclear what causes these timeouts in the first place. With the IOC reset being done properly the timeouts may still happen every now and then at reboot, but they won't cause any more harm than a single 60s delay at boot. All disks/SSDs are visible and usable.
Updated by Hans Rosenfeld 6 months ago
Testing: I've manually rebooted the customers system a few times until the timeouts happened at boot. Previously a timeout meant the system would take ages to boot and then see none of its SSDs. With these changes implemented the system did experience a short delay (60s) due to the timeout, but once it finished booting all SSDs were visible and usable, the system behaved as expected.
Updated by Electric Monk 6 months ago
- Status changed from New to Closed
- % Done changed from 20 to 100
commit 882fdc88a3017c111dceac12ee7c111b7c4f1232 Author: Hans Rosenfeld <firstname.lastname@example.org> Date: 2019-10-04T22:00:10.000Z 11767 mpt_sas hangs after config header request timeout Reviewed by: Robert Mustacchi <email@example.com> Reviewed by: Jerry Jelinek <firstname.lastname@example.org> Reviewed by: Andy Stormont <email@example.com> Reviewed by: Andy Giles <firstname.lastname@example.org> Reviewed by: Toomas Soome <email@example.com> Approved by: Dan McDonald <firstname.lastname@example.org>