Feature #11369
closedPCIe errors on passthru devices shouldn't cause a panic
100%
Description
My test system reproducibly experiences PCIe errors on several devices (two Intel NVME, one LSI SAS controller) that are passed through to a bhyve VM running SmartOS. These errors are reported to the host through pcieb_intr_handler(), and error analysis in pf_scan_fabric() / pf_analyse_error() will in roughly 9 out 10 cases cause a panic.
The way pf_scan_fabric() works is that it walks the PCIe fabric to gather error information from the error register of each device. If a device reported an error its error registers will be read, the information will be put in a pf_data_t, and the registers will be cleared. That pf_data_t is put on a queue for later analysis. This analysis is later done by pf_analyse_error().
Even in the case of PCIe passthru I assume we want to be informed by FMA about faults in the hardware, in case it is a real hardware fault. So we still want to get the full analysis of the error, but we want to detect when a device is under passthru control and prevent error analysis from requesting a panic as response to the error.
When creating the pf_data_t for a device we'll check it's pcie_bus_t for a flag indicating passthru, which will be set by the passthru driver (only ppt at this time) whenever a device is under VM control, and will be cleared and when the driver controls the device. If the flag is set, we'll set up a new pe_severity_mask member of pf_data_t to mask PF_ERR_PANIC. Later in pf_analyse_error() we'll use this mask on pe_severity_flags to prevent panics from this device instance only. Should other devices that are not currently used for PCIe passthru experience a fatal error a panic may still happen.
Testing:
I have two NVMe devices that will reliably cause PCIe errors when used in passthru in a SmartOS VM. Previously those errors would cause the host to panic, but with these changes in place the panic does not happen. FMA still records the faults and tries to retire the devices, which will only take effect at the next reboot. The VM sometimes does hang when accessing those devices (the nvme driver gets stuck waiting for I/O completions), but that is a problem for a different day.
Updated by Electric Monk over 4 years ago
- Status changed from New to Closed
git commit 79bed773cb9f85f14d6c40e097abafdf4cc1e687
commit 79bed773cb9f85f14d6c40e097abafdf4cc1e687 Author: Hans Rosenfeld <hans.rosenfeld@joyent.com> Date: 2019-08-19T17:40:33.000Z 11369 PCIe errors on passthru devices shouldn't cause a panic Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Andy Fiddaman <andy@omniosce.org> Approved by: Richard Lowe <richlowe@richlowe.net>