Project

General

Profile

Actions

Feature #11369

closed

PCIe errors on passthru devices shouldn't cause a panic

Added by Robert Mustacchi over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

My test system reproducibly experiences PCIe errors on several devices (two Intel NVME, one LSI SAS controller) that are passed through to a bhyve VM running SmartOS. These errors are reported to the host through pcieb_intr_handler(), and error analysis in pf_scan_fabric() / pf_analyse_error() will in roughly 9 out 10 cases cause a panic.

The way pf_scan_fabric() works is that it walks the PCIe fabric to gather error information from the error register of each device. If a device reported an error its error registers will be read, the information will be put in a pf_data_t, and the registers will be cleared. That pf_data_t is put on a queue for later analysis. This analysis is later done by pf_analyse_error().

Even in the case of PCIe passthru I assume we want to be informed by FMA about faults in the hardware, in case it is a real hardware fault. So we still want to get the full analysis of the error, but we want to detect when a device is under passthru control and prevent error analysis from requesting a panic as response to the error.

When creating the pf_data_t for a device we'll check it's pcie_bus_t for a flag indicating passthru, which will be set by the passthru driver (only ppt at this time) whenever a device is under VM control, and will be cleared and when the driver controls the device. If the flag is set, we'll set up a new pe_severity_mask member of pf_data_t to mask PF_ERR_PANIC. Later in pf_analyse_error() we'll use this mask on pe_severity_flags to prevent panics from this device instance only. Should other devices that are not currently used for PCIe passthru experience a fatal error a panic may still happen.

Testing:

I have two NVMe devices that will reliably cause PCIe errors when used in passthru in a SmartOS VM. Previously those errors would cause the host to panic, but with these changes in place the panic does not happen. FMA still records the faults and tries to retire the devices, which will only take effect at the next reboot. The VM sometimes does hang when accessing those devices (the nvme driver gets stuck waiting for I/O completions), but that is a problem for a different day.

Actions #1

Updated by Electric Monk over 4 years ago

  • Status changed from New to Closed

git commit 79bed773cb9f85f14d6c40e097abafdf4cc1e687

commit  79bed773cb9f85f14d6c40e097abafdf4cc1e687
Author: Hans Rosenfeld <hans.rosenfeld@joyent.com>
Date:   2019-08-19T17:40:33.000Z

    11369 PCIe errors on passthru devices shouldn't cause a panic
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Andy Fiddaman <andy@omniosce.org>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Actions

Also available in: Atom PDF