Project

General

Profile

Actions

Bug #12135

closed

ECRC PCIe errors shouldn't be fatal

Added by Robert Mustacchi over 2 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

PCI express defines errors in several different categories:

  • Correctable errors,
  • Uncorrectable, non-fatal errors
  • Uncorrectable, fatal errors

The PCIe specification classifies an end to end CRC error (ECRC) as an uncorrectable, non-fatal error. The underlying system will generally retry transactions when such an error occurs. However, in the kernel today, we will treat this as a fatal error and panic the system when that occurs. As seen in 11953, we have seen systems where such events can occur and take out the system. However, they really shouldn't. They should allow FMA to determine what to do and, in general not panic the system. This demotes such errors to generating ereports which in the context of trying to work through 11953 we have managed to exercise extensively.


Related issues

Related to illumos gate - Bug #11953: PCI Fatal Error encountered on AMD B450 platformNewRobert Mustacchi

Actions
Actions #1

Updated by Robert Mustacchi over 2 years ago

  • Related to Bug #11953: PCI Fatal Error encountered on AMD B450 platform added
Actions #2

Updated by Electric Monk over 2 years ago

  • Status changed from New to Closed
  • % Done changed from 90 to 100

git commit f7afc1fdb20343a25c4d88d6e7004d102e4c3e38

commit  f7afc1fdb20343a25c4d88d6e7004d102e4c3e38
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   2019-12-29T00:23:43.000Z

    12135 ECRC PCIe errors shouldn't be fatal
    12136 Want hook to disable PCIe link monitoring
    12134 Capture PCIe aspm status
    Reviewed by: John Levon <john.levon@joyent.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Garrett D'Amore <garrett@damore.org>

Actions

Also available in: Atom PDF