Bug #12135
closedECRC PCIe errors shouldn't be fatal
100%
Description
PCI express defines errors in several different categories:
- Correctable errors,
- Uncorrectable, non-fatal errors
- Uncorrectable, fatal errors
The PCIe specification classifies an end to end CRC error (ECRC) as an uncorrectable, non-fatal error. The underlying system will generally retry transactions when such an error occurs. However, in the kernel today, we will treat this as a fatal error and panic the system when that occurs. As seen in 11953, we have seen systems where such events can occur and take out the system. However, they really shouldn't. They should allow FMA to determine what to do and, in general not panic the system. This demotes such errors to generating ereports which in the context of trying to work through 11953 we have managed to exercise extensively.
Related issues
Updated by Robert Mustacchi over 3 years ago
- Related to Bug #11953: PCI Fatal Error encountered on AMD B450 platform added
Updated by Electric Monk over 3 years ago
- Status changed from New to Closed
- % Done changed from 90 to 100
git commit f7afc1fdb20343a25c4d88d6e7004d102e4c3e38
commit f7afc1fdb20343a25c4d88d6e7004d102e4c3e38 Author: Robert Mustacchi <rm@fingolfin.org> Date: 2019-12-29T00:23:43.000Z 12135 ECRC PCIe errors shouldn't be fatal 12136 Want hook to disable PCIe link monitoring 12134 Capture PCIe aspm status Reviewed by: John Levon <john.levon@joyent.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Garrett D'Amore <garrett@damore.org>