Bug #11953
openPCI Fatal Error encountered on AMD B450 platform
90%
Description
Recently put together a new build on a GIGABYTE AORUS PRO based on the AMD B450 chipset to replace an aging Core2 build from 2007 (which got struck by lightning a while ago. PCI cards and storage were spared.)
- Ryzen 7 2700X, 8 CPUs, 16 threads
- 32GB DDR4-2400
- 512GB SSD
- 1000GB HDD
- two spare 2000GB HDDs for RAID1 (to replace single HDD above eventually)
Tested both GTX1650 and GT1030, as there is no builtin ATI Radeon Vega
on the processor. Booting from OI DVD forcing VESA driver, or forcing text console made no difference, the faulting controller was moved from pcieb-9
to pcieb-0
- Booted OpenBSD/i386 miniroot correctly.
- Booted OpenBSD/amd64 miniroot correctly
- Booted Fedora Core 31/amd64 Live correctly
- failed to boot existing OI on hard driive (pcie fatal error on pcieb-9)
- failed to boot OI Live (pcie fatal error on pcieb-0)
- failed to boot OI live with VESA driver
- failed to boot OI live with text console only
- failed to boot OI live in EFI mode as above (vesa, text-only)
There does not seem to be any kind of serial port present, which may hinder debugging efforts.
Will try to provide more information as needed.
Files
Related issues
Updated by Rick V over 3 years ago
Output of ::msgbuf
:
Output of ::status
Output of ::stack
Updated by Rick V over 3 years ago
output of ffffff0d0458c240::print pcieb_devstate_t pcieb_dip |::devinfo
Updated by Rick V over 3 years ago
- File fmdump.sun-pc.log fmdump.sun-pc.log added
- File messages.log messages.log added
After hardcoding pcieb`pcieb_die
to zero, this allows me to boot to the text console with networking (integrated igb0
, external gani2
PCI-E card)
Updated by Rick V over 3 years ago
updating to Quadro 440.31 fixes graphics support for me, leaving me with the manual workaround (patching pcieb_die
) and unable to address more than 4 CPUs or hardware threads.
Updated by Rick V over 3 years ago
an attempt to disable the non-functional AMD IOMMU causes the entire system to hang as soon as the GPU driver is loaded
not on every boot however
Updated by János Zsitvai over 3 years ago
I hit the same
PCI(-X) Express Fatal Erroron ASUS B450M-K and Ryzen 2600X on the 2019.10 release of OpenIndiana, but 2019.04 boots to graphical interface without issue on my end.
Updated by Harmen R over 3 years ago
I have similar problems with SmartOS, the pcieb driver panics the system at boot.
This is on an X470 based system with a Ryzen 7 3700X.
For me this is related to the pci link bandwidth event handling added a couple of months back. The system works just fine, when I patch the pcieb driver to not initialize the link bandwidth interrupt handler.
Updated by Rick V over 3 years ago
Booting Solaris 11.4 also works, which is expected because Sun Grid OCI runs on the AMD EPYC. Also have access to all 16 threads, but that isn't really an option for me, as i have valuable data on OpenZFS volumes installed, and without patches, highly vulnerable.
harmen: what should i patch to disable that ISR?
Updated by Harmen R over 3 years ago
This is the patch I'm using.
--- usr/src/uts/common/io/pciex/pcieb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/usr/src/uts/common/io/pciex/pcieb.c b/usr/src/uts/common/io/pciex/pcieb.c index c9d65748bbc..f7dd4fa268c 100644 --- a/usr/src/uts/common/io/pciex/pcieb.c +++ b/usr/src/uts/common/io/pciex/pcieb.c @@ -587,7 +587,7 @@ pcieb_attach(dev_info_t *devi, ddi_attach_cmd_t cmd) (void) pcie_hpintr_enable(devi); - (void) pcie_link_bw_enable(devi); + // (void) pcie_link_bw_enable(devi); /* Do any platform specific workarounds needed at this time */ pcieb_plat_attach_workaround(devi);
Updated by Rick V over 3 years ago
ok managed to copy the old pcieb
driver from my original (2019.04) BE. While the latest Quadro driver works, it is very buggy, and sometimes locks up when openwindows xorg is started.
Updated by Robert Mustacchi over 3 years ago
Harmen, can you describe how the system panics when link bandwidth checks are enabled? Do you happen to have a crash dump available from that? János, do you happen to have a crash dump from that as well?
Rick, I'll try to get some time to look at the error reports and see what's coming in and being treated as a fatal error. Thanks for providing them.
Updated by Harmen R over 3 years ago
Hi Robert,
The system panics during boot and then reboots. I've attached a 12 second video that shows the behavior. The system prints some info before it reboots, that's kinda visible 9 seconds into the video.
Stacktrace is available also in the screenshot I posted before.
Unfortunately I don't have a crash dump, the system stops before the dump device becomes available. The kmdb console is also inoperable.
If there are other ways to get more info for debugging this, please let me know.
Updated by Rick V over 3 years ago
indeed. To get k[am]db
back, reboot, drop into EFI PROM monitor, check the CMOS configuration for USB legacy support
(ON), port 0x60/0x64 emulation
(ON), and turn OFF XHCI handoff
, if such options exist on your machine.
Can't read ACPI SRAT table, one NUMA node only
At least illumos-joyent
is nice enough to tell you why it can't bootstrap any more of the existing CPUs
Updated by János Zsitvai over 3 years ago
I don't have a crash dump either, because of the dump device unavailability. Kmdb works, but I'm not familiar enough to know what to look for.
Updated by Rick V over 3 years ago
Got a crash dump later on, see if this is any useful: https://snowlight.net/files/bug/core.tar.xz
XHCI reset occurred
Updated by Robert Mustacchi over 3 years ago
I'll take a look at that and the rest of the information and see what can be the most useful way of approaching this. Thank you all for your patience.
Is it correct that for everyone involved disabling the PCIe link bandwidth services is helping?
Updated by Rick V over 3 years ago
yes that's what it looks like for me. My original boot environment dates back to a few days after 2019.04 was released, so i just picked the pcieb
driver out piecemeal to boot without problems on the current BE
Updated by János Zsitvai over 3 years ago
Would remote access to kmdb help? I can provide ssh access to a machine connected to the affected system through serial port (picocom + tmux).
Updated by Robert Mustacchi over 3 years ago
János, let me go through all of the data that Rick has. Once I can confirm with that and some possible next steps that make sense, I may take you up on that. I appreciate the offer. In some ways, the most useful thing during the interim in kmb, since you have the same stack is to get the error report queue that generated that so I can compare it to what we see with Rick's case.
At the command prompt I think you can do that by doing something like:
> ::walk ereportq_dump | ::ereport -v > ::walk ereportq_pend | ::ereport -v
Updated by János Zsitvai over 3 years ago
Attached is the output I got from those commands.
Updated by Robert Mustacchi over 3 years ago
Thanks for including that output János. The class of error that you're seeing is the same as the one that Rick sees. I expect these are the same that Harmen is seeing as well.
The thing that's the same in all of these is that we have received a PCIe AER with the ue_status set to 0x80000. This corresponds to a ECRC uncorrectable error being generated. Uncorrectable errors in PCIe parlance can be broken down into two different categories: fatal and non-fatal errors. The idea being that a non-fatal error indicates an issue with a given transaction, that will that transaction could not be recovered, the link itself is still fine and additional transactions can continue to go on the bus.
Today, the OS enables these by default. It does know that this is a non-fatal, uncorrectable error; however, the tables in pcie_fault.c end up treating this as an event that we should panic on. Unfortunately this has historically been treated in illumos (and Solaris before it) as a fatal error that should panic the system. The origins of this decision have been lost to time. However, it's not clear that ECRC errors being encountered should be treated as fatal errors. Presuming that the underlying devices and buses can retry the transactions, they should be able to recover and proceed. If they don't, there are other classes of PCIe errors like transaction timeouts and others that will percolate up through the system.
Ultimately, FMA still can handle these and opt to offline a device. Given the experiments already done with setting pcieb_die
to zero show that the system does still continue to function and that the devices still work.
Why enabling bandwidth link monitoring on a subset of the devices in the systems causes and ECRC to be generated is a real mystery to me. I've double checked and I don't believe that the code paths could cause us to incorrectly enable this on the wrong set of PCIe capability registers or enable this on a non-bridge device. That said, there's nothing that says we couldn't get these at some other time. It would be interesting to see if such errors are reported on other platforms, where they seem to be non-fatal at least.
I'm working on a change to try and make these non-fatal. A prototype diff is below:
commit 78a0e36cee5bad893b3118111f2f55f6d171544e (HEAD -> master) Author: Robert Mustacchi <rm@fingolfin.org> Date: Fri Nov 15 14:55:58 2019 +0000 11953 PCI Fatal Error encountered on AMD B450 platform diff --git a/usr/src/uts/common/io/pciex/pcie_fault.c b/usr/src/uts/common/io/pciex/pcie_fault.c index 6a335db3e2..90563c1d1a 100644 --- a/usr/src/uts/common/io/pciex/pcie_fault.c +++ b/usr/src/uts/common/io/pciex/pcie_fault.c @@ -1207,7 +1207,7 @@ const pf_fab_err_tbl_t pcie_pcie_tbl[] = { {PCIE_AER_UCE_MTLP, pf_panic, PF_AFFECTED_PARENT, 0}, - {PCIE_AER_UCE_ECRC, pf_panic, + {PCIE_AER_UCE_ECRC, pf_no_panic, PF_AFFECTED_SELF, 0}, {PCIE_AER_UCE_UR, pf_analyse_ca_ur, @@ -1248,7 +1248,7 @@ const pf_fab_err_tbl_t pcie_rp_tbl[] = { PF_AFFECTED_SELF | PF_AFFECTED_AER, PF_AFFECTED_SELF | PF_AFFECTED_CHILDREN}, - {PCIE_AER_UCE_ECRC, pf_panic, + {PCIE_AER_UCE_ECRC, pf_no_panic, PF_AFFECTED_AER, PF_AFFECTED_CHILDREN}, {PCIE_AER_UCE_UR, pf_no_panic, @@ -1289,7 +1289,7 @@ const pf_fab_err_tbl_t pcie_sw_tbl[] = { PF_AFFECTED_SELF | PF_AFFECTED_AER, PF_AFFECTED_SELF | PF_AFFECTED_CHILDREN}, - {PCIE_AER_UCE_ECRC, pf_panic, + {PCIE_AER_UCE_ECRC, pf_no_panic, PF_AFFECTED_AER, PF_AFFECTED_SELF | PF_AFFECTED_CHILDREN}, {PCIE_AER_UCE_UR, pf_analyse_ca_ur,
I'll try and get something more formal together for this. Rick, János, Harmen, if you don't mind trying this change, I'd appreciate it. If you need help building this or getting media appropriate for whatever you're running, please let me know and I'll work with folks to try and produce that.
Thanks again for your patience and helping me work through this. Let's see how this works out. If it does, then Rick, let's follow up on the 4 CPUs issue that you're having and Harmen, if you don't mind, I have some changes available to enable the temperature sensors on the Zen 2 line that I could use some help testing.
Updated by Rick V over 3 years ago
ok applied the patch, have https://snowlight.net/files/bug/pcieb.tar.xz as a quick-n-dirty gate build artefact.
Updated by Harmen R over 3 years ago
Thanks for working with us Robert, much appreciated. I don't mind testing the temperature sensors, please let me know how you want to proceed with that.
I won't be able to test the patch until Monday, I'll communicate the results then.
Updated by János Zsitvai over 3 years ago
- File console.log console.log added
Using pcieb.tar.xz from above, I get PCIEX-8000-G2. This is the log from verbose boot and kmdb enabled, but it doesn't drop to debugger, just hangs.
Updated by Rick V over 3 years ago
yes, i've seen that at least once since applying the patch. As Robert noted, FMA can still kick in and disable that PCI slot if it encounters link errors. Which devices affected, I can never quite tell, as my system continues to operate normally regardless.
Updated by Robert Mustacchi over 3 years ago
If we can get the fault informatoin (fmdump -fV) and the output of prtconf -v, I should be able to figure out what device this is impacting. I will see if I can spend a bit of time to narrow down what's going on here. I'll try and review how we determine which PCIe bridges to enable link reporting on. Perhaps we're doing it for one that we shouldn't be.
Updated by Rick V over 3 years ago
- File fmdump2.txt fmdump2.txt added
- File prtconf2.txt prtconf2.txt added
here goes
Updated by János Zsitvai over 3 years ago
- File fmdump.txt fmdump.txt added
- File prtconf.txt prtconf.txt added
Here's what I get.
Updated by Harmen R over 3 years ago
- File prtconf.txt prtconf.txt added
Robert, your prototype patch works for my system. I can boot and the system works just fine.
fmdump only shows me horsing around with sata hotplug a couple of days ago.
I've attached my prtconf. The device that's causing my boot panics is pci1022,1483, instance #0.
Updated by Robert Mustacchi over 3 years ago
Thanks everyone for all the data. I've been going through this and I'm not sure I have a great lead yet. I have one idea that I'd like to try, which is basically disabling the autonomous state machine transition bandwidth events that we have. So, one experiment I've done is to push another change up that might address that that'd be worth trying.
Otherwise, I'll need to see if I can track down some erratum or an AMD contact. For the time being, I've gone ahead and pushed https://github.com/rmustacc/illumos-gate/tree/pcie-ecrc. This has both the original change and a follow up one that tries to look at some of the bandwidth monitoring. Anyways, if folks would be willing to build with both of those and give it a shot, I'd appreciate it. I'll need to figure out a few different experiments if that doesn't work.
Thank you again for your help and patience.
Updated by Robert Mustacchi over 3 years ago
One other question, can you see if the BIOS has an option related to ASPM which is a PCIe option called 'Active-state power monitoring'?
Updated by Rick V over 3 years ago
Mine does not. I can however, cap the PCI bus supported revision. Solaris PCI-E driver only supported PCI-E 1.1, setting it to v3 or later causes the NVIDIA modesetting driver to terminate and reset the fb console (or hang the system)
Updated by Robert Mustacchi over 3 years ago
OK. I'll have to manually put something together for the ASPM test at some point in the future. If folks can try building the top commit on that branch in addition to the previous one: https://github.com/rmustacc/illumos-gate/commit/555cf78e16310e90707b38c6ca0e9f93d2b264da I'd appreciate it.
Updated by Rick V over 3 years ago
ok got the gpu working just long enough to build https://snowlight.net/files/bug/pcie.xz which is /kernel/misc/amd64/pcie
with the two patches applied. Boots fine on my end
Updated by Robert Mustacchi over 3 years ago
Are you seeing any additional ereports about timeouts since that's happened? Harmen and János, if you get a chance to try it out, I'd appreciate that.
Updated by Rick V over 3 years ago
- File fmdump.txt fmdump.txt added
yeah
this new fmdump -Vep
is HUGE, and that's just for the last week or so.
Updated by Robert Mustacchi over 3 years ago
Just to confirm, several of those ereports are from after you booted onto the new bits right?
Updated by Harmen R over 3 years ago
I should have some time tomorrow for trying this.
Updated by Rick V over 3 years ago
Robert: Yes. Anything after 26 Nov @ 1427 (GMT-6) corresponds to the latest patched driver. Anything older simply has the first patch applied (turn off ECRC panic)
Updated by Robert Mustacchi over 3 years ago
OK, thanks Rick. I have another theory that this may be related or we're somehow tickling ASPM. I'm going to go experiment with trying to capture this information and see if this is related. I did have something like this happen in the past and it was related to that, but I'm not convinced we'll be that lucky.
Updated by Robert Mustacchi over 3 years ago
So to see if ASPM might be on the scene here, I pushed a new commit to that same branch which when present will add two properties to the devinfo tree. So if folks manage to build with that as well and can then upload prtconf -v output again, that'd be great.
Updated by Rick V over 3 years ago
- File prtconf-aspm.txt prtconf-aspm.txt added
ok
https://snowlight.net/files/bug/pcie2.xz - with ASPM status patch
Updated by Robert Mustacchi over 3 years ago
OK. So, ASPM isn't really enabled on all of the relevant parts. I think probably what I should do, assuming the data from Harmen and János is similar to Rick's data is to basically offer an /etc/system value to turn this off on some platforms and see if I can reach someone at AMD eventually. It may be that in some cases we only see an occasional ereport and it won't be so bad, but if it does end up causing parts to be disabled, we can disable the link reporting (on a per-system basis) until we better understand why the two are related.
Updated by Harmen R over 3 years ago
- File fmdump.txt fmdump.txt added
- File prtconf.txt prtconf.txt added
I have tested a smartos-live build with all 3 patches.
The machine boots and operates as normal. I do have a pci device failing now, see attached fmdump.
ASPM seems to be disabled for me as well.
Updated by Rick V over 3 years ago
I'm so embarrassed right about now
my old core2 box needed some change to the boot-ncpus
value in the emulated OBP eeprom(1m) file
and i forgot to remove it after i junked the old box
zen+/pinnacle ridge SMP and SMT work just fine
still mostly unable to keep it running for more than a day or two
update: ah yes i had that value in the fake-OBP ROM because Skylake chipset was also misreporting CPUs even with SMT turned on
Updated by Robert Mustacchi over 3 years ago
- Assignee set to Robert Mustacchi
- % Done changed from 0 to 90
I haven't had a lot of luck with being able to dig further here. I was able to get a hold of some of the erratum associated with these parts, but nothing turned up there. This is really stumping me. I've put together a change which I'll get reviewed for this which allows us to disable this on a per-system basis. So at least folks that are seeing this on B450 can disable the link state information. There'll be a minor impact for FMA, but at the end of the day, comapred to a device that's constantly being offlined, that's not a big deal. Sorry for the trouble here and the lack of a better solution. I've put up a change for review at https://code.illumos.org/c/illumos-gate/+/241. Hopefully this'll help and I'll also try to get back to some of the other Zen+/Zen 2 enabling.
Updated by Robert Mustacchi over 3 years ago
- Related to Feature #12134: Capture PCIe aspm status added
Updated by Robert Mustacchi over 3 years ago
- Related to Bug #12135: ECRC PCIe errors shouldn't be fatal added
Updated by Robert Mustacchi over 3 years ago
- Related to Bug #12136: Want hook to disable PCIe link monitoring added
Updated by Robert Mustacchi over 3 years ago
An update here, I am at least getting the workarounds that you all have helped test into illumos. I am also trying to reach out to AMD on this front, but it's a bit of a slow moving process (mostly at my end). In terms of other feature enablement, I did put up socket detection (mostly cosmetic) and temperature sensor stuff for review in https://code.illumos.org/c/illumos-gate/+/278 and https://code.illumos.org/c/illumos-gate/+/277. If you wouldn't mind giving those a whirl, I'd be happy to walk through how to test and verify they're working. I already did verify with someone else that they work on Zen+, so the main outstanding bit is Zen 2, whose temperature sensors I'm more concerned about. Thanks for all the help.
Updated by Harmen R over 3 years ago
Thanks Robert.
I'll see if I can build and test with the socket detection and temperature sensor changes. It's probably going to be tomorrow or on Thursday.
Updated by Harmen R over 3 years ago
So I've built an image from the smartos master branch with the patches from 11976, 11975 and 12135 applied.
Temperature sensors don't show in fmtopo.
[root@ryzen-server ~]# /usr/lib/fm/fmd/fmtopo -V '*sensor=temp*' TIME UUID Jan 02 10:15:11 907b0e78-f9a9-e418-a289-a7236ffb2b2d [root@ryzen-server ~]#
Are there other tests you want me to run? Also for the socket detection? I only have a single socket system, not sure how helpful that would be?
Updated by Rick V over 3 years ago
Not sure if smartos ships things like the cpu-sensors base driver
I had to install driver/cpu/sensor
and then boot -r
to get it to show up in fmtopo
. Have since rolled back the patches to prepare for integration (except for /kernel/drv/amd64/pcieb
and /kernel/misc/amd64/pcie
), and to see if the nvidia driver cuts back on its own memory corruption. (Turns out I'm having similar problems in a windows 10 dual-boot, but on windows the entire system simply hangs. I'm almost tempted to install a checked build.)
Updated by Harmen R over 3 years ago
The temperature sensors do show on my Intel systems running smartos.
Updated by Rick V over 3 years ago
So the EFI configuration program recently exposed an option to toggle PCI advanced error reporting. What effect would this have on the OS handling PCI errors?
Updated by Robert Mustacchi over 3 years ago
Regarding the temperature sensors, if you're building SmartOS, the big problem is that it uses its own packaging. So you have to manually add the PCI IDs that I added to the manifest file to the file uts/intel/os/driver_aliases. It has a slightly different format. So you'd need to add something that looked like the following to that file in illumos-joyent in addition to the patches.
amdf17nbdf "pci1022,1440,p" amdf17nbdf "pci1022,1480,p" amdf17nbdf "pci1022,1490,p" amdf17nbdf "pci1022,15d0,p" amdf17nbdf "pci1022,15e8,p"
Rick, regarding the toggling of that, I'm not exactly sure. It may cause firmware to interpose on certain cfgspace accesses so as to hide the AER support or it may intercept it using a firmware-first mechanism so as to make it so the OS will never receive any of the events. Unfortunately, since this is all firmware dependent, it's hard to say.
Updated by Harmen R over 3 years ago
Thanks Robert, I didn't know that. After applying your suggested changes to driver_aliases I get temperature readings.
I've posted my results in the 11976 feature thread. (https://www.illumos.org/issues/11976#note-3)
Updated by Robert Mustacchi over 3 years ago
Thanks for testing that Harmen. I've followed up on that thread if you don't mind doing a little bit. Could you also run a prtconf -vp
and make sure you see a socket AM4 output?
Unrelated to the temperature sensors, I've put back the initial bits that we worked through. Thank you all for your help.
Updated by Harmen R over 3 years ago
- File prtconf_vp.txt prtconf_vp.txt added
I don't see socket AM4 in the output. I've attached the output.
Updated by Robert Mustacchi over 3 years ago
I'm sorry, I should have said psrinfo -vp
, not prtconf. That's my mistake.
Updated by Harmen R over 3 years ago
# psrinfo -vp The physical processor has 8 cores and 16 virtual processors (0-15) The core has 2 virtual processors (0 8) The core has 2 virtual processors (1 9) The core has 2 virtual processors (2 10) The core has 2 virtual processors (3 11) The core has 2 virtual processors (4 12) The core has 2 virtual processors (5 13) The core has 2 virtual processors (6 14) The core has 2 virtual processors (7 15) x86 (AuthenticAMD 870F10 family 23 model 113 step 0 clock 3593 MHz) AMD Ryzen 7 3700X 8-Core Processor
Updated by Rick V over 3 years ago
same here
$ psrinfo -vp The physical processor has 8 cores and 16 virtual processors (0-15) The core has 2 virtual processors (0 8) The core has 2 virtual processors (1 9) The core has 2 virtual processors (2 10) The core has 2 virtual processors (3 11) The core has 2 virtual processors (4 12) The core has 2 virtual processors (5 13) The core has 2 virtual processors (6 14) The core has 2 virtual processors (7 15) x86 (AuthenticAMD 800F82 family 23 model 8 step 2 clock 3693 MHz) AMD Ryzen 7 2700X Eight-Core Processor
Updated by Robert Mustacchi over 3 years ago
I double checked things and it looks like we don't yet have a model 70-7fh revision guide for us to properly map that, so I guess this is currently expected. I'll have to see if I can change the kernel routines here to get us the socket information without the processor revisions or track down a revision guide that probably comes with an NDA attached.
Updated by Rick V over 3 years ago
on a whim i decided to redo case airflow and CPU cooling - GPU crashes reduced to switching video mode (and usually only at boot or when OpenWindows restarts lightdm)
Applying the patch and setting pcie_disable_lbw
in /etc/system
cleared up the rest of the errors for now.
Updated by Harmen R over 3 years ago
Thanks for your efforts Robert, as of now I'm running vanilla smartos again.
Updated by Rick V over 3 years ago
I mean, I'm still piling up errors randomly, but now getting them at the PCI root nexus:
(I run a workstation setup, these are probably coming from the GPU. PCI AER is turned ON in platform config)
$ fmdump -Vp -u 94a10c81-b449-c218-b820-b7fb6ea89775 TIME UUID SUNW-MSG-ID Feb 10 2020 17:59:57.485931000 94a10c81-b449-c218-b820-b7fb6ea89775 SUNOS-8000-J0 TIME CLASS ENA Feb 10 17:59:57.2831 ereport.io.pciex.rc.ce-msg 0x0724a51517401401 nvlist version: 0 version = 0x0 class = list.suspect uuid = 94a10c81-b449-c218-b820-b7fb6ea89775 code = SUNOS-8000-J0 diag-time = 1581379197 416859 de = fmd:///module/eft fault-list-sz = 0x2 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = defect.sunos.eft.unexpected_telemetry certainty = 0x32 resource = dev:////pci@0,0 reason = no valid path to component was found in ereport.io.pciex.rc.ce-msg retire = 0 response = 0 (end fault-list[0]) (start fault-list[1]) nvlist version: 0 version = 0x0 class = fault.sunos.eft.unexpected_telemetry certainty = 0x32 resource = dev:////pci@0,0 reason = no valid path to component was found in ereport.io.pciex.rc.ce-msg retire = 0 response = 0 (end fault-list[1]) fault-status = 0x1 0x1 severity = Major __ttl = 0x1 __tod = 0x5e41ee7d 0x1cf6b7f8
Updated by Dan McDonald almost 3 years ago
I have an X470 chipset on an ASRock Rack X470D4U, and am seeing these sorts of errors, that now tickle FMA thanks to #12135. Watching this bug now.