Project

General

Profile

Actions

Bug #11953

open

PCI Fatal Error encountered on AMD B450 platform

Added by Rick V over 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Category:
-
Start date:
Due date:
% Done:

90%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

Recently put together a new build on a GIGABYTE AORUS PRO based on the AMD B450 chipset to replace an aging Core2 build from 2007 (which got struck by lightning a while ago. PCI cards and storage were spared.)

  • Ryzen 7 2700X, 8 CPUs, 16 threads
  • 32GB DDR4-2400
  • 512GB SSD
  • 1000GB HDD
  • two spare 2000GB HDDs for RAID1 (to replace single HDD above eventually)

Tested both GTX1650 and GT1030, as there is no builtin ATI Radeon Vega on the processor. Booting from OI DVD forcing VESA driver, or forcing text console made no difference, the faulting controller was moved from pcieb-9 to pcieb-0

  • Booted OpenBSD/i386 miniroot correctly.
  • Booted OpenBSD/amd64 miniroot correctly
  • Booted Fedora Core 31/amd64 Live correctly
  • failed to boot existing OI on hard driive (pcie fatal error on pcieb-9)
  • failed to boot OI Live (pcie fatal error on pcieb-0)
  • failed to boot OI live with VESA driver
  • failed to boot OI live with text console only
  • failed to boot OI live in EFI mode as above (vesa, text-only)

There does not seem to be any kind of serial port present, which may hinder debugging efforts.
Will try to provide more information as needed.


Files

fmdump.sun-pc.log (29.6 KB) fmdump.sun-pc.log output of fmdump -eVp Rick V, 2019-11-11 12:15 AM
messages.log (154 KB) messages.log system log output Rick V, 2019-11-11 12:15 AM
prtconf.txt (323 KB) prtconf.txt prtconf -vDd output Rick V, 2019-11-11 01:19 AM
Screenshot from 2019-10-13 11-56-40.png (191 KB) Screenshot from 2019-10-13 11-56-40.png Harmen R, 2019-11-13 06:45 PM
video_30-9-2019_8-55-26_part1.m4v (679 KB) video_30-9-2019_8-55-26_part1.m4v Harmen R, 2019-11-14 01:36 PM
kmdb.txt (898 Bytes) kmdb.txt János Zsitvai, 2019-11-15 10:38 PM
console.log (18.1 KB) console.log János Zsitvai, 2019-11-16 08:51 PM
prtconf2.txt (301 KB) prtconf2.txt largely similar to the original prtconf dump less a few USB devices Rick V, 2019-11-17 04:03 AM
fmdump2.txt (95.6 KB) fmdump2.txt fmdump of PCIEX-8000-G2 Rick V, 2019-11-17 04:03 AM
fmdump.txt (158 KB) fmdump.txt János Zsitvai, 2019-11-17 01:52 PM
prtconf.txt (270 KB) prtconf.txt János Zsitvai, 2019-11-17 01:52 PM
prtconf.txt (544 KB) prtconf.txt Harmen R, 2019-11-17 08:38 PM
fmdump.txt (2.98 MB) fmdump.txt huge fmdump Rick V, 2019-11-26 09:40 PM
prtconf-aspm.txt (245 KB) prtconf-aspm.txt current prtconf dump with ASPM status Rick V, 2019-11-27 05:17 AM
fmdump.txt (8.7 KB) fmdump.txt Harmen R, 2019-11-28 05:55 AM
prtconf.txt (574 KB) prtconf.txt Harmen R, 2019-11-28 05:55 AM
prtconf_vp.txt (54.5 KB) prtconf_vp.txt Harmen R, 2020-01-04 08:02 PM
err.log (8.34 KB) err.log Rick V, 2020-02-11 01:12 AM

Related issues

Related to illumos gate - Feature #12134: Capture PCIe aspm statusClosedRobert Mustacchi

Actions
Related to illumos gate - Bug #12135: ECRC PCIe errors shouldn't be fatalClosedRobert Mustacchi

Actions
Related to illumos gate - Bug #12136: Want hook to disable PCIe link monitoringClosedRobert Mustacchi

Actions
Actions #1

Updated by Rick V over 3 years ago

Output of ::msgbuf:



Output of ::status

Output of ::stack

Actions #2

Updated by Rick V over 3 years ago

output of ffffff0d0458c240::print pcieb_devstate_t pcieb_dip |::devinfo

Actions #3

Updated by Rick V over 3 years ago

After hardcoding pcieb`pcieb_die to zero, this allows me to boot to the text console with networking (integrated igb0, external gani2 PCI-E card)

Actions #4

Updated by Rick V over 3 years ago

Actions #5

Updated by Rick V over 3 years ago

updating to Quadro 440.31 fixes graphics support for me, leaving me with the manual workaround (patching pcieb_die) and unable to address more than 4 CPUs or hardware threads.

Actions #6

Updated by Rick V over 3 years ago

an attempt to disable the non-functional AMD IOMMU causes the entire system to hang as soon as the GPU driver is loaded
not on every boot however

Actions #7

Updated by János Zsitvai over 3 years ago

I hit the same

PCI(-X) Express Fatal Error
on ASUS B450M-K and Ryzen 2600X on the 2019.10 release of OpenIndiana, but 2019.04 boots to graphical interface without issue on my end.

Actions #8

Updated by Harmen R over 3 years ago

I have similar problems with SmartOS, the pcieb driver panics the system at boot.

This is on an X470 based system with a Ryzen 7 3700X.

For me this is related to the pci link bandwidth event handling added a couple of months back. The system works just fine, when I patch the pcieb driver to not initialize the link bandwidth interrupt handler.

Actions #9

Updated by Rick V over 3 years ago

Booting Solaris 11.4 also works, which is expected because Sun Grid OCI runs on the AMD EPYC. Also have access to all 16 threads, but that isn't really an option for me, as i have valuable data on OpenZFS volumes installed, and without patches, highly vulnerable.

harmen: what should i patch to disable that ISR?

Actions #10

Updated by Harmen R over 3 years ago

This is the patch I'm using.

---
 usr/src/uts/common/io/pciex/pcieb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/usr/src/uts/common/io/pciex/pcieb.c b/usr/src/uts/common/io/pciex/pcieb.c
index c9d65748bbc..f7dd4fa268c 100644
--- a/usr/src/uts/common/io/pciex/pcieb.c
+++ b/usr/src/uts/common/io/pciex/pcieb.c
@@ -587,7 +587,7 @@ pcieb_attach(dev_info_t *devi, ddi_attach_cmd_t cmd)

     (void) pcie_hpintr_enable(devi);

-    (void) pcie_link_bw_enable(devi);
+    // (void) pcie_link_bw_enable(devi);

     /* Do any platform specific workarounds needed at this time */
     pcieb_plat_attach_workaround(devi);

Actions #11

Updated by Rick V over 3 years ago

ok managed to copy the old pcieb driver from my original (2019.04) BE. While the latest Quadro driver works, it is very buggy, and sometimes locks up when openwindows xorg is started.

Actions #12

Updated by Robert Mustacchi over 3 years ago

Harmen, can you describe how the system panics when link bandwidth checks are enabled? Do you happen to have a crash dump available from that? János, do you happen to have a crash dump from that as well?

Rick, I'll try to get some time to look at the error reports and see what's coming in and being treated as a fatal error. Thanks for providing them.

Actions #13

Updated by Harmen R over 3 years ago

Hi Robert,

The system panics during boot and then reboots. I've attached a 12 second video that shows the behavior. The system prints some info before it reboots, that's kinda visible 9 seconds into the video.

Stacktrace is available also in the screenshot I posted before.

Unfortunately I don't have a crash dump, the system stops before the dump device becomes available. The kmdb console is also inoperable.

If there are other ways to get more info for debugging this, please let me know.

Actions #14

Updated by Rick V over 3 years ago

indeed. To get k[am]db back, reboot, drop into EFI PROM monitor, check the CMOS configuration for USB legacy support (ON), port 0x60/0x64 emulation (ON), and turn OFF XHCI handoff, if such options exist on your machine.

Can't read ACPI SRAT table, one NUMA node only

At least illumos-joyent is nice enough to tell you why it can't bootstrap any more of the existing CPUs

Actions #15

Updated by János Zsitvai over 3 years ago

I don't have a crash dump either, because of the dump device unavailability. Kmdb works, but I'm not familiar enough to know what to look for.

Actions #16

Updated by Rick V over 3 years ago

Got a crash dump later on, see if this is any useful: https://snowlight.net/files/bug/core.tar.xz
XHCI reset occurred

Actions #17

Updated by Robert Mustacchi over 3 years ago

I'll take a look at that and the rest of the information and see what can be the most useful way of approaching this. Thank you all for your patience.

Is it correct that for everyone involved disabling the PCIe link bandwidth services is helping?

Actions #18

Updated by Rick V over 3 years ago

yes that's what it looks like for me. My original boot environment dates back to a few days after 2019.04 was released, so i just picked the pcieb driver out piecemeal to boot without problems on the current BE

Actions #19

Updated by János Zsitvai over 3 years ago

Would remote access to kmdb help? I can provide ssh access to a machine connected to the affected system through serial port (picocom + tmux).

Actions #20

Updated by Robert Mustacchi over 3 years ago

János, let me go through all of the data that Rick has. Once I can confirm with that and some possible next steps that make sense, I may take you up on that. I appreciate the offer. In some ways, the most useful thing during the interim in kmb, since you have the same stack is to get the error report queue that generated that so I can compare it to what we see with Rick's case.

At the command prompt I think you can do that by doing something like:

> ::walk ereportq_dump | ::ereport -v
> ::walk ereportq_pend | ::ereport -v
Actions #21

Updated by János Zsitvai over 3 years ago

Attached is the output I got from those commands.

Actions #22

Updated by Robert Mustacchi over 3 years ago

Thanks for including that output János. The class of error that you're seeing is the same as the one that Rick sees. I expect these are the same that Harmen is seeing as well.

The thing that's the same in all of these is that we have received a PCIe AER with the ue_status set to 0x80000. This corresponds to a ECRC uncorrectable error being generated. Uncorrectable errors in PCIe parlance can be broken down into two different categories: fatal and non-fatal errors. The idea being that a non-fatal error indicates an issue with a given transaction, that will that transaction could not be recovered, the link itself is still fine and additional transactions can continue to go on the bus.

Today, the OS enables these by default. It does know that this is a non-fatal, uncorrectable error; however, the tables in pcie_fault.c end up treating this as an event that we should panic on. Unfortunately this has historically been treated in illumos (and Solaris before it) as a fatal error that should panic the system. The origins of this decision have been lost to time. However, it's not clear that ECRC errors being encountered should be treated as fatal errors. Presuming that the underlying devices and buses can retry the transactions, they should be able to recover and proceed. If they don't, there are other classes of PCIe errors like transaction timeouts and others that will percolate up through the system.

Ultimately, FMA still can handle these and opt to offline a device. Given the experiments already done with setting pcieb_die to zero show that the system does still continue to function and that the devices still work.

Why enabling bandwidth link monitoring on a subset of the devices in the systems causes and ECRC to be generated is a real mystery to me. I've double checked and I don't believe that the code paths could cause us to incorrectly enable this on the wrong set of PCIe capability registers or enable this on a non-bridge device. That said, there's nothing that says we couldn't get these at some other time. It would be interesting to see if such errors are reported on other platforms, where they seem to be non-fatal at least.

I'm working on a change to try and make these non-fatal. A prototype diff is below:

commit 78a0e36cee5bad893b3118111f2f55f6d171544e (HEAD -> master)
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   Fri Nov 15 14:55:58 2019 +0000

    11953 PCI Fatal Error encountered on AMD B450 platform

diff --git a/usr/src/uts/common/io/pciex/pcie_fault.c b/usr/src/uts/common/io/pciex/pcie_fault.c
index 6a335db3e2..90563c1d1a 100644
--- a/usr/src/uts/common/io/pciex/pcie_fault.c
+++ b/usr/src/uts/common/io/pciex/pcie_fault.c
@@ -1207,7 +1207,7 @@ const pf_fab_err_tbl_t pcie_pcie_tbl[] = {
        {PCIE_AER_UCE_MTLP,     pf_panic,
            PF_AFFECTED_PARENT, 0},

-       {PCIE_AER_UCE_ECRC,     pf_panic,
+       {PCIE_AER_UCE_ECRC,     pf_no_panic,
            PF_AFFECTED_SELF, 0},

        {PCIE_AER_UCE_UR,       pf_analyse_ca_ur,
@@ -1248,7 +1248,7 @@ const pf_fab_err_tbl_t pcie_rp_tbl[] = {
            PF_AFFECTED_SELF | PF_AFFECTED_AER,
            PF_AFFECTED_SELF | PF_AFFECTED_CHILDREN},

-       {PCIE_AER_UCE_ECRC,     pf_panic,
+       {PCIE_AER_UCE_ECRC,     pf_no_panic,
            PF_AFFECTED_AER, PF_AFFECTED_CHILDREN},

        {PCIE_AER_UCE_UR,       pf_no_panic,
@@ -1289,7 +1289,7 @@ const pf_fab_err_tbl_t pcie_sw_tbl[] = {
            PF_AFFECTED_SELF | PF_AFFECTED_AER,
            PF_AFFECTED_SELF | PF_AFFECTED_CHILDREN},

-       {PCIE_AER_UCE_ECRC,     pf_panic,
+       {PCIE_AER_UCE_ECRC,     pf_no_panic,
            PF_AFFECTED_AER, PF_AFFECTED_SELF | PF_AFFECTED_CHILDREN},

        {PCIE_AER_UCE_UR,       pf_analyse_ca_ur,

I'll try and get something more formal together for this. Rick, János, Harmen, if you don't mind trying this change, I'd appreciate it. If you need help building this or getting media appropriate for whatever you're running, please let me know and I'll work with folks to try and produce that.

Thanks again for your patience and helping me work through this. Let's see how this works out. If it does, then Rick, let's follow up on the 4 CPUs issue that you're having and Harmen, if you don't mind, I have some changes available to enable the temperature sensors on the Zen 2 line that I could use some help testing.

Actions #23

Updated by Rick V over 3 years ago

ok applied the patch, have https://snowlight.net/files/bug/pcieb.tar.xz as a quick-n-dirty gate build artefact.

Actions #24

Updated by Harmen R over 3 years ago

Thanks for working with us Robert, much appreciated. I don't mind testing the temperature sensors, please let me know how you want to proceed with that.

I won't be able to test the patch until Monday, I'll communicate the results then.

Actions #25

Updated by János Zsitvai over 3 years ago

Using pcieb.tar.xz from above, I get PCIEX-8000-G2. This is the log from verbose boot and kmdb enabled, but it doesn't drop to debugger, just hangs.

Actions #26

Updated by Rick V over 3 years ago

yes, i've seen that at least once since applying the patch. As Robert noted, FMA can still kick in and disable that PCI slot if it encounters link errors. Which devices affected, I can never quite tell, as my system continues to operate normally regardless.

Actions #27

Updated by Robert Mustacchi over 3 years ago

If we can get the fault informatoin (fmdump -fV) and the output of prtconf -v, I should be able to figure out what device this is impacting. I will see if I can spend a bit of time to narrow down what's going on here. I'll try and review how we determine which PCIe bridges to enable link reporting on. Perhaps we're doing it for one that we shouldn't be.

Actions #28

Updated by Rick V over 3 years ago

here goes

Actions #29

Updated by János Zsitvai over 3 years ago

Here's what I get.

Actions #30

Updated by Harmen R over 3 years ago

Robert, your prototype patch works for my system. I can boot and the system works just fine.

fmdump only shows me horsing around with sata hotplug a couple of days ago.

I've attached my prtconf. The device that's causing my boot panics is pci1022,1483, instance #0.

Actions #31

Updated by Robert Mustacchi over 3 years ago

Thanks everyone for all the data. I've been going through this and I'm not sure I have a great lead yet. I have one idea that I'd like to try, which is basically disabling the autonomous state machine transition bandwidth events that we have. So, one experiment I've done is to push another change up that might address that that'd be worth trying.

Otherwise, I'll need to see if I can track down some erratum or an AMD contact. For the time being, I've gone ahead and pushed https://github.com/rmustacc/illumos-gate/tree/pcie-ecrc. This has both the original change and a follow up one that tries to look at some of the bandwidth monitoring. Anyways, if folks would be willing to build with both of those and give it a shot, I'd appreciate it. I'll need to figure out a few different experiments if that doesn't work.

Thank you again for your help and patience.

Actions #32

Updated by Robert Mustacchi over 3 years ago

One other question, can you see if the BIOS has an option related to ASPM which is a PCIe option called 'Active-state power monitoring'?

Actions #33

Updated by Rick V over 3 years ago

Mine does not. I can however, cap the PCI bus supported revision. Solaris PCI-E driver only supported PCI-E 1.1, setting it to v3 or later causes the NVIDIA modesetting driver to terminate and reset the fb console (or hang the system)

Actions #34

Updated by Robert Mustacchi over 3 years ago

OK. I'll have to manually put something together for the ASPM test at some point in the future. If folks can try building the top commit on that branch in addition to the previous one: https://github.com/rmustacc/illumos-gate/commit/555cf78e16310e90707b38c6ca0e9f93d2b264da I'd appreciate it.

Actions #35

Updated by Rick V over 3 years ago

ok got the gpu working just long enough to build https://snowlight.net/files/bug/pcie.xz which is /kernel/misc/amd64/pcie with the two patches applied. Boots fine on my end

Actions #36

Updated by Robert Mustacchi over 3 years ago

Are you seeing any additional ereports about timeouts since that's happened? Harmen and János, if you get a chance to try it out, I'd appreciate that.

Actions #37

Updated by Rick V over 3 years ago

yeah
this new fmdump -Vep is HUGE, and that's just for the last week or so.

Actions #38

Updated by Robert Mustacchi over 3 years ago

Just to confirm, several of those ereports are from after you booted onto the new bits right?

Actions #39

Updated by Harmen R over 3 years ago

I should have some time tomorrow for trying this.

Actions #40

Updated by Rick V over 3 years ago

Robert: Yes. Anything after 26 Nov @ 1427 (GMT-6) corresponds to the latest patched driver. Anything older simply has the first patch applied (turn off ECRC panic)

Actions #41

Updated by Robert Mustacchi over 3 years ago

OK, thanks Rick. I have another theory that this may be related or we're somehow tickling ASPM. I'm going to go experiment with trying to capture this information and see if this is related. I did have something like this happen in the past and it was related to that, but I'm not convinced we'll be that lucky.

Actions #42

Updated by Robert Mustacchi over 3 years ago

So to see if ASPM might be on the scene here, I pushed a new commit to that same branch which when present will add two properties to the devinfo tree. So if folks manage to build with that as well and can then upload prtconf -v output again, that'd be great.

Actions #43

Updated by Rick V over 3 years ago

ok
https://snowlight.net/files/bug/pcie2.xz - with ASPM status patch

Actions #44

Updated by Robert Mustacchi over 3 years ago

OK. So, ASPM isn't really enabled on all of the relevant parts. I think probably what I should do, assuming the data from Harmen and János is similar to Rick's data is to basically offer an /etc/system value to turn this off on some platforms and see if I can reach someone at AMD eventually. It may be that in some cases we only see an occasional ereport and it won't be so bad, but if it does end up causing parts to be disabled, we can disable the link reporting (on a per-system basis) until we better understand why the two are related.

Actions #45

Updated by Harmen R over 3 years ago

I have tested a smartos-live build with all 3 patches.

The machine boots and operates as normal. I do have a pci device failing now, see attached fmdump.

ASPM seems to be disabled for me as well.

Actions #46

Updated by Rick V over 3 years ago

I'm so embarrassed right about now
my old core2 box needed some change to the boot-ncpus value in the emulated OBP eeprom(1m) file
and i forgot to remove it after i junked the old box
zen+/pinnacle ridge SMP and SMT work just fine

still mostly unable to keep it running for more than a day or two

update: ah yes i had that value in the fake-OBP ROM because Skylake chipset was also misreporting CPUs even with SMT turned on

Actions #47

Updated by Robert Mustacchi over 3 years ago

  • Assignee set to Robert Mustacchi
  • % Done changed from 0 to 90

I haven't had a lot of luck with being able to dig further here. I was able to get a hold of some of the erratum associated with these parts, but nothing turned up there. This is really stumping me. I've put together a change which I'll get reviewed for this which allows us to disable this on a per-system basis. So at least folks that are seeing this on B450 can disable the link state information. There'll be a minor impact for FMA, but at the end of the day, comapred to a device that's constantly being offlined, that's not a big deal. Sorry for the trouble here and the lack of a better solution. I've put up a change for review at https://code.illumos.org/c/illumos-gate/+/241. Hopefully this'll help and I'll also try to get back to some of the other Zen+/Zen 2 enabling.

Actions #48

Updated by Robert Mustacchi over 3 years ago

Actions #49

Updated by Robert Mustacchi over 3 years ago

  • Related to Bug #12135: ECRC PCIe errors shouldn't be fatal added
Actions #50

Updated by Robert Mustacchi over 3 years ago

  • Related to Bug #12136: Want hook to disable PCIe link monitoring added
Actions #51

Updated by Robert Mustacchi over 3 years ago

An update here, I am at least getting the workarounds that you all have helped test into illumos. I am also trying to reach out to AMD on this front, but it's a bit of a slow moving process (mostly at my end). In terms of other feature enablement, I did put up socket detection (mostly cosmetic) and temperature sensor stuff for review in https://code.illumos.org/c/illumos-gate/+/278 and https://code.illumos.org/c/illumos-gate/+/277. If you wouldn't mind giving those a whirl, I'd be happy to walk through how to test and verify they're working. I already did verify with someone else that they work on Zen+, so the main outstanding bit is Zen 2, whose temperature sensors I'm more concerned about. Thanks for all the help.

Actions #52

Updated by Harmen R over 3 years ago

Thanks Robert.

I'll see if I can build and test with the socket detection and temperature sensor changes. It's probably going to be tomorrow or on Thursday.

Actions #53

Updated by Harmen R over 3 years ago

So I've built an image from the smartos master branch with the patches from 11976, 11975 and 12135 applied.

Temperature sensors don't show in fmtopo.

[root@ryzen-server ~]# /usr/lib/fm/fmd/fmtopo -V '*sensor=temp*'
TIME                 UUID
Jan 02 10:15:11 907b0e78-f9a9-e418-a289-a7236ffb2b2d

[root@ryzen-server ~]# 

Are there other tests you want me to run? Also for the socket detection? I only have a single socket system, not sure how helpful that would be?

Actions #54

Updated by Rick V over 3 years ago

Not sure if smartos ships things like the cpu-sensors base driver
I had to install driver/cpu/sensor and then boot -r to get it to show up in fmtopo. Have since rolled back the patches to prepare for integration (except for /kernel/drv/amd64/pcieb and /kernel/misc/amd64/pcie), and to see if the nvidia driver cuts back on its own memory corruption. (Turns out I'm having similar problems in a windows 10 dual-boot, but on windows the entire system simply hangs. I'm almost tempted to install a checked build.)

Actions #55

Updated by Harmen R over 3 years ago

The temperature sensors do show on my Intel systems running smartos.

Actions #56

Updated by Rick V over 3 years ago

So the EFI configuration program recently exposed an option to toggle PCI advanced error reporting. What effect would this have on the OS handling PCI errors?

Actions #57

Updated by Robert Mustacchi over 3 years ago

Regarding the temperature sensors, if you're building SmartOS, the big problem is that it uses its own packaging. So you have to manually add the PCI IDs that I added to the manifest file to the file uts/intel/os/driver_aliases. It has a slightly different format. So you'd need to add something that looked like the following to that file in illumos-joyent in addition to the patches.

amdf17nbdf "pci1022,1440,p" 
amdf17nbdf "pci1022,1480,p" 
amdf17nbdf "pci1022,1490,p" 
amdf17nbdf "pci1022,15d0,p" 
amdf17nbdf "pci1022,15e8,p" 

Rick, regarding the toggling of that, I'm not exactly sure. It may cause firmware to interpose on certain cfgspace accesses so as to hide the AER support or it may intercept it using a firmware-first mechanism so as to make it so the OS will never receive any of the events. Unfortunately, since this is all firmware dependent, it's hard to say.

Actions #58

Updated by Harmen R over 3 years ago

Thanks Robert, I didn't know that. After applying your suggested changes to driver_aliases I get temperature readings.

I've posted my results in the 11976 feature thread. (https://www.illumos.org/issues/11976#note-3)

Actions #59

Updated by Robert Mustacchi over 3 years ago

Thanks for testing that Harmen. I've followed up on that thread if you don't mind doing a little bit. Could you also run a prtconf -vp and make sure you see a socket AM4 output?

Unrelated to the temperature sensors, I've put back the initial bits that we worked through. Thank you all for your help.

Actions #60

Updated by Harmen R over 3 years ago

I don't see socket AM4 in the output. I've attached the output.

Actions #61

Updated by Robert Mustacchi over 3 years ago

I'm sorry, I should have said psrinfo -vp, not prtconf. That's my mistake.

Actions #62

Updated by Harmen R over 3 years ago

# psrinfo -vp
The physical processor has 8 cores and 16 virtual processors (0-15)
  The core has 2 virtual processors (0 8)
  The core has 2 virtual processors (1 9)
  The core has 2 virtual processors (2 10)
  The core has 2 virtual processors (3 11)
  The core has 2 virtual processors (4 12)
  The core has 2 virtual processors (5 13)
  The core has 2 virtual processors (6 14)
  The core has 2 virtual processors (7 15)
    x86 (AuthenticAMD 870F10 family 23 model 113 step 0 clock 3593 MHz)
      AMD Ryzen 7 3700X 8-Core Processor
Actions #63

Updated by Rick V over 3 years ago

same here

$ psrinfo -vp
The physical processor has 8 cores and 16 virtual processors (0-15)
  The core has 2 virtual processors (0 8)
  The core has 2 virtual processors (1 9)
  The core has 2 virtual processors (2 10)
  The core has 2 virtual processors (3 11)
  The core has 2 virtual processors (4 12)
  The core has 2 virtual processors (5 13)
  The core has 2 virtual processors (6 14)
  The core has 2 virtual processors (7 15)
    x86 (AuthenticAMD 800F82 family 23 model 8 step 2 clock 3693 MHz)
      AMD Ryzen 7 2700X Eight-Core Processor

Actions #64

Updated by Robert Mustacchi over 3 years ago

I double checked things and it looks like we don't yet have a model 70-7fh revision guide for us to properly map that, so I guess this is currently expected. I'll have to see if I can change the kernel routines here to get us the socket information without the processor revisions or track down a revision guide that probably comes with an NDA attached.

Actions #65

Updated by Rick V over 3 years ago

on a whim i decided to redo case airflow and CPU cooling - GPU crashes reduced to switching video mode (and usually only at boot or when OpenWindows restarts lightdm)

Applying the patch and setting pcie_disable_lbw in /etc/system cleared up the rest of the errors for now.

Actions #66

Updated by Harmen R over 3 years ago

Thanks for your efforts Robert, as of now I'm running vanilla smartos again.

Actions #67

Updated by Rick V over 3 years ago

I mean, I'm still piling up errors randomly, but now getting them at the PCI root nexus:
(I run a workstation setup, these are probably coming from the GPU. PCI AER is turned ON in platform config)

$ fmdump -Vp -u 94a10c81-b449-c218-b820-b7fb6ea89775
TIME                           UUID                                 SUNW-MSG-ID
Feb 10 2020 17:59:57.485931000 94a10c81-b449-c218-b820-b7fb6ea89775 SUNOS-8000-J0

  TIME                 CLASS                                 ENA
  Feb 10 17:59:57.2831 ereport.io.pciex.rc.ce-msg            0x0724a51517401401

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 94a10c81-b449-c218-b820-b7fb6ea89775
        code = SUNOS-8000-J0
        diag-time = 1581379197 416859
        de = fmd:///module/eft
        fault-list-sz = 0x2
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.eft.unexpected_telemetry
                certainty = 0x32
                resource = dev:////pci@0,0
                reason = no valid path to component was found in ereport.io.pciex.rc.ce-msg
                retire = 0
                response = 0
        (end fault-list[0])
        (start fault-list[1])
        nvlist version: 0
                version = 0x0
                class = fault.sunos.eft.unexpected_telemetry
                certainty = 0x32
                resource = dev:////pci@0,0
                reason = no valid path to component was found in ereport.io.pciex.rc.ce-msg
                retire = 0
                response = 0
        (end fault-list[1])

        fault-status = 0x1 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x5e41ee7d 0x1cf6b7f8

Actions #68

Updated by Rick V over 3 years ago

freshly rotated error log

Actions #69

Updated by Dan McDonald almost 3 years ago

I have an X470 chipset on an ASRock Rack X470D4U, and am seeing these sorts of errors, that now tickle FMA thanks to #12135. Watching this bug now.

Actions

Also available in: Atom PDF