Project

General

Profile

Bug #12387

X11 fails for certain Ryzen CPUs

Added by Gary Mills 8 months ago. Updated 26 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
driver - device drivers
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

When I upgrade from OI hipster-20191229 to OI hipster-20200212 and boot the new BE, I get an X11 hang on the console. The Xorg.0.log file ends like this:

        ARUBA, ARUBA
    [    72.142] (II) VESA: driver for VESA chipsets: vesa
    [    72.142] (++) using VT number 7

This system has an ASUS PRIME B350M-A motherboard. This board contains hardware for an on-CPU GPU because some Ryzen CPUs have built-in graphics. However, I'm using an AMD Ryzen 3 1200 CPU which does not have built-in graphics. This CPU requires an external graphics card. Other Ryzen CPUs also have this requirement. For this, I'm using a Radeon HD 7450 PCIe video card. The frame buffer device (/dev/fb) shows up like this:

    /devices/pci@0,0/pci1022,1453@3,1/display@0:text-0

I built debugging versions of the pcie and pcieb drivers from the latest illumos source and installed them in the new BE. In /etc/system, I've enabled full debugging in this manner:

    * Enable driver debugging
    set pcie:pcie_debug_flags = 1
    set pcieb:pcieb_dbg_print = 1
    set pcieb:pcieb_dbg_intr_print = 1

As far as I can tell, the kernel correctly identifies the video card. I've attached output from the serial console, as file cons.txt, to this bug report. Perhaps somebody will notice a problem in this file, although it looks okay to me.

At the end of the console output, these syslog messages appear:

Mar  7 09:48:49 ryzen genunix: WARNING: constraints forbid retire: /pci@0,0/pci1022,1453@1,3/pci1b21,1142@0
Mar  7 09:48:49 ryzen genunix: NOTICE: Device: already retired: /pci@0,0/pci1022,1453@1,3/pci1022,43b2@0,2/pci1022,43b4@0/pci1043,86f2@0

I don't know what triggers this message. Of course, syslog messages may be from an earlier time than the surrounding direct console messages. The path components in the messages are curious. The first is for an USB controller. Specifically, pci1022,1453@1,3 is an AMD PCIe GPP Bridge, and pci1b21,1142@0 is an ASMedia USB 3.1 Host Controller. The second message is for an unknown ASUS device. Specifically, pci1022,43b2@0,2 is an AMD Device unknown, pci1022,43b4@0 is for an AMD 300 Series Chipset PCIe Port, and pci1043,86f2@0 is for unknown ASUSTeK device.

I suspect that the virtual terminal service is using the wrong video device, thus causing the Xorg hang. How can I confirm or deny this idea?

I'm willing to use my Ryzen system to help debug the problem further, but I don't know how to do this now. Perhaps somebody can suggest a way.


Files

cons.txt (51.6 KB) cons.txt Gary Mills, 2020-03-13 02:29 PM

Related issues

Related to illumos gate - Bug #12453: Missing shutdown log on AMD RyzenRejected

Actions
#1

Updated by Gordon Ross 8 months ago

FYI the common drm driver support we current have is missing some things (TTM features) required by the radeon drm driver, so there's no drm driver for Radeon cards. That means Radeon cards will probably not work, or might only allow VESA modes.

That said, the kernel should not hang. Since this issue is opened against illumos gate, let's keep the focus of this issue on fixing the hang.

#2

Updated by Gary Mills 8 months ago

I agree with your recommendation. The Radeon card did work in December of 2019. It was only in February of 2020 that it stopped working. The VESA driver is fine with me. That's all I expected to work, and what Xorg chose. Something in illumos has changed in those two months. My current suspect is the VT service. I believe that's also part of illumos.

#3

Updated by Gary Mills 7 months ago

I have an update. I've reviewed the source of vt and console-kit-daemon. Both of them do not recreate files in /dev/vt, although they do use these files. I''ve also review the source of dev. I can't tell if dev recreates these files. I assume it does.

In any case, I've recently upgraded one Intel system and two AMD Ryzen systems, including the one that is the subject of this bug report, to OI versions from 20200321 and 20200322. All of these upgrades were successful. I am able to do OS upgrades again. All three of these OI versions added six services and configured devices at the beginning of the first boot.

On the system that led to this bug report, I tried another boot of the 20200212 OI version. I got the usual video hang. When I tried a reboot with the reconfigure option, I got the same result.

Booting the 20200321 version was still successful. I conclude that reconfigure did not resolve the problem, and that something has changed between 2020-02-12 and 2020-03-21. I don't know if this change occurred in illumos or OI. At least, OS upgrades work again.

#4

Updated by Gary Mills 7 months ago

I have another update. I tried a BIOS upgrade on the test system, from 4014 to 5220. On reboot, I got the usual X11 hang. When I reverted to the original BIOS version, I still got the X11 hang. This OS image used to work. Now it doesn't. Something has changed in the OS image. Could there be a cache that is now incorrect? Can I somehow invalidate this cache?

To validate this idea, I did another upgrade to today's release of OI. It built a new BE. The new BE booted normally, without the X11 hang. As long as I don't upgrade the BIOS version, it should continue to work. A cache would certainly explain the results I've been seeing, although some of my earlier conclusions would be wrong.

What could be causing this peculiar behavior?

#5

Updated by Gary Mills 7 months ago

  • Related to Bug #12453: Missing shutdown log on AMD Ryzen added
#6

Updated by Gary Mills 4 months ago

I tried it again, with a different procedure. First. I created a duplicate of an existing BE, with beadm create. The new BE booted into X11 normally. Then. I used the EZ-flash utility in the BIOS setup to upgrade the BIOS from 4014 to 5220. When I booted the BE, it failed to enter X11, but left the console with a black screen and a white cursor. After that, it hung. I had to power off the system to recover control.

ASUS does not provide a way to downgrade the BIOS version. Instead, I used an AMI utility to downgrade the BIOS from 5220 back to 4014. I tried again to boot the duplicate BE. This time, it left the console in text mode, and displayed two device retire messages. It gave me the console login message as well, but hung when I tried to log in. Later, I discovered two reports from the fault manager in /var/adm/messages . Here are parts of what I found:

Jul  7 20:08:15 ryzen fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SUNOS-8000-J0, TYPE: Fault, VER: 1, SEVERITY: Major
...
Jul  7 20:08:15 ryzen DESC: The diagnosis engine encountered telemetry from the listed devices for which it was unable to perform a diagnosis - 
...
Jul  7 20:08:18 ryzen fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-G2, TYPE: Fault, VER: 1, SEVERITY: Major
...
Jul  7 20:08:18 ryzen DESC: A problem has been detected on one of the specified devices or on one of the specified connecting buses.
...
Jul  7 20:08:18 ryzen genunix: [ID 631017 kern.notice] NOTICE: Device: already retired: /pci@0,0/pci1022,1453@1,3/pci1022,43b2@0,2/pci1022,43b4@0/pci1043,86f2@0
Jul  7 20:08:18 ryzen last message repeated 1 time
Jul  7 20:08:18 ryzen genunix: [ID 631017 kern.notice] NOTICE: Device: already retired: /pci@0,0/pci1022,1453@1,3/pci1022,43b2@0,2/pci1022,43b4@0

The original BE booted normally into X11 . Clearly, something has contaminated the duplicate BE, so that it fails to boot with either BIOS version. I suspect the PCI cache. According to the man page, this cache is supposed to be managed automatically. Perhaps this doesn't happen when a BIOS upgrade rearranges the PCI bus. Could this be the case? What can I do to confirm this idea?

#7

Updated by Joshua M. Clulow 4 months ago

If the device is being retired, you might check to see if you have a retire store at /etc/devices/retire_store in that BE. There are other device cache files in /etc/devices as well. These are packed nvlists with a header, so you'll need a program like this to dump them:

#include <stdlib.h>
#include <stdio.h>
#include <libnvpair.h>
#include <fcntl.h>
#include <strings.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <err.h>
#include <errno.h>

int
main(int argc, char *argv[])
{
        int fd;
        struct stat st;
        char *buf;
        size_t pos = 0;
        size_t rem;
        int e;

        if (argc != 2) {
                errx(1, "which file?");
        }

        if ((fd = open(argv[1], O_RDONLY)) < 0) {
                err(1, "open %s", argv[1]);
        }

        if (fstat(fd, &st) != 0) {
                err(1, "fstat %s", argv[1]);
        }

        rem = st.st_size;
        if ((buf = malloc(rem)) == NULL) {
                err(1, "malloc");
        }

        for (;;) {
                ssize_t rsz;

                if ((rsz = read(fd, buf + pos, rem)) < 0) {
                        if (errno == EINTR) {
                                continue;
                        }

                        err(1, "read %s", argv[1]);
                } else if (rsz == 0) {
                        if (rem != 0) {
                                errx(1, "file length changed?");
                        }

                        break;
                }

                rem -= rsz;
                pos += rsz;
        }

        /*
         * Look for the appropriate header:
         */
        if (st.st_size < 128) {
                errx(1, "file too short");
        }
        uint32_t magic;
        bcopy(buf, &magic, sizeof (magic));
        if (magic != 0xdeb1dcac) {
                errx(1, "magic %x was not %x", magic, 0xdeb1dcac);
        }
        int32_t version;
        bcopy(buf + 4, &version, sizeof (version));
        if (version != 1) {
                errx(1, "version %d was not %d", version, 1);
        }

        nvlist_t *nvl;
        if ((e = nvlist_unpack(buf + 128, st.st_size - 128, &nvl, 0)) != 0) {
                errno = e;
                err(1, "nvlist_unpack");
        }

        nvlist_print(stdout, nvl);

        free(buf);
        nvlist_free(nvl);
        (void) close(fd);
}

(also up at https://gist.github.com/jclulow/3652e561d5dfb4f336e1e3db4bf9ca7a )

Build with gcc -m64 -o dump_devices_file dump_devices_file.c -lnvpair and you can point it at at least some of the files in /etc/devices. Let me know how it goes!

#8

Updated by Gary Mills 4 months ago

In the display below, the duplicate BE is mounted at /mnt .

There was no such file as /etc/devices/retire_store in either the original or the duplicate BE .

Here are the files in /etc/devices for both BEs:

<root@ryzen># ll -t /mnt/etc/devices
total 783
-r--r--r--   1 root     root      293506 Jul  7 20:46 snapshot_cache
-r--r--r--   1 root     root         788 Jul  6 15:49 devname_cache
-r--r--r--   1 root     root         684 Sep 13  2018 devid_cache
-r--r--r--   1 root     root         228 Sep 11  2018 pci_unitaddr_persistent
-r--r--r--   1 root     root         276 Sep 11  2018 mdi_scsi_vhci_cache
<root@ryzen># ll -t /etc/devices    
total 783
-r--r--r--   1 root     root         788 Jul  7 21:36 devname_cache
-r--r--r--   1 root     root      293506 Jul  7 21:36 snapshot_cache
-r--r--r--   1 root     root         684 Sep 13  2018 devid_cache
-r--r--r--   1 root     root         228 Sep 11  2018 pci_unitaddr_persistent
-r--r--r--   1 root     root         276 Sep 11  2018 mdi_scsi_vhci_cache

I assume the files with the 2018 dates have not changed. I tried the nvlist dumper program on some of them anyway. It seemed to work. The cmp command told me that the devname_cache file was the same in both BEs, but the snapshot_cache files were different. The nvlist dumper did not work on this one. Here's what I got:

<root@ryzen># /tmp/nvdump /mnt/etc/devices/snapshot_cache
nvdump: magic 2 was not deb1dcac

I suppose that this cache file is not an nvlist cache.

The man page suggests that it's safe to remove all of these cache files. Should I try that?

#9

Updated by Gary Mills 4 months ago

I have a little more information. First, I upgraded the BIOS version, from 4014 to 5220 . Then, I booted the live USB image OI-hipster-gui-200504 . It booted into an X11 desktop quite nicely. That's the first time I've seen an X11 desktop with that BIOS version on that system. I suppose I could do an install at that point, but I really need to do an upgrade of an existing BE.

To that end, I created a second duplicate of a recent BE, and removed the cache files from /etc/devices on that duplicate. I also rebuilt the boot archive on that duplicate BE. When I did a reconfigure boot of that duplicate, it still failed, giving me only a black screen with a white cursor.

To recover, I downgraded the BIOS version back to 4014 and booted the original BE. It gave me the usual X11 console with a login box. So, I'm a little further ahead, but not much further. Something on the BE is still causing the boot to fail. What could it be? The BIOS upgrade changes something, probably the PCI device paths. The USB image can accommodate this change but the installed system cannot. What can I do now?

#10

Updated by Gary Mills 3 months ago

I was able to do a bit more testing, with the upgraded BIOS version and a duplicate BE. Any attempt to boot that BE ended in failure. With a typical failure, I got a black console screen with a white cursor at the upper left corner. X11 did not work. There was no login box and no graphical desktop.

However, I was able to boot the OI USB image to the X11 desktop. From there, I could enable remote login and access the duplicate BE on the disk. I used zpool import , beadm mount , and bootadm update-archive for this purpose.

To determine if the device cache is the problem, I copied the cache files from the USB image. I got two retire messages, followed by a hang.

To check on /etc/path_to_inst, I also copied this file from the USB image. Again, I got two retire messages, followed by a hang.

Finally, to provide a clean start of the fault manager, I copied the contents of the directory /var/fm/fmd from the USB image. This time, I got a fault manager message that disappeared off the screen, followed by a typical failure.

The retire messages come from the function modctl_retire(). The source file is /illumos-gate/usr/src/uts/common/os/modctl.c . This function only performs the retire of a device path, and logs failures. The orders for retirement come from the fault manager. /var/adm/messages on the duplicate BE revealed two messages from the fault manager:

Jul 27 19:13:05 ryzen fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SUNOS-8000-J0, TYPE: Fault, VER: 1, SEVERITY: Major
Jul 27 19:13:05 ryzen EVENT-TIME: Mon Jul 27 19:13:05 CDT 2020
Jul 27 19:13:05 ryzen PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: ryzen
Jul 27 19:13:05 ryzen SOURCE: eft, REV: 1.16
Jul 27 19:13:05 ryzen EVENT-ID: 002929e2-1241-4f27-84af-a05956677ec9
Jul 27 19:13:05 ryzen DESC: The diagnosis engine encountered telemetry from the listed devices for which it was unable to perform a diagnosis - 
Jul 27 19:13:05 ryzen   Refer to http://illumos.org/msg/SUNOS-8000-J0 for more information.  Refer to http://illumos.org/msg/SUNOS-8000-J0 for more information.
Jul 27 19:13:05 ryzen AUTO-RESPONSE: Error reports have been logged for examination by your illumos distribution team.
Jul 27 19:13:05 ryzen IMPACT: Automated diagnosis and response for these events will not occur.
Jul 27 19:13:05 ryzen REC-ACTION: Ensure that the latest illumos Kernel and Predictive Self-Healing (PSH) updates are installed.
...
Jul 27 19:13:09 ryzen fmd: [ID 377184 daemon.error] SUNW-MSG-ID: PCIEX-8000-G2, TYPE: Fault, VER: 1, SEVERITY: Major
Jul 27 19:13:09 ryzen EVENT-TIME: Mon Jul 27 19:13:09 CDT 2020
Jul 27 19:13:09 ryzen PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: ryzen
Jul 27 19:13:09 ryzen SOURCE: eft, REV: 1.16
Jul 27 19:13:09 ryzen EVENT-ID: 7dbb05b3-5178-c234-f2b9-8df45a8ac811
Jul 27 19:13:09 ryzen DESC: A problem has been detected on one of the specified devices or on one of the specified connecting buses.
Jul 27 19:13:09 ryzen   Refer to http://illumos.org/msg/PCIEX-8000-G2 for more information.
Jul 27 19:13:09 ryzen AUTO-RESPONSE: One or more device instances may be disabled
Jul 27 19:13:09 ryzen IMPACT: Loss of services provided by the device instances associated with this fault
Jul 27 19:13:09 ryzen REC-ACTION: Ensure that the latest drivers and patches are installed. Otherwise schedule a repair procedure to replace the affected device(s).  Use fmadm faulty to identify the devices or contact your illumos distribution team for support.
Jul 27 19:13:09 ryzen genunix: [ID 631017 kern.notice] NOTICE: Device: already retired: /pci@0,0/pci1022,1453@1,3/pci1022,43b2@0,2/pci1022,43b4@0/pci1043,86f2@0
Jul 27 19:13:09 ryzen last message repeated 1 time
Jul 27 19:13:09 ryzen genunix: [ID 631017 kern.notice] NOTICE: Device: already retired: /pci@0,0/pci1022,1453@1,3/pci1022,43b2@0,2/pci1022,43b4@0

Why does the fault manager retire those paths? How can I investigate further? Is there a way to tell the fault manager not to touch those paths?

#11

Updated by Gary Mills 3 months ago

It always puzzled me that the live USB image boots to the X11 desktop, but the BEs on the disk never get that far. Recently, I discovered the reason: the fault manager is disabled on the live USB image.

With the duplicate BE, the boot always fails with the upgraded BIOS version. When I disabled the fault manager on the duplicate BE, it booted to X11, giving me a login box.

Here's the mechanism of the failure. On startup,the fault manager retires the path to the PCIe video device. When Xorg starts, it puts the video card into graphics mode, and then hangs. The result is no X11 graphics, including the login box. This situation seems to affect only the AMD Ryzen 3 1200 CPU. Later Ryzen CPUs don't have this problem.

Other people have reported intermittent X11 failures on boot with Ryzen CPUs. Mine happens every time. I wonder if there is a race going on. Both the fault manager and Xorg are accessing the video device.

#12

Updated by Gary Mills about 1 month ago

I found a workaround that is better than disabling the fault manager. Simply, put:

* Disable link bandwidth notifications
set pcie:pcie_disable_lbw=1

into /etc/system and reboot. Then, remove all the files and directories under /var/fm/fmd . At that point, you can enable the fault manager without having it retire a bunch of PCIe paths. Even with another reboot, the system behaves normally.

This problem only happens with the family 23 (0x17) model 1 (0x01) Ryzen CPU. It doesn't happen with a CPU of the same family but model 17 (0x11). I don't have anything between the two to use for testing.

#13

Updated by Gary Mills 26 days ago

I now have something between the two. I upgraded the CPU in my Ryzen system from:

  x86 (AuthenticAMD 800F11 family 23 model 1 step 1 clock 3100 MHz)
        AMD Ryzen 3 1200 Quad-Core Processor    [ Socket: AM4 ]

to:

     x86 (AuthenticAMD 800F82 family 23 model 8 step 2 clock 3400 MHz)
      AMD Ryzen 5 2600 Six-Core Processor

The procedure outlined above didn't work with this CPU: I had to disable the fault manager to prevent the hang. This CPU used the same (probably faulty) bridge as before. Setting the pcie_disable_lbw variable was not sufficient.

Also available in: Atom PDF