X11 fails for certain Ryzen CPUs
When I upgrade from OI hipster-20191229 to OI hipster-20200212 and boot the new BE, I get an X11 hang on the console. The Xorg.0.log file ends like this:
ARUBA, ARUBA [ 72.142] (II) VESA: driver for VESA chipsets: vesa [ 72.142] (++) using VT number 7
This system has an ASUS PRIME B350M-A motherboard. This board contains hardware for an on-CPU GPU because some Ryzen CPUs have built-in graphics. However, I'm using an AMD Ryzen 3 1200 CPU which does not have built-in graphics. This CPU requires an external graphics card. Other Ryzen CPUs also have this requirement. For this, I'm using a Radeon HD 7450 PCIe video card. The frame buffer device (/dev/fb) shows up like this:
I built debugging versions of the pcie and pcieb drivers from the latest illumos source and installed them in the new BE. In /etc/system, I've enabled full debugging in this manner:
* Enable driver debugging set pcie:pcie_debug_flags = 1 set pcieb:pcieb_dbg_print = 1 set pcieb:pcieb_dbg_intr_print = 1
As far as I can tell, the kernel correctly identifies the video card. I've attached output from the serial console, as file cons.txt, to this bug report. Perhaps somebody will notice a problem in this file, although it looks okay to me.
At the end of the console output, these syslog messages appear:
Mar 7 09:48:49 ryzen genunix: WARNING: constraints forbid retire: /pci@0,0/pci1022,1453@1,3/pci1b21,1142@0 Mar 7 09:48:49 ryzen genunix: NOTICE: Device: already retired: /pci@0,0/pci1022,1453@1,3/pci1022,43b2@0,2/pci1022,43b4@0/pci1043,86f2@0
I don't know what triggers this message. Of course, syslog messages may be from an earlier time than the surrounding direct console messages. The path components in the messages are curious. The first is for an USB controller. Specifically, pci1022,1453@1,3 is an AMD PCIe GPP Bridge, and pci1b21,1142@0 is an ASMedia USB 3.1 Host Controller. The second message is for an unknown ASUS device. Specifically, pci1022,43b2@0,2 is an AMD Device unknown, pci1022,43b4@0 is for an AMD 300 Series Chipset PCIe Port, and pci1043,86f2@0 is for unknown ASUSTeK device.
I suspect that the virtual terminal service is using the wrong video device, thus causing the Xorg hang. How can I confirm or deny this idea?
I'm willing to use my Ryzen system to help debug the problem further, but I don't know how to do this now. Perhaps somebody can suggest a way.
Updated by Gordon Ross 3 months ago
FYI the common drm driver support we current have is missing some things (TTM features) required by the radeon drm driver, so there's no drm driver for Radeon cards. That means Radeon cards will probably not work, or might only allow VESA modes.
That said, the kernel should not hang. Since this issue is opened against illumos gate, let's keep the focus of this issue on fixing the hang.
Updated by Gary Mills 3 months ago
I agree with your recommendation. The Radeon card did work in December of 2019. It was only in February of 2020 that it stopped working. The VESA driver is fine with me. That's all I expected to work, and what Xorg chose. Something in illumos has changed in those two months. My current suspect is the VT service. I believe that's also part of illumos.
Updated by Gary Mills 2 months ago
I have an update. I've reviewed the source of vt and console-kit-daemon. Both of them do not recreate files in /dev/vt, although they do use these files. I''ve also review the source of dev. I can't tell if dev recreates these files. I assume it does.
In any case, I've recently upgraded one Intel system and two AMD Ryzen systems, including the one that is the subject of this bug report, to OI versions from 20200321 and 20200322. All of these upgrades were successful. I am able to do OS upgrades again. All three of these OI versions added six services and configured devices at the beginning of the first boot.
On the system that led to this bug report, I tried another boot of the 20200212 OI version. I got the usual video hang. When I tried a reboot with the reconfigure option, I got the same result.
Booting the 20200321 version was still successful. I conclude that reconfigure did not resolve the problem, and that something has changed between 2020-02-12 and 2020-03-21. I don't know if this change occurred in illumos or OI. At least, OS upgrades work again.
Updated by Gary Mills 2 months ago
I have another update. I tried a BIOS upgrade on the test system, from 4014 to 5220. On reboot, I got the usual X11 hang. When I reverted to the original BIOS version, I still got the X11 hang. This OS image used to work. Now it doesn't. Something has changed in the OS image. Could there be a cache that is now incorrect? Can I somehow invalidate this cache?
To validate this idea, I did another upgrade to today's release of OI. It built a new BE. The new BE booted normally, without the X11 hang. As long as I don't upgrade the BIOS version, it should continue to work. A cache would certainly explain the results I've been seeing, although some of my earlier conclusions would be wrong.
What could be causing this peculiar behavior?