Bug #8684

illumos disables a working USB controller

Added by Gary Mills 7 months ago. Updated 5 months ago.

Status:NewStart date:2017-09-23
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:kernel
Target version:-
Difficulty:Medium Tags:needs-triage

Description

When I'm booting the OI-hipster-gui-20170502 USB image, I get these messages on the console:

SunOS Release ...
Copyright (c) ...
WARNING: /pci@0,0/pci1043,81c0@b,1 (ehci0): No SOF interrupts have been received, this USB EHCI controller is unusable
WARNING: /pci@0,0/pci1043,81c0@b (ohci0): No SOF interrupts have been received, this USB OHCI controller is unusable
WARNING: /pci@0,0/pci1043,81c0@b (ohci0): No SOF interrupts have been received, this USB OHCI controller is unusable
WARNING: /pci@0,0/pci1043,81c0@b,1 (ehci0): No SOF interrupts have been received, this USB EHCI controller is unusable

When this happened, the keyboard, mouse, and USB stick went dead. I had to hold down the power button to recover the system. The messages come from these kernel files:

usr/src/uts/common/io/usb/hcd/ehci/ehci_util.c
usr/src/uts/common/io/usb/hcd/openhci/ohci.c

It's disabled a USB controller that is working, and used for the keyboard and mouse, and also for the front panel ports. The USB controllers show up this way with scanpci:

pci bus 0x0000 cardnum 0x0b function 0x00: vendor 0x10de device 0x026d
 NVIDIA Corporation MCP51 USB Controller
pci bus 0x0000 cardnum 0x0b function 0x01: vendor 0x10de device 0x026e
 NVIDIA Corporation MCP51 USB Controller

The machine on which this behavior occurs has an ASUS M2NPV-VM motherboard with an AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ CPU. The BIOS version is Revision 1401 08/07/2008. It works correctly with an OS installed from the OI-hipster-gui-20160421 DVD. Sometime between 2016 and 2017, a kernel change occured that broke this machine.

History

#1 Updated by Jean-Pierre André 7 months ago

I can confirm the issue on a different motherboard (Dell) with a similar CPU (AMD Athlon(tm) 64 X2 Dual Core Processor 4200+), and the same USB controller. So I cannot upgrade any more.

#2 Updated by Gary Mills 7 months ago

I found the same USB errors when booting the OI-hipster-gui-20161030 DVD. This means that the change that introduced the fatal error must have been incorporated into OI sometime between April 2016 and October 2016. Was there an exception list for this USB controller before? Did the time delay waiting for the SOF interrupts change? Why is illumos initializing USB controllers that are in use anyway?

#3 Updated by Toomas Soome 7 months ago

Gary Mills wrote:

I found the same USB errors when booting the OI-hipster-gui-20161030 DVD. This means that the change that introduced the fatal error must have been incorporated into OI sometime between April 2016 and October 2016. Was there an exception list for this USB controller before? Did the time delay waiting for the SOF interrupts change? Why is illumos initializing USB controllers that are in use anyway?

It actually is old issue plaguing Solaris down to at least Solaris 10, also the issue has been filed as #1857. Just that there has been none to investigate EHCI driver and fix it...

#4 Updated by Gary Mills 7 months ago

I'm aware of #1857. I omitted it because it dealt only with virtualized systems. Something changed in 2016 so that the bug began to affect real hardware. I'm still searching for that change. The two files cited above have not changed in many years. They can't be it. As well, those files don't contain any code to try several more times. They simply print the message and disable the controller.

The curious thing is that even though my motherboard contains two of those USB controllers, the kernel only disables the one that is in use.

#5 Updated by Gary Mills 6 months ago

I found a workaround that enabled me to boot and run the hipster BE that I upgraded last month. The BIOS of this system contained an item called `HPET Support'. It was enabled. I disabled it. After that change, the SOF error messages did not appear. The first HPET message also disappeared from the messages log, leaving only the second one:

Oct 17 20:32:12 amd pcplusmp: [ID 801116 kern.info] NOTICE: ACPI HPET table query failed

Also, the USB keyboard and USB mouse worked normally.

The implications of this seemingly unrelated change are curious. The kernel apparently initialized the HPET device incorrectly, resulting in no interrupts from the USB controller. That's why the kernel disabled the USB controller. The incorrect initialization may be the result of an incorrect HPET table in the BIOS, or it may be just bad kernel code. The problem may affect only Nvidia HPET devices or only Nvidia MCP51 HPET devices. Without the HPET support, the interrupts from the USB controller were normal.

#6 Updated by Gary Mills 5 months ago

I've investigated this problem in several different ways, all to no avail. Only the BIOS change to disable HPET support brought the system back to normal.

I looked for related source changes between the two dates cited above, but found none. I reviewed the source code for HPET support in the kernel, but found no obvious errors or omissions. I compared the verbose boot output on a serial console, including breakpoint debugging with kmdb, for the cases of HPET support enabled and disabled in the BIOS, but they were identical.

In case the boot was proceeding slowly instead of stalling, I waited patiently for 1 1/2 hours: it still hadn't completed. I even tried the slow boot workaround (setting apix_enable to zero). This too had no effect.

Something goes wrong in the kernel. Somehow HPET support affects the USB controller and causes the rest of the boot to stall. That's all I know.

Also available in: Atom