Bug #1625
closedXorg hang (100% CPU), nvidia-related
0%
Description
This Xorg hang is happening on two of my systems. Both run OI-151a now, and both have the nVidia driver version 280.13. The problem also happened under OI-148 with the same nVidia driver, and with several previous driver revisions (I download them directly from the nVidia site). One machine is 64-bit Intel Core i7 w/8GB RAM, GeForce 7300GT; The other is 32-bit Intel P4 w/2GB RAM, GeForce 6200.
When the problem occurs, mouse and keyboard input are ignored, but one can log remotely. You see the Xorg process using 100% CPU. Actually it shows in "prstat -m" as 33% USR, 33% LCK, 33% SLP. Usually restarting gdm is not sufficient, I think the nVidia card or driver is in an unhappy state, with some blocks of black & white bars on the screen after Xorg exits, so I usually just reboot the system.
Also, the Xorg.0.log shows entries like the ones below at the end, prior to killing the Xorg process:
(WW) Oct 04 10:24:40 NVIDIA: WAIT (2, 6, 0x8000, 0x0000c290, 0x0000ca08)
(WW) Oct 04 10:24:47 NVIDIA: WAIT (1, 6, 0x8000, 0x0000c290, 0x0000ca08)
. . .
At the same time, this shows up in /var/adm/messages:
Oct 4 10:24:44 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 16, Head 00000000 Count 00fdd77b
Oct 4 10:24:44 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 16, Head 00000001 Count 00fdd74c
Oct 4 10:24:52 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 16, Head 00000000 Count 00fdd77c
Oct 4 10:24:52 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 16, Head 00000001 Count 00fdd74d
Oct 4 10:24:53 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 8, Channel 0000001e
Oct 4 10:25:02 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 8, Channel 00000020
Oct 4 10:25:02 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 16, Head 00000000 Count 00fdd77e
Oct 4 10:25:02 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 16, Head 00000001 Count 00fdd74f
Note that the problem is difficult to replicate on demand. It seems to happen at random times, and probably at or near the time of a window focus change. It does not appear to be necessary to be viewing flash video or a web page, or using any particular X application. I do have "compiz" window-manager effects enabled.
Let me know what kind of debug information I can collect, and how.
Files
Related issues
Updated by Chris Jordan over 11 years ago
- Category set to XNV (X Window System)
- Assignee set to OI XNV
- Difficulty changed from Medium to Hard
- Tags changed from needs-triage to nvidia
Updated by Marion Hakanson over 11 years ago
I've been using the new nVidia driver version 285.05.09 for a little over a week, no hangs since then. As I said, I don't know how to trigger it on demand, so there's no guarantee the problem is gone forever, but if it's not "fixed", the frequency seems reduced.
I did not find mention in the 285.05.09 README nor release notes, of a particular bug fix which seemed to match the problem I describe.
Updated by Marion Hakanson over 11 years ago
My previous entry was wishful thinking. Suffered another Xorg-busy hang this evening, and had the presence of mind to force a kernel crash dump via "reboot -d".
Is there somewhere I can upload the "vmdump.0" file so someone more expert than I am can have a look? I haven't poked around in it myself yet, will try to get a chance to do that tomorrow.
Updated by Marion Hakanson almost 11 years ago
A couple of new pieces of information:
(1) I've been repeatedly failing to get a good crash dump. Turns out this was due to bug #1369, so see that bug report for the workaround.
(2) When the hang happens, if instead of rebooting you just kill the Xorg process, when Xorg tries to restart it fails, complaining that the nvidia card is not receiving interrupts. On my two machines, other devices sharing the same IRQ also fail to receive any interrupts when in this state.
(3) The problem still occurs in the stable release candidate oi_151a2, with the latest nVidia driver, version 295.20. This despite the new driver claiming to have fixed a bug which causes hangs on OpenGL systems, including X11/compiz machines like these.
(4) I do finally have a successful crash dump from the latest incident in (3) above. Here's an excerpt of an "mdb -k" session:
> ::cpuinfo ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc304a0 1f 0 0 10 no no t-58 ffffff02d686cba0 Xorg 1 ffffff02d4d61040 1f 0 0 -1 no no t-3 ffffff000f696c40 (idle) 2 ffffff02d50f0b00 1f 0 0 -1 no no t-1 ffffff000f505c40 (idle) 3 ffffff02d50ed500 1f 0 0 -1 no no t-1 ffffff000f59dc40 (idle) 4 ffffff02d50ec000 1f 0 0 -1 no no t-20 ffffff000f564c40 (idle) 5 fffffffffbc3aca0 1b 0 0 59 no no t-2 ffffff02debef7c0 reboot 6 ffffff02d50e2080 1f 1 0 -1 no no t-5 ffffff000fbfec40 (idle) 7 ffffff02d5269000 1f 0 0 -1 no no t-5 ffffff000fc7fc40 (idle) > ffffff02d686cba0::findstack stack pointer for thread ffffff02d686cba0: ffffff000fcd9a00 ffffff000fcd9a80 xc_serv+0x186() ffffff000fcd9b10 0xffffff02d686cba0() ffffff000fcd9b40 restorecontext+0x13b() ffffff000fcd9f00 getsetcontext+0x227() ffffff000fcd9f10 _interrupt+0xba() >
Perhaps this isn't actually an nvidia-related problem, but instead some kind of interrupt-handling problem. I would welcome suggestions on what to look for in the crash dump I've got, or I can make it available for download if someone wants to take a crack at it themselves.
Updated by Marion Hakanson almost 11 years ago
My system was in this state again when I came in to work this morning. I can confirm that (according to "intrstat 1") the nvidia driver was not receiving any interrupts (nor was the ehci#0 driver which shares the same IRQ).
On this system, enabling MSI for the nvidia driver appears to have no effect.
Marion
Updated by Ken Mays almost 11 years ago
- Status changed from New to Closed
- Priority changed from Normal to Low
- Target version set to oi_151_stable
Marion - I checked for this issue. IMHO, install Solaris 11 and check OS stability from that point. I 'closed' this ticket as I cannot reproduce the issue of this ticket with oi_151a Xnv or Nvidia 280.13/295.20 drivers on my test hardware at this time.
Updated by Marion Hakanson almost 11 years ago
Ken, thanks for looking into this. I understand your reasons for closing this item -- heck, I can't even reproduce it on demand, yet it does continue to happen on a regular basis on at least one of my two desktop machines.
As was hinted at above, I'm wondering if this is in fact not Xorg or nvidia specific, but rather a case of lost interrupts peculiar to my hardware. Both of these systems have the nvidia driver sharing the same IRQ with some other onboard devices (keyboard, mouse, disk, or ehci drivers). In fact, something changed on my older machine when I went to oi151a2 (prestable), which seems to have changed the IRQ assignments (nvidia driver no longer shares with ehci and pci-ide), and I haven't had this problem on that machine since.
If I could get some guidance on troubleshooting or diagnosing a system in the state where drivers are no longer receiving interrupts, I believe I could make some progress with this problem.
Could this issue be re-classified appropriately, and kept open until I give up on it?
Dell Optiplex 980:
# echo "::interrupts -d" | mdb -k IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# Driver Name(s) 1 0x41 5 ISA Edg Fixed 4 1 0x0/0x1 i8042#0 9 0x80 9 PCI Lvl Fixed 1 1 0x0/0x9 acpi_wrapper_isr 11 0xd1 14 PCI Lvl Fixed 2 1 0x0/0xb hpet_isr 12 0x42 5 ISA Edg Fixed 5 1 0x0/0xc i8042#0 16 0x82 9 PCI Lvl Fixed 7 2 0x0/0x10 ehci#0, nvidia#0 23 0x83 9 PCI Lvl Fixed 0 1 0x0/0x17 ehci#1 24 0x40 5 PCI Edg MSI 3 1 - ahci#0 25 0x81 7 PCI Edg MSI 6 1 - pcieb#0 26 0x60 6 PCI Edg MSI 1 1 - e1000g#0 32 0x20 2 Edg IPI all 1 - cmi_cmci_trap 160 0xa0 0 Edg IPI all 0 - poke_cpu 208 0xd0 14 Edg IPI all 1 - kcpc_hw_overflow_intr 209 0xd3 14 Edg IPI all 1 - cbe_fire 210 0xd4 14 Edg IPI all 1 - cbe_fire 240 0xe0 15 Edg IPI all 1 - xc_serv 241 0xe1 15 Edg IPI all 1 - apic_error_intr #
Old Pentium-4 machine:
# echo "::interrupts -d" | mdb -k IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# Driver Name(s) 1 0x41 5 ISA Edg Fixed 0 1 0x0/0x1 i8042#0 4 0xb0 12 ISA Edg Fixed 0 1 0x0/0x4 asy#0 6 0x44 5 ISA Edg Fixed 0 1 0x0/0x6 fdc#0 7 0x45 5 ISA Edg Fixed 0 1 0x0/0x7 ecpp#0 9 0x80 9 PCI Lvl Fixed 0 1 0x0/0x9 acpi_wrapper_isr 12 0x42 5 ISA Edg Fixed 0 1 0x0/0xc i8042#0 15 0x43 5 ISA Edg Fixed 0 1 0x0/0xf ata#1 16 0x81 9 PCI Lvl Fixed 0 3 0x0/0x10 uhci#3, uhci#0, nvidia#0 18 0x84 9 PCI Lvl Fixed 0 4 0x0/0x12 uhci#2, e1000g#0, pci-ide#1 , pci-ide#1 19 0x83 9 PCI Lvl Fixed 0 1 0x0/0x13 uhci#1 22 0x40 5 PCI Lvl Fixed 0 1 0x0/0x16 pci-ide#2 23 0x82 9 PCI Lvl Fixed 0 1 0x0/0x17 ehci#0 160 0xa0 0 Edg IPI all 0 - poke_cpu 208 0xd0 14 Edg IPI all 1 - kcpc_hw_overflow_intr 209 0xd1 14 Edg IPI all 1 - cbe_fire 210 0xd3 14 Edg IPI all 1 - cbe_fire 240 0xe0 15 Edg IPI all 1 - xc_serv 241 0xe1 15 Edg IPI all 1 - apic_error_intr #
Updated by Ken Mays almost 11 years ago
Best to install Solaris 11 on one of your test machines and try to reduplicate your hardware issue from there using the default Nvidia driver (280.13). You can also try to install the newer Nvidia 295.33 driver on oi_151a2 on your other test system. This will give you two different test baselines to work from and help isolate the issue. Solaris 11 is more resilient with IRQ issues so try that first if updating to the Nvidia 295.33 driver does not work for you.
Updated by Marion Hakanson almost 11 years ago
Best to install Solaris 11 on one of your test machines and try to
reduplicate your hardware issue from there using the default Nvidia
driver (280.13). You can also try to install the newer Nvidia 295.33 driver
on oi_151a2 on your other test system. This will give you two different test
baselines to work from and help isolate the issue. Solaris 11 is more
resilient with IRQ issues so try that first if updating to the Nvidia 295.33
driver does not work for you.
Thanks for the suggestions, Ken.
I've got driver 295.33 installed; We'll see if the issue continues to occur.
I'm not sure installing Solaris-11 is going to be feasible for me, except in a live image scenario. These "test machines" are my two daily-use desktop machines (work and home); Perhaps I'm being a bit dense, but is there a way using IPS to install Solaris-11 in a new BE on my oi root pool, without doing a full "nuke and repave" of the existing rpool? Will "pkg image-create" do that?
Also, when I do have the system in "no interrupts to the nvidia driver" mode, are there some useful details I can extract using "mdb -k"?