Project

General

Profile

Bug #1625

Xorg hang (100% CPU), nvidia-related

Added by Marion Hakanson about 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
XNV (X Window System)
Target version:
Start date:
2011-10-11
Due date:
% Done:

0%

Estimated time:
Difficulty:
Hard
Tags:
nvidia

Description

This Xorg hang is happening on two of my systems. Both run OI-151a now, and both have the nVidia driver version 280.13. The problem also happened under OI-148 with the same nVidia driver, and with several previous driver revisions (I download them directly from the nVidia site). One machine is 64-bit Intel Core i7 w/8GB RAM, GeForce 7300GT; The other is 32-bit Intel P4 w/2GB RAM, GeForce 6200.

When the problem occurs, mouse and keyboard input are ignored, but one can log remotely. You see the Xorg process using 100% CPU. Actually it shows in "prstat -m" as 33% USR, 33% LCK, 33% SLP. Usually restarting gdm is not sufficient, I think the nVidia card or driver is in an unhappy state, with some blocks of black & white bars on the screen after Xorg exits, so I usually just reboot the system.

Also, the Xorg.0.log shows entries like the ones below at the end, prior to killing the Xorg process:

(WW) Oct 04 10:24:40 NVIDIA: WAIT (2, 6, 0x8000, 0x0000c290, 0x0000ca08)
(WW) Oct 04 10:24:47 NVIDIA: WAIT (1, 6, 0x8000, 0x0000c290, 0x0000ca08)
. . .

At the same time, this shows up in /var/adm/messages:

Oct 4 10:24:44 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 16, Head 00000000 Count 00fdd77b
Oct 4 10:24:44 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 16, Head 00000001 Count 00fdd74c
Oct 4 10:24:52 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 16, Head 00000000 Count 00fdd77c
Oct 4 10:24:52 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 16, Head 00000001 Count 00fdd74d
Oct 4 10:24:53 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 8, Channel 0000001e
Oct 4 10:25:02 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 8, Channel 00000020
Oct 4 10:25:02 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 16, Head 00000000 Count 00fdd77e
Oct 4 10:25:02 kyklops nvidia: [ID 702911 kern.notice] NVRM: Xid (0000:01:00): 16, Head 00000001 Count 00fdd74f

Note that the problem is difficult to replicate on demand. It seems to happen at random times, and probably at or near the time of a window focus change. It does not appear to be necessary to be viewing flash video or a web page, or using any particular X application. I do have "compiz" window-manager effects enabled.

Let me know what kind of debug information I can collect, and how.


Files

Xorg.0.log.old (13.5 KB) Xorg.0.log.old Marion Hakanson, 2011-10-11 01:05 AM

Related issues

Related to OpenIndiana Distribution - Bug #2446: Update NVDAgraphics to 295.33Resolved2012-03-232012-05-08

Actions

History

#1

Updated by Chris Jordan about 8 years ago

  • Category set to XNV (X Window System)
  • Assignee set to OI XNV
  • Difficulty changed from Medium to Hard
  • Tags changed from needs-triage to nvidia
#2

Updated by Marion Hakanson about 8 years ago

I've been using the new nVidia driver version 285.05.09 for a little over a week, no hangs since then. As I said, I don't know how to trigger it on demand, so there's no guarantee the problem is gone forever, but if it's not "fixed", the frequency seems reduced.

I did not find mention in the 285.05.09 README nor release notes, of a particular bug fix which seemed to match the problem I describe.

#3

Updated by Marion Hakanson about 8 years ago

My previous entry was wishful thinking. Suffered another Xorg-busy hang this evening, and had the presence of mind to force a kernel crash dump via "reboot -d".

Is there somewhere I can upload the "vmdump.0" file so someone more expert than I am can have a look? I haven't poked around in it myself yet, will try to get a chance to do that tomorrow.

#4

Updated by Marion Hakanson almost 8 years ago

A couple of new pieces of information:

(1) I've been repeatedly failing to get a good crash dump. Turns out this was due to bug #1369, so see that bug report for the workaround.

(2) When the hang happens, if instead of rebooting you just kill the Xorg process, when Xorg tries to restart it fails, complaining that the nvidia card is not receiving interrupts. On my two machines, other devices sharing the same IRQ also fail to receive any interrupts when in this state.

(3) The problem still occurs in the stable release candidate oi_151a2, with the latest nVidia driver, version 295.20. This despite the new driver claiming to have fixed a bug which causes hangs on OpenGL systems, including X11/compiz machines like these.

(4) I do finally have a successful crash dump from the latest incident in (3) above. Here's an excerpt of an "mdb -k" session:

> ::cpuinfo
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc304a0  1f    0    0  10   no    no t-58   ffffff02d686cba0 Xorg
  1 ffffff02d4d61040  1f    0    0  -1   no    no t-3    ffffff000f696c40
 (idle)
  2 ffffff02d50f0b00  1f    0    0  -1   no    no t-1    ffffff000f505c40
 (idle)
  3 ffffff02d50ed500  1f    0    0  -1   no    no t-1    ffffff000f59dc40
 (idle)
  4 ffffff02d50ec000  1f    0    0  -1   no    no t-20   ffffff000f564c40
 (idle)
  5 fffffffffbc3aca0  1b    0    0  59   no    no t-2    ffffff02debef7c0 reboot
  6 ffffff02d50e2080  1f    1    0  -1   no    no t-5    ffffff000fbfec40
 (idle)
  7 ffffff02d5269000  1f    0    0  -1   no    no t-5    ffffff000fc7fc40
 (idle)
> ffffff02d686cba0::findstack
stack pointer for thread ffffff02d686cba0: ffffff000fcd9a00
  ffffff000fcd9a80 xc_serv+0x186()
  ffffff000fcd9b10 0xffffff02d686cba0()
  ffffff000fcd9b40 restorecontext+0x13b()
  ffffff000fcd9f00 getsetcontext+0x227()
  ffffff000fcd9f10 _interrupt+0xba()
> 

Perhaps this isn't actually an nvidia-related problem, but instead some kind of interrupt-handling problem. I would welcome suggestions on what to look for in the crash dump I've got, or I can make it available for download if someone wants to take a crack at it themselves.

#5

Updated by Marion Hakanson almost 8 years ago

My system was in this state again when I came in to work this morning. I can confirm that (according to "intrstat 1") the nvidia driver was not receiving any interrupts (nor was the ehci#0 driver which shares the same IRQ).

On this system, enabling MSI for the nvidia driver appears to have no effect.

Marion

#6

Updated by Ken Mays over 7 years ago

  • Status changed from New to Closed
  • Priority changed from Normal to Low
  • Target version set to oi_151_stable

Marion - I checked for this issue. IMHO, install Solaris 11 and check OS stability from that point. I 'closed' this ticket as I cannot reproduce the issue of this ticket with oi_151a Xnv or Nvidia 280.13/295.20 drivers on my test hardware at this time.

#7

Updated by Marion Hakanson over 7 years ago

Ken, thanks for looking into this. I understand your reasons for closing this item -- heck, I can't even reproduce it on demand, yet it does continue to happen on a regular basis on at least one of my two desktop machines.

As was hinted at above, I'm wondering if this is in fact not Xorg or nvidia specific, but rather a case of lost interrupts peculiar to my hardware. Both of these systems have the nvidia driver sharing the same IRQ with some other onboard devices (keyboard, mouse, disk, or ehci drivers). In fact, something changed on my older machine when I went to oi151a2 (prestable), which seems to have changed the IRQ assignments (nvidia driver no longer shares with ehci and pci-ide), and I haven't had this problem on that machine since.

If I could get some guidance on troubleshooting or diagnosing a system in the state where drivers are no longer receiving interrupts, I believe I could make some progress with this problem.

Could this issue be re-classified appropriately, and kept open until I give up on it?

Dell Optiplex 980:

# echo "::interrupts -d" | mdb -k
IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# Driver Name(s) 
1    0x41 5   ISA    Edg Fixed  4   1     0x0/0x1   i8042#0
9    0x80 9   PCI    Lvl Fixed  1   1     0x0/0x9   acpi_wrapper_isr
11   0xd1 14  PCI    Lvl Fixed  2   1     0x0/0xb   hpet_isr
12   0x42 5   ISA    Edg Fixed  5   1     0x0/0xc   i8042#0
16   0x82 9   PCI    Lvl Fixed  7   2     0x0/0x10  ehci#0, nvidia#0
23   0x83 9   PCI    Lvl Fixed  0   1     0x0/0x17  ehci#1
24   0x40 5   PCI    Edg MSI    3   1     -         ahci#0
25   0x81 7   PCI    Edg MSI    6   1     -         pcieb#0
26   0x60 6   PCI    Edg MSI    1   1     -         e1000g#0
32   0x20 2          Edg IPI    all 1     -         cmi_cmci_trap
160  0xa0 0          Edg IPI    all 0     -         poke_cpu
208  0xd0 14         Edg IPI    all 1     -         kcpc_hw_overflow_intr
209  0xd3 14         Edg IPI    all 1     -         cbe_fire
210  0xd4 14         Edg IPI    all 1     -         cbe_fire
240  0xe0 15         Edg IPI    all 1     -         xc_serv
241  0xe1 15         Edg IPI    all 1     -         apic_error_intr
# 

Old Pentium-4 machine:

# echo "::interrupts -d" | mdb -k
IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# Driver Name(s) 
1    0x41 5   ISA    Edg Fixed  0   1     0x0/0x1   i8042#0
4    0xb0 12  ISA    Edg Fixed  0   1     0x0/0x4   asy#0
6    0x44 5   ISA    Edg Fixed  0   1     0x0/0x6   fdc#0
7    0x45 5   ISA    Edg Fixed  0   1     0x0/0x7   ecpp#0
9    0x80 9   PCI    Lvl Fixed  0   1     0x0/0x9   acpi_wrapper_isr
12   0x42 5   ISA    Edg Fixed  0   1     0x0/0xc   i8042#0
15   0x43 5   ISA    Edg Fixed  0   1     0x0/0xf   ata#1
16   0x81 9   PCI    Lvl Fixed  0   3     0x0/0x10  uhci#3, uhci#0, nvidia#0
18   0x84 9   PCI    Lvl Fixed  0   4     0x0/0x12  uhci#2, e1000g#0, pci-ide#1
, pci-ide#1
19   0x83 9   PCI    Lvl Fixed  0   1     0x0/0x13  uhci#1
22   0x40 5   PCI    Lvl Fixed  0   1     0x0/0x16  pci-ide#2
23   0x82 9   PCI    Lvl Fixed  0   1     0x0/0x17  ehci#0
160  0xa0 0          Edg IPI    all 0     -         poke_cpu
208  0xd0 14         Edg IPI    all 1     -         kcpc_hw_overflow_intr
209  0xd1 14         Edg IPI    all 1     -         cbe_fire
210  0xd3 14         Edg IPI    all 1     -         cbe_fire
240  0xe0 15         Edg IPI    all 1     -         xc_serv
241  0xe1 15         Edg IPI    all 1     -         apic_error_intr
# 

#8

Updated by Ken Mays over 7 years ago

Best to install Solaris 11 on one of your test machines and try to reduplicate your hardware issue from there using the default Nvidia driver (280.13). You can also try to install the newer Nvidia 295.33 driver on oi_151a2 on your other test system. This will give you two different test baselines to work from and help isolate the issue. Solaris 11 is more resilient with IRQ issues so try that first if updating to the Nvidia 295.33 driver does not work for you.

#9

Updated by Marion Hakanson over 7 years ago

Best to install Solaris 11 on one of your test machines and try to
reduplicate your hardware issue from there using the default Nvidia
driver (280.13). You can also try to install the newer Nvidia 295.33 driver
on oi_151a2 on your other test system. This will give you two different test
baselines to work from and help isolate the issue. Solaris 11 is more
resilient with IRQ issues so try that first if updating to the Nvidia 295.33
driver does not work for you.

Thanks for the suggestions, Ken.

I've got driver 295.33 installed; We'll see if the issue continues to occur.

I'm not sure installing Solaris-11 is going to be feasible for me, except in a live image scenario. These "test machines" are my two daily-use desktop machines (work and home); Perhaps I'm being a bit dense, but is there a way using IPS to install Solaris-11 in a new BE on my oi root pool, without doing a full "nuke and repave" of the existing rpool? Will "pkg image-create" do that?

Also, when I do have the system in "no interrupts to the nvidia driver" mode, are there some useful details I can extract using "mdb -k"?

Also available in: Atom PDF