Project

General

Profile

Actions

Bug #13067

open

Regression between Hipster-November2019 install media and 2020-07: 'ls /devices' hangs

Added by carton . about 1 year ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

On a boot environment fresh from the standard November 2019 install media, not pkg image-update'd, the following things work as intended:

- pressing the power button for two seconds powers down after about 30 seconds.
- 'reboot' restarts the machine
- 'ls /devices' and 'ls /dev/rdsk' work
- 'prtconf -v' works (output attached)
- 'format -M' works (it doesn't print any debug info. I guess I'm holding it wrong.)
- 'mdb -k' gives its REPL prompt as root, or gives an error if run as non-root
- 'zpool status' works

After pkg image-update:

- pressing the power button still works the same as it did
- 'reboot' hangs. The machine still echoes newlines when I press return and still responds to pings, but won't accept a new ssh connection after running 'reboot', and even left overnight it will not reboot.
- 'ls /devices' and 'ls /dev/rdsk' hang.
- 'prtconf -v' hangs.
- 'format -M' hangs without printing any debug info.
- 'mdb -k' hangs, whether run as root or non-root
- 'zpool status works if none of the above commands have been attempted. If any have, 'zpool status' also hangs.

I'm able to switch between boot environments with 'beadm activate' whether the system has hung in one of these commands or not.

When one of the above commands hangs, I can still ssh into the system and run other commands. The only way to reboot is with the power button, though.

Unfortunately because almost all interaction with the system results in a hang, I couldn't gather debug info using any normal commands. 'truss' either doesn't work or shows something 'ls /devices'-like as the last syscall in the case of 'truss format'.

'dmesg' does work on the updated system after a command has hung, but no new message is emitted when the hang occurs.

I'm happy to attempt whatever you ask.

In a different world I could bisect, but I don't think the repository has retained all packages between the install media and the present, has it? I think it has not because I'm unable to install programs like rpc.bootparamd without image-updating and was guessing that's because the packages matching the install media have been gc'd. Maybe it would've been smarter to keep a thinned out history so unupdated machines could install packages, and we could do binary bisects with at least month granularity.


Files

hipster-november2019-prtconf_-V-eidothea.txt (393 KB) hipster-november2019-prtconf_-V-eidothea.txt prtconf -V output on November 2019 not-updated boot environment carton ., 2020-08-20 01:55 PM
crash.0 (577 KB) crash.0 forced panic while hung: ::panicinfo, ::cpuinfo -v, ::threadlist -v 10, ::msgbuf, *panic_thread::findstack -v, ::stacks carton ., 2020-09-10 11:25 PM
Actions #1

Updated by Robert Mustacchi about 1 year ago

Do you have the ability to generate an nmi potentially? If not, we may be able to boot with kmdb and drop into a debugger when it's hung to force a kernel panic that we can look at.

Actions #2

Updated by carton . about 1 year ago

I can't generate an NMI, but I do have a PS/2 keyboard. The instructions here:

https://illumos.org/docs/user-guide/debug-systems/#gathering-information-from-a-running-system-using-only-nmi-x86

seem outdated because we now seem to have a great FreeBSD-like FORTH-speaking(?) loader instead of grub, but the new loader has menu options to enable kmdb so I used the menus in lieu of the web page. I did this:

1. enable kmdb in 'loader', "on boot", and change console from graphics to text
2. as web page instructs,

   [0]> moddebug/W 80000000
   [0]> snooping/W 1
   [0]> :c

3. login with ssh (it shows pty modules loaded when I did this)
4. in shell

$ ls /dev/dsk &   # <-- the command hangs, but the system overall keeps mostly-working, as the bug describes.

5. F1-A
6. in kmdb

    [0]> $<systemdump    (succeeds 100%)

7. reboot into Hipster-November2019
8. savecore
9. savecore -vf vmdump.0
10. reboot into a boot environment updated to 2020-08-17
11. in shell

$ echo '::panicinfo\n::cpuinfo -v\n::threadlist -v 10\n::msgbuf\n*panic_thread::findstack -v\n::stacks' | mdb 0 > ~/crash.0

The 'crash.0' output from above only is attached. I also have available on my end vmdump.0 vmcore.0 unix.0, but they are >4MByte so I can't attach to the bug.

I tried the mdb command under Hipster-November2019, but it printed errors, so I reran mdb under the same system version that wrote the dump. I did run savecore under the older system, so hopefully that's okay.

lmk how to generate, if more crashdump-like info is helpful, for example if you want me to run more mdb commands and attach output, or send the dumps themselves OOB of redmine.

Actions #3

Updated by Dan McDonald 8 months ago

One of the stacks in your mdb output is fascinating. There's an infiniband (using the "hermon") device on your machine, and it seems that there's thread hung in device-configuration of infiniband:

fffffe0010c51c20 fffffffffbc48080                0   0  60 fffffffffbc4936c
  PC: _resume_from_idle+0x12b
  stack pointer for thread fffffe0010c51c20 (mt_config_thread()): fffffe0010c51740
  [ fffffe0010c51740 _resume_from_idle+0x12b() ]
    swtch+0x133()
    cv_wait+0x68(fffffffffbc4936c, fffffffffbd0b988)
    thread_join+0x37(0)
    sdpib_detach+0x93()
    devi_detach+0x88(fffffe0bdc1ae350, 0)
    detach_node+0x49(fffffe0bdc1ae350, 2000)
    i_ndi_unconfig_node+0x171(fffffe0bdc1ae350, 4, 2000)
    i_ddi_detachchild+0x33(fffffe0bdc1ae350, 2000)
    devi_detach_node+0xd1(fffffe0bdc1ae350, 2000)
    unconfig_immediate_children+0x176(fffffe0bd2c3e898, 0, 2000, ffffffff)
    ndi_busop_bus_unconfig+0x75(fffffe0bd2c3e898, 2000, 7, ffffffff)
    ibnex_bus_unconfig+0x35(fffffe0bd2c3e898, 2000, 7, ffffffff)
    devi_unconfig_common+0x13b(fffffe0bd2c3e898, 0, 2000, ffffffff, 0)
    mt_config_thread+0x15f(fffffe0c1e1f8c30)
    thread_start+0xb()

The prtconf -v output confirms the Mellanox device that has IB on it.


                    name='vendor-id' type=int items=1

                        value=000015b3

                    name='device-id' type=int items=1

                        value=0000634a

                    name='vendor-name' type=string items=1

                        value='Mellanox Technologies'

                    name='device-name' type=string items=1

                        value='MT25408A0-FCC-DI ConnectX, Dual Port 20Gb/s InfiniBand / 10GigE Adapter IC with PCIe 2.0 x8 2.5GT/s Interface'

But a quick look in the source suggests nothing has changed in this code (hermon, sdp, or ib) for a while.

Actions

Also available in: Atom PDF