Project

General

Profile

Bug #13067

Regression between Hipster-November2019 install media and 2020-07: 'ls /devices' hangs

Added by carton . 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

On a boot environment fresh from the standard November 2019 install media, not pkg image-update'd, the following things work as intended:

- pressing the power button for two seconds powers down after about 30 seconds.
- 'reboot' restarts the machine
- 'ls /devices' and 'ls /dev/rdsk' work
- 'prtconf -v' works (output attached)
- 'format -M' works (it doesn't print any debug info. I guess I'm holding it wrong.)
- 'mdb -k' gives its REPL prompt as root, or gives an error if run as non-root
- 'zpool status' works

After pkg image-update:

- pressing the power button still works the same as it did
- 'reboot' hangs. The machine still echoes newlines when I press return and still responds to pings, but won't accept a new ssh connection after running 'reboot', and even left overnight it will not reboot.
- 'ls /devices' and 'ls /dev/rdsk' hang.
- 'prtconf -v' hangs.
- 'format -M' hangs without printing any debug info.
- 'mdb -k' hangs, whether run as root or non-root
- 'zpool status works if none of the above commands have been attempted. If any have, 'zpool status' also hangs.

I'm able to switch between boot environments with 'beadm activate' whether the system has hung in one of these commands or not.

When one of the above commands hangs, I can still ssh into the system and run other commands. The only way to reboot is with the power button, though.

Unfortunately because almost all interaction with the system results in a hang, I couldn't gather debug info using any normal commands. 'truss' either doesn't work or shows something 'ls /devices'-like as the last syscall in the case of 'truss format'.

'dmesg' does work on the updated system after a command has hung, but no new message is emitted when the hang occurs.

I'm happy to attempt whatever you ask.

In a different world I could bisect, but I don't think the repository has retained all packages between the install media and the present, has it? I think it has not because I'm unable to install programs like rpc.bootparamd without image-updating and was guessing that's because the packages matching the install media have been gc'd. Maybe it would've been smarter to keep a thinned out history so unupdated machines could install packages, and we could do binary bisects with at least month granularity.


Files

hipster-november2019-prtconf_-V-eidothea.txt (393 KB) hipster-november2019-prtconf_-V-eidothea.txt prtconf -V output on November 2019 not-updated boot environment carton ., 2020-08-20 01:55 PM
crash.0 (577 KB) crash.0 forced panic while hung: ::panicinfo, ::cpuinfo -v, ::threadlist -v 10, ::msgbuf, *panic_thread::findstack -v, ::stacks carton ., 2020-09-10 11:25 PM
#1

Updated by Robert Mustacchi 3 months ago

Do you have the ability to generate an nmi potentially? If not, we may be able to boot with kmdb and drop into a debugger when it's hung to force a kernel panic that we can look at.

#2

Updated by carton . 3 months ago

I can't generate an NMI, but I do have a PS/2 keyboard. The instructions here:

https://illumos.org/docs/user-guide/debug-systems/#gathering-information-from-a-running-system-using-only-nmi-x86

seem outdated because we now seem to have a great FreeBSD-like FORTH-speaking(?) loader instead of grub, but the new loader has menu options to enable kmdb so I used the menus in lieu of the web page. I did this:

1. enable kmdb in 'loader', "on boot", and change console from graphics to text
2. as web page instructs,

   [0]> moddebug/W 80000000
   [0]> snooping/W 1
   [0]> :c

3. login with ssh (it shows pty modules loaded when I did this)
4. in shell

$ ls /dev/dsk &   # <-- the command hangs, but the system overall keeps mostly-working, as the bug describes.

5. F1-A
6. in kmdb

    [0]> $<systemdump    (succeeds 100%)

7. reboot into Hipster-November2019
8. savecore
9. savecore -vf vmdump.0
10. reboot into a boot environment updated to 2020-08-17
11. in shell

$ echo '::panicinfo\n::cpuinfo -v\n::threadlist -v 10\n::msgbuf\n*panic_thread::findstack -v\n::stacks' | mdb 0 > ~/crash.0

The 'crash.0' output from above only is attached. I also have available on my end vmdump.0 vmcore.0 unix.0, but they are >4MByte so I can't attach to the bug.

I tried the mdb command under Hipster-November2019, but it printed errors, so I reran mdb under the same system version that wrote the dump. I did run savecore under the older system, so hopefully that's okay.

lmk how to generate, if more crashdump-like info is helpful, for example if you want me to run more mdb commands and attach output, or send the dumps themselves OOB of redmine.

Also available in: Atom PDF