Project

General

Profile

Feature #957

Need a way to enforce a reboot (reset-like) from software

Added by Jim Klimov over 8 years ago. Updated over 7 years ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2011-04-25
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

As described in my other bug report (956), some infinite loop in ZFS or some other bug forbids a pool to be manipulated or exported. In turn, this forbids the system to reboot with its current methods, because even with "reboot -qnl" it still tries to export the pool and hangs indefinitely.

This is particularly annoying because I'm now thousands of miles away from the box, and not every day there's anyone nearby to come and press "reset". (It's not in datacenter, no remotely controllable UPSes or PDUs to reset it, no IPMI card either. It is a home-class PC in an apartment.)

From my Linux experience I remember that it has some Magic SysRQ keyboard sequences to cause sync, reboot, process and stack dumps to console, etc. - and these were very useful features regarding flaky hardware such as early Promise IDE controllers which occasionally lost the hard disks but found them during BIOS reboot sequence. Most useful was the interface to these keyboard sequences from software, like:
  1. echo b > /proc/sysrq-trigger
    This particular command would be the same as pressing "Alt + SysRq + B" on the console keyboard, and if the kernel is somewhat alive - it would ungracefully reboot the machine (kill kernel, return to BIOS).
    It was particularly useful as it's accessible over SSH, and works even if "reboot" doesn't (i.e. the system disk with /bin has disappeared due to the controller, but the SSH session was opened a week ago and still works as a process).

Now with old-looking problems of "can't reboot because disk is hanging" make me wish for a similar interface in Solaris - or in OpenIndiana as this is what I expect to use in the coming years...

Link: http://en.wikipedia.org/wiki/Magic_SysRq_key

History

#1

Updated by Rich Lowe over 8 years ago

You may be able to do this via uadmin, though I'm not sure. See uadmin(1M) and uadmin(2).
(it's not the most graceful of interfaces)

This should evade basically all reboot processing that happens in userland, I'm not sure where the stuff you're trying to avoid is run.

#2

Updated by Albert Lee over 8 years ago

  • Project changed from OpenIndiana Distribution to illumos gate
#3

Updated by Jim Klimov over 8 years ago

Thanks, I'll try uadmin if I manage to catch my system dying in real-time one way or another, to see if it solves my problem at the moment.

It seems from manpages that indeed the reboot subcommands (unlike shutdown subcommands) should do a quick and ungraceful job. Hopefully it would also kill the all-consuming loops in kernel which hang my system (scanrate hell and/or pool block traversing, as I think now).

However this does not alleviate the desire for a MagicSysRQ-like interface via console and a purely software interface to kernel like /proc/sysrq-trigger which would not require swapning a process.

Yesterday when my system was caught descending into scanrate hell, it could not fork a "swap" process to add some virtual memory, although a shell command "echo B > /proc/sysrq-trigger" would have probably worked.

#4

Updated by Garrett D'Amore over 8 years ago

I once wrote a program called "erbd" at qualcomm that was a daemon that stuck around just for this reason -- it would do a reboot without doing any malloc()'s or fork()'s.

Now that said, there is a magic break sequence, which can be enabled in the keyboard driver. F1-A on a PC keyboard -- it is only normally enabled when you activate kmdb (boot with -k) for kernel debugging. See kbd(1m) for details on this.

There's also a way to set up a hardware watchdog if your system is equipped with one.

At the end of the day, there are some software problems for which there is no way to recover -- e.g. a loop in a high level interrupt context.

As a result, I'm not a big fan of this kind of approach.

#5

Updated by Garrett D'Amore over 8 years ago

  • Status changed from New to Feedback

Not also that you should be familiar with the old sysadmin trick of "exec". If you can't fork a new process, you can often get your shell to exec another process... in this case uadmin.

So "exec uadmin 1 0" would probably do the equivalent of what you need.

#6

Updated by Jim Klimov over 8 years ago

  • Difficulty set to Medium

Of my bug reports, this one seems more approrpiate for this update: I have made a binary version of my watchdog based on vmstat and calling uadmin if bad things happen, and at least once it worked in my situation (once tried, so far).

I am not sure it is as fork/malloc-clean as "erbd", because it also tries to warn all terminals in the process, but in the last second or so of uptime it still has enough RAM to do that :)

So if this would prove to work for me, then the problem of a different software-resetter is not so urgent.
BUT I'd like to draw attention to the fact that this is not a solution like proposed in my original feature request #957, and it actually solves a different problem under different circumstances ;)

Now I'm thinking the /proc/sysrq-trigger part of solution to this requested feature of mine would be along the lines of a kernel driver or module which would merely accept a couple of numbers (and maybe a string) written into a device file, and pass them to uadmin(2) :)


My FreeRAM-Watchdog code and compiled i386 binary can be found here:
http://thumper.cos.ru/~jim/freeram-watchdog-20110501-firstcommit.tgz

It is largely a bastardized "vmstat" with a piece of "shutdown" based on code from ONNV repository (files dated 2010-10-29, included). In particular, the "stat/common" subdir dealing with actual kstats is included and used as is with no modifications. Of all my systems at hand, it only compiled on OI_148a; while Solaris 10 and SXCE 129 did not have some features of the original code (nanosleep, etc). It was compiled with GCC-3/GMAKE and currently with GDB debugging support built in.

It still takes roughly 15 minutes after I issue a pool import request for the system to go down like this, according to original "vmstat 1":

kthr      memory            page            disk          faults      cpu
r b w swap free re mf pi po fr de sr f0 s2 s3 s4 in sy cs us sy id
9 0 0 38156184 3014016 48 775 0 0 0 0 0 0 219 213 270 2389 8659 2086 2 98 0
8 0 0 37941032 2798860 52 843 0 0 0 0 0 0 257 253 214 2305 8485 1978 2 98 0
6 0 0 37718736 2576568 48 783 0 0 0 0 0 0 205 214 247 2384 8147 2089 2 98 0
8 0 0 37484908 2342696 32 608 0 0 0 0 0 0 267 250 236 2531 8016 2192 2 98 0
5 0 0 37274804 2132592 34 611 0 0 0 0 0 0 205 193 245 2300 7984 1900 2 98 0
8 0 0 37021592 1879420 29 374 0 0 0 0 0 0 301 225 255 2382 6923 2010 1 99 0
8 0 0 36806540 1664368 48 776 0 0 0 0 0 0 229 258 234 2290 8005 1722 2 98 0
4 0 0 36558608 1416440 48 783 0 0 0 0 0 0 270 203 261 2320 8171 1884 2 98 0
7 0 0 36326244 1184076 49 796 0 0 0 0 0 0 236 263 186 2414 8195 2121 1 99 0
10 0 0 36212496 1070328 47 771 0 0 0 0 0 0 185 199 201 28405 7979 1568 2 98 0
7 0 0 36094756 936016 49 789 0 0 0 0 0 0 194 211 200 22586 5636 1541 1 99 0
16 0 0 36094756 792408 43 771 0 0 0 0 0 0 195 215 195 22942 5010 2546 1 99 0
13 0 0 36094756 663764 58 722 0 0 0 0 0 0 187 193 180 1804 7210 6572 2 98 0
8 0 0 36094756 445176 51 824 0 0 0 0 0 0 177 181 200 2860 5688 11771 2 98 0
7 0 0 36094756 265188 44 719 0 0 0 0 0 0 195 181 192 18925 6672 7206 1 99 0

!! May 1, 2011 11:58:37 AM MSD: WARNING: freeram-watchdog
!! detected deadly VM conditions, launching preemptive reset!
!! FreeSWAP: 36095600 FreeRAM: 98112 ScanRate: 280838

18 0 0 36095600 98112 2 64  4 629 629 122576 318704 0 201 218 205 14903 6064 9471 2 98 0
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s2 s3 s4 in sy cs us sy id

As we can see, one second too late would be too late ;)

Unfortunately, VM kstats are updated by kernel only once per second. So there is still a chance to miss the moment.

Please tell me if there is a higher-resolution VM polling mechanism?

This watchdog tries to increase polling frequency to 10Hz when RAM depletion is too fast (200Mb/sec by default), because per-CPU stats can still be collected and ScanRate seems derived from them. But my first live test failed to prove (or disprove) that:

freq freeswap freeram scanrate Dswap Dram in sy cs us sy id
10 36326244 1184076 0 0 0 87900 7762 1662 0 100 0
10 36326244 1184076 0 0 0 88964 2555 1200 0 100 0
10 36212496 1070328 0 -113748 -113748 87871 6742 816 2 98 0
1 36094756 936016 0 -117740 -134312 2062 5559 1790 1 99 0
1 36094756 792408 0 0 -143608 23549 5153 2559 1 99 0
1 36094756 663764 0 0 -128644 1819 5869 6813 2 98 0
1 36094756 445176 0 0 -218588 2715 5663 10383 1 99 0
10 36094756 445176 0 0 0 1827 2555 9536 0 100 0
10 36094756 445176 0 0 0 1893 7407 13387 0 100 0
10 36094756 445176 0 0 0 1940 14060 1900 0 100 0
10 36094756 445176 0 0 0 2389 3122 18689 0 100 0
10 36094756 445176 0 0 0 47806 10815 6126 3 97 0
10 36094756 265188 0 0 -179988 3971 5352 4800 5 95 0
1 36095600 98112 280838 844 -167076 16712 5040 8632 1 99 0
May 1, 2011 11:58:37 AM MSD: WARNING: freeram-watchdog
detected deadly VM conditions, launching preemptive reset! == FreeSWAP: 36095600 FreeRAM: 98112 ScanRate: 280838

Seemingly the scan rate jumped up just in the last second. Maybe it was true, the result is not definitive for me yet.

//Jim

PS: Thank you for your other comments, but sadly still I can not say they are on target.

1) As for "exec" - yes, I know of that feature. It did not come to my fingertips as I was typing the command, and a couple of seconds later is was too late.
Also it still incurs reads from disks, allocating memory, etc. which might be impossible in this case.

2) Like it or not, but when you go out to the big community, you'll have to deal with all sorts of hardware, not only high-grade lab equipment.
I can relate to my past with such hardware, and I'm certain that with the kind of rigs people put together now for their Home ZFS servers published on the forums, such experience will be relevant for a long time ;)
Back in my Linux-enthusiast days of running a university campus infrastructure with whatever scraps the students cashed together to make a server, there were many hardware-related faults leading to loss of service until reboot of the server (i.e. scenario with temp-loss of root disk, above). Luckily this rarely led to loss of data as well. Replacing the hardware was not a monetary option for us-students, though. Apparently there was no remote-control card, it would cost like half the server (some 150 or 200$) if compatible at all.
As the servers tended to be in many different student-dormitory buildings, running to press keys on keyboard was not an option either - from 1am to 6am the doors were locked, and maintenance/repairs ususally happened at night - we tried to study at daytime ;)
So the ability to enter the system by a previously created SSH remote shell, and echo a reboot sequence straight into the kernel - all these ops needed no HDD I/Os - it was a life saver I don't know how many times.

Now, my current situation with the system hanging quickly in a kernel loop (?) is obviously very different from that typical situation from my past of a system merely being unable to do stuff, but quite responsive for the tasks launched before.

#7

Updated by Jim Klimov over 7 years ago

  • Tags set to needs-triage

For those stumbling on the system-hang problem, I was recently made aware of the "deadman timer" in Solaris, which (if enabled) might reboot the system from such hangs, and perhaps cause a panic to get the kernel dump for debugging the problem.
I have not yet got a chance to try it out, but others in similar peril might want to know :)
See: http://wiki.illumos.org/pages/viewpage.action?pageId=1146983

Also available in: Atom PDF