Feature #2441


Failsafe boot option

Added by Jim Klimov about 9 years ago. Updated almost 6 years ago.

OS/Net (Kernel and Userland)
Target version:
Start date:
Due date:
% Done:


Estimated time:


OpenSolaris SXCE releases, as well as Solaris 10, included a "Failsafe boot" environment.
It was basically a RAMDisk image stored on the same boot disk, containing all relevant drivers for your hardware and the current kernel along with some userland software. Basically, it was an expanded and somewhat interactive variant of the "boot archive" we still have today, but less frequently updated (and potentially broken in the way).
This environment was often crucial for repairing a broken installation, deaing with stuff such as renamed drives (or changed device paths like IDE/SATA conversion in BIOS, or movement to another server after the original one got fried but the disks survived), or refusal to boot with broken mirrors without quorum, etc.
Nowadays a (customized) LiveCD/LiveUSB has saved me for uncountable number of times, when my experiments or hardware failures at my home test system have left its HDD-based BEs unuseable.
However, booting from such alternate media is slow and often inconvenient (it needs to get plugged in and/or prioritized in BIOS), especially for remote maintenance of unmanned locations (especially on hardware lacking proper networked IPMI/ILOM consoles).

I propose that some sort of a failsafe or live environment with minimal HW/ZFS requirements (bootable anywhere the backing image file can be read) should again be provided as part of the installed software. Ideally it should have virtual-consoles built in and expect connections (and dump kernel logs) on both the text console and the serial console (LiveCD should really do that as well).

There should be a means to provide customizations into the failsafe image, such as required drivers (storage, net), networking setup, ssh, or perhaps adding some lower-level testing software or scripts that the user likes. The image should then be built (iso, gzipped ufs, whatever) and used as a single unmodifiable unit (unlike a ZFS alternate BE which can get unmountable due to a number of reasons). Perhaps the current scripts for boot_archive maintenance can be reused for this quest.

Perhaps manual or automated updates to the image (replacing the image file or adding another one "near" it) should be enabled for kernel upgrades, especially if storage is involved (updated ZFS code/versions/features), and for net/storage driver installations and upgrades, and driver.conf settings change.
NOTE that possible/probable failure of such upgrades is one of the reasons we want the failsafe environment in the first place, so a fully-automated replacement without user's consent is something we likely don't want to do (perhaps no sooner than the first successful boot and proper shutdown after that?) But it is also useless to have an obsolete image which can't access the current ZFS pool or changed hardware... I think the rebuild procedure could be automated, but called upon by the user manually, like "bootadm update-archive -R /a" can be used today.

GRUB should support booting this Failsafe image file from such sources as:
  • current active root BE on rpool
  • a dedicated filesystem on the rpool (i.e. user would like to set copies=3 for this file only, and keep it separate from his other BE tests, rolling snapshots and other works)
  • a dedicated partition/slice with whatever filesystem GRUB and Solaris support (perhaps not containing a zfs pool but a "recovery" FAT or UFS partition with this one file or so)
  • a separate storage device (i.e. a Failsafe USB Flash which is not unlike a LiveUSB, but customized for this particular box, or a dataset on a different ZFS pool in the same box)

Also, if GRUB or kernel fails to load a "normal" BE, there should be a way to (optionally?) automatically reboot into the failsafe image without user interaction (i.e. to conveniently recover a failed remote/headless box).
In kernel side this could be done perhaps with a specialized procedure in the kernel panic/crashdump handler mixed with a fastboot procedure - for cases where kernel detects that it can not mount (such as zfs:vfs_mount failures). Maybe somehow else...

I guess this is a late-notice to make it into the currently expected "stable" release, but for one to be really useable in real-life situations, I think this should be done.

Thanks for listening, this was
//Jim Klimov

Related issues

Has duplicate illumos gate - Bug #1111: OpenIndiana lacks a "failsafe boot" image and optionNew2011-06-14


Also available in: Atom PDF