Failsafe boot option
OpenSolaris SXCE releases, as well as Solaris 10, included a "Failsafe boot" environment.
It was basically a RAMDisk image stored on the same boot disk, containing all relevant drivers for your hardware and the current kernel along with some userland software. Basically, it was an expanded and somewhat interactive variant of the "boot archive" we still have today, but less frequently updated (and potentially broken in the way).
This environment was often crucial for repairing a broken installation, deaing with stuff such as renamed drives (or changed device paths like IDE/SATA conversion in BIOS, or movement to another server after the original one got fried but the disks survived), or refusal to boot with broken mirrors without quorum, etc.
Nowadays a (customized) LiveCD/LiveUSB has saved me for uncountable number of times, when my experiments or hardware failures at my home test system have left its HDD-based BEs unuseable.
However, booting from such alternate media is slow and often inconvenient (it needs to get plugged in and/or prioritized in BIOS), especially for remote maintenance of unmanned locations (especially on hardware lacking proper networked IPMI/ILOM consoles).
I propose that some sort of a failsafe or live environment with minimal HW/ZFS requirements (bootable anywhere the backing image file can be read) should again be provided as part of the installed software. Ideally it should have virtual-consoles built in and expect connections (and dump kernel logs) on both the text console and the serial console (LiveCD should really do that as well).
There should be a means to provide customizations into the failsafe image, such as required drivers (storage, net), networking setup, ssh, or perhaps adding some lower-level testing software or scripts that the user likes. The image should then be built (iso, gzipped ufs, whatever) and used as a single unmodifiable unit (unlike a ZFS alternate BE which can get unmountable due to a number of reasons). Perhaps the current scripts for boot_archive maintenance can be reused for this quest.
Perhaps manual or automated updates to the image (replacing the image file or adding another one "near" it) should be enabled for kernel upgrades, especially if storage is involved (updated ZFS code/versions/features), and for net/storage driver installations and upgrades, and driver.conf settings change.
NOTE that possible/probable failure of such upgrades is one of the reasons we want the failsafe environment in the first place, so a fully-automated replacement without user's consent is something we likely don't want to do (perhaps no sooner than the first successful boot and proper shutdown after that?) But it is also useless to have an obsolete image which can't access the current ZFS pool or changed hardware... I think the rebuild procedure could be automated, but called upon by the user manually, like "bootadm update-archive -R /a" can be used today.
- current active root BE on rpool
- a dedicated filesystem on the rpool (i.e. user would like to set copies=3 for this file only, and keep it separate from his other BE tests, rolling snapshots and other works)
- a dedicated partition/slice with whatever filesystem GRUB and Solaris support (perhaps not containing a zfs pool but a "recovery" FAT or UFS partition with this one file or so)
- a separate storage device (i.e. a Failsafe USB Flash which is not unlike a LiveUSB, but customized for this particular box, or a dataset on a different ZFS pool in the same box)
Also, if GRUB or kernel fails to load a "normal" BE, there should be a way to (optionally?) automatically reboot into the failsafe image without user interaction (i.e. to conveniently recover a failed remote/headless box).
In kernel side this could be done perhaps with a specialized procedure in the kernel panic/crashdump handler mixed with a fastboot procedure - for cases where kernel detects that it can not mount (such as zfs:vfs_mount failures). Maybe somehow else...
I guess this is a late-notice to make it into the currently expected "stable" release, but for one to be really useable in real-life situations, I think this should be done.
Thanks for listening, this was
Updated by Jim Klimov about 7 years ago
A "Firefly" minimal installation image was created by Alex Eremin (alhazred) which should essentially allow the recovery tasks:
Now, the illumos/OI quest is to automate the building of such images based on the current OS, as old FailSafe Boots did :)
Updated by Jim Klimov almost 6 years ago
I've been integrating the Firefly failsafe images (pre-built and occasionally distributed on Alex Eremin's page and sourceforge repo) with my system-management scripts.
As a result, there is a script with which one can now deploy contents of a Firefly ISO into a dedicated dataset on an rpool, and re-build the Firefly image with kernel and driver bits from a current BE so that it matches the ZFS capabilities, etc. The resulting failsafe boot archive can be stored in the same BE as the OS installation or in a dedicated one.
Also this script can now be called from my split-root upgrade scripts to maintain a firefly failsafe up-to-date.
See here: https://github.com/jimklimov/illumos-splitroot-scripts/tree/master/bin (licensed as CDDL, feel free to use and/or port the logic into proper OI/illumos).