Feature #829
openSupport installing OpenIndiana into a hierarchy of datasets
0%
Description
Many of SXCE and Solaris 10 systems I maintain were installed using a hierarchy of datasets for the root ZFS pool.
This approach separates the "/" root filesystem dataset from "/opt", "/var" and "/usr" (and optionally "/usrlocal") so that these 3 or 4 datasets can be kept as root's children and compressed on-disk while the "/" filesystem is not-compressed and thus well-supported by GRUB to boot.
Also some of the user/log data from "/var" is shared between boot environments by being dedicated (compressed) datasets. This allows to switch back and forth between BEs while upgrading or testing updates, and keep the same track of system historical data, etc.
NOTE that sharing "/var/tmp" like this proved to be a bad idea for some certain reasons (mountpoint may be written to by i.e. ipfilter service, which forbidds mounting the shared FS later in the boot sequence).
Granted, a separate "/usr" system is sometimes a hassle during single-user mode repairs, but when we're tight with space on smallish boot media, the compressed "/usr" is more of a bonus than a problem - workarounds are documented internally. And with latest releases the automounting worked properly enough...
While not quite officially supported, this approach was discussed on opensolaris forum and even worked transparently with LiveUpgrade - LU properly cloned the root's children when making a new BE, and attached existing shared datasets.
Here's a layout example for my current testbed system, which will hopefully work after the reboot (as older systems did well):
root@openindiana:~# df -k | egrep 'ROOT|SHARED|zfsroot|/a'
rpool/ROOT/oi_148a 14088223 190948 13897275 2% /zfsroot
rpool/ROOT/oi_148a/usr
14006968 109693 13897275 1% /zfsroot/usr
rpool/ROOT/oi_148a/opt
13899058 1783 13897275 1% /zfsroot/opt
rpool/ROOT/oi_148a/var
13942656 45381 13897275 1% /zfsroot/var
rpool/ROOT/oi_148a/usrlocal
13897306 31 13897275 1% /zfsroot/usrlocal
rpool/SHARED/var/adm 512000 66 511934 1% /zfsroot/var/adm
rpool/SHARED/var/log 13897327 52 13897275 1% /zfsroot/var/log
rpool/SHARED/var/mail
13948475 32 13948443 1% /zfsroot/var/mail
rpool/SHARED/var/crash
13897306 31 13897275 1% /zfsroot/var/crash
rpool/SHARED/var/cores
102400 31 102369 1% /zfsroot/var/cores
I propose that OpenIndiana installer and other tools at least expect the possibility of non-monolithic root FS.
Better yet - the interactive and automated installers should offer to create such hierarchy if the user desires (i.e. later releases of Solaris 10 did allow to create a separate /var dataset, although not shared between BEs, which was a step in this direction).
Coupled with my bug 828 (letting OI install into existing rpools), it would be nice if the installer proposed to update such split roots as well.
Thanks,
//Jim Klimov
Updated by Jim Klimov about 12 years ago
Just for info - here's the range of space savings (with lzjb) in split root filesystem datasets:
- df -k (selected datasets)
rpool/ROOT/oi_148a 12941359 499121 12442239 4% /zfsroot
rpool/ROOT/oi_148a/usr 13680979 1238740 12442239 10% /zfsroot/usr
rpool/ROOT/oi_148a/opt 12444022 1783 12442239 1% /zfsroot/opt
rpool/ROOT/oi_148a/var 12487620 45381 12442239 1% /zfsroot/var
rpool/ROOT/oi_148a/usrlocal 12442270 31 12442239 1% /zfsroot/usrlocal
rpool/SHARED/var/adm 512000 66 511934 1% /zfsroot/var/adm
rpool/SHARED/var/log 12442290 52 12442239 1% /zfsroot/var/log
rpool/SHARED/var/mail 12493439 32 12493407 1% /zfsroot/var/mail
rpool/SHARED/var/crash 12442270 31 12442239 1% /zfsroot/var/crash
rpool/SHARED/var/cores 102400 31 102369 1% /zfsroot/var/cores
rpool/ROOT/openindiana 15411543 2969304 12442239 20% /aUSED REFER RATIO COMPRESS NAME
7.74G 48.5K 1.18x off rpool
4.56G 31K 1.24x off rpool/ROOT
1.70G 487M 1.66x off rpool/ROOT/oi_148a
1.74M 1.74M 2.17x on rpool/ROOT/oi_148a/opt
1.18G 1.18G 1.91x on rpool/ROOT/oi_148a/usr
31K 31K 1.00x on rpool/ROOT/oi_148a/usrlocal
44.3M 44.3M 2.11x on rpool/ROOT/oi_148a/var
2.86G 2.83G 1.00x off rpool/ROOT/openindiana
50.2M 31K 2.16x on rpool/SHARED
50.2M 31K 2.29x on rpool/SHARED/var
66K 66K 4.63x on rpool/SHARED/var/adm
31K 31K 1.00x on rpool/SHARED/var/cores
31K 31K 1.00x on rpool/SHARED/var/crash
51.5K 51.5K 1.69x on rpool/SHARED/var/log
32K 32K 1.00x on rpool/SHARED/var/mail
1.50G 1.50G 1.00x off rpool/dump
96.5K 32K 1.00x off rpool/export
64.5K 32K 1.00x off rpool/export/home
32.5K 32.5K 1.00x off rpool/export/home/admin
1.59G 126M 1.00x off rpool/swap
So, over all, the 2.9G default installation compressed into 1.7G - not so bad for users with i.e. USB (or SSD) flash drives, old small boot HDDs, etc.
Updated by Jim Klimov about 12 years ago
Some good news. Apparently, beadm automatically found the new hierarchy - even before I rebooted from LiveUSB system instance.
However it only "sees" the child datasets of the rpool/ROOT/oi_148a (var usr opt), and does not mount the rpool/SHARED/var/* datasets. Even despite the manual changes to etc/vfstab to include these datasets - like I do on other servers ;)
Updated by Jim Klimov about 12 years ago
And bad news. It seems that unlike SXCE and Solaris 10 releases, OpenIndiana requires /usr parts (some files from sbin/, bin/, lib/) to be available for basic bootup, even before it tries to mount filesystems in /etc/vfstab by "mount -a" or whatever, or does "zfs mount -a".
Also it seems there is no "FailSafe" miniroot...
SO FAR I'm abandoning the idea - until dependency on /usr is resolved. It takes up most of the root disk space, so inability to separate and compress it makes much of the other idea useless.
However, mounting the shared var/(log|adm|crash|cores|mail) does work nearly as expected - via vfstab. ZFS automounts kick in too late in the boot sequence now - after some log-writing services try to start :(
//Jim
Updated by Julian Wiesener almost 12 years ago
- Status changed from New to Rejected
- Difficulty set to Medium
- Tags set to needs-triage
Changing the way we handle boot environments are currently out of scope for us, since we would possible need to modify beadm ips and bootadm, while it doesn't give us much advantage. If you want to implement it, please do so. You're very welcome if you provide patches we can integrate into our installer, and ips.
beadm and bootadm changes would need to pass the illumos integration rules. I propose it is best to start with beadm/bootadm in illumos and if that does work right, continue with IPS and our installer.
Updated by Jim Klimov about 11 years ago
- Status changed from Rejected to Feedback
OVERVIEW
After a while I got another machine to test the split-root hierarchy with, now with oi_151a release, and I've played with the situation for a week now. Reopening the case to add my findings, and because it seems quite solvable - beadm support seems adequate to me (although things don't seem to work if I circumvent beadm).
Main problem is actually with BROKEN "/sbin" contents, and singularly with "/sbin/sh" which is now a symlink to "../usr/bin/ksh93". The "/sbin" directory is supposed to be there for "system binaries", the minimum set required for OS maintenance and bootup. For much of Solaris history it contained either statically compiled binaries with self-sufficient executable files, or at most with linker dependencies on "/lib" contents which are also relatively small.
Now in current OpenIndiana "/sbin/sh" which is "/usr/bin/ksh" pulls along lots of libraries from "/usr/lib" making the system unbootable if "/usr" is a filesystem separate from "/" root, and separating the needed pieces (to put them into root dataset) is not very trivial. In this case the kernel just loops with "INIT unable to spawn shell to run stuff" kind of message.
However, I copied over an "/sbin/sh" from an older SXCE release (where it is a compact old Bourne shell), and the system booted almost OK - some SMF methods had to be changed to use ksh directly instead of "#!/sbin/sh" (syntax differences failed these scripts under Bourne shell "sh"). For me these were "/lib/svc/method/logadm-upgrade" and "/lib/svc/method/net-physical". This fix seems in line with the finding that most other standard manifests use "/bin/ksh" or "/bin/ksh -p" or "/usr/bin/ksh93" already.
I think a proper fix would be to build and package a static ksh93 and store that as "/sbin/sh", the rest of system already seems ready for the split-root hierarchy.
Also I found that root's default user "~/.profile" files were changed to use BASH syntax (like "export VAR=VAL" instead of "VAR=VAL; export VAR") and while I like bash, this choice of incompatible syntax is rather a regression. Absence of bash in my minimal root FS and following lack of proper PATH (due to "export PATH=..." being invalid for old "sh") was an inconvenience during single-user mode boots which had not mounted "/usr" for one reason or another. Again, in golden old Solaris days, root shell was "/sbin/sh" (not relying on other FSes); now it could be a static build of "/sbib/bash", I suppose. Not "/usr/bin/bash" as of now.
DETAILS
Here is the filesystem layout my test box has currently booted with:
root@openindiana:~# df -k Filesystem 1K-blocks Used Available Use% Mounted on rpool/ROOT/openindiana3 51660103 684780 48925050 2% / swap 5179448 1104 5178344 1% /etc/svc/volatile rpool/ROOT/openindiana3/usr 49698442 773392 48925050 2% /usr swap 5178348 4 5178344 1% /tmp rpool/ROOT/openindiana3/usrlocal 48925081 31 48925050 1% /usrlocal rpool/ROOT/openindiana3/var 48953565 28515 48925050 1% /var swap 5178388 44 5178344 1% /var/run rpool/export 48925083 33 48925050 1% /export rpool/export/ftp 48925082 32 48925050 1% /export/ftp rpool/export/ftp/distribs 49737913 812863 48925050 2% /export/ftp/distribs rpool/export/home 48925083 33 48925050 1% /export/home rpool/export/home/jim 48925086 36 48925050 1% /export/home/jim rpool 48925100 50 48925050 1% /rpool rpool/SHARED/var/adm 48925339 289 48925050 1% /var/adm rpool/SHARED/var/cores 5242862 31 5242831 1% /var/cores rpool/SHARED/var/crash 5242862 31 5242831 1% /var/crash rpool/SHARED/var/log 48925117 67 48925050 1% /var/log rpool/SHARED/var/mail 48925082 32 48925050 1% /var/mail rpool/SHARED/var/spool/clientmqueue 2097134 33 2097102 1% /var/spool/clientmqueue rpool/SHARED/var/spool/mqueue 2097134 31 2097103 1% /var/spool/mqueue
The dataset hierarchy, including other test roots:
root@openindiana:~# zfs list -o canmount,mountpoint,referenced,compression,compressratio,name CANMOUNT MOUNTPOINT REFER COMPRESS RATIO NAME on /rpool 50K off 1.14x rpool off legacy 31K off 1.21x rpool/ROOT noauto / 420M off 1.00x rpool/ROOT/oi_151a noauto /opt 1.41M off 1.00x rpool/ROOT/oi_151a/opt noauto /usr 755M gzip-9 1.00x rpool/ROOT/oi_151a/usr noauto /usrlocal 31K gzip-9 1.00x rpool/ROOT/oi_151a/usrlocal noauto /var 27.8M off 1.00x rpool/ROOT/oi_151a/var noauto / 2.61G off 1.00x rpool/ROOT/oi_151b noauto /opt 1.41M off 1.00x rpool/ROOT/oi_151b/opt noauto /usrlocal 31K off 1.00x rpool/ROOT/oi_151b/usrlocal noauto /var 27.7M off 1.00x rpool/ROOT/oi_151b/var noauto / 2.61G off 1.00x rpool/ROOT/openindiana noauto /opt 1.41M gzip-9 1.00x rpool/ROOT/openindiana/opt noauto /usrlocal 31K off 1.00x rpool/ROOT/openindiana/usrlocal noauto /var 27.7M gzip-9 2.91x rpool/ROOT/openindiana/var noauto / 2.61G off 1.00x rpool/ROOT/openindiana2 noauto /opt 1.41M off 1.00x rpool/ROOT/openindiana2/opt noauto /usrlocal 31K off 1.00x rpool/ROOT/openindiana2/usrlocal noauto /var 27.9M off 1.00x rpool/ROOT/openindiana2/var noauto / 422M off 1.00x rpool/ROOT/openindiana3 noauto legacy 1.41M off 1.00x rpool/ROOT/openindiana3/opt noauto legacy 755M off 1.00x rpool/ROOT/openindiana3/usr noauto legacy 31K off 1.00x rpool/ROOT/openindiana3/usrlocal noauto legacy 27.9M off 1.00x rpool/ROOT/openindiana3/var noauto /a 422M off 1.41x rpool/ROOT/openindiana4 noauto /opt 1.41M off 3.16x rpool/ROOT/openindiana4/opt noauto /usr 755M off 3.02x rpool/ROOT/openindiana4/usr noauto /usrlocal 31K off 1.00x rpool/ROOT/openindiana4/usrlocal noauto /var 27.9M off 3.03x rpool/ROOT/openindiana4/var noauto legacy 31K off 3.82x rpool/SHARED noauto /var 31K gzip-9 3.91x rpool/SHARED/var on /var/adm 289K lzjb 5.85x rpool/SHARED/var/adm on /var/cores 31K gzip-9 1.00x rpool/SHARED/var/cores on /var/crash 31K gzip-9 1.00x rpool/SHARED/var/crash on /var/log 66.5K lzjb 1.70x rpool/SHARED/var/log on /var/mail 32K lzjb 1.00x rpool/SHARED/var/mail off /var/spool 31K gzip-9 1.00x rpool/SHARED/var/spool on /var/spool/clientmqueue 32.5K lzjb 1.00x rpool/SHARED/var/spool/clientmqueue on /var/spool/mqueue 31K lzjb 1.00x rpool/SHARED/var/spool/mqueue - - 2.00G off 1.00x rpool/dump on /export 33K lzjb 1.01x rpool/export on /export/ftp 32K lzjb 1.01x rpool/export/ftp on /export/ftp/distribs 794M gzip-9 1.01x rpool/export/ftp/distribs on /export/home 33K lzjb 1.02x rpool/export/home on /export/home/admin 36K lzjb 1.03x rpool/export/home/admin on /export/home/jim 36K lzjb 1.03x rpool/export/home/jim - - 731M off 1.00x rpool/swap
In particular, as can be seen above, the default installation's "/usr" can be compressed 3x from 2.3Gb to 800Mb with gzip-9 (like in livecd compressed image). This is a substantial difference on many systems I maintain (small local HDDs or Flash devices for roots, or numerous VMs colocated on relatively small storage):
root@openindiana:~# beadm mount openindiana /a Mounted successfully on: '/a' root@openindiana:~# du -ks /a/usr 2332540 /a/usr root@openindiana:~# du -ks /usr 802290 /usr
I did have inexplicable problems when trying to "zfs snapshot", "zfs clone" and "zfs set org.opensolaris.libbe:uuid" my root hierarchies - they happened to be unbootable (freezing after mounting root and detecting hardware). However, "beadm create" of current root's clone and further mutilation of its datasets worked okay. I trussed the "beadm create" calls but failed to note any difference in my procedure (which worked for SXCE) from its inner magic.
Perhaps the UUID has to be saved somewhere, or not be competely random? ;)
It was not at all difficult to seperate "/var" and "/opt" into child datasets of the current root's clone and move (rsync) files into them. While root dataset is still required to be uncompressed (I've tested that explicitly - GRUB fails to mount it even if I get past failsafes implemented by beadm and zfs/zpool to disallow setting compression on bootable datasets or setting bootability on compressed datasets), its children may use "lzjb" or "gzip-9" at least (the ones I tested) since they are mounted by a fully-fledged Solaris libzfs and not the trimmed version in GRUB.
Also I've made a number of shared datasets to serve as subdirectories of "/var" in all BEs as seen above (shared logs, mail spools, etc.) and so I can enforce different compression and quotas on these spacehogs, while use the same log history, etc. between reboots into different BEs.
You can see above the intricate structure under "rpool/SHARED/var/*" which mixes "canmount=off" for some nodes (var, var/spool) and allows automounts for actual directories. They are mounted after the proper root dataset and its children, so no conflict occurs such as attempts to mount "/var/adm" before "/var". Still, I protected all such mountpoints in the parent datasets by immutability as detailed below.
NOTE however, that they are not mounted with "beadm mount" when you configure an alternate BE, just like it does not mount "/export/*" and other datasets into the ABE. If you require them to get automounted in these cases, you might instead use legacy mounts and /etc/vfstab references. Or you could use loop-mounts manually to link the directories (see lofs(7FS) like in delegation of paths to local zones).
Such separated "/var", "/var/*", "/usrlocal" and "/opt" mounts work for both bootups and "beadm mount" as either "canmount=noauto,mountpoint=legacy" datasets with mount definitions from appropriate BE root's "etc/vfstab" files, and as "canmount=noauto,mountpoint=/var" etc. datasets WITHOUT reference in "etc/vfstab". The "/var/*" components are automatically mounted with "zfs mount -a" on this system, and as legacy mountpoints with references from "/etc/vfstab" on some of my other systems.
NOTE that it is obviously wrong to use "canmount=on,mountpoint=/var" type of entries, because different BE's "/var" child datasets would come into conflict. This is also not needed for the BE mount with beadm or boot procedures, and beadm sets "canmount=noauto" to such children.
In particular, I was pleased to see that beadm and bootup procedure both properly mounted non-standard paths like my "/usrlocal" by virtue of this dataset being a "canmount=noauto" child of the current root.
NOTE: There can be a conflict where you have both "canmount=noauto,mountpoint=/var" and a reference in "etc/vfstab" - leading to "svc:/filesystem/minimal" failures ("mount -a" fails because it only supports explicit legacy mountpoints).
NOTE: One issue I had was that I tried to be lazy and "inerit mountpoint". The effective values remained like "/a/var" which precluded further booting and failed into single-user mode, but that was easy to fix with explicit mountpoint definition. It seems that "beadm mount" works safely and does not overwrite the defined mountpoint attributes when it mounts the BE into an alternate root.
NOTE that root's homedir "/root" could not be separated safely in my tests, but one can make "rpool/SHARED/root-data" and mount that under all BE's like "/root/root-data". YMMV.
I protected the mountpoints (such as "/var" in the root dataset) to be immutable directories so that for example "/var/adm" can not be mounted if proper "/var" is not yet mounted. Otherwise ZFS refused to mount "/var" into a contaminated non-empty directory (even after reboots when there is no longer a sub-FS there, only the mountpoint remains).
It was even more important to protect paths like "/var/adm" in the "/var" dataset itself - so that no logs, utmpx or similar files would be created before an actual "/var/adm" dataset is mounted.
Roughly, the routine went like this (NOTE that the default path points to GNU utilities first, and the GNU chmod lacks advanced features needed for ZFS ACLs):
# umount /a/var # /usr/bin/chmod S+ci /a/var # mkdir /a/var/xxx mkdir failed: Not owner # mount /a/var # /usr/bin/chmod S+ci /a/var/adm /a/var/log .....
This particular problem can also be worked around with legacy mounts and the "-O" option for overlay mounts in "/etc/vfstab" entries (see also my RFE #997 to add support for overlay auto-mounting as a zfs dataset attribute).
The main (profanity removed) "issue" I finally had was with separating "/usr" into a separate dataset, compressed or not. The core problem was that "/sbin" no longer contains self-sufficient binaries, and the system bootup routine now depends on resources from "/usr". I detailed this issue in the overview above.
My workaround was to embed a self-sufficient "/sbin/sh" into the root dataset, and fix some SMF service mainfests to reference KSH93 "/usr/bin/ksh93" directly.
Finally, I also tested that this new setup is supportable by "beadm create" - and it cloned my split-root BE just nicely, "beadm mount"ed the clone, activated and booted it.
I've also changed that clone from using legacy mounts and "/etc/vfstab" into automatic mounts of root dataset's children with "canmount=noauto,mountpoint=/var" etc.
RESULT
Overall, as of oi_151a release the system seems almost ready to support complicated hierarchies (in order to provide compression, quotas, different copies settngs and whatever else you can think of), with different working methods of automounting the component directories. Most components can be easily separated manually post-installation (and it is a matter of updating the Caiman wizards to do this during initial installation).
One key to separating the "/usr" partition and saving over a gigabyte of disk in the default installation is simply to once again provide a self-sufficient "/sbin/sh" executable.
Updated by Jim Klimov about 11 years ago
Additional experiment regarding the internal magic of beadm: starting with my split root hierarch described above, I made a recursive zfs snapshot and a set of cloned filesystems as an alternate root with no other changes. This root has booted successfully regardless of whether "org.opensolaris.libbe:uuid" was set at all, or was set to a manually defined value (i.e. another dataset's uuid with some characters zeroed out).
However "zfs create"'ing a hierarchy and rsync'ing files into it has failed the boot test :(
For clarity, I ought to retry with an ACL-aware "sun cpio" or such utility, as this is about the only idea I have left now. Open to suggestions ;)
Updated by Jim Klimov about 11 years ago
Okay, there is no beadm magic. Only "/sbin/sh" remains as the single problem for split-dataset root environments.
DETAILS
A fully-manual method HAS also worked out at last. The culprit was in the loopback-mounted files like this (and damn, I knew and documented this before!):
/usr/lib/libc/libc_hwcap1.so.1 51797824 774447 51023377 2% /lib/libc.so.1
One way or another, a root hierarchy manually created and copied "from scratch" has booted up for me now.
Both with "find ... | cpio" as detailed below, and with "rsync -xavPHK" as I tried several times before.
Following is the routine to recreate in my case, YMMV.
1. Create and mount the BE root dataset hierarchy, including protected (immutable) mountpoints for shared paths I've had made before:
BENAME=oi_151a zfs create -o canmount=noauto -o mountpoint=/a rpool/ROOT/${BENAME} && \ mkdir -p /a && \ zfs mount rpool/ROOT/${BENAME} && \ cd /a && \ for D in var opt usr usrlocal; do mkdir $D && \ /bin/chmod S+ci $D zfs create -o canmount=noauto -o mountpoint=/a/$D -o compression=gzip-9 rpool/ROOT/${BENAME}/$D && \ zfs mount rpool/ROOT/${BENAME}/$D done cd /a/var && \ mkdir -p adm log mail cores crash spool/clientmqueue spool/mqueue && \ /bin/chmod S+ci adm log mail cores crash spool/clientmqueue spool/mqueue
If you did not have a separated "/usr/local", create the symlink:
ln -s ../usrlocal /a/usr/local
The lack of UUID has not shown to be fatal for beadm BE switches to work back and forth, but it might be nice to set a unique ID anyway. Be creative, change some HEX characters in the sample uuid below:
zfs set org.opensolaris.libbe:uuid=c2c6c968-9866-c662-aac1-86c6cc77c2ff rpool/ROOT/${BENAME}
2. Loopback-mount the current root and copy it over:
mount -F lofs -o nosub / /mnt && \ cd /mnt && \ find . -xdev -depth -print | cpio -pPvdm /a/ && \ cd / && \ umount /mnt
Primarily this helps avoid other mounts infringing into the real root dataset and maybe clouding some original data.
Skipping this known step may have been the single critical error I did during my earlier rsync tests (perhaps that just skipped copying of libc.so which is kinda critical).
The same can be done with rsync:
mount -F lofs -o nosub / /mnt && \ rsync -xavPHK /mnt/ /a/ && \ umount /mnt
3. If your original root BE is comprised of more than one dataset (mountpoint) already, copy others over:
cd / find var opt usr usrlocal -xdev -depth -print | cpio -pPvdm /a/
As above, this invocation of find should only locate the contents of mountpoints listed on the command line and not descend into their submounts if any (like /var/run). If this does not work, try old-school ways:
cd / for D in var opt usr usrlocal ; do find $D -xdev -depth -print | cpio -pPvdm /a/$D done
There may be errors when cpio tries to change the immutable mountpoints (update ownership, timestamps, or even worse - try to add files into them), these are okay - that's what were protecting from. After mounting the proper datasets, their root directories (such as "/var/adm/") should be assigned correct POSIX properties.
The same can be done with rsync:
for D in var opt usr usrlocal ; do rsync -xavPHK /$D/ /a/$D/ && \ done
4. Clean up for beadm to work:
for D in var opt usr usrlocal ; do zfs umount /a/$D done zfs umount /a
5. Use beadm to mount the new ABE and perhaps apply its magic (like updating children's mountpoints so that they do automount after reboot, etc.), and ultimately boot into it:
:; beadm mount ${BENAME} /a Mounted successfully on: '/a' :; /bin/df -k | grep " /a" rpool/ROOT/oi_151a 61415424 432365 51023734 1% /a rpool/ROOT/oi_151a/opt 61415424 1441 51023734 1% /a/opt rpool/ROOT/oi_151a/usrlocal 61415424 31 51023734 1% /a/usrlocal rpool/ROOT/oi_151a/var 61415424 28376 51023734 1% /a/var rpool/ROOT/oi_151a/usr 61415424 774455 51023734 2% /a/usr :; bootadm update-archive -R /a (This is likely to return nothing since your current root's boot_archive should be up-to-date) :; beadm umount ${BENAME} :; beadm activate ${BENAME} :; init 6
NOTE that "beadm mount; beadm umount" routine is fairly important to finalize things. Alternatively, you can manually set the mountpoint properties of datasets after you have unmounted them - this also works, but with more hassle.
HTH,
//Jim Klimov
Updated by Jim Klimov about 11 years ago
I've updated the system to oi_151a2 with "pkg update" and it worked along nicely with the hierarchical root, including the separated 2x-compressed "/usr", structured as described above and illustrated below.
root@openindiana:~# uname -a SunOS openindiana 5.11 oi_151a2 i86pc i386 i86pc root@openindiana:~# pkg update No updates available for this image. root@openindiana:~# zfs list -o canmount,mountpoint,referenced,compression,compressratio,name | grep ROOT; zfs list off legacy 31K off 1.60x rpool/ROOT noauto / 422M off 1.05x rpool/ROOT/oi_151a noauto / 428M off 1.00x rpool/ROOT/oi_151a-1 noauto /opt 3.83M off 1.00x rpool/ROOT/oi_151a-1/opt noauto /usr 1.09G off 1.00x rpool/ROOT/oi_151a-1/usr noauto /usrlocal 31K off 1.00x rpool/ROOT/oi_151a-1/usrlocal noauto /var 91.4M off 1.00x rpool/ROOT/oi_151a-1/var noauto / 429M off 1.60x rpool/ROOT/oi_151a-2 noauto /opt 3.83M off 1.74x rpool/ROOT/oi_151a-2/opt noauto /usr 1.25G off 2.08x rpool/ROOT/oi_151a-2/usr noauto /usrlocal 31K off 1.00x rpool/ROOT/oi_151a-2/usrlocal noauto /var 199M off 1.26x rpool/ROOT/oi_151a-2/var noauto /opt 1.38M gzip-9 1.00x rpool/ROOT/oi_151a/opt noauto /usr 755M gzip-9 1.00x rpool/ROOT/oi_151a/usr noauto /usrlocal 31K gzip-9 1.00x rpool/ROOT/oi_151a/usrlocal noauto /var 33.9M gzip-9 1.16x rpool/ROOT/oi_151a/var noauto / 422M off 1.00x rpool/ROOT/oi_151a_release noauto /opt 1.38M gzip-9 1.00x rpool/ROOT/oi_151a_release/opt noauto /usr 755M gzip-9 1.00x rpool/ROOT/oi_151a_release/usr noauto /usrlocal 31K gzip-9 1.00x rpool/ROOT/oi_151a_release/usrlocal noauto /var 27.6M gzip-9 1.00x rpool/ROOT/oi_151a_release/var NAME USED AVAIL REFER MOUNTPOINT rpool 7.53G 51.0G 48.5K /rpool rpool/ROOT 2.63G 51.0G 31K legacy rpool/ROOT/oi_151a-1 7.91M 51.0G 428M / rpool/ROOT/oi_151a-1/opt 0 51.0G 3.83M /opt rpool/ROOT/oi_151a-1/usr 2.13M 51.0G 1.09G /usr rpool/ROOT/oi_151a-1/usrlocal 0 51.0G 31K /usrlocal rpool/ROOT/oi_151a-1/var 4.98M 51.0G 91.4M /var rpool/ROOT/oi_151a-2 2.62G 51.0G 429M / rpool/ROOT/oi_151a-2/opt 3.86M 51.0G 3.83M /opt rpool/ROOT/oi_151a-2/usr 1.37G 51.0G 1.25G /usr rpool/ROOT/oi_151a-2/usrlocal 33K 51.0G 31K /usrlocal rpool/ROOT/oi_151a-2/var 418M 51.0G 199M /var rpool/ROOT/oi_151a 7.40M 51.0G 422M / rpool/ROOT/oi_151a/opt 0 51.0G 1.38M /opt rpool/ROOT/oi_151a/usr 2.18M 51.0G 755M /usr rpool/ROOT/oi_151a/usrlocal 0 51.0G 31K /usrlocal rpool/ROOT/oi_151a/var 3.10M 51.0G 33.9M /var rpool/ROOT/oi_151a_release 54K 51.0G 422M / rpool/ROOT/oi_151a_release/opt 1K 51.0G 1.38M /opt rpool/ROOT/oi_151a_release/usr 1K 51.0G 755M /usr rpool/ROOT/oi_151a_release/usrlocal 1K 51.0G 31K /usrlocal rpool/ROOT/oi_151a_release/var 1K 51.0G 27.6M /var rpool/SHARED 1.01M 51.0G 31K legacy rpool/SHARED/var 1006K 51.0G 31K /var rpool/SHARED/var/adm 533K 51.0G 410K /var/adm rpool/SHARED/var/cores 67K 5.00G 31K /var/cores rpool/SHARED/var/crash 49K 5.00G 31K /var/crash rpool/SHARED/var/log 110K 51.0G 66.5K /var/log rpool/SHARED/var/mail 50K 51.0G 32K /var/mail rpool/SHARED/var/spool 166K 51.0G 31K /var/spool rpool/SHARED/var/spool/clientmqueue 68.5K 2.00G 32.5K /var/spool/clientmqueue rpool/SHARED/var/spool/mqueue 67K 2.00G 31K /var/spool/mqueue rpool/dump 2.00G 51.0G 2.00G - rpool/export 794M 51.0G 33K /export rpool/export/ftp 794M 51.0G 32K /export/ftp rpool/export/ftp/distribs 794M 51.0G 794M /export/ftp/distribs rpool/export/home 277K 51.0G 35K /export/home rpool/export/home/jim 55K 51.0G 36K /export/home/jim rpool/swap 2.12G 52.4G 731M - root@openindiana:~# df -k Filesystem kbytes used avail capacity Mounted on rpool/ROOT/oi_151a-2 61415424 439462 53514567 1% / /devices 0 0 0 0% /devices /dev 0 0 0 0% /dev ctfs 0 0 0 0% /system/contract proc 0 0 0 0% /proc mnttab 0 0 0 0% /etc/mnttab swap 5214072 1104 5212968 1% /etc/svc/volatile objfs 0 0 0 0% /system/object sharefs 0 0 0 0% /etc/dfs/sharetab rpool/ROOT/oi_151a-2/usr 61415424 1306014 53514567 3% /usr /usr/lib/libc/libc_hwcap1.so.1 54820581 1306014 53514567 3% /lib/libc.so.1 fd 0 0 0 0% /dev/fd rpool/ROOT/oi_151a-2/var 61415424 203270 53514567 1% /var swap 5212972 4 5212968 1% /tmp swap 5213016 48 5212968 1% /var/run rpool/ROOT/oi_151a-2/opt 61415424 3921 53514567 1% /opt rpool/ROOT/oi_151a-2/usrlocal 61415424 31 53514567 1% /usrlocal rpool/export 61415424 33 53514567 1% /export rpool/export/ftp 61415424 32 53514567 1% /export/ftp rpool/export/ftp/distribs 61415424 812862 53514567 2% /export/ftp/distribs rpool/export/home 61415424 35 53514567 1% /export/home rpool/export/home/jim 61415424 36 53514567 1% /export/home/jim rpool 61415424 48 53514567 1% /rpool rpool/SHARED/var/adm 61415424 409 53514567 1% /var/adm rpool/SHARED/var/cores 5242880 31 5242813 1% /var/cores rpool/SHARED/var/crash 5242880 31 5242831 1% /var/crash rpool/SHARED/var/log 61415424 66 53514567 1% /var/log rpool/SHARED/var/mail 61415424 32 53514567 1% /var/mail rpool/SHARED/var/spool/clientmqueue 2097152 32 2097083 1% /var/spool/clientmqueue rpool/SHARED/var/spool/mqueue 2097152 31 2097085 1% /var/spool/mqueue
This is another 2 cents in that the core OS and BE support is already in place. The development might focus on 'wizardization' of this setup (Caiman?), and on providing a static "/sbin/sh".
Updated by Jim Klimov about 11 years ago
Well, there is one more quirk that "beadm" could improve on in regard to such split-dataset root support: when cloning the BE, carry over some ZFS attributes such as the compression, so that the benefits of gzip-9 for /usr or lzjb for /var remain in place on ABEs as well.
Possibly other attributes should be copied from sources onto clones as well - perhaps any ZFS attrs that were explicitly defined in the source DSes (like optional dedup, copies, atime, zfs-auto-snap settings and so on).
Updated 2013-02-16: I posted an RFE #3569 to track this particular suggestion.
Updated by Jim Klimov about 11 years ago
As I've recently discovered, the "pkgadd" command (for SVR4 support) now somehow relies on "/sbin/sh" being KSH. The relevant error (when it is a Bourne shell semi-static binary carried over from OpenSolaris SXCE) seems like this:
# gzcat MySVR4.pkg.gz > /tmp/x # pkgadd -d /tmp/x /sbin/sh: -1: bad number pkgadd: ERROR: attempt to process datastream failed - process </usr/bin/cpio -icdumD -C 512> failed, exit code 1 pkgadd: ERROR: could not process datastream from </tmp/x>
The fix is to make the "/sbin/sh" a KSH again. Reinstating the symlink would break bootability with separated /usr. Until I have compiled a static binary, a lofs-overlay can be used like this:
# mount -F lofs /usr/bin/`isainfo -n`/ksh93 /sbin/sh # pkgadd -d /tmp/x The following packages are available: 1 MySVR4 My test SVR4 package (none) 1.9.24 Select package(s) you wish to process (or 'all' to process all packages). (default: all) [?,??,q]: Processing package instance <MySVR4> from </tmp/x> ... [ verifying class <none> ] ## Executing postinstall script. Installation of <MySVR4> was successful.
DISCLAIMER: I did not test if this is safe to use in i.e. /etc/vfstab (likely with a static name for 'amd64' or 'i386' ISA dirname) so that the original static Bourne shell is only used between boot and mounting of /usr, and I did not test in which order these mounts do happen either.
Updated by Igor Pashev about 11 years ago
I think, there is not need for separated /usr.
/usr/local, /var/tmp, /var/log etc. are ok.
The idea is: all directories which are under control of packaging tools should be on a single filesystem,
so you can do consistent snapshots. Other directories could be on different FS.
Updated by Jim Klimov almost 11 years ago
I think there is a need (moderate, since the systems are known to work without such separation), but nonetheless.
For one, compression of the installed base on the order of 2x-3x translates to savings - on disk, faster reads from disk, perhaps in RAM (when/if ARC caches compressed blocks if they are stored so on-disk). This may be especially important not for workstations with huge rpools (though even there I tend to use as small an rpool as reasonable, and have the data pool separately), but in VMs with limited virtual disk sizes, or in embedded environments i.e. on size-limited Flash storage (USB, CF, SSD).
On the other hand, many admins like myself are accustomed to such separation - though many old customs are destroyed by onslaught of IPS. A secondary argument in this position might be that the SMF services still define separately "svc:/system/filesystem/root:default" and "svc:/system/filesystem/usr:default", and on a "distro" like LiveCD there are different instances, i.e. "svc:/system/filesystem/root:net", "svc:/system/filesystem/root:media" and "svc:/system/filesystem/usr:live-media". So, outside of IPS control such separation is useful and used.
On the third, this just plain works well (with two quirks above - the non-static /sbin/sh and with beadm clones forgetting zfs attributes of their origin - like that they are compressed). This works with beadm snapshots too, so in terms of managing BEs with consistent snapshotting/cloning - this does work already.
Your argument does pose a slightly different point - about remaining under control of packaging tools - but not all illumos distros even use those, and I think the aim is consistency of boot environments - which we do have with separation of /usr as well.
You've also mentioned separation of /var/tmp. That is actually an interesting point - it worked on some of my older servers (i.e. SXCE, Sol10), but tends to work poorly in newer OpenSolaris descendants. The problem is that some SMF services start before mounting of secondary FSes like /var/tmp and already expect that directory to be writeable. We can later overlay-mount a "real" separate /var/tmp dataset with /etc/vfstab (-O option) and override a non-empty mountpoint, but we can not do so with ZFS automounts (yet, see bug #997). Also it is questionable whether some such early-starting services would need to access their (now hidden) temporary files.
So, while separation of /var/tmp would indeed make sense, so far I decided not to touch it. A possible solution might be to expand the list of primary FSes like in SMF service filesystem/minimal, filesystem/var or some such, which is a dependency for most of the other services, to enforce mounting of the separate /var/tmp if that is defined on the filesystem.
PS: Perhaps this task should be relocated to illumos-gate - I could not tell the difference well between two projects and their subprojects a year ago, and OpenIndiana was the only illumos-based distro I tried as a likely generic-distro future for my systems.
Apparently, if now I talk about the quest's usefullness outside of IPS and OpenIndiana, and since such low-level things could be used (or chosen to be not used) by all distros alike, this is a quest for upstream illumos :)
Updated by Jim Klimov almost 11 years ago
- new syntax used in current ksh-oriented SMF method scripts (For me these were "/lib/svc/method/logadm-upgrade" and "/lib/svc/method/net-physical" and "/lib/svc/method/ipfilter"),
- and "pkgadd" breaking with "Bad number: -1" (something to do with UID/GID setups, and maybe the RBAC stuff - /sbin/sh does not want to start interactively for root either, but does so for non-root users)
as I complained above.
It does also work for this RFE's purpose of separating the /usr filesystem.
So, using a copy of the binary provided by OpenIndiana already would seemingly solve this half of the RFE:
# mv /sbin/sh /sbin/sh.old && cp -pf /usr/bin/bash /sbin/sh # ldd /sbin/sh libsocket.so.1 => /lib/libsocket.so.1 libresolv.so.2 => /lib/libresolv.so.2 libnsl.so.1 => /lib/libnsl.so.1 libgen.so.1 => /lib/libgen.so.1 libc.so.1 => /lib/libc.so.1 libcurses.so.1 => /lib/libcurses.so.1 libmd.so.1 => /lib/libmd.so.1 libmp.so.2 => /lib/libmp.so.2 libm.so.2 => /lib/libm.so.2 # /sbin/sh --version GNU bash, version 4.0.28(1)-release (i386-pc-solaris2.11) Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software; you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
Updated by Jim Klimov almost 11 years ago
PS: bash is a bit more compact than ksh in RAM as well ;)
# ps -o rss,vsz,comm | grep sbin/sh 2476 3872 /sbin/sh.ksh 2552 3680 /sbin/sh.bash 2392 3680 /sbin/sh
Strangely, even though /sbin/sh.bash and /sbin/sh are copies of the same file, they consume a bit different RSS size... ;)
Updated by Jim Klimov almost 11 years ago
Another issue was located and fixed with hierarchical datasets: "rpool/SHARED/var/adm" was not automatically mounted as part of "filesystem/minimal", which led to misuse of /var/adm/utmpx and such files. This has little noticeable effect except that "uptime", "w" and similar commands report wrong uptimes (starting at approximately the installation time and the last recorded reboot), and the "last" command does not include system shutdown and boot times. This is annoying, but not fatal.
This patch adds support for such shared system datasets to the SMF method (although the list remains as it was, and other datasets get mounted with "zfs mount -a" sometime later):
# gdiff -Naur /lib/svc/method/fs-minimal.orig /lib/svc/method/fs-minimal --- /lib/svc/method/fs-minimal.orig 2011-09-12 15:04:00.000000000 +0400 +++ /lib/svc/method/fs-minimal 2012-06-13 00:28:44.146412124 +0400 @@ -31,6 +31,10 @@ . /lib/svc/share/smf_include.sh . /lib/svc/share/fs_include.sh +# Report selected mounts to /dev/console +debug_mnt=0 +[ -f /.debug_mnt ] && debug_mnt=1 + # Mount other file systems to be available in single user mode. # Currently, these are /var, /var/adm, /var/run and /tmp. A change # here will require a modification to the following programs (and @@ -49,6 +53,7 @@ if [ -n "$mountp" ]; then mounted $mountp $mntopts $fstype < /etc/mnttab && continue checkfs $fsckdev $fstype $mountp || exit $SMF_EXIT_ERR_FATAL + [ x"$debug_mnt" = x1 ] && echo "Mounting '$fs': use '$mountp' of type '$fstype' with opts '$mntopts' from vfstab" > /dev/console mountfs -O $mountp $fstype $mntopts - || exit $SMF_EXIT_ERR_FATAL continue @@ -57,10 +62,50 @@ mountpt=`zfs get -H -o value mountpoint $be$fs 2>/dev/null` if [ $? = 0 ] ; then if [ "x$mountpt" = "x$fs" ] ; then + [ x"$debug_mnt" = x1 ] && echo "Mounting '$fs': use '$be$fs': in same root hierarchy" > /dev/console /sbin/zfs mount -O $be$fs + continue fi fi - fi + # These mountpoints can be shared among BEs in a separate tree. + # Find and mount matching automountable datasets; if there is + # choice - prefer the (first found?) one in the current rpool. + mountdslist="`zfs list -H -o canmount,mountpoint,name | awk '( $1 == "on" && $2 == "'"$fs"'" ) {print $3}' 2>/dev/null`" + if [ $? = 0 -a "x$mountdslist" != x ] ; then + if [ "x`echo "$mountdslist"|wc -l|sed 's/ //g'`" = x1 ]; then + # We only had one hit + [ x"$debug_mnt" = x1 ] && echo "Mounting '$fs': use '$mountdslist': the only option" > /dev/console + /sbin/zfs mount -O "$mountdslist" + continue + else + rpoolname="`echo "$be" | awk -F/ '{print $1}'`" + mountdspref="`echo "$mountdslist" | egrep '^'"$rpoolname/" | head -1`" + if [ $? = 0 -a "x$mountdspref" != x ] ; then + [ x"$debug_mnt" = x1 ] && echo "Mounting '$fs': use '$mountdspref': same rpool" > /dev/console + /sbin/zfs mount -O "$mountdspref" + continue + fi + # This is the least-definite situation: several + # matching datasets exist, and none on the current + # rpool. See if any pools can be ruled out due to + # bad (non-default) altroots. + for mountds in $mountdslist; do + dspool="`echo "$mountds" | awk -F/ '{print $1}'`" + dspool_altroot="`zpool list -H -o altroot "$dspool"`" + if [ $? = 0 -a \ + x"$dspool_altroot" = "x-" -o \ + x"$dspool_altroot" = "x/" ]; then + [ x"$debug_mnt" = x1 ] && echo "Mounting '$fs': use '$mountds': good altroot" > /dev/console + /sbin/zfs mount -O "$mountds" + continue + fi + done + fi + fi + # Technically, it is possible to have a pool named var with + # the default altroot and a dataset "var/adm" with an inherited + # mountpoint, which should automount into "/var/adm". TBD... + fi ### if root is ZFS done mounted /var/run - tmpfs < /etc/mnttab @@ -73,6 +118,7 @@ fi if [ "$rootiszfs" = 1 ] ; then + # Mount (other) possible children of current rootfs dataset /sbin/zfs list -rH -o mountpoint -s mountpoint -t filesystem $be | \ while read mountp ; do if [ "x$mountp" != "x" -a "$mountp" != "legacy" ] ; then
Updated by Jim Klimov over 10 years ago
I am not sure as to when support of compressed root fs was added to OI/GRUB, but as of oi_151a5 I did successfully compress (gzip-9) and boot a BE (copy of stock openindiana installation). This basically eliminates the need for split-off usr and opt child datasets as described above, as long as the rationale was to get them compressed. Some deployments might want to still do that, but I (author and proponent of the technique above) can't conjure up a reason at the moment ;)
Still, things like different compression/quota/copies settings or desire to share stuff between BEs instead of wasting space on storing volatile data in clones of BEs occupying a full single FS, do still warrant a separate hierarchy of some things under /var, as described above...
Curiously, while the compression of root dataset now works, the "zfs set" and "zpool set" do still protect the user/admin from setting bootability on a compressed dataset (zpool set bootfs=...) or enabling compression on current bootable dataset.
This can be easily worked around if desired, by using inherited compression setting for rootfs'es, with the base value being defined in rpool/SHARED. I hope this would also allow to retain the benefits of compression when cloning BEs during installations and updates, so the protective checks above might become new trouble-makers now if bootfs compression is enabled...
Updated by Jim Klimov over 9 years ago
- #4352 - the relevant patches for fs-* SMF methods to mount such roots more reliably than currently existing ones, including a limited solution to #997 to enforce overlay-mounting for components of the root filesystem hierarchy,
- #4351 - the procedure to bring ksh93 and its dependency libraries into the root filesystem as /sbin/sh (so that /usr can be safely split off),
- #4360 - on a related note, update SMF methods and other scripts that should be executed by (/usr)/bin/sh to use /sbin/sh instead,
- #4361 - fix the networking/physical SMF method scripts to not require programs from /usr,
- #3569 - the procedure to replicate custom ZFS attributes after a "beadm clone" (such as compression for child datasets of the rootfs), including #4355 - to do this in particular when "pkg" creates the clones.
Issues #4353 and #4354 track proposed improvements to the Caiman installer with steps needed to support such installations as a hands-off automated procedure, and are not part of that article.
Updated by Jim Klimov almost 8 years ago
FYI: The scripts and patches relevant to this issue and ones related to it have initially been placed as attachments in the Wiki article, but were relocated to github this year: https://github.com/jimklimov/illumos-splitroot-scripts (licensed as CDDL, so hopefully the logic and/or implementation can get into upstream OI or illumos).