Feature #829


Support installing OpenIndiana into a hierarchy of datasets

Added by Jim Klimov over 13 years ago. Updated 5 months ago.

In Progress
Caiman (Installer)
Target version:
Start date:
Due date:
% Done:


Estimated time:


Many of SXCE and Solaris 10 systems I maintain were installed using a hierarchy of datasets for the root ZFS pool.

This approach separates the "/" root filesystem dataset from "/opt", "/var" and "/usr" (and optionally "/usrlocal") so that these 3 or 4 datasets can be kept as root's children and compressed on-disk while the "/" filesystem is not-compressed and thus well-supported by GRUB to boot.

Also some of the user/log data from "/var" is shared between boot environments by being dedicated (compressed) datasets. This allows to switch back and forth between BEs while upgrading or testing updates, and keep the same track of system historical data, etc.

NOTE that sharing "/var/tmp" like this proved to be a bad idea for some certain reasons (mountpoint may be written to by i.e. ipfilter service, which forbidds mounting the shared FS later in the boot sequence).

Granted, a separate "/usr" system is sometimes a hassle during single-user mode repairs, but when we're tight with space on smallish boot media, the compressed "/usr" is more of a bonus than a problem - workarounds are documented internally. And with latest releases the automounting worked properly enough...

While not quite officially supported, this approach was discussed on opensolaris forum and even worked transparently with LiveUpgrade - LU properly cloned the root's children when making a new BE, and attached existing shared datasets.

Here's a layout example for my current testbed system, which will hopefully work after the reboot (as older systems did well):

root@openindiana:~# df -k | egrep 'ROOT|SHARED|zfsroot|/a'
rpool/ROOT/oi_148a 14088223 190948 13897275 2% /zfsroot
14006968 109693 13897275 1% /zfsroot/usr
13899058 1783 13897275 1% /zfsroot/opt
13942656 45381 13897275 1% /zfsroot/var
13897306 31 13897275 1% /zfsroot/usrlocal
rpool/SHARED/var/adm 512000 66 511934 1% /zfsroot/var/adm
rpool/SHARED/var/log 13897327 52 13897275 1% /zfsroot/var/log
13948475 32 13948443 1% /zfsroot/var/mail
13897306 31 13897275 1% /zfsroot/var/crash
102400 31 102369 1% /zfsroot/var/cores

I propose that OpenIndiana installer and other tools at least expect the possibility of non-monolithic root FS.

Better yet - the interactive and automated installers should offer to create such hierarchy if the user desires (i.e. later releases of Solaris 10 did allow to create a separate /var dataset, although not shared between BEs, which was a step in this direction).

Coupled with my bug 828 (letting OI install into existing rpools), it would be nice if the installer proposed to update such split roots as well.

//Jim Klimov

Related issues

Related to OpenIndiana Distribution - Feature #4353: Update Caiman installer to allow installation into an existing root pool and dataset hierarchyNew2013-11-24

Related to OpenIndiana Distribution - Feature #4354: Update Caiman to offer split-root installations as an optionNew2013-11-24

Actions #1

Updated by Jim Klimov over 13 years ago

Just for info - here's the range of space savings (with lzjb) in split root filesystem datasets:

  1. df -k (selected datasets)
    rpool/ROOT/oi_148a 12941359 499121 12442239 4% /zfsroot
    rpool/ROOT/oi_148a/usr 13680979 1238740 12442239 10% /zfsroot/usr
    rpool/ROOT/oi_148a/opt 12444022 1783 12442239 1% /zfsroot/opt
    rpool/ROOT/oi_148a/var 12487620 45381 12442239 1% /zfsroot/var
    rpool/ROOT/oi_148a/usrlocal 12442270 31 12442239 1% /zfsroot/usrlocal
    rpool/SHARED/var/adm 512000 66 511934 1% /zfsroot/var/adm
    rpool/SHARED/var/log 12442290 52 12442239 1% /zfsroot/var/log
    rpool/SHARED/var/mail 12493439 32 12493407 1% /zfsroot/var/mail
    rpool/SHARED/var/crash 12442270 31 12442239 1% /zfsroot/var/crash
    rpool/SHARED/var/cores 102400 31 102369 1% /zfsroot/var/cores
    rpool/ROOT/openindiana 15411543 2969304 12442239 20% /a

    7.74G 48.5K 1.18x off rpool
    4.56G 31K 1.24x off rpool/ROOT
    1.70G 487M 1.66x off rpool/ROOT/oi_148a
    1.74M 1.74M 2.17x on rpool/ROOT/oi_148a/opt
    1.18G 1.18G 1.91x on rpool/ROOT/oi_148a/usr
    31K 31K 1.00x on rpool/ROOT/oi_148a/usrlocal
    44.3M 44.3M 2.11x on rpool/ROOT/oi_148a/var
    2.86G 2.83G 1.00x off rpool/ROOT/openindiana
    50.2M 31K 2.16x on rpool/SHARED
    50.2M 31K 2.29x on rpool/SHARED/var
    66K 66K 4.63x on rpool/SHARED/var/adm
    31K 31K 1.00x on rpool/SHARED/var/cores
    31K 31K 1.00x on rpool/SHARED/var/crash
    51.5K 51.5K 1.69x on rpool/SHARED/var/log
    32K 32K 1.00x on rpool/SHARED/var/mail
    1.50G 1.50G 1.00x off rpool/dump
    96.5K 32K 1.00x off rpool/export
    64.5K 32K 1.00x off rpool/export/home
    32.5K 32.5K 1.00x off rpool/export/home/admin
    1.59G 126M 1.00x off rpool/swap

So, over all, the 2.9G default installation compressed into 1.7G - not so bad for users with i.e. USB (or SSD) flash drives, old small boot HDDs, etc.

Actions #2

Updated by Jim Klimov over 13 years ago

Some good news. Apparently, beadm automatically found the new hierarchy - even before I rebooted from LiveUSB system instance.

However it only "sees" the child datasets of the rpool/ROOT/oi_148a (var usr opt), and does not mount the rpool/SHARED/var/* datasets. Even despite the manual changes to etc/vfstab to include these datasets - like I do on other servers ;)

Actions #3

Updated by Jim Klimov over 13 years ago

And bad news. It seems that unlike SXCE and Solaris 10 releases, OpenIndiana requires /usr parts (some files from sbin/, bin/, lib/) to be available for basic bootup, even before it tries to mount filesystems in /etc/vfstab by "mount -a" or whatever, or does "zfs mount -a".

Also it seems there is no "FailSafe" miniroot...

SO FAR I'm abandoning the idea - until dependency on /usr is resolved. It takes up most of the root disk space, so inability to separate and compress it makes much of the other idea useless.

However, mounting the shared var/(log|adm|crash|cores|mail) does work nearly as expected - via vfstab. ZFS automounts kick in too late in the boot sequence now - after some log-writing services try to start :(


Actions #4

Updated by Julian Wiesener almost 13 years ago

  • Status changed from New to Rejected
  • Difficulty set to Medium
  • Tags set to needs-triage

Changing the way we handle boot environments are currently out of scope for us, since we would possible need to modify beadm ips and bootadm, while it doesn't give us much advantage. If you want to implement it, please do so. You're very welcome if you provide patches we can integrate into our installer, and ips.
beadm and bootadm changes would need to pass the illumos integration rules. I propose it is best to start with beadm/bootadm in illumos and if that does work right, continue with IPS and our installer.

Actions #5

Updated by Jim Klimov about 12 years ago

  • Status changed from Rejected to Feedback


After a while I got another machine to test the split-root hierarchy with, now with oi_151a release, and I've played with the situation for a week now. Reopening the case to add my findings, and because it seems quite solvable - beadm support seems adequate to me (although things don't seem to work if I circumvent beadm).

Main problem is actually with BROKEN "/sbin" contents, and singularly with "/sbin/sh" which is now a symlink to "../usr/bin/ksh93". The "/sbin" directory is supposed to be there for "system binaries", the minimum set required for OS maintenance and bootup. For much of Solaris history it contained either statically compiled binaries with self-sufficient executable files, or at most with linker dependencies on "/lib" contents which are also relatively small.

Now in current OpenIndiana "/sbin/sh" which is "/usr/bin/ksh" pulls along lots of libraries from "/usr/lib" making the system unbootable if "/usr" is a filesystem separate from "/" root, and separating the needed pieces (to put them into root dataset) is not very trivial. In this case the kernel just loops with "INIT unable to spawn shell to run stuff" kind of message.
However, I copied over an "/sbin/sh" from an older SXCE release (where it is a compact old Bourne shell), and the system booted almost OK - some SMF methods had to be changed to use ksh directly instead of "#!/sbin/sh" (syntax differences failed these scripts under Bourne shell "sh"). For me these were "/lib/svc/method/logadm-upgrade" and "/lib/svc/method/net-physical". This fix seems in line with the finding that most other standard manifests use "/bin/ksh" or "/bin/ksh -p" or "/usr/bin/ksh93" already.

I think a proper fix would be to build and package a static ksh93 and store that as "/sbin/sh", the rest of system already seems ready for the split-root hierarchy.

Also I found that root's default user "~/.profile" files were changed to use BASH syntax (like "export VAR=VAL" instead of "VAR=VAL; export VAR") and while I like bash, this choice of incompatible syntax is rather a regression. Absence of bash in my minimal root FS and following lack of proper PATH (due to "export PATH=..." being invalid for old "sh") was an inconvenience during single-user mode boots which had not mounted "/usr" for one reason or another. Again, in golden old Solaris days, root shell was "/sbin/sh" (not relying on other FSes); now it could be a static build of "/sbib/bash", I suppose. Not "/usr/bin/bash" as of now.


Here is the filesystem layout my test box has currently booted with:

root@openindiana:~# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
                      51660103    684780  48925050   2% /
swap                   5179448      1104   5178344   1% /etc/svc/volatile
                      49698442    773392  48925050   2% /usr
swap                   5178348         4   5178344   1% /tmp
                      48925081        31  48925050   1% /usrlocal
                      48953565     28515  48925050   1% /var
swap                   5178388        44   5178344   1% /var/run
rpool/export          48925083        33  48925050   1% /export
rpool/export/ftp      48925082        32  48925050   1% /export/ftp
                      49737913    812863  48925050   2% /export/ftp/distribs
rpool/export/home     48925083        33  48925050   1% /export/home
                      48925086        36  48925050   1% /export/home/jim
rpool                 48925100        50  48925050   1% /rpool
rpool/SHARED/var/adm  48925339       289  48925050   1% /var/adm
                       5242862        31   5242831   1% /var/cores
                       5242862        31   5242831   1% /var/crash
rpool/SHARED/var/log  48925117        67  48925050   1% /var/log
                      48925082        32  48925050   1% /var/mail
                       2097134        33   2097102   1% /var/spool/clientmqueue
                       2097134        31   2097103   1% /var/spool/mqueue

The dataset hierarchy, including other test roots:

root@openindiana:~# zfs list -o canmount,mountpoint,referenced,compression,compressratio,name
      on  /rpool                     50K       off  1.14x  rpool
     off  legacy                     31K       off  1.21x  rpool/ROOT
  noauto  /                         420M       off  1.00x  rpool/ROOT/oi_151a
  noauto  /opt                     1.41M       off  1.00x  rpool/ROOT/oi_151a/opt
  noauto  /usr                      755M    gzip-9  1.00x  rpool/ROOT/oi_151a/usr
  noauto  /usrlocal                  31K    gzip-9  1.00x  rpool/ROOT/oi_151a/usrlocal
  noauto  /var                     27.8M       off  1.00x  rpool/ROOT/oi_151a/var
  noauto  /                        2.61G       off  1.00x  rpool/ROOT/oi_151b
  noauto  /opt                     1.41M       off  1.00x  rpool/ROOT/oi_151b/opt
  noauto  /usrlocal                  31K       off  1.00x  rpool/ROOT/oi_151b/usrlocal
  noauto  /var                     27.7M       off  1.00x  rpool/ROOT/oi_151b/var
  noauto  /                        2.61G       off  1.00x  rpool/ROOT/openindiana
  noauto  /opt                     1.41M    gzip-9  1.00x  rpool/ROOT/openindiana/opt
  noauto  /usrlocal                  31K       off  1.00x  rpool/ROOT/openindiana/usrlocal
  noauto  /var                     27.7M    gzip-9  2.91x  rpool/ROOT/openindiana/var
  noauto  /                        2.61G       off  1.00x  rpool/ROOT/openindiana2
  noauto  /opt                     1.41M       off  1.00x  rpool/ROOT/openindiana2/opt
  noauto  /usrlocal                  31K       off  1.00x  rpool/ROOT/openindiana2/usrlocal
  noauto  /var                     27.9M       off  1.00x  rpool/ROOT/openindiana2/var
  noauto  /                        422M       off  1.00x  rpool/ROOT/openindiana3
  noauto  legacy                   1.41M       off  1.00x  rpool/ROOT/openindiana3/opt
  noauto  legacy                    755M       off  1.00x  rpool/ROOT/openindiana3/usr
  noauto  legacy                     31K       off  1.00x  rpool/ROOT/openindiana3/usrlocal
  noauto  legacy                   27.9M       off  1.00x  rpool/ROOT/openindiana3/var
  noauto  /a                        422M       off  1.41x  rpool/ROOT/openindiana4
  noauto  /opt                     1.41M       off  3.16x  rpool/ROOT/openindiana4/opt
  noauto  /usr                      755M       off  3.02x  rpool/ROOT/openindiana4/usr
  noauto  /usrlocal                  31K       off  1.00x  rpool/ROOT/openindiana4/usrlocal
  noauto  /var                     27.9M       off  3.03x  rpool/ROOT/openindiana4/var
  noauto  legacy                     31K       off  3.82x  rpool/SHARED
  noauto  /var                       31K    gzip-9  3.91x  rpool/SHARED/var
      on  /var/adm                  289K      lzjb  5.85x  rpool/SHARED/var/adm
      on  /var/cores                 31K    gzip-9  1.00x  rpool/SHARED/var/cores
      on  /var/crash                 31K    gzip-9  1.00x  rpool/SHARED/var/crash
      on  /var/log                 66.5K      lzjb  1.70x  rpool/SHARED/var/log
      on  /var/mail                  32K      lzjb  1.00x  rpool/SHARED/var/mail
     off  /var/spool                 31K    gzip-9  1.00x  rpool/SHARED/var/spool
      on  /var/spool/clientmqueue  32.5K      lzjb  1.00x  rpool/SHARED/var/spool/clientmqueue
      on  /var/spool/mqueue          31K      lzjb  1.00x  rpool/SHARED/var/spool/mqueue
       -  -                        2.00G       off  1.00x  rpool/dump
      on  /export                    33K      lzjb  1.01x  rpool/export
      on  /export/ftp                32K      lzjb  1.01x  rpool/export/ftp
      on  /export/ftp/distribs      794M    gzip-9  1.01x  rpool/export/ftp/distribs
      on  /export/home               33K      lzjb  1.02x  rpool/export/home
      on  /export/home/admin         36K      lzjb  1.03x  rpool/export/home/admin
      on  /export/home/jim           36K      lzjb  1.03x  rpool/export/home/jim
       -  -                         731M       off  1.00x  rpool/swap

In particular, as can be seen above, the default installation's "/usr" can be compressed 3x from 2.3Gb to 800Mb with gzip-9 (like in livecd compressed image). This is a substantial difference on many systems I maintain (small local HDDs or Flash devices for roots, or numerous VMs colocated on relatively small storage):

root@openindiana:~# beadm mount openindiana /a
Mounted successfully on: '/a'

root@openindiana:~# du -ks /a/usr
2332540 /a/usr

root@openindiana:~# du -ks /usr
802290  /usr

I did have inexplicable problems when trying to "zfs snapshot", "zfs clone" and "zfs set org.opensolaris.libbe:uuid" my root hierarchies - they happened to be unbootable (freezing after mounting root and detecting hardware). However, "beadm create" of current root's clone and further mutilation of its datasets worked okay. I trussed the "beadm create" calls but failed to note any difference in my procedure (which worked for SXCE) from its inner magic.
Perhaps the UUID has to be saved somewhere, or not be competely random? ;)

It was not at all difficult to seperate "/var" and "/opt" into child datasets of the current root's clone and move (rsync) files into them. While root dataset is still required to be uncompressed (I've tested that explicitly - GRUB fails to mount it even if I get past failsafes implemented by beadm and zfs/zpool to disallow setting compression on bootable datasets or setting bootability on compressed datasets), its children may use "lzjb" or "gzip-9" at least (the ones I tested) since they are mounted by a fully-fledged Solaris libzfs and not the trimmed version in GRUB.

Also I've made a number of shared datasets to serve as subdirectories of "/var" in all BEs as seen above (shared logs, mail spools, etc.) and so I can enforce different compression and quotas on these spacehogs, while use the same log history, etc. between reboots into different BEs.

You can see above the intricate structure under "rpool/SHARED/var/*" which mixes "canmount=off" for some nodes (var, var/spool) and allows automounts for actual directories. They are mounted after the proper root dataset and its children, so no conflict occurs such as attempts to mount "/var/adm" before "/var". Still, I protected all such mountpoints in the parent datasets by immutability as detailed below.
NOTE however, that they are not mounted with "beadm mount" when you configure an alternate BE, just like it does not mount "/export/*" and other datasets into the ABE. If you require them to get automounted in these cases, you might instead use legacy mounts and /etc/vfstab references. Or you could use loop-mounts manually to link the directories (see lofs(7FS) like in delegation of paths to local zones).

Such separated "/var", "/var/*", "/usrlocal" and "/opt" mounts work for both bootups and "beadm mount" as either "canmount=noauto,mountpoint=legacy" datasets with mount definitions from appropriate BE root's "etc/vfstab" files, and as "canmount=noauto,mountpoint=/var" etc. datasets WITHOUT reference in "etc/vfstab". The "/var/*" components are automatically mounted with "zfs mount -a" on this system, and as legacy mountpoints with references from "/etc/vfstab" on some of my other systems.

NOTE that it is obviously wrong to use "canmount=on,mountpoint=/var" type of entries, because different BE's "/var" child datasets would come into conflict. This is also not needed for the BE mount with beadm or boot procedures, and beadm sets "canmount=noauto" to such children.
In particular, I was pleased to see that beadm and bootup procedure both properly mounted non-standard paths like my "/usrlocal" by virtue of this dataset being a "canmount=noauto" child of the current root.

NOTE: There can be a conflict where you have both "canmount=noauto,mountpoint=/var" and a reference in "etc/vfstab" - leading to "svc:/filesystem/minimal" failures ("mount -a" fails because it only supports explicit legacy mountpoints).

NOTE: One issue I had was that I tried to be lazy and "inerit mountpoint". The effective values remained like "/a/var" which precluded further booting and failed into single-user mode, but that was easy to fix with explicit mountpoint definition. It seems that "beadm mount" works safely and does not overwrite the defined mountpoint attributes when it mounts the BE into an alternate root.

NOTE that root's homedir "/root" could not be separated safely in my tests, but one can make "rpool/SHARED/root-data" and mount that under all BE's like "/root/root-data". YMMV.

I protected the mountpoints (such as "/var" in the root dataset) to be immutable directories so that for example "/var/adm" can not be mounted if proper "/var" is not yet mounted. Otherwise ZFS refused to mount "/var" into a contaminated non-empty directory (even after reboots when there is no longer a sub-FS there, only the mountpoint remains).
It was even more important to protect paths like "/var/adm" in the "/var" dataset itself - so that no logs, utmpx or similar files would be created before an actual "/var/adm" dataset is mounted.
Roughly, the routine went like this (NOTE that the default path points to GNU utilities first, and the GNU chmod lacks advanced features needed for ZFS ACLs):

# umount /a/var
# /usr/bin/chmod S+ci /a/var
# mkdir /a/var/xxx
mkdir failed: Not owner
# mount /a/var
# /usr/bin/chmod S+ci /a/var/adm /a/var/log .....

This particular problem can also be worked around with legacy mounts and the "-O" option for overlay mounts in "/etc/vfstab" entries (see also my RFE #997 to add support for overlay auto-mounting as a zfs dataset attribute).

The main (profanity removed) "issue" I finally had was with separating "/usr" into a separate dataset, compressed or not. The core problem was that "/sbin" no longer contains self-sufficient binaries, and the system bootup routine now depends on resources from "/usr". I detailed this issue in the overview above.

My workaround was to embed a self-sufficient "/sbin/sh" into the root dataset, and fix some SMF service mainfests to reference KSH93 "/usr/bin/ksh93" directly.

Finally, I also tested that this new setup is supportable by "beadm create" - and it cloned my split-root BE just nicely, "beadm mount"ed the clone, activated and booted it.
I've also changed that clone from using legacy mounts and "/etc/vfstab" into automatic mounts of root dataset's children with "canmount=noauto,mountpoint=/var" etc.


Overall, as of oi_151a release the system seems almost ready to support complicated hierarchies (in order to provide compression, quotas, different copies settngs and whatever else you can think of), with different working methods of automounting the component directories. Most components can be easily separated manually post-installation (and it is a matter of updating the Caiman wizards to do this during initial installation).

One key to separating the "/usr" partition and saving over a gigabyte of disk in the default installation is simply to once again provide a self-sufficient "/sbin/sh" executable.

Actions #6

Updated by Jim Klimov about 12 years ago

Additional experiment regarding the internal magic of beadm: starting with my split root hierarch described above, I made a recursive zfs snapshot and a set of cloned filesystems as an alternate root with no other changes. This root has booted successfully regardless of whether "org.opensolaris.libbe:uuid" was set at all, or was set to a manually defined value (i.e. another dataset's uuid with some characters zeroed out).

However "zfs create"'ing a hierarchy and rsync'ing files into it has failed the boot test :(

For clarity, I ought to retry with an ACL-aware "sun cpio" or such utility, as this is about the only idea I have left now. Open to suggestions ;)

Actions #7

Updated by Jim Klimov about 12 years ago

Okay, there is no beadm magic. Only "/sbin/sh" remains as the single problem for split-dataset root environments.


A fully-manual method HAS also worked out at last. The culprit was in the loopback-mounted files like this (and damn, I knew and documented this before!):

                      51797824    774447  51023377   2% /lib/

One way or another, a root hierarchy manually created and copied "from scratch" has booted up for me now.
Both with "find ... | cpio" as detailed below, and with "rsync -xavPHK" as I tried several times before.

Following is the routine to recreate in my case, YMMV.

1. Create and mount the BE root dataset hierarchy, including protected (immutable) mountpoints for shared paths I've had made before:

zfs create -o canmount=noauto -o mountpoint=/a rpool/ROOT/${BENAME}  && \
mkdir -p /a && \
zfs mount rpool/ROOT/${BENAME} && \
cd /a && \
for D in var opt usr usrlocal; do
  mkdir $D  && \
  /bin/chmod S+ci $D
  zfs create -o canmount=noauto -o mountpoint=/a/$D -o compression=gzip-9 rpool/ROOT/${BENAME}/$D  && \
  zfs mount rpool/ROOT/${BENAME}/$D

cd /a/var && \
mkdir -p adm log mail cores crash spool/clientmqueue spool/mqueue && \
/bin/chmod S+ci adm log mail cores crash spool/clientmqueue spool/mqueue

If you did not have a separated "/usr/local", create the symlink:

ln -s ../usrlocal /a/usr/local

The lack of UUID has not shown to be fatal for beadm BE switches to work back and forth, but it might be nice to set a unique ID anyway. Be creative, change some HEX characters in the sample uuid below:

zfs set org.opensolaris.libbe:uuid=c2c6c968-9866-c662-aac1-86c6cc77c2ff rpool/ROOT/${BENAME}

2. Loopback-mount the current root and copy it over:

mount -F lofs -o nosub / /mnt && \
cd /mnt  && \
find . -xdev -depth -print | cpio -pPvdm /a/  && \
cd / && \
umount /mnt

Primarily this helps avoid other mounts infringing into the real root dataset and maybe clouding some original data.
Skipping this known step may have been the single critical error I did during my earlier rsync tests (perhaps that just skipped copying of which is kinda critical).

The same can be done with rsync:

mount -F lofs -o nosub / /mnt && \
rsync -xavPHK /mnt/ /a/ && \
umount /mnt

3. If your original root BE is comprised of more than one dataset (mountpoint) already, copy others over:

cd /
find var opt usr usrlocal -xdev -depth -print | cpio -pPvdm /a/

As above, this invocation of find should only locate the contents of mountpoints listed on the command line and not descend into their submounts if any (like /var/run). If this does not work, try old-school ways:
cd /
for D in var opt usr usrlocal ; do
  find $D -xdev -depth -print | cpio -pPvdm /a/$D

There may be errors when cpio tries to change the immutable mountpoints (update ownership, timestamps, or even worse - try to add files into them), these are okay - that's what were protecting from. After mounting the proper datasets, their root directories (such as "/var/adm/") should be assigned correct POSIX properties.

The same can be done with rsync:

for D in var opt usr usrlocal ; do
  rsync -xavPHK /$D/ /a/$D/ && \

4. Clean up for beadm to work:

for D in var opt usr usrlocal ; do
  zfs umount /a/$D
zfs umount /a

5. Use beadm to mount the new ABE and perhaps apply its magic (like updating children's mountpoints so that they do automount after reboot, etc.), and ultimately boot into it:

:; beadm mount ${BENAME} /a
Mounted successfully on: '/a'

:; /bin/df -k | grep " /a" 
rpool/ROOT/oi_151a   61415424  432365 51023734     1%    /a
rpool/ROOT/oi_151a/opt 61415424    1441 51023734     1%    /a/opt
rpool/ROOT/oi_151a/usrlocal 61415424      31 51023734     1%    /a/usrlocal
rpool/ROOT/oi_151a/var 61415424   28376 51023734     1%    /a/var
rpool/ROOT/oi_151a/usr 61415424  774455 51023734     2%    /a/usr

:; bootadm update-archive -R /a
(This is likely to return nothing since your current root's boot_archive should be up-to-date)

:; beadm umount ${BENAME}

:; beadm activate ${BENAME}

:; init 6

NOTE that "beadm mount; beadm umount" routine is fairly important to finalize things. Alternatively, you can manually set the mountpoint properties of datasets after you have unmounted them - this also works, but with more hassle.

//Jim Klimov

Actions #8

Updated by Jim Klimov about 12 years ago

I've updated the system to oi_151a2 with "pkg update" and it worked along nicely with the hierarchical root, including the separated 2x-compressed "/usr", structured as described above and illustrated below.

root@openindiana:~# uname -a 
SunOS openindiana 5.11 oi_151a2 i86pc i386 i86pc

root@openindiana:~# pkg update
No updates available for this image.   

root@openindiana:~# zfs list -o canmount,mountpoint,referenced,compression,compressratio,name | grep ROOT; zfs list 
     off  legacy                               31K       off  1.60x  rpool/ROOT
  noauto  /                                   422M       off  1.05x  rpool/ROOT/oi_151a
  noauto  /                                   428M       off  1.00x  rpool/ROOT/oi_151a-1
  noauto  /opt                               3.83M       off  1.00x  rpool/ROOT/oi_151a-1/opt
  noauto  /usr                               1.09G       off  1.00x  rpool/ROOT/oi_151a-1/usr
  noauto  /usrlocal                            31K       off  1.00x  rpool/ROOT/oi_151a-1/usrlocal
  noauto  /var                               91.4M       off  1.00x  rpool/ROOT/oi_151a-1/var
  noauto  /                                   429M       off  1.60x  rpool/ROOT/oi_151a-2
  noauto  /opt                               3.83M       off  1.74x  rpool/ROOT/oi_151a-2/opt
  noauto  /usr                               1.25G       off  2.08x  rpool/ROOT/oi_151a-2/usr
  noauto  /usrlocal                            31K       off  1.00x  rpool/ROOT/oi_151a-2/usrlocal
  noauto  /var                                199M       off  1.26x  rpool/ROOT/oi_151a-2/var
  noauto  /opt                               1.38M    gzip-9  1.00x  rpool/ROOT/oi_151a/opt
  noauto  /usr                                755M    gzip-9  1.00x  rpool/ROOT/oi_151a/usr
  noauto  /usrlocal                            31K    gzip-9  1.00x  rpool/ROOT/oi_151a/usrlocal
  noauto  /var                               33.9M    gzip-9  1.16x  rpool/ROOT/oi_151a/var
  noauto  /                                   422M       off  1.00x  rpool/ROOT/oi_151a_release
  noauto  /opt                               1.38M    gzip-9  1.00x  rpool/ROOT/oi_151a_release/opt
  noauto  /usr                                755M    gzip-9  1.00x  rpool/ROOT/oi_151a_release/usr
  noauto  /usrlocal                            31K    gzip-9  1.00x  rpool/ROOT/oi_151a_release/usrlocal
  noauto  /var                               27.6M    gzip-9  1.00x  rpool/ROOT/oi_151a_release/var
NAME                                     USED  AVAIL  REFER  MOUNTPOINT
rpool                                   7.53G  51.0G  48.5K  /rpool
rpool/ROOT                              2.63G  51.0G    31K  legacy
rpool/ROOT/oi_151a-1                    7.91M  51.0G   428M  /
rpool/ROOT/oi_151a-1/opt                    0  51.0G  3.83M  /opt
rpool/ROOT/oi_151a-1/usr                2.13M  51.0G  1.09G  /usr
rpool/ROOT/oi_151a-1/usrlocal               0  51.0G    31K  /usrlocal
rpool/ROOT/oi_151a-1/var                4.98M  51.0G  91.4M  /var
rpool/ROOT/oi_151a-2                    2.62G  51.0G   429M  /
rpool/ROOT/oi_151a-2/opt                3.86M  51.0G  3.83M  /opt
rpool/ROOT/oi_151a-2/usr                1.37G  51.0G  1.25G  /usr
rpool/ROOT/oi_151a-2/usrlocal             33K  51.0G    31K  /usrlocal
rpool/ROOT/oi_151a-2/var                 418M  51.0G   199M  /var
rpool/ROOT/oi_151a                      7.40M  51.0G   422M  /
rpool/ROOT/oi_151a/opt                      0  51.0G  1.38M  /opt
rpool/ROOT/oi_151a/usr                  2.18M  51.0G   755M  /usr
rpool/ROOT/oi_151a/usrlocal                 0  51.0G    31K  /usrlocal
rpool/ROOT/oi_151a/var                  3.10M  51.0G  33.9M  /var
rpool/ROOT/oi_151a_release                54K  51.0G   422M  /
rpool/ROOT/oi_151a_release/opt             1K  51.0G  1.38M  /opt
rpool/ROOT/oi_151a_release/usr             1K  51.0G   755M  /usr
rpool/ROOT/oi_151a_release/usrlocal        1K  51.0G    31K  /usrlocal
rpool/ROOT/oi_151a_release/var             1K  51.0G  27.6M  /var
rpool/SHARED                            1.01M  51.0G    31K  legacy
rpool/SHARED/var                        1006K  51.0G    31K  /var
rpool/SHARED/var/adm                     533K  51.0G   410K  /var/adm
rpool/SHARED/var/cores                    67K  5.00G    31K  /var/cores
rpool/SHARED/var/crash                    49K  5.00G    31K  /var/crash
rpool/SHARED/var/log                     110K  51.0G  66.5K  /var/log
rpool/SHARED/var/mail                     50K  51.0G    32K  /var/mail
rpool/SHARED/var/spool                   166K  51.0G    31K  /var/spool
rpool/SHARED/var/spool/clientmqueue     68.5K  2.00G  32.5K  /var/spool/clientmqueue
rpool/SHARED/var/spool/mqueue             67K  2.00G    31K  /var/spool/mqueue
rpool/dump                              2.00G  51.0G  2.00G  -
rpool/export                             794M  51.0G    33K  /export
rpool/export/ftp                         794M  51.0G    32K  /export/ftp
rpool/export/ftp/distribs                794M  51.0G   794M  /export/ftp/distribs
rpool/export/home                        277K  51.0G    35K  /export/home
rpool/export/home/jim                     55K  51.0G    36K  /export/home/jim
rpool/swap                              2.12G  52.4G   731M  -

root@openindiana:~# df -k
Filesystem            kbytes    used   avail capacity  Mounted on
rpool/ROOT/oi_151a-2 61415424  439462 53514567     1%    /
/devices                   0       0       0     0%    /devices
/dev                       0       0       0     0%    /dev
ctfs                       0       0       0     0%    /system/contract
proc                       0       0       0     0%    /proc
mnttab                     0       0       0     0%    /etc/mnttab
swap                 5214072    1104 5212968     1%    /etc/svc/volatile
objfs                      0       0       0     0%    /system/object
sharefs                    0       0       0     0%    /etc/dfs/sharetab
                     61415424 1306014 53514567     3%    /usr
                     54820581 1306014 53514567     3%    /lib/
fd                         0       0       0     0%    /dev/fd
                     61415424  203270 53514567     1%    /var
swap                 5212972       4 5212968     1%    /tmp
swap                 5213016      48 5212968     1%    /var/run
                     61415424    3921 53514567     1%    /opt
                     61415424      31 53514567     1%    /usrlocal
rpool/export         61415424      33 53514567     1%    /export
rpool/export/ftp     61415424      32 53514567     1%    /export/ftp
                     61415424  812862 53514567     2%    /export/ftp/distribs
rpool/export/home    61415424      35 53514567     1%    /export/home
                     61415424      36 53514567     1%    /export/home/jim
rpool                61415424      48 53514567     1%    /rpool
rpool/SHARED/var/adm 61415424     409 53514567     1%    /var/adm
                     5242880      31 5242813     1%    /var/cores
                     5242880      31 5242831     1%    /var/crash
rpool/SHARED/var/log 61415424      66 53514567     1%    /var/log
                     61415424      32 53514567     1%    /var/mail
                     2097152      32 2097083     1%    /var/spool/clientmqueue
                     2097152      31 2097085     1%    /var/spool/mqueue

This is another 2 cents in that the core OS and BE support is already in place. The development might focus on 'wizardization' of this setup (Caiman?), and on providing a static "/sbin/sh".

Actions #9

Updated by Jim Klimov about 12 years ago

Well, there is one more quirk that "beadm" could improve on in regard to such split-dataset root support: when cloning the BE, carry over some ZFS attributes such as the compression, so that the benefits of gzip-9 for /usr or lzjb for /var remain in place on ABEs as well.

Possibly other attributes should be copied from sources onto clones as well - perhaps any ZFS attrs that were explicitly defined in the source DSes (like optional dedup, copies, atime, zfs-auto-snap settings and so on).

Updated 2013-02-16: I posted an RFE #3569 to track this particular suggestion.

Actions #10

Updated by Jim Klimov about 12 years ago

As I've recently discovered, the "pkgadd" command (for SVR4 support) now somehow relies on "/sbin/sh" being KSH. The relevant error (when it is a Bourne shell semi-static binary carried over from OpenSolaris SXCE) seems like this:

# gzcat MySVR4.pkg.gz > /tmp/x
# pkgadd -d /tmp/x
/sbin/sh: -1: bad number
pkgadd: ERROR: attempt to process datastream failed
    - process </usr/bin/cpio -icdumD -C 512> failed, exit code 1
pkgadd: ERROR: could not process datastream from </tmp/x>

The fix is to make the "/sbin/sh" a KSH again. Reinstating the symlink would break bootability with separated /usr. Until I have compiled a static binary, a lofs-overlay can be used like this:

# mount -F lofs /usr/bin/`isainfo -n`/ksh93 /sbin/sh
# pkgadd -d /tmp/x                           

The following packages are available:
  1  MySVR4     My test SVR4 package
               (none) 1.9.24

Select package(s) you wish to process (or 'all' to process
all packages). (default: all) [?,??,q]: 

Processing package instance <MySVR4> from </tmp/x>
[ verifying class <none> ]
## Executing postinstall script.
Installation of <MySVR4> was successful.

DISCLAIMER: I did not test if this is safe to use in i.e. /etc/vfstab (likely with a static name for 'amd64' or 'i386' ISA dirname) so that the original static Bourne shell is only used between boot and mounting of /usr, and I did not test in which order these mounts do happen either.

Actions #11

Updated by Igor Pashev about 12 years ago

I think, there is not need for separated /usr.

/usr/local, /var/tmp, /var/log etc. are ok.

The idea is: all directories which are under control of packaging tools should be on a single filesystem,
so you can do consistent snapshots. Other directories could be on different FS.

Actions #12

Updated by Jim Klimov about 12 years ago

I think there is a need (moderate, since the systems are known to work without such separation), but nonetheless.

For one, compression of the installed base on the order of 2x-3x translates to savings - on disk, faster reads from disk, perhaps in RAM (when/if ARC caches compressed blocks if they are stored so on-disk). This may be especially important not for workstations with huge rpools (though even there I tend to use as small an rpool as reasonable, and have the data pool separately), but in VMs with limited virtual disk sizes, or in embedded environments i.e. on size-limited Flash storage (USB, CF, SSD).

On the other hand, many admins like myself are accustomed to such separation - though many old customs are destroyed by onslaught of IPS. A secondary argument in this position might be that the SMF services still define separately "svc:/system/filesystem/root:default" and "svc:/system/filesystem/usr:default", and on a "distro" like LiveCD there are different instances, i.e. "svc:/system/filesystem/root:net", "svc:/system/filesystem/root:media" and "svc:/system/filesystem/usr:live-media". So, outside of IPS control such separation is useful and used.

On the third, this just plain works well (with two quirks above - the non-static /sbin/sh and with beadm clones forgetting zfs attributes of their origin - like that they are compressed). This works with beadm snapshots too, so in terms of managing BEs with consistent snapshotting/cloning - this does work already.
Your argument does pose a slightly different point - about remaining under control of packaging tools - but not all illumos distros even use those, and I think the aim is consistency of boot environments - which we do have with separation of /usr as well.

You've also mentioned separation of /var/tmp. That is actually an interesting point - it worked on some of my older servers (i.e. SXCE, Sol10), but tends to work poorly in newer OpenSolaris descendants. The problem is that some SMF services start before mounting of secondary FSes like /var/tmp and already expect that directory to be writeable. We can later overlay-mount a "real" separate /var/tmp dataset with /etc/vfstab (-O option) and override a non-empty mountpoint, but we can not do so with ZFS automounts (yet, see bug #997). Also it is questionable whether some such early-starting services would need to access their (now hidden) temporary files.
So, while separation of /var/tmp would indeed make sense, so far I decided not to touch it. A possible solution might be to expand the list of primary FSes like in SMF service filesystem/minimal, filesystem/var or some such, which is a dependency for most of the other services, to enforce mounting of the separate /var/tmp if that is defined on the filesystem.

PS: Perhaps this task should be relocated to illumos-gate - I could not tell the difference well between two projects and their subprojects a year ago, and OpenIndiana was the only illumos-based distro I tried as a likely generic-distro future for my systems.
Apparently, if now I talk about the quest's usefullness outside of IPS and OpenIndiana, and since such low-level things could be used (or chosen to be not used) by all distros alike, this is a quest for upstream illumos :)

Actions #13

Updated by Jim Klimov about 12 years ago

A new piece of intel: at least as of oi_151a4, the /usr/bin/bash binary only has dependencies on objects in "/lib" and it is a suitable candidate for "/sbin/sh" - and at that it works better than the binary from SXCE. While it does consume twice as much memory (both RAM and VM), bash so far has not shown incompatibilities of older "sh" with:
  • new syntax used in current ksh-oriented SMF method scripts (For me these were "/lib/svc/method/logadm-upgrade" and "/lib/svc/method/net-physical" and "/lib/svc/method/ipfilter"),
  • and "pkgadd" breaking with "Bad number: -1" (something to do with UID/GID setups, and maybe the RBAC stuff - /sbin/sh does not want to start interactively for root either, but does so for non-root users)
    as I complained above.
    It does also work for this RFE's purpose of separating the /usr filesystem.

So, using a copy of the binary provided by OpenIndiana already would seemingly solve this half of the RFE:

# mv /sbin/sh /sbin/sh.old && cp -pf /usr/bin/bash /sbin/sh

# ldd /sbin/sh =>        /lib/ =>        /lib/ =>   /lib/ =>   /lib/ =>     /lib/ =>        /lib/ =>    /lib/ =>    /lib/ =>     /lib/

# /sbin/sh --version
GNU bash, version 4.0.28(1)-release (i386-pc-solaris2.11)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Actions #14

Updated by Jim Klimov about 12 years ago

PS: bash is a bit more compact than ksh in RAM as well ;)

# ps -o rss,vsz,comm | grep sbin/sh
2476 3872  /sbin/sh.ksh
2552 3680  /sbin/sh.bash
2392 3680  /sbin/sh

Strangely, even though /sbin/sh.bash and /sbin/sh are copies of the same file, they consume a bit different RSS size... ;)

Actions #15

Updated by Jim Klimov about 12 years ago

Another issue was located and fixed with hierarchical datasets: "rpool/SHARED/var/adm" was not automatically mounted as part of "filesystem/minimal", which led to misuse of /var/adm/utmpx and such files. This has little noticeable effect except that "uptime", "w" and similar commands report wrong uptimes (starting at approximately the installation time and the last recorded reboot), and the "last" command does not include system shutdown and boot times. This is annoying, but not fatal.

This patch adds support for such shared system datasets to the SMF method (although the list remains as it was, and other datasets get mounted with "zfs mount -a" sometime later):

# gdiff -Naur /lib/svc/method/fs-minimal.orig /lib/svc/method/fs-minimal 
--- /lib/svc/method/fs-minimal.orig     2011-09-12 15:04:00.000000000 +0400
+++ /lib/svc/method/fs-minimal  2012-06-13 00:28:44.146412124 +0400
@@ -31,6 +31,10 @@
 . /lib/svc/share/
 . /lib/svc/share/

+# Report selected mounts to /dev/console
+[ -f /.debug_mnt ] && debug_mnt=1
 # Mount other file systems to be available in single user mode.
 # Currently, these are /var, /var/adm, /var/run and /tmp.  A change
 # here will require a modification to the following programs (and
@@ -49,6 +53,7 @@
        if [ -n "$mountp" ]; then
                mounted $mountp $mntopts $fstype < /etc/mnttab && continue
                checkfs $fsckdev $fstype $mountp || exit $SMF_EXIT_ERR_FATAL
+               [ x"$debug_mnt" = x1 ] && echo "Mounting '$fs': use '$mountp' of type '$fstype' with opts '$mntopts' from vfstab" > /dev/console
                mountfs -O $mountp $fstype $mntopts - ||
                    exit $SMF_EXIT_ERR_FATAL
@@ -57,10 +62,50 @@
                mountpt=`zfs get -H -o value mountpoint $be$fs 2>/dev/null`
                if [ $? = 0 ] ; then
                        if [ "x$mountpt" = "x$fs" ] ; then
+                               [ x"$debug_mnt" = x1 ] && echo "Mounting '$fs': use '$be$fs': in same root hierarchy" > /dev/console
                                /sbin/zfs mount -O $be$fs
+                               continue
-       fi
+               # These mountpoints can be shared among BEs in a separate tree.
+               # Find and mount matching automountable datasets; if there is
+               # choice - prefer the (first found?) one in the current rpool.
+               mountdslist="`zfs list -H -o canmount,mountpoint,name | awk '( $1 == "on" && $2 == "'"$fs"'" ) {print $3}' 2>/dev/null`" 
+               if [ $? = 0 -a "x$mountdslist" != x ] ; then
+                       if [ "x`echo "$mountdslist"|wc -l|sed 's/ //g'`" = x1 ]; then
+                               # We only had one hit
+                               [ x"$debug_mnt" = x1 ] && echo "Mounting '$fs': use '$mountdslist': the only option" > /dev/console
+                               /sbin/zfs mount -O "$mountdslist" 
+                               continue
+                       else
+                               rpoolname="`echo "$be" | awk -F/ '{print $1}'`" 
+                               mountdspref="`echo "$mountdslist" | egrep '^'"$rpoolname/" | head -1`" 
+                               if [ $? = 0 -a "x$mountdspref" != x ] ; then
+                                       [ x"$debug_mnt" = x1 ] && echo "Mounting '$fs': use '$mountdspref': same rpool" > /dev/console
+                                       /sbin/zfs mount -O "$mountdspref" 
+                                       continue
+                               fi
+                               # This is the least-definite situation: several
+                               # matching datasets exist, and none on the current
+                               # rpool. See if any pools can be ruled out due to
+                               # bad (non-default) altroots.
+                               for mountds in $mountdslist; do
+                                       dspool="`echo "$mountds" | awk -F/ '{print $1}'`" 
+                                       dspool_altroot="`zpool list -H -o altroot "$dspool"`" 
+                                       if [ $? = 0 -a \
+                                            x"$dspool_altroot" = "x-" -o \
+                                            x"$dspool_altroot" = "x/" ]; then
+                                               [ x"$debug_mnt" = x1 ] && echo "Mounting '$fs': use '$mountds': good altroot" > /dev/console
+                                               /sbin/zfs mount -O "$mountds" 
+                                               continue
+                                       fi
+                                   done
+                               fi
+               fi
+               # Technically, it is possible to have a pool named var with
+               # the default altroot and a dataset "var/adm" with an inherited
+               # mountpoint, which should automount into "/var/adm". TBD...
+       fi  ### if root is ZFS

 mounted /var/run - tmpfs < /etc/mnttab
@@ -73,6 +118,7 @@

 if [ "$rootiszfs" = 1 ] ; then
+       # Mount (other) possible children of current rootfs dataset
        /sbin/zfs list -rH -o mountpoint -s mountpoint -t filesystem $be | \
            while read mountp ; do
                if [ "x$mountp" != "x" -a "$mountp" != "legacy" ] ; then

Actions #16

Updated by Jim Klimov almost 12 years ago

I am not sure as to when support of compressed root fs was added to OI/GRUB, but as of oi_151a5 I did successfully compress (gzip-9) and boot a BE (copy of stock openindiana installation). This basically eliminates the need for split-off usr and opt child datasets as described above, as long as the rationale was to get them compressed. Some deployments might want to still do that, but I (author and proponent of the technique above) can't conjure up a reason at the moment ;)
Still, things like different compression/quota/copies settings or desire to share stuff between BEs instead of wasting space on storing volatile data in clones of BEs occupying a full single FS, do still warrant a separate hierarchy of some things under /var, as described above...

Curiously, while the compression of root dataset now works, the "zfs set" and "zpool set" do still protect the user/admin from setting bootability on a compressed dataset (zpool set bootfs=...) or enabling compression on current bootable dataset.

This can be easily worked around if desired, by using inherited compression setting for rootfs'es, with the base value being defined in rpool/SHARED. I hope this would also allow to retain the benefits of compression when cloning BEs during installations and updates, so the protective checks above might become new trouble-makers now if bootfs compression is enabled...

Actions #17

Updated by Ken Mays almost 11 years ago

  • Assignee set to OI Caiman
Actions #18

Updated by Jim Klimov over 10 years ago

Much of the research done by me on this subject matter has culminated in a Wiki publication which describes the copy-pastable steps to transform a standard Solaris-like installation (with OI as the example and testbed) into a split-root one. This article includes all the missing pieces needed to complete this quest manually, which (and more) I posted as bugs, and hope that the solutions will make it upstream so that the setup can be done and maintained out of the box:
  • #4352 - the relevant patches for fs-* SMF methods to mount such roots more reliably than currently existing ones, including a limited solution to #997 to enforce overlay-mounting for components of the root filesystem hierarchy,
  • #4351 - the procedure to bring ksh93 and its dependency libraries into the root filesystem as /sbin/sh (so that /usr can be safely split off),
    • #4360 - on a related note, update SMF methods and other scripts that should be executed by (/usr)/bin/sh to use /sbin/sh instead,
  • #4361 - fix the networking/physical SMF method scripts to not require programs from /usr,
  • #3569 - the procedure to replicate custom ZFS attributes after a "beadm clone" (such as compression for child datasets of the rootfs), including #4355 - to do this in particular when "pkg" creates the clones.

Issues #4353 and #4354 track proposed improvements to the Caiman installer with steps needed to support such installations as a hands-off automated procedure, and are not part of that article.

Actions #19

Updated by Jim Klimov almost 9 years ago

FYI: The scripts and patches relevant to this issue and ones related to it have initially been placed as attachments in the Wiki article, but were relocated to github this year: (licensed as CDDL, so hopefully the logic and/or implementation can get into upstream OI or illumos).

Actions #20

Updated by Marcel Telka 5 months ago

  • Related to Feature #4353: Update Caiman installer to allow installation into an existing root pool and dataset hierarchy added
Actions #21

Updated by Marcel Telka 5 months ago

  • Related to Feature #4354: Update Caiman to offer split-root installations as an option added
Actions #22

Updated by Marcel Telka 5 months ago

  • Status changed from Feedback to In Progress
  • Assignee deleted (OI Caiman)
Actions #23

Updated by Marcel Telka 5 months ago

  • Category set to Caiman (Installer)
  • Target version set to Hipster

Also available in: Atom PDF