ZFS: Allow specifying minimum dataset/metaslab block size AND alignment
My original concern (even a problem) was that ZFS metadata wastes very much on-disk space in my system with 4Kb-blocks on VDEV (forced with ashift=12) to accomodate new HDDs with native 4kb sectors.
In bug #954 I proposed that ZFS metadata should be re-packaged in a virtualization layer, such that there can be several metadata entries in one ZFS block.
Today I've thought of an alternative solution should be compatible with existing ZFS on-disk format, and thus I like it better - the change would be in ZFS-write/allocate logic but not producing an incompatible disk format. Still, this may need one or more new ZFS and ZPOOL attributes.
Instead of reworking ZFS metadata processing for several entries to be "packaged" in one ZFS 4kb-block, we could formally use 512-byte blocks for the ZFS (ashift=9) but enforce larger blocks for filesystem and volume datasets (and for metaslabs and other system data block-ranges) in zfs-write logic.
Namely, ZFS datasets and metaslabs should be written with the minumum block size and alignment requested by the user (4kb or more) - but with guaranteed start-alignment regarding hardware sectors.
ZFS would need some knowledge of the storage device layout - i.e. at which logical block number does ZFS's slice/partition start, so that it would know if LBA sector number for a particular block is divisible by (minblocksize/512bytes = 8 for minblocksize=4kb), and use this knowledge when allocating blocks for a new write.
This would also make a no-problem the current hassle of aligning partition starts when making new pools - in the worst case, ZFS would waste a few starting blocks for rounding, and issue aligned writes anyway (for most I/Os).
For metaslab processing (as to lose less performance with 512-byte metadata entries on 4kb-sectored devices) some sort of 8-entry prefetch might be useful (i.e. when we need one entry, we read and cache all 8 anyway).
That's about it.
For metaslabs and other pool-wide technical info this could be a pool-level setting, while for datasets it could be a per-dataset setting with root dataset's default inherited from the pool (if that's possible - otherwise, initialized with the same value at pool creation) and would take place for all writes after the setting is set or changed (like compression, dedup, etc).
Thus the user would have an option of using unaligned blocks of 512 bytes on specific datasets, if he desires to - i.e. for rarely-accessed archival storage, where performance matters just, once during a write, and the user agrees to wait longer for this one write but save some space on zillions of small files.
Updated by Jim Klimov about 10 years ago
I believe my description above may be somewhat incorrect in terms of terminology.
In particular, at the time of writing I believed that "metaslabs" are reserved ranges of blocks for metadata, which seems to be not true. Regardless, metaslabs should still be aligned at the requested minblocksize sector boundary.
And newly written metadata blocks should "just happen" to be grouped to make up the minblocksize by the write algorithm - and with nominal vdev block size of 512b, no incapsulation is needed for reads, although read-prefetching is still a good idea.