ZFS metadata uses way too much space with 4Kb blocks (ashift=12)
Details, from problem question to my research and findings, are represented in OpenSolaris forum thread: http://opensolaris.org/jive/thread.jspa?threadID=138604&tstart=0
In short, due to modern drives with native 4Kb sectors being slower when used with smaller sectors (i.e. emulated 512-byte ones), my pool was created with a modified zpool binary which enforced ashift=12 and 4Kb blocks became the minimum for ZFS.
Now it appears that filesystem overhead grew considerably, especially with regard to volumes with a small volblocksize (in my extreme case every data block of 4Kb volume is matched by a 4Kb metadata block, at least doubling the space usage in the pool vs. "user data size"), and with regard to snapshots (they seem to reserve space for doubling the metadata blocks - which is quite a lot now).
One proposal would be to raise the top blocksize from 128Kb to 1Mb so as to keep the FS metadata's overhead percentage smaller.
Another proposal is to treat such 4Kb metadata blocks as containers with eight colocated 512-byte blocks, each with a different piece of ZFS metadata, i.e. by adding or expanding a layer of abstraction to address the metadata blocks.
This idea seems to be backwards-compatible in terms of code which might expect metadata blocks to be (at least) 512 bytes in size, and would keep the overhead as low as with standard ashift=9 pools, and should do little harm to performance on 4Kb-sectored hard disks (or SSD media).
Updated by Jim Klimov about 10 years ago
- Difficulty set to Medium
- Tags set to needs-triage
Today I've thought of an alternative solution which might be better compatible with ZFS on-disk format: instead of reworking ZFS metadata processing for several entries to be "packaged" in one ZFS 4kb-block, we could formally use 512-byte blocks for the ZFS but larger blocks (4kb or more - but with guaranteed start-alignment).
I'll detail that alternative idea in another bug, and I like it better already - the change would be in ZFS-write logic but not producing an incompatible disk format.
Posted as bug #1044.