ZFS checksum/compression algorithm number indirection
Block pointers in ZFS include checksum and compression algorithm numbers. These are 8-bit numbers that say what algorithms were used on the data that the blkptr points to. Currently the mapping of number -> algorithm must be the same across all versions of ZFS, otherwise when different software versions access the same pool each one will try to use different algorithms on the data, resulting in I/O errors. If two downstream implementations ship code that uses the same algorithm number for two different algorithms it will be impossible to merge both of them into the same code base since neither one can change what algorithm number they are using without invalidating pools they have already created. If one of those implementations pushes to Illumos the other will not be able to pull those changes.
To fix this http://lists.freebsd.org/pipermail/freebsd-fs/2011-May/011568.html proposes an "indirection table" at the pool level that maps unique string names for algorithms to their 8-bit numbers. The 8-bit number for a particular algorithm could differ on different pools.
These changes should also include a re-usable system for updating feature reference counts to represent the number of blocks in the pool that are using that algorithm. Every time a transaction group is synced out we should count the total change in the number of blocks using each algorithm and update the feature's reference count (updating the reference count directly for each block would be too inefficient).
This feature should be implemented by someone implementing a new checksum or compression algorithm so that there is a real-world consumer to test it with.
Updated by Jim Klimov over 8 years ago
Would it make sense to add a better-compressing opensourced algorithm, like bzip2, p7zip or xz? For example, WORM data like large distribution file repositories might benefit from heavyweight slow compression used once for writing, making storage more compact and reads faster (less IOs).
Also it might be helpful to provide an "adaptive strong compression" method which would compress individual blocks (or files) with one or another "slow" algorithm like "gzip-9", "bzip2-9" and so on, based on some heuristic (i.e. incoming block size - smaller ones might have no benefit from compression at all and would still use a sector or two, or based on history of this file's previous blocks compression - which algorithm yielded best results).
In such case the individual blkptr_t's would reference a particular algorithm used for the block, while the dataset would be marked with another compression type number standing for this adaptive meta-algorithm.
Updated by Christopher Siden about 8 years ago
- Assignee set to Christopher Siden
- % Done changed from 0 to 20
I've been making slow progress on this. So far I've updated everything that uses compression algo numbers to user a spa-specific table instead of the global zio_compress_table and started making some easy infrastructure to hook new algos in to feature flags (there is still a bit of work to do here in order to properly keep track of how many blocks are using a specific algo so we know if it is active/enabled). I've also hit a snag in the way we do the translation from algo name -> integer, currently it is done in libzfs (userland) where we don't have direct access to the mappings in the spa, this translation probably needs to be moved to the kernel (probably wouldn't be a bad idea to just do this for all properties at once).