zfs large block support
ZFS should support large blocks (>128KB). However, these large blocks will only be used if explicitly requested by setting the "recordsize" property. The default recordsize will still be 128KB, and we'll only issue i/o > 128KB for large blocks (i/o aggregation will still be capped at 128KB). This maintains the existing behavior (and performance characteristics) for all uses except filesystems whose "recordsize" has been increased.
The "recordsize" property can be set up to 1MB. The on-disk format and the SPA will support up to 16MB blocks, but these can only be used by "experimentally" changing a kernel tunable to allow larger "recordsize" values. The advantage of adding support for 16MB (the most that can be easily supported by the on-disk format) now is that we will not need to change the on-disk format or add feature flags when increasing the "officially supported" block size from 1MB to any larger values.
Note that by default, "zfs send" will automatically break any large blocks up into 128KB chunks, so that the send stream can be received on any system. A new flag (-b) can be specified which will transmit the large blocks, in which case the send stream can only be received by systems that support large blocks.
Note that GRUB "supports" the large_blocks feature, so it is safe to enable it on root pools. However, GRUB can't actually read large blocks (due to lack of memory), so the recordsize of the boot filesystem can not be increased above 128KB (this is enforced similarly to GRUB's lack of gzip support).
Recommended uses center around improving performance of random reads of large blocks (>= 128KB):
- files that are randomly read in large chunks (e.g. video files when streaming many concurrent streams such that prefetch can not effectively cache data); performance will be improved in this case because random 1MB reads from rotating disks has higher bandwidth than random 128KB reads.
- typically, performance of scrub/resilver is improved, especially with RAID-Z
The tradeoffs to consider when using large blocks include:
- accessing large blocks tends to increase latency of all operations, because even small reads will need to get in line benind large reads/writes
- sub-block writes (i.e. write to 128KB of a 1MB block) will incur even larger read-modify-write penalty
- the last, partially-filled block of each file will be larger, wasting memory, and if compression is not enabled, disk space (expected waste is 1/2 the recordsize per file, assuming random file length)