Bug #5027

zfs large block support

Added by Matthew Ahrens about 3 years ago. Updated over 1 year ago.

Status:ClosedStart date:2014-07-19
Priority:NormalDue date:
Assignee:Matthew Ahrens% Done:


Category:zfs - Zettabyte File System
Target version:-
Difficulty:Medium Tags:needs-triage


ZFS should support large blocks (>128KB). However, these large blocks will only be used if explicitly requested by setting the "recordsize" property. The default recordsize will still be 128KB, and we'll only issue i/o > 128KB for large blocks (i/o aggregation will still be capped at 128KB). This maintains the existing behavior (and performance characteristics) for all uses except filesystems whose "recordsize" has been increased.

The "recordsize" property can be set up to 1MB. The on-disk format and the SPA will support up to 16MB blocks, but these can only be used by "experimentally" changing a kernel tunable to allow larger "recordsize" values. The advantage of adding support for 16MB (the most that can be easily supported by the on-disk format) now is that we will not need to change the on-disk format or add feature flags when increasing the "officially supported" block size from 1MB to any larger values.

Note that by default, "zfs send" will automatically break any large blocks up into 128KB chunks, so that the send stream can be received on any system. A new flag (-b) can be specified which will transmit the large blocks, in which case the send stream can only be received by systems that support large blocks.

Note that GRUB "supports" the large_blocks feature, so it is safe to enable it on root pools. However, GRUB can't actually read large blocks (due to lack of memory), so the recordsize of the boot filesystem can not be increased above 128KB (this is enforced similarly to GRUB's lack of gzip support).

Recommended uses center around improving performance of random reads of large blocks (>= 128KB):
- files that are randomly read in large chunks (e.g. video files when streaming many concurrent streams such that prefetch can not effectively cache data); performance will be improved in this case because random 1MB reads from rotating disks has higher bandwidth than random 128KB reads.
- typically, performance of scrub/resilver is improved, especially with RAID-Z

The tradeoffs to consider when using large blocks include:
- accessing large blocks tends to increase latency of all operations, because even small reads will need to get in line benind large reads/writes
- sub-block writes (i.e. write to 128KB of a 1MB block) will incur even larger read-modify-write penalty
- the last, partially-filled block of each file will be larger, wasting memory, and if compression is not enabled, disk space (expected waste is 1/2 the recordsize per file, assuming random file length)


#1 Updated by Electric Monk almost 3 years ago

  • % Done changed from 0 to 100
  • Status changed from New to Closed

git commit b515258426fed6c7311fd3f1dea697cfbd4085c6

commit  b515258426fed6c7311fd3f1dea697cfbd4085c6
Author: Matthew Ahrens <matt@mahrens.org>
Date:   2014-11-07T16:30:07.000Z

    5027 zfs large block support
    Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
    Reviewed by: George Wilson <george.wilson@delphix.com>
    Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
    Reviewed by: Richard Elling <richard.elling@richardelling.com>
    Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
    Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
    Approved by: Dan McDonald <danmcd@omniti.com>

#2 Updated by Electric Monk over 1 year ago

git commit c3d26abc9ee97b4f60233556aadeb57e0bd30bb9

commit  c3d26abc9ee97b4f60233556aadeb57e0bd30bb9
Author: Matthew Ahrens <matt@mahrens.org>
Date:   2016-02-08T16:30:40.000Z

    5027 zfs large block support (add copyright)

Also available in: Atom