Bug #11971


Reduce loaded range tree memory usage

Added by Jerry Jelinek over 3 years ago. Updated over 3 years ago.

zfs - Zettabyte File System
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


Port from OpenZFS
ca5777793 Reduce loaded range tree memory usage

    This patch implements a new tree structure for ZFS, and uses it to
    store range trees more efficiently.

    The new structure is approximately a B-tree, though there are some
    small differences from the usual characterizations. The tree has core
    nodes and leaf nodes; each contain data elements, which the elements
    in the core nodes acting as separators between its children. The
    difference between core and leaf nodes is that the core nodes have an
    array of children, while leaf nodes don't. Every node in the tree may
    be only partially full; in most cases, they are all at least 50% full
    (in terms of element count) except for the root node, which can be
    less full. Underfull nodes will steal from their neighbors or merge to
    remain full enough, while overfull nodes will split in two. The data
    elements are contained in tree-controlled buffers; they are copied
    into these on insertion, and overwritten on deletion. This means that
    the elements are not independently allocated, which reduces overhead,
    but also means they can't be shared between trees (and also that
    pointers to them are only valid until a side-effectful tree operation
    occurs). The overhead varies based on how dense the tree is, but is
    usually on the order of about 50% of the element size; the per-node
    overheads are very small, and so don't make a significant difference.
    The trees can accept arbitrary records; they accept a size and a
    comparator to allow them to be used for a variety of purposes.

    The new trees replace the AVL trees used in the range trees today.
    Currently, the range_seg_t structure contains three 8 byte integers
    of payload and two 24 byte avl_tree_node_ts to handle its storage in
    both an offset-sorted tree and a size-sorted tree (total size: 64
    bytes). In the new model, the range seg structures are usually two 4
    byte integers, but a separate one needs to exist for the size-sorted
    and offset-sorted tree. Between the raw size, the 50% overhead, and
    the double storage, the new btrees are expected to use 8*1.5*2 = 24
    bytes per record, or 33.3% as much memory as the AVL trees (this is
    for the purposes of storing metaslab range trees; for other purposes,
    like scrubs, they use ~50% as much memory).

    We reduced the size of the payload in the range segments by teaching
    range trees about starting offsets and shifts; since metaslabs have a
    fixed starting offset, and they all operate in terms of disk sectors,
    we can store the ranges using 4-byte integers as long as the size of
    the metaslab divided by the sector size is less than 2^32. For 512-byte
    sectors, this is a 2^41 (or 2TB) metaslab, which with the default
    settings corresponds to a 256PB disk. 4k sector disks can handle
    metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
    anticipate disks of this size in the near future, there should be
    almost no cases where metaslabs need 64-byte integers to store their
    ranges. We do still have the capability to store 64-byte integer ranges
    to account for cases where we are storing per-vdev (or per-dnode) trees,
    which could reasonably go above the limits discussed. We also do not
    store fill information in the compact version of the node, since it
    is only used for sorted scrub.

    We also optimized the metaslab loading process in various other ways
    to offset some inefficiencies in the btree model. While individual
    operations (find, insert, remove_from) are faster for the btree than
    they are for the avl tree, remove usually requires a find operation,
    while in the AVL tree model the element itself suffices. Some clever
    changes actually caused an overall speedup in metaslab loading; we use
    approximately 40% less cpu to load metaslabs in our tests on Illumos.

    Another memory and performance optimization was achieved by changing
    what is stored in the size-sorted trees. When a disk is heavily
    fragmented, the df algorithm used by default in ZFS will almost always
    find a number of small regions in its initial cursor-based search; it
    will usually only fall back to the size-sorted tree to find larger
    regions. If we increase the size of the cursor-based search slightly,
    and don't store segments that are smaller than a tunable size floor
    in the size-sorted tree, we can further cut memory usage down to
    below 20% of what the AVL trees store. This also results in further
    reductions in CPU time spent loading metaslabs.

    The 16KiB size floor was chosen because it results in substantial memory
    usage reduction while not usually resulting in situations where we can't
    find an appropriate chunk with the cursor and are forced to use an
    oversized chunk from the size-sorted tree. In addition, even if we do
    have to use an oversized chunk from the size-sorted tree, the chunk
    would be too small to use for ZIL allocations, so it isn't as big of a
    loss as it might otherwise be. And often, more small allocations will
    follow the initial one, and the cursor search will now find the
    remainder of the chunk we didn't use all of and use it for subsequent
    allocations. Practical testing has shown little or no change in
    fragmentation as a result of this change.

    If the size-sorted tree becomes empty while the offset sorted one still
    has entries, it will load all the entries from the offset sorted tree
    and disregard the size floor until it is unloaded again. This operation
    occurs rarely with the default setting, only on incredibly thoroughly
    fragmented pools.

    There are some other small changes to zdb to teach it to handle btrees,
    but nothing major.

Actions #1

Updated by Jerry Jelinek over 3 years ago

  • % Done changed from 0 to 80
Actions #2

Updated by Jerry Jelinek over 3 years ago

There are no new tests with this change but I ran a zfs test run on both DEBUG and non-DEBUG builds

Actions #3

Updated by Dan McDonald over 3 years ago

As presented at (Metaslab Allocation Performance).

Actions #4

Updated by Jerry Jelinek over 3 years ago

For reference, this was integrated into the Delphix illumos release branch June 20. It was integrated into the ZoL master Oct 9.

Actions #5

Updated by Electric Monk over 3 years ago

  • Status changed from New to Closed
  • % Done changed from 80 to 100

git commit 4d7988d6050abba5c1ff60e7fd196e95c22e20f4

commit  4d7988d6050abba5c1ff60e7fd196e95c22e20f4
Author: Paul Dagnelie <>
Date:   2019-12-06T12:47:33.000Z

    11971 Reduce loaded range tree memory usage
    Portions contributed by: Jerry Jelinek <>
    Reviewed by: George Wilson <>
    Reviewed by: Matt Ahrens <>
    Reviewed by: Sebastien Roy
    Reviewed by: Igor Kozhukhov <>
    Reviewed by: Brian Behlendorf <>
    Reviewed by: Kody Kantor <>
    Reviewed by: Andy Fiddaman <>
    Approved by: Dan McDonald <>


Also available in: Atom PDF