Bug #11557


Log Spacemap Project

Added by Jerry Jelinek almost 5 years ago. Updated over 4 years ago.

zfs - Zettabyte File System
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


Port from ZoL:
93e28d661 Log Spacemap Project
along with related fixes:
cae97c88c List log_spacemap feature in zpool-features.5 manual
1ba4f3e7b 9072 handle error of zap_cursor_retrieve() for log spacemap zap
2fcf4481a mismerged log spacemap comment for metaslab_verify_weight_and_frag
7f3190891 Tricky semantics of ms_max_size in metaslab_should_allocate()
bac15c119 zdb: don't print log spacemap stats in pools without the feature

Actions #1

Updated by Jerry Jelinek over 4 years ago

Also include the following two commits:
d3230d761a looping in metaslab_block_picker impacts performance on fragmented pools
cb020f0d8 Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_threshold

Also, include the mdb changes from the original Delphix work from here:

Actions #2

Updated by Jerry Jelinek over 4 years ago

Here is the ZoL commit msg for the original work.

    Log Spacemap Project

    = Motivation

    At Delphix we've seen a lot of customer systems where fragmentation
    is over 75% and random writes take a performance hit because a lot
    of time is spend on I/Os that update on-disk space accounting metadata.
    Specifically, we seen cases where 20% to 40% of sync time is spend
    after sync pass 1 and ~30% of the I/Os on the system is spent updating

    The problem is that these pools have existed long enough that we've
    touched almost every metaslab at least once, and random writes
    scatter frees across all metaslabs every TXG, thus appending to
    their spacemaps and resulting in many I/Os. To give an example,
    assuming that every VDEV has 200 metaslabs and our writes fit within
    a single spacemap block (generally 4K) we have 200 I/Os. Then if we
    assume 2 levels of indirection, we need 400 additional I/Os and
    since we are talking about metadata for which we keep 2 extra copies
    for redundancy we need to triple that number, leading to a total of
    1800 I/Os per VDEV every TXG.

    We could try and decrease the number of metaslabs so we have less
    I/Os per TXG but then each metaslab would cover a wider range on
    disk and thus would take more time to be loaded in memory from disk.
    In addition, after it's loaded, it's range tree would consume more

    Another idea would be to just increase the spacemap block size
    which would allow us to fit more entries within an I/O block
    resulting in fewer I/Os per metaslab and a speedup in loading time.
    The problem is still that we don't deal with the number of I/Os
    going up as the number of metaslabs is increasing and the fact
    is that we generally write a lot to a few metaslabs and a little
    to the rest of them. Thus, just increasing the block size would
    actually waste bandwidth because we won't be utilizing our bigger
    block size.

    = About this patch

    This patch introduces the Log Spacemap project which provides the
    solution to the above problem while taking into account all the
    aforementioned tradeoffs. The details on how it achieves that can
    be found in the references sections below and in the code (see
    Big Theory Statement in spa_log_spacemap.c).

    Even though the change is fairly constraint within the metaslab
    and lower-level SPA codepaths, there is a side-change that is
    user-facing. The change is that VDEV IDs from VDEV holes will no
    longer be reused. To give some background and reasoning for this,
    when a log device is removed and its VDEV structure was replaced
    with a hole (or was compacted; if at the end of the vdev array),
    its vdev_id could be reused by devices added after that. Now
    with the pool-wide space maps recording the vdev ID, this behavior
    can cause problems (e.g. is this entry referring to a segment in
    the new vdev or the removed log?). Thus, to simplify things the
    ID reuse behavior is gone and now vdev IDs for top-level vdevs
    are truly unique within a pool.

    = Testing

    The illumos implementation of this feature has been used internally
    for a year and has been in production for ~6 months. For this patch
    specifically there don't seem to be any regressions introduced to
    ZTS and I have been running zloop for a week without any related

    = Performance Analysis (Linux Specific)

    All performance results and analysis for illumos can be found in
    the links of the references. Redoing the same experiments in Linux
    gave similar results. Below are the specifics of the Linux run.

    After the pool reached stable state the percentage of the time
    spent in pass 1 per TXG was 64% on average for the stock bits
    while the log spacemap bits stayed at 95% during the experiment

    Sync times per TXG were 37.6 seconds on average for the stock
    bits and 22.7 seconds for the log spacemap bits (related graph: As a result
    the log spacemap bits were able to push more TXGs, which is also
    the reason why all graphs quantified per TXG have more entries for
    the log spacemap bits.

    Another interesting aspect in terms of txg syncs is that the stock
    bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
    and 20% reach 9. The log space map bits reached sync pass 4 in 79%
    of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
    emphasizes the fact that not only we spend less time on metadata
    but we also iterate less times to convergence in spa_sync() dirtying
    [related graphs:

    Finally, the improvement in IOPs that the userland gains from the
    change is approximately 40%. There is a consistent win in IOPS as
    you can see from the graphs below but the absolute amount of
    improvement that the log spacemap gives varies within each minute

    = Porting to Other Platforms

    For people that want to port this commit to other platforms below
    is a list of ZoL commits that this patch depends on:

    Make zdb results for checkpoint tests consistent

    Update vdev_is_spacemap_addressable() for new spacemap encoding

    Simplify spa_sync by breaking it up to smaller functions

    Factor metaslab_load_wait() in metaslab_load()

    Rename range_tree_verify to range_tree_verify_not_present

    Change target size of metaslabs from 256GB to 16GB

    zdb -L should skip leak detection altogether

    vs_alloc can underflow in L2ARC vdevs

    Simplify log vdev removal code

    Get rid of space_map_update() for ms_synced_length

    Introduce auxiliary metaslab histograms

    Error path in metaslab_load_impl() forgets to drop ms_sync_lock

    = References

    Background, Motivation, and Internals of the Feature
    - OpenZFS 2017 Presentation:
    - Slides:

    Flushing Algorithm Internals & Performance Results
    (Illumos Specific)
    - Blogpost:
    - OpenZFS 2018 Presentation:
    - Slides:

Actions #3

Updated by Jerry Jelinek over 4 years ago

To test this I've run the zfs-test suite on DEBUG and non-DEBUG builds. The new and updated tests pass. I've manually verified the updated man page and the new mdb dcmds.

Actions #4

Updated by Electric Monk over 4 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 814dcd43c3de9925fd6226c256e4d4327841a0e1

commit  814dcd43c3de9925fd6226c256e4d4327841a0e1
Author: Serapheim Dimitropoulos <>
Date:   2019-09-25T16:13:23.000Z

    11557 Log Spacemap Project
    Portions contributed by: Jerry Jelinek <>
    Portions contributed by: Matthew Ahrens <>
    Reviewed by: George Melikov <>
    Reviewed by: Brian Behlendorf <>
    Reviewed by: Matt Ahrens <>
    Reviewed by: Brian Behlendorf <>
    Reviewed by: Paul Dagnelie <>
    Reviewed by: Tony Nguyen <>
    Reviewed by: George Wilson <>
    Reviewed by: Sara Hartse <>
    Reviewed by: Igor Kozhukhov <>
    Reviewed by: Pavel Zakharov <>
    Reviewed by: Andy Fiddaman <>
    Reviewed by: Toomas Soome <>
    Reviewed by: C Fraire <>
    Reviewed by: Kody Kantor <>
    Approved by: Gordon Ross <>


Also available in: Atom PDF