Project

General

Profile

Bug #11557

Log Spacemap Project

Added by Jerry Jelinek 3 months ago. Updated about 2 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

Port from ZoL:
93e28d661 Log Spacemap Project
along with related fixes:
cae97c88c List log_spacemap feature in zpool-features.5 manual
1ba4f3e7b 9072 handle error of zap_cursor_retrieve() for log spacemap zap
2fcf4481a mismerged log spacemap comment for metaslab_verify_weight_and_frag
7f3190891 Tricky semantics of ms_max_size in metaslab_should_allocate()
bac15c119 zdb: don't print log spacemap stats in pools without the feature

History

#1

Updated by Jerry Jelinek 2 months ago

Also include the following two commits:
d3230d761a looping in metaslab_block_picker impacts performance on fragmented pools
cb020f0d8 Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_threshold

Also, include the mdb changes from the original Delphix work from here:
https://github.com/delphix/delphix-os/commit/fe7bf6cfb4574e480938b8c824d5bbea8a817cfa#diff-cc64a7dca2e6972ab2b85ec543f6d3c9

#2

Updated by Jerry Jelinek 2 months ago

Here is the ZoL commit msg for the original work.

    Log Spacemap Project

    = Motivation

    At Delphix we've seen a lot of customer systems where fragmentation
    is over 75% and random writes take a performance hit because a lot
    of time is spend on I/Os that update on-disk space accounting metadata.
    Specifically, we seen cases where 20% to 40% of sync time is spend
    after sync pass 1 and ~30% of the I/Os on the system is spent updating
    spacemaps.

    The problem is that these pools have existed long enough that we've
    touched almost every metaslab at least once, and random writes
    scatter frees across all metaslabs every TXG, thus appending to
    their spacemaps and resulting in many I/Os. To give an example,
    assuming that every VDEV has 200 metaslabs and our writes fit within
    a single spacemap block (generally 4K) we have 200 I/Os. Then if we
    assume 2 levels of indirection, we need 400 additional I/Os and
    since we are talking about metadata for which we keep 2 extra copies
    for redundancy we need to triple that number, leading to a total of
    1800 I/Os per VDEV every TXG.

    We could try and decrease the number of metaslabs so we have less
    I/Os per TXG but then each metaslab would cover a wider range on
    disk and thus would take more time to be loaded in memory from disk.
    In addition, after it's loaded, it's range tree would consume more
    memory.

    Another idea would be to just increase the spacemap block size
    which would allow us to fit more entries within an I/O block
    resulting in fewer I/Os per metaslab and a speedup in loading time.
    The problem is still that we don't deal with the number of I/Os
    going up as the number of metaslabs is increasing and the fact
    is that we generally write a lot to a few metaslabs and a little
    to the rest of them. Thus, just increasing the block size would
    actually waste bandwidth because we won't be utilizing our bigger
    block size.

    = About this patch

    This patch introduces the Log Spacemap project which provides the
    solution to the above problem while taking into account all the
    aforementioned tradeoffs. The details on how it achieves that can
    be found in the references sections below and in the code (see
    Big Theory Statement in spa_log_spacemap.c).

    Even though the change is fairly constraint within the metaslab
    and lower-level SPA codepaths, there is a side-change that is
    user-facing. The change is that VDEV IDs from VDEV holes will no
    longer be reused. To give some background and reasoning for this,
    when a log device is removed and its VDEV structure was replaced
    with a hole (or was compacted; if at the end of the vdev array),
    its vdev_id could be reused by devices added after that. Now
    with the pool-wide space maps recording the vdev ID, this behavior
    can cause problems (e.g. is this entry referring to a segment in
    the new vdev or the removed log?). Thus, to simplify things the
    ID reuse behavior is gone and now vdev IDs for top-level vdevs
    are truly unique within a pool.

    = Testing

    The illumos implementation of this feature has been used internally
    for a year and has been in production for ~6 months. For this patch
    specifically there don't seem to be any regressions introduced to
    ZTS and I have been running zloop for a week without any related
    problems.

    = Performance Analysis (Linux Specific)

    All performance results and analysis for illumos can be found in
    the links of the references. Redoing the same experiments in Linux
    gave similar results. Below are the specifics of the Linux run.

    After the pool reached stable state the percentage of the time
    spent in pass 1 per TXG was 64% on average for the stock bits
    while the log spacemap bits stayed at 95% during the experiment
    (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

    Sync times per TXG were 37.6 seconds on average for the stock
    bits and 22.7 seconds for the log spacemap bits (related graph:
    sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
    the log spacemap bits were able to push more TXGs, which is also
    the reason why all graphs quantified per TXG have more entries for
    the log spacemap bits.

    Another interesting aspect in terms of txg syncs is that the stock
    bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
    and 20% reach 9. The log space map bits reached sync pass 4 in 79%
    of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
    emphasizes the fact that not only we spend less time on metadata
    but we also iterate less times to convergence in spa_sync() dirtying
    objects.
    [related graphs:
    stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
    lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

    Finally, the improvement in IOPs that the userland gains from the
    change is approximately 40%. There is a consistent win in IOPS as
    you can see from the graphs below but the absolute amount of
    improvement that the log spacemap gives varies within each minute
    interval.
    sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
    sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

    = Porting to Other Platforms

    For people that want to port this commit to other platforms below
    is a list of ZoL commits that this patch depends on:

    Make zdb results for checkpoint tests consistent
    db587941c5ff6dea01932bb78f70db63cf7f38ba

    Update vdev_is_spacemap_addressable() for new spacemap encoding
    419ba5914552c6185afbe1dd17b3ed4b0d526547

    Simplify spa_sync by breaking it up to smaller functions
    8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834

    Factor metaslab_load_wait() in metaslab_load()
    b194fab0fb6caad18711abccaff3c69ad8b3f6d3

    Rename range_tree_verify to range_tree_verify_not_present
    df72b8bebe0ebac0b20e0750984bad182cb6564a

    Change target size of metaslabs from 256GB to 16GB
    c853f382db731e15a87512f4ef1101d14d778a55

    zdb -L should skip leak detection altogether
    21e7cf5da89f55ce98ec1115726b150e19eefe89

    vs_alloc can underflow in L2ARC vdevs
    7558997d2f808368867ca7e5234e5793446e8f3f

    Simplify log vdev removal code
    6c926f426a26ffb6d7d8e563e33fc176164175cb

    Get rid of space_map_update() for ms_synced_length
    425d3237ee88abc53d8522a7139c926d278b4b7f

    Introduce auxiliary metaslab histograms
    928e8ad47d3478a3d5d01f0dd6ae74a9371af65e

    Error path in metaslab_load_impl() forgets to drop ms_sync_lock
    8eef997679ba54547f7d361553d21b3291f41ae7

    = References

    Background, Motivation, and Internals of the Feature
    - OpenZFS 2017 Presentation:
    youtu.be/jj2IxRkl5bQ
    - Slides:
    slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

    Flushing Algorithm Internals & Performance Results
    (Illumos Specific)
    - Blogpost:
    sdimitro.github.io/post/zfs-lsm-flushing/
    - OpenZFS 2018 Presentation:
    youtu.be/x6D2dHRjkxw
    - Slides:
    slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

#3

Updated by Jerry Jelinek about 2 months ago

To test this I've run the zfs-test suite on DEBUG and non-DEBUG builds. The new and updated tests pass. I've manually verified the updated man page and the new mdb dcmds.

#4

Updated by Electric Monk about 2 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 814dcd43c3de9925fd6226c256e4d4327841a0e1

commit  814dcd43c3de9925fd6226c256e4d4327841a0e1
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
Date:   2019-09-25T16:13:23.000Z

    11557 Log Spacemap Project
    Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Portions contributed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: George Melikov <mail@gmelikov.ru>
    Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed by: Matt Ahrens <mahrens@delphix.com>
    Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed by: Paul Dagnelie <pcd@delphix.com>
    Reviewed by: Tony Nguyen <tony.nguyen@delphix.com>
    Reviewed by: George Wilson <george.wilson@delphix.com>
    Reviewed by: Sara Hartse <sara.hartse@delphix.com>
    Reviewed by: Igor Kozhukhov <igor@dilos.org>
    Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
    Reviewed by: Andy Fiddaman <andy@omniosce.org>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: C Fraire <cfraire@me.com>
    Reviewed by: Kody Kantor <kody.kantor@joyent.com>
    Approved by: Gordon Ross <gwr@nexenta.com>

Also available in: Atom PDF