Bug #11918

metaslab improvements

Added by Jerry Jelinek 12 months ago. Updated 12 months ago.

zfs - Zettabyte File System
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


Port the following metaslab improvements from ZoL

475aa97ca Prevent metaslab_sync panic due to spa_final_dirty_txg
eef0f4d84 Keep more metaslabs loaded
f09fda507 Cap metaslab memory usage
c81f1790e Metaslab max_size should be persisted while unloaded
fe0ea8481 Don't activate metaslabs with weight 0
679b0f2ab Concurrent small allocation defeats large allocation



Updated by Jerry Jelinek 12 months ago

To test this I ran the zfs-test suite on both DEBUG and non-DEBUG builds.


Updated by Jerry Jelinek 12 months ago

Concurrent small allocation defeats large allocation

With the new parallel allocators scheme, there is a possibility for
a problem where two threads, allocating from the same allocator at
the same time, conflict with each other. There are two primary cases
to worry about. First, another thread working on another allocator
activates the same metaslab that the first thread was trying to
activate. This results in the first thread needing to go back and
reselect a new metaslab, even though it may have waited a long time
for this metaslab to load. Second, another thread working on the same
allocator may have activated a different metaslab while the first
thread was waiting for its metaslab to load. Both of these cases
can cause the first thread to be significantly delayed in issuing
its IOs. The second case can also cause metaslab load/unload churn;
because the metaslab is loaded but not fully activated, we never set
the selected_txg, which results in the metaslab being immediately
unloaded again. This process can repeat many times, wasting disk and
cpu resources. This is more likely to happen when the IO of the first
thread is a larger one (like a ZIL write) and the other thread is
doing a smaller write, because it is more likely to find an
acceptable metaslab quickly.

There are two primary changes. The first is to always proceed with
the allocation when returning from metaslab_activate if we were
preempted in either of the ways described in the previous section.
The second change is to set the selected_txg before we do the call
to activate so that even if the metaslab is not used for an
allocation, we won't immediately attempt to unload it.

Don't activate metaslabs with weight 0

We return ENOSPC in metaslab_activate if the metaslab has weight 0,
to avoid activating a metaslab with no space available. For sanity
checking, we also assert that there is no free space in the range
tree in that case.

Metaslab max_size should be persisted while unloaded

When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.

This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.

Cap metaslab memory usage

On systems with large amounts of storage and high fragmentation, a huge
amount of space can be used by storing metaslab range trees. Since
metaslabs are only unloaded during a txg sync, and only if they have
been inactive for 8 txgs, it is possible to get into a state where all
of the system's memory is consumed by range trees and metaslabs, and
txgs cannot sync. While ZFS knows how to evict ARC data when needed,
it has no such mechanism for range tree data. This can result in boot
hangs for some system configurations.

First, we add the ability to unload metaslabs outside of syncing
context. Second, we store a multilist of all loaded metaslabs, sorted
by their selection txg, so we can quickly identify the oldest
metaslabs. We use a multilist to reduce lock contention during heavy
write workloads. Finally, we add logic that will unload a metaslab
when we're loading a new metaslab, if we're using more than a certain
fraction of the available memory on range trees.

Keep more metaslabs loaded

With the other metaslab changes loaded onto a system, we can
significantly reduce the memory usage of each loaded metaslab and
unload them on demand if there is memory pressure. However, none
of those changes actually result in us keeping more metaslabs loaded.
If we don't keep more metaslabs loaded, we will still have to wait
for demand-loading to finish when no loaded metaslab can satisfy our
allocation, which can cause ZIL performance issues. In addition,
performance is traditionally measured by IOs per unit time, while
unloading is currently done on a txg-count basis. Txgs can take a
widely varying range of times, from tenths of a second to several
seconds. This can result in confusing, hard to predict behavior.

This change simply adds a time-based component to metaslab unloading.
A metaslab will remain loaded for one minute and 8 txgs (by default)
after it was last used, unless it is evicted due to memory pressure.

Prevent metaslab_sync panic due to spa_final_dirty_txg

If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.


Updated by Electric Monk 12 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit af1d63aba5cec023f92214c1f1faec9b489ac517

commit  af1d63aba5cec023f92214c1f1faec9b489ac517
Author: Paul Dagnelie <>
Date:   2019-11-13T20:49:14.000Z

    11918 metaslab improvements
    Portions contributed by: Jerry Jelinek <>
    Reviewed by: Matt Ahrens <>
    Reviewed by: Brian Behlendorf <>
    Reviewed by: George Wilson <>
    Reviewed by: Igor Kozhukhov <>
    Reviewed by: Sebastien Roy <>
    Reviewed by: Serapheim Dimitropoulos <>
    Reviewed by: Andy Fiddaman <>
    Approved by: Dan McDonald <>

Also available in: Atom PDF