Project

General

Profile

Bug #6171

dsl_prop_unregister() slows down dataset eviction.

Added by Justin Gibbs about 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Start date:
2015-08-29
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

A change to a property on a dataset must be propagated to its descendants
in case that property is inherited. For datasets whose information is
not currently loaded into memory (e.g. a snapshot that isn't currently
mounted), there is nothing to do; the property change will take effect
the next time that dataset is loaded. To handle updates to datasets that
are in-core, ZFS registers a callback entry for each property of each
loaded dataset with the dsl directory that holds that dataset. There
is a dsl directory associated with each live dataset that references
both the live dataset and any snapshots of the live dataset. A property
change is effected by doing a traversal of the tree of dsl directories
for a pool, starting at the directory sourcing the change, and invoking
these callbacks.

The current implementation both registers and de-registers properties
individually for each loaded dataset. While registration for a property is
O(1) (insert into a list), de-registration is O(n) (search list and then
remove). The 'n' for de-registration, however, is not limited to the size
(number of snapshots + 1) of the dsl directory. The eviction portion
of the life cycle for the in core state of datasets is asynchronous,
which allows multiple copies of the dataset information to be in-core
at once. Only one of these copies is active at any time with the rest
going through tear down processing, but all copies contribute to the
cost of performing a dsl_prop_unregister().

One way to create multiple, in-flight copies of dataset information
is by performing "zfs list" operations from multiple threads
concurrently. In-core dataset information is loaded on demand and then
evicted when reference counts drops to zero. For datasets that are not
mounted, there is no persistent reference count to keep them resident.
So, a list operation will load them, compute the information required to
do the list operation, and then evict them. When performing this operation
from multiple threads it is possible that some of the in-core dataset
information will be reused, but also possible to lose the race and load
the dataset again, even while the same information is being torn down.

Compounding the performance issue further is a change made for illumos
issue 5056 which made dataset eviction single threaded. In environments
using automation to manage ZFS datasets, it is now possible to create
enough of a backlog of dataset evictions to consume excessive amounts
of kernel memory and to bog down the system.

The fix employed here is to make property de-registration O(1). With this
change in place, it is hoped that a single thread is more than sufficient
to handle eviction processing. If it isn't, the problem can be solved
by increasing the number of threads devoted to the eviction taskq.

Comments from proposed change:

uts/common/fs/zfs/dsl_dataset.c
uts/common/fs/zfs/dsl_dir.c:
uts/common/fs/zfs/dsl_prop.c:
uts/common/fs/zfs/sys/dsl_dataset.h:
uts/common/fs/zfs/sys/dsl_dir.h:
uts/common/fs/zfs/sys/dsl_prop.h:
    Associate dsl property callback records with both the
    dsl directory and the dsl dataset that is registering the
    callback. Both connections are protected by the dsl directory's
    "dd_lock".

    When linking callbacks into a dsl directory, group them by
    the property type. This helps reduce the space penalty for the
    double association (the property name pointer is stored once
    per dsl_dir instead of in each record) and reduces the number of
    strcmp() calls required to do callback processing when updating
    a single property. Property types are stored in a linked list
    since currently ZFS registers a maximum of 10 property types
    for each dataset.

    Note that the property buckets/records associated with a dsl
    directory are created on demand, but only freed when the dsl
    directory is freed. Given the static nature of property types
    and their small number, there is no benefit to freeing the few
    bytes of memory used to represent the property record earlier.
    When a property record becomes empty, the dsl directory is either
    going to become unreferenced a little later in this thread of
    execution, or there is a high chance that another dataset is
    going to be loaded that would recreate the bucket anyway.

    Replace dsl_prop_unregister() with dsl_prop_unregister_all().
    All callers of dsl_prop_unregister() are trying to remove
    all property registrations for a given dsl dataset anyway. By
    changing the API, we can avoid doing any lookups of callbacks
    by property type and just traverse the list of all callbacks
    for the dataset and free each one.

uts/common/fs/zfs/dmu_objset.c:
uts/common/fs/zfs/zfs_vfsops.c:
    Replace use of dsl_prop_unregister() with the new
    dsl_prop_unregister_all() API.

History

#2

Updated by Justin Gibbs about 4 years ago

  • Assignee set to Justin Gibbs
#3

Updated by Dan McDonald about 4 years ago

This bug was first observed, and analyzed in some detail, in early July. This mail to the OpenZFS developer's list ( http://lists.open-zfs.org/pipermail/developer/2015-July/001521.html ) documents the analysis.

#4

Updated by David Pacheco about 4 years ago

I'm able to reproduce at least some of the original problem on a much smaller system. My system has 48GB of DRAM and 16 CPUs. "zfs list | wc -l" reports 271, and "zfs list -t all | wc -l" reports 535. I ran this in a loop in one shell:

# for i in $(seq 1 10); do ( while ! [[ -f stop ]]; do zfs list -t all > /dev/null; done ) & done

This kicks off 10 tight loops that just run "zfs list -t all". You can stop all of the loops using "touch stop". When I ran this 5 times to get 50 loops, I found that the taskq did start backing up:

    ADDR             NAME                             ACT/THDS Q'ED  MAXQ INST
    ffffff0d2c64d8f8 dbu_evict                          0/   1   95 29308    -

This system hasn't gotten to the point where the feedback loop will keep it from ever recovering, and the queue does get to zero, but it shows that on this system, it's not hard to generate enough work to back up the taskq. (I actually found it harder to do this on a beefier system, so if you've got a smaller one, it's still worth trying this out.)

#5

Updated by Justin Gibbs about 4 years ago

Are your results from before or after applying the proposed fix for this issue?

#6

Updated by Electric Monk almost 4 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 03bad06fbb261fd4a7151a70dfeff2f5041cce1f

commit  03bad06fbb261fd4a7151a70dfeff2f5041cce1f
Author: Justin Gibbs <gibbs@scsiguy.com>
Date:   2015-09-17T12:05:51.000Z

    6171 dsl_prop_unregister() slows down dataset eviction.
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Prakash Surya <prakash.surya@delphix.com>
    Approved by: Dan McDonald <danmcd@omniti.com>

Also available in: Atom PDF