Project

General

Profile

Bug #4873

zvol unmap calls can take a very long time for larger datasets

Added by Alex Reece over 5 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
2014-05-20
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

Currently, dbuf_free_range iterates over all dbufs in an object when freeing a range (since dn->dn_dbufs is a linked list). For large files, this O(n) operation can take a long time. On a customer's system, the unmap ran for more than an hour.

History

#1

Updated by Alex Reece about 5 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 90

By changing dn->dn_dbufs to an AVL tree, we changed the O(n) "find overlapping dbufs" operation to O(log(n)) for a substantial performance improvement. Detailed performance analysis available in the review: http://reviews.csiden.org/r/22/

#2

Updated by Josef Sipek about 5 years ago

Can you please copy & paste the analysis here? I'm worried that at some point, reviews.csiden.org's database may go away and we'll lose valuable context for this change.

(The analysis on reviewboard seems to be full of markup - that's why I didn't just copy & paste from there.)

#3

Updated by Alex Reece about 5 years ago

Setup:

For both the areece-control (existing bits) and areece-free-range (new bits), create a test zpool and a zfs dataset with recordsize=512 and compression=off.

    sudo zpool create test c1t1d0
    sudo zfs create -o recordsize=512 -o compression=off test/fs
    sudo chown -R $(logname) /test

Regression test 1:

Created a file with 1 million dbufs via dd if=/dev/zero of=/test/fs/file bs=$((2**20)) count=$((2**9)). This is a workload where dbuf_create is called many times and now has an O(log(n)) avl_add operation instead of an O(1) list_insert_head. There was not a statistically meaningful difference in the performance here, perhaps because dbuf_create was not a large portion of the workload (CPU profiling indicates that it is <5% of the cost of this action)

    $ cat free_range_create.times | histogram.py --confidence=0.9
    # NumSamples = 31; Min = 9.50; Max = 20.75
    # Mean = 13.516924 (+/- 0.753623); Variance = 6.507525; SD = 2.550985; Median 13.188155
    # each * represents a count of 1
        9.5012 -    10.6259 [     3]: ***
       10.6259 -    11.7506 [     5]: *****
       11.7506 -    12.8753 [     7]: *******
       12.8753 -    14.0000 [     4]: ****
       14.0000 -    15.1247 [     5]: *****
       15.1247 -    16.2494 [     2]: **
       16.2494 -    17.3742 [     2]: **
       17.3742 -    18.4989 [     2]: **
       18.4989 -    19.6236 [     0]: 
       19.6236 -    20.7483 [     1]: *
    $ cat control_create.times| histogram.py --confidence=0.9
    # NumSamples = 31; Min = 7.56; Max = 18.35
    # Mean = 13.112786 (+/- 0.703081); Variance = 5.663922; SD = 2.379900; Median 12.995457
    # each * represents a count of 1
        7.5602 -     8.6397 [     1]: *
        8.6397 -     9.7192 [     0]: 
        9.7192 -    10.7986 [     4]: ****
       10.7986 -    11.8781 [     7]: *******
       11.8781 -    12.9576 [     2]: **
       12.9576 -    14.0370 [     6]: ******
       14.0370 -    15.1165 [     4]: ****
       15.1165 -    16.1960 [     3]: ***
       16.1960 -    17.2754 [     3]: ***
       17.2754 -    18.3549 [     1]: *

Regression Test 2:

Starting with a (0.5G) file with 1 million dbufs, read it repeatedly until it is cached in the ARC. When it is fully cached, zfs unmount and zfs mount it. This clears the dbufs from the file, but not the arc bufs. We then compare the time it takes to read the file, which will necessitate creating all the necessary dbufs but no io, since the file contents are cached (the lack of io was verified w/ iostat). Again, there is not a statistically meaningful difference in performance, perhaps because dbuf\\\\_create was not a large portion of the workload (CPU profiling indicates that it is <5% of the cost of this action)

    $ cat control\\\\_test2.times | histogram.py --confidence=0.9
    \\\\# NumSamples = 31; Min = 3.33; Max = 7.76
    \\\\# Mean = 4.762711 (+/- 0.445965); Variance = 2.278809; SD = 1.509572; Median 4.074336
    \\\\# each \\\\* represents a count of 1
        3\\\\.3267 -     3.7703 \\\\[    12\\\\]: \\\\*\\\\*********\\\\*\\\\*
        3\\\\.7703 -     4.2140 \\\\[     6\\\\]: \\\\*\\\\***\\\\*\\\\*
        4\\\\.2140 -     4.6576 \\\\[     1\\\\]: \\\\*
        4\\\\.6576 -     5.1012 \\\\[     1\\\\]: \\\\*
        5\\\\.1012 -     5.5449 \\\\[     3\\\\]: \\\\*\\\\*\\\\*
        5\\\\.5449 -     5.9885 \\\\[     1\\\\]: \\\\*
        5\\\\.9885 -     6.4321 \\\\[     0\\\\]: 
        6\\\\.4321 -     6.8758 \\\\[     1\\\\]: \\\\*
        6\\\\.8758 -     7.3194 \\\\[     3\\\\]: \\\\*\\\\*\\\\*
        7\\\\.3194 -     7.7630 \\\\[     3\\\\]: \\\\*\\\\*\\\\*
    $ cat free\\\\_range\\\\_test2.times | histogram.py --confidence=0.9
    \\\\# NumSamples = 31; Min = 2.93; Max = 7.57
    \\\\# Mean = 4.959024 (+/- 0.392162); Variance = 1.762129; SD = 1.327452; Median 4.666025
    \\\\# each \\\\* represents a count of 1
        2\\\\.9304 -     3.3940 \\\\[     3\\\\]: \\\\*\\\\*\\\\*
        3\\\\.3940 -     3.8576 \\\\[     6\\\\]: \\\\*\\\\***\\\\*\\\\*
        3\\\\.8576 -     4.3212 \\\\[     4\\\\]: \\\\*\\\\*\\\\*\\\\*
        4\\\\.3212 -     4.7848 \\\\[     3\\\\]: \\\\*\\\\*\\\\*
        4\\\\.7848 -     5.2484 \\\\[     1\\\\]: \\\\*
        5\\\\.2484 -     5.7120 \\\\[     0\\\\]: 
        5\\\\.7120 -     6.1756 \\\\[     7\\\\]: \\\\*\\\\****\\\\*\\\\*
        6\\\\.1756 -     6.6392 \\\\[     4\\\\]: \\\\*\\\\*\\\\*\\\\*
        6\\\\.6392 -     7.1027 \\\\[     2\\\\]: \\\\*\\\\*
        7\\\\.1027 -     7.5663 \\\\[     1\\\\]: \\\\*

Regression test 3:

I created a file w/ 16 million dbufs and removed it entirely.

$ sudo zpool create test c3t\\\\{1,2,3\\\\}d0
$ sudo zfs create -o recordsize=512 -o compression=off test/fs
$ sudo chown -R $(logname) /test

$ dd if=/dev/zero of=/test/fs/file bs=$((16 * (2**20))) count=$((2**9))
512+0 records in
512+0 records out
8589934592 bytes (8.6 GB) copied, 440.277 s, 19.5 MB/s

$ ptime rm /test/fs/file

When I test release OI to release illumos w/ the new bits on a machine w/ sufficient ram to keep the entire file cached, I see a 10x improvement:

alex@oi$ ptime rm /test/fs/file

real     2:11.528192698
user        0.000450506
sys      1:34.483731748

alex@illumos$ ptime rm /test/fs/file

real        8.306551951
user        0.000512151
sys         8.305646676

This ~10x different in perf is very close to the theoretical difference: I observe 484 calls to dbuf_free_range on OI, so the total number of "operations" for the linked list approach is ~3 billion and the total number of "operations" for the avl tree approach is ~300 million.

If we look at how the time is spent in the original case, we see that 75% of the time is spend in list_next (presumably because of cache misses):

On top of that, using the flame graph I do not see a majority of time being spent in avl_rotate for the new approach (I only see ~10% of the time spent there):

Performance Test:

With a (0.5G) file with 1 million dbufs, free 100 blocks from the middle of a file, 1 block at a time, via repeatedly calling \\\\`fcntl(fd, F\\\\_FREESP64, \\\\{ .l\\\\_start = <offset\\\\>, .l\\\\_len = 512 \\\\})\\\\` The difference between the control and the new code is almost 4 orders of magnitude:

    $ cat free\\\\_range\\\\_free.times | histogram.py --confidence=0.9
    \\\\# NumSamples = 30; Min = 0.00; Max = 0.00
    \\\\# Mean = 0.000032 (+/- 0.000002); Variance = 0.000000; SD = 0.000006; Median 0.000032
    \\\\# each \\\\* represents a count of 1
        0\\\\.0000 -     0.0000 \\\\[     3\\\\]: \\\\*\\\\*\\\\*
        0\\\\.0000 -     0.0000 \\\\[     4\\\\]: \\\\*\\\\*\\\\*\\\\*
        0\\\\.0000 -     0.0000 \\\\[     2\\\\]: \\\\*\\\\*
        0\\\\.0000 -     0.0000 \\\\[     4\\\\]: \\\\*\\\\*\\\\*\\\\*
        0\\\\.0000 -     0.0000 \\\\[     7\\\\]: \\\\*\\\\****\\\\*\\\\*
        0\\\\.0000 -     0.0000 \\\\[     2\\\\]: \\\\*\\\\*
        0\\\\.0000 -     0.0000 \\\\[     2\\\\]: \\\\*\\\\*
        0\\\\.0000 -     0.0000 \\\\[     3\\\\]: \\\\*\\\\*\\\\*
        0\\\\.0000 -     0.0000 \\\\[     2\\\\]: \\\\*\\\\*
        0\\\\.0000 -     0.0000 \\\\[     1\\\\]: \\\\*
    $ cat control\\\\_free.times| histogram.py --confidence=0.9
    \\\\# NumSamples = 30; Min = 0.24; Max = 0.34
    \\\\# Mean = 0.296387 (+/- 0.006490); Variance = 0.000467; SD = 0.021612; Median 0.302462
    \\\\# each \\\\* represents a count of 1
        0\\\\.2433 -     0.2533 \\\\[     2\\\\]: \\\\*\\\\*
        0\\\\.2533 -     0.2634 \\\\[     2\\\\]: \\\\*\\\\*
        0\\\\.2634 -     0.2734 \\\\[     1\\\\]: \\\\*
        0\\\\.2734 -     0.2834 \\\\[     1\\\\]: \\\\*
        0\\\\.2834 -     0.2935 \\\\[     3\\\\]: \\\\*\\\\*\\\\*
        0\\\\.2935 -     0.3035 \\\\[     7\\\\]: \\\\*\\\\****\\\\*\\\\*
        0\\\\.3035 -     0.3136 \\\\[    11\\\\]: \\\\*\\\\********\\\\*\\\\*
        0\\\\.3136 -     0.3236 \\\\[     2\\\\]: \\\\*\\\\*
        0\\\\.3236 -     0.3336 \\\\[     0\\\\]: 
        0\\\\.3336 -     0.3437 \\\\[     1\\\\]: \\\\*
#4

Updated by Electric Monk about 5 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 90 to 100

git commit 0f6d88aded0d165f5954688a9b13bac76c38da84

commit  0f6d88aded0d165f5954688a9b13bac76c38da84
Author: Alex Reece <alex@delphix.com>
Date:   2014-07-26T21:40:04.000Z

    4873 zvol unmap calls can take a very long time for larger datasets
    Reviewed by: George Wilson <george@delphix.com>
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
    Reviewed by: Basil Crow <basil.crow@delphix.com>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Approved by: Robert Mustacchi <rm@joyent.com>

Also available in: Atom PDF