Project

General

Profile

Feature #9112

Improve allocation performance on high-end systems

Added by Paul Dagnelie over 1 year ago. Updated 11 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
2018-02-15
Due date:
% Done:

100%

Estimated time:
Difficulty:
Hard
Tags:
needs-triage

Description

On high-end systems running async sequential write workloads, especially NUMA systems with flash or NVMe storage, one significant performance bottleneck is selecting a metaslab to do allocations from. This process can be parallelized, providing significant performance increases for these workloads.

History

#1

Updated by Electric Monk over 1 year ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit f78cdc34af236a6199dd9e21376f4a46348c0d56

commit  f78cdc34af236a6199dd9e21376f4a46348c0d56
Author: Paul Dagnelie <pcd@delphix.com>
Date:   2018-04-04T17:22:16.000Z

    9112 Improve allocation performance on high-end systems
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: George Wilson <george.wilson@delphix.com>
    Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
    Reviewed by: Alexander Motin <mav@FreeBSD.org>
    Approved by: Gordon Ross <gwr@nexenta.com>

#2

Updated by Sanjay Nadkarni 11 months ago

Adding details from openzfs emails from Paul.

Overview

We parallelize the allocation process by creating the concept of "allocators". There are a certain number of allocators per metaslab group, defined by the value of a tunable at pool open time. Each allocator for a given metaslab group has up to 2 active metaslabs; one "primary", and one "secondary". The primary and secondary weight mean the same thing they did in in the pre-allocator world; primary metaslabs are used for most allocations, secondary metaslabs are used for ditto blocks being allocated in the same metaslab group. There is also the CLAIM weight, which has been separated out from the other weights, but that is less important to understanding the patch. The active metaslabs for each allocator are moved from their normal place in the metaslab tree for the group to the back of the tree. This way, they will not be selected for use by other allocators searching for new metaslabs unless all the passive metaslabs are unsuitable for allocations. If that does happen, the allocators will "steal" from each other to ensure that IOs don't fail until there is truly no space left to perform allocations.

In addition, the alloc queue for each metaslab group has been broken into a separate queue for each allocator. We don't want to dramatically increase the number of inflight IOs on low-end systems, because it can significantly increase txg times. On the other hand, we want to ensure that there are enough IOs for each allocator to allow for good coalescing before sending the IOs to the disk. As a result, we take a compromise path; each allocator's alloc queue max depth starts at a certain value for every txg. Every time an IO completes, we increase the max depth. This should hopefully provide a good balance between the two failure modes, while not dramatically increasing complexity.

We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause very similar contention when selecting IOs to allocate. This parallelization uses the same allocator scheme as metaslab selection.

Performance results

Performance improvements from this change can vary significantly based on the number of CPUs in the system, whether or not the system has a NUMA architecture, the speed of the drives, the values for the various tunables, and the workload being performed. For an fio async sequential write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB SSDs, there is a roughly 25% performance improvement.

Future work

Analysis of the performance of the system with this patch applied shows that a significant new bottleneck is the vdev disk queues, which also need to be parallelized. Prototyping of this change has occurred, and there was a performance improvement, but more work needs to be done before its stability has been verified and it is ready to be upstreamed.

---------------------
Performance on HDD disks.

I created a pool on 4 1TB HDDs, and then created a filesystem with compression off (since i'm using random data, it wouldn't help) and a edon-r for the checksum. I then create a number of files of a given size using dd, creating 10 at a time in parallel. Once the files are created, I export and import the pool to clear the cache, and read in all the files, writing them to /dev/null. I ran that part 15 times, timing it each time to get a decent performance measurement. Then I destroy the filesystem, and go back to the fs creation step with new parameters.

I ran a decent spectrum of tests, so hopefully some of this data will help reassure you. The numbers across the top rows are the size of the files created, and the numbers across the left are the number of files being read. The times are the average of the 15 runs, in seconds.

Read Performance:

For recordsize=512:
Before 128k 1m 8m 64m After 128k 1m 8m 64m
10 0.09 0.36 2.88 21.70 10 0.07 0.33 2.89 20.21
40 0.29 1.55 11.93 86.50 40 0.24 1.46 10.94 83.53
160 1.14 6.08 50.10 361.76 160 0.93 6.23 44.20 342.40

For recordsize=8k:
Before 128k 1m 8m 64m After 128k 1m 8m 64m
10 0.05 0.16 0.64 3.89 10 0.07 0.16 0.72 3.61
40 0.25 0.59 3.02 17.72 40 0.20 0.68 3.46 15.96
160 0.69 2.79 12.43 68.86 160 0.76 2.59 14.47 59.43

For recordsize=128k:
Before 128k 1m 8m 64m After 128k 1m 8m 64m
10 0.04 0.10 0.53 3.58 10 0.05 0.11 0.64 4.32
40 0.14 0.37 2.31 17.94 40 0.14 0.45 2.62 16.95
160 0.59 1.67 9.81 60.39 160 0.56 1.77 10.42 61.90

It looks like performance is usually similar with the new bits, some slightly better and some slightly worse. Those results may be within the margin of error of each other, or there may be some pattern that explains why some runs are slightly faster and some are slightly slower; I'm not sure.

Write performance:

For recordsize=512:
Before 128k 1m 8m 64m After 128k 1m 8m 64m
10 0.03 0.07 0.45 3.27 10 0.03 0.07 0.43 3.35
40 0.11 0.29 1.68 34.28 40 0.11 0.29 1.66 27.49
160 0.46 1.17 8.18 478.53 160 0.45 1.18 8.23 357.20

For recordsize=8k:
Before 128k 1m 8m 64m After 128k 1m 8m 64m
10 0.00 0.00 0.16 1.09 10 0.00 0.00 0.16 1.08
40 0.11 0.17 0.64 4.67 40 0.11 0.00 0.64 4.49
160 0.46 0.70 2.72 19.08 160 0.46 0.70 2.69 22.66

For recordsize=128k:
Before 128k 1m 8m 64m After 128k 1m 8m 64m
10 0.00 0.04 0.00 1.02 10 0.00 0.04 0.15 1.02
40 0.12 0.17 0.61 4.18 40 0.11 0.17 0.61 4.22
160 0.46 0.67 2.48 16.78 160 0.46 0.68 2.43 16.81

For recordsize=1m:
Before 128k 1m 8m 64m After 128k 1m 8m 64m
10 0.00 0.04 0.15 1.05 10 0.03 0.00 0.00 1.04
40 0.11 0.18 0.62 4.27 40 0.11 0.17 0.62 4.24
160 0.46 0.71 2.53 17.14 160 0.44 0.68 2.51 17.10

Similar to the read perf, or slightly better: Almost everything is the same or better, only a couple things are worse. Weirdly, 160 64m files for 8k and 128k are worse, but 512 and 1m are not.

Also available in: Atom PDF