Project

General

Profile

Feature #10295

sysctls for ZFS DDT blocks, similar to those for spacemap blocks, to enable mitigation of heavy 4K I/O

Added by Stilez y 10 months ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
2019-01-25
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

The problem scenario

I'm working on a server which is about 45% full and not that heavily used, with ~ 33TB capacity and ~ 15TB of heavily dedupable VM data (~ 3.5x). The server and pool were specced with full knowledge and anticipation of the heavy workload and RAM requirements of DDT (large fast RAM, fast CPU, cache on NVMe, fast mirror pool, sysctls reserving ~ 80GB RAM for metadata, etc). Unfortunately this still wasn't enough and the system still grinds to an unusable halt, choked by 4k reads. The culprit was found to be spacemap and DDT loading into ARC. Partial mitigation has helped, by migrating the data to smaller metaslabs and large metaslab block sizes (~ 1000 x 8GB slabs per vdev + 64k spacemap blocks), as described by Matt Ahrens' talk at BSDCan 2016 (p.20 - 28 and especially p.27-28). This exposed DDT loading (4K read) as the main remaining problem. Unfortunately, sysctl mitigation isn't yet available for DDT load/unloads, and that's the essence of this feature request.

With spacemaps mitigated, the disks initially read in a huge amount of 4k IO with very few writes as ARC warms up (and also reloads any evicted data en masse later) - all of this is DDT and takes tens of minutes before writing proper commences. The resulting initial file R/W can take over 2 hours for 100GB, and most of that time is spent reading DDTs into ARC. If the test writes are across a 10G LAN (to prevent caching the source), then with DDT "warm", the 2nd test run is immensely faster, with almost no 4k read.

It would therefore make sense to mitigate/optimise DDT similarly to spacemaps - perhaps setting a larger block size, or using debug tunables to force preload/restrict unload. But unlike spacemap data, these controls over DDT just don't exist. So there's no way to preload DDT or prevent eviction, unlike spacemaps, and no way to modify DDT block sizes either, which might provide a way to reduce IO count.

It might also be beneficial to have a clearer sight of "metadata" in ARC, broken down into its main components (spacemap, ddt, "other"), to expose which of these is under pressure or hogging RAM/IO resources, and perhaps even a couple of tunables to prioritise ARC/L2ARC metadata retention between spacemap/DDT/"other". Currently one can only stipulate a reservation for total metadata and not a reservation/prioritisation between types of metadata, and cannot specify preferred retention priority when ARC eviction is required.

Feature request

dedup is a beneficial technology , very relevant to some use-cases, sufficient to suggest a longer term project for improvement (Matt Ahrens, Dedup doesn’t have to suck! ). Although rehabilitating dedup is a substantial already-begun project, there are unhelpful omissions in dedup-related sysctls areas where better control and better stats could enable immediate improvements that already exist for spacemap - but not ddt - handling. I would like to request these additional sysctls, to parallel existing spacemap sysctls:

METADATA CONTROL SYSCTLS

                                                                                                                                                         BASED ON
vfs.zfs.dedup.debug_load                                    Load all ddts when pool is first opened                                                      metaslab.debug_load
vfs.zfs.dedup.debug_unload                                  Prevent ddts from being unloaded                                                             metaslab.debug_unload
vfs.zfs.standard_dedup_blksz                                DDT block size                                                                               standard_sm_blksz
vfs.zfs.dedup.unload_min_time                               Minimum time (sec) that unused DDT block will not be evicted if avoidable                    metaslab.unload_delay [time, not txg count]

vfs.zfs.arc_min_sm                                          Spacemap metadata will not be evicted if spacemap metadata in ARC below this level           new
vfs.zfs.arc_min_dedup                                       DDT metadata will not be evicted if DDT metadata in ARC below this level                     new

It would be extremely helpful if these could be considered for addition, since then DDT tuning and at least some substantial optimisation becomes possible to explore, and hard stats related to different major types of metadata implicated in severe slowdown can be better distinguished. Since the code exists for spacemaps and will affect only block allocation + stat collection, it might be a not-huge job, with significant benefits for these users and use-cases.

Also for stats usage, perhaps these or some subset:

METADATA STATS EXPOSURE SYSCTLS

ARC HIT+MISS STATS:

kstat.zfs.misc.arcstats.anon_evictable_sm_metadata          specific stats for ARC spacemap metadata
kstat.zfs.misc.arcstats.arc_sm_meta_max                     specific stats for ARC spacemap metadata
kstat.zfs.misc.arcstats.arc_sm_meta_min                     specific stats for ARC spacemap metadata
kstat.zfs.misc.arcstats.arc_sm_meta_used                    specific stats for ARC spacemap metadata
kstat.zfs.misc.arcstats.demand_sm_metadata_hits             specific stats for ARC spacemap metadata
kstat.zfs.misc.arcstats.demand_sm_metadata_misses           specific stats for ARC spacemap metadata
kstat.zfs.misc.arcstats.metadata_size                       specific stats for ARC spacemap metadata
kstat.zfs.misc.arcstats.prefetch_sm_metadata_hits           specific stats for ARC spacemap metadata
kstat.zfs.misc.arcstats.prefetch_sm_metadata_misses         specific stats for ARC spacemap metadata

kstat.zfs.misc.arcstats.anon_evictable_ddt_metadata         specific stats for ARC dedup metadata
kstat.zfs.misc.arcstats.arc_ddt_meta_max                    specific stats for ARC dedup metadata
kstat.zfs.misc.arcstats.arc_ddt_meta_min                    specific stats for ARC dedup metadata
kstat.zfs.misc.arcstats.arc_ddt_meta_used                   specific stats for ARC dedup metadata
kstat.zfs.misc.arcstats.demand_ddt_metadata_hits            specific stats for ARC dedup metadata
kstat.zfs.misc.arcstats.demand_ddt_metadata_misses          specific stats for ARC dedup metadata
kstat.zfs.misc.arcstats.metadata_size                       specific stats for ARC dedup metadata
kstat.zfs.misc.arcstats.prefetch_ddt_metadata_hits          specific stats for ARC dedup metadata
kstat.zfs.misc.arcstats.prefetch_ddt_metadata_misses        specific stats for ARC dedup metadata

MFU+MRU STATS:

vfs.zfs.mfu_ghost_dedup_esize                               size of DDT metadata in mfu ghost state
vfs.zfs.mfu_dedup_esize                                     size of DDT metadata in mfu state
vfs.zfs.mfu_ghost_sm_esize                                  size of spacemap metadata in mfu ghost state
vfs.zfs.mfu_sm_esize                                        size of spacemap metadata in mfu state

vfs.zfs.mru_ghost_dedup_esize                               size of DDT metadata in mru ghost state
vfs.zfs.mru_dedup_esize                                     size of DDT metadata in mru state
vfs.zfs.mru_ghost_sm_esize                                  size of spacemap metadata in mru ghost state
vfs.zfs.mru_sm_esize                                        size of spacemap metadata in mru state

kstat.zfs.misc.arcstats.mfu_evictable_ddt_data              specific stats for MFU spacemap metadata
kstat.zfs.misc.arcstats.mfu_evictable_ddt_metadata          specific stats for MFU spacemap metadata
kstat.zfs.misc.arcstats.mfu_ghost_evictable_ddt_data        specific stats for MFU spacemap metadata
kstat.zfs.misc.arcstats.mfu_ghost_evictable_ddt_metadata    specific stats for MFU spacemap metadata
kstat.zfs.misc.arcstats.mru_evictable_ddt_metadata          specific stats for MFU spacemap metadata
kstat.zfs.misc.arcstats.mru_ghost_evictable_ddt_data        specific stats for MFU spacemap metadata
kstat.zfs.misc.arcstats.mru_ghost_evictable_ddt_metadata    specific stats for MFU spacemap metadata

kstat.zfs.misc.arcstats.mfu_evictable_sm_data               specific stats for MFU dedup metadata
kstat.zfs.misc.arcstats.mfu_evictable_sm_metadata           specific stats for MFU dedup metadata
kstat.zfs.misc.arcstats.mfu_ghost_evictable_sm_data         specific stats for MFU dedup metadata
kstat.zfs.misc.arcstats.mfu_ghost_evictable_sm_metadata     specific stats for MFU dedup metadata
kstat.zfs.misc.arcstats.mru_evictable_sm_metadata           specific stats for MFU dedup metadata
kstat.zfs.misc.arcstats.mru_ghost_evictable_sm_data         specific stats for MFU dedup metadata
kstat.zfs.misc.arcstats.mru_ghost_evictable_sm_metadata     specific stats for MFU dedup metadata

These probably aren't hard to add, but allow clearer insight into the major components of metadata IO pressure, if this is a problem. Right now the only insight exposed is for metadata in total, which isn't ideal considering how significantly spacemaps and if enabled, ddt, can impact a system.

History

#1

Updated by Joshua M. Clulow 10 months ago

Which illumos distribution and version are you running on the system where you're hitting the issue?

#2

Updated by Stilez y 10 months ago

Apologies if it's not correct. I'm running zfs on FreeBSD, but openzfs' website doesn't seem to link to an issue/feature tracker and all I can find, says freebsd uses openzfs which uses Illumos as the upstream source that it's synced to. As I can't seem to find an openzfs tracker, I've reported it here, on the basis it's probably of use to all, to request this. If it's okay, I'll gladly provide such info as I can. Basically I've been tracking down a severe performance issue related to write throttle failure on fast hardware for a long time, and I'm finally getting to a point where the issues are clear, reproducible and evidenced. The throttle isn't appropriately invoked during some IO, causing session death. Spacemap sysctls mitigate in part, suggesting that equivalent DDT sysctls would help a lot, but they don't exist yet. Hence this request. If wrong about where to report it, can you direct me to the appropriate openzfs tracker instead? And thank you, either way.

#3

Updated by Joshua M. Clulow 10 months ago

This issue may well be FreeBSD-specific, and at the very least the sysctl mechanism you're talking about is something we don't have in illumos. I think it would be best to report this in the FreeBSD bug tracker initially.

#4

Updated by Stilez y 10 months ago

I highly doubt it's FreeBSD specific. The issue with space maps 4k block size is well known (all platforms), and can be altered by the sysadmin. Dedup table blocks - for pools where where dedup is a sane option - are almost certainly identically affected (tens of GB of 4k that has to be loaded before any data can be written).

While DDT is hard to sort out, exposing/adding these few specific controls for DDT block size etc, and some ARC stats to break down metadata held into major components, is valuable in itself.

I'll report this as suggested on the FreeBSD tracker as well, but these few additional ZFS controls/options are surely of value to expose, for all dedup-users, on all platforms, including this and any related ones.

Also available in: Atom PDF