sysctls for ZFS DDT blocks, similar to those for spacemap blocks, to enable mitigation of heavy 4K I/O
I'm working on a server which is about 45% full and not that heavily used, with ~ 33TB capacity and ~ 15TB of heavily dedupable VM data (~ 3.5x). The server and pool were specced with full knowledge and anticipation of the heavy workload and RAM requirements of DDT (large fast RAM, fast CPU, cache on NVMe, fast mirror pool, sysctls reserving ~ 80GB RAM for metadata, etc). Unfortunately this still wasn't enough and the system still grinds to an unusable halt, choked by 4k reads. The culprit was found to be spacemap and DDT loading into ARC. Partial mitigation has helped, by migrating the data to smaller metaslabs and large metaslab block sizes (~ 1000 x 8GB slabs per vdev + 64k spacemap blocks), as described by Matt Ahrens' talk at BSDCan 2016 (p.20 - 28 and especially p.27-28). This exposed DDT loading (4K read) as the main remaining problem. Unfortunately, sysctl mitigation isn't yet available for DDT load/unloads, and that's the essence of this feature request.
With spacemaps mitigated, the disks initially read in a huge amount of 4k IO with very few writes as ARC warms up (and also reloads any evicted data en masse later) - all of this is DDT and takes tens of minutes before writing proper commences. The resulting initial file R/W can take over 2 hours for 100GB, and most of that time is spent reading DDTs into ARC. If the test writes are across a 10G LAN (to prevent caching the source), then with DDT "warm", the 2nd test run is immensely faster, with almost no 4k read.
It would therefore make sense to mitigate/optimise DDT similarly to spacemaps - perhaps setting a larger block size, or using debug tunables to force preload/restrict unload. But unlike spacemap data, these controls over DDT just don't exist. So there's no way to preload DDT or prevent eviction, unlike spacemaps, and no way to modify DDT block sizes either, which might provide a way to reduce IO count.
It might also be beneficial to have a clearer sight of "metadata" in ARC, broken down into its main components (spacemap, ddt, "other"), to expose which of these is under pressure or hogging RAM/IO resources, and perhaps even a couple of tunables to prioritise ARC/L2ARC metadata retention between spacemap/DDT/"other". Currently one can only stipulate a reservation for total metadata and not a reservation/prioritisation between types of metadata, and cannot specify preferred retention priority when ARC eviction is required.
dedup is a beneficial technology , very relevant to some use-cases, sufficient to suggest a longer term project for improvement (Matt Ahrens, Dedup doesn’t have to suck! ). Although rehabilitating dedup is a substantial already-begun project, there are unhelpful omissions in dedup-related sysctls areas where better control and better stats could enable immediate improvements that already exist for spacemap - but not ddt - handling. I would like to request these additional sysctls, to parallel existing spacemap sysctls:
METADATA CONTROL SYSCTLS
BASED ON vfs.zfs.dedup.debug_load Load all ddts when pool is first opened metaslab.debug_load vfs.zfs.dedup.debug_unload Prevent ddts from being unloaded metaslab.debug_unload vfs.zfs.standard_dedup_blksz DDT block size standard_sm_blksz vfs.zfs.dedup.unload_min_time Minimum time (sec) that unused DDT block will not be evicted if avoidable metaslab.unload_delay [time, not txg count] vfs.zfs.arc_min_sm Spacemap metadata will not be evicted if spacemap metadata in ARC below this level new vfs.zfs.arc_min_dedup DDT metadata will not be evicted if DDT metadata in ARC below this level new
It would be extremely helpful if these could be considered for addition, since then DDT tuning and at least some substantial optimisation becomes possible to explore, and hard stats related to different major types of metadata implicated in severe slowdown can be better distinguished. Since the code exists for spacemaps and will affect only block allocation + stat collection, it might be a not-huge job, with significant benefits for these users and use-cases.
Also for stats usage, perhaps these or some subset:
METADATA STATS EXPOSURE SYSCTLS
ARC HIT+MISS STATS: kstat.zfs.misc.arcstats.anon_evictable_sm_metadata specific stats for ARC spacemap metadata kstat.zfs.misc.arcstats.arc_sm_meta_max specific stats for ARC spacemap metadata kstat.zfs.misc.arcstats.arc_sm_meta_min specific stats for ARC spacemap metadata kstat.zfs.misc.arcstats.arc_sm_meta_used specific stats for ARC spacemap metadata kstat.zfs.misc.arcstats.demand_sm_metadata_hits specific stats for ARC spacemap metadata kstat.zfs.misc.arcstats.demand_sm_metadata_misses specific stats for ARC spacemap metadata kstat.zfs.misc.arcstats.metadata_size specific stats for ARC spacemap metadata kstat.zfs.misc.arcstats.prefetch_sm_metadata_hits specific stats for ARC spacemap metadata kstat.zfs.misc.arcstats.prefetch_sm_metadata_misses specific stats for ARC spacemap metadata kstat.zfs.misc.arcstats.anon_evictable_ddt_metadata specific stats for ARC dedup metadata kstat.zfs.misc.arcstats.arc_ddt_meta_max specific stats for ARC dedup metadata kstat.zfs.misc.arcstats.arc_ddt_meta_min specific stats for ARC dedup metadata kstat.zfs.misc.arcstats.arc_ddt_meta_used specific stats for ARC dedup metadata kstat.zfs.misc.arcstats.demand_ddt_metadata_hits specific stats for ARC dedup metadata kstat.zfs.misc.arcstats.demand_ddt_metadata_misses specific stats for ARC dedup metadata kstat.zfs.misc.arcstats.metadata_size specific stats for ARC dedup metadata kstat.zfs.misc.arcstats.prefetch_ddt_metadata_hits specific stats for ARC dedup metadata kstat.zfs.misc.arcstats.prefetch_ddt_metadata_misses specific stats for ARC dedup metadata MFU+MRU STATS: vfs.zfs.mfu_ghost_dedup_esize size of DDT metadata in mfu ghost state vfs.zfs.mfu_dedup_esize size of DDT metadata in mfu state vfs.zfs.mfu_ghost_sm_esize size of spacemap metadata in mfu ghost state vfs.zfs.mfu_sm_esize size of spacemap metadata in mfu state vfs.zfs.mru_ghost_dedup_esize size of DDT metadata in mru ghost state vfs.zfs.mru_dedup_esize size of DDT metadata in mru state vfs.zfs.mru_ghost_sm_esize size of spacemap metadata in mru ghost state vfs.zfs.mru_sm_esize size of spacemap metadata in mru state kstat.zfs.misc.arcstats.mfu_evictable_ddt_data specific stats for MFU spacemap metadata kstat.zfs.misc.arcstats.mfu_evictable_ddt_metadata specific stats for MFU spacemap metadata kstat.zfs.misc.arcstats.mfu_ghost_evictable_ddt_data specific stats for MFU spacemap metadata kstat.zfs.misc.arcstats.mfu_ghost_evictable_ddt_metadata specific stats for MFU spacemap metadata kstat.zfs.misc.arcstats.mru_evictable_ddt_metadata specific stats for MFU spacemap metadata kstat.zfs.misc.arcstats.mru_ghost_evictable_ddt_data specific stats for MFU spacemap metadata kstat.zfs.misc.arcstats.mru_ghost_evictable_ddt_metadata specific stats for MFU spacemap metadata kstat.zfs.misc.arcstats.mfu_evictable_sm_data specific stats for MFU dedup metadata kstat.zfs.misc.arcstats.mfu_evictable_sm_metadata specific stats for MFU dedup metadata kstat.zfs.misc.arcstats.mfu_ghost_evictable_sm_data specific stats for MFU dedup metadata kstat.zfs.misc.arcstats.mfu_ghost_evictable_sm_metadata specific stats for MFU dedup metadata kstat.zfs.misc.arcstats.mru_evictable_sm_metadata specific stats for MFU dedup metadata kstat.zfs.misc.arcstats.mru_ghost_evictable_sm_data specific stats for MFU dedup metadata kstat.zfs.misc.arcstats.mru_ghost_evictable_sm_metadata specific stats for MFU dedup metadata
These probably aren't hard to add, but allow clearer insight into the major components of metadata IO pressure, if this is a problem. Right now the only insight exposed is for metadata in total, which isn't ideal considering how significantly spacemaps and if enabled, ddt, can impact a system.
Updated by Stilez y about 2 years ago
Apologies if it's not correct. I'm running zfs on FreeBSD, but openzfs' website doesn't seem to link to an issue/feature tracker and all I can find, says freebsd uses openzfs which uses Illumos as the upstream source that it's synced to. As I can't seem to find an openzfs tracker, I've reported it here, on the basis it's probably of use to all, to request this. If it's okay, I'll gladly provide such info as I can. Basically I've been tracking down a severe performance issue related to write throttle failure on fast hardware for a long time, and I'm finally getting to a point where the issues are clear, reproducible and evidenced. The throttle isn't appropriately invoked during some IO, causing session death. Spacemap sysctls mitigate in part, suggesting that equivalent DDT sysctls would help a lot, but they don't exist yet. Hence this request. If wrong about where to report it, can you direct me to the appropriate openzfs tracker instead? And thank you, either way.
Updated by Stilez y about 2 years ago
I highly doubt it's FreeBSD specific. The issue with space maps 4k block size is well known (all platforms), and can be altered by the sysadmin. Dedup table blocks - for pools where where dedup is a sane option - are almost certainly identically affected (tens of GB of 4k that has to be loaded before any data can be written).
While DDT is hard to sort out, exposing/adding these few specific controls for DDT block size etc, and some ARC stats to break down metadata held into major components, is valuable in itself.
I'll report this as suggested on the FreeBSD tracker as well, but these few additional ZFS controls/options are surely of value to expose, for all dedup-users, on all platforms, including this and any related ones.