Project

General

Profile

Actions

Bug #14658

open

zfs: dsl_scan can block txg for arbitrary length of time

Added by Alex Wilson 2 months ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

see https://github.com/openzfs/zfs/pull/9300

Currently if you have a large pool with many datasets, and a resilver starts against a device that has existing data, the dsl_scan code can spend a very long time walking top-level datasets, eliminating them because they haven't changed. When it is walking these it is not interruptible, and dsl_scan can block any txgs or reads from completing for an arbitrary length of time.

We've observed this on large OmniOS fileservers, where the dsl_scan can take a day or more to complete, and all I/O to the pool is blocked for the entire time. There is no safe way to suspend or stop a scan which is in this state.

Actions #1

Updated by Alex Wilson 2 months ago

The patch from ZoL is fairly short and applies cleanly (https://github.com/openzfs/zfs/commit/5815f7ac30e108fcbf4c6487328c28d818e9e014). We've run it on omnios r151038 for a few months now. After applying, I/O no longer stalls for an entire day during one of these resilvers.

We also have a hot-patch version which we have used -- it was built following the old Joyent sdevhp template using the ZoL patch.

Actions #2

Updated by Electric Monk 2 months ago

  • Gerrit CR set to 2127
Actions #3

Updated by Alex Wilson 2 months ago

  • Description updated (diff)
Actions

Also available in: Atom PDF