Bug #14658
openzfs: dsl_scan can block txg for arbitrary length of time
0%
Description
see https://github.com/openzfs/zfs/pull/9300
Currently if you have a large pool with many datasets, and a resilver starts against a device that has existing data, the dsl_scan code can spend a very long time walking top-level datasets, eliminating them because they haven't changed. When it is walking these it is not interruptible, and dsl_scan can block any txgs or reads from completing for an arbitrary length of time.
We've observed this on large OmniOS fileservers, where the dsl_scan can take a day or more to complete, and all I/O to the pool is blocked for the entire time. There is no safe way to suspend or stop a scan which is in this state.
Updated by Alex Wilson 2 months ago
The patch from ZoL is fairly short and applies cleanly (https://github.com/openzfs/zfs/commit/5815f7ac30e108fcbf4c6487328c28d818e9e014). We've run it on omnios r151038 for a few months now. After applying, I/O no longer stalls for an entire day during one of these resilvers.
We also have a hot-patch version which we have used -- it was built following the old Joyent sdevhp template using the ZoL patch.