ZFS sorted scans

Review Request #1226 - Created Oct. 8, 2018 and updated

Toomas Soome
This originated from ZFS On Linux, as https://github.com/zfsonlinux/zfs/commit/d4a72f23863382bdf6d0ae33196f5b5decbc48fd


During scans (scrubs or resilvers), it sorts the blocks in each transaction group by block offset; the result can be a significant improvement. (On my test system just now, which I put some effort to introduce fragmentation into the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the changes.) I've seen similar rations on production systems.

FreeNAS has had these changes since Oct 2017.

Scrub & rebuild pools. Note times for performance analysis.

The pools are compatible with systems without the changes, so bouncing back and forth between two versions is possible, and I've used that for correctness-checking.

zpool scrub
zpool offline rpool disk and then later zpool online rpool disk
zpool replace

In my test systems the speedup on phydical disks is very noticeable, 4x 4TB raidz1 (7200RPM WD Black SATA), the scrub/resilver is down to 2 hours from 4 hours (3.5TB allocated data). The virtual disks under vmware fusion (on top of SSD) are not that drastic, but still significant and overall impression is consistent with notes from ZoL and FreeBSD.

Matthew Ahrens
Toomas Soome
Review request changed

Change Summary:

add following patches (as suggested):
a76f3d0437e5e974f0f748f8735af3539443b388: Fix deadlock in IO pipeline
c26cf0966d131b722c32f8ccecfe5791a789d975: Fix zio->io_priority failed (7 < 6) assert




Revision 2 (+3215 -860)

Show changes

Jerry Jelinek

I know you didn't write this code so I only concentrated on looking for possible bugs or problems. Overall this looks good. I assume this was already reviewed by people who are more knowledgeable about zfs than me. Here are the issues I noticed.

348-349 This comment seems wrong now
5069 At first I thought this should still be wrapped in VERIFY0, but after
looking at the next change, I think we're missing the assignment of
the return to 'rc'.

2445 Why was this block added? It looks redundant with the existing block
at 2406.

99 s/limitted/limited/
137 and 160 either change to boolean or use 0
159 this tunable seems worthy of an explanatory comment
384 I'm not sure why the fill_weight code is duplicated here from scan_init.
1757 Unless I'm misreading the code, it appears that we'll leak bp_toread if
dsl_scan_recurse returns an error.
3857 I don't see how the rest of the code in this function will block the
caller. Is something missing here?

210 This function seems kind of problematic. If rtop_remove is set but
rtop_add is NULL, we lose the entry. It seems like it would be better
to do the checks up front before we remove/adjust/add, although based
on how this is used, maybe it should just assert the ops? I don't see
any of these with only one op and not the other. I'm not sure why the
extra checks were added for the individual ops throughout the code.

136 Trailing whilespace. cstyle should have caught this.