Project

General

Profile

Feature #2017

ZFS VDEV prefetch - reenable for scrubs, and use separate caches for metadata and other data

Added by Jim Klimov almost 9 years ago. Updated almost 9 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
2012-01-21
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

PREHISTORY

I have recently got reintroduced to pre-fetching mechanisms in ZFS, and particularly to VDEV prefetching and its turbulent history.
As far as I've learned, it is a mechanism designed to optimise use of media during reads by pre-reading several KBs of data after requested small reads, in hope that some of this extra data would soon be requested. Such prereads are relatively cheap in terms of latency, compared to spinning disks' seek times needed for explicit reads of every small block. By original (and current?) design, there is a separate cache maintained for each leaf VDEV (hard disk), 10MB by default.
By snv_70 it was found that the cache rolls over too quickly to be efficient (entries expire before they are read), so the cache was modified to only include metadata as processed and detected during prefetches [2],[3].
By snv_148 it was decided that the mechanism is useless and evil, based on the fact that with default settings it consumed 6.5Gb of 8Gb RAM on a system with 768 vdevs, and "Although the vdev cache can be limited or disabled by tuning the zfs:zfs_vdev_cache_max parameter, the default setting appears to allow the vdev cache to exhaust all available memory on systems with enough vdevs", so the default VDEV prefetch cache size was set to zero and effectively disable [4],[5],[6]. In fact, the code is slated for removal because it is not efficient for some users [5].

PROPOSAL

First of all, I object to removing of the code for everybody just because it does not suit some cases well. That's covered by zeroing the default cache size already.

Secondly, for smaller systems, such as laptops or home-NASes with 8-20 drives, the prefetch caching overhead is negligible, while the bonuses can be potentially large (subject to each user's testing, tuning and desired balance). In fact, the feature can be expanded somewhat for those who can benefit from it.

I don't think that a cache may be considered "useless" if it shows 30%-70% efficiency at an operation that is essentially pree (prefetch) and has an expensive alternative (seek). Modern large storage systems have 48-192Gb+ RAM and can afford some 10Gb for prefetch of a thousand drives. Arguably, this cache could also save some cycles on contented buses/HBAs ;)

In particular, this prefetch shined in my tests with scrubbing, which is not an infrequent operation, giving 99+% efficiency by prefetching metadata only:

# kstat -p zfs:0:vdev_cache_stats
zfs:0:vdev_cache_stats:class    misc
zfs:0:vdev_cache_stats:crtime   60.67302806
zfs:0:vdev_cache_stats:delegations      23398
zfs:0:vdev_cache_stats:hits     1309308
zfs:0:vdev_cache_stats:misses   11592
zfs:0:vdev_cache_stats:snaptime 89207.679698161

RFE1: For the future, when beginning a scrub, the system might auto-tune or at least suggest to enable the VDEV prefetch (perhaps with larger strokes)...

The current mechanism discards prefetched "userdata", even though it might come requested in the next read after processing the metadata block - perhaps as part of file-prefetch mechanism.
RFE2: I propose that a second rolling cache (per drive or common) be maintained for already-prefetched sectors that would otherwise be discarded. This would potentially allow users to benefit from prefetching file/zvol data while avoiding the problem of rolling away the prefetched metadata (which is likely more frequently useful). Tuning the two cache sizes separately can allow the users to strike the right balance for THEIR needs.

Links:
[1] http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device-Level_Prefetching
[2] http://www.cuddletech.com/blog/pivot/entry.php?id=1040 - Understanding ZFS: Prefetch
[3] http://wesunsolve.net/bugid/id/6437054
[4] http://wesunsolve.net/bugid/id/6684116
[5] https://www.illumos.org/issues/175
[6] http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg48068.html

#1

Updated by Garrett D'Amore almost 9 years ago

I'm happy to see the prefetch code remain provided there is a use case for it. The situations I had seen had the prefetch effectively useless... but if it helps your use case then it makes sense to retain it.

I'd like to research to see how general the effectiveness is for scrubbing... that is effectively your RFE 1.

A lot more investigation needs to be done to see if RFE 2 would indeed be useful.

Also available in: Atom PDF