ZFS VDEV prefetch - reenable for scrubs, and use separate caches for metadata and other data
I have recently got reintroduced to pre-fetching mechanisms in ZFS, and particularly to VDEV prefetching and its turbulent history.
As far as I've learned, it is a mechanism designed to optimise use of media during reads by pre-reading several KBs of data after requested small reads, in hope that some of this extra data would soon be requested. Such prereads are relatively cheap in terms of latency, compared to spinning disks' seek times needed for explicit reads of every small block. By original (and current?) design, there is a separate cache maintained for each leaf VDEV (hard disk), 10MB by default.
By snv_70 it was found that the cache rolls over too quickly to be efficient (entries expire before they are read), so the cache was modified to only include metadata as processed and detected during prefetches ,.
By snv_148 it was decided that the mechanism is useless and evil, based on the fact that with default settings it consumed 6.5Gb of 8Gb RAM on a system with 768 vdevs, and "Although the vdev cache can be limited or disabled by tuning the zfs:zfs_vdev_cache_max parameter, the default setting appears to allow the vdev cache to exhaust all available memory on systems with enough vdevs", so the default VDEV prefetch cache size was set to zero and effectively disable ,,. In fact, the code is slated for removal because it is not efficient for some users .
First of all, I object to removing of the code for everybody just because it does not suit some cases well. That's covered by zeroing the default cache size already.
Secondly, for smaller systems, such as laptops or home-NASes with 8-20 drives, the prefetch caching overhead is negligible, while the bonuses can be potentially large (subject to each user's testing, tuning and desired balance). In fact, the feature can be expanded somewhat for those who can benefit from it.
I don't think that a cache may be considered "useless" if it shows 30%-70% efficiency at an operation that is essentially pree (prefetch) and has an expensive alternative (seek). Modern large storage systems have 48-192Gb+ RAM and can afford some 10Gb for prefetch of a thousand drives. Arguably, this cache could also save some cycles on contented buses/HBAs ;)
In particular, this prefetch shined in my tests with scrubbing, which is not an infrequent operation, giving 99+% efficiency by prefetching metadata only:
# kstat -p zfs:0:vdev_cache_stats zfs:0:vdev_cache_stats:class misc zfs:0:vdev_cache_stats:crtime 60.67302806 zfs:0:vdev_cache_stats:delegations 23398 zfs:0:vdev_cache_stats:hits 1309308 zfs:0:vdev_cache_stats:misses 11592 zfs:0:vdev_cache_stats:snaptime 89207.679698161
RFE1: For the future, when beginning a scrub, the system might auto-tune or at least suggest to enable the VDEV prefetch (perhaps with larger strokes)...
The current mechanism discards prefetched "userdata", even though it might come requested in the next read after processing the metadata block - perhaps as part of file-prefetch mechanism.
RFE2: I propose that a second rolling cache (per drive or common) be maintained for already-prefetched sectors that would otherwise be discarded. This would potentially allow users to benefit from prefetching file/zvol data while avoiding the problem of rolling away the prefetched metadata (which is likely more frequently useful). Tuning the two cache sizes separately can allow the users to strike the right balance for THEIR needs.
 http://www.cuddletech.com/blog/pivot/entry.php?id=1040 - Understanding ZFS: Prefetch
Updated by Garrett D'Amore over 8 years ago
I'm happy to see the prefetch code remain provided there is a use case for it. The situations I had seen had the prefetch effectively useless... but if it helps your use case then it makes sense to retain it.
I'd like to research to see how general the effectiveness is for scrubbing... that is effectively your RFE 1.
A lot more investigation needs to be done to see if RFE 2 would indeed be useful.