zfs prefetch code needs work
The existing ZFS prefetch code (dmu_zfetch.c) has some problems:
1. It's nearly impossible to understand. e.g. there are an abundance of kstats but it's hard to know what they mean (see below).
2. For some workloads, it detects patterns that aren't really there (e.g. strided patterns, backwards scans), and generates needless i/os prefetching blocks that will never be referenced.
3. It has lock contention issues. These are caused primarily by dmu_zfetch_colinear() calling dmu_zfetch_dofetch() (which can block waiting for i/o) with the zf_rwlock held for writer, thus blocking all other threads accessing this file.
I suggest that we rewrite this code to detect only forward, sequential streams.
Here is my understanding of what the existing prefetch kstats mean:
bogus_streams: should always be zero
stride_hits/misses: count of block accesses that are part of strided streams. hits indicates accesses of blocks that are already cached (perhaps due to a prefetch). misses indicates access of blocks that are not cached (and will require reads from disk)
streams_resets: number of times that we accessed a block that is part of a prefetch stream, and thus should have already been prefetched and thus cached, but it was actually not cached (e.g. because it was evicted by the time we got around to accessing it). In this case, we throw away ("reset") the prefetch stream.
streams_noresets: number of times that we accessed a block that is part of a non-strided prefetch stream, and the block was already cached (perhaps due to a prefetch).
hits = streams_noresets + stride_hits. The number of times we accessed a block that is part of a prefetch stream, and the block was already cached (perhaps due to a prefetch).
misses: opposite of hits. The number of times we accessed a block that is not part of a prefetch stream, or the block was not cached.
colinear_hits: The number of times we accessed a block that was a "miss", but the access forms a group of 3 equidistant accesses which we detect as a strided stream
colinear_misses: opposite of colinear_hits. Prefetch couldn't detect anything about this access.
colinear_hits + colinear_misses = misses
reclaim_successes: Upon an access that prefetch couldn't detect anything about (colinear_misses), we try to create a new stream to represent this access. However, we can't create a new stream if we already have the maximum number of streams (8) and they were all accessed recently. This count represents successes.
reclaim_failures: opposite of reclaim_successes; we couldn't create a new stream.
reclaim_successes + reclaim_failures = colinear_misses
Additionally, the existing prefetch and ARC code does not track basic stats that we would need to know if prefetching is effective.
We will want to know:
1. how many prefetch i/os were issued
2. how many times was a demand read satisfied by a prefetched i/o (counting each prefetch i/o at most once)
3. how many times did a demand read have to wait for a prefetch i/o to complete (priority inversion problem)
If the difference between 1 and 2 increases, it indicates that we did useless prefetches (prefetched data that was not used). #2/#1 is roughly "prefetch efficiency" - it should always be high, otherwise prefetch could actually be hurting you.
(#2 / total demand reads) is the fraction of all demand reads that were satisfied by prefetch, it is roughly "prefetch efficacy". It could be high or low depending on the workload, but if it's high it indicates that substantial value is being derived from prefetching.
arcstats.prefetch_*_misses is #1 - how many prefetch i/os were issued
in the new code, arcstats.demand_hit_prefetch (as well as zfetchstats.hits) is #2 - a demand read on a prefetched block
in the new code, arcstats.demand_wait_for_prefetch is #3 - a demand read that has to wait for a prefetch to complete
There is an additional problem with multiblock reads (described in the context of recordsize=8k):
If we see a multiblock read (e.g. for 64KB - 8x 8KB blocks), it may be not necessarily correlated with subsequent reads of adjacent file offsets. However, the prefetch code interprets this read of 8 sequential blocks as the start of a pattern, and issues prefetch reads for an additional 135 blocks (1 MB). The common case is that we do not see a subsequent read request for the adjacent 1MB, so these prefetch reads are useless. We should not consider prefetching until we have received multiple adjacent read requests from the application.
On a customer system, I observed that we were issuing many prefetch i/os that were not ultimately used by demand reads. I also saw that we were getting many large read requests – 100KB up to ~2MB. A single large request results in several sequential single-block reads, which the prefetch code interprets as the start of a pattern. Because the prefetch code is so aggressive, a read of as little as 8 blocks (64KB) will result in the prefetching an additional 135 blocks (1MB). I was able to reduce the number of unused prefetch i/os by reducing these parameters (zfetch_block_cap and zfetch_array_rd_sz). However this serves only to reduce the damage caused by prefetch.
This problem may be exacerbated by the fact that we consider prefetching for all multiblock reads, despite zfetch_array_rd_sz (default 1MB) which intends to disable prefetch of reads larger than 1MB. However, zfs_read_chunk_size also defaults to 1MB, so the ZPL splits all reads into chunks of at most 1MB, so zfetch never sees reads of > 1MB, and thus always considers prefetching.
The prefetch code should be modified to consider a multiblock read as a single read read for prefetching purposes, and wait until an adjacent (perhaps also multiblock) read before issuing prefetches.
Updated by Electric Monk about 4 years ago
- % Done changed from 0 to 100
- Status changed from New to Closed
commit cf6106c8a0d6598b045811f9650d66e07eb332af Author: Matthew Ahrens <email@example.com> Date: 2015-09-09T21:50:40.000Z 5987 zfs prefetch code needs work Reviewed by: Adam Leventhal <firstname.lastname@example.org> Reviewed by: George Wilson <email@example.com> Reviewed by: Paul Dagnelie <firstname.lastname@example.org> Approved by: Gordon Ross <email@example.com>