Actions
Bug #4881
closedzfs send performance degradation when embedded block pointers are encountered
Start date:
2014-05-23
Due date:
% Done:
100%
Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:
Description
Matt Ahrens:
As George noted below, the problem is that the traverse_prefetch_thread is able to get way ahead of the dmu_send() thread (we saw it get 2 million blocks ahead). When this happens, the data that was prefetched can be evicted from the cache (ARC) before the dmu_send() thread accesses it. This decreases performance because we lose the benefit of prefetching (i.e. having multiple i/os outstanding concurrently) and we end up reading every block from disk twice (once by the prefetch and once when we actually need it). The prefetch thread is supposed to be at most zfs_pd_blks_max (default 100) blocks ahead of the dmu_send thread. This relies on the prefetch and dmu_send threads coordinating how many blocks have been "produced (prefetched) but not yet consumed (accessed)" in prefetch_data_t:pd_blks_fetched. When the prefetch thread visits embedded block pointers (blocks whose data compresses down to <100 bytes and thus fits inside the block pointer itself), it does not increment pd_blks_fetched (i.e. it does not "produce" a prefetched block), because no i/o is needed to "read" the block. (See traverse_prefetcher().) However, when the zfs_send() thread visits an embedded block pointer, it decrements pd_blks_fetched, thinking that it has "consumed" a prefetched block. (See traverse_visitbp().) These extra decrements cause zfs_send() thread get one block further behind for every embedded block that is visited (waiting if necessary so that the count doesn't go negative). The fix is to move the "do we count this block as a prefetch" logic into a common function, so that the logic is exactly the same in the producer (traverse_prefetcher()) and consumer (traverse_visitbp()). When we're in this bad state, we will see zio_read() being called by the dmu_send() thread. Normally this should not happen, because everything it needs will have already been prefetched. (Note that if the disks are slower than the network, the dmu_send() thread can still need to wait for i/o to complete, but it will do by waiting for the already-issued prefetch i/o, via an interlock in the ARC.) Therefore we can detect that we've gotten into this situation by seeing if zio_read() is called from dmu_send(): dpp -n 'zio_read:entry/callers["dmu_send"]/{@=count()}' -c 'zfs send ...' We can also observe how far the prefetch thread is ahead of the main thread (results only make sense when doing a non-incremental send of a large, non-sparse, file). The main and prefetch threads blkids should not be further apart than the number of blocks "fetched": dpp -q -n 'traverse_visitbp:entry /args[3]->zb_blkid%100==0/ { if (args[0]->td_pfd) { printf("main thread: blkid=%u fetched=%u\\n", args[3]->zb_blkid, args[0]->td_pfd->pd_blks_fetched); } else { printf("prefetch thread: blkid=%u\\n", args[3]->zb_blkid); } }' Note that the bug will be triggered (i.e. the prefetch thread will get further ahead than designed) whenever sending a dataset with embedded block pointers. I.e. blocks that compress very well (controlled by the feature@embedded_data pool property). However, performance will still be fine unless there are more of these embedded block pointers than there is memory available to the ARC. Thus the problem would not be noticed with small-scale tests.
Related issues
Updated by Electric Monk over 9 years ago
- Status changed from In Progress to Closed
git commit 06315b795c0d54f0228e0b8af497a28752dd92da
commit 06315b795c0d54f0228e0b8af497a28752dd92da Author: Matthew Ahrens <mahrens@delphix.com> Date: 2014-06-05T21:34:21.000Z 4881 zfs send performance degradation when embedded block pointers are encountered Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com>
Actions