multiple scans can lead to partial resilvering
zfs - Zettabyte File System
Analysis by George Wilson:
I've been looking at many zloop/ztest failures that involve checksum errors. The scenario typically involves a pool that is mirrored, spared, or devices that have been replaced. It appears that during the resilver operation we are failing to copy over all the data from the original device. The device is later detached leaving a partially resilvered vdev to service the pool. The end result is everything from suspended pools, to random assertions, failures to open the pool, and the list goes on.
When a device is attached (or replaced) the spa code will create a DTL that indicates that newly attached device does not contain any contents between the specified txg range. The range must contain the currently open txg to ensure that dmu_sync-ed blocks are also resilvered. After setting the DTL it calls into the DSL to notify any running scan that it should restart and passes the txg for when the restart should occur. The issue is that the dsl is told to restart the scan when the open context txg is synced this creates a race when multiple scan are running. Imagine this scenario: A pool configuration that looks like this: ADDR STATE AUX DESCRIPTION 080b9800 HEALTHY - root 08c27a00 HEALTHY - mirror 080ba640 HEALTHY - /rpool/tmp/ztest.0b 08c29b40 HEALTHY - /rpool/tmp/ztest.0a 0a1087c0 HEALTHY - mirror 080b9cc0 HEALTHY - /rpool/tmp/ztest.1a 0a109140 HEALTHY - /rpool/tmp/ztest.1b In this case we have 2 devices, ztest.0a and ztest.1b, that are both in need of resilvering. The ztest.0a device was attached first at txg 850 and kicks off a resilver scan. At txg 857 ztest.1b is attached and the resilver is scheduled to restart at txg 860 but before this happens the first scan completes at txg 858. This is when things go horribly wrong. As a result of the scan completing at txg 858 we look at all the vdevs and their DTLs. Any DTL that indicates that it is missing txgs in the range of 3 to 358 will now be cleared since we finished the scan. This is fine for device ztest.0a but for device ztest.1b that was newly attached it means that most of the blocks have not been resilvered. Later when txg 860 syncs it will start the scan but that will only try to resilver blocks from (859, 860].
To resolve this we only excise the txgs on devices that were part of the entire scan. Any device that was added after the scan was started will not have its DTLs updated when the current running scan completes.
Updated by Christopher Siden almost 9 years ago
- Status changed from In Progress to Closed
commit b4952e1 Author: George Wilson <email@example.com> Date: Wed Aug 7 13:16:22 2013 3956 ::vdev -r should work with pipelines 3957 ztest should update the cachefile before killing itself 3958 multiple scans can lead to partial resilvering 3959 ddt entries are not always resilvered 3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth 3961 freed gang blocks are not resilvered and can cause pool to suspend 3962 ztest should print out zfs debug buffer before exiting Reviewed by: Matthew Ahrens <firstname.lastname@example.org> Reviewed by: Adam Leventhal <email@example.com> Approved by: Richard Lowe <firstname.lastname@example.org>