ztest fault injection should avoid resilvering devices
zfs - Zettabyte File System
Analysis by George Wilson:
I've seen several instances where the ztest deadman timer kicks in. Looking at the core file I see that the spa has been suspended and we're in a state like this: > ::spa -v ADDR STATE NAME 10c2a000 ACTIVE ztest ADDR STATE AUX DESCRIPTION 082fd640 DEGRADED - root 082fb500 DEGRADED - replacing 082fd180 CANT_OPEN CORRUPT_DATA /tank/spacemap/ztest/ztest.0a 082fb9c0 HEALTHY - /tank/spacemap/ztest/ztest.0b 17334300 HEALTHY - /tank/spacemap/ztest/ztest.1a Unfortunately at the same time that we inject an error we also did a spa reguid. I suspect that eventually the pool suspended since it could not satisfy a read from top-level vdev 0 and when we try to resume the pool we end up with the state you see above. To solve the first part of this problem, we should avoid performing certain fault injection on devices which still have missing DTLs.
In ztest_fault_inject() we periodically change the leaf vdev's state to close, cant_write, or cant_read. This gets done on the first child of the top-level. If that top-level is in the middle of a resilver then we end up potentially losing the only copy of good data.
Updated by Christopher Siden almost 9 years ago
- Status changed from In Progress to Closed
commit 2c1e2b4 Author: George Wilson <email@example.com> Date: Wed Aug 7 11:24:34 2013 3949 ztest fault injection should avoid resilvering devices 3950 ztest: deadman fires when we're doing a scan 3951 ztest hang when running dedup test 3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together Reviewed by: Matthew Ahrens <firstname.lastname@example.org> Reviewed by: Adam Leventhal <email@example.com> Approved by: Richard Lowe <firstname.lastname@example.org>