Actions
Bug #3961
closedfreed gang blocks are not resilvered and can cause pool to suspend
Start date:
2013-08-02
Due date:
% Done:
100%
Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:
Description
Analysis by George Wilson:
While running zloop I've seen instances where the pool ends up in a suspended state. > 08559000::print spa_t spa_suspend_zio_root | ::zio -rc ADDRESS TYPE STAGE WAITER TIME_ELAPSED 8c88390 NULL OPEN - - 89a7c70 NULL DONE 153 - 9f399a0 FREE FAILED - - Looking at the block pointer we always seem to have a dedup-ed gang block that is at fault: > 9f399a0::print zio_t io_bp | ::blkptr DVA[0]=<0:15ab200:10200> DVA[1]=<0:3899200:200> [L0 DEDUP] SHA256 OFF LE gang unique single size=10000L/10000P birth=296L/296P fill=0 cksum=0:0:0:0 The config looks like this: > ::spa -v ADDR STATE NAME 08559000 ACTIVE ztest ADDR STATE AUX DESCRIPTION 080b9cc0 HEALTHY - root 091fa840 HEALTHY - /rpool/tmp/ztest.0b Looking at the debug messages we see that we've just completed a scan: visited 2816 blocks in 1774ms finished scan txg 313 txg 313 scan done complete=1 spa=ztest async request task=8 And we see that we were detaching a device: > ::stacks -c spa_vdev_detach THREAD STATE SOBJ COUNT 16b PARKED CV 1 libc.so.1`cond_wait_queue+0x60 libc.so.1`__cond_wait+0x86 libc.so.1`cond_wait+0x24 libzpool.so.1`cv_wait+0x40 libzpool.so.1`txg_wait_synced+0x17d libzpool.so.1`spa_vdev_config_exit+0x14f libzpool.so.1`spa_vdev_exit+0x31 libzpool.so.1`spa_vdev_detach+0x4c1 ztest_vdev_attach_detach+0x297 ztest_execute+0x67 ztest_thread+0xf7 libc.so.1`_thrp_setup+0x9b libc.so.1`_lwp_start We're detaching a device and as we saw from the pool configuration it's already been removed from the vddv tree. So it looks like we've detached a device without resilvering some of the blocks.
Further investigation shows that the free is coming from: > 153::findstack -v stack pointer for thread 153: f1fe1d98 [ f1fe1d98 libc.so.1`__lwp_park+0xb() ] f1fe1dc8 libc.so.1`cond_wait_queue+0x60(89a7f20, 89a7f08, 0, fee9b342) f1fe1e08 libc.so.1`__cond_wait+0x86(89a7f20, 89a7f08, f1fe1e28, fee9b402) f1fe1e28 libc.so.1`cond_wait+0x24(89a7f20, 89a7f08, f1fe1e48, fed59476) f1fe1e48 libzpool.so.1`cv_wait+0x40(89a7f20, 89a7f00, f1fe1e88, fed4c587) f1fe1e88 libzpool.so.1`zio_wait+0x7f(89a7c70, fed15b20, 89a7c70, b0e5b80) f1fe1f28 libzpool.so.1`spa_sync+0x37a(8559000, 13a, 0, fed21136) f1fe1fc8 libzpool.so.1`txg_sync_thread+0x3f2(80ba180, fef15000, f1fe1fe8, feea26c6) f1fe1fe8 libc.so.1`_thrp_setup+0x9b(f89f4a40) f1fe1ff8 libc.so.1`_lwp_start(f89f4a40, 0, 0, 0, 0, 0) This corresponds to this code in spa_sync(): /* * If anything has changed in this txg, or if someone is waiting * for this txg to sync (eg, spa_vdev_remove()), push the * deferred frees from the previous txg. If not, leave them * alone so that we don't generate work on an otherwise idle * system. */ if (!txg_list_empty(&dp->dp_dirty_datasets, txg) || !txg_list_empty(&dp->dp_dirty_dirs, txg) || !txg_list_empty(&dp->dp_sync_tasks, txg) || ((dsl_scan_active(dp->dp_scan) || txg_sync_waiting(dp)) && !spa_shutting_down(spa))) { zio_t *zio = zio_root(spa, NULL, NULL, 0); VERIFY3U(bpobj_iterate(defer_bpo, spa_free_sync_cb, zio, tx), ==, 0); VERIFY0(zio_wait(zio)); } So this means that we're processing the frees from the previous tag. So the scenario looks like this: 1). a replace operation was running 2). a dedup block's reference goes to 0 3). we free the block just as the scan completes 4). the freed block is deferred since spa_sync_pass is already 1 5). the device detaches 6). we process the deferred frees from the previous tag 7). one of the freed blocks is a gang block so we go to read the gang block header[s] but fail to read because the block was never copied over to this device.
When a scan finishes it's traversal it must wait one additional txg to allow any deferred frees to get processes. This would ensure that a device cannot be removed until any deferred frees that occurred while we were doing the scan have actually been processed.
Updated by Christopher Siden over 10 years ago
- Status changed from In Progress to Closed
commit b4952e1 Author: George Wilson <george.wilson@delphix.com> Date: Wed Aug 7 13:16:22 2013 3956 ::vdev -r should work with pipelines 3957 ztest should update the cachefile before killing itself 3958 multiple scans can lead to partial resilvering 3959 ddt entries are not always resilvered 3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth 3961 freed gang blocks are not resilvered and can cause pool to suspend 3962 ztest should print out zfs debug buffer before exiting Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net>
Actions