Project

General

Profile

Bug #3961

freed gang blocks are not resilvered and can cause pool to suspend

Added by Christopher Siden over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Category:
zfs - Zettabyte File System
Start date:
2013-08-02
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

Analysis by George Wilson:

While running zloop I've seen instances where the pool ends up in a suspended
state.

> 08559000::print spa_t spa_suspend_zio_root | ::zio -rc
ADDRESS         TYPE  STAGE            WAITER           TIME_ELAPSED
8c88390         NULL  OPEN             -                -            
 89a7c70        NULL  DONE             153              -            
  9f399a0       FREE  FAILED           -                -            

Looking at the block pointer we always seem to have a dedup-ed gang block that
is at fault:

> 9f399a0::print zio_t io_bp | ::blkptr
DVA[0]=<0:15ab200:10200>
DVA[1]=<0:3899200:200>
[L0 DEDUP] SHA256 OFF LE gang unique single
size=10000L/10000P birth=296L/296P fill=0
cksum=0:0:0:0

The config looks like this:

> ::spa -v
ADDR         STATE NAME                                                        
08559000    ACTIVE ztest

    ADDR     STATE     AUX          DESCRIPTION                                
    080b9cc0 HEALTHY   -            root
    091fa840 HEALTHY   -              /rpool/tmp/ztest.0b

Looking at the debug messages we see that we've just completed a scan:

visited 2816 blocks in 1774ms
finished scan txg 313
txg 313 scan done complete=1
spa=ztest async request task=8

And we see that we were detaching a device:

> ::stacks -c spa_vdev_detach
THREAD   STATE    SOBJ        COUNT
16b      PARKED   CV              1
         libc.so.1`cond_wait_queue+0x60
         libc.so.1`__cond_wait+0x86
         libc.so.1`cond_wait+0x24
         libzpool.so.1`cv_wait+0x40
         libzpool.so.1`txg_wait_synced+0x17d
         libzpool.so.1`spa_vdev_config_exit+0x14f
         libzpool.so.1`spa_vdev_exit+0x31
         libzpool.so.1`spa_vdev_detach+0x4c1
         ztest_vdev_attach_detach+0x297
         ztest_execute+0x67
         ztest_thread+0xf7
         libc.so.1`_thrp_setup+0x9b
         libc.so.1`_lwp_start

We're detaching a device and as we saw from the pool configuration it's already
been removed from the vddv tree.

So it looks like we've detached a device without resilvering some of the
blocks.

Further investigation shows that the free is coming from:

> 153::findstack -v
stack pointer for thread 153: f1fe1d98
[ f1fe1d98 libc.so.1`__lwp_park+0xb() ]
  f1fe1dc8 libc.so.1`cond_wait_queue+0x60(89a7f20, 89a7f08, 0, fee9b342)
  f1fe1e08 libc.so.1`__cond_wait+0x86(89a7f20, 89a7f08, f1fe1e28, fee9b402)
  f1fe1e28 libc.so.1`cond_wait+0x24(89a7f20, 89a7f08, f1fe1e48, fed59476)
  f1fe1e48 libzpool.so.1`cv_wait+0x40(89a7f20, 89a7f00, f1fe1e88, fed4c587)
  f1fe1e88 libzpool.so.1`zio_wait+0x7f(89a7c70, fed15b20, 89a7c70, b0e5b80)
  f1fe1f28 libzpool.so.1`spa_sync+0x37a(8559000, 13a, 0, fed21136)
  f1fe1fc8 libzpool.so.1`txg_sync_thread+0x3f2(80ba180, fef15000, f1fe1fe8, 
  feea26c6)
  f1fe1fe8 libc.so.1`_thrp_setup+0x9b(f89f4a40)
  f1fe1ff8 libc.so.1`_lwp_start(f89f4a40, 0, 0, 0, 0, 0)

This corresponds to this code in spa_sync():

        /*
         * If anything has changed in this txg, or if someone is waiting
         * for this txg to sync (eg, spa_vdev_remove()), push the
         * deferred frees from the previous txg.  If not, leave them
         * alone so that we don't generate work on an otherwise idle
         * system.
         */
        if (!txg_list_empty(&dp->dp_dirty_datasets, txg) ||
            !txg_list_empty(&dp->dp_dirty_dirs, txg) ||
            !txg_list_empty(&dp->dp_sync_tasks, txg) ||
            ((dsl_scan_active(dp->dp_scan) ||
            txg_sync_waiting(dp)) && !spa_shutting_down(spa))) {
                zio_t *zio = zio_root(spa, NULL, NULL, 0);
                VERIFY3U(bpobj_iterate(defer_bpo,
                    spa_free_sync_cb, zio, tx), ==, 0);
                VERIFY0(zio_wait(zio));
        }

So this means that we're processing the frees from the previous tag.

So the scenario looks like this:

1). a replace operation was running
2). a dedup block's reference goes to 0 
3). we free the block just as the scan completes
4). the freed block is deferred since spa_sync_pass is already 1
5). the device detaches
6). we process the deferred frees from the previous tag
7). one of the freed blocks is a gang block so we go to read the gang block
header[s] but fail to read because the block was never copied over to this
device.

When a scan finishes it's traversal it must wait one additional txg to allow
any deferred frees to get processes. This would ensure that a device cannot be
removed until any deferred frees that occurred while we were doing the scan
have actually been processed.

History

#1

Updated by Christopher Siden over 6 years ago

  • Status changed from In Progress to Closed
commit b4952e1
Author: George Wilson <george.wilson@delphix.com>
Date:   Wed Aug 7 13:16:22 2013

    3956 ::vdev -r should work with pipelines
    3957 ztest should update the cachefile before killing itself
    3958 multiple scans can lead to partial resilvering
    3959 ddt entries are not always resilvered
    3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth
    3961 freed gang blocks are not resilvered and can cause pool to suspend
    3962 ztest should print out zfs debug buffer before exiting
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Adam Leventhal <ahl@delphix.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF