Project

General

Profile

Bug #3959

ddt entries are not always resilvered

Added by Christopher Siden over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Category:
zfs - Zettabyte File System
Start date:
2013-08-02
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

Analysis by George Wilson:

Running zloop I have hit cases where 'zdb' detects checksum errors. In all of
the instances the checksum errors are associated with dedup blocks. Errors look
like this:

Traversing all blocks to verify checksums and verify nothing leaked ...

10.9M completed (   5MB/s) estimated time remaining: 0hr 00min 01sec       
zdb_blkptr_cb: Got error 50 reading <394, 4, 0, 8000002500>
DVA[0]=<0:efaa00:200> DVA[1]=<0:27daa00:200> [L0 other uint64[]] sha256
uncompressed LE contiguous dedup double size=200L/200P birth=376L/376P fill=1
cksum=c76681257c327e2b:bb5c90712fdc4ab3:1905b6292581bbee:628fcc2ec1c1eb55 --
skipping
zdb_blkptr_cb: Got error 50 reading <394, 4, 0, 200000003000>
DVA[0]=<0:efaa00:200> DVA[1]=<0:27daa00:200> [L0 other uint64[]] sha256
uncompressed LE contiguous dedup double size=200L/200P birth=397L/376P fill=1
cksum=c76681257c327e2b:bb5c90712fdc4ab3:1905b6292581bbee:628fcc2ec1c1eb55 --
skipping

Error counts:

    errno  count
       50  2

    No leaks (block sum matches space maps exactly)

    bp count:            6710
    bp logical:      70804480      avg:  10552
    bp physical:     11543040      avg:   1720     compression:   6.13
    bp allocated:    19259392      avg:   2870     compression:   3.68
    bp deduped:          2048    ref>1:      2   deduplication:   1.00
    SPA allocated:   19257344     used: 13.21%

Blocks    LSIZE    PSIZE    ASIZE      avg     comp    %Total    Type
     -        -        -        -        -        -         -    unallocated
     2      32K       4K    12.0K    6.00K     8.00      0.06    object
directory
     2       1K       1K    3.00K    1.50K     1.00      0.02    object array
     1      16K      16K    48.0K    48.0K     1.00      0.26    packed nvlist
     -        -        -        -        -        -         -    packed nvlist
size
    27    3.27M     126K     378K    14.0K    26.54      2.01    bpobj
     -        -        -        -        -        -         -    bpobj header
     -        -        -        -        -        -         -    SPA space map
header
    73     292K    73.0K     219K    3.00K     4.00      1.16    SPA space map
    13    92.0K    92.0K    92.0K    7.08K     1.00      0.49    ZIL intent log
    43     688K    43.0K    98.0K    2.28K    16.00      0.52        L6 DMU
dnode
    43     688K    43.0K    98.0K    2.28K    16.00      0.52        L5 DMU
dnode
    43     688K    43.0K    98.0K    2.28K    16.00      0.52        L4 DMU
dnode
    43     688K    43.0K    98.0K    2.28K    16.00      0.52        L3 DMU
dnode
    43     688K    43.0K    98.0K    2.28K    16.00      0.52        L2 DMU
dnode
    44     704K    44.5K     103K    2.33K    15.82      0.54        L1 DMU
dnode
    93    1.45M     165K     431K    4.63K     9.02      2.29        L0 DMU
dnode
   352    5.50M     425K    1.00M    2.91K    13.27      5.44    DMU dnode
    47    94.0K    23.5K    54.5K    1.16K     4.00      0.29    DMU objset
     -        -        -        -        -        -         -    DSL directory
    32    17.5K      16K    48.0K    1.50K     1.09      0.26    DSL directory
child map
    30    15.0K    15.0K    45.0K    1.50K     1.00      0.24    DSL dataset
snap map
    32      16K      16K    48.0K    1.50K     1.00      0.26    DSL props
     -        -        -        -        -        -         -    DSL dataset
     -        -        -        -        -        -         -    ZFS znode
     -        -        -        -        -        -         -    ZFS V0 ACL
     -        -        -        -        -        -         -    ZFS plain file
     -        -        -        -        -        -         -    ZFS directory
     -        -        -        -        -        -         -    ZFS master
node
     -        -        -        -        -        -         -    ZFS delete
queue
     -        -        -        -        -        -         -    zvol object
     -        -        -        -        -        -         -    zvol prop
     -        -        -        -        -        -         -    other uint8[]
     3    3.00K    1.50K    3.00K       1K     2.00      0.02        L14 other
uint64[]
    12    14.0K    6.00K    14.0K    1.17K     2.33      0.07        L13 other
uint64[]
    26    35.0K    13.0K    29.5K    1.13K     2.69      0.16        L12 other
uint64[]
    39    60.0K    19.5K    46.5K    1.19K     3.08      0.25        L11 other
uint64[]
    63     109K    33.5K    81.5K    1.29K     3.25      0.43        L10 other
uint64[]
    99     217K    53.0K     130K    1.31K     4.09      0.69        L9 other
uint64[]
   142     377K    75.0K     183K    1.29K     5.03      0.97        L8 other
uint64[]
   185     635K    96.5K     237K    1.28K     6.58      1.26        L7 other
uint64[]
   235    1.01M     127K     313K    1.33K     8.17      1.66        L6 other
uint64[]
   290    1.50M     162K     394K    1.36K     9.50      2.09        L5 other
uint64[]
   347    1.80M     196K     475K    1.37K     9.41      2.52        L4 other
uint64[]
   447    2.10M     249K     597K    1.33K     8.65      3.17        L3 other
uint64[]
   692    3.16M     389K     909K    1.31K     8.32      4.83        L2 other
uint64[]
 1.17K    5.30M     671K    1.51M    1.29K     8.09      8.24        L1 other
uint64[]
 1.81K    36.4M    6.90M    8.36M    4.62K     5.27     45.53        L0 other
uint64[]
 5.50K    52.7M    8.94M    13.2M    2.40K     5.89     71.90    other uint64[]
    34     544K    39.5K    96.5K    2.84K    13.77      0.51        L1 other
ZAP
   299    4.02M    1.06M    2.61M    8.93K     3.79     14.20        L0 other
ZAP
   333    4.55M    1.10M    2.70M    8.31K     4.14     14.71    other ZAP
     -        -        -        -        -        -         -    persistent
error log
     1     128K    26.0K    78.0K    78.0K     4.92      0.41    SPA history
     -        -        -        -        -        -         -    SPA history
offsets
     1      512      512    1.50K    1.50K     1.00      0.01    Pool
properties
     -        -        -        -        -        -         -    DSL
permissions
     -        -        -        -        -        -         -    ZFS ACL
     -        -        -        -        -        -         -    ZFS SYSACL
     -        -        -        -        -        -         -    FUID table
     -        -        -        -        -        -         -    FUID table
size
     3    3.00K    1.50K    4.50K    1.50K     2.00      0.02    DSL dataset
next clones
     -        -        -        -        -        -         -    scan work
queue
     -        -        -        -        -        -         -    ZFS user/group
used
     -        -        -        -        -        -         -    ZFS user/group
quota
     2       1K       1K    3.00K    1.50K     1.00      0.02    snapshot
refcount tags
     1       4K       2K    6.00K    6.00K     2.00      0.03        L1 DDT ZAP
algorithm
    26     104K    57.5K     173K    6.63K     1.81      0.92        L0 DDT ZAP
algorithm
    27     108K    59.5K     179K    6.61K     1.82      0.95    DDT ZAP
algorithm
     2      32K    3.50K    10.5K    5.25K     9.14      0.06    DDT statistics
     -        -        -        -        -        -         -    System
attributes
     -        -        -        -        -        -         -    SA master node
     -        -        -        -        -        -         -    SA attr
registration
     -        -        -        -        -        -         -    SA attr
layouts
     -        -        -        -        -        -         -    scan
translations
     -        -        -        -        -        -         -    deduplicated
block
    58    29.0K    29.0K    87.0K    1.50K     1.00      0.46    DSL deadlist
map
     -        -        -        -        -        -         -    DSL deadlist
map hdr
    10    6.50K    5.00K    15.0K    1.50K     1.30      0.08    DSL dir clones
     4     512K    18.0K    54.0K    13.5K    28.44      0.29    bpobj subobj
    25     160K    34.0K     102K    4.08K     4.71      0.54    deferred free
     -        -        -        -        -        -         -    dedup ditto
     4    33.0K    4.50K    13.5K    3.38K     7.33      0.07    other
     3    3.00K    1.50K    3.00K       1K     2.00      0.02        L14 Total
    12    14.0K    6.00K    14.0K    1.17K     2.33      0.07        L13 Total
    26    35.0K    13.0K    29.5K    1.13K     2.69      0.16        L12 Total
    39    60.0K    19.5K    46.5K    1.19K     3.08      0.25        L11 Total
    63     109K    33.5K    81.5K    1.29K     3.25      0.43        L10 Total
    99     217K    53.0K     130K    1.31K     4.09      0.69        L9 Total
   142     377K    75.0K     183K    1.29K     5.03      0.97        L8 Total
   185     635K    96.5K     237K    1.28K     6.58      1.26        L7 Total
   278    1.68M     170K     411K    1.48K    10.15      2.19        L6 Total
   333    2.17M     205K     492K    1.48K    10.87      2.61        L5 Total
   390    2.47M     239K     573K    1.47K    10.60      3.04        L4 Total
   490    2.77M     292K     695K    1.42K     9.73      3.70        L3 Total
   735    3.83M     432K    1007K    1.37K     9.09      5.35        L2 Total
 1.25K    6.52M     757K    1.71M    1.37K     8.82      9.33        L1 Total
 2.58K    46.7M    8.67M    12.8M    4.98K     5.38     69.94        L0 Total
 6.55K    67.5M    11.0M    18.4M    2.80K     6.13    100.00    Total

                            capacity   operations   bandwidth  ---- errors ----
description                used avail  read write  read write  read write cksum
ztest                     18.4M  121M 7.25K     0 10.1M     0     0     0     4
  replacing               18.4M  121M 7.25K     0 10.1M     0     0     0     8
    /rpool/tmp/ztest.0b               6.68K     0 11.4M     0     0     0     8
    /rpool/tmp/ztest.0a                   6     0  204K     0     0     0     8

The reason for the checksum errors is that some blocks in the DDT can be
skipped when a resilver is active. The scenario has to do with DDT entries that
are changing reference counts at the time a resilver is active. Above
dsl_scan_ddt() the following comment exists:

/*
 * Scrub/dedup interaction.
 *
 * If there are N references to a deduped block, we don't want to scrub it
 * N times -- ideally, we should scrub it exactly once.
 *
 * We leverage the fact that the dde's replication class (enum ddt_class)
 * is ordered from highest replication class (DDT_CLASS_DITTO) to lowest
 * (DDT_CLASS_UNIQUE) so that we may walk the DDT in that order.
 *
 * To prevent excess scrubbing, the scrub begins by walking the DDT
 * to find all blocks with refcnt > 1, and scrubs each of these once.
 * Since there are two replication classes which contain blocks with
 * refcnt > 1, we scrub the highest replication class (DDT_CLASS_DITTO) first.
 * Finally the top-down scrub begins, only visiting blocks with refcnt == 1.
 *
 * There would be nothing more to say if a block's refcnt couldn't change
 * during a scrub, but of course it can so we must account for changes
 * in a block's replication class.
 *
 * Here's an example of what can occur:
 *
 * If a block has refcnt > 1 during the DDT scrub phase, but has refcnt == 1
 * when visited during the top-down scrub phase, it will be scrubbed twice.
 * This negates our scrub optimization, but is otherwise harmless.
 *
 * If a block has refcnt == 1 during the DDT scrub phase, but has refcnt > 1
 * on each visit during the top-down scrub phase, it will never be scrubbed.
 * To catch this, ddt_sync_entry() notifies the scrub code whenever a block's
 * reference class transitions to a higher level (i.e DDT_CLASS_UNIQUE to
 * DDT_CLASS_DUPLICATE); if it transitions from refcnt == 1 to refcnt > 1
 * while a scrub is in progress, it scrubs the block right then.
 */

The case of interest is when a dedup block transitions to refcnt > 1. The code
for that looks like this:

ddt_sync_entry()
{
<snip>
                /*
                 * If the class changes, the order that we scan this bp
                 * changes.  If it decreases, we could miss it, so
                 * scan it right now.  (This covers both class changing
                 * while we are doing ddt_walk(), and when we are
                 * traversing.)
                 */
                if (nclass < oclass) {
                        dsl_scan_ddt_entry(dp->dp_scan,
                            ddt->ddt_checksum, dde, tx);
                }

This is correct so let's look at dsl_scan_ddt_entry():

void    
dsl_scan_ddt_entry(dsl_scan_t *scn, enum zio_checksum checksum,
    ddt_entry_t *dde, dmu_tx_t *tx)
{               
        const ddt_key_t *ddk = &dde->dde_key;
        ddt_phys_t *ddp = dde->dde_phys;
        blkptr_t bp;
        zbookmark_t zb = { 0 };

        if (scn->scn_phys.scn_state != DSS_SCANNING)
                return;

        for (int p = 0; p < DDT_PHYS_TYPES; p++, ddp++) {
                if (ddp->ddp_phys_birth == 0 ||
                    ddp->ddp_phys_birth > scn->scn_phys.scn_cur_max_txg)
                        continue;
                ddt_bp_create(checksum, ddk, ddp, &bp);

                scn->scn_visited_this_txg++;
                scan_funcs[scn->scn_phys.scn_func](scn->scn_dp, &bp, &zb);
        }       
}       

At first glance this looks correct. After some additional debugging I was able
to isolate the bug to this line of code:

                    ddp->ddp_phys_birth > scn->scn_phys.scn_cur_max_txg

The problem is that scn_cur_max_txg != scn_max_txg when we resume a scan on a
snapshot. So a dedup block that may have existed prior to the start of the scan
but happens to be newer than the svn_cur_max_txg value is never srubbed or
resilvered.

History

#1

Updated by Christopher Siden over 5 years ago

  • Status changed from In Progress to Closed
commit b4952e1
Author: George Wilson <george.wilson@delphix.com>
Date:   Wed Aug 7 13:16:22 2013

    3956 ::vdev -r should work with pipelines
    3957 ztest should update the cachefile before killing itself
    3958 multiple scans can lead to partial resilvering
    3959 ddt entries are not always resilvered
    3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth
    3961 freed gang blocks are not resilvered and can cause pool to suspend
    3962 ztest should print out zfs debug buffer before exiting
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Adam Leventhal <ahl@delphix.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF