Actions
Bug #3959
closedddt entries are not always resilvered
Start date:
2013-08-02
Due date:
% Done:
100%
Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:
Description
Analysis by George Wilson:
Running zloop I have hit cases where 'zdb' detects checksum errors. In all of the instances the checksum errors are associated with dedup blocks. Errors look like this: Traversing all blocks to verify checksums and verify nothing leaked ... 10.9M completed ( 5MB/s) estimated time remaining: 0hr 00min 01sec zdb_blkptr_cb: Got error 50 reading <394, 4, 0, 8000002500> DVA[0]=<0:efaa00:200> DVA[1]=<0:27daa00:200> [L0 other uint64[]] sha256 uncompressed LE contiguous dedup double size=200L/200P birth=376L/376P fill=1 cksum=c76681257c327e2b:bb5c90712fdc4ab3:1905b6292581bbee:628fcc2ec1c1eb55 -- skipping zdb_blkptr_cb: Got error 50 reading <394, 4, 0, 200000003000> DVA[0]=<0:efaa00:200> DVA[1]=<0:27daa00:200> [L0 other uint64[]] sha256 uncompressed LE contiguous dedup double size=200L/200P birth=397L/376P fill=1 cksum=c76681257c327e2b:bb5c90712fdc4ab3:1905b6292581bbee:628fcc2ec1c1eb55 -- skipping Error counts: errno count 50 2 No leaks (block sum matches space maps exactly) bp count: 6710 bp logical: 70804480 avg: 10552 bp physical: 11543040 avg: 1720 compression: 6.13 bp allocated: 19259392 avg: 2870 compression: 3.68 bp deduped: 2048 ref>1: 2 deduplication: 1.00 SPA allocated: 19257344 used: 13.21% Blocks LSIZE PSIZE ASIZE avg comp %Total Type - - - - - - - unallocated 2 32K 4K 12.0K 6.00K 8.00 0.06 object directory 2 1K 1K 3.00K 1.50K 1.00 0.02 object array 1 16K 16K 48.0K 48.0K 1.00 0.26 packed nvlist - - - - - - - packed nvlist size 27 3.27M 126K 378K 14.0K 26.54 2.01 bpobj - - - - - - - bpobj header - - - - - - - SPA space map header 73 292K 73.0K 219K 3.00K 4.00 1.16 SPA space map 13 92.0K 92.0K 92.0K 7.08K 1.00 0.49 ZIL intent log 43 688K 43.0K 98.0K 2.28K 16.00 0.52 L6 DMU dnode 43 688K 43.0K 98.0K 2.28K 16.00 0.52 L5 DMU dnode 43 688K 43.0K 98.0K 2.28K 16.00 0.52 L4 DMU dnode 43 688K 43.0K 98.0K 2.28K 16.00 0.52 L3 DMU dnode 43 688K 43.0K 98.0K 2.28K 16.00 0.52 L2 DMU dnode 44 704K 44.5K 103K 2.33K 15.82 0.54 L1 DMU dnode 93 1.45M 165K 431K 4.63K 9.02 2.29 L0 DMU dnode 352 5.50M 425K 1.00M 2.91K 13.27 5.44 DMU dnode 47 94.0K 23.5K 54.5K 1.16K 4.00 0.29 DMU objset - - - - - - - DSL directory 32 17.5K 16K 48.0K 1.50K 1.09 0.26 DSL directory child map 30 15.0K 15.0K 45.0K 1.50K 1.00 0.24 DSL dataset snap map 32 16K 16K 48.0K 1.50K 1.00 0.26 DSL props - - - - - - - DSL dataset - - - - - - - ZFS znode - - - - - - - ZFS V0 ACL - - - - - - - ZFS plain file - - - - - - - ZFS directory - - - - - - - ZFS master node - - - - - - - ZFS delete queue - - - - - - - zvol object - - - - - - - zvol prop - - - - - - - other uint8[] 3 3.00K 1.50K 3.00K 1K 2.00 0.02 L14 other uint64[] 12 14.0K 6.00K 14.0K 1.17K 2.33 0.07 L13 other uint64[] 26 35.0K 13.0K 29.5K 1.13K 2.69 0.16 L12 other uint64[] 39 60.0K 19.5K 46.5K 1.19K 3.08 0.25 L11 other uint64[] 63 109K 33.5K 81.5K 1.29K 3.25 0.43 L10 other uint64[] 99 217K 53.0K 130K 1.31K 4.09 0.69 L9 other uint64[] 142 377K 75.0K 183K 1.29K 5.03 0.97 L8 other uint64[] 185 635K 96.5K 237K 1.28K 6.58 1.26 L7 other uint64[] 235 1.01M 127K 313K 1.33K 8.17 1.66 L6 other uint64[] 290 1.50M 162K 394K 1.36K 9.50 2.09 L5 other uint64[] 347 1.80M 196K 475K 1.37K 9.41 2.52 L4 other uint64[] 447 2.10M 249K 597K 1.33K 8.65 3.17 L3 other uint64[] 692 3.16M 389K 909K 1.31K 8.32 4.83 L2 other uint64[] 1.17K 5.30M 671K 1.51M 1.29K 8.09 8.24 L1 other uint64[] 1.81K 36.4M 6.90M 8.36M 4.62K 5.27 45.53 L0 other uint64[] 5.50K 52.7M 8.94M 13.2M 2.40K 5.89 71.90 other uint64[] 34 544K 39.5K 96.5K 2.84K 13.77 0.51 L1 other ZAP 299 4.02M 1.06M 2.61M 8.93K 3.79 14.20 L0 other ZAP 333 4.55M 1.10M 2.70M 8.31K 4.14 14.71 other ZAP - - - - - - - persistent error log 1 128K 26.0K 78.0K 78.0K 4.92 0.41 SPA history - - - - - - - SPA history offsets 1 512 512 1.50K 1.50K 1.00 0.01 Pool properties - - - - - - - DSL permissions - - - - - - - ZFS ACL - - - - - - - ZFS SYSACL - - - - - - - FUID table - - - - - - - FUID table size 3 3.00K 1.50K 4.50K 1.50K 2.00 0.02 DSL dataset next clones - - - - - - - scan work queue - - - - - - - ZFS user/group used - - - - - - - ZFS user/group quota 2 1K 1K 3.00K 1.50K 1.00 0.02 snapshot refcount tags 1 4K 2K 6.00K 6.00K 2.00 0.03 L1 DDT ZAP algorithm 26 104K 57.5K 173K 6.63K 1.81 0.92 L0 DDT ZAP algorithm 27 108K 59.5K 179K 6.61K 1.82 0.95 DDT ZAP algorithm 2 32K 3.50K 10.5K 5.25K 9.14 0.06 DDT statistics - - - - - - - System attributes - - - - - - - SA master node - - - - - - - SA attr registration - - - - - - - SA attr layouts - - - - - - - scan translations - - - - - - - deduplicated block 58 29.0K 29.0K 87.0K 1.50K 1.00 0.46 DSL deadlist map - - - - - - - DSL deadlist map hdr 10 6.50K 5.00K 15.0K 1.50K 1.30 0.08 DSL dir clones 4 512K 18.0K 54.0K 13.5K 28.44 0.29 bpobj subobj 25 160K 34.0K 102K 4.08K 4.71 0.54 deferred free - - - - - - - dedup ditto 4 33.0K 4.50K 13.5K 3.38K 7.33 0.07 other 3 3.00K 1.50K 3.00K 1K 2.00 0.02 L14 Total 12 14.0K 6.00K 14.0K 1.17K 2.33 0.07 L13 Total 26 35.0K 13.0K 29.5K 1.13K 2.69 0.16 L12 Total 39 60.0K 19.5K 46.5K 1.19K 3.08 0.25 L11 Total 63 109K 33.5K 81.5K 1.29K 3.25 0.43 L10 Total 99 217K 53.0K 130K 1.31K 4.09 0.69 L9 Total 142 377K 75.0K 183K 1.29K 5.03 0.97 L8 Total 185 635K 96.5K 237K 1.28K 6.58 1.26 L7 Total 278 1.68M 170K 411K 1.48K 10.15 2.19 L6 Total 333 2.17M 205K 492K 1.48K 10.87 2.61 L5 Total 390 2.47M 239K 573K 1.47K 10.60 3.04 L4 Total 490 2.77M 292K 695K 1.42K 9.73 3.70 L3 Total 735 3.83M 432K 1007K 1.37K 9.09 5.35 L2 Total 1.25K 6.52M 757K 1.71M 1.37K 8.82 9.33 L1 Total 2.58K 46.7M 8.67M 12.8M 4.98K 5.38 69.94 L0 Total 6.55K 67.5M 11.0M 18.4M 2.80K 6.13 100.00 Total capacity operations bandwidth ---- errors ---- description used avail read write read write read write cksum ztest 18.4M 121M 7.25K 0 10.1M 0 0 0 4 replacing 18.4M 121M 7.25K 0 10.1M 0 0 0 8 /rpool/tmp/ztest.0b 6.68K 0 11.4M 0 0 0 8 /rpool/tmp/ztest.0a 6 0 204K 0 0 0 8
The reason for the checksum errors is that some blocks in the DDT can be skipped when a resilver is active. The scenario has to do with DDT entries that are changing reference counts at the time a resilver is active. Above dsl_scan_ddt() the following comment exists: /* * Scrub/dedup interaction. * * If there are N references to a deduped block, we don't want to scrub it * N times -- ideally, we should scrub it exactly once. * * We leverage the fact that the dde's replication class (enum ddt_class) * is ordered from highest replication class (DDT_CLASS_DITTO) to lowest * (DDT_CLASS_UNIQUE) so that we may walk the DDT in that order. * * To prevent excess scrubbing, the scrub begins by walking the DDT * to find all blocks with refcnt > 1, and scrubs each of these once. * Since there are two replication classes which contain blocks with * refcnt > 1, we scrub the highest replication class (DDT_CLASS_DITTO) first. * Finally the top-down scrub begins, only visiting blocks with refcnt == 1. * * There would be nothing more to say if a block's refcnt couldn't change * during a scrub, but of course it can so we must account for changes * in a block's replication class. * * Here's an example of what can occur: * * If a block has refcnt > 1 during the DDT scrub phase, but has refcnt == 1 * when visited during the top-down scrub phase, it will be scrubbed twice. * This negates our scrub optimization, but is otherwise harmless. * * If a block has refcnt == 1 during the DDT scrub phase, but has refcnt > 1 * on each visit during the top-down scrub phase, it will never be scrubbed. * To catch this, ddt_sync_entry() notifies the scrub code whenever a block's * reference class transitions to a higher level (i.e DDT_CLASS_UNIQUE to * DDT_CLASS_DUPLICATE); if it transitions from refcnt == 1 to refcnt > 1 * while a scrub is in progress, it scrubs the block right then. */ The case of interest is when a dedup block transitions to refcnt > 1. The code for that looks like this: ddt_sync_entry() { <snip> /* * If the class changes, the order that we scan this bp * changes. If it decreases, we could miss it, so * scan it right now. (This covers both class changing * while we are doing ddt_walk(), and when we are * traversing.) */ if (nclass < oclass) { dsl_scan_ddt_entry(dp->dp_scan, ddt->ddt_checksum, dde, tx); } This is correct so let's look at dsl_scan_ddt_entry(): void dsl_scan_ddt_entry(dsl_scan_t *scn, enum zio_checksum checksum, ddt_entry_t *dde, dmu_tx_t *tx) { const ddt_key_t *ddk = &dde->dde_key; ddt_phys_t *ddp = dde->dde_phys; blkptr_t bp; zbookmark_t zb = { 0 }; if (scn->scn_phys.scn_state != DSS_SCANNING) return; for (int p = 0; p < DDT_PHYS_TYPES; p++, ddp++) { if (ddp->ddp_phys_birth == 0 || ddp->ddp_phys_birth > scn->scn_phys.scn_cur_max_txg) continue; ddt_bp_create(checksum, ddk, ddp, &bp); scn->scn_visited_this_txg++; scan_funcs[scn->scn_phys.scn_func](scn->scn_dp, &bp, &zb); } } At first glance this looks correct. After some additional debugging I was able to isolate the bug to this line of code: ddp->ddp_phys_birth > scn->scn_phys.scn_cur_max_txg The problem is that scn_cur_max_txg != scn_max_txg when we resume a scan on a snapshot. So a dedup block that may have existed prior to the start of the scan but happens to be newer than the svn_cur_max_txg value is never srubbed or resilvered.
Updated by Christopher Siden over 9 years ago
- Status changed from In Progress to Closed
commit b4952e1 Author: George Wilson <george.wilson@delphix.com> Date: Wed Aug 7 13:16:22 2013 3956 ::vdev -r should work with pipelines 3957 ztest should update the cachefile before killing itself 3958 multiple scans can lead to partial resilvering 3959 ddt entries are not always resilvered 3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth 3961 freed gang blocks are not resilvered and can cause pool to suspend 3962 ztest should print out zfs debug buffer before exiting Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net>
Actions