Project

General

Profile

Bug #6451

ztest fails due to checksum errors

Added by Matthew Ahrens about 4 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Category:
zfs - Zettabyte File System
Start date:
2015-11-09
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

Sometimes ztest fails because zdb detects checksum errors. e.g.:

Traversing all blocks to verify checksums and verify nothing leaked ...

zdb_blkptr_cb: Got error 50 reading <71, 47, 0, 8000160> DVA0=<0:1cc2000:180000> [L0 other uint64[]] sha256 uncompressed LE contiguou
s unique single size=100000L/100000P birth=271L/271P fill=1 cksum=c5a3e27d1ed0f894:843bca3a5473c4bf:f76a19b6830a2e4:91292591613a12bf --
skipping
zdb_blkptr_cb: Got error 50 reading <71, 47, 0, 800000180> DVA0=<0:ce16800:180000> [L0 other uint64[]] sha256 uncompressed LE contigu
ous unique single size=100000L/100000P birth=840L/840P fill=1 cksum=5d018f3d061e17f3:6d1584784587bf63:2805a74a0ce37369:ba68a214806c7e75
-- skipping
zdb_blkptr_cb: Got error 50 reading <71, 47, 0, 1000000360> DVA0=<0:10d37400:180000> [L0 other uint64[]] sha256 uncompressed LE conti
guous unique single size=100000L/100000P birth=904L/904P fill=1 cksum=fa1e11d4138bd14b:86c9488c444473e3:f31e43c72e72e46b:e3446472d1174d
ba -- skipping
zdb_blkptr_cb: Got error 50 reading <71, 47, 0, 400000002c0> DVA0=<0:127ef400:180000> [L0 other uint64[]] sha256 uncompressed LE cont
iguous dedup single size=100000L/100000P birth=549L/549P fill=1 cksum=30e14955ebf13522:66dc2ff8067e6810:4607e750abb9d3b3:6582b8af909fcb
58 -- skipping
zdb_blkptr_cb: Got error 50 reading <657, 5, 0, 1c0> DVA0=<0:1a180400:180000> [L0 other uint64[]] fletcher4 uncompressed LE contiguou
s unique single size=100000L/100000P birth=1091L/1091P fill=1 cksum=a6cf1e50:29b3bd01c57e5:36779b914035db9a:db61cdcf6bec56f0 -- skippin
g

The problem is that ztest_fault_inject() can inject multiple faults into the same block. It is designed such that it can inject errors on all leafs of a RAID-Z or mirror, but for a given range of offsets, it will only inject errors to a single leaf. The idea is that for any given block, it will have injected errors only on a single device, because this is the type of reconstruction we support. The injection-eligible ranges are separated by non-injectable ranges (aka DMZ) such that a single block is not supposed to straddle two injectable ranges.

The problem is that this logic assumes that the maximum blocksize is 128K, and therefore each DMZ need be only 265KB. This should have been increased when we introduced large blocks, as they can now span from one injectable range, across the 256KB DMZ and into the next injectable range on a different leaf. As the example above shows, all the blocks with checksum errors are >128KB.

History

#1

Updated by Electric Monk about 4 years ago

  • % Done changed from 0 to 100
  • Status changed from New to Closed

git commit f9eb9fdf196b6ed476e4ffc69cecd8b0da3cb7e7

commit  f9eb9fdf196b6ed476e4ffc69cecd8b0da3cb7e7
Author: Matthew Ahrens <mahrens@delphix.com>
Date:   2015-11-16T17:44:23.000Z

    6451 ztest fails due to checksum errors
    Reviewed by: George Wilson <george.wilson@delphix.com>
    Reviewed by: Prakash Surya <prakash.surya@delphix.com>
    Reviewed by: Jorgen Lundman <lundman@lundman.net>
    Approved by: Dan McDonald <danmcd@omniti.com>

Also available in: Atom PDF