Project

General

Profile

Bug #5820

verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig)

Added by Matthew Ahrens over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Category:
zfs - Zettabyte File System
Start date:
2015-04-09
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

we hit this VERIFY in zio_done()::

if (zio->io_flags & ZIO_FLAG_NOPWRITE)
VERIFY(BP_EQUAL(bp, &zio->io_bp_orig));

The proximate problem is that the NOPWRITE flag is set, but we have changed the BP.

In all of the failures, we were doing an override write, caused by dmu_sync(), and the io_bp_orig was a hole. On the test system, we could see that the hole was written the previous txg.

> ::stack
zio_done+0x165(ffffff01cf2d03e0)
...
spa_sync+0x3b7(ffffff01cf61f000, f34)

> ffffff01cf2d03e0::print zio_t io_done
io_done = dbuf_write_override_done

> ffffff01cf2d03e0::print zio_t io_private | ::print -a dbuf_dirty_record_t dt.dl.dr_overridden_by | ::blkptr
DVA[0]=<0:35c00a00:200>
[L0 PLAIN_FILE_CONTENTS] SHA256 OFF LE contiguous unique single
size=200L/200P birth=3844L/3844P fill=1
cksum=23e9c26287d421a8:6bd45fd43dd1adc9:d040a480eabe9bec:3814834a803b52af

> ffffff01cf2d03e0::print -a zio_t io_bp_orig | ::blkptr
HOLE [L0 PLAIN_FILE_CONTENTS] size=200L birth=3891L

> ffffff01cf61f000::print -d spa_t spa_syncing_txg
spa_syncing_txg = 0t3892


The fact that the block that we expected to find a real BP but instead found a hole was the big tip-off. dmu_sync() has to determine if we can nop-write the block. Nopwrite will compare with the current version of the BP, so it's critical that we not attempt to nopwrite if there could be any changes to the BP before the TXG in which this dmu_sync() will take effect. The comment in dmu_sync() describes this situation: /* * Assume the on-disk data is X, the current syncing data is Y, * and the current in-memory data is Z (currently in dmu_sync). * X and Z are identical but Y is has been modified. Normally, * when X and Z are the same we will perform a nopwrite but if Y * is different we must disable nopwrite since the resulting write * of Y to disk can free the block containing X. If we allowed a * nopwrite to occur the block pointing to Z would reference a freed * block. Since this is a rare case we simplify this by disabling * nopwrite if the current dmu_sync-ing dbuf has been modified in * a previous transaction.
*/

The problem is that our check for "if this BP could be modified in a previous txg" is incomplete; it takes into account if the dbuf is dirty (e.g. as a result of a zfs_write()), but not if this block is freed. The fix is to also check dnode_block_freed(), and to disable nopwrite if the block will be freed. The comment has also been updated:

/*
         * Assume the on-disk data is X, the current syncing data (in
         * txg - 1) is Y, and the current in-memory data is Z (currently
         * in dmu_sync).
         *
         * We usually want to perform a nopwrite if X and Z are the
         * same.  However, if Y is different (i.e. the BP is going to
         * change before this write takes effect), then a nopwrite will
         * be incorrect - we would override with X, which could have
         * been freed when Y was written.
         *
         * (Note that this is not a concern when we are nop-writing from
         * syncing context, because X and Y must be identical, because
         * all previous txgs have been synced.)
         *
         * Therefore, we disable nopwrite if the current BP could change
         * before this TXG.  There are two ways it could change: by
         * being dirty (dr_next is non-NULL), or by being freed
         * (dnode_block_freed()).  This behavior is verified by
         * zio_done(), which VERIFYs that the override BP is identical
         * to the on-disk BP.
*/

The problem can be reproduced by running the following. Timing parameters may be performance dependent.

#!/bin/bash -x

zpool create test c2t1d0
zfs create \\
        -o sync=always \\
        -o logbias=throughput \\
        -o recordsize=512 \\
        -o checksum=sha256 \\
        -o compress=lz4 \\
        test/fs

dd if=/dev/urandom of=/test/fs/base bs=1024k count=1

(while true; do sync; sleep 0.1; done)&

while true; do
        dd if=/test/fs/base of=/test/fs/file bs=512
done

The problem is that dmu_sync()'s check for "if this BP could be modified in a previous txg" is incomplete; it takes into account if the dbuf is dirty (e.g. as a result of a zfs_write()), but not if this block is freed. The fix is to also check dnode_block_freed(), and to disable nopwrite if the block will be freed.

History

#1

Updated by Electric Monk over 4 years ago

  • % Done changed from 0 to 100
  • Status changed from New to Closed

git commit 34e8acef009195effafdcf6417aec385e241796e

commit  34e8acef009195effafdcf6417aec385e241796e
Author: Matthew Ahrens <mahrens@delphix.com>
Date:   2015-04-11T18:35:39.000Z

    5820 verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig)
    Reviewed by: Alex Reece <alex@delphix.com>
    Reviewed by: George Wilson <george@delphix.com>
    Reviewed by: Steven Hartland <killing@multiplay.co.uk>
    Approved by: Garrett D'Amore <garrett@damore.org>

Also available in: Atom PDF