Project

General

Profile

Actions

Bug #14824

open

zfs: receive_object uses VERIFY0 on calls that can return EIO, panics on garden-variety disk failure

Added by Alex Wilson 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

Currently in dmu_recv.c around line 1416, we have:

    if (data != NULL) {
        dmu_buf_t *db;
        dnode_t *dn;
        uint32_t flags = DMU_READ_NO_PREFETCH;

        if (rwa->raw)
            flags |= DMU_READ_NO_DECRYPT;

        VERIFY0(dnode_hold(rwa->os, drro->drr_object, FTAG, &dn));
        VERIFY0(dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags));

        dmu_buf_will_dirty(db, tx);

Here we have done VERIFY0() on the return values of dnode_hold and dmu_bonus_hold_by_dnode. Both of these calls can result in ZFS needing to read blocks in from disk, and can return EIO if there's a failure in that process (including a transient failure that will later succeed).

If this happens at the moment, of course, this VERIFY will blow and panic the kernel.

As a result, if you have a server that's sitting there doing zfs receive all day (e.g. a backup snapshot box), any small transient disk glitch or failure can result in a kernel panic. Not great.

Actions #1

Updated by Alex Wilson 3 months ago

  • Description updated (diff)
Actions #2

Updated by Alex Wilson 3 months ago

  • Description updated (diff)
Actions #3

Updated by Alex Wilson 3 months ago

These VERIFY assertions were last touched in 8727 Native data and metadata encryption for ZFS, but it's been a VERIFY0 on the dnode_hold there forever.

We've only hit it so far in production on the dmu_bonus_hold_by_dnode call, so it may be that the dnode_hold is safe by virtue of the work this function does prior to this call? But the ZFS encryption work added the dmu_bonus_hold_by_dnode VERIFY as well and it's not safe?

Actions

Also available in: Atom PDF