Project

General

Profile

Bug #5630

stale bonus buffer in recycled dnode_t leads to data corruption

Added by Justin Gibbs over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
2015-02-16
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

dnode_sync_free() resets the in-core dnode_t of a just deleted
object so it represents an unallocated object. The dnode_t will
only be considered to represent a free dnode after its last hold
is released. This last hold may be dropped after dnode_sync_free()
returns. For data dbufs holding the dnode, this delay can be due
to asynchronous evictions from the arc coincident with the dnode
being destroyed. For bonus buffers, this can happen if the object
type can be destroyed even when held in another context.

Data corruption was observed when a dsl_dir was destroyed for a
recursive dataset destroy, while a "zfs list" operation also held
this dsl_dir. Although dnode_sync_free() calls dnode_evict_dbufs(),
the hold on the dsl_dir's bonus buffer prevented it from being
evicted. This left the bonus buffer associated with the dnode_t
even after the last hold on the dnode was released.

Some time later, this dnode_t was reused for a new dsl_dir. Instead
of getting a new (and zero filled) bonus buffer, the contents from
the old dsl_dir were returned. The dsl_dir initialization code
assumes the bonus buffer is zero filled and so only explicitly
initializes fields that have non-zero initial values. This allowed
the stale data to be incorporated into the new dsl_dir and written
to the media.

History

#2

Updated by Dan McDonald over 4 years ago

When was this bug introduced? Or has this been around a long time?

#3

Updated by Justin Gibbs over 4 years ago

I believe this bug has been here for a long time, but the recent changes to perform asynchronous objset eviction may have made the race window easier to hit. At Spectra Logic, we had a bpobj corruption issue in June of last year which I now think was also caused by this bug. This was long before the async eviction stuff was present in our tree.

#4

Updated by Electric Monk over 4 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit cd485b49201b16c079663125308af274b6299e96

commit  cd485b49201b16c079663125308af274b6299e96
Author: Justin T. Gibbs <justing@spectralogic.com>
Date:   2015-03-04T04:37:12.000Z

    5630 stale bonus buffer in recycled dnode_t leads to data corruption
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: George Wilson <george@delphix.com>
    Reviewed by: Will Andrews <will@freebsd.org>
    Approved by: Robert Mustacchi <rm@joyent.com>

Also available in: Atom PDF