Bug #3603


panic from bpobj_enqueue_subobj()

Added by Christopher Siden almost 9 years ago. Updated almost 9 years ago.

zfs - Zettabyte File System
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


From Matt Ahrens's original bug report and analysis:

George, Adam, and I took a look at the dump file.  We found the problem in this
code in bpobj_enqueue_subobj():

     * If subobj has only one block of subobjs, then move subobj's
     * subobjs to bpo's subobj list directly.  This reduces
     * recursion in bpobj_iterate due to nested subobjs.
    subsubobjs = subbpo.bpo_phys->bpo_subobjs;
    if (subsubobjs != 0) {
        dmu_object_info_t doi;

        VERIFY3U(0, ==, dmu_object_info(bpo->bpo_os, subsubobjs, &doi));
        if (doi.doi_max_offset == doi.doi_data_block_size) {
            dmu_buf_t *subdb;
            uint64_t numsubsub = subbpo.bpo_phys->bpo_num_subobjs;

            VERIFY3U(0, ==, dmu_buf_hold(bpo->bpo_os, subsubobjs,
                0, FTAG, &subdb, 0));
            dmu_write(bpo->bpo_os, bpo->bpo_phys->bpo_subobjs,
                bpo->bpo_phys->bpo_num_subobjs * sizeof (subobj),
                numsubsub * sizeof (subobj), subdb->db_data, tx);

We are writing numsubsub * 8 bytes from the subsubobj's dbuf into the target
bpobj's subobj.  The problem is that dmu_object_info() computes doi_max_offset
from the dnode_phys_t, so if we wrote to the subsubobj in this txg, it may
actually be more than one block long. I.e. numsubsub * 8 > subdb->db_size. 
Thus the dmu_write() will attempt to read past the end of subdb->db_data.

Luckily, the address after subdb->db_data is not mapped, so we panicked rather
than simply corrupting the subobj on disk.

The fix should be as simple as having dmu_object_info_from_dnode() compute
doi_max_offset from dn->dn_maxblkid rather than dn->dn_phys->dn_max_blkid. 
(Plus some assertions in bpobj_enqueue_subobj() to ensure that we never try to
write more data that there is in the dbuf.)

The following two conditions must be met to hit the bug:

There must be a couple hundred subobjs.  This can only happen if we have
created at least that many snapshots of this filesystem/zvol, and then
destroyed them.

We must be deleting several snapshots of the same filesystem/zvol at once. 
This will commonly happen after an upgrade, because many snapshots will have
fallen out of the retention policy while the stack was down during the upgrade.
 In this case we are deleting 1192 snapshots from at least 11 zvols.  If it was
only those 11 zvols, and 1 snapshot falls out of retention every 15 minutes,
then that implies that the stack was down for at least 24 hours.  But it may be
that there were more zvols involved, as we have already deleted 95% of the 1192
snaps to delete, and we don't know what zvols the already-deleted snapshots are
the bug can be reproduced with the following script:


zpool create test c1t1d0
zfs destroy -r test/fs
zfs create -o recordsize=1k test/fs


for (( i=0; i<n; i=i+1 )); do
        echo creating $i
        for (( j=0; j<i; j=j+1 )); do
                dd if=/dev/urandom of=/test/fs/file conv=notrunc bs=1k
oseek=$((j * n + i)) count=1 >& /dev/null
        dd if=/dev/urandom of=/test/fs/file conv=notrunc bs=1k oseek=$((i * n))
count=$n >& /dev/null
        zfs snapshot test/fs@$i

zfs destroy test/fs@$((n/2))%
zfs destroy test/fs@1%$((n/2-2)),0
The incorrect code has existed in ZFS since the dawn of time (2005).  However
it was made much easier to hit by the fix for 12024, which is the performance
enhancement I made to destroy multiple snapshots of a single filesystem in one

Also available in: Atom PDF