Project

General

Profile

Bug #3469

dbuf_read_impl shows too much enthusiasm

Added by Joshua M. Clulow over 7 years ago. Updated over 7 years ago.

Status:
New
Priority:
Normal
Category:
zfs - Zettabyte File System
Start date:
2013-01-11
Due date:
% Done:

90%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

In the post-mortem for a zpool containing byzantine corruption (induced previously by #2638) it was discovered that if the bonus block length read from disk was too long, we would corrupt our heap by copying a (potentially very large) region of memory out of the ARC into our dmu_buf_impl_t's db_data.

History

#1

Updated by Joshua M. Clulow over 7 years ago

After the heap corruption, ::kmem_verify reported several pages of memory were destroyed:

LEGEND -- F = Free    A = Allocated    C = Corrupt
...
======================== 9c000 unmapped
======================== 9d000 unmapped
======================== 9e000 page of zio_buf_512
ffffff1c8709e000 F
  ... F ...
ffffff1c8709e600 F
ffffff1c8709e800 A <--- our most recently allocated db.db_data
ffffff1c8709ea00 AC
ffffff1c8709ec00 AC
ffffff1c8709ee00 AC
======================== 9f000 page of dnode_t
ffffff1c8709f0b8 AC <-- dnode_t from panic stack
  ... AC ...
ffffff1c8709fcb8 AC
======================== a0000 page of kmem_alloc_192
ffffff1c870a0000 FC
  ... FC ...
ffffff1c870a0700 FC
ffffff1c870a0800 AC
  ... AC ...
ffffff1c870a0f00 AC
======================== a1000 page of znode_t
ffffff1c870a1010 FC
  ... FC ...
ffffff1c870a1890 FC
ffffff1c870a19a0 AC <-- znode_t most recently allocated and corrupt
  ... AC ...
ffffff1c870a1ef0 AC 
======================== a2000 page of dnode_t
ffffff1c870a20b0 AC <--- corrupt up until about ffffff1c870a2208
ffffff1c870a23b0 A  <--- from here on out, we're no longer corrupt
  ... A ...
======================== a3000 page of sa_bonus
...

The last allocation before panic was a dmu_buf_impl_t ...

> *panic_thread::allocdby
BUFCTL              TIMESTAMP CALLER
....
ffffff1c8708d118  148257b4ae1 dbuf_create+0x4e
ffffff1c87090718  148257b4767 dnode_create+0x47
ffffff1d6f28d670  148257b32cf zfs_dirent_lock+0x25b
ffffff1d508733b0  148257b0fea dbuf_dirty+0x18e
ffffff1c8702b638  148257b0e05 dbuf_dirty+0xa7
ffffff1a9ec9dd20  148257b09e9 zil_itx_assign+0x166
....

> ffffff1c8708d118::bufctl
ADDR             BUFADDR          TIMESTAMP    THREAD           CALLER
ffffff1c8708d118 ffffff1c870a3620  148257b4ae1 ffffff1ac868ab40 dbuf_create+0x4e

> ffffff1c870a3620::dbuf
        addr object lvl blkid holds os
ffffff1c870a3620     cfda 0     bonus  1 zones/69bf9e72-fad9-40ad-8ca7-3c03cdcf681b.toxic
> ffffff1c870a3620::print dmu_buf_impl_t db.db_data
db.db_data = 0xffffff1c8709e800

This is the first allocated data buffer in 0xffffff1c8709e000, is for the toxic dataset, and is at the head of the corruption.

Various reasonably unique ASCII strings appear in the data, as do various runs of 0xbaddcafe, leading to the suspicion that a large chunk of memory was miscopied out of the presently running kernel's heap rather than directly from disk. Selecting a string and attempting to locate its source:

> ffffff1c8709e000,4000::dump
...
ffffff1c870a1f60:  2f417573 7472616c 69612f44 61727769  /Australia/Darwi
...

> ffffff1c870a1f60+8/K
0xffffff1c870a1f68:             69777261442f6169
> 69777261442f6169::kgrep
ffffff1c870a1f68
ffffff1c905e9ca8

That is: once in the destination of the errant copy, and once in the source! To find the start of the copy, we get our offset from the top of the destination and subtract it from the source:

> ffffff1c870a1f68 - ffffff1c8709e800 = K
                3768
> ffffff1c905e9ca8 - 3768 = K
                ffffff1c905e6540

So the copy started at 0xffffff1c905e6540. By inspecting the two memory regions, we can see that around 0x39f0 bytes are identical and were likely copied.

The dmu_buf_impl_t was for object 0xcfda allocated here:

                 kmem_cache_alloc+0x22e
                 dbuf_create+0x4e
                 dbuf_create_bonus+0x2a
                 dmu_bonus_hold+0x7e
                 sa_buf_hold+0x1d
                 zfs_zget+0x5e

The panic stack in this dump is also for object 0xcfda:

mutex_owner_running+0xd()
dmu_object_info_from_dnode+0x3c(ffffff1c8709f0b8, ffffff00b8167600)
dmu_object_info_from_db+0x35(ffffff1c870a3620, ffffff00b8167600)
zfs_zget+0x78(ffffff1a97437040, cfda, ffffff00b8167780)
zfs_dirent_lock+0x3d6(ffffff00b8167788, ffffff1c87055330, ffffff00b81679d0, ffffff00b8167780, 6, 0)
zfs_dirlook+0xdb(ffffff1c87055330, ffffff00b81679d0, ffffff00b81679a0, 0, 0, 0)
zfs_lookup+0x25f(ffffff1c87051b00, ffffff00b81679d0, ffffff00b81679a0, ffffff00b8167c20, 0,
....

Inspecting the code, very little happens as a result of zfs_zget() between these two stacks. We can see (again from inspection) that our dmu_buf_impl_t is for a bonus block, and would have been created in state DB_UNCACHED. In dbuf_read_impl(), we call zio_buf_alloc(), which is where the destination of the errant copy was allocated:

> ffffff1c8709e800::kmalog

T-0.000152456  addr=ffffff1c8709e000  kmem_va_4096
         kmem_slab_alloc_impl+0xba
         kmem_slab_alloc+0xa6
         kmem_cache_alloc+0x1fa
         vmem_alloc+0x1bc
         segkmem_xalloc+0x90
         segkmem_alloc_vn+0xcd
         segkmem_alloc+0x24
         vmem_xalloc+0x581
         vmem_alloc+0x161
         kmem_slab_create+0x81
         kmem_slab_alloc+0x5b
         kmem_cache_alloc+0x1fa
         zio_buf_alloc+0x2c
         dbuf_read_impl+0x21c
         dbuf_read+0x1b4

Then we attempt a bcopy() using information likely gleaned from the dnode_phys_t...

527    if (db->db_blkid == DMU_BONUS_BLKID) {
528        int bonuslen = MIN(dn->dn_bonuslen, dn->dn_phys->dn_bonuslen);
529
530        ASSERT3U(bonuslen, <=, db->db.db_size);
531        db->db.db_data = zio_buf_alloc(DN_MAX_BONUSLEN);
532        arc_space_consume(DN_MAX_BONUSLEN, ARC_SPACE_OTHER);
533        if (bonuslen < DN_MAX_BONUSLEN)
534            bzero(db->db.db_data, DN_MAX_BONUSLEN);
535        if (bonuslen)
536            bcopy(DN_BONUS(dn->dn_phys), db->db.db_data, bonuslen);
537        DB_DNODE_EXIT(db);
538        dbuf_update_data(db);
539        db->db_state = DB_CACHED;
540        mutex_exit(&db->db_mtx);
541        return;
542    }

*NB:  DN_BONUS(x) is essentially == x + 0x140

Looking back 0x140, we see a value of dn_bonuslen which would have induced the wild thrashing about in the heap:

> ffffff1c905e6540-140::print dnode_phys_t
{
    dn_type = 0x68
    dn_indblkshift = 0
    dn_nlevels = 0x53
    dn_nblkptr = 0x2
    dn_bonustype = 0
    dn_checksum = 0
    dn_compress = 0
    dn_flags = 0x6e
    dn_datablkszsec = 0x776f
    dn_bonuslen = 0x3a00
    dn_pad2 = [ 0x2, 0, 0, 0x1 ]
    dn_maxblkid = 0x65736f6c63
    dn_used = 0x7610408204000000
....

This is also the dnode_phys_t data we expect for object 0xcfda when comparing with notes from previous analysis of the corrupt pool using zdb(1M), etc, i.e. historical market data posing as part of a zpool:

> ffffff1c905e6540-140,100::dump
                   \\\\/ 1 2 3  4 5 6 7  8 9 a b  c d e f  v123456789abcdef
ffffff1c905e6400:  68005302 0000006e 6f77003a 02000001  h.S....now.:....
...
#3

Updated by Jason King over 7 years ago

Dumb question perhaps, but will other parts of ZFS catch this corruption with your attached fixed? If not, should some sort of notification be taking place if we encounter this?

#4

Updated by Joshua M. Clulow over 7 years ago

So, the only time we've ever seen this was in a pool that was already essentially permanently damaged due to running for some time on a machine afflicted with #2638. While so afflicted, we clearly had memory corruption of some in-memory ZFS data that was then checksummed and written to disk (both ZFS and zdb(1) were perfectly happy to load the apparently correct blocks).

In this particular case, stat(2) was variously returning EINVAL or ENOENT when the mangled (and now truncated!) data was later used. We fully expect not to see this particular defect again, but it is nonetheless good form to never copy beyond the end of your buffer.

Also available in: Atom PDF