Bug #3469
opendbuf_read_impl shows too much enthusiasm
90%
Description
In the post-mortem for a zpool containing byzantine corruption (induced previously by #2638) it was discovered that if the bonus block length read from disk was too long, we would corrupt our heap by copying a (potentially very large) region of memory out of the ARC into our dmu_buf_impl_t's db_data.
Updated by Joshua M. Clulow over 9 years ago
After the heap corruption, ::kmem_verify reported several pages of memory were destroyed:
LEGEND -- F = Free A = Allocated C = Corrupt ... ======================== 9c000 unmapped ======================== 9d000 unmapped ======================== 9e000 page of zio_buf_512 ffffff1c8709e000 F ... F ... ffffff1c8709e600 F ffffff1c8709e800 A <--- our most recently allocated db.db_data ffffff1c8709ea00 AC ffffff1c8709ec00 AC ffffff1c8709ee00 AC ======================== 9f000 page of dnode_t ffffff1c8709f0b8 AC <-- dnode_t from panic stack ... AC ... ffffff1c8709fcb8 AC ======================== a0000 page of kmem_alloc_192 ffffff1c870a0000 FC ... FC ... ffffff1c870a0700 FC ffffff1c870a0800 AC ... AC ... ffffff1c870a0f00 AC ======================== a1000 page of znode_t ffffff1c870a1010 FC ... FC ... ffffff1c870a1890 FC ffffff1c870a19a0 AC <-- znode_t most recently allocated and corrupt ... AC ... ffffff1c870a1ef0 AC ======================== a2000 page of dnode_t ffffff1c870a20b0 AC <--- corrupt up until about ffffff1c870a2208 ffffff1c870a23b0 A <--- from here on out, we're no longer corrupt ... A ... ======================== a3000 page of sa_bonus ...
The last allocation before panic was a dmu_buf_impl_t ...
> *panic_thread::allocdby BUFCTL TIMESTAMP CALLER .... ffffff1c8708d118 148257b4ae1 dbuf_create+0x4e ffffff1c87090718 148257b4767 dnode_create+0x47 ffffff1d6f28d670 148257b32cf zfs_dirent_lock+0x25b ffffff1d508733b0 148257b0fea dbuf_dirty+0x18e ffffff1c8702b638 148257b0e05 dbuf_dirty+0xa7 ffffff1a9ec9dd20 148257b09e9 zil_itx_assign+0x166 .... > ffffff1c8708d118::bufctl ADDR BUFADDR TIMESTAMP THREAD CALLER ffffff1c8708d118 ffffff1c870a3620 148257b4ae1 ffffff1ac868ab40 dbuf_create+0x4e > ffffff1c870a3620::dbuf addr object lvl blkid holds os ffffff1c870a3620 cfda 0 bonus 1 zones/69bf9e72-fad9-40ad-8ca7-3c03cdcf681b.toxic > ffffff1c870a3620::print dmu_buf_impl_t db.db_data db.db_data = 0xffffff1c8709e800
This is the first allocated data buffer in 0xffffff1c8709e000, is for the toxic dataset, and is at the head of the corruption.
Various reasonably unique ASCII strings appear in the data, as do various runs of 0xbaddcafe, leading to the suspicion that a large chunk of memory was miscopied out of the presently running kernel's heap rather than directly from disk. Selecting a string and attempting to locate its source:
> ffffff1c8709e000,4000::dump ... ffffff1c870a1f60: 2f417573 7472616c 69612f44 61727769 /Australia/Darwi ... > ffffff1c870a1f60+8/K 0xffffff1c870a1f68: 69777261442f6169 > 69777261442f6169::kgrep ffffff1c870a1f68 ffffff1c905e9ca8
That is: once in the destination of the errant copy, and once in the source! To find the start of the copy, we get our offset from the top of the destination and subtract it from the source:
> ffffff1c870a1f68 - ffffff1c8709e800 = K 3768 > ffffff1c905e9ca8 - 3768 = K ffffff1c905e6540
So the copy started at 0xffffff1c905e6540. By inspecting the two memory regions, we can see that around 0x39f0 bytes are identical and were likely copied.
The dmu_buf_impl_t was for object 0xcfda allocated here:
kmem_cache_alloc+0x22e dbuf_create+0x4e dbuf_create_bonus+0x2a dmu_bonus_hold+0x7e sa_buf_hold+0x1d zfs_zget+0x5e
The panic stack in this dump is also for object 0xcfda:
mutex_owner_running+0xd() dmu_object_info_from_dnode+0x3c(ffffff1c8709f0b8, ffffff00b8167600) dmu_object_info_from_db+0x35(ffffff1c870a3620, ffffff00b8167600) zfs_zget+0x78(ffffff1a97437040, cfda, ffffff00b8167780) zfs_dirent_lock+0x3d6(ffffff00b8167788, ffffff1c87055330, ffffff00b81679d0, ffffff00b8167780, 6, 0) zfs_dirlook+0xdb(ffffff1c87055330, ffffff00b81679d0, ffffff00b81679a0, 0, 0, 0) zfs_lookup+0x25f(ffffff1c87051b00, ffffff00b81679d0, ffffff00b81679a0, ffffff00b8167c20, 0, ....
Inspecting the code, very little happens as a result of zfs_zget() between these two stacks. We can see (again from inspection) that our dmu_buf_impl_t is for a bonus block, and would have been created in state DB_UNCACHED. In dbuf_read_impl(), we call zio_buf_alloc(), which is where the destination of the errant copy was allocated:
> ffffff1c8709e800::kmalog T-0.000152456 addr=ffffff1c8709e000 kmem_va_4096 kmem_slab_alloc_impl+0xba kmem_slab_alloc+0xa6 kmem_cache_alloc+0x1fa vmem_alloc+0x1bc segkmem_xalloc+0x90 segkmem_alloc_vn+0xcd segkmem_alloc+0x24 vmem_xalloc+0x581 vmem_alloc+0x161 kmem_slab_create+0x81 kmem_slab_alloc+0x5b kmem_cache_alloc+0x1fa zio_buf_alloc+0x2c dbuf_read_impl+0x21c dbuf_read+0x1b4
Then we attempt a bcopy() using information likely gleaned from the dnode_phys_t...
527 if (db->db_blkid == DMU_BONUS_BLKID) { 528 int bonuslen = MIN(dn->dn_bonuslen, dn->dn_phys->dn_bonuslen); 529 530 ASSERT3U(bonuslen, <=, db->db.db_size); 531 db->db.db_data = zio_buf_alloc(DN_MAX_BONUSLEN); 532 arc_space_consume(DN_MAX_BONUSLEN, ARC_SPACE_OTHER); 533 if (bonuslen < DN_MAX_BONUSLEN) 534 bzero(db->db.db_data, DN_MAX_BONUSLEN); 535 if (bonuslen) 536 bcopy(DN_BONUS(dn->dn_phys), db->db.db_data, bonuslen); 537 DB_DNODE_EXIT(db); 538 dbuf_update_data(db); 539 db->db_state = DB_CACHED; 540 mutex_exit(&db->db_mtx); 541 return; 542 } *NB: DN_BONUS(x) is essentially == x + 0x140
Looking back 0x140, we see a value of dn_bonuslen which would have induced the wild thrashing about in the heap:
> ffffff1c905e6540-140::print dnode_phys_t { dn_type = 0x68 dn_indblkshift = 0 dn_nlevels = 0x53 dn_nblkptr = 0x2 dn_bonustype = 0 dn_checksum = 0 dn_compress = 0 dn_flags = 0x6e dn_datablkszsec = 0x776f dn_bonuslen = 0x3a00 dn_pad2 = [ 0x2, 0, 0, 0x1 ] dn_maxblkid = 0x65736f6c63 dn_used = 0x7610408204000000 ....
This is also the dnode_phys_t data we expect for object 0xcfda when comparing with notes from previous analysis of the corrupt pool using zdb(1M), etc, i.e. historical market data posing as part of a zpool:
> ffffff1c905e6540-140,100::dump \\\\/ 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef ffffff1c905e6400: 68005302 0000006e 6f77003a 02000001 h.S....now.:.... ...
Updated by Joshua M. Clulow over 9 years ago
Updated by Jason King over 9 years ago
Dumb question perhaps, but will other parts of ZFS catch this corruption with your attached fixed? If not, should some sort of notification be taking place if we encounter this?
Updated by Joshua M. Clulow over 9 years ago
So, the only time we've ever seen this was in a pool that was already essentially permanently damaged due to running for some time on a machine afflicted with #2638. While so afflicted, we clearly had memory corruption of some in-memory ZFS data that was then checksummed and written to disk (both ZFS and zdb(1) were perfectly happy to load the apparently correct blocks).
In this particular case, stat(2) was variously returning EINVAL or ENOENT when the mangled (and now truncated!) data was later used. We fully expect not to see this particular defect again, but it is nonetheless good form to never copy beyond the end of your buffer.