Project

General

Profile

Bug #5117

space map reallocation can cause corruption

Added by Matthew Ahrens about 5 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Category:
zfs - Zettabyte File System
Start date:
2014-08-24
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

The system panicked with space map corruption:

$c

vpanic()
vcmn_err+0x23(3, fffffffff7a69098, ffffff0006463d60)
zfs_panic_recover+0x51(fffffffff7a69098)
range_tree_remove+0x2ac(ffffff019ae782c0, 0, 200)
space_map_load+0x210(ffffff019ae08040, ffffff019ae782c0, 1)
metaslab_activate+0xb0(ffffff019a9abc80, 4000000000000000)
...

panic[cpu0]/thread=ffffff0196519ba0:
zfs: freeing free segment (offset=0 size=512)

When reproduced with debug bits, we hit this assertion
in dnode_sync():

ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));

We expected all the blocks to be freed by the previous call:

dnode_sync_free_range(dn, rp->fr_blkid, rp->fr_nblks, tx);

space_map_truncate() set up the free_range correctly, to cover the entire file
(fr_blkid=0, fr_nblks=UINT64_MAX). However, we subsequently wrote to the first
2 blocks of the file, thus dnode_clear_range()-ing them. This will modify the
free_range to indicate that these dirty blocks need not be freed, because we
are about to overwrite them.

So by the time we got to dnode_sync(), the free_range only covered part of the
file (in this case, fr_blkid=2, fr_nblks=UINT64_MAX). So we left the first 2
blocks in the dn_blkptr[], causing this assertion to fail.

On nondebug bits, we would drive on, setting dn_phys->dn_nblkptr=1. We then
call dnode_increase_indirection(), which will only copy dn_nblkptr (1) blocks
to the new top-level indirect block. The 2nd and 3rd blocks are turned into
holes, and their space is leaked. When we subsequently append to the
space_map, the 3rd block will be written, leaving only the 2nd block (blkid=1)
as a hole, as seen in Duh's crash.

This bug could theoretically occur when doing a "zfs receive", in practice all
of the free-able objects in the ZPL have nblkptr=1, because the system
attributes takes up the rest of the space. (And zvols have no free-able
objects.) There is one exception to this, which is ACLs on ZPL versions 1 and
2, which have nblkptr=3. However, I think that ACL objects can only ever have
a single block of data, so they also can not hit this bug.

We can fix this by doing the dnode_increase_indirection() before processing the
dn_next_nblkptr. This will move all 3 bp's to the new top-level indirect, so
the dnode's dn_blkptr[] will be empty when we reduce dn_nblkptr.

A few notes from the core dump we gathered, supporting evidence for the analysis:

here's the object on disk, before the panic-ing txg. Note 3 bp's in the dnode:

Object  lvl   iblk   dblk  dsize  lsize   %full  type
34 1 16K 4K 24.0K 12.0K 100.00 SPA space map
24 bonus SPA space map header
dnode flags: USED_BYTES
dnode maxblkid: 2
Indirect blocks:
0 L0 2:1c2a2000:c00 1000L/c00P F=1 B=1638/1638
1000 L0 2:1d57ea00:c00 1000L/c00P F=1 B=1653/1653
2000 L0 2:1902fe00:800 1000L/800P F=1 B=1682/1682
segment [0000000000000000, 0000000000003000) size 12.0K

in the dump there are only 2 bp's; the 3rd is a hole:

ffffff01a2fb3890::print dnode_t dn_phys->dn_blkptr0|::blkptr

DVA0=<2:1c2a2000:c00>
DVA1=<0:203a9e00:c00>
DVA2=<1:1fd2b200:c00>
[L0 SPACE_MAP] FLETCHER_4 LZJB LE contiguous unique triple
size=1000L/c00P birth=1638L/1638P fill=1
cksum=bb34699e50:11512f433085a:10f04ddd366bf9e:c850d2d31977cff4

ffffff01a2fb3890::print dnode_t dn_phys->dn_blkptr1|::blkptr

mdb: index 1 is outside of array bounds [0 .. 0]
DVA0=<2:1d57ea00:c00>
DVA1=<0:2168f800:c00>
DVA2=<1:19109a00:c00>
[L0 SPACE_MAP] FLETCHER_4 LZJB LE contiguous unique triple
size=1000L/c00P birth=1653L/1653P fill=1
cksum=ab24704e14:124dbda0742d1:12c94fc3c018681:e477e68a3a5994b9

ffffff01a2fb3890::print dnode_t dn_phys->dn_blkptr2|::blkptr

mdb: index 2 is outside of array bounds [0 .. 0]
<hole>

How did the 3rd get to be a hole but not the first 2?

looks like it probably did do a dnode_sync_free_range; this is the first thing
in ::kmem_log:

ffffff023d4dd7c0::whatis

ffffff023d4dd7c0 is freed from kmem_alloc_40:
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
ffffff01e7711498 ffffff023d4dd7c0 c4e1ce900e9 ffffff0008007c40
ffffff018b227008 ffffff018be7a740 ffffff0190d40ea8
kmem_cache_free_debug+0x10f
kmem_cache_free+0x123
kmem_free+0x4e
dnode_sync+0x2d0
dmu_objset_sync_dnodes+0x93
dmu_objset_sync+0x1b3
dsl_pool_sync+0x364
spa_sync+0x35c
txg_sync_thread+0x250
thread_start+8

if dnode_sync_free_range() got the wrong arguments, it could have freed just
that 3rd BP, rather than all of them. Did we free it?

ffffff0199129000::print -a spa_t spa_free_bplist3.bpl_list|::walk list | ::p

rint -a bplist_entry_t bpe_blk|::blkptr
DVA0=<2:1902fe00:800>
DVA1=<0:219aaa00:800>
DVA2=<1:195d9a00:800>
[L0 SPACE_MAP] FLETCHER_4 LZJB LE contiguous unique triple
size=1000L/800P birth=1682L/1682P fill=1
cksum=7b9004468b:8949515287fa:5ab82be5f80f10:2c916f175cef00e8

Yes, we did! Was it from dnode_sync_free_range()?

ffffff0199129000::print -a spa_t spa_free_bplist3.bpl_list|::walk list | ::w

hatis
ffffff01e7c205e8 is allocated from kmem_alloc_160:
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
ffffff01e6c3e438 ffffff01e7c205e8 c4e1ce8fb36 ffffff0008007c40
ffffff018b229508 ffffff018be7a680 ffffff0190d3ca28
kmem_cache_alloc_debug+0x2e8
kmem_cache_alloc+0xd4
kmem_alloc+0x4b
bplist_append+0x2e
zio_free+0x50
dsl_free+0x1c
dsl_dataset_block_kill+0x652
free_blocks+0x8d
dnode_sync_free_range+0x2e3
dnode_sync+0x2a0
dmu_objset_sync_dnodes+0x93
dmu_objset_sync+0x1b3
dsl_pool_sync+0x364
spa_sync+0x35c
txg_sync_thread+0x250

Yes, it was!

Did we dirty the first 2 blocks, causing the free_range to be reduced?

ffffff01a2fb3890::print -a dnode_t dn_dbufs|::walk list | ::dbuf

addr object lvl blkid holds os
ffffff01c295f2e0 22 1 0 1 mos
ffffff01a2fc03f8 22 0 1 1 mos
ffffff01a2fc0528 22 0 0 1 mos

We have dbufs for the first 2 blocks, are they dirty in this txg?

ffffff01a2fb3890::print -a dnode_t dn_dbufs|::walk list | ::print dmu_buf_imp

l_t db_last_dirty->dr_txg
db_last_dirty->dr_txg = 0x693
db_last_dirty->dr_txg = 0x693
db_last_dirty->dr_txg = 0x693

Yes!

History

#1

Updated by Matthew Ahrens about 5 years ago

typo in the bug synopsis; we are talking about space maps here, not astronauts. (I don't see how to change it.)

#2

Updated by Marcel Telka about 5 years ago

  • Subject changed from spaceman reallocation can cause corruption to space map reallocation can cause corruption
#3

Updated by Electric Monk about 5 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit e503a685964805f048b35c2a4e70e0638344f2b7

commit  e503a685964805f048b35c2a4e70e0638344f2b7
Author: George Wilson <george.wilson@delphix.com>
Date:   2014-09-05T16:59:23.000Z

    5117 space map reallocation can cause corruption
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
    Reviewed by: Richard Elling <richard.elling@gmail.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF