Project

General

Profile

Bug #3695

missing rele in dmu_send_impl

Added by Robert Mustacchi about 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
High
Category:
zfs - Zettabyte File System
Start date:
2013-04-04
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

We have a panic with the following status:

> ::status
debugging crash dump vmcore.0 (64-bit) from headnode
operating system: 5.11 joyent_20130325T170959Z (i86pc)
image uuid: (not set)
panic message: thread ffffff0a0cee6440 terminating with rrw lock ffffff090a09c2b8 held
dump content: kernel pages only
> $C
ffffff003fc08c50 vpanic()
ffffff003fc08c70 0xfffffffff7d854ce()
ffffff003fc08ca0 tsd_exit+0x79()
ffffff003fc08cc0 thread_exit+0x24()
ffffff003fc08d50 proc_exit+0xa62(2, d)
ffffff003fc08d70 exit+0x15(2, d)
ffffff003fc08e00 psig+0x33b()
ffffff003fc08ec0 post_syscall+0x82d(20, 0)
ffffff003fc08f00 syscall_exit+0x68()
ffffff003fc08f10 0xfffffffffb800ed9()

So we have a thread that is exiting on a sigpipe and still has zfs tsd. Unfortunately at this point in time it's hard to know what process it was, but based on our expectations, we think this is a zfs send.

> rrw_tsd_key/D
rrw_tsd_key:
rrw_tsd_key:    7               
> ffffff0a0cee6440::tsd -k 7 | ::print rrw_node_t
{
    rn_next = 0
    rn_rrl = 0xffffff090a09c2b8
    rn_tag = __func__.34482
}
> __func__.34482/s
__func__.34482:
__func__.34482: dmu_send_obj

Therefore the hold comes from dmu_send_obj. By code inspection, we know that the hold has to correspond to the dsl_pool_hold. While dmu_send_obj looks fine and we can tell by inspection that everything is okay, dmu_send_impl is more problematic. Specifically our problem is the call to dump_bytes. If that fails, we goto out. However, every other call to goto out is after calls to dsl_dataset_long_hold() and dsl_pool_rele(). That means that if we fail the dump_bytes, we'll never release our hold on the pool which causes us to have a dangling hold on the rrw_lock, which blows up when we exit the process.

History

#1

Updated by Robert Mustacchi about 7 years ago

A naive and prototype fix is http://fingolfin.org/illumos/webrev/3695/.

#2

Updated by Matthew Ahrens about 7 years ago

  • Status changed from New to Closed

Great analysis on this. It is a duplicate of 3645.

Also available in: Atom PDF