Bug #3695
closedmissing rele in dmu_send_impl
0%
Description
We have a panic with the following status:
> ::status debugging crash dump vmcore.0 (64-bit) from headnode operating system: 5.11 joyent_20130325T170959Z (i86pc) image uuid: (not set) panic message: thread ffffff0a0cee6440 terminating with rrw lock ffffff090a09c2b8 held dump content: kernel pages only > $C ffffff003fc08c50 vpanic() ffffff003fc08c70 0xfffffffff7d854ce() ffffff003fc08ca0 tsd_exit+0x79() ffffff003fc08cc0 thread_exit+0x24() ffffff003fc08d50 proc_exit+0xa62(2, d) ffffff003fc08d70 exit+0x15(2, d) ffffff003fc08e00 psig+0x33b() ffffff003fc08ec0 post_syscall+0x82d(20, 0) ffffff003fc08f00 syscall_exit+0x68() ffffff003fc08f10 0xfffffffffb800ed9()
So we have a thread that is exiting on a sigpipe and still has zfs tsd. Unfortunately at this point in time it's hard to know what process it was, but based on our expectations, we think this is a zfs send.
> rrw_tsd_key/D rrw_tsd_key: rrw_tsd_key: 7 > ffffff0a0cee6440::tsd -k 7 | ::print rrw_node_t { rn_next = 0 rn_rrl = 0xffffff090a09c2b8 rn_tag = __func__.34482 } > __func__.34482/s __func__.34482: __func__.34482: dmu_send_obj
Therefore the hold comes from dmu_send_obj. By code inspection, we know that the hold has to correspond to the dsl_pool_hold. While dmu_send_obj looks fine and we can tell by inspection that everything is okay, dmu_send_impl is more problematic. Specifically our problem is the call to dump_bytes. If that fails, we goto out. However, every other call to goto out is after calls to dsl_dataset_long_hold() and dsl_pool_rele(). That means that if we fail the dump_bytes, we'll never release our hold on the pool which causes us to have a dangling hold on the rrw_lock, which blows up when we exit the process.
Updated by Robert Mustacchi over 9 years ago
A naive and prototype fix is http://fingolfin.org/illumos/webrev/3695/.
Updated by Matthew Ahrens over 9 years ago
- Status changed from New to Closed
Great analysis on this. It is a duplicate of 3645.