hang in "dmu_tx_wait", likely due to how "cv_signal" is used in "dsl_pool_dirty_delta"
I managed to trigger a situation where multiple threads were "stuck" in dmu_tx_wait.
Some of the threads were due to zpool initialize activity:
ffffff00f6a23c40 SLEEP CV 4 swtch+0x141 cv_wait+0x70 dmu_tx_wait+0x6b dmu_tx_assign+0x55 vdev_initialize_write+0xc1 vdev_initialize_ranges+0xb3 vdev_initialize_thread+0x13d thread_start+8
and another was not:
ffffff31f802b7c0 SLEEP CV 1 swtch+0x141 cv_wait+0x70 dmu_tx_wait+0x6b dmu_tx_assign+0x55 dmu_free_long_range_impl+0xa7 dmu_free_long_range+0x62 zfs_rmnode+0x52 zfs_zinactive+0xe0 zfs_inactive+0x75 fop_inactive+0x76 vn_rele+0x82 zfs_remove+0x6b8 fop_remove+0x5b vn_removeat+0x382 unlinkat+0x59 unlink+0x1b
Based on the function and offset (dmu_tx_wait+0x6b), I'm fairly confident we're stuck on the following cv_wait in dmu_tx_wait:
mutex_enter(&dp->dp_lock); while (dp->dp_dirty_total >= zfs_dirty_data_max) cv_wait(&dp->dp_spaceavail_cv, &dp->dp_lock); uint64_t dirty = dp->dp_dirty_total; mutex_exit(&dp->dp_lock);
dp_dirty_total was verified to be 0 at this time:
> 0xffffff325c0f9740::print dsl_pool_t dp_dirty_total dp_dirty_total = 0
and zfs_dirty_data_max was non-zero:
> zfs_dirty_data_max::print 0x100000000
Background, Hypothesis, Initial Fix
Even though the bug's writeup in JIRA is very detailed, we still don't know exactly how we got to that state, thus I wasn't able to reproduce the bug. That said, we can make an educated guess combining the information on JIRA with the code.
From the fact that dp_dirty_total was 0 (which is less than zfs_dirty_data_max) we know that there was one thread that set it to 0 and then signaled one of the waiters of dp_spaceavail_cv [see dsl_pool_dirty_delta() which is also the only place that dp_dirty_total is changed]. Thus, the only logical explaination then for the bug being hit is that the waiter that just got awaken didn't go through dsl_pool_dirty_data(). Given that this function is only called by dsl_pool_dirty_space() or dsl_pool_undirty_space() I can only think of two possible ways of the above scenario happening:
 The waiter didn't call into any of the two functions - which I find highly unlikely (i.e. why wait on dp_spaceavail_cv to begin with?).
 The waiter did call in one of the above function but it passed 0 as the space/delta to be dirtied (or undirtied) and then the callee returned immediately (e.g both dsl_pool_dirty_space() and dsl_pool_undirty_space() return immediately when space is 0).
In any case and no matter how we got there, the easy fix would be to just broadcast to all waiters whenever dp_dirty_total hits 0. That said and given that we've never hit this before, it would make sense to think more on why the above situation occured.
Observations of pool dirtying/undirtying under different conditions
Attempting to reproduce this issue, I created a dataset with sync=always and started doing contiguous writes in a file within that dataset. I observed with DTrace that even though we update the pool's dirty data accounting when we would dirty stuff, the accounting wouldn't be decremented incrementally as as we were done with the ZIOs of those writes (the reason being that dbuf_write_physdone() isn't be called as we go through the override code paths, and thus dsl_pool_undirty_space() is never called). As a result we'd have to wait until we get to dsl_pool_sync() where we zero out all dirty data accounting for the pool and the current TXG's metadata. The same issue would arise when using dedup.
In both cases (sync & dedup) we shouldn't have to wait until dsl_pool_sync() zeros out the accounting data. According to the comment in that part of the code, the reason's why we do the zeroing, have nothing to do with what we observe:
/* * We have written all of the accounted dirty data, so our * dp_space_towrite should now be zero. However, some seldom-used * code paths do not adhere to this (e.g. dbuf_undirty(), also * rounding error in dbuf_write_physdone). * Shore up the accounting of any dirtied space now.
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
Ideally what we want to do is to undirty in the accounting exactly what we dirty (I use the word ideally as we can still have rounding errors). This would make the behavior of the system more clear and predictable.
Another interesting issue that I observed with DTrace was that we wouldn't update any of the pool's dirty data accounting whenever we would dirty and/or undirty MOS data. In addition, every time we would change the size of a dbuf through dbuf_new_size() we wouldn't update the accounted space dirtied in the appropriate dirty record, so when ZIOs are done we would undirty less that we dirtied from the pool's accounting point of view.
Fixes Introduced and Final Thoughts
For the first two issues observed (sync & dedup) this patch ensures that we still update the pool's accounting when we undirty data, regardless of the write being physical or not.
For changes in the MOS, we first ensure to zero out the pool's dirty data accounting in dsl_pool_sync() after we synced the MOS. Then we can go ahead and enable the update of the pool's dirty data accounting wheneve we change MOS data.
Another fix is that we now update the accounting explicitly for counting errors in dbuf_write_done().
Finally, dbuf_new_size() updates the accounted space of the appropriate dirty record correctly now.
The problem is that we still don't know how the bug came up in the JIRA ticket. That said the issues fixed seem to be very relevant, so instead of going with the broadcasting solution right away, I decided to leave this patch as is.