some written buffers are not deducted from dirty size
Some written / sync-ed buffers are not deducted from the dirty data size until we clear out all of the remaining dirty space for the sync-ing TXG.
The latter action is taken to handle various rounding errors and rare scenarios.
However, it appears to mask problems with dirty data accounting in not so rare scenarios.
For example, it seems that
zio_ddt_write does not propagate
io_physdone callback to any of child zio-s that it creates.
We observe that during heavy writing to datasets with dedup enabled the dirty data size stays at its maximum value most of the time before dropping to almost zero and quickly growing back again.
I think that
zio_ddt_write should pass
cio and maybe to
dio as well.
There are also other scenario-s when a buffer is not deducted from the dirty size until the big cleanup.
For example, if the buffer is sync-ed from the open context because of an indirect write log record.
In that case the dirty size is not adjusted in the open context.
It is also not adjusted in the sync-ing context because there is no physical write.
One idea to make the accounting more strict is to split it between
Or more specifically between
dbuf_write_physdone would work as it does now but it would also keep track of how much space it undirties for a buffer / dirty record.
dbuf_write_done would undirty the remaining space.
This way we should be able to cover all scenarios and track the actual dirty space more closely.
Updated by Andriy Gapon over 3 years ago
I wonder if anyone measured a difference in performance between the current dirty data size accounting via io_phys_done and trying the accounting via io_done. That is, deducting the size of the whole buffer once it is completely written instead of the current piecemeal algorithm.
Updated by Matthew Ahrens over 3 years ago
Yes, we introduced the io_phys_done accounting specifically because doing it in io_done is too coarse-grained. Logical i/os that write to more than one top-level vdev (i.e. those with more than one DVA, aka ditto blocks, which all metadata are) would only be accounted when both DVAs were written. This tended to cause the dirty accounting for metadata writes to jump at the end of the txg, rather than completing smoothly as physical i/o completes.
Updated by Stilez y over 2 years ago
This bug has severe impact for devices on 10G LAN. A deduped pool triggers this bug, which can perpetually kill Samba and iSCSI data write sessions.
The buffer handling issue means that on the dedup write path, for periods of milliseconds, data isn't correctly and smoothly drawn from the NIC because ZFS buffers aren't emptying. With 10G, a few milliseconds are long enough to drive the connection to zero TCP windows many times a second even with a large RVC buffer - it can't get time to recover. After a few attempts at continuing, remote clients for Samba, iSCSI and other protocols deem a timeout so the connection dies mid-session after a random time (typically a few minutes / or 5 - 15 GB into a transfer), at the point where, by chance, a number of client enquiries have been replied consecutively with zero window.
This happens even under optimal load - server and LAN otherwise quiet, one client directly connected, and the only file transfer is a single 30 - 60GB file to be written.
Inquiries on FreeBSD-fs mailing list suggest this is the underlying bug responsible.
Additional info/"where this arose"/system specs/graphs/debug info/discussion is here, on the FreeNAS forum , but no point duplicating the info as it doesn't add much or help.
As there's no easy workaround, the effect is severe, and this bug pretty much kills the ability to use dedup and 10G together (the pool here is highly dedupable at about 4.2x), any attention to this issue would be deeply valued :) thanks