too-frequent TXG sync causes excessive write inflation
ZFS starts syncing a TXG if there's 64MB of dirty data. This can result in very short TXG syncs, which can be inefficient because the ratio of metadata to user data is poor. Typically, this is not a real performance problem, because it was thought to only happen under very light workloads. As the (write) workload increases, the TXG sync time will increase, leading to efficiency. However, when the workload is almost entirely sync writes, especially those that use dmu_sync() to write the user data to its final resting place from open context, we can have very short TXG's even under moderate workloads, because spa_sync() isn't writing the actual user data.
The problem is not the short txg's per se, but rather the high frequency of pushing out txg's, caused by the low 64MB trigger for starting a sync. Conceptually, for a given workload, each TXG is going to have a fixed amount of overhead (in terms of MB, IOPS, or time taken to write it) that is only loosely coupled to the amount of dirty data (i.e. frequency of TXG sync). Therefore, decreasing TXG sync frequency will decrease the overhead per unit time.
The solution is to increase the amount of dirty data allowed before pushing out a TXG, from 64MB to 20% of zfs_dirty_data_max (820MB if total RAM >= 40GB). When we did this on the customer system in ESCL-658, overall write inflation by bandwidth decreased from ~5.5x to 3x.
Updated by Electric Monk about 1 year ago
- Status changed from New to Closed
- % Done changed from 0 to 100
commit 7928f4baf4ab3230557eb6289be68aa7a3003f38 Author: Matthew Ahrens <email@example.com> Date: 2018-09-18T20:53:10.000Z 9617 too-frequent TXG sync causes excessive write inflation Reviewed by: Serapheim Dimitropoulos <firstname.lastname@example.org> Reviewed by: Brad Lewis <email@example.com> Reviewed by: George Wilson <firstname.lastname@example.org> Reviewed by: Andrew Stormont <email@example.com> Approved by: Robert Mustacchi <firstname.lastname@example.org>