zfs write throttle & i/o scheduler performance work
This review mainly consists of two related performance improvements:
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync read, sync write, async read, async write, and scrub/resilver. The scheduler issues a number of concurrent i/os from each class to the device. Once a class has been selected, an i/o is selected from this class using either an elevator algorithem (async, scrub classes) or FIFO (sync classes). The number of concurrent async write i/os is tuned dynamically based on i/o load, to achieve good sync i/o latency when there is not a high load of writes, and good write throughput when there is. See the block comment in vdev_queue.c for more details.
2. The write throttle (dsl_pool_tempreserve_space() and txg_constrain_throughput()) is rewritten to produce much more consistent delays when under constant load. The new write throttle is based on the amount of dirty data, rather than guesses about future performance of the system. When there is a lot of dirty data, each transaction (e.g. write() syscall) will be delayed by the same small amount. This eliminates the "brick wall of wait" that the old write throttle could hit, causing all transactions to wait several seconds until the next txg opens. One of the keys to the new write throttle is decrementing the amount of dirty data as i/o completes, rather than at the end of spa_sync(). Note that the write throttle is only applied once the i/o scheduler is issuing the maximum number of outstanding async writes. See the block comments in dsl_pool.c and above dmu_tx_delay() for more details.
This diff has several other effects, including:
- the commonly-tuned global variable zfs_vdev_max_pending has been removed; use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
- the size of each txg (meaning the amount of dirty data written, and thus the time it takes to write out) is now controlled differently. There is no longer an explicit time goal; the primary determinant is amount of dirty data. Systems that are under light or medium load will now often see that a txg is always syncing, but the impact to performance (e.g. read latency) is minimal. Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
- zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression, checksum, etc. This improves latency by not allowing these CPU-intensive tasks to consume all CPU (on machines with at least 4 CPU's; the percentage is rounded up).
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem with this is that if there are always i/os pending, then certain classes of i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async writes will be serviced until they become "past due". One symptom of this situation is that each pass of the txg sync takes at least several seconds (typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must service all of these overdue i/os before any new i/os. This happens when we enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in the future. If we can't complete all the i/os in 2.5 seconds (e.g. because there were always reads pending), then these i/os will become past due. Now we must service all the "async" writes (which could be hundreds of megabytes) before we service any reads, introducing considerable latency to synchronous i/os (reads or ZIL writes).
Updated by Christopher Siden almost 8 years ago
- Status changed from In Progress to Closed
commit 69962b5647e4a8b9b14998733b765925381b727e Author: Matthew Ahrens <email@example.com> Date: Mon Aug 26 14:13:26 2013 4045 zfs write throttle & i/o scheduler performance work Reviewed by: George Wilson <firstname.lastname@example.org> Reviewed by: Adam Leventhal <email@example.com> Reviewed by: Christopher Siden <firstname.lastname@example.org> Reviewed by: Ned Bass <email@example.com> Reviewed by: Brendan Gregg <firstname.lastname@example.org> Approved by: Robert Mustacchi <email@example.com>