lock contention on arcs_mtx
This patch attempts to reduce lock contention on the current arc_state_t
mutexes. These mutexes are used liberally to protect the number of LRU
lists within the ARC (e.g. ARC_mru, ARC_mfu, etc). The granularity at
which these locks are acquired has been shown to greatly affect the
performance of highly concurrent, cached workloads.
To illustrate this performance impact, I used fio to run a benchmark
on a system with 32GB of RAM and varied the number of CPUs on the system
from 1 to 32 CPUs, testing each power of two number of CPUs. The workload
remained constant for each test, and consisted of 32 threads each randomly
reading from unique files, consisting of 16GB worth of data in total (each
file was 512MB in size). The ZFS filesystem was configured to have a
recordsize of 8K, and the fio threads were configured to use a block size
Without this patch, when running with 32 CPUs, fio reported 388.6 thousand
IOPs during a 60 second run of the workload, which equates to about
3036.2 MB/s of aggregate read bandwidth. With this patch applied (again,
using 32 CPUs) fio reported 1266.8 thousand IOPs during a 60 second run,
which equates to about 9896.6 MB/s of aggregate read bandwidth. That's a
225% increase in the aggregate read bandwidth when using 32 CPUs.
As I mentioned above, I also ran the same workload while varying the
number of CPUs on the system, ranging from 1 CPU to 32 CPUs. The table
below shows the aggregate bandwidth results from those tests when running
on master, and the results when running with this patch applied:
| patched |
-------------+ | # CPUS | Bandwidth |
-------------+ | 1 | 833163KB/s | | 2 | 1520.2MB/s | | 4 | 2985.3MB/s | | 8 | 5913.2MB/s | | 16 | 8938.9MB/s | | 32 | 9896.6MB/s |
| master |
-------------+ | # CPUS | Bandwidth |
-------------+ | 1 | 840893KB/s | | 2 | 1559.5MB/s | | 4 | 3022.1MB/s | | 8 | 5104.9MB/s | | 16 | 3854.3MB/s | | 32 | 3036.2MB/s |
As you can see, the aggregate bandwidth scales much better when adding
more CPUs with this patch applied, than it does without this patch.
To achieve this, the single LRU and mutex that is used for each of the
arc states (e.g. ARC_mru, ARC_mfu, etc) has been split up into multiple
LRU lists each with it's own mutex. So, for the above test, 32
"sublists" were used for each arc state; one sublist per CPU on the
system. The multilist_t object was introduced to make this transition
from one LRU to many LRUs.
In addition, the logic performing reclamation from the ARC was reworked.
Now, only the arc_reclaim_thread will call the arc_evict() and
arc_evict_ghost() functions, whereas arc_evict() could previously be
called from arc_get_data_buf() (although, it can still be called from
dsl_pool_close() and "zinject -a"). This was done to try and prevent
multiple threads simultaneously contending on the arc state locks, all
trying to reclaim simultaneously. A secondary reason for this change is
it subjectively makes the code easier to understand as it removes some
edge cases (e.g. the "recycle" case, and mru vs. mfu decision in
One concern that this change introduces is "fairness" in the reclaim
logic. Now that each arc state is composed of multiple independent but
related LRUs, it's more difficult to determine which object is the
"oldest" across all of the LRUs composing each arc state. Thus, it is
possible for a buffer that isn't strictly the "oldest" in the set of
LRUs to be evicted out of order (e.g. before the "oldest" buffer). I've
reasoned this concern away, by assuming (and measuring during testing)
that buffers will be inserted into each arc state's LRU in a even
distribution across all sublists; thus, simply evicting the "oldest"
buffer(s) in one randomly selected sublist should be "good enough".
To work around a potential 3-way deadlock involving the multilist sublist
locks and the l2ad_mtx, the l2ad_mtx locking had to be slightly reworked
(we need to grab the sublist lock prior to the l2ad_mtx) in
l2arc_write_buffers. To accomodate these locking changes, the reclaim
logic surrounding l2arc buffers needed to changed as well, due to a couple
regressions introduced. The locking changes are straight forward, so I'll
refrain from discussing them here, but the changes to the reclaim logic
are more note worthy:
1. We need to be careful not to evict a header to the arc_l2c_only
state that's concurrently being written to the l2arc device. Since
the buffer isn't written out with the hash lock held, we must
explicitly account for this in the reclaim logic. While not a common
case, it is possible, so we must account for it. Now, if this was to
happen, the buffer will be skipped during eviction.
2. Now that the l2ad_mtx isn't held throughout the whole execution of
l2arc_write_header, we must be careful to place a header realloc'ed
to the arc_l2c_only state back into the l2ad's buflist in the same
location. Previously, the new header would be inserted into the
head of the buflist, which would allow the header to be re-written
out in l2arc_write_buffers (even though it no longer has a b_l1hdr).
3. And finally, if we encounter a hash lock collision in
l2arc_write_done(), we wait for the lock to become uncontended and
retry. Unlike the previous behavior, we can't skip the header, leaving
its ARC_FLAG_L2_WRITING bit set, since that would prevent the buffer
from being evicted from the ghost list until it was evicted from
the L2ARC device.
The last big change introduced was the call to arc_do_user_evicts() was
removed from arc_reclaim_thread(), and split out into a completely
separate thread. This was due to a deadlock being introduced as a result
of the change to arc_get_data_buf(). When arc_get_data_buf() sleeps
waiting for the eviction thread to evict buffers, it does so holding the
hash lock. The eviction function is carefully crafted as to not sleep
on any arc header's hash lock (e.g. using mutex_tryenter). Unfortunately,
the callbacks in arc_do_user_evicts() are not so careful, so we must
execute them in such a way that they cannot block the reclaim thread
waiting on to acquire a header's hash lock.
A few miscellaneous notes:
A couple new static dtrace probes were introduced to help measure
the evenness of a given multilist structure: 'multilist-insert' and
The tunable zfs_arc_num_sublists_per_state was added to allow a user
to configure the number of sublists to user for each arc state.
The tunable zfs_arc_overflow_shift was added to configure how far
past arc_c we're allowed to grow, before arc_get_data_buf() will
sleep and wait for the reclaim thread to free up some space.
The tunable arc_evict_iterations has been removed and replaced with
zfs_arc_evict_batch_limit, which defines the number of headers to
evict from each sublist before moving on to another sublist.
A couple new arcstats were added to track the new L2ARC reclaim
logic that's been added. These stats are "evict_l2_skip" and
A new kstat was added, "evict_not_enough", to track the number of
times arc_evict_state() is not able to meet it's target number of
bytes to evict from a given arc state.