Bug #3040
performance: remove rrw_lock
10%
Description
ZFS uses it's own rrw_lock.
rrw_lock is not real rw_lock because in internals it is mutex.
Therefore this brings HIGH cpu load when "readers" tries to get this rrw_lock on READ.
Below example of lockstat on 16 CPUs when plain "getattr" test is used.
for i in `seq 1 16`; do
./test_stat file-$i;
done
lockstat -b sleep 5
Adaptive mutex spin: 1556009 events in 5.157 seconds (301739 events/sec)
Count indv cuml rcnt Lock Caller
-------------------------------------------------------------------------------
845878 54% 54% 0.00 0xffffff0252a40550 rrw_enter_read+0x22
709974 46% 100% 0.00 0xffffff0252a40550 rrw_exit+0x1e
21 0% 100% 0.00 0xffffff02523c6740 dtrace_systrace_syscall32+0xaa
Files
Updated by Vitaliy Gusev over 8 years ago
Solution: replace rrw_lock with OS rw_lock.
If it brings deadlock, introduce property for rw_lock that "enables" nested read locks w/o blocking by
waiting writers:
Example of that property:
rw_set_property(NESTED_READERS);
Once it set, it enables "Linux" rwlock behavior. Waiting writers can not block readers and can not obtain lock until all readers release that rwlock.
Updated by Matthew Ahrens over 8 years ago
- Category set to zfs - Zettabyte File System
- Status changed from New to Feedback
Please update this report with:
(b) how much performance could improve if we removed the lock entirely, or changed it to a non-recursive rwlock, for a workload that matters. The test you ran doesn't mean anything to me because you don't say what ./test_stat does.
(a) why the z_teardown_lock needs to be recursive (try a write() syscall from an address that is backed by a mmap'ed ZFS file). How you would address this requirement (by removing the requirement, or implementing a better recursive rwlock).
Updated by Vitaliy Gusev over 8 years ago
- File teststat4.c teststat4.c added
- Status changed from Feedback to In Progress
- % Done changed from 0 to 10
a) Accordingly to lockstat and dtrace outputs, a lot of time is spending at rrw_enter_read or rrw_exit, whereas
it have to be fast path without any delay because nobody calls "zfsvfs_teardown" or similar.
mpstat shows next:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 1206 100 78 29 6 8878 0 30245 4 96 0 0
1 0 0 0 986 12 47 27 4 8868 0 29712 3 97 0 0
2 0 0 0 985 15 27 16 2 8878 0 28996 3 97 0 0
3 0 0 0 985 1 34 21 1 9475 0 31872 3 97 0 0
4 0 0 0 999 1 37 22 5 10364 0 34671 3 97 0 0
5 0 0 0 997 1 35 19 3 10636 0 35906 3 97 0 0
6 0 0 0 1005 14 53 29 3 11004 0 36987 3 97 0 0
7 0 0 3 1017 13 60 31 3 11807 0 39968 3 97 0 0
8 2 0 0 1004 6 25 17 3 8982 0 44097 5 95 0 0
9 0 0 0 983 3 63 12 3 9055 0 43354 5 95 0 0
10 0 0 0 986 1 11 9 0 8984 0 43765 4 96 0 0
11 0 0 0 995 12 13 12 1 9099 0 44471 4 96 0 0
12 0 0 0 986 6 24 13 2 9143 0 45977 4 96 0 0
13 0 0 0 1013 15 41 22 3 9442 0 47095 4 96 0 0
14 0 0 0 1004 14 36 21 1 9672 0 48095 4 96 0 0
15 0 0 0 987 4 14 10 1 9835 0 49737 4 96 0 0
It is about 40K syscalls/sec per CPU.
After just removed rrwlock (test-variant):
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 1236 104 91 34 5 19 0 239799 18 82 0 0 1 0 0 18 984 1 51 30 4 34 0 281377 16 84 0 0 2 0 0 0 972 23 28 19 3 39 0 297364 17 83 0 0 3 0 0 0 1062 52 117 63 2 27 0 249862 18 82 0 0 4 0 0 0 1002 3 43 28 2 34 0 252618 19 81 0 0 5 0 0 0 983 1 25 18 3 32 0 259577 19 81 0 0 6 0 0 0 977 1 17 14 2 25 0 254605 18 82 0 0 7 0 0 0 1021 12 83 37 5 25 0 216745 16 84 0 0 8 2 0 0 1002 6 20 17 2 32 0 249552 19 81 0 0 9 0 0 0 1002 5 35 15 3 35 0 250925 18 82 0 0 10 0 0 0 991 1 5 9 0 31 0 253455 19 81 0 0 11 0 0 0 999 12 7 10 1 31 0 249123 17 83 0 0 12 0 0 0 968 0 1 7 0 33 0 304640 17 83 0 0 13 0 0 0 995 3 8 11 1 32 0 243559 18 82 0 0 14 0 0 0 986 0 1 8 0 33 0 264971 19 81 0 0 15 0 0 0 994 0 26 11 1 22 0 246027 19 81 0 0
It is about 240K syscalls/sec. So more than 5 times improvement.
Updated by Vitaliy Gusev over 8 years ago
b) rrwlock was introduced by commit:
changeset: 5326:6752aa2bd5bc
user: ek110237
date: Wed Oct 24 16:54:46 2007 -0700
6425096 want online 'zfs recv' (read only and read/write)
6597182 .zfs/snapshot code could use a little more comments
I don't know where recursive is required, but I'll check it. If it is needed, I want to introduce suitable property at krw_lock, see my comment at #note-1
Also I think there are a lot of places in the kernel where that property will be useful.
Updated by Matthew Ahrens over 8 years ago
- Status changed from In Progress to Feedback
I don't understand how you are pegging 16 CPUs with only 1 thread.
I get 1.6 Million syscl/sec using your test (1 thread on 1 CPU), on a not especially fast virtual machine.
Are you sure you are not running debug bits?
Updated by Vitaliy Gusev over 8 years ago
Matthew Ahrens wrote:
I don't understand how you are pegging 16 CPUs with only 1 thread.
Matthew, I described it in the bug's head:
for i in `seq 1 16`; do
./test_stat file-$i;
done
I get 1.6 Million syscl/sec using your test (1 thread on 1 CPU), on a not especially fast virtual machine.
I run test on different files on the same vfs. Problem occurs when system has 4,8 or more cores.
Are you sure you are not running debug bits?
I tested it on both variant. Just make sure that you run number of test as number of CPUs
Updated by Vitaliy Gusev over 8 years ago
Sorry test it should be run in background
for i in `seq 1 16`; do
./test_stat file-$i &
done
Updated by Matthew Ahrens over 8 years ago
I saw ~ 1 million syscl/sec/CPU with 4 threads and 4 CPUs. Performance degraded with 8 threads and 8 CPUs, to ~400K syscl/sec/CPU.
Are you seeing lock contention on a real-world workload?
Would be interesting to see what the performance of a plain rwlock_t would be, as I imagine that would be the best possible.
Updated by Vitaliy Gusev over 8 years ago
Matthew Ahrens wrote:
I saw ~ 1 million syscl/sec/CPU with 4 threads and 4 CPUs. Performance degraded with 8 threads and 8 CPUs, to ~400K syscl/sec/CPU.
That's right. It is CPUs scale problem. Than more CPUs you run than less total performance you have.
Are you seeing lock contention on a real-world workload?
Yes. Why not? All services (smb, nfs, ...) can access to many files on the same vfs on many CPUs.
Would be interesting to see what the performance of a plain rwlock_t would be, as I imagine that would be the best possible.
I replaced rrw with rwlock and it shown ~1,6x performance improvement:
restricted mpstat's output:
syscl 19712 21714 22852 22394 23966 23410 20111 24155 103594 93337 96387 103220 104632 102914 102049 90828
Not so much either...
dtrace-profile-2000 shown:
unix`atomic_cas_64+0x8
zfs`zfs_getattr+0xc18
genunix`fop_getattr+0xad
genunix`cstat32+0x47
genunix`fstat32+0x52
genunix`fxstat32+0x27
genunix`dtrace_systrace_syscall32+0x11a
unix`sys_syscall32+0xff
Updated by Vitaliy Gusev over 8 years ago
For original kernel, dtrace-1000, stack:
unix`default_lock_delay+0x8c
unix`mutex_vector_enter+0x2ad
zfs`rrw_enter_read+0x22
zfs`rrw_enter+0x2c
zfs`zfs_getattr+0x6f
genunix`fop_getattr+0xad
genunix`cstat32+0x47
genunix`fstat32+0x52
genunix`fxstat32+0x27
genunix`dtrace_systrace_syscall32+0x11a
unix`sys_syscall32+0xff
35235