Project

General

Profile

Bug #3040

performance: remove rrw_lock

Added by Vitaliy Gusev about 8 years ago. Updated about 8 years ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
2012-07-30
Due date:
% Done:

10%

Estimated time:
Difficulty:
Hard
Tags:
needs-triage
Gerrit CR:

Description

ZFS uses it's own rrw_lock.

rrw_lock is not real rw_lock because in internals it is mutex.

Therefore this brings HIGH cpu load when "readers" tries to get this rrw_lock on READ.

Below example of lockstat on 16 CPUs when plain "getattr" test is used.

for i in `seq 1 16`; do

./test_stat file-$i;
done

lockstat -b sleep 5

Adaptive mutex spin: 1556009 events in 5.157 seconds (301739 events/sec)

Count indv cuml rcnt Lock Caller
-------------------------------------------------------------------------------
845878 54% 54% 0.00 0xffffff0252a40550 rrw_enter_read+0x22
709974 46% 100% 0.00 0xffffff0252a40550 rrw_exit+0x1e
21 0% 100% 0.00 0xffffff02523c6740 dtrace_systrace_syscall32+0xaa


Files

teststat4.c (771 Bytes) teststat4.c test_stat - just call sys_stat in loop Vitaliy Gusev, 2012-07-31 12:32 PM

History

#1

Updated by Vitaliy Gusev about 8 years ago

Solution: replace rrw_lock with OS rw_lock.

If it brings deadlock, introduce property for rw_lock that "enables" nested read locks w/o blocking by
waiting writers:

Example of that property:

rw_set_property(NESTED_READERS);

Once it set, it enables "Linux" rwlock behavior. Waiting writers can not block readers and can not obtain lock until all readers release that rwlock.

#2

Updated by Matthew Ahrens about 8 years ago

  • Category set to zfs - Zettabyte File System
  • Status changed from New to Feedback

Please update this report with:

(b) how much performance could improve if we removed the lock entirely, or changed it to a non-recursive rwlock, for a workload that matters. The test you ran doesn't mean anything to me because you don't say what ./test_stat does.

(a) why the z_teardown_lock needs to be recursive (try a write() syscall from an address that is backed by a mmap'ed ZFS file). How you would address this requirement (by removing the requirement, or implementing a better recursive rwlock).

#3

Updated by Vitaliy Gusev about 8 years ago

a) Accordingly to lockstat and dtrace outputs, a lot of time is spending at rrw_enter_read or rrw_exit, whereas

it have to be fast path without any delay because nobody calls "zfsvfs_teardown" or similar.
mpstat shows next:
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0    0   0    0  1206  100   78   29    6 8878    0 30245    4  96   0   0
   1    0   0    0   986   12   47   27    4 8868    0 29712    3  97   0   0
   2    0   0    0   985   15   27   16    2 8878    0 28996    3  97   0   0
   3    0   0    0   985    1   34   21    1 9475    0 31872    3  97   0   0
   4    0   0    0   999    1   37   22    5 10364    0 34671    3  97   0   0
   5    0   0    0   997    1   35   19    3 10636    0 35906    3  97   0   0
   6    0   0    0  1005   14   53   29    3 11004    0 36987    3  97   0   0
   7    0   0    3  1017   13   60   31    3 11807    0 39968    3  97   0   0
   8    2   0    0  1004    6   25   17    3 8982    0 44097    5  95   0   0
   9    0   0    0   983    3   63   12    3 9055    0 43354    5  95   0   0
  10    0   0    0   986    1   11    9    0 8984    0 43765    4  96   0   0
  11    0   0    0   995   12   13   12    1 9099    0 44471    4  96   0   0
  12    0   0    0   986    6   24   13    2 9143    0 45977    4  96   0   0
  13    0   0    0  1013   15   41   22    3 9442    0 47095    4  96   0   0
  14    0   0    0  1004   14   36   21    1 9672    0 48095    4  96   0   0
  15    0   0    0   987    4   14   10    1 9835    0 49737    4  96   0   0

It is about 40K syscalls/sec per CPU.

After just removed rrwlock (test-variant):


 CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
   0    0   0    0  1236  104   91   34    5   19    0 239799   18  82   0   0
   1    0   0   18   984    1   51   30    4   34    0 281377   16  84   0   0
   2    0   0    0   972   23   28   19    3   39    0 297364   17  83   0   0
   3    0   0    0  1062   52  117   63    2   27    0 249862   18  82   0   0
   4    0   0    0  1002    3   43   28    2   34    0 252618   19  81   0   0
   5    0   0    0   983    1   25   18    3   32    0 259577   19  81   0   0
   6    0   0    0   977    1   17   14    2   25    0 254605   18  82   0   0
   7    0   0    0  1021   12   83   37    5   25    0 216745   16  84   0   0
   8    2   0    0  1002    6   20   17    2   32    0 249552   19  81   0   0
   9    0   0    0  1002    5   35   15    3   35    0 250925   18  82   0   0
  10    0   0    0   991    1    5    9    0   31    0 253455   19  81   0   0
  11    0   0    0   999   12    7   10    1   31    0 249123   17  83   0   0
  12    0   0    0   968    0    1    7    0   33    0 304640   17  83   0   0
  13    0   0    0   995    3    8   11    1   32    0 243559   18  82   0   0
  14    0   0    0   986    0    1    8    0   33    0 264971   19  81   0   0
  15    0   0    0   994    0   26   11    1   22    0 246027   19  81   0   0

It is about 240K syscalls/sec. So more than 5 times improvement.

#4

Updated by Vitaliy Gusev about 8 years ago

b) rrwlock was introduced by commit:

changeset:   5326:6752aa2bd5bc
user: ek110237
date: Wed Oct 24 16:54:46 2007 -0700
6425096 want online 'zfs recv' (read only and read/write)
6597182 .zfs/snapshot code could use a little more comments
I don't know where recursive is required, but I'll check it. If it is needed, I want to introduce suitable property at krw_lock, see my comment at  #note-1
Also I think there are a lot of places in the kernel where that property will be useful.
#5

Updated by Matthew Ahrens about 8 years ago

  • Status changed from In Progress to Feedback

I don't understand how you are pegging 16 CPUs with only 1 thread.

I get 1.6 Million syscl/sec using your test (1 thread on 1 CPU), on a not especially fast virtual machine.

Are you sure you are not running debug bits?

#6

Updated by Vitaliy Gusev about 8 years ago

Matthew Ahrens wrote:

I don't understand how you are pegging 16 CPUs with only 1 thread.

Matthew, I described it in the bug's head:

for i in `seq 1 16`; do
./test_stat file-$i;
done

I get 1.6 Million syscl/sec using your test (1 thread on 1 CPU), on a not especially fast virtual machine.

I run test on different files on the same vfs. Problem occurs when system has 4,8 or more cores.

Are you sure you are not running debug bits?

I tested it on both variant. Just make sure that you run number of test as number of CPUs

#7

Updated by Vitaliy Gusev about 8 years ago

Sorry test it should be run in background

 for i in `seq 1 16`; do
        ./test_stat file-$i &
 done

#8

Updated by Matthew Ahrens about 8 years ago

I saw ~ 1 million syscl/sec/CPU with 4 threads and 4 CPUs. Performance degraded with 8 threads and 8 CPUs, to ~400K syscl/sec/CPU.

Are you seeing lock contention on a real-world workload?

Would be interesting to see what the performance of a plain rwlock_t would be, as I imagine that would be the best possible.

#9

Updated by Vitaliy Gusev about 8 years ago

Matthew Ahrens wrote:

I saw ~ 1 million syscl/sec/CPU with 4 threads and 4 CPUs. Performance degraded with 8 threads and 8 CPUs, to ~400K syscl/sec/CPU.

That's right. It is CPUs scale problem. Than more CPUs you run than less total performance you have.

Are you seeing lock contention on a real-world workload?

Yes. Why not? All services (smb, nfs, ...) can access to many files on the same vfs on many CPUs.

Would be interesting to see what the performance of a plain rwlock_t would be, as I imagine that would be the best possible.

I replaced rrw with rwlock and it shown ~1,6x performance improvement:

restricted mpstat's output:

    syscl
    19712
    21714
    22852
    22394
    23966
    23410
    20111
    24155
   103594
    93337
    96387
   103220
   104632
   102914
   102049
    90828

Not so much either...

dtrace-profile-2000 shown:

unix`atomic_cas_64+0x8
zfs`zfs_getattr+0xc18
genunix`fop_getattr+0xad
genunix`cstat32+0x47
genunix`fstat32+0x52
genunix`fxstat32+0x27
genunix`dtrace_systrace_syscall32+0x11a
unix`sys_syscall32+0xff
#10

Updated by Vitaliy Gusev about 8 years ago

For original kernel, dtrace-1000, stack:

unix`default_lock_delay+0x8c
unix`mutex_vector_enter+0x2ad
zfs`rrw_enter_read+0x22
zfs`rrw_enter+0x2c
zfs`zfs_getattr+0x6f
genunix`fop_getattr+0xad
genunix`cstat32+0x47
genunix`fstat32+0x52
genunix`fxstat32+0x27
genunix`dtrace_systrace_syscall32+0x11a
unix`sys_syscall32+0xff
35235

Also available in: Atom PDF