Project

General

Profile

Bug #13058

high contention on root vnode mutex during a lookup-heavy workload

Added by Mateusz Guzik 4 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

The box at hand is quite old: a 2 socket * 10 cores * 2 threads Sandy Bridge. There is a lx-zone with Debian performing an incremental build of a "make allyesconfig" kernel. The system is SunOS smartos 5.11 joyent_20200813T030805Z i86pc i386 i86pc. The bottleneck at hand is present in stock illumos kernels, code below is just a handy reproducer.

Like so:
[root@smartos ~]# time zlogin 9c3a771b-e317-c11f-c9a1-fa13a393c1b7 make -C /root/linux-5.3-rc8 -s -j 40 bzImage

real 0m33.587s
user 3m9.938s
sys 11m12.696s

lockstat leaves no doubt:

Adaptive mutex spin: 49453800 events in 66.364 seconds (745187 events/sec)
Count indv cuml rcnt     nsec Lock                   Caller
-------------------------------------------------------------------------------
17829329  36%  36% 0.00    45619 0xfffffeb2244e3400     vn_rele+0x20
16948900  34%  70% 0.00    48157 0xfffffeb2244e3400     lookuppnatcred+0xa5
5862413  12%  82% 0.00    28543 0xfffffeb2269db240     vn_rele+0x20
5263907  11%  93% 0.00    29813 0xfffffeb2269db240     lookuppnatcred+0xe7
1340556   3%  96% 0.00      935 0xfffffeb1c8534220     dnlc_lookup+0x8f
315749   1%  96% 0.00     1032 0xfffffeb224721b00     vn_rele+0x20
277099   1%  97% 0.00      666 0xfffffeb1c846e7a8     dnlc_lookup+0x8f
220136   0%  97% 0.00      778 0xfffffeb224721b00     dnlc_lookup+0x13c
118131   0%  97% 0.00      807 0xfffffeb227073b00     vn_rele+0x20
101586   0%  98% 0.00      724 0xfffffeb2247ddf48     zfs_fastaccesschk_execute+0x61

Printing the vnode containing the most contended mutex reveals it is zone's root.

v_counts distribution when entering vn_rele:

[root@smartos ~]# dtrace -n 'fbt::vn_rele:entry { @ = quantize(args[0]->v_count); }'
dtrace: description 'fbt::vn_rele:entry ' matched 1 probe

           value  ------------- Distribution ------------- count    
               0 |                                         0        
               1 |                                         1232803  
               2 |@@@@@@@@@@@@@@@@@@                       87972877 
               4 |@@@@@                                    24666657 
               8 |@@@@                                     19437996 
              16 |                                         721687   
              32 |@                                        4536612  
              64 |@@@@@@@                                  34993206 
             128 |@@@@@                                    22456804 
             256 |                                         1917     
             512 |                                         0

Suggested fix is to employ a cas loop which modifies v_counter as long as it is not 1 and resorts to locking otherwise. While this still suffers from scalability limitations, it should be significantly faster in face of contention (also faster single-threaded). Moreover, beginning of the lookup can blindly increment both root and cwd vnodes. This should retain all currently provided guarantees.

With some more work it may be possible to avoid refing cwd/root around the lookup. One idea how to do it is to create a wrapper struct which can be refed instead (once more without mutexes).


Files

smartos-j40-bzImage-inc.svg (221 KB) smartos-j40-bzImage-inc.svg on cpu flamegraph Mateusz Guzik, 2020-08-19 12:22 AM

No data to display

Also available in: Atom PDF