Project

General

Profile

Bug #4426

mci_rw_lock white-hot under load

Added by Robert Mustacchi over 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Category:
networking
Start date:
2013-12-24
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

A customer has a system that is exhibiting approximately 185,000 calls per second to mac_client_check_flow_vid(). The outliers for this call are significant:

           value  ------------- Distribution ------------- count    
             256 |                                         0        
             512 |@@@@@@@@@@@@@@@                          71059    
            1024 |@@@@@@@@@@@@@@@@@                        81489    
            2048 |@@@@@                                    24689    
            4096 |                                         2133     
            8192 |@                                        3230     
           16384 |                                         2273     
           32768 |                                         2005     
           65536 |                                         1787     
          131072 |                                         1678     
          262144 |                                         535      
          524288 |                                         45       
         1048576 |                                         1        
         2097152 |                                         0        

This is all due to mci_rw_lock; we need to find a way to avoid acquiring it in the hot path.

Here is an mpstat snippet from an SDC7 platform dated 8/29/13:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  152   0    0   812  138 12251  419 3372  660  362 24901   35  15   0  50
  1  181   0    0   486   38 10707  332 2693  586  329 21933   31  13   0  56
  2   80   0    0  2175  170 20686 1889 2967 1296  193 13576   16  17   0  67
  3   44   0    0  1912  139 19507 1663 2627 1311  164 12387   16  17   0  67
  4   33   0    0  1615  112 17511 1348 2146 1237  136 11624   13  14   0  73
  5   38   0    0  1630  109 17389 1381 2185 1213  150 10700   14  15   0  71
  6   26   0    0  1571   89 19363 1374 2048 1372  139 10300   11  15   0  74
  7   56   0    0  1816  134 20561 1544 2387 1511  123 11314   12  16   0  72
  8   27   0    0  1563   89 17970 1346 2040 1287  174 11449   12  15   0  73
  9   11   0    0  1756  109 20963 1526 2164 1617  162 10467   10  17   0  73
 10   24   0    0  4259  385 49332 5274 3892 2164 2564 10193   12  24   0  64
 11  118   0    0   190    3 7002  127 1141  404  228 14762   19   8   0  73
 12   95   0    0 11129 10765 7854  259 1484 2504  359 15994   21  17   0  61
 13  154   0    0  8336 7993 8267  239 1578 1905  343 17141   24  16   0  60
 14   78   0    0 11015 10684 8034  235 1562 2525  401 16490   22  17   0  61
 15   89   0   31 11563 11169 8438  297 1718 2630  454 16872   24  20   0  56
 16   91   0    0 10579 10215 7603  234 1515 2144  356 15119   24  17   0  59
 17   99   0    0 11111 10735 7592  212 1509 2231  344 15154   22  17   0  61
 18  110   0    0  9303 8922 7574  202 1513 1862  293 15208   23  15   0  62
 19   89   0    0 11274 10925 7149  212 1380 2330  307 14240   21  16   0  63
 20   95   0    0   288   27 6582  129 1121  362  175 13752   20   8   0  72
 21   93   0    0 19580 19310 7717  175 1580  425  216 15953   24  16   0  60
 22   73   0    0   332   39 6858  141 1179  356  182 14102   19   9   0  72
 23   64   0    0   191   17 5491   84  829  307  136 12073   17   7   0  76
 24  146   0    0   242    8 6366  113  993  380  192 13217   18   8   0  74
 25  107   0 1053   360  109 5655   94  800  301  168 12238   16   7   0  77
 26   49   0    0   179    9 5773  111  884  333  139 12117   17   8   0  75
 27   45   0   19   163    1 5740  101  800  339  139 12623   16   7   0  77
 28   64   0    0   238   23 6012  100  978  326  172 12662   18   8   0  74
 29   99   0    0   159   42 5079   68  706  314  143 11049   15   7   0  78
 30   73   0    0  3814 3601 6900  128 1224  631  151 14617   19  12   0  68
 31   51   0   10  3731 3532 6091   94 1028  953  192 12769   19  14   0  67

Here's an mpstat snippet from the same load running on the same kind of machine on a slightly newer platform (11/4/13):
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    2   0    0  3305  224 19199 1302 5850 1785 6500  8984   13  36   0  50
  1    1   0    0 15577 1560 95441 37609 22358 1300 49066  1997    4  91   0   4
  2    5   0    0  1412   55 9461 1153 3187  309 2122 12425   21  14   0  65
  3    9   0    0  1306   63 9514 1073 3120  266 2112 12894   21  14   0  65
  4    8   0    0  1151   40 6792  898 2377  184 1141 11851   20  11   0  69
  5    4   0    0  1098   47 9131  910 2911  231 2039 12474   20  13   0  66
  6    0   0  784  1079   43 6670  929 2294  178 1179 11424   19  11   0  70
  7    8   0    0  1137   57 9118  915 2904  247 2136 12151   20  14   0  66
  8    9   0    0  1967   72 22743 1749 8262  624 8017 13018   20  37   0  43
  9   19   1   31  1631   52 15936 1376 5557  385 5111 13190   22  26   0  52
 10    1   0    0  1553   47 10173 1323 3510  264 2601 12754   21  18   0  61
 11    6   0   31  1625  150 8717 1208 2988  216 2034 12234   22  16   0  62
 12   28   0    0  9031 7795 6498 1077 2351  578 1146 11489   21  14   0  65
 13    1   0    0  9560 8282 8727 1263 2792  695 2113 11700   21  16   0  63
 14   18   1    0  8382 7264 6283 1017 2188  520 1066 11172   21  12   0  67
 15   14   0    0  9644 8485 7593 1006 2530  726 1623 11193   21  14   0  64
 16    2   0    0  9249 8215 8693  914 3004  694 2071 11410   20  16   0  64
 17   11   1    0  8631 7548 8886  894 3859  622 2745 10090   18  19   0  63
 18    7   0    0  9712 8737 9025  825 2835  720 2218 11289   18  18   0  64
 19   14   0    0  9381 8435 6930  788 2269  670 1318 10942   19  14   0  67
 20   10   0    0  1026   47 9379  818 2873  264 2288 12039   20  14   0  66
 21    7   0    0  1021   42 6572  844 2096  192 1187 11024   19  11   0  70
 22   15   0    0 19458 18399 12726  830 3796  395 3705 11765   19  26   0  55
 23   18   0    0  1140   43 8391  945 2715  235 1931 11438   19  14   0  68
 24    9   1    0  1168   40 7315 1003 2648  200 1584 10963   20  13   0  67
 25   10   0    0  1231   65 7182 1032 2151 1964 1722  8336   15  17   0  68
 26    2   0    0  1284   67 8360 1100 2458 1807 2034  9434   15  18   0  67
 27   14   0    0  1256   90 6523 1044 1845 1966 1426  7954   15  16   0  69
 28    0   0    0  3204 1975 8519 1119 2373 2237 2176  8394   14  21   0  65
 29    9   0    0  3085 1972 6647 1039 1948 1918 1428  7735   14  18   0  68
 30    7   0    0  1229   84 8693 1054 2446 1893 2219  9057   15  19   0  66
 31    4   0    0  1189  108 6389  968 1856 2012 1343  7974   15  16   0  69

Note in particular that "srw" – which is high in the initial output – is much higher in the second output. Also much higher are involuntary context switches. It was confirmed, however, that the absolute number of rwlock acquisitions was the same in both cases – which means that contention is higher under identical load on the newer platform. This may be due to increased preemption because of the reduced scheduling quantum.

Fixing this issue resulted in dramatically better behavior:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0   37   0    0  3440  179 17051 1260 1439 1182    0  8557   12  18   0  69
  1  196   0    0  1875  103 17903 1626 1428 1200    0 12216   17  16   0  67
  2   51   0    0  1287   66 15907 1075  916 1092    0  9083   12  14   0  74
  3  133   0    0  1788   97 16568 1582 1202 1082    0 12388   18  15   0  67
  4   90   0    0  1338   73 15886 1143  808 1109    0  8426   12  14   0  74
  5  106   0    0  4836  409 40477 4674 1427 1124    0 11815   17  19   0  63
  6  163   0   31   118    1 5378   23  812  300    0 12570   19   7   0  74
  7  200   0    0   137    1 7166   39 1195  382    0 16723   24   9   0  68
  8  112   0    0    92   15 4817   30  753  257    0 10779   17   7   0  77
  9  115   0    0   116   21 6175   37 1097  322    0 14418   23   9   0  68
 10  116   0    0    80   15 4771   12  709  247    0 10788   17   7   0  76
 11  203   0    0   154   35 6584   36 1094  320    0 15135   22   9   0  69
 12   74   0    0  3175 3117 4861   20  762  759    0 10850   16  11   0  73
 13  106   0    0  3277 3170 6328   39 1195  554    0 13864   22  12   0  66
 14  102   0    0   104   38 4115   18  722  211    0  8889   14   6   0  79
 15  148   0    0   188   77 6135   31 1045  254    0 13859   21   8   0  70
 16  206   0    0  9116 8922 6805   40 1278 1317    0 15671   23  16   0  62
 17  111   0   15  8534 8369 5199   18  908 1141    0 11463   18  13   0  70
 18  192   0    0 10027 9824 6886   34 1255 1431    0 15706   23  16   0  61
 19  129   0    0 10711 10537 5395   36  934 1457    0 11955   18  16   0  67
 20  171   0    0  8933 8740 6966   45 1245 1300    0 15932   23  15   0  62
 21  120   0    0  8872 8698 5766   32 1023 1255    0 12716   20  14   0  66
 22  126   0   31  9281 8986 7004   30 1283 1344    0 16097   22  15   0  62
 23  181   0    0 10399 10219 5550   40  881 1447    0 12678   18  15   0  67
 24  179   0    0   159   33 6641   36 1145  344    0 15394   23   9   0  69
 25  136   0    0 19191 19087 5484   25  932  295    0 12276   18  15   0  67
 26   93   0    0   147   28 6552   40 1115  327    0 15689   22  10   0  68
 27  137   0    0    91   19 4670   30  707  235    0 10423   15   7   0  78
 28  130   0    0   169   36 6170   30 1090  316    0 14037   21   9   0  70
 29   19   0    0  1216   66 14092  996  733 1013    0  6660   10  15   0  75
 30  127   0    0  1787  101 16782 1580 1076 1147    0 11196   17  17   0  66
 31   34   0    0  1285   70 15142 1075  739 1065    0  6992   10  16   0  74

In particular, note that srw has gone to 0 . The customer has reported that this resulted in much better latency, and there now appears to be a much higher throughput limiter in the system (as yet undetermined).

History

#1

Updated by Electric Monk over 4 years ago

  • Status changed from New to Closed

git commit 3bb0cb708b0d9e3882b52efcf67a78f2a39baf90

commit  3bb0cb708b0d9e3882b52efcf67a78f2a39baf90
Author: Bryan Cantrill <bryan@joyent.com>
Date:   2015-04-30T15:06:22.000Z

    4426 mci_rw_lock white-hot under load
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Reviewed by: Piotr Jasiukajtis <estibi@me.com>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Approved by: Dan McDonald <danmcd@omniti.com>

Also available in: Atom PDF