Project

General

Profile

Bug #13208

Create aggr fails when underlying links have more than 128 Tx rings

Added by Paul Winder 22 days ago. Updated 10 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
networking
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Using:

dladm create-aggr -l mlxcx0 -l mlxcx1 -l mlxcx2 aggr0

Currently panics the system (see #13207). Needless, we should be able to create an aggr with more than 2 mlxcx links. It is cause because the default Tx rings per group in mlxcx is 64, so 2 mlxcx hits the global per group limit of 128 - MAX_RINGS_PER_GROUP.

The default is set by this constant:

#define MLXCX_TX_NRINGS_PER_GROUP_DFLT  64

This seems far too high, I propose the default be set to 16 - the same as number of rings per Rx group.

It can be overridden is mlxcx.conf using tx_nrings_per_group driver property.


Related issues

Related to illumos gate - Bug #13207: Creating an aggr with more than 128 Tx or Rx rings panicsNewPaul Winder

Actions
#1

Updated by Paul Winder 22 days ago

  • Related to Bug #13207: Creating an aggr with more than 128 Tx or Rx rings panics added
#2

Updated by Alex Wilson 22 days ago

Setting this default lower seems fine to me. The default of 64 was mainly picked while thinking of large machines (64 or 128 logical CPUs) with single or dual NICs (like our fileserver head nodes) to make sure they could spread out across the CPUs available.

I created an aggr of 4 mlxcxs during testing originally, but it was on a machine with way fewer logical CPUs and it reduced the count of rings per group automatically there. I didn't realise the global limit was so low.

#3

Updated by Paul Winder 17 days ago

  • Subject changed from Can't create an aggr with more than 2 mlxcx links to Create aggr fails when underlying links have more than 128 Tx rings
#4

Updated by Paul Winder 17 days ago

I have revised how I intend to address this. Rather than reducing the default on the individual driver (mlxcx) where the fault was found, I tried to take a more generic approach.

When an aggr is created, it will now limit the number of rings per port/link. It will restrict it to MAX_RINGS_PER_GROUP / <number of ports>. If the aggr is created with a single link, we will make sure that link does not use more than ½ MAX_RINGS_PER_GROUP.

As an aggr is being manipulated (eg by adding and removing ports/links) we will maintain the highest number of ports/links the aggr ever had. Then when adding links to an existing aggr we will include the new links in the high link count and ensure the new links never use more than MAX_RINGS_PER_GROUP / <high port count> rings. This can lead to the situation where newly added links may have less Tx rings, this will only become apparent when the underlying link driver (currently only mlxcx) has significant number of Tx rings. To circumvent this, where possible, an aggr should be created with all its underlying links in the initial dladm create-aggr command, once the aggr has been created if a link is removed from an aggr then re-added, the link will get the same number of rings it previously had.

If new links are added to an aggr and there are no Tx rings left, the operation will fail - as it does now.

#5

Updated by Paul Winder 17 days ago

The test cases show:
  1. Creating an aggr with 1 link
  2. Add another to it
  3. Adding a 3rd link which has too many Tx rings
  4. Deleting what was created, and creating a 4 link aggr. Each mlxcx has 64 Tx rings and the create succeeded by limiting the number used for each link.
  5. Removing and re-adding back a link.
# dladm create-aggr -t -l mlxcx0 aggr0
# dladm show-aggr -L aggr0
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr0       mlxcx0       yes          no   no   no   no        no
# dladm add-aggr -t -l mlxcx1 aggr0
# dladm show-aggr -L aggr0
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr0       mlxcx0       yes          no   no   no   no        no
--          mlxcx1       yes          no   no   no   no        no
# dladm add-aggr -t -l mlxcx2 aggr0
dladm: add operation failed: I/O error
# dladm show-aggr -L aggr0
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr0       mlxcx0       yes          no   no   no   no        no
--          mlxcx1       yes          no   no   no   no        no
# dladm delete-aggr aggr0
# dladm create-aggr -t -l mlxcx0 -l mlxcx1 -l mlxcx2 -l mlxcx3 aggr0
# dladm show-aggr -L aggr0
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr0       mlxcx0       yes          no   no   no   no        no
--          mlxcx1       yes          no   no   no   no        no
--          mlxcx2       yes          no   no   no   no        no
--          mlxcx3       yes          no   no   no   no        no
# dladm remove-aggr -t -l mlxcx0  aggr0
# dladm show-aggr -L aggr0
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr0       mlxcx1       yes          no   no   no   no        no
--          mlxcx2       yes          no   no   no   no        no
--          mlxcx3       yes          no   no   no   no        no
# dladm add-aggr -t -l mlxcx0  aggr0
# dladm show-aggr -L aggr0
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr0       mlxcx1       yes          no   no   no   no        no
--          mlxcx2       yes          no   no   no   no        no
--          mlxcx3       yes          no   no   no   no        no
--          mlxcx0       yes          no   no   no   no        no
# dladm delete-aggr aggr0
#6

Updated by Electric Monk 17 days ago

  • Gerrit CR set to 970
#7

Updated by Paul Winder 10 days ago

I've revised this, based on code review feedback. Two main changes:
  • Increased the default number of ports to 4. It is common for aggrs to have 4 ports and we have had no reported issues with other NICs used in 4 port aggrs panic'ing because of depleted Tx tings. So it seems reasonable this will have no impact on existing aggrs.
  • When adding further aggrs, it will ensure no individual port can have more Tx rings than any other. So, if 5th link is added to the aggr, then none of the links will uses more than 1/5th of the maximum supported rings - MAX_RINGS_PER_GROUP. It does this by trimming down existing rings.

To test the code, I used a version which started with a default of 2 ports and created an aggr with 2 mlxcx links. Each mlxcx link will have 64 Tx rings, so together they would use up to the MAX_RINGS_PER_GROUP limit. To this I added 2 more links which would force the trimming of existing links. This was done whilst send iperf traffic.

Sample iperf output after adding the 2 links, there was a drop off on some of the TCP connections which later recovered.

[  5]  30.00-40.00  sec  6.64 GBytes  5.70 Gbits/sec                  
[  8]  30.00-40.00  sec  5.97 GBytes  5.12 Gbits/sec                  
[ 10]  30.00-40.00  sec  6.63 GBytes  5.70 Gbits/sec                  
[ 12]  30.00-40.00  sec  6.40 GBytes  5.50 Gbits/sec                  
[SUM]  30.00-40.00  sec  25.6 GBytes  22.0 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]  40.00-50.00  sec  11.6 GBytes  9.93 Gbits/sec                  
[  8]  40.00-50.00  sec  2.28 GBytes  1.96 Gbits/sec                  
[ 10]  40.00-50.00  sec  2.32 GBytes  1.99 Gbits/sec                  
[ 12]  40.00-50.00  sec  11.9 GBytes  10.2 Gbits/sec                  
[SUM]  40.00-50.00  sec  28.0 GBytes  24.1 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]  50.00-60.00  sec  13.7 GBytes  11.7 Gbits/sec                  
[  8]  50.00-60.00  sec  0.00 Bytes  0.00 bits/sec                  
[ 10]  50.00-60.00  sec  0.00 Bytes  0.00 bits/sec                  
[ 12]  50.00-60.00  sec  13.7 GBytes  11.8 Gbits/sec                  
[SUM]  50.00-60.00  sec  27.4 GBytes  23.5 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]  60.00-70.00  sec  13.5 GBytes  11.6 Gbits/sec                  
[  8]  60.00-70.00  sec  0.00 Bytes  0.00 bits/sec                  
[ 10]  60.00-70.00  sec  0.00 Bytes  0.00 bits/sec                  
[ 12]  60.00-70.00  sec  13.7 GBytes  11.7 Gbits/sec                  
[SUM]  60.00-70.00  sec  27.1 GBytes  23.3 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]  70.00-80.00  sec  13.6 GBytes  11.7 Gbits/sec                  
[  8]  70.00-80.00  sec  0.00 Bytes  0.00 bits/sec                  
[ 10]  70.00-80.00  sec  0.00 Bytes  0.00 bits/sec                  
[ 12]  70.00-80.00  sec  13.9 GBytes  12.0 Gbits/sec                  
[SUM]  70.00-80.00  sec  27.5 GBytes  23.6 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]  80.00-90.00  sec  13.5 GBytes  11.6 Gbits/sec                  
[  8]  80.00-90.00  sec  0.00 Bytes  0.00 bits/sec                  
[ 10]  80.00-90.00  sec  0.00 Bytes  0.00 bits/sec                  
[ 12]  80.00-90.00  sec  13.6 GBytes  11.7 Gbits/sec                  
[SUM]  80.00-90.00  sec  27.1 GBytes  23.3 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]  90.00-100.00 sec  9.27 GBytes  7.96 Gbits/sec                  
[  8]  90.00-100.00 sec  4.12 GBytes  3.53 Gbits/sec                  
[ 10]  90.00-100.00 sec  4.03 GBytes  3.46 Gbits/sec                  
[ 12]  90.00-100.00 sec  10.0 GBytes  8.62 Gbits/sec                  
[SUM]  90.00-100.00 sec  27.5 GBytes  23.6 Gbits/sec    

I assumed the drop off was caused by stopping of some of the rings, so I repeated the test when I knew no rings would be trimmed and there was a similar effect. I'm not sure what causes this, but it doesn't seem to be a regression.
[  5]  80.00-90.00  sec  4.55 GBytes  3.91 Gbits/sec                  
[  8]  80.00-90.00  sec  4.51 GBytes  3.87 Gbits/sec                  
[ 10]  80.00-90.00  sec  11.0 GBytes  9.45 Gbits/sec                  
[ 12]  80.00-90.00  sec  4.57 GBytes  3.93 Gbits/sec                  
[SUM]  80.00-90.00  sec  24.6 GBytes  21.2 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]  90.00-100.00 sec  0.00 Bytes  0.00 bits/sec                  
[  8]  90.00-100.00 sec  0.00 Bytes  0.00 bits/sec                  
[ 10]  90.00-100.00 sec  22.4 GBytes  19.3 Gbits/sec                  
[ 12]  90.00-100.00 sec  0.00 Bytes  0.00 bits/sec                  
[SUM]  90.00-100.00 sec  22.4 GBytes  19.3 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5] 100.00-110.00 sec  0.00 Bytes  0.00 bits/sec                  
[  8] 100.00-110.00 sec  0.00 Bytes  0.00 bits/sec                  
[ 10] 100.00-110.00 sec  22.9 GBytes  19.6 Gbits/sec                  
[ 12] 100.00-110.00 sec  0.00 Bytes  0.00 bits/sec                  
[SUM] 100.00-110.00 sec  22.9 GBytes  19.6 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5] 110.00-120.00 sec  0.00 Bytes  0.00 bits/sec                  
[  8] 110.00-120.00 sec  0.00 Bytes  0.00 bits/sec                  
[ 10] 110.00-120.00 sec  22.2 GBytes  19.1 Gbits/sec                  
[ 12] 110.00-120.00 sec  0.00 Bytes  0.00 bits/sec                  
[SUM] 110.00-120.00 sec  22.2 GBytes  19.1 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5] 120.00-130.00 sec  0.00 Bytes  0.00 bits/sec                  
[  8] 120.00-130.00 sec  0.00 Bytes  0.00 bits/sec                  
[ 10] 120.00-130.00 sec  22.8 GBytes  19.6 Gbits/sec                  
[ 12] 120.00-130.00 sec  0.00 Bytes  0.00 bits/sec                  
[SUM] 120.00-130.00 sec  22.8 GBytes  19.6 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5] 130.00-140.00 sec  1.08 GBytes   928 Mbits/sec                  
[  8] 130.00-140.00 sec  1.09 GBytes   938 Mbits/sec                  
[ 10] 130.00-140.00 sec  19.6 GBytes  16.9 Gbits/sec                  
[ 12] 130.00-140.00 sec  1.08 GBytes   927 Mbits/sec                  
[SUM] 130.00-140.00 sec  22.9 GBytes  19.6 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5] 140.00-150.00 sec  6.74 GBytes  5.79 Gbits/sec                  
[  8] 140.00-150.00 sec  6.83 GBytes  5.87 Gbits/sec                  
[ 10] 140.00-150.00 sec  6.84 GBytes  5.87 Gbits/sec                  
[ 12] 140.00-150.00 sec  6.80 GBytes  5.84 Gbits/sec                  
[SUM] 140.00-150.00 sec  27.2 GBytes  23.4 Gbits/sec             

Also available in: Atom PDF