Project

General

Profile

Bug #918

Need better IP fanout (esp. with VLANs present)

Added by Dan McDonald over 8 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
kernel
Start date:
2011-04-18
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

Joyent found this as "OS-332", and it was fixed in the Joyent Illumos batch.

The proposed fix adds IPv4 checks to what is normally an IPv6-only codepath. While the proposed fix may address the problem, I believe further analysis may yield a cleaner fix.

History

#1

Updated by Robert Mustacchi almost 8 years ago

  • Difficulty set to Medium
  • Tags set to needs-triage

Original analysis from Brendan Gregg on this issue follows.

Summary: For high network load (especially 10 GbE on ixgbe), if VLANs are used then the kernel does not fanout across multiple CPUs properly, instead picking a single CPU. This CPU becomes the bottleneck. The issue is how VLANs are handled in mac_rx_srs_fanout().

mpstat during load:

# mpstat 1
[...]
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0   15   0   19   535  203 3121   18   95  195    0  3278    2   4   0  94
  1   62   0  233   375    3 2870   25   92  165    0  4086    2   5   0  93
  2   21   0    0 19598 19514 2862   12   61 5806    0  3438    2  19   0  79
  3   64   0  715 19807 19517 2797   11   75 5767    0  3239    2  24   0  74
  4   62   0   46    52    3 2460    5   36  176    0  2974    2   7   0  91
  5   53   0  381   319    3 2660    5   62  187    0  3726    3   4   0  93
  6    8   0    0   110    6 8241   54   58 1255    0  2743    2   3   0  95
  7  176   0  385   361    4 2786   23   89  134    0  4041    4   5   0  91
  8   19   0    0    17    1 1459    1   18  121    0  1091    1   1   0  98
  9   30   0  242   342    3 2692   11   69  179    0  3465    2   4   0  94
 10    0   0    0    10    1  229    0   14   17    0   289    1   1   0  98
 11   55   0  294   354    1 2882   40   81  200    0  3308    3   6   0  91
 12   99   0    0    49    3 1776    2   18  139    0  2018    2   3   0  95
 13   88   0  403   343    1 2624   18   70  181    0  3710    3   5   0  92
 14   15   0    0    34    3 1680    3   24  113    0  1272    3   1   0  96
 15  397   0  165   322    4 2390    1   33  104    0  2813    5   4   0  91
 16   12   0    0    42    3 2181    9   39  145    0  2393    2   2   0  96
 17   27   0  186   364    5 2701   13   83  163    0  2526    2   4   0  94
 18    8   0    0    30    2 1414    2   16   90    0   854    1   1   0  98
 19  239   0   96   358    2 2776   10   69  129    0  2270    2   4   0  94
 20   16   0    0    88    2 2770   12   37  173    0  4124    2   3   0  95
 21   25   0   67   379    1 2558   16   57  149    0  3898    1   4   0  95
 22    0   0    0   253   23 1868   34    2  692    0     0    0 100   0   0
 23   26   0  156   369    2 2372   16   69  164    0  2497    1   3   0  96

Note how CPU 22 in this case was at 100% system. Sampling that stack track to see what it is doing:

# dtrace -n 'profile-1001 /cpu == 22/ { @[stack()] = count(); }'
[...]
              genunix`allocb+0x74
              ip`tcp_send+0x80e
              ip`tcp_wput_data+0x56e
              ip`tcp_input_data+0x3cfc
              ip`squeue_enter+0x440
              ip`ip_fanout_v4+0x48d
              ip`ire_recv_local_v4+0x366
              ip`ill_input_short_v4+0x69e
              ip`ip_input+0x23b
              dls`i_dls_link_rx+0x2e7
              mac`mac_rx_deliver+0x5d
              mac`mac_rx_soft_ring_drain+0xdf
              mac`mac_soft_ring_worker+0x111
              unix`thread_start+0x8
              311

              genunix`ddi_dma_addr_bind_handle+0x68
              ixgbe`ixgbe_tx_bind+0x51
              ixgbe`ixgbe_ring_tx+0x7a8
              mac`mac_hwring_tx+0x20
              mac`mac_tx_send+0x4bc
              mac`mac_tx_single_ring_mode+0x91
              mac`mac_tx+0x35d
              dld`str_mdata_fastpath_put+0xa4
              ip`ip_xmit+0x7eb
              ip`ire_send_wire_v4+0x345
              ip`conn_ip_output+0x250
              ip`tcp_send+0xb9d
              ip`tcp_wput_data+0x56e
              ip`tcp_input_data+0x3cfc
              ip`squeue_enter+0x440
              ip`ip_fanout_v4+0x48d
              ip`ire_recv_local_v4+0x366
              ip`ill_input_short_v4+0x69e
              ip`ip_input+0x23b
              dls`i_dls_link_rx+0x2e7
              mac`mac_rx_deliver+0x5d
              mac`mac_rx_soft_ring_drain+0xdf
              mac`mac_soft_ring_worker+0x111
              unix`thread_start+0x8
              355

              unix`mutex_enter+0x10
              dld`str_mdata_fastpath_put+0xa4
              ip`ip_xmit+0x7eb
              ip`ire_send_wire_v4+0x345
              ip`conn_ip_output+0x250
              ip`tcp_send+0xb9d
              ip`tcp_wput_data+0x56e
              ip`tcp_input_data+0x3cfc
              ip`squeue_enter+0x440
              ip`ip_fanout_v4+0x48d
              ip`ire_recv_local_v4+0x366
              ip`ill_input_short_v4+0x69e
              ip`ip_input+0x23b
              dls`i_dls_link_rx+0x2e7
              mac`mac_rx_deliver+0x5d
              mac`mac_rx_soft_ring_drain+0xdf
              mac`mac_soft_ring_worker+0x111
              unix`thread_start+0x8
              385

It isn't obvious from this what the issue is. It's a mac_soft_ring_worker thread that is doing work. There should be more than one of these threads on other CPUs (and there are, as shown by mdb -k), but it's only the one on CPU 22 that is doing work, as confirmed by:

# dtrace -n 'fbt::mac_rx_deliver:entry { @[cpu] = count(); }'

The interfaces should be configured to fanout using multiple CPUs by default. This can be forced (and increased) using dladm to set the "cpus" property, which I did for all interfaces:

[root@d4-85-64-7b-9a-4a /var/tmp/brendan]# dladm show-linkprop -p cpus
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
ixgbe0       cpus            rw   0-23           --             -- 
ixgbe1       cpus            rw   0-23           --             -- 
bnx0         cpus            rw   0-23           --             -- 
bnx1         cpus            rw   0-23           --             -- 
bnx2         cpus            rw   0-23           --             -- 
bnx3         cpus            rw   0-23           --             -- 
vc0_ff_ee_e7_5e_bd_0 cpus    rw   0-23           --             -- 
vc0_ff_ee_c0_fa_23_0 cpus    rw   0-23           --             -- 
vc0_ff_ee_7c_7d_f0_0 cpus    rw   0-23           --             -- 
vc0_ff_ee_dd_4c_3a_0 cpus    rw   0-23           --             -- 
vc0_ff_ee_22_c0_cc_0 cpus    rw   0-23           --             --

And confirmed in mdb that this had taken effect:

> ::mac_srs -rcv
                                      CPU_COUNT            FANOUT_CPU_COUNT    
            ADDR LINK_NAME            (CPU_LIST)           (CPU_LIST)          
ffffff2639cc9000 vc0_ff_ee_22_c0_cc_0 24                   22                  
                                      (00,01,02,03,04,05,  (01,02,03,04,05,06,
                                       06,07,08,09,0a,0b,   07,08,09,0a,0b,0c,
                                       0c,0d,0e,0f,10,11,   0d,0e,0f,10,11,12,
                                       12,13,14,15,16,17)    13,14,15,16)
ffffff2639cc9cc0 vc0_ff_ee_e7_5e_bd_0 24                   17                  
                                      (00,01,02,03,04,05,  (01,02,03,04,05,06,
                                       06,07,08,09,0a,0b,   07,08,09,0a,0b,0c,
                                       0c,0d,0e,0f,10,11,   0d,0e,0f,10,11)
                                       12,13,14,15,16,17)
ffffff26368ab040 vc0_ff_ee_dd_4c_3a_0 24                   22                  
                                      (00,01,02,03,04,05,  (01,02,03,04,05,06,
                                       06,07,08,09,0a,0b,   07,08,09,0a,0b,0c,
                                       0c,0d,0e,0f,10,11,   0d,0e,0f,10,11,12,
                                       12,13,14,15,16,17)    13,14,15,16)
ffffff26368ac9c0 vc0_ff_ee_7c_7d_f0_0 24                   22                  
                                      (00,01,02,03,04,05,  (01,02,03,04,05,06,
                                       06,07,08,09,0a,0b,   07,08,09,0a,0b,0c,
                                       0c,0d,0e,0f,10,11,   0d,0e,0f,10,11,12,
                                       12,13,14,15,16,17)    13,14,15,16)
ffffff2600ad2040 vc0_ff_ee_c0_fa_23_0 24                   22                  
                                      (00,01,02,03,04,05,  (01,02,03,04,05,06,
                                       06,07,08,09,0a,0b,   07,08,09,0a,0b,0c,
                                       0c,0d,0e,0f,10,11,   0d,0e,0f,10,11,12,
                                       12,13,14,15,16,17)    13,14,15,16)
ffffff2600ad5340 bnx0                 24                   22                  
                                      (00,01,02,03,04,05,  (01,02,03,04,05,06,
                                       06,07,08,09,0a,0b,   07,08,09,0a,0b,0c,
                                       0c,0d,0e,0f,10,11,   0d,0e,0f,10,11,12,
                                       12,13,14,15,16,17)    13,14,15,16)

(sorry, different state than the earlier mpstat showed; here the first fanout CPU for all interfaces is "01".)

The problem was the hashing of packets to decide which CPUs (and which mac_soft_ring_worker) should handle it. This was traced to the mac_rx_srs_long_fanout() function, which was being called for the ixgbe IPv4 packets:

/*
 * mac_rx_srs_long_fanout
 *
 * The fanout routine for IPv6
 */
static int
mac_rx_srs_long_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *mp,
    uint32_t sap, size_t hdrsize, enum pkt_type *type, uint_t *indx)
{
[...]
                switch (nexthdr) {
                case IPPROTO_TCP:
                        hash = HASH_ADDR(V4_PART_OF_V6(ip6h->ip6_src),
                            *(uint32_t *)whereptr);
                        *indx = COMPUTE_INDEX(hash,
                            mac_srs->srs_tcp_ring_count);

Which sets the indx integer to identify which index from the previous CPU_LIST to use for fanout.

However, this is an IPv6 function, which isn't supposed to handle IPv4:

mac_rx_srs_long_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *mp,
    uint32_t sap, size_t hdrsize, enum pkt_type *type, uint_t *indx)
{
[...]
        if (sap == ETHERTYPE_IPV6) {
[...]
        } else {
                *indx = 0;
                *type = OTH;
        }
        return (0);

DTrace showed that were were taking that "else" codepath, since this is IPv4. That codepath sets indx to 0, which picks the first CPU from the CPU_LIST. Always.

I don't think mac_rx_srs_long_fanout() should really be called, as it only deals with IPv6 properly. As an experiment, IPv6 was configured on the network, and fanout was found to then work properly. Apart from balancing in mpstat and mac_rx_deliver() firing on multiple CPUs, the following one-liner was also used to check:

dtrace -n 'mac_rx_srs_long_fanout:entry { self->i = args[5]; }
    mac_rx_srs_long_fanout:return { @[*self->i] = count(); self->i = 0; }'

Which showed that for IPv6, the index was being set to different numbers. That one-liner on IPv4 always returned 0 (... but the fact that this function is being called in the first place is enough to know that we are in the wrong code path).

The parent function here is mac_rx_srs_fanout(), with important parts:

static void
mac_rx_srs_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *head)
{
[...]
        /*
         * Special clients (eg. VLAN, non ether, etc) need DLS
         * processing in the Rx path. SRST_DLS_BYPASS will be clear for
         * such SRSs. Another way of disabling bypass is to set the
         * MCIS_RX_BYPASS_DISABLE flag.
         */
        dls_bypass = ((mac_srs->srs_type & SRST_DLS_BYPASS) != 0) &&
            ((mcip->mci_state_flags & MCIS_RX_BYPASS_DISABLE) == 0);
[...]
                if (!dls_bypass) {
                        if (mac_rx_srs_long_fanout(mac_srs, mp, sap,
                            hdrsize, &type, &indx) == -1) {
[...]

Later on is the "normal" hashing of TCP ports, but we never make it that far, entering mac_rx_srs_long_fanout() since dls_bypass was not set. I guess the idea is the VLANs need special processing (and not dls_bypass), but that that feeds into the IPv6 function for fanout, which doesn't handle IPV4 properly, is a problem.

The easiest fix may be to add hashing code to mac_rx_srs_long_fanout() for the IPv4 case. The code can be copied from later in mac_rx_srs_fanout() - just the TCP case will be enough - after testing that the extra bytes are available and don't need a pullup(). See the PORTS_SIZE tests as examples. (Refactoring the code to avoid redundant TCP hashing calls is possible, but may be a more complex change for now.)

Jerry tracked down and added the following note:

Found the following comment from the Crossbow code review:

Known Design Issues

DLS bypass for IPv4 only and asymmetric IPv4 vs IPv6 datapath (RFE 6719435 Finish the DLS bypass work). The RFE entails bringing a full SAP classifer in MAC, deal with mac_header_info for v6 while preserving packet chains and making squeue polling symmetric for both v4 and v6. Since there is sunstantial of overlap with IP datapath refactoring project, this RFE will be done in datapath project gate after it merges with Crossbow putback 1.

#2

Updated by Dan McDonald over 6 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

commit 3cc3202e63472d673bcac160bfb86d1865dd9fe7
Author: Dan McDonald <>
Date: Wed Apr 3 14:02:24 2013 -0400

918 Need better IP fanout (esp. with VLANs present)
Reviewed by: Hans Rosenfeld &lt;&gt;
Reviewed by: Sebastien Roy &lt;&gt;
Reviewed by: Garrett D'Amore &lt;&gt;
Approved by: Gordon Ross &lt;&gt;

Also available in: Atom PDF