Bug #918
closedNeed better IP fanout (esp. with VLANs present)
100%
Description
Joyent found this as "OS-332", and it was fixed in the Joyent Illumos batch.
The proposed fix adds IPv4 checks to what is normally an IPv6-only codepath. While the proposed fix may address the problem, I believe further analysis may yield a cleaner fix.
Updated by Robert Mustacchi almost 12 years ago
- Difficulty set to Medium
- Tags set to needs-triage
Original analysis from Brendan Gregg on this issue follows.
Summary: For high network load (especially 10 GbE on ixgbe), if VLANs are used then the kernel does not fanout across multiple CPUs properly, instead picking a single CPU. This CPU becomes the bottleneck. The issue is how VLANs are handled in mac_rx_srs_fanout().
mpstat during load:
# mpstat 1 [...] CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 15 0 19 535 203 3121 18 95 195 0 3278 2 4 0 94 1 62 0 233 375 3 2870 25 92 165 0 4086 2 5 0 93 2 21 0 0 19598 19514 2862 12 61 5806 0 3438 2 19 0 79 3 64 0 715 19807 19517 2797 11 75 5767 0 3239 2 24 0 74 4 62 0 46 52 3 2460 5 36 176 0 2974 2 7 0 91 5 53 0 381 319 3 2660 5 62 187 0 3726 3 4 0 93 6 8 0 0 110 6 8241 54 58 1255 0 2743 2 3 0 95 7 176 0 385 361 4 2786 23 89 134 0 4041 4 5 0 91 8 19 0 0 17 1 1459 1 18 121 0 1091 1 1 0 98 9 30 0 242 342 3 2692 11 69 179 0 3465 2 4 0 94 10 0 0 0 10 1 229 0 14 17 0 289 1 1 0 98 11 55 0 294 354 1 2882 40 81 200 0 3308 3 6 0 91 12 99 0 0 49 3 1776 2 18 139 0 2018 2 3 0 95 13 88 0 403 343 1 2624 18 70 181 0 3710 3 5 0 92 14 15 0 0 34 3 1680 3 24 113 0 1272 3 1 0 96 15 397 0 165 322 4 2390 1 33 104 0 2813 5 4 0 91 16 12 0 0 42 3 2181 9 39 145 0 2393 2 2 0 96 17 27 0 186 364 5 2701 13 83 163 0 2526 2 4 0 94 18 8 0 0 30 2 1414 2 16 90 0 854 1 1 0 98 19 239 0 96 358 2 2776 10 69 129 0 2270 2 4 0 94 20 16 0 0 88 2 2770 12 37 173 0 4124 2 3 0 95 21 25 0 67 379 1 2558 16 57 149 0 3898 1 4 0 95 22 0 0 0 253 23 1868 34 2 692 0 0 0 100 0 0 23 26 0 156 369 2 2372 16 69 164 0 2497 1 3 0 96
Note how CPU 22 in this case was at 100% system. Sampling that stack track to see what it is doing:
# dtrace -n 'profile-1001 /cpu == 22/ { @[stack()] = count(); }' [...] genunix`allocb+0x74 ip`tcp_send+0x80e ip`tcp_wput_data+0x56e ip`tcp_input_data+0x3cfc ip`squeue_enter+0x440 ip`ip_fanout_v4+0x48d ip`ire_recv_local_v4+0x366 ip`ill_input_short_v4+0x69e ip`ip_input+0x23b dls`i_dls_link_rx+0x2e7 mac`mac_rx_deliver+0x5d mac`mac_rx_soft_ring_drain+0xdf mac`mac_soft_ring_worker+0x111 unix`thread_start+0x8 311 genunix`ddi_dma_addr_bind_handle+0x68 ixgbe`ixgbe_tx_bind+0x51 ixgbe`ixgbe_ring_tx+0x7a8 mac`mac_hwring_tx+0x20 mac`mac_tx_send+0x4bc mac`mac_tx_single_ring_mode+0x91 mac`mac_tx+0x35d dld`str_mdata_fastpath_put+0xa4 ip`ip_xmit+0x7eb ip`ire_send_wire_v4+0x345 ip`conn_ip_output+0x250 ip`tcp_send+0xb9d ip`tcp_wput_data+0x56e ip`tcp_input_data+0x3cfc ip`squeue_enter+0x440 ip`ip_fanout_v4+0x48d ip`ire_recv_local_v4+0x366 ip`ill_input_short_v4+0x69e ip`ip_input+0x23b dls`i_dls_link_rx+0x2e7 mac`mac_rx_deliver+0x5d mac`mac_rx_soft_ring_drain+0xdf mac`mac_soft_ring_worker+0x111 unix`thread_start+0x8 355 unix`mutex_enter+0x10 dld`str_mdata_fastpath_put+0xa4 ip`ip_xmit+0x7eb ip`ire_send_wire_v4+0x345 ip`conn_ip_output+0x250 ip`tcp_send+0xb9d ip`tcp_wput_data+0x56e ip`tcp_input_data+0x3cfc ip`squeue_enter+0x440 ip`ip_fanout_v4+0x48d ip`ire_recv_local_v4+0x366 ip`ill_input_short_v4+0x69e ip`ip_input+0x23b dls`i_dls_link_rx+0x2e7 mac`mac_rx_deliver+0x5d mac`mac_rx_soft_ring_drain+0xdf mac`mac_soft_ring_worker+0x111 unix`thread_start+0x8 385
It isn't obvious from this what the issue is. It's a mac_soft_ring_worker thread that is doing work. There should be more than one of these threads on other CPUs (and there are, as shown by mdb -k), but it's only the one on CPU 22 that is doing work, as confirmed by:
# dtrace -n 'fbt::mac_rx_deliver:entry { @[cpu] = count(); }'
The interfaces should be configured to fanout using multiple CPUs by default. This can be forced (and increased) using dladm to set the "cpus" property, which I did for all interfaces:
[root@d4-85-64-7b-9a-4a /var/tmp/brendan]# dladm show-linkprop -p cpus LINK PROPERTY PERM VALUE DEFAULT POSSIBLE ixgbe0 cpus rw 0-23 -- -- ixgbe1 cpus rw 0-23 -- -- bnx0 cpus rw 0-23 -- -- bnx1 cpus rw 0-23 -- -- bnx2 cpus rw 0-23 -- -- bnx3 cpus rw 0-23 -- -- vc0_ff_ee_e7_5e_bd_0 cpus rw 0-23 -- -- vc0_ff_ee_c0_fa_23_0 cpus rw 0-23 -- -- vc0_ff_ee_7c_7d_f0_0 cpus rw 0-23 -- -- vc0_ff_ee_dd_4c_3a_0 cpus rw 0-23 -- -- vc0_ff_ee_22_c0_cc_0 cpus rw 0-23 -- --
And confirmed in mdb that this had taken effect:
> ::mac_srs -rcv CPU_COUNT FANOUT_CPU_COUNT ADDR LINK_NAME (CPU_LIST) (CPU_LIST) ffffff2639cc9000 vc0_ff_ee_22_c0_cc_0 24 22 (00,01,02,03,04,05, (01,02,03,04,05,06, 06,07,08,09,0a,0b, 07,08,09,0a,0b,0c, 0c,0d,0e,0f,10,11, 0d,0e,0f,10,11,12, 12,13,14,15,16,17) 13,14,15,16) ffffff2639cc9cc0 vc0_ff_ee_e7_5e_bd_0 24 17 (00,01,02,03,04,05, (01,02,03,04,05,06, 06,07,08,09,0a,0b, 07,08,09,0a,0b,0c, 0c,0d,0e,0f,10,11, 0d,0e,0f,10,11) 12,13,14,15,16,17) ffffff26368ab040 vc0_ff_ee_dd_4c_3a_0 24 22 (00,01,02,03,04,05, (01,02,03,04,05,06, 06,07,08,09,0a,0b, 07,08,09,0a,0b,0c, 0c,0d,0e,0f,10,11, 0d,0e,0f,10,11,12, 12,13,14,15,16,17) 13,14,15,16) ffffff26368ac9c0 vc0_ff_ee_7c_7d_f0_0 24 22 (00,01,02,03,04,05, (01,02,03,04,05,06, 06,07,08,09,0a,0b, 07,08,09,0a,0b,0c, 0c,0d,0e,0f,10,11, 0d,0e,0f,10,11,12, 12,13,14,15,16,17) 13,14,15,16) ffffff2600ad2040 vc0_ff_ee_c0_fa_23_0 24 22 (00,01,02,03,04,05, (01,02,03,04,05,06, 06,07,08,09,0a,0b, 07,08,09,0a,0b,0c, 0c,0d,0e,0f,10,11, 0d,0e,0f,10,11,12, 12,13,14,15,16,17) 13,14,15,16) ffffff2600ad5340 bnx0 24 22 (00,01,02,03,04,05, (01,02,03,04,05,06, 06,07,08,09,0a,0b, 07,08,09,0a,0b,0c, 0c,0d,0e,0f,10,11, 0d,0e,0f,10,11,12, 12,13,14,15,16,17) 13,14,15,16)
(sorry, different state than the earlier mpstat showed; here the first fanout CPU for all interfaces is "01".)
The problem was the hashing of packets to decide which CPUs (and which mac_soft_ring_worker) should handle it. This was traced to the mac_rx_srs_long_fanout() function, which was being called for the ixgbe IPv4 packets:
/* * mac_rx_srs_long_fanout * * The fanout routine for IPv6 */ static int mac_rx_srs_long_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *mp, uint32_t sap, size_t hdrsize, enum pkt_type *type, uint_t *indx) { [...] switch (nexthdr) { case IPPROTO_TCP: hash = HASH_ADDR(V4_PART_OF_V6(ip6h->ip6_src), *(uint32_t *)whereptr); *indx = COMPUTE_INDEX(hash, mac_srs->srs_tcp_ring_count);
Which sets the indx integer to identify which index from the previous CPU_LIST to use for fanout.
However, this is an IPv6 function, which isn't supposed to handle IPv4:
mac_rx_srs_long_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *mp, uint32_t sap, size_t hdrsize, enum pkt_type *type, uint_t *indx) { [...] if (sap == ETHERTYPE_IPV6) { [...] } else { *indx = 0; *type = OTH; } return (0);
DTrace showed that were were taking that "else" codepath, since this is IPv4. That codepath sets indx to 0, which picks the first CPU from the CPU_LIST. Always.
I don't think mac_rx_srs_long_fanout() should really be called, as it only deals with IPv6 properly. As an experiment, IPv6 was configured on the network, and fanout was found to then work properly. Apart from balancing in mpstat and mac_rx_deliver() firing on multiple CPUs, the following one-liner was also used to check:
dtrace -n 'mac_rx_srs_long_fanout:entry { self->i = args[5]; } mac_rx_srs_long_fanout:return { @[*self->i] = count(); self->i = 0; }'
Which showed that for IPv6, the index was being set to different numbers. That one-liner on IPv4 always returned 0 (... but the fact that this function is being called in the first place is enough to know that we are in the wrong code path).
The parent function here is mac_rx_srs_fanout(), with important parts:
static void mac_rx_srs_fanout(mac_soft_ring_set_t *mac_srs, mblk_t *head) { [...] /* * Special clients (eg. VLAN, non ether, etc) need DLS * processing in the Rx path. SRST_DLS_BYPASS will be clear for * such SRSs. Another way of disabling bypass is to set the * MCIS_RX_BYPASS_DISABLE flag. */ dls_bypass = ((mac_srs->srs_type & SRST_DLS_BYPASS) != 0) && ((mcip->mci_state_flags & MCIS_RX_BYPASS_DISABLE) == 0); [...] if (!dls_bypass) { if (mac_rx_srs_long_fanout(mac_srs, mp, sap, hdrsize, &type, &indx) == -1) { [...]
Later on is the "normal" hashing of TCP ports, but we never make it that far, entering mac_rx_srs_long_fanout() since dls_bypass was not set. I guess the idea is the VLANs need special processing (and not dls_bypass), but that that feeds into the IPv6 function for fanout, which doesn't handle IPV4 properly, is a problem.
The easiest fix may be to add hashing code to mac_rx_srs_long_fanout() for the IPv4 case. The code can be copied from later in mac_rx_srs_fanout() - just the TCP case will be enough - after testing that the extra bytes are available and don't need a pullup(). See the PORTS_SIZE tests as examples. (Refactoring the code to avoid redundant TCP hashing calls is possible, but may be a more complex change for now.)
Jerry tracked down and added the following note:
Found the following comment from the Crossbow code review:
Known Design Issues
DLS bypass for IPv4 only and asymmetric IPv4 vs IPv6 datapath (RFE 6719435 Finish the DLS bypass work). The RFE entails bringing a full SAP classifer in MAC, deal with mac_header_info for v6 while preserving packet chains and making squeue polling symmetric for both v4 and v6. Since there is sunstantial of overlap with IP datapath refactoring project, this RFE will be done in datapath project gate after it merges with Crossbow putback 1.
Updated by Dan McDonald over 10 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
commit 3cc3202e63472d673bcac160bfb86d1865dd9fe7
Author: Dan McDonald <danmcd@nexenta.com>
Date: Wed Apr 3 14:02:24 2013 -0400
918 Need better IP fanout (esp. with VLANs present)
Reviewed by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Gordon Ross <gwr@nexenta.com>