Bug #12754
closedpacket flow over a defaulted LACP port
100%
Description
This is an upstream of SmartOS OS-6805. Found below is my original synopsis.
The problem is hit only under a confluence of conditions.
1. The aggr must have more then one client. Otherwise mac_tx_send
will avoid calling mac_bcast_send()
and instead head directly to mac_provider_tx()
, passing the ring supplied as argument which was properly selected via aggr's mac_tx_aggr_mode()
.
2. One of the aggr ports must be down and it must be the first one added to the aggr, causing mi_default_tx_ring()
to be set to the first pseudo ring of the downed port (this initiation is part of i_mac_group_add_ring()
).
3. LACP active mode will recognize that the other ports are up and call aggr_set_coll_dist()
with B_TRUE
in order to call aggr_send_port_enable()
to enable the port. This will place the port on lg_tx_ports
which will allow the port's pseudo rings to be selected as part of aggr_find_tx_ring()
(part of step 1). It will also make a call to aggr_grp_update_default()
to make sure the mi_defaut_tx_ring
is updated to the first ring or the first port in lg_tx_ports
. However, this will only happen if the pseudo ring is marked MR_INUSE
, otherwise the default ring will remain unchanged, leaving us at the mercy of the luck of the draw from step 2. And sure enough, when this LACP event fires the ring is not marked as in-use because it has no clients at the moment.
Here's an example of mac_hwring_set_default()
being called while the ring state is not MR_INUSE
(that's the 0x0
value at the end of the mac_hwring_set_default()
line).
13619438082931 aggr_send_port_enable 0xfffffe5967ffc9e8 aggr`aggr_set_coll_dist+0x7c aggr`lacp_mux_sm+0x1e0 aggr`lacp_receive_sm+0x309 aggr`aggr_lacp_rx+0xc1 aggr`aggr_lacp_rx_thread+0xd2 unix`thread_start+0xb 13619438093784 aggr_grp_update_default 0xfffffe595dd158f8 aggr`aggr_send_port_enable+0x86 aggr`aggr_set_coll_dist+0x7c aggr`lacp_mux_sm+0x1e0 aggr`lacp_receive_sm+0x309 aggr`aggr_lacp_rx+0xc1 aggr`aggr_lacp_rx_thread+0xd2 unix`thread_start+0xb 13619438099257 mac_hwring_set_default 0xfffffe59513b4008 0xfffffe5c7a74eca8 0x0
The end result of these three things is that the aggr's mac will be put in a state where its default Tx ring is pointing to a downed port instance and it has no way to notice this fact and recover. You might think that we need a call to aggr_send_port_disable()
to rectify the situation but the downed port is never enabled in the first place, so it's never part of the lg_tx_ports
list. This is why, if you only have one aggr client (i.e., you place IP directly on top of the primary client), the broadcast traffic will work just fine, because it will select a ring from the working lg_tx_ports
, but if you have two or more clients it will fallback to mac_bcast_send()
and try to use the bad mi_default_tx_ring
.
I think the real issue here is that mac_hwring_set_default()
shouldn't concern itself with ring state. It's sole purpose is to make sure that the mi_default_tx_ring()
is the first ring of the first port in lg_tx_ports
.
Updated by Ryan Zezeski almost 2 years ago
Here is the test I ran to verify the bug exists in gate.
Create two aggrs: aggr0={ixgbe2,ixgbe3}, aggr1={ixgbe5,ixgbe6}. Notice the first port in each aggr is down, this is a prerequisite for the bug to appear.
rpz@thunderhead:~$ dladm show-phys LINK MEDIA STATE SPEED DUPLEX DEVICE igb0 Ethernet unknown 0 half igb0 igb1 Ethernet up 1000 full igb1 ixgbe2 Ethernet down 0 unknown ixgbe2 ixgbe3 Ethernet up 10000 full ixgbe3 ixgbe4 Ethernet up 1000 full ixgbe4 ixgbe5 Ethernet down 0 unknown ixgbe5 ixgbe6 Ethernet up 10000 full ixgbe6 rpz@thunderhead:~$ pfexec dladm create-aggr -t -P L4 -L active -l ixgbe2 -l ixgbe3 aggr0 rpz@thunderhead:~$ pfexec dladm create-aggr -t -P L4 -L active -l ixgbe5 -l ixgbe6 aggr1 rpz@thunderhead:~$ dladm show-aggr -L LINK PORT AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED aggr0 ixgbe2 yes no no no yes no -- ixgbe3 yes yes yes yes no no aggr1 ixgbe5 yes no no no yes no -- ixgbe6 yes yes yes yes no no rpz@thunderhead:~$ dladm show-aggr -x LINK PORT SPEED DUPLEX STATE ADDRESS PORTSTATE aggr0 -- 10000Mb full up a0:36:9f:c2:3b:9c -- ixgbe2 0Mb unknown down a0:36:9f:c2:3b:9c standby ixgbe3 10000Mb full up a0:36:9f:c2:3b:9e attached aggr1 -- 10000Mb full up b4:96:91:5:48:38 -- ixgbe5 0Mb unknown down b4:96:91:5:48:38 standby ixgbe6 10000Mb full up b4:96:91:5:48:3a attached
Create an IP addr on top of aggr0 to send some broadcast traffic on.
rpz@thunderhead:~$ pfexec ipadm create-addr -t -T static -a 192.168.66.1/24 aggr0/v4
Then create a VNIC on aggr0 (the one which will be used to send broadcast traffic), as multiple clients is also a requirement for the bug to surface. We don't need an IP on this, we just need another mac client in order to influence the Tx code path.
rpz@thunderhead:~$ pfexec dladm create-vnic -t -l aggr0 unused0
Run this dtrace script to catch the bad send.
rpz@thunderhead:~$ pfexec dtrace -qn 'mac_bcast_send:entry /arg1 != NULL/ { this->grp = (mac_bcast_grp_t *)arg0; this->smcip = (mac_client_impl_t *)arg1; printf("%s 0x%p %s 0x%p 0x%p %s\n", probefunc, this->smcip, stringof(this->smcip->mci_name), this->grp, this->grp->mbg_mac_impl, stringof(this->grp->mbg_mac_impl->mi_name)); } ixgbe_ring_tx:entry { this->ring = (ixgbe_tx_ring_t *)arg0; this->ixgbe = this->ring->ixgbe; if ((this->ixgbe->link_state & 0x1) == 0) { printf("sending on down interface: ixgbe%d\n", this->ring->ixgbe->instance); stack(); } }'
Use ping to produce ARP packets outgoing on aggr0.
rpz@thunderhead:~$ ping -i 192.168.66.1 192.168.66.99
Notice dtrace picking up sends on the downed port.
rpz@thunderhead:~$ pfexec dtrace -qn 'mac_bcast_send:entry /arg1 != NULL/ { this->grp = (mac_bcast_grp_t *)arg0; this->smcip = (mac_client_impl_t *)arg1; printf("%s 0x%p %s 0x%p 0x%p %s\n", probefunc, this->smcip, stringof(this->smcip->mci_name), this->grp, this->grp->mbg_mac_impl, stringof(this->grp->mbg_mac_impl->mi_name)); } ixgbe_ring_tx:entry { this->ring = (ixgbe_tx_ring_t *)arg0; this->ixgbe = this->ring->ixgbe; if ((this->ixgbe->link_state & 0x1) == 0) { printf("sending on down interface: ixgbe%d\n", this->ring->ixgbe->instance); stack(); } }' mac_bcast_send 0xffffd01988fbc1d0 aggr0 0xffffd0195e418a08 0xffffd01989773260 aggr1017 sending on down interface: ixgbe2 mac`mac_hwring_tx+0x1c mac`mac_hwring_send_priv+0x90 aggr`aggr_ring_tx+0x1a mac`mac_hwring_tx+0x1c mac`mac_bcast_send+0x311 mac`mac_tx_send+0x1c4 mac`mac_tx_soft_ring_process+0x89 mac`mac_tx_aggr_mode+0x8c mac`mac_tx+0xdb dld`proto_unitdata_req+0x2d1 dld`dld_wput+0xa9 unix`putnext+0x233 ip`arp_output+0x1fd ip`arp_request+0x123 ip`ip_ndp_resolve+0x84 ip`ip_xmit+0xab5 ip`ire_send_wire_v4+0x345 ip`conn_ip_output+0x1d4 ip`icmp_output_newdst+0x644 ip`rawip_send+0x502 mac_bcast_send 0xffffd01988fbc1d0 aggr0 0xffffd0195e418a08 0xffffd01989773260 aggr1017 sending on down interface: ixgbe2 mac`mac_hwring_tx+0x1c mac`mac_hwring_send_priv+0x90 aggr`aggr_ring_tx+0x1a mac`mac_hwring_tx+0x1c mac`mac_bcast_send+0x311 mac`mac_tx_send+0x1c4 mac`mac_tx_soft_ring_process+0x89 mac`mac_tx_aggr_mode+0x8c mac`mac_tx+0xdb dld`proto_unitdata_req+0x2d1 dld`dld_wput+0xa9 unix`putnext+0x233 ip`arp_output+0x1fd ip`arp_request+0x123 ip`ip_ndp_resolve+0x84 ip`nce_timer+0x2fd genunix`callout_list_expire+0x8f genunix`callout_expire+0x33 genunix`callout_execute+0x1e genunix`taskq_thread+0x2cd
Use mdb to verify that aggr0's mi_default_tx_ring
is point to a ring on the downed port. First, get the pointer to the aggr driver.
rpz@thunderhead:~$ pfexec mdb -k Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs sata ip hook neti sockfs arp usba fctl stmf stmf_sbd mm sd lofs random idm mr_sas crypto ptm cpc fcip fcp ufs logindmux nsmb smbsrv nfs ] > ::walk mac_client_impl_cache | ::printf "%s %s 0x%p\n" mac_client_impl_t mci_name mci_mip->mi_name mci_mip->mi_driver ! grep "aggr0 " aggr0 aggr1017 0xffffd0199a09e128
Next, verify that the downed port (ixgbe2) is the first port in the aggr.
> 0xffffd0199a09e128::print aggr_grp_t lg_ports[0].lp_mh | ::print mac_impl_t mi_name mi_name = [ "ixgbe2" ]
Next, verify the Tx port list is correct.
> 0xffffd0199a09e128::print aggr_grp_t lg_tx_ports[0]->lp_mh | ::print mac_impl_t mi_name mi_name = [ "ixgbe3" ]
Finally, verify that mi_default_tx_ring
is pointing to the downed ixgbe2 port. This is our actual problem.
> 0xffffd0199a09e128::print aggr_grp_t lg_mh | ::print mac_impl_t mi_default_tx_ring mi_default_tx_ring = 0xffffd0195cf227a0 > 0xffffd0195cf227a0::print mac_ring_t mr_info.mri_driver | ::print aggr_pseudo_tx_ring_t atr_hw_rh | ::print mac_ring_t mr_info.mri_driver | ::print ixgbe_tx_ring_t ixgbe | ::print ixgbe_t instance instance = 0x2
Updated by Ryan Zezeski almost 2 years ago
Test With Patch¶
Create two aggrs: aggr0={ixgbe2,ixgbe3}, aggr1={ixgbe5,ixgbe6}
. Notice the first port in each aggr is down, this is a prerequisite for the bug to appear.
rpz@thunderhead:~$ dladm show-phys LINK MEDIA STATE SPEED DUPLEX DEVICE igb0 Ethernet unknown 0 half igb0 igb1 Ethernet up 1000 full igb1 ixgbe2 Ethernet down 0 unknown ixgbe2 ixgbe3 Ethernet unknown 0 unknown ixgbe3 ixgbe4 Ethernet up 1000 full ixgbe4 ixgbe5 Ethernet down 0 unknown ixgbe5 ixgbe6 Ethernet unknown 0 unknown ixgbe6 rpz@thunderhead:~$ pfexec dladm create-aggr -t -P L4 -L active -l ixgbe2 -l ixgbe3 aggr0 rpz@thunderhead:~$ pfexec dladm create-aggr -t -P L4 -L active -l ixgbe5 -l ixgbe6 aggr1 rpz@thunderhead:~$ dladm show-aggr -L LINK PORT AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED aggr0 ixgbe2 yes no no no yes no -- ixgbe3 yes yes yes yes no no aggr1 ixgbe5 yes no no no yes no -- ixgbe6 yes yes yes yes no no rpz@thunderhead:~$ dladm show-aggr -x LINK PORT SPEED DUPLEX STATE ADDRESS PORTSTATE aggr0 -- 10000Mb full up a0:36:9f:c2:3b:9c -- ixgbe2 0Mb unknown down a0:36:9f:c2:3b:9c standby ixgbe3 10000Mb full up a0:36:9f:c2:3b:9e attached aggr1 -- 10000Mb full up b4:96:91:5:48:38 -- ixgbe5 0Mb unknown down b4:96:91:5:48:38 standby ixgbe6 10000Mb full up b4:96:91:5:48:3a attached
Create an IP addr on top of aggr0 to send some broadcast traffic on.
rpz@thunderhead:~$ pfexec ipadm create-addr -t -T static -a 192.168.66.1/24 aggr0/v4
Then create a VNIC on aggr0 (the one which will be used to send broadcast traffic), as multiple clients is also a requirement for the bug to surface. We don't need an IP on this, we just need another mac client in order to influence the Tx code path.
rpz@thunderhead:~$ pfexec dladm create-vnic -t -l aggr0 unused0
Run this dtrace script to catch the bad send.
rpz@thunderhead:~$ pfexec dtrace -qn 'mac_bcast_send:entry /arg1 != NULL/ { this->grp = (mac_bcast_grp_t *)arg0; this->smcip = (mac_client_impl_t *)arg1; printf("%s 0x%p %s 0x%p 0x%p %s\n", probefunc, this->smcip, stringof(this->smcip->mci_name), this->grp, this->grp->mbg_mac_impl, stringof(this->grp->mbg_mac_impl->mi_name)); } ixgbe_ring_tx:entry { this->ring = (ixgbe_tx_ring_t *)arg0; this->ixgbe = this->ring->ixgbe; if ((this->ixgbe->link_state & 0x1) == 0) { printf("sending on down interface: ixgbe%d\n", this->ring->ixgbe->instance); stack(); } }'
Use ping to produce ARP packets outgoing on aggr0.
rpz@thunderhead:~$ ping -i 192.168.66.1 192.168.66.99
Notice dtrace picks up no sends on the downed port.
rpz@thunderhead:~$ pfexec dtrace -qn 'mac_bcast_send:entry /arg1 != NULL/ { this->grp = (mac_bcast_grp_t *)arg0; this->smcip = (mac_client_impl_t *)arg1; printf("%s 0x%p %s 0x%p 0x%p %s\n", probefunc, this->smcip, stringof(this->smcip->mci_name), this->grp, this->grp->mbg_mac_impl, stringof(this->grp->mbg_mac_impl->mi_name)); } ixgbe_ring_tx:entry { this->ring = (ixgbe_tx_ring_t *)arg0; this->ixgbe = this->ring->ixgbe; if ((this->ixgbe->link_state & 0x1) == 0) { printf("sending on down interface: ixgbe%d\n", this->ring->ixgbe->instance); stack(); } }' mac_bcast_send 0xfffffe5a97b8d390 aggr0 0xfffffe59eb504828 0xfffffe5a97b859c0 aggr1013 mac_bcast_send 0xfffffe5a97b8d390 aggr0 0xfffffe59eb504828 0xfffffe5a97b859c0 aggr1013 mac_bcast_send 0xfffffe5a97b8d390 aggr0 0xfffffe59eb504828 0xfffffe5a97b859c0 aggr1013 mac_bcast_send 0xfffffe5a97b8d390 aggr0 0xfffffe59eb504828 0xfffffe5a97b859c0 aggr1013 mac_bcast_send 0xfffffe5a97b8d390 aggr0 0xfffffe59eb504828 0xfffffe5a97b859c0 aggr1013 mac_bcast_send 0xfffffe5a97b8d390 aggr0 0xfffffe59eb504828 0xfffffe5a97b859c0 aggr1013
Use mdb to verify that aggr0's mi_default_tx_ring
is point to a ring on the downed port. First, get the pointer to the aggr driver.
rpz@thunderhead:~$ pfexec mdb -k Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs sata ip hook neti sockfs arp usba fctl stmf stmf_sbd mm sd lofs idm mr_sas ptm cpc fcip ufs logindmux nfs ] > ::walk mac_client_impl_cache | ::printf "%s %s 0x%p\n" mac_client_impl_t mci_name mci_mip->mi_name mci_mip->mi_driver ! grep "aggr0 " aggr0 aggr1013 0xfffffe5a833717d0
Next, verify that the downed port (ixgbe2) is the first port in the aggr. Note that this just verifies that ixgbe2 was the first port added -- it's the lg_tx_ports
and mi_default_tx_ring
that determine which port is used for Tx.
> 0xfffffe5a833717d0::print aggr_grp_t lg_ports[0].lp_mh | ::print mac_impl_t mi_name mi_name = [ "ixgbe2" ]
Next, verify the Tx port list is correct.
> 0xfffffe5a833717d0::print aggr_grp_t lg_tx_ports[0]->lp_mh | ::print mac_impl_t mi_name mi_name = [ "ixgbe3" ]
Finally, verify the fix: that mi_default_tx_ring
is pointing to a port that is up, in this case it should be ixgbe3.
> 0xfffffe5a833717d0::print aggr_grp_t lg_mh | ::print mac_impl_t mi_default_tx_ring mi_default_tx_ring = 0xfffffe5a6b8f8970 > 0xfffffe5a6b8f8970::print mac_ring_t mr_info.mri_driver | ::print aggr_pseudo_tx_ring_t atr_hw_rh | ::print mac_ring_t mr_info.mri_driver | ::print ixgbe_tx_ring_t ixgbe | ::print ixgbe_t instance instance = 0x3
Updated by Electric Monk almost 2 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 99ad48a445346f36969661ed214f75b99e19a9a7
commit 99ad48a445346f36969661ed214f75b99e19a9a7 Author: Ryan Zezeski <rpz@joyent.com> Date: 2020-06-16T21:21:45.000Z 12754 packet flow over a defaulted LACP port Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Robert Mustacchi <rm@fingolfin.org>