Project

General

Profile

Bug #12754

packet flow over a defaulted LACP port

Added by Ryan Zezeski 6 months ago. Updated 5 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
networking
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

This is an upstream of SmartOS OS-6805. Found below is my original synopsis.

The problem is hit only under a confluence of conditions.

1. The aggr must have more then one client. Otherwise mac_tx_send will avoid calling mac_bcast_send() and instead head directly to mac_provider_tx(), passing the ring supplied as argument which was properly selected via aggr's mac_tx_aggr_mode().

2. One of the aggr ports must be down and it must be the first one added to the aggr, causing mi_default_tx_ring() to be set to the first pseudo ring of the downed port (this initiation is part of i_mac_group_add_ring()).

3. LACP active mode will recognize that the other ports are up and call aggr_set_coll_dist() with B_TRUE in order to call aggr_send_port_enable() to enable the port. This will place the port on lg_tx_ports which will allow the port's pseudo rings to be selected as part of aggr_find_tx_ring() (part of step 1). It will also make a call to aggr_grp_update_default() to make sure the mi_defaut_tx_ring is updated to the first ring or the first port in lg_tx_ports. However, this will only happen if the pseudo ring is marked MR_INUSE, otherwise the default ring will remain unchanged, leaving us at the mercy of the luck of the draw from step 2. And sure enough, when this LACP event fires the ring is not marked as in-use because it has no clients at the moment.

Here's an example of mac_hwring_set_default() being called while the ring state is not MR_INUSE (that's the 0x0 value at the end of the mac_hwring_set_default() line).

13619438082931 aggr_send_port_enable 0xfffffe5967ffc9e8

              aggr`aggr_set_coll_dist+0x7c
              aggr`lacp_mux_sm+0x1e0
              aggr`lacp_receive_sm+0x309
              aggr`aggr_lacp_rx+0xc1
              aggr`aggr_lacp_rx_thread+0xd2
              unix`thread_start+0xb
13619438093784 aggr_grp_update_default 0xfffffe595dd158f8

              aggr`aggr_send_port_enable+0x86
              aggr`aggr_set_coll_dist+0x7c
              aggr`lacp_mux_sm+0x1e0
              aggr`lacp_receive_sm+0x309
              aggr`aggr_lacp_rx+0xc1
              aggr`aggr_lacp_rx_thread+0xd2
              unix`thread_start+0xb
13619438099257 mac_hwring_set_default 0xfffffe59513b4008 0xfffffe5c7a74eca8 0x0

The end result of these three things is that the aggr's mac will be put in a state where its default Tx ring is pointing to a downed port instance and it has no way to notice this fact and recover. You might think that we need a call to aggr_send_port_disable() to rectify the situation but the downed port is never enabled in the first place, so it's never part of the lg_tx_ports list. This is why, if you only have one aggr client (i.e., you place IP directly on top of the primary client), the broadcast traffic will work just fine, because it will select a ring from the working lg_tx_ports, but if you have two or more clients it will fallback to mac_bcast_send() and try to use the bad mi_default_tx_ring.

I think the real issue here is that mac_hwring_set_default() shouldn't concern itself with ring state. It's sole purpose is to make sure that the mi_default_tx_ring() is the first ring of the first port in lg_tx_ports.

#1

Updated by Ryan Zezeski 6 months ago

Here is the test I ran to verify the bug exists in gate.

Create two aggrs: aggr0={ixgbe2,ixgbe3}, aggr1={ixgbe5,ixgbe6}. Notice the first port in each aggr is down, this is a prerequisite for the bug to appear.

rpz@thunderhead:~$ dladm show-phys
LINK         MEDIA                STATE      SPEED  DUPLEX    DEVICE
igb0         Ethernet             unknown    0      half      igb0
igb1         Ethernet             up         1000   full      igb1
ixgbe2       Ethernet             down       0      unknown   ixgbe2
ixgbe3       Ethernet             up         10000  full      ixgbe3
ixgbe4       Ethernet             up         1000   full      ixgbe4
ixgbe5       Ethernet             down       0      unknown   ixgbe5
ixgbe6       Ethernet             up         10000  full      ixgbe6

rpz@thunderhead:~$ pfexec dladm create-aggr -t -P L4 -L active -l ixgbe2 -l ixgbe3 aggr0

rpz@thunderhead:~$ pfexec dladm create-aggr -t -P L4 -L active -l ixgbe5 -l ixgbe6 aggr1

rpz@thunderhead:~$ dladm show-aggr -L
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr0       ixgbe2       yes          no   no   no   yes       no
--          ixgbe3       yes          yes  yes  yes  no        no
aggr1       ixgbe5       yes          no   no   no   yes       no
--          ixgbe6       yes          yes  yes  yes  no        no
rpz@thunderhead:~$ dladm show-aggr -x
LINK        PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr0       --             10000Mb full   up        a0:36:9f:c2:3b:9c  --
            ixgbe2         0Mb  unknown   down      a0:36:9f:c2:3b:9c  standby
            ixgbe3         10000Mb full   up        a0:36:9f:c2:3b:9e  attached
aggr1       --             10000Mb full   up        b4:96:91:5:48:38   --
            ixgbe5         0Mb  unknown   down      b4:96:91:5:48:38   standby
            ixgbe6         10000Mb full   up        b4:96:91:5:48:3a   attached

Create an IP addr on top of aggr0 to send some broadcast traffic on.

rpz@thunderhead:~$ pfexec ipadm create-addr -t -T static -a 192.168.66.1/24 aggr0/v4

Then create a VNIC on aggr0 (the one which will be used to send broadcast traffic), as multiple clients is also a requirement for the bug to surface. We don't need an IP on this, we just need another mac client in order to influence the Tx code path.

rpz@thunderhead:~$ pfexec dladm create-vnic -t -l aggr0 unused0

Run this dtrace script to catch the bad send.

rpz@thunderhead:~$ pfexec dtrace -qn 'mac_bcast_send:entry /arg1 != NULL/ { this->grp = (mac_bcast_grp_t *)arg0; this->smcip = (mac_client_impl_t *)arg1; printf("%s 0x%p %s 0x%p 0x%p %s\n", probefunc, this->smcip, stringof(this->smcip->mci_name), this->grp, this->grp->mbg_mac_impl, stringof(this->grp->mbg_mac_impl->mi_name)); } ixgbe_ring_tx:entry { this->ring = (ixgbe_tx_ring_t *)arg0; this->ixgbe = this->ring->ixgbe; if ((this->ixgbe->link_state & 0x1) == 0) { printf("sending on down interface: ixgbe%d\n", this->ring->ixgbe->instance); stack(); } }'

Use ping to produce ARP packets outgoing on aggr0.

rpz@thunderhead:~$ ping -i 192.168.66.1 192.168.66.99

Notice dtrace picking up sends on the downed port.

rpz@thunderhead:~$ pfexec dtrace -qn 'mac_bcast_send:entry /arg1 != NULL/ { this->grp = (mac_bcast_grp_t *)arg0; this->smcip = (mac_client_impl_t *)arg1; printf("%s 0x%p %s 0x%p 0x%p %s\n", probefunc, this->smcip, stringof(this->smcip->mci_name), this->grp, this->grp->mbg_mac_impl, stringof(this->grp->mbg_mac_impl->mi_name)); } ixgbe_ring_tx:entry { this->ring = (ixgbe_tx_ring_t *)arg0; this->ixgbe = this->ring->ixgbe; if ((this->ixgbe->link_state & 0x1) == 0) { printf("sending on down interface: ixgbe%d\n", this->ring->ixgbe->instance); stack(); } }'

mac_bcast_send 0xffffd01988fbc1d0 aggr0 0xffffd0195e418a08 0xffffd01989773260 aggr1017
sending on down interface: ixgbe2

              mac`mac_hwring_tx+0x1c
              mac`mac_hwring_send_priv+0x90
              aggr`aggr_ring_tx+0x1a
              mac`mac_hwring_tx+0x1c
              mac`mac_bcast_send+0x311
              mac`mac_tx_send+0x1c4
              mac`mac_tx_soft_ring_process+0x89
              mac`mac_tx_aggr_mode+0x8c
              mac`mac_tx+0xdb
              dld`proto_unitdata_req+0x2d1
              dld`dld_wput+0xa9
              unix`putnext+0x233
              ip`arp_output+0x1fd
              ip`arp_request+0x123
              ip`ip_ndp_resolve+0x84
              ip`ip_xmit+0xab5
              ip`ire_send_wire_v4+0x345
              ip`conn_ip_output+0x1d4
              ip`icmp_output_newdst+0x644
              ip`rawip_send+0x502
mac_bcast_send 0xffffd01988fbc1d0 aggr0 0xffffd0195e418a08 0xffffd01989773260 aggr1017
sending on down interface: ixgbe2

              mac`mac_hwring_tx+0x1c
              mac`mac_hwring_send_priv+0x90
              aggr`aggr_ring_tx+0x1a
              mac`mac_hwring_tx+0x1c
              mac`mac_bcast_send+0x311
              mac`mac_tx_send+0x1c4
              mac`mac_tx_soft_ring_process+0x89
              mac`mac_tx_aggr_mode+0x8c
              mac`mac_tx+0xdb
              dld`proto_unitdata_req+0x2d1
              dld`dld_wput+0xa9
              unix`putnext+0x233
              ip`arp_output+0x1fd
              ip`arp_request+0x123
              ip`ip_ndp_resolve+0x84
              ip`nce_timer+0x2fd
              genunix`callout_list_expire+0x8f
              genunix`callout_expire+0x33
              genunix`callout_execute+0x1e
              genunix`taskq_thread+0x2cd

Use mdb to verify that aggr0's mi_default_tx_ring is point to a ring on the downed port. First, get the pointer to the aggr driver.

rpz@thunderhead:~$ pfexec mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs sata ip hook neti sockfs arp usba fctl stmf stmf_sbd mm sd lofs random idm mr_sas crypto ptm cpc fcip fcp ufs logindmux nsmb smbsrv nfs ]

> ::walk mac_client_impl_cache | ::printf "%s %s 0x%p\n" mac_client_impl_t mci_name mci_mip->mi_name mci_mip->mi_driver ! grep "aggr0 " 
aggr0 aggr1017 0xffffd0199a09e128

Next, verify that the downed port (ixgbe2) is the first port in the aggr.

> 0xffffd0199a09e128::print aggr_grp_t lg_ports[0].lp_mh | ::print mac_impl_t mi_name
mi_name = [ "ixgbe2" ]

Next, verify the Tx port list is correct.

> 0xffffd0199a09e128::print aggr_grp_t lg_tx_ports[0]->lp_mh | ::print mac_impl_t mi_name
mi_name = [ "ixgbe3" ]

Finally, verify that mi_default_tx_ring is pointing to the downed ixgbe2 port. This is our actual problem.

> 0xffffd0199a09e128::print aggr_grp_t lg_mh | ::print mac_impl_t mi_default_tx_ring
mi_default_tx_ring = 0xffffd0195cf227a0
> 0xffffd0195cf227a0::print mac_ring_t mr_info.mri_driver | ::print aggr_pseudo_tx_ring_t atr_hw_rh | ::print mac_ring_t mr_info.mri_driver | ::print ixgbe_tx_ring_t ixgbe | ::print ixgbe_t instance
instance = 0x2
#2

Updated by Ryan Zezeski 5 months ago

Test With Patch

Create two aggrs: aggr0={ixgbe2,ixgbe3}, aggr1={ixgbe5,ixgbe6}. Notice the first port in each aggr is down, this is a prerequisite for the bug to appear.

rpz@thunderhead:~$ dladm show-phys
LINK         MEDIA                STATE      SPEED  DUPLEX    DEVICE
igb0         Ethernet             unknown    0      half      igb0
igb1         Ethernet             up         1000   full      igb1
ixgbe2       Ethernet             down       0      unknown   ixgbe2
ixgbe3       Ethernet             unknown    0      unknown   ixgbe3
ixgbe4       Ethernet             up         1000   full      ixgbe4
ixgbe5       Ethernet             down       0      unknown   ixgbe5
ixgbe6       Ethernet             unknown    0      unknown   ixgbe6

rpz@thunderhead:~$ pfexec dladm create-aggr -t -P L4 -L active -l ixgbe2 -l ixgbe3 aggr0
rpz@thunderhead:~$ pfexec dladm create-aggr -t -P L4 -L active -l ixgbe5 -l ixgbe6 aggr1

rpz@thunderhead:~$ dladm show-aggr -L
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr0       ixgbe2       yes          no   no   no   yes       no
--          ixgbe3       yes          yes  yes  yes  no        no
aggr1       ixgbe5       yes          no   no   no   yes       no
--          ixgbe6       yes          yes  yes  yes  no        no

rpz@thunderhead:~$ dladm show-aggr -x
LINK        PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr0       --             10000Mb full   up        a0:36:9f:c2:3b:9c  --
            ixgbe2         0Mb  unknown   down      a0:36:9f:c2:3b:9c  standby
            ixgbe3         10000Mb full   up        a0:36:9f:c2:3b:9e  attached
aggr1       --             10000Mb full   up        b4:96:91:5:48:38   --
            ixgbe5         0Mb  unknown   down      b4:96:91:5:48:38   standby
            ixgbe6         10000Mb full   up        b4:96:91:5:48:3a   attached

Create an IP addr on top of aggr0 to send some broadcast traffic on.

rpz@thunderhead:~$ pfexec ipadm create-addr -t -T static -a 192.168.66.1/24 aggr0/v4

Then create a VNIC on aggr0 (the one which will be used to send broadcast traffic), as multiple clients is also a requirement for the bug to surface. We don't need an IP on this, we just need another mac client in order to influence the Tx code path.

rpz@thunderhead:~$ pfexec dladm create-vnic -t -l aggr0 unused0

Run this dtrace script to catch the bad send.

rpz@thunderhead:~$ pfexec dtrace -qn 'mac_bcast_send:entry /arg1 != NULL/ { this->grp = (mac_bcast_grp_t *)arg0; this->smcip = (mac_client_impl_t *)arg1; printf("%s 0x%p %s 0x%p 0x%p %s\n", probefunc, this->smcip, stringof(this->smcip->mci_name), this->grp, this->grp->mbg_mac_impl, stringof(this->grp->mbg_mac_impl->mi_name)); } ixgbe_ring_tx:entry { this->ring = (ixgbe_tx_ring_t *)arg0; this->ixgbe = this->ring->ixgbe; if ((this->ixgbe->link_state & 0x1) == 0) { printf("sending on down interface: ixgbe%d\n", this->ring->ixgbe->instance); stack(); } }'

Use ping to produce ARP packets outgoing on aggr0.

rpz@thunderhead:~$ ping -i 192.168.66.1 192.168.66.99

Notice dtrace picks up no sends on the downed port.

rpz@thunderhead:~$ pfexec dtrace -qn 'mac_bcast_send:entry /arg1 != NULL/ { this->grp = (mac_bcast_grp_t *)arg0; this->smcip = (mac_client_impl_t *)arg1; printf("%s 0x%p %s 0x%p 0x%p %s\n", probefunc, this->smcip, stringof(this->smcip->mci_name), this->grp, this->grp->mbg_mac_impl, stringof(this->grp->mbg_mac_impl->mi_name)); } ixgbe_ring_tx:entry { this->ring = (ixgbe_tx_ring_t *)arg0; this->ixgbe = this->ring->ixgbe; if ((this->ixgbe->link_state & 0x1) == 0) { printf("sending on down interface: ixgbe%d\n", this->ring->ixgbe->instance); stack(); } }'
mac_bcast_send 0xfffffe5a97b8d390 aggr0 0xfffffe59eb504828 0xfffffe5a97b859c0 aggr1013
mac_bcast_send 0xfffffe5a97b8d390 aggr0 0xfffffe59eb504828 0xfffffe5a97b859c0 aggr1013
mac_bcast_send 0xfffffe5a97b8d390 aggr0 0xfffffe59eb504828 0xfffffe5a97b859c0 aggr1013
mac_bcast_send 0xfffffe5a97b8d390 aggr0 0xfffffe59eb504828 0xfffffe5a97b859c0 aggr1013
mac_bcast_send 0xfffffe5a97b8d390 aggr0 0xfffffe59eb504828 0xfffffe5a97b859c0 aggr1013
mac_bcast_send 0xfffffe5a97b8d390 aggr0 0xfffffe59eb504828 0xfffffe5a97b859c0 aggr1013

Use mdb to verify that aggr0's mi_default_tx_ring is point to a ring on the downed port. First, get the pointer to the aggr driver.

rpz@thunderhead:~$ pfexec mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs sata ip hook neti sockfs arp usba fctl stmf stmf_sbd mm sd lofs idm mr_sas ptm cpc fcip ufs logindmux nfs ]
> ::walk mac_client_impl_cache | ::printf "%s %s 0x%p\n" mac_client_impl_t mci_name mci_mip->mi_name mci_mip->mi_driver ! grep "aggr0 " 
aggr0 aggr1013 0xfffffe5a833717d0

Next, verify that the downed port (ixgbe2) is the first port in the aggr. Note that this just verifies that ixgbe2 was the first port added -- it's the lg_tx_ports and mi_default_tx_ring that determine which port is used for Tx.

> 0xfffffe5a833717d0::print aggr_grp_t lg_ports[0].lp_mh | ::print mac_impl_t mi_name
mi_name = [ "ixgbe2" ]

Next, verify the Tx port list is correct.

> 0xfffffe5a833717d0::print aggr_grp_t lg_tx_ports[0]->lp_mh | ::print mac_impl_t mi_name
mi_name = [ "ixgbe3" ]

Finally, verify the fix: that mi_default_tx_ring is pointing to a port that is up, in this case it should be ixgbe3.

> 0xfffffe5a833717d0::print aggr_grp_t lg_mh | ::print mac_impl_t mi_default_tx_ring
mi_default_tx_ring = 0xfffffe5a6b8f8970
> 0xfffffe5a6b8f8970::print mac_ring_t mr_info.mri_driver | ::print aggr_pseudo_tx_ring_t atr_hw_rh | ::print mac_ring_t mr_info.mri_driver | ::print ixgbe_tx_ring_t ixgbe | ::print ixgbe_t instance
instance = 0x3
#3

Updated by Electric Monk 5 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 99ad48a445346f36969661ed214f75b99e19a9a7

commit  99ad48a445346f36969661ed214f75b99e19a9a7
Author: Ryan Zezeski <rpz@joyent.com>
Date:   2020-06-16T21:21:45.000Z

    12754 packet flow over a defaulted LACP port
    Reviewed by: Dan McDonald <danmcd@joyent.com>
    Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
    Reviewed by: Igor Kozhukhov <igor@dilos.org>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Also available in: Atom PDF