Project

General

Profile

Bug #10282

aggrs state machine in confusing state before mc_start() called

Added by Robert Mustacchi about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Category:
networking
Start date:
2019-01-24
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

This is a tale of an aggr that was stuck. So stuck that it never tried to transmit any LACP frames ever again. The mystery: why.

aggr_port_t {
    struct aggr_port_s *lp_next = 0
    struct aggr_grp_s *lp_grp = 0xffffd0c589cfd1b8
    datalink_id_t lp_linkid = 0x2
    uint16_t lp_portid = 0x2
    uint8_t [6] lp_addr = [ 0x90, 0xe2, 0xba, 0xce, 0x3f, 0x10 ]
    uint32_t lp_refs = 0x3
    aggr_port_state_t lp_state = 0x2 (AGGR_PORT_STATE_ATTACHED)
    unsigned int:1 lp_started :1 = 0
    unsigned int:1 lp_tx_enabled :1 = 0
    unsigned int:1 lp_collector_enabled :1 = 0
    unsigned int:1 lp_promisc_on :1 = 0
    unsigned int:1 lp_no_link_update :1 = 0
    unsigned int:1 lp_rx_grp_added :1 = 0x1
    unsigned int:1 lp_tx_grp_added :1 = 0x1
    unsigned int:1 lp_closing :1 = 0
    unsigned int:24 lp_pad_bits = 0
    mac_handle_t lp_mh = 0xffffd0c572df56c0
    mac_client_handle_t lp_mch = 0xffffd0c60d8c3ab0
    const mac_info_t *lp_mip = 0xffffd0c572df5708
    mac_notify_handle_t lp_mnh = 0xffffd0c572a9e1d8
    uint_t lp_tx_idx = 0
    uint64_t lp_ifspeed = 0x2540be400
    link_state_t lp_link_state = 1 (LINK_STATE_UP)
    link_duplex_t lp_link_duplex = 0x2 (LINK_DUPLEX_FULL)
    uint64_t [17] lp_stat = [ 0x2540be400, 0, 0x266b, 0, 0, 0, 0, 0, 0, 0, 0, 0xffc0, 0xe14bc5f, 0, 0, 0, 0 ]
    uint64_t [75] lp_ether_stat = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0x1, 0x3, 0x1, 0, 0, 0, 0, 0, 0, 0x1, 0x1, 0x1, 0, 0, 0, 0, 0, 0, 0x1, 0x1, ... ]
    aggr_lacp_port_t lp_lacp = {
        uint16_t ActorPortNumber = 0x2
        uint16_t ActorPortPriority = 0x1000
        uint32_t ActorPortAggrId = 0
        boolean_t NTT = 0x1 (B_TRUE)
        uint16_t ActorAdminPortKey = 0x3e8
        uint16_t ActorOperPortKey = 0x3e8
        aggr_lacp_state_t ActorAdminPortState = {
            struct  bit = {
                unsigned char:1 activity :1 = 0x1
                unsigned char:1 timeout :1 = 0x1
                unsigned char:1 aggregation :1 = 0x1
                unsigned char:1 sync :1 = 0
                unsigned char:1 collecting :1 = 0
                unsigned char:1 distributing :1 = 0
                unsigned char:1 defaulted :1 = 0
                unsigned char:1 expired :1 = 0
            }
            uint8_t state = 0x7
        }
        aggr_lacp_state_t ActorOperPortState = {
            struct  bit = {
                unsigned char:1 activity :1 = 0x1
                unsigned char:1 timeout :1 = 0x1
                unsigned char:1 aggregation :1 = 0x1
                unsigned char:1 sync :1 = 0
                unsigned char:1 collecting :1 = 0
                unsigned char:1 distributing :1 = 0
                unsigned char:1 defaulted :1 = 0x1
                unsigned char:1 expired :1 = 0
            }
            uint8_t state = 0x47
        }
        struct ether_addr PartnerAdminSystem = {
            ether_addr_t ether_addr_octet = [ 0, 0, 0, 0, 0, 0 ]
        }
        struct ether_addr PartnerOperSystem = {
            ether_addr_t ether_addr_octet = [ 0, 0, 0, 0, 0, 0 ]
        }
        uint16_t PartnerAdminSysPriority = 0
        uint16_t PartnerOperSysPriority = 0
        uint16_t PartnerAdminKey = 0
        uint16_t PartnerOperKey = 0
        uint16_t PartnerAdminPortNum = 0
        uint16_t PartnerOperPortNum = 0
        uint16_t PartnerAdminPortPriority = 0
        uint16_t PartnerOperPortPriority = 0
        aggr_lacp_state_t PartnerAdminPortState = {
            struct  bit = {
                unsigned char:1 activity :1 = 0
                unsigned char:1 timeout :1 = 0x1
                unsigned char:1 aggregation :1 = 0
                unsigned char:1 sync :1 = 0
                unsigned char:1 collecting :1 = 0
                unsigned char:1 distributing :1 = 0
                unsigned char:1 defaulted :1 = 0
                unsigned char:1 expired :1 = 0
            }
            uint8_t state = 0x2
        }
        aggr_lacp_state_t PartnerOperPortState = {
            struct  bit = {
                unsigned char:1 activity :1 = 0
                unsigned char:1 timeout :1 = 0x1
                unsigned char:1 aggregation :1 = 0
                unsigned char:1 sync :1 = 0x1
                unsigned char:1 collecting :1 = 0
                unsigned char:1 distributing :1 = 0
                unsigned char:1 defaulted :1 = 0
                unsigned char:1 expired :1 = 0
            }
            uint8_t state = 0xa
        }
        uint16_t PartnerCollectorMaxDelay = 0
        state_machine_t sm = {
            unsigned int:1 lacp_on :1 = 0x1
            unsigned int:1 begin :1 = 0
            unsigned int:1 lacp_enabled :1 = 0x1
            unsigned int:1 port_enabled :1 = 0x1
            unsigned int:1 actor_churn :1 = 0
            unsigned int:1 partner_churn :1 = 0
            unsigned int:1 ready_n :1 = 0x1
            unsigned int:1 port_moved :1 = 0
            unsigned int:24 pad_bits = 0
            lacp_selected_t selected = 0 (0)
            uint32_t current_while_timer_exp = 0
            lacp_periodic_state_t periodic_state = 1 (LACP_FAST_PERIODIC)
            lacp_receive_state_t receive_state = 4 (LACP_DEFAULTED)
            lacp_mux_state_t mux_state = 0 (LACP_DETACHED)
            lacp_churn_state_t churn_state = 0 (0)
        }
        lacp_timer_t current_while_timer = {
            uint32_t val = 0x3
            timeout_id_t id = 0
        }
        lacp_timer_t periodic_timer = {
            uint32_t val = 0x1
            timeout_id_t id = 0xd4c247a05
        }
        lacp_timer_t wait_while_timer = {
            uint32_t val = 0x2
            timeout_id_t id = 0
        }
        uint32_t lacp_timer_bits = 0
        kthread_t *lacp_timer_thread = 0xffffd003d18a7c40
        kmutex_t lacp_timer_lock = {
            void *[1] _opaque = [ 0 ]
        }
        kcondvar_t lacp_timer_cv = {
            ushort_t _opaque = 0x1
        }
        hrtime_t time = 0x19d009d9235fe5
    }
    lacp_stats_t lp_lacp_stats = {
        uint64_t LACPDUsRx = 0
        uint64_t MarkerPDUsRx = 0
        uint64_t MarkerResponsePDUsRx = 0
        uint64_t UnknownRx = 0
        uint64_t IllegalRx = 0
        uint64_t LACPDUsTx = 0
        uint64_t MarkerPDUsTx = 0
        uint64_t MarkerResponsePDUsTx = 0
    }
    uint32_t lp_margin = 0x4
    mac_promisc_handle_t lp_mphp = 0
    mac_unicast_handle_t lp_mah = 0xffffd0c57642ac90
    aggr_unicst_addr_t *lp_prom_addr = 0
    mac_group_handle_t lp_hwgh = 0xffffd0c57115d000
    int lp_tx_ring_cnt = 0x8
    mac_ring_handle_t *lp_tx_rings = 0xffffd0c574a38880
    mac_ring_handle_t *lp_pseudo_tx_rings = 0xffffd0c582c7d7c0
    void *lp_tx_notify_mh = 0xffffd0c589adc320
}

And the matching grp_t:

aggr_grp_t {
    datalink_id_t lg_linkid = 0x7
    uint16_t lg_key = 0x3e8
    uint32_t lg_refs = 0x3
    uint16_t lg_nports = 0x2
    uint8_t [6] lg_addr = [ 0x90, 0xe2, 0xba, 0xce, 0x44, 0xcc ]
    unsigned short:1 lg_closing :1 = 0
    unsigned short:1 lg_addr_fixed :1 = 0
    unsigned short:1 lg_started :1 = 0
    unsigned short:1 lg_promisc :1 = 0
    unsigned short:1 lg_zcopy :1 = 0x1
    unsigned short:1 lg_vlan :1 = 0x1
    unsigned short:1 lg_force :1 = 0
    unsigned short:1 lg_lso :1 = 0x1
    unsigned short:8 lg_pad_bits = 0
    aggr_port_t *lg_ports = 0xffffd0c60d888aa8
    aggr_port_t *lg_mac_addr_port = 0xffffd0c60d888aa8
    mac_handle_t lg_mh = 0xffffd0c572dffb20
    zoneid_t lg_zoneid = 0
    uint_t lg_nattached_ports = 0x2
    krwlock_t lg_tx_lock = {
        void *[1] _opaque = [ 0 ]
    }
    uint_t lg_ntx_ports = 0
    aggr_port_t **lg_tx_ports = 0
    uint_t lg_tx_ports_size = 0
    uint32_t lg_tx_policy = 0x7
    uint8_t lg_mac_tx_policy = 0x7
    uint64_t lg_ifspeed = 0x2540be400
    link_state_t lg_link_state = 1 (LINK_STATE_UP)
    link_duplex_t lg_link_duplex = 0x2 (LINK_DUPLEX_FULL)
    uint64_t [17] lg_stat = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
    uint64_t [75] lg_ether_stat = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]
    aggr_lacp_mode_t lg_lacp_mode = 0x1 (AGGR_LACP_ACTIVE)
    Agg_t aggr = {
        uint32_t AggregatorIdentifier = 0
        boolean_t IndividualAggr = 0 (0)
        uint32_t ActorAdminKey = 0
        uint32_t ActorOperKey = 0
        struct ether_addr PartnerSystem = {
            ether_addr_t ether_addr_octet = [ 0, 0, 0, 0, 0, 0 ]
        }
        uint32_t PartnerSystemPriority = 0
        uint32_t PartnerOperAggrKey = 0
        boolean_t ReceiveState = 0 (0)
        boolean_t TransmitState = 0 (0)
        uint16_t ActorSystemPriority = 0x1000
        uint16_t CollectorMaxDelay = 0xa
        aggr_lacp_timer_t PeriodicTimer = 0x1 (AGGR_LACP_TIMER_SHORT)
        uint64_t TimeOfLastOperChange = 0
        boolean_t ready = 0x1 (B_TRUE)
    }
    uint32_t lg_hcksum_txflags = 0x12
    uint_t lg_max_sdu = 0x5dc
    uint32_t lg_margin = 0x4
    mac_capab_lso_t lg_cap_lso = {
        t_uscalar_t lso_flags = 0x1
        lso_basic_tcp_ipv4_t lso_basic_tcp_ipv4 = {
            t_uscalar_t lso_max = 0xffff
        }
    }
    kmutex_t lg_lacp_lock = {
        void *[1] _opaque = [ 0 ]
    }
    kcondvar_t lg_lacp_cv = {
        ushort_t _opaque = 0x1
    }
    mblk_t *lg_lacp_head = 0
    mblk_t *lg_lacp_tail = 0
    kthread_t *lg_lacp_rx_thread = 0xffffd003d5ec6c40
    boolean_t lg_lacp_done = 0 (0)
...

All attempts to TX on this aggr_port_t bail out because lp_started is 0. Also note that we are in state AGGR_PORT_STATE_ATTACHED and LACP_DEFAULTED.


theory:

  • we start up and set up the aggr and the port on it
  • the port starts with the begin flag and moves to LACP_INITIALIZE, then LACP_PORT_DISABLED, then LACP_EXPIRED
  • in LACP_EXPIRED we set up the tx periodic SM and wait one SHORT_TIMEOUT_TIME. However, the tx periodic SM never transmits anything because we haven't joined the group yet (and lp_started is still 0).
  • if we don't receive anything from the switch in 3 sec, we go to LACP_DEFAULTED and we will sit there until we receive an LACP frame from the switch. This is clearly wrong, as we're in "Active" mode – we should be transmitting at this point.
  • if we do receive something, we go to LACP_CURRENT. Because we haven't sent anything to our partner, it's possible they sent a frame to us without the "sync" bit set (since at their end, the link it out of sync). This means we don't "select" this port automatically, and we don't call lacp_mux_sm... nothing works
  • the lp_started flag is set only by aggr_port_start, which is only called either by aggr_grp_add_ports (if lg_started is set, which it isn't), or aggr_m_start (the aggr MAC start callback, which would also set lg_started if it had been called)

First, let's get basic aggr state:

> !dladm show-aggr   
LINK            POLICY   ADDRPOLICY           LACPACTIVITY  LACPTIMER   FLAGS
aggr1           L2,L3,L4 auto                 active        short       -----
aggr2           L2,L3,L4 auto                 active        short       -----
> !dladm show-aggr -x
LINK        PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr1       --             10000Mb full   up        90:e2:ba:ce:44:cc  --
            ixgbe0         10000Mb full   up        90:e2:ba:ce:44:cc  attached
            ixgbe2         10000Mb full   up        90:e2:ba:ce:3f:10  attached
aggr2       --             10000Mb full   up        90:e2:ba:ce:44:cd  --
            ixgbe1         10000Mb full   up        90:e2:ba:ce:44:cd  attached
            ixgbe3         10000Mb full   up        90:e2:ba:ce:3f:11  attached
> !dladm show-aggr -L
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr1       ixgbe0       yes          no   no   no   yes       no
--          ixgbe2       yes          no   no   no   yes       no
aggr2       ixgbe1       yes          yes  yes  yes  no        no
--          ixgbe3       yes          yes  yes  yes  no        no

So, based on this we care about aggr1, let's figure out which aggr_grp_t that
is.

> ::walk aggr_grp_cache
0xffffd0c589cfa970
0xffffd0c589cfd1b8
> ::walk aggr_grp_cache | ::print aggr_grp_t lg_mh | ::print mac_impl_t mi_addr
mi_addr = [ 0x90, 0xe2, 0xba, 0xce, 0x44, 0xcd, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0 ]
mi_addr = [ 0x90, 0xe2, 0xba, 0xce, 0x44, 0xcc, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0 ]

So, the aggr we care about is the second one, 0xffffd0c589cfd1b8. So, this
is the nic that should be used for all internal/external traffic.

Now, let's try and understand if this device has ever been started:

> 0xffffd0c589cfd1b8::print aggr_grp_t lg_mh | ::print mac_impl_t mi_active
mi_active = 0

Suspiciously, the NIC isn't considered active at this point. Now, if we double
check the system, we'll actually find there are no VMs or VNICs or anything on
this interface. Which explains 100% why we're not transmitting and why we're in
an odd state.

Importantly, the underlying links used in a link aggregation are always opened up by the driver when the individual links are added to the aggregation through the various aggregation ioctls. Effectively, the device is always started and can send and receive packets. In this case, deferring the tx state machine until after we call the mc_start() endpoint doesn't actually provide any value.

1. Resources aren't preserved, because all of the resources are already used because we've opened the links.
2. Because we have the rx state machine, but not the tx state machine we end up leaving administrators in a confused place where they believe that the link is down. We've seen a couple of customer incidents related to this.

In conclusion we should start the state machine as soon as we start the rx state machine which happens as soon as the port is attached to the group.


To test this I went through and booted the system before and after the change and observed the behavior with the link up and down.

On the new bits, everything entered the expected LACP state and stayed there, even with the aggregation being down:

[root@haswell ~]# dladm show-aggr -L
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
foo0        i40e0        yes          yes  yes  yes  no        yes
--          i40e1        yes          yes  yes  yes  no        yes

I used a DTrace one liner to look at the number of functions called and verified that things looked reasonable and then also went through and verified that we saw similar amounts of traffic being sent via snoop on a physical device.

Even with this in place, I found that if we ended up enabling the aggregation and then stopping it, we'd end up seeing this behavior but under a slightly different complextion. While we were now sending out packets and receiving packets from the switch, the received packets were not making it to the aggregation driver.

The problem is that the aggregation driver calls a function to 'stop' the underlying hardware ring when its rings are stopped. As part of stopping the interface, all of its rings are stopped. The aggr driver has these 'pseudo-rings'. The 'psuedo-rings' end up calling a private mac function to stop the underlying ring when they need to, which causes the ring not to be stopped, but just quiesced. This means that all of the packets that come into the ring are dropped in MAC, which includes all of our lacp packets.

It's interesting to note that this doesn't happen in the default state. In fact, we always receive everything up into the aggr and then it gets dropped later on. In this case, I don't believe that we're receiving much value from quiescing the hardware rings relative to the operational cost of having the aggr enter the defaulted state from the perspective of lacp. It's not like any of the resources are being released. Really the only tradeoff here is that we're going to end up spending a little bit more CPU time before we drop the packets. However, there are a few important things to note:

1. Most systems start with nothing on the device and then have things added, the devices rarely stop themselves again.
2. We already pay this cost with aggrs that are created but aren't started. They already operate in this mode.
3. Unlike other devices, there's a lot more administrator intent behind creating a link aggregation, which argues that LACP should always be functioning. If they didn't want it to function, they shouldn't have created the aggr.

History

#1

Updated by Electric Monk about 1 year ago

  • Status changed from New to Closed

git commit 666e8af970029746a2a6532e68d14e14ccdde700

commit  666e8af970029746a2a6532e68d14e14ccdde700
Author: Robert Mustacchi <rm@joyent.com>
Date:   2019-01-28T22:36:15.000Z

    10282 aggrs state machine in confusing state before mc_start() called
    Reviewed by: Alex Wilson <alex.wilson@joyent.com>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF