Project

General

Profile

Bug #6274

MAC tries to use aggr rings from downed links

Added by Robert Mustacchi over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Category:
networking
Start date:
2015-09-30
Due date:
% Done:

100%

Estimated time:
Difficulty:
Hard
Tags:
Gerrit CR:

Description

In trying to reproduce an issue that was seen in the field, we downed
one of the ports of an aggregation pair and were testing various ICMP
pings that occurred. While doing so, we used snoop on downed interface
and noticed that there was ARP traffic going on there. From there, we
dug in with DTrace.

Here's our naughty ixgbe and the code path the odd frames are coming out
of:

[root@sw4cn01 (swdemo04) ~]# dtrace -n 'fbt::ixgbe_ring_tx:entry/(uintptr_t)((ixgbe_tx_ring_t *)arg0)->ixgbe == 0xffffff431daf2000/{ stack(); }'
  0  67892              ixgbe_ring_tx:entry 
              mac`mac_hwring_tx+0x1d
              mac`mac_hwring_send_priv+0x90
              aggr`aggr_ring_tx+0x24
              mac`mac_hwring_tx+0x1d
              mac`mac_bcast_send+0x350
              mac`mac_tx_send+0x2e8
              mac`mac_tx_soft_ring_process+0x79
              mac`mac_tx_aggr_mode+0x7c
              mac`mac_tx+0xda
              dld`proto_unitdata_req+0x2c1
              dld`dld_wput+0x163
              unix`putnext+0x225
              ip`arp_output+0x23c
              ip`arp_request+0xdf
              ip`ip_ndp_resolve+0x76
              ip`nce_timer+0x2c8
              genunix`callout_list_expire+0x98
              genunix`callout_expire+0x3b
              genunix`callout_execute+0x20
              genunix`taskq_thread+0x2d0

I found that ixgbe by leveraging the fact that we were roughly sending
one packet per second and using the following D one liner and picked the
one that roughly only had a count of 1 to 2. This was pretty easy
because of the nice property that snoop confirmed we only had arp
traffic at about 1-2/sec on the link. I also then confirmed the instance
number matched up.

dtrace -n 'fbt::ixgbe_ring_tx:entry{ @[(uintptr_t)((ixgbe_tx_ring_t *)arg0)->ixgbe] = count(); }' -n 'tick-1s { printa("%p: %@d\n", @); trunc(@); }'

So from here, the first big question is did we actually intend for this
to be sent out a ring that corresponds to this device? In other words,
the first problem seems that we ended up on here.

So with this stack trace I found myself with a few different questions.
The bigget one was trying to understand how it was we made it out there
with this stack. I decided to work backwards and first wanted to look at
how macbcast_send picked a ring to see if it was honoring the one that
we had sent it in on to begin with.

It turns out that if we look at the code in mac_bcast_send, it actually
just ends up using something different entirely, it uses the
mac_impl_t's default_tx_ring! Whoh, where did that come from. Let's see
what mac will tell us it is about our aggr.

I previously identified the aggr group by walking the
mac_aggr_grp_cache and figuring out which one of the entries
corresponded to our links. From there, I was able to get to the default
ring and it's private data:

::walk aggr_grp_cache | ::print aggr_grp_t lg_mh | ::print mac_impl_t mi_default_tx_ring
 | ::print mac_ring_t
{
    mr_index = 0
    mr_type = 0x2 (MAC_RING_TYPE_TX)
    mr_next = 0
    mr_gh = 0xffffff4326a75be0
    mr_classify_type = 0 (0)
    mr_srs = 0
    mr_prh = 0
    mr_refcnt = 0
    mr_gen_num = 0x2
    mr_ksp = 0xffffff4370db4040
    mr_mip = 0xffffff43515cf140
    mr_lock = {
        _opaque = [ 0 ]
    }
    mr_cv = {
        _opaque = 0
    }
    mr_state = 0x2 (MR_INUSE)
    mr_flag = 0
    mr_info = {
        mri_driver = 0xffffff432549a788
        mri_start = 0
        mri_stop = 0
        mri_intr = {
            mi_handle = 0
            mi_enable = 0
            mi_disable = 0
            mi_ddi_handle = 0xffffff432073b4c0
            mi_ddi_shared = 0 (0)
        }
        mrfunion = {
            send = aggr_ring_tx
            poll = aggr_ring_tx
        }
        mri_stat = aggr_tx_ring_stat
        mri_flags = 0
    }
}
{
    mr_index = 0
    mr_type = 0x2 (MAC_RING_TYPE_TX)
    mr_next = 0
    mr_gh = 0xffffff4323dbd338
    mr_classify_type = 0 (0)
    mr_srs = 0
    mr_prh = 0
    mr_refcnt = 0
    mr_gen_num = 0x2
    mr_ksp = 0xffffff43515bacd0
    mr_mip = 0xffffff43515d26c8
    mr_lock = {
        _opaque = [ 0 ]
    }
    mr_cv = {
        _opaque = 0
    }
    mr_state = 0x2 (MR_INUSE)
    mr_flag = 0
    mr_info = {
        mri_driver = 0xffffff432549cf40
        mri_start = 0                 
        mri_stop = 0
        mri_intr = {
            mi_handle = 0
            mi_enable = 0
            mi_disable = 0
            mi_ddi_handle = 0xffffff431ea79e30
            mi_ddi_shared = 0 (0)
        }
        mrfunion = {
            send = aggr_ring_tx
            poll = aggr_ring_tx
        }
        mri_stat = aggr_tx_ring_stat
        mri_flags = 0
    }
}
> 0xffffff432549cf40::whatis
ffffff432549cf40 is ffffff432549b7c0+1780, allocated from aggr_grp_cache
> 0xffffff432549cf40::print aggr_pseudo_tx_ring_t
{
    atr_rh = 0xffffff43209e8560
    atr_port = 0xffffff434143eba8
    atr_hw_rh = 0xffffff432073ad90
    atr_flags = 0x1
}
> ::walk aggr_port_cache
0xffffff434143e008
0xffffff434143e3e8
0xffffff434143e7c8
> 0xffffff434143eba8::print aggr_port_t lp_tx_enabled
lp_tx_enabled = 0

Well there's your problem. We somehow have a default tx ring that points
to the one of these that we turned off. That's no good. Because
lp_tx_enabled is B_FALSE, we know that we must have called
aggr_send_port_disable() on this. There are two different ways that we
may have come through here. Regardless, neither code path seems to do
anything that might also manipulate the pseudo rings. So the question
for us is what causes us to actually remove and insert things there.

We adjust the pseudo transmission rings in the function
aggr_rem_pseudo_tx_group. Now, one interesting thing to see is that
it has a flag to see if the port has been added or not. If we look at
our candidate:

> 0xffffff434143eba8::print aggr_port_t lp_tx_grp_added
lp_tx_grp_added = 0x1
> 0xffffff434143eba8::print aggr_port_t lp_rx_grp_added
lp_rx_grp_added = 0x1

We'll note that we're still enabled. So I believe that if we're removing
a group from tx service, we need to also remove its pseudo tx rings,
otherwise we're at the risk of continuing to have it be incorrectly
used.

The naive approach was of course naive and panicked right away. The
problem being here that we tried to tear down all the tx and rx rings,
which doesn't actually help us.

We need to actually still keep these rings about beacuse we need to use
them potentially for sending additional LACPDUs and marker messages.
They're used by the various 802.1ax state machines and logically even if
we think the aggr is down, we should still be enabled to send them.

There are two different directions that we could go out here. We could
refactor how we logically present our rings to mac so that any given
logical ring has rings from as many of the underlying nics in it as
possible. This wouldn't change how we do fanout per-se, but how we
present it to MAC. Thus when MAC pics a default ring, we can then go
switch it based on what links are up. However, I'm not sure that kind of
overhaul is going to end well. As such, we should should likely inform
MAC that it should do something different and pick a new default, even
if that means that we end up juggling around clients.

To that end, adding a private function call from the aggr code to update
what our default is may be a better path. Let's at least start with that
for now.

#1

Updated by Electric Monk over 5 years ago

  • Status changed from New to Closed

git commit 09b7f21a0999a8ceb9f3e517fff7c39c52405ba2

commit  09b7f21a0999a8ceb9f3e517fff7c39c52405ba2
Author: Robert Mustacchi <rm@joyent.com>
Date:   2015-10-08T21:50:26.000Z

    6274 MAC tries to use aggr rings from downed links
    Reviewed by: Bryan Cantrill <bryan@joyent.com>
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF