Project

General

Profile

Bug #12980

attempting to change MTU on mlxcx based aggregation can induce FMA event

Added by Paul Winder 16 days ago. Updated 2 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Aggregations will let you (attempt) to change the MTU whilst they are in use, to do that they stop and start appropriate rings. If you attempt to do this on a mlxcx aggr which is under heavy load you can induce an FMA event and corresponding syslog message.
The NIC remains online and functional, and is not flagged for retirement, but after looking at the cause I suspect there is also a risk of a panic through bad pointer references.
The FMA event is similar to:

Jul 24 2020 11:50:28.714272905 ereport.io.device.inval_state
nvlist version: 0
        class = ereport.io.device.inval_state
        ena = 0x2d29cd1f7c203c01
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /pci@0,0/pci8086,6f08@3/pci117c,b4@0
        (end detector)

        __ttl = 0x1
        __tod = 0x5f1acb04 0x2a92f089

and messages:
2020-07-24T09:45:17.627768+00:00 localhost mlxcx: [ID 887215 kern.warning] WARNING: mlxcx0: got completion on CQ 17 but no buffer matching wqe found: 3c3d (first buffer counter = ffffffff)
2020-07-24T09:45:48.933365+00:00 localhost mlxcx: [ID 887215 kern.warning] WARNING: mlxcx0: got completion on CQ 16 but no buffer matching wqe found: 1598 (first buffer counter = ffffffff)

To induce this you need to create a set of links similar to:
# dladm show-link
mlxcx0      phys      9000   up       --         --
mlxcx1      phys      9000   up       --         --
aggr0       aggr      9000   up       --         mlxcx0,mlxcx1
data0       vnic      8000   up       --         aggr0

Notice the smaller MTU on the data0 vnic, it is that which allows us to attempt the MTU update and cause the rings to be stop/started.
I ran iperf as a server on the system which I was attempting to change the MTU, and client on another system.
Repeated running of:
dladm set-linkprop -p mtu=8000 aggr0

Produced the message and FMA record in about 1 in 3 cases


Related issues

Related to illumos gate - Bug #12987: devo_power misconfigured in mlxcxIn Progress

Actions
Related to illumos gate - Bug #12988: potential hang in mlxcx when async and ring vectors end up on same CPUIn Progress

Actions

History

#1

Updated by Paul Winder 16 days ago

To test, I used two servers.
  1. on server 1, I had the aggregation and vnic configured as in the main descriptive text.
  2. on server 1, I started iperf3 -s
  3. on server 2, I started iperf3 -c <server1>
  4. on server 1, whilst iperf3 is running I executed a similar (brain dead) script to:
    # for i in 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 4; do dladm set-linkprop -p mtu=8000 aggr0; sleep 2; done
    

and it produces output, as expected:

dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy
dladm: warning: cannot set link property 'mtu' on 'aggr0': link busy

and no FMA events nor syslog messages.

This set of tests were carried out with kmem_flags set to 0xf and zero.

#2

Updated by Paul Winder 13 days ago

  • Subject changed from attempting to change MTU on mlxcx based aggregation can induce and FMA event to attempting to change MTU on mlxcx based aggregation can induce FMA event
#3

Updated by Paul Winder 13 days ago

  • Related to Bug #12987: devo_power misconfigured in mlxcx added
#4

Updated by Paul Winder 13 days ago

  • Related to Bug #12988: potential hang in mlxcx when async and ring vectors end up on same CPU added
#5

Updated by Electric Monk 13 days ago

  • Gerrit CR set to 816
#6

Updated by Paul Winder 2 days ago

A walk through on the circumstances which cause the message to appear when a ring is stopped.

An important factor is that an interrupt vector can (and almost certainly will) have multiple CQs assigned to it. Then when the vector gets an interrupt, as the interrupt handler processes the events it matches events to CQs and then process the pertinent CQ.

Also, CQs which act as receive queues/rings do have DMA buffers pre-allocated and assigned to entries in the CQ.

With this in mind, consider:
  1. An interrupt vector is scheduled and processing events.
  2. A command is issued which will stop the rings. Each ring is stopped in mlxcx_mac_ring_stop(). mlxcx_mac_ring_stop() will issue a command to the HCA which will stop the send and receive queues. After this no more events will be posted to the CQ tied to the receive queue, but there still maybe unprocessed completions yet to be processed in 1.
  3. mlxcx_mac_ring_stop() frees the DMA buffers tied to the CQ. This is safe because the CQ has been stopped, and the h/w will not post anymore completions.
  4. The running interrupt vector sees there are completion events for the CQ - they would have been post before the ring was stopped. It runs through the completions in the CQ and attempts to match completions to buffers. But the buffer lists have been emptied and the buffers freed, so it doesn't find a matching buffer for the completion and issues a warning message.

The warning messages are disconcerting, but in this case harmless. We need to avoid issuing messages for known situations which are not errors.

But, even though I never encountered this, I could see in the mlxcx_mac_ring_stop() code where the buffer lists, which are traversed in the processing of the CQs, are emptied without the CQ lock. Possibly leading to the interrupt processing the CQs to traverse a list whilst it is being deconstructed.

Also available in: Atom PDF