Project

General

Profile

Bug #12480

long mblk chain will cause mlxcx to stop sending

Added by Paul Winder 8 months ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

The send queue is divided in WQEBBs. When a message is to be sent hardware structures describing the message and data are placed in the ring, part of the message are sets of address and length entries which tell the hardware where the data is. The combination of hardware structures for a message can span multiple WQEBBs, especially when then message is composed of disjointed physical addresses. This disjointedness comes about when a message is composed of multiple mblks.

There is a limit to the total size of WQEBBs a single message can be (described in octowords - 16 bytes). The total number of octowords a message can be 63. It is easy to construct a test which causes this counter to overflow. When it does so, it normally ends up with the ring becoming disable, and also generating messages in the log similar to:

WARNING: mlxcx0: got completion on CQ 3f but no buffer matching wqe found: 3681 (first buffer counter = 73684)
WARNING: mlxcx0: got completion on CQ 3f but no buffer matching wqe found: 3682 (first buffer counter = 73684)
WARNING: mlxcx0: got completion on CQ 3f but no buffer matching wqe found: 3683 (first buffer counter = 73684)

The driver has an ASSERT for this, but that is not a realistic solution.

#1

Updated by Paul Winder 8 months ago

To reproduce, simply run an iperf test with arguments similar to:

iperf3 -c xx.xx.xx.xx -P 128 -b 1b

Yes, that is a 1 byte block size and is an extreme case, but it readily induces the bug

#2

Updated by Paul Winder 8 months ago

To fix this, in the code which assigns internal buffers to the send message it already counts the number of DMA cookies. When the number of cookies would cause this overflow, we do a message pull-up

#3

Updated by Paul Winder 8 months ago

The code has been tested using iperf3 to load up the driver.
Tested with the specific test case, used dtrace to prove the new code path was executed.
Also, stressed using more "normal" iperf3 arguments.

#4

Updated by Electric Monk 8 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 70 to 100

git commit 82b4190e0f86654c179e1dad46c51c6f999464ec

commit  82b4190e0f86654c179e1dad46c51c6f999464ec
Author: Paul Winder <pwinder@racktopsystems.com>
Date:   2020-04-22T15:28:03.000Z

    12480 long mblk chain will cause mlxcx to stop sending
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Reviewed by: Andy Stormont <astormont@racktopsystems.com>
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF