long mblk chain will cause mlxcx to stop sending
The send queue is divided in WQEBBs. When a message is to be sent hardware structures describing the message and data are placed in the ring, part of the message are sets of address and length entries which tell the hardware where the data is. The combination of hardware structures for a message can span multiple WQEBBs, especially when then message is composed of disjointed physical addresses. This disjointedness comes about when a message is composed of multiple mblks.
There is a limit to the total size of WQEBBs a single message can be (described in octowords - 16 bytes). The total number of octowords a message can be 63. It is easy to construct a test which causes this counter to overflow. When it does so, it normally ends up with the ring becoming disable, and also generating messages in the log similar to:
WARNING: mlxcx0: got completion on CQ 3f but no buffer matching wqe found: 3681 (first buffer counter = 73684) WARNING: mlxcx0: got completion on CQ 3f but no buffer matching wqe found: 3682 (first buffer counter = 73684) WARNING: mlxcx0: got completion on CQ 3f but no buffer matching wqe found: 3683 (first buffer counter = 73684)
The driver has an ASSERT for this, but that is not a realistic solution.
Updated by Electric Monk about 2 years ago
- Status changed from In Progress to Closed
- % Done changed from 70 to 100
commit 82b4190e0f86654c179e1dad46c51c6f999464ec Author: Paul Winder <email@example.com> Date: 2020-04-22T15:28:03.000Z 12480 long mblk chain will cause mlxcx to stop sending Reviewed by: Garrett D'Amore <firstname.lastname@example.org> Reviewed by: Andy Stormont <email@example.com> Reviewed by: Robert Mustacchi <firstname.lastname@example.org> Approved by: Dan McDonald <email@example.com>