Project

General

Profile

Feature #13018

A virtual function of a ConnectX-4 VDI attached to an IllumOS based VM (OI) is not handled correctly by mlxcx

Added by Florian Manschwetus 4 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

I created a ESXi VM running current OI Hipster with a CX-4 VDI VF. The mlxcx driver tries to handle the device but fails, resulting in messages displayed in dmesg:

mlxcx: [ID 989156 kern.warning] WARNING: mlxcx1: command MLXCX_OP_CREATE_EQ 0x301 failed with status code MLXCX_CMD_R_BAD_PARAM (0x3)
mlxcx: [ID 989156 kern.warning] WARNING: mlxcx0: command MLXCX_OP_CREATE_EQ 0x301 failed with status code MLXCX_CMD_R_BAD_PARAM (0x3)


Files

mlxcx (221 KB) mlxcx Paul Winder, 2020-09-04 03:08 PM
#1

Updated by Paul Winder 3 months ago

MLXCX_OP_CREATE_EQ is used to create event queues for asynchronous events and completion events. I suspect this is happening creating the event queue for async events. Doing so it calls:

        ret = mlxcx_setup_eq(mlxp, 0,
            (1ULL << MLXCX_EVENT_CMD_COMPLETION) |
            (1ULL << MLXCX_EVENT_PAGE_REQUEST) |
            (1ULL << MLXCX_EVENT_PORT_STATE) |
            (1ULL << MLXCX_EVENT_INTERNAL_ERROR) |
            (1ULL << MLXCX_EVENT_PORT_MODULE) |
            (1ULL << MLXCX_EVENT_SENDQ_DRAIN) |
            (1ULL << MLXCX_EVENT_LAST_WQE) |
            (1ULL << MLXCX_EVENT_CQ_ERROR) |
            (1ULL << MLXCX_EVENT_WQ_CATASTROPHE) |
            (1ULL << MLXCX_EVENT_PAGE_FAULT) |
            (1ULL << MLXCX_EVENT_WQ_INVALID_REQ) |
            (1ULL << MLXCX_EVENT_WQ_ACCESS_VIOL) |
            (1ULL << MLXCX_EVENT_NIC_VPORT) |
            (1ULL << MLXCX_EVENT_DOORBELL_CONGEST));

The event of particular concern is MLXCX_EVENT_NIC_VPORT, this event is to notify the e-switch manager (ie in the hypervisor driver) of changes to vport context. This could be the cause of the BAD_PARAM error as it is not-likely to be applicable for a VF.

In fact there are h/w capabilities which can be checked to confirm whether they are supported, both for this and some of the other events. Unfortunately I don't have h/w to test this hypothesis....

#2

Updated by Florian Manschwetus 3 months ago

Is there a way to query/test this, having an illumos based system running in such an environment?
Or is there some documentation about this?

#3

Updated by Paul Winder 3 months ago

If you have an environment where you can test this, can you try the following to confirm whether my theory is correct by:
  1. Unload the mlxcx driver. Use modinfo to get the list of drivers and get the 'id' of mlxcx. Then run 'modunload -i <id>'.
  2. Run this dtrace command:
    dtrace -Zn 'fbt:mlxcx:mlxcx_setup_async_eqs:entry {print(args[0]->mlx_caps->mlc_hca_cur); exit(0);}'
  3. Get the driver to attempt to load by doing something like 'dladm show-phys'

In the meantime, I will put together a change which I think will fix it ....

#4

Updated by Paul Winder 3 months ago

A possible fix is here, you can cherry-pick that and build it yourself.

Otherwise I can send you a pre-built mlxcx driver, compiled from head of master

#5

Updated by Florian Manschwetus 3 months ago

admin@oi-testref:~$ sudo dtrace -Zn 'fbt:mlxcx:mlxcx_setup_async_eqs:entry {print(args[0]->mlx_caps->mlc_hca_cur); exit(0);}'
dtrace: invalid probe specifier fbt:mlxcx:mlxcx_setup_async_eqs:entry {print(args[0]->mlx_caps->mlc_hca_cur); exit(0);}: in action list: args[ ] may not be referenced because probe description fbt:mlxcx:mlxcx_setup_async_eqs:entry matches an unstable set of probes
#6

Updated by Paul Winder 3 months ago

Ok, try

dtrace -Zn 'fbt:mlxcx:mlxcx_setup_async_eqs:entry {print(((mlxcx_t *)arg0)->mlx_caps->mlc_hca_cur); exit(0);}'

#7

Updated by Paul Winder 3 months ago

Also available in: Atom PDF