Feature #13018
open
A virtual function of a ConnectX-4 VDI attached to an IllumOS based VM (OI) is not handled correctly by mlxcx
Added by Florian Manschwetus almost 2 years ago.
Updated almost 2 years ago.
Description
I created a ESXi VM running current OI Hipster with a CX-4 VDI VF. The mlxcx driver tries to handle the device but fails, resulting in messages displayed in dmesg:
mlxcx: [ID 989156 kern.warning] WARNING: mlxcx1: command MLXCX_OP_CREATE_EQ 0x301 failed with status code MLXCX_CMD_R_BAD_PARAM (0x3)
mlxcx: [ID 989156 kern.warning] WARNING: mlxcx0: command MLXCX_OP_CREATE_EQ 0x301 failed with status code MLXCX_CMD_R_BAD_PARAM (0x3)
Files
mlxcx (221 KB)
mlxcx |
|
Paul Winder, 2020-09-04 03:08 PM
|
|
MLXCX_OP_CREATE_EQ is used to create event queues for asynchronous events and completion events. I suspect this is happening creating the event queue for async events. Doing so it calls:
ret = mlxcx_setup_eq(mlxp, 0,
(1ULL << MLXCX_EVENT_CMD_COMPLETION) |
(1ULL << MLXCX_EVENT_PAGE_REQUEST) |
(1ULL << MLXCX_EVENT_PORT_STATE) |
(1ULL << MLXCX_EVENT_INTERNAL_ERROR) |
(1ULL << MLXCX_EVENT_PORT_MODULE) |
(1ULL << MLXCX_EVENT_SENDQ_DRAIN) |
(1ULL << MLXCX_EVENT_LAST_WQE) |
(1ULL << MLXCX_EVENT_CQ_ERROR) |
(1ULL << MLXCX_EVENT_WQ_CATASTROPHE) |
(1ULL << MLXCX_EVENT_PAGE_FAULT) |
(1ULL << MLXCX_EVENT_WQ_INVALID_REQ) |
(1ULL << MLXCX_EVENT_WQ_ACCESS_VIOL) |
(1ULL << MLXCX_EVENT_NIC_VPORT) |
(1ULL << MLXCX_EVENT_DOORBELL_CONGEST));
The event of particular concern is MLXCX_EVENT_NIC_VPORT, this event is to notify the e-switch manager (ie in the hypervisor driver) of changes to vport context. This could be the cause of the BAD_PARAM error as it is not-likely to be applicable for a VF.
In fact there are h/w capabilities which can be checked to confirm whether they are supported, both for this and some of the other events. Unfortunately I don't have h/w to test this hypothesis....
Is there a way to query/test this, having an illumos based system running in such an environment?
Or is there some documentation about this?
If you have an environment where you can test this, can you try the following to confirm whether my theory is correct by:
- Unload the mlxcx driver. Use modinfo to get the list of drivers and get the 'id' of mlxcx. Then run 'modunload -i <id>'.
- Run this dtrace command:
dtrace -Zn 'fbt:mlxcx:mlxcx_setup_async_eqs:entry {print(args[0]->mlx_caps->mlc_hca_cur); exit(0);}'
- Get the driver to attempt to load by doing something like 'dladm show-phys'
In the meantime, I will put together a change which I think will fix it ....
A possible fix is here, you can cherry-pick that and build it yourself.
Otherwise I can send you a pre-built mlxcx driver, compiled from head of master
admin@oi-testref:~$ sudo dtrace -Zn 'fbt:mlxcx:mlxcx_setup_async_eqs:entry {print(args[0]->mlx_caps->mlc_hca_cur); exit(0);}'
dtrace: invalid probe specifier fbt:mlxcx:mlxcx_setup_async_eqs:entry {print(args[0]->mlx_caps->mlc_hca_cur); exit(0);}: in action list: args[ ] may not be referenced because probe description fbt:mlxcx:mlxcx_setup_async_eqs:entry matches an unstable set of probes
Ok, try
dtrace -Zn 'fbt:mlxcx:mlxcx_setup_async_eqs:entry {print(((mlxcx_t *)arg0)->mlx_caps->mlc_hca_cur); exit(0);}'
Also available in: Atom
PDF