Bug #12205
closed
want generic NIC transceiver fault events
Added by Alex Wilson over 2 years ago.
Updated over 2 years ago.
Description
It would be nice to have some generic FMA fault events for NIC transceiver issues. These should cover some broad categories of transceiver faults common across lots of different NICs. Importantly, some eversholt rules should be used to let drivers generate ereport
s on their regular PCI function device node and have that translated into a fault against the transceiver node.
Testing notes:
- Built and tested on SmartOS
- Tested together with #12204 (which is the first consumer for this new API)
- Also tested independently by using
fminject
with an injection like:
fmridef dev {
uint8_t version;
string scheme;
string device-path;
};
evdef ereport.io.nic.txr-err {
fmri dev detector;
string error;
uint8_t port_index;
uint8_t txr_index;
};
evdef ereport.io.service.lost {
fmri dev detector;
};
event ereport.io.nic.txr-err ev1 = {
{ 0, "dev", "/pci@76,0/pci8086,2f02@1/pci15b3,2@0" },
"unknown",
0,
0
};
event ereport.io.service.lost ev2 = {
{ 0, "dev", "/pci@76,0/pci8086,2f02@1/pci15b3,2@0" }
};
ev1;
ev2;
- Tested each of the different
"error"
property values (notsupp
, whitelist
, overtemp
, hwfail
, unknown
) by modifying the fminject
script
- For each of the error types, after injecting looked at
fmdump -V
and fmadm faulty
, as well as inspecting the console output. Checked that string substitutions are working (at least when tested with #12206 as well) and produce sensible messages
- Tested the
notsupp
type error through the kernel by modifying the mlxcx
code to always emit an ereport of that type on transceiver removal, then removed a transceiver
- Injected a
notsupp/unknown
event using fminject
Example message:
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jan 17 02:34:48 b93831f4-9a07-e328-89a4-d03b9b9409fc NIC-8000-0Q Critical injected
Host : valium-temp
Platform : PowerEdge-R630 Chassis_id : 1GZM252
Product_sn :
Fault class : fault.io.nic.transceiver.notsupp
Problem in : hc://:product-id=PowerEdge-R630:server-id=valium-temp:chassis-id=1GZM252:serial=CN053HVN3B53PRF:part=616740003:revision=B/motherboard=0/hostbridge=9/pciexrc=9/pciexbus=129/pciexdev=0/pciexfn=0/port=0/transceiver=0
faulted but still in service
FRU : hc://:product-id=PowerEdge-R630:server-id=valium-temp:chassis-id=1GZM252:serial=CN053HVN3B53PRF:part=616740003:revision=B/motherboard=0/hostbridge=9/pciexrc=9/pciexbus=129/pciexdev=0/pciexfn=0/port=0/transceiver=0
faulty
Description : NIC transceiver module 0 (SFP/SFP+/QSFP+ etc.) in mlxcx0 is of a
type that is not supported. This may be due to an incompatible
link type or speed. In some NICs, this may also be caused by
enforcement of a vendor or part whitelist.
NIC data link: mlxcx0 (ec:0d:9a:65:66:b8)
Module vendor: Amphenol
Module part: 616740003
Module serial: CN053HVN3B53PRF
Refer to http://illumos.org/msg/NIC-8000-0Q for more information.
Response : The transceiver module has been disabled, and the network data
link associated with it (mlxcx0) has been marked as down.
Impact : No network traffic will pass through the data link or network
interfaces associated with this transceiver slot.
Action : Replace the transceiver module with one of a supported type.
- Built on OmnIOS and tested
notsupp
with fminject
. Didn't re-test the other types.
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 12eb87fbfbcd9e0abde89898daa0a87c695807e4
commit 12eb87fbfbcd9e0abde89898daa0a87c695807e4
Author: Alex Wilson <alex@uq.edu.au>
Date: 2020-03-04T02:46:52.000Z
12205 want generic NIC transceiver fault events
Reviewed by: Robert Mustacchi <rm@fingolfin.org>
Reviewed by: Paul Winder <paul@winders.demon.co.uk>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Also available in: Atom
PDF