Actions
Bug #12205
closedwant generic NIC transceiver fault events
Start date:
Due date:
% Done:
100%
Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
Description
It would be nice to have some generic FMA fault events for NIC transceiver issues. These should cover some broad categories of transceiver faults common across lots of different NICs. Importantly, some eversholt rules should be used to let drivers generate ereport
s on their regular PCI function device node and have that translated into a fault against the transceiver node.
Related issues
Updated by Robert Mustacchi over 2 years ago
- Related to Feature #12206: Want datalink properties in topo added
Updated by Alex Wilson over 2 years ago
Updated by Alex Wilson over 2 years ago
Testing notes:
- Built and tested on SmartOS
- Tested together with #12204 (which is the first consumer for this new API)
- Also tested independently by using
fminject
with an injection like:
fmridef dev { uint8_t version; string scheme; string device-path; }; evdef ereport.io.nic.txr-err { fmri dev detector; string error; uint8_t port_index; uint8_t txr_index; }; evdef ereport.io.service.lost { fmri dev detector; }; event ereport.io.nic.txr-err ev1 = { { 0, "dev", "/pci@76,0/pci8086,2f02@1/pci15b3,2@0" }, "unknown", 0, 0 }; event ereport.io.service.lost ev2 = { { 0, "dev", "/pci@76,0/pci8086,2f02@1/pci15b3,2@0" } }; ev1; ev2;
- Tested each of the different
"error"
property values (notsupp
,whitelist
,overtemp
,hwfail
,unknown
) by modifying thefminject
script - For each of the error types, after injecting looked at
fmdump -V
andfmadm faulty
, as well as inspecting the console output. Checked that string substitutions are working (at least when tested with #12206 as well) and produce sensible messages - Tested the
notsupp
type error through the kernel by modifying themlxcx
code to always emit an ereport of that type on transceiver removal, then removed a transceiver - Injected a
notsupp/unknown
event usingfminject
Example message:
--------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Jan 17 02:34:48 b93831f4-9a07-e328-89a4-d03b9b9409fc NIC-8000-0Q Critical injected Host : valium-temp Platform : PowerEdge-R630 Chassis_id : 1GZM252 Product_sn : Fault class : fault.io.nic.transceiver.notsupp Problem in : hc://:product-id=PowerEdge-R630:server-id=valium-temp:chassis-id=1GZM252:serial=CN053HVN3B53PRF:part=616740003:revision=B/motherboard=0/hostbridge=9/pciexrc=9/pciexbus=129/pciexdev=0/pciexfn=0/port=0/transceiver=0 faulted but still in service FRU : hc://:product-id=PowerEdge-R630:server-id=valium-temp:chassis-id=1GZM252:serial=CN053HVN3B53PRF:part=616740003:revision=B/motherboard=0/hostbridge=9/pciexrc=9/pciexbus=129/pciexdev=0/pciexfn=0/port=0/transceiver=0 faulty Description : NIC transceiver module 0 (SFP/SFP+/QSFP+ etc.) in mlxcx0 is of a type that is not supported. This may be due to an incompatible link type or speed. In some NICs, this may also be caused by enforcement of a vendor or part whitelist. NIC data link: mlxcx0 (ec:0d:9a:65:66:b8) Module vendor: Amphenol Module part: 616740003 Module serial: CN053HVN3B53PRF Refer to http://illumos.org/msg/NIC-8000-0Q for more information. Response : The transceiver module has been disabled, and the network data link associated with it (mlxcx0) has been marked as down. Impact : No network traffic will pass through the data link or network interfaces associated with this transceiver slot. Action : Replace the transceiver module with one of a supported type.
- Built on OmnIOS and tested
notsupp
withfminject
. Didn't re-test the other types.
Updated by Electric Monk over 2 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 12eb87fbfbcd9e0abde89898daa0a87c695807e4
commit 12eb87fbfbcd9e0abde89898daa0a87c695807e4 Author: Alex Wilson <alex@uq.edu.au> Date: 2020-03-04T02:46:52.000Z 12205 want generic NIC transceiver fault events Reviewed by: Robert Mustacchi <rm@fingolfin.org> Reviewed by: Paul Winder <paul@winders.demon.co.uk> Reviewed by: Rob Johnston <rob.johnston@joyent.com> Approved by: Garrett D'Amore <garrett@damore.org>
Actions