Project

General

Profile

Bug #12205

want generic NIC transceiver fault events

Added by Alex Wilson 11 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

It would be nice to have some generic FMA fault events for NIC transceiver issues. These should cover some broad categories of transceiver faults common across lots of different NICs. Importantly, some eversholt rules should be used to let drivers generate ereports on their regular PCI function device node and have that translated into a fault against the transceiver node.


Related issues

Related to illumos gate - Feature #12206: Want datalink properties in topoClosedRobert Mustacchi

Actions
#1

Updated by Robert Mustacchi 11 months ago

#3

Updated by Alex Wilson 9 months ago

Testing notes:

  • Built and tested on SmartOS
  • Tested together with #12204 (which is the first consumer for this new API)
  • Also tested independently by using fminject with an injection like:
fmridef dev {
    uint8_t version;
    string scheme;
    string device-path;
};

evdef ereport.io.nic.txr-err {
    fmri dev detector;
    string error;
    uint8_t port_index;
    uint8_t txr_index;
};

evdef ereport.io.service.lost {
    fmri dev detector;
};

event ereport.io.nic.txr-err ev1 = {
    { 0, "dev", "/pci@76,0/pci8086,2f02@1/pci15b3,2@0" },
    "unknown",
    0,
    0
};

event ereport.io.service.lost ev2 = {
    { 0, "dev", "/pci@76,0/pci8086,2f02@1/pci15b3,2@0" }
};

ev1;
ev2;
  • Tested each of the different "error" property values (notsupp, whitelist, overtemp, hwfail, unknown) by modifying the fminject script
  • For each of the error types, after injecting looked at fmdump -V and fmadm faulty, as well as inspecting the console output. Checked that string substitutions are working (at least when tested with #12206 as well) and produce sensible messages
  • Tested the notsupp type error through the kernel by modifying the mlxcx code to always emit an ereport of that type on transceiver removal, then removed a transceiver
  • Injected a notsupp/unknown event using fminject

Example message:

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jan 17 02:34:48 b93831f4-9a07-e328-89a4-d03b9b9409fc  NIC-8000-0Q    Critical  injected

Host        : valium-temp
Platform    : PowerEdge-R630    Chassis_id  : 1GZM252
Product_sn  : 

Fault class : fault.io.nic.transceiver.notsupp
Problem in  : hc://:product-id=PowerEdge-R630:server-id=valium-temp:chassis-id=1GZM252:serial=CN053HVN3B53PRF:part=616740003:revision=B/motherboard=0/hostbridge=9/pciexrc=9/pciexbus=129/pciexdev=0/pciexfn=0/port=0/transceiver=0
                  faulted but still in service
FRU         : hc://:product-id=PowerEdge-R630:server-id=valium-temp:chassis-id=1GZM252:serial=CN053HVN3B53PRF:part=616740003:revision=B/motherboard=0/hostbridge=9/pciexrc=9/pciexbus=129/pciexdev=0/pciexfn=0/port=0/transceiver=0
                  faulty

Description : NIC transceiver module 0 (SFP/SFP+/QSFP+ etc.) in mlxcx0 is of a
              type that is not supported. This may be due to an incompatible
              link type or speed. In some NICs, this may also be caused by
              enforcement of a vendor or part whitelist.

              NIC data link: mlxcx0 (ec:0d:9a:65:66:b8)
              Module vendor: Amphenol
              Module part: 616740003
              Module serial: CN053HVN3B53PRF

              Refer to http://illumos.org/msg/NIC-8000-0Q for more information.

Response    : The transceiver module has been disabled, and the network data
              link associated with it (mlxcx0) has been marked as down.

Impact      : No network traffic will pass through the data link or network
              interfaces associated with this transceiver slot.

Action      : Replace the transceiver module with one of a supported type.
  • Built on OmnIOS and tested notsupp with fminject. Didn't re-test the other types.
#4

Updated by Electric Monk 9 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 12eb87fbfbcd9e0abde89898daa0a87c695807e4

commit  12eb87fbfbcd9e0abde89898daa0a87c695807e4
Author: Alex Wilson <alex@uq.edu.au>
Date:   2020-03-04T02:46:52.000Z

    12205 want generic NIC transceiver fault events
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Reviewed by: Paul Winder <paul@winders.demon.co.uk>
    Reviewed by: Rob Johnston <rob.johnston@joyent.com>
    Approved by: Garrett D'Amore <garrett@damore.org>

Also available in: Atom PDF