Feature #8975
closedipmi topo plugin should automatically enumerate sensors on nodes it enumerates
100%
Description
Today, unless you add sufficient platform-specific goo to the libtopo xml map files, sensors will not be enumerated when building the topo snapshot. The changes being introduced by issue #8774 will add a new, more platform-agnostic mechanism to the fac_prov_ipmi module for binding topo nodes to their associated sensors. All that's needed to complete the circuit here are to have the "ipmi" module do the following for any nodes that it creates (fans and psus currently):
1. create entity_id and entity_instance properties in the "ipmi" property group
2. register the TOPO_METH_FAC_ENUM method from fac_prov_ipmi module.
Related issues
Updated by Rob Johnston over 5 years ago
To test this change, I used the "fmtopo" utility to create topo snapshots both with and without my changes and verified that the only difference with my changes were the following:
1) The new "entity_id" and "entity_inst" properties were being correctly created on all fan and psu nodes
2) sensor facility nodes were now being enumerated as children of the psu and fan nodes
I tested this change in the global zone of a J1101 compute node and a SuperMicro box in the Joyent SF lab (lava).
Before and after diffs of the fmtopo output are below:
$ diff fmtopoV.before.j1101 fmtopoV.after.j1101 2c2 < Nov 20 22:35:04 d318b451-9dc5-60ca-fad3-8a4397ddd1a2 — > Nov 20 22:35:20 383786f5-19e9-6a83-ac89-cda2adcbfdf2 23a24,39 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0xa > entity_instance uint32 0x2 > > hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/psu=0?sensor=PS2 Status > group: protocol version: 1 stability: Private/Private > resource fmri hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/psu=0?sensor=PS2 Status > group: authority version: 1 stability: Private/Private > product-id string Joyent-Compute-Platform-1101 > chassis-id string S12612523710138 > server-id string headnode > group: facility version: 1 stability: Private/Private > entity_ref string[] [ "PS2 Status" ] > sensor-class string discrete > type uint32 0x8 (POWER_SUPPLY) > state uint32 0x1 (PRESENT) 33a50,67 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x1 > > hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=0?sensor=FAN1 > group: protocol version: 1 stability: Private/Private > resource fmri hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=0?sensor=FAN1 > group: authority version: 1 stability: Private/Private > product-id string Joyent-Compute-Platform-1101 > chassis-id string S12612523710138 > server-id string headnode > group: facility version: 1 stability: Private/Private > entity_ref string[] [ "FAN1" ] > sensor-class string threshold > type uint32 0x101 (THRESHOLD_STATE) > state uint32 0xc0 (0xc0) > reading double 2925.000000 > units uint32 0x12 (RPM) 43a78,95 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x2 > > hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=1?sensor=FAN2 > group: protocol version: 1 stability: Private/Private > resource fmri hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=1?sensor=FAN2 > group: authority version: 1 stability: Private/Private > product-id string Joyent-Compute-Platform-1101 > chassis-id string S12612523710138 > server-id string headnode > group: facility version: 1 stability: Private/Private > entity_ref string[] [ "FAN2" ] > sensor-class string threshold > type uint32 0x101 (THRESHOLD_STATE) > state uint32 0xc0 (0xc0) > reading double 3000.000000 > units uint32 0x12 (RPM) 53a106,123 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x3 > > hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=2?sensor=FAN3 > group: protocol version: 1 stability: Private/Private > resource fmri hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=2?sensor=FAN3 > group: authority version: 1 stability: Private/Private > product-id string Joyent-Compute-Platform-1101 > chassis-id string S12612523710138 > server-id string headnode > group: facility version: 1 stability: Private/Private > entity_ref string[] [ "FAN3" ] > sensor-class string threshold > type uint32 0x101 (THRESHOLD_STATE) > state uint32 0xc0 (0xc0) > reading double 3000.000000 > units uint32 0x12 (RPM) 63a134,138 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x4 > > hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=3?sensor=FAN4 73a149,153 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x5 > > hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=4?sensor=FAN5 83a164,168 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x6 > > hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=5?sensor=FAN6 93a179,183 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x7 > > hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=6?sensor=FAN7 103a194,198 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x8 > > hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=7?sensor=FAN8 $ diff fmtopoV.before.lava fmtopoV.after.lava 2c2 < Nov 21 02:30:27 fb7f352b-0601-4ffa-bd73-918336f97241 — > Nov 21 02:28:46 826ee98a-8f1d-ec08-f1ac-8ff7aa3e2f07 23a24,39 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0xa > entity_instance uint32 0x1 > > hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=0?sensor=PS1 Status > group: protocol version: 1 stability: Private/Private > resource fmri hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=0?sensor=PS1 Status > group: authority version: 1 stability: Private/Private > product-id string SSG-2028R-ACR24L > chassis-id string S194308X7333203 > server-id string lava > group: facility version: 1 stability: Private/Private > entity_ref string[] [ "PS1 Status" ] > sensor-class string discrete > type uint32 0x8 (POWER_SUPPLY) > state uint32 0x1 (PRESENT) 33a50,65 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0xa > entity_instance uint32 0x2 > > hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1?sensor=PS2 Status > group: protocol version: 1 stability: Private/Private > resource fmri hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1?sensor=PS2 Status > group: authority version: 1 stability: Private/Private > product-id string SSG-2028R-ACR24L > chassis-id string S194308X7333203 > server-id string lava > group: facility version: 1 stability: Private/Private > entity_ref string[] [ "PS2 Status" ] > sensor-class string discrete > type uint32 0x8 (POWER_SUPPLY) > state uint32 0xb (0x0b) 43a76,93 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x1 > > hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=0?sensor=FAN1 > group: protocol version: 1 stability: Private/Private > resource fmri hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=0?sensor=FAN1 > group: authority version: 1 stability: Private/Private > product-id string SSG-2028R-ACR24L > chassis-id string S194308X7333203 > server-id string lava > group: facility version: 1 stability: Private/Private > entity_ref string[] [ "FAN1" ] > sensor-class string threshold > type uint32 0x101 (THRESHOLD_STATE) > state uint32 0xc0 (0xc0) > reading double 2800.000000 > units uint32 0x12 (RPM) 53a104,121 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x2 > > hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=1?sensor=FAN2 > group: protocol version: 1 stability: Private/Private > resource fmri hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=1?sensor=FAN2 > group: authority version: 1 stability: Private/Private > product-id string SSG-2028R-ACR24L > chassis-id string S194308X7333203 > server-id string lava > group: facility version: 1 stability: Private/Private > entity_ref string[] [ "FAN2" ] > sensor-class string threshold > type uint32 0x101 (THRESHOLD_STATE) > state uint32 0xc0 (0xc0) > reading double 2900.000000 > units uint32 0x12 (RPM) 63a132,149 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x3 > > hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=2?sensor=FAN3 > group: protocol version: 1 stability: Private/Private > resource fmri hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=2?sensor=FAN3 > group: authority version: 1 stability: Private/Private > product-id string SSG-2028R-ACR24L > chassis-id string S194308X7333203 > server-id string lava > group: facility version: 1 stability: Private/Private > entity_ref string[] [ "FAN3" ] > sensor-class string threshold > type uint32 0x101 (THRESHOLD_STATE) > state uint32 0xc0 (0xc0) > reading double 4000.000000 > units uint32 0x12 (RPM) 73a160,164 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x4 > > hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=3?sensor=FAN4 83a175,179 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x5 > > hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=4?sensor=FAN5 93a190,194 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x6 > > hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=5?sensor=FAN6 103a205,222 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x7 > > hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6?sensor=FANA > group: protocol version: 1 stability: Private/Private > resource fmri hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6?sensor=FANA > group: authority version: 1 stability: Private/Private > product-id string SSG-2028R-ACR24L > chassis-id string S194308X7333203 > server-id string lava > group: facility version: 1 stability: Private/Private > entity_ref string[] [ "FANA" ] > sensor-class string threshold > type uint32 0x101 (THRESHOLD_STATE) > state uint32 0xc0 (0xc0) > reading double 2900.000000 > units uint32 0x12 (RPM) 113a233,237 > group: ipmi version: 1 stability: Private/Private > entity_id uint32 0x1d > entity_instance uint32 0x8 > > hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=7?sensor=FANB
Some additional testing
==================
I also built a USB stick with my changes for and re-verified the changes on lava. I also tested things like removing power from one of the PSUs and physically obstructing one of the fans to verify that FMA was able to detect and diagnose the faults based on polling the relevant sensors.
Below shows the ereports and faults after removing the cable from one of the power supplies and also obstructing one of the fan modules:
Nov 21 2017 23:54:05.407454535 ereport.sensor.failure nvlist version: 0 type = psu nonrecov = 1 predictive = 0 source = 0x1 details = (embedded nvlist) nvlist version: 0 PS2 Status = (embedded nvlist) nvlist version: 0 type = 0x8 state = 0xb units = 0x0 nonrecov = 1 predictive = 0 source = 0x1 (end PS2 Status) (end details) class = ereport.sensor.failure version = 0x0 ena = 0x343d9e885005c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = hc hc-root = authority = (embedded nvlist) nvlist version: 0 product-id = SSG-2028R-ACR24L server-id = lava chassis-id = S194308X7333203 (end authority) hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = chassis hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = psu hc-id = 1 (end hc-list[1]) (end detector) __ttl = 0x1 __tod = 0x5a14bc9d 0x18494347 Nov 21 2017 23:54:05.430333822 ereport.sensor.failure nvlist version: 0 type = fan nonrecov = 1 predictive = 0 source = 0x0 details = (embedded nvlist) nvlist version: 0 FANA = (embedded nvlist) nvlist version: 0 type = 0x101 state = 0xc4 units = 0x12 nonrecov = 1 predictive = 0 source = 0x0 reading = 0x0.000000 (end FANA) (end details) class = ereport.sensor.failure version = 0x0 ena = 0x343efbaa6705c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = hc hc-root = authority = (embedded nvlist) nvlist version: 0 product-id = SSG-2028R-ACR24L server-id = lava chassis-id = S194308X7333203 (end authority) hc-list = (array of embedded nvlists) (start hc-list[0]) nvlist version: 0 hc-name = chassis hc-id = 0 (end hc-list[0]) (start hc-list[1]) nvlist version: 0 hc-name = fan hc-id = 6 (end hc-list[1]) (end detector) __ttl = 0x1 __tod = 0x5a14bc9d 0x19a65f7e # fmdump -v -c list.suspect TIME UUID SUNW-MSG-ID EVENT Nov 21 23:38:50.1852 bf796346-8199-ebff-c7df-db05dc3882e9 SENSOR-8000-6G Diagnosed 100% fault.psu.failed-int Problem in: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1 Affects: - FRU: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1 Location: PSU 1 Nov 21 23:54:05.5062 5ef0195b-2548-4369-a14e-e4652cb33635 SENSOR-8000-26 Diagnosed 100% fault.fan.failed Problem in: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6 Affects: - FRU: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6 Location: FAN 6 # fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 21 23:38:50 bf796346-8199-ebff-c7df-db05dc3882e9 SENSOR-8000-6G Major Host : lava Platform : SSG-2028R-ACR24L Chassis_id : S194308X7333203 Product_sn : Fault class : fault.psu.failed-int Problem in : "PSU 1" (hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1) faulted but still in service FRU : "PSU 1" (hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1) faulty Description : A sensor indicates that this power supply has failed. Refer to http://illumos.org/msg/SENSOR-8000-6G for more information. Response : None. Impact : The enclosure may be getting inadequate power. Subsequent loss of power supplies may force the enclosure to shutdown. Action : Replace the indicated power supply --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Nov 21 23:54:05 5ef0195b-2548-4369-a14e-e4652cb33635 SENSOR-8000-26 Minor Host : lava Platform : SSG-2028R-ACR24L Chassis_id : S194308X7333203 Product_sn : Fault class : fault.fan.failed Problem in : "FAN 6" (hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6) faulted but still in service FRU : "FAN 6" (hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6) faulty Description : External sensors indicate that a fan is no longer operating correctly. Refer to http://illumos.org/msg/SENSOR-8000-26 for more information. Response : None. Impact : The enclosure may be getting inadequate cooling. If the problem persists, the components may overheat and the enclosure may shutdown. Action : Check the fan to see if the fan has lost power or there is a physical obstruction. Replace the fan if the internal mechanism has failed.
Then after restoring power and unblocking the fan...
# fmdump -v -c list.resolved -t "Nov 21 23:54:05.5062" TIME UUID SUNW-MSG-ID EVENT Nov 21 23:56:05.5062 5ef0195b-2548-4369-a14e-e4652cb33635 FMD-8000-6U Resolved 100% fault.fan.failed Repair Attempted Problem in: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6 Affects: - FRU: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6 Location: FAN 6 Nov 22 00:06:05.8279 bf796346-8199-ebff-c7df-db05dc3882e9 FMD-8000-6U Resolved 100% fault.psu.failed-int Repair Attempted Problem in: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1 Affects: - FRU: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1 Location: PSU 1 # fmadm faulty #
Updated by Electric Monk over 5 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 8f022dd6c1ebe3edc269726bf537617e665df32f
commit 8f022dd6c1ebe3edc269726bf537617e665df32f Author: Rob Johnston <rob.johnston@joyent.com> Date: 2018-01-23T21:33:23.000Z 8967 libipmi: add support for GET_CHASSIS_STATUS command 8974 fac_prov_ipmi should support binding by entity id and instance 8975 ipmi topo plugin should automatically enumerate sensors on nodes it enumerates 8976 ipmi enumerator should include FRU identity information in FMRI authority 8977 ipmi enumerator doesn't always enumerate nested entities 8978 Add topo facility method for controlling chassis ident indicator Reviewed by: Yuri Pankov <yuripv@icloud.com> Reviewed by: Ben Sims <bensims@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com>
Updated by Matthias Scheler over 3 years ago
- Related to Bug #11818: IPMI topo plugin shouldn't return data from unavailable sensors added