Project

General

Profile

Actions

Feature #8975

closed

ipmi topo plugin should automatically enumerate sensors on nodes it enumerates

Added by Rob Johnston over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Start date:
2018-01-19
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:

Description

Today, unless you add sufficient platform-specific goo to the libtopo xml map files, sensors will not be enumerated when building the topo snapshot. The changes being introduced by issue #8774 will add a new, more platform-agnostic mechanism to the fac_prov_ipmi module for binding topo nodes to their associated sensors. All that's needed to complete the circuit here are to have the "ipmi" module do the following for any nodes that it creates (fans and psus currently):

1. create entity_id and entity_instance properties in the "ipmi" property group
2. register the TOPO_METH_FAC_ENUM method from fac_prov_ipmi module.


Related issues

Related to illumos gate - Bug #11818: IPMI topo plugin shouldn't return data from unavailable sensorsClosedMatthias Scheler

Actions
Actions #1

Updated by Rob Johnston over 5 years ago

To test this change, I used the "fmtopo" utility to create topo snapshots both with and without my changes and verified that the only difference with my changes were the following:

1) The new "entity_id" and "entity_inst" properties were being correctly created on all fan and psu nodes
2) sensor facility nodes were now being enumerated as children of the psu and fan nodes

I tested this change in the global zone of a J1101 compute node and a SuperMicro box in the Joyent SF lab (lava).

Before and after diffs of the fmtopo output are below:

$ diff fmtopoV.before.j1101 fmtopoV.after.j1101
2c2
< Nov 20 22:35:04 d318b451-9dc5-60ca-fad3-8a4397ddd1a2
—
> Nov 20 22:35:20 383786f5-19e9-6a83-ac89-cda2adcbfdf2
23a24,39
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0xa
> entity_instance uint32 0x2
>
> hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/psu=0?sensor=PS2 Status
> group: protocol version: 1 stability: Private/Private
> resource fmri hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/psu=0?sensor=PS2 Status
> group: authority version: 1 stability: Private/Private
> product-id string Joyent-Compute-Platform-1101
> chassis-id string S12612523710138
> server-id string headnode
> group: facility version: 1 stability: Private/Private
> entity_ref string[] [ "PS2 Status" ]
> sensor-class string discrete
> type uint32 0x8 (POWER_SUPPLY)
> state uint32 0x1 (PRESENT)
33a50,67
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x1
>
> hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=0?sensor=FAN1
> group: protocol version: 1 stability: Private/Private
> resource fmri hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=0?sensor=FAN1
> group: authority version: 1 stability: Private/Private
> product-id string Joyent-Compute-Platform-1101
> chassis-id string S12612523710138
> server-id string headnode
> group: facility version: 1 stability: Private/Private
> entity_ref string[] [ "FAN1" ]
> sensor-class string threshold
> type uint32 0x101 (THRESHOLD_STATE)
> state uint32 0xc0 (0xc0)
> reading double 2925.000000
> units uint32 0x12 (RPM)
43a78,95
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x2
>
> hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=1?sensor=FAN2
> group: protocol version: 1 stability: Private/Private
> resource fmri hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=1?sensor=FAN2
> group: authority version: 1 stability: Private/Private
> product-id string Joyent-Compute-Platform-1101
> chassis-id string S12612523710138
> server-id string headnode
> group: facility version: 1 stability: Private/Private
> entity_ref string[] [ "FAN2" ]
> sensor-class string threshold
> type uint32 0x101 (THRESHOLD_STATE)
> state uint32 0xc0 (0xc0)
> reading double 3000.000000
> units uint32 0x12 (RPM)
53a106,123
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x3
>
> hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=2?sensor=FAN3
> group: protocol version: 1 stability: Private/Private
> resource fmri hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=2?sensor=FAN3
> group: authority version: 1 stability: Private/Private
> product-id string Joyent-Compute-Platform-1101
> chassis-id string S12612523710138
> server-id string headnode
> group: facility version: 1 stability: Private/Private
> entity_ref string[] [ "FAN3" ]
> sensor-class string threshold
> type uint32 0x101 (THRESHOLD_STATE)
> state uint32 0xc0 (0xc0)
> reading double 3000.000000
> units uint32 0x12 (RPM)
63a134,138
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x4
>
> hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=3?sensor=FAN4
73a149,153
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x5
>
> hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=4?sensor=FAN5
83a164,168
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x6
>
> hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=5?sensor=FAN6
93a179,183
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x7
>
> hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=6?sensor=FAN7
103a194,198
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x8
>
> hc://:product-id=Joyent-Compute-Platform-1101:server-id=headnode:chassis-id=S12612523710138/chassis=0/fan=7?sensor=FAN8

$ diff fmtopoV.before.lava fmtopoV.after.lava
2c2
< Nov 21 02:30:27 fb7f352b-0601-4ffa-bd73-918336f97241
—
> Nov 21 02:28:46 826ee98a-8f1d-ec08-f1ac-8ff7aa3e2f07
23a24,39
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0xa
> entity_instance uint32 0x1
>
> hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=0?sensor=PS1 Status
> group: protocol version: 1 stability: Private/Private
> resource fmri hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=0?sensor=PS1 Status
> group: authority version: 1 stability: Private/Private
> product-id string SSG-2028R-ACR24L
> chassis-id string S194308X7333203
> server-id string lava
> group: facility version: 1 stability: Private/Private
> entity_ref string[] [ "PS1 Status" ]
> sensor-class string discrete
> type uint32 0x8 (POWER_SUPPLY)
> state uint32 0x1 (PRESENT)
33a50,65
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0xa
> entity_instance uint32 0x2
>
> hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1?sensor=PS2 Status
> group: protocol version: 1 stability: Private/Private
> resource fmri hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1?sensor=PS2 Status
> group: authority version: 1 stability: Private/Private
> product-id string SSG-2028R-ACR24L
> chassis-id string S194308X7333203
> server-id string lava
> group: facility version: 1 stability: Private/Private
> entity_ref string[] [ "PS2 Status" ]
> sensor-class string discrete
> type uint32 0x8 (POWER_SUPPLY)
> state uint32 0xb (0x0b)
43a76,93
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x1
>
> hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=0?sensor=FAN1
> group: protocol version: 1 stability: Private/Private
> resource fmri hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=0?sensor=FAN1
> group: authority version: 1 stability: Private/Private
> product-id string SSG-2028R-ACR24L
> chassis-id string S194308X7333203
> server-id string lava
> group: facility version: 1 stability: Private/Private
> entity_ref string[] [ "FAN1" ]
> sensor-class string threshold
> type uint32 0x101 (THRESHOLD_STATE)
> state uint32 0xc0 (0xc0)
> reading double 2800.000000
> units uint32 0x12 (RPM)
53a104,121
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x2
>
> hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=1?sensor=FAN2
> group: protocol version: 1 stability: Private/Private
> resource fmri hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=1?sensor=FAN2
> group: authority version: 1 stability: Private/Private
> product-id string SSG-2028R-ACR24L
> chassis-id string S194308X7333203
> server-id string lava
> group: facility version: 1 stability: Private/Private
> entity_ref string[] [ "FAN2" ]
> sensor-class string threshold
> type uint32 0x101 (THRESHOLD_STATE)
> state uint32 0xc0 (0xc0)
> reading double 2900.000000
> units uint32 0x12 (RPM)
63a132,149
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x3
>
> hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=2?sensor=FAN3
> group: protocol version: 1 stability: Private/Private
> resource fmri hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=2?sensor=FAN3
> group: authority version: 1 stability: Private/Private
> product-id string SSG-2028R-ACR24L
> chassis-id string S194308X7333203
> server-id string lava
> group: facility version: 1 stability: Private/Private
> entity_ref string[] [ "FAN3" ]
> sensor-class string threshold
> type uint32 0x101 (THRESHOLD_STATE)
> state uint32 0xc0 (0xc0)
> reading double 4000.000000
> units uint32 0x12 (RPM)
73a160,164
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x4
>
> hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=3?sensor=FAN4
83a175,179
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x5
>
> hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=4?sensor=FAN5
93a190,194
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x6
>
> hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=5?sensor=FAN6
103a205,222
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x7
>
> hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6?sensor=FANA
> group: protocol version: 1 stability: Private/Private
> resource fmri hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6?sensor=FANA
> group: authority version: 1 stability: Private/Private
> product-id string SSG-2028R-ACR24L
> chassis-id string S194308X7333203
> server-id string lava
> group: facility version: 1 stability: Private/Private
> entity_ref string[] [ "FANA" ]
> sensor-class string threshold
> type uint32 0x101 (THRESHOLD_STATE)
> state uint32 0xc0 (0xc0)
> reading double 2900.000000
> units uint32 0x12 (RPM)
113a233,237
> group: ipmi version: 1 stability: Private/Private
> entity_id uint32 0x1d
> entity_instance uint32 0x8
>
> hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=7?sensor=FANB

Some additional testing ==================
I also built a USB stick with my changes for and re-verified the changes on lava. I also tested things like removing power from one of the PSUs and physically obstructing one of the fans to verify that FMA was able to detect and diagnose the faults based on polling the relevant sensors.

Below shows the ereports and faults after removing the cable from one of the power supplies and also obstructing one of the fan modules:

Nov 21 2017 23:54:05.407454535 ereport.sensor.failure
nvlist version: 0
type = psu
nonrecov = 1
predictive = 0
source = 0x1
details = (embedded nvlist)
nvlist version: 0
PS2 Status = (embedded nvlist)
nvlist version: 0
type = 0x8
state = 0xb
units = 0x0
nonrecov = 1
predictive = 0
source = 0x1
(end PS2 Status)

(end details)

class = ereport.sensor.failure
version = 0x0
ena = 0x343d9e885005c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
authority = (embedded nvlist)
nvlist version: 0
product-id = SSG-2028R-ACR24L
server-id = lava
chassis-id = S194308X7333203
(end authority)

hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = psu
hc-id = 1
(end hc-list[1])

(end detector)

__ttl = 0x1
__tod = 0x5a14bc9d 0x18494347

Nov 21 2017 23:54:05.430333822 ereport.sensor.failure
nvlist version: 0
type = fan
nonrecov = 1
predictive = 0
source = 0x0
details = (embedded nvlist)
nvlist version: 0
FANA = (embedded nvlist)
nvlist version: 0
type = 0x101
state = 0xc4
units = 0x12
nonrecov = 1
predictive = 0
source = 0x0
reading = 0x0.000000
(end FANA)

(end details)

class = ereport.sensor.failure
version = 0x0
ena = 0x343efbaa6705c01
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = hc
hc-root =
authority = (embedded nvlist)
nvlist version: 0
product-id = SSG-2028R-ACR24L
server-id = lava
chassis-id = S194308X7333203
(end authority)

hc-list = (array of embedded nvlists)
(start hc-list[0])
nvlist version: 0
hc-name = chassis
hc-id = 0
(end hc-list[0])
(start hc-list[1])
nvlist version: 0
hc-name = fan
hc-id = 6
(end hc-list[1])

(end detector)

__ttl = 0x1
__tod = 0x5a14bc9d 0x19a65f7e

# fmdump -v -c list.suspect
TIME UUID SUNW-MSG-ID EVENT
Nov 21 23:38:50.1852 bf796346-8199-ebff-c7df-db05dc3882e9 SENSOR-8000-6G Diagnosed
100% fault.psu.failed-int

Problem in: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1
Affects: -
FRU: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1
Location: PSU 1

Nov 21 23:54:05.5062 5ef0195b-2548-4369-a14e-e4652cb33635 SENSOR-8000-26 Diagnosed
100% fault.fan.failed

Problem in: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6
Affects: -
FRU: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6
Location: FAN 6

# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Nov 21 23:38:50 bf796346-8199-ebff-c7df-db05dc3882e9 SENSOR-8000-6G Major

Host : lava
Platform : SSG-2028R-ACR24L Chassis_id : S194308X7333203
Product_sn :

Fault class : fault.psu.failed-int
Problem in : "PSU 1" (hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1)
faulted but still in service
FRU : "PSU 1" (hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1)
faulty

Description : A sensor indicates that this power supply has failed. Refer to
http://illumos.org/msg/SENSOR-8000-6G for more information.

Response : None.

Impact : The enclosure may be getting inadequate power. Subsequent loss
of power supplies may force the enclosure to shutdown.

Action : Replace the indicated power supply

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Nov 21 23:54:05 5ef0195b-2548-4369-a14e-e4652cb33635 SENSOR-8000-26 Minor

Host : lava
Platform : SSG-2028R-ACR24L Chassis_id : S194308X7333203
Product_sn :

Fault class : fault.fan.failed
Problem in : "FAN 6" (hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6)
faulted but still in service
FRU : "FAN 6" (hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6)
faulty

Description : External sensors indicate that a fan is no longer operating
correctly. Refer to http://illumos.org/msg/SENSOR-8000-26 for
more information.

Response : None.

Impact : The enclosure may be getting inadequate cooling. If the problem
persists, the components may overheat and the enclosure may
shutdown.

Action : Check the fan to see if the fan has lost power or there is a
physical obstruction. Replace the fan if the internal mechanism
has failed.

Then after restoring power and unblocking the fan...

# fmdump -v -c list.resolved -t "Nov 21 23:54:05.5062" 
TIME UUID SUNW-MSG-ID EVENT
Nov 21 23:56:05.5062 5ef0195b-2548-4369-a14e-e4652cb33635 FMD-8000-6U Resolved
100% fault.fan.failed Repair Attempted

Problem in: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6
Affects: -
FRU: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/fan=6
Location: FAN 6

Nov 22 00:06:05.8279 bf796346-8199-ebff-c7df-db05dc3882e9 FMD-8000-6U Resolved
100% fault.psu.failed-int Repair Attempted

Problem in: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1
Affects: -
FRU: hc://:product-id=SSG-2028R-ACR24L:server-id=lava:chassis-id=S194308X7333203/chassis=0/psu=1
Location: PSU 1

# fmadm faulty
#
Actions #2

Updated by Electric Monk over 5 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 8f022dd6c1ebe3edc269726bf537617e665df32f

commit  8f022dd6c1ebe3edc269726bf537617e665df32f
Author: Rob Johnston <rob.johnston@joyent.com>
Date:   2018-01-23T21:33:23.000Z

    8967 libipmi: add support for GET_CHASSIS_STATUS command
    8974 fac_prov_ipmi should support binding by entity id and instance
    8975 ipmi topo plugin should automatically enumerate sensors on nodes it enumerates
    8976 ipmi enumerator should include FRU identity information in FMRI authority
    8977 ipmi enumerator doesn't always enumerate nested entities
    8978 Add topo facility method for controlling chassis ident indicator
    Reviewed by: Yuri Pankov <yuripv@icloud.com>
    Reviewed by: Ben Sims <bensims@gmail.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions #3

Updated by Matthias Scheler over 3 years ago

  • Related to Bug #11818: IPMI topo plugin shouldn't return data from unavailable sensors added
Actions

Also available in: Atom PDF