Bug #315
openfmadm telemetry faults
0%
Description
I had installed OpenIndiana on a generic x86 box, and I was seeing this error reported by fmadm:
root@vps3:~# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Sep 18 11:31:18 11badbe0-0605-6c0f-94fc-f9b3fa26d1e9 SUNOS-8000-J0 Major
Host : RS253
Platform : X8SIL Chassis_id : 0123456789
Product_sn :
Fault class : defect.sunos.eft.unexpected_telemetry 50%
fault.sunos.eft.unexpected_telemetry 50%
Problem in : dev:////pci@0,0/pci8086,3b4c@1c,5/pci15d9,605@0
faulted and taken out of service
Description : The diagnosis engine encountered telemetry from the listed
devices for which it was unable to perform a diagnosis -
Refer to http://sun.com/msg/SUNOS-8000-J0 for more information.
Refer to http://sun.com/msg/SUNOS-8000-J0 for more information.
Response : Error reports have been logged for examination by Sun.
Impact : Automated diagnosis and response for these events will not occur.
Action : Ensure that the latest Solaris Kernel and Predictive Self-Healing
(PSH) patches are installed.
Later, we installed OI on a Sun x4170 server. We see the same error on there (though the "Problem in" is different), leading me to believe that this may be more a bug in fmd:
root@vps5:~# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Sep 30 19:14:06 f7f2f656-222c-c162-b69a-be27807175ae SUNOS-8000-J0 Major
Host : vps5
Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1003XF51B8
Product_sn :
Fault class : fault.sunos.eft.unexpected_telemetry 50%
defect.sunos.eft.unexpected_telemetry 50%
Problem in : dev:////pci@0,0
faulted and taken out of service
Description : The diagnosis engine encountered telemetry from the listed
devices for which it was unable to perform a diagnosis -
Refer to http://sun.com/msg/SUNOS-8000-J0 for more information.
Refer to http://sun.com/msg/SUNOS-8000-J0 for more information.
Response : Error reports have been logged for examination by Sun.
Impact : Automated diagnosis and response for these events will not occur.
Action : Ensure that the latest Solaris Kernel and Predictive Self-Healing
(PSH) patches are installed.
Files
Updated by Garrett D'Amore over 12 years ago
- Project changed from site to illumos gate
Updated by Matthew Goheen over 12 years ago
We have also installed oi_147 on a Sun Fire X4170 and experience continuous (60Mb/day of fm errors/warnings in /var/fm/fmd/errlog) fm errors related to the two Fibre channel controllers installed in the system. Here is a (very) small portion of the output from "fmdump -e":
...
Oct 26 05:38:34.2556 ereport.io.pci.fabric
Oct 26 05:38:34.2556 ereport.io.pci.fabric
Oct 26 05:38:34.2556 ereport.io.pciex.rc.ce-msg
Oct 26 05:38:34.2556 ereport.io.pciex.pl.re
Oct 26 05:38:34.2556 ereport.io.pciex.rc.ce-msg
Oct 26 05:39:16.2110 ereport.io.pci.fabric
Oct 26 05:39:16.2110 ereport.io.pci.fabric
Oct 26 05:39:16.2110 ereport.io.pciex.rc.ce-msg
Oct 26 05:39:16.2110 ereport.io.pciex.pl.re
Oct 26 05:39:16.2110 ereport.io.pciex.rc.ce-msg
Oct 26 05:39:22.2068 ereport.io.pci.fabric
Oct 26 05:39:22.2069 ereport.io.pci.fabric
Oct 26 05:39:22.2068 ereport.io.pciex.rc.ce-msg
Oct 26 05:39:22.2069 ereport.io.pciex.pl.re
Oct 26 05:39:22.2069 ereport.io.pciex.rc.ce-msg
Oct 26 05:39:30.0206 ereport.io.pci.fabric
Oct 26 05:39:30.0207 ereport.io.pci.fabric
Oct 26 05:39:30.0206 ereport.io.pciex.rc.ce-msg
Oct 26 05:39:30.0207 ereport.io.pciex.pl.re
Oct 26 05:39:30.0207 ereport.io.pciex.rc.ce-msg
...
"fmdump" reports:
TIME UUID SUNW-MSG-ID EVENT Oct 22 17:21:12.7552 682ee1c3-4dca-c578-899d-8262200ed77b SUNOS-8000-J0 Diagnosed Oct 22 17:25:09.7113 fcbe397e-7b5c-c9b7-95bd-e1c87ccab845 PCIEX-8000-KP Diagnosed Oct 22 17:54:16.1998 46d368bf-8525-43bd-d005-c3c929b98d6b PCIEX-8000-KP Diagnosed Oct 25 11:08:26.3788 ed012fb2-3305-c74d-f3db-cd450e7690f4 SUNOS-8000-J0 Diagnosed Oct 25 11:26:43.0540 46d368bf-8525-43bd-d005-c3c929b98d6b FMD-8000-58 Updated Oct 25 11:26:53.7197 6770b1ae-e718-ccf9-d802-b1833f5b92e2 PCIEX-8000-KP Diagnosed
"fmadm faulty" reports:
--------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Oct 25 11:26:53 6770b1ae-e718-ccf9-d802-b1833f5b92e2 PCIEX-8000-KP Major Host : hopi-fs Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn : Fault class : fault.io.pciex.device-interr-corr 67% fault.io.pciex.bus-linkerr-corr 33% Affects : dev:////pci@0,0/pci8086,340e@7/pci1077,170@0 faulted but still in service FRU : "PCIe 1" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=hopi-fs:chassis-id=1015XF50CF/motherboard=0/hostbridge=3/pciexrc=3/pciexbus=19/pciexdev=0) faulty Description : Too many recovered bus errors have been detected, which indicates a problem with the specified bus or with the specified transmitting device. This may degrade into an unrecoverable fault. Refer to http://sun.com/msg/PCIEX-8000-KP for more information. Response : One or more device instances may be disabled Impact : Loss of services provided by the device instances associated with this fault Action : If a plug-in card is involved check for badly-seated cards or bent pins. Otherwise schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or contact Sun for support. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Oct 22 17:54:16 46d368bf-8525-43bd-d005-c3c929b98d6b PCIEX-8000-KP Major Host : hopi-fs Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn : Fault class : fault.io.pciex.device-interr-corr max 50% fault.io.pciex.bus-linkerr-corr 25% Affects : dev:////pci@0,0/pci8086,340e@7/pci1077,170@0 ok and in service dev:////pci@0,0/pci8086,340e@7 faulted but still in service FRU : "PCIe 1" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=hopi-fs:chassis-id=1015XF50CF/motherboard=0/hostbridge=3/pciexrc=3/pciexbus=19/pciexdev=0) max 50% acquitted "MB" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=hopi-fs:chassis-id=1015XF50CF/motherboard=0) 25% faulty Description : Too many recovered bus errors have been detected, which indicates a problem with the specified bus or with the specified transmitting device. This may degrade into an unrecoverable fault. Refer to http://sun.com/msg/PCIEX-8000-KP for more information. Response : One or more device instances may be disabled Impact : Loss of services provided by the device instances associated with this fault Action : If a plug-in card is involved check for badly-seated cards or bent pins. Otherwise schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or contact Sun for support. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Oct 22 17:21:12 682ee1c3-4dca-c578-899d-8262200ed77b SUNOS-8000-J0 Major Host : hopi-fs Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn : Fault class : fault.sunos.eft.unexpected_telemetry 50% defect.sunos.eft.unexpected_telemetry 50% Problem in : dev:////pci@0,0 faulted and taken out of service Description : The diagnosis engine encountered telemetry from the listed devices for which it was unable to perform a diagnosis - Refer to http://sun.com/msg/SUNOS-8000-J0 for more information. Refer to http://sun.com/msg/SUNOS-8000-J0 for more information. Response : Error reports have been logged for examination by Sun. Impact : Automated diagnosis and response for these events will not occur. Action : Ensure that the latest Solaris Kernel and Predictive Self-Healing (PSH) patches are installed. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Oct 25 11:08:26 ed012fb2-3305-c74d-f3db-cd450e7690f4 SUNOS-8000-J0 Major Host : hopi-fs Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn : Fault class : defect.sunos.eft.unexpected_telemetry 50% fault.sunos.eft.unexpected_telemetry 50% Problem in : dev:////pci@0,0/pci8086,3410@9/pci1077,170@0 faulted and taken out of service Description : The diagnosis engine encountered telemetry from the listed devices for which it was unable to perform a diagnosis - Refer to http://sun.com/msg/SUNOS-8000-J0 for more information. Refer to http://sun.com/msg/SUNOS-8000-J0 for more information. Response : Error reports have been logged for examination by Sun. Impact : Automated diagnosis and response for these events will not occur. Action : Ensure that the latest Solaris Kernel and Predictive Self-Healing (PSH) patches are installed. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Oct 22 17:25:09 fcbe397e-7b5c-c9b7-95bd-e1c87ccab845 PCIEX-8000-KP Major Host : hopi-fs Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn : Fault class : fault.io.pciex.device-interr-corr max 50% fault.io.pciex.bus-linkerr-corr 25% Affects : dev:////pci@0,0/pci8086,3410/pci1077,170@0 dev:////pci@0,0/pci8086,3410@9 faulted and taken out of service FRU : "PCIe 2" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=hopi-fs:chassis-id=1015XF50CF/motherboard=0/hostbridge=4/pciexrc=4/pciexbus=25/pciexdev=0) max 50% "MB" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=hopi-fs:chassis-id=1015XF50CF/motherboard=0) 25% faulty Description : Too many recovered bus errors have been detected, which indicates a problem with the specified bus or with the specified transmitting device. This may degrade into an unrecoverable fault. Refer to http://sun.com/msg/PCIEX-8000-KP for more information. Response : One or more device instances may be disabled Impact : Loss of services provided by the device instances associated with this fault Action : If a plug-in card is involved check for badly-seated cards or bent pins. Otherwise schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or contact Sun for support.
Note that these (or similar) faults also occurred in NexentaStor 3.0.2 but seemed to go away in 3.0.3. They also occurred in NexentaCore 3.0.1.
When we first encountered this error in NexentaStor 3 we tried reseating the cards, trying only one card, and moving them around. I believe we even swapped the cards to another X4170 we had. Nothing helped. These cards happen to be QLogic QLE2560 cards.
The system is currently limping along with one card/slot faulted out.
Updated by Matthew Goheen over 12 years ago
Note also this note from the Sun X4170 Product Notes:
-----
CR #6858722
Description:
PCIe Errors Reported With QLE2562 Card Attached Using Active Riser
Issue:
If the Sun QLogic QLE2562-based, 8 Gigabit/second PCI-E Dual FC Host
Adapter (SG-XPCIE2FC-QF8-Z) or Sun QLogic QLE2560-based, 8
Gigabit/second PCI-E Single FC Host Adapter (SG-XPCIE1FC-QF8-Z) are placed
in PCIe slot 1, 2, 4, or 5, a stream of correctable PCIe errors will be reported. This
can cause performance issues and operating system stability issues due to the
large number of errors received.
Affected Hardware and Software:
Sun Fire X4270 and X4275 Servers
Supplemental Releases 1.0, 1.2, 1.4, 2.0, 2.0.1, 2.1 and 2.3.1
Workaround:
The Sun QLE2562-based or QLE-2560-based cards must be placed in PCIe slot 0
or 3 only.
-----
Certainly seems pertinent! Doesn't mention the X4170 but it certainly seems to affect our systems.
Updated by Udo Grabowski about 10 years ago
- Difficulty set to Medium
- Tags set to needs-triage
We see the same problem on a closely related architecture, Sun X4275:
It has the same Bios/Ilom software package as the 4170 (3.0.16.15c -
r79486).
The 'unexpected telemetry' part is certainly Oracle bug 15820471, which they fixed in Sol 11.1, but the generic problem with the excessive
correctable PCI errors (here on a Adaptec STK RAID INT adapter, aac driver,
which is placed in slot 0) is still unexplained.
Updated by Bayard Bell over 9 years ago
- Tags deleted (
needs-triage)
Could someone who's encountered this bug please attach the following output:
fmdump -mV
fmdump -eV
If output is massive, please slim it down with -t (e.g. 1d).
Updated by Piotr Jasiukajtis over 9 years ago
It is indeed a BUG, I have seen this on a lot of Sun Fire machines.
Updated by Piotr Jasiukajtis about 9 years ago
- File fmdump_eV.txt fmdump_eV.txt added
- File fmdump_mV.txt fmdump_mV.txt added
Here are fmdump files from SUN FIRE X4270.
Updated by Bryan Horstmann-Allen about 9 years ago
- File fmdump-mV.txt fmdump-mV.txt added
I'm also seeing this on one Sun Fire X4270 (though not another, because life needs to be full of wonder.)
rw-r--r- 1 root root 405G Mar 23 14:36 /var/fm/fmd/errlog
`fmdump -eV` got to 3GB before I killed it. Here's a sample:
Mar 18 2014 22:00:07.666349575 ereport.io.pciex.rc.ce-msg
nvlist version: 0
ena = 0x79f6faca6ad01401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@0,0/pci8086,340e@7
(end detector)
class = ereport.io.pciex.rc.ce-msg
rc-status = 0x1
source-id = 0x38
source-valid = 1
__ttl = 0x1
__tod = 0x5328c1e7 0x27b7b007
`fmdump -mV` attached.
Updated by Adel abula about 6 years ago
Updated by Chad Schmelter about 1 year ago
I also encounter this on an Intel NUC10i7FNH and it appears to be the network card:
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
May 09 14:23:45 291d0373-c78d-c7d4-f7c4-f6f0aa59e181 SUNOS-8000-J0 Major
Host : cn01
Platform : NUC10i7FNH Chassis_id : G6FN13500H2K
Product_sn :
Fault class : defect.sunos.eft.unexpected_telemetry 50%
fault.sunos.eft.unexpected_telemetry 50%
Problem in : dev:////pci@0,0/pci8086,2081@1f,6
faulted and taken out of service
Description : The diagnosis engine encountered telemetry from the listed
devices for which it was unable to perform a diagnosis -
Refer to http://illumos.org/msg/SUNOS-8000-J0 for more
information. Refer to http://illumos.org/msg/SUNOS-8000-J0 for
more information.
Response : Error reports have been logged for examination by your illumos
distribution team.
Impact : Automated diagnosis and response for these events will not occur.
Action : Ensure that the latest illumos Kernel and Predictive Self-Healing
(PSH) updates are installed.
$ prtconf -Dv | grep 'pci8086,2081@1f,6'
dev_path=/pci@0,0/pci8086,2081@1f,6:e1000g0
$ dladm show-phys
LINK MEDIA STATE SPEED DUPLEX DEVICE
e1000g0 Ethernet up 1000 full e1000g0