Project

General

Profile

Bug #315

fmadm telemetry faults

Added by Anil Jangity about 9 years ago. Updated over 2 years ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2010-10-07
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:

Description

I had installed OpenIndiana on a generic x86 box, and I was seeing this error reported by fmadm:

root@vps3:~# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Sep 18 11:31:18 11badbe0-0605-6c0f-94fc-f9b3fa26d1e9 SUNOS-8000-J0 Major

Host : RS253
Platform : X8SIL Chassis_id : 0123456789
Product_sn :

Fault class : defect.sunos.eft.unexpected_telemetry 50%
fault.sunos.eft.unexpected_telemetry 50%
Problem in : dev:////pci@0,0/pci8086,3b4c@1c,5/pci15d9,605@0
faulted and taken out of service

Description : The diagnosis engine encountered telemetry from the listed
devices for which it was unable to perform a diagnosis -
Refer to http://sun.com/msg/SUNOS-8000-J0 for more information.
Refer to http://sun.com/msg/SUNOS-8000-J0 for more information.

Response : Error reports have been logged for examination by Sun.

Impact : Automated diagnosis and response for these events will not occur.

Action : Ensure that the latest Solaris Kernel and Predictive Self-Healing
(PSH) patches are installed.

Later, we installed OI on a Sun x4170 server. We see the same error on there (though the "Problem in" is different), leading me to believe that this may be more a bug in fmd:

root@vps5:~# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Sep 30 19:14:06 f7f2f656-222c-c162-b69a-be27807175ae SUNOS-8000-J0 Major

Host : vps5
Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1003XF51B8
Product_sn :

Fault class : fault.sunos.eft.unexpected_telemetry 50%
defect.sunos.eft.unexpected_telemetry 50%
Problem in : dev:////pci@0,0
faulted and taken out of service

Description : The diagnosis engine encountered telemetry from the listed
devices for which it was unable to perform a diagnosis -
Refer to http://sun.com/msg/SUNOS-8000-J0 for more information.
Refer to http://sun.com/msg/SUNOS-8000-J0 for more information.

Response : Error reports have been logged for examination by Sun.

Impact : Automated diagnosis and response for these events will not occur.

Action : Ensure that the latest Solaris Kernel and Predictive Self-Healing
(PSH) patches are installed.


Files

fmdump_eV.txt (7.18 KB) fmdump_eV.txt Piotr Jasiukajtis, 2014-03-18 09:09 AM
fmdump_mV.txt (6.13 KB) fmdump_mV.txt Piotr Jasiukajtis, 2014-03-18 09:09 AM
fmdump-mV.txt (36.7 KB) fmdump-mV.txt Bryan Horstmann-Allen, 2014-03-23 02:42 PM
mV.txt (62.5 KB) mV.txt Adel abula, 2017-04-09 01:19 PM
eV.txt (62.5 KB) eV.txt Adel abula, 2017-04-09 01:19 PM

History

#1

Updated by Garrett D'Amore about 9 years ago

  • Project changed from site to illumos gate
#2

Updated by Matthew Goheen almost 9 years ago

We have also installed oi_147 on a Sun Fire X4170 and experience continuous (60Mb/day of fm errors/warnings in /var/fm/fmd/errlog) fm errors related to the two Fibre channel controllers installed in the system. Here is a (very) small portion of the output from "fmdump -e":

...
Oct 26 05:38:34.2556 ereport.io.pci.fabric
Oct 26 05:38:34.2556 ereport.io.pci.fabric
Oct 26 05:38:34.2556 ereport.io.pciex.rc.ce-msg
Oct 26 05:38:34.2556 ereport.io.pciex.pl.re
Oct 26 05:38:34.2556 ereport.io.pciex.rc.ce-msg
Oct 26 05:39:16.2110 ereport.io.pci.fabric
Oct 26 05:39:16.2110 ereport.io.pci.fabric
Oct 26 05:39:16.2110 ereport.io.pciex.rc.ce-msg
Oct 26 05:39:16.2110 ereport.io.pciex.pl.re
Oct 26 05:39:16.2110 ereport.io.pciex.rc.ce-msg
Oct 26 05:39:22.2068 ereport.io.pci.fabric
Oct 26 05:39:22.2069 ereport.io.pci.fabric
Oct 26 05:39:22.2068 ereport.io.pciex.rc.ce-msg
Oct 26 05:39:22.2069 ereport.io.pciex.pl.re
Oct 26 05:39:22.2069 ereport.io.pciex.rc.ce-msg
Oct 26 05:39:30.0206 ereport.io.pci.fabric
Oct 26 05:39:30.0207 ereport.io.pci.fabric
Oct 26 05:39:30.0206 ereport.io.pciex.rc.ce-msg
Oct 26 05:39:30.0207 ereport.io.pciex.pl.re
Oct 26 05:39:30.0207 ereport.io.pciex.rc.ce-msg
...

"fmdump" reports:

TIME                 UUID                                 SUNW-MSG-ID EVENT
Oct 22 17:21:12.7552 682ee1c3-4dca-c578-899d-8262200ed77b SUNOS-8000-J0 Diagnosed
Oct 22 17:25:09.7113 fcbe397e-7b5c-c9b7-95bd-e1c87ccab845 PCIEX-8000-KP Diagnosed
Oct 22 17:54:16.1998 46d368bf-8525-43bd-d005-c3c929b98d6b PCIEX-8000-KP Diagnosed
Oct 25 11:08:26.3788 ed012fb2-3305-c74d-f3db-cd450e7690f4 SUNOS-8000-J0 Diagnosed
Oct 25 11:26:43.0540 46d368bf-8525-43bd-d005-c3c929b98d6b FMD-8000-58 Updated
Oct 25 11:26:53.7197 6770b1ae-e718-ccf9-d802-b1833f5b92e2 PCIEX-8000-KP Diagnosed

"fmadm faulty" reports:
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Oct 25 11:26:53 6770b1ae-e718-ccf9-d802-b1833f5b92e2  PCIEX-8000-KP  Major     

Host        : hopi-fs
Platform    : SUN-FIRE-X4170-SERVER     Chassis_id  : 1015XF50CF
Product_sn  : 

Fault class : fault.io.pciex.device-interr-corr 67%
              fault.io.pciex.bus-linkerr-corr 33%
Affects     : dev:////pci@0,0/pci8086,340e@7/pci1077,170@0
                  faulted but still in service
FRU         : "PCIe 1" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=hopi-fs:chassis-id=1015XF50CF/motherboard=0/hostbridge=3/pciexrc=3/pciexbus=19/pciexdev=0)
                  faulty

Description : Too many recovered bus errors have been detected, which indicates
              a problem with the specified bus or with the specified
              transmitting device. This may degrade into an unrecoverable
              fault.
              Refer to http://sun.com/msg/PCIEX-8000-KP for more information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : If a plug-in card is involved check for badly-seated cards or
              bent pins. Otherwise schedule a repair procedure to replace the
              affected device.  Use fmadm faulty to identify the device or
              contact Sun for support.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Oct 22 17:54:16 46d368bf-8525-43bd-d005-c3c929b98d6b  PCIEX-8000-KP  Major     

Host        : hopi-fs
Platform    : SUN-FIRE-X4170-SERVER     Chassis_id  : 1015XF50CF
Product_sn  : 

Fault class : fault.io.pciex.device-interr-corr max 50%
              fault.io.pciex.bus-linkerr-corr 25%
Affects     : dev:////pci@0,0/pci8086,340e@7/pci1077,170@0
                  ok and in service
              dev:////pci@0,0/pci8086,340e@7
                  faulted but still in service
FRU         : "PCIe 1" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=hopi-fs:chassis-id=1015XF50CF/motherboard=0/hostbridge=3/pciexrc=3/pciexbus=19/pciexdev=0) max 50%
                  acquitted
              "MB" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=hopi-fs:chassis-id=1015XF50CF/motherboard=0) 25%
                  faulty

Description : Too many recovered bus errors have been detected, which indicates
              a problem with the specified bus or with the specified
              transmitting device. This may degrade into an unrecoverable
              fault.
              Refer to http://sun.com/msg/PCIEX-8000-KP for more information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : If a plug-in card is involved check for badly-seated cards or
              bent pins. Otherwise schedule a repair procedure to replace the
              affected device.  Use fmadm faulty to identify the device or
              contact Sun for support.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Oct 22 17:21:12 682ee1c3-4dca-c578-899d-8262200ed77b  SUNOS-8000-J0  Major     

Host        : hopi-fs
Platform    : SUN-FIRE-X4170-SERVER     Chassis_id  : 1015XF50CF
Product_sn  : 

Fault class : fault.sunos.eft.unexpected_telemetry 50%
              defect.sunos.eft.unexpected_telemetry 50%
Problem in  : dev:////pci@0,0
                  faulted and taken out of service

Description : The diagnosis engine encountered telemetry from the listed
              devices for which it was unable to perform a diagnosis - 
              Refer to http://sun.com/msg/SUNOS-8000-J0 for more information. 
              Refer to http://sun.com/msg/SUNOS-8000-J0 for more information.

Response    : Error reports have been logged for examination by Sun.

Impact      : Automated diagnosis and response for these events will not occur.

Action      : Ensure that the latest Solaris Kernel and Predictive Self-Healing
              (PSH) patches are installed.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Oct 25 11:08:26 ed012fb2-3305-c74d-f3db-cd450e7690f4  SUNOS-8000-J0  Major     

Host        : hopi-fs
Platform    : SUN-FIRE-X4170-SERVER     Chassis_id  : 1015XF50CF
Product_sn  : 

Fault class : defect.sunos.eft.unexpected_telemetry 50%
              fault.sunos.eft.unexpected_telemetry 50%
Problem in  : dev:////pci@0,0/pci8086,3410@9/pci1077,170@0
                  faulted and taken out of service

Description : The diagnosis engine encountered telemetry from the listed
              devices for which it was unable to perform a diagnosis - 
              Refer to http://sun.com/msg/SUNOS-8000-J0 for more information. 
              Refer to http://sun.com/msg/SUNOS-8000-J0 for more information.

Response    : Error reports have been logged for examination by Sun.

Impact      : Automated diagnosis and response for these events will not occur.

Action      : Ensure that the latest Solaris Kernel and Predictive Self-Healing
              (PSH) patches are installed.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Oct 22 17:25:09 fcbe397e-7b5c-c9b7-95bd-e1c87ccab845  PCIEX-8000-KP  Major     

Host        : hopi-fs
Platform    : SUN-FIRE-X4170-SERVER     Chassis_id  : 1015XF50CF
Product_sn  : 

Fault class : fault.io.pciex.device-interr-corr max 50%
              fault.io.pciex.bus-linkerr-corr 25%
Affects     : dev:////pci@0,0/pci8086,3410/pci1077,170@0
              dev:////pci@0,0/pci8086,3410@9
                  faulted and taken out of service
FRU         : "PCIe 2" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=hopi-fs:chassis-id=1015XF50CF/motherboard=0/hostbridge=4/pciexrc=4/pciexbus=25/pciexdev=0) max 50%
              "MB" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=hopi-fs:chassis-id=1015XF50CF/motherboard=0) 25%
                  faulty

Description : Too many recovered bus errors have been detected, which indicates
              a problem with the specified bus or with the specified
              transmitting device. This may degrade into an unrecoverable
              fault.
              Refer to http://sun.com/msg/PCIEX-8000-KP for more information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : If a plug-in card is involved check for badly-seated cards or
              bent pins. Otherwise schedule a repair procedure to replace the
              affected device.  Use fmadm faulty to identify the device or
              contact Sun for support.

Note that these (or similar) faults also occurred in NexentaStor 3.0.2 but seemed to go away in 3.0.3. They also occurred in NexentaCore 3.0.1.

When we first encountered this error in NexentaStor 3 we tried reseating the cards, trying only one card, and moving them around. I believe we even swapped the cards to another X4170 we had. Nothing helped. These cards happen to be QLogic QLE2560 cards.

The system is currently limping along with one card/slot faulted out.

#3

Updated by Matthew Goheen almost 9 years ago

Note also this note from the Sun X4170 Product Notes:
-----
CR #6858722
Description:
PCIe Errors Reported With QLE2562 Card Attached Using Active Riser
Issue:
If the Sun QLogic QLE2562-based, 8 Gigabit/second PCI-E Dual FC Host
Adapter (SG-XPCIE2FC-QF8-Z) or Sun QLogic QLE2560-based, 8
Gigabit/second PCI-E Single FC Host Adapter (SG-XPCIE1FC-QF8-Z) are placed
in PCIe slot 1, 2, 4, or 5, a stream of correctable PCIe errors will be reported. This
can cause performance issues and operating system stability issues due to the
large number of errors received.
Affected Hardware and Software:
• Sun Fire X4270 and X4275 Servers
• Supplemental Releases 1.0, 1.2, 1.4, 2.0, 2.0.1, 2.1 and 2.3.1
Workaround:
The Sun QLE2562-based or QLE-2560-based cards must be placed in PCIe slot 0
or 3 only.
-----
Certainly seems pertinent! Doesn't mention the X4170 but it certainly seems to affect our systems.

#4

Updated by Udo Grabowski over 6 years ago

  • Difficulty set to Medium
  • Tags set to needs-triage

We see the same problem on a closely related architecture, Sun X4275:

<http://www.listbox.com/member/archive/182180/2013/06/sort/time_rev/page/1/entry/1:6/20130603084047:CDEAC6EE-CC4A-11E2-9060-E754E4E05EAE/>

It has the same Bios/Ilom software package as the 4170 (3.0.16.15c -
r79486).

The 'unexpected telemetry' part is certainly Oracle bug 15820471, which they fixed in Sol 11.1, but the generic problem with the excessive
correctable PCI errors (here on a Adaptec STK RAID INT adapter, aac driver,
which is placed in slot 0) is still unexplained.

#5

Updated by Bayard Bell about 6 years ago

  • Tags deleted (needs-triage)

Could someone who's encountered this bug please attach the following output:

fmdump -mV
fmdump -eV

If output is massive, please slim it down with -t (e.g. 1d).

#6

Updated by Bayard Bell about 6 years ago

  • Status changed from New to Feedback
#7

Updated by Piotr Jasiukajtis about 6 years ago

It is indeed a BUG, I have seen this on a lot of Sun Fire machines.

#8

Updated by Piotr Jasiukajtis over 5 years ago

Here are fmdump files from SUN FIRE X4270.

#9

Updated by Bryan Horstmann-Allen over 5 years ago

I'm also seeing this on one Sun Fire X4270 (though not another, because life needs to be full of wonder.)

rw-r--r- 1 root root 405G Mar 23 14:36 /var/fm/fmd/errlog

`fmdump -eV` got to 3GB before I killed it. Here's a sample:

Mar 18 2014 22:00:07.666349575 ereport.io.pciex.rc.ce-msg
nvlist version: 0
ena = 0x79f6faca6ad01401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@0,0/pci8086,340e@7
(end detector)

class = ereport.io.pciex.rc.ce-msg
rc-status = 0x1
source-id = 0x38
source-valid = 1
__ttl = 0x1
__tod = 0x5328c1e7 0x27b7b007

`fmdump -mV` attached.

#10

Updated by Adel abula over 2 years ago

Having the same issue in joyent triton

#11

Updated by Adel abula over 2 years ago

Also available in: Atom PDF