Bug #5873

CPU cores faulted on boot

Added by Nebojsa Djogo over 3 years ago. Updated over 2 years ago.

Status:NewStart date:2015-04-26
Priority:ImmediateDue date:
Assignee:-% Done:

0%

Category:-
Target version:-
Difficulty:Medium Tags:needs-triage

Description

This is a new Open Indiana installation on a new computer with Intel Corei5 processor

After rebooting, 3 cores on a 4-core system are in a faulted state:

sudo psrinfo
0       faulted   since 04/26/2015 14:21:59
1       faulted   since 04/26/2015 14:21:59
2       faulted   since 04/26/2015 14:21:59
3       on-line   since 04/26/2015 14:21:55

The cores can be brought online with

psradm -n -F 0 1 2

Result:

sudo psrinfo
0       on-line   since 04/26/2015 14:28:32
1       on-line   since 04/26/2015 14:28:32
2       on-line   since 04/26/2015 14:28:32
3       on-line   since 04/26/2015 14:21:55

This is a serious bug. The system has been tested with various tools to make sure there are no hardware problems.
I also tried two different Linux live CDs and both work and enable all 4 cores without issues.

This needs an immediate workaround as it is not feasible to use parsed every type the system reboots and obviously a permanent fix soon.

OpenIndiana (powered by illumos)    SunOS 5.11    oi_151a8

For completenes - here is cpu_info after the cores have been enabled

kstat cpu_info
module: cpu_info                        instance: 0
name:   cpu_info0                       class:    misc
        brand                           Intel(r) Core(tm) i5-4590 CPU @ 3.30GHz
        cache_id                        0
        chip_id                         0
        clock_MHz                       3300
        clog_id                         0
        core_id                         0
        cpu_type                        i386
        crtime                          22.272251786
        current_clock_Hz                3301000000
        current_cstate                  2
        family                          6
        fpu_type                        i387 compatible
        implementation                  x86 (chipid 0x0 GenuineIntel 306C3 family 6 model 60 step 3 clock 3300 MHz)
        model                           60
        ncore_per_chip                  4
        ncpu_per_chip                   4
        pg_id                           5
        pkg_core_id                     0
        snaptime                        651.071334540
        socket_type                     Unknown
        state                           on-line
        state_begin                     1430072912
        stepping                        3
        supported_frequencies_Hz        800000000:1000000000:1200000000:1300000000:1500000000:1700000000:1900000000:2000000000:2200000000:2400000000:2600000000:2800000000:2900000000:3100000000:3300000000:3301000000
        supported_max_cstates           2
        vendor_id                       GenuineIntel

module: cpu_info                        instance: 1
name:   cpu_info1                       class:    misc
        brand                           Intel(r) Core(tm) i5-4590 CPU @ 3.30GHz
        cache_id                        0
        chip_id                         0
        clock_MHz                       3300
        clog_id                         2
        core_id                         1
        cpu_type                        i386
        crtime                          24.649869888
        current_clock_Hz                3301000000
        current_cstate                  2
        family                          6
        fpu_type                        i387 compatible
        implementation                  x86 (chipid 0x0 GenuineIntel 306C3 family 6 model 60 step 3 clock 3300 MHz)
        model                           60
        ncore_per_chip                  4
        ncpu_per_chip                   4
        pg_id                           7
        pkg_core_id                     1
        snaptime                        651.072788040
        socket_type                     Unknown
        state                           on-line
        state_begin                     1430072912
        stepping                        3
        supported_frequencies_Hz        800000000:1000000000:1200000000:1300000000:1500000000:1700000000:1900000000:2000000000:2200000000:2400000000:2600000000:2800000000:2900000000:3100000000:3300000000:3301000000
        supported_max_cstates           2
        vendor_id                       GenuineIntel

module: cpu_info                        instance: 2
name:   cpu_info2                       class:    misc
        brand                           Intel(r) Core(tm) i5-4590 CPU @ 3.30GHz
        cache_id                        0
        chip_id                         0
        clock_MHz                       3300
        clog_id                         4
        core_id                         2
        cpu_type                        i386
        crtime                          24.678507276
        current_clock_Hz                3301000000
        current_cstate                  2
        family                          6
        fpu_type                        i387 compatible
        implementation                  x86 (chipid 0x0 GenuineIntel 306C3 family 6 model 60 step 3 clock 3300 MHz)
        model                           60
        ncore_per_chip                  4
        ncpu_per_chip                   4
        pg_id                           9
        pkg_core_id                     2
        snaptime                        651.072832446
        socket_type                     Unknown
        state                           on-line
        state_begin                     1430072912
        stepping                        3
        supported_frequencies_Hz        800000000:1000000000:1200000000:1300000000:1500000000:1700000000:1900000000:2000000000:2200000000:2400000000:2600000000:2800000000:2900000000:3100000000:3300000000:3301000000
        supported_max_cstates           2
        vendor_id                       GenuineIntel

module: cpu_info                        instance: 3
name:   cpu_info3                       class:    misc
        brand                           Intel(r) Core(tm) i5-4590 CPU @ 3.30GHz
        cache_id                        0
        chip_id                         0
        clock_MHz                       3300
        clog_id                         6
        core_id                         3
        cpu_type                        i386
        crtime                          24.701377875
        current_clock_Hz                3301000000
        current_cstate                  0
        family                          6
        fpu_type                        i387 compatible
        implementation                  x86 (chipid 0x0 GenuineIntel 306C3 family 6 model 60 step 3 clock 3300 MHz)
        model                           60
        ncore_per_chip                  4
        ncpu_per_chip                   4
        pg_id                           11
        pkg_core_id                     3
        snaptime                        651.072905323
        socket_type                     Unknown
        state                           on-line
        state_begin                     1430072515
        stepping                        3
        supported_frequencies_Hz        800000000:1000000000:1200000000:1300000000:1500000000:1700000000:1900000000:2000000000:2200000000:2400000000:2600000000:2800000000:2900000000:3100000000:3300000000:3301000000
        supported_max_cstates           2
        vendor_id                       GenuineInte


Related issues

Duplicates illumos gate - Bug #6858: FMD shutting down cores on Xeon Haswell CPU (HSW131 detection) New 2016-04-07

History

#1 Updated by Nebojsa Djogo over 3 years ago

Seems that his was caused by some kind of driver failure. Clearing the error and bringing the cores back online solves the issue.

#2 Updated by Alexander Pyhalov over 3 years ago

Is issue reproducible on recent OpenIndiana Hipster?

#3 Updated by Nebojsa Djogo over 3 years ago

Unfortunately I do not have a Hipster installation. This machine needs to be stable.

uname -a
SunOS indiana 5.11 oi_151a8 i86pc i386 i86pc Solaris

I did manage to kind of get it to happen again when I was fiddling with VirtualBox machines setup/start/stop so it might be related to VBox drivers.

#4 Updated by Nebojsa Djogo over 3 years ago

This happened again just now. There was no additional actions performed on the server. There are three VirtualBox machines running on it, but the load on the computer never exceeds 0.3 . Plenty of RAM (16G)

Here are the details

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 29 14:42:46 29936f79-5947-6ebb-9192-d1bce6e42c08  INTEL-8000-1J  Major    

Host        : indiana
Platform    : H97M-D3H    Chassis_id  : To-be-filled-by-O.E.M.
Product_sn  :

Fault class : fault.cpu.intel.internal
Affects     : cpu:///cpuid=1
                  faulted and taken out of service
FRU         : hc://:product-id=H97M-D3H:server-id=indiana:chassis-id=To-be-filled-by-O.E.M./motherboard=0/chip=0
                  faulty

Description : An internal error has been encountered on this cpu.  Refer to
              http://illumos.org/msg/INTEL-8000-1J for more information.

Response    : The system will attempt to offline this cpu to remove it from
              service.

Impact      : Performance of this system may be affected.

Action      : Schedule a repair procedure to replace the affected CPU.  Use
              'fmadm faulty' to identify the module.

Is it possible that this really is a faulty processor or do you think this is connected to the software.
Is there anything I can try ?

#5 Updated by Nebojsa Djogo over 3 years ago

More information that should help you pinpoint the issue. Generated using "fmadm"

panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff001ffeab80 addr=81a4 occurred in module "unix" due to an illegal access to a user address
panicstack = unix:die+dd () | unix:trap+17db () | unix:cmntrap+e6 () | unix:mutex_enter+b () | genunix:closeall+66 () | genunix:proc_exit+43f () | genunix:exit+15 () | genunix:psig+582 () | genunix:post_syscall+4f0 () | unix:brand_sys_syscall+31a () |

Please let me know if there is anything else I can provide you with or help in any other way to debug this.

#6 Updated by Nebojsa Djogo over 3 years ago

Just a quick note that I just updated this system to the latest 151a9 . I was running the 151a8 so we will see if this is still happening with the update.

#7 Updated by Nebojsa Djogo over 3 years ago

Just happened again over night with the latest version of everything

srinfo
0       on-line   since 04/29/2015 16:37:01
1       on-line   since 04/29/2015 16:37:04
2       faulted   since 04/30/2015 08:18:49
3       on-line   since 04/29/2015 16:37:04
fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 30 08:18:49 c21697e9-8cfa-e427-fa9d-a3e877c15abb  INTEL-8000-1J  Major

Host        : indiana
Platform    : H97M-D3H  Chassis_id  : To-be-filled-by-O.E.M.
Product_sn  :

Fault class : fault.cpu.intel.internal
Affects     : cpu:///cpuid=2
                  faulted and taken out of service
FRU         : hc://:product-id=H97M-D3H:server-id=indiana:chassis-id=To-be-filled-by-O.E.M./motherboard=0/chip=0
                  faulty

Description : An internal error has been encountered on this cpu.  Refer to
              http://illumos.org/msg/INTEL-8000-1J for more information.

Response    : The system will attempt to offline this cpu to remove it from
              service.

Impact      : Performance of this system may be affected.

Action      : Schedule a repair procedure to replace the affected CPU.  Use
              'fmadm faulty' to identify the module.
fmdump -Vp  -u c21697e9-8cfa-e427-fa9d-a3e877c15abb
TIME                           UUID                                 SUNW-MSG-ID
Apr 30 2015 08:18:49.775101000 c21697e9-8cfa-e427-fa9d-a3e877c15abb INTEL-8000-1J

  TIME                 CLASS                                 ENA
  Apr 30 00:14:50.7600 ereport.cpu.intel.internal_parity     0x9ffe70d9f2b04001
  Apr 29 23:08:50.7600 ereport.cpu.intel.internal_parity     0x665e4a9266004001
  Apr 29 01:13:18.1680 ereport.cpu.intel.internal_parity     0xe9af313d87104001
  Apr 30 08:18:48.7600 ereport.cpu.intel.internal_parity     0x4694def806304001

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = c21697e9-8cfa-e427-fa9d-a3e877c15abb
        code = INTEL-8000-1J
        diag-time = 1430396329 760533
        de = fmd:///module/eft
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = fault.cpu.intel.internal
                certainty = 0x64
                resource = hc://:product-id=H97M-D3H:server-id=indiana:chassis-id=To-be-filled-by-O.E.M./motherboard=0/chip=0/core=2/strand=0
                asru = cpu:///cpuid=2
                fru = hc://:product-id=H97M-D3H:server-id=indiana:chassis-id=To-be-filled-by-O.E.M./motherboard=0/chip=0
        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x55421da9 0x2e331a48

#8 Updated by Nikola M. almost 3 years ago

Please report if this is still the isuue or not in newest illumos in Hipster (there is also an live DVD/USB to see)
and if you were testing it with another set of motherboard/cpu.

Also there is a way of keeping Openindiana /dev installed together with hipster-2015 install, that when updated brings newest illumos to test
(You can install OI 201510 in Virtualbox and then do zfs send/receive to the already installed OI. (make username on install differs from from one on on-hardwar OI and zfs send both BE and user dir datasets. you can prepare empty BE by beadm create, beadm mount it /mnt and rm all data inside BE and beadm umount, possibly rename, to receive new install)

#9 Updated by Nebojsa Djogo over 2 years ago

Unfortunately I had to switch to a Linux system months ago so I cannot test this further. The Linux (Ubuntu) system has been online since day one with zero issues on same hardware. I would have preferred Illumos and maybe next time I upgrade my hardware systems I will look again at this project.

#10 Updated by Victor Vess over 2 years ago

Nebojsa Djogo wrote:

Unfortunately I had to switch to a Linux system months ago so I cannot test this further. The Linux (Ubuntu) system has been online since day one with zero issues on same hardware. I would have preferred Illumos and maybe next time I upgrade my hardware systems I will look again at this project.

Sorry to hear that - But FYI I've seen same issue on SmartOS distribution (also illumos) - Currently working around it by disabling cpumem-retire module in fmadm (which is only acceptable because it's a home lab not a production server). I've been running for over 2 months like this with no negative impacts to any of the guest images.

Is your CPU a Haswell core as well? I'm wondering if it's related to a known issue with Haswell core CPU's running virtualization platforms: https://bugs.launchpad.net/qemu/+bug/1307225 (see posts 9 and 12)

#11 Updated by Aurélien Larcher over 2 years ago

  • Duplicates Bug #6858: FMD shutting down cores on Xeon Haswell CPU (HSW131 detection) added

Also available in: Atom