Bug #5873
openCPU cores faulted on boot
0%
Description
This is a new Open Indiana installation on a new computer with Intel Corei5 processor
After rebooting, 3 cores on a 4-core system are in a faulted state:
sudo psrinfo 0 faulted since 04/26/2015 14:21:59 1 faulted since 04/26/2015 14:21:59 2 faulted since 04/26/2015 14:21:59 3 on-line since 04/26/2015 14:21:55
The cores can be brought online with
psradm -n -F 0 1 2
Result:
sudo psrinfo 0 on-line since 04/26/2015 14:28:32 1 on-line since 04/26/2015 14:28:32 2 on-line since 04/26/2015 14:28:32 3 on-line since 04/26/2015 14:21:55
This is a serious bug. The system has been tested with various tools to make sure there are no hardware problems.
I also tried two different Linux live CDs and both work and enable all 4 cores without issues.
This needs an immediate workaround as it is not feasible to use parsed every type the system reboots and obviously a permanent fix soon.
OpenIndiana (powered by illumos) SunOS 5.11 oi_151a8
For completenes - here is cpu_info after the cores have been enabled
kstat cpu_info module: cpu_info instance: 0 name: cpu_info0 class: misc brand Intel(r) Core(tm) i5-4590 CPU @ 3.30GHz cache_id 0 chip_id 0 clock_MHz 3300 clog_id 0 core_id 0 cpu_type i386 crtime 22.272251786 current_clock_Hz 3301000000 current_cstate 2 family 6 fpu_type i387 compatible implementation x86 (chipid 0x0 GenuineIntel 306C3 family 6 model 60 step 3 clock 3300 MHz) model 60 ncore_per_chip 4 ncpu_per_chip 4 pg_id 5 pkg_core_id 0 snaptime 651.071334540 socket_type Unknown state on-line state_begin 1430072912 stepping 3 supported_frequencies_Hz 800000000:1000000000:1200000000:1300000000:1500000000:1700000000:1900000000:2000000000:2200000000:2400000000:2600000000:2800000000:2900000000:3100000000:3300000000:3301000000 supported_max_cstates 2 vendor_id GenuineIntel module: cpu_info instance: 1 name: cpu_info1 class: misc brand Intel(r) Core(tm) i5-4590 CPU @ 3.30GHz cache_id 0 chip_id 0 clock_MHz 3300 clog_id 2 core_id 1 cpu_type i386 crtime 24.649869888 current_clock_Hz 3301000000 current_cstate 2 family 6 fpu_type i387 compatible implementation x86 (chipid 0x0 GenuineIntel 306C3 family 6 model 60 step 3 clock 3300 MHz) model 60 ncore_per_chip 4 ncpu_per_chip 4 pg_id 7 pkg_core_id 1 snaptime 651.072788040 socket_type Unknown state on-line state_begin 1430072912 stepping 3 supported_frequencies_Hz 800000000:1000000000:1200000000:1300000000:1500000000:1700000000:1900000000:2000000000:2200000000:2400000000:2600000000:2800000000:2900000000:3100000000:3300000000:3301000000 supported_max_cstates 2 vendor_id GenuineIntel module: cpu_info instance: 2 name: cpu_info2 class: misc brand Intel(r) Core(tm) i5-4590 CPU @ 3.30GHz cache_id 0 chip_id 0 clock_MHz 3300 clog_id 4 core_id 2 cpu_type i386 crtime 24.678507276 current_clock_Hz 3301000000 current_cstate 2 family 6 fpu_type i387 compatible implementation x86 (chipid 0x0 GenuineIntel 306C3 family 6 model 60 step 3 clock 3300 MHz) model 60 ncore_per_chip 4 ncpu_per_chip 4 pg_id 9 pkg_core_id 2 snaptime 651.072832446 socket_type Unknown state on-line state_begin 1430072912 stepping 3 supported_frequencies_Hz 800000000:1000000000:1200000000:1300000000:1500000000:1700000000:1900000000:2000000000:2200000000:2400000000:2600000000:2800000000:2900000000:3100000000:3300000000:3301000000 supported_max_cstates 2 vendor_id GenuineIntel module: cpu_info instance: 3 name: cpu_info3 class: misc brand Intel(r) Core(tm) i5-4590 CPU @ 3.30GHz cache_id 0 chip_id 0 clock_MHz 3300 clog_id 6 core_id 3 cpu_type i386 crtime 24.701377875 current_clock_Hz 3301000000 current_cstate 0 family 6 fpu_type i387 compatible implementation x86 (chipid 0x0 GenuineIntel 306C3 family 6 model 60 step 3 clock 3300 MHz) model 60 ncore_per_chip 4 ncpu_per_chip 4 pg_id 11 pkg_core_id 3 snaptime 651.072905323 socket_type Unknown state on-line state_begin 1430072515 stepping 3 supported_frequencies_Hz 800000000:1000000000:1200000000:1300000000:1500000000:1700000000:1900000000:2000000000:2200000000:2400000000:2600000000:2800000000:2900000000:3100000000:3300000000:3301000000 supported_max_cstates 2 vendor_id GenuineInte
Related issues
Updated by Nebojsa Djogo almost 8 years ago
Seems that his was caused by some kind of driver failure. Clearing the error and bringing the cores back online solves the issue.
Updated by Alexander Pyhalov almost 8 years ago
Is issue reproducible on recent OpenIndiana Hipster?
Updated by Nebojsa Djogo almost 8 years ago
Unfortunately I do not have a Hipster installation. This machine needs to be stable.
uname -a SunOS indiana 5.11 oi_151a8 i86pc i386 i86pc Solaris
I did manage to kind of get it to happen again when I was fiddling with VirtualBox machines setup/start/stop so it might be related to VBox drivers.
Updated by Nebojsa Djogo almost 8 years ago
This happened again just now. There was no additional actions performed on the server. There are three VirtualBox machines running on it, but the load on the computer never exceeds 0.3 . Plenty of RAM (16G)
Here are the details
--------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Apr 29 14:42:46 29936f79-5947-6ebb-9192-d1bce6e42c08 INTEL-8000-1J Major Host : indiana Platform : H97M-D3H Chassis_id : To-be-filled-by-O.E.M. Product_sn : Fault class : fault.cpu.intel.internal Affects : cpu:///cpuid=1 faulted and taken out of service FRU : hc://:product-id=H97M-D3H:server-id=indiana:chassis-id=To-be-filled-by-O.E.M./motherboard=0/chip=0 faulty Description : An internal error has been encountered on this cpu. Refer to http://illumos.org/msg/INTEL-8000-1J for more information. Response : The system will attempt to offline this cpu to remove it from service. Impact : Performance of this system may be affected. Action : Schedule a repair procedure to replace the affected CPU. Use 'fmadm faulty' to identify the module.
Is it possible that this really is a faulty processor or do you think this is connected to the software.
Is there anything I can try ?
Updated by Nebojsa Djogo almost 8 years ago
More information that should help you pinpoint the issue. Generated using "fmadm"
panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff001ffeab80 addr=81a4 occurred in module "unix" due to an illegal access to a user address panicstack = unix:die+dd () | unix:trap+17db () | unix:cmntrap+e6 () | unix:mutex_enter+b () | genunix:closeall+66 () | genunix:proc_exit+43f () | genunix:exit+15 () | genunix:psig+582 () | genunix:post_syscall+4f0 () | unix:brand_sys_syscall+31a () |
Please let me know if there is anything else I can provide you with or help in any other way to debug this.
Updated by Nebojsa Djogo almost 8 years ago
Just a quick note that I just updated this system to the latest 151a9 . I was running the 151a8 so we will see if this is still happening with the update.
Updated by Nebojsa Djogo almost 8 years ago
Just happened again over night with the latest version of everything
srinfo 0 on-line since 04/29/2015 16:37:01 1 on-line since 04/29/2015 16:37:04 2 faulted since 04/30/2015 08:18:49 3 on-line since 04/29/2015 16:37:04
fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Apr 30 08:18:49 c21697e9-8cfa-e427-fa9d-a3e877c15abb INTEL-8000-1J Major Host : indiana Platform : H97M-D3H Chassis_id : To-be-filled-by-O.E.M. Product_sn : Fault class : fault.cpu.intel.internal Affects : cpu:///cpuid=2 faulted and taken out of service FRU : hc://:product-id=H97M-D3H:server-id=indiana:chassis-id=To-be-filled-by-O.E.M./motherboard=0/chip=0 faulty Description : An internal error has been encountered on this cpu. Refer to http://illumos.org/msg/INTEL-8000-1J for more information. Response : The system will attempt to offline this cpu to remove it from service. Impact : Performance of this system may be affected. Action : Schedule a repair procedure to replace the affected CPU. Use 'fmadm faulty' to identify the module.
fmdump -Vp -u c21697e9-8cfa-e427-fa9d-a3e877c15abb TIME UUID SUNW-MSG-ID Apr 30 2015 08:18:49.775101000 c21697e9-8cfa-e427-fa9d-a3e877c15abb INTEL-8000-1J TIME CLASS ENA Apr 30 00:14:50.7600 ereport.cpu.intel.internal_parity 0x9ffe70d9f2b04001 Apr 29 23:08:50.7600 ereport.cpu.intel.internal_parity 0x665e4a9266004001 Apr 29 01:13:18.1680 ereport.cpu.intel.internal_parity 0xe9af313d87104001 Apr 30 08:18:48.7600 ereport.cpu.intel.internal_parity 0x4694def806304001 nvlist version: 0 version = 0x0 class = list.suspect uuid = c21697e9-8cfa-e427-fa9d-a3e877c15abb code = INTEL-8000-1J diag-time = 1430396329 760533 de = fmd:///module/eft fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = fault.cpu.intel.internal certainty = 0x64 resource = hc://:product-id=H97M-D3H:server-id=indiana:chassis-id=To-be-filled-by-O.E.M./motherboard=0/chip=0/core=2/strand=0 asru = cpu:///cpuid=2 fru = hc://:product-id=H97M-D3H:server-id=indiana:chassis-id=To-be-filled-by-O.E.M./motherboard=0/chip=0 (end fault-list[0]) fault-status = 0x1 severity = Major __ttl = 0x1 __tod = 0x55421da9 0x2e331a48
Updated by Nikola M. over 7 years ago
Please report if this is still the isuue or not in newest illumos in Hipster (there is also an live DVD/USB to see)
and if you were testing it with another set of motherboard/cpu.
Also there is a way of keeping Openindiana /dev installed together with hipster-2015 install, that when updated brings newest illumos to test
(You can install OI 201510 in Virtualbox and then do zfs send/receive to the already installed OI. (make username on install differs from from one on on-hardwar OI and zfs send both BE and user dir datasets. you can prepare empty BE by beadm create, beadm mount it /mnt and rm all data inside BE and beadm umount, possibly rename, to receive new install)
Updated by Nebojsa Djogo about 7 years ago
Unfortunately I had to switch to a Linux system months ago so I cannot test this further. The Linux (Ubuntu) system has been online since day one with zero issues on same hardware. I would have preferred Illumos and maybe next time I upgrade my hardware systems I will look again at this project.
Updated by Victor Vess about 7 years ago
Nebojsa Djogo wrote:
Unfortunately I had to switch to a Linux system months ago so I cannot test this further. The Linux (Ubuntu) system has been online since day one with zero issues on same hardware. I would have preferred Illumos and maybe next time I upgrade my hardware systems I will look again at this project.
Sorry to hear that - But FYI I've seen same issue on SmartOS distribution (also illumos) - Currently working around it by disabling cpumem-retire module in fmadm (which is only acceptable because it's a home lab not a production server). I've been running for over 2 months like this with no negative impacts to any of the guest images.
Is your CPU a Haswell core as well? I'm wondering if it's related to a known issue with Haswell core CPU's running virtualization platforms: https://bugs.launchpad.net/qemu/+bug/1307225 (see posts 9 and 12)
Updated by Aurélien Larcher over 6 years ago
- Is duplicate of Bug #6858: FMD shutting down cores on Xeon Haswell CPU (HSW131 detection) added