Bug #6858


FMD shutting down cores on Xeon Haswell CPU (HSW131 detection)

Added by Victor Vess over 6 years ago. Updated over 6 years ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


CPU: Xeon E3-1230V3 Haswell
OS version: SunOS smartos 5.11 joyent_20160317T000621Z i86pc i386 i86pc

I have set up SmartOS on a home lab with machine with a Haswell Xeon - using it as a hypervisor to run various OS/LX/KVM zones (home/hobby - I am new to Illumos OS)

Ever since setting up the machine in 10/2015 and creating some VM's, I was receiving periodic errors from Fault Management that it was shutting down cores for INTEL-8000-1J. cpumem-retire agent was then offlining the core, but clearing the fault seemed to resolve the problem. Currently I am mitigating this nuisance by disabling cpumem-retire (since this is a home machine I can get away with it)

After some research I found other OS's experiencing similar MCE problems related to Intel erratum HSW131:

Running fmdump -epV I get the following report, the IA32_MCi_STATUS register seems to be a perfect match for HSW131:

Apr 06 2016 04:38:37.076445056
nvlist version: 0
        class =
        ena = 0x210c77fc04302401
        detector = hc:///motherboard=0/chip=0/core=1/strand=1
        IA32_MCG_STATUS = 0x0
        machine_check_in_progress = 0
        bank_number = *0x0*
        bank_msr_offset = 0x400
        IA32_MCi_STATUS = *0x90000040000f0005*
        overflow = 0
        error_uncorrected = 0
        error_enabled = 1
        processor_context_corrupt = 0
        error_code = 0x5
        model_specific_error_code = 0xf
        threshold_based_error_status = No tracking
        __ttl = 0x1
        __tod = 0x570492cd 0x48e7580

Is there a way to get FMD to ignore/eat this particular MCE?
This may be a duplicate of depending on the CPU that user had

Related issues

Has duplicate OpenIndiana Distribution - Bug #5873: CPU cores faulted on bootNew2015-04-26

Actions #1

Updated by Robert Mustacchi over 6 years ago

Probably we'll need to add some logic to the part of the system that's passing these events down and have it explicitly ignore MCEs with this specific signature.

Actions #2

Updated by Victor Vess over 6 years ago

Sure that's what I figured - I found the FreeBSD commit where someone fixed it there:

I actually pulled down Illumos source with naive hope I'd find a parallel area and fix myself, but I've been humbled by that exercise ;) I am not a very good C programmer and apparently shouldn't be anywhere near an OS kernel for a while.

I do have the ability to build and test any fixes on my system. I typically get 2-3 of these MCE's per day so I can let it bake for a couple weeks to test.

Actions #4

Updated by Aurélien Larcher about 6 years ago

  • Has duplicate Bug #5873: CPU cores faulted on boot added

Also available in: Atom PDF