FMD shutting down cores on Xeon Haswell CPU (HSW131 detection)
CPU: Xeon E3-1230V3 Haswell
OS version: SunOS smartos 5.11 joyent_20160317T000621Z i86pc i386 i86pc
I have set up SmartOS on a home lab with machine with a Haswell Xeon - using it as a hypervisor to run various OS/LX/KVM zones (home/hobby - I am new to Illumos OS)
Ever since setting up the machine in 10/2015 and creating some VM's, I was receiving periodic errors from Fault Management that it was shutting down cores for INTEL-8000-1J. cpumem-retire agent was then offlining the core, but clearing the fault seemed to resolve the problem. Currently I am mitigating this nuisance by disabling cpumem-retire (since this is a home machine I can get away with it)
After some research I found other OS's experiencing similar MCE problems related to Intel erratum HSW131: http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e3-1200v3-spec-update.pdf
Running fmdump -epV I get the following report, the IA32_MCi_STATUS register seems to be a perfect match for HSW131:
Apr 06 2016 04:38:37.076445056 ereport.cpu.intel.internal_parity nvlist version: 0 class = ereport.cpu.intel.internal_parity ena = 0x210c77fc04302401 detector = hc:///motherboard=0/chip=0/core=1/strand=1 IA32_MCG_STATUS = 0x0 machine_check_in_progress = 0 bank_number = *0x0* bank_msr_offset = 0x400 IA32_MCi_STATUS = *0x90000040000f0005* overflow = 0 error_uncorrected = 0 error_enabled = 1 processor_context_corrupt = 0 error_code = 0x5 model_specific_error_code = 0xf threshold_based_error_status = No tracking __ttl = 0x1 __tod = 0x570492cd 0x48e7580
Is there a way to get FMD to ignore/eat this particular MCE?
This may be a duplicate of https://www.illumos.org/issues/5873 depending on the CPU that user had
Updated by Victor Vess about 5 years ago
Sure that's what I figured - I found the FreeBSD commit where someone fixed it there: http://svnweb.freebsd.org/base/head/sys/x86/x86/mca.c?r1=269052&r2=269051&pathrev=269052
I actually pulled down Illumos source with naive hope I'd find a parallel area and fix myself, but I've been humbled by that exercise ;) I am not a very good C programmer and apparently shouldn't be anywhere near an OS kernel for a while.
I do have the ability to build and test any fixes on my system. I typically get 2-3 of these MCE's per day so I can let it bake for a couple weeks to test.
Updated by Jason King about 5 years ago
If I had to guess, this might be where it goes: [[http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/cpu/genuineintel/gintel_main.c#gintel_error_action]]