Project

General

Profile

Bug #6858

FMD shutting down cores on Xeon Haswell CPU (HSW131 detection)

Added by Victor Vess over 4 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2016-04-07
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

CPU: Xeon E3-1230V3 Haswell
OS version: SunOS smartos 5.11 joyent_20160317T000621Z i86pc i386 i86pc

I have set up SmartOS on a home lab with machine with a Haswell Xeon - using it as a hypervisor to run various OS/LX/KVM zones (home/hobby - I am new to Illumos OS)

Ever since setting up the machine in 10/2015 and creating some VM's, I was receiving periodic errors from Fault Management that it was shutting down cores for INTEL-8000-1J. cpumem-retire agent was then offlining the core, but clearing the fault seemed to resolve the problem. Currently I am mitigating this nuisance by disabling cpumem-retire (since this is a home machine I can get away with it)

After some research I found other OS's experiencing similar MCE problems related to Intel erratum HSW131: http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e3-1200v3-spec-update.pdf

Running fmdump -epV I get the following report, the IA32_MCi_STATUS register seems to be a perfect match for HSW131:

Apr 06 2016 04:38:37.076445056 ereport.cpu.intel.internal_parity
nvlist version: 0
        class = ereport.cpu.intel.internal_parity
        ena = 0x210c77fc04302401
        detector = hc:///motherboard=0/chip=0/core=1/strand=1
        IA32_MCG_STATUS = 0x0
        machine_check_in_progress = 0
        bank_number = *0x0*
        bank_msr_offset = 0x400
        IA32_MCi_STATUS = *0x90000040000f0005*
        overflow = 0
        error_uncorrected = 0
        error_enabled = 1
        processor_context_corrupt = 0
        error_code = 0x5
        model_specific_error_code = 0xf
        threshold_based_error_status = No tracking
        __ttl = 0x1
        __tod = 0x570492cd 0x48e7580

Is there a way to get FMD to ignore/eat this particular MCE?
This may be a duplicate of https://www.illumos.org/issues/5873 depending on the CPU that user had


Related issues

Has duplicate OpenIndiana Distribution - Bug #5873: CPU cores faulted on bootNew2015-04-26

Actions
#1

Updated by Robert Mustacchi over 4 years ago

Probably we'll need to add some logic to the part of the system that's passing these events down and have it explicitly ignore MCEs with this specific signature.

#2

Updated by Victor Vess over 4 years ago

Sure that's what I figured - I found the FreeBSD commit where someone fixed it there: http://svnweb.freebsd.org/base/head/sys/x86/x86/mca.c?r1=269052&r2=269051&pathrev=269052

I actually pulled down Illumos source with naive hope I'd find a parallel area and fix myself, but I've been humbled by that exercise ;) I am not a very good C programmer and apparently shouldn't be anywhere near an OS kernel for a while.

I do have the ability to build and test any fixes on my system. I typically get 2-3 of these MCE's per day so I can let it bake for a couple weeks to test.

#4

Updated by Aurélien Larcher over 4 years ago

  • Has duplicate Bug #5873: CPU cores faulted on boot added

Also available in: Atom PDF