Project

General

Profile

Actions

Bug #6858

open

FMD shutting down cores on Xeon Haswell CPU (HSW131 detection)

Added by Victor Vess over 6 years ago. Updated over 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2016-04-07
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

CPU: Xeon E3-1230V3 Haswell
OS version: SunOS smartos 5.11 joyent_20160317T000621Z i86pc i386 i86pc

I have set up SmartOS on a home lab with machine with a Haswell Xeon - using it as a hypervisor to run various OS/LX/KVM zones (home/hobby - I am new to Illumos OS)

Ever since setting up the machine in 10/2015 and creating some VM's, I was receiving periodic errors from Fault Management that it was shutting down cores for INTEL-8000-1J. cpumem-retire agent was then offlining the core, but clearing the fault seemed to resolve the problem. Currently I am mitigating this nuisance by disabling cpumem-retire (since this is a home machine I can get away with it)

After some research I found other OS's experiencing similar MCE problems related to Intel erratum HSW131: http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e3-1200v3-spec-update.pdf

Running fmdump -epV I get the following report, the IA32_MCi_STATUS register seems to be a perfect match for HSW131:

Apr 06 2016 04:38:37.076445056 ereport.cpu.intel.internal_parity
nvlist version: 0
        class = ereport.cpu.intel.internal_parity
        ena = 0x210c77fc04302401
        detector = hc:///motherboard=0/chip=0/core=1/strand=1
        IA32_MCG_STATUS = 0x0
        machine_check_in_progress = 0
        bank_number = *0x0*
        bank_msr_offset = 0x400
        IA32_MCi_STATUS = *0x90000040000f0005*
        overflow = 0
        error_uncorrected = 0
        error_enabled = 1
        processor_context_corrupt = 0
        error_code = 0x5
        model_specific_error_code = 0xf
        threshold_based_error_status = No tracking
        __ttl = 0x1
        __tod = 0x570492cd 0x48e7580

Is there a way to get FMD to ignore/eat this particular MCE?
This may be a duplicate of https://www.illumos.org/issues/5873 depending on the CPU that user had


Related issues

Has duplicate OpenIndiana Distribution - Bug #5873: CPU cores faulted on bootNew2015-04-26

Actions
Actions #1

Updated by Robert Mustacchi over 6 years ago

Probably we'll need to add some logic to the part of the system that's passing these events down and have it explicitly ignore MCEs with this specific signature.

Actions #2

Updated by Victor Vess over 6 years ago

Sure that's what I figured - I found the FreeBSD commit where someone fixed it there: http://svnweb.freebsd.org/base/head/sys/x86/x86/mca.c?r1=269052&r2=269051&pathrev=269052

I actually pulled down Illumos source with naive hope I'd find a parallel area and fix myself, but I've been humbled by that exercise ;) I am not a very good C programmer and apparently shouldn't be anywhere near an OS kernel for a while.

I do have the ability to build and test any fixes on my system. I typically get 2-3 of these MCE's per day so I can let it bake for a couple weeks to test.

Actions #4

Updated by Aurélien Larcher about 6 years ago

  • Has duplicate Bug #5873: CPU cores faulted on boot added
Actions

Also available in: Atom PDF