Project

General

Profile

Bug #12920

Newer releases falsely report CPU layer 2 cache is faulty on E4-4600 platform

Added by Jason Matthews 5 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Some time after SmartOS 20190619 and definitely before 2020603 the fault manager reports L2 cache on all CPU packages as faulty, disables them, and reduces system performance. Rolling back to an older release prevents the cores from being disabled.

The system starts disabling cores shortly after boot and slowly disables cores across all physical CPU packages. Maybe you can have one bad package but four going bad at the same time?

I first noticed this back in May I installed a May release of SmartOS on a Dell R820.
Rolled the first system back to 20190605 on 5/15/2020 the errors did not recur.
Deployed 20200603 on another system 6/30/202, errors start shortly after boot. Rolled that system back to 20190605 and 18 hour later there are no errors.

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 01 06:29:54 f8ec8b53-ec0c-c3e2-89a4-80e0bfecab44 INTEL-8000-MQ Major

Host : dba009
Platform : PowerEdge-R820 Chassis_id : DFC5CZ1
Product_sn :

Fault class : fault.cpu.intel.l2dcache
Affects : cpu:///cpuid=48
faulted and taken out of service
FRU : hc://:product-id=PowerEdge-R820:server-id=dba009:chassis-id=DFC5CZ1/motherboard=0/chip=0
faulty

Description : A level 2 Data Cache on this cpu is faulty. Refer to
http://illumos.org/msg/INTEL-8000-MQ for more information.

Response : The system will attempt to offline this cpu to remove it from
service.

Impact : Performance of this system may be affected.

Action : Schedule a repair procedure to replace the affected CPU. Use
'fmadm faulty' to identify the module.

root@dba009:/root# smbios |grep -i xeon
Version: Intel(R) Xeon(R) CPU E5-4650L 0 2.60GHz
Family: 179 (Intel Xeon)
Version: Intel(R) Xeon(R) CPU E5-4650L 0
2.60GHz
Family: 179 (Intel Xeon)
Version: Intel(R) Xeon(R) CPU E5-4650L 0 2.60GHz
Family: 179 (Intel Xeon)
Version: Intel(R) Xeon(R) CPU E5-4650L 0
2.60GHz
Family: 179 (Intel Xeon)

System Configuration: Dell Inc. PowerEdge R820
BIOS Configuration: Dell Inc. 2.0.20 01/16/2014

Let me know what other information I can provide to be helpful.

#1

Updated by Jason Matthews 5 months ago

The title should read E5-4600

#2

Updated by Dan McDonald 4 months ago

Do you have fmdump(1M) output you can put here as well? Thanks.

#3

Updated by Jason Matthews 4 months ago

root@dba012:/root# fmdump -vu fbf336c0-1a07-ecae-a33a-e8aec4c3d5f9
TIME UUID SUNW-MSG-ID EVENT
Jul 07 15:05:57.4475 fbf336c0-1a07-ecae-a33a-e8aec4c3d5f9 INTEL-8000-MQ Diagnosed
100% fault.cpu.intel.l2dcache

Problem in: hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=1/core=7/strand=1
Affects: cpu:///cpuid=61
FRU: hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=1
Location: -

root@dba012:/root# fmdump -vvu 71c515fa-6311-46da-ba37-fbc340d39a41
TIME UUID SUNW-MSG-ID EVENT
Jun 30 23:22:40.2070 71c515fa-6311-46da-ba37-fbc340d39a41 INTEL-8000-MQ Diagnosed
100% fault.cpu.intel.l2dcache

Problem in: hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=3/core=4/strand=1
Affects: cpu:///cpuid=51
FRU: hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=3
Location: -

root@dba012:/root# fmdump -vvu b62b2c92-15a3-e957-a64b-ab8373be2a88
TIME UUID SUNW-MSG-ID EVENT
Jul 14 14:54:12.2088 b62b2c92-15a3-e957-a64b-ab8373be2a88 INTEL-8000-MQ Diagnosed
100% fault.cpu.intel.l2dcache

Problem in: hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=0/core=7/strand=0
Affects: cpu:///cpuid=28
FRU: hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=0
Location: -

Jul 01 01:03:45.3388 5ecc2c55-96f2-6f62-c8e0-b55c4f29b49f INTEL-8000-MQ Diagnosed
Jul 01 02:46:00.4630 08f72756-25eb-6fb4-ad4d-b7c0ddb0c583 INTEL-8000-MQ Diagnosed
Jul 01 04:00:17.2533 d839ce77-6081-43da-cb95-b828899b84ff INTEL-8000-MQ Diagnosed
Jul 01 04:00:36.5058 bc0581d3-f0ab-cec4-ae6b-e33964dad1d1 INTEL-8000-MQ Diagnosed
Jul 01 05:15:32.9833 397bea28-3985-679e-c1d2-f5eeed2bb491 INTEL-8000-MQ Diagnosed
Jul 01 06:04:03.0343 ad99a9a5-f5a3-6803-f1ea-a9ccddda2edb INTEL-8000-MQ Diagnosed
Jul 01 06:14:33.1024 43b9f4cd-92d0-cd10-ed8e-9e4e77b7f230 INTEL-8000-MQ Diagnosed
Jul 01 09:36:08.2571 b85e2032-0991-e6c1-b047-b1dc5cb0ea8f INTEL-8000-MQ Diagnosed
Jul 01 11:09:38.3874 f775de2e-cc33-63fe-d500-fc3b1a54479f INTEL-8000-MQ Diagnosed
Jul 01 13:11:58.5179 4ca34b98-9a4b-ee7e-be8e-cdc3a7771c31 INTEL-8000-MQ Diagnosed
Jul 01 14:09:58.5525 dee2beb0-0e98-679a-903b-c10382a50df9 INTEL-8000-MQ Diagnosed
Jul 01 15:01:18.6370 bb2e0b84-9a96-4873-8abe-ec73cef8c9f1 INTEL-8000-MQ Diagnosed
Jul 01 16:21:23.6977 d121099b-e9a9-43cf-f6bc-d3709ae4c033 INTEL-8000-MQ Diagnosed
Jul 01 18:01:43.8074 d9cc213c-2cf5-400f-946c-cbc13cc870b0 INTEL-8000-MQ Diagnosed
Jul 01 18:51:08.8508 1844f97a-dc97-4328-d90a-afcc177bdd61 INTEL-8000-MQ Diagnosed
Jul 01 22:04:59.0965 2aee2e31-9a2a-46b0-850c-b54cd83c3ca5 INTEL-8000-MQ Diagnosed
Jul 01 22:44:34.1284 a0190e48-71bd-ed58-fd4b-e6e495f41af9 INTEL-8000-MQ Diagnosed
Jul 01 23:08:19.1239 4a946ec4-bcca-c729-f616-f52a04f2bd17 INTEL-8000-MQ Diagnosed
Jul 02 00:07:49.2371 7735a894-d7fd-679e-fb83-8de0a60cd16a INTEL-8000-MQ Diagnosed
Jul 02 01:00:34.2672 dd460be4-286c-e580-d755-aae7609c7185 INTEL-8000-MQ Diagnosed
Jul 02 01:54:54.3521 5caddf79-62f6-6c9c-b7b1-8aae9019368a INTEL-8000-MQ Diagnosed
Jul 02 02:40:34.3851 9d92a883-e05b-4f6b-a7d4-cacca73d50bd INTEL-8000-MQ Diagnosed
Jul 02 03:08:04.3907 dfea69f3-38dd-6756-a571-a85eb3fb8c81 INTEL-8000-MQ Diagnosed
Jul 02 03:08:34.4134 a7d0258e-0e39-6626-8f50-8d6fefde6d43 INTEL-8000-MQ Diagnosed
Jul 02 05:17:39.5271 9a3e2fa2-ca66-4fe6-ad49-bd1b5741d8d0 INTEL-8000-MQ Diagnosed
Jul 02 10:36:39.8413 ae845a9b-9a80-e496-edff-f8b4ad69f00a INTEL-8000-MQ Diagnosed
Jul 02 11:28:54.9113 3fe56414-40f7-c7bf-f396-8e0a5a1a8dd9 INTEL-8000-MQ Diagnosed
Jul 02 11:48:24.9456 1bd45384-455e-618a-d738-b592bee8a5f3 INTEL-8000-MQ Diagnosed
Jul 02 12:22:54.9752 f8d9c37c-746e-40cd-940e-9e30801b48a2 INTEL-8000-MQ Diagnosed
Jul 02 12:29:59.9755 5eb9c101-9514-c4ac-a5ef-bb84de5e4c89 INTEL-8000-MQ Diagnosed
Jul 02 14:45:15.1335 7d559ee8-7035-e5f0-84e9-b2bb35e325b6 INTEL-8000-MQ Diagnosed
Jul 02 15:26:40.1977 845f68da-77a2-ed52-e85f-9604fea54cc3 INTEL-8000-MQ Diagnosed
Jul 02 15:47:00.1931 0032b9e5-901d-6071-9928-a85abd5bce1e INTEL-8000-MQ Diagnosed
Jul 02 21:18:15.5985 7e65a09a-aafb-ef41-dd17-c508d0584f60 INTEL-8000-MQ Diagnosed
Jul 02 21:24:00.6070 99d003fc-ab21-6d2e-864b-bdceffc4c7e2 INTEL-8000-MQ Diagnosed
Jul 02 21:34:55.6046 115147e0-d711-e49a-f7ad-8d2145af0b9e INTEL-8000-MQ Diagnosed
Jul 02 23:40:15.7208 f493473f-d325-4fc9-92fa-f717bb3bbdd9 INTEL-8000-MQ Diagnosed
Jul 03 05:21:51.0997 2f3ccd8c-8efb-664b-bcff-97fa64a9b42c INTEL-8000-MQ Diagnosed
Jul 03 07:53:11.2649 3eb750b9-afab-c69a-d061-e55a7566a37c INTEL-8000-MQ Diagnosed
Jul 03 09:52:06.3539 71cbb324-ccc4-e930-b6b2-9ccd0c53b6a4 INTEL-8000-MQ Diagnosed
Jul 03 14:21:21.6162 f5e82751-e547-e781-d447-fc0f48ac5308 INTEL-8000-MQ Diagnosed
Jul 03 16:36:56.7603 07a9dfb5-d6a6-c7dc-815e-dcbeb8a9f8a6 INTEL-8000-MQ Diagnosed
Jul 03 19:48:06.9840 8257e11a-995f-45a7-e91a-830234aa64c6 INTEL-8000-MQ Diagnosed
Jul 03 21:01:02.0382 57e8fc19-65a8-cbef-9601-9fe25d27207d INTEL-8000-MQ Diagnosed
Jul 04 06:36:07.6367 420e49dc-b052-cb80-cdb8-d851496135b2 INTEL-8000-MQ Diagnosed
Jul 04 07:20:57.7012 f8350fc5-e09e-c534-8fdb-b437a265fbab INTEL-8000-MQ Diagnosed
Jul 04 14:56:48.1557 31803dba-149c-414e-b542-ba885664e8e8 INTEL-8000-MQ Diagnosed
Jul 04 21:04:18.5088 c9510d1e-be5c-c840-a741-b7779bef6deb INTEL-8000-MQ Diagnosed
Jul 04 22:47:53.5982 eb511650-8850-c45e-8bfa-d7e74cbbd164 INTEL-8000-MQ Diagnosed
Jul 05 11:52:59.3594 e0f642f5-3afa-ce7e-9e30-9105ac59fa79 INTEL-8000-MQ Diagnosed
Jul 05 12:28:59.4172 e44786bd-95f8-c4b4-f7fd-98fa2516961c INTEL-8000-MQ Diagnosed
Jul 06 22:35:26.4389 f9052b32-62c0-68c5-91b2-c7d592831e0d INTEL-8000-MQ Diagnosed
Jul 07 15:05:57.4475 fbf336c0-1a07-ecae-a33a-e8aec4c3d5f9 INTEL-8000-MQ Diagnosed
Jul 07 19:05:42.6555 b0bf73f5-893f-4aaa-8ca9-e15db991c0e3 INTEL-8000-MQ Diagnosed
Jul 10 05:17:21.0211 b603d0fe-c968-6ca5-ff77-aafd83dfb74e INTEL-8000-MQ Diagnosed
Jul 14 14:54:12.2088 b62b2c92-15a3-e957-a64b-ab8373be2a88 INTEL-8000-MQ Diagnosed


root@dba012:/root# psrinfo |grep faulted |wc -l
57
root@dba012:/root# psrinfo |grep on-line |wc -l
7
root@dba012:/root# psrinfo |grep on-line
9 on-line since 06/30/2020 23:12:14
12 on-line since 06/30/2020 23:12:14
15 on-line since 06/30/2020 23:12:14
24 on-line since 06/30/2020 23:12:14
26 on-line since 06/30/2020 23:12:14
29 on-line since 06/30/2020 23:12:14
58 on-line since 06/30/2020 23:12:15

#4

Updated by Joshua M. Clulow 4 months ago

  • Project changed from site to illumos gate
#5

Updated by Jason King 4 months ago

The only remotely related change in that time frame I can see is

commit a47ab03e261661b7326ab0b642649034886be632
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   Fri Apr 3 08:56:49 2020 +0000

    12467 Add support for AMD PPIN
    12468 Remove generic_cpu -Wno-parentheses gag
    Reviewed by: C Fraire <cfraire@me.com>
    Reviewed by: Yuri Pankov <ypankov@fastmail.com>
    Reviewed by: Patrick Mooney <pmooney@pfmooney.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

only because it's touching the code that generates CPU fault events, but there is nothing there that appears like it would be relevant to the issue, so it seems less likely to be involved.

The fmdump -evv output might also be useful -- the error log can be different from the fault log (the error log should be the raw events, the fault log is what the system thinks is causing the problem after trying to correlate the events from the error log).

#6

Updated by Jason Matthews 3 months ago

from another system, but same issue.

Jul 02 03:08:33.3748 ereport.cpu.intel.l2dcache_uc 0xb80d985dd162e401
Jul 02 05:17:38.4863 ereport.cpu.intel.l2dcache_uc 0x28c1f614e551a401
Jul 02 10:36:38.7996 ereport.cpu.intel.l2dcache_uc 0x3f4821fef2f0e401
Jul 02 11:28:53.8674 ereport.cpu.intel.l2dcache_uc 0x6ce6fcd5a9924001
Jul 02 11:48:23.9013 ereport.cpu.intel.l2dcache_uc 0x7deda66cbe516401
Jul 02 12:22:53.9297 ereport.cpu.intel.l2dcache_uc 0x9c0cfe38d1310001
Jul 02 12:29:58.9289 ereport.cpu.intel.l2dcache_uc 0xa23c37206580c401
Jul 02 14:45:14.0857 ereport.cpu.intel.l2dcache_uc 0x1853162d19d24401
Jul 02 15:26:39.1487 ereport.cpu.intel.l2dcache_uc 0x3c7c87afbd806401
Jul 02 15:46:59.1428 ereport.cpu.intel.l2dcache_uc 0x4e3d4ba9a4328001
Jul 02 21:18:14.5471 ereport.cpu.intel.l2dcache_uc 0x6f75db1b51c12001
Jul 02 21:23:59.5543 ereport.cpu.intel.l2dcache_uc 0x747b16d2d9904001
Jul 02 21:34:54.5511 ereport.cpu.intel.l2dcache_uc 0x7e031b4c2883e401
Jul 02 23:40:14.6635 ereport.cpu.intel.l2dcache_uc 0xeb714c0417230401
Jul 03 05:21:50.0431 ereport.cpu.intel.l2dcache_uc 0x15af69f788400401
Jul 03 07:53:10.2072 ereport.cpu.intel.l2dcache_uc 0x99d129e481b02001
Jul 03 09:52:05.2948 ereport.cpu.intel.l2dcache_uc 0x01a50bf63a022001
Jul 03 14:21:20.5556 ereport.cpu.intel.l2dcache_uc 0xecbb31c954528401
Jul 03 16:36:55.6981 ereport.cpu.intel.l2dcache_uc 0x631c84380c616001
Jul 03 19:48:05.9210 ereport.cpu.intel.l2dcache_uc 0x0a05cce099b30001
Jul 03 21:01:00.9742 ereport.cpu.intel.l2dcache_uc 0x49afe65dc7100001
Jul 04 06:36:06.5713 ereport.cpu.intel.l2dcache_uc 0x3fcd619b0f508401
Jul 04 07:20:56.6348 ereport.cpu.intel.l2dcache_uc 0x66f27d997322a401
Jul 04 14:56:47.0881 ereport.cpu.intel.l2dcache_uc 0xf4f1615a1972a001
Jul 04 21:04:17.4398 ereport.cpu.intel.l2dcache_uc 0x35d020fa07d08001
Jul 04 22:47:52.5283 ereport.cpu.intel.l2dcache_uc 0x9040cbfa17736401
Jul 05 11:52:58.2881 ereport.cpu.intel.l2dcache_uc 0x3db8d89c21b20401
Jul 05 12:28:58.3447 ereport.cpu.intel.l2dcache_uc 0x5d2790cc32902401
Jul 06 22:35:25.3647 ereport.cpu.intel.l2dcache_uc 0x57ea20298f40a001
Jul 07 15:05:56.3715 ereport.cpu.intel.l2dcache_uc 0xb8bc9eb68e21e401
Jul 07 19:05:41.5782 ereport.cpu.intel.l2dcache_uc 0x8a10e67190810401
Jul 10 05:17:19.9424 ereport.cpu.intel.l2dcache_uc 0x729fa149d0912401
Jul 14 14:54:11.1291 ereport.cpu.intel.l2dcache_uc 0x0f57cf009fb0e001
Jul 23 08:08:18.2132 ereport.cpu.intel.l2dcache_uc 0xe05e84563c11e001
Aug 02 03:33:32.2419 ereport.cpu.intel.l2dcache_uc 0x0d23b9e592906001

#7

Updated by Jason King 3 months ago

Is that fmdump -evv output from an event (if not can you supply that)?

#8

Updated by Jason Matthews 3 months ago

meaning these events?

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 06 22:35:26 f9052b32-62c0-68c5-91b2-c7d592831e0d INTEL-8000-MQ Major

Host : dba012
Platform : PowerEdge-R820 Chassis_id : DFC8CZ1
Product_sn :

Fault class : fault.cpu.intel.l2dcache
Affects : cpu:///cpuid=20
faulted but still in service
FRU : hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=0
faulty

Description : A level 2 Data Cache on this cpu is faulty. Refer to
http://illumos.org/msg/INTEL-8000-MQ for more information.

Response : The system will attempt to offline this cpu to remove it from
service.

Impact : Performance of this system may be affected.

Action : Schedule a repair procedure to replace the affected CPU. Use
'fmadm faulty' to identify the module.

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 07 15:05:57 fbf336c0-1a07-ecae-a33a-e8aec4c3d5f9 INTEL-8000-MQ Major

Host : dba012
Platform : PowerEdge-R820 Chassis_id : DFC8CZ1
Product_sn :

Fault class : fault.cpu.intel.l2dcache
Affects : cpu:///cpuid=61
faulted but still in service
FRU : hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=1
faulty

Description : A level 2 Data Cache on this cpu is faulty. Refer to
http://illumos.org/msg/INTEL-8000-MQ for more information.

Response : The system will attempt to offline this cpu to remove it from
service.

Impact : Performance of this system may be affected.

Action : Schedule a repair procedure to replace the affected CPU. Use
'fmadm faulty' to identify the module.

#9

Updated by Jason King 3 months ago

Actually try fmdump -eV -- I found some HW errata for that CPU that might be relevant, and that should contain enough details to determine if that's what's being encountered (that should output the raw event data).

#10

Updated by Jason Matthews 3 months ago

TIME                           CLASS
Mar 07 2018 05:02:42.679413745 ereport.cpu.intel.quickpath.mem_redundant
nvlist version: 0
        class = ereport.cpu.intel.quickpath.mem_redundant
        ena = 0xa6d85ca0c590a401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = hc
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = motherboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = chip
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = memory-controller
                        hc-id = 0
                (end hc-list[2])

        (end detector)

        compound_errorname = MC_CH3_SCRUB_ERR
        IA32_MCG_STATUS = 0x0
        machine_check_in_progress = 0
        bank_number = 0xb
        bank_msr_offset = 0x42c
        IA32_MCi_STATUS = 0x8c000042000800c3
        overflow = 0
        error_uncorrected = 0
        error_enabled = 0
        processor_context_corrupt = 0
        error_code = 0xc3
        model_specific_error_code = 0x8
        threshold_based_error_status = No tracking
        IA32_MCi_ADDR = 0xf630567c0
        IA32_MCi_MISC = 0x90848000800048c
        ECC-syndrome = 0x9084800
        physaddr = 0xf630567c0
        resource = (array of embedded nvlists)
        (start resource[0])
        nvlist version: 0
               version = 0x0
                scheme = hc
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = motherboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = chip
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = memory-controller
                        hc-id = 0
                (end hc-list[2])
                (start hc-list[3])
                nvlist version: 0
                        hc-name = dram-channel
                        hc-id = 0
                (end hc-list[3])
                (start hc-list[4])
                nvlist version: 0
                        hc-name = dimm
                        hc-id = 0
                (end hc-list[4])

                hc-specific = (embedded nvlist)
                nvlist version: 0
                        offset = 0xffffffffffffffff
                (end hc-specific)

        (end resource[0])

        __ttl = 0x1
        __tod = 0x5a9f7272 0x287f07f1

#11

Updated by Jason King 3 months ago

If you do 'fmdump -e' are all of the events 'ereport.cpu.intel.quickpath.mem_redundant' or are there different values?

#12

Updated by Jason Matthews 3 months ago

Sorry if I clipped something useful...

root@dba012:/root# fmdump -e | sort -k 5 | cut -d" " -f 5 |uniq -c
76
1772 ereport.cpu.intel.quickpath.mem_addr_parity
1772 ereport.cpu.intel.quickpath.mem_ce
100 ereport.cpu.intel.quickpath.mem_redundant

#13

Updated by Jason Matthews 3 months ago

Mar 07 2018 05:08:24.819772196 ereport.cpu.intel.quickpath.mem_addr_parity
nvlist version: 0
        class = ereport.cpu.intel.quickpath.mem_addr_parity
        ena = 0xabd2f00848c02401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = hc
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = motherboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = chip
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = memory-controller
                        hc-id = 0
                (end hc-list[2])

        (end detector)

        compound_errorname = MC_CH3_RD_ERR
        IA32_MCG_STATUS = 0x0
        machine_check_in_progress = 0
        bank_number = 0xb
        bank_msr_offset = 0x42c
        IA32_MCi_STATUS = 0x8800004200800093
        overflow = 0
        error_uncorrected = 0
        error_enabled = 0
        processor_context_corrupt = 0
        error_code = 0x93
        model_specific_error_code = 0x80
        threshold_based_error_status = No tracking
        IA32_MCi_MISC = 0x490848000800048c
        ECC-syndrome = 0x49084800
        resource = (array of embedded nvlists)
        (start resource[0])
       nvlist version: 0
                version = 0x0
                scheme = hc
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = motherboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = chip
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = memory-controller
                        hc-id = 0
                (end hc-list[2])
                (start hc-list[3])
                nvlist version: 0
                        hc-name = dram-channel
                        hc-id = 0
                (end hc-list[3])
                (start hc-list[4])
                nvlist version: 0
                        hc-name = dimm
                        hc-id = 0
                (end hc-list[4])

                hc-specific = (embedded nvlist)
                nvlist version: 0
                        offset = 0xffffffffffffffff
                (end hc-specific)

        (end resource[0])

        __ttl = 0x1
        __tod = 0x5a9f73c8 0x30dcbb24

Mar 08 2018 16:07:38.004013860 ereport.cpu.intel.quickpath.mem_ce
nvlist version: 0
        class = ereport.cpu.intel.quickpath.mem_ce
        ena = 0xd4b071565200e001
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = hc
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = motherboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = chip
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = memory-controller
                        hc-id = 0
                (end hc-list[2])

        (end detector)

        compound_errorname = MC_CH3_RD_ERR
        IA32_MCG_STATUS = 0x0
        machine_check_in_progress = 0
        bank_number = 0x5
        bank_msr_offset = 0x414
        IA32_MCi_STATUS = 0x8c00004000010093
        overflow = 0
        error_uncorrected = 0
        error_enabled = 0
        processor_context_corrupt = 0
        error_code = 0x93
        model_specific_error_code = 0x1
        threshold_based_error_status = No tracking
        IA32_MCi_ADDR = 0xf630567c0
        IA32_MCi_MISC = 0x20262686
        ECC-syndrome = 0x0
        physaddr = 0xf630567c0
        resource = (array of embedded nvlists)
        (start resource[0])
        nvlist version: 0
           version = 0x0
                scheme = hc
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = motherboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = chip
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = memory-controller
                        hc-id = 0
                (end hc-list[2])
                (start hc-list[3])
                nvlist version: 0
                        hc-name = dram-channel
                        hc-id = 1
                (end hc-list[3])
                (start hc-list[4])
                nvlist version: 0
                        hc-name = dimm
                        hc-id = 2
                (end hc-list[4])

                hc-specific = (embedded nvlist)
                nvlist version: 0
                        offset = 0xffffffffffffffff
                (end hc-specific)

        (end resource[0])

        __ttl = 0x1
        __tod = 0x5aa15fca 0x3d3f24

Mar 08 2018 16:07:38.004012753 ereport.cpu.intel.quickpath.mem_addr_parity
nvlist version: 0
        class = ereport.cpu.intel.quickpath.mem_addr_parity
        ena = 0xd4b071565200e001
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = hc
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = motherboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = chip
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = memory-controller
                        hc-id = 0
                (end hc-list[2])

        (end detector)

        compound_errorname = MC_CH3_RD_ERR
        IA32_MCG_STATUS = 0x0
        machine_check_in_progress = 0
        bank_number = 0xb
        bank_msr_offset = 0x42c
        IA32_MCi_STATUS = 0x8800004200800093
        overflow = 0
        error_uncorrected = 0
        error_enabled = 0
        processor_context_corrupt = 0
        error_code = 0x93
        model_specific_error_code = 0x80
        threshold_based_error_status = No tracking
        IA32_MCi_MISC = 0x490848000800048c
        ECC-syndrome = 0x49084800
        resource = (array of embedded nvlists)
        (start resource[0])
        nvlist version: 0
              version = 0x0
                scheme = hc
                hc-list = (array of embedded nvlists)
                (start hc-list[0])
                nvlist version: 0
                        hc-name = motherboard
                        hc-id = 0
                (end hc-list[0])
                (start hc-list[1])
                nvlist version: 0
                        hc-name = chip
                        hc-id = 0
                (end hc-list[1])
                (start hc-list[2])
                nvlist version: 0
                        hc-name = memory-controller
                        hc-id = 0
                (end hc-list[2])
                (start hc-list[3])
                nvlist version: 0
                        hc-name = dram-channel
                        hc-id = 0
                (end hc-list[3])
                (start hc-list[4])
                nvlist version: 0
                        hc-name = dimm
                        hc-id = 0
                (end hc-list[4])

                hc-specific = (embedded nvlist)
                nvlist version: 0
                        offset = 0xffffffffffffffff
                (end hc-specific)

        (end resource[0])

        __ttl = 0x1
        __tod = 0x5aa15fca 0x3d3ad1
#14

Updated by Jason King 3 months ago

I'll need to dig a bit more to confirm, but these would appear to be correctable error events... which leads to the question of why the correlation engine seems to think it's an L2 cache problem...

#15

Updated by Amos Hayes 3 months ago

I have what seems to me to be a very similar problem where SmartOS/illumos is telling me that the cache on core 0 of my AMD Ryzen 9 3900X CPU was faulty.

I just thought I would mention it in case it is related or there is some way I can help you track it down. I finally have some time to get back to troubleshooting this. Thread is here:

https://smartos.topicbox.com/groups/smartos-discuss/T48e23fef4bc4b8c7/a-cache-on-this-cpu-is-faulty

#16

Updated by Jason Matthews 3 months ago

Okay, here is an update. I have a dev build joyent_20200824T192056Z running on an E5-4600. I forgot about this bug and loaded the system with a newer image by mistake. This PI doesnt disable CPU cores. It has three days of run time, in the past it would have faulted out most of the CPUs by now.

#17

Updated by Amos Hayes 3 months ago

Just a heads-up that my issue appears to have been a result of my motherboard automatically setting the frequency too high for my RAM when four DIMMs were installed. AMD stipulates that a single pair of DIMMs can run at listed speeds but with multiple pairs, the speeds need to be lowered. After lowering, my CPU 0 cache fault stopped reappearing on boot.

Also available in: Atom PDF