Project

General

Profile

Bug #12920

Newer releases falsely report CPU layer 2 cache is faulty on E4-4600 platform

Added by Jason Matthews about 1 month ago. Updated 12 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Some time after SmartOS 20190619 and definitely before 2020603 the fault manager reports L2 cache on all CPU packages as faulty, disables them, and reduces system performance. Rolling back to an older release prevents the cores from being disabled.

The system starts disabling cores shortly after boot and slowly disables cores across all physical CPU packages. Maybe you can have one bad package but four going bad at the same time?

I first noticed this back in May I installed a May release of SmartOS on a Dell R820.
Rolled the first system back to 20190605 on 5/15/2020 the errors did not recur.
Deployed 20200603 on another system 6/30/202, errors start shortly after boot. Rolled that system back to 20190605 and 18 hour later there are no errors.

--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 01 06:29:54 f8ec8b53-ec0c-c3e2-89a4-80e0bfecab44 INTEL-8000-MQ Major

Host : dba009
Platform : PowerEdge-R820 Chassis_id : DFC5CZ1
Product_sn :

Fault class : fault.cpu.intel.l2dcache
Affects : cpu:///cpuid=48
faulted and taken out of service
FRU : hc://:product-id=PowerEdge-R820:server-id=dba009:chassis-id=DFC5CZ1/motherboard=0/chip=0
faulty

Description : A level 2 Data Cache on this cpu is faulty. Refer to
http://illumos.org/msg/INTEL-8000-MQ for more information.

Response : The system will attempt to offline this cpu to remove it from
service.

Impact : Performance of this system may be affected.

Action : Schedule a repair procedure to replace the affected CPU. Use
'fmadm faulty' to identify the module.

root@dba009:/root# smbios |grep -i xeon
Version: Intel(R) Xeon(R) CPU E5-4650L 0 2.60GHz
Family: 179 (Intel Xeon)
Version: Intel(R) Xeon(R) CPU E5-4650L 0
2.60GHz
Family: 179 (Intel Xeon)
Version: Intel(R) Xeon(R) CPU E5-4650L 0 2.60GHz
Family: 179 (Intel Xeon)
Version: Intel(R) Xeon(R) CPU E5-4650L 0
2.60GHz
Family: 179 (Intel Xeon)

System Configuration: Dell Inc. PowerEdge R820
BIOS Configuration: Dell Inc. 2.0.20 01/16/2014

Let me know what other information I can provide to be helpful.

History

#1

Updated by Jason Matthews about 1 month ago

The title should read E5-4600

#2

Updated by Dan McDonald 12 days ago

Do you have fmdump(1M) output you can put here as well? Thanks.

#3

Updated by Jason Matthews 12 days ago

root@dba012:/root# fmdump -vu fbf336c0-1a07-ecae-a33a-e8aec4c3d5f9
TIME UUID SUNW-MSG-ID EVENT
Jul 07 15:05:57.4475 fbf336c0-1a07-ecae-a33a-e8aec4c3d5f9 INTEL-8000-MQ Diagnosed
100% fault.cpu.intel.l2dcache

Problem in: hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=1/core=7/strand=1
Affects: cpu:///cpuid=61
FRU: hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=1
Location: -

root@dba012:/root# fmdump -vvu 71c515fa-6311-46da-ba37-fbc340d39a41
TIME UUID SUNW-MSG-ID EVENT
Jun 30 23:22:40.2070 71c515fa-6311-46da-ba37-fbc340d39a41 INTEL-8000-MQ Diagnosed
100% fault.cpu.intel.l2dcache

Problem in: hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=3/core=4/strand=1
Affects: cpu:///cpuid=51
FRU: hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=3
Location: -

root@dba012:/root# fmdump -vvu b62b2c92-15a3-e957-a64b-ab8373be2a88
TIME UUID SUNW-MSG-ID EVENT
Jul 14 14:54:12.2088 b62b2c92-15a3-e957-a64b-ab8373be2a88 INTEL-8000-MQ Diagnosed
100% fault.cpu.intel.l2dcache

Problem in: hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=0/core=7/strand=0
Affects: cpu:///cpuid=28
FRU: hc://:product-id=PowerEdge-R820:server-id=dba012:chassis-id=DFC8CZ1/motherboard=0/chip=0
Location: -

Jul 01 01:03:45.3388 5ecc2c55-96f2-6f62-c8e0-b55c4f29b49f INTEL-8000-MQ Diagnosed
Jul 01 02:46:00.4630 08f72756-25eb-6fb4-ad4d-b7c0ddb0c583 INTEL-8000-MQ Diagnosed
Jul 01 04:00:17.2533 d839ce77-6081-43da-cb95-b828899b84ff INTEL-8000-MQ Diagnosed
Jul 01 04:00:36.5058 bc0581d3-f0ab-cec4-ae6b-e33964dad1d1 INTEL-8000-MQ Diagnosed
Jul 01 05:15:32.9833 397bea28-3985-679e-c1d2-f5eeed2bb491 INTEL-8000-MQ Diagnosed
Jul 01 06:04:03.0343 ad99a9a5-f5a3-6803-f1ea-a9ccddda2edb INTEL-8000-MQ Diagnosed
Jul 01 06:14:33.1024 43b9f4cd-92d0-cd10-ed8e-9e4e77b7f230 INTEL-8000-MQ Diagnosed
Jul 01 09:36:08.2571 b85e2032-0991-e6c1-b047-b1dc5cb0ea8f INTEL-8000-MQ Diagnosed
Jul 01 11:09:38.3874 f775de2e-cc33-63fe-d500-fc3b1a54479f INTEL-8000-MQ Diagnosed
Jul 01 13:11:58.5179 4ca34b98-9a4b-ee7e-be8e-cdc3a7771c31 INTEL-8000-MQ Diagnosed
Jul 01 14:09:58.5525 dee2beb0-0e98-679a-903b-c10382a50df9 INTEL-8000-MQ Diagnosed
Jul 01 15:01:18.6370 bb2e0b84-9a96-4873-8abe-ec73cef8c9f1 INTEL-8000-MQ Diagnosed
Jul 01 16:21:23.6977 d121099b-e9a9-43cf-f6bc-d3709ae4c033 INTEL-8000-MQ Diagnosed
Jul 01 18:01:43.8074 d9cc213c-2cf5-400f-946c-cbc13cc870b0 INTEL-8000-MQ Diagnosed
Jul 01 18:51:08.8508 1844f97a-dc97-4328-d90a-afcc177bdd61 INTEL-8000-MQ Diagnosed
Jul 01 22:04:59.0965 2aee2e31-9a2a-46b0-850c-b54cd83c3ca5 INTEL-8000-MQ Diagnosed
Jul 01 22:44:34.1284 a0190e48-71bd-ed58-fd4b-e6e495f41af9 INTEL-8000-MQ Diagnosed
Jul 01 23:08:19.1239 4a946ec4-bcca-c729-f616-f52a04f2bd17 INTEL-8000-MQ Diagnosed
Jul 02 00:07:49.2371 7735a894-d7fd-679e-fb83-8de0a60cd16a INTEL-8000-MQ Diagnosed
Jul 02 01:00:34.2672 dd460be4-286c-e580-d755-aae7609c7185 INTEL-8000-MQ Diagnosed
Jul 02 01:54:54.3521 5caddf79-62f6-6c9c-b7b1-8aae9019368a INTEL-8000-MQ Diagnosed
Jul 02 02:40:34.3851 9d92a883-e05b-4f6b-a7d4-cacca73d50bd INTEL-8000-MQ Diagnosed
Jul 02 03:08:04.3907 dfea69f3-38dd-6756-a571-a85eb3fb8c81 INTEL-8000-MQ Diagnosed
Jul 02 03:08:34.4134 a7d0258e-0e39-6626-8f50-8d6fefde6d43 INTEL-8000-MQ Diagnosed
Jul 02 05:17:39.5271 9a3e2fa2-ca66-4fe6-ad49-bd1b5741d8d0 INTEL-8000-MQ Diagnosed
Jul 02 10:36:39.8413 ae845a9b-9a80-e496-edff-f8b4ad69f00a INTEL-8000-MQ Diagnosed
Jul 02 11:28:54.9113 3fe56414-40f7-c7bf-f396-8e0a5a1a8dd9 INTEL-8000-MQ Diagnosed
Jul 02 11:48:24.9456 1bd45384-455e-618a-d738-b592bee8a5f3 INTEL-8000-MQ Diagnosed
Jul 02 12:22:54.9752 f8d9c37c-746e-40cd-940e-9e30801b48a2 INTEL-8000-MQ Diagnosed
Jul 02 12:29:59.9755 5eb9c101-9514-c4ac-a5ef-bb84de5e4c89 INTEL-8000-MQ Diagnosed
Jul 02 14:45:15.1335 7d559ee8-7035-e5f0-84e9-b2bb35e325b6 INTEL-8000-MQ Diagnosed
Jul 02 15:26:40.1977 845f68da-77a2-ed52-e85f-9604fea54cc3 INTEL-8000-MQ Diagnosed
Jul 02 15:47:00.1931 0032b9e5-901d-6071-9928-a85abd5bce1e INTEL-8000-MQ Diagnosed
Jul 02 21:18:15.5985 7e65a09a-aafb-ef41-dd17-c508d0584f60 INTEL-8000-MQ Diagnosed
Jul 02 21:24:00.6070 99d003fc-ab21-6d2e-864b-bdceffc4c7e2 INTEL-8000-MQ Diagnosed
Jul 02 21:34:55.6046 115147e0-d711-e49a-f7ad-8d2145af0b9e INTEL-8000-MQ Diagnosed
Jul 02 23:40:15.7208 f493473f-d325-4fc9-92fa-f717bb3bbdd9 INTEL-8000-MQ Diagnosed
Jul 03 05:21:51.0997 2f3ccd8c-8efb-664b-bcff-97fa64a9b42c INTEL-8000-MQ Diagnosed
Jul 03 07:53:11.2649 3eb750b9-afab-c69a-d061-e55a7566a37c INTEL-8000-MQ Diagnosed
Jul 03 09:52:06.3539 71cbb324-ccc4-e930-b6b2-9ccd0c53b6a4 INTEL-8000-MQ Diagnosed
Jul 03 14:21:21.6162 f5e82751-e547-e781-d447-fc0f48ac5308 INTEL-8000-MQ Diagnosed
Jul 03 16:36:56.7603 07a9dfb5-d6a6-c7dc-815e-dcbeb8a9f8a6 INTEL-8000-MQ Diagnosed
Jul 03 19:48:06.9840 8257e11a-995f-45a7-e91a-830234aa64c6 INTEL-8000-MQ Diagnosed
Jul 03 21:01:02.0382 57e8fc19-65a8-cbef-9601-9fe25d27207d INTEL-8000-MQ Diagnosed
Jul 04 06:36:07.6367 420e49dc-b052-cb80-cdb8-d851496135b2 INTEL-8000-MQ Diagnosed
Jul 04 07:20:57.7012 f8350fc5-e09e-c534-8fdb-b437a265fbab INTEL-8000-MQ Diagnosed
Jul 04 14:56:48.1557 31803dba-149c-414e-b542-ba885664e8e8 INTEL-8000-MQ Diagnosed
Jul 04 21:04:18.5088 c9510d1e-be5c-c840-a741-b7779bef6deb INTEL-8000-MQ Diagnosed
Jul 04 22:47:53.5982 eb511650-8850-c45e-8bfa-d7e74cbbd164 INTEL-8000-MQ Diagnosed
Jul 05 11:52:59.3594 e0f642f5-3afa-ce7e-9e30-9105ac59fa79 INTEL-8000-MQ Diagnosed
Jul 05 12:28:59.4172 e44786bd-95f8-c4b4-f7fd-98fa2516961c INTEL-8000-MQ Diagnosed
Jul 06 22:35:26.4389 f9052b32-62c0-68c5-91b2-c7d592831e0d INTEL-8000-MQ Diagnosed
Jul 07 15:05:57.4475 fbf336c0-1a07-ecae-a33a-e8aec4c3d5f9 INTEL-8000-MQ Diagnosed
Jul 07 19:05:42.6555 b0bf73f5-893f-4aaa-8ca9-e15db991c0e3 INTEL-8000-MQ Diagnosed
Jul 10 05:17:21.0211 b603d0fe-c968-6ca5-ff77-aafd83dfb74e INTEL-8000-MQ Diagnosed
Jul 14 14:54:12.2088 b62b2c92-15a3-e957-a64b-ab8373be2a88 INTEL-8000-MQ Diagnosed


root@dba012:/root# psrinfo |grep faulted |wc -l
57
root@dba012:/root# psrinfo |grep on-line |wc -l
7
root@dba012:/root# psrinfo |grep on-line
9 on-line since 06/30/2020 23:12:14
12 on-line since 06/30/2020 23:12:14
15 on-line since 06/30/2020 23:12:14
24 on-line since 06/30/2020 23:12:14
26 on-line since 06/30/2020 23:12:14
29 on-line since 06/30/2020 23:12:14
58 on-line since 06/30/2020 23:12:15

#4

Updated by Joshua M. Clulow 12 days ago

  • Project changed from site to illumos gate
#5

Updated by Jason King 12 days ago

The only remotely related change in that time frame I can see is

commit a47ab03e261661b7326ab0b642649034886be632
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   Fri Apr 3 08:56:49 2020 +0000

    12467 Add support for AMD PPIN
    12468 Remove generic_cpu -Wno-parentheses gag
    Reviewed by: C Fraire <cfraire@me.com>
    Reviewed by: Yuri Pankov <ypankov@fastmail.com>
    Reviewed by: Patrick Mooney <pmooney@pfmooney.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

only because it's touching the code that generates CPU fault events, but there is nothing there that appears like it would be relevant to the issue, so it seems less likely to be involved.

The fmdump -evv output might also be useful -- the error log can be different from the fault log (the error log should be the raw events, the fault log is what the system thinks is causing the problem after trying to correlate the events from the error log).

Also available in: Atom PDF