Project

General

Profile

Bug #13177

dma memory allocation failures in mpt driver cause vdevs to be erroneously faulted by fmd

Added by Jason Matthews about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
kernel
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

I am not sure if this is a kernel or driver problem, but I had to pick a category. it seems to me if the kernel can't find 5gb of ram for a driver when zfs arc is holding 70gb that is kernel problem.
should i consider limited the size of ARC or is the problem rooted some where else?

System Configuration: Supermicro Super Server
BIOS Configuration: American Megatrends Inc. 2.0a 08/08/2019
BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)

==== Processor Sockets ====================================

Version Location Tag
-------------------------------- --------------------------
AMD EPYC 7401P 24-Core Processor CPU

==== Memory Device Sockets ================================

Type Status Set Device Locator Bank Locator
----------- ------ --- ------------------- ----------------
DDR4 in use 0 DIMMA1 P0_Node0_Channel0_Dimm0
DDR4 in use 0 DIMMB1 P0_Node0_Channel1_Dimm0
DDR4 in use 0 DIMMC1 P0_Node0_Channel2_Dimm0
unknown empty 0 DIMMD1 P0_Node0_Channel3_Dimm0
unknown empty 0 DIMME1 P0_Node0_Channel4_Dimm0
unknown empty 0 DIMMF1 P0_Node0_Channel5_Dimm0
unknown empty 0 DIMMG1 P0_Node0_Channel6_Dimm0
unknown empty 0 DIMMH1 P0_Node0_Channel7_Dimm0

==== On-Board Devices =====================================

==== Upgradeable Slots ====================================

ID Status Type Description
--- --------- ---------------- ----------------------------
1 in use PCI Exp. Gen 3 x8 CPU SLOT1 PCI-E 3.0 X8, Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (mpt_sas)
2 available PCI Exp. Gen 3 x16 CPU SLOT2 PCI-E 3.0 X16
3 in use PCI Exp. Gen 3 x8 CPU SLOT3 PCI-E 3.0 X8, Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (mpt_sas)
4 available PCI Exp. Gen 3 x16 CPU SLOT4 PCI-E 3.0 X16
5 in use PCI Exp. Gen 3 x8 CPU SLOT5 PCI-E 3.0 X8, Chelsio Communications Inc T580-SO-CR Unified Wire Ethernet Controller (<unknown>)
6 available PCI Exp. Gen 3 x16 CPU SLOT6 PCI-E 3.0 X16
7 available PCI Exp. Gen 3 x4 M.2 PCIE X4

root@backup002:/var/adm# echo ::memstat |mdb k
Page Summary Pages MB Tot
-----------
---------------- ---------------- ----
Kernel 5763597 22514 23
Boot pages 75331 294 0%
ZFS File Data 18221977 71179 73%
Anon 60588 236 0%
Exec and libs 1226 4 0%
Page cache 5790 22 0%
Free (cachelist) 18724 73 0%
Free (freelist) 985174 3848 4%

Total 25132407 98173
Physical 25132405 98173

Sep 16 19:11:00 backup002 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major
Sep 16 19:11:00 backup002 EVENT-TIME: Wed Sep 16 19:10:58 UTC 2020
Sep 16 19:11:00 backup002 PLATFORM: Super-Server, CSN: 0123456789, HOSTNAME: backup002
Sep 16 19:11:00 backup002 SOURCE: zfs-diagnosis, REV: 1.0
Sep 16 19:11:00 backup002 EVENT-ID: db42f990-e846-cfb5-e36d-e92e4f45106e
Sep 16 19:11:00 backup002 DESC: The number of checksum errors associated with a ZFS device
Sep 16 19:11:00 backup002 exceeded acceptable levels. Refer to http://illumos.org/msg/ZFS-8000-GH for more information.
Sep 16 19:11:00 backup002 AUTO-RESPONSE: The device has been marked as degraded. An attempt
Sep 16 19:11:00 backup002 will be made to activate a hot spare if available.
Sep 16 19:11:00 backup002 IMPACT: Fault tolerance of the pool may be compromised.
Sep 16 19:11:00 backup002 REC-ACTION: Run 'zpool status -x' and replace the bad device.

Sep 17 00:16:02 backup002 scsi: [ID 107833 kern.warning] WARNING: /pci@34,0/pci1022,1453@1,1/pci15d9,808@0 (mpt_sas0):
Sep 17 00:16:02 backup002 unable to kmem_alloc enough memory for scatter/gather list

root@backup001:/root# cat ddi\-memory\-alloc.d
dtrace n 'ddi_dma_mem_alloc:entry { self>amt = arg1 } ddi_dma_mem_alloc:return /arg1 != 0/ { trace(self->amt) } ddi_dma_mem_alloc:return { self->amt = 0 }'

output from ddi-memory-alloc.d during fault:
CPU ID FUNCTION:NAME
22 24290 ddi_dma_mem_alloc:return 4736
22 24290 ddi_dma_mem_alloc:return 4736

there are 102 physical drives connected to this mpt controller. Half of the drives are 'owned' by this host. They other half by another.

No data to display

Also available in: Atom PDF