Project

General

Profile

Actions

Bug #2994

closed

MPT SGL mem alloc failed

Added by Tim Coalson over 11 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
Drivers
Target version:
-
Start date:
2012-07-12
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

After upgrading to oi_151a5, these errors started showing up in /var/adm/messages in groups of dozens at a time (starting about halfway through a scrub):

Jul 7 01:15:21 myelin2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086
,340a@3/pci1000,30c0@0 (mpt_sas0):
Jul 7 01:15:21 myelin2 Unable to allocate dma memory for extra SGL.
Jul 7 01:15:21 myelin2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086
,340a@3/pci1000,30c0@0 (mpt_sas0):
Jul 7 01:15:21 myelin2 MPT SGL mem alloc failed

During the first weekly scrub after updating, these caused multiple read, write, and checksum failures on multiple disks, and ended up causing every disk in one of two vdevs to be marked as degraded. After clearing and trying a second scrub, the same symptoms happened again, but this time two were additionally marked as faulted. Since the errors mentioned memory, I did a ::memstat (no scrub was active at the time):

$ echo ::memstat | sudo mdb k
Page Summary Pages MB Tot
-----------
---------------- ---------------- ----
Kernel 4553408 17786 72
ZFS File Data 193505 755 3%
Anon 108432 423 2%
Exec and libs 1351 5 0%
Page cache 5734 22 0%
Free (cachelist) 22007 85 0%
Free (freelist) 1402689 5479 22%

Total 6287126 24559
Physical 6287125 24559

I booted back into oi_151a4, ran a complete scrub, which fixed a lot of checksum errors without any reported data errors, and then ran a clear and a second scrub, which gave the pool a clean bill of health (no errors, 0 repaired bytes). Neither scrub in oi_151a4 caused any mpt_sas error messages in /var/adm/messages (it did report the pool as degraded while fixing checksum errors in the first scrub, due to number of checksum errors exceeding acceptable levels, on multiple devices).

The system is running two LSI 9201-16i HBAs (SAS2116_1(B1) according to sas2flash), running firmware 11.00.00.00 and bios 07.21.00.00, connected via fanout cables to 24 SATA disks (through a supermicro BPN-SAS-846TQ backplane, with sideband connectors attached, in a supermicro 846TQ-R1200B chassis, if that matters). It has previously been running oi_151a4, a3, a2, and a1 without problems. The only other thing I changed when I did the upgrade was to update virtualbox to 4.1.18 (it runs two VMs, 4GB of memory each, used for automated building of our software).

Attaching output of zdb -C (run from oi_151a4).

Choosing high priority because this bug caused checksum errors in a pool, though it was fixed by scrubbing after rolling back the update.


Files

zdb_-C_output.txt (10.9 KB) zdb_-C_output.txt Tim Coalson, 2012-07-12 09:32 PM
Actions #1

Updated by Ken Mays almost 11 years ago

  • Assignee set to OI illumos
Actions #2

Updated by Tim Coalson almost 11 years ago

This issue disappeared in a6 and a7, I don't know if this means it should be closed.

Actions #3

Updated by Ken Mays over 9 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

Closed per user validation.

Actions

Also available in: Atom PDF