Bug #2994
closedMPT SGL mem alloc failed
100%
Description
After upgrading to oi_151a5, these errors started showing up in /var/adm/messages in groups of dozens at a time (starting about halfway through a scrub):
Jul 7 01:15:21 myelin2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086
,340a@3/pci1000,30c0@0 (mpt_sas0):
Jul 7 01:15:21 myelin2 Unable to allocate dma memory for extra SGL.
Jul 7 01:15:21 myelin2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086
,340a@3/pci1000,30c0@0 (mpt_sas0):
Jul 7 01:15:21 myelin2 MPT SGL mem alloc failed
During the first weekly scrub after updating, these caused multiple read, write, and checksum failures on multiple disks, and ended up causing every disk in one of two vdevs to be marked as degraded. After clearing and trying a second scrub, the same symptoms happened again, but this time two were additionally marked as faulted. Since the errors mentioned memory, I did a ::memstat (no scrub was active at the time):
$ echo ::memstat | sudo mdb k ---------------- ---------------- ----
Page Summary Pages MB Tot
-----------
Kernel 4553408 17786 72
ZFS File Data 193505 755 3%
Anon 108432 423 2%
Exec and libs 1351 5 0%
Page cache 5734 22 0%
Free (cachelist) 22007 85 0%
Free (freelist) 1402689 5479 22%
Total 6287126 24559
Physical 6287125 24559
I booted back into oi_151a4, ran a complete scrub, which fixed a lot of checksum errors without any reported data errors, and then ran a clear and a second scrub, which gave the pool a clean bill of health (no errors, 0 repaired bytes). Neither scrub in oi_151a4 caused any mpt_sas error messages in /var/adm/messages (it did report the pool as degraded while fixing checksum errors in the first scrub, due to number of checksum errors exceeding acceptable levels, on multiple devices).
The system is running two LSI 9201-16i HBAs (SAS2116_1(B1) according to sas2flash), running firmware 11.00.00.00 and bios 07.21.00.00, connected via fanout cables to 24 SATA disks (through a supermicro BPN-SAS-846TQ backplane, with sideband connectors attached, in a supermicro 846TQ-R1200B chassis, if that matters). It has previously been running oi_151a4, a3, a2, and a1 without problems. The only other thing I changed when I did the upgrade was to update virtualbox to 4.1.18 (it runs two VMs, 4GB of memory each, used for automated building of our software).
Attaching output of zdb -C (run from oi_151a4).
Choosing high priority because this bug caused checksum errors in a pool, though it was fixed by scrubbing after rolling back the update.
Files
Updated by Tim Coalson almost 11 years ago
This issue disappeared in a6 and a7, I don't know if this means it should be closed.
Updated by Ken Mays over 9 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
Closed per user validation.