General Protection Trap after r13630 (#998) combined with closed mpt driver
System panics with putback 13630 (#998 obsolete DMA driver interfaces should be removed) and closed mpt driver.
I tried those builds so far to test it:
r13629 - debug - is fine
r13630 - non debug - panics
# modinfo | grep mpt 57 fffffffff7a5b000 43720 69 1 mpt (MPT HBA Driver) 58 fffffffff7a88000 2b538 163 1 mpt_sas (MPTSAS HBA Driver 00.00.00.24)
non debug mpt: # mcs -p /kernel/drv/amd64/mpt /kernel/drv/amd64/mpt: @(#)SunOS 5.11 onnv-gate:2010-08-18 August 2010 mpt - debug: # mcs -p /mnt/x/kernel/drv/amd64/mpt /mnt/x/kernel/drv/amd64/mpt: @(#)SunOS 5.11 onnv-gate:2010-08-18 August 2010 @(#)SunOS Internal Development: 2010-Aug-18 [onnv-gate]
System Configuration: Supermicro X8DT3
# scanpci -v | grep -i lsi LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] CardVendor 0x1000 card 0x3020 (LSI Logic / Symbios Logic, Card unknown)
A workaround is to boot with -B disable-mpt=true.
Updated by Rich Lowe over 9 years ago
From an email I sent earlier
So, we're at mpt_restart_io+7c in which we load r15 which is not a good pointer %r15 comes from mpt_restart_ioc+6f, which is loading %r8+a massive offset. %r8 comes from mpt_restart_ioc+5a, which is loading %r13+50, %r13 is an mpt_t (I figure you're bored by now, I can go further, but that's tracing %rdi through life, knowing that %rdi is an mpt_t based on the CTF info of that function) If we trace the offsets through ctf's type info (see ctfdump) we find that mpt_t+50 is mpt_t->m_active, which is a struct mpt_slots, mpt_slots is, happily, also huge! so that massive offset isn't a big deal WHOO!. mpt_slots+120a0 == m_slot+0
It looks like
m_slot is neither NULL, nor a valid pointer and we're crashing on it. (I say not NULL, because we check), and the specifics of the #GP seem to bear this out
Updated by Piotr Jasiukajtis over 9 years ago
Rich Lowe wrote:
This is a list of things I would try, without any real science to them at present:
- Boot with the IOMMU disabled
"-B intel-iommu=no" didn't help
- Boot the prior changeset, again, and make certain that one worked (just for paranoia).
Already done that.
Updated by Garrett D'Amore over 9 years ago
I've done some deep analysis here, and the problem is that mpt is using the devinfo structure internals directly, most likely via the DEVI macro, to determine whether the code is being called during attach or not, to decide whether to initialize the hardware.
The problem is that due to a change in the structure, it is confused, and doesn't think the device is attaching, and so winds up calling the reset logic before the rest of the data structure is initialized during its attach handler.
Its very very unfortunate that mpt is using this private interface.
I would love to remove the header from delivery, but it turns out that this header is exposed via libdevice, and a bunch of other things as well. So drivers that have incorrectly used this structure or the macro are going to suffer. To be clear: this is only a problem because mpt has incorrectly accessed internal details that it never should have.
The interim fix we should apply is to restore the pointer we removed, but we'll restore it as a void * with a name that indicates that the previous value is a placeholder.