Project

General

Profile

Actions

Bug #2221

closed

General Protection Trap after r13630 (#998) combined with closed mpt driver

Added by Piotr Jasiukajtis over 9 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
kernel
Start date:
2012-03-02
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

System panics with putback 13630 (#998 obsolete DMA driver interfaces should be removed) and closed mpt driver.

I tried those builds so far to test it:

r13629 - debug - is fine
r13630 - non debug - panics

# modinfo | grep mpt
 57 fffffffff7a5b000  43720  69   1  mpt (MPT HBA Driver)
 58 fffffffff7a88000  2b538 163   1  mpt_sas (MPTSAS HBA Driver 00.00.00.24)
non debug mpt:
# mcs -p /kernel/drv/amd64/mpt
/kernel/drv/amd64/mpt:
@(#)SunOS 5.11 onnv-gate:2010-08-18 August 2010

mpt - debug: 
# mcs -p /mnt/x/kernel/drv/amd64/mpt
/mnt/x/kernel/drv/amd64/mpt:
@(#)SunOS 5.11 onnv-gate:2010-08-18 August 2010
@(#)SunOS Internal Development:  2010-Aug-18 [onnv-gate]

System Configuration: Supermicro X8DT3

# scanpci -v | grep -i lsi
 LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS
 LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
 CardVendor 0x1000 card 0x3020 (LSI Logic / Symbios Logic, Card unknown)

A workaround is to boot with -B disable-mpt=true.


Files

msgbuf_1.png (17 KB) msgbuf_1.png Piotr Jasiukajtis, 2012-03-02 06:52 PM
msgbuf_2.png (19.6 KB) msgbuf_2.png Piotr Jasiukajtis, 2012-03-02 06:52 PM
msgbuf_3.png (17.6 KB) msgbuf_3.png Piotr Jasiukajtis, 2012-03-02 06:52 PM
msgbuf_4.png (16.3 KB) msgbuf_4.png Piotr Jasiukajtis, 2012-03-02 06:52 PM
msgbuf_5.png (18.7 KB) msgbuf_5.png Piotr Jasiukajtis, 2012-03-02 06:52 PM
msgbuf_6.png (20.5 KB) msgbuf_6.png Piotr Jasiukajtis, 2012-03-02 06:52 PM
regs.png (16.3 KB) regs.png Piotr Jasiukajtis, 2012-03-02 06:54 PM
stack_1.png (21 KB) stack_1.png Piotr Jasiukajtis, 2012-03-02 06:54 PM
stack_2.png (21.9 KB) stack_2.png Piotr Jasiukajtis, 2012-03-02 06:54 PM
stack_3.png (19.4 KB) stack_3.png Piotr Jasiukajtis, 2012-03-02 06:54 PM
Actions #3

Updated by Rich Lowe over 9 years ago

This is a list of things I would try, without any real science to them at present:

- Boot with the IOMMU disabled
- Boot the prior changeset, again, and make certain that one worked (just for paranoia).
Actions #4

Updated by Rich Lowe over 9 years ago

From an email I sent earlier

So, we're at mpt_restart_io+7c  in which we load r15 which is not a good pointer
%r15 comes from mpt_restart_ioc+6f, which is loading %r8+a massive offset.
%r8 comes from mpt_restart_ioc+5a, which is loading %r13+50,
%r13 is an mpt_t  (I figure you're bored by now, I can go further, but
that's tracing %rdi through life, knowing that %rdi is an mpt_t based on the CTF info of that function)

If we trace the offsets through ctf's type info (see ctfdump) we find that mpt_t+50 is
mpt_t->m_active, which is a struct mpt_slots,
mpt_slots is, happily, also huge! so that massive offset isn't a big
deal WHOO!.  mpt_slots+120a0 == m_slot+0

It looks like m_slot[0] is neither NULL, nor a valid pointer and we're crashing on it. (I say not NULL, because we check), and the specifics of the #GP seem to bear this out

Actions #5

Updated by Piotr Jasiukajtis over 9 years ago

Rich Lowe wrote:

This is a list of things I would try, without any real science to them at present:

- Boot with the IOMMU disabled

"-B intel-iommu=no" didn't help

- Boot the prior changeset, again, and make certain that one worked (just for paranoia).

Already done that.

Actions #6

Updated by Rich Lowe over 9 years ago

  • Status changed from New to Closed

#998 was backed out which should clear this up.

Actions #7

Updated by Garrett D'Amore over 9 years ago

I've done some deep analysis here, and the problem is that mpt is using the devinfo structure internals directly, most likely via the DEVI macro, to determine whether the code is being called during attach or not, to decide whether to initialize the hardware.

The problem is that due to a change in the structure, it is confused, and doesn't think the device is attaching, and so winds up calling the reset logic before the rest of the data structure is initialized during its attach handler.

Its very very unfortunate that mpt is using this private interface.

I would love to remove the header from delivery, but it turns out that this header is exposed via libdevice, and a bunch of other things as well. So drivers that have incorrectly used this structure or the macro are going to suffer. To be clear: this is only a problem because mpt has incorrectly accessed internal details that it never should have.

The interim fix we should apply is to restore the pointer we removed, but we'll restore it as a void * with a name that indicates that the previous value is a placeholder.

Actions

Also available in: Atom PDF