Project

General

Profile

Actions

Bug #4310

closed

Another mpt_sas panic

Added by Rich Ercolani over 9 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
driver - device drivers
Start date:
2013-11-10
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

Latest git illumos as of today (11/10/13), commit dfdb2065f4.

I was having a problem where my system became unresponsive after boot and eventually panic'd, apparently after trying an IOC reset, using oi151_a8's stock kernel.

Inferring it was the problem resolved by 018d3f06fe6, I built git, booted into the new kernel, and was greeted by a new, different panic!

I don't have a dump at the moment; the panic appears to have prevented it from dumping successfully [the dump device is, in fact, on an mpt_sas controller, so...]. Screencap of the panic attached; I'm going to add a dump device that is not on the mpt_sas device and try again.


Files

snap_042146.png (78.1 KB) snap_042146.png Screencap of mpt_sas panic message Rich Ercolani, 2013-11-10 09:30 AM
there_goes_mptsas__1_.png (60.8 KB) there_goes_mptsas__1_.png Screencap of this on il-gate nightly 20140723 Rich Ercolani, 2014-07-23 02:48 PM

Related issues

Related to illumos gate - Bug #3195: mpt_sas IOC reset races can cause panicsResolvedAlbert Lee2012-09-14

Actions
Has duplicate illumos gate - Bug #4399: mpt_sas should not do taswClosed2013-12-13

Actions
Is duplicate of illumos gate - Bug #6256: mptsas: deadlock in mptsas_handle_topo_changeClosedDan Fields2015-09-22

Actions
Actions #1

Updated by Rich Ercolani over 9 years ago

I have achieved vmdump.

http://skysrv.pha.jhu.edu/~rercola/another_mptsas_vmdump (1.5G)

Debugging to follow.

Actions #2

Updated by Rich Ercolani over 9 years ago

> ::panicinfo
             cpu               15
          thread ffffff00f608bc40
         message assertion failed: tq != curthread->t_taskq, file: ../../common/os/taskq.c, line: 1325
             rdi fffffffffbf53640
             rsi ffffff00f608b790
             rdx fffffffffbf3ecd9
             rcx              52d
              r8 ffffff00f608bc40
              r9                1
             rax ffffff00f608b7b0
             rbx ffffff2373399c50
             rbp ffffff00f608b7f0
             r10                0
             r11 ffffff00f608bc40
             r12 ffffff237343e020
             r13                0
             r14 ffffff23735bb000
             r15 ffffff237343e032
          fsbase                0
          gsbase ffffff232969f580
              ds               4b
              es               4b
              fs                0
              gs              1c3
          trapno                0
             err                0
             rip fffffffffb869df0
              cs               30
          rflags              286
             rsp ffffff00f608b788
              ss               38
          gdt_hi                0
          gdt_lo         300001ef
          idt_hi                0
          idt_lo         50000fff
             ldt                0
            task               70
             cr0         8005003b
             cr2          806f298
             cr3          4000000
             cr4              6f8
> ::stack
vpanic()
0xfffffffffbe0be68()
taskq_wait+0x108(ffffff2373399c50)
ddi_taskq_wait+0x11(ffffff2373399c50)
mptsas_flush_hba+0x214(ffffff237343e000)
mptsas_restart_ioc+0x89(ffffff237343e000)
mptsas_ioc_task_management+0x2c2(ffffff237343e000, 3, 34, 0, 0, 0)
mptsas_do_scsi_reset+0xb0(ffffff237343e000, 34)
mptsas_handle_topo_change+0x863(ffffff2385a1f180, 0)
mptsas_handle_dr+0x12c(ffffff2385a1f180)
taskq_thread+0x318(ffffff2373399c50)
thread_start+8()

What a fascinating assertion failure. So, that assert is just the first line of taskq_wait, that doesn't help us much; let's go back to mptsas_flush_hba and see where we're calling that from...

Actions #3

Updated by Rich Ercolani over 9 years ago

Ah, okay. The problem, and underlying meaning of the assertion, is that you can't call taskq_wait from inside a thread, because deadlock will ensue. So that assertion tripping means we would have deadlocked, if it weren't there. Nice. So we have my "mpt_sas becomes a black hole" bug right here, I suppose.

Not sure how to best work around this failure mode. The "cheap" way would be to change the end of mptsas_flush_hba from:

        /*
         * Drain the taskqs prior to reallocating resources.
         */
        mutex_exit(&mpt->m_mutex);
        ddi_taskq_wait(mpt->m_event_taskq);
        ddi_taskq_wait(mpt->m_dr_taskq);
        mutex_enter(&mpt->m_mutex);

to:
        /*
         * Drain the taskqs prior to reallocating resources.
         */
        mutex_exit(&mpt->m_mutex);
        ddi_taskq_wait(mpt->m_event_taskq);
        bool in_thread = (mpt->m_dr_taskq == curthread->t_taskq);
        if (!in_thread) {
                ddi_taskq_wait(mpt->m_dr_taskq);
        }
        mutex_enter(&mpt->m_mutex);

But I don't quite know what the consequences of that would be, or the sanity of that kind of intimate violation.

Actions #4

Updated by Albert Lee over 9 years ago

  • Tags deleted (needs-triage)

This is a corner case in the fix for #3195, which assumes the IOC reset would be issued from a different thread and the DR and event taskqs would need to be synchronised.

Actions #5

Updated by Albert Lee over 9 years ago

The IOC reset is due to a task management command failure which is certainly unusual and has a separate root cause.

The easier solution to the immediate problem should be to farm out the IOC reset to another thread and return failure since we can't continue processing the event if a reset occurs.

Actions #6

Updated by Rich Ercolani over 8 years ago

In case people thought this was fixed, this still transpires on latest il-gate from 20140723.

Actions #7

Updated by Igor Kozhukhov over 8 years ago

Rich Ercolani wrote:

In case people thought this was fixed, this still transpires on latest il-gate from 20140723.

could you try on your env http://apt2.dilos.org/dilos/isos/dilos-core_install-14-07-21_01-17-1.3.7.9.iso
i have updates for mpt_sas from nexenta tree and will be interest test it on another env.
my LSI9211-8i work well with it

Actions #8

Updated by Rich Ercolani over 8 years ago

Igor Kozhukhov wrote:

Rich Ercolani wrote:

In case people thought this was fixed, this still transpires on latest il-gate from 20140723.

could you try on your env http://apt2.dilos.org/dilos/isos/dilos-core_install-14-07-21_01-17-1.3.7.9.iso
i have updates for mpt_sas from nexenta tree and will be interest test it on another env.
my LSI9211-8i work well with it

It would be easier for me to either pull changes from a tree or try a binary mpt_sas (and/or if necessary sd) module, than to boot another OS on the machine. Sadly, the one this transpired on is not a development machine.

Actions #9

Updated by Rich Ercolani over 8 years ago

From discussion with Albert:

"I haven't settled on a design for the fix yet. Basically we shouldn't be driving the reset from one of the threads that also needs to terminate on reset to release resources"

Makes sense, but I'm not currently familiar enough with the workings of the codebase to suggest/implement a good fix. We'll see if that changes...

Actions #10

Updated by Marcel Telka about 6 years ago

  • Is duplicate of Bug #6256: mptsas: deadlock in mptsas_handle_topo_change added
Actions #11

Updated by Marcel Telka about 6 years ago

  • Status changed from New to Closed

This is a duplicate of #6256.

Actions

Also available in: Atom PDF