Bug #4310
closedAnother mpt_sas panic
0%
Description
Latest git illumos as of today (11/10/13), commit dfdb2065f4.
I was having a problem where my system became unresponsive after boot and eventually panic'd, apparently after trying an IOC reset, using oi151_a8's stock kernel.
Inferring it was the problem resolved by 018d3f06fe6, I built git, booted into the new kernel, and was greeted by a new, different panic!
I don't have a dump at the moment; the panic appears to have prevented it from dumping successfully [the dump device is, in fact, on an mpt_sas controller, so...]. Screencap of the panic attached; I'm going to add a dump device that is not on the mpt_sas device and try again.
Files
Related issues
Updated by Rich Ercolani over 9 years ago
I have achieved vmdump.
http://skysrv.pha.jhu.edu/~rercola/another_mptsas_vmdump (1.5G)
Debugging to follow.
Updated by Rich Ercolani over 9 years ago
> ::panicinfo cpu 15 thread ffffff00f608bc40 message assertion failed: tq != curthread->t_taskq, file: ../../common/os/taskq.c, line: 1325 rdi fffffffffbf53640 rsi ffffff00f608b790 rdx fffffffffbf3ecd9 rcx 52d r8 ffffff00f608bc40 r9 1 rax ffffff00f608b7b0 rbx ffffff2373399c50 rbp ffffff00f608b7f0 r10 0 r11 ffffff00f608bc40 r12 ffffff237343e020 r13 0 r14 ffffff23735bb000 r15 ffffff237343e032 fsbase 0 gsbase ffffff232969f580 ds 4b es 4b fs 0 gs 1c3 trapno 0 err 0 rip fffffffffb869df0 cs 30 rflags 286 rsp ffffff00f608b788 ss 38 gdt_hi 0 gdt_lo 300001ef idt_hi 0 idt_lo 50000fff ldt 0 task 70 cr0 8005003b cr2 806f298 cr3 4000000 cr4 6f8 > ::stack vpanic() 0xfffffffffbe0be68() taskq_wait+0x108(ffffff2373399c50) ddi_taskq_wait+0x11(ffffff2373399c50) mptsas_flush_hba+0x214(ffffff237343e000) mptsas_restart_ioc+0x89(ffffff237343e000) mptsas_ioc_task_management+0x2c2(ffffff237343e000, 3, 34, 0, 0, 0) mptsas_do_scsi_reset+0xb0(ffffff237343e000, 34) mptsas_handle_topo_change+0x863(ffffff2385a1f180, 0) mptsas_handle_dr+0x12c(ffffff2385a1f180) taskq_thread+0x318(ffffff2373399c50) thread_start+8()
What a fascinating assertion failure. So, that assert is just the first line of taskq_wait, that doesn't help us much; let's go back to mptsas_flush_hba and see where we're calling that from...
Updated by Rich Ercolani over 9 years ago
Ah, okay. The problem, and underlying meaning of the assertion, is that you can't call taskq_wait from inside a thread, because deadlock will ensue. So that assertion tripping means we would have deadlocked, if it weren't there. Nice. So we have my "mpt_sas becomes a black hole" bug right here, I suppose.
Not sure how to best work around this failure mode. The "cheap" way would be to change the end of mptsas_flush_hba from:
/* * Drain the taskqs prior to reallocating resources. */ mutex_exit(&mpt->m_mutex); ddi_taskq_wait(mpt->m_event_taskq); ddi_taskq_wait(mpt->m_dr_taskq); mutex_enter(&mpt->m_mutex);
to:
/* * Drain the taskqs prior to reallocating resources. */ mutex_exit(&mpt->m_mutex); ddi_taskq_wait(mpt->m_event_taskq); bool in_thread = (mpt->m_dr_taskq == curthread->t_taskq); if (!in_thread) { ddi_taskq_wait(mpt->m_dr_taskq); } mutex_enter(&mpt->m_mutex);
But I don't quite know what the consequences of that would be, or the sanity of that kind of intimate violation.
Updated by Albert Lee over 9 years ago
- Tags deleted (
needs-triage)
This is a corner case in the fix for #3195, which assumes the IOC reset would be issued from a different thread and the DR and event taskqs would need to be synchronised.
Updated by Albert Lee over 9 years ago
The IOC reset is due to a task management command failure which is certainly unusual and has a separate root cause.
The easier solution to the immediate problem should be to farm out the IOC reset to another thread and return failure since we can't continue processing the event if a reset occurs.
Updated by Rich Ercolani over 8 years ago
In case people thought this was fixed, this still transpires on latest il-gate from 20140723.
Updated by Igor Kozhukhov over 8 years ago
Rich Ercolani wrote:
In case people thought this was fixed, this still transpires on latest il-gate from 20140723.
could you try on your env http://apt2.dilos.org/dilos/isos/dilos-core_install-14-07-21_01-17-1.3.7.9.iso
i have updates for mpt_sas from nexenta tree and will be interest test it on another env.
my LSI9211-8i work well with it
Updated by Rich Ercolani over 8 years ago
Igor Kozhukhov wrote:
Rich Ercolani wrote:
In case people thought this was fixed, this still transpires on latest il-gate from 20140723.
could you try on your env http://apt2.dilos.org/dilos/isos/dilos-core_install-14-07-21_01-17-1.3.7.9.iso
i have updates for mpt_sas from nexenta tree and will be interest test it on another env.
my LSI9211-8i work well with it
It would be easier for me to either pull changes from a tree or try a binary mpt_sas (and/or if necessary sd) module, than to boot another OS on the machine. Sadly, the one this transpired on is not a development machine.
Updated by Rich Ercolani over 8 years ago
From discussion with Albert:
"I haven't settled on a design for the fix yet. Basically we shouldn't be driving the reset from one of the threads that also needs to terminate on reset to release resources"
Makes sense, but I'm not currently familiar enough with the workings of the codebase to suggest/implement a good fix. We'll see if that changes...
Updated by Marcel Telka about 6 years ago
- Is duplicate of Bug #6256: mptsas: deadlock in mptsas_handle_topo_change added
Updated by Marcel Telka about 6 years ago
- Status changed from New to Closed
This is a duplicate of #6256.