Project

General

Profile

Actions

Bug #3195

closed

mpt_sas IOC reset races can cause panics

Added by Albert Lee about 9 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
2012-09-14
Due date:
% Done:

90%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

There are races present between event handling and the IOC reset path for mpt_sas.
The most common panic resulting from this occurs near the beginning of mptsas_handle_event:

ffffff002f0c8b90 mptsas_handle_event+0x1c(ffffff07450fb020)
ffffff002f0c8c40 taskq_thread+0x285(ffffff07306c9280)
ffffff002f0c8c50 thread_start+8()

sometimes a stack trace will only show a mutex_enter in the event taskq:
ffffff002ffd4c60 fffffffffbc2dbf0                0   0  60                0
  PC: panicsys+0x9b    TASKQ: mpt_sas_mptsas_event_taskq
  stack pointer for thread ffffff002ffd4c60: ffffff002ffd46b0
    panic+0x94()
    die+0xdd()
    trap+0x177b()
    0xfffffffffb8001d6()
    mutex_enter+0xb()
    taskq_thread+0x285()
    thread_start+8()

There are three things which we don't want to run concurrently with an IOC reset attempt: the interrupt handlers, an address reply taskq thread, and a dynamic reconfigure (topology change) taskq thread. These could reference resources reallocated or zeroed in mptsas_init_chip.

We call mptsas_init_chip to process a reset, which reallocates the reply queue that mptsas_handle_event uses an entry from. We must grab mpt->m_mutex before checking the m_in_reset flag, and drop the event if we are in the middle of a reset.

In the solution, we have the interrupt handler exit if it's started in the middle of a reset (interrupts are already disabled, but either the polled or regular interrupt handler could still be running). We wait for the taskq threads to complete before proceeding. We also don't zero the address reply array since it's not necessary.

This was user reported on the developer list:
http://comments.gmane.org/gmane.os.illumos.devel/8922

This was determined to be a duplicate of Nexenta #10443 and Bugster 6966172. We've fixed this in illumos-nexenta:
https://bitbucket.org/nexenta/illumos-nexenta/changeset/36c6ca86c98e0b33ce8516c391c3e3c1b63371ee


Related issues

Related to illumos gate - Bug #4310: Another mpt_sas panicClosed2013-11-10

Actions
Related to illumos gate - Bug #5306: mpt_sas hangs up during IOC resetClosed2014-11-10

Actions
Related to illumos gate - Bug #6256: mptsas: deadlock in mptsas_handle_topo_changeClosedDan Fields2015-09-22

Actions
Related to illumos gate - Bug #4399: mpt_sas should not do taswClosed2013-12-13

Actions
Actions

Also available in: Atom PDF