Project

General

Profile

Bug #3195

mpt_sas IOC reset races can cause panics

Added by Albert Lee almost 8 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
2012-09-14
Due date:
% Done:

90%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

There are races present between event handling and the IOC reset path for mpt_sas.
The most common panic resulting from this occurs near the beginning of mptsas_handle_event:

ffffff002f0c8b90 mptsas_handle_event+0x1c(ffffff07450fb020)
ffffff002f0c8c40 taskq_thread+0x285(ffffff07306c9280)
ffffff002f0c8c50 thread_start+8()

sometimes a stack trace will only show a mutex_enter in the event taskq:
ffffff002ffd4c60 fffffffffbc2dbf0                0   0  60                0
  PC: panicsys+0x9b    TASKQ: mpt_sas_mptsas_event_taskq
  stack pointer for thread ffffff002ffd4c60: ffffff002ffd46b0
    panic+0x94()
    die+0xdd()
    trap+0x177b()
    0xfffffffffb8001d6()
    mutex_enter+0xb()
    taskq_thread+0x285()
    thread_start+8()

There are three things which we don't want to run concurrently with an IOC reset attempt: the interrupt handlers, an address reply taskq thread, and a dynamic reconfigure (topology change) taskq thread. These could reference resources reallocated or zeroed in mptsas_init_chip.

We call mptsas_init_chip to process a reset, which reallocates the reply queue that mptsas_handle_event uses an entry from. We must grab mpt->m_mutex before checking the m_in_reset flag, and drop the event if we are in the middle of a reset.

In the solution, we have the interrupt handler exit if it's started in the middle of a reset (interrupts are already disabled, but either the polled or regular interrupt handler could still be running). We wait for the taskq threads to complete before proceeding. We also don't zero the address reply array since it's not necessary.

This was user reported on the developer list:
http://comments.gmane.org/gmane.os.illumos.devel/8922

This was determined to be a duplicate of Nexenta #10443 and Bugster 6966172. We've fixed this in illumos-nexenta:
https://bitbucket.org/nexenta/illumos-nexenta/changeset/36c6ca86c98e0b33ce8516c391c3e3c1b63371ee


Related issues

Related to illumos gate - Bug #4310: Another mpt_sas panicClosed2013-11-10

Actions
Related to illumos gate - Bug #5306: mpt_sas hangs up during IOC resetClosed2014-11-10

Actions
Related to illumos gate - Bug #6256: mptsas: deadlock in mptsas_handle_topo_changeClosed2015-09-22

Actions
Related to illumos gate - Bug #4399: mpt_sas should not do taswClosed2013-12-13

Actions

History

#1

Updated by Dan McDonald almost 7 years ago

  • Status changed from In Progress to Pending RTI
#2

Updated by Dan McDonald almost 7 years ago

  • Status changed from Pending RTI to Resolved

commit 018d3f06fe63d3b8316ef73502fb8f2dd473ffd1
Author: Albert Lee <>
Date: Fri Aug 31 14:53:16 2012 -0400

3195 mpt_sas IOC reset races can cause panics
Reviewed by: Dan McDonald &lt;&gt;
Reviewed by: Keith Wesolowski &lt;&gt;
Approved by: Garrett D'Amore &lt;&gt;
#3

Updated by Arne Jansen almost 5 years ago

  • Related to Bug #6256: mptsas: deadlock in mptsas_handle_topo_change added
#4

Updated by Marcel Telka over 3 years ago

  • Related to Bug #4399: mpt_sas should not do tasw added
#5

Updated by Gracie Flora over 2 years ago

Mutex to synchronize ioctl,sysfs demonstrate way and pci asset dealing with. PCI asset liberating will prompt free indispensable equipment/memory asset, which may be being used by cli/sysfs way works bringing about Null pointer reference took after by portion crash. To keep away from the above race condition we utilize mutex syncrhonization which guarantees the syncrhonization between cli/sysfs_show way.

[[Do My Coursework For Me]]https://www.courseworkcamp.co.uk/write-my-coursework

Also available in: Atom PDF