Project

General

Profile

Actions

Bug #6256

closed

mptsas: deadlock in mptsas_handle_topo_change

Added by Arne Jansen over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
2015-09-22
Due date:
% Done:

100%

Estimated time:
8.00 h
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:

Description

We encountered a hanging mptsas driver, stuck at

> ffffff001fc69c40::findstack -v
stack pointer for thread ffffff001fc69c40: ffffff001fc69780
[ ffffff001fc69780 _resume_from_idle+0xf4() ]
  ffffff001fc697b0 swtch+0x141()
  ffffff001fc697f0 cv_wait+0x70(ffffff04f52242aa, ffffff04f5224298)
  ffffff001fc69830 taskq_wait+0x43(ffffff04f5224278)
  ffffff001fc69850 ddi_taskq_wait+0x11(ffffff04f5224278)
  ffffff001fc698b0 mptsas_flush_hba+0x1e4(ffffff04f70f2000)
  ffffff001fc69900 mptsas_restart_ioc+0x7d(ffffff04f70f2000)
  ffffff001fc699b0 mptsas_ioc_task_management+0x2ca(ffffff04f70f2000, 3, 15, 0, 
  0, 0, ffffff0000000000)
  ffffff001fc69a00 mptsas_do_scsi_reset+0x90(ffffff04f70f2000, 15)
  ffffff001fc69af0 mptsas_handle_topo_change+0x763(ffffff04f6082978, 0)
  ffffff001fc69b60 mptsas_handle_dr+0x184(ffffff04f6082978)
  ffffff001fc69c20 taskq_thread+0x2d0(ffffff04f5224278)
  ffffff001fc69c30 thread_start+8()

The taskq it waits for only contains one taskq, namely itself:

> ffffff04f5224278::taskq -at
ADDR             NAME                             ACT/THDS Q'ED  MAXQ INST
ffffff04f5224278 mpt_sas_mptsas_dr_taskq            1/   1    1     1    0
    THREAD           STATE    SOBJ                COUNT
    ffffff001fc69c40 SLEEP    CV                      1
                     swtch+0x141
                     cv_wait+0x70
                     taskq_wait+0x43
                     ddi_taskq_wait+0x11
                     mptsas_flush_hba+0x1e4
                     mptsas_restart_ioc+0x7d
                     mptsas_ioc_task_management+0x2ca
                     mptsas_do_scsi_reset+0x90
                     mptsas_handle_topo_change+0x763
                     mptsas_handle_dr+0x184
                     taskq_thread+0x2d0
                     thread_start+8

Files

BUG6256.jpg (43.4 KB) BUG6256.jpg Dan Fields, 2015-10-22 02:16 PM
BUG6256-2.jpg (61.4 KB) BUG6256-2.jpg Dan Fields, 2015-11-24 07:41 PM

Related issues

Related to illumos gate - Bug #3195: mpt_sas IOC reset races can cause panicsResolvedAlbert Lee2012-09-14

Actions
Has duplicate illumos gate - Bug #4310: Another mpt_sas panicClosed2013-11-10

Actions
Has duplicate illumos gate - Bug #5306: mpt_sas hangs up during IOC resetClosed2014-11-10

Actions
Actions #1

Updated by Arne Jansen over 7 years ago

  • Related to Bug #3195: mpt_sas IOC reset races can cause panics added
Actions #2

Updated by Arne Jansen over 7 years ago

The ddi_taskq_wait has been added with

Author: Albert Lee <trisk@nexenta.com>
Date:   Fri Aug 31 14:53:16 2012 -0400

    3195 mpt_sas IOC reset races can cause panics
    Reviewed by: Dan McDonald <danmcd@nexenta.com>
    Reviewed by: Keith Wesolowski <keith.wesolowski@joyent.com>
    Approved by: Garrett D'Amore <garrett.damore@gmail.com>
Actions #3

Updated by Dan Fields over 7 years ago

Arne,

Proposed fix (see below). It has been running for a couple of weeks and no occurrences of deadlock or the previous
problem seen. It does fix the deadlock on the taskq as it is coded with one thread per queue - but if that changes, outcome is indeterminate. To get rid of this altogether I agree with re-writing the firmware reset handling to exist in one thread and coordinate that with the few places requesting it in the driver.

Arne Jansen wrote:

We encountered a hanging mptsas driver, stuck at

[...]

The taskq it waits for only contains one taskq, namely itself:

[...]

Actions #4

Updated by Arne Jansen over 7 years ago

Dan,

I see how this fixes the deadlock. Can you please also ASSERT that the taskq really only has one thread and add a comment to the taskq_create why the number of threads has to be set to one?

Actions #5

Updated by Joshua M. Clulow over 7 years ago

If you're 100% sure this is invariant, you should use VERIFY() rather than ASSERT().

Actions #6

Updated by Simon K over 7 years ago

I would be glad to see the (modified) patch upstream soon :)

Dan Fields wrote:

Arne,

Proposed fix (see below). It has been running for a couple of weeks and no occurrences of deadlock or the previous
problem seen. It does fix the deadlock on the taskq as it is coded with one thread per queue - but if that changes, outcome is indeterminate. To get rid of this altogether I agree with re-writing the firmware reset handling to exist in one thread and coordinate that with the few places requesting it in the driver.

Arne Jansen wrote:

We encountered a hanging mptsas driver, stuck at

[...]

The taskq it waits for only contains one taskq, namely itself:

[...]

Actions #7

Updated by Dan Fields over 7 years ago

Simon/Arne,

After more thought, I realize the queue size of 1 is a different issue and isn't a factor in this particular bug.
I originally conflated the queue size with this issue but was wrong about that. The crux of this is the deadlock
brought about by waiting on a threads own taskq. The code as it stands fixes the issue and the addition of
the comment helps explain the unique situation. Changing the size of the queue is a different issue and even
with a queue size greater than or equal to 1 this issue would still remain.

Dan

Here it is updated patch, with added comment, I would like to push:

Actions #8

Updated by Simon K over 7 years ago

Dan, please send your webrev/patch to the developer mailing list. It's easier to discuss your fix there.

Dan Fields wrote:

Simon/Arne,

After more thought, I realize the queue size of 1 is a different issue and isn't a factor in this particular bug.
I originally conflated the queue size with this issue but was wrong about that. The crux of this is the deadlock
brought about by waiting on a threads own taskq. The code as it stands fixes the issue and the addition of
the comment helps explain the unique situation. Changing the size of the queue is a different issue and even
with a queue size greater than or equal to 1 this issue would still remain.

Dan

Here it is updated patch, with added comment, I would like to push:

Actions #9

Updated by Dan Fields over 7 years ago

Simon,

I posted to the [developer] mailing list.

Dan

Actions #10

Updated by Gordon Ross over 7 years ago

  • Assignee set to Dan Fields
Actions #11

Updated by Dan Fields over 7 years ago

  • Category set to driver - device drivers
  • Status changed from New to Resolved
  • % Done changed from 0 to 90
  • Estimated time set to 8.00 h
Actions #12

Updated by Dan Fields over 7 years ago

  • Status changed from Resolved to In Progress
Actions #13

Updated by Electric Monk over 7 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 90 to 100

git commit 6b4a8fe20e9d0459e1b43ecaea243341c29d0041

commit  6b4a8fe20e9d0459e1b43ecaea243341c29d0041
Author: Dan Fields <dan.fields@nexenta.com>
Date:   2016-01-08T16:18:24.000Z

    6256 mptsas: deadlock in mptsas_handle_topo_change
    Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
    Reviewed by: Albert Lee <trisk@omniti.com>
    Approved by: Dan McDonald <danmcd@omniti.com>

Actions #14

Updated by Marcel Telka over 6 years ago

  • Has duplicate Bug #4310: Another mpt_sas panic added
Actions #15

Updated by Marcel Telka over 6 years ago

  • Has duplicate Bug #5306: mpt_sas hangs up during IOC reset added
Actions

Also available in: Atom PDF