Bug #6256
closedmptsas: deadlock in mptsas_handle_topo_change
100%
Description
We encountered a hanging mptsas driver, stuck at
> ffffff001fc69c40::findstack -v stack pointer for thread ffffff001fc69c40: ffffff001fc69780 [ ffffff001fc69780 _resume_from_idle+0xf4() ] ffffff001fc697b0 swtch+0x141() ffffff001fc697f0 cv_wait+0x70(ffffff04f52242aa, ffffff04f5224298) ffffff001fc69830 taskq_wait+0x43(ffffff04f5224278) ffffff001fc69850 ddi_taskq_wait+0x11(ffffff04f5224278) ffffff001fc698b0 mptsas_flush_hba+0x1e4(ffffff04f70f2000) ffffff001fc69900 mptsas_restart_ioc+0x7d(ffffff04f70f2000) ffffff001fc699b0 mptsas_ioc_task_management+0x2ca(ffffff04f70f2000, 3, 15, 0, 0, 0, ffffff0000000000) ffffff001fc69a00 mptsas_do_scsi_reset+0x90(ffffff04f70f2000, 15) ffffff001fc69af0 mptsas_handle_topo_change+0x763(ffffff04f6082978, 0) ffffff001fc69b60 mptsas_handle_dr+0x184(ffffff04f6082978) ffffff001fc69c20 taskq_thread+0x2d0(ffffff04f5224278) ffffff001fc69c30 thread_start+8()
The taskq it waits for only contains one taskq, namely itself:
> ffffff04f5224278::taskq -at ADDR NAME ACT/THDS Q'ED MAXQ INST ffffff04f5224278 mpt_sas_mptsas_dr_taskq 1/ 1 1 1 0 THREAD STATE SOBJ COUNT ffffff001fc69c40 SLEEP CV 1 swtch+0x141 cv_wait+0x70 taskq_wait+0x43 ddi_taskq_wait+0x11 mptsas_flush_hba+0x1e4 mptsas_restart_ioc+0x7d mptsas_ioc_task_management+0x2ca mptsas_do_scsi_reset+0x90 mptsas_handle_topo_change+0x763 mptsas_handle_dr+0x184 taskq_thread+0x2d0 thread_start+8
Files
Related issues
Updated by Arne Jansen over 7 years ago
- Related to Bug #3195: mpt_sas IOC reset races can cause panics added
Updated by Arne Jansen over 7 years ago
The ddi_taskq_wait has been added with
Author: Albert Lee <trisk@nexenta.com> Date: Fri Aug 31 14:53:16 2012 -0400 3195 mpt_sas IOC reset races can cause panics Reviewed by: Dan McDonald <danmcd@nexenta.com> Reviewed by: Keith Wesolowski <keith.wesolowski@joyent.com> Approved by: Garrett D'Amore <garrett.damore@gmail.com>
Updated by Dan Fields over 7 years ago
- File BUG6256.jpg BUG6256.jpg added
Arne,
Proposed fix (see below). It has been running for a couple of weeks and no occurrences of deadlock or the previous
problem seen. It does fix the deadlock on the taskq as it is coded with one thread per queue - but if that changes, outcome is indeterminate. To get rid of this altogether I agree with re-writing the firmware reset handling to exist in one thread and coordinate that with the few places requesting it in the driver.
Arne Jansen wrote:
We encountered a hanging mptsas driver, stuck at
[...]
The taskq it waits for only contains one taskq, namely itself:
[...]
Updated by Arne Jansen over 7 years ago
Dan,
I see how this fixes the deadlock. Can you please also ASSERT that the taskq really only has one thread and add a comment to the taskq_create why the number of threads has to be set to one?
Updated by Joshua M. Clulow over 7 years ago
If you're 100% sure this is invariant, you should use VERIFY()
rather than ASSERT()
.
Updated by Simon K over 7 years ago
I would be glad to see the (modified) patch upstream soon :)
Dan Fields wrote:
Arne,
Proposed fix (see below). It has been running for a couple of weeks and no occurrences of deadlock or the previous
problem seen. It does fix the deadlock on the taskq as it is coded with one thread per queue - but if that changes, outcome is indeterminate. To get rid of this altogether I agree with re-writing the firmware reset handling to exist in one thread and coordinate that with the few places requesting it in the driver.
Arne Jansen wrote:
We encountered a hanging mptsas driver, stuck at
[...]
The taskq it waits for only contains one taskq, namely itself:
[...]
Updated by Dan Fields over 7 years ago
- File BUG6256-2.jpg BUG6256-2.jpg added
Simon/Arne,
After more thought, I realize the queue size of 1 is a different issue and isn't a factor in this particular bug.
I originally conflated the queue size with this issue but was wrong about that. The crux of this is the deadlock
brought about by waiting on a threads own taskq. The code as it stands fixes the issue and the addition of
the comment helps explain the unique situation. Changing the size of the queue is a different issue and even
with a queue size greater than or equal to 1 this issue would still remain.
Dan
Here it is updated patch, with added comment, I would like to push:
Updated by Simon K over 7 years ago
Dan, please send your webrev/patch to the developer mailing list. It's easier to discuss your fix there.
Dan Fields wrote:
Simon/Arne,
After more thought, I realize the queue size of 1 is a different issue and isn't a factor in this particular bug.
I originally conflated the queue size with this issue but was wrong about that. The crux of this is the deadlock
brought about by waiting on a threads own taskq. The code as it stands fixes the issue and the addition of
the comment helps explain the unique situation. Changing the size of the queue is a different issue and even
with a queue size greater than or equal to 1 this issue would still remain.Dan
Here it is updated patch, with added comment, I would like to push:
Updated by Dan Fields over 7 years ago
Simon,
I posted to the [developer] mailing list.
Dan
Updated by Dan Fields over 7 years ago
- Category set to driver - device drivers
- Status changed from New to Resolved
- % Done changed from 0 to 90
- Estimated time set to 8.00 h
Updated by Dan Fields over 7 years ago
- Status changed from Resolved to In Progress
Updated by Electric Monk over 7 years ago
- Status changed from In Progress to Closed
- % Done changed from 90 to 100
git commit 6b4a8fe20e9d0459e1b43ecaea243341c29d0041
commit 6b4a8fe20e9d0459e1b43ecaea243341c29d0041 Author: Dan Fields <dan.fields@nexenta.com> Date: 2016-01-08T16:18:24.000Z 6256 mptsas: deadlock in mptsas_handle_topo_change Reviewed by: Rick McNeal <rick.mcneal@nexenta.com> Reviewed by: Albert Lee <trisk@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com>
Updated by Marcel Telka over 6 years ago
- Has duplicate Bug #4310: Another mpt_sas panic added
Updated by Marcel Telka over 6 years ago
- Has duplicate Bug #5306: mpt_sas hangs up during IOC reset added