Actions
Bug #5306
closedmpt_sas hangs up during IOC reset
Status:
Closed
Priority:
Normal
Assignee:
-
Category:
driver - device drivers
Start date:
2014-11-10
Due date:
% Done:
0%
Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:
Description
Platform:
LSI SAS-9207 dual-port HBA card, connecting to two expanders with SAS drive attached.
MPxIO enabled for those SAS drives.
Symptom:
When one SAS cable is pulled, HBA driver mpt_sas will hang up sometimes. There's deadlock between mptsas_handle_topo_change() thread and mptsas_flush_hba() thread.
> ::stacks -m mpt_sas ffffff01ea832c40 SLEEP CV 2 swtch+0x141 cv_wait+0x70 ndi_devi_enter+0x7f mptsas_bus_config+0xaf scsi_hba_bus_config+0x70 devi_config_common+0xa5 ndi_devi_config+0x1a bus_config_phci+0xbb thread_start+8 ffffff01ec1bcc40 SLEEP CV 1 swtch+0x141 cv_wait+0x70 ndi_devi_enter+0x7f mptsas_bus_config+0xaf scsi_hba_bus_config+0x70 devi_config_common+0xa5 mt_config_thread+0x58 thread_start+8 ffffff01e97bac40 SLEEP CV 1 swtch+0x141 cv_wait+0x70 ndi_devi_enter+0x7f mptsas_handle_topo_change+0x3d5 mptsas_handle_dr+0x184 taskq_thread+0x2d0 thread_start+8 ffffff01eab8ac40 SLEEP CV 1 swtch+0x141 cv_wait+0x70 taskq_wait+0x43 ddi_taskq_wait+0x11 mptsas_taskq_wait+0x23 mptsas_flush_hba+0x1dc mptsas_restart_ioc+0x79 mptsas_do_passthru+0x588 mptsas_smp_start+0x139 smp_transport+0x16 smp_probe+0x8f mptsas_probe_smp+0x58 mptsas_online_smp+0x63 mptsas_config_all+0xe6 mptsas_bus_config+0x140 scsi_hba_bus_config+0x70 devi_config_common+0xa5 ndi_devi_config+0x1a bus_config_phci+0xbb thread_start+8
Analysis:
Thread A:
> ffffff01e97bac40::findstack -v stack pointer for thread ffffff01e97bac40: ffffff01e97ba980 [ ffffff01e97ba980 _resume_from_idle+0xf4() ] ffffff01e97ba9b0 swtch+0x141() ffffff01e97ba9f0 cv_wait+0x70(ffffff42e2d9095c, ffffff42e2d90870) ffffff01e97baa30 ndi_devi_enter+0x7f(ffffff42e2d90808, ffffff01e97baaa8) ffffff01e97baaf0 mptsas_handle_topo_change+0x3d5(ffffff43da4ae358, ffffff43848492b8) ffffff01e97bab60 mptsas_handle_dr+0x184(ffffff440b0136a8) ffffff01e97bac20 taskq_thread+0x2d0(ffffff4384102238) ffffff01e97bac30 thread_start+8()
working on m_dr_taskq:
> ffffff01e97bac40::print -t "kthread_t" t_taskq void *t_taskq = 0xffffff4384102238
being blocked by thread B:
> ffffff42e2d90808::print -t "struct dev_info" devi_busy_thread void *devi_busy_thread = 0xffffff01eab8ac40
Thread B is also blocked waiting for m_dr_taskq that thread A is working on:
> 0xffffff01eab8ac40::findstack -v stack pointer for thread ffffff01eab8ac40: ffffff01eab8a2b0 [ ffffff01eab8a2b0 _resume_from_idle+0xf4() ] ffffff01eab8a2e0 swtch+0x141() ffffff01eab8a320 cv_wait+0x70(ffffff438410226a, ffffff4384102258) ffffff01eab8a360 taskq_wait+0x43(ffffff4384102238) ffffff01eab8a380 ddi_taskq_wait+0x11(ffffff4384102238) ffffff01eab8a3a0 mptsas_taskq_wait+0x23(ffffff4384102238) ffffff01eab8a400 mptsas_flush_hba+0x1dc(ffffff4383d10000) ffffff01eab8a440 mptsas_restart_ioc+0x79(ffffff4383d10000) ffffff01eab8a640 mptsas_do_passthru+0x588(ffffff4383d10000, ffffff01eab8a690, ffffff01eab8a6c0, ffffff01eab8a770, 20, 1c, ffffff010000003c, ffffffff00000003, ffffff01eab8a7e0, ffffff0100000004, ffffff010000003c, 80000000) ffffff01eab8a730 mptsas_smp_start+0x139(ffffff01eab8a7b0) ffffff01eab8a750 smp_transport+0x16(ffffff01eab8a7b0) ffffff01eab8a830 smp_probe+0x8f(ffffff01eab8a850) ffffff01eab8a8a0 mptsas_probe_smp+0x58(ffffff4384849010, 500093d001de103f) ffffff01eab8a9e0 mptsas_online_smp+0x63(ffffff4384849010, ffffff435d0c8dc8, ffffff01eab8aa08) ffffff01eab8aa50 mptsas_config_all+0xe6(ffffff4384849010) ffffff01eab8ab00 mptsas_bus_config+0x140(ffffff4384849010, 4004000, 2, ffffffff, 0) ffffff01eab8ab70 scsi_hba_bus_config+0x70(ffffff4384849010, 4004000, 2, ffffffff, 0) ffffff01eab8abc0 devi_config_common+0xa5(ffffff4384849010, 4004000, ffffffff) ffffff01eab8abe0 ndi_devi_config+0x1a(ffffff4384849010, 4004000) ffffff01eab8ac20 bus_config_phci+0xbb(ffffff438ffdb008) ffffff01eab8ac30 thread_start+8()
Deadlock!
When expander device is removed, DR events will be dispatched and handled by mpt_sas in the order:
SMP device --> disk physical path --> SES pseudo device.
https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/io/scsi/adapters/mpt_sas/mptsas.c#L6217
/* * If HBA is being reset, don't perform operations depending * on the IOC. We must free the topo list, however. */ if (!mpt->m_in_reset) mptsas_handle_topo_change(topo_node, parent); else NDBG20(("skipping topo change received during reset"));
However, mptsas_handle_topo_change() will temporarily release m_mutex. So others can take the mutex and run into IOC reset (set m_in_reset) in middle of DR event handling.
mutex_exit(&mpt->m_mutex); ndi_devi_enter(scsi_vhci_dip, &circ); ndi_devi_enter(parent, &circ1); rval = mptsas_offline_target(parent, addr); ndi_devi_exit(parent, circ1); ndi_devi_exit(scsi_vhci_dip, circ);
/* * Hold the nexus across the bus_config */ ndi_devi_enter(scsi_vhci_dip, &circ); ndi_devi_enter(pdip, &circ1);
/* * Drain the taskqs prior to reallocating resources. */ mutex_exit(&mpt->m_mutex); ddi_taskq_wait(mpt->m_event_taskq); ddi_taskq_wait(mpt->m_dr_taskq); mutex_enter(&mpt->m_mutex);
This can lead to the potential deadlock:
- Expander device is removed;
- SMP removal event is propagated by syseventd while there's still DR events remaining in m_dr_taskq;
- SMP sysevent listener (e.g. FMA, prtconf) then queries devinfo tree;
- Devinfo finds info cache is invalid and attempts to re-configure all devices;
- When configuring phci nodes, scsi_vhci_dip is held in mptsas_bus_config();
- The mutex is then held by the configure thread as mptsas_handle_dr() temporarily released it.
- The configure thread experienced timeout when probing expander SMP device and proceed to restart IOC. It waits for mptsas_handle_dr() finish with scsi_vhci_dip held.
- mptsas_handle_dr() -->> mptsas_handle_topo_change() is waiting for scsi_vhci_dip to offline device in devinfo tree. Thus they run into deadlock.
Summary:
This is a corner case side-effect of #3195, which doesn't check and release scsi_vhci_dip, the root MPxIO node, if IOC reset thread owns it.
Related issues
Updated by Marcel Telka over 6 years ago
- Is duplicate of Bug #6256: mptsas: deadlock in mptsas_handle_topo_change added
Updated by Marcel Telka over 6 years ago
- Category set to driver - device drivers
- Status changed from New to Closed
This is a duplicate of #6256.
Actions