Bug #12988
closedpotential hang in mlxcx when async and ring vectors end up on same CPU
100%
Description
On systems with fewer CPUs it is likely that the vector for mlxcx_intr_n() and mlxcx_intr_async() can end up on the same CPU.
This can lead to a potential deadlock, as some (a lot of ...) commands which issue commands to HCA will either acquire the CQ or WQ mutex, but the CQ mutex in particular is held by the mlxcx_intr_n(). So, if the CQ mutex acquired by the command coincides with the CQ whose vector routine (mlxcx_intr_n()) is assigned to the same CPU as mlxcx_intr_async(), the mlxcx_intr_n() routine will hold off the execution of mlxcx_intr_async() because there are scheduled at the same PIL.
For example:
THREAD STATE SOBJ COUNT ffffff001e6d7c20 SLEEP MUTEX 1 swtch+0x86 turnstile_block+0x25b mutex_vector_enter+0x358 mlxcx_intr_n+0x128 apix_dispatch_by_vector+0x8c apix_dispatch_lowlevel+0x1c ffffff06ef707460 SLEEP CV 1 swtch+0x133 cv_wait+0x68 mlxcx_cmd_wait+0x43 mlxcx_cmd_stop_sq+0x111 mlxcx_mac_ring_stop+0x20b mac_stop_ring+0x29 mac_datapath_teardown+0x5d0 mac_client_datapath_teardown+0x56 mac_unicast_remove+0x29b aggr_set_port_sdu+0x47 aggr_sdu_update+0x91 aggr_m_setprop+0x6c mac_set_mtu+0xc4 mac_set_prop+0x176 drv_ioc_prop_common+0x4dd drv_ioc_setprop+0x1b drv_ioctl+0x1ef cdev_ioctl+0x2b spec_ioctl+0x45 fop_ioctl+0x5b ioctl+0x153 _sys_sysenter_post_swapgs+0x14f ffffff0021952c20 SLEEP MUTEX 1 swtch+0x133 turnstile_block+0x25b mutex_vector_enter+0x358 mlxcx_mac_ring_tx+0x168 mac_hwring_tx+0x1c mac_hwring_send_priv+0x90 aggr_ring_tx+0x1a mac_hwring_tx+0x1c mac_tx_send+0x63c mac_tx_soft_ring_process+0x89 mac_tx_aggr_mode+0x8c mac_tx+0xdb str_mdata_fastpath_put+0x8e ip_xmit+0x843 ire_send_wire_v4+0x345 conn_ip_output+0x1d4 tcp_send_data+0x58 tcp_input_data+0x1f5a squeue_enter+0x3f9 ip_fanout_v4+0xbca ip_input_local_v4+0xc6 ire_recv_local_v4+0x131 ill_input_short_v4+0x3ff ip_input_common_v4+0x23f ip_input+0x1f mac_rx_soft_ring_drain+0x115 mac_soft_ring_worker+0xa1 thread_start+0xb
and the held off interrupt routine:
> fffffffffbc48000::cpuinfo -v ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc48000 1f 1 6 -1 no no t-1 ffffff001e605c20 (idle) | | | RUNNING <--+ | +--> PIL THREAD READY | 6 ffffff001e6d7c20 QUIESCED | EXISTS +--> PRI THREAD PROC ENABLE 60 ffffff002002ac20 sched > apixs/J apixs: apixs: ffffff06d919c000 > ffffff06d919c000::print -t apix_impl_t apix_impl_t { processorid_t x_cpuid = 0 uint16_t x_intr_pending = 0x40 struct autovec *[16] x_intr_head = [ 0, 0, 0, 0, 0, 0, 0xffffff06dda435a0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] struct autovec *[16] x_intr_tail = [ 0, 0, 0, 0, 0, 0, 0xffffff06dda435a0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] apix_vector_t *x_obsoletes = 0 apix_vector_t *[256] x_vectbl = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ] lock_t x_lock = 0 } > 0xffffff06dda435a0::print -t struct autovec struct autovec { struct autovec *av_link = 0 uint_t (*)() av_vector = mlxcx_intr_async caddr_t av_intarg1 = 0xffffff06e4ab7dc0 caddr_t av_intarg2 = 0xffffff06e90ea900 uint64_t *av_ticksp = 0xffffff06e92af538 uint_t av_prilevel = 0x6 void *av_intr_id = 0xffffff06dd6179a8 dev_info_t *av_dip = 0xffffff06dd5d0d50 ushort_t av_flags = 0x100 struct autovec *av_ipl_link = 0 }
Thread ffffff001e6d7c20 is block because it can't get the CQ lock, which is held by ffffff06ef707460 which requires the mlxcx_intr_async() to run to complete the command and then it will release the CQ mutex.
A secondary effect is then and tx/rx which relies on vectors on the same CPU will be blocked too.
The CQ and WQ mutex are used to maintain consistency in fields which are only likely to change through user interaction (stopping/starting interface etc). My proposed solution is to set the mlxcx_intr_async() to run at a higher PIL, I am concerned releasing CQ/WQ locks in specific code paths may expose to bugs and we may not cover all the scenarios in which lock manipulation needs to change.
Related issues
Updated by Paul Winder almost 2 years ago
Testing alongside with #12980.
And the vectors for mlxcx after the change:
# mdb -ke "::interrupts"|grep mlxcx 0/0x20 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 0/0x21 - 7 PCI Edg MSI-X 1 - mlxcx_intr_async 0/0x22 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 1/0x23 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 1/0x24 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 1/0x25 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 2/0x22 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 2/0x23 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 3/0x21 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 3/0x22 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 4/0x22 - 7 PCI Edg MSI-X 1 - mlxcx_intr_async 4/0x23 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 4/0x24 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 5/0x22 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 5/0x23 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n 5/0x24 - 6 PCI Edg MSI-X 1 - mlxcx_intr_n
Updated by Paul Winder almost 2 years ago
- Related to Bug #12980: attempting to change MTU on mlxcx based aggregation can induce FMA event added
Updated by Paul Winder almost 2 years ago
- Subject changed from potential hang in mlxcx when async and ring vectors end up same CPU to potential hang in mlxcx when async and ring vectors end up on same CPU
Updated by Paul Winder almost 2 years ago
Here is a description of the scenario which causes the deadlock.
The deadlock can only occur if a mlxcx_intr_n() and mlxcx_intr_async() vector are assigned to the same CPU.
1. It starts when a "normal" thread runs a HCA command whilst holding a CQ lock - not uncommon. Eg if a user command is run which wants to stop the rings. When it does so, it acquires the WQ and CQ mutexes (the CQ mutex is the significant one), issues the command to stop the RQ to HCA, and waits for the command to complete. It is holding the CQ mutex whilst it is waiting .
2. CQs are assigned to interrupt vectors, and multiple CQs can be assigned to the same vector. Stopping a CQ has no direct effect on the vector, so the vector can still take interrupts. There are a couple of scenarios here too: other CQs can take events which trigger the vector, or the vector could have been pending in APIX. Assume this vector (mlxcx_intr_n()) is the one sharing the CPU with mlxcx_intr_async(), and it is running on the CPU. It iterates over the CQs assigned to it, as it does so it acquires the CQ mutex - this is the same CQ mutex acquired in 1. At this point 1. holds the mutex the mlxcx_intr_n() vector will block.
3. The command issued in 1. completes in the h/w, the h/w generates an event which triggers an interrupt to run mlxcx_intr_async(). Significantly this vector shares the CPU with the blocked mlxcx_intr_n() in 2. Both these vectors have the same IPL - 6, so APIX knows there is already a vector at IPL 6 running on the CPU, and will hold off any interrupt threads at IPL 6 or lower. If mlxcx_intr_async() cannot run, it will not wake the waiting thread in 1., which will never release the CQ, which means 2. will remain blocked - this is the deadlock.
Updated by Electric Monk over 1 year ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
git commit e1447ca93391f31609bda487cb922dbff9dcdef5
commit e1447ca93391f31609bda487cb922dbff9dcdef5 Author: Paul Winder <pwinder@racktopsystems.com> Date: 2020-08-26T14:21:08.000Z 12980 attempting to change MTU on mlxcx based aggregation can induce FMA event 12987 devo_power misconfigured in mlxcx 12988 potential hang in mlxcx when async and ring vectors end up on same CPU Reviewed by: Robert Mustacchi <rm@fingolfin.org> Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Garrett D'Amore <garrett@damore.org>
Updated by Andy Fiddaman over 1 year ago
- Related to Bug #13108: mlxcx fails to attach on system using pcplusmp after 12988 added