Project

General

Profile

Bug #12988

potential hang in mlxcx when async and ring vectors end up on same CPU

Added by Paul Winder 13 days ago. Updated 2 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

On systems with fewer CPUs it is likely that the vector for mlxcx_intr_n() and mlxcx_intr_async() can end up on the same CPU.

This can lead to a potential deadlock, as some (a lot of ...) commands which issue commands to HCA will either acquire the CQ or WQ mutex, but the CQ mutex in particular is held by the mlxcx_intr_n(). So, if the CQ mutex acquired by the command coincides with the CQ whose vector routine (mlxcx_intr_n()) is assigned to the same CPU as mlxcx_intr_async(), the mlxcx_intr_n() routine will hold off the execution of mlxcx_intr_async() because there are scheduled at the same PIL.

For example:

THREAD           STATE    SOBJ                COUNT
ffffff001e6d7c20 SLEEP    MUTEX                   1
                 swtch+0x86
                 turnstile_block+0x25b
                 mutex_vector_enter+0x358
                 mlxcx_intr_n+0x128
                 apix_dispatch_by_vector+0x8c
                 apix_dispatch_lowlevel+0x1c

ffffff06ef707460 SLEEP    CV                      1
                 swtch+0x133
                 cv_wait+0x68
                 mlxcx_cmd_wait+0x43
                 mlxcx_cmd_stop_sq+0x111
                 mlxcx_mac_ring_stop+0x20b
                 mac_stop_ring+0x29
                 mac_datapath_teardown+0x5d0
                 mac_client_datapath_teardown+0x56
                 mac_unicast_remove+0x29b
                 aggr_set_port_sdu+0x47
                 aggr_sdu_update+0x91
                 aggr_m_setprop+0x6c
                 mac_set_mtu+0xc4
                 mac_set_prop+0x176
                 drv_ioc_prop_common+0x4dd
                 drv_ioc_setprop+0x1b
                 drv_ioctl+0x1ef
                 cdev_ioctl+0x2b
                 spec_ioctl+0x45
                 fop_ioctl+0x5b
                 ioctl+0x153
                 _sys_sysenter_post_swapgs+0x14f

ffffff0021952c20 SLEEP    MUTEX                   1
                 swtch+0x133
                 turnstile_block+0x25b
                 mutex_vector_enter+0x358
                 mlxcx_mac_ring_tx+0x168
                 mac_hwring_tx+0x1c
                 mac_hwring_send_priv+0x90
                 aggr_ring_tx+0x1a
                 mac_hwring_tx+0x1c
                 mac_tx_send+0x63c
                 mac_tx_soft_ring_process+0x89
                 mac_tx_aggr_mode+0x8c
                 mac_tx+0xdb
                 str_mdata_fastpath_put+0x8e
                 ip_xmit+0x843
                 ire_send_wire_v4+0x345
                 conn_ip_output+0x1d4
                 tcp_send_data+0x58
                 tcp_input_data+0x1f5a
                 squeue_enter+0x3f9
                 ip_fanout_v4+0xbca
                 ip_input_local_v4+0xc6
                 ire_recv_local_v4+0x131
                 ill_input_short_v4+0x3ff
                 ip_input_common_v4+0x23f
                 ip_input+0x1f
                 mac_rx_soft_ring_drain+0x115
                 mac_soft_ring_worker+0xa1
                 thread_start+0xb

and the held off interrupt routine:

> fffffffffbc48000::cpuinfo -v        
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc48000  1f    1    6  -1   no    no t-1    ffffff001e605c20 (idle)
                       |    |    |
            RUNNING <--+    |    +--> PIL THREAD
              READY         |           6 ffffff001e6d7c20
           QUIESCED         |
             EXISTS         +-->  PRI THREAD           PROC
             ENABLE                60 ffffff002002ac20 sched

> apixs/J
apixs:
apixs:          ffffff06d919c000 
> ffffff06d919c000::print -t apix_impl_t
apix_impl_t {
    processorid_t x_cpuid = 0
    uint16_t x_intr_pending = 0x40
    struct autovec *[16] x_intr_head = [ 0, 0, 0, 0, 0, 0, 0xffffff06dda435a0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
    struct autovec *[16] x_intr_tail = [ 0, 0, 0, 0, 0, 0, 0xffffff06dda435a0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
    apix_vector_t *x_obsoletes = 0
    apix_vector_t *[256] x_vectbl = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]
    lock_t x_lock = 0
}
> 0xffffff06dda435a0::print -t struct autovec
struct autovec {
    struct autovec *av_link = 0
    uint_t (*)() av_vector = mlxcx_intr_async
    caddr_t av_intarg1 = 0xffffff06e4ab7dc0
    caddr_t av_intarg2 = 0xffffff06e90ea900
    uint64_t *av_ticksp = 0xffffff06e92af538
    uint_t av_prilevel = 0x6
    void *av_intr_id = 0xffffff06dd6179a8
    dev_info_t *av_dip = 0xffffff06dd5d0d50
    ushort_t av_flags = 0x100
    struct autovec *av_ipl_link = 0
}

Thread ffffff001e6d7c20 is block because it can't get the CQ lock, which is held by ffffff06ef707460 which requires the mlxcx_intr_async() to run to complete the command and then it will release the CQ mutex.

A secondary effect is then and tx/rx which relies on vectors on the same CPU will be blocked too.

The CQ and WQ mutex are used to maintain consistency in fields which are only likely to change through user interaction (stopping/starting interface etc). My proposed solution is to set the mlxcx_intr_async() to run at a higher PIL, I am concerned releasing CQ/WQ locks in specific code paths may expose to bugs and we may not cover all the scenarios in which lock manipulation needs to change.


Related issues

Related to illumos gate - Bug #12980: attempting to change MTU on mlxcx based aggregation can induce FMA eventIn Progress

Actions

History

#1

Updated by Paul Winder 13 days ago

Testing alongside with #12980.

And the vectors for mlxcx after the change:

# mdb -ke "::interrupts"|grep mlxcx
0/0x20    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
0/0x21    -   7   PCI    Edg MSI-X  1     -         mlxcx_intr_async
0/0x22    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
1/0x23    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
1/0x24    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
1/0x25    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
2/0x22    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
2/0x23    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
3/0x21    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
3/0x22    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
4/0x22    -   7   PCI    Edg MSI-X  1     -         mlxcx_intr_async
4/0x23    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
4/0x24    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
5/0x22    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
5/0x23    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n
5/0x24    -   6   PCI    Edg MSI-X  1     -         mlxcx_intr_n

#2

Updated by Paul Winder 13 days ago

  • Related to Bug #12980: attempting to change MTU on mlxcx based aggregation can induce FMA event added
#3

Updated by Paul Winder 13 days ago

  • Subject changed from potential hang in mlxcx when async and ring vectors end up same CPU to potential hang in mlxcx when async and ring vectors end up on same CPU
#4

Updated by Electric Monk 13 days ago

  • Gerrit CR set to 816
#5

Updated by Paul Winder 2 days ago

Here is a description of the scenario which causes the deadlock.

The deadlock can only occur if a mlxcx_intr_n() and mlxcx_intr_async() vector are assigned to the same CPU.

1. It starts when a "normal" thread runs a HCA command whilst holding a CQ lock - not uncommon. Eg if a user command is run which wants to stop the rings. When it does so, it acquires the WQ and CQ mutexes (the CQ mutex is the significant one), issues the command to stop the RQ to HCA, and waits for the command to complete. It is holding the CQ mutex whilst it is waiting .

2. CQs are assigned to interrupt vectors, and multiple CQs can be assigned to the same vector. Stopping a CQ has no direct effect on the vector, so the vector can still take interrupts. There are a couple of scenarios here too: other CQs can take events which trigger the vector, or the vector could have been pending in APIX. Assume this vector (mlxcx_intr_n()) is the one sharing the CPU with mlxcx_intr_async(), and it is running on the CPU. It iterates over the CQs assigned to it, as it does so it acquires the CQ mutex - this is the same CQ mutex acquired in 1. At this point 1. holds the mutex the mlxcx_intr_n() vector will block.

3. The command issued in 1. completes in the h/w, the h/w generates an event which triggers an interrupt to run mlxcx_intr_async(). Significantly this vector shares the CPU with the blocked mlxcx_intr_n() in 2. Both these vectors have the same IPL - 6, so APIX knows there is already a vector at IPL 6 running on the CPU, and will hold off any interrupt threads at IPL 6 or lower. If mlxcx_intr_async() cannot run, it will not wake the waiting thread in 1., which will never release the CQ, which means 2. will remain blocked - this is the deadlock.

Also available in: Atom PDF