Project

General

Profile

Bug #12481

attempting to change MTU size of aggregate over mlxcx can cause dladm to hang

Added by Paul Winder 8 months ago. Updated 7 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

If you create an aggr over mlxcx devices and then issue the command:

dladm set-linkprop -p mtu=1500 aggr1

and the link is in use, you will get one of two different hangs.
Either
> ffffd0328837dba0::findstack -v
stack pointer for thread ffffd0328837dba0 (dladm/1): ffffd000f63ff690
[ ffffd000f63ff690 _resume_from_idle+0x12b() ]
  ffffd000f63ff6c0 swtch+0x133()
  ffffd000f63ff700 cv_wait+0x68(ffffd0323fb00568, ffffd0323fb00520)
  ffffd000f63ff760 mlxcx_mac_ring_stop+0x11c(ffffd0328d401000)
  ffffd000f63ff790 mac_stop_ring+0x29(ffffd03277726720)
  ffffd000f63ff7c0 mac_stop_group_and_rings+0x38(ffffd03234f3c000)
  ffffd000f63ff7f0 mac_stop+0x33(ffffd031f41fc5c0)
  ffffd000f63ff850 mac_client_datapath_teardown+0x1c9(ffffd031f365c718, ffffd0328f7a1798, 0)
  ffffd000f63ff8a0 mac_unicast_remove+0x29b(ffffd031f365c718, ffffd0328f7a1798)
  ffffd000f63ff920 aggr_set_port_sdu+0x47(ffffd03cf66a5128, ffffd03345ccb308, 5dc, ffffd0328f2dbb08)
  ffffd000f63ff980 aggr_sdu_update+0x91(ffffd03cf66a5128, 5dc)
  ffffd000f63ff9e0 aggr_m_setprop+0x6c(ffffd03cf66a5128, fffffffffba08cdb, 6, 4, ffffd000f63ff9fc)
  ffffd000f63ffa50 mac_set_mtu+0xc4(ffffd03c89987138, 5dc, 0)
  ffffd000f63ffad0 mac_set_prop+0x136(ffffd03c89987138, 6, ffffd03c265e6010, ffffd03c265e6114, 4)
  ffffd000f63ffba0 drv_ioc_prop_common+0x4dd(ffffd03ea565aa80, 807e3c0, 1, ffffd0334ce09860, 100003)
  ffffd000f63ffbe0 drv_ioc_setprop+0x1b(ffffd03ea565aa80, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8)
  ffffd000f63ffc80 drv_ioctl+0x1ef(1200000000, d1d001b, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8)
  ffffd000f63ffcc0 cdev_ioctl+0x2b(1200000000, d1d001b, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8)
  ffffd000f63ffd10 spec_ioctl+0x45(ffffd031f0351500, d1d001b, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8, 0)
  ffffd000f63ffda0 fop_ioctl+0x5b(ffffd031f0351500, d1d001b, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8, 0)
  ffffd000f63ffec0 ioctl+0x153(3, d1d001b, 807e3c0)
  ffffd000f63fff10 _sys_sysenter_post_swapgs+0x14f()

Some analysis of this particular case:
The function mlxcx_mac_ring_stop() is waiting here:
                mutex_enter(&s->mlbs_mtx);
                while (!list_is_empty(&s->mlbs_busy))
                        cv_wait(&s->mlbs_free_nonempty, &s->mlbs_mtx);
                while ((buf = list_head(&s->mlbs_free)) != NULL) {
                        mlxcx_buf_destroy(mlxp, buf);
                }
                mutex_exit(&s->mlbs_mtx);

If we take a look at the mlbs_busy list, we see:
> ffffd0323fb00528::walk list|::print -t mlxcx_buffer_t mlb_state mlb_mp
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
mblk_t *mlb_mp = 0xffffd03cc8d0aec0
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
mblk_t *mlb_mp = 0xffffd03c81a2e8c0
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
mblk_t *mlb_mp = 0xffffd03e82f7b240

There are 3 buffers "on loan". These are receive buffers which have been passed up the stack, and in this case they are being held on to.
With some judicious kgrep'ing, I found this thread:
> ffffd000fb4a9c20::findstack -v
stack pointer for thread ffffd000fb4a9c20 (aggr_lacp_rx_thread()): ffffd000fb4a9a50
[ ffffd000fb4a9a50 _resume_from_idle+0x12b() ]
  ffffd000fb4a9a80 swtch+0x133()
  ffffd000fb4a9ac0 cv_wait+0x68(ffffd03c89987224, ffffd03c89987210)
  ffffd000fb4a9b00 i_mac_perim_enter+0x5b(ffffd03c89987138)
  ffffd000fb4a9b30 mac_perim_enter_by_mh+0x23(ffffd03c89987138, ffffd000fb4a9b40)
  ffffd000fb4a9b70 aggr_lacp_rx+0x3d(ffffd03cc8d0aec0)
  ffffd000fb4a9c00 aggr_lacp_rx_thread+0xd2(ffffd03cf66a5128)
  ffffd000fb4a9c10 thread_start+0xb()

The argument to aggr_lacp_rx() is a mblk, and it is one of the mlxcx mblk. This thread is blocked in i_mac_perim_enter() because it wants to be the owner.
A closer look at the stack in mlxcx reveals a call to mac_set_mtu() with the same address as the argument to i_mac_perim_enter(). Which is a mac_impl_t:
> ffffd03c89987138::print -t mac_impl_t mi_perim_owner
kthread_t *mi_perim_owner = 0xffffd0328837dba0

which is indeed owned by the thread blocked in mlxcx.
So, this is the deadlock: mlxcx thread is the perimeter owner, and is blocked because it has a mblk which is in use. That in use mblk is owned by another aggr thread which wants to be the perimeter owner.
The solution is not to block the mlxcx_mac_ring_stop() because of loaned mblk, but we do need to handle them be returned to use after the ring has stopped.

And another cause:

fffffe63205b2c20 SLEEP    MUTEX                   1
                 swtch+0x86
                 turnstile_block+0x25b
                 mutex_vector_enter+0x358
                 mlxcx_mac_ring_tx+0x170
                 mac_hwring_tx+0x1c
                 mac_hwring_send_priv+0x90
                 aggr_ring_tx+0x1a
                 mac_hwring_tx+0x1c
                 mac_tx_send+0x63c
                 mac_tx_soft_ring_process+0x89
                 mac_tx_aggr_mode+0x8c
                 mac_tx+0xdb
                 str_mdata_fastpath_put+0x8e
                 ip_xmit+0x841
                 ire_send_wire_v4+0x345
                 conn_ip_output+0x1d4
                 tcp_send_data+0x58
                 tcp_send+0x8c2
                 tcp_wput_data+0x68a
                 tcp_input_data+0xab7
                 squeue_enter+0x3f9
                 ip_fanout_v4+0xbca
                 ip_input_local_v4+0xc6
                 ire_recv_local_v4+0x131
                 ill_input_short_v4+0x3ff
                 ip_input_common_v4+0x23f
                 ip_input+0x1f
                 mac_rx_soft_ring_process+0x1be
                 mac_rx_srs_proto_fanout+0x29a
                 mac_rx_srs_drain+0x2e5
                 mac_rx_srs_process+0x123
                 mac_rx_classify+0x7c
                 mac_rx_flow+0x58
                 mac_rx_common+0x1d6
                 mac_rx+0xc6
                 mac_rx_ring+0x1f
                 aggr_mac_rx+0x1c
                 aggr_recv_path_cb+0x12f
                 aggr_recv_cb+0xd
                 mac_rx_common+0x20b  
                 mac_rx+0xc6
                 mac_rx_ring+0x1f
                 mlxcx_intr_n+0x191
                 apix_dispatch_by_vector+0x8c
                 apix_dispatch_lowlevel+0x1c

In this case the mac_set_mtu() is holding the WQ mutex (amongst others), and waiting for the bufs to drain. But the stack, is waiting for the same mutex, but has just taken a buffer.... deadlock again.

#1

Updated by Paul Winder 8 months ago

The problem is the mlxcx mac stop callback function mlxcx_mac_ring_stop() grabs the completion and work queue mutex and then waits for the buffer queues to drain whilst holding them. Holding the completion mutex blocks the interrupt routine, so we need to shutdown the queues, release the locks and then wait for the buffer lists to drain.

#2

Updated by Electric Monk 7 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 70 to 100

git commit 19325e87abbf92e73010df2342fc48019027aee1

commit  19325e87abbf92e73010df2342fc48019027aee1
Author: Paul Winder <pwinder@racktopsystems.com>
Date:   2020-04-30T07:09:30.000Z

    12481 attempting to change MTU size of aggregate over mlxcx can cause dladm to hang
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Reviewed by: Andy Stormont <astormont@racktopsystems.com>
    Reviewed by: Dan McDonald <danmcd@joyent.com>
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Approved by: Garrett D'Amore <garrett@damore.org>

Also available in: Atom PDF