Project

General

Profile

Bug #12481

attempting to change MTU size of aggregate over mlxcx can cause dladm to hang

Added by Paul Winder 10 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

If you create an aggr over mlxcx devices and then issue the command:

dladm set-linkprop -p mtu=1500 aggr1

and the link is in use, you will get one of two different hangs.
Either
> ffffd0328837dba0::findstack -v
stack pointer for thread ffffd0328837dba0 (dladm/1): ffffd000f63ff690
[ ffffd000f63ff690 _resume_from_idle+0x12b() ]
  ffffd000f63ff6c0 swtch+0x133()
  ffffd000f63ff700 cv_wait+0x68(ffffd0323fb00568, ffffd0323fb00520)
  ffffd000f63ff760 mlxcx_mac_ring_stop+0x11c(ffffd0328d401000)
  ffffd000f63ff790 mac_stop_ring+0x29(ffffd03277726720)
  ffffd000f63ff7c0 mac_stop_group_and_rings+0x38(ffffd03234f3c000)
  ffffd000f63ff7f0 mac_stop+0x33(ffffd031f41fc5c0)
  ffffd000f63ff850 mac_client_datapath_teardown+0x1c9(ffffd031f365c718, ffffd0328f7a1798, 0)
  ffffd000f63ff8a0 mac_unicast_remove+0x29b(ffffd031f365c718, ffffd0328f7a1798)
  ffffd000f63ff920 aggr_set_port_sdu+0x47(ffffd03cf66a5128, ffffd03345ccb308, 5dc, ffffd0328f2dbb08)
  ffffd000f63ff980 aggr_sdu_update+0x91(ffffd03cf66a5128, 5dc)
  ffffd000f63ff9e0 aggr_m_setprop+0x6c(ffffd03cf66a5128, fffffffffba08cdb, 6, 4, ffffd000f63ff9fc)
  ffffd000f63ffa50 mac_set_mtu+0xc4(ffffd03c89987138, 5dc, 0)
  ffffd000f63ffad0 mac_set_prop+0x136(ffffd03c89987138, 6, ffffd03c265e6010, ffffd03c265e6114, 4)
  ffffd000f63ffba0 drv_ioc_prop_common+0x4dd(ffffd03ea565aa80, 807e3c0, 1, ffffd0334ce09860, 100003)
  ffffd000f63ffbe0 drv_ioc_setprop+0x1b(ffffd03ea565aa80, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8)
  ffffd000f63ffc80 drv_ioctl+0x1ef(1200000000, d1d001b, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8)
  ffffd000f63ffcc0 cdev_ioctl+0x2b(1200000000, d1d001b, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8)
  ffffd000f63ffd10 spec_ioctl+0x45(ffffd031f0351500, d1d001b, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8, 0)
  ffffd000f63ffda0 fop_ioctl+0x5b(ffffd031f0351500, d1d001b, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8, 0)
  ffffd000f63ffec0 ioctl+0x153(3, d1d001b, 807e3c0)
  ffffd000f63fff10 _sys_sysenter_post_swapgs+0x14f()

Some analysis of this particular case:
The function mlxcx_mac_ring_stop() is waiting here:
                mutex_enter(&s->mlbs_mtx);
                while (!list_is_empty(&s->mlbs_busy))
                        cv_wait(&s->mlbs_free_nonempty, &s->mlbs_mtx);
                while ((buf = list_head(&s->mlbs_free)) != NULL) {
                        mlxcx_buf_destroy(mlxp, buf);
                }
                mutex_exit(&s->mlbs_mtx);

If we take a look at the mlbs_busy list, we see:
> ffffd0323fb00528::walk list|::print -t mlxcx_buffer_t mlb_state mlb_mp
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
mblk_t *mlb_mp = 0xffffd03cc8d0aec0
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
mblk_t *mlb_mp = 0xffffd03c81a2e8c0
mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN)
mblk_t *mlb_mp = 0xffffd03e82f7b240

There are 3 buffers "on loan". These are receive buffers which have been passed up the stack, and in this case they are being held on to.
With some judicious kgrep'ing, I found this thread:
> ffffd000fb4a9c20::findstack -v
stack pointer for thread ffffd000fb4a9c20 (aggr_lacp_rx_thread()): ffffd000fb4a9a50
[ ffffd000fb4a9a50 _resume_from_idle+0x12b() ]
  ffffd000fb4a9a80 swtch+0x133()
  ffffd000fb4a9ac0 cv_wait+0x68(ffffd03c89987224, ffffd03c89987210)
  ffffd000fb4a9b00 i_mac_perim_enter+0x5b(ffffd03c89987138)
  ffffd000fb4a9b30 mac_perim_enter_by_mh+0x23(ffffd03c89987138, ffffd000fb4a9b40)
  ffffd000fb4a9b70 aggr_lacp_rx+0x3d(ffffd03cc8d0aec0)
  ffffd000fb4a9c00 aggr_lacp_rx_thread+0xd2(ffffd03cf66a5128)
  ffffd000fb4a9c10 thread_start+0xb()

The argument to aggr_lacp_rx() is a mblk, and it is one of the mlxcx mblk. This thread is blocked in i_mac_perim_enter() because it wants to be the owner.
A closer look at the stack in mlxcx reveals a call to mac_set_mtu() with the same address as the argument to i_mac_perim_enter(). Which is a mac_impl_t:
> ffffd03c89987138::print -t mac_impl_t mi_perim_owner
kthread_t *mi_perim_owner = 0xffffd0328837dba0

which is indeed owned by the thread blocked in mlxcx.
So, this is the deadlock: mlxcx thread is the perimeter owner, and is blocked because it has a mblk which is in use. That in use mblk is owned by another aggr thread which wants to be the perimeter owner.
The solution is not to block the mlxcx_mac_ring_stop() because of loaned mblk, but we do need to handle them be returned to use after the ring has stopped.

And another cause:

fffffe63205b2c20 SLEEP    MUTEX                   1
                 swtch+0x86
                 turnstile_block+0x25b
                 mutex_vector_enter+0x358
                 mlxcx_mac_ring_tx+0x170
                 mac_hwring_tx+0x1c
                 mac_hwring_send_priv+0x90
                 aggr_ring_tx+0x1a
                 mac_hwring_tx+0x1c
                 mac_tx_send+0x63c
                 mac_tx_soft_ring_process+0x89
                 mac_tx_aggr_mode+0x8c
                 mac_tx+0xdb
                 str_mdata_fastpath_put+0x8e
                 ip_xmit+0x841
                 ire_send_wire_v4+0x345
                 conn_ip_output+0x1d4
                 tcp_send_data+0x58
                 tcp_send+0x8c2
                 tcp_wput_data+0x68a
                 tcp_input_data+0xab7
                 squeue_enter+0x3f9
                 ip_fanout_v4+0xbca
                 ip_input_local_v4+0xc6
                 ire_recv_local_v4+0x131
                 ill_input_short_v4+0x3ff
                 ip_input_common_v4+0x23f
                 ip_input+0x1f
                 mac_rx_soft_ring_process+0x1be
                 mac_rx_srs_proto_fanout+0x29a
                 mac_rx_srs_drain+0x2e5
                 mac_rx_srs_process+0x123
                 mac_rx_classify+0x7c
                 mac_rx_flow+0x58
                 mac_rx_common+0x1d6
                 mac_rx+0xc6
                 mac_rx_ring+0x1f
                 aggr_mac_rx+0x1c
                 aggr_recv_path_cb+0x12f
                 aggr_recv_cb+0xd
                 mac_rx_common+0x20b  
                 mac_rx+0xc6
                 mac_rx_ring+0x1f
                 mlxcx_intr_n+0x191
                 apix_dispatch_by_vector+0x8c
                 apix_dispatch_lowlevel+0x1c

In this case the mac_set_mtu() is holding the WQ mutex (amongst others), and waiting for the bufs to drain. But the stack, is waiting for the same mutex, but has just taken a buffer.... deadlock again.

Also available in: Atom PDF