Bug #12481
attempting to change MTU size of aggregate over mlxcx can cause dladm to hang
Start date:
Due date:
% Done:
100%
Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
Description
If you create an aggr over mlxcx devices and then issue the command:
dladm set-linkprop -p mtu=1500 aggr1
and the link is in use, you will get one of two different hangs.
Either
> ffffd0328837dba0::findstack -v stack pointer for thread ffffd0328837dba0 (dladm/1): ffffd000f63ff690 [ ffffd000f63ff690 _resume_from_idle+0x12b() ] ffffd000f63ff6c0 swtch+0x133() ffffd000f63ff700 cv_wait+0x68(ffffd0323fb00568, ffffd0323fb00520) ffffd000f63ff760 mlxcx_mac_ring_stop+0x11c(ffffd0328d401000) ffffd000f63ff790 mac_stop_ring+0x29(ffffd03277726720) ffffd000f63ff7c0 mac_stop_group_and_rings+0x38(ffffd03234f3c000) ffffd000f63ff7f0 mac_stop+0x33(ffffd031f41fc5c0) ffffd000f63ff850 mac_client_datapath_teardown+0x1c9(ffffd031f365c718, ffffd0328f7a1798, 0) ffffd000f63ff8a0 mac_unicast_remove+0x29b(ffffd031f365c718, ffffd0328f7a1798) ffffd000f63ff920 aggr_set_port_sdu+0x47(ffffd03cf66a5128, ffffd03345ccb308, 5dc, ffffd0328f2dbb08) ffffd000f63ff980 aggr_sdu_update+0x91(ffffd03cf66a5128, 5dc) ffffd000f63ff9e0 aggr_m_setprop+0x6c(ffffd03cf66a5128, fffffffffba08cdb, 6, 4, ffffd000f63ff9fc) ffffd000f63ffa50 mac_set_mtu+0xc4(ffffd03c89987138, 5dc, 0) ffffd000f63ffad0 mac_set_prop+0x136(ffffd03c89987138, 6, ffffd03c265e6010, ffffd03c265e6114, 4) ffffd000f63ffba0 drv_ioc_prop_common+0x4dd(ffffd03ea565aa80, 807e3c0, 1, ffffd0334ce09860, 100003) ffffd000f63ffbe0 drv_ioc_setprop+0x1b(ffffd03ea565aa80, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8) ffffd000f63ffc80 drv_ioctl+0x1ef(1200000000, d1d001b, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8) ffffd000f63ffcc0 cdev_ioctl+0x2b(1200000000, d1d001b, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8) ffffd000f63ffd10 spec_ioctl+0x45(ffffd031f0351500, d1d001b, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8, 0) ffffd000f63ffda0 fop_ioctl+0x5b(ffffd031f0351500, d1d001b, 807e3c0, 100003, ffffd0334ce09860, ffffd000f63ffdd8, 0) ffffd000f63ffec0 ioctl+0x153(3, d1d001b, 807e3c0) ffffd000f63fff10 _sys_sysenter_post_swapgs+0x14f()
Some analysis of this particular case:
The function mlxcx_mac_ring_stop() is waiting here:
mutex_enter(&s->mlbs_mtx); while (!list_is_empty(&s->mlbs_busy)) cv_wait(&s->mlbs_free_nonempty, &s->mlbs_mtx); while ((buf = list_head(&s->mlbs_free)) != NULL) { mlxcx_buf_destroy(mlxp, buf); } mutex_exit(&s->mlbs_mtx);
If we take a look at the mlbs_busy list, we see:
> ffffd0323fb00528::walk list|::print -t mlxcx_buffer_t mlb_state mlb_mp mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN) mblk_t *mlb_mp = 0xffffd03cc8d0aec0 mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN) mblk_t *mlb_mp = 0xffffd03c81a2e8c0 mlxcx_buffer_state_t mlb_state = 3 (MLXCX_BUFFER_ON_LOAN) mblk_t *mlb_mp = 0xffffd03e82f7b240
There are 3 buffers "on loan". These are receive buffers which have been passed up the stack, and in this case they are being held on to.
With some judicious kgrep'ing, I found this thread:
> ffffd000fb4a9c20::findstack -v stack pointer for thread ffffd000fb4a9c20 (aggr_lacp_rx_thread()): ffffd000fb4a9a50 [ ffffd000fb4a9a50 _resume_from_idle+0x12b() ] ffffd000fb4a9a80 swtch+0x133() ffffd000fb4a9ac0 cv_wait+0x68(ffffd03c89987224, ffffd03c89987210) ffffd000fb4a9b00 i_mac_perim_enter+0x5b(ffffd03c89987138) ffffd000fb4a9b30 mac_perim_enter_by_mh+0x23(ffffd03c89987138, ffffd000fb4a9b40) ffffd000fb4a9b70 aggr_lacp_rx+0x3d(ffffd03cc8d0aec0) ffffd000fb4a9c00 aggr_lacp_rx_thread+0xd2(ffffd03cf66a5128) ffffd000fb4a9c10 thread_start+0xb()
The argument to aggr_lacp_rx() is a mblk, and it is one of the mlxcx mblk. This thread is blocked in i_mac_perim_enter() because it wants to be the owner.
A closer look at the stack in mlxcx reveals a call to mac_set_mtu() with the same address as the argument to i_mac_perim_enter(). Which is a mac_impl_t:
> ffffd03c89987138::print -t mac_impl_t mi_perim_owner kthread_t *mi_perim_owner = 0xffffd0328837dba0
which is indeed owned by the thread blocked in mlxcx.
So, this is the deadlock: mlxcx thread is the perimeter owner, and is blocked because it has a mblk which is in use. That in use mblk is owned by another aggr thread which wants to be the perimeter owner.
The solution is not to block the mlxcx_mac_ring_stop() because of loaned mblk, but we do need to handle them be returned to use after the ring has stopped.
And another cause:
fffffe63205b2c20 SLEEP MUTEX 1 swtch+0x86 turnstile_block+0x25b mutex_vector_enter+0x358 mlxcx_mac_ring_tx+0x170 mac_hwring_tx+0x1c mac_hwring_send_priv+0x90 aggr_ring_tx+0x1a mac_hwring_tx+0x1c mac_tx_send+0x63c mac_tx_soft_ring_process+0x89 mac_tx_aggr_mode+0x8c mac_tx+0xdb str_mdata_fastpath_put+0x8e ip_xmit+0x841 ire_send_wire_v4+0x345 conn_ip_output+0x1d4 tcp_send_data+0x58 tcp_send+0x8c2 tcp_wput_data+0x68a tcp_input_data+0xab7 squeue_enter+0x3f9 ip_fanout_v4+0xbca ip_input_local_v4+0xc6 ire_recv_local_v4+0x131 ill_input_short_v4+0x3ff ip_input_common_v4+0x23f ip_input+0x1f mac_rx_soft_ring_process+0x1be mac_rx_srs_proto_fanout+0x29a mac_rx_srs_drain+0x2e5 mac_rx_srs_process+0x123 mac_rx_classify+0x7c mac_rx_flow+0x58 mac_rx_common+0x1d6 mac_rx+0xc6 mac_rx_ring+0x1f aggr_mac_rx+0x1c aggr_recv_path_cb+0x12f aggr_recv_cb+0xd mac_rx_common+0x20b mac_rx+0xc6 mac_rx_ring+0x1f mlxcx_intr_n+0x191 apix_dispatch_by_vector+0x8c apix_dispatch_lowlevel+0x1c
In this case the mac_set_mtu() is holding the WQ mutex (amongst others), and waiting for the bufs to drain. But the stack, is waiting for the same mutex, but has just taken a buffer.... deadlock again.