Project

General

Profile

Actions

Bug #13610

open

mac_promisc_dispatch CV deadlock when re-entering against mac_promisc_remove

Added by Alex Wilson 7 months ago. Updated 7 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

In mac_promisc_dispatch, we call into mac_callback_walker_enter, which amongst other things sets the mcbi_walker_cnt to act as a barrier against somebody else destroying the cb while we're using it. The idea is that if someone who wants to use or destroy it (e.g. mac_promisc_remove) enters while mcbi_walker_cnt is non-zero, they will increment mcbi_del_cnt and then sleep on the mcbi_cv until all the walkers are done. This is basically a slightly hacky reader-writer lock (though some extra features were added with the mcbi_barrier_cnt in pmooney's bhyve changes -- #12674). To prevent a nasty race with another new walker, in mac_callback_walker_enter, if a new walker tries to use the cb while mcbi_del_cnt is >0, then it too also sleeps on the CV.

Unfortunately, it's possible to re-enter mac_promisc_dispatch from under mac_promisc_dispatch_one, for example like in this thread stack:


fffffbe36151ec20 SLEEP    CV                      1
                 swtch+0x86
                 cv_wait+0x68
                 mac_callback_walker_enter+0x2b
                 mac_promisc_dispatch+0x37
                 mac_provider_tx+0x57
                 mac_hwring_send_priv+0x1a
                 aggr_ring_tx+0x1a
                 mac_hwring_tx+0x19
                 mac_ring_tx+0x1e
                 mac_provider_tx+0x85
                 mac_tx_send+0x288
                 mac_tx_soft_ring_process+0x89
                 mac_tx_aggr_mode+0x8c
                 mac_tx+0x1a9
                 str_mdata_fastpath_put+0x8e
                 ip_xmit+0x841
                 ire_send_wire_v4+0x345
                 conn_ip_output+0x1d4
                 tcp_send_data+0x58
                 tcp_input_data+0x1f3a
                 squeue_enter+0x3f9
                 ip_fanout_v4+0xbca
                 ip_input_local_v4+0xc6
                 ire_recv_local_v4+0x131
                 ill_input_short_v4+0x3ff
                 ip_input_common_v4+0x23f
                 ip_input+0x1f
                 mac_rx_soft_ring_process+0x1be
                 mac_rx_srs_fanout+0x395
                 mac_rx_srs_drain+0x22b
                 mac_rx_srs_process+0x123
                 mac_rx_classify+0x88
                 mac_rx_flow+0x58
                 mac_rx_common+0x23e
                 mac_rx+0xc6
                 aggr_mac_rx+0x25
                 aggr_recv_path_cb+0x11f
                 aggr_recv_promisc_cb+0x13
                 mac_promisc_dispatch_one+0x9c
                 mac_promisc_dispatch+0x82
                 mac_rx_common+0x47
                 mac_rx+0xc6
                 mac_rx_ring+0x1f
                 ixgbe_intr_rx_work+0x5c
                 ixgbe_intr_msix+0x3f
                 apix_dispatch_by_vector+0x8c
                 apix_dispatch_lowlevel+0x1c

If we manage to enter the callback walker section, and then while we're looping back around to call ourselves via a stack like this one, another thread enters mac_promisc_remove, then this second looped back call will go to sleep on the mcbi_cv waiting for mac_promisc_remove, on the only thread which could signal that same CV (at the end of the original mac_promisc_dispatch call, when it calls into mac_callback_walker_exit).

This deadlocks MAC and any thread which attempts to transmit network traffic after this point. Eventually this usually results in a full-system hang as other things block up behind things which are blocked on network TX.

Actions #1

Updated by Alex Wilson 7 months ago

I have a dump of a system thus frozen, if anybody needs it to get further details. My original debugging sessions on it:

https://gist.github.com/arekinath/a005fb78756e56a64482ab5ef8aa0666

https://gist.github.com/arekinath/787480d32c99c8e87ac566c91831ccf7

Actions #2

Updated by Alex Wilson 7 months ago

  • Description updated (diff)
Actions

Also available in: Atom PDF