Bug #14677
openmlxcx NULL deref panic due to race in mlxcx_cmd_taskq
0%
Description
In mlxcx_cmd_taskq
, we set up and submit a new command to the hardware queue. We ring the hardware doorbell, and then the code proceeds to read cmd->mlcmd_poll
from the command control structure. However, the hardware could have interrupted us in this interval and completed the command, leading to this pointer (cmd
) being invalid. This code should not assume that cmd
is valid after it has rung the hardware doorbell, if interrupt handling of the command is enabled.
Updated by Alex Wilson over 1 year ago
> ::status debugging crash dump vmcore.0 (64-bit) from sdc-cb8031 operating system: 5.11 joyent_20211222T061856Z (i86pc) git branch: eait-20211222 git rev: 02cad5c2435e32903353fa6c1573ec69697e788a image uuid: (not set) panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffb0452895950 addr=20 occurred in module "genunix" due to a NULL pointer dereference dump content: kernel pages only > $C fffffb0452895a90 ddi_dma_sync+0x3a(0, 0, 0, 2) fffffb0452895af0 mlxcx_cmd_done+0x51(fffffb044f710480, 0) fffffb0452895b50 mlxcx_cmd_taskq+0x1a8(fffffb044f710480) fffffb0452895c00 taskq_thread+0x2cd(fffffdc6f82cf2c0) fffffb0452895c10 thread_start+0xb() > fffffb044f710480::print -t mlxcx_cmd_t mlxcx_cmd_t { struct mlxcx *mlcmd_mlxp = 0xfffffb044f710450 kmutex_t mlcmd_lock = { void *[1] _opaque = [ 0x10 ] } kcondvar_t mlcmd_cv = { ushort_t _opaque = 0x4f8 } boolean_t mlcmd_poll = 0xfffffb04 (0xfffffffffffffb04) uint8_t mlcmd_token = 0x10 mlxcx_cmd_op_t mlcmd_op = 0 (???) const void *mlcmd_in = 0x48 uint32_t mlcmd_inlen = 0 void *mlcmd_out = 0xfffffb044f7104b0 uint32_t mlcmd_outlen = 0x4f7104b0 list_t mlcmd_mbox_in = { size_t list_size = 0 size_t list_offset = 0x48 struct list_node list_head = { struct list_node *list_next = 0 struct list_node *list_prev = 0xfffffdd2b6ea5c88 } } uint8_t mlcmd_nboxes_in = 0x98 list_t mlcmd_mbox_out = { size_t list_size = 0x3 size_t list_offset = 0 struct list_node list_head = { struct list_node *list_next = 0 struct list_node *list_prev = 0 } } uint8_t mlcmd_nboxes_out = 0 mlxcx_cmd_state_t mlcmd_state = 0 (0) uint8_t mlcmd_status = 0 }
Looks like a garbage. Where's it from?
> fffffb044f710480::whatis fffffb044f710480 is in thread fffffb044f710c20's stack > fffffb044f710c20::findstack -v stack pointer for thread fffffb044f710c20 (tq:ddi_periodic_taskq): fffffb044f710380 [ fffffb044f710380 _resume_from_idle+0x12b() ] fffffb044f7103b0 swtch+0x133() fffffb044f7103f0 cv_wait+0x68(fffffb044f710470, fffffb044f710468) fffffb044f710430 mlxcx_cmd_wait+0x43(fffffb044f710460) fffffb044f710a50 mlxcx_cmd_query_cq+0xe5(fffffdc6ec98b200, fffffdc6faf71ac0, fffffb044f710a88) fffffb044f710b10 mlxcx_cq_check+0x7c(fffffdc6ec98b200) fffffb044f710b50 periodic_execute+0xcf(fffffdc6f7968bd8) fffffb044f710c00 taskq_thread+0x2cd(fffffdc676516248) fffffb044f710c10 thread_start+0xb()
OK, so this is the periodic completion queue status check. But the cmd_t it's
waiting on is different to the one that the taskq is looking at. Is the one it's
waiting on also on the taskq?
> fffffdc6f82cf2c0::taskq ADDR NAME ACT/THDS Q'ED MAXQ INST fffffdc6f82cf2c0 mlxcx_cmdq_1 1/ 1 3 3 1 > fffffdc6f82cf2c0::walk taskq_entry | ::taskq_entry ENTRY ARG FUNCTION fffffdca0338d7e8 fffffb044f6fe460 mlxcx_cmd_taskq fffffdcdd933a980 fffffb044f709fb0 mlxcx_cmd_taskq
So, yes, fffffb044f6fe460
is next on the taskq. Just to double-check:
> fffffb044f710460::print -t mlxcx_cmd_t mlxcx_cmd_t { struct mlxcx *mlcmd_mlxp = 0xfffffdc6ec98b200 kmutex_t mlcmd_lock = { void *[1] _opaque = [ 0 ] } kcondvar_t mlcmd_cv = { ushort_t _opaque = 0x1 } boolean_t mlcmd_poll = 0 (0) uint8_t mlcmd_token = 0xec mlxcx_cmd_op_t mlcmd_op = 0t1026 (MLXCX_OP_QUERY_CQ) const void *mlcmd_in = 0xfffffb044f710450 uint32_t mlcmd_inlen = 0x10 void *mlcmd_out = 0xfffffb044f7104f8 uint32_t mlcmd_outlen = 0x510 list_t mlcmd_mbox_in = { size_t list_size = 0x48 size_t list_offset = 0 struct list_node list_head = { struct list_node *list_next = 0xfffffb044f7104b0 struct list_node *list_prev = 0xfffffb044f7104b0 } } uint8_t mlcmd_nboxes_in = 0 list_t mlcmd_mbox_out = { size_t list_size = 0x48 size_t list_offset = 0 struct list_node list_head = { struct list_node *list_next = 0xfffffdd2b6ea5c88 struct list_node *list_prev = 0xfffffdcccd805198 } } uint8_t mlcmd_nboxes_out = 0x3 mlxcx_cmd_state_t mlcmd_state = 0 (0) uint8_t mlcmd_status = 0 }
That looks like a valid cmd_t. At this point we should also note that in the
panic stack, mlxcx_cmd_taskq
has called mlxcx_cmd_done
directly, which it only does in polling mode. The mlcmd_poll
field
in the command is used to determine whether we should poll the command or
whether the EQ is running.
So, clearly this address was already pointing to garbage at the point wheremlxcx_cmd_taskq
checks mlcmd_poll
. There's an ASSERT
about the mlcmd_op
field at the top of mlxcx_cmd_taskq
which could have blown to tell us whether this was already invalid at the very start
of the function, but alas this was not a DEBUG build.
If we look into the logic near the top of this function and compare it to the
state of this cmd_t
at the time of panic, it's pretty clear it would
have crashed earlier if it had already been garbage then -- the garbage cmdq
pointer would have been NULL
:
fffffb044f710480::print -t mlxcx_cmd_t mlcmd_mlxp->mlx_cmd mlxcx_cmd_queue_t mlcmd_mlxp->mlx_cmd = { kmutex_t mlcmd_mlxp->mlx_cmd.mcmd_lock = { void *[1] _opaque = [ 0 ] } kcondvar_t mlcmd_mlxp->mlx_cmd.mcmd_cv = { ushort_t _opaque = 0 } mlxcx_dma_buffer_t mlcmd_mlxp->mlx_cmd.mcmd_dma = { mlxcx_dma_buffer_flags_t mxdb_flags = 0 (0) caddr_t mxdb_va = 0 size_t mxdb_len = 0 ddi_acc_handle_t mxdb_acc_handle = 0 ddi_dma_handle_t mxdb_dma_handle = 0 uint_t mxdb_ncookies = 0 } boolean_t mlcmd_mlxp->mlx_cmd.mcmd_polled = 0 (0) uint8_t mlcmd_mlxp->mlx_cmd.mcmd_size_l2 = 0 uint8_t mlcmd_mlxp->mlx_cmd.mcmd_stride_l2 = 0 uint_t mlcmd_mlxp->mlx_cmd.mcmd_size = 0 uint32_t mlcmd_mlxp->mlx_cmd.mcmd_mask = 0 mlxcx_cmd_t *[32] mlcmd_mlxp->mlx_cmd.mcmd_active = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] ddi_taskq_t *mlcmd_mlxp->mlx_cmd.mcmd_taskq = 0 id_space_t *mlcmd_mlxp->mlx_cmd.mcmd_tokens = 0 }
So the deref of ent
would have failed.
The check on mlcmd_poll
in mlxcx_cmd_taskq
takes place
after the command has been enqueued and the doorbell rung. The function is holding
no locks at this point. It seems possible that we rung the doorbell, the hardware
completed the command, and the original caller discarded the mlxcx_cmd_t, before
this taskq thread got back on CPU and looked at mlcmd_poll
.
Let's see if we can find the completion on the event queue still (it probably
hasn't wrapped around yet if it's such a tight race). We know it went in
command slot 0 (from the argument on the stack), so let's look for a completion
for that:
> 0xfffffdc6ec98b200::print -t mlxcx_t mlx_eqs[0] mlxcx_event_queue_t mlx_eqs = { kmutex_t mlx_eqs->mleq_mtx = { void *[1] _opaque = [ 0 ] } kcondvar_t mlx_eqs->mleq_cv = { ushort_t _opaque = 0 } mlxcx_t *mlx_eqs->mleq_mlx = 0xfffffdc6ec98b200 mlxcx_eventq_state_t mlx_eqs->mleq_state = 0x2b (MLXCX_EQ_{ALLOC|CREATED|ARMED|INTR_ENABLED}) mlxcx_eventq_type_t mlx_eqs->mleq_type = 0 (0) mlxcx_dma_buffer_t mlx_eqs->mleq_dma = { mlxcx_dma_buffer_flags_t mxdb_flags = 0x7 (MLXCX_DMABUF_{HDL_ALLOC|MEM_ALLOC|BOUND}) caddr_t mxdb_va = 0xfffffdc6f9a5c000 size_t mxdb_len = 0x10000 ddi_acc_handle_t mxdb_acc_handle = 0xfffffdc6f9930dc0 ddi_dma_handle_t mxdb_dma_handle = 0xfffffdc6f9a57b40 uint_t mxdb_ncookies = 0x10 } size_t mlx_eqs->mleq_entshift = 0xa size_t mlx_eqs->mleq_nents = 0x400 mlxcx_eventq_ent_t *mlx_eqs->mleq_ent = 0xfffffdc6f9a5c000 uint32_t mlx_eqs->mleq_cc = 0x86d2a9a uint32_t mlx_eqs->mleq_cc_armed = 0x86d2a9a uint32_t mlx_eqs->mleq_events = 0x44b3f38 uint32_t mlx_eqs->mleq_badintrs = 0 uint_t mlx_eqs->mleq_num = 0x4 uint_t mlx_eqs->mleq_intr_index = 0 mlxcx_uar_t *mlx_eqs->mleq_uar = 0xfffffdc6ec98b460 avl_tree_t mlx_eqs->mleq_cqs = { struct avl_node *avl_root = 0 int (*)() avl_compar = 0 size_t avl_offset = 0 ulong_t avl_numnodes = 0 size_t avl_size = 0 } uint32_t mlx_eqs->mleq_check_disarm_cc = 0 uint_t mlx_eqs->mleq_check_disarm_cnt = 0 } > 86d2a9a & (400-1)=X 29a > 0xfffffdc6ec98b200::print -t mlxcx_t mlx_eqs[0].mleq_ent[29a] mlxcx_eventq_ent_t mlx_eqs[0].mleq_ent[29a] = { uint8_t [1] mlx_eqs[0].mleq_ent[29a].mleqe_rsvd = [ 0 ] uint8_t mlx_eqs[0].mleq_ent[29a].mleqe_event_type = 0xa uint8_t [1] mlx_eqs[0].mleq_ent[29a].mleqe_rsvd2 = [ 0 ] uint8_t mlx_eqs[0].mleq_ent[29a].mleqe_event_sub_type = 0 uint8_t [28] mlx_eqs[0].mleq_ent[29a].mleqe_rsvd3 = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] union mlx_eqs[0].mleq_ent[29a]. = { uint8_t [28] mleqe_unknown_data = [ 0, 0, 0, 0x1, 0x18, 0xec, 0, 0, 0, 0, 0x4, 0x2, 0, 0, 0x4, 0x18, 0, 0, 0, 0, 0, 0xe3, 0x40, 0xa3, 0, 0, 0, 0 ] mlxcx_evdata_cmd_completion_t mleqe_cmd_completion = { uint32be_t mled_cmd_completion_vec = { uint32_t be_val = 0x1000000 } ... uint8_t mlx_eqs[0].mleq_ent[29a].mleqe_signature = 0xc7 uint8_t mlx_eqs[0].mleq_ent[29a].mleqe_owner = 0 } > 0xfffffdc6ec98b200::print -t mlxcx_t mlx_eqs[0].mleq_ent[299] mlxcx_eventq_ent_t mlx_eqs[0].mleq_ent[299] = { uint8_t [1] mlx_eqs[0].mleq_ent[299].mleqe_rsvd = [ 0 ] uint8_t mlx_eqs[0].mleq_ent[299].mleqe_event_type = 0xa uint8_t [1] mlx_eqs[0].mleq_ent[299].mleqe_rsvd2 = [ 0 ] uint8_t mlx_eqs[0].mleq_ent[299].mleqe_event_sub_type = 0 uint8_t [28] mlx_eqs[0].mleq_ent[299].mleqe_rsvd3 = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] union mlx_eqs[0].mleq_ent[299]. = { uint8_t [28] mleqe_unknown_data = [ 0, 0, 0, 0x1, 0x13, 0xeb, 0, 0, 0, 0, 0x4, 0x2, 0, 0, 0x4, 0x17, 0, 0, 0, 0, 0, 0xe2, 0xc3, 0x1e, 0, 0, 0, 0 ] mlxcx_evdata_cmd_completion_t mleqe_cmd_completion = { uint32be_t mled_cmd_completion_vec = { uint32_t be_val = 0x1000000 } uint8_t [24] mled_cmd_completion_rsvd = [ 0x13, 0xeb, 0, 0, 0, 0, 0x4, 0x2, 0, 0, 0x4, 0x17, 0, 0, 0, 0, 0, 0xe2, 0xc3, 0x1e, 0, 0, 0, 0 ] } uint8_t mlx_eqs[0].mleq_ent[299].mleqe_signature = 0xf8 uint8_t mlx_eqs[0].mleq_ent[299].mleqe_owner = 0 }
It seems we have been submitting every command into slot 0, because we useddi_ffs
on a mask to find the first available index. This is a
little unfortunate, as it makes it very hard to look back in the ring when
debugging like this. We can clearly see that command slot 0 has completed many
times in recent history, but this doesn't really enable us to tell if that
specific command (which was also in slot 0) was completed or not.
Going back to mlxcx_cmd_taskq
and referring to the Mellanox PRM,
it's not super clear whether setting the "ownership" field or ringing the
doorbell is considered sufficient before the hardware takes over. It definitely
seems true that setting the ownership field is necessary, and ringing the
doorbell is normally required, but it's not clear whether the hardware can
begin executing a command before the doorbell is rung if the ownership value
is set. These two operations should be the last thing we do, however, and we
should not be assuming that we can use the mlxcx_cmd_t
once we've
done either of them to a command that's enabled for interrupt-based handling:
the interrupt may happen before we reach the next line.
We should reorganise this code to write the owner field and ring the doorbell
last and never use the cmd_t
after that, and add comments explaining
about this potential race. It should read out the value of mlcmd_poll
near the start of the function and use that instead of accessing the struct again.
It's also worth considering whether this code should always hold the cmd_t's lock
at any point when it's reading from or manipulating it or anything hanging off it
(like the actual command DMA entry). This code was initially designed for
polling-only operation and had interrupts added later, and the current lack of
locking in the original code makes it easy to make this kind of mistake.
I also think we should make the ASSERT
at the top ofmlxcx_cmd_taskq
into a VERIFY()
, and perhaps also
validate a few other fields there so that we have more confidence about what's
going on when debugging command handling.
Additionally, I think we should make command number allocations proceed
sequentially so that we can look back in the event queue and get information
(to a reasonable degree of confidence) about whether the hardware has handled
a specific command when debugging.