Project

General

Profile

Actions

Bug #14677

open

mlxcx NULL deref panic due to race in mlxcx_cmd_taskq

Added by Alex Wilson 23 days ago. Updated 23 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

In mlxcx_cmd_taskq, we set up and submit a new command to the hardware queue. We ring the hardware doorbell, and then the code proceeds to read cmd->mlcmd_poll from the command control structure. However, the hardware could have interrupted us in this interval and completed the command, leading to this pointer (cmd) being invalid. This code should not assume that cmd is valid after it has rung the hardware doorbell, if interrupt handling of the command is enabled.

Actions #1

Updated by Alex Wilson 23 days ago

> ::status
debugging crash dump vmcore.0 (64-bit) from sdc-cb8031
operating system: 5.11 joyent_20211222T061856Z (i86pc)
git branch: eait-20211222
git rev: 02cad5c2435e32903353fa6c1573ec69697e788a
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffb0452895950 addr=20 occurred in module "genunix" due to a NULL pointer dereference
dump content: kernel pages only

> $C
fffffb0452895a90 ddi_dma_sync+0x3a(0, 0, 0, 2)
fffffb0452895af0 mlxcx_cmd_done+0x51(fffffb044f710480, 0)
fffffb0452895b50 mlxcx_cmd_taskq+0x1a8(fffffb044f710480)
fffffb0452895c00 taskq_thread+0x2cd(fffffdc6f82cf2c0)
fffffb0452895c10 thread_start+0xb()

> fffffb044f710480::print -t mlxcx_cmd_t
mlxcx_cmd_t {
    struct mlxcx *mlcmd_mlxp = 0xfffffb044f710450
    kmutex_t mlcmd_lock = {
        void *[1] _opaque = [ 0x10 ]
    }
    kcondvar_t mlcmd_cv = {
        ushort_t _opaque = 0x4f8
    }
    boolean_t mlcmd_poll = 0xfffffb04 (0xfffffffffffffb04)
    uint8_t mlcmd_token = 0x10
    mlxcx_cmd_op_t mlcmd_op = 0 (???)
    const void *mlcmd_in = 0x48
    uint32_t mlcmd_inlen = 0
    void *mlcmd_out = 0xfffffb044f7104b0
    uint32_t mlcmd_outlen = 0x4f7104b0
    list_t mlcmd_mbox_in = {
        size_t list_size = 0
        size_t list_offset = 0x48
        struct list_node list_head = {
            struct list_node *list_next = 0
            struct list_node *list_prev = 0xfffffdd2b6ea5c88
        }
    }
    uint8_t mlcmd_nboxes_in = 0x98
    list_t mlcmd_mbox_out = {
        size_t list_size = 0x3
        size_t list_offset = 0
        struct list_node list_head = {
            struct list_node *list_next = 0
            struct list_node *list_prev = 0
        }
    }
    uint8_t mlcmd_nboxes_out = 0
    mlxcx_cmd_state_t mlcmd_state = 0 (0)
    uint8_t mlcmd_status = 0
}

Looks like a garbage. Where's it from?

> fffffb044f710480::whatis
fffffb044f710480 is in thread fffffb044f710c20's stack
> fffffb044f710c20::findstack -v
stack pointer for thread fffffb044f710c20 (tq:ddi_periodic_taskq): fffffb044f710380
[ fffffb044f710380 _resume_from_idle+0x12b() ]
  fffffb044f7103b0 swtch+0x133()
  fffffb044f7103f0 cv_wait+0x68(fffffb044f710470, fffffb044f710468)
  fffffb044f710430 mlxcx_cmd_wait+0x43(fffffb044f710460)
  fffffb044f710a50 mlxcx_cmd_query_cq+0xe5(fffffdc6ec98b200, fffffdc6faf71ac0, fffffb044f710a88)
  fffffb044f710b10 mlxcx_cq_check+0x7c(fffffdc6ec98b200)
  fffffb044f710b50 periodic_execute+0xcf(fffffdc6f7968bd8)
  fffffb044f710c00 taskq_thread+0x2cd(fffffdc676516248)
  fffffb044f710c10 thread_start+0xb()

OK, so this is the periodic completion queue status check. But the cmd_t it's
waiting on is different to the one that the taskq is looking at. Is the one it's
waiting on also on the taskq?

> fffffdc6f82cf2c0::taskq
ADDR             NAME                             ACT/THDS Q'ED  MAXQ INST
fffffdc6f82cf2c0 mlxcx_cmdq_1                       1/   1    3     3    1

> fffffdc6f82cf2c0::walk taskq_entry | ::taskq_entry
ENTRY            ARG              FUNCTION
fffffdca0338d7e8 fffffb044f6fe460 mlxcx_cmd_taskq
fffffdcdd933a980 fffffb044f709fb0 mlxcx_cmd_taskq

So, yes, fffffb044f6fe460 is next on the taskq. Just to double-check:

> fffffb044f710460::print -t mlxcx_cmd_t
mlxcx_cmd_t {
    struct mlxcx *mlcmd_mlxp = 0xfffffdc6ec98b200
    kmutex_t mlcmd_lock = {
        void *[1] _opaque = [ 0 ]
    }
    kcondvar_t mlcmd_cv = {
        ushort_t _opaque = 0x1
    }
    boolean_t mlcmd_poll = 0 (0)
    uint8_t mlcmd_token = 0xec
    mlxcx_cmd_op_t mlcmd_op = 0t1026 (MLXCX_OP_QUERY_CQ)
    const void *mlcmd_in = 0xfffffb044f710450
    uint32_t mlcmd_inlen = 0x10
    void *mlcmd_out = 0xfffffb044f7104f8
    uint32_t mlcmd_outlen = 0x510
    list_t mlcmd_mbox_in = {
        size_t list_size = 0x48
        size_t list_offset = 0
        struct list_node list_head = {
            struct list_node *list_next = 0xfffffb044f7104b0
            struct list_node *list_prev = 0xfffffb044f7104b0
        }
    }
    uint8_t mlcmd_nboxes_in = 0
    list_t mlcmd_mbox_out = {
        size_t list_size = 0x48
        size_t list_offset = 0
        struct list_node list_head = {
            struct list_node *list_next = 0xfffffdd2b6ea5c88
            struct list_node *list_prev = 0xfffffdcccd805198
        }
    }
    uint8_t mlcmd_nboxes_out = 0x3
    mlxcx_cmd_state_t mlcmd_state = 0 (0)
    uint8_t mlcmd_status = 0
}

That looks like a valid cmd_t. At this point we should also note that in the
panic stack, mlxcx_cmd_taskq has called mlxcx_cmd_done
directly, which it only does in polling mode. The mlcmd_poll field
in the command is used to determine whether we should poll the command or
whether the EQ is running.

So, clearly this address was already pointing to garbage at the point where
mlxcx_cmd_taskq checks mlcmd_poll. There's an ASSERT
about the mlcmd_op field at the top of mlxcx_cmd_taskq
which could have blown to tell us whether this was already invalid at the very start
of the function, but alas this was not a DEBUG build.

If we look into the logic near the top of this function and compare it to the
state of this cmd_t at the time of panic, it's pretty clear it would
have crashed earlier if it had already been garbage then -- the garbage
cmdq pointer would have been NULL:

 fffffb044f710480::print -t mlxcx_cmd_t mlcmd_mlxp->mlx_cmd
mlxcx_cmd_queue_t mlcmd_mlxp->mlx_cmd = {
    kmutex_t mlcmd_mlxp->mlx_cmd.mcmd_lock = {
        void *[1] _opaque = [ 0 ]
    }
    kcondvar_t mlcmd_mlxp->mlx_cmd.mcmd_cv = {
        ushort_t _opaque = 0
    }
    mlxcx_dma_buffer_t mlcmd_mlxp->mlx_cmd.mcmd_dma = {
        mlxcx_dma_buffer_flags_t mxdb_flags = 0 (0)
        caddr_t mxdb_va = 0
        size_t mxdb_len = 0
        ddi_acc_handle_t mxdb_acc_handle = 0
        ddi_dma_handle_t mxdb_dma_handle = 0
        uint_t mxdb_ncookies = 0
    }
    boolean_t mlcmd_mlxp->mlx_cmd.mcmd_polled = 0 (0)
    uint8_t mlcmd_mlxp->mlx_cmd.mcmd_size_l2 = 0
    uint8_t mlcmd_mlxp->mlx_cmd.mcmd_stride_l2 = 0
    uint_t mlcmd_mlxp->mlx_cmd.mcmd_size = 0
    uint32_t mlcmd_mlxp->mlx_cmd.mcmd_mask = 0
    mlxcx_cmd_t *[32] mlcmd_mlxp->mlx_cmd.mcmd_active = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
    ddi_taskq_t *mlcmd_mlxp->mlx_cmd.mcmd_taskq = 0
    id_space_t *mlcmd_mlxp->mlx_cmd.mcmd_tokens = 0
}

So the deref of ent would have failed.

The check on mlcmd_poll in mlxcx_cmd_taskq takes place
after the command has been enqueued and the doorbell rung. The function is holding
no locks at this point. It seems possible that we rung the doorbell, the hardware
completed the command, and the original caller discarded the mlxcx_cmd_t, before
this taskq thread got back on CPU and looked at mlcmd_poll.

Let's see if we can find the completion on the event queue still (it probably
hasn't wrapped around yet if it's such a tight race). We know it went in
command slot 0 (from the argument on the stack), so let's look for a completion
for that:

> 0xfffffdc6ec98b200::print -t mlxcx_t mlx_eqs[0]
mlxcx_event_queue_t mlx_eqs = {
    kmutex_t mlx_eqs->mleq_mtx = {
        void *[1] _opaque = [ 0 ]
    }
    kcondvar_t mlx_eqs->mleq_cv = {
        ushort_t _opaque = 0
    }
    mlxcx_t *mlx_eqs->mleq_mlx = 0xfffffdc6ec98b200
    mlxcx_eventq_state_t mlx_eqs->mleq_state = 0x2b (MLXCX_EQ_{ALLOC|CREATED|ARMED|INTR_ENABLED})
    mlxcx_eventq_type_t mlx_eqs->mleq_type = 0 (0)
    mlxcx_dma_buffer_t mlx_eqs->mleq_dma = {
        mlxcx_dma_buffer_flags_t mxdb_flags = 0x7 (MLXCX_DMABUF_{HDL_ALLOC|MEM_ALLOC|BOUND})
        caddr_t mxdb_va = 0xfffffdc6f9a5c000
        size_t mxdb_len = 0x10000
        ddi_acc_handle_t mxdb_acc_handle = 0xfffffdc6f9930dc0
        ddi_dma_handle_t mxdb_dma_handle = 0xfffffdc6f9a57b40
        uint_t mxdb_ncookies = 0x10
    }
    size_t mlx_eqs->mleq_entshift = 0xa
    size_t mlx_eqs->mleq_nents = 0x400
    mlxcx_eventq_ent_t *mlx_eqs->mleq_ent = 0xfffffdc6f9a5c000
    uint32_t mlx_eqs->mleq_cc = 0x86d2a9a
    uint32_t mlx_eqs->mleq_cc_armed = 0x86d2a9a
    uint32_t mlx_eqs->mleq_events = 0x44b3f38
    uint32_t mlx_eqs->mleq_badintrs = 0
    uint_t mlx_eqs->mleq_num = 0x4
    uint_t mlx_eqs->mleq_intr_index = 0
    mlxcx_uar_t *mlx_eqs->mleq_uar = 0xfffffdc6ec98b460
    avl_tree_t mlx_eqs->mleq_cqs = {
        struct avl_node *avl_root = 0
        int (*)() avl_compar = 0
        size_t avl_offset = 0
        ulong_t avl_numnodes = 0
        size_t avl_size = 0
    }
    uint32_t mlx_eqs->mleq_check_disarm_cc = 0
    uint_t mlx_eqs->mleq_check_disarm_cnt = 0
}

> 86d2a9a & (400-1)=X
                29a

> 0xfffffdc6ec98b200::print -t mlxcx_t mlx_eqs[0].mleq_ent[29a]
mlxcx_eventq_ent_t mlx_eqs[0].mleq_ent[29a] = {
    uint8_t [1] mlx_eqs[0].mleq_ent[29a].mleqe_rsvd = [ 0 ]
    uint8_t mlx_eqs[0].mleq_ent[29a].mleqe_event_type = 0xa
    uint8_t [1] mlx_eqs[0].mleq_ent[29a].mleqe_rsvd2 = [ 0 ]
    uint8_t mlx_eqs[0].mleq_ent[29a].mleqe_event_sub_type = 0
    uint8_t [28] mlx_eqs[0].mleq_ent[29a].mleqe_rsvd3 = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
    union  mlx_eqs[0].mleq_ent[29a]. = {
        uint8_t [28] mleqe_unknown_data = [ 0, 0, 0, 0x1, 0x18, 0xec, 0, 0, 0, 0, 0x4, 0x2, 0, 0, 0x4, 0x18, 0, 0, 0, 0, 0, 0xe3, 0x40, 0xa3, 0, 0, 0, 0 ]
        mlxcx_evdata_cmd_completion_t mleqe_cmd_completion = {
            uint32be_t mled_cmd_completion_vec = {
                uint32_t be_val = 0x1000000
            }
            ...
    uint8_t mlx_eqs[0].mleq_ent[29a].mleqe_signature = 0xc7
    uint8_t mlx_eqs[0].mleq_ent[29a].mleqe_owner = 0
}

> 0xfffffdc6ec98b200::print -t mlxcx_t mlx_eqs[0].mleq_ent[299]
mlxcx_eventq_ent_t mlx_eqs[0].mleq_ent[299] = {
    uint8_t [1] mlx_eqs[0].mleq_ent[299].mleqe_rsvd = [ 0 ]
    uint8_t mlx_eqs[0].mleq_ent[299].mleqe_event_type = 0xa
    uint8_t [1] mlx_eqs[0].mleq_ent[299].mleqe_rsvd2 = [ 0 ]
    uint8_t mlx_eqs[0].mleq_ent[299].mleqe_event_sub_type = 0
    uint8_t [28] mlx_eqs[0].mleq_ent[299].mleqe_rsvd3 = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
    union  mlx_eqs[0].mleq_ent[299]. = {
        uint8_t [28] mleqe_unknown_data = [ 0, 0, 0, 0x1, 0x13, 0xeb, 0, 0, 0, 0, 0x4, 0x2, 0, 0, 0x4, 0x17, 0, 0, 0, 0, 0, 0xe2, 0xc3, 0x1e, 0, 0, 0, 0 ]
        mlxcx_evdata_cmd_completion_t mleqe_cmd_completion = {
            uint32be_t mled_cmd_completion_vec = {
                uint32_t be_val = 0x1000000
            }
            uint8_t [24] mled_cmd_completion_rsvd = [ 0x13, 0xeb, 0, 0, 0, 0, 0x4, 0x2, 0, 0, 0x4, 0x17, 0, 0, 0, 0, 0, 0xe2, 0xc3, 0x1e, 0, 0, 0, 0 ]
        }
    uint8_t mlx_eqs[0].mleq_ent[299].mleqe_signature = 0xf8
    uint8_t mlx_eqs[0].mleq_ent[299].mleqe_owner = 0
}

It seems we have been submitting every command into slot 0, because we use
ddi_ffs on a mask to find the first available index. This is a
little unfortunate, as it makes it very hard to look back in the ring when
debugging like this. We can clearly see that command slot 0 has completed many
times in recent history, but this doesn't really enable us to tell if that
specific command (which was also in slot 0) was completed or not.

Going back to mlxcx_cmd_taskq and referring to the Mellanox PRM,
it's not super clear whether setting the "ownership" field or ringing the
doorbell is considered sufficient before the hardware takes over. It definitely
seems true that setting the ownership field is necessary, and ringing the
doorbell is normally required, but it's not clear whether the hardware can
begin executing a command before the doorbell is rung if the ownership value
is set. These two operations should be the last thing we do, however, and we
should not be assuming that we can use the mlxcx_cmd_t once we've
done either of them to a command that's enabled for interrupt-based handling:
the interrupt may happen before we reach the next line.

We should reorganise this code to write the owner field and ring the doorbell
last and never use the cmd_t after that, and add comments explaining
about this potential race. It should read out the value of mlcmd_poll
near the start of the function and use that instead of accessing the struct again.

It's also worth considering whether this code should always hold the cmd_t's lock
at any point when it's reading from or manipulating it or anything hanging off it
(like the actual command DMA entry). This code was initially designed for
polling-only operation and had interrupts added later, and the current lack of
locking in the original code makes it easy to make this kind of mistake.

I also think we should make the ASSERT at the top of
mlxcx_cmd_taskq into a VERIFY(), and perhaps also
validate a few other fields there so that we have more confidence about what's
going on when debugging command handling.

Additionally, I think we should make command number allocations proceed
sequentially so that we can look back in the event queue and get information
(to a reasonable degree of confidence) about whether the hardware has handled
a specific command when debugging.

Actions

Also available in: Atom PDF