Project

General

Profile

Actions

Bug #13359

closed

mlxcx_update_link_state can race against mlxcx_register_mac

Added by Alex Wilson 11 months ago. Updated 7 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

After the changes to make mlxcx do all the commands during attach asynchronously, we now have a much wider window for us to get and try to handle a link state change notification before mlxcx_register_mac has finished setting up mlx_mac_hdl.

If this happens, we get a null deref:

panic[cpu53]/thread=fffffb04508c6c20: BAD TRAP: type=e (#pf Page fault) rp=fffffb04508c69f0 addr=9c occurred in module "mac" due to a NULL pointer dereference

sched: #pf Page fault
Bad kernel fault at addr=0x9c
pid=0, pc=0xfffffffffbbd4f71, sp=0xfffffb04508c6ae0, eflags=0x10247
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe>  cr4: 3406f8<smap,smep,osxsav,xmme,fxsr,pge,mce,pae,pse,de>
cr2: 9c  cr3: 14000000  cr8: 0

        rdi:                0 rsi:                1 rdx:                1
        rcx:                1  r8: fffffdc594029dc8  r9:                0
        rax:                0 rbx: fffffdc592dafc80 rbp: fffffb04508c6b10
        r10:                1 r11: fffffb04508c6c20 r12: fffffdc5714f5dc0
        r13: fffffdc5714f5dc0 r14: fffffdc592dafc80 r15: fffffdc579897508
        fsb:                0 gsb: fffffdc56f15e000  ds:               4b
         es:               4b  fs:                0  gs:              1c3
        trp:                e err:                2 rip: fffffffffbbd4f71
         cs:               30 rfl:            10247 rsp: fffffb04508c6ae0
         ss:               38

Warning - stack not written to the dump buffer
fffffb04508c68f0 unix:die+c6 ()
fffffb04508c69e0 unix:trap+11c1 ()
fffffb04508c69f0 unix:cmntrap+e9 ()
fffffb04508c6b10 mac:mac_link_update+1 ()
fffffb04508c6b50 mlxcx:mlxcx_link_state_task+4c ()
fffffb04508c6c00 genunix:taskq_thread+2cd ()
fffffb04508c6c10 unix:thread_start+b ()

This also exposes the fact that currently, the locking on the mlxcx_t is probably inadequate to cover this case and cases like it. We might get away with just a if (mlxp->mlx_mac_hdl != NULL) in mlxcx_update_link_state on x86, but if we ever run on another less-well-ordered CPU that probably won't cut it.

Actions #1

Updated by Alex Wilson 11 months ago

  • Gerrit CR set to 1100

Testing done so far on cr#1100, ps#2:

  • Built a SmartOS PI with updated mlxcx
    • Booted on a box with a connectx-5 10 times in a row (this box previously would hit the null ptr deref about one in every 10 boots on average), created an aggr over the mlxcx interfaces and brought up an IP to check it passes traffic still
    • Booted on a box with a connectx-4 Lx, created aggr and checked IP
  • Ran illumos nightly with patch on omnios r151036
Actions #2

Updated by Electric Monk 7 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 80d1a7bde98a8ab2881940a6fe6775073564f253

commit  80d1a7bde98a8ab2881940a6fe6775073564f253
Author: Alex Wilson <alex@cooperi.net>
Date:   2021-04-06T15:38:25.000Z

    13359 mlxcx_update_link_state can race against mlxcx_register_mac
    13370 mlxcx_intr_n doing redundant check on mleqe_event_type
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Reviewed by: Paul Winder <paul@winder.uk.net>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF