Project

General

Profile

Actions

Bug #14747

closed

pcieb_attach: race between pcie_init_root_port_mps and hotplug events

Added by Robert Mustacchi 21 days ago. Updated 6 days ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

While doing bring up on our platform, we discovered the following panic. The following is the analysis that was provided by the engineer that looked at this:

pcieb39 is /pci@2c,0/pci1022,1484@8,3
npe2 at root: space 58 offset 0
npe2 is /pci@58,0
NOTICE: pciehpc (pcieb40): card is inserted in the slot pcie16
initialized cpu module 'cpu.generic' on chip 0 core 1 strand 1

panic[cpu1]/thread=fffffcc2686d3c20: BAD TRAP: type=d (#gp General protection) rp=fffffcc2686d3720 addr=fffffcc2686d3960

#gp General protection
addr=0xfffffcc2686d3960
pid=0, pc=0xfffffffffbf0fbe0, sp=0xfffffcc2686d3810, eflags=0x10286
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe>  cr4: 360668<smap,smep,osxsav,pcide,xmme,fxsr,mce,pae,de>
cr2: 0  cr3: 7fff1000  cr8: 0

        rdi: fffffe25aab5a000 rsi: fffffffff7c4f660 rdx: fffffcc2686d3870
        rcx: fffffe25e503a000  r8: fffffcc2686d3c20  r9: fffffffffbc91830
        rax: deadbeefdeadbeef rbx: fffffe25aab5a000 rbp: fffffcc2686d3850
        r10:                0 r11: fffffcc2686d3c20 r12: fffffffff7c4f660
        r13: fffffcc2686d3870 r14:                0 r15: fffffcc2686d3960
        fsb:        200000000 gsb: fffffe25ef331000  ds:               38
         es:               38  fs:                0  gs:                0
        trp:                d err:                0 rip: fffffffffbf0fbe0
         cs:               30 rfl:            10286 rsp: fffffcc2686d3810
         ss:               38

CPU          ADDRESS    TIMESTAMP TYPE  VC HANDLER          PC
  0 fffffffffbe30d60  4443017d61c intr  f6 cbe_fire         tsc_gethrtime_delta+96
  0 fffffffffbe30be8  4442ff951c4 intr  f6 cbe_fire         tsc_gethrtime_delta+96
  0 fffffffffbe30a70  4442fef5e1c intr  f2 unknown          tsc_gethrtime_delta+96
  0 fffffffffbe308f8  4442fdad460 intr  f6 cbe_fire         i86_mwait+12
  0 fffffffffbe30780  4442fbc4acc intr  f6 cbe_fire         tsc_gethrtime_delta+96
  0 fffffffffbe30608  4442f9dcb38 intr  f6 cbe_fire         i86_mwait+12
  0 fffffffffbe30490  4442f7f4488 intr  f6 cbe_fire         i86_mwait+12
  0 fffffffffbe30318  4442f6734b0 intr  25 pcieb_intr_handl i86_mwait+12
  0 fffffffffbe301a0  4442f60c0e4 intr  f6 cbe_fire         i86_mwait+12
  0 fffffffffbe30028  4442f5186d8 intr  25 pcieb_intr_handl i86_mwait+12
  1 fffffe25ef3d8b10  444302103f4 trap   d #gp              ddi_walk_devs+40
  1 fffffe25ef3d8998  4442eb65d5c intr   - (fakesoftint)    fakesoftint+23
  1 fffffe25ef3d8820  4442eb5c568 intr  f2 unknown          sysp_putchar+4f
  1 fffffe25ef3d86a8  4442e9064e4 intr  f2 unknown          pcie_cfgspace_read_uint16+24
  1 fffffe25ef3d8530  4442e8f71c4 intr  f2 unknown          pcie_cfgspace_read_uint8+24
  1 fffffe25ef3d83b8  4442e8e75f8 intr  f2 unknown          pcie_cfgspace_read_uint16+24
  1 fffffe25ef3d8240  4442e8d6dfc intr  f2 unknown          __x86_indirect_thunk_r9+3
  1 fffffe25ef3d80c8  4442e8c7c08 intr  f2 unknown          pcie_cfgspace_read_uint16+24
  1 fffffe25ef3d7f50  4442e8b72b8 intr  f2 unknown          tsc_gethrtime_delta+96
  1 fffffe25ef3d7dd8  4442e8a19f4 intr  f2 unknown          getpcstack+46
  2 fffffe25ef502050  444301e062c intr  23 pcieb_intr_handl i86_mwait+12
  2 fffffe25ef501ed8  444300826b8 intr  23 pcieb_intr_handl i86_mwait+12
  2 fffffe25ef501d60  4442ffe6e84 intr  20 asyintr          ddi_get8+f
  2 fffffe25ef501be8  4442ffe4904 intr  20 asyintr          ddi_get8+f
  2 fffffe25ef501a70  4442ffe2dc0 intr  20 asyintr          ddi_get8+f
  2 fffffe25ef5018f8  4442ffe1240 intr  20 asyintr          ddi_get8+f
  2 fffffe25ef501780  4442ffdfcd8 intr  20 asyintr          ddi_get8+f
  2 fffffe25ef501608  4442ffdd3e8 intr  20 asyintr          ddi_get8+f
  2 fffffe25ef501490  4442ffdb8b8 intr  20 asyintr          ddi_get8+f
  2 fffffe25ef501318  4442ffda15c intr  20 asyintr          i86_mwait+12

Warning - stack not written to the dump buffer
fffffcc2686d3620 unix:rewrite_syscall+0 ()
fffffcc2686d3710 unix:trap+d6a ()
fffffcc2686d3720 unix:cmntrap+1cd ()
fffffcc2686d3850 genunix:ddi_walk_devs+40 ()
fffffcc2686d38c0 pcie:pcie_scan_mps+56 ()
fffffcc2686d3900 pcie:pcie_get_fabric_mps+1d ()
fffffcc2686d3950 pcie:pcie_init_root_port_mps+67 ()
fffffcc2686d39b0 pcieb:pcieb_attach+278 ()
fffffcc2686d3a30 genunix:devi_attach+ad ()
fffffcc2686d3a70 genunix:attach_node+149 ()
fffffcc2686d3ac0 genunix:i_ndi_config_node+16c ()
fffffcc2686d3af0 genunix:i_ddi_attachchild+60 ()
fffffcc2686d3b40 genunix:i_ddi_attach_node_hierarchy+8a ()
fffffcc2686d3b90 genunix:attach_driver_nodes+90 ()
fffffcc2686d3be0 genunix:ddi_hold_installed_driver+196 ()
fffffcc2686d3c00 genunix:attach_drivers+3c ()
fffffcc2686d3c10 unix:thread_start+b ()

The panic is at the site of ddi_get_parent, specifically the second of three such calls in ddi_walk_devs() which looks like in its entirety:

        ASSERT(dip == NULL || ddi_get_parent(dip) == NULL ||
           DEVI_BUSY_OWNED(ddi_get_parent(dip)));

        (void) walk_devs(dip, f, arg, 1);

We have passed the first two checks in the assertion: neither dip nor its parent is NULL, so we are going to get the parent again to see whether DEVI_BUSY_OWNED() is true. Note that there are three calls here because DEVI_BUSY_OWNED() is a broken macro that evaluates its argument twice:

  #define DEVI_BUSY_OWNED(dip)   (DEVI_BUSY_CHANGING(dip) &&    \
          ((DEVI(dip))->devi_busy_thread == curthread))

Since we are blowing up on the second call, this implies a race, and sure enough we see that the node has gone away from under us:

[1]> fffffe25aab5a000::print struct dev_info
{
    devi_parent = 0xdeadbeefdeadbeef
    devi_child = 0xdeadbeefdeadbeef
...

Who else is interesting?

stack pointer for thread fffffcc2685fec20 (tq:pcie_link): fffffcc2685fea10
[ fffffcc2685fea10 _resume_from_idle+0x12b() ]
  fffffcc2685fea40 swtch+0x16b()
  fffffcc2685fea70 cv_wait+0x89(fffffe25aab5e154, fffffe25aab5e068)
  fffffcc2685feab0 ndi_devi_enter+0x87(fffffe25aab5e000, fffffcc2685feadc)
  fffffcc2685feb50 pcie`pcie_link_bw_taskq+0x153(fffffe25aab5e000)
  fffffcc2685fec00 taskq_thread+0x315(fffffe25e4200c40)
  fffffcc2685fec10 thread_start+0xb()

The dev_info_t at fffffe25aab5e000 happens to be bridge 40, which is the parent of the T6:

[1]> fffffe25aab5e000::devinfo
fffffe25aab5e000 pciexclass,060400, instance #40 (driver not attached)

So what we are seeing here is a race between something trying to attach the T6 to this bridge and something else destroying its dev_info_t. The thread waiting on the bridge node is the link bandwidth management interrupt handler (which is believed to be a frequent source of bogus error telemetry). Our current state there looks reasonable and reflects the fact that we haven't actually started running the bandwidth management interrupt code:

[1]> 0xfffffe25e2b677c0::print pcie_bus_t
{
    bus_dip = 0xfffffe25aab5e000
    bus_rp_dip = 0xfffffe25aab5e000
    bus_cfg_hdl = 0xfffffe25e42f7680
    bus_fm_flags = 0
    bus_soft_state = 0
    bus_bdf = 0x8009
    bus_rp_bdf = 0x8009
    bus_dev_ven_id = 0x14831022
    bus_rev_id = 0
    bus_hdr_type = 0x1
    bus_dev_type = 0x40
    bus_bdg_secbus = 0x85
    bus_pcie_off = 0x58
    bus_aer_off = 0x150
    bus_pcix_off = 0
    bus_pci_hp_off = 0
    bus_ecc_ver = 0
    bus_bus_range = {
        lo = 0x85
        hi = 0x85
    }
    bus_addr_ranges = 0xfffffe25e4275a48
    bus_addr_entries = 0x1
    bus_assigned_addr = 0
    bus_assigned_entries = 0
    bus_pfd = 0xfffffe25e42f5888
    bus_dom = 0xfffffe25e4306190
    bus_mps = 0
    bus_plat_private = 0
    bus_hp_sup_modes = 0x4 (PCIE_NATIVE_HP_MODE)
    bus_hp_curr_mode = 0x4 (PCIE_NATIVE_HP_MODE)
    bus_hp_ctrl = 0xfffffe25e4300c00
    bus_ari = 0
    bus_pcie2pci_secbus = 0
    bus_speed_mutex = {
        _opaque = [ 0 ]
    }
    bus_speed_flags = 0 (0)
    bus_max_width = 6 (PCIE_LINK_WIDTH_X16)
    bus_cur_width = 6 (PCIE_LINK_WIDTH_X16)
    bus_sup_speed = 0xf (PCIE_LINK_SPEED_{2_5|5|8|16})
    bus_max_speed = 0x8 (PCIE_LINK_SPEED_16)
    bus_cur_speed = 0x4 (PCIE_LINK_SPEED_8)
    bus_target_speed = 0x8 (PCIE_LINK_SPEED_16)
    bus_lbw_mutex = {
        _opaque = [ 0 ]
    }
    bus_lbw_cv = {
        _opaque = 0
    }
    bus_lbw_state = 0x3 (PCIE_LBW_S_{ENABLED|DISPATCHED})
    bus_lbw_ent = {
        tqent_next = 0
        tqent_prev = 0
        tqent_func = pcie`pcie_link_bw_taskq
        tqent_arg = 0xfffffe25aab5e000
        tqent_un = {
            tqent_bucket = 1
            tqent_flags = 0x1
        }
        tqent_thread = 0
        tqent_cv = {
            _opaque = 0
        }
    }
    bus_lbw_nevents = 0x1
    bus_lbw_pbuf = 0xfffffe25e2a5d000 "" 
    bus_lbw_cbuf = 0xfffffe25e2a5d4c0 "" 
}

So whodunit?

[1]> fffffe25aab5a000::whatis
fffffe25aab5a000 is freed from dev_info_node_cache:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
fffffe25e2e7baa0 fffffe25aab5a000      2221a34d331 fffffcc2685a4c20
                 fffffe25e2c59008 fffffe2382a81a80 fffffe24be51be98
                 kmem_cache_free_debug+0xfa
                 kmem_cache_free+0x86
                 i_ddi_free_node+0x27b
                 ddi_remove_child+0xb7
                 devi_detach_node+0x3a5
                 ndi_devi_offline+0xa6
                 pcicfg`pcicfg_teardown_device+0x59
                 pcicfg`pcicfg_unconfigure+0x14e
                 stubs_common_code+0x59
                 pcie`pcie_hp_unprobe+0x34
                 pcie`pciehpc_slot_unprobe+0x68
                 pcie`pciehpc_hp_ops+0x178
                 pcie`pcie_hp_common_ops+0x198
                 ddihp_cn_pre_change_state+0xa1
                 ddihp_connector_ops+0x1b1

There are several paths that can end up in ddihp_connector_ops() downgrading the state. It's interesting that we saw the notice stating that e1000g0 was going offline; while that's not the device that led to the panic, it's surprising that the device would go offline after previously being enumerated and attached. We may be racing with something happening inside the SMU firmware here, but it's very difficult to know for certain. One thing that may help -- beyond getting a dump so we don't have to debug this live -- will be to enable the DDI_HP_* flags in ddidebug. For the moment, the need to make progress elsewhere necessitates ending this debug session, but it doesn't seem like this will be too hard to reproduce by booting in a loop.


From there, Rick Altherr noted the following:

I set a breakpoint on pcicfg_unconfigure() to try to figure out where the unprobe is coming from. Unfortunately, it is dispatched async from a taskq but this does let me verify that the unprobe event is happening at the same time as attach_drivers is blocked on ndi_devi_enter for the same dip:

[0]> fffffb04506d37b0::stack                                                                                                                               
swtch+0x16b()                                                                                                                                              
cv_wait+0x89(fffffdca3f53a72c, fffffdca3f53a640)                                                                                                           
ndi_devi_enter+0x87(fffffdca3f53a5d8, fffffb04506d386c)                                                                                                    
pcie`pcie_scan_mps+0x43(fffffdca3f53b178, fffffdca3f534e88, fffffb04506d391c)                                                                              
pcie`pcie_get_fabric_mps+0x1d(fffffdca3f53b178, fffffdca3f534e88, fffffb04506d391c)                                                                        
pcie`pcie_init_root_port_mps+0x67(fffffdca3f53a5d8)                          
pcieb`pcieb_attach+0x278(fffffdca3f53a5d8, 0)                                                                                                              
devi_attach+0xad(fffffdca3f53a5d8, 0)                                                                                                                      
attach_node+0x149(fffffdca3f53a5d8)                                          
i_ndi_config_node+0x16c(fffffdca3f53a5d8, 6, 0)                                                                                                            
i_ddi_attachchild+0x60(fffffdca3f53a5d8)                                                                                                                   
i_ddi_attach_node_hierarchy+0x8a(fffffdca3f53a5d8)                                                                                                         
attach_driver_nodes+0x90(50)                                                                                                                               
ddi_hold_installed_driver+0x196(50)                                                                                                                        
attach_drivers+0x3c()                                                                                                                                      
thread_start+0xb(

[0]> fffffb045057a860::stack
0xfffffdca8e140900()
pcie`pciehpc_slot_unprobe+0x68(fffffdca8e140900)
pcie`pciehpc_hp_ops+0x178(fffffdca3f53a5d8, fffffdca7968a340, 4, 0, 0)
pcie`pcie_hp_common_ops+0x198(fffffdca3f53a5d8, fffffdca7968a340, 4, 0, 0)
ddihp_cn_pre_change_state+0xa1(fffffdca7968e680, 2000)
ddihp_connector_ops+0x1b1(fffffdca7968e680, 2, fffffb045057aadc, fffffb045057aae4)
ddihp_cn_req_handler+0x9f(fffffdca7968e680, 2000)
ddihp_cn_run_event+0xa3(fffffdca79688250)
taskq_d_thread+0x12d(fffffdca7947cba0)
thread_start+0xb()

[0]> fffffdca79688250::print ddi_hp_cn_async_event_entry_t
{
    dip = 0xfffffdca3f53a5d8
    cn_name = 0xfffffdca7968a280 "pcie9" 
    target_state = 0t8192 (DDI_HP_CN_STATE_PRESENT)
}

The last bit of pcieb_attach is:

    /*
     * Initialize interrupt handlers. Ignore return value.
     */
    (void) pcieb_intr_attach(pcieb);

    (void) pcie_hpintr_enable(devi);

    (void) pcie_link_bw_enable(devi);

    /* Do any platform specific workarounds needed at this time */
    pcieb_plat_attach_workaround(devi);

    /*
     * If this is a root port, determine and set the max payload size.
     * Since this will involve scanning the fabric, all error enabling
     * and sw workarounds should be in place before doing this.
     */
    if (PCIE_IS_RP(bus_p))
        pcie_init_root_port_mps(devi);

    ddi_report_dev(devi);
    return (DDI_SUCCESS);

Note that pcie_hpintr_enable is called before pcie_init_root_port_mps. So pcie_init_root_port_mps needs to be tolerant of hotplug interrupts arriving while it is doing its thing. Now, we know from the above callstacks that it eventually calls ndi_devi_enter to prevent modification of the dev_info subtree. ddihp_cn_run_event also calls ndi_devi_enter at the start of the hotplug interrupt handling and holds it for the duration. So these should be mutually exclusive, right?

If you look carefully at pcie_scan_mps, you'll notice that, while ndi_devi_enter is called on the parent before ddi_walk_devs, the dip provided to ddi_walk_devs was passed in to pcie_scan_mps and was thus obtained before ndi_devi_enter. If we go up a few frames in the stack, pcie_init_root_port_mps does the following:

    (void) pcie_get_fabric_mps(ddi_get_parent(dip),
        ddi_get_child(dip), &max_supported);

and pcie_get_fabric_mps passes straight through to pcie_scan_mps.

So, pcie_init_root_port_mps gets the dip of the root port's child, i.e. a device attached to that port without ensuring that the root port's subtree can't change (i.e. calling ndi_devi_enter). If a hotplug event happens to arrive between that and the ndi_devi_enter in pcie_scan_mps that unconfigures the port, the child is removed and pcie_scan_mps has a dangling reference to it. As soon as the hotplug event finishes, pcie_scan_mps is unblocked and it immediately tries to walk the dangling reference.


From there I went through and did an initial analysis of the NDI locking here:

So I spent a bit of additional time going through the NDI here. First a little background. In general, when a given driver is active and one of its entry points is being called into, by definition its dev_info_t and those of its parent will be valid and something that you can access for iteration through the tree. However, there is no guarantee of your children in the tree, though siblings are generally safe because of how the locking works here. For a dev_info_t there are a few different things that are used to make a node exist here:

  • devi_ref which is a reference count on the node. This reference count is incremented when other things in the system have the device open or are going through various character/block device operations.
  • The DEVI_BUSY flag which is tied into the ndi_devi_enter() and ndi_devi_exit() functions. These are used whenever we need to guarantee that no changes are being made to our children, which is important for iterating through them.

Today, before we call a driver's attach(9E) entry point, we will ensure that we call ndi_devi_enter() on its parent. This is verified in attach_node() which is what'll ultimately call into a driver's attach(9E) routine. This also ensures things like the driver is held and related. It also is what guarantees the attach/detach ordering and that cbops and related can't be called during attach. Similarly when we're in the detach(9E) case, this is happening as well.

In this case we basically have the following relationship:

  npe
   |
   v
 pcieb
   |
   v
 t4nex
   |
   v
cxgbe0 - cxgbe1

So in the thread that's calling pcieb_attach, which ultimately is the one that's panicking, when we enter pcie_init_root_port_mps() we hold the npe instance DEVI_BUSY flag in our thread. Next, we have the thread that's doing the detach of t4nex. This holds the DEVI_BUSY flag for our pcieb instance. For pcieb to look at its children and ensure they can't be changing, which the dump shows they can, it needs to enter its own DEVI_BUSY flag via ndi_devi_enter(). This is what Rick's fix did and why it worked.

Now, the next question is is this sufficient. As here we're talking about the question of a single layer. What if we had several PCIe switches inbetween here and there and were doing the fabric scan. So the first key to this is the implementation of how we scan the fabric. It uses ddi_walk_devs(). It requires that the parent of the passed in dev_info_t is held and marked as BUSY. At that point, as it walks through the tree, it will ensure that it does the necessary calls to ndi_devi_enter() and ndi_devi_exit() ensuring that we have safety.

So, with that, let's return to the actual function that we have. pcie_get_fabric_mps() is called with rc_dip corresponding to the npe instance and dip corresponding to the t4nex case. Ironically the pcieb dip is not there. In the pcie_scan_mps() function, it currently does an enter on the parent of dip, which would be pcieb. But as pointed out this is both too late and redundant with the fix that Rick put together. So we can eliminate it. However, we don't need any other locking, because we correctly will own the pcieb dev_info_t at this point and are allowed to walk through our children. walk_devs() (the implementation of ddi_walk_devs()) will take care of locking the t4nex instance before descending down and related.

So with all that in mind, I'd summarize this as:

  • The fix by @kc8apf and the analysis is all spot on.
  • The change should be safe for an arbitrary tree based on my analysis.
  • We can opt to remove the ndi_devi_enter in pcie_scan_mps because it's redundant.
Actions #1

Updated by Robert Mustacchi 21 days ago

  • Assignee changed from Robert Mustacchi to Rick Altherr
Actions #2

Updated by Electric Monk 15 days ago

  • Gerrit CR set to 2195
Actions #3

Updated by Rick Altherr 7 days ago

Testing:
- booted hundreds of times on multiple Oxide platform systems which were hitting this on nearly every boot. Due to a different bug, that platform reliably delivers a PCIe presence detect changed hotplug event immediately when hotplug interrupts are enabled.
- booted on an i86pc platform (AMD Rome) and verified all PCIe devices were discovered correctly.
- Robert Mustacchi also tested on additional AMD Rome systems

Actions #4

Updated by Electric Monk 6 days ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit da37308cbe221e77b90426b63bf79a8b67c490b0

commit  da37308cbe221e77b90426b63bf79a8b67c490b0
Author: Rick Altherr <rick@oxidecomputer.com>
Date:   2022-06-30T01:22:44.000Z

    14747 pcieb_attach: race between pcie_init_root_port_mps and hotplug events
    Reviewed by: Robert Mustacchi <rm+illumos@fingolfin.org>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Actions

Also available in: Atom PDF