Bug #14747
closedpcieb_attach: race between pcie_init_root_port_mps and hotplug events
100%
Description
While doing bring up on our platform, we discovered the following panic. The following is the analysis that was provided by the engineer that looked at this:
pcieb39 is /pci@2c,0/pci1022,1484@8,3 npe2 at root: space 58 offset 0 npe2 is /pci@58,0 NOTICE: pciehpc (pcieb40): card is inserted in the slot pcie16 initialized cpu module 'cpu.generic' on chip 0 core 1 strand 1 panic[cpu1]/thread=fffffcc2686d3c20: BAD TRAP: type=d (#gp General protection) rp=fffffcc2686d3720 addr=fffffcc2686d3960 #gp General protection addr=0xfffffcc2686d3960 pid=0, pc=0xfffffffffbf0fbe0, sp=0xfffffcc2686d3810, eflags=0x10286 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 360668<smap,smep,osxsav,pcide,xmme,fxsr,mce,pae,de> cr2: 0 cr3: 7fff1000 cr8: 0 rdi: fffffe25aab5a000 rsi: fffffffff7c4f660 rdx: fffffcc2686d3870 rcx: fffffe25e503a000 r8: fffffcc2686d3c20 r9: fffffffffbc91830 rax: deadbeefdeadbeef rbx: fffffe25aab5a000 rbp: fffffcc2686d3850 r10: 0 r11: fffffcc2686d3c20 r12: fffffffff7c4f660 r13: fffffcc2686d3870 r14: 0 r15: fffffcc2686d3960 fsb: 200000000 gsb: fffffe25ef331000 ds: 38 es: 38 fs: 0 gs: 0 trp: d err: 0 rip: fffffffffbf0fbe0 cs: 30 rfl: 10286 rsp: fffffcc2686d3810 ss: 38 CPU ADDRESS TIMESTAMP TYPE VC HANDLER PC 0 fffffffffbe30d60 4443017d61c intr f6 cbe_fire tsc_gethrtime_delta+96 0 fffffffffbe30be8 4442ff951c4 intr f6 cbe_fire tsc_gethrtime_delta+96 0 fffffffffbe30a70 4442fef5e1c intr f2 unknown tsc_gethrtime_delta+96 0 fffffffffbe308f8 4442fdad460 intr f6 cbe_fire i86_mwait+12 0 fffffffffbe30780 4442fbc4acc intr f6 cbe_fire tsc_gethrtime_delta+96 0 fffffffffbe30608 4442f9dcb38 intr f6 cbe_fire i86_mwait+12 0 fffffffffbe30490 4442f7f4488 intr f6 cbe_fire i86_mwait+12 0 fffffffffbe30318 4442f6734b0 intr 25 pcieb_intr_handl i86_mwait+12 0 fffffffffbe301a0 4442f60c0e4 intr f6 cbe_fire i86_mwait+12 0 fffffffffbe30028 4442f5186d8 intr 25 pcieb_intr_handl i86_mwait+12 1 fffffe25ef3d8b10 444302103f4 trap d #gp ddi_walk_devs+40 1 fffffe25ef3d8998 4442eb65d5c intr - (fakesoftint) fakesoftint+23 1 fffffe25ef3d8820 4442eb5c568 intr f2 unknown sysp_putchar+4f 1 fffffe25ef3d86a8 4442e9064e4 intr f2 unknown pcie_cfgspace_read_uint16+24 1 fffffe25ef3d8530 4442e8f71c4 intr f2 unknown pcie_cfgspace_read_uint8+24 1 fffffe25ef3d83b8 4442e8e75f8 intr f2 unknown pcie_cfgspace_read_uint16+24 1 fffffe25ef3d8240 4442e8d6dfc intr f2 unknown __x86_indirect_thunk_r9+3 1 fffffe25ef3d80c8 4442e8c7c08 intr f2 unknown pcie_cfgspace_read_uint16+24 1 fffffe25ef3d7f50 4442e8b72b8 intr f2 unknown tsc_gethrtime_delta+96 1 fffffe25ef3d7dd8 4442e8a19f4 intr f2 unknown getpcstack+46 2 fffffe25ef502050 444301e062c intr 23 pcieb_intr_handl i86_mwait+12 2 fffffe25ef501ed8 444300826b8 intr 23 pcieb_intr_handl i86_mwait+12 2 fffffe25ef501d60 4442ffe6e84 intr 20 asyintr ddi_get8+f 2 fffffe25ef501be8 4442ffe4904 intr 20 asyintr ddi_get8+f 2 fffffe25ef501a70 4442ffe2dc0 intr 20 asyintr ddi_get8+f 2 fffffe25ef5018f8 4442ffe1240 intr 20 asyintr ddi_get8+f 2 fffffe25ef501780 4442ffdfcd8 intr 20 asyintr ddi_get8+f 2 fffffe25ef501608 4442ffdd3e8 intr 20 asyintr ddi_get8+f 2 fffffe25ef501490 4442ffdb8b8 intr 20 asyintr ddi_get8+f 2 fffffe25ef501318 4442ffda15c intr 20 asyintr i86_mwait+12 Warning - stack not written to the dump buffer fffffcc2686d3620 unix:rewrite_syscall+0 () fffffcc2686d3710 unix:trap+d6a () fffffcc2686d3720 unix:cmntrap+1cd () fffffcc2686d3850 genunix:ddi_walk_devs+40 () fffffcc2686d38c0 pcie:pcie_scan_mps+56 () fffffcc2686d3900 pcie:pcie_get_fabric_mps+1d () fffffcc2686d3950 pcie:pcie_init_root_port_mps+67 () fffffcc2686d39b0 pcieb:pcieb_attach+278 () fffffcc2686d3a30 genunix:devi_attach+ad () fffffcc2686d3a70 genunix:attach_node+149 () fffffcc2686d3ac0 genunix:i_ndi_config_node+16c () fffffcc2686d3af0 genunix:i_ddi_attachchild+60 () fffffcc2686d3b40 genunix:i_ddi_attach_node_hierarchy+8a () fffffcc2686d3b90 genunix:attach_driver_nodes+90 () fffffcc2686d3be0 genunix:ddi_hold_installed_driver+196 () fffffcc2686d3c00 genunix:attach_drivers+3c () fffffcc2686d3c10 unix:thread_start+b ()
The panic is at the site of ddi_get_parent
, specifically the second of three such calls in ddi_walk_devs()
which looks like in its entirety:
ASSERT(dip == NULL || ddi_get_parent(dip) == NULL || DEVI_BUSY_OWNED(ddi_get_parent(dip))); (void) walk_devs(dip, f, arg, 1);
We have passed the first two checks in the assertion: neither dip nor its parent is NULL, so we are going to get the parent again to see whether DEVI_BUSY_OWNED() is true. Note that there are three calls here because DEVI_BUSY_OWNED()
is a broken macro that evaluates its argument twice:
#define DEVI_BUSY_OWNED(dip) (DEVI_BUSY_CHANGING(dip) && \ ((DEVI(dip))->devi_busy_thread == curthread))
Since we are blowing up on the second call, this implies a race, and sure enough we see that the node has gone away from under us:
[1]> fffffe25aab5a000::print struct dev_info { devi_parent = 0xdeadbeefdeadbeef devi_child = 0xdeadbeefdeadbeef ...
Who else is interesting?
stack pointer for thread fffffcc2685fec20 (tq:pcie_link): fffffcc2685fea10 [ fffffcc2685fea10 _resume_from_idle+0x12b() ] fffffcc2685fea40 swtch+0x16b() fffffcc2685fea70 cv_wait+0x89(fffffe25aab5e154, fffffe25aab5e068) fffffcc2685feab0 ndi_devi_enter+0x87(fffffe25aab5e000, fffffcc2685feadc) fffffcc2685feb50 pcie`pcie_link_bw_taskq+0x153(fffffe25aab5e000) fffffcc2685fec00 taskq_thread+0x315(fffffe25e4200c40) fffffcc2685fec10 thread_start+0xb() The dev_info_t at fffffe25aab5e000 happens to be bridge 40, which is the parent of the T6: [1]> fffffe25aab5e000::devinfo fffffe25aab5e000 pciexclass,060400, instance #40 (driver not attached)
So what we are seeing here is a race between something trying to attach the T6 to this bridge and something else destroying its dev_info_t
. The thread waiting on the bridge node is the link bandwidth management interrupt handler (which is believed to be a frequent source of bogus error telemetry). Our current state there looks reasonable and reflects the fact that we haven't actually started running the bandwidth management interrupt code:
[1]> 0xfffffe25e2b677c0::print pcie_bus_t { bus_dip = 0xfffffe25aab5e000 bus_rp_dip = 0xfffffe25aab5e000 bus_cfg_hdl = 0xfffffe25e42f7680 bus_fm_flags = 0 bus_soft_state = 0 bus_bdf = 0x8009 bus_rp_bdf = 0x8009 bus_dev_ven_id = 0x14831022 bus_rev_id = 0 bus_hdr_type = 0x1 bus_dev_type = 0x40 bus_bdg_secbus = 0x85 bus_pcie_off = 0x58 bus_aer_off = 0x150 bus_pcix_off = 0 bus_pci_hp_off = 0 bus_ecc_ver = 0 bus_bus_range = { lo = 0x85 hi = 0x85 } bus_addr_ranges = 0xfffffe25e4275a48 bus_addr_entries = 0x1 bus_assigned_addr = 0 bus_assigned_entries = 0 bus_pfd = 0xfffffe25e42f5888 bus_dom = 0xfffffe25e4306190 bus_mps = 0 bus_plat_private = 0 bus_hp_sup_modes = 0x4 (PCIE_NATIVE_HP_MODE) bus_hp_curr_mode = 0x4 (PCIE_NATIVE_HP_MODE) bus_hp_ctrl = 0xfffffe25e4300c00 bus_ari = 0 bus_pcie2pci_secbus = 0 bus_speed_mutex = { _opaque = [ 0 ] } bus_speed_flags = 0 (0) bus_max_width = 6 (PCIE_LINK_WIDTH_X16) bus_cur_width = 6 (PCIE_LINK_WIDTH_X16) bus_sup_speed = 0xf (PCIE_LINK_SPEED_{2_5|5|8|16}) bus_max_speed = 0x8 (PCIE_LINK_SPEED_16) bus_cur_speed = 0x4 (PCIE_LINK_SPEED_8) bus_target_speed = 0x8 (PCIE_LINK_SPEED_16) bus_lbw_mutex = { _opaque = [ 0 ] } bus_lbw_cv = { _opaque = 0 } bus_lbw_state = 0x3 (PCIE_LBW_S_{ENABLED|DISPATCHED}) bus_lbw_ent = { tqent_next = 0 tqent_prev = 0 tqent_func = pcie`pcie_link_bw_taskq tqent_arg = 0xfffffe25aab5e000 tqent_un = { tqent_bucket = 1 tqent_flags = 0x1 } tqent_thread = 0 tqent_cv = { _opaque = 0 } } bus_lbw_nevents = 0x1 bus_lbw_pbuf = 0xfffffe25e2a5d000 "" bus_lbw_cbuf = 0xfffffe25e2a5d4c0 "" }
So whodunit?
[1]> fffffe25aab5a000::whatis fffffe25aab5a000 is freed from dev_info_node_cache: ADDR BUFADDR TIMESTAMP THREAD CACHE LASTLOG CONTENTS fffffe25e2e7baa0 fffffe25aab5a000 2221a34d331 fffffcc2685a4c20 fffffe25e2c59008 fffffe2382a81a80 fffffe24be51be98 kmem_cache_free_debug+0xfa kmem_cache_free+0x86 i_ddi_free_node+0x27b ddi_remove_child+0xb7 devi_detach_node+0x3a5 ndi_devi_offline+0xa6 pcicfg`pcicfg_teardown_device+0x59 pcicfg`pcicfg_unconfigure+0x14e stubs_common_code+0x59 pcie`pcie_hp_unprobe+0x34 pcie`pciehpc_slot_unprobe+0x68 pcie`pciehpc_hp_ops+0x178 pcie`pcie_hp_common_ops+0x198 ddihp_cn_pre_change_state+0xa1 ddihp_connector_ops+0x1b1
There are several paths that can end up in ddihp_connector_ops()
downgrading the state. It's interesting that we saw the notice stating that e1000g0 was going offline; while that's not the device that led to the panic, it's surprising that the device would go offline after previously being enumerated and attached. We may be racing with something happening inside the SMU firmware here, but it's very difficult to know for certain. One thing that may help -- beyond getting a dump so we don't have to debug this live -- will be to enable the DDI_HP_*
flags in ddidebug. For the moment, the need to make progress elsewhere necessitates ending this debug session, but it doesn't seem like this will be too hard to reproduce by booting in a loop.
From there, Rick Altherr noted the following:
I set a breakpoint on pcicfg_unconfigure() to try to figure out where the unprobe is coming from. Unfortunately, it is dispatched async from a taskq but this does let me verify that the unprobe event is happening at the same time as attach_drivers is blocked on ndi_devi_enter for the same dip:
[0]> fffffb04506d37b0::stack swtch+0x16b() cv_wait+0x89(fffffdca3f53a72c, fffffdca3f53a640) ndi_devi_enter+0x87(fffffdca3f53a5d8, fffffb04506d386c) pcie`pcie_scan_mps+0x43(fffffdca3f53b178, fffffdca3f534e88, fffffb04506d391c) pcie`pcie_get_fabric_mps+0x1d(fffffdca3f53b178, fffffdca3f534e88, fffffb04506d391c) pcie`pcie_init_root_port_mps+0x67(fffffdca3f53a5d8) pcieb`pcieb_attach+0x278(fffffdca3f53a5d8, 0) devi_attach+0xad(fffffdca3f53a5d8, 0) attach_node+0x149(fffffdca3f53a5d8) i_ndi_config_node+0x16c(fffffdca3f53a5d8, 6, 0) i_ddi_attachchild+0x60(fffffdca3f53a5d8) i_ddi_attach_node_hierarchy+0x8a(fffffdca3f53a5d8) attach_driver_nodes+0x90(50) ddi_hold_installed_driver+0x196(50) attach_drivers+0x3c() thread_start+0xb( [0]> fffffb045057a860::stack 0xfffffdca8e140900() pcie`pciehpc_slot_unprobe+0x68(fffffdca8e140900) pcie`pciehpc_hp_ops+0x178(fffffdca3f53a5d8, fffffdca7968a340, 4, 0, 0) pcie`pcie_hp_common_ops+0x198(fffffdca3f53a5d8, fffffdca7968a340, 4, 0, 0) ddihp_cn_pre_change_state+0xa1(fffffdca7968e680, 2000) ddihp_connector_ops+0x1b1(fffffdca7968e680, 2, fffffb045057aadc, fffffb045057aae4) ddihp_cn_req_handler+0x9f(fffffdca7968e680, 2000) ddihp_cn_run_event+0xa3(fffffdca79688250) taskq_d_thread+0x12d(fffffdca7947cba0) thread_start+0xb() [0]> fffffdca79688250::print ddi_hp_cn_async_event_entry_t { dip = 0xfffffdca3f53a5d8 cn_name = 0xfffffdca7968a280 "pcie9" target_state = 0t8192 (DDI_HP_CN_STATE_PRESENT) }
The last bit of pcieb_attach is:
/* * Initialize interrupt handlers. Ignore return value. */ (void) pcieb_intr_attach(pcieb); (void) pcie_hpintr_enable(devi); (void) pcie_link_bw_enable(devi); /* Do any platform specific workarounds needed at this time */ pcieb_plat_attach_workaround(devi); /* * If this is a root port, determine and set the max payload size. * Since this will involve scanning the fabric, all error enabling * and sw workarounds should be in place before doing this. */ if (PCIE_IS_RP(bus_p)) pcie_init_root_port_mps(devi); ddi_report_dev(devi); return (DDI_SUCCESS);
Note that pcie_hpintr_enable
is called before pcie_init_root_port_mps
. So pcie_init_root_port_mps
needs to be tolerant of hotplug interrupts arriving while it is doing its thing. Now, we know from the above callstacks that it eventually calls ndi_devi_enter
to prevent modification of the dev_info subtree. ddihp_cn_run_event
also calls ndi_devi_enter
at the start of the hotplug interrupt handling and holds it for the duration. So these should be mutually exclusive, right?
If you look carefully at pcie_scan_mps
, you'll notice that, while ndi_devi_enter
is called on the parent before ddi_walk_devs
, the dip provided to ddi_walk_devs
was passed in to pcie_scan_mps
and was thus obtained before ndi_devi_enter
. If we go up a few frames in the stack, pcie_init_root_port_mps
does the following:
(void) pcie_get_fabric_mps(ddi_get_parent(dip), ddi_get_child(dip), &max_supported);
and pcie_get_fabric_mps
passes straight through to pcie_scan_mps
.
So, pcie_init_root_port_mps
gets the dip of the root port's child, i.e. a device attached to that port without ensuring that the root port's subtree can't change (i.e. calling ndi_devi_enter
). If a hotplug event happens to arrive between that and the ndi_devi_enter
in pcie_scan_mps
that unconfigures the port, the child is removed and pcie_scan_mps
has a dangling reference to it. As soon as the hotplug event finishes, pcie_scan_mps
is unblocked and it immediately tries to walk the dangling reference.
From there I went through and did an initial analysis of the NDI locking here:
So I spent a bit of additional time going through the NDI here. First a little background. In general, when a given driver is active and one of its entry points is being called into, by definition its dev_info_t and those of its parent will be valid and something that you can access for iteration through the tree. However, there is no guarantee of your children in the tree, though siblings are generally safe because of how the locking works here. For a dev_info_t there are a few different things that are used to make a node exist here:
devi_ref
which is a reference count on the node. This reference count is incremented when other things in the system have the device open or are going through various character/block device operations.- The
DEVI_BUSY
flag which is tied into the ndi_devi_enter() and ndi_devi_exit() functions. These are used whenever we need to guarantee that no changes are being made to our children, which is important for iterating through them.
Today, before we call a driver's attach(9E)
entry point, we will ensure that we call ndi_devi_enter()
on its parent. This is verified in attach_node()
which is what'll ultimately call into a driver's attach(9E)
routine. This also ensures things like the driver is held and related. It also is what guarantees the attach/detach ordering and that cbops and related can't be called during attach. Similarly when we're in the detach(9E)
case, this is happening as well.
In this case we basically have the following relationship:
npe | v pcieb | v t4nex | v cxgbe0 - cxgbe1
So in the thread that's calling pcieb_attach, which ultimately is the one that's panicking, when we enter pcie_init_root_port_mps()
we hold the npe instance DEVI_BUSY
flag in our thread. Next, we have the thread that's doing the detach of t4nex. This holds the DEVI_BUSY
flag for our pcieb instance. For pcieb to look at its children and ensure they can't be changing, which the dump shows they can, it needs to enter its own DEVI_BUSY
flag via ndi_devi_enter()
. This is what Rick's fix did and why it worked.
Now, the next question is is this sufficient. As here we're talking about the question of a single layer. What if we had several PCIe switches inbetween here and there and were doing the fabric scan. So the first key to this is the implementation of how we scan the fabric. It uses ddi_walk_devs(). It requires that the parent of the passed in dev_info_t is held and marked as BUSY. At that point, as it walks through the tree, it will ensure that it does the necessary calls to ndi_devi_enter() and ndi_devi_exit() ensuring that we have safety.
So, with that, let's return to the actual function that we have. pcie_get_fabric_mps()
is called with rc_dip
corresponding to the npe instance and dip corresponding to the t4nex case. Ironically the pcieb dip is not there. In the pcie_scan_mps()
function, it currently does an enter on the parent of dip, which would be pcieb. But as pointed out this is both too late and redundant with the fix that Rick put together. So we can eliminate it. However, we don't need any other locking, because we correctly will own the pcieb dev_info_t at this point and are allowed to walk through our children. walk_devs()
(the implementation of ddi_walk_devs()
) will take care of locking the t4nex instance before descending down and related.
So with all that in mind, I'd summarize this as:
- The fix by @kc8apf and the analysis is all spot on.
- The change should be safe for an arbitrary tree based on my analysis.
- We can opt to remove the ndi_devi_enter in pcie_scan_mps because it's redundant.
Updated by Robert Mustacchi 21 days ago
- Assignee changed from Robert Mustacchi to Rick Altherr
Updated by Rick Altherr 7 days ago
Testing:
- booted hundreds of times on multiple Oxide platform systems which were hitting this on nearly every boot. Due to a different bug, that platform reliably delivers a PCIe presence detect changed hotplug event immediately when hotplug interrupts are enabled.
- booted on an i86pc platform (AMD Rome) and verified all PCIe devices were discovered correctly.
- Robert Mustacchi also tested on additional AMD Rome systems
Updated by Electric Monk 6 days ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit da37308cbe221e77b90426b63bf79a8b67c490b0
commit da37308cbe221e77b90426b63bf79a8b67c490b0 Author: Rick Altherr <rick@oxidecomputer.com> Date: 2022-06-30T01:22:44.000Z 14747 pcieb_attach: race between pcie_init_root_port_mps and hotplug events Reviewed by: Robert Mustacchi <rm+illumos@fingolfin.org> Approved by: Richard Lowe <richlowe@richlowe.net>