nvme: race between detach and removal callback
While working on a Gimlet, we were pulling a disk to try and understand some unrelated behavior to T6 testing on a new Gimlet (needed to insert a K.2) and we ended up panicking on insertion of the K.2 (which leads to more questions about the detection pieces that are currently firing). Here was the initial panic that we got in kmdb. As this was a fresh Gimlet being set up, there was no dump device so I debugged this in kmdb:
BRM44220005 # Dec 28 00:06:53 BRM44220005 genunix: /pci@ab,0/pci1022,1483@1 ,2/pci1b96,0@0/blkdev@w0014EE81000D30AF,0 (blkdev2) offline DeWARNING: nvme8: async event request returned failure, sct = 0, sc = 0, dnr = 0, m = 0 panic[cpu16]/thread=fffff7880cae9c20: mutex_enter: bad mutex, lp=fffffcf9580fc9b0 owner=fffff78809edec20 thread=fffff7880cae9c20 Warning - stack not written to the dump buffer fffff7880cae9820 unix:mutex_panic+4a () fffff7880cae9890 unix:mutex_vector_enter+307 () fffff7880cae9900 nvme:nvme_remove_callback+4e () fffff7880cae9980 genunix:ndi_event_do_callback+fa () fffff7880cae99d0 npe:npe_bus_post_event+3f () fffff7880cae9a00 genunix:ndi_post_event+20 () fffff7880cae9a70 pcie:pciehpc_intr+3bc () fffff7880cae9aa0 pcie:pcie_hp_intr+88 () fffff7880cae9ad0 pcie:pcie_intr+1c () fffff7880cae9b70 pcieb:pcieb_intr_handler+1d8 () fffff7880cae9bc0 apix:apix_dispatch_by_vector+8c () fffff7880cae9c00 apix:apix_dispatch_lowlevel+1c () fffff7880ca98a20 unix:switch_sp_and_call+15 () fffff7880ca98a90 apix:apix_do_interrupt+eb () fffff7880ca98aa0 unix:cmnint+c3 () fffff7880ca98b90 unix:i86_mwait+12 () fffff7880ca98bd0 unix:cpu_idle_mwait+142 () fffff7880ca98be0 unix:cpu_idle_adaptive+19 () fffff7880ca98c00 unix:idle+ae () fffff7880ca98c10 unix:thread_start+b ()
A bad mutex, for once something that's not immediately PCIe malaise! The nvme_remove_callback as we can see is called by the kernel when we see that a device has actually been removed from its slot. This is registered once for every distinct PCIe device and is used to clean up state. It tries to go through each queue. This means we're trying to do a mutex_enter() on nq_mutex.
So let's take stock of the world. First, let's confirm the devinfo in question is what we care about:
Welcome to kmdb Loaded modules: [ scsi_vhci crypto mac cpc ipcc neti ptm ufs unix mpt zfs krtld sata apix uhci hook lofs genunix ip logindmux nsmb usba xhci specfs random mm cpu.generic arp sockfs smbsrv ] > $C fffffffffbe908f0 kmdb_enter+0xb() fffffffffbe90920 debug_enter+0x75(fffffffffbd588f8) fffffffffbe90a10 panicsys+0x610(fffffffffbd5a5af, fffff7880cae97a8, fffffffffbe90a20, 1) fffff7880cae9790 vpanic+0x15c() fffff7880cae9800 0xfffffffffbca0bc1() fffff7880cae9820 mutex_panic+0x4a(fffffffffbd5b820, fffffcf9580fc9b0) fffff7880cae9890 mutex_vector_enter+0x307(fffffcf9580fc9b0) fffff7880cae9900 nvme`nvme_remove_callback+0x4e(fffffcf9307b5320, fffffcf931ec4708, fffffcf9581003c0, 0) fffff7880cae9980 ndi_event_do_callback+0xfa(fffffcf931ec0200, fffffcf9307b5320, fffffcf931ec4708, 0) fffff7880cae99d0 npe`npe_bus_post_event+0x3f(fffffcf9307b95e0, fffffcf9307b5320, fffffcf931ec4708, 0) fffff7880cae9a00 ndi_post_event+0x20(fffffcf9307b86c8, fffffcf9307b5320, fffffcf931ec4708, 0) fffff7880cae9a70 pcie`pciehpc_intr+0x3bc(fffffcf9307b86c8) fffff7880cae9aa0 pcie`pcie_hp_intr+0x88(fffffcf9307b86c8) fffff7880cae9ad0 pcie`pcie_intr+0x1c(fffffcf9307b86c8) fffff7880cae9b70 pcieb`pcieb_intr_handler+0x1d8(fffffcf9581011f0, 0) fffff7880cae9bc0 apix`apix_dispatch_by_vector+0x8c(20) fffff7880cae9c00 apix`apix_dispatch_lowlevel+0x1c(20, 0) fffff7880ca98a20 switch_sp_and_call+0x15() fffff7880ca98a90 apix`apix_do_interrupt+0xeb(fffff7880ca98aa0, 0) fffff7880ca98aa0 _interrupt+0xc3() fffff7880ca98b90 i86_mwait+0x12() fffff7880ca98bd0 cpu_idle_mwait+0x142() fffff7880ca98be0 cpu_idle_adaptive+0x19() fffff7880ca98c00 idle+0xae() fffff7880ca98c10 thread_start+0xb() > > fffffcf9307b5320::devinfo fffffcf9307b5320 pciexclass,010802, instance #8 (could not determine driver name) Driver properties at fffffcfa18743d70: name='io-cqueue-len' type=int items=1 value=00000400 name='io-squeue-len' type=int items=1 value=00000400 name='volatile-write-cache-enable' type=int items=1 value=00000000 name='volatile-write-cache-present' type=int items=1 value=00000000 name='async-event-limit' type=int items=1 value=0000000a name='admin-queue-len' type=int items=1 value=00000100 name='fm-accchk-capable' type=any items=0 name='fm-dmachk-capable' type=any items=0 name='fm-errcb-capable' type=any items=0 name='fm-ereport-capable' type=any items=0 Hardware properties at fffffcfa18743bb8: name='ddi-ufm-capable' type=any items=0 name='pci-msix-capid-pointer' type=int items=1 value=00000050 name='pci-msi-capid-pointer' type=int items=1 value=000000a0 name='pcie-serial' type=byte items=8 value=af.30.0d.00.81.ee.14.00 name='pcie-link-supported-speeds' type=int64 items=3 value=000000009502f900.000000012a05f200.00000001dcd65000 name='pcie-link-target-speed' type=int64 items=1 value=00000001dcd65000 name='pcie-link-maximum-speed' type=int64 items=1 value=00000001dcd65000 name='pcie-link-current-speed' type=int64 items=1 value=00000001dcd65000 name='pcie-link-current-width' type=int items=1 value=00000004 name='pcie-link-maximum-width' type=int items=1 value=00000004 name='assigned-addresses' type=int items=10 value=83c60010.00000000.8a000000.00000000.00008000.c3c60020.00000107.40000000.00000000.00001000 name='reg' type=int items=15 value=00c60000.00000000.00000000.00000000.00000000.03c60010.00000000.00000000.00000000.00008000.43c60020.00000000.00000000.00000000.00001000 name='compatible' type=string items=15 value='pciex1b96,2500.1b96.0.0' + 'pciex1b96,2500.1b96.0' + 'pciex1b96,2500.0' + 'pciex1b96,2500' + 'pciexclass,010802' + 'pciexclass,0108' + 'pci1b96,2500.1b96.0.0' + 'pci1b96,2500.1b96.0' + 'pci1b96,0,s' + 'pci1b96,0' + 'pci1b96,2500.0' + 'pci1b96,2500,p' + 'pci1b96,2500' + 'pciclass,010802' + 'pciclass,0108' name='model' type=string items=1 value='NVM Express' name='power-consumption' type=int items=2 value=00000001.00000001 name='devsel-speed' type=int items=1 value=00000000 name='interrupts' type=int items=1 value=00000001 name='subsystem-vendor-id' type=int items=1 value=00001b96 name='subsystem-id' type=int items=1 value=00000000 name='unit-address' type=string items=1 value='0' name='class-code' type=int items=1 value=00010802 name='revision-id' type=int items=1 value=00000000 name='vendor-id' type=int items=1 value=00001b96 name='device-id' type=int items=1 value=00002500 > fffffcf9581003c0::whatis fffffcf9581003c0 is allocated from kmem_alloc_768
That is the dev info of a WDC SSD, so that adds up. The nvme_t that we got in the argument to the nvme_remove_callback() is still valid. So let's see what els is going on in the nvme driver:
> ::stacks -m nvme THREAD STATE SOBJ COUNT fffff7880a2eac20 SLEEP RWLOCK 3 swtch+0x133 turnstile_block+0x25b rw_enter_sleep+0x1a5 log_enter+0x2c log_sendmsg+0xae cprintf+0x39d vzdcmn_err+0x6f dev_err+0x7f nvme`nvme_async_event_task+0x1dd taskq_thread+0x2cd thread_start+0xb fffff78809edec20 SLEEP CV 1 swtch+0x133 cv_wait+0x68 delay_common+0xb4 delay+0x3d nvme`nvme_reset+0x10f nvme`nvme_detach+0x341 devi_detach+0x88 detach_node+0x49 i_ndi_unconfig_node+0x171 i_ddi_detachchild+0x33 devi_detach_node+0xd1 ndi_devi_offline+0x135 ddihp_cn_change_children_state+0xac ddihp_cn_pre_change_state+0x3b ddihp_connector_ops+0x83 ddihp_cn_req_handler+0x3e ndi_hp_state_change_req+0x14f pcie`pciehpc_power_fault_handler+0x67 taskq_d_thread+0xbc thread_start+0xb fffff7880a15bc20 ONPROC <NONE> 1 apix`apix_hilevel_intr_prolog+0x85 apix`apix_do_interrupt+0x333 _interrupt+0xc3 cmt_balance+0x111 setbackdq+0x333 sleepq_wakeone_chan+0x8c cv_signal+0xa7 taskq_bucket_dispatch+0x79 taskq_dispatch+0xd7 qenable_locked+0xee putq+0x220 log_sendmsg+0x352 cprintf+0x39d vzdcmn_err+0x6f dev_err+0x7f nvme`nvme_async_event_task+0x1dd taskq_thread+0x2cd thread_start+0xb fffff7880cae9c20 PANIC <NONE> 1 0xfffffffffbca0bc1 mutex_panic+0x4a mutex_vector_enter+0x307 nvme`nvme_remove_callback+0x4e ndi_event_do_callback+0xfa npe`npe_bus_post_event+0x3f ndi_post_event+0x20 pcie`pciehpc_intr+0x3bc pcie`pcie_hp_intr+0x88 pcie`pcie_intr+0x1c pcieb`pcieb_intr_handler+0x1d8 apix`apix_dispatch_by_vector+0x8c apix`apix_dispatch_lowlevel+0x1c
OK, most notably we have a thread in detach that is hanging out trying to reset the device in a delay. This makes the ordering problem very straightforward:
- nvme_detach() calls the ddi event removal callback as one of the last things that it does meaning that it destroys the mutex before it removes the callback and while resetting.
- nvme_remove_callback() assumes that data is valid, but it has been beaten to the punch.
There are a few things that we need to look at here that we're not doing:
- nvme_detach() needs to make sure that the call to remove the DDI callback happens earlier in detach before the corresponding items are removed.
- nvme_remove_callback() probably wants to see if we're in detach and consider not doing anything.
- nvme_reset() right now doesn't look for invalid reads from its registers. In this case the WDC devices seem to have a n_timeout value of 7s. So this leads this as a pretty wide open window to hit.
However, there's a second issue here that is actually somewhat of a PCIe malaise issue and we should consider a deeper bit. We still need to fix the above because detach can always race with a surprise removal, but this is still somewhat more nuanced and will be duplicated into its own bug. We had both a power fault and a removal. If we look at the detach callback that's happening because of a power fault and so that's what triggers the detach. Normally we'd fire the removal callback and let it run to completion ahead of the node teardown.
However, here we can see that we already dispatched the power fault callback (how that happened is another question I wish I had realized when we first saw this!) and then we also hit a removal in a subsequent interrupt and dispatched this. While this isn't a guarantee that nefarious activity would occur the fact that this removal event is firing while detach is ongoing is very suspicious and its hard to think through how this could be working correctly.
In this ticket we are simply fixing the race condition in the NVMe driver itself and leaving
nvme_reset() for another time.
Updated by Electric Monk 11 days ago
- Gerrit CR set to 2878
Updated by Robert Mustacchi 2 days ago
In particular with this in place between Andy and I have tested that this:
- Works on i86pc with enterprise SSD style hoptlug
- On a gimlet with hotplug
- In a propolis VM which was where we originally saw this
On the first two I did do pulls and testing with some subsequent changes to fix the PCIe state machine logic in place.
Updated by Electric Monk 2 days ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit c0a9dba32d94978bd77a0d3a65cd31611e696c61
commit c0a9dba32d94978bd77a0d3a65cd31611e696c61 Author: Robert Mustacchi <firstname.lastname@example.org> Date: 2023-06-02T21:16:46.000Z 15689 nvme: race between detach and removal callback 15690 nvme driver does not attach to Propolis NVMe devices Reviewed by: Andy Fiddaman <email@example.com> Reviewed by: Hans Rosenfeld <firstname.lastname@example.org> Approved by: Dan McDonald <email@example.com>