Project

General

Profile

Actions

Bug #15689

closed

nvme: race between detach and removal callback

Added by Robert Mustacchi 11 days ago. Updated 2 days ago.

Status:
Closed
Priority:
Normal
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

While working on a Gimlet, we were pulling a disk to try and understand some unrelated behavior to T6 testing on a new Gimlet (needed to insert a K.2) and we ended up panicking on insertion of the K.2 (which leads to more questions about the detection pieces that are currently firing). Here was the initial panic that we got in kmdb. As this was a fresh Gimlet being set up, there was no dump device so I debugged this in kmdb:

BRM44220005 # Dec 28 00:06:53 BRM44220005 genunix: /pci@ab,0/pci1022,1483@1
,2/pci1b96,0@0/blkdev@w0014EE81000D30AF,0 (blkdev2) offline
DeWARNING: nvme8: async event request returned failure, sct = 0, sc = 0,
dnr = 0, m = 0

panic[cpu16]/thread=fffff7880cae9c20: mutex_enter: bad mutex,
lp=fffffcf9580fc9b0 owner=fffff78809edec20 thread=fffff7880cae9c20

Warning - stack not written to the dump buffer
fffff7880cae9820 unix:mutex_panic+4a ()
fffff7880cae9890 unix:mutex_vector_enter+307 ()
fffff7880cae9900 nvme:nvme_remove_callback+4e ()
fffff7880cae9980 genunix:ndi_event_do_callback+fa ()
fffff7880cae99d0 npe:npe_bus_post_event+3f ()
fffff7880cae9a00 genunix:ndi_post_event+20 ()
fffff7880cae9a70 pcie:pciehpc_intr+3bc ()
fffff7880cae9aa0 pcie:pcie_hp_intr+88 ()
fffff7880cae9ad0 pcie:pcie_intr+1c ()
fffff7880cae9b70 pcieb:pcieb_intr_handler+1d8 ()
fffff7880cae9bc0 apix:apix_dispatch_by_vector+8c ()
fffff7880cae9c00 apix:apix_dispatch_lowlevel+1c ()
fffff7880ca98a20 unix:switch_sp_and_call+15 ()
fffff7880ca98a90 apix:apix_do_interrupt+eb ()
fffff7880ca98aa0 unix:cmnint+c3 ()
fffff7880ca98b90 unix:i86_mwait+12 ()
fffff7880ca98bd0 unix:cpu_idle_mwait+142 ()
fffff7880ca98be0 unix:cpu_idle_adaptive+19 ()
fffff7880ca98c00 unix:idle+ae ()
fffff7880ca98c10 unix:thread_start+b ()

A bad mutex, for once something that's not immediately PCIe malaise! The nvme_remove_callback as we can see is called by the kernel when we see that a device has actually been removed from its slot. This is registered once for every distinct PCIe device and is used to clean up state. It tries to go through each queue. This means we're trying to do a mutex_enter() on nq_mutex.

So let's take stock of the world. First, let's confirm the devinfo in question is what we care about:

Welcome to kmdb
Loaded modules: [ scsi_vhci crypto mac cpc ipcc neti ptm ufs unix mpt zfs
krtld sata apix uhci hook lofs genunix ip logindmux nsmb usba xhci specfs
random mm cpu.generic arp sockfs smbsrv ]
[16]> $C
fffffffffbe908f0 kmdb_enter+0xb()
fffffffffbe90920 debug_enter+0x75(fffffffffbd588f8)
fffffffffbe90a10 panicsys+0x610(fffffffffbd5a5af, fffff7880cae97a8,
fffffffffbe90a20, 1)
fffff7880cae9790 vpanic+0x15c()
fffff7880cae9800 0xfffffffffbca0bc1()
fffff7880cae9820 mutex_panic+0x4a(fffffffffbd5b820, fffffcf9580fc9b0)
fffff7880cae9890 mutex_vector_enter+0x307(fffffcf9580fc9b0)
fffff7880cae9900 nvme`nvme_remove_callback+0x4e(fffffcf9307b5320,
fffffcf931ec4708, fffffcf9581003c0, 0)
fffff7880cae9980 ndi_event_do_callback+0xfa(fffffcf931ec0200,
fffffcf9307b5320, fffffcf931ec4708, 0)
fffff7880cae99d0 npe`npe_bus_post_event+0x3f(fffffcf9307b95e0,
fffffcf9307b5320, fffffcf931ec4708, 0)
fffff7880cae9a00 ndi_post_event+0x20(fffffcf9307b86c8, fffffcf9307b5320,
fffffcf931ec4708, 0)
fffff7880cae9a70 pcie`pciehpc_intr+0x3bc(fffffcf9307b86c8)
fffff7880cae9aa0 pcie`pcie_hp_intr+0x88(fffffcf9307b86c8)
fffff7880cae9ad0 pcie`pcie_intr+0x1c(fffffcf9307b86c8)
fffff7880cae9b70 pcieb`pcieb_intr_handler+0x1d8(fffffcf9581011f0, 0)
fffff7880cae9bc0 apix`apix_dispatch_by_vector+0x8c(20)
fffff7880cae9c00 apix`apix_dispatch_lowlevel+0x1c(20, 0)
fffff7880ca98a20 switch_sp_and_call+0x15()
fffff7880ca98a90 apix`apix_do_interrupt+0xeb(fffff7880ca98aa0, 0)
fffff7880ca98aa0 _interrupt+0xc3()
fffff7880ca98b90 i86_mwait+0x12()
fffff7880ca98bd0 cpu_idle_mwait+0x142()
fffff7880ca98be0 cpu_idle_adaptive+0x19()
fffff7880ca98c00 idle+0xae()
fffff7880ca98c10 thread_start+0xb()
[16]>
[16]> fffffcf9307b5320::devinfo
fffffcf9307b5320 pciexclass,010802, instance #8 (could not determine driver
name)
        Driver properties at fffffcfa18743d70:
            name='io-cqueue-len' type=int items=1
                value=00000400
            name='io-squeue-len' type=int items=1
                value=00000400
            name='volatile-write-cache-enable' type=int items=1
                value=00000000
            name='volatile-write-cache-present' type=int items=1
                value=00000000
            name='async-event-limit' type=int items=1
                value=0000000a
            name='admin-queue-len' type=int items=1
                value=00000100
            name='fm-accchk-capable' type=any items=0
            name='fm-dmachk-capable' type=any items=0
            name='fm-errcb-capable' type=any items=0
            name='fm-ereport-capable' type=any items=0
        Hardware properties at fffffcfa18743bb8:
            name='ddi-ufm-capable' type=any items=0
            name='pci-msix-capid-pointer' type=int items=1
                value=00000050
            name='pci-msi-capid-pointer' type=int items=1
                value=000000a0
            name='pcie-serial' type=byte items=8
                value=af.30.0d.00.81.ee.14.00
            name='pcie-link-supported-speeds' type=int64 items=3
                value=000000009502f900.000000012a05f200.00000001dcd65000
            name='pcie-link-target-speed' type=int64 items=1
                value=00000001dcd65000
            name='pcie-link-maximum-speed' type=int64 items=1
                value=00000001dcd65000
            name='pcie-link-current-speed' type=int64 items=1
                value=00000001dcd65000
            name='pcie-link-current-width' type=int items=1
                value=00000004
            name='pcie-link-maximum-width' type=int items=1
                value=00000004
            name='assigned-addresses' type=int items=10

value=83c60010.00000000.8a000000.00000000.00008000.c3c60020.00000107.40000000.00000000.00001000
            name='reg' type=int items=15

value=00c60000.00000000.00000000.00000000.00000000.03c60010.00000000.00000000.00000000.00008000.43c60020.00000000.00000000.00000000.00001000
            name='compatible' type=string items=15
                value='pciex1b96,2500.1b96.0.0' + 'pciex1b96,2500.1b96.0' +
'pciex1b96,2500.0' + 'pciex1b96,2500' + 'pciexclass,010802' +
'pciexclass,0108' + 'pci1b96,2500.1b96.0.0' + 'pci1b96,2500.1b96.0' +
'pci1b96,0,s' + 'pci1b96,0' + 'pci1b96,2500.0' + 'pci1b96,2500,p' +
'pci1b96,2500' + 'pciclass,010802' + 'pciclass,0108'
            name='model' type=string items=1
                value='NVM Express'
            name='power-consumption' type=int items=2
                value=00000001.00000001
            name='devsel-speed' type=int items=1
                value=00000000
            name='interrupts' type=int items=1
                value=00000001
            name='subsystem-vendor-id' type=int items=1
                value=00001b96
            name='subsystem-id' type=int items=1
                value=00000000
            name='unit-address' type=string items=1
                value='0'
            name='class-code' type=int items=1
                value=00010802
            name='revision-id' type=int items=1
                value=00000000
            name='vendor-id' type=int items=1
                value=00001b96
            name='device-id' type=int items=1
                value=00002500
[16]> fffffcf9581003c0::whatis
fffffcf9581003c0 is allocated from kmem_alloc_768

That is the dev info of a WDC SSD, so that adds up. The nvme_t that we got in the argument to the nvme_remove_callback() is still valid. So let's see what els is going on in the nvme driver:

[16]> ::stacks -m nvme
THREAD           STATE    SOBJ                COUNT
fffff7880a2eac20 SLEEP    RWLOCK                  3
                 swtch+0x133
                 turnstile_block+0x25b
                 rw_enter_sleep+0x1a5
                 log_enter+0x2c
                 log_sendmsg+0xae
                 cprintf+0x39d
                 vzdcmn_err+0x6f
                 dev_err+0x7f
                 nvme`nvme_async_event_task+0x1dd
                 taskq_thread+0x2cd
                 thread_start+0xb

fffff78809edec20 SLEEP    CV                      1
                 swtch+0x133
                 cv_wait+0x68
                 delay_common+0xb4
                 delay+0x3d
                 nvme`nvme_reset+0x10f
                 nvme`nvme_detach+0x341
                 devi_detach+0x88
                 detach_node+0x49
                 i_ndi_unconfig_node+0x171
                 i_ddi_detachchild+0x33
                 devi_detach_node+0xd1
                 ndi_devi_offline+0x135
                 ddihp_cn_change_children_state+0xac
                 ddihp_cn_pre_change_state+0x3b
                 ddihp_connector_ops+0x83
                 ddihp_cn_req_handler+0x3e
                 ndi_hp_state_change_req+0x14f
                 pcie`pciehpc_power_fault_handler+0x67
                 taskq_d_thread+0xbc
                 thread_start+0xb

fffff7880a15bc20 ONPROC   <NONE>                  1
                 apix`apix_hilevel_intr_prolog+0x85
                 apix`apix_do_interrupt+0x333
                 _interrupt+0xc3
                 cmt_balance+0x111
                 setbackdq+0x333
                 sleepq_wakeone_chan+0x8c
                 cv_signal+0xa7
                 taskq_bucket_dispatch+0x79
                 taskq_dispatch+0xd7
                 qenable_locked+0xee
                 putq+0x220
                 log_sendmsg+0x352
                 cprintf+0x39d
                 vzdcmn_err+0x6f
                 dev_err+0x7f
                 nvme`nvme_async_event_task+0x1dd
                 taskq_thread+0x2cd
                 thread_start+0xb

fffff7880cae9c20 PANIC    <NONE>                  1
                 0xfffffffffbca0bc1
                 mutex_panic+0x4a
                 mutex_vector_enter+0x307
                 nvme`nvme_remove_callback+0x4e
                 ndi_event_do_callback+0xfa
                 npe`npe_bus_post_event+0x3f
                 ndi_post_event+0x20
                 pcie`pciehpc_intr+0x3bc
                 pcie`pcie_hp_intr+0x88
                 pcie`pcie_intr+0x1c
                 pcieb`pcieb_intr_handler+0x1d8
                 apix`apix_dispatch_by_vector+0x8c
                 apix`apix_dispatch_lowlevel+0x1c

OK, most notably we have a thread in detach that is hanging out trying to reset the device in a delay. This makes the ordering problem very straightforward:

  • nvme_detach() calls the ddi event removal callback as one of the last things that it does meaning that it destroys the mutex before it removes the callback and while resetting.
  • nvme_remove_callback() assumes that data is valid, but it has been beaten to the punch.

There are a few things that we need to look at here that we're not doing:

  • nvme_detach() needs to make sure that the call to remove the DDI callback happens earlier in detach before the corresponding items are removed.
  • nvme_remove_callback() probably wants to see if we're in detach and consider not doing anything.
  • nvme_reset() right now doesn't look for invalid reads from its registers. In this case the WDC devices seem to have a n_timeout value of 7s. So this leads this as a pretty wide open window to hit.

However, there's a second issue here that is actually somewhat of a PCIe malaise issue and we should consider a deeper bit. We still need to fix the above because detach can always race with a surprise removal, but this is still somewhat more nuanced and will be duplicated into its own bug. We had both a power fault and a removal. If we look at the detach callback that's happening because of a power fault and so that's what triggers the detach. Normally we'd fire the removal callback and let it run to completion ahead of the node teardown.

However, here we can see that we already dispatched the power fault callback (how that happened is another question I wish I had realized when we first saw this!) and then we also hit a removal in a subsequent interrupt and dispatched this. While this isn't a guarantee that nefarious activity would occur the fact that this removal event is firing while detach is ongoing is very suspicious and its hard to think through how this could be working correctly.

In this ticket we are simply fixing the race condition in the NVMe driver itself and leaving nvme_reset() for another time.

Actions #1

Updated by Electric Monk 11 days ago

  • Gerrit CR set to 2878
Actions #2

Updated by Robert Mustacchi 2 days ago

In particular with this in place between Andy and I have tested that this:

  • Works on i86pc with enterprise SSD style hoptlug
  • On a gimlet with hotplug
  • In a propolis VM which was where we originally saw this

On the first two I did do pulls and testing with some subsequent changes to fix the PCIe state machine logic in place.

Actions #3

Updated by Electric Monk 2 days ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit c0a9dba32d94978bd77a0d3a65cd31611e696c61

commit  c0a9dba32d94978bd77a0d3a65cd31611e696c61
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   2023-06-02T21:16:46.000Z

    15689 nvme: race between detach and removal callback
    15690 nvme driver does not attach to Propolis NVMe devices
    Reviewed by: Andy Fiddaman <illumos@fiddaman.net>
    Reviewed by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
    Approved by: Dan McDonald <danmcd@mnx.io>

Actions

Also available in: Atom PDF