Bug #15690
closednvme driver does not attach to Propolis NVMe devices
100%
Description
Andy initially hit this and wrote the following:
I've had a few attempts to get illumos guests booted on dogfood, and noticed that they only see the cloud-init block device in their disk list.
The propolis NVMe controllers show up in the PCI device list, and their device IDs, class and so on look fine:
# pcieadm show-devs BDF TYPE INSTANCE DEVICE 0/0/0 PCI -- 440FX - 82441FX PMC [Natoma] 0/1/0 PCI isa0 82371SB PIIX3 ISA [Natoma/Triton II] 0/1/3 PCI -- 82371AB/EB/MB PIIX4 ACPI 0/8/0 PCI vioif0 Virtio network device 0/10/0 PCI -- Propolis NVMe Controller 0/11/0 PCI -- Propolis NVMe Controller 0/18/0 PCI vioblk0 Virtio block device pci1de,0 (driver not attached) name='compatible' type=string items=9 value='pci1de,0.1de.0.0' + 'pci1de,0.1de.0' + 'pci1de,0,s' + 'pci1de,0' + 'pci1de,0.0' + 'pci1de,0,p' + 'pci1de,0' + 'pciclass,010802' + 'pciclass,0108' # grep nvme /etc/driver_aliases nvme "pciclass,010802" nvme "pciexclass,010802"
To try and replicate this I've booted an OmniOS 'fat' iso image under propolis, with a single nvme disk. The symptoms are the same.
Running the following dtrace script shows a failure in ddi_get_eventcookie(), very early in nvme_attach():
kayak-r151047# modunload -i 0 kayak-r151047# modinfo | grep nvme kayak-r151047# kayak-r151047# cat nvme.d #!/usr/sbin/dtrace -ZCs nvme_attach:entry {self->t++} nvme_attach:return{self->t--} fbt::ddi*:entry/self->t/{} fbt::ddi*:return/self->t/{trace(arg1)} kayak-r151047# kayak-r151047# ./nvme.d & [1] 288 dtrace: script './nvme.d' matched 800 probes kayak-r151047# update_drv nvme pseudo-device: devinfo0 devinfo0 is /pseudo/devinfo@0 devfsadm: driver failed to attach: nvme Warning: Driver (nvme) successfully added to system but failed to attach CPU ID FUNCTION:NAME 0 15931 ddi_get_instance:entry 0 15932 ddi_get_instance:return 0 0 16170 ddi_soft_state_zalloc:entry 0 16171 ddi_soft_state_zalloc:return 0 0 17786 ddi_get_soft_state:entry 0 17787 ddi_get_soft_state:return -2148172750080 0 20380 ddi_set_driver_private:entry 0 20381 ddi_set_driver_private:return -2148172750080 0 24210 ddi_get_eventcookie:entry 0 24211 ddi_get_eventcookie:return 4294967295 0 15931 ddi_get_instance:entry 0 15932 ddi_get_instance:return 0 0 17786 ddi_get_soft_state:entry 0 17787 ddi_get_soft_state:return -2148172750080 0 18583 ddi_remove_minor_node:entry 0 18584 ddi_remove_minor_node:return 0 0 23986 ddi_soft_state_free:entry 0 23987 ddi_soft_state_free:return -2148159496864
From here, I took a look and wrote the following:
In this particular case, the parent nexus is pci(4D) and not npe(4D). In illumos#11698 Want NVMe Hotplug Support the logic for the removal cookie was only added to npe because all non-virtual PCI devices are built on top of PCI Express and are required to by the spec. In this case the bus doesn't support the removal event and the device isn't really hotpluggable (at least from a sense that we support though the platform could support the ACPI event here).
The fix here is twofold to consider:
- Failing to get a removal event is not fatal. It just means that we need to not perform callback registration.
- We should choose at what point to actually allow for the pci nexus driver to support this event. It technically supports this with the PCI SHPC controller, but that's not actually what it used for hotplug in these environments, it's an entirely different ACPI method.
I'm leaning towards just fixing (1) rather than (2) right now as we can't really test a removal event end to end in pci(4D) due to the lack of the ACPI hotplug method.
This was tested along with #15689. We tested on both traditional PCs with Enterprise SSD style hotplug, Oxide's hardware which use ExpressModule based hotplug (and have a power controller) and in propolis VMs where we originally saw that.
Updated by Electric Monk 2 days ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit c0a9dba32d94978bd77a0d3a65cd31611e696c61
commit c0a9dba32d94978bd77a0d3a65cd31611e696c61 Author: Robert Mustacchi <rm@fingolfin.org> Date: 2023-06-02T21:16:46.000Z 15689 nvme: race between detach and removal callback 15690 nvme driver does not attach to Propolis NVMe devices Reviewed by: Andy Fiddaman <illumos@fiddaman.net> Reviewed by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Approved by: Dan McDonald <danmcd@mnx.io>