Project

General

Profile

Bug #7921

failing to load the usba root hub module destroys driver parent private data

Added by Robert Mustacchi almost 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Category:
driver - device drivers
Start date:
2017-03-01
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

While working on xhci, I had several panics while unloading that looked like:

[root@haswell /var/crash/volatile]# mdb 47
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs mm lofs idm sata random cpc logindmux ptm sppp nfs sd i40e ]
> ::status
debugging crash dump vmcore.47 (64-bit) from haswell
operating system: 5.11 joyent_20160420T214651Z (i86pc)
image uuid: (not set)
panic message: 
assertion failed: vecp->v_share == 0, file: ../../i86pc/io/apix/apix_utils.c, line: 717
dump content: kernel pages only
> $C
ffffff0010c80520 vpanic()
ffffff0010c80550 0xfffffffffbdf9e58()
ffffff0010c80580 apix_cleanup_vector+0x31(ffffff03150abe80)
ffffff0010c80600 apix_free_vectors+0x93(ffffff02e383bdc8, 0, 1, 2)
ffffff0010c80670 apix_intr_ops+0x10c(ffffff02e383bdc8, ffffff034249b010, 1, 0)
ffffff0010c806a0 apix_irm_intr_ops+0x29(ffffff02e383bdc8, ffffff034249b010, 1, 0)
ffffff0010c807d0 pci_common_intr_ops+0x412(ffffff02e383c608, ffffff02e383bdc8, d, 
ffffff034249b010, 0)
ffffff0010c80810 npe_intr_ops+0x21(ffffff02e383c608, ffffff02e383bdc8, d, ffffff034249b010
, 0)
ffffff0010c80880 i_ddi_intr_ops+0x4e(ffffff02e383bdc8, ffffff02e383bdc8, d, 
ffffff034249b010, 0)
ffffff0010c808d0 ddi_intr_free+0x72(ffffff034249b010)
ffffff0010c80910 xhci_cleanup+0x199(ffffff0348e0f140)
ffffff0010c80940 xhci_attach+0x213(ffffff02e383bdc8, 0)
ffffff0010c809b0 devi_attach+0x9e(ffffff02e383bdc8, 0)
ffffff0010c809f0 attach_node+0x14f(ffffff02e383bdc8)
ffffff0010c80a40 i_ndi_config_node+0xc0(ffffff02e383bdc8, 6, 0)
ffffff0010c80a70 i_ddi_attachchild+0x88(ffffff02e383bdc8)
ffffff0010c80ab0 devi_attach_node+0x88(ffffff02e383bdc8, 4048)
ffffff0010c80b30 config_immediate_children+0xe0(ffffff02e383c608, 4048, c5)
ffffff0010c80b80 devi_config_common+0xd9(ffffff02e383c608, 4048, c5)
ffffff0010c80c20 mt_config_thread+0x70(ffffff035910a840)
ffffff0010c80c30 thread_start+8()

The apix module is angry because it expects to have v_share set to zero, but here it's one:

> ffffff03150abe80::print apix_vector_t
{
    v_state = 0x3
    v_type = 0x2
    v_cpuid = 0x4
    v_vector = 0x22
    v_share = 0x1
    v_inum = 0
    v_flags = 0
    v_bound_cpuid = 0x4
    v_busy = 0
    v_pri = 0x9
    v_autovect = 0xffffff02f226f700
    v_intrmap_private = 0
    v_devp = 0
    v_next = 0
}

This value gets manipulated in two functions. In apix_insert_av and apix_remove_av. This happens when an interrupt is enabled and disabled respectively. Now, the weird thing is that in the xhci module, we definitely call into and disable interrupts. Even more so, we'd log if it fails. The state flags in the dump indicate that we should have called it, but if we look at the dump, we find that we're already onto removing the interrupt when we hit this.

So, what's going on?

After a bunch of single stepping in kmdb through this, I found where we started to have things go wrong. Here's the stack trace:

ffffff0011c26660 npe`pci_disable_intr+0x53(ffffff02e383c608, ffffff02e383bdc8, ffffff02f6930dd0, 0)
ffffff0011c26790 npe`pci_common_intr_ops+0x8cc(ffffff02e383c608, ffffff02e383bdc8, a, ffffff02f6930dd0, 0)
ffffff0011c267d0 npe`npe_intr_ops+0x21(ffffff02e383c608, ffffff02e383bdc8, a,
ffffff02f6930dd0, 0)
ffffff0011c26840 i_ddi_intr_ops+0x4e(ffffff02e383bdc8, ffffff02e383bdc8, a,
ffffff02f6930dd0, 0)
ffffff0011c268a0 ddi_intr_block_disable+0x10c(ffffff03353a7a38, 1)
ffffff0011c268d0 xhci`xhci_ddi_intr_disable+0x2b(ffffff03353a79c0)
ffffff0011c26910 xhci`xhci_cleanup+0x1e8(ffffff03353a79c0)
ffffff0011c26940 xhci`xhci_attach+0x213(ffffff02e383bdc8, 0)
ffffff0011c269b0 devi_attach+0x9e(ffffff02e383bdc8, 0)
ffffff0011c269f0 attach_node+0x14f(ffffff02e383bdc8)
ffffff0011c26a40 i_ndi_config_node+0xc0(ffffff02e383bdc8, 6, 0)
ffffff0011c26a70 i_ddi_attachchild+0x88(ffffff02e383bdc8)
ffffff0011c26ab0 devi_attach_node+0x88(ffffff02e383bdc8, 4048)
ffffff0011c26b30 config_immediate_children+0xe0(ffffff02e383c608, 4048, c5)
ffffff0011c26b80 devi_config_common+0xd9(ffffff02e383c608, 4048, c5)
ffffff0011c26c20 mt_config_thread+0x70(ffffff033a367100)
ffffff0011c26c30 thread_start+8()

So what's happened is that in pci_disable_intr() we reach the point where we call pci_intx_get_ispec() and it returns NULL. From looking in the source code, that seems very suspicious. Now what makes this even more suspicious is that this only happens after I call usba_hubdi_bind_root_hub()! If I purposefully fail attach without that, then I find that everything's fine.

Even more worrisome is that I find that the pci_intx_get_ispec() value is non-NULL and has data in it that we'd expect to represent the interrupt in xhci. Now, the interesting point is where this data gets stored. If you look at the details of pci_intx_get_ispec() we'll see that we get that value by getting the parent's private data. Now that specifically is data that the parent dev_info_t stores on us. As a PCI device, a PCI based device should be our private data. But, here's where things get interesting.

Now, usba tries to play some games particularly with how it stores some of its driver specific data. Particularly we're looking at usba_set_usba_device(). Here, it does different things depending on whether or not it's a root hub. This makes sense. If it's a root hub (which is almost always a device with an entirely different parent, like PCI in our case) then it stores the data in a different place entirely. The function even notes that this very case is on the scene. But if it's not a root hub, it'll happily use it, which I guess makes sense, as it theoretically is the nexus driver for those devices.

Now, here's where things go wrong. Remember, we failed to register. This means we're in a rarely treaded code path. Realistically one that probably has never been taken. And here's where it all becomes clear. Unlike in the unbind path, the failure path of trying to bind the root hub does things in the following order:

  • Remove the device from the root hub list
  • NULL out the usba private value

So this is it! Because the framework removed us from the root hub list, when it went to set the device it thought we were a lowly device it owned. So it happily clobbered our data.

Thankfully, the solution here is simple. We just need to fix up the unwinding order in failure to match unbinding.

#1

Updated by Electric Monk over 3 years ago

  • Status changed from New to Closed

git commit 993e3faf6a142ae3efdd24388883264c2b56bede

commit  993e3faf6a142ae3efdd24388883264c2b56bede
Author: Robert Mustacchi <rm@joyent.com>
Date:   2017-03-10T18:21:18.000Z

    1979 USB 3.0 support
    7918 want usb_pipe_xopen(9F)
    7919 usbai burst macros for endpoint descriptor are wrong
    7920 usba_hcdi_register() should fail if driver is using private data
    7921 failing to load the usba root hub module destroys driver parent private data
    7922 want ::hubd walker
    7923 ::prtusb should include version
    7924 usb_*_request(9S) manual pages should match structure names
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Bryan Cantrill <bryan@joyent.com>
    Reviewed by: Dale Ghent <daleg@omniti.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF