Bug #12449
openapcupsd triggers ASSERT on debug kernels when killed
0%
Description
When rebooting a DEBUG SmartOS PI on a system running apcupsd, I consistently hit this assert during reboot:
assertion failed: draining_succeeded == USB_SUCCESS, file: ../../common/io/usb/usba/usbai_pipe_mgmt.c, line: 1433
This occurs when apcupsd dies (I can see the 'stopping on signal 15' message just prior to the panic).
There's nothing SmartOS specific in the USB code suggesting it's a general illumos bug (so I'm posting the details here in case others have hit this). I haven't completely root caused this yet, but documenting what I have found so far in this issue.
> ::stack vpanic() 0xfffffffffbe35245() usba_pipe_sync_reset+0xca(fffffeb2ea34f360, fffffeb2f0978960, fffffeb41209e2d8, 100) usba_pipe_setup_func_call+0x13b(fffffeb2ea34f360, fffffffff7acf200, fffffeb2f0978960, 0, 100, 0) usb_pipe_reset+0x143(fffffeb2ea34f360, fffffeb2f0978960, 100, 0, 0) ugen_epx_req+0x41c(fffffeb2f144e000, fffffeb427027dd0) ugen_strategy+0x15b(fffffeb427027dd0) default_physio+0x33a(fffffffff7adba90, 0, fa00000004, 40, fffffffff7adbc50, fffffe00f561de10) physio+0x11(fffffffff7adba90, 0, fa00000004, 40, fffffffff7adbc50, fffffe00f561de10) usb_ugen_read+0x58(fffffeb2f107bb88, fa00000004, fffffe00f561de10, fffffeb3ab1fc2d0) ugen_skel_read+0x5e(fa00000004, fffffe00f561de10, fffffeb3ab1fc2d0) cdev_read+0x2a(fa00000004, fffffe00f561de10, fffffeb3ab1fc2d0) spec_read+0x310(fffffeb4342df600, fffffe00f561de10, 0, fffffeb3ab1fc2d0, 0) fop_read+0x111(fffffeb4342df600, fffffe00f561de10, 0, fffffeb3ab1fc2d0, 0) read+0x26f(a, fffffc7fffdffc10, 14) sys_syscall+0x2ae()
So we're doing a reset on one of the pipes when this happens. Which pipe?
> fffffeb2f0978960::print usba_ph_impl_t usba_ph_dip usba_ph_data usba_ph_dip = 0xfffffeb2ea34f360 usba_ph_data = 0xfffffeb42649a218 > ::usba_device ... ugen 0 fffffeb2ea34f360 /pci@0,0/pci15d9,86d@14/input@1 EP TYPE DIR STATE P_HANDLE P_POLICY EP DESCR 0 Cntrl Out IDLE fffffeb2e95477d8 fffffeb2e9547810 fffffeb2e9547812 1 Intr In ACTIVE fffffeb42649a218 fffffeb42649a250 fffffeb42649a252 ...
The code triggering the assert is:
draining_succeeded = usba_drain_cbs(ph_data, USB_CB_RESET_PIPE, USB_CR_PIPE_RESET); /* this MUST have succeeded */ ASSERT(draining_succeeded == USB_SUCCESS);
So how can usba_drain_cbs
fail?
/* * wait for any callbacks in progress but don't wait for * for queued requests on the default pipe */ for (timeout = 0; (timeout < usba_drain_timeout) && (ph_data->p_req_count > usba_list_entry_count(&ph_data->p_queue)); timeout++) { mutex_exit(&ph_data->p_mutex); delay(drv_usectohz(1000)); mutex_enter(&ph_data->p_mutex); }
And ph_data->p_req_count
is:
> fffffeb2f0978960::print -t usba_ph_impl_t usba_ph_data | ::print usba_pipe_handle_data_t p_req_count p_req_count = 0x1
So there was a request that wasn't removed after the timeout has passed.
This is where my knowledge starts to get pretty fuzzy, but I think we are missing a bit of cleanup -- the only places I could find that decrement p_req_count
are in request callbacks -- that implies they aren't being called. In the context of a user-land ugen client being killed and forcing a reset, it seems plausible that we are missing a step in this path. I'll need to do some more digging to see if I can figure out more.
Updated by Jason King over 3 years ago
The command that queues the request:
> 0xfffffeb2ef9fad80::print usba_hcdi_ops_t usba_hcdi_pipe_intr_xfer usba_hcdi_pipe_intr_xfer = xhci_hcdi_pipe_intr_xfer Also > fffffeb2f0978960::print -t usba_ph_impl_t usba_ph_data | ::print usba_pipe_handle_data_t p_queue.count p_queue.count = 0
So it appears either we had a request that didn't cleanup in it's callback, or it's callback isn't being called.
Updated by Jason King over 3 years ago
I need to dig more, but I think what might be happening is that the draining causes usba_do_req_exc_cb
as the callback to invoke instead of the normal callback. This in turn runs all the request callbacks in a separate taskq. I'm wondering if the problem is that the taskq has not completed by the time draining code finishes, causing the error.