Project

General

Profile

Bug #12449

apcupsd triggers ASSERT on debug kernels when killed

Added by Jason King 4 months ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
kernel
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

When rebooting a DEBUG SmartOS PI on a system running apcupsd, I consistently hit this assert during reboot:

assertion failed: draining_succeeded == USB_SUCCESS, file: ../../common/io/usb/usba/usbai_pipe_mgmt.c, line: 1433

This occurs when apcupsd dies (I can see the 'stopping on signal 15' message just prior to the panic).

There's nothing SmartOS specific in the USB code suggesting it's a general illumos bug (so I'm posting the details here in case others have hit this). I haven't completely root caused this yet, but documenting what I have found so far in this issue.

> ::stack
vpanic()
0xfffffffffbe35245()
usba_pipe_sync_reset+0xca(fffffeb2ea34f360, fffffeb2f0978960, fffffeb41209e2d8, 100)
usba_pipe_setup_func_call+0x13b(fffffeb2ea34f360, fffffffff7acf200, fffffeb2f0978960, 0, 100, 0)
usb_pipe_reset+0x143(fffffeb2ea34f360, fffffeb2f0978960, 100, 0, 0)
ugen_epx_req+0x41c(fffffeb2f144e000, fffffeb427027dd0)
ugen_strategy+0x15b(fffffeb427027dd0)
default_physio+0x33a(fffffffff7adba90, 0, fa00000004, 40, fffffffff7adbc50, fffffe00f561de10)
physio+0x11(fffffffff7adba90, 0, fa00000004, 40, fffffffff7adbc50, fffffe00f561de10)
usb_ugen_read+0x58(fffffeb2f107bb88, fa00000004, fffffe00f561de10, fffffeb3ab1fc2d0)
ugen_skel_read+0x5e(fa00000004, fffffe00f561de10, fffffeb3ab1fc2d0)
cdev_read+0x2a(fa00000004, fffffe00f561de10, fffffeb3ab1fc2d0)
spec_read+0x310(fffffeb4342df600, fffffe00f561de10, 0, fffffeb3ab1fc2d0, 0)
fop_read+0x111(fffffeb4342df600, fffffe00f561de10, 0, fffffeb3ab1fc2d0, 0)
read+0x26f(a, fffffc7fffdffc10, 14)
sys_syscall+0x2ae()

So we're doing a reset on one of the pipes when this happens. Which pipe?

> fffffeb2f0978960::print usba_ph_impl_t usba_ph_dip usba_ph_data
usba_ph_dip = 0xfffffeb2ea34f360
usba_ph_data = 0xfffffeb42649a218
> ::usba_device
...
ugen             0   fffffeb2ea34f360 /pci@0,0/pci15d9,86d@14/input@1

    EP  TYPE  DIR STATE   P_HANDLE         P_POLICY         EP DESCR
      0 Cntrl Out IDLE    fffffeb2e95477d8 fffffeb2e9547810 fffffeb2e9547812
      1 Intr  In  ACTIVE  fffffeb42649a218 fffffeb42649a250 fffffeb42649a252
...

The code triggering the assert is:

    draining_succeeded = usba_drain_cbs(ph_data, USB_CB_RESET_PIPE,
          USB_CR_PIPE_RESET);

      /* this MUST have succeeded */
      ASSERT(draining_succeeded == USB_SUCCESS);

So how can usba_drain_cbs fail?

        /*
       * wait for any callbacks in progress but don't wait for
       * for queued requests on the default pipe
       */
      for (timeout = 0; (timeout < usba_drain_timeout) &&
          (ph_data->p_req_count >
          usba_list_entry_count(&ph_data->p_queue));
          timeout++) {
          mutex_exit(&ph_data->p_mutex);
          delay(drv_usectohz(1000));
          mutex_enter(&ph_data->p_mutex);
      }

And ph_data->p_req_count is:

> fffffeb2f0978960::print -t usba_ph_impl_t usba_ph_data | ::print usba_pipe_handle_data_t p_req_count
p_req_count = 0x1

So there was a request that wasn't removed after the timeout has passed.

This is where my knowledge starts to get pretty fuzzy, but I think we are missing a bit of cleanup -- the only places I could find that decrement p_req_count are in request callbacks -- that implies they aren't being called. In the context of a user-land ugen client being killed and forcing a reset, it seems plausible that we are missing a step in this path. I'll need to do some more digging to see if I can figure out more.

History

#1

Updated by Jason King 4 months ago

The command that queues the request:

> 0xfffffeb2ef9fad80::print usba_hcdi_ops_t usba_hcdi_pipe_intr_xfer
usba_hcdi_pipe_intr_xfer = xhci_hcdi_pipe_intr_xfer

Also
> fffffeb2f0978960::print -t usba_ph_impl_t usba_ph_data | ::print usba_pipe_handle_data_t p_queue.count
p_queue.count = 0

So it appears either we had a request that didn't cleanup in it's callback, or it's callback isn't being called.

#2

Updated by Jason King about 1 month ago

I need to dig more, but I think what might be happening is that the draining causes usba_do_req_exc_cb as the callback to invoke instead of the normal callback. This in turn runs all the request callbacks in a separate taskq. I'm wondering if the problem is that the taskq has not completed by the time draining code finishes, causing the error.

Also available in: Atom PDF