Bug #15311
openipd_detach() should be more careful about non-empty ipd_nsl
0%
Description
I received a kernel panic from a customer. It died here:
> $C fffff946236048b0 vpanic() fffff946236048d0 mutex_panic+0x4a(fffffffffb9633fb, ffffffffc0001db0) fffff94623604940 mutex_vector_enter+0x307(ffffffffc0001db0) fffff94623604970 ipd_nin_destroy+0x1a(11bd, fffffef69db8f6c0) fffff946236049b0 neti_stack_apply_destroy+0x108(fffffd6b2fbbff68, fffffd6a660d78c8) fffff946236049f0 neti_apply_all_instances+0x2b(fffffd6b2fbbff68, fffffffff7ab8620) fffff94623604a20 neti_stack_fini+0x95(11bd, fffffd6b2fbbff68) fffff94623604aa0 netstack_apply_destroy+0x104(fffffffffbccfb88, fffffd6b2f834c80, 4) fffff94623604ae0 apply_all_modules_reverse+0x3f(fffffd6b2f834c80, fffffffffba315d0) fffff94623604b20 netstack_stack_inactive+0x106(fffffd6b2f834c80) fffff94623604b60 netstack_reap+0x1a(fffffd6b2f834c80) fffff94623604c00 taskq_d_thread+0xbc(ffffff5e8a85c7b0) fffff94623604c10 thread_start+0xb() >
A quick inspection of the mutex in question, ipd_nsl_lock, shows it has been mutex_destroy()'ed:
> ffffffffc0001db0::mutex ADDR TYPE HELD MINSPL OLDSPL WAITERS mdb: ipd_nsl_lock: invalid adaptive mutex (-f to dump anyway) > ffffffffc0001db0::mutex -f ADDR TYPE HELD MINSPL OLDSPL WAITERS ffffffffc0001db0 adapt fffff94625e01c20 - - no > ffffffffc0001db0/P ipd_nsl_lock: ipd_nsl_lock: 0xfffff94625e01c26 >
The 0x6 at the end is an indicator of a destroyed mutex, and masking that 0x6 out shows the thread that destroyed the mutex:
> fffff94625e01c20::findstack -v stack pointer for thread fffff94625e01c20 (TS_FREE) (mt_config_thread()): fffff94625e01bb0 [ fffff94625e01bb0 resume_from_zombie+0x95() ] fffff94625e01be0 swtch_from_zombie+0x85() fffff94625e01c00 thread_exit+0xeb() fffff94625e01c10 thread_splitstack_run() >
It appears to be a freed thread. Lucky for us, the ipd_detach() function is the ONLY function that can mutex_destroy(&ipd_nsl_lock). Let's take a look...
ipd_detach(dev_info_t *dip, ddi_detach_cmd_t cmd) { if (cmd != DDI_DETACH) return (DDI_FAILURE); mutex_enter(&ipd_nactive_lock); if (ipd_nactive > 0) { mutex_exit(&ipd_nactive_lock); return (EBUSY); } mutex_exit(&ipd_nactive_lock); ASSERT(dip == ipd_devi); ddi_remove_minor_node(dip, NULL); ipd_devi = NULL; if (ipd_neti != NULL) { VERIFY(net_instance_unregister(ipd_neti) == 0); net_instance_free(ipd_neti); } mutex_destroy(&ipd_nsl_lock); mutex_destroy(&ipd_nactive_lock); list_destroy(&ipd_nsl); /* XXX KEBE SAYS WHOA! */
Hmmm, the dump I have seems to indicate that ipd_detach() can race against a netstack memory reclamation thread and beat it to destroy ipd_nsl_lock AND destroying the list head.
More MDB inspection shows the disconnect between the ipd netstack instance (in the reclamation thread) and the list head:
> ipd_nsl::print list_t { list_size = 0x110 list_offset = 0 list_head = { list_next = 0 list_prev = 0 } } > fffffef69db8f6c0::print ipd_netstack_t { ipdn_link = { list_next = ipd_nsl+0x10 list_prev = ipd_nsl+0x10 } ...
At first glance, I think acquiring the lock at ipd_detach() and checking for a non-empty list (returning EBUSY in that case) would be the first correct thing to do.