panic from ill_taskq_dispatch
We recently hit a panic while using SmartOS. This has only been seen in VMware to date, which when pathological scheduling happens, explains why we don't normally see this. The following is the analysis from Joyent:
# mdb 0 Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs mpt idm sd crypto random cpc logindmux ptm sppp nsmb smbsrv nfs ipc ] > ::status debugging crash dump vmcore.0 (64-bit) from headnode operating system: 5.11 joyent_20120806T172727Z (i86pc) image uuid: (not set) panic message: mutex_enter: bad mutex, lp=ffffff01b4aa4f58 owner=ffffff01b243c120 thread=ffffff000ac 7ac40 dump content: kernel pages only > $C ffffff000ac7aa80 vpanic() ffffff000ac7aab0 mutex_panic+0x73(fffffffffb933b6a, ffffff01b4aa4f58) ffffff000ac7ab20 mutex_vector_enter+0x367(ffffff01b4aa4f58) ffffff000ac7ab60 cv_wait+0x7c(ffffff01b4aa4f60, ffffff01b4aa4f58) ffffff000ac7ac20 ill_taskq_dispatch+0x155() ffffff000ac7ac30 thread_start+8()
The netstack pointer is on the stack, even though mdb didn't fish it out:
> ffffff000ac7ac20-0x8/p 0xffffff000ac7ac18: 0xffffff01b4aa4000
Based on offsetof(ip_stack_t, ips_capab_taskq_cv), this also matches the cv_wait argument.
But that thing has been freed:
> 0xffffff01b4aa4000::whatis ffffff01b4aa4000 is freed from kmem_alloc_8192
This is smelling a lot like OS-119. In that bug, ip_stack_init() clobbered a mutex after starting the taskq thread that tried to use the mutex. In this case, maybe ip_stack_fini() is clobbering a mutex while the taskq thread is still running?
That mutex and cv do get freed in ip_stack_shutdown. By that point, the taskq thread would have been cleaned up in ip_stack_shutdown(), but that only does this:
4362 mutex_enter(&ipst->ips_capab_taskq_lock); 4363 ipst->ips_capab_taskq_quit = B_TRUE; 4364 cv_signal(&ipst->ips_capab_taskq_cv); 4365 mutex_exit(&ipst->ips_capab_taskq_lock);
Note that it doesn't actually wait for the thread to exit. I don't see any other code that does, either, but I'm not at all familiar with this subsystem.
The taskq thread itself is doing this:
2250 if (ipst->ips_capab_taskq_quit) 2251 break; 2252 CALLB_CPR_SAFE_BEGIN(&cprinfo); 2253 cv_wait(&ipst->ips_capab_taskq_cv, &ipst->ips_capab_taskq_lock) ;
So it looks like if ip_stack_shutdown() sets ips_capab_taskq_quit while the taskq thread is at line 2252, and the shutdown thread also gets all the way through ip_stack_fini() before the taskq thread gets to 2253, then we'll clobber the cv and mutex in this way and panic.