Project

General

Profile

Bug #3200

panic from ill_taskq_dispatch

Added by Robert Mustacchi about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
High
Category:
networking
Start date:
2012-09-17
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

We recently hit a panic while using SmartOS. This has only been seen in VMware to date, which when pathological scheduling happens, explains why we don't normally see this. The following is the analysis from Joyent:

# mdb 0
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs mpt idm sd crypto random cpc logindmux ptm sppp nsmb smbsrv nfs ipc ]
> ::status
debugging crash dump vmcore.0 (64-bit) from headnode
operating system: 5.11 joyent_20120806T172727Z (i86pc)
image uuid: (not set)
panic message: 
mutex_enter: bad mutex, lp=ffffff01b4aa4f58 owner=ffffff01b243c120 thread=ffffff000ac
7ac40
dump content: kernel pages only
> $C
ffffff000ac7aa80 vpanic()
ffffff000ac7aab0 mutex_panic+0x73(fffffffffb933b6a, ffffff01b4aa4f58)
ffffff000ac7ab20 mutex_vector_enter+0x367(ffffff01b4aa4f58)
ffffff000ac7ab60 cv_wait+0x7c(ffffff01b4aa4f60, ffffff01b4aa4f58)
ffffff000ac7ac20 ill_taskq_dispatch+0x155()
ffffff000ac7ac30 thread_start+8()

The netstack pointer is on the stack, even though mdb didn't fish it out:
> ffffff000ac7ac20-0x8/p
0xffffff000ac7ac18:             0xffffff01b4aa4000

Based on offsetof(ip_stack_t, ips_capab_taskq_cv), this also matches the cv_wait argument.

But that thing has been freed:

> 0xffffff01b4aa4000::whatis
ffffff01b4aa4000 is freed from kmem_alloc_8192

This is smelling a lot like OS-119. In that bug, ip_stack_init() clobbered a mutex after starting the taskq thread that tried to use the mutex. In this case, maybe ip_stack_fini() is clobbering a mutex while the taskq thread is still running?

That mutex and cv do get freed in ip_stack_shutdown. By that point, the taskq thread would have been cleaned up in ip_stack_shutdown(), but that only does this:

4362         mutex_enter(&ipst->ips_capab_taskq_lock);
4363         ipst->ips_capab_taskq_quit = B_TRUE;
4364         cv_signal(&ipst->ips_capab_taskq_cv);
4365         mutex_exit(&ipst->ips_capab_taskq_lock);

Note that it doesn't actually wait for the thread to exit. I don't see any other code that does, either, but I'm not at all familiar with this subsystem.

The taskq thread itself is doing this:

2250                 if (ipst->ips_capab_taskq_quit)
2251                         break;
2252                 CALLB_CPR_SAFE_BEGIN(&cprinfo);
2253                 cv_wait(&ipst->ips_capab_taskq_cv, &ipst->ips_capab_taskq_lock)      ;

So it looks like if ip_stack_shutdown() sets ips_capab_taskq_quit while the taskq thread is at line 2252, and the shutdown thread also gets all the way through ip_stack_fini() before the taskq thread gets to 2253, then we'll clobber the cv and mutex in this way and panic.

#1

Updated by Robert Mustacchi about 8 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Resolved in 13837:76dc42aca3ef.

Also available in: Atom PDF