Project

General

Profile

Bug #3326

VMWare: mutex panic during shutdown

Added by Dan Swartzendruber over 6 years ago. Updated almost 5 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2012-10-31
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

OI151a7 running virtualized under ESXi5.1. I can reproduceably cause a panic on shutdown. Either via ACPI button (virtual of course) or 'init 6'. I will provide crash dump as soon as I know the issue number.

> ::status
debugging crash dump vmcore.5 (64-bit) from omnios-appliance2
operating system: 5.11 omnios-bc85f2d (i86pc)
image uuid: 94d86b43-36cd-64b6-cb67-d0349b5c4a1a
panic message: mutex_enter: bad mutex, lp=ffffffffc0197198 owner=ffffff01e8750800 thread=ffffff0008b13c40
dump content: kernel pages only
> ::stack
vpanic()
mutex_panic+0x73(fffffffffb933be5, ffffffffc0197198)
mutex_vector_enter+0x367(ffffffffc0197198)
cv_timedwait_hires+0x107(ffffffffc01971a0, ffffffffc0197198, 3b9aca00, 989680, 0)
cv_timedwait_sig_hires+0x1eb(ffffffffc01971a0, ffffffffc0197198, 3b9aca00, 989680, 0)
cv_timedwait_sig+0x49(ffffffffc01971a0, ffffffffc0197198, 1e95e)
0xfffffffff85079c5()
thread_start+8()
>

History

#2

Updated by Dan Swartzendruber over 6 years ago

I am suspicious this may be vmware related. I'd been running this OI VM under ESXi 5.0 for months with no problems. I upgraded this hypervisor host to ESXi 5.1 just a couple of days ago, and now OI panics every time I shut i down.

#3

Updated by Marcel Telka over 6 years ago

  • Status changed from New to Feedback

Please provide the crash dump:

Not Found

The requested URL /illumos/3326/ was not found on this server.
Apache/2.2.22 (Ubuntu) Server at www.druber.com Port 80
#4

Updated by Dan Swartzendruber over 6 years ago

Sorry, I no longer have the dump. After a couple of months with no activity on the issue, I moved onto a different option, since this a major inconvenience for me.

#5

Updated by Marcel Telka over 6 years ago

  • Status changed from Feedback to Closed

Closing. Not enough data.

Please reopen once we can reproduce the issue again and/or we have the crash dump available. Thanks.

#6

Updated by Dan Swartzendruber about 6 years ago

I can repro this at will using omnios as well. I will provide a dump. I am surprised no-one has tried to reproduce this in-house (I have had 100% success). It's trivially easy. Install free esxi5. Deploy omnios OVA from their website. Install vmware tools. Type 'init 6' and instant crash...

#7

Updated by Igor Kozhukhov about 6 years ago

I have ported open-vm-tools to DilOS and have no crashes with it on VMware ESXi 4,5,5.1.
but i have no illumos-gate as is - i have some my additional changes: LX zone, LIBM, etc.
open-vm-tools works well for me.
-Igor

Dan Swartzendruber wrote:

I can repro this at will using omnios as well. I will provide a dump. I am surprised no-one has tried to reproduce this in-house (I have had 100% success). It's trivially easy. Install free esxi5. Deploy omnios OVA from their website. Install vmware tools. Type 'init 6' and instant crash...

#8

Updated by Dan Swartzendruber about 6 years ago

Okay, unix.4.gz and vmcore.4.gz are at:

http://www.druber.com/illumos/3326/

#9

Updated by Dan Swartzendruber about 6 years ago

Well, damn, omnios created the dump zvol as only being 512MB so I think the dump is crap. I'm recreating it - stay tuned...

#10

Updated by Dan Swartzendruber about 6 years ago

Okay, I think I figured out why the dump zvol was too small. I only made an 8GB boot disk :( So I created a dump zvol on the tank pool and made that 16GB. Dump is now much bigger. unix.5.gz and vmcore.5.gz.

#11

Updated by Marcel Telka about 6 years ago

  • Status changed from Closed to In Progress

I added the panic stack from vmcore.5 to the Description field and reopened the bug.

#12

Updated by Rich Lowe about 6 years ago

If this is easily repeatable, could you please boot with -kd, and when you stop at the kmdb prompt on the console, utter:

> moddebug/W 0x80004000
> :c

At the prompts. This should cause module text to be retained when modules unload, and load addresses to be printed as they load. This would hopefully give us some clue as to which module the panic thread is coming from and -- presumably -- where the bug is. At present:

fffffffff85079c5 is fffffffff8507000+9c5, freed from the module_text vmem arena

And while the text is just about intact, it's not providing any real clue as to whom it belongs, beyond it probably being the same module that would contain the mutex we're attempting to take (which has just been freed from module_data).

#13

Updated by Dan Swartzendruber about 6 years ago

Will do. Hopefully tonight...

#14

Updated by Dan Swartzendruber about 6 years ago

vmcore.0.gz and unix.0.gz

#15

Updated by Rich Lowe about 6 years ago

In those dumps we have:

ffffff0010ac29c0 vpanic()
ffffff0010ac29f0 mutex_panic+0x73(fffffffffb933d25, ffffffffc0189478)
ffffff0010ac2a60 mutex_vector_enter+0x367(ffffffffc0189478)
ffffff0010ac2af0 cv_timedwait_hires+0x107(ffffffffc0189480, ffffffffc0189478, 
3b9aca00, 989680, 0)
ffffff0010ac2b90 cv_timedwait_sig_hires+0x1eb(ffffffffc0189480, ffffffffc0189478
, 3b9aca00, 989680, 0)
ffffff0010ac2be0 cv_timedwait_sig+0x49(ffffffffc0189480, ffffffffc0189478, 3550
)
ffffff0010ac2c20 0xfffffffff86a99c5()
ffffff0010ac2c30 thread_start+8()

So, the thread function is 0xfffffffff86a99c5

We also have

> $<msgbuf ! grep -B1 fffffffff86a
...
text for /kernel/drv/amd64/vmmemctl 
was at fffffffff86a9000
...

I don't know what vmmemctl is, but we seem to have a thread running in that module after that module has been unloaded.

Now we have the base address for the text, we can reasonably disassemble the entire module text (and be somewhat free of the perils of variable insn sizes).

When we do that, we find we're in a function that looks a whole lot like the top level function of a thread, though one which does have the capacity to exit. (You can do fffffffff86a9000::dis -n 350 or whatever)

My bet is that whatever this vmmemctl driver is is not sufficiently cleaning up when detaching, and is leaving this thread alive to crash whenever it next runs.

#16

Updated by Rich Lowe about 6 years ago

Ah, it's part of the vmware tools.

Nothing stands out as overly broken looking in the open-vm-tools code I can find, but basically any copy of the code I can find is on the wrong end a particularly crappy connection (including git.opensource.vmware.com).

There was a crash on unload in the recent past (mid 2012?), but it doesn't look like this one.

Are you using open-vm-tools or the presumably different vmware-tools? Either way, I suspect its their fault, and you should probably debug things from that point, and an assumption that the thread they create to manage ballooning is not exiting by the time they return success from _fini (even though, as I said, the code I can find seems to do the right thing, when glancing at it)

#17

Updated by Dan Swartzendruber about 6 years ago

No, I'm not using open-vm-tools, but the solaris tools provided by ESXi. This only seems to be an issue shutting down/rebooting, so I guess I'm willing to live with it. I don't have a support contract with vmware, so getting them to look at this seems unlikely. Thanks for the effort!

#18

Updated by Garrett D'Amore almost 5 years ago

  • Subject changed from mutex_enter: bad mutex panic to VMWare: mutex panic during shutdown
  • Priority changed from High to Normal

Updating priority to normal (panic during shutdown), and synopsis. It appears that this may be a bug in VMWare's tools.

Also available in: Atom PDF