Bug #3289

zfs panic when sending replication stream

Added by Martin Blom about 5 years ago. Updated about 5 years ago.

Status:NewStart date:2012-10-20
Priority:UrgentDue date:
Assignee:-% Done:

0%

Category:zfs - Zettabyte File System
Target version:-
Difficulty:Medium Tags:needs-triage

Description

I've had a 'zfs send' problem for a long time (OpenSolaris 2009.06) and last week I stupidly decided to scrub the pool, which resulted in an endless reboot and panic cycle.

I'm now trying to transfer the filesystems using oi_151a7, but it turns out the same thing happens with that kernel too.

I took a screenshot while the dump was written, showing part of the stack trace, which is attached. Let me know what else I can do to help.

I'd prefer not to restore from backup, since I'll lose all snapshots, but the system has been offline for way too many days already.

zfs-panic-2012-10-20.png (264 KB) Martin Blom, 2012-10-20 12:05 PM

zfs-send-panic-2012-10-28.mov (3.49 MB) Martin Blom, 2012-10-28 05:19 PM

History

#1 Updated by Christopher Siden about 5 years ago

Your image is missing the top part of the stack trace that says which function the failure actually occurred in. Also there should be a panic or assertion failure message just above the stack trace that would be helpful.

#2 Updated by Martin Blom about 5 years ago

It's not easy to catch what's going on, and 'savecore' does not seem to work:

root@elizabeth:/home/lcs# savecore 
savecore: bad magic number 0

I've attached a video ... If you look carefully, you can see that it seems to start with

panic[cpu3]/thread=ffffff000f80ac40: BAD TRAP: type=0 (#de Divide error) rp=ffff...

#3 Updated by Christopher Siden about 5 years ago

I can't seem to get your video to display the stack trace in a readable way (it's blurry). The panic message only contains a generic message, so without a stack trace or the ability to reproduce the problem it's pretty much impossible to fix. Do you not have a scrollback buffer you can use to copy-paste the full output of the panic message + stack trace?

When you say savecore isn't working do you mean the panic isn't generating a crash dump (i.e. there is nothing in /var/crash?).

#4 Updated by Martin Blom about 5 years ago

I understand that you need more information, and I would be happy to provide it, but I need some help in order to do so.

The console video is from a KVM that sampling the VGA output, so copy/paste does not work, and the message/stack trace just flashes by and disappears, as the video indicates. I have only a few seconds available when the kernel is dumping to disk before it reboots. I didn't think the Solaris console had scrollback but I'll try page up/down next time.

Would attaching something to the serial port help?

Yes, I don't even have /var/crash and savecore just exists with that "bad magic number 0" message. Don't know what to do.

#5 Updated by Christopher Siden about 5 years ago

Maybe the core dump problem is related to:
https://www.illumos.org/issues/1110
You could try putting that workaround line in /etc/system (and bootadm update-archive and reboot), all it does is disable multi-threaded crash dumps (which have a bug in them).

Otherwise you can add 'console=ttya' to your grub kernel line, so it looks something like this:

kernel$ /platform/i86pc/kernel/$ISADIR/unix -B $ZFS-BOOTFS,console=ttya

That will redirect the system console to a serial device, you will no longer see anything in your KVM after grub passes things off to the kernel, but you will be able to view the console (including the panic message and stack trace) from your serial console, were presumably you could get a scroll back buffer.

Also available in: Atom