Bug #3532
Aborted zfs recv causes NULL pointer kernel panic
0%
Description
Aborting a zfs recv operation is causing the system to panic. This was easily reproduced, 5 out of 5 times. Running illumos-gate from about 2 weeks ago.
crash dump at http://patrickdk.com/crash0.tar.gz
::status debugging crash dump vmcore.0 (64-bit) from stor3 operating system: 5.11 oi_151a7c (i86pc) image uuid: 5e96e3ce-73f5-e683-c532-ec021616a724 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010375530 addr=0 occurred in module "unix" due to a NULL pointer dereference dump content: kernel pages only
::stack strncpy+0x28(ffffff02ee0e1000, 0, 2000) dodefault+0xca(ffffff02e1bc14a0, 1, 2000, ffffff02ee0e1000) dsl_prop_get_dd+0x1bf(ffffff02d583a000, ffffff02e1bc14a0, 1, 2000, ffffff02ee0e1000, 0) dsl_prop_get_ds+0x152() dsl_prop_set_sync+0x37a() dsl_props_set_sync+0x9b(ffffff02e27da700, ffffff001089b860, ffffff02ead4b500) dsl_sync_task_group_sync+0x146(ffffff02eb692198, ffffff02ead4b500) dsl_pool_sync+0x443(ffffff02e0de3a00, 58898) spa_sync+0x33f(ffffff02d5f52040, 58898) txg_sync_thread+0x1fc(ffffff02e0de3a00) thread_start+8()
::msgbuf MESSAGE open version 5000 pool aggr0 using 5000 This Solaris instance has UUID 5e96e3ce-73f5-e683-c532-ec021616a724 dump on /dev/zvol/dsk/rpool/dump size 8192 MB pseudo-device: pm0 pm0 is /pseudo/pm@0 pseudo-device: power0 power0 is /pseudo/power@0 pseudo-device: srn0 srn0 is /pseudo/srn@0 iscsi0 at root iscsi0 is /iscsi fcoe0 at root fcoe0 is /fcoe pseudo-device: pseudo1 pseudo1 is /pseudo/zconsnex@1 pcplusmp: asy (asy) instance 0 irq 0x4 vector 0xb0 ioapic 0x8 intin 0x4 is bound to cpu 3 pcplusmp: ide (ata) instance 0 irq 0xe vector 0x43 ioapic 0x8 intin 0xe is bound to cpu 0 acpinex: sb@0, acpinex1 acpinex1 is /fw/sb@0 ATAPI device at targ 0, lun 0 lastlun 0x0 model TEAC CD-ROM CD-224E PCI Express-device: ide@0, ata0 ata0 is /pci@0,0/pci-ide@1f,1/ide@0 ISA-device: asy0 asy0 is /pci@0,0/isa@1f/asy@1,3f8 pcplusmp: ide (ata) instance 1 irq 0xf vector 0x44 ioapic 0x8 intin 0xf is bound to cpu 1 UltraDMA mode 2 selected sd0 at ata0: target 0 lun 0 sd0 is /pci@0,0/pci-ide@1f,1/ide@0/sd@0,0 ISA-device: pit_beep0 pit_beep0 is /pci@0,0/isa@1f/pit_beep pseudo-device: llc10 llc10 is /pseudo/llc1@0 pseudo-device: lofi0 lofi0 is /pseudo/lofi@0 pseudo-device: ramdisk1024 ramdisk1024 is /pseudo/ramdisk@1024 pseudo-device: ucode0 ucode0 is /pseudo/ucode@0 pseudo-device: dcpc0 dcpc0 is /pseudo/dcpc@0 pseudo-device: dtrace0 dtrace0 is /pseudo/dtrace@0 pseudo-device: fasttrap0 fasttrap0 is /pseudo/fasttrap@0 pseudo-device: fbt0 fbt0 is /pseudo/fbt@0 pseudo-device: lockstat0 lockstat0 is /pseudo/lockstat@0 pseudo-device: profile0 profile0 is /pseudo/profile@0 pseudo-device: sdt0 sdt0 is /pseudo/sdt@0 pseudo-device: systrace0 systrace0 is /pseudo/systrace@0 pseudo-device: fcsm0 fcsm0 is /pseudo/fcsm@0 NOTICE: fcsm(0): attached to path /pci@0,0/pci8086,3595@2/pci8086,32a@0,2/pci1077,101@7/fp@0,0 NOTICE: fcsm(2): attached to path /pci@0,0/pci8086,3595@2/pci8086,32a@0,2/pci1077,101@7,1/fp@0,0 pseudo-device: nvidia255 nvidia255 is /pseudo/nvidia@255 pseudo-device: fcp0 fcp0 is /pseudo/fcp@0 pseudo-device: fct0 fct0 is /pseudo/fct@0 pseudo-device: stmf0 stmf0 is /pseudo/stmf@0 pseudo-device: pool0 pool0 is /pseudo/pool@0 pseudo-device: fssnap0 fssnap0 is /pseudo/fssnap@0 IP Filter: v4.1.9, running. pseudo-device: nsmb0 nsmb0 is /pseudo/nsmb@0 pseudo-device: bpf0 bpf0 is /pseudo/bpf@0 pseudo-device: ii0 ii0 is /pseudo/ii@0 sv (revision 11.11, SunOS 5.11, None) pseudo-device: sv0 sv0 is /pseudo/sv@0 pseudo-device: rdc0 rdc0 is /pseudo/rdc@0 pseudo-device: ncall0 ncall0 is /pseudo/ncall@0 pseudo-device: nsctl0 nsctl0 is /pseudo/nsctl@0 pseudo-device: nsctl0 nsctl0 is /pseudo/nsctl@0 pseudo-device: sdbc0 sdbc0 is /pseudo/sdbc@0 device pciclass,030000@d(display#0) keeps up device sd@0,0(sd#0), but the former is not power managed pcplusmp: ide (ata) instance 1 irq 0xf vector 0x44 ioapic 0x8 intin 0xf is bound to cpu 2 pcplusmp: ide (ata) instance 1 irq 0xf vector 0x44 ioapic 0x8 intin 0xf is bound to cpu 3 WARNING: constraints forbid retire: /pci@0,0/pci8086,3595@2/pci8086,32a@0,2 pseudo-device: devinfo0 devinfo0 is /pseudo/devinfo@0 panic[cpu2]/thread=ffffff0010375c40: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010375530 addr=0 occurred in module "unix" due to a NULL pointer dereference sched: #pf Page fault Bad kernel fault at addr=0x0 pid=0, pc=0xfffffffffb887e98, sp=0xffffff0010375620, eflags=0x10206 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de> cr2: 0 cr3: 8c00000 cr8: c rdi: ffffff02ee0e1000 rsi: 0 rdx: 2000 rcx: ffffff02ee0e1000 r8: 3e r9: ffffff0010375438 rax: ffffff02ee0e1000 rbx: 0 rbp: ffffff0010375640 r10: 1 r11: 0 r12: ffffff02ee0e1000 r13: 2000 r14: 32 r15: ffffff02ee0e1000 fsb: 0 gsb: ffffff02d7cea080 ds: 4b es: 4b fs: 0 gs: 1c3 trp: e err: 0 rip: fffffffffb887e98 cs: 30 rfl: 10206 rsp: ffffff0010375620 ss: 38 ffffff0010375410 unix:die+df () ffffff0010375520 unix:trap+db3 () ffffff0010375530 unix:cmntrap+e6 () ffffff0010375640 unix:strncpy+28 () ffffff0010375690 zfs:dodefault+ca () ffffff0010375750 zfs:dsl_prop_get_dd+1bf () ffffff0010375810 zfs:dsl_prop_get_ds+152 () ffffff00103758e0 zfs:dsl_prop_set_sync+37a () ffffff0010375990 zfs:dsl_props_set_sync+9b () ffffff00103759e0 zfs:dsl_sync_task_group_sync+146 () ffffff0010375aa0 zfs:dsl_pool_sync+443 () ffffff0010375b70 zfs:spa_sync+33f () ffffff0010375c20 zfs:txg_sync_thread+1fc () ffffff0010375c30 unix:thread_start+8 () syncing file systems... done dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
Updated by Christopher Siden almost 8 years ago
I can't reproduce this by aborting a zfs recv with ctrl + C, do you have more specific information about the commands you ran to cause this reliably?
Updated by Patrick Domack almost 8 years ago
I'll attempt to see what kind of things will cause it. In this case, it was moving a zvol from an existing location to a new location, and it didn't exist on the destination yet.
From the source location:
NAME PROPERTY VALUE SOURCE
aggr0/luns2/esx3 type volume -
aggr0/luns2/esx3 creation Wed May 2 10:29 2012 -
aggr0/luns2/esx3 used 279G -
aggr0/luns2/esx3 available 2.19T -
aggr0/luns2/esx3 referenced 215G -
aggr0/luns2/esx3 compressratio 1.95x -
aggr0/luns2/esx3 reservation none default
aggr0/luns2/esx3 volsize 800G local
aggr0/luns2/esx3 volblocksize 16K -
aggr0/luns2/esx3 checksum on default
aggr0/luns2/esx3 compression lzjb local
aggr0/luns2/esx3 readonly off default
aggr0/luns2/esx3 copies 1 default
aggr0/luns2/esx3 refreservation none default
aggr0/luns2/esx3 primarycache all default
aggr0/luns2/esx3 secondarycache all default
aggr0/luns2/esx3 usedbysnapshots 63.9G -
aggr0/luns2/esx3 usedbydataset 215G -
aggr0/luns2/esx3 usedbychildren 0 -
aggr0/luns2/esx3 usedbyrefreservation 0 -
aggr0/luns2/esx3 logbias latency default
aggr0/luns2/esx3 dedup off default
aggr0/luns2/esx3 mlslabel none default
aggr0/luns2/esx3 sync standard default
aggr0/luns2/esx3 refcompressratio 1.83x -
aggr0/luns2/esx3 written 5.57G -
And the sending command: zfs send -v -R aggr0/luns2/esx3@repsnap
And the receiving command: zfs recv -v -F -d aggr0
It might not have been a control-c on the recv side, but the sending end, causing a incomplete transmission?
I'll attempt to see if I can narrow it down.