Project

General

Profile

Bug #5833

kernel crash on zfs import (mount)

Added by Jan-Peter Koopmann over 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
2015-04-13
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

Hi,

on a reboot I got this crash (over and over):
Apr 13 17:22:05 nasjpk ^Mpanic[cpu2]/thread=ffffff03e3b32780:
Apr 13 17:22:05 nasjpk genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=ffffff001799b340 addr=20 occurred in module "zfs" due to a NULL pointer dereference
Apr 13 17:22:05 nasjpk unix: [ID 100000 kern.notice]
Apr 13 17:22:05 nasjpk unix: [ID 839527 kern.notice] zfs:
Apr 13 17:22:05 nasjpk unix: [ID 753105 kern.notice] #pf Page fault
Apr 13 17:22:05 nasjpk unix: [ID 532287 kern.notice] Bad kernel fault at addr=0x20
Apr 13 17:22:05 nasjpk unix: [ID 243837 kern.notice] pid=487, pc=0xfffffffff7a220e8, sp=0xffffff001799b438, eflags=0x10217
Apr 13 17:22:05 nasjpk unix: [ID 211416 kern.notice] cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
Apr 13 17:22:05 nasjpk unix: [ID 624947 kern.notice] cr2: 20
Apr 13 17:22:05 nasjpk unix: [ID 625075 kern.notice] cr3: 335010000
Apr 13 17:22:05 nasjpk unix: [ID 625715 kern.notice] cr8: c
Apr 13 17:22:05 nasjpk unix: [ID 100000 kern.notice]
Apr 13 17:22:05 nasjpk unix: [ID 592667 kern.notice] rdi: ffffff03ec5e0d68 rsi: 0 rdx: 0
Apr 13 17:22:05 nasjpk unix: [ID 592667 kern.notice] rcx: ed34b513 r8: 12cb4ae8 r9: ffffff001799b4a0
Apr 13 17:22:05 nasjpk unix: [ID 592667 kern.notice] rax: 7ffff rbx: 0 rbp: ffffff001799b480
Apr 13 17:22:05 nasjpk unix: [ID 592667 kern.notice] r10: ffff r11: 0 r12: ffffff03ec5e0d68
Apr 13 17:22:05 nasjpk unix: [ID 592667 kern.notice] r13: ffffff03ec5e0d68 r14: ffffff001799b5c0 r15: ffffff001799b600
Apr 13 17:22:05 nasjpk unix: [ID 592667 kern.notice] fsb: 0 gsb: ffffff03e31a5a80 ds: 4b
Apr 13 17:22:05 nasjpk unix: [ID 592667 kern.notice] es: 4b fs: 0 gs: 1c3
Apr 13 17:22:05 nasjpk unix: [ID 592667 kern.notice] trp: e err: 0 rip: fffffffff7a220e8
Apr 13 17:22:05 nasjpk unix: [ID 592667 kern.notice] cs: 30 rfl: 10217 rsp: ffffff001799b438
Apr 13 17:22:05 nasjpk unix: [ID 266532 kern.notice] ss: 38
Apr 13 17:22:05 nasjpk unix: [ID 100000 kern.notice]
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799b210 unix:die+dd ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799b330 unix:trap+17db ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799b340 unix:cmntrap+e6 ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799b480 zfs:zap_leaf_lookup_closest+40 ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799b510 zfs:fzap_cursor_retrieve+c9 ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799b5a0 zfs:zap_cursor_retrieve+17d ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799b780 zfs:zfs_purgedir+4c ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799b7d0 zfs:zfs_rmnode+50 ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799b810 zfs:zfs_zinactive+b5 ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799b860 zfs:zfs_inactive+11a ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799b8b0 genunix:fop_inactive+af ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799b8d0 genunix:vn_rele+5f ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799bac0 zfs:zfs_unlinked_drain+af ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799baf0 zfs:zfsvfs_setup+102 ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799bb50 zfs:zfs_domount+17c ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799bc70 zfs:zfs_mount+1cd ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799bca0 genunix:fsop_mount+21 ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799be00 genunix:domount+b0e ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799be80 genunix:mount+121 ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799bec0 genunix:syscall_ap+8c ()
Apr 13 17:22:05 nasjpk genunix: [ID 655072 kern.notice] ffffff001799bf10 unix:brand_sys_sysenter+1c9 ()
Apr 13 17:22:05 nasjpk unix: [ID 100000 kern.notice]
Apr 13 17:22:05 nasjpk genunix: [ID 672855 kern.notice] syncing file systems...
Apr 13 17:22:05 nasjpk genunix: [ID 904073 kern.notice] done

I started with the debugger, was able to identify the dataset that caused the crash doing do_mount, booted to a live USB, zfs imported with -N and deleted the dataset. That got me working again.

After a reboot i zfs imported the pool and did a reboot. Strangely enough the reboot triggered yet another panic (in the shutdown phase) but the reboot itself survived and scrub is running.

Version: oi_151a8

-bash-4.0$ pfexec pkg info entire
Name: entire
Zusammenfassung: incorporation to lock all system packages to same build
Beschreibung: This package constrains system package versions to the same
build. WARNING: Proper system update and correct package
selection depend on the presence of this incorporation.
Removing this package will result in an unsupported system.
Status: Installiert
Herausgeber: openindiana.org
Version: 0.5.11
Build-Release: 5.11
Zweig: 0.151.1.8
Packaging-Datum: 21. Juli 2013 12:50:47
Größe: 0.00 B
FMRI: pkg:,5.11-0.151.1.8:20130721T125047Z


Related issues

Related to illumos gate - Bug #2233: zfs mount hangs systemFeedback2012-03-04

Actions

History

#1

Updated by Pavel Zakharov over 4 years ago

  • Related to Bug #2233: zfs mount hangs system added
#2

Updated by Pavel Zakharov over 4 years ago

Had the same issue occur in production.

After analyzing the crash dump, it seems to be related to xattr directories in the zfs unlinked list. Got this info by tracing the znode_t argument to zfs_purgedir() function call during the crash.

In our case, it seems like the problematic directory is the virtual <xattrdir> used by SMB: .$EXTEND/$QUOTA/<xattrdir>

When trying to run zdb on it, an assert occurs:
NestOS.ES200160-001-02:~# zdb -e -dddd poolpy-b82bd9a1-0b08-47a2-8526-cfbcb32f4f8a/sharetest45@auto_2015-07-23-094507_5668_r 481
Dataset poolpy-b82bd9a1-0b08-47a2-8526-cfbcb32f4f8a/sharetest45@auto_2015-07-23-094507_5668_r [ZPL], ID 205, cr_txg 11328, 870G, 459 objects, rootbp DVA0=<0:d99f5ed400:200> DVA1=<0:2c0003ec00:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=11296L/11296P fill=459 cksum=1088557663:678e97ba95a:14d8d44e07287:2df4e867cfb1e1

Object  lvl   iblk   dblk  dsize  lsize   %full  type
481 1 16K 512 0 512 0.00 ZFS directory
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 0
path ???&lt;object#481&gt;
uid 0
gid 3
atime Thu Jul 23 09:42:33 2015
mtime Thu Jul 23 09:42:33 2015
ctime Thu Jul 23 09:42:33 2015
crtime Thu Jul 23 09:42:33 2015
gen 11290
mode 41777
size 2
parent 480 <--- $EXTEND/$QUOTA
links 0
worm WORM_STATE_OPEN
pflags 40800000145 <-- Indicates it is an xattr dir
Fat ZAP stats:
Pointer table:
1 elements
zt_blk: 0
zt_numblks: 0
zt_shift: 0
zt_blks_copied: 0
zt_nextblk: 0
ZAP entries: 0
Leaf blocks: 0
Total blocks: 0
zap_block_type: 0x0
zap_magic: 0x0
zap_salt: 0x0
Leafs with 2^n pointers:
Blocks with n*5 entries:
Blocks n/10 full:
Entries with n chunks:
Buckets with n entries:

assertion failed for thread 0xfffffd7fff152a40, thread-id 1: zap_f_phys(zap)->zap_magic 0x2F52AB2ABULL (0x0 0x2f52ab2ab), file ../../../uts/common/fs/zfs/zap.c, line 578

#3

Updated by Dan McDonald over 4 years ago

This occured for me, and after some diggint, it appears that the zap_phys_t (which I'm guessing is read off-disk?) is all zeroes.

I used zdb(1) to find the <xattr> directory causing a panic just like this, and here's what I found:

# mdb core.zdb
Loading modules: [ libumem.so.1 libc.so.1 libzpool.so.1 libavl.so.1 libcmdutils.so.1 libnvpair.so.1 libtopo.so.1 ld.so.1 ]
> $c 
libc.so.1`_lwp_kill+0xa()
libc.so.1`_assfail+0x168(ffffdf7fffdfea20, ffffdf7ffda347b0, 241)
libc.so.1`assfail3+0xe6(ffffdf7ffda34e88, 0, ffffdf7ffda39f6c, 2f52ab2ab, 
ffffdf7ffda347b0, 241)
libzpool.so.1`zap_deref_leaf+0x7b(3dceb500, 0, 0, 0, ffffdf7fffdff200)
libzpool.so.1`fzap_cursor_retrieve+0x140(3dceb500, ffffdf7fffdff1f0, 
ffffdf7fffdff050)
libzpool.so.1`zap_cursor_retrieve+0x67(ffffdf7fffdff1f0, ffffdf7fffdff050)
dump_zpldir+0xc3(3904d300, e107ad5, 0, 0)
dump_object+0x31f(3904d300, e107ad5, 4, ffffdf7fffdff8bc)
dump_dir+0x1e8(3904d300)
main+0x4b9(4, ffffdf7fffdff9e8)
_start+0x6c()
> 3dceb500::print -ta zap_t zap_dbuf
    3dceb548 struct dmu_buf *zap_dbuf = 0x3de46460
> 0x3de46460::print "struct dmu_buf"  
{
    db_object = 0xe107ad5
    db_offset = 0
    db_size = 0x200
    db_data = 0x3de6d400
}
> 0x3de6d400::print zap_phys_t
{
    zap_block_type = 0
    zap_magic = 0
    zap_ptrtbl = {
        zt_blk = 0
        zt_numblks = 0
        zt_shift = 0
        zt_nextblk = 0
        zt_blks_copied = 0
    }
    zap_freeblk = 0
    zap_num_leafs = 0
    zap_num_entries = 0
    zap_salt = 0
    zap_normflags = 0
    zap_flags = 0
}
> 

That's something very VERY wrong.

#4

Updated by Dan McDonald over 4 years ago

My user suggested this possible fix from ZoL:

https://github.com/zfsonlinux/zfs/commit/4254acb05743dc2175ae76f6e15b0785d4b688fd

This may stop the all-zero ZAP corruption in an XATTR directory.

#5

Updated by Dan McDonald over 4 years ago

  • Status changed from New to Closed

I think this is a duplicate of what just got pushed with #6842. I'm going to close this bug as a duplicate.

Also available in: Atom PDF