Project

General

Profile

Bug #6779

kernel panic in zfs_readdir shortly after mount

Added by Jan Schlien about 4 years ago. Updated almost 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2016-03-21
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

I've got two similar crashes from two different systems. I looked at #5833 and #6510 but it appears to different from those two.

The panic messages:

Feb  6 19:25:56 RXDATA: panic[cpu1]/thread=ffffff237d4f93a0: BAD TRAP: type=e (#pf Page fault) rp=ffffff0116da3d60 addr=20 occurred in module "zfs" due to a NULL pointer dereference
Feb  6 19:25:56 RXDATA: nfsd: #pf Page fault  
Feb  6 19:25:56 RXDATA: Bad kernel fault at addr=0x20 
Feb  6 19:25:56 RXDATA: pid=5969, pc=0xfffffffff7a18575, sp=0xffffff0116da3e50, eflags=0x10206
Feb  6 19:25:56 RXDATA: cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 1406f8<smep,osxsav,xmme,fxsr,pge,mce,pae,pse,de>
Feb  6 19:25:56 RXDATA: cr2: 20cr3: 20419cb000cr8: 0
Feb  6 19:25:56 RXDATA:         rdi: ffffff26eea4baa0 rsi: f6fbb2f000000000 rdx:                8
Feb  6 19:25:56 RXDATA:         rcx:         fb484a93  r8: ffffff0116da3ec0  r9:                0
Feb  6 19:25:56 RXDATA:         rax:         fb484a98 rbx: ffffff0116da41b0 rbp: ffffff0116da3ea0
Feb  6 19:25:57 RXDATA:         r10: fffffffffb854378 r11:                0 r12:            80000       
Feb  6 19:25:57 RXDATA:         r13:                0 r14: ffffff0116da3ec0 r15: ffffff0116da4090
Feb  6 19:25:57 RXDATA:         fsb:                0 gsb: ffffff23720bf580  ds:               4b
Feb  6 19:25:57 RXDATA:          es:               4b  fs:                0  gs:              1c3       
Feb  6 19:25:57 RXDATA:         trp:                e err:                0 rip: fffffffff7a18575       
Feb  6 19:25:57 RXDATA:          cs:               30 rfl:            10206 rsp: ffffff0116da3e50   
Feb  6 19:25:57 RXDATA:          ss:               38          
Feb  6 19:25:57 RXDATA: ffffff0116da3c40 unix:die+df ()
Feb  6 19:25:57 RXDATA: ffffff0116da3d50 unix:trap+dc0 ()
Feb  6 19:25:57 RXDATA: ffffff0116da3d60 unix:cmntrap+e6 ()
Feb  6 19:25:57 RXDATA: ffffff0116da3ea0 zfs:zap_leaf_lookup_closest+45 ()
Feb  6 19:25:57 RXDATA: ffffff0116da3f40 zfs:fzap_cursor_retrieve+bb ()
Feb  6 19:25:57 RXDATA: ffffff0116da3fe0 zfs:zap_cursor_retrieve+11e ()
Feb  6 19:25:57 RXDATA: ffffff0116da4260 zfs:zfs_readdir+33d ()
Feb  6 19:25:57 RXDATA: ffffff0116da42f0 genunix:fop_readdir+6b ()
Feb  6 19:25:57 RXDATA: ffffff0116da46c0 nfssrv:rfs4_op_readdir+68b () 
Feb  6 19:25:57 RXDATA: ffffff0116da4850 nfssrv:rfs4_compound+1e9 ()
Feb  6 19:25:58 RXDATA: ffffff0116da4900 nfssrv:rfs4_dispatch+145 ()
Feb  6 19:25:58 RXDATA: ffffff0116da4c20 nfssrv:common_dispatch+765 () 
Feb  6 19:25:58 RXDATA: ffffff0116da4c40 nfssrv:rfs_dispatch+2d ()
Feb  6 19:25:58 RXDATA: ffffff0116da4d20 rpcmod:svc_getreq+1c1 ()
Feb  6 19:25:58 RXDATA: ffffff0116da4d90 rpcmod:svc_run+e0 ()
Feb  6 19:25:58 RXDATA: ffffff0116da4dd0 rpcmod:svc_do_run+8e ()
Feb  6 19:25:58 RXDATA: ffffff0116da4ec0 nfs:nfssys+f1 ()
Feb  6 19:25:58 RXDATA: ffffff0116da4f10 unix:brand_sys_sysenter+1c9 ()
> zap_leaf_lookup_closest::dis
zap_leaf_lookup_closest:        pushq  %rbp
zap_leaf_lookup_closest+1:      movq   %rcx,%r8
zap_leaf_lookup_closest+4:      movq   %rsp,%rbp
zap_leaf_lookup_closest+7:      subq   $0x50,%rsp
zap_leaf_lookup_closest+0xb:    movl   0x50(%rdi),%eax
zap_leaf_lookup_closest+0xe:    movq   %rdx,-0x18(%rbp)
zap_leaf_lookup_closest+0x12:   movl   %edx,-0x4c(%rbp)
zap_leaf_lookup_closest+0x15:   movq   0x58(%rdi),%rdx
zap_leaf_lookup_closest+0x19:   movq   %rbx,-0x48(%rbp)
zap_leaf_lookup_closest+0x1d:   movq   %r12,-0x40(%rbp)
zap_leaf_lookup_closest+0x21:   movq   %rcx,-0x20(%rbp)
zap_leaf_lookup_closest+0x25:   leal   -0x5(%rax),%ecx
zap_leaf_lookup_closest+0x28:   movq   %r13,-0x38(%rbp)
zap_leaf_lookup_closest+0x2c:   movq   %r14,-0x30(%rbp)
zap_leaf_lookup_closest+0x30:   movl   $0x1,%r12d
zap_leaf_lookup_closest+0x36:   movq   %r15,-0x28(%rbp)
zap_leaf_lookup_closest+0x3a:   movq   %rsi,-0x10(%rbp)
zap_leaf_lookup_closest+0x3e:   shll   %cl,%r12d
zap_leaf_lookup_closest+0x41:   movq   %rdi,-0x8(%rbp)
zap_leaf_lookup_closest+0x45:   movq   0x18(%rdx),%r9

From the source code, this should be the call to zap_leaf_phys(l), which fits with %rdx being 8 in the panic buffer.

> ffffff26eea4baa0::print -t zap_leaf_t l_dbuf
dmu_buf_t *l_dbuf = 8

In fact, the whole thing that acts like a zap_leaf_t looks more like a zap_t (see dbu_evict_func):

> ffffff26eea4baa0::print -t zap_leaf_t   
zap_leaf_t {
    dmu_buf_user_t l_dbu = {
        taskq_ent_t dbu_tqent = {
            struct taskq_ent *tqent_next = 0
            struct taskq_ent *tqent_prev = 0
            task_func_t *tqent_func = 0
            void *tqent_arg = 0
            union  tqent_un = {
                taskq_bucket_t *tqent_bucket = 0
                uintptr_t tqent_flags = 0
            }
            kthread_t *tqent_thread = 0
            kcondvar_t tqent_cv = {
                ushort_t _opaque = 0
            }
        }
        dmu_buf_evict_func_t *dbu_evict_func = zap_evict
    }
    krwlock_t l_rwlock = {
        void *[1] _opaque = [ 0xffffff26ae98e708 ]
    }
    uint64_t l_blkid = 0x4e99f
    int l_bs = 0xfb484a98
    dmu_buf_t *l_dbuf = 8
}
> ffffff26eea4baa0::print -t zap_t     
zap_t {
    dmu_buf_user_t zap_dbu = {
        taskq_ent_t dbu_tqent = {
            struct taskq_ent *tqent_next = 0
            struct taskq_ent *tqent_prev = 0
            task_func_t *tqent_func = 0
            void *tqent_arg = 0
            union  tqent_un = {
                taskq_bucket_t *tqent_bucket = 0
                uintptr_t tqent_flags = 0
            }
            kthread_t *tqent_thread = 0
            kcondvar_t tqent_cv = {
                ushort_t _opaque = 0
            }
        }
        dmu_buf_evict_func_t *dbu_evict_func = zap_evict
    }
    objset_t *zap_objset = 0xffffff26ae98e708
    uint64_t zap_object = 0x4e99f
    struct dmu_buf *zap_dbuf = 0xffffff26fb484a98
    krwlock_t zap_rwlock = {
        void *[1] _opaque = [ 8 ]
    }
    boolean_t zap_ismicro = 0 (0)
    int zap_normflags = 0
    uint64_t zap_salt = 0
    union  zap_u = {
        struct  zap_fat = {
            kmutex_t zap_num_entries_mtx = {
                void *[1] _opaque = [ 0 ]
            }
            int zap_block_shift = 0xe
        }
        struct  zap_micro = {
            int16_t zap_num_entries = 0
            int16_t zap_num_chunks = 0
            int16_t zap_alloc_next = 0
            avl_tree_t zap_avl = {
                struct avl_node *avl_root = 0xe
                int (*)() avl_compar = 0
                size_t avl_offset = 0
                ulong_t avl_numnodes = 0
                size_t avl_size = 0
            }
        }
    }
}

The cursor has the same pointer set for zc_zap and zc_leaf, which I think should not happen:

> ffffff0116da41b0::print -t zap_cursor_t
zap_cursor_t {
    objset_t *zc_objset = 0xffffff26ae98e700
    struct zap *zc_zap = 0xffffff26eea4baa0
    struct zap_leaf *zc_leaf = 0xffffff26eea4baa0
    uint64_t zc_zapobj = 0x4e99f
    uint64_t zc_serialized = 0x1f6fbb2f
    uint64_t zc_hash = 0xf6fbb2f000000000
    uint32_t zc_cd = 0x1
}

What also looks suspicious to me is the znode passed to zfs_readdir has z_links = 0:

> 0xffffff26fafb2a40::print -t vnode_t v_data | ::print -t znode_t
znode_t {
    struct zfsvfs *z_zfsvfs = 0xffffff26aeb1c000
    vnode_t *z_vnode = 0xffffff26fafb2a40
    uint64_t z_id = 0x4e99f
    kmutex_t z_lock = {
        void *[1] _opaque = [ 0 ]
    }
    krwlock_t z_parent_lock = {
        void *[1] _opaque = [ 0 ]
    }
    krwlock_t z_name_lock = {
        void *[1] _opaque = [ 0 ]
    }
    zfs_dirlock_t *z_dirlocks = 0
    kmutex_t z_range_lock = {
        void *[1] _opaque = [ 0 ]
    }
    avl_tree_t z_range_avl = {
        struct avl_node *avl_root = 0
        int (*)() avl_compar = zfs_range_compare
        size_t avl_offset = 0x8
        ulong_t avl_numnodes = 0
        size_t avl_size = 0x40
    }
    uint8_t z_unlinked = 0
    uint8_t z_atime_dirty = 0
    uint8_t z_zn_prefetch = 0x1
    uint8_t z_moved = 0
    uint_t z_blksz = 0x4000
    uint_t z_seq = 0x7a4653
    uint64_t z_mapcnt = 0
    uint64_t z_gen = 0x1077b3f
    uint64_t z_size = 0x2
    uint64_t [2] z_atime = [ 0x533f8166, 0x1a780340 ]
    uint64_t z_links = 0
    uint64_t z_pflags = 0x40800000042
    uint64_t z_uid = 0x3e8
    uint64_t z_gid = 0x44c
    mode_t z_mode = 0x41ed
    uint32_t z_sync_cnt = 0
    kmutex_t z_acl_lock = {
        void *[1] _opaque = [ 0 ]
    }
    zfs_acl_t *z_acl_cached = 0xffffff26fb422ec0
    list_node_t z_link_node = {
        struct list_node *list_next = 0xffffff26aeb1c418
        struct list_node *list_prev = 0xffffff26fafb02c8
    }
    sa_handle_t *z_sa_hdl = 0xffffff26faf5f2e0
    boolean_t z_is_sa = 0 (0)         
    uint32_t z_new_content = 0
}

In another thread, there is an rmdir operation waiting for the same dataset (but for a different directory):

> ffffff237aa750c0::findstack -v
stack pointer for thread ffffff237aa750c0: ffffff00f61bb170
[ ffffff00f61bb170 _resume_from_idle+0xf4() ]
  ffffff00f61bb1a0 swtch+0x141()
  ffffff00f61bb1e0 cv_wait+0x70(ffffff237cc4c8c6, ffffff237cc4c880)
  ffffff00f61bb230 txg_wait_open+0x83(ffffff237cc4c700, 59c32b)
  ffffff00f61bb280 dmu_tx_wait+0x16f(ffffff26faffc100)
  ffffff00f61bb360 zfs_rmdir+0x202(ffffff26fafb2c40, ffffff26dbfeb248, ffffff235363e340, ffffff264b5a9bc8, 0, 0)
  ffffff00f61bb3e0 fop_rmdir+0x63(ffffff26fafb2c40, ffffff26dbfeb248, ffffff235363e340, ffffff264b5a9bc8, 0, 0)
  ffffff00f61bb6c0 rfs4_op_remove+0x79d(ffffff266423fb38, ffffff26ebb914a8, ffffff00f61bbca0, ffffff00f61bb6e0)
  ffffff00f61bb850 rfs4_compound+0x1e9(ffffff00f61bbaa0, ffffff00f61bb870, 0, ffffff00f61bbca0, 0, ffffff00f61bb8a8)
  ffffff00f61bb900 rfs4_dispatch+0x21d(ffffffffc040f010, ffffff00f61bbca0, ffffff26ee71dc00, ffffff00f61bbaa0)
  ffffff00f61bbc20 common_dispatch+0x765(ffffff00f61bbca0, ffffff26ee71dc00, 2, 4, fffffffff8de0e66, ffffffffc040f060)
  ffffff00f61bbc40 rfs_dispatch+0x2d(ffffff00f61bbca0, ffffff26ee71dc00)
  ffffff00f61bbd20 svc_getreq+0x1c1(ffffff26ee71dc00, ffffff237971d960)
  ffffff00f61bbd90 svc_run+0xe0(ffffff26ee2ba6e8)
  ffffff00f61bbdd0 svc_do_run+0x8e(1)
  ffffff00f61bbec0 nfssys+0xf1(e, fec90fbc)
  ffffff00f61bbf10 _sys_sysenter_post_swapgs+0x149()
> ffffff26fafb2c40::print -t vnode_t v_data | ::print -t znode_t                                                                                                                                     
znode_t {
    struct zfsvfs *z_zfsvfs = 0xffffff26aeb1c000
    vnode_t *z_vnode = 0xffffff26fafb2c40
    uint64_t z_id = 0x1f32c
    kmutex_t z_lock = {
        void *[1] _opaque = [ 0 ]
    }
    krwlock_t z_parent_lock = {
        void *[1] _opaque = [ 0 ]
    }
    krwlock_t z_name_lock = {
        void *[1] _opaque = [ 0 ]
    }
    zfs_dirlock_t *z_dirlocks = 0
    kmutex_t z_range_lock = {
        void *[1] _opaque = [ 0 ]
    }
    avl_tree_t z_range_avl = {
        struct avl_node *avl_root = 0
        int (*)() avl_compar = zfs_range_compare
        size_t avl_offset = 0x8
        ulong_t avl_numnodes = 0
        size_t avl_size = 0x40
    }
    uint8_t z_unlinked = 0
    uint8_t z_atime_dirty = 0
    uint8_t z_zn_prefetch = 0x1
    uint8_t z_moved = 0
    uint_t z_blksz = 0x4000
    uint_t z_seq = 0x7a4653
    uint64_t z_mapcnt = 0
    uint64_t z_gen = 0xb64842
    uint64_t z_size = 0x20
    uint64_t [2] z_atime = [ 0x53a87c51, 0x1da98c84 ]
    uint64_t z_links = 0x12
    uint64_t z_pflags = 0x40800000042
    uint64_t z_uid = 0x3e8
    uint64_t z_gid = 0x44c
    mode_t z_mode = 0x41ff
    uint32_t z_sync_cnt = 0
    kmutex_t z_acl_lock = {
        void *[1] _opaque = [ 0 ]
    }
    zfs_acl_t *z_acl_cached = 0xffffff26faffc340
    list_node_t z_link_node = {
        struct list_node *list_next = 0xffffff26fafb02c8
        struct list_node *list_prev = 0xffffff26fafb04b8
    }
    sa_handle_t *z_sa_hdl = 0xffffff26faf5f6a0
    boolean_t z_is_sa = 0 (0)         
    uint32_t z_new_content = 0
}

I believe that dmu_tx_assign in zfs_rmdir failed because the dataset was out of quota. The system kept crashing in a loop with the same stack, until we slightly increased the affected dataset's quota.

Exactly the same happened to a different dataset on another system several days ago. Same panic stack, %rdi = 8, zc_zap == zc_leaf, zfs_readdir on a znode with z_links = 0, quota exhausted, same mitigation strategy used, only the stack of the second thread waiting to remove something has zfs_remove in it instead of zfs_rmdir.

History

#1

Updated by Arne Jansen almost 4 years ago

The root cause is found. Chronology of the events:

1. we have an empty directory
2. the filesystem is nearly out of quota
3. a snapshot prevents any space from being freed
4. a remote clients calls opendir via nfs4
5. the same client removes the directory
6. the client proceeds with readdir --> boom.

During 4, the client obtains a FH for the directory.
During 5, the directory entry gets removed, the inode is added to the unlinked set and z_unlinked is set.
Then the inode contents is deleted (dmu_free_long_range).
Deletion of the inode itself fails because the fs ran out of space exactly at this moment. The inode is left in the unlinked set, the znode is freed.
During 6, the readdir of the client instantiates a new znode for the (empty) inode, this time with z_unlinked unset.
During zap read, the contents of the empty file is interpreted as a fatzap, which leads to the crash.

I propose 2 changes as a remedy:
1. when instantiating a znode, set z_unlinked when z_links == 0
2. mark the deletion of the inode as netfree (in zfs_rmnode).

Also available in: Atom PDF