Bug #6779
closedkernel panic in zfs_readdir shortly after mount
100%
Description
I've got two similar crashes from two different systems. I looked at #5833 and #6510 but it appears to different from those two.
The panic messages:
Feb 6 19:25:56 RXDATA: panic[cpu1]/thread=ffffff237d4f93a0: BAD TRAP: type=e (#pf Page fault) rp=ffffff0116da3d60 addr=20 occurred in module "zfs" due to a NULL pointer dereference Feb 6 19:25:56 RXDATA: nfsd: #pf Page fault Feb 6 19:25:56 RXDATA: Bad kernel fault at addr=0x20 Feb 6 19:25:56 RXDATA: pid=5969, pc=0xfffffffff7a18575, sp=0xffffff0116da3e50, eflags=0x10206 Feb 6 19:25:56 RXDATA: cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 1406f8<smep,osxsav,xmme,fxsr,pge,mce,pae,pse,de> Feb 6 19:25:56 RXDATA: cr2: 20cr3: 20419cb000cr8: 0 Feb 6 19:25:56 RXDATA: rdi: ffffff26eea4baa0 rsi: f6fbb2f000000000 rdx: 8 Feb 6 19:25:56 RXDATA: rcx: fb484a93 r8: ffffff0116da3ec0 r9: 0 Feb 6 19:25:56 RXDATA: rax: fb484a98 rbx: ffffff0116da41b0 rbp: ffffff0116da3ea0 Feb 6 19:25:57 RXDATA: r10: fffffffffb854378 r11: 0 r12: 80000 Feb 6 19:25:57 RXDATA: r13: 0 r14: ffffff0116da3ec0 r15: ffffff0116da4090 Feb 6 19:25:57 RXDATA: fsb: 0 gsb: ffffff23720bf580 ds: 4b Feb 6 19:25:57 RXDATA: es: 4b fs: 0 gs: 1c3 Feb 6 19:25:57 RXDATA: trp: e err: 0 rip: fffffffff7a18575 Feb 6 19:25:57 RXDATA: cs: 30 rfl: 10206 rsp: ffffff0116da3e50 Feb 6 19:25:57 RXDATA: ss: 38 Feb 6 19:25:57 RXDATA: ffffff0116da3c40 unix:die+df () Feb 6 19:25:57 RXDATA: ffffff0116da3d50 unix:trap+dc0 () Feb 6 19:25:57 RXDATA: ffffff0116da3d60 unix:cmntrap+e6 () Feb 6 19:25:57 RXDATA: ffffff0116da3ea0 zfs:zap_leaf_lookup_closest+45 () Feb 6 19:25:57 RXDATA: ffffff0116da3f40 zfs:fzap_cursor_retrieve+bb () Feb 6 19:25:57 RXDATA: ffffff0116da3fe0 zfs:zap_cursor_retrieve+11e () Feb 6 19:25:57 RXDATA: ffffff0116da4260 zfs:zfs_readdir+33d () Feb 6 19:25:57 RXDATA: ffffff0116da42f0 genunix:fop_readdir+6b () Feb 6 19:25:57 RXDATA: ffffff0116da46c0 nfssrv:rfs4_op_readdir+68b () Feb 6 19:25:57 RXDATA: ffffff0116da4850 nfssrv:rfs4_compound+1e9 () Feb 6 19:25:58 RXDATA: ffffff0116da4900 nfssrv:rfs4_dispatch+145 () Feb 6 19:25:58 RXDATA: ffffff0116da4c20 nfssrv:common_dispatch+765 () Feb 6 19:25:58 RXDATA: ffffff0116da4c40 nfssrv:rfs_dispatch+2d () Feb 6 19:25:58 RXDATA: ffffff0116da4d20 rpcmod:svc_getreq+1c1 () Feb 6 19:25:58 RXDATA: ffffff0116da4d90 rpcmod:svc_run+e0 () Feb 6 19:25:58 RXDATA: ffffff0116da4dd0 rpcmod:svc_do_run+8e () Feb 6 19:25:58 RXDATA: ffffff0116da4ec0 nfs:nfssys+f1 () Feb 6 19:25:58 RXDATA: ffffff0116da4f10 unix:brand_sys_sysenter+1c9 ()
> zap_leaf_lookup_closest::dis zap_leaf_lookup_closest: pushq %rbp zap_leaf_lookup_closest+1: movq %rcx,%r8 zap_leaf_lookup_closest+4: movq %rsp,%rbp zap_leaf_lookup_closest+7: subq $0x50,%rsp zap_leaf_lookup_closest+0xb: movl 0x50(%rdi),%eax zap_leaf_lookup_closest+0xe: movq %rdx,-0x18(%rbp) zap_leaf_lookup_closest+0x12: movl %edx,-0x4c(%rbp) zap_leaf_lookup_closest+0x15: movq 0x58(%rdi),%rdx zap_leaf_lookup_closest+0x19: movq %rbx,-0x48(%rbp) zap_leaf_lookup_closest+0x1d: movq %r12,-0x40(%rbp) zap_leaf_lookup_closest+0x21: movq %rcx,-0x20(%rbp) zap_leaf_lookup_closest+0x25: leal -0x5(%rax),%ecx zap_leaf_lookup_closest+0x28: movq %r13,-0x38(%rbp) zap_leaf_lookup_closest+0x2c: movq %r14,-0x30(%rbp) zap_leaf_lookup_closest+0x30: movl $0x1,%r12d zap_leaf_lookup_closest+0x36: movq %r15,-0x28(%rbp) zap_leaf_lookup_closest+0x3a: movq %rsi,-0x10(%rbp) zap_leaf_lookup_closest+0x3e: shll %cl,%r12d zap_leaf_lookup_closest+0x41: movq %rdi,-0x8(%rbp) zap_leaf_lookup_closest+0x45: movq 0x18(%rdx),%r9
From the source code, this should be the call to zap_leaf_phys(l)
, which fits with %rdx
being 8 in the panic buffer.
> ffffff26eea4baa0::print -t zap_leaf_t l_dbuf dmu_buf_t *l_dbuf = 8
In fact, the whole thing that acts like a zap_leaf_t looks more like a zap_t (see dbu_evict_func
):
> ffffff26eea4baa0::print -t zap_leaf_t zap_leaf_t { dmu_buf_user_t l_dbu = { taskq_ent_t dbu_tqent = { struct taskq_ent *tqent_next = 0 struct taskq_ent *tqent_prev = 0 task_func_t *tqent_func = 0 void *tqent_arg = 0 union tqent_un = { taskq_bucket_t *tqent_bucket = 0 uintptr_t tqent_flags = 0 } kthread_t *tqent_thread = 0 kcondvar_t tqent_cv = { ushort_t _opaque = 0 } } dmu_buf_evict_func_t *dbu_evict_func = zap_evict } krwlock_t l_rwlock = { void *[1] _opaque = [ 0xffffff26ae98e708 ] } uint64_t l_blkid = 0x4e99f int l_bs = 0xfb484a98 dmu_buf_t *l_dbuf = 8 } > ffffff26eea4baa0::print -t zap_t zap_t { dmu_buf_user_t zap_dbu = { taskq_ent_t dbu_tqent = { struct taskq_ent *tqent_next = 0 struct taskq_ent *tqent_prev = 0 task_func_t *tqent_func = 0 void *tqent_arg = 0 union tqent_un = { taskq_bucket_t *tqent_bucket = 0 uintptr_t tqent_flags = 0 } kthread_t *tqent_thread = 0 kcondvar_t tqent_cv = { ushort_t _opaque = 0 } } dmu_buf_evict_func_t *dbu_evict_func = zap_evict } objset_t *zap_objset = 0xffffff26ae98e708 uint64_t zap_object = 0x4e99f struct dmu_buf *zap_dbuf = 0xffffff26fb484a98 krwlock_t zap_rwlock = { void *[1] _opaque = [ 8 ] } boolean_t zap_ismicro = 0 (0) int zap_normflags = 0 uint64_t zap_salt = 0 union zap_u = { struct zap_fat = { kmutex_t zap_num_entries_mtx = { void *[1] _opaque = [ 0 ] } int zap_block_shift = 0xe } struct zap_micro = { int16_t zap_num_entries = 0 int16_t zap_num_chunks = 0 int16_t zap_alloc_next = 0 avl_tree_t zap_avl = { struct avl_node *avl_root = 0xe int (*)() avl_compar = 0 size_t avl_offset = 0 ulong_t avl_numnodes = 0 size_t avl_size = 0 } } } }
The cursor has the same pointer set for zc_zap and zc_leaf, which I think should not happen:
> ffffff0116da41b0::print -t zap_cursor_t zap_cursor_t { objset_t *zc_objset = 0xffffff26ae98e700 struct zap *zc_zap = 0xffffff26eea4baa0 struct zap_leaf *zc_leaf = 0xffffff26eea4baa0 uint64_t zc_zapobj = 0x4e99f uint64_t zc_serialized = 0x1f6fbb2f uint64_t zc_hash = 0xf6fbb2f000000000 uint32_t zc_cd = 0x1 }
What also looks suspicious to me is the znode passed to zfs_readdir
has z_links = 0
:
> 0xffffff26fafb2a40::print -t vnode_t v_data | ::print -t znode_t znode_t { struct zfsvfs *z_zfsvfs = 0xffffff26aeb1c000 vnode_t *z_vnode = 0xffffff26fafb2a40 uint64_t z_id = 0x4e99f kmutex_t z_lock = { void *[1] _opaque = [ 0 ] } krwlock_t z_parent_lock = { void *[1] _opaque = [ 0 ] } krwlock_t z_name_lock = { void *[1] _opaque = [ 0 ] } zfs_dirlock_t *z_dirlocks = 0 kmutex_t z_range_lock = { void *[1] _opaque = [ 0 ] } avl_tree_t z_range_avl = { struct avl_node *avl_root = 0 int (*)() avl_compar = zfs_range_compare size_t avl_offset = 0x8 ulong_t avl_numnodes = 0 size_t avl_size = 0x40 } uint8_t z_unlinked = 0 uint8_t z_atime_dirty = 0 uint8_t z_zn_prefetch = 0x1 uint8_t z_moved = 0 uint_t z_blksz = 0x4000 uint_t z_seq = 0x7a4653 uint64_t z_mapcnt = 0 uint64_t z_gen = 0x1077b3f uint64_t z_size = 0x2 uint64_t [2] z_atime = [ 0x533f8166, 0x1a780340 ] uint64_t z_links = 0 uint64_t z_pflags = 0x40800000042 uint64_t z_uid = 0x3e8 uint64_t z_gid = 0x44c mode_t z_mode = 0x41ed uint32_t z_sync_cnt = 0 kmutex_t z_acl_lock = { void *[1] _opaque = [ 0 ] } zfs_acl_t *z_acl_cached = 0xffffff26fb422ec0 list_node_t z_link_node = { struct list_node *list_next = 0xffffff26aeb1c418 struct list_node *list_prev = 0xffffff26fafb02c8 } sa_handle_t *z_sa_hdl = 0xffffff26faf5f2e0 boolean_t z_is_sa = 0 (0) uint32_t z_new_content = 0 }
In another thread, there is an rmdir operation waiting for the same dataset (but for a different directory):
> ffffff237aa750c0::findstack -v stack pointer for thread ffffff237aa750c0: ffffff00f61bb170 [ ffffff00f61bb170 _resume_from_idle+0xf4() ] ffffff00f61bb1a0 swtch+0x141() ffffff00f61bb1e0 cv_wait+0x70(ffffff237cc4c8c6, ffffff237cc4c880) ffffff00f61bb230 txg_wait_open+0x83(ffffff237cc4c700, 59c32b) ffffff00f61bb280 dmu_tx_wait+0x16f(ffffff26faffc100) ffffff00f61bb360 zfs_rmdir+0x202(ffffff26fafb2c40, ffffff26dbfeb248, ffffff235363e340, ffffff264b5a9bc8, 0, 0) ffffff00f61bb3e0 fop_rmdir+0x63(ffffff26fafb2c40, ffffff26dbfeb248, ffffff235363e340, ffffff264b5a9bc8, 0, 0) ffffff00f61bb6c0 rfs4_op_remove+0x79d(ffffff266423fb38, ffffff26ebb914a8, ffffff00f61bbca0, ffffff00f61bb6e0) ffffff00f61bb850 rfs4_compound+0x1e9(ffffff00f61bbaa0, ffffff00f61bb870, 0, ffffff00f61bbca0, 0, ffffff00f61bb8a8) ffffff00f61bb900 rfs4_dispatch+0x21d(ffffffffc040f010, ffffff00f61bbca0, ffffff26ee71dc00, ffffff00f61bbaa0) ffffff00f61bbc20 common_dispatch+0x765(ffffff00f61bbca0, ffffff26ee71dc00, 2, 4, fffffffff8de0e66, ffffffffc040f060) ffffff00f61bbc40 rfs_dispatch+0x2d(ffffff00f61bbca0, ffffff26ee71dc00) ffffff00f61bbd20 svc_getreq+0x1c1(ffffff26ee71dc00, ffffff237971d960) ffffff00f61bbd90 svc_run+0xe0(ffffff26ee2ba6e8) ffffff00f61bbdd0 svc_do_run+0x8e(1) ffffff00f61bbec0 nfssys+0xf1(e, fec90fbc) ffffff00f61bbf10 _sys_sysenter_post_swapgs+0x149() > ffffff26fafb2c40::print -t vnode_t v_data | ::print -t znode_t znode_t { struct zfsvfs *z_zfsvfs = 0xffffff26aeb1c000 vnode_t *z_vnode = 0xffffff26fafb2c40 uint64_t z_id = 0x1f32c kmutex_t z_lock = { void *[1] _opaque = [ 0 ] } krwlock_t z_parent_lock = { void *[1] _opaque = [ 0 ] } krwlock_t z_name_lock = { void *[1] _opaque = [ 0 ] } zfs_dirlock_t *z_dirlocks = 0 kmutex_t z_range_lock = { void *[1] _opaque = [ 0 ] } avl_tree_t z_range_avl = { struct avl_node *avl_root = 0 int (*)() avl_compar = zfs_range_compare size_t avl_offset = 0x8 ulong_t avl_numnodes = 0 size_t avl_size = 0x40 } uint8_t z_unlinked = 0 uint8_t z_atime_dirty = 0 uint8_t z_zn_prefetch = 0x1 uint8_t z_moved = 0 uint_t z_blksz = 0x4000 uint_t z_seq = 0x7a4653 uint64_t z_mapcnt = 0 uint64_t z_gen = 0xb64842 uint64_t z_size = 0x20 uint64_t [2] z_atime = [ 0x53a87c51, 0x1da98c84 ] uint64_t z_links = 0x12 uint64_t z_pflags = 0x40800000042 uint64_t z_uid = 0x3e8 uint64_t z_gid = 0x44c mode_t z_mode = 0x41ff uint32_t z_sync_cnt = 0 kmutex_t z_acl_lock = { void *[1] _opaque = [ 0 ] } zfs_acl_t *z_acl_cached = 0xffffff26faffc340 list_node_t z_link_node = { struct list_node *list_next = 0xffffff26fafb02c8 struct list_node *list_prev = 0xffffff26fafb04b8 } sa_handle_t *z_sa_hdl = 0xffffff26faf5f6a0 boolean_t z_is_sa = 0 (0) uint32_t z_new_content = 0 }
I believe that dmu_tx_assign
in zfs_rmdir
failed because the dataset was out of quota. The system kept crashing in a loop with the same stack, until we slightly increased the affected dataset's quota.
Exactly the same happened to a different dataset on another system several days ago. Same panic stack, %rdi = 8
, zc_zap == zc_leaf
, zfs_readdir
on a znode with z_links = 0
, quota exhausted, same mitigation strategy used, only the stack of the second thread waiting to remove something has zfs_remove
in it instead of zfs_rmdir
.
Updated by Arne Jansen about 6 years ago
The root cause is found. Chronology of the events:
1. we have an empty directory
2. the filesystem is nearly out of quota
3. a snapshot prevents any space from being freed
4. a remote clients calls opendir via nfs4
5. the same client removes the directory
6. the client proceeds with readdir --> boom.
During 4, the client obtains a FH for the directory.
During 5, the directory entry gets removed, the inode is added to the unlinked set and z_unlinked is set.
Then the inode contents is deleted (dmu_free_long_range).
Deletion of the inode itself fails because the fs ran out of space exactly at this moment. The inode is left in the unlinked set, the znode is freed.
During 6, the readdir of the client instantiates a new znode for the (empty) inode, this time with z_unlinked unset.
During zap read, the contents of the empty file is interpreted as a fatzap, which leads to the crash.
I propose 2 changes as a remedy:
1. when instantiating a znode, set z_unlinked when z_links == 0
2. mark the deletion of the inode as netfree (in zfs_rmnode).
Updated by Marcel Telka 7 months ago
- Category set to zfs - Zettabyte File System
- Status changed from New to In Progress
- Assignee set to Marcel Telka
Updated by Marcel Telka 7 months ago
Updated by Marcel Telka 7 months ago
- Status changed from In Progress to Pending RTI
Updated by Electric Monk 6 months ago
- Status changed from Pending RTI to Closed
- % Done changed from 0 to 100
git commit c2270953142194d5f260d803d60104fad12ab076
commit c2270953142194d5f260d803d60104fad12ab076 Author: Arne Jansen <jansen@webgods.de> Date: 2022-02-04T15:10:10.000Z 6779 kernel panic in zfs_readdir shortly after mount Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Marcel Telka <marcel@telka.sk> Approved by: Dan McDonald <danmcd@joyent.com>