Bug #3875
panic in zfs_root() after failed rollback
100%
Description
A system panicked with the following stack:
ffffff01ecedf880 dnode_hold_impl+0x90(0, 4, 1, fffffffff7d633c1, ffffff01ecedf8c8) ffffff01ecedf8b0 dnode_hold+0x28(0, 4, fffffffff7d633c1, ffffff01ecedf8c8) ffffff01ecedf910 dmu_bonus_hold+0x37(0, 4, 0, ffffff01ecedf988) ffffff01ecedf940 sa_buf_hold+0x1d(0, 4, 0, ffffff01ecedf988) ffffff01ecedf9e0 zfs_zget+0x64(ffffff431a72a500, 4, ffffff01ecedfa08) ffffff01ecedfa40 zfs_root+0x78(ffffff431a87db78, ffffff01ecedfa98) ffffff01ecedfa80 fsop_root+0x2d(ffffff431a87db78, ffffff01ecedfa98) ffffff01ecedfad0 traverse+0x87(ffffff01ecedfcc8) ffffff01ecedfd30 lookuppnvp+0x229(ffffff01ecedfe50, ffffff01ecedfe30, 1, 0, 0, ffffff42e2b6d840, ffffff42e2b6d840, ffffff437f9083b0) ffffff01ecedfdd0 lookuppnatcred+0x15e(ffffff01ecedfe50, ffffff01ecedfe30, 1, 0, 0, 0, ffffff437f9083b0) ffffff01ecedfe20 lookuppn+0x38(ffffff01ecedfe50, ffffff01ecedfe30, 1, 0, 0) ffffff01ecedfec0 resolvepath+0x86(8045580, 8045980, 400) ffffff01ecedff10 _sys_sysenter_post_swapgs+0x149()
The proximate cause is that z_os is NULL. We are looking at:
> ffffff431a72a500::print zfsvfs_t z_vfs | ::print vfs_t vfs_mntpt | ::print ref str_t rs_string | /s 0xffffff4317f2f290: /zones/d4658531-a2ae-4dce-9382-54b9a0d5fa82 > ffffff01ecedfe50::print struct pathname { pn_buf = 0xffffff4373a91b80 "/zones/d4658531-a2ae-4dce-9382-54b9a0d5fa82" pn_path = 0xffffff4373a91bab "" pn_pathlen = 0 pn_bufsize = 0x400 }
The broader context here is that dozens of zones are being started up, for which the relevant ZFS activity consists of two general actions: rolling back filesystem datasets to a known snapshot, and checking the zone's root against all other zones by calling resolvepath() on them as seen in zoneadm.c's crosscheck_zonepaths(). These threads are looking at:
> ::stacks -a -c zfs_root | ::findstack -v ! grep zfs_root | awk '{ print $2 }' | sed -e 's/.*(//' -e 's/,.*//' | sort | uniq ffffff431a87d838 ffffff431a87db78 ffffff431ac435d0 > ffffff431a87d838::print zfsvfs_t z_vfs->vfs_vnodecovered | ::vnode2path /zones/d928d767-3b66-43a2-916d-62afe6a13f92/cores > ffffff431a87db78 /zones/d4658531-a2ae-4dce-9382-54b9a0d5fa82/cores > ffffff431ac435d0 /zones/ee4d605a-1de0-4ecb-bd26-4c76ca72772e/cores
So we have a number of threads trying to resolvepath() this same path that we panicked in, but blocked. The thread we actually raced with is:
> 0t10431::pid2proc | ::walk thread | ::findstack -v stack pointer for thread ffffff4360ce1b80: ffffff01ec6f1940 ffffff01ec6f1990 bitset_find+0x6a(fffffffffbc7e018) ffffff01ec6f19d0 cpu_wakeup_mwait+0x70(ffffff42ed090080, 0) ffffff01ec6f1a40 setbackdq+0x200(ffffff42ed09eb00) ffffff01ec6f1a80 rwst_enter_common+0x2c5(fffffffff78dd92e, ffffffff00000000, 1fbcf4d98) ffffff01ec6f1ac0 0xffffff01ec6f1b50() ffffff01ec6f1c90 socket_vop_getattr+0x56(ffffff4358c84a80, ffffff01ec6f1db0, 0, ffffff43128860e0, 0) ffffff01ec6f1d10 fop_getattr+0xa8(ffffff4358c84a80, ffffff01ec6f1db0, 0, ffffff43128860e0, 0) ffffff01ec6f1e80 cstat64_32+0x4a(ffffff4358c84a80, 8045a00, 0, ffffff43128860e0) ffffff01ec6f1ec0 fstat64_32+0x43(2, 8045a00) ffffff01ec6f1f10 _sys_sysenter_post_swapgs+0x149()
This thread is part of the process doing a rollback of our toxic dataset, and in userland is at:
> 0t10431::pid2proc | ::walk thread | ::print kthread_t t_lwp->lwp_regs | ::print struct regs r_rip r_rip = 0xfeef7ae3
From looking at a similar process (since our dump does not have this process's pages in it), we are at:
> $C 08046b98 libc_hwcap1.so.1`fstat64+0xc(2, 8046bb0, fefc9d86, fef80183, fefe6e90, 4) 08046c58 libc_hwcap1.so.1`isseekable+0x4d(80770a0, fefd2a7a, fef90300, 11, 80470c0, fef90348) 08046c88 libc_hwcap1.so.1`_setbufend+0x37(80770a0, fef6db58, feffb0a4, fef80086, 8047214, fef901b5) 08046d58 libc_hwcap1.so.1`_findbuf+0x138(80770a0, fefd04f4, feffc480, fef80086, feffb0a4, fefd04cf) 08047ad8 libc_hwcap1.so.1`_ndoprnt+0x55(8064d57, 8047b28, 80770a0, 0, 8047b10, fefc37a0) 08047b18 libc_hwcap1.so.1`fprintf+0xb0(80770a0, 8064d57, 8064e46, 8047dd4, 8047b50, feffb0a4) 08047c78 zfs_do_rollback+0xe2(1, 8047d04, 8076860, 801, 0, 3) 08047cc8 main+0x212(8060dca, feffb0a4, 8047cf4, 805558b, 2, 8047d00) 08047cf4 _start+0x83(2, 8047dd0, 8047dd4, 0, 8047ddd, 8047df1) > 0xfeef7ae3::dis libc_hwcap1.so.1`syscall: popl %edx libc_hwcap1.so.1`syscall+1: popl %eax libc_hwcap1.so.1`syscall+2: pushl %edx libc_hwcap1.so.1`syscall+3: call +0x0 <libc_hwcap1.so.1`syscall+8> libc_hwcap1.so.1`syscall+8: popl %edx libc_hwcap1.so.1`syscall+9: movl %esp,%ecx libc_hwcap1.so.1`syscall+0xb: addl $0xb,%edx libc_hwcap1.so.1`syscall+0x11: sysenter libc_hwcap1.so.1`syscall+0x13: movl (%esp),%edx libc_hwcap1.so.1`syscall+0x16: pushl %edx libc_hwcap1.so.1`syscall+0x17: jb -0x9bb0d <libc_hwcap1.so.1`__cerror> libc_hwcap1.so.1`syscall+0x1d: ret 0xfeef7aee: nop libc_hwcap1.so.1`__systemcall: popl %edx libc_hwcap1.so.1`__systemcall+1:popl %ecx libc_hwcap1.so.1`__systemcall+2:popl %eax libc_hwcap1.so.1`__systemcall+3:pushl %edx libc_hwcap1.so.1`__systemcall+4:int $0x91 libc_hwcap1.so.1`__systemcall+6:jae +0xf <libc_hwcap1.so.1`__systemcall+0x17> >
So we are in the midst of trying to fprintf() to stderr after returning an error from the ZFS rollback ioctl. We therefore know that the ioctl failed. Looking at zfs_resume_fs(), the only way for z_os to be NULL is for us to fail at the top of this function. If we do so, we will:
- clear z_os
- drop our locks to allow others in ZFS_ENTER() to proceed
- return an error that is propagated to userland
This is entirely consistent with everything that we see, so we presume we ended up in this error path. The exact means by which we came to be in it is not entirely clear but we presumably ended up getting EBUSY somehow.
The problem here is that if we ever end up in this error path, we drop the locks protecting access to the zfsvfs_t prior to forcibly unmounting the filesystem. Because z_os is NULL, any thread that had already picked up the zfsvfs_t and was sitting in ZFS_ENTER() when we dropped our locks in zfs_resume_fs() will now acquire the lock, attempt to use z_os, and panic.
There are two plausible solutions here: one is to forcibly unmount the filesystem while still holding the locks. It is difficult to know whether this is safe to do. The other is to simply set z_umounted prior to dropping the locks. The zfs_umount() path does not examine this flag and will happily allow unmounting to proceed asynchronously, while everyone in ZFS_ENTER() will see z_umounted and fail. It may be that there's an even better solution that would prevent the seemingly spurious vfsop failures that this would entail, but it's not clear what that is. In any case, the rollback itself is going to fail.
Related issues
Updated by Keith Wesolowski over 7 years ago
One possible way to reproduce this:
Part 1 (race exploder):
dtrace -wn 'fbt::zfs_resume_fs:entry { self->interest = 1; }' -n 'fbt::zfs_suspend_fs:entry { self->interest = 1; }' -n 'fbt::dsl_dataset_disown:entry/self->interest/ { print(args[0]); print(args[0]->ds_owner); self->ds = args[0]; }' -n 'fbt::dsl_dataset_disown:return/self->interest/ { print(self->ds); print(self->ds->ds_owner); }' -n 'fbt::zfs_suspend_fs:return { self->interest = 0; }' -n 'fbt::zfs_domount:entry { self->interest = 1; }' -n 'fbt::dsl_dataset_tryown:entry/self->interest/ { stack(); print(args[0]); print(args[0]->ds_owner); chill(200000000); print(args[0]->ds_owner); }' -n 'fbt::zfs_domount:return { self->interest = 0; }'
Part 2 (rollbacks):
while :; do zfs rollback zones/foo@init; done &
Part 3 (race partner):
while :; do if mount -F zfs zones/foo /mnt 2>/dev/null; then echo got it; break; fi; done &
Part 4 (victim):
while :; do /var/tmp/resolvepath /zones/foo & /var/tmp/resolvepath /zones/foo & /var/tmp/resolvepath /zones/foo done
Where the victim is the trivial program:
#include <unistd.h> #include <limits.h> int main(int argc, char *argv[]) { char buf[PATH_MAX]; return (resolvepath(argv[1], buf, PATH_MAX)); }
Updated by Keith Wesolowski over 7 years ago
Unfortunately, the unmount path does in fact use ZFS_ENTER(), specifically in:
8 63579 zfsctl_root_lookup:return zfs`zfsctl_umount_snapshots+0x73 zfs`zfs_umount+0x8f genunix`fsop_unmount+0x1b genunix`dounmount+0x57 genunix`umount2_engine+0x96 genunix`umount2+0x163 unix`_sys_sysenter_post_swapgs+0x149
So the approach of simply setting z_unmounted and then dropping the locks in zfs_resume_fs()'s error path is unworkable.
Updated by Keith Wesolowski over 7 years ago
Attempting to force-unmount the stale filesystem doesn't work, so avoiding the dmu_objset_own() EBUSY is of little value:
18 35523 vn_vfswlock:return zfs`zfs_resume_fs+0x91 zfs`zfs_ioc_rollback+0x76 zfs`zfsdev_ioctl+0x4a7 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`_sys_sysenter_post_swapgs+0x149 16
We're getting EBUSY from vn_vfswlock() and never even calling dounmount().
Updated by Robert Mustacchi over 7 years ago
- Status changed from New to Resolved
- Assignee set to Keith Wesolowski
- % Done changed from 0 to 100
Resolved in 91948b51b8e978ddc88a36b2bc3ae83c20cdc9aa.
Updated by Roman Strashkin almost 3 years ago
- Related to Bug #9324: zfs-recv can be broken by some ZFS-ioctls added