Project

General

Profile

Bug #2807

zfs: deadlock with ds_rwlock

Added by Vitaliy Gusev over 7 years ago. Updated about 7 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Start date:
2012-05-31
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

System can hang due to obtaining dsl_dataset_t->ds_rwlock.

History

#1

Updated by Vitaliy Gusev over 7 years ago

It can be:

1.
Because ds_rwlock is taken on READ in recursive (twice or more).
If another thread N2 tries to get this lock on WRITE while thread N1 holds
this lock on READ and, the thread N2 will wait this lock. If thread N1 tries to hold
lock on READ second time (recursive) - will hang forever.

This is because ds_rwlock is krwlock_t and it has barrier for WRITE attempts.
Stacks:
dsl_dataset_hold_ref:entry ds ffffff00c56885c0 rwlock 0
zfs`dsl_dataset_hold+0xa6
zfs`dmu_objset_hold+0x2f
zfs`zfs_ioc_objset_stats+0x30
zfs`zfsdev_ioctl+0x183
dsl_dataset_hold_ref:entry ds ffffff00c56885c0 rwlock 8
zfs`dsl_dataset_hold_obj+0x3d
zfs`dsl_dataset_space_written+0xce
zfs`dsl_dataset_stats+0x1a5
zfs`dmu_objset_stats+0x3f
zfs`zfs_ioc_objset_stats_impl+0x5a
zfs`zfs_ioc_objset_stats+0x42
zfs`zfsdev_ioctl+0x183
------
#2

Updated by Vitaliy Gusev over 7 years ago

Also another stack that shows real deadlock:


  > ::pgrep zfs|::walk thread|::findstack
  stack pointer for thread ffffff00e5125800: ffffff00030b9a30
  [ ffffff00030b9a30 _resume_from_idle+0xf1() ]
    ffffff00030b9a60 swtch+0x1e6()
    ffffff00030b9b00 turnstile_block+0x854()
    ffffff00030b9b70 rw_enter_sleep+0x1c2()
    ffffff00030b9be0 dsl_dataset_clone_swap+0x95()
    ffffff00030b9c30 zfs_ioc_rollback+0xec()
    ffffff00030b9cb0 zfsdev_ioctl+0x183()
    ffffff00030b9cf0 cdev_ioctl+0x45()
    ffffff00030b9d30 spec_ioctl+0x5a()
    ffffff00030b9db0 fop_ioctl+0x7b()
    ffffff00030b9eb0 ioctl+0x18e()
    ffffff00030b9f00 _sys_sysenter_post_swapgs+0x237()

  stack pointer for thread ffffff00e65b7480: ffffff000288f890
  [ ffffff000288f890 _resume_from_idle+0xf1() ]
    ffffff000288f8c0 swtch+0x1e6()
    ffffff000288f8f0 cv_wait+0x7f()
    ffffff000288f950 dsl_dataset_hold_ref+0xaf()
    ffffff000288f9c0 dsl_dataset_hold+0xa6()
    ffffff000288fa10 dmu_objset_hold+0x2f()
    ffffff000288fa50 getzfsvfs+0x2a()
    ffffff000288fc30 zfs_ioc_recv+0x49d()
    ffffff000288fcb0 zfsdev_ioctl+0x183()
    ffffff000288fcf0 cdev_ioctl+0x45()
    ffffff000288fd30 spec_ioctl+0x5a()
    ffffff000288fdb0 fop_ioctl+0x7b()
    ffffff000288feb0 ioctl+0x18e()
    ffffff000288ff00 _sys_sysenter_post_swapgs+0x237()
> 

#3

Updated by Vitaliy Gusev over 7 years ago

2. Second problem is:

    "zfs recv" can sleep (wait for data from fifo) with held rw_lock on READ.

   a. "zfs recv" 

   b.  "zfs rollback" tries to get this lock on WRITE and goes to uninterruptible sleep.

   c. "zfs list" hangs (and cannot be killable).

In this case system can be repared if kill "zfs recv" proccess, but really either "zfs rollback" should return EBUSY instead sleep or "zfs list" should be killable.

Stack:
ffffff04faeb4c60 ffffff04eac54e18 ffffff07427adc70   1  59 ffffff054cd55cac
  PC: _resume_from_idle+0xf1    CMD: /usr/sbin/rrdaemon
  stack pointer for thread ffffff04faeb4c60: ffffff00201633e0
  [ ffffff00201633e0 _resume_from_idle+0xf1() ]
    swtch+0x145()
    cv_wait_sig_swap_core+0x174()
    cv_wait_sig_swap+0x18()
    fifo_read+0xca()
    fop_read+0x6b()
    vn_rdwr+0x17f()
    restore_read+0x6c()
    restore_write+0x55()
    dmu_recv_stream+0x7ab()
    zfs_ioc_recv+0x42d()
    zfsdev_ioctl+0x15e()
    cdev_ioctl+0x45()
    spec_ioctl+0x5a()
    fop_ioctl+0x7b()
    ioctl+0x18e()
    dtrace_systrace_syscall32+0x11a()
    _sys_sysenter_post_swapgs+0x149()

ffffff04faedc180 ffffff053e849918 ffffff05511f8340   1  59 ffffff04fb0a0688
  PC: _resume_from_idle+0xf1    CMD: zfs rollback -Rrf backup/vhosting-d@AutoSync:94_2011-09-22-23:35:07
  stack pointer for thread ffffff04faedc180: ffffff001fd199a0
  [ ffffff001fd199a0 _resume_from_idle+0xf1() ]
    swtch+0x145()
    turnstile_block+0x760()
    rw_enter_sleep+0x1a3()
    dsl_dataset_clone_swap+0x79()
    zfs_ioc_rollback+0x167()
    zfsdev_ioctl+0x15e()
    cdev_ioctl+0x45()
    spec_ioctl+0x5a()
    fop_ioctl+0x7b()
    ioctl+0x18e()
    dtrace_systrace_syscall32+0x11a()
    _sys_sysenter_post_swapgs+0x149()
ffffff04faf061a0 ffffff04fdcc1708 ffffff0551208ca0   1  59 ffffff04fb0a0690
  PC: _resume_from_idle+0xf1    CMD: zfs list -t all -r -Ho name,mounted,mountpoint,origin,canmount,type backup
  stack pointer for thread ffffff04faf061a0: ffffff001fd49990
  [ ffffff001fd49990 _resume_from_idle+0xf1() ]
    swtch+0x145()
    cv_wait+0x61()
    dsl_dataset_hold_ref+0x8a()
    dsl_dataset_hold+0xa6()
    dmu_objset_hold+0x2f()
    zfs_ioc_objset_stats+0x30()
    zfs_ioc_dataset_list_next+0x155()
    zfsdev_ioctl+0x15e()
    cdev_ioctl+0x45()
    spec_ioctl+0x5a()
    fop_ioctl+0x7b()
    ioctl+0x18e()
    dtrace_systrace_syscall32+0x11a()

#4

Updated by Vitaliy Gusev over 7 years ago

So "1" case can occur if just do:

zfs list {pool}/{snapshot}
zfs destroy {pool}/{dataset}

stack of this bug:

::pgrep zfs|::walk thread|::findstack
stack pointer for thread ffffff00b1013780: ffffff000269b940
[ ffffff000269b940 _resume_from_idle+0xf1() ]
ffffff000269b970 swtch+0x1e6()
ffffff000269b9a0 cv_wait+0x7f()
ffffff000269ba00 dsl_dataset_hold_ref+0xaf()
ffffff000269ba40 dsl_dataset_hold_obj+0x3d()
ffffff000269bae0 dsl_dataset_space_written+0xce()
ffffff000269bb70 dsl_dataset_stats+0x1a5()
ffffff000269bba0 dmu_objset_stats+0x3f()
ffffff000269bbf0 zfs_ioc_objset_stats_impl+0x5a()
ffffff000269bc30 zfs_ioc_objset_stats+0x69()
ffffff000269bcb0 zfsdev_ioctl+0x183()
ffffff000269bcf0 cdev_ioctl+0x45()
ffffff000269bd30 spec_ioctl+0x5a()
ffffff000269bdb0 fop_ioctl+0x7b()
ffffff000269beb0 ioctl+0x18e()
ffffff000269bf00 _sys_sysenter_post_swapgs+0x237()
stack pointer for thread ffffff00b326a860: ffffff0001a5c740
[ ffffff0001a5c740 _resume_from_idle+0xf1() ]
ffffff0001a5c770 swtch+0x1e6()
ffffff0001a5c810 turnstile_block+0x854()
ffffff0001a5c880 rw_enter_sleep+0x1c2()
ffffff0001a5c8b0 dsl_dataset_make_exclusive+0x43()
ffffff0001a5cbc0 dsl_dataset_destroy+0x253()
ffffff0001a5cc00 dmu_objset_destroy+0x42()
ffffff0001a5cc30 zfs_ioc_destroy+0x51()
ffffff0001a5ccb0 zfsdev_ioctl+0x183()
ffffff0001a5ccf0 cdev_ioctl+0x45()
ffffff0001a5cd30 spec_ioctl+0x5a()
ffffff0001a5cdb0 fop_ioctl+0x7b()
ffffff0001a5ceb0 ioctl+0x18e()
ffffff0001a5cf00 _sys_sysenter_post_swapgs+0x237()

#5

Updated by Vitaliy Gusev over 7 years ago

For prev message (correction):

zfs list {pool}/{dataset}
zfs destroy {pool}/{dataset}
#6

Updated by Vitaliy Gusev over 7 years ago

Reproducer test for case 1:

mkfile 100m diskfile
zpool create testpool $PWD/diskfile

zfs create testpool/data

mkfile 1M /testpool/data/f1
zfs snapshot testpool/data@1

mkfile 20M /testpool/data/f1
zfs snapshot testpool/data@2

zfs send -i @1 testpool/data@2 > t2
zfs rollback -r testpool/data@1

cat t2 | zfs recv testpool/data & zfs rollback -r testpool/data@1

#7

Updated by Matthew Ahrens over 7 years ago

I believe that the deadlock mentioned in update #2 is caused by the zfs_ioc_recv() thread recursively holding the dataset that is being received into. Recursively holding a dataset (i.e. holding a dataset while it is already held by the current thread) is never allowed.

zfs_ioc_recv() first holds the dataset via dmu_recv_begin(), and then later via getzfsvfs(). One way to fix this would be to add a new routine, getzfsvfs_ds() which takes a dataset_t as the argument, rather than the name of the dataset.

#8

Updated by Vitaliy Gusev over 7 years ago

Here is another deadlock after introduced getzfsvfs_ds()

This deadlock is due to race between ds->ds_rwlock and vfs->z_teardown_lock

::pgrep zfs|::walk thread|::findstack

stack pointer for thread ffffff00c68cd000: ffffff0001f10a30
[ ffffff0001f10a30 _resume_from_idle+0xf1() ]
ffffff0001f10a60 swtch+0x1e6()
ffffff0001f10b00 turnstile_block+0x854()
ffffff0001f10b70 rw_enter_sleep+0x1c2()
ffffff0001f10be0 dsl_dataset_clone_swap+0x95()
ffffff0001f10c30 zfs_ioc_rollback+0x155()
ffffff0001f10cb0 zfsdev_ioctl+0x183()
ffffff0001f10cf0 cdev_ioctl+0x45()
ffffff0001f10d30 spec_ioctl+0x5a()
ffffff0001f10db0 fop_ioctl+0x7b()
ffffff0001f10eb0 ioctl+0x18e()
ffffff0001f10f00 _sys_sysenter_post_swapgs+0x237()
stack pointer for thread ffffff00da818b80: ffffff0001ee6900
[ ffffff0001ee6900 _resume_from_idle+0xf1() ]
ffffff0001ee6930 swtch+0x1e6()
ffffff0001ee6960 cv_wait+0x7f()
ffffff0001ee69a0 rrw_enter_write+0x76()
ffffff0001ee69d0 rrw_enter+0x22()
ffffff0001ee6a30 zfsvfs_teardown+0x3b()
ffffff0001ee6a50 zfs_suspend_fs+0x18()
ffffff0001ee6c30 zfs_ioc_recv+0x477()
ffffff0001ee6cb0 zfsdev_ioctl+0x183()
ffffff0001ee6cf0 cdev_ioctl+0x45()
ffffff0001ee6d30 spec_ioctl+0x5a()
ffffff0001ee6db0 fop_ioctl+0x7b()
ffffff0001ee6eb0 ioctl+0x18e()
ffffff0001ee6f00 _sys_sysenter_post_swapgs+0x237()

Also available in: Atom PDF