Project

General

Profile

Bug #9340

deadlock between db_rwlock and db_mtx via dnode_increase_indirection()

Added by Brad Lewis over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2018-03-23
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

spa_sync() is waiting on:

> 131::findstack -v
stack pointer for thread 131: fffffd7fd9299c70
[ fffffd7fd9299c70 libc.so.1`__lwp_park+0x17() ]
  fffffd7fd9299cb0 libc.so.1`mutex_lock_impl+0x189(87146b0, 0)
  fffffd7fd9299ce0 libc.so.1`mutex_lock+0x45(87146b0)
  fffffd7fd9299d10 libzpool.so.1`zmutex_enter+0x57(87146a0)
  fffffd7fd9299d80 libzpool.so.1`dbuf_find+0xbb(4895140, 2, 0, 0)
  fffffd7fd9299e40 libzpool.so.1`dnode_increase_indirection+0x205(4785da0, 6f1f400)
  fffffd7fd9299ed0 libzpool.so.1`dnode_sync+0x6f3(4785da0, 6f1f400)
  fffffd7fd9299f10 libzpool.so.1`dmu_objset_sync_dnodes+0x99(4173980, 6f1f400)
  fffffd7fd9299f40 libzpool.so.1`sync_dnodes_task+0x31(2149380)
  fffffd7fd9299fb0 libcmdutils.so.1`utaskq_thread+0xe0(1d42340)
  fffffd7fd9299fe0 libc.so.1`_thrp_setup+0x8a(fffffd7ffead5a40)
  fffffd7fd9299ff0 libc.so.1`_lwp_start()

> 87146a0::print kmutex_t
{
    m_owner = 0x199

> 199::findstack -v
stack pointer for thread 199: fffffd7fd46bd4f0
[ fffffd7fd46bd4f0 libc.so.1`__lwp_park+0x17() ]
  fffffd7fd46bd530 libc.so.1`rw_rdlock_impl+0x204(3115328, 0)
  fffffd7fd46bd560 libc.so.1`pthread_rwlock_rdlock+0x45(3115328)
  fffffd7fd46bd5a0 libzpool.so.1`rw_enter+0x8a(3115318, 0)
  fffffd7fd46bd5f0 libzpool.so.1`dmu_buf_lock_parent+0x2d(87145f0, 0, fffffd7ffa90f4ed)
  fffffd7fd46bd660 libzpool.so.1`dbuf_read+0xf2(87145f0, 0, 9)
  fffffd7fd46bd6a0 libzpool.so.1`dmu_buf_will_dirty+0xf5(87145f0, 2021900)
  fffffd7fd46bd740 libzpool.so.1`dmu_write_impl+0xdf(83b4800, 1, 0, 38, 649e540, 2021900)
  fffffd7fd46bd7c0 libzpool.so.1`dmu_write+0x98(4895140, 2, 0, 38, 649e540, 2021900)
  fffffd7fd46bd900 ztest_replay_write+0x568(fffffd7fd46bdff0, 649e480, 0)
  fffffd7fd46bd980 ztest_write+0x125(fffffd7fd46bdff0, 2, 0, 38, fffffd7fd46bda00)
  fffffd7fd46bda90 ztest_io+0x12f(fffffd7fd46bdff0, 2, 0)
  fffffd7fd46bdfd0 ztest_dmu_object_alloc_free+0xc8(fffffd7fd46bdff0, 12)
  fffffd7fd46bfec0 ztest_dmu_objset_create_destroy+0x16f(4915e0, 12)
  fffffd7fd46bff50 ztest_execute+0x83(a, 420d70, 12)
  fffffd7fd46bffb0 ztest_thread+0xf4(12)
  fffffd7fd46bffe0 libc.so.1`_thrp_setup+0x8a(fffffd7ffeae5240)
  fffffd7fd46bfff0 libc.so.1`_lwp_start()

> 3115318::print krwlock_t
{
    rw_owner = 0x131

thread 0x131 calls dnode_increase_indirection() has the db_rwlock for writer, calls dbuf_find() which blocks waiting for db_mtx, which is held by thread 0x199, which has called dbuf_read(), which gets the db_mtx and then tries to get the parent's db_rwlock.

The problem is that the lock ordering used in dbuf_read and the one used in dnode_increase_indirection are incompatible. We can resolve the issue by declaring the one in dbuf_read to be correct, and fixing the one in dnode_increase_indirection. Specifically, the fix is to acquire the child's mutex before acquiring the parent's rwlock in all cases.

Also available in: Atom PDF