Bug #13937
open13317 mismerged; leaked rwlock counts on bonus' parents ensue
0%
Description
We've done a ZoL mis-merge yet again:
https://github.com/illumos/illumos-gate/commit/9704bf7fb82e71b685e194a967937ff03843e73a
see:
https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/zfs/dbuf.c#L1119
There should be a call to dmu_buf_unlock_parent
here. We've clobbered it by attempting to once again merge commits from ZoL out of order -- as usual, resulting in ZFS encryption and another feature breaking each other.
In this case, ZoL has changed this function considerably:
https://github.com/openzfs/zfs/blob/master/module/zfs/dbuf.c#L1465
But in all their versions of it, the locking was correct. Ours is not.
Whenever we hit a decryption issue which causes dbuf_read_verify_dnode_crypt
to return non-zero (e.g. by hitting #13697 or #13795, both of which are also caused by bad merges from ZoL!), this will then lead to us leaking rwlock counts here, eventually leading to whole-system deadlock which the zfs deadman doesn't detect.
Related issues
Updated by Alex Wilson over 1 year ago
My notes from debugging this in a dump:
> ::status debugging crash dump vmcore.0 (64-bit) from sdc-cb8031 operating system: 5.11 joyent_20210623T002805Z (i86pc) git branch: eait-20210623 git rev: f12a9dd262eb31f6cd4765ca82ad60d64defee9c image uuid: (not set) panic message: NMI received dump content: kernel pages only >> ::stacks -m zfs THREAD STATE SOBJ COUNT fffffe473261f880 SLEEP CV 8152 swtch+0x133 cv_wait+0x68 dmu_tx_wait+0x6b zfs_create+0x283 vnext_create+0xdb inotify_fop_create+0x5f vnext_create+0xdb inotify_fop_create+0x5f vhead_create+0x139 fop_create+0xcf vn_createat+0x67a vn_openat+0x192 copen+0x441 openat+0x29 lx_openat+0x96 lx_syscall_enter+0x1aa sys_syscall+0x145 fffffdca12def780 SLEEP CV 601 swtch+0x133 cv_wait+0x68 dmu_tx_wait+0x6b dmu_tx_assign+0x4b zil_commit_itx_assign+0x3b zil_commit_impl+0x26 zil_commit+0x3e zfs_fsync+0xb7 fop_fsync+0x4e fdsync+0x2c fffffdce381890a0 SLEEP CV 136 swtch+0x133 cv_wait_sig+0x171 txg_wait_synced_impl+0x7c txg_wait_synced_sig+0x10 dsl_sync_task_common+0x1c0 dsl_sync_task_sig+0x23 zcp_eval+0x655 dsl_destroy_snapshots_nvl+0x15a zfs_ioc_destroy_snaps+0x8b zfsdev_ioctl+0x505 cdev_ioctl+0x2b spec_ioctl+0x45 fop_ioctl+0x5b ioctl+0x153
Oh we are waiting alright. We are waiting for a txg sync!
> ::zfs_dbgmsg ... slow spa_sync: started 34815 seconds ago, calls 6764 slow spa_sync: started 34820 seconds ago, calls 6765 slow spa_sync: started 34825 seconds ago, calls 6766 slow spa_sync: started 34830 seconds ago, calls 6767 slow spa_sync: started 34835 seconds ago, calls 6768 slow spa_sync: started 34840 seconds ago, calls 6769
Mmhmm. Waiting.
> ::stacks -c txg_sync_thread THREAD STATE SOBJ COUNT fffffb04536dbc20 SLEEP CV 1 swtch+0x133 cv_wait+0x68 zio_wait+0x6b dsl_pool_sync+0xc6 spa_sync_iterate_to_convergence+0xd0 spa_sync+0x2f6 txg_sync_thread+0x1fd thread_start+0xb > fffffb04536dbc20::findstack -v stack pointer for thread fffffb04536dbc20 (txg_sync_thread()): fffffb04536db920 [ fffffb04536db920 _resume_from_idle+0x12b() ] fffffb04536db950 swtch+0x133() fffffb04536db990 cv_wait+0x68(fffffedf89031950, fffffedf89031948) fffffb04536db9d0 zio_wait+0x6b(fffffedf89031580) fffffb04536dba80 dsl_pool_sync+0xc6(fffffdc570940940, 3928b7) fffffb04536dbb00 spa_sync_iterate_to_convergence+0xd0(fffffdc61db70000, fffffdef2e609800) fffffb04536dbb60 spa_sync+0x2f6(fffffdc61db70000, 3928b7) fffffb04536dbc00 txg_sync_thread+0x1fd(fffffdc570940940) fffffb04536dbc10 thread_start+0xb() > fffffedf89031580::zio ADDRESS TYPE STAGE WAITER TIME_ELAPSED fffffedf89031580 NULL CHECKSUM_VERIFY fffffb04536dbc20 -
Alright, we're waiting on a zio. Good to know. Where are all those zios at
anyway?
> ::zio_state -r ADDRESS TYPE STAGE WAITER TIME_ELAPSED fffffedf89031580 NULL CHECKSUM_VERIFY fffffb04536dbc20 - fffffdd3791a29e0 WRITE ISSUE_ASYNC - - fffffdc81f312640 WRITE ISSUE_ASYNC - - fffffedf74aeb948 WRITE ISSUE_ASYNC - - fffffedf74e540d0 WRITE ISSUE_ASYNC - - fffffedf95f6c240 WRITE ISSUE_ASYNC - - fffffedf9571aef0 WRITE ISSUE_ASYNC - - fffffedf87a47128 WRITE ISSUE_ASYNC - - fffffedf887bf1e8 WRITE READY - - fffffdc61db90ca8 NULL OPEN - - fffffdc61db910d0 NULL OPEN - - fffffdc61db914f8 NULL OPEN - - fffffdc61db91920 NULL OPEN - - ... fffffdc61e2750e0 NULL OPEN - - fffffedf9666b188 NULL OPEN - - fffffedf9f3ae8c8 WRITE OPEN - - fffffdc61e67a640 NULL OPEN - - fffffdc61e68ca38 NULL OPEN - - fffffdc61e737698 NULL OPEN - - fffffdc61e740258 NULL OPEN - - fffffdc61e743250 NULL OPEN - - fffffdc61e743678 NULL OPEN - - fffffdc61e74be10 NULL OPEN - - fffffddb26a5a188 NULL OPEN - -
Alright. There's some zio there.
> ::taskq -n zio ADDR NAME ACT/THDS Q'ED MAXQ INST fffffdc61e764640 zio_null_issue 0/ 1 0 5950 - fffffdc61e764ac0 zio_null_intr 0/ 1 0 424 - fffffdc61e26f1b8 zio_read_issue 0/ 8 0 43 - fffffdc61e764760 zio_read_intr_0 0/ 12 0 46 - fffffdc61e7649a0 zio_read_intr_1 0/ 12 0 42 - fffffdc61e764be0 zio_read_intr_2 0/ 12 0 22 - fffffdc61e764d00 zio_read_intr_3 0/ 12 0 33 - fffffdc5e61fdaa8 zio_read_intr_4 0/ 12 0 38 - fffffdc5e61fd2c8 zio_read_intr_5 0/ 12 0 22 - fffffdc5e61fd1a8 zio_read_intr_6 0/ 12 0 34 - fffffdc5e61fd088 zio_read_intr_7 0/ 12 0 29 - fffffdc61e26f638 zio_write_issue 1/ 48 1 49025 - fffffdc61e215bd0 zio_write_issue_high 0/ 5 0 22 - fffffdc61e215ab0 zio_write_intr 0/ 8 0 853 - fffffdc61e215870 zio_write_intr_high 0/ 5 0 8 - fffffdc61e7640a0 zio_free_issue_0 0/ 12 0 0 - fffffdc61e215990 zio_free_issue_1 0/ 12 0 0 - fffffdc61e215510 zio_free_issue_2 0/ 12 0 0 - fffffdc61e2153f0 zio_free_issue_3 0/ 12 0 0 - fffffdc61e2152d0 zio_free_issue_4 0/ 12 0 0 - fffffdc61e26fe18 zio_free_issue_5 0/ 12 0 0 - fffffdc620906768 zio_free_issue_6 0/ 12 0 0 - fffffdc61e7641c0 zio_free_issue_7 0/ 12 0 0 - fffffdc61e7642e0 zio_free_intr 0/ 1 0 0 - fffffdc61e764400 zio_claim_issue 0/ 1 0 0 - fffffdc61e764520 zio_claim_intr 0/ 1 0 0 - fffffdc6209061c8 zio_ioctl_issue 0/ 1 0 0 - fffffdc6209060a8 zio_ioctl_intr 0/ 1 0 4 -
One zio is running, it's a write. Where's it at?
> fffffdc61e26f638::taskq -t ADDR NAME ACT/THDS Q'ED MAXQ INST fffffdc61e26f638 zio_write_issue 1/ 48 1 49025 - THREAD STATE SOBJ COUNT fffffb0452669c20 SLEEP RWLOCK 1 swtch+0x133 turnstile_block+0x25b rw_enter_sleep+0x1a5 dmu_buf_lock_parent+0x25 dbuf_write_ready+0x1fd arc_write_ready+0x157 zio_ready+0x61 zio_execute+0xa7 taskq_thread+0x2cd thread_start+0xb > fffffb0452669c20::findstack -v stack pointer for thread fffffb0452669c20 (zpool-zones/124): fffffb0452669770 [ fffffb0452669770 _resume_from_idle+0x12b() ] fffffb04526697a0 swtch+0x133() fffffb0452669840 turnstile_block+0x25b(0, 0, fffffe155b0f3148, fffffffffbc1c5c0, 0, 0) fffffb04526698b0 rw_enter_sleep+0x1a5(fffffe155b0f3148, 0) fffffb0452669900 dmu_buf_lock_parent+0x25(fffffdfe948bf9b0, 0, fffffffff7dc0750) fffffb0452669970 dbuf_write_ready+0x1fd(fffffedf887bf1e8, fffffe1579356008, fffffdfe948bf9b0) fffffb04526699e0 arc_write_ready+0x157(fffffedf887bf1e8) fffffb0452669a30 zio_ready+0x61(fffffedf887bf1e8) fffffb0452669a60 zio_execute+0xa7(fffffedf887bf1e8) fffffb0452669b10 taskq_thread+0x2cd(fffffdc61e26f638) fffffb0452669b20 thread_start+0xb()
So we're not blocked on I/O, then. The TXG sync is blocked up behind a zio
which is waiting for one of its kids. That kid is being executed right now,
but it's waiting for an rwlock on a dbuf's parent.
Who's in that lock anyway and what's that dbuf?
> fffffe155b0f3148::rwlock ADDR OWNER/COUNT FLAGS WAITERS fffffe155b0f3148 READERS=7 B011 fffffdcc488b1460 (R) || fffffb0452669c20 (W) WRITE_WANTED -------+| HAS_WAITERS --------+ > fffffdcc488b1460::findstack -v stack pointer for thread fffffdcc488b1460 (mongod/20 [ftdc]): fffffb04a659f290 [ fffffb04a659f290 _resume_from_idle+0x12b() ] fffffb04a659f2c0 swtch+0x133() fffffb04a659f360 turnstile_block+0x25b(fffffdc56d63d5f0, 1, fffffe155b0f3148, fffffffffbc1c5c0, 0, 0) fffffb04a659f3d0 rw_enter_sleep+0x2f8(fffffe155b0f3148, 1) fffffb04a659f420 dmu_buf_lock_parent+0x25(fffffe15b2254558, 1, fffffffff7dc0598) fffffb04a659f4c0 dbuf_read+0xb8(fffffe15b2254558, 0, 9) fffffb04a659f540 dmu_bonus_hold_impl+0xed(fffffdc62a46a600, 4404d, 0, 1, fffffb04a659f640) fffffb04a659f570 dmu_bonus_hold+0x15(fffffdc62a46a600, 4404d, 0, fffffb04a659f640) fffffb04a659f5a0 sa_buf_hold+0xd(fffffdc62a46a600, 4404d, 0, fffffb04a659f640) fffffb04a659f720 zfs_mknode+0x151(fffffdd49625a278, fffffb04a659fb70, fffffdd0f315edc0, fffffdcb3d320d50, 0, fffffb04a659f7c0, fffffb04a659f7d0) fffffb04a659f860 zfs_create+0x6fa(fffffdcddb915b00, fffffdc8e435cde1, fffffb04a659fb70, 0, 80, fffffb04a659fb40, fffffdcb3d320d50, fffffdcd00002102, 0, 0) fffffb04a659f900 fop_create+0xcf(fffffdcddb915b00, fffffdc8e435cde1, fffffb04a659fb70, 0, 80, fffffb04a659fb40, fffffdcb3d320d50, 300002102, 0, 0) fffffb04a659fac0 vn_createat+0x67a(c380d0, 0, fffffb04a659fb70, 0, 80, fffffb04a659fb40, fffffdc700000000, ffffffff00002102, 3f, 0) fffffb04a659fc60 vn_openat+0x192(c380d0, 0, 2302, 1b6, fffffb04a659fcc0, 0, 3f, 0, fffffb0400000023) fffffb04a659fdd0 copen+0x441(ffd19553, c380d0, 2302, 1b6) fffffb04a659fe00 openat+0x29(ffd19553, c380d0, 301, 1b6) fffffb04a659fe70 lx_openat+0x96(ffffff9c, c380d0, 241, 1b6) fffffb04a659fef0 lx_syscall_enter+0x1aa() fffffb04a659ff10 sys_syscall+0x145() > fffffb0452669c20::findstack -v stack pointer for thread fffffb0452669c20 (zpool-zones/124): fffffb0452669770 [ fffffb0452669770 _resume_from_idle+0x12b() ] fffffb04526697a0 swtch+0x133() fffffb0452669840 turnstile_block+0x25b(0, 0, fffffe155b0f3148, fffffffffbc1c5c0, 0, 0) fffffb04526698b0 rw_enter_sleep+0x1a5(fffffe155b0f3148, 0) fffffb0452669900 dmu_buf_lock_parent+0x25(fffffdfe948bf9b0, 0, fffffffff7dc0750) fffffb0452669970 dbuf_write_ready+0x1fd(fffffedf887bf1e8, fffffe1579356008, fffffdfe948bf9b0) fffffb04526699e0 arc_write_ready+0x157(fffffedf887bf1e8) fffffb0452669a30 zio_ready+0x61(fffffedf887bf1e8) fffffb0452669a60 zio_execute+0xa7(fffffedf887bf1e8) fffffb0452669b10 taskq_thread+0x2cd(fffffdc61e26f638) fffffb0452669b20 thread_start+0xb() > fffffdfe948bf9b0::dbuf addr object lvl blkid holds os fffffdfe948bf9b0 4404c 0 0 1 zones/466c270c-7be0-6eb9-99f6-d658c761c475 > fffffdfe948bf9b0::print dmu_buf_impl_t db_parent | ::dbuf addr object lvl blkid holds os fffffe155b0f30f0 mdn 0 2202 4 zones/466c270c-7be0-6eb9-99f6-d658c761c475
OK, so this rwlock the zio executor is waiting for is the db_rwlock of one of
the dbufs in the meta-dnode for zones/466c270c-7be0-6eb9-99f6-d658c761c475
.
There are allegedly 7 readers currently in the rwlock, and two threads waiting
(one is our zio executor, the other is a reader from zfs_mknode
).
Unfortunately, we're not on DEBUG bits, so we can't easily get a list of all the
readers. :(
Maybe we can find another way.
How many child dbufs of that parent do we currently have around?
> ::dbufs -n zones/466c270c-7be0-6eb9-99f6-d658c761c475 | ::print -a dmu_buf_impl_t db_parent ! grep fffffe155b0f30f0 | ... > kids fffffe15b2254588 db_parent = 0xfffffe155b0f30f0 fffffdfe948bf9e0 db_parent = 0xfffffe155b0f30f0 fffffe0f032ca060 db_parent = 0xfffffe155b0f30f0 > ::cat kids | ::dbuf addr object lvl blkid holds os fffffe15b2254558 4404d 0 bonus 1 zones/466c270c-7be0-6eb9-99f6-d658c761c475 fffffdfe948bf9b0 4404c 0 0 1 zones/466c270c-7be0-6eb9-99f6-d658c761c475 fffffe0f032ca030 4404c 0 bonus 1 zones/466c270c-7be0-6eb9-99f6-d658c761c475
So it's got two bonus and one regular child.
Let's see if we can easily eliminate any of those readers. Most things that
are going to deliberately hold a dbuf or its parent will keep a pointer to one
of them on the stack, right?
> ::cat kids | ::kgrep -v -a 0xfffffb0440000000 -A 0xfffffb04d0000000 ! tee stackpos ... | ::whatis fffffb0452669878: fffffdfe948bf9b0 is in thread fffffb0452669c20's stack fffffb0452669888: fffffdfe948bf9b0 is in thread fffffb0452669c20's stack fffffb04526698e0: fffffdfe948bf9b0 is in thread fffffb0452669c20's stack fffffb04526698f8: fffffdfe948bf9b0 is in thread fffffb0452669c20's stack fffffb0452669958: fffffdfe948bf9b0 is in thread fffffb0452669c20's stack fffffb045419a818: fffffdfe948bf9b0 is in thread fffffb045419ac20's stack (below sp) fffffb045419a8e8: fffffdfe948bf9b0 is in thread fffffb045419ac20's stack (below sp) fffffb04a659f398: fffffe15b2254558 is in thread fffffdcc488b1460's stack fffffb04a659f3b0: fffffe15b2254558 is in thread fffffdcc488b1460's stack fffffb04a659f418: fffffe15b2254558 is in thread fffffdcc488b1460's stack fffffb04a659f450: fffffe15b2254558 is in thread fffffdcc488b1460's stack fffffb04a659f490: fffffe15b2254558 is in thread fffffdcc488b1460's stack fffffb04a659f4b8: fffffe15b2254558 is in thread fffffdcc488b1460's stack fffffb04a659f3f0: fffffe155b0f30f0 is in thread fffffdcc488b1460's stack
This nets us the two threads we already knew about from the waiters list,
plus a below-SP match in fffffb045419ac20
:
stack pointer for thread fffffb045419ac20 (tq:dp_sync_taskq): fffffb045419aa70 [ fffffb045419aa70 _resume_from_idle+0x12b() ] fffffb045419aaa0 swtch+0x133() fffffb045419aae0 cv_wait+0x68(fffffdc6210d68c0, fffffdc6210d68b0) fffffb045419ab50 taskq_thread_wait+0xbe(fffffdc6210d6890, fffffdc6210d68b0, fffffdc6210d68c0, fffffb045419ab98, ffffffffffffffff) fffffb045419ac00 taskq_thread+0x34f(fffffdc6210d6890) fffffb045419ac10 thread_start+0xb()
Weird, but this particular thread is probably innocuous? It's thedp_sync_taskq
, so it has plenty of legit reasons to have touched
that dbuf at some point.
The lack of any other threads is super suspicious: did we leak rwlock counts
on this dbuf?
The most recent change in this area was13317 Decrease contention on dn_struct_rwlock
. Did it mess up somewhere?
Updated by Joshua M. Clulow over 1 year ago
- Related to Bug #13317: Decrease contention on dn_struct_rwlock added
Updated by Marcel Telka about 1 year ago
- Category set to zfs - Zettabyte File System