Project

General

Profile

Actions

Bug #13937

open

13317 mismerged; leaked rwlock counts on bonus' parents ensue

Added by Alex Wilson over 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

We've done a ZoL mis-merge yet again:

https://github.com/illumos/illumos-gate/commit/9704bf7fb82e71b685e194a967937ff03843e73a

see:

https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/zfs/dbuf.c#L1119

There should be a call to dmu_buf_unlock_parent here. We've clobbered it by attempting to once again merge commits from ZoL out of order -- as usual, resulting in ZFS encryption and another feature breaking each other.

In this case, ZoL has changed this function considerably:

https://github.com/openzfs/zfs/blob/master/module/zfs/dbuf.c#L1465

But in all their versions of it, the locking was correct. Ours is not.

Whenever we hit a decryption issue which causes dbuf_read_verify_dnode_crypt to return non-zero (e.g. by hitting #13697 or #13795, both of which are also caused by bad merges from ZoL!), this will then lead to us leaking rwlock counts here, eventually leading to whole-system deadlock which the zfs deadman doesn't detect.


Related issues

Related to illumos gate - Bug #13317: Decrease contention on dn_struct_rwlockClosedJason King

Actions
Actions #1

Updated by Alex Wilson over 1 year ago

My notes from debugging this in a dump:

> ::status
debugging crash dump vmcore.0 (64-bit) from sdc-cb8031
operating system: 5.11 joyent_20210623T002805Z (i86pc)
git branch: eait-20210623
git rev: f12a9dd262eb31f6cd4765ca82ad60d64defee9c
image uuid: (not set)
panic message: NMI received

dump content: kernel pages only

>> ::stacks -m zfs
THREAD           STATE    SOBJ                COUNT
fffffe473261f880 SLEEP    CV                   8152
                 swtch+0x133
                 cv_wait+0x68
                 dmu_tx_wait+0x6b
                 zfs_create+0x283
                 vnext_create+0xdb
                 inotify_fop_create+0x5f
                 vnext_create+0xdb
                 inotify_fop_create+0x5f
                 vhead_create+0x139
                 fop_create+0xcf
                 vn_createat+0x67a
                 vn_openat+0x192
                 copen+0x441
                 openat+0x29
                 lx_openat+0x96
                 lx_syscall_enter+0x1aa
                 sys_syscall+0x145

fffffdca12def780 SLEEP    CV                    601
                 swtch+0x133
                 cv_wait+0x68
                 dmu_tx_wait+0x6b
                 dmu_tx_assign+0x4b
                 zil_commit_itx_assign+0x3b
                 zil_commit_impl+0x26
                 zil_commit+0x3e
                 zfs_fsync+0xb7
                 fop_fsync+0x4e
                 fdsync+0x2c

fffffdce381890a0 SLEEP    CV                    136
                 swtch+0x133
                 cv_wait_sig+0x171
                 txg_wait_synced_impl+0x7c
                 txg_wait_synced_sig+0x10
                 dsl_sync_task_common+0x1c0
                 dsl_sync_task_sig+0x23
                 zcp_eval+0x655
                 dsl_destroy_snapshots_nvl+0x15a
                 zfs_ioc_destroy_snaps+0x8b
                 zfsdev_ioctl+0x505
                 cdev_ioctl+0x2b
                 spec_ioctl+0x45
                 fop_ioctl+0x5b
                 ioctl+0x153

Oh we are waiting alright. We are waiting for a txg sync!

> ::zfs_dbgmsg
...
slow spa_sync: started 34815 seconds ago, calls 6764
slow spa_sync: started 34820 seconds ago, calls 6765
slow spa_sync: started 34825 seconds ago, calls 6766
slow spa_sync: started 34830 seconds ago, calls 6767
slow spa_sync: started 34835 seconds ago, calls 6768
slow spa_sync: started 34840 seconds ago, calls 6769

Mmhmm. Waiting.

> ::stacks -c txg_sync_thread
THREAD           STATE    SOBJ                COUNT
fffffb04536dbc20 SLEEP    CV                      1
                 swtch+0x133
                 cv_wait+0x68
                 zio_wait+0x6b
                 dsl_pool_sync+0xc6
                 spa_sync_iterate_to_convergence+0xd0
                 spa_sync+0x2f6
                 txg_sync_thread+0x1fd
                 thread_start+0xb
> fffffb04536dbc20::findstack -v
stack pointer for thread fffffb04536dbc20 (txg_sync_thread()): fffffb04536db920
[ fffffb04536db920 _resume_from_idle+0x12b() ]
  fffffb04536db950 swtch+0x133()
  fffffb04536db990 cv_wait+0x68(fffffedf89031950, fffffedf89031948)
  fffffb04536db9d0 zio_wait+0x6b(fffffedf89031580)
  fffffb04536dba80 dsl_pool_sync+0xc6(fffffdc570940940, 3928b7)
  fffffb04536dbb00 spa_sync_iterate_to_convergence+0xd0(fffffdc61db70000, fffffdef2e609800)
  fffffb04536dbb60 spa_sync+0x2f6(fffffdc61db70000, 3928b7)
  fffffb04536dbc00 txg_sync_thread+0x1fd(fffffdc570940940)
  fffffb04536dbc10 thread_start+0xb()
> fffffedf89031580::zio
ADDRESS                 TYPE  STAGE            WAITER           TIME_ELAPSED
fffffedf89031580        NULL  CHECKSUM_VERIFY  fffffb04536dbc20 -            

Alright, we're waiting on a zio. Good to know. Where are all those zios at
anyway?

> ::zio_state -r
ADDRESS                 TYPE  STAGE            WAITER           TIME_ELAPSED
fffffedf89031580        NULL  CHECKSUM_VERIFY  fffffb04536dbc20 -            
 fffffdd3791a29e0       WRITE ISSUE_ASYNC      -                -            
  fffffdc81f312640      WRITE ISSUE_ASYNC      -                -            
   fffffedf74aeb948     WRITE ISSUE_ASYNC      -                -            
    fffffedf74e540d0    WRITE ISSUE_ASYNC      -                -            
     fffffedf95f6c240   WRITE ISSUE_ASYNC      -                -            
      fffffedf9571aef0  WRITE ISSUE_ASYNC      -                -            
       fffffedf87a47128 WRITE ISSUE_ASYNC      -                -            
       fffffedf887bf1e8 WRITE READY            -                -            
fffffdc61db90ca8        NULL  OPEN             -                -            
fffffdc61db910d0        NULL  OPEN             -                -            
fffffdc61db914f8        NULL  OPEN             -                -            
fffffdc61db91920        NULL  OPEN             -                -            
...
fffffdc61e2750e0        NULL  OPEN             -                -            
fffffedf9666b188        NULL  OPEN             -                -            
 fffffedf9f3ae8c8       WRITE OPEN             -                -            
fffffdc61e67a640        NULL  OPEN             -                -            
fffffdc61e68ca38        NULL  OPEN             -                -            
fffffdc61e737698        NULL  OPEN             -                -            
fffffdc61e740258        NULL  OPEN             -                -            
fffffdc61e743250        NULL  OPEN             -                -            
fffffdc61e743678        NULL  OPEN             -                -            
fffffdc61e74be10        NULL  OPEN             -                -            
fffffddb26a5a188        NULL  OPEN             -                -

Alright. There's some zio there.

> ::taskq -n zio
ADDR             NAME                             ACT/THDS Q'ED  MAXQ INST
fffffdc61e764640 zio_null_issue                     0/   1    0  5950    -
fffffdc61e764ac0 zio_null_intr                      0/   1    0   424    -
fffffdc61e26f1b8 zio_read_issue                     0/   8    0    43    -
fffffdc61e764760 zio_read_intr_0                    0/  12    0    46    -
fffffdc61e7649a0 zio_read_intr_1                    0/  12    0    42    -
fffffdc61e764be0 zio_read_intr_2                    0/  12    0    22    -
fffffdc61e764d00 zio_read_intr_3                    0/  12    0    33    -
fffffdc5e61fdaa8 zio_read_intr_4                    0/  12    0    38    -
fffffdc5e61fd2c8 zio_read_intr_5                    0/  12    0    22    -
fffffdc5e61fd1a8 zio_read_intr_6                    0/  12    0    34    -
fffffdc5e61fd088 zio_read_intr_7                    0/  12    0    29    -
fffffdc61e26f638 zio_write_issue                    1/  48    1 49025    -
fffffdc61e215bd0 zio_write_issue_high               0/   5    0    22    -
fffffdc61e215ab0 zio_write_intr                     0/   8    0   853    -
fffffdc61e215870 zio_write_intr_high                0/   5    0     8    -
fffffdc61e7640a0 zio_free_issue_0                   0/  12    0     0    -
fffffdc61e215990 zio_free_issue_1                   0/  12    0     0    -
fffffdc61e215510 zio_free_issue_2                   0/  12    0     0    -
fffffdc61e2153f0 zio_free_issue_3                   0/  12    0     0    -
fffffdc61e2152d0 zio_free_issue_4                   0/  12    0     0    -
fffffdc61e26fe18 zio_free_issue_5                   0/  12    0     0    -
fffffdc620906768 zio_free_issue_6                   0/  12    0     0    -
fffffdc61e7641c0 zio_free_issue_7                   0/  12    0     0    -
fffffdc61e7642e0 zio_free_intr                      0/   1    0     0    -
fffffdc61e764400 zio_claim_issue                    0/   1    0     0    -
fffffdc61e764520 zio_claim_intr                     0/   1    0     0    -
fffffdc6209061c8 zio_ioctl_issue                    0/   1    0     0    -
fffffdc6209060a8 zio_ioctl_intr                     0/   1    0     4    -

One zio is running, it's a write. Where's it at?

> fffffdc61e26f638::taskq -t
ADDR             NAME                             ACT/THDS Q'ED  MAXQ INST
fffffdc61e26f638 zio_write_issue                    1/  48    1 49025    -
    THREAD           STATE    SOBJ                COUNT
    fffffb0452669c20 SLEEP    RWLOCK                  1
                     swtch+0x133
                     turnstile_block+0x25b
                     rw_enter_sleep+0x1a5
                     dmu_buf_lock_parent+0x25
                     dbuf_write_ready+0x1fd
                     arc_write_ready+0x157
                     zio_ready+0x61
                     zio_execute+0xa7
                     taskq_thread+0x2cd
                     thread_start+0xb

> fffffb0452669c20::findstack -v
stack pointer for thread fffffb0452669c20 (zpool-zones/124): fffffb0452669770
[ fffffb0452669770 _resume_from_idle+0x12b() ]
  fffffb04526697a0 swtch+0x133()
  fffffb0452669840 turnstile_block+0x25b(0, 0, fffffe155b0f3148, fffffffffbc1c5c0, 0, 0)
  fffffb04526698b0 rw_enter_sleep+0x1a5(fffffe155b0f3148, 0)
  fffffb0452669900 dmu_buf_lock_parent+0x25(fffffdfe948bf9b0, 0, fffffffff7dc0750)
  fffffb0452669970 dbuf_write_ready+0x1fd(fffffedf887bf1e8, fffffe1579356008, fffffdfe948bf9b0)
  fffffb04526699e0 arc_write_ready+0x157(fffffedf887bf1e8)
  fffffb0452669a30 zio_ready+0x61(fffffedf887bf1e8)
  fffffb0452669a60 zio_execute+0xa7(fffffedf887bf1e8)
  fffffb0452669b10 taskq_thread+0x2cd(fffffdc61e26f638)
  fffffb0452669b20 thread_start+0xb()

So we're not blocked on I/O, then. The TXG sync is blocked up behind a zio
which is waiting for one of its kids. That kid is being executed right now,
but it's waiting for an rwlock on a dbuf's parent.

Who's in that lock anyway and what's that dbuf?

> fffffe155b0f3148::rwlock
            ADDR      OWNER/COUNT FLAGS          WAITERS
fffffe155b0f3148        READERS=7  B011 fffffdcc488b1460 (R)
                                     || fffffb0452669c20 (W)
                 WRITE_WANTED -------+|
                  HAS_WAITERS --------+

> fffffdcc488b1460::findstack -v
stack pointer for thread fffffdcc488b1460 (mongod/20 [ftdc]): fffffb04a659f290
[ fffffb04a659f290 _resume_from_idle+0x12b() ]
  fffffb04a659f2c0 swtch+0x133()
  fffffb04a659f360 turnstile_block+0x25b(fffffdc56d63d5f0, 1, fffffe155b0f3148, fffffffffbc1c5c0, 0, 0)
  fffffb04a659f3d0 rw_enter_sleep+0x2f8(fffffe155b0f3148, 1)
  fffffb04a659f420 dmu_buf_lock_parent+0x25(fffffe15b2254558, 1, fffffffff7dc0598)
  fffffb04a659f4c0 dbuf_read+0xb8(fffffe15b2254558, 0, 9)
  fffffb04a659f540 dmu_bonus_hold_impl+0xed(fffffdc62a46a600, 4404d, 0, 1, fffffb04a659f640)
  fffffb04a659f570 dmu_bonus_hold+0x15(fffffdc62a46a600, 4404d, 0, fffffb04a659f640)
  fffffb04a659f5a0 sa_buf_hold+0xd(fffffdc62a46a600, 4404d, 0, fffffb04a659f640)
  fffffb04a659f720 zfs_mknode+0x151(fffffdd49625a278, fffffb04a659fb70, fffffdd0f315edc0, fffffdcb3d320d50, 0, fffffb04a659f7c0, fffffb04a659f7d0)
  fffffb04a659f860 zfs_create+0x6fa(fffffdcddb915b00, fffffdc8e435cde1, fffffb04a659fb70, 0, 80, fffffb04a659fb40, fffffdcb3d320d50, fffffdcd00002102, 0, 0)
  fffffb04a659f900 fop_create+0xcf(fffffdcddb915b00, fffffdc8e435cde1, fffffb04a659fb70, 0, 80, fffffb04a659fb40, fffffdcb3d320d50, 300002102, 0, 0)
  fffffb04a659fac0 vn_createat+0x67a(c380d0, 0, fffffb04a659fb70, 0, 80, fffffb04a659fb40, fffffdc700000000, ffffffff00002102, 3f, 0)
  fffffb04a659fc60 vn_openat+0x192(c380d0, 0, 2302, 1b6, fffffb04a659fcc0, 0, 3f, 0, fffffb0400000023)
  fffffb04a659fdd0 copen+0x441(ffd19553, c380d0, 2302, 1b6)
  fffffb04a659fe00 openat+0x29(ffd19553, c380d0, 301, 1b6)
  fffffb04a659fe70 lx_openat+0x96(ffffff9c, c380d0, 241, 1b6)
  fffffb04a659fef0 lx_syscall_enter+0x1aa()
  fffffb04a659ff10 sys_syscall+0x145()

> fffffb0452669c20::findstack -v
stack pointer for thread fffffb0452669c20 (zpool-zones/124): fffffb0452669770
[ fffffb0452669770 _resume_from_idle+0x12b() ]
  fffffb04526697a0 swtch+0x133()
  fffffb0452669840 turnstile_block+0x25b(0, 0, fffffe155b0f3148, fffffffffbc1c5c0, 0, 0)
  fffffb04526698b0 rw_enter_sleep+0x1a5(fffffe155b0f3148, 0)
  fffffb0452669900 dmu_buf_lock_parent+0x25(fffffdfe948bf9b0, 0, fffffffff7dc0750)
  fffffb0452669970 dbuf_write_ready+0x1fd(fffffedf887bf1e8, fffffe1579356008, fffffdfe948bf9b0)
  fffffb04526699e0 arc_write_ready+0x157(fffffedf887bf1e8)
  fffffb0452669a30 zio_ready+0x61(fffffedf887bf1e8)
  fffffb0452669a60 zio_execute+0xa7(fffffedf887bf1e8)
  fffffb0452669b10 taskq_thread+0x2cd(fffffdc61e26f638)
  fffffb0452669b20 thread_start+0xb()

> fffffdfe948bf9b0::dbuf
            addr   object lvl     blkid holds os
fffffdfe948bf9b0    4404c   0         0     1 zones/466c270c-7be0-6eb9-99f6-d658c761c475

> fffffdfe948bf9b0::print dmu_buf_impl_t db_parent | ::dbuf
            addr   object lvl     blkid holds os
fffffe155b0f30f0      mdn   0      2202     4 zones/466c270c-7be0-6eb9-99f6-d658c761c475

OK, so this rwlock the zio executor is waiting for is the db_rwlock of one of
the dbufs in the meta-dnode for zones/466c270c-7be0-6eb9-99f6-d658c761c475.

There are allegedly 7 readers currently in the rwlock, and two threads waiting
(one is our zio executor, the other is a reader from zfs_mknode).

Unfortunately, we're not on DEBUG bits, so we can't easily get a list of all the
readers. :(

Maybe we can find another way.

How many child dbufs of that parent do we currently have around?

> ::dbufs -n zones/466c270c-7be0-6eb9-99f6-d658c761c475 | ::print -a dmu_buf_impl_t db_parent ! grep fffffe155b0f30f0 | ... > kids
fffffe15b2254588 db_parent = 0xfffffe155b0f30f0
fffffdfe948bf9e0 db_parent = 0xfffffe155b0f30f0
fffffe0f032ca060 db_parent = 0xfffffe155b0f30f0

> ::cat kids | ::dbuf
            addr   object lvl     blkid holds os
fffffe15b2254558    4404d   0     bonus     1 zones/466c270c-7be0-6eb9-99f6-d658c761c475
fffffdfe948bf9b0    4404c   0         0     1 zones/466c270c-7be0-6eb9-99f6-d658c761c475
fffffe0f032ca030    4404c   0     bonus     1 zones/466c270c-7be0-6eb9-99f6-d658c761c475

So it's got two bonus and one regular child.

Let's see if we can easily eliminate any of those readers. Most things that
are going to deliberately hold a dbuf or its parent will keep a pointer to one
of them on the stack, right?

> ::cat kids | ::kgrep -v -a 0xfffffb0440000000 -A 0xfffffb04d0000000 ! tee stackpos ... | ::whatis

fffffb0452669878:       fffffdfe948bf9b0    is in thread fffffb0452669c20's stack
fffffb0452669888:       fffffdfe948bf9b0    is in thread fffffb0452669c20's stack
fffffb04526698e0:       fffffdfe948bf9b0    is in thread fffffb0452669c20's stack
fffffb04526698f8:       fffffdfe948bf9b0    is in thread fffffb0452669c20's stack
fffffb0452669958:       fffffdfe948bf9b0    is in thread fffffb0452669c20's stack
fffffb045419a818:       fffffdfe948bf9b0    is in thread fffffb045419ac20's stack (below sp)
fffffb045419a8e8:       fffffdfe948bf9b0    is in thread fffffb045419ac20's stack (below sp)
fffffb04a659f398:       fffffe15b2254558    is in thread fffffdcc488b1460's stack
fffffb04a659f3b0:       fffffe15b2254558    is in thread fffffdcc488b1460's stack
fffffb04a659f418:       fffffe15b2254558    is in thread fffffdcc488b1460's stack
fffffb04a659f450:       fffffe15b2254558    is in thread fffffdcc488b1460's stack
fffffb04a659f490:       fffffe15b2254558    is in thread fffffdcc488b1460's stack
fffffb04a659f4b8:       fffffe15b2254558    is in thread fffffdcc488b1460's stack
fffffb04a659f3f0:       fffffe155b0f30f0    is in thread fffffdcc488b1460's stack

This nets us the two threads we already knew about from the waiters list,
plus a below-SP match in fffffb045419ac20:

stack pointer for thread fffffb045419ac20 (tq:dp_sync_taskq): fffffb045419aa70
[ fffffb045419aa70 _resume_from_idle+0x12b() ]
  fffffb045419aaa0 swtch+0x133()
  fffffb045419aae0 cv_wait+0x68(fffffdc6210d68c0, fffffdc6210d68b0)
  fffffb045419ab50 taskq_thread_wait+0xbe(fffffdc6210d6890, fffffdc6210d68b0, fffffdc6210d68c0, fffffb045419ab98, ffffffffffffffff)
  fffffb045419ac00 taskq_thread+0x34f(fffffdc6210d6890)
  fffffb045419ac10 thread_start+0xb()

Weird, but this particular thread is probably innocuous? It's the
dp_sync_taskq, so it has plenty of legit reasons to have touched
that dbuf at some point.

The lack of any other threads is super suspicious: did we leak rwlock counts
on this dbuf?

The most recent change in this area was
13317 Decrease contention on dn_struct_rwlock. Did it mess up somewhere?

Actions #2

Updated by Joshua M. Clulow over 1 year ago

  • Related to Bug #13317: Decrease contention on dn_struct_rwlock added
Actions #3

Updated by Joshua M. Clulow over 1 year ago

  • Description updated (diff)
Actions #4

Updated by Marcel Telka about 1 year ago

  • Category set to zfs - Zettabyte File System
Actions

Also available in: Atom PDF