Project

General

Profile

Actions

Bug #13515

closed

panic with bad mutex in smb_ofile_hold_olbrk

Added by Gordon Ross 10 months ago. Updated 10 months ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Panic looks like this:

# mdb -k 2
Loading modules: [ ... ]

> ::status
debugging crash dump vmcore.2 (64-bit) from ...
operating system: 5.11 ... (i86pc)
image uuid: ...
panic message: mutex_enter: bad mutex, lp=ffffff5000c14bf0 owner=ffffff00f640cc40 thread=ffffff00f5ebec40
dump content: kernel pages only
> $C
ffffff00f5ebe380 vpanic()
ffffff00f5ebe3a0 mutex_panic+0x58(fffffffffb95e2d8, ffffff5000c14bf0)
ffffff00f5ebe410 mutex_vector_enter+0x347(ffffff5000c14bf0)
ffffff00f5ebe450 smb_ofile_hold_olbrk+0x38(ffffff5000c14bb8)
ffffff00f5ebe4c0 smb_oplock_ind_break+0x80(ffffff5000c14bb8, 3, 1, 0)
ffffff00f5ebe560 smb_oplock_break_cmn+0x889(ffffff32dbb89640, ffffffa5a0b66bd8, 
100004)
ffffff00f5ebe590 smb_oplock_break_OPEN+0x31(ffffff32dbb89640, ffffffa5a0b66bd8, 
120089, 1)
ffffff00f5ebe710 smb_common_open+0xe88(ffffff3412810338)
ffffff00f5ebeaa0 smb2_create+0x6c0(ffffff3412810338)
ffffff00f5ebeb50 smb2sr_work+0x356(ffffff3412810338)
ffffff00f5ebeb90 smb2_tq_work+0x73(ffffff3412810338)
ffffff00f5ebec20 taskq_d_thread+0xb7(ffffff3263742600)
ffffff00f5ebec30 thread_start+8()
<pre>

Related issues

Related to illumos gate - Bug #13850: SMB session logoff stuck in smb_ofile_hold_olbrkClosedGordon Ross

Actions
Actions #1

Updated by Gordon Ross 10 months ago

  • Description updated (diff)
Actions #2

Updated by Gordon Ross 10 months ago

To make a long story short, this is a case of "use after free" where:

smb2_lease_ofile_close() gets a hold on some ofile,
another thread calls smb_ofile_close which ends up moving the lease,
and then smb2_lease_ofile_close also tries to move the lease
but ends up seting the exclusive open pointer to an ofile that's being freed.
Later, another thread calls smb_oplock_ind_break with that freed ofile, panic.

(Thanks to Matt Barden for that analysis, summarized here.)

Actions #3

Updated by Gordon Ross 10 months ago

  • Status changed from New to In Progress
  • Assignee set to Gordon Ross

There are several ways one might fix this. We went with a conservative approach:

Serialize oplock tear-down for any/all ofiles on the same node
by holding the node oplock mutex throughout that tear-down.
Do that oplock tear-down before the rest of ofile close so that
ofiles can not close while we hold the node oplock mutex.

Actions #4

Updated by Electric Monk 10 months ago

  • Gerrit CR set to 1228
Actions #5

Updated by Gordon Ross 10 months ago

Testing:
Matt Barden wrote a test that can reproduce this with fairly high probabiity,
running simultaneous SMB2 close calls with things that break a lease.

Before/after the fix:
smbtorture ... lease.close_order
(panic / no panic)

Actions #6

Updated by Electric Monk 10 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 1f0845f1dfb179d6aa598ad89bb44d432f4e1020

commit  1f0845f1dfb179d6aa598ad89bb44d432f4e1020
Author: Gordon Ross <gordon.ross@tintri.com>
Date:   2021-02-20T20:37:29.000Z

    13515 panic with bad mutex in smb_ofile_hold_olbrk
    Portions contributed by: Matt Barden <mbarden@tintri.com>
    Reviewed by: Evan Layton <elayton@tintri.com>
    Reviewed by: Rick McNeal <rmcneal@tintri.com>
    Reviewed by: Paul Winder <paul@winder.uk.net>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Actions #7

Updated by Gordon Ross 3 months ago

  • Related to Bug #13850: SMB session logoff stuck in smb_ofile_hold_olbrk added
Actions

Also available in: Atom PDF