Project

General

Profile

Actions

Bug #2041

closed

panic in nsmb_close

Added by Gordon Ross over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2012-01-28
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

After SMB server (domain member) is idle for some time,
we see a panic in the nsmb driver's close function.

Here's the relevant stack:

panic+0x94()
die+0xdd()
trap+0x177b()
0xfffffffffb8001d6()
get_ok_ack+0x39()
nb_snddis+0x89()
nb_disconnect+0x6b()
smb_nbst_disconnect+0x1e()
smb_iod_disconnect+0x4b()
nsmb_close2+0x4f()
nsmb_close+0x4f()
dev_close+0x3d()
device_close+0xa2()
spec_close+0x163()
fop_close+0x71()
closef+0x5d()
closeandsetf+0x507()
close+0x18()

Actions #1

Updated by Gordon Ross over 9 years ago

  • Status changed from New to In Progress

This is a "tear down" race involving nsmb_close.

The crash happens in a thread calling close:

   swtch+0x150()
   cv_wait_sig+0x14d()
   str_cv_wait+0xbb()
   strwaitq+0x1fe()
   kstrgetmsg+0x3dc()
   tli_recv+0x62()
   get_ok_ack+0x2d()
   nb_snddis+0xb3()
   nb_disconnect+0x66()
   smb_nbst_disconnect+0x1e()
   smb_iod_disconnect+0x4b()
   nsmb_close2+0x4f()
   nsmb_close+0x4f()
   dev_close+0x52()
   device_close+0xa2()
   spec_close+0x163()
   fop_close+0x71()
   closef+0x5f()
   closeandsetf+0x4f5()
   close+0x18()
   sys_syscall32+0xff()

This dies because the tli object has already been torn down.
We can see evidence of that in this crash dump.
Starting with the arg to get_ok_ack:

> ffffff077a3001b8 ::print TIUSER
{
    fp = 0xffffff075a43a320
    tp_info = {
        addr = 0x10
        options = 0x5dc
        tsdu = 0
        etsdu = 0x80
        connect = 0
        discon = 0
        servtype = 0x2
    }
    flags = 0
}
> 0xffffff075a43a320 ::print "struct file" 
{
    f_tlock = {
        _opaque = [ 0xffffff075fce3e60 ]
    }
    f_flag = 0x3
    f_pad = 0xbadd
    f_vnode = 0
    f_offset = 0
    f_cred = 0xffffff0760401328
    f_audit_data = 0
    f_count = 0x1
}

The fact that the f_vnode field is null above means that another thread has destroyed this endpoint. It's not in this particular crash dump that I have handy, but the thread that typically destroys the endpoint is the "IOD" thread:

   nb_disconnect+0x26()
   nb_getmsg_mlen+0x206()
   nbssn_peekhdr+0x36()
   nbssn_recv+0x84()
   smb_nbst_recv+0x5b()
   smb_iod_recv1+0x3b()
   smb_iod_recvall+0x65()
   smb_iod_vc_work+0xa0()
   smb_usr_iod_work+0xf4()
   nsmb_ioctl+0xd5()
   cdev_ioctl+0x6c()
   spec_ioctl+0x5a()
   fop_ioctl+0xd0()
   ioctl+0x18e()

The IOD thread would normally be already blocked in tli_recv() when the close thread sends the disconnect down, so the ACK will typically be seen by the IOD thread. The close thread then proceeds to trash the endpoint while the IOD thread is trying to use it.

Some have asked why this "transport" vector code is there.

When we started porting this code from BSD (way back in the OpenSolaris smbfs project), there was a very different model for managing the network transport endpoint. All the connection setup/tear-down was here in the driver, and it used this smb_tranctp.c module as an attempt at a generic interface to something like BSD sockets.
Later, we decided to move the complex parts of connection setup to a user-space helper process (smbiod) and pass-in a transport endpoint ready for I/O. The operations "loan fp" and "unloan fp" are how the smbiod temporarily loans the already connected endpoint to this driver. Most of the remaining parts of this formerly generic transport vector became unused at that point, but were left there because we weren't entirely sure if we would keep this design or not. At this point, we could remove the unused stuff in smb_trantcp.c if desired.

Serialization design in this layer has not significantly changed, but had some flaws that this fix corrects.
There should be only one receiver thread for any connection, which is the "IOD" thread. In the trantcp code, we now verify that and error or assert if we detect violations. Similarly, senders must serialize at the layer above this code due to issues like request sequence number and serial numbers for message integrity signing. So here too, we verify that there are no concurrent callers at the tranctp level, and error or assert if we detect any.

The one place where we have to deal with some concurrency is tear-down, where for example, either a reader or write thread may detect an unrecoverable error on the transport and decide to mark it as no longer connected. Whatever the reason for the endpoint going to "not connected" state, that state transition must happen with protection against concurrency from any caller to smb_nbst_disconnect (i.e. via device close). As long as all state changes of these trantcp endpoints happen under a mutex lock, the first thread calling disconnect will change state, and all later threads will notice the new stat and do nothing more beyond getting out, possibly with an error indicating that the send or recv has failed.

If you look at the mutex lock nbp->nbp_lock and the member nbp->nbp_flags before the fix, you'll see that there were missing protections.

Actions #2

Updated by Gordon Ross over 9 years ago

  • Status changed from In Progress to Resolved
changeset:   13595:565bd3085959
tag:         tip
user:        Gordon Ross <gwr@nexenta.com>
date:        Sat Feb 04 15:55:57 2012 -0500
description:
    2041 panic in nsmb_close
    Reviewed by: Dan McDonald <danmcd@nexenta.com>
    Reviewed by: Richard Elling <richard.elling@richardelling.com>
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Approved by: Garrett D'Amore <garrett@damore.org>
Actions

Also available in: Atom PDF