Project

General

Profile

Bug #5133

fix races for two panics on NULL pointer derefs in nsmb connection teardown

Added by Dan McDonald over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
cifs - CIFS server and client
Start date:
2014-09-05
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

In both cases an smbiod worker process triggers the panic with system calls against the nsmb (SMB client) kernel module as part of connection teardown.

SUP-538 summary:

This was a race between:
(a) the smbiod closing it's connection, leading to an "unloan" of the network socket, and
(b) other driver functions calling nb_disconnect(), i.e. as a result of network errors.

The root cause here is that nb_disconnect() can free nbp->nbp_frag out from under the smbiod receiver thread for lack of synchronization of the free with the state change, so our cleanest fix is to move the freemsg() to nb_unloan_fp(), which is synchronized with the state in nbp->nbp_flags.

Panic signature:

> ::status
[snip]
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff00b9ec78d0 addr=18 occurred in module "nsmb" due to a NULL pointer dereference
dump content: kernel pages and pages from PID -1
> ::stack
nbssn_recv+0xe8(ffffff3a01535e18, ffffff00b9ec7a88, ffffff00b9ec7a38, ffffff00b9ec7a3f)
smb_nbst_recv+0x62(ffffff6c5ff78080, ffffff00b9ec7a88)
smb_iod_recv1+0x3b(ffffff6c5ff78080, ffffff00b9ec7ae8)
smb_iod_recvall+0x65(ffffff6c5ff78080)
smb_iod_vc_work+0xa0(ffffff6c5ff78080, ffffff19b0fa5190)
smb_usr_iod_work+0xf4(ffffff19f25e44c8, 80b79f0, 100003, ffffff19b0fa5190)
nsmb_ioctl+0xd5(1000001d2ca, 6e7313, 80b79f0, 100003, ffffff19b0fa5190, ffffff00b9ec7d54)
cdev_ioctl+0x45(1000001d2ca, 6e7313, 80b79f0, 100003, ffffff19b0fa5190, ffffff00b9ec7d54)
spec_ioctl+0x5a(ffffff598484ba40, 6e7313, 80b79f0, 100003, ffffff19b0fa5190, ffffff00b9ec7d54)
fop_ioctl+0x7b(ffffff598484ba40, 6e7313, 80b79f0, 100003, ffffff19b0fa5190, ffffff00b9ec7d54)
ioctl+0x18e(6, 6e7313, 80b79f0)
dtrace_systrace_syscall32+0x11a(6, 6e7313, 80b79f0, 1c3, 1, 0)
_sys_sysenter_post_swapgs+0x149()

This can also exhibit in nbssn_peekhdr() and pullupmsg() called by the receiver thread. Examination of nbp->nbp_frag in the stack trace shows it to be NULL, while NBF_SENDLOCK is set in nbp->nbp_flags.

SUP-548 summary:
smb_iod_disconnect() checks whether vcp->iod_thr is NULL at smb_iod.c line 188, immediately before calling tsignal(vcp->iod_thr, SIGKILL) at line 190. Code that changes the pointer value (e.g. smb_usr_iod_work()) acquires/surrenders the necessary locks by calling SMB_VC_LOCK/SMB_VC_UNLOCK (see smb_usr.c lines 852-857). It appears that this code needs to do the same to synchronise without racing off to panic.

Signature:

> ::status
[snip]
operating system: 5.11 NexentaOS_134f (i86pc)
panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0017d17a00 addr=180 occurred in module "genunix" due to a NULL pointer dereference
dump content: kernel pages and pages from PID -1
> ::stack
tsignal+0x1e(0, 9)
smb_iod_disconnect+0x76(ffffff0700f0b700)
nsmb_close2+0x4f(ffffff066c2e41d8, ffffff042180a698)
nsmb_close+0x4f(10000003131, 3, 2, ffffff042180a698)
dev_close+0x3d(10000003131, 3, 2, ffffff042180a698)
device_close+0xa2(ffffff040d8ac180, 3, ffffff042180a698)
spec_close+0x163(ffffff040d8ac180, 3, 1, 0, ffffff042180a698, 0)
fop_close+0x71(ffffff040d8ac180, 3, 1, 0, ffffff042180a698, 0)
closef+0x5d(ffffff0414fc7b28)
closeandsetf+0x507(c, 0)
close+0x18(c)
dtrace_systrace_syscall32+0x11a(c, 0, 0, 1c3, 1, 0)
_sys_sysenter_post_swapgs+0x149()

Stack trace makes it very apparent that vcp->iod_thr has been set to NULL in race with tsignal() call.

Subsequent experience has shown that these issues exhibits in more recent illumos-gate shipped by OmniTI.

History

#1

Updated by Gordon Ross over 5 years ago

SUP-538 summary:
This was a race between:
(a) the smbiod closing it's connection, leading to an "unloan" of the network socket, and
(b) other driver functions calling nb_disconnect, i.e. as a result of network errors.

The root cause here is that nb_disconnect can free nbp_frag without the receiver thread knowing about that.
The simplest fix is to move the work of freeing that frag to the "unloan" function, which is called by the receiver thread on it's way out.

SUP-548 summary:
smb_iod_disconnect() checks whether vcp->iod_thr is NULL at smb_iod.c line 188, immediately before calling tsignal(vcp->iod_thr, SIGKILL) at line 190. Code that changes the pointer value (e.g. smb_usr_iod_work()) acquires/surrenders the necessary locks by calling SMB_VC_LOCK/SMB_VC_UNLOCK (see smb_usr.c lines 852-857). It appears that this code needs to do the same to synchronise without racing off to panic.

#2

Updated by Dan McDonald over 5 years ago

  • Subject changed from Upstream SMB client fixes: Nexenta SUP-538 and SUP-548 to Two SMB client race fixes: Nexenta SUP-538 and SUP-548
#3

Updated by Bayard Bell over 5 years ago

  • Subject changed from Two SMB client race fixes: Nexenta SUP-538 and SUP-548 to fix races for two panics on NULL pointer derefs in nsmb connection teardown
  • Status changed from New to Pending RTI
  • Assignee set to Bayard Bell
  • Tags deleted (needs-triage)
#4

Updated by Electric Monk over 5 years ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 0 to 100

git commit 3ce9ce383e93f64f4baed13c5a0a28d7a5f1b71e

commit  3ce9ce383e93f64f4baed13c5a0a28d7a5f1b71e
Author: Bayard Bell <bayard.bell@nexenta.com>
Date:   2014-09-16T17:59:09.000Z

    5133 fix races for two panics on NULL pointer derefs in nsmb connection teardown
    Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Reviewed by: Richard Lowe <richlowe@richlowe.net>
    Approved by: Robert Mustacchi <rm@joyent.com>

Also available in: Atom PDF