fix races for two panics on NULL pointer derefs in nsmb connection teardown
In both cases an smbiod worker process triggers the panic with system calls against the nsmb (SMB client) kernel module as part of connection teardown.
This was a race between:
(a) the smbiod closing it's connection, leading to an "unloan" of the network socket, and
(b) other driver functions calling nb_disconnect(), i.e. as a result of network errors.
The root cause here is that nb_disconnect() can free nbp->nbp_frag out from under the smbiod receiver thread for lack of synchronization of the free with the state change, so our cleanest fix is to move the freemsg() to nb_unloan_fp(), which is synchronized with the state in nbp->nbp_flags.
> ::status [snip] panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff00b9ec78d0 addr=18 occurred in module "nsmb" due to a NULL pointer dereference dump content: kernel pages and pages from PID -1 > ::stack nbssn_recv+0xe8(ffffff3a01535e18, ffffff00b9ec7a88, ffffff00b9ec7a38, ffffff00b9ec7a3f) smb_nbst_recv+0x62(ffffff6c5ff78080, ffffff00b9ec7a88) smb_iod_recv1+0x3b(ffffff6c5ff78080, ffffff00b9ec7ae8) smb_iod_recvall+0x65(ffffff6c5ff78080) smb_iod_vc_work+0xa0(ffffff6c5ff78080, ffffff19b0fa5190) smb_usr_iod_work+0xf4(ffffff19f25e44c8, 80b79f0, 100003, ffffff19b0fa5190) nsmb_ioctl+0xd5(1000001d2ca, 6e7313, 80b79f0, 100003, ffffff19b0fa5190, ffffff00b9ec7d54) cdev_ioctl+0x45(1000001d2ca, 6e7313, 80b79f0, 100003, ffffff19b0fa5190, ffffff00b9ec7d54) spec_ioctl+0x5a(ffffff598484ba40, 6e7313, 80b79f0, 100003, ffffff19b0fa5190, ffffff00b9ec7d54) fop_ioctl+0x7b(ffffff598484ba40, 6e7313, 80b79f0, 100003, ffffff19b0fa5190, ffffff00b9ec7d54) ioctl+0x18e(6, 6e7313, 80b79f0) dtrace_systrace_syscall32+0x11a(6, 6e7313, 80b79f0, 1c3, 1, 0) _sys_sysenter_post_swapgs+0x149()
This can also exhibit in nbssn_peekhdr() and pullupmsg() called by the receiver thread. Examination of nbp->nbp_frag in the stack trace shows it to be NULL, while NBF_SENDLOCK is set in nbp->nbp_flags.
smb_iod_disconnect() checks whether vcp->iod_thr is NULL at smb_iod.c line 188, immediately before calling tsignal(vcp->iod_thr, SIGKILL) at line 190. Code that changes the pointer value (e.g. smb_usr_iod_work()) acquires/surrenders the necessary locks by calling SMB_VC_LOCK/SMB_VC_UNLOCK (see smb_usr.c lines 852-857). It appears that this code needs to do the same to synchronise without racing off to panic.
> ::status [snip] operating system: 5.11 NexentaOS_134f (i86pc) panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0017d17a00 addr=180 occurred in module "genunix" due to a NULL pointer dereference dump content: kernel pages and pages from PID -1 > ::stack tsignal+0x1e(0, 9) smb_iod_disconnect+0x76(ffffff0700f0b700) nsmb_close2+0x4f(ffffff066c2e41d8, ffffff042180a698) nsmb_close+0x4f(10000003131, 3, 2, ffffff042180a698) dev_close+0x3d(10000003131, 3, 2, ffffff042180a698) device_close+0xa2(ffffff040d8ac180, 3, ffffff042180a698) spec_close+0x163(ffffff040d8ac180, 3, 1, 0, ffffff042180a698, 0) fop_close+0x71(ffffff040d8ac180, 3, 1, 0, ffffff042180a698, 0) closef+0x5d(ffffff0414fc7b28) closeandsetf+0x507(c, 0) close+0x18(c) dtrace_systrace_syscall32+0x11a(c, 0, 0, 1c3, 1, 0) _sys_sysenter_post_swapgs+0x149()
Stack trace makes it very apparent that vcp->iod_thr has been set to NULL in race with tsignal() call.
Subsequent experience has shown that these issues exhibits in more recent illumos-gate shipped by OmniTI.