Bug #9063
closedimprove procfs exit handling
100%
Description
Here is a summary of the relevant behavior between the two different code paths
proc_exit() 523 mutex_enter(&p->p_lock); 572 prlwpexit(t); /* notify /proc */ 586 mutex_exit(&p->p_lock); <---- XXX 663 mutex_enter(&pidlock); 763 mutex_enter(&p->p_lock); 811 p->p_stat = SZOMB 816 cdir = PTOU(p)->u_cdir; 897 mutex_exit(&p->p_lock); 951 mutex_exit(&pidlock); 958 if (cdir) VN_RELE(cdir); pr_lookup_procdir() 3687 mutex_enter(&pidlock); 3688 p = prfind(pid) 3702 mutex_enter(&p->p_lock); 3703 mutex_exit(&pidlock); 3742 if (p->p_stat == SZOMB) 3743 pcp->prc_flags |= PRC_DESTROY; 3768 adds new vp to p_plist 3771 mutex_exit(&p->p_lock);
Here is how I think this could happen. Within proc the PRC_DESTROY flag controls how much information is available. Normally that is set for a zombie process. When proc has a vnode for a process, then prlwpexit is used in the exit path to set PRC_DESTROY. However, if a process is exiting with no proc vnode and passes my XXX mark above, then after XXX if another thread comes in to pr_lookup_procdir it will create the proc vnode and not think that the process is exiting because p_stat is not yet set to SZOMB. Once pr_lookup_procdir exits and drops all the locks, then proc_exit can continue past my XXX mark to flag the process as a zombie, but by then we have a proc vnode which does not know the process is a zombie.
I've tested this by writing two test programs. One just loops doing a fork/exit while the parent waits for the child then forks again. The other test program loops continuously reading every /proc/<pid>/cwd. I've used a DTrace script to chill entry to pr_free_watched_pages when called in the proc_exit flow. With DTrace I can confirm that I'm sometimes finding a process via prfind in /proc which has p_stat == SRUN but also has the SEXITING bit set in p_flag. However, I never was able to cause a panic running this on a platform from before my fix. That seems consistent with the fact that we have never hit this in JPC and that this bug report is mostly "theoretical" at this point.
I also ran a bunch of things under strace in lx to verify that our ptrace emulation is still working as expected.
Updated by Electric Monk over 4 years ago
- Status changed from New to Closed
git commit 5203e56b6b338ebe19cb5433c609f9f5eb7d12b7
commit 5203e56b6b338ebe19cb5433c609f9f5eb7d12b7 Author: Jerry Jelinek <jerry.jelinek@joyent.com> Date: 2018-08-06T15:56:02.000Z 9063 improve procfs exit handling Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Jason King <jason.king@joyent.com> Approved by: Joshua M. Clulow <josh@sysmgr.org>