Bug #3986
closedsvc.startd dies in getutxent_frec()
100%
Description
We've seen a bunch of core dumps from getutxent_frec():
libc_hwcap1.so.1`fread+0x3c(fef594a0, 174, 1, 0, 805dba2, fef51000) libc_hwcap1.so.1`getutxent_frec+0x11f(0, fc9a4a40, fc1a6f98, 805dba2, 807e7ac, fedf2000) libc_hwcap1.so.1`getutxent+0x1d(807e7ac, fedf2000, 208, 807e7d0, fc9a4a40, 807e7d0) fork_sulogin+0x1bb(0, 807e7d0, fc1a6fb8, 806398a, 80943ac, fef51000) run_sulogin+0xbe(fc9a4a40, 0, fc1a6fe8, feede9ed, 0, 0) sulogin_thread+0x71(0, 0, 0, 0) libc_hwcap1.so.1`_thrp_setup+0x88(fc9a4a40) libc_hwcap1.so.1`_lwp_start(fc9a4a40, 0, 0, 0, 0, 0)
The process – a child of svc.startd created in fork_sulogin() – died because fd (a static int in libc's getutx.c) was set, but fp (a static FILE * in the same file) was NULL. From the code, this is an inconsistent state, and could only be possible between a lockutx() and unlockutx() – routines that are only called from makeutx() (and always symmetrically). Now, makeutx() is called from svc.startd's utmpx_mark_init(), raising the possibility that another thread in the parent process was in the middle of this routine when the process was forked. Because fork_sulogin() (naturally) creates a child process via fork1(), no other LWP exists in the child process – which makes it difficult to determine where parent LWPs were at the time of the fork.
Fortunately, utmpx_mark_init()'s use of (MT-unsafe) makeutx() acquires a lock (utmpx_lock); if another thread were in this critical section at the time of the fork, we would expect this lock to be held. And indeed, in every dump, it does in fact seem to be owned – and not by the only remaining LWP:
> utmpx_lock::print { __pthread_mutex_flags = { __pthread_mutex_flag1 = 0x4 __pthread_mutex_flag2 = 0 __pthread_mutex_ceiling = 0 __pthread_mutex_type = 0x2 __pthread_mutex_magic = 0x4d58 } __pthread_mutex_lock = { __pthread_mutex_lock64 = { __pthread_mutex_pad = [ 0, 0, 0, 0, 0, 0, 0, 0x1 ] } __pthread_mutex_lock32 = { __pthread_ownerpid = 0 __pthread_lockword = 0x1000000 } __pthread_mutex_owner64 = 0x100000000000000 } __pthread_mutex_data = 0xfec46a40 } > ::walk ulwp fec4b240
This is the smoking gun; another thread was in the utmpx critical section when we forked, leading to the consistency that induced the crash.
The fix is straightforward; a utmpx_prefork()/utmpx_postfork() should be added to svc.startd that acquire and release the lock – assuring that no thread is in one of the MT-unsafe utmpx routines at the time of the fork.
Updated by Robert Mustacchi over 10 years ago
- Status changed from New to Resolved
Resolved in 0d421f668cdfd7a53019f57234af254738038aa0.