Bug #3986


svc.startd dies in getutxent_frec()

Added by Robert Mustacchi almost 9 years ago. Updated almost 9 years ago.

cmd - userland programs
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


We've seen a bunch of core dumps from getutxent_frec():`fread+0x3c(fef594a0, 174, 1, 0, 805dba2, fef51000)`getutxent_frec+0x11f(0, fc9a4a40, fc1a6f98, 805dba2, 807e7ac, fedf2000)`getutxent+0x1d(807e7ac, fedf2000, 208, 807e7d0, fc9a4a40, 807e7d0)
fork_sulogin+0x1bb(0, 807e7d0, fc1a6fb8, 806398a, 80943ac, fef51000)
run_sulogin+0xbe(fc9a4a40, 0, fc1a6fe8, feede9ed, 0, 0)
sulogin_thread+0x71(0, 0, 0, 0)`_thrp_setup+0x88(fc9a4a40)`_lwp_start(fc9a4a40, 0, 0, 0, 0, 0)

The process – a child of svc.startd created in fork_sulogin() – died because fd (a static int in libc's getutx.c) was set, but fp (a static FILE * in the same file) was NULL. From the code, this is an inconsistent state, and could only be possible between a lockutx() and unlockutx() – routines that are only called from makeutx() (and always symmetrically). Now, makeutx() is called from svc.startd's utmpx_mark_init(), raising the possibility that another thread in the parent process was in the middle of this routine when the process was forked. Because fork_sulogin() (naturally) creates a child process via fork1(), no other LWP exists in the child process – which makes it difficult to determine where parent LWPs were at the time of the fork.

Fortunately, utmpx_mark_init()'s use of (MT-unsafe) makeutx() acquires a lock (utmpx_lock); if another thread were in this critical section at the time of the fork, we would expect this lock to be held. And indeed, in every dump, it does in fact seem to be owned – and not by the only remaining LWP:

> utmpx_lock::print
    __pthread_mutex_flags = {
        __pthread_mutex_flag1 = 0x4
        __pthread_mutex_flag2 = 0
        __pthread_mutex_ceiling = 0
        __pthread_mutex_type = 0x2
        __pthread_mutex_magic = 0x4d58
    __pthread_mutex_lock = {
        __pthread_mutex_lock64 = {
            __pthread_mutex_pad = [ 0, 0, 0, 0, 0, 0, 0, 0x1 ]
        __pthread_mutex_lock32 = {
            __pthread_ownerpid = 0
            __pthread_lockword = 0x1000000
        __pthread_mutex_owner64 = 0x100000000000000
    __pthread_mutex_data = 0xfec46a40
> ::walk ulwp

This is the smoking gun; another thread was in the utmpx critical section when we forked, leading to the consistency that induced the crash.

The fix is straightforward; a utmpx_prefork()/utmpx_postfork() should be added to svc.startd that acquire and release the lock – assuring that no thread is in one of the MT-unsafe utmpx routines at the time of the fork.

Actions #1

Updated by Robert Mustacchi almost 9 years ago

  • Status changed from New to Resolved

Resolved in 0d421f668cdfd7a53019f57234af254738038aa0.


Also available in: Atom PDF