Project

General

Profile

Actions

Bug #4996

closed

rtld _init race leads to incorrect symbol values

Added by Robert Mustacchi over 9 years ago. Updated almost 9 years ago.

Status:
Closed
Priority:
Normal
Category:
lib - userland libraries
Start date:
2014-07-11
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

Various cores in thoth with basically the same stack:

DUMP             TYPE  TIME                NODE/CMD         TICKET
ef55ec95b78fdbb3 core  2013-10-09T17:55:55 ldap_cachemgr    -
3d3c2e296af9dbc8 core  2013-10-09T17:55:55 ldap_cachemgr    -
a7add25a7727d4b4 core  2013-10-09T17:55:55 ldap_cachemgr    -
c7ecf9e7b9e00484 core  2013-10-09T17:59:04 ldap_cachemgr    -
958e7fe535695b41 core  2013-12-18T18:48:27 ldap_cachemgr    -
856e54fb3643bc21 core  2013-12-18T18:48:27 ldap_cachemgr    -
bb1daba63b48f01b core  2013-12-18T18:48:27 ldap_cachemgr    -
32542daafaf48938 core  2013-12-18T18:48:50 ldap_cachemgr    -
8a105ece703f1ea4 core  2013-12-18T18:49:10 ldap_cachemgr    -
5f8a43789ce82f9b core  2014-01-07T19:57:46 ldap_cachemgr    -
7193aabd99748f13 core  2014-01-13T15:50:05 ldap_cachemgr    -
5733913860d4080f core  2014-01-30T20:22:42 ldap_cachemgr    -
ab0266445715d355 core  2014-01-30T20:22:42 ldap_cachemgr    -

Stacks look mostly like:

libc.so.1`mutex_lock_impl+0x24()
libc.so.1`mutex_lock+0x19()
libnspr4.so`PR_Lock+0x1d()
libnspr4.so`PR_GetCurrentThread+0x96()
libnspr4.so`PR_SetError+0x18()
libnspr4.so`PR_NewMonitor+0xca()
libnss3.so`OCSP_InitGlobal+0xc9()
libnss3.so`nss_Init+0x9eb()
libnss3.so`NSS_Init+0x45()
libldap.so.5`ldapssl_clientauth_init+0x88()
libldap.so.5`ldapssl_client_init+0x24()
libsldap.so.1`createTLSSession+0xe5()
libsldap.so.1`createSession+0x103()
libsldap.so.1`openConnection+0x27a()
libsldap.so.1`makeConnection+0x290()
libsldap.so.1`getConnection+0x5a4()
libsldap.so.1`__s_api_getConnection+0x36()
libsldap.so.1`__ns_ldap_getRootDSE+0xd8()
getldap_get_rootDSE+0xd8()
libc.so.1`_thrp_setup+0x88()
libc.so.1`_lwp_start()

The problem here is a race between two initialisation threads:

stack pointer for thread 7: fe69f378
[ fe69f378 libc.so.1`mutex_lock_impl+0x24() ]
  fe69f398 libc.so.1`mutex_lock+0x19(0, 8156f60, 0, fe482a3b, fe49eaf8, 8156f60)
  fe69f3b8 libnspr4.so`PR_Lock+0x1d(0, 8156f60, feedcd4f, fe487261, fe49eaf8, 8156f60)
  fe69f3d8 libnspr4.so`PR_GetCurrentThread+0x96(fe69f428, fe49eaf8, fe69f428, fe482487, ffffe8b3, 0)
  fe69f3e8 libnspr4.so`PR_SetError+0x18(ffffe8b3, 0, 0, fe4823cb, fe69f420, fefc37a0)
  fe69f428 libnspr4.so`PR_NewMonitor+0xca(fe4125b0, 0, fe69f4a8, fe32b3df, fef54514, 0)
  fe69f438 libnss3.so`OCSP_InitGlobal+0xc9(fef54514, 0, fe440808, febd80a3, 0, febd9fd4)
  fe69f4a8 libnss3.so`nss_Init+0x9eb(813eec8, fe3fda22, fe3fda22, fe3fe4f0, fe3fda22, fe3fda22)
  fe69f508 libnss3.so`NSS_Init+0x45(813eec8, fec1b000, fe69f548, febfe3b6, fe69f540, fefc37a0)
  fe69f548 libldap.so.5`ldapssl_clientauth_init+0x88(813eec8, 0, 0, 0, 0, 7530)
  fe69f578 libldap.so.5`ldapssl_client_init+0x24(813eec8, 0, 1, 7530, 0, 0)
  fe69fd88 libsldap.so.1`createTLSSession+0xe5(8150e48, 8150df8, 27c, 7530, fe6a0dac, fe6a27d8)
  fe6a0598 libsldap.so.1`createSession+0x103(8150e48, 8150df8, 27c, 7530, fe6a0dac, fe6a27d8)
  fe6a0dc8 libsldap.so.1`openConnection+0x27a(fe6a0e38, 8150df8, 8150e48, 1e, fe6a27d8, 0)
  fe6a1638 libsldap.so.1`makeConnection+0x290(fe6a16d4, 8150ee8, 8150e48, fe6a1fac, 1e, fe6a27d8)
  fe6a1ec8 libsldap.so.1`getConnection+0x5a4(8150ee8, 400, 0, fe6a1fac, fe6a1fa8, fe6a27d8)
  fe6a1f18 libsldap.so.1`__s_api_getConnection+0x36(8150ee8, 400, 0, fe6a1fac, fe6a1fa8, fe6a27d8)
  fe6a2798 libsldap.so.1`__ns_ldap_getRootDSE+0xd8(8150ee8, fe6a2fac, fe6a27d8, 1, 813fe8c, 0)
  fe6a2fc8 getldap_get_rootDSE+0xd8(813fe70, 0, 0, 0)
  fe6a2fe8 libc.so.1`_thrp_setup+0x88(fec22a40)
  fe6a2ff8 libc.so.1`_lwp_start(fec22a40, 0, 0, 0, 0, 0)
stack pointer for thread 8: fe5a0378
[ fe5a0378 libnspr4.so`_PR_InitTPD() ]
  fe5a0388 libnspr4.so`PR_Init+0xb(0, 1, 0, fec04442, fe5a03e0, feffb0a4)
  fe5a03d8 libldap.so.5`prldap_nspr_init+0x78(feffb0a4, fefcee86, feffb0a4, febc0018, fe5a0440, fefd2777)
  fe5a03f0 libldap.so.5`_init+0x20(feffc8c0, 0, 0, fefd26a0, fed806c0, febc0718)
  fe5a0440 ld.so.1`call_init+0xe5(febc0710, 1, 2, fefd289c)
  fe5a0480 ld.so.1`load_completion+0xd5(fed806c0, febc0018, 2, 0)
  fe5a0550 ld.so.1`elf_bndr+0x356(fed80018, 378, fedaad15)
  fe5a0570 ld.so.1`elf_rtbndr+0x14(378, fedaad15, 0, 0, 1, 7530)
  fe5a0d88 0xfed80018(8150e70, 8150e20, 27c, 7530, fe5a1dac, fe5a37d8)
  fe5a1598 libsldap.so.1`createSession+0x103(8150e70, 8150e20, 27c, 7530, fe5a1dac, fe5a37d8)
  fe5a1dc8 libsldap.so.1`openConnection+0x27a(fe5a1e38, 8150e20, 8150e70, 1e, fe5a37d8, 0)
  fe5a2638 libsldap.so.1`makeConnection+0x290(fe5a26d4, 8150e98, 8150e70, fe5a2fac, 1e, fe5a37d8)
  fe5a2ec8 libsldap.so.1`getConnection+0x5a4(8150e98, 400, 0, fe5a2fac, fe5a2fa8, fe5a37d8)
  fe5a2f18 libsldap.so.1`__s_api_getConnection+0x36(8150e98, 400, 0, fe5a2fac, fe5a2fa8, fe5a37d8)
  fe5a3798 libsldap.so.1`__ns_ldap_getRootDSE+0xd8(8150e98, fe5a3fac, fe5a37d8, 1, 813fdec, 0)
  fe5a3fc8 getldap_get_rootDSE+0xd8(813fdd0, 0, 0, 0)
  fe5a3fe8 libc.so.1`_thrp_setup+0x88(fec23240)
  fe5a3ff8 libc.so.1`_lwp_start(fec23240, 0, 0, 0, 0, 0)
> 

In PR_InitStuff(), we have:

static void _PR_InitStuff(void)
{

    if (_pr_initialized) return;
    _pr_initialized = PR_TRUE;
    …

This is itself racy, but that's not the problem we're hitting. Thread 8 has set this flag, and is in the midst of performing initialisation. However, it has not yet initialised pt_book, which means the lock taken by thread 7 in PR_GetCurrentThread is NULL:

> pt_book/XXXXXXXX
libnspr4.so`pt_book:
libnspr4.so`pt_book:            0               0               0               0               0               
                0               0               0               

At some level, this seems straightforward; the obvious "solution" would be to add logic to _PR_ImplicitInitialization() that waits on a thread performing initialisation. Moving to such a 3-state regime would avoid this problem. But the whole point of nspr is that until it's initialised, there are is no way to take locks or wait on something; that's possible only after the various internal structures are set up. More curious is the fact that thread 7 seems to be in libldap.so.5 even though its constructor hasn't completed. That constructor is still running in thread 8 (and has never been run in thread 7, which we know because its nspr_initialized is still clear). This seems like a clear violation of the dynamic linking contract: all ctors are supposed to have completed before any function in the loaded object can be called via PLT or dlsym(). What's going on here?

First, note that for whatever reason both libsldap.so.1 and libldap.so.5 are being lazy-loaded:

[root@headnode (sf-3) ~]# elfdump -d /usr/lib/ldap/ldap_cachemgr 

Dynamic Section:  .dynamic
     index  tag                value
       [0]  NEEDED            0xe0e               libumem.so.1
       [1]  POSFLAG_1         0x1                 [ LAZY ]
       [2]  NEEDED            0xe24               libsldap.so.1
       [3]  NEEDED            0xe52               libc.so.1
       [4]  INIT              0x805aab0           
       [5]  FINI              0x805aae0           
       [6]  HASH              0x8050430           
       [7]  STRTAB            0x8051b00           
       [8]  STRSZ             0x10b7              
       [9]  SYMTAB            0x8050f10           
      [10]  SYMENT            0x10                
      [11]  SUNW_SYMTAB       0x8050a30           
      [12]  SUNW_SYMSZ        0x10d0              
      [13]  SUNW_SORTENT      0x4                 
      [14]  SUNW_SYMSORT      0x8052e48           
      [15]  SUNW_SYMSORTSZ    0x19c               
      [16]  CHECKSUM          0x8e06              
      [17]  VERNEED           0x8052bb8           
      [18]  VERNEEDNUM        0x3                 
      [19]  PLTRELSZ          0x460               
      [20]  PLTREL            0x11                
      [21]  JMPREL            0x805301c           
      [22]  REL               0x8052fe4           
      [23]  RELSZ             0x498               
      [24]  RELENT            0x8                 
      [25]  SYMINFO           0x8050134           
      [26]  SYMINSZ           0x2fc               
      [27]  SYMINENT          0x4                 
      [28]  DEBUG             0                   
      [29]  FLAGS             0                   0
      [30]  FLAGS_1           0                   0
      [31]  SUNW_STRPAD       0x200               
      [32]  SUNW_LDMACH       0x3e                EM_AMD64
      [33]  PLTGOT            0x806e000           
   [34-44]  NULL              0                   
[root@headnode (sf-3) ~]# elfdump -d /usr/lib/libsldap.so.1 

Dynamic Section:  .dynamic
     index  tag                value
       [0]  POSFLAG_1         0x1                 [ LAZY ]
       [1]  NEEDED            0x2819              libnsl.so.1
       [2]  POSFLAG_1         0x1                 [ LAZY ]
       [3]  NEEDED            0x2837              libldap.so.5
       [4]  POSFLAG_1         0x1                 [ LAZY ]
       [5]  NEEDED            0x2861              libscf.so.1
       [6]  NEEDED            0x2876              libc.so.1
       [7]  INIT              0x300d0             
       [8]  FINI              0x30100             
       [9]  SONAME            0x1                 libsldap.so.1
      [10]  HASH              0x690               
      [11]  STRTAB            0x3714              
      [12]  STRSZ             0x2ac8              
      [13]  SYMTAB            0x2634              
      [14]  SYMENT            0x10                
      [15]  SUNW_SYMTAB       0xf04               
      [16]  SUNW_SYMSZ        0x2810              
      [17]  SUNW_SORTENT      0x4                 
      [18]  SUNW_SYMSORT      0x65b4              
      [19]  SUNW_SYMSORTSZ    0x69c               
      [20]  CHECKSUM          0x5b80              
      [21]  VERDEF            0x633c              
      [22]  VERDEFNUM         0x3                 
      [23]  VERNEED           0x61dc              
      [24]  VERNEEDNUM        0x4                 
      [25]  RELCOUNT          0x142               
      [26]  PLTRELSZ          0x6e0               
      [27]  PLTREL            0x11                
      [28]  JMPREL            0x76b8              
      [29]  REL               0x6c50              
      [30]  RELSZ             0x1148              
      [31]  RELENT            0x8                 
      [32]  SYMINFO           0x258               
      [33]  SYMINSZ           0x438               
      [34]  SYMINENT          0x4                 
      [35]  FLAGS             0                   0
      [36]  FLAGS_1           0                   0
      [37]  SUNW_STRPAD       0x200               
      [38]  SUNW_LDMACH       0x3e                EM_AMD64
      [39]  PLTGOT            0x45000             
   [40-50]  NULL              0                   

Because of this, there is no guarantee that the ctors are run from thread 1 at startup time; instead, they will be run at the time the first reference is made to a symbol in each object. It turns out that in this case rtld provides no guarantees about the completion of the ctor with respect to a reference from another thread; the relevant code helpfully emits an LD_DEBUG message when this occurs before proceeding to resolve the reference:

        if ((FLAGS(dlmp) & (FLG_RT_RELOCED | FLG_RT_INITCALL)) ==
            (FLG_RT_RELOCED | FLG_RT_INITCALL)) {
                DBG_CALL(Dbg_util_no_init(dlmp));
                return;
        }

        if ((tobj = calloc(2, sizeof (Rt_map *))) != NULL) {
                tobj[0] = dlmp;
                call_init(tobj, DBG_INIT_DYN);
        }

In this manner we correctly avoid calling the ctor twice, but have no assurance that it has completed before proceeding. In fact, this is exactly our problem. This is trivial to reproduce, too:

libfoo.c:

#include <stdio.h>
#include <unistd.h>

static int blargh;
static void init_foo(void) __attribute__((constructor));

int
foo(int x)
{
    (void) printf("foo %d\\n", blargh);
    return (x);
}

static void
init_foo(void)
{
    sleep(1);
    blargh = 17;
}

consumer bar.c:

#include <pthread.h>
#include <stdlib.h>

extern int foo(int);

static void *
call_foo(void *x)
{
    (void) foo(0);
}

int
main(void)
{
    pthread_t a, b;

    (void) pthread_create(&a, NULL, call_foo, NULL);
    (void) pthread_create(&b, NULL, call_foo, NULL);

    (void) pthread_join(a, NULL);
    (void) pthread_join(b, NULL);

    exit(0);
}

Run it:

# LD_DEBUG=init ./bar
debug: 
debug: Solaris Linkers: 5.11-1.1748 (illumos)
debug: 
55804: 
55804: platform capability (CA_SUNW_PLAT) - i86pc
55804: machine capability (CA_SUNW_MACH) - i86pc
55804: hardware capabilities (CA_SUNW_HW_1) - 0x7dd75c6f  [ VMX AVX XSAVE PCLMULQDQ AES SSE4.2 SSE4.1 SSSE3 POPCNT TSCP AHF CX16 SSE3 SSE2 SSE FXSR MMX CMOV SEP CX8 TSC FPU ]
55804: 
55804: 
55804: configuration file=/var/ld/ld.config: [ ELF-DEFAULT-LIBPATH ELF-SECURE-LIBPATH ]
55804: 
55804: 
55804: 1: calling .init (from sorted order): /lib/libc.so.1
55804: 1: 
55804: 1: calling .init (done): /lib/libc.so.1
55804: 1: 
55804: 1: 
55804: 1: transferring control: bar
55804: 1: 
55804: 2: 
55804: 2: calling .init (from sorted order): /opt/local/gcc47/runtime/lib/./libgcc_s.so.1
55804: 2: 
55804: 3: calling .init (dynamically triggered): ./libfoo.so.1
55804: 3: 
55804: 3: warning: calling /opt/local/gcc47/runtime/lib/./libgcc_s.so.1 whose init has not completed
55804: 3: 
55804: 2: calling .init (done): /opt/local/gcc47/runtime/lib/./libgcc_s.so.1
55804: 2: 
55804: 2: warning: calling ./libfoo.so.1 whose init has not completed
55804: 2: 
foo 0
55804: 2: 
55804: 2: ld.so.1: bar: fatal: _ex_unwind: can't find symbol
55804: 2: 
55804: 2: 
55804: 3: calling .init (done): ./libfoo.so.1
55804: 3: 
foo 17
55804: 3: 
55804: 3: ld.so.1: bar: fatal: _ex_unwind: can't find symbol
55804: 3: 
55804: 3: 
55804: 1: calling .fini: ./libfoo.so.1
55804: 1: 
55804: 1: calling .fini: /opt/local/gcc47/runtime/lib/./libgcc_s.so.1
55804: 1: 
55804: 1: calling .fini: /lib/libc.so.1
55804: 1: 

Note that the second thread to resolve foo() gets the wrong value, because the ctor has not completed, and the telltale linker warning "calling ./libfoo.so.1 whose init has not completed".

It would seem that the simplest solution to this problem would be to change how ldap_cachemgr and libsldap.so.1 are built, so that libldap.so.5 is not lazy-loaded. Doing so might slightly increase the startup time for this utility and any others linked with these libraries, but would ensure correctness. In fact, no multi-threaded consumer of libldap.so.5 or any other consumer of the nspr libraries should ever be allowed to lazy-load it.

--

However, while we could restrict the use of libldap.so.5 or this general class of problems, we instead add a bit logic to rtld which will allow the right thing to happen in the case of lazy loaded libraries.

Actions

Also available in: Atom PDF