Bug #4996
closedrtld _init race leads to incorrect symbol values
100%
Description
Various cores in thoth with basically the same stack:
DUMP TYPE TIME NODE/CMD TICKET ef55ec95b78fdbb3 core 2013-10-09T17:55:55 ldap_cachemgr - 3d3c2e296af9dbc8 core 2013-10-09T17:55:55 ldap_cachemgr - a7add25a7727d4b4 core 2013-10-09T17:55:55 ldap_cachemgr - c7ecf9e7b9e00484 core 2013-10-09T17:59:04 ldap_cachemgr - 958e7fe535695b41 core 2013-12-18T18:48:27 ldap_cachemgr - 856e54fb3643bc21 core 2013-12-18T18:48:27 ldap_cachemgr - bb1daba63b48f01b core 2013-12-18T18:48:27 ldap_cachemgr - 32542daafaf48938 core 2013-12-18T18:48:50 ldap_cachemgr - 8a105ece703f1ea4 core 2013-12-18T18:49:10 ldap_cachemgr - 5f8a43789ce82f9b core 2014-01-07T19:57:46 ldap_cachemgr - 7193aabd99748f13 core 2014-01-13T15:50:05 ldap_cachemgr - 5733913860d4080f core 2014-01-30T20:22:42 ldap_cachemgr - ab0266445715d355 core 2014-01-30T20:22:42 ldap_cachemgr -
Stacks look mostly like:
libc.so.1`mutex_lock_impl+0x24() libc.so.1`mutex_lock+0x19() libnspr4.so`PR_Lock+0x1d() libnspr4.so`PR_GetCurrentThread+0x96() libnspr4.so`PR_SetError+0x18() libnspr4.so`PR_NewMonitor+0xca() libnss3.so`OCSP_InitGlobal+0xc9() libnss3.so`nss_Init+0x9eb() libnss3.so`NSS_Init+0x45() libldap.so.5`ldapssl_clientauth_init+0x88() libldap.so.5`ldapssl_client_init+0x24() libsldap.so.1`createTLSSession+0xe5() libsldap.so.1`createSession+0x103() libsldap.so.1`openConnection+0x27a() libsldap.so.1`makeConnection+0x290() libsldap.so.1`getConnection+0x5a4() libsldap.so.1`__s_api_getConnection+0x36() libsldap.so.1`__ns_ldap_getRootDSE+0xd8() getldap_get_rootDSE+0xd8() libc.so.1`_thrp_setup+0x88() libc.so.1`_lwp_start()
The problem here is a race between two initialisation threads:
stack pointer for thread 7: fe69f378 [ fe69f378 libc.so.1`mutex_lock_impl+0x24() ] fe69f398 libc.so.1`mutex_lock+0x19(0, 8156f60, 0, fe482a3b, fe49eaf8, 8156f60) fe69f3b8 libnspr4.so`PR_Lock+0x1d(0, 8156f60, feedcd4f, fe487261, fe49eaf8, 8156f60) fe69f3d8 libnspr4.so`PR_GetCurrentThread+0x96(fe69f428, fe49eaf8, fe69f428, fe482487, ffffe8b3, 0) fe69f3e8 libnspr4.so`PR_SetError+0x18(ffffe8b3, 0, 0, fe4823cb, fe69f420, fefc37a0) fe69f428 libnspr4.so`PR_NewMonitor+0xca(fe4125b0, 0, fe69f4a8, fe32b3df, fef54514, 0) fe69f438 libnss3.so`OCSP_InitGlobal+0xc9(fef54514, 0, fe440808, febd80a3, 0, febd9fd4) fe69f4a8 libnss3.so`nss_Init+0x9eb(813eec8, fe3fda22, fe3fda22, fe3fe4f0, fe3fda22, fe3fda22) fe69f508 libnss3.so`NSS_Init+0x45(813eec8, fec1b000, fe69f548, febfe3b6, fe69f540, fefc37a0) fe69f548 libldap.so.5`ldapssl_clientauth_init+0x88(813eec8, 0, 0, 0, 0, 7530) fe69f578 libldap.so.5`ldapssl_client_init+0x24(813eec8, 0, 1, 7530, 0, 0) fe69fd88 libsldap.so.1`createTLSSession+0xe5(8150e48, 8150df8, 27c, 7530, fe6a0dac, fe6a27d8) fe6a0598 libsldap.so.1`createSession+0x103(8150e48, 8150df8, 27c, 7530, fe6a0dac, fe6a27d8) fe6a0dc8 libsldap.so.1`openConnection+0x27a(fe6a0e38, 8150df8, 8150e48, 1e, fe6a27d8, 0) fe6a1638 libsldap.so.1`makeConnection+0x290(fe6a16d4, 8150ee8, 8150e48, fe6a1fac, 1e, fe6a27d8) fe6a1ec8 libsldap.so.1`getConnection+0x5a4(8150ee8, 400, 0, fe6a1fac, fe6a1fa8, fe6a27d8) fe6a1f18 libsldap.so.1`__s_api_getConnection+0x36(8150ee8, 400, 0, fe6a1fac, fe6a1fa8, fe6a27d8) fe6a2798 libsldap.so.1`__ns_ldap_getRootDSE+0xd8(8150ee8, fe6a2fac, fe6a27d8, 1, 813fe8c, 0) fe6a2fc8 getldap_get_rootDSE+0xd8(813fe70, 0, 0, 0) fe6a2fe8 libc.so.1`_thrp_setup+0x88(fec22a40) fe6a2ff8 libc.so.1`_lwp_start(fec22a40, 0, 0, 0, 0, 0) stack pointer for thread 8: fe5a0378 [ fe5a0378 libnspr4.so`_PR_InitTPD() ] fe5a0388 libnspr4.so`PR_Init+0xb(0, 1, 0, fec04442, fe5a03e0, feffb0a4) fe5a03d8 libldap.so.5`prldap_nspr_init+0x78(feffb0a4, fefcee86, feffb0a4, febc0018, fe5a0440, fefd2777) fe5a03f0 libldap.so.5`_init+0x20(feffc8c0, 0, 0, fefd26a0, fed806c0, febc0718) fe5a0440 ld.so.1`call_init+0xe5(febc0710, 1, 2, fefd289c) fe5a0480 ld.so.1`load_completion+0xd5(fed806c0, febc0018, 2, 0) fe5a0550 ld.so.1`elf_bndr+0x356(fed80018, 378, fedaad15) fe5a0570 ld.so.1`elf_rtbndr+0x14(378, fedaad15, 0, 0, 1, 7530) fe5a0d88 0xfed80018(8150e70, 8150e20, 27c, 7530, fe5a1dac, fe5a37d8) fe5a1598 libsldap.so.1`createSession+0x103(8150e70, 8150e20, 27c, 7530, fe5a1dac, fe5a37d8) fe5a1dc8 libsldap.so.1`openConnection+0x27a(fe5a1e38, 8150e20, 8150e70, 1e, fe5a37d8, 0) fe5a2638 libsldap.so.1`makeConnection+0x290(fe5a26d4, 8150e98, 8150e70, fe5a2fac, 1e, fe5a37d8) fe5a2ec8 libsldap.so.1`getConnection+0x5a4(8150e98, 400, 0, fe5a2fac, fe5a2fa8, fe5a37d8) fe5a2f18 libsldap.so.1`__s_api_getConnection+0x36(8150e98, 400, 0, fe5a2fac, fe5a2fa8, fe5a37d8) fe5a3798 libsldap.so.1`__ns_ldap_getRootDSE+0xd8(8150e98, fe5a3fac, fe5a37d8, 1, 813fdec, 0) fe5a3fc8 getldap_get_rootDSE+0xd8(813fdd0, 0, 0, 0) fe5a3fe8 libc.so.1`_thrp_setup+0x88(fec23240) fe5a3ff8 libc.so.1`_lwp_start(fec23240, 0, 0, 0, 0, 0) >
In PR_InitStuff(), we have:
static void _PR_InitStuff(void) { if (_pr_initialized) return; _pr_initialized = PR_TRUE; …
This is itself racy, but that's not the problem we're hitting. Thread 8 has set this flag, and is in the midst of performing initialisation. However, it has not yet initialised pt_book, which means the lock taken by thread 7 in PR_GetCurrentThread is NULL:
> pt_book/XXXXXXXX libnspr4.so`pt_book: libnspr4.so`pt_book: 0 0 0 0 0 0 0 0
At some level, this seems straightforward; the obvious "solution" would be to add logic to _PR_ImplicitInitialization() that waits on a thread performing initialisation. Moving to such a 3-state regime would avoid this problem. But the whole point of nspr is that until it's initialised, there are is no way to take locks or wait on something; that's possible only after the various internal structures are set up. More curious is the fact that thread 7 seems to be in libldap.so.5 even though its constructor hasn't completed. That constructor is still running in thread 8 (and has never been run in thread 7, which we know because its nspr_initialized is still clear). This seems like a clear violation of the dynamic linking contract: all ctors are supposed to have completed before any function in the loaded object can be called via PLT or dlsym(). What's going on here?
First, note that for whatever reason both libsldap.so.1 and libldap.so.5 are being lazy-loaded:
[root@headnode (sf-3) ~]# elfdump -d /usr/lib/ldap/ldap_cachemgr Dynamic Section: .dynamic index tag value [0] NEEDED 0xe0e libumem.so.1 [1] POSFLAG_1 0x1 [ LAZY ] [2] NEEDED 0xe24 libsldap.so.1 [3] NEEDED 0xe52 libc.so.1 [4] INIT 0x805aab0 [5] FINI 0x805aae0 [6] HASH 0x8050430 [7] STRTAB 0x8051b00 [8] STRSZ 0x10b7 [9] SYMTAB 0x8050f10 [10] SYMENT 0x10 [11] SUNW_SYMTAB 0x8050a30 [12] SUNW_SYMSZ 0x10d0 [13] SUNW_SORTENT 0x4 [14] SUNW_SYMSORT 0x8052e48 [15] SUNW_SYMSORTSZ 0x19c [16] CHECKSUM 0x8e06 [17] VERNEED 0x8052bb8 [18] VERNEEDNUM 0x3 [19] PLTRELSZ 0x460 [20] PLTREL 0x11 [21] JMPREL 0x805301c [22] REL 0x8052fe4 [23] RELSZ 0x498 [24] RELENT 0x8 [25] SYMINFO 0x8050134 [26] SYMINSZ 0x2fc [27] SYMINENT 0x4 [28] DEBUG 0 [29] FLAGS 0 0 [30] FLAGS_1 0 0 [31] SUNW_STRPAD 0x200 [32] SUNW_LDMACH 0x3e EM_AMD64 [33] PLTGOT 0x806e000 [34-44] NULL 0 [root@headnode (sf-3) ~]# elfdump -d /usr/lib/libsldap.so.1 Dynamic Section: .dynamic index tag value [0] POSFLAG_1 0x1 [ LAZY ] [1] NEEDED 0x2819 libnsl.so.1 [2] POSFLAG_1 0x1 [ LAZY ] [3] NEEDED 0x2837 libldap.so.5 [4] POSFLAG_1 0x1 [ LAZY ] [5] NEEDED 0x2861 libscf.so.1 [6] NEEDED 0x2876 libc.so.1 [7] INIT 0x300d0 [8] FINI 0x30100 [9] SONAME 0x1 libsldap.so.1 [10] HASH 0x690 [11] STRTAB 0x3714 [12] STRSZ 0x2ac8 [13] SYMTAB 0x2634 [14] SYMENT 0x10 [15] SUNW_SYMTAB 0xf04 [16] SUNW_SYMSZ 0x2810 [17] SUNW_SORTENT 0x4 [18] SUNW_SYMSORT 0x65b4 [19] SUNW_SYMSORTSZ 0x69c [20] CHECKSUM 0x5b80 [21] VERDEF 0x633c [22] VERDEFNUM 0x3 [23] VERNEED 0x61dc [24] VERNEEDNUM 0x4 [25] RELCOUNT 0x142 [26] PLTRELSZ 0x6e0 [27] PLTREL 0x11 [28] JMPREL 0x76b8 [29] REL 0x6c50 [30] RELSZ 0x1148 [31] RELENT 0x8 [32] SYMINFO 0x258 [33] SYMINSZ 0x438 [34] SYMINENT 0x4 [35] FLAGS 0 0 [36] FLAGS_1 0 0 [37] SUNW_STRPAD 0x200 [38] SUNW_LDMACH 0x3e EM_AMD64 [39] PLTGOT 0x45000 [40-50] NULL 0
Because of this, there is no guarantee that the ctors are run from thread 1 at startup time; instead, they will be run at the time the first reference is made to a symbol in each object. It turns out that in this case rtld provides no guarantees about the completion of the ctor with respect to a reference from another thread; the relevant code helpfully emits an LD_DEBUG message when this occurs before proceeding to resolve the reference:
if ((FLAGS(dlmp) & (FLG_RT_RELOCED | FLG_RT_INITCALL)) == (FLG_RT_RELOCED | FLG_RT_INITCALL)) { DBG_CALL(Dbg_util_no_init(dlmp)); return; } if ((tobj = calloc(2, sizeof (Rt_map *))) != NULL) { tobj[0] = dlmp; call_init(tobj, DBG_INIT_DYN); }
In this manner we correctly avoid calling the ctor twice, but have no assurance that it has completed before proceeding. In fact, this is exactly our problem. This is trivial to reproduce, too:
libfoo.c: #include <stdio.h> #include <unistd.h> static int blargh; static void init_foo(void) __attribute__((constructor)); int foo(int x) { (void) printf("foo %d\\n", blargh); return (x); } static void init_foo(void) { sleep(1); blargh = 17; } consumer bar.c: #include <pthread.h> #include <stdlib.h> extern int foo(int); static void * call_foo(void *x) { (void) foo(0); } int main(void) { pthread_t a, b; (void) pthread_create(&a, NULL, call_foo, NULL); (void) pthread_create(&b, NULL, call_foo, NULL); (void) pthread_join(a, NULL); (void) pthread_join(b, NULL); exit(0); }
Run it:
# LD_DEBUG=init ./bar debug: debug: Solaris Linkers: 5.11-1.1748 (illumos) debug: 55804: 55804: platform capability (CA_SUNW_PLAT) - i86pc 55804: machine capability (CA_SUNW_MACH) - i86pc 55804: hardware capabilities (CA_SUNW_HW_1) - 0x7dd75c6f [ VMX AVX XSAVE PCLMULQDQ AES SSE4.2 SSE4.1 SSSE3 POPCNT TSCP AHF CX16 SSE3 SSE2 SSE FXSR MMX CMOV SEP CX8 TSC FPU ] 55804: 55804: 55804: configuration file=/var/ld/ld.config: [ ELF-DEFAULT-LIBPATH ELF-SECURE-LIBPATH ] 55804: 55804: 55804: 1: calling .init (from sorted order): /lib/libc.so.1 55804: 1: 55804: 1: calling .init (done): /lib/libc.so.1 55804: 1: 55804: 1: 55804: 1: transferring control: bar 55804: 1: 55804: 2: 55804: 2: calling .init (from sorted order): /opt/local/gcc47/runtime/lib/./libgcc_s.so.1 55804: 2: 55804: 3: calling .init (dynamically triggered): ./libfoo.so.1 55804: 3: 55804: 3: warning: calling /opt/local/gcc47/runtime/lib/./libgcc_s.so.1 whose init has not completed 55804: 3: 55804: 2: calling .init (done): /opt/local/gcc47/runtime/lib/./libgcc_s.so.1 55804: 2: 55804: 2: warning: calling ./libfoo.so.1 whose init has not completed 55804: 2: foo 0 55804: 2: 55804: 2: ld.so.1: bar: fatal: _ex_unwind: can't find symbol 55804: 2: 55804: 2: 55804: 3: calling .init (done): ./libfoo.so.1 55804: 3: foo 17 55804: 3: 55804: 3: ld.so.1: bar: fatal: _ex_unwind: can't find symbol 55804: 3: 55804: 3: 55804: 1: calling .fini: ./libfoo.so.1 55804: 1: 55804: 1: calling .fini: /opt/local/gcc47/runtime/lib/./libgcc_s.so.1 55804: 1: 55804: 1: calling .fini: /lib/libc.so.1 55804: 1:
Note that the second thread to resolve foo() gets the wrong value, because the ctor has not completed, and the telltale linker warning "calling ./libfoo.so.1 whose init has not completed".
It would seem that the simplest solution to this problem would be to change how ldap_cachemgr and libsldap.so.1 are built, so that libldap.so.5 is not lazy-loaded. Doing so might slightly increase the startup time for this utility and any others linked with these libraries, but would ensure correctness. In fact, no multi-threaded consumer of libldap.so.5 or any other consumer of the nspr libraries should ever be allowed to lazy-load it.
--
However, while we could restrict the use of libldap.so.5 or this general class of problems, we instead add a bit logic to rtld which will allow the right thing to happen in the case of lazy loaded libraries.
Updated by Electric Monk over 8 years ago
- Status changed from New to Closed
git commit 4812581794004eff0af2b765b832403b30bf64ab
commit 4812581794004eff0af2b765b832403b30bf64ab Author: Keith M Wesolowski <wesolows@foobazco.org> Date: 2015-02-09T23:34:39.000Z 4996 rtld _init race leads to incorrect symbol values Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net>