Bug #15691
openuselocale(NULL) is too slow
0%
Description
Peeling one more layer off the onion that started with #15662 and continued with #15667.
uselocale() and its callees stick out in a flamegraph of the original egrep from nightly (appearing about 45% of the sample tracebacks).
Most processes are single-threaded and never call uselocale() with a non-NULL locale so that creates an opportunity for a fastpath - a global flag that doesn't flip until the first call with non-NULL locale.
Excluding tests, there are no callers of uselocale() in illumos-gate that pass a non-NULL locale.
Also it might make sense to have a simpler function in libc than uselocale(NULL) to call when you just want access to the current locale and don't want to change anything.
Related issues
Updated by Bill Sommerfeld 11 days ago
- Related to Bug #15662: nightly uses pathological regexp to find warnings added
Updated by Bill Sommerfeld 11 days ago
- Related to Bug #15667: libc regcomp/regexec show pathological behavior for "warn:|warning:" added
Updated by Robert Mustacchi 6 days ago
I was a bit curious about this. Looking at the flamegraph in #15667 we see that the membar_consumer()
and related that the thr_keycreate_once
is pretty high up there. This is inherent to the tsdalloc()
implementation. So I wondered what if we didn't actually do that at all and what would it look like with the locale_t
as part of the ulwp_t
as opposed to its current construction. I tossed an example of that up at https://github.com/rmustacc/illumos-gate/commit/0ea3368ffd7afa494726f495009c132aa8adfb05. It showed about a 2.73x improvement on 32-bit code and 6.37x improvement on 64-bit code.
To test this I wrote the following program:
#include <sys/time.h> #include <locale.h> #include <stdio.h> uint32_t warmup = 100; uint32_t iters = 100000000; int main(void) { hrtime_t start, end; for (uint32_t i = 0; i < warmup; i++) { (void) uselocale(NULL); } start = gethrtime(); for (uint32_t i = 0; i < iters; i++) { (void) uselocale(NULL); } end = gethrtime(); (void) printf("%u iterations: %lld\n", iters, end - start); return (0); }
I compiled it both 32-bit and 64-bit ala gcc -Wall -m64 -o mb_cur.64 mb_cur.c
and gcc -Wall -m32 -o mb_cur.32 mb_cur.c
with gcc 10.4.0-il-1. I explicitly did not enable optimizations and then disassembled the main functions to confirm the loops had not been optimized away.
I then pbound a process to a cpu and disabled interrupts on that CPU while leaving the box otherwise idle. I did not control turbo boost, but then proceeded to run this 100 times. Times were generally similar. This was run on a 2.8GHz AMD 7402P. And example set up and run was:
$ pfexec pbind -b 23 $$ $ pfexec psradm -i 23 $ for i in {1..100}; do ./mb_cur.64 2>&1 >> new/output.64; done
I did spot check this with ptime -m
just to make sure it was all user time. Here is a subset of data:
### old 32-bit 100000000 iterations: 1225172580 100000000 iterations: 1225140760 100000000 iterations: 1225101710 100000000 iterations: 1284265335 100000000 iterations: 1225660242 100000000 iterations: 1225245823 100000000 iterations: 1224972527 100000000 iterations: 1225157830 ### old 64-bit 100000000 iterations: 1524071578 100000000 iterations: 1524418676 100000000 iterations: 1524633631 100000000 iterations: 1526691039 100000000 iterations: 1558379760 100000000 iterations: 1523961875 100000000 iterations: 1523877353 100000000 iterations: 1524241171 100000000 iterations: 1523960716 100000000 iterations: 1523934524 ### new 32-bit 100000000 iterations: 448647759 100000000 iterations: 448688571 100000000 iterations: 448623439 100000000 iterations: 448655450 100000000 iterations: 448617479 100000000 iterations: 448792032 100000000 iterations: 450142830 100000000 iterations: 448604758 100000000 iterations: 448590299 ### new 64-bit 100000000 iterations: 239439401 100000000 iterations: 239442551 100000000 iterations: 239428151 100000000 iterations: 239431741 100000000 iterations: 239442920 100000000 iterations: 239581734 100000000 iterations: 239402340 100000000 iterations: 239405920 100000000 iterations: 239429220
I do think we need to re-evaluate the fact that tsdalloc() is always calling the thread key creation APIs on every use. I think it is more ok for things not in the critical path, but I don't think it was intended to be called every iteration. I think it's most notably bad in the uselocale() case because it wants to be used by every traditional call to get the locale.
Updated by Bill Sommerfeld 3 days ago
Thanks for prototyping that approach.
I went down a bit of a rathole looking at ways to improve access to the current locale via the current ulwp_t.
1) speeding access to per-thread state on i386/amd64:
Our ABI uses the (vestigial) segment registers fs
and gs
as base pointers to find per-cpu and per-thread state.
gcc 6 and later claim to support the __seg_fs
and __seg_gs
address space descriptors when __SEG_FS
and __SEG_GS
are predefined. This would let you get access to the current locale via something along the lines of:
#if defined(__SEG_GS) && defined(__SEG_FS)
#if defined(__amd64)
#define __seg_curthread __seg_fs
#elif defined(__i386)
#define __seg_curthread __seg_gs
#endif
locale_t curlocale(void) {
ulwp_t __seg_curthread *myulwp = 0;
return myulwp->ul_locale;
}
#endif
This compiles down to something along the lines of:
movq %fs:0x480,%rax
This works in standalone tests I've run. Unfortunately, the -fno-asm
flag to gcc (which cw
always adds to the command line) has an undocumented side effect turns off __seg_fs
and __seg_gs
without also disabling __SEG_GS
and __SEG_FS
. See GCC bug 79609: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79609
2) I'm trying to determine whether posix or the expectations of currently-running code actually mandate the double-pointer approach we currently use (process global locale + per-thread locale, where we fall back to the global locale if the per-thread locale hasn't been set or has been set back to the current value of the global locale), and you can change the global locale in all threads that haven't set a locale via setlocale
. After all, calling setlocale
while there are multiple threads active seems perilous and one document I found cautioned that this might not be threadsafe.
It would be simpler (one load, no conditional branches) if we could always find the current effective locale in the current ulwp_t and not have to also fetch the global locale if the local locale was NULL
.
If we have to preserve the "set locale in all threads that haven't overridden the global locale" behavior of setlocale
in multi-threaded programs, making setlocale
a little slower to make fetching the current locale faster seems like a good performance tradeoff.
3) we should be able to create a faster specialized inlineable function equivalent to uselocale(NULL)
that will eventually reduce to a small number of instructions that can be used internally within libc.
4) the thread exit path needs a closer look; the existing tsdalloc
code passes a destructor (freelocptr
) that would need to be invoked when a thread is freed.
I think it makes sense to structure this change into a series of commits to simplify review:
a) synch up ulwp_t and ulwp32_t
b) add the function in (3) initially as syntactic sugar for uselocale(NULL)
c) one or more commits to actually speed up local locale access.
Updated by Bill Sommerfeld 3 days ago
- Related to Bug #15713: ulwp32_t is out of sync with ulwp_t added
Updated by Bill Sommerfeld 3 days ago
I prototyped the read-only-from-curthread model described in the previous note - it did not take a lot of code to safely walk the list of ulwp_t
's and selectively update their ul_locale
pointer. So I don't have to fully research whether simplifying our complicated semantics is possible -- we can just preserve the existing behavior while making it simple & fast to fetch the current locale.