Project

General

Profile

Actions

Bug #15691

open

uselocale(NULL) is too slow

Added by Bill Sommerfeld 11 days ago. Updated 3 days ago.

Status:
In Progress
Priority:
Normal
Category:
lib - userland libraries
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

Peeling one more layer off the onion that started with #15662 and continued with #15667.

uselocale() and its callees stick out in a flamegraph of the original egrep from nightly (appearing about 45% of the sample tracebacks).

Most processes are single-threaded and never call uselocale() with a non-NULL locale so that creates an opportunity for a fastpath - a global flag that doesn't flip until the first call with non-NULL locale.

Excluding tests, there are no callers of uselocale() in illumos-gate that pass a non-NULL locale.

Also it might make sense to have a simpler function in libc than uselocale(NULL) to call when you just want access to the current locale and don't want to change anything.


Related issues

Related to illumos gate - Bug #15662: nightly uses pathological regexp to find warningsClosedBill Sommerfeld

Actions
Related to illumos gate - Bug #15667: libc regcomp/regexec show pathological behavior for "warn:|warning:"New

Actions
Related to illumos gate - Bug #15713: ulwp32_t is out of sync with ulwp_tPending RTIBill Sommerfeld

Actions
Actions #1

Updated by Bill Sommerfeld 11 days ago

  • Related to Bug #15662: nightly uses pathological regexp to find warnings added
Actions #2

Updated by Bill Sommerfeld 11 days ago

  • Related to Bug #15667: libc regcomp/regexec show pathological behavior for "warn:|warning:" added
Actions #3

Updated by Robert Mustacchi 6 days ago

I was a bit curious about this. Looking at the flamegraph in #15667 we see that the membar_consumer() and related that the thr_keycreate_once is pretty high up there. This is inherent to the tsdalloc() implementation. So I wondered what if we didn't actually do that at all and what would it look like with the locale_t as part of the ulwp_t as opposed to its current construction. I tossed an example of that up at https://github.com/rmustacc/illumos-gate/commit/0ea3368ffd7afa494726f495009c132aa8adfb05. It showed about a 2.73x improvement on 32-bit code and 6.37x improvement on 64-bit code.

To test this I wrote the following program:

#include <sys/time.h>
#include <locale.h>
#include <stdio.h>

uint32_t warmup = 100;
uint32_t iters = 100000000;

int
main(void)
{
        hrtime_t start, end;

        for (uint32_t i = 0; i < warmup; i++) {
                (void) uselocale(NULL);
        }

        start = gethrtime();
        for (uint32_t i = 0; i < iters; i++) {
                (void) uselocale(NULL);
        }

        end = gethrtime();
        (void) printf("%u iterations: %lld\n", iters, end - start);
        return (0);
}

I compiled it both 32-bit and 64-bit ala gcc -Wall -m64 -o mb_cur.64 mb_cur.c and gcc -Wall -m32 -o mb_cur.32 mb_cur.c with gcc 10.4.0-il-1. I explicitly did not enable optimizations and then disassembled the main functions to confirm the loops had not been optimized away.

I then pbound a process to a cpu and disabled interrupts on that CPU while leaving the box otherwise idle. I did not control turbo boost, but then proceeded to run this 100 times. Times were generally similar. This was run on a 2.8GHz AMD 7402P. And example set up and run was:

$ pfexec pbind -b 23 $$
$ pfexec psradm -i 23
$ for i in {1..100}; do ./mb_cur.64 2>&1 >> new/output.64; done

I did spot check this with ptime -m just to make sure it was all user time. Here is a subset of data:

### old 32-bit
100000000 iterations: 1225172580
100000000 iterations: 1225140760
100000000 iterations: 1225101710
100000000 iterations: 1284265335
100000000 iterations: 1225660242
100000000 iterations: 1225245823
100000000 iterations: 1224972527
100000000 iterations: 1225157830
### old 64-bit
100000000 iterations: 1524071578
100000000 iterations: 1524418676
100000000 iterations: 1524633631
100000000 iterations: 1526691039
100000000 iterations: 1558379760
100000000 iterations: 1523961875
100000000 iterations: 1523877353
100000000 iterations: 1524241171
100000000 iterations: 1523960716
100000000 iterations: 1523934524
### new 32-bit
100000000 iterations: 448647759
100000000 iterations: 448688571
100000000 iterations: 448623439
100000000 iterations: 448655450
100000000 iterations: 448617479
100000000 iterations: 448792032
100000000 iterations: 450142830
100000000 iterations: 448604758
100000000 iterations: 448590299
### new 64-bit
100000000 iterations: 239439401
100000000 iterations: 239442551
100000000 iterations: 239428151
100000000 iterations: 239431741
100000000 iterations: 239442920
100000000 iterations: 239581734
100000000 iterations: 239402340
100000000 iterations: 239405920
100000000 iterations: 239429220

I do think we need to re-evaluate the fact that tsdalloc() is always calling the thread key creation APIs on every use. I think it is more ok for things not in the critical path, but I don't think it was intended to be called every iteration. I think it's most notably bad in the uselocale() case because it wants to be used by every traditional call to get the locale.

Actions #4

Updated by Bill Sommerfeld 3 days ago

Thanks for prototyping that approach.

I went down a bit of a rathole looking at ways to improve access to the current locale via the current ulwp_t.

1) speeding access to per-thread state on i386/amd64:

Our ABI uses the (vestigial) segment registers fs and gs as base pointers to find per-cpu and per-thread state.

gcc 6 and later claim to support the __seg_fs and __seg_gs address space descriptors when __SEG_FS and __SEG_GS are predefined. This would let you get access to the current locale via something along the lines of:


#if defined(__SEG_GS) && defined(__SEG_FS)

#if defined(__amd64)
#define        __seg_curthread __seg_fs
#elif defined(__i386)
#define        __seg_curthread __seg_gs
#endif

locale_t curlocale(void) {
        ulwp_t __seg_curthread *myulwp = 0;
        return myulwp->ul_locale;
}
#endif

This compiles down to something along the lines of:

                          movq   %fs:0x480,%rax

This works in standalone tests I've run. Unfortunately, the -fno-asm flag to gcc (which cw always adds to the command line) has an undocumented side effect turns off __seg_fs and __seg_gs without also disabling __SEG_GS and __SEG_FS. See GCC bug 79609: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79609

2) I'm trying to determine whether posix or the expectations of currently-running code actually mandate the double-pointer approach we currently use (process global locale + per-thread locale, where we fall back to the global locale if the per-thread locale hasn't been set or has been set back to the current value of the global locale), and you can change the global locale in all threads that haven't set a locale via setlocale. After all, calling setlocale while there are multiple threads active seems perilous and one document I found cautioned that this might not be threadsafe.

It would be simpler (one load, no conditional branches) if we could always find the current effective locale in the current ulwp_t and not have to also fetch the global locale if the local locale was NULL.

If we have to preserve the "set locale in all threads that haven't overridden the global locale" behavior of setlocale in multi-threaded programs, making setlocale a little slower to make fetching the current locale faster seems like a good performance tradeoff.

3) we should be able to create a faster specialized inlineable function equivalent to uselocale(NULL) that will eventually reduce to a small number of instructions that can be used internally within libc.

4) the thread exit path needs a closer look; the existing tsdalloc code passes a destructor (freelocptr) that would need to be invoked when a thread is freed.

I think it makes sense to structure this change into a series of commits to simplify review:

a) synch up ulwp_t and ulwp32_t
b) add the function in (3) initially as syntactic sugar for uselocale(NULL)
c) one or more commits to actually speed up local locale access.

Actions #5

Updated by Bill Sommerfeld 3 days ago

  • Related to Bug #15713: ulwp32_t is out of sync with ulwp_t added
Actions #6

Updated by Bill Sommerfeld 3 days ago

I prototyped the read-only-from-curthread model described in the previous note - it did not take a lot of code to safely walk the list of ulwp_t's and selectively update their ul_locale pointer. So I don't have to fully research whether simplifying our complicated semantics is possible -- we can just preserve the existing behavior while making it simple & fast to fetch the current locale.

Actions #7

Updated by Bill Sommerfeld 3 days ago

  • Status changed from New to In Progress
Actions

Also available in: Atom PDF