Project

General

Profile

Actions

Bug #15667

open

libc regcomp/regexec show pathological behavior for "warn:|warning:"

Added by Bill Sommerfeld 18 days ago. Updated 13 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
lib - userland libraries
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

In #15662 I noticed pathological performance from egrep for the (case-insensitive) regexp "warn:warning:", running roughly 100x to 140x slower than the equivalent regexp "warn(ing)?:".

$ time egrep -i 'warn:|warning:' /z/ws/illumos-wformat/usr/src/install-nd-i386.out         
egrep -i 'warn:|warning:' /z/ws/illumos-wformat/usr/src/install-nd-i386.out  147.50s user 0.15s system 99% cpu 2:27.66 total

An alternate form that should be equivalent is much faster:

$ time egrep -i 'warn(ing)?:' /z/ws/illumos-wformat/usr/src/install-nd-i386.out 
egrep -i 'warn(ing)?:' /z/ws/illumos-wformat/usr/src/install-nd-i386.out  1.41s user 0.07s system 99% cpu 1.475 total

I did some preliminary investigation:

usr/src/cmd/grep uses the libc regcomp and regexec API as defined in <regex.h>.

I looked at the contents of the generated re_guts structure (see definition in usr/src/lib/libc/port/regex/regex2.h) for both the good and bad strings and some variations of them and noticed a couple things:

One difference is that the non-pathological version sets re_guts->must (a substring which must appear in the match) to ":" which enables a prescreening optimization in the regex matcher (see matcher() in .../engine.c)
(it looks like the prescreening optimization works better in case-insensitive mode where must is ":" than it does in case-sensitive where must is "warn" as there are relatively few lines in the file which contain a colon).

Another thing I noticed is that the use of case-insensitive matching causes each alphabetic character in the pattern to be converted into a character sets, which use a bitmap representation; there is no attempt to dedup the list of character sets which might reduce cache footprint.


Files

graph.svg (28 KB) graph.svg Flamegraph Bill Sommerfeld, 2023-05-22 06:43 PM
out.stacks-folded (12.1 KB) out.stacks-folded Bill Sommerfeld, 2023-05-22 06:45 PM
dumpregcmp.c (1.8 KB) dumpregcmp.c Bill Sommerfeld, 2023-05-22 08:32 PM

Related issues

Related to illumos gate - Bug #15662: nightly uses pathological regexp to find warningsClosedBill Sommerfeld

Actions
Related to illumos gate - Bug #15691: uselocale(NULL) is too slowIn ProgressBill Sommerfeld

Actions
Actions #1

Updated by Bill Sommerfeld 18 days ago

  • Related to Bug #15662: nightly uses pathological regexp to find warnings added
Actions #2

Updated by Bill Sommerfeld 18 days ago

When looking at a truss -u libc:: of a reduced test case, I noticed an extreme number of calls to __mb_cur_max() with an ensuing cascade of six subsequent calls:

/1@1:              -> libc:__mb_cur_max()
/1@1:                -> libc:uselocale()
/1@1:                  -> libc:tsdalloc()
/1@1:                    -> libc:thr_keycreate_once()
/1@1:                      -> libc:membar_consumer()
/1@1:                      <- libc:membar_consumer() = 0xfef5ba18
/1@1:                    <- libc:thr_keycreate_once() = 0
/1@1:                    -> libc:pthread_getspecific()
/1@1:                    <- libc:pthread_getspecific() = 0xfedf1c00
/1@1:                  <- libc:tsdalloc() = 0xfedf0200
/1@1:                <- libc:uselocale() = 0xfedf1a00
/1@1:                -> libc:__mb_cur_max_l(0xfedf1a00, 0x8068cae, 0x1, 0x8047388)
/1@1:                <- libc:__mb_cur_max_l() = 4
/1@1:              <- libc:__mb_cur_max() = 4

In the reduced test case, regexec() was called 10 times; CHIN was called 123794 times, and __mb_cur_max was called 125258 times.

The calls are due to the reference to MB_CUR_MAX in the middle of the NC macro in usr/src/lib/libc/port/regex/utils.h:

#define    NC        ((MB_CUR_MAX) == 1 ? (NC_MAX) : (128))

Some tinkering to avoid this (adding nc to the struct re_guts and passing it to CHIN, and marking CHIN and CHIN1 as inline) resulted in a ~3.5x speedup for the pathological case.

But the total number of calls to CHIN seems on the order of 5x too big -- the sample file is 11136 bytes long; I'd think that, worst case, it would compare each character in the file with the two "w"'s in the pattern and move on for each mismatch, only stepping further in the pattern when it found a w or W in the file..

Actions #3

Updated by Electric Monk 18 days ago

  • Gerrit CR set to 2860
Actions #4

Updated by Dan Cross 18 days ago

Sadly, I think the problem here is algorithmic, not just inefficiencies in the implementation.

Originally, egrep was implemented using a pre-computed DFA; at least, this is how it was implemented in 7th Edition, and that was carried forward into Solaris and then into illumos. However, this was replaced by the present driver, which calls into the regexp library that you saw in https://github.com/illumos/illumos-gate/commit/d2d52addd50254d1b7c318c6784172d8d7de20c6.

Unfortunately, the regexp package in libc is descended from Henry Spencer's open source library, and uses a backtracking implementation. I say "unfortunately" because backtracking can be exponential in both space and time (https://swtch.com/~rsc/regexp/regexp1.html).

The reasons to move away from the original egrep implementation back in 2017 were certainly good ones (multibyte characters, standards compliance, etc), but I think the pathological behavior seen here is an unintended consequence of that change. The original DFA would have been O(1) in the length of its input (though in general DFA construction can be exponential; egrep in 8th Edition and later Research Unix switched to DFA construction on-the-fly to try and amortize the cost a bit).

There aren't any really particularly great ways forward here, I'm afraid, other than trying to add some heuristics to make this case (and similar?) faster. Another option might be to port over another regexp library and call it from egrep: Google's re2 would be a good candidate, except that it's written in C++; TRE may work, and is in C. One could go in and modify Spencer's library to use an NFA where it can (ie, anywhere that doesn't require backreferences or similar features that aren't part of the regular languages). But while the theory is well-understood, that's an old library and the corner-cases might really mess something up. Yanking in plan9's egrep may be another option.

Actions #5

Updated by Bill Sommerfeld 17 days ago

So I've done a bit more poking at it, and it appears that backtracking simply isn't a factor in this case -- for the regexp at issue, once I fixed the "walk" function to not bother trying to detect word beginning/ends if they weren't needed by the regexp, the step function in engine.c is called exactly once per character in the input file.

Part of the problem is that step computes state reachability by checking each character in the input against every possible state in the graph -- even for states that are not reachable at the current character (where whether or not the character matches the state will have no effect on the reachability of any subsequent state). For example (and this is the most common operation in the regexp at issue):

        case OANYOF:
            cs = &g->sets[OPND(s)];
            if (!NONCHAR(ch) && CHIN(cs, ch))
                FWD(aft, bef, 1);
            break;

it evaluates whether ch is in cs for every state, even if that state in bef isn't reachable!

So we have O(n) calls to step, each of which does O(s) state evaluations. (n=input length, s=encoded regexp size), for net O(n*s) state evaluations. And none of the opcodes in the compiled representation of the regexp seem to do anything particularly tricky -- it really looks like the engine is just really slow in handling each character.

Bottom line: I agree that swapping out the package in favor of something better is probably the right answer here.

Actions #6

Updated by Robert Mustacchi 17 days ago

For what it's worth, I believe Yuri Pankov looked at TRE previously, assuming https://github.com/laurikari/tre/. The thought then was to use it to improve the core libc engine there. There may have been some minor compat pieces, I don't recall specifically. The properly working multibyte pieces there is useful.

Actions #7

Updated by Bill Sommerfeld 13 days ago

I've attached a flamegraph generated from the a run with the stock illumos egrep & libc showing that locale-related code is eating much of the time...

This was generated with:

$ dtrace -x ustackframes=100 -n 'profile-99 /pid == $target/ { @[ustack()] = count(); } tick-60s { exit(0); }' -o /tmp/out.stacks -c egrep -i 'warn:|warning:' install-nd-i386.out
$ ./stackcollapse.pl /tmp/out.stacks >/tmp/out.stacks-folded 
$ ./flamegraph.pl /tmp/out.stacks-folded >/tmp/graph.svg

Actions #8

Updated by Dan Cross 13 days ago

Bill Sommerfeld wrote in #note-5:

So I've done a bit more poking at it, and it appears that backtracking simply isn't a factor in this case -- for the regexp at issue, once I fixed the "walk" function to not bother trying to detect word beginning/ends if they weren't needed by the regexp, the step function in engine.c is called exactly once per character in the input file.

Part of the problem is that step computes state reachability by checking each character in the input against every possible state in the graph -- even for states that are not reachable at the current character (where whether or not the character matches the state will have no effect on the reachability of any subsequent state). For example (and this is the most common operation in the regexp at issue):

[...]

it evaluates whether ch is in cs for every state, even if that state in bef isn't reachable!

So we have O(n) calls to step, each of which does O(s) state evaluations. (n=input length, s=encoded regexp size), for net O(n*s) state evaluations. And none of the opcodes in the compiled representation of the regexp seem to do anything particularly tricky -- it really looks like the engine is just really slow in handling each character.

O(n*s) seems consistent with what we'd expect out of an NDFA simulation of a DFA; I'm sorry if I missed it, but surely the state space is relatively small here?

Bottom line: I agree that swapping out the package in favor of something better is probably the right answer here.

Agreed.

Bill Sommerfeld wrote in #note-7:

I've attached a flamegraph generated from the a run with the stock illumos egrep & libc showing that locale-related code is eating much of the time...
[...]

Naively, it looks as though the locale is being reinitialized for every input character? Surely that's not right.

Actions #9

Updated by Bill Sommerfeld 13 days ago

I wrote a program (attached. it's not pretty and you'll need to patch it to point at a tree that has regex2.h) to dump out the result of regcomp. For the regexp in question, it prints:

$ ./dumpregcmp 'warn:|warning:'
gp->ncsets: 11
gp->nstates: 19
gp->firststate: 0
gp->laststate: 18
gp->must: (null)
gp->moffset: -1
gp->mlen: 0
gp->nsub: 0
gp->backrefs: 0
gp->nplus: 0
0: 1 (OEND      ) 0
1: 15 (OCH_     ) 7
2: 6 (OANYOF    ) 0
3: 6 (OANYOF    ) 1
4: 6 (OANYOF    ) 2
5: 6 (OANYOF    ) 3
6: 2 (OCHAR     ) 58
7: 16 (OOR1     ) 6
8: 17 (OOR2     ) 9
9: 6 (OANYOF    ) 4
10: 6 (OANYOF   ) 5
11: 6 (OANYOF   ) 6
12: 6 (OANYOF   ) 7
13: 6 (OANYOF   ) 8
14: 6 (OANYOF   ) 9
15: 6 (OANYOF   ) 10
16: 2 (OCHAR    ) 58
17: 18 (O_CH    ) 10
18: 1 (OEND     ) 0

OANYOF: "is current character in set X" operation.
OCHAR: "is current character exactly this"

and OCH_/OOR1/OOR2/O_CH are the opcodes which bracket the alternatives for the "or".

The bulk of the calls into the locale code appear to be from the NC macro which is used in CHIN which does the work of OANYOF.

Actions #10

Updated by Bill Sommerfeld 11 days ago

  • Related to Bug #15691: uselocale(NULL) is too slow added
Actions

Also available in: Atom PDF