Project

General

Profile

Bug #6901

Multibyte regexec could be supported

Added by Garrett D'Amore over 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
locale - data and messages
Start date:
2016-04-11
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

From FreeBSD:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=207681

It looks like we could support multibyte regexec. I believe this has been broken since the days of illumos - in particular bug #2 introduced the file in question:

usr/src/lib/libc/port/local/regexec.c -- which was borrowed from FreeBSD.

Here's the actual diff from FreeBSD:

Index: lib/libc/regex/regex.3
===================================================================
--- lib/libc/regex/regex.3    (revision 286580)
+++ lib/libc/regex/regex.3    (working copy)
@@ -723,5 +723,3 @@
 .Pp
 The implementation of word-boundary matching is a bit of a kludge,
 and bugs may lurk in combinations of word-boundary matching and anchoring.
-.Pp
-Word-boundary matching does not work properly in multibyte locales.
Index: lib/libc/regex/regexec.c
===================================================================
--- lib/libc/regex/regexec.c    (revision 286580)
+++ lib/libc/regex/regexec.c    (working copy)
@@ -116,9 +116,9 @@
 #define    FWD(dst, src, n)    ((dst) |= ((unsigned long)(src)&(here)) << (n))
 #define    BACK(dst, src, n)    ((dst) |= ((unsigned long)(src)&(here)) >> (n))
 #define    ISSETBACK(v, n)    (((v) & ((unsigned long)here >> (n))) != 0)
-/* no multibyte support */
-#define    XMBRTOWC    xmbrtowc_dummy
-#define    ZAPSTATE(mbs)    ((void)(mbs))
+/* multibyte support */
+#define    XMBRTOWC    xmbrtowc
+#define    ZAPSTATE(mbs)    memset((mbs), 0, sizeof(*(mbs)))
 /* function names */
 #define SNAMES            /* engine.c looks after details */

@@ -170,9 +170,9 @@
 #define    FWD(dst, src, n)    ((dst)[here+(n)] |= (src)[here])
 #define    BACK(dst, src, n)    ((dst)[here-(n)] |= (src)[here])
 #define    ISSETBACK(v, n)    ((v)[here - (n)])
-/* no multibyte support */
-#define    XMBRTOWC    xmbrtowc_dummy
-#define    ZAPSTATE(mbs)    ((void)(mbs))
+/* multibyte support */
+#define    XMBRTOWC    xmbrtowc
+#define    ZAPSTATE(mbs)    memset((mbs), 0, sizeof(*(mbs)))
 /* function names */
 #define    LNAMES            /* flag */

Some research needs to be done here.

History

#1

Updated by Garrett D'Amore over 3 years ago

  • Description updated (diff)
#2

Updated by Yuri Pankov over 2 years ago

  • Description updated (diff)
#3

Updated by Yuri Pankov over 2 years ago

I think the reporter is reading the code wrong, see how regexec.c includes engine.c several times, after setting different defines. With MNAMES defined, which defines the proper XMBRTOWC, we have a #define matcher mmatcher, and regexec.c has the following in regexec():

        if (MB_CUR_MAX > 1)
                return(mmatcher(g, string, nmatch, pmatch, eflags));

So we are already using the proper mbrtowc(), man page talks specifically about word boundaries in multibyte locales.

#4

Updated by Yuri Pankov over 2 years ago

  • Status changed from New to Closed

To check that I'm not imagining things, I've added some printfs to regex routines and did a very simple test case with regcomp()/regexec(), important part here is not using REG_NOSUB:

$ LC_ALL=en_US.UTF-8 ./testre
I'm mmatcher
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
I'm xmbrtowc
pattern 'в' does match string 'абвгд'
$ LC_ALL=C ./testre
I'm smatcher
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
I'm xmbrtowc_dummy
pattern 'в' does match string 'абвгд'

So I'm going to close this as "nothing to fix" and will update the upstream ticket.

Also available in: Atom PDF