regex has problems with simple ranges in UTF-8 locales
In UTF-8 locales, other than
c.UTF-8, the libc regex has problems with simple ranges such as
[A-Z], including at least lower-case characters in that range.
This can be simply illustrated using
build% echo o > tf build% LC_ALL=C egrep '^[A-Z]' tf build% LC_ALL=C.UTF-8 egrep '^[A-Z]' tf build% LC_ALL=en_GB.UTF-8 egrep '^[A-Z]' tf o build% LC_ALL=en_US.UTF-8 egrep '^[A-Z]' tf o
(note that this behaviour is also apparent on FreeBSD)
A simple test program, regcomp.c (attached) which does, in different locales:
regcomp(&rx, "^[L-R]", REG_EXTENDED); regexec(&rx, "o", 0, NULL, 0)
Testing locale en_GB.UTF-8 - Matched Testing locale C.UTF-8 - Nomatch Testing locale C - Nomatch
This appears to be because
p_b_term() in libc does a fairly basic interpretation of the range, checking each character from 0 to 255 against the start and end of the range, using
_collate_range_cmp() and including the character if it is between the range start and end characters. This function uses
wscoll_l() to do the comparison. Obviously this is only going to work in single byte locales as it's only looking at those first 256 characters.
A simple program exercising `wcscoll` in different locales appear to show that, for example, a lower case
o does compare as between a capital
L and capital
P. See attached
build% ./wscoll C.UTF-8 Distance: L - O = -3 Distance: O - P = -1 Distance: L - o = -35 Distance: o - P = 31 build% build% ./wscoll en_GB.UTF-8 Distance: L - O = -45 Distance: O - P = -29 Distance: L - o = -45 Distance: o - P = -29
However, the same appears true on other systems, for example on Linux:
[bob@ns2]/tmp% ./wscoll Distance: L - O = -3 Distance: O - P = -1 Distance: L - o = -3 Distance: o - P = -1
Of note, FreeBSD attempted to change this in their libc but reverted soon after.
Updated by Robert Mustacchi about 1 year ago
Thanks for writing this up, Andy. I did a similar investigation and found something similar. That regcomp() was relying on collation to build the character class.
I suspect the fact that it started with using collation here is because of this bit of POSIX: "In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched. A range expression shall be expressed as the starting point and the ending point separated by a hyphen ( '-' )." (9.3.5 no. 7 https://pubs.opengroup.org/onlinepubs/007904875/basedefs/xbd_chap09.html).
There's a useful bit for background on collation from the unicode folks -- https://www.unicode.org/reports/tr10/#Common_Misperceptions. This tells us a couple of different things, that collation rules aren't stable and change over time and as we obviously discovered, have no relation to the code points. Effectively we're seeing something like aAbBcC... (not sure whether upper case comes first or not).
BREs with ranges seem to be built on a somewhat mental assumption of probably relying on ASCII here. It's not clear that relying on collation here makes sense, but it's also not clear to me what the implications of changing this to be like the byte values in the C locale would be like.
Updated by Andy Fiddaman about 1 year ago
I have found references that say that POSIX 1992 defined the ordering for a range expression (outside of the C and POSIX locales) to be in collation order.
POSIX 2008 introduced the wording that you quoted "In other locales, a range expression has unspecified behaviour".
It seems that the illumos implementation is allowed, and may have been required for standards compliance in the past, but it certainly isn't what people expect.