Project

General

Profile

Bug #13694

regex has problems with simple ranges in UTF-8 locales

Added by Andy Fiddaman 15 days ago. Updated 15 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
lib - userland libraries
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

In UTF-8 locales, other than c.UTF-8, the libc regex has problems with simple ranges such as [A-Z], including at least lower-case characters in that range.

This can be simply illustrated using egrep:

build% echo o > tf
build% LC_ALL=C egrep '^[A-Z]' tf
build% LC_ALL=C.UTF-8 egrep '^[A-Z]' tf
build% LC_ALL=en_GB.UTF-8 egrep '^[A-Z]' tf
o
build% LC_ALL=en_US.UTF-8 egrep '^[A-Z]' tf
o

(note that this behaviour is also apparent on FreeBSD)

A simple test program, regcomp.c (attached) which does, in different locales:

regcomp(&rx, "^[L-R]", REG_EXTENDED);
regexec(&rx, "o", 0, NULL, 0)

outputs:

Testing locale en_GB.UTF-8 - Matched
Testing locale C.UTF-8 - Nomatch
Testing locale C - Nomatch

This appears to be because p_b_term() in libc does a fairly basic interpretation of the range, checking each character from 0 to 255 against the start and end of the range, using _collate_range_cmp() and including the character if it is between the range start and end characters. This function uses wscoll_l() to do the comparison. Obviously this is only going to work in single byte locales as it's only looking at those first 256 characters.

A simple program exercising `wcscoll` in different locales appear to show that, for example, a lower case o does compare as between a capital L and capital P. See attached wscoll.c

build% ./wscoll C.UTF-8
Distance: L - O = -3
Distance: O - P = -1
Distance: L - o = -35
Distance: o - P = 31
build%
build% ./wscoll en_GB.UTF-8
Distance: L - O = -45
Distance: O - P = -29
Distance: L - o = -45
Distance: o - P = -29

However, the same appears true on other systems, for example on Linux:

[bob@ns2]/tmp% ./wscoll
Distance: L - O = -3
Distance: O - P = -1
Distance: L - o = -3
Distance: o - P = -1

Of note, FreeBSD attempted to change this in their libc but reverted soon after.


Files

wscoll.c (378 Bytes) wscoll.c wscoll test program Andy Fiddaman, 2021-04-05 12:27 PM
regcomp.c (717 Bytes) regcomp.c regcomp test program Andy Fiddaman, 2021-04-05 12:27 PM
#1

Updated by Robert Mustacchi 15 days ago

Thanks for writing this up, Andy. I did a similar investigation and found something similar. That regcomp() was relying on collation to build the character class.

I suspect the fact that it started with using collation here is because of this bit of POSIX: "In the POSIX locale, a range expression represents the set of collating elements that fall between two elements in the collation sequence, inclusive. In other locales, a range expression has unspecified behavior: strictly conforming applications shall not rely on whether the range expression is valid, or on the set of collating elements matched. A range expression shall be expressed as the starting point and the ending point separated by a hyphen ( '-' )." (9.3.5 no. 7 https://pubs.opengroup.org/onlinepubs/007904875/basedefs/xbd_chap09.html).

There's a useful bit for background on collation from the unicode folks -- https://www.unicode.org/reports/tr10/#Common_Misperceptions. This tells us a couple of different things, that collation rules aren't stable and change over time and as we obviously discovered, have no relation to the code points. Effectively we're seeing something like aAbBcC... (not sure whether upper case comes first or not).

BREs with ranges seem to be built on a somewhat mental assumption of probably relying on ASCII here. It's not clear that relying on collation here makes sense, but it's also not clear to me what the implications of changing this to be like the byte values in the C locale would be like.

#2

Updated by Andy Fiddaman 15 days ago

I have found references that say that POSIX 1992 defined the ordering for a range expression (outside of the C and POSIX locales) to be in collation order.

POSIX 2008 introduced the wording that you quoted "In other locales, a range expression has unspecified behaviour".

It seems that the illumos implementation is allowed, and may have been required for standards compliance in the past, but it certainly isn't what people expect.

Also available in: Atom PDF