Project

General

Profile

Bug #8995

regcomp(3C) shouldn't use collations for range expressions in any locale

Added by Yuri Pankov over 2 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
lib - userland libraries
Start date:
2018-01-25
Due date:
% Done:

50%

Estimated time:
Difficulty:
Bite-size
Tags:
Gerrit CR:

Description

The specification has to say the following about range expressions:

In the POSIX locale, a range expression represents the set of collating elements
that fall between two elements in the collation sequence, inclusive. In other locales,
a range expression has unspecified behavior: strictly conforming applications shall
not rely on whether the range expression is valid, or on the set of collating elements
matched.

We are trying to do collation lookups, but do it half-way - the check for range correctness is there, but then we add the characters from 0..UCHAR_MAX range, which makes no sense. More so, doing proper collation lookups (using wcscoll() as done upstream) will just make it worse - the range will include the mix of small/capital/other characters for what seems a pretty simple range to user.

Doing the binary (non-collating, simply based on wide character value) lookup will work in most cases (e.g. for cyrillic small letters [а-я]), and where it doesn't, well, the specification says it's undefined. Making this change will also allow for simple fix for #8908.

Also available in: Atom PDF