Bug #4006
closedCertain printable unicode characters misclassified as nonprintable
100%
Description
iswprint incorrectly returns false for many printable unicode characters (eg. ½, €, £ and box-drawing characters) using LC_CTYPE=en_US.UTF-8. No iswctype check seems to return true for these - you can use the attached test program to verify (I don't believe this locale defines any additional character classes, so I think testing the ones valid in all locales should be sufficient).
This is possibly related to #2408, but I'm not sure.
Files
Updated by Lauri Tirkkonen about 9 years ago
- File wctype_test.c wctype_test.c added
Also classified as non-printable are characters '§' (SECTION SIGN) and '¤' (CURRENCY SIGN). I looked into this further using those characters as a basis.
This happens because the input data doesn't specify these characters as belonging to the 'print' (or indeed, any) character class (and see http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/localedef/ctype.c#190). That's what CLDR's POSIX locale definition generator tools also output; neither of the example characters appears in 'print' for en_US.UTF-8. Clearly they are printable and should, and SECTION SIGN should also appear in 'punct', since it is listed in the 'punctuation' exemplar character set (CLDR common/main/en.xml, http://cldr.unicode.org/translation/characters). I also looked at the released POSIX-format locale files in an older CLDR release (2.0.1) and these characters weren't there either.
I also looked at a few other operating systems wrt. iswctype, iswprint and these characters. OpenBSD's iswprint returns true for both, and places them in character classes 'graph' 'print' and 'punct', per iswctype. On Debian 7 (EGLIBC 2.13-38+deb7u4), however, iswprint returns true, but iswctype doesn't place them in any character class (not even 'print'!). I modified my original test program a bit for this (missing #include), and have attached an updated version.
So since OpenBSD seems to do the right thing, I looked at its source data. Based on the top comment, it originates from FreeBSD (so it's also visible at http://src.illumos.org/source/xref/freebsd-head/share/mklocale/UTF-8.src); there's a comment saying that it's generated, but no clues as to how. I'm unsure whether CLDR actually contains character classification data wrt. POSIX character classes.
It's not clear what the right approach here is, or whether this could be considered a CLDR bug or not. Still, since we only consider characters listed in the 'print' character class in the source files printable, that means iswprint returns false for most unicode characters; perhaps that logic should be reversed (ie. consider all characters printable except those known not to be)
Updated by Lauri Tirkkonen about 9 years ago
Currently the UTF-8.ct file, which contains the character classifications and case mappings, is generated from the *.UTF-8.src data files, which come from CLDR. Since these are insufficient, we could augment the .ct file with information from the Unicode Character Database (UCD). In particular there is a property called "General_Category" [0, Section 4.5] which exists for all Unicode code points, which would be useful here (although there can be ambiguities in mapping these categories to POSIX character classes).
Actually, since the case mappings and character classifications end up being the same for all UTF-8 locales [1], perhaps the whole UTF-8.ct ought to be generated from the UCD instead of CLDR.
[0]: http://www.unicode.org/versions/Unicode7.0.0/
[1]: http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/localedef/data/ctype.sh#18
Updated by Lauri Tirkkonen about 9 years ago
Further reading of Unicode docs reveals the following (under http://www.unicode.org/reports/tr35/#Bundle_vs_Item_Lookup ):
The locale data does not contain general character properties that are
derived from the Unicode Character Database [UAX44]. That data being common
across locales, it is not duplicated in the bundles. Constructing a POSIX
locale from the CLDR data requires use of UCD data. In addition, POSIX
locales may also specify the character encoding, which requires the data to
be transformed into that target encoding.
Interestingly the POSIX generation tools included with the CLDR don't require UCD data from what I can tell, though. If there's a way to provide UCD data to those tools, I haven't found it.
Updated by Lauri Tirkkonen about 9 years ago
- Status changed from New to In Progress
- Assignee set to Lauri Tirkkonen
Updated by Yuri Pankov almost 7 years ago
- Status changed from In Progress to Pending RTI
- Assignee changed from Lauri Tirkkonen to Yuri Pankov
- % Done changed from 0 to 90
- Difficulty changed from Medium to Bite-size
- Tags deleted (
needs-triage)
Updated by Electric Monk almost 7 years ago
- Status changed from Pending RTI to Closed
- % Done changed from 90 to 100
git commit 3e6960d70408b9f4e09714ed3341173673ed28b2
commit 3e6960d70408b9f4e09714ed3341173673ed28b2 Author: Yuri Pankov <yuri.pankov@nexenta.com> Date: 2017-03-01T16:14:17.000Z 4006 Certain printable unicode characters misclassified as nonprintable Portions contributed by: John Marino <draco@marino.st> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Richard Lowe <richlowe@richlowe.net>