Project

General

Profile

Bug #4006

Certain printable unicode characters misclassified as nonprintable

Added by Lauri Tirkkonen over 6 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
locale - data and messages
Start date:
2013-08-07
Due date:
% Done:

100%

Estimated time:
Difficulty:
Bite-size
Tags:

Description

iswprint incorrectly returns false for many printable unicode characters (eg. ½, €, £ and box-drawing characters) using LC_CTYPE=en_US.UTF-8. No iswctype check seems to return true for these - you can use the attached test program to verify (I don't believe this locale defines any additional character classes, so I think testing the ones valid in all locales should be sufficient).

This is possibly related to #2408, but I'm not sure.


Files

wctype_test.c (786 Bytes) wctype_test.c Lauri Tirkkonen, 2013-08-07 06:41 PM
wctype_test.c (835 Bytes) wctype_test.c Lauri Tirkkonen, 2014-10-06 10:08 PM

History

#1

Updated by Lauri Tirkkonen about 5 years ago

Also classified as non-printable are characters '§' (SECTION SIGN) and '¤' (CURRENCY SIGN). I looked into this further using those characters as a basis.

This happens because the input data doesn't specify these characters as belonging to the 'print' (or indeed, any) character class (and see http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/localedef/ctype.c#190). That's what CLDR's POSIX locale definition generator tools also output; neither of the example characters appears in 'print' for en_US.UTF-8. Clearly they are printable and should, and SECTION SIGN should also appear in 'punct', since it is listed in the 'punctuation' exemplar character set (CLDR common/main/en.xml, http://cldr.unicode.org/translation/characters). I also looked at the released POSIX-format locale files in an older CLDR release (2.0.1) and these characters weren't there either.

I also looked at a few other operating systems wrt. iswctype, iswprint and these characters. OpenBSD's iswprint returns true for both, and places them in character classes 'graph' 'print' and 'punct', per iswctype. On Debian 7 (EGLIBC 2.13-38+deb7u4), however, iswprint returns true, but iswctype doesn't place them in any character class (not even 'print'!). I modified my original test program a bit for this (missing #include), and have attached an updated version.

So since OpenBSD seems to do the right thing, I looked at its source data. Based on the top comment, it originates from FreeBSD (so it's also visible at http://src.illumos.org/source/xref/freebsd-head/share/mklocale/UTF-8.src); there's a comment saying that it's generated, but no clues as to how. I'm unsure whether CLDR actually contains character classification data wrt. POSIX character classes.

It's not clear what the right approach here is, or whether this could be considered a CLDR bug or not. Still, since we only consider characters listed in the 'print' character class in the source files printable, that means iswprint returns false for most unicode characters; perhaps that logic should be reversed (ie. consider all characters printable except those known not to be)

#2

Updated by Lauri Tirkkonen about 5 years ago

Currently the UTF-8.ct file, which contains the character classifications and case mappings, is generated from the *.UTF-8.src data files, which come from CLDR. Since these are insufficient, we could augment the .ct file with information from the Unicode Character Database (UCD). In particular there is a property called "General_Category" [0, Section 4.5] which exists for all Unicode code points, which would be useful here (although there can be ambiguities in mapping these categories to POSIX character classes).

Actually, since the case mappings and character classifications end up being the same for all UTF-8 locales [1], perhaps the whole UTF-8.ct ought to be generated from the UCD instead of CLDR.

[0]: http://www.unicode.org/versions/Unicode7.0.0/
[1]: http://src.illumos.org/source/xref/illumos-gate/usr/src/cmd/localedef/data/ctype.sh#18

#3

Updated by Lauri Tirkkonen about 5 years ago

Further reading of Unicode docs reveals the following (under http://www.unicode.org/reports/tr35/#Bundle_vs_Item_Lookup ):

The locale data does not contain general character properties that are
derived from the Unicode Character Database [UAX44]. That data being common
across locales, it is not duplicated in the bundles. Constructing a POSIX
locale from the CLDR data requires use of UCD data. In addition, POSIX
locales may also specify the character encoding, which requires the data to
be transformed into that target encoding.

Interestingly the POSIX generation tools included with the CLDR don't require UCD data from what I can tell, though. If there's a way to provide UCD data to those tools, I haven't found it.

#4

Updated by Lauri Tirkkonen about 5 years ago

  • Status changed from New to In Progress
  • Assignee set to Lauri Tirkkonen
#5

Updated by Yuri Pankov almost 3 years ago

  • Status changed from In Progress to Pending RTI
  • Assignee changed from Lauri Tirkkonen to Yuri Pankov
  • % Done changed from 0 to 90
  • Difficulty changed from Medium to Bite-size
  • Tags deleted (needs-triage)
#6

Updated by Electric Monk almost 3 years ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 90 to 100

git commit 3e6960d70408b9f4e09714ed3341173673ed28b2

commit  3e6960d70408b9f4e09714ed3341173673ed28b2
Author: Yuri Pankov <yuri.pankov@nexenta.com>
Date:   2017-03-01T16:14:17.000Z

    4006 Certain printable unicode characters misclassified as nonprintable
    Portions contributed by: John Marino <draco@marino.st>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF