Bug #11807
closed"private use area" characters should be marked as "printable"
100%
Description
As reported in FreeBSD PR240911, at least some of the characters in E000-F8FF range are used by Powerline fonts, and having no attributes for these ranges in UnicodeData.txt other than "Other, Private Use" it should be safe to mark all of them as printable.
Files
Updated by Garrett D'Amore over 3 years ago
I'm not sure I agree that it is safe to mark private use characters as printable. That may be the case, but generally private use means that you can't make any assertions about them.
It would be perfectly reasonable to use codes in that space for joiners, to indicate context such as direction or style changes, or anything else that someone might want. Blithely assuming that they are printable is optimistic.
Updated by Yuri Pankov over 3 years ago
Yes, description says as much, and it's just a deliberate choice we could make here, having those "printable", to help with existing known consumers. The alternative being using the hand-crafted overrides for that area doesn't look better to me.
Updated by Yuri Pankov over 3 years ago
Just checked Debian GNU/Linux and they seems to do the same, at least the 0xe000-0xf8ff range is all marked 'print punct graph'.
Updated by Garrett D'Amore over 3 years ago
Ok. Personally I think this is a mistake being made to accommodate non-standard use cases, but I understand the expediency of doing so.
Hving said that, punct is absolutely dead wrong. Even the existing symbols in the font are not punctuation, and reporting them as such may wind up with them acquiring undesirable semantics (such as being allowed or disallowed in various contexts) -- these should be treated more like emoji, which are also not punctuation. Btw, how are these treated in terms of cell width?
Updated by Yuri Pankov over 3 years ago
Garrett D'Amore wrote:
Ok. Personally I think this is a mistake being made to accommodate non-standard use cases, but I understand the expediency of doing so.
I don't disagree here, and would rather see powerline and friends just get the official code points instead.
Hving said that, punct is absolutely dead wrong. Even the existing symbols in the font are not punctuation, and reporting them as such may wind up with them acquiring undesirable semantics (such as being allowed or disallowed in various contexts) -- these should be treated more like emoji, which are also not punctuation. Btw, how are these treated in terms of cell width?
Agree about the 'punct' being wrong, I want to just make it 'graph', implying 'print'.
The wcwidth is interesting question, didn't think about it. Now checking the same linux, all of those have the width of 1. I'll make sure we do the same, so if anything expects something else, "we did all we could".
Updated by Joshua M. Clulow about 3 years ago
Background Information¶
- Unicode® Standard Annex #44 contains information on the
UnicodeData.txt
file format for Unicode 9.0.0, which appears to be the version we currently use. - Wikipedia: Private Use Areas has some background on the Unicode code point ranges we're talking about here, including an explicit mention of Powerline using "U+E0A0–U+E0A2 and U+E0B0–U+E0B3".
glibc¶
Looking at what glibc are doing in both the shipping glibc in Debian and in their current stock master, it seems that they mark all three PUA ranges as at least "graph" and "print" in their ctype file; e.g., look at localedata/locales/i18n_ctype etc. In another file, they note that they use a default character width of 1, and then do not mention the PUA areas at all in the portion of the file which would provide an overridden width.
Whimsy and Non-normative Addenda¶
Interestingly, the Linux project includes some documentation that suggests they maintain a registry of code points in what they (of course!) call the "Linux Zone" . from U+F000 up to U+F8FF. See Documentation/admin-guide/unicode.rst
There is also a third party organisation that maintains the ConScript Unicode Registry which purports to "coordinate the assignment of blocks out of the Unicode Private Use Area (E000-F8FF and 000F0000-0010FFFF) to constructed/artificial scripts, including scripts for constructed/artificial languages". Of note is a long-standing proposed definition of Cirth script that is still "being revised" and appears to conflict with the Powerline characters!
Updated by Yuri Pankov about 3 years ago
To test this, I used the test case attached to original FreeBSD PR:
alnum:0x4004, cntrl:0x20, ideogram:0x200, print:0x8000, space:0x8, xdigit:0x80, alpha:0x4000, digit:0x4, lower:0x2, punct:0x10, special:0x1000, blank:0x40, graph:0x2000, phonogram:0x100, rune:0xffffffff, upper:0x1, Default Locale is: C Character d 0x64 is in classes: alnum print xdigit alpha lower graph rune in C locale, iswprint(0x64) = 1 in en_US.UTF-8 locale, iswprint(0x64) = 1 in ja_JP.UTF-8 locale, iswprint(0x64) = 1 Character 0xe0b1 is in classes: print graph rune in C locale, iswprint(0xe0b1) = 0 in en_US.UTF-8 locale, iswprint(0xe0b1) = 1 in ja_JP.UTF-8 locale, iswprint(0xe0b1) = 1 Character 0xe0b2 is in classes: print graph rune in C locale, iswprint(0xe0b2) = 0 in en_US.UTF-8 locale, iswprint(0xe0b2) = 1 in ja_JP.UTF-8 locale, iswprint(0xe0b2) = 1 Character 0xe0b3 is in classes: print graph rune in C locale, iswprint(0xe0b3) = 0 in en_US.UTF-8 locale, iswprint(0xe0b3) = 1 in ja_JP.UTF-8 locale, iswprint(0xe0b3) = 1 Character 0xe0b0 is in classes: print graph rune in C locale, iswprint(0xe0b0) = 0 in en_US.UTF-8 locale, iswprint(0xe0b0) = 1 in ja_JP.UTF-8 locale, iswprint(0xe0b0) = 1
Updated by Yuri Pankov about 3 years ago
Without the change:
alnum:0x4004, cntrl:0x20, ideogram:0x200, print:0x8000, space:0x8, xdigit:0x80, alpha:0x4000, digit:0x4, lower:0x2, punct:0x10, special:0x1000, blank:0x40, graph:0x2000, phonogram:0x100, rune:0xffffffff, upper:0x1, Default Locale is: C Character d 0x64 is in classes: alnum print xdigit alpha lower graph rune in C locale, iswprint(0x64) = 1 in en_US.UTF-8 locale, iswprint(0x64) = 1 in ja_JP.UTF-8 locale, iswprint(0x64) = 1 Character 0xe0b1 is in classes: cntrl rune in C locale, iswprint(0xe0b1) = 0 in en_US.UTF-8 locale, iswprint(0xe0b1) = 0 in ja_JP.UTF-8 locale, iswprint(0xe0b1) = 0 Character 0xe0b2 is in classes: cntrl rune in C locale, iswprint(0xe0b2) = 0 in en_US.UTF-8 locale, iswprint(0xe0b2) = 0 in ja_JP.UTF-8 locale, iswprint(0xe0b2) = 0 Character 0xe0b3 is in classes: cntrl rune in C locale, iswprint(0xe0b3) = 0 in en_US.UTF-8 locale, iswprint(0xe0b3) = 0 in ja_JP.UTF-8 locale, iswprint(0xe0b3) = 0 Character 0xe0b0 is in classes: cntrl rune in C locale, iswprint(0xe0b0) = 0 in en_US.UTF-8 locale, iswprint(0xe0b0) = 0 in ja_JP.UTF-8 locale, iswprint(0xe0b0) = 0
Updated by Yuri Pankov about 3 years ago
I have also used the attached code to check that other ranges didn't change:
$ diff -u new old No differences encountered
Updated by Electric Monk about 3 years ago
- Status changed from In Progress to Closed
- % Done changed from 50 to 100
git commit 3052595ab8ddcc51231d239415b5eba5d913d45b
commit 3052595ab8ddcc51231d239415b5eba5d913d45b Author: Yuri Pankov <ypankov@fastmail.com> Date: 2020-05-05T01:12:39.000Z 11807 "private use area" characters should be marked as "printable" Reviewed by: Joshua M. Clulow <josh@sysmgr.org> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Robert Mustacchi <rm@fingolfin.org>