Project

General

Profile

Bug #11807

"private use area" characters should be marked as "printable"

Added by Yuri Pankov 12 months ago. Updated 5 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
locale - data and messages
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Bite-size
Tags:
Gerrit CR:

Description

As reported in FreeBSD PR240911, at least some of the characters in E000-F8FF range are used by Powerline fonts, and having no attributes for these ranges in UnicodeData.txt other than "Other, Private Use" it should be safe to mark all of them as printable.


Files

ct.c (876 Bytes) ct.c Yuri Pankov, 2020-05-05 12:29 AM

History

#1

Updated by Garrett D'Amore 12 months ago

I'm not sure I agree that it is safe to mark private use characters as printable. That may be the case, but generally private use means that you can't make any assertions about them.

It would be perfectly reasonable to use codes in that space for joiners, to indicate context such as direction or style changes, or anything else that someone might want. Blithely assuming that they are printable is optimistic.

#2

Updated by Yuri Pankov 12 months ago

Yes, description says as much, and it's just a deliberate choice we could make here, having those "printable", to help with existing known consumers. The alternative being using the hand-crafted overrides for that area doesn't look better to me.

#3

Updated by Yuri Pankov 12 months ago

Just checked Debian GNU/Linux and they seems to do the same, at least the 0xe000-0xf8ff range is all marked 'print punct graph'.

#4

Updated by Garrett D'Amore 12 months ago

Ok. Personally I think this is a mistake being made to accommodate non-standard use cases, but I understand the expediency of doing so.

Hving said that, punct is absolutely dead wrong. Even the existing symbols in the font are not punctuation, and reporting them as such may wind up with them acquiring undesirable semantics (such as being allowed or disallowed in various contexts) -- these should be treated more like emoji, which are also not punctuation. Btw, how are these treated in terms of cell width?

#5

Updated by Yuri Pankov 12 months ago

Garrett D'Amore wrote:

Ok. Personally I think this is a mistake being made to accommodate non-standard use cases, but I understand the expediency of doing so.

I don't disagree here, and would rather see powerline and friends just get the official code points instead.

Hving said that, punct is absolutely dead wrong. Even the existing symbols in the font are not punctuation, and reporting them as such may wind up with them acquiring undesirable semantics (such as being allowed or disallowed in various contexts) -- these should be treated more like emoji, which are also not punctuation. Btw, how are these treated in terms of cell width?

Agree about the 'punct' being wrong, I want to just make it 'graph', implying 'print'.

The wcwidth is interesting question, didn't think about it. Now checking the same linux, all of those have the width of 1. I'll make sure we do the same, so if anything expects something else, "we did all we could".

#6

Updated by Joshua M. Clulow 5 months ago

Background Information

  • Unicode® Standard Annex #44 contains information on the UnicodeData.txt file format for Unicode 9.0.0, which appears to be the version we currently use.
  • Wikipedia: Private Use Areas has some background on the Unicode code point ranges we're talking about here, including an explicit mention of Powerline using "U+E0A0–U+E0A2 and U+E0B0–U+E0B3".

glibc

Looking at what glibc are doing in both the shipping glibc in Debian and in their current stock master, it seems that they mark all three PUA ranges as at least "graph" and "print" in their ctype file; e.g., look at localedata/locales/i18n_ctype etc. In another file, they note that they use a default character width of 1, and then do not mention the PUA areas at all in the portion of the file which would provide an overridden width.

Whimsy and Non-normative Addenda

Interestingly, the Linux project includes some documentation that suggests they maintain a registry of code points in what they (of course!) call the "Linux Zone" . from U+F000 up to U+F8FF. See Documentation/admin-guide/unicode.rst

There is also a third party organisation that maintains the ConScript Unicode Registry which purports to "coordinate the assignment of blocks out of the Unicode Private Use Area (E000-F8FF and 000F0000-0010FFFF) to constructed/artificial scripts, including scripts for constructed/artificial languages". Of note is a long-standing proposed definition of Cirth script that is still "being revised" and appears to conflict with the Powerline characters!

#7

Updated by Yuri Pankov 5 months ago

To test this, I used the test case attached to original FreeBSD PR:

alnum:0x4004, cntrl:0x20, ideogram:0x200, print:0x8000, space:0x8, xdigit:0x80, alpha:0x4000, digit:0x4, lower:0x2, punct:0x10, special:0x1000, blank:0x40, graph:0x2000, phonogram:0x100, rune:0xffffffff, upper:0x1,
Default Locale is: C
Character d 0x64 is in classes: alnum print xdigit alpha lower graph rune
in C locale, iswprint(0x64) = 1
in en_US.UTF-8 locale, iswprint(0x64) = 1
in ja_JP.UTF-8 locale, iswprint(0x64) = 1

Character  0xe0b1 is in classes: print graph rune
in C locale, iswprint(0xe0b1) = 0
in en_US.UTF-8 locale, iswprint(0xe0b1) = 1
in ja_JP.UTF-8 locale, iswprint(0xe0b1) = 1

Character  0xe0b2 is in classes: print graph rune
in C locale, iswprint(0xe0b2) = 0
in en_US.UTF-8 locale, iswprint(0xe0b2) = 1
in ja_JP.UTF-8 locale, iswprint(0xe0b2) = 1

Character  0xe0b3 is in classes: print graph rune
in C locale, iswprint(0xe0b3) = 0
in en_US.UTF-8 locale, iswprint(0xe0b3) = 1
in ja_JP.UTF-8 locale, iswprint(0xe0b3) = 1

Character  0xe0b0 is in classes: print graph rune
in C locale, iswprint(0xe0b0) = 0
in en_US.UTF-8 locale, iswprint(0xe0b0) = 1
in ja_JP.UTF-8 locale, iswprint(0xe0b0) = 1

#8

Updated by Yuri Pankov 5 months ago

Without the change:

alnum:0x4004, cntrl:0x20, ideogram:0x200, print:0x8000, space:0x8, xdigit:0x80, alpha:0x4000, digit:0x4, lower:0x2, punct:0x10, special:0x1000, blank:0x40, graph:0x2000, phonogram:0x100, rune:0xffffffff, upper:0x1,
Default Locale is: C
Character d 0x64 is in classes: alnum print xdigit alpha lower graph rune
in C locale, iswprint(0x64) = 1
in en_US.UTF-8 locale, iswprint(0x64) = 1
in ja_JP.UTF-8 locale, iswprint(0x64) = 1

Character  0xe0b1 is in classes: cntrl rune
in C locale, iswprint(0xe0b1) = 0
in en_US.UTF-8 locale, iswprint(0xe0b1) = 0
in ja_JP.UTF-8 locale, iswprint(0xe0b1) = 0

Character  0xe0b2 is in classes: cntrl rune
in C locale, iswprint(0xe0b2) = 0
in en_US.UTF-8 locale, iswprint(0xe0b2) = 0
in ja_JP.UTF-8 locale, iswprint(0xe0b2) = 0

Character  0xe0b3 is in classes: cntrl rune
in C locale, iswprint(0xe0b3) = 0
in en_US.UTF-8 locale, iswprint(0xe0b3) = 0
in ja_JP.UTF-8 locale, iswprint(0xe0b3) = 0

Character  0xe0b0 is in classes: cntrl rune
in C locale, iswprint(0xe0b0) = 0
in en_US.UTF-8 locale, iswprint(0xe0b0) = 0
in ja_JP.UTF-8 locale, iswprint(0xe0b0) = 0

#9

Updated by Yuri Pankov 5 months ago

I have also used the attached code to check that other ranges didn't change:

$ diff -u new old
No differences encountered

#10

Updated by Electric Monk 5 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 50 to 100

git commit 3052595ab8ddcc51231d239415b5eba5d913d45b

commit  3052595ab8ddcc51231d239415b5eba5d913d45b
Author: Yuri Pankov <ypankov@fastmail.com>
Date:   2020-05-05T01:12:39.000Z

    11807 "private use area" characters should be marked as "printable" 
    Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Also available in: Atom PDF