Bug #2408
closedCJK character width handled incorrectly in terminal emulators
100%
Description
On Illumos-based systems, CJK character width is not handled correctly in any terminal emulator. The terminal emulator is aware of the number of characters entered at the prompt, but when attempting to move the cursor through a CJK text string, each full-width character is counted as two characters, causing the cursor position to be reflected incorrectly and making the terminal unusable. This problem is specific to Illumos-based distributions and does not appear in Solaris 10, OpenSolaris, OpenIndiana 148, or Solaris 11.
Steps to reproduce: Type or paste some Japanese characters into a terminal emulator (gnome-terminal, xterm, and Xfce Terminal on OpenIndiana 151a2 tested). With the cursor at the end of the text string, begin pressing the left arrow key to move back through the text. It will take two keypresses to clear a single character, and the cursor will be unable to move further back once the number of keypresses matches the number of characters entered. Attempting to insert additional characters between those already entered will cause visual glitches.
Updated by Garrett D'Amore about 11 years ago
It would help me if you could provide specific detail about what locale you were using, and sample Japanese text to work with. I suspect that the wcwidth() routine is broken. When I implemented this, I was operating on the belief that wcwidth was only needed for EUC locales, but perhaps we need support for it in UTF-8 as well?
I'm not sure where I can find wcwidth() data ... I don't think its in the unicode data files that I was given. :-(
Updated by Garrett D'Amore about 11 years ago
Yeah, so looking at this more, we missed it. So did the POSIX committee apparently -- there is no standard way to provide this data to localedef -- its not part of the standard localedef grammar!
I am going to propose therefore a local extension, which should hopefully keep us from having to touch all the CLDR data files.
The local extension would be a new input file format, to be processed by localedef, which would contain the list of widths. This input file would be specified on the command line with a new switch (optional) to localedef.
One implication of this is that we will need to write the width data somewhere -- we already have a few bits reserved in the ctype data for this, so that's the logical place to do it.
I don't think this would be too hard to do.
Updated by Ian Johnson about 11 years ago
My locale is ja_JP.UTF-8, but the problem should also manifest in English locales. As for sample text, try:
これは例です。
If you paste this string into a terminal emulator and then try to move the cursor back over it, you should be able to observe that the cursor can't move back to the beginning, and that attempting to insert characters with the cursor in the middle of the string results in strange behavior.
I'm not familiar with the intricate details of how this is handled, but your explanation sounds likely to me. I have seen a few old mentions around the web of wcwidth() problems previously existing in other platforms causing similar issues in UTF-8 locales. What kind of wcwidth() data would be needed to resolve the issue?
Updated by Yuri Pankov almost 11 years ago
While looking at this issue, I noticed the following (not directly related to the issue itself) - __ct_rune_t is typedef'ed int for both LP64 and LP32 despite what the comment says (if I'm reading it correctly): http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libc/port/locale/runetype.h#88
Updated by Yuri Pankov almost 11 years ago
BTW, this affect not only CJK locales, the utility I'm trying to import relies on wcwidth() as well to pretty-format its output. FreeBSD has the width defined as a ranges of character codes, we could use it.. Garrett, any hints on properly handling this?
Updated by Boris Osher almost 11 years ago
Hi Yuri,
Could you please let me know how long roughly it could take to fix the issue?
Thank you.
Updated by Yuri Pankov almost 11 years ago
- Status changed from New to In Progress
- Assignee set to Yuri Pankov
- % Done changed from 0 to 30
Updated by Lauri Tirkkonen almost 10 years ago
Garrett D'Amore wrote:
Yeah, so looking at this more, we missed it. So did the POSIX committee
apparently -- there is no standard way to provide this data to localedef --
its not part of the standard localedef grammar!
It seems to me that you can specify the widths between 'WIDTH' and 'END WIDTH'
in the charmap file as described at the end of this section:
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tag_06_04
Updated by Lauri Tirkkonen almost 10 years ago
Lauri Tirkkonen wrote:
It seems to me that you can specify the widths between 'WIDTH' and 'END WIDTH'
in the charmap file
It seems that the CLDR tools to generate charmap files don't bake width data in them, though I would think that the data exists in the CLDR data files somewhere.
Updated by Garrett D'Amore almost 10 years ago
I've written an implementation of WIDTH / END WIDTH localedef support, along with a very rich level of character widths taken from an external source and the Unicode characters -- this works for all encodings.
I'll be posting it shortly. Good news - no changes in libc.
I also updated localedef(1)'s man page, which was ancient beyond belief.
Updated by Garrett D'Amore almost 10 years ago
- Assignee changed from Yuri Pankov to Garrett D'Amore
Updated by Garrett D'Amore almost 10 years ago
- Status changed from In Progress to Resolved
- % Done changed from 90 to 100
commit 2da1cd3a39e2d3da7f9d15071ea9462919c011ac
Author: Garrett D'Amore <garrett@dey-sys.com>
Date: Tue Aug 27 18:16:23 2013 -0700
2408 CJK character width handled incorrectly in terminal emulators
3019 localedef(1) manpage is pretty out of date
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Lauri Tirkkonen <lotheac@iki.fi>
Reviewed by: Andy Stormont <andyjstormont@gmail.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>