Project

General

Profile

Bug #2408

CJK character width handled incorrectly in terminal emulators

Added by Ian Johnson over 7 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Category:
lib - userland libraries
Start date:
2012-03-14
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

On Illumos-based systems, CJK character width is not handled correctly in any terminal emulator. The terminal emulator is aware of the number of characters entered at the prompt, but when attempting to move the cursor through a CJK text string, each full-width character is counted as two characters, causing the cursor position to be reflected incorrectly and making the terminal unusable. This problem is specific to Illumos-based distributions and does not appear in Solaris 10, OpenSolaris, OpenIndiana 148, or Solaris 11.

Steps to reproduce: Type or paste some Japanese characters into a terminal emulator (gnome-terminal, xterm, and Xfce Terminal on OpenIndiana 151a2 tested). With the cursor at the end of the text string, begin pressing the left arrow key to move back through the text. It will take two keypresses to clear a single character, and the cursor will be unable to move further back once the number of keypresses matches the number of characters entered. Attempting to insert additional characters between those already entered will cause visual glitches.

History

#1

Updated by Garrett D'Amore over 7 years ago

It would help me if you could provide specific detail about what locale you were using, and sample Japanese text to work with. I suspect that the wcwidth() routine is broken. When I implemented this, I was operating on the belief that wcwidth was only needed for EUC locales, but perhaps we need support for it in UTF-8 as well?

I'm not sure where I can find wcwidth() data ... I don't think its in the unicode data files that I was given. :-(

#2

Updated by Garrett D'Amore over 7 years ago

Yeah, so looking at this more, we missed it. So did the POSIX committee apparently -- there is no standard way to provide this data to localedef -- its not part of the standard localedef grammar!

I am going to propose therefore a local extension, which should hopefully keep us from having to touch all the CLDR data files.

The local extension would be a new input file format, to be processed by localedef, which would contain the list of widths. This input file would be specified on the command line with a new switch (optional) to localedef.

One implication of this is that we will need to write the width data somewhere -- we already have a few bits reserved in the ctype data for this, so that's the logical place to do it.

I don't think this would be too hard to do.

#3

Updated by Ian Johnson over 7 years ago

My locale is ja_JP.UTF-8, but the problem should also manifest in English locales. As for sample text, try:

これは例です。

If you paste this string into a terminal emulator and then try to move the cursor back over it, you should be able to observe that the cursor can't move back to the beginning, and that attempting to insert characters with the cursor in the middle of the string results in strange behavior.

I'm not familiar with the intricate details of how this is handled, but your explanation sounds likely to me. I have seen a few old mentions around the web of wcwidth() problems previously existing in other platforms causing similar issues in UTF-8 locales. What kind of wcwidth() data would be needed to resolve the issue?

#4

Updated by Yuri Pankov over 7 years ago

While looking at this issue, I noticed the following (not directly related to the issue itself) - __ct_rune_t is typedef'ed int for both LP64 and LP32 despite what the comment says (if I'm reading it correctly): http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libc/port/locale/runetype.h#88

#5

Updated by Yuri Pankov over 7 years ago

BTW, this affect not only CJK locales, the utility I'm trying to import relies on wcwidth() as well to pretty-format its output. FreeBSD has the width defined as a ranges of character codes, we could use it.. Garrett, any hints on properly handling this?

#6

Updated by Boris Osher over 7 years ago

Hi Yuri,

Could you please let me know how long roughly it could take to fix the issue?

Thank you.

#7

Updated by Yuri Pankov over 7 years ago

  • Status changed from New to In Progress
  • Assignee set to Yuri Pankov
  • % Done changed from 0 to 30
#8

Updated by Lauri Tirkkonen over 6 years ago

Garrett D'Amore wrote:

Yeah, so looking at this more, we missed it. So did the POSIX committee
apparently -- there is no standard way to provide this data to localedef --
its not part of the standard localedef grammar!

It seems to me that you can specify the widths between 'WIDTH' and 'END WIDTH'
in the charmap file as described at the end of this section:
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tag_06_04

#9

Updated by Lauri Tirkkonen over 6 years ago

Lauri Tirkkonen wrote:

It seems to me that you can specify the widths between 'WIDTH' and 'END WIDTH'
in the charmap file

It seems that the CLDR tools to generate charmap files don't bake width data in them, though I would think that the data exists in the CLDR data files somewhere.

#10

Updated by Garrett D'Amore about 6 years ago

I've written an implementation of WIDTH / END WIDTH localedef support, along with a very rich level of character widths taken from an external source and the Unicode characters -- this works for all encodings.

I'll be posting it shortly. Good news - no changes in libc.

I also updated localedef(1)'s man page, which was ancient beyond belief.

#11

Updated by Garrett D'Amore about 6 years ago

  • Assignee changed from Yuri Pankov to Garrett D'Amore
#12

Updated by Garrett D'Amore about 6 years ago

  • % Done changed from 30 to 90
#13

Updated by Garrett D'Amore about 6 years ago

  • Tags deleted (needs-triage)
#14

Updated by Garrett D'Amore about 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

commit 2da1cd3a39e2d3da7f9d15071ea9462919c011ac
Author: Garrett D'Amore <>
Date: Tue Aug 27 18:16:23 2013 -0700

2408 CJK character width handled incorrectly in terminal emulators
3019 localedef(1) manpage is pretty out of date
Reviewed by: Yuri Pankov &lt;&gt;
Reviewed by: Lauri Tirkkonen &lt;&gt;
Reviewed by: Andy Stormont &lt;&gt;
Approved by: Gordon Ross &lt;&gt;

Also available in: Atom PDF