Project

General

Profile

Feature #7784

uts: console input should support Unicode

Added by Toomas Soome over 2 years ago. Updated 6 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
kernel
Start date:
2017-05-12
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Difficulty:
Hard
Tags:
needs-triage

Description

The current console input is only supporting iso8859-1, would need more wider support.

The current keyboard driver is built on keymaps based on 8-bit character encodings and specifically using 8859-1 table (in some cases only ascii set).

To provide better support for input, we could either allow other 8-bit tables, or build the input around unicode and pass the chars to upper layers in utf-8 encoding.

8-bit tables is no go because the world is moving to unicode already. Therefore we build the keymaps to support unicode chars, the initial take is to rewrite the existing maps to use the same chars but in unicode encoding (this is because right now our console only can display 8859-1); the keyboard internal structures are redesigned to use 32-bit chars and the unicode chars are translated to utf-8 sequence when data is leaving the keyboard driver.

This approach will need 7796 (https://www.illumos.org/issues/7796) to actually have this data stream properly working through ldterm module.

To display the data, tem already is processing the utf-8 byte sequences to reconstruct unicode chars and does handle the unprintable ones. In followup work we will update tem to provide full support for unicode chars (not just latin1 set), and add font subsystem to create bitmaps and low level screen handling to draw the bitmaps.

This approach by itself is nothing new or unseen, the same idea is implemented in FreeBSD keyboard maps and keyboard driver (and probably elsewhere too).

The actual implementation is already done in https://github.com/tsoome/illumos-gate/tree/loader and the test iso and usb images are currently available at http://205-118-191-90.dyn.estpak.ee.


Subtasks

Feature #8210: uts: remove kb streams moduleClosedToomas Soome

Actions

History

#1

Updated by Toomas Soome about 1 year ago

  • Description updated (diff)

Toomas Soome wrote:

The current console input is only supporting iso8859-1, would need more wider support.

#2

Updated by John Howard 8 months ago

Should be renamed to "uts: console input should support Unicode".

Unicode Consortium & ISO defines the codepoint attribute as a logical 21 bits range. Unicode branded products must also comply with the usage of the upper bits. UTF-8 is useful because it lets us assume legacy ASCII character set 7-bits range is a subset using fewer than physical 21 bits. Actually, UTF-8 can wastefully use more than physical 21 bits just to represent the codepoint attribute.

The primary problem with your approach is using the upper bits beyond the codepoint attribute (this makes it incompatible with the Unicode industry standard and the ISO 10646:* standards). You are using them for keyboard state attributes (again not Unicode-compliant attributes). Do we really need that keyboard state stored in the 32-21= 11 upper bits? Wasting those bits for useless keyboard state is a mistake from the start. Just because FreeBSD does it wrong does not mean illumos has to.

The secondary problem is UTF-32 Big Endian is simpler than UTF-8. The goal should be to optimally use the full 32 bits as the standard and not just the logical 21 bits codepoint attribute. What many do not realize is there are alternative specifications of additional character attributes defined from the upper bits (I designed one, initially licensed for illumos only, to be released). Ultimately one shall become Unicode branded. Your non-compliant upper bits keystate design is interfering with Unicode as the international standard of Information Technology.

The codepoint returned also indicates a logical True for its keystate as a key RELEASE predicate. Key REPEAT predicate can be deduced later if necessary (it's not). Low-level Unix does not use simultaneous keystate information requiring a keymap to be included with each instance of character input. Desktops build events via the timestamp field from the input drivers (keyboard, mouse, touch). Consider the Console using an event-driven interface (maybe Wayland's libinput) from multiple input drivers. Then character input Event type might be a 64 bits record structure with two fields: Timestamp, and UTF-32 Big Endian. I don't think redundant information (such as keystate) should be included with each character input Event.

#3

Updated by Toomas Soome 8 months ago

John Howard wrote:

Should be renamed to "uts: console input should support Unicode".

Unicode Consortium & ISO defines the codepoint attribute as a logical 21 bits range. Unicode branded products must also comply with the usage of the upper bits. UTF-8 is useful because it lets us assume legacy ASCII character set 7-bits range is a subset using fewer than physical 21 bits. Actually, UTF-8 can wastefully use more than physical 21 bits just to represent the codepoint attribute.

The primary problem with your approach is using the upper bits beyond the codepoint attribute (this makes it incompatible with the Unicode industry standard and the ISO 10646:* standards). You are using them for keyboard state attributes (again not Unicode-compliant attributes). Do we really need that keyboard state stored in the 32-21= 11 upper bits? Wasting those bits for useless keyboard state is a mistake from the start. Just because FreeBSD does it wrong does not mean illumos has to.

Those bits are used for simple reason - it makes it possible to use 32-bit entry for a key, nothing less, nothing more. Why it is done so? because the console fonts are limited and do not represent all those glyphs introduced in unicode. Do we need to have unicode compliant console keyboard driver? maybe. I'd say, we would want to get functional driver first (we are not even able to use caps lock properly!).

The implementations are not written in stone (except in illumos where we have ancient source and are lacking developers to make things better), especially as the state bits are anyhow used internally by the driver.

The secondary problem is UTF-32 Big Endian is simpler than UTF-8. The goal should be to optimally use the full 32 bits as the standard and not just the logical 21 bits codepoint attribute. What many do not realize is there are alternative specifications of additional character attributes defined from the upper bits (I designed one, initially licensed for illumos only, to be released). Ultimately one shall become Unicode branded. Your non-compliant upper bits keystate design is interfering with Unicode as the international standard of Information Technology.

The codepoint returned also indicates a logical True for its keystate as a key RELEASE predicate. Key REPEAT predicate can be deduced later if necessary (it's not). Low-level Unix does not use simultaneous keystate information requiring a keymap to be included with each instance of character input. Desktops build events via the timestamp field from the input drivers (keyboard, mouse, touch). Consider the Console using an event-driven interface (maybe Wayland's libinput) from multiple input drivers. Then character input Event type might be a 64 bits record structure with two fields: Timestamp, and UTF-32 Big Endian. I don't think redundant information (such as keystate) should be included with each character input Event.

Yea, except we have noting to do with those 64-bit events outside the keyboard driver itself.

If anyone feels we should have fully compliant (whatever that means) Unicode driver, please jump on the boat and start writing one:)

#4

Updated by Toomas Soome 8 months ago

  • Subject changed from uts: console input should support utf-8 to uts: console input should support Unicode
#5

Updated by Electric Monk 6 months ago

  • Status changed from New to Closed

git commit adc2b73db62a4506a57dfd1ce89bcadc4a60a29d

commit  adc2b73db62a4506a57dfd1ce89bcadc4a60a29d
Author: Toomas Soome <tsoome@me.com>
Date:   2019-01-07T09:11:17.000Z

    7784 uts: console input should support Unicode
    Reviewed by: Yuri Pankov <yuripv@yuripv.net>
    Reviewed by: Igor Kozhukhov <igor@dilos.org>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>

Also available in: Atom PDF