Project

General

Profile

Actions

Bug #16566

open

Native msgfmt produces invalid GNU .mo file when UTF-8 chars are in msgid

Added by Marcel Telka 30 days ago. Updated 9 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
cmd - userland programs
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

Let's create this simple po file (please note á; the U+00E1 character; in msgid):

$ cat <<EOF > test.po 
msgid "" 
msgstr "Content-Type: text/plain; charset=UTF-8\n" 

msgid "á" 
msgstr "translated" 
EOF
$

The attempt to generate the GNU mo file using the native msgfmt produces invalid file:
$ LC_ALL=en_US.UTF-8 /usr/bin/msgfmt -g -v -o illumos.mo test.po
Generating the MO file in the GNU MO format.
Processing file "test.po"...
1 translated message(s), 0 fuzzy translation(s), 0 untranslated message(s).
$ /usr/bin/msgunfmt -o illumos.po illumos.mo
/usr/bin/msgunfmt: file "illumos.mo" is not in GNU .mo format: Some messages are at a wrong index in the hash table.
$

While trying the same with GNU msgfmt works well:
$ LC_ALL=en_US.UTF-8 /usr/gnu/bin/msgfmt -v -o gnu.mo test.po
1 translated message.
$ /usr/bin/msgunfmt -o gnu.po gnu.mo
$

Actions #1

Updated by David Stes 10 days ago

Is it common to have UTF-8 codeset used for "msgid" arguments ?

There is work to make a POSIX standard for native msgfmt and gnu msgfmt by the Austin Group

https://austingroupbugs.net/view.php?id=1122

The POSIX draft says The msgid and msgid_plural arguments are typically in (US) English. The arguments are always
used in the POSIX or C locale,

I think they suggest there that the POSIX codeset is ASCII and that the arguments of msgid are assumed ASCII ?

However I am not sure; because you file the issue as "Bug" you suggest that it should be possible to use UTF-8 codeset in msgid arguments ?

I had a look at an example project (for Smalltalk by the way) that uses .po files and the msgstr files are translated localised arguments in various languages and use UTF-8 codeset and the header indicates charset=UTF-8.

But the msgid themselves for that specific project seem to be ASCII .

For that specific project I had a look at msgid arguments and compared UTF-8 and ASCII but all msgids seem to be ASCII

However as said, I am not sure the POSIX draft insists that the msgid arguments SHOULD be ASCII. Anyway my understanding is that it is a draft and not a standard.

Actions #2

Updated by Marcel Telka 10 days ago

David Stes wrote in #note-1:

Is it common to have UTF-8 codeset used for "msgid" arguments ?

I do not know how common it is, but some projects does so, for example git.

There is work to make a POSIX standard for native msgfmt and gnu msgfmt by the Austin Group

The native msgfmt claims it supports GNU mo files, so it should do that properly. Including GNU's divergence from something that may become POSIX in future.

Actions #3

Updated by David Stes 9 days ago

Can you please give an actual example from developer/versioning/git
(git - Fast Version Control System) of a ".po" message catalog with a msgid argument with non-ASCII UTF-8 content ?

Actions #4

Updated by Marcel Telka 9 days ago

David Stes wrote in #note-3:

Can you please give an actual example from developer/versioning/git

Isn't link directly to the po file I provided in #note-2 enough?

Actions

Also available in: Atom PDF