Project

General

Profile

Actions

Bug #992

closed

towlower/towupper are broken

Added by Yuri Pankov almost 12 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
locale - data and messages
Start date:
2011-05-04
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:

Description

$ cat tow-test.c
#include <locale.h>
#include <stdio.h>
#include <wchar.h>

int
main(void) {
        wint_t  rusP = 0x41f;
        wint_t  rusp = 0x43f;

        setlocale(LC_ALL, "");

        printf("towlower: %lc -> %lc\n", rusP, towlower(rusP));
        printf("towupper: %lc -> %lc\n", rusp, towupper(rusp));

        return (0);
}

Illumos:
$ LC_ALL=en_US.UTF-8 ./tow-test
towlower: П -> П
towupper: п -> п
$ LC_ALL=ru_RU.UTF-8 ./tow-test
towlower: П -> П
towupper: п -> П

Solaris 11:
$ LC_ALL=en_US.UTF-8 ./tow-test
towlower: П -> п
towupper: п -> П
$ LC_ALL=ru_RU.UTF-8 ./tow-test
towlower: П -> п
towupper: п -> П

FreeBSD 8:
$ LC_ALL=en_US.UTF-8 ./tow-test
towlower: П -> п
towupper: п -> П
$ LC_ALL=ru_RU.UTF-8 ./tow-test
towlower: П -> п
towupper: п -> П

The letter is cyrillic 'pe' (just as an example, problem isn't limited to it).
I'm not sure if towlower/towupper are actually at fault here or problem is deeper and setting Garrett as assignee as he was working on it.


Related issues

Blocks illumos gate - Feature #1146: find should support -inameResolvedYuri Pankov2011-06-24

Actions
Actions #1

Updated by Yuri Pankov almost 12 years ago

Looks like our maplower is badly broken, here is a part of the dump of both (notice that maplower contains lowercase chars, which it should have capital ones, and mappings doesn't exist):

        printf("magic=%s encoding=%s\n", _CurrentRuneLocale->__magic,
            _CurrentRuneLocale->__encoding);
        rr = &_CurrentRuneLocale->__maplower_ext;
        base = rr->__ranges;
        printf("maplower: ranges=%d\n", rr->__nranges);
        for (lim = 0; lim != rr->__nranges; lim++) {
                re = base + lim;
                printf("range=%d rangelen=%d chars=%lc-%lc -> maps=%lc-%lc\n",
                    (int)lim, (int)(re->__max - re->__min + 1),
                    re->__min, re->__max, re->__map,
                    re->__map + re->__max - re->__min);
        }
        printf("\n");
        rr = &_CurrentRuneLocale->__mapupper_ext;
        base = rr->__ranges;
        printf("mapupper: ranges=%d\n", rr->__nranges);
        for (lim = 0; lim != rr->__nranges; lim++) {
                re = base + lim;
                printf("range=%d rangelen=%d chars=%lc-%lc -> maps=%lc-%lc\n",
                    (int)lim, (int)(re->__max - re->__min + 1),
                    re->__min, re->__max, re->__map,
                    re->__map + re->__max - re->__min);
        }
        printf("\n");

added to lib/libc/port/locale/towlower.c

Illumos:

magic=RuneMagiUTF-8 encoding=UTF-8
maplower: ranges=393
range=0 rangelen=1 chars=ā-ā -> maps=^@-^@
range=1 rangelen=1 chars=ă-ă -> maps=^@-^@
range=2 rangelen=1 chars=ą-ą -> maps=^@-^@
range=3 rangelen=1 chars=ć-ć -> maps=^@-^@
range=4 rangelen=1 chars=ĉ-ĉ -> maps=^@-^@
range=5 rangelen=1 chars=ċ-ċ -> maps=^@-^@
range=6 rangelen=1 chars=č-č -> maps=^@-^@
range=7 rangelen=1 chars=ď-ď -> maps=^@-^@
range=8 rangelen=1 chars=đ-đ -> maps=^@-^@
range=9 rangelen=1 chars=ē-ē -> maps=^@-^@
range=10 rangelen=1 chars=ĕ-ĕ -> maps=^@-^@
...
mapupper: ranges=351
range=0 rangelen=1 chars=ā-ā -> maps=Ā-Ā
range=1 rangelen=1 chars=ă-ă -> maps=Ă-Ă
range=2 rangelen=1 chars=ą-ą -> maps=Ą-Ą
range=3 rangelen=1 chars=ć-ć -> maps=Ć-Ć
range=4 rangelen=1 chars=ĉ-ĉ -> maps=Ĉ-Ĉ
range=5 rangelen=1 chars=ċ-ċ -> maps=Ċ-Ċ
range=6 rangelen=1 chars=č-č -> maps=Č-Č
range=7 rangelen=1 chars=ď-ď -> maps=Ď-Ď
range=8 rangelen=1 chars=đ-đ -> maps=Đ-Đ
range=9 rangelen=1 chars=ē-ē -> maps=Ē-Ē
range=10 rangelen=1 chars=ĕ-ĕ -> maps=Ĕ-Ĕ
...

FreeBSD (useful, as our code is mostly identical for this stuff):

magic=RuneMagiUTF-8 encoding=UTF-8
maplower: ranges=410
range=0 rangelen=1 chars=Ā-Ā -> maps=ā-ā
range=1 rangelen=1 chars=Ă-Ă -> maps=ă-ă
range=2 rangelen=1 chars=Ą-Ą -> maps=ą-ą
range=3 rangelen=1 chars=Ć-Ć -> maps=ć-ć
range=4 rangelen=1 chars=Ĉ-Ĉ -> maps=ĉ-ĉ
range=5 rangelen=1 chars=Ċ-Ċ -> maps=ċ-ċ
range=6 rangelen=1 chars=Č-Č -> maps=č-č
range=7 rangelen=1 chars=Ď-Ď -> maps=ď-ď
range=8 rangelen=1 chars=Đ-Đ -> maps=đ-đ
range=9 rangelen=1 chars=Ē-Ē -> maps=ē-ē
range=10 rangelen=1 chars=Ĕ-Ĕ -> maps=ĕ-ĕ
...
mapupper: ranges=418
range=0 rangelen=1 chars=ā-ā -> maps=Ā-Ā
range=1 rangelen=1 chars=ă-ă -> maps=Ă-Ă
range=2 rangelen=1 chars=ą-ą -> maps=Ą-Ą
range=3 rangelen=1 chars=ć-ć -> maps=Ć-Ć
range=4 rangelen=1 chars=ĉ-ĉ -> maps=Ĉ-Ĉ
range=5 rangelen=1 chars=ċ-ċ -> maps=Ċ-Ċ
range=6 rangelen=1 chars=č-č -> maps=Č-Č
range=7 rangelen=1 chars=ď-ď -> maps=Ď-Ď
range=8 rangelen=1 chars=đ-đ -> maps=Đ-Đ
range=9 rangelen=1 chars=ē-ē -> maps=Ē-Ē
range=10 rangelen=1 chars=ĕ-ĕ -> maps=Ĕ-Ĕ
...

Actions #2

Updated by Yuri Pankov almost 12 years ago

What we need to do here - fix __maplower_ext and make *.UTF-8/LC_CTYPE/LCL_DATA contain mappings for all locales, not just current one - no other OS makes case mappings and character type dependent on the current locale...

Actions #3

Updated by Yuri Pankov almost 12 years ago

Fixing the first appears to be easy, there's a nasty typo in cmd/localedef/ctype.c:

diff -r 4f5bb85e2547 usr/src/cmd/localedef/ctype.c
--- a/usr/src/cmd/localedef/ctype.c     Fri May 06 07:32:53 2011 -0700
+++ b/usr/src/cmd/localedef/ctype.c     Sun May 08 11:21:21 2011 +0400
@@ -321,8 +321,8 @@
                        ct[rl.runetype_ext_nranges - 1].map = ctn->ctype;
                        last_ct = ctn;
                }
-               if (ctn->toupper == 0) {
-                       last_up = NULL;
+               if (ctn->tolower == 0) {
+                       last_lo = NULL;
                } else if ((last_lo != NULL) &&
                    (last_lo->tolower + 1 == ctn->tolower)) {
                        lo[rl.maplower_ext_nranges-1].max = wc;

Actions #4

Updated by Yuri Pankov almost 12 years ago

  • Assignee changed from Garrett D'Amore to Yuri Pankov
Actions #5

Updated by Gordon Ross over 11 years ago

  • Status changed from New to Resolved
changeset:   13399:a1d28d03839f
tag:         tip
user:        Yuri Pankov <yuri.pankov@gmail.com>
date:        Thu May 12 03:21:34 2011 +0400
description:
    992 towlower/towupper are broken
    Reviewed by: Garrett D'Amore <garrett@nexenta.com>
    Approved by: Gordon Ross <gwr@nexenta.com>
Actions

Also available in: Atom PDF