Project

General

Profile

Bug #4173

OpenIndiana hipster: std::isalpha g++ error

Added by Alexander Pyhalov about 6 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
-
Start date:
2013-10-02
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

std::isalpha realization in G++ incorrectly recognizes binary symbols as alphabetic when UTF-8 locale is used.
Attached program can't url-encode string, however it works correctly on Debian Linux 7.0 or FreeBSD 9.
Studio compiler doesn't have this bug. iswalpha() correctly recognizes symbols as non-alphabetic. GCC 3.4 and 4.7 are affected.
Test case is attached.
[pre]
$ g++ tohex.cc -o tohex
$ ./tohex
%88�i%25%3C7��%B80�%3B�%98%BD��%B7Z%94
[/pre]
Coorect output is:
[pre]
$ LANG=C ./tohex
%88%C4i%25%3C7%E9%FC%B80%EA%3B%FF%98%BD%C8%D1%B7Z%94
[/pre]


Files

tohex.cc (1.23 KB) tohex.cc Alexander Pyhalov, 2013-10-02 05:48 PM
test_ctype.cc (554 Bytes) test_ctype.cc Alexander Pyhalov, 2013-10-04 02:52 PM

History

#1

Updated by Garrett D'Amore about 6 years ago

What happens if you add -D_XPG4 to the compiler flags?

#2

Updated by Alexander Pyhalov about 6 years ago

Nothing changes:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

$ g++ -D_XPG4 tohex.cc -o tohex_xpg4
$ ./tohex_xpg4
%88�i%25%3C7��%B80�%3B�%98%BD��%B7Z%94
$ env LANG=C ./tohex_xpg4
%88%C4i%25%3C7%E9%FC%B80%EA%3B%FF%98%BD%C8%D1%B7Z%94

P.S. Standard headers already define _XPG4 (by using sys/feature_tests.h ).

#3

Updated by Alexander Pyhalov about 6 years ago

Some more information. If I change in sample program strings

if (std::isalpha(*first, std::locale::classic()) ||
std::isdigit(*first, std::locale::classic()) ||

to

if (std::isalpha((wchar_t)*first, std::locale::classic()) ||
std::isdigit((wchar_t)*first, std::locale::classic()) ||

it works as expected.

#4

Updated by Alexander Pyhalov about 6 years ago

I've traced down the sample to the following (test_cctype.cc). Is it an error in __ctype_mask or just incorrect usage (this is expansion of the following code):


std::use_facet< std::ctype<char> >(loc).is (std::ctype<char>::alpha, *s.begin())

extract from libstdc++-v3/config/os/solaris/solaris2.8/ctype_inline.h:

bool
ctype<char>::
is(mask __m, char __c) const { return M_table[static_cast<unsigned char>(_c)] & __m; }

#5

Updated by Garrett D'Amore about 6 years ago

libstdc++-v3 uses what looks like a custom version -- M_table isn't a libc builtin, so it must come from somewhere else in libstdcc++.

I have to say, its been a long time since I used C++, and the syntax is incredibly foreign to me.

HOWEVER, if they are using isw*, then that is different than is*. iswalpha() considers values that are wide characters, and understands locales. isalpha() understands only 8-bit locales (e.g. ISO-8859-1), so its output cannot be trusted in a UTF-8 locale (or any other locale that uses multibyte strings.)

I'm thinking some magical work in C++ is translating types and hence code to isw*. The meanings of isw* are subtly different than the non-wide versions and you cannot simply "cast" them.

This feels like a bug in G++.

Additionally, you need to know that is* only examines a single byte at a time.

#7

Updated by Lauri Tirkkonen over 3 years ago

It does seem to be libstdc++ doing something weird. Using the test program from #6751, I looked at all libc functions which get 0xe4 as the first argument, and there were only two: btowc and btowc_l (the former calls the latter). Here's the locname and ustack for two invocations of the program using different LC_CTYPE:

kekkonen ~ % LC_CTYPE=C pfexec dtrace -Fn 'pid$target:libc.so.1:btowc_l:entry /arg0 == 0xe4/ { print(args[1]->locname); ustack(); }' -c ./alpha
dtrace: description 'pid$target:libc.so.1:btowc_l:entry ' matched 1 probe
is alpha: 0
dtrace: pid 24275 has exited
CPU FUNCTION                                 
 13  -> btowc_l                               char [129] "C/en_US.UTF-8/C/fi_FI.UTF-8/en_US.UTF-8/en_US.UTF-8" 

              libc.so.1`btowc_l
              libc.so.1`btowc+0x2d
              libstdc++.so.6.0.18`_ZNSt5ctypeIwE19_M_initialize_ctypeEv+0x50
              libstdc++.so.6.0.18`_ZNSt5ctypeIwEC1Ej+0x43
              libstdc++.so.6.0.18`_ZNSt6locale5_ImplC1Ej+0x80a
              libstdc++.so.6.0.18`_ZNSt6locale18_S_initialize_onceEv+0x29
              libc.so.1`pthread_once+0x6c
              libstdc++.so.6.0.18`_ZNSt6locale13_S_initializeEv+0x62
              libstdc++.so.6.0.18`_ZNSt6locale7classicEv+0x18
              alpha`main+0x22
              alpha`_start+0x83

kekkonen ~ % LC_CTYPE=en_US.UTF-8 pfexec dtrace -Fn 'pid$target:libc.so.1:btowc_l:entry /arg0 == 0xe4/ { print(args[1]->locname); ustack(); }' -c ./alpha
dtrace: description 'pid$target:libc.so.1:btowc_l:entry ' matched 1 probe
is alpha: 1
dtrace: pid 24346 has exited
CPU FUNCTION                                 
 29  -> btowc_l                               char [129] "en_US.UTF-8/en_US.UTF-8/C/fi_FI.UTF-8/en_US.UTF-8/en_US.UTF-8" 

              libc.so.1`btowc_l
              libc.so.1`btowc+0x2d
              libstdc++.so.6.0.18`_ZNSt5ctypeIwE19_M_initialize_ctypeEv+0x50
              libstdc++.so.6.0.18`_ZNSt5ctypeIwEC1Ej+0x43
              libstdc++.so.6.0.18`_ZNSt6locale5_ImplC1Ej+0x80a
              libstdc++.so.6.0.18`_ZNSt6locale18_S_initialize_onceEv+0x29
              libc.so.1`pthread_once+0x6c
              libstdc++.so.6.0.18`_ZNSt6locale13_S_initializeEv+0x62
              libstdc++.so.6.0.18`_ZNSt6locale7classicEv+0x18
              alpha`main+0x22
              alpha`_start+0x83

#8

Updated by Lauri Tirkkonen over 3 years ago

Lauri Tirkkonen wrote:

It does seem to be libstdc++ doing something weird. Using the test program from #6751, I looked at all libc functions which get 0xe4 as the first argument, and there were only two: btowc and btowc_l (the former calls the latter)

From what I can tell ctype<wchar_t>::_M_initialize_ctype() in gcc-5.1.0/ibstdc++-v3/config/locale/generic/ctype_members.cc:248 is basically just calling btowc(i) for all i <= 0 <= 255 and storing the result. If std::locale::classic() is called before the setlocale() call in the test program, things happen to work, but apparently the initialization uses whatever the current locale is.

Not sure how to proceed with this bug.

Also available in: Atom PDF