Project

General

Profile

Bug #4412

character handling functions should return 0 for argument range 128-255 in UTF-8 locales

Added by Alexander Pyhalov almost 6 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
lib - userland libraries
Start date:
2013-12-22
Due date:
% Done:

100%

Estimated time:
Difficulty:
Bite-size
Tags:

Description

Currently, islower()/isupper()/isalpha()/(and so on) function man page specify that

If the argument to any of the character handling  macros  is
     not  in the domain of the function, the result is undefined.

And as (char)128-255 is not a correct UTF-8 symbol these functions as I think stick to undefined behavior. This is incorrect and breaks a lot of software, they should return 0 for symbols 128-255.
According to POSIX (http://pubs.opengroup.org/onlinepubs/7999959899/functions/islower.html): "The c argument is an int, the value of which the application shall ensure is a character representable as an unsigned char or equal to the value of the macro EOF. If the argument has any other value, the behavior is undefined". Nothing is stated about "a correct character in current locale", just "a character representable as an unsigned char or equal to the value of the macro EOF".

Test case:

$ cat  test_islower.c
#include <stdio.h>
#include <ctype.h>
#include <locale.h>

int main()
{
  int i;

  setlocale(LC_ALL, "en_US.UTF-8");

  for(i=0;i<256;i++){
  if(islower(i)){
    printf("lower, %d %c\\n",i,i);
  }
  }
  return 0;
}
$ gcc test_islower.c   -o test_islower
$./test_islower

Result shows that islower considers 170,181,186,223-246,248-255 to be lower case characters.

History

#1

Updated by Garrett D'Amore almost 6 years ago

Well, it seems unfortunate.

Note that if you set the locale to "C", you get correct answers.

It seems that UTF-8 is allowing you to test for numeric values for character types that are frankly illegal in UTF-8. That's probably a mistake. Although any program which relies on this behavior is also making a mistake. If they want portable answers to this question, they need to use the "C" locale.

The problem here is that bug 992 made a change that introduced some of this behavior; notably that non-language specific characters have mappings. (So that you could convert upper case Cyrillic to lower case, and vice versa, even when in a UTF-8 English locale.)

Its not clear what the right answer here is. Some people think that these foreign characters should be handled universally, whereas others (like the submitter here) would like the response to be valid only for characters that made sense in the language. (But it gets more complex, and we sometimes use accents and have foreign characters in names and words, even those those characters are not normally part of English. For example, Sašo is a name. It would be nice to handle the š and Š properly as upper and lower case.

Again, if you have a test program that cares about this, and breaks because it wants to parse only ASCII, then fix your program. Either have it use the C locale, or have it restrict its input to just ASCII.

(Actually, according to the request of this, "W" would not be upper case in some locales. E.g. Cyrillic ones like ru_RU.UTF-8. Having ASCII English letters not treated as upper/lower case would probably break far far more software than allowing Cyrillic characters to get upper/lower case treatment in Latin character locales. :-)

Anyway, I don't really view this request as a "bug", other than a bug in the calling program.

#2

Updated by Yuri Pankov over 2 years ago

  • Category set to lib - userland libraries
  • Status changed from New to In Progress
  • Assignee set to Yuri Pankov
  • % Done changed from 0 to 50
  • Difficulty changed from Medium to Bite-size
  • Tags deleted (needs-triage)

I think this bug is pretty much valid and isn't related to #992 at all - POSIX defines that argument to these functions is representable as unsigned char or EOF, and none of the UTF-8 locales (by design) can represent values in 128-255 range as a valid character, so returning anything other than 0 there is a bug.

Just to be sure, I've tested this on FreeBSD and linux (debian) just to make sure we won't break anything - results are the same, and confirm what Alexander wrote originally.

#3

Updated by Yuri Pankov over 2 years ago

  • Subject changed from islower()/isupper()/isalpha()/... functions should return 0 for symbols 128-255 in UTF-8 locales to character handling functions should return 0 for argument range 128-255 in UTF-8 locales
#4

Updated by Electric Monk over 2 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 50 to 100

git commit 4f9d8ec0de8c8724941a1c13a5adf4c710d6442c

commit  4f9d8ec0de8c8724941a1c13a5adf4c710d6442c
Author: Yuri Pankov <yuri.pankov@nexenta.com>
Date:   2017-04-03T19:09:44.000Z

    4412 character handling functions should return 0 for argument range 128-255 in UTF-8 locales
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Approved by: Dan McDonald <danmcd@omniti.com>

Also available in: Atom PDF