Project

General

Profile

Bug #11741

regexec: fix processing multibyte strings

Added by Yuri Pankov about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
lib - userland libraries
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Bite-size
Tags:
Gerrit CR:

Description

From FreeBSD's PR153502:

I'm seeing odd behavior from programs using regex(3) like less(1), vi(1) and sed(1) when using LANG=en_US.UTF-8 and UTF-8 inputs.

Sometimes it seems to work right:

$ echo 'é' | sed -ne '/^.$/p'
é
$ echo 'éé' | sed -ne '/^..$/p'
éé
$ echo 'aéa' | sed -ne '/a.a/p'
aéa
$ echo 'aéa' | sed -ne '/a.*a/p'
aéa
$ echo 'aaéaa' | sed -ne '/aa.aa/p'
aaéaa
$ echo 'aéaéa' | sed -ne '/a.a.a/p'
aéaéa

But not always:

$ echo 'éa' | sed -ne '/.a/p'
$ echo 'aéaa' | sed -ne '/a.aa/p'
$ echo 'éaé' | sed -ne '/.a./p'

Seems like using ".*", ".+", ".{0,}" or ".{1,}" works right, but ".{0,1}", ".{1,1}" or a lone "." doesn't always.

#1

Updated by Electric Monk about 1 year ago

  • Status changed from In Progress to Closed
  • % Done changed from 50 to 100

git commit 695dd8d1c21542efb8ca2e82c6eb63007a6a5212

commit  695dd8d1c21542efb8ca2e82c6eb63007a6a5212
Author: Yuri Pankov <yuri.pankov@nexenta.com>
Date:   2019-10-22T15:10:25.000Z

    11741 regexec: fix processing multibyte strings
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF