Project

General

Profile

Feature #7587

SMB should support enhanced Unicode

Added by Dmitry Glushenok almost 3 years ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
cifs - CIFS server and client
Start date:
2016-11-16
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

ZFS created with "-o utf8only=on -o normalization=formD -o casesensitivity=mixed -o nbmand=on" and shared using "sharesmb=on".

While trying to create a file with emoji in the name from Windows client - the error appears (The file name you specified is not valid or too long).
But when trying to create the same file on ZFS - it gets created. But then SMB share stops displaying some files, which were existing on the share before. For example:

# touch test1
# touch test2
# touch test3
# touch test4
# touch test5
# touch test3🔥

Stops displaying test3, test3🔥 and test4, but still shows test1, test2 and test5 (all via SMB, ZFS shows everything).

When utf8only and normalization is not used - SMB does not forbids from creating files with emoji in name, but still does not displays them. Moreover - in this case it looks like the name is somehow normalized, which prevents it from being displayed using terminal.
For example, the following "test3🔥" files were created on SMB share (18:40) and using touch on ZFS (18:37):

# /usr/bin/ls -l
<cut> Nov 16 18:40 test3??????.txt
<cut> Nov 16 18:37 test3🔥
#
# GNU-ls -l
<cut> Nov 16 18:40 'test3'$'\355\240\275\355\264\245'
<cut> Nov 16 18:37 'test3'$'\360\237\224\245'
#

Is it possible that unicode tables used by SMB are outdated and needs to be actualized?

All tests on omnios-r151018.


Files

7587.c (593 Bytes) 7587.c test case for u8_* functions Yuri Pankov, 2017-03-03 03:03 PM

Related issues

Has duplicate illumos gate - Feature #10998: SMB should support enhanced UnicodeClosed2019-05-14

Actions

History

#1

Updated by Yuri Pankov over 2 years ago

That's pretty strange, actually. SMB uses the same conversion routines as ZFS - u8_textprep_str(), u8_validate(), so its data shouldn't be an issue.

More so, I've done a bit of testing, and both functions treat the string with emoji as valid.

And yes, I'm able to reproduce this, with utf8only=yes and normalization=formD, though my results on windows side is even more embarassing, I don't see any files at all:

H:\>dir
 Volume in drive H is data
 Volume Serial Number is 9E0D-9C2F

 Directory of H:\

03/03/2017  05:51 PM    <DIR>          .
03/03/2017  05:51 PM    <DIR>          ..
File Not Found

however:

H:\>dir test1 test2 test3 test4 test5
 Volume in drive H is data
 Volume Serial Number is 9E0D-9C2F

 Directory of H:\

03/03/2017  05:51 PM                 0 test1

 Directory of H:\

03/03/2017  05:51 PM                 0 test2

 Directory of H:\

03/03/2017  05:51 PM                 0 test3

 Directory of H:\

03/03/2017  05:51 PM                 0 test4

 Directory of H:\

03/03/2017  05:51 PM                 0 test5
               5 File(s)              0 bytes
               0 Dir(s)  2,404,231,413,760 bytes free

#2

Updated by Yuri Pankov over 2 years ago

Just rechecked this after updating both ctype definitions (#4006) and locale data (#8170), and the problem is still there.

#3

Updated by Gordon Ross 5 months ago

  • Subject changed from Different SMB and ZFS behavior on emoji symbols to SMB should support enhanced Unicode
#4

Updated by Gordon Ross 5 months ago

The failure to display these directory entries happens when
u8_validate returns -1 in this call stack:

smb_odir_next_odirent
smb_odir_read_fileinfo
smb2_find_entries
smb2_query_dir
smb2sr_work
The string passed to u8_validate has extended Unicode chars like:

\355 \240 \275 \355 \264 \245

Actually, the above is an invalid Unicode string.
It got there because when the ZIP file is unpacked in the share,
the SMB server is incorrectly converting UTF-16 characters
to UTF-8 when they require more than 16 bits. Those land in
the file system as invalid UTF-8 strings (assuming you don't
have utf8only set in your ZFS datasset).

If instead I unpack the customer's ZIP on the appliance,
we get correct UTF-8 file names (complete with "fire" and
"speaker" emojis but the SMB server still does not
display them in directory read requests because
smb_wcequiv_strlen() returns -1, then
smb2_find_mbc_encode returns "internal error"
so smb2_find_entries skips this entry.

Both of these problems happen because the SMB server code
uses some wrapper funtions for Unicode conversions that
assumes all Unicode symbols can be represented as UCS-2
(historical Unicode symbols with codes < 65536) and that
conversions can be performed one symbol at a time.

We need to change those wrappers and the call sites that
use them to use string conversions instead of one-at-a-time
interfaces.

#5

Updated by Gordon Ross 5 months ago

Testing: Unpack some files with names containing enhanced unicode, and verify the SMB server presents them in directory listings.

Fix in production since late 2018

#6

Updated by Gordon Ross 5 months ago

  • Status changed from New to In Progress
#7

Updated by Gordon Ross 5 months ago

Closed #10998 as a dup. of this.

#8

Updated by Gordon Ross 4 months ago

  • Has duplicate Feature #10998: SMB should support enhanced Unicode added
#9

Updated by Gordon Ross 4 months ago

  • Tracker changed from Bug to Feature
#10

Updated by Electric Monk 4 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 7d1ffc32e5e72873791b96934af035e0f051fc14

commit  7d1ffc32e5e72873791b96934af035e0f051fc14
Author: Gordon Ross <gwr@nexenta.com>
Date:   2019-06-04T02:09:35.000Z

    7587 SMB should support enhanced Unicode
    Reviewed by: Matt Barden <matt.barden@nexenta.com>
    Reviewed by: Evan Layton <evan.layton@nexenta.com>
    Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF