Feature #7587
closedSMB should support enhanced Unicode
100%
Description
ZFS created with "-o utf8only=on -o normalization=formD -o casesensitivity=mixed -o nbmand=on" and shared using "sharesmb=on".
While trying to create a file with emoji in the name from Windows client - the error appears (The file name you specified is not valid or too long).
But when trying to create the same file on ZFS - it gets created. But then SMB share stops displaying some files, which were existing on the share before. For example:
# touch test1 # touch test2 # touch test3 # touch test4 # touch test5 # touch test3🔥
Stops displaying test3, test3🔥 and test4, but still shows test1, test2 and test5 (all via SMB, ZFS shows everything).
When utf8only and normalization is not used - SMB does not forbids from creating files with emoji in name, but still does not displays them. Moreover - in this case it looks like the name is somehow normalized, which prevents it from being displayed using terminal.
For example, the following "test3🔥" files were created on SMB share (18:40) and using touch on ZFS (18:37):
# /usr/bin/ls -l <cut> Nov 16 18:40 test3??????.txt <cut> Nov 16 18:37 test3🔥 # # GNU-ls -l <cut> Nov 16 18:40 'test3'$'\355\240\275\355\264\245' <cut> Nov 16 18:37 'test3'$'\360\237\224\245' #
Is it possible that unicode tables used by SMB are outdated and needs to be actualized?
All tests on omnios-r151018.
Files
Related issues
Updated by Yuri Pankov over 6 years ago
That's pretty strange, actually. SMB uses the same conversion routines as ZFS - u8_textprep_str(), u8_validate(), so its data shouldn't be an issue.
More so, I've done a bit of testing, and both functions treat the string with emoji as valid.
And yes, I'm able to reproduce this, with utf8only=yes and normalization=formD, though my results on windows side is even more embarassing, I don't see any files at all:
H:\>dir Volume in drive H is data Volume Serial Number is 9E0D-9C2F Directory of H:\ 03/03/2017 05:51 PM <DIR> . 03/03/2017 05:51 PM <DIR> .. File Not Found
however:
H:\>dir test1 test2 test3 test4 test5 Volume in drive H is data Volume Serial Number is 9E0D-9C2F Directory of H:\ 03/03/2017 05:51 PM 0 test1 Directory of H:\ 03/03/2017 05:51 PM 0 test2 Directory of H:\ 03/03/2017 05:51 PM 0 test3 Directory of H:\ 03/03/2017 05:51 PM 0 test4 Directory of H:\ 03/03/2017 05:51 PM 0 test5 5 File(s) 0 bytes 0 Dir(s) 2,404,231,413,760 bytes free
Updated by Yuri Pankov over 6 years ago
Updated by Gordon Ross over 4 years ago
- Subject changed from Different SMB and ZFS behavior on emoji symbols to SMB should support enhanced Unicode
Updated by Gordon Ross over 4 years ago
The failure to display these directory entries happens when
u8_validate returns -1 in this call stack:
smb_odir_next_odirent
smb_odir_read_fileinfo
smb2_find_entries
smb2_query_dir
smb2sr_work
The string passed to u8_validate has extended Unicode chars like:
\355 \240 \275 \355 \264 \245
Actually, the above is an invalid Unicode string.
It got there because when the ZIP file is unpacked in the share,
the SMB server is incorrectly converting UTF-16 characters
to UTF-8 when they require more than 16 bits. Those land in
the file system as invalid UTF-8 strings (assuming you don't
have utf8only set in your ZFS datasset).
If instead I unpack the customer's ZIP on the appliance,
we get correct UTF-8 file names (complete with "fire" and
"speaker" emojis but the SMB server still does not
display them in directory read requests because
smb_wcequiv_strlen() returns -1, then
smb2_find_mbc_encode returns "internal error"
so smb2_find_entries skips this entry.
Both of these problems happen because the SMB server code
uses some wrapper funtions for Unicode conversions that
assumes all Unicode symbols can be represented as UCS-2
(historical Unicode symbols with codes < 65536) and that
conversions can be performed one symbol at a time.
We need to change those wrappers and the call sites that
use them to use string conversions instead of one-at-a-time
interfaces.
Updated by Gordon Ross over 4 years ago
Testing: Unpack some files with names containing enhanced unicode, and verify the SMB server presents them in directory listings.
Fix in production since late 2018
Updated by Gordon Ross over 4 years ago
- Has duplicate Feature #10998: SMB should support enhanced Unicode added
Updated by Electric Monk over 4 years ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
git commit 7d1ffc32e5e72873791b96934af035e0f051fc14
commit 7d1ffc32e5e72873791b96934af035e0f051fc14 Author: Gordon Ross <gwr@nexenta.com> Date: 2019-06-04T02:09:35.000Z 7587 SMB should support enhanced Unicode Reviewed by: Matt Barden <matt.barden@nexenta.com> Reviewed by: Evan Layton <evan.layton@nexenta.com> Reviewed by: Joyce McIntosh <joyce.mcintosh@nexenta.com> Approved by: Richard Lowe <richlowe@richlowe.net>