Bug #14735
closedUsers with a large group sidlist cause performance problems in SMB
100%
Description
This issue was reported to us when a user copying 17MB of data across 733 files took over a minute just to 'find' the files to copy. This account had over 300 SIDs in its group list. When the user copied this same dataset with a local 'admin' user (over SMB) with 4 SIDs in its group list, the problem disappeared. The share in question has 'Access Based Enumeration' enabled.
Flamegraphs showed us that 97.3% of samples occurred in zfs_zaccess_aces_check() - between 13,000 and 14,000 calls per second - with many examples in zfs_groupmember() or mutex_enter/exit (on z_acl_lock). This occurs in two main places: smb_vop_lookup() and smb_vop_readdir(). The access check in smb_vop_lookup() comes from zfs_lookup() attempting to check that the user has ACE_EXECUTE access (SMB can't make use of DNLC due to case-insensitive lookups, and so almost never uses the 'fast path'). The check in smb_vop_readdir() comes from zfs_readdir() when ABE is enabled on the share.
Several things come to light when looking at these access checks:
1. In SMB, by default, users are always granted PRIV_FILE_DAC_SEARCH to emulate 'SeBypassTraverseChecking', which is granted to all users in Windows. This means they will always be granted ACE_EXECUTE on directories. However, the privilege check only occurs after searching the ACL fails to grant ACE_EXECUTE access. In essence, this is a wasteful check by default in SMB - we'll know at the SMB level whether we need to perform that check, but currently have no way of informing ZFS of this (zfs_lookup() discards ATTR_NOACLCHECK).
2. zfs_groupmember() and zfs_user_in_cred() linearly search the list of SIDs, but perform binary search of posix IDs (in groupmember()). When the ACE doesn't apply to the user, we fruitlessly search the entire sidlist; when this list is large, this can take an extremely long time. If an ACL is long, but doesn't grant the user any access, the entire sidlist is searched for each entry, which can multiply the wasted time.
3. zfs_groupmember() calls zfs_fuid_find_by_idx() on each iteration of the loop with the same arguments each time. That function performs a binary search (under a reader lock) of all domain SIDs known to ZFS (which is generally pretty small). zfs_user_in_cred(), on the other hand, only calls it once.
4. The entirety of zfs_zaccess_aces_check() occurs while the z_acl_lock mutex is held. When the ACL check takes too long (for example, if there are a very large number of groups or if it needs to contact idmap or ldap for some reason), this can cause significant contention for other requests that want to access the object. This is despite the fact that access checking is nearly always a read-only operation, with the one exception being the first time the ACL is read (it needs to be read from disk and then cached in the znode).
To resolve this, we made the following changes:
1. Update zfs_lookup() to honor ATTR_NOACLCHECK for the ACE_EXECUTE access checks, and update smb_vop_lookup() to pass this flag when the user has PRIV_FILE_DAC_SEARCH.
2. Implement a new interface to search ksidlists for SIDs and posix IDs, then sort the list at the time it's created (for SMB that's at authentication time), so that we can implement binary search for the list.
3. Update zfs_groupmember() to only call zfs_fuid_find_by_idx() once, like its sibling zfs_user_in_cred().
With these changes, we saw an 80% average reduction in our cold test case.
The following represents the change in execution time we saw with these changes vs without the changes. The first set of percentages is for 'cold' runs, and the second is for 'warm' runs (ran immediately after the cold run). The copy causes the server to process the number of entries under 'Processed Entries', spread out as described under the first column. The user had 400 groups, and used Windows Explorer to copy the top-level directory to a local directory.
In all cases, the user only has access to 1000 total files.
largedir_1000 contains 1000 directories, each with one file; the User has access to all files and directories.
largedir_21000 contains 21000 directories, each with one file; the User has access to 1000 directories, and all of the files in those directories.
largedir_41000 contains 41000 directories, each with one file; the User has access to 1000 directories, and all of the files in those directories.
hierarchy_200 contains 200 directories, each with 200 files; the User has access to 200 directories, and all of the files in 5 directories.
hierarchy_500 contains 500 directories, each with 200 files; the User has access to 500 directories, and all of the files in 5 directories.
hierarchy_500_spread contains 500 directories, each with 200 files; the User has access to 500 directories, and two files in each directory.
Each file has 25,000 bytes of ASCII characters. Additionally, I enabled ABE on the share, and added 1000 entries to the dataset's FUID table (using chmod to add arbitrary SIDs).
'Transfer rate' is the percentage increase; the times are percentage reduction.
Processed Entries Discovery Time Copy Time Total time Transfer Rate (KB/s) Discovery Time Copy Time Total time Transfer Rate 1000 directories 2000 83.80% 65.92% 72.65% 365.65% 87.52% 68.85% 75.85% 414.09% 21000 directories 22000 87.47% 68.07% 81.83% 550.46% 93.95% 66.10% 85.58% 693.50% 41000 directories 42000 87.17% 63.06% 82.62% 575.45% 94.95% 66.38% 89.31% 935.28% 200 directories 40200 84.10% 62.80% 77.34% 441.35% 88.50% 65.94% 81.26% 533.67% 500 directories 100500 84.17% 66.66% 80.85% 522.06% 88.42% 69.49% 84.71% 654.08% 500 directories (spread) 100500 84.12% 66.15% 80.60% 515.39% 88.41% 69.72% 84.77% 656.74% Average 84.89% 64.67% 80.35% 513.56% 90.07% 67.88% 85.01% 694.94%
Since these changes reduced the time spent enumerating the files to an acceptable level, no work was done to change z_acl_lock.
Files
Related issues
Updated by Matt Barden over 1 year ago
- File CreateLargeGroupUser.ps CreateLargeGroupUser.ps added
- File largedirectorytest.sh largedirectorytest.sh added
Updated by Matt Barden 12 months ago
Tested using the provided scripts to verify the performance change, as well as the new usr/src/test/os-test/tests/ksid.c.
Updated by Gordon Ross 5 months ago
- Assignee changed from Matt Barden to Toomas Soome
Toomas is going to work on integrating this.
Updated by Gordon Ross 5 months ago
- Has duplicate Bug #15593: Import NEX-22861 CIFS performance issues around zfs_zaccess_aces_check added
Updated by Electric Monk 4 months ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit f37b3cbb6f67aaea5eec1c335bdc7bf432867d64
commit f37b3cbb6f67aaea5eec1c335bdc7bf432867d64 Author: Matt Barden <mbarden@tintri.com> Date: 2023-05-25T15:20:43.000Z 14735 Users with a large group sidlist cause performance problems in SMB Reviewed by: Evan Layton <elayton@tintri.com> Reviewed by: Gordon Ross <gordon.ross@tintri.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Dan McDonald <danmcd@mnx.io>
Updated by Gordon Ross 4 months ago
For the record, the earliest instance of this commit appears to be:
https://github.com/Nexenta/illumos-nexenta/commit/5811ebc4a41351483fd13ac00e40cf9725611b62
commit 5811ebc4a41351483fd13ac00e40cf9725611b62 Author: Matt Barden <mbarden@tintri.com> Date: Tue Jul 21 11:55:14 2020 -0400 NEX-22861 CIFS performance issues around zfs_zaccess_aces_check Includes AVL macros from MVF illumos@c4ab0d3f46036e85ad0700125c5a83cc139f55a3 10809 Performance optimization of AVL tree comparator functions Reviewed by: Evan Layton <elayton@tintri.com> Reviewed by: Gordon Ross <gordon.ross@tintri.com>