Bug #14735


Users with a large group sidlist cause performance problems in SMB

Added by Matt Barden 25 days ago. Updated 25 days ago.

smb - SMB server and client
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


This issue was reported to us when a user copying 17MB of data across 733 files took over a minute just to 'find' the files to copy. This account had over 300 SIDs in its group list. When the user copied this same dataset with a local 'admin' user (over SMB) with 4 SIDs in its group list, the problem disappeared. The share in question has 'Access Based Enumeration' enabled.

Flamegraphs showed us that 97.3% of samples occurred in zfs_zaccess_aces_check() - between 13,000 and 14,000 calls per second - with many examples in zfs_groupmember() or mutex_enter/exit (on z_acl_lock). This occurs in two main places: smb_vop_lookup() and smb_vop_readdir(). The access check in smb_vop_lookup() comes from zfs_lookup() attempting to check that the user has ACE_EXECUTE access (SMB can't make use of DNLC due to case-insensitive lookups, and so almost never uses the 'fast path'). The check in smb_vop_readdir() comes from zfs_readdir() when ABE is enabled on the share.

Several things come to light when looking at these access checks:

1. In SMB, by default, users are always granted PRIV_FILE_DAC_SEARCH to emulate 'SeBypassTraverseChecking', which is granted to all users in Windows. This means they will always be granted ACE_EXECUTE on directories. However, the privilege check only occurs after searching the ACL fails to grant ACE_EXECUTE access. In essence, this is a wasteful check by default in SMB - we'll know at the SMB level whether we need to perform that check, but currently have no way of informing ZFS of this (zfs_lookup() discards ATTR_NOACLCHECK).
2. zfs_groupmember() and zfs_user_in_cred() linearly search the list of SIDs, but perform binary search of posix IDs (in groupmember()). When the ACE doesn't apply to the user, we fruitlessly search the entire sidlist; when this list is large, this can take an extremely long time. If an ACL is long, but doesn't grant the user any access, the entire sidlist is searched for each entry, which can multiply the wasted time.
3. zfs_groupmember() calls zfs_fuid_find_by_idx() on each iteration of the loop with the same arguments each time. That function performs a binary search (under a reader lock) of all domain SIDs known to ZFS (which is generally pretty small). zfs_user_in_cred(), on the other hand, only calls it once.
4. The entirety of zfs_zaccess_aces_check() occurs while the z_acl_lock mutex is held. When the ACL check takes too long (for example, if there are a very large number of groups or if it needs to contact idmap or ldap for some reason), this can cause significant contention for other requests that want to access the object. This is despite the fact that access checking is nearly always a read-only operation, with the one exception being the first time the ACL is read (it needs to be read from disk and then cached in the znode).

To resolve this, we made the following changes:
1. Update zfs_lookup() to honor ATTR_NOACLCHECK for the ACE_EXECUTE access checks, and update smb_vop_lookup() to pass this flag when the user has PRIV_FILE_DAC_SEARCH.
2. Implement a new interface to search ksidlists for SIDs and posix IDs, then sort the list at the time it's created (for SMB that's at authentication time), so that we can implement binary search for the list.
3. Update zfs_groupmember() to only call zfs_fuid_find_by_idx() once, like its sibling zfs_user_in_cred().

With these changes, we saw an 80% average reduction in our cold test case.

The following represents the change in execution time we saw with these changes vs without the changes. The first set of percentages is for 'cold' runs, and the second is for 'warm' runs (ran immediately after the cold run). The copy causes the server to process the number of entries under 'Processed Entries', spread out as described under the first column. The user had 400 groups, and used Windows Explorer to copy the top-level directory to a local directory.

In all cases, the user only has access to 1000 total files.
largedir_1000 contains 1000 directories, each with one file; the User has access to all files and directories.
largedir_21000 contains 21000 directories, each with one file; the User has access to 1000 directories, and all of the files in those directories.
largedir_41000 contains 41000 directories, each with one file; the User has access to 1000 directories, and all of the files in those directories.
hierarchy_200 contains 200 directories, each with 200 files; the User has access to 200 directories, and all of the files in 5 directories.
hierarchy_500 contains 500 directories, each with 200 files; the User has access to 500 directories, and all of the files in 5 directories.
hierarchy_500_spread contains 500 directories, each with 200 files; the User has access to 500 directories, and two files in each directory.

Each file has 25,000 bytes of ASCII characters. Additionally, I enabled ABE on the share, and added 1000 entries to the dataset's FUID table (using chmod to add arbitrary SIDs).

'Transfer rate' is the percentage increase; the times are percentage reduction.

                    Processed Entries        Discovery Time    Copy Time    Total time    Transfer Rate (KB/s)    Discovery Time    Copy Time    Total time    Transfer Rate
1000 directories        2000                83.80%            65.92%        72.65%    365.65%                87.52%            68.85%        75.85%    414.09%
21000 directories        22000                87.47%            68.07%        81.83%    550.46%                93.95%            66.10%        85.58%    693.50%
41000 directories        42000                87.17%            63.06%        82.62%    575.45%                94.95%            66.38%        89.31%    935.28%
200 directories        40200                84.10%            62.80%        77.34%    441.35%                88.50%            65.94%        81.26%    533.67%
500 directories        100500                84.17%            66.66%        80.85%    522.06%                88.42%            69.49%        84.71%    654.08%
500 directories (spread)    100500                84.12%            66.15%        80.60%    515.39%                88.41%            69.72%        84.77%    656.74%
Average                                    84.89%            64.67%        80.35%    513.56%                90.07%            67.88%        85.01%    694.94%

Since these changes reduced the time spent enumerating the files to an acceptable level, no work was done to change z_acl_lock.

Files (2.6 KB) sets up the test directory Matt Barden, 2022-06-10 11:13 PM (342 Bytes) Creates a user with 400 groups (and the associated groups) Matt Barden, 2022-06-10 11:13 PM

Also available in: Atom PDF