Bug #11815
closedFSCTL_SRV_COPYCHUNK is horrendously slow in smb server
0%
Description
A user recently complained to me that a massive directory copy by an application was horrendously slow. Using smb_req_time.d, I was able to find that the vast majority of smb2_ioctl calls (FSCTL_SRV_COPYCHUNK in particular) were taking anywhere from 300-500ms (the actual VFS operations observed by the OS were somewhere between 0-50ms). I've posted a flamegraph of the kernel stacks, though it's only on CPU time, so if held up by locks or something else in the network path, it won't be that helpful. I also might be able to send some packet captures (but definitely in private, not on here).
The user was able to copy the same directory to their local file system in a very minimal amount of time (15 minutes as opposed to 14 hours).
Files
Updated by Adam Stylinski over 3 years ago
- File smb_fnc_times.d smb_fnc_times.d added
I have a little bit more refinement to this issue. It seems that the time is largely spent in the smb_vop_ioctl function (or at least any child function calls underneath have been inlined). I wrote a dtrace script to measure function calls within smbsrv module that had a call time > some value, and I set the value to 1ms. There were a couple of normal calls that were somewhere in the vicinity of 1-2ms, and even a couple of outliers in the hundreds of ms, but seems the vast majority of the time was in smb_vop_ioctl.
Here's the output:
root@archive:~# ./smb_fnc_times.d 1 ^C totals smb_session_xprt_gethdr value ------------- Distribution ------------- count 262144 | 0 524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 5 1048576 | 0 smb_vop_read value ------------- Distribution ------------- count 262144 | 0 524288 |@@@@@ 1 1048576 |@@@@@@@@@@@@@@@@@@@@@@@@@ 5 2097152 |@@@@@@@@@@ 2 4194304 | 0 smb_vop_ioctl value ------------- Distribution ------------- count 134217728 | 0 268435456 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 15 536870912 |@@@@@@@@ 4 1073741824 | 0 smb_sorecv value ------------- Distribution ------------- count 262144 | 0 524288 |@ 58 1048576 |@@@@@@@@@@@@@@@@@@@@ 911 2097152 |@@@@@@@@@@@@ 553 4194304 |@@@@ 164 8388608 |@ 28 16777216 | 17 33554432 |@ 25 67108864 | 14 134217728 | 18 268435456 | 8 536870912 | 0
Updated by Gordon Ross over 3 years ago
Can you provide a snoop or wireshark capture of the network traffic while this copy happens?
Please restart either the SMB server or the client before starting the capture and running the test so we see the initial connection from the client in the capture.
After a look at the flamegraph, I think that system might be "thrashing" the page layer.
Could you re-try that test with this in /etc/system
set smbsrv:smb2_copychunk_max_seg = 0x20000
If the client is using ODX, you may also want:
set smbsrv:smb2_odx_buf_size = 0x20000
Updated by Adam Stylinski over 3 years ago
I sent you a private email with a packet capture. It doesn't including the settings you mentioned above - though I should tell you that a large number of these files are small tifs, mostly in the < 500kb range in size. If that max_seg parameter does what I think it does, I don't think it'll help.
Updated by Adam Stylinski over 3 years ago
So after a full day of chasing this down, it would seem that, at least in part, an update to the Windows 10 SMB client may have created this bug.
I don't have any Windows server hosts to compare this behavior to - however after chatting with Gordon for a bit and giving him a few packet captures, it would seem there are a lot of SMB IOCTLs going through from the client that don't make any sense.
The suggested fix that I've found in a few forums now to remedy one of the updates in 1803 is to set a few registry settings that adjust how long lived the client side directory cache is (setting this to 0). Applying this seems to improve the copy times tremendously. Why this does anything and whether or not this was broken as far back as the 1803 is hard to say, but I can confirm the fix Windows users are suggesting is positively effecting this behavior.
This also might explain why the flame graph isn't showing a ton of time on CPU for these copies. If the client is not issuing ioctls to perform these copies very fast, the smbsrv is going to sit idle waiting for the next RPC.
At the very least, this bug should be renamed, as I'm almost certain now that it's misleading. It's only simply related to odx feature as that's how I was able to trigger it.
Updated by Adam Stylinski over 3 years ago
Updated by Adam Stylinski over 3 years ago
So there's a little bit more to this bug. Those registry settings lessen the impact of the new SMB client from microsoft but not in a way that mitigates this completely. In Windows explorer, the estimated time of the root of a very densely packed tree of files went from 1 day to 1-3 hours. However, the rate at which files were being copied still largely seemed to be 1 file / second for tiny files.
The application, which didn't require any directory discovery (as it had an index stored in a database), copied at a rate of 1-1.5 files a second. The directory was 40,000-50,000 files, so this would have been ~14 hours. I'm now not entirely sure that approximated time from explorer was going to be a reliable metric.
As a side note, trying this same exact application Windows 7 took the expected 30 minutes or so. What's even weirder is forcing SMB 2.1 to be the max protocol didn't make this issue go away, either. It seems that there's something really strange going on with Microsoft's SMB client. Those who have relationships with Microsoft (maybe a player such a Nexenta?) should definitely leverage this to get a fix or explanation from them.
It's beginning to look like the issue falls mostly on Microsoft right now, but I can't say with any confidence the smbd implementation isn't at least a part of the equation, here.
Updated by Adam Stylinski over 3 years ago
A new update:
It appears Windows Defender Real Time Protection was the main culprit for this. I'm not sure why it wasn't measuring a ton of time in the usual processes (the usual symptom of this behavior), but disabling it / whitelisting the protection restores full performance now.
We've always had this enabled, so it seems this is the first time we've actually be impacted by it.
Updated by Gordon Ross over 3 years ago
- Status changed from New to Resolved
Assuming this is (just) a Windows problem, unless info. to the contrary comes along.