Bug #13138
openVery bad SMB small file/request performance on i40e, made faster by iftop
0%
Description
- nvme with i40e
- normal: close to 15 minutes
- with iftop during small files copy phases: 3.7 minutes
- local cp -r from local tmpfs: 6 seconds
- Old Win2k12 system: 5 to 6 Minutes
- ESXi OI VM: roughly 5 Minutes
The source is my Win10 workstation, the build output is stored on a ramdisk (imdisk)
Also deleting this files again takes much longer as on other systems, but is also faster than all other with iftop running.
But iftop reduces the max throughput to less than the half.
Files
Updated by Joshua M. Clulow about 3 years ago
To be clear, you are:
- building software on Windows 10 into a ramdisk on that machine, which produces about 20k files of 6GB in size
- copying all of that data to an SMB share of a ZFS file system on an illumos machine with an i40e NIC
Is that right? I have some follow-up questions:
- Which distribution of illumos, and what version?
- What tool are you using to perform the copy on the Windows machine?
The times you have listed for the "Old Win2k12 system" and the "ESXi OI VM", those are different systems that are also acting as the SMB file server in this scenario?
Updated by Florian Manschwetus about 3 years ago
- Yes that is right.
- Currently Openindiana (text/server), maybe switch to omniOS
- Default windows file Copy, I mixed together some powershell, to be able to measure the time it takes
The old Win2k12 is the System which should be replaced, among other things, by the new ZFS system.
The ESXI OI VM, I setup, with some so far unused resources of an ESXi, to check for software/os issues.
Updated by Joshua M. Clulow about 3 years ago
From earlier discussion in IRC it sounds like the OI file server with NVMe/i40e is not particularly busy on any CPU while the slow transfer is happening? (confirm with mpstat 1
, say)
If it is not that busy, presumably the additional delay is coming in the form of something going to sleep. When you engage iftop
presumably that has an impact on timing, or on the code paths being taken, which might be bringing things out of sleep more often and thus speeding up the small file case.
Looking in uts/common/fs/smbsrv/smb_net.c
at the smb_net_send_uio()
function I can see one place where we might go to sleep waiting to send data out. Conveniently there are USDT (DTrace) probes on either side of this sleep, so you could look at the sleep length with something like:
dtrace -n ' sdt:smbsrv::send-wait-start { self->s = timestamp; } sdt:smbsrv::send-wait-done /self->s/ { @ = quantize(timestamp - self->s); self->s = 0; } tick-1s { printa(@); trunc(@); }'
You can do similar timing analysis between different call points, e.g., looking at the length of time spent blocked in smb_sorecv()
might be an interesting start; e.g.,
dtrace -n ' smb_sorecv:entry { self->s = timestamp; } smb_sorecv:return { @ = quantize(timestamp - self->s); self->s = 0; } tick-1s { printa(@); trunc(@); }'
The DTrace sched
provider may also be useful in determining where things are going to sleep, or what wakes them up. The wakeup
probe, for instance, would allow you to collect stacks of what is waking other threads up.
Updated by Florian Manschwetus about 3 years ago
- File mpstat.txt mpstat.txt added
- File mpstat-iftop.txt mpstat-iftop.txt added
- File send-wait.txt send-wait.txt added
- File send-wait-iftop.txt send-wait-iftop.txt added
- File sorecv-wait.txt sorecv-wait.txt added
- File sorecv-wait-delete-scan.txt sorecv-wait-delete-scan.txt added
- File sorecv-wait-delete-scan-iftop.txt sorecv-wait-delete-scan-iftop.txt added
- File sorecv-wait-iftop.txt sorecv-wait-iftop.txt added
I made some experiments with and without iftop during phases, where iftop leads to significant improvements.
Updated by Florian Manschwetus about 3 years ago
What about recreating this with smbclient inside a zone, may allow switching between scenarios with and without i40e?
Updated by Florian Manschwetus about 3 years ago
I reduced the test data, to 3944 files with together 26,3MB.
- without iftop: 178,443s
- with iftop: 32,712s
Updated by Florian Manschwetus about 3 years ago
Output of netstat -s -P tcp 300, the second output.
without iftop
TCP tcpRtoAlgorithm = 0 tcpRtoMin = 400 tcpRtoMax = 60000 tcpMaxConn = -1 tcpActiveOpens = 6 tcpPassiveOpens = 1 tcpAttemptFails = 0 tcpEstabResets = 2 tcpCurrEstab = 7 tcpOutSegs = 38654 tcpOutDataSegs = 31837 tcpOutDataBytes =4744661 tcpRetransSegs = 0 tcpRetransBytes = 0 tcpOutAck = 6756 tcpOutAckDelayed = 210 tcpOutUrg = 0 tcpOutWinUpdate = 0 tcpOutWinProbe = 0 tcpOutControl = 12 tcpOutRsts = 0 tcpOutFastRetrans = 0 tcpInSegs = 42955 tcpInAckSegs = 0 tcpInAckBytes =4744555 tcpInDupAck = 393 tcpInAckUnsent = 0 tcpInInorderSegs = 42447 tcpInInorderBytes =23827513 tcpInUnorderSegs = 0 tcpInUnorderBytes = 0 tcpInDupSegs = 393 tcpInDupBytes =396905 tcpInPartDupSegs = 0 tcpInPartDupBytes = 0 tcpInPastWinSegs = 0 tcpInPastWinBytes = 0 tcpInWinProbe = 0 tcpInWinUpdate = 0 tcpInClosed = 0 tcpRttNoUpdate = 2 tcpRttUpdate = 31813 tcpTimRetrans = 0 tcpTimRetransDrop = 0 tcpTimKeepalive = 0 tcpTimKeepaliveProbe= 0 tcpTimKeepaliveDrop = 0 tcpListenDrop = 0 tcpListenDropQ0 = 0 tcpHalfOpenDrop = 0 tcpOutSackRetrans = 0
with iftop
TCP tcpRtoAlgorithm = 0 tcpRtoMin = 400 tcpRtoMax = 60000 tcpMaxConn = -1 tcpActiveOpens = 1 tcpPassiveOpens = 2 tcpAttemptFails = 0 tcpEstabResets = 2 tcpCurrEstab = 7 tcpOutSegs =114850 tcpOutDataSegs = 32189 tcpOutDataBytes =4763543 tcpRetransSegs = 0 tcpRetransBytes = 0 tcpOutAck = 82601 tcpOutAckDelayed = 0 tcpOutUrg = 0 tcpOutWinUpdate = 0 tcpOutWinProbe = 0 tcpOutControl = 4 tcpOutRsts = 0 tcpOutFastRetrans = 0 tcpInSegs = 86946 tcpInAckSegs = 0 tcpInAckBytes =4763428 tcpInDupAck = 43417 tcpInAckUnsent = 0 tcpInInorderSegs = 42381 tcpInInorderBytes =23754460 tcpInUnorderSegs = 8 tcpInUnorderBytes = 11680 tcpInDupSegs = 43580 tcpInDupBytes =25307842 tcpInPartDupSegs = 0 tcpInPartDupBytes = 0 tcpInPastWinSegs = 0 tcpInPastWinBytes = 0 tcpInWinProbe = 0 tcpInWinUpdate = 0 tcpInClosed = 0 tcpRttNoUpdate = 3 tcpRttUpdate = 32159 tcpTimRetrans = 0 tcpTimRetransDrop = 0 tcpTimKeepalive = 0 tcpTimKeepaliveProbe= 0 tcpTimKeepaliveDrop = 0 tcpListenDrop = 0 tcpListenDropQ0 = 0 tcpHalfOpenDrop = 0 tcpOutSackRetrans = 0
Updated by Florian Manschwetus about 3 years ago
I have to add two facts:
1. tcpdump also "fixes" the issue, so can't be used directly to check on server side
2. NFS (v4) seem to be not affected, maybe due to other traffic pattern or sth. else.
Updated by Florian Manschwetus about 3 years ago
- File genDataLoadZero.tbz genDataLoadZero.tbz added
- normal TotalSeconds : 154,1989154
- with tcpdump TotalSeconds : 27,5101907
Updated by Florian Manschwetus about 3 years ago
- File copy-item.ps1 copy-item.ps1 added
Also I have created a small powershell script, which can be sourced in a powershell, to time the copy and remove operations as performed by Windows Explorer
Updated by Florian Manschwetus about 3 years ago
I collected the traffic on clientside with wireshark for analysis
SMB2 servicetimes- without tcpdump running on server side
=========================================================================== SMB2 Service Response Time Statistics - copyGenZero.pcapng: Index Procedure Calls Min SRT (s) Max SRT (s) Avg SRT (s) Sum SRT (s) --------------------------------------------------------------------------- SMB2 Close 6 5367 0.000154 0.016469 0.000416 2.234292 Create 5 5369 0.000217 0.016052 0.001024 5.497396 Find 14 44 0.000178 0.002430 0.000776 0.034153 GetInfo 16 10558 0.000140 0.016761 0.000414 4.367313 Ioctl 11 6 0.000358 0.001136 0.000620 0.003718 KeepAlive 13 7 0.000384 0.000711 0.000448 0.003137 Negotiate Protocol 0 1 0.000200 0.000200 0.000200 0.000200 Read 8 1 0.000228 0.000228 0.000228 0.000228 Session Logoff 2 1 0.001525 0.001525 0.001525 0.001525 Session Setup 1 2 0.006513 0.021861 0.014187 0.028374 SetInfo 17 7888 0.000151 0.016451 0.000483 3.810506 Tree Connect 3 2 0.000211 0.000427 0.000319 0.000638 Tree Disconnect 4 2 0.000481 0.000507 0.000494 0.000988 Write 9 3326 0.000234 0.383245 0.033221 110.493686 SMB2 ---------------------------------------------------------------------------
- with tcpdump running on server side
=========================================================================== SMB2 Service Response Time Statistics - copyGenZeroTcpdump.pcapng: Index Procedure Calls Min SRT (s) Max SRT (s) Avg SRT (s) Sum SRT (s) --------------------------------------------------------------------------- SMB2 Close 6 5328 0.000110 0.030047 0.000304 1.620219 Create 5 5330 0.000160 0.015628 0.000826 4.400416 Find 14 18 0.000218 0.003466 0.000943 0.016978 GetInfo 16 10508 0.000119 0.015660 0.000294 3.090754 Ioctl 11 6 0.000149 0.000689 0.000275 0.001649 KeepAlive 13 2 0.000176 0.009396 0.004786 0.009572 Negotiate Protocol 0 1 0.000202 0.000202 0.000202 0.000202 Read 8 1 0.000197 0.000197 0.000197 0.000197 Session Logoff 2 1 0.001055 0.001055 0.001055 0.001055 Session Setup 1 2 0.006869 0.023717 0.015293 0.030586 SetInfo 17 7888 0.000112 0.016304 0.000338 2.663702 Tree Connect 3 2 0.000232 0.000256 0.000244 0.000488 Tree Disconnect 4 2 0.000357 0.000922 0.000639 0.001279 Write 9 3326 0.000180 0.017290 0.000588 1.956407 SMB2 ---------------------------------------------------------------------------
Updated by Florian Manschwetus about 3 years ago
- File copyGenZeroTcpTP.pdf copyGenZeroTcpTP.pdf added
- File copyGenZeroTcpdumpTcpTP.pdf copyGenZeroTcpdumpTcpTP.pdf added
- File copyGenZeroTcpRTT.pdf copyGenZeroTcpRTT.pdf added
- File copyGenZeroTcpdumpTcpRTT.pdf copyGenZeroTcpdumpTcpRTT.pdf added
- File copyGenZeroTcpdumpPacketLengths.txt copyGenZeroTcpdumpPacketLengths.txt added
- File copyGenZeroPacketLengths.txt copyGenZeroPacketLengths.txt added
- File copyGenZeroTcpdumpIO.pdf copyGenZeroTcpdumpIO.pdf added
- File copyGenZeroIO.pdf copyGenZeroIO.pdf added
I analysed the traffic I collected, here are some, as far as I think, interesting things
Updated by Florian Manschwetus about 3 years ago
- File copyGenZeroDtraceSMBServicetimes_compare.txt copyGenZeroDtraceSMBServicetimes_compare.txt added
- File smb2-trace.d smb2-trace.d added
I modified the smb2-trace.d to measure service rtt internally alike what wireshark offered on client side
Updated by Florian Manschwetus about 3 years ago
Hi, quick update, I found that running "snoop" as well as "snoop -P" "fixes" this problem also.
Updated by Florian Manschwetus about 3 years ago
- File copyGenZeroDtraceSMBServicetimes_compare2.txt copyGenZeroDtraceSMBServicetimes_compare2.txt added
I enhanced my dtrace rtt measurement, to also measure the gap time between the smb2 requests, this time I used snoop.
If interested, I can also attach the modified d-trace script.
- normal: 138.624117
- with snoop: 27,6608591
Updated by Florian Manschwetus about 3 years ago
- File LAN Skizze.pdf LAN Skizze.pdf added
As it was asked for, here a sketch of our Network, aggregated ports use LACP as shown.
ESXi Connect 2 times to 40G for two independent SR-IOv groups. (CX4 VPI EDR)
Updated by Florian Manschwetus about 3 years ago
Today I found, that using the 40G variant, snoop and such, does not make the problem go away, so I was able to capture traffic on both ends of the link simultaneously, analysis show, that the serverside measured SMB RTTs were as to be expected, while the client side still show the problem under investigation. So it has to be concluded that the delay happens after the TX was captured on server and/or before the RX is captured.
Paul mentioned _rx_intr_throttle, which has clear and significant influence on the matter of topic.
Currently I do some experiments to get the picture more clear.
Updated by Florian Manschwetus about 3 years ago
I modified
- /etc/i40e.conf
- rx_limit_per_intr=4096
- rx_ring_size=4096
- tx_ring_size=4096
- linkprops
- _tx_intr_throttle=110
- _rx_intr_throttle=160
- _rx_dma_threshold=156
- _tx_dma_threshold=400
It might require some more studies to see if this can/should be adopted for general or can be improved further.
Also someone else has to decide, if there is still sth. pointing to a potential problem in the driver code.
Also I have to point out, that I still have achieved better transfer times manually turning on/off iftop during transfer, so it is still not perfect. I'll try to retest this.