Project

General

Profile

Actions

Bug #13138

open

Very bad SMB small file/request performance on i40e, made faster by iftop

Added by Florian Manschwetus about 3 years ago. Updated over 2 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
networking
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

Copy of a build tree (lot of small and mid size files and only some very large, 20k files, close to 6GB in sum) takes to:
  • nvme with i40e
    • normal: close to 15 minutes
    • with iftop during small files copy phases: 3.7 minutes
    • local cp -r from local tmpfs: 6 seconds
  • Old Win2k12 system: 5 to 6 Minutes
  • ESXi OI VM: roughly 5 Minutes

The source is my Win10 workstation, the build output is stored on a ramdisk (imdisk)

Also deleting this files again takes much longer as on other systems, but is also faster than all other with iftop running.

But iftop reduces the max throughput to less than the half.


Files

mpstat.txt (6.39 KB) mpstat.txt Florian Manschwetus, 2020-09-14 07:03 AM
mpstat-iftop.txt (5.11 KB) mpstat-iftop.txt Florian Manschwetus, 2020-09-14 07:03 AM
send-wait-iftop.txt (6.88 KB) send-wait-iftop.txt Florian Manschwetus, 2020-09-14 07:03 AM
send-wait.txt (4.19 KB) send-wait.txt Florian Manschwetus, 2020-09-14 07:03 AM
sorecv-wait.txt (12.6 KB) sorecv-wait.txt Florian Manschwetus, 2020-09-14 07:03 AM
sorecv-wait-delete-scan.txt (8.75 KB) sorecv-wait-delete-scan.txt Florian Manschwetus, 2020-09-14 07:03 AM
sorecv-wait-delete-scan-iftop.txt (10.6 KB) sorecv-wait-delete-scan-iftop.txt Florian Manschwetus, 2020-09-14 07:03 AM
sorecv-wait-iftop.txt (23 KB) sorecv-wait-iftop.txt Florian Manschwetus, 2020-09-14 07:03 AM
genDataLoadZero.tbz (35.5 KB) genDataLoadZero.tbz tar bzip2 of generated dir tree Florian Manschwetus, 2020-09-28 09:43 AM
copy-item.ps1 (720 Bytes) copy-item.ps1 powershell script, defining functions to time Windows Explorer copy/remove operation Florian Manschwetus, 2020-09-28 09:46 AM
copyGenZeroTcpTP.pdf (1.75 MB) copyGenZeroTcpTP.pdf throughput normal Florian Manschwetus, 2020-09-28 04:37 PM
copyGenZeroTcpdumpTcpTP.pdf (1.92 MB) copyGenZeroTcpdumpTcpTP.pdf throughput tcpdump Florian Manschwetus, 2020-09-28 04:37 PM
copyGenZeroTcpdumpTcpRTT.pdf (679 KB) copyGenZeroTcpdumpTcpRTT.pdf RoundTripTime tcpdump Florian Manschwetus, 2020-09-28 04:37 PM
copyGenZeroTcpRTT.pdf (1.22 MB) copyGenZeroTcpRTT.pdf RoundTripTime normal Florian Manschwetus, 2020-09-28 04:37 PM
copyGenZeroTcpdumpPacketLengths.txt (1.95 KB) copyGenZeroTcpdumpPacketLengths.txt Florian Manschwetus, 2020-09-28 04:37 PM
copyGenZeroPacketLengths.txt (1.95 KB) copyGenZeroPacketLengths.txt Florian Manschwetus, 2020-09-28 04:37 PM
copyGenZeroTcpdumpIO.pdf (71.3 KB) copyGenZeroTcpdumpIO.pdf Florian Manschwetus, 2020-09-28 04:37 PM
copyGenZeroIO.pdf (89.3 KB) copyGenZeroIO.pdf Florian Manschwetus, 2020-09-28 04:37 PM
copyGenZeroDtraceSMBServicetimes_compare.txt (4.03 KB) copyGenZeroDtraceSMBServicetimes_compare.txt internal RTT by dtrace smb2 Florian Manschwetus, 2020-09-30 09:31 AM
smb2-trace.d (2.83 KB) smb2-trace.d dtrace script used to get internal smb2 RTT Florian Manschwetus, 2020-09-30 09:32 AM
copyGenZeroDtraceSMBServicetimes_compare2.txt (2.61 KB) copyGenZeroDtraceSMBServicetimes_compare2.txt dtrace mesaured SMB2 service RTT, with time gap between requests measured Florian Manschwetus, 2020-09-30 11:00 AM
LAN Skizze.pdf (212 KB) LAN Skizze.pdf Florian Manschwetus, 2020-09-30 03:42 PM
Actions #1

Updated by Joshua M. Clulow about 3 years ago

To be clear, you are:

  • building software on Windows 10 into a ramdisk on that machine, which produces about 20k files of 6GB in size
  • copying all of that data to an SMB share of a ZFS file system on an illumos machine with an i40e NIC

Is that right? I have some follow-up questions:

  • Which distribution of illumos, and what version?
  • What tool are you using to perform the copy on the Windows machine?

The times you have listed for the "Old Win2k12 system" and the "ESXi OI VM", those are different systems that are also acting as the SMB file server in this scenario?

Actions #2

Updated by Florian Manschwetus about 3 years ago

  • Yes that is right.
  • Currently Openindiana (text/server), maybe switch to omniOS
  • Default windows file Copy, I mixed together some powershell, to be able to measure the time it takes

The old Win2k12 is the System which should be replaced, among other things, by the new ZFS system.
The ESXI OI VM, I setup, with some so far unused resources of an ESXi, to check for software/os issues.

Actions #3

Updated by Joshua M. Clulow about 3 years ago

From earlier discussion in IRC it sounds like the OI file server with NVMe/i40e is not particularly busy on any CPU while the slow transfer is happening? (confirm with mpstat 1, say)

If it is not that busy, presumably the additional delay is coming in the form of something going to sleep. When you engage iftop presumably that has an impact on timing, or on the code paths being taken, which might be bringing things out of sleep more often and thus speeding up the small file case.

Looking in uts/common/fs/smbsrv/smb_net.c at the smb_net_send_uio() function I can see one place where we might go to sleep waiting to send data out. Conveniently there are USDT (DTrace) probes on either side of this sleep, so you could look at the sleep length with something like:

dtrace -n '
    sdt:smbsrv::send-wait-start { self->s = timestamp; }
    sdt:smbsrv::send-wait-done /self->s/ { @ = quantize(timestamp - self->s); self->s = 0; }
    tick-1s { printa(@); trunc(@); }'

You can do similar timing analysis between different call points, e.g., looking at the length of time spent blocked in smb_sorecv() might be an interesting start; e.g.,

dtrace -n '
    smb_sorecv:entry { self->s = timestamp; }
    smb_sorecv:return { @ = quantize(timestamp - self->s); self->s = 0; }
    tick-1s { printa(@); trunc(@); }'

The DTrace sched provider may also be useful in determining where things are going to sleep, or what wakes them up. The wakeup probe, for instance, would allow you to collect stacks of what is waking other threads up.

Actions #4

Updated by Florian Manschwetus about 3 years ago

I made some experiments with and without iftop during phases, where iftop leads to significant improvements.

Actions #5

Updated by Florian Manschwetus about 3 years ago

What about recreating this with smbclient inside a zone, may allow switching between scenarios with and without i40e?

Actions #6

Updated by Florian Manschwetus about 3 years ago

I reduced the test data, to 3944 files with together 26,3MB.

  • without iftop: 178,443s
  • with iftop: 32,712s
Actions #7

Updated by Florian Manschwetus about 3 years ago

Output of netstat -s -P tcp 300, the second output.

without iftop

TCP     tcpRtoAlgorithm     =     0     tcpRtoMin           =   400
        tcpRtoMax           = 60000     tcpMaxConn          =    -1
        tcpActiveOpens      =     6     tcpPassiveOpens     =     1
        tcpAttemptFails     =     0     tcpEstabResets      =     2
        tcpCurrEstab        =     7     tcpOutSegs          = 38654
        tcpOutDataSegs      = 31837     tcpOutDataBytes     =4744661
        tcpRetransSegs      =     0     tcpRetransBytes     =     0
        tcpOutAck           =  6756     tcpOutAckDelayed    =   210
        tcpOutUrg           =     0     tcpOutWinUpdate     =     0
        tcpOutWinProbe      =     0     tcpOutControl       =    12
        tcpOutRsts          =     0     tcpOutFastRetrans   =     0
        tcpInSegs           = 42955
        tcpInAckSegs        =     0     tcpInAckBytes       =4744555
        tcpInDupAck         =   393     tcpInAckUnsent      =     0
        tcpInInorderSegs    = 42447     tcpInInorderBytes   =23827513
        tcpInUnorderSegs    =     0     tcpInUnorderBytes   =     0
        tcpInDupSegs        =   393     tcpInDupBytes       =396905
        tcpInPartDupSegs    =     0     tcpInPartDupBytes   =     0
        tcpInPastWinSegs    =     0     tcpInPastWinBytes   =     0
        tcpInWinProbe       =     0     tcpInWinUpdate      =     0
        tcpInClosed         =     0     tcpRttNoUpdate      =     2
        tcpRttUpdate        = 31813     tcpTimRetrans       =     0
        tcpTimRetransDrop   =     0     tcpTimKeepalive     =     0
        tcpTimKeepaliveProbe=     0     tcpTimKeepaliveDrop =     0
        tcpListenDrop       =     0     tcpListenDropQ0     =     0
        tcpHalfOpenDrop     =     0     tcpOutSackRetrans   =     0

with iftop

TCP     tcpRtoAlgorithm     =     0     tcpRtoMin           =   400
        tcpRtoMax           = 60000     tcpMaxConn          =    -1
        tcpActiveOpens      =     1     tcpPassiveOpens     =     2
        tcpAttemptFails     =     0     tcpEstabResets      =     2
        tcpCurrEstab        =     7     tcpOutSegs          =114850
        tcpOutDataSegs      = 32189     tcpOutDataBytes     =4763543
        tcpRetransSegs      =     0     tcpRetransBytes     =     0
        tcpOutAck           = 82601     tcpOutAckDelayed    =     0
        tcpOutUrg           =     0     tcpOutWinUpdate     =     0
        tcpOutWinProbe      =     0     tcpOutControl       =     4
        tcpOutRsts          =     0     tcpOutFastRetrans   =     0
        tcpInSegs           = 86946
        tcpInAckSegs        =     0     tcpInAckBytes       =4763428
        tcpInDupAck         = 43417     tcpInAckUnsent      =     0
        tcpInInorderSegs    = 42381     tcpInInorderBytes   =23754460
        tcpInUnorderSegs    =     8     tcpInUnorderBytes   = 11680
        tcpInDupSegs        = 43580     tcpInDupBytes       =25307842
        tcpInPartDupSegs    =     0     tcpInPartDupBytes   =     0
        tcpInPastWinSegs    =     0     tcpInPastWinBytes   =     0
        tcpInWinProbe       =     0     tcpInWinUpdate      =     0
        tcpInClosed         =     0     tcpRttNoUpdate      =     3
        tcpRttUpdate        = 32159     tcpTimRetrans       =     0
        tcpTimRetransDrop   =     0     tcpTimKeepalive     =     0
        tcpTimKeepaliveProbe=     0     tcpTimKeepaliveDrop =     0
        tcpListenDrop       =     0     tcpListenDropQ0     =     0
        tcpHalfOpenDrop     =     0     tcpOutSackRetrans   =     0

Actions #8

Updated by Florian Manschwetus about 3 years ago

I have to add two facts:
1. tcpdump also "fixes" the issue, so can't be used directly to check on server side
2. NFS (v4) seem to be not affected, maybe due to other traffic pattern or sth. else.

Actions #9

Updated by Florian Manschwetus about 3 years ago

ok, I created a set of zero filled files, with artificial names, which mimic the original data.
  • normal TotalSeconds : 154,1989154
  • with tcpdump TotalSeconds : 27,5101907
Actions #10

Updated by Florian Manschwetus about 3 years ago

Also I have created a small powershell script, which can be sourced in a powershell, to time the copy and remove operations as performed by Windows Explorer

Actions #11

Updated by Florian Manschwetus about 3 years ago

I collected the traffic on clientside with wireshark for analysis

SMB2 servicetimes
  • without tcpdump running on server side
    ===========================================================================
    SMB2 Service Response Time Statistics - copyGenZero.pcapng:
    Index      Procedure  Calls  Min SRT (s)  Max SRT (s)  Avg SRT (s)  Sum SRT (s)
    ---------------------------------------------------------------------------
    SMB2 
    Close              6   5367     0.000154     0.016469     0.000416     2.234292
    Create             5   5369     0.000217     0.016052     0.001024     5.497396
    Find              14     44     0.000178     0.002430     0.000776     0.034153
    GetInfo           16  10558     0.000140     0.016761     0.000414     4.367313
    Ioctl             11      6     0.000358     0.001136     0.000620     0.003718
    KeepAlive         13      7     0.000384     0.000711     0.000448     0.003137
    Negotiate Protocol 0      1     0.000200     0.000200     0.000200     0.000200
    Read               8      1     0.000228     0.000228     0.000228     0.000228
    Session Logoff     2      1     0.001525     0.001525     0.001525     0.001525
    Session Setup      1      2     0.006513     0.021861     0.014187     0.028374
    SetInfo           17   7888     0.000151     0.016451     0.000483     3.810506
    Tree Connect       3      2     0.000211     0.000427     0.000319     0.000638
    Tree Disconnect    4      2     0.000481     0.000507     0.000494     0.000988
    Write              9   3326     0.000234     0.383245     0.033221   110.493686
    SMB2 
    ---------------------------------------------------------------------------
    
  • with tcpdump running on server side
    ===========================================================================
    SMB2 Service Response Time Statistics - copyGenZeroTcpdump.pcapng:
    Index       Procedure  Calls  Min SRT (s)  Max SRT (s)  Avg SRT (s)  Sum SRT (s)
    ---------------------------------------------------------------------------
    SMB2 
    Close               6   5328     0.000110     0.030047     0.000304     1.620219
    Create              5   5330     0.000160     0.015628     0.000826     4.400416
    Find               14     18     0.000218     0.003466     0.000943     0.016978
    GetInfo            16  10508     0.000119     0.015660     0.000294     3.090754
    Ioctl              11      6     0.000149     0.000689     0.000275     0.001649
    KeepAlive          13      2     0.000176     0.009396     0.004786     0.009572
    Negotiate Protocol  0      1     0.000202     0.000202     0.000202     0.000202
    Read                8      1     0.000197     0.000197     0.000197     0.000197
    Session Logoff      2      1     0.001055     0.001055     0.001055     0.001055
    Session Setup       1      2     0.006869     0.023717     0.015293     0.030586
    SetInfo            17   7888     0.000112     0.016304     0.000338     2.663702
    Tree Connect        3      2     0.000232     0.000256     0.000244     0.000488
    Tree Disconnect     4      2     0.000357     0.000922     0.000639     0.001279
    Write               9   3326     0.000180     0.017290     0.000588     1.956407
    SMB2 
    ---------------------------------------------------------------------------
    
Actions #13

Updated by Florian Manschwetus about 3 years ago

I modified the smb2-trace.d to measure service rtt internally alike what wireshark offered on client side

Actions #14

Updated by Florian Manschwetus about 3 years ago

Hi, quick update, I found that running "snoop" as well as "snoop -P" "fixes" this problem also.

Actions #15

Updated by Florian Manschwetus about 3 years ago

I enhanced my dtrace rtt measurement, to also measure the gap time between the smb2 requests, this time I used snoop.
If interested, I can also attach the modified d-trace script.

Overall time, timed at powershell
  • normal: 138.624117
  • with snoop: 27,6608591
Actions #16

Updated by Florian Manschwetus about 3 years ago

As it was asked for, here a sketch of our Network, aggregated ports use LACP as shown.
ESXi Connect 2 times to 40G for two independent SR-IOv groups. (CX4 VPI EDR)

Actions #17

Updated by Florian Manschwetus about 3 years ago

Today I found, that using the 40G variant, snoop and such, does not make the problem go away, so I was able to capture traffic on both ends of the link simultaneously, analysis show, that the serverside measured SMB RTTs were as to be expected, while the client side still show the problem under investigation. So it has to be concluded that the delay happens after the TX was captured on server and/or before the RX is captured.
Paul mentioned _rx_intr_throttle, which has clear and significant influence on the matter of topic.

Currently I do some experiments to get the picture more clear.

Actions #18

Updated by Florian Manschwetus about 3 years ago

It turned out that several tuning parameters seem to be bad fit, now I get reasonable transfer times for my samples
I modified
  • /etc/i40e.conf
    • rx_limit_per_intr=4096
    • rx_ring_size=4096
    • tx_ring_size=4096
  • linkprops
    • _tx_intr_throttle=110
    • _rx_intr_throttle=160
    • _rx_dma_threshold=156
    • _tx_dma_threshold=400

It might require some more studies to see if this can/should be adopted for general or can be improved further.
Also someone else has to decide, if there is still sth. pointing to a potential problem in the driver code.

Also I have to point out, that I still have achieved better transfer times manually turning on/off iftop during transfer, so it is still not perfect. I'll try to retest this.

Actions #19

Updated by Gordon Ross over 2 years ago

  • Category set to networking
Actions

Also available in: Atom PDF