Bug #5295

remove maxburst logic from TCP's send algorithm

Added by Dan McDonald almost 3 years ago. Updated almost 3 years ago.

Status:ClosedStart date:2014-11-06
Priority:NormalDue date:
Assignee:Dan McDonald% Done:


Target version:-
Difficulty:Medium Tags:needs-triage


This has been lying around in Delphix's distro for a while (see https://github.com/delphix/delphix-os/commit/86e502910bb74d1b60b08940d03230aa514504ac ).

Here's Sebastien Roy's analysis:

Here's some analysis that led me to the TCP send burst issue I described in person. This might make a good blog post for someone interested in these details. My favorite part of this investigation was finding the same atrocity in FreeBSD surrounded by #if 0. :-)

On a customer system, we observed that NFS read throughput was anemic. We also observed that even though LSO was enabled, we never transmitted more than 7K at a time.

The first step was to eliminate two obvious possibilities:
A) The application was only transmitting 7K at a time and we never had any data queued up
B) We constantly have a full window and cannot transmit more than 7K at a time

I quickly eliminated A by using dtrace to aggregate message sizes passed into tcp_output(), and eliminated B by using dtrace to aggregate the head-room in the pipe for each TCP send. By "head-room", I mean the usable part of the window, which is: (min(swnd, cwnd) - (snxt - suna)).

swnd - send window size (our peer's receive window size)
cwnd - the congestion window size
snxt - the next available sequence number for transmit
suna - the earliest unacknowledged sequence number we've transmitted

This boils down to (effective window size) - (unacknowledged data). The resulting histograms showed us that we usually had 100s of Kilobytes of head-room.

Each of these being eliminated, I dove into the source code to narrow in on every piece of logic from tcp_output() down to tcp_send_data() that is related to calculating the number of bytes to transmit. Since we constantly had unsent data in the queue, the logic for determining how much data to send (let’s call that “length”), intuitively speaking,_should_ have looked something like this:

1. Start with length = min(usable, unsent), where <usable> is the “head-room” I mention above, and <unsent> is the amount of unsent data in the queue
2. Is LSO being used? If no:
2.1 length = min(length, mss) - <mss> is the maximum segment size, based on the path MTU (~1500 bytes)
if yes:
2.2 length = min(length, 64KB) - 64KB is the maximum amount of data that can be transmitted in a single IP packet

Given that in our case, LSO was enabled, and we had 100s of KB of unsent data, and usable was also 100s of KB, length should have been 64KB. It obviously wasn’t.

Tracing the logic down into tcp_send(), I ran into this bit of logic:

        int             num_burst_seg = tcp->tcp_snd_burst;
                 * Calculate the maximum payload length we can send at one
                 * time.
                if (do_lso_send) {
                         * Check whether be able to to do LSO for the current
                         * available data.
                        if (num_burst_seg >= 2 && (*usable - 1) / mss >= 1) {
                                lso_usable = MIN(tcp->tcp_lso_max, *usable);
                                lso_usable = MIN(lso_usable,
                                    num_burst_seg * mss);

Here, <i>usable</i> , and <i>mss</i> have the same semantics as my outline above. tcp->tcp_lso_max is 64KB, as above. We interestingly also use this “num_burst_seg” value into account. Some quick dtrace confirmed that num_burst_seg was set to the magical value “5”, and 5 * mss == ~7K.

That explains why we’re only sending 7K at a time. The next question is, why is num_burst_seg set to 5, and why do we only see the 7K transmits when we’re sending to off-link hosts? We can answer that by looking at where tcp_snd_burst gets set:

It is initially set to “infinity” when the connection is established:

tcp->tcp_snd_burst = TCP_CWND_INFINITE;

After the connection is established, tcp_snd_burst gets adjusted using this bit of copy-and-pasted logic whenever we do a TCP retransmission:

                        if (tcp->tcp_rexmit) {
                                tcp->tcp_snd_burst = tcp->tcp_localnet ?
                                    TCP_CWND_INFINITE : TCP_CWND_NORMAL;

The constants above are:

/* TCP cwnd burst factor. */
#define TCP_CWND_INFINITE       65535
#define TCP_CWND_SS             3
#define TCP_CWND_NORMAL         5

Ah, now we’re getting somewhere, as this explains why we only see this for off-link destinations. tcp_localnet is set upon connect() or accept() (when we figure out the destination address) based on whether or not the destination is on the same subnet as the local address for the connection.

Interestingly, tcp_snd_burst is never increased again after it has been ratcheted down to 5, so the very first time we retransmit any TCP segment when sending off-link, the connection gets permanently penalized with an effective congestion window equal to 5 * mss, even though the real congestion window for the connection increases based on the congestion control algorithm. This logic is at least outdated, and at most completely busted.

To finally connect the 7K transmit issue with low throughput, we ran a long running iperf, and dynamically adjusted the value of tcp_localnet on the connection:

> ::netstat ! grep 5001
ffffff0310e2bb80  0      0    0
> ffffff0310e2bb80::print tcp_t tcp_localnet
tcp_localnet = 0
> ffffff0310e2bb80::print -a tcp_t tcp_localnet
ffffff0310e2bc55.1 tcp_localnet = 0
> ffffff0310e2bc55/v 0x52
0xffffff0310e2bc55:             0x50    =       0x52

0x52 was used as it is the logical or result of the existing byte at that address with and the tcp_localnet bit set. Immediately after making this change, we saw two things:

1. we were sending full 64KB packets to the NIC
2. throughput went up from < 100MB/s to > 1GB/s, confirmed by running dlstat -s 1:

      vmxnet3s0    7.22K  476.18K    7.63K   55.71M
      vmxnet3s0    6.04K  398.63K    6.42K   46.90M
      vmxnet3s0    5.17K  341.06K    5.45K   39.80M
      vmxnet3s0    6.58K  434.51K    7.00K   51.10M
      vmxnet3s0    9.81K  647.73K   10.35K   75.60M
      vmxnet3s0   11.37K  750.86K   12.06K   88.08M
      vmxnet3s0   14.55K  968.85K   14.60K  644.02M  <— set tcp_localnet here
      vmxnet3s0   23.07K    1.52M   23.10K    1.08G
      vmxnet3s0   24.06K    1.59M   24.04K    1.12G
      vmxnet3s0   23.89K    1.58M   23.90K    1.11G
      vmxnet3s0   23.68K    1.56M   23.68K    1.10G
      vmxnet3s0   23.44K    1.55M   23.44K    1.09G
      vmxnet3s0   24.33K    1.61M   24.36K    1.13G

Interestingly, our OS wasn't the first to introduce the concept of a maximum send burst. FreeBSD also had such a thing, and wen New Reno came around, someone did the following to it (from FreeBSD's tcp_output()):

#if 0
         * This completely breaks TCP if newreno is turned on.  What happens
         * is that if delayed-acks are turned on on the receiver, this code
         * on the transmitter effectively destroys the TCP window, forcing
         * it to four packets (1.5Kx4 = 6K window).
        if (sendalot && --maxburst)
                goto again;

Their maxburst parameter was set to 4, and ours is set to 5. Their observation was that the TCP window was artificially limited to 6K, and we've just observed that ours is artificially limited to 7K (but only when the destination is off-link).


#1 Updated by Electric Monk almost 3 years ago

  • % Done changed from 80 to 100
  • Status changed from In Progress to Closed

git commit 633fc3a6eed35d918db16925b7048d7a2e28064a

commit  633fc3a6eed35d918db16925b7048d7a2e28064a
Author: Sebastien Roy <sebastien.roy@delphix.com>
Date:   2014-11-21T22:25:13.000Z

    5295 remove maxburst logic from TCP's send algorithm
    Reviewed by: Christopher Siden <christopher.siden@delphix.com>
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Eric Diven <eric.diven@delphix.com>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
    Approved by: Gordon Ross <gordon.ross@nexenta.com>

Also available in: Atom