Bug #5295
closedremove maxburst logic from TCP's send algorithm
100%
Description
This has been lying around in Delphix's distro for a while (see https://github.com/delphix/delphix-os/commit/86e502910bb74d1b60b08940d03230aa514504ac ).
Here's Sebastien Roy's analysis:
Here's some analysis that led me to the TCP send burst issue I described in person. This might make a good blog post for someone interested in these details. My favorite part of this investigation was finding the same atrocity in FreeBSD surrounded by #if 0. :-)
On a customer system, we observed that NFS read throughput was anemic. We also observed that even though LSO was enabled, we never transmitted more than 7K at a time.
The first step was to eliminate two obvious possibilities:
A) The application was only transmitting 7K at a time and we never had any data queued up
B) We constantly have a full window and cannot transmit more than 7K at a time
I quickly eliminated A by using dtrace to aggregate message sizes passed into tcp_output(), and eliminated B by using dtrace to aggregate the head-room in the pipe for each TCP send. By "head-room", I mean the usable part of the window, which is: (min(swnd, cwnd) - (snxt - suna)).
swnd - send window size (our peer's receive window size)
cwnd - the congestion window size
snxt - the next available sequence number for transmit
suna - the earliest unacknowledged sequence number we've transmitted
This boils down to (effective window size) - (unacknowledged data). The resulting histograms showed us that we usually had 100s of Kilobytes of head-room.
Each of these being eliminated, I dove into the source code to narrow in on every piece of logic from tcp_output() down to tcp_send_data() that is related to calculating the number of bytes to transmit. Since we constantly had unsent data in the queue, the logic for determining how much data to send (let’s call that “length”), intuitively speaking,_should_ have looked something like this:
1. Start with length = min(usable, unsent), where <usable> is the “head-room” I mention above, and <unsent> is the amount of unsent data in the queue
2. Is LSO being used? If no:
2.1 length = min(length, mss) - <mss> is the maximum segment size, based on the path MTU (~1500 bytes)
if yes:
2.2 length = min(length, 64KB) - 64KB is the maximum amount of data that can be transmitted in a single IP packet
Given that in our case, LSO was enabled, and we had 100s of KB of unsent data, and usable was also 100s of KB, length should have been 64KB. It obviously wasn’t.
Tracing the logic down into tcp_send(), I ran into this bit of logic:
… int num_burst_seg = tcp->tcp_snd_burst; … /* * Calculate the maximum payload length we can send at one * time. */ if (do_lso_send) { /* * Check whether be able to to do LSO for the current * available data. */ if (num_burst_seg >= 2 && (*usable - 1) / mss >= 1) { lso_usable = MIN(tcp->tcp_lso_max, *usable); lso_usable = MIN(lso_usable, num_burst_seg * mss);
Here, <i>usable</i> , and <i>mss</i> have the same semantics as my outline above. tcp->tcp_lso_max is 64KB, as above. We interestingly also use this “num_burst_seg” value into account. Some quick dtrace confirmed that num_burst_seg was set to the magical value “5”, and 5 * mss == ~7K.
That explains why we’re only sending 7K at a time. The next question is, why is num_burst_seg set to 5, and why do we only see the 7K transmits when we’re sending to off-link hosts? We can answer that by looking at where tcp_snd_burst gets set:
It is initially set to “infinity” when the connection is established:
tcp->tcp_snd_burst = TCP_CWND_INFINITE;
After the connection is established, tcp_snd_burst gets adjusted using this bit of copy-and-pasted logic whenever we do a TCP retransmission:
if (tcp->tcp_rexmit) { … tcp->tcp_snd_burst = tcp->tcp_localnet ? TCP_CWND_INFINITE : TCP_CWND_NORMAL; …
The constants above are:
/* TCP cwnd burst factor. */ #define TCP_CWND_INFINITE 65535 #define TCP_CWND_SS 3 #define TCP_CWND_NORMAL 5
Ah, now we’re getting somewhere, as this explains why we only see this for off-link destinations. tcp_localnet is set upon connect() or accept() (when we figure out the destination address) based on whether or not the destination is on the same subnet as the local address for the connection.
Interestingly, tcp_snd_burst is never increased again after it has been ratcheted down to 5, so the very first time we retransmit any TCP segment when sending off-link, the connection gets permanently penalized with an effective congestion window equal to 5 * mss, even though the real congestion window for the connection increases based on the congestion control algorithm. This logic is at least outdated, and at most completely busted.
To finally connect the 7K transmit issue with low throughput, we ran a long running iperf, and dynamically adjusted the value of tcp_localnet on the connection:
> ::netstat ! grep 5001 ffffff0310e2bb80 0 10.100.3.46.36077 10.100.1.70.5001 0 0 > ffffff0310e2bb80::print tcp_t tcp_localnet tcp_localnet = 0 > ffffff0310e2bb80::print -a tcp_t tcp_localnet ffffff0310e2bc55.1 tcp_localnet = 0 > ffffff0310e2bc55/v 0x52 0xffffff0310e2bc55: 0x50 = 0x52
0x52 was used as it is the logical or result of the existing byte at that address with and the tcp_localnet bit set. Immediately after making this change, we saw two things:
1. we were sending full 64KB packets to the NIC
2. throughput went up from < 100MB/s to > 1GB/s, confirmed by running dlstat -s 1:
vmxnet3s0 7.22K 476.18K 7.63K 55.71M vmxnet3s0 6.04K 398.63K 6.42K 46.90M vmxnet3s0 5.17K 341.06K 5.45K 39.80M vmxnet3s0 6.58K 434.51K 7.00K 51.10M vmxnet3s0 9.81K 647.73K 10.35K 75.60M vmxnet3s0 11.37K 750.86K 12.06K 88.08M vmxnet3s0 14.55K 968.85K 14.60K 644.02M <— set tcp_localnet here vmxnet3s0 23.07K 1.52M 23.10K 1.08G vmxnet3s0 24.06K 1.59M 24.04K 1.12G vmxnet3s0 23.89K 1.58M 23.90K 1.11G vmxnet3s0 23.68K 1.56M 23.68K 1.10G vmxnet3s0 23.44K 1.55M 23.44K 1.09G vmxnet3s0 24.33K 1.61M 24.36K 1.13G
Interestingly, our OS wasn't the first to introduce the concept of a maximum send burst. FreeBSD also had such a thing, and wen New Reno came around, someone did the following to it (from FreeBSD's tcp_output()):
#if 0 /* * This completely breaks TCP if newreno is turned on. What happens * is that if delayed-acks are turned on on the receiver, this code * on the transmitter effectively destroys the TCP window, forcing * it to four packets (1.5Kx4 = 6K window). */ if (sendalot && --maxburst) goto again; #endif
Their maxburst parameter was set to 4, and ours is set to 5. Their observation was that the TCP window was artificially limited to 6K, and we've just observed that ours is artificially limited to 7K (but only when the destination is off-link).
Updated by Electric Monk over 8 years ago
- Status changed from In Progress to Closed
- % Done changed from 80 to 100
git commit 633fc3a6eed35d918db16925b7048d7a2e28064a
commit 633fc3a6eed35d918db16925b7048d7a2e28064a Author: Sebastien Roy <sebastien.roy@delphix.com> Date: 2014-11-21T22:25:13.000Z 5295 remove maxburst logic from TCP's send algorithm Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Eric Diven <eric.diven@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Approved by: Gordon Ross <gordon.ross@nexenta.com>