Project

General

Profile

Feature #7062

Connections remain in TIME_WAIT too long

Added by Robert Mustacchi over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Category:
networking
Start date:
2016-06-06
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

We had a customer who tuned their TCP time wait interval by running:

 ndd -set /dev/tcp tcp_time_wait_interval 20000

However, when they run the following command:

netstat -naf inet -P tcp

They see the connections sitting in TIME_WAIT longer than the 20 second interval they've set. It appears that it is possible to set tcp_time_wait_interval within the local zone, but changing this setting has not reduced the amount of time the TCP connections are sitting in TIME_WAIT.


Looking at the mentioned zone I was able to capture the described behavior.

I first confirmed via mdb that the tcp_time_wait_interval was set appropriately in their netstack.
Using this dtrace script, I gathered data regarding the actual lifetimes of connections in the TIME_WAIT state:

tcp_time_wait_append:entry
/args[0]->tcp_tcps->tcps_netstack->netstack_stackid == 598/
{
        tw[arg0] = timestamp;
}
tcp_time_wait_remove:entry
/tw[arg0]/
{
        this->time = (timestamp - tw[arg0]) / 1000000;
        tw[arg0] = 0;
        @nsq = lquantize(this->time, 5000, 60000, 5000);
}

Despite the proper setting, it does appear that a number of connections are lingering longer than the 10-15 seconds which would be expected:

           value  ------------- Distribution ------------- count
          < 5000 |                                         7
            5000 |                                         1
           10000 |@                                        226
           15000 |                                         82
           20000 |@@                                       278
           25000 |@@@@@@@@@@@@                             1985
           30000 |@@@@@@@@@@@@@@@@@@@                      3193
           35000 |@@                                       371
           40000 |@@                                       352
           45000 |@                                        248
           50000 |@                                        110
           55000 |                                         0

Assuming that the timer initiating calls to tcp_time_wait_collector is firing at the appropriate 5-second intervals, I went looking for another explanation. I believe it has to do with how connections are appended to the TIME_WAIT queue for later clean-up. This is done on a per-squeue basis, meaning that the contained connections can have differing time_wait_interval values. A connection with a short time_wait_interval will have to wait behind one with a long interval given the current logic used to walk the list.

In order to prevent tenants with longer tcp_time_wait_interval values from delaying connection cleanup for those with a shorter value, the connection traversal logic requires restructuring.

#1

Updated by Electric Monk over 4 years ago

  • Status changed from New to Closed

git commit 2404c9e6b54f427b32dd0a2d46940d6a4c5299bc

commit  2404c9e6b54f427b32dd0a2d46940d6a4c5299bc
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2016-06-09T20:31:42.000Z

    7062 Connections remain in TIME_WAIT too long
    7061 local TCP connections should be expediently purged from TIME_WAIT
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Approved by: Dan McDonald <danmcd@omniti.com>

Also available in: Atom PDF