Project

General

Profile

Bug #7209

tcp_time_wait_collector() collecting all the CPU time

Added by Robert Mustacchi about 4 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
High
Category:
networking
Start date:
2016-07-21
Due date:
% Done:

100%

Estimated time:
Difficulty:
Hard
Tags:
Gerrit CR:

Description

So, we have a system that was NMI'd. Let's take a look at some of the
basics of the system:

> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                   16325304             63770   16%
Boot pages                  68308               266    0%
ZFS File Data            71502056            279304   71%
Anon                      9460825             36956    9%
Exec and libs               85009               332    0%
Page cache                 149122               582    0%
Free (cachelist)           133458               521    0%
Free (freelist)           2895251             11309    3%

Total                   100619333            393044
Physical                100619332            393044

Wow, so that's a lot of memory owned by ZFS and while a bit in the free
list, not much. So, we have a bunch of node processes and others, let's
just take a look at what some of them are doing. So, first off, we have
a lot of the threads that are onproc and others are stuck in
mutex_vector_enter. Particularly from tcp_get_conn(). Here's a
reprsentative stack:

> ffffff06986a5460::findstack -v
stack pointer for thread ffffff06986a5460: fffffe6378f25a30
[ fffffe6378f25a30 _resume_from_idle+0x112() ]
  fffffe6378f25a60 swtch+0x141()
  fffffe6378f25b00 turnstile_block+0x21a(0, 0, fffffef4ace55000, fffffffffbc08aa0, 0, 0)
  fffffe6378f25b70 mutex_vector_enter+0x3a3(fffffef4ace55000)
  fffffe6378f25bb0 tcp_get_conn+0x35(fffffef4acdc2680, fffffef966df9000)
  fffffe6378f25c20 tcp_create_common+0x94(fffffef459cff8a0, 0, 1, fffffe6378f25cbc)
  fffffe6378f25c80 tcp_create+0x7a(2, 2, 0, ffffff634b037948, ffffff634b037804, fffffe6378f25cbc, ffffffff00000000, fffffef459cff8a0)
  fffffe6378f25d00 socket_init_common+0x1c0(ffffff634b0377d0, 0, 0, fffffef459cff8a0)
  fffffe6378f25d30 so_init+0x25(ffffff634b0377d0, 0, fffffef459cff8a0, 0)
  fffffe6378f25de0 socket_create+0xc2(2, 2, 0, 0, 0, 0, fffffe6300000001, fffffef459cff8a0, fffffe6378f25e38)
  fffffe6378f25e70 lx_socket+0x84(2, 1, 0)
  fffffe6378f25ef0 lx_syscall_enter+0x16f()
  fffffe6378f25f10 sys_syscall+0x16c()

If we look at who owns this:

> fffffef4ace55000::mutex
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
fffffef4ace55000 adapt fffffe6360e58c40      -      -     yes
> fffffe6360e58c40::findstack -v
stack pointer for thread fffffe6360e58c40: fffffe6360e584d0
  fffffe6360e58500 apix_setspl+0x17(fffffffff78f0b81)
  fffffe6360e58540 splr+0x6a(fffffe635fcd4c40)
  fffffe6360e58c40 0xcaa5af57300fa()

Now, we know that the main place that this is called is from
tcp_time_wait_collector(). So let's see who's in that and what they're
doing:

> ::stacks -c tcp_time_wait_collector
THREAD           STATE    SOBJ                COUNT
fffffe6360e52c40 ONPROC   <NONE>                  2
                 ixgbe_check_acc_handle+0x2e
                 ixgbe_tx_fill_ring+0x235
                 ixgbe_ring_tx+0x8c0
                 mac_hwring_tx+0x1d
                 mac_hwring_send_priv+0x90
                 aggr_ring_tx+0x24
                 mac_hwring_tx+0x1d
                 mac_tx_send+0x605
                 mac_tx_soft_ring_process+0x79
                 mac_tx_aggr_mode+0x7c
                 mac_tx+0x2a8
                 proto_unitdata_req+0x13c
                 dld_wput+0x163
                 putnext+0x2c9
                 apix_hilevel_intr_prolog+0x3e
                 0
                 tcp_time_wait_collector+0x29a
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe6361447c40 ONPROC   <NONE>                  1
                 0xf4
                 0
                 tcp_time_wait_collector+0x29a
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe6360e2ec40 ONPROC   <NONE>                  1
                 0xf4
                 0
                 tcp_time_wait_collector+0x36c
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe6360e40c40 ONPROC   <NONE>                  1
                 0xf4
                 apix_hilevel_intr_prolog+0x3e
                 tcp_time_wait_collector+0x29a
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe63614cbc40 ONPROC   <NONE>                  1
                 0x570
                 apix_hilevel_intr_prolog+0x3e
                 tcp_time_wait_collector+0x29a
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8       

fffffe636141dc40 ONPROC   <NONE>                  1
                 AcpiUtStatusExit+0x72
                 0
                 apix_do_interrupt+0x34a
                 0xf4
                 0
                 tcp_time_wait_collector+0x36c
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe6360ea6c40 ONPROC   <NONE>                  1
                 AcpiUtStatusExit+0x72
                 AcpiNsGetNode+0xb2
                 0x3c7
                 0xf4
                 tcp_time_wait_collector+0x335
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe6361423c40 ONPROC   <NONE>                  1
                 AcpiUtStatusExit+0x72
                 AcpiNsGetNode+0xb2
                 apix_setspl+0x17
                 apix_do_softint_prolog+0x53
                 apix_hilevel_intr_prolog+0x3e
                 0
                 tcp_time_wait_collector+0x2bb
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe636144dc40 ONPROC   <NONE>                  1
                 AcpiUtStatusExit+0x72
                 AcpiNsGetNode+0xb2
                 apix_setspl+0x17
                 apix_do_softint_prolog+0x53
                 apix_hilevel_intr_prolog+0x3e
                 0
                 tcp_time_wait_collector+0x36c
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe6360e64c40 ONPROC   <NONE>                  1
                 AcpiUtStatusExit+0x72
                 apix_hilevel_intr_prolog+0x3e
                 0
                 tcp_time_wait_collector+0x36c
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe6360e76c40 ONPROC   <NONE>                  1
                 apix_do_interrupt+0x34a
                 0xf4
                 0
                 tcp_time_wait_collector+0x29a
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe6360e6ac40 ONPROC   <NONE>                  1
                 apix_hilevel_intr_prolog+0x3e
                 6
                 tcp_time_wait_collector+0x29a
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe6360e28c40 ONPROC   <NONE>                  1
                 apix_setspl+0x17
                 apix_do_softint_prolog+0x53
                 apix_hilevel_intr_prolog+0x3e
                 0
                 tcp_time_wait_collector+0x2b2
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe6360e70c40 ONPROC   <NONE>                  1
                 apix_setspl+0x17
                 apix_do_softint_prolog+0x53
                 apix_hilevel_intr_prolog+0x3e
                 0
                 tcp_time_wait_collector+0x35c
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe636143bc40 ONPROC   <NONE>                  1
                 apix_setspl+0x17
                 apix_do_softint_prolog+0x53
                 apix_hilevel_intr_prolog+0x3e
                 6
                 tcp_time_wait_collector+0x35c
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

fffffe63614ddc40 ONPROC   <NONE>                  1
                 ixgbe_check_acc_handle+0x2e
                 ixgbe_tx_fill_ring+0x235
                 ixgbe_ring_tx+0x8c0
                 mac_hwring_tx+0x1d
                 mac_hwring_send_priv+0x90
                 aggr_ring_tx+0x24
                 mac_hwring_tx+0x1d
                 mac_tx_send+0x605
                 mac_tx_soft_ring_process+0x79
                 mac_tx_aggr_mode+0x7c
                 mac_tx+0x2a8
                 proto_unitdata_req+0x13c
                 dld_wput+0x163
                 putnext+0x2c9
                 apix_hilevel_intr_prolog+0x3e
                 0
                 tcp_time_wait_collector+0x173
                 callout_list_expire+0x98
                 callout_expire+0x3b
                 callout_execute+0x20
                 taskq_thread+0x2d0
                 thread_start+8

Of course, due to the NMI, it's not that surprising to see interrupts on
everything that's runanble. But here's a bit of the scary thing:

> ::cpuinfo
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc32fc0  1f    4    0  99   no    no t-29817866 fffffe6360edcc40 sched
  1 fffffef46c698040  1f    4    0  99   no    no t-29817866 fffffe6360e7cc40 sched
  2 fffffffffbc3d8c0  1b    7    0  99   no    no t-37056266 fffffe63614b9c40 sched
  3 fffffef46c67f500  1f    3    0  99   no    no t-37060714 fffffe6360e22c40 sched
  4 fffffef46c67e000  1f    7    0  99   no    no t-37057657 fffffe6360e76c40 sched
  5 fffffef46c67cac0  1f    6    0  99   no    no t-37060714 fffffe6360e40c40 sched
  6 fffffef46c8c9580  1f    7    0  99   no    no t-37053297 fffffe6360e46c40 sched
  7 fffffef46c8c8080  1f    6    0  99   no    no t-37051693 fffffe636141dc40 sched
  8 fffffef46c8c6a80  1f    3    0  99   no    no t-37051693 fffffe6360e52c40 sched
  9 fffffef46c942580  1f    6    0  99   no    no t-37051693 fffffe6360e2ec40 sched
 10 fffffef46c941080  1f    6    0  99   no    no t-37053297 fffffe63614ddc40 sched
 11 fffffef46c935a80  1f    7    0  99   no    no t-37055877 fffffe6360e70c40 sched
 12 fffffef46c92e540  1f    8    0  99   no    no t-37054736 fffffe6360ea6c40 sched
 13 fffffef46c92d040  1f    0    0  -1   no    no t-78   fffffe63608bcc40 (idle)
 14 fffffef46c92bb00  1f    8    0  99   no    no t-37046346 fffffe6361423c40 sched
 15 fffffef46c927a80  1f    1    0   1  yes    no t-1078 fffffefed41aa020 node
 16 fffffef46ca3bac0  1f    0    0   1  yes    no t-1799866 fffffef767673b40 node
 17 fffffef46ca25080  1f    0    0 168  yes    no t-6755866 fffffe6360f61c40 sched
 18 fffffef46ca21a80  1f   10    0  99   no    no t-37051693 fffffe63614b3c40 sched
 19 fffffef4acda3540  1f    0    0 165  yes    no t-1078 fffffe6361061c40 sched
 20 fffffef4acda2040  1f   11    6  99   no    no t-37053346 fffffe6360e6ac40 sched
 21 fffffef4acd96b00  1f    8    6  99   no    no t-37040656 fffffe636143bc40 sched
 22 fffffef4acd93500  1f    8    6  99   no    no t-37042036 fffffe6360e3ac40 sched
 23 fffffef4acd92000  1f    6    0  99   no    no t-37051693 fffffe6360e58c40 sched
 24 fffffef4acd8eac0  1f    6    0  99   no    no t-37043487 fffffe636144dc40 sched
 25 fffffef4acd8b580  1f    2    0  99   no    no t-37051693 fffffe63614adc40 sched
 26 fffffef4acd8a080  1f    3    0  99   no    no t-37042526 fffffe6360e5ec40 sched
 27 fffffef4acd86a80  1f    4    0  99   no    no t-37036627 fffffe636145fc40 sched
 28 fffffef4ace60540  1f    6    0  99   no    no t-37061306 fffffe636149bc40 sched
 29 fffffef4ace5e580  1f    7    0  99   no    no t-37051693 fffffe6360e28c40 sched
 30 fffffef4ace54500  1f    6    0  99   no    no t-37043316 fffffe6361447c40 sched
 31 fffffef4ace53000  1f    0    0   1   no    no t-78   ffffff8c96dfa820 java
 32 fffffef4ace51ac0  1f    0    0   1  yes    no t-2397866 fffffef75fdc0020 node
 33 fffffef4ace4e580  1f    8    0  99   no    no t-37051693 fffffe63614cbc40 sched
 34 fffffef4ace4d080  1f    0    0  60   no    no t-34   fffffe6361531c40 fsflush
 35 fffffef4ace49a80  1f    1    0   1  yes    no t-1078 fffffef7618860c0 node
 36 fffffef4ace46540  1f   10    0  99   no    no t-37019627 fffffe6360e34c40 sched
 37 fffffef4ccf4e580  1f    9    0  99   no    no t-37044376 fffffe6360e64c40 sched
 38 fffffef4ccf4c500  1f   11    0  99   no    no t-37058176 fffffe6361465c40 sched
 39 fffffef4ccf4b000  1f    0    6  99   no    no t-0    fffffe635fcfec40 sched

Those are some masive switch times for sched, who of course is our
process. So, let's get a sense of how much we think is in these tcp time
wait backlogs:

> ::walk squeue_cache | ::print squeue_t sq_private[0] | ::grep '. != 0'
> | ::print tcp_squeue_priv_t tcp_time_wait_cnt ! sort | uniq -c
     10 tcp_time_wait_cnt = 0
      2 tcp_time_wait_cnt = 0x1
      3 tcp_time_wait_cnt = 0x10
      2 tcp_time_wait_cnt = 0x11
      4 tcp_time_wait_cnt = 0x12
      1 tcp_time_wait_cnt = 0x13
      3 tcp_time_wait_cnt = 0x14
      3 tcp_time_wait_cnt = 0x15
      1 tcp_time_wait_cnt = 0x16
      4 tcp_time_wait_cnt = 0x17
      1 tcp_time_wait_cnt = 0x18
      1 tcp_time_wait_cnt = 0x19
      1 tcp_time_wait_cnt = 0x1a
      1 tcp_time_wait_cnt = 0x1b
      2 tcp_time_wait_cnt = 0x1c
      1 tcp_time_wait_cnt = 0x1d
      1 tcp_time_wait_cnt = 0x1e
      1 tcp_time_wait_cnt = 0x9
      3 tcp_time_wait_cnt = 0xc
      1 tcp_time_wait_cnt = 0xe
      3 tcp_time_wait_cnt = 0xf

Okay, actually not that much of a backlog; however, we still have those
ridiculously high times since switch. So based on this, I think what's
happened is that some number of the tcp time wait collectors have
basically been greedy on CPU and just have enough going on to always do
work and never give up. If other things could run, I think some of the
necessary back pressure (e.g. user failures) if we can't recycle fast
enough would come up; however, if they're almost always on CPU (the high
switch times), then we're likely in a world of hurt.


The tcp_time_wait_collector thread is charged with expiring TIME_WAIT connections which are stored in buckets of a timing wheel in ascending order for reaping. When the collector thread is woken and processes all connections in a bucket, it looks for the next bucket containing items to be expired, calculating the appropriate time when it should fire next. This calculated time may be in the past if the just-processed connections took an overly long time to process.

The tcp_time_wait_collector function is supposed to handle this case by immediately processing this next bucket, since scheduling a wake-up in the past would do not good:

        /*
         * It is possible that the system is under such dire load that between
         * the timer scheduling and TIME_WAIT processing delay, execution
         * overran the interval allocated to this bucket.
         */
        now = ddi_get_lbolt64() - tsp->tcp_time_wait_offset;
        if (new_schedule <= now) {
                /*
                 * Attempt to right the situation by immediately performing a
                 * purge on the next bucket.  This loop will continue as needed
                 * until the schedule can be pushed out ahead of the clock.
                 */
                idx = TW_BUCKET(new_schedule - 1);
                goto retry;
        }

We can see that it selects the next bucket, chosen by the schedule, and jumps back to the logic for expiring connections from the bucket. This is working as desired, but there is a flaw in how the next scheduled time is calculated.

After looping to process the next bucket due to the time overrun, tcp_time_collector goes to select the next bucket and a new schedule time:

                /*
                 * Locate the next bucket containing entries.
                 */
                new_schedule = active_schedule
                    + MSEC_TO_TICK(TCP_TIME_WAIT_DELAY);
                nidx = TW_BUCKET_NEXT(idx);

The issue is that active_schedule refers to the time initially used upon entry to tcp_wait_time_collector. This means that new_schedule will now always be in the past, triggering an infinite loop.

(I believe there is a small edge case where it can escape this loop, but it relies on a near absence of TIME_WAIT connections after the first loop.)

History

#1

Updated by Electric Monk about 4 years ago

  • Status changed from New to Closed

git commit c79a72d7e59c9ff493d8dde8b462cca0445f51c8

commit  c79a72d7e59c9ff493d8dde8b462cca0445f51c8
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2016-08-02T14:47:22.000Z

    7209 tcp_time_wait_collector() collecting all the CPU time
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF