Bug #7209
tcp_time_wait_collector() collecting all the CPU time
100%
Description
So, we have a system that was NMI'd. Let's take a look at some of the
basics of the system:
> ::memstat Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 16325304 63770 16% Boot pages 68308 266 0% ZFS File Data 71502056 279304 71% Anon 9460825 36956 9% Exec and libs 85009 332 0% Page cache 149122 582 0% Free (cachelist) 133458 521 0% Free (freelist) 2895251 11309 3% Total 100619333 393044 Physical 100619332 393044
Wow, so that's a lot of memory owned by ZFS and while a bit in the free
list, not much. So, we have a bunch of node processes and others, let's
just take a look at what some of them are doing. So, first off, we have
a lot of the threads that are onproc and others are stuck in
mutex_vector_enter. Particularly from tcp_get_conn(). Here's a
reprsentative stack:
> ffffff06986a5460::findstack -v stack pointer for thread ffffff06986a5460: fffffe6378f25a30 [ fffffe6378f25a30 _resume_from_idle+0x112() ] fffffe6378f25a60 swtch+0x141() fffffe6378f25b00 turnstile_block+0x21a(0, 0, fffffef4ace55000, fffffffffbc08aa0, 0, 0) fffffe6378f25b70 mutex_vector_enter+0x3a3(fffffef4ace55000) fffffe6378f25bb0 tcp_get_conn+0x35(fffffef4acdc2680, fffffef966df9000) fffffe6378f25c20 tcp_create_common+0x94(fffffef459cff8a0, 0, 1, fffffe6378f25cbc) fffffe6378f25c80 tcp_create+0x7a(2, 2, 0, ffffff634b037948, ffffff634b037804, fffffe6378f25cbc, ffffffff00000000, fffffef459cff8a0) fffffe6378f25d00 socket_init_common+0x1c0(ffffff634b0377d0, 0, 0, fffffef459cff8a0) fffffe6378f25d30 so_init+0x25(ffffff634b0377d0, 0, fffffef459cff8a0, 0) fffffe6378f25de0 socket_create+0xc2(2, 2, 0, 0, 0, 0, fffffe6300000001, fffffef459cff8a0, fffffe6378f25e38) fffffe6378f25e70 lx_socket+0x84(2, 1, 0) fffffe6378f25ef0 lx_syscall_enter+0x16f() fffffe6378f25f10 sys_syscall+0x16c()
If we look at who owns this:
> fffffef4ace55000::mutex ADDR TYPE HELD MINSPL OLDSPL WAITERS fffffef4ace55000 adapt fffffe6360e58c40 - - yes > fffffe6360e58c40::findstack -v stack pointer for thread fffffe6360e58c40: fffffe6360e584d0 fffffe6360e58500 apix_setspl+0x17(fffffffff78f0b81) fffffe6360e58540 splr+0x6a(fffffe635fcd4c40) fffffe6360e58c40 0xcaa5af57300fa()
Now, we know that the main place that this is called is from
tcp_time_wait_collector(). So let's see who's in that and what they're
doing:
> ::stacks -c tcp_time_wait_collector THREAD STATE SOBJ COUNT fffffe6360e52c40 ONPROC <NONE> 2 ixgbe_check_acc_handle+0x2e ixgbe_tx_fill_ring+0x235 ixgbe_ring_tx+0x8c0 mac_hwring_tx+0x1d mac_hwring_send_priv+0x90 aggr_ring_tx+0x24 mac_hwring_tx+0x1d mac_tx_send+0x605 mac_tx_soft_ring_process+0x79 mac_tx_aggr_mode+0x7c mac_tx+0x2a8 proto_unitdata_req+0x13c dld_wput+0x163 putnext+0x2c9 apix_hilevel_intr_prolog+0x3e 0 tcp_time_wait_collector+0x29a callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe6361447c40 ONPROC <NONE> 1 0xf4 0 tcp_time_wait_collector+0x29a callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe6360e2ec40 ONPROC <NONE> 1 0xf4 0 tcp_time_wait_collector+0x36c callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe6360e40c40 ONPROC <NONE> 1 0xf4 apix_hilevel_intr_prolog+0x3e tcp_time_wait_collector+0x29a callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe63614cbc40 ONPROC <NONE> 1 0x570 apix_hilevel_intr_prolog+0x3e tcp_time_wait_collector+0x29a callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe636141dc40 ONPROC <NONE> 1 AcpiUtStatusExit+0x72 0 apix_do_interrupt+0x34a 0xf4 0 tcp_time_wait_collector+0x36c callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe6360ea6c40 ONPROC <NONE> 1 AcpiUtStatusExit+0x72 AcpiNsGetNode+0xb2 0x3c7 0xf4 tcp_time_wait_collector+0x335 callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe6361423c40 ONPROC <NONE> 1 AcpiUtStatusExit+0x72 AcpiNsGetNode+0xb2 apix_setspl+0x17 apix_do_softint_prolog+0x53 apix_hilevel_intr_prolog+0x3e 0 tcp_time_wait_collector+0x2bb callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe636144dc40 ONPROC <NONE> 1 AcpiUtStatusExit+0x72 AcpiNsGetNode+0xb2 apix_setspl+0x17 apix_do_softint_prolog+0x53 apix_hilevel_intr_prolog+0x3e 0 tcp_time_wait_collector+0x36c callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe6360e64c40 ONPROC <NONE> 1 AcpiUtStatusExit+0x72 apix_hilevel_intr_prolog+0x3e 0 tcp_time_wait_collector+0x36c callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe6360e76c40 ONPROC <NONE> 1 apix_do_interrupt+0x34a 0xf4 0 tcp_time_wait_collector+0x29a callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe6360e6ac40 ONPROC <NONE> 1 apix_hilevel_intr_prolog+0x3e 6 tcp_time_wait_collector+0x29a callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe6360e28c40 ONPROC <NONE> 1 apix_setspl+0x17 apix_do_softint_prolog+0x53 apix_hilevel_intr_prolog+0x3e 0 tcp_time_wait_collector+0x2b2 callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe6360e70c40 ONPROC <NONE> 1 apix_setspl+0x17 apix_do_softint_prolog+0x53 apix_hilevel_intr_prolog+0x3e 0 tcp_time_wait_collector+0x35c callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe636143bc40 ONPROC <NONE> 1 apix_setspl+0x17 apix_do_softint_prolog+0x53 apix_hilevel_intr_prolog+0x3e 6 tcp_time_wait_collector+0x35c callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8 fffffe63614ddc40 ONPROC <NONE> 1 ixgbe_check_acc_handle+0x2e ixgbe_tx_fill_ring+0x235 ixgbe_ring_tx+0x8c0 mac_hwring_tx+0x1d mac_hwring_send_priv+0x90 aggr_ring_tx+0x24 mac_hwring_tx+0x1d mac_tx_send+0x605 mac_tx_soft_ring_process+0x79 mac_tx_aggr_mode+0x7c mac_tx+0x2a8 proto_unitdata_req+0x13c dld_wput+0x163 putnext+0x2c9 apix_hilevel_intr_prolog+0x3e 0 tcp_time_wait_collector+0x173 callout_list_expire+0x98 callout_expire+0x3b callout_execute+0x20 taskq_thread+0x2d0 thread_start+8
Of course, due to the NMI, it's not that surprising to see interrupts on
everything that's runanble. But here's a bit of the scary thing:
> ::cpuinfo ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc32fc0 1f 4 0 99 no no t-29817866 fffffe6360edcc40 sched 1 fffffef46c698040 1f 4 0 99 no no t-29817866 fffffe6360e7cc40 sched 2 fffffffffbc3d8c0 1b 7 0 99 no no t-37056266 fffffe63614b9c40 sched 3 fffffef46c67f500 1f 3 0 99 no no t-37060714 fffffe6360e22c40 sched 4 fffffef46c67e000 1f 7 0 99 no no t-37057657 fffffe6360e76c40 sched 5 fffffef46c67cac0 1f 6 0 99 no no t-37060714 fffffe6360e40c40 sched 6 fffffef46c8c9580 1f 7 0 99 no no t-37053297 fffffe6360e46c40 sched 7 fffffef46c8c8080 1f 6 0 99 no no t-37051693 fffffe636141dc40 sched 8 fffffef46c8c6a80 1f 3 0 99 no no t-37051693 fffffe6360e52c40 sched 9 fffffef46c942580 1f 6 0 99 no no t-37051693 fffffe6360e2ec40 sched 10 fffffef46c941080 1f 6 0 99 no no t-37053297 fffffe63614ddc40 sched 11 fffffef46c935a80 1f 7 0 99 no no t-37055877 fffffe6360e70c40 sched 12 fffffef46c92e540 1f 8 0 99 no no t-37054736 fffffe6360ea6c40 sched 13 fffffef46c92d040 1f 0 0 -1 no no t-78 fffffe63608bcc40 (idle) 14 fffffef46c92bb00 1f 8 0 99 no no t-37046346 fffffe6361423c40 sched 15 fffffef46c927a80 1f 1 0 1 yes no t-1078 fffffefed41aa020 node 16 fffffef46ca3bac0 1f 0 0 1 yes no t-1799866 fffffef767673b40 node 17 fffffef46ca25080 1f 0 0 168 yes no t-6755866 fffffe6360f61c40 sched 18 fffffef46ca21a80 1f 10 0 99 no no t-37051693 fffffe63614b3c40 sched 19 fffffef4acda3540 1f 0 0 165 yes no t-1078 fffffe6361061c40 sched 20 fffffef4acda2040 1f 11 6 99 no no t-37053346 fffffe6360e6ac40 sched 21 fffffef4acd96b00 1f 8 6 99 no no t-37040656 fffffe636143bc40 sched 22 fffffef4acd93500 1f 8 6 99 no no t-37042036 fffffe6360e3ac40 sched 23 fffffef4acd92000 1f 6 0 99 no no t-37051693 fffffe6360e58c40 sched 24 fffffef4acd8eac0 1f 6 0 99 no no t-37043487 fffffe636144dc40 sched 25 fffffef4acd8b580 1f 2 0 99 no no t-37051693 fffffe63614adc40 sched 26 fffffef4acd8a080 1f 3 0 99 no no t-37042526 fffffe6360e5ec40 sched 27 fffffef4acd86a80 1f 4 0 99 no no t-37036627 fffffe636145fc40 sched 28 fffffef4ace60540 1f 6 0 99 no no t-37061306 fffffe636149bc40 sched 29 fffffef4ace5e580 1f 7 0 99 no no t-37051693 fffffe6360e28c40 sched 30 fffffef4ace54500 1f 6 0 99 no no t-37043316 fffffe6361447c40 sched 31 fffffef4ace53000 1f 0 0 1 no no t-78 ffffff8c96dfa820 java 32 fffffef4ace51ac0 1f 0 0 1 yes no t-2397866 fffffef75fdc0020 node 33 fffffef4ace4e580 1f 8 0 99 no no t-37051693 fffffe63614cbc40 sched 34 fffffef4ace4d080 1f 0 0 60 no no t-34 fffffe6361531c40 fsflush 35 fffffef4ace49a80 1f 1 0 1 yes no t-1078 fffffef7618860c0 node 36 fffffef4ace46540 1f 10 0 99 no no t-37019627 fffffe6360e34c40 sched 37 fffffef4ccf4e580 1f 9 0 99 no no t-37044376 fffffe6360e64c40 sched 38 fffffef4ccf4c500 1f 11 0 99 no no t-37058176 fffffe6361465c40 sched 39 fffffef4ccf4b000 1f 0 6 99 no no t-0 fffffe635fcfec40 sched
Those are some masive switch times for sched, who of course is our
process. So, let's get a sense of how much we think is in these tcp time
wait backlogs:
> ::walk squeue_cache | ::print squeue_t sq_private[0] | ::grep '. != 0' > | ::print tcp_squeue_priv_t tcp_time_wait_cnt ! sort | uniq -c 10 tcp_time_wait_cnt = 0 2 tcp_time_wait_cnt = 0x1 3 tcp_time_wait_cnt = 0x10 2 tcp_time_wait_cnt = 0x11 4 tcp_time_wait_cnt = 0x12 1 tcp_time_wait_cnt = 0x13 3 tcp_time_wait_cnt = 0x14 3 tcp_time_wait_cnt = 0x15 1 tcp_time_wait_cnt = 0x16 4 tcp_time_wait_cnt = 0x17 1 tcp_time_wait_cnt = 0x18 1 tcp_time_wait_cnt = 0x19 1 tcp_time_wait_cnt = 0x1a 1 tcp_time_wait_cnt = 0x1b 2 tcp_time_wait_cnt = 0x1c 1 tcp_time_wait_cnt = 0x1d 1 tcp_time_wait_cnt = 0x1e 1 tcp_time_wait_cnt = 0x9 3 tcp_time_wait_cnt = 0xc 1 tcp_time_wait_cnt = 0xe 3 tcp_time_wait_cnt = 0xf
Okay, actually not that much of a backlog; however, we still have those
ridiculously high times since switch. So based on this, I think what's
happened is that some number of the tcp time wait collectors have
basically been greedy on CPU and just have enough going on to always do
work and never give up. If other things could run, I think some of the
necessary back pressure (e.g. user failures) if we can't recycle fast
enough would come up; however, if they're almost always on CPU (the high
switch times), then we're likely in a world of hurt.
The tcp_time_wait_collector thread is charged with expiring TIME_WAIT connections which are stored in buckets of a timing wheel in ascending order for reaping. When the collector thread is woken and processes all connections in a bucket, it looks for the next bucket containing items to be expired, calculating the appropriate time when it should fire next. This calculated time may be in the past if the just-processed connections took an overly long time to process.
The tcp_time_wait_collector function is supposed to handle this case by immediately processing this next bucket, since scheduling a wake-up in the past would do not good:
/* * It is possible that the system is under such dire load that between * the timer scheduling and TIME_WAIT processing delay, execution * overran the interval allocated to this bucket. */ now = ddi_get_lbolt64() - tsp->tcp_time_wait_offset; if (new_schedule <= now) { /* * Attempt to right the situation by immediately performing a * purge on the next bucket. This loop will continue as needed * until the schedule can be pushed out ahead of the clock. */ idx = TW_BUCKET(new_schedule - 1); goto retry; }
We can see that it selects the next bucket, chosen by the schedule, and jumps back to the logic for expiring connections from the bucket. This is working as desired, but there is a flaw in how the next scheduled time is calculated.
After looping to process the next bucket due to the time overrun, tcp_time_collector goes to select the next bucket and a new schedule time:
/* * Locate the next bucket containing entries. */ new_schedule = active_schedule + MSEC_TO_TICK(TCP_TIME_WAIT_DELAY); nidx = TW_BUCKET_NEXT(idx);
The issue is that active_schedule refers to the time initially used upon entry to tcp_wait_time_collector. This means that new_schedule will now always be in the past, triggering an infinite loop.
(I believe there is a small edge case where it can escape this loop, but it relies on a near absence of TIME_WAIT connections after the first loop.)
Updated by Electric Monk over 4 years ago
- Status changed from New to Closed
git commit c79a72d7e59c9ff493d8dde8b462cca0445f51c8
commit c79a72d7e59c9ff493d8dde8b462cca0445f51c8 Author: Patrick Mooney <pmooney@pfmooney.com> Date: 2016-08-02T14:47:22.000Z 7209 tcp_time_wait_collector() collecting all the CPU time Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net>