Project

General

Profile

Bug #8923

tcpListenDrop counter continusly increases when we put our production webserver on OI

Added by anil choudhary almost 2 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Urgent
Assignee:
-
Category:
-
Start date:
2017-12-14
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

On running following command we see tcpListenDrop at rate of 1000/sec
netstat -sP tcp 1 |grep -i listendrop

tcpListenDrop       =  3321     tcpListenDropQ0     =     0
tcpListenDrop = 2494 tcpListenDropQ0 = 0
tcpListenDrop = 2473 tcpListenDropQ0 = 0
tcpListenDrop = 1321 tcpListenDropQ0 = 0
tcpListenDrop = 1518 tcpListenDropQ0 = 0

How to fix this issue


Related issues

Is duplicate of illumos gate - Bug #5123: IP DCE does not scale - part deuxClosed2014-08-26

Actions

History

#1

Updated by anil choudhary almost 2 years ago

initially tcpListenDrop is 0 for five to six hour.

tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
tcpListenDrop = 0 tcpListenDropQ0 = 0
If we reboot the webserver Then again it works for five to six hour.

We tired to tune following tcp parameter
ndd -set /dev/tcp tcp_conn_req_max_q 8192
ndd -set /dev/tcp tcp_conn_req_max_q0 8192
ndd -set /dev/tcp tcp_xmit_hiwat 8192
ndd -set /dev/tcp tcp_recv_hiwat 8192
ndd -set /dev/tcp tcp_max_buf 1048576
ndd -set /dev/tcp tcp_smallest_anon_port 2048
ndd -set /dev/tcp tcp_ecn_permitted 1
ndd -set /dev/tcp tcp_keepalive_interval 60000
ndd -set /dev/tcp tcp_ip_abort_interval 10000
ndd -set /dev/tcp tcp_fin_wait_2_flush_interval 30000

but it didn't help

It is appreciated if you look into
As we are running production webserver in it.

#2

Updated by anil choudhary almost 2 years ago

We are using latest OI release

pkg info kernel
Name: system/kernel
Summary: Core Solaris Kernel
Description: core kernel software for a specific instruction-set architecture
Category: System/Core
State: Installed
Publisher: openindiana.org
Version: 0.5.11
Branch: 2017.0.0.16782
Packaging Date: October 28, 2017 at 04:51:58 AM
Size: 42.37 MB
FMRI: pkg::20171028T045158Z

webserver is working fine initially but after 8-16 hour it start giving latency and what we observe during this time is that it hogs one cpu and tcpListendrop also increases.

and output of below dtrace command is following

dtrace -n 'profile-1002 /cpu == 18/ { @calls[curthread , stack(5)] = count(); }'

-1914828104640

ip`dce_lookup_and_add_v4+0x98 ip`

ip_set_destination_v4+0x2d2

ip`ip_attr_connect+0xd6

ip`conn_connect+0x11a

ip`tcp_set_destination+0x6e

6665

It is really appreciated if you help us to sort it out.

#3

Updated by anil choudhary almost 2 years ago

cpu 18 is used 100%

#4

Updated by anil choudhary almost 2 years ago

even after setting following
ndd -set /dev/ip ip_dce_reclaim_fraction 1default is 3ndd -get /dev/ip ip_dce_reclaim_fraction 3
kmem_cache!grep dce keep on increasing at rate of 200/sec

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15194244

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15198612

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15200510

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15202538

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15204540

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15206126

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15208232

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15210364

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15213588

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15220764

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15222870

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15224300

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15226900

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15229422

::kmem_cache!grep dce

fffffea39a483008 dce_cache 0000 000000 152 15231918

can you please suggest how to reduce it

#5

Updated by Robert Bailey almost 2 years ago

It would be good to know what hardware you are running this on. Also what web server software is installed. Maybe it is a problem with the NIC or the NIC driver?

Thank You

#6

Updated by anil choudhary almost 2 years ago

Hardware is HP blade server with bnxe driver

#7

Updated by Avnindra Singh almost 2 years ago

Robert Bailey wrote:

It would be good to know what hardware you are running this on. Also what web server software is installed. Maybe it is a problem with the NIC or the NIC driver?

Thank You

Webserver is a proprietary application.

#8

Updated by anil choudhary almost 2 years ago

even after increasing ip_dce_hash_size=2048, problem shifted for couple of days but not resolved.
and it is resurfacing after couple of days.

we have followed work around mentioned in https://www.illumos.org/issues/3925

Bug #3925: IP DCE does not scale - illumos gate - illumos.org
www.illumos.org
From the Joyent bug report and evaluation: The IP DCE is a cache of dce_t structures (see usr/src/uts/common/inet/ip/ip_dce.c, specifically the dce_lookup_and_add_v4 ...

we have configure cornjob for every 3 hours to put memory pressure using following script

#!/bin/bash
ndd -set /dev/ip ip_dce_reclaim_fraction 1 {
echo '%/l/lotsfree >oldlotsfree'
echo 'lotsfree/Z ((%/l/freemem)*4)'
sleep 2
echo 'lotsfree/Z <oldlotsfree'
} | exec mdb -kw

but reclaim stopped after some time in this condition we take savecore -L and using scat tool we got this stack

and reclaim worker thread hangs in cv_wait

following is the stack

thread name: [ fffffe4228d759b0 _resume_from_idle+0x112() ]
fffffe4228d759e0 swtch+0x141()
fffffe4228d75a20 cv_wait+0x70(fffffea3900acd6a, fffffea3900acd60)
fffffe4228d75a80 tcp_ixa_cleanup_getmblk+0x93(fffffea408a6e080)
fffffe4228d75ad0 conn_ixa_cleanup+0x8d(fffffea408a6e080, 0)
fffffe4228d75b40 ipcl_walk+0xc3(fffffffff7c03140, 0, fffffea35ec5b000)
fffffe4228d75b80 ip_dce_reclaim_stack+0x91(fffffea35ec5b000)
fffffe4228d75bc0 ip_dce_reclaim+0x5c()
fffffe4228d75c20 dce_reclaim_worker+0xf0(0)
fffffe4228d75c30 thread_start+8()

#9

Updated by anil choudhary over 1 year ago

This bug can be closed it is fixed by Bug #5123

#10

Updated by Marcel Telka over 1 year ago

  • Is duplicate of Bug #5123: IP DCE does not scale - part deux added
#11

Updated by Marcel Telka over 1 year ago

  • Status changed from New to Closed

Also available in: Atom PDF