Bug #3925

IP DCE does not scale

Added by Robert Mustacchi over 4 years ago. Updated over 3 years ago.

Status:ClosedStart date:2013-07-28
Priority:HighDue date:
Assignee:Robert Mustacchi% Done:

100%

Category:networking
Target version:-
Difficulty:Hard Tags:

Description

From the Joyent bug report and evaluation:

The IP DCE is a cache of dce_t structures (see usr/src/uts/common/inet/ip/ip_dce.c, specifically
the dce_lookup_and_add_v4() function which searches the cache. The cache consists of a hard coded 256
entry hash table. New entries are added to the end of a linked list headed by a hash bucket
(ips_dce_hash_v4 in the ip_stack_t for a given IP stack). On a machine with a large exposure to
the internet, this cache can become quite large (easily millions of entries).
There are 3 problems with this scheme that are scaling related:

1. With only 256 hash buckets, assuming the hash algorithm is good, the number of elements on
a given hash chain can be quite large (thousands of entries).
2. Examining the results of the hashing algorithm on a given machine, every third bucket is being
used, effectively increasing the size of the hash chains that are in use by a factor of 3, i.e.,
the hashing algorithm needs to be reworked.
3. Entries are removed from the cache only when the system is under memory pressure. This effectively
means that entries may never be removed. On one of the systems that I have been examining,
the total number of DCE cache entries is over 40,000,000.

The above factors result in the following symptoms:

1. Over time, the load average for a given system that is getting lots of requests from the network
from a large number of remote systems (IPs), steadily increases.
2. Response time to network events also steadily increases. Zeus, for instance, decides that the machine
has failed and initiates fail over. It takes between 2-3 weeks of uptime before the load
becomes an issue. Any system with a large number of different IP addresses accessing may run into the
problem.
3. Overall system performance is effected. Aside from rebooting the SM that is effected, or (artificially?)
inducing memory stress, there is no way to clear the cache.
4. The most obvious symptom shows up in mpstat output. One or more CPUs that are handling large numbers
of interrupts will show very low idle times, but very high system times (20%-80% respectively, or worse).
(The nic interrupt handler indirectly calls dce_lookup_and_add_v4(), and the routine instead of taking 20-30 microseconds
takes 20-30 milliseconds).

Workaround

One day before discovering the problem at Libero, someone else discovered the same problem existing in Solaris 11.
A workaround is shown at http://comments.gmane.org/gmane.os.solaris.opensolaris.networking/14278.
This workaround was tried at Libero, and reduced the load average from 1.2 to .7, with no ill effects.
It is included here for completeness:

#!/bin/bash
ndd -set /dev/ip ip_dce_reclaim_fraction 1
{
  echo '%/l/lotsfree >oldlotsfree'
  echo 'lotsfree/Z ((%/l/freemem)*4)'
  sleep 2
  echo 'lotsfree/Z <oldlotsfree'
} | exec mdb -kw

Suggested Fixes

Implement LRU to free up entries that have not been accessed in a long time.
Increase the number of hash buckets.
Modify the hashing algorithm to get more even distribution across hash buckets.
Change from using a linked list to some other data structure, for instance, an
AVL tree. This, of course, would require more work and may not be the best fit
(how often are entries added/deleted?).

And the resulting analysis from Jerry:

Given Max's investigation, the fix is pretty straightforward. Currently the cache grows indefinitely until there is memory pressure on the system as a whole, but the hash table will typically be unusable long before then. We need to prune the the hash table when it hits a "fullness" threshold. I have added a ndd tunable "ip_dce_reclaim_threshold" with a default value of 1000. When adding an entry to the cache, if the bucket is over this threshold, we will prune the cache using the pre-existing "ip_dce_reclaim_fraction" ndd tunable value which reduces the cache by that percent (i.e. the default is 3 so we prune 33% of the cache). I also added a global "ip_dce_hash_size" which can be tuned in /etc/system to increase the per-stack hashtable size. I didn't want to change this from the existing default of 256 buckets since each zone will have two of these hash tables. On a system which seeing a lot of IP addrs, we could bump this up to 512, 1024, or more, but all zones on that system will then be using two hash tables of that size. Between these two tunables and the existing tunable, we now have good control over the behavior of the dce cache.

Note that this work covers several different commits from illumos-joyent.

History

#1 Updated by Robert Mustacchi over 4 years ago

  • % Done changed from 90 to 100
  • Status changed from New to Pending RTI

The following are follow up notes related to issues from this bug that have been merged into this illumos bug.

This bug related to issues in the squeue subsystem, used as a form of task queue by the IP classifier system. Entries are placed on the queue in the form of mblks and processed by worker threads. In many cases, there are fast paths that allow processing inline by the thread enqueuing the request. While there are several different types of tasks that can be run out of squeues, the task of primary concern here is DCE reclaim, performed by conn_ixa_cleanup() in ip_attr.c. See the initial report on this function's purpose. In short, it walks a list of connections and performs a cleanup operation on each one to free up data structures that have been marked for destruction elsewhere. See also the block comment at the top of squeue.c.

There are three separate problems here, of which two are closely related. The first and most obvious problem is that, as Dave noted in comment 29565, there is a path into squeue processing that does not set sq_run. Specifically, the slow path through squeue_enter() will end up invoking squeue_drain() near squeue.c:558 without setting this field. All other paths that can result in the invocation of the function hanging off b_queue set this field before starting and clear it after finishing. The result of this is a 1-way deadlock of the form (key paths highlighted):

swtch+0x8e
                 cv_wait+0x61
                 tcp_ixa_cleanup_wait_and_finish+0x56
                 conn_ixa_cleanup+0x1be
                 ipcl_walk+0xba
                 ip_dce_reclaim_stack+0x86                 <<<
                 dce_lookup_and_add_v4+0x6c                
                 ip_set_destination_v4+0x1cf
                 ip_attr_connect+0xc6
                 conn_connect+0xeb
                 tcp_set_destination+0x66
                 tcp_input_listener+0x7ca
                 squeue_drain+0x228                        <<<
                 squeue_enter+0x4fe                        <<<
                 ip_fanout_v4+0x48d
                 ire_recv_local_v4+0x366
                 ill_input_short_v4+0x6ce
                 ip_input+0x23b
                 mac_rx_soft_ring_process+0x17a
                 mac_rx_srs_proto_fanout+0x46f
                 mac_rx_srs_drain+0x26e
                 mac_rx_srs_process+0x1a9
                 mac_rx_classify+0x159
                 mac_rx_flow+0x54
                 mac_rx_common+0x1f6
                 mac_rx+0xac
                 mac_rx_ring+0x4c
                 igb_intr_rx+0x5e
                 av_dispatch_autovect+0x7c
                 dispatch_hardint+0x33

The problem here is that conn_ixa_cleanup() is relying on connp->conn_sqp->sq_run to detect that we are already executing in the context of squeue processing and must complete our work synchronously as part of some larger (and invisible to us) task. If we instead place our task at the end of the connection's squeue and wait for it, we'll never finish because we're already processing something at the head of that same squeue. And, in fact, this is exactly what happens:

A thread calls into squeue_enter() and takes the path through squeue.c:558
It next invokes squeue_drain() to process the contents of the queue (sq_run == NULL)
A task on the squeue, invoked from squeue_drain(), ends up in ip_dce_reclaim_stack() because there are too many DCE entries lying around
The thread then calls into conn_ixa_cleanup() as part of that reclaim, such that the connection being cleaned up has been assigned the same squeue that is being drained – this last condition is REQUIRED to trigger this case
conn_ixa_cleanup() gets the per-stack shared mblk, and then, seeing that connp->conn_sqp->sq_run != curthread, enqueues the mblk on the same squeue via SQUEUE_ENTER_ONE() and waits for its processing to complete
Processing of this mblk is blocked because the squeue is non-empty; i.e., we cannot enter the fast path in squeue_enter(). Since squeue_enter() will not process the mblk inline and it is already being drained, further process by this thread is impossible.
At this point, anyone relying on being able to call squeue_synch_enter() on this squeue or otherwise obtaining the result of anything being processed in this squeue will also block forever.

Note that comment 29565 incorrectly asserts that it is illegal to call conn_ixa_cleanup() with sq_run Unable to render embedded object: File (= curthread. That is in fact legal, provided that the thread is not currently processing squeue entries) not found. If this code is reached from a syscall path, for instance, it is perfectly acceptable for sq_run to be NULL; it will be set in squeue_enter() prior to invoking tcp_ixa_cleanup().

This simple case requires only a single actor and is addressed by setting sq_run prior to making the call into squeue_drain() near squeue.c:558. It can be readily identified by a single thread that has called conn_ixa_cleanup() from within squeue_drain() and is making no progress. An unlimited number of additional threads may be further blocked, most likely in squeue_synch_enter(), but these are merely collateral damage. The amount of load, in the form of traffic or connection rate, is only weakly correlated with the probability that any given TCP processing task will trigger the deadlock. Fewer connections make inline processing in squeue_enter() more likely, which cannot trigger the deadlock. And more connections mean that DCE reclaim will be performed more frequently, thereby increasing the probability that a particular reclaim operation will occur on the same squeue on which processing is already occurring. For this reason, decreasing the DCE reclaim thresholds will also increase the risk. But this problem could occur even at low loads.

The second and third cases are very closely related, and much more subtle. These are the two missed wakeup races in conn_ixa_cleanup() itself. As noted in comment 29578, this function synchronizes itself with tcp_ixa_cleanup() using a single per-stack CV and the presence or absence of the per-stack shared mblk as a work token. When tcp_ixa_cleanup() completes, it takes the per-stack lock, returns the mblk to the stack, signals the next waiter, and drops the lock. This is broken because two different threads may be waiting on this CV, only one of which will do useful work when signaled.

In addition, because conn_ixa_cleanup() drops the per-stack lock before it calls SQUEUE_ENTER_ONE() to enqueue the task, then reacquires it later to check for the presence of the shared work token as a sign of completion, it is possible for that completion to occur either inline or in another thread, only for the work token to be taken by another thread prior to the completion check. This adds to the number of cases in which the missed wakeup can occur, leaving threads stranded at either ip_attr.c:1226 or ip_attr.c:1244 (or, ultimately, both). These sequences are highlighted well by comment 29581.

Because this relies on having multiple threads simultaneously trying to clean up connections in the same TCP stack, its likelihood increases dramatically as the number of connections increases and/or the thresholds for DCE reclaim decrease. It is highly unlikely that this would ever be seen when the rate of connection formation is moderate relative to the DCE reclaim threshold and the rate at which DCE reclaim can be completed. It is more difficult to identify than the simpler case, but a reliable guide is the presence of 2 or more threads stuck in conn_ixa_cleanup() and none in squeue_drain(), as in the dump referenced in comment 29585 from FC0SGS1:

ffffff19b147f120 SLEEP    CV                      1
                 swtch+0x145
                 cv_wait+0x61
                 conn_ixa_cleanup+0x169
                 ipcl_walk+0xba
                 ip_dce_reclaim_stack+0x86
                 dce_lookup_and_add_v4+0x6c
                 ip_set_destination_v4+0x1cf
                 ip_attr_connect+0xc6
                 conn_connect+0xeb
                 tcp_set_destination+0x66
                 tcp_connect_ipv4+0x135
                 tcp_do_connect+0x20f
                 tcp_connect+0x9f
                 so_connect+0xf8
                 socket_connect+0x47
                 connect+0xec
                 _sys_sysenter_post_swapgs+0x149

fffffffd920698a0 SLEEP    CV                      1
                 swtch+0x145
                 cv_wait+0x61
                 conn_ixa_cleanup+0x214
                 ipcl_walk+0xba
                 ip_dce_reclaim_stack+0x86
                 dce_lookup_and_add_v4+0x6c
                 ip_set_destination_v4+0x1cf
                 ip_attr_connect+0xc6
                 conn_connect+0xeb    
                 tcp_set_destination+0x66
                 tcp_connect_ipv4+0x135
                 tcp_do_connect+0x20f
                 tcp_connect+0x9f
                 so_connect+0xf8
                 socket_connect+0x47
                 connect+0xec
                 _sys_sysenter_post_swapgs+0x149

ffffff19b147e4e0 SLEEP    CV                      1
                 swtch+0x145
                 cv_wait+0x61
                 squeue_synch_enter+0x15a
                 tcp_connect+0x65
                 so_connect+0xf8
                 socket_connect+0x47
                 connect+0xec

In this case, thread ...120 is stuck at ip_attr.c:1226, while ...8a0 is stuck at ip_attr.c:1244. The third thread, ...4e0, is a victim. Both of the wedged threads have suffered from a missed wakeup at some time in the past.

The solution to this is somewhat complex. The mechanism provides us with the useful constraint that at most one connection in any given stack is being cleaned up at one time. But it imposes no limit on the number of threads waiting to clean up either the same or a different connection in the same stack, nor does it impose any limit on the number of threads waiting for a non-inline cleanup operation to complete (though the code is written with the implicit assumption that there is at most one thread waiting for either condition). As a result, we must provide both reliable notification of completion and reliable notification that the per-stack shared mblk (work token) is available for the next operation. Therefore, simply changing tcp_ixa_cleanup() to broadcast its completion is insufficient; threads waiting on cleanup completion would still be racing with threads waiting to perform the next cleanup; there is no guarantee that these threads would ever see the shared mblk non-NULL and return (though this would prevent threads from getting stuck waiting to clean up another connection). Instead, we introduce a 3-state machine to the IXA itself and split the single cleanup CV into a completion CV and a token-available CV. An IXA that is not being cleaned up is in the IDLE state. Upon beginning cleanup, a thread sets its state to INPROGRESS, protected by the per-stack cleanup lock. When cleanup is complete, its state is set to COMPLETE at the same time the shared mblk is returned to the stack (again protected by the same lock). Finally, when the thread that either performed the cleanup inline or enqueued it for processing elsewhere receives a signal that processing has completed, it will set the state back to IDLE. Processing of a connection may not commence unless that connection's IXA is in the IDLE state. Completion results in a signal to the next worker waiting to begin cleanup, and a broadcast to all threads waiting on completion; the latter is required because we leave both the completion and ready CVs in the stack rather than placing the completion CV in the IXA itself. Note that we also no longer need to distinguish between the inline and enqueue code paths: it is always safe to take the stack cleanup lock, check the state of an IXA, and if INPROGRESS, wait.

All of the defects described in this issue can be found by setting ip:ip_dce_hash_size = 8 in /etc/system, ip_dce_reclaim_threshold to 8 using hdd(1M), and initiating a large (thousands per minute) number of TCP connections to the target system from unique IP addresses. These addresses need not be valid or reachable. nmap(1) is useful for generating this type of load; a single client system with 8 to 64 such nmap threads is sufficient. The kstat ip_dce_reclaim_calls can be used to verify that DCE reclaims are in fact occurring. The fix described in this issue and integrated into the source base was verified after 10 minutes of load, during which approximately 27,000 DCE reclaim calls were made.

The next related follow up:

The tcp_close_output() mblk is put on SQ 2141340. The queue is empty, and there's only 1 mblk, so we take the fast path. We set SQS_PROC, drop the SQ lock, and call tcp_close_output().
While tcp_close_output() is doing whatever the hell it does, another thread decides to do DCE reclaim. It encounters a conn that belongs to SQ 2141340, queues it up, and waits for it to complete.
Meanwhile, tcp_close_output() has decided that it too needs to do DCE reclaim, and has started walking the connection list. As soon as it finds a connection for the same stack (i.e., almost certainly immediately), it will block waiting on the stack-shared mblk.

Given what we see here and what we learned previously, it is at best impractical to make the DCE reclaim path safe to call in any context. In particular, despite the clear intent of the author of conn_ixa_cleanup(), it is not possible to make this function safe to call when one is processing an squeue.

Therefore, we will replace the inline calls to ip_dce_reclaim_stack() (and for completeness, the kmem_cache reclaim callback) with a background thread that may be signaled when excessively full DCE hash buckets are detected.

Finally:

The root cause here is that the ip_xmit_attr_t is being bcopy()ed by the ironically named ixa_safe_copy() routine. Because this function's callers and conn_ixa_cleanup() do not share a locking strategy, this copy can occur during the small window during which we are cleaning up the ixa. If this occurs, the newly copied ixa will appear to be in the process of being cleaned up even though no thread is actually doing so. If a conn with the copied ixa is later selected for cleanup, the reclaim worker will be corked.

#2 Updated by Robert Mustacchi over 4 years ago

This covers the following commits from illumos-joyent:

OS-946 IP DCE does not scale
OS-1041 node process stuck in squeue mess
OS-1042 increase DCE cache tunables
OS-1056 ip_dce_reclaim_stack() cannot be made perfectly safe
OS-1059 new IP symbols need to be listed in symbol file
OS-1082 dce_reclaim_thread stops making forward progress

#3 Updated by Robert Mustacchi over 4 years ago

  • Status changed from Pending RTI to Resolved

Resolved in 7c6d7024e51780d3aacf9063d2133c1e957d7eea.

#4 Updated by Jason Matthews over 3 years ago

  • % Done changed from 100 to 90
  • Assignee set to Robert Mustacchi
  • Status changed from Resolved to Feedback

I still see this problem on OI151a9.

My webservers talk to several hundred million clients each month. I expected to eliminate the DCE Cache memory leak by moving to 151a9 but the web tier started under performing with just a couple days exposure having not implemented the work around.

Do you guys need access to a 'highly exposed' box for testing purposes?

#5 Updated by Robert Mustacchi over 3 years ago

  • % Done changed from 90 to 100
  • Status changed from Feedback to Closed

Can you please open a new bug related to this and include data such as flamegraphs or bucket sizes, etc. Whatever we do will need to be done under a new bug anyways.

Also available in: Atom