Bug #15019


TCP and UDP default send_buf and recv_buf values should be increased

Added by Randy Fishel 2 months ago. Updated 2 months ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


The default high water marks (a.k.a. send_buf and recv_buf) for TCP and UDP old for modern machines and should be increased. While a project to make them more dynamic, similar to other OS's, quick relief can be had by just increasing the relevant HIWATER marks such as:

diff --git a/usr/src/uts/common/inet/tcp_impl.h b/usr/src/uts/common/inet/tcp_impl.h
index 5669592cff..1825765390 100644
--- a/usr/src/uts/common/inet/tcp_impl.h
+++ b/usr/src/uts/common/inet/tcp_impl.h
@@ -61,9 +61,9 @@ extern sock_downcalls_t sock_tcp_downcalls;
  * by setting it to 0.
 #define        TCP_XMIT_LOWATER        4096
-#define        TCP_XMIT_HIWATER        49152
+#define        TCP_XMIT_HIWATER        8388608
 #define        TCP_RECV_LOWATER        2048
-#define        TCP_RECV_HIWATER        128000
+#define        TCP_RECV_HIWATER        8388608

  * Bind hash list size and has function.  It has to be a power of 2 for
diff --git a/usr/src/uts/common/inet/udp_impl.h b/usr/src/uts/common/inet/udp_impl.h
index 0fc597ccf3..6db103025b 100644
--- a/usr/src/uts/common/inet/udp_impl.h
+++ b/usr/src/uts/common/inet/udp_impl.h
@@ -119,9 +119,9 @@ typedef struct {
 #define        UDP_NUM_EPRIV_PORTS     64

 /* Default buffer size and flow control wake up threshold. */
-#define        UDP_RECV_HIWATER        (56 * 1024)
+#define        UDP_RECV_HIWATER        4194304
 #define        UDP_RECV_LOWATER        128
-#define        UDP_XMIT_HIWATER        (56 * 1024)
+#define        UDP_XMIT_HIWATER        4194304
 #define        UDP_XMIT_LOWATER        1024

My observations with some machines with 10GB+ machines shows no significant improvement in performance after 8M on TCP and 4M or UDP, but other experiments and observation might suggest better values.
I've also had an observation that changing the default value gives better results than changing the defaults via ipadm. The hypothesis is that the initial default values in netstack are used for the high water values elsewhere in the lower stack, but I haven't found the specific detail to prove or disprove this.

Also note that changing other high water marks for other protocols such as icmp and sctp should also be considered:

      icmp  recv_buf:  8192
      icmp send_buf:  8192
      sctp  recv_buf:   102400
      sctp  send_buf:  102400

Actions #1

Updated by Randy Fishel 2 months ago

Seems that cut-and-paste didn't quite do what I wanted, but it should have shown that the TCP_XMIT_HIWAT and TCP_RECV_HIWAT changed tp 8388608, and the corresponding UDP values changed to 4194304

Actions #2

Updated by Patrick Mooney 2 months ago

  • Description updated (diff)
Actions #3

Updated by Joshua M. Clulow 2 months ago

A few thoughts:

While this may improve performance on some systems, I think it's worth noting that it would allow processes to use vastly more kernel memory than they are presently able to use with the same file descriptor limit. This can be a particularly big problem for stuck processes that stop reading their sockets for some reason -- especially when stuck processes end up accumulating together on a host. As far as I know there is no specific resource control that caps this memory use.

I think it's worth digging in on the suggestion that dynamic tuning is not as good as static tuning -- it is likely that this is just a bug, and presumably it will prevent tuning down for systems that require it as much as it apparently prevents tuning everything up today.

Finally, it'd be good to nail down a methodology for measuring the gains, and to compare with other contemporary UNIX platforms to make sure we're heading into the right ballpark.

Actions #4

Updated by Randy Fishel 2 months ago

Adding or increasing buffer sizes, high water marks, caches, etc., will always add to system memory consumption and increased risk of hitting maximum's. How much it impacts a particular system would be based on it's configuration. Modern systems that we choose to support will likely, however, have a large amount of memory or have a configuration that is unlikely to consume a large amount of this memory due to a lower need.

In the case networking, mblks (with the exception of a previously described allocb_oversize issue) come from one of the STREAMS caches and not only would memory pressure would be visible my looking at the STREAMS and kmem caches via mdb, but in most cases the system will adjust cache sizes based actual usage (lower memory systems trying to handle high-performance networking is a potential issue independent of default values).

My experiments were run with various combinations of Illumos-based OS's, Linux, and FreeBSD, under various loads using sockets, NFS, and SMB protocols, and how I came up with the suggested numbers (I was achieving similar results on Illumos as I was seeing on Linux and FreeBSD, sometimes better). While I wasn't paying close attention to memory consumption, there wasn't the appearance of extraordinary changes by checking overall kernel memory usage and cache sizes

Most measurements were taken using iperf as it is an easy test to setup and use. For the knowledgeable person, was able to saturate a 25GB network link using the '-P 1' option (also keeping the window size at 64k to avoid calling allocb_oversize). Memory consumption was performed by running multiple iperf''s over multiple NIC's using large values for '-P' and performing large file NFS transfers; kernel memory barely budged, especially after caches stabilized.

One of the prototypes I am creating will have the default high water marks placed in kernel variables so that they may be adjusted via /etc/system, so only requiring a reboot to see system changes instead of a new kernel (yes, I know that they can be changed with ipadm, but as mentioned, I see significant improvement when changing the default vs. just using ipadm).

On this latter point (change defaults vs. ipadm), My hypothesis is that IP or some component below IP is using the initial value in netstack before it can be changed by ipadm. While this would likely be a bug, I still assert that the default values that the typical user would see are still out of date and should be increased; possibly not as large as my suggestion, but my current observation of the impact on running systems isn't particularly significant.

Maybe the aforementioned prototype (set in a kernel tunable) is worthwhile as others can run similar experiments to see how it might impact their load or usage. Might also lead to a more practical dynamic method.


Also available in: Atom PDF