Project

General

Profile

Bug #3630

NFS server should not allocate oversized buffers

Added by Christopher Siden over 6 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Category:
nfs - NFS server and client
Start date:
2013-03-15
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

From Sebastien Roy's bug report at Delphix:

While working on a customer case, I found that one of the biggest
bottlenecks in this workload was caused by an insane amount of CPU cross-call
activity.  Here is a typical 1-second mpstat sample:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0 14766  102  739   15  196 1555   37    56    1  38   0  61
  1    0   0    7 23159  110  594   22  145 1362   28   996   28  31   0  41
  2    0   0    0 15248   11  915   20  180 1931   59   371    3  41   0  56
  3    0   0 1286 17318 1372  944   38  191 1906   44   264    0  45   0  55
  4    0   0    0 13284   12  902   13  165 1502   52    39    0  35   0  65
  5    0   0    0 16654   11  841   10  177 2352   63    14    0  43   0  57
  6    0   0    0  9924   43  737   20  162 1438   44    18    1  29   0  70
  7    0   0 105914  5070 4839  102    7   30  511    3  3287   13  83   0   4

The analysis of the cause of these cross calls follows:

For each 1MB read request, the NFS server allocates a 1MB buffer to hold the
response data and sends that along to TCP.  Each time a TCP segment associated
with that data is ACKed by the client, tcp frees that segment.  Because the
memory associated with the segment is not allocated from a kmem cache (buffers
> 128KB are deemed oversized and do not benefit from a kmem cache), the free
operation results in synchronous cross-calls to every CPU.  In this case,
that's 8 CPUs.  Aside from mpstat showing the cross call activity, the
observable side-effect was that simple context switches (e.g. RPC signaling an
NFS thread to pick off a request from the request queue with cv_signal()) were
taking 10s of milliseconds instead of 10s of microseconds.

I got in touch with Brendan Gregg at Joyent to talk about this case, and he
confirmed from his experience working on the 7000 series that using rsize >
128KB resulted in exactly these symptoms.  In fact, this exact problem is
addressed in a case study in his book!  http://tinyurl.com/8ub3hxg

Knowing that buffers > 128KB are allocated from kmem_oversize, the workaround
we employed was to set the rsize (maximum read request size) on the
client's mount point to 128KB.  This resulted in the elimination of virtually
all cross calls on Delphix, times that requests spent in RPC service queues to
go back down to their expected 10s of microseconds range, and overall increased
NFS throughput (I measured a ~20% improvement when testing with fio).  This got
their job time down to ~8 minutes (remember that the original time when we
started was ~17 minutes), and we measured NFS throughput increase of ~25%
during their job.

Mantha then measured that a new bottleneck was the RPC request queue on the
client caused by the client now having to issue 8 times more requests as
before.

I was left unsatisfied with the false choice between sluggish CPU performance
on Delphix (when rsize is > 128KB) and the increased number of requests from
the client (when rsize is <= 128KB).  We could have the best of both worlds if
the NFS server simply never allocated buffers > the maximum cacheable buffer,
and used the iovec to hold multiple buffers instead of one giant buffer.

With this scheme NFS clients can set rsize to their allowable maximums without
concern for falling off of a memory allocation cliff on Delphix.

The NFS write path doesn't appear to have this same problem.  Here's a size
breakdown of all kmem_alloc() sizes while processing a single 1MB NFS write
operation:

  kmem_alloc sizes                                  
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@                                        2        
               8 |@@@                                      6        
              16 |@@@@@@@                                  13       
              32 |@@@@@@@@@@@@@                            23       
              64 |@@@@@@@@                                 14       
             128 |@@@@@@@                                  12       
             256 |                                         0        
             512 |                                         0        
            1024 |                                         0        
            2048 |                                         0        
            4096 |                                         0        
            8192 |@                                        1        
           16384 |                                         0

History

#1

Updated by Christopher Siden about 6 years ago

  • Status changed from In Progress to Closed
#2

Updated by Christopher Siden about 6 years ago

commit e36d7b1
Author: Sebastien Roy <seb@delphix.com>
Date:   Tue May 21 16:31:47 2013

    3630 NFS server should not allocate oversized buffers
    Reviewed by: Jeff Biseda <jeff.biseda@delphix.com>
    Reviewed by: Eric Schrock <Eric.Schrock@delphix.com>
    Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Reviewed by: Marcel Telka <marcel.telka@nexenta.com>
    Approved by: Albert Lee <trisk@nexenta.com>

Also available in: Atom PDF