Actions
Bug #3630
closedNFS server should not allocate oversized buffers
Start date:
2013-03-15
Due date:
% Done:
100%
Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:
Description
From Sebastien Roy's bug report at Delphix:
While working on a customer case, I found that one of the biggest bottlenecks in this workload was caused by an insane amount of CPU cross-call activity. Here is a typical 1-second mpstat sample: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 14766 102 739 15 196 1555 37 56 1 38 0 61 1 0 0 7 23159 110 594 22 145 1362 28 996 28 31 0 41 2 0 0 0 15248 11 915 20 180 1931 59 371 3 41 0 56 3 0 0 1286 17318 1372 944 38 191 1906 44 264 0 45 0 55 4 0 0 0 13284 12 902 13 165 1502 52 39 0 35 0 65 5 0 0 0 16654 11 841 10 177 2352 63 14 0 43 0 57 6 0 0 0 9924 43 737 20 162 1438 44 18 1 29 0 70 7 0 0 105914 5070 4839 102 7 30 511 3 3287 13 83 0 4 The analysis of the cause of these cross calls follows: For each 1MB read request, the NFS server allocates a 1MB buffer to hold the response data and sends that along to TCP. Each time a TCP segment associated with that data is ACKed by the client, tcp frees that segment. Because the memory associated with the segment is not allocated from a kmem cache (buffers > 128KB are deemed oversized and do not benefit from a kmem cache), the free operation results in synchronous cross-calls to every CPU. In this case, that's 8 CPUs. Aside from mpstat showing the cross call activity, the observable side-effect was that simple context switches (e.g. RPC signaling an NFS thread to pick off a request from the request queue with cv_signal()) were taking 10s of milliseconds instead of 10s of microseconds. I got in touch with Brendan Gregg at Joyent to talk about this case, and he confirmed from his experience working on the 7000 series that using rsize > 128KB resulted in exactly these symptoms. In fact, this exact problem is addressed in a case study in his book! http://tinyurl.com/8ub3hxg Knowing that buffers > 128KB are allocated from kmem_oversize, the workaround we employed was to set the rsize (maximum read request size) on the client's mount point to 128KB. This resulted in the elimination of virtually all cross calls on Delphix, times that requests spent in RPC service queues to go back down to their expected 10s of microseconds range, and overall increased NFS throughput (I measured a ~20% improvement when testing with fio). This got their job time down to ~8 minutes (remember that the original time when we started was ~17 minutes), and we measured NFS throughput increase of ~25% during their job. Mantha then measured that a new bottleneck was the RPC request queue on the client caused by the client now having to issue 8 times more requests as before. I was left unsatisfied with the false choice between sluggish CPU performance on Delphix (when rsize is > 128KB) and the increased number of requests from the client (when rsize is <= 128KB). We could have the best of both worlds if the NFS server simply never allocated buffers > the maximum cacheable buffer, and used the iovec to hold multiple buffers instead of one giant buffer. With this scheme NFS clients can set rsize to their allowable maximums without concern for falling off of a memory allocation cliff on Delphix. The NFS write path doesn't appear to have this same problem. Here's a size breakdown of all kmem_alloc() sizes while processing a single 1MB NFS write operation: kmem_alloc sizes value ------------- Distribution ------------- count 2 | 0 4 |@ 2 8 |@@@ 6 16 |@@@@@@@ 13 32 |@@@@@@@@@@@@@ 23 64 |@@@@@@@@ 14 128 |@@@@@@@ 12 256 | 0 512 | 0 1024 | 0 2048 | 0 4096 | 0 8192 |@ 1 16384 | 0
Updated by Christopher Siden about 10 years ago
- Status changed from In Progress to Closed
Updated by Christopher Siden about 10 years ago
commit e36d7b1 Author: Sebastien Roy <seb@delphix.com> Date: Tue May 21 16:31:47 2013 3630 NFS server should not allocate oversized buffers Reviewed by: Jeff Biseda <jeff.biseda@delphix.com> Reviewed by: Eric Schrock <Eric.Schrock@delphix.com> Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Marcel Telka <marcel.telka@nexenta.com> Approved by: Albert Lee <trisk@nexenta.com>
Actions