Project

General

Profile

Bug #4628

NFS extremely slow in nfsauth

Added by Marcel Telka about 6 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
nfs - NFS server and client
Start date:
2014-02-25
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

While testing the NFS server 4500 clients the NFS server started to became very slow spending most of the time in nfsauth.

Here is the typical snapshot of NFS stacks:

> ::stacks -m rpcmod
THREAD           STATE    SOBJ                COUNT
ffffff1a7062b1a0 SLEEP    RWLOCK                704
                 swtch+0x145
                 turnstile_block+0x760
                 rw_enter_sleep+0x205
                 nfsauth_cache_get+0x8b
                 nfsauth_access+0x102
                 checkauth+0xea
                 common_dispatch+0x2ac
                 rfs_dispatch+0x2d
                 svc_getreq+0x19c
                 svc_run+0x1d6
                 svc_do_run+0x81
                 nfssys+0x765
                 dtrace_systrace_syscall32+0x11a
                 _sys_sysenter_post_swapgs+0x149

ffffff1a040e98a0 SLEEP    RWLOCK                189
                 swtch+0x145
                 turnstile_block+0x760
                 rw_enter_sleep+0x205
                 nfsauth_cache_get+0x8b
                 nfsauth_access+0x102
                 rfs3_access+0x76
                 common_dispatch+0x384
                 rfs_dispatch+0x2d
                 svc_getreq+0x19c
                 svc_run+0x1d6
                 svc_do_run+0x81
                 nfssys+0x765
                 dtrace_systrace_syscall32+0x11a
                 _sys_sysenter_post_swapgs+0x149

ffffff1a03fdf8a0 SLEEP    RWLOCK                132
                 swtch+0x145
                 turnstile_block+0x760
                 rw_enter_sleep+0x1a3
                 nfsauth_cache_get+0x31e
                 nfsauth_access+0x102
                 checkauth+0xea
                 common_dispatch+0x2ac
                 rfs_dispatch+0x2d
                 svc_getreq+0x19c
                 svc_run+0x1d6        
                 svc_do_run+0x81      
                 nfssys+0x765         
                 dtrace_systrace_syscall32+0x11a
                 _sys_sysenter_post_swapgs+0x149

ffffff00b98dec60 SLEEP    CV                      3
                 swtch+0x145          
                 cv_wait+0x61         
                 svc_thread_creator+0xbf
                 thread_start+8       

ffffff19a79e1b00 SLEEP    CV                      2
                 swtch+0x145          
                 cv_wait_sig+0x14d    
                 svc_wait+0xb9        
                 nfssys+0x792         
                 _sys_sysenter_post_swapgs+0x149

ffffff1a03fb8e60 SLEEP    CV                      1
                 swtch+0x145          
                 cv_wait_sig+0x14d    
                 svc_wait+0xb9        
                 nfssys+0x792         
                 dtrace_systrace_syscall32+0x11a
                 _sys_sysenter_post_swapgs+0x149

And the queue is full of pending requests:

> ::svc_pool nfs
SVCPOOL = ffffff1a46ceb090 -> POOL ID = NFS(1)
Non detached threads    = 1025
Detached threads        = 0
Max threads             = 1024
`redline'               = 1
Reserved threads        = 0
Thread lock     = mutex not held
Asleep threads          = 0
Request lock    = mutex not held
Pending requests        = 61070
Walking threads         = 0
Max requests from xprt  = 8
Stack size for svc_run  = 0
Creator lock    = mutex not held
No of Master xprt's     = 62097
rwlock for the mxprtlist= owner 0   
master xprt list ptr    = ffffff19e295f398

>

During this, the machine is idle:

# vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s3   in   sy   cs us sy id
 1 0 0 77191288 88817792 68 618 9 0 0 0 22 -0 -0 -0  2 23613 10706 16138 0 4 96
 0 0 0 73385328 85063032 51 104 0 0 0 0  0  0  0  0  0 16796 833  980  0  0 100
 1 0 0 73385200 85062932 0 4 0  0  0  0  0  0  0  0  0 16917 492 1027  0  0 100
 0 0 0 73385200 85062932 0 0 0  0  0  0  0  0  0  0  0 16772 389  845  0  0 100
 1 0 0 73385200 85062932 0 0 0  0  0  0  0  0  0  0  0 16860 388  852  0  0 100
^C
#

The problem seems to be in the communication with mountd:

> ::stacks -c nfsauth_retrieve
THREAD           STATE    SOBJ                COUNT
ffffff00b9676c60 SLEEP    SHUTTLE                 1
                 swtch_to+0xe6
                 shuttle_resume+0x325
                 door_upcall+0x243
                 door_ki_upcall_limited+0x69
                 stubs_common_code+0x51
                 nfsauth_retrieve+0x249
                 nfsauth_refresh_thread+0x143
                 thread_start+8

>

The bug is in nfsauth_refresh_thread() where exi_cache_lock is held for very long time. It is held during the refresh of all auth entries for a particular exportinfo (the while loop). It is also held during the nfsauth_retrieve() call and this call might be very slow.

During my tests I saw the nfsauth_retrieve() to complete in more than 3 seconds. This was caused primarily by the slow DNS service:

10:27:45.55562  10.3.53.167 -> 10.3.40.245  DNS C 160.4.168.192.in-addr.arpa. Internet PTR ?
10:27:48.59405  10.3.40.245 -> 10.3.53.167  DNS R  Error: 3(Name Error)

The exi_cache_lock held in nfsauth_refresh_thread() caused hoarding of all other threads in nfsauth_cache_get() and slowing down all NFS operations.

The fix will remove exi_cache_lock holding from the nfsauth_refresh_thread(). In fact, we do not need to hold it there.

While I was touching the nfsauth_refresh_thread() function I did some minor code deduplication in the while loop.

History

#1

Updated by Marcel Telka about 6 years ago

  • Status changed from In Progress to Pending RTI
#2

Updated by Robert Mustacchi about 6 years ago

  • Status changed from Pending RTI to Resolved
  • % Done changed from 0 to 100
  • Tags deleted (needs-triage)
#3

Updated by Electric Monk about 6 years ago

git commit 35107df554395996bb53e7b81e5eedee9018e582

Author: Marcel Telka <marcel.telka@nexenta.com>

4628 NFS extremely slow in nfsauth
Reviewed by: Jan Kryl <jan.kryl@nexenta.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Robert Mustacchi <rm@joyent.com>

Also available in: Atom PDF