svc_queueclose() could wake up too much threads
In some cases we could have a lot of threads (hundreds) sleeping in svc_poll() and waiting for new work at the pool->p_req_cv condition variable. Here:
2033 /* 2034 * Count us as asleep, mark this thread as safe 2035 * for suspend and wait for a request. 2036 */ 2037 pool->p_asleep++; 2038 timeleft = cv_reltimedwait_sig(&pool->p_req_cv, 2039 &pool->p_req_lock, pool->p_timeout, TR_CLOCK_TICK);
This could easily happen, for example when you have hundreds clients calling NFS operations serialized, with short delays between the NFS calls. This is even easier to observe with very fast NFS operations, like NFS NULL.
In this scenario, when one client disconnects and svc_queueclose() is called, we might end up waking up all those sleeping threads:
2339 /* 2340 * Wakeup threads sleeping in svc_poll() so that they 2341 * unlink from the transport 2342 */ 2343 mutex_enter(&xprt->xp_pool->p_req_lock); 2344 cv_broadcast(&xprt->xp_pool->p_req_cv); 2345 mutex_exit(&xprt->xp_pool->p_req_lock);
We are waking up all sleeping threads, because we have no other way how to wake the only thread (or few threads) attached to the transport we are just closing.
We are waking up threads for one reason only: so they can disconnect from the transport and the transport could be properly closed and destroyed.
The problem with such waking up is that several hundreds of threads are woken up just so they find they have nothing to do (because they are not attached to the closing transport). After that they immediately go back to sleep.
It is well known that waking up many threads for no reason using cv_broadcast() is causing huge performance penalty and it should be avoided when possible.
We saw this issue at a customer (please see attached customer.svg file with the flame graph) where we see that more than 22% of the time is spent spinning in cv_reltimedwait_sig().
The problem is also visible in the crash dump file we have from the customer. There is a lot of runnable threads (waiting for CPU) in cv_reltimedwait_sig() and few others just running on CPU (plus 50 threads waiting in cv_reltimedwait_sig(); these are the threads just recently returned back to sleep):
> ::stacks -m rpcmod -c cv_reltimedwait_sig THREAD STATE SOBJ COUNT ffffff439b1c3080 RUN <NONE> 283 swtch+0x141 cv_timedwait_sig_hires+0x39d cv_reltimedwait_sig+0x4f svc_poll+0x1f2 svc_run+0x16e svc_do_run+0x8e nfssys+0xf1 dtrace_systrace_syscall32+0xe4 _sys_sysenter_post_swapgs+0x149 ffffff439dc680c0 SLEEP CV 50 swtch+0x141 cv_timedwait_sig_hires+0x39d cv_reltimedwait_sig+0x4f svc_poll+0x1f2 svc_run+0x16e svc_do_run+0x8e nfssys+0xf1 dtrace_systrace_syscall32+0xe4 _sys_sysenter_post_swapgs+0x149 ffffff439f9c5060 ONPROC <NONE> 1 0 default_lock_delay+0x3f mutex_vector_enter+0xcc cv_timedwait_sig_hires+0x3c9 cv_reltimedwait_sig+0x4f svc_poll+0x1f2 svc_run+0x16e svc_do_run+0x8e nfssys+0xf1 dtrace_systrace_syscall32+0xe4 _sys_sysenter_post_swapgs+0x149 ffffff439e0fd500 ONPROC <NONE> 1 0xb _resume_from_idle+0xec cv_timedwait_sig_hires+0x39d cv_reltimedwait_sig+0x4f svc_poll+0x1f2 svc_run+0x16e svc_do_run+0x8e nfssys+0xf1 dtrace_systrace_syscall32+0xe4 _sys_sysenter_post_swapgs+0x149 >
We were also able to reproduce the issue in our lab (see attached lab.svg flame graph).