add tunables to combat scheduling delay of kernel threads
The problem is that the threads running mac_soft_ring_worker() and squeue_worker() need to be scheduled promptly after receiving the network interrupt indicating that we have received a packet. These threads are running at the highest priority (pri=99). The mac_soft_ring_worker() threads are bound to specific CPUs, which exacerbates the scheduling problem, because if even one CPU will not allow other threads to run, then the connections assigned to the worker threads bound to that CPU will not be serviced.
There are two main issues that cause the networking worker threads to not be scheduled when they are runnable:
1. If another kernel thread is running, it will be allowed to run until it is voluntarily descheduled (e.g. by blocking or calling kpreempt()). There are several kernel threads that can use CPU for a long time without blocking (e.g. kmem_depot_ws_reap(), vmem_hash_rescale(), fsflush(), and zio threads). Note that when a higher-priority thread becomes runnable, the CPU will be interrupted (via poke_cpu()), but this is a waste of time, because the interrupt handler does not make any scheduling decisions if a kernel thread is running.
2. Even if the above was addressed, some kernel threads can run for a long time at the highest priority (pri=99), so even if the scheduler is invoked, these threads will continue to run.
The solution is many-fold:
1. Kernel preemption (disabled by default)
We will allow higher-priority kernel threads to preempt lower priority kernel threads. This ensures that lower-priority threads (e.g. kmem_depot_ws_reap()) will be forced to promptly give up the CPU so that the networking threads can run. This may have wide-ranging performance impact.
2. ZIO threads reduced to pri=98.
The zio threads run in the SYSDC (Duty Cycle) scheduling class, which runs the threads (when on duty) at pri=99. This is changed to pri=98. SYSDC is only used by zio threads, so the impact is small.
3. Shrink vmem hashtable only if 8x as large as necessary.
vmem_hash_rescale() is called every 15 seconds and by timeout(), which runs at pri=99. If a vmem arena's hash table is more than twice its ideal size (or less than half its ideal size), the hash table is resized, which takes up to 1.5 seconds on a test machine. I observed that this resizing would happen periodically, as the number of hash entries grew and shrank. To reduce this oscillation, I made it so that it will only shrink the hashtable once it is 1/8th the ideal size.
4. kmem_depot_ws_reap() periodically calls kpreempt() (disabled by default)
Force kmem_depot_ws_reap() to periodically invoke the scheduler by calling kpreempt(). This is not necessary if kernel preemption is enabled, but is a less invasive change in behavior.
5. Kernel timeslicing (disabled by default)
Every tick (10ms), we will reevaluate the kernel threads priorities and preempt the running thread if there is a higher-priority thread runnable (and optionally, if there is a same-priority thread runnable).
Updated by Daniel Kimmel about 5 years ago
My mistake, we decided to revert solution 5 described above for the following reason:
The problem is that clock_tick_process() grabs t_plockp and then assumes that prevents the thread from being freed.
If the thread has an LWP, then it would have been exited via lwp_exit(), which grabs t_procp->p_lock and sets TP_LWPEXIT. This prevents clock_tick_process() from processing this thread.
However, if the thread does not have an LWP (i.e. t_lwp NULL, t_procp &p0), then it will exit via thread_exit(), which does nothing with p_lock or t_plockp or TP_LWPEXIT. In this case, clock_tick_process() continues processing the thread after it has been freed, leading to this panic.
I'm not sure the best way to add locking to prevent this, so for now the fix will be to revert the code that allows clock_tick_process() to work on LWP-less threads
This bug will only address solutions 1-4.
Updated by Daniel Kimmel about 5 years ago
Updates after some code review feedback:
- We are going to drop solution 1 (the kernel preemption piece).
- We are going to enable solution 4 (the kmem_depot_ws_reap fix) by default.
So a new description of these fixes is:
- ZIO threads reduced to pri=98
- shrink vmem hashtable only if it's 8x as large as necessary
- kmem_depot_ws_reap() periodically calls kpreempt()
Updated by Electric Monk about 5 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
commit 1c207ae9a87e9b2ff04aa0cdad948fee76ac5dd7 Author: Matthew Ahrens <email@example.com> Date: 2016-09-09T20:56:55.000Z 7232 add tunables to combat scheduling delay of kernel threads Reviewed by: Paul Dagnelie <firstname.lastname@example.org> Reviewed by: George Wilson <email@example.com> Reviewed by: Sebastien Roy <firstname.lastname@example.org> Reviewed by: Josef 'Jeff' Sipek <email@example.com> Reviewed by: Bryan Cantrill <firstname.lastname@example.org> Approved by: Dan McDonald <email@example.com>