add tunables to combat scheduling delay of kernel threads
The problem is that the threads running mac_soft_ring_worker() and squeue_worker() need to be scheduled promptly after receiving the network interrupt indicating that we have received a packet. These threads are running at the highest priority (pri=99). The mac_soft_ring_worker() threads are bound to specific CPUs, which exacerbates the scheduling problem, because if even one CPU will not allow other threads to run, then the connections assigned to the worker threads bound to that CPU will not be serviced.
There are two main issues that cause the networking worker threads to not be scheduled when they are runnable:
1. If another kernel thread is running, it will be allowed to run until it is voluntarily descheduled (e.g. by blocking or calling kpreempt()). There are several kernel threads that can use CPU for a long time without blocking (e.g. kmem_depot_ws_reap(), vmem_hash_rescale(), fsflush(), and zio threads). Note that when a higher-priority thread becomes runnable, the CPU will be interrupted (via poke_cpu()), but this is a waste of time, because the interrupt handler does not make any scheduling decisions if a kernel thread is running.
2. Even if the above was addressed, some kernel threads can run for a long time at the highest priority (pri=99), so even if the scheduler is invoked, these threads will continue to run.
The solution is many-fold:
1. Kernel preemption (disabled by default)
We will allow higher-priority kernel threads to preempt lower priority kernel threads. This ensures that lower-priority threads (e.g. kmem_depot_ws_reap()) will be forced to promptly give up the CPU so that the networking threads can run. This may have wide-ranging performance impact.
2. ZIO threads reduced to pri=98.
The zio threads run in the SYSDC (Duty Cycle) scheduling class, which runs the threads (when on duty) at pri=99. This is changed to pri=98. SYSDC is only used by zio threads, so the impact is small.
3. Shrink vmem hashtable only if 8x as large as necessary.
vmem_hash_rescale() is called every 15 seconds and by timeout(), which runs at pri=99. If a vmem arena's hash table is more than twice its ideal size (or less than half its ideal size), the hash table is resized, which takes up to 1.5 seconds on a test machine. I observed that this resizing would happen periodically, as the number of hash entries grew and shrank. To reduce this oscillation, I made it so that it will only shrink the hashtable once it is 1/8th the ideal size.
4. kmem_depot_ws_reap() periodically calls kpreempt() (disabled by default)
Force kmem_depot_ws_reap() to periodically invoke the scheduler by calling kpreempt(). This is not necessary if kernel preemption is enabled, but is a less invasive change in behavior.
5. Kernel timeslicing (disabled by default)
Every tick (10ms), we will reevaluate the kernel threads priorities and preempt the running thread if there is a higher-priority thread runnable (and optionally, if there is a same-priority thread runnable).