Project

General

Profile

Bug #12391

Kernel latencies for bound threads

Added by Arne Jansen 7 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Hard
Tags:
Gerrit CR:

Description

We've recently seen massive problems with network outages of several seconds to 20 minutes. It turns out this is due to an interaction between long-running kernel maintenance threads like kmem_move and kmem_depot_ws_reap with the various network threads. Research showed that this problem has been identified before, see #8493 and #7232. The solution there is to add scheduling points to all long-running tasks. This is tedious and error-prone, so I'd like to take a more general approach.

A bit on the background:
Traditionally there is no preemption for kernel threads running in the sys class. Priorities in the class are 60-99. Kernel preemption is only implemented for prios >= 100, which are reserved for realtime threads.
When a thread in the sys class becomes runnable, a cpu is chosen and it is added to the dispatch queue of that cpu, even though it has higher priority than the currently running thread. So this thread has to wait until the thread running on that cpu reaches a scheduling point.
This normally isn't a problem even with long-running threads, as there is a second mechanism: If a cpu becomes idle, before entering idle it searches all other cpus for potential work and steals the most suited. So our enqueued thread gets stolen and is being run on a different cpu. Unfortunately this doesn't help us in case of the network stack, as it has various workers bound to a specific cpu. This prevents the thread from being stolen. So it has to wait, in our case for up to 20 minutes.

So how can we address this?
  1. The solution taken so far is to split up identified longer running threads into smaller ones or add scheduling points. This is not very generic and it's still easy to get occasional latencies of >100ms.
  2. Don't bind the network threads (or any other thread that is time-critical). This defeats the per-cpu approach of the stack and probably has adverse effects on general networking performance.
  3. Implement generic kernel preemption for class sys. This has been proposed in #7232 but has later been backed out for reasons discussed here: https://github.com/openzfs/openzfs/pull/158#discussion_r75203592
  4. Implement kernel preemption only for threads that are bound to a cpu. This is the approach I'm currently testing with good results so far. The resulting patch is very short, as all necessary bits are already in place. If a thread is bound, we re-route the scheduling decision to the kpreempt-path by artificially raising the prio to 100 (or, more specific, kpreemptpri), but only for the decision if kernel preemption should take place or not. The thread itself stays at its given prio.

Patch proposal :

--- a/usr/src/uts/common/disp/disp.c
+++ b/usr/src/uts/common/disp/disp.c
@@ -1338,7 +1338,14 @@ setbackdq(kthread_t *tp)
                if (tpri > dp->disp_maxrunpri) {
                        dp->disp_maxrunpri = tpri;
                        membar_enter();
-                       cpu_resched(cp, tpri);
+                       if (bound)
+                               /*
+                                * force reschedule for bound threads, as
+                                * they can't be stolen
+                                */
+                               cpu_resched(cp, kpreemtpri);
+                       else
+                               cpu_resched(cp, tpri);
                }
        }

@@ -1497,7 +1504,14 @@ setfrontdq(kthread_t *tp)
                if (tpri > dp->disp_maxrunpri) {
                        dp->disp_maxrunpri = tpri;
                        membar_enter();
-                       cpu_resched(cp, tpri);
+                       if (bound)
+                               /*
+                                * force reschedule for bound threads, as
+                                * they can't be stolen
+                                */
+                               cpu_resched(cp, kpreemtpri);
+                       else
+                               cpu_resched(cp, tpri);
                }
        }

No data to display

Also available in: Atom PDF