Bug #13717


kernel fpu use can still lead to panic

Added by Jerry Jelinek about 1 month ago. Updated 5 days ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:


Even after the fix for illumos#13614 we are still able to hit a panic with kernel fpu usage. The status and the top
of the stack are the same as in 13614. Specifically we are attempting to context switch off of the CPU, the
user-level context switch handler is trying to save the FPU state, but the CPU is in a bad state because
another thread has previously left the CPU with the TS bit set in %cr0 via the call to fpdisable in kernel_fpu_end.

Actions #1

Updated by Jerry Jelinek about 1 month ago

We have additional, non-open-source code which makes heavy usage of the fpu while in the kernel. While stress testing this code we can hit this panic after 12-36 hours of constant testing.
There are various race bugs in kernel_fpu_begin and kernel_fpu_end which can cause a context switch at the wrong time and leave the thread's pcb fpu state incorrect.

Actions #2

Updated by Jerry Jelinek about 1 month ago

During debugging and fixing this, I took the opportunity to also cleanup some of the code in kernel_fpu_end and to improve the comments.

The most pernicious bug here was when the code would context switch after the fp_save in kernel_fpu_begin, but before the FPU_KERNEL flag was set. This meant that we would leave kernel_fpu_begin without the user-land FPU state being valid in the pcb. Thus, if we ever context switch off the new cpu and that cpu had previously had a system thread running on it which had used the kernel fpu, then that cpu would still have the TS bit set in %cr0 (from fpdisable in kernel_fpu_end and since the user-land restore handler would not have run to do CLTS), so then we would panic.

Of course this also implies we have corrupted the user-land fpu state, which is bad, although in practice the user-land threads we have mostly observed (but not always) are SMB threads which come into the kernel but never leave.

Actions #3

Updated by Jerry Jelinek about 1 month ago

For testing this, we developed a stress load test which would reliably panic within 12 hours. After this fix we ran this load continuously for about 5 days without a panic.

Actions #4

Updated by Electric Monk 5 days ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit a16c2dd277fe07b088b1a5da5b953f5415ce55ec

commit  a16c2dd277fe07b088b1a5da5b953f5415ce55ec
Author: Jerry Jelinek <>
Date:   2021-05-10T16:10:07.000Z

    13717 kernel fpu use can still lead to panic

    Reviewed by: Dan McDonald <>
    Reviewed by: Robert Mustacchi <>
    Approved by: Richard Lowe <>


Also available in: Atom PDF