Project

General

Profile

Actions

Bug #13717

closed

kernel fpu use can still lead to panic

Added by Jerry Jelinek 5 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Even after the fix for illumos#13614 we are still able to hit a panic with kernel fpu usage. The status and the top
of the stack are the same as in 13614. Specifically we are attempting to context switch off of the CPU, the
user-level context switch handler is trying to save the FPU state, but the CPU is in a bad state because
another thread has previously left the CPU with the TS bit set in %cr0 via the call to fpdisable in kernel_fpu_end.


Related issues

Related to illumos gate - Bug #13902: Fix for 13717 may break 8-disk raidz2ClosedDan McDonald

Actions
Related to illumos gate - Bug #13908: disable kernel FPU by default until it is stableClosedJoshua M. Clulow

Actions
Actions #1

Updated by Jerry Jelinek 5 months ago

We have additional, non-open-source code which makes heavy usage of the fpu while in the kernel. While stress testing this code we can hit this panic after 12-36 hours of constant testing.
There are various race bugs in kernel_fpu_begin and kernel_fpu_end which can cause a context switch at the wrong time and leave the thread's pcb fpu state incorrect.

Actions #2

Updated by Jerry Jelinek 5 months ago

During debugging and fixing this, I took the opportunity to also cleanup some of the code in kernel_fpu_end and to improve the comments.

The most pernicious bug here was when the code would context switch after the fp_save in kernel_fpu_begin, but before the FPU_KERNEL flag was set. This meant that we would leave kernel_fpu_begin without the user-land FPU state being valid in the pcb. Thus, if we ever context switch off the new cpu and that cpu had previously had a system thread running on it which had used the kernel fpu, then that cpu would still have the TS bit set in %cr0 (from fpdisable in kernel_fpu_end and since the user-land restore handler would not have run to do CLTS), so then we would panic.

Of course this also implies we have corrupted the user-land fpu state, which is bad, although in practice the user-land threads we have mostly observed (but not always) are SMB threads which come into the kernel but never leave.

Actions #3

Updated by Jerry Jelinek 5 months ago

For testing this, we developed a stress load test which would reliably panic within 12 hours. After this fix we ran this load continuously for about 5 days without a panic.

Actions #4

Updated by Electric Monk 4 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit a16c2dd277fe07b088b1a5da5b953f5415ce55ec

commit  a16c2dd277fe07b088b1a5da5b953f5415ce55ec
Author: Jerry Jelinek <gjelinek@gmail.com>
Date:   2021-05-10T16:10:07.000Z

    13717 kernel fpu use can still lead to panic

    Reviewed by: Dan McDonald <danmcd@joyent.com>
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Actions #5

Updated by Dan McDonald 3 months ago

  • Related to Bug #13902: Fix for 13717 may break 8-disk raidz2 added
Actions #6

Updated by Joshua M. Clulow 3 months ago

  • Project changed from site to illumos gate
Actions #7

Updated by Andy Fiddaman 3 months ago

  • Gerrit CR set to 1415
Actions #8

Updated by Joshua M. Clulow 3 months ago

  • Related to Bug #13908: disable kernel FPU by default until it is stable added
Actions

Also available in: Atom PDF