Project

General

Profile

Actions

Bug #13614

closed

user-level thread swtch panic

Added by Jerry Jelinek about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

We hit the following:

panic message: BAD TRAP: type=7 (#nm Device not available) rp=fffffcc2720bb600 addr=fffffe23619e6000

The top of the stack is:
> $C
fffffcc2720bb720 xsaveopt_ctxt+0x16()
fffffcc2720bb730 resume+0x75()
fffffcc2720bb760 swtch+0x133()
fffffcc2720bb7e0 cv_timedwait_hires+0xd1(fffffe290d6561d8, fffffe290d6561d0, 37e11d600, f4240, 0)
fffffcc2720bb810 cv_reltimedwait+0x3a(fffffe290d6561d8, fffffe290d6561d0, 3a98, 4)

This occurs because the ts bit is set in %cr0 so the xsaveopt instruction generates
a fault.

Normally the ts bit will be cleared when the FPU is in use, or the context is saved, so that
the xsaveopt instruction in the xsaveopt_ctxt context op function will be able to access
the FPU. However in this case, the ts bit is set.

Actions #1

Updated by Jerry Jelinek about 1 year ago

We have some additional kernel FPU crypto code that is not upstream and this is
hooked in with ZFS encryption. Normally the kfpu handling is done by a pure
kernel thread in the zio pipeline. However, there can be rare cases when a
user-level thread can execute that code directly.

If that user-level thread were then to context switch before leaving the kernel, the
they can hit this bug.

The problem here is the unconditional fpdisable, which sets the ts bit, at
the end of kernel_fpu_end.

Actions #2

Updated by Electric Monk about 1 year ago

  • Gerrit CR set to 1319
Actions #3

Updated by Jerry Jelinek about 1 year ago

For testing, I tested this on our Racktop build of illumos. As I mentioned in the description, we
have additional FPU enabled code in our build which is hooked in to the kernel crypto framework.

We originally hit this bug in an SMB thread. The smbd threads are user-level threads, even
though they come into the kernel and never leave, they still have the user-level thread
context switch handlers installed. We ran stress tests of the SMB server after this fix.

In addition to running this fix with ZFS/SMB, the digest command will call into the kernel to
take advantage of HW accelerators. Using digest, which is a user-level consumer of
the kernel FPU code, combined with some DTrace chill to encourage a context switch after
using the FPU in the kernel, I was able to reproduce the panic on the first try. After the fix, I
ran tens of iterations of the digest/chill combination without a problem.

Actions #4

Updated by Electric Monk about 1 year ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 288166677c0b62978c976160131a2cc1cf4176b4

commit  288166677c0b62978c976160131a2cc1cf4176b4
Author: Jerry Jelinek <gjelinek@gmail.com>
Date:   2021-03-11T16:43:27.000Z

    13614 user-level thread swtch panic
    Reviewed by: Dan McDonald <danmcd@joyent.com>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Reviewed by: Igor Kozhukhov <igor@dilos.org>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Actions

Also available in: Atom PDF