Project

General

Profile

Bug #12968

curthread swtch-ing while the kernel is using the FPU

Added by Marcel Telka 3 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

After update of my 6 months old OpenIndiana hipster installation to the recent build the machine started to panic with the following info:

> ::status
debugging crash dump vmcore.6 (64-bit) from MACHINE
operating system: 5.11 illumos-93f1cac532 (i86pc)
build version: heads/master-0-g93f1cac532-dirty

image uuid: 62eebea1-ac74-c3d0-9166-fd250c650c2a
panic message: curthread swtch-ing while the kernel is using the FPU
dump content: kernel pages only
> ::stack
vpanic()
0xfffffffffb880413()
resume+0x7a()
swtch+0x133()
turnstile_block+0x25b(0, 0, fffffe2c9ea2bc40, fffffffffbc1b540, 0, 0)
mutex_vector_enter+0x358(fffffe2c9ea2bc40)
kmem_cache_free+0x70(fffffe2c9ea2b888, fffffe2d6f46d580)
kmem_free+0x43(fffffe2d6f46d580, 58)
removectx+0x11c(fffffe003e5ffc20, 0, fffffffffb880310, fffffffffb8803b0, 0, 0)
kernel_fpu_end+0xf0(0, 2)
sse2_gen_pq+0x1a4(fffffe2cbda7b1c0)
vdev_raidz_math_generate+0x52(fffffe2cbda7b1c0)
vdev_raidz_generate_parity+0x16(fffffe2cbda7b1c0)
raidz_parity_verify+0xa4(fffffe2cccd6a9f0, fffffe2cbda7b1c0)
vdev_raidz_io_done+0x39a(fffffe2cccd6a9f0)
zio_vdev_io_done+0x80(fffffe2cccd6a9f0)
zio_execute+0x7c(fffffe2cccd6a9f0)
taskq_thread+0x2cd(fffffe2cbb0dbb90)
thread_start+0xb()
>

The machine was updated on 2020-07-18 at about 6:30pm CEST and it panicked 7 times overnight (between 3:50am CEST and 7:00am CEST) during the scrub operation (the scrub was started at 3:10am CEST by cron). Once I stopped scrubbing (on 2020-07-19 at about 7:00am CEST) the machine seems to be running stable.

Note: CEST = UTC+2


Related issues

Related to illumos gate - Bug #12793: kernel FPU supportClosedJerry Jelinek

Actions
Related to illumos gate - Bug #12794: ZFS support for vectorized algorithms on x86 (HW support)ClosedJerry Jelinek

Actions
Related to illumos gate - Bug #12668: ZFS support for vectorized algorithms on x86 (initial support)ClosedJerry Jelinek

Actions
#1

Updated by Marcel Telka 3 months ago

#2

Updated by Marcel Telka 3 months ago

  • Related to Bug #12794: ZFS support for vectorized algorithms on x86 (HW support) added
#3

Updated by Marcel Telka 3 months ago

  • Related to Bug #12668: ZFS support for vectorized algorithms on x86 (initial support) added
#4

Updated by Jerry Jelinek 3 months ago

  • Assignee set to Jerry Jelinek
#5

Updated by Marcel Telka 3 months ago

Just FYI: The machine is still running okay (but I didn't tried to start scrub again, yet).

#6

Updated by Marcel Telka 3 months ago

  • Description updated (diff)
#7

Updated by Jerry Jelinek 3 months ago

It looks like the fix for this is to turn off T_KFPU before we call removectx instead of after.
This should handle the voluntary swtch we see here in the stack. I'm going to try to
reproduce this first for testing before I put up the code review.

#8

Updated by Jerry Jelinek 3 months ago

I filled a small raidz1 with about 7GB of data (raidz1 about 1/3 full) and I was able to hit this bug during a scrub. I tested the proposed fix and multiple scrubs have not paniced. I'll get this up for CR while I do more testing.

#9

Updated by Joshua M. Clulow 3 months ago

  • Gerrit CR set to 803
#10

Updated by Jerry Jelinek 3 months ago

To test this I ran the raidz-specific subset of the zfs test suite in a loop for 2 hours.
This also passed previously in a 24 hour run but it demonstrates no regressions
here.

I was able to reproduce the panic on a raidz scrub. I ran 10 scrubs after the
fix and all of them completed without any problem.

#11

Updated by Joshua M. Clulow 3 months ago

Jerry made an observation about why the testing for #12794 et al did not catch this regression:

There are a fair number of tests in the zfs test suite which issue "zpool scrub".
I'll have to go through each one to determine if it is testing any raidz configs.
The obvious scrub subset looks like it concentrates on mirrors, but there are
other tests in different subsets that I haven't looked at yet. One possible
reason this didn't fail is that we might need a more realistically sized raidz
and it must be populated with a lot of data to trigger this, since it doesn't
happen immediately during a scrub. That kind of scenario is harder to add
to the test suite.

#12

Updated by Electric Monk 3 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 948cceb01d0173f5a732ef880dddcadff2204c12

commit  948cceb01d0173f5a732ef880dddcadff2204c12
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Date:   2020-07-20T20:53:10.000Z

    12968 curthread swtch-ing while the kernel is using the FPU
    Reviewed by: Jason King <jason.king@joyent.com>
    Reviewed by: Dan McDonald <danmcd@joyent.com>
    Reviewed by: Patrick Mooney <pmooney@pfmooney.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Joshua M. Clulow <josh@sysmgr.org>

#13

Updated by Alex Wilson 3 months ago

In case anyone is interested in further field reports, I've had 3 machines today hit this panic spontaneously without doing any scrubs at all, about 2 hours or so after upgrading them.

After panic-ing and coming back up on a kernel with the fix for this bug, scrubs on these machines are finding strange checksum errors and sometimes permanent errors in the oldest files on the pool. Not sure if it's related, but only seems to have occurred so far on the machines that actually hit this panic and rebooted (and the machines scrubbed clean yesterday before the upgrade).

#14

Updated by Jerry Jelinek 3 months ago

It's not unexpected that you could hit this without doing a scrub since
there was nothing scrub-specific in the bug. Using scrub was how
Marcel hit this and how I was able to reproduce it for testing, but it
would be possible to hit the bug any time the kmem mutex was contended
during the kernel_fpu_end.

I don't have a good explanation for what you're seeing with scrub now.
None of this new kfpu/raidz code touched the checksum code in any
way. It's always hard to predict what happens when we panic in the
middle of zio handling like this bug did, but the on-disk state should always
be consistent, so I don't have an explanation for what you're seeing
with scrub now.

#15

Updated by Marcel Telka 3 months ago

I updated to this fix two days ago and immediately started scrub. Now, after almost two days of scrubbing (this is usual time for scrub of my pool), the scrub finished without panic (thanks Jerry!) and no errors were detected as well:

  scan: scrub repaired 0 in 1 days 19:55:25 with 0 errors on Thu Jul 23 09:22:00 2020

Also available in: Atom PDF