curthread swtch-ing while the kernel is using the FPU
After update of my 6 months old OpenIndiana hipster installation to the recent build the machine started to panic with the following info:
> ::status debugging crash dump vmcore.6 (64-bit) from MACHINE operating system: 5.11 illumos-93f1cac532 (i86pc) build version: heads/master-0-g93f1cac532-dirty image uuid: 62eebea1-ac74-c3d0-9166-fd250c650c2a panic message: curthread swtch-ing while the kernel is using the FPU dump content: kernel pages only > ::stack vpanic() 0xfffffffffb880413() resume+0x7a() swtch+0x133() turnstile_block+0x25b(0, 0, fffffe2c9ea2bc40, fffffffffbc1b540, 0, 0) mutex_vector_enter+0x358(fffffe2c9ea2bc40) kmem_cache_free+0x70(fffffe2c9ea2b888, fffffe2d6f46d580) kmem_free+0x43(fffffe2d6f46d580, 58) removectx+0x11c(fffffe003e5ffc20, 0, fffffffffb880310, fffffffffb8803b0, 0, 0) kernel_fpu_end+0xf0(0, 2) sse2_gen_pq+0x1a4(fffffe2cbda7b1c0) vdev_raidz_math_generate+0x52(fffffe2cbda7b1c0) vdev_raidz_generate_parity+0x16(fffffe2cbda7b1c0) raidz_parity_verify+0xa4(fffffe2cccd6a9f0, fffffe2cbda7b1c0) vdev_raidz_io_done+0x39a(fffffe2cccd6a9f0) zio_vdev_io_done+0x80(fffffe2cccd6a9f0) zio_execute+0x7c(fffffe2cccd6a9f0) taskq_thread+0x2cd(fffffe2cbb0dbb90) thread_start+0xb() >
The machine was updated on 2020-07-18 at about 6:30pm CEST and it panicked 7 times overnight (between 3:50am CEST and 7:00am CEST) during the scrub operation (the scrub was started at 3:10am CEST by cron). Once I stopped scrubbing (on 2020-07-19 at about 7:00am CEST) the machine seems to be running stable.
Note: CEST = UTC+2
Updated by Jerry Jelinek 3 months ago
To test this I ran the raidz-specific subset of the zfs test suite in a loop for 2 hours.
This also passed previously in a 24 hour run but it demonstrates no regressions
I was able to reproduce the panic on a raidz scrub. I ran 10 scrubs after the
fix and all of them completed without any problem.
Updated by Joshua M. Clulow 3 months ago
Jerry made an observation about why the testing for #12794 et al did not catch this regression:
There are a fair number of tests in the zfs test suite which issue "zpool scrub".
I'll have to go through each one to determine if it is testing any raidz configs.
The obvious scrub subset looks like it concentrates on mirrors, but there are
other tests in different subsets that I haven't looked at yet. One possible
reason this didn't fail is that we might need a more realistically sized raidz
and it must be populated with a lot of data to trigger this, since it doesn't
happen immediately during a scrub. That kind of scenario is harder to add
to the test suite.
Updated by Electric Monk 3 months ago
- Status changed from New to Closed
- % Done changed from 0 to 100
commit 948cceb01d0173f5a732ef880dddcadff2204c12 Author: Jerry Jelinek <firstname.lastname@example.org> Date: 2020-07-20T20:53:10.000Z 12968 curthread swtch-ing while the kernel is using the FPU Reviewed by: Jason King <email@example.com> Reviewed by: Dan McDonald <firstname.lastname@example.org> Reviewed by: Patrick Mooney <email@example.com> Reviewed by: Toomas Soome <firstname.lastname@example.org> Approved by: Joshua M. Clulow <email@example.com>
Updated by Alex Wilson 3 months ago
In case anyone is interested in further field reports, I've had 3 machines today hit this panic spontaneously without doing any scrubs at all, about 2 hours or so after upgrading them.
After panic-ing and coming back up on a kernel with the fix for this bug, scrubs on these machines are finding strange checksum errors and sometimes permanent errors in the oldest files on the pool. Not sure if it's related, but only seems to have occurred so far on the machines that actually hit this panic and rebooted (and the machines scrubbed clean yesterday before the upgrade).
Updated by Jerry Jelinek 3 months ago
It's not unexpected that you could hit this without doing a scrub since
there was nothing scrub-specific in the bug. Using scrub was how
Marcel hit this and how I was able to reproduce it for testing, but it
would be possible to hit the bug any time the kmem mutex was contended
during the kernel_fpu_end.
I don't have a good explanation for what you're seeing with scrub now.
None of this new kfpu/raidz code touched the checksum code in any
way. It's always hard to predict what happens when we panic in the
middle of zio handling like this bug did, but the on-disk state should always
be consistent, so I don't have an explanation for what you're seeing
with scrub now.
Updated by Marcel Telka 3 months ago
I updated to this fix two days ago and immediately started scrub. Now, after almost two days of scrubbing (this is usual time for scrub of my pool), the scrub finished without panic (thanks Jerry!) and no errors were detected as well:
scan: scrub repaired 0 in 1 days 19:55:25 with 0 errors on Thu Jul 23 09:22:00 2020