Bug #13902
closedFix for 13717 may break 8-disk raidz2
100%
Description
I've got three kernel dumps from SmartOS 20210617 and one more from SmartOS 20210520. The filer is running an 8-disk raidz (raidz1) pool and it panics within a few hours consistently, with this pattern of panic:
mdb: warning: dump is from SunOS 5.11 joyent_20210520T001536Z; dcmds and macros may not match kernel implementation Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba uhci smbios stmf_sbd stmf sd zfs mm lofs idm sata crypto random cpc logindmux ptm sppp nsmb smbsrv nfs ] > $C fffffe0170ae9590 fpxrestore+2() fffffe0170ae95b0 kernel_fpu_ctx_restore+0x13(0) fffffe0170ae95e0 restorectx+0x3d(fffffe0170ae9c20) > ::status debugging crash dump vmcore.4 (64-bit) from ronsu operating system: 5.11 joyent_20210520T001536Z (i86pc) git branch: release-20210520 git rev: 58a2b7ba9771c57f897a33cf69335cd18fa5992f image uuid: (not set) panic message: BAD TRAP: type=d (#gp General protection) rp=fffffe0170ae9470 addr=ffffff0a5c2ff000 dump content: kernel pages only > fffffe0170ae9c20::findstack -v stack pointer for thread fffffe0170ae9c20 (zpool-zones/120): fffffe0170ae95f0 fffffe0170ae9620 swtch+0x133() fffffe0170ae9640 preempt+0xfa() fffffe0170ae9670 kpreempt+0x3c(ffffffff) fffffe0170ae96f0 installctx+0x11a(fffffe0170ae9c20, 0, fffffffffb880870, fffffffffb880910, 0, 0, 0, 0) fffffe0170ae9740 kernel_fpu_begin+0x116(0, 2) fffffe0170ae97c0 ssse3_gen_pq+0x1b4(ffffff0c4e303580) fffffe0170ae97e0 vdev_raidz_math_generate+0x52(ffffff0c4e303580) fffffe0170ae9810 vdev_raidz_generate_parity+0x16(ffffff0c4e303580) fffffe0170ae98a0 vdev_raidz_io_start+0x1d8(ffffff0c4cd5c928) fffffe0170ae98f0 zio_vdev_io_start+0x89(ffffff0c4cd5c928) fffffe0170ae9920 zio_execute+0xf1(ffffff0c4cd5c928) fffffe0170ae9950 zio_nowait+0x2b(ffffff0c4cd5c928) fffffe0170ae99e0 vdev_mirror_io_start+0xba(ffffff0c4eb206b8) fffffe0170ae9a30 zio_vdev_io_start+0x1a2(ffffff0c4eb206b8) fffffe0170ae9a60 zio_execute+0xa7(ffffff0c4eb206b8) fffffe0170ae9b10 taskq_thread+0x2cd(ffffff0a9d94e298) fffffe0170ae9b20 thread_start+0xb() >
So the faulty instruction and data are:
> fpxrestore+2::dis fpxrestore: clts fpxrestore+2: fxrstor (%rdi) fpxrestore+6: ret 0xfffffffffb87e277: nopw 0x0(%rax,%rax) xrestore: clts xrestore+2: movl %esi,%eax xrestore+4: movq %rsi,%rdx xrestore+7: shrq $0x20,%rdx xrestore+0xb: xrstor (%rdi) xrestore+0xe: ret 0xfffffffffb87e28f: nop fpdisable: movq %cr0,%rdi > <rdi=P 0xffffff0a9f650800 > <rdi::whatis ffffff0a9f650800 is allocated from fxsave_cache > ffffff0a9f650800::print struct fxsave_state !head { fx_fcw = 0xcafe fx_fsw = 0xbadd fx_fctw = 0xcafe fx_fop = 0xbadd fx_rip = 0xbaddcafebaddcafe fx_rdp = 0xbaddcafebaddcafe fx_mxcsr = 0xbaddcafe fx_mxcsr_mask = 0xbaddcafe fx_st = [ >
Hmmm, this looks like uninitialized memory. Was this supposed to have something?
I'm giving the discoverer a DEBUG kernel, in the hopes that we'll have more information in its coredump. The discoverer is currently running a pre-illumos-13717 SmartOS platform image, and so far, no panic.
Files
Related issues
Updated by Dan McDonald about 2 years ago
- Subject changed from Fix for 13717 may break 8-disk raidz1 to Fix for 13717 may break 8-disk raidz2
Updated by Dan McDonald about 2 years ago
It's raidz2, not raidz:
[root@ronsu /zones]# zpool list -v NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT zones 7.27T 58.5G 7.21T - - 0% 0% 1.00x ONLINE - raidz2 7.27T 58.5G 7.21T - - 0% 0.78% c1t0d0 - - - - - - - c1t1d0 - - - - - - - c1t2d0 - - - - - - - c1t3d0 - - - - - - - c1t4d0 - - - - - - - c1t5d0 - - - - - - - c1t6d0 - - - - - - - c1t7d0 - - - - - - -
Updated by Dan McDonald about 2 years ago
- Related to Bug #13717: kernel fpu use can still lead to panic added
Updated by Dan McDonald about 2 years ago
- Related to Feature #12478: installctx needs kpreempt_disable protection added
Updated by Dan McDonald about 2 years ago
One other thing to notice is that the thread's stack is at the end of installing a save-FPU thread context, and when it calls kpreempt_enable() it IMMEDIATELY switches the same thread to restoring its FPU context, with an uninitialized buffer!
Updated by Dan McDonald about 2 years ago
Core dumps available in subdirectories at:
https://kebe.com/~danmcd/webrevs/13902/
folders based on SmartOS Platform Image run.
Updated by Dan McDonald about 2 years ago
A DEBUG dump was obtained, and it has the same stack and swtch() behavior as the non-debug ones. The only immediately useful piece of DEBUG data is that the allocated-but-unused fxsave_state buffer has allocation information:
> <rdi::whatis ffffff0c662c7450 is allocated from fxsave_cache: ADDR BUFADDR TIMESTAMP THREAD CACHE LASTLOG CONTENTS ffffff0c693fdcb8 ffffff0c662c7450 6991e8fb66 fffffe0173a24c20 ffffff0c2923a008 ffffff0a7275c640 0 kmem_cache_alloc_debug+0x2fc kmem_cache_alloc+0x25b fp_lwp_init+0x1f lwp_fp_init+0xe lwp_create+0xd80 lwp_kernel_create+0x6b taskq_thread_create+0x9b taskq_thread+0x5f8 thread_start+0xb >
Updated by Dan McDonald about 2 years ago
The floating point context for the kernel LWP has its pointer appear twice in the same lwp, That is standard behavior, however, see http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/os/fpu.c?r=948cceb0&fi=fp_lwp_init#860.
> <rdi::kgrep |::whatis fffffffffbc9c3a8 is panicbuf+0x40, in unix's data segment fffffe0171106420 is in thread fffffe0171106c20's stack ffffff0c34aeb6b8 is ffffff0c34ae6000+56b8, allocated from kmem_alloc_49152: ADDR BUFADDR TIMESTAMP THREAD CACHE LASTLOG CONTENTS ffffff0c29807600 ffffff0c34ae6000 433ff528dd fffffffffbc580c0 ffffff0a2c25e008 ffffff0a6929df00 0 kmem_slab_alloc_impl+0x324 kmem_slab_alloc+0x69 kmem_cache_alloc+0x193 kmem_alloc+0x4b kmem_zalloc+0x10d mp_cpu_configure_common+0x2aa start_cpu+0x39 start_other_cpus+0x2a0 main+0x30a _locore_start+0x90 ffffff0ca1df9c40 is allocated from lwp_cache: ADDR BUFADDR TIMESTAMP THREAD CACHE LASTLOG CONTENTS ffffff0c9f3a24b0 ffffff0ca1df9c40 6991e8ec01 fffffe0173a24c20 ffffff0c29f63008 ffffff0a7275c340 0 kmem_cache_alloc_debug+0x2fc kmem_cache_alloc+0x25b lwp_create+0x11bc lwp_kernel_create+0x6b taskq_thread_create+0x9b taskq_thread+0x5f8 thread_start+0xb ffffff0ca1df9d00 is ffffff0ca1df9c40+c0, allocated from lwp_cache: ADDR BUFADDR TIMESTAMP THREAD CACHE LASTLOG CONTENTS ffffff0c9f3a24b0 ffffff0ca1df9c40 6991e8ec01 fffffe0173a24c20 ffffff0c29f63008 ffffff0a7275c340 0 kmem_cache_alloc_debug+0x2fc kmem_cache_alloc+0x25b lwp_create+0x11bc lwp_kernel_create+0x6b taskq_thread_create+0x9b taskq_thread+0x5f8 thread_start+0xb ffffff0c693fdcc0 is ffffff0c693fdcb8+8, allocated from kmem_bufctl_audit_cache >
Updated by Dan McDonald about 2 years ago
The CPU on the filer's machine is Intel(r) Xeon(r) CPU E5649 @ 2.53GHz. If there's a known issue with Westmere EP and context switching, that may be useful to know.
Updated by Dan McDonald about 2 years ago
One last speculation before dinner.
I wonder if the FP registers are never saved during the installctx() setup, but then once installed the are attempted to be reloaded by the preempt()->swtch() sequence? Maybe we need kernel_fpu_begin() to be more explicit about the initial save?
Updated by Igor Kozhukhov about 2 years ago
could you try to test illumos based kvm VM under smartos or omnios and run ZFS tests ?
i can see FPU related crash on dilos kvm VM under dilos kvm host with ZFS tests.
Updated by Andy Fiddaman about 2 years ago
As a data point, I have a number of servers running 6-8 disk RAIDZ2, and with #13717 applied (latest OmniOS r38); no crashes.
However, they're running with avx2 and not ssse3.
> raidz_all_maths::print [ vdev_raidz_original_impl, vdev_raidz_scalar_impl, vdev_raidz_sse2_impl, vdev_raidz_ssse3_impl, vdev_raidz_avx2_impl ] > zfs_vdev_raidz_impl/x zfs_vdev_raidz_impl: zfs_vdev_raidz_impl: ffff > vdev_raidz_fastest_impl::print { init = 0 fini = 0 gen = [ avx2_gen_p, avx2_gen_pq, avx2_gen_pqr ] rec = [ avx2_rec_p, avx2_rec_q, avx2_rec_r, avx2_rec_pq, avx2_rec_pr, avx2_rec_qr, avx2_rec_pqr ] is_supported = 0 name = [ "fastest" ] }
Updated by Denis Kozadaev about 2 years ago
A KVM vm runs on a DilOS host:
dev0% uname -a
SunOS dev0 5.11 2.0.2.87 i86pc i386 i86pc
VM configuration
CPU: 4
RAM: 8GB
1 virtio block is for the root pool
1 virtio block is for data
3 virtio blocks are for ZFS tests (8GB each)
root@ztest03:/var/crash/ztest03# mdb unix.16 vmcore.16 mdb: warning: dump is from SunOS 5.11 2.0.2.86-20210618-1-0+int2-mm; dcmds and macros may not match kernel implementation Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs ip hook neti sockfs arp usba smbios sd mm random crypto ptm ufs logindmux nfs ] > ::status debugging crash dump vmcore.16 (64-bit) from ztest03 operating system: 5.11 2.0.2.86-20210618-1-0+int2-mm (i86pc) image uuid: 821d9f64-9eed-4eb3-b8e2-67272a355011 panic message: BAD TRAP: type=d (#gp General protection) rp=fffffe000fbd23f0 addr=fffffe0bebcfd000 dump content: kernel pages only > $C fffffe000fbd2510 fpxrestore+2() fffffe000fbd2540 kernel_fpu_ctx_restore+0x5b(0) fffffe000fbd2570 restorectx+0x5f(fffffe000fbd2c20) > $r %rax = 0x0000000000000002 %r9 = 0x000000000000050d %rbx = 0xfffffe0c48ef37c0 %r10 = 0xfffffe0c48ef39c0 %rcx = 0x0000000000000175 %r11 = 0xfffffffffbc590c0 kas %rdx = 0x00000000b82b5600 %r12 = 0xfffffe0c48ef37c0 %rsi = 0xfffffe0c48ef3818 %r13 = 0xfffffffffb890217 _resume_from_idle+0x12b %rdi = 0xfffffe0bfad7e220 %r14 = 0xfffffe0bebd3f000 %r8 = 0x0000000000000000 %r15 = 0xfffffe0bebcfd000 %rip = 0xfffffffffb889c42 fpxrestore+2 %rbp = 0xfffffe000fbd2510 %rsp = 0xfffffe000fbd24e8 %rflags = 0x00010246 id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0 status=<of,df,IF,tf,sf,ZF,af,PF,cf> %cs = 0x0030 %ds = 0x004b %es = 0x004b %fs = 0x0000 %gs = 0x0000 %gsbase = 0xfffffe0bebcfd000 %kgsbase = 0x0 %trapno = 0xd %err = 0x0 %cr2 = 0xfffffc7fee7e9004 %cr3 = 0x4400000 > fffffe000fbd2c20::whatis fffffe000fbd2c20 is allocated as a thread structure > fffffe000fbd2c20::findstack -v stack pointer for thread fffffe000fbd2c20 (zpool-testpool/94): fffffe000fbd2580 fffffe000fbd25b0 swtch+0x18a() fffffe000fbd25d0 preempt+0xf9() fffffe000fbd2600 kpreempt+0x3c(ffffffff) fffffe000fbd2680 installctx+0x13a(fffffe000fbd2c20, 0, fffffffffb88c5c0, fffffffffb88c730, 0, 0, 0, 0) fffffe000fbd26d0 kernel_fpu_begin+0x181(0, 2) fffffe000fbd2730 sse2_gen_p+0x1cc(fffffe111e1c2880) fffffe000fbd2750 vdev_raidz_math_generate+0x57(fffffe1128016bd8, fffffe111e1c2880) fffffe000fbd2780 vdev_raidz_generate_parity_row+0x2f(fffffe1128016bd8, fffffe111e1c2880) fffffe000fbd2830 vdev_raidz_io_start_write+0x43(fffffe112dfee128, fffffe111e1c2880, 9) fffffe000fbd2870 vdev_raidz_io_start+0xdf(fffffe112dfee128) fffffe000fbd28d0 zio_vdev_io_start+0x23f(fffffe112dfee128) fffffe000fbd2910 zio_nowait+0x183(fffffe112dfee128) fffffe000fbd29a0 vdev_mirror_io_start+0xda(fffffe112e93fd78) fffffe000fbd2a00 zio_vdev_io_start+0x517(fffffe112e93fd78) fffffe000fbd2a40 zio_execute+0xfb(fffffe112e93fd78) fffffe000fbd2b00 taskq_thread+0x311(fffffe0cf1de3d68) fffffe000fbd2b10 thread_start+0xb() > raidz_all_maths::print [ vdev_raidz_original_impl, vdev_raidz_scalar_impl, vdev_raidz_sse2_impl, vdev_raidz_ssse3_impl, vdev_raidz_avx2_impl ] > zfs_vdev_raidz_impl/x zfs_vdev_raidz_impl: zfs_vdev_raidz_impl: ffff > vdev_raidz_fastest_impl::print { init = 0 fini = 0 gen = [ sse2_gen_p, sse2_gen_pq, sse2_gen_pqr ] rec = [ sse2_rec_p, sse2_rec_q, sse2_rec_r, sse2_rec_pq, sse2_rec_pr, sse2_rec_qr, sse2_rec_pqr ] is_supported = 0 name = [ "fastest" ] } > <rdi=P 0xfffffe0bfad7e220 > fffffe0bfad7e220::whatis fffffe0bfad7e220 is allocated from fxsave_cache: ADDR BUFADDR TIMESTAMP THREAD CACHE LASTLOG CONTENTS fffffe0bfa830b70 fffffe0bfad7e220 855c1e531fcf8 fffffe00101e3c20 fffffe0be6362008 fffffe0bd2de8040 fffffe0bdd0c4858 kmem_cache_alloc_debug+0x3b0 kmem_cache_alloc+0xc7 fp_lwp_init+0x1f lwp_fp_init+0xe lwp_create+0xde0 lwp_kernel_create+0x6b taskq_thread_create+0x9b taskq_create_common+0x231 taskq_create_sysdc+0xae spa_taskqs_init+0x14a spa_create_zio_taskqs+0x29 spa_thread+0x1a9 thread_start+0xb > fffffe0bfad7e220::print struct fxsave_state { fx_fcw = 0xcafe fx_fsw = 0xbadd fx_fctw = 0xcafe fx_fop = 0xbadd fx_rip = 0xbaddcafebaddcafe fx_rdp = 0xbaddcafebaddcafe fx_mxcsr = 0xbaddcafe fx_mxcsr_mask = 0xbaddcafe fx_st = [ { fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ] fpr_mmx = 0xbaddcafebaddcafe __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ] fpr_mmx = 0xbaddcafebaddcafe __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ] fpr_mmx = 0xbaddcafebaddcafe __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ] fpr_mmx = 0xbaddcafebaddcafe __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ] fpr_mmx = 0xbaddcafebaddcafe __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ] fpr_mmx = 0xbaddcafebaddcafe __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ] fpr_mmx = 0xbaddcafebaddcafe __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ] fpr_mmx = 0xbaddcafebaddcafe __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, ] fx_xmm = [ { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, ] __fx_ign2 = [ { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, { _q = -3.6623428129233170e+847 _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ] }, ] } > ::spa -v ADDR STATE NAME fffffe0bead68000 ACTIVE data ADDR STATE AUX DESCRIPTION fffffe0bf787c000 HEALTHY - root fffffe0bf7881000 HEALTHY - /dev/dsk/c6t0d0s0 fffffe0be8461000 ACTIVE rpool fffffe0be8685000 HEALTHY - root fffffe0be868a000 HEALTHY - /dev/dsk/c4t0d0s0 fffffe123136a000 ACTIVE testpool fffffe123137e000 HEALTHY - root fffffe1231383000 HEALTHY - raidz fffffe1231388000 HEALTHY - /var/tmp/test_results/testdir/disk.cache/a fffffe123138d000 HEALTHY - /var/tmp/test_results/testdir/disk.cache/b fffffe1231392000 HEALTHY - /var/tmp/test_results/testdir/disk.cache/c - - - cache fffffe12314c9000 HEALTHY - /dev/dsk/c7t0d0s0 fffffe1231523000 HEALTHY - /dev/dsk/c8t0d0s0 >
The dump is also available
Updated by Dan McDonald about 2 years ago
I think I'm on to something.
Consider this fragment of code in kernel_fpu_begin():
/* * Not all threads may have an active LWP. If they do and we're not * going to re-use the LWP, then we should go ahead and save the state. * We must also note that the fpu is now being used by the kernel and * therefore we do not want to manage the fpu state via the user-level * thread's context handlers. * * We might have already saved once (due to a prior use of the kernel * FPU or another code path) so FPU_VALID could be set. This is handled * by fp_save, as is the FPU_EN check. */ if (pl != NULL) { kpreempt_disable(); if ((flags & KFPU_USE_LWP) == 0) /* XXX KEBE SAYS LOOK HERE!!! */ fp_save(&pl->lwp_pcb.pcb_fpu); pl->lwp_pcb.pcb_fpu.fpu_flags |= FPU_KERNEL; kpreempt_enable(); }
The check for KFPU_USE_LWP being turned off is interesting. I believe it's there to prevent a redundant save if userland has an LWP.
Our panicking process has LWPs, but is NOT a userland process (it's a taskq service). I wonder if we need to be less picky about when we perform this fp_save(). I'm going to have the filer test a small loosening of the restriction:
diff --git a/usr/src/uts/intel/ia32/os/fpu.c b/usr/src/uts/intel/ia32/os/fpu.c index d648e0552f..81a98c05a1 100644 --- a/usr/src/uts/intel/ia32/os/fpu.c +++ b/usr/src/uts/intel/ia32/os/fpu.c @@ -1382,7 +1382,8 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags) */ if (pl != NULL) { kpreempt_disable(); - if ((flags & KFPU_USE_LWP) == 0) + if ((flags & KFPU_USE_LWP) == 0 || + (curthread->t_procp->p_flag & SSYS) != 0) fp_save(&pl->lwp_pcb.pcb_fpu); pl->lwp_pcb.pcb_fpu.fpu_flags |= FPU_KERNEL; kpreempt_enable();
Stay tuned.
Updated by Dan McDonald about 2 years ago
Haven't tested the prior fix yet, but one OTHER fly in the ointment...
The caller isn't aware of this, but fp_save(), as a side-effect, ALWAYS CALLS kpreempt_enable() prior to return. Not sure what a double-call of kpreempt_enable() would do, but that's something ELSE to consider when examining the aforementioned fragment of kernel_fpu_begin().
Updated by Patrick Mooney about 2 years ago
The caller isn't aware of this, but fp_save(), as a side-effect, ALWAYS CALLS kpreempt_enable() prior to return. Not sure what a double-call of kpreempt_enable() would do, but that's something ELSE to consider when examining the aforementioned fragment of kernel_fpu_begin().
It has balanced kpreempt_disable
/kpreempt_enable
calls, so I don't think fp_save
is much of an outlier when it comes to preemption.
It's true that kpreempt_enable
will end up calling into swtch
if there happens to be work attempting to preempt the running thread, but I don't believe anything would prevent such preemption from happening spontaneously either (once t_kpreempt
was back at 0).
Updated by Dan McDonald about 2 years ago
Patrick Mooney wrote in #note-16:
It's true that
kpreempt_enable
will end up calling intoswtch
if there happens to be work attempting to preempt the running thread, but I don't believe anything would prevent such preemption from happening spontaneously either (oncet_kpreempt
was back at 0).
Thanks for the correction about fp_save(), and for the confirmation that kpreempt_enable() might force swtch(). Whatcha think about the loosening of the barriers to calling fp_save()?
Updated by Joshua M. Clulow about 2 years ago
- Related to Bug #13908: disable kernel FPU by default until it is stable added
Updated by Patrick Mooney about 2 years ago
#13908 seems like the way to go in the short term, so we can truly get to the bottom of this, rather than rushing a solution.
Updated by Jerry Jelinek about 2 years ago
I'll take a look at this sometime this week. Since the code is now disabled, I realize the urgency is less. One quick observation is that the notes in comment #15 are off track and the proposed fix is not relevant. This should be clear from the dump information provided where flags is KFPU_USE_LWP. In fact, in the (now disabled) illumos kfpu code, flags is only ever KFPU_NO_STATE or KFPU_USE_LWP. The fp_save described in comment #15 can never execute with how the kfpu was being used in stock illumos.
I'm fairly certain this is only peripherally related to the changes for 13717, which essentially just added kpreempt_disable/kpreempt_enable. However, we're now doing 2 preempt disble/enable here (the 2nd one when we do the installctx) and that may make this code more prone to a context switch at an inopportune point. The block at the end of the function will initialized the kernel thread's LWP fpu save area, but it might be better to do that before installing the context switch handler and perhaps wrap all of this in a preempt disble/enable to ensure we don't swtch before everything is setup.
Updated by Jerry Jelinek about 2 years ago
If anyone can provide a build with this change to the user who is hitting this (along with re-enabling kfpu) I'd be interested in hearing If this resolves the problem. Note that I haven't done any testing of this myself yet.
--- a/usr/src/uts/intel/ia32/os/fpu.c +++ b/usr/src/uts/intel/ia32/os/fpu.c @@ -1378,12 +1378,11 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags) * We might have already saved once so FPU_VALID could be set. * This is handled by fp_save. */ + kpreempt_disable(); if (pl != NULL) { - kpreempt_disable(); if ((flags & KFPU_USE_LWP) == 0) fp_save(&pl->lwp_pcb.pcb_fpu); pl->lwp_pcb.pcb_fpu.fpu_flags |= FPU_KERNEL; - kpreempt_enable(); } /* @@ -1418,6 +1417,7 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags) // initialize the kfpu state kernel_fpu_ctx_restore(kfpu); } + kpreempt_enable(); }
Updated by Dan McDonald about 2 years ago
I'm fairly certain this is only peripherally related to the changes for 13717, which essentially just added kpreempt_disable/kpreempt_enable. However, we're now doing 2 preempt disble/enable here (the 2nd one when we do the installctx) and that may make this code more prone to a context switch at an inopportune point. The block at the end of the function will initialized the kernel thread's LWP fpu save area, but it might be better to do that before installing the context switch handler and perhaps wrap all of this in a preempt disble/enable to ensure we don't swtch before everything is setup.
And your moving of kpreempt_{en,dis}able() in your suggested diff will do just that... and our dumps are literally showing preemption occurring right at the end of installctx()... it would also explain why we might not see this in the wild as much: it's a timing event and the more cycles per scheduling quanta the less likely the thread could context-switch at that inconvenient time.
I have the bug filer running a SmartOS PI with your new change, and I'll report back with either a new kernel dump OR good news.
Updated by Denis Kozadaev about 2 years ago
Jerry's patch works well
ZFS tests in the KVM vm passed without the kernel panic
Updated by Dan McDonald about 2 years ago
So far, the patch in comment 21 appears to be working.
There is one more issue to resolve, however. installctx() requires a KM_SLEEP allocation, which means it could block indefinitely while being called under kpreempt_disable(). Once we solve that, we solve this bug.
Two either-or ideas that immediately occur to me are:
1.) installctx() can be called with a pre-allocated buffer.
OR
2.) kernel_fpu_begin() can make sure installctx() is called only after everything else is set up.
Updated by Dan McDonald about 2 years ago
DO NOT USE -- THIS ONE PANICS.
Here was a proposed patch for option 2 "make sure installctx() is called only after everything else is set up." It moves the kernel-only-LWP initialization (like the taskq thread that's panicking) to before the installctx() call.
diff --git a/usr/src/uts/intel/ia32/os/fpu.c b/usr/src/uts/intel/ia32/os/fpu.c index d648e0552f..da8e1f76b4 100644 --- a/usr/src/uts/intel/ia32/os/fpu.c +++ b/usr/src/uts/intel/ia32/os/fpu.c @@ -1381,10 +1381,27 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags) * by fp_save, as is the FPU_EN check. */ if (pl != NULL) { + fpu_ctx_t *pf = &pl->lwp_pcb.pcb_fpu; + kpreempt_disable(); - if ((flags & KFPU_USE_LWP) == 0) - fp_save(&pl->lwp_pcb.pcb_fpu); - pl->lwp_pcb.pcb_fpu.fpu_flags |= FPU_KERNEL; + if ((flags & KFPU_USE_LWP) == 0) { + fp_save(pf); + } else { + /* + * For pure kernel threads with an LWP, we can use the + * LWP's pcb_fpu to save/restore context. + */ + VERIFY(curthread->t_procp->p_flag & SSYS); + VERIFY(kfpu == NULL); + ASSERT((pf->fpu_flags & FPU_EN) == 0); + + /* Always restore the fpu to the initial state. */ + if (fp_save_mech == FP_XSAVE) + pf->fpu_xsave_mask = XFEATURE_FP_ALL; + fpinit(); + pf->fpu_flags = FPU_EN; + } + pf->fpu_flags |= FPU_KERNEL; kpreempt_enable(); } @@ -1400,23 +1417,7 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags) curthread->t_flag |= T_KFPU; - if ((flags & KFPU_USE_LWP) == KFPU_USE_LWP) { - /* - * For pure kernel threads with an LWP, we can use the LWP's - * pcb_fpu to save/restore context. - */ - fpu_ctx_t *pf = &pl->lwp_pcb.pcb_fpu; - - VERIFY(curthread->t_procp->p_flag & SSYS); - VERIFY(kfpu == NULL); - ASSERT((pf->fpu_flags & FPU_EN) == 0); - - /* Always restore the fpu to the initial state. */ - if (fp_save_mech == FP_XSAVE) - pf->fpu_xsave_mask = XFEATURE_FP_ALL; - fpinit(); - pf->fpu_flags = FPU_EN | FPU_KERNEL; - } else { + if ((flags & KFPU_USE_LWP) == 0) { /* initialize the kfpu state */ kernel_fpu_ctx_restore(kfpu); }
It panics, for reasons TBD.
Updated by Jerry Jelinek about 2 years ago
The observation about a sleeping alloc is a good one, thanks for catching that. I don't think the previous diff is the right approach. How about this instead? I think this avoids the voluntary swtch we can hit before we're ready.
--- a/usr/src/uts/intel/ia32/os/fpu.c +++ b/usr/src/uts/intel/ia32/os/fpu.c @@ -1390,16 +1390,6 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags) kpreempt_enable(); } - /* - * Set the context operations for kernel FPU usage. Note that this - * cannot be done with pre-emption and interrupts disabled, since - * installctx does a sleeping allocation. We haven't finished - * initializing our kernel FPU state yet, but in the rare case that we - * happen to save/restore before that, no harm is done. - */ - installctx(curthread, kfpu, kernel_fpu_ctx_save, kernel_fpu_ctx_restore, - NULL, NULL, NULL, NULL); - curthread->t_flag |= T_KFPU; if ((flags & KFPU_USE_LWP) == KFPU_USE_LWP) { @@ -1422,6 +1412,17 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags) // initialize the kfpu state kernel_fpu_ctx_restore(kfpu); } + + /* + * Set the context operations for kernel FPU usage. Note that this + * cannot be done with pre-emption and interrupts disabled, since + * installctx does a sleeping allocation. We haven't finished + * initializing our kernel FPU state yet, but in the rare case that we + * happen to save/restore before that, no harm is done. + */ + installctx(curthread, kfpu, kernel_fpu_ctx_save, kernel_fpu_ctx_restore, + NULL, NULL, NULL, NULL); + }
Updated by Jason King about 2 years ago
I think this makes sense -- though I have a question (as I've been mostly observing from the sidelines) -- should T_KFPU be set after the handlers have been installed? Since they will also update T_KFPU as needed, I'm wondering if somehow kernel_fpu_no_switch()
could end up being called in-between T_KFPU being set and installctx() installing the handlers (causing a panic).
Updated by Dan McDonald about 2 years ago
Jerry's original fix, plus additional modifications to installctx() and its callers (others of which are preempting the kernel) have a webrev here:
https://kebe.com/~danmcd/webrevs/13902/webrev-prealloc/
that has tested so far.
Jerry's most recent suggestion will be in:
https://kebe.com/~danmcd/webrevs/13902/webrev-jerry2/
and will be tested later today.
Updated by Dan McDonald about 2 years ago
If Jerry's second fix works, it may be sufficient for this bug, but the "prealloc" changes will likely become their own issue.
Updated by Dan McDonald about 2 years ago
Jerry's second fix failed during the sleeping memory allocation. We likely need his first fix, AND the preallocation for installctx() callers that need disabled preemption over the installctx() call. Screen-shot attached.
Updated by Dan McDonald about 2 years ago
- Related to Bug #13915: installctx() blocking allocate causes problems added
Updated by Dan McDonald about 2 years ago
- File jerry2-panic.png jerry2-panic.png added
Updated by Jason King about 2 years ago
Is that panic from running https://kebe.com/~danmcd/webrevs/13902/webrev-jerry2/ (with all the revisions, it's not always clear). If so, I'm wondering if my earlier question is still relevant.
Updated by Dan McDonald about 2 years ago
Jason King wrote in #note-34:
Is that panic from running https://kebe.com/~danmcd/webrevs/13902/webrev-jerry2/ (with all the revisions, it's not always clear). If so, I'm wondering if my earlier question is still relevant.
It is from webrev-jerry2, and your question was answered, "YES, YES IT COULD!" :)
Updated by Dan McDonald about 2 years ago
The very latest fix, which includes #13915, has been running for:
15:45:10 up 17:39, 1 user, load average: 0.05, 0.05, 0.04
without incident on the discoverer's machine. Previous to this fix, it would panic within a handful of hours, sometimes even less than an hour.
Updated by Dan McDonald about 2 years ago
- Status changed from New to Pending RTI
- Assignee set to Dan McDonald
Updated by Dan McDonald about 2 years ago
- % Done changed from 0 to 90
- Difficulty changed from Hard to Medium
Updated by Electric Monk about 2 years ago
- Status changed from Pending RTI to Closed
- % Done changed from 90 to 100
git commit c21bd51d7acbaf77116c4cc3a23dfc6d16c637c2
commit c21bd51d7acbaf77116c4cc3a23dfc6d16c637c2 Author: Dan McDonald <danmcd@joyent.com> Date: 2021-07-02T19:24:36.000Z 13902 Fix for 13717 may break 8-disk raidz2 13915 installctx() blocking allocate causes problems Portions contributed by: Jerry Jelinek <gjelinek@gmail.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Patrick Mooney <pmooney@pfmooney.com> Approved by: Joshua M. Clulow <josh@sysmgr.org>