Project

General

Profile

Actions

Bug #13902

closed

Fix for 13717 may break 8-disk raidz2

Added by Dan McDonald about 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

I've got three kernel dumps from SmartOS 20210617 and one more from SmartOS 20210520. The filer is running an 8-disk raidz (raidz1) pool and it panics within a few hours consistently, with this pattern of panic:

mdb: warning: dump is from SunOS 5.11 joyent_20210520T001536Z; dcmds and macros may not match kernel implementation
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba uhci smbios stmf_sbd stmf sd zfs mm lofs idm sata crypto random cpc logindmux ptm sppp nsmb smbsrv nfs ]
> $C
fffffe0170ae9590 fpxrestore+2()
fffffe0170ae95b0 kernel_fpu_ctx_restore+0x13(0)
fffffe0170ae95e0 restorectx+0x3d(fffffe0170ae9c20)
> ::status
debugging crash dump vmcore.4 (64-bit) from ronsu
operating system: 5.11 joyent_20210520T001536Z (i86pc)
git branch: release-20210520
git rev: 58a2b7ba9771c57f897a33cf69335cd18fa5992f
image uuid: (not set)
panic message: BAD TRAP: type=d (#gp General protection) rp=fffffe0170ae9470 addr=ffffff0a5c2ff000
dump content: kernel pages only
> fffffe0170ae9c20::findstack -v
stack pointer for thread fffffe0170ae9c20 (zpool-zones/120): fffffe0170ae95f0
  fffffe0170ae9620 swtch+0x133()
  fffffe0170ae9640 preempt+0xfa()
  fffffe0170ae9670 kpreempt+0x3c(ffffffff)
  fffffe0170ae96f0 installctx+0x11a(fffffe0170ae9c20, 0, fffffffffb880870, fffffffffb880910, 0, 0, 0, 0)
  fffffe0170ae9740 kernel_fpu_begin+0x116(0, 2)
  fffffe0170ae97c0 ssse3_gen_pq+0x1b4(ffffff0c4e303580)
  fffffe0170ae97e0 vdev_raidz_math_generate+0x52(ffffff0c4e303580)
  fffffe0170ae9810 vdev_raidz_generate_parity+0x16(ffffff0c4e303580)
  fffffe0170ae98a0 vdev_raidz_io_start+0x1d8(ffffff0c4cd5c928)
  fffffe0170ae98f0 zio_vdev_io_start+0x89(ffffff0c4cd5c928)
  fffffe0170ae9920 zio_execute+0xf1(ffffff0c4cd5c928)
  fffffe0170ae9950 zio_nowait+0x2b(ffffff0c4cd5c928)
  fffffe0170ae99e0 vdev_mirror_io_start+0xba(ffffff0c4eb206b8)
  fffffe0170ae9a30 zio_vdev_io_start+0x1a2(ffffff0c4eb206b8)
  fffffe0170ae9a60 zio_execute+0xa7(ffffff0c4eb206b8)
  fffffe0170ae9b10 taskq_thread+0x2cd(ffffff0a9d94e298)
  fffffe0170ae9b20 thread_start+0xb()
> 

So the faulty instruction and data are:

> fpxrestore+2::dis
fpxrestore:                     clts   
fpxrestore+2:                   fxrstor (%rdi)
fpxrestore+6:                   ret    
0xfffffffffb87e277:             nopw   0x0(%rax,%rax)
xrestore:                       clts   
xrestore+2:                     movl   %esi,%eax
xrestore+4:                     movq   %rsi,%rdx
xrestore+7:                     shrq   $0x20,%rdx
xrestore+0xb:                   xrstor (%rdi)
xrestore+0xe:                   ret    
0xfffffffffb87e28f:             nop    
fpdisable:                      movq   %cr0,%rdi
> <rdi=P
                0xffffff0a9f650800 
> <rdi::whatis
ffffff0a9f650800 is allocated from fxsave_cache
> ffffff0a9f650800::print struct fxsave_state !head
{
    fx_fcw = 0xcafe
    fx_fsw = 0xbadd
    fx_fctw = 0xcafe
    fx_fop = 0xbadd
    fx_rip = 0xbaddcafebaddcafe
    fx_rdp = 0xbaddcafebaddcafe
    fx_mxcsr = 0xbaddcafe
    fx_mxcsr_mask = 0xbaddcafe
    fx_st = [
> 

Hmmm, this looks like uninitialized memory. Was this supposed to have something?

I'm giving the discoverer a DEBUG kernel, in the hopes that we'll have more information in its coredump. The discoverer is currently running a pre-illumos-13717 SmartOS platform image, and so far, no panic.


Files

jerry2-panic.png (358 KB) jerry2-panic.png Dan McDonald, 2021-06-30 01:29 AM

Related issues

Related to illumos gate - Bug #13717: kernel fpu use can still lead to panicClosed

Actions
Related to illumos gate - Feature #12478: installctx needs kpreempt_disable protectionClosedPatrick Mooney

Actions
Related to illumos gate - Bug #13908: disable kernel FPU by default until it is stableClosedJoshua M. Clulow

Actions
Related to illumos gate - Bug #13915: installctx() blocking allocate causes problemsClosedDan McDonald

Actions
Actions #1

Updated by Dan McDonald about 2 years ago

  • Subject changed from Fix for 13717 may break 8-disk raidz1 to Fix for 13717 may break 8-disk raidz2
Actions #2

Updated by Dan McDonald about 2 years ago

It's raidz2, not raidz:

[root@ronsu /zones]# zpool list -v
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zones       7.27T  58.5G  7.21T        -         -     0%     0%  1.00x  ONLINE  -
  raidz2    7.27T  58.5G  7.21T        -         -     0%  0.78%
    c1t0d0      -      -      -        -         -      -      -
    c1t1d0      -      -      -        -         -      -      -
    c1t2d0      -      -      -        -         -      -      -
    c1t3d0      -      -      -        -         -      -      -
    c1t4d0      -      -      -        -         -      -      -
    c1t5d0      -      -      -        -         -      -      -
    c1t6d0      -      -      -        -         -      -      -
    c1t7d0      -      -      -        -         -      -      -
Actions #3

Updated by Dan McDonald about 2 years ago

  • Related to Bug #13717: kernel fpu use can still lead to panic added
Actions #4

Updated by Dan McDonald about 2 years ago

  • Related to Feature #12478: installctx needs kpreempt_disable protection added
Actions #5

Updated by Dan McDonald about 2 years ago

One other thing to notice is that the thread's stack is at the end of installing a save-FPU thread context, and when it calls kpreempt_enable() it IMMEDIATELY switches the same thread to restoring its FPU context, with an uninitialized buffer!

Actions #6

Updated by Dan McDonald about 2 years ago

Core dumps available in subdirectories at:

https://kebe.com/~danmcd/webrevs/13902/

folders based on SmartOS Platform Image run.

Actions #7

Updated by Dan McDonald about 2 years ago

A DEBUG dump was obtained, and it has the same stack and swtch() behavior as the non-debug ones. The only immediately useful piece of DEBUG data is that the allocated-but-unused fxsave_state buffer has allocation information:

> <rdi::whatis
ffffff0c662c7450 is allocated from fxsave_cache:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
ffffff0c693fdcb8 ffffff0c662c7450       6991e8fb66 fffffe0173a24c20
                 ffffff0c2923a008 ffffff0a7275c640                0
                 kmem_cache_alloc_debug+0x2fc
                 kmem_cache_alloc+0x25b
                 fp_lwp_init+0x1f
                 lwp_fp_init+0xe
                 lwp_create+0xd80
                 lwp_kernel_create+0x6b
                 taskq_thread_create+0x9b
                 taskq_thread+0x5f8
                 thread_start+0xb

> 
Actions #8

Updated by Dan McDonald about 2 years ago

The floating point context for the kernel LWP has its pointer appear twice in the same lwp, That is standard behavior, however, see http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/os/fpu.c?r=948cceb0&fi=fp_lwp_init#860.

> <rdi::kgrep |::whatis 
fffffffffbc9c3a8 is panicbuf+0x40, in unix's data segment
fffffe0171106420 is in thread fffffe0171106c20's stack
ffffff0c34aeb6b8 is ffffff0c34ae6000+56b8, allocated from kmem_alloc_49152:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
ffffff0c29807600 ffffff0c34ae6000       433ff528dd fffffffffbc580c0
                 ffffff0a2c25e008 ffffff0a6929df00                0
                 kmem_slab_alloc_impl+0x324
                 kmem_slab_alloc+0x69
                 kmem_cache_alloc+0x193
                 kmem_alloc+0x4b
                 kmem_zalloc+0x10d
                 mp_cpu_configure_common+0x2aa
                 start_cpu+0x39
                 start_other_cpus+0x2a0
                 main+0x30a
                 _locore_start+0x90

ffffff0ca1df9c40 is allocated from lwp_cache:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
ffffff0c9f3a24b0 ffffff0ca1df9c40       6991e8ec01 fffffe0173a24c20
                 ffffff0c29f63008 ffffff0a7275c340                0
                 kmem_cache_alloc_debug+0x2fc
                 kmem_cache_alloc+0x25b
                 lwp_create+0x11bc
                 lwp_kernel_create+0x6b
                 taskq_thread_create+0x9b
                 taskq_thread+0x5f8
                 thread_start+0xb

ffffff0ca1df9d00 is ffffff0ca1df9c40+c0, allocated from lwp_cache:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
ffffff0c9f3a24b0 ffffff0ca1df9c40       6991e8ec01 fffffe0173a24c20
                 ffffff0c29f63008 ffffff0a7275c340                0
                 kmem_cache_alloc_debug+0x2fc
                 kmem_cache_alloc+0x25b
                 lwp_create+0x11bc
                 lwp_kernel_create+0x6b
                 taskq_thread_create+0x9b
                 taskq_thread+0x5f8
                 thread_start+0xb

ffffff0c693fdcc0 is ffffff0c693fdcb8+8, allocated from kmem_bufctl_audit_cache
> 
Actions #9

Updated by Dan McDonald about 2 years ago

The CPU on the filer's machine is Intel(r) Xeon(r) CPU E5649 @ 2.53GHz. If there's a known issue with Westmere EP and context switching, that may be useful to know.

Actions #10

Updated by Dan McDonald about 2 years ago

One last speculation before dinner.

I wonder if the FP registers are never saved during the installctx() setup, but then once installed the are attempted to be reloaded by the preempt()->swtch() sequence? Maybe we need kernel_fpu_begin() to be more explicit about the initial save?

Actions #11

Updated by Igor Kozhukhov about 2 years ago

could you try to test illumos based kvm VM under smartos or omnios and run ZFS tests ?
i can see FPU related crash on dilos kvm VM under dilos kvm host with ZFS tests.

Actions #12

Updated by Andy Fiddaman about 2 years ago

As a data point, I have a number of servers running 6-8 disk RAIDZ2, and with #13717 applied (latest OmniOS r38); no crashes.
However, they're running with avx2 and not ssse3.

> raidz_all_maths::print
[ vdev_raidz_original_impl, vdev_raidz_scalar_impl, vdev_raidz_sse2_impl, vdev_raidz_ssse3_impl, vdev_raidz_avx2_impl ]

> zfs_vdev_raidz_impl/x
zfs_vdev_raidz_impl:
zfs_vdev_raidz_impl:            ffff

> vdev_raidz_fastest_impl::print
{
    init = 0
    fini = 0
    gen = [ avx2_gen_p, avx2_gen_pq, avx2_gen_pqr ]
    rec = [ avx2_rec_p, avx2_rec_q, avx2_rec_r, avx2_rec_pq, avx2_rec_pr, avx2_rec_qr, avx2_rec_pqr ]
    is_supported = 0
    name = [ "fastest" ]
}
Actions #13

Updated by Denis Kozadaev about 2 years ago

A KVM vm runs on a DilOS host:
dev0% uname -a
SunOS dev0 5.11 2.0.2.87 i86pc i386 i86pc

VM configuration
CPU: 4
RAM: 8GB
1 virtio block is for the root pool
1 virtio block is for data
3 virtio blocks are for ZFS tests (8GB each)

root@ztest03:/var/crash/ztest03# mdb unix.16 vmcore.16
mdb: warning: dump is from SunOS 5.11 2.0.2.86-20210618-1-0+int2-mm; dcmds and macros may not match kernel implementation
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs ip hook neti sockfs arp usba smbios sd mm random crypto ptm ufs logindmux nfs ]
> ::status
debugging crash dump vmcore.16 (64-bit) from ztest03
operating system: 5.11 2.0.2.86-20210618-1-0+int2-mm (i86pc)
image uuid: 821d9f64-9eed-4eb3-b8e2-67272a355011
panic message: BAD TRAP: type=d (#gp General protection) rp=fffffe000fbd23f0 addr=fffffe0bebcfd000
dump content: kernel pages only
> $C
fffffe000fbd2510 fpxrestore+2()
fffffe000fbd2540 kernel_fpu_ctx_restore+0x5b(0)
fffffe000fbd2570 restorectx+0x5f(fffffe000fbd2c20)
> $r
%rax = 0x0000000000000002                 %r9  = 0x000000000000050d 
%rbx = 0xfffffe0c48ef37c0                 %r10 = 0xfffffe0c48ef39c0 
%rcx = 0x0000000000000175                 %r11 = 0xfffffffffbc590c0 kas
%rdx = 0x00000000b82b5600                 %r12 = 0xfffffe0c48ef37c0 
%rsi = 0xfffffe0c48ef3818                 %r13 = 0xfffffffffb890217 _resume_from_idle+0x12b
%rdi = 0xfffffe0bfad7e220                 %r14 = 0xfffffe0bebd3f000 
%r8  = 0x0000000000000000                 %r15 = 0xfffffe0bebcfd000 

%rip = 0xfffffffffb889c42 fpxrestore+2
%rbp = 0xfffffe000fbd2510
%rsp = 0xfffffe000fbd24e8
%rflags = 0x00010246
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
  status=<of,df,IF,tf,sf,ZF,af,PF,cf>

%cs = 0x0030    %ds = 0x004b    %es = 0x004b    %fs = 0x0000
%gs = 0x0000    %gsbase = 0xfffffe0bebcfd000    %kgsbase = 0x0
%trapno = 0xd   %err = 0x0      %cr2 = 0xfffffc7fee7e9004       %cr3 = 0x4400000

> fffffe000fbd2c20::whatis
fffffe000fbd2c20 is allocated as a thread structure

> fffffe000fbd2c20::findstack -v
stack pointer for thread fffffe000fbd2c20 (zpool-testpool/94): fffffe000fbd2580
  fffffe000fbd25b0 swtch+0x18a()
  fffffe000fbd25d0 preempt+0xf9()
  fffffe000fbd2600 kpreempt+0x3c(ffffffff)
  fffffe000fbd2680 installctx+0x13a(fffffe000fbd2c20, 0, fffffffffb88c5c0, fffffffffb88c730, 0, 0, 0, 0)
  fffffe000fbd26d0 kernel_fpu_begin+0x181(0, 2)
  fffffe000fbd2730 sse2_gen_p+0x1cc(fffffe111e1c2880)
  fffffe000fbd2750 vdev_raidz_math_generate+0x57(fffffe1128016bd8, fffffe111e1c2880)
  fffffe000fbd2780 vdev_raidz_generate_parity_row+0x2f(fffffe1128016bd8, fffffe111e1c2880)
  fffffe000fbd2830 vdev_raidz_io_start_write+0x43(fffffe112dfee128, fffffe111e1c2880, 9)
  fffffe000fbd2870 vdev_raidz_io_start+0xdf(fffffe112dfee128)
  fffffe000fbd28d0 zio_vdev_io_start+0x23f(fffffe112dfee128)
  fffffe000fbd2910 zio_nowait+0x183(fffffe112dfee128)
  fffffe000fbd29a0 vdev_mirror_io_start+0xda(fffffe112e93fd78)
  fffffe000fbd2a00 zio_vdev_io_start+0x517(fffffe112e93fd78)
  fffffe000fbd2a40 zio_execute+0xfb(fffffe112e93fd78)
  fffffe000fbd2b00 taskq_thread+0x311(fffffe0cf1de3d68)
  fffffe000fbd2b10 thread_start+0xb()

> raidz_all_maths::print
[ vdev_raidz_original_impl, vdev_raidz_scalar_impl, vdev_raidz_sse2_impl, vdev_raidz_ssse3_impl, vdev_raidz_avx2_impl ]

> zfs_vdev_raidz_impl/x
zfs_vdev_raidz_impl:
zfs_vdev_raidz_impl:            ffff    

> vdev_raidz_fastest_impl::print
{
    init = 0
    fini = 0
    gen = [ sse2_gen_p, sse2_gen_pq, sse2_gen_pqr ]
    rec = [ sse2_rec_p, sse2_rec_q, sse2_rec_r, sse2_rec_pq, sse2_rec_pr, sse2_rec_qr, sse2_rec_pqr ]
    is_supported = 0
    name = [ "fastest" ]
}
> <rdi=P
                0xfffffe0bfad7e220 
> fffffe0bfad7e220::whatis
fffffe0bfad7e220 is allocated from fxsave_cache:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
fffffe0bfa830b70 fffffe0bfad7e220    855c1e531fcf8 fffffe00101e3c20
                 fffffe0be6362008 fffffe0bd2de8040 fffffe0bdd0c4858
                 kmem_cache_alloc_debug+0x3b0
                 kmem_cache_alloc+0xc7
                 fp_lwp_init+0x1f
                 lwp_fp_init+0xe
                 lwp_create+0xde0
                 lwp_kernel_create+0x6b
                 taskq_thread_create+0x9b
                 taskq_create_common+0x231
                 taskq_create_sysdc+0xae
                 spa_taskqs_init+0x14a
                 spa_create_zio_taskqs+0x29
                 spa_thread+0x1a9
                 thread_start+0xb

> fffffe0bfad7e220::print struct fxsave_state
{
    fx_fcw = 0xcafe
    fx_fsw = 0xbadd
    fx_fctw = 0xcafe
    fx_fop = 0xbadd
    fx_rip = 0xbaddcafebaddcafe
    fx_rdp = 0xbaddcafebaddcafe
    fx_mxcsr = 0xbaddcafe
    fx_mxcsr_mask = 0xbaddcafe
    fx_st = [
        {
            fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ]
            fpr_mmx = 0xbaddcafebaddcafe
            __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ]
            fpr_mmx = 0xbaddcafebaddcafe
            __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ]
            fpr_mmx = 0xbaddcafebaddcafe
            __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ]
            fpr_mmx = 0xbaddcafebaddcafe
            __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ]
            fpr_mmx = 0xbaddcafebaddcafe
            __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ]
            fpr_mmx = 0xbaddcafebaddcafe
            __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ]
            fpr_mmx = 0xbaddcafebaddcafe
            __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            fpr_16 = [ 0xcafe, 0xbadd, 0xcafe, 0xbadd, 0xcafe ]
            fpr_mmx = 0xbaddcafebaddcafe
            __fpr_pad = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
    ]
    fx_xmm = [
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },                            
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
    ]
    __fx_ign2 = [
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
        {
            _q = -3.6623428129233170e+847
            _l = [ 0xbaddcafe, 0xbaddcafe, 0xbaddcafe, 0xbaddcafe ]
        },
    ]
}

> ::spa -v
ADDR                 STATE NAME                                                
fffffe0bead68000    ACTIVE data

    ADDR             STATE     AUX          DESCRIPTION                        
    fffffe0bf787c000 HEALTHY   -            root
    fffffe0bf7881000 HEALTHY   -              /dev/dsk/c6t0d0s0
fffffe0be8461000    ACTIVE rpool

    fffffe0be8685000 HEALTHY   -            root
    fffffe0be868a000 HEALTHY   -              /dev/dsk/c4t0d0s0
fffffe123136a000    ACTIVE testpool

    fffffe123137e000 HEALTHY   -            root
    fffffe1231383000 HEALTHY   -              raidz
    fffffe1231388000 HEALTHY   -                /var/tmp/test_results/testdir/disk.cache/a
    fffffe123138d000 HEALTHY   -                /var/tmp/test_results/testdir/disk.cache/b
    fffffe1231392000 HEALTHY   -                /var/tmp/test_results/testdir/disk.cache/c
    -                -         -            cache
    fffffe12314c9000 HEALTHY   -              /dev/dsk/c7t0d0s0
    fffffe1231523000 HEALTHY   -              /dev/dsk/c8t0d0s0
> 

The dump is also available

Actions #14

Updated by Dan McDonald about 2 years ago

I think I'm on to something.

Consider this fragment of code in kernel_fpu_begin():

    /*
     * Not all threads may have an active LWP. If they do and we're not
     * going to re-use the LWP, then we should go ahead and save the state.
     * We must also note that the fpu is now being used by the kernel and
     * therefore we do not want to manage the fpu state via the user-level
     * thread's context handlers.
     *
     * We might have already saved once (due to a prior use of the kernel
     * FPU or another code path) so FPU_VALID could be set. This is handled
     * by fp_save, as is the FPU_EN check.
     */
    if (pl != NULL) {
        kpreempt_disable();
        if ((flags & KFPU_USE_LWP) == 0)   /* XXX KEBE SAYS LOOK HERE!!! */
            fp_save(&pl->lwp_pcb.pcb_fpu);
        pl->lwp_pcb.pcb_fpu.fpu_flags |= FPU_KERNEL;
        kpreempt_enable();
    }

The check for KFPU_USE_LWP being turned off is interesting. I believe it's there to prevent a redundant save if userland has an LWP.

Our panicking process has LWPs, but is NOT a userland process (it's a taskq service). I wonder if we need to be less picky about when we perform this fp_save(). I'm going to have the filer test a small loosening of the restriction:

diff --git a/usr/src/uts/intel/ia32/os/fpu.c b/usr/src/uts/intel/ia32/os/fpu.c
index d648e0552f..81a98c05a1 100644
--- a/usr/src/uts/intel/ia32/os/fpu.c
+++ b/usr/src/uts/intel/ia32/os/fpu.c
@@ -1382,7 +1382,8 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags)
      */
     if (pl != NULL) {
         kpreempt_disable();
-        if ((flags & KFPU_USE_LWP) == 0)
+        if ((flags & KFPU_USE_LWP) == 0 ||
+            (curthread->t_procp->p_flag & SSYS) != 0)
             fp_save(&pl->lwp_pcb.pcb_fpu);
         pl->lwp_pcb.pcb_fpu.fpu_flags |= FPU_KERNEL;
         kpreempt_enable();

Stay tuned.

Actions #15

Updated by Dan McDonald about 2 years ago

Haven't tested the prior fix yet, but one OTHER fly in the ointment...

The caller isn't aware of this, but fp_save(), as a side-effect, ALWAYS CALLS kpreempt_enable() prior to return. Not sure what a double-call of kpreempt_enable() would do, but that's something ELSE to consider when examining the aforementioned fragment of kernel_fpu_begin().

Actions #16

Updated by Patrick Mooney about 2 years ago

The caller isn't aware of this, but fp_save(), as a side-effect, ALWAYS CALLS kpreempt_enable() prior to return. Not sure what a double-call of kpreempt_enable() would do, but that's something ELSE to consider when examining the aforementioned fragment of kernel_fpu_begin().

It has balanced kpreempt_disable/kpreempt_enable calls, so I don't think fp_save is much of an outlier when it comes to preemption.

It's true that kpreempt_enable will end up calling into swtch if there happens to be work attempting to preempt the running thread, but I don't believe anything would prevent such preemption from happening spontaneously either (once t_kpreempt was back at 0).

Actions #17

Updated by Dan McDonald about 2 years ago

Patrick Mooney wrote in #note-16:

It's true that kpreempt_enable will end up calling into swtch if there happens to be work attempting to preempt the running thread, but I don't believe anything would prevent such preemption from happening spontaneously either (once t_kpreempt was back at 0).

Thanks for the correction about fp_save(), and for the confirmation that kpreempt_enable() might force swtch(). Whatcha think about the loosening of the barriers to calling fp_save()?

Actions #18

Updated by Joshua M. Clulow about 2 years ago

  • Related to Bug #13908: disable kernel FPU by default until it is stable added
Actions #19

Updated by Patrick Mooney about 2 years ago

#13908 seems like the way to go in the short term, so we can truly get to the bottom of this, rather than rushing a solution.

Actions #20

Updated by Jerry Jelinek about 2 years ago

I'll take a look at this sometime this week. Since the code is now disabled, I realize the urgency is less. One quick observation is that the notes in comment #15 are off track and the proposed fix is not relevant. This should be clear from the dump information provided where flags is KFPU_USE_LWP. In fact, in the (now disabled) illumos kfpu code, flags is only ever KFPU_NO_STATE or KFPU_USE_LWP. The fp_save described in comment #15 can never execute with how the kfpu was being used in stock illumos.

I'm fairly certain this is only peripherally related to the changes for 13717, which essentially just added kpreempt_disable/kpreempt_enable. However, we're now doing 2 preempt disble/enable here (the 2nd one when we do the installctx) and that may make this code more prone to a context switch at an inopportune point. The block at the end of the function will initialized the kernel thread's LWP fpu save area, but it might be better to do that before installing the context switch handler and perhaps wrap all of this in a preempt disble/enable to ensure we don't swtch before everything is setup.

Actions #21

Updated by Jerry Jelinek about 2 years ago

If anyone can provide a build with this change to the user who is hitting this (along with re-enabling kfpu) I'd be interested in hearing If this resolves the problem. Note that I haven't done any testing of this myself yet.

--- a/usr/src/uts/intel/ia32/os/fpu.c
+++ b/usr/src/uts/intel/ia32/os/fpu.c
@@ -1378,12 +1378,11 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags)
         * We might have already saved once so FPU_VALID could be set.
         * This is handled by fp_save.
         */
+       kpreempt_disable();
        if (pl != NULL) {
-               kpreempt_disable();
                if ((flags & KFPU_USE_LWP) == 0)
                        fp_save(&pl->lwp_pcb.pcb_fpu);
                pl->lwp_pcb.pcb_fpu.fpu_flags |= FPU_KERNEL;
-               kpreempt_enable();
        }

        /*
@@ -1418,6 +1417,7 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags)
                // initialize the kfpu state
                kernel_fpu_ctx_restore(kfpu);
        }
+       kpreempt_enable();
 }
Actions #22

Updated by Dan McDonald about 2 years ago

I'm fairly certain this is only peripherally related to the changes for 13717, which essentially just added kpreempt_disable/kpreempt_enable. However, we're now doing 2 preempt disble/enable here (the 2nd one when we do the installctx) and that may make this code more prone to a context switch at an inopportune point. The block at the end of the function will initialized the kernel thread's LWP fpu save area, but it might be better to do that before installing the context switch handler and perhaps wrap all of this in a preempt disble/enable to ensure we don't swtch before everything is setup.

And your moving of kpreempt_{en,dis}able() in your suggested diff will do just that... and our dumps are literally showing preemption occurring right at the end of installctx()... it would also explain why we might not see this in the wild as much: it's a timing event and the more cycles per scheduling quanta the less likely the thread could context-switch at that inconvenient time.

I have the bug filer running a SmartOS PI with your new change, and I'll report back with either a new kernel dump OR good news.

Actions #23

Updated by Denis Kozadaev about 2 years ago

Jerry's patch works well
ZFS tests in the KVM vm passed without the kernel panic

Actions #24

Updated by Dan McDonald about 2 years ago

So far, the patch in comment 21 appears to be working.

There is one more issue to resolve, however. installctx() requires a KM_SLEEP allocation, which means it could block indefinitely while being called under kpreempt_disable(). Once we solve that, we solve this bug.

Two either-or ideas that immediately occur to me are:

1.) installctx() can be called with a pre-allocated buffer.

OR

2.) kernel_fpu_begin() can make sure installctx() is called only after everything else is set up.

Actions #25

Updated by Dan McDonald about 2 years ago

DO NOT USE -- THIS ONE PANICS.

Here was a proposed patch for option 2 "make sure installctx() is called only after everything else is set up." It moves the kernel-only-LWP initialization (like the taskq thread that's panicking) to before the installctx() call.

diff --git a/usr/src/uts/intel/ia32/os/fpu.c b/usr/src/uts/intel/ia32/os/fpu.c
index d648e0552f..da8e1f76b4 100644
--- a/usr/src/uts/intel/ia32/os/fpu.c
+++ b/usr/src/uts/intel/ia32/os/fpu.c
@@ -1381,10 +1381,27 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags)
      * by fp_save, as is the FPU_EN check.
      */
     if (pl != NULL) {
+        fpu_ctx_t *pf = &pl->lwp_pcb.pcb_fpu;
+
         kpreempt_disable();
-        if ((flags & KFPU_USE_LWP) == 0)
-            fp_save(&pl->lwp_pcb.pcb_fpu);
-        pl->lwp_pcb.pcb_fpu.fpu_flags |= FPU_KERNEL;
+        if ((flags & KFPU_USE_LWP) == 0) {
+            fp_save(pf);
+        } else {
+            /*
+             * For pure kernel threads with an LWP, we can use the
+             * LWP's pcb_fpu to save/restore context.
+             */
+            VERIFY(curthread->t_procp->p_flag & SSYS);
+            VERIFY(kfpu == NULL);
+            ASSERT((pf->fpu_flags & FPU_EN) == 0);
+
+            /* Always restore the fpu to the initial state. */
+            if (fp_save_mech == FP_XSAVE)
+                pf->fpu_xsave_mask = XFEATURE_FP_ALL;
+            fpinit();
+            pf->fpu_flags = FPU_EN;
+        }
+        pf->fpu_flags |= FPU_KERNEL;
         kpreempt_enable();
     }

@@ -1400,23 +1417,7 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags)

     curthread->t_flag |= T_KFPU;

-    if ((flags & KFPU_USE_LWP) == KFPU_USE_LWP) {
-        /*
-         * For pure kernel threads with an LWP, we can use the LWP's
-         * pcb_fpu to save/restore context.
-         */
-        fpu_ctx_t *pf = &pl->lwp_pcb.pcb_fpu;
-
-        VERIFY(curthread->t_procp->p_flag & SSYS);
-        VERIFY(kfpu == NULL);
-        ASSERT((pf->fpu_flags & FPU_EN) == 0);
-
-        /* Always restore the fpu to the initial state. */
-        if (fp_save_mech == FP_XSAVE)
-            pf->fpu_xsave_mask = XFEATURE_FP_ALL;
-        fpinit();
-        pf->fpu_flags = FPU_EN | FPU_KERNEL;
-    } else {
+    if ((flags & KFPU_USE_LWP) == 0) {
         /* initialize the kfpu state */
         kernel_fpu_ctx_restore(kfpu);
     }

It panics, for reasons TBD.

Actions #26

Updated by Jerry Jelinek about 2 years ago

The observation about a sleeping alloc is a good one, thanks for catching that. I don't think the previous diff is the right approach. How about this instead? I think this avoids the voluntary swtch we can hit before we're ready.

--- a/usr/src/uts/intel/ia32/os/fpu.c
+++ b/usr/src/uts/intel/ia32/os/fpu.c
@@ -1390,16 +1390,6 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags)
                kpreempt_enable();
        }

-       /*
-        * Set the context operations for kernel FPU usage. Note that this
-        * cannot be done with pre-emption and interrupts disabled, since
-        * installctx does a sleeping allocation. We haven't finished
-        * initializing our kernel FPU state yet, but in the rare case that we
-        * happen to save/restore before that, no harm is done.
-        */
-       installctx(curthread, kfpu, kernel_fpu_ctx_save, kernel_fpu_ctx_restore,
-           NULL, NULL, NULL, NULL);
-
        curthread->t_flag |= T_KFPU;

        if ((flags & KFPU_USE_LWP) == KFPU_USE_LWP) {
@@ -1422,6 +1412,17 @@ kernel_fpu_begin(kfpu_state_t *kfpu, uint_t flags)
                // initialize the kfpu state
                kernel_fpu_ctx_restore(kfpu);
        }
+
+       /*
+        * Set the context operations for kernel FPU usage. Note that this
+        * cannot be done with pre-emption and interrupts disabled, since
+        * installctx does a sleeping allocation. We haven't finished
+        * initializing our kernel FPU state yet, but in the rare case that we
+        * happen to save/restore before that, no harm is done.
+        */
+       installctx(curthread, kfpu, kernel_fpu_ctx_save, kernel_fpu_ctx_restore,
+           NULL, NULL, NULL, NULL);
+
 }

Actions #27

Updated by Jason King about 2 years ago

I think this makes sense -- though I have a question (as I've been mostly observing from the sidelines) -- should T_KFPU be set after the handlers have been installed? Since they will also update T_KFPU as needed, I'm wondering if somehow kernel_fpu_no_switch() could end up being called in-between T_KFPU being set and installctx() installing the handlers (causing a panic).

Actions #28

Updated by Dan McDonald about 2 years ago

Jerry's original fix, plus additional modifications to installctx() and its callers (others of which are preempting the kernel) have a webrev here:

https://kebe.com/~danmcd/webrevs/13902/webrev-prealloc/

that has tested so far.

Jerry's most recent suggestion will be in:

https://kebe.com/~danmcd/webrevs/13902/webrev-jerry2/

and will be tested later today.

Actions #29

Updated by Dan McDonald about 2 years ago

If Jerry's second fix works, it may be sufficient for this bug, but the "prealloc" changes will likely become their own issue.

Actions #30

Updated by Dan McDonald about 2 years ago

Jerry's second fix failed during the sleeping memory allocation. We likely need his first fix, AND the preallocation for installctx() callers that need disabled preemption over the installctx() call. Screen-shot attached.

Actions #31

Updated by Dan McDonald about 2 years ago

  • Related to Bug #13915: installctx() blocking allocate causes problems added
Actions #32

Updated by Electric Monk about 2 years ago

  • Gerrit CR set to 1579
Actions #34

Updated by Jason King about 2 years ago

Is that panic from running https://kebe.com/~danmcd/webrevs/13902/webrev-jerry2/ (with all the revisions, it's not always clear). If so, I'm wondering if my earlier question is still relevant.

Actions #35

Updated by Dan McDonald about 2 years ago

Jason King wrote in #note-34:

Is that panic from running https://kebe.com/~danmcd/webrevs/13902/webrev-jerry2/ (with all the revisions, it's not always clear). If so, I'm wondering if my earlier question is still relevant.

It is from webrev-jerry2, and your question was answered, "YES, YES IT COULD!" :)

Actions #36

Updated by Dan McDonald about 2 years ago

The very latest fix, which includes #13915, has been running for:

15:45:10    up 17:39,  1 user,  load average: 0.05, 0.05, 0.04

without incident on the discoverer's machine. Previous to this fix, it would panic within a handful of hours, sometimes even less than an hour.

Actions #37

Updated by Dan McDonald about 2 years ago

  • Status changed from New to Pending RTI
  • Assignee set to Dan McDonald
Actions #38

Updated by Dan McDonald about 2 years ago

  • % Done changed from 0 to 90
  • Difficulty changed from Hard to Medium
Actions #39

Updated by Electric Monk about 2 years ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 90 to 100

git commit c21bd51d7acbaf77116c4cc3a23dfc6d16c637c2

commit  c21bd51d7acbaf77116c4cc3a23dfc6d16c637c2
Author: Dan McDonald <danmcd@joyent.com>
Date:   2021-07-02T19:24:36.000Z

    13902 Fix for 13717 may break 8-disk raidz2
    13915 installctx() blocking allocate causes problems
    Portions contributed by: Jerry Jelinek <gjelinek@gmail.com>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Patrick Mooney <pmooney@pfmooney.com>
    Approved by: Joshua M. Clulow <josh@sysmgr.org>

Actions

Also available in: Atom PDF