Project

General

Profile

Actions

Bug #13902

closed

Fix for 13717 may break 8-disk raidz2

Added by Dan McDonald about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

I've got three kernel dumps from SmartOS 20210617 and one more from SmartOS 20210520. The filer is running an 8-disk raidz (raidz1) pool and it panics within a few hours consistently, with this pattern of panic:

mdb: warning: dump is from SunOS 5.11 joyent_20210520T001536Z; dcmds and macros may not match kernel implementation
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba uhci smbios stmf_sbd stmf sd zfs mm lofs idm sata crypto random cpc logindmux ptm sppp nsmb smbsrv nfs ]
> $C
fffffe0170ae9590 fpxrestore+2()
fffffe0170ae95b0 kernel_fpu_ctx_restore+0x13(0)
fffffe0170ae95e0 restorectx+0x3d(fffffe0170ae9c20)
> ::status
debugging crash dump vmcore.4 (64-bit) from ronsu
operating system: 5.11 joyent_20210520T001536Z (i86pc)
git branch: release-20210520
git rev: 58a2b7ba9771c57f897a33cf69335cd18fa5992f
image uuid: (not set)
panic message: BAD TRAP: type=d (#gp General protection) rp=fffffe0170ae9470 addr=ffffff0a5c2ff000
dump content: kernel pages only
> fffffe0170ae9c20::findstack -v
stack pointer for thread fffffe0170ae9c20 (zpool-zones/120): fffffe0170ae95f0
  fffffe0170ae9620 swtch+0x133()
  fffffe0170ae9640 preempt+0xfa()
  fffffe0170ae9670 kpreempt+0x3c(ffffffff)
  fffffe0170ae96f0 installctx+0x11a(fffffe0170ae9c20, 0, fffffffffb880870, fffffffffb880910, 0, 0, 0, 0)
  fffffe0170ae9740 kernel_fpu_begin+0x116(0, 2)
  fffffe0170ae97c0 ssse3_gen_pq+0x1b4(ffffff0c4e303580)
  fffffe0170ae97e0 vdev_raidz_math_generate+0x52(ffffff0c4e303580)
  fffffe0170ae9810 vdev_raidz_generate_parity+0x16(ffffff0c4e303580)
  fffffe0170ae98a0 vdev_raidz_io_start+0x1d8(ffffff0c4cd5c928)
  fffffe0170ae98f0 zio_vdev_io_start+0x89(ffffff0c4cd5c928)
  fffffe0170ae9920 zio_execute+0xf1(ffffff0c4cd5c928)
  fffffe0170ae9950 zio_nowait+0x2b(ffffff0c4cd5c928)
  fffffe0170ae99e0 vdev_mirror_io_start+0xba(ffffff0c4eb206b8)
  fffffe0170ae9a30 zio_vdev_io_start+0x1a2(ffffff0c4eb206b8)
  fffffe0170ae9a60 zio_execute+0xa7(ffffff0c4eb206b8)
  fffffe0170ae9b10 taskq_thread+0x2cd(ffffff0a9d94e298)
  fffffe0170ae9b20 thread_start+0xb()
> 

So the faulty instruction and data are:

> fpxrestore+2::dis
fpxrestore:                     clts   
fpxrestore+2:                   fxrstor (%rdi)
fpxrestore+6:                   ret    
0xfffffffffb87e277:             nopw   0x0(%rax,%rax)
xrestore:                       clts   
xrestore+2:                     movl   %esi,%eax
xrestore+4:                     movq   %rsi,%rdx
xrestore+7:                     shrq   $0x20,%rdx
xrestore+0xb:                   xrstor (%rdi)
xrestore+0xe:                   ret    
0xfffffffffb87e28f:             nop    
fpdisable:                      movq   %cr0,%rdi
> <rdi=P
                0xffffff0a9f650800 
> <rdi::whatis
ffffff0a9f650800 is allocated from fxsave_cache
> ffffff0a9f650800::print struct fxsave_state !head
{
    fx_fcw = 0xcafe
    fx_fsw = 0xbadd
    fx_fctw = 0xcafe
    fx_fop = 0xbadd
    fx_rip = 0xbaddcafebaddcafe
    fx_rdp = 0xbaddcafebaddcafe
    fx_mxcsr = 0xbaddcafe
    fx_mxcsr_mask = 0xbaddcafe
    fx_st = [
> 

Hmmm, this looks like uninitialized memory. Was this supposed to have something?

I'm giving the discoverer a DEBUG kernel, in the hopes that we'll have more information in its coredump. The discoverer is currently running a pre-illumos-13717 SmartOS platform image, and so far, no panic.


Files

jerry2-panic.png (358 KB) jerry2-panic.png Dan McDonald, 2021-06-30 01:29 AM

Related issues

Related to illumos gate - Bug #13717: kernel fpu use can still lead to panicClosed

Actions
Related to illumos gate - Feature #12478: installctx needs kpreempt_disable protectionClosedPatrick Mooney

Actions
Related to illumos gate - Bug #13908: disable kernel FPU by default until it is stableClosedJoshua M. Clulow

Actions
Related to illumos gate - Bug #13915: installctx() blocking allocate causes problemsClosedDan McDonald

Actions
Actions

Also available in: Atom PDF