Actions
Bug #8570
openixgbe crash with rings
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2017-08-09
Due date:
% Done:
0%
Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
Description
i have setup:
ixgbe1 to 10G switch, up vlan like:
dladm create-vlan -v 400 -l ixgbe1 subnet40
and configure nic 'subnet40' for ethernet frames layer2 - no IP on it.
i have core dump after some traffic
panic[cpu4]/thread=ffffff003df43c40: BAD TRAP: type=e (#pf Page fault) rp=ffffff003df43990 addr=18 occurred in module "ixgbe" due to a NULL pointer dereference sched: #pf Page fault Bad kernel fault at addr=0x18 pid=0, pc=0xfffffffff81c9bb6, sp=0xffffff003df43a80, eflags=0x10246 cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 3406f8<smap,smep,osxsav,xmme,fxsr,pge,mce,pae,pse,de> cr2: 18 cr3: 16a68f000 cr8: 0 rdi: ffffff0d0d43f0a0 rsi: 0 rdx: 3a38 rcx: 20 r8: ffffff0cef852180 r9: 1 rax: 0 rbx: ffffff0d0d508040 rbp: ffffff003df43b10 r10: 1a r11: 4 r12: 80000b r13: 4 r14: ffffff0cffb34280 r15: ffffff0d002a0000 fsb: fffffd7fff162a40 gsb: ffffff0d00083040 ds: 0 es: 0 fs: 0 gs: 0 trp: e err: 0 rip: fffffffff81c9bb6 cs: 30 rfl: 10246 rsp: ffffff003df43a80 ss: 38 ffffff003df43890 unix:real_mode_stop_cpu_stage2_end+ad84 () ffffff003df43980 unix:trap+12df () ffffff003df43990 unix:_cmntrap+e6 () ffffff003df43b10 ixgbe:ixgbe_ring_rx+1e6 () ffffff003df43b50 ixgbe:ixgbe_intr_rx_work+34 () ffffff003df43b90 ixgbe:ixgbe_intr_msix+55 () ffffff003df43be0 apix:apix_dispatch_by_vector+8a () ffffff003df43c20 apix:apix_dispatch_lowlevel+1c () ffffff003faa69c0 unix:switch_sp_and_call+13 () ffffff003faa6a20 apix:apix_do_interrupt+398 () ffffff003faa6a30 unix:cmnint+ba () ffffff003faa6b40 unix:kcopy+22 () ffffff003faa6bb0 genunix:uiomove+dd () ffffff003faa6ca0 genunix:vpm_data_copy+eb () ffffff003faa6d30 specfs:spec_read+114 () ffffff003faa6dd0 genunix:fop_read+fd () ffffff003faa6f00 genunix:pread+1c2 () ffffff003faa6f10 unix:brand_sys_syscall+21a () dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel + curproc NOTICE: ahci0: ahci_tran_reset_dport port 4 reset port
root@dev6:/var/crash/myhost# mdb *.10 Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs sata sd ip hook neti sockfs arp usba xhci uhci stmf stmf_sbd qlc pmcs mr_sas mpt_sas mpt mm random crypto ptm ufs logindmux smbsrv nfs nsmb ] > $C ffffff003df43b10 ixgbe_ring_rx+0x1e6(ffffff0cef852180, ffffffff) ffffff003df43b50 ixgbe_intr_rx_work+0x34(ffffff0cef852180) ffffff003df43b90 ixgbe_intr_msix+0x55(ffffff0d002a0918, 0) ffffff003df43be0 apix_dispatch_by_vector+0x8a(21) ffffff003df43c20 apix_dispatch_lowlevel+0x1c(21, 0) ffffff003faa69c0 switch_sp_and_call+0x13() ffffff003faa6a20 apix_do_interrupt+0x398(ffffff003faa6a30, 5043d0) ffffff003faa6a30 _interrupt+0xba() ffffff003faa6b40 kcopy+0x22() ffffff003faa6bb0 uiomove+0xdd(fffffe0171b90000, 1000, 0, ffffff003faa6e28) ffffff003faa6ca0 vpm_data_copy+0xeb(ffffff0d02822a80, 1647c000, 2000, ffffff003faa6e28, 1, 0, ffffffff00000000, ffffff0e00000001) ffffff003faa6d30 spec_read+0x114(ffffff0dab89eb80, ffffff003faa6e28, 0, ffffff0d2b1dca28, 0) ffffff003faa6dd0 fop_read+0xfd(ffffff0dab89eb80, ffffff003faa6e28, 0, ffffff0d2b1dca28, 0) ffffff003faa6f00 pread+0x1c2(5, 42d3d0, 100000, 163a6000) ffffff003faa6f10 sys_syscall+0x19f() > ixgbe_ring_rx+0x1e6::dis ixgbe_ring_rx+0x1c0: incl 0x28(%r8) ixgbe_ring_rx+0x1c4: leaq 0x0(,%r11,8),%rcx ixgbe_ring_rx+0x1cc: movl 0x1a0c(%r15),%esi ixgbe_ring_rx+0x1d3: movq 0x50(%r14),%rax ixgbe_ring_rx+0x1d7: testl %esi,%esi ixgbe_ring_rx+0x1d9: movq (%rax,%rcx),%rax ixgbe_ring_rx+0x1dd: je +0x7 <ixgbe_ring_rx+0x1e6> ixgbe_ring_rx+0x1df: movl 0x60(%rax),%ecx ixgbe_ring_rx+0x1e2: testl %ecx,%ecx ixgbe_ring_rx+0x1e4: jne +0xf <ixgbe_ring_rx+0x1f5> ixgbe_ring_rx+0x1e6: movq 0x18(%rax),%rax ixgbe_ring_rx+0x1ea: movq $0x0,0x8(%rbx) ixgbe_ring_rx+0x1f2: movq %rax,(%rbx) ixgbe_ring_rx+0x1f5: movl 0x6c(%r14),%ecx ixgbe_ring_rx+0x1f9: leal 0x1(%r13),%eax ixgbe_ring_rx+0x1fd: cmpl %ecx,%eax ixgbe_ring_rx+0x1ff: jb +0xcb <ixgbe_ring_rx+0x2d0> ixgbe_ring_rx+0x205: movl $0x1,%eax ixgbe_ring_rx+0x20a: incl %r10d ixgbe_ring_rx+0x20d: subl %ecx,%eax ixgbe_ring_rx+0x20f: addl %eax,%r13d
Updated by Dan McDonald about 5 years ago
I got a disassembly of where this is in Igor's code:
ixgbe_ring_rx+0x1c0: incl 0x28(%r8) ixgbe_ring_rx+0x1c4: leaq 0x0(,%r11,8),%rcx ixgbe_ring_rx+0x1cc: movl 0x1a0c(%r15),%esi ixgbe_ring_rx+0x1d3: movq 0x50(%r14),%rax ixgbe_ring_rx+0x1d7: testl %esi,%esi ixgbe_ring_rx+0x1d9: movq (%rax,%rcx),%rax ixgbe_ring_rx+0x1dd: je +0x7 <ixgbe_ring_rx+0x1e6> ixgbe_ring_rx+0x1df: movl 0x60(%rax),%ecx ixgbe_ring_rx+0x1e2: testl %ecx,%ecx ixgbe_ring_rx+0x1e4: jne +0xf <ixgbe_ring_rx+0x1f5> ixgbe_ring_rx+0x1e6: movq 0x18(%rax),%rax /* XXX KEBE SAYS PANIC IS HERE */ ixgbe_ring_rx+0x1ea: movq $0x0,0x8(%rbx) ixgbe_ring_rx+0x1f2: movq %rax,(%rbx) ixgbe_ring_rx+0x1f5: movl 0x6c(%r14),%ecx ixgbe_ring_rx+0x1f9: leal 0x1(%r13),%eax ixgbe_ring_rx+0x1fd: cmpl %ecx,%eax ixgbe_ring_rx+0x1ff: jb +0xcb <ixgbe_ring_rx+0x2d0> ixgbe_ring_rx+0x205: movl $0x1,%eax ixgbe_ring_rx+0x20a: incl %r10d ixgbe_ring_rx+0x20d: subl %ecx,%eax ixgbe_ring_rx+0x20f: addl %eax,%r13d
So the closest source I could find to match that is here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/ixgbe/ixgbe_rx.c#695
Either in one of two places where
current_rbd->read.pkt_addr = current_rcb->rx_buf.dma_address; current_rbd->read.hdr_addr = 0;
occurs. current_rcb is set to NULL for some reason, and it shouldn't be (or it should be checked). The whys & hows will be how this bug gets fixed.
Updated by Igor Kozhukhov almost 5 years ago
additional info with additional core dump:
root@bldhost:/zones/crash/con4# mdb *.4 Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs sata sd ip hook neti sockfs arp usba stmf stmf_sbd mpt_sas mm random crypto ptm ufs logindmux smbsrv nfs nsmb ] > ::status debugging crash dump vmcore.4 (64-bit) from con4 operating system: 5.11 2.0.0.5+a2 (i86pc) image uuid: 8bc7c7b0-9230-6746-d861-c239f96164bf panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffd001e9884920 addr=18 occurred in module "genunix" due to a NULL pointer dereference dump content: kernel pages only > $C ffffd001e9884a70 ddi_dma_sync() ffffd001e9884b10 ixgbe_ring_rx+0x3a6(ffffd063163f1180, ffffffff) ffffd001e9884b50 ixgbe_intr_rx_work+0x34(ffffd063163f1180) ffffd001e9884b90 ixgbe_intr_msix+0x55(ffffd0631aeb3918, 0) ffffd001e9884be0 apix_dispatch_by_vector+0x8a(20) ffffd001e9884c20 apix_dispatch_lowlevel+0x1c(20, 0) ffffd001e9833a60 switch_sp_and_call+0x13() ffffd001e9833ac0 apix_do_interrupt+0x398(ffffd001e9833ad0, 0) ffffd001e9833ad0 _interrupt+0xba() ffffd001e9833bc0 i86_mwait+0xd() ffffd001e9833c00 cpu_idle_mwait+0x127() ffffd001e9833c20 idle+0xa2() ffffd001e9833c30 thread_start+8() > ffffd001e9884b10::whatis ffffd001e9884b10 is in thread ffffd001e9884c40's stack > ::findleaks CACHE LEAKED BUFFER CALLER ffffd062a1867008 18085 ffffd063174b5b30 ? ffffd062a186a008 8282 ffffd063159b53d8 ? ffffd062a186d008 9111 ffffd062db4325d0 ? ffffd062a1873008 2 ffffd06315899ee0 ? ffffd062a1879008 244 ffffd0631901ab90 ? ffffd062a187c008 1 ffffd0663bcc70c8 ? ffffd062a1914008 467443 ffffd062a20700c0 ? ffffd062a191a008 467443 ffffd06315896080 ? ffffd0631a6ca008 474979 ffffd0663bb86008 ? ffffd0631a6d0008 475006 ffffd0663bb7a010 ? ------------------------------------------------------------------------ Total 1920596 buffers, 211947416 bytes
will investigate list of leaks, maybe it will helps
Actions