Project

General

Profile

Bug #13297

mlxcx kernel panic on debug build due to create an aggregated link

Added by Denis Kozadaev 20 days ago. Updated 20 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

The system has two mellanox cards 25G:
root@Shelf6:~# dladm show-link
LINK CLASS MTU STATE BRIDGE OVER
mlxcx1 phys 1500 up -- --
mlxcx2 phys 1500 up -- --

Creating an aggr link:
root@Shelf6:~# dladm create-aggr -l mlxcx1 -l mlxcx2 -P L2,L3 -L off aggr2
and the system panics:

panic[cpu0]/thread=fffffe003d0f8c20: pageout_deadman: stuck pushing the same page for 90 seconds (freemem is 19984)

fffffe003d0f8a10 genunix:process_type+2127bf ()
fffffe003d0f8a80 genunix:clock+7f3 ()
fffffe003d0f8b60 genunix:cyclic_softint+1df ()
fffffe003d0f8b80 unix:cbe_softclock+19 ()
fffffe003d0f8bd0 unix:av_dispatch_softvect+7d ()
fffffe003d0f8c00 apix:apix_dispatch_softint+35 ()
fffffe003d0059d0 unix:switch_sp_and_call+15 ()
fffffe003d005a30 apix:apix_do_softint+54 ()
fffffe003d005aa0 apix:apix_do_interrupt+3e0 ()
fffffe003d005ab0 unix:_interrupt+1f2 ()
fffffe003d005ba0 unix:i86_mwait+12 ()
fffffe003d005be0 unix:cpu_idle_mwait+127 ()
fffffe003d005c00 unix:idle+a2 ()
fffffe003d005c10 unix:thread_start+b ()

panic: entering debugger (continue to save dump)

Welcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ scsi_vhci crypto mac cpc uppc mr_sas neti sd ptm ufs unix mpt zfs krtld sata apix uhci pmcs hook genunix ip logindmux usba xhci specfs pcplusmp random mm cpu.generic arp mpt_sas emlxs sockfs ]
[0]> 

prtconf -Dd:

            pci15b3,20 (pciex15b3,1017) [Mellanox Technologies MT27800 Family [ConnectX-5]] (driver name: mlxcx)
            pci15b3,20 (pciex15b3,1017) [Mellanox Technologies MT27800 Family [ConnectX-5]] (driver name: mlxcx)

prtdiag:

System Configuration: Supermicro SSG-6048R-E1CR36L
BIOS Configuration: American Megatrends Inc. 2.0a 06/30/2016
BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)

==== Processor Sockets ====================================

Version                          Location Tag
-------------------------------- --------------------------
Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz SOCKET 0
Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz SOCKET 1

==== Memory Device Sockets ================================

Type        Status Set Device Locator      Bank Locator
----------- ------ --- ------------------- ----------------
DDR4        in use 0   P1_DIMMA1           NODE 1
DDR4        empty  0   P1_DIMMA2           NODE 1
DDR4        in use 0   P1_DIMMB1           NODE 1
DDR4        empty  0   P1_DIMMB2           NODE 1
DDR4        empty  0   P1_DIMMC1           NODE 1
DDR4        empty  0   P1_DIMMC2           NODE 1
DDR4        empty  0   P1_DIMMD1           NODE 1
DDR4        empty  0   P1_DIMMD2           NODE 1
DDR4        in use 0   P2_DIMME1           NODE 3
DDR4        empty  0   P2_DIMME2           NODE 3
DDR4        in use 0   P2_DIMMF1           NODE 3
DDR4        empty  0   P2_DIMMF2           NODE 3
DDR4        empty  0   P2_DIMMG1           NODE 3
DDR4        empty  0   P2_DIMMG2           NODE 3
DDR4        empty  0   P2_DIMMH1           NODE 3
DDR4        empty  0   P2_DIMMH2           NODE 3

==== On-Board Devices =====================================
   Onboard Aspeed Video
 Onboard Intel Lan

==== Upgradeable Slots ====================================

ID  Status    Type             Description
--- --------- ---------------- ----------------------------
1   available PCI Exp. Gen 3 x8 CPU1 SLOT1
2   in use    PCI Exp. Gen 3 x16 CPU1 SLOT2, LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (mpt_sas)
3   in use    PCI Exp. Gen 3 x8 CPU1 SLOT3, Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (ixgbe)
4   available PCI Exp. Gen 3 x16 CPU2 SLOT4
5   available PCI Exp. Gen 3 x8 CPU2 SLOT5
6   in use    PCI Exp. Gen 3 x16 CPU2 SLOT6, Mellanox Technologies MT27800 Family [ConnectX-5] (mlxcx)
7   available PCI Exp. Gen 3 x16 CPU2 SLOT7

prtconf -v from a bootable BE on the device:

            pci15b3,20 (driver not attached)
                Hardware properties:
                    name='pci-msix-capid-pointer' type=int items=1
                        value=0000009c
                    name='pcie-link-supported-speeds' type=int64 items=3
                        value=000000009502f900.000000012a05f200.00000001dcd65000
                    name='pcie-link-maximum-speed' type=int64 items=1
                        value=00000001dcd65000
                    name='pcie-link-current-speed' type=int64 items=1
                        value=00000001dcd65000
                    name='pcie-link-current-width' type=int items=1
                        value=00000008
                    name='pcie-link-maximum-width' type=int items=1
                        value=00000008
                    name='acpi-namespace' type=string items=1
                        value='\_SB_.PCI1.QR2A.H001'
                    name='assigned-addresses' type=int items=5
                        value=c3810110.00000000.f6000000.00000000.02000000
                    name='reg' type=int items=10
                        value=00810100.00000000.00000000.00000000.00000000.43810110.00000000.00000000.00000000.02
000000
                    name='compatible' type=string items=15
                        value='pciex15b3,1017.15b3.20.0' + 'pciex15b3,1017.15b3.20' + 'pciex15b3,1017.0' + 'pciex
15b3,1017' + 'pciexclass,020000' + 'pciexclass,0200' + 'pci15b3,1017.15b3.20.0' + 'pci15b3,1017.15b3.20' + 'pci15
b3,20,s' + 'pci15b3,20' + 'pci15b3,1017.0' + 'pci15b3,1017,p' + 'pci15b3,1017' + 'pciclass,020000' + 'pciclass,0200'
                    name='model' type=string items=1
                        value='Ethernet controller'
                    name='power-consumption' type=int items=2
                        value=00000001.00000001
                    name='devsel-speed' type=int items=1
                        value=00000000
                    name='interrupts' type=int items=1
                        value=00000002
                    name='subsystem-vendor-id' type=int items=1
                        value=000015b3
                    name='subsystem-id' type=int items=1
                        value=00000020
                    name='unit-address' type=string items=1
                        value='0,1'
                    name='class-code' type=int items=1
                        value=00020000
                    name='revision-id' type=int items=1
                        value=00000000
                    name='vendor-id' type=int items=1
                        value=000015b3
                    name='device-id' type=int items=1
                        value=00001017
                    name='vendor-name' type=string items=1
                        value='Mellanox Technologies'
                    name='device-name' type=string items=1
                        value='MT27800 Family [ConnectX-5]'
                    name='subsystem-name' type=string items=1
                        value='unknown subsystem'

#1

Updated by Denis Kozadaev 20 days ago

The system has only 32 GB RAM.
By Paul Winder's hint I tuned /kernel/drv/mlxcx.conf:
sq_size_shift = 12;
rq_size_shift = 11;
tx_nrings_per_group = 32;
rx_ngroups_small = 16;

and it helped, the system boots and the aggr created:

root@Shelf6:~# dladm show-aggr -x aggr2
LINK        PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr2       --             10000Mb full   up        50:6b:4b:c1:ea:74  --
            mlxcx1         10000Mb full   up        50:6b:4b:c1:ea:74  attached
            mlxcx2         10000Mb full   up        50:6b:4b:c1:ea:75  attached

#2

Updated by Denis Kozadaev 20 days ago

update:
the system has the only mellanox dual port 25G card.
also a DEBUG build was tested.

Also available in: Atom PDF