Actions
Bug #12890
closedmlxcx uses excessive stack space causing stack overflow panic
Start date:
Due date:
% Done:
100%
Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
Description
I've had a couple of boot time panics on a system with a pair of mlxcx cards. In both cases it was a double fault with a very long stack trace involving the mlxcx driver.
Here's one of them:
panic message: BAD TRAP: type=8 (#df Double fault) rp=fffffdc5cca88f10 addr=0 dump content: kernel pages and pages from PID -1 > $C fffffb0450d6c030 tr_pftrap+0x11() fffffb0450d6c0a0 vmem_alloc+0x1c3(fffffffffbf84b60, 20000, 0) fffffb0450d6c1b0 vmem_xalloc+0x629(fffffdc4e1c1f000, 20000, 1000, 0, 0, 0, 0, 0) fffffb0450d6c220 vmem_alloc+0x190(fffffdc4e1c1f000, 20000, 0) fffffb0450d6c2b0 kmem_slab_create+0x7c(fffffdc4e1c2f008, 0) fffffb0450d6c310 kmem_slab_alloc+0x10b(fffffdc4e1c2f008, 0) fffffb0450d6c370 kmem_cache_alloc+0x15b(fffffdc4e1c2f008, 0) fffffb0450d6c3e0 vmem_alloc+0x1ae(fffffdc4e1c1f000, 2000, 0) fffffb0450d6c480 segkmem_xalloc+0xfe(fffffdc4e1c1f000, 0, 2000, 0, 0, fffffffffb8a7a50, fffffffffbfa96a0) fffffb0450d6c4f0 segkmem_alloc_vn+0x3b(fffffdc4e1c1f000, 2000, 0, fffffffffbfa96a0) fffffb0450d6c520 segkmem_alloc+0x17(fffffdc4e1c1f000, 2000, 0) fffffb0450d6c630 vmem_xalloc+0x629(fffffdc4e1c20000, 2000, 1000, 0, 0, 0, 0, ffffffff00000000) fffffb0450d6c6a0 vmem_alloc+0x190(fffffdc4e1c20000, 2000, 0) fffffb0450d6c730 kmem_slab_create+0x7c(fffffdc551804008, 0) fffffb0450d6c790 kmem_slab_alloc+0x10b(fffffdc551804008, 0) fffffb0450d6c7f0 kmem_cache_alloc+0x15b(fffffdc551804008, 0) fffffb0450d6c870 rootnex_coredma_allochdl+0x3f(fffffdc551081d50, fffffdc5510fd7f8, fffffb0450d6ca48, 1, 0, fffffdc5cfc1eb38) fffffb0450d6c8f0 rootnex_dma_allochdl+0xd5(fffffdc551081d50, fffffdc5510fd7f8, fffffb0450d6ca48, 1, 0, fffffdc5cfc1eb38) fffffb0450d6c960 ddi_dma_allochdl+0x62(fffffdc551081008, fffffdc5510fd7f8, fffffb0450d6ca48, 1, 0, fffffdc5cfc1eb38) fffffb0450d6c9c0 pcieb_dma_allochdl+0x3e(fffffdc551081008, fffffdc5510fd7f8, fffffb0450d6ca48, 1, 0, fffffdc5cfc1eb38) fffffb0450d6ca30 ddi_dma_allochdl+0x62(fffffdc5510fd7f8, fffffdc5510fd7f8, fffffb0450d6ca48, 1, 0, fffffdc5cfc1eb38) fffffb0450d6cae0 ddi_dma_alloc_handle+0x7c(fffffdc5510fd7f8, fffffb0450d6cbc8, 1, 0, fffffdc5cfc1eb38) fffffb0450d6cb90 mlxcx_dma_alloc+0x72(fffffdc5d691e040, fffffdc5cfc1eb18, fffffb0450d6cbc8, fffffb0450d6cbc2, 1, 240, 1) fffffb0450d6cc70 mlxcx_cmd_mbox_alloc+0xb7(fffffdc5d691e040, fffffb0450d6cd70, 9) fffffb0450d6cd00 mlxcx_cmd_send+0x79(fffffdc5d691e040, fffffb0450d6cd38, fffffb0450d6cdc8, 1010, fffffb0450d6ddd8, 10) fffffb0450d6ee40 mlxcx_cmd_give_pages+0x154(fffffdc5d691e040, 1, 200, fffffb0450d6eef8) fffffb0450d6ff40 mlxcx_give_pages+0x14f(fffffdc5d691e040, 11a9) fffffb0450d6ff80 mlxcx_init_pages+0x42(fffffdc5d691e040, 2) fffffb0450d6ffc0 mlxcx_attach+0x17f(fffffdc5510fd7f8, 0) fffffb0450d70040 devi_attach+0xa1(fffffdc5510fd7f8, 0) fffffb0450d70080 attach_node+0x8b(fffffdc5510fd7f8) fffffb0450d700d0 i_ndi_config_node+0x12c(fffffdc5510fd7f8, 6, 0) fffffb0450d70100 i_ddi_attachchild+0x3a(fffffdc5510fd7f8) fffffb0450d70140 devi_attach_node+0x5d(fffffdc5510fd7f8, 4080) fffffb0450d70210 devi_config_one+0x3ae(fffffdc551081008, fffffdc5cdb5fc00, fffffb0450d702e0, 4080, 0) fffffb0450d70290 ndi_devi_config_one+0xad(fffffdc551081008, fffffdc5cdb5fc00, fffffb0450d702e0, 4080) fffffb0450d70370 resolve_pathname+0x14f(fffffb0450d707c8, fffffb0450d70380, 0, 0) fffffb0450d703a0 e_ddi_hold_devi_by_path+0x24(fffffb0450d707c8, 0) fffffb0450d70c10 ufm_do_getcaps+0x95(8046724, 100001) fffffb0450d70c70 ufm_ioctl+0xa3(ec00000000, 75666d01, 8046724, 100001, fffffdc5d6a70d20, fffffb0450d70dc8) fffffb0450d70cb0 cdev_ioctl+0x2b(ec00000000, 75666d01, 8046724, 100001, fffffdc5d6a70d20, fffffb0450d70dc8) fffffb0450d70d00 spec_ioctl+0x45(fffffdc5cf9b7140, 75666d01, 8046724, 100001, fffffdc5d6a70d20, fffffb0450d70dc8, 0) fffffb0450d70d90 fop_ioctl+0x5b(fffffdc5cf9b7140, 75666d01, 8046724, 100001, fffffdc5d6a70d20, fffffb0450d70dc8, 0) fffffb0450d70eb0 ioctl+0x153(b, 75666d01, 8046724) fffffb0450d70f10 sys_syscall32+0x138()
This appears to be due to stack space exhaustion and looking that the addresses above, three pages are being used in mlxcx_give_pages()
and mlxcx_cmd_give_pages()
Related issues
Updated by Andy Fiddaman almost 2 years ago
- Category set to driver - device drivers
Updated by Paul Winder almost 2 years ago
- Status changed from New to In Progress
- Assignee set to Paul Winder
- Gerrit CR set to 705
The changes in change 705 remove the arrays from the structs which chewed up this stack
Updated by Dan McDonald almost 2 years ago
- Related to Bug #12797: mlxcx max flow table limit can be exceeded added
Updated by Dan McDonald almost 2 years ago
- Related to Bug #12798: mlxcx command interface should allow concurrent commands and be interrupt driven added
Updated by Dan McDonald almost 2 years ago
- Related to Bug #12799: mlxcx #if defined for MAC_VLAN_UNTAGGED is redundant added
Updated by Electric Monk almost 2 years ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
git commit 5f0e3176f407dfb9d989b5dcc94a6d5384d0b142
commit 5f0e3176f407dfb9d989b5dcc94a6d5384d0b142 Author: Paul Winder <pwinder@racktopsystems.com> Date: 2020-07-10T15:41:30.000Z 12797 mlxcx max flow table limit can be exceeded 12798 mlxcx command interface should allow concurrent commands and be interrupt driven 12799 mlxcx #if defined for MAC_VLAN_UNTAGGED is redundant 12890 mlxcx uses excessive stack space causing stack overflow panic Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Robert Mustacchi <rm@fingolfin.org> Reviewed by: Alex Wilson <alex@cooperi.net> Approved by: Dan McDonald <danmcd@joyent.com>
Actions