Bug #9601
closedDivide by zero in i40e_get_available_resources()
100%
Description
With the following hardware (the motherboard is Supermicro X11SPM-TF, Version 1.01, with latest BIOS 2.0b):
pci bus 0x00b5 cardnum 0x00 function 0x00: vendor 0x8086 device 0x37d2 Intel Corporation Ethernet Connection X722 for 10GBASE-T CardVendor 0x15d9 card 0x37d2 (Super Micro Computer Inc, Card unknown) STATUS 0x0010 COMMAND 0x0046 CLASS 0x02 0x00 0x00 REVISION 0x08 BIST 0x00 HEADER 0x80 LATENCY 0x00 CACHE 0x10 BASE0 0xfa000000 SIZE 16777216 MEM PREFETCHABLE BASE3 0xfb008000 SIZE 32768 MEM PREFETCHABLE BASEROM 0x00000000 addr 0x00000000 MAX_LAT 0x00 MIN_GNT 0x00 INT_PIN 0x01 INT_LINE 0x0b pci bus 0x00b5 cardnum 0x00 function 0x01: vendor 0x8086 device 0x37d2 Intel Corporation Ethernet Connection X722 for 10GBASE-T CardVendor 0x15d9 card 0x37d2 (Super Micro Computer Inc, Card unknown) STATUS 0x0010 COMMAND 0x0046 CLASS 0x02 0x00 0x00 REVISION 0x08 BIST 0x00 HEADER 0x80 LATENCY 0x00 CACHE 0x10 BASE0 0xf9000000 SIZE 16777216 MEM PREFETCHABLE BASE3 0xfb000000 SIZE 32768 MEM PREFETCHABLE BASEROM 0x00000000 addr 0x00000000 MAX_LAT 0x00 MIN_GNT 0x00 INT_PIN 0x01 INT_LINE 0x0b
the i40e driver panices as follows:
> ::status debugging crash dump vmcore.0 (64-bit) from operating system: 5.11 titanic_39.90.2 (i86pc) image uuid: c2aa9826-72a6-ccbe-b32c-c21a800074df panic message: BAD TRAP: type=0 (#de Divide error) rp=ffffc1016f5767d0 addr=0 dump content: kernel pages only > ::stack i40e_get_available_resources+0x1dd(ffffc133bdb13000) i40e_common_code_init+0xeb(ffffc133bdb13000, ffffc133bdb13050) i40e_attach+0xe5(ffffc1338081c000, 0) devi_attach+0x92(ffffc1338081c000, 0) attach_node+0xa7(ffffc1338081c000) i_ndi_config_node+0x7d(ffffc1338081c000, 6, 0) i_ddi_attachchild+0x48(ffffc1338081c000) devi_attach_node+0x5e(ffffc1338081c000, 4004048) config_immediate_children+0xbf(ffffc1338081c2a8, 4004048, ffffffff) devi_config_common+0xd9(ffffc1338081c2a8, 4004048, ffffffff) mt_config_thread+0x58(ffffc133c0509de8) thread_start+8() >
Related issues
Updated by Marcel Telka almost 5 years ago
The immediate cause for the panic is that idp->id_nfuncs
is zero in i40e_get_available_resources()
. The idp->id_nfuncs
is set to zero in i40e_device_find()
because hw->num_partitions
is zero as well and finally, the hw->num_partitions
is set to zero due to this code in i40e_common.c
:
3823 /* count the enabled ports (aka the "not disabled" ports) */ 3824 hw->num_ports = 0; 3825 for (i = 0; i < 4; i++) { 3826 u32 port_cfg_reg = I40E_PRTGEN_CNF + (4 * i); 3827 u64 port_cfg = 0; 3828 3829 /* use AQ read to get the physical register offset instead 3830 * of the port relative offset 3831 */ 3832 i40e_aq_debug_read_register(hw, port_cfg_reg, &port_cfg, NULL); 3833 if (!(port_cfg & I40E_PRTGEN_CNF_PORT_DIS_MASK)) 3834 hw->num_ports++; 3835 } 3836 3837 valid_functions = p->valid_functions; 3838 num_functions = 0; 3839 while (valid_functions) { 3840 if (valid_functions & 1) 3841 num_functions++; 3842 valid_functions >>= 1; 3843 } 3844 3845 /* partition id is 1-based, and functions are evenly spread 3846 * across the ports as partitions 3847 */ 3848 if (hw->num_ports != 0) { 3849 hw->partition_id = (hw->pf_id / hw->num_ports) + 1; 3850 hw->num_partitions = num_functions / hw->num_ports; 3851 }
In the above code the four i40e_aq_debug_read_register()
calls (in the for loop) returns 0, 0, 2, and 2 in port_cfg
respectively, so hw->num_ports
is set to 4. Because p->valid_functions
is set to 0x3 (so num_functions
is 2) the value of hw->num_partitions
is calculated as zero (at line 3850).
This outcome is unexpected by i40e_device_find()
(please note the ASSERT there) so we fail miserably later.
The code cited above comes from Intel ixl-1.6.10 driver and it is suspicious for various reasons:
- It reads from
PRTGEN_CNF
and three subsequent locations afterPRTGEN_CNF
. In the controller specification there is onlyPRTGEN_CNF
documented (at address0x000B8120
, 32bits), but three subsequent addresses0x000B8124
,0x000B8128
, and0x000B812C
are simply left undocumented. - The
i40e_aq_debug_read_register()
function is used to read those four locations. It is unclear why the genericrd32()
is not used here instead. Maybe the initial intention was to read four NVM copies ofPRTGEN_CNF
, but in this case thei40e_read_nvm_word()
function had to be used instead. - The return value of
i40e_aq_debug_read_register()
is completely ignored, so in a case thei40e_aq_debug_read_register()
function fails we will count the port as active becauseport_cfg
is initialized to zero (this is not our case; in all four cases thei40e_aq_debug_read_register()
function succeeds).
Updated by Marcel Telka almost 5 years ago
I tried to use rd32()
to read PRTGEN_CNF
and three following addresses, but it always returns zero:
rd32(0x000B8120) = 0 rd32(0x000B8124) = 0 rd32(0x000B8128) = 0 rd32(0x000B812C) = 0
Then I noticed in the specification that we probably should consult PRTGEN_STATUS
instead of PRTGEN_CNF
to get the actual port status. Namely its PORT_VALID
bit looked promising, so I tried to read PRTGEN_STATUS
(and subsequent three locations as well) using both i40e_aq_debug_read_register()
and rd32()
with the following results:
i40e_aq_debug_read_register(0x000B8100) = 3 i40e_aq_debug_read_register(0x000B8104) = 3 i40e_aq_debug_read_register(0x000B8108) = 0 i40e_aq_debug_read_register(0x000B810C) = 0 rd32(0x000B8100) = 3 rd32(0x000B8104) = 3 rd32(0x000B8108) = 3 rd32(0x000B810C) = 3
Based on this observation my proposal to fix this issue is to change the code at lines 3823-3835 to check bit PORT_VALID
in the PRTGEN_STATUS
register instead of currently used bit PORT_DIS
in the PRTGEN_CNF
register.
Updated by Marcel Telka almost 5 years ago
- Status changed from New to In Progress
- Assignee set to Marcel Telka
Updated by Marcel Telka almost 5 years ago
Updated by Marcel Telka about 4 years ago
- Related to Bug #10176: i40e: driving X722 1GbE cause hangup added
Updated by Marcel Telka over 3 years ago
- Status changed from In Progress to Pending RTI
Updated by Electric Monk over 3 years ago
- Status changed from Pending RTI to Closed
- % Done changed from 0 to 100
git commit f0c1c263e90642997cf3e76484abec617782ddb8
commit f0c1c263e90642997cf3e76484abec617782ddb8 Author: Marcel Telka <marcel@telka.sk> Date: 2019-11-14T16:19:43.000Z 9601 Divide by zero in i40e_get_available_resources() Reviewed by: Robert Mustacchi <rm+illumos@fingolfin.org> Approved by: Dan McDonald <danmcd@joyent.com>
Updated by Marcel Telka over 2 years ago
- Related to Bug #13041: i40e_get_available_resources() broken again for X722 part added
Updated by Marcel Telka over 2 years ago
- Related to Bug #13046: i40e_device_find() ASSERTs that num_ports and num_partitions are non-zero, but i40e_parse_discover_capabilities() can return with them being zero added