Project

General

Profile

Actions

Bug #9601

closed

Divide by zero in i40e_get_available_resources()

Added by Marcel Telka almost 5 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
2018-06-14
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:

Description

With the following hardware (the motherboard is Supermicro X11SPM-TF, Version 1.01, with latest BIOS 2.0b):

pci bus 0x00b5 cardnum 0x00 function 0x00: vendor 0x8086 device 0x37d2
 Intel Corporation Ethernet Connection X722 for 10GBASE-T
 CardVendor 0x15d9 card 0x37d2 (Super Micro Computer Inc, Card unknown)
  STATUS    0x0010  COMMAND 0x0046
  CLASS     0x02 0x00 0x00  REVISION 0x08
  BIST      0x00  HEADER 0x80  LATENCY 0x00  CACHE 0x10
  BASE0     0xfa000000 SIZE 16777216  MEM PREFETCHABLE
  BASE3     0xfb008000 SIZE 32768  MEM PREFETCHABLE
  BASEROM   0x00000000  addr 0x00000000
  MAX_LAT   0x00  MIN_GNT 0x00  INT_PIN 0x01  INT_LINE 0x0b

pci bus 0x00b5 cardnum 0x00 function 0x01: vendor 0x8086 device 0x37d2
 Intel Corporation Ethernet Connection X722 for 10GBASE-T
 CardVendor 0x15d9 card 0x37d2 (Super Micro Computer Inc, Card unknown)
  STATUS    0x0010  COMMAND 0x0046
  CLASS     0x02 0x00 0x00  REVISION 0x08
  BIST      0x00  HEADER 0x80  LATENCY 0x00  CACHE 0x10
  BASE0     0xf9000000 SIZE 16777216  MEM PREFETCHABLE
  BASE3     0xfb000000 SIZE 32768  MEM PREFETCHABLE
  BASEROM   0x00000000  addr 0x00000000
  MAX_LAT   0x00  MIN_GNT 0x00  INT_PIN 0x01  INT_LINE 0x0b

the i40e driver panices as follows:

> ::status
debugging crash dump vmcore.0 (64-bit) from
operating system: 5.11 titanic_39.90.2 (i86pc)
image uuid: c2aa9826-72a6-ccbe-b32c-c21a800074df
panic message: BAD TRAP: type=0 (#de Divide error) rp=ffffc1016f5767d0 addr=0
dump content: kernel pages only
> ::stack
i40e_get_available_resources+0x1dd(ffffc133bdb13000)
i40e_common_code_init+0xeb(ffffc133bdb13000, ffffc133bdb13050)
i40e_attach+0xe5(ffffc1338081c000, 0)
devi_attach+0x92(ffffc1338081c000, 0)
attach_node+0xa7(ffffc1338081c000)
i_ndi_config_node+0x7d(ffffc1338081c000, 6, 0)
i_ddi_attachchild+0x48(ffffc1338081c000)
devi_attach_node+0x5e(ffffc1338081c000, 4004048)
config_immediate_children+0xbf(ffffc1338081c2a8, 4004048, ffffffff)
devi_config_common+0xd9(ffffc1338081c2a8, 4004048, ffffffff)
mt_config_thread+0x58(ffffc133c0509de8)
thread_start+8()
>

Related issues

Related to illumos gate - Bug #10176: i40e: driving X722 1GbE cause hangupNew2019-01-05

Actions
Related to illumos gate - Bug #13041: i40e_get_available_resources() broken again for X722 partClosedJohn Levon

Actions
Related to illumos gate - Bug #13046: i40e_device_find() ASSERTs that num_ports and num_partitions are non-zero, but i40e_parse_discover_capabilities() can return with them being zeroNew

Actions
Actions #1

Updated by Marcel Telka almost 5 years ago

The immediate cause for the panic is that idp->id_nfuncs is zero in i40e_get_available_resources(). The idp->id_nfuncs is set to zero in i40e_device_find() because hw->num_partitions is zero as well and finally, the hw->num_partitions is set to zero due to this code in i40e_common.c:

3823    /* count the enabled ports (aka the "not disabled" ports) */
3824    hw->num_ports = 0;
3825    for (i = 0; i < 4; i++) {
3826        u32 port_cfg_reg = I40E_PRTGEN_CNF + (4 * i);
3827        u64 port_cfg = 0;
3828
3829        /* use AQ read to get the physical register offset instead
3830         * of the port relative offset
3831         */
3832        i40e_aq_debug_read_register(hw, port_cfg_reg, &port_cfg, NULL);
3833        if (!(port_cfg & I40E_PRTGEN_CNF_PORT_DIS_MASK))
3834            hw->num_ports++;
3835    }
3836
3837    valid_functions = p->valid_functions;
3838    num_functions = 0;
3839    while (valid_functions) {
3840        if (valid_functions & 1)
3841            num_functions++;
3842        valid_functions >>= 1;
3843    }
3844
3845    /* partition id is 1-based, and functions are evenly spread
3846     * across the ports as partitions
3847     */
3848    if (hw->num_ports != 0) {
3849        hw->partition_id = (hw->pf_id / hw->num_ports) + 1;
3850        hw->num_partitions = num_functions / hw->num_ports;
3851    }

In the above code the four i40e_aq_debug_read_register() calls (in the for loop) returns 0, 0, 2, and 2 in port_cfg respectively, so hw->num_ports is set to 4. Because p->valid_functions is set to 0x3 (so num_functions is 2) the value of hw->num_partitions is calculated as zero (at line 3850).

This outcome is unexpected by i40e_device_find() (please note the ASSERT there) so we fail miserably later.

The code cited above comes from Intel ixl-1.6.10 driver and it is suspicious for various reasons:

  1. It reads from PRTGEN_CNF and three subsequent locations after PRTGEN_CNF. In the controller specification there is only PRTGEN_CNF documented (at address 0x000B8120, 32bits), but three subsequent addresses 0x000B8124, 0x000B8128, and 0x000B812C are simply left undocumented.
  2. The i40e_aq_debug_read_register() function is used to read those four locations. It is unclear why the generic rd32() is not used here instead. Maybe the initial intention was to read four NVM copies of PRTGEN_CNF, but in this case the i40e_read_nvm_word() function had to be used instead.
  3. The return value of i40e_aq_debug_read_register() is completely ignored, so in a case the i40e_aq_debug_read_register() function fails we will count the port as active because port_cfg is initialized to zero (this is not our case; in all four cases the i40e_aq_debug_read_register() function succeeds).
Actions #2

Updated by Marcel Telka almost 5 years ago

I tried to use rd32() to read PRTGEN_CNF and three following addresses, but it always returns zero:

rd32(0x000B8120) = 0
rd32(0x000B8124) = 0
rd32(0x000B8128) = 0
rd32(0x000B812C) = 0

Then I noticed in the specification that we probably should consult PRTGEN_STATUS instead of PRTGEN_CNF to get the actual port status. Namely its PORT_VALID bit looked promising, so I tried to read PRTGEN_STATUS (and subsequent three locations as well) using both i40e_aq_debug_read_register() and rd32() with the following results:

i40e_aq_debug_read_register(0x000B8100) = 3
i40e_aq_debug_read_register(0x000B8104) = 3
i40e_aq_debug_read_register(0x000B8108) = 0
i40e_aq_debug_read_register(0x000B810C) = 0
rd32(0x000B8100) = 3
rd32(0x000B8104) = 3
rd32(0x000B8108) = 3
rd32(0x000B810C) = 3

Based on this observation my proposal to fix this issue is to change the code at lines 3823-3835 to check bit PORT_VALID in the PRTGEN_STATUS register instead of currently used bit PORT_DIS in the PRTGEN_CNF register.

Actions #3

Updated by Marcel Telka almost 5 years ago

  • Status changed from New to In Progress
  • Assignee set to Marcel Telka
Actions #5

Updated by Marcel Telka about 4 years ago

  • Related to Bug #10176: i40e: driving X722 1GbE cause hangup added
Actions #6

Updated by Marcel Telka over 3 years ago

  • Status changed from In Progress to Pending RTI
Actions #7

Updated by Electric Monk over 3 years ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 0 to 100

git commit f0c1c263e90642997cf3e76484abec617782ddb8

commit  f0c1c263e90642997cf3e76484abec617782ddb8
Author: Marcel Telka <marcel@telka.sk>
Date:   2019-11-14T16:19:43.000Z

    9601 Divide by zero in i40e_get_available_resources()
    Reviewed by: Robert Mustacchi <rm+illumos@fingolfin.org>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions #8

Updated by Marcel Telka over 2 years ago

  • Related to Bug #13041: i40e_get_available_resources() broken again for X722 part added
Actions #9

Updated by Marcel Telka over 2 years ago

  • Related to Bug #13046: i40e_device_find() ASSERTs that num_ports and num_partitions are non-zero, but i40e_parse_discover_capabilities() can return with them being zero added
Actions

Also available in: Atom PDF