Project

General

Profile

Bug #13041

i40e_get_available_resources() broken again for X722 part

Added by John Levon 4 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

See bug 9601 for the original fix.
After that, we got some new code via bug 12933:

+       /* OCP cards case: if a mezz is removed the ethernet port is at
+        * disabled state in PRTGEN_CNF register. Additional NVM read is
+        * needed in order to check if we are dealing with OCP card.
+        * Those cards have 4 PFs at minimum, so using PRTGEN_CNF for counting
+        * physical ports results in wrong partition id calculation and thus
+        * not supporting WoL.
+        */
+       if (hw->mac.type == I40E_MAC_X722) {
+               if (i40e_acquire_nvm(hw, I40E_RESOURCE_READ) == I40E_SUCCESS) {
+                       status = i40e_aq_read_nvm(hw, I40E_SR_EMP_MODULE_PTR,
+                                                 2 * I40E_SR_OCP_CFG_WORD0,
+                                                 sizeof(ocp_cfg_word0),
+                                                 &ocp_cfg_word0, TRUE, NULL);
+                       if (status == I40E_SUCCESS &&
+                           (ocp_cfg_word0 & I40E_SR_OCP_ENABLED))
+                               hw->num_ports = 4;
+                       i40e_release_nvm(hw);
+               }
+       }
+

The assertion here that PF is at least 4 is clearly not the case as shown
in the above bug, and on my device 0x8086:0x37d2

num_functions is 2, so we end up with:

3951                 hw->num_partitions = num_functions / hw->num_ports;              

num_partitions as zero, and we go boom.
What's perplexing here is that the upstream code has the same code: it
forces OCP X722 to have 4 ports.

The above comment is talking about the PRTGEN_CNF register, but we
no longer use that, preferring the status one, as described in bug 9601.

I don't have a spec with any X722 or OCP details, but it seems like we might
have a chance at fixing this just by disabling that new code again.


Related issues

Related to illumos gate - Bug #9601: Divide by zero in i40e_get_available_resources()ClosedMarcel Telka2018-06-14

Actions
Related to illumos gate - Bug #12933: Update i40e core code to support NVM v1.7ClosedPaul Winder

Actions
#1

Updated by Marcel Telka 4 months ago

  • Related to Bug #9601: Divide by zero in i40e_get_available_resources() added
#2

Updated by Marcel Telka 4 months ago

  • Related to Bug #12933: Update i40e core code to support NVM v1.7 added
#3

Updated by Electric Monk 4 months ago

  • Gerrit CR set to 846
#4

Updated by John Levon 4 months ago

The system I tested this on is based on this:

https://www.supermicro.com/en/products/motherboard/X11DPH-T

which definitely seems to give lie to the idea that the OCP config
has to have four active ports. (Is it a X557-T2 in the system?)
The C620 chipset spec also has a section "38.11.1.1.4 OCP Support"
that seems to explicitly say both quad and dual port config is allowed.

#5

Updated by Randy Fishel 4 months ago

The problem as stated in previous comments is that the driver can find more ports than functions, and that is not a proper configuration per Intel. There is also a secondary problem that it is possible that both of num_ports and num_partitions can be zero, and later in the Illumos driver we expect both to be non-zero, but is possibly something that should be fixed as part of another bug.

The linux (and possibly BSD) driver does not find this to be a problem because because it doesn't use num_partitions, but a failure to have num_ports to be correct is a problem for the Linux driver as partition_id will be incorrect (Illumos does not use the features where an incorrect partition_id becomes an issue, but does expect num_ports and num_partitions to be non-zero).

While removing the OCP will allow the Illumos driver to function on 2 port X722 cards, it does become a deviation from the Intel common code, and it might be worthwhile to either specify that (in the code), or find a fix that works better in the X722 case (which could be a different bug).

#6

Updated by Robert Mustacchi 3 months ago

  • Category set to driver - device drivers
  • Assignee set to John Levon
#7

Updated by Electric Monk 3 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 341c5f490806c8b3e6e31512923db4c0e1b464b1

commit  341c5f490806c8b3e6e31512923db4c0e1b464b1
Author: John Levon <john.levon@joyent.com>
Date:   2020-08-19T07:30:51.000Z

    13041 i40e_get_available_resources() broken again for X722 part
    Reviewed by: Randy Fishel <randyf@sibernet.com>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Also available in: Atom PDF