Project

General

Profile

Actions

Bug #12867

closed

Mis-programmed pcie bridge leaves 64-bit device unusable

Added by Andy Fiddaman over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

An OmniOS user reported a problem on a server with a pair of Mellanox Technologies MT27700 Family (ConnectX-4) cards not working.
These cards are supported by the recently integrated illumos mlxcx driver but the driver is failing to attach with:

     WARNING: mlxcx0: found unsupported command revision: 65535, expected 5

The driver is getting back PCI_EINVAL32 in response to a PCI read, which means the read failed. There
are two of these cards on separate PCI buses, and they both work if the system is booted to Linux.

Let's look at a portion of the device tree on this box:

# prtconf -dD
i86pc (driver name: rootnex)
    pci, instance #0 (driver name: npe)
        pci1458,1000 (pciex1022,1480) [Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex], instance #0 (driver name: amdf17nbdf)
        pci1022,1482 (pciex1022,1482) [Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge]
        pci1022,1483 (pciex1022,1483) [Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge], instance #0 (driver name: pcieb)
            pci15b3,13 (pciex15b3,1013) [Mellanox Technologies MT27700 Family [ConnectX-4]] (driver name: mlxcx)

    pci, instance #4 (driver name: npe)
        pci1458,1000 (pciex1022,1480) [Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex], instance #3 (driver name: amdf17nbdf)
        pci1022,1482 (pciex1022,1482) [Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge]
        pci1022,1483 (pciex1022,1483) [Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge], instance #16 (driver name: pcieb)
            pci15b3,13 (pciex15b3,1013) [Mellanox Technologies MT27700 Family [ConnectX-4]] (driver name: mlxcx)

Looking at the second card and its associated bridge in more detail:

        pci1022,1483 (pciex1022,1483) [Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge], instance #16 (driver name: pcieb)
            Hardware properties:
                name='available' type=int items=5
                    value=81000000.00000000.00009000.00000000.00001000
                name='acpi-namespace' type=string items=1
                    value='\_SB_.S1D3.D7B0'
                name='reg' type=int items=5
                    value=00800900.00000000.00000000.00000000.00000000
                name='ranges' type=int items=24
                    value=81000000.00000000.00009000.81000000.00000000.00009000.00000000.00001000.82000000.00000000.90000000.82000000.00000000.90000000.00000000.02000000.c2000000.00000000.6e000000.c2000000.00000000.6e000000.00000000.02000000
                name='bus-range' type=int items=2
                    value=00000081.00000081
            pci15b3,13 (pciex15b3,1013) [Mellanox Technologies MT27700 Family [ConnectX-4]] (driver name: mlxcx)
                Hardware properties:
                    name='acpi-namespace' type=string items=1
                        value='\_SB_.S1D3.D7B0.D0E9'
                    name='assigned-addresses' type=int items=5
                        value=c3810010.00000601.90000000.00000000.02000000
                    name='reg' type=int items=10
                        value=00810000.00000000.00000000.00000000.00000000.43810010.00000000.00000000.00000000.02000000

The assigned-addresses property for the card shows that the first BAR has been assigned a 32MiB range starting at 0x60190000000,
and that it's a pre-fetchable memory range (bit 7 in the first entry of the 5-tuple). That's a valid address but is surprisingly high
(although this is a 512GB system).

What's odd here, is that the ranges property for the bridge shows both parent and child addresses as 0x6e000000, which is a completely different address.

The ranges for the PCI-to-PCI bridge, shows two likely memory addresses - 0x90000000 and 0x6e000000, neither of which match the address allocated to the card. At this point it looks like there's at least some 32-bit truncation occuring.

Let's look at the memory lists for the card - the bridge has a bus range of 81-81 so that's where the card is:

# mdb -k
> pci_bus_res::print [81]
[81] = {
    [81].io_avail = 0xfffffdc5510eeca8
    [81].io_used = 0
    [81].mem_avail = 0
    [81].mem_used = 0xfffffdc551d11ac8
    [81].pmem_avail = 0
    [81].pmem_used = 0xfffffdc5510ee788
    [81].bus_avail = 0
    [81].dip = 0xfffffdc55110c2b8
    [81].privdata = 0
    [81].par_bus = 0x80
    [81].sub_bus = 0x81
    [81].root_addr = 0xff
    [81].num_cbb = 0
    [81].io_reprogram = 0x1 (B_TRUE)
    [81].mem_reprogram = 0x1 (B_TRUE)
    [81].subtractive = 0 (0)
    [81].mem_size = 0x2000000
    [81].io_size = 0
}

# mem_used (non-prefetch ranges)
> pci_bus_res::print [81].mem_used | ::memlist
            ADDR             BASE             SIZE
fffffdc551d11ac8         90000000          2000000
>

# pmem_used (pre-fetch)
> pci_bus_res::print [81].pmem_used | ::memlist
            ADDR             BASE             SIZE
fffffdc5510ee788         6e000000          2000000

Three things jump out here - there's that 32-bit truncation of the assigned address again, and apparently there's a second
non-prefetchable range in use starting at 0x6e000000 that did not show up in the assigned addresses data. The card's memory
range has also been reprogrammed at some point during PCI bus enumeration (the mem_reprogram property).

Let's look at the bridge (bus 80, according to the par_bus property in the output above).

> pci_bus_res::print [80]
[80] = {
    [80].io_avail = 0
    [80].io_used = 0xfffffdc55126c198
    [80].mem_avail = 0
    [80].mem_used = 0xfffffdc5510ee728
    [80].pmem_avail = 0
    [80].pmem_used = 0xfffffdc5510ee768
    [80].bus_avail = 0xfffffdc551d11b08
    [80].dip = 0xfffffdc55110cab0
    [80].privdata = 0
    [80].par_bus = 0xff
    [80].sub_bus = 0x9f
    [80].root_addr = 0x70
    [80].num_cbb = 0
    [80].io_reprogram = 0 (0)
    [80].mem_reprogram = 0 (0)
    [80].subtractive = 0 (0)
    [80].mem_size = 0x2104000
    [80].io_size = 0
}
> pci_bus_res::print [80].mem_used | ::memlist
            ADDR             BASE             SIZE
fffffdc5510ee728         90000000          2200000
> pci_bus_res::print [80].pmem_used | ::memlist
            ADDR             BASE             SIZE
fffffdc5510ee768         6e000000          2000000

same thing there.

We can also check the retrieved ACPI data for the bridge:

> acpi_mem_res::print [80] | ::memlist
            ADDR             BASE             SIZE
fffffdc551d11ac8         90000000          2000000

Just the 32-bit non-prefetch range there.

Extracting, and decompiling the ACPI tables (acpidump, acpixtract, iasl -d) and finding the relevant code for the bridge
(The detailed output above helpfully includes acpi-namespace: '\_SB_.S1D3.D7B0'), the ACPI data contains two address
ranges for this bridge, which are:

32-bit - 0x90000000 + 0x2200000
64-bit - 58170200000 + 0x8000000000

If the assigned address had been 0x6016e000000 then it fits into the second range (it's just over 32MiB down from the top),
but the other address of 0x6e000000 does not fit into either range, so it's apparently a 32-bit truncation of the 64-bit address.

Another thing to note here, is that this second range is big, but only a very small portion of it is showing up in the
bus memory lists. Where has the rest gone?

Booting the system with pci autoconfig debugging on (this is very chatty with lots of things going to the system console - boot takes a very, very long time even with the framebuffer set to 8-bit colour).

# echo boot_resolution=1024x768x8 > /boot/conf.d/bootres
# beadm create pcidebug
# beadm mount pcidebug /a
# echo set pci_autoconfig:pci_boot_debug = 1 > /a/etc/system.d/pci
# bootadm update-archive -R /a
# beadm activate pcidebug
# init 6
# grep reprogram /var/adm/messages
NOTICE: reprogram mem-range on ppb[80/1/1]: 0x90000000 ~ 0x91ffffff

This shows that the bridge (ppb is PCI-to-PCI Bridge, and we know the bridge address is 0x80)
was re-programmed with the 32-bit range that showed up earlier.

Looking at the early boot information (prtconf -dDvp) shows an initially assigned address of 0x6016e000000. The final address of 0x60190000000 is a combination of the high bits of this and the actual programmed address, but this is by no means the only thing going wrong here!

Actions #1

Updated by Andy Fiddaman over 1 year ago

  • Description updated (diff)
Actions #2

Updated by Andy Fiddaman over 1 year ago

  • Gerrit CR set to 732
Actions #3

Updated by Andy Fiddaman over 1 year ago

Testing results from the original problem server.

The devices now show up.

# dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
igb0        phys      1500   up       --         --
igb1        phys      1500   unknown  --         --
mlxcx0      phys      1500   down     --         --
mlxcx1      phys      1500   down     --         --

pmem_used is better at 6016e000000 (recall that before it was 6e000000)

# for f in mem_used mem_avail pmem_used pmem_avail; do
> mdb -ke "pci_bus_res::print [80].$f | ::memlist" 
> done
            ADDR             BASE             SIZE
fffffdc5510e7a88         92000000           200000
            ADDR             BASE             SIZE
fffffdc551d4bdc8         90000000          2000000
            ADDR             BASE             SIZE
fffffdc5510e7ac8      6016e000000          2000000
            ADDR             BASE             SIZE
mdb: can't read memlist at 0: no mapping for address

Here's the salient difference between prtconf -dDv before and after the patch:

-            pci15b3,13 (pciex15b3,1013) [Mellanox Technologies MT27700 Family [ConnectX-4]] (driver name: mlxcx)
+            pci15b3,13 (pciex15b3,1013) [Mellanox Technologies MT27700 Family [ConnectX-4]], instance #1 (driver name: mlxcx)
...
                     name='assigned-addresses' type=int items=5
-                        value=c3810010.00000601.90000000.00000000.02000000
+                        value=c3810010.00000601.6e000000.00000000.02000000

Now showing the correct assigned 64-bit address. It turns out that, in this case, the address being programmed was actually a combination of the top 32-bits from the pre-fetchable 64-bit address, and the bottom 32-bits from the non-pre-fetchable space.

In the data from the boot system (prtconf -dDvp), there's a different story:

             model:  'PCI-PCI bridge'
-            ranges:  c2000000.00000000.6e000000.c2000000.00000000.6e000000.00000000.02000000
+            ranges:  c3000000.00000601.6e000000.c3000000.00000601.6e000000.00000000.02000000

Here there was truncation of the correct 64-bit address (and the c2 at the start of the range indicates pre-fetchable 32-bit)

Actions #4

Updated by Andy Fiddaman over 1 year ago

This change has been successfully booted on the following systems with help from the community.

  • Gigabyte R282-Z91-00 server (where the original problem was reported)
  • bhyve VM
  • bhyve VM with virtual NVMe (-s 7,nvme,ram=1024)
  • ESX6.7 VM
  • Lenovo ThinkServer TS140 passing through a wireless card to a FreeBSD bhyve VM
  • Supermicro A1SAi-2750F
  • HP Z220 CMT Workstation
  • Supermicro X10SL7-F
  • Intel S2600CP
  • AsRockRack EPYCD8
  • Supermicro X9DRi-LN4+/X9DR3-LN4+
  • Dell R210 II
  • Dell PowerEdge R730
  • Supermicro SYS-5028D-TN4T
  • ASUS PRIME B350M-A motherboard with AMD Ryzen 3 1200 CPU

In all cases, there were no reported problems and the data gathered showed only improvements in the programming of PCI components. As a general note, the ranges properties contained additional data which has come in via ACPI - previously 64-bit ranges from ACPI were ignored.

Actions #5

Updated by Electric Monk over 1 year ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 6ecc470585ed07369dd51b0ed85f5cf848e5b5c2

commit  6ecc470585ed07369dd51b0ed85f5cf848e5b5c2
Author: Andy Fiddaman <omnios@citrus-it.co.uk>
Date:   2020-07-02T14:48:34.000Z

    12867 Mis-programmed pcie bridge leaves 64-bit device unusable
    12873 pci_autoconf: Makefile and compiler warning cleanup
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Reviewed by: Paul Winder <paul@winder.uk.net>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF