Project

General

Profile

Actions

Feature #14812

closed

pcie: properly set max packet size and tagging

Added by Robert Mustacchi 3 months ago. Updated about 2 months ago.

Status:
Closed
Priority:
Normal
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

While debugging a system we were seeing intermittent hangs and varied issues with a specific PCIe device. This leads with a summary of what's going on and what we're doing and then follow ups with the original debugging and analysis that got us here:

In PCIe there are settings that must be set for everything that is considered a given 'hierarchy domain', in other words everything that is below a root port must be identical. While things like whether or not AERs or alternative routing IDs (ARI) is often something that can be device specific, these are things that cannot be. In particular, you have issues here around values in the fabric like:

  • The Maximum Packet Size
  • The kind of transaction tagging going on

While these are the initial two that we're handling here, these are not the only things that need to be looked at. While the stack has traditionally tried to handle the maximum packet size (mps) it has not done so correctly, only applying it to nodes where we attach device drivers. While this seems correct on the surface, in the face of multi-function devices this is not correct. There are also more complications here when hotplug comes on the scene.

What this change does is introduce processing and scanning of the fabric to determine and set these items prior to attaching device drivers and takes into account the hotplug nature of some buses and the rest of the topology. Rather than replicate all of the nuance here, the big theory statement goes into details about what's happening here, what's being done, and why. As part of this, I've also looked ahead to forthcoming features in PCIe 5.0 and 6.0 and tried to ensure we're covering the initial set of these.

While there may want to be more tuning that can happen here in terms of administrator limits or some default where the operator knows a higher value for our fail safes, that'll come as part of pcieadm link option handling.

The rest of this is a bunch of the original debugging and analysis from a broad cast of characters:

~~~

With PCIe straps enabled and booting in UP mode, when we reach the code discussed above, we panic instead of hanging hard. This suggests that we may have some additional bug associated with handling PCIe interrupts in MP mode. The evidence is not all that damning yet, but it's something to keep in mind.

The panic looks like this:

SUNW-MSG-ID: SUNOS-8000-0G, TYPE: Error, VER: 1, SEVERITY: Major
EVENT-TIME: 0x386d9ae3.0x11a49c41 (0xb38d4d6815)
PLATFORM: oxide, CSN: -, HOSTNAME: gimlet
SOURCE: SunOS, REV: 5.11 precompliance-0-g2968248d3e
DESC: Errors have been detected that require a reboot to ensure system
integrity.  See http://illumos.org/msg/SUNOS-8000-0G for more information.
AUTO-RESPONSE: Solaris will attempt to save and diagnose the error telemetry
IMPACT: The system will sync files, save a crash dump if needed, and reboot
REC-ACTION: Save the error summary below in case telemetry cannot be saved

panic[cpu0]/thread=fffffb0450271c20: pcieb-40: PCI(-X) Express Fatal Error. (0x41)

Warning - stack not written to the dump buffer
fffffb0450271b70 pcieb:pcieb_intr_handler+1fc ()
fffffb0450271bc0 apix:apix_dispatch_by_vector+8c ()
fffffb0450271c00 apix:apix_dispatch_lowlevel+1c ()
fffffb0457d1f020 unix:switch_sp_and_call+15 ()
fffffb0457d1f090 apix:apix_do_interrupt+13a ()
fffffb0457d1f0a0 unix:cmnint+1f2 ()
fffffb0457d1f2f0 unix:kstat_hold+1 ()
fffffb0457d1f3f0 unix:kstat_create_zone+c4 ()
fffffb0457d1f440 unix:kstat_create+28 ()
fffffb0457d1f4c0 t4nex:setup_rxq_kstats+86 ()
fffffb0457d1f510 t4nex:alloc_rxq+74 ()
fffffb0457d1f570 t4nex:t4_setup_port_queues+337 ()
fffffb0457d1f5c0 t4nex:port_full_init+42 ()
fffffb0457d1f620 t4nex:t4_init_synchronized+7a ()
fffffb0457d1f660 t4nex:t4_mc_start+29 ()
fffffb0457d1f6b0 mac:mac_start+65 ()
fffffb0457d1f710 dls:dls_open+fc ()
fffffb0457d1f780 dld:dld_str_attach+150 ()
fffffb0457d1f7f0 dld:dld_str_open+dc ()
fffffb0457d1f830 dld:dld_open+27 ()
fffffb0457d1f8e0 genunix:qattach+10e ()
fffffb0457d1fa00 genunix:stropen+32c ()
fffffb0457d1fad0 specfs:spec_open+4d0 ()
fffffb0457d1fb40 genunix:fop_open+a4 ()
fffffb0457d1fce0 genunix:vn_openat+208 ()
fffffb0457d1fe50 genunix:copen+431 ()
fffffb0457d1fe80 genunix:openat32+1a ()
fffffb0457d1feb0 genunix:open32+1c ()
fffffb0457d1ff00 unix:brand_sys_syscall32+290 ()

Note that we got a little bit farther while the DMA transaction started, then got these ereports:

[0]> ::walk ereportq_pend | ::ereport -v
class='ereport.io.pci.fabric'
ena=0b38d4cdcc900001
detector
    version=00
    scheme='dev'
    device-path='/pci@58,0/pci1022,1483@1,1/pci1de,fff9@0,4'
bdf=8504
device_id=6492
vendor_id=1425
rev_id=00
dev_type=0000
pcie_off=0070
pcix_off=0000
aer_off=0100
ecc_ver=0000
pci_status=0010
pci_command=0046
pcie_status=0024
pcie_command=2017
pcie_dev_cap=10008fe4
pcie_adv_ctl=000001f2
pcie_ue_status=00040000
pcie_ue_mask=00100000
pcie_ue_sev=00062030
pcie_ue_hdr0=4a008040
pcie_ue_hdr1=00000100
pcie_ue_hdr2=85040000
pcie_ue_hdr3=00000000
pcie_ce_status=00000000
pcie_ce_mask=00000000
pcie_ue_tgt_trans=00000000
pcie_ue_tgt_addr=0000000000000000
pcie_ue_tgt_bdf=ffff
remainder=00000000
severity=00000040

class='ereport.io.pci.fabric'
ena=0b38d4c297c00001
detector
    version=00
    scheme='dev'
    device-path='/pci@58,0/pci1022,1483@1,1'
bdf=8009
device_id=1483
vendor_id=1022
rev_id=00
dev_type=0040
pcie_off=0058
pcix_off=0000
aer_off=0150
ecc_ver=0000
pci_status=0010
pci_command=0047
pci_bdg_sec_status=4000
pci_bdg_ctrl=0003
pcie_status=0000
pcie_command=2057
pcie_dev_cap=00008022
pcie_adv_ctl=000001e0
pcie_ue_status=00000000
pcie_ue_mask=00100000
pcie_ue_sev=00062030
pcie_ue_hdr0=00000000
pcie_ue_hdr1=00000000
pcie_ue_hdr2=00000000
pcie_ue_hdr3=00000000
pcie_ce_status=00000000
pcie_ce_mask=00000000
pcie_rp_status=00000000
pcie_rp_control=0000
pcie_adv_rp_status=00000054
pcie_adv_rp_command=00000007
pcie_adv_rp_ce_src_id=0000
pcie_adv_rp_ue_src_id=8504
pcie_slot_cap=00820062
pcie_slot_control=103a
pcie_slot_status=0040
remainder=00000001
severity=00000001

pcie_ue_status=00040000 is MAL_TLP_STATUS indicating a malformed TLP error. The 4 ue_hdrN fields should correspond to PCIE_HDR_LOGn and contain the first 4 32-bit pieces of the TLP header. The RP status simply tells us that this was a fatal error and that it was the first uncorrectable error that occurred.

...

I believe the TLP is malformed because it exceeds the receiver's maximum payload size. We have set that size in function 4's device control register to match the bridge's (512 bytes) but not in function 0 since we didn't attach to it, so function 0's is still 128 bytes. Apparently the T6's implementation-defined way to interpret this is to use F0's value. See page 78 of PCIe4.

~~~

Workaround: 60::wrpcicfg 80 1 1 2017 in kmdb. No config/unconfig/attach/detach needed. Just do it before plumbing or doing anything else to cxgbeN that wants DMA. Alternate (probably preferred) workaround: 78::wrpcicfg 85 0 0 2150. Both workarounds set the MPS to match between the device and bridge, but setting the device's function 0 is better than setting the bridge's since we'd rather this be as large as possible.

~~~

A further analysis by Rick:

Max Payload Size is set by pcie_initchild_mps which is only called in two place: pcie_init_root_port_mps and pcie_initchild. The former is the 2nd to last call in pcieb_attach and it is only used to set MPS on the root port itself. pcie_initchild is called from both npe_initchild and pcieb_initchild which are both invoked as a response to DDI_CTLOPS_INITCHILD which comes from init_child which only happens if a driver is bound to the node. For T6, nothing is bound to function 0 so init_child will never be invoked.

This poses a bit of a quandary. If we wait until pcie_initchild_mps is called by pcie_initchild, we'll need to do a fair bit of work to find all the other functions in the device, find the smallest MPS reported by their DEVCAP, and the program that into each function. This would then be repeated for each driver attach within the same device. For T6, this will happen at least twice.

Alternatively, we could set MPS when placeholder nodes are created for each function. pcie_init but is attractive for this option as it gets invoked in both boot and hotplug paths and it already sets bus_mps in the newly created node. The downside is that it gets invoked for each function individually which could lead to different MPS per function if the device advertises different supported MPS via each function's DEVCAP. Otherwise, it's probably pcicfg_configure to catch the hotplug path and pcie_init_root_mps for boot time.

~~~

I followed up with the following:

Thanks for going through that bit of the kernel, @kc8apf. So, given the fact that a change here is supposed to kind of cover the entire fabric and requires scanning and ideally keeping everything in sync, I'm not sure we should be doing this on a per-device basis more generally (it'd be fine for our system today). The decision of what the MPS is has to be based on scanning the entire fabric. If we consider the case where we have an intermediate non-hotplugable switch with multiple downstream ports, we'd need to make sure that we scanned everything.

In addition, there is the complication that once a driver is attached and performing I/O, there isn't really a great facility for changing this safely on the fly, you basically would need to quiesce that entire portion of the fabric. As such, I think we want to think about your last suggestion, which is basically where we basically add a new notion of a fabric changed event which causes us to rescan the fabric to set up new maximums as appropriate which we would do in the similar place to pcie_init_root_mps and do a corresponding flag in pcicfg_configure. I suspect that this is not going to be the last of these that we encounter.

Logically I break our different cases down into three different categories:

We have a non-hotplug capable segment, which means that the devices we discover at initial enumeration time are static. In this configuration we can have any number of intermediate switches with no problem. This will be set once when we scan the root fabric and go from there.
We have a device whose immediate parent is a bridge that is hotpluggable and this is the only hotpluggable device on this overall PCIe fabric. This is the common case that we have in Gimlet today and most direct-attach NVMe systems today given the increase in lanes. Put differently: we have a fabric where there is only a single hot-pluggable slot or otherwise only a single leaf device.
This is more of a weird case, but basically we either have a top-level thing that's hp-capable plugged in that has a static set of devices downstream. That is, we'll detect and enumerate them together. Imagine plugging in a device that has a switch and several fixed downstream devices built into the same enclosure.
We have a fabric that has multiple hot-pluggable slots under a root port due to the use of switches. It doesn't matter whether there is one intermediate switch or multiple, but as long as there are multiple ports, we can't do anything.

Visually:

(1)

 rp -- > dev

 rp -- > switch -- > dev
               +-- > dev
               +-- > dev

 rp -- > switch -- > dev
               +-- > switch -- > dev
                           +-- > dev
               +-- > dev

(2)
    (hp)
rp -- > dev

    (hp)
rp -- > devA

             (hp)
rp -- > switch --> dev

(3)
    (hp)
rp -- > single-port switch --> dev

    (hp)                  (hp)
rp -- > single-port switch --> dev

    (hp)          
rp -- > switch --> dev
              +--> dev
              +--> dev

(4)
              (hp)
rp -- > switch --> dev
              (hp)
              +--> dev
              (hp)
              +--> dev

             (hp)
rp -- > switch --> dev
              +--> dev (not hp)

The reason I call these out is that at least in the case of (4), we can't really opt to set a safe value for the MPS other than 128 bytes because of the fact that once a driver is attached we can't safely change this without quiescing the state and there's no way of knowing what can be plugged in. Strictly speaking, we could/should allow an operator to nominate a required mps for device activity. However, regardless, this is basically something that can't really be set aside from some default.

The case of (1) and (2) means that scanning is safe. The case of (3) is more nuanced and may or may not be safe depending on how much we want to assume out of the standard pcieb device. It is also not a very common expected case.

But basically because of these, I don't think it makes sense to do this in the per-child device bits. If we did, we'd still have to calculate what the actual mps is and do the scan at both of the points in initial root port attachment and subsequent PCIe hotplug events and in those cases I'm not sure its sufficient to actually program a single device as we may have an intermediate one-port switch (which I know sounds crazy but I see on some Ryzen boards).

We may want to also adjust the maximum read size along the way as well here. In the future, I suspect we will need to think about things related to atomics and other bits. Similarly with LTR. The big question is how many of these other things will be function-specific and can't vary like MPS.

Actions #1

Updated by Electric Monk 3 months ago

  • Gerrit CR set to 2240
Actions #2

Updated by Robert Mustacchi about 2 months ago

So the biggest question around this was how do we go about and test this. This change impacts the initialization of every PCIe device whether attached or not. Testing here used a similar methodology in all cases:

  1. First grab the state of all PCIe devices in configuration space (/usr/lib/pci/pcieadm save-cfgspace -a old/)
  2. Reboot onto bits that have this change
  3. Grab all PCIe device configuration space again: (/usr/lib/pci/pcieadm save-cfgspace -a new/)
  4. Convert the binary files to a textual represenation (usually on another syste): for f in *.pci; do /usr/lib/pci/pcieadm show-cfgspace -f $f > $f.txt; done
  5. Check that we have the same number of files before / after
  6. After doing this, diff the old/new files. Usually a loop like for f in *.txt; do echo $f; diff -u $f ../new/$f; done

There were a few different cases that we were trying to check here and confirm:

  • Multi-function devices were properly set up when the BIOS/UEFI was not there (or didn't)
  • Regression testing that we properly detect features across all of the PCIe switches that we have
  • Ensuring that the hotplug path was in good shape

To that end, between myself and members of the community we took a look at several different systems that were manufactured both by AMD and Intel. In particular the following are systems that we've looked at, in addition to some of the developent systems that prompted this work:

  • AMD Rome 1P Server
  • Intel 2P Ivy Bridge Server
  • Intel 2P Skylake Server
  • AMD Zen 2 Matisse System
  • Intel 2P Haswell Server

For all Intel systems, things generally looked the same. The only difference (which isn't specific to this) is that sometimes an MSI-X related set of addresses would change. On AMD systems with PCIe Gen 4 support, the default firmware (AGESA / UEFI) was not trying to go through and enable 10-bit completion support. With this change here, we do now enable 10-bit requests on supported systems.

In terms of looking at the breadth of testing here, while going through the other systems that people had we had a couple different instances here that are non-standard cases:

  • PCIe switches with multiple downstream ports and endpoints that had different max packet sizes, the smallest was used.
  • PCIe to PCI bridges which resulted in PCI devices beneath them (often VGA controllers).
  • Combinations of the above.
  • I looked for cases where the device had a larger MPS than the root port (common on both Intel and with the T6) and verified that we used the smaller.
  • I looked for cases where the root port had a larger MPS (often AMD) than the endpoint and confirmed we took the smaller.

In addition, with the hotplug testing I took two different PCIe devices with a 512 byte and 256 byte max payload size. Because the AMD root port has a max payload size of 512 bytes we would expect to see things change between 512 and 256 bytes on the port. I was able to verify this with pcieadm.

The only things that we haven't been able to 100% test at this point are features related to PCIe Gen 6. However, we have managed to get a reasonable spread of different PCIe topologies, particularly in the non-hotplug case.

Actions #3

Updated by Electric Monk about 2 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 5b2c4190a831f52d91a5b92473ffb5a06e84511d

commit  5b2c4190a831f52d91a5b92473ffb5a06e84511d
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   2022-08-05T00:34:57.000Z

    14812 pcie: properly set max packet size and tagging
    Reviewed by: Rich Lowe <richlowe@richlowe.net>
    Reviewed by: C Fraire <cfraire@me.com>
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Approved by: Dan McDonald <danmcd@mnx.io>

Actions

Also available in: Atom PDF