Bug #11867
closedPCIe expansion slots mis-labelled in topo
100%
Description
I noticed that the labels for PCIe devices in the topo snapshot were incorrect and appeared to be off-by-one on the Joyent-M12G5 in the SF lab.
The system has 7 PCIe expansion slots and has PCIe cards installed in slots 1, 2, 4 and 6. However, in topo these cards are labelled as being in slots 2, 3, 5 and 7. After digging into how topo assigns the labels, I found that it uses logic like the following:
1) Determine the slot id of the device. To do this it first looks at the slot-names devinfo property, which is derived from the IRQ routing table as defined by IEEE 1275 - see pci_slot_names_prop() and tries to scrape the slot id off of the encoded string.
If that fails, (which isn't the case here) it tries to determine the slot-id from the phsyical-slot# devinfo prop.
2) Once it has a slot id, it cross-references it with the type 0x9 SMBIOS records and then sets the label to the locator string from the SMBIOS record with the same slot id.
This whole mechanism relies on SMBIOS and slot-names agreeing on the slot id numbering. But on this system, the slot-names props number the slots from 1 to 7 - which is what we'd expect as slot id 0 should be reserved for on-board devices. But the type 0x9 SMBIOS records use slot id's 0 to 6.
As an example of this mis-match, there is an LSI HBA in the 2nd slot. The string that we scrape the slot id out of in slot-names property for this device is "Slot2":
name='slot-names' type=int items=3 value=00000001.746f6c53.00000032
But that ends up matching to the following SMBIOS record:
ID SIZE TYPE 15 41 SMB_TYPE_SLOT (type 9) (upgradeable system slot) Location Tag: CPU1 SLOT 3 PCI-E 3.0 X8 Reference Designator: CPU1 SLOT 3 PCI-E 3.0 X8 Slot ID: 0x2 Type: 0xb5 (PCI Exp. Gen 3 x8) Width: 0xb (8x or x8) Usage: 0x3 (available) Length: 0x3 (short length) Slot Characteristics 1: 0xc SMB_SLCH1_33V (provides 3.3V) SMB_SLCH1_SHARED (opening shared with other slot) Slot Characteristics 2: 0x1 SMB_SLCH2_PME (slot supports PME# signal) Segment Group: 0 Bus Number: 255 Device/Function Number: 0
Updated by Rob Johnston over 2 years ago
The fix will be to add logic similar to prtdiag(1m) - i.e. on SMBIOS implementations that expose the BDF in the Type 9 (Slot) record (SMBIOS 2.6 and later) attempt to match the SMBIOS slot record using the BDF. Systems with older SMBIOS implementations will fall back to the existing logic.
Additionally, there is a one-line bug in pci_slot_label_lookup() that causes us to source the slotname from the parent did_t. Honestly, I'm not sure how any of this ever worked. And indeed almost every system I've checked recently had broken PCIe labels - not just the Joyent-M12G5.
Updated by Rob Johnston over 2 years ago
Testing¶
I built a platform image with this change and compared the PCI labels that were generated by libtopo both before and after the change on a variety of platforms. For platforms that were affected by this bug, I verified that the labels were correct after the fix. For the one platform I could find that didn't exhibit this bug, I verified that the fix had no effect.
Below is a list of platforms tested:
PLATFORM BEFORE FIX AFTER FIX Supermicro SSG-2029P-ACR24L Affected Fixed Supermicro SSG-6049P-E1CR36L Not Affected No change Supermicro SYS-2028U-E1CNRT+ Affected Fixed Supermicro SSG-6049P-E1CR60L-JI006 Affected Fixed Western Digital H4060-S Affected Fixed Micro-Star International MS-7b18 Affected Fixed
Updated by Electric Monk over 2 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 106e8bd44b02f4b8cd3c825790276c1c7081e67a
commit 106e8bd44b02f4b8cd3c825790276c1c7081e67a Author: Rob Johnston <rob.johnston@joyent.com> Date: 2019-10-29T23:40:00.000Z 11867 PCIe expansion slots mis-labelled in topo Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Approved by: Robert Mustacchi <rm@fingolfin.org>