Bug #14181
closedpcie hotplug trapped in power off loop with no power controller
100%
Description
PCIe hotplug relies on a series of bits that are set in PCIe configuration space, generally in the slot capabilities, slot control, and slot status register which appear in the basic PCIe capability. You can see an example of the configuration space data with the following excerpt from a Gigabyte AMD server which has been configured for enterprise SSD hotplug:
Slot Capabilities: 0xe25e1 |--> Attention Button Present: yes (0x1) |--> Power Controller Present: no (0x0) |--> MRL Sensor Present: no (0x0) |--> Attention Indicator Present: no (0x0) |--> Power Indicator Present: no (0x0) |--> Hot-Plug Surprise: supported (0x20) |--> Hot-Plug Capable : supported (0x40) |--> Slot Power Limit Value: 0x4b |--> Slot Power Limit Scale: 0x0 |--> Electromechanical Interlock Present: yes (0x20000) |--> No Command Completed: 0x60e257 |--> Physical Slot Number: 0x1 Slot Control: 0x102b |--> Attention Button Pressed: enabled (0x1) |--> Power Fault Detected: enabled (0x2) |--> MRL Sensor Changed: disabled (0x0) |--> Presence Detect Changed: enabled (0x8) |--> Command Complete Interrupt: disabled (0x0) |--> Hot Plug Interrupt Enable: enabled (0x20) |--> Attention Indicator Control: reserved (0x0) |--> Power Indicator Control: reserved (0x0) |--> Power Controller Control: power on (0x0) |--> Electromechanical Interlock Control: 0x0 |--> Data Link Layer State Changed: enabled (0x1000) |--> Auto Slot Power Limit: enabled (0x0) |--> In-Band PD: enabled (0x0) Slot Status: 0xc0 |--> Attention Button Pressed: no (0x0) |--> Power Fault Detected: no (0x0) |--> MRL Sensor Changed: no (0x0) |--> Presence Detect Changed: no (0x0) |--> Command Complete: no (0x0) |--> MRL Sensor State: closed (0x0) |--> Presence Detect State: present (0x40) |--> Electromechanical Interlock: engaged (0x80) |--> Data Link Layer State Changed: no (0x0)
Of note here is that there is no power controller present. Without a power controller present, it means that writes to the Power Controller Control
bit in the control register may not work.
With this there are two problems in mind that stem from the same place. Effectively whether someone does surprise removal of the drive or uses cfgadm -c unconfigure
which is meant to turn off the power we will ultimately end up in an ultimate loop trying to fruitlessly power off this slot and seeing no change. In particular, we have a problematic while loop:
1449 pciehpc_downgrade_slot_state(pcie_hp_slot_t *slot_p, 1450 ddi_hp_cn_state_t target_state) 1451 { 1452 ddi_hp_cn_state_t curr_state; 1453 int rv = DDI_SUCCESS; 1454 1455 1456 curr_state = slot_p->hs_info.cn_state; 1457 while ((curr_state > target_state) && (rv == DDI_SUCCESS)) { 1458 1459 switch (curr_state) { 1460 case DDI_HP_CN_STATE_PRESENT: 1461 /* 1462 * From PRESENT to EMPTY, just check hardware slot 1463 * state. 1464 */ 1465 pciehpc_get_slot_state(slot_p); 1466 curr_state = slot_p->hs_info.cn_state; 1467 if (curr_state >= DDI_HP_CN_STATE_PRESENT) 1468 rv = DDI_FAILURE; 1469 break; 1470 case DDI_HP_CN_STATE_POWERED: 1471 rv = (slot_p->hs_ctrl->hc_ops.poweroff_hpc_slot)( 1472 slot_p, &curr_state); 1473 1474 break; 1475 case DDI_HP_CN_STATE_ENABLED: 1476 curr_state = slot_p->hs_info.cn_state = 1477 DDI_HP_CN_STATE_POWERED; 1478 1479 break; 1480 default: 1481 /* should never reach here */ 1482 ASSERT("unknown devinfo state"); 1483 } 1484 }
This basically is waiting for the state to change, but it never will. Now the gotcha here is that when it writes to the register, it doesn't actually end up looking at the result of this to determine it's return value. So for example, if you look at pciehpc_slot_poweroff
, it always gets the current state, bu doesn't really do anything to ensure that this is what it should be. So we try to power off but the power is on, then we'll return success; however, not actually change the state here. The same is not true on pciehpc_slot_poweron
. It seems to at least acknowledge that the device may fail to power on.
To deal with this, I believe there are several things we should do:
1. pciehpc_upgrade_slot_state()
and pciehpc_downgrade_slot_state()
need to basically keep track of how many times they've looped. If we've looped trying to transition with no change in state after some fixed number (10, 20, 100?) then we need to fail this. Otherwise you have a very bad day and can't make progress.
2. If we do not have power control here, then we need to not bother trying to do this at all. This means failing cfgadm
requests and not doing this in the hotplug path. It would be best if cfgadm
could tell the user why this is. It also means making sure that in the hotplug path we don't attempt to change anything here and just have to rely on the platform to do the right thing.
3. Potentially figure out if failing to power off the slot should result in returning DDI_FAILURE
in pciehpc_slot_poweroff
.
Updated by Robert Mustacchi over 1 year ago
- Assignee changed from Michael Zeller to Robert Mustacchi
Updated by Robert Mustacchi over 1 year ago
For the moment, what I've done is to go ahead and basically done (3) right now which doesn't quite solve the user experience bit in (2), but critically ensures that we won't end up blocked with NDI locks held. Now we instead see:
rm@romulus ~ $ pfexec cfgadm -c disconnect Slot19 cfgadm: Hardware specific failure: disconnect failed
Updated by Robert Mustacchi over 1 year ago
I performed some additional testing here as well. On systems that are based around Enterprise SSD systems, I was able to pull out drives, check that they disappeared from say nvmeadm list
, plugged drives back in and used cfgadm to configure them again and verified that this was working end to end.
Updated by Electric Monk over 1 year ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit b9a2a14b8f156d14859ac7ceeea07ac7194df08d
commit b9a2a14b8f156d14859ac7ceeea07ac7194df08d Author: Robert Mustacchi <rm@fingolfin.org> Date: 2022-07-17T14:01:37.000Z 14181 pcie hotplug trapped in power off loop with no power controller Reviewed by: Dan McDonald <danmcd@mnx.io> Reviewed by: Andy Fiddaman <andy@omnios.org> Approved by: Garrett D'Amore <garrett@damore.org>