Bug #14181


pcie hotplug trapped in power off loop with no power controller

Added by Robert Mustacchi 3 months ago.

driver - device drivers
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


PCIe hotplug relies on a series of bits that are set in PCIe configuration space, generally in the slot capabilities, slot control, and slot status register which appear in the basic PCIe capability. You can see an example of the configuration space data with the following excerpt from a Gigabyte AMD server which has been configured for enterprise SSD hotplug:

  Slot Capabilities: 0xe25e1
    |--> Attention Button Present: yes (0x1)
    |--> Power Controller Present: no (0x0)
    |--> MRL Sensor Present: no (0x0)
    |--> Attention Indicator Present: no (0x0)
    |--> Power Indicator Present: no (0x0)
    |--> Hot-Plug Surprise: supported (0x20)
    |--> Hot-Plug Capable : supported (0x40)
    |--> Slot Power Limit Value: 0x4b
    |--> Slot Power Limit Scale: 0x0
    |--> Electromechanical Interlock Present: yes (0x20000)
    |--> No Command Completed: 0x60e257
    |--> Physical Slot Number: 0x1
  Slot Control: 0x102b
    |--> Attention Button Pressed: enabled (0x1)
    |--> Power Fault Detected: enabled (0x2)
    |--> MRL Sensor Changed: disabled (0x0)
    |--> Presence Detect Changed: enabled (0x8)
    |--> Command Complete Interrupt: disabled (0x0)
    |--> Hot Plug Interrupt Enable: enabled (0x20)
    |--> Attention Indicator Control: reserved (0x0)
    |--> Power Indicator Control: reserved (0x0)
    |--> Power Controller Control: power on (0x0)
    |--> Electromechanical Interlock Control: 0x0
    |--> Data Link Layer State Changed: enabled (0x1000)
    |--> Auto Slot Power Limit: enabled (0x0)
    |--> In-Band PD: enabled (0x0)
  Slot Status: 0xc0
    |--> Attention Button Pressed: no (0x0)
    |--> Power Fault Detected: no (0x0)
    |--> MRL Sensor Changed: no (0x0)
    |--> Presence Detect Changed: no (0x0)
    |--> Command Complete: no (0x0)
    |--> MRL Sensor State: closed (0x0)
    |--> Presence Detect State: present (0x40)
    |--> Electromechanical Interlock: engaged (0x80)
    |--> Data Link Layer State Changed: no (0x0)

Of note here is that there is no power controller present. Without a power controller present, it means that writes to the Power Controller Control bit in the control register may not work.

With this there are two problems in mind that stem from the same place. Effectively whether someone does surprise removal of the drive or uses cfgadm -c unconfigure which is meant to turn off the power we will ultimately end up in an ultimate loop trying to fruitlessly power off this slot and seeing no change. In particular, we have a problematic while loop:

1449 pciehpc_downgrade_slot_state(pcie_hp_slot_t *slot_p,
1450     ddi_hp_cn_state_t target_state)
1451 {
1452         ddi_hp_cn_state_t curr_state;
1453         int rv = DDI_SUCCESS;
1456         curr_state = slot_p->hs_info.cn_state;
1457         while ((curr_state > target_state) && (rv == DDI_SUCCESS)) {
1459                 switch (curr_state) {
1460                 case DDI_HP_CN_STATE_PRESENT:
1461                         /*
1462                          * From PRESENT to EMPTY, just check hardware slot
1463                          * state.
1464                          */
1465                         pciehpc_get_slot_state(slot_p);
1466                         curr_state = slot_p->hs_info.cn_state;
1467                         if (curr_state >= DDI_HP_CN_STATE_PRESENT)
1468                                 rv = DDI_FAILURE;
1469                         break;
1470                 case DDI_HP_CN_STATE_POWERED:
1471                         rv = (slot_p->hs_ctrl->hc_ops.poweroff_hpc_slot)(
1472                             slot_p, &curr_state);
1474                         break;
1475                 case DDI_HP_CN_STATE_ENABLED:
1476                         curr_state = slot_p->hs_info.cn_state =
1477                             DDI_HP_CN_STATE_POWERED;
1479                         break;
1480                 default:
1481                         /* should never reach here */
1482                         ASSERT("unknown devinfo state");
1483                 }
1484         }

This basically is waiting for the state to change, but it never will. Now the gotcha here is that when it writes to the register, it doesn't actually end up looking at the result of this to determine it's return value. So for example, if you look at pciehpc_slot_poweroff, it always gets the current state, bu doesn't really do anything to ensure that this is what it should be. So we try to power off but the power is on, then we'll return success; however, not actually change the state here. The same is not true on pciehpc_slot_poweron. It seems to at least acknowledge that the device may fail to power on.

To deal with this, I believe there are several things we should do:

1. pciehpc_upgrade_slot_state() and pciehpc_downgrade_slot_state() need to basically keep track of how many times they've looped. If we've looped trying to transition with no change in state after some fixed number (10, 20, 100?) then we need to fail this. Otherwise you have a very bad day and can't make progress.

2. If we do not have power control here, then we need to not bother trying to do this at all. This means failing cfgadm requests and not doing this in the hotplug path. It would be best if cfgadm could tell the user why this is. It also means making sure that in the hotplug path we don't attempt to change anything here and just have to rely on the platform to do the right thing.

3. Potentially figure out if failing to power off the slot should result in returning DDI_FAILURE in pciehpc_slot_poweroff.

No data to display


Also available in: Atom PDF