Project

General

Profile

Actions

Bug #14181

closed

pcie hotplug trapped in power off loop with no power controller

Added by Robert Mustacchi about 2 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

PCIe hotplug relies on a series of bits that are set in PCIe configuration space, generally in the slot capabilities, slot control, and slot status register which appear in the basic PCIe capability. You can see an example of the configuration space data with the following excerpt from a Gigabyte AMD server which has been configured for enterprise SSD hotplug:

  Slot Capabilities: 0xe25e1
    |--> Attention Button Present: yes (0x1)
    |--> Power Controller Present: no (0x0)
    |--> MRL Sensor Present: no (0x0)
    |--> Attention Indicator Present: no (0x0)
    |--> Power Indicator Present: no (0x0)
    |--> Hot-Plug Surprise: supported (0x20)
    |--> Hot-Plug Capable : supported (0x40)
    |--> Slot Power Limit Value: 0x4b
    |--> Slot Power Limit Scale: 0x0
    |--> Electromechanical Interlock Present: yes (0x20000)
    |--> No Command Completed: 0x60e257
    |--> Physical Slot Number: 0x1
  Slot Control: 0x102b
    |--> Attention Button Pressed: enabled (0x1)
    |--> Power Fault Detected: enabled (0x2)
    |--> MRL Sensor Changed: disabled (0x0)
    |--> Presence Detect Changed: enabled (0x8)
    |--> Command Complete Interrupt: disabled (0x0)
    |--> Hot Plug Interrupt Enable: enabled (0x20)
    |--> Attention Indicator Control: reserved (0x0)
    |--> Power Indicator Control: reserved (0x0)
    |--> Power Controller Control: power on (0x0)
    |--> Electromechanical Interlock Control: 0x0
    |--> Data Link Layer State Changed: enabled (0x1000)
    |--> Auto Slot Power Limit: enabled (0x0)
    |--> In-Band PD: enabled (0x0)
  Slot Status: 0xc0
    |--> Attention Button Pressed: no (0x0)
    |--> Power Fault Detected: no (0x0)
    |--> MRL Sensor Changed: no (0x0)
    |--> Presence Detect Changed: no (0x0)
    |--> Command Complete: no (0x0)
    |--> MRL Sensor State: closed (0x0)
    |--> Presence Detect State: present (0x40)
    |--> Electromechanical Interlock: engaged (0x80)
    |--> Data Link Layer State Changed: no (0x0)

Of note here is that there is no power controller present. Without a power controller present, it means that writes to the Power Controller Control bit in the control register may not work.

With this there are two problems in mind that stem from the same place. Effectively whether someone does surprise removal of the drive or uses cfgadm -c unconfigure which is meant to turn off the power we will ultimately end up in an ultimate loop trying to fruitlessly power off this slot and seeing no change. In particular, we have a problematic while loop:

1449 pciehpc_downgrade_slot_state(pcie_hp_slot_t *slot_p,
1450     ddi_hp_cn_state_t target_state)
1451 {
1452         ddi_hp_cn_state_t curr_state;
1453         int rv = DDI_SUCCESS;
1454 
1455 
1456         curr_state = slot_p->hs_info.cn_state;
1457         while ((curr_state > target_state) && (rv == DDI_SUCCESS)) {
1458 
1459                 switch (curr_state) {
1460                 case DDI_HP_CN_STATE_PRESENT:
1461                         /*
1462                          * From PRESENT to EMPTY, just check hardware slot
1463                          * state.
1464                          */
1465                         pciehpc_get_slot_state(slot_p);
1466                         curr_state = slot_p->hs_info.cn_state;
1467                         if (curr_state >= DDI_HP_CN_STATE_PRESENT)
1468                                 rv = DDI_FAILURE;
1469                         break;
1470                 case DDI_HP_CN_STATE_POWERED:
1471                         rv = (slot_p->hs_ctrl->hc_ops.poweroff_hpc_slot)(
1472                             slot_p, &curr_state);
1473 
1474                         break;
1475                 case DDI_HP_CN_STATE_ENABLED:
1476                         curr_state = slot_p->hs_info.cn_state =
1477                             DDI_HP_CN_STATE_POWERED;
1478 
1479                         break;
1480                 default:
1481                         /* should never reach here */
1482                         ASSERT("unknown devinfo state");
1483                 }
1484         }

This basically is waiting for the state to change, but it never will. Now the gotcha here is that when it writes to the register, it doesn't actually end up looking at the result of this to determine it's return value. So for example, if you look at pciehpc_slot_poweroff, it always gets the current state, bu doesn't really do anything to ensure that this is what it should be. So we try to power off but the power is on, then we'll return success; however, not actually change the state here. The same is not true on pciehpc_slot_poweron. It seems to at least acknowledge that the device may fail to power on.

To deal with this, I believe there are several things we should do:

1. pciehpc_upgrade_slot_state() and pciehpc_downgrade_slot_state() need to basically keep track of how many times they've looped. If we've looped trying to transition with no change in state after some fixed number (10, 20, 100?) then we need to fail this. Otherwise you have a very bad day and can't make progress.

2. If we do not have power control here, then we need to not bother trying to do this at all. This means failing cfgadm requests and not doing this in the hotplug path. It would be best if cfgadm could tell the user why this is. It also means making sure that in the hotplug path we don't attempt to change anything here and just have to rely on the platform to do the right thing.

3. Potentially figure out if failing to power off the slot should result in returning DDI_FAILURE in pciehpc_slot_poweroff.

Actions #1

Updated by Robert Mustacchi over 1 year ago

  • Assignee changed from Michael Zeller to Robert Mustacchi
Actions #2

Updated by Robert Mustacchi over 1 year ago

For the moment, what I've done is to go ahead and basically done (3) right now which doesn't quite solve the user experience bit in (2), but critically ensures that we won't end up blocked with NDI locks held. Now we instead see:

rm@romulus ~ $ pfexec cfgadm -c disconnect Slot19
cfgadm: Hardware specific failure: disconnect failed   
Actions #3

Updated by Electric Monk over 1 year ago

  • Gerrit CR set to 2189
Actions #4

Updated by Robert Mustacchi over 1 year ago

I performed some additional testing here as well. On systems that are based around Enterprise SSD systems, I was able to pull out drives, check that they disappeared from say nvmeadm list, plugged drives back in and used cfgadm to configure them again and verified that this was working end to end.

Actions #5

Updated by Electric Monk over 1 year ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit b9a2a14b8f156d14859ac7ceeea07ac7194df08d

commit  b9a2a14b8f156d14859ac7ceeea07ac7194df08d
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   2022-07-17T14:01:37.000Z

    14181 pcie hotplug trapped in power off loop with no power controller
    Reviewed by: Dan McDonald <danmcd@mnx.io>
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Approved by: Garrett D'Amore <garrett@damore.org>

Actions

Also available in: Atom PDF