Project

General

Profile

Actions

Bug #14541

open

nvme_quiesce() can hang preventing reboot

Added by Joshua M. Clulow 6 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Category:
driver - device drivers
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

I have some Gigabyte Z33 systems with five NVMe drives:

# diskinfo
TYPE    DISK                    VID      PID              SIZE          RMV SSD
NVME    c1t000CCA0060126400d0   NVMe     HUSPR3280ADP301   745.21 GiB   no  yes
NVME    c2t000CCA0061112F01d0   NVMe     HUSPR3280ADP301   745.21 GiB   no  yes
NVME    c3t000CCA006013C400d0   NVMe     HUSPR3280ADP301   745.21 GiB   no  yes
NVME    c4t1d0                  HP       SSD EX950 1TB     953.87 GiB   no  yes
NVME    c5t1d0                  HP       SSD EX950 1TB     953.87 GiB   no  yes

These machines reliably hang on reboot, some time after the rebooting... prompt. I dug in with KMDB, setting breakpoints for each thing that occurs after that prompt is printed. I was able to determine we were getting to the point of quiescing devices, but no further, so I looked at which devices we were quiescing:

quiesce_one_device::bp -c '<rdi::devinfo -q; :c'

The last device node we tried to quiesce was for nvme, so it seems that nvme_quiesce() is responsible. I confirmed this by replacing the function with nothing:

# mdb -kwe 'nvme_quiesce/v c3'

Patching out this function causes reboot to proceed reliably.

Actions #1

Updated by Joshua M. Clulow 5 months ago

  • Assignee set to Jordan Hendricks
Actions #2

Updated by Jordan Hendricks 5 months ago

I've dug into this and have an explanation for what's happening. I'll give the summary, then outline how I arrived here.

Summary

The behavior observed here is not actually a hang, but an unreasonably long reboot time; the machine will reliably reboot, but it takes about 8 minutes. This is the result of the nvme drives on this machine not processing an abrupt shutdown notification correctly, combined with a large timeout in the quiesce path.

When the nvme driver is asked to quiesce the device, it notifies the controller of an abrupt shutdown by writing to the Shutdown Notification field of the Controller Configuration register (CC.SHN), then clears the Controller Configuration register's Enable bit (CC.EN) to cause a controller reset. For both the shutdown notification and the reset, the driver checks that the controller has processed the updates in a loop, up until a timeout. Unfortunately, on these drives using an abrupt shutdown causes the reset to never appear to be processed. The driver thus exhausts its timeout before giving up, which translates to about 2 minutes per device, which, multiplied by several nvme devices, makes for a very long reboot time.

Investigation

Determining the source of the "hang"

First, I wanted to determine where exactly the machine was appearing to hang in nvme_quiesce. At this stage of reboot, facilities like taking a crash dump aren't available, so I used kmdb to observe what was happening. After setting initial breakpoints on functions called by nvme_quiesce, I observed that nvme_reset seemed to be where the hang was. I began to step through instructions to see if I could find one that hung. Instead, what I noticed was that it was looping a lot. For context, here's what nvme_reset looks like:

static boolean_t
nvme_reset(nvme_t *nvme, boolean_t quiesce)
{
        nvme_reg_csts_t csts;
        int i;

        nvme_put32(nvme, NVME_REG_CC, 0);

        csts.r = nvme_get32(nvme, NVME_REG_CSTS);
        if (csts.b.csts_rdy == 1) {
                nvme_put32(nvme, NVME_REG_CC, 0);
                for (i = 0; i != nvme->n_timeout * 10; i++) {
                        csts.r = nvme_get32(nvme, NVME_REG_CSTS);
                        if (csts.b.csts_rdy == 0)
                                break;

                        if (quiesce)
                                drv_usecwait(50000);
                        else
                                delay(drv_usectohz(50000));
                }
        }

        nvme_put32(nvme, NVME_REG_AQA, 0);
        nvme_put32(nvme, NVME_REG_ASQ, 0);
        nvme_put32(nvme, NVME_REG_ACQ, 0);

        csts.r = nvme_get32(nvme, NVME_REG_CSTS);
        return (csts.b.csts_rdy == 0 ? B_TRUE : B_FALSE);
}

Before the loop, the driver writes 0 to the Controller Status register, which will clear the enable bit (CC.EN), prompting a controller reset. Then, in a loop, it reads from the Controller Status register to see if the ready bit (CSTS.RDY) is 0, indicating the controller has processed the reset. If it isn't zero, nvme_reset will wait 50 ms using drv_usecwait. It does this 10 * n_timeout times, where n_timeout is read from the Controller Capabilities register (CAP.TO). The multiplication by 10 is because the timeout is in units of 500 ms, but the waiting period is 50 ms.

I wanted to see what was happening with this loop. So I found where i was incremented:

nvme_reset+0xd0:                incl   %r12d

Then set a breakpoint here to print out i, then continue execution:

nvme_reset+0xd0::bp -c '<r12d=K; :c'

I kicked off a reboot, which dropped me into kmdb due to the breakpoint. I observed that each call to nvme_reset (one per device) looped 2550 times, which seemed like it might be exhausting the loop. I checked the value of nvme->n_timeout, and sure enough, it was: the value was 255 (2550 / 10):

[0]> ::stack
nvme`nvme_reset+0xd0(fffffeb1ab458000, 1)
nvme`nvme_quiesce+0x48(fffffeb1959765b8)
devi_quiesce+0x31(fffffeb1959765b8)
quiesce_one_device+0x92(fffffeb1959765b8, fffffe00f4617d5c)
quiesce_devices+0x32(fffffeb1959765b8, fffffe00f4617d5c)
quiesce_devices+0x27(fffffeb19597b3b8, fffffe00f4617d5c)
quiesce_devices+0x27(fffffeb1a42b9430, fffffe00f4617d5c)
quiesce_devices+0x27(fffffeb19597bc28, fffffe00f4617d5c)
mdboot+0x265(2, 1, 0, 1)
kadmin+0x1bf(2, 1, 0, fffffeb1c647ad10)
uadmin+0x17d(2, 1, 0)
sys_syscall32+0x105()
[0]> fffffeb1ab458000::print nvme_t n_timeout 
n_timeout = 0xff
[0]> 0xff=e
                255 

So it seemed like for whatever reason, the ready bit was never set to 0 by the controller. I checked the spec to ensure the driver was doing the correct thing here. Section 3.1.5. says of CC.EN:

When this field is cleared to ‘0’, the CSTS.RDY bit is cleared to ‘0’ by the controller once the controller is ready to be re-enabled.

And section 3.1.1 says this of the timeout:

This is the worst case time that host software shall wait for
CSTS.RDY to transition from:
a) ‘0’ to ‘1’ after CC.EN transitions from ‘0’ to ‘1’; or
b) ‘1’ to ‘0’ after CC.EN transitions from ‘1’ to ‘0’.

This worst case time may be experienced after events such as an abrupt shutdown or activation of a new firmware image; typical times are expected to be much shorter. This field is in 500 millisecond units.

So while waiting 127.5 seconds is a long time, this code is doing what the spec says to do.

Experiment 1: Trying a read/modify/write of CC register in nvme_reset

Something I was then curious about was the fact that, when clearning the CC.EN bit, the driver also cleared all the other bits writing all zeroes to CC. That happens in this line in nvme_reset:

        nvme_put32(nvme, NVME_REG_CC, 0);

I would expect the driver to do a read/modify/write operation of this register instead of clearing all the bits. So I did an experiment where I changed the value that would be written to the register, to see if it would behave differently.

This is the definition of nvme_put32:

static inline void
nvme_put32(nvme_t *nvme, uintptr_t reg, uint32_t val)
{
        ASSERT(((uintptr_t)(nvme->n_regs + reg) & 0x3) == 0);

        /*LINTED: E_BAD_PTR_CAST_ALIGN*/
        ddi_put32(nvme->n_regh, (uint32_t *)(nvme->n_regs + reg), val);
}

First, I made a breakpoint on the call to ddi_io_put32 (since nvme_put32 is inlined):

nvme`nvme_reset+0x23:           call   +0x3bd8d78       <ddi_io_put32>

As this is an AMD machine, the first, second, and third arguments in the registers rdi, rsi, and rdx, respectively:

kmdb: target stopped at:
nvme`nvme_reset+0x23:   call   +0x3bccd78       <ddi_io_put32>
[42]> <rdi=K
                fffffeb1ac61d7c0 
[42]> <rsi=K
                fffffeb1437d6014 
[42]> <rdx=K
                0 

The second and third arguments are the most interesting ones here. The second argument is the address of the CC register. The third argument is the value we'll need to change. But first, we need to see what's currently in the register (which was read in an earlier function):

[42]> fffffeb1437d6014/X
kmdb: failed to read data from target: no mapping for address

This fails because that address is a virtual address. To read it here, we need to get the physical mapping:

[42]> fffffeb1437d6014::vtop
virtual fffffeb1437d6014 mapped to physical f8730014

I wanted to make sure I was reading from the right address for this register. So I checked another register with a value that was easy to verify: the Version register (section 3.1.8), which is at offset 08 (CC is at 14). I adjusted the offset in the address from 14 to 08 and read from it (note the backslash, since this is a physical address):

[42]> f8730008\X
0xf8730008:     10100           
[42]> 10100=j
                10000000100000000
                |       |
                |       +---------- bit 8  mask 0x00100
                +------------------ bit 16 mask 0x10000

This indicates that the device is an NVMe 1.1 device. I verified this was right by checking the nvme_t. It agreed:

[42]> fffffeb1abb86900::print nvme_t n_version
n_version = {
    n_version.v_minor = 0x1
    n_version.v_major = 0x1
}

Then I read the value of the CC register, and noted that the enable bit (bit 0) was set:

[42]> f8730014\X
0xf8730014:     468001          
[42]> 468001=j
                10001101000000000000001
                |   || |              |
                |   || |              +-- bit 0  mask 0x000001
                |   || +----------------- bit 15 mask 0x008000
                |   |+------------------- bit 17 mask 0x020000
                |   +-------------------- bit 18 mask 0x040000
                +------------------------ bit 22 mask 0x400000

Next up, I modified the val argument to ddi_io_put32 to change what got written to CC to only clear the enable bit:

[42]> 468000>rdx
[42]> <rdx=j
                10001101000000000000000
                |   || |
                |   || +----------------- bit 15 mask 0x008000
                |   |+------------------- bit 17 mask 0x020000
                |   +-------------------- bit 18 mask 0x040000
                +------------------------ bit 22 mask 0x400000

So I then stepped over the call the ddi_io_put32, and read the value again:

nvme`nvme_reset+0x28:   movq   0x10(%rbx),%rax
[42]> f8730014\X
0xf8730014:     0 

It was 0, not the value I had written. Further, the same thing happened as before -- the ready bit never appeared as 0, and the driver took a long time to finish quiesce.

CSTS indicated that the shutdown was complete (bits 03:02 are 10b), and the ready bit was still set, despite prompting a controller reset.

[42]> f873001c\X
0xf873001c:     9               
[42]> 9=j
                1001
                |  |
                |  +-- bit 0 mask 0x1
                +----- bit 3 mask 0x8

This was a bit of a disappointing result, and I spent a bunch of time verifying that this approach to changing the value that ends up in the register was correct. Ultimately, I decided I wanted to try modifying the CC.EN in value in a different way to see if I could produce a different outcome.

Experiment 2: Abrupt shutdown notification and controller reset in one register write

Another thing that stuck out to me while looking into this bug was the fact that shutdown notification and the reset were two separate writes to the CC register in the driver. Section 7.6.2 of the NVMe spec discusses shutdown, but it doesn't say explicitly that the shutdown notification and reset should be done in two steps. Section 3.1.5. notes that other fields in CC (including CC.EN) are allowed to be modified when CC.SHN is set to 01b (normal shutdown notification) or 10b (abrupt shutdown notification). So I was curious what would happen if I did the abrupt shutdown notification and cleared the enable bit in the same write.

To do this without a having to make a new OS build, I chose to do a similar approach as the first experiment in the nvme_shutdown path, which is called before nvme_reset. For context, this is what nvme_shutdown looks like:

static void
nvme_shutdown(nvme_t *nvme, int mode, boolean_t quiesce)
{
        nvme_reg_cc_t cc;
        nvme_reg_csts_t csts;
        int i;

        ASSERT(mode == NVME_CC_SHN_NORMAL || mode == NVME_CC_SHN_ABRUPT);

        cc.r = nvme_get32(nvme, NVME_REG_CC);
        cc.b.cc_shn = mode & 0x3;
        nvme_put32(nvme, NVME_REG_CC, cc.r);

        for (i = 0; i != 10; i++) {
                csts.r = nvme_get32(nvme, NVME_REG_CSTS);
                if (csts.b.csts_shst == NVME_CSTS_SHN_COMPLETE)
                        break;

                if (quiesce)
                        drv_usecwait(100000);
                else
                        delay(drv_usectohz(100000));
        }
}

I checked what the CC register value was just after it was read here:

        cc.r = nvme_get32(nvme, NVME_REG_CC);

That value was 460001, and the value the driver would write normally was 468001, which sets the abrupt shutown notification (bits 15:14 to 10b):

[9]> f8730014\X
0xf8730014:     460001   
[9]> <rdx=K
                468001
[9]> 468001=j
                10001101000000000000001
                |   || |              |
                |   || |              +-- bit 0  mask 0x000001
                |   || +----------------- bit 15 mask 0x008000
                |   |+------------------- bit 17 mask 0x020000
                |   +-------------------- bit 18 mask 0x040000
                +------------------------ bit 22 mask 0x400000

Instead, I modified the value that the write put in a similar manner as before to both clear the enable bit and set the abrupt shutdown notification:

[9]> 468000>rdx
[9]> ::step over

Then checked the values of CC and CSTS. Like before, CC was 0, and CSTS status had the ready bit set:

## CC
[9]> f8730014\X
0xf8730014:     0 

## CSTS
[9]> f873001c\j
0xf873001c:     1001
                |  |
                |  +-- bit 0 mask 0x1
                +----- bit 3 mask 0x8

As before, nvme_reset took a long time, as it never saw the ready bit clear to 0. Time to try something else.

Experiment 3: Normal shutdown notification and controller reset in one register write

At this point, I began to wonder if something strange was happening with abrupt shutdown. I wanted to see if issuing a normal shutdown notification would work as expected.

I took a similar approach to experiment 3 to both notify the controller of a normal shutdown (instead of an abrupt one) and clear the enable bit in one write:

# CC before write
[0]> f8730014\X
0xf8730014:     460001          

# CSTS before write
[0]> f873001c\X
0xf873001c:     1               

# value to write with CC.EN=0 and CC.SHN=01 (normal)
[0]> 467000=j
                10001100111000000000000
                |   ||  |||
                |   ||  ||+-------------- bit 12 mask 0x001000
                |   ||  |+--------------- bit 13 mask 0x002000
                |   ||  +---------------- bit 14 mask 0x004000
                |   |+------------------- bit 17 mask 0x020000
                |   +-------------------- bit 18 mask 0x040000
                +------------------------ bit 22 mask 0x400000
[0]> 467000>rdx
[0]> ::step over
kmdb: target stopped at:
nvme`nvme_shutdown+0x50:jmp    +0x10    <nvme`nvme_shutdown+0x62>

After the write and the subsequent read of the register, for the first time ever, the drive reported that CSTS.RDY was 0! But it did not acknowledge the shutdown notification in CSTS.SHST:

## CC after write
[0]> f8730014\X
0xf8730014:     0 

## CSTS after write
[0]> f873001c\X
0xf873001c:     0               

Also for the first time, nvme_reset completed quickly after this, as the ready bit was 0.

One thing that didn't make sense to me was that CSTS.SHST reported 0, indicating normal operation, and that it hasn't started processing or completed the shutdown notification. I thought this might be expected since the device was reset in the same operation, and I took a look at the spec again:

It says in Section 3.1.5 that:

All other controller registers defined in this section and internal controller state (e.g., Feature values defined in section 5.21.1 that are not persistent across power states) are reset to their default values.

So it seems like maybe the other values of CSTS might've been cleared on the reset, which would make sense. A controller reset is disabling this device, so reading that register again isn't useful. To distinguish a bit better between normal shutdown and abrupt shutdown, I wanted to try one last experiment where I followed the normal code path, only modifying nvme_shutdown to do a normal shutdown instead of an abrupt one.

Experiment 4: Normal shutdown notifcation, then controller reset

The last experiment I wanted to try was simply changing the write to CC.SHN in nvme_shutdown from an abrupt shutdown to a normal shutdown.

# CC before write
[16]> f8730014\X
0xf8730014:     460001          

# value to write to CC (from driver)
[16]> <rdx=K
                468001          

# modify argument to change abrupt shutdown to normal
[16]> 464001=j
                10001100100000000000001
                |   ||  |             |
                |   ||  |             +-- bit 0  mask 0x000001
                |   ||  +---------------- bit 14 mask 0x004000
                |   |+------------------- bit 17 mask 0x020000
                |   +-------------------- bit 18 mask 0x040000
                +------------------------ bit 22 mask 0x400000
[16]> 464001>rdx
[16]> ::step over

CSTS.SHST reflected that the shutdown was complete, and the ready bit was still 1, as expected:

[16]> f873001c\X
0xf873001c:     9               
[16]> 9=j
                1001
                |  |
                |  +-- bit 0 mask 0x1
                +----- bit 3 mask 0x8

When the driver reached nvme_reset and cleared CC.EN, the ready bit was 0 on the first read of CSTS.RDY:

[16]> f873001c\X
0xf873001c:     0               

And nvme_reset completed quickly this time.

There are five drives on this machine, and up to this point, I had generally been only running my experiments on the first or second one. Just to be sure this fix worked for all of them, I set a breakpoint before the write in nvme_shutdown, changed the value of rdx to indicate a normal shutdown, then continued, and observed that they all completed quickly.

Conclusions

In both cases where I experimented with using a normal shutdown (either in the same write as the reset or an earlier one), the device was able to process a reset successfully. In cases with the abrupt shutdown, the device never cleared the ready bit to demonstrate that it was okay to be treated as reset. On the particular machine I was testing on, this didn't seem to cause problems -- the device gets reset again when it is attached again. So it seems like abrupt shutdown is buggy on these devices, but the main symptom I've observed is a slow reboot time.

Another thing I did while debugging was try some of these same experiments on a different machine with different drives. I noted that their timeout was far less than 255 (127.5 seconds) -- it was 60 (30 seconds). They also processed the abrupt shutdown notification immediately, as would be expected since this hasn't been a reported issue on other drives.

Suggested Fix

One simple fix here would be to change nvme_quiesce to use a normal shutdown notification instead of an abrupt. I took a look at what Linux does for its nvme drivers, and I didn't see any use of abrupt shutdown, only normal shutdown. The spec says in section 7.6.2 that abrupt shutdown should be used for entry to the D3 power state, which is defined in section 8.4.4: "In Runtime D3 (RTD3) main power is removed from the controller. Auxiliary power may or may not be provided." As far as I know, nothing like that is happening in the reboot path, so normal shutdown should be sufficient here.

Actions #3

Updated by Electric Monk 4 months ago

  • Gerrit CR set to 2129
Actions #4

Updated by Jordan Hendricks 4 months ago

Testing

I tested this change on one of the systems this issue was observed on originally. I installed my OS image and rebooted to confirm that the reboot no longer took an unreasonable amount of time.

Actions

Also available in: Atom PDF