Project

General

Profile

Bug #8629

nvme: rework command abortion

Added by Hans Rosenfeld about 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Category:
driver - device drivers
Start date:
2017-09-07
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

The way timed-out commands are aborted in nvme is a bit odd. It drops the mutex of the command to be aborted very early, which means that there may be a race between abortion and the command. We handle this safely right now but it has a couple of odd side effects that other parts of nvme need deal with. Another issue is that it replaces the command callback with its own version to deal with cleanup, and a timeout needs to be detected by the submitter and be handled specially because of that.

It turns out that this could be improved. When a command is aborted it should go through its regular completion callback, which will deal with the abortion in the error code path and eventually clean up the command. At the same time we can deal better with abort commands that time out themselves. Improving the internal error propagation of nvme will also help.

History

#1

Updated by Hans Rosenfeld about 2 years ago

Webrev: https://grumpf.hope-2000.org/illumos-8629/

As command timeouts and aborts don't happen in normal operation, I have tested this with two different test setups:

  1. On a fully functional Intel P3700 device, which takes about 9 seconds to complete a format operation. I've reduced the format command timeout to 6 seconds, and as a result the driver attempted to abort the format command. The abort returned an error (abort failed) and the format command completed normally within the time the driver waited for the abort command. Reducing the format command timeout to 3s led to a double timeout and subsequent faulting of the NVMe device as expected.
  2. I've put two known to be broken Intel P3700 in my test system. These devices have the nasty habit of ignoring commands during attach, but at different points in attach. Both caused a double timeout at the expected places in attach and were faulted by FMA.

This change has been in illumos-joyent since July.

#2

Updated by Hans Rosenfeld about 2 years ago

  • Parent task set to #8628
#3

Updated by Hans Rosenfeld about 2 years ago

  • Parent task deleted (#8628)
#4

Updated by Electric Monk about 2 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit e984c70bc7d741cd0663924c95c15fed9f645565

commit  e984c70bc7d741cd0663924c95c15fed9f645565
Author: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Date:   2017-09-25T09:11:15.000Z

    8629 nvme: rework command abortion
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Jason King <jason.king@joyent.com>
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF