Project

General

Profile

Bug #9846

nvme driver shouldn't panic from userland commands

Added by Robert Mustacchi about 1 year ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Category:
driver - device drivers
Start date:
2018-09-18
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

While trying to test changes to 9809, I encountered a panic without running my bits. Running nvmeadm get-features on an Intel P4600 fails:

panic[cpu0]/thread=fffffdc53ba95820: programming error: invalid NS/format in cmd fffffdc680f07140

fffffb0458748a20 genunix:dev_err+7b ()
fffffb0458748a50 nvme:nvme_check_generic_cmd_status+188 ()
fffffb0458748ad0 nvme:nvme_get_features+2b7 ()
fffffb0458748b50 nvme:nvme_ioctl_get_features+c4 ()
fffffb0458748c70 nvme:nvme_ioctl+141 ()
fffffb0458748cb0 genunix:cdev_ioctl+39 ()
fffffb0458748d00 specfs:spec_ioctl+60 ()
fffffb0458748d90 genunix:fop_ioctl+55 ()
fffffb0458748eb0 genunix:ioctl+9b ()
fffffb0458748f10 unix:brand_sys_sysenter+1d3 ()

Specifically we're trying to get NVME_FEAT_ERROR. Here is the error status from the nvme_cmd_t:

[0]> fffffdc680f07140::print nvme_cmd_t nc_cqe.cqe_sf  
nc_cqe.cqe_sf = {
    nc_cqe.cqe_sf.sf_p = 0x1
    nc_cqe.cqe_sf.sf_sc = 0x17
    nc_cqe.cqe_sf.sf_sct = 0
    nc_cqe.cqe_sf.sf_rsvd2 = 0
    nc_cqe.cqe_sf.sf_m = 0
    nc_cqe.cqe_sf.sf_dnr = 0x1
}

And here's the actual command we sent:

[0]> fffffdc680f07140::print nvme_cmd_t nc_sqe
nc_sqe = {
    nc_sqe.sqe_opc = 0xa
    nc_sqe.sqe_fuse = 0
    nc_sqe.sqe_rsvd = 0
    nc_sqe.sqe_psdt = 0
    nc_sqe.sqe_cid = 0x1c
    nc_sqe.sqe_nsid = 0
    nc_sqe.sqe_rsvd1 = 0
    nc_sqe.sqe_m = {
        m_ptr = 0
        m_sglp = 0
    }
    nc_sqe.sqe_dptr = {
        d_prp = [ 0, 0 ]
        d_sgl = {
            sgl_addr = 0
            sgl_len = 0
            sgl_rsvd = [ 0, 0, 0 ]
            sgl_zero = 0
            sgl_type = 0
        }
    }
    nc_sqe.sqe_cdw10 = 0x5
    nc_sqe.sqe_cdw11 = 0              
    nc_sqe.sqe_cdw12 = 0
    nc_sqe.sqe_cdw13 = 0
    nc_sqe.sqe_cdw14 = 0
    nc_sqe.sqe_cdw15 = 0
}

While this feature is mandatory, it's not clear what the programming error is or if something else is going on. We may need to also check for a firmware upgrade.

However, given the prevalence of issues of this kind, it's not clear that panicking the kernel is the right solution as it seems like often times we are hitting this for cases that aren't programmer errors. See also illumos#8945, illumos#8808, illumos#9285.

From talking with Hans, we discussed and agreed that we should take this a step further. Rather than continuing to whitelist various items, we should make sure that ioctls that are issued by user land through nvmeadm or other sources are always set not to panic. As it seems there will always be drives that cause us to go down those paths.


I tested this by making sure that we no longer panic when running nvmeadm against the devices that caused this and was able to now properly run get features. I also ran identify, format, and secure-erase on a handful of different devices. This used a series of different devices that are running nvme versions 1.0 and 1.2 from different manufacturers.


Related issues

Related to illumos gate - Bug #8945: nvme panics when async events are not supportedClosed2017-12-30

Actions
Related to illumos gate - Bug #8808: nvme: Software Progress Marker feature is optionalClosed2017-11-16

Actions
Related to illumos gate - Bug #9285: nvme: "programming error: invalid NS/format" doing 'nvmeadm list' on a controller without namespacesClosed2018-03-14

Actions
Related to illumos gate - Feature #9809: nvme driver should attach to all NVMe 1.x devicesClosed2018-09-09

Actions

History

#1

Updated by Robert Mustacchi about 1 year ago

  • Related to Bug #8945: nvme panics when async events are not supported added
#2

Updated by Robert Mustacchi about 1 year ago

  • Related to Bug #8808: nvme: Software Progress Marker feature is optional added
#3

Updated by Robert Mustacchi about 1 year ago

  • Related to Bug #9285: nvme: "programming error: invalid NS/format" doing 'nvmeadm list' on a controller without namespaces added
#4

Updated by Robert Mustacchi about 1 year ago

  • Related to Feature #9809: nvme driver should attach to all NVMe 1.x devices added
#5

Updated by Electric Monk 9 months ago

  • Status changed from New to Closed

git commit bc586359b7d9a851391f04b06254a9ead6109d47

commit  bc586359b7d9a851391f04b06254a9ead6109d47
Author: Robert Mustacchi <rm@joyent.com>
Date:   2019-01-10T21:49:38.000Z

    9846 nvme driver shouldn't panic from userland commands
    Reviewed by: Mike Gerdts <mike.gerdts@joyent.com>
    Reviewed by: Dan McDonald <danmcd@joyent.com>
    Reviewed by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF