Bug #9846
closednvme driver shouldn't panic from userland commands
100%
Description
While trying to test changes to 9809, I encountered a panic without running my bits. Running nvmeadm get-features on an Intel P4600 fails:
panic[cpu0]/thread=fffffdc53ba95820: programming error: invalid NS/format in cmd fffffdc680f07140 fffffb0458748a20 genunix:dev_err+7b () fffffb0458748a50 nvme:nvme_check_generic_cmd_status+188 () fffffb0458748ad0 nvme:nvme_get_features+2b7 () fffffb0458748b50 nvme:nvme_ioctl_get_features+c4 () fffffb0458748c70 nvme:nvme_ioctl+141 () fffffb0458748cb0 genunix:cdev_ioctl+39 () fffffb0458748d00 specfs:spec_ioctl+60 () fffffb0458748d90 genunix:fop_ioctl+55 () fffffb0458748eb0 genunix:ioctl+9b () fffffb0458748f10 unix:brand_sys_sysenter+1d3 ()
Specifically we're trying to get NVME_FEAT_ERROR. Here is the error status from the nvme_cmd_t:
[0]> fffffdc680f07140::print nvme_cmd_t nc_cqe.cqe_sf nc_cqe.cqe_sf = { nc_cqe.cqe_sf.sf_p = 0x1 nc_cqe.cqe_sf.sf_sc = 0x17 nc_cqe.cqe_sf.sf_sct = 0 nc_cqe.cqe_sf.sf_rsvd2 = 0 nc_cqe.cqe_sf.sf_m = 0 nc_cqe.cqe_sf.sf_dnr = 0x1 }
And here's the actual command we sent:
[0]> fffffdc680f07140::print nvme_cmd_t nc_sqe nc_sqe = { nc_sqe.sqe_opc = 0xa nc_sqe.sqe_fuse = 0 nc_sqe.sqe_rsvd = 0 nc_sqe.sqe_psdt = 0 nc_sqe.sqe_cid = 0x1c nc_sqe.sqe_nsid = 0 nc_sqe.sqe_rsvd1 = 0 nc_sqe.sqe_m = { m_ptr = 0 m_sglp = 0 } nc_sqe.sqe_dptr = { d_prp = [ 0, 0 ] d_sgl = { sgl_addr = 0 sgl_len = 0 sgl_rsvd = [ 0, 0, 0 ] sgl_zero = 0 sgl_type = 0 } } nc_sqe.sqe_cdw10 = 0x5 nc_sqe.sqe_cdw11 = 0 nc_sqe.sqe_cdw12 = 0 nc_sqe.sqe_cdw13 = 0 nc_sqe.sqe_cdw14 = 0 nc_sqe.sqe_cdw15 = 0 }
While this feature is mandatory, it's not clear what the programming error is or if something else is going on. We may need to also check for a firmware upgrade.
However, given the prevalence of issues of this kind, it's not clear that panicking the kernel is the right solution as it seems like often times we are hitting this for cases that aren't programmer errors. See also illumos#8945, illumos#8808, illumos#9285.
From talking with Hans, we discussed and agreed that we should take this a step further. Rather than continuing to whitelist various items, we should make sure that ioctls that are issued by user land through nvmeadm or other sources are always set not to panic. As it seems there will always be drives that cause us to go down those paths.
I tested this by making sure that we no longer panic when running nvmeadm against the devices that caused this and was able to now properly run get features. I also ran identify, format, and secure-erase on a handful of different devices. This used a series of different devices that are running nvme versions 1.0 and 1.2 from different manufacturers.
Related issues
Updated by Robert Mustacchi about 5 years ago
- Related to Bug #8945: nvme panics when async events are not supported added
Updated by Robert Mustacchi about 5 years ago
- Related to Bug #8808: nvme: Software Progress Marker feature is optional added
Updated by Robert Mustacchi about 5 years ago
- Related to Bug #9285: nvme: "programming error: invalid NS/format" doing 'nvmeadm list' on a controller without namespaces added
Updated by Robert Mustacchi about 5 years ago
- Related to Feature #9809: nvme driver should attach to all NVMe 1.x devices added
Updated by Electric Monk almost 5 years ago
- Status changed from New to Closed
git commit bc586359b7d9a851391f04b06254a9ead6109d47
commit bc586359b7d9a851391f04b06254a9ead6109d47 Author: Robert Mustacchi <rm@joyent.com> Date: 2019-01-10T21:49:38.000Z 9846 nvme driver shouldn't panic from userland commands Reviewed by: Mike Gerdts <mike.gerdts@joyent.com> Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Approved by: Richard Lowe <richlowe@richlowe.net>