Project

General

Profile

Bug #7351

NVMe driver sporadically lost track of completed I/O request, which leads to zpool hanging and machine panic.

Added by Youzhong Yang almost 4 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
driver - device drivers
Start date:
2016-09-01
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

We tested the new set of NVMe fixes and encountered this blocking issue. The issue can be easily reproduced by doing two zfs send/recvs of large dataset simultaneously.

In summary, the issue is that NVMe SSD device indicates that an I/O is completed (by looking at the completion entry in the Completion Queue) but the driver never gets a chance to process it.

Please see attached txt for the analysis of two cases (we have a bunch of crash dumps but unfortunately we cannot share it publicly or privately without going through lengthy approval process).

By the way, our NVMe drives are as follows:

- 22 x Intel DC P3600 400GB NVMe PCIe 3.0, MLC 2.5" 20nm SSDPE2ME400G4

- 2 x Intel DC P3700 800GB NVMe PCIe 3.0 HET MLC 2.5" 20nm SSDPE2MD800G4


Files

zfs-deadman.txt (28.8 KB) zfs-deadman.txt Youzhong Yang, 2016-09-01 05:30 PM
NVME_BUS_CORRS.txt (15.9 KB) NVME_BUS_CORRS.txt Dan Fields, 2017-10-19 12:02 PM

History

#1

Updated by Youzhong Yang almost 4 years ago

reproduced the issue using two dds on SmartOS image 20160915T211220Z (downloaded from joyent).

Here is the crash dump: https://drive.google.com/file/d/0B6o_PKnt911fUk9vc2NIay1uS3M/view?usp=sharing
it is lz4 compressed vmdump.2.

The script for reproduction is as follows:

#!/bin/bash

for i in `seq 1 1000`; do
    dd if=/dev/zero of=file00 bs=1M count=102400 oflag=sync &
    dd if=/dev/zero of=file01 bs=1M count=102400 oflag=sync &
    wait
    rm file00 file01
done

file00 and file01 are created under a zfs filesystem with compression and atime set to off.

The zpool configuration and diskinfo output are as follows:

  pool: clusters
 state: ONLINE
  scan: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        clusters     ONLINE       0     0     0
          raidz1-0   ONLINE       0     0     0
            c20t1d0  ONLINE       0     0     0
            c23t1d0  ONLINE       0     0     0
            c17t1d0  ONLINE       0     0     0
            c18t1d0  ONLINE       0     0     0
            c21t1d0  ONLINE       0     0     0
            c22t1d0  ONLINE       0     0     0
            c4t1d0   ONLINE       0     0     0
            c5t1d0   ONLINE       0     0     0
        logs
          c25t1d0    ONLINE       0     0     0
          c0t1d0     ONLINE       0     0     0

errors: No known data errors

TYPE    DISK                    VID      PID              SIZE          RMV SSD
-       c0t1d0                  INTEL    SSDPE2MD800G4     745.21 GiB   no  yes
-       c1t1d0                  INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c2t1d0                  INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c3t1d0                  INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c4t1d0                  INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c5t1d0                  INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c6t1d0                  INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c7t1d0                  INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c8t1d0                  INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c9t1d0                  INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c10t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c11t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
UNKNOWN c12t0d0                 INTEL    SSDSC2KI010X6     931.51 GiB   no  yes
UNKNOWN c12t1d0                 INTEL    SSDSC2KI010X6     931.51 GiB   no  yes
USB     c13t0d0                 Innodisk USB Drive 3SE       3.87 GiB   yes no
-       c14t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c15t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c16t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c17t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c18t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c19t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c20t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c21t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c22t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c23t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c24t1d0                 INTEL    SSDPE2ME400G4     372.61 GiB   no  yes
-       c25t1d0                 INTEL    SSDPE2MD800G4     745.21 GiB   no  yes

For your convenience, here is the qpair that had the issue:

> ffffd127ed67f2b0::print -t nvme_qpair_t
nvme_qpair_t {
    size_t nq_nentry = 0x400
    nvme_dma_t *nq_sqdma = 0xffffd127ed693dd0
    nvme_sqe_t *nq_sq = 0xffffd127ed840000
    uint_t nq_sqhead = 0xc
    uint_t nq_sqtail = 0xd
    uintptr_t nq_sqtdbl = 0x1028
    nvme_dma_t *nq_cqdma = 0xffffd127ed6936f0
    nvme_cqe_t *nq_cq = 0xffffd127ed82c000
    uint_t nq_cqhead = 0xc
    uint_t nq_cqtail = 0
    uintptr_t nq_cqhdbl = 0x102c
    nvme_cmd_t **nq_cmd = 0xffffd127ed582000
    uint16_t nq_next_cmd = 0xd
    uint_t nq_active_cmds = 0x1
    int nq_phase = 0
    kmutex_t nq_mutex = {
        void *[1] _opaque = [ 0 ]
    }
}

#2

Updated by Dan Fields over 2 years ago

Not sure if it is related but I have a P3700 which goes off-line because of pciex errors. When I disable the the diagnosis of the errors, the device stays on-line and continues to work. We are still testing.

Also available in: Atom PDF