Bug #7351
openNVMe driver sporadically lost track of completed I/O request, which leads to zpool hanging and machine panic.
0%
Description
We tested the new set of NVMe fixes and encountered this blocking issue. The issue can be easily reproduced by doing two zfs send/recvs of large dataset simultaneously.
In summary, the issue is that NVMe SSD device indicates that an I/O is completed (by looking at the completion entry in the Completion Queue) but the driver never gets a chance to process it.
Please see attached txt for the analysis of two cases (we have a bunch of crash dumps but unfortunately we cannot share it publicly or privately without going through lengthy approval process).
By the way, our NVMe drives are as follows:
- 22 x Intel DC P3600 400GB NVMe PCIe 3.0, MLC 2.5" 20nm SSDPE2ME400G4
- 2 x Intel DC P3700 800GB NVMe PCIe 3.0 HET MLC 2.5" 20nm SSDPE2MD800G4
Files
Updated by Youzhong Yang about 7 years ago
reproduced the issue using two dds on SmartOS image 20160915T211220Z (downloaded from joyent).
Here is the crash dump: https://drive.google.com/file/d/0B6o_PKnt911fUk9vc2NIay1uS3M/view?usp=sharing
it is lz4 compressed vmdump.2.
The script for reproduction is as follows:
#!/bin/bash for i in `seq 1 1000`; do dd if=/dev/zero of=file00 bs=1M count=102400 oflag=sync & dd if=/dev/zero of=file01 bs=1M count=102400 oflag=sync & wait rm file00 file01 done
file00 and file01 are created under a zfs filesystem with compression and atime set to off.
The zpool configuration and diskinfo output are as follows:
pool: clusters state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM clusters ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c20t1d0 ONLINE 0 0 0 c23t1d0 ONLINE 0 0 0 c17t1d0 ONLINE 0 0 0 c18t1d0 ONLINE 0 0 0 c21t1d0 ONLINE 0 0 0 c22t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 c5t1d0 ONLINE 0 0 0 logs c25t1d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 errors: No known data errors TYPE DISK VID PID SIZE RMV SSD - c0t1d0 INTEL SSDPE2MD800G4 745.21 GiB no yes - c1t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c2t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c3t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c4t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c5t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c6t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c7t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c8t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c9t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c10t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c11t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes UNKNOWN c12t0d0 INTEL SSDSC2KI010X6 931.51 GiB no yes UNKNOWN c12t1d0 INTEL SSDSC2KI010X6 931.51 GiB no yes USB c13t0d0 Innodisk USB Drive 3SE 3.87 GiB yes no - c14t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c15t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c16t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c17t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c18t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c19t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c20t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c21t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c22t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c23t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c24t1d0 INTEL SSDPE2ME400G4 372.61 GiB no yes - c25t1d0 INTEL SSDPE2MD800G4 745.21 GiB no yes
For your convenience, here is the qpair that had the issue:
> ffffd127ed67f2b0::print -t nvme_qpair_t nvme_qpair_t { size_t nq_nentry = 0x400 nvme_dma_t *nq_sqdma = 0xffffd127ed693dd0 nvme_sqe_t *nq_sq = 0xffffd127ed840000 uint_t nq_sqhead = 0xc uint_t nq_sqtail = 0xd uintptr_t nq_sqtdbl = 0x1028 nvme_dma_t *nq_cqdma = 0xffffd127ed6936f0 nvme_cqe_t *nq_cq = 0xffffd127ed82c000 uint_t nq_cqhead = 0xc uint_t nq_cqtail = 0 uintptr_t nq_cqhdbl = 0x102c nvme_cmd_t **nq_cmd = 0xffffd127ed582000 uint16_t nq_next_cmd = 0xd uint_t nq_active_cmds = 0x1 int nq_phase = 0 kmutex_t nq_mutex = { void *[1] _opaque = [ 0 ] } }
Updated by Dan Fields about 6 years ago
- File NVME_BUS_CORRS.txt NVME_BUS_CORRS.txt added
Not sure if it is related but I have a P3700 which goes off-line because of pciex errors. When I disable the the diagnosis of the errors, the device stays on-line and continues to work. We are still testing.