Project

General

Profile

Actions

Bug #7351

open

NVMe driver sporadically lost track of completed I/O request, which leads to zpool hanging and machine panic.

Added by Youzhong Yang about 5 years ago. Updated almost 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
driver - device drivers
Start date:
2016-09-01
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

We tested the new set of NVMe fixes and encountered this blocking issue. The issue can be easily reproduced by doing two zfs send/recvs of large dataset simultaneously.

In summary, the issue is that NVMe SSD device indicates that an I/O is completed (by looking at the completion entry in the Completion Queue) but the driver never gets a chance to process it.

Please see attached txt for the analysis of two cases (we have a bunch of crash dumps but unfortunately we cannot share it publicly or privately without going through lengthy approval process).

By the way, our NVMe drives are as follows:

- 22 x Intel DC P3600 400GB NVMe PCIe 3.0, MLC 2.5" 20nm SSDPE2ME400G4

- 2 x Intel DC P3700 800GB NVMe PCIe 3.0 HET MLC 2.5" 20nm SSDPE2MD800G4


Files

zfs-deadman.txt (28.8 KB) zfs-deadman.txt Youzhong Yang, 2016-09-01 05:30 PM
NVME_BUS_CORRS.txt (15.9 KB) NVME_BUS_CORRS.txt Dan Fields, 2017-10-19 12:02 PM
Actions

Also available in: Atom PDF