Project

General

Profile

Bug #7448

ZFS doesn't notice when disk vdevs have no write cache

Added by Youzhong Yang almost 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
driver - device drivers
Start date:
2016-10-03
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

I built a SmartOS image with all the NVMe commits including 7372(support NVMe volatile write cache) and repeated my dd testing:

> #!/bin/bash
> for i in `seq 1 1000`; do
> dd if=/dev/zero of=file00 bs=1M count=102400 oflag=sync &
> dd if=/dev/zero of=file01 bs=1M count=102400 oflag=sync &
> wait
> rm file00 file01
> done
> 

Previously each dd command took ~145 seconds to finish, now it takes ~400 seconds.

Eventually I figured out it is 7372 that causes unnecessary nvme_bd_sync() executions which wasted CPU cycles.

If a NVMe device doesn't support a write cache, the nvme_bd_sync function will return ENOTSUP to indicate this to upper layers.

It seems this returned value is ignored by ZFS, and as such this bug is not really specific to NVMe. In vdev_disk_io_start() ZFS sends the flush to the disk driver (blkdev) with a callback to vdev_disk_ioctl_done(). As nvme filled in the bd_sync_cache function pointer, blkdev will not return ENOTSUP, as the nvme driver in general does support cache flush. Instead it will issue an asynchronous flush to nvme and immediately return 0, and hence ZFS will not set vdev_nowritecache here. The nvme driver will at some point process the cache flush command, and if there is no write cache on the device it will return ENOTSUP, which will be delivered to the vdev_disk_ioctl_done() callback. This function will not check the error code and not set nowritecache.

The right place to check the error code from the cache flush is in zio_vdev_io_assess(). This would catch both cases, synchronous and asynchronous cache flushes. This would also be independent of the implementation detail that some drivers can return ENOTSUP immediately.

Fixing this just in nvme has two problems: We would need a per-instance bd_ops vector and NULL out the bd_sync_cache function pointer if no cache is present in a particular device. And also, just fixing it in nvme wouldn't help any other disk drivers that may run into the same problem.


Files

zfs.diff (663 Bytes) zfs.diff Hans Rosenfeld, 2016-10-04 04:14 PM

History

#1

Updated by Hans Rosenfeld almost 3 years ago

If a NVMe device doesn't support a write cache, the nvme_bd_sync function will return ENOTSUP to indicate this to upper layers.

It seems this returned value is ignored by ZFS:
In vdev_disk_io_start() ZFS sends the flush to blkdev with a callback to vdev_disk_ioctl_done(). In this case blkdev does an asynchronous flush and returns 0, and hence ZFS doesn't set vdev_nowritecache here. Later vdev_disk_ioctl_done() will not check the error and also not set vdev_nowritecache.

This should be fixed in ZFS.

#2

Updated by Hans Rosenfeld almost 3 years ago

  • Subject changed from nvme: performance regression when volatile write cache not present to ZFS doesn't notice when disk vdevs have no write cache
#3

Updated by Hans Rosenfeld almost 3 years ago

  • Description updated (diff)
#4

Updated by Hans Rosenfeld almost 3 years ago

  • Description updated (diff)
#5

Updated by Electric Monk almost 3 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 295438ba3230419314faaa889a2616f561658bd5

commit  295438ba3230419314faaa889a2616f561658bd5
Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Date:   2016-10-26T21:59:28.000Z

    7448 ZFS doesn't notice when disk vdevs have no write cache
    Reviewed by: Dan Fields <dan.fields@nexenta.com>
    Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
    Reviewed by: George Wilson <george.wilson@delphix.com>
    Approved by: Dan McDonald <danmcd@omniti.com>

Also available in: Atom PDF