Project

General

Profile

Actions

Feature #12940

closed

Add DKIOCFREE support to NVMe devices

Added by Jason King 10 months ago. Updated 10 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

With the impending integration of #12506, we will be able to add support for the DKIOCFREE ioctl to NVMe devices that support it. The NVMe specs state that the Dataset management command w/ the deallocate bit set should be thought of as the equivalent to a SATA TRIM or SCSI UNMAP command.

To avoid confusion, there is also an NVMe write zeros command as well, but that seems more for data wiping purposes. While it could also be used, since the spec explicitly states the dataset mgmt command w/ deallocate is the NVMe analogue to TRIM/UNMAP, we will opt to use that on NVMe devices that advertise support.

Actions #1

Updated by Electric Monk 10 months ago

  • Gerrit CR set to 788
Actions #2

Updated by Jason King 10 months ago

For testing, I started with running the zfs trim tests on some NVMe devices (in bhyve):

root@pi:~# diskinfo
TYPE    DISK                    VID      PID              SIZE          RMV SSD
-       c2t0d0                  Virtio   Block Device       30.00 GiB   no  no
NVME    c6t589CFC20EB660001d0   NVMe     bhyve-NVMe         10.00 GiB   no  yes
NVME    c7t589CFC20EA260001d0   NVMe     bhyve-NVMe         10.00 GiB   no  yes
NVME    c8t589CFC202AE70001d0   NVMe     bhyve-NVMe         10.00 GiB   no  yes

ztest@pi:~$ export DISKS='c6t589CFC20EB660001d0 c7t589CFC20EA260001d0 c8t589CFC202AE70001d0'
ztest@pi:~$ zfstest -c /var/tmp/trim.run
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/setup (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_attach_detach_add_remove (run as root) [00:05] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_import_export (run as root) [00:05] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_multiple (run as root) [00:35] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_neg (run as root) [00:01] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_offline_export_import_online (run as root) [00:06] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_online_offline (run as root) [00:05] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_partial (run as root) [06:01] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_rate (run as root) [00:18] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_rate_neg (run as root) [00:01] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_secure (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_split (run as root) [00:14] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_start_and_cancel_neg (run as root) [00:02] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_start_and_cancel_pos (run as root) [00:02] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_suspend_resume (run as root) [00:09] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_unsupported_vdevs (run as root) [00:02] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_verify_checksums (run as root) [00:13] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_verify_trimmed (run as root) [00:09] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/cleanup (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/setup (run as root)    [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/autotrim_integrity (run as root) [01:06] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/autotrim_config (run as root) [01:23] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/autotrim_trim_integrity (run as root) [02:19] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/trim_integrity (run as root) [01:17] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/trim_config (run as root) [00:58] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/cleanup (run as root)  [00:00] [PASS]

Results Summary
PASS      26

Running Time:   00:15:22
Percent passed: 100.0%
Actions #3

Updated by Jason King 10 months ago

In addition to running the zfs tests on bhyve emulated NVMe devices, I was able to locate a test machine with it's primary zpool on mirrored physical NVMe devices (running SmartOS). Running zpool trim zones trimmed approx 800Gb from the device. Running a full scrub after the trim was complete (using zpool status -t zones to confirm the trim was complete), no errors were found by the scrub:

Prior to trim (nvmeadm list):

nvme9: model: SAMSNG MZQLW1T9HMJP-00003, serial: S36WNX0J709307, FW rev: CXV8401Q, NVMe v1.2
  nvme9/1 (c2t1d0): Size = 1831420 MB, Capacity = 1831420 MB, Used = 935701 MB
nvme11: model: SAMSUNG MZQLW1T9HMJP-00003, serial: S36WNX0J709117, FW rev: CXV8401Q, NVMe v1.2
  nvme11/1 (c3t1d0): Size = 1831420 MB, Capacity = 1831420 MB, Used = 936458 MB

After trim:

nvme9: model: SAMSUNG MZQLW1T9HMJP-00003, serial: S36WNX0J709307, FW rev: CXV8401Q, NVMe v1.2
  nvme9/1 (c2t1d0): Size = 1831420 MB, Capacity = 1831420 MB, Used = 80295 MB
nvme11: model: SAMSUNG MZQLW1T9HMJP-00003, serial: S36WNX0J709117, FW rev: CXV8401Q, NVMe v1.2
  nvme11/1 (c3t1d0): Size = 1831420 MB, Capacity = 1831420 MB, Used = 80616 MB

Actions #4

Updated by Jason King 10 months ago

Additionally, I ran the zfs trim tests on a fully NVMe pool. Unfortunately, there were some failures, but these appear to be issues with the test themselves:

ztest@nvme2:~$ /opt/zfs\-tests/bin/zfstest -c /var/tmp/trim.run
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/setup (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_attach_detach_add_remove (run as root) [01:01] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_import_export (run as root) [00:04] [FAIL]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_multiple (run as root) [41:51] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_neg (run as root) [00:01] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_offline_export_import_online (run as root) [00:03] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_online_offline (run as root) [00:01] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_partial (run as root) [01:46] [FAIL]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_rate (run as root) [00:05] [FAIL]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_rate_neg (run as root) [00:01] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_secure (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_split (run as root) [00:04] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_start_and_cancel_neg (run as root) [00:01] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_start_and_cancel_pos (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_suspend_resume (run as root) [00:06] [FAIL]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_unsupported_vdevs (run as root) [00:02] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_verify_checksums (run as root) [00:12] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_verify_trimmed (run as root) [00:06] [FAIL]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/cleanup (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/setup (run as root)    [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/autotrim_integrity (run as root) [00:49] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/autotrim_config (run as root) [00:46] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/autotrim_trim_integrity (run as root) [01:11] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/trim_integrity (run as root) [00:45] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/trim_config (run as root) [00:27] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/cleanup (run as root)  [00:00] [PASS]

Results Summary
FAIL       5
PASS      21

Running Time:    00:49:34
Percent passed:    80.8%
Log directory:    /var/tmp/test_results/20200714T205927

Also, as it turns out, the TRIM tests are a mixture of tests using actual disk devices as well as test that create file backed zpools. Obviously the file-backed zpools are uninteresting for the NVMe testing.

The tests that utilize the disk devices are:

zpool_trim_attach_detach_add_remove
zpool_trim_multiple
zpool_trim_neg
zpool_trim_offline_export_import_online
zpool_trim_online_offline
zpool_trim_rate_neg
zpool_trim_split
zpool_trim_start_and_cancel_neg
zpool_trim_start_and_cancel_pos

Which all pass.

For completeness, the failures on file backed zpools appear to largely be timing bugs with the scripts (full details below):

For zpool_trim_import_export

SUCCESS: mkdir /var/tmp/testdir
SUCCESS: truncate -s 10G /var/tmp/testdir/largefile
SUCCESS: zpool create -f testpool /var/tmp/testdir/largefile
SUCCESS: zpool trim -r 256M testpool
SUCCESS: zpool export testpool
SUCCESS: zpool import -d /var/tmp/testdir testpool
SUCCESS: eval trim_prog_line testpool /var/tmp/testdir/largefile | grep suspended exited 1
cannot trim '/var/tmp/testdir/largefile': there is no active trim
ERROR: zpool trim -s testpool /var/tmp/testdir/largefile exited 255

In this instance, the trim is completing before the pool can be exported.

For zpool_trim_partial, the failing portion is:

ERROR: test 2694186496 -le 2254857830.4 exited 1

The actual statement is log_must test $new_size -le $(( 2.1 * LARGESIZE)) -- in this case the factor was 2.5 instead of 2.1. The factor appears to be somewhat arbitrary and a best guess based on the amount of space metaslabs will consume. That in this instance it appears the metaslabs consumed a bit more space than expected, however the amount of space used by the pool did decrease.

For zpool_trim_rate:

SUCCESS: zpool trim -r 200M testpool
SUCCESS: sleep 4
cannot trim '/var/tmp/testdir/largefile': there is no active trim
ERROR: zpool trim -s testpool exited 255

It again appears the trim completed quicker than it could test.

For zpool_trim_suspend_resume,

SUCCESS: mkdir /var/tmp/testdir
SUCCESS: truncate -s 10G /var/tmp/testdir/largefile
SUCCESS: zpool create -f testpool /var/tmp/testdir/largefile
SUCCESS: zpool trim -r 256M testpool
cannot trim '/var/tmp/testdir/largefile': there is no active trim
ERROR: zpool trim -s testpool exited 255
NOTE: Performing test-fail callback (/opt/zfs-tests/callbacks/zfs_dbgmsg)

Again, the trim completed quicker before it could check the status.

For zpool_trim_verify_trimmed

SUCCESS: set_tunable32 zfs_trim_extent_bytes_min 4096
SUCCESS: mkdir /var/tmp/testdir
SUCCESS: truncate -s 1073741824 /var/tmp/testdir/largefile
SUCCESS: zpool create testpool /var/tmp/testdir/largefile
SUCCESS: zpool initialize testpool
SUCCESS: test 856241664 -gt 8388608
SUCCESS: zpool trim testpool
NOTE: Checking if 856241664 is within +/-134217728 of 512 (diff: 856241152)

ERROR: within_tolerance 856241664 512 134217728 exited 1

This one is more interesting -- it appears that the trim didn't release any space. However, since this is a file based zpool, the TRIM zio is passed to vdev_file_io_start, which issues the F_FREESP fcntl (technically the VOP_SPACE vnode method), which in turns calls zfs_space. Running dtrace -n 'zfs_space:entry {print (*args[2]) }' while running this test only shows that it's being called, with reasonable values for l_start and l_len:

 69  55931                  zfs_space:entry flock64_t {
    short l_type = 0xb
    short l_whence = 0
    off64_t l_start = 0x45f400
    off64_t l_len = 0x1200
    int l_sysid = 0
    pid_t l_pid = 0
    long [4] l_pad = [ 0, 0, 0, 0 ]
}
 69  55931                  zfs_space:entry flock64_t {
    short l_type = 0xb
    short l_whence = 0
    off64_t l_start = 0x474000
    off64_t l_len = 0x3f8c000
    int l_sysid = 0
    pid_t l_pid = 0
    long [4] l_pad = [ 0, 0, 0, 0 ]
}

After a brief chat with Matt Ahrens, I believe what is happening is that the hole punching that occurs during this test doesn't become visible until the TXG is synced. In a VM, there appears to often be enough latency, noisiness that the TXG has synced by the time the size of the backing file is checked (via du). On a physical machine, this is happening quick enough that du is being run before the TXG can sync, while on a VM, the scheduling latency typically allows the TXG to sync first. I tested this by adding a call to 'sync' between the completion of the 'zpool trim' command and the subsequent du, and the test passes consistently after that (I'll file a bug for this).

Actions #5

Updated by Electric Monk 10 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 2ba19bafbe44c6a57d09e79cc5e11875088875bf

commit  2ba19bafbe44c6a57d09e79cc5e11875088875bf
Author: Jason King <jason.king@joyent.com>
Date:   2020-07-15T19:03:19.000Z

    12940 Add DKIOCFREE support to NVMe devices
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Reviewed by: Toomas Some <tsoome@me.com>
    Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>

Actions

Also available in: Atom PDF