Project

General

Profile

Actions

Feature #12506

closed

Add support to vioblk for DISCARD operation

Added by Jason King over 3 years ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

The discard operation is the vioblk equivalent of UNMAP or TRIM. While it was added in newer versions of the virtio spec, the newer versions allow hypervisors to advertise these features to legacy devices as well (other OSes take advantage of this for their equivalent support), so we can add support for this without also having to update the whole virtio layer to 1.1

One complication is that a host can present restrictions on DISCARD requests such as a stricter alignment than the device block size for the starting offset (e.g. starting offsets on 4k boundaries on a device with a 512byte block size), limits on the number of blocks that can be discarded in a single request, as well as limits on the number of extents it will accept in a single request. There is currently no mechanism for informing modules issuing DISCARD requests (e.g. zfs) of these requirements. To address this, the blkdev driver will be extended to accept DKIOCFREE ioctls, as well as allow drivers (e.g. vioblk) to inform blkdev of any restrictions on free requests. The blkdev layer will then align, split, and segment any requests that do not conform to the underlying driver's requirements (a blkdev using driver can opt out of this by not informing blkdev of any restrictions and then rejecting requests itself). The hope is this will make it simpler to extend DKIOCFREE support to other drivers utilizing blkdev (e.g. nvme) which may share similar concerns.

One area that is arguably not as robust as we might like is in error handling around DKIOCFREE requests. Today, DKIOCFREE calls are generally asynchronous and possibly queued by the driver, so any errors in processing the request aren't always guaranteed to be returned to the caller (the ioctl may return before the driver processes the request). Even when DF_WAIT_SYNC is used to wait until the operation is complete, if multiple extents are passed in a single request, there is no way to report a problem with a specific extent, nor is there any documented expectations on what one could assume on an error -- e.g. if a requests includes multiple extents, but one of the extents causes some error, nothing can be inferred about the state of the other extents in the request (completed, not completed, etc). It's also not clear if a better approach is even possible that would be compatible with the error semantics of all of the underlying protocols (scsi, sata, nvme, etc.) while presenting a single interface to consumers such as zfs -- if it does turn out to be possible, such support would be worth of it's own issue, and is not something we propose to solve in this issue -- only that the changes do not make the situation any worse.


Related issues

Related to illumos gate - Bug #12924: blkdev needs to be better at handling attach failuresClosedJason King

Actions
Related to illumos gate - Bug #14897: blkdev discard support is needlessly complexNew

Actions
Actions #1

Updated by Jason King over 3 years ago

For testing (so far), I've run the zfs trim tests which all pass with the exception of zpool_trim_verify_trimmed. That test creates a file based zpool and performs trim operations on that file-based zpool. The error is that the amount of 'sparseness' on that pool is not as large as it expects. Since it's not performing the test against a device (let along a vioblk device), it does not seem relevant.

In addition, I've been creating random files of various sizes, deleting them, overwriting portions of them, etc. on a zpool on top of a vioblk device and then performing a zpool trim and zpool scrub. No errors have been found so far.

Also, I've been running dtrace scripts on the host side looking at DKIOCFREE ioctls to zvols:

#!/usr/sbin/dtrace -s

zvol_ioctl:entry
/arg1 == 1074/
{
    d = (dkioc_free_list_t *)copyin(arg2, sizeof (dkioc_free_list_t));
    e = &d->dfl_exts[0];

      printf("start = 0x%llx length = %llu offset = 0x%llx n = %llu\n",
        e->dfle_start, (unsigned long long)e->dfle_length, d->dfl_offset,
        (unsigned long long)d->dfl_num_exts);
}

The test VM is running under bhyve, and bhyve only supports a single extent, so this was enough to see the whole request. I then used a small program (attached) to generate DKIOCFREE requests with various sized extents, different number of extents in a request, etc. while observing the resulting requests on the host side -- verifying that a large request was being split, multiple extents in a request were broken up into multiple requests, etc.

Actions #2

Updated by Dan McDonald about 3 years ago

  • Related to Bug #12924: blkdev needs to be better at handling attach failures added
Actions #3

Updated by Jason King about 3 years ago

Additionally, since zfs is the major consumer of the DKIOCFREE ioctl in illumos, I ran the zfs TRIM tests from the zfs test suite against 3 vioblk test devices which all passed:

ztest@pi:~$ zfstest -c /var/tmp/trim.run
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/setup (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_attach_detach_add_remove (run as root) [01:02] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_import_export (run as root) [00:05] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_multiple (run as root) [00:18] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_neg (run as root) [00:01] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_offline_export_import_online (run as root) [00:03] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_online_offline (run as root) [00:02] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_partial (run as root) [03:20] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_rate (run as root) [00:18] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_rate_neg (run as root) [00:01] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_secure (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_split (run as root) [00:11] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_start_and_cancel_neg (run as root) [00:01] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_start_and_cancel_pos (run as root) [00:02] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_suspend_resume (run as root) [00:09] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_unsupported_vdevs (run as root) [00:02] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_verify_checksums (run as root) [00:12] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/zpool_trim_verify_trimmed (run as root) [00:07] [PASS]
Test: /opt/zfs-tests/tests/functional/cli_root/zpool_trim/cleanup (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/setup (run as root)    [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/autotrim_integrity (run as root) [00:58] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/autotrim_config (run as root) [01:08] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/autotrim_trim_integrity (run as root) [02:12] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/trim_integrity (run as root) [01:14] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/trim_config (run as root) [00:52] [PASS]
Test: /opt/zfs-tests/tests/functional/trim/cleanup (run as root)  [00:00] [PASS]

Results Summary
PASS      26

Running Time:   00:12:29
Percent passed: 100.0%
Actions #4

Updated by Jason King about 3 years ago

Additionally, I wrote the following utility to issue DKIOCFREE commands against block devices to assist with testing:

/*
 * This file and its contents are supplied under the terms of the
 * Common Development and Distribution License ("CDDL"), version 1.0.
 * You may only use this file in accordance with the terms of version
 * 1.0 of the CDDL.
 *
 * A full copy of the text of the CDDL should have accompanied this
 * source.  A copy of the CDDL is also available via the Internet at
 * http://www.illumos.org/license/CDDL.
 */

/*
 * Copyright 2020 Joyent, Inc.
 */

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
#include <sys/types.h>
#include <sys/dkio.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <err.h>
#include <unistd.h>

static uint64_t
get_size(const char *dev, int fd)
{
        struct dk_minfo dkm = { 0 };
        int bit = 0;

        if (ioctl(fd, DKIOCGMEDIAINFO, &dkm) < 0)
                err(EXIT_FAILURE, "ioctl(%s, DKIOCGMEDIAINFO) failed", dev);

        if (dkm.dki_lbsize > 0)
                bit = ffs((int)dkm.dki_lbsize) - 1;

        return ((uint64_t)dkm.dki_capacity << (uint_t)bit);
}

static uint64_t
get_num(const char *str)
{
        uint64_t val;

        errno = 0;
        val = strtoull(str, NULL, 0);
        if (errno != 0)
                err(EXIT_FAILURE, "could not parse '%s'", str);
        return (val);
}

static void
usage(void)
{
        fprintf(stderr,
            "Usage: %s [-aw] dev\n" 
            "       %s [-w] dev [offset length] [offset length...]\n",
            getprogname(), getprogname());
        exit(2);
}

static dkioc_free_list_t *
make_list(int argc, char **argv)
{
        dkioc_free_list_t *dfl = NULL;

        if (argc % 2 == 1)
                errx(EXIT_FAILURE, "odd number of parameters given");

        if ((dfl = calloc(1, DFL_SZ(argc / 2))) == NULL)
                err(EXIT_FAILURE, "no memory");

        dfl->dfl_num_exts = argc / 2;
        for (size_t i = 0; i < argc / 2; i++) {
                dfl->dfl_exts[i].dfle_start = get_num(argv[i * 2]);
                dfl->dfl_exts[i].dfle_length = get_num(argv[(i * 2) + 1]);
        }

        return (dfl);
}

static dkioc_free_list_t *
make_all(const char *dev, int fd)
{
        dkioc_free_list_t *dfl = NULL;
        uint64_t len = get_size(dev, fd);

        if ((dfl = calloc(1, DFL_SZ(1))) == NULL)
                err(EXIT_FAILURE, "no memory");

        dfl->dfl_num_exts = 1;
        dfl->dfl_exts[0].dfle_start = 0;
        dfl->dfl_exts[0].dfle_length = len;

        return (dfl);
}

int
main(int argc, char **argv)
{
        const char *devname = NULL;
        dkioc_free_list_t *dfl;
        boolean_t all = B_FALSE;
        boolean_t wait = B_FALSE;
        int fd = -1;
        int c;

        while ((c = getopt(argc, argv, "aw")) != -1) {
                switch (c) {
                case 'a':
                        all = B_TRUE;
                        break;
                case 'w':
                        wait = B_TRUE;
                        break;
                case '?':
                        fprintf(stderr, "Invalid option -%c\n", optopt);
                        usage();
                }
        }

        if (argc == optind)
                usage();

        devname = argv[optind];
        argc -= optind + 1;
        argv += optind + 1;

        if ((fd = open(devname, O_RDWR)) < 0)
                err(EXIT_FAILURE, "open %s failed", devname);

        dfl = all ? make_all(devname, fd) : make_list(argc, argv);

        printf("%s:\n", devname);
        for (size_t i = 0; i < dfl->dfl_num_exts; i++) {
                printf("    [%zu] start = %llu length = %llu\n", i,
                    dfl->dfl_exts[i].dfle_start,
                    dfl->dfl_exts[i].dfle_length);
        }

        if (wait)
                dfl->dfl_flags |= DF_WAIT_SYNC;

        if (ioctl(fd, DKIOCFREE, dfl) < 0)
                err(EXIT_FAILURE, "ioctl(%s, DKIOCFREE) failed", devname);

        (void) close(fd);
        return (0);
}
Actions #5

Updated by Jason King about 3 years ago

Besides the zfs TRIM tests, I also used the above C program to manually issue DKIOCFREE ioctls, while using the D script in the hypervisor (e.g. GZ).

The setup is an OmniOSCE VM with the changes and a vioblk device running under SmartOS (bhyve) with the vioblk device backed by a zvol on the host. Since there's several layers in play here between the guest and the hypervisor, it might be useful to describe the expected flow in a bit of detail. Guest DKIOCFREE ioctls (which we can generate via the above C program) get expanded by the guest blkdev driver into 0 or more dkioc_free_list_t@s. The @dfl_offset value may be altered (increased) to account for any partitioning of the vioblk device in the guest (the starting LBA of the partition is translated into a byte offset and added to any existing dfl_offset value passed to the guest blkdev driver). The guest vioblk driver then generates 0 or more DISCARD requests, using the LBA and length values passed to it.

Note that at this point any non-zero value of dfl_offset is added to each dfle_start value (and then converted to an LBA) to produce a device LBA (the vioblk DISCARD request doesn't provide a method to specify a global offset value like the DKIOCFREE ioctl or the SCSI UNMAP commands do). This means the start address seen by the hypervisor may be different than the one in the original request -- partitioning in the guest may mean LBA 0 of a slice resides at LBA 0x2000 of the device (for example), and the guest vioblk driver will translate the LBA as needed (by adding dfl_offset) so the host sees a start address of 0x2000 (to use the same example).

The host will then take all the start, length extents (currently bhyve only supports 1 extent per discard request), and create it's own DKIOCFREE ioctls for the backing zvols. Since these are whole devices (no partitioning by the hypervisor), dfl_offset is 0 in these DKIOCFREE requests.

With all of that, testing several scenarios:

Single block, aligned start address

Guest:

root@pi:/ws# ./zvol-free /dev/rdsk/c3t0d0s0 0 512
/dev/rdsk/c3t0d0s0:
    [0] start = 0 length = 512

Host:
  3  61416                 zvol_ioctl:entry start = 0x20000 length = 512 offset = 0x0 n = 1

Multiple full blocks, aligned start address

Guest:

root@pi:/ws# ./zvol-free /dev/rdsk/c3t0d0s0 1024 10240
/dev/rdsk/c3t0d0s0:
    [0] start = 1024 length = 10240

Host:
  1  61416                 zvol_ioctl:entry start = 0x20400 length = 10240 offset = 0x0 n = 1

Multiple full blocks, larger than the max vioblk DISCARD size (16MB for bhyve)

Guest (freeing 100MiB):

root@pi:/ws# ./zvol-free /dev/rdsk/c3t0d0s0 1024 104857600
/dev/rdsk/c3t0d0s0:
    [0] start = 1024 length = 104857600

Host:
  0  61416                 zvol_ioctl:entry start = 0x20400 length = 16777216 offset = 0x0 n = 1

 11  61416                 zvol_ioctl:entry start = 0x1020400 length = 16777216 offset = 0x0 n = 1

  0  61416                 zvol_ioctl:entry start = 0x2020400 length = 16777216 offset = 0x0 n = 1

 12  61416                 zvol_ioctl:entry start = 0x3020400 length = 16777216 offset = 0x0 n = 1

  3  61416                 zvol_ioctl:entry start = 0x4020400 length = 16777216 offset = 0x0 n = 1

  0  61416                 zvol_ioctl:entry start = 0x5020400 length = 16777216 offset = 0x0 n = 1

  3  61416                 zvol_ioctl:entry start = 0x6020400 length = 4194304 offset = 0x0 n = 1

(16777216 * 6 + 4194304 = 104857600)

Single block, starting address unaligned

Guest:

root@pi:/ws# ./zvol-free /dev/rdsk/c3t0d0s0 513 512
/dev/rdsk/c3t0d0s0:
    [0] start = 513 length = 512

No request to host (expected -- there isn't a subset of the byte range that can be represented as LBAs)

Multiple blocks, starting address unaligned

Guest:

root@pi:/ws# ./zvol-free /dev/rdsk/c3t0d0s0 513 1024
/dev/rdsk/c3t0d0s0:
    [0] start = 513 length = 1024

Host:
  1  61416                 zvol_ioctl:entry start = 0x20400 length = 512 offset = 0x0 n = 1

Multiple blocks, unaligned start address (but a bit larger)

Guest:

root@pi:/ws# ./zvol-free /dev/rdsk/c3t0d0s0 513 10240
/dev/rdsk/c3t0d0s0:
    [0] start = 513 length = 10240

Host:

 12  61416                 zvol_ioctl:entry start = 0x20400 length = 9728 offset = 0x0 n = 1

10240 - 512 = 9728, so this matches expectations

Aligned start address, unaligned length

Guest:

root@pi:/ws# ./zvol-free /dev/rdsk/c3t0d0s0 512 513
/dev/rdsk/c3t0d0s0:
    [0] start = 512 length = 513

Host:

 12  61416                 zvol_ioctl:entry start = 0x20200 length = 512 offset = 0x0 n = 1

Aligned start, unaligned length (multiple blocks)

Guest:

root@pi:/ws# ./zvol-free /dev/rdsk/c3t0d0s0 512 10239
/dev/rdsk/c3t0d0s0:
    [0] start = 512 length = 10239

Host:

 13  61416                 zvol_ioctl:entry start = 0x20200 length = 9728 offset = 0x0 n = 1

Unaligned start address, unaligned length

Guest:

root@pi:/ws# ./zvol-free /dev/rdsk/c3t0d0s0 513 10239
/dev/rdsk/c3t0d0s0:
    [0] start = 513 length = 10239

Host:

 14  61416                 zvol_ioctl:entry start = 0x20400 length = 9728 offset = 0x0 n = 1

Unaligned start, unaligned length (larger than the max for a single vioblk DISCARD request):

Guest:

root@pi:/ws# ./zvol-free /dev/rdsk/c3t0d0s0 1025 104857601
/dev/rdsk/c3t0d0s0:
    [0] start = 1025 length = 104857601

Host:

  5  61416                 zvol_ioctl:entry start = 0x20600 length = 16777216 offset = 0x0 n = 1

  4  61416                 zvol_ioctl:entry start = 0x1020600 length = 16777216 offset = 0x0 n = 1

 12  61416                 zvol_ioctl:entry start = 0x2020600 length = 16777216 offset = 0x0 n = 1

  5  61416                 zvol_ioctl:entry start = 0x3020600 length = 16777216 offset = 0x0 n = 1

 15  61416                 zvol_ioctl:entry start = 0x4020600 length = 16777216 offset = 0x0 n = 1

  4  61416                 zvol_ioctl:entry start = 0x5020600 length = 16777216 offset = 0x0 n = 1

  6  61416                 zvol_ioctl:entry start = 0x6020600 length = 4193792 offset = 0x0 n = 1

4194304-4193792=512, so also checks out.

Invalid length (too large)

Guest:

root@pi:/ws# ./zvol-free /dev/rdsk/c3t0d0s0 1025 99999999999999
/dev/rdsk/c3t0d0s0:
    [0] start = 1025 length = 99999999999999
zvol-free: ioctl(/dev/rdsk/c3t0d0s0, DKIOCFREE) failed: Invalid argument

(No request seen on host as expected).

Actions #6

Updated by Electric Monk about 3 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 1a5ae140ba142cafb59ab08b3212c4ebbce84f32

commit  1a5ae140ba142cafb59ab08b3212c4ebbce84f32
Author: Jason King <jason.king@joyent.com>
Date:   2020-07-09T22:01:46.000Z

    12506 Add support to vioblk for DISCARD operation
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Reviewed by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions #7

Updated by Joshua M. Clulow about 1 year ago

  • Related to Bug #14897: blkdev discard support is needlessly complex added
Actions #8

Updated by Joshua M. Clulow about 1 year ago

Note: Review for this one is still available at: https://illumos.org/rb/r/2534/

Actions

Also available in: Atom PDF