Bug #14897


blkdev discard support is needlessly complex

Added by Garrett D'Amore almost 2 years ago. Updated almost 2 years ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


When the initial implementation of discard support was introduced into blkdev, it seems that the project missed one of the principle goals of blkdev, which is to be as simple as reasonably possible, and to shield the driver author from having to deal with the details of DKIO ioctls.

The API to drivers should be as simple as:

  • maximum free size (how many blocks can be freed)
  • free granularity (perhaps this could be elided and media->m_pblksize could simply be presumed -- it is probably accurate!)

The free request should simply take the blk address and count as we do for other transfers, and support only contiguous free requests. There should not be any need to expose the dkioc_free_list or anything like that to drivers.

Blkdev can break large requests up as needed -- it already does this for large DMA windows that need to be passed via multiple windows.

While some devices have a lot more flexibility, the goal of blkdev is not to answer everyone's every possible desire, but to be a simple common layer for the most often needed functionality. If a device class needs to express a lot more complexity, then that device class should consider doing so via a full block driver using cmlb and company.

My instinct is that the current consumer of this, vioscsi, has no such need for great flexibility, and would be more than satisfied with just the simpler interface I've described here.

Related issues

Related to illumos gate - Feature #12506: Add support to vioblk for DISCARD operationClosedJason King

Related to illumos gate - Feature #12940: Add DKIOCFREE support to NVMe devicesClosedJason King

Actions #1

Updated by Joshua M. Clulow almost 2 years ago

You say that vioscsi is a consumer of blkdev, but I thought it was a SCSI HBA that would make use of sd?

It seems that both of the drivers which provide .o_free_space to blkdev today do support (and should continue to support) amalgamating lists of non-contiguous discards into a single request; viz., vioblk_map_discard() for vioblk, and nvme_fill_ranges() for nvme.

Actions #2

Updated by Garrett D'Amore almost 2 years ago

I meant `vioblk` ... my bad. Sorry.

Actions #3

Updated by Garrett D'Amore almost 2 years ago

I see that nvme does do non-contiguous regions.

I view that as a design mistake.

blkdev should have been doing this work, if needed. To be honest, I'm not entirely sure why we need non-contiguous regions at all, except perhaps to accommodate freeing up data for files that might be scattered around the drive (fragmented).

I suppose there may be some performance reasons to present the entire list at once, rather than as separate commands, but nonetheless exposing this complexity to blkdev nexus drivers was very much not something I wanted to see when I designed blkdev originally.

The goal is to make it easy to write simple drivers, not to express every possible nuance.

Having said all that, I wonder how much a difference it really makes to express these operations as single scatter operations instead of individual operations for each contiguous region makes. Particularly given the very deep queues that NVME devices tend to have.

Actions #4

Updated by Garrett D'Amore almost 2 years ago

For the record, what threw me (and caused me to miss nvme), is that nvme doesn't assign any values to the d_free_align, d_max_free_seg, d_max_free_blks, or d_max_free_seg_blks in the drive info.

Those fields probably were amongst the ones that I had the most visceral gut reaction of "ewwww" to, but honestly the dkioc_freelist stuff is pretty awful too.

Actions #5

Updated by Joshua M. Clulow almost 2 years ago

  • Related to Feature #12506: Add support to vioblk for DISCARD operation added
  • Related to Feature #12940: Add DKIOCFREE support to NVMe devices added
Actions #6

Updated by Garrett D'Amore almost 2 years ago

Thinking about this a bit more, I think if we were to consolidate this somewhat we might be able to dispense with the fields needed to convey all of this information to device driver. There are a few salient points:

1. If we eliminate multiple segments, that solves a huge level of complexity, and required tuning. The main benefit of using multiple segments is to support freeing space on fragmented disks, and may go slightly faster, but I strongly suspect that savings for discontiguous regions is going to be limited to the dispatch overhead. For devices with parallel queues, we can even dispatch separate regions to separate queues, and might therefore see a net improvement(!) over multiple segments.

2. Linux does not use multiple segments when issuing UNMAP or TRIM. So that's a compelling piece of evidence that performance doesn't require it. (This makes sense as most devices with UNMAP or TRIM are going to have deep queues anyway. See above.)

3. That leaves the question of limits on the segment size and alignment. The following limits apply:

  • alignment
  • minimum size (granularity) to free -- probably the same as alignment (and probably 4K!)
  • maximum size to free (maybe because of latency?)

Looking at "real" devices, it seems like the alignment requirement is likely to be linked to the size of underlying storage cells. So maybe 4K, maybe bigger. Would be bigger for devices that use large stripe widths (RAID devices) I guess.

Same situation holds true for minimum size.

The maximum size is probably not interesting, for the reason described below.

4. ZFS is going to use large block sizes for reclaiming anyway. Probably larger than the underlying device anyway.

5. Presently, the API does not provide any way for ZFS (or any other consumer) to learn what the limitations are. If ZFS issues a request that cannot be satisfied, EINVAL is returned.

6. UNMAP is "advisory". It's nice to be able to discard space, but its not critical if this fails. The main benefit of using it is two-fold: a) to facilitate overprovisioning (a questionable practice but common), and b) to help some devices without sufficient headroom perform better when needing to reallocate. A failure to UNMAP should never result in a failure to be able to read or write data, nor should it cause storage space to be consumed without bound. (Consider a device that is used and rewritten and never gets to see an UNMAP.)

So I think the better solution is to shove the complexity that vioblk is requiring mostly into vioblk. Specifically what I'd propose is this:

  • blkdev breaks dkiocfree requests up, issuing a single request per contiguous extent.
  • blkdev has no knowledge (nor needs any) of restrictions on alignment, size, etc.
  • consuming device can look at the request:
    - if the request cannot be satisfied (alignment or other concerns), it should just fail EINVAL
    - if the request is too big, it should just chain requests (i.e. the interrupt handler can submit requests iteratively)

This can lead to a fairly trivial version of the free_space (which at this point I think I'd rename to just "o_discard":

  • uses the x_blkno, and x_nblks member of the xfer_t.
  • does not express any limitations on alignment or sizing
  • drivers that need to break the request up iteratively can do so, only calling bd_xfer_done once the entire request is satisfied

This would make for a much nicer/easier to use API, without really losing any tangible benefits from the current design.

Actions #7

Updated by Jason King almost 2 years ago

The complexity was to preserve the current semantics of the DKIOCFREE ioctl (which itself were derived from the SCSI UNMAP command) without having to write a lot of almost, but not quite identical code in each driver -- NVMe, vioblk, ...). There was also a thought it could (in the future) also be used to simplify sd.c a bit as well since it'd mean there could be one bit of common code to
handle all the necessary segmentation/alignment requirements for the various protocols.

If we want to simplify this, then we probably need to consider the implications for the DKIOCFREE ioctl. If we change it for blkdev devices, it does mean they'll have different semantics for the ioctl than SCSI devices. That may not be a problem -- it is a 'best effort' attempt, and the ioctl is (AFAIK) still considered 'private', but probably something that we want to make sure we're clear about.

I will also note that NVMe 2.0 at least (I can't find the similar info in 1.2 or 1.3, but I haven't gone through the entire change history of the NVMe spec) does introduce the ability for a controller to advertise limits for the maximum number of ranges, the maximum number of blocks in a single range, and the maximum number to total blocks (summed across all ranges) in a single dataset management request. I don't think we've updated our NVMe support to 2.0 just yet, but something that's probably coming in the future. We could certainly add APIs to allow consumers to discover this sort of information, but then we'd really just be pushing the complexity off to the consumers when we could handle it for them.

I'd also be cautious about making too many assumptions about how a controller will process TRIM requests. A single request with multiple ranges may allow the controller to better optimize the TRIM operation than multiple individual TRIM requests (even when submitted in parallel) since it might allow for coalescing things depending on how blocks are mapped.

The Windows TRIM API (at least the one I found) looks more like our DKIOCFREE ioctl (supporting multiple ranges in a single request) than the Linux or FreeBSD requests, so there's at least some precedent here for submitting more than one range at a time that's arguably equally as popular as Linux. There was also proposals to have Linux aggregate multiple queued block discard requests into a smaller number of TRIM/UNMAP/... requests to take advantage of submitting multiple ranges in a single command (however it's unclear if this work was ever submitted or integrated into Linux).

Actions #8

Updated by Garrett D'Amore almost 2 years ago

Perhaps I wasn't being sufficiently clear.

On the one hand, I don't love the DKIO ioctl for this, and I think designing it around SCSI UNMAP was likely a design mistake. (Not the first time that software / driver folks have made similar mistakes trying to expose hardware capabilities directly, btw.)

But on the other, and the one for this discussion, I think that the complexity of handling that IOCTL can almost entirely be handled in blkdev. Without asking drivers to deal with with much of it, unless they really, really want to.

In particular blkdev can handle the break up of a single dkio into multiple simple requests for the bottom layer.

The only thing the bottom layer needs to do is is "ignore" (or fail) requests that don't fit to it's capabilities, and possibly break up a single contiguous request if it has some limit on the size of the contiguous ranges it can support. (Today it does not appear nvme needs that.)

Using a single command for multiple contiguous ranges almost certainly has very tiny benefits, and greatly complicates life for driver developers.

The purpose of blkdev is to free driver developers from exactly this sort of complexity. (After all, before blkdev they could have just implemented the ioctls directly themselves, and used cmlb to help with labeling.)

Actions #9

Updated by Garrett D'Amore almost 2 years ago

More to the point, I'm not proposing that we change the DKIOCFREE ioctl.

What I'm proposing is that we not pass the complexity of handling down to leaf drivers. Instead completely handle the complexity in the framework (with the possible exception of deferring the problem of limits on the contiguous range to drivers, which seem unlikely to be a significant problem.)

Although I'd also be amenable to a max_discard member along the lines of max_xfer, and then handling any breakups in blkdev. That only makes sense if we think these range limits are going to be more prevalent than they are today though.

Actions #10

Updated by Jason King almost 2 years ago

But the proposal does change the behavior of the ioctl for blkdev devices compared to sd. Since it's still a private ioctl, we can probably hand wave away the difference and just say no guarantees are made wrt to what the requests to the device look like compared to the input. It would be nice to have that noted somewhere to help people in the future.

The assumption behind the original work is that research has shown that using TRIM improves write performance an appreciable amount -- even on enterprise SSDs (despite how one may feel about that). Since the controller is somewhat of a 'black box', giving the controller all the info we can (in terms of used/unused blocks) and letting the controller decide what to use/what to ignore was thought to be the best approach.

Every protocol except sdcard (ATA, SCSI, NVMe, vioblk) expect a list of ranges to erase. Unfortunately, not every protocol expresses the same types of limits on their version of erase (e.g. as noted earlier NMVe up until around 2.0 had none outside of the limitations imposed by the field sizes of the request), so that meant that some values aren't used for some drivers. The DKIOCFREE ioctl uses byte ranges instead of LBAs and block counts all of which adds complexity.

Today, blkdev takes an dkioc_free_list_t from the DKIOCFREE ioctl, and spits out zero or more @dkioc_free_list_t@s to the leaf driver. The @dkioc_free_list_t@s the driver sees are guaranteed by blkdev to meet all of its requirements in terms of alignments, sizes, etc. The leaf drivers merely need to translate the list of byte ranges into LBAs and block counts. This part was a bit unfortunate, but the feeling was that since the ioctl uses byte ranges and lengths, keeping it consistent instead of flipping from bytes to blocks mid way would be less confusing (though not ideal). However even there, blkdev does the work so that all the leaf drivers have to do is divide the byte offset/lengths by the block size (all the starting offset and lengths will be narrowed as necessary -- or even elided -- to ensure they fall on LBA/block boundaries).

The net result is that the leaf drivers have to do very little work with the request they are given. I'm not sure what complexity you think the current (or future) leaf drivers are being burdened with by blkdev. For the two major consumers (vioblk and NVMe), you still have to tell allocate the protocol specific structure for the request -- whether it's 1 range or 100 ranges in the request. That complexity is inherent and can't go away. From that, the request has to be populated with the range(s) to erase. The only difference between a request with a single range and a request with multiple ranges is that if multiple ranges are possible, the assignment of the ranges in protocol specific structure must happen within a loop.

If the leaf driver must handle splitting certain requests (to make blkdev more ascetically pleasing), that is new complexity added to the leaf drivers that doesn't exist today (and complexity that will be largely duplicated between drivers). As was noted in the original (3 month long) review, if we created some facility for callers of the DKIOCFREE ioctl to discover the device's requirements, and then reject requests that don't conform, we could simplify this greatly since we could (after validation) always just pass through the request. However, such a facility is probably worthy of its own issue (if not an IPD). The alternative is to just drop any requests that don't meet the requirements of the underlying device, but that seems rather rude when we could service them today (and I will note that FreeBSD and Linux frameworks will try to do the work to make the request conform to the requirements of the driver).

Actions #11

Updated by Garrett D'Amore almost 2 years ago

I feel like we may be talking / writing past one another.

There are 2 interface boundaries for blkdev.

1. The upper interface boundary, which is how blkdev interacts with ioctls like DKIOCFREE.

I am not proposing (here at least) to change that interface. It is undoubtedly flawed, and it's unfortunate that it appears it was designed as just a simple mapping of the UNMAP (or more likely TRIM) capability, rather than giving a little more thought to the design.  But it is, at least for now, what it is. (If it's not public then we probably should consider some changes too, but again, that's not this proposal.). For all intents and purposes, this ioctl can continue to function as it does today, with exactly the same inputs and semantics.

2. The lower interface boundary, which is the API in bllkdev.h. This is the interface that drivers like vioblk and nvme code to. This is where I would like to see some attention given to simplification; the purpose of blkdev is to make writing these drivers a very easy and straight-forward thing, and to shield them from the craziness and legacy that we sometimes have in the dkio interfaces above.

The blkdev driver is the meat between these layers.

My proposal is that when blkdev gets a single DKIOCFREE request, it should examine that request, and break it up into multiple, perhaps sequential (although perhaps not!) requests to the underlying driver. I believe it's already doing some of this to provide the support for limiting segments and segment sizes. What I'm saying here is, I think we should explicitly make the decision that blkdev only supports one outstanding segment at a time. There are other things that I would free it from. The choice to pass on the insanity of using byte ranges when everything about block devices is block oriented is incomprehensible to me. It makes reading the driver harder, and it's one more thing to be error prone.

(As an aside, I both understand why SCSI and the filesystems use multiple segments -- they want to have a single simple SCSI request to issue when deleting a file, which might be scattered around the filesystem, and I also understand why this might be a terrible thing to actually issue. In the early days of TRIM and UNMAP, many implementations would stall all I/O while performing a TRIM or UNMAP operation, and having these be scattered around the media is likely to be quite expensive. Conversely they might be trying to amortize the cost of doing the GC into a single operation. I don't think modern implementations suffer any of these problems; if they did, Linux would surely have adopted an approach to permit multiple segments in a single request. I believe their experience is far larger here than ours is, simply because of the differing sizes of the communities.)

The next questions are the limits for a single segment. If limitations are going to be common, then maybe we should expose this in blkdev and have blkdev do the break up. Although I think we can probably limit it to just two values in that case; the maximum block count to support, and the alignment/granularity (these can probably be a single value, as they are almost certainly linked). But, if its really just one esoteric use case, maybe blkdev shouldn't pay this cost, but the driver can. One thing that should be looked into is whether there is ever a case where the alignment/granularity is different than the physical block size. Personally, if these limitations are common, then I think we should explore ways to improve the upper interface so that the requests coming down are guaranteed to be reasonable.

The whole segmentation/alignment thing smells very familiar -- like the ddi_dma_attributes. This is an area that I think many driver developers have struggled with, and frequently there is confusion as to what the fields really mean. The documentation for this is not easy to parse, and it really can't be because you need to understand all the concepts. If I could avoid injecting the same level of complexity in the blkdev API, I would prefer to do so.

At any rate, with this design, the interface between blkdev and the driver would not have needed new fields to be added to the bd_xfer_t, as the LBA and count would be sufficient to express any single operation. Further, it would have required at most two extra fields in the bd_drive structure, which would be readily understandable.

In my opinion, if we do nothing else, we should gut the use of the dkioc_free_list_t, and replace it with an array of segments where the values are LBA and count. Of course, as I've said already, I'd rather not do even that, and just use the single segment with the existing LBA and count in the bd_xfer_t.

I also think pushing the data structure for a private (undocumented) "foreign" (to blkdev) API was misguided. We want to be able to document the blkdev APIs and ideally raise it to a reasonably high commitment level. In support of that, it's best to avoid "commitment inheritence" in these other structures. (We do, I think, want to be able to fix the DKIOCFREE ioctls someday.)

All of this does imply that some additional complexity in blkdev itself would be needed. But that's fine, as long as we can spare the drivers from having it.


Also available in: Atom PDF