Project

General

Profile

Bug #13190

rootnex_get_sgl() has a cookie problem

Added by Ryan Zezeski 7 months ago. Updated 7 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Under some very specific conditions, the rootnex SGL logic increases cookie count before making full use of the current cookie, as seen here in rootnext_get_sgl().

        /*
         * this page didn't need the copy buffer, it is physically
         * contiguous with the last page, and it's <= the max cookie
         * size.
         */
        } else {
            sgl[cnt].dmac_size += psize;

            /*
             * if this exactly ==  the maximum cookie size, and
             * it isn't the last cookie, go to the next cookie.
             */
            if (((sgl[cnt].dmac_size + psize) == maxseg) &&
                ((cnt + 1) < sglinfo->si_max_pages)) {
                cnt++;
                sgl[cnt].dmac_laddress = 0;
                sgl[cnt].dmac_size = 0;
                sgl[cnt].dmac_type = 0;
            }
        }

NOTE: This same logic is also present in rootnex_dvma_get_sgl(), which makes sense as it was made 5 years later and probably based (read copy/paste) on the non-IOMMU logic.

By counting psize towards the cookie size twice we are prematurely increasing the cookie count, wasting a page's worth of memory, and, more importantly, potentially exceeding the maximum cookie count, needlessly failing the allocation. We enter this code block when the SGL logic has determined the current page is physically contiguous with the last page in the cookie and fits in the current cookie. We add psize to indicate that this cookie is taking ownership of the current page. The if statement then determines if we've reached the maximum segment (cookie) size (the maxseg value is the minimum of the dma_attr_count_max and dma_attr_maxxfer attributes). If we have reached the maximum size of a single cookie, then we must move onto the next one. The problem, of course, is that we already added psize to .dmac_size, and thus we will close out the current cookie prematurely, given the right circumstances, which are:

  • The DMA allocations are in exact multiples of a page size. E.g., page size of 4K and allocations of 4K, 8K, 12K, 16K, etc.
  • The dma_attr_count_max is set to <allocation size> - 1, which tells the rootnext the maximum cookie size is <allocation size> (the weirdness around max cookie size is its own topic and is something that has bitten me before with i40e, see https://smartos.org/bugview/OS-7344)
  • The dma_attr_maxxfer is set to <allocation size>.
  • The <allocation size> is 12K or more.

Okay, so why are these conditions important?

  • The allocation must be an exact multiple of the page size because that's the natural size rootnex works with to determine SGL count.
  • We need to match the maximum count and transfer size to the allocation size in order to hit the exact boundary in the if statement.
  • And finally, the weirdest one, we need to allocate 12K or more because we need at least 3 pages to hit this logic. Looking at rootnex_get_sgl(), you'll notice that the first cookie is a bit special in that we automatically assign the first page to it (because a cookie needs fit a page at minimum). So let's take the example of allocating 8K. Before entering the loop, we will have 4K in the first cookie. Then we enter the loop and fall into the else block listed above, and increment the cookie by psize (to give us a .dmac_size of 8K). Then the if statement will check 8K + 4K == 8K, and return false. Then we fall into the end of the while loop, decrement size by 4K to give us 0, and exit the while loop with a successful allocation.
1. before while loop: sgl[0].dmac_size = 4K, cnt = 0
2. enter while loop: size = 4K
3. enter else block
4. sgl[0].dmac_size += 4K, now sgl[0].dmac_size = 8K
5. sgl[0].dmac_size + psize == 8K, 8K + 4K == 8K, false
6. size -= 4K, now size = 0
7. exit while loop
8. sgl[0].dmac_size != 0, enter else block, set SGL size to cnt + 1
9. return: si_sgl_size = 1

Now the same for 12K

1. before while loop: sgl[0].dmac_size = 4K, cnt = 0
2. enter while loop: size = 8K
3. enter else block
4. sgl[0].dmac_size += 4K, now sgl[0].dmac_size = 8K
5. sgl[0].dmac_size + psize == 12K, true, enter if
6. cnt++, now cnt = 1, move to next cookie, sgl[1].dmac_size = 0
7. size -= 4K, now size = 4K
8. enter else block
9. sgl[1].dmac_size += 4K, now sgl[1].dmac_size = 4K
10. sgl[1].dmac_size + psize == 12K, false
11. size -= 4K, now size = 0
12. exit while loop
13. sgl[1].dmac_size != 0, enter else block, set SGL size to cnt + 1
14. return: si_sgl_size = 2

This logic was introduced in a fairly large changeset back in 2005.

https://github.com/illumos/illumos-gate/commit/12f080e7d03a5a6c62c85f0005491e9e4d355cfb

It's my belief that the original author's double accounting was simply a mistake of refactoring or copy/paste. With such a specific set of conditions required to trip this logic it's no wonder it has laid dormant for 15 years. It's probably highly unusual for a driver to specify such tight constraints on the DMA cookie and max transfer sizes. I only hit this because I was working on a driver, and for no great reason I can think of, decided to weld the max segment size to the actual allocation size (by building DMA attrs dynamically based on allocation size). Perhaps such a strategy is nonsensical, but it has still uncovered a bug that could bite real drivers if the right device came along.


Files

igb_gld.c (40.5 KB) igb_gld.c modified igb_gld.c Ryan Zezeski, 2020-10-01 02:24 AM
rootnex_get_sgl_bug.d (546 Bytes) rootnex_get_sgl_bug.d Ryan Zezeski, 2020-10-01 02:24 AM
#1

Updated by Ryan Zezeski 7 months ago

To replicate this issue (on something other than my in-development driver) I modified the igb driver to run the following function anytime the _intr_throttling private property was fetched.

static void
rpz_test_dma(igb_t *igb, size_t size)
{
    ddi_dma_attr_t attr;
    ddi_device_acc_attr_t acc;
    ddi_dma_handle_t dma_hdl;
    ddi_acc_handle_t acc_hdl;
    caddr_t va;
    size_t real_len;
    int ret;

    bcopy(&rpz_desc_dma_attr, &attr, sizeof (attr));
    acc.devacc_attr_version = DDI_DEVICE_ATTR_V1;
    acc.devacc_attr_endian_flags = DDI_NEVERSWAP_ACC;
    acc.devacc_attr_dataorder = DDI_STRICTORDER_ACC;
    acc.devacc_attr_access = DDI_DEFAULT_ACC;

    attr.dma_attr_count_max = size - 1;
    attr.dma_attr_maxxfer = size;
    ret = ddi_dma_alloc_handle(igb->dip, &attr, DDI_DMA_DONTWAIT, NULL,
        &dma_hdl);

    if (ret != 0) {
        cmn_err(CE_NOTE, "rpz_test_dma failed to alloc DMA handle: %d",
            ret);
        return;
    }

    ret = ddi_dma_mem_alloc(dma_hdl, size, &acc, DDI_DMA_CONSISTENT,
        DDI_DMA_DONTWAIT, NULL, &va, &real_len, &acc_hdl);

    if (ret != 0) {
        cmn_err(CE_NOTE, "rpz_test_dma failed to allocate %lu bytes of " 
            "DMA memory: %d", size, ret);
        ddi_dma_free_handle(&dma_hdl);
        return;
    }

    ret = ddi_dma_addr_bind_handle(dma_hdl, NULL, va, real_len,
        DDI_DMA_RDWR | DDI_DMA_CONSISTENT, DDI_DMA_DONTWAIT, NULL, NULL,
        NULL);

    if (ret == 0) {
        cmn_err(CE_NOTE,
            "rpz_test_dma sucessfully bound %lu bytes of DMA", size);
    } else {
        cmn_err(CE_NOTE,
            "rpz_test_dma failed to bind %lu bytes of DMA: %d", size,
            ret);
        ddi_dma_mem_free(&acc_hdl);
        ddi_dma_free_handle(&dma_hdl);
        return;
    }

    ddi_dma_unbind_handle(dma_hdl);
    ddi_dma_mem_free(&acc_hdl);
    ddi_dma_free_handle(&dma_hdl);
}

With igb_get_priv_prop() modified to call this function.

    } else if (strcmp(pr_name, "_intr_throttling") == 0) {
        rpz_test_dma(igb, 4096);
        rpz_test_dma(igb, 2 * 4096);
        rpz_test_dma(igb, 3 * 4096);
        rpz_test_dma(igb, 4 * 4096);
        value = igb->intr_throttling[0];

With that in place I then used this dtrace script to confirm that the first cookie is only partially full when allocating 12K or more.

rpz_test_dma:entry {self->t = 1;}

rootnex_get_sgl:entry /self->t/
{
    printf("\n*** DMA size: %u ", args[0]->dmao_size);
    self->sgl = args[1];
    self->sglinfo = args[2];
}

rootnex_get_sgl:return /self->sglinfo/
{
    printf("(%s) ***\n", self->sglinfo->si_sgl_size == 1 ? "GOOD" : "BAD");
    printf("*** sgl[0].dmac_size = %u ***\n", self->sgl[0].dmac_size);
    print(*self->sglinfo);
    self->sgl = 0;
    self->sglinfo = 0;
}

rpz_test_dma:return { self->t = 0; }

Then I ran the following command.

rpz@thunderhead:~$ dladm show-linkprop -p _intr_throttling igb0
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
igb0         _intr_throttling rw  200            200            -- 

With the following messages printed in the system log (-3 == DDI_DMA_TOOBIG).

Sep 30 15:08:08 thunderhead igb: [ID 760230 kern.notice] NOTICE: rpz_test_dma sucessfully bound 4096 bytes of DMA
Sep 30 15:08:08 thunderhead igb: [ID 760230 kern.notice] NOTICE: rpz_test_dma sucessfully bound 8192 bytes of DMA
Sep 30 15:08:08 thunderhead igb: [ID 103451 kern.notice] NOTICE: rpz_test_dma failed to bind 12288 bytes of DMA: -3
Sep 30 15:08:08 thunderhead igb: [ID 103451 kern.notice] NOTICE: rpz_test_dma failed to bind 16384 bytes of DMA: -3

And the following dtrace output.

rpz@thunderhead:~$ pfexec dtrace -qs /data/d/rootnex_get_sgl_bug.d 

*** DMA size: 4096 (GOOD) ***
*** sgl[0].dmac_size = 4096 ***
rootnex_sglinfo_t {
    boolean_t si_cancross = B_FALSE
    uint64_t si_min_addr = 0
    uint64_t si_max_addr = 0xffffffffffffffff
    uint64_t si_max_cookie_size = 0x1000
    uint64_t si_segmask = 0xffffffffffffffff
    uint_t si_flags = 0x200
    uint_t si_max_pages = 0x2
    boolean_t si_bounce_on_seg = B_FALSE
    size_t si_copybuf_req = 0
    off_t si_buf_offset = 0
    struct as *si_asp = 0xfffffffffbc58060
    uint_t si_sgl_size = 0x1
}
*** DMA size: 8192 (GOOD) ***
*** sgl[0].dmac_size = 8192 ***
rootnex_sglinfo_t {
    boolean_t si_cancross = B_FALSE
    uint64_t si_min_addr = 0
    uint64_t si_max_addr = 0xffffffffffffffff
    uint64_t si_max_cookie_size = 0x2000
    uint64_t si_segmask = 0xffffffffffffffff
    uint_t si_flags = 0x200
    uint_t si_max_pages = 0x3
    boolean_t si_bounce_on_seg = B_FALSE
    size_t si_copybuf_req = 0
    off_t si_buf_offset = 0
    struct as *si_asp = 0xfffffffffbc58060
    uint_t si_sgl_size = 0x1
}
*** DMA size: 12288 (BAD) ***
*** sgl[0].dmac_size = 8192 ***
rootnex_sglinfo_t {
    boolean_t si_cancross = B_FALSE
    uint64_t si_min_addr = 0
    uint64_t si_max_addr = 0xffffffffffffffff
    uint64_t si_max_cookie_size = 0x3000
    uint64_t si_segmask = 0xffffffffffffffff
    uint_t si_flags = 0x200
    uint_t si_max_pages = 0x4
    boolean_t si_bounce_on_seg = B_FALSE
    size_t si_copybuf_req = 0
    off_t si_buf_offset = 0
    struct as *si_asp = 0xfffffffffbc58060
    uint_t si_sgl_size = 0x2
}
*** DMA size: 16384 (BAD) ***
*** sgl[0].dmac_size = 12288 ***
rootnex_sglinfo_t {
    boolean_t si_cancross = B_FALSE
    uint64_t si_min_addr = 0
    uint64_t si_max_addr = 0xffffffffffffffff
    uint64_t si_max_cookie_size = 0x4000
    uint64_t si_segmask = 0xffffffffffffffff
    uint_t si_flags = 0x200
    uint_t si_max_pages = 0x5
    boolean_t si_bounce_on_seg = B_FALSE
    size_t si_copybuf_req = 0
    off_t si_buf_offset = 0
    struct as *si_asp = 0xfffffffffbc58060
    uint_t si_sgl_size = 0x2
}

Notice that allocations 12K and 16K failed to fully use the first cookie and return an si_sgl_size of 2.

#2

Updated by Electric Monk 7 months ago

  • Gerrit CR set to 955
#3

Updated by Ryan Zezeski 7 months ago

To test I ran the same custom igb allocation as mentioned above, but verified that none of the allocations failed, as shown below. I wasn't able to trigger this code path using the IOMMU (rootnex_dvma_get_sgl()). These specific allocations always took the fast path, iommulib_nexdma_mapobject(). That said, I was able to boot with the Intel IOMMU enabled and saw plenty of rootnex_dvma_get_slg() calls being made by ZFS. Given this is clearly a copy/paste job, it makes sense to fix both.

rpz@thunderhead:~$ pfexec dtrace -qs/data/d/rootnex_get_sgl_bug.d 

*** DMA size: 4096 (GOOD) ***
*** sgl[0].dmac_size = 4096 ***
rootnex_sglinfo_t {
    boolean_t si_cancross = B_FALSE
    uint64_t si_min_addr = 0
    uint64_t si_max_addr = 0xffffffffffffffff
    uint64_t si_max_cookie_size = 0x1000
    uint64_t si_segmask = 0xffffffffffffffff
    uint_t si_flags = 0x200
    uint_t si_max_pages = 0x2
    boolean_t si_bounce_on_seg = B_FALSE
    size_t si_copybuf_req = 0
    off_t si_buf_offset = 0
    struct as *si_asp = 0xfffffffffbc4a060
    uint_t si_sgl_size = 0x1
}
*** DMA size: 8192 (GOOD) ***
*** sgl[0].dmac_size = 8192 ***
rootnex_sglinfo_t {
    boolean_t si_cancross = B_FALSE
    uint64_t si_min_addr = 0
    uint64_t si_max_addr = 0xffffffffffffffff
    uint64_t si_max_cookie_size = 0x2000
    uint64_t si_segmask = 0xffffffffffffffff
    uint_t si_flags = 0x200
    uint_t si_max_pages = 0x3
    boolean_t si_bounce_on_seg = B_FALSE
    size_t si_copybuf_req = 0
    off_t si_buf_offset = 0
    struct as *si_asp = 0xfffffffffbc4a060
    uint_t si_sgl_size = 0x1
}
*** DMA size: 12288 (GOOD) ***
*** sgl[0].dmac_size = 12288 ***
rootnex_sglinfo_t {
    boolean_t si_cancross = B_FALSE
    uint64_t si_min_addr = 0
    uint64_t si_max_addr = 0xffffffffffffffff
    uint64_t si_max_cookie_size = 0x3000
    uint64_t si_segmask = 0xffffffffffffffff
    uint_t si_flags = 0x200
    uint_t si_max_pages = 0x4
    boolean_t si_bounce_on_seg = B_FALSE
    size_t si_copybuf_req = 0
    off_t si_buf_offset = 0
    struct as *si_asp = 0xfffffffffbc4a060
    uint_t si_sgl_size = 0x1
}
*** DMA size: 16384 (GOOD) ***
*** sgl[0].dmac_size = 16384 ***
rootnex_sglinfo_t {
    boolean_t si_cancross = B_FALSE
    uint64_t si_min_addr = 0
    uint64_t si_max_addr = 0xffffffffffffffff
    uint64_t si_max_cookie_size = 0x4000
    uint64_t si_segmask = 0xffffffffffffffff
    uint_t si_flags = 0x200
    uint_t si_max_pages = 0x5
    boolean_t si_bounce_on_seg = B_FALSE
    size_t si_copybuf_req = 0
    off_t si_buf_offset = 0
    struct as *si_asp = 0xfffffffffbc4a060
    uint_t si_sgl_size = 0x1
}
Oct  2 09:50:17 thunderhead igb: [ID 760230 kern.notice] NOTICE: rpz_test_dma sucessfully bound 4096 bytes of DMA
Oct  2 09:50:17 thunderhead igb: [ID 760230 kern.notice] NOTICE: rpz_test_dma sucessfully bound 8192 bytes of DMA
Oct  2 09:50:17 thunderhead igb: [ID 760230 kern.notice] NOTICE: rpz_test_dma sucessfully bound 12288 bytes of DMA
Oct  2 09:50:17 thunderhead igb: [ID 760230 kern.notice] NOTICE: rpz_test_dma sucessfully bound 16384 bytes of DMA
#4

Updated by Electric Monk 7 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 23a869ddcd5330bdcbefc98e81d9de23b83f3182

commit  23a869ddcd5330bdcbefc98e81d9de23b83f3182
Author: Ryan Zezeski <ryan@zinascii.com>
Date:   2020-10-02T22:07:04.000Z

    13190 rootnex_get_sgl() has a cookie problem
    Reviewed by: Paul Winder <paul@winder.uk.net>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF