Project

General

Profile

Actions

Bug #14035

closed

viona mishandles merged-rx packets

Added by Patrick Mooney 2 months ago. Updated about 2 months ago.

Status:
Closed
Priority:
Normal
Category:
bhyve
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

SmartOS user Smithx10 reported several instances suffering from wedged RX virtqueues when running under bhyve (with viona driving the virtio-net emulation).
One guest in question, running Linux 4.14.35-1902.302.2.el7uek.x86_64 reported this error via dmesg:

[251475.388441] virtio_net virtio4: input.0:id 282 is not a head!

The state for the viona ring associated with that virtqueue was collected:
{
    vr_link = 0xfffffd6af396c600
    vr_lock = {
        _opaque = [ 0 ]
    }
    vr_cv = {
        _opaque = 0x1
    }
    vr_state = 0x3
    vr_state_flags = 0
    vr_xfer_outstanding = 0
    vr_worker_thread = 0xfffffd6663ccc4a0
    vr_lease = 0xfffffd6ad4720528
    vr_txdesb = 0
    vr_txiov = 0
    vr_intr_enabled = 0
    vr_msi_addr = 0xfee00000
    vr_msi_msg = 0x40a1
    vr_a_mutex = {
        _opaque = [ 0 ]
    }
    vr_u_mutex = {
        _opaque = [ 0 ]
    }
    vr_pa = 0x136f20000
    vr_size = 0x400
    vr_mask = 0x3ff
    vr_cur_aidx = 0x90fd
    vr_cur_uidx = 0x90fd
    vr_map_pages = 0xfffffd6ad585bc80
    vr_stats = {
        rs_ndesc_too_high = 0
        rs_bad_idx = 0
        rs_indir_bad_len = 0
        rs_indir_bad_nest = 0
        rs_indir_bad_next = 0
        rs_no_space = 0xf9f53
        rs_too_many_desc = 0
        rs_desc_bad_len = 0
        rs_bad_ring_addr = 0
        rs_fail_hcksum = 0
        rs_fail_hcksum6 = 0
        rs_fail_hcksum_proto = 0
        rs_bad_rx_frame = 0
        rs_rx_merge_overrun = 0
        rs_rx_merge_underrun = 0
        rs_rx_pad_short = 0x32
        rs_rx_mcast_check = 0
        rs_too_short = 0
        rs_tx_absent = 0
        rs_rx_hookdrop = 0
        rs_tx_hookdrop = 0
    }
}

We can see the rs_no_space count is high, and if examined live, still climbing. The guest kernel would have stopped placing available buffers in that virtqueue after deciding it was in an erroneous state. As such, our failure to deliver packets makes sense in context. I dumped the guest memory pages holding the virtqueue for further inspection (attached as virtqueue.bin).
These macros (customized for how the VQ is sized) allow for easy inspection with mdb:
::typedef -c LP64
::typedef 'struct { uint64_t addr; uint32_t len; uint16_t flags; uint16_t next; }' virtio_desc_t
::typedef 'struct { uint16_t flags; uint16_t idx; uint16_t ent[0x400]; uint16_t last_avail_idx; uint8_t _pad[0x7fa]; }' virtio_avail_t
::typedef 'struct { uint32_t idx; uint32_t len; }' virtio_used_ent_t
::typedef 'struct { uint16_t flags; uint16_t idx; virtio_used_ent_t ent[0x400]; uint16_t last_used_idx;}' virtio_used_t
::typedef 'struct { virtio_desc_t desc[0x400]; virtio_avail_t avail; virtio_used_t used; }' virtio_queue_t

First, let's figure out what the error message from Linux means. Using the v4.14 sources as a rough guide, we see that message emitted from virtqueue_get_buf_ctx:

        /* Only get used array entries after they have been exposed by host. */
        virtio_rmb(vq->weak_barriers);

        last_used = (vq->last_used_idx & (vq->vring.num - 1));
        i = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].id);
        *len = virtio32_to_cpu(_vq->vdev, vq->vring.used->ring[last_used].len);

        if (unlikely(i >= vq->vring.num)) {
                BAD_RING(vq, "id %u out of range\n", i);
                return NULL;
        }
        if (unlikely(!vq->desc_state[i].data)) {
                BAD_RING(vq, "id %u is not a head!\n", i);
                return NULL;
        }

So it read what it expected to be the next descriptor from the "used" side of the virtqueue and found no in-kernel data associated with it. Fortunately for us, when that is working properly (as opposed to emitting the aforementioned error), the function records the index of the successfully-retrieved descriptor:

        if (!(vq->avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT))
                virtio_store_mb(vq->weak_barriers,
                                &vring_used_event(&vq->vring),
                                cpu_to_virtio16(_vq->vdev, vq->last_used_idx));

Let's start from there when examining the ring contents:

> 0::print virtio_queue_t avail.last_avail_idx
avail.last_avail_idx = 0x8ef6

The viona host state for the ring indicated it had progressed as far as 0x90fd before finding no more entries in the ring, a difference of 519 entries - well below the ring size of 1024. This makes sense, as the guest may not have kept the avail ring 100% populated with new buffers to fill.
Let's inspect the used entries around that index (8ef6 masked for a ring size of 0x400 :

... Masked for the ring size of 0x400 ...
> 8ef6&3ff=K
                2f6
> 0::print -ta virtio_queue_t used.ent[2f6]
67b4 virtio_used_ent_t used.ent[2f6] = {
    67b4 uint32_t used.ent[2f6].idx = 0x11a
    67b8 uint32_t used.ent[2f6].len = 0x4e
}
> 11a=E
                282

So that entry does match the error message emitted.
Let's take a look at the surrounding entries in the ring for context. Presumably those preceding 0x11a were fine.
> 679c,6::print -ta virtio_used_ent_t
679c virtio_used_ent_t {
    679c uint32_t idx = 0x130
    67a0 uint32_t len = 0x56
}
67a4 virtio_used_ent_t {
    67a4 uint32_t idx = 0x12f
    67a8 uint32_t len = 0xa9
}
67ac virtio_used_ent_t {
    67ac uint32_t idx = 0x11d
    67b0 uint32_t len = 0x542
}
67b4 virtio_used_ent_t {
    67b4 uint32_t idx = 0x11a
    67b8 uint32_t len = 0x4e
}
67bc virtio_used_ent_t {
    67bc uint32_t idx = 0x114
    67c0 uint32_t len = 0x4e
}
67c4 virtio_used_ent_t {
    67c4 uint32_t idx = 0x11c
    67c8 uint32_t len = 0x4e
}

And now let's look for similar entries in the avail ring, as they would have been presented to the device to be populated with packets:

> 0::print -ta virtio_queue_t avail.ent
4004 uint16_t [1024] avail.ent = [ 0x385, 0x384, 0x383, 0x382, 0x3a2, 0x3a1, 0x3a0, 0x39f, 0x39e, 0x39d, 0x39c, 0x391, 0x390, 0x38f, 0x38e, 0x38d, 0x38c, 0x3ae, 0x3db, 0x3e2, 0x3e1, 0x3ad, 0x3d5, 0x3af, 0x3c7, 0x3bf, 0x37c, 0x37b, 0x37a, 0x379, 0x378, 0x377, ... ]
> 4004,400/axn ! grep 11a
0x411a:         2fd
0x4204:         11a
> (4204-4004)%2=K
                100
> 0::print -ta virtio_queue_t avail.ent[100]
4204 uint16_t avail.ent[100] = 0x11a
> 4204-c,8/axn
0x41f8:         0x41f8:         2ba
0x41fa:         2be
0x41fc:         2c0
0x41fe:         11b
0x4200:         11c
0x4202:         114
0x4204:         11a
0x4206:         128

There appears to be some difference in the order that viona consumed these available buffers to send packets to the guest, but that should be fine. Per the spec:

Device is not generally required to use buffers in the same order in which they have been made available by the driver.

That said, further investigation is required to see why those are so out of order, and if the state in the ring matches all expectations.


Files

virtqueue.bin (32 KB) virtqueue.bin Virtqueue Contents Patrick Mooney, 2021-08-25 08:04 PM
second.bin (32 KB) second.bin Patrick Mooney, 2021-08-25 11:01 PM

Related issues

Related to illumos gate - Feature #13783: viona should operate page-wiseClosedPatrick Mooney

Actions
Actions #1

Updated by Patrick Mooney 2 months ago

Data from another misbehaving host was collected:

{
    vr_link = 0xfffffdbb96a3e700
    vr_lock = {
        _opaque = [ 0 ]
    }
    vr_cv = {
        _opaque = 0x1
    }
    vr_state = 0x3
    vr_state_flags = 0
    vr_xfer_outstanding = 0
    vr_worker_thread = 0xfffffd69ebc0c860
    vr_lease = 0xfffffd72fe45e5f8
    vr_txdesb = 0
    vr_txiov = 0
    vr_intr_enabled = 0
    vr_msi_addr = 0xfee00000
    vr_msi_msg = 0x40c1
    vr_a_mutex = {
        _opaque = [ 0 ]
    }
    vr_u_mutex = {
        _opaque = [ 0 ]
    }
    vr_pa = 0x231dd8000
    vr_size = 0x400
    vr_mask = 0x3ff
    vr_cur_aidx = 0xa30d
    vr_cur_uidx = 0xa30d
    vr_map_pages = 0xfffffd7229e9bc80
    vr_stats = {
        rs_ndesc_too_high = 0
        rs_bad_idx = 0
        rs_indir_bad_len = 0
        rs_indir_bad_nest = 0
        rs_indir_bad_next = 0
        rs_no_space = 0x136f8c
        rs_too_many_desc = 0
        rs_desc_bad_len = 0
        rs_bad_ring_addr = 0
        rs_fail_hcksum = 0
        rs_fail_hcksum6 = 0
        rs_fail_hcksum_proto = 0
        rs_bad_rx_frame = 0x719f
        rs_rx_merge_overrun = 0
        rs_rx_merge_underrun = 0
        rs_rx_pad_short = 0x3d4e
        rs_rx_mcast_check = 0
        rs_too_short = 0x719f
        rs_tx_absent = 0
        rs_rx_hookdrop = 0
        rs_tx_hookdrop = 0
    }
}

[ 1111.998047] virtio_net virtio4: input.0:id 525 is not a head!

Actions #2

Updated by Patrick Mooney 2 months ago

While scrutinizing differences introduced by #13783 (since this behavior, so far, appears to be present only on machines running OS bits newer than that) I came across something suspicious:

vq_pushchain_many(viona_vring_t *ring, uint_t num_bufs, used_elem_t *elem)
{
        uint16_t uidx;

        mutex_enter(&ring->vr_u_mutex);

        uidx = ring->vr_cur_uidx;

        for (uint_t i = 0; i < num_bufs; i++) {
                vq_write_used_ent(ring, uidx & ring->vr_mask, elem[i].id,
                    elem[i].len);
        }
        uidx += num_bufs;

        membar_producer();
        vq_write_used_idx(ring, uidx);
        ring->vr_cur_uidx = uidx;

        mutex_exit(&ring->vr_u_mutex);
}

The viona_pushchain_many function is only used by merged-RX operations, which is the case here. If an incoming packet were to be larger than a single virtio descriptor, merged-RX allows it to be spread across multiple ones. All of those descriptors would then be added to the "used" ring at once, with the first one bearing a header which describes the number of descriptors for the packet. This logic was updated for #13783 and is not properly updating the used index for writing the descriptor:

        for (uint_t i = 0; i < num_bufs; i++) {
                vq_write_used_ent(ring, uidx & ring->vr_mask, elem[i].id,
                    elem[i].len);
        }

This will simply overwrite the first descriptor with any follow one(s). The guest would then go read the descriptor header from that not-first descriptor header, and assuming the bytes residing at the offset used for num_buffers is greater than 1, it'll march forward looking for other entries in the ring which have not been written properly. There's a chance that entry is valid, in which case the guest network stack would just encounter a garbage packet, but if the entry referred a descriptor which had no valid state associated with it in the guest, then the "is not a head" error would be a reasonable expectation.

There are other clues which point toward large (>1536) packet reception on the host. While many entries in the RX descriptor ring are sized at 1536, some are 2k in size. The linux network stack takes average packet size into account when populating new descriptor entries. Given the small (but present) population of 2k bufs, it's highly likely that LSO packets were being at least occasionally received by the guest, thus putting it in danger of triggering this bug.

Actions #3

Updated by Patrick Mooney 2 months ago

More context around large packets being to blame: I've heard from both sjorge (who also saw this recently) and Smithx10 that the instances in question had their vnic on a 9k MTU link.

Actions #4

Updated by Patrick Mooney 2 months ago

sjorge confirmed that in his environment, large (4000+ byte) ping packets reliably tripped the issue when running without the proposed patch.

Actions #5

Updated by Patrick Mooney 2 months ago

Actions #6

Updated by Patrick Mooney 2 months ago

  • Subject changed from virtio_net wedged on viona device to viona mishandles merged-rx packets
Actions #7

Updated by Patrick Mooney 2 months ago

Smithx10, who is running a test platform with the fix on one CN, confirmed that the fix prevents large ping packets from hanging the virtqueue. On the older platform, those large pings immediately hang the virtqueue.

Actions #8

Updated by Patrick Mooney about 2 months ago

The CN running the test PI with the proposed fix is still operational with no reported issues in the guest.

Actions #9

Updated by Patrick Mooney about 2 months ago

For testing purposes, I was able to reproduce this issue using jumbo frames over an etherstub device. With a Linux guest running on the stock kernel, changing the MTU from 1500 to 9000 on that NIC and running another iperf test resulted in the error occurring immediately. With the fix in place, the same test repeated encountered no such error.

Actions #10

Updated by Patrick Mooney about 2 months ago

In addition to testing of the behavior specific to this issue, I also booted the normal battery of guests to confirm they still function properly with the updated viona.

Actions #11

Updated by Electric Monk about 2 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit be899113441eaeb88dfc5644ec05599e4e27eacb

commit  be899113441eaeb88dfc5644ec05599e4e27eacb
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2021-09-05T00:43:10.000Z

    14035 viona mishandles merged-rx packets
    Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
    Reviewed by: Dan Cross <cross@oxidecomputer.com>
    Approved by: Gordon Ross <gordon.w.ross@gmail.com>

Actions

Also available in: Atom PDF