Project

General

Profile

Actions

Bug #14026

closed

livelock in vmm_drv interface

Added by Patrick Mooney 4 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Category:
bhyve
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Smithx10 on IRC reported several instances appearing to hang during boot, with dmesg from the CentOS7 guest kernel stopping abruptly in the logs recorded by SmartOS (joyent_20210729T002724Z). Taking the kernel stacks of the process in question leads us to some initial clues about the problem:

... OUTPUT CUT ...
stack pointer for thread fffffd6b04d27c00 (bhyve/36 [vcpu 0]): fffff946167525f0
[ fffff946167525f0 _resume_from_idle+0x12b() ]
  fffff94616752620 swtch+0x133()
  fffff946167526c0 turnstile_block+0x25b(0, 0, fffffd6b03a206a0, fffffffffbc1c5c0, 0, 0)
  fffff94616752730 rw_enter_sleep+0x1a5(fffffd6b03a206a0, 0)
  fffff94616752790 vmm_write_lock+0x87(fffffd6b03a20580)
  fffff94616752c40 vmmdev_do_ioctl+0x2e8(fffffd6b03a20580, 766c08, fffff7ffe8f7ca60, 202003, fffffd6c04e96258, fffff94616752e18)
  fffff94616752cc0 vmm_ioctl+0x99(13100000009, 766c08, fffff7ffe8f7ca60, 202003, fffffd6c04e96258, fffff94616752e18)
  fffff94616752d00 cdev_ioctl+0x2b(13100000009, 766c08, fffff7ffe8f7ca60, 202003, fffffd6c04e96258, fffff94616752e18)
  fffff94616752d50 spec_ioctl+0x45(fffffd676db35e40, 766c08, fffff7ffe8f7ca60, 202003, fffffd6c04e96258, fffff94616752e18, 0)
  fffff94616752de0 fop_ioctl+0x5b(fffffd676db35e40, 766c08, fffff7ffe8f7ca60, 202003, fffffd6c04e96258, fffff94616752e18, 0)
  fffff94616752f00 ioctl+0x153(3, 766c08, fffff7ffe8f7ca60)
  fffff94616752f10 sys_syscall+0x1a8()
stack pointer for thread fffffd6b03a113e0 (bhyve/37 [vcpu 1]): fffff94616b87630
[ fffff94616b87630 _resume_from_idle+0x12b() ]
  fffff94616b87660 swtch+0x133()
  fffff94616b876a0 cv_wait+0x68(fffffd6afa7da40a, fffffd6afa7da3f8)
  fffff94616b87710 vcpu_set_state_locked+0x94(fffffd6afa7da000, 1, 1, 1)
  fffff94616b87770 vcpu_set_state+0x6e(fffffd6afa7da000, 1, 1, 1)
  fffff94616b87790 vcpu_lock_one+0x19(fffffd6b03a20580, 1)
  fffff94616b87c40 vmmdev_do_ioctl+0xed(fffffd6b03a20580, 767001, fffff7ffe8cd1e40, 202003, fffffd6c04e96258, fffff94616b87e18)
  fffff94616b87cc0 vmm_ioctl+0x99(13100000009, 767001, fffff7ffe8cd1e40, 202003, fffffd6c04e96258, fffff94616b87e18)
  fffff94616b87d00 cdev_ioctl+0x2b(13100000009, 767001, fffff7ffe8cd1e40, 202003, fffffd6c04e96258, fffff94616b87e18)
  fffff94616b87d50 spec_ioctl+0x45(fffffd676db35e40, 767001, fffff7ffe8cd1e40, 202003, fffffd6c04e96258, fffff94616b87e18, 0)
  fffff94616b87de0 fop_ioctl+0x5b(fffffd676db35e40, 767001, fffff7ffe8cd1e40, 202003, fffffd6c04e96258, fffff94616b87e18, 0)
  fffff94616b87f00 ioctl+0x153(3, 767001, fffff7ffe8cd1e40)
  fffff94616b87f10 sys_syscall+0x1a8()
stack pointer for thread fffffd6b90aeb800 (bhyve/38 [viona_rx_fffffd6b1e186990]): fffff94617338d40
[ fffff94617338d40 _resume_from_idle+0x12b() ]
  fffff94617338d70 swtch+0x133()
  fffff94617338db0 cv_wait+0x68(fffffd6b03a2069c, fffffd6b03a20670)
  fffff94617338e10 vmm_drv_lease_sign+0xa3(fffffd6b1bcf5a98, fffffffff8250310, fffffd6b1e186990)
  fffff94617338e50 viona_ring_lease_renew+0x34(fffffd6b1e186990)
  fffff94617338ea0 viona_worker_rx+0x98(fffffd6b1e186990, fffffd6b1e186980)
  fffff94617338ef0 viona_worker+0x21d(fffffd6b1e186990)
  fffff94617338f00 thread_start+0xb()
stack pointer for thread fffffd6b0c4d6820 (bhyve/39): fffff946133cde00
[ fffff946133cde00 _resume_from_idle+0x12b() ]
  fffff946133cde30 swtch+0x133()
  fffff946133cdea0 cv_wait_sig+0x171(fffffd6b1e186ac8, fffffd6b1e186ac0)
  fffff946133cdef0 viona_worker+0x16d(fffffd6b1e186ab8)
  fffff946133cdf00 thread_start+0xb()

We see the vcpu-0 thread attempting a VM_MUNMAP_MEMSEG (VMM_LOCK_IOC_BASE | 0x08) ioctl, blocked while waiting to acquire the VM write lock. The vcpu-1 thread is waiting on its singular vCPU-specific lock (held by vcpu-0 for the write lock). The viona RX ring is attempting to renew its vmm_drv lease, which it had to relinquish as part of the VM write lock being taken. Of most interest is the viona TX ring, which is simply waiting for a ring state transition. The state of that ring sheds more light on the situation:

fffffd6b1e186ab8::print viona_vring_t
{
    vr_link = 0xfffffd6b1e186980
    vr_lock = {
        _opaque = [ 0 ]
    }
    vr_cv = {
        _opaque = 0x1
    }
    vr_state = 0x2
    vr_state_flags = 0
    vr_xfer_outstanding = 0
    vr_worker_thread = 0xfffffd6b0c4d6820
    vr_lease = 0xfffffd6bf7c44928
    vr_txdesb = 0xfffffd6c1b661000
    vr_txiov = 0xfffffd6b832eb000
    vr_intr_enabled = 0
    vr_msi_addr = 0
    vr_msi_msg = 0
    vr_a_mutex = {
        _opaque = [ 0 ]
    }
    vr_u_mutex = {
        _opaque = [ 0 ]
    }
    vr_pa = 0xbf0c0000
    vr_size = 0x400
    vr_mask = 0x3ff
    vr_cur_aidx = 0
    vr_cur_uidx = 0
    vr_map_pages = 0xfffffd6afc6e3900
    vr_stats = {
... ELIDED FOR BREVITY ...
    }
}

That worker thread is still holding its vmm_drv lease, which is preventing the vmm_write_lock call from ever succeeding. (Each lease implies a reader lock on the VM instance) Why is this thread still holding its lease? Indeed, it is supposed to check for expiration:

        while (ring->vr_state_flags == 0) {
                /*
                 * Keeping lease renewals timely while waiting for the ring to
                 * be started is important for avoiding deadlocks.
                 */
                if (vmm_drv_lease_expired(ring->vr_lease)) {
                        if (!viona_ring_lease_renew(ring)) {
                                goto cleanup;
                        }
                }

                (void) cv_wait_sig(&ring->vr_cv, &ring->vr_lock);

                if (VRING_NEED_BAIL(ring, p)) {
                        goto cleanup;
                }
        }

Indeed the lease itself is marked as expired:

0xfffffd6bf7c44928::print vmm_lease_t
{
    vml_node = {
        list_next = 0xfffffd6b03a20688
        list_prev = 0xfffffd6b03a20688
    }
    vml_vm = 0xfffffd6afa7da000
    vml_expired = 0x1 (B_TRUE)
    vml_expire_func = viona_ring_lease_expire_cb
    vml_expire_arg = 0xfffffd6b1e186ab8
    vml_expire_node = {
        list_next = 0
        list_prev = 0
    }
    vml_hold = 0xfffffd6b1bcf5a98
}

How might we have missed the notification from vmm_write_lock when it expired all the vmm_drv leases? The viona expiration callback offers one possibility:

static boolean_t
viona_ring_lease_expire_cb(void *arg)
{
        viona_vring_t *ring = arg;

        cv_broadcast(&ring->vr_cv);

        /* The lease will be broken asynchronously. */
        return (B_FALSE);
}

Notably, it does not take ring->vr_lock, so there is a narrow window between the vmm_drv_lease_expired and cv_wait_sig calls where the lease state could change without the viona worker noticing.

We should rearrange the locking in vmm_write_lock to allow the lease expiration functions (namely viona_ring_lease_expire_cb) to acquire the necessary lock(s) to ensure reliable notification.

Actions #1

Updated by Patrick Mooney 4 months ago

Given how tight the race is from vmm_drv_lease_expired to the cv_wait_sig call, it seems a bit surprising that someone had multiple instances hit this all at once. It becomes a little more reasonable in context, however: These CNs were being rebooted after a PI update, meaning high load as all the instances were starting at once. To further compound the issue, at least one of the instances which was stuck like this had CPU resource constraints (cap & shares) which were very tight, even further increasing the chances of preemption.

Actions #2

Updated by Patrick Mooney 4 months ago

The most simple solution would be to wrap the cv_broadcast call in viona_ring_lease_expire_cb with entry/exit of vr_lock. That brings up a different lock ordering problem, though: When a viona ring acquires a lease, it holds viona_vring_t`vr_lock and then enters vmm_softc_t`vmm_lease_lock. For the lease expire notification, vmm_write_lock holds vmm_softc_t`vmm_lease_lock (to protect the list it's iterating over), and then would need to enter viona_vring_t`vr_lock. This would be a hopeless deadlock, instead of the current livelock (which knocked loose by a signal to the stuck process).

In order to make this tenable, the vmm_write_lock call needs to drop vmm_lease_lock while doing the expire notifications. We can make appropriate guarantees about vmm_lease_list remaining unchanged for the duration, which makes such a workaround appropriate.

Actions #3

Updated by Patrick Mooney 3 months ago

Smithx10 has booted a PI featuring the proposed fix for this issue (along with the #14032 patch) and it seems to be fine for normal used. The erroneous behavior was seldom observed, given how tight the race was.

Actions #4

Updated by Patrick Mooney 3 months ago

As this race has not otherwise been seen in the wild, reproduction has been difficult. The normal battery of guests were booted on a platform bearing the fix to check for regressions. The bhyvectl --wrlock-cycle utility was used to at least exercise the vmm_drv_lease renewal codepaths. Using dtrace, I could watch the existing viona instances receive the lease expiration notification (via viona_ring_lease_expire_cb) and then subsequently renew their lease (calling vmm_drv_lease_sign).

Actions #5

Updated by Electric Monk 3 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 6703a0e87b73700da5f26904f87f406b8ca509c3

commit  6703a0e87b73700da5f26904f87f406b8ca509c3
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2021-09-01T20:23:32.000Z

    14026 livelock in vmm_drv interface
    Reviewed by: Dan McDonald <danmcd@joyent.com>
    Reviewed by: Dan Cross <cross@oxidecomputer.com>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Actions

Also available in: Atom PDF