Project

General

Profile

Actions

Bug #13309

closed

bhyve movs emulation leaks mem refcnt

Added by Patrick Mooney 10 months ago. Updated 10 months ago.

Status:
Closed
Priority:
Normal
Category:
bhyve
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
bhyve
Gerrit CR:

Description

Several users reported via IRC that they were seeing bhyve leak guest-sized chunks of memory when rebooting win2019 guests. Checking vmm_list and vmm_destroy_list it did not appear to be a case where deferred clean-up/destroy was to blame. One of the users was available to run a test for me, so they recorded the struct vm state before rebooting the instance (which did seem to leak the guest memory) before initiating a panic to capture a dump. Looking at the vm_object@s which were associated with the now-destroy instance, I could see that their ref-counts were still high. This prevented the memory backing them from being freed. I fired up a win2019 instance locally and found that it leaked memory in the same way. Armed with this information, I traced all @vm_object_reference and vm_object_deallocate calls (which is where those refcounts are modified) to see where we might be missing a de-allocate. This lead me to vie_emulate_movs:

        error = vm_copy_setup(vm, vcpuid, paging, srcaddr, opsize, PROT_READ,
            copyinfo, nitems(copyinfo), &fault);
        if (error == 0) {
                if (fault)
                        goto done;      /* Resume guest to handle fault */

                /*
                 * case (2): read from system memory and write to mmio.
                 */
                vm_copyin(vm, vcpuid, copyinfo, &val, opsize);
                vm_copy_teardown(vm, vcpuid, copyinfo, nitems(copyinfo));
                error = vie_mmio_write(vie, vm, vcpuid, gpa, val, opsize);
                if (error)
                        goto done;
        } else {
                /*
                 * 'vm_copy_setup()' is expected to fail for cases (3) and (4)
                 * if 'srcaddr' is in the mmio space.
                 */

                if (vie_get_gla(vie, vm, vcpuid, opsize, vie->addrsize,
                    PROT_WRITE, VM_REG_GUEST_ES, VM_REG_GUEST_RDI,
                    &dstaddr) != 0) {
                        goto done;
                }

                error = vm_copy_setup(vm, vcpuid, paging, dstaddr, opsize,
                    PROT_WRITE, copyinfo, nitems(copyinfo), &fault);
                if (error == 0) {
                        if (fault)
                                goto done;    /* Resume guest to handle fault */

                        /*
                         * case (3): read from MMIO and write to system memory.
                         *
                         * A MMIO read can have side-effects so we
                         * commit to it only after vm_copy_setup() is
                         * successful. If a page-fault needs to be
                         * injected into the guest then it will happen
                         * before the MMIO read is attempted.
                         */
                        error = vie_mmio_read(vie, vm, vcpuid, gpa, &val,
                            opsize);
                        if (error)
                                goto done;

                        vm_copyout(vm, vcpuid, &val, copyinfo, opsize);
                        vm_copy_teardown(vm, vcpuid, copyinfo,
                            nitems(copyinfo));

The "case 3" path is of specific interest. Here we're doing an MMIO from an emulated device and storing the result in guest memory. The copy-setup (which establishes a hold on the memory) occurs before the MMIO is attempted. If that MMIO address is not handled in-kernel, the vie_mmio_read call will fail with an error (indicating that fulfillment needs to occur in userspace). It's there that we're leaking the vm_object reference, since the goto done skips the vm_copy_teardown call if the read needs to be processed in userspace.

As a precaution, I looked at the other vm_copy_teardown calls in the instruction emulation. All of them (save for the one in question) had proper flow control to ensure they were called to conclude the emulation.

Actions #1

Updated by Electric Monk 10 months ago

  • Gerrit CR set to 1059
Actions #2

Updated by Rick V 10 months ago

i'd seen it with other guests but my windows 10 guest was big enough (16GB) to prevent other zones from booting, as well as starving the swap watchdog twice after i'd updated to a build with #13802. (With an 8GB leak, I could purge the ARC up to twice before the host would lock up without warning pre-#13802)

Actions #3

Updated by Patrick Mooney 10 months ago

After reproducing the issue on a stock BE, I booted onto a platform with the fix and ran through the same workload (windows server 2019). After numerous boot/halt/destroy cycles, VMM memory usaged remained at zero where it should be. I also booted the other smoke-test guests to confirm that the modified codepath in question had no adverse effects for them. All ran (and cleaned up itself) as expected. A dump of the system after the testing reported nothing in ::findleaks (besides the expected acpi noise).

Actions #4

Updated by Electric Monk 10 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 83cd75bb2949d26e6eb38ddefc60fdeed1909643

commit  83cd75bb2949d26e6eb38ddefc60fdeed1909643
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2020-11-13T21:09:09.000Z

    13309 bhyve movs emulation leaks mem refcnt
    Reviewed by: Dan McDonald <danmcd@joyent.com>
    Reviewed by: Jason King <jason.brian.king@gmail.com>
    Reviewed by: Ryan Zezeski <ryan@zinascii.com>
    Reviewed by: Rick V <rick@snowlight.net>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Actions

Also available in: Atom PDF