Project

General

Profile

Bug #13309

bhyve movs emulation leaks mem refcnt

Added by Patrick Mooney 4 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Category:
bhyve
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
bhyve
Gerrit CR:

Description

Several users reported via IRC that they were seeing bhyve leak guest-sized chunks of memory when rebooting win2019 guests. Checking vmm_list and vmm_destroy_list it did not appear to be a case where deferred clean-up/destroy was to blame. One of the users was available to run a test for me, so they recorded the struct vm state before rebooting the instance (which did seem to leak the guest memory) before initiating a panic to capture a dump. Looking at the vm_object@s which were associated with the now-destroy instance, I could see that their ref-counts were still high. This prevented the memory backing them from being freed. I fired up a win2019 instance locally and found that it leaked memory in the same way. Armed with this information, I traced all @vm_object_reference and vm_object_deallocate calls (which is where those refcounts are modified) to see where we might be missing a de-allocate. This lead me to vie_emulate_movs:

        error = vm_copy_setup(vm, vcpuid, paging, srcaddr, opsize, PROT_READ,
            copyinfo, nitems(copyinfo), &fault);
        if (error == 0) {
                if (fault)
                        goto done;      /* Resume guest to handle fault */

                /*
                 * case (2): read from system memory and write to mmio.
                 */
                vm_copyin(vm, vcpuid, copyinfo, &val, opsize);
                vm_copy_teardown(vm, vcpuid, copyinfo, nitems(copyinfo));
                error = vie_mmio_write(vie, vm, vcpuid, gpa, val, opsize);
                if (error)
                        goto done;
        } else {
                /*
                 * 'vm_copy_setup()' is expected to fail for cases (3) and (4)
                 * if 'srcaddr' is in the mmio space.
                 */

                if (vie_get_gla(vie, vm, vcpuid, opsize, vie->addrsize,
                    PROT_WRITE, VM_REG_GUEST_ES, VM_REG_GUEST_RDI,
                    &dstaddr) != 0) {
                        goto done;
                }

                error = vm_copy_setup(vm, vcpuid, paging, dstaddr, opsize,
                    PROT_WRITE, copyinfo, nitems(copyinfo), &fault);
                if (error == 0) {
                        if (fault)
                                goto done;    /* Resume guest to handle fault */

                        /*
                         * case (3): read from MMIO and write to system memory.
                         *
                         * A MMIO read can have side-effects so we
                         * commit to it only after vm_copy_setup() is
                         * successful. If a page-fault needs to be
                         * injected into the guest then it will happen
                         * before the MMIO read is attempted.
                         */
                        error = vie_mmio_read(vie, vm, vcpuid, gpa, &val,
                            opsize);
                        if (error)
                                goto done;

                        vm_copyout(vm, vcpuid, &val, copyinfo, opsize);
                        vm_copy_teardown(vm, vcpuid, copyinfo,
                            nitems(copyinfo));

The "case 3" path is of specific interest. Here we're doing an MMIO from an emulated device and storing the result in guest memory. The copy-setup (which establishes a hold on the memory) occurs before the MMIO is attempted. If that MMIO address is not handled in-kernel, the vie_mmio_read call will fail with an error (indicating that fulfillment needs to occur in userspace). It's there that we're leaking the vm_object reference, since the goto done skips the vm_copy_teardown call if the read needs to be processed in userspace.

As a precaution, I looked at the other vm_copy_teardown calls in the instruction emulation. All of them (save for the one in question) had proper flow control to ensure they were called to conclude the emulation.

Also available in: Atom PDF