Project

General

Profile

Actions

Bug #14281

closed

bhyve bungles kstat clean-up

Added by Patrick Mooney 8 months ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Category:
bhyve
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

While playing with the bhyve kstats exposed by #13723, I hit a panic:

panic[cpu15]/thread=fffffe59eb8027e0:
BAD TRAP: type=e (#pf Page fault) rp=fffffe007ae3d910 addr=fffffe8bf48b83d0

kstat:
#pf Page fault
Bad kernel fault at addr=0xfffffe8bf48b83d0
pid=106800, pc=0xfffffffff7dc85bb, sp=0xfffffe007ae3da00, eflags=0x10286
cr0: 80050033<pg,wp,ne,et,mp,pe>  cr4: 3406f8<smap,smep,osxsav,xmme,fxsr,pge,mce,pae,pse,de>
cr2: fffffe8bf48b83d0
cr3: 841007000
cr8: 0

        rdi: fffffe8bf4898000 rsi:                0 rdx:           200000
        rcx:              18c  r8: fffffe007ae3deb0  r9:                0
        rax: fffffe8bf48b8000 rbx: fffffe8bf48b8000 rbp: fffffe007ae3da30
        r10: fffffffffb88512c r11:                1 r12: fffffe8bf4898110
        r13:                0 r14:          8bfe860 r15: fffffe8bf4898000
        fsb:                0 gsb: fffffe59d3e75000  ds:               4b
         es:               4b  fs:                0  gs:              1c3
        trp:                e err:                0 rip: fffffffff7dc85bb
         cs:               30 rfl:            10286 rsp: fffffe007ae3da00
         ss:               38

fffffe007ae3d810 unix:rewrite_syscall+0 ()
fffffe007ae3d900 unix:trap+12b7 ()
fffffe007ae3d910 unix:_cmntrap+1cd ()
fffffe007ae3da30 vmm:vmm_kstat_update_vcpu+6b ()
fffffe007ae3dc30 kstat:read_kstat_data+f5 ()
fffffe007ae3dc70 kstat:kstat_ioctl+5b ()
fffffe007ae3dcb0 genunix:cdev_ioctl+2b ()
fffffe007ae3dd00 specfs:spec_ioctl+45 ()
fffffe007ae3dd90 genunix:fop_ioctl+5b ()
fffffe007ae3deb0 genunix:ioctl+153 ()
fffffe007ae3df00 unix:brand_sys_syscall32+290 ()
> $C
fffffe007ae3da30 vmm_kstat_update_vcpu+0x6b(fffffe8bf4898000, 0)
fffffe007ae3dc30 read_kstat_data+0xf5(fffffe007ae3ddc8, 8bfe860, 100001)
fffffe007ae3dc70 kstat_ioctl+0x5b(7d00000000, 4b02, 8bfe860, 100001, fffffe59f7ecc2a8, fffffe007ae3ddc8)
fffffe007ae3dcb0 cdev_ioctl+0x2b(7d00000000, 4b02, 8bfe860, 100001, fffffe59f7ecc2a8, fffffe007ae3ddc8)
fffffe007ae3dd00 spec_ioctl+0x45(fffffe59e3abf800, 4b02, 8bfe860, 100001, fffffe59f7ecc2a8, fffffe007ae3ddc8, 0)
fffffe007ae3dd90 fop_ioctl+0x5b(fffffe59e3abf800, 4b02, 8bfe860, 100001, fffffe59f7ecc2a8, fffffe007ae3ddc8, 0)
fffffe007ae3deb0 ioctl+0x153(3, 4b02, 8bfe860)
fffffe007ae3df00 sys_syscall32+0x20f()
> fffffe8bf4898000::print kstat_t
{
    ks_crtime = 0x35943c61d2a83
    ks_next = 0
    ks_kid = 0x1d8d
    ks_module = [ "vmm" ]
    ks_resv = 0
    ks_instance = 0x1d
    ks_name = [ "vcpu0" ]
    ks_type = 0x1
    ks_class = [ "misc" ]
    ks_flags = 0
    ks_data = 0xfffffe8bf4898110
    ks_ndata = 0x7
    ks_data_size = 0x150
    ks_snaptime = 0x35943c61d2a83
    ks_update = vmm_kstat_update_vcpu
    ks_private = 0xfffffe8bf48b8000
    ks_snapshot = default_kstat_snapshot
    ks_lock = 0
}
> fffffe8bf4898000::print kstat_t ks_private | ::whatis
fffffe8bf48b8000 is fffffe8bf48b1000+7000, freed from the heap vmem arena:
            ADDR TYPE            START              END             SIZE
                                THREAD        TIMESTAMP
fffffe5e5ba018c8 FREE fffffe8bf48b1000 fffffe8bf53a7000         11493376

The bhyve kstats stash a reference to the struct vm in ks_private, so trying to access that after it's freed is not great. The kstat update is pretty straightforward in that regard:

int
vmm_kstat_update_vcpu(struct kstat *ksp, int rw)
{
        struct vm *vm = ksp->ks_private;
        vmm_vcpu_kstats_t *vvk = ksp->ks_data;
        const int vcpuid = vvk->vvk_vcpu.value.ui32;
        struct vcpu *vcpu = &vm->vcpu[vcpuid];

        ASSERT3U(vcpuid, <, VM_MAXCPU);

        vvk->vvk_time_init.value.ui64 = vcpu->ustate_total[VU_INIT];
        vvk->vvk_time_run.value.ui64 = vcpu->ustate_total[VU_RUN];
        vvk->vvk_time_idle.value.ui64 = vcpu->ustate_total[VU_IDLE];
        vvk->vvk_time_emu_kern.value.ui64 = vcpu->ustate_total[VU_EMU_KERN];
        vvk->vvk_time_emu_user.value.ui64 = vcpu->ustate_total[VU_EMU_USER];
        vvk->vvk_time_sched.value.ui64 = vcpu->ustate_total[VU_SCHED];

        return (0);
}

Why would that be freed? The VM in question was undergoing recreation, having been just destroyed and then subsequently created. If we look at the logic for VM destruction, it becomes clearer:

static int
vmm_do_vm_destroy_locked(vmm_softc_t *sc, boolean_t clean_zsd,
    boolean_t *hma_release)
{

...
        if (sc->vmm_is_open) {
                list_insert_tail(&vmm_destroy_list, sc);
                sc->vmm_flags |= VMM_DESTROY;
        } else {
                vm_destroy(sc->vmm_vm);
                vmm_kstat_fini(sc);
                ddi_soft_state_free(vmm_statep, minor);
                id_free(vmm_minors, minor);
                *hma_release = B_TRUE;
        }

...
}

static int
vmm_close(dev_t dev, int flag, int otyp, cred_t *credp)
{

...
        /*
         * If this VM was destroyed while the vmm device was open, then
         * clean it up now that it is closed.
         */
        if (sc->vmm_flags & VMM_DESTROY) {
                list_remove(&vmm_destroy_list, sc);
                vm_destroy(sc->vmm_vm);
                ddi_soft_state_free(vmm_statep, minor);
                id_free(vmm_minors, minor);
                hma_release = B_TRUE;
        }

...
}

So it appears we have two problems here:
1. The vmm_close after destroy case (where actual VM destruction is deferred until the device itself is closed) completely misses its needed vmm_kstat_fini call, leaving kstats lingering around with a reference to the now-destroyed VM.
2. The vmm_do_vm_destroy_locked case destroys the VM before cleaning up the associated kstats, leaving a short window for use-after-free if those stats are accessed during such a window.

Both of these should be addressed to make the bhyve kstats situation more correct.


Related issues

Related to illumos gate - Bug #13723: expose bhyve metrics as kstatsClosedPatrick Mooney

Actions
Actions #1

Updated by Electric Monk 8 months ago

  • Gerrit CR set to 1840
Actions #2

Updated by Patrick Mooney 8 months ago

  • Related to Bug #13723: expose bhyve metrics as kstats added
Actions #3

Updated by Patrick Mooney 8 months ago

For testing, I constructed a contrived example for case 1 (deferred vm_destroy leaving lingering kstats). On an unpatched machine, I was able to observe those kstats as "present" even after the deferred destruction of the vm completed (seen by tracing the call via dtrace, and then checking for an empty /dev/vmm). When running the proposed patch, there were no lingering kstats after the same set of actions.

The second case (the use-after-free race) was not observed causing problems in the wild. It was checked by following the vmm_kstat_fini and vm_destroy calls via dtrace to confirm their proper ordering to avoid the potential issue.

In addition to those specific tests, the machine running the test bits was forced to dump, so it could be checked with ::findleaks. It was clean, modulo the "expected" entries.

Actions #4

Updated by Electric Monk 8 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 6884664d1dc95e1917a534df2163ccd6f5c19b69

commit  6884664d1dc95e1917a534df2163ccd6f5c19b69
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2021-12-02T23:08:22.000Z

    14281 bhyve bungles kstat clean-up
    Reviewed by: Rich Lowe <richlowe@richlowe.net>
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF