Project

General

Profile

Actions

Bug #13861

closed

bhyve apicv logic could take more care

Added by Patrick Mooney 3 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Category:
bhyve
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

While scrutinizing consumers of the bhyve VM shim (vmspace, pmap, etc), my attention was drawn to how the VMX logic handles changes in APICv state during an instance lifetime. In particular, the transition from xAPIC to x2APIC mode, where access to the MMIO range is cut off and the x2APIC MSRs are "enabled". Currently, bhyve is both disabling the APIC MMIO access control (PROCBASED2_VIRTUALIZE_APIC_ACCESSES) as well as removing the MMIO mapping in the EPT tables for the APIC access page. With those done, the x2APIC MSRs are enabled in the MSR bitmap, along with the PROCBASED2_VIRTUALIZE_X2APIC_MODE control to activate the x2apic acceleration bits. One thing to note is that the unmapping of the APIC access page and the modification of the MSR bitmap are VM-wide operation (the vmspace and the bitmap apply to the whole guest). The former could cause potential issues if any other vCPUs attempt to access the APIC via MMIO, and the latter is literally called out as prohibited in the SDM:

Certain fields in the VMCS point to external data structures (for example: the MSR bitmap, the I/O bitmaps). If a logical processor is in VMX non-root operation, none of the external structures referenced by that logical processor's current VMCS should be modified by any logical processor or DMA. Before updating one of these structures, the VMM must ensure that no logical processor whose current VMCS references the structure is in VMX non-root operation.

Other VMMs have common solutions to both of these issues:
1. Leave the APIC access page mapped for the life of the VM. Disabling the VIRTUALIZE_APIC_ACCESSES control on a per-vcpu basis will be adequate to prevent APIC MMIOs from being processed normally. It's true that MMIO reads/writes would then land on that physical page of memory, but the SDM notes that with x2APIC enabled the MMIO interface would show "Behavior identical to xAPIC in globally disabled state". This is not a corner of the architecture we expect guests to explore.
2. Use a per-vCPU MSR bitmap. The x2APIC state can that be individually set on each vCPU as it makes those transitions and the MSR bitmap will never be altered while the associated VMCS used to run that vCPU.

It should be noted that booting bhyve guests with the x2APIC enabled (using the -x flag) does not appear to be successful today. This is clearly an area of the software which has not received much scrutiny, so functional expectations are low.


Related issues

Related to illumos gate - Bug #13896: bhyve VM interfaces should be better fitIn ProgressPatrick Mooney

Actions
Actions #1

Updated by Electric Monk 3 months ago

  • Gerrit CR set to 1538
Actions #2

Updated by Patrick Mooney 3 months ago

  • Related to Bug #13896: bhyve VM interfaces should be better fit added
Actions #3

Updated by Patrick Mooney 3 months ago

For testing, I booted up an Intel machine running these bits. I first confirmed that the normal suite of guests booted and ran. After that, I started looking for specifics related to the change. I inspected the MSR bitmaps initialized by the new logic. The results matched what I expected:

> fffffe5a2ee8e000,200=np8+ | ::grep '*.!=ffffffffffffffff' | /nap
0xfffffe5a2ee8e000:
0xfffffe5a2ee8e000:             0xfffffffffffeffff
0xfffffe5a2ee8e028:             0xff8fffffffffffff
0xfffffe5a2ee8e410:             0xffffffffffffffe0
0xfffffe5a2ee8e420:             0xfffffffffffffff8
0xfffffe5a2ee8e828:             0xff8fffffffffffff
0xfffffe5a2ee8ec10:             0xffffffffffffffe0
0xfffffe5a2ee8ec20:             0xfffffffffffffff8

That corresponds to MSR_TSC being read-only, while the sysenter, syscall, fsgsbase, and efer MSRs are read-write.

In addition to that, I checked that the per-instance APIC access page (for VMX to detect APIC accesses) was mapped in the expected address in the instance vmspace.

> fffffe5a2ee44000::print 'struct vm' vmspace | ::print -a 'struct vmspace'
fffffe5a2d418010 {
    fffffe5a2d418010 vm_map = {
        fffffe5a2d418010 vmm_space = 0
    }
    fffffe5a2d418018 vms_lock = {
        fffffe5a2d418018 _opaque = [ 0 ]
    }
    fffffe5a2d418020 vms_map_changing = 0 (0)
    fffffe5a2d418028 vms_pmap = {
        fffffe5a2d418028 pm_pml4 = 0xfffffe5a2ee01000
        fffffe5a2d418030 pm_active = {
            fffffe5a2d418030 cpub = [ 0, 0, 0, 0 ]
        }
        fffffe5a2d418050 pm_eptgen = 0x2f3ed
        fffffe5a2d418058 pm_type = 0x1 (PT_EPT)
        fffffe5a2d418060 pm_ops = ept_ops
        fffffe5a2d418068 pm_impl = 0xfffffe5a1fa13100
    }
    fffffe5a2d418070 vms_size = 0x400000000000
    fffffe5a2d418078 vms_maplist = {
        fffffe5a2d418078 list_size = 0x38
        fffffe5a2d418080 list_offset = 0
        fffffe5a2d418088 list_head = {
            fffffe5a2d418088 list_next = 0xfffffe5a1fa083c0
            fffffe5a2d418090 list_prev = 0xfffffe5a192fe870
        }
    }
}
> fffffe5a2d418078::walk list | ::print vmspace_mapping_t
{
    vmsm_node = {
        list_next = 0xfffffe59f0ecf4b0
        list_prev = 0xfffffe5a2d418088
    }
    vmsm_object = 0xfffffe5a2e1e1670
    vmsm_addr = 0xfee00000
    vmsm_len = 0x1000
    vmsm_offset = 0
    vmsm_prot = 0x3
}
{
    vmsm_node = {
        list_next = 0xfffffe5a2d429238
        list_prev = 0xfffffe5a1fa083c0
    }
    vmsm_object = 0xfffffe5a2e6ccbd0
    vmsm_addr = 0x100000000
    vmsm_len = 0x40000000
    vmsm_offset = 0
    vmsm_prot = 0x7
}
{
    vmsm_node = {
        list_next = 0xfffffe5a192fe870
        list_prev = 0xfffffe59f0ecf4b0
    }
    vmsm_object = 0xfffffe5a2e6cc480
    vmsm_addr = 0
    vmsm_len = 0xc0000000
    vmsm_offset = 0
    vmsm_prot = 0x7
}
{
    vmsm_node = {
        list_next = 0xfffffe5a2d418088
        list_prev = 0xfffffe5a2d429238
    }
    vmsm_object = 0xfffffe5a2e6cc438
    vmsm_addr = 0xffe20000
    vmsm_len = 0x1e0000
    vmsm_offset = 0xe20000
    vmsm_prot = 0x5
}
> fffffe5a2e1e1670::print 'struct vm_object'
{
    vmo_refcnt = 0x1
    vmo_type = 0x6
    vmo_size = 0x1000000
    vmo_pager = vm_object_pager_sg
    vmo_data = 0xfffffe5a26484240
    vmo_lock = {
        _opaque = [ 0 ]
    }
    vmo_attr = '\0'
}
> fffffe5a26484240::print 'struct sglist'
{
    sg_lock = {
        _opaque = [ 0 ]
    }
    sg_refcnt = 0x1
    sg_len = 0x1
    sg_next = 0x1
    sg_entries = [ ... ]
}
> fffffe5a26484240::print 'struct sglist' sg_entries[0]
mdb: index 0 is outside of array bounds [0 .. ffffffff]
sg_entries[0] = {
    sg_entries[0].sge_pa = 0x84fed9000
    sg_entries[0].sge_len = 0x1000
}
> fffffe5a2ee44000::print 'struct vm' cookie | ::print 'struct vmx' apic_access_page
apic_access_page = 0xfffffe5a2eed3000
> fffffe5a2eed3000::vtop
virtual fffffe5a2eed3000 mapped to physical 84fed9000
Actions #4

Updated by Patrick Mooney 3 months ago

After running the live tests. I dumped the test machine and a subsequent findleaks check yielded no issues.

Actions #5

Updated by Patrick Mooney 3 months ago

Mike Zeller tested this independently, running an Ubuntu workload on a SmartOS machine running the relevant bits.

Actions #6

Updated by Electric Monk 3 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 6b641d7a35808436d7283b7057a01791e2e2a4e2

commit  6b641d7a35808436d7283b7057a01791e2e2a4e2
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2021-07-02T22:08:24.000Z

    13861 bhyve apicv logic could take more care
    Reviewed by: Dan Cross <cross@oxidecomputer.com>
    Reviewed by: Mike Zeller <mike.zeller@joyent.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF