Bug #13861
closedbhyve apicv logic could take more care
100%
Description
While scrutinizing consumers of the bhyve VM shim (vmspace, pmap, etc), my attention was drawn to how the VMX logic handles changes in APICv state during an instance lifetime. In particular, the transition from xAPIC to x2APIC mode, where access to the MMIO range is cut off and the x2APIC MSRs are "enabled". Currently, bhyve is both disabling the APIC MMIO access control (PROCBASED2_VIRTUALIZE_APIC_ACCESSES
) as well as removing the MMIO mapping in the EPT tables for the APIC access page. With those done, the x2APIC MSRs are enabled in the MSR bitmap, along with the PROCBASED2_VIRTUALIZE_X2APIC_MODE
control to activate the x2apic acceleration bits. One thing to note is that the unmapping of the APIC access page and the modification of the MSR bitmap are VM-wide operation (the vmspace and the bitmap apply to the whole guest). The former could cause potential issues if any other vCPUs attempt to access the APIC via MMIO, and the latter is literally called out as prohibited in the SDM:
Certain fields in the VMCS point to external data structures (for example: the MSR bitmap, the I/O bitmaps). If a logical processor is in VMX non-root operation, none of the external structures referenced by that logical processor's current VMCS should be modified by any logical processor or DMA. Before updating one of these structures, the VMM must ensure that no logical processor whose current VMCS references the structure is in VMX non-root operation.
Other VMMs have common solutions to both of these issues:
1. Leave the APIC access page mapped for the life of the VM. Disabling the VIRTUALIZE_APIC_ACCESSES
control on a per-vcpu basis will be adequate to prevent APIC MMIOs from being processed normally. It's true that MMIO reads/writes would then land on that physical page of memory, but the SDM notes that with x2APIC enabled the MMIO interface would show "Behavior identical to xAPIC in globally disabled state". This is not a corner of the architecture we expect guests to explore.
2. Use a per-vCPU MSR bitmap. The x2APIC state can that be individually set on each vCPU as it makes those transitions and the MSR bitmap will never be altered while the associated VMCS used to run that vCPU.
It should be noted that booting bhyve guests with the x2APIC enabled (using the -x
flag) does not appear to be successful today. This is clearly an area of the software which has not received much scrutiny, so functional expectations are low.
Related issues
Updated by Patrick Mooney over 1 year ago
- Related to Bug #13896: bhyve VM interfaces should be better fit added
Updated by Patrick Mooney over 1 year ago
For testing, I booted up an Intel machine running these bits. I first confirmed that the normal suite of guests booted and ran. After that, I started looking for specifics related to the change. I inspected the MSR bitmaps initialized by the new logic. The results matched what I expected:
> fffffe5a2ee8e000,200=np8+ | ::grep '*.!=ffffffffffffffff' | /nap 0xfffffe5a2ee8e000: 0xfffffe5a2ee8e000: 0xfffffffffffeffff 0xfffffe5a2ee8e028: 0xff8fffffffffffff 0xfffffe5a2ee8e410: 0xffffffffffffffe0 0xfffffe5a2ee8e420: 0xfffffffffffffff8 0xfffffe5a2ee8e828: 0xff8fffffffffffff 0xfffffe5a2ee8ec10: 0xffffffffffffffe0 0xfffffe5a2ee8ec20: 0xfffffffffffffff8
That corresponds to MSR_TSC being read-only, while the sysenter, syscall, fsgsbase, and efer MSRs are read-write.
In addition to that, I checked that the per-instance APIC access page (for VMX to detect APIC accesses) was mapped in the expected address in the instance vmspace.
> fffffe5a2ee44000::print 'struct vm' vmspace | ::print -a 'struct vmspace' fffffe5a2d418010 { fffffe5a2d418010 vm_map = { fffffe5a2d418010 vmm_space = 0 } fffffe5a2d418018 vms_lock = { fffffe5a2d418018 _opaque = [ 0 ] } fffffe5a2d418020 vms_map_changing = 0 (0) fffffe5a2d418028 vms_pmap = { fffffe5a2d418028 pm_pml4 = 0xfffffe5a2ee01000 fffffe5a2d418030 pm_active = { fffffe5a2d418030 cpub = [ 0, 0, 0, 0 ] } fffffe5a2d418050 pm_eptgen = 0x2f3ed fffffe5a2d418058 pm_type = 0x1 (PT_EPT) fffffe5a2d418060 pm_ops = ept_ops fffffe5a2d418068 pm_impl = 0xfffffe5a1fa13100 } fffffe5a2d418070 vms_size = 0x400000000000 fffffe5a2d418078 vms_maplist = { fffffe5a2d418078 list_size = 0x38 fffffe5a2d418080 list_offset = 0 fffffe5a2d418088 list_head = { fffffe5a2d418088 list_next = 0xfffffe5a1fa083c0 fffffe5a2d418090 list_prev = 0xfffffe5a192fe870 } } } > fffffe5a2d418078::walk list | ::print vmspace_mapping_t { vmsm_node = { list_next = 0xfffffe59f0ecf4b0 list_prev = 0xfffffe5a2d418088 } vmsm_object = 0xfffffe5a2e1e1670 vmsm_addr = 0xfee00000 vmsm_len = 0x1000 vmsm_offset = 0 vmsm_prot = 0x3 } { vmsm_node = { list_next = 0xfffffe5a2d429238 list_prev = 0xfffffe5a1fa083c0 } vmsm_object = 0xfffffe5a2e6ccbd0 vmsm_addr = 0x100000000 vmsm_len = 0x40000000 vmsm_offset = 0 vmsm_prot = 0x7 } { vmsm_node = { list_next = 0xfffffe5a192fe870 list_prev = 0xfffffe59f0ecf4b0 } vmsm_object = 0xfffffe5a2e6cc480 vmsm_addr = 0 vmsm_len = 0xc0000000 vmsm_offset = 0 vmsm_prot = 0x7 } { vmsm_node = { list_next = 0xfffffe5a2d418088 list_prev = 0xfffffe5a2d429238 } vmsm_object = 0xfffffe5a2e6cc438 vmsm_addr = 0xffe20000 vmsm_len = 0x1e0000 vmsm_offset = 0xe20000 vmsm_prot = 0x5 } > fffffe5a2e1e1670::print 'struct vm_object' { vmo_refcnt = 0x1 vmo_type = 0x6 vmo_size = 0x1000000 vmo_pager = vm_object_pager_sg vmo_data = 0xfffffe5a26484240 vmo_lock = { _opaque = [ 0 ] } vmo_attr = '\0' } > fffffe5a26484240::print 'struct sglist' { sg_lock = { _opaque = [ 0 ] } sg_refcnt = 0x1 sg_len = 0x1 sg_next = 0x1 sg_entries = [ ... ] } > fffffe5a26484240::print 'struct sglist' sg_entries[0] mdb: index 0 is outside of array bounds [0 .. ffffffff] sg_entries[0] = { sg_entries[0].sge_pa = 0x84fed9000 sg_entries[0].sge_len = 0x1000 } > fffffe5a2ee44000::print 'struct vm' cookie | ::print 'struct vmx' apic_access_page apic_access_page = 0xfffffe5a2eed3000 > fffffe5a2eed3000::vtop virtual fffffe5a2eed3000 mapped to physical 84fed9000
Updated by Patrick Mooney over 1 year ago
After running the live tests. I dumped the test machine and a subsequent findleaks check yielded no issues.
Updated by Patrick Mooney over 1 year ago
Mike Zeller tested this independently, running an Ubuntu workload on a SmartOS machine running the relevant bits.
Updated by Electric Monk over 1 year ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
git commit 6b641d7a35808436d7283b7057a01791e2e2a4e2
commit 6b641d7a35808436d7283b7057a01791e2e2a4e2 Author: Patrick Mooney <pmooney@pfmooney.com> Date: 2021-07-02T22:08:24.000Z 13861 bhyve apicv logic could take more care Reviewed by: Dan Cross <cross@oxidecomputer.com> Reviewed by: Mike Zeller <mike.zeller@joyent.com> Approved by: Dan McDonald <danmcd@joyent.com>