Project

General

Profile

Actions

Bug #14266

closed

bhyve mishandles TLB flush on VMX

Added by Patrick Mooney 9 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Category:
bhyve
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

After the integration of #13896, some community members reported issues running bhyve workloads on Intel-based machines with current bits. The symptoms were a bit sporadic, with some experiencing instances which failed to boot, citing triple-fault errors, while another had a guest experiencing strange and seemingly random SIGSEGV errors in userspace processes (including fault addresses which made no sense). One test case in particular received focus, where an old UEFI ROM (including CSM support) used to boot OmniOSCE Bloody would experience a triple-fault early in the illumos boot process.

Output from the bhyve process:

pci_fbuf: mmap_memseg failed
pci_fbuf: munmap_memseg failed
bhyve-bhyvetest: pci_passthru: map_pptdev_mmio failed
bhyve-bhyvetest: pci_passthru: unmap_pptdev_mmio failed
pci_fbuf: mmap_memseg failed
pci_fbuf: munmap_memseg failed
vm exit[0]
        reason          VMX
        rip             0x0000000000005249
        inst_length     3
        status          0
        exit_reason     2 (Triple fault)
        qualification   0x0000000000000000
        inst_type               0
        inst_error              0

Disassembly from that region of guest memory:

0x5200:                         pushl  %esi
0x5201:                         pushl  %edx
0x5202:                         pushl  %ecx
0x5203:                         pushl  %ebx
0x5204:                         pushl  %eax
0x5205:                         pushl  $0x0
0x5207:                         popl   %ds
0x5208:                         lgdt   (%esi)
0x520b:                         jo     +0x54    <0x5261>
0x520d:                         movw   $0x1,%ax
0x5211:                         addb   %al,(%eax)
0x5213:                         movw   %ax,%cr0
0x5217:                         ljmp   $0xffffb866,$0x10521c
0x521e:                         orb    %al,(%eax)
0x5220:                         movw   %eax,%ds
0x5222:                         movw   %eax,%ss
0x5224:                         movl   0x5400,%eax
0x5229:                         movl   %eax,%cr4
0x522c:                         movl   0x53f8,%eax
0x5231:                         movl   %eax,%cr3
0x5234:                         movl   $0xc0000080,%ecx
0x5239:                         rdmsr
0x523b:                         btsl   $0x8,%eax
0x523f:                         wrmsr
0x5241:                         movl   0x53f0,%eax
0x5246:                         movl   %eax,%cr0
0x5249:                         jmp    +0x0     <0x524b>
0x524b:                         pushl  $0x28
0x524d:                         pushl  $0x5253
0x5252:                         lret
0x5253:                         lgdt   0x53e0
0x525b:                         lidt   0x53d0
0x5263:                         decl   %eax

The as-loaded %cr4 value fetched from that guest as well:
0x5400:         1006b8

This certainly resembled the same circumstances as #14100, where manipulation of control registers which should incur a TLB flush failed to do so, leaving stale entries behind with disastrous consequences.
Refreshed scrutiny of that logic was therefore prudent.

The logic which incurs a TLB flush from %cr0 write emulation uses vmx_invvpid to flush "combined mapping" entries (as Intel calls them) from the TLB which are tagged with the VPID of that vCPU:

static int
vmx_emulate_cr0_access(struct vmx *vmx, int vcpu, uint64_t exitqual)
{

...

        /* Flush the TLB if the paging or write-protect bits are changing */
        if ((diff & CR0_PG) != 0 || (diff & CR0_WP) != 0) {
                vmx_invvpid(vmx, vcpu, 1);
        }

In vmx_invvpid, there is some logic to elide the INVVPID if the vmspace (EPT) generation is such that an INVEPT on this host CPU would be necessary anyways:

vmx_invvpid(struct vmx *vmx, int vcpu, int running)
{

...

        if (vmspace_table_gen(vms) == vmx->eptgen[curcpu]) {
                invvpid_desc._res1 = 0;
                invvpid_desc._res2 = 0;
                invvpid_desc.vpid = vmxstate->vpid;
                invvpid_desc.linear_addr = 0;
                invvpid(INVVPID_TYPE_SINGLE_CONTEXT, invvpid_desc);
                vmm_stat_incr(vmx->vm, vcpu, VCPU_INVVPID_DONE, 1);
        } else {
                /*
                 * The invvpid can be skipped if an invept is going to
                 * be performed before entering the guest. The invept
                 * will invalidate combined mappings tagged with
                 * 'vmx->eptp' for all vpids.
                 */
                vmm_stat_incr(vmx->vm, vcpu, VCPU_INVVPID_SAVED, 1);
        }

This all seems reasonable. Closer reading of the Intel SDM notes that "guest-physical mappings" which are associated with the EPT structures (but not any guest-virtual address, like the "combined mappings") can only be cleared with INVEPT. Such would be the case when the EPT tables are updated via an unmap or a protection change (empty/absent EPT entries which are later faulted in would not result in TLB entries requiring a flush). With that in mind, the next thing to scrutinize would be that follow-up INVEPT flush, since it becomes critical when the INVVPID was elided. There we see the problem:

                /*
                 * Indicate activation of vmspace (EPT) table just prior to VMX
                 * entry, checking for the necessity of an invept invalidation.
                 */
                eptgen = vmc_table_enter(vmc);
                if (vmx->eptgen[vcpu] != eptgen) {
                        /*
                         * VMspace generate does not match what was previously
                         * used for this CPU so all mappings associated with
                         * this EPTP must be invalidated.
                         */
                        invept(1, vmx->eptp);
                        vmx->eptgen[vcpu] = eptgen;
                }

On a casual glance, it looks like the same check, but it's using vcpu which is the vCPU ID, rather than curcpu the host CPU ID. This mistake, introduced in #13896, would perform the wrong comparison to check if eptgen is current, and would store the now-current value in the wrong index. While it is a bit surprising that this never manifested in testing, there are a few factors which are suspect contributors:

1. The primary machine that #13896 was tested upon (for Intel) had 20C/40T. That meant both that vcpuid/curcpu mismatch would be more common, but also that there were many more host CPUs (with potentially overwritten TLBs) to migrate to.
2. Update vmspace logic introduced in #13896 would no longer increase the EPT generation (eptgen) for pages faulted in, but only unmap operations (with protection changes not currently used/supported)

Regardless, the eptgen logic in vmx_run needs to be reverted to use curcpu for checking the vmspace generation counter. In addition to that, it would be useful to have additional detail in the comments to help future readers distinguish between why/when INVEPT or INVVPID are necessary for adequately flushing the TLB.


Related issues

Related to illumos gate - Bug #13896: bhyve VM interfaces should be better fitClosedPatrick Mooney

Actions
Related to illumos gate - Bug #14100: bhyve misses TLB flush for shadowed cr0ClosedPatrick Mooney

Actions
Actions #1

Updated by Patrick Mooney 9 months ago

  • Related to Bug #13896: bhyve VM interfaces should be better fit added
Actions #2

Updated by Patrick Mooney 9 months ago

  • Related to Bug #14100: bhyve misses TLB flush for shadowed cr0 added
Actions #3

Updated by Electric Monk 9 months ago

  • Gerrit CR set to 1825
Actions #4

Updated by Patrick Mooney 9 months ago

To clarify: The early error messages in the first trace above are not of concern to us now.

pci_fbuf: mmap_memseg failed
pci_fbuf: munmap_memseg failed
bhyve-bhyvetest: pci_passthru: map_pptdev_mmio failed
bhyve-bhyvetest: pci_passthru: unmap_pptdev_mmio failed
pci_fbuf: mmap_memseg failed
pci_fbuf: munmap_memseg failed

These are related to running a very old bootrom which is attempting some operations not well handled by the bhyve userspace. Those errors were already present prior to the integration of #13896.

Actions #5

Updated by Patrick Mooney 9 months ago

I've tested this change locally on Intel gear, where the normal battery of guests all booted and ran fine. Several community members who were experience the issue have reported that with the proposed patch in place, the problems (segfaults and booting issues) no longer occur.

Actions #6

Updated by Electric Monk 9 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit d1c02647196254b902261506af8a31be3080ca68

commit  d1c02647196254b902261506af8a31be3080ca68
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2021-11-29T20:20:49.000Z

    14266 bhyve mishandles TLB flush on VMX
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Reviewed by: Dan McDonald <danmcd@joyent.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Actions

Also available in: Atom PDF