Project

General

Profile

Actions

Bug #14100

closed

bhyve misses TLB flush for shadowed cr0

Added by Patrick Mooney 4 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Category:
bhyve
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

During initial testing for #14081, which includes an update which exposes SMEP/SMAP to tests, several folks discovered their Windows guests would fail early in boot with a triple-fault on SMEP-capable Intel hardware. The fault itself would occur during a long-jump:

vm exit[1]
        reason          VMX
        rip             0x00000000000016d7
        inst_length     3
        status          0
        exit_reason     2 (Triple fault)
        qualification   0x0000000000000000
        inst_type               0
        inst_error              0
fbuf frame buffer base: fffffc7fecc00000 [sz 16777216]

...

[0]> 1:x
[1]> $r
%cs = 0x0030            %eax = 0x80050033
%ds = 0x0020            %ebx = 0x00000000
%ss = 0x0000            %ecx = 0xc0000080
%es = 0x0000            %edx = 0x00000000
%fs = 0x0000            %esi = 0x00000000
%gs = 0x0000            %edi = 0x00001000

%eip = 0x000016d7
%ebp = 0x00000000
%esp = 0x00000000

%eflags = 0x00010046
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
  status=<of,df,if,tf,sf,ZF,af,PF,cf>

%trapno = 0x0
   %err = 0x0
[1]> ::sysregs
%efer = 0xd01 <SCE,LME,LMA,NXE>
%cr0 = 0x80050033 <PE,MP,ET,NE,WP,AM,PG>
%cr2 = 0x0000000000000080 <0x80>
%cr3 = 0xbffdf000 <pfn:0xbffdf flags:>
%cr4 = 0x152678 <DE,PSE,PAE,MCE,OSFXSR,OSXMMEXCPT,VMXE,FSGSBASE,OSXSAVE,SMEP>

%pdpte0 = 0x0000000300000001    %pdpte2 = 0x0000000000000000
%pdpte1 = 0x0000000000000000    %pdpte3 = 0x0000000000004000

%gdtr = 0x0000000000001018/0x3f
%idtr = 0x0000000000000000/0xffff
%ldtr = 0x0000000000000000/0x0000ffff 0x00082 <usable, LDT, dpl 0, flags: P,16b>
%tr   = 0x0000000000000000/0x0000ffff 0x0008b <usable, 32b/64b TSS, busy, dpl 0, flags: P,16b>
%cs   = 0x0000000000000000/0xffffffff 0x0c09b <usable, code, non-conforming, execute-read, dpl 0, flags: P,32b,G,A>
%ss   = 0x0000000000000000/0x0000ffff 0x00093 <usable, data, up, read-write, dpl 0, flags: P,16b,A>
%ds   = 0x0000000000000000/0xffffffff 0x0c093 <usable, data, up, read-write, dpl 0, flags: P,32b,G,A>
%es   = 0x0000000000000000/0x0000ffff 0x00093 <usable, data, up, read-write, dpl 0, flags: P,16b,A>
%fs   = 0x0000000000000000/0x0000ffff 0x00093 <usable, data, up, read-write, dpl 0, flags: P,16b,A>
%gs   = 0x0000000000000000/0x0000ffff 0x00093 <usable, data, up, read-write, dpl 0, flags: P,16b,A>
%intr_shadow = 0x0

[1]> <rip::dis
0x16d7:                         ljmp   *0x66(%rdi)
0x16da:                         movl   %edi,%edi
0x16dc:                         movq   0x70(%rdi),%rcx
0x16e0:                         movq   0xa0(%rdi),%rax
0x16e7:                         movq   0x78(%rdi),%rdi

Looking back a little further, it seems to be a routine messing about with various paging/protection-related state:

0x167c:                         movl   0xa8(%rdi),%eax
0x1682:                         andl   $0xfffdffff,%eax
0x1687:                         movq   %rax,%cr4
0x168a:                         movl   0x58(%rdi),%eax
0x168d:                         movq   %rax,%cr3
0x1690:                         testl  $0x1,0x8(%rdi)
0x1697:                         je     +0xc     <0x16a5>
0x1699:                         movl   $0x1a0,%ecx
0x169e:                         rdmsr
0x16a0:                         andl   $0xfffffffb,%edx
0x16a3:                         wrmsr
0x16a5:                         testl  $0x2,0x8(%rdi)
0x16ac:                         je     +0xb     <0x16b9>
0x16ae:                         movq   %cr4,%rax
0x16b1:                         orl    $0x1000,%eax
0x16b6:                         movq   %rax,%cr4
0x16b9:                         movl   $0xc0000080,%ecx
0x16be:                         rdmsr
0x16c0:                         orl    0x88(%rdi),%eax
0x16c6:                         orl    0x8c(%rdi),%edx
0x16cc:                         wrmsr
0x16ce:                         movl   0x90(%rdi),%eax
0x16d4:                         movq   %rax,%cr0
0x16d7:                         ljmp   *0x66(%rdi)

It turns out that this load of %cr0 is triggering the vmx_emulate_cr0_access logic in bhyve, since it includes CR0.CD (not showed in the ::sysregs output, because it does not currently take shadowed bits into account as it should). If CR0.CD was not included, then VMX would handle the access entirely, including any TLB flushing required, but since it incurs and exit for the shadow manipulation, that duty is delegated to us. Sadly, the cr0 handling for bhyve on Intel performs no such flush when relevant (CR0.PG or CR0.WP) bits change, so the guest is proceeding with a stale cached entry which apparently incurs a fatal fault. (Inspecting the page table structures referenced by the guest %cr3 revealed nothing appearing problematic)

The VMX logic in bhyve should be updated to perform such flushing.


Related issues

Related to illumos gate - Bug #14101: bhyve should expose shadowed bits in CRsClosedPatrick Mooney

Actions
Related to illumos gate - Feature #14081: bhyve upstream sync 2021 SeptemberClosedAndy Fiddaman

Actions
Related to illumos gate - Bug #14266: bhyve mishandles TLB flush on VMXClosedPatrick Mooney

Actions
Actions #1

Updated by Patrick Mooney 4 months ago

  • Related to Bug #14101: bhyve should expose shadowed bits in CRs added
Actions #2

Updated by Patrick Mooney 4 months ago

Actions #3

Updated by Patrick Mooney 4 months ago

  • % Done changed from 0 to 80
Actions #4

Updated by Electric Monk 4 months ago

  • Gerrit CR set to 1722
Actions #5

Updated by Patrick Mooney 4 months ago

This has been tested on Intel hardware capable of SMEP, where exposing that capability to Windows under bhyve was causing the triple-fault. With the proposed fix in place, Windows boots without issue. Others have tried the fix as well:

All looks good with this patch on top of 9p and the bhyve sync. Tried various guests - Linux, FreeBSD and Windows 10 without problems. The Linux and FreeBSD guests could see SMEP.

Jorge has confirmed the same on his server except that guests can also see SMAP since his host has it. Windows working fine there too.

Actions #6

Updated by Patrick Mooney 4 months ago

In addition to the specific Windows+SMEP test, I also ran the typical gauntlet of guest tests on Intel gear with the patch to look for unexpected behavior with the change. All of them booted and ran normally.

Actions #7

Updated by Electric Monk 4 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 80 to 100

git commit bf0dcd3f9893153e708295693e9015919b00112b

commit  bf0dcd3f9893153e708295693e9015919b00112b
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2021-09-27T19:48:33.000Z

    14100 bhyve misses TLB flush for shadowed cr0
    14101 bhyve should expose shadowed bits in CRs
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions #8

Updated by Patrick Mooney about 2 months ago

  • Related to Bug #14266: bhyve mishandles TLB flush on VMX added
Actions

Also available in: Atom PDF