Bug #14100
closedbhyve misses TLB flush for shadowed cr0
100%
Description
During initial testing for #14081, which includes an update which exposes SMEP/SMAP to tests, several folks discovered their Windows guests would fail early in boot with a triple-fault on SMEP-capable Intel hardware. The fault itself would occur during a long-jump:
vm exit[1] reason VMX rip 0x00000000000016d7 inst_length 3 status 0 exit_reason 2 (Triple fault) qualification 0x0000000000000000 inst_type 0 inst_error 0 fbuf frame buffer base: fffffc7fecc00000 [sz 16777216] ... [0]> 1:x [1]> $r %cs = 0x0030 %eax = 0x80050033 %ds = 0x0020 %ebx = 0x00000000 %ss = 0x0000 %ecx = 0xc0000080 %es = 0x0000 %edx = 0x00000000 %fs = 0x0000 %esi = 0x00000000 %gs = 0x0000 %edi = 0x00001000 %eip = 0x000016d7 %ebp = 0x00000000 %esp = 0x00000000 %eflags = 0x00010046 id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0 status=<of,df,if,tf,sf,ZF,af,PF,cf> %trapno = 0x0 %err = 0x0 [1]> ::sysregs %efer = 0xd01 <SCE,LME,LMA,NXE> %cr0 = 0x80050033 <PE,MP,ET,NE,WP,AM,PG> %cr2 = 0x0000000000000080 <0x80> %cr3 = 0xbffdf000 <pfn:0xbffdf flags:> %cr4 = 0x152678 <DE,PSE,PAE,MCE,OSFXSR,OSXMMEXCPT,VMXE,FSGSBASE,OSXSAVE,SMEP> %pdpte0 = 0x0000000300000001 %pdpte2 = 0x0000000000000000 %pdpte1 = 0x0000000000000000 %pdpte3 = 0x0000000000004000 %gdtr = 0x0000000000001018/0x3f %idtr = 0x0000000000000000/0xffff %ldtr = 0x0000000000000000/0x0000ffff 0x00082 <usable, LDT, dpl 0, flags: P,16b> %tr = 0x0000000000000000/0x0000ffff 0x0008b <usable, 32b/64b TSS, busy, dpl 0, flags: P,16b> %cs = 0x0000000000000000/0xffffffff 0x0c09b <usable, code, non-conforming, execute-read, dpl 0, flags: P,32b,G,A> %ss = 0x0000000000000000/0x0000ffff 0x00093 <usable, data, up, read-write, dpl 0, flags: P,16b,A> %ds = 0x0000000000000000/0xffffffff 0x0c093 <usable, data, up, read-write, dpl 0, flags: P,32b,G,A> %es = 0x0000000000000000/0x0000ffff 0x00093 <usable, data, up, read-write, dpl 0, flags: P,16b,A> %fs = 0x0000000000000000/0x0000ffff 0x00093 <usable, data, up, read-write, dpl 0, flags: P,16b,A> %gs = 0x0000000000000000/0x0000ffff 0x00093 <usable, data, up, read-write, dpl 0, flags: P,16b,A> %intr_shadow = 0x0 [1]> <rip::dis 0x16d7: ljmp *0x66(%rdi) 0x16da: movl %edi,%edi 0x16dc: movq 0x70(%rdi),%rcx 0x16e0: movq 0xa0(%rdi),%rax 0x16e7: movq 0x78(%rdi),%rdi
Looking back a little further, it seems to be a routine messing about with various paging/protection-related state:
0x167c: movl 0xa8(%rdi),%eax 0x1682: andl $0xfffdffff,%eax 0x1687: movq %rax,%cr4 0x168a: movl 0x58(%rdi),%eax 0x168d: movq %rax,%cr3 0x1690: testl $0x1,0x8(%rdi) 0x1697: je +0xc <0x16a5> 0x1699: movl $0x1a0,%ecx 0x169e: rdmsr 0x16a0: andl $0xfffffffb,%edx 0x16a3: wrmsr 0x16a5: testl $0x2,0x8(%rdi) 0x16ac: je +0xb <0x16b9> 0x16ae: movq %cr4,%rax 0x16b1: orl $0x1000,%eax 0x16b6: movq %rax,%cr4 0x16b9: movl $0xc0000080,%ecx 0x16be: rdmsr 0x16c0: orl 0x88(%rdi),%eax 0x16c6: orl 0x8c(%rdi),%edx 0x16cc: wrmsr 0x16ce: movl 0x90(%rdi),%eax 0x16d4: movq %rax,%cr0 0x16d7: ljmp *0x66(%rdi)
It turns out that this load of %cr0 is triggering the vmx_emulate_cr0_access
logic in bhyve, since it includes CR0.CD
(not showed in the ::sysregs
output, because it does not currently take shadowed bits into account as it should). If CR0.CD
was not included, then VMX would handle the access entirely, including any TLB flushing required, but since it incurs and exit for the shadow manipulation, that duty is delegated to us. Sadly, the cr0 handling for bhyve on Intel performs no such flush when relevant (CR0.PG
or CR0.WP
) bits change, so the guest is proceeding with a stale cached entry which apparently incurs a fatal fault. (Inspecting the page table structures referenced by the guest %cr3
revealed nothing appearing problematic)
The VMX logic in bhyve should be updated to perform such flushing.
Related issues
Updated by Patrick Mooney over 1 year ago
- Related to Bug #14101: bhyve should expose shadowed bits in CRs added
Updated by Patrick Mooney over 1 year ago
- Related to Feature #14081: bhyve upstream sync 2021 September added
Updated by Patrick Mooney over 1 year ago
This has been tested on Intel hardware capable of SMEP, where exposing that capability to Windows under bhyve was causing the triple-fault. With the proposed fix in place, Windows boots without issue. Others have tried the fix as well:
All looks good with this patch on top of 9p and the bhyve sync. Tried various guests - Linux, FreeBSD and Windows 10 without problems. The Linux and FreeBSD guests could see SMEP.
Jorge has confirmed the same on his server except that guests can also see SMAP since his host has it. Windows working fine there too.
Updated by Patrick Mooney over 1 year ago
In addition to the specific Windows+SMEP test, I also ran the typical gauntlet of guest tests on Intel gear with the patch to look for unexpected behavior with the change. All of them booted and ran normally.
Updated by Electric Monk over 1 year ago
- Status changed from In Progress to Closed
- % Done changed from 80 to 100
git commit bf0dcd3f9893153e708295693e9015919b00112b
commit bf0dcd3f9893153e708295693e9015919b00112b Author: Patrick Mooney <pmooney@pfmooney.com> Date: 2021-09-27T19:48:33.000Z 14100 bhyve misses TLB flush for shadowed cr0 14101 bhyve should expose shadowed bits in CRs Reviewed by: Andy Fiddaman <andy@omnios.org> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Dan McDonald <danmcd@joyent.com>
Updated by Patrick Mooney over 1 year ago
- Related to Bug #14266: bhyve mishandles TLB flush on VMX added