Bug #14367
closedbhyve gpt mishandles small mappings
100%
Description
While andyf was testing VARS.fd support from upstream FreeBSD, he hit this panic:
panic[cpu1]/thread=fffffe59dceda440: assertion failed: vmsm != NULL, file: ../../i86pc/io/vmm/vmm_vm.c, line: 749 fffffe007b8222a0 vpanic() fffffe007b8222f0 0xfffffffffbdcd955() fffffe007b8223b0 vmspace_lookup_map+0x227(fffffe7c5ce506c0, ffe00000, 1, fffffe007b8223c0, fffffe007b8223c8) fffffe007b822420 vmc_hold+0x8f(fffffe59d6463f68, ffe00000, 1) fffffe007b8224e0 vm_copy_setup+0x165(fffffea88c400000, 0, fffffe59e2358440, ffe00000, 8, 1, fffffe007b822580, fffffe5900000002, fffffe007b822534) fffffe007b822620 vie_emulate_movs+0x17c(fffffe59e2358400, fffffea88c400000, 0, ffe00000) fffffe007b822650 vie_emulate_mmio+0x105(fffffe59e2358400, fffffea88c400000, 0) fffffe007b8226c0 vm_handle_mmio_emul+0x145(fffffea88c400000, 0) fffffe007b822780 vm_run+0x3ea(fffffea88c400000, 0, fffffe007b8228b8) fffffe007b822c30 vmmdev_do_ioctl+0xdd3(fffffe59ee861880, 767001, fffffc7fef00ae40, 202003, fffffe7b9247ca20, fffffe007b822e08) fffffe007b822cb0 vmm_ioctl+0xec(11c00000008, 767001, fffffc7fef00ae40, 202003, fffffe7b9247ca20, fffffe007b822e08) fffffe007b822cf0 cdev_ioctl+0x2b(11c00000008, 767001, fffffc7fef00ae40, 202003, fffffe7b9247ca20, fffffe007b822e08) fffffe007b822d40 spec_ioctl+0x45(fffffe5a1b6df200, 767001, fffffc7fef00ae40, 202003, fffffe7b9247ca20, fffffe007b822e08, 0) fffffe007b822dd0 fop_ioctl+0x5b(fffffe5a1b6df200, 767001, fffffc7fef00ae40, 202003, fffffe7b9247ca20, fffffe007b822e08, 0) fffffe007b822ef0 ioctl+0x153(3, 767001, fffffc7fef00ae40) fffffe007b822f00 sys_syscall+0x283()
That address, 0xffe00000
, corresponds to the top of the 2MB region occupied by the bootrom. It is specifically part of the writable region exposed by the persistent VARS.fd handling. We can confirm that by looking at the mapping list in the vmspace_t
:
{ vmsm_node = { list_next = 0xfffffe622cd4bb48 list_prev = 0xfffffe7c5ce50708 } vmsm_object = 0xfffffe5a2c8ecc40 vmsm_addr = 0 vmsm_len = 0x8000000 vmsm_offset = 0 vmsm_prot = 0x7 } { vmsm_node = { list_next = 0xfffffe7c5ce50708 list_prev = 0xfffffe622cd4e828 } vmsm_object = 0xfffffe627f567d60 vmsm_addr = 0xffe20000 vmsm_len = 0x1e0000 vmsm_offset = 0xe20000 vmsm_prot = 0x5 }
Here, 0xffe00000
- 0xffe20000
are not covered by any valid mapping, meaning accesses to those physical addresses will result in an exit to userspace for MMIO emulation. Why then did we panic on an assert? Let's look at the relevant portion of vmm_lookup_map
:
...
vmm_gpt_walk(gpt, gpa, entries, MAX_GPT_LEVEL);
leaf = entries[LEVEL1];
if (leaf == NULL) {
/*
* Since we populated the intermediate tables for any regions
* mapped in the GPT, an empty leaf entry indicates there is no
* mapping, populated or not, at this GPT.
*/
return (FC_NOMAP);
}
if (vmm_gpt_is_mapped(gpt, leaf, &pfn, &prot)) {
if ((req_prot & prot) != req_prot) {
return (FC_PROT);
}
} else {
vmspace_mapping_t *vmsm;
vm_object_t *vmo;
/*
* Because of the prior leaf check, we should be confident that
* _some_ mapping covers this GPA
*/
vmsm = vm_mapping_find(vms, gpa, PAGESIZE);
VERIFY(vmsm != NULL);
...
Here, vmm_lookup_map
is operating under the assumption that if a leaf pagetable (one with entries describing 4k page mappings) is present, that the GPA in question should be covered by a mapping. This is due to the GPT logic enforcing page table pre-creation (leaf, and all higher levels) for the gpa range covering each new mapping. This breaks down if a mapping covers not aligned (in both size and position) to 2MB, where it covers only part of a leaf table.
Several things must change for this to be fixed. First, the vmsm != NULL
assert must be removed, and be replaced with a more appropriate FC_NOMAP
bailout instead. That will solve the immediate issue, but we must still ensure that mappings which are not 2MB aligned cannot cause other issues. In particular if 2+ mappings were to "share" a leaf page. An unmap of one, must not result in clean-up of that leaf page, provided the other still exists. Expectations and invariants around page table pre-population and clean-up must be checked for correctness.
Files
Related issues
Updated by Patrick Mooney 5 months ago
- Related to Bug #13896: bhyve VM interfaces should be better fit added
Updated by Patrick Mooney 5 months ago
- File bhyve bhyve added
- File BHYVE_VARS.fd BHYVE_VARS.fd added
- File README.txt README.txt added
Updated by Andy Fiddaman 5 months ago
- Blocks Feature #14372: bhyve upstream sync 2022 January added
Updated by Patrick Mooney 5 months ago
With the fix in place, the bhyve tests suites pass:
root@carbide:~# /opt/bhyve-tests/bin/bhyvetest Test: /opt/bhyve-tests/tests/memmap (run as root) [00:00] [PASS] Results Summary PASS 1 Running Time: 00:00:00 Percent passed: 100.0% Log directory: /var/tmp/test_results/20220111T195008 root@carbide:~# /opt/bhyvetest/bin/bhyvetest -a Starting tests... output directory: /root/bhyvetest.100524 Executing test /opt/bhyvetest/tst/mevent/lists.delete.exe ... passed Executing test /opt/bhyvetest/tst/mevent/read.disable.exe ... passed Executing test /opt/bhyvetest/tst/mevent/read.pause.exe ... passed Executing test /opt/bhyvetest/tst/mevent/read.requeue.exe ... passed ------------- Results ------------- Tests passed: 4 Tests failed: 0 Tests ran: 4 Congrats, some tiny parts of bhyve aren't completely broken, the tests pass.
The specific memmap
test fails (as expected) when run on vanilla gate bits without the fix.
I ran the reproducer from Andy and it successfully made it to the UEFI shell, rather than panicking the machine.
In smoke-testing the standard battery of guests, all booted and ran as expected.
Updated by Patrick Mooney 5 months ago
- Related to Bug #14398: consolidate bhyve tests added
Updated by Electric Monk 5 months ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
git commit 7daa540591d7332190b10fc3a818bfd5e7d536fe
commit 7daa540591d7332190b10fc3a818bfd5e7d536fe Author: Patrick Mooney <pmooney@pfmooney.com> Date: 2022-01-11T20:49:23.000Z 14367 bhyve gpt mishandles small mappings Reviewed by: Dan Cross <cross@oxidecomputer.com> Reviewed by: Andy Fiddaman <andy@omnios.org> Approved by: Dan McDonald <danmcd@joyent.com>