Bug #14042
closedGPT: Closed ranges conflict with other code causing panics
100%
Description
The recently-introduced GPT work in bhyve assumes that ranges are expressed as closed interviews when populating, vacating, mapping and unmapping ranges of guest physical address space: this was done to allow us to map right up to the end of the expressible address range without special code. However, the rest of the kernel assumes that ranges are half-open, as is the norm elsewhere. As a result, when rebooting guests, which unmaps and vacates the boot ROM region in the guest PA space, we both free and try to vacate the page immediately following the boot ROM area. On a VM configured with sufficient memory, that is a RAM region; thus, we free a page of RAM and try to vacate the region, which still has mapped pages, resulting in an assertion failure on a kernel built in debug mode. This is obviously incorrect.
Given that the physical address space never extends all the way to 0xffff_ffff_ffff_ffff on hardware, and even in a virtualized context, we never map anything up there in the 2LPT, the fix here is just to use half-open ranges.
Related issues
Updated by Dan Cross about 2 years ago
A change to address this is in https://code.illumos.org/c/illumos-gate/+/1686.
Testing: built the system with this change, started and restarted various VMs and observe that the system does not panic. Checked for memory leaks and verified that there are none.
Updated by Patrick Mooney about 2 years ago
- Related to Feature #13932: improve bhyve second level page table support added
Updated by Dan Cross about 2 years ago
More debugging detail was requested here:
Patrick reported to me that he was observing panics in this code when rebooting VMs (Propolis, our VMM at Oxide, recently gained reboot support); he suspected an impedance mismatch between the inclusive (closed) ranges in the GPT code and the assumption about half-open (exclusive upper bound) ranges elsewhere in the kernel. This seemed plausible.
I was able to reproduce locally. A representative stack dump from a panic:
TIME UUID SUNW-MSG-ID Aug 28 2021 02:45:10.773713000 0ea5299c-c158-6410-e95e-b82fef3981f7 SUNOS-8000-KL TIME CLASS ENA Aug 28 02:45:10.7568 ireport.os.sunos.panic.dump_available 0x0000000000000000 Aug 28 02:45:01.3635 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000 nvlist version: 0 version = 0x0 class = list.suspect uuid = 0ea5299c-c158-6410-e95e-b82fef3981f7 code = SUNOS-8000-KL diag-time = 1630143910 763117 de = fmd:///module/software-diagnosis fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = defect.sunos.kernel.panic certainty = 0x64 asru = sw:///:path=/var/crash//.0ea5299c-c158-6410-e95e-b82fef3981f7 resource = sw:///:path=/var/crash//.0ea5299c-c158-6410-e95e-b82fef3981f7 savecore-succcess = 1 dump-dir = /var/crash/ dump-files = vmdump.0 os-instance-uuid = 0ea5299c-c158-6410-e95e-b82fef3981f7 panicstr = assertion failed: ptes[i] == 0 (0xe23cd7067 == 0x0), file: ../../i86pc/io/vmm/vmm_gpt.c, line: 441 panicstack = fffffffffbdcd481 () | vmm:vmm_gpt_vacate_entry+8b () | vmm:vmm_gpt_vacate_region+42 () | vmm:rvi_ops_unmap+59 () | vmm:vm_map_remove+98 () | vmm:vm_free_memmap+56 () | vmm:vm_cleanup+112 () | vmm:vm_destroy+1b () | vmm:vmm_do_vm_destroy_locked+c2 () | vmm:vmmdev_do_vm_destroy+ae () | vmm:vmm_ctl_ioctl+114 () | vmm:vmm_ioctl+d3 () | genunix:cdev_ioctl+2b () | specfs:spec_ioctl+45 () | genunix:fop_ioctl+5b () | genunix:ioctl+153 () | unix:brand_sys_syscall+304 () | crashtime = 1630143825 panic-time = Sat Aug 28 02:43:45 2021 PDT (end fault-list[0]) fault-status = 0x1 severity = Major __ttl = 0x1 __tod = 0x612a05a6 0x2e1dec68
The assertion in this case occurs when vacating a region: the code asserts that every entry in the leaf 2LPT corresponding to the virtual address being vacated is nil'd out (that is, the 2LPTE is 0), but this obviously wasn't the case as seen in the assertion message; my initial assumption was that, as the leaf PT maps a 2MiB region, perhaps two regions, vacated independently, spanned the PT.
I added some debugging coded to the kernel to print more detail (I wanted to see the PT index) and prevent the error from being fatal so that I could see exactly where the assertion was tripping: it was asserting on the 2nd (index 1) entry in a leaf PT not being null during a call to vmm_gpt_vacate_entry
and the GPA was 0x100000000
: clearly this could not span an aligned 2MiB region.
Examining the memory map of the VM yielded some insight:
: frankenstein; sudo bhyvectl --vm=testvm --get-memmap Address Length Segment Offset Prot Flags 0 3072MB sysmem 0 RWX FFE00000 2048KB bootrom 0 R-X 100000000 5120MB sysmem 0 RWX : frankenstein;
Note the bootrom region at 0xffe00000
and extending to 0x100000000
. Suspicious.... But perhaps the panic happens when we vacation the RAM region? Note that reboot only unmaps the boot ROM region, but let's check anyway. Quick, to the dtrace
!
dtrace -n 'fbt:vmm:vm_map_remove:entry { print(*args[0]); print(args[1]); print(args[2]); }'
This showed that, on reboot, a call to vm_map_remove()
is made to unmap the boot ROM region, with a VM argument corresponding to our VM, the start address 0xffe00000 and end address 0x100000000. Aha. This is an inclusive upper bound, hence where our reaching into the RAM region, hence tripping the assertion.
So yes, it's the bounds.
Since the rest of the kernel assumes half-open address ranges, the easiest fix is just to change the GPT code.
Updated by Electric Monk about 2 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit d7b40b493097c87910a5c1cf1316bb7c4be5bafd
commit d7b40b493097c87910a5c1cf1316bb7c4be5bafd Author: Dan Cross <cross@oxidecomputer.com> Date: 2021-09-03T15:17:41.000Z 14042 GPT: Closed ranges conflict with other code causing panics Reviewed by: Patrick Mooney <pmooney@pfmooney.com> Reviewed by: Robert Mustacchi <rm@fingolfin.org> Approved by: Dan McDonald <danmcd@joyent.com>