Project

General

Profile

Actions

Bug #14042

closed

GPT: Closed ranges conflict with other code causing panics

Added by Dan Cross about 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
bhyve
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

The recently-introduced GPT work in bhyve assumes that ranges are expressed as closed interviews when populating, vacating, mapping and unmapping ranges of guest physical address space: this was done to allow us to map right up to the end of the expressible address range without special code. However, the rest of the kernel assumes that ranges are half-open, as is the norm elsewhere. As a result, when rebooting guests, which unmaps and vacates the boot ROM region in the guest PA space, we both free and try to vacate the page immediately following the boot ROM area. On a VM configured with sufficient memory, that is a RAM region; thus, we free a page of RAM and try to vacate the region, which still has mapped pages, resulting in an assertion failure on a kernel built in debug mode. This is obviously incorrect.

Given that the physical address space never extends all the way to 0xffff_ffff_ffff_ffff on hardware, and even in a virtualized context, we never map anything up there in the 2LPT, the fix here is just to use half-open ranges.


Related issues

Related to illumos gate - Feature #13932: improve bhyve second level page table supportClosed

Actions
Actions #1

Updated by Dan Cross about 2 years ago

A change to address this is in https://code.illumos.org/c/illumos-gate/+/1686.

Testing: built the system with this change, started and restarted various VMs and observe that the system does not panic. Checked for memory leaks and verified that there are none.

Actions #2

Updated by Patrick Mooney about 2 years ago

  • Related to Feature #13932: improve bhyve second level page table support added
Actions #3

Updated by Dan Cross about 2 years ago

More debugging detail was requested here:

Patrick reported to me that he was observing panics in this code when rebooting VMs (Propolis, our VMM at Oxide, recently gained reboot support); he suspected an impedance mismatch between the inclusive (closed) ranges in the GPT code and the assumption about half-open (exclusive upper bound) ranges elsewhere in the kernel. This seemed plausible.

I was able to reproduce locally. A representative stack dump from a panic:

TIME                           UUID                                 SUNW-MSG-ID
Aug 28 2021 02:45:10.773713000 0ea5299c-c158-6410-e95e-b82fef3981f7 SUNOS-8000-KL

  TIME                 CLASS                                 ENA
  Aug 28 02:45:10.7568 ireport.os.sunos.panic.dump_available 0x0000000000000000
  Aug 28 02:45:01.3635 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 0ea5299c-c158-6410-e95e-b82fef3981f7
        code = SUNOS-8000-KL
        diag-time = 1630143910 763117
        de = fmd:///module/software-diagnosis
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.kernel.panic
                certainty = 0x64
                asru = sw:///:path=/var/crash//.0ea5299c-c158-6410-e95e-b82fef3981f7
                resource = sw:///:path=/var/crash//.0ea5299c-c158-6410-e95e-b82fef3981f7
                savecore-succcess = 1
                dump-dir = /var/crash/
                dump-files = vmdump.0
                os-instance-uuid = 0ea5299c-c158-6410-e95e-b82fef3981f7
                panicstr = assertion failed: ptes[i] == 0 (0xe23cd7067 == 0x0), file: ../../i86pc/io/vmm/vmm_gpt.c, line: 441
                panicstack = fffffffffbdcd481 () | vmm:vmm_gpt_vacate_entry+8b () | vmm:vmm_gpt_vacate_region+42 () | vmm:rvi_ops_unmap+59 () | vmm:vm_map_remove+98 () | vmm:vm_free_memmap+56 () | vmm:vm_cleanup+112 () | vmm:vm_destroy+1b () | vmm:vmm_do_vm_destroy_locked+c2 () | vmm:vmmdev_do_vm_destroy+ae () | vmm:vmm_ctl_ioctl+114 () | vmm:vmm_ioctl+d3 () | genunix:cdev_ioctl+2b () | specfs:spec_ioctl+45 () | genunix:fop_ioctl+5b () | genunix:ioctl+153 () | unix:brand_sys_syscall+304 () |
                crashtime = 1630143825
                panic-time = Sat Aug 28 02:43:45 2021 PDT
        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x612a05a6 0x2e1dec68

The assertion in this case occurs when vacating a region: the code asserts that every entry in the leaf 2LPT corresponding to the virtual address being vacated is nil'd out (that is, the 2LPTE is 0), but this obviously wasn't the case as seen in the assertion message; my initial assumption was that, as the leaf PT maps a 2MiB region, perhaps two regions, vacated independently, spanned the PT.

I added some debugging coded to the kernel to print more detail (I wanted to see the PT index) and prevent the error from being fatal so that I could see exactly where the assertion was tripping: it was asserting on the 2nd (index 1) entry in a leaf PT not being null during a call to vmm_gpt_vacate_entry and the GPA was 0x100000000: clearly this could not span an aligned 2MiB region.

Examining the memory map of the VM yielded some insight:

: frankenstein; sudo bhyvectl --vm=testvm --get-memmap
Address     Length      Segment     Offset      Prot  Flags
0           3072MB      sysmem      0           RWX
FFE00000    2048KB      bootrom     0           R-X
100000000   5120MB      sysmem      0           RWX
: frankenstein;

Note the bootrom region at 0xffe00000 and extending to 0x100000000. Suspicious.... But perhaps the panic happens when we vacation the RAM region? Note that reboot only unmaps the boot ROM region, but let's check anyway. Quick, to the dtrace!

dtrace -n 'fbt:vmm:vm_map_remove:entry { print(*args[0]); print(args[1]); print(args[2]); }'

This showed that, on reboot, a call to vm_map_remove() is made to unmap the boot ROM region, with a VM argument corresponding to our VM, the start address 0xffe00000 and end address 0x100000000. Aha. This is an inclusive upper bound, hence where our reaching into the RAM region, hence tripping the assertion.

So yes, it's the bounds.

Since the rest of the kernel assumes half-open address ranges, the easiest fix is just to change the GPT code.

Actions #4

Updated by Electric Monk about 2 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit d7b40b493097c87910a5c1cf1316bb7c4be5bafd

commit  d7b40b493097c87910a5c1cf1316bb7c4be5bafd
Author: Dan Cross <cross@oxidecomputer.com>
Date:   2021-09-03T15:17:41.000Z

    14042 GPT: Closed ranges conflict with other code causing panics
    Reviewed by: Patrick Mooney <pmooney@pfmooney.com>
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF