Bug #14261
closedbhyve should expose kernel device state
100%
Description
Given that bhyve implements fair amount of device emulation in the kernel vmm (LAPIC, IO-APIC, PIT, etc), it would be nice for live migration (and debugging) to be able to query and manipulate the state of those devices. There are some existing mechanisms for getting/setting general purpose register state, and manipulating the nvram fields of the RTC, but nothing for the rest of the in-kernel state. Rather than adding a bespoke getter/setter interface per device, I propose a more general API which exposes those different kinds of data with a common key-value interface. The state for each device would be covered by a given class, and then within that class, the device would be free to define different identifiers for the various parts of its state. For example, the local APIC state for a given vCPU would be under the LAPIC class with field offsets for the various registers (e.g. IRR0) being used for the identifiers. Those vCPU-specific devices would require that the vcpuid be specified for the given get/set operation. Devices which are machine-wide (like the IOAPIC or PIT) would stand alone with no vcpuid association.
Expected per-vcpu data classes:
- register (GPR, segment, etc)
- MSRs
- FPU
- LAPIC
- VMM (arch-specific vmm state, VMX or SVM)
Expected VM-wide classes:
- IOAPIC
- ATPIT
- ATPIC
- HPET
- PM Timer
- RTC
It may also make sense to include some meta classes for information about the capabilities and expectations of the data transfer interface itself, as well as versioning about the various classes.
Updated by Patrick Mooney over 1 year ago
This change is part of a series (including #14635) of work to enable save/restore and eventually live migration in bhyve (via Propolis). By itself, it is inadequate to demonstrate even save/restore with success, given the additional state which needs to be moved. It was split from from #14635 in order to make it easier to review. When built with #14635 floated on top, and run with an experimental patchset of Propolis, we were able to successfully pause a running bhyve instance, dump its state to a file, destroy the in-kernel resources, exit the initial Propolis process, and then restore the guest in a new Propolis process with a fresh in-kernel VMM instance rehydrated with the saved information (DRAM contents and device state)
While this functionality is proven out, the riskiest portion of the added behavior (the ability to load device state into an existing instance) is being kept hidden behind a kernel variable (vmm_allow_state_writes
) which must be changed via mdb -kw
in order to be used. This allows developers and adventurous testers to use the functionality at their discretion while keeping it masked from other users until we are more comfortable about its robustness.
Updated by Patrick Mooney over 1 year ago
On a system running bits built from the change, the bhyve test suite still passes (on both AMD and Intel):
Test: /opt/bhyve-tests/tests/mevent/vnode_zvol (run as root) [00:02] [PASS] Test: /opt/bhyve-tests/tests/inst_emul/rdmsr (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/inst_emul/wrmsr (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/inst_emul/triple_fault (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/kdev/vatpit_freq (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/kdev/vhpet_freq (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/kdev/vlapic_freq (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/kdev/vlapic_freq_periodic (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/kdev/vlapic_mmio_access (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/kdev/vlapic_msr_access (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/kdev/vpmtmr_freq (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/mevent/lists_delete (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/mevent/read_disable (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/mevent/read_pause (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/mevent/read_requeue (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/mevent/vnode_file (run as root) [00:09] [PASS] Test: /opt/bhyve-tests/tests/vmm/fpu_getset (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/vmm/interface_version (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/vmm/mem_devmem (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/vmm/mem_partial (run as root) [00:00] [PASS] Test: /opt/bhyve-tests/tests/vmm/mem_seg_map (run as root) [00:00] [PASS] Results Summary PASS 21 Running Time: 00:00:14 Percent passed: 100.0% Log directory: /var/tmp/test_results/20220623T031540
The standard battery of test guests all booted and ran (on AMD and Intel) as well.
Updated by Patrick Mooney over 1 year ago
When testing with #14635 (and dev propolis), if vmm_allow_state_writes
was written to non-zero prior to attempting a VM restore, the VMM_DATA_WRITE
operation attempts would fail as expected.
Updated by Electric Monk over 1 year ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
git commit d515dd7754a14758624ee9b1330197cdb6a47c49
commit d515dd7754a14758624ee9b1330197cdb6a47c49 Author: Patrick Mooney <pmooney@pfmooney.com> Date: 2022-06-23T19:41:39.000Z 14261 bhyve should expose kernel device state Reviewed by: Dan Cross <cross@oxidecomputer.com> Reviewed by: Luqman Aden <luqman@oxide.computer> Reviewed by: Jordan Paige Hendricks <jordan@oxidecomputer.com> Approved by: Dan McDonald <danmcd@mnx.io>