Bug #13645
closedbhyve queues wrong SIPI
100%
Description
While testing propolis on a wider range of gear, I found that it refused to boot a guest with >1 vCPUs while running on an Intel CPUs. Looking at the info printed to the qemu debug port (from the OVMF ROM) and from what was visible in mdb (while tracing the guest), it appeared to be waiting for the AP to start up during boot. Another user observed the guest panic halt with a triple-fault on a seemingly random AP. This lead me to scrutinize the INIT/SIPI sequences. In the OVMF logs, we can see it select memory to use for AP startup code:
AP Loop Mode is 1 WakeupBufferStart = 9F000, WakeupBufferSize = 1000 CpuMpPei: 5-Level Paging = 0 APIC MODE is 1 MpInitLib: Find 2 processors in system.
Where 0x9f000
is the buffer with the startup code, resulting in a SIPI vector of 0x96
to cause the AP to jump there when they start. With dtrace we can see that behavior:
dtrace -n 'vcpu_vector_sipi:entry { trace(arg1); print(arg2) } vm_inject_init:entry { trace(arg1) } vm_inject_sipi:entry { trace(arg1); print(arg2) }' dtrace: description 'vcpu_vector_sipi:entry ' matched 3 probes CPU ID FUNCTION:NAME 19 68150 vm_inject_init:entry 1 19 68152 vm_inject_sipi:entry 1int64_t 0x9f 1 68467 vcpu_vector_sipi:entry 1int64_t 0x9f 19 68152 vm_inject_sipi:entry 1int64_t 0x9f
The ROM take several laps though this INIT/SIPI logic for various parts of startup. It's on the final sequence that we see something strange:
19 68150 vm_inject_init:entry 1 1 68467 vcpu_vector_sipi:entry 1int64_t 0x9f 19 68152 vm_inject_sipi:entry 1int64_t 0x87 19 68152 vm_inject_sipi:entry 1int64_t 0x87
The INIT IPI is sent first, as expected. Surprisingly, a SIPI with vector 0x9f
is immediately processed after that. While that matches the vector needed from before, the log tells a different story:
AP Loop Mode is 1 GetMicrocodePatchInfoFromHob: MicrocodeBase = 0x0, MicrocodeSize = 0x0 WakeupBufferStart = 87000, WakeupBufferSize = 239 CpuDxe: 5-Level Paging = 0
The SIPI requests we see after the vcpu_vector_sipi
call match that address, but why did we vector immediately into what looks like the old SIPI address? I turns out that the OVMF ROM issues double-SIPIs to start APs on Intel. That second SIPI is queued on the vCPU and not cleared during a subsequent INIT:
int
vm_inject_init(struct vm *vm, int vcpuid)
{
struct vcpu *vcpu;
if (vcpuid < 0 || vcpuid >= vm->maxcpus)
return (EINVAL);
vcpu = &vm->vcpu[vcpuid];
vcpu_lock(vcpu);
vcpu->run_state |= VRS_PEND_INIT;
vcpu_notify_event_locked(vcpu, VCPU_NOTIFY_EXIT);
vcpu_unlock(vcpu);
return (0);
}
int
vm_inject_sipi(struct vm *vm, int vcpuid, uint8_t vector)
{
struct vcpu *vcpu;
if (vcpuid < 0 || vcpuid >= vm->maxcpus)
return (EINVAL);
vcpu = &vm->vcpu[vcpuid];
vcpu_lock(vcpu);
vcpu->run_state |= VRS_PEND_SIPI;
vcpu->sipi_vector = vector;
/* SIPI is only actionable if the CPU is waiting in INIT state */
if ((vcpu->run_state & (VRS_INIT | VRS_RUN)) == VRS_INIT) {
vcpu_notify_event_locked(vcpu, VCPU_NOTIFY_EXIT);
}
vcpu_unlock(vcpu);
return (0);
}
That SIPI-queuing behavior was initially designed this way to avoid missing a SIPI while an INIT was yet to be processed by the VMM, but it's clearly not working here. It seems reasonable to clear any pending SIPI when an INIT is queued, so it requires a subsequent SIPI (with the then-correct vector) to start the AP.
Related issues