Project

General

Profile

Actions

Bug #16183

closed

bhyve vlapic_reset mangles timer state

Added by Patrick Mooney 4 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Category:
bhyve
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:
oxide:propolis#597

Description

While testing live instance migration with propolis, we encountered issues with the destination host becoming unresponsive. (See propolis#597)
After configuring the dumping the machine when it encountered these conditions, it became clear that a periodic timer on one of the vlapics was in a bad way. The a vm_run() thread is waiting effectively forever on a cyclic juggle:

stack pointer for thread fffffe595a3dabe0 (propolis-server/22 [vcpu-1]): fffffe007ad8b470
  fffffe007ad8b4e0 apix_hilevel_intr_prolog+0x82(fffffe5931e72000, f, 0, fffffe007ad8b550)
  fffffe007ad8b540 apix_do_interrupt+0x3d4(fffffe007ad8b550, 294a7ed)
  fffffe007ad8b550 _interrupt+0xc3()
  fffffe007ad8b670 tsc_read+3()
  fffffe007ad8b680 gethrtime+0x10()
  fffffe007ad8b6b0 drv_usecwait+0x9f(2000000)
  fffffe007ad8b740 cyclic_juggle_one_to+0xbc(fffffe5973d9ef50, fffffe5931a04480)
  fffffe007ad8b780 cyclic_move_here+0x88(fffffe5973d9ef50)
  fffffe007ad8b7b0 vmm_glue_callout_localize+0x27(fffffe5969c2c3f0)
  fffffe007ad8b7d0 vlapic_localize_resources+0x15(fffffe5969c2c3a8)
  fffffe007ad8b800 vm_localize_resources+0x46(fffffe59739f1000, fffffe59739f14a0)
  fffffe007ad8b880 vm_run+0x114(fffffe59739f1000, 1, fffffe007ad8b9c0)
  fffffe007ad8bc40 vmmdev_do_ioctl+0xb03(fffffe5971da91c0, 767001, fffffc7fe6bffc00, 202003, fffffe595f3598d8, fffffe007ad8be18)
  fffffe007ad8bcc0 vmm_ioctl+0x12f(11600000002, 767001, fffffc7fe6bffc00, 202003, fffffe595f3598d8, fffffe007ad8be18)
  fffffe007ad8bd00 cdev_ioctl+0x3f(11600000002, 767001, fffffc7fe6bffc00, 202003, fffffe595f3598d8, fffffe007ad8be18)
  fffffe007ad8bd50 spec_ioctl+0x55(fffffe5934cc4c80, 767001, fffffc7fe6bffc00, 202003, fffffe595f3598d8, fffffe007ad8be18, 0)
  fffffe007ad8bde0 fop_ioctl+0x40(fffffe5934cc4c80, 767001, fffffc7fe6bffc00, 202003, fffffe595f3598d8, fffffe007ad8be18, 0)
  fffffe007ad8bf00 ioctl+0x144(e, 767001, fffffc7fe6bffc00)
  fffffe007ad8bf10 sys_syscall+0x17d()

It was unable to move the cyclic because it's in a tight loop of firing and reprogramming:

> ::stacks -c vlapic_callout_handler | ::findstack -v
stack pointer for thread fffffe007bc08c20 (unix`thread_create_intr()): fffffe007bc087f0
  fffffe007bc08860 apix_hilevel_intr_prolog+0x82(fffffe593223b000, e, 2, fffffe007bc088d0)
  fffffe007bc088c0 apix_do_interrupt+0x3d4(fffffe007bc088d0, 2)
  fffffe007bc088d0 _interrupt+0xc3()
  fffffe007bc089f0 rw_exit()
  fffffe007bc08a50 callout_reset_hrtime+0x39(fffffe5969c2c3f0, 576238bddf, fffffffff4200b10, fffffe5969c2c3a8, 200)
  fffffe007bc08a70 vlapic_callout_reset+0x29(fffffe5969c2c3a8)
  fffffe007bc08aa0 vlapic_callout_handler+0xb2(fffffe5969c2c3a8)
  fffffe007bc08ad0 vmm_glue_callout_handler+0x36(fffffe5969c2c3f0)
  fffffe007bc08b60 cyclic_softint+0xe1(fffffe593223b000, 0)
  fffffe007bc08b80 cbe_low_level+0x20(0, 0)
  fffffe007bc08bd0 av_dispatch_softvect+0x72(2)
  fffffe007bc08c00 apix_dispatch_softint+0x35(0, 0)
  fffffe007bbb79b0 switch_sp_and_call+0x15()
  fffffe007bbb7a00 apix_do_softint+0x5a(fffffe007bbb7a70)
  fffffe007bbb7a60 apix_do_interrupt+0x35f(fffffe007bbb7a70, 0)
  fffffe007bbb7a70 _interrupt+0xc3()
  fffffe007bbb7ba0 fakesoftint_return()
  fffffe007bbb7be0 _resume_from_idle+0x12b()
  fffffe007bbb7c00 idle+0xbc()
  fffffe007bbb7c10 thread_start+0xb()
> fffffe5969c2c3a8::print 'struct vlapic'
{
    vm = 0xfffffe59739f1000
    vcpuid = 0x1
    apic_page = 0xfffffe59802e4000
    ops = {
        set_intr_ready = 0
        sync_state = 0
        intr_accepted = 0
        post_intr = 0
        enable_x2apic_mode = 0
    }
    esr_pending = 0
    callout = {
        c_cyc_id = 0xfffffe5973d9ef50
        c_target = 0x576238bddf
        c_fired = 0x576238bdb7
        c_func = vlapic_callout_handler
        c_arg = 0xfffffe5969c2c3a8
    }
    timer_fire_when = 0x576238bddf
    timer_period = 0
    timer_cur_freq = 0x8000000
    timer_lock = {
        _opaque = [ 0xfffffe007bc08c20 ]
    }
    msr_apicbase = 0xfee00800
    svr_last = 0xff
    lvt_last = [ 0x100d8, 0x400, 0x300d8, 0xe2, 0xfe, 0x10000, 0x10000 ]
    stats = {
        vs_clamp_ccr = 0
        vs_import_timer_overage = 0
    }
}
> fffffe5969c2c3a8::print 'struct vlapic' apic_page->icr_timer
apic_page->icr_timer = 0
> fffffe5969c2c3a8::print 'struct vlapic' apic_page->lvt_timer
apic_page->lvt_timer = 0x300d8

Note that the vlapic timer is configured for periodic mode, but the period is 0, thanks to a zero icrtimer value. By all accounts this timer should have been disabled if icrtimer was written to zero.
Spinning in such a state is obviously a problem which should be addressed by stronger logic in the vlapic callout reprogramming. That aside, how did we come to have a zero icrtimer while timer_fire_when is non-zero?
Greg captured the vlapic data from the migration payload (pardon the json escaping mess):

DevicePayload {
                kind: \"bhyve-x86-lapic\",
                      version: 1,
                data: \"(page:(id:16777216,version:393236,tpr:32,apr:0,ldr:33554432,dfr:4294967295,svr:479,isr:(0,0,0,0,0,0,0,0),tmr:(0,0,0,0,0,0,0,0),irr:(0,0,65536,0,0,0,0,0),esr:0,lvt_cmci:65536,icr:72057594038190127,lvt_timer:196824,lvt_thermal:65536,lvt_pcint:254,lvt_lint0:65752,lvt_lint1:1024,lvt_error:226,icr_timer:0,dcr_timer:11),msr_apicbase:4276094976,timer_target:64011593782,esr_pending:0)\" 

So icr_timer was zeroed with a non-zero timer_target (normalized timer_target_when) on the source when this instance was migrated. That is certainly strange, since vlapic_icrtmr_write_handler stops any pending calling and zeroes timer_target_when if icr_timer is written to zero.

After a bunch of confused tracing later, the culprit emerges.

dtrace:

vlapic_icrtmr_write_handler:entry
{
        self->vlapic = args[0];
}
vlapic_icrtmr_write_handler:entry,
vlapic_icrtmr_write_handler:return
/self->vlapic/
{
        printf("%s:%s(%p, when:%ld period:%ld icrtmr:%x)\n",
                probefunc,
                probename,
                self->vlapic,
                (long)self->vlapic->timer_fire_when,
                (long)self->vlapic->timer_period,
                self->vlapic->apic_page->icr_timer);
}
vlapic_icrtmr_write_handler:return
/self->vlapic/
{
        self->vlapic = 0;
}
vlapic_reset:entry
{
        self->vlapic = args[0];
}

vlapic_reset:entry,
vlapic_reset:return
/self->vlapic/
{
        printf("%s:%s(%p)\n", probefunc, probename, self->vlapic);
        printf("\t icr:%x when:%ld\n", self->vlapic->apic_page->icr_timer, (long)self->vlapic->timer_fire_when);
}

vlapic_reset:return
/self->vlapic/
{
        self->vlapic = 0;
}

Output:

vlapic_icrtmr_write_handler:entry(fffffe6664ad5020, when:2107474113559628 period:0 icrtmr:f81d0088)
vlapic_icrtmr_write_handler:return(fffffe6664ad5020, when:2107474113425857 period:62028322339 icrtmr:f81d0088)
vlapic_icrtmr_write_handler:entry(fffffe65f767eec8, when:2107474113555521 period:0 icrtmr:f81d0088)
vlapic_icrtmr_write_handler:entry(fffffe6571ad5320, when:2107474113562394 period:0 icrtmr:f81d0088)
vlapic_icrtmr_write_handler:return(fffffe6571ad5320, when:2107474113585382 period:62028322339 icrtmr:f81d0088)
vlapic_icrtmr_write_handler:return(fffffe65f767eec8, when:2107474113586884 period:62028322339 icrtmr:f81d0088)
vlapic_icrtmr_write_handler:entry(fffffe6571aec078, when:2107474103201712 period:4095999999046 icrtmr:ffffffff)
vlapic_icrtmr_write_handler:return(fffffe6571aec078, when:2111508234436723 period:4095999999046 icrtmr:ffffffff)
vlapic_icrtmr_write_handler:entry(fffffe6571aec078, when:2111508234436723 period:4095999999046 icrtmr:ffffffff)
vlapic_icrtmr_write_handler:return(fffffe6571aec078, when:2111508361188771 period:4095999999046 icrtmr:ffffffff)
vlapic_icrtmr_write_handler:entry(fffffe6571aec078, when:2111508361188771 period:4095999999046 icrtmr:0)
vlapic_icrtmr_write_handler:return(fffffe6571aec078, when:0 period:0 icrtmr:0)
vlapic_reset:entry(fffffe6571ad5320)
         icr:f81d0088 when:2107474113585382
vlapic_reset:return(fffffe6571ad5320)
         icr:0 when:2107474113585382
vlapic_reset:entry(fffffe65f767eec8)
         icr:f81d0088 when:2107474113586884
vlapic_reset:return(fffffe65f767eec8)
         icr:0 when:2107474113586884
vlapic_reset:entry(fffffe6664ad5020)
         icr:f81d0088 when:2107474113425857
vlapic_reset:return(fffffe6664ad5020)
         icr:0 when:2107474113425857

The vlapic is configured with a periodic timer, but then is reset. We are failing to clear timer_fire_when during vlapic_reset().

Given the pathological behavior when icr_timer and timer_fire_when disagree, we should also enforce congruence between them as part of VMM_DATA_WRITE operations to prevent invalid state from being loaded in addition to fixing vlapic_reset().

Actions #1

Updated by Electric Monk 4 months ago

  • Gerrit CR set to 3233
Actions #2

Updated by Patrick Mooney 4 months ago

With the import_vlapic test updated to cover the involved behavior (both the exportation of invalidly-reset state, and the allowed importation of that same state), I copied it to a machine running platform bits which predate the proposed fix. I was able to demonstrate both failure modes with the test. With the proposed fix in place, that same test, and all the others, passed:

Test: /opt/bhyve-tests/tests/mevent/vnode_zvol (run as root)      [00:02] [PASS]
Test: /opt/bhyve-tests/tests/inst_emul/cpuid (run as root)        [00:00] [PASS]
Test: /opt/bhyve-tests/tests/inst_emul/imul (run as root)         [00:00] [PASS]
Test: /opt/bhyve-tests/tests/inst_emul/rdmsr (run as root)        [00:00] [PASS]
Test: /opt/bhyve-tests/tests/inst_emul/wrmsr (run as root)        [00:00] [PASS]
Test: /opt/bhyve-tests/tests/inst_emul/triple_fault (run as root) [00:00] [PASS]
Test: /opt/bhyve-tests/tests/inst_emul/exit_paging (run as root)  [00:00] [PASS]
Test: /opt/bhyve-tests/tests/inst_emul/page_dirty (run as root)   [00:00] [PASS]
Test: /opt/bhyve-tests/tests/inst_emul/exit_consistent (run as root) [00:00] [PASS]
Test: /opt/bhyve-tests/tests/inst_emul/suspend_info (run as root) [00:00] [PASS]
Test: /opt/bhyve-tests/tests/inst_emul/vcpu_barrier (run as root) [00:00] [PASS]
Test: /opt/bhyve-tests/tests/kdev/vatpit_freq (run as root)       [00:00] [PASS]
Test: /opt/bhyve-tests/tests/kdev/vhpet_freq (run as root)        [00:00] [PASS]
Test: /opt/bhyve-tests/tests/kdev/vlapic_freq (run as root)       [00:00] [PASS]
Test: /opt/bhyve-tests/tests/kdev/vlapic_freq_periodic (run as root) [00:00] [PASS]
Test: /opt/bhyve-tests/tests/kdev/vlapic_mmio_access (run as root) [00:00] [PASS]
Test: /opt/bhyve-tests/tests/kdev/vlapic_msr_access (run as root) [00:00] [PASS]
Test: /opt/bhyve-tests/tests/kdev/vpmtmr_freq (run as root)       [00:00] [PASS]
Test: /opt/bhyve-tests/tests/kdev/vrtc_ops (run as root)          [00:04] [PASS]
Test: /opt/bhyve-tests/tests/kdev/wrmsr_tsc (run as root)         [00:00] [PASS]
Test: /opt/bhyve-tests/tests/kdev/rdmsr_tsc (run as root)         [00:00] [PASS]
Test: /opt/bhyve-tests/tests/mevent/lists_delete (run as root)    [00:00] [PASS]
Test: /opt/bhyve-tests/tests/mevent/read_disable (run as root)    [00:00] [PASS]
Test: /opt/bhyve-tests/tests/mevent/read_pause (run as root)      [00:00] [PASS]
Test: /opt/bhyve-tests/tests/mevent/read_requeue (run as root)    [00:00] [PASS]
Test: /opt/bhyve-tests/tests/mevent/vnode_file (run as root)      [00:09] [PASS]
Test: /opt/bhyve-tests/tests/viona/interface_version (run as root) [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/auto_destruct (run as root)      [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/cpuid_ioctl (run as root)        [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/datarw_constraints (run as root) [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/datarw_msrs (run as root)        [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/datarw_vcpu (run as root)        [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/default_capabs (run as root)     [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/drv_hold (run as root)           [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/fpu_getset (run as root)         [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/import_vlapic (run as root)      [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/interface_version (run as root)  [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/legacy_destruct (run as root)    [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/maxcpu (run as root)             [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/mem_devmem (run as root)         [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/mem_partial (run as root)        [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/mem_seg_map (run as root)        [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/pause_resume (run as root)       [00:00] [PASS]
Test: /opt/bhyve-tests/tests/vmm/self_destruct (run as root)      [00:00] [PASS]

Results Summary
PASS      44

Running Time:   00:00:19
Percent passed: 100.0%
Actions #3

Updated by Patrick Mooney 4 months ago

In addition to the specific targeted testing, I also spun up the normal array of guests under c-bhyve to check that their behavior was not impacted by the change. They all operated normally.

Actions #4

Updated by Electric Monk 4 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 1f7cf86b3ece41a9cec19a3ed7e3b45e005d1461

commit  1f7cf86b3ece41a9cec19a3ed7e3b45e005d1461
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2024-01-17T22:53:38.000Z

    16183 bhyve vlapic_reset mangles timer state
    Reviewed by: Greg Colombo <greg@oxidecomputer.com>
    Approved by: Dan McDonald <danmcd@mnx.io>

Actions

Also available in: Atom PDF