Project

General

Profile

Actions

Bug #1687

closed

Race in determine_platform / get_hwenv causes multi-CPU VM hang at boot

Added by Matt Amdur about 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
kernel
Start date:
2011-10-25
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

We hit an issue at a customer site a few months ago where an Illumos VM on ESX
failed to boot after the customer increased the vCPU count from 4 to 6. On
console we would see it get as far as printing the copyright before the VM
would hang. The VM was wedged, it would not respond to an NMI. We changed the
boot logic to enable kmdb and stepped through early initialization, but we
weren't able to track down what caused the hang. Changing the vCPU count back
to 4 resolved the problem for the customer.

We wrote a simple script to reproduce the problem in our lab by booting a VM,
waiting for tools to start, powering the VM off, changing the vCPU count
between 4 and 6, and rebooting. On ESX 4.1 the problem reproduced after
several days, but we weren't able to get any useful data from DTrace nor kmdb.
We migrated the VM to ESX 5.0 so that we could use vprobes, but it took
several weeks to reproduce the problem. Once we reproduced the problem we were
able to write a simple vprobe script to give us guest back-traces:

(defstring stack)
(vprobe VMM1Hz
   (gueststack stack)
      (printf "CPUD: %d Stack: %s\n" VCPUID stack))
CPUD: 1 Stack: GUEST_0xfffffffffb8545c6_0xfffffffffb854a79
CPUD: 3 Stack: GUEST_0xfffffffffb841c8e_0xfffffffffb842595_0xfffffffffb84278e_0xfffffffffb84357e_0x0
CPUD: 0 Stack: GUEST_0xfffffffffb84cd24_0xfffffffffb841e1c_0xfffffffffb841f5c_0xfffffffffb842220_0xfffffffffba724f4_0xfffffffffb8000a0_0x0
CPUD: 2 Stack: GUEST_0xfffffffffb801d94_0xfffffffffb9ec9ed_0xfffffffffb9ed69d_0xfffffffffb9d4b3f_0xfffffffffb9d54db_0xfffffffffb9dc2a3_0xfffffffffb9dc172_0xfffffffff79d376d_0xfffffffffb86e938_0x0

We used a simple perl script to pipe the output to mdb to convert this to
a set of symbolic back-traces:

CPU 0:
        tsc_sync_master+0x6c:
        mp_start_cpu_common+0x134:
        start_cpu+0x44:
        start_other_cpus+0x198:
        main+0x2ac:
        _locore_start+0x90:

CPU 1:
        xc_serv+0x12e:
        xc_common+0x229:

CPU 2:
        cbe_xcall+0x95:
        cyclic_reprogram_here+0x4d:
        cyclic_reprogram+0x95:
        callout_heap_insert+0x6f:
        timeout_generic+0x333:
        cv_timedwait_hires+0xab:
        cv_timedwait+0x5a:
        arc_reclaim_thread+0x13d:
        thread_start+8:

CPU 3:
        mp_startup_signal+0x6e:
        mp_startup_common+0x1b5:
        mp_startup_boot+0xe:
        0xfffffffffb84357e:

We looked at the backtraces for each CPU, CPUs 0 and 3 appeared to be stuck in
tight loops:

CPU 0 was stuck in tsc_sync_master:

...
        for (cnt = 0; cnt < SYNC_ITERATIONS; cnt++) {
                while (tsc_sync_go != TSC_SYNC_GO)
                        SMT_PAUSE(); <-- *** STUCK HERE ***
...

CPU 3 was stuck in the second call to mp_startup_signal from
mp_startup_common:

...
        CPUSET_ATOMIC_ADD(*(cpuset_t *)sp, cpuid);
        for (tempset = *sp; CPU_IN_SET(tempset, cpuid);
            tempset = *(volatile cpuset_t *)sp) {
                SMT_PAUSE();  <-- *** STUCK HERE ***
        }
...

At a high level CPU 0 was waiting for CPU 3 to indicate that it was ready to
sync its TSC. CPU 3 had already finished syncing its TSC and was waiting for
the boot CPU to signal it to finish initialization. Looking a bit more closely
at tsc_sync_master there is an obvious problem:

...
        hwtype = get_hwenv();
        if (!tsc_master_slave_sync_needed || hwtype == HW_XEN_HVM ||
            hwtype == HW_VMWARE)
                return; <--- *** Should have returned here ***

        flags = clear_int_flag();
        source = CPU->cpu_id;

        for (cnt = 0; cnt < SYNC_ITERATIONS; cnt++) {
                while (tsc_sync_go != TSC_SYNC_GO)
                        SMT_PAUSE(); <--- *** STUCK HERE ***

...

tsc_sync_master has explicit logic to not sync the TSC if we are running on
VMware (which we were), but somehow get_hwenv had returned something other
then HW_VMWARE. Looking at get_hwenv:

...
int
get_hwenv(void)
{
        if (platform_type == -1)
                determine_platform();

        return (platform_type);
}

static void
determine_platform()
{
        struct cpuid_regs cp;
        char *xen_str;
        uint32_t xen_signature[4], base;

        platform_type = HW_NATIVE;

        ...
}

determine_platform is clearly broken if multiple CPUs call it, as it resets
platform_type to native each time it's invoked before setting it to the real
value. We call determine_platform from cpuid_pass1, which is called as part of
mp_cpu_start_common for each CPU. The race is as follows, CPU 3 calls
determine_platform as part of startup which sets platform_type to HW_NATIVE
briefly before setting it to HW_VMWARE. CPU 0 happens to call
determine_platform during that window, which causes tsc_sync_master to get
stuck waiting for CPU 3 to set tsc_sync_go to TSC_SYNC_GO, which will never
happen.

The fix is pretty straight forward:

1. add determine_platform() to usr/src/uts/intel/sys/x86_archext.h
1.1. change the signature of determine_platform() to non-static and with
a (void)
2. move the only call to determine_platform() to mlsetup() right
before the call to cpuid_pass1()
3. delete all other calls to determine_platform()
4. get_hwenv assert platform_type is set

I've been running the script against the fix for the last 4 days with no
issues.

Actions #1

Updated by Rich Lowe about 12 years ago

  • Category set to kernel
  • Status changed from New to Resolved
  • % Done changed from 0 to 100
  • Tags deleted (needs-triage)

Resolved in r13498 commit:e702644ca141

Actions

Also available in: Atom PDF