Project

General

Profile

Bug #10284

Socket CMCI mismatch can lead to boot hang

Added by Robert Mustacchi 11 months ago. Updated 10 months ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
2019-01-24
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

First, I took a look at the state of the CPUs when this happens:

[3]> ::cpuinfo
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc5a000  1b    0    0  96   no    no t-2793897 fffffffffbc58080 
  1 fffffdc9b6b82000  1b    0    0  -1   no    no t-2677539 fffffb044fc67c40 (idle)
  2 fffffdc9b6cc4000  1b    0    0  -1   no    no t-2677537 fffffb044fce7c40 (idle)
  3 fffffdc9b6e7a000  1b    0    0  -1   no    no t-2680896 fffffb044fd2bc40 (idle)
  4 fffffdc9b7057000  1b    0    0  -1   no    no t-2749270 fffffb044ffb0c40 (idle)
  5 fffffdc9b724a000  1b    0    0  -1   no    no t-2793672 fffffb0450098c40 (idle)
  6 fffffdc9b743c000  1b    0    0  -1   no    no t-2793672 fffffb045017ec40 (idle)
  7 fffffdc9b766e000  1b    0    0  -1   no    no t-2715094 fffffb0450252c40 (idle)
  8 fffffdc9b7852000  1b    0    0  -1   no    no t-2793895 fffffb04504a1c40 (idle)
  9 fffffdc9b7b36000  1b    0    0  -1   no    no t-2711901 fffffb045068dc40 (idle)
 10 fffffdc9b7f43000  1b    0    0  -1   no    no t-2793895 fffffb0450837c40 (idle)
 11 fffffdc9b8130000  1b    0    0  -1   no    no t-2710948 fffffb0450a17c40 (idle)
 12 fffffdc9b833d000  1b    0    0  -1   no    no t-2793895 fffffb0450b91c40 (idle)
 13 fffffdc9b86f2000  1b    0    0  -1   no    no t-2709737 fffffb0450ca3c40 (idle)
 14 fffffdc9d89e1000  1b    0    0  -1   no    no t-2793895 fffffb0450d29c40 (idle)
 15 fffffdc9d8b43000  1b    0    0  -1   no    no t-2708560 fffffb0450dafc40 (idle)
 16 fffffdc9d8ca5000  1b    0    0  -1   no    no t-2793259 fffffb0450e35c40 (idle)
 17 fffffdc9d8e67000  1b    0    0  -1   no    no t-2698833 fffffb0450ebbc40 (idle)
 18 fffffdc9d8fc9000  1b    0    0   0   no    no t-2677537 fffffb044faeac40 
 19 fffffdc9d914b000  64    0   10  -1   no    no t-2793873 fffffb0450fc7c40 (idle)
[3]> 0xfffffdc9d914b000::print cpu_t cpu_m.mcpu_cpi->cpi_pass
cpu_m.mcpu_cpi->cpi_pass = 0x1
[3]> apic_cpus::print [13].aci_status
[13].aci_status = 0
[3]> ap_mlsetup/p
ap_mlsetup:
ap_mlsetup:     apix`apix_post_cpu_start

Based on Rob's earlier work, I decided to dig a bit futher into what was
happening and where we are. Interestingly, based on the fact that we've
finished cpuid_pass1, but not more suggested that we were in the psm
module initialization. In this case, apix`apix_post_cpu_start(). I know
we hadn't finished it because we don't have the aci_status bits set.

So, we're executing mp_startup_boot(). Based on looking at the assembly
code that gets injected into the real mode platter, we end up setting up
the initial rsp to the IST1 from the TSS. This means we should be able
to find our stack by doing some dumpster diving:

[3]> fffffdc9d914b000::print cpu_t cpu_m.mcpu_tss->tss_ist1
cpu_m.mcpu_tss->tss_ist1 = 0xfffffdc9d92bf000
[3]> 0xfffffdc9d92bf000-300,50/nap
0xfffffdc9d92bed00:             
0xfffffdc9d92bed00:             0               
0xfffffdc9d92bed08:             0xb             
0xfffffdc9d92bed10:             0xfffffdc9d92bed50
0xfffffdc9d92bed18:             splr+0x6a       
0xfffffdc9d92bed20:             0xfffffb0450fc1c40
0xfffffdc9d92bed28:             cp_default      
0xfffffdc9d92bed30:             0xfffffdc9d8dfe280
0xfffffdc9d92bed38:             0xfffffdc9d914b000
0xfffffdc9d92bed40:             0               
0xfffffdc9d92bed48:             0xb             
0xfffffdc9d92bed50:             0xfffffdc9d92bed80
0xfffffdc9d92bed58:             tsc_gethrtime+0x5e
0xfffffdc9d92bed60:             0               
0xfffffdc9d92bed68:             0x1e0af722      
0xfffffdc9d92bed70:             0xfffffb0450fc1e40
0xfffffdc9d92bed78:             0xfffffb0450fc1c40
0xfffffdc9d92bed80:             0xfffffdc9d914b000
0xfffffdc9d92bed88:             0x13            
0xfffffdc9d92bed90:             0x5ceb1a943a2   
0xfffffdc9d92bed98:             0               
0xfffffdc9d92beda0:             0xfffffdc9d92bedb0
0xfffffdc9d92beda8:             gethrtime+0xa   
0xfffffdc9d92bedb0:             0xfffffdc9d92bede0
0xfffffdc9d92bedb8:             lbolt_event_driven+0x26
0xfffffdc9d92bedc0:             0xfffffdc9d914b000
0xfffffdc9d92bedc8:             0xfffffb0450fc1c40
0xfffffdc9d92bedd0:             0xfffffdc9d914b000
0xfffffdc9d92bedd8:             0xfffffb0450fc7c40
0xfffffdc9d92bede0:             0xfffffdc9d92bedf0
0xfffffdc9d92bede8:             ddi_get_lbolt+0xa
0xfffffdc9d92bedf0:             0xfffffdc9d92bee20
0xfffffdc9d92bedf8:             swtch+0x18a     
0xfffffdc9d92bee00:             t0              
0xfffffdc9d92bee08:             mutex_sobj_ops  
0xfffffdc9d92bee10:             0               
0xfffffdc9d92bee18:             t0              
0xfffffdc9d92bee20:             0xfffffdc9d92beec0
0xfffffdc9d92bee28:             turnstile_block+0x2ea
0xfffffdc9d92bee30:             0x930           
0xfffffdc9d92bee38:             p0              
0xfffffdc9d92bee40:             0               
0xfffffdc9d92bee48:             0xfb88821d      
0xfffffdc9d92bee50:             0               
0xfffffdc9d92bee58:             turnstile_table+0x938
0xfffffdc9d92bee60:             0xfffffb0450fc1e40
0xfffffdc9d92bee68:             cpu_lock        
0xfffffdc9d92bee70:             1               
0xfffffdc9d92bee78:             t0              
0xfffffdc9d92bee80:             0               
0xfffffdc9d92bee88:             0               
0xfffffdc9d92bee90:             0               
0xfffffdc9d92bee98:             0               
0xfffffdc9d92beea0:             mutex_sobj_ops  
0xfffffdc9d92beea8:             cpu_lock        
0xfffffdc9d92beeb0:             0               
0xfffffdc9d92beeb8:             0xfffffdc9d8e0c900
0xfffffdc9d92beec0:             0xfffffdc9d92bef30
0xfffffdc9d92beec8:             mutex_vector_enter+0x3fd
0xfffffdc9d92beed0:             0xfffffdc9d89b6600
0xfffffdc9d92beed8:             0xfffffdc9d8e0c900
0xfffffdc9d92beee0:             0x400000014     
0xfffffdc9d92beee8:             0x2866c9b7d98   
0xfffffdc9d92beef0:             0xfffffdc9d92bef30
0xfffffdc9d92beef8:             7               
0xfffffdc9d92bef00:             0xfffffdc9d92bef90
0xfffffdc9d92bef08:             1               
0xfffffdc9d92bef10:             0               
0xfffffdc9d92bef18:             0               
[3]> 0xfffffdc9d92beda0$C
fffffdc9d92bedb0 gethrtime+0xa()
fffffdc9d92bede0 lbolt_event_driven+0x26()
fffffdc9d92bedf0 ddi_get_lbolt+0xa()
fffffdc9d92bee20 swtch+0x18a()
fffffdc9d92beec0 turnstile_block+0x2ea(fffffdc9d8e0c900, 0, fffffffffbc9c2c8, fffffffffbc22520, 0, 0)
fffffdc9d92bef30 mutex_vector_enter+0x3fd(fffffffffbc9c2c8)
fffffdc9d92bef70 apix`apix_init_intr+0x354()
fffffdc9d92bef80 apix`apix_post_cpu_start+0x24()
fffffdc9d92befd0 mp_startup_common+0x73(1)
fffffdc9d92befe0 mp_startup_boot+0xe()
fffffdc9d92beff0 0xfffffffffb85fb5e()

Once again, the day is saved thanks to the power of the frame pointer.
We now know exactly where we are and what's going on. We're trying to
grab cpu_lock, which certainly isn't going to work in this context:

[3]> fffffffffbc9c2c8=p 
                cpu_lock        
[3]> fffffffffbc9c2c8::mutex
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
fffffffffbc9c2c8 adapt fffffffffbc58080      -      -     yes
[3]> fffffffffbc58080::findstack -v
stack pointer for thread fffffffffbc58080: fffffffffbc97ec0
[ fffffffffbc97ec0 _resume_from_idle+0x126() ]
  fffffffffbc98000 mp_startup_wait+0x4a(fffffffffbc2e780, 13)
  fffffffffbc98080 mp_start_cpu_common+0x16a(fffffdc9d914b000, 1)
  fffffffffbc980e0 start_cpu+0x75(13)
  fffffffffbc98110 start_other_cpus+0x1c3(0)
  fffffffffbc98150 main+0x306()
  fffffffffbc98160 _locore_start+0x90()

Yup, the person trying to start us has the lock we need. In general, it
seems it's forbidden to grab cpu_lock. so let's go and see what it is
that we're trying to do.

What it appears we're trying to do is to register a cpu setup function
that'll be run on all CPUs at some point. In particular we're in a block
that requires the cmi_enable_cmci variable to be set, which it is. At
which point it will try to, if we haven't already set up, call the
register_cpu_setup_func() with cmci_cpu_setup. Unfortunately,
this requires cpu_lock. Now, starting from here, one might
reasonably ask why don't we hit this all the time. That's a good
question. Let's go understand what would set cmi_enable_cmci.

This is being set by the cpu.generic module. This is being called as
part of the cmi_mca_init entry point. This entry point is looking at the
MCG_CAP MSR, which describes the capabilities of the MCA architecture.
If bit 10 is set, then we will go through and see if we can enable CMCI
and if so, set the cmci_cpu_setup variable is set global.

Next, I wanted to answer the question of why doesn't this happen all the
time. One would assume that if we had CMCI enabled, as we do on other
systems, that we would see this on most of our other systems that have
this, like Sandy Bridge through Haswell. If we look at the callers of
the CMI MCA initialization logic, it should normally happen before we
start up other CPUs in {{startup_modules()}. As in the following stack
trace:

[0]> $C
fffffffffbc98070 cpu.generic`gcpu_mca_init()
fffffffffbc98100 startup_modules+0x2e0()
fffffffffbc98110 startup+0x55()
fffffffffbc98150 main+0x9b()
fffffffffbc98160 _locore_start+0x90()

However, this raises an intersting question. If we expect this to
normally happen when starting up the system, why don't we in this case?

Thankfully, the answer to our question is availble to us. We end up
storing the MSR's value in a register handle. So let's check all of
those on this system:

::walk cmihdl | ::print cmi_hdl_impl_t cmih_cmidata | ::print gcpu_data_t gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000c14

Wait a minute, that's weird. One is different from the rest and it
corresponds to the last CPU we successfully started. So, why did this
all of a sudden happen here. It happened on the 19th CPU. That's
suspiciously like it's on a new socket:

[3]> ::walk cmihdl | ::printf "%x %x %x\n" cmi_hdl_impl_t cmih_chipid cmih_procnodeid cmih_strandid
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 1 0

And so it is. In fact, if we look at another Skylake system from this
vendor:

> ::walk cmihdl | ::print cmi_hdl_impl_t cmih_cmidata | ::print gcpu_data_t gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap ! sort | uniq -c
  28 gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000814
  28 gcpu_mca.gcpu_mca_bioscfg.bios_mcg_cap = 0xf000c14

Well, that's bad. One socket has CMCI and another doesn't. So, based on
all this we have two problems that we need to figure out:

1. Why does one socket have CMCI enabled, but not the other.
2. How do we make sure that if this gets enabled we don't end up
deadlocking in boot.


Test plan:

Booted bits on systems where all sockets are consistent and either
report CMCI or don't and made sure the following was true with both the
apix and pcplusmp modules:

  • When CMCI enabled, that the gcpu_mca_flags were set to 6
  • When CMCI enabled, that the gcpu_cmci_setup_func was registered
  • When CMCI enabled, made sure that the polling thread had started CMCI on all CPUs
  • When CMCI disabled, that the gcpu_mca_flags were set to 0
  • When CMCI disabled, that the gcpu_cmci_setup_func was not registered

On the systems with mixed CMCI socket settings, I booted them various
times noting and making sure that we booted different combinations on
both systems such that we had both CMCI enabled on the boot CPU and not
on the other CPU. From there, I did the following:

  • Made sure that all sockets with CMCI registered that fact
  • Used psradm to make sure that turning on and off sockets didn't cause a deadlock (psradm -a -f, psradm -a -n).
  • Used fminject from a different system to cause CPU faults to be simulated that would cause FM to offline to the CPU. Then acquitted those CPUs and forced them online.
  • Verified that the above worked with both pcplusmp and apix psm modules

History

#1

Updated by Electric Monk 10 months ago

  • Status changed from New to Closed

git commit 918e0d92ec24e67f572737a68faf135dc6409d26

commit  918e0d92ec24e67f572737a68faf135dc6409d26
Author: Robert Mustacchi <rm@joyent.com>
Date:   2019-01-28T22:40:36.000Z

    10284 Socket CMCI mismatch can lead to boot hang
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: John Levon <john.levon@joyent.com>
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF