Project

General

Profile

Bug #10275

Kernel attempts to apply old erratum to AMD Threadripper 1950X

Added by Robert Mustacchi 11 months ago. Updated 11 months ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
2019-01-23
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

On my AMD desktop I was attempting to run a debug kernel under my KVM dev VM, which manages to trip an ASSERT.

Loading kmdb...
Welcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ unix krtld genunix ]
[0]> :c
WARNING: Couldn't read ACPI SRAT table from BIOS. lgrp support will be limited to one group.
panic[cpu0]/thread=fffffffffbc590a0: assertion failed: family == 0xf || family == 0x10 || family == 0x11, file: ../../i86pc/os/mp_startup.c, line: 749
Warning - stack not written to the dump buffer
fffffffffbc99040 genunix:max_hres_adj+15fba8 ()
fffffffffbc99050 unix:opteron_get_nnodes+68 ()
fffffffffbc99110 unix:workaround_errata+40d ()
fffffffffbc99150 unix:mlsetup+62d ()
fffffffffbc99160 unix:_locore_start+8b ()
panic: entering debugger (no dump device, continue to reboot)
kmdb: target stopped at:
kmdb_enter+0xb: movq   %rax,%rdi
[0]> ::regs
%rax = 0x0000000000000002                 %r9  = 0x00000000000186a0
%rbx = 0xfffffffffb98fc00                 %r10 = 0x0000000000000000
%rcx = 0x0000000000000020                 %r11 = 0x0000000000000000
%rdx = 0x00000000000003f8                 %r12 = 0xfffffffffbc9d328 panicbuf
%rsi = 0x000000000000000a                 %r13 = 0x0000000000000000
%rdi = 0xfffffffffbfcb9c0     nullwrapper %r14 = 0x0000000000000000
%r8  = 0xfffffffffbca0fc0 panic_stack+0x19f0 %r15 = 0x0000000000000000
%rip = 0xfffffffffb88a36b kmdb_enter+0xb
%rbp = 0xfffffffffbca13c0
%rsp = 0xfffffffffbca13c0
%rflags = 0x00000002
  id=0 vip=0 vif=0 ac=0 vm=0 rf=0 nt=0 iopl=0x0
  status=<of,df,if,tf,sf,zf,af,pf,cf>
%cs = 0x0030    %ds = 0x0000    %es = 0x0000    %fs = 0x0000
%gs = 0x0000    %gsbase = 0xfffffffffbc5b000    %kgsbase = 0x200000000
%trapno = 0x14  %err = 0x0      %cr2 = 0x0      %cr3 = 0x1e400000

The tripped ASSERT seems to be the following:

 736                 /*
 737                  * This routine uses a PCI config space based mechanism
 738                  * for retrieving the number of nodes in the system.
 739                  * Device 24, function 0, offset 0x60 as used here is not
 740                  * AMD processor architectural, and may not work on processor
 741                  * families other than those listed below.
 742                  *
 743                  * Callers of this routine must ensure that we're running on
 744                  * a processor which supports this mechanism.
 745                  * The assertion below is meant to catch calls on unsupported
 746                  * processors.
 747                  */
 748                 family = cpuid_getfamily(CPU);
 749                 ASSERT(family == 0xf || family == 0x10 || family == 0x11);

[0]> cpu::print cpu_t cpu_m.mcpu_cpi->cpi_family
kmdb: failed to read 4 bytes at fffffffd0000001d: no mapping for address
[0]> cpu::print -at cpu_t cpu_m.mcpu_cpi
fffffffffbc07a80 struct cpuid_info *cpu_m.mcpu_cpi = 0xfffffffd00000001

On an on a non DEBUG kernel I see the following with mdb

> cpu::print -at cpu_t cpu_m ! grep mcpu_cpi
    fffffffffbc07a80 struct cpuid_info *cpu_m.mcpu_cpi = 1

KVM passes in the CPU as

The physical processor has 1 virtual processor (15)
  x86 (AuthenticAMD 800F12 family 23 model 1 step 2 clock 3393 MHz)
        AMD EPYC Processor (with IBPB)

Linux reports the CPU as

processor       : 31
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD Ryzen Threadripper 1950X 16-Core Processor
stepping        : 1
microcode       : 0x8001137
cpu MHz         : 3316.490
cache size      : 512 KB
physical id     : 0
siblings        : 32
core id         : 15
cpu cores       : 16
apicid          : 31
initial apicid  : 31
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 6788.03
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]

It also looks like 0x17(family 23) for the family is not in the ASSERT statement. Perhaps this needs to be updated for newer CPUs?


So, fundamentally, we shouldn't be calling into opteron_get_nnodes() at all. In fact, it's very suspicious that we've detected any errata at all, since all of the errata that we're trying to work around mostly are at best in family 0x10. I think the reason that family 0x11 was added was because we also try to work around not having a constant TSC there. Though it's harder to say. Looking at the programming manual for recent processors, it's pretty clear that at least on families since 15h, the technique used there is not going to work.

So that leads us to ask the question of why are we calling this at all? With some help from Michael Zeller, we were able to figure out that we were calling into this because of what we call erratum '6336786'. This isn't actually an erratum at all! Instead this is an attempt to work around the fact that certain deep C states cause a non-constant TSC to occur. Unfortunately, nothing seems to virtualize this today. Even worse, the things that we want to somewhat arbitrarily write don't seem to exist on all platforms.


1. Booting on the impacted threadripper box, both when virtualized which triggers this and on bare metal, which should not trigger this.
2. Tested on an AMD EPYC box physically to make sure there were no problems

History

#1

Updated by Robert Mustacchi 11 months ago

#2

Updated by Robert Mustacchi 11 months ago

#3

Updated by Electric Monk 11 months ago

  • Status changed from New to Closed

git commit bf9b145b69c7f9e2ab076ee1ab9461cb6d527e64

commit  bf9b145b69c7f9e2ab076ee1ab9461cb6d527e64
Author: Robert Mustacchi <rm@joyent.com>
Date:   2019-01-29T21:04:52.000Z

    10275 Kernel attempts to apply old erratum to AMD Threadripper 1950X
    Reviewed by: Hans Rosenfeld <hans.rosenfeld@joyent.com>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Mike Zeller <mike.zeller@joyent.com>
    Reviewed by: Andy Stormont <astormont@racktopsystems.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Gergő Doma <domag02@gmail.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF