Feature #11976
closedWant CPU temperature sensors for Zen 2, Raven Ridge
100%
Description
We should enable the amdf17nbdf driver to know about Zen 2 (both Matisse and Rome) and Raven Ridge processors. This consists of updates to the amdf17nbdf driver.
Updated by Robert Mustacchi over 2 years ago
Review for this is at https://code.illumos.org/c/illumos-gate/+/278.
Updated by Robert Mustacchi over 2 years ago
We've managed to verify this on a Zen+ APU by Gary Mills. On that platform we see:
<root@ryzen># /usr/lib/fm/fmd/fmtopo -V '*sensor=temp*' TIME UUID Dec 29 17:06:26 c424cc3e-b9a0-ca24-b907-a53333468535 hc://:product-id=System-Product-Name:server-id=ryzen:chassis-id=System-Serial-Number/motherboard=0/chip=0/core=0?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=System-Product-Name:server-id=ryzen:chassis-id=System-Serial-Number/motherboard=0/chip=0/core=0?sensor=temp group: authority version: 1 stability: Private/Private product-id string System-Product-Name chassis-id string System-Serial-Number server-id string ryzen group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 29.375000 hc://:product-id=System-Product-Name:server-id=ryzen:chassis-id=System-Serial-Number/motherboard=0/chip=0/core=1?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=System-Product-Name:server-id=ryzen:chassis-id=System-Serial-Number/motherboard=0/chip=0/core=1?sensor=temp group: authority version: 1 stability: Private/Private product-id string System-Product-Name chassis-id string System-Serial-Number server-id string ryzen group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 29.375000 hc://:product-id=System-Product-Name:server-id=ryzen:chassis-id=System-Serial-Number/motherboard=0/chip=0/core=8?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=System-Product-Name:server-id=ryzen:chassis-id=System-Serial-Number/motherboard=0/chip=0/core=8?sensor=temp group: authority version: 1 stability: Private/Private product-id string System-Product-Name chassis-id string System-Serial-Number server-id string ryzen group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 29.375000 hc://:product-id=System-Product-Name:server-id=ryzen:chassis-id=System-Serial-Number/motherboard=0/chip=0/core=9?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=System-Product-Name:server-id=ryzen:chassis-id=System-Serial-Number/motherboard=0/chip=0/core=9?sensor=temp group: authority version: 1 stability: Private/Private product-id string System-Product-Name chassis-id string System-Serial-Number server-id string ryzen group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 29.375000
Updated by Harmen R over 2 years ago
I've tested this on a Ryzen 3700X / 470X system.
It's reporting temperatures for the cpu, I'm not sure the temperature is correct though, it's consistently 10 degrees Celsius higher than what the ipmi console is reporting for the cpu. I remember reading something about an offset like that, but I can't find it anymore unfortunately.
I've also read the temperature on the latest Linux kernel, on linux the readings agree with the ipmi values.
[root@ryzen-server ~]# /usr/lib/fm/fmd/fmtopo -V '*sensor=temp*' TIME UUID Jan 02 19:06:56 3f2de7d0-f9f7-4672-8297-a90ebff26d7a hc://:server-id=ryzen-server/motherboard=0/chip=0/core=0?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=ryzen-server/motherboard=0/chip=0/core=0?sensor=temp group: authority version: 1 stability: Private/Private server-id string ryzen-server group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 49.750000 hc://:server-id=ryzen-server/motherboard=0/chip=0/core=1?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=ryzen-server/motherboard=0/chip=0/core=1?sensor=temp group: authority version: 1 stability: Private/Private server-id string ryzen-server group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 49.750000 hc://:server-id=ryzen-server/motherboard=0/chip=0/core=2?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=ryzen-server/motherboard=0/chip=0/core=2?sensor=temp group: authority version: 1 stability: Private/Private server-id string ryzen-server group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 49.750000 hc://:server-id=ryzen-server/motherboard=0/chip=0/core=3?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=ryzen-server/motherboard=0/chip=0/core=3?sensor=temp group: authority version: 1 stability: Private/Private server-id string ryzen-server group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 49.750000 hc://:server-id=ryzen-server/motherboard=0/chip=0/core=4?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=ryzen-server/motherboard=0/chip=0/core=4?sensor=temp group: authority version: 1 stability: Private/Private server-id string ryzen-server group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 49.750000 hc://:server-id=ryzen-server/motherboard=0/chip=0/core=5?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=ryzen-server/motherboard=0/chip=0/core=5?sensor=temp group: authority version: 1 stability: Private/Private server-id string ryzen-server group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 49.750000 hc://:server-id=ryzen-server/motherboard=0/chip=0/core=6?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=ryzen-server/motherboard=0/chip=0/core=6?sensor=temp group: authority version: 1 stability: Private/Private server-id string ryzen-server group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 49.750000 hc://:server-id=ryzen-server/motherboard=0/chip=0/core=7?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=ryzen-server/motherboard=0/chip=0/core=7?sensor=temp group: authority version: 1 stability: Private/Private server-id string ryzen-server group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 49.750000
Updated by Robert Mustacchi over 2 years ago
Thanks for testing this Harmen. The fact that it's 10 degrees off is definitely a bit weird. While older Ryzen models did have an offset that the platform is required to note, it definitely doesn't for the Matisse family and your 3700X. I'd like to see what the raw value we read was. While on illumos with these changes could you try and run the following in mdb -k
?
*amdf17nbdf::print amdf17nbdf_t amd_nbdf_nbs | ::walk list | ::print amdf17nb_t
I'm pretty sure that's the right syntax. If you could also run impitool sensor
and include it here (running that on the local host right around the same time as you run that mdb command), I'd appreciate it.
Updated by Robert Mustacchi over 2 years ago
One other question, can you confirm for me how you're viewing the temperature on Linux?
Updated by Harmen R over 2 years ago
No problem. On linux I used the k10temp driver with lm-sensors. K10temp has support for this cpu from kernel 5.4.
I ran the commands you asked for consecutively, the box being idle as in all zones stopped.
> *amdf17nbdf::print amdf17nbdf_t amd_nbdf_nbs | ::walk list | ::print amdf17nb_t { amd_nb_link = { list_next = 0xfffffe5942f1d7b8 list_prev = 0xfffffe5942f1d7b8 } amd_nb_dip = 0xfffffe592a744558 amd_nb_cfgspace = 0xfffffe5936580140 amd_nb_bus = 0 amd_nb_dev = 0 amd_nb_func = 0 amd_nb_df = 0xfffffe5942931220 amd_nb_procnodeid = 0 amd_nb_temp_minor = 0x1 amd_nb_temp_last_read = 0x16d57107328 amd_nb_temp_off = 0 amd_nb_temp_reg = 0x2f500fef amd_nb_temp = 0x17a }
3VSB | 3.360 | Volts | ok | 2.880 | 3.040 | na | na | 3.700 | 3.880 5VSB | 5.040 | Volts | ok | 4.260 | 4.500 | na | na | 5.490 | 5.730 VCPU | 0.930 | Volts | ok | na | na | na | na | 1.650 | 1.730 VSOC | 1.020 | Volts | ok | 0.340 | 0.360 | na | na | 1.540 | 1.610 VCCM | 1.190 | Volts | ok | 1.020 | 1.080 | na | na | 1.320 | 1.380 APU_VDDP | 0.950 | Volts | ok | 0.770 | 0.810 | na | na | 1.160 | 1.210 3V | 3.360 | Volts | ok | 2.800 | 2.980 | na | na | 3.620 | 3.780 5V | 5.040 | Volts | ok | 4.260 | 4.500 | na | na | 5.490 | 5.730 12V | 12.000 | Volts | ok | 10.200 | 10.800 | na | na | 13.200 | 13.800 MB Temp | 24.000 | degrees C | ok | na | na | na | 55.000 | na | na Card side Temp | 26.000 | degrees C | ok | na | na | na | 68.000 | na | na CPU Temp | 37.000 | degrees C | ok | na | na | na | 93.000 | 94.000 | na DDR4_A2_Temp | na | degrees C | na | na | na | na | 84.000 | 85.000 | na DDR4_A1_Temp | 23.000 | degrees C | ok | na | na | na | 84.000 | 85.000 | na DDR4_B2_Temp | na | degrees C | na | na | na | na | 84.000 | 85.000 | na DDR4_B1_Temp | 23.000 | degrees C | ok | na | na | na | 84.000 | 85.000 | na FAN1 | 1200.000 | RPM | ok | na | na | 100.000 | na | na | na FAN2 | 5600.000 | RPM | ok | na | na | 100.000 | na | na | na FAN3 | 5200.000 | RPM | ok | na | na | 100.000 | na | na | na FAN4 | na | RPM | na | na | na | 100.000 | na | na | na FAN5 | 5400.000 | RPM | ok | na | na | 100.000 | na | na | na FAN6 | 5000.000 | RPM | ok | na | na | 100.000 | na | na | na ChassisIntr | 0x0 | discrete | 0x0080| na | na | na | na | na | na CPU_PROCHOT | 0x0 | discrete | 0x0080| na | na | na | na | na | na CPU_THERMTRIP | 0x0 | discrete | 0x0080| na | na | na | na | na | na PSU1 Status | 0x0 | discrete | 0x0080| na | na | na | na | na | na PSU1 AC lost | na | discrete | na | na | na | na | na | na | na PSU2 Status | 0x0 | discrete | 0x0080| na | na | na | na | na | na PSU2 AC lost | na | discrete | na | na | na | na | na | na | na 1.05V_PROM_S5 | 1.060 | Volts | ok | 0.890 | 0.950 | na | na | 1.160 | 1.210 2.5V_PROM | 2.580 | Volts | ok | 2.120 | 2.260 | na | na | 2.740 | 2.880 1.05V_PROM_RUN | 1.010 | Volts | ok | 0.890 | 0.950 | na | na | 1.160 | 1.210 BAT | 3.100 | Volts | ok | 2.000 | 2.700 | na | na | 3.400 | 3.560 PSU2 PIN | na | Watts | na | na | na | na | na | na | na PSU2 POUT | na | Watts | na | na | na | na | na | na | na PSU1 PIN | na | Watts | na | na | na | na | na | na | na PSU1 VIN | na | Volts | na | na | na | na | na | na | na PSU2 VIN | na | Volts | na | na | na | na | na | na | na PSU2 IOUT | na | Amps | na | na | na | na | na | na | na PSU1 IOUT | na | Amps | na | na | na | na | na | na | na PSU1 POUT | na | Watts | na | na | na | na | na | na | na
Updated by Robert Mustacchi over 2 years ago
Thank you for including that, Harmen. So based on the math, we are reporting the Tctl correctly based on what we've read. This should be using the same calculations. So I'm not sure why we got a 10 degree higher reading than Linux did. I could see some amount of power or other measurements change this there such that it'd be different, but if IPMI is reading something different that's weird, especially when they match on Linux. Given that this system can usually run both the Zen 1 / Zen 2 chips, I could see firmware accidentally including the Ryzen 7 2700X 10 degree offset, but then I wouldn't expect Linux to match that.
I've confirmed that the 5.4 kernel doesn't have a temperature adjustment offset for the Ryzen 3 (Zen 2) line. Can you confirm what model of motherboard this is? I assume the IPMI firmware is relatively up to date. I'll see if we can get some other numbers from different Zen 2 systems to see if we can make more sense of what's going on.
Updated by Harmen R over 2 years ago
I have a X470D4U from ASRock Rack (https://www.asrockrack.com/general/productdetail.asp?Model=X470D4U). My bios is the latest version, I see there's BMC firmware update available. I'll try and update that.
I will also double check my linux results, just in case.
Updated by Harmen R over 2 years ago
Ok, I've updated the BMC to the latest. No changes.
On linux the k10temp driver reports the same temperatures as ipmitool, on smartos the reported temperature is 10 degrees Celsius higher.
Updated by Robert Mustacchi about 2 years ago
So, another data point here. I have a Rome system I built using the AsRock Rack EPYCD8 system. Testing this does show that all of the termparature data is inline between the OS and the BMC. Here's what I see there for comparison:
rm@beowulf:~$ pfexec /usr/lib/fm/fmd/fmtopo -V *sensor=temp TIME UUID Apr 02 17:35:20 2e847e24-0ded-e491-b920-a0c79ec0764c hc://:server-id=beowulf/motherboard=0/chip=0/core=0?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=0?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=1?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=1?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=2?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=2?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=3?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=3?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=4?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=4?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=5?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=5?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=6?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=6?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=7?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=7?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=8?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=8?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=9?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=9?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=10?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=10?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=11?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=11?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=12?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=12?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=13?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=13?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=14?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=14?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000 hc://:server-id=beowulf/motherboard=0/chip=0/core=15?sensor=temp group: protocol version: 1 stability: Private/Private resource fmri hc://:server-id=beowulf/motherboard=0/chip=0/core=15?sensor=temp group: authority version: 1 stability: Private/Private server-id string beowulf group: facility version: 1 stability: Private/Private sensor-class string threshold type uint32 0x1 (TEMP) units uint32 0x1 (DEGREES_C) reading double 27.375000
and the relevant temperature sensor from the BMC:
CPU Temp | 27.000 | degrees C | ok | na | na | na | 95.000 | na | na
So at least it aligns here. Since I can't find anything where we're reading it incorrectly from the CPU, I think it may be worth moving forward with this. I'm sure we'll continue to see some challenges with the OS temperature sensors, but if it's the temperature that we're reading which the CPU will be generating alerts from it, I think it's probably worth it.
Updated by Electric Monk about 2 years ago
- Status changed from New to Closed
- % Done changed from 90 to 100
git commit 0e6adfea4a40da04a1864bdeed7e17450ce04df5
commit 0e6adfea4a40da04a1864bdeed7e17450ce04df5 Author: Robert Mustacchi <rm@fingolfin.org> Date: 2020-04-03T05:16:46.000Z 11976 Want CPU temperature sensors for Zen 2, Raven Ridge Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Joshua M. Clulow <josh@sysmgr.org>