Update nvme health logpage and temp thresholds
NVMe 1.2, 1.3, and 1.4 have extended the nvme health logpage information and added additional temperature sensors and thresholds we should update our headers and nvmeadm to support fetching and understanding that information.
Updated by Robert Mustacchi 3 months ago
To test this, I used the two interfaces on a system that had a single sensor:
rm@beowulf:~$ pfexec nvmeadm -v get-features nvme0 "Temperature Threshold" nvme0: Get Features Temperature Threshold Composite Over Temp. Threshold: 77C Composite Under Temp. Threshold: 0C nvme0: SMART/Health Information Critical Warnings Available Space: OK Temperature: OK Device Reliability: OK Media: OK Volatile Memory Backup: OK Temperature: 32C Available Spare Capacity: 100% Available Spare Threshold: 10% Device Life Used: 0% Data Read: 561GB Data Written: 1033GB Read Commands: 92989671 Write Commands: 56764216 Controller Busy: 1168min Power Cycles: 123 Power On: 464h Unsafe Shutdowns: 52 Uncorrectable Media Errors: 0 Errors Logged: 0 Warning Composite Temperature Time: 0min Critical Composite Temperature Time: 0min Thermal Management Temp 1 Transition Count: 0 Thermal Management Temp 2 Transition Count: 0 Time for Thermal Management Temp 1: 0sec Time for Thermal Management Temp 2: 0sec
I also tested this on some other folks system that had multiple temperature sensors. This had initially discovered bugs in the implementation due to rounding and other issues. However, I verified those by manually interposing on them with mdb to set the default limits that were advertised. While there was some consideration as to whether the default limits for the 'temperature thresholds' should be exposed, I opted to make it clear here so someone can understand what the device is programmed with, even if the actual value isn't practical (an upper temperature threshold of say UINT16_MAX Kelvin).
Updated by Electric Monk 3 months ago
- Status changed from New to Closed
- % Done changed from 90 to 100
commit 4a663bac9c5f9f82a5f633bc9639bbee3c2317ff Author: Robert Mustacchi <email@example.com> Date: 2020-07-09T04:54:40.000Z 12904 Update nvme health logpage and temp thresholds Reviewed by: C Fraire <firstname.lastname@example.org> Reviewed by: Paul Winder <email@example.com> Approved by: Richard Lowe <firstname.lastname@example.org>