Project

General

Profile

Feature #12904

Update nvme health logpage and temp thresholds

Added by Robert Mustacchi 3 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

NVMe 1.2, 1.3, and 1.4 have extended the nvme health logpage information and added additional temperature sensors and thresholds we should update our headers and nvmeadm to support fetching and understanding that information.

History

#1

Updated by Electric Monk 3 months ago

  • Gerrit CR set to 759
#2

Updated by Robert Mustacchi 3 months ago

To test this, I used the two interfaces on a system that had a single sensor:

rm@beowulf:~$ pfexec nvmeadm -v get-features nvme0 "Temperature Threshold" 
nvme0: Get Features
  Temperature Threshold
    Composite Over Temp. Threshold:         77C
    Composite Under Temp. Threshold:        0C

nvme0: SMART/Health Information
  Critical Warnings
    Available Space:                        OK
    Temperature:                            OK
    Device Reliability:                     OK
    Media:                                  OK
    Volatile Memory Backup:                 OK
  Temperature:                              32C
  Available Spare Capacity:                 100%
  Available Spare Threshold:                10%
  Device Life Used:                         0%
  Data Read:                                561GB
  Data Written:                             1033GB
  Read Commands:                            92989671
  Write Commands:                           56764216
  Controller Busy:                          1168min
  Power Cycles:                             123
  Power On:                                 464h
  Unsafe Shutdowns:                         52
  Uncorrectable Media Errors:               0
  Errors Logged:                            0
  Warning Composite Temperature Time:       0min
  Critical Composite Temperature Time:      0min
  Thermal Management Temp 1 Transition Count: 0
  Thermal Management Temp 2 Transition Count: 0
  Time for Thermal Management Temp 1:       0sec
  Time for Thermal Management Temp 2:       0sec

I also tested this on some other folks system that had multiple temperature sensors. This had initially discovered bugs in the implementation due to rounding and other issues. However, I verified those by manually interposing on them with mdb to set the default limits that were advertised. While there was some consideration as to whether the default limits for the 'temperature thresholds' should be exposed, I opted to make it clear here so someone can understand what the device is programmed with, even if the actual value isn't practical (an upper temperature threshold of say UINT16_MAX Kelvin).

#3

Updated by Electric Monk 3 months ago

  • Status changed from New to Closed
  • % Done changed from 90 to 100

git commit 4a663bac9c5f9f82a5f633bc9639bbee3c2317ff

commit  4a663bac9c5f9f82a5f633bc9639bbee3c2317ff
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   2020-07-09T04:54:40.000Z

    12904 Update nvme health logpage and temp thresholds
    Reviewed by: C Fraire <cfraire@me.com>
    Reviewed by: Paul Winder <paul@winder.uk.net>
    Approved by: Richard Lowe <richlowe@richlowe.net>

Also available in: Atom PDF