Project

General

Profile

Actions

Bug #7327

open

fmd should not retire a disk when its reference temperature is exceeded

Added by Pavel Cahyna almost 7 years ago. Updated about 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
cmd - userland programs
Start date:
2016-08-26
Due date:
% Done:

0%

Estimated time:
Difficulty:
Bite-size
Tags:
needs-triage
Gerrit CR:
External Bug:

Description

Some hard drives have their "Drive Trip Temperature" set to a low value (40 deg C). When this temperature is exceeded, the fault manager disables the disk:

SUNW-MSG-ID: DISK-8000-12, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Tue Aug 23 22:35:12 CEST 2016
PLATFORM: X10DRH-LN4, CSN: 123456789, HOSTNAME: nfsserv2
SOURCE: eft, REV: 1.16
EVENT-ID: 5f46d123-7ffd-cc9f-f4aa-a82bfaee0905
DESC: A disk's temperature exceeded the limits established by
its manufacturer.
  Refer to http://illumos.org/msg/DISK-8000-12 for more information.
AUTO-RESPONSE: None.
IMPACT: Performance degradation is likely and continued disk operation
beyond the temperature threshold can result in disk
damage and potential data loss.
REC-ACTION: Ensure that the system is properly cooled, that all fans are
functional, and that there are no obstructions of airflow to the affected
disk.

Now the disk is inaccessible: accessing raw device node produces "no such device or address". (By the way, this also means that one can not upgrade the firmware to solve the problem even on the Seagate drives without the use of another system or arcane workarounds: http://www.bigdatajunkie.com/index.php/10-hardware/19-seagate-constellation-es-3-firmware-0003 )
When one has many drives with the same property, this can easily result in losing access to an entire zpool, even if redundant.
Problematic drives include: Seagate Constellation ES3 (f/w rev. 0003) : https://forums.servethehome.com/index.php?threads/constellation-es3-drives-retired-from-solaris.3460/ , WD WD4001FYYG (f/w VR08).
For the Seagate drives the value was raised to 60 in f/w 0004, but there is AFAIK no firmware update for the WD ones (the issue was in fact introduced in a f/w update, f/w rev. VR07). I contacted our disk reseller who inquired at WD and got a response saying that this value is only for display, concerns no disk or controller operations and its change does not have any influence on functionality or configuration.
From this it is clear that retiring the disk when the value is exceeded was really not the intended use of the value by the manufacturer.
Note also that the original Sun description of automated problem diagnosis (Patent US7379846 - System and method for automated problem diagnosis, https://www.google.com/patents/US7379846 ) says:

Upset: An upset is an event that temporarily causes the system to behave in an unexpected manner. For a computer, a typical form of upset might be a cosmic ray strike, a nuclear decay, or a power supply surge that might perhaps alter the value of a memory bit. A very hot day or a failure of the room air conditioning system could also be regarded as a potential upset (although the latter might also be modelled as a fault). Nevertheless, despite the upset the underlying operation of the system is correct, and no specific repair is needed to the system hardware.

From this description exceeding a temperature can be regarded as an upset, not a fault.
I therefore suggest to disable this fault in fmd (treat the event as an upset, as suggested in the patent description). Here is a patch to do this:

--- ../disk.esc Wed Aug 24 15:01:30 2016
+++ disk.esc    Thu Aug 25 20:14:10 2016
@@ -70,6 +70,7 @@
 event upset.io.scsi.cmd.disk.dev.serr@P;
 event upset.io.scsi.cmd.disk.tran@P;
 event upset.io.scsi.cmd.disk.recovered@P;
+event upset.io.disk.over-temperature@P;

 /*
  * disk-as-detector: ereports from the kernel.
@@ -179,8 +180,11 @@
 /*
  * Fault events.
  */
-event fault.io.disk.over-temperature@P,
-    FITrate=10, FRU=P, ASRU=P;
+/*
+ * event fault.io.disk.over-temperature@P,
+ *     FITrate=10, FRU=P, ASRU=P;
+ */
+
 event fault.io.disk.predictive-failure@P, FITrate=10,
     FITrate=10, FRU=P, ASRU=P;
 event fault.io.disk.self-test-failure@P, FITrate=10,
@@ -196,9 +200,15 @@
 /*
  * Propagations.
  */
-prop fault.io.disk.over-temperature@P ->
+
+prop upset.io.disk.over-temperature@P ->
     ereport.io.scsi.disk.over-temperature@P;

+/*
+ * prop fault.io.disk.over-temperature@P ->
+ *     ereport.io.scsi.disk.over-temperature@P;
+ */
+
 prop fault.io.disk.self-test-failure@P ->
     ereport.io.scsi.disk.self-test-failure@P;


Related issues

Related to illumos gate - Bug #7330: reconsider retiring disks from live pools by fmdNew2016-08-26

Actions
Related to illumos gate - Bug #8244: Some drives report a bogus reference temperatureIn Progress2017-05-17

Actions
Actions #2

Updated by Alek Pinchuk over 6 years ago

  • Related to Bug #7330: reconsider retiring disks from live pools by fmd added
Actions #3

Updated by Pavel Cahyna over 6 years ago

Interesting. I suppose this change has never been merged? I observe this problem on omnios-c91bcdf.

Actions #4

Updated by Andrew Stormont about 6 years ago

I'm not entirely sure what the point of including a reference temperature in your firmware is if it's wrong. Even if it's only for display purposes.

Actions #5

Updated by Andrew Stormont about 6 years ago

  • Related to Bug #8244: Some drives report a bogus reference temperature added
Actions

Also available in: Atom PDF