fmd should not retire a disk when its reference temperature is exceeded
Some hard drives have their "Drive Trip Temperature" set to a low value (40 deg C). When this temperature is exceeded, the fault manager disables the disk:
SUNW-MSG-ID: DISK-8000-12, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Tue Aug 23 22:35:12 CEST 2016 PLATFORM: X10DRH-LN4, CSN: 123456789, HOSTNAME: nfsserv2 SOURCE: eft, REV: 1.16 EVENT-ID: 5f46d123-7ffd-cc9f-f4aa-a82bfaee0905 DESC: A disk's temperature exceeded the limits established by its manufacturer. Refer to http://illumos.org/msg/DISK-8000-12 for more information. AUTO-RESPONSE: None. IMPACT: Performance degradation is likely and continued disk operation beyond the temperature threshold can result in disk damage and potential data loss. REC-ACTION: Ensure that the system is properly cooled, that all fans are functional, and that there are no obstructions of airflow to the affected disk.
Now the disk is inaccessible: accessing raw device node produces "no such device or address". (By the way, this also means that one can not upgrade the firmware to solve the problem even on the Seagate drives without the use of another system or arcane workarounds: http://www.bigdatajunkie.com/index.php/10-hardware/19-seagate-constellation-es-3-firmware-0003 )
When one has many drives with the same property, this can easily result in losing access to an entire zpool, even if redundant.
Problematic drives include: Seagate Constellation ES3 (f/w rev. 0003) : https://forums.servethehome.com/index.php?threads/constellation-es3-drives-retired-from-solaris.3460/ , WD WD4001FYYG (f/w VR08).
For the Seagate drives the value was raised to 60 in f/w 0004, but there is AFAIK no firmware update for the WD ones (the issue was in fact introduced in a f/w update, f/w rev. VR07). I contacted our disk reseller who inquired at WD and got a response saying that this value is only for display, concerns no disk or controller operations and its change does not have any influence on functionality or configuration.
From this it is clear that retiring the disk when the value is exceeded was really not the intended use of the value by the manufacturer.
Note also that the original Sun description of automated problem diagnosis (Patent US7379846 - System and method for automated problem diagnosis, https://www.google.com/patents/US7379846 ) says:
Upset: An upset is an event that temporarily causes the system to behave in an unexpected manner. For a computer, a typical form of upset might be a cosmic ray strike, a nuclear decay, or a power supply surge that might perhaps alter the value of a memory bit. A very hot day or a failure of the room air conditioning system could also be regarded as a potential upset (although the latter might also be modelled as a fault). Nevertheless, despite the upset the underlying operation of the system is correct, and no specific repair is needed to the system hardware.
From this description exceeding a temperature can be regarded as an upset, not a fault.
I therefore suggest to disable this fault in fmd (treat the event as an upset, as suggested in the patent description). Here is a patch to do this:
--- ../disk.esc Wed Aug 24 15:01:30 2016 +++ disk.esc Thu Aug 25 20:14:10 2016 @@ -70,6 +70,7 @@ event upset.io.scsi.cmd.disk.dev.serr@P; event upset.io.scsi.cmd.disk.tran@P; event upset.io.scsi.cmd.disk.recovered@P; +event upset.io.disk.over-temperature@P; /* * disk-as-detector: ereports from the kernel. @@ -179,8 +180,11 @@ /* * Fault events. */ -event fault.io.disk.over-temperature@P, - FITrate=10, FRU=P, ASRU=P; +/* + * event fault.io.disk.over-temperature@P, + * FITrate=10, FRU=P, ASRU=P; + */ + event fault.io.disk.predictive-failure@P, FITrate=10, FITrate=10, FRU=P, ASRU=P; event fault.io.disk.self-test-failure@P, FITrate=10, @@ -196,9 +200,15 @@ /* * Propagations. */ -prop fault.io.disk.over-temperature@P -> + +prop upset.io.disk.over-temperature@P -> ereport.io.scsi.disk.over-temperature@P; +/* + * prop fault.io.disk.over-temperature@P -> + * ereport.io.scsi.disk.over-temperature@P; + */ + prop fault.io.disk.self-test-failure@P -> ereport.io.scsi.disk.self-test-failure@P;