Project

General

Profile

Bug #7330

reconsider retiring disks from live pools by fmd

Added by Pavel Cahyna over 3 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
cmd - userland programs
Start date:
2016-08-26
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

In some cases (cf. #7327, but there are certainly more) fmd retires a disk from the system. If the zfs vdev has not enough redundancy, the whole zpool will become inaccessible (and worse, #7328 and #7329). Even if those bugs were not present, it is questionable whether retiring a last disk from a vdev and making all the pool inaccessible is the right way to solve disk problems. In fact, even when there is enough redundancy, this may lead to a loss of data. Imagine a RAID1 vdev with two disks: A and B. B has too many errors (let's say sectors X1, X2, .... Xn are unreadable) and gets retired from the system. Now let's say that A has an unreadable or wrong checksum sector Y (different from X1, X2, ..., Xn). If B were still present, one could repair Y on A by using the data from B. But since B was retired, how will one avoid losing the data?

In my opinion disk faults should therefore result in initiating the reconstruction to a hot spare and only after this is completed the disk should be retired (AFAICT, zpool replace does that properly: disconnects the disk being replaced from the pool only after the data has been copied to the new one. fmd could therefore initiate "zpool replace" instead of disabling all access to the disk immediately).


Related issues

Related to illumos gate - Bug #7327: fmd should not retire a disk when its reference temperature is exceededNew2016-08-26

Actions

History

#1

Updated by Alek Pinchuk over 3 years ago

  • Related to Bug #7327: fmd should not retire a disk when its reference temperature is exceeded added
#2

Updated by Andrew Stormont over 2 years ago

I think the right answer is to have better failure algorithms. If the disk looks like it's one the way out you want to start silvering the spare right then and there. Waiting until FMA retires the device is too late.

#3

Updated by Chip Schweiss over 1 year ago

I have to agree with Pavel's analysis. FMD is too aggressive to retire disks.

Another example of this is when a disk throws a "predictive failure". FMD retires the disk immediately. This disk should be replaced gracefully, not offlined immediately as it still contains redundancy for the vdev.

A simple fix would be to make this a configurable so that disk retiring is an optional action.

Also available in: Atom PDF