Project

General

Profile

Bug #10241

ZFS not detecting faulty spares in a timely manner

Added by Kody Kantor about 1 year ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
2019-01-15
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

This is an upstream copy of OS-6602 (https://smartos.org/bugview/OS-6602).

ZFS could be enhanced to proactively probe spare vdevs to ensure that they are accessible if needed.

History

#1

Updated by Kody Kantor about 2 months ago

Here's a brief explanation of the suggested fix.

Occasionally during spa_sync we now request that all unused spare vdevs be probed using the 'async_request' functionality provided by spa_sync. The probe happens at the end of spa_sync. The probe functionality already exists in ZFS.

We added a global tunable called spa_spare_poll_interval_seconds to allow operators to change how often unused spare vdevs are polled. Each unused spare is polled every three hours by default.

There is a SERD wired up in the ZFS diagnosis engine for spare vdev probes. We fault a spare vdev if it fails five probes in a 24 hour period. The spare is then marked as 'FAULTED' by the retire agent. We had to enhance the retire agent to iterate over spare vdevs so this would work properly.

Once a spare vdev is faulted it will not be used to automatically replace a faulted pool vdev. There is a new test to verify this functionality.

#2

Updated by Kody Kantor about 2 months ago

Testing notes:

- Ran the zfs test suite in my OpenIndiana virtual machine. These are the test failures:

$ grep '\[FAIL\]' /var/tmp/test_results/20200107T214831/log
Test: /opt/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_encrypted (run as root) [00:01] [FAIL]
Test: /opt/zfs-tests/tests/functional/mmp/mmp_on_zdb (run as root) [00:27] [FAIL]
Test: /opt/zfs-tests/tests/functional/projectquota/projectspace_004_pos (run as root) [00:00] [FAIL]
Test: /opt/zfs-tests/tests/functional/rsend/send_encrypted_props (run as root) [00:13] [FAIL]
Test: /opt/zfs-tests/tests/functional/scrub_mirror/setup (run as root) [00:00] [FAIL]
Test: /opt/zfs-tests/tests/functional/write_dirs/setup (run as root) [00:00] [FAIL]

These are some of the usual test failures I see on this machine. The new test, zpool_replace_002_neg, passes.

- Injected six probe ereports for the spare device in a pool and ensured that the device was faulted by FMA:

$ sudo /usr/lib/fm/fmd/fminject ./injection.txt
sending event spare_probe ... done
sending event spare_probe ... done
sending event spare_probe ... done
sending event spare_probe ... done
sending event spare_probe ... done
sending event spare_probe ... done

After a few seconds the spare is now listed as faulted:

$ zpool status testpool
  pool: testpool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        testpool    ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c4t0d0  ONLINE       0     0     0
            c5t0d0  ONLINE       0     0     0
        spares
          c6t0d0    FAULTED   too many errors

errors: No known data errors

ZFS doesn't allow the faulted spare to be used as a replacement, as intended:

$ sudo zpool replace testpool c4t0d0 c6t0d0
cannot replace c4t0d0 with c6t0d0: one or more devices is currently unavailable

There are additional testing notes in the smartos.org/bugview ticket linked in this bug description.

#3

Updated by Electric Monk about 1 month ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit e830fb12ed60bbd91c87459b169b5d1cc6ef0b9e

commit  e830fb12ed60bbd91c87459b169b5d1cc6ef0b9e
Author: Kody A Kantor <kody@kkantor.com>
Date:   2020-01-14T16:08:49.000Z

    10241 ZFS not detecting faulty spares in a timely manner
    12132 zfs-retire agent crashes fmd on systems without vdev devids
    12034 zfs test send_encrypted_props can fail
    Reviewed by: C Fraire <cfraire@me.com>
    Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Reviewed by: Rob Johnston <rob.johnston@joyent.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

#4

Updated by Gernot Strasser about 1 month ago

Compiler issue here:

zfs_de.c:64:8: error: duplicate member 'zc_serd_probe'
char zc_serd_probe[MAX_SERDLEN];

Also available in: Atom PDF