ZFS not detecting faulty spares in a timely manner
This is an upstream copy of OS-6602 (https://smartos.org/bugview/OS-6602).
ZFS could be enhanced to proactively probe spare vdevs to ensure that they are accessible if needed.
Updated by Kody Kantor 11 months ago
Here's a brief explanation of the suggested fix.
Occasionally during spa_sync we now request that all unused spare vdevs be probed using the 'async_request' functionality provided by spa_sync. The probe happens at the end of spa_sync. The probe functionality already exists in ZFS.
We added a global tunable called
spa_spare_poll_interval_seconds to allow operators to change how often unused spare vdevs are polled. Each unused spare is polled every three hours by default.
There is a SERD wired up in the ZFS diagnosis engine for spare vdev probes. We fault a spare vdev if it fails five probes in a 24 hour period. The spare is then marked as 'FAULTED' by the retire agent. We had to enhance the retire agent to iterate over spare vdevs so this would work properly.
Once a spare vdev is faulted it will not be used to automatically replace a faulted pool vdev. There is a new test to verify this functionality.
Updated by Kody Kantor 11 months ago
- Ran the zfs test suite in my OpenIndiana virtual machine. These are the test failures:
$ grep '\[FAIL\]' /var/tmp/test_results/20200107T214831/log Test: /opt/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_encrypted (run as root) [00:01] [FAIL] Test: /opt/zfs-tests/tests/functional/mmp/mmp_on_zdb (run as root) [00:27] [FAIL] Test: /opt/zfs-tests/tests/functional/projectquota/projectspace_004_pos (run as root) [00:00] [FAIL] Test: /opt/zfs-tests/tests/functional/rsend/send_encrypted_props (run as root) [00:13] [FAIL] Test: /opt/zfs-tests/tests/functional/scrub_mirror/setup (run as root) [00:00] [FAIL] Test: /opt/zfs-tests/tests/functional/write_dirs/setup (run as root) [00:00] [FAIL]
These are some of the usual test failures I see on this machine. The new test, zpool_replace_002_neg, passes.
- Injected six probe ereports for the spare device in a pool and ensured that the device was faulted by FMA:
$ sudo /usr/lib/fm/fmd/fminject ./injection.txt sending event spare_probe ... done sending event spare_probe ... done sending event spare_probe ... done sending event spare_probe ... done sending event spare_probe ... done sending event spare_probe ... done
After a few seconds the spare is now listed as faulted:
$ zpool status testpool pool: testpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM testpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 spares c6t0d0 FAULTED too many errors errors: No known data errors
ZFS doesn't allow the faulted spare to be used as a replacement, as intended:
$ sudo zpool replace testpool c4t0d0 c6t0d0 cannot replace c4t0d0 with c6t0d0: one or more devices is currently unavailable
There are additional testing notes in the smartos.org/bugview ticket linked in this bug description.
Updated by Electric Monk 11 months ago
- Status changed from New to Closed
- % Done changed from 0 to 100
commit e830fb12ed60bbd91c87459b169b5d1cc6ef0b9e Author: Kody A Kantor <firstname.lastname@example.org> Date: 2020-01-14T16:08:49.000Z 10241 ZFS not detecting faulty spares in a timely manner 12132 zfs-retire agent crashes fmd on systems without vdev devids 12034 zfs test send_encrypted_props can fail Reviewed by: C Fraire <email@example.com> Reviewed by: Andrew Stormont <firstname.lastname@example.org> Reviewed by: Jerry Jelinek <email@example.com> Reviewed by: Patrick Mooney <firstname.lastname@example.org> Reviewed by: Rob Johnston <email@example.com> Approved by: Dan McDonald <firstname.lastname@example.org>