zfs-retire agent crashes fmd on systems without vdev devids
I sent a fake vdev probe_failure ereport to FMD to make sure that a hot spare would take over if one side of a mirror was faulted as the result of a probe failure. The vdev in question was faulted properly, but the hot spare was not activated:
root@oi:/export/home/kkantor# zpool status testpool pool: testpool state: DEGRADED ... NAME STATE READ WRITE CKSUM testpool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 c4t0d0 FAULTED 0 0 0 too many errors c5t0d0 ONLINE 0 0 0 spares c6t0d0 AVAIL
I tried again and watched 'fmstat' until fmstat reported a cryptic error that I remember seeing when FMD crashes.
fmstat: failed to retrieve list of modules: rpc call failed: RPC: Unable to send; errno = Bad file number
Sure enough, FMD left a core file that tells the story:
> ::stack libzfs.so.1`path_from_physpath+0x22(0, 88d33ac, ac224bc) libzfs.so.1`zpool_vdev_name+0x465(0, ac16588, abf3944, 0) zfs-retire.so`replace_with_spare+0xab(ab4c3c0, ac16588, abf3944) zfs-retire.so`zfs_retire_recv+0x4d1(ab4c3c0, 88532e0, ac667d8, 8b5687c) fmd_module_dispatch+0xd3(ab4c3c0, 88532e0) fmd_module_start+0x161(ab4c3c0) fmd_thread_start+0x2c(a0adbd8) libc.so.1`_thrp_setup+0x81(fed6b240) libc.so.1`_lwp_start(fed6b240, 0, 0, 0, 0, 0)
For some reason the retire agent isn't sending a libzfs handle to the 'zpool_vdev_name' function (the first arg to zpool_vdev_name and path_from_physpath). This was acceptable until #10622. The 10622 change has libzfs try harder, which requires a libzfs handle in the new call to path_from_physpath when a devid is not found for the given vdev. path_from_physpath doesn't check for the existence of the libzfs handle before use, which results in a segfault.
This should only be a problem on systems without vdev devids, like my KVM machine that uses blkdev instead of sd driver. We should have the replace_with_spare function in the retire agent send its libzfs handle to libzfs.
Updated by Kody Kantor over 1 year ago
- Ran the zfs test suite. See #10241 for the failures encountered.
- I sent a single fake vdev probe_failure ereports to FMD as described in the bug description, which caused one side of the mirror to be faulted. However this time, as expected, FMD did not crash as it attempted to replace the pool vdev with a hot spare:
$ sudo /usr/lib/fm/fmd/fminject ./injection.txt sending event mirror_probe ... done [... after a few seconds] $ zpool status testpool pool: testpool state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: resilvered 273K in 0 days 00:00:00 with 0 errors on Fri Jan 10 16:27:48 2020 config: NAME STATE READ WRITE CKSUM testpool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 spare-0 DEGRADED 0 0 0 c4t0d0 FAULTED 0 0 0 too many errors c6t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 spares c6t0d0 INUSE currently in use errors: No known data errors $ svcs fmd STATE STIME FMRI online 16:09:05 svc:/system/fmd:default $ uptime 16:30:06 up 21 min(s), 2 users, load average: 0.01, 0.03, 0.04
Updated by Electric Monk over 1 year ago
- Status changed from New to Closed
- % Done changed from 0 to 100
commit e830fb12ed60bbd91c87459b169b5d1cc6ef0b9e Author: Kody A Kantor <email@example.com> Date: 2020-01-14T16:08:49.000Z 10241 ZFS not detecting faulty spares in a timely manner 12132 zfs-retire agent crashes fmd on systems without vdev devids 12034 zfs test send_encrypted_props can fail Reviewed by: C Fraire <firstname.lastname@example.org> Reviewed by: Andrew Stormont <email@example.com> Reviewed by: Jerry Jelinek <firstname.lastname@example.org> Reviewed by: Patrick Mooney <email@example.com> Reviewed by: Rob Johnston <firstname.lastname@example.org> Approved by: Dan McDonald <email@example.com>