Bug #12132
closedzfs-retire agent crashes fmd on systems without vdev devids
100%
Description
I sent a fake vdev probe_failure ereport to FMD to make sure that a hot spare would take over if one side of a mirror was faulted as the result of a probe failure. The vdev in question was faulted properly, but the hot spare was not activated:
root@oi:/export/home/kkantor# zpool status testpool pool: testpool state: DEGRADED ... NAME STATE READ WRITE CKSUM testpool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 c4t0d0 FAULTED 0 0 0 too many errors c5t0d0 ONLINE 0 0 0 spares c6t0d0 AVAIL
I tried again and watched 'fmstat' until fmstat reported a cryptic error that I remember seeing when FMD crashes.
fmstat: failed to retrieve list of modules: rpc call failed: RPC: Unable to send; errno = Bad file number
Sure enough, FMD left a core file that tells the story:
> ::stack libzfs.so.1`path_from_physpath+0x22(0, 88d33ac, ac224bc) libzfs.so.1`zpool_vdev_name+0x465(0, ac16588, abf3944, 0) zfs-retire.so`replace_with_spare+0xab(ab4c3c0, ac16588, abf3944) zfs-retire.so`zfs_retire_recv+0x4d1(ab4c3c0, 88532e0, ac667d8, 8b5687c) fmd_module_dispatch+0xd3(ab4c3c0, 88532e0) fmd_module_start+0x161(ab4c3c0) fmd_thread_start+0x2c(a0adbd8) libc.so.1`_thrp_setup+0x81(fed6b240) libc.so.1`_lwp_start(fed6b240, 0, 0, 0, 0, 0)
For some reason the retire agent isn't sending a libzfs handle to the 'zpool_vdev_name' function (the first arg to zpool_vdev_name and path_from_physpath). This was acceptable until #10622. The 10622 change has libzfs try harder, which requires a libzfs handle in the new call to path_from_physpath when a devid is not found for the given vdev. path_from_physpath doesn't check for the existence of the libzfs handle before use, which results in a segfault.
This should only be a problem on systems without vdev devids, like my KVM machine that uses blkdev instead of sd driver. We should have the replace_with_spare function in the retire agent send its libzfs handle to libzfs.