Project

General

Profile

Actions

Bug #12132

closed

zfs-retire agent crashes fmd on systems without vdev devids

Added by Kody Kantor over 2 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Bite-size
Tags:
Gerrit CR:

Description

I sent a fake vdev probe_failure ereport to FMD to make sure that a hot spare would take over if one side of a mirror was faulted as the result of a probe failure. The vdev in question was faulted properly, but the hot spare was not activated:

root@oi:/export/home/kkantor# zpool status testpool
  pool: testpool
 state: DEGRADED
...
        NAME        STATE     READ WRITE CKSUM
        testpool    DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            c4t0d0  FAULTED      0     0     0  too many errors
            c5t0d0  ONLINE       0     0     0
        spares
          c6t0d0    AVAIL

I tried again and watched 'fmstat' until fmstat reported a cryptic error that I remember seeing when FMD crashes.

fmstat: failed to retrieve list of modules: rpc call failed: RPC: Unable to send; errno = Bad file number

Sure enough, FMD left a core file that tells the story:

> ::stack
libzfs.so.1`path_from_physpath+0x22(0, 88d33ac, ac224bc)
libzfs.so.1`zpool_vdev_name+0x465(0, ac16588, abf3944, 0)
zfs-retire.so`replace_with_spare+0xab(ab4c3c0, ac16588, abf3944)
zfs-retire.so`zfs_retire_recv+0x4d1(ab4c3c0, 88532e0, ac667d8, 8b5687c)
fmd_module_dispatch+0xd3(ab4c3c0, 88532e0)
fmd_module_start+0x161(ab4c3c0)
fmd_thread_start+0x2c(a0adbd8)
libc.so.1`_thrp_setup+0x81(fed6b240)
libc.so.1`_lwp_start(fed6b240, 0, 0, 0, 0, 0)

For some reason the retire agent isn't sending a libzfs handle to the 'zpool_vdev_name' function (the first arg to zpool_vdev_name and path_from_physpath). This was acceptable until #10622. The 10622 change has libzfs try harder, which requires a libzfs handle in the new call to path_from_physpath when a devid is not found for the given vdev. path_from_physpath doesn't check for the existence of the libzfs handle before use, which results in a segfault.

This should only be a problem on systems without vdev devids, like my KVM machine that uses blkdev instead of sd driver. We should have the replace_with_spare function in the retire agent send its libzfs handle to libzfs.

Actions

Also available in: Atom PDF