Project

General

Profile

Bug #12132

zfs-retire agent crashes fmd on systems without vdev devids

Added by Kody Kantor about 2 months ago. Updated about 1 month ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Bite-size
Tags:

Description

I sent a fake vdev probe_failure ereport to FMD to make sure that a hot spare would take over if one side of a mirror was faulted as the result of a probe failure. The vdev in question was faulted properly, but the hot spare was not activated:

root@oi:/export/home/kkantor# zpool status testpool
  pool: testpool
 state: DEGRADED
...
        NAME        STATE     READ WRITE CKSUM
        testpool    DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            c4t0d0  FAULTED      0     0     0  too many errors
            c5t0d0  ONLINE       0     0     0
        spares
          c6t0d0    AVAIL

I tried again and watched 'fmstat' until fmstat reported a cryptic error that I remember seeing when FMD crashes.

fmstat: failed to retrieve list of modules: rpc call failed: RPC: Unable to send; errno = Bad file number

Sure enough, FMD left a core file that tells the story:

> ::stack
libzfs.so.1`path_from_physpath+0x22(0, 88d33ac, ac224bc)
libzfs.so.1`zpool_vdev_name+0x465(0, ac16588, abf3944, 0)
zfs-retire.so`replace_with_spare+0xab(ab4c3c0, ac16588, abf3944)
zfs-retire.so`zfs_retire_recv+0x4d1(ab4c3c0, 88532e0, ac667d8, 8b5687c)
fmd_module_dispatch+0xd3(ab4c3c0, 88532e0)
fmd_module_start+0x161(ab4c3c0)
fmd_thread_start+0x2c(a0adbd8)
libc.so.1`_thrp_setup+0x81(fed6b240)
libc.so.1`_lwp_start(fed6b240, 0, 0, 0, 0, 0)

For some reason the retire agent isn't sending a libzfs handle to the 'zpool_vdev_name' function (the first arg to zpool_vdev_name and path_from_physpath). This was acceptable until #10622. The 10622 change has libzfs try harder, which requires a libzfs handle in the new call to path_from_physpath when a devid is not found for the given vdev. path_from_physpath doesn't check for the existence of the libzfs handle before use, which results in a segfault.

This should only be a problem on systems without vdev devids, like my KVM machine that uses blkdev instead of sd driver. We should have the replace_with_spare function in the retire agent send its libzfs handle to libzfs.

History

#1

Updated by Kody Kantor about 1 month ago

Testing notes:
- Ran the zfs test suite. See #10241 for the failures encountered.
- I sent a single fake vdev probe_failure ereports to FMD as described in the bug description, which caused one side of the mirror to be faulted. However this time, as expected, FMD did not crash as it attempted to replace the pool vdev with a hot spare:

$ sudo /usr/lib/fm/fmd/fminject ./injection.txt
sending event mirror_probe ... done
 [... after a few seconds]
$ zpool status testpool
  pool: testpool
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: resilvered 273K in 0 days 00:00:00 with 0 errors on Fri Jan 10 16:27:48 2020
config:

        NAME          STATE     READ WRITE CKSUM
        testpool      DEGRADED     0     0     0
          mirror-0    DEGRADED     0     0     0
            spare-0   DEGRADED     0     0     0
              c4t0d0  FAULTED      0     0     0  too many errors
              c6t0d0  ONLINE       0     0     0
            c5t0d0    ONLINE       0     0     0
        spares
          c6t0d0      INUSE     currently in use

errors: No known data errors

$ svcs fmd
STATE          STIME    FMRI
online         16:09:05 svc:/system/fmd:default
$ uptime
16:30:06    up 21 min(s),  2 users,  load average: 0.01, 0.03, 0.04

#2

Updated by Electric Monk about 1 month ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit e830fb12ed60bbd91c87459b169b5d1cc6ef0b9e

commit  e830fb12ed60bbd91c87459b169b5d1cc6ef0b9e
Author: Kody A Kantor <kody@kkantor.com>
Date:   2020-01-14T16:08:49.000Z

    10241 ZFS not detecting faulty spares in a timely manner
    12132 zfs-retire agent crashes fmd on systems without vdev devids
    12034 zfs test send_encrypted_props can fail
    Reviewed by: C Fraire <cfraire@me.com>
    Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Reviewed by: Rob Johnston <rob.johnston@joyent.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF