Project

General

Profile

Bug #3422

zpool create/syseventd race yield non-importable pool

Added by Bryan Cantrill over 8 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
2012-12-19
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

On a standup of new hardware, we found that several new illumos-based
compute nodes were refusing to import their pool after the first boot.
Running DTrace on the failing zpool import revealed that the pools
were failing to import because the "txg" members of their labels were
not consistent across the pool. As it turns out, this issue is
easily reproduced with the following script:

#!/bin/ksh

disks="0 1 2 3" 
config="" 

timeout=5
pool=badpool

check()
{
    for d in $disks ; do
        zdb -l /dev/rlofi/10${d} | grep -w txg:
    done
}

for d in $disks; do
    if [[ $(expr $d % 2) -eq 0 ]]; then
        config="${config} mirror" 
    fi

    config="${config} /dev/lofi/10${d}" 

    disk=/var/tmp/${pool}.disk${d}

    if [[ ! -f $disk ]]; then
        echo "creating $disk..." 
        dd if=/dev/urandom of=$disk bs=1M count=100 2> /dev/null 1>&2
    fi
done

if zpool status $pool > /dev/null 2>&1 ; then
    echo "destroying existing test pool '$pool'..." 
    zpool destroy $pool
fi

echo "creating lofi devices..." 

for d in $disks; do
    if [[ -h /dev/lofi/10${d} ]]; then
        lofiadm -d /dev/lofi/10${d}
    fi

    lofiadm -a /var/tmp/${pool}.disk${d} /dev/lofi/10${d}
done

echo "creating test pool '$pool'..." 

zpool create $pool $config

echo "sleeping for $timeout seconds..." 
sleep $timeout

txgs=`check | sort | uniq -c | wc -l | nawk '{ print $1 }'`

if [[ $txgs -gt 1 ]]; then
    echo "pool '$pool' has $txgs different label txgs; exiting" 
    exit 1
fi

echo "consistent label txgs found in pool '$pool'; cleaning up" 

zpool destroy $pool

for d in $disks; do
    lofiadm -d /dev/lofi/10${d}
done

The race here is between zpool creation from devices as those devices are
being processed by devfsadmd and syseventd; if a zpool is created out
of a certain device before that device is processed by devfsadm/syseventd,
syseventd will find the new device in the pool (duh!) and will attempt
to online it via vdev_online().

The act of onlining the (already online) vdev results in a call to
vdev_reopen(), which ultimately finds its way to vdev_state_dirty() from
spa_vdev_state_exit(). This act of dirtying causes the txg to be updated on
the labels on the devices contained by the top-level vdev from the re-onlined
device. But because the label txg is not actually updated on all top-level
vdevs (only the one that contains the onlined device), the pool itself is in a
non-importable state. Now, if the pool is exported or if something else
happens that updates the label txg on all top-level vdevs (e.g., setting a
pool property), the pool will be fine (and indeed, a workaround for us is to
set a zpool property immediately before reboot after an initial config). But
if the pool is not exported before a reboot and nothing happens to update the
txg, it will be in a non-importable state. (It does not appear that there is
a way to make the pool importable via zdb or zpool import -- though perhaps
there should be; it's slightly nervewracking that misbehaving user-level
software could result in total pool loss.)

In terms of fixing this, it seems that there are many ways: the label txg
could be presumably written out to all top-level vdevs to keep it consistent
(and perhaps that should already be happening), or vdev_online() could no-op
out on a device that is already online or perhaps zpool import needn't be as
rigid about the txg in the label. In short, the correct fix is entirely
unclear, and will need to await proper ZFS expertise. ;)

Also available in: Atom PDF