Bug #3422

zpool create/syseventd race yield non-importable pool

Added by Bryan Cantrill about 4 years ago. Updated about 4 years ago.

Status:ClosedStart date:2012-12-19
Priority:NormalDue date:
Assignee:George Wilson% Done:

100%

Category:zfs - Zettabyte File System
Target version:-
Difficulty:Medium Tags:needs-triage

Description

On a standup of new hardware, we found that several new illumos-based
compute nodes were refusing to import their pool after the first boot.
Running DTrace on the failing zpool import revealed that the pools
were failing to import because the "txg" members of their labels were
not consistent across the pool. As it turns out, this issue is
easily reproduced with the following script:

#!/bin/ksh

disks="0 1 2 3" 
config="" 

timeout=5
pool=badpool

check()
{
    for d in $disks ; do
        zdb -l /dev/rlofi/10${d} | grep -w txg:
    done
}

for d in $disks; do
    if [[ $(expr $d % 2) -eq 0 ]]; then
        config="${config} mirror" 
    fi

    config="${config} /dev/lofi/10${d}" 

    disk=/var/tmp/${pool}.disk${d}

    if [[ ! -f $disk ]]; then
        echo "creating $disk..." 
        dd if=/dev/urandom of=$disk bs=1M count=100 2> /dev/null 1>&2
    fi
done

if zpool status $pool > /dev/null 2>&1 ; then
    echo "destroying existing test pool '$pool'..." 
    zpool destroy $pool
fi

echo "creating lofi devices..." 

for d in $disks; do
    if [[ -h /dev/lofi/10${d} ]]; then
        lofiadm -d /dev/lofi/10${d}
    fi

    lofiadm -a /var/tmp/${pool}.disk${d} /dev/lofi/10${d}
done

echo "creating test pool '$pool'..." 

zpool create $pool $config

echo "sleeping for $timeout seconds..." 
sleep $timeout

txgs=`check | sort | uniq -c | wc -l | nawk '{ print $1 }'`

if [[ $txgs -gt 1 ]]; then
    echo "pool '$pool' has $txgs different label txgs; exiting" 
    exit 1
fi

echo "consistent label txgs found in pool '$pool'; cleaning up" 

zpool destroy $pool

for d in $disks; do
    lofiadm -d /dev/lofi/10${d}
done

The race here is between zpool creation from devices as those devices are
being processed by devfsadmd and syseventd; if a zpool is created out
of a certain device before that device is processed by devfsadm/syseventd,
syseventd will find the new device in the pool (duh!) and will attempt
to online it via vdev_online().

The act of onlining the (already online) vdev results in a call to
vdev_reopen(), which ultimately finds its way to vdev_state_dirty() from
spa_vdev_state_exit(). This act of dirtying causes the txg to be updated on
the labels on the devices contained by the top-level vdev from the re-onlined
device. But because the label txg is not actually updated on all top-level
vdevs (only the one that contains the onlined device), the pool itself is in a
non-importable state. Now, if the pool is exported or if something else
happens that updates the label txg on all top-level vdevs (e.g., setting a
pool property), the pool will be fine (and indeed, a workaround for us is to
set a zpool property immediately before reboot after an initial config). But
if the pool is not exported before a reboot and nothing happens to update the
txg, it will be in a non-importable state. (It does not appear that there is
a way to make the pool importable via zdb or zpool import -- though perhaps
there should be; it's slightly nervewracking that misbehaving user-level
software could result in total pool loss.)

In terms of fixing this, it seems that there are many ways: the label txg
could be presumably written out to all top-level vdevs to keep it consistent
(and perhaps that should already be happening), or vdev_online() could no-op
out on a device that is already online or perhaps zpool import needn't be as
rigid about the txg in the label. In short, the correct fix is entirely
unclear, and will need to await proper ZFS expertise. ;)

History

#1 Updated by George Wilson about 4 years ago

  • Assignee set to George Wilson
  • Status changed from New to In Progress
  • Category set to zfs - Zettabyte File System

In vdev_validate() we use the spa_config_txg value to determine the most appropriate label to read the config from. Unfortunately the spa_config_txg value that is used to check the label txg values is not the most up-to-date value.

#2 Updated by Christopher Siden about 4 years ago

This is a regression caused by bug #3102. George Wilson's analysis:

As part of '3102 vdev_uberblock_load() and vdev_validate() may read the wrong
label' we avoid reading labels that are newer than spa_config_txg.
Unfortunately there are times where spa_config_txg is not updated by the
label's txg is. This causes the device to show up as faulted and prevents the
pool import.

#3 Updated by Christopher Siden about 4 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Closed
commit bda8819
Author: George Wilson <george.wilson@delphix.com>
Date:   Wed Jan 30 12:01:51 2013

    3422 zpool create/syseventd race yield non-importable pool
    3425 first write to a new zvol can fail with EFBIG
    Reviewed by: Adam Leventhal <ahl@delphix.com>
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Approved by: Garrett D'Amore <garrett@damore.org>

Also available in: Atom