Bug #3422
closedzpool create/syseventd race yield non-importable pool
100%
Description
On a standup of new hardware, we found that several new illumos-based
compute nodes were refusing to import their pool after the first boot.
Running DTrace on the failing zpool import revealed that the pools
were failing to import because the "txg" members of their labels were
not consistent across the pool. As it turns out, this issue is
easily reproduced with the following script:
#!/bin/ksh disks="0 1 2 3" config="" timeout=5 pool=badpool check() { for d in $disks ; do zdb -l /dev/rlofi/10${d} | grep -w txg: done } for d in $disks; do if [[ $(expr $d % 2) -eq 0 ]]; then config="${config} mirror" fi config="${config} /dev/lofi/10${d}" disk=/var/tmp/${pool}.disk${d} if [[ ! -f $disk ]]; then echo "creating $disk..." dd if=/dev/urandom of=$disk bs=1M count=100 2> /dev/null 1>&2 fi done if zpool status $pool > /dev/null 2>&1 ; then echo "destroying existing test pool '$pool'..." zpool destroy $pool fi echo "creating lofi devices..." for d in $disks; do if [[ -h /dev/lofi/10${d} ]]; then lofiadm -d /dev/lofi/10${d} fi lofiadm -a /var/tmp/${pool}.disk${d} /dev/lofi/10${d} done echo "creating test pool '$pool'..." zpool create $pool $config echo "sleeping for $timeout seconds..." sleep $timeout txgs=`check | sort | uniq -c | wc -l | nawk '{ print $1 }'` if [[ $txgs -gt 1 ]]; then echo "pool '$pool' has $txgs different label txgs; exiting" exit 1 fi echo "consistent label txgs found in pool '$pool'; cleaning up" zpool destroy $pool for d in $disks; do lofiadm -d /dev/lofi/10${d} done
The race here is between zpool creation from devices as those devices are
being processed by devfsadmd and syseventd; if a zpool is created out
of a certain device before that device is processed by devfsadm/syseventd,
syseventd will find the new device in the pool (duh!) and will attempt
to online it via vdev_online().
The act of onlining the (already online) vdev results in a call to
vdev_reopen(), which ultimately finds its way to vdev_state_dirty() from
spa_vdev_state_exit(). This act of dirtying causes the txg to be updated on
the labels on the devices contained by the top-level vdev from the re-onlined
device. But because the label txg is not actually updated on all top-level
vdevs (only the one that contains the onlined device), the pool itself is in a
non-importable state. Now, if the pool is exported or if something else
happens that updates the label txg on all top-level vdevs (e.g., setting a
pool property), the pool will be fine (and indeed, a workaround for us is to
set a zpool property immediately before reboot after an initial config). But
if the pool is not exported before a reboot and nothing happens to update the
txg, it will be in a non-importable state. (It does not appear that there is
a way to make the pool importable via zdb or zpool import -- though perhaps
there should be; it's slightly nervewracking that misbehaving user-level
software could result in total pool loss.)
In terms of fixing this, it seems that there are many ways: the label txg
could be presumably written out to all top-level vdevs to keep it consistent
(and perhaps that should already be happening), or vdev_online() could no-op
out on a device that is already online or perhaps zpool import needn't be as
rigid about the txg in the label. In short, the correct fix is entirely
unclear, and will need to await proper ZFS expertise. ;)
Updated by George Wilson over 10 years ago
- Category set to zfs - Zettabyte File System
- Status changed from New to In Progress
- Assignee set to George Wilson
In vdev_validate() we use the spa_config_txg value to determine the most appropriate label to read the config from. Unfortunately the spa_config_txg value that is used to check the label txg values is not the most up-to-date value.
Updated by Christopher Siden over 10 years ago
This is a regression caused by bug #3102. George Wilson's analysis:
As part of '3102 vdev_uberblock_load() and vdev_validate() may read the wrong label' we avoid reading labels that are newer than spa_config_txg. Unfortunately there are times where spa_config_txg is not updated by the label's txg is. This causes the device to show up as faulted and prevents the pool import.
Updated by Christopher Siden over 10 years ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
commit bda8819 Author: George Wilson <george.wilson@delphix.com> Date: Wed Jan 30 12:01:51 2013 3422 zpool create/syseventd race yield non-importable pool 3425 first write to a new zvol can fail with EFBIG Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org>