Project

General

Profile

Bug #1778

Assertion failed: rn->rn_nozpool == B_FALSE, file ../common/libzfs_import.c, line 1077, function zpool_open_func

Added by Josef Sipek over 7 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2011-11-15
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

Debug build from ~Nov 9th.

  1. zpool export lacie
    halt & unplug the external USB drive; then boot
  2. zpool import lacie
    Assertion failed: ...
  3. rmformat
  4. zpool import lacie
    <ok>

Related issues

Related to illumos gate - Bug #934: FreeBSD's GPT not recognizedResolved2011-04-19

Actions
Has duplicate illumos gate - Bug #4348: Assertion failed: rn->rn_nozpool == B_FALSE, file ../common/libzfs_import.c, line 1080, function zpool_open_funcClosed2013-11-23

Actions

History

#1

Updated by Alexander Eremin over 7 years ago

Probably this is an old problem with EFI labels reading

#2

Updated by Josef Sipek over 7 years ago

I hit this assertion again (with a different disk)...

$ sudo zpool import                     
  pool: toshiba
    id: 16816048008619957444
 state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

    toshiba     ONLINE
      c3t0d0    ONLINE
$ sudo zpool import toshiba
Assertion failed: rn->rn_nozpool == B_FALSE, file ../common/libzfs_import.c, line 1077, function zpool_open_func
#3

Updated by Josef Sipek over 7 years ago

Ok, another update... After trying to import it half a dozen times, the import succeeded. (Same thing happened last time with the lacie pool.)

#4

Updated by Josef Sipek about 7 years ago

I just ran into this issue again. This time, I tried:

# mv /dev/rdsk /dev/old-rdsk
# mv /dev/dsk /dev/old-dsk
# mkdir /dev/dsk /dev/rdsk
# devfsadm -c disk

zpool import succeeded afterward.

#5

Updated by Dan McDonald almost 6 years ago

I've seen this as well on oi_151a7 on a 24-way machine. By disabling 22/24 cores, I was able to not see this bug. I suspect it's a race, as there are comments like this:

12054f67d75Eric Taylor /*
12064f67d75Eric Taylor * create a thread pool to do all of this in parallel;
12074f67d75Eric Taylor * rn_nozpool is not protected, so this is racy in that
12084f67d75Eric Taylor * multiple tasks could decide that the same slice can
12094f67d75Eric Taylor * not hold a zpool, which is benign. Also choose
12104f67d75Eric Taylor * double the number of processors; we hold a lot of
12114f67d75Eric Taylor * locks in the kernel, so going beyond this doesn't
12124f67d75Eric Taylor * buy us much.
12134f67d75Eric Taylor */

#6

Updated by Sam Zaydel almost 5 years ago

I have been dealing with this same issue, so it seems this problem persists, but is only apparently occurring in some particular situations. As far as I can tell, Dan's workaround of disabling CPUs does the trick without anything more. This seems to suggest that in fact things really are racing here, where some value is perhaps changed midstream and the assertion is tickled. I could not even run `zpool import`, regardless of the name of the pool I was passing, or even just `zpool import`, to scan for pools. I killed all but one CPU and like magic, it all just worked as intended.

#7

Updated by Andrew Stormont almost 5 years ago

For some reason it processes every slice in a new thread which lookups up the VTOC/GPT and uses that information to not only update the slice it's looking at but neighbouring slices too. We run into the assert when another thread decides the slice that we've already read the label from cannot contain a usable config.

Instead of processing each slice (and there are 22 assuming the device has s[0-16] + p[0-4] slices) in a new thread and then looking up the VTOC/GPT again and again for each one, it should be processing each device in a new thread and reading the VTOC/GPT once and using that information when examining the slices on that device.

I have a webrev which fixes the code to do exactly this and I will post after it has received enough testing.

#8

Updated by Ian Collins almost 5 years ago

I think I'm seeing this trying to import the zones pool in a 20 (physical) core SmartOS box. I can't see a work around as this happens during boot. I'd naturally be interested in the fix.

I tried disabling 39 out 0f 40 cores, still hit the assert. The odd thing is the box was working perfectly until the pool became corrupt and I attempted to recreate it. It has been tested with OmniOS, SmartOS and Solaris 11.1. Now it only works with Solaris.

I have investigated this further and I think the problem (or at least one of them!) is much more basic.

I updated the code to print the drive name in zpool_open_func and then removed the thread pool to serialise the calls. The the assertion still fired on the same drive as before. Stepping through I realised the drive has a zero sized first partition, so check_one_slice will set rn_nozpool true for s0, the name of the drive being checked. So the assert will always fire if there is a dive with a zero sized s0. Removing the drive in question removed the assertion.

Here is the partition table for that drive:

Current partition table (original):
Total disk sectors available: 976756717 + 16384 (reserved sectors)

Part Tag Flag First Sector Size Last Sector
0 unassigned wm 0 0 0
1 usr wm 524544 465.50GB 976756750
2 unassigned wm 0 0 0
3 unassigned wm 0 0 0
4 unassigned wm 0 0 0
5 unassigned wm 0 0 0
6 unassigned wm 0 0 0
8 reserved wm 976756751 8.00MB 976773134

The bitch of it was that was the drive with a Solaris 11 install on I had added to check the box.... So it looks like dual booting with Solaris 11 is a non-no.

A couple of observations:

1) The comment in zpool_find_import_impl before the pool thread pool creation that multiple threads updating rn_nozpool being benign is plain wrong, it can cause the assertion to fire.

2) 80 worker threads on a 40 core system is way over the top. Andrew's suggestion of one per device makes sense.

#9

Updated by Mr. Jester almost 5 years ago

I ran into this issue too. Work around was to create partitions on all of the drives that weren't part of the array. Just create basic Solaris2 partitions for 100% of the space.

#10

Updated by Andrew Stormont almost 4 years ago

  • Has duplicate Bug #4348: Assertion failed: rn->rn_nozpool == B_FALSE, file ../common/libzfs_import.c, line 1080, function zpool_open_func added
#11

Updated by Andrew Stormont almost 4 years ago

  • Related to Bug #934: FreeBSD's GPT not recognized added
#12

Updated by George Wilson almost 4 years ago

The code has the following comment:

                 * rn_nozpool is not protected, so this is racy in that
                 * multiple tasks could decide that the same slice can
                 * not hold a zpool, which is benign.

Looking closer at the code, it appears that the rn_nozpool is just an optimization and it would be safe to remove the assert without impacting the functionality of the zpool_find_import_impl().

#13

Updated by Electric Monk almost 4 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit bd0f709169e67f4bd34526e186a7c34f595f0d9b

commit  bd0f709169e67f4bd34526e186a7c34f595f0d9b
Author: Andrew Stormont <andyjstormont@gmail.com>
Date:   2015-05-29T15:10:20.000Z

    1778 Assertion failed: rn->rn_nozpool == B_FALSE, file ../common/libzfs_import.c, line 1077, function zpool_open_func
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: George Wilson <george@delphix.com>
    Reviewed by: Richard Elling <richard.elling@richardelling.com>
    Approved by: Gordon Ross <gordon.ross@nexenta.com>

Also available in: Atom PDF