Assertion failed: rn->rn_nozpool == B_FALSE, file ../common/libzfs_import.c, line 1077, function zpool_open_func
Debug build from ~Nov 9th.
- zpool export lacie
halt & unplug the external USB drive; then boot
- zpool import lacie
Assertion failed: ...
- zpool import lacie
Updated by Josef Sipek about 8 years ago
I hit this assertion again (with a different disk)...
$ sudo zpool import pool: toshiba id: 16816048008619957444 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: toshiba ONLINE c3t0d0 ONLINE $ sudo zpool import toshiba Assertion failed: rn->rn_nozpool == B_FALSE, file ../common/libzfs_import.c, line 1077, function zpool_open_func
Updated by Dan McDonald over 6 years ago
I've seen this as well on oi_151a7 on a 24-way machine. By disabling 22/24 cores, I was able to not see this bug. I suspect it's a race, as there are comments like this:
12054f67d75Eric Taylor /*
12064f67d75Eric Taylor * create a thread pool to do all of this in parallel;
12074f67d75Eric Taylor * rn_nozpool is not protected, so this is racy in that
12084f67d75Eric Taylor * multiple tasks could decide that the same slice can
12094f67d75Eric Taylor * not hold a zpool, which is benign. Also choose
12104f67d75Eric Taylor * double the number of processors; we hold a lot of
12114f67d75Eric Taylor * locks in the kernel, so going beyond this doesn't
12124f67d75Eric Taylor * buy us much.
12134f67d75Eric Taylor */
Updated by Sam Zaydel almost 6 years ago
I have been dealing with this same issue, so it seems this problem persists, but is only apparently occurring in some particular situations. As far as I can tell, Dan's workaround of disabling CPUs does the trick without anything more. This seems to suggest that in fact things really are racing here, where some value is perhaps changed midstream and the assertion is tickled. I could not even run `zpool import`, regardless of the name of the pool I was passing, or even just `zpool import`, to scan for pools. I killed all but one CPU and like magic, it all just worked as intended.
Updated by Andrew Stormont almost 6 years ago
For some reason it processes every slice in a new thread which lookups up the VTOC/GPT and uses that information to not only update the slice it's looking at but neighbouring slices too. We run into the assert when another thread decides the slice that we've already read the label from cannot contain a usable config.
Instead of processing each slice (and there are 22 assuming the device has s[0-16] + p[0-4] slices) in a new thread and then looking up the VTOC/GPT again and again for each one, it should be processing each device in a new thread and reading the VTOC/GPT once and using that information when examining the slices on that device.
I have a webrev which fixes the code to do exactly this and I will post after it has received enough testing.
Updated by Ian Collins almost 6 years ago
I think I'm seeing this trying to import the zones pool in a 20 (physical) core SmartOS box. I can't see a work around as this happens during boot. I'd naturally be interested in the fix.
I tried disabling 39 out 0f 40 cores, still hit the assert. The odd thing is the box was working perfectly until the pool became corrupt and I attempted to recreate it. It has been tested with OmniOS, SmartOS and Solaris 11.1. Now it only works with Solaris.
I have investigated this further and I think the problem (or at least one of them!) is much more basic.
I updated the code to print the drive name in zpool_open_func and then removed the thread pool to serialise the calls. The the assertion still fired on the same drive as before. Stepping through I realised the drive has a zero sized first partition, so check_one_slice will set rn_nozpool true for s0, the name of the drive being checked. So the assert will always fire if there is a dive with a zero sized s0. Removing the drive in question removed the assertion.
Here is the partition table for that drive:
Current partition table (original):
Total disk sectors available: 976756717 + 16384 (reserved sectors)
Part Tag Flag First Sector Size Last Sector
0 unassigned wm 0 0 0
1 usr wm 524544 465.50GB 976756750
2 unassigned wm 0 0 0
3 unassigned wm 0 0 0
4 unassigned wm 0 0 0
5 unassigned wm 0 0 0
6 unassigned wm 0 0 0
8 reserved wm 976756751 8.00MB 976773134
The bitch of it was that was the drive with a Solaris 11 install on I had added to check the box.... So it looks like dual booting with Solaris 11 is a non-no.
A couple of observations:
1) The comment in zpool_find_import_impl before the pool thread pool creation that multiple threads updating rn_nozpool being benign is plain wrong, it can cause the assertion to fire.
2) 80 worker threads on a 40 core system is way over the top. Andrew's suggestion of one per device makes sense.
Updated by George Wilson almost 5 years ago
The code has the following comment:
* rn_nozpool is not protected, so this is racy in that * multiple tasks could decide that the same slice can * not hold a zpool, which is benign.
Looking closer at the code, it appears that the rn_nozpool is just an optimization and it would be safe to remove the assert without impacting the functionality of the zpool_find_import_impl().
Updated by Electric Monk almost 5 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
commit bd0f709169e67f4bd34526e186a7c34f595f0d9b Author: Andrew Stormont <firstname.lastname@example.org> Date: 2015-05-29T15:10:20.000Z 1778 Assertion failed: rn->rn_nozpool == B_FALSE, file ../common/libzfs_import.c, line 1077, function zpool_open_func Reviewed by: Matthew Ahrens <email@example.com> Reviewed by: George Wilson <firstname.lastname@example.org> Reviewed by: Richard Elling <email@example.com> Approved by: Gordon Ross <firstname.lastname@example.org>