ZFS LU stuck in the offlining state
Houston, we have a problem - ZFS Logical Unit stuck in the offlining state with COMSTAR.
Mandatory condition to reproduce the problem - the presence of the I/O load on Logical Unit
stmfadm create-lu /dev/zvol/rdsk/GIGO/zvol200g stmfadm add-view 600144F054A281000000513DB9770005
I mounted iscsi device on the initiator and began to upload with dd
At this point, you can execute commands:
stmfadm delete-lu -k 600144F054A281000000513DB9770005 or stmfadm offline-lu 600144F054A281000000513DB9770005
Logical Unit will stuck in the offlining state
stmfadm list-lu -v LU Name: 600144F054A281000000513DB9770005 Operational Status: Offlining Provider Name : sbd Alias : /dev/zvol/rdsk/GIGO/zvol200g View Entry Count : 0 Data File : /dev/zvol/rdsk/GIGO/zvol200g Meta File : not set Size : 214748364800 Block Size : 512 Management URL : not set Vendor ID : SUN Product ID : COMSTAR Serial Num : not set Write Protect : Disabled Writeback Cache : Enabled Access State : Active
As a consequence, it cannot be unloaded from the COMSTAR framework
stmfadm delete-lu 600144F054A281000000513DB9770005 stmfadm: resource busy
Hence the GIGO pool cannot be exported:
zpool export GIGO cannot export 'GIGO': pool is busy
Resulting in the cluster aborting failover
Now only reboot...
I tested it in OmniOS, OpenIndiana and Illumianos
The problem was in all variants.
Can you fix this?
I found this bug during testing rsf-1
Support rsf-1 says that this bug fixed in NexentaStor
Updated by Christopher Siden over 6 years ago
- Status changed from In Progress to Closed
- % Done changed from 80 to 100
commit a49dc89 Author: Saso Kiselkov <firstname.lastname@example.org> Date: Thu May 23 09:52:46 2013 3621 ZFS LU stuck in the offlining state Reviewed by: Sebastien Roy <email@example.com> Reviewed by: Jeff Biseda <firstname.lastname@example.org> Reviewed by: Dan McDonald <email@example.com> Approved by: Christopher Siden <firstname.lastname@example.org>
Updated by Bob Lu over 5 years ago
what's the result of this fix? I verified the fix with OI and found that the lu won't stuck with delete-lu, but it takes some time to delete the lu, for the first couple of delete, it still returned with resource busy, until the lu state changed to offline, delete operation completed.
What I concerned is:
1, Is this the result of the fix?
2, Can we improve the offling time slot? It's import for failover.
and the time differs from 20ms to 39s...:
dtrace: script 'stmf.d' matched 1 probe
CPU ID FUNCTION:NAME
1 6480 stmf_wait_ilu_tasks_finish:deadman-timeout-wait time:20000
5 6480 stmf_wait_ilu_tasks_finish:deadman-timeout-wait time:29480000
Updated by Bob Lu about 5 years ago
Markus Kovero wrote:
Problem seems to be fixed in OI Hipster with illumos-fe2e029.
You mentioned illumos-fe2e029, is this the one?
Author: Robert Mustacchi <email@example.com>
Date: Mon Oct 6 13:07:51 2014 -0700
5202 want ctf(4)
Reviewed by: Keith M Wesolowski <firstname.lastname@example.org>
Reviewed by: Jerry Jelinek <email@example.com>
Reviewed by: Garrett D'Amore <firstname.lastname@example.org>
Approved by: Dan McDonald <email@example.com>
Updated by Markus Kovero almost 5 years ago
- Status changed from Closed to Feedback
I referred to illumos kernel version in hipster. However, even though problem has become much less frequent, it still exists on illumos-f8554bb causing system to kernel panic.
As problem is no longer having stuck LU, but crashing the system, I'll continue this on https://www.illumos.org/issues/5079 as I get crashdump uploaded.