Project

General

Profile

Bug #5666

Crash of the HA storage cluster and entanglement of a hard drives

Added by Alexander Shvayakov almost 6 years ago. Updated almost 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2015-02-26
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

Hello,

I have a ZFS Storage cluster with the two nodes ad998/ad999 (Supermicro) and two JBOD Quanta M4240H
OS: OmniOS v11 r151012 ( omnios-10b9c79)
Clustering: RSF-1

This system worked about 2-3 years without serious problems.

But in December I had to reinstall the operating system on both nodes. I did it without a break with the migration ZFS pools between two nodes.
I couldn't find another way to add a network card. No other way didn't work.

I did a reinstall and the Storage worked for several months without any problems until the Feb 24..

I got a messages from the zabbix - "ZFS pools is DEGRADED".

All pools were located on the ad999 at this moment.

I saw many messages in the /var/adm/messages

Feb 24 15:45:55 ad999 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3c0a@3,2/pci1000,3000@0 (mpt_sas1):
Feb 24 15:45:55 ad999 Unable to allocate dma memory for extra SGL.
Feb 24 15:45:55 ad999 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3c0a@3,2/pci1000,3000@0 (mpt_sas1):
Feb 24 15:45:55 ad999 MPT SGL mem alloc failed

I saw a h/w errors in the output iostat -en on the all disks.
For example:

c2t5000CCA03E4B7A26d0 Soft Errors: 0 Hard Errors: 11 Transport Errors: 0
Vendor: HITACHI Product: HUS723030ALS640 Revision: A222 Serial No: YVHAHTWD
Size: 3000.59GB <3000592982016 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 11 Recoverable: 0

Status of the ZFS pool:

pool: GIGO1

state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Feb 24 17:28:54 2015
2.93M scanned out of 6.26T at 334K/s, (scan is slow, no estimated time)
1.41M resilvered, 0.00% done
config:

NAME                         STATE     READ WRITE CKSUM 
GIGO1 DEGRADED 0 4 0
mirror-0 DEGRADED 0 17 0
c2t5000CCA03E4AF152d0 ONLINE 0 17 26
c3t5000CCA03E3F024Ad0 DEGRADED 0 21 66 too many errors
mirror-1 DEGRADED 0 22 0
c2t5000CCA03E4B7A0Ed0 DEGRADED 0 24 106 too many errors
c3t5000CCA03E4B7A32d0 DEGRADED 1 34 117 too many errors
mirror-2 DEGRADED 0 7 0
spare-0 DEGRADED 0 5 78
c2t5000CCA03E4B7A26d0 DEGRADED 0 8 103 too many errors
c2t5000CCA03E4B7AEAd0 ONLINE 0 5 109 (resilvering)
c3t5000CCA03E4B7AA2d0 DEGRADED 0 13 80 too many errors
mirror-4 DEGRADED 0 14 0
spare-0 DEGRADED 0 9 51
c2t5000CCA03E441E72d0 DEGRADED 0 17 105 too many errors
c3t5000CCA03E4B7B3Ed0 ONLINE 0 9 76 (resilvering)
c3t5000CCA03E4B795Ad0 DEGRADED 1 16 53 too many errors
logs
mirror-3 UNAVAIL 0 19 0 insufficient replicas
c2t50011731001B157Ad0 FAULTED 0 19 0 too many errors
c3t50011731001B1586d0 FAULTED 0 19 0 too many errors
cache
c1t50015178F36660FAd0 ONLINE 0 0 0
c1t55CD2E40000942E9d0 ONLINE 0 0 0
spares
c2t5000CCA03E4B7AEAd0 INUSE currently in use
c3t5000CCA03E4B7B3Ed0 INUSE currently in use

errors: Permanent errors have been detected in the following files:

GIGO1/p1:&lt;0x0&gt;

I decided to move the zfs pools to the ad998, but I got a kernel panic during the import pool

Unfortunately I didn't wait for the full completion of the core dump and don't have data for debugging.
Now I understand that it had to be done, but I was in a hurry.

I saw the fresh driver mpt_sas in the updates and I decided to update the both systems and reboot nodes.
After reboot I tried to migrate pool, but I got panic again.

The DEGRADED pools work on the ad999 where everything started.
I can' to import (migrate) them to the second node (ad998).

Pools are resilvering now, but I see strange things:

NAME                         STATE     READ WRITE CKSUM
GIGO4 DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c2t5000CCA01A8B9696d0 ONLINE 0 0 6
c3t5000CCA01A8E08C2d0 DEGRADED 0 0 13 too many errors (resilvering)
mirror-1 ONLINE 0 0 0
spare-0 ONLINE 0 0 3
c2t5000CCA01A8D3116d0 ONLINE 0 0 3 (resilvering)
c2t5000CCA03E461C52d0 ONLINE 0 0 4 (resilvering)
c3t5000CCA03E4B7B8Ed0 ONLINE 0 0 3 (resilvering)
c3t5000CCA01A8E5EFEd0 ONLINE 0 0 1
mirror-2 ONLINE 0 0 0
c2t5000CCA01A8E60A2d0 ONLINE 0 0 3
spare-1 ONLINE 0 0 11
c3t5000CCA01A8E5FE6d0 ONLINE 0 0 11 (resilvering)
c3t5000CCA03E2427B2d0 ONLINE 0 0 11 (resilvering)
logs
mirror-3 ONLINE 0 0 0
c2t50011731001B159Ed0 ONLINE 0 0 0
c3t50011731001B15A6d0 ONLINE 0 0 0
cache
c1t5001517BB29F6ABAd0 ONLINE 0 0 0
spares
c2t5000CCA03E461C51d0 INUSE currently in use
c3t5000CCA03E2427B1d0 INUSE currently in use
c3t5000CCA03E4B7B8Ed0 INUSE currently in use

For example I see in the list of spare disks c2t5000CCA03E461C51d0 (currently in use), but I see c2t5000CCA03E461C52d0 in the mirror-1/spare-0

I tried:
smartctl -i -d scsi -T permissive /dev/rdsk/c2t5000CCA03E461C51d0
And I got: Smartctl open device: /dev/rdsk/c2t5000CCA03E461C51d0 failed: No such device

I can't understand why on the ad999 in the pool GIGO4 I saw the c2t5000CCA03E461C51d0 as spare. The correct name of this disc – c2t5000CCA03E461C52d0 on the ad999.
This disc have the name c2t5000CCA03E461C51d0 on the ad998.

I saw there are weird stuff.
For example I saw the c13t5000CCA01A8D3115d0s0 in the output fmdump -eV
You can see there the time: Feb 24 2015 17:37:47.839312364

Drive names changed after reinstalling of the Operating system from "c13...." to "c2......".
This disk have new name now - c2t5000CCA01A8D3115d0
I reinstalled this system on December 18.
This name is from another dimension.
That system doesn't exist

Where did that name?
Maybe out of the computers hell?
I can't understand

Could you recommend something?

I have two tarbols (1-2Mb) from the each node with the reports of various diagnostic tools.

But I can't to attach them, I get error 413 Request Entity Too Large

Also available in: Atom PDF