Bug #1197

Hang after resilver finished with mpt

Added by Roy Sigurd Karlsbakk over 6 years ago. Updated over 6 years ago.

Status:NewStart date:2011-07-11
Priority:UrgentDue date:
Assignee:-% Done:

0%

Category:driver - device drivers
Target version:-
Difficulty:Medium Tags:needs-triage

Description

Hi all

I just had a machine finish resilver after a drive (well, two actually) died. After resilver was finished, the Icinga (ex Nagios) check told me the pool was healthy again, so fine. But then, about 15 minutes later, Icinga complained the check timed out, and the box was unavailable. From a remote, I could see OpenIndiana spamming it with messages:

scsi: WARNING: /pci@0,0/pci8086,340e@7/pci1000,30a0@0... (mpt0):
Disconnected command timeout for Target 23

This looks familiar - I have seen similar on other servers, also just after resilver. The box is using LSI 3801 and 3081 controllers with the mpt driver. Current OS version is OpenIndiana b148.

It looks like this is the same bug I've hit earlier. I just became aware of the resilver issue when this happened within two days with two different machines (the other is 1700km from here, and I don't have a remote console for it yet - long story).

Is there anything I can do to debug this? I ran 'zpool status' from the console, and it apparently hangs there and won't go anywhere.....

Thank you for any help on this one!

roy

History

#1 Updated by Roy Sigurd Karlsbakk over 6 years ago

Just rebooted the box - couldn't get anything more out of it anyway. After the reboot, it shows a few more drives have died, and the drives that had finished resilvering (c4t37d0 and c4t43d0) are now resilvering once more. The whole zpool status is below. After the resilver, there were no issues reported, but still the pool/machine hung because of an issue probably related to c4t23d0, which is now marked as faulted. It should be noted that I have seen this and other OI machines with the same controllers kick off drives that have later shown to be ok, also after thorough testing.

roy

rsk@prv-backup:~$ zpool status pbpool
pool: pbpool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jul 11 21:37:12 2011
763G scanned out of 37.6T at 2.60G/s, 4h1m to go
29.8G resilvered, 1.99% done
config:

NAME                 STATE     READ WRITE CKSUM
pbpool DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
c4t8d0 ONLINE 0 0 0
c4t9d0 ONLINE 0 0 0
c4t10d0 ONLINE 0 0 0
c4t11d0 ONLINE 0 0 0
c4t12d0 ONLINE 0 0 0
c4t13d0 ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
c4t14d0 ONLINE 0 0 0
c4t15d0 ONLINE 0 0 0
c4t16d0 ONLINE 0 0 0
c4t17d0 ONLINE 0 0 0
c4t18d0 ONLINE 0 0 0
c4t19d0 ONLINE 0 0 0
c4t20d0 ONLINE 0 0 0
raidz2-3 DEGRADED 0 0 0
c4t21d0 ONLINE 0 0 0
c4t22d0 ONLINE 0 0 0
spare-2 UNAVAIL 0 0 0
c4t23d0 FAULTED 0 0 0 corrupted data
c8t35d0 ONLINE 0 0 0 (resilvering)
c4t24d0 ONLINE 0 0 0
c4t25d0 ONLINE 0 0 0
c4t26d0 ONLINE 0 0 0
c4t27d0 ONLINE 0 0 0
raidz2-4 DEGRADED 0 0 0
c4t28d0 ONLINE 0 0 0
c4t29d0 ONLINE 0 0 0
c4t30d0 ONLINE 0 0 0
c4t31d0 ONLINE 0 0 0
c4t32d0 FAULTED 0 0 0 too many errors
c4t33d0 ONLINE 0 0 0
c4t34d0 ONLINE 0 0 0
raidz2-5 DEGRADED 0 0 0
c4t35d0 ONLINE 0 0 0
c4t36d0 ONLINE 0 0 0
replacing-2 DEGRADED 0 0 0
c4t37d0/old OFFLINE 0 0 0
c4t37d0 ONLINE 0 0 0 (resilvering)
c4t38d0 ONLINE 0 0 0
c4t39d0 ONLINE 0 0 0
c4t40d0 ONLINE 0 0 0
c4t41d0 ONLINE 0 0 0
raidz2-6 DEGRADED 0 0 0
c4t42d0 ONLINE 0 0 0
spare-1 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
c4t43d0/old FAULTED 0 0 0 corrupted data
c4t43d0 ONLINE 0 0 0 (resilvering)
c8t34d0 ONLINE 0 0 0
c4t44d0 ONLINE 0 0 0
c8t2d0 ONLINE 0 0 0
c8t3d0 ONLINE 0 0 0
c8t4d0 ONLINE 0 0 0
c8t5d0 ONLINE 0 0 0
raidz2-7 ONLINE 0 0 0
c8t6d0 ONLINE 0 0 0
c8t7d0 ONLINE 0 0 0
c8t8d0 ONLINE 0 0 0
c8t9d0 ONLINE 0 0 0
c8t10d0 ONLINE 0 0 0
c8t11d0 ONLINE 0 0 0
c8t12d0 ONLINE 0 0 0
raidz2-8 ONLINE 0 0 0
c8t13d0 ONLINE 0 0 0
c8t14d0 ONLINE 0 0 0
c8t15d0 ONLINE 0 0 0
c8t16d0 ONLINE 0 0 0
c8t17d0 ONLINE 0 0 0
c8t18d0 ONLINE 0 0 0
c8t19d0 ONLINE 0 0 0
raidz2-9 ONLINE 0 0 0
c8t20d0 ONLINE 0 0 0
c8t21d0 ONLINE 0 0 0
c8t22d0 ONLINE 0 0 0
c8t23d0 ONLINE 0 0 0
c8t24d0 ONLINE 0 0 0
c8t25d0 ONLINE 0 0 0
c8t26d0 ONLINE 0 0 0
raidz2-10 ONLINE 0 0 0
c8t27d0 ONLINE 0 0 0
c8t28d0 ONLINE 0 0 0
c8t29d0 ONLINE 0 0 0
c8t30d0 ONLINE 0 0 0
c8t31d0 ONLINE 0 0 0
c8t32d0 ONLINE 0 0 0
c8t33d0 ONLINE 0 0 0
logs
mirror-11 ONLINE 0 0 0
c6d1 ONLINE 0 0 0
c7d1 ONLINE 0 0 0
cache
c8t0d0 ONLINE 0 0 0
c8t1d0 ONLINE 0 0 0
spares
c8t34d0 INUSE currently in use
c8t35d0 INUSE currently in use

errors: No known data errors

#2 Updated by Roy Sigurd Karlsbakk over 6 years ago

Also, iostat -en doesn't show anything alarming in particular, and none for the failed drives

#3 Updated by Roy Sigurd Karlsbakk over 6 years ago

Just checked SMART info on the drives as well - nothing...

Also available in: Atom