Project

General

Profile

Bug #5534

Single device pool I/O suspended, zpool export hangs indefinitely

Added by Keith Hall over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2015-01-13
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

I currently have a single disk pool named backup which is just there to "zfs receive" from the main 8 disk SAS pool.

backups run at midnight and is a simple zfs send/receive of the snapshot delta - taking no more than 20 minutes to run

At some point this pool suffered a problem (interestingly according to dmesg this was around 11am when it would not be accessed) and the pool I/O is now suspended. I had previously set failmode to continue on this pool.

I have tried a number of things below but stopped short of cfgadm offline/unconfigure in case there's any useful data we can get from this, as zfs export is hanging unkillable at the moment.

root@box:~# zpool status backup
pool: backup
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://illumos.org/msg/ZFS-8000-JQ
scan: none requested
config:

NAME                     STATE     READ WRITE CKSUM
backup ONLINE 5.48K 5 0
c3t5000C5004A7113A8d0 ONLINE 1 0 0

errors: 17 data errors, use '-v' for a list

status -v results in "errors: List of errors unavailable (insufficient privileges)"

I tried to export the pool, both with zpool export and zpool export -f, both commands just sit there completely unkillable, whilst other zpool commands e.g. status still accesses the drive etc

root@box:~# kill -9 18753
root@box:~# kill -9 10956
root@box:~# ps -ef | grep export
root 11666 10964 0 09:30:01 pts/5 0:00 grep export
root 18753 1 0 00:20:11 ? 0:00 zpool export backup
root 10956 12469 0 09:15:06 pts/1 0:00 zpool export -f backup

the drive is online and readable via dd...

/usr/gnu/bin/dd if=/dev/dsk/c3t5000C5004A7113A8d0 of=/dev/null
^C2445191+0 records in
2445190+0 records out
1251937280 bytes (1.3 GB) copied, 7.47739 s, 167 MB/s

c1 scsi-sas connected configured unknown
c1::w5000c5004a7113a8,0 disk-path connected configured unknown

dmesg output

Jan 11 11:07:41 box scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@17/pci1000,3020@0 (mpt_sas10):
Jan 11 11:07:41 box mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31110d00
Jan 11 11:07:41 box scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@17/pci1000,3020@0 (mpt_sas10):
Jan 11 11:07:41 box mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31110d00
Jan 11 11:07:44 box scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@17/pci1000,3020@0 (mpt_sas10):
Jan 11 11:07:44 box mptsas_handle_event_sync: IOCStatus=0x8000, IOCLogInfo=0x31170000
Jan 11 11:07:44 box scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci15ad,7a0@17/pci1000,3020@0 (mpt_sas10):
Jan 11 11:07:44 box mptsas_handle_event: IOCStatus=0x8000, IOCLogInfo=0x31170000

root@box:~# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jan 11 11:08:09 ac8e6988-748a-c465-84fc-a05a6033d006 ZFS-8000-FD Major

Host : box
Platform : VMware-Virtual-Platform Chassis_id : VMware-56-4d-d5-d0-8b-95-d8-73-3f-c1-70-75-09-e7-6d-76
Product_sn :

Fault class : fault.fs.zfs.vdev.io
Affects : zfs://pool=backup/vdev=fceeca4ca34cfcc5
faulted but still in service
Problem in : zfs://pool=backup/vdev=fceeca4ca34cfcc5
faulted but still in service

Description : The number of I/O errors associated with a ZFS device exceeded
acceptable levels. Refer to
http://illumos.org/msg/ZFS-8000-FD for more information.

Response : The device has been offlined and marked as faulted. An attempt
will be made to activate a hot spare if available.

Impact : Fault tolerance of the pool may be compromised.

Action : Run 'zpool status -x' and replace the bad device.

This drive is also power managed so likely to have been in a spun-down state at that time. It is possible that the cabling could have been disturbed, but should a hot-swap device issue create a problem such as this?

The system is still running in this state so if there's any further diagnosis or dumps I can take to help, please let me know

Keith

Also available in: Atom PDF