Project

General

Profile

Actions

Bug #1069

open

IO hang with bad disk on mpt and mpt_sas

Added by Alasdair Lumsden over 10 years ago. Updated about 10 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
kernel
Start date:
2011-05-26
Due date:
% Done:

0%

Estimated time:
Difficulty:
Hard
Tags:
needs-triage
Gerrit CR:

Description

Hi,

On a storage server, I've encountered a failure mode where a single disk has hung with a 100% busy time reported in iostat. As a consequence all IO to the zpool this disk is a member of has come to a complete standstill, with all other disks in the pool reporting 0 activity.

I've witnessed this behaviour on multiple (different) systems before, with SATA disks and SAS disks, on both mpt and mpt_sas. I've spoken to others, including George Wilson who has seen this behaviour too. It's quite a serious issue as this can crop up at any time on a production storage server, and if/when it does, it causes an outage that is only resolved by pulling the disk or rebooting the box.

The storage node consists of:

root ~ (san01.ixlon1): /usr/bin/uname -a
SunOS san01.ixlon1.everycity.co.uk 5.11 oi_148 i86pc i386 i86pc

Supermicro X8DTH-6F Dual Intel E5504, 32GB RAM
6 x LSI 3081E-R mpt based cards
48 x Western Digital SATA disks

"iostat -x 2" output

                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
...
sd27      0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd28      0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd29      0.0    0.0    0.0    0.0  0.0  4.0    0.0   0 100 
sd30      0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd31      0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
...

All IO to the pool has hung, so commands like "zpool list" stick.

"iostat -E" output shows no drive errors at all:

http://linux01.everycity.co.uk/~alasdair/iostat_e.txt

"*sd_state::walk softstate | ::print -t struct sd_lun" mdb output:

http://linux01.everycity.co.uk/~alasdair/sd_lun_state.txt

No FMA events have been generated:

root ~ (san01.ixlon1): fmdump -eV
TIME CLASS
fmdump: warning: /var/fm/fmd/errlog is empty

sd_io_time has been tuned down to 7 seconds, with 3 retries configured for the Western Digital drives, although several hours have passed so sd_io_time is obviously being ignored:

set sd:sd_io_time=7 (/etc/system)
sd-config-list = "ATA WDC WD7501AALS-0", "retries-timeout:3"; (/kernel/drv/sd.conf)

root ~ (san01.ixlon1): echo "sd_io_time::print" | mdb -k
0x7
root ~ (san01.ixlon1): echo "::walk sd_state | ::grep '.!=0' | ::sd_state" | mdb -k | egrep "^un|un_retry_count|un_cmd_timeout" 
un: ffffff090f43c640
   un_retry_count = 0x3
   un_cmd_timeout = 0x7
un: ffffff09184a5940
   un_retry_count = 0x3
   un_cmd_timeout = 0x7
... (all report 0x3 (or 0x5 for some devices) and 0x7)

The zpool layout consists of:

zpool create -f data \
mirror c1t0d0 c7t0d0 \
mirror c1t1d0 c7t1d0 \
mirror c1t2d0 c7t2d0 \
mirror c1t3d0 c7t3d0 \
mirror c1t4d0 c7t4d0 \
mirror c1t5d0 c7t5d0 \
mirror c1t6d0 c7t6d0 \
mirror c1t7d0 c7t7d0 \
mirror c2t0d0 c8t0d0 \
mirror c2t1d0 c8t1d0 \
mirror c2t2d0 c8t2d0 \
mirror c2t3d0 c8t3d0 \
mirror c2t4d0 c8t4d0 \
mirror c2t5d0 c8t5d0 \
mirror c2t6d0 c8t6d0 \
mirror c2t7d0 c8t7d0 \
mirror c3t8d0 c9t0d0 \
mirror c3t9d0 c9t1d0 \
mirror c3t10d0 c9t2d0 \
mirror c3t11d0 c9t3d0 \
mirror c3t12d0 c9t4d0 \
mirror c3t13d0 c9t5d0 \
mirror c3t14d0 c9t6d0 \
spare c3t15d0 c9t7d0 \
log c3t0d0 \
c3t1d0 \
c3t2d0 \
c3t3d0

Some general mdb investigation:

::walk zio_root | ::zio -r
ADDRESS                                  TYPE  STAGE            WAITER

ffffff0918ca4cd0                         NULL  CHECKSUM_VERIFY  ffffff003da48c40
ffffff0915a27c80                        WRITE VDEV_IO_START    -
ffffff0918eafcc0                        WRITE VDEV_IO_START    -
ffffff0918d09968                        WRITE VDEV_IO_START    -
ffffff0918aadc98                        WRITE VDEV_IO_START    -
ffffff09195eb358                         NULL  OPEN             -
ffffff090dfecc88                         NULL  OPEN             -

ffffff0918ca4cd0::zio
ADDRESS                                  TYPE  STAGE            WAITER          
ffffff0918ca4cd0                         NULL  CHECKSUM_VERIFY  ffffff003da48c40
ffffff0918ca4cd0::zio -r
ADDRESS                                  TYPE  STAGE            WAITER          
ffffff0918ca4cd0                         NULL  CHECKSUM_VERIFY  ffffff003da48c40
ffffff0915a27c80                        WRITE VDEV_IO_START    -
ffffff0918eafcc0                        WRITE VDEV_IO_START    -
ffffff0918d09968                        WRITE VDEV_IO_START    -
ffffff0918aadc98                        WRITE VDEV_IO_START    -

ffffff0918ca4cd0::print -t struct zio
struct zio {
   zbookmark_t io_bookmark = {
       uint64_t zb_objset = 0
       uint64_t zb_object = 0
       int64_t zb_level = 0
       uint64_t zb_blkid = 0
   }
   zio_prop_t io_prop = {
       enum zio_checksum zp_checksum = 0 (ZIO_CHECKSUM_INHERIT)
       enum zio_compress zp_compress = 0 (ZIO_COMPRESS_INHERIT)
       dmu_object_type_t zp_type = 0 (DMU_OT_NONE)
       uint8_t zp_level = 0
       uint8_t zp_copies = 0
       uint8_t zp_dedup = 0
       uint8_t zp_dedup_verify = 0
   }
   zio_type_t io_type = 0 (ZIO_TYPE_NULL)
   enum zio_child io_child_type = 3 (ZIO_CHILD_LOGICAL)
   int io_cmd = 0
   uint8_t io_priority = 0
   uint8_t io_reexecute = 0
   uint8_t [2] io_state = [ 0x1, 0 ]
   uint64_t io_txg = 0
   spa_t *io_spa = 0xffffff090e71a580
   blkptr_t *io_bp = 0
   blkptr_t *io_bp_override = 0
   blkptr_t io_bp_copy = {
       dva_t [3] blk_dva = [
           dva_t {
               uint64_t [2] dva_word = [ 0, 0 ]
           },
           dva_t {
               uint64_t [2] dva_word = [ 0, 0 ]
           },
           dva_t {
               uint64_t [2] dva_word = [ 0, 0 ]
           },
       ]
       uint64_t blk_prop = 0         
       uint64_t [2] blk_pad = [ 0, 0 ]
       uint64_t blk_phys_birth = 0
       uint64_t blk_birth = 0
       uint64_t blk_fill = 0
       zio_cksum_t blk_cksum = {
           uint64_t [4] zc_word = [ 0, 0, 0, 0 ]
       }
   }
   list_t io_parent_list = {
       size_t list_size = 0x30
       size_t list_offset = 0x10
       struct list_node list_head = {
           struct list_node *list_next = 0xffffff0918ca4dc0
           struct list_node *list_prev = 0xffffff0918ca4dc0
       }
   }
   list_t io_child_list = {
       size_t list_size = 0x30
       size_t list_offset = 0x20
       struct list_node list_head = {
           struct list_node *list_next = 0xffffff0918da5788
           struct list_node *list_prev = 0xffffff090e24c380
       }
   }
   zio_link_t *io_walk_link = 0
   zio_t *io_logical = 0
   zio_transform_t *io_transform_stack = 0
   zio_done_func_t *io_ready = 0
   zio_done_func_t *io_done = 0
   void *io_private = 0xffffff003da489e8
   int64_t io_prev_space_delta = 0
   blkptr_t io_bp_orig = {
       dva_t [3] blk_dva = [
           dva_t {
               uint64_t [2] dva_word = [ 0, 0 ]
           },
           dva_t {
               uint64_t [2] dva_word = [ 0, 0 ]
           },
           dva_t {
               uint64_t [2] dva_word = [ 0, 0 ]
           },
       ]
       uint64_t blk_prop = 0
       uint64_t [2] blk_pad = [ 0, 0 ]
       uint64_t blk_phys_birth = 0
       uint64_t blk_birth = 0
       uint64_t blk_fill = 0
       zio_cksum_t blk_cksum = {
           uint64_t [4] zc_word = [ 0, 0, 0, 0 ]
       }
   }
   void *io_data = 0
   void *io_orig_data = 0
   uint64_t io_size = 0
   uint64_t io_orig_size = 0
   vdev_t *io_vd = 0
   void *io_vsd = 0
   const zio_vsd_ops_t *io_vsd_ops = 0
   uint64_t io_offset = 0
   uint64_t io_deadline = 0
   avl_node_t io_offset_node = {
       struct avl_node *[2] avl_child = [ 0, 0 ]
       uintptr_t avl_pcb = 0
   }
   avl_node_t io_deadline_node = {
       struct avl_node *[2] avl_child = [ 0, 0 ]
       uintptr_t avl_pcb = 0
   }
   avl_tree_t *io_vdev_tree = 0
   enum zio_flag io_flags = 0x140 (ZIO_FLAG_{CANFAIL|CONFIG_WRITER})
   enum zio_stage io_stage = 0x80000 (ZIO_STAGE_CHECKSUM_VERIFY)
   enum zio_stage io_pipeline = 0x108000 (ZIO_STAGE_{READY|DONE})
   enum zio_flag io_orig_flags = 0x140 (ZIO_FLAG_{CANFAIL|CONFIG_WRITER})
   enum zio_stage io_orig_stage = 0x1 (ZIO_STAGE_OPEN)
   enum zio_stage io_orig_pipeline = 0x108000 (ZIO_STAGE_{READY|DONE})
   int io_error = 0
   int [4] io_child_error = [ 0, 0, 0, 0 ]
   unsigned long [4][2] io_children = [
       unsigned long [2] [ 0, 0x4 ]
       unsigned long [2] [ 0, 0 ]
       unsigned long [2] [ 0, 0 ]
       unsigned long [2] [ 0, 0 ]
   ]
   uint64_t io_child_count = 0x4
   uint64_t io_parent_count = 0
   uint64_t *io_stall = 0xffffff0918ca4f60
   zio_t *io_gang_leader = 0
   zio_gang_node_t *io_gang_tree = 0
   void *io_executor = 0xffffff003da48c40
   void *io_waiter = 0xffffff003da48c40
   kmutex_t io_lock = {
       void *[1] _opaque = [ 0 ]
   }
   kcondvar_t io_cv = {
       ushort_t _opaque = 0x1
   }
   zio_cksum_report_t *io_cksum_report = 0
   uint64_t io_ena = 0
}

ffffff0918ca4cd0::print -t struct zio io_waiter
void *io_waiter = 0xffffff003da48c40

0xffffff003da48c40::findstack
stack pointer for thread ffffff003da48c40: ffffff003da48920
[ ffffff003da48920 _resume_from_idle+0xf1() ]
 ffffff003da48950 swtch+0x145()
 ffffff003da48980 cv_wait+0x61()
 ffffff003da489c0 zio_wait+0x5d()
 ffffff003da48a40 vdev_uberblock_sync_list+0x163()
 ffffff003da48ad0 vdev_config_sync+0x129()
 ffffff003da48b80 spa_sync+0x5cd()
 ffffff003da48c20 txg_sync_thread+0x247()
 ffffff003da48c30 thread_start+8()

George Wilson and Gordon Ross took a look on the box, here's the discussion between us from #illumos:

http://linux01.everycity.co.uk/~alasdair/illumos_log.txt

I can provide SSH login to this server, it has a public IP, but it will need to be returned to service within 24-48 hours, so anyone would like to take a look, please get in touch on alasdairrr gmail com and I'll set up an account.

Cheers,

Alasdair


Related issues

Related to illumos gate - Bug #1343: mpt_sas sometimes stalls foreverNew2011-08-05

Actions
Related to illumos gate - Bug #1032: mpt_sas driver bottlenecks badly with single misbehaving disk on controllerNew2011-05-14

Actions
Actions

Also available in: Atom PDF