Project

General

Profile

Actions

Bug #2607

open

pool can suspend with spa_namespace_lock held by spa_config_update

Added by Albert Lee about 11 years ago. Updated about 11 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
2012-04-10
Due date:
% Done:

10%

Estimated time:
Difficulty:
Medium
Tags:
zfs, deadlock
Gerrit CR:
External Bug:

Description

Some intermittent transport errors from my SATA port multiplier caused a pool suspension which has resulted in a hung zpool import. It looks like attempting to clear a (spa_resume) a pool that is in this state results in a deadlock.
Analysis here:
http://pastie.org/3759669

ffffff001fcfcc40 fffffffffbc2eb00                0   0  99 ffffff04edfe7bca
  PC: _resume_from_idle+0xf1    THREAD: spa_async_thread()
  stack pointer for thread ffffff001fcfcc40: ffffff001fcfca40
  [ ffffff001fcfca40 _resume_from_idle+0xf1() ]
    swtch+0x141()
    cv_wait+0x70()
    txg_wait_synced+0x83()
    spa_config_update+0xdc()
    spa_config_update+0xfa()
    spa_async_thread+0xe1()
    thread_start+8()

ffffff00207d2c40 fffffffffbc2eb00                0   0  60 ffffff04f90e23b0
  PC: _resume_from_idle+0xf1    THREAD: txg_sync_thread()
  stack pointer for thread ffffff00207d2c40: ffffff00207d2960
  [ ffffff00207d2960 _resume_from_idle+0xf1() ]
    swtch+0x141()
    cv_wait+0x70()
    zio_wait+0x5b()
    dsl_pool_sync+0x26e()
    spa_sync+0x314()
    txg_sync_thread+0x204()
    thread_start+8()

ffffff04ec4e28c0 ffffff04ed7e4078 ffffff04ed226780   2  59 ffffffffc00378b0
  PC: _resume_from_idle+0xf1    CMD: zpool import -fFn -R /tmp/azupool azupool
  stack pointer for thread ffffff04ec4e28c0: ffffff00204a09a0
  [ ffffff00204a09a0 _resume_from_idle+0xf1() ]
    swtch+0x141()
    turnstile_block+0x262()
    mutex_vector_enter+0x3c5()
    spa_open_common+0x79()
    spa_open+0x1e()
    zfs_log_history+0x37()
    zfsdev_ioctl+0x168()
    cdev_ioctl+0x39()
    spec_ioctl+0x60()
    fop_ioctl+0x55()
    ioctl+0x9b()
    _sys_sysenter_post_swapgs+0x149()

> ffffff04ec4e28c0::findstack -v
stack pointer for thread ffffff04ec4e28c0: ffffff00204a09a0
[ ffffff00204a09a0 _resume_from_idle+0xf1() ]
  ffffff00204a09d0 swtch+0x141()
  ffffff00204a0a80 turnstile_block+0x262(ffffff04e764c128, 0, ffffffffc00378b0, 
  fffffffffbc079c0, 0, 0)
  ffffff00204a0af0 mutex_vector_enter+0x3c5(ffffffffc00378b0)
  ffffff00204a0b90 spa_open_common+0x79()
  ffffff00204a0bc0 spa_open+0x1e()
  ffffff00204a0c00 zfs_log_history+0x37(ffffff04edd0d000)
  ffffff00204a0c80 zfsdev_ioctl+0x168(5a00000000, 5a02, 80425a0, 100003, 
  ffffff04ec4ccd10, ffffff00204a0e68)
  ffffff00204a0cc0 cdev_ioctl+0x39()
  ffffff00204a0d10 spec_ioctl+0x60()
  ffffff00204a0da0 fop_ioctl+0x55(ffffff04e4d6ec80, 5a02, 80425a0, 100003, 
  ffffff04ec4ccd10, ffffff00204a0e68, 0)
  ffffff00204a0ec0 ioctl+0x9b(3, 5a02, 80425a0)
  ffffff00204a0f10 _sys_sysenter_post_swapgs+0x149()

# someone's holding spa_namespace_lock

> ffffffffc00378b0::mutex
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
ffffffffc00378b0 adapt ffffff001fcfcc40      -      -     yes

# this guy is, and also waiting for the txg_sync_thread:

> ffffff001fcfcc40::findstack -v
stack pointer for thread ffffff001fcfcc40: ffffff001fcfca40
[ ffffff001fcfca40 _resume_from_idle+0xf1() ]
  ffffff001fcfca70 swtch+0x141()
  ffffff001fcfcab0 cv_wait+0x70(ffffff04edfe7bca, ffffff04edfe7b90)
  ffffff001fcfcb00 txg_wait_synced+0x83(ffffff04edfe7a00, 4cdb50)
  ffffff001fcfcb60 spa_config_update+0xdc()
  ffffff001fcfcbc0 spa_config_update+0xfa()
  ffffff001fcfcc20 spa_async_thread+0xe1(ffffff04f728a040)
  ffffff001fcfcc30 thread_start+8()

> ffffff04edfe7a00::print dsl_pool_t
{
    dp_spa = 0xffffff04f728a040
    dp_meta_objset = 0xffffff04ea414cc0
    dp_root_dir = 0xffffff04ede10400
    dp_mos_dir = 0xffffff04e7b55200
    dp_free_dir = 0xffffff04f245c200
    dp_origin_snap = 0xffffff04edf5a7c0
    dp_root_dir_obj = 0x2
    dp_vnrele_taskq = 0xffffff04ea7d2a18
    dp_meta_rootbp = {
        blk_dva = [
            {
                dva_word = [ 0, 0 ]
            },
            {
                dva_word = [ 0, 0 ]

            },
        ]
        blk_prop = 0x800b070300000003
> ::spa -v
ADDR                 STATE NAME                                                
ffffff04f728a040    ACTIVE azupool

    ADDR             STATE     AUX          DESCRIPTION                        
    ffffff04ee0120c0 HEALTHY   -            root
    ffffff04e97a8c80 HEALTHY   -              raidz
    ffffff04ec424080 HEALTHY   -                /dev/dsk/c2t610d0s0
    ffffff04e4dbc380 HEALTHY   -                /dev/dsk/c2t546d0s0
    ffffff04ea429940 HEALTHY   -                /dev/dsk/c2t514d0s0
    ffffff04f726b2c0 HEALTHY   -                /dev/dsk/c2t578d0s0
ffffff04e502c580    ACTIVE zones

    ffffff04ead53340 HEALTHY   -            root
    ffffff04ead526c0 HEALTHY   -              mirror
    ffffff04ead52d00 HEALTHY   -                /dev/dsk/c2t0d0s0
    ffffff04ead52080 HEALTHY   -                /dev/dsk/c2t1d0s0

> ffffff00207d2c40::findstack -v
stack pointer for thread ffffff00207d2c40: ffffff00207d2960
[ ffffff00207d2960 _resume_from_idle+0xf1() ]
  ffffff00207d2990 swtch+0x141()
  ffffff00207d29d0 cv_wait+0x70(ffffff04f90e23b0, ffffff04f90e23a8)
  ffffff00207d2a10 zio_wait+0x5b(ffffff04f90e20a8)
  ffffff00207d2aa0 dsl_pool_sync+0x26e()
  ffffff00207d2b70 spa_sync+0x314(ffffff04f728a040, 4cdb50)
  ffffff00207d2c20 txg_sync_thread+0x204(ffffff04edfe7a00)
  ffffff00207d2c30 thread_start+8()

> ffffff04f90e20a8::zio   
ADDRESS                                  TYPE  STAGE            WAITER

ffffff04f90e20a8                         NULL  DONE             ffffff00207d2c40
> ffffff04f90e20a8::zio -p
ADDRESS                                  TYPE  STAGE            WAITER

 ffffff04f8e3ce18                        NULL  OPEN             -
> ffffff04f90e20a8::zio -r
ADDRESS                                  TYPE  STAGE            WAITER

ffffff04f90e20a8                         NULL  DONE             ffffff00207d2c40
 ffffff04f9130df0                        WRITE DONE             -
  ffffff04f8db6e20                       WRITE DONE             -
   ffffff04f8db8e50                      WRITE DONE             -

> ffffff04f90e20a8::print zio_t io_executor
io_executor = 0xffffff0020158c40

> 0xffffff0020158c40::findstack -v
stack pointer for thread ffffff0020158c40: ffffff0020158990
[ ffffff0020158990 _resume_from_idle+0xf1() ]
  ffffff00201589c0 swtch+0x141()
  ffffff0020158a00 cv_wait+0x70(ffffff04f7a8d600, ffffff04f7a8d5f0)
  ffffff0020158a70 taskq_thread_wait+0xbe(ffffff04f7a8d5d0, ffffff04f7a8d5f0, 
  ffffff04f7a8d600, ffffff0020158ad0, ffffffffffffffff)
  ffffff0020158b30 taskq_thread+0x37c(ffffff04f7a8d5d0)

  ffffff0020158b40 thread_start+8()
> ffffff04f90e20a8::print zio_t io_stage   
io_stage = 0x100000 (ZIO_STAGE_DONE)
> ffffff04f90e20a8::print zio_t io_waiter
io_waiter = 0xffffff00207d2c40

> 0xffffff00207d2c40::findstack -v
stack pointer for thread ffffff00207d2c40: ffffff00207d2960
[ ffffff00207d2960 _resume_from_idle+0xf1() ]
  ffffff00207d2990 swtch+0x141()
  ffffff00207d29d0 cv_wait+0x70(ffffff04f90e23b0, ffffff04f90e23a8)
  ffffff00207d2a10 zio_wait+0x5b(ffffff04f90e20a8)
  ffffff00207d2aa0 dsl_pool_sync+0x26e()
  ffffff00207d2b70 spa_sync+0x314(ffffff04f728a040, 4cdb50)
  ffffff00207d2c20 txg_sync_thread+0x204(ffffff04edfe7a00)
  ffffff00207d2c30 thread_start+8()
> ffffff04f90e20a8::print zio_t io_state 
io_state = [ 0x1, 0x1 ]

> ::threadlist -v ! grep zio
    zio_wait+0x5b()
  PC: _resume_from_idle+0xf1    CMD: grep zio

# no other zio threads
# this zio is in io_stage == ZIO_STAGE_DONE, but still has a parent and io_executor
# zio_done would unparent the zio, set io_executor = NULL and ping the io_cv
#
# zio_execute in zio_wait should've called zio_done for zio->io_stage == ZIO_STAGE_DONE

# unless io_stage started out as ZIO_STAGE_DONE when zio_execute was called?
# zio_wait has an ASSERT(zio->io_stage == ZIO_STAGE_OPEN) but this is non-debug, so who knows?
# maybe zio_execute shouldn't be setting zio->io_executor = curthread if zio->io_stage is
# already ZIO_STAGE_DONE

> ffffff04f90e20a8::print zio_t io_reexecute
io_reexecute = 0x2

# so this has special treatment in zio_done
# #define ZIO_REEXECUTE_SUSPEND  0x02 
# zio_done sets zio->io_state[ZIO_WAIT_DONE] = 1 and returns ZIO_PIPELINE_STOP to zio_execute

> ffffff04f90e20a8::print zio_t io_parent_list
io_parent_list = {
    io_parent_list.list_size = 0x30
    io_parent_list.list_offset = 0x10
    io_parent_list.list_head = {
        list_next = 0xffffff04fa6fbb48
        list_prev = 0xffffff04fa6fbb48
    }
}
> ffffff04f90e20a8::print zio_t io_walk_link  
io_walk_link = 0

# it would appear zio_unique_parent(zio) returned non-NULL so zio_notify_parent() would be
# called, rather than zio_suspend().
> ffffff04f90e20a8::print zio_t io_flags
io_flags = 0 (0)

# however, io_flags does not have ZIO_FLAG_DONT_PROPOGATE set as it should by zio_done before
# calling zio_notify_parent
> ffffff04f90e20a8::zio -p                 
ADDRESS                                  TYPE  STAGE            WAITER

 ffffff04f8e3ce18                        NULL  OPEN             -
> ffffff04f8e3ce18::print zio_t io_reexecute
io_reexecute = 0
# zio_notify_parent would pio->io_reexecute |= zio->io_reexecute which evidently hasn't
# happened
# so, assuming zio_unique_parent failed and we called zio_suspend()...
> ::spa -v
ADDR                 STATE NAME                                                
ffffff04f728a040    ACTIVE azupool

    ADDR             STATE     AUX          DESCRIPTION                        
    ffffff04ee0120c0 HEALTHY   -            root
    ffffff04e97a8c80 HEALTHY   -              raidz
    ffffff04ec424080 HEALTHY   -                /dev/dsk/c2t610d0s0
    ffffff04e4dbc380 HEALTHY   -                /dev/dsk/c2t546d0s0
    ffffff04ea429940 HEALTHY   -                /dev/dsk/c2t514d0s0
    ffffff04f726b2c0 HEALTHY   -                /dev/dsk/c2t578d0s0
ffffff04e502c580    ACTIVE zones

    ffffff04ead53340 HEALTHY   -            root
    ffffff04ead526c0 HEALTHY   -              mirror
    ffffff04ead52d00 HEALTHY   -                /dev/dsk/c2t0d0s0
    ffffff04ead52080 HEALTHY   -                /dev/dsk/c2t1d0s0
> ffffff04f728a040::print spa_t spa_suspend_zio_root
spa_suspend_zio_root = 0xffffff04f8e3ce18
> ffffff04f728a040::print spa_t spa_suspended       
spa_suspended = 0x1
> ffffff04f90e20a8::zio -p
ADDRESS                                  TYPE  STAGE            WAITER

 ffffff04f8e3ce18  
# bingo
# I/O failed, we reparented the zio, but now what?
# zio_resume is only called via 'zpool clear' ioctl, but that needs the spa_namespace_lock
# which is being held by the thread waiting for the txg_sync_thread to complete, which will
# never happen since the spa is suspended

Actions #1

Updated by Albert Lee about 11 years ago

  • Subject changed from clearing suspended pool results in spa_namespace_lock deadlock to pool can suspend with spa_namespace_lock held by spa_config_update

Updated synoposis to mention spa_config_update. 6879915 was a related bug with spa_vdev_setpath/spa_vdev_exit fixed in onnv_129.

Actions #2

Updated by Albert Lee about 11 years ago

Looks like I have a prelim. fix:
[root@mameshiba ~]# zpool import -fFn-R /tmp/blah azupool

Would be able to return azupool to its state as of April 10, 2012 03:20:32 AM UTC.
cannot open 'azupool': pool I/O is currently suspended

dmesg:
2012-04-10T15:03:30.500279+00:00 mameshiba ahci: [ID 777486 kern.warning] WARNIN
G: ahci0: ahci port 2 has interface fatal error
2012-04-10T15:03:30.500297+00:00 mameshiba ahci: [ID 687168 kern.warning] WARNIN
G: ahci0: ahci port 2 is trying to do error recovery
2012-04-10T15:03:30.500300+00:00 mameshiba ahci: [ID 551337 kern.warning] WARNIN
G: ahci0: #011Recovered Data Integrity Error (I)#012#011Transient Data Integrit
y Error (T)#012#011Handshake Error (H)#012
2012-04-10T15:03:31.093320+00:00 mameshiba ahci: [ID 657156 kern.warning] WARNIN
G: ahci0: error recovery for port 2 succeed
2012-04-10T15:03:31.093335+00:00 mameshiba sata: [ID 801845 kern.info] /pci@0,0/
pci1458,b005@1f,2:#012 SATA port 2:1 error#012
2012-04-10T15:03:31.093355+00:00 mameshiba sata: [ID 801845 kern.info] /pci@0,0/
pci1458,b005@1f,2:#012 SATA port 2:2 error#012
2012-04-10T15:03:31.093374+00:00 mameshiba scsi: [ID 107833 kern.warning] WARNIN
G: /pci@0,0/pci1458,b005@1f,2/disk@222,0 (sd9):#012#011Command failed to complet
e...Device is gone#012
2012-04-10T15:03:31.093438+00:00 mameshiba scsi: [ID 107833 kern.warning] WARNIN
G: /pci@0,0/pci1458,b005@1f,2/disk@242,0 (sd10):#012#011Command failed to compl$
te...Device is gone#012
2012-04-10T15:03:31.185661+00:00 mameshiba scsi: [ID 107833 kern.warning] WARNIN
G: /pci@0,0/pci1458,b005@1f,2/disk@202,0 (sd8):#012#011Command failed to complet
e...Device is gone#012
2012-04-10T15:03:31.185805+00:00 mameshiba zfs: [ID 249136 kern.info] imported v
ersion 26 pool azupool using 28
2012-04-10T15:03:31.185825+00:00 mameshiba rsyslogd: Uncompression of a message
failed with return code -3 - enable debug logging if you need further informatio
n. Message ignored.
2012-04-10T15:03:31.185944+00:00 mameshiba scsi: [ID 107833 kern.warning] WARNIN
G: /pci@0,0/pci1458,b005@1f,2/disk@262,0 (sd11):#012#011Command failed to comple
te...Device is gone#012
2012-04-10T15:03:31.223367+00:00 mameshiba sata: [ID 511305 kern.warning] WARNIN
G: Port multiplier is on the blacklist.

Actions #3

Updated by Albert Lee about 11 years ago

Strangely n actually is ignored - pool was actually imported:
[root@mameshiba ~]# zpool status
pool: azupool
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://illumos.org/msg/ZFS-8000-JQ
scan: scrub in progress since Mon Apr 9 22:28:56 2012
41.4M scanned out of 1.78T at 10.2K/s, (scan is slow, no estimated time)
8K repaired, 0.00% done
config:

NAME          STATE     READ WRITE CKSUM
azupool ONLINE 0 0 6
raidz1-0 ONLINE 0 0 18
c2t610d0 ONLINE 0 6 10
c2t546d0 ONLINE 0 3 12
c2t514d0 ONLINE 0 3 10
c2t578d0 ONLINE 0 3 10

errors: 3 data errors, use '-v' for a list

Actions #4

Updated by Albert Lee about 11 years ago

faulted again:
[root@mameshiba ~]# zpool status
pool: azupool
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://illumos.org/msg/ZFS-8000-JQ
scan: scrub in progress since Mon Apr 9 22:28:56 2012
41.4M scanned out of 1.78T at 7.16K/s, (scan is slow, no estimated time)
8K repaired, 0.00% done
config:

NAME          STATE     READ WRITE CKSUM
azupool UNAVAIL 0 0 0 insufficient replicas
raidz1-0 UNAVAIL 0 0 0 insufficient replicas
c2t610d0 FAULTED 0 18 0 too many errors
c2t546d0 FAULTED 0 15 0 too many errors
c2t514d0 FAULTED 0 12 0 too many errors
c2t578d0 FAULTED 0 9 0 too many errors

errors: Permanent errors have been detected in the following files:

<metadata>:<0x0>
<metadata>:<0x6f>
<metadata>:<0x72>

zpool export is hanging now:
ffffff001f27ac40 fffffffffbc2eb00 0 0 99 ffffff04eddfcaca
PC: _resume_from_idle+0xf1 THREAD: spa_async_thread()
stack pointer for thread ffffff001f27ac40: ffffff001f27aa70
[ ffffff001f27aa70 _resume_from_idle+0xf1() ]
swtch+0x141()
cv_wait+0x70()
txg_wait_synced+0x8b()
spa_config_update+0xe4()
spa_config_update+0x123()
spa_async_thread+0xcf()
thread_start+8()

ffffff001f259c40 fffffffffbc2eb00 0 0 60 ffffff04f9fc7f20
PC: _resume_from_idle+0xf1 THREAD: txg_sync_thread()
stack pointer for thread ffffff001f259c40: ffffff001f259960
[ ffffff001f259960 _resume_from_idle+0xf1() ]
swtch+0x141()
cv_wait+0x70()
zio_wait+0x5e()
dsl_pool_sync+0x277()
spa_sync+0x308()
txg_sync_thread+0x1e8()
thread_start+8()

ffffff04e7411860 ffffff04edf24008 ffffff04fca7e380 2 1 ffffff04eddfcaca
PC: _resume_from_idle+0xf1 CMD: zpool export azupool
stack pointer for thread ffffff04e7411860: ffffff001ff129e0
[ ffffff001ff129e0 _resume_from_idle+0xf1() ]
swtch+0x141()
cv_wait+0x70()
txg_wait_synced+0x8b()
dmu_tx_wait+0x50()
dmu_tx_assign+0x47()
spa_history_log+0x4f()
zfs_log_history+0x82()
zfs_ioc_pool_export+0x2e()
zfsdev_ioctl+0x155()
cdev_ioctl+0x39()
spec_ioctl+0x60()
fop_ioctl+0x55()
ioctl+0x9b()
_sys_sysenter_post_swapgs+0x149()

[root@mameshiba ~]# zpool clear azupool
cannot clear errors for azupool: I/O error

Actions

Also available in: Atom PDF