Bug #2607
openpool can suspend with spa_namespace_lock held by spa_config_update
10%
Description
Some intermittent transport errors from my SATA port multiplier caused a pool suspension which has resulted in a hung zpool import. It looks like attempting to clear a (spa_resume) a pool that is in this state results in a deadlock.
Analysis here:
http://pastie.org/3759669
ffffff001fcfcc40 fffffffffbc2eb00 0 0 99 ffffff04edfe7bca PC: _resume_from_idle+0xf1 THREAD: spa_async_thread() stack pointer for thread ffffff001fcfcc40: ffffff001fcfca40 [ ffffff001fcfca40 _resume_from_idle+0xf1() ] swtch+0x141() cv_wait+0x70() txg_wait_synced+0x83() spa_config_update+0xdc() spa_config_update+0xfa() spa_async_thread+0xe1() thread_start+8() ffffff00207d2c40 fffffffffbc2eb00 0 0 60 ffffff04f90e23b0 PC: _resume_from_idle+0xf1 THREAD: txg_sync_thread() stack pointer for thread ffffff00207d2c40: ffffff00207d2960 [ ffffff00207d2960 _resume_from_idle+0xf1() ] swtch+0x141() cv_wait+0x70() zio_wait+0x5b() dsl_pool_sync+0x26e() spa_sync+0x314() txg_sync_thread+0x204() thread_start+8() ffffff04ec4e28c0 ffffff04ed7e4078 ffffff04ed226780 2 59 ffffffffc00378b0 PC: _resume_from_idle+0xf1 CMD: zpool import -fFn -R /tmp/azupool azupool stack pointer for thread ffffff04ec4e28c0: ffffff00204a09a0 [ ffffff00204a09a0 _resume_from_idle+0xf1() ] swtch+0x141() turnstile_block+0x262() mutex_vector_enter+0x3c5() spa_open_common+0x79() spa_open+0x1e() zfs_log_history+0x37() zfsdev_ioctl+0x168() cdev_ioctl+0x39() spec_ioctl+0x60() fop_ioctl+0x55() ioctl+0x9b() _sys_sysenter_post_swapgs+0x149() > ffffff04ec4e28c0::findstack -v stack pointer for thread ffffff04ec4e28c0: ffffff00204a09a0 [ ffffff00204a09a0 _resume_from_idle+0xf1() ] ffffff00204a09d0 swtch+0x141() ffffff00204a0a80 turnstile_block+0x262(ffffff04e764c128, 0, ffffffffc00378b0, fffffffffbc079c0, 0, 0) ffffff00204a0af0 mutex_vector_enter+0x3c5(ffffffffc00378b0) ffffff00204a0b90 spa_open_common+0x79() ffffff00204a0bc0 spa_open+0x1e() ffffff00204a0c00 zfs_log_history+0x37(ffffff04edd0d000) ffffff00204a0c80 zfsdev_ioctl+0x168(5a00000000, 5a02, 80425a0, 100003, ffffff04ec4ccd10, ffffff00204a0e68) ffffff00204a0cc0 cdev_ioctl+0x39() ffffff00204a0d10 spec_ioctl+0x60() ffffff00204a0da0 fop_ioctl+0x55(ffffff04e4d6ec80, 5a02, 80425a0, 100003, ffffff04ec4ccd10, ffffff00204a0e68, 0) ffffff00204a0ec0 ioctl+0x9b(3, 5a02, 80425a0) ffffff00204a0f10 _sys_sysenter_post_swapgs+0x149() # someone's holding spa_namespace_lock > ffffffffc00378b0::mutex ADDR TYPE HELD MINSPL OLDSPL WAITERS ffffffffc00378b0 adapt ffffff001fcfcc40 - - yes # this guy is, and also waiting for the txg_sync_thread: > ffffff001fcfcc40::findstack -v stack pointer for thread ffffff001fcfcc40: ffffff001fcfca40 [ ffffff001fcfca40 _resume_from_idle+0xf1() ] ffffff001fcfca70 swtch+0x141() ffffff001fcfcab0 cv_wait+0x70(ffffff04edfe7bca, ffffff04edfe7b90) ffffff001fcfcb00 txg_wait_synced+0x83(ffffff04edfe7a00, 4cdb50) ffffff001fcfcb60 spa_config_update+0xdc() ffffff001fcfcbc0 spa_config_update+0xfa() ffffff001fcfcc20 spa_async_thread+0xe1(ffffff04f728a040) ffffff001fcfcc30 thread_start+8() > ffffff04edfe7a00::print dsl_pool_t { dp_spa = 0xffffff04f728a040 dp_meta_objset = 0xffffff04ea414cc0 dp_root_dir = 0xffffff04ede10400 dp_mos_dir = 0xffffff04e7b55200 dp_free_dir = 0xffffff04f245c200 dp_origin_snap = 0xffffff04edf5a7c0 dp_root_dir_obj = 0x2 dp_vnrele_taskq = 0xffffff04ea7d2a18 dp_meta_rootbp = { blk_dva = [ { dva_word = [ 0, 0 ] }, { dva_word = [ 0, 0 ] }, ] blk_prop = 0x800b070300000003 > ::spa -v ADDR STATE NAME ffffff04f728a040 ACTIVE azupool ADDR STATE AUX DESCRIPTION ffffff04ee0120c0 HEALTHY - root ffffff04e97a8c80 HEALTHY - raidz ffffff04ec424080 HEALTHY - /dev/dsk/c2t610d0s0 ffffff04e4dbc380 HEALTHY - /dev/dsk/c2t546d0s0 ffffff04ea429940 HEALTHY - /dev/dsk/c2t514d0s0 ffffff04f726b2c0 HEALTHY - /dev/dsk/c2t578d0s0 ffffff04e502c580 ACTIVE zones ffffff04ead53340 HEALTHY - root ffffff04ead526c0 HEALTHY - mirror ffffff04ead52d00 HEALTHY - /dev/dsk/c2t0d0s0 ffffff04ead52080 HEALTHY - /dev/dsk/c2t1d0s0 > ffffff00207d2c40::findstack -v stack pointer for thread ffffff00207d2c40: ffffff00207d2960 [ ffffff00207d2960 _resume_from_idle+0xf1() ] ffffff00207d2990 swtch+0x141() ffffff00207d29d0 cv_wait+0x70(ffffff04f90e23b0, ffffff04f90e23a8) ffffff00207d2a10 zio_wait+0x5b(ffffff04f90e20a8) ffffff00207d2aa0 dsl_pool_sync+0x26e() ffffff00207d2b70 spa_sync+0x314(ffffff04f728a040, 4cdb50) ffffff00207d2c20 txg_sync_thread+0x204(ffffff04edfe7a00) ffffff00207d2c30 thread_start+8() > ffffff04f90e20a8::zio ADDRESS TYPE STAGE WAITER ffffff04f90e20a8 NULL DONE ffffff00207d2c40 > ffffff04f90e20a8::zio -p ADDRESS TYPE STAGE WAITER ffffff04f8e3ce18 NULL OPEN - > ffffff04f90e20a8::zio -r ADDRESS TYPE STAGE WAITER ffffff04f90e20a8 NULL DONE ffffff00207d2c40 ffffff04f9130df0 WRITE DONE - ffffff04f8db6e20 WRITE DONE - ffffff04f8db8e50 WRITE DONE - > ffffff04f90e20a8::print zio_t io_executor io_executor = 0xffffff0020158c40 > 0xffffff0020158c40::findstack -v stack pointer for thread ffffff0020158c40: ffffff0020158990 [ ffffff0020158990 _resume_from_idle+0xf1() ] ffffff00201589c0 swtch+0x141() ffffff0020158a00 cv_wait+0x70(ffffff04f7a8d600, ffffff04f7a8d5f0) ffffff0020158a70 taskq_thread_wait+0xbe(ffffff04f7a8d5d0, ffffff04f7a8d5f0, ffffff04f7a8d600, ffffff0020158ad0, ffffffffffffffff) ffffff0020158b30 taskq_thread+0x37c(ffffff04f7a8d5d0) ffffff0020158b40 thread_start+8() > ffffff04f90e20a8::print zio_t io_stage io_stage = 0x100000 (ZIO_STAGE_DONE) > ffffff04f90e20a8::print zio_t io_waiter io_waiter = 0xffffff00207d2c40 > 0xffffff00207d2c40::findstack -v stack pointer for thread ffffff00207d2c40: ffffff00207d2960 [ ffffff00207d2960 _resume_from_idle+0xf1() ] ffffff00207d2990 swtch+0x141() ffffff00207d29d0 cv_wait+0x70(ffffff04f90e23b0, ffffff04f90e23a8) ffffff00207d2a10 zio_wait+0x5b(ffffff04f90e20a8) ffffff00207d2aa0 dsl_pool_sync+0x26e() ffffff00207d2b70 spa_sync+0x314(ffffff04f728a040, 4cdb50) ffffff00207d2c20 txg_sync_thread+0x204(ffffff04edfe7a00) ffffff00207d2c30 thread_start+8() > ffffff04f90e20a8::print zio_t io_state io_state = [ 0x1, 0x1 ] > ::threadlist -v ! grep zio zio_wait+0x5b() PC: _resume_from_idle+0xf1 CMD: grep zio # no other zio threads # this zio is in io_stage == ZIO_STAGE_DONE, but still has a parent and io_executor # zio_done would unparent the zio, set io_executor = NULL and ping the io_cv # # zio_execute in zio_wait should've called zio_done for zio->io_stage == ZIO_STAGE_DONE # unless io_stage started out as ZIO_STAGE_DONE when zio_execute was called? # zio_wait has an ASSERT(zio->io_stage == ZIO_STAGE_OPEN) but this is non-debug, so who knows? # maybe zio_execute shouldn't be setting zio->io_executor = curthread if zio->io_stage is # already ZIO_STAGE_DONE > ffffff04f90e20a8::print zio_t io_reexecute io_reexecute = 0x2 # so this has special treatment in zio_done # #define ZIO_REEXECUTE_SUSPEND 0x02 # zio_done sets zio->io_state[ZIO_WAIT_DONE] = 1 and returns ZIO_PIPELINE_STOP to zio_execute > ffffff04f90e20a8::print zio_t io_parent_list io_parent_list = { io_parent_list.list_size = 0x30 io_parent_list.list_offset = 0x10 io_parent_list.list_head = { list_next = 0xffffff04fa6fbb48 list_prev = 0xffffff04fa6fbb48 } } > ffffff04f90e20a8::print zio_t io_walk_link io_walk_link = 0 # it would appear zio_unique_parent(zio) returned non-NULL so zio_notify_parent() would be # called, rather than zio_suspend(). > ffffff04f90e20a8::print zio_t io_flags io_flags = 0 (0) # however, io_flags does not have ZIO_FLAG_DONT_PROPOGATE set as it should by zio_done before # calling zio_notify_parent > ffffff04f90e20a8::zio -p ADDRESS TYPE STAGE WAITER ffffff04f8e3ce18 NULL OPEN - > ffffff04f8e3ce18::print zio_t io_reexecute io_reexecute = 0 # zio_notify_parent would pio->io_reexecute |= zio->io_reexecute which evidently hasn't # happened # so, assuming zio_unique_parent failed and we called zio_suspend()... > ::spa -v ADDR STATE NAME ffffff04f728a040 ACTIVE azupool ADDR STATE AUX DESCRIPTION ffffff04ee0120c0 HEALTHY - root ffffff04e97a8c80 HEALTHY - raidz ffffff04ec424080 HEALTHY - /dev/dsk/c2t610d0s0 ffffff04e4dbc380 HEALTHY - /dev/dsk/c2t546d0s0 ffffff04ea429940 HEALTHY - /dev/dsk/c2t514d0s0 ffffff04f726b2c0 HEALTHY - /dev/dsk/c2t578d0s0 ffffff04e502c580 ACTIVE zones ffffff04ead53340 HEALTHY - root ffffff04ead526c0 HEALTHY - mirror ffffff04ead52d00 HEALTHY - /dev/dsk/c2t0d0s0 ffffff04ead52080 HEALTHY - /dev/dsk/c2t1d0s0 > ffffff04f728a040::print spa_t spa_suspend_zio_root spa_suspend_zio_root = 0xffffff04f8e3ce18 > ffffff04f728a040::print spa_t spa_suspended spa_suspended = 0x1 > ffffff04f90e20a8::zio -p ADDRESS TYPE STAGE WAITER ffffff04f8e3ce18 # bingo # I/O failed, we reparented the zio, but now what? # zio_resume is only called via 'zpool clear' ioctl, but that needs the spa_namespace_lock # which is being held by the thread waiting for the txg_sync_thread to complete, which will # never happen since the spa is suspended
Updated by Albert Lee about 11 years ago
- Subject changed from clearing suspended pool results in spa_namespace_lock deadlock to pool can suspend with spa_namespace_lock held by spa_config_update
Updated synoposis to mention spa_config_update. 6879915 was a related bug with spa_vdev_setpath/spa_vdev_exit fixed in onnv_129.
Updated by Albert Lee about 11 years ago
Looks like I have a prelim. fix:
[root@mameshiba ~]# zpool import -fFn-R /tmp/blah azupool
Would be able to return azupool to its state as of April 10, 2012 03:20:32 AM UTC.
cannot open 'azupool': pool I/O is currently suspended
dmesg:
2012-04-10T15:03:30.500279+00:00 mameshiba ahci: [ID 777486 kern.warning] WARNIN
G: ahci0: ahci port 2 has interface fatal error
2012-04-10T15:03:30.500297+00:00 mameshiba ahci: [ID 687168 kern.warning] WARNIN
G: ahci0: ahci port 2 is trying to do error recovery
2012-04-10T15:03:30.500300+00:00 mameshiba ahci: [ID 551337 kern.warning] WARNIN
G: ahci0: #011Recovered Data Integrity Error (I)#012#011Transient Data Integrit
y Error (T)#012#011Handshake Error (H)#012
2012-04-10T15:03:31.093320+00:00 mameshiba ahci: [ID 657156 kern.warning] WARNIN
G: ahci0: error recovery for port 2 succeed
2012-04-10T15:03:31.093335+00:00 mameshiba sata: [ID 801845 kern.info] /pci@0,0/
pci1458,b005@1f,2:#012 SATA port 2:1 error#012
2012-04-10T15:03:31.093355+00:00 mameshiba sata: [ID 801845 kern.info] /pci@0,0/
pci1458,b005@1f,2:#012 SATA port 2:2 error#012
2012-04-10T15:03:31.093374+00:00 mameshiba scsi: [ID 107833 kern.warning] WARNIN
G: /pci@0,0/pci1458,b005@1f,2/disk@222,0 (sd9):#012#011Command failed to complet
e...Device is gone#012
2012-04-10T15:03:31.093438+00:00 mameshiba scsi: [ID 107833 kern.warning] WARNIN
G: /pci@0,0/pci1458,b005@1f,2/disk@242,0 (sd10):#012#011Command failed to compl$
te...Device is gone#012
2012-04-10T15:03:31.185661+00:00 mameshiba scsi: [ID 107833 kern.warning] WARNIN
G: /pci@0,0/pci1458,b005@1f,2/disk@202,0 (sd8):#012#011Command failed to complet
e...Device is gone#012
2012-04-10T15:03:31.185805+00:00 mameshiba zfs: [ID 249136 kern.info] imported v
ersion 26 pool azupool using 28
2012-04-10T15:03:31.185825+00:00 mameshiba rsyslogd: Uncompression of a message
failed with return code -3 - enable debug logging if you need further informatio
n. Message ignored.
2012-04-10T15:03:31.185944+00:00 mameshiba scsi: [ID 107833 kern.warning] WARNIN
G: /pci@0,0/pci1458,b005@1f,2/disk@262,0 (sd11):#012#011Command failed to comple
te...Device is gone#012
2012-04-10T15:03:31.223367+00:00 mameshiba sata: [ID 511305 kern.warning] WARNIN
G: Port multiplier is on the blacklist.
Updated by Albert Lee about 11 years ago
Strangely n actually is ignored - pool was actually imported:
[root@mameshiba ~]# zpool status
pool: azupool
state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://illumos.org/msg/ZFS-8000-JQ
scan: scrub in progress since Mon Apr 9 22:28:56 2012
41.4M scanned out of 1.78T at 10.2K/s, (scan is slow, no estimated time)
8K repaired, 0.00% done
config:
NAME STATE READ WRITE CKSUM
azupool ONLINE 0 0 6
raidz1-0 ONLINE 0 0 18
c2t610d0 ONLINE 0 6 10
c2t546d0 ONLINE 0 3 12
c2t514d0 ONLINE 0 3 10
c2t578d0 ONLINE 0 3 10
errors: 3 data errors, use '-v' for a list
Updated by Albert Lee about 11 years ago
faulted again:
[root@mameshiba ~]# zpool status
pool: azupool
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://illumos.org/msg/ZFS-8000-JQ
scan: scrub in progress since Mon Apr 9 22:28:56 2012
41.4M scanned out of 1.78T at 7.16K/s, (scan is slow, no estimated time)
8K repaired, 0.00% done
config:
NAME STATE READ WRITE CKSUM
azupool UNAVAIL 0 0 0 insufficient replicas
raidz1-0 UNAVAIL 0 0 0 insufficient replicas
c2t610d0 FAULTED 0 18 0 too many errors
c2t546d0 FAULTED 0 15 0 too many errors
c2t514d0 FAULTED 0 12 0 too many errors
c2t578d0 FAULTED 0 9 0 too many errors
errors: Permanent errors have been detected in the following files:
<metadata>:<0x0>
<metadata>:<0x6f>
<metadata>:<0x72>
zpool export is hanging now:
ffffff001f27ac40 fffffffffbc2eb00 0 0 99 ffffff04eddfcaca
PC: _resume_from_idle+0xf1 THREAD: spa_async_thread()
stack pointer for thread ffffff001f27ac40: ffffff001f27aa70
[ ffffff001f27aa70 _resume_from_idle+0xf1() ]
swtch+0x141()
cv_wait+0x70()
txg_wait_synced+0x8b()
spa_config_update+0xe4()
spa_config_update+0x123()
spa_async_thread+0xcf()
thread_start+8()
ffffff001f259c40 fffffffffbc2eb00 0 0 60 ffffff04f9fc7f20
PC: _resume_from_idle+0xf1 THREAD: txg_sync_thread()
stack pointer for thread ffffff001f259c40: ffffff001f259960
[ ffffff001f259960 _resume_from_idle+0xf1() ]
swtch+0x141()
cv_wait+0x70()
zio_wait+0x5e()
dsl_pool_sync+0x277()
spa_sync+0x308()
txg_sync_thread+0x1e8()
thread_start+8()
ffffff04e7411860 ffffff04edf24008 ffffff04fca7e380 2 1 ffffff04eddfcaca
PC: _resume_from_idle+0xf1 CMD: zpool export azupool
stack pointer for thread ffffff04e7411860: ffffff001ff129e0
[ ffffff001ff129e0 _resume_from_idle+0xf1() ]
swtch+0x141()
cv_wait+0x70()
txg_wait_synced+0x8b()
dmu_tx_wait+0x50()
dmu_tx_assign+0x47()
spa_history_log+0x4f()
zfs_log_history+0x82()
zfs_ioc_pool_export+0x2e()
zfsdev_ioctl+0x155()
cdev_ioctl+0x39()
spec_ioctl+0x60()
fop_ioctl+0x55()
ioctl+0x9b()
_sys_sysenter_post_swapgs+0x149()
[root@mameshiba ~]# zpool clear azupool
cannot clear errors for azupool: I/O error