Bug #3994
closeduscsi_timeout of 0 sent to ses target driver can hang processes in biowait with a cmd stuck in HBA.
100%
Description
We encountered this at Nexenta with a user level command (sesctl) that sends a uscsi command with a timeout of zero to the ses target driver. The sesctl command hangs forever in biowait() and the command never expires from the mptsas HBA driver. Since sesctl never returns, any processes or users waiting for the results also wait forever.
Updated by Kevin Crowe almost 10 years ago
In rare cases that the command sent to a JBOD doesn't return you can end up with a stuck process & command that won't expire from mptsas. This is how it happens:
Looking at the process it's in biowait() after having sent the command:
> 0xffffff093948f380::findstack -v stack pointer for thread ffffff093948f380: ffffff003ead28a0 [ ffffff003ead28a0 _resume_from_idle+0xf1() ] ffffff003ead28d0 swtch+0x145() ffffff003ead2910 sema_p+0x1d9(ffffff090f5ccf00) ffffff003ead2940 biowait+0x76(ffffff090f5cce40) ffffff003ead2a40 default_physio+0x3d3(fffffffff84f9af0, ffffff090f5cce40, ffffffffffffffff, 40, fffffffff79296f0, ffffff003ead2aa0) ffffff003ead2a80 physio+0x25(fffffffff84f9af0, ffffff090f5cce40, ffffffffffffffff, 40, fffffffff79296f0, ffffff003ead2aa0) ffffff003ead2b40 scsi_uscsi_handle_cmd+0x21d(ffffffffffffffff, 0, ffffff093c937250, fffffffff84f9af0, ffffff090f5cce40 , 0) ffffff003ead2ba0 ses_uscsi_cmd+0x14f(ffffff093a5a3400, 80479a0, 100001) ffffff003ead2c30 ses_ioctl+0x95(1390000001b, 4c9, 80479a0, 100001, ffffff093613b918, ffffff003ead2d54) ffffff003ead2c70 cdev_ioctl+0x45(1390000001b, 4c9, 80479a0, 100001, ffffff093613b918, ffffff003ead2d54) ffffff003ead2cb0 spec_ioctl+0x5a(ffffff093873aa80, 4c9, 80479a0, 100001, ffffff093613b918, ffffff003ead2d54, 0) ffffff003ead2d30 fop_ioctl+0x7b(ffffff093873aa80, 4c9, 80479a0, 100001, ffffff093613b918, ffffff003ead2d54, 0) ffffff003ead2e30 ioctl+0x18e(4, 4c9, 80479a0) ffffff003ead2ec0 dtrace_systrace_syscall32+0x11a(4, 4c9, 80479a0, 1c3, 1, 0) ffffff003ead2f10 _sys_sysenter_post_swapgs+0x149() >
Dumping the mptsas data you find a command that never times out:
> ::mptsas mptsas_t inst ncmds suspend power ================================================================================ ffffff090e86a000 2 0 0 ON=D0 ffffff09393c1000 14 1 0 ON=D0 ffffff08fe489000 18 0 0 ON=D0 >
Because m_timebase = 0 in the mptsas_target_t structure it means we don't meet the 'if' condition in the bottom half of mptsas_watchsubr() and thus we never have a chance to hit mptsas_cmd_timeout() to timeout the command. You can't kill something in biowait, (at least not safely that I'm aware of) but if you poke a 0xa into m_timebase you can get that command to timeout.
Working backwards in time/software we find that the zero timeout got there from pkt_time in scsi_pkt which is set to zero when it's passed to mptsas_start_cmd. Working back further still we know it was sesctl that started all this, the top half of the ses driver stack looks like this:
ses`ses_start+0x30 genunix`default_physio+0x3cb genunix`physio+0x25 scsi`scsi_uscsi_handle_cmd+0x21d ses`ses_uscsi_cmd+0x14f ses`ses_ioctl+0x95 genunix`cdev_ioctl+0x45 specfs`spec_ioctl+0x5a genunix`fop_ioctl+0x7b genunix`ioctl+0x18e genunix`dtrace_systrace_syscall32+0x11a unix`_sys_sysenter_post_swapgs+0x149
and the from ses_start we proceed as follows:
8 -> ses_start 8 -> ses_get_pkt 8 -> scsi_init_pkt 8 -> mptsas_scsi_init_pkt 8 -> kmem_cache_alloc 8 <- kmem_cache_alloc 8 -> scsi_pkt_size 8 <- scsi_pkt_size ...which eventually gets to mptsas_accept_pkt, mptsas_save_cmd, mptsas_start_cmd with the zero timeout.
Other target drivers take care of timeouts like this:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/scsi/targets/sgen.c#1498
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/scsi/targets/sd.c#14234
If you look at how the ses target driver sets timeouts:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/scsi/targets/ses.c#1074
scroll to the bottom of the function and you find:
1106 pkt->pkt_time = scmd->uscsi_timeout;
So it blindly accepts a zero passed down. To fix this in ses it's just a matter of checking for a zero timeout and setting it to a default value.
diff --git a/usr/src/uts/common/io/scsi/targets/ses.c b/usr/src/uts/common/io/scsi/targets/ses.c index cc22429..dc6267e 100644 --- a/usr/src/uts/common/io/scsi/targets/ses.c +++ b/usr/src/uts/common/io/scsi/targets/ses.c @@ -24,6 +24,7 @@ * Copyright 2009 Sun Microsystems, Inc. All rights reserved. * Use is subject to license terms. * Copyright (c) 2011 Bayard G. Bell. All rights reserved. + * Copyright 2013 Nexenta Systems, Inc. All rights reserved. */ #include <sys/modctl.h> @@ -1103,7 +1104,10 @@ ses_get_pkt(struct buf *bp, int (*func)()) if (pkt == (struct scsi_pkt *)NULL) return; bcopy(scmd->uscsi_cdb, pkt->pkt_cdbp, (size_t)scmd->uscsi_cdblen); - pkt->pkt_time = scmd->uscsi_timeout; + + /* Set an upper bound timeout of ses_io_time if zero is passed in */ + pkt->pkt_time = (scmd->uscsi_timeout == 0) ? + ses_io_time : scmd->uscsi_timeout; pkt->pkt_comp = ses_callback; pkt->pkt_private = (opaque_t)ssc;
Updated by Dan McDonald almost 10 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
commit ffc5bd0fe44a6f6529e738f1c202f788b1355046
Author: Kevin Crowe <kevin.crowe@nexenta.com>
Date: Wed Aug 7 13:34:06 2013 -0400
3994 uscsi_timeout of 0 sent to ses target driver can hang processes in biow
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Jason King <jason.brian.king@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>