Project

General

Profile

Bug #3994

uscsi_timeout of 0 sent to ses target driver can hang processes in biowait with a cmd stuck in HBA.

Added by Kevin Crowe over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Kevin Crowe
Category:
driver - device drivers
Start date:
2013-08-05
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

We encountered this at Nexenta with a user level command (sesctl) that sends a uscsi command with a timeout of zero to the ses target driver. The sesctl command hangs forever in biowait() and the command never expires from the mptsas HBA driver. Since sesctl never returns, any processes or users waiting for the results also wait forever.

History

#1

Updated by Kevin Crowe over 6 years ago

In rare cases that the command sent to a JBOD doesn't return you can end up with a stuck process & command that won't expire from mptsas. This is how it happens:

Looking at the process it's in biowait() after having sent the command:

> 0xffffff093948f380::findstack -v
stack pointer for thread ffffff093948f380: ffffff003ead28a0
[ ffffff003ead28a0 _resume_from_idle+0xf1() ]
  ffffff003ead28d0 swtch+0x145()
  ffffff003ead2910 sema_p+0x1d9(ffffff090f5ccf00)
  ffffff003ead2940 biowait+0x76(ffffff090f5cce40)
  ffffff003ead2a40 default_physio+0x3d3(fffffffff84f9af0, ffffff090f5cce40, ffffffffffffffff, 40, fffffffff79296f0,
  ffffff003ead2aa0)
  ffffff003ead2a80 physio+0x25(fffffffff84f9af0, ffffff090f5cce40, ffffffffffffffff, 40, fffffffff79296f0,
  ffffff003ead2aa0)
  ffffff003ead2b40 scsi_uscsi_handle_cmd+0x21d(ffffffffffffffff, 0, ffffff093c937250, fffffffff84f9af0, ffffff090f5cce40
  , 0)
  ffffff003ead2ba0 ses_uscsi_cmd+0x14f(ffffff093a5a3400, 80479a0, 100001)
  ffffff003ead2c30 ses_ioctl+0x95(1390000001b, 4c9, 80479a0, 100001, ffffff093613b918, ffffff003ead2d54)
  ffffff003ead2c70 cdev_ioctl+0x45(1390000001b, 4c9, 80479a0, 100001, ffffff093613b918, ffffff003ead2d54)
  ffffff003ead2cb0 spec_ioctl+0x5a(ffffff093873aa80, 4c9, 80479a0, 100001, ffffff093613b918, ffffff003ead2d54, 0)
  ffffff003ead2d30 fop_ioctl+0x7b(ffffff093873aa80, 4c9, 80479a0, 100001, ffffff093613b918, ffffff003ead2d54, 0)
  ffffff003ead2e30 ioctl+0x18e(4, 4c9, 80479a0)
  ffffff003ead2ec0 dtrace_systrace_syscall32+0x11a(4, 4c9, 80479a0, 1c3, 1, 0)
  ffffff003ead2f10 _sys_sysenter_post_swapgs+0x149()
>

Dumping the mptsas data you find a command that never times out:

> ::mptsas
        mptsas_t inst ncmds suspend  power
================================================================================
ffffff090e86a000    2     0       0 ON=D0
ffffff09393c1000   14     1       0 ON=D0
ffffff08fe489000   18     0       0 ON=D0
>

Because m_timebase = 0 in the mptsas_target_t structure it means we don't meet the 'if' condition in the bottom half of mptsas_watchsubr() and thus we never have a chance to hit mptsas_cmd_timeout() to timeout the command. You can't kill something in biowait, (at least not safely that I'm aware of) but if you poke a 0xa into m_timebase you can get that command to timeout.

Working backwards in time/software we find that the zero timeout got there from pkt_time in scsi_pkt which is set to zero when it's passed to mptsas_start_cmd. Working back further still we know it was sesctl that started all this, the top half of the ses driver stack looks like this:

              ses`ses_start+0x30
              genunix`default_physio+0x3cb
              genunix`physio+0x25
              scsi`scsi_uscsi_handle_cmd+0x21d
              ses`ses_uscsi_cmd+0x14f
              ses`ses_ioctl+0x95
              genunix`cdev_ioctl+0x45
              specfs`spec_ioctl+0x5a
              genunix`fop_ioctl+0x7b
              genunix`ioctl+0x18e
              genunix`dtrace_systrace_syscall32+0x11a
              unix`_sys_sysenter_post_swapgs+0x149

and the from ses_start we proceed as follows:

  8        -> ses_start
  8          -> ses_get_pkt
  8            -> scsi_init_pkt
  8              -> mptsas_scsi_init_pkt
  8                -> kmem_cache_alloc
  8                <- kmem_cache_alloc
  8                -> scsi_pkt_size
  8                <- scsi_pkt_size

...which eventually gets to mptsas_accept_pkt, mptsas_save_cmd, mptsas_start_cmd with the zero timeout.

Other target drivers take care of timeouts like this:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/scsi/targets/sgen.c#1498
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/scsi/targets/sd.c#14234

If you look at how the ses target driver sets timeouts:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/io/scsi/targets/ses.c#1074
scroll to the bottom of the function and you find:
1106 pkt->pkt_time = scmd->uscsi_timeout;

So it blindly accepts a zero passed down. To fix this in ses it's just a matter of checking for a zero timeout and setting it to a default value.

diff --git a/usr/src/uts/common/io/scsi/targets/ses.c b/usr/src/uts/common/io/scsi/targets/ses.c
index cc22429..dc6267e 100644
--- a/usr/src/uts/common/io/scsi/targets/ses.c
+++ b/usr/src/uts/common/io/scsi/targets/ses.c
@@ -24,6 +24,7 @@
  * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
  * Copyright (c) 2011 Bayard G. Bell. All rights reserved.
+ * Copyright 2013 Nexenta Systems, Inc.  All rights reserved.
  */

 #include <sys/modctl.h>
@@ -1103,7 +1104,10 @@ ses_get_pkt(struct buf *bp, int (*func)())
        if (pkt == (struct scsi_pkt *)NULL)
                return;
        bcopy(scmd->uscsi_cdb, pkt->pkt_cdbp, (size_t)scmd->uscsi_cdblen);
-       pkt->pkt_time = scmd->uscsi_timeout;
+
+       /* Set an upper bound timeout of ses_io_time if zero is passed in */
+       pkt->pkt_time = (scmd->uscsi_timeout == 0) ?
+           ses_io_time : scmd->uscsi_timeout;

        pkt->pkt_comp = ses_callback;
        pkt->pkt_private = (opaque_t)ssc;
#2

Updated by Dan McDonald over 6 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

commit ffc5bd0fe44a6f6529e738f1c202f788b1355046
Author: Kevin Crowe <>
Date: Wed Aug 7 13:34:06 2013 -0400

3994 uscsi_timeout of 0 sent to ses target driver can hang processes in biow
Reviewed by: Dan McDonald &lt;&gt;
Reviewed by: Jason King &lt;&gt;
Approved by: Richard Lowe &lt;&gt;

Also available in: Atom PDF