Bug #4819

fix mpt_sas command timeout handling

Added by Hans Rosenfeld over 3 years ago. Updated over 3 years ago.

Status:ClosedStart date:2014-04-28
Priority:NormalDue date:
Assignee:Hans Rosenfeld% Done:

100%

Category:-
Target version:-
Difficulty:Medium Tags:needs-triage

Description

The existing timeout handling in mpt_sas are based on a per-target counter which is decremented every watchdog interval and reset when a new command arrives. Because it did not know when a command was actually issued, this treated timeouts smaller than the watchdog interval poorly, by making them longer than the interval, and also sometimes caused them to be reset multiple times. It also postpones timing out commands until the last command in an active slot has timed out. This is inaccurate and causes unpredictable behaviour when commands with different timeouts are issued over a period of time, particularly when larger timeouts such as 60 seconds are hardcoded in various subsystems.

We have replaced these heuristics with a per-target queue of active commands sorted by actual time of expiration, from last to first, and maintained as commands in active slots are started and removed. The watchdog is still used to process timeouts, and accurately identifies expired commands by comparing timestamps. The timeout detection has been updated to only issue a target reset when the last command timeout expires. Once a command has timed out, it sets the throttle to drain to prevent additional commands from being issued. Only when the last command in the queue times out is a reset is performed.

The watchdog interval has been reduced to 1s to handle short timeouts better.

History

#1 Updated by Hans Rosenfeld over 3 years ago

  • Assignee set to Hans Rosenfeld

webrev: http://ma.nexenta.com/~woodstock/mptsas-timeout/

This includes the following commits from Nexentas nza-kernel tree:

1884358344 re #8346 rb2639 KT disk failures
7aa236e38f re #9517 rb4120 After single disk fault patch installed single disk fault still causes process hangs
ed00677f18 re #9517 rb4120 After single disk fault patch installed single disk fault still causes process hangs (fix gcc build)
75c7a269f5 re #12927 rb4203 LSI 2008 mpt_sas
ab642d0b21 OS-59 remove automated target removal mechanism from mpt_sas.
3919b35111 OS-60 mptsas watchdog resolution considered way to long for accurate CMD timeouts.
ab0fb3244c OS-87 pkt_reason not set accordingly when mpt_sas times out commands

#2 Updated by Electric Monk over 3 years ago

  • Status changed from New to Closed

git commit cb3e7fb42f8104f779abb6856ccf6e5b8e6419d8

commit  cb3e7fb42f8104f779abb6856ccf6e5b8e6419d8
Author: Albert Lee <trisk@nexenta.com>
Date:   2014-04-29T17:48:40.000Z

    4819 fix mpt_sas command timeout handling
    Reviewed by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
    Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
    Reviewed by: Keith Wesolowski <keith.wesolowski@joyent.com>
    Approved by: Dan McDonald <danmcd@omniti.com>

Also available in: Atom