Project

General

Profile

Bug #7966

sd attach should not issue potentially slow scsi commands

Added by Yuri Pankov about 3 years ago. Updated almost 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
driver - device drivers
Start date:
2017-03-13
Due date:
% Done:

50%

Estimated time:
Difficulty:
Medium
Tags:

Description

Description by Jeffry Molanus.

During a sd_unit_attach() a START_STOP_UNIT() from sd_spinup_unit() is send. The drive(s) return Check_Condition but its return values are not diagnosed. As a result, retries are issued from SD as well as from mpt_sas. This results in a complete system hang since the device tree is locked and we never bother to check why sd_spinup_unit() fails.
When the START_STOP_UNIT() takes to long, we should simply fail the attach and go on. A failure of this function can be anything from a not charged STEC SSD or simply broken device.
In other words, if we have a system with 500 drives that is fully operational, and we expand the number of drives and insert a new shiny disk which is a DOA – we grind to a halt.
In case of HA, both nodes will "see" the new drive at the same time so both nodes will hang on the same disk which will prevent a fail-over from occurring. Note; the machine is responsive to pings etc so no automated fail-over may occur.
While working on this, it became clear to me that the whole way we do our attach routine requires lots of rework; it should really be asynchronous since even if we attach a good device; the device tree is "locked". All of this takes way to long:

real 1m20.461s
user 0m0.001s
sys 0m0.017s

In this case 3 drives are attached; and format(1M) is issued, this is however, being run on a debug build.
To make matter worse even if we fail the attach, a "format" that scans the device tree will try to reattach the device again. And we get ourselves back in the "hang" therefore it is important that we either:

1. retire the device which needs to be done from user space using di_retire()
2. clean devfs, so the device is really "gone" from the system.

In case of (2) reinserting the drive will make mptsas pick up the drive again so in case of a uncharged SSD, leave it for a minute or to and try to reinsert it.

History

#1

Updated by Yuri Pankov almost 3 years ago

  • Status changed from In Progress to Feedback
#2

Updated by Yuri Pankov almost 3 years ago

  • Status changed from Feedback to Rejected

withdrawn.

Also available in: Atom PDF