sd attach should not issue potentially slow scsi commands
Description by Jeffry Molanus.
During a sd_unit_attach() a START_STOP_UNIT() from sd_spinup_unit() is send. The drive(s) return Check_Condition but its return values are not diagnosed. As a result, retries are issued from SD as well as from mpt_sas. This results in a complete system hang since the device tree is locked and we never bother to check why sd_spinup_unit() fails.
When the START_STOP_UNIT() takes to long, we should simply fail the attach and go on. A failure of this function can be anything from a not charged STEC SSD or simply broken device.
In other words, if we have a system with 500 drives that is fully operational, and we expand the number of drives and insert a new shiny disk which is a DOA – we grind to a halt.
In case of HA, both nodes will "see" the new drive at the same time so both nodes will hang on the same disk which will prevent a fail-over from occurring. Note; the machine is responsive to pings etc so no automated fail-over may occur.
While working on this, it became clear to me that the whole way we do our attach routine requires lots of rework; it should really be asynchronous since even if we attach a good device; the device tree is "locked". All of this takes way to long:
In this case 3 drives are attached; and format(1M) is issued, this is however, being run on a debug build.
To make matter worse even if we fail the attach, a "format" that scans the device tree will try to reattach the device again. And we get ourselves back in the "hang" therefore it is important that we either:
1. retire the device which needs to be done from user space using di_retire()
2. clean devfs, so the device is really "gone" from the system.
In case of (2) reinserting the drive will make mptsas pick up the drive again so in case of a uncharged SSD, leave it for a minute or to and try to reinsert it.