sd open/close semaphore is unfortunate
The sd driver serializes all open/close operations such that any attempts to open the drive must block while any other instance is in open or close.
This turns out to be really devastating in the face of a bad drive, as non-blocking opens are not honored, and must wait for a blocking open to timeout. What's worse, is that because of the way this is structured, the open logic has a default long (60 second) timeout. If a number of processes are trying to open the device and it is not responsive, they will all get stuck for the entire wait period of all the callers ahead of them, and their own. Making matters worse, this is a semaphore and therefore does not honor interrupts.
We've observed (with some of our monitoring tools that used to attempt to open the drive frequently) some very very long queues, leading to the appearance of a stuck system call.
After doing a lot of analysis, I have become convinced that this open/close semaphore isn't protecting anything that is actually necessary. The opens_in_progress and layer_count fields are not actually useful for anything, although there is an attempt to synchronize with detach(9e) operations.
This is completely unnecessary. The modern DDI guarantees that a driver will not be detached while open files against that instance are present. (That's part of the purpose of the whole getinfo(9e) logic.). (On a historical note, before the invention of the LDI, some in-kernel consumers used to call the driver's open(9e) entry point directly, bypassing those safeties in the DDI. That hasn't been the case for a very long time now -- Solaris 9ish I think.)
We can dispense with the synchronization altogether. This also allows us to remove the detach_mutex.
RackTop has been running this way on systems with hundreds of disks attached (> 400 in some cases), and we have tested this behavior with various faulty and healthy disks, etc. We have seen no evidence of issue caused by removal of this logic.
Conversely, for some of our tools that use USCSI operations to issue CDBs directly to the disk, they can now immediately perform the open using the FNDELAY flag. This change has a had very substantial beneficial impact on our field operations at a number of customer sites.
I will be posting the full up review for this shortly.
Updated by Garrett D'Amore 2 days ago
For folks looking at my work in the driver:
- detach_count can go away entirely and the sd_detach_mutex -- the only things it serializes are the pm_idle_timeout shutdown, which is easier done by checking the timeout id and the existing locks, and things that the DDI already protects.
- no need to count layered opens or handle them separately -- the DDI takes care of all this (once upon a time it did not)
- the DDI guarantees not to call detach concurrently, and not to call it while files are open against the devinfo
- the sd_label_mutex is apparently an unused mutex, we can just remove it.