new driver for Smart Array storage controllers
This takes a combination of changes and work done to write a new controller for the Smart Array storage controllers. This was done for several reasons:
- The existing driver does not function very well, they only use INT X and have a number of other challenges that often lead to stalls and lost I/Os in the field
- We wanted to add support for modern facilities in the OS
- We wanted to add support for HBA mode found in more recent controllers
This will bind to the 6th generation devices and onwards that used to be used by the cpqary3 driver.
This combines a number of logical changes in illumos-joyent into one fully featured driver:
- OS-5564 new driver for Smart Array storage controllers
- OS-5799 Smart Array firmware 8.00 does not work with MSI
- OS-5800 identify and record firmware version of Smart Array controllers
- OS-5565 Need smrt HBA mode support
The issue in OS-5799 is as follows:
Though the smrt driver appears to function correctly with 6.XX and 7.XX firmware versions of the Smart Array firmware, it appears that version 8.00 has introduced a new defect. With prior versions of the controller, the system was able to use Message Signalled Interrupts (MSI) to receive notifications of command completion. Unfortunately, version 8.00 appears to have broken MSI support completely – the controller makes every indication that it is trying to interrupt the host, but the interrupt is nonetheless never triggered. By switching to MSI-X interrupts, as used in the Linux hpsa driver, completions start working again.
Adding to the complexity for this issue: at least one older firmware version (revision 6.00, running in a HP DL580 G7) does not appear to post MSI-X interrupts for Simple Transport Method command completions. As such, I've had to be more conservative in the use of MSI-X interrupts, restricting them to firmware revisions in the 8.XX range. This policy may require further refinement in the future, if HP makes more broken behaviour changes between revisions.
I have been running a battery of tests to generate a lot of ZFS I/O on both a HP DL360p G8 and a HL DL580 G7 in the SF lab. I have also been routinely scrubbing the pool without any problems.
It's somewhat hard to induce the controller to fail in the ways that I'd like in order to test all of the error handling. Because it's magical hardware RAID, it tries very hard to shield us from any kind of errors at all, so most or all of the SCSI commands we send to it succeed eventually. I pulled out a single disk from a RAID-5 set which appeared not to interrupt I/O to the logical volume. Removing a second disk appears to cause the controller to stop responding to commands of any kind, including pings, but not to stop updating its heartbeat register. Once this happens we eventually panic after trying, unsuccessfully, to reset the controller.
I don't think the failure handling is any worse than in the old driver.
I have tested dumping a couple of times, most recently in order to run ::findleaks (no leaks detected!). This certainly works better than the old driver.
Further testing was done as follows with the introduction of HBA mode:
So far we've been doing general testing of this by using debug and non-debug bits. Testing on the generation 9 boxes includes both HBA and RAID modes. More specifically we've been doing:
- Heavy I/O patterns with scrubs
- Verifying disk pulls while this happens and noting that they all get logged into the system log
- Verifying that we can dump during heavy load
- Verifying against older generation controllers, specifically those present in the DL580g7 and the DL360g8 (this means gen 6 and 7 controllers)