mpt_sas IOC reset races can cause panics
There are races present between event handling and the IOC reset path for mpt_sas.
The most common panic resulting from this occurs near the beginning of mptsas_handle_event:
ffffff002f0c8b90 mptsas_handle_event+0x1c(ffffff07450fb020) ffffff002f0c8c40 taskq_thread+0x285(ffffff07306c9280) ffffff002f0c8c50 thread_start+8()
sometimes a stack trace will only show a mutex_enter in the event taskq:
ffffff002ffd4c60 fffffffffbc2dbf0 0 0 60 0 PC: panicsys+0x9b TASKQ: mpt_sas_mptsas_event_taskq stack pointer for thread ffffff002ffd4c60: ffffff002ffd46b0 panic+0x94() die+0xdd() trap+0x177b() 0xfffffffffb8001d6() mutex_enter+0xb() taskq_thread+0x285() thread_start+8()
There are three things which we don't want to run concurrently with an IOC reset attempt: the interrupt handlers, an address reply taskq thread, and a dynamic reconfigure (topology change) taskq thread. These could reference resources reallocated or zeroed in mptsas_init_chip.
We call mptsas_init_chip to process a reset, which reallocates the reply queue that mptsas_handle_event uses an entry from. We must grab mpt->m_mutex before checking the m_in_reset flag, and drop the event if we are in the middle of a reset.
In the solution, we have the interrupt handler exit if it's started in the middle of a reset (interrupts are already disabled, but either the polled or regular interrupt handler could still be running). We wait for the taskq threads to complete before proceeding. We also don't zero the address reply array since it's not necessary.
This was user reported on the developer list:
This was determined to be a duplicate of Nexenta #10443 and Bugster 6966172. We've fixed this in illumos-nexenta: