i40e spends too much time in interrupt context when optics are removed
I'm seeing the following problem on a system with a 4-port i40e:
When the fibers are pulled (including the optics, as I have been told), each i40e instance will keep one CPU 100% busy in the adminq interrupt doing one link check after another. It seems that we stay in i40e_intr_adminq_work() for a very long time as more requests come in while we're still busy processing one. All of them are for link check, which is done while holding the general lock. We noticed that a system suffering from this interrupt storm will hang at boot or at shutdown.
Analyzing this with DTrace showed that every time we get a request from the queue there are between 4 and 16 more requests remaining. The total runtime of i40e_intr_adminq_work() can add up to several 100ms.
I have tried patching the driver to skip the link check, and the system apparently behaves normal. I have also checked just removing the mutex_enter/mutex_exit of the general lock, but that didn't improve the situation.
Updated by Hans Rosenfeld over 4 years ago
A few more properties of the device:
name='api-version' type=string items=1 dev=none value='1.4' name='firmware-build' type=string items=1 dev=none value='892b' name='firmware-version' type=string items=1 dev=none value='4.40' name='printed-board-assembly' type=string items=1 dev=none value=''