Bug #9575
apix can lose interrupts after interrupt thread blocks
100%
Description
This is a similar issue to that described in bug #7725
When an interrupt thread is scheduled any further interrupts at the same or lower IPL are added to a pending list, and a flag (x_intr_pending) is set. When the interrupt thread completes control is passed back to the main interrupt function in apix and it will then detect and process pending interrupts before it completes. The problem arises when the interrupt thread blocks on a mutex, when that happens it is converted to a standard thread. When this thread finishes it is not able to return to the main interrupt function and does a swtch() instead. The consequence of this is that and interrupts which have been queued as pending are not processed.
We are then in similar situation as described in #7725, should we now get an interrupt at the same IPL as those unprocessed, the flag in x_intr_pending is cleared, the new interrupt is handled, and the pending interrupts get missed.
Fixed by:- Before called swtch(), check if there are any pending interrupts and set soft interrupt on. When this soft interrupt is processed the pending interrupts will be cleared.
- When we get a new hard interrupt, check if there is a vector pending at the same level, then add to the pending list and process all pending vectors.
Related issues
Updated by Paul Winder over 2 years ago
> *apixs+918*28::print -t apix_impl_t apix_impl_t { processorid_t x_cpuid = 0x28 uint16_t x_intr_pending = 0x80 struct autovec *[16] x_intr_head = [ 0, 0, 0, 0, 0, 0, 0, 0xffffc2a59be7f158, 0, 0, 0, 0, 0, 0, 0, 0 ] struct autovec *[16] x_intr_tail = [ 0, 0, 0, 0, 0, 0, 0, 0xffffc2a59d4e6c98, 0, 0, 0, 0, 0, 0, 0, 0 ] apix_vector_t *x_obsoletes = 0 apix_vector_t *[256] x_vectbl = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ] lock_t x_lock = 0 }
The above is an entry in the apixs array for which interrupts are missing. x_intr_pending == 0x80 indicates there are pending vectors at IPL 7, and x_intr_head and x_intr_tail is the list. As the code stood, x_intr_pending should never be non-zero when there are no interrupt threads active or apix_do_interrupt() is active for the CPU. In this example the system was idle. Another symptom is the x_intr_pending is zero, but the corresponding array entries have queued vectors - this again is an invalid combination.
To verify the fix I added probes at the point where a soft and hard interrupt thread returns but does a swtch() because it had block on a mutex, and another probe in apix_do_interrupt at the point it dispatches the vector.
Below are some counts from the probes over a period of about 10mins on a system with heavy load across multiple NVMe drives. The softint and hardint confirm threads are blocking on mutex and hence not returning to apix_do_interrupt(), and pendint confirms that once we are back in apix_do_interrupt() there are unexpected pending interrupts. So the conditions are ripe for interrupts to go missing.
softints CPU06: 1 CPU13: 1 CPU42: 1 CPU48: 1 CPU50: 1 CPU00: 519 hardints CPU52: 4930 CPU51: 4975 CPU16: 5023 CPU09: 5043 CPU53: 5071 CPU14: 5080 CPU15: 5087 CPU06: 5090 CPU17: 5091 CPU08: 5108 CPU50: 5115 CPU07: 5193 CPU44: 5239 CPU58: 5243 CPU00: 5247 CPU45: 5300 CPU43: 5351 CPU59: 5354 CPU42: 5401 CPU01: 5486 CPU55: 6166 CPU49: 6180 CPU54: 6201 CPU56: 6247 CPU10: 6292 CPU13: 6325 CPU40: 6330 CPU18: 6351 CPU19: 6408 CPU12: 6411 CPU46: 6451 CPU11: 6469 CPU47: 6515 CPU05: 6530 CPU41: 6624 CPU48: 6642 CPU03: 6680 CPU57: 6700 CPU04: 6751 CPU02: 6783 pendints CPU05: 1 CPU19: 1 CPU46: 1 CPU06: 2 CPU00: 311
Updated by Electric Monk over 2 years ago
- Status changed from New to Closed
- % Done changed from 90 to 100
git commit af7c4cb5225cc98e505e165d8fe7b59f98595bbc
commit af7c4cb5225cc98e505e165d8fe7b59f98595bbc Author: Paul Winder <paul.winder@tegile.com> Date: 2018-06-11T15:27:48.000Z 9575 apix can lose interrupts after interrupt thread blocks Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Reviewed by: Ken Mays <kmays2000@gmail.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Approved by: Dan McDonald <danmcd@joyent.com>
Updated by Marcel Telka over 2 years ago
- Related to Bug #7724: apix may lose interrupts occuring while softint is running at same IPL added