Project

General

Profile

Bug #9575

apix can lose interrupts after interrupt thread blocks

Added by Paul Winder about 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
kernel
Start date:
2018-06-01
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

This is a similar issue to that described in bug #7725

When an interrupt thread is scheduled any further interrupts at the same or lower IPL are added to a pending list, and a flag (x_intr_pending) is set. When the interrupt thread completes control is passed back to the main interrupt function in apix and it will then detect and process pending interrupts before it completes. The problem arises when the interrupt thread blocks on a mutex, when that happens it is converted to a standard thread. When this thread finishes it is not able to return to the main interrupt function and does a swtch() instead. The consequence of this is that and interrupts which have been queued as pending are not processed.

We are then in similar situation as described in #7725, should we now get an interrupt at the same IPL as those unprocessed, the flag in x_intr_pending is cleared, the new interrupt is handled, and the pending interrupts get missed.

Fixed by:
  1. Before called swtch(), check if there are any pending interrupts and set soft interrupt on. When this soft interrupt is processed the pending interrupts will be cleared.
  2. When we get a new hard interrupt, check if there is a vector pending at the same level, then add to the pending list and process all pending vectors.

Related issues

Related to illumos gate - Bug #7724: apix may lose interrupts occuring while softint is running at same IPLClosed2017-01-03

Actions

History

#2

Updated by Paul Winder about 2 years ago

> *apixs+918*28::print -t apix_impl_t
apix_impl_t {
    processorid_t x_cpuid = 0x28
    uint16_t x_intr_pending = 0x80
    struct autovec *[16] x_intr_head = [ 0, 0, 0, 0, 0, 0, 0, 0xffffc2a59be7f158, 0, 0, 0, 0, 0, 0, 0, 0 ]
    struct autovec *[16] x_intr_tail = [ 0, 0, 0, 0, 0, 0, 0, 0xffffc2a59d4e6c98, 0, 0, 0, 0, 0, 0, 0, 0 ]
    apix_vector_t *x_obsoletes = 0
    apix_vector_t *[256] x_vectbl = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]
    lock_t x_lock = 0
}

The above is an entry in the apixs array for which interrupts are missing. x_intr_pending == 0x80 indicates there are pending vectors at IPL 7, and x_intr_head and x_intr_tail is the list. As the code stood, x_intr_pending should never be non-zero when there are no interrupt threads active or apix_do_interrupt() is active for the CPU. In this example the system was idle. Another symptom is the x_intr_pending is zero, but the corresponding array entries have queued vectors - this again is an invalid combination.

To verify the fix I added probes at the point where a soft and hard interrupt thread returns but does a swtch() because it had block on a mutex, and another probe in apix_do_interrupt at the point it dispatches the vector.

Below are some counts from the probes over a period of about 10mins on a system with heavy load across multiple NVMe drives. The softint and hardint confirm threads are blocking on mutex and hence not returning to apix_do_interrupt(), and pendint confirms that once we are back in apix_do_interrupt() there are unexpected pending interrupts. So the conditions are ripe for interrupts to go missing.

softints
        CPU06: 1
        CPU13: 1
        CPU42: 1
        CPU48: 1
        CPU50: 1
        CPU00: 519

hardints
        CPU52: 4930
        CPU51: 4975
        CPU16: 5023
        CPU09: 5043
        CPU53: 5071
        CPU14: 5080
        CPU15: 5087
        CPU06: 5090
        CPU17: 5091
        CPU08: 5108
        CPU50: 5115
        CPU07: 5193
        CPU44: 5239
        CPU58: 5243
        CPU00: 5247
        CPU45: 5300
        CPU43: 5351
        CPU59: 5354
        CPU42: 5401
        CPU01: 5486
        CPU55: 6166
        CPU49: 6180
        CPU54: 6201
        CPU56: 6247
        CPU10: 6292
        CPU13: 6325
        CPU40: 6330
        CPU18: 6351
        CPU19: 6408
        CPU12: 6411
        CPU46: 6451
        CPU11: 6469
        CPU47: 6515
        CPU05: 6530
        CPU41: 6624
        CPU48: 6642
        CPU03: 6680
        CPU57: 6700
        CPU04: 6751
        CPU02: 6783

pendints
        CPU05: 1
        CPU19: 1
        CPU46: 1
        CPU06: 2
        CPU00: 311

#3

Updated by Electric Monk about 2 years ago

  • Status changed from New to Closed
  • % Done changed from 90 to 100

git commit af7c4cb5225cc98e505e165d8fe7b59f98595bbc

commit  af7c4cb5225cc98e505e165d8fe7b59f98595bbc
Author: Paul Winder <paul.winder@tegile.com>
Date:   2018-06-11T15:27:48.000Z

    9575 apix can lose interrupts after interrupt thread blocks
    Reviewed by: Igor Kozhukhov <igor@dilos.org>
    Reviewed by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
    Reviewed by: Ken Mays <kmays2000@gmail.com>
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

#4

Updated by Marcel Telka about 2 years ago

  • Related to Bug #7724: apix may lose interrupts occuring while softint is running at same IPL added

Also available in: Atom PDF