Project

General

Profile

Bug #9836

EHCI appears to be causing a huge interrupt storm on my ivybridge system

Added by Adam Stylinski about 1 year ago. Updated 12 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2018-09-12
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

I noticed one particular core was 40% on kernel, 60% idle. Using mpstat revealed that it's getting 100's of thousands of interrupts per second:

adam@sqlbox:/home/adam$ mpstat 1
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0   63   0   14 195920 195677  255    3   14   16    0   290    1  41   0  59
  1   80   0    5   145   19  200    3   16   22    0   314    0   0   0  99
  2   82   0   54   164   67  193    3   15   22    0   299    0   0   0  99
  3   64   0    3   139   36  172    3   15   23    0   219    0   0   0  99
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    4   0   12 198278 198066  132    0    6    3    0   284    0  40   0  60
  1    0   0   13   146   34  138    0    9    1    0    20    0   0   0 100
  2    0   0   15    71    8   52    0    3    1    0    41    0   0   0 100
  3    0   0    0    61    9   44    0    2    0    0     6    0   0   0 100
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    2   0   12 198304 198089  134    1    5    5    0   283    0  40   0  60
  1    0   0   15   147   26  118    0    2    2    0    21    0   0   0 100
  2    0   0    3    94   12   69    0    3    1    0    53    0   0   0 100
  3    0   0   18    53    8   44    1    3    0    0     6    0   1   0  99
^C

Narrowed this down with a modified dtrace script to show aggregations of interrupts based on device name:

adam@sqlbox:/home/adam$ cat intby.d 
#!/usr/sbin/dtrace -s
/*
 * inttimes.d - print interrupt on-cpu time.
 *              Written using DTrace (Solaris 10 3/05)
 *
 * $Id: inttimes.d 3 2007-08-01 10:50:08Z brendan $
 *
 * USAGE:       inttimes.d      # wait several seconds, then hit Ctrl-C
 *
 * FIELDS:
 *              DEVICE          instance name of device driver
 *              TIME (ns)       sum of time spent servicing interrupt (ns)
 *
 * BASED ON: /usr/demo/dtrace/intr.d
 *
 * SEE ALSO:
 *          DTrace Guide "sdt Provider" chapter (docs.sun.com)
 *          intrstat(1M)
 *
 * PORTIONS: Copyright (c) 2005 Brendan Gregg.
 *

 *
 * 28-Jun-2005  Brendan Gregg   Created this.
 * 20-Apr-2006     "      "     Last update.
 */

#pragma D option quiet

dtrace:::BEGIN
{
        printf("Tracing... Hit Ctrl-C to end.\n");
}

sdt:::interrupt-complete
/arg0 != 0/
{
        this->devi = (struct dev_info *)arg0;
        /* this checks the pointer is valid, */
        self->name = this->devi != 0 ?
            stringof(`devnamesp[this->devi->devi_major].dn_name) : "?";
        this->inst = this->devi != 0 ? this->devi->devi_instance : 0;
        @num[self->name, this->inst] = count();
        self->name = 0;
}

dtrace:::END
{
        printf("%11s    %16s\n", "DEVICE", "Count");
        printa("%10s%-3d %@16d\n", @num);
}

The out revealed ehci is the culprit:

adam@sqlbox:/home/adam$ sudo ./intby.d 
Password: 
Tracing... Hit Ctrl-C to end.
^C
     DEVICE               Count
      ehci1                  2
    e1000g1                  9
   mpt_sas0                110
      ehci0             292199

Please let me know if there's anything I should try in terms of bios settings or system configuration.

History

#1

Updated by Adam Stylinski about 1 year ago

I should also add uname output:
adam@sqlbox:/home/adam$ uname -a
SunOS sqlbox 5.11 illumos-2eeda98682 i86pc i386 i86pc

#2

Updated by Adam Stylinski about 1 year ago

As it turns out, one boot environment prior does not suffer from this issue:

adam@sqlbox:/home/adam$ sudo intrstat 
Password: 

      device |      cpu0 %tim      cpu1 %tim      cpu2 %tim      cpu3 %tim
-------------+------------------------------------------------------------
    e1000g#1 |         0  0.0         0  0.0         3  0.0         0  0.0
      ehci#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      ehci#1 |         0  0.0         0  0.0         0  0.0         0  0.0

      device |      cpu0 %tim      cpu1 %tim      cpu2 %tim      cpu3 %tim
-------------+------------------------------------------------------------
    e1000g#1 |         0  0.0         0  0.0         3  0.0         0  0.0
      ehci#0 |         1  0.0         0  0.0         0  0.0         0  0.0
      ehci#1 |         0  0.0         1  0.0         0  0.0         0  0.0

      device |      cpu0 %tim      cpu1 %tim      cpu2 %tim      cpu3 %tim
-------------+------------------------------------------------------------
    e1000g#1 |         0  0.0         0  0.0         1  0.0         0  0.0
      ehci#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      ehci#1 |         0  0.0         0  0.0         0  0.0         0  0.0

      device |      cpu0 %tim      cpu1 %tim      cpu2 %tim      cpu3 %tim
-------------+------------------------------------------------------------
    e1000g#1 |         0  0.0         0  0.0         2  0.0         0  0.0
      ehci#0 |         1  0.0         0  0.0         0  0.0         0  0.0
      ehci#1 |         0  0.0         1  0.0         0  0.0         0  0.0

      device |      cpu0 %tim      cpu1 %tim      cpu2 %tim      cpu3 %tim
-------------+------------------------------------------------------------
    e1000g#1 |         0  0.0         0  0.0         1  0.0         0  0.0
      ehci#0 |         0  0.0         0  0.0         0  0.0         0  0.0
      ehci#1 |         0  0.0         0  0.0         0  0.0         0  0.0
   mpt_sas#0 |         0  0.0         0  0.0         0  0.0       132  0.1

      device |      cpu0 %tim      cpu1 %tim      cpu2 %tim      cpu3 %tim
-------------+------------------------------------------------------------
    e1000g#1 |         0  0.0         0  0.0         3  0.0         0  0.0
      ehci#0 |         1  0.0         0  0.0         0  0.0         0  0.0
      ehci#1 |         0  0.0         1  0.0         0  0.0         0  0.0
   mpt_sas#0 |         0  0.0         0  0.0         0  0.0         0  0.0

      device |      cpu0 %tim      cpu1 %tim      cpu2 %tim      cpu3 %tim
-------------+------------------------------------------------------------
    e1000g#1 |         0  0.0         0  0.0         2  0.0         0  0.0
      ehci#0 |         1  0.0         0  0.0         0  0.0         0  0.0
      ehci#1 |         0  0.0         1  0.0         0  0.0         0  0.0
   mpt_sas#0 |         0  0.0         0  0.0         0  0.0         0  0.0

adam@sqlbox:/home/adam$ uname -a
SunOS sqlbox 5.11 illumos-6640c13bea i86pc i386 i86pc

#3

Updated by Adam Stylinski about 1 year ago

Looking between those two revisions I only see two that could possibly do something like this: one that modifies an order for an atomic look during early CPU initialization, and the other more likely: a microcode update.

#4

Updated by Adam Stylinski 12 months ago

Just did another pkg image-update, the interrupt storm appears to be gone in illumos-e6ae04828f. Not sure if that helps any.

Also available in: Atom PDF