Project

General

Profile

Bug #1333

High kernel cpu usage & dtrace hang on idle system

Added by Miha Krizaj over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Category:
kernel
Start date:
2011-08-04
Due date:
% Done:

100%

Estimated time:
Difficulty:
Hard
Tags:

Description

We have a Supermicro X9SCA-F ( Intel® C204 PCH Chipset) with 16GB RAM, and E3-1240 Xeon CPU (quad core).

Clean install of open indiana Build 148 (server, no gui). Mirror disks on LSI2008 type controllers (two of them) and it has intel NICs.

The problem is that CPU usage is increasing gradually for kernel. It grows approx 0.1% per hour on a idle system. When it reaches some 5-6% any dtrace (arg0) hangs the system completely. It stops responding to pings, console is hung, nothing works.

Once (out of some 20 hangs) I was able to start the shutdown by pressing the physical power button, so it seems it still works in some way but is using all the cpu/interrrupts somehow.

I thought this is related to https://www.illumos.org/issues/423 and https://defect.opensolaris.org/bz/show_bug.cgi?id=12799#c2 with is cpupm enable poll-mode fix in power.conf , but setting as per above does not fix this problem.

I tried disabling c-states, disabling cpupm completely, in BIOS and in power.conf
using cpupm enable poll-mode then cpu-deep-idle disable and even cpupm disable but it still happens.

After reboot the system starts at almost 0% kernel usage and I can not find the cause of the increasing cpu usage as there is none to be seen. It takes 3-4 days to reach its breaking point.

When it grows to the higher percentage it hangs so this makes it hard to pinpoint.

Took some statistics when it is still possible:

prstat -s cpu -n 5: http://www.pastie.org/2241576

mpstat 5 5: http://www.pastie.org/2241586

powertop: http://www.pastie.org/2265444


Files

power_cpu_issue.txt (16 KB) power_cpu_issue.txt Jon Strabala, 2011-08-30 09:38 PM
power_cpu_issue_PLUS_44hours.txt (19 KB) power_cpu_issue_PLUS_44hours.txt Jon Strabala, 2011-09-01 07:02 PM
power_cpu_issue_PLUS_159ahours.txt (18.2 KB) power_cpu_issue_PLUS_159ahours.txt Jon Strabala, 2011-09-06 03:30 PM
power_cpu_issue_PLUS_0hours.txt (16.1 KB) power_cpu_issue_PLUS_0hours.txt Jon Strabala, 2011-09-06 05:34 PM
crash1.2011-09-06.tar.gz (83 KB) crash1.2011-09-06.tar.gz all hourly data and log files (not I added more scripts as it was runnig) Jon Strabala, 2011-09-07 03:51 PM
mdb.log (107 KB) mdb.log Miha Krizaj, 2011-09-07 09:41 PM
data_logs_151a_freezeup_in_21hours.tar.gz (230 KB) data_logs_151a_freezeup_in_21hours.tar.gz log files from a 'clean' oi_151 - http://pkg.openindiana.org/dev-il/ Jon Strabala, 2011-09-09 03:59 PM
logs_data_scripts.151a.final.20110913.tar.gz (2.05 MB) logs_data_scripts.151a.final.20110913.tar.gz bundle of log, data, and modified, new dtrace scripts for 1/2 samples Jon Strabala, 2011-09-13 03:01 PM
logs_data_scripts.151a.final.20110914.tar.gz (545 KB) logs_data_scripts.151a.final.20110914.tar.gz bundle of log, data, and modified, new dtrace scripts for 20 minute Jon Strabala, 2011-09-14 03:14 PM
oi_151_released.console.png (63.2 KB) oi_151_released.console.png oi_151a released 101.5 hours - image of text console after crash Jon Strabala, 2011-09-19 06:33 PM
oi_151_released.101hours.txt (76.6 KB) oi_151_released.101hours.txt oi_151a released - analysis of dtrace and other scripts Jon Strabala, 2011-09-19 06:33 PM
dtrace_req2_1sec_5_sec_10sec.txt (44.3 KB) dtrace_req2_1sec_5_sec_10sec.txt dtrace -n "apix_do_interrupt:entry AT 1sec 5sec and 10sec Jon Strabala, 2011-09-20 03:12 AM
20111115_tcs_ga_after_boot.txt (7.04 KB) 20111115_tcs_ga_after_boot.txt after boot: uname -a, prtdiag, top, and prstat Jon Strabala, 2011-11-15 06:03 PM
20111118_tcs_ga_after_boot.txt (7.08 KB) 20111118_tcs_ga_after_boot.txt info_1 Jon Strabala, 2011-11-18 06:34 PM
20111118_tcs_ga_after_boot_some_dtrace_output.txt (30.5 KB) 20111118_tcs_ga_after_boot_some_dtrace_output.txt info_2_with_dtrace Jon Strabala, 2011-11-18 06:34 PM
20111118_tcs_ga_after_69hours.txt (9.87 KB) 20111118_tcs_ga_after_69hours.txt test of latest path after 69 hours Jon Strabala, 2011-11-21 07:34 PM
20111118_tcs_ga_after_69hours_fsflush_oddly_high_cpu.txt (132 KB) 20111118_tcs_ga_after_69hours_fsflush_oddly_high_cpu.txt other #1 look into unrelated fflush CPU and/or IO issue Jon Strabala, 2011-11-21 07:34 PM
20111118_tcs_ga_after_69hours_dtrace_oddly_high_zfs_activity.txt (200 KB) 20111118_tcs_ga_after_69hours_dtrace_oddly_high_zfs_activity.txt other #2 look into unrelated fflush CPU and/or IO issue Jon Strabala, 2011-11-21 07:34 PM

Related issues

Has duplicate illumos gate - Bug #2129: interrupt increase on some Sandy Bridge CPUsRejected2012-02-14

Actions

History

#1

Updated by Jon Strabala over 7 years ago

I have the exact same experience and system basically a Supermicro X9SCA-F ( Intel® C204 PCH Chipset) with 16GB RAM, and a Intel(R) Xeon(R) CPU E31270 @ 3.40GHz

lab10# uname -a
SunOS lab10 5.11 oi_148b i86pc i386 i86pc Solaris
lab10#

The kernel is always at 0% after reboot but I see the same slow CPU usage growth over several days to the point where I can no longer log onto the system via ssh and have to do a hard power cycle via IPMI.

I have to do a "hard" power off, immeadiately after a hard power cycle and a 'reboot':

Below I ran the following and captured the output, I have attached a text file due to formatting issues in this 'posting' tool. Note these are all right after a 'fresh' reboot.

uname -a

prstat -s cpu -n

mpstat 5 2

powertop

prtdiag

cat /etc/power.conf

dtrace -n 'profile-10ms{@[stack()] = count()}'

echo ::interrupts | mdb -k

echo ::interrupts -d | mdb -k

intrstat 30

powertop -t 30

No matter what I try to do I can not seem to prevent this issue. In fact when it gets up past 20% dtrace scripts will hang.

I have tried "cpupm enable poll-mode" in /etc/power.conf
I have tried "cpupm disable" in /etc/power.conf

Today I am trying to add the following to the bottom of the /etc/system file based on info in https://defect.opensolaris.org/bz/show_bug.cgi?id=12799

set cpupm_tw_gov_interval=800000

Any ideas or test I can run would be greatly appreciated, I am stumped on how to debug or even collect data on this problem.

- Jon S

#2

Updated by Jon Strabala over 7 years ago

Doing some more testing ....

The CPU utilization on this box at idle was 99.3% at initial boot time, after about 44 hours it is only 98% idle. ( with "set cpupm_tw_gov_interval=800000" in /etc/system ).

At time of boot ...
  1. date
    August 31, 2011 07:07:16 AM PDT
  1. top -s 10
    last pid: 1965; load avg: 0.00, 0.00, 0.00; up 0+15:26:34 07:06:57
    62 processes: 61 sleeping, 1 on cpu
    CPU states: 99.3% idle, 0.0% user, 0.7% kernel, 0.0% iowait, 0.0% swap
    Kernel: 320 ctxsw, 1062 intr, 56 syscall
    Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

    PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    1204 root 3 59 0 59M 36M sleep 2:58 0.04% Xorg
    1657 gdm 1 59 0 87M 25M sleep 0:19 0.00% gdm-simple-gree
    1726 root 4 59 0 3068K 1948K sleep 0:00 0.00% automountd
    1964 root 1 59 0 3840K 2476K cpu/3 0:00 0.00% top
    372 root 4 59 0 6912K 5548K sleep 0:00 0.00% hald
    401 root 1 59 0 3496K 2092K sleep 0:00 0.00% hald-addon-acpi
    1642 gdm 1 59 0 82M 18M sleep 0:00 0.00% gnome-power-man
    1713 root 21 59 0 7396K 5080K sleep 0:00 0.00% nscd

after 44 hours it has crept up as follows:

  1. date
    September 1, 2011 11:33:56 AM PDT
##############################################################################
  1. top -s 10
    last pid: 2236; load avg: 0.00, 0.00, 0.00; up 1+19:54:09 11:34:32
    62 processes: 61 sleeping, 1 on cpu
    CPU states: 98.0% idle, 0.1% user, 1.9% kernel, 0.0% iowait, 0.0% swap
    Kernel: 386 ctxsw, 2 trap, 1073 intr, 62 syscall, 2 flt
    Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

    PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND
    1204 root 3 59 0 59M 36M sleep 8:25 0.04% Xorg
    1657 gdm 1 59 0 87M 25M sleep 0:52 0.00% gdm-simple-gree
    2235 root 1 59 0 3852K 2492K cpu/5 0:00 0.00% top
    1733 admin 1 59 0 7736K 5168K sleep 0:00 0.00% sshd
    372 root 4 59 0 6912K 5548K sleep 0:01 0.00% hald
    401 root 1 59 0 3496K 2092K sleep 0:00 0.00% hald-addon-acpi
    1642 gdm 1 59 0 82M 18M sleep 0:01 0.00% gnome-power-man
    1713 root 21 59 0 8428K 5264K sleep 0:01 0.00% nscd
    264 root 6 59 0 4356K 3156K sleep 0:00 0.00% devfsadm
    10 root 16 59 0 16M 13M sleep 0:02 0.00% svc.startd

I will continue adding comments and samples ( as per the attached file) until the system becomes unresponsive.

#3

Updated by Miha Krizaj over 7 years ago

In my case the kernel usage after 7 days uptime has reached almost 20% while idle, so I expect it to die any time now.

This problem has to be noticed somewhere else too, I think this is a chipset/cpu incompatibility in OI, as identical system works great on windows server and vmware ESXI.

last pid: 2827; load avg: 0.09, 0.07, 0.07; up 7+01:42:27 10:25:30
53 processes: 52 sleeping, 1 on cpu
CPU states: 80.3% idle, 0.1% user, 19.6% kernel, 0.0% iowait, 0.0% swap
Kernel: 6850 ctxsw, 5 trap, 3638 intr, 424 syscall, 5 flt
Memory: 16G phys mem, 2580M free mem, 8187M total swap, 8187M free swap

PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
2826 root 1 52 0 3840K 1888K cpu/0 0:00 0.01% top
794 daemon 18 60 -20 3032K 2008K sleep 182:01 0.00% nfsd
349 root 43 59 0 15M 7340K sleep 0:09 0.00% nscd
580 root 21 59 0 22M 14M sleep 1:08 0.00% smbd
605 root 1 59 0 12M 6384K sleep 0:23 0.00% intrd
11 root 20 59 0 14M 13M sleep 0:20 0.00% svc.configd
567 daemon 9 59 0 15M 4664K sleep 0:12 0.00% idmapd
774 daemon 4 59 0 10M 2144K sleep 0:06 0.00% nfsmapid
1374 root 1 59 0 13M 4400K sleep 0:05 0.00% sshd
572 root 26 59 0 25M 13M sleep 0:04 0.00% fmd
454 root 5 59 0 8148K 6732K sleep 0:04 0.00% hald
590 root 1 59 0 6080K 2200K sleep 0:04 0.00% sendmail
519 root 4 59 0 8396K 1932K sleep 0:04 0.00% automountd
417 root 6 59 0 9856K 3448K sleep 0:03 0.00% devfsadm
459 root 1 59 0 3552K 2132K sleep 0:03 0.00% hald-addon-acpi
9 root 13 59 0 19M 12M sleep 0:02 0.00% svc.startd
507 root 4 59 0 10M 3828K sleep 0:02 0.00% inetd
553 root 11 59 0 9540K 2264K sleep 0:01 0.00% syslogd
777 daemon 2 60 -20 2960K 1924K sleep 0:00 0.00% lockd
247 root 5 60 -20 2608K 1536K sleep 0:00 0.00% zonestatd
1375 admin 1 59 0 13M 4996K sleep 0:00 0.00% sshd
147 daemon 3 59 0 13M 4928K sleep 0:00 0.00% kcfd
246 root 17 59 0 11M 3372K sleep 0:00 0.00% syseventd
332 root 4 59 0 9648K 3300K sleep 0:00 0.00% picld
49 netadm 3 59 0 3980K 2780K sleep 0:00 0.00% ipmgmtd
792 root 22 59 0 10M 2740K sleep 0:00 0.00% mountd
1387 root 1 59 0 9100K 2596K sleep 0:00 0.00% bash
1378 admin 1 59 0 9104K 2584K sleep 0:00 0.00% bash
458 root 1 59 0 3956K 2352K sleep 0:00 0.00% hald-addon-cpuf
456 root 1 59 0 3948K 2344K sleep 0:00 0.00% hald-addon-netw
#4

Updated by Jon Strabala over 7 years ago

Continuing testing, and as Miha indicated it is always in the 'kernel usage' the problem never shows up in 'top' in normal user space processes via 'top' however there is some VERY interesting differences in the 'powertop' output as the kernel CPU rises.

The CPU utilization on this box at idle was 99.3% at initial boot time, after about 44 hours it is only 98% idle. ( with "set cpupm_tw_gov_interval=800000" in /etc/system ).

At time of boot ... (CPU usage via top WAS at 0.7%)
date
August 31, 2011 07:07:16 AM PDT

after 44 hours it has crept up as follows (CPU usage via top WAS at 2%):
date
September 1, 2011 11:33:56 AM PDT

after 159 hours it has crept up as follows (CPU usage via top NOW at at 6.5% - but WHY is the question):
date
September 6, 2011 07:21:51 AM PDT

##############################################################################
# top -s 10
last pid:  3376;  load avg:  0.01,  0.01,  0.00;  up 6+15:48:02                                                                                 07:28:25
62 processes: 61 sleeping, 1 on cpu
CPU states: 93.4% idle,  0.1% user,  6.5% kernel,  0.0% iowait,  0.0% swap
Kernel: 319 ctxsw, 3 trap, 963 intr, 61 syscall, 2 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

   PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
  1204 root        3  58    0   59M   36M sleep   31:49  0.04% Xorg
  1657 gdm         1  59    0   87M   25M sleep    3:12  0.00% gdm-simple-gree
  3375 root        1  59    0 3852K 2492K cpu/3    0:00  0.00% top
  1713 root       21  59    0 8428K 5332K sleep    0:04  0.00% nscd
  1733 admin       1  59    0 7776K 5188K sleep    0:01  0.00% sshd
  1642 gdm         1  59    0   82M   18M sleep    0:04  0.00% gnome-power-man
   264 root        6  59    0 4356K 3156K sleep    0:02  0.00% devfsadm
  1057 root        6  59    0   26M   10M sleep    0:02  0.00% fbserver
   761 root       26  59    0   18M   12M sleep    0:02  0.00% fmd
    10 root       16  59    0   16M   13M sleep    0:02  0.00% svc.startd
    12 root       22  59    0   16M   14M sleep    0:07  0.00% svc.configd

Another log file with various output is attached showing some of the 'deltas' called 'power_cpu_issue_PLUS_159ahours.txt' compare to the older sample set called 'power_cpu_issue_PLUS_44hours.txt'

I can not really tell if there is a truy difference in 'mpstat'

root@lab10:/tmp# sdiff -w 155 aft_44_hours.txt aft_159_hours.txt
root@lab10:~# mpstat 5 2                                                        root@lab10:~# mpstat 5 2
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt id     CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt id
  0    0   0    0   510  101   74    0    1    0    0    11    0   4   0  9  |    0    0   0    1   514  101   90    0    1    0    0     9    0  13   0  8
  1    0   0    0    78   10   37    0    1    0    0    11    0   1   0  9  |    1    0   0    0    81    9   38    0    1    0    0     7    0   2   0  9
  2    1   0    0    74    9   37    0    1    0    0     9    0   1   0  9  |    2    0   0    0    73    8   35    0    1    0    0     6    0   2   0  9
  3    0   0    0    71    8   35    0    1    0    0     7    0   1   0  9  |    3    0   0    0    67    7   32    0    1    0    0     5    0   2   0  9
  4    1   0    0    76    9   39    0    1    0    0    10    0   1   0  9  |    4    0   0    0    75    8   37    0    1    0    0     6    0   2   0  9
  5    1   0    1    70    7   35    0    1    0    0     8    0   1   0  9  |    5    0   0    0    64    6   31    0    1    0    0     6    0   2   0  9
  6    1   0    0    71    7   35    0    1    0    0     8    0   1   0  9  |    6    0   0    0    65    6   31    0    1    0    0     6    0   2   0  9
  7    0   0    0    73    7   36    0    1    0    0     7    0   1   0  9  |    7    0   0    0    65    6   31    0    1    0    0     5    0   2   0  9
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt id     CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt id
  0    4   0    0   519  101   86    0    2    0    0    61    0   7   0  9  |    0    0   0    0   513  101  128    0    1    1    0    36    0  26   0  7
  1    0   0    0    78    9   34    0    2    0    0     7    0   1   0  9  |    1    0   0    0    56    2   29    0    0    0    0     6    0   3   0  9
  2    0   0    0    80    9   34    0    1    0    0    10    0   1   0  9  |    2    0   0    0    34    3   15    0    0    0    0     6    0   2   0  9
  3    0   0    0    51    5   25    0    1    0    0     0    0   1   0  9  |    3    0   0    0    80    6   35    0    0    1    0     0    0   4   0  9
  4    0   0    0    86   12   43    0    1    0    0     0    0   1   0  9  |    4    5   0    0    59    1   30    0    1    1    0    61    0   3   0  9
  5    0   0    0    91    4   43    0    1    0    0     0    0   1   0  9  |    5    0   0    0    72    4   33    0    1    0    0     0    0   4   0  9
  6    0   0    0    93   13   47    0    1    0    0    13    0   1   0  9  |    6    0   0    0    82   14   40    0    0    0    0     0    0   5   0  9
  7    0   0    0    23    0    9    0    1    0    0     2    0   0   0 10  |    7    0   0    0    32    3   16    0    0    0    0    12    0   2   0  9
root@lab10:~#                                                                   root@lab10:~#
root@lab10:/tmp#


Obviously the CPU - CO has gone up from 3.1% to 10.4% as shown by the diff below, but I don't know why:

root@lab10:/tmp# diff powertop.5.samp1.44hours.txt powertop.5.samp2.159hours.txt
1,2c1
< root@lab10:~# powertop
< OpenIndiana PowerTOP version 1.2   (C) 2009 Intel Corporation
---
>                                                             OpenIndiana PowerTOP version 1.2
4c3,14
< Collecting data for 5.00 second(s)
---
> C-states (idle power)   Avg     Residency                                           P-states (frequencies)
> C0 (cpu running)                (9.9%)                                              1600 Mhz        0.0%
> C1                      0.0ms   (0.0%)                                              1800 Mhz        0.0%
> C2                      0.0ms   (0.0%)                                              2000 Mhz        0.0%
> C3                      2.4ms   (90.1%)                                             2200 Mhz        0.0%
>                                                                                     2400 Mhz        0.0%
>                                                                                     2600 Mhz        0.0%
>                                                                                     2800 Mhz        0.0%
>                                                                                     3000 Mhz        0.0%
>                                                                                     3200 Mhz        0.0%
>                                                                                     3400 Mhz        0.0%
>                                                                                     3800 Mhz(turbo) 100.0%
6,23c16
< root@lab10:~# vi /tmp/power.top.out
< root@lab10:~# cat  /tmp/power.top.out
<                                                     OpenIndiana PowerTOP version 1.2
<
< C-states (idle power)   Avg     Residency                                   P-states (frequencies)
< C0 (cpu running)                (3.4%)                                      1600 Mhz        0.0%
< C1                      0.0ms   (0.0%)                                      1800 Mhz        0.0%
< C2                      0.0ms   (0.0%)                                      2000 Mhz        0.0%
< C3                      2.4ms   (96.6%)                                     2200 Mhz        0.0%
<                                                                             2400 Mhz        0.0%
<                                                                             2600 Mhz        0.0%
<                                                                             2800 Mhz        0.0%
<                                                                             3000 Mhz        0.0%
<                                                                             3200 Mhz        0.0%
<                                                                             3400 Mhz        0.0%
<                                                                             3604 Mhz(turbo) 100.0%
<
< Wakeups-from-idle per second: 3201.6    interval: 5.0s
---
> Wakeups-from-idle per second: 3103.9    interval: 5.0s
27,29c20,22
< 89.0% (2849.0)                 sched :  <xcalls> unix`dtrace_xcall_func
<  3.1% (100.0)               <kernel> :  genunix`clock
<  2.4% ( 76.0)               <kernel> :  genunix`cv_wakeup
---
> 91.8% (2848.9)                 sched :  <xcalls> unix`dtrace_xcall_func
>  3.2% (100.2)               <kernel> :  genunix`clock
>  2.5% ( 76.2)               <kernel> :  genunix`cv_wakeup
31,32c24,25
<  0.2% (  8.0)               <kernel> :  cpudrv`cpudrv_monitor_disp
<  0.2% (  7.6)               <kernel> :  ehci`ehci_handle_root_hub_status_change
---
>  0.3% (  8.0)               <kernel> :  ehci`ehci_handle_root_hub_status_change
>  0.3% (  8.0)               <kernel> :  cpudrv`cpudrv_monitor_disp
34c27
<  0.1% (  2.0)            <interrupt> :  e1000g#0
---
>  0.1% (  2.2)            <interrupt> :  e1000g#0
36a30,31
>  0.0% (  1.0)               <kernel> :  genunix`sigalarm2proc
>  0.0% (  1.0)            <interrupt> :  ehci#0
38,40d32
<  0.0% (  0.8)            <interrupt> :  ehci#0
<  0.0% (  0.2)               <kernel> :  genunix`kmem_update
<  0.0% (  0.2)               <kernel> :  ahci`ahci_watchdog_handler
42c34
<  0.0% (  0.2)               <kernel> :  kcf`rnd_handler
---
>  0.0% (  0.2)               <kernel> :  ip`tcp_timer_callback
44c36,39
<  0.0% (  0.2)               <kernel> :  ibcm`ibcm_path_cache_timeout_cb
---
>  0.0% (  0.2)                  sched :  <xcalls> unix`hati_demap_func
>  0.0% (  0.2)               <kernel> :  genunix`kmem_update
>  0.0% (  0.2)               <kernel> :  kcf`rnd_handler
>  0.0% (  0.2)               <kernel> :  ahci`ahci_watchdog_handler
root@lab10:/tmp#

In addition I can not really tell id dtrace is showing anything
(or if I am using dtrace correctly and I just time it in my head and hit ctrl-C to stop it)

root@lab10:~# sdiff /tmp/dtrace_44_hours.txt /tmp/dtrace_159_hours.txt
root@lab10:~# dtrace -n 'profile-10ms{@[stack()] = count()}'    root@lab10:~# dtrace -n 'profile-10ms{@[stack()] = count()}'
dtrace: description 'profile-10ms' matched 1 probe              dtrace: description 'profile-10ms' matched 1 probe
^C                                                              ^C

              unix`av_check_softint_pending+0x50              |               unix`mutex_enter+0x10
              unix`av_dispatch_softvect+0x48                  |               unix`hment_mapcnt+0x1d
              apix`apix_dispatch_softint+0x34                 |               unix`hat_page_getshare+0x16
              unix`switch_sp_and_call+0x13                    |               genunix`fsflush_do_pages+0x236
                1                                             |               genunix`fsflush+0x39a
                                                              |               unix`thread_start+0x8
              unix`av_dispatch_softvect                       <
              unix`switch_sp_and_call+0x13                    <
                1                                             <
                                                              <
              unix`av_dispatch_softvect+0x1                   <
              unix`switch_sp_and_call+0x13                    <
                1                                                               1

              genunix`lbolt_event_driven+0x18                 |               unix`x86pte_mapin+0x47
              genunix`ddi_get_lbolt+0xd                       |               unix`x86pte_access_pagetable+0x3d
              SDC`sysdc_update_times+0x50                     |               unix`x86pte_set+0x8d
              SDC`sysdc_compute_pri+0x4b                      |               unix`hati_pte_map+0x187
              SDC`sysdc_setrun+0x35                           |               unix`hati_load_common+0x15d
              SDC`sysdc_wakeup+0x15                           |               unix`hat_memload+0x81
              genunix`sleepq_wakeone_chan+0x89                |               unix`segkmem_xalloc+0x13b
              genunix`cv_signal+0x8e                          |               unix`segkmem_alloc_vn+0xcd
              genunix`taskq_dispatch+0x374                    |               unix`segkmem_alloc+0x24
              zfs`zio_taskq_dispatch+0xab                     |               genunix`vmem_xalloc+0x546
              zfs`zio_interrupt+0x1c                          |               genunix`vmem_alloc+0x161
              zfs`vdev_disk_io_intr+0x6b                      |               genunix`kmem_alloc+0x64
              genunix`biodone+0x84                            |               genunix`kmem_zalloc+0x3b
              sd`sd_buf_iodone+0x67                           |               autofs`auto_calldaemon+0x22d
              sd`sd_mapblockaddr_iodone+0x4f                  |               autofs`auto_send_unmount_request+0x6b
              sd`sd_return_command+0x1d9                      |               autofs`unmount_node+0x181
              sd`sdintr+0x58d                                 |               autofs`try_unmount_node+0x13b
              scsi`scsi_hba_pkt_comp+0x15c                    |               autofs`unmount_subtree+0x79
              sata`sata_txlt_rw_completion+0x1d3              |               autofs`unmount_tree+0xcf
              ahci`ahci_flush_doneq+0x6a                      |               autofs`unmount_zone_tree+0x1e
                1                                                               1

              unix`0xfffffffffb85a6ca                         |               genunix`fsflush_do_pages+0x1da
              kstat`read_kstat_data+0x48d                     |               genunix`fsflush+0x39a
              kstat`kstat_ioctl+0x4a                          |               unix`thread_start+0x8
              genunix`cdev_ioctl+0x45                         <
              specfs`spec_ioctl+0x5a                          <
              genunix`fop_ioctl+0x7b                          <
              genunix`ioctl+0x18e                             <
              unix`_sys_sysenter_post_swapgs+0x149            <
                1                                                               1

              genunix`cpu_update_pct+0x3e                     |               unix`page_pptonum+0xc
              genunix`restore_mstate+0x40                     |               genunix`fsflush_do_pages+0x35d
              unix`swtch+0xfb                                 |               genunix`fsflush+0x39a
              unix`idle+0xc4                                  <
              unix`thread_start+0x8                                           unix`thread_start+0x8
                1                                                               1

              unix`av_dispatch_softvect+0x2c                  |                 7
              apix`apix_dispatch_softint+0x34                 <
              unix`switch_sp_and_call+0x13                    <
                2                                             <
                                                              <
              unix`av_dispatch_softvect+0x41                  <
              apix`apix_dispatch_softint+0x34                 <
              unix`switch_sp_and_call+0x13                    <
                2                                             <
                                                              <
                9                                             <
                                                              <
              apix`apix_dispatch_softint+0x27                 <
              unix`switch_sp_and_call+0x13                    <
               12                                             <

              unix`i86_mwait+0xd                                              unix`i86_mwait+0xd
              unix`cpu_idle_mwait+0xf1                                        unix`cpu_idle_mwait+0xf1
              unix`cpu_acpi_idle+0x8d                                         unix`cpu_acpi_idle+0x8d
              unix`cpu_idle_adaptive+0x19                                     unix`cpu_idle_adaptive+0x19
              unix`idle+0x114                                                 unix`idle+0x114
              unix`thread_start+0x8                                           unix`thread_start+0x8
               53                                             |                10

              unix`acpi_cpu_cstate+0x2ae                                      unix`acpi_cpu_cstate+0x2ae
              unix`cpu_acpi_idle+0x82                                         unix`cpu_acpi_idle+0x82
              unix`cpu_idle_adaptive+0x19                                     unix`cpu_idle_adaptive+0x19
              unix`idle+0x114                                                 unix`idle+0x114
              unix`thread_start+0x8                                           unix`thread_start+0x8
            15340                                             |             10027

Once again, I will continue adding comments and log file samples ( as per the attached file) until the system becomes unresponsive.

#5

Updated by Jon Strabala over 7 years ago

Note the attached file "power_cpu_issue_PLUS_0hours.txt" is from an initial reboot, it can be compared with the prior two files "power_cpu_issue_PLUS_44hours.txt" and "power_cpu_issue_PLUS_159hours.txt" - I will no longer capture these files. Instead I will be capturing nine (9) dtrace outputs until the system hangs on the hour.

I tried to run Brendan Gregg's 'hotkernel' fromt eh DTT package - at 'rup' of 160 hours (CPU usage via top NOW at at 6.5%)

root@lab10:/opt/DTT# ./hotkernel
Sampling... Hit Ctrl-C to end.
^C

waited about three minutes - it just doesn't print anything or come back. Not only that I can not even log into another shell via ssh.

However I was able to CTRL-Z it and then kill -9 the dtrace script.

^Z
[1]+  Stopped                 ./hotkernel
root@lab10:/opt/DTT#
root@lab10:/opt/DTT# kill %1
root@lab10:/opt/DTT#
root@lab10:/opt/DTT# jobs
[1]+  Running                 ./hotkernel &
root@lab10:/opt/DTT#
root@lab10:/opt/DTT# kill -TERM %1
root@lab10:/opt/DTT#
root@lab10:/opt/DTT# jobs
[1]+  Running                 ./hotkernel &
root@lab10:/opt/DTT# kill -9 %1
[1]+  Killed                  ./hotkernel
root@lab10:/opt/DTT#
root@lab10:/opt/DTT# jobs
root@lab10:/opt/DTT#

Okay I got the system back put playing around with dtrace form the DTT package eventually locked up my system so this is the end of my log file series. Nothing I could do ^C ^D or ^Z could get the 'dtrace' script to die, I couldn't open a new 'ssh' session so I had to give up with the log file captures that I had been doing (not that they seemed to tell me anything new).

So once again I did an IPMI for a hard power off and reboot

Power Off Server - Orderly Shutdown ( no effect ... )
Power Off Server - Immediate (worked but rather sever .... )
Power On Server ( to bring it back up online )

I captured the same data BUT this time at 0 hours about 2-5 minutes of time after the reboot:

See: power_cpu_issue_PLUS_0hours.txt

Right now at about 2 minutes (CPU usage via top WAS at 0.1% - about 2 minutes after reboot) in the bottom of the file power_cpu_issue_PLUS_0hours.txt

root@lab10:~# top -s 10
last pid:  1921;  load avg:  0.00,  0.02,  0.01;  up 0+00:10:18                                                                                  08:55:46
62 processes: 61 sleeping, 1 on cpu
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
Kernel: 317 ctxsw, 698 intr, 58 syscall
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

   PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
  1282 root        3  59    0   60M   38M sleep    0:04  0.05% Xorg
  1595 gdm         1  59    0   87M   25M sleep    0:00  0.01% gdm-simple-gree
  1920 root        1  59    0 3852K 2520K cpu/1    0:00  0.00% top
  1594 gdm         1  59    0   82M   18M sleep    0:00  0.00% gnome-power-man
  1780 root       22  59    0 5728K 3668K sleep    0:00  0.00% nscd
  1585 gdm         1  59    0 7152K 5944K sleep    0:00  0.00% gconfd-2
  1821 admin       1  59    0 7736K 5180K sleep    0:00  0.00% sshd
  1122 root        6  59    0   26M   10M sleep    0:00  0.00% fbserver
   208 root        6  59    0 4348K 3152K sleep    0:00  0.00% devfsadm
    10 root       14  59    0   15M   13M sleep    0:01  0.00% svc.startd
   816 root        4  59    0 6728K 3872K sleep    0:00  0.00% inetd
    12 root       21  59    0   16M   14M sleep    0:04  0.00% svc.configd
   837 root       25  59    0   18M   11M sleep    0:00  0.00% fmd
   270 root        5  60  -20 2548K 1512K sleep    0:00  0.00% zonestatd
root@lab10:~#

Now at 17 minutes after reboot, I can easily run Brendan Gregg's 'hotkernel' fromt eh DTT package - at 'rup' of 17 minutes (where it hung before at a 'rup' of 150 hours) contrast this with what happened above.

root@lab10:/opt/DTT# rup lab10
          lab10    up          17 mins,  load average: 0.00, 0.00, 0.01
root@lab10:/opt/DTT# ./hotkernel
Sampling... Hit Ctrl-C to end.
^C
FUNCTION                                                COUNT   PCNT
genunix`kmem_cache_alloc                                    1   0.0%
unix`av_clear_softint_pending                               1   0.0%
unix`cpu_idle_mwait                                         1   0.0%
zfs`metaslab_segsize_compare                                1   0.0%
unix`disp                                                   1   0.0%
unix`atomic_or_64                                           1   0.0%
unix`0xfffffffffb85                                         1   0.0%
unix`splr                                                   1   0.0%
unix`default_kstat_snapshot                                 1   0.0%
genunix`syscall_entry                                       1   0.0%
unix`tsc_gethrtimeunscaled_delta                            1   0.0%
e1000g`e1000g_m_stat                                        1   0.0%
apix`apix_dispatch_pending_autovect                         1   0.0%
genunix`callout_heap_delete                                 1   0.0%
unix`mutex_adaptive_tryenter                                1   0.0%
zfs`zio_vdev_io_start                                       1   0.0%
unix`gdt_update_usegd                                       1   0.0%
unix`do_splx                                                1   0.0%
unix`cpu_acpi_idle                                          1   0.0%
unix`unset_idle_cpu                                         1   0.0%
unix`mutex_enter                                            2   0.0%
unix`tsc_gethrtime_delta                                    2   0.0%
unix`i86_monitor                                            2   0.0%
zfs`lzjb_compress                                           2   0.0%
unix`disp_anywork                                          30   0.1%
unix`acpi_cpu_cstate                                      704   1.4%
unix`i86_mwait                                          50403  98.5%
root@lab10:/opt/DTT#

Cool I can run 'hotkernel' now, but why after a few days does it become very non-responsive ?

I made a modified version of the 'hotkernel' script to temrinate after 30 seconds - and will run it every hour from cron to try to find out what changes as the system stays up a long time ....

root@lab10:/opt/DTT# diff hotkernel hotkernel_30sec
83a84,88
>       tick-1s
>       / i++ >= 30 /
>       {
>           exit(0);
>       }
93c98
< print "Sampling... Hit Ctrl-C to end.\n";
---
> print "Sampling... Hit Ctrl-C to end. -OR- wait 30 seconds for autoquit\n";
root@lab10:/opt/DTT#

In addition I modified several other DTT scripts to out exit in 10sec or 30sec (as above) and made an hourly cron file to capture a lot of DTRACE outputs e.g.

#!/bin/csh -f
cd /opt/DTT
set yyyymmddhh=`date +"%Y-%m-%d_%H"`

echo "=========================================================================" 
echo "" 
echo "date" 
date

echo "" 
echo "rup lab10" 
rup lab10

echo "" 
echo "top -s 10 -n 15 -d 2 -b" >& /root/hotkernel_30/data/intoncpu_10sec.$yyyymmddhh.out
top -s 10 -n 15 -d 2 -b

echo "" 
echo "./hotkernel_30sec >& /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out" 
./hotkernel_30sec >& /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out

echo "" 
echo "wc -l /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out" 
wc -l /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out

echo "" 
echo "cat /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out" 
cat /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out

echo "" 
echo "take four dtrace CPU measures: cpuwalk.d 10, intoncpu_10sec.d, inttimes_10sec.d, xcallsbypid_10sec.d" 
cd /opt/DTT/Cpu
./cpuwalk.d 10 >& /root/hotkernel_30/data/cpuwalk.d_10.$yyyymmddhh.out
./intoncpu_10sec.d >& /root/hotkernel_30/data/intoncpu_10sec.$yyyymmddhh.out
./inttimes_10sec.d >& /root/hotkernel_30/data/inttimes_10sec.$yyyymmddhh.out
./xcallsbypid_10sec.d >& /root/hotkernel_30/data/xcallsbypid_10sec.$yyyymmddhh.out

echo "" 
echo "take a few Bin measures: sysbypid_10sec.d, topsyscall -C -s 5 2, iotop -C -t 10 5 4, cswstat_10sec.d" 
cd /opt/DTT/Bin
./sysbypid_10sec.d  >& /root/hotkernel_30/data/sysbypid_10sec.$yyyymmddhh.out
./topsyscall -C -s 5 2  >& /root/hotkernel_30/data/topsyscall-C-s_5_2.$yyyymmddhh.out
./iotop -C -t 10 5 4 >& /root/hotkernel_30/data/iotop-C-t_10_5_4.$yyyymmddhh.out
./cswstat_10sec.d >& /root/hotkernel_30/data/cswstat_10sec.$yyyymmddhh.out

The cron entry is as follows (every hour at 11 minutes past the hour), with the first set of data taken about 11 minutes after the latest reboot:

11 * * * * /root/hotkernel_30/runh30.csh >> /root/hotkernel_30/logs/runh30.log 2>&1

Hopefully it will lock up in about a 3-4 days and then I can look at the various 'dtrace' files after yet another hard reboot and actually isolate something specific as to what is causing this system to hard hang.

#6

Updated by Jon Strabala over 7 years ago

WRT Sandy Bridge issues WRT power.conf and/or deep-cpu-idle issues for chips/chipsets like E3-1240 Xeon's this issue #1333 and also the older issue #423 exist, in the illumos IRC chat we also have a few also comments by "DontKnwMuch" and "lblume" ... I am now leaning towards a lack of integration of bug #6911650.

Despite the fact I have started a set of hourly utility snapshots and 'dtrace' runs to try to isolate, just like DontKnwMuch, Miha Krizaj, after a while kernel CPU goes up (slow creep) until the point where scripts like 'hotkernel' will hang the system. Currently every hour I capture the following http://www.pastie.org/2493215 (9 normal commands and 10 dtrace scripts) I would love to add more checks, the system will crash in about 2-3 days.

Note if anyone can provide a 'dtrace' script or a link to such I would be happy to add it to my test bed to try to figure this issue out - I would love to get it fixed prior to a release of oi_151a

As I said I think this might be SUN-BUG #6911650 (part of Sun Patch-id 142910-04),

Synopsis: Deeper C-state entry regressed

refer to http://wesunsolve.net/bugid.php/id/6911650 , I wonder if this update has been ported into the illumos fork/src tree.

SUN-BUG #6911650 says specifically

Dtrace shows acpi_cpu_cstate is never calling cpu_acpi_read_port()
to put CPUs into deeper c-states. Ah, the problem is a typo in the
*mcpu_mwait check with ACPI_ADR_SPACE_SYSTEM_IO

Perhaps illumos need an update to the file src/uts/i86pc/os/cpupm/cpu_idle.c indeed I can find a 'back-patch' of 6911650 when looking at the history of this file.

If I was actually building oi_151 from source (but I don't do that yet) I would test this out myself e.g trying to fix "cpu_idle.c" as per the URL above.

#7

Updated by Jon Strabala over 7 years ago

The system became unstable after 14 hours, and I lost 'ssh' connectivity and then even a iLOM GUI session was unusable blinking on and off at about 15x per second.

In fact all 'pings' to the hosts IP address time out (but succed to the iLOM IP address).

Currently during this "network failure" of the host (named lab10) the 'hotkenel' script is running and NOT completing it should have timed out and exited in 30 seconds due to a custom modification in the script BUT it has been running for 45 minutes (90X too long) prior to a needed reboot.

It is also possible that there is an issue with 'ahci0' via 'inttimes' analysis.

Fortunately every measure has been captured in a log file (on each hour), such that when I reboot via the required hard power cycle no logs were lost.

ALL log files and data files are in : crash1.2011-09-06.tar.gz

There are lots of hourly sample files in the sud directories data.crash1/ logs/crash1/, note I added more scripts to capture more things during the run, thus there are more output files towards the end. If anyone has any ideas of more items 'std. scripts' 'utilities' or 'dtrace scripts' to capture additional into let me know and I will run the analysis again.

So once again I did an IPMI for a hard power off and reboot

Power Off Server - Orderly Shutdown ( no effect ... )
Power Off Server - Immediate (worked but rather severe .... )
Power On Server ( to bring it back up online )

Analysis after reboot

Basic system info

lab10# uname -a
SunOS lab10 5.11 oi_148b i86pc i386 i86pc Solaris

Basically with a lot of dtrace tracing what we have is that on an idle system the CPU (in the kernel) crept from a measly 0.5% to just 3.0% as per measurements from the 'top' command.

It is also interesting that 'hotkernel' shows the following for acpi_cpu_cstate - the only metric that grows (besides kernel CPU), unix`acpi_cpu_cstate grows from about 700 calls in a 30 second sample to 83,235 in the last 30 second sample.

Below is an start/stop'rup' and an hourly 'hotkernel_30sec | grep acpi_cpu_cstate'

runh30.2011-09-06_10.log:          lab10    up           9 mins,  load average: 0.00, 0.03, 0.02
runh30.2011-09-06_10.log:unix`acpi_cpu_cstate                                      564   0.2%
runh30.2011-09-06_11.log:unix`acpi_cpu_cstate                                      682   0.3%
runh30.2011-09-06_12.log:unix`acpi_cpu_cstate                                      710   0.3%
runh30.2011-09-06_13.log:unix`acpi_cpu_cstate                                      743   0.3%
runh30.2011-09-06_14.log:unix`acpi_cpu_cstate                                      725   0.3%
runh30.2011-09-06_15.log:unix`acpi_cpu_cstate                                      820   0.3%
runh30.2011-09-06_16.log:unix`acpi_cpu_cstate                                      649   0.3%
runh30.2011-09-06_17.log:unix`acpi_cpu_cstate                                     1050   0.4%
runh30.2011-09-06_18.log:unix`acpi_cpu_cstate                                     7536   3.0%
runh30.2011-09-06_19.log:unix`acpi_cpu_cstate                                    18721   7.6%
runh30.2011-09-06_20.log:unix`acpi_cpu_cstate                                    41254  16.7%
runh30.2011-09-06_21.log:unix`acpi_cpu_cstate                                    55342  22.4%
runh30.2011-09-06_22.log:unix`acpi_cpu_cstate                                    60255  24.5%
runh30.2011-09-06_23.log:unix`acpi_cpu_cstate                                    83235  34.6%
runh30.2011-09-06_23.log:          lab10    up          12:57,    load average: 0.00, 0.00, 0.00

The full 'hotkernel_30sec' sample from 2011-09-06_10 (uptime 0 hour 6 minutes )

http://www.pastie.org/2497598

and tail -10 of the file

unix`mutex_exit                                             4   0.0%
apix`apix_dispatch_softint                                  5   0.0%
unix`tsc_read                                               5   0.0%
unix`do_splx                                                6   0.0%
unix`ddi_get32                                              8   0.0%
unix`vsnprintf                                              8   0.0%
unix`mutex_delay_default                                   11   0.0%
unix`mutex_enter                                           17   0.0%
unix`acpi_cpu_cstate                                      648   0.3%
unix`i86_mwait                                         247108  99.7%

The full 'hotkernel_30sec' sample from 2011-09-06_16 (uptime 6 hour 49 minutes )

http://www.pastie.org/2497600

unix`cpu_acpi_idle                                          2   0.0%
dtrace`dtrace_ioctl                                         2   0.0%
unix`mutex_delay_default                                    3   0.0%
unix`tsc_gethrtimeunscaled_delta                            3   0.0%
unix`mutex_enter                                            3   0.0%
unix`page_nextn                                             5   0.0%
genunix`fsflush_do_pages                                    5   0.0%
unix`do_splx                                               21   0.0%
unix`acpi_cpu_cstate                                      649   0.3%
unix`i86_mwait                                         247021  99.7%

The full 'hotkernel_30sec' sample from 2011-09-06_23 (uptime 12 hour 57 minutes )

http://www.pastie.org/2497629

zfs`lzjb_compress                                          24   0.0%
unix`tsc_gethrtimeunscaled_delta                           25   0.0%
unix`todpc_rtcget                                          30   0.0%
unix`ddi_get32                                             44   0.0%
unix`do_splx                                               48   0.0%
genunix`fsflush                                            59   0.0%
unix`page_nextn                                           118   0.0%
genunix`fsflush_do_pages                                  118   0.0%
unix`acpi_cpu_cstate                                    83235  34.6%
unix`i86_mwait                                         156461  65.0%

Looking at the change in 10 second samples of the 'dtrace' script' inttimes'

Perhaps ahci0 which has grown from 875349 to 3110530 is causing an issue ? I am not sure but is this a 3.55x increase to from I think 87,534 to 311,053 interrupts/second.

lab10# cat inttimes_10sec.2011-09-06_10.out
Tracing... Hit Ctrl-C to end.
     DEVICE           TIME (ns)
      ehci1              67104
      ehci0              75402
    e1000g0             567550
      ahci0             875349

lab10# cat inttimes_10sec.2011-09-06_12.out
Tracing... Hit Ctrl-C to end.
     DEVICE           TIME (ns)
      ehci1              71231
      ehci0              74393
    e1000g0             102241
      ahci0             838477

lab10# cat inttimes_10sec.2011-09-06_14.out
Tracing... Hit Ctrl-C to end.
     DEVICE           TIME (ns)
      ehci1              72452
      ehci0              79223
    e1000g0             111892
      ahci0            1488218

lab10# cat inttimes_10sec.2011-09-06_16.out
Tracing... Hit Ctrl-C to end.
     DEVICE           TIME (ns)
      ehci1              72066
      ehci0              72927
    e1000g0             126142
      ahci0             814604

lab10# cat inttimes_10sec.2011-09-06_18.out
Tracing... Hit Ctrl-C to end.
dtrace: 313 dynamic variable drops with non-empty dirty list
dtrace: 47 dynamic variable drops with non-empty dirty list
dtrace: 389 dynamic variable drops with non-empty dirty list
dtrace: 159 dynamic variable drops with non-empty dirty list
dtrace: 151 dynamic variable drops with non-empty dirty list
dtrace: 445 dynamic variable drops with non-empty dirty list
     DEVICE           TIME (ns)
      ehci0              74580
      ehci1              77788
    e1000g0             100090
      ahci0            2574193

lab10# cat inttimes_10sec.2011-09-06_20.out
Tracing... Hit Ctrl-C to end.
dtrace: 771 dynamic variable drops with non-empty dirty list
dtrace: 1230 dynamic variable drops with non-empty dirty list
dtrace: 1640 dynamic variable drops with non-empty dirty list
dtrace: 481 dynamic variable drops with non-empty dirty list
dtrace: 55 dynamic variable drops with non-empty dirty list
dtrace: 599 dynamic variable drops with non-empty rinsing list
dtrace: 650 dynamic variable drops with non-empty dirty list
dtrace: 155 dynamic variable drops with non-empty dirty list
dtrace: 33 dynamic variable drops with non-empty dirty list
dtrace: 273 dynamic variable drops with non-empty dirty list
     DEVICE           TIME (ns)
      ehci1              77913
      ehci0              81590
    e1000g0             120083
      ahci0            2716996

lab10# cat inttimes_10sec.2011-09-06_21.out
Tracing... Hit Ctrl-C to end.
dtrace: 2091 dynamic variable drops with non-empty dirty list
dtrace: 1438 dynamic variable drops with non-empty dirty list
dtrace: 863 dynamic variable drops with non-empty dirty list
dtrace: 457 dynamic variable drops with non-empty dirty list
dtrace: 2280 dynamic variable drops with non-empty dirty list
dtrace: 1926 dynamic variable drops with non-empty dirty list
dtrace: 1802 dynamic variable drops with non-empty dirty list
dtrace: 2471 dynamic variable drops with non-empty dirty list
dtrace: 6 dynamic variable drops with non-empty rinsing list
dtrace: 1217 dynamic variable drops with non-empty dirty list
dtrace: 1431 dynamic variable drops with non-empty dirty list
dtrace: 1 dynamic variable drop with non-empty rinsing list
dtrace: 1328 dynamic variable drops with non-empty dirty list
     DEVICE           TIME (ns)
      ehci0              70275
      ehci1              72764
    e1000g0             115825
      ahci0            2150413

lab10# cat inttimes_10sec.2011-09-06_23.out
Tracing... Hit Ctrl-C to end.
dtrace: 10056 dynamic variable drops with non-empty dirty list
dtrace: 4 dynamic variable drops with non-empty rinsing list
dtrace: 12086 dynamic variable drops with non-empty dirty list
dtrace: 9 dynamic variable drops with non-empty rinsing list
dtrace: 6713 dynamic variable drops with non-empty dirty list
dtrace: 4820 dynamic variable drops with non-empty dirty list
dtrace: 4982 dynamic variable drops with non-empty dirty list
dtrace: 11144 dynamic variable drops with non-empty dirty list
dtrace: 7116 dynamic variable drops with non-empty dirty list
dtrace: 5434 dynamic variable drops with non-empty dirty list
dtrace: 7533 dynamic variable drops with non-empty dirty list
dtrace: 1 dynamic variable drop with non-empty rinsing list
dtrace: 3762 dynamic variable drops with non-empty dirty list
dtrace: 6260 dynamic variable drops with non-empty dirty list
     DEVICE           TIME (ns)
    e1000g0              48716
      ehci1              70819
      ehci0              73537
      ahci0            3110530

lab10#

Below is an hourly 'rup', a portion of 'top' and 'hotkernel_30sec | grep acpi_cpu_cstate'

==================================
Tue Sep  6 10:12:00 PDT 2011

          lab10    up           9 mins,  load average: 0.00, 0.03, 0.02

last pid:  2156;  load avg:  0.00,  0.02,  0.02;  up 0+00:09:48        10:12:03
64 processes: 63 sleeping, 1 on cpu
CPU states: 99.5% idle,  0.1% user,  0.4% kernel,  0.0% iowait,  0.0% swap
Kernel: 999 ctxsw, 2 trap, 1231 intr, 573 syscall, 2 flt
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      564   0.2%
==================================
Tue Sep  6 11:00:00 PDT 2011

          lab10    up          57 mins,  load average: 0.00, 0.00, 0.00

last pid:  2526;  load avg:  0.00,  0.00,  0.00;  up 0+00:57:48        11:00:03
65 processes: 64 sleeping, 1 on cpu
CPU states: 99.5% idle,  0.1% user,  0.4% kernel,  0.0% iowait,  0.0% swap
Kernel: 393 ctxsw, 378 trap, 1217 intr, 1204 syscall, 330 flt
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      682   0.3%
==================================
Tue Sep  6 12:00:00 PDT 2011

          lab10    up           1:57,    load average: 0.00, 0.00, 0.00

last pid:  2782;  load avg:  0.00,  0.00,  0.00;  up 0+01:57:48        12:00:03
64 processes: 63 sleeping, 1 on cpu
CPU states: 99.1% idle,  0.1% user,  0.7% kernel,  0.0% iowait,  0.0% swap
Kernel: 658 ctxsw, 2 trap, 1321 intr, 540 syscall, 2 flt
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      710   0.3%
==================================
Tue Sep  6 13:00:00 PDT 2011

          lab10    up           2:57,    load average: 0.00, 0.00, 0.00

last pid:  2879;  load avg:  0.00,  0.00,  0.00;  up 0+02:57:48        13:00:03
64 processes: 63 sleeping, 1 on cpu
CPU states: 99.2% idle,  0.0% user,  0.8% kernel,  0.0% iowait,  0.0% swap
Kernel: 895 ctxsw, 2 trap, 1422 intr, 542 syscall, 2 flt
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      743   0.3%
==================================
Tue Sep  6 14:00:00 PDT 2011

          lab10    up           3:57,    load average: 0.00, 0.00, 0.00

last pid:  2945;  load avg:  0.00,  0.00,  0.00;  up 0+03:57:48        14:00:03
65 processes: 64 sleeping, 1 on cpu
CPU states: 98.9% idle,  0.1% user,  1.0% kernel,  0.0% iowait,  0.0% swap
Kernel: 945 ctxsw, 2 trap, 1418 intr, 561 syscall, 2 flt
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      725   0.3%
==================================
Tue Sep  6 15:00:00 PDT 2011

          lab10    up           4:57,    load average: 0.00, 0.00, 0.00

last pid:  3055;  load avg:  0.00,  0.00,  0.00;  up 0+04:57:49        15:00:04
65 processes: 64 sleeping, 1 on cpu
CPU states: 98.8% idle,  0.1% user,  1.1% kernel,  0.0% iowait,  0.0% swap
Kernel: 917 ctxsw, 2 trap, 1477 intr, 545 syscall, 2 flt
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      820   0.3%
==================================
Tue Sep  6 16:00:00 PDT 2011

          lab10    up           5:57,    load average: 0.00, 0.00, 0.00

last pid:  3248;  load avg:  0.00,  0.00,  0.00;  up 0+05:57:48        16:00:03
65 processes: 64 sleeping, 1 on cpu
CPU states: 98.5% idle,  0.0% user,  1.5% kernel,  0.0% iowait,  0.0% swap
Kernel: 328 ctxsw, 2 trap, 1150 intr, 570 syscall, 2 flt
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      649   0.3%
==================================
Tue Sep  6 17:00:00 PDT 2011

          lab10    up           6:57,    load average: 0.00, 0.00, 0.00

last pid:  3414;  load avg:  0.00,  0.00,  0.00;  up 0+06:57:48        17:00:03
65 processes: 64 sleeping, 1 on cpu
CPU states: 98.6% idle,  0.0% user,  1.4% kernel,  0.0% iowait,  0.0% swap
Kernel: 904 ctxsw, 2 trap, 1444 intr, 547 syscall, 2 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                     1050   0.4%
==================================
Tue Sep  6 18:00:00 PDT 2011

          lab10    up           7:57,    load average: 0.00, 0.00, 0.00

last pid:  3501;  load avg:  0.00,  0.00,  0.00;  up 0+07:57:48        18:00:03
65 processes: 64 sleeping, 1 on cpu
CPU states: 97.9% idle,  0.1% user,  2.0% kernel,  0.0% iowait,  0.0% swap
Kernel: 916 ctxsw, 2 trap, 1430 intr, 565 syscall, 2 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                     7536   3.0%
==================================
Tue Sep  6 19:00:00 PDT 2011

          lab10    up           8:57,    load average: 0.00, 0.00, 0.00

last pid:  3540;  load avg:  0.00,  0.00,  0.00;  up 0+08:57:48        19:00:03
65 processes: 64 sleeping, 1 on cpu
CPU states: 97.9% idle,  0.1% user,  2.0% kernel,  0.0% iowait,  0.0% swap
Kernel: 316 ctxsw, 2 trap, 1032 intr, 558 syscall, 2 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                    18721   7.6%
==================================
Tue Sep  6 20:00:00 PDT 2011

          lab10    up           9:57,    load average: 0.00, 0.00, 0.00

last pid:  3579;  load avg:  0.00,  0.00,  0.00;  up 0+09:57:48        20:00:03
65 processes: 64 sleeping, 1 on cpu
CPU states: 97.7% idle,  0.0% user,  2.3% kernel,  0.0% iowait,  0.0% swap
Kernel: 1169 ctxsw, 2 trap, 1554 intr, 561 syscall, 2 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                    41254  16.7%
==================================
Tue Sep  6 21:00:00 PDT 2011

          lab10    up          10:57,    load average: 0.00, 0.00, 0.00

last pid:  3618;  load avg:  0.00,  0.00,  0.00;  up 0+10:57:48        21:00:03
65 processes: 64 sleeping, 1 on cpu
CPU states: 97.1% idle,  0.1% user,  2.7% kernel,  0.0% iowait,  0.0% swap
Kernel: 703 ctxsw, 3 trap, 1353 intr, 562 syscall, 2 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                    55342  22.4%
==================================
Tue Sep  6 22:00:00 PDT 2011

          lab10    up          11:57,    load average: 0.00, 0.00, 0.00

last pid:  3657;  load avg:  0.00,  0.00,  0.00;  up 0+11:57:48        22:00:03
65 processes: 64 sleeping, 1 on cpu
CPU states: 97.5% idle,  0.0% user,  2.5% kernel,  0.0% iowait,  0.0% swap
Kernel: 986 ctxsw, 3 trap, 1370 intr, 550 syscall, 2 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                    60255  24.5%
==================================
Tue Sep  6 23:00:00 PDT 2011

          lab10    up          12:57,    load average: 0.00, 0.00, 0.00

last pid:  3696;  load avg:  0.00,  0.00,  0.00;  up 0+12:57:48        23:00:03
65 processes: 64 sleeping, 1 on cpu
CPU states: 97.6% idle,  0.1% user,  2.2% kernel,  0.0% iowait,  0.0% swap
Kernel: 324 ctxsw, 2 trap, 1050 intr, 556 syscall, 2 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                    83235  34.6%
==================================
Wed Sep  7 00:00:00 PDT 2011

          lab10    up          13:57,    load average: 0.00, 0.00, 0.00

last pid:  3735;  load avg:  0.00,  0.00,  0.00;  up 0+13:57:48        00:00:03
65 processes: 64 sleeping, 1 on cpu
CPU states: 97.0% idle,  0.0% user,  3.0% kernel,  0.0% iowait,  0.0% swap
Kernel: 324 ctxsw, 2 trap, 1096 intr, 570 syscall, 2 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

lab10# !vi
vi ../scanlog.csh
lab10# !..
../scanlog.csh
==================================
Tue Sep  6 10:12:00 PDT 2011

          lab10    up           9 mins,  load average: 0.00, 0.03, 0.02

        last pid:  2156;  load avg:  0.00,  0.02,  0.02;  up 0+00:09:48        10:12:03
        64 processes: 63 sleeping, 1 on cpu
        CPU states: 99.5% idle,  0.1% user,  0.4% kernel,  0.0% iowait,  0.0% swap
        Kernel: 999 ctxsw, 2 trap, 1231 intr, 573 syscall, 2 flt
        Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      564   0.2%
==================================
Tue Sep  6 11:00:00 PDT 2011

          lab10    up          57 mins,  load average: 0.00, 0.00, 0.00

        last pid:  2526;  load avg:  0.00,  0.00,  0.00;  up 0+00:57:48        11:00:03
        65 processes: 64 sleeping, 1 on cpu
        CPU states: 99.5% idle,  0.1% user,  0.4% kernel,  0.0% iowait,  0.0% swap
        Kernel: 393 ctxsw, 378 trap, 1217 intr, 1204 syscall, 330 flt
        Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      682   0.3%
==================================
Tue Sep  6 12:00:00 PDT 2011

          lab10    up           1:57,    load average: 0.00, 0.00, 0.00

        last pid:  2782;  load avg:  0.00,  0.00,  0.00;  up 0+01:57:48        12:00:03
        64 processes: 63 sleeping, 1 on cpu
        CPU states: 99.1% idle,  0.1% user,  0.7% kernel,  0.0% iowait,  0.0% swap
        Kernel: 658 ctxsw, 2 trap, 1321 intr, 540 syscall, 2 flt
        Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      710   0.3%
==================================
Tue Sep  6 13:00:00 PDT 2011

          lab10    up           2:57,    load average: 0.00, 0.00, 0.00

        last pid:  2879;  load avg:  0.00,  0.00,  0.00;  up 0+02:57:48        13:00:03
        64 processes: 63 sleeping, 1 on cpu
        CPU states: 99.2% idle,  0.0% user,  0.8% kernel,  0.0% iowait,  0.0% swap
        Kernel: 895 ctxsw, 2 trap, 1422 intr, 542 syscall, 2 flt
        Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      743   0.3%
==================================
Tue Sep  6 14:00:00 PDT 2011

          lab10    up           3:57,    load average: 0.00, 0.00, 0.00

        last pid:  2945;  load avg:  0.00,  0.00,  0.00;  up 0+03:57:48        14:00:03
        65 processes: 64 sleeping, 1 on cpu
        CPU states: 98.9% idle,  0.1% user,  1.0% kernel,  0.0% iowait,  0.0% swap
        Kernel: 945 ctxsw, 2 trap, 1418 intr, 561 syscall, 2 flt
        Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      725   0.3%
==================================
Tue Sep  6 15:00:00 PDT 2011

          lab10    up           4:57,    load average: 0.00, 0.00, 0.00

        last pid:  3055;  load avg:  0.00,  0.00,  0.00;  up 0+04:57:49        15:00:04
        65 processes: 64 sleeping, 1 on cpu
        CPU states: 98.8% idle,  0.1% user,  1.1% kernel,  0.0% iowait,  0.0% swap
        Kernel: 917 ctxsw, 2 trap, 1477 intr, 545 syscall, 2 flt
        Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      820   0.3%
==================================
Tue Sep  6 16:00:00 PDT 2011

          lab10    up           5:57,    load average: 0.00, 0.00, 0.00

        last pid:  3248;  load avg:  0.00,  0.00,  0.00;  up 0+05:57:48        16:00:03
        65 processes: 64 sleeping, 1 on cpu
        CPU states: 98.5% idle,  0.0% user,  1.5% kernel,  0.0% iowait,  0.0% swap
        Kernel: 328 ctxsw, 2 trap, 1150 intr, 570 syscall, 2 flt
        Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                      649   0.3%
==================================
Tue Sep  6 17:00:00 PDT 2011

          lab10    up           6:57,    load average: 0.00, 0.00, 0.00

        last pid:  3414;  load avg:  0.00,  0.00,  0.00;  up 0+06:57:48        17:00:03
        65 processes: 64 sleeping, 1 on cpu
        CPU states: 98.6% idle,  0.0% user,  1.4% kernel,  0.0% iowait,  0.0% swap
        Kernel: 904 ctxsw, 2 trap, 1444 intr, 547 syscall, 2 flt
        Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                     1050   0.4%
==================================
Tue Sep  6 18:00:00 PDT 2011

          lab10    up           7:57,    load average: 0.00, 0.00, 0.00

        last pid:  3501;  load avg:  0.00,  0.00,  0.00;  up 0+07:57:48        18:00:03
        65 processes: 64 sleeping, 1 on cpu
        CPU states: 97.9% idle,  0.1% user,  2.0% kernel,  0.0% iowait,  0.0% swap
        Kernel: 916 ctxsw, 2 trap, 1430 intr, 565 syscall, 2 flt
        Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                     7536   3.0%
==================================
Tue Sep  6 19:00:00 PDT 2011

          lab10    up           8:57,    load average: 0.00, 0.00, 0.00

        last pid:  3540;  load avg:  0.00,  0.00,  0.00;  up 0+08:57:48        19:00:03
        65 processes: 64 sleeping, 1 on cpu
        CPU states: 97.9% idle,  0.1% user,  2.0% kernel,  0.0% iowait,  0.0% swap
        Kernel: 316 ctxsw, 2 trap, 1032 intr, 558 syscall, 2 flt
        Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                    18721   7.6%
==================================
Tue Sep  6 20:00:00 PDT 2011

          lab10    up           9:57,    load average: 0.00, 0.00, 0.00

        last pid:  3579;  load avg:  0.00,  0.00,  0.00;  up 0+09:57:48        20:00:03
        65 processes: 64 sleeping, 1 on cpu
        CPU states: 97.7% idle,  0.0% user,  2.3% kernel,  0.0% iowait,  0.0% swap
        Kernel: 1169 ctxsw, 2 trap, 1554 intr, 561 syscall, 2 flt
        Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                    41254  16.7%
==================================
Tue Sep  6 21:00:00 PDT 2011

          lab10    up          10:57,    load average: 0.00, 0.00, 0.00

        last pid:  3618;  load avg:  0.00,  0.00,  0.00;  up 0+10:57:48        21:00:03
        65 processes: 64 sleeping, 1 on cpu
        CPU states: 97.1% idle,  0.1% user,  2.7% kernel,  0.0% iowait,  0.0% swap
        Kernel: 703 ctxsw, 3 trap, 1353 intr, 562 syscall, 2 flt
        Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                    55342  22.4%
==================================
Tue Sep  6 22:00:00 PDT 2011

          lab10    up          11:57,    load average: 0.00, 0.00, 0.00

        last pid:  3657;  load avg:  0.00,  0.00,  0.00;  up 0+11:57:48        22:00:03
        65 processes: 64 sleeping, 1 on cpu
        CPU states: 97.5% idle,  0.0% user,  2.5% kernel,  0.0% iowait,  0.0% swap
        Kernel: 986 ctxsw, 3 trap, 1370 intr, 550 syscall, 2 flt
        Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                    60255  24.5%
==================================
Tue Sep  6 23:00:00 PDT 2011

          lab10    up          12:57,    load average: 0.00, 0.00, 0.00

        last pid:  3696;  load avg:  0.00,  0.00,  0.00;  up 0+12:57:48        23:00:03
        65 processes: 64 sleeping, 1 on cpu
        CPU states: 97.6% idle,  0.1% user,  2.2% kernel,  0.0% iowait,  0.0% swap
        Kernel: 324 ctxsw, 2 trap, 1050 intr, 556 syscall, 2 flt
        Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

unix`acpi_cpu_cstate                                    83235  34.6%
==================================
Wed Sep  7 00:00:00 PDT 2011

          lab10    up          13:57,    load average: 0.00, 0.00, 0.00

        last pid:  3735;  load avg:  0.00,  0.00,  0.00;  up 0+13:57:48        00:00:03
        65 processes: 64 sleeping, 1 on cpu
        CPU states: 97.0% idle,  0.0% user,  3.0% kernel,  0.0% iowait,  0.0% swap
        Kernel: 324 ctxsw, 2 trap, 1096 intr, 570 syscall, 2 flt
        Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

I don't think 'lockstat' shows anything important

lab10# cat lockstat-kIW-D_20_sleep_30.2011-09-06_17.out

Profiling interrupt: 23288 events in 30.017 seconds (776 events/sec)

Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
-------------------------------------------------------------------------------
23267 100% 100% 0.00      507 cpu[0]                 acpi_cpu_cstate
   10   0% 100% 0.00      328 cpu[3]                 i86_mwait
   10   0% 100% 0.00      868 cpu[7]                 (usermode)
    1   0% 100% 0.00      384 cpu[6]                 page_unlock
-------------------------------------------------------------------------------
lab10# cat lockstat-kIW-D_20_sleep_30.2011-09-06_20.out

Profiling interrupt: 23288 events in 30.018 seconds (776 events/sec)

Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
-------------------------------------------------------------------------------
23263 100% 100% 0.00      514 cpu[2]                 acpi_cpu_cstate
   12   0% 100% 0.00      704 cpu[4]                 (usermode)
   10   0% 100% 0.00      323 cpu[3]                 i86_mwait
    1   0% 100% 0.00      804 cpu[1]                 page_nextn
    1   0% 100% 0.00      434 cpu[1]+11              sep_save
    1   0% 100% 0.00      345 cpu[0]                 space_map_seg_compare
-------------------------------------------------------------------------------
lab10# cat lockstat-kIW-D_20_sleep_30.2011-09-06_23.out

Profiling interrupt: 23288 events in 30.017 seconds (776 events/sec)

Count indv cuml rcnt     nsec Hottest CPU+PIL        Caller
-------------------------------------------------------------------------------
23262 100% 100% 0.00      332 cpu[2]                 acpi_cpu_cstate
   13   0% 100% 0.00      284 cpu[6]                 (usermode)
   10   0% 100% 0.00      302 cpu[0]                 i86_mwait
    1   0% 100% 0.00      143 cpu[1]                 kmem_zalloc
    1   0% 100% 0.00      436 cpu[5]                 fsflush_do_pages
    1   0% 100% 0.00      595 cpu[0]                 do_splx
-------------------------------------------------------------------------------
lab10#
#8

Updated by Miha Krizaj over 7 years ago

I have left a system running for 14 days, until it reached 25% or so in kernel usage when completely idle, which is just below its breaking point as per my experience and did a forced crash dump.

I am attaching output from mdb -k
::stacks
::cpuinfo -v
::cycinfo -v
::walk cpu | ::print cpu_t

and a forced crash dump file link (950Mb), in case that someone can somehow find the cause/source of the high cpu usage from it... The crash dump was made when the system was using some 25% in kernel while idle.

http://www.blackstudio.com/dump.zip

#9

Updated by Jon Strabala over 7 years ago

My prior systems were the oi_148b ISO, I reproduced the same thing on a oi_151a system, it lost networking after about of 20 hours of operation and then froze running 'hotkernel' at about the 21 hour mark.

Hourly log files of may scripts and dtrace outputs are attached as: data_logs_151a_freezeup_in_21hours.tar.gz

This is the same basic syndrome, an increase in both 'ahci0' counts when running 'inttimes' and an increase in 'unix`acpi_cpu_cstate' when running 'hotkernel'.

Note fro about 3-hours from 2011-09-08_13 through the end of 2011-09-08_16 I was building ONNV_BUILDNUM=152 from the nightly source, thus you might ignore those files for those hours when inspecting loads and/or other system activity, at all other times this system was essentially completely idle. I will try

The two interesting items I could find from my set of 21 hourly samples, flowwed by a 'top' analysis showing the CPU on an FULLY idle system creeping up from 99.9% idle to 97.2% idle prior to the freeze in 'hotkernel'.

root@lab10:~/hotkernel_30/data.151a# cat info.2011-09-08_07.out
date
September  8, 2011 07:08:46 AM PDT

uname -a
SunOS lab10 5.11 oi_151a i86pc i386 i86pc Solaris

rup lab10
          lab10    up          15 mins,  load average: 0.00, 0.02, 0.02

root@lab10:~/hotkernel_30/data.151a# grep ahci0 inttimes_10sec.2011-09-0*
inttimes_10sec.2011-09-08_07.out:      ahci0            1041164
inttimes_10sec.2011-09-08_08.out:      ahci0            1118775
inttimes_10sec.2011-09-08_09.out:      ahci0            1112719
inttimes_10sec.2011-09-08_10.out:      ahci0            1299571
inttimes_10sec.2011-09-08_11.out:      ahci0            1166361
inttimes_10sec.2011-09-08_12.out:      ahci0            1139335
inttimes_10sec.2011-09-08_13.out:      ahci0            1556608  * built a illumos source tree, plus other work
inttimes_10sec.2011-09-08_14.out:      ahci0           23165583  * built a illumos source tree, plus other work
inttimes_10sec.2011-09-08_15.out:      ahci0             925821  * built a illumos source tree, plus other work
inttimes_10sec.2011-09-08_16.out:      ahci0            1833976  * built a illumos source tree, plus other work
inttimes_10sec.2011-09-08_17.out:      ahci0            1329873  ** full idle again
inttimes_10sec.2011-09-08_18.out:      ahci0            2528877
inttimes_10sec.2011-09-08_19.out:      ahci0            3968721
inttimes_10sec.2011-09-08_20.out:      ahci0            2308941
inttimes_10sec.2011-09-08_21.out:      ahci0            3021367
inttimes_10sec.2011-09-08_22.out:      ahci0            3545604
inttimes_10sec.2011-09-08_23.out:      ahci0            3254049
inttimes_10sec.2011-09-09_00.out:      ahci0            4223043
inttimes_10sec.2011-09-09_01.out:      ahci0            6785020
inttimes_10sec.2011-09-09_02.out:      ahci0            2955979  *** hard/crash loss of network after this point

root@lab10:~/hotkernel_30/data.151a# cat info.2011-09-09_03.out
date
Fri Sep  9 03:00:02 PDT 2011

uname -a
SunOS lab10 5.11 oi_151a i86pc i386 i86pc

rup lab10
          lab10    up          20:06,    load average: 0.00, 0.00, 0.00

root@lab10:~/hotkernel_30/data.151a# egrep acpi_cpu_cstate\|i86_mwait hotkernel_30sec.2011-09-0*
hotkernel_30sec.2011-09-08_07.out:unix`acpi_cpu_cstate                                      667   0.3%
hotkernel_30sec.2011-09-08_07.out:unix`i86_mwait                                         247286  99.7%
hotkernel_30sec.2011-09-08_08.out:unix`acpi_cpu_cstate                                      753   0.3%
hotkernel_30sec.2011-09-08_08.out:unix`i86_mwait                                         247174  99.7%
hotkernel_30sec.2011-09-08_09.out:unix`acpi_cpu_cstate                                      690   0.3%
hotkernel_30sec.2011-09-08_09.out:unix`i86_mwait                                         247250  99.7%
hotkernel_30sec.2011-09-08_10.out:unix`acpi_cpu_cstate                                      796   0.3%
hotkernel_30sec.2011-09-08_10.out:unix`i86_mwait                                         247115  99.7%
hotkernel_30sec.2011-09-08_11.out:unix`acpi_cpu_cstate                                      647   0.3%
hotkernel_30sec.2011-09-08_11.out:unix`i86_mwait                                         247179  99.7%
hotkernel_30sec.2011-09-08_12.out:unix`acpi_cpu_cstate                                      484   0.2%
hotkernel_30sec.2011-09-08_12.out:unix`i86_mwait                                         247358  99.8%
hotkernel_30sec.2011-09-08_13.out:unix`acpi_cpu_cstate                                      578   0.2%  * built illumos
hotkernel_30sec.2011-09-08_13.out:unix`i86_mwait                                         247191  99.7%  * built illumos
hotkernel_30sec.2011-09-08_15.out:unix`acpi_cpu_cstate                                      454   0.2%  * built illumos
hotkernel_30sec.2011-09-08_15.out:unix`i86_mwait                                         247137  99.8%  * built illumos
hotkernel_30sec.2011-09-08_16.out:unix`acpi_cpu_cstate                                      426   0.2%  * built illumos
hotkernel_30sec.2011-09-08_16.out:unix`i86_mwait                                         246996  99.7%  * built illumos 
hotkernel_30sec.2011-09-08_17.out:unix`acpi_cpu_cstate                                     3626   1.5%  ** full idle again
hotkernel_30sec.2011-09-08_17.out:unix`i86_mwait                                         243855  98.4%  ** full idle again
hotkernel_30sec.2011-09-08_18.out:unix`acpi_cpu_cstate                                    10192   4.1% 
hotkernel_30sec.2011-09-08_18.out:unix`i86_mwait                                         237204  95.8%
hotkernel_30sec.2011-09-08_19.out:unix`acpi_cpu_cstate                                    16476   6.7%
hotkernel_30sec.2011-09-08_19.out:unix`i86_mwait                                         230753  93.2%
hotkernel_30sec.2011-09-08_20.out:unix`acpi_cpu_cstate                                    23956   9.7%
hotkernel_30sec.2011-09-08_20.out:unix`i86_mwait                                         223047  90.2%
hotkernel_30sec.2011-09-08_21.out:unix`acpi_cpu_cstate                                    44800  18.1%
hotkernel_30sec.2011-09-08_21.out:unix`i86_mwait                                         202065  81.7%
hotkernel_30sec.2011-09-08_22.out:unix`acpi_cpu_cstate                                    49319  20.0%
hotkernel_30sec.2011-09-08_22.out:unix`i86_mwait                                         197101  79.8%
hotkernel_30sec.2011-09-08_23.out:unix`acpi_cpu_cstate                                    57275  23.2%
hotkernel_30sec.2011-09-08_23.out:unix`i86_mwait                                         188817  76.6%
hotkernel_30sec.2011-09-09_00.out:unix`acpi_cpu_cstate                                    62413  25.4%
hotkernel_30sec.2011-09-09_00.out:unix`i86_mwait                                         182381  74.2%
hotkernel_30sec.2011-09-09_01.out:unix`acpi_cpu_cstate                                    68875  28.2%
hotkernel_30sec.2011-09-09_01.out:unix`i86_mwait                                         173698  71.2%
hotkernel_30sec.2011-09-09_02.out:unix`acpi_cpu_cstate                                    79454  32.7%  *** hard/crash loss of network later
hotkernel_30sec.2011-09-09_02.out:unix`i86_mwait                                         153549  63.2%  *** hard/crash loss of network later
root@lab10:~/hotkernel_30/data.151a#

One other thing, after about 10 hours we start getting lots of items similar to the two items below

dtrace: 2598 dynamic variable drops with non-empty dirty list
dtrace: 1409 dynamic variable drops with non-empty rinsing list

Looking at the files that show the above type warnings

root@lab10:~/hotkernel_30/data.151a# grep -l dynamic inttimes_10sec.2011-09-0*
inttimes_10sec.2011-09-08_17.out
inttimes_10sec.2011-09-08_18.out
inttimes_10sec.2011-09-08_19.out
inttimes_10sec.2011-09-08_20.out
inttimes_10sec.2011-09-08_21.out
inttimes_10sec.2011-09-08_22.out
inttimes_10sec.2011-09-08_23.out
inttimes_10sec.2011-09-09_00.out
inttimes_10sec.2011-09-09_01.out
inttimes_10sec.2011-09-09_02.out

looking at some individual files

root@lab10:~/hotkernel_30/data.151a# cat inttimes_10sec.2011-09-08_07.out
Tracing... Hit Ctrl-C to end.
     DEVICE           TIME (ns)
      ehci1              65598
      ehci0              78374
    e1000g0             180137
      ahci0            1041164

root@lab10:~/hotkernel_30/data.151a# cat inttimes_10sec.2011-09-08_16.out
Tracing... Hit Ctrl-C to end.
     DEVICE           TIME (ns)
      ehci1              68515
      ehci0              78892
    e1000g0             858248
      ahci0            1833976

root@lab10:~/hotkernel_30/data.151a# cat inttimes_10sec.2011-09-08_17.out
Tracing... Hit Ctrl-C to end.
dtrace: 109 dynamic variable drops with non-empty dirty list
     DEVICE           TIME (ns)
      ehci1              81694
      ehci0              85620
    e1000g0             102613
      ahci0            1329873

root@lab10:~/hotkernel_30/data.151a# cat inttimes_10sec.2011-09-08_18.out
Tracing... Hit Ctrl-C to end.
dtrace: 103 dynamic variable drops with non-empty dirty list
dtrace: 356 dynamic variable drops with non-empty dirty list
     DEVICE           TIME (ns)
      ehci1              72442
      ehci0              89250
    e1000g0             107444
      ahci0            2528877

root@lab10:~/hotkernel_30/data.151a#

root@lab10:~/hotkernel_30/data.151a# cat inttimes_10sec.2011-09-09_02.out
Tracing... Hit Ctrl-C to end.
dtrace: 3687 dynamic variable drops with non-empty dirty list
dtrace: 4403 dynamic variable drops with non-empty dirty list
dtrace: 5160 dynamic variable drops with non-empty dirty list
dtrace: 6109 dynamic variable drops with non-empty dirty list
dtrace: 10 dynamic variable drops with non-empty rinsing list
dtrace: 5320 dynamic variable drops with non-empty dirty list
dtrace: 5 dynamic variable drops with non-empty rinsing list
dtrace: 7076 dynamic variable drops with non-empty dirty list
dtrace: 10470 dynamic variable drops with non-empty dirty list
dtrace: 7951 dynamic variable drops with non-empty dirty list
dtrace: 12489 dynamic variable drops with non-empty dirty list
dtrace: 6375 dynamic variable drops with non-empty dirty list
dtrace: 5400 dynamic variable drops with non-empty dirty list
     DEVICE           TIME (ns)
    e1000g0              52463
      ehci1              75167
      ehci0              89074
      ahci0            2955979

root@lab10:~/hotkernel_30/data.151a#

And the CPU creep shown by 'top'

lab10# cd /root/hotkernel_30/logs.151a
lab10# foreach f ( *.log )
? echo $i
? egrep 'last pid'\|processes\|'CPU states'\|Kernel\|Memory $i | tail -4
? echo "" 
? end
i: Undefined variable
lab10#  foreach i ( *.log )
?  echo $i
?  egrep 'last pid'\|processes\|'CPU states'\|Kernel\|Memory $i | tail -4
?  echo "" 
?  end
runh30.2011-09-08_07.log
63 processes: 62 sleeping, 1 on cpu
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
Kernel: 359 ctxsw, 2 trap, 1049 intr, 83 syscall, 2 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

runh30.2011-09-08_08.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
Kernel: 354 ctxsw, 3 trap, 1051 intr, 83 syscall, 3 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

runh30.2011-09-08_09.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 99.6% idle,  0.0% user,  0.3% kernel,  0.0% iowait,  0.0% swap
Kernel: 355 ctxsw, 2 trap, 1132 intr, 82 syscall, 3 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

runh30.2011-09-08_10.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 99.5% idle,  0.0% user,  0.5% kernel,  0.0% iowait,  0.0% swap
Kernel: 355 ctxsw, 2 trap, 1091 intr, 83 syscall, 3 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

runh30.2011-09-08_11.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 99.4% idle,  0.0% user,  0.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 354 ctxsw, 3 trap, 1061 intr, 83 syscall, 3 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

runh30.2011-09-08_12.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 99.2% idle,  0.0% user,  0.8% kernel,  0.0% iowait,  0.0% swap
Kernel: 365 ctxsw, 3 trap, 1093 intr, 96 syscall, 3 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

runh30.2011-09-08_13.log
65 processes: 64 sleeping, 1 on cpu
CPU states: 99.0% idle,  0.0% user,  0.9% kernel,  0.0% iowait,  0.0% swap
Kernel: 357 ctxsw, 2 trap, 1093 intr, 83 syscall, 3 flt
Memory: 16G phys mem, 11G free mem, 8187M total swap, 8187M free swap

runh30.2011-09-08_14.log
147 processes: 107 sleeping, 20 running, 13 zombie, 7 on cpu
CPU states:  0.0% idle, 50.5% user, 49.5% kernel,  0.0% iowait,  0.0% swap
Kernel: 1752 ctxsw, 55513 trap, 2719 intr, 112336 syscall, 69 fork, 49354 flt
Memory: 16G phys mem, 6058M free mem, 8187M total swap, 8187M free swap

runh30.2011-09-08_15.log
73 processes: 72 sleeping, 1 on cpu
CPU states: 98.8% idle,  0.0% user,  1.2% kernel,  0.0% iowait,  0.0% swap
Kernel: 350 ctxsw, 3 trap, 1053 intr, 140 syscall, 3 flt
Memory: 16G phys mem, 2624M free mem, 8187M total swap, 8182M free swap

runh30.2011-09-08_16.log
71 processes: 70 sleeping, 1 on cpu
CPU states: 98.8% idle,  0.0% user,  1.2% kernel,  0.0% iowait,  0.0% swap
Kernel: 362 ctxsw, 3 trap, 1003 intr, 154 syscall, 3 flt
Memory: 16G phys mem, 2324M free mem, 8187M total swap, 8182M free swap

runh30.2011-09-08_17.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 98.5% idle,  0.0% user,  1.5% kernel,  0.0% iowait,  0.0% swap
Kernel: 430 ctxsw, 5 trap, 1074 intr, 80 syscall, 4 flt
Memory: 16G phys mem, 7596M free mem, 8187M total swap, 8182M free swap

runh30.2011-09-08_18.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 98.3% idle,  0.0% user,  1.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 349 ctxsw, 4 trap, 1074 intr, 81 syscall, 4 flt
Memory: 16G phys mem, 7596M free mem, 8187M total swap, 8182M free swap

runh30.2011-09-08_19.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 98.2% idle,  0.0% user,  1.8% kernel,  0.0% iowait,  0.0% swap
Kernel: 352 ctxsw, 4 trap, 1107 intr, 80 syscall, 4 flt
Memory: 16G phys mem, 7595M free mem, 8187M total swap, 8182M free swap

runh30.2011-09-08_20.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 98.0% idle,  0.0% user,  2.0% kernel,  0.0% iowait,  0.0% swap
Kernel: 354 ctxsw, 5 trap, 1103 intr, 81 syscall, 4 flt
Memory: 16G phys mem, 7593M free mem, 8187M total swap, 8182M free swap

runh30.2011-09-08_21.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 97.9% idle,  0.0% user,  2.1% kernel,  0.0% iowait,  0.0% swap
Kernel: 353 ctxsw, 4 trap, 1100 intr, 83 syscall, 4 flt
Memory: 16G phys mem, 7593M free mem, 8187M total swap, 8182M free swap

runh30.2011-09-08_22.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 97.8% idle,  0.0% user,  2.2% kernel,  0.0% iowait,  0.0% swap
Kernel: 352 ctxsw, 4 trap, 1091 intr, 81 syscall, 4 flt
Memory: 16G phys mem, 7592M free mem, 8187M total swap, 8182M free swap

runh30.2011-09-08_23.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 97.6% idle,  0.0% user,  2.3% kernel,  0.0% iowait,  0.0% swap
Kernel: 353 ctxsw, 4 trap, 1082 intr, 82 syscall, 4 flt
Memory: 16G phys mem, 7592M free mem, 8187M total swap, 8182M free swap

runh30.2011-09-09_00.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 97.6% idle,  0.0% user,  2.4% kernel,  0.0% iowait,  0.0% swap
Kernel: 399 ctxsw, 4 trap, 1072 intr, 83 syscall, 4 flt
Memory: 16G phys mem, 7590M free mem, 8187M total swap, 8182M free swap

runh30.2011-09-09_01.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 97.4% idle,  0.0% user,  2.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 353 ctxsw, 4 trap, 1086 intr, 82 syscall, 4 flt
Memory: 16G phys mem, 7589M free mem, 8187M total swap, 8182M free swap

runh30.2011-09-09_02.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 97.2% idle,  0.0% user,  2.8% kernel,  0.0% iowait,  0.0% swap
Kernel: 357 ctxsw, 4 trap, 1087 intr, 82 syscall, 4 flt
Memory: 16G phys mem, 7589M free mem, 8187M total swap, 8182M free swap

runh30.2011-09-09_03.log
61 processes: 60 sleeping, 1 on cpu
CPU states: 97.2% idle,  0.0% user,  2.8% kernel,  0.0% iowait,  0.0% swap
Kernel: 388 ctxsw, 4 trap, 1025 intr, 81 syscall, 4 flt
Memory: 16G phys mem, 7587M free mem, 8187M total swap, 8182M free swap

lab10#
#10

Updated by Jon Strabala over 7 years ago

Built with gcc acpica-nightly-2011-09-10 with updated 'acpica' bits (from danmcd - refer to IRC) the bits ran fine, but the same creeping CPU issue remains.

Went from 99.5% idle to 98.1% idle prior to hanging on a scheduled (via cron) dtrace command (an attempt to debug the issue itself) in about 15 hours.

To date, all variants of OI oi_148b, oi_151a, AND as of 2011-09-10 fresh compiled nightly, nightly+'acpica' bits seem to have the same issue WRT Supermicro X9SCA-F ( Intel® C204 PCH Chipset) with 16GB RAM, and E3-1270 Xeon CPU (e.g. Sandy Bridge CPU).

comment about dtrace:

Unfortunately when I compile the nightly - I have no 'dtrace' capability or 'powertop'

If I run (as per https://www.illumos.org/projects/illumos-gate/wiki/How_To_Build_illumos) I still dont have 'dtrace' functionality
    mcs -d -n .SUNW_ctf /kernel/misc/klmmod 
    mcs -d -n .SUNW_ctf /kernel/misc/klmops
    mcs -d -n .SUNW_ctf /kernel/misc/amd64/klmmod
    mcs -d -n .SUNW_ctf /kernel/misc/amd64/klmops

I seem to have to do a lot more e.g.
    find /kernel -type f -name "*" | xargs mcs -d -n .SUNW_ctf
    find /platform/i86pc/kernel -type f -name "*" | xargs mcs -d -n .SUNW_ctf
    bootadm update-archive
    init 6

just to get minimal 'dtrace' capability running.

Regardless of the symbol and 'dtrace' limited functionality, we see the same issue and syndromes: a) a slow creep kernel CPU on an idle system b) an increase in 'unix`acpi_cpu_cstate' when running 'hotkernel' c) the ability to crash the system in 14-24 hours via any simple 'dtrace' command and d) long term 2-3 weeks freeze of the idle system requiring a hard power cycle.

Basic information showing the same problem

root@lab10:~/hotkernel_30/data.acpica-nightly-2011-09-10# ../check.csh
uname -a
SunOS lab10 5.11 testws i86pc i386 i86pc Solaris

rup lab10
          lab10    up           1 min,   load average: 0.15, 0.05, 0.02

beadm list
BE                        Active Mountpoint      Space Policy Created
151a                      -      -               45.7M static 2011-09-07 13:41
acpica-nd                 -      -               4.89G static 2011-09-07 11:10
acpica-nightly-2011-09-09 -      -               56.0M static 2011-09-09 13:22
acpica-nightly-2011-09-10 NR     /               70.4G static 2011-09-10 16:48
nightly-2011-09-08        -      -               139M  static 2011-09-08 16:03
openindiana               -      -               45.2M static 2011-05-05 10:02
sandbox                   -      -               63.0K static 2011-09-07 10:46

----
CPU states: 99.5% idle,  0.0% user,  0.5% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.5% idle,  0.0% user,  0.5% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.4% idle,  0.0% user,  0.6% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.4% idle,  0.0% user,  0.6% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.3% idle,  0.0% user,  0.7% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.2% idle,  0.0% user,  0.9% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.0% idle,  0.0% user,  1.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.9% idle,  0.0% user,  1.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.9% idle,  0.0% user,  1.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.7% idle,  0.0% user,  1.3% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.8% idle,  0.0% user,  1.2% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.6% idle,  0.0% user,  1.4% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.5% idle,  0.0% user,  1.5% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.4% idle,  0.0% user,  1.6% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.4% idle,  0.0% user,  1.6% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.1% idle,  0.0% user,  1.9% kernel,  0.0% iowait,  0.0% swap

----

last pid: 101132;  load avg:  0.13,  0.05,  0.02;  up 0+00:01:34       20:21:23
50 processes: 49 sleeping, 1 on cpu
CPU states: 99.5% idle,  0.0% user,  0.5% kernel,  0.0% iowait,  0.0% swap
Kernel: 409 ctxsw, 4 trap, 1173 intr, 57 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 101213;  load avg:  0.00,  0.00,  0.00;  up 0+00:40:24       21:00:13
44 processes: 43 sleeping, 1 on cpu
CPU states: 99.5% idle,  0.0% user,  0.5% kernel,  0.0% iowait,  0.0% swap
Kernel: 354 ctxsw, 4 trap, 1020 intr, 53 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 101269;  load avg:  0.00,  0.00,  0.00;  up 0+01:40:24       22:00:13
44 processes: 43 sleeping, 1 on cpu
CPU states: 99.4% idle,  0.0% user,  0.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 347 ctxsw, 4 trap, 1126 intr, 53 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 101318;  load avg:  0.00,  0.00,  0.00;  up 0+02:40:25       23:00:14
44 processes: 43 sleeping, 1 on cpu
CPU states: 99.4% idle,  0.0% user,  0.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 355 ctxsw, 4 trap, 1087 intr, 53 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 101367;  load avg:  0.00,  0.00,  0.00;  up 0+03:40:25       00:00:14
44 processes: 43 sleeping, 1 on cpu
CPU states: 99.3% idle,  0.0% user,  0.7% kernel,  0.0% iowait,  0.0% swap
Kernel: 393 ctxsw, 4 trap, 1080 intr, 53 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 101443;  load avg:  0.00,  0.00,  0.00;  up 0+04:40:24       01:00:13
44 processes: 43 sleeping, 1 on cpu
CPU states: 99.2% idle,  0.0% user,  0.9% kernel,  0.0% iowait,  0.0% swap
Kernel: 353 ctxsw, 4 trap, 1196 intr, 53 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 101492;  load avg:  0.00,  0.00,  0.00;  up 0+05:40:24       02:00:13
44 processes: 43 sleeping, 1 on cpu
CPU states: 99.0% idle,  0.0% user,  1.0% kernel,  0.0% iowait,  0.0% swap
Kernel: 357 ctxsw, 4 trap, 1193 intr, 53 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 101541;  load avg:  0.00,  0.00,  0.00;  up 0+06:40:24       03:00:13
44 processes: 43 sleeping, 1 on cpu
CPU states: 98.9% idle,  0.0% user,  1.1% kernel,  0.0% iowait,  0.0% swap
Kernel: 351 ctxsw, 4 trap, 1126 intr, 53 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 101594;  load avg:  0.00,  0.00,  0.00;  up 0+07:40:24       04:00:13
44 processes: 43 sleeping, 1 on cpu
CPU states: 98.9% idle,  0.0% user,  1.1% kernel,  0.0% iowait,  0.0% swap
Kernel: 362 ctxsw, 4 trap, 1053 intr, 53 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 101643;  load avg:  0.00,  0.00,  0.00;  up 0+08:40:24       05:00:13
44 processes: 43 sleeping, 1 on cpu
CPU states: 98.7% idle,  0.0% user,  1.3% kernel,  0.0% iowait,  0.0% swap
Kernel: 368 ctxsw, 4 trap, 1219 intr, 53 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 101692;  load avg:  0.00,  0.00,  0.00;  up 0+09:40:24       06:00:13
44 processes: 43 sleeping, 1 on cpu
CPU states: 98.8% idle,  0.0% user,  1.2% kernel,  0.0% iowait,  0.0% swap
Kernel: 353 ctxsw, 4 trap, 1048 intr, 53 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 101838;  load avg:  0.00,  0.00,  0.00;  up 0+10:40:24       07:00:13
47 processes: 46 sleeping, 1 on cpu
CPU states: 98.6% idle,  0.0% user,  1.4% kernel,  0.0% iowait,  0.0% swap
Kernel: 391 ctxsw, 4 trap, 1128 intr, 54 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 102235;  load avg:  0.00,  0.00,  0.00;  up 0+11:40:24       08:00:13
48 processes: 47 sleeping, 1 on cpu
CPU states: 98.5% idle,  0.0% user,  1.5% kernel,  0.0% iowait,  0.0% swap
Kernel: 394 ctxsw, 6 trap, 1234 intr, 77 syscall, 7 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 102284;  load avg:  0.00,  0.00,  0.00;  up 0+12:40:24       09:00:13
48 processes: 47 sleeping, 1 on cpu
CPU states: 98.4% idle,  0.0% user,  1.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 391 ctxsw, 3 trap, 1229 intr, 76 syscall, 3 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 102340;  load avg:  0.00,  0.00,  0.00;  up 0+13:40:24       10:00:13
48 processes: 47 sleeping, 1 on cpu
CPU states: 98.4% idle,  0.0% user,  1.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 354 ctxsw, 3 trap, 1084 intr, 76 syscall, 3 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

last pid: 103035;  load avg:  0.00,  0.00,  0.00;  up 0+14:40:24       11:00:13
47 processes: 46 sleeping, 1 on cpu
CPU states: 98.1% idle,  0.0% user,  1.9% kernel,  0.0% iowait,  0.0% swap
Kernel: 382 ctxsw, 54 trap, 1191 intr, 117 syscall, 40 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

----

hotkernel_30sec.2011-09-10_20.out
unix`thread_affinity_set                                   20   0.0%
unix`tsc_read                                              23   0.0%
unix`lock_try                                              24   0.0%
genunix`kmem_hash_rescale                                  36   0.0%
unix`splr                                                  95   0.0%
unix`do_splx                                              128   0.1%
genunix`vmem_hash_delete                                  596   0.2%
unix`acpi_cpu_cstate                                      605   0.2%
unix`scan_memory                                         1031   0.4%
unix`i86_mwait                                         245224  98.9%

hotkernel_30sec.2011-09-10_21.out
apix`apix_dispatch_softint                                  3   0.0%
genunix`callout_list_get                                    3   0.0%
genunix`cyclic_coverage                                     3   0.0%
unix`do_splx                                                3   0.0%
unix`tsc_read                                               4   0.0%
unix`ddi_get32                                              5   0.0%
genunix`vmem_hash_rescale                                   6   0.0%
unix`page_nextn                                             7   0.0%
unix`acpi_cpu_cstate                                      709   0.3%
unix`i86_mwait                                         247197  99.7%

hotkernel_30sec.2011-09-10_22.out
genunix`fsflush_do_pages                                    3   0.0%
unix`mutex_enter                                            4   0.0%
genunix`cyclic_coverage                                     4   0.0%
apix`apix_dispatch_softint                                  5   0.0%
genunix`vmem_hash_rescale                                   6   0.0%
unix`do_splx                                                6   0.0%
unix`page_nextn                                             7   0.0%
unix`mutex_delay_default                                    8   0.0%
unix`acpi_cpu_cstate                                      672   0.3%
unix`i86_mwait                                         247224  99.7%

hotkernel_30sec.2011-09-10_23.out
unix`i86_monitor                                            3   0.0%
zfs`ddt_stat_add                                            4   0.0%
unix`tsc_read                                               4   0.0%
genunix`fsflush_do_pages                                    6   0.0%
genunix`vmem_hash_rescale                                   7   0.0%
unix`page_nextn                                             8   0.0%
dtrace`dtrace_ioctl                                        11   0.0%
unix`do_splx                                               16   0.0%
unix`acpi_cpu_cstate                                      702   0.3%
unix`i86_mwait                                         247152  99.7%

hotkernel_30sec.2011-09-11_00.out
unix`do_splx                                                3   0.0%
unix`page_nextn                                             4   0.0%
unix`sys_syscall                                            5   0.0%
apix`apix_dispatch_softint                                  6   0.0%
genunix`syscall_mstate                                      6   0.0%
unix`tsc_read                                               6   0.0%
genunix`fsflush_do_pages                                    6   0.0%
genunix`vmem_hash_rescale                                   7   0.0%
unix`acpi_cpu_cstate                                      828   0.3%
unix`i86_mwait                                         247054  99.6%

hotkernel_30sec.2011-09-11_01.out
genunix`syscall_mstate                                      5   0.0%
unix`page_nextn                                             5   0.0%
unix`mutex_enter                                            6   0.0%
unix`tsc_read                                               7   0.0%
unix`mutex_delay_default                                    7   0.0%
genunix`vmem_hash_rescale                                   7   0.0%
unix`do_splx                                                7   0.0%
genunix`fsflush_do_pages                                    9   0.0%
unix`acpi_cpu_cstate                                      493   0.2%
unix`i86_mwait                                         247328  99.8%

hotkernel_30sec.2011-09-11_02.out
genunix`cyclic_coverage_hash                                3   0.0%
unix`disp_getwork                                           3   0.0%
apix`apix_dispatch_softint                                  5   0.0%
genunix`fsflush_do_pages                                    5   0.0%
unix`tsc_read                                               6   0.0%
genunix`vmem_hash_rescale                                   7   0.0%
unix`page_nextn                                             8   0.0%
unix`ddi_get32                                             20   0.0%
unix`acpi_cpu_cstate                                      509   0.2%
unix`i86_mwait                                         247328  99.8%

hotkernel_30sec.2011-09-11_03.out
apix`apix_dispatch_softint                                  3   0.0%
unix`fakesoftint_return                                     3   0.0%
unix`tsc_read                                               4   0.0%
genunix`fsflush_do_pages                                    6   0.0%
dtrace`dtrace_ioctl                                         6   0.0%
genunix`vmem_hash_rescale                                   8   0.0%
unix`page_nextn                                            10   0.0%
unix`do_splx                                               16   0.0%
unix`acpi_cpu_cstate                                      567   0.2%
unix`i86_mwait                                         247216  99.7%

hotkernel_30sec.2011-09-11_04.out
TS`ts_update_list                                           6   0.0%
unix`tsc_read                                               6   0.0%
unix`mutex_enter                                            7   0.0%
unix`page_nextn                                             7   0.0%
genunix`vmem_hash_rescale                                   8   0.0%
unix`do_splx                                                8   0.0%
genunix`fsflush_do_pages                                   10   0.0%
genunix`thread_lock                                        11   0.0%
unix`acpi_cpu_cstate                                      631   0.3%
unix`i86_mwait                                         247092  99.7%

hotkernel_30sec.2011-09-11_05.out
dtrace`dtrace_dynvar_clean                                  3   0.0%
unix`fakesoftint_return                                     3   0.0%
unix`resume                                                 3   0.0%
unix`do_splx                                                5   0.0%
genunix`fsflush_do_pages                                    6   0.0%
unix`tsc_read                                               7   0.0%
genunix`vmem_hash_rescale                                   9   0.0%
unix`page_nextn                                            10   0.0%
unix`acpi_cpu_cstate                                      556   0.2%
unix`i86_mwait                                         247218  99.7%

hotkernel_30sec.2011-09-11_06.out
unix`mutex_enter                                            4   0.0%
unix`fakesoftint_return                                     5   0.0%
unix`tsc_read                                               6   0.0%
genunix`fsflush_do_pages                                    7   0.0%
genunix`vmem_hash_rescale                                   9   0.0%
unix`page_nextn                                            10   0.0%
unix`do_splx                                               17   0.0%
unix`ddi_get32                                             32   0.0%
unix`acpi_cpu_cstate                                      649   0.3%
unix`i86_mwait                                         247042  99.7%

hotkernel_30sec.2011-09-11_07.out
unix`splr                                                   4   0.0%
unix`ddi_get32                                              5   0.0%
unix`sys_syscall                                            5   0.0%
genunix`fsflush_do_pages                                    5   0.0%
unix`tsc_read                                               7   0.0%
unix`do_splx                                                9   0.0%
genunix`vmem_hash_rescale                                  10   0.0%
unix`page_nextn                                            11   0.0%
unix`acpi_cpu_cstate                                      537   0.2%
unix`i86_mwait                                         247127  99.7%

hotkernel_30sec.2011-09-11_08.out
unix`default_lock_delay                                     7   0.0%
dtrace`dtrace_ioctl                                         7   0.0%
genunix`dnlc_purge_vfsp                                     8   0.0%
unix`mutex_delay_default                                    9   0.0%
unix`page_nextn                                            10   0.0%
genunix`fsflush_do_pages                                   10   0.0%
genunix`vmem_hash_rescale                                  11   0.0%
unix`do_splx                                               14   0.0%
unix`acpi_cpu_cstate                                     3381   1.4%
unix`i86_mwait                                         244217  98.6%

hotkernel_30sec.2011-09-11_09.out
genunix`cyclic_coverage                                     4   0.0%
unix`disp_getwork                                           6   0.0%
unix`mutex_enter                                            7   0.0%
genunix`dnlc_purge_vfsp                                     7   0.0%
unix`do_splx                                                9   0.0%
genunix`fsflush_do_pages                                    9   0.0%
unix`page_nextn                                            12   0.0%
genunix`vmem_hash_rescale                                  12   0.0%
unix`acpi_cpu_cstate                                     6367   2.6%
unix`i86_mwait                                         241182  97.4%

hotkernel_30sec.2011-09-11_10.out
genunix`syscall_mstate                                      5   0.0%
unix`mutex_enter                                            6   0.0%
unix`fakesoftint_return                                     6   0.0%
genunix`dnlc_purge_vfsp                                     7   0.0%
genunix`fsflush_do_pages                                    9   0.0%
unix`do_splx                                               10   0.0%
unix`page_nextn                                            13   0.0%
genunix`vmem_hash_rescale                                  13   0.0%
unix`acpi_cpu_cstate                                    10260   4.1%
unix`i86_mwait                                         237285  95.8%

hotkernel_30sec.2011-09-11_11.out
unix`mutex_delay_default                                   10   0.0%
unix`tsc_read                                              12   0.0%
genunix`syscall_mstate                                     12   0.0%
unix`page_nextn                                            13   0.0%
genunix`vmem_hash_rescale                                  14   0.0%
genunix`fsflush_do_pages                                   14   0.0%
unix`do_splx                                               15   0.0%
unix`mutex_enter                                           18   0.0%
unix`acpi_cpu_cstate                                    14031   5.7%
unix`i86_mwait                                         233282  94.2%

root@lab10:~/hotkernel_30/data.acpica-nightly-2011-09-10#
#11

Updated by Jon Strabala over 7 years ago

Once again the CPU on an idle system 'crept up in the kernel CPU' from 99.4% idle to 98.4% idle.

Oddly enough this time the system hung relatively fast in just six (6) hours as such this may be a different issue all together that I have never seen as a) I had keyboard I/O and screen output which I never had before and b) I never saw unix`acpi_cpu_cstate increase when running 'hotkernel'. Also I ran many more 'dtrace' scripts than before hopefully this isn't related to the number of times that one kicks off dtrace. As such I will re-run the last test one more time and re-post.

I have attached "logs_data_scripts.151a.final.20110913.tar.gz" a bundle of log, data, and modified, new dtrace (in opt/DTT ontop of the Dtrace Toolkit) scripts used for all my for 1/2 hour samples. The typical set of samples each 1/2 hour (taking about 9 minutes to run) are as follows and included in the 'gtar file (above):

-rw-r--r-- 1 root root     805 2011-09-12 16:51 info.2011-09-12_1651.out
-rw-r--r-- 1 root root    2899 2011-09-12 16:51 top_10sec.2011-09-12_1651.out
-rw-r--r-- 1 root root    1386 2011-09-12 16:51 mpstat_10_2.2011-09-12_1651.out
-rw-r--r-- 1 root root     391 2011-09-12 16:51 vmstat_3_3.2011-09-12_1651.out
-rw-r--r-- 1 root root    4773 2011-09-12 16:51 prstat-a-m_5_3.2011-09-12_1651.out
-rw-r--r-- 1 root root    2420 2011-09-12 16:51 prstat-L-c_4_2.2011-09-12_1651.out
-rw-r--r-- 1 root root     631 2011-09-12 16:51 profile-1001.kernel.2011-09-12_1651.out
-rw-r--r-- 1 root root     649 2011-09-12 16:52 profile-1001.user.2011-09-12_1651.out
-rw-r--r-- 1 root root  177565 2011-09-12 16:52 more.2011-09-12_1651.out
-rw-r--r-- 1 root root    4383 2011-09-12 16:52 profile-1971_5sec.2011-09-12_1651.out
-rw-r--r-- 1 root root 5729338 2011-09-12 16:52 adaptive.2011-09-12_1651.out
-rw-r--r-- 1 root root    1058 2011-09-12 16:52 syscall_cnt_by_syscall.2011-09-12_1651.out
-rw-r--r-- 1 root root    3516 2011-09-12 16:52 hotkernel_30sec.2011-09-12_1651.out
-rw-r--r-- 1 root root     756 2011-09-12 16:53 hotkernel_30sec_m.2011-09-12_1651.out
-rw-r--r-- 1 root root    1742 2011-09-12 16:53 cpuwalk.d_10.2011-09-12_1651.out
-rw-r--r-- 1 root root    1855 2011-09-12 16:53 intoncpu_10sec.2011-09-12_1651.out
-rw-r--r-- 1 root root     187 2011-09-12 16:54 inttimes_10sec.2011-09-12_1651.out
-rw-r--r-- 1 root root     154 2011-09-12 16:54 xcallsbypid_10sec.2011-09-12_1651.out
-rw-r--r-- 1 root root    1365 2011-09-12 16:54 sysbypid_10sec.2011-09-12_1651.out
-rw-r--r-- 1 root root    1557 2011-09-12 16:54 topsyscall-C-s_5_2.2011-09-12_1651.out
-rw-r--r-- 1 root root    1193 2011-09-12 16:55 iotop-C-t_10_5_4.2011-09-12_1651.out
-rw-r--r-- 1 root root     685 2011-09-12 16:55 cswstat_10sec.2011-09-12_1651.out
-rw-r--r-- 1 root root     273 2011-09-12 16:55 idle-state_10sec.2011-09-12_1651.out
-rw-r--r-- 1 root root     934 2011-09-12 16:55 watch_cpu_acpi_read_port.2011-09-12_1651.out
-rw-r--r-- 1 root root    3713 2011-09-12 16:56 watch_cstate.2011-09-12_1651.out
-rw-r--r-- 1 root root       1 2011-09-12 16:56 speedstep.2011-09-12_1651.out
-rw-r--r-- 1 root root     136 2011-09-12 16:57 watch_acpi_cpu_cstate_60.2011-09-12_1651.out
-rw-r--r-- 1 root root     142 2011-09-12 16:58 watch_cpu_acpi_read_port_60.2011-09-12_1651.out
-rw-r--r-- 1 root root    2909 2011-09-12 16:58 cpudists.2011-09-12_1651.out
-rw-r--r-- 1 root root     159 2011-09-12 16:58 cputimes.2011-09-12_1651.out
-rw-r--r-- 1 root root     213 2011-09-12 16:58 dladm_show-link-s.2011-09-12_1651.out
-rw-r--r-- 1 root root     380 2011-09-12 16:58 dlstat_show-link-r.2011-09-12_1651.out
-rw-r--r-- 1 root root     104 2011-09-12 16:58 dlstat.2011-09-12_1651.out
-rw-r--r-- 1 root root    2706 2011-09-12 16:58 intrstat_5_3.2011-09-12_1651.out
-rw-r--r-- 1 root root    7574 2011-09-12 16:59 powertop-d_3-t_10.2011-09-12_1651.out
-rw-r--r-- 1 root root     777 2011-09-12 16:59 lockstat-kIW-D_20_sleep_30.2011-09-12_1651.out
-rw-r--r-- 1 root root     156 2011-09-12 17:00 intrtime.pl.2011-09-12_1651.out

However at this point I think I would need to capture a kernel core dump, if someone is willing to post instructions on a) how to do it and b) how/where to send it I would be more than happy.

root@lab10:~/hotkernel_30/logs.151a.final# ls -ltr
total 145
-rw-r--r-- 1 root root 12762 2011-09-12 11:28 runh30.2011-09-12_1121.log
-rw-r--r-- 1 root root 11037 2011-09-12 11:58 runh30.2011-09-12_1151.log
-rw-r--r-- 1 root root 10219 2011-09-12 12:28 runh30.2011-09-12_1221.log
-rw-r--r-- 1 root root 11117 2011-09-12 12:58 runh30.2011-09-12_1251.log
-rw-r--r-- 1 root root 10291 2011-09-12 13:28 runh30.2011-09-12_1321.log
-rw-r--r-- 1 root root 10499 2011-09-12 13:58 runh30.2011-09-12_1351.log
-rw-r--r-- 1 root root 10640 2011-09-12 14:28 runh30.2011-09-12_1421.log
-rw-r--r-- 1 root root 11328 2011-09-12 14:58 runh30.2011-09-12_1451.log
-rw-r--r-- 1 root root 10500 2011-09-12 15:28 runh30.2011-09-12_1521.log
-rw-r--r-- 1 root root 12297 2011-09-12 15:58 runh30.2011-09-12_1551.log
-rw-r--r-- 1 root root 11327 2011-09-12 16:28 runh30.2011-09-12_1621.log
-rw-r--r-- 1 root root 11443 2011-09-12 16:58 runh30.2011-09-12_1651.log
-rw-r--r-- 1 root root  4707 2011-09-12 17:22 runh30.2011-09-12_1721.log
root@lab10:~/hotkernel_30/logs.151a.final#

As I said the system hung rather fast 6 instead of near 20 hours ... and this was odd because I never saw the 'normal' increase in counts for the counter unix`acpi_cpu_cstate - and also this one time I actually had a keyboard and screen display (first time) on the console BUT not response at all then I typed anything. Thsi obviously seems like a different issue altogether.

root@lab10:~/hotkernel_30/data.151a.final# ../check.csh
date
September 13, 2011 07:45:26 AM PDT

rup lab10
          lab10    up          21 mins,  load average: 0.00, 0.00, 0.00

uname -a
SunOS lab10 5.11 oi_151a i86pc i386 i86pc Solaris

beadm list
BE                        Active Mountpoint      Space Policy Created
151a                      NR     /               56.3G static 2011-09-07 13:41
acpica-nd                 -      -               4.89G static 2011-09-07 11:10
acpica-nightly-2011-09-09 -      -               56.0M static 2011-09-09 13:22
acpica-nightly-2011-09-10 -      -               22.8G static 2011-09-10 16:48
nightly-2011-09-08        -      -               139M  static 2011-09-08 16:03
openindiana               -      -               45.2M static 2011-05-05 10:02
sandbox                   -      -               63.0K static 2011-09-07 10:46

----
CPU states: 99.4% idle,  0.0% user,  0.6% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.4% idle,  0.0% user,  0.6% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.3% idle,  0.0% user,  0.7% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.2% idle,  0.0% user,  0.8% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.2% idle,  0.0% user,  0.8% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.0% idle,  0.0% user,  1.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.0% idle,  0.0% user,  1.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.9% idle,  0.0% user,  1.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.9% idle,  0.0% user,  1.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.8% idle,  0.0% user,  1.2% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.7% idle,  0.0% user,  1.3% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.7% idle,  0.0% user,  1.3% kernel,  0.0% iowait,  0.0% swap
CPU states: 98.4% idle,  0.0% user,  1.6% kernel,  0.0% iowait,  0.0% swap

----

last pid:  1277;  load avg:  0.00,  0.02,  0.01;  up 0+00:07:45        11:21:50
48 processes: 47 sleeping, 1 on cpu
CPU states: 99.4% idle,  0.0% user,  0.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 410 ctxsw, 58 trap, 1006 intr, 172 syscall, 45 flt
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

last pid:  1429;  load avg:  0.00,  0.00,  0.00;  up 0+00:37:14        11:51:19
47 processes: 46 sleeping, 1 on cpu
CPU states: 99.4% idle,  0.0% user,  0.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 390 ctxsw, 5 trap, 1063 intr, 59 syscall, 5 flt
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

last pid:  1545;  load avg:  0.00,  0.00,  0.00;  up 0+01:07:13        12:21:18
47 processes: 46 sleeping, 1 on cpu
CPU states: 99.3% idle,  0.0% user,  0.7% kernel,  0.0% iowait,  0.0% swap
Kernel: 342 ctxsw, 5 trap, 987 intr, 59 syscall, 5 flt
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

last pid:  1668;  load avg:  0.00,  0.00,  0.00;  up 0+01:37:13        12:51:18
47 processes: 46 sleeping, 1 on cpu
CPU states: 99.2% idle,  0.0% user,  0.8% kernel,  0.0% iowait,  0.0% swap
Kernel: 348 ctxsw, 5 trap, 1051 intr, 62 syscall, 5 flt
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

last pid:  1961;  load avg:  0.01,  0.11,  0.08;  up 0+02:07:13        13:21:18
47 processes: 46 sleeping, 1 on cpu
CPU states: 99.2% idle,  0.0% user,  0.8% kernel,  0.0% iowait,  0.0% swap
Kernel: 386 ctxsw, 2 trap, 1037 intr, 59 syscall, 3 flt
Memory: 16G phys mem, 5858M free mem, 8187M total swap, 8187M free swap

last pid:  2077;  load avg:  0.00,  0.00,  0.01;  up 0+02:37:13        13:51:18
47 processes: 46 sleeping, 1 on cpu
CPU states: 99.0% idle,  0.0% user,  1.0% kernel,  0.0% iowait,  0.0% swap
Kernel: 367 ctxsw, 5 trap, 1135 intr, 59 syscall, 5 flt
Memory: 16G phys mem, 5859M free mem, 8187M total swap, 8187M free swap

last pid:  2193;  load avg:  0.00,  0.00,  0.00;  up 0+03:07:13        14:21:18
47 processes: 46 sleeping, 1 on cpu
CPU states: 99.0% idle,  0.0% user,  1.0% kernel,  0.0% iowait,  0.0% swap
Kernel: 347 ctxsw, 5 trap, 1058 intr, 59 syscall, 5 flt
Memory: 16G phys mem, 5859M free mem, 8187M total swap, 8187M free swap

last pid:  2309;  load avg:  0.00,  0.00,  0.00;  up 0+03:37:13        14:51:18
47 processes: 46 sleeping, 1 on cpu
CPU states: 98.9% idle,  0.0% user,  1.1% kernel,  0.0% iowait,  0.0% swap
Kernel: 350 ctxsw, 5 trap, 1076 intr, 59 syscall, 5 flt
Memory: 16G phys mem, 5860M free mem, 8187M total swap, 8187M free swap

last pid:  2495;  load avg:  0.00,  0.00,  0.00;  up 0+04:07:13        15:21:18
47 processes: 46 sleeping, 1 on cpu
CPU states: 98.9% idle,  0.0% user,  1.1% kernel,  0.0% iowait,  0.0% swap
Kernel: 348 ctxsw, 5 trap, 1060 intr, 59 syscall, 5 flt
Memory: 16G phys mem, 5861M free mem, 8187M total swap, 8187M free swap

last pid:  2631;  load avg:  0.00,  0.00,  0.00;  up 0+04:37:13        15:51:18
47 processes: 46 sleeping, 1 on cpu
CPU states: 98.8% idle,  0.0% user,  1.2% kernel,  0.0% iowait,  0.0% swap
Kernel: 413 ctxsw, 143 trap, 1086 intr, 732 syscall, 114 flt
Memory: 16G phys mem, 5862M free mem, 8187M total swap, 8187M free swap

last pid:  2814;  load avg:  0.00,  0.00,  0.00;  up 0+05:07:14        16:21:19
47 processes: 46 sleeping, 1 on cpu
CPU states: 98.7% idle,  0.0% user,  1.3% kernel,  0.0% iowait,  0.0% swap
Kernel: 357 ctxsw, 5 trap, 1028 intr, 60 syscall, 5 flt
Memory: 16G phys mem, 5863M free mem, 8187M total swap, 8187M free swap

last pid:  2930;  load avg:  0.00,  0.00,  0.00;  up 0+05:37:13        16:51:18
47 processes: 46 sleeping, 1 on cpu
CPU states: 98.7% idle,  0.0% user,  1.3% kernel,  0.0% iowait,  0.0% swap
Kernel: 346 ctxsw, 5 trap, 1038 intr, 59 syscall, 5 flt
Memory: 16G phys mem, 5864M free mem, 8187M total swap, 8187M free swap

last pid:  3046;  load avg:  0.00,  0.00,  0.00;  up 0+06:07:13        17:21:18
47 processes: 46 sleeping, 1 on cpu
CPU states: 98.4% idle,  0.0% user,  1.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 343 ctxsw, 5 trap, 1133 intr, 59 syscall, 5 flt
Memory: 16G phys mem, 5865M free mem, 8187M total swap, 8187M free swap

----

hotkernel_30sec.2011-09-12_1121.out
genunix`set_active_fd                                       3   0.0%
unix`sys_syscall                                            3   0.0%
unix`page_nextn                                             3   0.0%
genunix`fsflush_do_pages                                    3   0.0%
unix`do_splx                                                3   0.0%
unix`mutex_delay_default                                    4   0.0%
unix`bzero                                                  4   0.0%
unix`mutex_enter                                            9   0.0%
unix`acpi_cpu_cstate                                      632   0.3%
unix`i86_mwait                                         247295  99.7%

hotkernel_30sec.2011-09-12_1151.out
unix`tsc_gethrtimeunscaled_delta                            3   0.0%
unix`i86_monitor                                            3   0.0%
unix`mutex_delay_default                                    3   0.0%
unix`page_nextn                                             3   0.0%
unix`setbackdq                                              4   0.0%
unix`mutex_enter                                            4   0.0%
genunix`fsflush_do_pages                                    5   0.0%
genunix`fsflush                                            30   0.0%
unix`acpi_cpu_cstate                                      642   0.3%
unix`i86_mwait                                         247265  99.7%

hotkernel_30sec.2011-09-12_1221.out
genunix`callout_list_expire                                 2   0.0%
genunix`dnlc_purge_vfsp                                     3   0.0%
dtrace`dtrace_dynvar_clean                                  3   0.0%
unix`default_lock_delay                                     4   0.0%
unix`page_nextn                                             4   0.0%
genunix`fsflush                                             9   0.0%
unix`todpc_rtcget                                          10   0.0%
genunix`fsflush_do_pages                                   11   0.0%
unix`acpi_cpu_cstate                                      615   0.2%
unix`i86_mwait                                         247313  99.7%

hotkernel_30sec.2011-09-12_1251.out
unix`page_nextn                                             3   0.0%
unix`mutex_exit                                             3   0.0%
genunix`fsflush_do_pages                                    3   0.0%
unix`bzero                                                  3   0.0%
unix`ddi_get32                                              4   0.0%
unix`0xfffffffffb85                                         4   0.0%
unix`do_splx                                                4   0.0%
unix`mutex_delay_default                                    6   0.0%
unix`acpi_cpu_cstate                                      758   0.3%
unix`i86_mwait                                         247156  99.7%

hotkernel_30sec.2011-09-12_1321.out
unix`mutex_enter                                            4   0.0%
unix`i86_monitor                                            4   0.0%
unix`mutex_delay_default                                    6   0.0%
unix`do_splx                                                6   0.0%
unix`tsc_gethrtimeunscaled_delta                            7   0.0%
genunix`dnlc_purge_vfsp                                     7   0.0%
unix`page_nextn                                            28   0.0%
genunix`fsflush_do_pages                                   29   0.0%
unix`acpi_cpu_cstate                                      688   0.3%
unix`i86_mwait                                         247113  99.7%

hotkernel_30sec.2011-09-12_1351.out
unix`mutex_delay_default                                    2   0.0%
unix`atomic_cas_32                                          2   0.0%
unix`mutex_enter                                            3   0.0%
unix`atomic_and_64                                          3   0.0%
genunix`dnlc_purge_vfsp                                     8   0.0%
unix`page_nextn                                            19   0.0%
genunix`fsflush_do_pages                                   20   0.0%
unix`todpc_rtcget                                          29   0.0%
unix`acpi_cpu_cstate                                      684   0.3%
unix`i86_mwait                                         247064  99.7%

hotkernel_30sec.2011-09-12_1421.out
unix`do_splx                                                2   0.0%
dtrace`dtrace_dynvar_clean                                  3   0.0%
unix`bzero                                                  3   0.0%
unix`tsc_gethrtime_delta                                    4   0.0%
unix`i86_monitor                                            4   0.0%
genunix`dnlc_purge_vfsp                                     9   0.0%
genunix`fsflush_do_pages                                   21   0.0%
unix`page_nextn                                            26   0.0%
unix`acpi_cpu_cstate                                      862   0.3%
unix`i86_mwait                                         246968  99.6%

hotkernel_30sec.2011-09-12_1451.out
unix`do_splx                                                4   0.0%
unix`tsc_gethrtimeunscaled_delta                            5   0.0%
unix`default_lock_delay                                     7   0.0%
genunix`dnlc_purge_vfsp                                     8   0.0%
unix`mutex_delay_default                                   12   0.0%
genunix`fsflush_do_pages                                   30   0.0%
unix`page_nextn                                            31   0.0%
unix`ddi_get32                                             32   0.0%
unix`acpi_cpu_cstate                                      665   0.3%
unix`i86_mwait                                         247079  99.7%

hotkernel_30sec.2011-09-12_1521.out
unix`bzero                                                  3   0.0%
unix`do_splx                                                3   0.0%
unix`tsc_gethrtimeunscaled_delta                            4   0.0%
unix`mutex_delay_default                                    5   0.0%
genunix`dnlc_purge_vfsp                                     9   0.0%
unix`ddi_get32                                             11   0.0%
unix`page_nextn                                            27   0.0%
genunix`fsflush_do_pages                                   31   0.0%
unix`acpi_cpu_cstate                                      805   0.3%
unix`i86_mwait                                         246906  99.6%

hotkernel_30sec.2011-09-12_1551.out
e1000g`e1000g_receive                                       6   0.0%
TS`ts_update_list                                           6   0.0%
unix`mutex_enter                                            7   0.0%
genunix`thread_lock                                         7   0.0%
genunix`dnlc_purge_vfsp                                     9   0.0%
unix`ddi_get32                                             12   0.0%
genunix`fsflush_do_pages                                   21   0.0%
unix`page_nextn                                            37   0.0%
unix`acpi_cpu_cstate                                      617   0.2%
unix`i86_mwait                                         246607  99.7%

hotkernel_30sec.2011-09-12_1621.out
unix`kstat_compare_bykid                                    5   0.0%
unix`default_lock_delay                                     5   0.0%
unix`0xfffffffffb85                                         6   0.0%
genunix`dnlc_purge_vfsp                                    10   0.0%
unix`bzero                                                 12   0.0%
unix`mutex_delay_default                                   21   0.0%
unix`page_nextn                                            26   0.0%
genunix`fsflush_do_pages                                   32   0.0%
unix`acpi_cpu_cstate                                      620   0.3%
unix`i86_mwait                                         247033  99.7%

hotkernel_30sec.2011-09-12_1651.out
unix`mutex_enter                                            5   0.0%
unix`bzero                                                  6   0.0%
unix`tsc_gethrtimeunscaled_delta                            8   0.0%
unix`do_splx                                                8   0.0%
genunix`dnlc_purge_vfsp                                    12   0.0%
unix`mutex_delay_default                                   12   0.0%
unix`page_nextn                                            29   0.0%
genunix`fsflush_do_pages                                   30   0.0%
unix`acpi_cpu_cstate                                      775   0.3%
unix`i86_mwait                                         246928  99.6%

Note, unix`acpi_cpu_cstate was at 775 (a normal value) and never went up to SAY 14000 like all previous runs and this only took six hours much too quick - so maybe I hit a different bug.

As indicated above - I have attached "logs_data_scripts.151a.final.20110913.tar.gz" a bundle of log, data, and modified, new dtrace (in opt/DTT ontop of the Dtrace Toolkit) scripts used for all my for 1/2 hour samples.

Basically I run root/hotkernel_30/runh30.csh to collect all my data as follows:

21 * * * * /root/hotkernel_30/runh30.csh >> /root/hotkernel_30/logs/runh30.`date +\%Y-\%m-\%d_\%H\%M`.log 2>&1
51 * * * * /root/hotkernel_30/runh30.csh >> /root/hotkernel_30/logs/runh30.`date +\%Y-\%m-\%d_\%H\%M`.log 2>&1
 

This "master" script launches a bunch of dtrace and non-dtrace scripts (forgive the hard pathing it was/is an ugly growth) ...

#!/bin/csh -f
cd /opt/DTT
set yyyymmddhh=`date +"%Y-%m-%d_%H%M"`

echo "=========================================================================" 
echo "uname -a" 
uname -a

echo "" 
echo "date" 
date

echo "" 
echo "rup lab10" 
rup lab10

sleep 1
echo "date" >> /root/hotkernel_30/data/info.$yyyymmddhh.out
date >> /root/hotkernel_30/data/info.$yyyymmddhh.out
echo "" >> /root/hotkernel_30/data/info.$yyyymmddhh.out

sleep 1
echo "uname -a" >> /root/hotkernel_30/data/info.$yyyymmddhh.out
uname -a >> /root/hotkernel_30/data/info.$yyyymmddhh.out
echo "" >> /root/hotkernel_30/data/info.$yyyymmddhh.out

sleep 1
echo "rup lab10" >> /root/hotkernel_30/data/info.$yyyymmddhh.out
rup lab10 >> /root/hotkernel_30/data/info.$yyyymmddhh.out
echo "" >> /root/hotkernel_30/data/info.$yyyymmddhh.out

sleep 1
echo "beadm list" >> /root/hotkernel_30/data/info.$yyyymmddhh.out
beadm list >> /root/hotkernel_30/data/info.$yyyymmddhh.out
echo "" >> /root/hotkernel_30/data/info.$yyyymmddhh.out

echo "" 
echo "top -s 10 -n 15 -d 2 -b" >& /root/hotkernel_30/data/top_10sec.$yyyymmddhh.out
sleep 1
top -s 10 -n 15 -d 2 -b  >& /root/hotkernel_30/data/top_10sec.$yyyymmddhh.out
cat  /root/hotkernel_30/data/top_10sec.$yyyymmddhh.out

sleep 1
mpstat 10 2 >& /root/hotkernel_30/data/mpstat_10_2.$yyyymmddhh.out
sleep 1
vmstat 3 3  >& /root/hotkernel_30/data/vmstat_3_3.$yyyymmddhh.out
sleep 1
prstat -a -m 5 3  >& /root/hotkernel_30/data/prstat-a-m_5_3.$yyyymmddhh.out
sleep 1
prstat -L -c 4 2 >& /root/hotkernel_30/data/prstat-L-c_4_2.$yyyymmddhh.out

# http://prefetch.net/blog/index.php/2007/03/03/viewing-busy-code-paths-with-dtrace/
echo "----------" 
echo "busiest code paths in the kernel" 
echo "dtrace -n 'profile-1001 {@[stack(20)]=count()} END{trunc(@,2)} tick-1s / i++ >= 1 / { exit(0); }'" 
sleep 1
dtrace -n 'profile-1001 {@[stack(20)]=count()} END{trunc(@,2)} tick-1s / i++ >= 1 / { exit(0); }' >& /root/hotkernel_30/data/profile-1001.kernel.$yyyymmddhh.out
cat /root/hotkernel_30/data/profile-1001.kernel.$yyyymmddhh.out

echo "----------" 
echo "busiest userland code paths" 
echo "dtrace -n 'profile-1001 /arg1/ {@[execname,ustack()]=count()} END{trunc(@,2)} tick-1s / i++ >= 1 / { exit(0); }'" 
sleep 1
dtrace -n 'profile-1001 /arg1/ {@[execname,ustack()]=count()} END{trunc(@,2)} tick-1s / i++ >= 1 / { exit(0); }' >& /root/hotkernel_30/data/profile-1001.user.$yyyymmddhh.out
cat /root/hotkernel_30/data/profile-1001.user.$yyyymmddhh.out

echo "/root/hotkernel_30/more.csh >& /root/hotkernel_30/data/more.$yyyymmddhh.out" 
sleep 1
/root/hotkernel_30/more.csh >& /root/hotkernel_30/data/more.$yyyymmddhh.out

echo "grep i86_mwait:entry /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l" 
grep i86_mwait:entry /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l
echo "" 

echo "grep i86_mwait:return /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l" 
grep i86_mwait:return /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l
echo "" 

echo "grep cpu_idle_adaptive:entry /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l" 
grep cpu_idle_adaptive:entry /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l
echo "" 

echo "grep cpu_idle_adaptive:return /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l" 
grep cpu_idle_adaptive:return /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l
echo "" 

sleep 1
dtrace -n 'profile-1971 {@[stack()]=count();} tick-1s {trunc(@, 3); printa(@);trunc(@);} tick-1s / i++ >= 5 / { exit(0); }' >& /root/hotkernel_30/data/profile-1971_5sec.$yyyymmddhh.out

echo "dtrace on mutex_vector_enter:entry, adaptive-acquire, adaptive-release" 
echo "dtrace on mutex_vector_enter:entry, adaptive-acquire, adaptive-release" > /root/hotkernel_30/data/adaptive.$yyyymmddhh.out

echo "" >> /root/hotkernel_30/data/adaptive.$yyyymmddhh.out
sleep 1
echo "/root/hotkernel_30/6958602.1.d" 
echo "/root/hotkernel_30/6958602.1.d" >> /root/hotkernel_30/data/adaptive.$yyyymmddhh.out
/root/hotkernel_30/6958602.1.d >& /root/hotkernel_30/data/adaptive.$yyyymmddhh.out.1
cat /root/hotkernel_30/data/adaptive.$yyyymmddhh.out.1 >> /root/hotkernel_30/data/adaptive.$yyyymmddhh.out
rm /root/hotkernel_30/data/adaptive.$yyyymmddhh.out.1

echo "" 
tail -10 /root/hotkernel_30/data/adaptive.$yyyymmddhh.out

echo "----------------" 
echo "#Syscall count by syscall" 
echo "dtrace -n 'syscall:::entry { @num[probefunc] = count(); }  tick-1s / i++ >= 1 / { exit(0); }'" 
sleep 1
dtrace -n 'syscall:::entry { @num[probefunc] = count(); }  tick-1s / i++ >= 1 / { exit(0); }' >& /root/hotkernel_30/data/syscall_cnt_by_syscall.$yyyymmddhh.out
cat /root/hotkernel_30/data/syscall_cnt_by_syscall.$yyyymmddhh.out
echo "" 

echo "" 
echo "./hotkernel_30sec >& /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out" 
sleep 1
./hotkernel_30sec >& /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out

echo "" 
echo "wc -l /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out" 
wc -l /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out

echo "" 
echo "cat /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out" 
cat /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out

sleep 1
./hotkernel_30sec -m >& /root/hotkernel_30/data/hotkernel_30sec_m.$yyyymmddhh.out

echo "" 
echo "take four dtrace CPU measures: cpuwalk.d 10, intoncpu_10sec.d, inttimes_10sec.d, xcallsbypid_10sec.d" 
cd /opt/DTT/Cpu
sleep 1
./cpuwalk.d 10 >& /root/hotkernel_30/data/cpuwalk.d_10.$yyyymmddhh.out
sleep 1
./intoncpu_10sec.d >& /root/hotkernel_30/data/intoncpu_10sec.$yyyymmddhh.out
sleep 1
./inttimes_10sec.d >& /root/hotkernel_30/data/inttimes_10sec.$yyyymmddhh.out
sleep 1
./xcallsbypid_10sec.d >& /root/hotkernel_30/data/xcallsbypid_10sec.$yyyymmddhh.out

echo "" 
echo "take a few Bin measures: sysbypid_10sec.d, topsyscall -C -s 5 2, iotop -C -t 10 5 4, cswstat_10sec.d, idle-state_10sec.d" 
cd /opt/DTT/Bin
sleep 1
./sysbypid_10sec.d  >& /root/hotkernel_30/data/sysbypid_10sec.$yyyymmddhh.out
sleep 1
./topsyscall -C -s 5 2  >& /root/hotkernel_30/data/topsyscall-C-s_5_2.$yyyymmddhh.out
sleep 1
./iotop -C -t 10 5 4 >& /root/hotkernel_30/data/iotop-C-t_10_5_4.$yyyymmddhh.out
sleep 1
./cswstat_10sec.d >& /root/hotkernel_30/data/cswstat_10sec.$yyyymmddhh.out
sleep 1
./idle-state_10sec.d >& /root/hotkernel_30/data/idle-state_10sec.$yyyymmddhh.out
sleep 1
./watch_cpu_acpi_read_port.d >& /root/hotkernel_30/data/watch_cpu_acpi_read_port.$yyyymmddhh.out
sleep 1
./watch_cstate.d >& /root/hotkernel_30/data/watch_cstate.$yyyymmddhh.out
sleep 1
/opt/DTT/Cpu/speedstep.d >& /root/hotkernel_30/data/speedstep.$yyyymmddhh.out
sleep 1
/opt/DTT/Bin/watch_acpi_cpu_cstate_60.d >& /root/hotkernel_30/data/watch_acpi_cpu_cstate_60.$yyyymmddhh.out
sleep 1
/opt/DTT/Bin/watch_cpu_acpi_read_port_60.d >& /root/hotkernel_30/data/watch_cpu_acpi_read_port_60.$yyyymmddhh.out
sleep 1
/opt/DTT/Bin/cpudists >& /root/hotkernel_30/data/cpudists.$yyyymmddhh.out
sleep 1
/opt/DTT/Bin/cputimes >& /root/hotkernel_30/data/cputimes.$yyyymmddhh.out

echo 'other: prstat -a -m 5 3, dladm show-link -s, dlstat show-link -r, dlstat, mpstat 10 2, intrstat 5 3'

sleep 1
dladm show-link -s >& /root/hotkernel_30/data/dladm_show-link-s.$yyyymmddhh.out
sleep 1
dlstat show-link -r >& /root/hotkernel_30/data/dlstat_show-link-r.$yyyymmddhh.out
sleep 1
dlstat >& /root/hotkernel_30/data/dlstat.$yyyymmddhh.out
sleep 1
intrstat 5 3 >& /root/hotkernel_30/data/intrstat_5_3.$yyyymmddhh.out
sleep 1
powertop -d 3 -t 10 >& /root/hotkernel_30/data/powertop-d_3-t_10.$yyyymmddhh.out
sleep 1
lockstat -kIW -D 20 sleep 30 >& /root/hotkernel_30/data/lockstat-kIW-D_20_sleep_30.$yyyymmddhh.out
sleep 1
/root/hotkernel_30/intrtime.pl >& /root/hotkernel_30/data/intrtime.pl.$yyyymmddhh.out
#12

Updated by Jon Strabala over 7 years ago

Once again the CPU on an idle system 'went up in the kernel CPU' from 99.9% idle to 98.6% idle.

And like the last run the system hung relatively fast twelve hours (12) - such that once again I never saw unix`acpi_cpu_cstate increase prior to the crash when running 'hotkernel' typically we need to last about 8 more hours.

I have attached "logs_data_scripts.151a.final.20110914.tar.gz" a second bundle of log, data, and modified, new dtrace (in opt/DTT ontop of the Dtrace Toolkit) scripts used for all 20 minute samples.

Note in all cases this is "dev-il" 151a not the new released ISO repository.

#13

Updated by Jon Strabala over 7 years ago

Right now this looks like it might be a complete BIOS issue. A while back the motherboard vendor told me "we do not support OpenIndiana or any other derivative of OpenSolaris" when I opened a trouble ticket with them - thus the end of SuperMicro's help.

However SuperMicro just came out with a new BIOS for the X9SCA-F board e.g. x9sca1.819.zip. So I did the following:

  1. I made a "released oi_151a" BE
  2. I reset my /etc/power.conf
  3. I updated/flashed my BIOS to x9sca1.819

And the initial results are very promising for my Sandy Bridge C204 E3-1270 system - I have been running for 5 hours now and I don't see the weird kernel CPU leak, my guess is that the system is stable now and will not hang. I it is nice (and normal) to see an my idle system say 100% idle when I run top.

Running for 5 hours I don't see the CPU creep via basic 'top' (yes I know 'prstat' is better than 'top', I am sampling this system three (3) times an hour with with 47 scripts, 2 are prstat commands, most of the scripts are dtrace based).

As seen occasionally I get 100% Idle, and it looks like with the new BIOS there is no more unknown leak in kernel CPU column.

top_10sec.2011-09-14_1240.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1240.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1300.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1300.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1320.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1320.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1340.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1340.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1400.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1400.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1420.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1420.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1440.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1440.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1500.out:CPU states: 99.9% idle,  0.1% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1500.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1520.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1520.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1540.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1540.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1600.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1600.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1620.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1620.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1640.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1640.out:CPU states: 99.8% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1700.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1700.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1720.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1720.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1740.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1740.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap

Compared with all my other runs I would most likely see a significant problem by now "CPU states: 98.8% idle, 0.0% user, 1.2% kernel, 0.0% iowait, 0.0% swap".

It might not matter at all but my /etc/power.conf file is now exactly - I tried it several times before so I am pretty sure that the BIOS update is making the difference.

#
# run 'pmconfig' if you change this file
#
cpu_deep_idle enable
cpu-threshold 10s
cpupm enable poll-mode
#
# fun commands:
#      kstat -m cpu_info | grep "supported_frequencies_Hz" 
#      kstat -m cpu_info | grep "current_clock_Hz" 
#14

Updated by Jon Strabala over 7 years ago

I seem to have jumped the gun on claiming a BIOS fix, things looked great for the 10 hours, and then I started seeing the CPU kernel creep again. The system has been up for about 19 hours now BUT I now see a 'lost' 0.3% CPU in the kernel.

Based on past data runs with 151s (dev) it does seem to be leaking 'lost' kernel CPU cycle about 9X slower, so maybe the BIOS update just makes thing harder to debug (perhaps it help or maybe there are multiple issues) or perhaps it was a combination of my /etc/power.conf and going to a BE from the newly the "release 151a"

I have provided a simple grep of my 20 minute 'top' samples below - I am still sampling this system three (3) times an hour with with 47 scripts, 2 are prstat commands, most of the scripts are dtrace based. Perhaps this time I will see something of interest if/when it finally hangs.

root@lab10:~/hotkernel_30/data# rup lab10
          lab10    up          19:16,    load average: 0.00, 0.01, 0.01

root@lab10:~/hotkernel_30/data#  grep 'CPU s' top_*
top_10sec.2011-09-14_1240.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1240.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1300.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1300.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1320.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1320.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1340.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1340.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1400.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1400.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1420.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1420.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1440.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1440.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1500.out:CPU states: 99.9% idle,  0.1% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1500.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1520.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1520.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1540.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1540.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1600.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1600.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1620.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1620.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1640.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1640.out:CPU states: 99.8% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1700.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1700.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1720.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1720.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1740.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1740.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1800.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1800.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1820.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1820.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1840.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1840.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1900.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1900.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1920.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1920.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1940.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_1940.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2000.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2000.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2020.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2020.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2040.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2040.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2100.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2100.out:CPU states: 99.9% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2120.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2120.out:CPU states: 99.8% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2140.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2140.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2200.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2200.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2220.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2220.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2240.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2240.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2300.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2300.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2320.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2320.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2340.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-14_2340.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0000.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0000.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0020.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0020.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0040.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0040.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0100.out:CPU states: 99.8% idle,  0.1% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0100.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0120.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0120.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0140.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0140.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0200.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0200.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0220.out:CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0220.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0240.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0240.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0300.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0300.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0320.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0320.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0340.out:CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0340.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0400.out:CPU states: 99.3% idle,  0.0% user,  0.7% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0400.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0420.out:CPU states: 99.6% idle,  0.0% user,  0.4% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0420.out:CPU states: 99.7% idle,  0.0% user,  0.3% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0440.out:CPU states: 99.7% idle,  0.0% user,  0.3% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0440.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0500.out:CPU states: 99.6% idle,  0.0% user,  0.4% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0500.out:CPU states: 99.7% idle,  0.0% user,  0.3% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0520.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0520.out:CPU states: 99.7% idle,  0.0% user,  0.3% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0540.out:CPU states: 99.6% idle,  0.0% user,  0.4% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0540.out:CPU states: 99.7% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0600.out:CPU states: 99.4% idle,  0.0% user,  0.6% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0600.out:CPU states: 99.7% idle,  0.0% user,  0.3% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0620.out:CPU states: 99.6% idle,  0.1% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0620.out:CPU states: 99.7% idle,  0.0% user,  0.3% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0640.out:CPU states: 99.6% idle,  0.0% user,  0.4% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0640.out:CPU states: 99.7% idle,  0.0% user,  0.3% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0700.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0700.out:CPU states: 99.7% idle,  0.0% user,  0.3% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0720.out:CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
top_10sec.2011-09-15_0720.out:CPU states: 99.7% idle,  0.0% user,  0.3% kernel,  0.0% iowait,  0.0% swap
root@lab10:~/hotkernel_30/data#
#15

Updated by Aubrey Li over 7 years ago

Hi,

It looks like this is not relevant to cpu power management.

From the mpstat log in the first bug reporter's thread, the
interrupt number is higher than expected.

So, when the CPU% is leaked, what does the following command
say?

#dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10"

Or if apix is not enabled,

#dtrace -n "do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10"

Thanks,
-Aubrey

#16

Updated by Miha Krizaj over 7 years ago

Hi,

I had a "scheduled" reboot today, to keep the machine up, and before rebooting I ran the command as per your post above (the kernel usage was at some 15% or so):

dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10"

This did not hang the system. I rebooted anyway (using "reboot"), to have a nother 10 or so days before it hangs and I noticed that after rebooting (Until now I always got the system hang, and therefore turn off, as I was trying various dtraces) kernel usage was still high, even higher as before reboot, the system at completely idle was using 25% in kernel.

After turning it off (init 5) kernel usage after boot was 0% or so. Prehaps I do not understand the reboot command, but I find this very interesting, I thought the reboot command restarts kernel.

Here is also my boot log after reboot if this can show something:

Reboot:
http://www.pastie.org/2554875

And here is the dtrace output:

unix`_interrupt+0xba
1
unix`_interrupt+0xba
unix`wrmsr+0xd
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_fire+0xd5
unix`cbe_fire+0x38
apix`apix_dispatch_pending_autovect+0x12c
apix`apix_dispatch_pending_hilevel+0x15
unix`switch_sp_and_call+0x13
apix`apix_do_interrupt+0x31b
unix`_interrupt+0xba
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`idle+0x114
unix`thread_start+0x8
1
unix`_interrupt+0xba
unix`wrmsr+0xd
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_reprogram_cyclic+0xf0
genunix`cyclic_reprogram+0xa4
genunix`callout_heap_delete+0x1e3
genunix`callout_realtime+0x1e
genunix`cyclic_softint+0xdc
unix`cbe_low_level+0x17
unix`av_dispatch_softvect+0x5f
apix`apix_dispatch_softint+0x34
unix`switch_sp_and_call+0x13
1
unix`_interrupt+0xba
unix`fakesoftint_return
unix`cbe_restore_level+0x17
genunix`cyclic_reprogram_cyclic+0xfd
genunix`cyclic_reprogram+0xa4
genunix`callout_heap_delete+0x1e3
genunix`callout_realtime+0x1e
genunix`cyclic_softint+0xdc
unix`cbe_low_level+0x17
unix`av_dispatch_softvect+0x5f
apix`apix_dispatch_softint+0x34
unix`switch_sp_and_call+0x13
1
unix`_interrupt+0xba
unix`page_lookup_nowait+0x9a
zfs`zfs_getpage+0x1c9
genunix`fop_getpage+0x9a
genunix`segvn_fault+0xae1
genunix`as_fault+0x5ee
unix`pagefault+0x99
unix`trap+0xe63
unix`0xfffffffffb8001d6
1
unix`_interrupt+0xba
apix`deadline_timer_reprogram+0x34
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_fire+0xd5
unix`cbe_fire+0x5a
apix`apix_dispatch_pending_autovect+0x12c
apix`apix_dispatch_pending_hilevel+0x15
unix`switch_sp_and_call+0x13
apix`apix_do_interrupt+0x31b
unix`_interrupt+0xba
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`idle+0x114
unix`thread_start+0x8
130
unix`_interrupt+0xba
unix`wrmsr+0xd
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_fire+0xd5
unix`cbe_fire+0x5a
apix`apix_dispatch_by_vector+0x93
apix`apix_dispatch_hilevel+0x15
unix`switch_sp_and_call+0x13
apix`apix_do_interrupt+0x25e
unix`_interrupt+0xba
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`idle+0x114
unix`thread_start+0x8
1002
unix`_interrupt+0xba
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`idle+0x114
unix`thread_start+0x8
1038
unix`_interrupt+0xba
unix`wrmsr+0xd
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_fire+0xd5
unix`cbe_fire+0x5a
apix`apix_dispatch_pending_autovect+0x12c
apix`apix_dispatch_pending_hilevel+0x15
unix`switch_sp_and_call+0x13
apix`apix_do_interrupt+0xdc
unix`_interrupt+0xba
unix`fakesoftint_return
unix`cbe_restore_level+0x17
genunix`cyclic_reprogram_cyclic+0xfd
genunix`cyclic_reprogram+0xa4
genunix`callout_heap_delete+0x1e3
genunix`callout_realtime+0x1e
genunix`cyclic_softint+0xdc
unix`cbe_low_level+0x17
unix`av_dispatch_softvect+0x5f
1738
unix`_interrupt+0xba
unix`wrmsr+0xd
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_fire+0xd5
unix`cbe_fire+0x5a
apix`apix_dispatch_pending_autovect+0x12c
apix`apix_dispatch_pending_hilevel+0x15
unix`switch_sp_and_call+0x13
apix`apix_do_interrupt+0x31b
unix`_interrupt+0xba
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`idle+0x114
unix`thread_start+0x8
3697185
#17

Updated by Aubrey Li over 7 years ago

hmmm..., something is very interesting...

It looks like your system is suffering from a interrupt storm. There shouldn't be 3697185 intr numbers in 10s on CPU0.

Can you please give the following command output?

$ echo hz/D | sudo mdb -k

$ echo apic_timer::print apic_timer_t | sudo mdb -k

#18

Updated by Miha Krizaj over 7 years ago

This number grows... as kernel usage does, but harder to see until it reaches some 10-15% kernel usage.

echo apic_timer::print apic_timer_t | sudo mdb -k {
mode = 0x2
apic_timer_enable_ops = deadline_timer_enable
apic_timer_disable_ops = deadline_timer_disable
apic_timer_reprogram_ops = deadline_timer_reprogram
}
echo hz/D | sudo mdb -k
hz:
hz: 100
admin@deepspace:~#

#19

Updated by Aubrey Li over 7 years ago

Okay, thanks for reporting back...

Looking at the hottest calling stack, something is interesting too.

(1) intr occurs

unix`_interrupt+0xba

(2) immediately after wrmsr?? I think this is not occasional,
because there are 3697185 stacks are the same in 10 seconds.

  • unix`wrmsr+0xd*
    apix`apic_timer_reprogram+0x19
    unix`cbe_reprogram+0x25
    genunix`cyclic_fire+0xd5
    unix`cbe_fire+0x5a
    apix`apix_dispatch_pending_autovect+0x12c
    apix`apix_dispatch_pending_hilevel+0x15
    unix`switch_sp_and_call+0x13
    apix`apix_do_interrupt+0x31b
    unix`_interrupt+0xba

(3) the processor is in idle.
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`idle+0x114
unix`thread_start+0x8
3697185

Here, from the stack, the scenario is,

- when the system is in idle
- there is a clock related interrupt occurs, it goes through
cbe->cyclic->cbe->apic, finally calls apic_timer_reprogram to
program the apic timer register.

However, it looks like everytime the register is programmed
(wrmsr called), there is an interrupt occurs

So, no wonder the kernel CPU% is creeping.

Can you please run the following dtrace script? I think 1s is enough.
No need to wait 10~15% kernel time, just after you see the same
calling stack as you reported above.

==================================================
#!/usr/sbin/dtrace -s

deadline_timer_reprogram:entry
/cpu == 0/ {
t = (hrtime_t)arg0;
a = (long long)t;
printf(" target = %llx - now = %x\n", a, timestamp);
} ==================================================

Thanks,
-Aubrey

#20

Updated by Miha Krizaj over 7 years ago

The output of the script is:

0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a31031
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a313b4
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3172d
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a31aad
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a31e34
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a321b1
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a32530
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a328b2
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a32c2d
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a32fac
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3332b
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a336a9
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a33a2a
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a33da4
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a34124
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a344aa
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3482a
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a34ba4
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a34f24
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a352a9
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3562d
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a359a9
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a35d29
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a360a6
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a36425
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a367a1
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a36b2b
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a36ea8
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3722a
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a375a6
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a37923
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a37ca3
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a38021
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3839d
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3871a
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a38a9e
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a38e23
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a391a2
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a39521
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a398a2
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a39c25
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a39fa4
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3a327
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3a6ac
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3aa2b
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3adac
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3b129
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3b4d5
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3b858
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3bbdd
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3bf7a
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3c2f8
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3c695
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3ca15
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3cd96
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3d115
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3d49b
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3d81a
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3db9b
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3df19
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3e2a2
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3e650
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3e9db
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3ed59
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3f0e6
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3f464
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f6a3fb67
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73ac646
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73ad171
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73adb9f
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73ae56e
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73aef54
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73af95a
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b02f1
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b0c8e
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b166c
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b206d
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b2a61
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b3417
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b3d81
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b4748
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b511c
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b5ad2
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b64ec
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b6ec2
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b7882
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b826c
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b8c43
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73b966e
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73ba00b
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73ba9a7
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73bb39c
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73bbd8e
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73bc7c0
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73bd162
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73bdb2c
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73be539
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73bef0b
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73bf95c
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c02d8
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c0ca4
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c1661
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c2035
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c2a1a
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c3400
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c3dbd
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c47af
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c6dda
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c72e8
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c7794
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c7c5d
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c810d
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c85ef
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73c8ad1
0  30881   deadline_timer_reprogram:entry  target = 521f7d52280 - now = 521f73c9506
#21

Updated by Aubrey Li over 7 years ago

  • Status changed from New to Feedback

okay, this looks quite weird, intr occurs before the target is due.

But if you add the following line to the /etc/system

set apix:apic_timer_preferred_mode = 0x0

and reboot, Is the problem gone?

Thanks,
-Aubrey

#22

Updated by Jon Strabala over 7 years ago

Aubrey,

======================================

After adding the following line to the /etc/system and rebooting ( I will not know for a few days if this solves the CPU kernel creep / interrupt problem )

set apix:apic_timer_preferred_mode = 0x0

If the mode is set to 0x0 I see 0 events/sec. in deadline_timer_reprogram:entry - I am running that test now.

Below with the mode at 0x2 the 1,257 events/sec. in deadline_timer_reprogram:entry at initial reboot

=================================

The rest of this E-mail deals with a test I just finished were mode is set to 0x2.

I just got done with a 101 hour test (got a crash very similar analysis to what you came up with but without the 'deep' knowledge to put things in context.

This following test is without "set apix:apic_timer_preferred_mode = 0x0" e.g. the mode for apix:apic_timer_preferred_mode is set to 0x2 (the default)

     date
     Sun Sep 18 18:00:19 PDT 2011

     uname -a
     SunOS lab10 5.11 oi_151a i86pc i386 i86pc

     rup lab10
               lab10    up  4 days,  5:39,    load average: 0.00, 0.00, 0.02

     beadm list | egrep NR\|BE
     BE                        Active Mountpoint Space Policy Created
     oi_151a                   NR     /          57.6G static 2011-09-14 09:01

     /etc/system
     set ip:ip_squeue_worker_wait=0
     set snooping=1
     set pcplusmp:apic_panic_on_nmi=1

     /etc/power.conf
     cpu_deep_idle enable
     cpu-threshold 10s
     cpupm enable poll-mode

     echo hz/D | sudo mdb -k
     hz:
     hz:             100

     echo apic_timer::print apic_timer_t | sudo mdb -k
     {
         mode = 0x2
         apic_timer_enable_ops = deadline_timer_enable
         apic_timer_disable_ops = deadline_timer_disable
         apic_timer_reprogram_ops = deadline_timer_reprogram
     }

This time unix`acpi_cpu_cstate actually decreased (as opposed to increased) across 101.5 hours until a kernel crash at 101.5 hours panic/dump attempt.

I also see a correlation in the kernel CPU creep (or growth) and the following counters:

'psm_get_cpu_id' - at boot is about 1313 at crash is about 4636, 
'cpu_idle_exit'   - at boot is about 216 at crash about 6782
'apix_intr_enter' - at boot about 78 at crash about 7158

*** note the initial values were determined after a reboot as I wrote the logging scripts a few days into the test run ***

The system (which is completely idle crashed on a dtrace script after the kernel CPU crept to about 1.6% for no known reason.

panic[cpu0]/thread ffffff001e8cbc40: deadman: timed out after 50 seconds of clock inactivity

      *** skip stack dump, but see image "oi_151_released.console.png" for full details  ***
      *** this thread seems to be a 'highly' called thread whatever it is, see '6958602.d' below ***

syncing to file systems... done
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
 2:23 100% done

Please see full console image attached as "oi_151_released.console.png"

The stack trace in the console image (attached) is similar to what I think is the major activity one of two possible formats one with unix`wrmsr and one with apix`deadline_timer_reprogram e.g. refer to "oi_151_released.101hours.txt" (also attached) for grater detail and actual counts:

  5  13355              cpu_idle_exit:entry
              apix`apix_do_interrupt+0x3e
              unix`_interrupt+0xba
              unix`wrmsr+0xd
              apix`apic_timer_reprogram+0x19
              unix`cbe_reprogram+0x25
              genunix`cyclic_fire+0xd5
              unix`cbe_fire+0x5a
              apix`apix_dispatch_pending_autovect+0x12c
              apix`apix_dispatch_pending_hilevel+0x15
              unix`switch_sp_and_call+0x13
              apix`apix_do_interrupt+0x31b
              unix`_interrupt+0xba
              genunix`fop_ioctl+0x7b
              genunix`ioctl+0x18e
              unix`sys_syscall+0x17a

  5  13355              cpu_idle_exit:entry
              apix`apix_do_interrupt+0x3e
              unix`_interrupt+0xba
              apix`deadline_timer_reprogram+0x34
              apix`apic_timer_reprogram+0x19
              unix`cbe_reprogram+0x25
              genunix`cyclic_fire+0xd5
              unix`cbe_fire+0x5a
              apix`apix_dispatch_pending_autovect+0x12c
              apix`apix_dispatch_pending_hilevel+0x15
              unix`switch_sp_and_call+0x13
              apix`apix_do_interrupt+0x31b
              unix`_interrupt+0xba
              genunix`fop_ioctl+0x7b
              genunix`ioctl+0x18e
              unix`sys_syscall+0x17a

displaying counts of stack traces across 1ms for cpu_idle_exit:entry
cpu_idle_exit_1ms.d.2011-09-14_1240.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-14_1620.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-14_2000.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-14_2340.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_0320.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_0700.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_1040.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_1420.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_1800.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_2140.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_0120.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_0500.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_0840.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_1220.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_1600.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_2100.out: 421
cpu_idle_exit_1ms.d.2011-09-17_0230.out: 519
cpu_idle_exit_1ms.d.2011-09-17_0800.out: 544
cpu_idle_exit_1ms.d.2011-09-17_1330.out: 531
cpu_idle_exit_1ms.d.2011-09-17_1900.out: 651
cpu_idle_exit_1ms.d.2011-09-18_0030.out: 660
cpu_idle_exit_1ms.d.2011-09-18_0600.out: 738
cpu_idle_exit_1ms.d.2011-09-18_1130.out: 780
cpu_idle_exit_1ms.d.2011-09-18_1700.out: 797
cpu_idle_exit_1ms.d.2011-09-18_1730.out: 758
cpu_idle_exit_1ms.d.2011-09-18_1800.out: 797
*** crash ***

displaying counts of stack traces across 1ms for cpu_idle_exit:entry that contain cbe_fire
cpu_idle_exit_1ms.d.2011-09-14_1240.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-14_1620.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-14_2000.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-14_2340.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_0320.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_0700.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_1040.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_1420.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_1800.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_2140.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_0120.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_0500.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_0840.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_1220.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_1600.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_2100.out: 412
cpu_idle_exit_1ms.d.2011-09-17_0230.out: 504
cpu_idle_exit_1ms.d.2011-09-17_0800.out: 543
cpu_idle_exit_1ms.d.2011-09-17_1330.out: 522
cpu_idle_exit_1ms.d.2011-09-17_1900.out: 639
cpu_idle_exit_1ms.d.2011-09-18_0030.out: 653
cpu_idle_exit_1ms.d.2011-09-18_0600.out: 737
cpu_idle_exit_1ms.d.2011-09-18_1130.out: 772
cpu_idle_exit_1ms.d.2011-09-18_1700.out: 792
cpu_idle_exit_1ms.d.2011-09-18_1730.out: 757
cpu_idle_exit_1ms.d.2011-09-18_1800.out: 796
*** crash ***

isplaying counts of stack traces across 1ms for cpu_idle_exit:entry that contain deadline_timer_reprogram - typically we see either deadline_timer_reprogram or wrmsr, this counts deadline_timer_reprogram
cpu_idle_exit_1ms.d.2011-09-14_1240.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-14_1620.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-14_2000.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-14_2340.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_0320.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_0700.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_1040.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_1420.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_1800.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_2140.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_0120.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_0500.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_0840.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_1220.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_1600.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_2100.out: 57
cpu_idle_exit_1ms.d.2011-09-17_0230.out: 62
cpu_idle_exit_1ms.d.2011-09-17_0800.out: 21
cpu_idle_exit_1ms.d.2011-09-17_1330.out: 61
cpu_idle_exit_1ms.d.2011-09-17_1900.out: 73
cpu_idle_exit_1ms.d.2011-09-18_0030.out: 100
cpu_idle_exit_1ms.d.2011-09-18_0600.out: 90
cpu_idle_exit_1ms.d.2011-09-18_1130.out: 84
cpu_idle_exit_1ms.d.2011-09-18_1700.out: 114
cpu_idle_exit_1ms.d.2011-09-18_1730.out: 92
cpu_idle_exit_1ms.d.2011-09-18_1800.out: 114
*** crash ***

displaying counts of stack traces across 1ms for cpu_idle_exit:entry that contain deadline_timer_reprogram - typically we see either deadline_timer_reprogram or wrmsr, this counts wrmsr
cpu_idle_exit_1ms.d.2011-09-14_1240.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-14_1620.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-14_2000.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-14_2340.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_0320.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_0700.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_1040.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_1420.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_1800.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-15_2140.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_0120.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_0500.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_0840.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_1220.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_1600.out:       <<not found script was not active yet, or system crashed>>
cpu_idle_exit_1ms.d.2011-09-16_2100.out: 355
cpu_idle_exit_1ms.d.2011-09-17_0230.out: 442
cpu_idle_exit_1ms.d.2011-09-17_0800.out: 522
cpu_idle_exit_1ms.d.2011-09-17_1330.out: 461
cpu_idle_exit_1ms.d.2011-09-17_1900.out: 566
cpu_idle_exit_1ms.d.2011-09-18_0030.out: 553
cpu_idle_exit_1ms.d.2011-09-18_0600.out: 647
cpu_idle_exit_1ms.d.2011-09-18_1130.out: 688
cpu_idle_exit_1ms.d.2011-09-18_1700.out: 678
cpu_idle_exit_1ms.d.2011-09-18_1730.out: 665
cpu_idle_exit_1ms.d.2011-09-18_1800.out: 682
*** crash ***

NOTE this was run after the crash on a reboot

with the same default mode = 0x2 for apix:apic_timer_preferred_mode same as the test run discussed, this is 12,571 events in 10 seconds.

root@lab10:~/hotkernel_30/data# cat ../aubrey.deadline_timer_reprogram.d
BEGIN {
  printf("looking at deadline_timer_reprogram:entry (requested by Aubrey Li)");
}
deadline_timer_reprogram:entry
{
  t = (hrtime_t)arg0;
  a = (long long)t;
  printf("target = %llx - now = %x\n", a, timestamp);
}
tick-10s {
  exit(0);
}

root@lab10:~/hotkernel_30/data# ls -ltr aubrey.deadline_timer_reprogram.d.2011-09-19_1055.out
-rw-r--r-- 1 root root 1081308 2011-09-19 10:56 aubrey.deadline_timer_reprogram.d.2011-09-19_1055.out

root@lab10:~/hotkernel_30/data# head -10 aubrey.deadline_timer_reprogram.d.2011-09-19_1055.out
dtrace: script '/root/hotkernel_30/aubrey.deadline_timer_reprogram.d' matched 3 probes
CPU     ID                    FUNCTION:NAME
  4      1                           :BEGIN looking at deadline_timer_reprogram:entry (requested by Aubrey Li)
  0  29397   deadline_timer_reprogram:entry target = 13954f36180 - now = 13954f33ee2

  0  29397   deadline_timer_reprogram:entry target = 13954f36180 - now = 13954f3492b

  0  29397   deadline_timer_reprogram:entry target = 13954f36180 - now = 13954f34ffb

  0  29397   deadline_timer_reprogram:entry target = 13954f36180 - now = 13954f35696

root@lab10:~/hotkernel_30/data# grep deadline_timer_reprogram:entry aubrey.deadline_timer_reprogram.d.2011-09-19_1055.out | wc -l
12571

As indicated above - if I apply the /etc/system changes and mode = 0x0 is set for apix:apic_timer_preferred_mode the above dtrace script emits, aubrey.deadline_timer_reprogram.d, nothing e.g. nothing would be captutred.

#23

Updated by Aubrey Li over 7 years ago

Hi Jon,

It looks like your system is experiencing exactly the same issue as Miha's.

To check if this issue is fixed, you can just run the following command on CPU0, not need on all of CPUs.

dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10"

From Miha's report, there are 3697185 intrs in dtrace output before. Now, you should
see only ~1000 intrs in dtrace output if it's fixed.

Well, the higher interrupt number is what I observed so far, this should be one reason to cause CPU% leak.
I'm not sure if the system has any other issues to cause CPU utilization leak.

So please continue to run your testing and let me know the result.

Thanks,
-Aubrey

#24

Updated by Jon Strabala over 7 years ago

Aubrey,

My basic test suite take about 4 days to be definitive, but after running for 8 hours with

root@lab10:~/hotkernel_30# echo apic_timer::print apic_timer_t | sudo mdb -k
{
    mode = 0
    apic_timer_enable_ops = oneshot_timer_enable
    apic_timer_disable_ops = oneshot_timer_disable
    apic_timer_reprogram_ops = oneshot_timer_reprogram
}
root@lab10:~/hotkernel_30#

dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10" gives 1093 in line with what Miha's is seeing - refer to the "dtrace_req2_1sec_5_sec_10sec.txt" attachement.

Also the dtrace script based on your prior posting

BEGIN {
  printf("looking at deadline_timer_reprogram:entry (requested by Aubrey Li)");
}
deadline_timer_reprogram:entry
{
  t = (hrtime_t)arg0;
  a = (long long)t;
  printf("target = %llx - now = %x\n", a, timestamp);
}
tick-10s {
  exit(0);
}

The above is called "aubrey.deadline_timer_reprogram.d", with the 0x0 setting still returns nothing of interest (just the exit(0) event).
root@lab10:~/hotkernel_30# dtrace -s aubrey.deadline_timer_reprogram.d
dtrace: script 'aubrey.deadline_timer_reprogram.d' matched 3 probes
CPU     ID                    FUNCTION:NAME
  5      1                           :BEGIN looking at deadline_timer_reprogram:entry (requested by Aubrey Li)
  5  65176                        :tick-10s

root@lab10:~/hotkernel_30#

Contrast the "no output" count for "deadline_timer_reprogram:entry" as above this with 12,571 count of deadline_timer_reprogram:entry" with the 0x2 setting for " in my prior post.

At this point when I run "top -q" then in the curses CUI "n 10" then "s 10" I do still see mostly 99.9% and also 100% idle sometimes. Note this is not very different than my last test with the 0x2 setting as it is too early to get a lot of discrimination at this point (the CPU kernel leak may or may not be fixed).

last pid:  4373;  load avg:  0.00,  0.00,  0.00;  up 0+08:24:15                 19:58:17
53 processes: 52 sleeping, 1 on cpu
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
Kernel: 308 ctxsw, 589 intr, 22 syscall
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap

   PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
  4372 root        1  59  -20 3840K 1984K cpu/5    0:00  0.00% top
  4290 admin       1  59    0 7748K 5176K sleep    0:00  0.00% sshd
   318 root        7  59    0 4404K 3196K sleep    0:00  0.00% devfsadm
    10 root       13  59    0   15M   13M sleep    0:01  0.00% svc.startd
   892 root        4  59    0 6792K 3852K sleep    0:00  0.00% inetd
    12 root       24  59    0   15M   14M sleep    0:03  0.00% svc.configd
   370 root       29  59    0 7144K 5012K sleep    0:00  0.00% nscd
   904 root        1  59    0 2708K 1676K sleep    0:00  0.00% in.routed
   914 root       25  59    0   19M   12M sleep    0:00  0.00% fmd
  1055 root        1  59    0 6620K 5516K sleep    0:01  0.00% intrd

As my basic test suite take about 4 days to be definitive - I will continue to keep you posted with updates in a day or two.

#25

Updated by Aubrey Li over 7 years ago

Glad to see the interrupt storm is gone. Let's wait to see if CPU kernel leak is fixed.

#26

Updated by Miha Krizaj over 7 years ago

Hi,

I also applied the set apix:apic_timer_preferred_mode = 0x0 and as far as I can see now, kernel usage is still low, after 14 hours of uptime. Somewhere between 0 to 0.7%. I will wait a few more days to be really sure, but it looks like this fixed it :)

This was really very much helpfull, thank you!

Why does it actually happen, is this s defect of some sort in hardware or software, and how that this is not more widespread if it is?

Thank you again!

#27

Updated by Jon Strabala over 7 years ago

Aubrey,

As Miha indicated - I too am interested in the root cause, and more importantly the impact does making things 'oneshot_*' vs. 'deadline_timer_*' impact any performance such as web servers, networking, disk, etc.

Right now after 20 hours of testing on my system lab10 I can not recognize a CPU kernel leak at this time. I will post again tomorrow.

Thanks

#28

Updated by Aubrey Li over 7 years ago

Please provide the output of the following commands

1. #psrinfo -vp
2. #prtdiag
3. #ucodeadm -v

Thanks,
-Aubrey

#29

Updated by Miha Krizaj over 7 years ago

Hi, here are the outputs of the commands, and by the way, the cpu usage in kernel is still very low, so this for sure fixed it! :) Thank you!

psrinfo -vp
The physical processor has 4 cores and 8 virtual processors (0-7)
The core has 2 virtual processors (0 4)
The core has 2 virtual processors (1 5)
The core has 2 virtual processors (2 6)
The core has 2 virtual processors (3 7)
x86 (GenuineIntel 206A7 family 6 model 42 step 7 clock 3292 MHz)
Intel(r) Xeon(r) CPU E31240 @ 3.30GHz

prtdiag
System Configuration: Supermicro X9SCI/X9SCA
BIOS Configuration: American Megatrends Inc. 1.0c 07/21/2011
BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)

==== Processor Sockets ====================================

Version Location Tag
-------------------------------- --------------------------
Intel(R) Xeon(R) CPU E31240 @ 3.30GHz CPU

==== Memory Device Sockets ================================

Type Status Set Device Locator Bank Locator
----------- ------ --- ------------------- ----------------
Unknown in use 0 DIMM_1A BANK0
Unknown in use 0 DIMM_2A BANK1
Unknown in use 0 DIMM_1B BANK2
Unknown in use 0 DIMM_2B BANK3

==== On-Board Devices =====================================
To Be Filled By O.E.M.

==== Upgradeable Slots ====================================

ID Status Type Description
--- --------- ---------------- ----------------------------
0 in use PCI SLOT 1
1 in use PCI SLOT 2
2 in use PCI SLOT 3
3 in use Unknown SLOT 4
4 in use Unknown SLOT 5
5 in use Unknown SLOT 6

ucodeadm -v
CPU Microcode Version
0 0x18
1 0x18
2 0x18
3 0x18
4 0x18
5 0x18
6 0x18
7 0x18

#30

Updated by Rich Lowe over 7 years ago

Hey Aubrey.

Notice how deadline_timer_reprogram is not actually doing a delta calculation, but comparing the time arg directly with <= 0? seems suspicious (especially since hrtime_t's are unsigned and absolute?)

Though I could be wildly off-base.

#31

Updated by Jon Strabala over 7 years ago

Aubrey,

I concur with Miha, the issue is no longer happening (up 45 hours no kernel CPU leak), here are the outputs of the commands you asked for from my system (lab10), plus 'top -q -n 10', 'rup lab10', 'uname lab10', and 'cat /etc/release' and 'dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()} END { trunc(@a, 3); }" -c "sleep 10"'

lab10# psrinfo -vp

The physical processor has 4 cores and 8 virtual processors (0-7)
  The core has 2 virtual processors (0 4)
  The core has 2 virtual processors (1 5)
  The core has 2 virtual processors (2 6)
  The core has 2 virtual processors (3 7)
    x86 (GenuineIntel 206A7 family 6 model 42 step 7 clock 3392 MHz)
      Intel(r) Xeon(r) CPU E31270 @ 3.40GHz

lab10# prtdiag

System Configuration: Supermicro X9SCI/X9SCA
BIOS Configuration: American Megatrends Inc. 1.10 08/19/2011
BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)

==== Processor Sockets ====================================

Version                          Location Tag
-------------------------------- --------------------------
Intel(R) Xeon(R) CPU E31270 @ 3.40GHz CPU

==== Memory Device Sockets ================================

Type        Status Set Device Locator      Bank Locator
----------- ------ --- ------------------- ----------------
Unknown     in use 0   DIMM_1A             BANK0
Unknown     in use 0   DIMM_2A             BANK1
Unknown     in use 0   DIMM_1B             BANK2
Unknown     in use 0   DIMM_2B             BANK3

==== On-Board Devices =====================================
   To Be Filled By O.E.M.

==== Upgradeable Slots ====================================

ID  Status    Type             Description
--- --------- ---------------- ----------------------------
0   in use    PCI              SLOT 1
1   in use    PCI              SLOT 2
2   in use    PCI              SLOT 3
3   in use    Unknown          SLOT 4
4   in use    Unknown          SLOT 5
5   in use    Unknown          SLOT 6

lab10# ucodeadm -v

CPU     Microcode Version
0       0x18
1       0x18
2       0x18
3       0x18
4       0x18
5       0x18
6       0x18
7       0x18

lab10# top -q -n 10

last pid: 15899;  load avg:  0.00,  0.00,  0.00;  up 1+21:18:24        08:52:26
45 processes: 44 sleeping, 1 on cpu
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
Kernel: 301 ctxsw, 11 trap, 589 intr, 364 syscall, 11 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

   PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
 15898 root        1  59  -20 3672K 1700K cpu/3    0:00  0.02% top
  1055 root        1  59    0 6716K 5612K sleep    0:08  0.00% intrd
    12 root       24  59    0   15M   14M sleep    0:05  0.00% svc.configd
  1081 admin       1  59    0 7796K 5208K sleep    0:04  0.00% sshd
   759 root        4  59    0 3100K 1976K sleep    0:04  0.00% automountd
   370 root       29  59    0 7712K 5252K sleep    0:02  0.00% nscd
    10 root       13  59    0   15M   13M sleep    0:01  0.00% svc.startd
   914 root       25  59    0   19M   12M sleep    0:01  0.00% fmd
   361 root        4  59    0 6852K 5548K sleep    0:01  0.00% hald
   318 root        7  59    0 4404K 3196K sleep    0:01  0.00% devfsadm

lab10# rup lab10

          lab10    up  1 day,  21:18,    load average: 0.00, 0.00, 0.00

lab100# uname -a

SunOS lab10 5.11 oi_151a i86pc i386 i86pc Solaris

lab10# uname -a

SunOS lab10 5.11 oi_151a i86pc i386 i86pc Solaris

lab10# cat /etc/release

                      OpenIndiana Development oi_151a X86
        Copyright 2010 Oracle and/or its affiliates. All rights reserved.
                        Use is subject to license terms.
                           Assembled 01 September 2011

lab10# dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()} END { trunc(@a, 3); }" -c "sleep 10"

dtrace: description 'apix_do_interrupt:entry' matched 2 probes
dtrace: pid 22827 has exited
CPU     ID                    FUNCTION:NAME
  3      2                             :END

              unix`_interrupt+0xba
              genunix`clock+0x1b8
              genunix`cyclic_softint+0xdc
              unix`cbe_softclock+0x1a
              unix`av_dispatch_softvect+0x5f
              apix`apix_dispatch_softint+0x34
              unix`switch_sp_and_call+0x13
                8

              unix`_interrupt+0xba
              genunix`clock+0x1bc
              genunix`cyclic_softint+0xdc
              unix`cbe_softclock+0x1a
              unix`av_dispatch_softvect+0x5f
              apix`apix_dispatch_softint+0x34
              unix`switch_sp_and_call+0x13
               17

              unix`_interrupt+0xba
              unix`acpi_cpu_cstate+0x2ae
              unix`cpu_acpi_idle+0x82
              unix`cpu_idle_adaptive+0x19
              unix`idle+0x114
              unix`thread_start+0x8
             1067

The above count '1067' for 'apix_do_interrupt:entry' has remained stable for the last day.

And finally 'hourly 'top' samples across about 46 hours

CPU states: 99.9% idle,  0.1% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 93.5% idle,  5.8% user,  0.8% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.8% idle,  0.0% user,  0.2% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.5% idle,  0.0% user,  0.5% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.1% user,  0.0% kernel,  0.0% iowait,  0.0% swap

#32

Updated by Jon Strabala over 7 years ago

Rich,

Rich Lowe wrote:

Hey Aubrey.

Notice how deadline_timer_reprogram is not actually doing a delta calculation, but comparing the time arg directly with <= 0? seems suspicious (especially since hrtime_t's are unsigned and absolute?)

Though I could be wildly off-base.

Do you mean /usr/src/uts/i86pc/io/pcplusmp/apic_timer.c , specifically

/* deadline timer reprogram */
static void
deadline_timer_reprogram(hrtime_t time)
{
        uint64_t        ticks;

        if (time <= 0) {
                /*
                 * generate an immediate interrupt
                 */
                ticks = (uint64_t)tsc_read();
        } else {
                ticks = unscalehrtime(time);
        }

        wrmsr(IA32_DEADLINE_TSC_MSR, ticks);
}

Perhaps the line "if (time <= 0) {" above should be replaced with the seven lines from the function oneshot_timer_reprogram ?

        hrtime_t        now;
        int64_t         delta;
        uint_t          ticks;

        now = gethrtime();
        delta = time - now;

        if (delta <= 0) {

As I beleive the 'oneshot_timer_reprogram' call works correctly and both methods have the same prototype.

But I also notice that 'oneshot_timer_reprogram' seems to have a lot of safety checks too, unlike the function deadline_timer_reprogram see below - for the full function:

static void
oneshot_timer_reprogram(hrtime_t time)
{
        hrtime_t        now;
        int64_t         delta;
        uint_t          ticks;

        now = gethrtime();
        delta = time - now;

        if (delta <= 0) {
                /*
                 * requested to generate an interrupt in the past
                 * generate an interrupt as soon as possible
                 */
                ticks = apic_min_timer_ticks;
        } else if (delta > apic_nsec_max) {
                /*
                 * requested to generate an interrupt at a time
                 * further than what we are capable of. Set to max
                 * the hardware can handle
                 */
                ticks = APIC_MAXVAL;
#ifdef DEBUG
                cmn_err(CE_CONT, "apic_timer_reprogram, request at" 
                    "  %lld  too far in future, current time" 
                    "  %lld \n", time, now);
#endif
        } else {
                ticks = APIC_NSECS_TO_TICKS(delta);
        }

        if (ticks < apic_min_timer_ticks)
                ticks = apic_min_timer_ticks;

        apic_reg_ops->apic_write(APIC_INIT_COUNT, ticks);
}

- Jon

#33

Updated by Aubrey Li over 7 years ago

The TSC-deadline mode allows software to use local APIC timer to single
interrupt at an absolute time.

Writing any value(>0) to TSC-deadline MSR arms the timer.
Writing 0 to the MSR disarms the local-APIC timer.

An interrupt is generated when the logical processor’s TSC >= the target
value in the MSR.

Looking at the target value and TSC from Miha's report

0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3f0e6
0  30881   deadline_timer_reprogram:entry  target = 521f6a3f580 - now = 521f6a3f464
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f6a3fb67
0  30881   deadline_timer_reprogram:entry  target = 521f73c8c00 - now = 521f73ac646

Interrupt occurs TSC < target. <===misbehavior.

This issue may exist in Solaris only. Other OSes don't have this issue because they didn't use deadline TSC.

Deadline TSC has its merits, it eliminats the drift introduced when OS/VMM programs a
timer tick. In solaris, and, if we look at the timer infrastructure, we can see the cyclic, cbe
are using an absolute target, as well as it on SPARC, that is, replacing one-shot to deadline on X86
also reduces the overhead by eliminating time delta calculation.

For now, the best solution is reporting this issue to your machine vendor

Thanks,
-Aubrey

#34

Updated by Miha Krizaj over 7 years ago

So this is actually a BIOS problem because of the solaris spicifics?

Does the line which currently solves this, cause any other side effects?

Thank you for the explanation. I will try to contact supermicro support to see if this can be somehow fixed in the future.

#35

Updated by Jon Strabala over 7 years ago

I have notified Supermicro, specifically for Case #SM1108302601

#36

Updated by Jon Strabala over 7 years ago

Supermicro's short reply for Case #SM1108302601 is as follows:

"Our current BIOS also do not support this TSC-deadline" 

I am trying to get them to escalate it to AMI the BIOS provider.

#37

Updated by Josef Sipek over 7 years ago

I have a similar system, after a boot, it had ~42% kernel usage on core 0 according to mpstat. After about 10 hours, it had ~46%. Previously, I had it run for over 5 days and I saw ~69% (I rebooted it to see if the load would disappear).

# prtdiag 
System Configuration: Supermicro X9SCL/X9SCM
BIOS Configuration: American Megatrends Inc. 4.6.4 04/27/2011
BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)

==== Processor Sockets ====================================

Version                          Location Tag
-------------------------------- --------------------------
Intel(R) Xeon(R) CPU E31230 @ 3.20GHz CPU

==== Memory Device Sockets ================================

Type        Status Set Device Locator      Bank Locator
----------- ------ --- ------------------- ----------------
Unknown     in use 0   DIMM_1A             BANK0
Unknown     in use 0   DIMM_2A             BANK0
Unknown     in use 0   DIMM_1B             BANK0
Unknown     in use 0   DIMM_2B             BANK0

==== On-Board Devices =====================================
   To Be Filled By O.E.M.

==== Upgradeable Slots ====================================

ID  Status    Type             Description
--- --------- ---------------- ----------------------------
0   in use    PCI Express      SLOT 4
1   in use    PCI Express      SLOT 5
2   in use    PCI Express      SLOT 6
3   in use    PCI Express      SLOT 7

Rich Lowe pointed me to this bug. I added the following line to my /etc/system:

set apix:apic_timer_preferred_mode = 0x0

and my apic_timer went from deadline_timer to oneshot_timer. I see 0% kernel load now.

#38

Updated by Aubrey Li over 7 years ago

So far this is more like an OS issue.

#39

Updated by Jon Strabala over 7 years ago

Aubrey,

I could do the recompile and test - but I would have to compile with gcc ( I am not sure if I need to build the whole OS because I can not get the patched Sun compilers - no big deal it only takes 18 minutes )

If you supply a link to the corrected file I will download it and give it a try

Also should I use the latest dev tree or the current base oi_151a

#40

Updated by Jon Strabala over 7 years ago

Aubrey,

Then I made a new BE 'nightly-2011-10-07-aubrey' booted into it.

The Illumos Project     SunOS 5.11      testws  Oct. 07, 2011
SunOS Internal Development: admin 2011-Oct-07 [illumos-gate]
# uname -a
SunOS lab10 5.11 testws i86pc i386 i86pc Solaris
# beadm list
BE                        Active Mountpoint Space Policy Created
nightly-2011-10-07-aubrey NR     /          54.7G static 2011-10-07 15:46
oi_151a                   -      -          16.2M static 2011-09-14 09:01

I commented the following from /etc/system and did a reconfiguration reboot
* set apix:apic_timer_preferred_mode = 0x0

Other (after reboot)
# echo hz/D | sudo mdb -k
hz:
hz:             100

# echo apic_timer::print apic_timer_t | sudo mdb -k
{
    mode = 0x2
    apic_timer_enable_ops = deadline_timer_enable
    apic_timer_disable_ops = deadline_timer_disable
    apic_timer_reprogram_ops = deadline_timer_reprogram
}

# cat /etc/power.conf
cpu_deep_idle enable
cpu-threshold 10s
cpupm enable poll-mode

Note, I cannot run dtrace scripts as I get errors ( side effect of gcc build ), I tried running 'mcs -d -n .SUNW_ctf modname' on various modules but no luck for getting dtrace working.

I will run this setup a few days and see what happens with respect to the CPU kernwel creep, but since no dtrace is possible I can not really sample anything.

#41

Updated by Aubrey Li over 7 years ago

Jon,

Thanks a lot for verification.

what errors dtrace reported? You can rudely comment out the error line in the reported dtrace script files.

#42

Updated by Jon Strabala over 7 years ago

My dtrace issues have to do with the way I build oi_151a nightly with gcc the below is most likely overkill but it seems to be the only way I can get dtrace to work on any of my 'home-built' illumos kernels. This is what works for me BUT I think I am doing way too much and don't really understand the reasons behind needing to do this maybe because the closed bits are compiled with a sun compiler.

# list errors, this is why I grep out '.conf'
find /kernel -type f -name "*" | grep -v '.conf' |  xargs mcs -d -n .SUNW_ctf

# apply mcs -d -n  .SUNW_ctf on the module
find /platform/i86pc/kernel -type f -name "*" | grep -v '.conf' | xargs mcs -d -n .SUNW_ctf

# update the boot archive
bootadm update-archive

# reboot
sync
init 6

Prior to the above application of commands to get dtrace to work (I ran things over the weekend) for 60+ hours on the most recent illumos nightly with a patched apix module (and compare it to a oi_151 release). Without 'dtrace' this is a very very seat of the pants test using on the Unix 'top' command.

Looking at the nightly build above patched about 60+ hours without "set apix:apic_timer_preferred_mode = 0x0" in /etc/system

root@lab10:/home/admin/top# grep  'CPU s' *  | sort +1 | grep 100 | wc -l
25
root@lab10:/home/admin/top# grep  'CPU s' *  | sort +1 | grep 99.9 | wc -l
81
root@lab10:/home/admin/top# grep  'CPU s' *  | sort +1 | grep 99.8 | wc -l
11
root@lab10:/home/admin/top# grep  'CPU s' *  | sort +1 | grep 99.7 | wc -l
0
root@lab10:/home/admin/top# grep  'CPU s' *  | sort +1 | grep 99.6 | wc -l
5
root@lab10:/home/admin/top# grep  'CPU s' *  | sort +1 | grep 99.5 | wc -l
3
root@lab10:/home/admin/top# grep  'CPU s' *  | sort +1 | grep 99.4 | wc -l
3

Looking at (some old logs for) oi_151 non-patched from the latest ISO across 66 hours (just counts) and "set apix:apic_timer_preferred_mode = 0x0" in /etc/system

# cat top_10* | grep 'CPU s' | grep 100 | wc -l
51
# cat top_10* | grep 'CPU s' | grep 99.9 | wc -l
75
# cat top_10* | grep 'CPU s' | grep 99.8 | wc -l
11
# cat top_10* | grep 'CPU s' | grep 99.7 | wc -l
0
# cat top_10* | grep 'CPU s' | grep 99.6 | wc -l
3
# cat top_10* | grep 'CPU s' | grep 99.5 | wc -l
0
# cat top_10* | grep 'CPU s' | grep 99.4  | wc -l
0 

From the above (just counts - you can visualize the histogram) I base my "feeling" that "set apix:apic_timer_preferred_mode = 0x0" makes the system appear slightly "more idle" that the patch on a totally idle system. Yes this VERY un-scientific, and I have no idea on what happens under load between the two runs.

Odd issue with applying /cpu==0/ in my dtrace commands

After to the above application of commands to get dtrace to work. One funny thing I noticed it is almost as if the nightly build has a slightly different dtrace syntax - or my application of 'mcs commands' removed the cpu symbol ? - this is after the application above application of the above commands (at top) and a reboot:

# dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10" 
dtrace: invalid probe specifier apix_do_interrupt:entry /cpu==0/ { @a[stack()]=count()}: in predicate: failed to resolve cpu: Unknown variable name

# dtrace -n "do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10" 
dtrace: invalid probe specifier do_interrupt:entry /cpu==0/ { @a[stack()]=count()}: in predicate: failed to resolve cpu: Unknown variable name

I am confounded why is cpu an unknown variable name ?

Note, if I drop /cpu==0/ whcih used to work (might be my gcc compiles and/or 'mcs' commands to get dtrace to work) and rerun

# dtrace -n "apix_do_interrupt:entry { @a[stack()]=count()}" -c "sleep 10" | tail -50
                7

              unix`_interrupt+0x1e9
              unix`mutex_delay_default+0x7
              unix`mutex_vector_enter+0xb9
              zfs`vdev_queue_io_done+0x49
              zfs`zio_vdev_io_done+0xe8
              zfs`zio_execute+0xd7
              genunix`taskq_thread+0x32f
              unix`thread_start+0x8
                9

              unix`_interrupt+0x1e9
              unix`mutex_delay_default+0x7
              unix`lock_set_spin+0x7b
              genunix`disp_lock_enter_high+0x11
              unix`setbackdq+0x12b
              genunix`cv_unsleep+0x98
              genunix`setrun_locked+0x73
              genunix`setrun+0x20
              genunix`cv_wakeup+0x36
              genunix`callout_list_expire+0x91
              genunix`callout_expire+0x68
              genunix`callout_realtime+0x28
              genunix`cyclic_softint+0x3bb
              unix`cbe_low_level+0x14
              unix`av_dispatch_softvect+0x8d
              apix`apix_dispatch_softint+0x35
              unix`switch_sp_and_call+0x13
               10

              unix`_interrupt+0x1e9
              unix`i86_mwait+0xd
              unix`cpu_idle_mwait+0x125
              unix`cpu_acpi_idle+0x8c
              unix`cpu_idle_adaptive+0x13
              unix`idle+0x89
              unix`thread_start+0x8
               11

              unix`_interrupt+0x1e9
               23

              unix`_interrupt+0x1e9
              unix`acpi_cpu_cstate+0x11e
              unix`cpu_acpi_idle+0xcb
              unix`cpu_idle_adaptive+0x13
              unix`idle+0x89
              unix`thread_start+0x8
             4418

#dtrace -n "do_interrupt:entry { @a[stack()]=count()}" -c "sleep 10" 
dtrace: description 'do_interrupt:entry ' matched 1 probe
dtrace: pid 101749 has exited

# cat kern_100ms_ent_ret.d
BEGIN
{
        printf("\n100ms sample of all kernel entry and returns\n");
        start = timestamp;
}

fbt:*::entry { @e[probefunc]=count() }
fbt:*::return { @r[probefunc]=count() }

END {
        printf("\n%20s %10s %10s\n", "PROGRAM", "entries", "returns");
        printa("%20s %10@d %10@d\n", @e, @r);
}

tick-100ms
{
        exit(0);
}

# dtrace -s kern_100ms_ent_ret.d | head -10
dtrace: script 'kern_100ms_ent_ret.d' matched 62736 probes
CPU     ID                    FUNCTION:NAME
  3      1                           :BEGIN
100ms sample of all kernel entry and returns

  3  65519                      :tick-100ms
  3      2                             :END
             PROGRAM    entries    returns
              strlen          0          2
               kcopy          0        133
         anon_create          1          1

# dtrace -s kern_100ms_ent_ret.d | tail -35
dtrace: script 'kern_100ms_ent_ret.d' matched 62736 probes
    page_vnode_mutex         67         67
     cyclic_coverage         70         70
cyclic_coverage_hash         70         70
           gethrtime         79         79
        rw_lock_held         82         82
       tsc_gethrtime         83         83
        as_segcompar        120        120
           set_errno        248        248
      audit_getstate        249        249
      clear_stale_fd        249        249
    lwp_getdatamodel        249        249
        post_syscall        249        249
      disp_lock_exit        255        255
             cpu_get        256        256
         cpu_get_all        256        256
            p_online        256        256
   p_online_internal        256        256
p_online_internal_locked        256        256
         thread_lock        257        257
            prunstop        261        261
             do_splx        273        273
                splr        273        273
  interrupts_enabled        274        274
      sys_rtt_common        279        279
          xc_extract        446        446
apix_get_pending_spl        546        546
         apix_setspl        546        546
  get_local_apic_pri        546        546
      psm_get_cpu_id        546        546
         mutex_owned        956        956
      syscall_mstate       1036       1036
  gethrtime_unscaled       1192       1192
tsc_gethrtimeunscaled       1192       1192

If there is anything else you want tested or any 'dtrace' script you might like me to try, let me know.

#43

Updated by Jon Strabala over 7 years ago

*It looks like a long test shows no issue with applying the patch to apic_timer.c to fix BUG #1333 *
Running 11 days 20 hours things look very good, this has been running nightly-2011-10-11 whith both danmcd's new Joyent-contributed acpica bits (these bits have relation to the 1333 bug I just wanted to test them) and the patch to apic_timer.c to solve the 1333 bug

So we can either add the patch to apic_timer.c (or alternatively use the work around described above in /etc/system by adding the line "set apix:apic_timer_preferred_mode = 0x0"). The 11+ day test here did not use the /etc/system work around.

- Jon

root@lab10:~# echo hz/D | sudo mdb -k
hz:
hz:             100

root@lab10:~# echo apic_timer::print apic_timer_t | sudo mdb -k
{
    mode = 0x2
    apic_timer_enable_ops = deadline_timer_enable
    apic_timer_disable_ops = deadline_timer_disable
    apic_timer_reprogram_ops = deadline_timer_reprogram
}

root@lab10:~# cat /etc/power.conf
cpu_deep_idle enable
cpu-threshold 10s
cpupm enable poll-mode
root@lab10:~#

Looking at some of my hourly 'top' measurements, dtrace doesn't work on this test build, I see the following:

root@lab10:/home/admin/top# grep 'CPU states' d.log2 | wc -l
568
root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 100 | wc -l
78
root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.9 | wc -l
382
root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.8 | wc -l
59
root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.7 | wc -l
0
root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.6 | wc -l
18
root@lab10:/home/admin/top#  grep 'CPU states' d.log2 | grep 99.5 | wc -l
15
root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.4 | wc -l
9
root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.2 | wc -l
0
root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.1 | wc -l
1
root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.0 | wc -l
1
root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 89. | wc -l
0
root@lab10:/home/admin/top# grep 'CPU states' d.log2 | tail -10
CPU states: 99.5% idle,  0.1% user,  0.4% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
root@lab10:/home/admin/top#

And a recent 100% idle as follows:

======================================
October 24, 2011 09:34:27 AM PDT
top -q -d 2 -s 10 -n 10 -S
last pid: 104899;  load avg:  0.00,  0.00,  0.00;  up 11+17:59:54      09:34:28
48 processes: 47 sleeping, 1 on cpu
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
Kernel: 308 ctxsw, 12 trap, 621 intr, 366 syscall, 11 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

   PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
104898 root        1  59  -20 3364K 1668K cpu/6    0:00  0.03% top
     3 root        1  60  -20    0K    0K sleep   10:55  0.00% fsflush
101106 root        1  59    0 6064K 2072K sleep    0:10  0.00% sendmail
101017 root        1  59    0 6632K 5512K sleep    0:53  0.00% intrd
100754 root        4  59    0 3060K 1992K sleep    0:39  0.00% automountd
100921 root       26  59    0   32M   25M sleep    0:19  0.00% fmd
100363 root       29  59    0 7768K 5232K sleep    0:16  0.00% nscd
     6 root      150  99  -20    0K    0K sleep    0:15  0.00% zpool-rpool
100005 root       23  59    0   15M   14M sleep    0:15  0.00% svc.configd
100352 root        4  59    0 6812K 5488K sleep    0:12  0.00% hald

last pid: 104899;  load avg:  0.00,  0.00,  0.00;  up 11+18:00:04      09:34:38
48 processes: 47 sleeping, 1 on cpu
CPU states: 99.9% idle,  0.0% user,  0.1% kernel,  0.0% iowait,  0.0% swap
Kernel: 381 ctxsw, 4 trap, 764 intr, 27 syscall, 4 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

   PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
     6 root      150  99  -20    0K    0K sleep    0:15  0.03% zpool-rpool
     3 root        1  60  -20    0K    0K sleep   10:55  0.02% fsflush
104898 root        1  59  -20 3516K 1836K cpu/6    0:00  0.00% top
100352 root        4  59    0 6812K 5488K sleep    0:12  0.00% hald
100418 root        1  59    0 3492K 2120K sleep    0:12  0.00% hald-addon-acpi
     4 root        9  60  -20    0K    0K sleep    0:01  0.00% kcfpoold
101106 root        1  59    0 6064K 2072K sleep    0:10  0.00% sendmail
100279 root        7  59    0 4420K 3164K sleep    0:09  0.00% devfsadm
100363 root       29  59    0 7768K 5232K sleep    0:16  0.00% nscd
#44

Updated by Rich Lowe over 7 years ago

  • Status changed from Feedback to In Progress
  • Assignee set to Rich Lowe
  • Tags deleted (needs-triage)
#45

Updated by Rich Lowe over 7 years ago

  • Assignee changed from Rich Lowe to Garrett D'Amore
#46

Updated by Garrett D'Amore over 7 years ago

  • % Done changed from 0 to 90

Here is my attempt at fixing this, based on conversation with Rich Lowe and my own analysis of the code. I need help testing it. As you can see, the problem is that there were several problems in the old code, including sign extension problems and confusion about the starting baseline of the tick counter coming in. (Specifically, the WMSR is not the same value that gethrtime() returns, but the nsec coming in is a value in the same starting base as gethrtime().) My code makes this function work using gethrtime() as the base properly, calculating the delta properly, and avoiding the the sign extension problem -- I hope.


diff -r 417c34452f03 usr/src/uts/i86pc/io/pcplusmp/apic_timer.c
--- a/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c        Fri Nov 11 14:07:54 2011 -0800
+++ b/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c        Fri Nov 11 14:19:14 2011 -0800
@@ -25,6 +25,9 @@
  * Copyright (c) 2010, Intel Corporation.
  * All rights reserved.
  */
+/*
+ * Copyright 2011 Nexenta Systems, Inc.  All rights reserved.
+ */

 #include <sys/time.h>
 #include <sys/psm.h>
@@ -302,17 +305,15 @@
 static void
 deadline_timer_reprogram(hrtime_t time)
 {
+       int64_t         delta;
        uint64_t        ticks;

-       if (time <= 0) {
-               /*
-                * generate an immediate interrupt
-                */
-               ticks = (uint64_t)tsc_read();
-       } else {
-               ticks = unscalehrtime(time);
+       delta = time - gethrtime();
+       if (delta <= 0) {
+               /* generate an immediate interrupt */
+               delta = 0;
        }
-
+       ticks = (uint64_t)tsc_read() + delta;
        wrmsr(IA32_DEADLINE_TSC_MSR, ticks);
 }

#47

Updated by Rich Lowe over 7 years ago

That's not the patch in the state it was when I reviewed it, you need to be unscaling the time delta. (That might be a no-op, but it's not always a no-op, see unscalehrtimef).
On real hardware made in the last 20 years, it should never be a no-op, actually.

You definitely need to unscale the delta to get it into TSC units.

#48

Updated by Garrett D'Amore over 7 years ago

Good catch. Here's the revised version:


diff -r 417c34452f03 usr/src/uts/i86pc/io/pcplusmp/apic_timer.c
--- a/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c        Fri Nov 11 14:07:54 2011 -0800
+++ b/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c        Fri Nov 11 19:55:03 2011 -0800
@@ -25,6 +25,9 @@
  * Copyright (c) 2010, Intel Corporation.
  * All rights reserved.
  */
+/*
+ * Copyright 2011 Nexenta Systems, Inc.  All rights reserved.
+ */

 #include <sys/time.h>
 #include <sys/psm.h>
@@ -302,17 +305,15 @@
 static void
 deadline_timer_reprogram(hrtime_t time)
 {
+       int64_t         delta;
        uint64_t        ticks;

-       if (time <= 0) {
-               /*
-                * generate an immediate interrupt
-                */
-               ticks = (uint64_t)tsc_read();
-       } else {
-               ticks = unscalehrtime(time);
+       delta = time - gethrtime();
+       if (delta <= 0) {
+               /* generate an immediate interrupt */
+               delta = 0;
        }
-
+       ticks = (uint64_t)tsc_read() + unscalehrtime(delta);
        wrmsr(IA32_DEADLINE_TSC_MSR, ticks);
 }
#49

Updated by Josef Sipek over 7 years ago

I tried patch (with unscalehrtime) but the moment I tried to run dtrace the whole system locked up hard.

#50

Updated by Josef Sipek over 7 years ago

I rebooted with -ks, hoping that I could F1-A the system when it locked up. Of course, no matter what I did the system stayed functional. I ran a simple dtrace one-liner to see how long the timer was supposed to be set for:

# dtrace -n '::deadline_timer_reprogram:entry{@=llquantize(arg0-timestamp,10,0,8,10);} tick-10sec{printa(@);}'
...
  2  67972                      :tick-10sec

           value  ------------- Distribution ------------- count
           10000 |                                         0
           20000 |                                         4
           30000 |                                         0
           40000 |                                         0
           50000 |                                         6
           60000 |                                         0
           70000 |                                         0
           80000 |                                         0
           90000 |                                         0
          100000 |                                         11
          200000 |                                         11
          300000 |                                         0
          400000 |                                         9
          500000 |                                         0
          600000 |                                         22
          700000 |                                         5
          800000 |                                         18
          900000 |                                         0
         1000000 |                                         131
         2000000 |@                                        167
         3000000 |                                         103
         4000000 |                                         85
         5000000 |                                         98
         6000000 |                                         126
         7000000 |                                         164
         8000000 |                                         104
         9000000 |@@@@@@@@@@@@@@@@@                        5689
        10000000 |@@@@@@@                                  2232
        20000000 |                                         79
        30000000 |                                         156
        40000000 |@                                        393
        50000000 |                                         104
        60000000 |                                         49
        70000000 |                                         96
        80000000 |                                         50
        90000000 |                                         96
       100000000 |@@                                       626
       200000000 |@@                                       653
       300000000 |@                                        336
       400000000 |@                                        333
       500000000 |@                                        321
       600000000 |@                                        331
       700000000 |@                                        246
       800000000 |@                                        168
       900000000 |@                                        195
   >= 1000000000 |                                         0

Not bad, 100Hz...with some shorter durations as well.

I rebooted, -k this time. Log in to X, run the same dtrace oneliner, it spits out one @ showing one negative delta. I ^C it, and about a second later, the system locks up. F1-A & $<systemdump do nothing.

I saw the lockups before. So, at the very least, I haven't noticed any regressions. ;)

#51

Updated by Aubrey Li over 7 years ago

Garrett D'Amore wrote:

Good catch. Here's the revised version:

[...]

Two comments of this patch:

1. Since we are using tsc_read() here, we should make sure that the thread can't
migrate while calcultating the TSC tick and reprogram the MSR.

2. From the latest SDM V3a(link below), In xAPIC mode (in which the local-APIC
registers are memory-mapped), software must serialize between the memory-mapped
write to the LVT entry and the WRMSR to IA32_TSC_DEADLINE.

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html

#52

Updated by Garrett D'Amore over 7 years ago

Thanks Aubrey. I also noticed by reading the manual that negative values are handled properly, so we don't need that if branch either.

I've got another version to try, which uses the algorithm suggested in the SDM to serialize the WRMSR value. I believe this should work. Note that we need not worry about thread migration, as I believe this routine is called in a context that excludes migration. Again, I need help testing this code.


diff -r 417c34452f03 usr/src/uts/i86pc/io/pcplusmp/apic_timer.c
--- a/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c        Fri Nov 11 14:07:54 2011 -0800
+++ b/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c        Sat Nov 12 09:27:47 2011 -0800
@@ -25,6 +25,9 @@
  * Copyright (c) 2010, Intel Corporation.
  * All rights reserved.
  */
+/*
+ * Copyright 2011 Nexenta Systems, Inc.  All rights reserved.
+ */

 #include <sys/time.h>
 #include <sys/psm.h>
@@ -302,17 +305,33 @@
 static void
 deadline_timer_reprogram(hrtime_t time)
 {
+       int64_t         delta;
        uint64_t        ticks;

-       if (time <= 0) {
-               /*
-                * generate an immediate interrupt
-                */
-               ticks = (uint64_t)tsc_read();
-       } else {
-               ticks = unscalehrtime(time);
-       }
+       /*
+        * According to Intel SDM v3a, negative time is ok.
+        * Note that this entire routine is called with
+        * CBE_HIGH_PIL, so we needn't worry about preemption.
+        */
+       delta = time - gethrtime();

+       /*
+        * Now we have to serialize this per the SDM.  The algorithm
+        * recommended by Intel is this:
+        *
+        * a) write a big value to the deadline register
+        * b) read the register back
+        * c) if it reads zero, go back to a and try again
+        * d) write the new value
+        */
+       do {
+               /* write a really big value */
+               ticks = (uint64_t)tsc_read() + (1u << 31);
+               wrmsr(IA32_DEADLINE_TSC_MSR, ticks);
+               ticks = rdmsr(IA32_DEADLINE_TSC_MSR);
+       } while (ticks == 0);
+       /* Now we shouldn't be interrupted, we can set the deadline */
+       ticks = (uint64_t)tsc_read() + unscalehrtime(delta);
        wrmsr(IA32_DEADLINE_TSC_MSR, ticks);
 }
#53

Updated by Aubrey Li over 7 years ago

Garrett D'Amore wrote:

Thanks Aubrey. I also noticed by reading the manual that negative values are handled properly, so we don't need that if branch either.

Yeah, the negative value can be handled properly by the deadline timer, but not be handled by unscalehrtime(). there is a macro in the routine, which rudely converts the signed delta to unsigned variable.

I've got another version to try, which uses the algorithm suggested in the SDM to serialize the WRMSR value.

The patch proposed is not the right way per the latest SDM. Writes to the IA32_TSC_DEADLINE MSR are not serialized. Therefore, system software should not use WRMSR to the IA32_TSC_DEADLINE MSR as a serializing instruction. Read and write accesses to the IA32_TSC_DEADLINE and other MSR registers will occur in program order. As SDM indicates, the only scenario which needs serialize write is, in xAPIC mode (in which the local-APIC registers are memory-mapped), software must serialize between the memory-mapped write to the LVT entry and the WRMSR to IA32_TSC_DEADLINE. In x2APIC mode, no serialization is required between the two writes (by WRMSR) to the LVT and IA32_TSC_DEADLINE MSRs.

I believe this should work. Note that we need not worry about thread migration, as I believe this routine is called in a context that excludes migration. Again, I need help testing this code.

If this routine is called in interrupt context, in which IPL = 14, then we don't need to worry about thread migration.
However, I don't think that's always true, cbe_reprogram() will be called whenever the root is changed, there're multiple paths, as I found out by dtrace, for example:

1.
genunix`cyclic_reprogram_cyclic+0xdd
genunix`cyclic_reprogram+0xa4
genunix`callout_heap_insert+0x6b
genunix`timeout_generic+0x32b
genunix`cv_timedwait_hires+0xad
genunix`cv_reltimedwait+0x4f
stmf`stmf_svc+0x41a
genunix`taskq_thread+0x22e
unix`thread_start+0x8
231

2.
genunix`cyclic_reprogram_cyclic+0xdd
genunix`cyclic_reprogram+0xa4
genunix`callout_heap_insert+0x6b
genunix`timeout_generic+0x32b
genunix`lwp_timer_enqueue+0x90
genunix`lwp_cond_wait+0x5be
unix`_sys_sysenter_post_swapgs+0x149
448

So, I think we still need to avoid migration here.

Thanks,
-Aubrey

#54

Updated by Garrett D'Amore over 7 years ago

cyclic_reprogram disable kernel preemption and runs with the PIL set high. I think we're ok here. Reading the code, I think all the code paths we care about go through cyclic_reprogram -- so no thread migration should occur.

I need to spend some more time with the serialization logic. I had assumed that the LVT was already written by the time we got here, and that I didn't need to serialize the deadline timer MSR to that, because it would have effectively already been done. Apart from that one step, I was trying to follow the recommendation of the SDM. I'll have another look.

#55

Updated by Aubrey Li over 7 years ago

Garrett D'Amore wrote:

cyclic_reprogram disable kernel preemption and runs with the PIL set high. I think we're ok here. Reading the code, I think all the code paths we care about go through cyclic_reprogram -- so no thread migration should occur.

You're right, irrespective of the context cbe_set_level() is called with CY_HIGH_LEVEL, so we won't be preempted, except a level 15 can pin this thread, but that's not a problem.

#56

Updated by Garrett D'Amore over 7 years ago

Ok, re-reading my code, and the spec, I've got another try. I'm sorry I don't have h/w atm I can test this on:


--- a/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c        Fri Nov 11 14:07:54 2011 -0800
+++ b/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c        Mon Nov 14 21:04:06 2011 -0800
@@ -25,6 +25,9 @@
  * Copyright (c) 2010, Intel Corporation.
  * All rights reserved.
  */
+/*
+ * Copyright 2011 Nexenta Systems, Inc.  All rights reserved.
+ */

 #include <sys/time.h>
 #include <sys/psm.h>
@@ -302,17 +305,40 @@
 static void
 deadline_timer_reprogram(hrtime_t time)
 {
+       int64_t         delta;
        uint64_t        ticks;

-       if (time <= 0) {
-               /*
-                * generate an immediate interrupt
-                */
-               ticks = (uint64_t)tsc_read();
-       } else {
-               ticks = unscalehrtime(time);
-       }

+       /*
+        * According to Intel SDM v3a, negative time is ok.
+        * Note that this entire routine is called with
+        * CBE_HIGH_PIL, so we needn't worry about preemption.
+        */
+       delta = time - gethrtime();
+
+       /* The unscalehrtime wants unsigned values. */
+       delta = max(delta, 0);
+
+       /*
+        * Now we have to serialize this per the SDM.  The algorithm
+        * recommended by Intel is this:
+        *
+        * a) write the LVT entry to specify AV_DEADLINE mode
+        * b) write a big value to the deadline register
+        * c) read the register back
+        * d) if it reads zero, go back to b and try again
+        * e) write the new value
+        */
+       apic_reg_ops->apic_write(APIC_LOCAL_TIMER,
+           (apic_clkvect + APIC_BASE_VECT) | AV_DEADLINE);
+       do {
+               /* write a really big value */
+               ticks = (uint64_t)tsc_read() + (1u << 31);
+               wrmsr(IA32_DEADLINE_TSC_MSR, ticks);
+               ticks = rdmsr(IA32_DEADLINE_TSC_MSR);
+       } while (ticks == 0);
+       /* Now we shouldn't be interrupted, we can set the deadline */
+       ticks = (uint64_t)tsc_read() + unscalehrtime(delta);
        wrmsr(IA32_DEADLINE_TSC_MSR, ticks);
 }
 
#57

Updated by Aubrey Li over 7 years ago

Garrett D'Amore wrote:

Ok, re-reading my code, and the spec, I've got another try. I'm sorry I don't have h/w atm I can test this on:

[...]

Since the writes to the IA32_TSC_DEADLINE MSR are not serialized. Read and write accesses to the IA32_TSC_DEADLINE and other MSR registers will occur in program order.
Is it better we put the do {} while just into deadline_timer_enable(), after LVT write? deadline_timer_disable should be okay since we are disabling the timer and the value in the MSR can be ignored.

#58

Updated by Jon Strabala over 7 years ago

Using my X9SCA-F running oi_151a, I applied Garrett's the patch as in [https://www.illumos.org/issues/1333#note-56] to a new nightly and made a new BE two hours ago. It seems to work about the same as the one Rich Lowe supplied me via E-mail back on Oct, 28 2011 (and via a build he made).

First check worked:

root@lab10:~# cat /etc/power.conf
cpu_deep_idle enable
cpu-threshold 10s
cpupm enable poll-mode

Second check worked (this should invoke the code in the new patch):

root@lab10:~# cat /etc/power.conf
cpu_deep_idle enable
cpu-threshold 10s
cpupm enable

I will post again if I notice any issues in the next few days (like kernel CPU creep), I can not run dtrace and thus tools like 'powertop' with my GCC nightly builds. If you care I have my system config, and an initial 'top' and 'prstat' see the attachment.

#59

Updated by Garrett D'Amore over 7 years ago

Aubrey Li wrote:

Garrett D'Amore wrote:

Ok, re-reading my code, and the spec, I've got another try. I'm sorry I don't have h/w atm I can test this on:

[...]

Since the writes to the IA32_TSC_DEADLINE MSR are not serialized. Read and write accesses to the IA32_TSC_DEADLINE and other MSR registers will occur in program order.

If I am understanding properly, the problem is that the timer could have expired just as we were trying to write a value. We want to make sure that we aren't racing against an existing timer, IIUC. I think that is the basis of Intel's recommendations.

Is it better we put the do {} while just into deadline_timer_enable(), after LVT write? deadline_timer_disable should be okay since we are disabling the timer and the value in the MSR can be ignored.

I don't think so... because the enable logic doesn't have the new timer value that we get at reprogram().

#60

Updated by Garrett D'Amore over 7 years ago

Ok, I think I finally understood what was meant about "serialized" -- the problem is issue order != execution order. Got it. So, here's another try, hopefully this one is a bit cleaner (sorry for being so dense here):

--- a/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c        Fri Nov 11 14:07:54 2011 -0800
+++ b/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c        Tue Nov 15 13:11:11 2011 -0800
@@ -25,6 +25,9 @@
  * Copyright (c) 2010, Intel Corporation.
  * All rights reserved.
  */
+/*
+ * Copyright 2011 Nexenta Systems, Inc.  All rights reserved.
+ */

 #include <sys/time.h>
 #include <sys/psm.h>
@@ -286,8 +289,29 @@
 static void
 deadline_timer_enable(void)
 {
+       uint64_t ticks;
+
        apic_reg_ops->apic_write(APIC_LOCAL_TIMER,
            (apic_clkvect + APIC_BASE_VECT) | AV_DEADLINE);
+       /*
+        * Now we have to serialize this per the SDM.  That is to
+        * say, the above enabling can race in the pipeline with
+        * changes to the MSR.  We need to make sure the above
+        * operation is complete before we proceed to reprogram
+        * the deadline value in reprogram().  The algorithm
+        * recommended by the Intel SDM 3A in 10.5.1.4 is:
+        *
+        * a) write a big value to the deadline register
+        * b) read the register back
+        * c) if it reads zero, go back to a and try again
+        */
+
+       do {
+               /* write a really big value */
+               ticks = (uint64_t)tsc_read() + APIC_MAXVAL;
+               wrmsr(IA32_DEADLINE_TSC_MSR, ticks);
+               ticks = rdmsr(IA32_DEADLINE_TSC_MSR);
+       } while (ticks == 0);
 }

 /* deadline timer disable */
@@ -302,17 +326,20 @@
 static void
 deadline_timer_reprogram(hrtime_t time)
 {
+       int64_t         delta;
        uint64_t        ticks;

-       if (time <= 0) {
-               /*
-                * generate an immediate interrupt
-                */
-               ticks = (uint64_t)tsc_read();
-       } else {
-               ticks = unscalehrtime(time);
-       }
+       /*
+        * Note that this entire routine is called with
+        * CBE_HIGH_PIL, so we needn't worry about preemption.
+        */
+       delta = time - gethrtime();

+       /* The unscalehrtime wants unsigned values. */
+       delta = max(delta, 0);
+
+       /* Now we shouldn't be interrupted, we can set the deadline */
+       ticks = (uint64_t)tsc_read() + unscalehrtime(delta);
        wrmsr(IA32_DEADLINE_TSC_MSR, ticks);
 }

Again I need help testing.

#61

Updated by Aubrey Li over 7 years ago

Garrett D'Amore wrote:

Ok, I think I finally understood what was meant about "serialized" -- the problem is issue order != execution order. Got it. So, here's another try, hopefully this one is a bit cleaner (sorry for being so dense here):

I still have two comments, :)

1. I believe APIC_MAXVAL is not big enough. For example, if CPU is 2.2Ghz, TSC tick 2,200,000,000 is equal to 1s,
APIC_MAXVAL = 0xFFFFFFFF = 0t4,294,967,295, so APIC_MAXVAL/2,200,000,000 < 2s, that is, we are scheduling
a timer too early.

2. To be picky, should we check the variable x2apic_enable to do the serialize write? Since SDM says, in x2APIC mode, no serialization is required between the two writes (by WRMSR) to the LVT and IA32_TSC_DEADLINE MSRs.

#62

Updated by Garrett D'Amore over 7 years ago

so 0xffffffff I figured would be big enough (anything over about 100 msec ought to be big enough), but I can just write some arbitrarily large value. How about 1u << 63? Is this universally a safe value here? I'd be happy to just pick a number so large that I don't even have to look at the initial tsc value.

Wrt checking x2apic_enable, I think this may be pedantic. From what I can tell, the serialization here will only be called once during boot generally, and the extra logic should be harmless and yet always correct. Unless there is some reason that the WRMSR/RDMSR would behave very differently in x2APIC mode (I've not seen any indication that it would.)

#63

Updated by Aubrey Li over 7 years ago

Garrett D'Amore wrote:

so 0xffffffff I figured would be big enough (anything over about 100 msec ought to be big enough), but I can just write some arbitrarily large value. How about 1u << 63? Is this universally a safe value here? I'd be happy to just pick a number so large that I don't even have to look at the initial tsc value.

1u << 63 should work.

Wrt checking x2apic_enable, I think this may be pedantic. From what I can tell, the serialization here will only be called once during boot generally, and the extra logic should be harmless and yet always correct. Unless there is some reason that the WRMSR/RDMSR would behave very differently in x2APIC mode (I've not seen any indication that it would.)

Yeah I agree it's pedantic, your call to add the checking or not, ;)

#64

Updated by Jon Strabala over 7 years ago

FYI I was sort of "brain dead" I had to apply the patch with the '-l' option (for white space) as I had to cut-n-past it from this thread

patch -l apic_timer.c < patchfile_bug-1333_note-60.txt

Once again, using my X9SCA-F running oi_151a, I applied Garrett's revised patch (above), specifically [https://www.illumos.org/issues/1333#note-60] to a new nightly BE. It still seems to work about the same as the one Rich Lowe supplied me via E-mail back on Oct, 28 2011 (and via a build he made).

First check worked again:

root@lab10:~# cat /etc/power.conf
cpu_deep_idle enable
cpu-threshold 10s
cpupm enable poll-mode

Second check worked again (this should invoke the code in the new patch):

root@lab10:~# cat /etc/power.conf
cpu_deep_idle enable
cpu-threshold 10s
cpupm enable

Once again I will post again if I notice any issues in the next few days (like kernel CPU creep), wow it seems that with this new nightly I can now run dtrace Woo Yeah! and thus tools like 'powertop' with my GCC nightly builds. If you care I have my system config, and an initial 'top' and 'prstat' see the first attachment plus various dtrace outputs see the second attachment (I plan this new BE to run today and over the weekend for about 3 days).

#65

Updated by Jon Strabala over 7 years ago

After about 69 ours it seems like the system is running fine WRT issue #1333 with Garrett's revised patch (above), specifically [https://www.illumos.org/issues/1333#note-60] - see first attachment.

On an unrelated note I seem to see a lot of activity in 'fflush' and/or 'zpool-rool' on a system that should be completely idle (this isn't related to the issue at hand but the second and third attachements). Oddly in 60 seconds I am seeing a high number of calls to 'open' 38048, 'forksys' 421456 , and 'umount2' == 19804321 on a totally idle system

#66

Updated by Rich Lowe over 7 years ago

if the fsflush stuff is separate, please open a separate bug.

#67

Updated by Jon Strabala over 7 years ago

After about 5 days it seems like the system is still running fine WRT issue #1333 with Garrett's revised patch (above), specifically [https://www.illumos.org/issues/1333#note-60].

last pid: 102619;  load avg  0.00,  0.00,  0.00;  up 5+01:39:48                                                                                     11:51:50
51 processes: 50 sleeping, 1 on cpu
CPU states:  100% idle,  0.0% user,  0.0% kernel,  0.0% iowait,  0.0% swap
Kernel: 311 ctxsw, 2 trap, 561 intr, 157 syscall, 2 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap

   PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
     3 root        1  60  -20    0K    0K sleep    5:11  0.10% fsflush
102618 root        1  59    0 3536K 1908K cpu/2    0:00  0.03% top
101072 admin       1  59    0 7756K 5136K sleep    0:00  0.01% sshd
101337 root        6  59    0   73M   46M sleep    0:03  0.00% fbserver
101077 root        1  59    0 6000K 2036K sleep    0:04  0.00% sendmail
100861 root        1  59    0 2668K 1608K sleep    0:04  0.00% in.routed
100269 root        6  59    0 4396K 3104K sleep    0:03  0.00% devfsadm
100645 root        4  59    0 3060K 2012K sleep    0:32  0.00% automountd
100859 root        1  59    0 6480K 5364K sleep    0:18  0.00% intrd
100005 root       22  59    0   15M   13M sleep    0:07  0.00% svc.configd

As to the 'fflush' issue (unrelated to this issue) I have opened bug #1810 as Rich Lowe suggested.

#68

Updated by Garrett D'Amore over 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

changeset: 13533:6c1a51927cbd
tag: tip
user: Garrett D'Amore <>
date: Mon Nov 28 20:07:13 2011 -0800
description:
1333 High kernel cpu usage & dtrace hang on idle system
Reviewed by: Richard Lowe <>
Reviewed by: Aubrey Li <>
Reviewed by: Albert Lee <>
Reviewed by: Dan McDonald <>
Reviewed by: Robert Mustacchi <>
Approved by: Gordon Ross <>

Also available in: Atom PDF