Bug #1333
closedHigh kernel cpu usage & dtrace hang on idle system
100%
Description
We have a Supermicro X9SCA-F ( IntelĀ® C204 PCH Chipset) with 16GB RAM, and E3-1240 Xeon CPU (quad core).
Clean install of open indiana Build 148 (server, no gui). Mirror disks on LSI2008 type controllers (two of them) and it has intel NICs.
The problem is that CPU usage is increasing gradually for kernel. It grows approx 0.1% per hour on a idle system. When it reaches some 5-6% any dtrace (arg0) hangs the system completely. It stops responding to pings, console is hung, nothing works.
Once (out of some 20 hangs) I was able to start the shutdown by pressing the physical power button, so it seems it still works in some way but is using all the cpu/interrrupts somehow.
I thought this is related to https://www.illumos.org/issues/423 and https://defect.opensolaris.org/bz/show_bug.cgi?id=12799#c2 with is cpupm enable poll-mode fix in power.conf , but setting as per above does not fix this problem.
I tried disabling c-states, disabling cpupm completely, in BIOS and in power.conf
using cpupm enable poll-mode then cpu-deep-idle disable and even cpupm disable but it still happens.
After reboot the system starts at almost 0% kernel usage and I can not find the cause of the increasing cpu usage as there is none to be seen. It takes 3-4 days to reach its breaking point.
When it grows to the higher percentage it hangs so this makes it hard to pinpoint.
Took some statistics when it is still possible:
prstat -s cpu -n 5: http://www.pastie.org/2241576
mpstat 5 5: http://www.pastie.org/2241586
powertop: http://www.pastie.org/2265444
Files
Related issues
Updated by Jon Strabala almost 12 years ago
- File power_cpu_issue.txt power_cpu_issue.txt added
I have the exact same experience and system basically a Supermicro X9SCA-F ( IntelĀ® C204 PCH Chipset) with 16GB RAM, and a Intel(R) Xeon(R) CPU E31270 @ 3.40GHz
lab10# uname -a
SunOS lab10 5.11 oi_148b i86pc i386 i86pc Solaris
lab10#
The kernel is always at 0% after reboot but I see the same slow CPU usage growth over several days to the point where I can no longer log onto the system via ssh and have to do a hard power cycle via IPMI.
I have to do a "hard" power off, immeadiately after a hard power cycle and a 'reboot':
Below I ran the following and captured the output, I have attached a text file due to formatting issues in this 'posting' tool. Note these are all right after a 'fresh' reboot.
uname -a
prstat -s cpu -n
mpstat 5 2
powertop
prtdiag
cat /etc/power.conf
dtrace -n 'profile-10ms{@[stack()] = count()}'
echo ::interrupts | mdb -k
echo ::interrupts -d | mdb -k
intrstat 30
powertop -t 30
No matter what I try to do I can not seem to prevent this issue. In fact when it gets up past 20% dtrace scripts will hang.
I have tried "cpupm enable poll-mode" in /etc/power.conf
I have tried "cpupm disable" in /etc/power.conf
Today I am trying to add the following to the bottom of the /etc/system file based on info in https://defect.opensolaris.org/bz/show_bug.cgi?id=12799
set cpupm_tw_gov_interval=800000
Any ideas or test I can run would be greatly appreciated, I am stumped on how to debug or even collect data on this problem.
- Jon S
Updated by Jon Strabala almost 12 years ago
Doing some more testing ....
The CPU utilization on this box at idle was 99.3% at initial boot time, after about 44 hours it is only 98% idle. ( with "set cpupm_tw_gov_interval=800000" in /etc/system ).
At time of boot ...- date
August 31, 2011 07:07:16 AM PDT
- top -s 10
last pid: 1965; load avg: 0.00, 0.00, 0.00; up 0+15:26:34 07:06:57
62 processes: 61 sleeping, 1 on cpu
CPU states: 99.3% idle, 0.0% user, 0.7% kernel, 0.0% iowait, 0.0% swap
Kernel: 320 ctxsw, 1062 intr, 56 syscall
Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swapPID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND
1204 root 3 59 0 59M 36M sleep 2:58 0.04% Xorg
1657 gdm 1 59 0 87M 25M sleep 0:19 0.00% gdm-simple-gree
1726 root 4 59 0 3068K 1948K sleep 0:00 0.00% automountd
1964 root 1 59 0 3840K 2476K cpu/3 0:00 0.00% top
372 root 4 59 0 6912K 5548K sleep 0:00 0.00% hald
401 root 1 59 0 3496K 2092K sleep 0:00 0.00% hald-addon-acpi
1642 gdm 1 59 0 82M 18M sleep 0:00 0.00% gnome-power-man
1713 root 21 59 0 7396K 5080K sleep 0:00 0.00% nscd
after 44 hours it has crept up as follows:
- date
September 1, 2011 11:33:56 AM PDT
- top -s 10
last pid: 2236; load avg: 0.00, 0.00, 0.00; up 1+19:54:09 11:34:32
62 processes: 61 sleeping, 1 on cpu
CPU states: 98.0% idle, 0.1% user, 1.9% kernel, 0.0% iowait, 0.0% swap
Kernel: 386 ctxsw, 2 trap, 1073 intr, 62 syscall, 2 flt
Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swapPID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND
1204 root 3 59 0 59M 36M sleep 8:25 0.04% Xorg
1657 gdm 1 59 0 87M 25M sleep 0:52 0.00% gdm-simple-gree
2235 root 1 59 0 3852K 2492K cpu/5 0:00 0.00% top
1733 admin 1 59 0 7736K 5168K sleep 0:00 0.00% sshd
372 root 4 59 0 6912K 5548K sleep 0:01 0.00% hald
401 root 1 59 0 3496K 2092K sleep 0:00 0.00% hald-addon-acpi
1642 gdm 1 59 0 82M 18M sleep 0:01 0.00% gnome-power-man
1713 root 21 59 0 8428K 5264K sleep 0:01 0.00% nscd
264 root 6 59 0 4356K 3156K sleep 0:00 0.00% devfsadm
10 root 16 59 0 16M 13M sleep 0:02 0.00% svc.startd
I will continue adding comments and samples ( as per the attached file) until the system becomes unresponsive.
Updated by Miha Krizaj almost 12 years ago
In my case the kernel usage after 7 days uptime has reached almost 20% while idle, so I expect it to die any time now.
This problem has to be noticed somewhere else too, I think this is a chipset/cpu incompatibility in OI, as identical system works great on windows server and vmware ESXI.
last pid: 2827; load avg: 0.09, 0.07, 0.07; up 7+01:42:27 10:25:30
53 processes: 52 sleeping, 1 on cpu
CPU states: 80.3% idle, 0.1% user, 19.6% kernel, 0.0% iowait, 0.0% swap
Kernel: 6850 ctxsw, 5 trap, 3638 intr, 424 syscall, 5 flt
Memory: 16G phys mem, 2580M free mem, 8187M total swap, 8187M free swap
PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND
2826 root 1 52 0 3840K 1888K cpu/0 0:00 0.01% top
794 daemon 18 60 -20 3032K 2008K sleep 182:01 0.00% nfsd
349 root 43 59 0 15M 7340K sleep 0:09 0.00% nscd
580 root 21 59 0 22M 14M sleep 1:08 0.00% smbd
605 root 1 59 0 12M 6384K sleep 0:23 0.00% intrd
11 root 20 59 0 14M 13M sleep 0:20 0.00% svc.configd
567 daemon 9 59 0 15M 4664K sleep 0:12 0.00% idmapd
774 daemon 4 59 0 10M 2144K sleep 0:06 0.00% nfsmapid
1374 root 1 59 0 13M 4400K sleep 0:05 0.00% sshd
572 root 26 59 0 25M 13M sleep 0:04 0.00% fmd
454 root 5 59 0 8148K 6732K sleep 0:04 0.00% hald
590 root 1 59 0 6080K 2200K sleep 0:04 0.00% sendmail
519 root 4 59 0 8396K 1932K sleep 0:04 0.00% automountd
417 root 6 59 0 9856K 3448K sleep 0:03 0.00% devfsadm
459 root 1 59 0 3552K 2132K sleep 0:03 0.00% hald-addon-acpi
9 root 13 59 0 19M 12M sleep 0:02 0.00% svc.startd
507 root 4 59 0 10M 3828K sleep 0:02 0.00% inetd
553 root 11 59 0 9540K 2264K sleep 0:01 0.00% syslogd
777 daemon 2 60 -20 2960K 1924K sleep 0:00 0.00% lockd
247 root 5 60 -20 2608K 1536K sleep 0:00 0.00% zonestatd
1375 admin 1 59 0 13M 4996K sleep 0:00 0.00% sshd
147 daemon 3 59 0 13M 4928K sleep 0:00 0.00% kcfd
246 root 17 59 0 11M 3372K sleep 0:00 0.00% syseventd
332 root 4 59 0 9648K 3300K sleep 0:00 0.00% picld
49 netadm 3 59 0 3980K 2780K sleep 0:00 0.00% ipmgmtd
792 root 22 59 0 10M 2740K sleep 0:00 0.00% mountd
1387 root 1 59 0 9100K 2596K sleep 0:00 0.00% bash
1378 admin 1 59 0 9104K 2584K sleep 0:00 0.00% bash
458 root 1 59 0 3956K 2352K sleep 0:00 0.00% hald-addon-cpuf
456 root 1 59 0 3948K 2344K sleep 0:00 0.00% hald-addon-netw
Updated by Jon Strabala almost 12 years ago
Continuing testing, and as Miha indicated it is always in the 'kernel usage' the problem never shows up in 'top' in normal user space processes via 'top' however there is some VERY interesting differences in the 'powertop' output as the kernel CPU rises.
The CPU utilization on this box at idle was 99.3% at initial boot time, after about 44 hours it is only 98% idle. ( with "set cpupm_tw_gov_interval=800000" in /etc/system ).
At time of boot ... (CPU usage via top WAS at 0.7%)
date
August 31, 2011 07:07:16 AM PDT
after 44 hours it has crept up as follows (CPU usage via top WAS at 2%):
date
September 1, 2011 11:33:56 AM PDT
after 159 hours it has crept up as follows (CPU usage via top NOW at at 6.5% - but WHY is the question):
date
September 6, 2011 07:21:51 AM PDT
############################################################################## # top -s 10 last pid: 3376; load avg: 0.01, 0.01, 0.00; up 6+15:48:02 07:28:25 62 processes: 61 sleeping, 1 on cpu CPU states: 93.4% idle, 0.1% user, 6.5% kernel, 0.0% iowait, 0.0% swap Kernel: 319 ctxsw, 3 trap, 963 intr, 61 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 1204 root 3 58 0 59M 36M sleep 31:49 0.04% Xorg 1657 gdm 1 59 0 87M 25M sleep 3:12 0.00% gdm-simple-gree 3375 root 1 59 0 3852K 2492K cpu/3 0:00 0.00% top 1713 root 21 59 0 8428K 5332K sleep 0:04 0.00% nscd 1733 admin 1 59 0 7776K 5188K sleep 0:01 0.00% sshd 1642 gdm 1 59 0 82M 18M sleep 0:04 0.00% gnome-power-man 264 root 6 59 0 4356K 3156K sleep 0:02 0.00% devfsadm 1057 root 6 59 0 26M 10M sleep 0:02 0.00% fbserver 761 root 26 59 0 18M 12M sleep 0:02 0.00% fmd 10 root 16 59 0 16M 13M sleep 0:02 0.00% svc.startd 12 root 22 59 0 16M 14M sleep 0:07 0.00% svc.configd
Another log file with various output is attached showing some of the 'deltas' called 'power_cpu_issue_PLUS_159ahours.txt' compare to the older sample set called 'power_cpu_issue_PLUS_44hours.txt'
I can not really tell if there is a truy difference in 'mpstat'
root@lab10:/tmp# sdiff -w 155 aft_44_hours.txt aft_159_hours.txt root@lab10:~# mpstat 5 2 root@lab10:~# mpstat 5 2 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt id CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt id 0 0 0 0 510 101 74 0 1 0 0 11 0 4 0 9 | 0 0 0 1 514 101 90 0 1 0 0 9 0 13 0 8 1 0 0 0 78 10 37 0 1 0 0 11 0 1 0 9 | 1 0 0 0 81 9 38 0 1 0 0 7 0 2 0 9 2 1 0 0 74 9 37 0 1 0 0 9 0 1 0 9 | 2 0 0 0 73 8 35 0 1 0 0 6 0 2 0 9 3 0 0 0 71 8 35 0 1 0 0 7 0 1 0 9 | 3 0 0 0 67 7 32 0 1 0 0 5 0 2 0 9 4 1 0 0 76 9 39 0 1 0 0 10 0 1 0 9 | 4 0 0 0 75 8 37 0 1 0 0 6 0 2 0 9 5 1 0 1 70 7 35 0 1 0 0 8 0 1 0 9 | 5 0 0 0 64 6 31 0 1 0 0 6 0 2 0 9 6 1 0 0 71 7 35 0 1 0 0 8 0 1 0 9 | 6 0 0 0 65 6 31 0 1 0 0 6 0 2 0 9 7 0 0 0 73 7 36 0 1 0 0 7 0 1 0 9 | 7 0 0 0 65 6 31 0 1 0 0 5 0 2 0 9 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt id CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt id 0 4 0 0 519 101 86 0 2 0 0 61 0 7 0 9 | 0 0 0 0 513 101 128 0 1 1 0 36 0 26 0 7 1 0 0 0 78 9 34 0 2 0 0 7 0 1 0 9 | 1 0 0 0 56 2 29 0 0 0 0 6 0 3 0 9 2 0 0 0 80 9 34 0 1 0 0 10 0 1 0 9 | 2 0 0 0 34 3 15 0 0 0 0 6 0 2 0 9 3 0 0 0 51 5 25 0 1 0 0 0 0 1 0 9 | 3 0 0 0 80 6 35 0 0 1 0 0 0 4 0 9 4 0 0 0 86 12 43 0 1 0 0 0 0 1 0 9 | 4 5 0 0 59 1 30 0 1 1 0 61 0 3 0 9 5 0 0 0 91 4 43 0 1 0 0 0 0 1 0 9 | 5 0 0 0 72 4 33 0 1 0 0 0 0 4 0 9 6 0 0 0 93 13 47 0 1 0 0 13 0 1 0 9 | 6 0 0 0 82 14 40 0 0 0 0 0 0 5 0 9 7 0 0 0 23 0 9 0 1 0 0 2 0 0 0 10 | 7 0 0 0 32 3 16 0 0 0 0 12 0 2 0 9 root@lab10:~# root@lab10:~# root@lab10:/tmp#
Obviously the CPU - CO has gone up from 3.1% to 10.4% as shown by the diff below, but I don't know why:
root@lab10:/tmp# diff powertop.5.samp1.44hours.txt powertop.5.samp2.159hours.txt 1,2c1 < root@lab10:~# powertop < OpenIndiana PowerTOP version 1.2 (C) 2009 Intel Corporation --- > OpenIndiana PowerTOP version 1.2 4c3,14 < Collecting data for 5.00 second(s) --- > C-states (idle power) Avg Residency P-states (frequencies) > C0 (cpu running) (9.9%) 1600 Mhz 0.0% > C1 0.0ms (0.0%) 1800 Mhz 0.0% > C2 0.0ms (0.0%) 2000 Mhz 0.0% > C3 2.4ms (90.1%) 2200 Mhz 0.0% > 2400 Mhz 0.0% > 2600 Mhz 0.0% > 2800 Mhz 0.0% > 3000 Mhz 0.0% > 3200 Mhz 0.0% > 3400 Mhz 0.0% > 3800 Mhz(turbo) 100.0% 6,23c16 < root@lab10:~# vi /tmp/power.top.out < root@lab10:~# cat /tmp/power.top.out < OpenIndiana PowerTOP version 1.2 < < C-states (idle power) Avg Residency P-states (frequencies) < C0 (cpu running) (3.4%) 1600 Mhz 0.0% < C1 0.0ms (0.0%) 1800 Mhz 0.0% < C2 0.0ms (0.0%) 2000 Mhz 0.0% < C3 2.4ms (96.6%) 2200 Mhz 0.0% < 2400 Mhz 0.0% < 2600 Mhz 0.0% < 2800 Mhz 0.0% < 3000 Mhz 0.0% < 3200 Mhz 0.0% < 3400 Mhz 0.0% < 3604 Mhz(turbo) 100.0% < < Wakeups-from-idle per second: 3201.6 interval: 5.0s --- > Wakeups-from-idle per second: 3103.9 interval: 5.0s 27,29c20,22 < 89.0% (2849.0) sched : <xcalls> unix`dtrace_xcall_func < 3.1% (100.0) <kernel> : genunix`clock < 2.4% ( 76.0) <kernel> : genunix`cv_wakeup --- > 91.8% (2848.9) sched : <xcalls> unix`dtrace_xcall_func > 3.2% (100.2) <kernel> : genunix`clock > 2.5% ( 76.2) <kernel> : genunix`cv_wakeup 31,32c24,25 < 0.2% ( 8.0) <kernel> : cpudrv`cpudrv_monitor_disp < 0.2% ( 7.6) <kernel> : ehci`ehci_handle_root_hub_status_change --- > 0.3% ( 8.0) <kernel> : ehci`ehci_handle_root_hub_status_change > 0.3% ( 8.0) <kernel> : cpudrv`cpudrv_monitor_disp 34c27 < 0.1% ( 2.0) <interrupt> : e1000g#0 --- > 0.1% ( 2.2) <interrupt> : e1000g#0 36a30,31 > 0.0% ( 1.0) <kernel> : genunix`sigalarm2proc > 0.0% ( 1.0) <interrupt> : ehci#0 38,40d32 < 0.0% ( 0.8) <interrupt> : ehci#0 < 0.0% ( 0.2) <kernel> : genunix`kmem_update < 0.0% ( 0.2) <kernel> : ahci`ahci_watchdog_handler 42c34 < 0.0% ( 0.2) <kernel> : kcf`rnd_handler --- > 0.0% ( 0.2) <kernel> : ip`tcp_timer_callback 44c36,39 < 0.0% ( 0.2) <kernel> : ibcm`ibcm_path_cache_timeout_cb --- > 0.0% ( 0.2) sched : <xcalls> unix`hati_demap_func > 0.0% ( 0.2) <kernel> : genunix`kmem_update > 0.0% ( 0.2) <kernel> : kcf`rnd_handler > 0.0% ( 0.2) <kernel> : ahci`ahci_watchdog_handler root@lab10:/tmp#
In addition I can not really tell id dtrace is showing anything
(or if I am using dtrace correctly and I just time it in my head and hit ctrl-C to stop it)
root@lab10:~# sdiff /tmp/dtrace_44_hours.txt /tmp/dtrace_159_hours.txt root@lab10:~# dtrace -n 'profile-10ms{@[stack()] = count()}' root@lab10:~# dtrace -n 'profile-10ms{@[stack()] = count()}' dtrace: description 'profile-10ms' matched 1 probe dtrace: description 'profile-10ms' matched 1 probe ^C ^C unix`av_check_softint_pending+0x50 | unix`mutex_enter+0x10 unix`av_dispatch_softvect+0x48 | unix`hment_mapcnt+0x1d apix`apix_dispatch_softint+0x34 | unix`hat_page_getshare+0x16 unix`switch_sp_and_call+0x13 | genunix`fsflush_do_pages+0x236 1 | genunix`fsflush+0x39a | unix`thread_start+0x8 unix`av_dispatch_softvect < unix`switch_sp_and_call+0x13 < 1 < < unix`av_dispatch_softvect+0x1 < unix`switch_sp_and_call+0x13 < 1 1 genunix`lbolt_event_driven+0x18 | unix`x86pte_mapin+0x47 genunix`ddi_get_lbolt+0xd | unix`x86pte_access_pagetable+0x3d SDC`sysdc_update_times+0x50 | unix`x86pte_set+0x8d SDC`sysdc_compute_pri+0x4b | unix`hati_pte_map+0x187 SDC`sysdc_setrun+0x35 | unix`hati_load_common+0x15d SDC`sysdc_wakeup+0x15 | unix`hat_memload+0x81 genunix`sleepq_wakeone_chan+0x89 | unix`segkmem_xalloc+0x13b genunix`cv_signal+0x8e | unix`segkmem_alloc_vn+0xcd genunix`taskq_dispatch+0x374 | unix`segkmem_alloc+0x24 zfs`zio_taskq_dispatch+0xab | genunix`vmem_xalloc+0x546 zfs`zio_interrupt+0x1c | genunix`vmem_alloc+0x161 zfs`vdev_disk_io_intr+0x6b | genunix`kmem_alloc+0x64 genunix`biodone+0x84 | genunix`kmem_zalloc+0x3b sd`sd_buf_iodone+0x67 | autofs`auto_calldaemon+0x22d sd`sd_mapblockaddr_iodone+0x4f | autofs`auto_send_unmount_request+0x6b sd`sd_return_command+0x1d9 | autofs`unmount_node+0x181 sd`sdintr+0x58d | autofs`try_unmount_node+0x13b scsi`scsi_hba_pkt_comp+0x15c | autofs`unmount_subtree+0x79 sata`sata_txlt_rw_completion+0x1d3 | autofs`unmount_tree+0xcf ahci`ahci_flush_doneq+0x6a | autofs`unmount_zone_tree+0x1e 1 1 unix`0xfffffffffb85a6ca | genunix`fsflush_do_pages+0x1da kstat`read_kstat_data+0x48d | genunix`fsflush+0x39a kstat`kstat_ioctl+0x4a | unix`thread_start+0x8 genunix`cdev_ioctl+0x45 < specfs`spec_ioctl+0x5a < genunix`fop_ioctl+0x7b < genunix`ioctl+0x18e < unix`_sys_sysenter_post_swapgs+0x149 < 1 1 genunix`cpu_update_pct+0x3e | unix`page_pptonum+0xc genunix`restore_mstate+0x40 | genunix`fsflush_do_pages+0x35d unix`swtch+0xfb | genunix`fsflush+0x39a unix`idle+0xc4 < unix`thread_start+0x8 unix`thread_start+0x8 1 1 unix`av_dispatch_softvect+0x2c | 7 apix`apix_dispatch_softint+0x34 < unix`switch_sp_and_call+0x13 < 2 < < unix`av_dispatch_softvect+0x41 < apix`apix_dispatch_softint+0x34 < unix`switch_sp_and_call+0x13 < 2 < < 9 < < apix`apix_dispatch_softint+0x27 < unix`switch_sp_and_call+0x13 < 12 < unix`i86_mwait+0xd unix`i86_mwait+0xd unix`cpu_idle_mwait+0xf1 unix`cpu_idle_mwait+0xf1 unix`cpu_acpi_idle+0x8d unix`cpu_acpi_idle+0x8d unix`cpu_idle_adaptive+0x19 unix`cpu_idle_adaptive+0x19 unix`idle+0x114 unix`idle+0x114 unix`thread_start+0x8 unix`thread_start+0x8 53 | 10 unix`acpi_cpu_cstate+0x2ae unix`acpi_cpu_cstate+0x2ae unix`cpu_acpi_idle+0x82 unix`cpu_acpi_idle+0x82 unix`cpu_idle_adaptive+0x19 unix`cpu_idle_adaptive+0x19 unix`idle+0x114 unix`idle+0x114 unix`thread_start+0x8 unix`thread_start+0x8 15340 | 10027
Once again, I will continue adding comments and log file samples ( as per the attached file) until the system becomes unresponsive.
Updated by Jon Strabala almost 12 years ago
Note the attached file "power_cpu_issue_PLUS_0hours.txt" is from an initial reboot, it can be compared with the prior two files "power_cpu_issue_PLUS_44hours.txt" and "power_cpu_issue_PLUS_159hours.txt" - I will no longer capture these files. Instead I will be capturing nine (9) dtrace outputs until the system hangs on the hour.
I tried to run Brendan Gregg's 'hotkernel' fromt eh DTT package - at 'rup' of 160 hours (CPU usage via top NOW at at 6.5%)
root@lab10:/opt/DTT# ./hotkernel Sampling... Hit Ctrl-C to end. ^C
waited about three minutes - it just doesn't print anything or come back. Not only that I can not even log into another shell via ssh.
However I was able to CTRL-Z it and then kill -9 the dtrace script.
^Z [1]+ Stopped ./hotkernel root@lab10:/opt/DTT# root@lab10:/opt/DTT# kill %1 root@lab10:/opt/DTT# root@lab10:/opt/DTT# jobs [1]+ Running ./hotkernel & root@lab10:/opt/DTT# root@lab10:/opt/DTT# kill -TERM %1 root@lab10:/opt/DTT# root@lab10:/opt/DTT# jobs [1]+ Running ./hotkernel & root@lab10:/opt/DTT# kill -9 %1 [1]+ Killed ./hotkernel root@lab10:/opt/DTT# root@lab10:/opt/DTT# jobs root@lab10:/opt/DTT#
Okay I got the system back put playing around with dtrace form the DTT package eventually locked up my system so this is the end of my log file series. Nothing I could do ^C ^D or ^Z could get the 'dtrace' script to die, I couldn't open a new 'ssh' session so I had to give up with the log file captures that I had been doing (not that they seemed to tell me anything new).
So once again I did an IPMI for a hard power off and reboot
Power Off Server - Orderly Shutdown ( no effect ... )
Power Off Server - Immediate (worked but rather sever .... )
Power On Server ( to bring it back up online )
I captured the same data BUT this time at 0 hours about 2-5 minutes of time after the reboot:
See: power_cpu_issue_PLUS_0hours.txt
Right now at about 2 minutes (CPU usage via top WAS at 0.1% - about 2 minutes after reboot) in the bottom of the file power_cpu_issue_PLUS_0hours.txt
root@lab10:~# top -s 10 last pid: 1921; load avg: 0.00, 0.02, 0.01; up 0+00:10:18 08:55:46 62 processes: 61 sleeping, 1 on cpu CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap Kernel: 317 ctxsw, 698 intr, 58 syscall Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 1282 root 3 59 0 60M 38M sleep 0:04 0.05% Xorg 1595 gdm 1 59 0 87M 25M sleep 0:00 0.01% gdm-simple-gree 1920 root 1 59 0 3852K 2520K cpu/1 0:00 0.00% top 1594 gdm 1 59 0 82M 18M sleep 0:00 0.00% gnome-power-man 1780 root 22 59 0 5728K 3668K sleep 0:00 0.00% nscd 1585 gdm 1 59 0 7152K 5944K sleep 0:00 0.00% gconfd-2 1821 admin 1 59 0 7736K 5180K sleep 0:00 0.00% sshd 1122 root 6 59 0 26M 10M sleep 0:00 0.00% fbserver 208 root 6 59 0 4348K 3152K sleep 0:00 0.00% devfsadm 10 root 14 59 0 15M 13M sleep 0:01 0.00% svc.startd 816 root 4 59 0 6728K 3872K sleep 0:00 0.00% inetd 12 root 21 59 0 16M 14M sleep 0:04 0.00% svc.configd 837 root 25 59 0 18M 11M sleep 0:00 0.00% fmd 270 root 5 60 -20 2548K 1512K sleep 0:00 0.00% zonestatd root@lab10:~#
Now at 17 minutes after reboot, I can easily run Brendan Gregg's 'hotkernel' fromt eh DTT package - at 'rup' of 17 minutes (where it hung before at a 'rup' of 150 hours) contrast this with what happened above.
root@lab10:/opt/DTT# rup lab10 lab10 up 17 mins, load average: 0.00, 0.00, 0.01 root@lab10:/opt/DTT# ./hotkernel Sampling... Hit Ctrl-C to end. ^C FUNCTION COUNT PCNT genunix`kmem_cache_alloc 1 0.0% unix`av_clear_softint_pending 1 0.0% unix`cpu_idle_mwait 1 0.0% zfs`metaslab_segsize_compare 1 0.0% unix`disp 1 0.0% unix`atomic_or_64 1 0.0% unix`0xfffffffffb85 1 0.0% unix`splr 1 0.0% unix`default_kstat_snapshot 1 0.0% genunix`syscall_entry 1 0.0% unix`tsc_gethrtimeunscaled_delta 1 0.0% e1000g`e1000g_m_stat 1 0.0% apix`apix_dispatch_pending_autovect 1 0.0% genunix`callout_heap_delete 1 0.0% unix`mutex_adaptive_tryenter 1 0.0% zfs`zio_vdev_io_start 1 0.0% unix`gdt_update_usegd 1 0.0% unix`do_splx 1 0.0% unix`cpu_acpi_idle 1 0.0% unix`unset_idle_cpu 1 0.0% unix`mutex_enter 2 0.0% unix`tsc_gethrtime_delta 2 0.0% unix`i86_monitor 2 0.0% zfs`lzjb_compress 2 0.0% unix`disp_anywork 30 0.1% unix`acpi_cpu_cstate 704 1.4% unix`i86_mwait 50403 98.5% root@lab10:/opt/DTT#
Cool I can run 'hotkernel' now, but why after a few days does it become very non-responsive ?
I made a modified version of the 'hotkernel' script to temrinate after 30 seconds - and will run it every hour from cron to try to find out what changes as the system stays up a long time ....
root@lab10:/opt/DTT# diff hotkernel hotkernel_30sec 83a84,88 > tick-1s > / i++ >= 30 / > { > exit(0); > } 93c98 < print "Sampling... Hit Ctrl-C to end.\n"; --- > print "Sampling... Hit Ctrl-C to end. -OR- wait 30 seconds for autoquit\n"; root@lab10:/opt/DTT#
In addition I modified several other DTT scripts to out exit in 10sec or 30sec (as above) and made an hourly cron file to capture a lot of DTRACE outputs e.g.
#!/bin/csh -f cd /opt/DTT set yyyymmddhh=`date +"%Y-%m-%d_%H"` echo "=========================================================================" echo "" echo "date" date echo "" echo "rup lab10" rup lab10 echo "" echo "top -s 10 -n 15 -d 2 -b" >& /root/hotkernel_30/data/intoncpu_10sec.$yyyymmddhh.out top -s 10 -n 15 -d 2 -b echo "" echo "./hotkernel_30sec >& /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out" ./hotkernel_30sec >& /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out echo "" echo "wc -l /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out" wc -l /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out echo "" echo "cat /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out" cat /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out echo "" echo "take four dtrace CPU measures: cpuwalk.d 10, intoncpu_10sec.d, inttimes_10sec.d, xcallsbypid_10sec.d" cd /opt/DTT/Cpu ./cpuwalk.d 10 >& /root/hotkernel_30/data/cpuwalk.d_10.$yyyymmddhh.out ./intoncpu_10sec.d >& /root/hotkernel_30/data/intoncpu_10sec.$yyyymmddhh.out ./inttimes_10sec.d >& /root/hotkernel_30/data/inttimes_10sec.$yyyymmddhh.out ./xcallsbypid_10sec.d >& /root/hotkernel_30/data/xcallsbypid_10sec.$yyyymmddhh.out echo "" echo "take a few Bin measures: sysbypid_10sec.d, topsyscall -C -s 5 2, iotop -C -t 10 5 4, cswstat_10sec.d" cd /opt/DTT/Bin ./sysbypid_10sec.d >& /root/hotkernel_30/data/sysbypid_10sec.$yyyymmddhh.out ./topsyscall -C -s 5 2 >& /root/hotkernel_30/data/topsyscall-C-s_5_2.$yyyymmddhh.out ./iotop -C -t 10 5 4 >& /root/hotkernel_30/data/iotop-C-t_10_5_4.$yyyymmddhh.out ./cswstat_10sec.d >& /root/hotkernel_30/data/cswstat_10sec.$yyyymmddhh.out
The cron entry is as follows (every hour at 11 minutes past the hour), with the first set of data taken about 11 minutes after the latest reboot:
11 * * * * /root/hotkernel_30/runh30.csh >> /root/hotkernel_30/logs/runh30.log 2>&1
Hopefully it will lock up in about a 3-4 days and then I can look at the various 'dtrace' files after yet another hard reboot and actually isolate something specific as to what is causing this system to hard hang.
Updated by Jon Strabala almost 12 years ago
WRT Sandy Bridge issues WRT power.conf and/or deep-cpu-idle issues for chips/chipsets like E3-1240 Xeon's this issue #1333 and also the older issue #423 exist, in the illumos IRC chat we also have a few also comments by "DontKnwMuch" and "lblume" ... I am now leaning towards a lack of integration of bug #6911650.
Despite the fact I have started a set of hourly utility snapshots and 'dtrace' runs to try to isolate, just like DontKnwMuch, Miha Krizaj, after a while kernel CPU goes up (slow creep) until the point where scripts like 'hotkernel' will hang the system. Currently every hour I capture the following http://www.pastie.org/2493215 (9 normal commands and 10 dtrace scripts) I would love to add more checks, the system will crash in about 2-3 days.
Note if anyone can provide a 'dtrace' script or a link to such I would be happy to add it to my test bed to try to figure this issue out - I would love to get it fixed prior to a release of oi_151a
As I said I think this might be SUN-BUG #6911650 (part of Sun Patch-id 142910-04),
Synopsis: Deeper C-state entry regressed
refer to http://wesunsolve.net/bugid.php/id/6911650 , I wonder if this update has been ported into the illumos fork/src tree.
SUN-BUG #6911650 says specifically
Dtrace shows acpi_cpu_cstate is never calling cpu_acpi_read_port()
to put CPUs into deeper c-states. Ah, the problem is a typo in the
*mcpu_mwait check with ACPI_ADR_SPACE_SYSTEM_IO
Perhaps illumos need an update to the file src/uts/i86pc/os/cpupm/cpu_idle.c indeed I can find a 'back-patch' of 6911650 when looking at the history of this file.
If I was actually building oi_151 from source (but I don't do that yet) I would test this out myself e.g trying to fix "cpu_idle.c" as per the URL above.
Updated by Jon Strabala almost 12 years ago
- File crash1.2011-09-06.tar.gz crash1.2011-09-06.tar.gz added
The system became unstable after 14 hours, and I lost 'ssh' connectivity and then even a iLOM GUI session was unusable blinking on and off at about 15x per second.
In fact all 'pings' to the hosts IP address time out (but succed to the iLOM IP address).
Currently during this "network failure" of the host (named lab10) the 'hotkenel' script is running and NOT completing it should have timed out and exited in 30 seconds due to a custom modification in the script BUT it has been running for 45 minutes (90X too long) prior to a needed reboot.
It is also possible that there is an issue with 'ahci0' via 'inttimes' analysis.
Fortunately every measure has been captured in a log file (on each hour), such that when I reboot via the required hard power cycle no logs were lost.
ALL log files and data files are in : crash1.2011-09-06.tar.gz
There are lots of hourly sample files in the sud directories data.crash1/ logs/crash1/, note I added more scripts to capture more things during the run, thus there are more output files towards the end. If anyone has any ideas of more items 'std. scripts' 'utilities' or 'dtrace scripts' to capture additional into let me know and I will run the analysis again.
So once again I did an IPMI for a hard power off and reboot
Power Off Server - Orderly Shutdown ( no effect ... )
Power Off Server - Immediate (worked but rather severe .... )
Power On Server ( to bring it back up online )
Analysis after reboot
Basic system info
lab10# uname -a
SunOS lab10 5.11 oi_148b i86pc i386 i86pc Solaris
Basically with a lot of dtrace tracing what we have is that on an idle system the CPU (in the kernel) crept from a measly 0.5% to just 3.0% as per measurements from the 'top' command.
It is also interesting that 'hotkernel' shows the following for acpi_cpu_cstate - the only metric that grows (besides kernel CPU), unix`acpi_cpu_cstate grows from about 700 calls in a 30 second sample to 83,235 in the last 30 second sample.
Below is an start/stop'rup' and an hourly 'hotkernel_30sec | grep acpi_cpu_cstate'
runh30.2011-09-06_10.log: lab10 up 9 mins, load average: 0.00, 0.03, 0.02 runh30.2011-09-06_10.log:unix`acpi_cpu_cstate 564 0.2% runh30.2011-09-06_11.log:unix`acpi_cpu_cstate 682 0.3% runh30.2011-09-06_12.log:unix`acpi_cpu_cstate 710 0.3% runh30.2011-09-06_13.log:unix`acpi_cpu_cstate 743 0.3% runh30.2011-09-06_14.log:unix`acpi_cpu_cstate 725 0.3% runh30.2011-09-06_15.log:unix`acpi_cpu_cstate 820 0.3% runh30.2011-09-06_16.log:unix`acpi_cpu_cstate 649 0.3% runh30.2011-09-06_17.log:unix`acpi_cpu_cstate 1050 0.4% runh30.2011-09-06_18.log:unix`acpi_cpu_cstate 7536 3.0% runh30.2011-09-06_19.log:unix`acpi_cpu_cstate 18721 7.6% runh30.2011-09-06_20.log:unix`acpi_cpu_cstate 41254 16.7% runh30.2011-09-06_21.log:unix`acpi_cpu_cstate 55342 22.4% runh30.2011-09-06_22.log:unix`acpi_cpu_cstate 60255 24.5% runh30.2011-09-06_23.log:unix`acpi_cpu_cstate 83235 34.6% runh30.2011-09-06_23.log: lab10 up 12:57, load average: 0.00, 0.00, 0.00
The full 'hotkernel_30sec' sample from 2011-09-06_10 (uptime 0 hour 6 minutes )
and tail -10 of the file
unix`mutex_exit 4 0.0% apix`apix_dispatch_softint 5 0.0% unix`tsc_read 5 0.0% unix`do_splx 6 0.0% unix`ddi_get32 8 0.0% unix`vsnprintf 8 0.0% unix`mutex_delay_default 11 0.0% unix`mutex_enter 17 0.0% unix`acpi_cpu_cstate 648 0.3% unix`i86_mwait 247108 99.7%
The full 'hotkernel_30sec' sample from 2011-09-06_16 (uptime 6 hour 49 minutes )
unix`cpu_acpi_idle 2 0.0% dtrace`dtrace_ioctl 2 0.0% unix`mutex_delay_default 3 0.0% unix`tsc_gethrtimeunscaled_delta 3 0.0% unix`mutex_enter 3 0.0% unix`page_nextn 5 0.0% genunix`fsflush_do_pages 5 0.0% unix`do_splx 21 0.0% unix`acpi_cpu_cstate 649 0.3% unix`i86_mwait 247021 99.7%
The full 'hotkernel_30sec' sample from 2011-09-06_23 (uptime 12 hour 57 minutes )
zfs`lzjb_compress 24 0.0% unix`tsc_gethrtimeunscaled_delta 25 0.0% unix`todpc_rtcget 30 0.0% unix`ddi_get32 44 0.0% unix`do_splx 48 0.0% genunix`fsflush 59 0.0% unix`page_nextn 118 0.0% genunix`fsflush_do_pages 118 0.0% unix`acpi_cpu_cstate 83235 34.6% unix`i86_mwait 156461 65.0%
Looking at the change in 10 second samples of the 'dtrace' script' inttimes'
Perhaps ahci0 which has grown from 875349 to 3110530 is causing an issue ? I am not sure but is this a 3.55x increase to from I think 87,534 to 311,053 interrupts/second.
lab10# cat inttimes_10sec.2011-09-06_10.out Tracing... Hit Ctrl-C to end. DEVICE TIME (ns) ehci1 67104 ehci0 75402 e1000g0 567550 ahci0 875349 lab10# cat inttimes_10sec.2011-09-06_12.out Tracing... Hit Ctrl-C to end. DEVICE TIME (ns) ehci1 71231 ehci0 74393 e1000g0 102241 ahci0 838477 lab10# cat inttimes_10sec.2011-09-06_14.out Tracing... Hit Ctrl-C to end. DEVICE TIME (ns) ehci1 72452 ehci0 79223 e1000g0 111892 ahci0 1488218 lab10# cat inttimes_10sec.2011-09-06_16.out Tracing... Hit Ctrl-C to end. DEVICE TIME (ns) ehci1 72066 ehci0 72927 e1000g0 126142 ahci0 814604 lab10# cat inttimes_10sec.2011-09-06_18.out Tracing... Hit Ctrl-C to end. dtrace: 313 dynamic variable drops with non-empty dirty list dtrace: 47 dynamic variable drops with non-empty dirty list dtrace: 389 dynamic variable drops with non-empty dirty list dtrace: 159 dynamic variable drops with non-empty dirty list dtrace: 151 dynamic variable drops with non-empty dirty list dtrace: 445 dynamic variable drops with non-empty dirty list DEVICE TIME (ns) ehci0 74580 ehci1 77788 e1000g0 100090 ahci0 2574193 lab10# cat inttimes_10sec.2011-09-06_20.out Tracing... Hit Ctrl-C to end. dtrace: 771 dynamic variable drops with non-empty dirty list dtrace: 1230 dynamic variable drops with non-empty dirty list dtrace: 1640 dynamic variable drops with non-empty dirty list dtrace: 481 dynamic variable drops with non-empty dirty list dtrace: 55 dynamic variable drops with non-empty dirty list dtrace: 599 dynamic variable drops with non-empty rinsing list dtrace: 650 dynamic variable drops with non-empty dirty list dtrace: 155 dynamic variable drops with non-empty dirty list dtrace: 33 dynamic variable drops with non-empty dirty list dtrace: 273 dynamic variable drops with non-empty dirty list DEVICE TIME (ns) ehci1 77913 ehci0 81590 e1000g0 120083 ahci0 2716996 lab10# cat inttimes_10sec.2011-09-06_21.out Tracing... Hit Ctrl-C to end. dtrace: 2091 dynamic variable drops with non-empty dirty list dtrace: 1438 dynamic variable drops with non-empty dirty list dtrace: 863 dynamic variable drops with non-empty dirty list dtrace: 457 dynamic variable drops with non-empty dirty list dtrace: 2280 dynamic variable drops with non-empty dirty list dtrace: 1926 dynamic variable drops with non-empty dirty list dtrace: 1802 dynamic variable drops with non-empty dirty list dtrace: 2471 dynamic variable drops with non-empty dirty list dtrace: 6 dynamic variable drops with non-empty rinsing list dtrace: 1217 dynamic variable drops with non-empty dirty list dtrace: 1431 dynamic variable drops with non-empty dirty list dtrace: 1 dynamic variable drop with non-empty rinsing list dtrace: 1328 dynamic variable drops with non-empty dirty list DEVICE TIME (ns) ehci0 70275 ehci1 72764 e1000g0 115825 ahci0 2150413 lab10# cat inttimes_10sec.2011-09-06_23.out Tracing... Hit Ctrl-C to end. dtrace: 10056 dynamic variable drops with non-empty dirty list dtrace: 4 dynamic variable drops with non-empty rinsing list dtrace: 12086 dynamic variable drops with non-empty dirty list dtrace: 9 dynamic variable drops with non-empty rinsing list dtrace: 6713 dynamic variable drops with non-empty dirty list dtrace: 4820 dynamic variable drops with non-empty dirty list dtrace: 4982 dynamic variable drops with non-empty dirty list dtrace: 11144 dynamic variable drops with non-empty dirty list dtrace: 7116 dynamic variable drops with non-empty dirty list dtrace: 5434 dynamic variable drops with non-empty dirty list dtrace: 7533 dynamic variable drops with non-empty dirty list dtrace: 1 dynamic variable drop with non-empty rinsing list dtrace: 3762 dynamic variable drops with non-empty dirty list dtrace: 6260 dynamic variable drops with non-empty dirty list DEVICE TIME (ns) e1000g0 48716 ehci1 70819 ehci0 73537 ahci0 3110530 lab10#
Below is an hourly 'rup', a portion of 'top' and 'hotkernel_30sec | grep acpi_cpu_cstate'
================================== Tue Sep 6 10:12:00 PDT 2011 lab10 up 9 mins, load average: 0.00, 0.03, 0.02 last pid: 2156; load avg: 0.00, 0.02, 0.02; up 0+00:09:48 10:12:03 64 processes: 63 sleeping, 1 on cpu CPU states: 99.5% idle, 0.1% user, 0.4% kernel, 0.0% iowait, 0.0% swap Kernel: 999 ctxsw, 2 trap, 1231 intr, 573 syscall, 2 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 564 0.2% ================================== Tue Sep 6 11:00:00 PDT 2011 lab10 up 57 mins, load average: 0.00, 0.00, 0.00 last pid: 2526; load avg: 0.00, 0.00, 0.00; up 0+00:57:48 11:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 99.5% idle, 0.1% user, 0.4% kernel, 0.0% iowait, 0.0% swap Kernel: 393 ctxsw, 378 trap, 1217 intr, 1204 syscall, 330 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 682 0.3% ================================== Tue Sep 6 12:00:00 PDT 2011 lab10 up 1:57, load average: 0.00, 0.00, 0.00 last pid: 2782; load avg: 0.00, 0.00, 0.00; up 0+01:57:48 12:00:03 64 processes: 63 sleeping, 1 on cpu CPU states: 99.1% idle, 0.1% user, 0.7% kernel, 0.0% iowait, 0.0% swap Kernel: 658 ctxsw, 2 trap, 1321 intr, 540 syscall, 2 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 710 0.3% ================================== Tue Sep 6 13:00:00 PDT 2011 lab10 up 2:57, load average: 0.00, 0.00, 0.00 last pid: 2879; load avg: 0.00, 0.00, 0.00; up 0+02:57:48 13:00:03 64 processes: 63 sleeping, 1 on cpu CPU states: 99.2% idle, 0.0% user, 0.8% kernel, 0.0% iowait, 0.0% swap Kernel: 895 ctxsw, 2 trap, 1422 intr, 542 syscall, 2 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 743 0.3% ================================== Tue Sep 6 14:00:00 PDT 2011 lab10 up 3:57, load average: 0.00, 0.00, 0.00 last pid: 2945; load avg: 0.00, 0.00, 0.00; up 0+03:57:48 14:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 98.9% idle, 0.1% user, 1.0% kernel, 0.0% iowait, 0.0% swap Kernel: 945 ctxsw, 2 trap, 1418 intr, 561 syscall, 2 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 725 0.3% ================================== Tue Sep 6 15:00:00 PDT 2011 lab10 up 4:57, load average: 0.00, 0.00, 0.00 last pid: 3055; load avg: 0.00, 0.00, 0.00; up 0+04:57:49 15:00:04 65 processes: 64 sleeping, 1 on cpu CPU states: 98.8% idle, 0.1% user, 1.1% kernel, 0.0% iowait, 0.0% swap Kernel: 917 ctxsw, 2 trap, 1477 intr, 545 syscall, 2 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 820 0.3% ================================== Tue Sep 6 16:00:00 PDT 2011 lab10 up 5:57, load average: 0.00, 0.00, 0.00 last pid: 3248; load avg: 0.00, 0.00, 0.00; up 0+05:57:48 16:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 98.5% idle, 0.0% user, 1.5% kernel, 0.0% iowait, 0.0% swap Kernel: 328 ctxsw, 2 trap, 1150 intr, 570 syscall, 2 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 649 0.3% ================================== Tue Sep 6 17:00:00 PDT 2011 lab10 up 6:57, load average: 0.00, 0.00, 0.00 last pid: 3414; load avg: 0.00, 0.00, 0.00; up 0+06:57:48 17:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 98.6% idle, 0.0% user, 1.4% kernel, 0.0% iowait, 0.0% swap Kernel: 904 ctxsw, 2 trap, 1444 intr, 547 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 1050 0.4% ================================== Tue Sep 6 18:00:00 PDT 2011 lab10 up 7:57, load average: 0.00, 0.00, 0.00 last pid: 3501; load avg: 0.00, 0.00, 0.00; up 0+07:57:48 18:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.9% idle, 0.1% user, 2.0% kernel, 0.0% iowait, 0.0% swap Kernel: 916 ctxsw, 2 trap, 1430 intr, 565 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 7536 3.0% ================================== Tue Sep 6 19:00:00 PDT 2011 lab10 up 8:57, load average: 0.00, 0.00, 0.00 last pid: 3540; load avg: 0.00, 0.00, 0.00; up 0+08:57:48 19:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.9% idle, 0.1% user, 2.0% kernel, 0.0% iowait, 0.0% swap Kernel: 316 ctxsw, 2 trap, 1032 intr, 558 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 18721 7.6% ================================== Tue Sep 6 20:00:00 PDT 2011 lab10 up 9:57, load average: 0.00, 0.00, 0.00 last pid: 3579; load avg: 0.00, 0.00, 0.00; up 0+09:57:48 20:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.7% idle, 0.0% user, 2.3% kernel, 0.0% iowait, 0.0% swap Kernel: 1169 ctxsw, 2 trap, 1554 intr, 561 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 41254 16.7% ================================== Tue Sep 6 21:00:00 PDT 2011 lab10 up 10:57, load average: 0.00, 0.00, 0.00 last pid: 3618; load avg: 0.00, 0.00, 0.00; up 0+10:57:48 21:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.1% idle, 0.1% user, 2.7% kernel, 0.0% iowait, 0.0% swap Kernel: 703 ctxsw, 3 trap, 1353 intr, 562 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 55342 22.4% ================================== Tue Sep 6 22:00:00 PDT 2011 lab10 up 11:57, load average: 0.00, 0.00, 0.00 last pid: 3657; load avg: 0.00, 0.00, 0.00; up 0+11:57:48 22:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.5% idle, 0.0% user, 2.5% kernel, 0.0% iowait, 0.0% swap Kernel: 986 ctxsw, 3 trap, 1370 intr, 550 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 60255 24.5% ================================== Tue Sep 6 23:00:00 PDT 2011 lab10 up 12:57, load average: 0.00, 0.00, 0.00 last pid: 3696; load avg: 0.00, 0.00, 0.00; up 0+12:57:48 23:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.6% idle, 0.1% user, 2.2% kernel, 0.0% iowait, 0.0% swap Kernel: 324 ctxsw, 2 trap, 1050 intr, 556 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 83235 34.6% ================================== Wed Sep 7 00:00:00 PDT 2011 lab10 up 13:57, load average: 0.00, 0.00, 0.00 last pid: 3735; load avg: 0.00, 0.00, 0.00; up 0+13:57:48 00:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.0% idle, 0.0% user, 3.0% kernel, 0.0% iowait, 0.0% swap Kernel: 324 ctxsw, 2 trap, 1096 intr, 570 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap lab10# !vi vi ../scanlog.csh lab10# !.. ../scanlog.csh ================================== Tue Sep 6 10:12:00 PDT 2011 lab10 up 9 mins, load average: 0.00, 0.03, 0.02 last pid: 2156; load avg: 0.00, 0.02, 0.02; up 0+00:09:48 10:12:03 64 processes: 63 sleeping, 1 on cpu CPU states: 99.5% idle, 0.1% user, 0.4% kernel, 0.0% iowait, 0.0% swap Kernel: 999 ctxsw, 2 trap, 1231 intr, 573 syscall, 2 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 564 0.2% ================================== Tue Sep 6 11:00:00 PDT 2011 lab10 up 57 mins, load average: 0.00, 0.00, 0.00 last pid: 2526; load avg: 0.00, 0.00, 0.00; up 0+00:57:48 11:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 99.5% idle, 0.1% user, 0.4% kernel, 0.0% iowait, 0.0% swap Kernel: 393 ctxsw, 378 trap, 1217 intr, 1204 syscall, 330 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 682 0.3% ================================== Tue Sep 6 12:00:00 PDT 2011 lab10 up 1:57, load average: 0.00, 0.00, 0.00 last pid: 2782; load avg: 0.00, 0.00, 0.00; up 0+01:57:48 12:00:03 64 processes: 63 sleeping, 1 on cpu CPU states: 99.1% idle, 0.1% user, 0.7% kernel, 0.0% iowait, 0.0% swap Kernel: 658 ctxsw, 2 trap, 1321 intr, 540 syscall, 2 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 710 0.3% ================================== Tue Sep 6 13:00:00 PDT 2011 lab10 up 2:57, load average: 0.00, 0.00, 0.00 last pid: 2879; load avg: 0.00, 0.00, 0.00; up 0+02:57:48 13:00:03 64 processes: 63 sleeping, 1 on cpu CPU states: 99.2% idle, 0.0% user, 0.8% kernel, 0.0% iowait, 0.0% swap Kernel: 895 ctxsw, 2 trap, 1422 intr, 542 syscall, 2 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 743 0.3% ================================== Tue Sep 6 14:00:00 PDT 2011 lab10 up 3:57, load average: 0.00, 0.00, 0.00 last pid: 2945; load avg: 0.00, 0.00, 0.00; up 0+03:57:48 14:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 98.9% idle, 0.1% user, 1.0% kernel, 0.0% iowait, 0.0% swap Kernel: 945 ctxsw, 2 trap, 1418 intr, 561 syscall, 2 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 725 0.3% ================================== Tue Sep 6 15:00:00 PDT 2011 lab10 up 4:57, load average: 0.00, 0.00, 0.00 last pid: 3055; load avg: 0.00, 0.00, 0.00; up 0+04:57:49 15:00:04 65 processes: 64 sleeping, 1 on cpu CPU states: 98.8% idle, 0.1% user, 1.1% kernel, 0.0% iowait, 0.0% swap Kernel: 917 ctxsw, 2 trap, 1477 intr, 545 syscall, 2 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 820 0.3% ================================== Tue Sep 6 16:00:00 PDT 2011 lab10 up 5:57, load average: 0.00, 0.00, 0.00 last pid: 3248; load avg: 0.00, 0.00, 0.00; up 0+05:57:48 16:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 98.5% idle, 0.0% user, 1.5% kernel, 0.0% iowait, 0.0% swap Kernel: 328 ctxsw, 2 trap, 1150 intr, 570 syscall, 2 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 649 0.3% ================================== Tue Sep 6 17:00:00 PDT 2011 lab10 up 6:57, load average: 0.00, 0.00, 0.00 last pid: 3414; load avg: 0.00, 0.00, 0.00; up 0+06:57:48 17:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 98.6% idle, 0.0% user, 1.4% kernel, 0.0% iowait, 0.0% swap Kernel: 904 ctxsw, 2 trap, 1444 intr, 547 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 1050 0.4% ================================== Tue Sep 6 18:00:00 PDT 2011 lab10 up 7:57, load average: 0.00, 0.00, 0.00 last pid: 3501; load avg: 0.00, 0.00, 0.00; up 0+07:57:48 18:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.9% idle, 0.1% user, 2.0% kernel, 0.0% iowait, 0.0% swap Kernel: 916 ctxsw, 2 trap, 1430 intr, 565 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 7536 3.0% ================================== Tue Sep 6 19:00:00 PDT 2011 lab10 up 8:57, load average: 0.00, 0.00, 0.00 last pid: 3540; load avg: 0.00, 0.00, 0.00; up 0+08:57:48 19:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.9% idle, 0.1% user, 2.0% kernel, 0.0% iowait, 0.0% swap Kernel: 316 ctxsw, 2 trap, 1032 intr, 558 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 18721 7.6% ================================== Tue Sep 6 20:00:00 PDT 2011 lab10 up 9:57, load average: 0.00, 0.00, 0.00 last pid: 3579; load avg: 0.00, 0.00, 0.00; up 0+09:57:48 20:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.7% idle, 0.0% user, 2.3% kernel, 0.0% iowait, 0.0% swap Kernel: 1169 ctxsw, 2 trap, 1554 intr, 561 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 41254 16.7% ================================== Tue Sep 6 21:00:00 PDT 2011 lab10 up 10:57, load average: 0.00, 0.00, 0.00 last pid: 3618; load avg: 0.00, 0.00, 0.00; up 0+10:57:48 21:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.1% idle, 0.1% user, 2.7% kernel, 0.0% iowait, 0.0% swap Kernel: 703 ctxsw, 3 trap, 1353 intr, 562 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 55342 22.4% ================================== Tue Sep 6 22:00:00 PDT 2011 lab10 up 11:57, load average: 0.00, 0.00, 0.00 last pid: 3657; load avg: 0.00, 0.00, 0.00; up 0+11:57:48 22:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.5% idle, 0.0% user, 2.5% kernel, 0.0% iowait, 0.0% swap Kernel: 986 ctxsw, 3 trap, 1370 intr, 550 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 60255 24.5% ================================== Tue Sep 6 23:00:00 PDT 2011 lab10 up 12:57, load average: 0.00, 0.00, 0.00 last pid: 3696; load avg: 0.00, 0.00, 0.00; up 0+12:57:48 23:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.6% idle, 0.1% user, 2.2% kernel, 0.0% iowait, 0.0% swap Kernel: 324 ctxsw, 2 trap, 1050 intr, 556 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap unix`acpi_cpu_cstate 83235 34.6% ================================== Wed Sep 7 00:00:00 PDT 2011 lab10 up 13:57, load average: 0.00, 0.00, 0.00 last pid: 3735; load avg: 0.00, 0.00, 0.00; up 0+13:57:48 00:00:03 65 processes: 64 sleeping, 1 on cpu CPU states: 97.0% idle, 0.0% user, 3.0% kernel, 0.0% iowait, 0.0% swap Kernel: 324 ctxsw, 2 trap, 1096 intr, 570 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap
I don't think 'lockstat' shows anything important
lab10# cat lockstat-kIW-D_20_sleep_30.2011-09-06_17.out Profiling interrupt: 23288 events in 30.017 seconds (776 events/sec) Count indv cuml rcnt nsec Hottest CPU+PIL Caller ------------------------------------------------------------------------------- 23267 100% 100% 0.00 507 cpu[0] acpi_cpu_cstate 10 0% 100% 0.00 328 cpu[3] i86_mwait 10 0% 100% 0.00 868 cpu[7] (usermode) 1 0% 100% 0.00 384 cpu[6] page_unlock ------------------------------------------------------------------------------- lab10# cat lockstat-kIW-D_20_sleep_30.2011-09-06_20.out Profiling interrupt: 23288 events in 30.018 seconds (776 events/sec) Count indv cuml rcnt nsec Hottest CPU+PIL Caller ------------------------------------------------------------------------------- 23263 100% 100% 0.00 514 cpu[2] acpi_cpu_cstate 12 0% 100% 0.00 704 cpu[4] (usermode) 10 0% 100% 0.00 323 cpu[3] i86_mwait 1 0% 100% 0.00 804 cpu[1] page_nextn 1 0% 100% 0.00 434 cpu[1]+11 sep_save 1 0% 100% 0.00 345 cpu[0] space_map_seg_compare ------------------------------------------------------------------------------- lab10# cat lockstat-kIW-D_20_sleep_30.2011-09-06_23.out Profiling interrupt: 23288 events in 30.017 seconds (776 events/sec) Count indv cuml rcnt nsec Hottest CPU+PIL Caller ------------------------------------------------------------------------------- 23262 100% 100% 0.00 332 cpu[2] acpi_cpu_cstate 13 0% 100% 0.00 284 cpu[6] (usermode) 10 0% 100% 0.00 302 cpu[0] i86_mwait 1 0% 100% 0.00 143 cpu[1] kmem_zalloc 1 0% 100% 0.00 436 cpu[5] fsflush_do_pages 1 0% 100% 0.00 595 cpu[0] do_splx ------------------------------------------------------------------------------- lab10#
Updated by Miha Krizaj almost 12 years ago
I have left a system running for 14 days, until it reached 25% or so in kernel usage when completely idle, which is just below its breaking point as per my experience and did a forced crash dump.
I am attaching output from mdb -k
::stacks
::cpuinfo -v
::cycinfo -v
::walk cpu | ::print cpu_t
and a forced crash dump file link (950Mb), in case that someone can somehow find the cause/source of the high cpu usage from it... The crash dump was made when the system was using some 25% in kernel while idle.
Updated by Jon Strabala over 11 years ago
My prior systems were the oi_148b ISO, I reproduced the same thing on a oi_151a system, it lost networking after about of 20 hours of operation and then froze running 'hotkernel' at about the 21 hour mark.
Hourly log files of may scripts and dtrace outputs are attached as: data_logs_151a_freezeup_in_21hours.tar.gz
This is the same basic syndrome, an increase in both 'ahci0' counts when running 'inttimes' and an increase in 'unix`acpi_cpu_cstate' when running 'hotkernel'.
Note fro about 3-hours from 2011-09-08_13 through the end of 2011-09-08_16 I was building ONNV_BUILDNUM=152 from the nightly source, thus you might ignore those files for those hours when inspecting loads and/or other system activity, at all other times this system was essentially completely idle. I will try
The two interesting items I could find from my set of 21 hourly samples, flowwed by a 'top' analysis showing the CPU on an FULLY idle system creeping up from 99.9% idle to 97.2% idle prior to the freeze in 'hotkernel'.
root@lab10:~/hotkernel_30/data.151a# cat info.2011-09-08_07.out date September 8, 2011 07:08:46 AM PDT uname -a SunOS lab10 5.11 oi_151a i86pc i386 i86pc Solaris rup lab10 lab10 up 15 mins, load average: 0.00, 0.02, 0.02 root@lab10:~/hotkernel_30/data.151a# grep ahci0 inttimes_10sec.2011-09-0* inttimes_10sec.2011-09-08_07.out: ahci0 1041164 inttimes_10sec.2011-09-08_08.out: ahci0 1118775 inttimes_10sec.2011-09-08_09.out: ahci0 1112719 inttimes_10sec.2011-09-08_10.out: ahci0 1299571 inttimes_10sec.2011-09-08_11.out: ahci0 1166361 inttimes_10sec.2011-09-08_12.out: ahci0 1139335 inttimes_10sec.2011-09-08_13.out: ahci0 1556608 * built a illumos source tree, plus other work inttimes_10sec.2011-09-08_14.out: ahci0 23165583 * built a illumos source tree, plus other work inttimes_10sec.2011-09-08_15.out: ahci0 925821 * built a illumos source tree, plus other work inttimes_10sec.2011-09-08_16.out: ahci0 1833976 * built a illumos source tree, plus other work inttimes_10sec.2011-09-08_17.out: ahci0 1329873 ** full idle again inttimes_10sec.2011-09-08_18.out: ahci0 2528877 inttimes_10sec.2011-09-08_19.out: ahci0 3968721 inttimes_10sec.2011-09-08_20.out: ahci0 2308941 inttimes_10sec.2011-09-08_21.out: ahci0 3021367 inttimes_10sec.2011-09-08_22.out: ahci0 3545604 inttimes_10sec.2011-09-08_23.out: ahci0 3254049 inttimes_10sec.2011-09-09_00.out: ahci0 4223043 inttimes_10sec.2011-09-09_01.out: ahci0 6785020 inttimes_10sec.2011-09-09_02.out: ahci0 2955979 *** hard/crash loss of network after this point root@lab10:~/hotkernel_30/data.151a# cat info.2011-09-09_03.out date Fri Sep 9 03:00:02 PDT 2011 uname -a SunOS lab10 5.11 oi_151a i86pc i386 i86pc rup lab10 lab10 up 20:06, load average: 0.00, 0.00, 0.00 root@lab10:~/hotkernel_30/data.151a# egrep acpi_cpu_cstate\|i86_mwait hotkernel_30sec.2011-09-0* hotkernel_30sec.2011-09-08_07.out:unix`acpi_cpu_cstate 667 0.3% hotkernel_30sec.2011-09-08_07.out:unix`i86_mwait 247286 99.7% hotkernel_30sec.2011-09-08_08.out:unix`acpi_cpu_cstate 753 0.3% hotkernel_30sec.2011-09-08_08.out:unix`i86_mwait 247174 99.7% hotkernel_30sec.2011-09-08_09.out:unix`acpi_cpu_cstate 690 0.3% hotkernel_30sec.2011-09-08_09.out:unix`i86_mwait 247250 99.7% hotkernel_30sec.2011-09-08_10.out:unix`acpi_cpu_cstate 796 0.3% hotkernel_30sec.2011-09-08_10.out:unix`i86_mwait 247115 99.7% hotkernel_30sec.2011-09-08_11.out:unix`acpi_cpu_cstate 647 0.3% hotkernel_30sec.2011-09-08_11.out:unix`i86_mwait 247179 99.7% hotkernel_30sec.2011-09-08_12.out:unix`acpi_cpu_cstate 484 0.2% hotkernel_30sec.2011-09-08_12.out:unix`i86_mwait 247358 99.8% hotkernel_30sec.2011-09-08_13.out:unix`acpi_cpu_cstate 578 0.2% * built illumos hotkernel_30sec.2011-09-08_13.out:unix`i86_mwait 247191 99.7% * built illumos hotkernel_30sec.2011-09-08_15.out:unix`acpi_cpu_cstate 454 0.2% * built illumos hotkernel_30sec.2011-09-08_15.out:unix`i86_mwait 247137 99.8% * built illumos hotkernel_30sec.2011-09-08_16.out:unix`acpi_cpu_cstate 426 0.2% * built illumos hotkernel_30sec.2011-09-08_16.out:unix`i86_mwait 246996 99.7% * built illumos hotkernel_30sec.2011-09-08_17.out:unix`acpi_cpu_cstate 3626 1.5% ** full idle again hotkernel_30sec.2011-09-08_17.out:unix`i86_mwait 243855 98.4% ** full idle again hotkernel_30sec.2011-09-08_18.out:unix`acpi_cpu_cstate 10192 4.1% hotkernel_30sec.2011-09-08_18.out:unix`i86_mwait 237204 95.8% hotkernel_30sec.2011-09-08_19.out:unix`acpi_cpu_cstate 16476 6.7% hotkernel_30sec.2011-09-08_19.out:unix`i86_mwait 230753 93.2% hotkernel_30sec.2011-09-08_20.out:unix`acpi_cpu_cstate 23956 9.7% hotkernel_30sec.2011-09-08_20.out:unix`i86_mwait 223047 90.2% hotkernel_30sec.2011-09-08_21.out:unix`acpi_cpu_cstate 44800 18.1% hotkernel_30sec.2011-09-08_21.out:unix`i86_mwait 202065 81.7% hotkernel_30sec.2011-09-08_22.out:unix`acpi_cpu_cstate 49319 20.0% hotkernel_30sec.2011-09-08_22.out:unix`i86_mwait 197101 79.8% hotkernel_30sec.2011-09-08_23.out:unix`acpi_cpu_cstate 57275 23.2% hotkernel_30sec.2011-09-08_23.out:unix`i86_mwait 188817 76.6% hotkernel_30sec.2011-09-09_00.out:unix`acpi_cpu_cstate 62413 25.4% hotkernel_30sec.2011-09-09_00.out:unix`i86_mwait 182381 74.2% hotkernel_30sec.2011-09-09_01.out:unix`acpi_cpu_cstate 68875 28.2% hotkernel_30sec.2011-09-09_01.out:unix`i86_mwait 173698 71.2% hotkernel_30sec.2011-09-09_02.out:unix`acpi_cpu_cstate 79454 32.7% *** hard/crash loss of network later hotkernel_30sec.2011-09-09_02.out:unix`i86_mwait 153549 63.2% *** hard/crash loss of network later root@lab10:~/hotkernel_30/data.151a#
One other thing, after about 10 hours we start getting lots of items similar to the two items below
dtrace: 2598 dynamic variable drops with non-empty dirty list
dtrace: 1409 dynamic variable drops with non-empty rinsing list
Looking at the files that show the above type warnings
root@lab10:~/hotkernel_30/data.151a# grep -l dynamic inttimes_10sec.2011-09-0* inttimes_10sec.2011-09-08_17.out inttimes_10sec.2011-09-08_18.out inttimes_10sec.2011-09-08_19.out inttimes_10sec.2011-09-08_20.out inttimes_10sec.2011-09-08_21.out inttimes_10sec.2011-09-08_22.out inttimes_10sec.2011-09-08_23.out inttimes_10sec.2011-09-09_00.out inttimes_10sec.2011-09-09_01.out inttimes_10sec.2011-09-09_02.out
looking at some individual files
root@lab10:~/hotkernel_30/data.151a# cat inttimes_10sec.2011-09-08_07.out Tracing... Hit Ctrl-C to end. DEVICE TIME (ns) ehci1 65598 ehci0 78374 e1000g0 180137 ahci0 1041164 root@lab10:~/hotkernel_30/data.151a# cat inttimes_10sec.2011-09-08_16.out Tracing... Hit Ctrl-C to end. DEVICE TIME (ns) ehci1 68515 ehci0 78892 e1000g0 858248 ahci0 1833976 root@lab10:~/hotkernel_30/data.151a# cat inttimes_10sec.2011-09-08_17.out Tracing... Hit Ctrl-C to end. dtrace: 109 dynamic variable drops with non-empty dirty list DEVICE TIME (ns) ehci1 81694 ehci0 85620 e1000g0 102613 ahci0 1329873 root@lab10:~/hotkernel_30/data.151a# cat inttimes_10sec.2011-09-08_18.out Tracing... Hit Ctrl-C to end. dtrace: 103 dynamic variable drops with non-empty dirty list dtrace: 356 dynamic variable drops with non-empty dirty list DEVICE TIME (ns) ehci1 72442 ehci0 89250 e1000g0 107444 ahci0 2528877 root@lab10:~/hotkernel_30/data.151a# root@lab10:~/hotkernel_30/data.151a# cat inttimes_10sec.2011-09-09_02.out Tracing... Hit Ctrl-C to end. dtrace: 3687 dynamic variable drops with non-empty dirty list dtrace: 4403 dynamic variable drops with non-empty dirty list dtrace: 5160 dynamic variable drops with non-empty dirty list dtrace: 6109 dynamic variable drops with non-empty dirty list dtrace: 10 dynamic variable drops with non-empty rinsing list dtrace: 5320 dynamic variable drops with non-empty dirty list dtrace: 5 dynamic variable drops with non-empty rinsing list dtrace: 7076 dynamic variable drops with non-empty dirty list dtrace: 10470 dynamic variable drops with non-empty dirty list dtrace: 7951 dynamic variable drops with non-empty dirty list dtrace: 12489 dynamic variable drops with non-empty dirty list dtrace: 6375 dynamic variable drops with non-empty dirty list dtrace: 5400 dynamic variable drops with non-empty dirty list DEVICE TIME (ns) e1000g0 52463 ehci1 75167 ehci0 89074 ahci0 2955979 root@lab10:~/hotkernel_30/data.151a#
And the CPU creep shown by 'top'
lab10# cd /root/hotkernel_30/logs.151a lab10# foreach f ( *.log ) ? echo $i ? egrep 'last pid'\|processes\|'CPU states'\|Kernel\|Memory $i | tail -4 ? echo "" ? end i: Undefined variable lab10# foreach i ( *.log ) ? echo $i ? egrep 'last pid'\|processes\|'CPU states'\|Kernel\|Memory $i | tail -4 ? echo "" ? end runh30.2011-09-08_07.log 63 processes: 62 sleeping, 1 on cpu CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap Kernel: 359 ctxsw, 2 trap, 1049 intr, 83 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap runh30.2011-09-08_08.log 61 processes: 60 sleeping, 1 on cpu CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap Kernel: 354 ctxsw, 3 trap, 1051 intr, 83 syscall, 3 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap runh30.2011-09-08_09.log 61 processes: 60 sleeping, 1 on cpu CPU states: 99.6% idle, 0.0% user, 0.3% kernel, 0.0% iowait, 0.0% swap Kernel: 355 ctxsw, 2 trap, 1132 intr, 82 syscall, 3 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap runh30.2011-09-08_10.log 61 processes: 60 sleeping, 1 on cpu CPU states: 99.5% idle, 0.0% user, 0.5% kernel, 0.0% iowait, 0.0% swap Kernel: 355 ctxsw, 2 trap, 1091 intr, 83 syscall, 3 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap runh30.2011-09-08_11.log 61 processes: 60 sleeping, 1 on cpu CPU states: 99.4% idle, 0.0% user, 0.6% kernel, 0.0% iowait, 0.0% swap Kernel: 354 ctxsw, 3 trap, 1061 intr, 83 syscall, 3 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap runh30.2011-09-08_12.log 61 processes: 60 sleeping, 1 on cpu CPU states: 99.2% idle, 0.0% user, 0.8% kernel, 0.0% iowait, 0.0% swap Kernel: 365 ctxsw, 3 trap, 1093 intr, 96 syscall, 3 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap runh30.2011-09-08_13.log 65 processes: 64 sleeping, 1 on cpu CPU states: 99.0% idle, 0.0% user, 0.9% kernel, 0.0% iowait, 0.0% swap Kernel: 357 ctxsw, 2 trap, 1093 intr, 83 syscall, 3 flt Memory: 16G phys mem, 11G free mem, 8187M total swap, 8187M free swap runh30.2011-09-08_14.log 147 processes: 107 sleeping, 20 running, 13 zombie, 7 on cpu CPU states: 0.0% idle, 50.5% user, 49.5% kernel, 0.0% iowait, 0.0% swap Kernel: 1752 ctxsw, 55513 trap, 2719 intr, 112336 syscall, 69 fork, 49354 flt Memory: 16G phys mem, 6058M free mem, 8187M total swap, 8187M free swap runh30.2011-09-08_15.log 73 processes: 72 sleeping, 1 on cpu CPU states: 98.8% idle, 0.0% user, 1.2% kernel, 0.0% iowait, 0.0% swap Kernel: 350 ctxsw, 3 trap, 1053 intr, 140 syscall, 3 flt Memory: 16G phys mem, 2624M free mem, 8187M total swap, 8182M free swap runh30.2011-09-08_16.log 71 processes: 70 sleeping, 1 on cpu CPU states: 98.8% idle, 0.0% user, 1.2% kernel, 0.0% iowait, 0.0% swap Kernel: 362 ctxsw, 3 trap, 1003 intr, 154 syscall, 3 flt Memory: 16G phys mem, 2324M free mem, 8187M total swap, 8182M free swap runh30.2011-09-08_17.log 61 processes: 60 sleeping, 1 on cpu CPU states: 98.5% idle, 0.0% user, 1.5% kernel, 0.0% iowait, 0.0% swap Kernel: 430 ctxsw, 5 trap, 1074 intr, 80 syscall, 4 flt Memory: 16G phys mem, 7596M free mem, 8187M total swap, 8182M free swap runh30.2011-09-08_18.log 61 processes: 60 sleeping, 1 on cpu CPU states: 98.3% idle, 0.0% user, 1.6% kernel, 0.0% iowait, 0.0% swap Kernel: 349 ctxsw, 4 trap, 1074 intr, 81 syscall, 4 flt Memory: 16G phys mem, 7596M free mem, 8187M total swap, 8182M free swap runh30.2011-09-08_19.log 61 processes: 60 sleeping, 1 on cpu CPU states: 98.2% idle, 0.0% user, 1.8% kernel, 0.0% iowait, 0.0% swap Kernel: 352 ctxsw, 4 trap, 1107 intr, 80 syscall, 4 flt Memory: 16G phys mem, 7595M free mem, 8187M total swap, 8182M free swap runh30.2011-09-08_20.log 61 processes: 60 sleeping, 1 on cpu CPU states: 98.0% idle, 0.0% user, 2.0% kernel, 0.0% iowait, 0.0% swap Kernel: 354 ctxsw, 5 trap, 1103 intr, 81 syscall, 4 flt Memory: 16G phys mem, 7593M free mem, 8187M total swap, 8182M free swap runh30.2011-09-08_21.log 61 processes: 60 sleeping, 1 on cpu CPU states: 97.9% idle, 0.0% user, 2.1% kernel, 0.0% iowait, 0.0% swap Kernel: 353 ctxsw, 4 trap, 1100 intr, 83 syscall, 4 flt Memory: 16G phys mem, 7593M free mem, 8187M total swap, 8182M free swap runh30.2011-09-08_22.log 61 processes: 60 sleeping, 1 on cpu CPU states: 97.8% idle, 0.0% user, 2.2% kernel, 0.0% iowait, 0.0% swap Kernel: 352 ctxsw, 4 trap, 1091 intr, 81 syscall, 4 flt Memory: 16G phys mem, 7592M free mem, 8187M total swap, 8182M free swap runh30.2011-09-08_23.log 61 processes: 60 sleeping, 1 on cpu CPU states: 97.6% idle, 0.0% user, 2.3% kernel, 0.0% iowait, 0.0% swap Kernel: 353 ctxsw, 4 trap, 1082 intr, 82 syscall, 4 flt Memory: 16G phys mem, 7592M free mem, 8187M total swap, 8182M free swap runh30.2011-09-09_00.log 61 processes: 60 sleeping, 1 on cpu CPU states: 97.6% idle, 0.0% user, 2.4% kernel, 0.0% iowait, 0.0% swap Kernel: 399 ctxsw, 4 trap, 1072 intr, 83 syscall, 4 flt Memory: 16G phys mem, 7590M free mem, 8187M total swap, 8182M free swap runh30.2011-09-09_01.log 61 processes: 60 sleeping, 1 on cpu CPU states: 97.4% idle, 0.0% user, 2.6% kernel, 0.0% iowait, 0.0% swap Kernel: 353 ctxsw, 4 trap, 1086 intr, 82 syscall, 4 flt Memory: 16G phys mem, 7589M free mem, 8187M total swap, 8182M free swap runh30.2011-09-09_02.log 61 processes: 60 sleeping, 1 on cpu CPU states: 97.2% idle, 0.0% user, 2.8% kernel, 0.0% iowait, 0.0% swap Kernel: 357 ctxsw, 4 trap, 1087 intr, 82 syscall, 4 flt Memory: 16G phys mem, 7589M free mem, 8187M total swap, 8182M free swap runh30.2011-09-09_03.log 61 processes: 60 sleeping, 1 on cpu CPU states: 97.2% idle, 0.0% user, 2.8% kernel, 0.0% iowait, 0.0% swap Kernel: 388 ctxsw, 4 trap, 1025 intr, 81 syscall, 4 flt Memory: 16G phys mem, 7587M free mem, 8187M total swap, 8182M free swap lab10#
Updated by Jon Strabala over 11 years ago
Built with gcc acpica-nightly-2011-09-10 with updated 'acpica' bits (from danmcd - refer to IRC) the bits ran fine, but the same creeping CPU issue remains.
Went from 99.5% idle to 98.1% idle prior to hanging on a scheduled (via cron) dtrace command (an attempt to debug the issue itself) in about 15 hours.
To date, all variants of OI oi_148b, oi_151a, AND as of 2011-09-10 fresh compiled nightly, nightly+'acpica' bits seem to have the same issue WRT Supermicro X9SCA-F ( IntelĀ® C204 PCH Chipset) with 16GB RAM, and E3-1270 Xeon CPU (e.g. Sandy Bridge CPU).
comment about dtrace:
Unfortunately when I compile the nightly - I have no 'dtrace' capability or 'powertop' If I run (as per https://www.illumos.org/projects/illumos-gate/wiki/How_To_Build_illumos) I still dont have 'dtrace' functionality mcs -d -n .SUNW_ctf /kernel/misc/klmmod mcs -d -n .SUNW_ctf /kernel/misc/klmops mcs -d -n .SUNW_ctf /kernel/misc/amd64/klmmod mcs -d -n .SUNW_ctf /kernel/misc/amd64/klmops I seem to have to do a lot more e.g. find /kernel -type f -name "*" | xargs mcs -d -n .SUNW_ctf find /platform/i86pc/kernel -type f -name "*" | xargs mcs -d -n .SUNW_ctf bootadm update-archive init 6 just to get minimal 'dtrace' capability running.
Regardless of the symbol and 'dtrace' limited functionality, we see the same issue and syndromes: a) a slow creep kernel CPU on an idle system b) an increase in 'unix`acpi_cpu_cstate' when running 'hotkernel' c) the ability to crash the system in 14-24 hours via any simple 'dtrace' command and d) long term 2-3 weeks freeze of the idle system requiring a hard power cycle.
Basic information showing the same problem
root@lab10:~/hotkernel_30/data.acpica-nightly-2011-09-10# ../check.csh uname -a SunOS lab10 5.11 testws i86pc i386 i86pc Solaris rup lab10 lab10 up 1 min, load average: 0.15, 0.05, 0.02 beadm list BE Active Mountpoint Space Policy Created 151a - - 45.7M static 2011-09-07 13:41 acpica-nd - - 4.89G static 2011-09-07 11:10 acpica-nightly-2011-09-09 - - 56.0M static 2011-09-09 13:22 acpica-nightly-2011-09-10 NR / 70.4G static 2011-09-10 16:48 nightly-2011-09-08 - - 139M static 2011-09-08 16:03 openindiana - - 45.2M static 2011-05-05 10:02 sandbox - - 63.0K static 2011-09-07 10:46 ---- CPU states: 99.5% idle, 0.0% user, 0.5% kernel, 0.0% iowait, 0.0% swap CPU states: 99.5% idle, 0.0% user, 0.5% kernel, 0.0% iowait, 0.0% swap CPU states: 99.4% idle, 0.0% user, 0.6% kernel, 0.0% iowait, 0.0% swap CPU states: 99.4% idle, 0.0% user, 0.6% kernel, 0.0% iowait, 0.0% swap CPU states: 99.3% idle, 0.0% user, 0.7% kernel, 0.0% iowait, 0.0% swap CPU states: 99.2% idle, 0.0% user, 0.9% kernel, 0.0% iowait, 0.0% swap CPU states: 99.0% idle, 0.0% user, 1.0% kernel, 0.0% iowait, 0.0% swap CPU states: 98.9% idle, 0.0% user, 1.1% kernel, 0.0% iowait, 0.0% swap CPU states: 98.9% idle, 0.0% user, 1.1% kernel, 0.0% iowait, 0.0% swap CPU states: 98.7% idle, 0.0% user, 1.3% kernel, 0.0% iowait, 0.0% swap CPU states: 98.8% idle, 0.0% user, 1.2% kernel, 0.0% iowait, 0.0% swap CPU states: 98.6% idle, 0.0% user, 1.4% kernel, 0.0% iowait, 0.0% swap CPU states: 98.5% idle, 0.0% user, 1.5% kernel, 0.0% iowait, 0.0% swap CPU states: 98.4% idle, 0.0% user, 1.6% kernel, 0.0% iowait, 0.0% swap CPU states: 98.4% idle, 0.0% user, 1.6% kernel, 0.0% iowait, 0.0% swap CPU states: 98.1% idle, 0.0% user, 1.9% kernel, 0.0% iowait, 0.0% swap ---- last pid: 101132; load avg: 0.13, 0.05, 0.02; up 0+00:01:34 20:21:23 50 processes: 49 sleeping, 1 on cpu CPU states: 99.5% idle, 0.0% user, 0.5% kernel, 0.0% iowait, 0.0% swap Kernel: 409 ctxsw, 4 trap, 1173 intr, 57 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 101213; load avg: 0.00, 0.00, 0.00; up 0+00:40:24 21:00:13 44 processes: 43 sleeping, 1 on cpu CPU states: 99.5% idle, 0.0% user, 0.5% kernel, 0.0% iowait, 0.0% swap Kernel: 354 ctxsw, 4 trap, 1020 intr, 53 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 101269; load avg: 0.00, 0.00, 0.00; up 0+01:40:24 22:00:13 44 processes: 43 sleeping, 1 on cpu CPU states: 99.4% idle, 0.0% user, 0.6% kernel, 0.0% iowait, 0.0% swap Kernel: 347 ctxsw, 4 trap, 1126 intr, 53 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 101318; load avg: 0.00, 0.00, 0.00; up 0+02:40:25 23:00:14 44 processes: 43 sleeping, 1 on cpu CPU states: 99.4% idle, 0.0% user, 0.6% kernel, 0.0% iowait, 0.0% swap Kernel: 355 ctxsw, 4 trap, 1087 intr, 53 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 101367; load avg: 0.00, 0.00, 0.00; up 0+03:40:25 00:00:14 44 processes: 43 sleeping, 1 on cpu CPU states: 99.3% idle, 0.0% user, 0.7% kernel, 0.0% iowait, 0.0% swap Kernel: 393 ctxsw, 4 trap, 1080 intr, 53 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 101443; load avg: 0.00, 0.00, 0.00; up 0+04:40:24 01:00:13 44 processes: 43 sleeping, 1 on cpu CPU states: 99.2% idle, 0.0% user, 0.9% kernel, 0.0% iowait, 0.0% swap Kernel: 353 ctxsw, 4 trap, 1196 intr, 53 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 101492; load avg: 0.00, 0.00, 0.00; up 0+05:40:24 02:00:13 44 processes: 43 sleeping, 1 on cpu CPU states: 99.0% idle, 0.0% user, 1.0% kernel, 0.0% iowait, 0.0% swap Kernel: 357 ctxsw, 4 trap, 1193 intr, 53 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 101541; load avg: 0.00, 0.00, 0.00; up 0+06:40:24 03:00:13 44 processes: 43 sleeping, 1 on cpu CPU states: 98.9% idle, 0.0% user, 1.1% kernel, 0.0% iowait, 0.0% swap Kernel: 351 ctxsw, 4 trap, 1126 intr, 53 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 101594; load avg: 0.00, 0.00, 0.00; up 0+07:40:24 04:00:13 44 processes: 43 sleeping, 1 on cpu CPU states: 98.9% idle, 0.0% user, 1.1% kernel, 0.0% iowait, 0.0% swap Kernel: 362 ctxsw, 4 trap, 1053 intr, 53 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 101643; load avg: 0.00, 0.00, 0.00; up 0+08:40:24 05:00:13 44 processes: 43 sleeping, 1 on cpu CPU states: 98.7% idle, 0.0% user, 1.3% kernel, 0.0% iowait, 0.0% swap Kernel: 368 ctxsw, 4 trap, 1219 intr, 53 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 101692; load avg: 0.00, 0.00, 0.00; up 0+09:40:24 06:00:13 44 processes: 43 sleeping, 1 on cpu CPU states: 98.8% idle, 0.0% user, 1.2% kernel, 0.0% iowait, 0.0% swap Kernel: 353 ctxsw, 4 trap, 1048 intr, 53 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 101838; load avg: 0.00, 0.00, 0.00; up 0+10:40:24 07:00:13 47 processes: 46 sleeping, 1 on cpu CPU states: 98.6% idle, 0.0% user, 1.4% kernel, 0.0% iowait, 0.0% swap Kernel: 391 ctxsw, 4 trap, 1128 intr, 54 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 102235; load avg: 0.00, 0.00, 0.00; up 0+11:40:24 08:00:13 48 processes: 47 sleeping, 1 on cpu CPU states: 98.5% idle, 0.0% user, 1.5% kernel, 0.0% iowait, 0.0% swap Kernel: 394 ctxsw, 6 trap, 1234 intr, 77 syscall, 7 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 102284; load avg: 0.00, 0.00, 0.00; up 0+12:40:24 09:00:13 48 processes: 47 sleeping, 1 on cpu CPU states: 98.4% idle, 0.0% user, 1.6% kernel, 0.0% iowait, 0.0% swap Kernel: 391 ctxsw, 3 trap, 1229 intr, 76 syscall, 3 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 102340; load avg: 0.00, 0.00, 0.00; up 0+13:40:24 10:00:13 48 processes: 47 sleeping, 1 on cpu CPU states: 98.4% idle, 0.0% user, 1.6% kernel, 0.0% iowait, 0.0% swap Kernel: 354 ctxsw, 3 trap, 1084 intr, 76 syscall, 3 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap last pid: 103035; load avg: 0.00, 0.00, 0.00; up 0+14:40:24 11:00:13 47 processes: 46 sleeping, 1 on cpu CPU states: 98.1% idle, 0.0% user, 1.9% kernel, 0.0% iowait, 0.0% swap Kernel: 382 ctxsw, 54 trap, 1191 intr, 117 syscall, 40 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap ---- hotkernel_30sec.2011-09-10_20.out unix`thread_affinity_set 20 0.0% unix`tsc_read 23 0.0% unix`lock_try 24 0.0% genunix`kmem_hash_rescale 36 0.0% unix`splr 95 0.0% unix`do_splx 128 0.1% genunix`vmem_hash_delete 596 0.2% unix`acpi_cpu_cstate 605 0.2% unix`scan_memory 1031 0.4% unix`i86_mwait 245224 98.9% hotkernel_30sec.2011-09-10_21.out apix`apix_dispatch_softint 3 0.0% genunix`callout_list_get 3 0.0% genunix`cyclic_coverage 3 0.0% unix`do_splx 3 0.0% unix`tsc_read 4 0.0% unix`ddi_get32 5 0.0% genunix`vmem_hash_rescale 6 0.0% unix`page_nextn 7 0.0% unix`acpi_cpu_cstate 709 0.3% unix`i86_mwait 247197 99.7% hotkernel_30sec.2011-09-10_22.out genunix`fsflush_do_pages 3 0.0% unix`mutex_enter 4 0.0% genunix`cyclic_coverage 4 0.0% apix`apix_dispatch_softint 5 0.0% genunix`vmem_hash_rescale 6 0.0% unix`do_splx 6 0.0% unix`page_nextn 7 0.0% unix`mutex_delay_default 8 0.0% unix`acpi_cpu_cstate 672 0.3% unix`i86_mwait 247224 99.7% hotkernel_30sec.2011-09-10_23.out unix`i86_monitor 3 0.0% zfs`ddt_stat_add 4 0.0% unix`tsc_read 4 0.0% genunix`fsflush_do_pages 6 0.0% genunix`vmem_hash_rescale 7 0.0% unix`page_nextn 8 0.0% dtrace`dtrace_ioctl 11 0.0% unix`do_splx 16 0.0% unix`acpi_cpu_cstate 702 0.3% unix`i86_mwait 247152 99.7% hotkernel_30sec.2011-09-11_00.out unix`do_splx 3 0.0% unix`page_nextn 4 0.0% unix`sys_syscall 5 0.0% apix`apix_dispatch_softint 6 0.0% genunix`syscall_mstate 6 0.0% unix`tsc_read 6 0.0% genunix`fsflush_do_pages 6 0.0% genunix`vmem_hash_rescale 7 0.0% unix`acpi_cpu_cstate 828 0.3% unix`i86_mwait 247054 99.6% hotkernel_30sec.2011-09-11_01.out genunix`syscall_mstate 5 0.0% unix`page_nextn 5 0.0% unix`mutex_enter 6 0.0% unix`tsc_read 7 0.0% unix`mutex_delay_default 7 0.0% genunix`vmem_hash_rescale 7 0.0% unix`do_splx 7 0.0% genunix`fsflush_do_pages 9 0.0% unix`acpi_cpu_cstate 493 0.2% unix`i86_mwait 247328 99.8% hotkernel_30sec.2011-09-11_02.out genunix`cyclic_coverage_hash 3 0.0% unix`disp_getwork 3 0.0% apix`apix_dispatch_softint 5 0.0% genunix`fsflush_do_pages 5 0.0% unix`tsc_read 6 0.0% genunix`vmem_hash_rescale 7 0.0% unix`page_nextn 8 0.0% unix`ddi_get32 20 0.0% unix`acpi_cpu_cstate 509 0.2% unix`i86_mwait 247328 99.8% hotkernel_30sec.2011-09-11_03.out apix`apix_dispatch_softint 3 0.0% unix`fakesoftint_return 3 0.0% unix`tsc_read 4 0.0% genunix`fsflush_do_pages 6 0.0% dtrace`dtrace_ioctl 6 0.0% genunix`vmem_hash_rescale 8 0.0% unix`page_nextn 10 0.0% unix`do_splx 16 0.0% unix`acpi_cpu_cstate 567 0.2% unix`i86_mwait 247216 99.7% hotkernel_30sec.2011-09-11_04.out TS`ts_update_list 6 0.0% unix`tsc_read 6 0.0% unix`mutex_enter 7 0.0% unix`page_nextn 7 0.0% genunix`vmem_hash_rescale 8 0.0% unix`do_splx 8 0.0% genunix`fsflush_do_pages 10 0.0% genunix`thread_lock 11 0.0% unix`acpi_cpu_cstate 631 0.3% unix`i86_mwait 247092 99.7% hotkernel_30sec.2011-09-11_05.out dtrace`dtrace_dynvar_clean 3 0.0% unix`fakesoftint_return 3 0.0% unix`resume 3 0.0% unix`do_splx 5 0.0% genunix`fsflush_do_pages 6 0.0% unix`tsc_read 7 0.0% genunix`vmem_hash_rescale 9 0.0% unix`page_nextn 10 0.0% unix`acpi_cpu_cstate 556 0.2% unix`i86_mwait 247218 99.7% hotkernel_30sec.2011-09-11_06.out unix`mutex_enter 4 0.0% unix`fakesoftint_return 5 0.0% unix`tsc_read 6 0.0% genunix`fsflush_do_pages 7 0.0% genunix`vmem_hash_rescale 9 0.0% unix`page_nextn 10 0.0% unix`do_splx 17 0.0% unix`ddi_get32 32 0.0% unix`acpi_cpu_cstate 649 0.3% unix`i86_mwait 247042 99.7% hotkernel_30sec.2011-09-11_07.out unix`splr 4 0.0% unix`ddi_get32 5 0.0% unix`sys_syscall 5 0.0% genunix`fsflush_do_pages 5 0.0% unix`tsc_read 7 0.0% unix`do_splx 9 0.0% genunix`vmem_hash_rescale 10 0.0% unix`page_nextn 11 0.0% unix`acpi_cpu_cstate 537 0.2% unix`i86_mwait 247127 99.7% hotkernel_30sec.2011-09-11_08.out unix`default_lock_delay 7 0.0% dtrace`dtrace_ioctl 7 0.0% genunix`dnlc_purge_vfsp 8 0.0% unix`mutex_delay_default 9 0.0% unix`page_nextn 10 0.0% genunix`fsflush_do_pages 10 0.0% genunix`vmem_hash_rescale 11 0.0% unix`do_splx 14 0.0% unix`acpi_cpu_cstate 3381 1.4% unix`i86_mwait 244217 98.6% hotkernel_30sec.2011-09-11_09.out genunix`cyclic_coverage 4 0.0% unix`disp_getwork 6 0.0% unix`mutex_enter 7 0.0% genunix`dnlc_purge_vfsp 7 0.0% unix`do_splx 9 0.0% genunix`fsflush_do_pages 9 0.0% unix`page_nextn 12 0.0% genunix`vmem_hash_rescale 12 0.0% unix`acpi_cpu_cstate 6367 2.6% unix`i86_mwait 241182 97.4% hotkernel_30sec.2011-09-11_10.out genunix`syscall_mstate 5 0.0% unix`mutex_enter 6 0.0% unix`fakesoftint_return 6 0.0% genunix`dnlc_purge_vfsp 7 0.0% genunix`fsflush_do_pages 9 0.0% unix`do_splx 10 0.0% unix`page_nextn 13 0.0% genunix`vmem_hash_rescale 13 0.0% unix`acpi_cpu_cstate 10260 4.1% unix`i86_mwait 237285 95.8% hotkernel_30sec.2011-09-11_11.out unix`mutex_delay_default 10 0.0% unix`tsc_read 12 0.0% genunix`syscall_mstate 12 0.0% unix`page_nextn 13 0.0% genunix`vmem_hash_rescale 14 0.0% genunix`fsflush_do_pages 14 0.0% unix`do_splx 15 0.0% unix`mutex_enter 18 0.0% unix`acpi_cpu_cstate 14031 5.7% unix`i86_mwait 233282 94.2% root@lab10:~/hotkernel_30/data.acpica-nightly-2011-09-10#
Updated by Jon Strabala over 11 years ago
- File logs_data_scripts.151a.final.20110913.tar.gz logs_data_scripts.151a.final.20110913.tar.gz added
Once again the CPU on an idle system 'crept up in the kernel CPU' from 99.4% idle to 98.4% idle.
Oddly enough this time the system hung relatively fast in just six (6) hours as such this may be a different issue all together that I have never seen as a) I had keyboard I/O and screen output which I never had before and b) I never saw unix`acpi_cpu_cstate increase when running 'hotkernel'. Also I ran many more 'dtrace' scripts than before hopefully this isn't related to the number of times that one kicks off dtrace. As such I will re-run the last test one more time and re-post.
I have attached "logs_data_scripts.151a.final.20110913.tar.gz" a bundle of log, data, and modified, new dtrace (in opt/DTT ontop of the Dtrace Toolkit) scripts used for all my for 1/2 hour samples. The typical set of samples each 1/2 hour (taking about 9 minutes to run) are as follows and included in the 'gtar file (above):
-rw-r--r-- 1 root root 805 2011-09-12 16:51 info.2011-09-12_1651.out -rw-r--r-- 1 root root 2899 2011-09-12 16:51 top_10sec.2011-09-12_1651.out -rw-r--r-- 1 root root 1386 2011-09-12 16:51 mpstat_10_2.2011-09-12_1651.out -rw-r--r-- 1 root root 391 2011-09-12 16:51 vmstat_3_3.2011-09-12_1651.out -rw-r--r-- 1 root root 4773 2011-09-12 16:51 prstat-a-m_5_3.2011-09-12_1651.out -rw-r--r-- 1 root root 2420 2011-09-12 16:51 prstat-L-c_4_2.2011-09-12_1651.out -rw-r--r-- 1 root root 631 2011-09-12 16:51 profile-1001.kernel.2011-09-12_1651.out -rw-r--r-- 1 root root 649 2011-09-12 16:52 profile-1001.user.2011-09-12_1651.out -rw-r--r-- 1 root root 177565 2011-09-12 16:52 more.2011-09-12_1651.out -rw-r--r-- 1 root root 4383 2011-09-12 16:52 profile-1971_5sec.2011-09-12_1651.out -rw-r--r-- 1 root root 5729338 2011-09-12 16:52 adaptive.2011-09-12_1651.out -rw-r--r-- 1 root root 1058 2011-09-12 16:52 syscall_cnt_by_syscall.2011-09-12_1651.out -rw-r--r-- 1 root root 3516 2011-09-12 16:52 hotkernel_30sec.2011-09-12_1651.out -rw-r--r-- 1 root root 756 2011-09-12 16:53 hotkernel_30sec_m.2011-09-12_1651.out -rw-r--r-- 1 root root 1742 2011-09-12 16:53 cpuwalk.d_10.2011-09-12_1651.out -rw-r--r-- 1 root root 1855 2011-09-12 16:53 intoncpu_10sec.2011-09-12_1651.out -rw-r--r-- 1 root root 187 2011-09-12 16:54 inttimes_10sec.2011-09-12_1651.out -rw-r--r-- 1 root root 154 2011-09-12 16:54 xcallsbypid_10sec.2011-09-12_1651.out -rw-r--r-- 1 root root 1365 2011-09-12 16:54 sysbypid_10sec.2011-09-12_1651.out -rw-r--r-- 1 root root 1557 2011-09-12 16:54 topsyscall-C-s_5_2.2011-09-12_1651.out -rw-r--r-- 1 root root 1193 2011-09-12 16:55 iotop-C-t_10_5_4.2011-09-12_1651.out -rw-r--r-- 1 root root 685 2011-09-12 16:55 cswstat_10sec.2011-09-12_1651.out -rw-r--r-- 1 root root 273 2011-09-12 16:55 idle-state_10sec.2011-09-12_1651.out -rw-r--r-- 1 root root 934 2011-09-12 16:55 watch_cpu_acpi_read_port.2011-09-12_1651.out -rw-r--r-- 1 root root 3713 2011-09-12 16:56 watch_cstate.2011-09-12_1651.out -rw-r--r-- 1 root root 1 2011-09-12 16:56 speedstep.2011-09-12_1651.out -rw-r--r-- 1 root root 136 2011-09-12 16:57 watch_acpi_cpu_cstate_60.2011-09-12_1651.out -rw-r--r-- 1 root root 142 2011-09-12 16:58 watch_cpu_acpi_read_port_60.2011-09-12_1651.out -rw-r--r-- 1 root root 2909 2011-09-12 16:58 cpudists.2011-09-12_1651.out -rw-r--r-- 1 root root 159 2011-09-12 16:58 cputimes.2011-09-12_1651.out -rw-r--r-- 1 root root 213 2011-09-12 16:58 dladm_show-link-s.2011-09-12_1651.out -rw-r--r-- 1 root root 380 2011-09-12 16:58 dlstat_show-link-r.2011-09-12_1651.out -rw-r--r-- 1 root root 104 2011-09-12 16:58 dlstat.2011-09-12_1651.out -rw-r--r-- 1 root root 2706 2011-09-12 16:58 intrstat_5_3.2011-09-12_1651.out -rw-r--r-- 1 root root 7574 2011-09-12 16:59 powertop-d_3-t_10.2011-09-12_1651.out -rw-r--r-- 1 root root 777 2011-09-12 16:59 lockstat-kIW-D_20_sleep_30.2011-09-12_1651.out -rw-r--r-- 1 root root 156 2011-09-12 17:00 intrtime.pl.2011-09-12_1651.out
However at this point I think I would need to capture a kernel core dump, if someone is willing to post instructions on a) how to do it and b) how/where to send it I would be more than happy.
root@lab10:~/hotkernel_30/logs.151a.final# ls -ltr total 145 -rw-r--r-- 1 root root 12762 2011-09-12 11:28 runh30.2011-09-12_1121.log -rw-r--r-- 1 root root 11037 2011-09-12 11:58 runh30.2011-09-12_1151.log -rw-r--r-- 1 root root 10219 2011-09-12 12:28 runh30.2011-09-12_1221.log -rw-r--r-- 1 root root 11117 2011-09-12 12:58 runh30.2011-09-12_1251.log -rw-r--r-- 1 root root 10291 2011-09-12 13:28 runh30.2011-09-12_1321.log -rw-r--r-- 1 root root 10499 2011-09-12 13:58 runh30.2011-09-12_1351.log -rw-r--r-- 1 root root 10640 2011-09-12 14:28 runh30.2011-09-12_1421.log -rw-r--r-- 1 root root 11328 2011-09-12 14:58 runh30.2011-09-12_1451.log -rw-r--r-- 1 root root 10500 2011-09-12 15:28 runh30.2011-09-12_1521.log -rw-r--r-- 1 root root 12297 2011-09-12 15:58 runh30.2011-09-12_1551.log -rw-r--r-- 1 root root 11327 2011-09-12 16:28 runh30.2011-09-12_1621.log -rw-r--r-- 1 root root 11443 2011-09-12 16:58 runh30.2011-09-12_1651.log -rw-r--r-- 1 root root 4707 2011-09-12 17:22 runh30.2011-09-12_1721.log root@lab10:~/hotkernel_30/logs.151a.final#
As I said the system hung rather fast 6 instead of near 20 hours ... and this was odd because I never saw the 'normal' increase in counts for the counter unix`acpi_cpu_cstate - and also this one time I actually had a keyboard and screen display (first time) on the console BUT not response at all then I typed anything. Thsi obviously seems like a different issue altogether.
root@lab10:~/hotkernel_30/data.151a.final# ../check.csh date September 13, 2011 07:45:26 AM PDT rup lab10 lab10 up 21 mins, load average: 0.00, 0.00, 0.00 uname -a SunOS lab10 5.11 oi_151a i86pc i386 i86pc Solaris beadm list BE Active Mountpoint Space Policy Created 151a NR / 56.3G static 2011-09-07 13:41 acpica-nd - - 4.89G static 2011-09-07 11:10 acpica-nightly-2011-09-09 - - 56.0M static 2011-09-09 13:22 acpica-nightly-2011-09-10 - - 22.8G static 2011-09-10 16:48 nightly-2011-09-08 - - 139M static 2011-09-08 16:03 openindiana - - 45.2M static 2011-05-05 10:02 sandbox - - 63.0K static 2011-09-07 10:46 ---- CPU states: 99.4% idle, 0.0% user, 0.6% kernel, 0.0% iowait, 0.0% swap CPU states: 99.4% idle, 0.0% user, 0.6% kernel, 0.0% iowait, 0.0% swap CPU states: 99.3% idle, 0.0% user, 0.7% kernel, 0.0% iowait, 0.0% swap CPU states: 99.2% idle, 0.0% user, 0.8% kernel, 0.0% iowait, 0.0% swap CPU states: 99.2% idle, 0.0% user, 0.8% kernel, 0.0% iowait, 0.0% swap CPU states: 99.0% idle, 0.0% user, 1.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.0% idle, 0.0% user, 1.0% kernel, 0.0% iowait, 0.0% swap CPU states: 98.9% idle, 0.0% user, 1.1% kernel, 0.0% iowait, 0.0% swap CPU states: 98.9% idle, 0.0% user, 1.1% kernel, 0.0% iowait, 0.0% swap CPU states: 98.8% idle, 0.0% user, 1.2% kernel, 0.0% iowait, 0.0% swap CPU states: 98.7% idle, 0.0% user, 1.3% kernel, 0.0% iowait, 0.0% swap CPU states: 98.7% idle, 0.0% user, 1.3% kernel, 0.0% iowait, 0.0% swap CPU states: 98.4% idle, 0.0% user, 1.6% kernel, 0.0% iowait, 0.0% swap ---- last pid: 1277; load avg: 0.00, 0.02, 0.01; up 0+00:07:45 11:21:50 48 processes: 47 sleeping, 1 on cpu CPU states: 99.4% idle, 0.0% user, 0.6% kernel, 0.0% iowait, 0.0% swap Kernel: 410 ctxsw, 58 trap, 1006 intr, 172 syscall, 45 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap last pid: 1429; load avg: 0.00, 0.00, 0.00; up 0+00:37:14 11:51:19 47 processes: 46 sleeping, 1 on cpu CPU states: 99.4% idle, 0.0% user, 0.6% kernel, 0.0% iowait, 0.0% swap Kernel: 390 ctxsw, 5 trap, 1063 intr, 59 syscall, 5 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap last pid: 1545; load avg: 0.00, 0.00, 0.00; up 0+01:07:13 12:21:18 47 processes: 46 sleeping, 1 on cpu CPU states: 99.3% idle, 0.0% user, 0.7% kernel, 0.0% iowait, 0.0% swap Kernel: 342 ctxsw, 5 trap, 987 intr, 59 syscall, 5 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap last pid: 1668; load avg: 0.00, 0.00, 0.00; up 0+01:37:13 12:51:18 47 processes: 46 sleeping, 1 on cpu CPU states: 99.2% idle, 0.0% user, 0.8% kernel, 0.0% iowait, 0.0% swap Kernel: 348 ctxsw, 5 trap, 1051 intr, 62 syscall, 5 flt Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap last pid: 1961; load avg: 0.01, 0.11, 0.08; up 0+02:07:13 13:21:18 47 processes: 46 sleeping, 1 on cpu CPU states: 99.2% idle, 0.0% user, 0.8% kernel, 0.0% iowait, 0.0% swap Kernel: 386 ctxsw, 2 trap, 1037 intr, 59 syscall, 3 flt Memory: 16G phys mem, 5858M free mem, 8187M total swap, 8187M free swap last pid: 2077; load avg: 0.00, 0.00, 0.01; up 0+02:37:13 13:51:18 47 processes: 46 sleeping, 1 on cpu CPU states: 99.0% idle, 0.0% user, 1.0% kernel, 0.0% iowait, 0.0% swap Kernel: 367 ctxsw, 5 trap, 1135 intr, 59 syscall, 5 flt Memory: 16G phys mem, 5859M free mem, 8187M total swap, 8187M free swap last pid: 2193; load avg: 0.00, 0.00, 0.00; up 0+03:07:13 14:21:18 47 processes: 46 sleeping, 1 on cpu CPU states: 99.0% idle, 0.0% user, 1.0% kernel, 0.0% iowait, 0.0% swap Kernel: 347 ctxsw, 5 trap, 1058 intr, 59 syscall, 5 flt Memory: 16G phys mem, 5859M free mem, 8187M total swap, 8187M free swap last pid: 2309; load avg: 0.00, 0.00, 0.00; up 0+03:37:13 14:51:18 47 processes: 46 sleeping, 1 on cpu CPU states: 98.9% idle, 0.0% user, 1.1% kernel, 0.0% iowait, 0.0% swap Kernel: 350 ctxsw, 5 trap, 1076 intr, 59 syscall, 5 flt Memory: 16G phys mem, 5860M free mem, 8187M total swap, 8187M free swap last pid: 2495; load avg: 0.00, 0.00, 0.00; up 0+04:07:13 15:21:18 47 processes: 46 sleeping, 1 on cpu CPU states: 98.9% idle, 0.0% user, 1.1% kernel, 0.0% iowait, 0.0% swap Kernel: 348 ctxsw, 5 trap, 1060 intr, 59 syscall, 5 flt Memory: 16G phys mem, 5861M free mem, 8187M total swap, 8187M free swap last pid: 2631; load avg: 0.00, 0.00, 0.00; up 0+04:37:13 15:51:18 47 processes: 46 sleeping, 1 on cpu CPU states: 98.8% idle, 0.0% user, 1.2% kernel, 0.0% iowait, 0.0% swap Kernel: 413 ctxsw, 143 trap, 1086 intr, 732 syscall, 114 flt Memory: 16G phys mem, 5862M free mem, 8187M total swap, 8187M free swap last pid: 2814; load avg: 0.00, 0.00, 0.00; up 0+05:07:14 16:21:19 47 processes: 46 sleeping, 1 on cpu CPU states: 98.7% idle, 0.0% user, 1.3% kernel, 0.0% iowait, 0.0% swap Kernel: 357 ctxsw, 5 trap, 1028 intr, 60 syscall, 5 flt Memory: 16G phys mem, 5863M free mem, 8187M total swap, 8187M free swap last pid: 2930; load avg: 0.00, 0.00, 0.00; up 0+05:37:13 16:51:18 47 processes: 46 sleeping, 1 on cpu CPU states: 98.7% idle, 0.0% user, 1.3% kernel, 0.0% iowait, 0.0% swap Kernel: 346 ctxsw, 5 trap, 1038 intr, 59 syscall, 5 flt Memory: 16G phys mem, 5864M free mem, 8187M total swap, 8187M free swap last pid: 3046; load avg: 0.00, 0.00, 0.00; up 0+06:07:13 17:21:18 47 processes: 46 sleeping, 1 on cpu CPU states: 98.4% idle, 0.0% user, 1.6% kernel, 0.0% iowait, 0.0% swap Kernel: 343 ctxsw, 5 trap, 1133 intr, 59 syscall, 5 flt Memory: 16G phys mem, 5865M free mem, 8187M total swap, 8187M free swap ---- hotkernel_30sec.2011-09-12_1121.out genunix`set_active_fd 3 0.0% unix`sys_syscall 3 0.0% unix`page_nextn 3 0.0% genunix`fsflush_do_pages 3 0.0% unix`do_splx 3 0.0% unix`mutex_delay_default 4 0.0% unix`bzero 4 0.0% unix`mutex_enter 9 0.0% unix`acpi_cpu_cstate 632 0.3% unix`i86_mwait 247295 99.7% hotkernel_30sec.2011-09-12_1151.out unix`tsc_gethrtimeunscaled_delta 3 0.0% unix`i86_monitor 3 0.0% unix`mutex_delay_default 3 0.0% unix`page_nextn 3 0.0% unix`setbackdq 4 0.0% unix`mutex_enter 4 0.0% genunix`fsflush_do_pages 5 0.0% genunix`fsflush 30 0.0% unix`acpi_cpu_cstate 642 0.3% unix`i86_mwait 247265 99.7% hotkernel_30sec.2011-09-12_1221.out genunix`callout_list_expire 2 0.0% genunix`dnlc_purge_vfsp 3 0.0% dtrace`dtrace_dynvar_clean 3 0.0% unix`default_lock_delay 4 0.0% unix`page_nextn 4 0.0% genunix`fsflush 9 0.0% unix`todpc_rtcget 10 0.0% genunix`fsflush_do_pages 11 0.0% unix`acpi_cpu_cstate 615 0.2% unix`i86_mwait 247313 99.7% hotkernel_30sec.2011-09-12_1251.out unix`page_nextn 3 0.0% unix`mutex_exit 3 0.0% genunix`fsflush_do_pages 3 0.0% unix`bzero 3 0.0% unix`ddi_get32 4 0.0% unix`0xfffffffffb85 4 0.0% unix`do_splx 4 0.0% unix`mutex_delay_default 6 0.0% unix`acpi_cpu_cstate 758 0.3% unix`i86_mwait 247156 99.7% hotkernel_30sec.2011-09-12_1321.out unix`mutex_enter 4 0.0% unix`i86_monitor 4 0.0% unix`mutex_delay_default 6 0.0% unix`do_splx 6 0.0% unix`tsc_gethrtimeunscaled_delta 7 0.0% genunix`dnlc_purge_vfsp 7 0.0% unix`page_nextn 28 0.0% genunix`fsflush_do_pages 29 0.0% unix`acpi_cpu_cstate 688 0.3% unix`i86_mwait 247113 99.7% hotkernel_30sec.2011-09-12_1351.out unix`mutex_delay_default 2 0.0% unix`atomic_cas_32 2 0.0% unix`mutex_enter 3 0.0% unix`atomic_and_64 3 0.0% genunix`dnlc_purge_vfsp 8 0.0% unix`page_nextn 19 0.0% genunix`fsflush_do_pages 20 0.0% unix`todpc_rtcget 29 0.0% unix`acpi_cpu_cstate 684 0.3% unix`i86_mwait 247064 99.7% hotkernel_30sec.2011-09-12_1421.out unix`do_splx 2 0.0% dtrace`dtrace_dynvar_clean 3 0.0% unix`bzero 3 0.0% unix`tsc_gethrtime_delta 4 0.0% unix`i86_monitor 4 0.0% genunix`dnlc_purge_vfsp 9 0.0% genunix`fsflush_do_pages 21 0.0% unix`page_nextn 26 0.0% unix`acpi_cpu_cstate 862 0.3% unix`i86_mwait 246968 99.6% hotkernel_30sec.2011-09-12_1451.out unix`do_splx 4 0.0% unix`tsc_gethrtimeunscaled_delta 5 0.0% unix`default_lock_delay 7 0.0% genunix`dnlc_purge_vfsp 8 0.0% unix`mutex_delay_default 12 0.0% genunix`fsflush_do_pages 30 0.0% unix`page_nextn 31 0.0% unix`ddi_get32 32 0.0% unix`acpi_cpu_cstate 665 0.3% unix`i86_mwait 247079 99.7% hotkernel_30sec.2011-09-12_1521.out unix`bzero 3 0.0% unix`do_splx 3 0.0% unix`tsc_gethrtimeunscaled_delta 4 0.0% unix`mutex_delay_default 5 0.0% genunix`dnlc_purge_vfsp 9 0.0% unix`ddi_get32 11 0.0% unix`page_nextn 27 0.0% genunix`fsflush_do_pages 31 0.0% unix`acpi_cpu_cstate 805 0.3% unix`i86_mwait 246906 99.6% hotkernel_30sec.2011-09-12_1551.out e1000g`e1000g_receive 6 0.0% TS`ts_update_list 6 0.0% unix`mutex_enter 7 0.0% genunix`thread_lock 7 0.0% genunix`dnlc_purge_vfsp 9 0.0% unix`ddi_get32 12 0.0% genunix`fsflush_do_pages 21 0.0% unix`page_nextn 37 0.0% unix`acpi_cpu_cstate 617 0.2% unix`i86_mwait 246607 99.7% hotkernel_30sec.2011-09-12_1621.out unix`kstat_compare_bykid 5 0.0% unix`default_lock_delay 5 0.0% unix`0xfffffffffb85 6 0.0% genunix`dnlc_purge_vfsp 10 0.0% unix`bzero 12 0.0% unix`mutex_delay_default 21 0.0% unix`page_nextn 26 0.0% genunix`fsflush_do_pages 32 0.0% unix`acpi_cpu_cstate 620 0.3% unix`i86_mwait 247033 99.7% hotkernel_30sec.2011-09-12_1651.out unix`mutex_enter 5 0.0% unix`bzero 6 0.0% unix`tsc_gethrtimeunscaled_delta 8 0.0% unix`do_splx 8 0.0% genunix`dnlc_purge_vfsp 12 0.0% unix`mutex_delay_default 12 0.0% unix`page_nextn 29 0.0% genunix`fsflush_do_pages 30 0.0% unix`acpi_cpu_cstate 775 0.3% unix`i86_mwait 246928 99.6%
Note, unix`acpi_cpu_cstate was at 775 (a normal value) and never went up to SAY 14000 like all previous runs and this only took six hours much too quick - so maybe I hit a different bug.
As indicated above - I have attached "logs_data_scripts.151a.final.20110913.tar.gz" a bundle of log, data, and modified, new dtrace (in opt/DTT ontop of the Dtrace Toolkit) scripts used for all my for 1/2 hour samples.
Basically I run root/hotkernel_30/runh30.csh to collect all my data as follows:
21 * * * * /root/hotkernel_30/runh30.csh >> /root/hotkernel_30/logs/runh30.`date +\%Y-\%m-\%d_\%H\%M`.log 2>&1 51 * * * * /root/hotkernel_30/runh30.csh >> /root/hotkernel_30/logs/runh30.`date +\%Y-\%m-\%d_\%H\%M`.log 2>&1
This "master" script launches a bunch of dtrace and non-dtrace scripts (forgive the hard pathing it was/is an ugly growth) ...
#!/bin/csh -f cd /opt/DTT set yyyymmddhh=`date +"%Y-%m-%d_%H%M"` echo "=========================================================================" echo "uname -a" uname -a echo "" echo "date" date echo "" echo "rup lab10" rup lab10 sleep 1 echo "date" >> /root/hotkernel_30/data/info.$yyyymmddhh.out date >> /root/hotkernel_30/data/info.$yyyymmddhh.out echo "" >> /root/hotkernel_30/data/info.$yyyymmddhh.out sleep 1 echo "uname -a" >> /root/hotkernel_30/data/info.$yyyymmddhh.out uname -a >> /root/hotkernel_30/data/info.$yyyymmddhh.out echo "" >> /root/hotkernel_30/data/info.$yyyymmddhh.out sleep 1 echo "rup lab10" >> /root/hotkernel_30/data/info.$yyyymmddhh.out rup lab10 >> /root/hotkernel_30/data/info.$yyyymmddhh.out echo "" >> /root/hotkernel_30/data/info.$yyyymmddhh.out sleep 1 echo "beadm list" >> /root/hotkernel_30/data/info.$yyyymmddhh.out beadm list >> /root/hotkernel_30/data/info.$yyyymmddhh.out echo "" >> /root/hotkernel_30/data/info.$yyyymmddhh.out echo "" echo "top -s 10 -n 15 -d 2 -b" >& /root/hotkernel_30/data/top_10sec.$yyyymmddhh.out sleep 1 top -s 10 -n 15 -d 2 -b >& /root/hotkernel_30/data/top_10sec.$yyyymmddhh.out cat /root/hotkernel_30/data/top_10sec.$yyyymmddhh.out sleep 1 mpstat 10 2 >& /root/hotkernel_30/data/mpstat_10_2.$yyyymmddhh.out sleep 1 vmstat 3 3 >& /root/hotkernel_30/data/vmstat_3_3.$yyyymmddhh.out sleep 1 prstat -a -m 5 3 >& /root/hotkernel_30/data/prstat-a-m_5_3.$yyyymmddhh.out sleep 1 prstat -L -c 4 2 >& /root/hotkernel_30/data/prstat-L-c_4_2.$yyyymmddhh.out # http://prefetch.net/blog/index.php/2007/03/03/viewing-busy-code-paths-with-dtrace/ echo "----------" echo "busiest code paths in the kernel" echo "dtrace -n 'profile-1001 {@[stack(20)]=count()} END{trunc(@,2)} tick-1s / i++ >= 1 / { exit(0); }'" sleep 1 dtrace -n 'profile-1001 {@[stack(20)]=count()} END{trunc(@,2)} tick-1s / i++ >= 1 / { exit(0); }' >& /root/hotkernel_30/data/profile-1001.kernel.$yyyymmddhh.out cat /root/hotkernel_30/data/profile-1001.kernel.$yyyymmddhh.out echo "----------" echo "busiest userland code paths" echo "dtrace -n 'profile-1001 /arg1/ {@[execname,ustack()]=count()} END{trunc(@,2)} tick-1s / i++ >= 1 / { exit(0); }'" sleep 1 dtrace -n 'profile-1001 /arg1/ {@[execname,ustack()]=count()} END{trunc(@,2)} tick-1s / i++ >= 1 / { exit(0); }' >& /root/hotkernel_30/data/profile-1001.user.$yyyymmddhh.out cat /root/hotkernel_30/data/profile-1001.user.$yyyymmddhh.out echo "/root/hotkernel_30/more.csh >& /root/hotkernel_30/data/more.$yyyymmddhh.out" sleep 1 /root/hotkernel_30/more.csh >& /root/hotkernel_30/data/more.$yyyymmddhh.out echo "grep i86_mwait:entry /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l" grep i86_mwait:entry /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l echo "" echo "grep i86_mwait:return /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l" grep i86_mwait:return /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l echo "" echo "grep cpu_idle_adaptive:entry /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l" grep cpu_idle_adaptive:entry /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l echo "" echo "grep cpu_idle_adaptive:return /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l" grep cpu_idle_adaptive:return /root/hotkernel_30/data/more.$yyyymmddhh.out | wc -l echo "" sleep 1 dtrace -n 'profile-1971 {@[stack()]=count();} tick-1s {trunc(@, 3); printa(@);trunc(@);} tick-1s / i++ >= 5 / { exit(0); }' >& /root/hotkernel_30/data/profile-1971_5sec.$yyyymmddhh.out echo "dtrace on mutex_vector_enter:entry, adaptive-acquire, adaptive-release" echo "dtrace on mutex_vector_enter:entry, adaptive-acquire, adaptive-release" > /root/hotkernel_30/data/adaptive.$yyyymmddhh.out echo "" >> /root/hotkernel_30/data/adaptive.$yyyymmddhh.out sleep 1 echo "/root/hotkernel_30/6958602.1.d" echo "/root/hotkernel_30/6958602.1.d" >> /root/hotkernel_30/data/adaptive.$yyyymmddhh.out /root/hotkernel_30/6958602.1.d >& /root/hotkernel_30/data/adaptive.$yyyymmddhh.out.1 cat /root/hotkernel_30/data/adaptive.$yyyymmddhh.out.1 >> /root/hotkernel_30/data/adaptive.$yyyymmddhh.out rm /root/hotkernel_30/data/adaptive.$yyyymmddhh.out.1 echo "" tail -10 /root/hotkernel_30/data/adaptive.$yyyymmddhh.out echo "----------------" echo "#Syscall count by syscall" echo "dtrace -n 'syscall:::entry { @num[probefunc] = count(); } tick-1s / i++ >= 1 / { exit(0); }'" sleep 1 dtrace -n 'syscall:::entry { @num[probefunc] = count(); } tick-1s / i++ >= 1 / { exit(0); }' >& /root/hotkernel_30/data/syscall_cnt_by_syscall.$yyyymmddhh.out cat /root/hotkernel_30/data/syscall_cnt_by_syscall.$yyyymmddhh.out echo "" echo "" echo "./hotkernel_30sec >& /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out" sleep 1 ./hotkernel_30sec >& /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out echo "" echo "wc -l /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out" wc -l /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out echo "" echo "cat /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out" cat /root/hotkernel_30/data/hotkernel_30sec.$yyyymmddhh.out sleep 1 ./hotkernel_30sec -m >& /root/hotkernel_30/data/hotkernel_30sec_m.$yyyymmddhh.out echo "" echo "take four dtrace CPU measures: cpuwalk.d 10, intoncpu_10sec.d, inttimes_10sec.d, xcallsbypid_10sec.d" cd /opt/DTT/Cpu sleep 1 ./cpuwalk.d 10 >& /root/hotkernel_30/data/cpuwalk.d_10.$yyyymmddhh.out sleep 1 ./intoncpu_10sec.d >& /root/hotkernel_30/data/intoncpu_10sec.$yyyymmddhh.out sleep 1 ./inttimes_10sec.d >& /root/hotkernel_30/data/inttimes_10sec.$yyyymmddhh.out sleep 1 ./xcallsbypid_10sec.d >& /root/hotkernel_30/data/xcallsbypid_10sec.$yyyymmddhh.out echo "" echo "take a few Bin measures: sysbypid_10sec.d, topsyscall -C -s 5 2, iotop -C -t 10 5 4, cswstat_10sec.d, idle-state_10sec.d" cd /opt/DTT/Bin sleep 1 ./sysbypid_10sec.d >& /root/hotkernel_30/data/sysbypid_10sec.$yyyymmddhh.out sleep 1 ./topsyscall -C -s 5 2 >& /root/hotkernel_30/data/topsyscall-C-s_5_2.$yyyymmddhh.out sleep 1 ./iotop -C -t 10 5 4 >& /root/hotkernel_30/data/iotop-C-t_10_5_4.$yyyymmddhh.out sleep 1 ./cswstat_10sec.d >& /root/hotkernel_30/data/cswstat_10sec.$yyyymmddhh.out sleep 1 ./idle-state_10sec.d >& /root/hotkernel_30/data/idle-state_10sec.$yyyymmddhh.out sleep 1 ./watch_cpu_acpi_read_port.d >& /root/hotkernel_30/data/watch_cpu_acpi_read_port.$yyyymmddhh.out sleep 1 ./watch_cstate.d >& /root/hotkernel_30/data/watch_cstate.$yyyymmddhh.out sleep 1 /opt/DTT/Cpu/speedstep.d >& /root/hotkernel_30/data/speedstep.$yyyymmddhh.out sleep 1 /opt/DTT/Bin/watch_acpi_cpu_cstate_60.d >& /root/hotkernel_30/data/watch_acpi_cpu_cstate_60.$yyyymmddhh.out sleep 1 /opt/DTT/Bin/watch_cpu_acpi_read_port_60.d >& /root/hotkernel_30/data/watch_cpu_acpi_read_port_60.$yyyymmddhh.out sleep 1 /opt/DTT/Bin/cpudists >& /root/hotkernel_30/data/cpudists.$yyyymmddhh.out sleep 1 /opt/DTT/Bin/cputimes >& /root/hotkernel_30/data/cputimes.$yyyymmddhh.out echo 'other: prstat -a -m 5 3, dladm show-link -s, dlstat show-link -r, dlstat, mpstat 10 2, intrstat 5 3' sleep 1 dladm show-link -s >& /root/hotkernel_30/data/dladm_show-link-s.$yyyymmddhh.out sleep 1 dlstat show-link -r >& /root/hotkernel_30/data/dlstat_show-link-r.$yyyymmddhh.out sleep 1 dlstat >& /root/hotkernel_30/data/dlstat.$yyyymmddhh.out sleep 1 intrstat 5 3 >& /root/hotkernel_30/data/intrstat_5_3.$yyyymmddhh.out sleep 1 powertop -d 3 -t 10 >& /root/hotkernel_30/data/powertop-d_3-t_10.$yyyymmddhh.out sleep 1 lockstat -kIW -D 20 sleep 30 >& /root/hotkernel_30/data/lockstat-kIW-D_20_sleep_30.$yyyymmddhh.out sleep 1 /root/hotkernel_30/intrtime.pl >& /root/hotkernel_30/data/intrtime.pl.$yyyymmddhh.out
Updated by Jon Strabala over 11 years ago
- File logs_data_scripts.151a.final.20110914.tar.gz logs_data_scripts.151a.final.20110914.tar.gz added
Once again the CPU on an idle system 'went up in the kernel CPU' from 99.9% idle to 98.6% idle.
And like the last run the system hung relatively fast twelve hours (12) - such that once again I never saw unix`acpi_cpu_cstate increase prior to the crash when running 'hotkernel' typically we need to last about 8 more hours.
I have attached "logs_data_scripts.151a.final.20110914.tar.gz" a second bundle of log, data, and modified, new dtrace (in opt/DTT ontop of the Dtrace Toolkit) scripts used for all 20 minute samples.
Note in all cases this is "dev-il" 151a not the new released ISO repository.
Updated by Jon Strabala over 11 years ago
Right now this looks like it might be a complete BIOS issue. A while back the motherboard vendor told me "we do not support OpenIndiana or any other derivative of OpenSolaris" when I opened a trouble ticket with them - thus the end of SuperMicro's help.
However SuperMicro just came out with a new BIOS for the X9SCA-F board e.g. x9sca1.819.zip. So I did the following:
- I made a "released oi_151a" BE
- I reset my /etc/power.conf
- I updated/flashed my BIOS to x9sca1.819
And the initial results are very promising for my Sandy Bridge C204 E3-1270 system - I have been running for 5 hours now and I don't see the weird kernel CPU leak, my guess is that the system is stable now and will not hang. I it is nice (and normal) to see an my idle system say 100% idle when I run top.
Running for 5 hours I don't see the CPU creep via basic 'top' (yes I know 'prstat' is better than 'top', I am sampling this system three (3) times an hour with with 47 scripts, 2 are prstat commands, most of the scripts are dtrace based).
As seen occasionally I get 100% Idle, and it looks like with the new BIOS there is no more unknown leak in kernel CPU column.
top_10sec.2011-09-14_1240.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1240.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1300.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1300.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1320.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1320.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1340.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1340.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1400.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1400.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1420.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1420.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1440.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1440.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1500.out:CPU states: 99.9% idle, 0.1% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1500.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1520.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1520.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1540.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1540.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1600.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1600.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1620.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1620.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1640.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1640.out:CPU states: 99.8% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1700.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1700.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1720.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1720.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1740.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1740.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap
Compared with all my other runs I would most likely see a significant problem by now "CPU states: 98.8% idle, 0.0% user, 1.2% kernel, 0.0% iowait, 0.0% swap".
It might not matter at all but my /etc/power.conf file is now exactly - I tried it several times before so I am pretty sure that the BIOS update is making the difference.
# # run 'pmconfig' if you change this file # cpu_deep_idle enable cpu-threshold 10s cpupm enable poll-mode # # fun commands: # kstat -m cpu_info | grep "supported_frequencies_Hz" # kstat -m cpu_info | grep "current_clock_Hz"
Updated by Jon Strabala over 11 years ago
I seem to have jumped the gun on claiming a BIOS fix, things looked great for the 10 hours, and then I started seeing the CPU kernel creep again. The system has been up for about 19 hours now BUT I now see a 'lost' 0.3% CPU in the kernel.
Based on past data runs with 151s (dev) it does seem to be leaking 'lost' kernel CPU cycle about 9X slower, so maybe the BIOS update just makes thing harder to debug (perhaps it help or maybe there are multiple issues) or perhaps it was a combination of my /etc/power.conf and going to a BE from the newly the "release 151a"
I have provided a simple grep of my 20 minute 'top' samples below - I am still sampling this system three (3) times an hour with with 47 scripts, 2 are prstat commands, most of the scripts are dtrace based. Perhaps this time I will see something of interest if/when it finally hangs.
root@lab10:~/hotkernel_30/data# rup lab10 lab10 up 19:16, load average: 0.00, 0.01, 0.01 root@lab10:~/hotkernel_30/data# grep 'CPU s' top_* top_10sec.2011-09-14_1240.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1240.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1300.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1300.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1320.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1320.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1340.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1340.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1400.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1400.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1420.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1420.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1440.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1440.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1500.out:CPU states: 99.9% idle, 0.1% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1500.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1520.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1520.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1540.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1540.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1600.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1600.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1620.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1620.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1640.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1640.out:CPU states: 99.8% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1700.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1700.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1720.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1720.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1740.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1740.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1800.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1800.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1820.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1820.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1840.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1840.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1900.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1900.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1920.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1920.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1940.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_1940.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2000.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2000.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2020.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2020.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2040.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2040.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2100.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2100.out:CPU states: 99.9% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2120.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2120.out:CPU states: 99.8% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2140.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2140.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2200.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2200.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2220.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2220.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2240.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2240.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2300.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2300.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2320.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2320.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2340.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-14_2340.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0000.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0000.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0020.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0020.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0040.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0040.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0100.out:CPU states: 99.8% idle, 0.1% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0100.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0120.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0120.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0140.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0140.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0200.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0200.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0220.out:CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0220.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0240.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0240.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0300.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0300.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0320.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0320.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0340.out:CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0340.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0400.out:CPU states: 99.3% idle, 0.0% user, 0.7% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0400.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0420.out:CPU states: 99.6% idle, 0.0% user, 0.4% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0420.out:CPU states: 99.7% idle, 0.0% user, 0.3% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0440.out:CPU states: 99.7% idle, 0.0% user, 0.3% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0440.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0500.out:CPU states: 99.6% idle, 0.0% user, 0.4% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0500.out:CPU states: 99.7% idle, 0.0% user, 0.3% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0520.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0520.out:CPU states: 99.7% idle, 0.0% user, 0.3% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0540.out:CPU states: 99.6% idle, 0.0% user, 0.4% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0540.out:CPU states: 99.7% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0600.out:CPU states: 99.4% idle, 0.0% user, 0.6% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0600.out:CPU states: 99.7% idle, 0.0% user, 0.3% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0620.out:CPU states: 99.6% idle, 0.1% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0620.out:CPU states: 99.7% idle, 0.0% user, 0.3% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0640.out:CPU states: 99.6% idle, 0.0% user, 0.4% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0640.out:CPU states: 99.7% idle, 0.0% user, 0.3% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0700.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0700.out:CPU states: 99.7% idle, 0.0% user, 0.3% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0720.out:CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap top_10sec.2011-09-15_0720.out:CPU states: 99.7% idle, 0.0% user, 0.3% kernel, 0.0% iowait, 0.0% swap root@lab10:~/hotkernel_30/data#
Updated by Aubrey Li over 11 years ago
Hi,
It looks like this is not relevant to cpu power management.
From the mpstat log in the first bug reporter's thread, the
interrupt number is higher than expected.
So, when the CPU% is leaked, what does the following command
say?
#dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10"
Or if apix is not enabled,
#dtrace -n "do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10"
Thanks,
-Aubrey
Updated by Miha Krizaj over 11 years ago
Hi,
I had a "scheduled" reboot today, to keep the machine up, and before rebooting I ran the command as per your post above (the kernel usage was at some 15% or so):
dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10"
This did not hang the system. I rebooted anyway (using "reboot"), to have a nother 10 or so days before it hangs and I noticed that after rebooting (Until now I always got the system hang, and therefore turn off, as I was trying various dtraces) kernel usage was still high, even higher as before reboot, the system at completely idle was using 25% in kernel.
After turning it off (init 5) kernel usage after boot was 0% or so. Prehaps I do not understand the reboot command, but I find this very interesting, I thought the reboot command restarts kernel.
Here is also my boot log after reboot if this can show something:
Reboot:
http://www.pastie.org/2554875
And here is the dtrace output:
unix`_interrupt+0xba
1
unix`_interrupt+0xba
unix`wrmsr+0xd
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_fire+0xd5
unix`cbe_fire+0x38
apix`apix_dispatch_pending_autovect+0x12c
apix`apix_dispatch_pending_hilevel+0x15
unix`switch_sp_and_call+0x13
apix`apix_do_interrupt+0x31b
unix`_interrupt+0xba
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`idle+0x114
unix`thread_start+0x8
1
unix`_interrupt+0xba
unix`wrmsr+0xd
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_reprogram_cyclic+0xf0
genunix`cyclic_reprogram+0xa4
genunix`callout_heap_delete+0x1e3
genunix`callout_realtime+0x1e
genunix`cyclic_softint+0xdc
unix`cbe_low_level+0x17
unix`av_dispatch_softvect+0x5f
apix`apix_dispatch_softint+0x34
unix`switch_sp_and_call+0x13
1
unix`_interrupt+0xba
unix`fakesoftint_return
unix`cbe_restore_level+0x17
genunix`cyclic_reprogram_cyclic+0xfd
genunix`cyclic_reprogram+0xa4
genunix`callout_heap_delete+0x1e3
genunix`callout_realtime+0x1e
genunix`cyclic_softint+0xdc
unix`cbe_low_level+0x17
unix`av_dispatch_softvect+0x5f
apix`apix_dispatch_softint+0x34
unix`switch_sp_and_call+0x13
1
unix`_interrupt+0xba
unix`page_lookup_nowait+0x9a
zfs`zfs_getpage+0x1c9
genunix`fop_getpage+0x9a
genunix`segvn_fault+0xae1
genunix`as_fault+0x5ee
unix`pagefault+0x99
unix`trap+0xe63
unix`0xfffffffffb8001d6
1
unix`_interrupt+0xba
apix`deadline_timer_reprogram+0x34
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_fire+0xd5
unix`cbe_fire+0x5a
apix`apix_dispatch_pending_autovect+0x12c
apix`apix_dispatch_pending_hilevel+0x15
unix`switch_sp_and_call+0x13
apix`apix_do_interrupt+0x31b
unix`_interrupt+0xba
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`idle+0x114
unix`thread_start+0x8
130
unix`_interrupt+0xba
unix`wrmsr+0xd
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_fire+0xd5
unix`cbe_fire+0x5a
apix`apix_dispatch_by_vector+0x93
apix`apix_dispatch_hilevel+0x15
unix`switch_sp_and_call+0x13
apix`apix_do_interrupt+0x25e
unix`_interrupt+0xba
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`idle+0x114
unix`thread_start+0x8
1002
unix`_interrupt+0xba
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`idle+0x114
unix`thread_start+0x8
1038
unix`_interrupt+0xba
unix`wrmsr+0xd
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_fire+0xd5
unix`cbe_fire+0x5a
apix`apix_dispatch_pending_autovect+0x12c
apix`apix_dispatch_pending_hilevel+0x15
unix`switch_sp_and_call+0x13
apix`apix_do_interrupt+0xdc
unix`_interrupt+0xba
unix`fakesoftint_return
unix`cbe_restore_level+0x17
genunix`cyclic_reprogram_cyclic+0xfd
genunix`cyclic_reprogram+0xa4
genunix`callout_heap_delete+0x1e3
genunix`callout_realtime+0x1e
genunix`cyclic_softint+0xdc
unix`cbe_low_level+0x17
unix`av_dispatch_softvect+0x5f
1738
unix`_interrupt+0xba
unix`wrmsr+0xd
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_fire+0xd5
unix`cbe_fire+0x5a
apix`apix_dispatch_pending_autovect+0x12c
apix`apix_dispatch_pending_hilevel+0x15
unix`switch_sp_and_call+0x13
apix`apix_do_interrupt+0x31b
unix`_interrupt+0xba
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`idle+0x114
unix`thread_start+0x8
3697185
Updated by Aubrey Li over 11 years ago
hmmm..., something is very interesting...
It looks like your system is suffering from a interrupt storm. There shouldn't be 3697185 intr numbers in 10s on CPU0.
Can you please give the following command output?
$ echo hz/D | sudo mdb -k
$ echo apic_timer::print apic_timer_t | sudo mdb -k
Updated by Miha Krizaj over 11 years ago
This number grows... as kernel usage does, but harder to see until it reaches some 10-15% kernel usage.
echo apic_timer::print apic_timer_t | sudo mdb -k
{
mode = 0x2
apic_timer_enable_ops = deadline_timer_enable
apic_timer_disable_ops = deadline_timer_disable
apic_timer_reprogram_ops = deadline_timer_reprogram
}
echo hz/D | sudo mdb -k
hz:
hz: 100
admin@deepspace:~#
Updated by Aubrey Li over 11 years ago
Okay, thanks for reporting back...
Looking at the hottest calling stack, something is interesting too.
(1) intr occurs
unix`_interrupt+0xba
(2) immediately after wrmsr?? I think this is not occasional,
because there are 3697185 stacks are the same in 10 seconds.
- unix`wrmsr+0xd*
apix`apic_timer_reprogram+0x19
unix`cbe_reprogram+0x25
genunix`cyclic_fire+0xd5
unix`cbe_fire+0x5a
apix`apix_dispatch_pending_autovect+0x12c
apix`apix_dispatch_pending_hilevel+0x15
unix`switch_sp_and_call+0x13
apix`apix_do_interrupt+0x31b
unix`_interrupt+0xba
(3) the processor is in idle.
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0xf1
unix`idle+0x114
unix`thread_start+0x8
3697185
Here, from the stack, the scenario is,
- when the system is in idle
- there is a clock related interrupt occurs, it goes through
cbe->cyclic->cbe->apic, finally calls apic_timer_reprogram to
program the apic timer register.
However, it looks like everytime the register is programmed
(wrmsr called), there is an interrupt occurs
So, no wonder the kernel CPU% is creeping.
Can you please run the following dtrace script? I think 1s is enough.
No need to wait 10~15% kernel time, just after you see the same
calling stack as you reported above.
==================================================
#!/usr/sbin/dtrace -s
deadline_timer_reprogram:entry
/cpu == 0/
{
t = (hrtime_t)arg0;
a = (long long)t;
printf(" target = %llx - now = %x\n", a, timestamp);
}
==================================================
Thanks,
-Aubrey
Updated by Miha Krizaj over 11 years ago
The output of the script is:
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a31031
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a313b4
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3172d
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a31aad
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a31e34
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a321b1
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a32530
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a328b2
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a32c2d
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a32fac
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3332b
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a336a9
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a33a2a
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a33da4
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a34124
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a344aa
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3482a
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a34ba4
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a34f24
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a352a9
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3562d
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a359a9
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a35d29
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a360a6
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a36425
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a367a1
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a36b2b
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a36ea8
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3722a
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a375a6
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a37923
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a37ca3
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a38021
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3839d
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3871a
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a38a9e
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a38e23
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a391a2
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a39521
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a398a2
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a39c25
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a39fa4
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3a327
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3a6ac
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3aa2b
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3adac
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3b129
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3b4d5
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3b858
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3bbdd
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3bf7a
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3c2f8
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3c695
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3ca15
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3cd96
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3d115
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3d49b
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3d81a
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3db9b
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3df19
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3e2a2
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3e650
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3e9db
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3ed59
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3f0e6
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3f464
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f6a3fb67
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73ac646
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73ad171
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73adb9f
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73ae56e
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73aef54
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73af95a
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b02f1
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b0c8e
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b166c
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b206d
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b2a61
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b3417
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b3d81
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b4748
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b511c
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b5ad2
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b64ec
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b6ec2
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b7882
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b826c
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b8c43
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73b966e
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73ba00b
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73ba9a7
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73bb39c
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73bbd8e
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73bc7c0
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73bd162
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73bdb2c
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73be539
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73bef0b
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73bf95c
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c02d8
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c0ca4
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c1661
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c2035
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c2a1a
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c3400
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c3dbd
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c47af
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c6dda
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c72e8
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c7794
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c7c5d
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c810d
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c85ef
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73c8ad1
0 30881 deadline_timer_reprogram:entry target = 521f7d52280 - now = 521f73c9506
Updated by Aubrey Li over 11 years ago
- Status changed from New to Feedback
okay, this looks quite weird, intr occurs before the target is due.
But if you add the following line to the /etc/system
set apix:apic_timer_preferred_mode = 0x0
and reboot, Is the problem gone?
Thanks,
-Aubrey
Updated by Jon Strabala over 11 years ago
- File oi_151_released.console.png oi_151_released.console.png added
- File oi_151_released.101hours.txt oi_151_released.101hours.txt added
Aubrey,
======================================
After adding the following line to the /etc/system and rebooting ( I will not know for a few days if this solves the CPU kernel creep / interrupt problem )
set apix:apic_timer_preferred_mode = 0x0
If the mode is set to 0x0 I see 0 events/sec. in deadline_timer_reprogram:entry - I am running that test now.
Below with the mode at 0x2 the 1,257 events/sec. in deadline_timer_reprogram:entry at initial reboot
=================================
The rest of this E-mail deals with a test I just finished were mode is set to 0x2.
I just got done with a 101 hour test (got a crash very similar analysis to what you came up with but without the 'deep' knowledge to put things in context.
This following test is without "set apix:apic_timer_preferred_mode = 0x0" e.g. the mode for apix:apic_timer_preferred_mode is set to 0x2 (the default)
date Sun Sep 18 18:00:19 PDT 2011 uname -a SunOS lab10 5.11 oi_151a i86pc i386 i86pc rup lab10 lab10 up 4 days, 5:39, load average: 0.00, 0.00, 0.02 beadm list | egrep NR\|BE BE Active Mountpoint Space Policy Created oi_151a NR / 57.6G static 2011-09-14 09:01 /etc/system set ip:ip_squeue_worker_wait=0 set snooping=1 set pcplusmp:apic_panic_on_nmi=1 /etc/power.conf cpu_deep_idle enable cpu-threshold 10s cpupm enable poll-mode echo hz/D | sudo mdb -k hz: hz: 100 echo apic_timer::print apic_timer_t | sudo mdb -k { mode = 0x2 apic_timer_enable_ops = deadline_timer_enable apic_timer_disable_ops = deadline_timer_disable apic_timer_reprogram_ops = deadline_timer_reprogram }
This time unix`acpi_cpu_cstate actually decreased (as opposed to increased) across 101.5 hours until a kernel crash at 101.5 hours panic/dump attempt.
I also see a correlation in the kernel CPU creep (or growth) and the following counters:
'psm_get_cpu_id' - at boot is about 1313 at crash is about 4636, 'cpu_idle_exit' - at boot is about 216 at crash about 6782 'apix_intr_enter' - at boot about 78 at crash about 7158 *** note the initial values were determined after a reboot as I wrote the logging scripts a few days into the test run ***
The system (which is completely idle crashed on a dtrace script after the kernel CPU crept to about 1.6% for no known reason.
panic[cpu0]/thread ffffff001e8cbc40: deadman: timed out after 50 seconds of clock inactivity *** skip stack dump, but see image "oi_151_released.console.png" for full details *** *** this thread seems to be a 'highly' called thread whatever it is, see '6958602.d' below *** syncing to file systems... done dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel 2:23 100% done
Please see full console image attached as "oi_151_released.console.png"
The stack trace in the console image (attached) is similar to what I think is the major activity one of two possible formats one with unix`wrmsr and one with apix`deadline_timer_reprogram e.g. refer to "oi_151_released.101hours.txt" (also attached) for grater detail and actual counts:
5 13355 cpu_idle_exit:entry apix`apix_do_interrupt+0x3e unix`_interrupt+0xba unix`wrmsr+0xd apix`apic_timer_reprogram+0x19 unix`cbe_reprogram+0x25 genunix`cyclic_fire+0xd5 unix`cbe_fire+0x5a apix`apix_dispatch_pending_autovect+0x12c apix`apix_dispatch_pending_hilevel+0x15 unix`switch_sp_and_call+0x13 apix`apix_do_interrupt+0x31b unix`_interrupt+0xba genunix`fop_ioctl+0x7b genunix`ioctl+0x18e unix`sys_syscall+0x17a 5 13355 cpu_idle_exit:entry apix`apix_do_interrupt+0x3e unix`_interrupt+0xba apix`deadline_timer_reprogram+0x34 apix`apic_timer_reprogram+0x19 unix`cbe_reprogram+0x25 genunix`cyclic_fire+0xd5 unix`cbe_fire+0x5a apix`apix_dispatch_pending_autovect+0x12c apix`apix_dispatch_pending_hilevel+0x15 unix`switch_sp_and_call+0x13 apix`apix_do_interrupt+0x31b unix`_interrupt+0xba genunix`fop_ioctl+0x7b genunix`ioctl+0x18e unix`sys_syscall+0x17a
displaying counts of stack traces across 1ms for cpu_idle_exit:entry
cpu_idle_exit_1ms.d.2011-09-14_1240.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-14_1620.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-14_2000.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-14_2340.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_0320.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_0700.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_1040.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_1420.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_1800.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_2140.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_0120.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_0500.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_0840.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_1220.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_1600.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_2100.out: 421 cpu_idle_exit_1ms.d.2011-09-17_0230.out: 519 cpu_idle_exit_1ms.d.2011-09-17_0800.out: 544 cpu_idle_exit_1ms.d.2011-09-17_1330.out: 531 cpu_idle_exit_1ms.d.2011-09-17_1900.out: 651 cpu_idle_exit_1ms.d.2011-09-18_0030.out: 660 cpu_idle_exit_1ms.d.2011-09-18_0600.out: 738 cpu_idle_exit_1ms.d.2011-09-18_1130.out: 780 cpu_idle_exit_1ms.d.2011-09-18_1700.out: 797 cpu_idle_exit_1ms.d.2011-09-18_1730.out: 758 cpu_idle_exit_1ms.d.2011-09-18_1800.out: 797 *** crash ***
displaying counts of stack traces across 1ms for cpu_idle_exit:entry that contain cbe_fire
cpu_idle_exit_1ms.d.2011-09-14_1240.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-14_1620.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-14_2000.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-14_2340.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_0320.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_0700.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_1040.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_1420.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_1800.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_2140.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_0120.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_0500.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_0840.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_1220.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_1600.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_2100.out: 412 cpu_idle_exit_1ms.d.2011-09-17_0230.out: 504 cpu_idle_exit_1ms.d.2011-09-17_0800.out: 543 cpu_idle_exit_1ms.d.2011-09-17_1330.out: 522 cpu_idle_exit_1ms.d.2011-09-17_1900.out: 639 cpu_idle_exit_1ms.d.2011-09-18_0030.out: 653 cpu_idle_exit_1ms.d.2011-09-18_0600.out: 737 cpu_idle_exit_1ms.d.2011-09-18_1130.out: 772 cpu_idle_exit_1ms.d.2011-09-18_1700.out: 792 cpu_idle_exit_1ms.d.2011-09-18_1730.out: 757 cpu_idle_exit_1ms.d.2011-09-18_1800.out: 796 *** crash ***
isplaying counts of stack traces across 1ms for cpu_idle_exit:entry that contain deadline_timer_reprogram - typically we see either deadline_timer_reprogram or wrmsr, this counts deadline_timer_reprogram
cpu_idle_exit_1ms.d.2011-09-14_1240.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-14_1620.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-14_2000.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-14_2340.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_0320.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_0700.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_1040.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_1420.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_1800.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_2140.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_0120.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_0500.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_0840.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_1220.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_1600.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_2100.out: 57 cpu_idle_exit_1ms.d.2011-09-17_0230.out: 62 cpu_idle_exit_1ms.d.2011-09-17_0800.out: 21 cpu_idle_exit_1ms.d.2011-09-17_1330.out: 61 cpu_idle_exit_1ms.d.2011-09-17_1900.out: 73 cpu_idle_exit_1ms.d.2011-09-18_0030.out: 100 cpu_idle_exit_1ms.d.2011-09-18_0600.out: 90 cpu_idle_exit_1ms.d.2011-09-18_1130.out: 84 cpu_idle_exit_1ms.d.2011-09-18_1700.out: 114 cpu_idle_exit_1ms.d.2011-09-18_1730.out: 92 cpu_idle_exit_1ms.d.2011-09-18_1800.out: 114 *** crash ***
displaying counts of stack traces across 1ms for cpu_idle_exit:entry that contain deadline_timer_reprogram - typically we see either deadline_timer_reprogram or wrmsr, this counts wrmsr
cpu_idle_exit_1ms.d.2011-09-14_1240.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-14_1620.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-14_2000.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-14_2340.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_0320.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_0700.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_1040.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_1420.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_1800.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-15_2140.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_0120.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_0500.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_0840.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_1220.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_1600.out: <<not found script was not active yet, or system crashed>> cpu_idle_exit_1ms.d.2011-09-16_2100.out: 355 cpu_idle_exit_1ms.d.2011-09-17_0230.out: 442 cpu_idle_exit_1ms.d.2011-09-17_0800.out: 522 cpu_idle_exit_1ms.d.2011-09-17_1330.out: 461 cpu_idle_exit_1ms.d.2011-09-17_1900.out: 566 cpu_idle_exit_1ms.d.2011-09-18_0030.out: 553 cpu_idle_exit_1ms.d.2011-09-18_0600.out: 647 cpu_idle_exit_1ms.d.2011-09-18_1130.out: 688 cpu_idle_exit_1ms.d.2011-09-18_1700.out: 678 cpu_idle_exit_1ms.d.2011-09-18_1730.out: 665 cpu_idle_exit_1ms.d.2011-09-18_1800.out: 682 *** crash ***
NOTE this was run after the crash on a reboot
with the same default mode = 0x2 for apix:apic_timer_preferred_mode same as the test run discussed, this is 12,571 events in 10 seconds.
root@lab10:~/hotkernel_30/data# cat ../aubrey.deadline_timer_reprogram.d BEGIN { printf("looking at deadline_timer_reprogram:entry (requested by Aubrey Li)"); } deadline_timer_reprogram:entry { t = (hrtime_t)arg0; a = (long long)t; printf("target = %llx - now = %x\n", a, timestamp); } tick-10s { exit(0); } root@lab10:~/hotkernel_30/data# ls -ltr aubrey.deadline_timer_reprogram.d.2011-09-19_1055.out -rw-r--r-- 1 root root 1081308 2011-09-19 10:56 aubrey.deadline_timer_reprogram.d.2011-09-19_1055.out root@lab10:~/hotkernel_30/data# head -10 aubrey.deadline_timer_reprogram.d.2011-09-19_1055.out dtrace: script '/root/hotkernel_30/aubrey.deadline_timer_reprogram.d' matched 3 probes CPU ID FUNCTION:NAME 4 1 :BEGIN looking at deadline_timer_reprogram:entry (requested by Aubrey Li) 0 29397 deadline_timer_reprogram:entry target = 13954f36180 - now = 13954f33ee2 0 29397 deadline_timer_reprogram:entry target = 13954f36180 - now = 13954f3492b 0 29397 deadline_timer_reprogram:entry target = 13954f36180 - now = 13954f34ffb 0 29397 deadline_timer_reprogram:entry target = 13954f36180 - now = 13954f35696 root@lab10:~/hotkernel_30/data# grep deadline_timer_reprogram:entry aubrey.deadline_timer_reprogram.d.2011-09-19_1055.out | wc -l 12571
As indicated above - if I apply the /etc/system changes and mode = 0x0 is set for apix:apic_timer_preferred_mode the above dtrace script emits, aubrey.deadline_timer_reprogram.d, nothing e.g. nothing would be captutred.
Updated by Aubrey Li over 11 years ago
Hi Jon,
It looks like your system is experiencing exactly the same issue as Miha's.
To check if this issue is fixed, you can just run the following command on CPU0, not need on all of CPUs.
dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10"
From Miha's report, there are 3697185 intrs in dtrace output before. Now, you should
see only ~1000 intrs in dtrace output if it's fixed.
Well, the higher interrupt number is what I observed so far, this should be one reason to cause CPU% leak.
I'm not sure if the system has any other issues to cause CPU utilization leak.
So please continue to run your testing and let me know the result.
Thanks,
-Aubrey
Updated by Jon Strabala over 11 years ago
Aubrey,
My basic test suite take about 4 days to be definitive, but after running for 8 hours with
root@lab10:~/hotkernel_30# echo apic_timer::print apic_timer_t | sudo mdb -k { mode = 0 apic_timer_enable_ops = oneshot_timer_enable apic_timer_disable_ops = oneshot_timer_disable apic_timer_reprogram_ops = oneshot_timer_reprogram } root@lab10:~/hotkernel_30#
dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10" gives 1093 in line with what Miha's is seeing - refer to the "dtrace_req2_1sec_5_sec_10sec.txt" attachement.
Also the dtrace script based on your prior posting
BEGIN { printf("looking at deadline_timer_reprogram:entry (requested by Aubrey Li)"); } deadline_timer_reprogram:entry { t = (hrtime_t)arg0; a = (long long)t; printf("target = %llx - now = %x\n", a, timestamp); } tick-10s { exit(0); }
The above is called "aubrey.deadline_timer_reprogram.d", with the 0x0 setting still returns nothing of interest (just the exit(0) event).
root@lab10:~/hotkernel_30# dtrace -s aubrey.deadline_timer_reprogram.d dtrace: script 'aubrey.deadline_timer_reprogram.d' matched 3 probes CPU ID FUNCTION:NAME 5 1 :BEGIN looking at deadline_timer_reprogram:entry (requested by Aubrey Li) 5 65176 :tick-10s root@lab10:~/hotkernel_30#
Contrast the "no output" count for "deadline_timer_reprogram:entry" as above this with 12,571 count of deadline_timer_reprogram:entry" with the 0x2 setting for " in my prior post.
At this point when I run "top -q" then in the curses CUI "n 10" then "s 10" I do still see mostly 99.9% and also 100% idle sometimes. Note this is not very different than my last test with the 0x2 setting as it is too early to get a lot of discrimination at this point (the CPU kernel leak may or may not be fixed).
last pid: 4373; load avg: 0.00, 0.00, 0.00; up 0+08:24:15 19:58:17 53 processes: 52 sleeping, 1 on cpu CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap Kernel: 308 ctxsw, 589 intr, 22 syscall Memory: 16G phys mem, 15G free mem, 8187M total swap, 8187M free swap PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 4372 root 1 59 -20 3840K 1984K cpu/5 0:00 0.00% top 4290 admin 1 59 0 7748K 5176K sleep 0:00 0.00% sshd 318 root 7 59 0 4404K 3196K sleep 0:00 0.00% devfsadm 10 root 13 59 0 15M 13M sleep 0:01 0.00% svc.startd 892 root 4 59 0 6792K 3852K sleep 0:00 0.00% inetd 12 root 24 59 0 15M 14M sleep 0:03 0.00% svc.configd 370 root 29 59 0 7144K 5012K sleep 0:00 0.00% nscd 904 root 1 59 0 2708K 1676K sleep 0:00 0.00% in.routed 914 root 25 59 0 19M 12M sleep 0:00 0.00% fmd 1055 root 1 59 0 6620K 5516K sleep 0:01 0.00% intrd
As my basic test suite take about 4 days to be definitive - I will continue to keep you posted with updates in a day or two.
Updated by Aubrey Li over 11 years ago
Glad to see the interrupt storm is gone. Let's wait to see if CPU kernel leak is fixed.
Updated by Miha Krizaj over 11 years ago
Hi,
I also applied the set apix:apic_timer_preferred_mode = 0x0 and as far as I can see now, kernel usage is still low, after 14 hours of uptime. Somewhere between 0 to 0.7%. I will wait a few more days to be really sure, but it looks like this fixed it :)
This was really very much helpfull, thank you!
Why does it actually happen, is this s defect of some sort in hardware or software, and how that this is not more widespread if it is?
Thank you again!
Updated by Jon Strabala over 11 years ago
Aubrey,
As Miha indicated - I too am interested in the root cause, and more importantly the impact does making things 'oneshot_*' vs. 'deadline_timer_*' impact any performance such as web servers, networking, disk, etc.
Right now after 20 hours of testing on my system lab10 I can not recognize a CPU kernel leak at this time. I will post again tomorrow.
Thanks
Updated by Aubrey Li over 11 years ago
Please provide the output of the following commands
1. #psrinfo -vp
2. #prtdiag
3. #ucodeadm -v
Thanks,
-Aubrey
Updated by Miha Krizaj over 11 years ago
Hi, here are the outputs of the commands, and by the way, the cpu usage in kernel is still very low, so this for sure fixed it! :) Thank you!
psrinfo -vp
The physical processor has 4 cores and 8 virtual processors (0-7)
The core has 2 virtual processors (0 4)
The core has 2 virtual processors (1 5)
The core has 2 virtual processors (2 6)
The core has 2 virtual processors (3 7)
x86 (GenuineIntel 206A7 family 6 model 42 step 7 clock 3292 MHz)
Intel(r) Xeon(r) CPU E31240 @ 3.30GHz
prtdiag
System Configuration: Supermicro X9SCI/X9SCA
BIOS Configuration: American Megatrends Inc. 1.0c 07/21/2011
BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)
==== Processor Sockets ====================================
Version Location Tag
-------------------------------- --------------------------
Intel(R) Xeon(R) CPU E31240 @ 3.30GHz CPU
==== Memory Device Sockets ================================
Type Status Set Device Locator Bank Locator
----------- ------ --- ------------------- ----------------
Unknown in use 0 DIMM_1A BANK0
Unknown in use 0 DIMM_2A BANK1
Unknown in use 0 DIMM_1B BANK2
Unknown in use 0 DIMM_2B BANK3
==== On-Board Devices =====================================
To Be Filled By O.E.M.
==== Upgradeable Slots ====================================
ID Status Type Description
--- --------- ---------------- ----------------------------
0 in use PCI SLOT 1
1 in use PCI SLOT 2
2 in use PCI SLOT 3
3 in use Unknown SLOT 4
4 in use Unknown SLOT 5
5 in use Unknown SLOT 6
ucodeadm -v
CPU Microcode Version
0 0x18
1 0x18
2 0x18
3 0x18
4 0x18
5 0x18
6 0x18
7 0x18
Updated by Rich Lowe over 11 years ago
Hey Aubrey.
Notice how deadline_timer_reprogram is not actually doing a delta calculation, but comparing the time arg directly with <= 0? seems suspicious (especially since hrtime_t's are unsigned and absolute?)
Though I could be wildly off-base.
Updated by Jon Strabala over 11 years ago
Aubrey,
I concur with Miha, the issue is no longer happening (up 45 hours no kernel CPU leak), here are the outputs of the commands you asked for from my system (lab10), plus 'top -q -n 10', 'rup lab10', 'uname lab10', and 'cat /etc/release' and 'dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()} END { trunc(@a, 3); }" -c "sleep 10"'
lab10# psrinfo -vp
The physical processor has 4 cores and 8 virtual processors (0-7) The core has 2 virtual processors (0 4) The core has 2 virtual processors (1 5) The core has 2 virtual processors (2 6) The core has 2 virtual processors (3 7) x86 (GenuineIntel 206A7 family 6 model 42 step 7 clock 3392 MHz) Intel(r) Xeon(r) CPU E31270 @ 3.40GHz
lab10# prtdiag
System Configuration: Supermicro X9SCI/X9SCA BIOS Configuration: American Megatrends Inc. 1.10 08/19/2011 BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style) ==== Processor Sockets ==================================== Version Location Tag -------------------------------- -------------------------- Intel(R) Xeon(R) CPU E31270 @ 3.40GHz CPU ==== Memory Device Sockets ================================ Type Status Set Device Locator Bank Locator ----------- ------ --- ------------------- ---------------- Unknown in use 0 DIMM_1A BANK0 Unknown in use 0 DIMM_2A BANK1 Unknown in use 0 DIMM_1B BANK2 Unknown in use 0 DIMM_2B BANK3 ==== On-Board Devices ===================================== To Be Filled By O.E.M. ==== Upgradeable Slots ==================================== ID Status Type Description --- --------- ---------------- ---------------------------- 0 in use PCI SLOT 1 1 in use PCI SLOT 2 2 in use PCI SLOT 3 3 in use Unknown SLOT 4 4 in use Unknown SLOT 5 5 in use Unknown SLOT 6
lab10# ucodeadm -v
CPU Microcode Version 0 0x18 1 0x18 2 0x18 3 0x18 4 0x18 5 0x18 6 0x18 7 0x18
lab10# top -q -n 10
last pid: 15899; load avg: 0.00, 0.00, 0.00; up 1+21:18:24 08:52:26 45 processes: 44 sleeping, 1 on cpu CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap Kernel: 301 ctxsw, 11 trap, 589 intr, 364 syscall, 11 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 15898 root 1 59 -20 3672K 1700K cpu/3 0:00 0.02% top 1055 root 1 59 0 6716K 5612K sleep 0:08 0.00% intrd 12 root 24 59 0 15M 14M sleep 0:05 0.00% svc.configd 1081 admin 1 59 0 7796K 5208K sleep 0:04 0.00% sshd 759 root 4 59 0 3100K 1976K sleep 0:04 0.00% automountd 370 root 29 59 0 7712K 5252K sleep 0:02 0.00% nscd 10 root 13 59 0 15M 13M sleep 0:01 0.00% svc.startd 914 root 25 59 0 19M 12M sleep 0:01 0.00% fmd 361 root 4 59 0 6852K 5548K sleep 0:01 0.00% hald 318 root 7 59 0 4404K 3196K sleep 0:01 0.00% devfsadm
lab10# rup lab10
lab10 up 1 day, 21:18, load average: 0.00, 0.00, 0.00
lab100# uname -a
SunOS lab10 5.11 oi_151a i86pc i386 i86pc Solaris
lab10# uname -a
SunOS lab10 5.11 oi_151a i86pc i386 i86pc Solaris
lab10# cat /etc/release
OpenIndiana Development oi_151a X86 Copyright 2010 Oracle and/or its affiliates. All rights reserved. Use is subject to license terms. Assembled 01 September 2011
lab10# dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()} END { trunc(@a, 3); }" -c "sleep 10"
dtrace: description 'apix_do_interrupt:entry' matched 2 probes dtrace: pid 22827 has exited CPU ID FUNCTION:NAME 3 2 :END unix`_interrupt+0xba genunix`clock+0x1b8 genunix`cyclic_softint+0xdc unix`cbe_softclock+0x1a unix`av_dispatch_softvect+0x5f apix`apix_dispatch_softint+0x34 unix`switch_sp_and_call+0x13 8 unix`_interrupt+0xba genunix`clock+0x1bc genunix`cyclic_softint+0xdc unix`cbe_softclock+0x1a unix`av_dispatch_softvect+0x5f apix`apix_dispatch_softint+0x34 unix`switch_sp_and_call+0x13 17 unix`_interrupt+0xba unix`acpi_cpu_cstate+0x2ae unix`cpu_acpi_idle+0x82 unix`cpu_idle_adaptive+0x19 unix`idle+0x114 unix`thread_start+0x8 1067
The above count '1067' for 'apix_do_interrupt:entry' has remained stable for the last day.
And finally 'hourly 'top' samples across about 46 hours
CPU states: 99.9% idle, 0.1% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 93.5% idle, 5.8% user, 0.8% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 99.8% idle, 0.0% user, 0.2% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 99.5% idle, 0.0% user, 0.5% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.1% user, 0.0% kernel, 0.0% iowait, 0.0% swap
Updated by Jon Strabala over 11 years ago
Rich,
Rich Lowe wrote:
Hey Aubrey.
Notice how deadline_timer_reprogram is not actually doing a delta calculation, but comparing the time arg directly with <= 0? seems suspicious (especially since hrtime_t's are unsigned and absolute?)
Though I could be wildly off-base.
Do you mean /usr/src/uts/i86pc/io/pcplusmp/apic_timer.c , specifically
/* deadline timer reprogram */ static void deadline_timer_reprogram(hrtime_t time) { uint64_t ticks; if (time <= 0) { /* * generate an immediate interrupt */ ticks = (uint64_t)tsc_read(); } else { ticks = unscalehrtime(time); } wrmsr(IA32_DEADLINE_TSC_MSR, ticks); }
Perhaps the line "if (time <= 0) {" above should be replaced with the seven lines from the function oneshot_timer_reprogram ?
hrtime_t now; int64_t delta; uint_t ticks; now = gethrtime(); delta = time - now; if (delta <= 0) {
As I beleive the 'oneshot_timer_reprogram' call works correctly and both methods have the same prototype.
But I also notice that 'oneshot_timer_reprogram' seems to have a lot of safety checks too, unlike the function deadline_timer_reprogram see below - for the full function:
static void oneshot_timer_reprogram(hrtime_t time) { hrtime_t now; int64_t delta; uint_t ticks; now = gethrtime(); delta = time - now; if (delta <= 0) { /* * requested to generate an interrupt in the past * generate an interrupt as soon as possible */ ticks = apic_min_timer_ticks; } else if (delta > apic_nsec_max) { /* * requested to generate an interrupt at a time * further than what we are capable of. Set to max * the hardware can handle */ ticks = APIC_MAXVAL; #ifdef DEBUG cmn_err(CE_CONT, "apic_timer_reprogram, request at" " %lld too far in future, current time" " %lld \n", time, now); #endif } else { ticks = APIC_NSECS_TO_TICKS(delta); } if (ticks < apic_min_timer_ticks) ticks = apic_min_timer_ticks; apic_reg_ops->apic_write(APIC_INIT_COUNT, ticks); }
- Jon
Updated by Aubrey Li over 11 years ago
The TSC-deadline mode allows software to use local APIC timer to single
interrupt at an absolute time.
Writing any value(>0) to TSC-deadline MSR arms the timer.
Writing 0 to the MSR disarms the local-APIC timer.
An interrupt is generated when the logical processorās TSC >= the target
value in the MSR.
Looking at the target value and TSC from Miha's report
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3f0e6
0 30881 deadline_timer_reprogram:entry target = 521f6a3f580 - now = 521f6a3f464
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f6a3fb67
0 30881 deadline_timer_reprogram:entry target = 521f73c8c00 - now = 521f73ac646
Interrupt occurs TSC < target. <===misbehavior.
This issue may exist in Solaris only. Other OSes don't have this issue because they didn't use deadline TSC.
Deadline TSC has its merits, it eliminats the drift introduced when OS/VMM programs a
timer tick. In solaris, and, if we look at the timer infrastructure, we can see the cyclic, cbe
are using an absolute target, as well as it on SPARC, that is, replacing one-shot to deadline on X86
also reduces the overhead by eliminating time delta calculation.
For now, the best solution is reporting this issue to your machine vendor
Thanks,
-Aubrey
Updated by Miha Krizaj over 11 years ago
So this is actually a BIOS problem because of the solaris spicifics?
Does the line which currently solves this, cause any other side effects?
Thank you for the explanation. I will try to contact supermicro support to see if this can be somehow fixed in the future.
Updated by Jon Strabala over 11 years ago
I have notified Supermicro, specifically for Case #SM1108302601
Updated by Jon Strabala over 11 years ago
Supermicro's short reply for Case #SM1108302601 is as follows:
"Our current BIOS also do not support this TSC-deadline"
I am trying to get them to escalate it to AMI the BIOS provider.
Updated by Josef Sipek over 11 years ago
I have a similar system, after a boot, it had ~42% kernel usage on core 0 according to mpstat. After about 10 hours, it had ~46%. Previously, I had it run for over 5 days and I saw ~69% (I rebooted it to see if the load would disappear).
# prtdiag System Configuration: Supermicro X9SCL/X9SCM BIOS Configuration: American Megatrends Inc. 4.6.4 04/27/2011 BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style) ==== Processor Sockets ==================================== Version Location Tag -------------------------------- -------------------------- Intel(R) Xeon(R) CPU E31230 @ 3.20GHz CPU ==== Memory Device Sockets ================================ Type Status Set Device Locator Bank Locator ----------- ------ --- ------------------- ---------------- Unknown in use 0 DIMM_1A BANK0 Unknown in use 0 DIMM_2A BANK0 Unknown in use 0 DIMM_1B BANK0 Unknown in use 0 DIMM_2B BANK0 ==== On-Board Devices ===================================== To Be Filled By O.E.M. ==== Upgradeable Slots ==================================== ID Status Type Description --- --------- ---------------- ---------------------------- 0 in use PCI Express SLOT 4 1 in use PCI Express SLOT 5 2 in use PCI Express SLOT 6 3 in use PCI Express SLOT 7
Rich Lowe pointed me to this bug. I added the following line to my /etc/system:
set apix:apic_timer_preferred_mode = 0x0
and my apic_timer went from deadline_timer to oneshot_timer. I see 0% kernel load now.
Updated by Jon Strabala over 11 years ago
Aubrey,
I could do the recompile and test - but I would have to compile with gcc ( I am not sure if I need to build the whole OS because I can not get the patched Sun compilers - no big deal it only takes 18 minutes )
If you supply a link to the corrected file I will download it and give it a try
Also should I use the latest dev tree or the current base oi_151a
Updated by Jon Strabala over 11 years ago
Aubrey,
Then I made a new BE 'nightly-2011-10-07-aubrey' booted into it.
The Illumos Project SunOS 5.11 testws Oct. 07, 2011 SunOS Internal Development: admin 2011-Oct-07 [illumos-gate] # uname -a SunOS lab10 5.11 testws i86pc i386 i86pc Solaris # beadm list BE Active Mountpoint Space Policy Created nightly-2011-10-07-aubrey NR / 54.7G static 2011-10-07 15:46 oi_151a - - 16.2M static 2011-09-14 09:01
I commented the following from /etc/system and did a reconfiguration reboot
* set apix:apic_timer_preferred_mode = 0x0
Other (after reboot)
# echo hz/D | sudo mdb -k hz: hz: 100 # echo apic_timer::print apic_timer_t | sudo mdb -k { mode = 0x2 apic_timer_enable_ops = deadline_timer_enable apic_timer_disable_ops = deadline_timer_disable apic_timer_reprogram_ops = deadline_timer_reprogram } # cat /etc/power.conf cpu_deep_idle enable cpu-threshold 10s cpupm enable poll-mode
Note, I cannot run dtrace scripts as I get errors ( side effect of gcc build ), I tried running 'mcs -d -n .SUNW_ctf modname' on various modules but no luck for getting dtrace working.
I will run this setup a few days and see what happens with respect to the CPU kernwel creep, but since no dtrace is possible I can not really sample anything.
Updated by Aubrey Li over 11 years ago
Jon,
Thanks a lot for verification.
what errors dtrace reported? You can rudely comment out the error line in the reported dtrace script files.
Updated by Jon Strabala over 11 years ago
My dtrace issues have to do with the way I build oi_151a nightly with gcc the below is most likely overkill but it seems to be the only way I can get dtrace to work on any of my 'home-built' illumos kernels. This is what works for me BUT I think I am doing way too much and don't really understand the reasons behind needing to do this maybe because the closed bits are compiled with a sun compiler.
# list errors, this is why I grep out '.conf' find /kernel -type f -name "*" | grep -v '.conf' | xargs mcs -d -n .SUNW_ctf # apply mcs -d -n .SUNW_ctf on the module find /platform/i86pc/kernel -type f -name "*" | grep -v '.conf' | xargs mcs -d -n .SUNW_ctf # update the boot archive bootadm update-archive # reboot sync init 6
Prior to the above application of commands to get dtrace to work (I ran things over the weekend) for 60+ hours on the most recent illumos nightly with a patched apix module (and compare it to a oi_151 release). Without 'dtrace' this is a very very seat of the pants test using on the Unix 'top' command.
Looking at the nightly build above patched about 60+ hours without "set apix:apic_timer_preferred_mode = 0x0" in /etc/system
root@lab10:/home/admin/top# grep 'CPU s' * | sort +1 | grep 100 | wc -l 25 root@lab10:/home/admin/top# grep 'CPU s' * | sort +1 | grep 99.9 | wc -l 81 root@lab10:/home/admin/top# grep 'CPU s' * | sort +1 | grep 99.8 | wc -l 11 root@lab10:/home/admin/top# grep 'CPU s' * | sort +1 | grep 99.7 | wc -l 0 root@lab10:/home/admin/top# grep 'CPU s' * | sort +1 | grep 99.6 | wc -l 5 root@lab10:/home/admin/top# grep 'CPU s' * | sort +1 | grep 99.5 | wc -l 3 root@lab10:/home/admin/top# grep 'CPU s' * | sort +1 | grep 99.4 | wc -l 3
Looking at (some old logs for) oi_151 non-patched from the latest ISO across 66 hours (just counts) and "set apix:apic_timer_preferred_mode = 0x0" in /etc/system
# cat top_10* | grep 'CPU s' | grep 100 | wc -l 51 # cat top_10* | grep 'CPU s' | grep 99.9 | wc -l 75 # cat top_10* | grep 'CPU s' | grep 99.8 | wc -l 11 # cat top_10* | grep 'CPU s' | grep 99.7 | wc -l 0 # cat top_10* | grep 'CPU s' | grep 99.6 | wc -l 3 # cat top_10* | grep 'CPU s' | grep 99.5 | wc -l 0 # cat top_10* | grep 'CPU s' | grep 99.4 | wc -l 0
From the above (just counts - you can visualize the histogram) I base my "feeling" that "set apix:apic_timer_preferred_mode = 0x0" makes the system appear slightly "more idle" that the patch on a totally idle system. Yes this VERY un-scientific, and I have no idea on what happens under load between the two runs.
Odd issue with applying /cpu==0/ in my dtrace commands
After to the above application of commands to get dtrace to work. One funny thing I noticed it is almost as if the nightly build has a slightly different dtrace syntax - or my application of 'mcs commands' removed the cpu symbol ? - this is after the application above application of the above commands (at top) and a reboot:
# dtrace -n "apix_do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10" dtrace: invalid probe specifier apix_do_interrupt:entry /cpu==0/ { @a[stack()]=count()}: in predicate: failed to resolve cpu: Unknown variable name # dtrace -n "do_interrupt:entry/cpu==0/ { @a[stack()]=count()}" -c "sleep 10" dtrace: invalid probe specifier do_interrupt:entry /cpu==0/ { @a[stack()]=count()}: in predicate: failed to resolve cpu: Unknown variable name
I am confounded why is cpu an unknown variable name ?
Note, if I drop /cpu==0/ whcih used to work (might be my gcc compiles and/or 'mcs' commands to get dtrace to work) and rerun
# dtrace -n "apix_do_interrupt:entry { @a[stack()]=count()}" -c "sleep 10" | tail -50 7 unix`_interrupt+0x1e9 unix`mutex_delay_default+0x7 unix`mutex_vector_enter+0xb9 zfs`vdev_queue_io_done+0x49 zfs`zio_vdev_io_done+0xe8 zfs`zio_execute+0xd7 genunix`taskq_thread+0x32f unix`thread_start+0x8 9 unix`_interrupt+0x1e9 unix`mutex_delay_default+0x7 unix`lock_set_spin+0x7b genunix`disp_lock_enter_high+0x11 unix`setbackdq+0x12b genunix`cv_unsleep+0x98 genunix`setrun_locked+0x73 genunix`setrun+0x20 genunix`cv_wakeup+0x36 genunix`callout_list_expire+0x91 genunix`callout_expire+0x68 genunix`callout_realtime+0x28 genunix`cyclic_softint+0x3bb unix`cbe_low_level+0x14 unix`av_dispatch_softvect+0x8d apix`apix_dispatch_softint+0x35 unix`switch_sp_and_call+0x13 10 unix`_interrupt+0x1e9 unix`i86_mwait+0xd unix`cpu_idle_mwait+0x125 unix`cpu_acpi_idle+0x8c unix`cpu_idle_adaptive+0x13 unix`idle+0x89 unix`thread_start+0x8 11 unix`_interrupt+0x1e9 23 unix`_interrupt+0x1e9 unix`acpi_cpu_cstate+0x11e unix`cpu_acpi_idle+0xcb unix`cpu_idle_adaptive+0x13 unix`idle+0x89 unix`thread_start+0x8 4418 #dtrace -n "do_interrupt:entry { @a[stack()]=count()}" -c "sleep 10" dtrace: description 'do_interrupt:entry ' matched 1 probe dtrace: pid 101749 has exited
# cat kern_100ms_ent_ret.d BEGIN { printf("\n100ms sample of all kernel entry and returns\n"); start = timestamp; } fbt:*::entry { @e[probefunc]=count() } fbt:*::return { @r[probefunc]=count() } END { printf("\n%20s %10s %10s\n", "PROGRAM", "entries", "returns"); printa("%20s %10@d %10@d\n", @e, @r); } tick-100ms { exit(0); } # dtrace -s kern_100ms_ent_ret.d | head -10 dtrace: script 'kern_100ms_ent_ret.d' matched 62736 probes CPU ID FUNCTION:NAME 3 1 :BEGIN 100ms sample of all kernel entry and returns 3 65519 :tick-100ms 3 2 :END PROGRAM entries returns strlen 0 2 kcopy 0 133 anon_create 1 1 # dtrace -s kern_100ms_ent_ret.d | tail -35 dtrace: script 'kern_100ms_ent_ret.d' matched 62736 probes page_vnode_mutex 67 67 cyclic_coverage 70 70 cyclic_coverage_hash 70 70 gethrtime 79 79 rw_lock_held 82 82 tsc_gethrtime 83 83 as_segcompar 120 120 set_errno 248 248 audit_getstate 249 249 clear_stale_fd 249 249 lwp_getdatamodel 249 249 post_syscall 249 249 disp_lock_exit 255 255 cpu_get 256 256 cpu_get_all 256 256 p_online 256 256 p_online_internal 256 256 p_online_internal_locked 256 256 thread_lock 257 257 prunstop 261 261 do_splx 273 273 splr 273 273 interrupts_enabled 274 274 sys_rtt_common 279 279 xc_extract 446 446 apix_get_pending_spl 546 546 apix_setspl 546 546 get_local_apic_pri 546 546 psm_get_cpu_id 546 546 mutex_owned 956 956 syscall_mstate 1036 1036 gethrtime_unscaled 1192 1192 tsc_gethrtimeunscaled 1192 1192
If there is anything else you want tested or any 'dtrace' script you might like me to try, let me know.
Updated by Jon Strabala over 11 years ago
*It looks like a long test shows no issue with applying the patch to apic_timer.c to fix BUG #1333
*
Running 11 days 20 hours things look very good, this has been running nightly-2011-10-11 whith both danmcd's new Joyent-contributed acpica bits (these bits have relation to the 1333 bug I just wanted to test them) and the patch to apic_timer.c to solve the 1333 bug
So we can either add the patch to apic_timer.c (or alternatively use the work around described above in /etc/system by adding the line "set apix:apic_timer_preferred_mode = 0x0"). The 11+ day test here did not use the /etc/system work around.
- Jon
root@lab10:~# echo hz/D | sudo mdb -k hz: hz: 100 root@lab10:~# echo apic_timer::print apic_timer_t | sudo mdb -k { mode = 0x2 apic_timer_enable_ops = deadline_timer_enable apic_timer_disable_ops = deadline_timer_disable apic_timer_reprogram_ops = deadline_timer_reprogram } root@lab10:~# cat /etc/power.conf cpu_deep_idle enable cpu-threshold 10s cpupm enable poll-mode root@lab10:~#
Looking at some of my hourly 'top' measurements, dtrace doesn't work on this test build, I see the following:
root@lab10:/home/admin/top# grep 'CPU states' d.log2 | wc -l 568 root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 100 | wc -l 78 root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.9 | wc -l 382 root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.8 | wc -l 59 root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.7 | wc -l 0 root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.6 | wc -l 18 root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.5 | wc -l 15 root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.4 | wc -l 9 root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.2 | wc -l 0 root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.1 | wc -l 1 root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 99.0 | wc -l 1 root@lab10:/home/admin/top# grep 'CPU states' d.log2 | grep 89. | wc -l 0 root@lab10:/home/admin/top# grep 'CPU states' d.log2 | tail -10 CPU states: 99.5% idle, 0.1% user, 0.4% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap root@lab10:/home/admin/top#
And a recent 100% idle as follows:
====================================== October 24, 2011 09:34:27 AM PDT top -q -d 2 -s 10 -n 10 -S last pid: 104899; load avg: 0.00, 0.00, 0.00; up 11+17:59:54 09:34:28 48 processes: 47 sleeping, 1 on cpu CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap Kernel: 308 ctxsw, 12 trap, 621 intr, 366 syscall, 11 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 104898 root 1 59 -20 3364K 1668K cpu/6 0:00 0.03% top 3 root 1 60 -20 0K 0K sleep 10:55 0.00% fsflush 101106 root 1 59 0 6064K 2072K sleep 0:10 0.00% sendmail 101017 root 1 59 0 6632K 5512K sleep 0:53 0.00% intrd 100754 root 4 59 0 3060K 1992K sleep 0:39 0.00% automountd 100921 root 26 59 0 32M 25M sleep 0:19 0.00% fmd 100363 root 29 59 0 7768K 5232K sleep 0:16 0.00% nscd 6 root 150 99 -20 0K 0K sleep 0:15 0.00% zpool-rpool 100005 root 23 59 0 15M 14M sleep 0:15 0.00% svc.configd 100352 root 4 59 0 6812K 5488K sleep 0:12 0.00% hald last pid: 104899; load avg: 0.00, 0.00, 0.00; up 11+18:00:04 09:34:38 48 processes: 47 sleeping, 1 on cpu CPU states: 99.9% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap Kernel: 381 ctxsw, 4 trap, 764 intr, 27 syscall, 4 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 6 root 150 99 -20 0K 0K sleep 0:15 0.03% zpool-rpool 3 root 1 60 -20 0K 0K sleep 10:55 0.02% fsflush 104898 root 1 59 -20 3516K 1836K cpu/6 0:00 0.00% top 100352 root 4 59 0 6812K 5488K sleep 0:12 0.00% hald 100418 root 1 59 0 3492K 2120K sleep 0:12 0.00% hald-addon-acpi 4 root 9 60 -20 0K 0K sleep 0:01 0.00% kcfpoold 101106 root 1 59 0 6064K 2072K sleep 0:10 0.00% sendmail 100279 root 7 59 0 4420K 3164K sleep 0:09 0.00% devfsadm 100363 root 29 59 0 7768K 5232K sleep 0:16 0.00% nscd
Updated by Rich Lowe over 11 years ago
- Status changed from Feedback to In Progress
- Assignee set to Rich Lowe
- Tags deleted (
needs-triage)
Updated by Rich Lowe over 11 years ago
- Assignee changed from Rich Lowe to Garrett D'Amore
Updated by Garrett D'Amore over 11 years ago
- % Done changed from 0 to 90
Here is my attempt at fixing this, based on conversation with Rich Lowe and my own analysis of the code. I need help testing it. As you can see, the problem is that there were several problems in the old code, including sign extension problems and confusion about the starting baseline of the tick counter coming in. (Specifically, the WMSR is not the same value that gethrtime() returns, but the nsec coming in is a value in the same starting base as gethrtime().) My code makes this function work using gethrtime() as the base properly, calculating the delta properly, and avoiding the the sign extension problem -- I hope.
diff -r 417c34452f03 usr/src/uts/i86pc/io/pcplusmp/apic_timer.c --- a/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c Fri Nov 11 14:07:54 2011 -0800 +++ b/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c Fri Nov 11 14:19:14 2011 -0800 @@ -25,6 +25,9 @@ * Copyright (c) 2010, Intel Corporation. * All rights reserved. */ +/* + * Copyright 2011 Nexenta Systems, Inc. All rights reserved. + */ #include <sys/time.h> #include <sys/psm.h> @@ -302,17 +305,15 @@ static void deadline_timer_reprogram(hrtime_t time) { + int64_t delta; uint64_t ticks; - if (time <= 0) { - /* - * generate an immediate interrupt - */ - ticks = (uint64_t)tsc_read(); - } else { - ticks = unscalehrtime(time); + delta = time - gethrtime(); + if (delta <= 0) { + /* generate an immediate interrupt */ + delta = 0; } - + ticks = (uint64_t)tsc_read() + delta; wrmsr(IA32_DEADLINE_TSC_MSR, ticks); }
Updated by Rich Lowe over 11 years ago
That's not the patch in the state it was when I reviewed it, you need to be unscaling the time delta. (That might be a no-op, but it's not always a no-op, see unscalehrtimef).
On real hardware made in the last 20 years, it should never be a no-op, actually.
You definitely need to unscale the delta to get it into TSC units.
Updated by Garrett D'Amore over 11 years ago
Good catch. Here's the revised version:
diff -r 417c34452f03 usr/src/uts/i86pc/io/pcplusmp/apic_timer.c --- a/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c Fri Nov 11 14:07:54 2011 -0800 +++ b/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c Fri Nov 11 19:55:03 2011 -0800 @@ -25,6 +25,9 @@ * Copyright (c) 2010, Intel Corporation. * All rights reserved. */ +/* + * Copyright 2011 Nexenta Systems, Inc. All rights reserved. + */ #include <sys/time.h> #include <sys/psm.h> @@ -302,17 +305,15 @@ static void deadline_timer_reprogram(hrtime_t time) { + int64_t delta; uint64_t ticks; - if (time <= 0) { - /* - * generate an immediate interrupt - */ - ticks = (uint64_t)tsc_read(); - } else { - ticks = unscalehrtime(time); + delta = time - gethrtime(); + if (delta <= 0) { + /* generate an immediate interrupt */ + delta = 0; } - + ticks = (uint64_t)tsc_read() + unscalehrtime(delta); wrmsr(IA32_DEADLINE_TSC_MSR, ticks); }
Updated by Josef Sipek over 11 years ago
I tried patch (with unscalehrtime) but the moment I tried to run dtrace the whole system locked up hard.
Updated by Josef Sipek over 11 years ago
I rebooted with -ks, hoping that I could F1-A the system when it locked up. Of course, no matter what I did the system stayed functional. I ran a simple dtrace one-liner to see how long the timer was supposed to be set for:
# dtrace -n '::deadline_timer_reprogram:entry{@=llquantize(arg0-timestamp,10,0,8,10);} tick-10sec{printa(@);}' ... 2 67972 :tick-10sec value ------------- Distribution ------------- count 10000 | 0 20000 | 4 30000 | 0 40000 | 0 50000 | 6 60000 | 0 70000 | 0 80000 | 0 90000 | 0 100000 | 11 200000 | 11 300000 | 0 400000 | 9 500000 | 0 600000 | 22 700000 | 5 800000 | 18 900000 | 0 1000000 | 131 2000000 |@ 167 3000000 | 103 4000000 | 85 5000000 | 98 6000000 | 126 7000000 | 164 8000000 | 104 9000000 |@@@@@@@@@@@@@@@@@ 5689 10000000 |@@@@@@@ 2232 20000000 | 79 30000000 | 156 40000000 |@ 393 50000000 | 104 60000000 | 49 70000000 | 96 80000000 | 50 90000000 | 96 100000000 |@@ 626 200000000 |@@ 653 300000000 |@ 336 400000000 |@ 333 500000000 |@ 321 600000000 |@ 331 700000000 |@ 246 800000000 |@ 168 900000000 |@ 195 >= 1000000000 | 0
Not bad, 100Hz...with some shorter durations as well.
I rebooted, -k this time. Log in to X, run the same dtrace oneliner, it spits out one @ showing one negative delta. I ^C it, and about a second later, the system locks up. F1-A & $<systemdump do nothing.
I saw the lockups before. So, at the very least, I haven't noticed any regressions. ;)
Updated by Aubrey Li over 11 years ago
Garrett D'Amore wrote:
Good catch. Here's the revised version:
[...]
Two comments of this patch:
1. Since we are using tsc_read() here, we should make sure that the thread can't
migrate while calcultating the TSC tick and reprogram the MSR.
2. From the latest SDM V3a(link below), In xAPIC mode (in which the local-APIC
registers are memory-mapped), software must serialize between the memory-mapped
write to the LVT entry and the WRMSR to IA32_TSC_DEADLINE.
Updated by Garrett D'Amore over 11 years ago
Thanks Aubrey. I also noticed by reading the manual that negative values are handled properly, so we don't need that if branch either.
I've got another version to try, which uses the algorithm suggested in the SDM to serialize the WRMSR value. I believe this should work. Note that we need not worry about thread migration, as I believe this routine is called in a context that excludes migration. Again, I need help testing this code.
diff -r 417c34452f03 usr/src/uts/i86pc/io/pcplusmp/apic_timer.c --- a/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c Fri Nov 11 14:07:54 2011 -0800 +++ b/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c Sat Nov 12 09:27:47 2011 -0800 @@ -25,6 +25,9 @@ * Copyright (c) 2010, Intel Corporation. * All rights reserved. */ +/* + * Copyright 2011 Nexenta Systems, Inc. All rights reserved. + */ #include <sys/time.h> #include <sys/psm.h> @@ -302,17 +305,33 @@ static void deadline_timer_reprogram(hrtime_t time) { + int64_t delta; uint64_t ticks; - if (time <= 0) { - /* - * generate an immediate interrupt - */ - ticks = (uint64_t)tsc_read(); - } else { - ticks = unscalehrtime(time); - } + /* + * According to Intel SDM v3a, negative time is ok. + * Note that this entire routine is called with + * CBE_HIGH_PIL, so we needn't worry about preemption. + */ + delta = time - gethrtime(); + /* + * Now we have to serialize this per the SDM. The algorithm + * recommended by Intel is this: + * + * a) write a big value to the deadline register + * b) read the register back + * c) if it reads zero, go back to a and try again + * d) write the new value + */ + do { + /* write a really big value */ + ticks = (uint64_t)tsc_read() + (1u << 31); + wrmsr(IA32_DEADLINE_TSC_MSR, ticks); + ticks = rdmsr(IA32_DEADLINE_TSC_MSR); + } while (ticks == 0); + /* Now we shouldn't be interrupted, we can set the deadline */ + ticks = (uint64_t)tsc_read() + unscalehrtime(delta); wrmsr(IA32_DEADLINE_TSC_MSR, ticks); }
Updated by Aubrey Li over 11 years ago
Garrett D'Amore wrote:
Thanks Aubrey. I also noticed by reading the manual that negative values are handled properly, so we don't need that if branch either.
Yeah, the negative value can be handled properly by the deadline timer, but not be handled by unscalehrtime(). there is a macro in the routine, which rudely converts the signed delta to unsigned variable.
I've got another version to try, which uses the algorithm suggested in the SDM to serialize the WRMSR value.
The patch proposed is not the right way per the latest SDM. Writes to the IA32_TSC_DEADLINE MSR are not serialized. Therefore, system software should not use WRMSR to the IA32_TSC_DEADLINE MSR as a serializing instruction. Read and write accesses to the IA32_TSC_DEADLINE and other MSR registers will occur in program order. As SDM indicates, the only scenario which needs serialize write is, in xAPIC mode (in which the local-APIC registers are memory-mapped), software must serialize between the memory-mapped write to the LVT entry and the WRMSR to IA32_TSC_DEADLINE. In x2APIC mode, no serialization is required between the two writes (by WRMSR) to the LVT and IA32_TSC_DEADLINE MSRs.
I believe this should work. Note that we need not worry about thread migration, as I believe this routine is called in a context that excludes migration. Again, I need help testing this code.
If this routine is called in interrupt context, in which IPL = 14, then we don't need to worry about thread migration.
However, I don't think that's always true, cbe_reprogram() will be called whenever the root is changed, there're multiple paths, as I found out by dtrace, for example:
1.
genunix`cyclic_reprogram_cyclic+0xdd
genunix`cyclic_reprogram+0xa4
genunix`callout_heap_insert+0x6b
genunix`timeout_generic+0x32b
genunix`cv_timedwait_hires+0xad
genunix`cv_reltimedwait+0x4f
stmf`stmf_svc+0x41a
genunix`taskq_thread+0x22e
unix`thread_start+0x8
231
2.
genunix`cyclic_reprogram_cyclic+0xdd
genunix`cyclic_reprogram+0xa4
genunix`callout_heap_insert+0x6b
genunix`timeout_generic+0x32b
genunix`lwp_timer_enqueue+0x90
genunix`lwp_cond_wait+0x5be
unix`_sys_sysenter_post_swapgs+0x149
448
So, I think we still need to avoid migration here.
Thanks,
-Aubrey
Updated by Garrett D'Amore over 11 years ago
cyclic_reprogram disable kernel preemption and runs with the PIL set high. I think we're ok here. Reading the code, I think all the code paths we care about go through cyclic_reprogram -- so no thread migration should occur.
I need to spend some more time with the serialization logic. I had assumed that the LVT was already written by the time we got here, and that I didn't need to serialize the deadline timer MSR to that, because it would have effectively already been done. Apart from that one step, I was trying to follow the recommendation of the SDM. I'll have another look.
Updated by Aubrey Li over 11 years ago
Garrett D'Amore wrote:
cyclic_reprogram disable kernel preemption and runs with the PIL set high. I think we're ok here. Reading the code, I think all the code paths we care about go through cyclic_reprogram -- so no thread migration should occur.
You're right, irrespective of the context cbe_set_level() is called with CY_HIGH_LEVEL, so we won't be preempted, except a level 15 can pin this thread, but that's not a problem.
Updated by Garrett D'Amore over 11 years ago
Ok, re-reading my code, and the spec, I've got another try. I'm sorry I don't have h/w atm I can test this on:
--- a/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c Fri Nov 11 14:07:54 2011 -0800 +++ b/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c Mon Nov 14 21:04:06 2011 -0800 @@ -25,6 +25,9 @@ * Copyright (c) 2010, Intel Corporation. * All rights reserved. */ +/* + * Copyright 2011 Nexenta Systems, Inc. All rights reserved. + */ #include <sys/time.h> #include <sys/psm.h> @@ -302,17 +305,40 @@ static void deadline_timer_reprogram(hrtime_t time) { + int64_t delta; uint64_t ticks; - if (time <= 0) { - /* - * generate an immediate interrupt - */ - ticks = (uint64_t)tsc_read(); - } else { - ticks = unscalehrtime(time); - } + /* + * According to Intel SDM v3a, negative time is ok. + * Note that this entire routine is called with + * CBE_HIGH_PIL, so we needn't worry about preemption. + */ + delta = time - gethrtime(); + + /* The unscalehrtime wants unsigned values. */ + delta = max(delta, 0); + + /* + * Now we have to serialize this per the SDM. The algorithm + * recommended by Intel is this: + * + * a) write the LVT entry to specify AV_DEADLINE mode + * b) write a big value to the deadline register + * c) read the register back + * d) if it reads zero, go back to b and try again + * e) write the new value + */ + apic_reg_ops->apic_write(APIC_LOCAL_TIMER, + (apic_clkvect + APIC_BASE_VECT) | AV_DEADLINE); + do { + /* write a really big value */ + ticks = (uint64_t)tsc_read() + (1u << 31); + wrmsr(IA32_DEADLINE_TSC_MSR, ticks); + ticks = rdmsr(IA32_DEADLINE_TSC_MSR); + } while (ticks == 0); + /* Now we shouldn't be interrupted, we can set the deadline */ + ticks = (uint64_t)tsc_read() + unscalehrtime(delta); wrmsr(IA32_DEADLINE_TSC_MSR, ticks); }
Updated by Aubrey Li over 11 years ago
Garrett D'Amore wrote:
Ok, re-reading my code, and the spec, I've got another try. I'm sorry I don't have h/w atm I can test this on:
[...]
Since the writes to the IA32_TSC_DEADLINE MSR are not serialized. Read and write accesses to the IA32_TSC_DEADLINE and other MSR registers will occur in program order.
Is it better we put the do {} while just into deadline_timer_enable(), after LVT write? deadline_timer_disable should be okay since we are disabling the timer and the value in the MSR can be ignored.
Updated by Jon Strabala over 11 years ago
Using my X9SCA-F running oi_151a, I applied Garrett's the patch as in [https://www.illumos.org/issues/1333#note-56] to a new nightly and made a new BE two hours ago. It seems to work about the same as the one Rich Lowe supplied me via E-mail back on Oct, 28 2011 (and via a build he made).
First check worked:
root@lab10:~# cat /etc/power.conf cpu_deep_idle enable cpu-threshold 10s cpupm enable poll-mode
Second check worked (this should invoke the code in the new patch):
root@lab10:~# cat /etc/power.conf cpu_deep_idle enable cpu-threshold 10s cpupm enable
I will post again if I notice any issues in the next few days (like kernel CPU creep), I can not run dtrace and thus tools like 'powertop' with my GCC nightly builds. If you care I have my system config, and an initial 'top' and 'prstat' see the attachment.
Updated by Garrett D'Amore over 11 years ago
Aubrey Li wrote:
Garrett D'Amore wrote:
Ok, re-reading my code, and the spec, I've got another try. I'm sorry I don't have h/w atm I can test this on:
[...]
Since the writes to the IA32_TSC_DEADLINE MSR are not serialized. Read and write accesses to the IA32_TSC_DEADLINE and other MSR registers will occur in program order.
If I am understanding properly, the problem is that the timer could have expired just as we were trying to write a value. We want to make sure that we aren't racing against an existing timer, IIUC. I think that is the basis of Intel's recommendations.
Is it better we put the do {} while just into deadline_timer_enable(), after LVT write? deadline_timer_disable should be okay since we are disabling the timer and the value in the MSR can be ignored.
I don't think so... because the enable logic doesn't have the new timer value that we get at reprogram().
Updated by Garrett D'Amore over 11 years ago
Ok, I think I finally understood what was meant about "serialized" -- the problem is issue order != execution order. Got it. So, here's another try, hopefully this one is a bit cleaner (sorry for being so dense here):
--- a/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c Fri Nov 11 14:07:54 2011 -0800 +++ b/usr/src/uts/i86pc/io/pcplusmp/apic_timer.c Tue Nov 15 13:11:11 2011 -0800 @@ -25,6 +25,9 @@ * Copyright (c) 2010, Intel Corporation. * All rights reserved. */ +/* + * Copyright 2011 Nexenta Systems, Inc. All rights reserved. + */ #include <sys/time.h> #include <sys/psm.h> @@ -286,8 +289,29 @@ static void deadline_timer_enable(void) { + uint64_t ticks; + apic_reg_ops->apic_write(APIC_LOCAL_TIMER, (apic_clkvect + APIC_BASE_VECT) | AV_DEADLINE); + /* + * Now we have to serialize this per the SDM. That is to + * say, the above enabling can race in the pipeline with + * changes to the MSR. We need to make sure the above + * operation is complete before we proceed to reprogram + * the deadline value in reprogram(). The algorithm + * recommended by the Intel SDM 3A in 10.5.1.4 is: + * + * a) write a big value to the deadline register + * b) read the register back + * c) if it reads zero, go back to a and try again + */ + + do { + /* write a really big value */ + ticks = (uint64_t)tsc_read() + APIC_MAXVAL; + wrmsr(IA32_DEADLINE_TSC_MSR, ticks); + ticks = rdmsr(IA32_DEADLINE_TSC_MSR); + } while (ticks == 0); } /* deadline timer disable */ @@ -302,17 +326,20 @@ static void deadline_timer_reprogram(hrtime_t time) { + int64_t delta; uint64_t ticks; - if (time <= 0) { - /* - * generate an immediate interrupt - */ - ticks = (uint64_t)tsc_read(); - } else { - ticks = unscalehrtime(time); - } + /* + * Note that this entire routine is called with + * CBE_HIGH_PIL, so we needn't worry about preemption. + */ + delta = time - gethrtime(); + /* The unscalehrtime wants unsigned values. */ + delta = max(delta, 0); + + /* Now we shouldn't be interrupted, we can set the deadline */ + ticks = (uint64_t)tsc_read() + unscalehrtime(delta); wrmsr(IA32_DEADLINE_TSC_MSR, ticks); }
Again I need help testing.
Updated by Aubrey Li over 11 years ago
Garrett D'Amore wrote:
Ok, I think I finally understood what was meant about "serialized" -- the problem is issue order != execution order. Got it. So, here's another try, hopefully this one is a bit cleaner (sorry for being so dense here):
I still have two comments, :)
1. I believe APIC_MAXVAL is not big enough. For example, if CPU is 2.2Ghz, TSC tick 2,200,000,000 is equal to 1s,
APIC_MAXVAL = 0xFFFFFFFF = 0t4,294,967,295, so APIC_MAXVAL/2,200,000,000 < 2s, that is, we are scheduling
a timer too early.
2. To be picky, should we check the variable x2apic_enable to do the serialize write? Since SDM says, in x2APIC mode, no serialization is required between the two writes (by WRMSR) to the LVT and IA32_TSC_DEADLINE MSRs.
Updated by Garrett D'Amore over 11 years ago
so 0xffffffff I figured would be big enough (anything over about 100 msec ought to be big enough), but I can just write some arbitrarily large value. How about 1u << 63? Is this universally a safe value here? I'd be happy to just pick a number so large that I don't even have to look at the initial tsc value.
Wrt checking x2apic_enable, I think this may be pedantic. From what I can tell, the serialization here will only be called once during boot generally, and the extra logic should be harmless and yet always correct. Unless there is some reason that the WRMSR/RDMSR would behave very differently in x2APIC mode (I've not seen any indication that it would.)
Updated by Aubrey Li over 11 years ago
Garrett D'Amore wrote:
so 0xffffffff I figured would be big enough (anything over about 100 msec ought to be big enough), but I can just write some arbitrarily large value. How about 1u << 63? Is this universally a safe value here? I'd be happy to just pick a number so large that I don't even have to look at the initial tsc value.
1u << 63 should work.
Wrt checking x2apic_enable, I think this may be pedantic. From what I can tell, the serialization here will only be called once during boot generally, and the extra logic should be harmless and yet always correct. Unless there is some reason that the WRMSR/RDMSR would behave very differently in x2APIC mode (I've not seen any indication that it would.)
Yeah I agree it's pedantic, your call to add the checking or not, ;)
Updated by Jon Strabala over 11 years ago
- File 20111118_tcs_ga_after_boot.txt 20111118_tcs_ga_after_boot.txt added
- File 20111118_tcs_ga_after_boot_some_dtrace_output.txt 20111118_tcs_ga_after_boot_some_dtrace_output.txt added
FYI I was sort of "brain dead" I had to apply the patch with the '-l' option (for white space) as I had to cut-n-past it from this thread
patch -l apic_timer.c < patchfile_bug-1333_note-60.txt
Once again, using my X9SCA-F running oi_151a, I applied Garrett's revised patch (above), specifically [https://www.illumos.org/issues/1333#note-60] to a new nightly BE. It still seems to work about the same as the one Rich Lowe supplied me via E-mail back on Oct, 28 2011 (and via a build he made).
First check worked again:
root@lab10:~# cat /etc/power.conf cpu_deep_idle enable cpu-threshold 10s cpupm enable poll-mode
Second check worked again (this should invoke the code in the new patch):
root@lab10:~# cat /etc/power.conf cpu_deep_idle enable cpu-threshold 10s cpupm enable
Once again I will post again if I notice any issues in the next few days (like kernel CPU creep), wow it seems that with this new nightly I can now run dtrace Woo Yeah! and thus tools like 'powertop' with my GCC nightly builds. If you care I have my system config, and an initial 'top' and 'prstat' see the first attachment plus various dtrace outputs see the second attachment (I plan this new BE to run today and over the weekend for about 3 days).
Updated by Jon Strabala over 11 years ago
- File 20111118_tcs_ga_after_69hours.txt 20111118_tcs_ga_after_69hours.txt added
- File 20111118_tcs_ga_after_69hours_fsflush_oddly_high_cpu.txt 20111118_tcs_ga_after_69hours_fsflush_oddly_high_cpu.txt added
- File 20111118_tcs_ga_after_69hours_dtrace_oddly_high_zfs_activity.txt 20111118_tcs_ga_after_69hours_dtrace_oddly_high_zfs_activity.txt added
After about 69 ours it seems like the system is running fine WRT issue #1333 with Garrett's revised patch (above), specifically [https://www.illumos.org/issues/1333#note-60] - see first attachment.
On an unrelated note I seem to see a lot of activity in 'fflush' and/or 'zpool-rool' on a system that should be completely idle (this isn't related to the issue at hand but the second and third attachements). Oddly in 60 seconds I am seeing a high number of calls to 'open' 38048, 'forksys' 421456 , and 'umount2' == 19804321 on a totally idle system
Updated by Rich Lowe over 11 years ago
if the fsflush stuff is separate, please open a separate bug.
Updated by Jon Strabala over 11 years ago
After about 5 days it seems like the system is still running fine WRT issue #1333 with Garrett's revised patch (above), specifically [https://www.illumos.org/issues/1333#note-60].
last pid: 102619; load avg 0.00, 0.00, 0.00; up 5+01:39:48 11:51:50 51 processes: 50 sleeping, 1 on cpu CPU states: 100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap Kernel: 311 ctxsw, 2 trap, 561 intr, 157 syscall, 2 flt Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 3 root 1 60 -20 0K 0K sleep 5:11 0.10% fsflush 102618 root 1 59 0 3536K 1908K cpu/2 0:00 0.03% top 101072 admin 1 59 0 7756K 5136K sleep 0:00 0.01% sshd 101337 root 6 59 0 73M 46M sleep 0:03 0.00% fbserver 101077 root 1 59 0 6000K 2036K sleep 0:04 0.00% sendmail 100861 root 1 59 0 2668K 1608K sleep 0:04 0.00% in.routed 100269 root 6 59 0 4396K 3104K sleep 0:03 0.00% devfsadm 100645 root 4 59 0 3060K 2012K sleep 0:32 0.00% automountd 100859 root 1 59 0 6480K 5364K sleep 0:18 0.00% intrd 100005 root 22 59 0 15M 13M sleep 0:07 0.00% svc.configd
As to the 'fflush' issue (unrelated to this issue) I have opened bug #1810 as Rich Lowe suggested.
Updated by Garrett D'Amore over 11 years ago
- Status changed from In Progress to Resolved
- % Done changed from 90 to 100
changeset: 13533:6c1a51927cbd
tag: tip
user: Garrett D'Amore <garrett@nexenta.com>
date: Mon Nov 28 20:07:13 2011 -0800
description:
1333 High kernel cpu usage & dtrace hang on idle system
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Aubrey Li <aubreylee@gmail.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Gordon Ross <gwr@nexenta.com>