Bug #1810
openSystem call bursts in 'umount2' on an idle system up to 4 million/minute
0%
Description
When looking at some odd CPU use issues in 'fflush' and 'zpool-rool' I found that occasinally I have large bursts of system calls to the like of 'umount2' on a fully idle system.
I have three BE's as follows:
lab10# beadm list BE Active Mountpoint Space Policy Created oi_151a - - 33.5M static 2011-09-14 09:01 tsc-fix - - 826M static 2011-10-28 15:08 nightly-2011-11-18 NR / 64.3G static 2011-11-18 08:18
I run a script called 'cpu.d' which samples for 60 seconds and I do this 6 times.
BE oi_151a (a 148b upgrade to 1512a) had 2 out of 6 cpu.d runs showing 'umount2' system calls > 1.8M refer to FILE chk_fflush.20111123.oi_151a.out: umount2 1863075 umount2 1866598 BE tsc-fix (built by Rich Lowe) had 2 out of 6 cpu.d runs showing 'umount2' system calls > 2.6M refer to FILE chk_fflush.20111122.deadline-tsc-0-gf0ee1a4.out: umount2 2693302 umount2 2870902 BE nightly-2011-11-18 had 2 out of 6 cpu.d runs showing 'umount2' system calls > 3.7M refer to FILE chk_fflush.20111122.tsc-ga-20111118.out: umount2 3946642 umount2 3764694 umount2 3759205
If the definition of '_umount2_' is as follows from the '_man_' page, I can not for the life of me figure out why millions of attempts each minute requesting that a previously mounted file system contained on a block special device or directory be unmounted on an idle server (as high as 3,946,642 in 60 seconds or 65,777/sec.)
The umount() function requests that a previously mounted file system contained on a block special device or directory be unmounted. The file argument is a pointer to the abso lute pathname of the file system to be unmounted. After unmounting the file system, the directory upon which the file system was mounted reverts to its ordinary interpreta tion. The umount2() function is identical to umount(), with the additional capability of unmounting file systems even if there are open files active. The mflag argument must contain one of the following values:
The fact that this "idle" system is executing the system call 'umount2' at a rate of 65,777 times a second (maybe higher as this is averaged over a minute) occasionally seems rather high.
The script "cpu.d" script - note I run it six times (for every BE) and sometimes I don't see 'umount2' at all but at least 1/3 of the time I do. Sometimes I see millions so the issue shows itself in an occasional burst (sometimes almost hitting 4 million system calls / minute).
#!/usr/sbin/dtrace -s syscall:::entry { self->vstart = vtimestamp; } syscall:::return /self->vstart/ { this->oncpu = vtimestamp - self->vstart; @[probefunc] = avg(this->oncpu); self->vstart = 0; } tick-60s { exit(0); }
Here is one example 'cpu.d' output, refer to the files for an extensive amount of system info and other dtrace runs.
lab10# ./cpu.d dtrace: script './cpu.d' matched 465 probes CPU ID FUNCTION:NAME 1 65595 :tick-60s p_online 82 getpid 152 sysconfig 168 sigaction 363 gtime 398 lwp_sigmask 453 fcntl 553 pset 587 clock_gettime 742 uucopy 829 lwp_continue 960 alarm 1186 ioctl 1226 c2audit 1326 lwp_kill 1423 setcontext 1615 brk 2169 getmsg 2267 mmap 3140 stat64 3308 doorfs 3474 read 5213 pread 5999 close 7294 write 9020 statvfs64 9351 lwp_park 10086 schedctl 10100 stat 13764 writev 14845 pollsys 17994 open 25440 lwp_create 27340 so_socket 37327 nanosleep 57335 putmsg 297994 umount2 3946642
The attached file 'scripts.chk_fflush.tar' contains several of dtrace scripts that I used to make the three attached output files *.out:
iotop diskhits iopending hotspot.d iofile.d iofileb.d seeksize.d bitesize.d fsflush.d fspaging.d psio10 iosnoop hotkernel_60 cpu.d zfssnoop.d ziosnoop.d zfsstack.d
Files
Updated by Jon Strabala about 11 years ago
After five days of uptime on BE nightly-2011-11-18, I re-ran the same dtrace scripts again as I did on 11/23/2011.
BE nightly-2011-11-18 had 3 out of 6 cpu.d runs (60 second samples) showing 'umount2' system calls > 3.5M refer to FILE chk_fflush.20111128.tsc-ga-20111118.out: umount2 4587575 umount2 4607936 umount2 4583946 Also it BE nightly-2011-11-18 had 2 out of 6 cpu.d runs (60 second samples) with a very high forksys count > 400K forksys 408747 forksys 471988
This "idle" system is still executing the system call 'umount2' at a rate up to 76,798 times a second (maybe even higher as this is averaged over a minute).
Updated by Rich Lowe about 11 years ago
add execname to the aggregation too, and we'll see who's doing it
Updated by Jon Strabala about 11 years ago
Rich,
I see two potential items the first is 'automountd' this is the program calling 'umount2'
The second is 'sendmail' this is the program calling 'forksys', I had some issues turning off 'sendmail'. Yes in an IRC chat I inidcated that I did a "svcadm disable sendmail" but keep getting about 600 messages [time period is fuzzy] to /var/log/syslog oi_151a like:
Oct 3 23:53:40 lab10 sendmail[2179]: [ID 801593 mail.info] p92M8das001995: to=postmaster, delay=1+08:45:01, xdelay=00:00:00, mailer=relay, pri=11858122, relay=[127.0.0.1], dsn=4.0.0, stat=Deferred: Connection refused by [127.0.0.1]
Also mailq indicated "/var/spool/mqueue was empty" and it seems like /usr/lib/update-manager/update-refresh.sh via cron produces some mail and it also seems like some messages are generated when I build OI/illumos. This was triggering the syslog messages. Thus I kind of turned partially turned 'sendmail' back on.
This is a lab box I really don't want any 'sendmail' at all. Perhaps my inability to turn sendmail completely off (and my stubbornness to not fully configure sendmail) is the culprit for both the 'forksys' and the 'umount2' issues ?
Still 15,831,553 calls (yeas 15M) from 'automountd' to 'umount2' in 60 seconds for any reason seems insane - is my dtrace script (below) bad ?
#!/usr/sbin/dtrace -s syscall:::entry { self->vstart = vtimestamp; } syscall:::return /self->vstart/ { this->oncpu = vtimestamp - self->vstart; @[probefunc,execname] = avg(this->oncpu); self->vstart = 0; } tick-60s { exit(0); }
And two outputs one a 60 second run catching 'umount2' and the other a 60 second run catching 'forksys'
# ----------------------------------------------------------------------------- # script output/run for ./cpu_en.d # ---------------------------- lab10# ./cpu_en.d dtrace: script './cpu_en.d' matched 465 probes CPU ID FUNCTION:NAME 0 73792 :tick-60s p_online dtrace 76 getpid nscd 155 sysconfig dtrace 168 fcntl nscd 245 getpid svc.configd 250 gtime sendmail 266 gtime utmpd 279 gtime nscd 345 sigaction dtrace 360 gtime devfsadm 364 lwp_sigmask sendmail 394 lwp_sigmask nscd 435 gtime intrd 436 gtime automountd 533 clock_gettime hald-addon-acpi 547 uucopy nscd 644 fstat64 init 679 setcontext nscd 685 lwp_continue nscd 702 llseek init 706 fcntl svc.configd 718 pset sendmail 755 ioctl utmpd 825 ioctl dtrace 937 llseek utmpd 990 ioctl init 1033 close init 1137 llseek nscd 1173 fstat64 utmpd 1189 ioctl nscd 1325 c2audit svc.configd 1339 lwp_kill nscd 1430 close utmpd 1468 fstat64 nscd 1492 close nscd 1554 schedctl nscd 1604 brk dtrace 1715 doorfs svc.configd 2511 stat64 nscd 2827 read hald-addon-acpi 2885 doorfs nscd 3207 kill utmpd 3330 read hald 3591 lwp_park fmd 3855 pollsys hald 5660 doorfs automountd 5894 read utmpd 5907 read init 6293 nanosleep intrd 6912 open init 7118 lwp_park devfsadm 7357 lwp_park svc.configd 7734 nanosleep fmd 7910 stat init 7916 lwp_park dtrace 8037 writev hald 8040 write init 8573 nanosleep nscd 8600 lwp_park nscd 8635 lwp_park svc.startd 9285 pollsys hald-addon-acpi 9550 pollsys sendmail 9687 pollsys in.routed 9918 pollsys fmd 10118 rename init 10544 ioctl intrd 10856 schedctl dtrace 11397 open nscd 11443 read nscd 12101 stat nscd 12616 open utmpd 13174 lwp_create nscd 16285 pread utmpd 21000 writev hald-addon-acpi 22861 mmap dtrace 47378 umount2 automountd 15831553 # ----------------------------------------------------------------------------- # script output/run for ./cpu_en.d # ---------------------------- lab10# ./cpu_en.d dtrace: script './cpu_en.d' matched 465 probes CPU ID FUNCTION:NAME 0 73792 :tick-60s p_online dtrace 76 sysconfig dtrace 181 gtime nscd 239 lwp_sigmask ttymon 261 getpid sendmail 289 gtime sendmail 289 lwp_self sendmail 315 lwp_sigmask sac 323 lwp_sigmask nscd 340 lwp_sigmask sendmail 360 sigaction dtrace 374 gtime devfsadm 388 gtime intrd 395 getuid sendmail 459 clock_gettime hald-addon-acpi 595 setcontext nscd 616 uucopy nscd 633 lwp_continue nscd 739 sigaction sendmail 745 pset sendmail 784 getmsg in.routed 938 ioctl dtrace 983 fcntl sendmail 1001 setitimer sendmail 1205 alarm sac 1208 lwp_kill nscd 1219 schedctl nscd 1399 fstat64 sendmail 1478 read ttymon 1490 close sendmail 1601 brk dtrace 1739 stat64 nscd 2068 setcontext ttymon 2576 setcontext sendmail 2606 setuid sendmail 2776 read hald-addon-acpi 2958 setcontext sac 3420 waitsys sendmail 3446 lwp_park fmd 3657 read hald 3814 pollsys ttymon 4375 write ttymon 4546 openat sendmail 4984 pollsys hald 5804 close in.routed 6449 lwp_park svc.configd 7729 lwp_park devfsadm 7776 nanosleep fmd 8081 lwp_park dtrace 8133 writev hald 8155 pause sendmail 8254 lwp_park nscd 8628 read sac 8705 pollsys sendmail 8972 write sac 9046 lwp_park svc.startd 9088 pollsys in.routed 9636 getdents64 sendmail 9837 pollsys hald-addon-acpi 9839 stat nscd 10616 schedctl dtrace 11100 writev hald-addon-acpi 11225 so_socket in.routed 11600 pollsys fmd 11660 nanosleep nscd 12605 lwp_create nscd 16906 schedctl sendmail 19272 ioctl intrd 22788 pread utmpd 23557 open in.routed 26906 open64 sendmail 32919 ioctl in.routed 45250 mmap dtrace 46774 putmsg in.routed 189065 forksys sendmail 233419
Updated by Rich Lowe about 11 years ago
I'm pretty sure the automounter shouldn't be attempting unmounts at anything like that rate, unless you have a lot of automounted filesystems, and most of them are mounted.
You could adjust the D to record the paths it was trying to unmount, and at what rate (a lot of paths at a normal rate? a few paths far too often?)
Updated by Rich Lowe about 11 years ago
Have you done anything unusual to this system (changing the value of hz, or anything of that nature?)
Also, can you record both the user and kernel stack when mount2 is running (ustack(), and stack())
Updated by Rich Lowe about 11 years ago
- Status changed from New to Feedback
Ok, no, now I see it. I hadn't read your scripts, just your text.
What your scripts are measuring isn't what you're describing them measuring (or you're confusing which script is which, and thus confusing me).
You're measuring their average time to complete, in nsec, but describing it as calls/sec. So what you're really seeing is an average time for the unmount() system call to complete of about 4 thousandths of a second.
Updated by Jon Strabala about 11 years ago
- File 20111128_tcs_ga_three_dtrace_oneliners_at_180s_and_10ms.txt 20111128_tcs_ga_three_dtrace_oneliners_at_180s_and_10ms.txt added
Sorry about misinterpreting the dtrace output (perils of someone like me, not understanding 'dtrace', blindly grabbing a analysis script and using it), I agree 4ms in umount() isn't too bad after all it does do something physical.
Before I let this one go, I dug a bit more - maybe the title should be 'fsflush' activity instead of 'umount2' - but maybe this is just normal kernel/zfs type activity.
So the below 'one-liner' dtrace script, using '= count();' seems to say that 'fsflush' is called 4,500,931 (4.5M times) in 180 seconds right (that would be about 25K/sec.,) - would this be considered high on an idles system doing absolutely nothing ?
lab10# dtrace -n 'fbt::*:entry{ @[execname] = count(); } tick-180s { exit(0); }' dtrace: description 'fbt::*:entry' matched 31473 probes CPU ID FUNCTION:NAME 1 65604 :tick-180s ttymon 616 sac 635 automountd 720 in.routed 1344 kcfpoold 1511 svc.startd 3659 utmpd 3839 hald-addon-acpi 8625 hald 10952 svc.configd 18715 devfsadm 19726 nscd 29862 intrd 35188 fmd 68451 zpool-rpool 204841 dtrace 2434749 fsflush 4500931 sched 17339936
More details in "20111128_tcs_ga_three_dtrace_oneliners_at_180s_and_10ms.txt" the attached file.
In an attempt to dig deeper I made another dtrace script called "chk_fsflush_automountd_zpool-rpool.d" to print some stats and specifically look at the fsflush, automountd and zpool-rpool processes. The actual script is shown below (all mistakes are mine, for example the output in ms for sched never looks right):
#!/usr/sbin/dtrace -s #pragma D option quiet sched:::on-cpu { self->on = vtimestamp; } sched:::off-cpu /self->on/ { @oncpu[execname] = sum(vtimestamp - self->on); self->on = 0; } vminfo:::anonpgin { self->anonpgin = 1; } :::pageio_setup:return { self->wait = timestamp; } :::pageio_done:entry /self->anonpgin == 1/ { self->anonpgin = 0; @pageintime[execname] = sum(timestamp - self->wait); self->wait = 0; } ::::entry /execname == "fsflush" || execname == "zpool-rpool" || execname == "automountd"/ { @exec_counts[probefunc,execname] = count(); } tick-5s { trunc(@oncpu, 10); trunc(@pageintime, 10); printf("\033[H\033[2J"); normalize(@pageintime, 1000000); printf("Who's waiting for pagein (milliseconds):\n"); printa(" %-50s %15@d\n", @pageintime); normalize(@oncpu, 1000000); printf("Who's on cpu (milliseconds):\n"); printa(" %-50s %15@d\n", @oncpu); printf("Dump - fsflush zpool-rpool automountd count()'s\n"); printf(" probefunc, execname, count()\n"); printf(" ----------------------------------------------------------------------------------------------------\n"); printa(" %-40s %-40s %15@d\n", @exec_counts); trunc(@oncpu); trunc(@pageintime); trunc(@exec_counts); } tick-180s { exit(0); }
I ran the above on an 'idle' system 'cut-n-pasted' the output to the attached file "chk_fsflush_automountd_zpool-rpool.out"
The most interesting item is that sometimes in a five (5) second period (there are 36 periods) I see some spikes like:
probefunc, execname, count() ---------------------------------------------------------------------------------------------------- **** many lines removed - see chk_fsflush_automountd_zpool-rpool.out for full output *** page_trylock fsflush 5037 page_unlock fsflush 5037 page_pptonum fsflush 20633 page_nextn fsflush 634781
This becomes almost 127K calls/sec to 'page_nextn' (from the fsflush process), but once again the system is doing absolutely nothing and all output is to the screen via an ssh login to putty runing on a PC (screen buffer is on the PC). Remember this is 127K calls/sec in a 5 second sample, the true peak/burst may actually be much higher.
Is this activity 'page_nextn' (from fsflush) normal on a system doing essentially nothing ? If it is close the ticket.
Updated by Jon Strabala about 11 years ago
Added file "chk_fsflush_automountd_zpool-rpool.out.txt" referenced in https://www.illumos.org/issues/1810#note-7
Updated by Jon Strabala about 11 years ago
It seems like fsflush is indeed a process that should consume a bit of CPU as it need to scan memory. I don't know if the tuning 'advice' below is accurate anymore (for illumos) but it does seem to explain what is happening WRT fsflush calls.
From: http://oracle-dba-quickanswers.blogspot.com/2011/10/how-to-analyze-high-cpu-utilization-in.html
Fsflush Cpu Usage: Primary task of file system flush daemon, fsflush, is to search and flush dirty file system pages to the backing store of the file system that the file(s) reside in. By default, fsflush is configured to wake up every 5 seconds (in recent releases it is changed to 1 second) and search 1/6th of physical memory for dirty pages - thus flushing all of physical memory within 30 seconds. On large memory systems, the fsflush daemon will consume a large amount of cpu if there are many dirty pages in the file system page cache. In some circumstances, the daemon will not be able to complete it's scan of 1/6th of physical memory at each scheduled wake up. The amount of time being spent by fsflush can be observed by taking multiple snapshots using ps(1) to view the cpu time (TIME column) consumed by the daemon: # ps -aef |grep fsflush UID PID PPID C STIME TTY TIME CMD root 3 0 0 10:14:13 ? 132:13 fsflush The amount of time spent in fsflush can also be observed on an idle machine. A 96GB machine with many file system pages will use 100% of a CPU for approximately 3 seconds, each time the file system daemon wakes up: # mpstat 5 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 4 0 109 201 101 15 0 0 12 0 64 0 99* 0 1 1 39 0 4 1 1 6 0 0 3 0 92 0 0 0 100 2 15 0 1473 10 10 14 0 0 1 0 21 2 0 2 95 3 21 0 1726 1 1 56 0 0 47 0 286 1 0 3 96 4 22 2 123 0 0 11 0 0 1 0 68 0 0 1 98 A larger value for the autoup parameter reduces the fsflush cpu usage at a cost of keeping the dirty buffer in memory for longer period . If fsflush runs for a longer period, it can effect application performance since it runs in kernel mode. It is therefore recommended to increase the default (30 seconds) to (nGB x 4). For example, a system configured with 96 GB should have an autoup value of 384. Edit the /etc/system file and add the following line and reboot. set autoup=384
Now the system under test, hostname: lab10, (16GB ram, last reboot about 17 hours ago) is usually always like the following (nothing near as high as the 'mpstat 5' output shown in the URL above):
lab10# ps -aef |grep fsflush | grep -v grep root 3 0 0 17:36:04 ? 1:37 fsflush lab10# mpstat 5 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 7 0 20 344 102 91 0 3 1 0 16 0 0 0 100 1 9 0 38 104 24 72 0 3 1 0 23 0 0 0 100 2 7 0 34 104 19 71 0 3 1 0 16 0 0 0 100 3 7 0 22 109 17 75 0 3 1 0 17 0 0 0 100 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 312 101 89 0 4 0 0 4 0 0 0 100 1 0 0 0 66 20 59 0 3 1 0 0 0 0 0 100 2 0 0 0 74 28 68 0 3 1 0 56 0 0 0 100 3 0 0 0 70 9 66 0 4 1 0 1 0 0 0 100 * * *
but occasionally (1 out of 20 samples) I do see a bit of non-idle CPU like:
lab10# mpstat 5 * * * CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 312 101 68 0 5 1 0 5 0 0 0 100 1 0 0 0 88 37 87 0 4 0 0 1 0 1 0 99 2 0 0 0 66 11 57 0 4 1 0 57 0 0 0 100 3 0 0 0 82 15 75 0 5 1 0 1 0 0 0 100
Further information on fsflush and tunables can be found starting on page 35, http://www.scribd.com/doc/59153317/Solaris-10-System-Administrator-Collection-Solaris-Tunable-Parameters-Reference-Manual - but once again I don't know how relevant this is for illumos.
At this point the burst of CPU related to 'fsflush' is just a curiosity to me as the above information (refer to the URL's) kind of explains what is going on to me.
So the only question that remains, is this activity 'page_nextn' (from fsflush) normal on a system doing essentially nothing ? Should pages just be flushed once. Or does all the kernel or perhaps even dtrace activity guarantee that I will be getting a spikes of 'fsflush' activity because pages are getting dirty. It still seems odd that on an idle system with 16GB RAM, (top only shows 2GB used for anything - I assume mostly for ZFS read cache - with 100% of the swap free) that I would have to write anything to disk.
lab10# # top -q -S -o time last pid: 102042; load avg 0.00, 0.00, 0.00; up 0+17:17:48 10:53:21 49 processes: 48 sleeping, 1 on cpu CPU states: 100% idle, 0.0% user, 0.1% kernel, 0.0% iowait, 0.0% swap Kernel: 287 ctxsw, 530 intr, 44 syscall Memory: 16G phys mem, 14G free mem, 8187M total swap, 8187M free swap PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 3 root 1 60 -20 0K 0K sleep 1:37 0.00% fsflush 6 root 142 99 -20 0K 0K sleep 0:13 0.00% zpool-rpool 101073 admin 1 59 0 7700K 5116K sleep 0:04 0.00% sshd 100005 root 21 59 0 15M 14M sleep 0:04 0.00% svc.configd 100895 root 1 59 0 6488K 5372K sleep 0:02 0.00% intrd 100686 root 4 59 0 3060K 2012K sleep 0:02 0.00% automountd 100343 root 29 59 0 6040K 4352K sleep 0:01 0.00% nscd 100780 root 25 59 0 31M 25M sleep 0:01 0.00% fmd 100003 root 16 59 0 14M 12M sleep 0:01 0.00% svc.startd 100386 root 4 59 0 6772K 5420K sleep 0:01 0.00% hald 102041 root 1 59 -20 3536K 1912K cpu/2 0:00 0.01% top 100272 root 6 59 0 4396K 3104K sleep 0:00 0.00% devfsadm 2 root 2 98 -20 0K 0K sleep 0:00 0.00% pageout 100265 root 5 60 -20 2476K 1532K sleep 0:00 0.00% zonestatd 4 root 5 60 -20 0K 0K sleep 0:00 0.00% kcfpoold 100291 root 1 59 0 11M 5688K sleep 0:00 0.00% cupsd 100885 root 4 59 0 9740K 6600K sleep 0:00 0.00% rad 100784 smmsp 1 59 0 5936K 1660K sleep 0:00 0.00% sendmail 101072 root 1 59 0 5704K 3376K sleep 0:00 0.00% sshd 100241 root 17 59 0 5552K 3044K sleep 0:00 0.00% syseventd 100727 root 3 59 0 4896K 3484K sleep 0:00 0.00% inetd 100836 root 1 59 0 4312K 1920K sleep 0:00 0.00% sshd 100435 root 2 59 0 4256K 3096K sleep 0:00 0.00% console-kit-dae 100760 root 11 59 0 4212K 2204K sleep 0:00 0.00% syslogd 100497 root 1 59 0 4128K 2284K sleep 0:00 0.00% rmvolmgr 100395 root 1 59 0 3888K 2316K sleep 0:00 0.00% hald-addon-cpuf 100393 root 1 59 0 3884K 2312K sleep 0:00 0.00% hald-addon-netw 101092 root 1 59 0 3780K 2648K sleep 0:00 0.00% bash 100387 root 1 59 0 3776K 2260K sleep 0:00 0.00% hald-runner 101080 admin 1 59 0 3772K 2568K sleep 0:00 0.00% bash
In other words if nothing in memory is dirty (needing to be flushed to disk) do I still need to spend the time to 'scan' all the system memory searching for dirty file system pages ? Also if the memory is dedicated to ZFS read cache does it need to be scanned at all ?
If this type of behavior is completely normal - if it is someone can either close or delete the ticket.