Project

General

Profile

Bug #2855

vmu_calculate_proc() impeding page faults for thousands of seconds

Added by John Sonnenschein over 8 years ago.

Status:
New
Priority:
Normal
Category:
-
Start date:
2012-06-08
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

From Joyent OS-987:

the w command is hanging because it's trying to get the pr_p_lock
on the prnode_t associated with 0xffffff19cc48d0a0:

0xffffff19cc48d0a0::ps

S PID PPID PGID SID UID FLAGS ADDR NAME
R 22596 6280 3904 3904 0 0x42000000 ffffff19cc48d0a0 redis-server
In terms of who owns this, it's quite likely related to the fact that
the process is currently in a pagefault:

0xffffff19cc48d0a0::walk thread | ::findstack

stack pointer for thread ffffff19b1987000: ffffff00b9240980
[ ffffff00b9240980 _resume_from_idle+0xf1() ]
ffffff00b92409b0 swtch+0x145()
ffffff00b9240a60 turnstile_block+0x760()
ffffff00b9240ad0 rw_enter_sleep+0x205()
ffffff00b9240c60 segvn_fault+0xb91()
ffffff00b9240d70 as_fault+0x5ee()
ffffff00b9240df0 pagefault+0x99()
ffffff00b9240f00 trap+0xe63()
ffffff00b9240f10 0xfffffffffb8001d6()
And it's been there for a while – over 404 seconds:

0xffffff19cc48d0a0::walk thread | ::print kthread_t t_hrtime | ::scalehrtime | ::eval .>t
*panic_hrtime-<t=E

404401240289
As for the owner:

ffffff19bedff030::rwlock

ADDR OWNER/COUNT FLAGS WAITERS
ffffff19bedff030 ffffff19b1946040 B101 ffffff19b1987000 (R) | |
WRITE_LOCKED ------+ |
HAS_WAITERS --------+

ffffff19b1946040::findstack

stack pointer for thread ffffff19b1946040: ffffff00ba3f09c0
ffffff00ba3f0ad0 kmem_log_event+0xa4()
ffffff00ba3f0b20 do_interrupt+0xf2()
ffffff00ba3f0b30 _interrupt+0xba()
ffffff00ba3f0c60 i_mod_hash_find_nosync+0x44()
ffffff00ba3f0ca0 vmu_find_insert_anon+0x2d()
ffffff00ba3f0d80 vmu_calculate_seg+0x45f()
ffffff00ba3f0dd0 vmu_calculate_proc+0x259()
ffffff00ba3f0e00 vmu_calculate+0xd2()
ffffff00ba3f0e80 vm_getusage+0x23a()
ffffff00ba3f0f00 sysconfig+0x10b()
ffffff00ba3f0f10 sys_syscall+0x17a()
This process, it turns out, is a bit of a monster from an anon perspective:

ffffff19cc48d0a0::pmap -q

SEG BASE SIZE VNODE
fffffff81e294c38 0000000000400000 264k fffffff81cbe7540
ffffff19bd600c00 0000000000451000 12k fffffff81cbe7540
ffffff19bee6c7f0 0000000000454000 4194292k [ anon ]
ffffff19b32eee60 0000000100451000 4194288k [ anon ]
ffffff1999e07808 000000020044d000 4194288k [ anon ]
ffffff19bee6c8b0 0000000300449000 4194288k [ anon ]
ffffff19cc3f63f8 0000000400445000 4194288k [ anon ]
ffffff19c0fdee20 0000000500441000 4194288k [ anon ]
fffffff81e294638 000000060043d000 4194288k [ anon ]
ffffff19ab3460e0 0000000700439000 4194288k [ anon ]
ffffff19cc242080 0000000800435000 4194240k [ anon ]
ffffff19cc246138 0000000900425000 1250032k [ anon ]
ffffff19c0fe5ed0 fffffd7ffe450000 64k [ anon ]
ffffff19cc17af08 fffffd7ffe46f000 4k ffffff19bf695b00
fffffff81c02bb70 fffffd7ffea50000 68k ffffff19ad676a40
ffffff19b2c03ef0 fffffd7ffea71000 4k ffffff19ad676a40
ffffff19bee65908 fffffd7ffeadd000 12k ffffff19ad676340
ffffff19bee0e8f8 fffffd7ffecc0000 560k ffffff19ad683800
ffffff19bd618908 fffffd7ffed5c000 12k ffffff19ad683800
ffffff19b26f2dc0 fffffd7ffed5f000 28k [ anon ]
ffffff19c0fe1d58 fffffd7ffeea0000 400k ffffff1999fb9840
ffffff19bf561c90 fffffd7ffef13000 20k ffffff1999fb9840
ffffff19bf60bd20 fffffd7fff0b0000 4k ffffff19bf4f6200
ffffff19c0ff2380 fffffd7fff0c0000 64k [ anon ]
ffffff19b2e83038 fffffd7fff0e0000 64k [ anon ]
ffffff19bee0e9b8 fffffd7fff0ff000 128k [ anon ]
ffffff19b221eb40 fffffd7fff120000 4k [ anon ]
ffffff19bee6c4f0 fffffd7fff133000 4k [ anon ]
ffffff19bee98000 fffffd7fff140000 24k [ anon ]
ffffff19bfb88c70 fffffd7fff150000 4k [ anon ]
ffffff19b26ee2e8 fffffd7fff160000 1584k ffffff1999d53e40
ffffff19bee98120 fffffd7fff2fc000 52k ffffff1999d53e40
ffffff19bee0e2f8 fffffd7fff309000 8k [ anon ]
ffffff19bee654e8 fffffd7fff310000 4k [ anon ]
fffffff81c02b150 fffffd7fff320000 4k [ anon ]
ffffff19cc39fa48 fffffd7fff330000 4k [ anon ]
ffffff19b2c33f48 fffffd7fff340000 4k [ anon ]
ffffff19b2c031d0 fffffd7fff350000 4k ffffff19bd379580
ffffff19cc4af608 fffffd7fff360000 4k [ anon ]
ffffff19b2c036b0 fffffd7fff370000 4k [ anon ]
ffffff19cc39f4a8 fffffd7fff380000 4k [ anon ]
ffffff19c0fe1578 fffffd7fff390000 4k [ anon ]
ffffff19cc3f6998 fffffd7fff393000 348k ffffff1999cd6200
ffffff19b26ff3b0 fffffd7fff3fa000 12k ffffff1999cd6200
ffffff19ab3472b8 fffffd7fff3fd000 8k [ anon ]
ffffff19ab257a30 fffffd7fffd7f000 516k [ anon ]
Across those ten big segments, that's damn near 10 million anon structures –
it wouldn't be at all surprising to learn that vmu_calculate_seg() performs
pathologically when the numbers get this large.

So who is doing this? Normally, we would expect this calculation to be
done by the agent LWP from zoneadmd – but in this case, it appears to
be called directly from the "memsup" process spawned by beam:

0xffffff19adf0d048::ptree

fffffffffbc2e430 sched
ffffff19991b8010 init
ffffff19b2136070 run_erl
ffffff19ab34a038 beam.smp
ffffff19adf0d048 memsup
That's one area to investigate; another is the efficacy of the vmu_data's
vmu_cache: would we expect this result to have been served from the cache?
Finally, would we expect vmu_calculate() of such a process to take many
minutes, and is there something that should/needs to be done to speed
that up?

Further investigation revealed that the problem is that the "vmusage
anon hash" has a number of elements – 4,879,970 – that grossly outstrips
its mh_nchains parameter (256). This mod_hash – which, like all mod_hash
instantiations, is not dynamically sized – stores anon structures that
have reference counts greater than 1 to avoid double-counting pre-COW'd
anon pages. In most cases, this is fine, because most anon structures
will have a reference count of 1. For a MAP_PRIVATE mapping after a
fork(), however, the reference counts will be bumped – and any call
to vmu_calculate_proc() in that window will approach O(N^2) as the number
of anonymous pages exceeds 256. Indeed, this is what has happened in this
case:

ffffff19cc48d0a0::ptree

fffffffffbc2e430 sched
ffffff19991b8010 init
ffffff19b12930a8 svc.startd
fffffff81ce71010 redis-server
ffffff19cc48d0a0 redis-server
That this is the issue can be demonstrated with the following program:

#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#include <stdio.h>
#include <sys/varargs.h>
#include <sys/mman.h>
#include <sys/vm_usage.h>
#include <thread.h>

char *g_cmd = "vmu";

static void
fatal(char *fmt, ...) {
va_list ap;
int error = errno;

va_start(ap, fmt);
(void) fprintf(stderr, "%s: ", g_cmd);
(void) vfprintf(stderr, fmt, ap);
if (fmt[strlen(fmt) - 1] != '\n')
(void) fprintf(stderr, ": %s\n", strerror(error));
exit(EXIT_FAILURE);
}

int
main(int argc, char **argv) {
vmusage_t usage2;
size_t nres = 1;
size_t mb = argc < 2 ? 1 : atoi(argv1);
size_t sz = mb * 1024 * 1024, i;
hrtime_t ts;
pid_t child;
bzero(usage, sizeof (usage));

volatile char *mem = mmap(0, sz, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANON, -1, 0);
for (i = 0; i < sz; i += 0x1000)
mem[i] = '!';
ts = gethrtime();
if (getvmusage(VMUSAGE_ZONE, 0, usage, &nres) != 0)
fatal("getvmusage");
printf("%ld %lld ", sz, gethrtime() - ts);
fflush(stdout);
if ((child = fork()) == 0) {
pause();
exit(0);
}
ts = gethrtime();
if (getvmusage(VMUSAGE_ZONE, 0, usage, &nres) != 0)
fatal("getvmusage");
printf("%lld\n", gethrtime() - ts);
kill(child, SIGUSR1);
return (0);
}
Running this with a variety of sizes gives the following output:

1048576 841933281 822410614
2097152 801560428 822042493
4194304 830167087 833025003
8388608 824213090 842133466
16777216 804305045 826148264
33554432 828869888 834539331
67108864 828798841 843833245
134217728 818885542 890337352
268435456 838127070 1032552518
536870912 839422118 1558559308
1073741824 885646801 3911471381
2147483648 940393046 17803540506
4294967296 1039827551 96370378024
8589934592 1250603557 489499022485

That is, time in the non-fork()'d case stays roughly linear, while time
in the fork()'d case becomes quadratic. (Graph of this attached.)
In terms of the fix: given that mod_hash is very upfront about its
limitations with respect to resizing, it seems the most reasonable path is
to implement vme_anon_hash not as a mod_hash but rather as an AVL tree.
This fix should be straightforward, and will be prototyped...

Also available in: Atom PDF