Bug #6202
closedhtable_reap() kicks off htable crime wave, destroying system
100%
Description
So we finally got an NMI'd dump when a system was in this state. (thoth dump b4ec5232c9a3b3928fc32636126676a3.)
The ::cpuinfo is almost comic:
> ::cpuinfo ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc32680 1f 12 0 169 yes yes t-11 fffff001e82c5c40 sched 1 fffff042ea274040 1f 9 0 60 no no t-12 fffff043c03cf3c0 svc.configd 2 fffff042ea6c6b00 1f 4 0 60 no no t-0 fffff04383291820 PassengerLogging 3 fffff042ea6bf500 1f 9 0 60 no no t-3 fffff078a5b947c0 qemu-system-x86_ 4 fffff042ea6baac0 1f 4 0 60 no no t-0 fffff154830083c0 zeus.monitor 5 fffff042ea6b5580 1f 7 0 60 no no t-0 fffff0431e9f7180 nscd 6 fffff042ea6b4080 1f 6 0 60 no no t-3 fffff04417b2fb40 java 7 fffffffffbc3cf80 1b 4 0 60 no no t-0 fffff1c40a5a5480 node 8 fffff042ea84c540 1f 5 0 60 no no t-0 fffff04399026740 java 9 fffff042ea84b040 1f 3 0 60 no no t-5 fffff1602a426440 nscd 10 fffff042ea849b00 1f 7 0 60 no no t-5 fffff0431e9f78c0 devfsadm 11 fffff042ea846500 1f 2 0 60 no no t-12 fffff0887d959160 qemu-system-x86_ 12 fffff042ea845000 1f 6 0 60 no no t-10 fffff090e17b8020 svc.configd 13 fffff042ea841ac0 1f 1 0 60 no no t-8 fffff042ea92a080 rsyslogd 14 fffff042ea838580 1f 5 0 60 no no t-0 fffff08874789c40 nscd 15 fffff042ea837080 1f 5 0 60 no no t-5 fffff04380471820 node 16 fffff042ea831a80 1f 5 0 60 no no t-17 fffff088052ea000 qemu-system-x86_ 17 fffff042ea913540 1f 4 0 60 no no t-10 fffff092541520e0 java 18 fffff042ea912040 1f 2 0 60 no no t-0 fffff043ed036c60 node 19 fffff042ea90eb00 1f 4 0 60 no no t-3 fffff0438332a3a0 svc.startd 20 fffff042ea90b500 1f 2 0 60 no no t-0 fffff043836317c0 svc.startd 21 fffff042ea90a000 1f 4 0 60 no no t-15 fffff088335ca7e0 epmd 22 fffff042ea908ac0 1f 2 0 60 no no t-8 fffff04383812820 nscd 23 fffff042ea901580 1f 3 0 60 no no t-8 fffff16af2f21b00 rsyslogd 24 fffff042ea8ff500 1f 2 0 60 no no t-0 fffff043bfd6e520 httpd 25 fffff042ea8fe000 1f 2 0 100 no no t-3 fffff088133b2160 node 26 fffff042ea8faac0 1f 4 0 60 no no t-3 fffff0438901f820 node 27 fffff042ea8f7580 1f 3 0 60 no no t-0 fffff043e770e500 nscd 28 fffff042ea8f6080 1f 5 0 60 no no t-0 fffff08876f1a460 qemu-system-x86_ 29 fffff042eaa1ea80 1f 5 0 60 no no t-1 fffff088038ffc40 zeus.admin 30 fffff042eaa1b540 1f 6 0 60 no no t-3 fffff090e3d168c0 zabbix_agentd 31 fffff042eaa1a040 1f 4 0 60 no no t-23 fffff0886b843c00 rsyslogd 32 fffff042eaa18b00 1f 7 0 60 no no t-0 fffff0879d25e0a0 zeus.zxtm 33 fffff042eaa11500 1f 4 0 60 no no t-5 fffff09291740ba0 qemu-system-x86_ 34 fffff042eaa10000 1f 6 0 60 no no t-0 fffff16ae02bf000 svc.startd 35 fffff042eaa0c040 1f 4 0 60 no no t-8 fffff04383336400 mysqld 36 fffff042eaa06b00 1f 6 0 60 no no t-1 fffff078aa978000 node 37 fffff042eaa01500 1f 4 0 60 no no t-1 fffff088066b3820 zeus.zxtm 38 fffff042eaa00000 1f 5 0 60 no no t-0 fffff1627df3a860 node 39 fffff042eab06ac0 1f 4 0 60 no no t-26 fffff0439ec567e0 ipmon 40 fffff042eab03580 1f 5 0 60 no no t-3 fffff0437734b4c0 svc.startd 41 fffff042eab02080 1f 4 0 60 no no t-3 fffff043ac366b20 nscd 42 fffff042eaafca80 1f 3 0 60 no no t-0 fffff08a47400440 nscd 43 fffff042eaaf9540 1f 4 0 60 no no t-13 fffff04383338b20 svc.configd 44 fffff042eaaf8040 1f 2 0 97 no no t-118 fffff001e9c83c40 pageout 45 fffff042eaaf4080 1f 5 0 60 no no t-1 fffff08806311140 mysqld 46 fffff042eaaf0a80 1f 12 0 60 no no t-4 fffff04383336060 inetd 47 fffff042eabe2040 1f 7 0 60 no no t-15 fffff043b4370c00 nginx
A Noah's Ark of programs – all runnable! It turns out that they are all in htable_steal():
> ::stacks -t run THREAD STATE SOBJ COUNT fffff0431bddb0e0 RUN <NONE> 572 <thread not in memory> fffff0431c47fbc0 RUN <NONE> 142 swtch+0x141 turnstile_block+0x21a mutex_vector_enter+0x3a3 htable_steal_active+0x85 htable_steal+0x1ce htable_alloc+0xf9 htable_create+0x1e9 hati_load_common+0x198 hat_memload+0x75 hat_memload_region+0x25 segvn_faultpage+0x462 segvn_fault+0x8e6 as_fault+0x37a pagefault+0x96 trap+0x2c7 fffff0883dd677e0 RUN <NONE> 34 swtch+0x141 turnstile_block+0x21a mutex_vector_enter+0x3a3 htable_steal_active+0x85 htable_steal+0x1ce htable_alloc+0xf9 htable_create+0x1e9 hati_load_common+0x198 hat_memload+0x75 hat_memload_region+0x25 segvn_faultpage+0x462 segvn_fault+0x8e6 as_fault+0x37a pagefault+0x96 trap+0xd23 0xfffffffffb8001d6 bcopy+0x1b4 lwp_park+0x2bb syslwp_park+0x63 _sys_sysenter_post_swapgs+0x153 fffff078b9a67b60 RUN <NONE> 24 swtch+0x141 swapout_lwp+0x8e fss_trapret+0x5f post_syscall+0x5dc syscall_exit+0x68
The thing that's really annoying here is that the ARC is huge:
> ::memstat Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 11436030 44671 17% ZFS File Data 38191184 149184 57% Anon 17103265 66809 25% Exec and libs 268 1 0% Page cache 23610 92 0% Free (cachelist) 224953 878 0% Free (freelist) 102515 400 0% Total 67081825 262038 Physical 67081824 262038
So why isn't the ARC shrinking?!
> ::stacks -c arc_reclaim_thread THREAD STATE SOBJ COUNT fffff001ea593c40 SLEEP CV 1 swtch+0x141 cv_wait+0x70 taskq_wait+0x43 kmem_cache_reap_now+0x44 arc_kmem_reap_now+0x68 arc_reclaim_thread+0x21a thread_start+8
So we're blocking on kmem reaping; how is that coming along?
> *kmem_taskq::taskq -t ADDR NAME ACT/THDS Q'ED MAXQ INST fffff042dfd78d28 kmem_taskq 1/ 1 389 542 0 THREAD STATE SOBJ COUNT fffff001e8217c40 RUN <NONE> 1 swtch+0x141 turnstile_block+0x21a mutex_vector_enter+0x3a3 htable_steal_active+0x85 htable_steal+0x1ce htable_reap+0x6f kmem_cache_reap+0x3b taskq_thread+0x2d0 thread_start+8
And there you have it: we're stealing htables because there's no memory – but we can't free memory because we're fighting over stealing htables. In terms of where the actual bug lies in this seizing contraption, it's telling that we have not, in fact, had an allocation failure out of the htable cache:
> *htable_cache::print kmem_cache_t cache_alloc_fail cache_alloc_fail = 0
Or indeed, any cache:
> ::walk kmem_cache | ::print kmem_cache_t cache_alloc_fail ! sort | uniq -c 501 cache_alloc_fail = 0
So beyond the stealing induced by htable_reap(), why are we even stealing at all? We're stealing because ptable_alloc() actually fails if freemem is below throttlefree:
static pfn_t ptable_alloc(uintptr_t seed) { pfn_t pfn; page_t *pp; pfn = PFN_INVALID; /* * The first check is to see if there is memory in the system. If we * drop to throttlefree, then fail the ptable_alloc() and let the * stealing code kick in. Note that we have to do this test here, * since the test in page_create_throttle() would let the NOSLEEP * allocation go through and deplete the page reserves. * * The !NOMEMWAIT() lets pageout, fsflush, etc. skip this check. */ if (!NOMEMWAIT() && freemem <= throttlefree + 1) return (PFN_INVALID); In terms of what's inducing the sudden memory pressure, it should come as no surprise that there's a big KVM provision on the scene: > ::stacks -c memcntl THREAD STATE SOBJ COUNT fffff16588cd67e0 RUN <NONE> 1 swtch+0x141 cv_wait+0x70 page_create_throttle+0x16a page_create_va+0x5d4 swap_getapage+0x113 swap_getpage+0x90 fop_getpage+0x7e anon_zero+0xb6 segvn_lockop+0xd1a as_ctl+0x2c5 memcntl+0x4de
So based on this, we can describe entirely what's happening: after the system has been up for a while and has grown a pretty big ARC (~150G), we have a 64G KVM provision. The need to lock this memory (and therefore displace ARC) quickly drives memory down to the point at which the ARC will aggressively shrink (namely, when freemem gets below lotsfree + needfree + desfree). However, ARC reclamation will block on its ability to reap several key kmem caches – which in turn blocks on the complete kmem_reap() kicked off by schedpaging(). The ARC reclamation being blocked behind all of kmem reaping might be fine were it not for htable_reap(), which is making the rather capricious decision to steal and then return to the system 11,525 of the system's ~230,000 htables. But because the system is active and running, and because the htable stealing logic is so ill-conceived (there is essentially no consideration given to level of activity of the victim), the victims quickly themselves become the perpetrators: they page fault on their missing htables and then they themselves steal an htable from someone else. Turning essentially every process in the system into an htable thief causes lock contention to spike to astronomical levels and throughput drops to nearly zero as thieves back up on locks. Indeed, even the ARC reclamation is caught in the disaster of its own making, and can't meaningfully progress to the one act that will relieve the system: shrinking the ARC.
The executive summary is that 100% of this stealing activity is being induced by the htable_steal() from kmem reaping – this wound is entirely self-inflicted. It's particularly galling that:
- The reap callback has no tunables – it always decides to reap up to 5% of physical memory (!!), with a minimum of ten pages of page tables.
- htable theft from the reap callback is not meaningfully differentiated from that due more dire memory scarcity: htable_steal() will steal active page tables even though this really is beyond the spirit of the reap callback.
- Active page tables are stolen in an incredibly ham-handed fashion. Not only are the targets essentially random, there is seemingly no understanding in the code that reaping may put the system into a prolonged state of cascading theft.
There are a bunch of ways to plausibly fix this:
- Do not have a kmem reap callback at all for the htable cache, beause the system will already start to steal when memory is dangerously low (namely, when KM_NOSLEEP allocations start to fail).
- Have a kmem reap callback, but make the number of page tables stolen by htable_steal() via the reap callback a tunable – and tune it much, much lower than 5%. (Much lower.)
- Do not allow the reap callback to ever steal active page tables.
- Do not allow the reap callback to make more than a single pass.
(Some combinations of options are obviously redundant and others are mutually exclusive.)
Updated by Electric Monk almost 7 years ago
- Status changed from New to Closed
git commit 091194e6fc9fb462c843bd5f1beea58d43650378
commit 091194e6fc9fb462c843bd5f1beea58d43650378 Author: Bryan Cantrill <bryan@joyent.com> Date: 2015-09-13T14:56:21.000Z 6202 htable_reap() kicks off htable crime wave, destroying system Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Richard PALO <richard@netbsd.org> Approved by: Albert Lee <trisk@omniti.com>