Project

General

Profile

Bug #6202

htable_reap() kicks off htable crime wave, destroying system

Added by Robert Mustacchi about 4 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
2015-09-03
Due date:
% Done:

100%

Estimated time:
Difficulty:
Hard
Tags:

Description

So we finally got an NMI'd dump when a system was in this state. (thoth dump b4ec5232c9a3b3928fc32636126676a3.)

The ::cpuinfo is almost comic:

> ::cpuinfo
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc32680  1f   12    0 169  yes   yes t-11   fffff001e82c5c40 sched
  1 fffff042ea274040  1f    9    0  60   no    no t-12   fffff043c03cf3c0 svc.configd
  2 fffff042ea6c6b00  1f    4    0  60   no    no t-0    fffff04383291820 PassengerLogging
  3 fffff042ea6bf500  1f    9    0  60   no    no t-3    fffff078a5b947c0 qemu-system-x86_
  4 fffff042ea6baac0  1f    4    0  60   no    no t-0    fffff154830083c0 zeus.monitor
  5 fffff042ea6b5580  1f    7    0  60   no    no t-0    fffff0431e9f7180 nscd
  6 fffff042ea6b4080  1f    6    0  60   no    no t-3    fffff04417b2fb40 java
  7 fffffffffbc3cf80  1b    4    0  60   no    no t-0    fffff1c40a5a5480 node
  8 fffff042ea84c540  1f    5    0  60   no    no t-0    fffff04399026740 java
  9 fffff042ea84b040  1f    3    0  60   no    no t-5    fffff1602a426440 nscd
 10 fffff042ea849b00  1f    7    0  60   no    no t-5    fffff0431e9f78c0 devfsadm
 11 fffff042ea846500  1f    2    0  60   no    no t-12   fffff0887d959160 qemu-system-x86_
 12 fffff042ea845000  1f    6    0  60   no    no t-10   fffff090e17b8020 svc.configd
 13 fffff042ea841ac0  1f    1    0  60   no    no t-8    fffff042ea92a080 rsyslogd
 14 fffff042ea838580  1f    5    0  60   no    no t-0    fffff08874789c40 nscd
 15 fffff042ea837080  1f    5    0  60   no    no t-5    fffff04380471820 node
 16 fffff042ea831a80  1f    5    0  60   no    no t-17   fffff088052ea000 qemu-system-x86_
 17 fffff042ea913540  1f    4    0  60   no    no t-10   fffff092541520e0 java
 18 fffff042ea912040  1f    2    0  60   no    no t-0    fffff043ed036c60 node
 19 fffff042ea90eb00  1f    4    0  60   no    no t-3    fffff0438332a3a0 svc.startd
 20 fffff042ea90b500  1f    2    0  60   no    no t-0    fffff043836317c0 svc.startd
 21 fffff042ea90a000  1f    4    0  60   no    no t-15   fffff088335ca7e0 epmd
 22 fffff042ea908ac0  1f    2    0  60   no    no t-8    fffff04383812820 nscd
 23 fffff042ea901580  1f    3    0  60   no    no t-8    fffff16af2f21b00 rsyslogd
 24 fffff042ea8ff500  1f    2    0  60   no    no t-0    fffff043bfd6e520 httpd
 25 fffff042ea8fe000  1f    2    0 100   no    no t-3    fffff088133b2160 node
 26 fffff042ea8faac0  1f    4    0  60   no    no t-3    fffff0438901f820 node
 27 fffff042ea8f7580  1f    3    0  60   no    no t-0    fffff043e770e500 nscd
 28 fffff042ea8f6080  1f    5    0  60   no    no t-0    fffff08876f1a460 qemu-system-x86_
 29 fffff042eaa1ea80  1f    5    0  60   no    no t-1    fffff088038ffc40 zeus.admin
 30 fffff042eaa1b540  1f    6    0  60   no    no t-3    fffff090e3d168c0 zabbix_agentd
 31 fffff042eaa1a040  1f    4    0  60   no    no t-23   fffff0886b843c00 rsyslogd
 32 fffff042eaa18b00  1f    7    0  60   no    no t-0    fffff0879d25e0a0 zeus.zxtm
 33 fffff042eaa11500  1f    4    0  60   no    no t-5    fffff09291740ba0 qemu-system-x86_
 34 fffff042eaa10000  1f    6    0  60   no    no t-0    fffff16ae02bf000 svc.startd
 35 fffff042eaa0c040  1f    4    0  60   no    no t-8    fffff04383336400 mysqld
 36 fffff042eaa06b00  1f    6    0  60   no    no t-1    fffff078aa978000 node
 37 fffff042eaa01500  1f    4    0  60   no    no t-1    fffff088066b3820 zeus.zxtm
 38 fffff042eaa00000  1f    5    0  60   no    no t-0    fffff1627df3a860 node
 39 fffff042eab06ac0  1f    4    0  60   no    no t-26   fffff0439ec567e0 ipmon
 40 fffff042eab03580  1f    5    0  60   no    no t-3    fffff0437734b4c0 svc.startd
 41 fffff042eab02080  1f    4    0  60   no    no t-3    fffff043ac366b20 nscd
 42 fffff042eaafca80  1f    3    0  60   no    no t-0    fffff08a47400440 nscd
 43 fffff042eaaf9540  1f    4    0  60   no    no t-13   fffff04383338b20 svc.configd
 44 fffff042eaaf8040  1f    2    0  97   no    no t-118  fffff001e9c83c40 pageout
 45 fffff042eaaf4080  1f    5    0  60   no    no t-1    fffff08806311140 mysqld
 46 fffff042eaaf0a80  1f   12    0  60   no    no t-4    fffff04383336060 inetd
 47 fffff042eabe2040  1f    7    0  60   no    no t-15   fffff043b4370c00 nginx

A Noah's Ark of programs – all runnable! It turns out that they are all in htable_steal():

> ::stacks -t run
THREAD           STATE    SOBJ                COUNT
fffff0431bddb0e0 RUN      <NONE>                572
                 <thread not in memory>

fffff0431c47fbc0 RUN      <NONE>                142
                 swtch+0x141
                 turnstile_block+0x21a
                 mutex_vector_enter+0x3a3
                 htable_steal_active+0x85
                 htable_steal+0x1ce
                 htable_alloc+0xf9
                 htable_create+0x1e9
                 hati_load_common+0x198
                 hat_memload+0x75
                 hat_memload_region+0x25
                 segvn_faultpage+0x462
                 segvn_fault+0x8e6
                 as_fault+0x37a
                 pagefault+0x96
                 trap+0x2c7

fffff0883dd677e0 RUN      <NONE>                 34
                 swtch+0x141
                 turnstile_block+0x21a
                 mutex_vector_enter+0x3a3
                 htable_steal_active+0x85
                 htable_steal+0x1ce
                 htable_alloc+0xf9
                 htable_create+0x1e9
                 hati_load_common+0x198
                 hat_memload+0x75
                 hat_memload_region+0x25
                 segvn_faultpage+0x462
                 segvn_fault+0x8e6
                 as_fault+0x37a
                 pagefault+0x96
                 trap+0xd23
                 0xfffffffffb8001d6
                 bcopy+0x1b4
                 lwp_park+0x2bb
                 syslwp_park+0x63
                 _sys_sysenter_post_swapgs+0x153

fffff078b9a67b60 RUN      <NONE>                 24
                 swtch+0x141          
                 swapout_lwp+0x8e
                 fss_trapret+0x5f
                 post_syscall+0x5dc
                 syscall_exit+0x68

The thing that's really annoying here is that the ARC is huge:

> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                   11436030             44671   17%
ZFS File Data            38191184            149184   57%
Anon                     17103265             66809   25%
Exec and libs                 268                 1    0%
Page cache                  23610                92    0%
Free (cachelist)           224953               878    0%
Free (freelist)            102515               400    0%

Total                    67081825            262038
Physical                 67081824            262038

So why isn't the ARC shrinking?!

> ::stacks -c arc_reclaim_thread
THREAD           STATE    SOBJ                COUNT
fffff001ea593c40 SLEEP    CV                      1
                 swtch+0x141
                 cv_wait+0x70
                 taskq_wait+0x43
                 kmem_cache_reap_now+0x44
                 arc_kmem_reap_now+0x68
                 arc_reclaim_thread+0x21a
                 thread_start+8

So we're blocking on kmem reaping; how is that coming along?

> *kmem_taskq::taskq -t
ADDR             NAME                             ACT/THDS Q'ED  MAXQ INST
fffff042dfd78d28 kmem_taskq                         1/   1  389   542    0
    THREAD           STATE    SOBJ                COUNT
    fffff001e8217c40 RUN      <NONE>                  1
                     swtch+0x141
                     turnstile_block+0x21a
                     mutex_vector_enter+0x3a3
                     htable_steal_active+0x85
                     htable_steal+0x1ce
                     htable_reap+0x6f
                     kmem_cache_reap+0x3b
                     taskq_thread+0x2d0
                     thread_start+8

And there you have it: we're stealing htables because there's no memory – but we can't free memory because we're fighting over stealing htables. In terms of where the actual bug lies in this seizing contraption, it's telling that we have not, in fact, had an allocation failure out of the htable cache:

> *htable_cache::print kmem_cache_t cache_alloc_fail 
cache_alloc_fail = 0

Or indeed, any cache:

> ::walk kmem_cache | ::print kmem_cache_t cache_alloc_fail ! sort | uniq -c
    501 cache_alloc_fail = 0

So beyond the stealing induced by htable_reap(), why are we even stealing at all? We're stealing because ptable_alloc() actually fails if freemem is below throttlefree:

static pfn_t
ptable_alloc(uintptr_t seed)
{
        pfn_t pfn;
        page_t *pp; 

        pfn = PFN_INVALID;

        /*
         * The first check is to see if there is memory in the system. If we
         * drop to throttlefree, then fail the ptable_alloc() and let the
         * stealing code kick in. Note that we have to do this test here,
         * since the test in page_create_throttle() would let the NOSLEEP
         * allocation go through and deplete the page reserves.
         *
         * The !NOMEMWAIT() lets pageout, fsflush, etc. skip this check.
         */
        if (!NOMEMWAIT() && freemem <= throttlefree + 1)
                return (PFN_INVALID);

In terms of what's inducing the sudden memory pressure, it should come as no surprise that there's a big KVM provision on the scene:

> ::stacks -c memcntl
THREAD           STATE    SOBJ                COUNT
fffff16588cd67e0 RUN      <NONE>                  1
                 swtch+0x141
                 cv_wait+0x70
                 page_create_throttle+0x16a
                 page_create_va+0x5d4
                 swap_getapage+0x113
                 swap_getpage+0x90
                 fop_getpage+0x7e
                 anon_zero+0xb6
                 segvn_lockop+0xd1a
                 as_ctl+0x2c5
                 memcntl+0x4de

So based on this, we can describe entirely what's happening: after the system has been up for a while and has grown a pretty big ARC (~150G), we have a 64G KVM provision. The need to lock this memory (and therefore displace ARC) quickly drives memory down to the point at which the ARC will aggressively shrink (namely, when freemem gets below lotsfree + needfree + desfree). However, ARC reclamation will block on its ability to reap several key kmem caches – which in turn blocks on the complete kmem_reap() kicked off by schedpaging(). The ARC reclamation being blocked behind all of kmem reaping might be fine were it not for htable_reap(), which is making the rather capricious decision to steal and then return to the system 11,525 of the system's ~230,000 htables. But because the system is active and running, and because the htable stealing logic is so ill-conceived (there is essentially no consideration given to level of activity of the victim), the victims quickly themselves become the perpetrators: they page fault on their missing htables and then they themselves steal an htable from someone else. Turning essentially every process in the system into an htable thief causes lock contention to spike to astronomical levels and throughput drops to nearly zero as thieves back up on locks. Indeed, even the ARC reclamation is caught in the disaster of its own making, and can't meaningfully progress to the one act that will relieve the system: shrinking the ARC.

The executive summary is that 100% of this stealing activity is being induced by the htable_steal() from kmem reaping – this wound is entirely self-inflicted. It's particularly galling that:

  • The reap callback has no tunables – it always decides to reap up to 5% of physical memory (!!), with a minimum of ten pages of page tables.
  • htable theft from the reap callback is not meaningfully differentiated from that due more dire memory scarcity: htable_steal() will steal active page tables even though this really is beyond the spirit of the reap callback.
  • Active page tables are stolen in an incredibly ham-handed fashion. Not only are the targets essentially random, there is seemingly no understanding in the code that reaping may put the system into a prolonged state of cascading theft.

There are a bunch of ways to plausibly fix this:

  • Do not have a kmem reap callback at all for the htable cache, beause the system will already start to steal when memory is dangerously low (namely, when KM_NOSLEEP allocations start to fail).
  • Have a kmem reap callback, but make the number of page tables stolen by htable_steal() via the reap callback a tunable – and tune it much, much lower than 5%. (Much lower.)
  • Do not allow the reap callback to ever steal active page tables.
  • Do not allow the reap callback to make more than a single pass.

(Some combinations of options are obviously redundant and others are mutually exclusive.)

History

#1

Updated by Electric Monk about 4 years ago

  • Status changed from New to Closed

git commit 091194e6fc9fb462c843bd5f1beea58d43650378

commit  091194e6fc9fb462c843bd5f1beea58d43650378
Author: Bryan Cantrill <bryan@joyent.com>
Date:   2015-09-13T14:56:21.000Z

    6202 htable_reap() kicks off htable crime wave, destroying system
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
    Reviewed by: Richard PALO <richard@netbsd.org>
    Approved by: Albert Lee <trisk@omniti.com>

Also available in: Atom PDF