Project

General

Profile

Bug #13097

improve VM tunables for modern systems

Added by Joshua M. Clulow 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Category:
kernel
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Hard
Tags:
Gerrit CR:

Description

Modern systems tend to be configured to swap to a ZFS zvol device. ZFS pool I/O requires a lot more (mostly transient) allocations when compared to more traditional swap targets like UFS-backed files or even raw disk devices. In addition to this, modern systems are quite fast at allocating and consuming memory. It seems that the only hard stop on the provision of pages from the free and cache lists to users is throttlefree, and the only hard stop on allocation to the reset of the kernel outside of the pageout() machinery is pageout_reserve.

It is remarkably easy for a current illumos system to end up deadlocked trying to page out, with even a single Python thread allocating as fast as possible. In a 1GB system, pageout_reserve is a little under 2MB, and throttlefree is a little under 4MB. ZFS requires occasional larger bursts of allocation (e.g., when loading metaslabs, adjusting range trees, compressing blocks, assembling I/O buffers for the trip to disk, etc). 2MB is not enough slack to ensure that pageout() and ZFS can work together to steadily evict memory to disk in these low memory situations.

Even on a larger 8GB system, where pageout_reserve is 16MB today (and throttlefree is 32MB) we can relatively easily hit the deadlock condition, especially if when we get unlucky and things like kmem reaping and ARC reduction don't happen in the correct sequence.

Part of the issue is that while lotsfree seems large enough, the actual harder stops are only at the lower thresholds -- which are halved and halved and halved again from there. Instead of making each subsequent threshold a half of the next threshold up, it seems we could have a shallower ramp; e.g., some of the thresholds could be 3/4 of the size of the threshold above. Indeed, we could also mandate a larger minimum lotsfree of something like 16MB so that even 1GB systems will have around 4MB of reserve pool for pageout.

Finally, for larger systems we probably set too much memory aside! Once pageout_reserve is high enough to make forward progress, the requirement for memory would seem to be more tightly related to the rate at which pageout can or must occur -- rather than the overall size of memory. On a 512GB system, fully 2GB is below throttlefree -- which seems like a lot! We should consider capping lotsfree, or at least greatly detuning its growth, above a certain physical memory size.

Proposed Change

  • lotsfree remains 1/64 of physical memory, but is clamped in the range 16MB through 2048MB
  • minfree (and thus also throttlefree) changes to be 3/4 of desfree
  • pageout_reserve changes to be 3/4 of throttlefree

A comparison of original sizes and proposed new sizes for these values with a range of system memory sizes appears below:

======================================================================
physmem          |  131072 p   512 MB
                 | NEW VALUES                  | ORIGINAL VALUES
lotsfree         |    4096 p    16 MB ( 3.12%) |    2048 p     8 MB ( 1.56%)
desfree          |    2048 p     8 MB ( 1.56%) |    1024 p     4 MB ( 0.78%)
minfree          |    1536 p     6 MB ( 1.17%) |     512 p     2 MB ( 0.39%)
throttlefree     |    1536 p     6 MB ( 1.17%) |     512 p     2 MB ( 0.39%)
pageout_reserve  |    1152 p     4 MB ( 0.88%) |     256 p     1 MB ( 0.20%)

======================================================================
physmem          |  262144 p  1024 MB
                 | NEW VALUES                  | ORIGINAL VALUES
lotsfree         |    4096 p    16 MB ( 1.56%) |    4096 p    16 MB ( 1.56%)
desfree          |    2048 p     8 MB ( 0.78%) |    2048 p     8 MB ( 0.78%)
minfree          |    1536 p     6 MB ( 0.59%) |    1024 p     4 MB ( 0.39%)
throttlefree     |    1536 p     6 MB ( 0.59%) |    1024 p     4 MB ( 0.39%)
pageout_reserve  |    1152 p     4 MB ( 0.44%) |     512 p     2 MB ( 0.20%)

======================================================================
physmem          |  524288 p  2048 MB
                 | NEW VALUES                  | ORIGINAL VALUES
lotsfree         |    8192 p    32 MB ( 1.56%) |    8192 p    32 MB ( 1.56%)
desfree          |    4096 p    16 MB ( 0.78%) |    4096 p    16 MB ( 0.78%)
minfree          |    3072 p    12 MB ( 0.59%) |    2048 p     8 MB ( 0.39%)
throttlefree     |    3072 p    12 MB ( 0.59%) |    2048 p     8 MB ( 0.39%)
pageout_reserve  |    2304 p     9 MB ( 0.44%) |    1024 p     4 MB ( 0.20%)

======================================================================
physmem          | 1048576 p  4096 MB
                 | NEW VALUES                  | ORIGINAL VALUES
lotsfree         |   16384 p    64 MB ( 1.56%) |   16384 p    64 MB ( 1.56%)
desfree          |    8192 p    32 MB ( 0.78%) |    8192 p    32 MB ( 0.78%)
minfree          |    6144 p    24 MB ( 0.59%) |    4096 p    16 MB ( 0.39%)
throttlefree     |    6144 p    24 MB ( 0.59%) |    4096 p    16 MB ( 0.39%)
pageout_reserve  |    4608 p    18 MB ( 0.44%) |    2048 p     8 MB ( 0.20%)

======================================================================
physmem          | 2097152 p  8192 MB
                 | NEW VALUES                  | ORIGINAL VALUES
lotsfree         |   32768 p   128 MB ( 1.56%) |   32768 p   128 MB ( 1.56%)
desfree          |   16384 p    64 MB ( 0.78%) |   16384 p    64 MB ( 0.78%)
minfree          |   12288 p    48 MB ( 0.59%) |    8192 p    32 MB ( 0.39%)
throttlefree     |   12288 p    48 MB ( 0.59%) |    8192 p    32 MB ( 0.39%)
pageout_reserve  |    9216 p    36 MB ( 0.44%) |    4096 p    16 MB ( 0.20%)

======================================================================
physmem          | 4194304 p 16384 MB
                 | NEW VALUES                  | ORIGINAL VALUES
lotsfree         |   65536 p   256 MB ( 1.56%) |   65536 p   256 MB ( 1.56%)
desfree          |   32768 p   128 MB ( 0.78%) |   32768 p   128 MB ( 0.78%)
minfree          |   24576 p    96 MB ( 0.59%) |   16384 p    64 MB ( 0.39%)
throttlefree     |   24576 p    96 MB ( 0.59%) |   16384 p    64 MB ( 0.39%)
pageout_reserve  |   18432 p    72 MB ( 0.44%) |    8192 p    32 MB ( 0.20%)

======================================================================
physmem          | 8388608 p 32768 MB
                 | NEW VALUES                  | ORIGINAL VALUES
lotsfree         |  131072 p   512 MB ( 1.56%) |  131072 p   512 MB ( 1.56%)
desfree          |   65536 p   256 MB ( 0.78%) |   65536 p   256 MB ( 0.78%)
minfree          |   49152 p   192 MB ( 0.59%) |   32768 p   128 MB ( 0.39%)
throttlefree     |   49152 p   192 MB ( 0.59%) |   32768 p   128 MB ( 0.39%)
pageout_reserve  |   36864 p   144 MB ( 0.44%) |   16384 p    64 MB ( 0.20%)

======================================================================
physmem          | 16777216 p 65536 MB
                 | NEW VALUES                  | ORIGINAL VALUES
lotsfree         |  262144 p  1024 MB ( 1.56%) |  262144 p  1024 MB ( 1.56%)
desfree          |  131072 p   512 MB ( 0.78%) |  131072 p   512 MB ( 0.78%)
minfree          |   98304 p   384 MB ( 0.59%) |   65536 p   256 MB ( 0.39%)
throttlefree     |   98304 p   384 MB ( 0.59%) |   65536 p   256 MB ( 0.39%)
pageout_reserve  |   73728 p   288 MB ( 0.44%) |   32768 p   128 MB ( 0.20%)

======================================================================
physmem          | 33554432 p 131072 MB
                 | NEW VALUES                  | ORIGINAL VALUES
lotsfree         |  524288 p  2048 MB ( 1.56%) |  524288 p  2048 MB ( 1.56%)
desfree          |  262144 p  1024 MB ( 0.78%) |  262144 p  1024 MB ( 0.78%)
minfree          |  196608 p   768 MB ( 0.59%) |  131072 p   512 MB ( 0.39%)
throttlefree     |  196608 p   768 MB ( 0.59%) |  131072 p   512 MB ( 0.39%)
pageout_reserve  |  147456 p   576 MB ( 0.44%) |   65536 p   256 MB ( 0.20%)

======================================================================
physmem          | 67108864 p 262144 MB
                 | NEW VALUES                  | ORIGINAL VALUES
lotsfree         |  524288 p  2048 MB ( 0.78%) | 1048576 p  4096 MB ( 1.56%)
desfree          |  262144 p  1024 MB ( 0.39%) |  524288 p  2048 MB ( 0.78%)
minfree          |  196608 p   768 MB ( 0.29%) |  262144 p  1024 MB ( 0.39%)
throttlefree     |  196608 p   768 MB ( 0.29%) |  262144 p  1024 MB ( 0.39%)
pageout_reserve  |  147456 p   576 MB ( 0.22%) |  131072 p   512 MB ( 0.20%)

======================================================================
physmem          | 134217728 p 524288 MB
                 | NEW VALUES                  | ORIGINAL VALUES
lotsfree         |  524288 p  2048 MB ( 0.39%) | 2097152 p  8192 MB ( 1.56%)
desfree          |  262144 p  1024 MB ( 0.20%) | 1048576 p  4096 MB ( 0.78%)
minfree          |  196608 p   768 MB ( 0.15%) |  524288 p  2048 MB ( 0.39%)
throttlefree     |  196608 p   768 MB ( 0.15%) |  524288 p  2048 MB ( 0.39%)
pageout_reserve  |  147456 p   576 MB ( 0.11%) |  262144 p  1024 MB ( 0.20%)
#1

Updated by Joshua M. Clulow 3 months ago

Note that my current plan is to play with the values until I cannot deadlock the system just by allocating anymore -- especially with the smaller memory virtual machines that are so easy to topple over today.

#2

Updated by Joshua M. Clulow 3 months ago

Notes from a block comment I intend to integrate with this change:

/*
 * FREE MEMORY MANAGEMENT
 *
 * Management of the pool of free pages is a tricky business.  There are
 * several critical threshold values which constrain our allocation of new
 * pages and inform the rate of paging out of memory to swap.  These threshold
 * values, and the behaviour they induce, are described below in descending
 * order of size -- and thus increasing order of severity!
 *
 *   +---------------------------------------------------- physmem (all memory)
 *   |
 *   | Ordinarily there are no particular constraints placed on page
 *   v allocation.  The page scanner is not running and page_create_va()
 *   | will effectively grant all page requests (whether from the kernel
 *   | or from user processes) without artificial delay.
 *   |
 *   +------------------------------- lotsfree (1.56% of physmem, minimum 32MB)
 *   |
 *   | When we have less than "lotsfree" pages, pageout_scanner() is
 *   v signalled by schedpaging() to begin looking for pages that can
 *   | be evicted to disk to bring us back above lotsfree.  At this
 *   | stage there is still no constraint on allocation of free pages.
 *   |
 *   +------------------ desfree (1/2 of lotsfree, 0.78% of physmem, min. 16MB)
 *   |
 *   | When we drop below desfree, a number of kernel facilities will
 *   v wait before allocating more memory, under the assumption that
 *   | pageout or reaping will make progress and free up some memory.
 *   | This behaviour is not especially coordinated; look for comparisons
 *   | of desfree and freemem.
 *   |
 *   | In addition to various attempts at advisory caution, clock()
 *   | will wake up the thread that is ordinarily parked in sched().
 *   | This routine is responsible for the heavy-handed swapping out
 *   v of entire processes in an attempt to arrest the slide of free
 *   | memory.  See comments in sched.c for more details.
 *   |
 *   +---- minfree & throttlefree (3/4 of desfree, 0.59% of physmem, min. 12MB)
 *   |
 *   | These two separate tunables have, by default, the same value.
 *   v Various parts of the kernel use minfree to signal the need for
 *   | more aggressive reclamation of memory, and sched() is more
 *   | aggressive at swapping processes out.
 *   |
 *   | If free memory falls below throttlefree, page_create_va() will
 *   | use page_create_throttle() to begin holding most requests for
 *   | new pages while pageout and reaping free up memory.  Sleeping
 *   v allocations (e.g., KM_SLEEP) are held here while we wait for
 *   | more memory.  Non-sleeping allocations are generally allowed to
 *   | proceed.
 *   |
 *   +------- pageout_reserve (3/4 of throttlefree, 0.44% of physmem, min. 9MB)
 *   |
 *   | When we hit throttlefree, the situation is already dire.  The
 *   v system is generally paging out memory and swapping out entire
 *   | processes in order to free up memory for continued operation.
 *   |
 *   | Unfortunately, evicting memory to disk generally requires short
 *   | term use of additional memory; e.g., allocation of buffers for
 *   | storage drivers, updating maps of free and used blocks, etc.
 *   | As such, pageout_reserve is the number of pages that we keep in
 *   | special reserve for use by pageout() and sched() and by any
 *   v other parts of the kernel that need to be working for those to
 *   | make forward progress such as the ZFS I/O pipeline.
 *   |
 *   | When we are below pageout_reserve, we fail or hold any allocation
 *   | that has not explicitly requested access to the reserve pool.
 *   | Access to the reserve is generally granted via the KM_PUSHPAGE
 *   | flag, or by marking a thread T_PUSHPAGE such that all allocations
 *   | can implicitly tap the reserve.  For more details, see the
 *   v NOMEMWAIT() macro, the T_PUSHPAGE thread flag, the KM_PUSHPAGE
 *   | and VM_PUSHPAGE allocation flags, and page_create_throttle().
 *   |
 *   +---------------------------------------------------------- no free memory
 *   |
 *   | If we have arrived here, things are very bad indeed.  It is
 *   v surprisingly difficult to tell if this condition is even fatal,
 *   | as enough memory may have been granted to pageout() and to the
 *   | ZFS I/O pipeline that requests for eviction that have already been
 *   | made will complete and free up memory some time soon.
 *   |
 *   | If free memory does not materialise, the system generally remains
 *   | deadlocked.  The pageout_deadman() below is run once per second
 *   | from clock(), seeking to limit the amount of time a single request
 *   v to page out can be blocked before the system panics to get a crash
 *   | dump and return to service.
 *   |
 *   +-------------------------------------------------------------------------
 */
#3

Updated by Peter Tribble 3 months ago

So, a couple of thoughts:

How many people are using illumos with ZFS on a 512M system? I can imagine that would be somewhat painful. And if ZFS is problematic, IPS is doubly so. For Tribblix, I know that users down at that end of the spectrum often use UFS. Are the new settings for a 512M system excessive (it seems like an extremely large change to me)?

Is there significant use of systems with no swap device at all? I've occasionally seen this concept promoted. It doesn't apply to me (in the Java world, you want to have significant swap for anonymous reservations even if you never intend to actually write to it).

#4

Updated by Joshua M. Clulow 3 months ago

Those are fair questions. I'm not completely convinced that ZFS isn't doable in 512MB. There are some things that stand in the way of that today, like pageout deadlocks and some of the other issues I've just filed, but I don't yet believe it's impossible. It's pretty cheap for me to look at it as I'm testing this anyway. I have been focused on the 8GB VM I've been using and have been able to deadlock with surprising (to me) ease -- but I'm going to go run the stress tests on smaller VMs too.

IPS isn't inherently impossible, but you do have to keep the repository free of old versions. It helps a lot to be able to swap without the system falling over as well.

When you say you see issues with ZFS on smaller systems, what are they exactly? Hangs? Just slow performance? Not much free memory left for user processes? Something else?

I appreciate the feedback -- I know you've worked pretty extensively on smaller memory footprints with Tribblix in the past.

#5

Updated by Joshua M. Clulow 3 months ago

Oh, and yes, I think a swap file of at least some size is pretty critical in a world without overcommitted memory -- at least if you want to be able to fork!

#6

Updated by Joshua M. Clulow 3 months ago

  • Description updated (diff)
#7

Updated by Joshua M. Clulow 3 months ago

Peter, I have done some more testing in 512MB guests with this fix rebased on the end of the ZFS (and other) changes I have queued up in my pageout branch: https://github.com/illumos/illumos-gate/compare/master...jclulow:pageout

I think I can drop the lotsfree floor from 48MB to 16MB and still be relatively deadlock resistant. I tried 12MB, but had at least one deadlock during testing with that value.

#8

Updated by Peter Tribble 3 months ago

Thanks.

As for small memory systems, I suspect the case of interest is something like AWS (or VirtualBox) where you can vary the size of the instance. I don't think you can install to ZFS onto a 512M system (even with Tribblix, which is pretty light). So the use cases for such small systems are quite different - either an AWS [other providers may be available] nano instance with an existing installed AMI, or an old physical system. I'm slightly wary of optimizing the one case at the expense of the other.

#9

Updated by Electric Monk 3 months ago

  • Gerrit CR set to 890
#10

Updated by Rich Lowe 3 months ago

Peter Tribble wrote:

Thanks.

As for small memory systems, I suspect the case of interest is something like AWS (or VirtualBox) where you can vary the size of the instance. I don't think you can install to ZFS onto a 512M system (even with Tribblix, which is pretty light). So the use cases for such small systems are quite different - either an AWS [other providers may be available] nano instance with an existing installed AMI, or an old physical system. I'm slightly wary of optimizing the one case at the expense of the other.

I don't think that's a risk, and I think with these changes you could install into 512M the whole point here is that the reasons things lock up in such cases is poor tuning and increased allocation in the pageout path. With an installer that configured swap for itself early (as would be wise), it has as much chance of success as anything.

The major difference would be an installer may be running in part out of a ramdisk, further increasing memory pressure. That wouldn't be affected by changes such as these (unless the ramdisk code is poor).

Also available in: Atom PDF