Bug #12701


segspt_minfree needs right-sizing

Added by John Levon about 2 years ago. Updated about 2 years ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
Actions #1

Updated by John Levon about 2 years ago

We discovered in production that sometimes restarting postgres would fail (OS-7244):

    applying pg config: postgres exited unexpectedly (code 1); stdout = , stderr = 2017-10-15 05:42:00 UTC [27792]: [1-1] user=,db=,app=,client= FATAL:  shmat(id=10, addr=0, flags=0x4000) failed: Not enough space

The behaviour was captured via

My analysis from the above bug:

Note that previously we will have gone through shm_get() to create the segment.
For that to work, the segment size is checked against the rctls.
As can be seen above, these should be fine. We also, though pass through anon_resv().

In our case, this will basically look at:

pswap_pages = k_anoninfo.ani_max - k_anoninfo.ani_phys_resv

struct k_anoninfo {
    pgcnt_t ani_max = 0x7fbffff
    pgcnt_t ani_free = 0x7ee97a2
    pgcnt_t ani_phys_resv = 0x2fce4d
    pgcnt_t ani_mem_resv = 0
    pgcnt_t ani_locked_swap = 0

So, since we clearly have 8Gb of physical swap available, anon_resvmem() happily
adjusts k_anoninfo.ani_phys_resv and everything is hunky-dory. It has no way of knowing
that in fact, we are about to request non-SHM_PAGEABLE memory when we actually
try to attach to this. So we will not go down the path that reserves into ani_mem_resv here.

Note that when we have attached one of these segments, we should find that
those are covered by ani_locked_swap (and ani_mem_resv is at least that value too).

Taking one of these DTrace outputs and re-arranging in time order:

T8236879223522:pid363695t1 shmat(9, 0)
T8236879554823:pid363695t1 segspt_create()
T8236879562408:pid363695t1 anon_swap_adjust(2163198)

So we got as far as segspt__create() and:

 543         if ((sptcargs->flags & SHM_PAGEABLE) == 0) {                             
 544                 if (err = anon_swap_adjust(npages))                              
 545                         return (err);                                            
 546         }

$ unit 2163198 p | grep gigabytes
8.251945 gigabytes                                  

So we're asking to lock 8Gb in memory, and need to adjust for that in the memory swap accounting.

3517         unlocked_mem_swap = k_anoninfo.ani_mem_resv                              
3518             - k_anoninfo.ani_locked_swap;                                        
3519         if (npages > unlocked_mem_swap) {                                        
3520                 spgcnt_t adjusted_swap = npages - unlocked_mem_swap;             
3522                 /*                                                               
3523                  * if there is not enough unlocked mem swap we take missing      
3524                  * amount from phys swap and give it to mem swap                 
3525                  */                                                              
3526                 if (!page_reclaim_mem(adjusted_swap, segspt_minfree, 1)) {       
3527                         mutex_exit(&anoninfo_lock);                              
3528                         return (ENOMEM);                                         
3529                 }                                                                

unlocked_mem_swap is zero, as we discussed above. So, we must now grab our 8Gb
from free memory i.e. by adjusting availrmem in page_reclaim_mem().

We're here:

5706 int                                                                              
5707 page_reclaim_mem(pgcnt_t npages, pgcnt_t epages, int adjust)

T8236879571028:pid363695t1 page_reclaim_mem(2163198, 6182286, 1); availrmem = 7916950, t_minarmem = 25

So we're asking for our npages=2163198, with segspt_minfree=23.5Gb. We'll loop basically temporarily adjusting needfree, kicking off a kmem_reap, and hoping enough turns up in availrmem for our needs. Which is 2163198p+6182286p+tune.t_minarmem ~= 31Gb. `availrmem is at about 30Gb.

So we eventually find sadness:

T8245878764684:pid363695t1 page_reclaim_mem() -> 0 after 18 iterations, availrmem now 7976482

So we reaped ... a bit? Either way, we're way off what we supposedly need here.

Now, this is all clearly a little bit bonkers because of segspt_minfree. It also explains why a simple reboot of the zone can cause our issue: imagine we have just enough to satisfy the allocation initially (that is, our 8Gb plus the segspt_minfree. After this, something else, such as kmem, eats into `availrmem. Now when we shutdown, we'll release the 8Gb back - so in theory we should be able to get it back. But in the meantime, the other user - that is not bound by segspt_minfree has changed the equation such that we can't get our 8Gb back with the segspt_minfree headroom.

This could happen generally, of course. But we're really not short on memory on these systems for what we're doing here. The problem appears to be all around an unnecessarily large segspt_minfree setting.

  62 /*                                                                               
  63  * segspt_minfree is the memory left for system after ISM                        
  64  * locked its pages; it is set up to 5% of availrmem in                          
  65  * sptcreate when ISM is created.  ISM should not use more                       
  66  * than ~90% of availrmem; if it does, then the performance                      
  67  * of the system may decrease. Machines with large memories may                  
  68  * be able to use up more memory for ISM so we set the default                   
  69  * segspt_minfree to 5% (which gives ISM max 95% of availrmem.                   
  70  * If somebody wants even more memory for ISM (risking hanging                   
  71  * the system) they can patch the segspt_minfree to smaller number.              
  72  */                                                                              
  73 pgcnt_t segspt_minfree = 0;                                                      

Of course, this comment is pretty ancient. The old Solaris tuning doc says this:

When to Change

On database servers with large amounts of physical memory using ISM, this parameter can be tuned downward. If ISM segments are not used, this parameter has no effect. A maximum value of 128 Mbytes (0x4000) is almost certainly sufficient on large memory machines.

which is... slightly different ... from the 23Gb we're getting right now.

It seems clear we can tune this down, especially on our systems where we should be protected by the zone rctls anyway. What's not yet obvious is what the calculation should be, and whether it's something we can deploy by default, or if we need /etc/system tweaks.

To fix this, I added some basic tuning of this value on larger systems.

The fix has been running a while at Joyent. I also verified quickly on OI that Postgres 12 started up and operated OK.

Actions #2

Updated by John Levon about 2 years ago

Note that bug #7241 apparently saw a similar issue previously; however the fix there (try a bit harder) was not sufficient.

Actions #3

Updated by Electric Monk about 2 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 6fca39246c7001c1b47c1fc2bc82de657dbd8a8e

commit  6fca39246c7001c1b47c1fc2bc82de657dbd8a8e
Author: John Levon <>
Date:   2020-05-07T15:28:34.000Z

    12701 segspt_minfree needs right-sizing
    Reviewed by: Jerry Jelinek <>
    Reviewed by: Patrick Mooney <>
    Approved by: Gordon Ross <>


Also available in: Atom PDF