Bug #12701
closedsegspt_minfree needs right-sizing
Added by John Levon about 2 years ago. Updated about 2 years ago.
100%
Updated by John Levon about 2 years ago
We discovered in production that sometimes restarting postgres would fail (OS-7244):
applying pg config: postgres exited unexpectedly (code 1); stdout = , stderr = 2017-10-15 05:42:00 UTC [27792]: [1-1] user=,db=,app=,client= FATAL: shmat(id=10, addr=0, flags=0x4000) failed: Not enough space
The behaviour was captured via https://github.com/jlevon/grot/blob/master/manatee-381/manatee-381.d
My analysis from the above bug:
Note that previously we will have gone through shm_get() to create the segment.
For that to work, the segment size is checked against the rctls.
As can be seen above, these should be fine. We also, though pass through anon_resv().
In our case, this will basically look at:
pswap_pages = k_anoninfo.ani_max - k_anoninfo.ani_phys_resv struct k_anoninfo { pgcnt_t ani_max = 0x7fbffff pgcnt_t ani_free = 0x7ee97a2 pgcnt_t ani_phys_resv = 0x2fce4d pgcnt_t ani_mem_resv = 0 pgcnt_t ani_locked_swap = 0 }
So, since we clearly have 8Gb of physical swap available, anon_resvmem() happily
adjusts k_anoninfo.ani_phys_resv and everything is hunky-dory. It has no way of knowing
that in fact, we are about to request non-SHM_PAGEABLE memory when we actually
try to attach to this. So we will not go down the path that reserves into ani_mem_resv here.
Note that when we have attached one of these segments, we should find that
those are covered by ani_locked_swap (and ani_mem_resv is at least that value too).
Taking one of these DTrace outputs and re-arranging in time order:
T8236879223522:pid363695t1 shmat(9, 0) ... T8236879554823:pid363695t1 segspt_create() ... T8236879562408:pid363695t1 anon_swap_adjust(2163198)
So we got as far as segspt__create() and:
543 if ((sptcargs->flags & SHM_PAGEABLE) == 0) { 544 if (err = anon_swap_adjust(npages)) 545 return (err); 546 } $ unit 2163198 p | grep gigabytes 8.251945 gigabytes
So we're asking to lock 8Gb in memory, and need to adjust for that in the memory swap accounting.
3517 unlocked_mem_swap = k_anoninfo.ani_mem_resv 3518 - k_anoninfo.ani_locked_swap; 3519 if (npages > unlocked_mem_swap) { 3520 spgcnt_t adjusted_swap = npages - unlocked_mem_swap; 3521 3522 /* 3523 * if there is not enough unlocked mem swap we take missing 3524 * amount from phys swap and give it to mem swap 3525 */ 3526 if (!page_reclaim_mem(adjusted_swap, segspt_minfree, 1)) { 3527 mutex_exit(&anoninfo_lock); 3528 return (ENOMEM); 3529 }
unlocked_mem_swap is zero, as we discussed above. So, we must now grab our 8Gb
from free memory i.e. by adjusting availrmem in page_reclaim_mem().
We're here:
5706 int 5707 page_reclaim_mem(pgcnt_t npages, pgcnt_t epages, int adjust) T8236879571028:pid363695t1 page_reclaim_mem(2163198, 6182286, 1); availrmem = 7916950, t_minarmem = 25
So we're asking for our npages=2163198, with segspt_minfree=23.5Gb. We'll loop basically temporarily adjusting needfree, kicking off a kmem_reap, and hoping enough turns up in availrmem for our needs. Which is 2163198p+6182286p+tune.t_minarmem ~= 31Gb. `availrmem is at about 30Gb.
So we eventually find sadness:
T8245878764684:pid363695t1 page_reclaim_mem() -> 0 after 18 iterations, availrmem now 7976482
So we reaped ... a bit? Either way, we're way off what we supposedly need here.
Now, this is all clearly a little bit bonkers because of segspt_minfree. It also explains why a simple reboot of the zone can cause our issue: imagine we have just enough to satisfy the allocation initially (that is, our 8Gb plus the segspt_minfree. After this, something else, such as kmem, eats into `availrmem. Now when we shutdown, we'll release the 8Gb back - so in theory we should be able to get it back. But in the meantime, the other user - that is not bound by segspt_minfree has changed the equation such that we can't get our 8Gb back with the segspt_minfree headroom.
This could happen generally, of course. But we're really not short on memory on these systems for what we're doing here. The problem appears to be all around an unnecessarily large segspt_minfree setting.
62 /* 63 * segspt_minfree is the memory left for system after ISM 64 * locked its pages; it is set up to 5% of availrmem in 65 * sptcreate when ISM is created. ISM should not use more 66 * than ~90% of availrmem; if it does, then the performance 67 * of the system may decrease. Machines with large memories may 68 * be able to use up more memory for ISM so we set the default 69 * segspt_minfree to 5% (which gives ISM max 95% of availrmem. 70 * If somebody wants even more memory for ISM (risking hanging 71 * the system) they can patch the segspt_minfree to smaller number. 72 */ 73 pgcnt_t segspt_minfree = 0;
Of course, this comment is pretty ancient. The old Solaris tuning doc says this:
When to Change
On database servers with large amounts of physical memory using ISM, this parameter can be tuned downward. If ISM segments are not used, this parameter has no effect. A maximum value of 128 Mbytes (0x4000) is almost certainly sufficient on large memory machines.
which is... slightly different ... from the 23Gb we're getting right now.
It seems clear we can tune this down, especially on our systems where we should be protected by the zone rctls anyway. What's not yet obvious is what the calculation should be, and whether it's something we can deploy by default, or if we need /etc/system tweaks.
To fix this, I added some basic tuning of this value on larger systems.
The fix has been running a while at Joyent. I also verified quickly on OI that Postgres 12 started up and operated OK.
Updated by John Levon about 2 years ago
Note that bug #7241 apparently saw a similar issue previously; however the fix there (try a bit harder) was not sufficient.
Updated by Electric Monk about 2 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 6fca39246c7001c1b47c1fc2bc82de657dbd8a8e
commit 6fca39246c7001c1b47c1fc2bc82de657dbd8a8e Author: John Levon <john.levon@joyent.com> Date: 2020-05-07T15:28:34.000Z 12701 segspt_minfree needs right-sizing Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Patrick Mooney <pmooney@pfmooney.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com>