Project

General

Profile

Bug #7056

tmpfs in a non-global zone does not abide by tmpfs_minfree

Added by Robert Mustacchi over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
2016-06-06
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

We've seen a number of core dumps in us-beta from gsort that match this signature:

core '00-25-90-94-c0-38/core.gsort.35797' of 35797:    sort
 fee73fc4 _ndoprnt (, , , 0, , fefc36e0) + c
 fee77db0 snprintf (, 7d0, , , , ) + 78
 080637ad vasnprintf (, , , , 2, feb24608) + 89d
 08061ab2 rpl_vfprintf (, , , , , ) + 42
 08060433 error_tail (, , , fef42bd3, , ) + 23
 080604ec error    (0, 1c, , , , ) + 4c
 080557b7 die      (, 1, 8, , , ) + 37
 08056806 ???????? (40, fef22fc0, 1, , f, f)
 080593d5 mergefps (, , , 0, 0, 0) + 525
 08059527 mergefiles (, , 18, feb11000, 820, ) + 47
 08059797 merge    (0, , , , , feffb0a4) + 77
 080699e4 main     (1, , , , , 0) + 2144
 080551e3 _start   (1, , 0, , , ) + 83

This may be related to being out of memory; the problem may have gone away when the size of the zone (a reducer) was increased.


In all cases, we died (somewhat paradoxically) trying to grow the stack while trying to emit a fatal error. This is odd because (1) the dumps themselves are not that large and (2) so many pages have been successfully allocated before the failed stack growth.

The key is the nature of the error; here's the stack of one of the dumps:

libc_hwcap1.so.1`_ndoprnt+0xc(8046910, 80468ec, 80468c0, 0, 80468d0, fefc36e0)
libc_hwcap1.so.1`snprintf+0x78(80471a0, 7d0, 8046910, 806b849, 80469b0, 8046a2c
)
vasnprintf+0x89d(80471a0, 804719c, 806b7f1, 80479ec, 80bdf00, feb24194)
rpl_vfprintf+0x42(807e5a0, 806b7f1, 80479ec, 893ee01, 893ee01, 806b849)
error_tail+0x23(80479ec, 806bb8c, 8047ec0, fef42bd3, 8105fd0, 806b849)
error+0x4c(0, 1c, 806b7f1, 806b849, 893ee01, 806b849)
die+0x37(806b849, 1, 8, 807e5b0, 892fb40, 0)
0x8056806(892fb40, 8039528, 8047aa8, fee9c7fc, 892ff30, 0)
sortlines.isra.13+0x224(892fb10, 8047ca4, 807e5b0, 893ee01, 8047bb4, feffb0a4)
main+0xd1c(1, 8047e30, 8047e38, 8055182, 8067880, 0)
_start+0x83(1, 8047ec0, 0, 8047ec5, 8047f18, 8047f2b)

Here's the message it's attempting to write out:

> 806b849/s
0x806b849:      write failed

The errno is 0x1c (28 decimal) – ENOSPC. pfiles (and it's new ability to run on a core dump) shows what's going on:

pfiles core.gsort.81257
core 'core.gsort.81257' of 81257:    sort
   0: S_IFIFO mode:0000 dev:530,0 ino:30244602 uid:0 gid:0 rdev:0,0
      O_RDWR
      offset:5120
   1: S_IFIFO mode:0000 dev:530,0 ino:30244604 uid:0 gid:0 rdev:0,0
      O_RDWR
      offset:0
   2: S_IFREG mode:0644 dev:90,66482 ino:513096 uid:0 gid:0 size:6
      O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
      /var/tmp/marlin_task/stderr
      offset:6
   3: S_IFREG mode:0600 dev:537,22679 ino:1772486441 uid:0 gid:0 size:647168
      O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE
      /tmp/sortBWasRN
      offset:647168

The problem, then, is that we have run out of space in /tmp – which is to say that we've run out of swap. Our attempt to record this error results in stack growth, which requires swap to back it – but because swap has been entirely exhausted, this fails and we die on a bus error or seg fault. Now, something has exhausted all of /tmp – and it's clearly not this single file.

The tmpfs exhaustion immediately brings up three questions: (1) Should Marlin zones not have /tmp backed by DRAM? (2) Should Marlin zones have /tmp cleaned even when the zone is not being reset? (3) Should the swap backing a zone be much larger than our current 2X DRAM? It seems that the answer to the first two is likely "no" – and even if the answer to the third is to signicantly increase the size of swap, this failure mode is clearly still both possible and undesirable. This leads to the most actionable question: should tmpfs be able to consume every last page of swap within a zone? There is, in fact, a tunable – tmpfs_minfree – that dictates the amount of swap that tmpfs is to deliberately not consume. This value is non-zero and should be sufficiently so (it's 2MB), so why is this mechanism not working?

A little DTrace reveals that the root cause is that if takemem is not set and the zone is non-NULL, there is no attempt to check the zone's max-swap resource control in anon_resvmem(). tmp_resv() very much relies on being able to accurately perform this check via its use of the anon_checkspace() macro; if anon_checkspace() succeeds with the number inflated by tmpfs_minfree, it will perform the true allocation with the actual number – explaining why use of tmpfs was able to reserve beyond tmpfs_minfree. (For whatever it's worth, it looks like this has been broken since the introduction of per-zone swap controls in 2006.) While the other questions remain legit (especially the swap sizing), that tmpfs use in a non-global does not abide by tmpfs_minfree is clearly busted; had that mechanism been operating properly, gsort would not have dumped core, and we would have seen a much crisper failure mode.


Related issues

Has duplicate illumos gate - Bug #3482: tmpfs in a non-global zone does not abide by tmpfs_minfreeNewBryan Cantrill2013-01-18

Actions
#1

Updated by Electric Monk over 4 years ago

  • Status changed from New to Closed

git commit 7de21d7cba3fddc724c51930845ad91d8082e0c7

commit  7de21d7cba3fddc724c51930845ad91d8082e0c7
Author: Bryan Cantrill <bryan@joyent.com>
Date:   2016-06-10T21:22:53.000Z

    7056 tmpfs in a non-global zone does not abide by tmpfs_minfree
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Approved by: Richard Lowe <richlowe@richlowe.net>

#2

Updated by Gergő Mihály Doma over 1 year ago

  • Has duplicate Bug #3482: tmpfs in a non-global zone does not abide by tmpfs_minfree added

Also available in: Atom PDF