tmpfs in a non-global zone does not abide by tmpfs_minfree
We've seen a number of core dumps in us-beta from gsort that match this signature:
core '00-25-90-94-c0-38/core.gsort.35797' of 35797: sort fee73fc4 _ndoprnt (, , , 0, , fefc36e0) + c fee77db0 snprintf (, 7d0, , , , ) + 78 080637ad vasnprintf (, , , , 2, feb24608) + 89d 08061ab2 rpl_vfprintf (, , , , , ) + 42 08060433 error_tail (, , , fef42bd3, , ) + 23 080604ec error (0, 1c, , , , ) + 4c 080557b7 die (, 1, 8, , , ) + 37 08056806 ???????? (40, fef22fc0, 1, , f, f) 080593d5 mergefps (, , , 0, 0, 0) + 525 08059527 mergefiles (, , 18, feb11000, 820, ) + 47 08059797 merge (0, , , , , feffb0a4) + 77 080699e4 main (1, , , , , 0) + 2144 080551e3 _start (1, , 0, , , ) + 83
This may be related to being out of memory; the problem may have gone away when the size of the zone (a reducer) was increased.
In all cases, we died (somewhat paradoxically) trying to grow the stack while trying to emit a fatal error. This is odd because (1) the dumps themselves are not that large and (2) so many pages have been successfully allocated before the failed stack growth.
The key is the nature of the error; here's the stack of one of the dumps:
libc_hwcap1.so.1`_ndoprnt+0xc(8046910, 80468ec, 80468c0, 0, 80468d0, fefc36e0) libc_hwcap1.so.1`snprintf+0x78(80471a0, 7d0, 8046910, 806b849, 80469b0, 8046a2c ) vasnprintf+0x89d(80471a0, 804719c, 806b7f1, 80479ec, 80bdf00, feb24194) rpl_vfprintf+0x42(807e5a0, 806b7f1, 80479ec, 893ee01, 893ee01, 806b849) error_tail+0x23(80479ec, 806bb8c, 8047ec0, fef42bd3, 8105fd0, 806b849) error+0x4c(0, 1c, 806b7f1, 806b849, 893ee01, 806b849) die+0x37(806b849, 1, 8, 807e5b0, 892fb40, 0) 0x8056806(892fb40, 8039528, 8047aa8, fee9c7fc, 892ff30, 0) sortlines.isra.13+0x224(892fb10, 8047ca4, 807e5b0, 893ee01, 8047bb4, feffb0a4) main+0xd1c(1, 8047e30, 8047e38, 8055182, 8067880, 0) _start+0x83(1, 8047ec0, 0, 8047ec5, 8047f18, 8047f2b)
Here's the message it's attempting to write out:
> 806b849/s 0x806b849: write failed
The errno is 0x1c (28 decimal) – ENOSPC. pfiles (and it's new ability to run on a core dump) shows what's going on:
pfiles core.gsort.81257 core 'core.gsort.81257' of 81257: sort 0: S_IFIFO mode:0000 dev:530,0 ino:30244602 uid:0 gid:0 rdev:0,0 O_RDWR offset:5120 1: S_IFIFO mode:0000 dev:530,0 ino:30244604 uid:0 gid:0 rdev:0,0 O_RDWR offset:0 2: S_IFREG mode:0644 dev:90,66482 ino:513096 uid:0 gid:0 size:6 O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE /var/tmp/marlin_task/stderr offset:6 3: S_IFREG mode:0600 dev:537,22679 ino:1772486441 uid:0 gid:0 size:647168 O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE /tmp/sortBWasRN offset:647168
The problem, then, is that we have run out of space in /tmp – which is to say that we've run out of swap. Our attempt to record this error results in stack growth, which requires swap to back it – but because swap has been entirely exhausted, this fails and we die on a bus error or seg fault. Now, something has exhausted all of /tmp – and it's clearly not this single file.
The tmpfs exhaustion immediately brings up three questions: (1) Should Marlin zones not have /tmp backed by DRAM? (2) Should Marlin zones have /tmp cleaned even when the zone is not being reset? (3) Should the swap backing a zone be much larger than our current 2X DRAM? It seems that the answer to the first two is likely "no" – and even if the answer to the third is to signicantly increase the size of swap, this failure mode is clearly still both possible and undesirable. This leads to the most actionable question: should tmpfs be able to consume every last page of swap within a zone? There is, in fact, a tunable – tmpfs_minfree – that dictates the amount of swap that tmpfs is to deliberately not consume. This value is non-zero and should be sufficiently so (it's 2MB), so why is this mechanism not working?
A little DTrace reveals that the root cause is that if takemem is not set and the zone is non-NULL, there is no attempt to check the zone's max-swap resource control in anon_resvmem(). tmp_resv() very much relies on being able to accurately perform this check via its use of the anon_checkspace() macro; if anon_checkspace() succeeds with the number inflated by tmpfs_minfree, it will perform the true allocation with the actual number – explaining why use of tmpfs was able to reserve beyond tmpfs_minfree. (For whatever it's worth, it looks like this has been broken since the introduction of per-zone swap controls in 2006.) While the other questions remain legit (especially the swap sizing), that tmpfs use in a non-global does not abide by tmpfs_minfree is clearly busted; had that mechanism been operating properly, gsort would not have dumped core, and we would have seen a much crisper failure mode.
Updated by Electric Monk over 4 years ago
- Status changed from New to Closed
commit 7de21d7cba3fddc724c51930845ad91d8082e0c7 Author: Bryan Cantrill <firstname.lastname@example.org> Date: 2016-06-10T21:22:53.000Z 7056 tmpfs in a non-global zone does not abide by tmpfs_minfree Reviewed by: Jerry Jelinek <email@example.com> Reviewed by: Robert Mustacchi <firstname.lastname@example.org> Reviewed by: Garrett D'Amore <email@example.com> Approved by: Richard Lowe <firstname.lastname@example.org>