tmpfs in a non-global zone does not abide by tmpfs_minfree
We have seen cases recently of gsort in non-global zones dying under somewhat odd conditions. In all cases, we died (somewhat paradoxically) trying to grow the stack while trying to emit a fatal error. This is odd because (1) the dumps themselves are not that large and (2) so many pages have been successfully allocated before the failed stack growth.
The key is the nature of the error; here's the stack of one of the dumps:
libc_hwcap1.so.1`_ndoprnt+0xc(8046910, 80468ec, 80468c0, 0, 80468d0, fefc36e0) libc_hwcap1.so.1`snprintf+0x78(80471a0, 7d0, 8046910, 806b849, 80469b0, 8046a2c ) vasnprintf+0x89d(80471a0, 804719c, 806b7f1, 80479ec, 80bdf00, feb24194) rpl_vfprintf+0x42(807e5a0, 806b7f1, 80479ec, 893ee01, 893ee01, 806b849) error_tail+0x23(80479ec, 806bb8c, 8047ec0, fef42bd3, 8105fd0, 806b849) error+0x4c(0, 1c, 806b7f1, 806b849, 893ee01, 806b849) die+0x37(806b849, 1, 8, 807e5b0, 892fb40, 0) 0x8056806(892fb40, 8039528, 8047aa8, fee9c7fc, 892ff30, 0) sortlines.isra.13+0x224(892fb10, 8047ca4, 807e5b0, 893ee01, 8047bb4, feffb0a4) main+0xd1c(1, 8047e30, 8047e38, 8055182, 8067880, 0) _start+0x83(1, 8047ec0, 0, 8047ec5, 8047f18, 8047f2b)
Here's the message it's attempting to write out:
> 806b849/s 0x806b849: write failed
The errno is 0x1c (28 decimal) -- ENOSPC. pfiles (and it's new ability to run on a core dump!) shows what's going on:
% pfiles core.gsort.81257 core 'core.gsort.81257' of 81257: sort 0: S_IFIFO mode:0000 dev:530,0 ino:30244602 uid:0 gid:0 rdev:0,0 O_RDWR offset:5120 1: S_IFIFO mode:0000 dev:530,0 ino:30244604 uid:0 gid:0 rdev:0,0 O_RDWR offset:0 2: S_IFREG mode:0644 dev:90,66482 ino:513096 uid:0 gid:0 size:6 O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE /var/tmp/marlin_task/stderr offset:6 3: S_IFREG mode:0600 dev:537,22679 ino:1772486441 uid:0 gid:0 size:647168 O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE /tmp/sortBWasRN offset:647168
The problem, then, is that we have run out of space in /tmp -- which is to say that we've run out of swap. Our attempt to record this error results in stack growth, which requires swap to back it -- but because swap has been entirely exhausted, this fails and we die on a bus error or seg fault. Now, something has exhausted all of /tmp -- and it's clearly not this single file.
The tmpfs exhaustion immediately brings up several sizing questions (we currently size a zone's swap at a multiplicative factor of its DRAM), but there's a deeper question: should tmpfs be able to consume every last page of swap within a zone? There is, in fact, a tunable -- tmpfs_minfree -- that dictates the amount of swap that tmpfs is to deliberately not consume. This value is non-zero and should be sufficiently so (it's 2MB), so why is this mechanism not working?
A little DTrace reveals that the root cause is that if takemem is not set and the zone is non-NULL, there is no attempt to check the zone's max-swap resource control in anon_resvmem(). tmp_resv() very much relies on being able to accurately perform this check via its use of the anon_checkspace() macro; if anon_checkspace() succeeds with the number inflated by tmpfs_minfree, it will perform the true allocation with the actual number -- explaining why use of tmpfs was able to reserve beyond tmpfs_minfree. (For whatever it's worth, it looks like this has been broken since the introduction of per-zone swap controls in 2006.) While the other questions remain legit (especially the swap sizing), that tmpfs use in a non-global does not abide by tmpfs_minfree is clearly busted; had that mechanism been operating properly, gsort would not have dumped core, and we would have seen a much crisper failure mode.