Bug #6343
zoneadmd parent needs to close open fds
100%
Description
Jobs were hanging on us-beta. The reason was system c0-90 had all compute zones in the "disabled" or "uninitialized" states, meaning that there was nowhere to run work, so work was just waiting.
I took a look at one of these and found:
[2013-02-07T00:44:38.134Z] ERROR: MarlinAgent/Zone-2240ba1e-b6e1-4e02-89a8-7bf667759f86/51826 on 00-25-90-94-c0-90: zone removed from service zone could not be made ready: Command failed: zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': attempting to cleanup mount /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': Error: GZ process 96284 under /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta blocking shutdown zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': attempting to cleanup mount /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': Error: GZ process 96284 under /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta blocking shutdown zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unmount of '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta' failed zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount file systems in zone zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to destroy zone at mAgent.zoneReady (/opt/smartdc/agents/lib/node_modules/marlin/lib/agent/agent.js:2136:7) at /opt/smartdc/agents/lib/node_modules/marlin/lib/agent/zone.js:255:4 at next (/opt/smartdc/agents/lib/node_modules/marlin/node_modules/vasync/lib/vasync.js:178:4) at /opt/smartdc/agents/lib/node_modules/marlin/lib/agent/zone.js:283:4 at ChildProcess.exithandler (child_process.js:544:7) at ChildProcess.EventEmitter.emit (events.js:99:17) at maybeClose (child_process.js:638:16) at Process._handle.onexit (child_process.js:680:5) Caused by: Error: Command failed: zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': attempting to cleanup mount /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': Error: GZ process 96284 under /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta blocking shutdown zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': attempting to cleanup mount /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': Error: GZ process 96284 under /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta blocking shutdown zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unmount of '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta' failed zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount file systems in zone zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to destroy zone at ChildProcess.exithandler (child_process.js:540:15) at ChildProcess.EventEmitter.emit (events.js:99:17) at maybeClose (child_process.js:638:16) at Process._handle.onexit (child_process.js:680:5)
So we took zone 2240ba1e-b6e1-4e02-89a8-7bf667759f86 out of service because we couldn't halt it because 96284 was holding files open. Thankfully, this guy's still around, and it turns out to be zoneadmd for a different zone:
96284: zoneadmd -z 930279cb-40d1-4322-b083-797489c5b8f2 Current rlimit: 65536 file descriptors 0: S_IFSOCK mode:0666 dev:540,0 ino:13830 uid:0 gid:0 rdev:0,0 O_RDWR SOCK_STREAM SO_SNDBUF(16384),SO_RCVBUF(5120) sockname: AF_UNIX peer: node[51826] zone: global[0] 1: S_IFSOCK mode:0666 dev:540,0 ino:52337 uid:0 gid:0 rdev:0,0 O_RDWR SOCK_STREAM SO_SNDBUF(16384),SO_RCVBUF(5120) sockname: AF_UNIX peer: node[51826] zone: global[0] 2: S_IFSOCK mode:0666 dev:540,0 ino:48945 uid:0 gid:0 rdev:0,0 O_RDWR SOCK_STREAM SO_SNDBUF(16384),SO_RCVBUF(5120) sockname: AF_UNIX peer: node[51826] zone: global[0] 4: S_IFCHR mode:0666 dev:532,0 ino:47185924 uid:0 gid:3 rdev:90,0 O_RDWR /devices/pseudo/zfs@0:zfs offset:0 5: S_IFREG mode:0444 dev:536,1 ino:2 uid:0 gid:0 rdev:0,0 O_RDONLY /etc/mnttab offset:0 6: S_IFREG mode:0444 dev:539,1 ino:128 uid:0 gid:0 rdev:0,0 O_RDONLY /etc/dfs/sharetab offset:0 7: S_IFCHR mode:0666 dev:532,0 ino:47185924 uid:0 gid:3 rdev:90,0 O_RDWR /devices/pseudo/zfs@0:zfs offset:0 8: S_IFREG mode:0600 dev:537,3 ino:1721164976 uid:0 gid:0 size:0 O_RDWR|O_CREAT advisory write lock set by process 96283 /var/run/zones/930279cb-40d1-4322-b083-797489c5b8f2.zoneadm.lock offset:0 58: S_IFDIR mode:0755 dev:548,4305 ino:2003510738 uid:0 gid:0 rdev:0,0 O_RDONLY|O_LARGEFILE /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta offset:0 121: S_IFSOCK mode:0666 dev:540,0 ino:19608 uid:0 gid:0 rdev:0,0 O_RDWR|O_NONBLOCK SOCK_STREAM SO_SNDBUF(16384),SO_RCVBUF(5120) sockname: AF_UNIX /tmp/.marlin.sock
File descriptors 58 and 121 are only ever opened by the marlin agent – how could they possibly wind up open in a zoneadmd process? It turns out that various "zoneadm" commands (which are invoked by the marlin agent) fork a zoneadmd process. That process forks a child that does closefrom(0), but the first zoneadmd process doesn't, leaving these files open. So in this case, a "zoneadm boot" (or similar) for zone 930279cb-40d1-4322-b083-797489c5b8f2 left an extra reference on a file in zone 2240ba1e-b6e1-4e02-89a8-7bf667759f86, breaking that zone.
I believe we're missing a closefrom in zoneadmd.c's main in the parent process after forking.
Updated by Electric Monk over 5 years ago
- Status changed from New to Closed
git commit 056d3a7d553516b590a0543f4df3152a3144b42b
commit 056d3a7d553516b590a0543f4df3152a3144b42b Author: Jerry Jelinek <jerry.jelinek@joyent.com> Date: 2015-10-25T02:06:56.000Z 6343 zoneadmd parent needs to close open fds Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org>