Project

General

Profile

Bug #6343

zoneadmd parent needs to close open fds

Added by Robert Mustacchi over 4 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zones
Start date:
2015-10-16
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

Jobs were hanging on us-beta. The reason was system c0-90 had all compute zones in the "disabled" or "uninitialized" states, meaning that there was nowhere to run work, so work was just waiting.

I took a look at one of these and found:

[2013-02-07T00:44:38.134Z] ERROR: MarlinAgent/Zone-2240ba1e-b6e1-4e02-89a8-7bf667759f86/51826 on 00-25-90-94-c0-90: zone removed from service
    zone could not be made ready: Command failed: zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': attempting to cleanup mount /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': Error: GZ process 96284 under /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta blocking shutdown
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': attempting to cleanup mount /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': Error: GZ process 96284 under /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta blocking shutdown
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unmount of '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta' failed
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount file systems in zone
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to destroy zone

        at mAgent.zoneReady (/opt/smartdc/agents/lib/node_modules/marlin/lib/agent/agent.js:2136:7)
        at /opt/smartdc/agents/lib/node_modules/marlin/lib/agent/zone.js:255:4
        at next (/opt/smartdc/agents/lib/node_modules/marlin/node_modules/vasync/lib/vasync.js:178:4)
        at /opt/smartdc/agents/lib/node_modules/marlin/lib/agent/zone.js:283:4
        at ChildProcess.exithandler (child_process.js:544:7)
        at ChildProcess.EventEmitter.emit (events.js:99:17)
        at maybeClose (child_process.js:638:16)
        at Process._handle.onexit (child_process.js:680:5)
    Caused by: Error: Command failed: zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta', retrying in 2 seconds
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': attempting to cleanup mount /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': Error: GZ process 96284 under /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta blocking shutdown
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': attempting to cleanup mount /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': Error: GZ process 96284 under /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta blocking shutdown
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unmount of '/zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta' failed
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to unmount file systems in zone
    zone '2240ba1e-b6e1-4e02-89a8-7bf667759f86': unable to destroy zone

        at ChildProcess.exithandler (child_process.js:540:15)
        at ChildProcess.EventEmitter.emit (events.js:99:17)
        at maybeClose (child_process.js:638:16)
        at Process._handle.onexit (child_process.js:680:5)

So we took zone 2240ba1e-b6e1-4e02-89a8-7bf667759f86 out of service because we couldn't halt it because 96284 was holding files open. Thankfully, this guy's still around, and it turns out to be zoneadmd for a different zone:

96284:    zoneadmd -z 930279cb-40d1-4322-b083-797489c5b8f2
  Current rlimit: 65536 file descriptors
   0: S_IFSOCK mode:0666 dev:540,0 ino:13830 uid:0 gid:0 rdev:0,0
      O_RDWR
    SOCK_STREAM
    SO_SNDBUF(16384),SO_RCVBUF(5120)
    sockname: AF_UNIX 
    peer: node[51826] zone: global[0]
   1: S_IFSOCK mode:0666 dev:540,0 ino:52337 uid:0 gid:0 rdev:0,0
      O_RDWR
    SOCK_STREAM
    SO_SNDBUF(16384),SO_RCVBUF(5120)
    sockname: AF_UNIX 
    peer: node[51826] zone: global[0]
   2: S_IFSOCK mode:0666 dev:540,0 ino:48945 uid:0 gid:0 rdev:0,0
      O_RDWR
    SOCK_STREAM
    SO_SNDBUF(16384),SO_RCVBUF(5120)
    sockname: AF_UNIX 
    peer: node[51826] zone: global[0]
   4: S_IFCHR mode:0666 dev:532,0 ino:47185924 uid:0 gid:3 rdev:90,0
      O_RDWR
      /devices/pseudo/zfs@0:zfs
      offset:0
   5: S_IFREG mode:0444 dev:536,1 ino:2 uid:0 gid:0 rdev:0,0
      O_RDONLY
      /etc/mnttab
      offset:0
   6: S_IFREG mode:0444 dev:539,1 ino:128 uid:0 gid:0 rdev:0,0
      O_RDONLY
      /etc/dfs/sharetab
      offset:0
   7: S_IFCHR mode:0666 dev:532,0 ino:47185924 uid:0 gid:3 rdev:90,0
      O_RDWR
      /devices/pseudo/zfs@0:zfs
      offset:0
   8: S_IFREG mode:0600 dev:537,3 ino:1721164976 uid:0 gid:0 size:0
      O_RDWR|O_CREAT
      advisory write lock set by process 96283
      /var/run/zones/930279cb-40d1-4322-b083-797489c5b8f2.zoneadm.lock
      offset:0
  58: S_IFDIR mode:0755 dev:548,4305 ino:2003510738 uid:0 gid:0 rdev:0,0
      O_RDONLY|O_LARGEFILE
      /zones/2240ba1e-b6e1-4e02-89a8-7bf667759f86/root/manta
      offset:0
 121: S_IFSOCK mode:0666 dev:540,0 ino:19608 uid:0 gid:0 rdev:0,0
      O_RDWR|O_NONBLOCK
    SOCK_STREAM
    SO_SNDBUF(16384),SO_RCVBUF(5120)
    sockname: AF_UNIX /tmp/.marlin.sock

File descriptors 58 and 121 are only ever opened by the marlin agent – how could they possibly wind up open in a zoneadmd process? It turns out that various "zoneadm" commands (which are invoked by the marlin agent) fork a zoneadmd process. That process forks a child that does closefrom(0), but the first zoneadmd process doesn't, leaving these files open. So in this case, a "zoneadm boot" (or similar) for zone 930279cb-40d1-4322-b083-797489c5b8f2 left an extra reference on a file in zone 2240ba1e-b6e1-4e02-89a8-7bf667759f86, breaking that zone.

I believe we're missing a closefrom in zoneadmd.c's main in the parent process after forking.

History

#1

Updated by Electric Monk about 4 years ago

  • Status changed from New to Closed

git commit 056d3a7d553516b590a0543f4df3152a3144b42b

commit  056d3a7d553516b590a0543f4df3152a3144b42b
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Date:   2015-10-25T02:06:56.000Z

    6343 zoneadmd parent needs to close open fds
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
    Approved by: Garrett D'Amore <garrett@damore.org>

Also available in: Atom PDF