Project

General

Profile

Bug #13091

process contract escaped SMF

Added by David Pacheco 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

I originally saw this problem with OmniOS's "ntp" SMF service when I configured it to wait for sync at startup. I've been able to reproduce this with a separate SMF service that's very simple. Here's the manifest:

<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<!-- 
    Manifest automatically generated by smfgen.
 -->
<service_bundle type="manifest" name="application-bug-demo" >
    <service name="application/bug-demo" type="service" version="1" >
        <create_default_instance enabled="true" />
        <dependency name="dep0" grouping="require_all" restart_on="error" type="service" >
            <service_fmri value="svc:/milestone/multi-user:default" />
        </dependency>
        <exec_method type="method" name="start" exec="/var/tmp/smfbug/start.sh" timeout_seconds="600" />
        <exec_method type="method" name="stop" exec=":kill" timeout_seconds="30" />
        <template >
            <common_name >
                <loctext xml:lang="C" >bug demo</loctext>
            </common_name>
        </template>
    </service>
</service_bundle>

and here's the start method:

#!/bin/bash

#
# start method that forks a process that sleeps for a while so that the caller
# can try to kill it.
#

set -o errexit
set -o xtrace

. /lib/svc/share/smf_include.sh

sleep 800 &

if ! sleep 300 ; then
        exit $SMF_EXIT_ERR_CONFIG
fi

The key pieces of the start method are:

- it forks off "sleep" in the background for a pretty long time (indefinitely, as far as we care here)
- it executes a separate "sleep 300", which is long enough for us to kill it by hand, which will trigger the bug
- it exits with SMF_EXIT_ERR_CONFIG if this second sleep command fails

I've got both of these files in "/var/tmp/smfbug":

dap@lennier:/var/tmp/smfbug$ ls -l
total 18
-rw-r--r--   1 dap      other        901 Sep  1 20:49 bug-demo.xml
-rwxr-xr-x   1 dap      other        247 Sep  1 20:51 start.sh
dap@lennier:/var/tmp/smfbug$ 

The service doesn't exist yet:

dap@lennier:/var/tmp/smfbug$ svcs bug-demo
svcs: Pattern 'bug-demo' doesn't match any instances
STATE          STIME    FMRI
dap@lennier:/var/tmp/smfbug$

Let's import it and see what's running:

dap@lennier:/var/tmp/smfbug$ svccfg import bug-demo.xml
dap@lennier:/var/tmp/smfbug$ svcs -p bug-demo
STATE          STIME    FMRI
offline*       20:51:29 svc:/application/bug-demo:default
               20:51:29     3384 start.sh
               20:51:29     3385 sleep
               20:51:29     3386 sleep
dap@lennier:/var/tmp/smfbug$ ps -opid,ctid,args -p "3384 3385 3386" 
  PID  CTID COMMAND
 3384   132 /bin/bash /var/tmp/smfbug/start.sh
 3385   132 sleep 800
 3386   132 sleep 300
dap@lennier:/var/tmp/smfbug$ ctstat -i 132 -v
CTID    ZONEID  TYPE    STATE   HOLDER  EVENTS  QTIME   NTIME   
132     0       process owned   9       0       -       -       
        cookie:                0x20
        informative event set: none
        critical event set:    core signal hwerr empty
        fatal event set:       none
        parameter set:         inherit regent
        member processes:      3384 3385 3386
        inherited contracts:   none
        service fmri:          svc:/application/bug-demo:default
        service fmri ctid:     132
        creator:               svc.startd
        aux:                   start

That's all good. Now, let's kill the shorter-running "sleep" process. We expect the start method to exit with $SMF_ERR_EXIT_CONFIG, which will send the service into maintenance:

dap@lennier:/var/tmp/smfbug$ pfexec kill 3386
dap@lennier:/var/tmp/smfbug$ svcs bug-demo
STATE          STIME    FMRI
maintenance    20:52:07 svc:/application/bug-demo:default

The SMF log contains:

[ Sep  1 20:51:29 Executing start method ("/var/tmp/smfbug/start.sh"). ]
+ . /lib/svc/share/smf_include.sh
++ SMF_EXIT_OK=0
++ SMF_EXIT_NODAEMON=94
++ SMF_EXIT_ERR_FATAL=95
++ SMF_EXIT_ERR_CONFIG=96
++ SMF_EXIT_MON_DEGRADE=97
++ SMF_EXIT_MON_OFFLINE=98
++ SMF_EXIT_ERR_NOSMF=99
++ SMF_EXIT_ERR_PERM=100
+ sleep 300
+ sleep 800
Terminated
+ exit 96
[ Sep  1 20:52:07 Method "start" exited with status 96. ]

That's all good. But the long-running "sleep" process is still running, and the contract is still around!

dap@lennier:/var/tmp/smfbug$ ps -opid,args -p 3385
  PID COMMAND
 3385 sleep 800
dap@lennier:/var/tmp/smfbug$ ctstat -i 132 -v
CTID    ZONEID  TYPE    STATE   HOLDER  EVENTS  QTIME   NTIME   
132     0       process orphan  -       0       -       -       
        cookie:                0x20
        informative event set: none
        critical event set:    core signal hwerr empty
        fatal event set:       none
        parameter set:         inherit regent
        member processes:      3385
        inherited contracts:   none
        service fmri:          svc:/application/bug-demo:default
        service fmri ctid:     132
        creator:               svc.startd
        aux:                   start
dap@lennier:/var/tmp/smfbug$ 

This can cause all kinds of problems for services that expect only one instance is running (e.g., if they expect to be able to bind to a particular TCP port).

#1

Updated by David Pacheco 3 months ago

The only relevant lines in startd's log file:

Sep  1 20:52:07/252 ERROR: svc:/application/bug-demo:default: Method "/var/tmp/smfbug/start.sh" failed with exit status 96.
Sep  1 20:52:07/252: application/bug-demo:default misconfigured: transitioned to maintenance (see 'svcs -xv' for details)

Also available in: Atom PDF