Project

General

Profile

Actions

Bug #3990

closed

a misconfigured smf service can cause svc.configd to leak memory and eventually hang

Added by Robert Mustacchi about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
cmd - userland programs
Start date:
2013-08-04
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

Elijah collected a bunch of logs and the smf configuration for me. The problem we had was that the svc was missing its start method and that the svc is configured as a "Wait" svc:

<property_group name='startd' type='framework'>
      <propval name='duration' type='astring' value='child'/>
      <propval name='ignore_error' type='astring' value='core,signal'/>
    </property_group>

From the svc.startd man page:

"Wait" model services are restarted whenever the child process
     associated with the service exits. A child process that exits is
     not considered an error for "wait" model services, and repeated
     failures do not lead to a transition to maintenance state.

In their svc.startd log file we see this error repeated many times per second for days:

May 23 03:10:04/6: instance svc:/application/redis/queue1:default
 exited with status 127

This occurred on multiple different svcs until they added the missing start method code.

From digging up old OpenSolaris discussions on this exact issue, this is how SMF is designed to behave. This particular failure mode then tickles other bugs in svc.configd, leading it to eventually hang. However, I think the SMF design is broken here since it assumes all svc developers are knowledgeable and always create and test their svcs before deploying them. This has not been the case for us. I think we should fix startd so that if the start method does not exist, the svc immediately goes into maintenance instead of retrying forever. Also, if the svc is repeatedly failing, we should probably throttle is down so that it is not restarting 10 times a second.

Actions #1

Updated by Robert Mustacchi about 10 years ago

Resolved in 2a17138d7a5102bc6e0bf0444224cd0c416d98f0.

Actions #2

Updated by Robert Mustacchi about 10 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF