a misconfigured smf service can cause svc.configd to leak memory and eventually hang
Elijah collected a bunch of logs and the smf configuration for me. The problem we had was that the svc was missing its start method and that the svc is configured as a "Wait" svc:
<property_group name='startd' type='framework'> <propval name='duration' type='astring' value='child'/> <propval name='ignore_error' type='astring' value='core,signal'/> </property_group>
From the svc.startd man page:
"Wait" model services are restarted whenever the child process associated with the service exits. A child process that exits is not considered an error for "wait" model services, and repeated failures do not lead to a transition to maintenance state.
In their svc.startd log file we see this error repeated many times per second for days:
May 23 03:10:04/6: instance svc:/application/redis/queue1:default exited with status 127
This occurred on multiple different svcs until they added the missing start method code.
From digging up old OpenSolaris discussions on this exact issue, this is how SMF is designed to behave. This particular failure mode then tickles other bugs in svc.configd, leading it to eventually hang. However, I think the SMF design is broken here since it assumes all svc developers are knowledgeable and always create and test their svcs before deploying them. This has not been the case for us. I think we should fix startd so that if the start method does not exist, the svc immediately goes into maintenance instead of retrying forever. Also, if the svc is repeatedly failing, we should probably throttle is down so that it is not restarting 10 times a second.