Project

General

Profile

Bug #7697

svc.startd aborting in method_ready_contract

Added by Robert Mustacchi over 3 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
cmd - userland programs
Start date:
2016-12-29
Due date:
% Done:

90%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

We have a bunch of svc.startd cores on the east-3b HN, all with the following stack:

> $C
f34aedb8 libc.so.1`_lwp_kill+0x15(d4, 6, 334, fef5c000, fef5c000, 835f940)
f34aedd8 libc.so.1`raise+0x2b(6, 0, f34aedf0, fef5f4c0, 0, 0)
f34aee28 libc.so.1`abort+0x10e(835f940, 832c748, f34aee58, 0, 835f940, 832c748)
f34aee58 method_ready_contract+0x232(835f940, 0, 0, 0)
f34aef78 method_run+0x471(f34aefac, 0, f34aefa4, 8355a28, fef5c000, f899a240)
f34aefc8 method_thread+0x179(836fdc0, 0, 0, 0)
f34aefe8 libc.so.1`_thrp_setup+0x88(f899a240)
f34aeff8 libc.so.1`_lwp_start(f899a240, 0, 0, 0, 0, 0)

This looks like we're at the following:

bad_error("ct_pr_tmpl_set_transfer", ret);

Also note that there are a bunch of the following msgs in the svc.startd log around the time of the cores:

Nov  3 13:54:44/176: failed to abandon contract 729828: Permission denied
Nov  3 13:59:54/90: failed to abandon contract 728107: Permission denied
Nov  3 15:19:42/184: failed to abandon contract 717157: Permission denied
Nov  3 15:19:50/189: failed to abandon contract 717162: Permission denied
Nov  3 15:19:50/191: failed to abandon contract 717161: Permission denied
Nov  3 15:20:13/202: failed to abandon contract 717167: Permission denied
Nov  3 15:20:19/206: failed to abandon contract 729844: Permission denied
Nov  3 16:35:50/178: failed to abandon contract 770266: No such file or directory
Nov  3 16:35:59/187: failed to abandon contract 770273: No such file or directory
Nov  3 16:35:59/189: failed to abandon contract 770272: No such file or directory
Nov  3 16:36:19/197: failed to abandon contract 770280: No such file or directory
Nov  3 16:36:24/201: failed to abandon contract 770301: No such file or directory

Here are the messages from /var/adm/messages from around the time of the first core dump:

Nov  3 13:54:43 headnode svc.ipfd[20418]: [ID 404139 daemon.error] _scf_notify_wait failed: connection to repository broken
Nov  3 13:54:44 headnode svc.startd[3915]: [ID 575841 daemon.notice] failed to abandon contract 729828: Permission denied
Nov  3 13:55:25 headnode svc.ipfd[20418]: [ID 404139 daemon.error] _scf_notify_wait failed: connection to repository broken
Nov  3 13:59:53 headnode last message repeated 6 times

We hit this again and were able to figure out what happened. The / file system filled up and init got into the mode where it tried to restart svc.started. That failed three times fast and init went into its maintenance mode. We removed the file that filled root. init stayed in maintenance until we got on the console and did ^D which got it out of maintenance and then it restarted svc.startd and svc.configd successfully.

> startd_failure_time::array hrtime_t 3
806b7b8
806b7c0
806b7c8

> $C
08047548 libc.so.1`__sigsuspend+7(8047560, 8047570, 8047580, feedd8f4)
08047598 waitproc+0x62(806bbdc, 0, 4, 8058a63)
080475b8 enter_maintenance+0x14f(1, 0, 12c, 806f830, 0, 0)
080475d8 contract_event+0xcf(806b7a4, 1, 493e0, 8058e8a)
08047f18 main+0x494(fee10140, fef7b728, 8047f40, 8054343, 1, 8047f4c)
08047f40 _start+0x83(1, 8047fd0, 0, 0, 7d8, 8047fdb)

It would be good if init would log this condition into /var/adm/messages since that is on a different file system and has plenty of space. Currently it is hard to see that init is in this state.

Also available in: Atom PDF