regression after 7267 SMF is fast and loose with optional dependencies
Updated the test VM to latest illumos-gate and after boot, the system is having many services in stoppped state:
root@grub:/# svcs -vx
svc:/milestone/name-services:default (name services milestone)
State: offline since September 10, 2016 01:10:07 PM CEST
Impact: 17 dependent services are not running:
- svcadm disable physical:nwam
root@grub:/root# svcadm enable physical:nwam
root@grub:/root# Assertion failed: v->gv_state RESTARTER_STATE_DEGRADED || v->gv_state RESTARTER_STATE_ONLINE, file graph.c, line 891
Sep 10 13:19:36 svc.startd963: restarting after interruption
Sep 10 13:19:38 grub in.routed669: route 0.0.0.0/8 --> 0.0.0.0 nexthop is not directly connected
Sep 10 13:19:42 grub nwamd890: 1: nwamd_down_interface: ipadm_delete_addr failed on e1000g0: Could not communicate with dhcpagent
So I did activate the previous BE and rebooted and everything is back normal.
Updated by Toomas Soome about 5 years ago
Andrew Stormont wrote:
Toomas do you have network/physical:default enabled? Offline would count as enabled.
I found the issue with default OI config, pysical:default disabled, physical:nwam + dhcp. But while I was testing the problem, I did disable nwam and enable default + did configure ipadm create-addr -T dhcp, it still had problems (the same message about connection with dhcpagent), however, it did behave better in sense that depending services were online, just network interfaces were not configured.
Updated by Andrew Stormont about 5 years ago
It seems there are a number of issues at play here:
1. SMF does not inhibit services that are being excluded from starting. This makes any attempts to switch between network/physical:nwam and network/physical:default prone to issues if you don't disable the other service first.
2. There's a bug in offline_subtree_leaves that prevents services in offline or maintenance mode from being disabled (unless they've been marked by GV_TODISABLE, which can only happen if "svcadm disable" was done on them while they were in "offline" or "degraded" state).
3. The network/physical:nwam service introduces a cyclic optional_all dependency (name-services -> network-physical:nwam -> name-services). This is why svc:/milestone/name-services:default is wedged in offline state. The resolution code does not have any awareness of this type of "unsatisfiable"-ness.
Updated by Andrew Stormont almost 5 years ago
Just a little bit more info on what went wrong here:
The mark_subtree function sets GV_TOFFLINE on offline instances. This causes the offline_subtree_leaves function to attempt to offline them (running foul of the assert in vertex_send_event) and the propagate functions to overlook them when their dependencies are satisfied (causing them to be stuck in offline state).
The fix was to stop marking offline services with GV_TOOFFLINE and make the dependency code (even) more robust so instances don't start when their dependents are in transitioning.