Bug #7369
regression after 7267 SMF is fast and loose with optional dependencies
Added by Toomas Soome over 4 years ago.
Updated almost 4 years ago.
Category:
cmd - userland programs
Description
Updated the test VM to latest illumos-gate and after boot, the system is having many services in stoppped state:
root@grub:/# svcs -vx
svc:/milestone/name-services:default (name services milestone)
State: offline since September 10, 2016 01:10:07 PM CEST
Reason: Unknown.
See: http://illumos.org/msg/SMF-8000-AR
Impact: 17 dependent services are not running:
svc:/milestone/multi-user:default
svc:/system/boot-config:default
as the network is also not configured I did:
- svcadm disable physical:nwam
root@grub:/root# svcadm enable physical:nwam
root@grub:/root# Assertion failed: v->gv_state RESTARTER_STATE_DEGRADED || v->gv_state RESTARTER_STATE_ONLINE, file graph.c, line 891
Sep 10 13:19:36 svc.startd963: restarting after interruption
Sep 10 13:19:38 grub in.routed669: route 0.0.0.0/8 --> 0.0.0.0 nexthop is not directly connected
Sep 10 13:19:42 grub nwamd890: 1: nwamd_down_interface: ipadm_delete_addr failed on e1000g0: Could not communicate with dhcpagent
So I did activate the previous BE and rebooted and everything is back normal.
It seems like this might be down to a bug in the exclude_all dependency handling code. I'm looking into it. Chances are this will also require a small tweak to the nwam manifest file too, since it's incorrect.
- Subject changed from regression after 7267 SMF is fast and loose with optional dependencies to SMF is fast and loose with exclude dependencies
- Related to Bug #7267: SMF is fast and loose with optional dependencies added
Toomas do you have network/physical:default enabled? Offline would count as enabled.
Andrew Stormont wrote:
Toomas do you have network/physical:default enabled? Offline would count as enabled.
I found the issue with default OI config, pysical:default disabled, physical:nwam + dhcp. But while I was testing the problem, I did disable nwam and enable default + did configure ipadm create-addr -T dhcp, it still had problems (the same message about connection with dhcpagent), however, it did behave better in sense that depending services were online, just network interfaces were not configured.
It seems there are a number of issues at play here:
1. SMF does not inhibit services that are being excluded from starting. This makes any attempts to switch between network/physical:nwam and network/physical:default prone to issues if you don't disable the other service first.
2. There's a bug in offline_subtree_leaves that prevents services in offline or maintenance mode from being disabled (unless they've been marked by GV_TODISABLE, which can only happen if "svcadm disable" was done on them while they were in "offline" or "degraded" state).
3. The network/physical:nwam service introduces a cyclic optional_all dependency (name-services -> network-physical:nwam -> name-services). This is why svc:/milestone/name-services:default is wedged in offline state. The resolution code does not have any awareness of this type of "unsatisfiable"-ness.
- Subject changed from SMF is fast and loose with exclude dependencies to regression after 7267 SMF is fast and loose with optional dependencies
It turns out there are a bunch of more problems with the dependency handling code that weren't showing up before. I think it has something to do with all the enabling/disabling that nwam does in it's method file.
Just a little bit more info on what went wrong here:
The mark_subtree function sets GV_TOFFLINE on offline instances. This causes the offline_subtree_leaves function to attempt to offline them (running foul of the assert in vertex_send_event) and the propagate functions to overlook them when their dependencies are satisfied (causing them to be stuck in offline state).
The fix was to stop marking offline services with GV_TOOFFLINE and make the dependency code (even) more robust so instances don't start when their dependents are in transitioning.
- Status changed from New to Resolved
The last attempt to solve #7267 fixes these issues.
Also available in: Atom
PDF