Bug #7369
closedregression after 7267 SMF is fast and loose with optional dependencies
0%
Description
Updated the test VM to latest illumos-gate and after boot, the system is having many services in stoppped state:
root@grub:/# svcs -vx
svc:/milestone/name-services:default (name services milestone)
State: offline since September 10, 2016 01:10:07 PM CEST
Reason: Unknown.
See: http://illumos.org/msg/SMF-8000-AR
Impact: 17 dependent services are not running:
svc:/milestone/multi-user:default
svc:/system/boot-config:default
- svcadm disable physical:nwam
root@grub:/root# svcadm enable physical:nwam
root@grub:/root# Assertion failed: v->gv_state RESTARTER_STATE_DEGRADED || v->gv_state RESTARTER_STATE_ONLINE, file graph.c, line 891
Sep 10 13:19:36 svc.startd963: restarting after interruption
Sep 10 13:19:38 grub in.routed669: route 0.0.0.0/8 --> 0.0.0.0 nexthop is not directly connected
Sep 10 13:19:42 grub nwamd890: 1: nwamd_down_interface: ipadm_delete_addr failed on e1000g0: Could not communicate with dhcpagent
So I did activate the previous BE and rebooted and everything is back normal.
Related issues
Updated by Andrew Stormont almost 6 years ago
It seems like this might be down to a bug in the exclude_all dependency handling code. I'm looking into it. Chances are this will also require a small tweak to the nwam manifest file too, since it's incorrect.
Updated by Andrew Stormont almost 6 years ago
- Subject changed from regression after 7267 SMF is fast and loose with optional dependencies to SMF is fast and loose with exclude dependencies
Updated by Andrew Stormont almost 6 years ago
- Related to Bug #7267: SMF is fast and loose with optional dependencies added
Updated by Andrew Stormont almost 6 years ago
Toomas do you have network/physical:default enabled? Offline would count as enabled.
Updated by Toomas Soome almost 6 years ago
Andrew Stormont wrote:
Toomas do you have network/physical:default enabled? Offline would count as enabled.
I found the issue with default OI config, pysical:default disabled, physical:nwam + dhcp. But while I was testing the problem, I did disable nwam and enable default + did configure ipadm create-addr -T dhcp, it still had problems (the same message about connection with dhcpagent), however, it did behave better in sense that depending services were online, just network interfaces were not configured.
Updated by Andrew Stormont almost 6 years ago
It seems there are a number of issues at play here:
1. SMF does not inhibit services that are being excluded from starting. This makes any attempts to switch between network/physical:nwam and network/physical:default prone to issues if you don't disable the other service first.
2. There's a bug in offline_subtree_leaves that prevents services in offline or maintenance mode from being disabled (unless they've been marked by GV_TODISABLE, which can only happen if "svcadm disable" was done on them while they were in "offline" or "degraded" state).
3. The network/physical:nwam service introduces a cyclic optional_all dependency (name-services -> network-physical:nwam -> name-services). This is why svc:/milestone/name-services:default is wedged in offline state. The resolution code does not have any awareness of this type of "unsatisfiable"-ness.
Updated by Andrew Stormont almost 6 years ago
This webrev attempts to fix the first two issues. The last one will be tackled as part of #7267
Updated by Andrew Stormont almost 6 years ago
- Subject changed from SMF is fast and loose with exclude dependencies to regression after 7267 SMF is fast and loose with optional dependencies
Updated by Andrew Stormont almost 6 years ago
It turns out there are a bunch of more problems with the dependency handling code that weren't showing up before. I think it has something to do with all the enabling/disabling that nwam does in it's method file.
Updated by Andrew Stormont almost 6 years ago
This updated version of 7267 should solve your issues: http://cr.illumos.org/~webrev/andy_js/7267-2/
Updated by Andrew Stormont almost 6 years ago
Just a little bit more info on what went wrong here:
The mark_subtree function sets GV_TOFFLINE on offline instances. This causes the offline_subtree_leaves function to attempt to offline them (running foul of the assert in vertex_send_event) and the propagate functions to overlook them when their dependencies are satisfied (causing them to be stuck in offline state).
The fix was to stop marking offline services with GV_TOOFFLINE and make the dependency code (even) more robust so instances don't start when their dependents are in transitioning.
Updated by Andrew Stormont over 5 years ago
- Status changed from New to Resolved
The last attempt to solve #7267 fixes these issues.