Project

General

Profile

Actions

Bug #10001

open

creating duplicate vnics is possible, and breaks dladm

Added by Rich Lowe about 5 years ago. Updated almost 5 years ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
networking
Start date:
2018-11-21
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

A user on IRC reports:

<jlinnosa> Hi all. I just managed to create vnics with same name as existing
           vnics and now datalink stuff won't work... Any advice? 
<richlowe> assuming it's not Joyent's zone stuff, I think your only option is
           probably editing the cfg's by hand, and filing a bug to say how you
           managed it   
<richlowe> /etc/dladmm/*.conf I think, but you may want to ask someone like
           danmcd who's more regularly thinking about the network stuff
<richlowe> I think, were I editting the files even though it said not to, I'd
           probably try my hand at just renaming the duplicate so I could
           delete it/edit it properly 
<jlinnosa> it's omnios r151026, and i just basically copy-pasted 'dladm
           create-vnic -l aggr0 netns20; dladm create-vnic -l igb4 netns21'
           twice...  
<richlowe> Wow
<richlowe> See, I thought the tools were like... at least that not broken
<richlowe> :\
<jlinnosa> right after 'dladm show-vnic' listed everything nicely, but
           everything stopped working after 'dladm delete-vnic netns20'...
                                                                        [15:39]

(interspersed wih my responses).

Basically, dladm allowed a duplicate vnic to be created (it shouldn't), and then failed


Related issues

Related to illumos gate - Bug #7464: crossbow allows to create a vnic with existing nameNew2016-10-10

Actions
Actions #1

Updated by Rich Lowe about 5 years ago

<jlinnosa> yeah, 'svc:/network/datalink-management:default (data-link
           management daemon)' is in maintenance
<jlinnosa> 'Reason: Start method exited with $SMF_EXIT_ERR_FATAL.'
<jlinnosa> '[ Nov 21 15:49:37 Stopping because process dumped core. ]
Actions #2

Updated by Jaakko Linnosaari about 5 years ago

From syslog:

Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('unix', 0, 'netns20'): namespace collision
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('netns20', 0, 'mac_misc_stat'): namespace collision
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('netns20', 0, 'mac_rx_swlane0'): namespace collision
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('netns20', 0, 'mac_tx_swlane0'): namespace collision
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('netns20', 0, 'mac_tx_hwlane1'): namespace collision
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('netns20', 0, 'mac_tx_hwlane0'): namespace collision
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('netns20', 0, 'mac_rx_swlane0_fanout0'): namespace collision
Nov 21 15:48:20 brunhes mac: [ID 469746 kern.info] NOTICE: vnic1056 registered
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('link', 0, 'netns20'): namespace collision
Nov 21 15:48:20 brunhes mac: [ID 435574 kern.info] NOTICE: vnic1056 link up, 1000 Mbps, unknown duplex
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('unix', 0, 'netns21'): namespace collision
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('netns21', 0, 'mac_misc_stat'): namespace collision
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('netns21', 0, 'mac_rx_swlane0'): namespace collision
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('netns21', 0, 'mac_tx_hwlane0'): namespace collision
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('netns21', 0, 'mac_rx_swlane0_fanout0'): namespace collision
Nov 21 15:48:20 brunhes mac: [ID 469746 kern.info] NOTICE: vnic1057 registered
Nov 21 15:48:20 brunhes unix: [ID 665567 kern.warning] WARNING: kstat_create('link', 0, 'netns21'): namespace collision
Nov 21 15:48:20 brunhes mac: [ID 435574 kern.info] NOTICE: vnic1057 link up, 1000 Mbps, unknown duplex
Nov 21 15:49:00 brunhes mac: [ID 469746 kern.info] NOTICE: vnic1058 registered
Nov 21 15:49:00 brunhes mac: [ID 435574 kern.info] NOTICE: vnic1058 link up, 1000 Mbps, unknown duplex
Nov 21 15:49:00 brunhes mac: [ID 469746 kern.info] NOTICE: vnic1059 registered
Nov 21 15:49:00 brunhes mac: [ID 435574 kern.info] NOTICE: vnic1059 link up, 1000 Mbps, unknown duplex
Nov 21 15:49:37 brunhes zoneadmd[7399]: [ID 702911 daemon.error] [zone 'ns2'] datalinks remain in zone after shutdown
Nov 21 15:49:37 brunhes zoneadmd[7399]: [ID 702911 daemon.error] [zone 'ns2'] unable to unconfigure network interfaces in zone
Nov 21 15:49:37 brunhes zoneadmd[7399]: [ID 702911 daemon.error] [zone 'ns2'] unable to destroy zone
Nov 21 15:49:37 brunhes dlmgmtd[17590]: [ID 597718 daemon.warning] dlmgmt_process_db_onereq() read operation on persistent configuration failed: Invalid argument
Nov 21 15:49:37 brunhes dlmgmtd[17590]: [ID 807205 daemon.error] unable to initialize daemon: Invalid argument
Nov 21 15:49:37 brunhes svc.startd[9]: [ID 652011 daemon.warning] svc:/network/datalink-management:default: Method "/lib/svc/method/svc-dlmgmtd" failed with exit status 95.
Nov 21 15:49:37 brunhes svc.startd[9]: [ID 748625 daemon.error] network/datalink-management:default failed fatally: transitioned to maintenance (see 'svcs -xv' for details)
Nov 21 15:49:37 brunhes fmd: [ID 377184 daemon.error] SUNW-MSG-ID: SMF-8000-YX, TYPE: defect, VER: 1, SEVERITY: major
Nov 21 15:49:37 brunhes EVENT-TIME: Wed Nov 21 15:49:37 EET 2018
Nov 21 15:49:37 brunhes PLATFORM: ProLiant-DL360e-Gen8, CSN: xxxxxxxxxx, HOSTNAME: brunhes
Nov 21 15:49:37 brunhes SOURCE: software-diagnosis, REV: 0.1
Nov 21 15:49:37 brunhes EVENT-ID: 6069c745-b588-caeb-eee3-df15e7f01b2f
Nov 21 15:49:37 brunhes DESC: A service failed - a start, stop or refresh method failed.
Nov 21 15:49:37 brunhes   Refer to http://illumos.org/msg/SMF-8000-YX for more information.
Nov 21 15:49:37 brunhes AUTO-RESPONSE: The service has been placed into the maintenance state.
Nov 21 15:49:37 brunhes IMPACT: svc:/network/datalink-management:default is unavailable.
Nov 21 15:49:37 brunhes REC-ACTION: Run 'svcs -xv svc:/network/datalink-management:default' to determine the generic reason why the service failed, the location of any logfiles, and a list of other services impacted.

Actions #3

Updated by Jaakko Linnosaari about 5 years ago

# uname -a
SunOS brunhes 5.11 omnios-r151026-d9b45886bd i86pc i386 i86pc
Actions #4

Updated by Peter Tribble about 5 years ago

Tested in tribblix m20 and m20.5, and omnitribblix m20.5 - all fail, so this confirms it's a problem with vanilla illumos-gate

To reproduce:

create an exclusive-ip zone with a vnic
then, with the zone running, ettempt to create a vnic of the same name in the global zone

Actions #5

Updated by Carlos Neira almost 5 years ago

Peter Tribble wrote:

Tested in tribblix m20 and m20.5, and omnitribblix m20.5 - all fail, so this confirms it's a problem with vanilla illumos-gate

To reproduce:

create an exclusive-ip zone with a vnic
then, with the zone running, ettempt to create a vnic of the same name in the global zone

Ths change only addresses creating duplicated vnics https://github.com/omniosorg/illumos-omnios/pull/367 using dladm, I don't know if this is enough or some other checks should be in place.

Actions #6

Updated by Andy Fiddaman almost 5 years ago

  • Status changed from New to In Progress
  • Assignee set to Andy Fiddaman
  • Tags deleted (needs-triage)
Actions #7

Updated by Andy Fiddaman almost 5 years ago

The problem occurs because when the duplicate link is returned to the global zone, dlmgmtd crashes with this stack:

--------------------- thread# 2 / lwp# 2 ---------------------
 feeec827 _lwp_kill (fed1e974, fed1e974, b0, feed2ee4) + 7
 feed30d8 _assfail (fed6167c, fed61634, 289) + 205
 feed31ed assfail  (fed6167c, fed61634, 289) + 21
 fed6141e avl_add  (806bb9c, 806d930) + 48
 080558a4 dlmgmt_setzoneid (fed1edf4, fed1ed80, fed1eda8, 0, 806cd58) + 175
 0805548a dlmgmt_handler (0, fed1edf4, c, 0, 0) + 96
 feeecebd __door_return () + 3d

There is a block of code in dlmgmt_setzoneid() that is supposed to detect duplicates in the destination zone:

       if (zoneid != GLOBAL_ZONEID &&
           link_by_name(linkp->ll_link, newzoneid) != NULL) {
                err = EEXIST;
                goto done;
        }

but this never triggers because there is an earlier check for zoneid != GLOBAL ZONEID that returns EACCESS. I think the original author actually intended if (newzoneid != GLOBAL_ZONEID) here, but that is still not enough because if the new zone is the GZ then link_by_name() also consults the on-loan-link list.

There are two ways to fix this:
1. Prevent allocation of a duplicate-named VNIC. This is already effectively enforced when zones are not involved as the backend call to zone_add_datalink() already returns EEXIST for that case.
2. Fix the code block above to prevent a link changing zone if that would result in duplicates in the destination zone.

I'm going for 2. Either way, for illumos-gate*, VNICs should all be given unique names. With this fix in place, shutting down a zone that has a VNIC with the same name as one already in the GZ will result in the zone hanging in the 'down' state, requiring administrator intervention; I think this is appropriate.

  • * illumos-joyent has 20 additional commits in this area which restructure the way in which VNICs work including (effectively) support for VNICs in zones having the same names. I did port these to omnios and experiment but there are other enhancements here such as automatically removing VNICs during zone shutdown. Either way, Joyenteurs, you'll find that the fix in illumos-joyent for this precise problem (OS-1457 dladm won't show or create vnics) is different and that's because so much else is including the data structures and lookup functions.
Actions #8

Updated by Andy Fiddaman almost 5 years ago

Review at https://illumos.org/rb/r/1482/

Testing notes, with the new dlmgmtd in place:

bloody# dladm show-vnic
LINK         OVER         SPEED  MACADDRESS        MACADDRTYPE         VID
test0        vioif0       1000   2:8:20:67:d4:82   random              0

Normal case, zone startup/shutdown

bloody# zoneadm -z test boot
bloody# zlogin test dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
test0       vnic      1500   up       --         ?
bloody# zoneadm -z test halt

Error case:

bloody# dladm create-vnic -l vioif0 test0
dladm: vnic creation over vioif0 failed: object already exists
bloody# zoneadm -z test boot
bloody# dladm create-vnic -l vioif0 test0
bloody# dladm show-vnic
LINK         OVER         SPEED  MACADDRESS        MACADDRTYPE         VID
test0        vioif0       1000   2:8:20:67:d4:82   random              0
test0        vioif0       1000   2:8:20:68:da:d6   random              0
bloody# zoneadm -z test halt
zone 'test': datalinks remain in zone after shutdown
zone 'test': unable to unconfigure network interfaces in zone
zone 'test': unable to destroy zone
bloody#
bloody# svcs -x
bloody# zoneadm list -vc
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              ipkg     shared
   1 test             down       /zones/test                    pkgsrc   excl

Actions #9

Updated by Andy Fiddaman almost 5 years ago

  • Related to Bug #7464: crossbow allows to create a vnic with existing name added
Actions

Also available in: Atom PDF