Project

General

Profile

Bug #8437

dlmgmt_setzoneid() doesn't enforce uniqueness

Added by Ryan Zezeski over 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
networking
Start date:
2017-06-28
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

As reported in great detail by OS-1457, the dlmgmt_setzoneid() function doesn't properly enforce link name uniqueness. If you create two links with the same name but different mac addrs the second create will fail but leave various datalink structures in an inconsistent state. If dlmgmt_setzoneid() returned EEXIST, as it should, this scenario is avoided.

History

#1

Updated by Ryan Zezeski over 2 years ago

Reproducing this on OmniOS was a bit different than SmartOS because the gate code doesn't allow setting properties during vnic-create if the link is temporary (something SmartOS changed a long time ago). So if you want to set the zone property on a temporary vnic you first have to create the vnic and then call set-linkprop. But other than that I was able to trip the same assert in dlmgmtd on OmniOS as I did on SmartOS. Here are the steps to reproduce on OmniOS:

  1. Create a zone so you can assign vnics to it.
  2. Create vnic.
  3. Set zone property.
  4. Create the same vnic as step 1 but change the MAC addr.
  5. Set it's zone property to the same as step 2.
# dladm create-vnic -l e1000g0 -t -m 12:22:45:01:01:01 vn3
# dladm set-linkprop -t -p zone=zone1 vn3
# dladm create-vnic -l e1000g0 -t -m 12:22:46:01:01:01 vn3
# dladm set-linkprop -t -p zone=zone1 vn3
dladm: warning: cannot set link property 'zone' on 'vn3': operation failed

If you pgrep dlmgmtd you'll notice the pid changed because it crashed. If you have coreadm setup properly should see the following status:

root@mt-corel:/root# mdb /cores/core.dlmgmtd.3558
Loading modules: [ libc.so.1 libavl.so.1 libnvpair.so.1 ld.so.1 ]
> ::status
debugging core file of dlmgmtd (32-bit) from mt-corel
file: /sbin/dlmgmtd
initial argv: /sbin/dlmgmtd
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=3558 uid=0 code=-1
panic message:
assertion failed for thread 0xfec20240, thread-id 2: avl_find() succeeded inside avl_add(), file /export/home/danmcd/build/illumos-omnios/usr/src/common/a
vl/avl.c, line 649
> $C
fed2e94c libc.so.1`_lwp_kill+0x15(fed2e974, fed2e974, ad, fed2ea1d, 2, 8)
fed2ecac libc.so.1`_assfail+0x25d(fed71658, fed71614, 289, fed71658, fed2ed0c)
fed2eccc libc.so.1`assfail+0x2b(fed71658, fed71614, 289, 806d0cc)
fed2ed0c libavl.so.1`avl_add+0x48(806bb70, 806d240, fed2ed2c, fef5f000, 114, fed2edac)
fed2ed4c dlmgmt_setzoneid+0x14c(fed2edf4, fed2ed80, fed2eda8, 0, 806d0a0, 806d0a0)
fed2edcc dlmgmt_handler+0x96(0, fed2edf4, c, 0, 0, 805516c)
00000000 libc.so.1`__door_return+0x4b()
#2

Updated by Ryan Zezeski over 2 years ago

webrev: http://us-east.manta.joyent.com/rpz/public/webrev/gate/8437/index.html

The zoneid != GLOBAL_ZONE has no business being there. There is an earlier check that makes sure the calling zone is the GZ, otherwise it returns with EACCES. Since an NGZ can never execute this function the link_by_name() function is always skipped and it causes an abort later when avl_add() is called with the same link name and zone id. By removing the GZ check the link_by_name() function is allowed to execute and this function will return EEXIST when the link already exists.

#3

Updated by Electric Monk over 2 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit b2519362c825a494fb6e93549e2e32a425011563

commit  b2519362c825a494fb6e93549e2e32a425011563
Author: Ryan Zezeski <rpz@joyent.com>
Date:   2017-07-07T15:46:33.000Z

    8437 dlmgmt_setzoneid() doesn't enforce uniqueness
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Rich Lowe <richlowe@richlowe.net>
    Approved by: Dan McDonald <danmcd@joyent.com>

#4

Updated by Ryan Zezeski about 2 years ago

I've requested that the advocates backout this change as it was found to break zone halt on OmniOS (and probably all other distros which are not SmartOS). Here is the original report from Andy Fiddaman:

Since integrating "8437 dlmgmt_setzoneid() doesn't enforce uniqueness" 
into OmniOS bloody, zones with VNICs no longer shut down properly.
dlmgmtd is asked to move the VNIC back to the global zone and since it
already exists there, the code now returns EEXIST.

As far as I can tell, the proper fix for the original issue is:

@@ -1259,7 +1258,8 @@ dlmgmt_setzoneid(void *argp, void *retp, size_t *sz, zoneid_t zoneid,
         * Before we remove the link from its current zone, make sure that
         * there isn't a link with the same name in the destination zone.
         */
-       if (zoneid != GLOBAL_ZONEID &&
+       if (newzoneid != GLOBAL_ZONEID &&
                link_by_name(linkp->ll_link, newzoneid) != NULL) {

but I'm happy to be corrected! We have reverted this commit in OmniOS for
now.

And here is my reply:

Sorry for the delay. I had a bit of trouble getting my environment
setup. I reproduced this issue but I stand by my original fix. It just
happens to rely on other changes present in SmartOS that aren't
upstream.

When I tested my fix I didn't consider zone halt. I only verified that
assigning two vnics with the same name to the same zone returned
EBUSY. I should have also tested zone halt, that's my bad. My change
broke zone halt because it relies on other changes in SmartOS which
are not in gate. Specifically, gate still has the dlmgmt_loan_avl and
SmartOS doesn't. In the gate code, if the newzoneid is the GZ then
link_by_name() will return a pointer to the link whether it's under
the GZ or "loaned" to an NGZ. So when zone halt calls
dlmgmt_setzoneid() to set the link back to the GZ it considers the
link as already under the GZ and returns EEXISTS, and this causes the
halt to fail.

Changing the logic to "newzoneid != GLOBAL_ZONEID" is not the correct
fix. If an NGZ and the GZ have a link with the same name then dlmgtmd
will abort on avl_add() when the NGZ halts and attempts to set it to
the GZ (and it will also cause the halt to fail and put the zone in
the down state). I verified this behavior with a new gate ONU'd onto
omniosce -- dlmgmtd aborted as I expected.

I think the best solution is to revert my change and wait until other
SmartOS networking changes make it in.

Also available in: Atom PDF