Project

General

Profile

Bug #12686

dladm: vnic creation over bge0 failed: object already exists

Added by Chris Ridd 7 months ago. Updated 7 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
driver - device drivers
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Some changes between joyent_20200408T231825Z and joyent_20200424T174452Z from illumos have caused a critical regression.

Native and LX zones do not start up, and cannot be created. For example a new native zone:

# cat swift.json 
{
  "brand": "joyent",
  "image_uuid": "7b021460-73f9-11ea-a6da-07800cddb2ca",
  "alias": "swift",
  "hostname": "swift",
  "max_physical_memory": 512,
  "resolvers": ["8.8.8.8","4.4.4.4"],
  "nics": [
  {
    "nic_tag": "admin",
    "ip": "dhcp" 
  }
  ],
}

# vmadm create -f swift.json 
first of 1 error: first of 1 error: Command failed: zone '79b2c4da-3e85-c54f-c3a3-f099ba91b2c7': error creating VNIC net0 (global NIC admin)
zone '79b2c4da-3e85-c54f-c3a3-f099ba91b2c7': msg: dladm: vnic creation over bge0 failed: object already exists
zone '79b2c4da-3e85-c54f-c3a3-f099ba91b2c7': Failed cmd: dladm create-vnic -t -l bge0 -p mtu=1500,zone=79b2c4da-3e85-c54f-c3a3-f099ba91b2c7 -m 32:65:8c:63:a4:ee tmp90870
zone '79b2c4da-3e85-c54f-c3a3-f099ba91b2c7': destroying snapshot: No such zone configured
zoneadm: zone '79b2c4da-3e85-c54f-c3a3-f099ba91b2c7': call to zoneadmd failed

I would suspect one of these bge-related changes:

# prtconf -v /dev/bge
pci103c,2133, instance #0
    System software properties:
        name='bge-known-subsystems' type=int items=16
            value=108e1647.108e1648.108e16a7.108e16a8.17c20010.17341013.101402a6.10f12885.17c20020.10b71006.10280109.102801f8.1028865d.0e11005a.0e1100cb.103c12bc
        name='bge-rx-rings' type=int items=1
            value=00000010
        name='bge-tx-rings' type=int items=1
            value=00000001
    Driver properties:
        name='fm-accchk-capable' type=boolean dev=none
        name='fm-dmachk-capable' type=boolean dev=none
        name='fm-errcb-capable' type=boolean dev=none
        name='fm-ereport-capable' type=boolean dev=none
    Hardware properties:
        name='pci-msix-capid-pointer' type=int items=1
            value=000000a0
        name='pci-msi-capid-pointer' type=int items=1
            value=00000058
        name='pcie-link-supported-speeds' type=int64 items=2
            value=000000009502f900.000000012a05f200
        name='pcie-link-target-speed' type=int64 items=1
            value=000000009502f900
        name='pcie-link-maximum-speed' type=int64 items=1
            value=000000012a05f200
        name='pcie-link-current-speed' type=int64 items=1
            value=000000012a05f200
        name='pcie-link-current-width' type=int items=1
            value=00000001
        name='pcie-link-maximum-width' type=int items=1
            value=00000002
        name='pcie-aspm-state' type=string items=1
            value='disabled'
        name='pcie-aspm-support' type=string items=1
            value='l0s,l1'
        name='pcie-serial' type=byte items=8
            value=f8.07.d8.06.51.64.00.00
        name='assigned-addresses' type=int items=15
            value=c3030010.00000000.fab30000.00000000.00010000.c3030018.00000000.fab40000.00000000.00010000.c3030020.00000000.fab50000.00000000.00010000
        name='reg' type=int items=20
            value=00030000.00000000.00000000.00000000.00000000.43030010.00000000.00000000.00000000.00010000.43030018.00000000.00000000.00000000.00010000.43030020.00000000.00000000.00000000.00010000
        name='compatible' type=string items=15
            value='pciex14e4,165f.103c.2133.0' + 'pciex14e4,165f.103c.2133' + 'pciex14e4,165f.0' + 'pciex14e4,165f' + 'pciexclass,020000' + 'pciexclass,0200' + 'pci14e4,165f.103c.2133.0' + 'pci14e4,165f.103c.2133' + 'pci103c,2133,s' + 'pci103c,2133' + 'pci14e4,165f.0' + 'pci14e4,165f,p' + 'pci14e4,165f' + 'pciclass,020000' + 'pciclass,0200'
        name='model' type=string items=1
            value='Ethernet controller'
        name='power-consumption' type=int items=2
            value=00000001.00000001
        name='devsel-speed' type=int items=1
            value=00000000
        name='interrupts' type=int items=1
            value=00000001
        name='subsystem-vendor-id' type=int items=1
            value=0000103c
        name='subsystem-id' type=int items=1
            value=00002133
        name='unit-address' type=string items=1
            value='0'
        name='class-code' type=int items=1
            value=00020000
        name='revision-id' type=int items=1
            value=00000000
        name='vendor-id' type=int items=1
            value=000014e4
        name='device-id' type=int items=1
            value=0000165f
        name='vendor-name' type=string items=1
            value='Broadcom Inc. and subsidiaries'
        name='device-name' type=string items=1
            value='NetXtreme BCM5720 2-port Gigabit Ethernet PCIe'
        name='subsystem-name' type=string items=1
            value='NC332i Adapter'
    Device Minor Nodes:
        dev=(116,1)
            dev_path=/pci@0,0/pci8086,1c18@1c,4/pci103c,2133@0:bge0
                spectype=chr type=minor
                dev_link=/dev/bge0
        dev=(116,1002)
            dev_path=<clone>
            Device Minor Layered Under:
                mod=udp accesstype=chr
                    dev_path=/pseudo/udp@0
        dev=(116,1003)
            dev_path=<clone>
            Device Minor Layered Under:
                mod=udp accesstype=chr
                    dev_path=/pseudo/udp@0

Files

bge-bad.out.gz (53.9 KB) bge-bad.out.gz Chris Ridd, 2020-05-02 05:41 PM
bge-good.out.gz (55.5 KB) bge-good.out.gz Chris Ridd, 2020-05-02 05:41 PM
bge.out_vmadm_create__when_on_single_dl.zip (1.51 KB) bge.out_vmadm_create__when_on_single_dl.zip Olivia Mocellini Merbeller, 2020-05-05 02:03 PM
bge.out_dladm_create-vnic__when_on_single_dl.zip (1.49 KB) bge.out_dladm_create-vnic__when_on_single_dl.zip Olivia Mocellini Merbeller, 2020-05-05 02:03 PM
bge.out_vmadm_create__when_on_aggr.zip (1.76 KB) bge.out_vmadm_create__when_on_aggr.zip Olivia Mocellini Merbeller, 2020-05-05 02:03 PM
bge.out_dladm_create-vnic__when_on_aggr.zip (1.64 KB) bge.out_dladm_create-vnic__when_on_aggr.zip Olivia Mocellini Merbeller, 2020-05-05 02:03 PM

Related issues

Related to illumos gate - Bug #12713: Network error after udating to 2020.04 / bge0: DL_BIND_REQ failedDuplicateRobert Mustacchi

Actions
Related to illumos gate - Feature #12450: Add support for BCM57765 family devices to bgeClosedRobert Mustacchi

Actions
Related to illumos gate - Bug #12496: bge mac address initialization is wrongClosedRobert Mustacchi

Actions
Related to illumos gate - Bug #12497: bge ape locking left always disabled after 7513ClosedRobert Mustacchi

Actions
Related to illumos gate - Bug #12498: bge ring interrupt masking logic is brokenClosedRobert Mustacchi

Actions
Has duplicate illumos gate - Bug #12713: Network error after udating to 2020.04 / bge0: DL_BIND_REQ failedDuplicateRobert Mustacchi

Actions
#1

Updated by Chris Ridd 7 months ago

Attaching the output of dtrace -n 'fbt:bge::{ trace(arg0); trace(arg1); }' -o /var/tmp/bge.out while trying to start the zone on the current bad smartos build (joyent_20200424T174452Z), and on the previous good smartos build (joyent_20200408T231825Z).

#2

Updated by Robert Mustacchi 7 months ago

Thanks for that Chris. That helps point us in the right direction. What's happening is that we're failing this bit of the bge_addmac implementation, I think:

...
1483         if ((err = bge_unicst_set(bgep, mac_addr, slot)) != 0)
1484                 goto fail;
1485 
1486         /* A rule is already here. Deny this.  */
1487         if (rrp->mac_addr_rule != NULL) {
1488                 err = ether_cmp(mac_addr, rrp->mac_addr_val) ? EEXIST : EBUSY;
1489                 goto fail;
1490         }
...

As we've gotten an EEXIST. I'm not that familiar with this part of the bge driver, so I have some general DTrace bits I've written below to try and get more information about what the configuration of this ring is:

fbt::bge_addmac:entry
{
        self->t = 1;
        stack();
        self->ring = (recv_ring_t *)arg0;
        self->bge = self->ring->bgep;
        printf("attempting to add mac:\n");
        tracemem(arg1, 6);
        print(*self->ring);

        printf("slot 0\n");
        print(self->bge->curr_addr[0]);
        print(self->bge->curr_addr[1]);
        print(self->bge->curr_addr[2]);
        print(self->bge->curr_addr[3]);
}

fbt::bge_addmac:entry
/self->ring != NULL && self->ring->mac_addr_rule != NULL/
{
        printf("ring rule\n");
        print(*self->ring->mac_addr_rule);
}

fbt::bge_unicst_find:return
/self->t/
{
        printf("picked slot %d\n", args[1]);
        print(self->bge->curr_addr[args[1]]);
}

fbt::bge_addmac:return
{
        self->t = 0;
        self->ring = NULL;
        self->bge = NULL;
}
#3

Updated by Olivia Mocellini Merbeller 7 months ago

Hi there,
Attached are 4 files for the dtrace output.
2 of them when running over a single datalink, where the creation of the vnic object fails and 2 when running on an aggregate datalink, where the creation still works, for just in case.
In either of these two sets, there is one output for running a simple dladm create-vnic, and another one, for creating a zone with vmadm create -f <filename>.

Please let me know if anything.

#4

Updated by Chris Ridd 7 months ago

# dladm show-phys -m
LINK         SLOT     ADDRESS            INUSE CLIENT
bge0         primary  64:51:6:d8:7:f8    yes  bge0
bge1         primary  64:51:6:d8:7:f9    yes  bge1

Dtrace output from running vmadm start <my swift zone's uuid> :

CPU     ID                    FUNCTION:NAME
  0  82168                 bge_addmac:entry 
              mac`mac_group_addmac+0x7f
              mac`mac_add_macaddr_vlan+0x14a
              mac`mac_datapath_setup+0x6cb
              mac`mac_client_datapath_setup+0x298
              mac`i_mac_unicast_add+0x55e
              mac`mac_unicast_add+0x6e
              vnic`vnic_unicast_add+0x206
              vnic`vnic_dev_create+0x3d9
              vnic`vnic_ioc_create+0xea
              dld`drv_ioctl+0x1ef
              genunix`cdev_ioctl+0x2b
              specfs`spec_ioctl+0x45
              genunix`fop_ioctl+0x5b
              genunix`ioctl+0x153
              unix`_sys_sysenter_post_swapgs+0x159
attempting to add mac:

             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 52 10 62 36 01 6f                                R.b6.o
recv_ring_t {
    dma_area_t desc = {
        ddi_acc_handle_t acc_hdl = 0xfffffe16dd309b80
        void *mem_va = 0xfffffe16dd492000
        uint32_t nslots = 0x200
        uint32_t size = 0x20
        size_t alength = 0x4000
        ddi_dma_handle_t dma_hdl = 0xfffffe16dd466a80
        offset_t offset = 0
        ddi_dma_cookie_t cookie = {
            union _dmu = {
                uint64_t _dmac_ll = 0x20999000
                uint32_t [2] _dmac_la = [ 0x20999000, 0 ]
            }
            size_t dmac_size = 0x4000
            uint_t dmac_type = 0
        }
        uint32_t ncookies = 0x1
        uint32_t token = 0xbcd573cb
    }
    bge_rcb_t hw_rcb = {
        uint64_t host_ring_addr = 0x20999000
        uint32_t nic_ring_addr = 0
        uint16_t flags = 0
        uint16_t max_len = 0x200
    }
    struct bge *bgep = 0xfffffe16dd122000
    ddi_softintr_t rx_softint = 0
    volatile uint16_t *prod_index_p = 0xfffffe16dd49c814
    bge_regno_t chip_mbx_reg = 0x280
    kmutex_t [1] rx_lock = [ 
        kmutex_t {
            void *[1] _opaque = [ 0 ]
        }
    ]
    uint64_t rx_next = 0x4
    mac_ring_handle_t ring_handle = 0xfffffe16ddc30e60
    mac_group_handle_t ring_group_handle = 0xfffffe16dd309680
    uint64_t ring_gen_num = 0
    bge_rule_info_t *mac_addr_rule = 0xfffffe171fa1d638
    uint8_t [6] mac_addr_val = [ 0x64, 0x51, 0x6, 0xd8, 0x7, 0xf8 ]
    int poll_flag = 0
    uint64_t rx_pkts = 0x1204
    uint64_t rx_bytes = 0xb6047
}slot 0
bge_mac_addr_t {
    ether_addr_t addr = [ 0x64, 0x51, 0x6, 0xd8, 0x7, 0xf8 ]
    uint8_t spare = 0
    boolean_t set = B_TRUE
}bge_mac_addr_t {
    ether_addr_t addr = [ 0, 0, 0, 0, 0, 0 ]
    uint8_t spare = 0
    boolean_t set = B_FALSE
}bge_mac_addr_t {
    ether_addr_t addr = [ 0, 0, 0, 0, 0, 0 ]
    uint8_t spare = 0
    boolean_t set = B_FALSE
}bge_mac_addr_t {
    ether_addr_t addr = [ 0, 0, 0, 0, 0, 0 ]
    uint8_t spare = 0
    boolean_t set = B_FALSE
}
  0  82168                 bge_addmac:entry ring rule
bge_rule_info_t {
    int start = 0
    int count = 0x2
}
  0  82403           bge_unicst_find:return picked slot -1
bge_mac_addr_t {
    ether_addr_t addr = [ 0, 0, 0, 0, 0, 0 ]
    uint8_t spare = 0
    boolean_t set = B_FALSE
}

#5

Updated by Robert Mustacchi 7 months ago

Thanks for providing this everyone. With this, I think I understand everything that's happening here. Let me recap a bit.

The hardware here has up to four mac address registers, as in places that we can actually program a packet. It also has up to 8 mac address receive rules. Basically these are rules that can be used to direct a packet to a given ring. In general, it takes two of these to map to a single MAC address to a ring.

A side effect of this is that the way the driver is designed, it wants to use up to one ring for each group and each of those groups gets a single MAC address, which requires one of the four mac address registers and 2 of the 8 receive rules. The way the driver is written today, if it finds that the ring already has a rule, it'll fail. Now, the way it does this is a bit unfortunate for a few reasons. Here's the code in questions:

1486         /* A rule is already here. Deny this.  */
1487         if (rrp->mac_addr_rule != NULL) {
1488                 err = ether_cmp(mac_addr, rrp->mac_addr_val) ? EEXIST : EBUSY;
1489                 goto fail;
1490         }

If we have an address we're going to fail. It tries to see if this address matches the existing one to return something more menaingful. But there are two problems here:

1) If the device has used up its slots it should return ENOSPC, not EBUSY. This will allow MAC to fall back into promiscuous mode.

2) ether_cmp indicates that thing are the same by returning 0. Which means that when things don't match and they return 1, like in this case, we return EEXIST.

So, this should all be changed to do what mac expects. Because of how the mc_unicst() entry point was being used previously, this didn't end up mattering very much.

Now, this doesn't answer the question of why we are even seeing this on the 5720 and the other parts. While hardware could support more than one ring, it turns out that all of the devices have been told that they only support a single ring. This is why we're coming back on group 0 and ring 0 on the first vnic being created. It turns out that all the chips have decided to limit themselves to one by default and we haven't increased this.

I think the fix here is to change this around such that we properly return ENOSPC so that it'll put us into promisc mode. We can look at adding more rings and testing that once we have the rest of this working and folks aren't experiencing regressions. I have a prototype for this I'm building and will post for folks to test once I have this ready.

#6

Updated by Robert Mustacchi 7 months ago

I've put together a potential prototype at https://code.illumos.org/c/illumos-gate/+/653. If folks would be able to test that out, that'd be appreciated. If I can help put together materials to help with that, please let me know.

#7

Updated by Marcel Telka 7 months ago

  • Related to Bug #12713: Network error after udating to 2020.04 / bge0: DL_BIND_REQ failed added
#8

Updated by Marcel Telka 7 months ago

  • Related to Feature #12450: Add support for BCM57765 family devices to bge added
#9

Updated by Marcel Telka 7 months ago

  • Related to Bug #12496: bge mac address initialization is wrong added
#10

Updated by Marcel Telka 7 months ago

  • Related to Bug #12497: bge ape locking left always disabled after 7513 added
#11

Updated by Marcel Telka 7 months ago

  • Related to Bug #12498: bge ring interrupt masking logic is broken added
#12

Updated by Chris Ridd 7 months ago

Robert Mustacchi wrote:

I've put together a potential prototype at https://code.illumos.org/c/illumos-gate/+/653. If folks would be able to test that out, that'd be appreciated. If I can help put together materials to help with that, please let me know.

I've tested Jason's updated debug build at https://us-east.manta.joyent.com/jbk/public/tmp/bge/platform-20200510T182623Z.usb.gz and vmadm can now successfully start my zones!

#13

Updated by Robert Mustacchi 7 months ago

  • Has duplicate Bug #12713: Network error after udating to 2020.04 / bge0: DL_BIND_REQ failed added
#14

Updated by Robert Mustacchi 7 months ago

Ultimately we've been able to conclude that the earlier analysis was correct. This was introduced as a side effect of the attempts I made to fix the panics and probable corruption in 12496. My apologies for the breakage. Chris, Olivia, thank you very much for your help here.

#15

Updated by Olivia Mocellini Merbeller 7 months ago

Hi guys, I can confirm that VNIC creation also works on NetXtreme BCM57781 with same PI (20200510T182623Z). :)

#16

Updated by Electric Monk 7 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 8291b3b94350ddd6df6ecd55435b59079f7a3dd2

commit  8291b3b94350ddd6df6ecd55435b59079f7a3dd2
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   2020-05-12T16:04:22.000Z

    12686 dladm: vnic creation over bge0 failed: object already exists
    Reviewed by: Ryan Zezeski <ryan@zinascii.com>
    Reviewed by: Patrick Mooney <pmooney@pfmooney.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF