Bug #12686
closeddladm: vnic creation over bge0 failed: object already exists
100%
Description
Some changes between joyent_20200408T231825Z and joyent_20200424T174452Z from illumos have caused a critical regression.
Native and LX zones do not start up, and cannot be created. For example a new native zone:
# cat swift.json { "brand": "joyent", "image_uuid": "7b021460-73f9-11ea-a6da-07800cddb2ca", "alias": "swift", "hostname": "swift", "max_physical_memory": 512, "resolvers": ["8.8.8.8","4.4.4.4"], "nics": [ { "nic_tag": "admin", "ip": "dhcp" } ], } # vmadm create -f swift.json first of 1 error: first of 1 error: Command failed: zone '79b2c4da-3e85-c54f-c3a3-f099ba91b2c7': error creating VNIC net0 (global NIC admin) zone '79b2c4da-3e85-c54f-c3a3-f099ba91b2c7': msg: dladm: vnic creation over bge0 failed: object already exists zone '79b2c4da-3e85-c54f-c3a3-f099ba91b2c7': Failed cmd: dladm create-vnic -t -l bge0 -p mtu=1500,zone=79b2c4da-3e85-c54f-c3a3-f099ba91b2c7 -m 32:65:8c:63:a4:ee tmp90870 zone '79b2c4da-3e85-c54f-c3a3-f099ba91b2c7': destroying snapshot: No such zone configured zoneadm: zone '79b2c4da-3e85-c54f-c3a3-f099ba91b2c7': call to zoneadmd failed
I would suspect one of these bge-related changes:
- https://www.illumos.org/issues/12450
- https://www.illumos.org/issues/12496
- https://www.illumos.org/issues/12497
- https://www.illumos.org/issues/12498
# prtconf -v /dev/bge pci103c,2133, instance #0 System software properties: name='bge-known-subsystems' type=int items=16 value=108e1647.108e1648.108e16a7.108e16a8.17c20010.17341013.101402a6.10f12885.17c20020.10b71006.10280109.102801f8.1028865d.0e11005a.0e1100cb.103c12bc name='bge-rx-rings' type=int items=1 value=00000010 name='bge-tx-rings' type=int items=1 value=00000001 Driver properties: name='fm-accchk-capable' type=boolean dev=none name='fm-dmachk-capable' type=boolean dev=none name='fm-errcb-capable' type=boolean dev=none name='fm-ereport-capable' type=boolean dev=none Hardware properties: name='pci-msix-capid-pointer' type=int items=1 value=000000a0 name='pci-msi-capid-pointer' type=int items=1 value=00000058 name='pcie-link-supported-speeds' type=int64 items=2 value=000000009502f900.000000012a05f200 name='pcie-link-target-speed' type=int64 items=1 value=000000009502f900 name='pcie-link-maximum-speed' type=int64 items=1 value=000000012a05f200 name='pcie-link-current-speed' type=int64 items=1 value=000000012a05f200 name='pcie-link-current-width' type=int items=1 value=00000001 name='pcie-link-maximum-width' type=int items=1 value=00000002 name='pcie-aspm-state' type=string items=1 value='disabled' name='pcie-aspm-support' type=string items=1 value='l0s,l1' name='pcie-serial' type=byte items=8 value=f8.07.d8.06.51.64.00.00 name='assigned-addresses' type=int items=15 value=c3030010.00000000.fab30000.00000000.00010000.c3030018.00000000.fab40000.00000000.00010000.c3030020.00000000.fab50000.00000000.00010000 name='reg' type=int items=20 value=00030000.00000000.00000000.00000000.00000000.43030010.00000000.00000000.00000000.00010000.43030018.00000000.00000000.00000000.00010000.43030020.00000000.00000000.00000000.00010000 name='compatible' type=string items=15 value='pciex14e4,165f.103c.2133.0' + 'pciex14e4,165f.103c.2133' + 'pciex14e4,165f.0' + 'pciex14e4,165f' + 'pciexclass,020000' + 'pciexclass,0200' + 'pci14e4,165f.103c.2133.0' + 'pci14e4,165f.103c.2133' + 'pci103c,2133,s' + 'pci103c,2133' + 'pci14e4,165f.0' + 'pci14e4,165f,p' + 'pci14e4,165f' + 'pciclass,020000' + 'pciclass,0200' name='model' type=string items=1 value='Ethernet controller' name='power-consumption' type=int items=2 value=00000001.00000001 name='devsel-speed' type=int items=1 value=00000000 name='interrupts' type=int items=1 value=00000001 name='subsystem-vendor-id' type=int items=1 value=0000103c name='subsystem-id' type=int items=1 value=00002133 name='unit-address' type=string items=1 value='0' name='class-code' type=int items=1 value=00020000 name='revision-id' type=int items=1 value=00000000 name='vendor-id' type=int items=1 value=000014e4 name='device-id' type=int items=1 value=0000165f name='vendor-name' type=string items=1 value='Broadcom Inc. and subsidiaries' name='device-name' type=string items=1 value='NetXtreme BCM5720 2-port Gigabit Ethernet PCIe' name='subsystem-name' type=string items=1 value='NC332i Adapter' Device Minor Nodes: dev=(116,1) dev_path=/pci@0,0/pci8086,1c18@1c,4/pci103c,2133@0:bge0 spectype=chr type=minor dev_link=/dev/bge0 dev=(116,1002) dev_path=<clone> Device Minor Layered Under: mod=udp accesstype=chr dev_path=/pseudo/udp@0 dev=(116,1003) dev_path=<clone> Device Minor Layered Under: mod=udp accesstype=chr dev_path=/pseudo/udp@0
Files
Related issues
Updated by Chris Ridd over 3 years ago
- File bge-bad.out.gz bge-bad.out.gz added
- File bge-good.out.gz bge-good.out.gz added
Attaching the output of dtrace -n 'fbt:bge::{ trace(arg0); trace(arg1); }' -o /var/tmp/bge.out
while trying to start the zone on the current bad smartos build (joyent_20200424T174452Z), and on the previous good smartos build (joyent_20200408T231825Z).
Updated by Robert Mustacchi over 3 years ago
Thanks for that Chris. That helps point us in the right direction. What's happening is that we're failing this bit of the bge_addmac implementation, I think:
... 1483 if ((err = bge_unicst_set(bgep, mac_addr, slot)) != 0) 1484 goto fail; 1485 1486 /* A rule is already here. Deny this. */ 1487 if (rrp->mac_addr_rule != NULL) { 1488 err = ether_cmp(mac_addr, rrp->mac_addr_val) ? EEXIST : EBUSY; 1489 goto fail; 1490 } ...
As we've gotten an EEXIST. I'm not that familiar with this part of the bge driver, so I have some general DTrace bits I've written below to try and get more information about what the configuration of this ring is:
fbt::bge_addmac:entry { self->t = 1; stack(); self->ring = (recv_ring_t *)arg0; self->bge = self->ring->bgep; printf("attempting to add mac:\n"); tracemem(arg1, 6); print(*self->ring); printf("slot 0\n"); print(self->bge->curr_addr[0]); print(self->bge->curr_addr[1]); print(self->bge->curr_addr[2]); print(self->bge->curr_addr[3]); } fbt::bge_addmac:entry /self->ring != NULL && self->ring->mac_addr_rule != NULL/ { printf("ring rule\n"); print(*self->ring->mac_addr_rule); } fbt::bge_unicst_find:return /self->t/ { printf("picked slot %d\n", args[1]); print(self->bge->curr_addr[args[1]]); } fbt::bge_addmac:return { self->t = 0; self->ring = NULL; self->bge = NULL; }
Updated by Olivia Mocellini Merbeller over 3 years ago
- File bge.out_vmadm_create__when_on_single_dl.zip bge.out_vmadm_create__when_on_single_dl.zip added
- File bge.out_dladm_create-vnic__when_on_single_dl.zip bge.out_dladm_create-vnic__when_on_single_dl.zip added
- File bge.out_vmadm_create__when_on_aggr.zip bge.out_vmadm_create__when_on_aggr.zip added
- File bge.out_dladm_create-vnic__when_on_aggr.zip bge.out_dladm_create-vnic__when_on_aggr.zip added
Hi there,
Attached are 4 files for the dtrace output.
2 of them when running over a single datalink, where the creation of the vnic object fails and 2 when running on an aggregate datalink, where the creation still works, for just in case.
In either of these two sets, there is one output for running a simple dladm create-vnic, and another one, for creating a zone with vmadm create -f <filename>.
Please let me know if anything.
Updated by Chris Ridd over 3 years ago
# dladm show-phys -m LINK SLOT ADDRESS INUSE CLIENT bge0 primary 64:51:6:d8:7:f8 yes bge0 bge1 primary 64:51:6:d8:7:f9 yes bge1
Dtrace output from running vmadm start <my swift zone's uuid>
:
CPU ID FUNCTION:NAME 0 82168 bge_addmac:entry mac`mac_group_addmac+0x7f mac`mac_add_macaddr_vlan+0x14a mac`mac_datapath_setup+0x6cb mac`mac_client_datapath_setup+0x298 mac`i_mac_unicast_add+0x55e mac`mac_unicast_add+0x6e vnic`vnic_unicast_add+0x206 vnic`vnic_dev_create+0x3d9 vnic`vnic_ioc_create+0xea dld`drv_ioctl+0x1ef genunix`cdev_ioctl+0x2b specfs`spec_ioctl+0x45 genunix`fop_ioctl+0x5b genunix`ioctl+0x153 unix`_sys_sysenter_post_swapgs+0x159 attempting to add mac: 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 0: 52 10 62 36 01 6f R.b6.o recv_ring_t { dma_area_t desc = { ddi_acc_handle_t acc_hdl = 0xfffffe16dd309b80 void *mem_va = 0xfffffe16dd492000 uint32_t nslots = 0x200 uint32_t size = 0x20 size_t alength = 0x4000 ddi_dma_handle_t dma_hdl = 0xfffffe16dd466a80 offset_t offset = 0 ddi_dma_cookie_t cookie = { union _dmu = { uint64_t _dmac_ll = 0x20999000 uint32_t [2] _dmac_la = [ 0x20999000, 0 ] } size_t dmac_size = 0x4000 uint_t dmac_type = 0 } uint32_t ncookies = 0x1 uint32_t token = 0xbcd573cb } bge_rcb_t hw_rcb = { uint64_t host_ring_addr = 0x20999000 uint32_t nic_ring_addr = 0 uint16_t flags = 0 uint16_t max_len = 0x200 } struct bge *bgep = 0xfffffe16dd122000 ddi_softintr_t rx_softint = 0 volatile uint16_t *prod_index_p = 0xfffffe16dd49c814 bge_regno_t chip_mbx_reg = 0x280 kmutex_t [1] rx_lock = [ kmutex_t { void *[1] _opaque = [ 0 ] } ] uint64_t rx_next = 0x4 mac_ring_handle_t ring_handle = 0xfffffe16ddc30e60 mac_group_handle_t ring_group_handle = 0xfffffe16dd309680 uint64_t ring_gen_num = 0 bge_rule_info_t *mac_addr_rule = 0xfffffe171fa1d638 uint8_t [6] mac_addr_val = [ 0x64, 0x51, 0x6, 0xd8, 0x7, 0xf8 ] int poll_flag = 0 uint64_t rx_pkts = 0x1204 uint64_t rx_bytes = 0xb6047 }slot 0 bge_mac_addr_t { ether_addr_t addr = [ 0x64, 0x51, 0x6, 0xd8, 0x7, 0xf8 ] uint8_t spare = 0 boolean_t set = B_TRUE }bge_mac_addr_t { ether_addr_t addr = [ 0, 0, 0, 0, 0, 0 ] uint8_t spare = 0 boolean_t set = B_FALSE }bge_mac_addr_t { ether_addr_t addr = [ 0, 0, 0, 0, 0, 0 ] uint8_t spare = 0 boolean_t set = B_FALSE }bge_mac_addr_t { ether_addr_t addr = [ 0, 0, 0, 0, 0, 0 ] uint8_t spare = 0 boolean_t set = B_FALSE } 0 82168 bge_addmac:entry ring rule bge_rule_info_t { int start = 0 int count = 0x2 } 0 82403 bge_unicst_find:return picked slot -1 bge_mac_addr_t { ether_addr_t addr = [ 0, 0, 0, 0, 0, 0 ] uint8_t spare = 0 boolean_t set = B_FALSE }
Updated by Robert Mustacchi over 3 years ago
Thanks for providing this everyone. With this, I think I understand everything that's happening here. Let me recap a bit.
The hardware here has up to four mac address registers, as in places that we can actually program a packet. It also has up to 8 mac address receive rules. Basically these are rules that can be used to direct a packet to a given ring. In general, it takes two of these to map to a single MAC address to a ring.
A side effect of this is that the way the driver is designed, it wants to use up to one ring for each group and each of those groups gets a single MAC address, which requires one of the four mac address registers and 2 of the 8 receive rules. The way the driver is written today, if it finds that the ring already has a rule, it'll fail. Now, the way it does this is a bit unfortunate for a few reasons. Here's the code in questions:
1486 /* A rule is already here. Deny this. */ 1487 if (rrp->mac_addr_rule != NULL) { 1488 err = ether_cmp(mac_addr, rrp->mac_addr_val) ? EEXIST : EBUSY; 1489 goto fail; 1490 }
If we have an address we're going to fail. It tries to see if this address matches the existing one to return something more menaingful. But there are two problems here:
1) If the device has used up its slots it should return ENOSPC, not EBUSY. This will allow MAC to fall back into promiscuous mode.
2) ether_cmp indicates that thing are the same by returning 0. Which means that when things don't match and they return 1, like in this case, we return EEXIST.
So, this should all be changed to do what mac expects. Because of how the mc_unicst() entry point was being used previously, this didn't end up mattering very much.
Now, this doesn't answer the question of why we are even seeing this on the 5720 and the other parts. While hardware could support more than one ring, it turns out that all of the devices have been told that they only support a single ring. This is why we're coming back on group 0 and ring 0 on the first vnic being created. It turns out that all the chips have decided to limit themselves to one by default and we haven't increased this.
I think the fix here is to change this around such that we properly return ENOSPC so that it'll put us into promisc mode. We can look at adding more rings and testing that once we have the rest of this working and folks aren't experiencing regressions. I have a prototype for this I'm building and will post for folks to test once I have this ready.
Updated by Robert Mustacchi over 3 years ago
I've put together a potential prototype at https://code.illumos.org/c/illumos-gate/+/653. If folks would be able to test that out, that'd be appreciated. If I can help put together materials to help with that, please let me know.
Updated by Marcel Telka over 3 years ago
- Related to Bug #12713: Network error after udating to 2020.04 / bge0: DL_BIND_REQ failed added
Updated by Marcel Telka over 3 years ago
- Related to Feature #12450: Add support for BCM57765 family devices to bge added
Updated by Marcel Telka over 3 years ago
- Related to Bug #12496: bge mac address initialization is wrong added
Updated by Marcel Telka over 3 years ago
- Related to Bug #12497: bge ape locking left always disabled after 7513 added
Updated by Marcel Telka over 3 years ago
- Related to Bug #12498: bge ring interrupt masking logic is broken added
Updated by Chris Ridd over 3 years ago
Robert Mustacchi wrote:
I've put together a potential prototype at https://code.illumos.org/c/illumos-gate/+/653. If folks would be able to test that out, that'd be appreciated. If I can help put together materials to help with that, please let me know.
I've tested Jason's updated debug build at https://us-east.manta.joyent.com/jbk/public/tmp/bge/platform-20200510T182623Z.usb.gz and vmadm
can now successfully start my zones!
Updated by Robert Mustacchi over 3 years ago
- Has duplicate Bug #12713: Network error after udating to 2020.04 / bge0: DL_BIND_REQ failed added
Updated by Robert Mustacchi over 3 years ago
Ultimately we've been able to conclude that the earlier analysis was correct. This was introduced as a side effect of the attempts I made to fix the panics and probable corruption in 12496. My apologies for the breakage. Chris, Olivia, thank you very much for your help here.
Updated by Olivia Mocellini Merbeller over 3 years ago
Hi guys, I can confirm that VNIC creation also works on NetXtreme BCM57781 with same PI (20200510T182623Z). :)
Updated by Electric Monk over 3 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 8291b3b94350ddd6df6ecd55435b59079f7a3dd2
commit 8291b3b94350ddd6df6ecd55435b59079f7a3dd2 Author: Robert Mustacchi <rm@fingolfin.org> Date: 2020-05-12T16:04:22.000Z 12686 dladm: vnic creation over bge0 failed: object already exists Reviewed by: Ryan Zezeski <ryan@zinascii.com> Reviewed by: Patrick Mooney <pmooney@pfmooney.com> Approved by: Dan McDonald <danmcd@joyent.com>