Project

General

Profile

Actions

Bug #15167

closed

Panic when halting a zone with self-created links

Added by Nils Nieuwejaar over 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
networking
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:
oxide:stlouis#246

Description

The panic comes from an assertion failure:

assertion failed: ddp->dd_ref > 1, file: ../../common/io/dls/dls_mgmt.c, line: 1108

Stack:

fffffe00210879c0 fffffffffbdcb895 ()
fffffe00210879f0 dls:dls_devnet_rele+70 ()
fffffe0021087a50 dls:dls_devnet_setzid+9c ()
fffffe0021087a80 dls:dls_zone_remove+2f ()
fffffe0021087b00 genunix:zone_datalink_walk+106 ()
fffffe0021087b20 dls:dls_stack_shutdown+19 ()
fffffe0021087ba0 genunix:netstack_apply_shutdown+10d ()
fffffe0021087be0 genunix:apply_all_modules_reverse+3f ()
fffffe0021087c20 genunix:netstack_zone_shutdown+13b ()
fffffe0021087c90 genunix:zsd_apply_shutdown+24d ()
fffffe0021087ce0 genunix:zsd_apply_all_keys+4e ()
fffffe0021087d30 genunix:zone_zsd_callbacks+e4 ()
fffffe0021087d70 genunix:zone_shutdown+11f ()
fffffe0021087eb0 genunix:zone+2ff ()
fffffe0021087f00 unix:brand_sys_syscall32+290 ()@

When moving a link from the global zone to a non-global zone, we bump up the reference count in dls_devnet_setzid(). When moving a link back to the global zone, we drop the reference count in the same routine.

When a link is created within the zone, dls_devnet_set() calls i_dls_devnet_setzid() directly, bypassing the reference count manipulation.

When halting a zone, the links within it are returned to the global zone via dls_devnet_setzid(). Any links created within the zone will be missing the expected extra reference, causing this panic.

I can imagine multiple ways of attacking this problem: explicitly taking an extra reference on zone-created links, checking dd_owner_zid in dls_devnet_setzid() and only manipulating the refcount when moving a link to/from its owned zone, and probably others.

I'm happy to take on this bug or defer to anybody with more familiarity with this code.


Related issues

Related to illumos gate - Bug #15553: dlmgmtd asserts on processing NGZ persistent configuration fileClosedAndy Fiddaman

Actions
Actions #1

Updated by Nils Nieuwejaar over 1 year ago

After looking at this a bit more, the reference count isn't really the problem - it's a symptom of the problem.

It's problematic to move a zone-created link into the global zone as part of the zone destroy as there is state that is shared by the link and the zone. We also can't simply destroy the link as the zone is destroyed, as there is link-type bookkeeping that needs to be done at a higher level.

It seems that the cleanest solution is to prevent a zone from being shut down if there are non-global links in it, which can be achieved simply by not moving the links to the global zone in dls_zone_remove(). This leaves the zone in the "down" state:

11:42:14 odev:~$ pfexec zoneadm -z test halt
zone 'test': datalinks remain in zone after shutdown
zone 'test': unable to unconfigure network interfaces in zone
zone 'test': unable to destroy zone

Unfortunately, once the zone is in this state, there doesn't seem to be any means to destroy the links. They aren't visible in the global zone, and we can no longer run in the context of the non-global zone.

Actions #2

Updated by Andy Fiddaman over 1 year ago

I haven't been successful in triggering the panic reported here, but I can replicate the suspect reference handling.

I created a zone and gave it the sys_dl_config privilege, and then created a VNIC over a spare physical interface in the GZ, and passed both in. Once in the zone, I created a second VNIC over the physical interface and compared:

This is how it looks in the zone:

root@lipkg:~# dladm create-vnic -l vioif1 zvnic1
root@lipkg:~# dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
vioif1      phys      1500   up       ?          --
gzvnic0     vnic      1500   up       --         vioif1
zvnic1      vnic      1500   up       --         vioif1

and the two different VNICs, both over vioif1:

> fffffe2d1a01f5a8::print dls_devnet_t
{
    dd_linkid = 0x12
    dd_linkname = [ "gzvnic0" ]              <--- was created in GZ
    dd_mac = [ "vnic1020" ]
    dd_ksp = 0xfffffe2d43de63a0
    dd_zone_ksp = 0xfffffe2d7070ca00
    dd_ref = 0x2                             <--- ref for both GZ and zone
    dd_tref = 0
    dd_flags = 0
    dd_owner_zid = 0                         <--- owned by GZ
    dd_zid = 0x3                             <--- in zone
    dd_prop_loaded = 0x1 (B_TRUE)
    dd_prop_taskid = 0
}
> fffffe2d43efae20::print dls_devnet_t
{
    dd_linkid = 0x17
    dd_linkname = [ "zvnic1" ]               <--- was created in zone
    dd_mac = [ "vnic1024" ]
    dd_ksp = 0xfffffe2d743933a0
    dd_zone_ksp = 0
    dd_ref = 0x1                             <--- ref for zone only
    dd_tref = 0
    dd_flags = 0
    dd_owner_zid = 0x3                       <--- owned by zone
    dd_zid = 0x3                             <--- in zone
    dd_prop_loaded = 0x1 (B_TRUE)
    dd_prop_taskid = 0
}

Actions #3

Updated by Andy Fiddaman over 1 year ago

Once the zone is halted:

> fffffe2d1a01f5a8::print dls_devnet_t
{
    dd_linkid = 0x12
    dd_linkname = [ "gzvnic0" ]
    dd_mac = [ "vnic1020" ]
    dd_ksp = 0xfffffe2d43de63a0
    dd_zone_ksp = 0
    dd_ref = 0x1
    dd_mutex = {
        _opaque = [ 0 ]
    }
    dd_cv = {
        _opaque = 0
    }
    dd_tref = 0
    dd_flags = 0
    dd_owner_zid = 0
    dd_zid = 0
    dd_prop_loaded = 0x1 (B_TRUE)
    dd_prop_taskid = 0
}
> fffffe2d43efae20::print dls_devnet_t
{
    dd_linkid = 0x17
    dd_linkname = [ "zvnic1" ]
    dd_mac = [ "vnic1024" ]
    dd_ksp = 0
    dd_zone_ksp = 0
    dd_ref = 0
    dd_mutex = {
        _opaque = [ 0 ]
    }
    dd_cv = {
        _opaque = 0
    }
    dd_tref = 0
    dd_flags = 0
    dd_owner_zid = 0x3
    dd_zid = 0
    dd_prop_loaded = 0x1 (B_TRUE)
    dd_prop_taskid = 0
}

Actions #4

Updated by Andy Fiddaman over 1 year ago

and the VNIC cannot be deleted in the GZ without changing the reference count:

bloody# dladm delete-vnic zvnic1
dladm: vnic deletion failed: link busy
bloody# mdb -kwe 'fffffe2d43efae20::print -a dls_devnet_t dd_ref | /w1'
0xfffffe2d43efaf58:             0       =       0x1
bloody# dladm delete-vnic zvnic1
bloody#
Actions #5

Updated by Andy Fiddaman over 1 year ago

  • Category changed from kernel to networking
  • Status changed from New to In Progress
  • Assignee set to Andy Fiddaman
  • External Bug set to oxide:stlouis#246
Actions #6

Updated by Electric Monk over 1 year ago

  • Gerrit CR set to 2662
Actions #7

Updated by Andy Fiddaman over 1 year ago

I've put a change up for review at
https://code.illumos.org/c/illumos-gate/+/2662 which takes pieces of the work
that's been done around this area in SmartOS, with an eye on reducing the delta
between SmartOS and gate and taking advantage of the experience gained there.
When there were multiple possible approaches, I chose to reduce the delta.

  • OS-327 dlmgmtd/zoneadmd interdependencies prevent dlmgmtd from starting
  • OS-375 i_dls_mgmt_upcall()/dlmgmt_zfop() deadlock in dlmgmtd
  • OS-2608 dlmgmtd needs to record zone identifiers
  • OS-3492 zone_free asserts to its destruction when dlmgmtd has fallen
  • OS-5363 dlmgmtd forking with write locks results in deadlock chaos
  • OS-6274 links owned by NGZ erroneously marked as on loan

The nub of this, and what ultimately fixes the panic, is the introduction of a
new concept of a "transient" link. Such a link is bound to the lifetime of a
zone and will be destroyed when (or around the time that) the zone exits. Links
created within a zone acquire this new property. In addition, links created
within a zone are now required to also be "temporary" - persistent links
already don't work here, it just wasn't previously enforced. They could be made
to work, and might be useful, but that's a change beyond what I've done.

It's worth noting that SmartOS' aims are a bit different in a few places, so
this change set reduces the delta but does not eliminate it. I also found some
additional bugs along the way that don't show up in general SmartOS usage given
the limited scope of what is done with datalinks there (and I suspect very few
people outside of Oxide have created datalinks directly in a zone that has the
appropriate privileges). The biggest difference comes from the fact that, on
SmartOS, once a temporary VNIC is "given" to a zone, it becomes transient -
that isn't something desirable for a general distribution that wants links to
move into and out of a zone as it boots and halts (even temporary ones). In the
end, although the transient concept is the same, when it is applied differs.
There is also a small amount of other work that was done in SmartOS which I
have not taken as part of this change - such as the work to remove the onloan
AVL tree and some things around GZ visibility.

To summarise the two ways in which links can end up in a zone:

  • By being created in the GZ and then subsequently having its zone link property set (such as by zoneadmd:vplat, or by dladm in some way). These links are on-loan. They can be persistent or temporary.
  • By being created within a zone that has the sys_dl_config privilege. These can only be temporary.

SmartOS flags links in the first of these categories as transient if they are
temporary. While these two concepts are really orthogonal, this works well
for SmartOS. My change flags links in the second category. It is arguable
whether creating a temporary link and assigning it to a zone in one step
(dladm create-vnic -t -p zone=ngz vnic0) could also be considered transient,
as of this change it is not -- it also appears not to work properly on gate
(the property is not set); I have not investigated this.

In addition to testing that this change resolves the original panic, I've also
tested that:

  • Restarting dlmgmtd now results in its internal state matching that from before the restart. This is true for zones with a variety of loaned and transient interfaces;
  • Halting a zone without dlmgmtd running still fails, leaving the zone in the down state (and there is no-longer a kernel panic), and then...
  • Restarting dlmgmtd and then re-running zoneadm -z ngz halt recovers the situation with all datalinks being moved back to the GZ or destroyed, as appropriate.
helios # zonecfg -z test info net | egrep '^n|phy' | paste - -
net:            physical: tfpkt0
net:            physical: cxgbe0
net:            physical: vnic1
helios # zoneadm -z test boot
helios # zlogin test dladm create-vnic -tl cxgbe0 vnic0
helios # zlogin test dladm create-tfport -p0 -l tfpkt0 tfport0
helios # zlogin test dladm create-vlan -tv 258 -l tfport0 gimlet30

helios # dlinks
* KERNEL *

      igb0       igb0 ref=4 transient=0 zid=0 ownerzid=0
     vnic0   vnic1014 ref=1 transient=1 zid=7 ownerzid=7
     vnic1   vnic1005 ref=2 transient=0 zid=7 ownerzid=0
    cxgbe0     cxgbe0 ref=2 transient=0 zid=7 ownerzid=0
    cxgbe1     cxgbe1 ref=1 transient=0 zid=0 ownerzid=0
    tfpkt0     tfpkt0 ref=2 transient=0 zid=7 ownerzid=0
   tfport0 tfport1015 ref=1 transient=1 zid=7 ownerzid=7
  gimlet30   vnic1016 ref=1 transient=1 zid=7 ownerzid=7

* DLMGMTD *

      igb0 transient=0 zid=0 flags=3 loaned=0
     vnic0 transient=1 zid=7 flags=1 loaned=0
     vnic1 transient=0 zid=7 flags=3 loaned=1
    cxgbe0 transient=0 zid=7 flags=3 loaned=1
    cxgbe1 transient=0 zid=0 flags=3 loaned=0
    tfpkt0 transient=0 zid=7 flags=3 loaned=1
   tfport0 transient=1 zid=7 flags=1 loaned=0
  gimlet30 transient=1 zid=7 flags=1 loaned=0

* GZ *

igb0    linkid=int,1;zone=string,global;class=int,1;media=int,4;
tfpkt0  linkid=int,2;zone=string,global;class=int,1;media=int,4;
cxgbe0  linkid=int,3;zone=string,global;class=int,1;media=int,4;
cxgbe1  linkid=int,4;zone=string,global;class=int,1;media=int,4;
vnic1   linkid=int,5;zone=string,global;class=int,8;media=int,4;
vnic0   linkid=int,6;zone=string,test;class=int,8;media=int,4;transient=boolean,1;
tfport0 linkid=int,7;zone=string,test;class=int,2048;media=int,0;transient=boolean,1;
gimlet30        linkid=int,8;zone=string,test;class=int,2;media=int,0;transient=boolean,1;

* ZONE *

tfpkt0  linkid=int,2;zone=string,test;class=int,1;media=int,4;
cxgbe0  linkid=int,3;zone=string,test;class=int,1;media=int,4;

helios # dlinks > /tmp/old
helios # svcadm restart datalink-management
helios # dlinks > /tmp/new
helios # diff /tmp/{old,new}
helios # diff -u /tmp/{old,new}
No differences encountered

helios # zoneadm -z test halt
helios #
helios # dlinks
* KERNEL *

      igb0       igb0 ref=4 transient=0 zid=0 ownerzid=0
     vnic1   vnic1005 ref=1 transient=0 zid=0 ownerzid=0
    cxgbe0     cxgbe0 ref=1 transient=0 zid=0 ownerzid=0
    cxgbe1     cxgbe1 ref=1 transient=0 zid=0 ownerzid=0
    tfpkt0     tfpkt0 ref=1 transient=0 zid=0 ownerzid=0

* DLMGMTD *

      igb0 transient=0 zid=0 flags=3 loaned=0
     vnic1 transient=0 zid=0 flags=3 loaned=0
    cxgbe0 transient=0 zid=0 flags=3 loaned=0
    cxgbe1 transient=0 zid=0 flags=3 loaned=0
    tfpkt0 transient=0 zid=0 flags=3 loaned=0

* GZ *

igb0    linkid=int,1;zone=string,global;class=int,1;media=int,4;
tfpkt0  linkid=int,2;zone=string,global;class=int,1;media=int,4;
cxgbe0  linkid=int,3;zone=string,global;class=int,1;media=int,4;
cxgbe1  linkid=int,4;zone=string,global;class=int,1;media=int,4;
vnic1   linkid=int,5;zone=string,global;class=int,8;media=int,4;
helios # zoneadm -z test boot
helios # zlogin test dladm create-vnic -tl cxgbe0 vnic0
helios # zlogin test dladm create-tfport -p0 -l tfpkt0 tfport0
helios # zlogin test dladm create-vlan -tv 258 -l tfport0 gimlet30
helios #
helios # svcadm disable datalink-management
helios # zoneadm -z test halt
zone 'test': clear link 11 'protection' link property: operation failed
zone 'test': clear link 11 'allowed-ips' link property: operation failed
zone 'test': clear link 10 'protection' link property: operation failed
zone 'test': clear link 10 'allowed-ips' link property: operation failed
zone 'test': clear link 9 'protection' link property: operation failed
zone 'test': clear link 9 'allowed-ips' link property: operation failed
zone 'test': clear link 5 'protection' link property: operation failed
zone 'test': clear link 5 'allowed-ips' link property: operation failed
zone 'test': clear link 3 'protection' link property: operation failed
zone 'test': clear link 3 'allowed-ips' link property: operation failed
zone 'test': clear link 2 'protection' link property: operation failed
zone 'test': clear link 2 'allowed-ips' link property: operation failed
zone 'test': failed to get link info for 11: operation failed
zone 'test': failed to get link info for 10: operation failed
zone 'test': failed to get link info for 9: operation failed
zone 'test': failed to get link info for 5: operation failed
zone 'test': failed to get link info for 3: operation failed
zone 'test': failed to get link info for 2: operation failed
zone 'test': 6 datalink(s) remain in zone after shutdown
zone 'test': unable to unconfigure network interfaces in zone
zone 'test': unable to destroy zone
helios #
helios # svcadm enable datalink-management
helios # zoneadm -z test halt
helios # zoneadm list -vc
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              ipkg     shared
   - test             installed  /zones/test                    omicron1 excl
helios # dlinks
* KERNEL *

      igb0       igb0 ref=4 transient=0 zid=0 ownerzid=0
     vnic1   vnic1005 ref=1 transient=0 zid=0 ownerzid=0
    cxgbe0     cxgbe0 ref=1 transient=0 zid=0 ownerzid=0
    cxgbe1     cxgbe1 ref=1 transient=0 zid=0 ownerzid=0
    tfpkt0     tfpkt0 ref=1 transient=0 zid=0 ownerzid=0

* DLMGMTD *

      igb0 transient=0 zid=0 flags=3 loaned=0
     vnic1 transient=0 zid=0 flags=3 loaned=0
    cxgbe0 transient=0 zid=0 flags=3 loaned=0
    cxgbe1 transient=0 zid=0 flags=3 loaned=0
    tfpkt0 transient=0 zid=0 flags=3 loaned=0

* GZ *

igb0    linkid=int,1;zone=string,global;class=int,1;media=int,4;
tfpkt0  linkid=int,2;zone=string,global;class=int,1;media=int,4;
cxgbe0  linkid=int,3;zone=string,global;class=int,1;media=int,4;
cxgbe1  linkid=int,4;zone=string,global;class=int,1;media=int,4;
vnic1   linkid=int,5;zone=string,global;class=int,8;media=int,4;
Actions #8

Updated by Patrick Mooney over 1 year ago

Linking and transcribing the downstream tickets for posterity:

OS-327 dlmgmtd/zoneadmd interdependencies prevent dlmgmtd from starting
OS-375 i_dls_mgmt_upcall()/dlmgmt_zfop() deadlock in dlmgmtd
OS-2608 - private dlmgmtd needs to record zone identifiers
OS-3492 - private zone_free asserts to its destruction when dlmgmtd has fallen
OS-5363 dlmgmtd forking with write locks results in deadlock chaos
OS-6274 links owned by NGZ erroneously marked as on loan

OS-327

We had a situation in no.de production where dlmgmtd refused to start because of a down zone. truss output from dlmgmtd's failed attempt to come up:

17160:  getpid()                                        = 17160 [17141]
17160:  lwp_self()                                      = 1
17160:  munmap(0xFEBE1000, 1040384)                     = 0
17160:  lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000)
 = 0xFFBFFEFF [0xFFFFFFFF]
17160:  ioctl(5, (('c'<<24)|('t'<<16)|('t'<<8)|1), 0x08046748) = 0
17160:  close(5)                                        = 0
17160:  schedctl()                                      = 0xFED5C000
17160:  getprivimplinfo(0x08045E80, 2076)               = 0
17160:  sysconfig(_CONFIG_NGROUPS)                      = 16
17160:  zone_lookup(0x00000000)                         = 0
17160:  zone_getattr(0, ZONE_ATTR_PRIVSET, 0xFEE10288, 12) = 12
17160:  setppriv(PRIV_SET, PRIV_EFFECTIVE, {ffffffffffffffffffffffff}) = 0
17160:  zone_enter(4)                                   Err#22 EINVAL
17160:  _exit(22)

zoneadm output:

[root@a4-ba-db-11-91-ff /var/tmp]# zoneadm list -v
  ID NAME             STATUS     PATH                           BRAND    IP    
   0 global           running    /                              liveimg  shared
   1 fgbsuqaa         running    /zones/fgbsuqaa                joyent   excl  
   2 gnuda3aa         running    /zones/gnuda3aa                joyent   excl  
   3 gpyt5eac         running    /zones/gpyt5eac                joyent   excl  
   4 fhfzf8aa         down       /zones/fhfzf8aa                joyent   excl  
   5 gmp54vaa         running    /zones/gmp54vaa                joyent   excl  
   6 gmqggxaa         running    /zones/gmqggxaa                joyent   excl  
...

Attempting to manually halt the errant zone:

root@a4-ba-db-11-91-ff /var/tmp]# zoneadm -z fhfzf8aa halt
zone 'fhfzf8aa': operation failed
zone 'fhfzf8aa': unable clear datalink protect property
zone 'fhfzf8aa': unable to destroy zone

And then a truss of the failing zoneadmd:

/3:     zone_lookup("fhfzf8aa")                         = 4
/3:     zone_getattr(4, ZONE_ATTR_FLAGS, 0xFE64DA3A, 2) = 2
/3:     zone_list_datalink(4, 0xFE64DA34, 0x00000000)   = 0
/3:     zone_list_datalink(4, 0xFE64DA34, 0x08092510)   = 0
/3:     door_call(5, 0xFE64D880)                        Err#9 EBADF
/3:     close(5)                                        = 0
/3:     open("/etc/svc/volatile/dladm/dlmgmt_door", O_RDONLY) = 5

That is, we're failing to remove the datalink protect property because dlmgmtd is not running, and we therefore cannot reboot the zone -- but dlmgmtd cannot start because the zone is down. Note that, because of an earlier bug, dlmgmtd had been recently sent a SIGKILL, which may have left to partially invalid state. Regardless, the inability to get out of this state is a bug.

...

I was able to reproduce something similar to this even with the new
vnic on demand code on the latest develop branch.

I disabled the datalink svc so that dlmgtmd won't run or restart.
When I try to halt a zone, it won't halt and it stays in the "running"
state. We get the following message which what we saw on the old
code above:

zone 'adminui': unable clear datalink protect property
zone 'adminui': unable to destroy zone

This is coming from zoneadmd vplat_teardown() which calls
remove_datalink_pool() and prints the message if that has an
error. It also stops the zone from shutting down which is why
it stays in the "running" state.

I need to change this code so that it is more tolerant of errors
and the zone continues to shutdown even in the face of a failure like
this one.

I've fixed up usr/src/uts/common/io/dls/dls_mgmt.c so that dladm
can delete orphaned vnics. I also reproduced the dlmgmtd startup
problem with down zones using the following sequence:

1) all zones up and running
2) svcadm disable datalink-management
3) cd into a zone mountpoint
4) zoneadm halt that zone (this will cause the zone to stick in the "down" state since the fs can't umount
5) svcadm enable datalink-management

The svc will be in maintenance since dlmgmtd will have failed to enter the zone
in the down state.

In addition to the changes I've made so far, I need to make dlmgmtd more tolerant of failures when entering zones in a bad state.

I've updated dlmgmtd to skip over zones that are going down and now
it comes up clean even when I have a zone stuck in the down state.
I'll do some more testing and push these hardening changes tomorrow.

The set of changes are described in the comments. In
summary:
zones will now halt if dlmgtmd is unavailable
dlmgtmd will come up and initialize all of the zones it needs
to, ignoring those that don't need to initialize
dladm can be used to clear orphaned vnics
zoneadmd will report errors a little better.

OS-375

In no.de production, we saw several (specifically, three) problems due to dlmgmtd. The specific pathologies of each of these varied, but they all had the same design issue at their root: the upcall from the kernel to dlmgmtd inducing complicated, multi-way deadlocks between user- and kernel-level.

Here are the (trimmed) in-kernel stacks for dlmgtmd for the first case:

THREAD           STATE    SOBJ                COUNT
ffffff0d38a1c020 SLEEP    CV                     48
                 swtch+0x145
                 cv_wait_sig_swap_core+0x174
                 cv_wait_sig_swap+0x18
                 cv_waituntil_sig+0x13c
                 lwp_park+0x157
                 syslwp_park+0x31
                 sys_syscall32+0xff

ffffff0d56ea2c40 STOPPED  <NONE>                  2
                 swtch+0x145
                 stop+0x82e
                 issig_forreal+0x1d7
                 issig+0x20
                 door_return+0x3e9
                 doorfs32+0x134
                 sys_syscall32+0xff
...
ffffff0d56fd73c0 SLEEP    RWLOCK                  1
                 swtch+0x145
                 turnstile_block+0x760
                 rw_enter_sleep+0x1a3
                 dls_devnet_hold_common+0x58
                 dls_devnet_hold_tmp+0x1a
                 drv_ioc_phys_attr+0x8a
                 drv_ioctl+0x190
                 cdev_ioctl+0x45
                 spec_ioctl+0x5a
                 fop_ioctl+0x7b
                 ioctl+0x18e
                 sys_syscall32+0xff
...
ffffff0d56efd7c0 SLEEP    SHUTTLE                 1
                 swtch_to+0xe6
                 shuttle_resume+0x325
                 door_upcall+0x243
                 door_ki_upcall_limited+0x69
                 stubs_common_code+0x51
                 i_dls_mgmt_upcall+0x100
                 i_dls_mgmt_get_linkattr+0x6e
                 dls_mgmt_get_phydev+0x3d
                 drv_ioc_phys_attr+0x5e
                 drv_ioctl+0x190
                 cdev_ioctl+0x45
                 spec_ioctl+0x5a
                 fop_ioctl+0x7b
                 ioctl+0x18e
                 sys_syscall32+0xff

And here are the stacks (again, trimmed) at user-level:

THREAD   STATE    SOBJ        COUNT
2        PARKED   MX             43
         libc_hwcap1.so.1`mutex_lock_impl+0x163
         libc_hwcap1.so.1`mutex_lock+0x10
         libc_hwcap1.so.1`malloc+0x2a
         libc_hwcap1.so.1`_ucred_alloc+0x21
         libc_hwcap1.so.1`door_ucred+0x23
         dlmgmt_handler+0x40
         libc_hwcap1.so.1`__door_return+0x3e

f        UNPARKED <NONE>          2
         dlmgmt_handler+0xf2
         libc_hwcap1.so.1`__door_return+0x3e

d        UNPARKED <NONE>          1
         libc_hwcap1.so.1`suspend_fork+0x66
         libc_hwcap1.so.1`forkx+0xd6
         libc_hwcap1.so.1`fork+0x1d
         dlmgmt_zfop+0xc0
         dlmgmt_zfopen+0xa0
         dlmgmt_process_db_onereq+0x5e
         dlmgmt_process_db_req+0x4d
         dlmgmt_db_update+0x5a
         dlmgmt_write_db_entry+0x27
         dlmgmt_remapid+0xe7
         dlmgmt_handler+0x84
         libc_hwcap1.so.1`__door_return+0x3e

Note that there are two threads in the dlmgmt_handler() trying to call door_ucred() to get the credentials of their caller -- and that these threads are blocked in malloc(). The holder of the libc_malloc_lock is thread 0xd -- who, having acquired the malloc locks -- is attempting to suspend every thread in preparation to fork. Of course, we will not be able to suspend the thread performing the upcall, and this is the deadlock: we cannot serve door requests because we're blocked in malloc(), but that malloc() is blocked by a fork() that is in turn blocked on the in-kernel thread making the door upcall. It is not clear how this deadlock can be avoided; as long as the process that is the target of an upcall can possibly fork (under any condition), this deadlock will be theoretically possible.

OS-5363

Looking at RM08211:

We have a bunch of dladm processes which are in a bad state:

[root@RM08211 (us-east-2) ~]# pgrep dladm | wc -l
      50

Based on this let's see what's going on with dlmgmtd:

THREAD           STATE    SOBJ                COUNT
ffffff431dc74c60 SLEEP    SHUTTLE              1553
                 swtch+0x141
                 shuttle_swtch+0x203
                 door_return+0x214
                 doorfs32+0x180
                 sys_syscall32+0x14a

ffffff431dc7f8a0 STOPPED  <NONE>                474
                 swtch+0x141
                 stop+0x386
                 issig_forreal+0x3e4
                 issig+0x25
                 door_return+0x3ef
                 doorfs32+0x180
                 sys_syscall32+0x14a

ffffff4332614140 SLEEP    CV                    130
                 swtch+0x141
                 cv_wait_sig_swap_core+0x1b9
                 cv_wait_sig_swap+0x17
                 cv_waituntil_sig+0xbd
                 lwp_park+0x15e
                 syslwp_park+0x63
                 sys_syscall32+0x14a

ffffff43337024c0 STOPPED  <NONE>                 22
                 swtch+0x141
                 stop+0x386
                 issig_forreal+0x3e4
                 issig+0x25
                 cv_wait_sig_swap_core+0x303
                 cv_wait_sig_swap+0x17
                 cv_waituntil_sig+0xbd
                 lwp_park+0x15e
                 syslwp_park+0x63
                 sys_syscall32+0x14a

ffffff4333702c00 SLEEP    RWLOCK                  2
                 swtch+0x141
                 turnstile_block+0x21a
                 rw_enter_sleep+0x19b
                 vnic_dev_delete+0x43
                 vnic_ioc_delete+0x28
                 drv_ioctl+0x1e4
                 cdev_ioctl+0x39
                 spec_ioctl+0x60
                 fop_ioctl+0x55
                 ioctl+0x9b
                 sys_syscall32+0x14a

ffffff44d103e3a0 SLEEP    CV                      1
                 swtch+0x141
                 cv_timedwait_sig_hires+0x39d
                 cv_waituntil_sig+0xfa
                 nanosleep+0x19f
                 sys_syscall32+0x14a

ffffff4e433544c0 SLEEP    CV                      1
                 swtch+0x141
                 cv_wait_sig+0x185
                 lwp_suspend+0xa4
                 syslwp_suspend+0x48
                 sys_syscall32+0x14a  

ffffff431de23060 SLEEP    CV                      1
                 swtch+0x141
                 cv_wait_sig_swap_core+0x1b9
                 cv_wait_sig_swap+0x17
                 pause+0x45
                 sys_syscall32+0x14a

Let's get the thread that's blocked on an rwlock, always a bad sign and
see who has that lock:

> ffffff4333702c00::findstack -v
stack pointer for thread ffffff4333702c00: ffffff01fc0bf9e0
[ ffffff01fc0bf9e0 _resume_from_idle+0x112() ]
  ffffff01fc0bfa10 swtch+0x141()
  ffffff01fc0bfab0 turnstile_block+0x21a(ffffff45f2578520, 0, ffffffffc011b270, fffffffffbc08ce0, 0, 0)
  ffffff01fc0bfb20 rw_enter_sleep+0x19b(ffffffffc011b270, 0)
  ffffff01fc0bfb90 vnic_dev_delete+0x43(24483, 0, ffffffefb9d39a28)
  ffffff01fc0bfbd0 vnic_ioc_delete+0x28(ffffff4562ecaec8, fd6cfd04, 100003, ffffffefb9d39a28, ffffff01fc0bfe58)
  ffffff01fc0bfc70 drv_ioctl+0x1e4(1200000000, 1710002, fd6cfd04, 100003, ffffffefb9d39a28, ffffff01fc0bfe58)
  ffffff01fc0bfcb0 cdev_ioctl+0x39(1200000000, 1710002, fd6cfd04, 100003, ffffffefb9d39a28, ffffff01fc0bfe58)
  ffffff01fc0bfd00 spec_ioctl+0x60(ffffff431dd93980, 1710002, fd6cfd04, 100003, ffffffefb9d39a28, ffffff01fc0bfe58, 0)
  ffffff01fc0bfd90 fop_ioctl+0x55(ffffff431dd93980, 1710002, fd6cfd04, 100003, ffffffefb9d39a28, ffffff01fc0bfe58, 0)
  ffffff01fc0bfeb0 ioctl+0x9b(0, 1710002, fd6cfd04)
  ffffff01fc0bff10 sys_syscall32+0x14a()
> ffffffffc011b270::rwlock
            ADDR      OWNER/COUNT FLAGS          WAITERS
ffffffffc011b270 ffffff448549dbe0  B111 ffffff462bd1d760 (W)
                                    ||| ffffff4333702c00 (W)
                 WRITE_LOCKED ------+|| ffffff433448b880 (W)
                 WRITE_WANTED -------+|
                  HAS_WAITERS --------+
> ffffffffc011b270=p
                vnic_lock       
> ffffff448549dbe0::findstack -v
stack pointer for thread ffffff448549dbe0: ffffff01eb656010
[ ffffff01eb656010 _resume_from_idle+0x112() ]
  ffffff01eb656050 swtch_to+0xb6(ffffff473202cb80)
  ffffff01eb6560a0 shuttle_resume+0x2af(ffffff473202cb80, ffffffffc0014ed0)
  ffffff01eb656150 door_upcall+0x212(ffffff431de42140, ffffff01eb656240, ffffff42e273ce18, ffffffffffffffff, 0)
  ffffff01eb6561d0 door_ki_upcall_limited+0x67(ffffff431dccc5d8, ffffff01eb656240, ffffff42e273ce18, ffffffffffffffff, 0)
  ffffff01eb656220 stubs_common_code+0x59()
  ffffff01eb6562c0 i_dls_mgmt_upcall+0xbf(ffffff01eb6566f0, 28, ffffff01eb6562e0, 410)
  ffffff01eb656760 i_dls_mgmt_get_linkattr+0x76(6, fffffffffbbec385, ffffff01eb656788, ffffff01eb656778)
  ffffff01eb6567c0 dls_mgmt_get_phydev+0x40(6, ffffff01eb6567e0)
  ffffff01eb656830 dls_devnet_hold_common+0x41(6, ffffff01eb6568c8, 1)
  ffffff01eb656850 dls_devnet_hold_tmp+0x1a(6, ffffff01eb6568c8)
  ffffff01eb6568a0 stubs_common_code+0x59()
  ffffff01eb6568f0 mac_open_by_linkid+0x20(6, ffffff45fd86c088)
  ffffff01eb656af0 vnic_dev_create+0x543(244b1, 6, ffffff01eb656b54, ffffff01eb656b5c, ffffff01eb656b60, ffffff01eb656b58, ffffff420000000c, 
  ffffff42000008fc, ffffff0100000000, ffffffff00000000, ffffff7fccaa6044, ffffff4200000000, ffffff01eb656b50, ffffffb7fdbf61a8)
  ffffff01eb656bd0 vnic_ioc_create+0x10d(ffffff7fccaa6000, 80401c0, 100003, ffffffb7fdbf61a8, ffffff01eb656e58)
  ffffff01eb656c70 drv_ioctl+0x1e4(1200000000, 1710001, 80401c0, 100003, ffffffb7fdbf61a8, ffffff01eb656e58)
  ffffff01eb656cb0 cdev_ioctl+0x39(1200000000, 1710001, 80401c0, 100003, ffffffb7fdbf61a8, ffffff01eb656e58)
  ffffff01eb656d00 spec_ioctl+0x60(ffffff431dd93980, 1710001, 80401c0, 100003, ffffffb7fdbf61a8, ffffff01eb656e58, 0)
  ffffff01eb656d90 fop_ioctl+0x55(ffffff431dd93980, 1710001, 80401c0, 100003, ffffffb7fdbf61a8, ffffff01eb656e58, 0)
  ffffff01eb656eb0 ioctl+0x9b(4, 1710001, 80401c0)
  ffffff01eb656f10 _sys_sysenter_post_swapgs+0x153()
> ffffff448549dbe0::print kthread_t t_procp->p_user.u_psargs
t_procp->p_user.u_psargs = [ "dladm create-vnic -t -l ixgbe1 -p mtu=1500,zone=42f6b9ae-20fa-4d74-a5eb-a0d55a9" ]

Okay, so dlmgmtd is trying to grab the vnic_lock which it needs to
delete something. It presumably is holding a table lock while this is
happening, but we'll need to verify that. On the other hand, this other
thread, which appears to be trying to create a VNIC holds the table
lock.

So based on this, let's take a look and see what's going on with
dlmgmtd and its internal locks:

> ::walk thread | ::stacks
THREAD   STATE    SOBJ        COUNT
2        UNPARKED <NONE>       2008
         dlmgmt_handler+0xf4
         libc.so.1`__door_return+0x3d

13       PARKED   MX            138
         libc.so.1`mutex_lock_impl+0x22f
         libc.so.1`mutex_lock+0x10
         libc.so.1`malloc+0x41
         libc.so.1`_ucred_alloc+0x21
         libc.so.1`door_ucred+0x23
         dlmgmt_handler+0x3f
         libc.so.1`__door_return+0x3d

12       UNPARKED <NONE>         11
         dlmgmt_handler+0x10b
         libc.so.1`__door_return+0x3d

1a       PARKED   CV              9
         libc.so.1`cond_wait_queue+0x5e
         libc.so.1`__cond_wait+0x83
         libc.so.1`cond_wait+0x24
         libc.so.1`pthread_cond_wait+0x1e
         dlmgmt_table_lock+0x2b
         dlmgmt_getnext+0x22
         dlmgmt_handler+0x96
         libc.so.1`__door_return+0x3d

881      UNPARKED <NONE>          6
         libc.so.1`door_create_func+0x29
         libc.so.1`_thrp_setup+0x8c
         libc.so.1`_lwp_start

17       UNPARKED <NONE>          2

2a1      PARKED   CV              2
         libc.so.1`cond_wait_queue+0x5e
         libc.so.1`__cond_wait+0x83
         libc.so.1`cond_wait+0x24
         libc.so.1`pthread_cond_wait+0x1e
         dlmgmt_table_lock+0x2b
         dlmgmt_upcall_getattr+0x15
         dlmgmt_handler+0x96
         libc.so.1`__door_return+0x3d

40       PARKED   CV              2
         libc.so.1`cond_wait_queue+0x5e
         libc.so.1`__cond_wait+0x83
         libc.so.1`cond_wait+0x24
         libc.so.1`pthread_cond_wait+0x1e
         dlmgmt_table_lock+0x2b
         dlmgmt_zonehalt+0xb6
         dlmgmt_handler+0x96
         libc.so.1`__door_return+0x3d

322      UNPARKED <NONE>          1
         libc.so.1`cond_wait_queue+0x5e
         libc.so.1`__cond_wait+0x83
         libc.so.1`cond_wait+0x24
         libc.so.1`pthread_cond_wait+0x1e
         dlmgmt_table_lock+0x2b
         dlmgmt_getlinkid+0x2a
         dlmgmt_handler+0x96
         libc.so.1`__door_return+0x3d 

149      UNPARKED <NONE>          1
         libc.so.1`cond_wait_queue+0x5e
         libc.so.1`__cond_wait+0x83
         libc.so.1`cond_wait+0x24
         libc.so.1`pthread_cond_wait+0x1e
         dlmgmt_table_lock+0x2b
         dlmgmt_getnext+0x22
         dlmgmt_handler+0x96
         libc.so.1`__door_return+0x3d

500      UNPARKED <NONE>          1
         libc.so.1`sleep+0x2d
         dlmgmt_zonehalt+0x6f
         dlmgmt_handler+0x96
         libc.so.1`__door_return+0x3d

148      UNPARKED <NONE>          1
         libc.so.1`suspend_fork+0x97
         libc.so.1`forkx+0x121
         libc.so.1`fork+0x1a
         dlmgmt_zfop+0xbf
         dlmgmt_zfopen+0xb8
         dlmgmt_process_db_onereq+0x54
         dlmgmt_process_db_req+0x4a
         dlmgmt_db_init+0xcc
         dlmgmt_zone_init+0x12e
         dlmgmt_zoneboot+0x55
         dlmgmt_handler+0x96
         libc.so.1`__door_return+0x3d

1        UNPARKED <NONE>          1
         main+0xe2
         _start+0x83

18       PARKED   CV              1
         libc.so.1`cond_wait_queue+0x5e
         libc.so.1`__cond_wait+0x83
         libc.so.1`cond_wait+0x24
         libc.so.1`pthread_cond_wait+0x1e
         dlmgmt_table_lock+0x2b
         dlmgmt_getlinkid+0x2a
         dlmgmt_handler+0x96
         libc.so.1`__door_return+0x3d

It appears that tid 2a1 is the one that's trying to create the vnic in
so far as right now it's asking for physical attributes. It's of course,
trying to grab the table lock which is an rwlock. So let's see if it's
trying to grab a read lock or a write lock and if so, what's going on:

> 0x2a1::findstack -v
stack pointer for thread 2a1: d4cd9840
[ d4cd9840 libc.so.1`__lwp_park+0xb() ]
  d4cd9870 libc.so.1`cond_wait_queue+0x5e(806b6d8, 806b6c0, 0, feefbcf8)
  d4cd98b0 libc.so.1`__cond_wait+0x83(806b6d8, 806b6c0, d4cd98d0,
feefbda1, fef73000, d4cd9dd8)
  d4cd98d0 libc.so.1`cond_wait+0x24(806b6d8, 806b6c0, d4cd9e00, f5f00)
  d4cd98f0 libc.so.1`pthread_cond_wait+0x1e(806b6d8, 806b6c0, 0,
d4cd9e00)
  d4cd9910 dlmgmt_table_lock+0x2b(0, d4cda000)
  d4cd9930 dlmgmt_upcall_getattr+0x15(d4cd9dd8, d4cd9960, d4cd9d8c, 0,
8075730, 410)
  d4cd9db0 dlmgmt_handler+0x96(0, d4cd9dd8, 28, 0, 0, 8055196)
  00000000 libc.so.1`__door_return+0x3d()

So actually, all it wants is a read lock. We're also in the midst of a
fork, so let's verify they're not on the scene:

> 0x2a1::tid2ulwp | ::print ulwp_t ul_uberdata->atforklist[]
ul_uberdata->atforklist = {
    ul_uberdata->atforklist->forw = 0xfee10300
    ul_uberdata->atforklist->back = 0xfee10300
    ul_uberdata->atforklist->prepare = libc.so.1`libc_prepare_atfork
    ul_uberdata->atforklist->parent = libc.so.1`libc_parent_atfork
    ul_uberdata->atforklist->child = libc.so.1`libc_child_atfork
}
> 0x2a1::tid2ulwp | ::print -at ulwp_t ul_uberdata->atforklist[]
fee10080 atfork_t ul_uberdata->atforklist = {
    fee10080 struct atfork *ul_uberdata->atforklist->forw = 0xfee10300
    fee10084 struct atfork *ul_uberdata->atforklist->back = 0xfee10300
    fee10088 void (*)() ul_uberdata->atforklist->prepare = libc.so.1`libc_prepare_atfork
    fee1008c void (*)() ul_uberdata->atforklist->parent = libc.so.1`libc_parent_atfork
    fee10090 void (*)() ul_uberdata->atforklist->child = libc.so.1`libc_child_atfork
}
> 0xfee10300::print struct atfork
{
    forw = 0xfee10080
    back = 0xfee10080
    prepare = libnsl.so.1`_libnsl_prefork
    parent = libnsl.so.1`_libnsl_parent_atfork
    child = libnsl.so.1`_libnsl_child_atfork
}

Okay, so fork isn't a problem. So let's go and see who, if anyone is
holding the lock and what the current state of is:

> dlmgmt_avl_lock::print rwlock_t
{
    readers = 0x40000000
    type = 0
    magic = 0x5257
    mutex = {
        flags = {
            flag1 = 0
            flag2 = 0
            ceiling = 0
            mbcp_type_un = {
                bcptype = 0
                mtype_rcount = {
                    count_type1 = 0
                    count_type2 = 0
                }
            }
            magic = 0x4d58
        }
        lock = {
            lock64 = {
                pad = [ 0, 0, 0, 0, 0, 0, 0, 0 ]
            }
            lock32 = {
                ownerpid = 0
                lockword = 0
            }
            owner64 = 0
        }
        data = 0
    }
    readercv = {
        flags = {
            flag = [ 0, 0, 0, 0 ]
            type = 0
            magic = 0x4356
        }
        data = 0xf0ba3a40
    }
    writercv = {
        flags = {
            flag = [ 0, 0, 0, 0 ]
            type = 0
            magic = 0x4356
        }
        data = 0
    }
}

Okay, based on the readers value we know it's write locked. And from
here, we should be able to determine whom the right owner is. And... the
owner is the the thread that's trying to fork!

> dlmgmt_avl_lock::print rwlock_t readercv.data
readercv.data = 0xf0ba3a40
> 0xf0ba3a40::print ulwp_t ul_lwpid
ul_lwpid = 0x148
> 0x148::findstack -v
stack pointer for thread 148: ea5cf5e0
[ ea5cf5e0 libc.so.1`___lwp_suspend+7() ]
  ea5cf630 libc.so.1`suspend_fork+0x97(f0ba3a40)
  ea5cf670 libc.so.1`forkx+0x121(0, 0, ea5cf760, 8057971)
  ea5cf690 libc.so.1`fork+0x1a(ea5d0558, 8059df1, 17, feec5474, 2336, ea5d0458)
  ea5cf760 dlmgmt_zfop+0xbf(ea5cfbe4, 2336, 8057c1c, ea5cf784, 806fb20, 0)
  ea5cf7b0 dlmgmt_zfopen+0xb8(ea5cfbe4, 8059fd3, 2336, ea5cffe4, ea5cfbe4, 400)
  ea5d0000 dlmgmt_process_db_onereq+0x54(806fb20, 0, 0, 0, 0, 0)
  ea5d0040 dlmgmt_process_db_req+0x4a(806fb20, 805a04d, 80593f7, 0, 0, 0)
  ea5d0480 dlmgmt_db_init+0xcc(2336, ea5d0928, 41, 400)
  ea5d0d30 dlmgmt_zone_init+0x12e(2336, ea5d0df8, ea5d0d50, 8055448, 806b5e0, ea5d0d80)
  ea5d0d50 dlmgmt_zoneboot+0x55(ea5d0df8, ea5d0d80, ea5d0dac, 0, 8073c60, 806b4d0)
  ea5d0dd0 dlmgmt_handler+0x96(0, ea5d0df8, 8, 0, 0, 8055196)
  00000000 libc.so.1`__door_return+0x3d()

Okay, so the dlmgmt_zoneboot() function grabs the table lock for
write. It then preceeds to try and fork and exec. So why can't that
thread make progress? So it's trying to suspend a thread. What thread is
it trying to suspend? Let's go back to our earlier kernel stacks.
There's one thread calling lwp_suspend. Let's see what tid it's trying
to call it on and see what that tid is on our userland process:

> ffffff4e433544c0::findstack -v
stack pointer for thread ffffff4e433544c0: ffffff01f8c3ada0
[ ffffff01f8c3ada0 _resume_from_idle+0x112() ]
  ffffff01f8c3add0 swtch+0x141()
  ffffff01f8c3ae40 cv_wait_sig+0x185(ffffff431ded80d6, ffffff42e28c6380)
  ffffff01f8c3ae80 lwp_suspend+0xa4(ffffff433448b880)
  ffffff01f8c3aeb0 syslwp_suspend+0x48(339)
  ffffff01f8c3af10 sys_syscall32+0x14a()

So we see that it's tid 0x339 and in userland, we see that it's not
having a good day:

> 0x339::findstack -v
stack pointer for thread 339: 0

So, dear kernel, who is our mystery thread?

> ::pgrep dlmgmtd | ::walk thread | ::printf "%p %x\n" kthread_t . t_tid ! grep 339
ffffff477c339880 182
ffffff433448b880 339
ffffff43339f8180 35c
ffffff43339ffc00 518
ffffff43339f63a0 595
ffffff43339f6000 615
ffffff47943394a0 64b
> ffffff433448b880::findstack -v
stack pointer for thread ffffff433448b880: ffffff02051d19e0
[ ffffff02051d19e0 _resume_from_idle+0x112() ]
  ffffff02051d1a10 swtch+0x141()
  ffffff02051d1ab0 turnstile_block+0x21a(0, 0, ffffffffc011b270, fffffffffbc08ce0, 0, 0)
  ffffff02051d1b20 rw_enter_sleep+0x19b(ffffffffc011b270, 0)
  ffffff02051d1b90 vnic_dev_delete+0x43(24490, 0, ffffffefb9d39a28)
  ffffff02051d1bd0 vnic_ioc_delete+0x28(ffffff447c0447d0, cb563d04, 100003, ffffffefb9d39a28, ffffff02051d1e58)
  ffffff02051d1c70 drv_ioctl+0x1e4(1200000000, 1710002, cb563d04, 100003, ffffffefb9d39a28, ffffff02051d1e58)
  ffffff02051d1cb0 cdev_ioctl+0x39(1200000000, 1710002, cb563d04, 100003, ffffffefb9d39a28, ffffff02051d1e58)
  ffffff02051d1d00 spec_ioctl+0x60(ffffff431dd93980, 1710002, cb563d04, 100003, ffffffefb9d39a28, ffffff02051d1e58, 0)
  ffffff02051d1d90 fop_ioctl+0x55(ffffff431dd93980, 1710002, cb563d04, 100003, ffffffefb9d39a28, ffffff02051d1e58, 0)
  ffffff02051d1eb0 ioctl+0x9b(0, 1710002, cb563d04)
  ffffff02051d1f10 sys_syscall32+0x14a()

Aha, and there we have it. There's our problem. So let's see if we can
figure out anything about our userland stack to confirm that this thread
has no locks or where it's coming from. Not surprisingly, it's a door
call:

> 0x339::tid2ulwp | ::print ulwp_t ul_startpc
ul_startpc = libc.so.1`door_create_func

So, let's go stack dumpster diving.

> ffffff02051d1eb0,30/nap
0xffffff02051d1eb0:             
0xffffff02051d1eb0:             0xffffff02051d1f10
0xffffff02051d1eb8:             sys_syscall32+0x14a
0xffffff02051d1ec0:             0x51d1f10       
0xffffff02051d1ec8:             0xfb800e6e      
0xffffff02051d1ed0:             0               
0xffffff02051d1ed8:             0x1710002       
0xffffff02051d1ee0:             0xcb563d04      
0xffffff02051d1ee8:             0xffffff433448b880
0xffffff02051d1ef0:             1               
0xffffff02051d1ef8:             0               
0xffffff02051d1f00:             0xffffff02051d1f10
0xffffff02051d1f08:             sys_syscall32+0xe3
0xffffff02051d1f10:             0               
0xffffff02051d1f18:             0               
0xffffff02051d1f20:             0x2316          
0xffffff02051d1f28:             0xcb563d04      
0xffffff02051d1f30:             0xfeefae1e      
0xffffff02051d1f38:             0xfed74980      
0xffffff02051d1f40:             0               
0xffffff02051d1f48:             0xffffff433448b880
0xffffff02051d1f50:             0x36            
0xffffff02051d1f58:             0x8071440       
0xffffff02051d1f60:             0xcb563d20      
0xffffff02051d1f68:             0               
0xffffff02051d1f70:             0               
0xffffff02051d1f78:             0               
0xffffff02051d1f80:             0               
0xffffff02051d1f88:             0xffffffa4f4144880
0xffffff02051d1f90:             0xffffff433448b880
0xffffff02051d1f98:             0               
0xffffff02051d1fa0:             0               
0xffffff02051d1fa8:             0x4b            
0xffffff02051d1fb0:             0x4b            
0xffffff02051d1fb8:             0               
0xffffff02051d1fc0:             0x1c3           
0xffffff02051d1fc8:             0xe             
0xffffff02051d1fd0:             sys_syscall_int+8
0xffffff02051d1fd8:             0xfef053c7      
0xffffff02051d1fe0:             0x43            
0xffffff02051d1fe8:             0x292           
0xffffff02051d1ff0:             0xcb563ce4      
0xffffff02051d1ff8:             0x4b            
mdb: failed to read data from target: no mapping for address
0xffffff02051d2000:             
mdb: failed to read data from target: no mapping for address
0xffffff02051d2000:             
mdb: failed to read data from target: no mapping for address
0xffffff02051d2000:             
mdb: failed to read data from target: no mapping for address
0xffffff02051d2000:             
mdb: failed to read data from target: no mapping for address
0xffffff02051d2000:             
mdb: failed to read data from target: no mapping for address
0xffffff02051d2000:            

If we try a few different things, we get something that looks like the
following:

> 0xcb563ce4$C            
8510c483 0xe81979c0()
> 0xcb563d20$C
cb563d50 dlmgmt_zonehalt+0xc1(cb563df8, cb563d80, cb563dac, 0, 80735f0, 8)
cb563dd0 dlmgmt_handler+0x96(0, cb563df8, 8, 0, 0, 8055196)
00000000 libc.so.1`__door_return+0x3d()
> 0xcb563d04
00024490 dlmgmtd`dlmgmt_avl_mutex()
> dlmgmt_zonehalt+0xc1::dis
dlmgmt_zonehalt+0x97:           pushl  $0x806b72c
<dlmgmtd`my_pid.8955>
dlmgmt_zonehalt+0x9c:           pushl  %eax
dlmgmt_zonehalt+0x9d:           call   -0x177c  <PLT=libc.so.1`write>
dlmgmt_zonehalt+0xa2:           movl   %ebx,(%esp)
dlmgmt_zonehalt+0xa5:           call   -0x1894  <PLT=libc.so.1`close>
dlmgmt_zonehalt+0xaa:           movl   $0x1,(%esp)
dlmgmt_zonehalt+0xb1:           call   +0x209d  <dlmgmt_table_lock>
dlmgmt_zonehalt+0xb6:           addl   $0x4,%esp
dlmgmt_zonehalt+0xb9:           pushl  0x4(%edi)
dlmgmt_zonehalt+0xbc:           call   +0x404f  <dlmgmt_db_fini>
dlmgmt_zonehalt+0xc1:           call   +0x2024  <dlmgmt_table_unlock>
dlmgmt_zonehalt+0xc6:           movl   $0x8059eaa,(%esp)
dlmgmt_zonehalt+0xcd:           call   -0x160c  <PLT=libc.so.1`unlink>
dlmgmt_zonehalt+0xd2:           addl   $0x10,%esp
dlmgmt_zonehalt+0xd5:           movl   0xc(%ebp),%eax
dlmgmt_zonehalt+0xd8:           movl   %esi,(%eax)
dlmgmt_zonehalt+0xda:           leal   -0xc(%ebp),%esp
dlmgmt_zonehalt+0xdd:           popl   %ebx
dlmgmt_zonehalt+0xde:           popl   %esi
dlmgmt_zonehalt+0xdf:           popl   %edi
dlmgmt_zonehalt+0xe0:           leave 

So this thread is trying to destroy the zone. It's not actively holding
the lock, which makes sense. We drop the lock around deletions due to
OS-3342. However, because the forking thread is trying to suspend us, we
get to an even deeper twisted web. Basically it looks like this:

Thread 0x339 is trying to delete a vnic. It holds no locks, but is
locked in the kernel. Normally this is fine. This is what OS-3342 tries
to address in particular.

The door call thread we identified was 0x2a1 which is locked trying to
get the table lock. Note it could also potentially be thread 0x149. In
this case, it doesn't really matter. Normally this wouldn't have trouble
eventually acquiring it. However, it's impossible.

The table lock is blocked on our friend, thread 0x148 which is trying to
suspend every thread and fork. It's not clear how this could generally
work given that it holds a write lock across the fork. And that is
what's going on.

We have a thread in dlmgmtd that's trying to delete a vnic. It's blocked
on the rwlock:

> ffffff453c117520::findstack -v
stack pointer for thread ffffff453c117520: ffffff01ed5b59e0
[ ffffff01ed5b59e0 _resume_from_idle+0x112() ]
  ffffff01ed5b5a10 swtch+0x141()
  ffffff01ed5b5ab0 turnstile_block+0x21a(ffffff42e2732048, 0, ffffffffc011b550, fffffffffbc08ce0, 0, 0)
  ffffff01ed5b5b20 rw_enter_sleep+0x19b(ffffffffc011b550, 0)
  ffffff01ed5b5b90 vnic_dev_delete+0x43(2c764, 0, ffffffcd24704d50)
  ffffff01ed5b5bd0 vnic_ioc_delete+0x28(ffffff5016e3ec50, 98d7dd04, 100003, ffffffcd24704d50, ffffff01ed5b5e58)
  ffffff01ed5b5c70 drv_ioctl+0x1e4(1200000000, 1710002, 98d7dd04, 100003, ffffffcd24704d50, ffffff01ed5b5e58)
  ffffff01ed5b5cb0 cdev_ioctl+0x39(1200000000, 1710002, 98d7dd04, 100003, ffffffcd24704d50, ffffff01ed5b5e58)
  ffffff01ed5b5d00 spec_ioctl+0x60(ffffff431deb2440, 1710002, 98d7dd04, 100003, ffffffcd24704d50, ffffff01ed5b5e58, 0)
  ffffff01ed5b5d90 fop_ioctl+0x55(ffffff431deb2440, 1710002, 98d7dd04, 100003, ffffffcd24704d50, ffffff01ed5b5e58, 0)
  ffffff01ed5b5eb0 ioctl+0x9b(0, 1710002, 98d7dd04)
  ffffff01ed5b5f10 sys_syscall32+0x14a()
> ffffffffc011b550::rwlock
            ADDR      OWNER/COUNT FLAGS          WAITERS
ffffffffc011b550 ffffffc2251173a0  B111 ffffff463aed4060 (W)
                                    ||| ffffff7bf9fbb420 (W)
                 WRITE_LOCKED ------+|| ffffff47c1dd1760 (W)
                 WRITE_WANTED -------+| ffffff453c117520 (W)
                  HAS_WAITERS --------+
> ffffffc2251173a0::findstack -v
stack pointer for thread ffffffc2251173a0: ffffff01f8fa3450
[ ffffff01f8fa3450 _resume_from_idle+0x112() ]
  ffffff01f8fa3490 swtch_to+0xb6(ffffff46bdf24400)
  ffffff01f8fa34e0 shuttle_resume+0x2af(ffffff46bdf24400, ffffffffc0014ed0)
  ffffff01f8fa3590 door_upcall+0x212(ffffff431deb2040, ffffff01f8fa3680, ffffff42e273ce18, ffffffffffffffff, 0)
  ffffff01f8fa3610 door_ki_upcall_limited+0x67(ffffff431ddaa7d8, ffffff01f8fa3680, ffffff42e273ce18, ffffffffffffffff, 0)
  ffffff01f8fa3660 stubs_common_code+0x59()
  ffffff01f8fa3700 i_dls_mgmt_upcall+0xbf(ffffff01f8fa3720, 8, ffffff01f8fa3730, 30)
  ffffff01f8fa37d0 dls_mgmt_get_linkinfo+0x7c(2c7bb, ffffff01f8fa3800, ffffff01f8fa37fc, 0, 0)
  ffffff01f8fa3880 dls_devnet_set+0xb0(ffffff44c05e6bd8, 2c7bb, 0, ffffff01f8fa38b0)
  ffffff01f8fa38f0 dls_devnet_create+0x4c(ffffff44c05e6bc0, 2c7bb, 0)
  ffffff01f8fa3af0 vnic_dev_create+0x6a0(2c7bb, 6, ffffff01f8fa3b54, ffffff01f8fa3b5c, ffffff01f8fa3b60, ffffff01f8fa3b58, ffffff420000000c, 
  ffffff420000054b, ffffff0100000000, ffffffff00000000, ffffff4618f59044, ffffff4200000000, ffffff01f8fa3b50, ffffffced7982400)
  ffffff01f8fa3bd0 vnic_ioc_create+0x10d(ffffff4618f59000, 80401b0, 100003, ffffffced7982400, ffffff01f8fa3e58)
  ffffff01f8fa3c70 drv_ioctl+0x1e4(1200000000, 1710001, 80401b0, 100003, ffffffced7982400, ffffff01f8fa3e58)
  ffffff01f8fa3cb0 cdev_ioctl+0x39(1200000000, 1710001, 80401b0, 100003, ffffffced7982400, ffffff01f8fa3e58)
  ffffff01f8fa3d00 spec_ioctl+0x60(ffffff431deb2440, 1710001, 80401b0, 100003, ffffffced7982400, ffffff01f8fa3e58, 0)
  ffffff01f8fa3d90 fop_ioctl+0x55(ffffff431deb2440, 1710001, 80401b0, 100003, ffffffced7982400, ffffff01f8fa3e58, 0)
  ffffff01f8fa3eb0 ioctl+0x9b(4, 1710001, 80401b0)
  ffffff01f8fa3f10 _sys_sysenter_post_swapgs+0x153()
> ffffff452f83a400::findstack -v
stack pointer for thread ffffff452f83a400: ffffff02070d1da0
[ ffffff02070d1da0 _resume_from_idle+0x112() ]
  ffffff02070d1dd0 swtch+0x141()
  ffffff02070d1e40 cv_wait_sig+0x185(ffffff431de960de, ffffff42e28c6380)
  ffffff02070d1e80 lwp_suspend+0xa4(ffffff453c117520)
  ffffff02070d1eb0 syslwp_suspend+0x48(661)
  ffffff02070d1f10 sys_syscall32+0x14a()
> ffffff453c117520::print kthread_t t_tid
t_tid = 0x661

So we're trying to suspend the thread that's trying to grab a lock that's doing
a door upcall and is likely suspended. Classic OS-5363.

If we could avoid the fork we could avoid this deadlock. In this particular case the fork is done to read the datalink.conf file in the zone:

[root@testsos ~]# dtrace -qn 'pid$target::zone_file_exists:entry { printf("zone_file_eixsts: %s%s\n", stringof(args[0]), stringof(args[1])); } pid$target::dlmgmt_zfopen:entry /arg2 != 0/ { printf("open %s with %s mode in zone %d\n", stringof(args[0]), stringof(args[1]), arg2); ustack(); self->t=1; } proc:::create /self->t/ { printf("dlmgmt forked\n"); } pid$target::dlmgmt_zfopen:return /self->t/ { self->t=0; }' -p 100027
zone_file_eixsts: /zones/e0b64261-4708-c1f3-971b-dd561161b3dc/root/native/etc/svc/volatile/dladm
zone_file_eixsts: /zones/e0b64261-4708-c1f3-971b-dd561161b3dc/root/etc/svc/volatile/dladm/network-datalink-management:default.cache
zone_file_eixsts: /zones/e0b64261-4708-c1f3-971b-dd561161b3dc/root/etc/dladm/datalink.conf
open /etc/dladm/datalink.conf with r mode in zone 18

              dlmgmtd`dlmgmt_zfopen
              dlmgmtd`dlmgmt_process_db_onereq+0x54
              dlmgmtd`dlmgmt_process_db_req+0x4a
              dlmgmtd`dlmgmt_db_init+0xcc
              dlmgmtd`dlmgmt_zone_init+0x12e
              dlmgmtd`dlmgmt_zoneboot+0x55
              dlmgmtd`dlmgmt_handler+0x96
              libc.so.1`__door_return+0x4b
dlmgmt forked

In our case this file is always empty since we aren't creating persistent links in a zone. Perhaps we could add a special case to the code to avoid this particular dlmgmt_zfopen(), similar to how we added the zone_file_exists() special case to avoid dlmgmt upcall/fork deadlocks:

/*
 * Attempt to mitigate one of the deadlocks in the dlmgmtd architecture.
 *
 * dlmgmt_db_init() calls dlmgmt_process_db_req() which eventually gets to
 * dlmgmt_zfop() which tries to fork, enter the zone and read the file.
 * Because of the upcall architecture of dlmgmtd this can lead to deadlock
 * with the following scenario:
 *    a) the thread preparing to fork will have acquired the malloc locks
 *       then attempt to suspend every thread in preparation to fork.
 *    b) all of the upcalls will be blocked in door_ucred() trying to malloc()
 *       and get the credentials of their caller.
 *    c) we can't suspend the in-kernel thread making the upcall.
 *
 * Thus, we cannot serve door requests because we're blocked in malloc()
 * which fork() owns, but fork() is in turn blocked on the in-kernel thread
 * making the door upcall.  This is a fundamental architectural problem with
 * any server handling upcalls and also trying to fork().
 *
 * To minimize the chance of this deadlock occuring, we check ahead of time to
 * see if the file we want to read actually exists in the zone (which it almost
 * never does), so we don't need fork in that case (i.e. rarely to never).
 */
static boolean_t
zone_file_exists(char *zoneroot, char *filename)
Actions #9

Updated by Electric Monk about 1 year ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit 85dff7a05711e1238299281f8a94d2d40834c775

commit  85dff7a05711e1238299281f8a94d2d40834c775
Author: Andy Fiddaman <illumos@fiddaman.net>
Date:   2023-04-07T16:22:22.000Z

    15167 Panic when halting a zone with self-created links
    15407 zone_free asserts to its destruction when dlmgmtd has fallen
    15408 dlmgmtd/zoneadmd interdependencies prevent dlmgmtd from starting
    15409 dlmgmtd forking with write locks results in deadlock chaos
    Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Portions contributed by: Joshua M. Clulow <jmc@joyent.com>
    Portions contributed by: Robert Mustacchi <rm@joyent.com>
    Portions contributed by: Ryan Zezeski <rpz@joyent.com>
    Reviewed by: Robert Mustacchi <rm+illumos@fingolfin.org>
    Approved by: Patrick Mooney <pmooney@pfmooney.com>

Actions #10

Updated by Andy Fiddaman about 1 year ago

  • Related to Bug #15553: dlmgmtd asserts on processing NGZ persistent configuration file added
Actions

Also available in: Atom PDF