Project

General

Profile

Actions

Bug #5883

closed

race in netstack_unregister and zone creation

Added by Robert Mustacchi about 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
2015-04-30
Due date:
% Done:

100%

Estimated time:
Difficulty:
Hard
Tags:
Gerrit CR:

Description

Bill hit this again. I did a fresh analysis not looking at what we currently had.

Our stack looks like:

> $C
ffffff00b81e3b30 0xfffffffff7fa7186()
ffffff00b81e3b60 ip_ne_queue_func+0x53(ffffff19e3a5ad80)
ffffff00b81e3c20 taskq_thread+0x2d0(ffffff19929c4c18)
ffffff00b81e3c30 thread_start+8()

Now, did we really call off into no where land?

> ip_ne_queue_func+0x53::dis
...
ip_ne_queue_func+0x4e:          call   +0x4767d <hook_run>
ip_ne_queue_func+0x53:          movq   %r12,%rdi

No, We're missing a call to hook_run() on the stack, could be simply
tail call or just that wherever we ended up put is in crazy town.

The arg to ip_ne_queue_func is a hook_nic_event_t:

> ffffff19e3a5ad80::print hook_nic_event_int_t
{
    hnei_stackid = 0x2a
    hnei_event = {
        hne_protocol = 0xffffff19ac92a2c0
        hne_nic = 0x1
        hne_lif = 0x1
        hne_event = 1 (NE_PLUMB)
        hne_data = 0xffffff19cbe2f3e0
        hne_datalen = 0x4
    }
}

Because we know that stackid zoneid we know what zone we are in
123c3352-e142-49f9-ae77-b9b906df727c. The zone state is running, but, it
doesn't seem to be doing much, we're booting probably:

> ffffff19b49672c0::print zone_t zone_proc_initpid | ::pid2proc | ::ptree
fffffffffbc2eb80  sched
     ffffff19931bd010  init
          ffffff19d349c000  zsched
               ffffff19deb46008  init

This makes sense that we are early on in boot if we are getting a plumb
nic event. Thankfully nestack id zoneid, therefore we know our
netstack pointer:

> ffffff19ac4363c0::netstack
            ADDR STACKID  FLAGS
ffffff19ac4363c0     42    000000

> ffffff19ac4363c0::print netstack_t netstack_u.nu_s.nu_ip | ::print ip_stack_t ips_ipv6_net_data
ips_ipv6_net_data = 0xffffff19ac92a200

The hne_protocol does not equal the ipv6 net data which means this is
IPv4.

So the arguments to our missing hook_run call are:

hook_run (0xffffff19d4e4e100, 0xffffff19cb158640, ffffff19e3a5ad88)

If we peak around our stack then we can even find a bit of the hook_run,
presumably a return address!

> ffffff00b81e3a00,50/np
0xffffff00b81e3a00:             
...
                0xfffffffff7fa7186
                0x30            
                0x10246         
                0xffffff00b81e3ac8
                0x38            
                0xffffff00b81e3b30
                hook_run+0xaf   
                0xffffff19ac4363c0
                0xffffff19ac4364a8
                0xffffff19929c4c38
                0xffffff19e3a5ad80
                0xffffff19ac4363c0
                1               
                0xffffff19929c4c40
                0xffffff00b81e3bc0
                0xffffff19929c4c38
                0xffffff19e3a5ad88
                0xffffff19cb158640
                0xffffff19cace5c00
                0xffffff00b81e3b60
                ip_ne_queue_func+0x53
                0xffffff19929c4c18
                0xffffff19929c4c38
                0xffffff00b81e3b60
                0xffffff19e3a5ad80
                0xffffff00b81e3c20
                taskq_thread+0x2d0

And if we look at that in hook_run, that brings us back to our good old
call!

> hook_run+0xaf::dis
hook_run+0xac:                  call   *0x18(%rbx)
hook_run+0xaf:                  movq   %rbx,%rdx

So what did we think the hook should have been.

Let's walk through that tailq. The head is:

> 0xffffff19cb158640::hooklist
            ADDR     FLAG                 FUNC HINT
NAME 
               HINTVALUE
ffffff19d4714d80        0 fffffffff7fa7180    0          ipnet_nicevents 

Now, if we look at the function that we actually have there:

> ffffff19d4714d80::print hook_int_t
{
    hi_entry = {
        tqe_next = 0
        tqe_prev = 0xffffff19cb158660
    }
    hi_hook = {
        h_version = 0x1
        h_func = 0xfffffffff7fa7180
        h_name = 0xffffff19a773a948 "ipnet_nicevents" 
        h_flags = 0
        h_hint = 0 (HH_NONE)
        h_hintvalue = 0xfffffffff7b0ef39
        h_arg = 0xffffff19d41eb680
    }
> 0xfffffffff7fa7180::whatis
fffffffff7fa7180 is in smbsrv's text segment

And there we go, that's definitely a blow up.

Now, let's look at the function that generates this:

> ipnet_register_netihook::dis
ipnet_register_netihook:        pushq  %rbp
ipnet_register_netihook+1:      movq   %rsp,%rbp
ipnet_register_netihook+4:      subq   $0x40,%rsp
ipnet_register_netihook+8:      movq   %rbx,-0x30(%rbp)
ipnet_register_netihook+0xc:    movq   %rdi,-0x8(%rbp)
ipnet_register_netihook+0x10:   movq   %rdi,%rbx
ipnet_register_netihook+0x13:   movl   $0x1,%edi
ipnet_register_netihook+0x18:   movq   %r12,-0x28(%rbp)
ipnet_register_netihook+0x1c:   movq   %r13,-0x20(%rbp)
ipnet_register_netihook+0x20:   movq   %r14,-0x18(%rbp)
ipnet_register_netihook+0x24:   call   -0x482299        <hook_alloc>
ipnet_register_netihook+0x29:   movq   %rax,0x18(%rbx)
ipnet_register_netihook+0x2d:   movq   $0xfffffffff7f93180,0x8(%rax)
ipnet_register_netihook+0x35:   movq   0x18(%rbx),%rax
ipnet_register_netihook+0x39:   movq   $0xfffffffff7f94c2d,0x10(%rax)

> 0xfffffffff7f93180::whatis
fffffffff7f93180 is ipnet_nicevent_cb, in ipnet's text segment

So, we do have it there. Those values get put in place by krtld. It is
relocatable ode, so it should be pretty straightforward for it to do
that.

The only way that it seems to make sense for an address and a module to
move around is after we've called _fini().

Now if we look at the flags for this module and its state:

> ffffff19ac4363c0::print netstack_t netstack_m_state[0t17]
netstack_m_state[0t17] = {
    netstack_m_state[0t17].nms_flags = 0x4
    netstack_m_state[0t17].nms_cv = {
        _opaque = 0
    }
}

So that just means that it has been created. Furthermore, If we look
over the list of all the other modules, we find that we are the only
ones who have this old value! We can see this with the following
expression:

> ::walk netstack | ::print -t netstack_t netstack_u.nu_s.nu_ip | ::print ip_stack_t ips_ipv4nicevents | ::hooklist
...
            ADDR     FLAG                 FUNC HINT                     NAME 
               HINTVALUE
ffffff198d7cf000        0    ipnet_nicevent_cb    0          ipnet_nicevents 

            ADDR     FLAG                 FUNC HINT                     NAME 
               HINTVALUE
ffffff198d7c8940        0    ipnet_nicevent_cb    0          ipnet_nicevents 

            ADDR     FLAG                 FUNC HINT                     NAME 
               HINTVALUE
ffffff19d4714d80        0 fffffffff7fa7180    0          ipnet_nicevents 

            ADDR     FLAG                 FUNC HINT                     NAME 
               HINTVALUE
ffffff19d56bb7c0        0    ipnet_nicevent_cb    0          ipnet_nicevents 

            ADDR     FLAG                 FUNC HINT                     NAME 
               HINTVALUE
ffffff19b3d81040        0    ipnet_nicevent_cb    0          ipnet_nicevents 

            ADDR     FLAG                 FUNC HINT                     NAME 
               HINTVALUE
ffffff19d23887c0        0    ipnet_nicevent_cb    0          ipnet_nicevents 
...

From inspecting how the module unload process works, we know that the
module text segment will on move once _fini has been called. In that
case, we know that we will appropriately clean up all currently active
modules.

However, this raises its own question. What happens if a zone is created
as the module is tearing down its own netstack hooks?

Let's look at each of these cases in detail, especially with respect to
locking.

When we unregister a netstack module, this is the ordering:

-grab netstack global lock
-Walk each netstack:
-Grab its lock
-If its state flag is CREATE_COMPLETED mark for both shutdown
and destroy
-Release its lock
-Mark module as dying
-Release the global lock
-apply to all netstacks shutdown
-apply to all netstacks destroy

When a zone with an exclusive stack is created we do the following:

Zone Create Phase 1
-Grab the global lock
-Create our netstack_t
-Add ourselves to the global list
-Grab our netstack lock
-Walk each of the modules that are registered
-For each one that is registered and not dying, flag it as a create
needed iff no other create flags exist
-Release our stack lock
-Release the global lock
Zone Create Phase 2
Now we call into apply_all_modules and call netstack_apply_create
which grabs the global lock
-Grab our local lock
-If already in progres in another thread, wait for it to finish
-Set flag to indicate that we are in progress of creating
-drop netstack lock
-drop global lock
-Call create function
-grab global lock
-grab netstack lock
-set flag to CREATE_COMPLETED
-release stack lock
-release global lock

So here is how the race would occur:

1. We finish Phase 1 of zone_create. This means that we walk all the
modules that currently exist and are not dying and flag them as needing
a create.

2. We next run the netstack_unregister and flag the module as dying. We also
walk through every netstack and check to see if that module is noted as
having successfully completed its creation. If that's the case, we flag
it for either shutdown and/or destruction. Note that this is the only
place that we flag it. It doesn't matter if when we apply the shutdown
or destroy that the create is in progress, because while we wait for it
kindly to finish, it doesn't matter, nothing is going to go back and
check for the in-progress pieces.

Say that we continued to run, but a zone was in the progress of being
created, we would not apply the shutdown or destroy to it because we
would wait until the netstack had finished being created.

3. We go ahead and look at all the modules that need to be created which
is Phase 2. All we do is check whether or not we need to create
something or not, that's it, so we go ahead and create it. We finish by
signalling to anyone who might have been waiting for us to finish, e.g.
someone stuck doing the second part of the netstack_unregister.

4. We now go ahead and loop through the shutdown and destroy. Because we
never had flagged this netstack's rendition of the module as needing
eiher, we end up jsut keeping it around until the module next unloads
itself. If it ends up firing again during that time, then we're in bad
shape... d'oh.

Actions #1

Updated by Electric Monk about 7 years ago

  • Status changed from New to Closed

git commit 589efa9501f3347f21e60905a96ca39427169e10

commit  589efa9501f3347f21e60905a96ca39427169e10
Author: Robert Mustacchi <rm@joyent.com>
Date:   2015-05-16T06:06:54.000Z

    5883 race in netstack_unregister and zone creation
    5884 want zid2zone
    5885 want a dcmd for going from netstack id to nestack_t
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Bryan Cantrill <bryan@joyent.com>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Approved by: Gordon Ross <gwr@nexenta.com>

Actions

Also available in: Atom PDF