Bug #5883
closedrace in netstack_unregister and zone creation
100%
Description
Bill hit this again. I did a fresh analysis not looking at what we currently had.
Our stack looks like:
> $C ffffff00b81e3b30 0xfffffffff7fa7186() ffffff00b81e3b60 ip_ne_queue_func+0x53(ffffff19e3a5ad80) ffffff00b81e3c20 taskq_thread+0x2d0(ffffff19929c4c18) ffffff00b81e3c30 thread_start+8()
Now, did we really call off into no where land?
> ip_ne_queue_func+0x53::dis ... ip_ne_queue_func+0x4e: call +0x4767d <hook_run> ip_ne_queue_func+0x53: movq %r12,%rdi
No, We're missing a call to hook_run() on the stack, could be simply
tail call or just that wherever we ended up put is in crazy town.
The arg to ip_ne_queue_func is a hook_nic_event_t:
> ffffff19e3a5ad80::print hook_nic_event_int_t { hnei_stackid = 0x2a hnei_event = { hne_protocol = 0xffffff19ac92a2c0 hne_nic = 0x1 hne_lif = 0x1 hne_event = 1 (NE_PLUMB) hne_data = 0xffffff19cbe2f3e0 hne_datalen = 0x4 } }
Because we know that stackid zoneid we know what zone we are in
123c3352-e142-49f9-ae77-b9b906df727c. The zone state is running, but, it
doesn't seem to be doing much, we're booting probably:
> ffffff19b49672c0::print zone_t zone_proc_initpid | ::pid2proc | ::ptree fffffffffbc2eb80 sched ffffff19931bd010 init ffffff19d349c000 zsched ffffff19deb46008 init
This makes sense that we are early on in boot if we are getting a plumb
nic event. Thankfully nestack id zoneid, therefore we know our
netstack pointer:
> ffffff19ac4363c0::netstack ADDR STACKID FLAGS ffffff19ac4363c0 42 000000 > ffffff19ac4363c0::print netstack_t netstack_u.nu_s.nu_ip | ::print ip_stack_t ips_ipv6_net_data ips_ipv6_net_data = 0xffffff19ac92a200
The hne_protocol does not equal the ipv6 net data which means this is
IPv4.
So the arguments to our missing hook_run call are:
hook_run (0xffffff19d4e4e100, 0xffffff19cb158640, ffffff19e3a5ad88)
If we peak around our stack then we can even find a bit of the hook_run,
presumably a return address!
> ffffff00b81e3a00,50/np 0xffffff00b81e3a00: ... 0xfffffffff7fa7186 0x30 0x10246 0xffffff00b81e3ac8 0x38 0xffffff00b81e3b30 hook_run+0xaf 0xffffff19ac4363c0 0xffffff19ac4364a8 0xffffff19929c4c38 0xffffff19e3a5ad80 0xffffff19ac4363c0 1 0xffffff19929c4c40 0xffffff00b81e3bc0 0xffffff19929c4c38 0xffffff19e3a5ad88 0xffffff19cb158640 0xffffff19cace5c00 0xffffff00b81e3b60 ip_ne_queue_func+0x53 0xffffff19929c4c18 0xffffff19929c4c38 0xffffff00b81e3b60 0xffffff19e3a5ad80 0xffffff00b81e3c20 taskq_thread+0x2d0
And if we look at that in hook_run, that brings us back to our good old
call!
> hook_run+0xaf::dis hook_run+0xac: call *0x18(%rbx) hook_run+0xaf: movq %rbx,%rdx
So what did we think the hook should have been.
Let's walk through that tailq. The head is:
> 0xffffff19cb158640::hooklist ADDR FLAG FUNC HINT NAME HINTVALUE ffffff19d4714d80 0 fffffffff7fa7180 0 ipnet_nicevents
Now, if we look at the function that we actually have there:
> ffffff19d4714d80::print hook_int_t { hi_entry = { tqe_next = 0 tqe_prev = 0xffffff19cb158660 } hi_hook = { h_version = 0x1 h_func = 0xfffffffff7fa7180 h_name = 0xffffff19a773a948 "ipnet_nicevents" h_flags = 0 h_hint = 0 (HH_NONE) h_hintvalue = 0xfffffffff7b0ef39 h_arg = 0xffffff19d41eb680 } > 0xfffffffff7fa7180::whatis fffffffff7fa7180 is in smbsrv's text segment
And there we go, that's definitely a blow up.
Now, let's look at the function that generates this:
> ipnet_register_netihook::dis ipnet_register_netihook: pushq %rbp ipnet_register_netihook+1: movq %rsp,%rbp ipnet_register_netihook+4: subq $0x40,%rsp ipnet_register_netihook+8: movq %rbx,-0x30(%rbp) ipnet_register_netihook+0xc: movq %rdi,-0x8(%rbp) ipnet_register_netihook+0x10: movq %rdi,%rbx ipnet_register_netihook+0x13: movl $0x1,%edi ipnet_register_netihook+0x18: movq %r12,-0x28(%rbp) ipnet_register_netihook+0x1c: movq %r13,-0x20(%rbp) ipnet_register_netihook+0x20: movq %r14,-0x18(%rbp) ipnet_register_netihook+0x24: call -0x482299 <hook_alloc> ipnet_register_netihook+0x29: movq %rax,0x18(%rbx) ipnet_register_netihook+0x2d: movq $0xfffffffff7f93180,0x8(%rax) ipnet_register_netihook+0x35: movq 0x18(%rbx),%rax ipnet_register_netihook+0x39: movq $0xfffffffff7f94c2d,0x10(%rax) > 0xfffffffff7f93180::whatis fffffffff7f93180 is ipnet_nicevent_cb, in ipnet's text segment
So, we do have it there. Those values get put in place by krtld. It is
relocatable ode, so it should be pretty straightforward for it to do
that.
The only way that it seems to make sense for an address and a module to
move around is after we've called _fini().
Now if we look at the flags for this module and its state:
> ffffff19ac4363c0::print netstack_t netstack_m_state[0t17] netstack_m_state[0t17] = { netstack_m_state[0t17].nms_flags = 0x4 netstack_m_state[0t17].nms_cv = { _opaque = 0 } }
So that just means that it has been created. Furthermore, If we look
over the list of all the other modules, we find that we are the only
ones who have this old value! We can see this with the following
expression:
> ::walk netstack | ::print -t netstack_t netstack_u.nu_s.nu_ip | ::print ip_stack_t ips_ipv4nicevents | ::hooklist ... ADDR FLAG FUNC HINT NAME HINTVALUE ffffff198d7cf000 0 ipnet_nicevent_cb 0 ipnet_nicevents ADDR FLAG FUNC HINT NAME HINTVALUE ffffff198d7c8940 0 ipnet_nicevent_cb 0 ipnet_nicevents ADDR FLAG FUNC HINT NAME HINTVALUE ffffff19d4714d80 0 fffffffff7fa7180 0 ipnet_nicevents ADDR FLAG FUNC HINT NAME HINTVALUE ffffff19d56bb7c0 0 ipnet_nicevent_cb 0 ipnet_nicevents ADDR FLAG FUNC HINT NAME HINTVALUE ffffff19b3d81040 0 ipnet_nicevent_cb 0 ipnet_nicevents ADDR FLAG FUNC HINT NAME HINTVALUE ffffff19d23887c0 0 ipnet_nicevent_cb 0 ipnet_nicevents ...
From inspecting how the module unload process works, we know that the
module text segment will on move once _fini has been called. In that
case, we know that we will appropriately clean up all currently active
modules.
However, this raises its own question. What happens if a zone is created
as the module is tearing down its own netstack hooks?
Let's look at each of these cases in detail, especially with respect to
locking.
When we unregister a netstack module, this is the ordering:
-grab netstack global lock
-Walk each netstack:
-Grab its lock
-If its state flag is CREATE_COMPLETED mark for both shutdown
and destroy
-Release its lock
-Mark module as dying
-Release the global lock
-apply to all netstacks shutdown
-apply to all netstacks destroy
When a zone with an exclusive stack is created we do the following:
Zone Create Phase 1
-Grab the global lock
-Create our netstack_t
-Add ourselves to the global list
-Grab our netstack lock
-Walk each of the modules that are registered
-For each one that is registered and not dying, flag it as a create
needed iff no other create flags exist
-Release our stack lock
-Release the global lock
Zone Create Phase 2
Now we call into apply_all_modules and call netstack_apply_create
which grabs the global lock
-Grab our local lock
-If already in progres in another thread, wait for it to finish
-Set flag to indicate that we are in progress of creating
-drop netstack lock
-drop global lock
-Call create function
-grab global lock
-grab netstack lock
-set flag to CREATE_COMPLETED
-release stack lock
-release global lock
So here is how the race would occur:
1. We finish Phase 1 of zone_create. This means that we walk all the
modules that currently exist and are not dying and flag them as needing
a create.
2. We next run the netstack_unregister and flag the module as dying. We also
walk through every netstack and check to see if that module is noted as
having successfully completed its creation. If that's the case, we flag
it for either shutdown and/or destruction. Note that this is the only
place that we flag it. It doesn't matter if when we apply the shutdown
or destroy that the create is in progress, because while we wait for it
kindly to finish, it doesn't matter, nothing is going to go back and
check for the in-progress pieces.
Say that we continued to run, but a zone was in the progress of being
created, we would not apply the shutdown or destroy to it because we
would wait until the netstack had finished being created.
3. We go ahead and look at all the modules that need to be created which
is Phase 2. All we do is check whether or not we need to create
something or not, that's it, so we go ahead and create it. We finish by
signalling to anyone who might have been waiting for us to finish, e.g.
someone stuck doing the second part of the netstack_unregister.
4. We now go ahead and loop through the shutdown and destroy. Because we
never had flagged this netstack's rendition of the module as needing
eiher, we end up jsut keeping it around until the module next unloads
itself. If it ends up firing again during that time, then we're in bad
shape... d'oh.
Updated by Electric Monk about 7 years ago
- Status changed from New to Closed
git commit 589efa9501f3347f21e60905a96ca39427169e10
commit 589efa9501f3347f21e60905a96ca39427169e10 Author: Robert Mustacchi <rm@joyent.com> Date: 2015-05-16T06:06:54.000Z 5883 race in netstack_unregister and zone creation 5884 want zid2zone 5885 want a dcmd for going from netstack id to nestack_t Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Bryan Cantrill <bryan@joyent.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Gordon Ross <gwr@nexenta.com>