Project

General

Profile

Bug #3933

contract adoption can race

Added by Robert Mustacchi over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
kernel
Start date:
2013-07-28
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

We saw several panics of the following form:

list_next+0x14(ffffff19d17edd48, ffffffffffffffa8)
cte_trim+0x4d(ffffff19d17edd28, ffffff19e152b540)
contract_abandon+0x1bb(ffffff19e152b540, ffffff0f372b8008, 0)
contract_exit+0x49(ffffff0f372b8008)
proc_exit+0x4dd(2, 9)
exit+0x15(2, 9)
psig+0x582()
post_syscall+0x49e(4, 0)
syscall_exit+0x59(ffffff151e06f120, 4, 0)
0xfffffffffb800ea9()

In all three cases, it appears that the ctq_events list for a non-global zone's svc.startd has been corrupted with an event that has a zeroed list_t. That is:

> ffffff19d17edd28::print ct_equeue_t ctq_events | ::walk list
ffffff0f66284658
ffffff184e302640
ffffff10fe0ebd38
ffffff0f72d8bb50
ffffff0f03decaa8
ffffff0fea635518
mdb: failed to read list element at 0xffffffffffffffa8: no mapping for address

And the corrupt ct_kevent_t:

{
    cte_lock = {
        _opaque = [ 0 ]
    }
    cte_id = 0x3d92874
    cte_type = 0x1
    cte_refs = 0x3
    cte_nodes = [
        {
            ctm_node = {
                list_next = 0xffffff196ac6a720
                list_prev = 0xffffff196ac6a720
            }
            ctm_refs = 0
            ctm_trimmed = 0
            ctm_nreliable = 0
        },
        {
            ctm_node = {
                list_next = 0xffffff150d3e09a8
                list_prev = 0xffffff0fdee71908
            }
            ctm_refs = 0
            ctm_trimmed = 0
            ctm_nreliable = 0
        },
        {
            ctm_node = {
                list_next = 0
                list_prev = 0
            }
            ctm_refs = 0
            ctm_trimmed = 0
            ctm_nreliable = 0
        },
    ]
    cte_flags = 0
    cte_data = 0xffffff19c26d0260
    cte_gdata = 0
    cte_contract = 0xffffff196ac6a600
}

Assuming it's due to logical error and not memory corruption (which seems a very reasonable assumption given that we saw this three times with essentially identical symptoms), the list code doesn't allow this state to be possible in too many different ways. One pathology that can lead to the observed state is if an element is placed on the list twice, in which case the removal of the element would cause its previous and next pointers to be NULL'd out, but it would remain linked in on the list (at a different point). However, looking at the code that manipulates the ctq_events list, it's not immediately clear how these conditions might be possible: cte_publish() adds an event to the queue, but always does it with a freshly allocated event; cte_copy() copies an event to the queue, but that should only be due to contract adoption.

Based on additional correspondence:
It appears adoption can race with an incoming event being processed by cte_publish_all(). The event is added first to the contract queue and (possibly) the type queue; contract_adopt() steps in, establishes a contract owner, and adds the event just published to the contract queue to the new owner's process queue; and finally ct_publish_all() continues, adding the event to the process queue a second time.

There are many ways to close this race, but it seems the simplest is to simply perform a check in cte_publish that the event is not on the specified queue.

#1

Updated by Robert Mustacchi over 7 years ago

  • Status changed from New to Resolved
  • % Done changed from 90 to 100

Resolved in a81df0a5d715363cc1841810a87818dfa95675c0.

Also available in: Atom PDF