Project

General

Profile

Actions

Bug #14619

closed

Race between udp_activate() and conn_get_socket_info()

Added by Andy Fiddaman 4 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

An OmniOS user experienced the following kernel panic while testing the upcoming r151042 release.

> ::status
debugging crash dump vmcore.1 (64-bit) from corona
operating system: 5.11 omnios-r151042-5b105d0fdc (i86pc)
build version: gfx-drm - heads/master-0-g77f745e
illumos-kvm - heads/master-0-g93cd164
heads/master-0-g5b105d0fdc

image uuid: 8888bdcf-2a57-6ce3-c171-e2776cfad812
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffcc273846520 addr=58 occurred in module "ip" due to a NULL pointer dereference
dump content: kernel pages only
> $C
fffffcc2738466d0 conn_get_socket_info+0x4f(fffffe2391b985c0, fffffcc273846780)
fffffcc2738468c0 udp_snmp_get+0x407(fffffe28e3ef7108, fffffe24fdbcf100, 0)
fffffcc273846920 ip_snmp_get+0xd3(fffffe28e3ef7108, fffffe2386724cc0, 404, 0)
fffffcc2738469a0 snmpcom_req+0x21a(fffffe28e3ef7108, fffffe36d57adc60, fffffffff7b86470, fffffffff7b832f0, fffffe24dffd3338)
fffffcc2738469f0 ip_wput_nondata+0x26f(fffffe28e3ef7108, fffffe36d57adc60)
fffffcc273846a60 putnext+0x2d5(fffffe24dee263a0, fffffe36d57adc60)
fffffcc273846c50 strput+0x35f(fffffe24682f34d0, fffffe36d57adc60, fffffcc273846cb8, fffffcc273846ca0, 0, 0, 4)
fffffcc273846d50 strputmsg+0x2ee(fffffe23c5cc6340, fffffcc273846dc8, fffffcc273846dd8, 0, c, 3)
fffffcc273846e50 msgio32+0x1b8(8, 80472b4, 0, fffffcc273846e84, 2, fffffcc273846e7f, fffffcc273846e80)
fffffcc273846eb0 putmsg32+0x64(8, 80472b4, 0, 8)
fffffcc273846f10 _sys_sysenter_post_swapgs+0x159()

This crash corresponds to this code segment:

        if (connp->conn_upper_handle != NULL) {
                vn = (*connp->conn_upcalls->su_get_vnode)
                    (connp->conn_upper_handle);
        } 

which implies that conn_upper_handle is non-NULL while conn_upcalls is NULL, let's check:

> fffffe2391b985c0::print conn_t
{
    conn_ref = 0x2
    conn_flags = 0x1008
    conn_proto = 0x11
    conn_recv = udp_input
    conn_recvicmp = udp_icmp_input
...
    conn_upcalls = 0
    conn_upper_handle = 0xfffffe24bdce70a0
...
}

This is similar to the panic that was fixed in #12976, and that fix aimed to always reset these two fields together, while protected by conn_lock.

Let's see what else was happening in the ip module at the time:

> ::stacks -m ip
THREAD           STATE    SOBJ                COUNT
fffffe23b2d9f440 ONPROC   <NONE>                  1
                 udp_activate+0x49
                 socket_init_common+0x215
                 so_init+0x25
                 socket_create+0xe6
                 so_socket+0x1cd
                 _sys_sysenter_post_swapgs+0x159

fffffe252dadf7e0 PANIC    <NONE>                  1
                 0xfffffffffb8a9461
                 die+0xc0
                 trap+0xa49
                 cmntrap_pushed+0x3c
                 conn_get_socket_info+0x4f
                 udp_snmp_get+0x407
                 ip_snmp_get+0xd3
                 snmpcom_req+0x21a
                 ip_wput_nondata+0x26f
                 putnext+0x2d5
                 strput+0x35f
                 strputmsg+0x2ee
                 msgio32+0x1b8
                 putmsg32+0x64
                 _sys_sysenter_post_swapgs+0x159

I suspect we have a winner, if it's working on the same connection:

> fffffe23b2d9f440::findstack -v
stack pointer for thread fffffe23b2d9f440 (mdnsd/1): fffffcc26de3ab10
[ fffffcc26de3ab10 _resume_from_idle+0x12b() ]
  fffffcc26de3aca0 udp_activate+0x49(fffffe2391b985c0, fffffe24bdce70a0, fffffffffbdb8f40, 0, fffffe233e058178)
  fffffcc26de3ad10 socket_init_common+0x215(fffffe24bdce70a0, 0, 0, fffffe233e058178)
  fffffcc26de3ad40 so_init+0x25(fffffe24bdce70a0, 0, fffffe233e058178, 0)
  fffffcc26de3adf0 socket_create+0xe6(2, 1, 0, 0, 0, 0, fffffcc200000001, fffffe233e058178, fffffcc26de3ae3c)

> fffffe252dadf7e0::findstack -v
stack pointer for thread fffffe252dadf7e0 (in.mpathd/1): fffffcc273846320
  fffffcc2738466d0 conn_get_socket_info+0x4f(fffffe2391b985c0, fffffcc273846780)
  fffffcc2738468c0 udp_snmp_get+0x407(fffffe28e3ef7108, fffffe24fdbcf100, 0)
  fffffcc273846920 ip_snmp_get+0xd3(fffffe28e3ef7108, fffffe2386724cc0, 404, 0)
  fffffcc2738469a0 snmpcom_req+0x21a(fffffe28e3ef7108, fffffe36d57adc60, fffffffff7b86470, fffffffff7b832f0, fffffe24dffd3338)

and indeed, they're both using fffffe2391b985c0

void
udp_activate(sock_lower_handle_t proto_handle, sock_upper_handle_t sock_handle,
    sock_upcalls_t *sock_upcalls, int flags, cred_t *cr)
{
        conn_t          *connp = (conn_t *)proto_handle;
        struct sock_proto_props sopp;

        /* All Solaris components should pass a cred for this operation. */
        ASSERT(cr != NULL);

        connp->conn_upcalls = sock_upcalls;
        connp->conn_upper_handle = sock_handle;
...

The thread was at udp_activate+0x49:

udp_activate+0x42:              movq   %rsi,0x1b0(%rdi)             <-- sock_handle
udp_activate+0x49:              leaq   -0x98(%rbp),%rsi        <-- here when the panic occurred
udp_activate+0x50:              movq   %rdx,0x1a8(%rdi)             <-- sock_upcalls

It looks like the compiler has re-ordered these* - %rsi is the sock_handle and %rdx is the sock_upcalls
and so the other thread saw the handle as non-NULL while the upcalls were still NULL. That's a tight race!

It's likely this pair of assignments should be wrapped in a lock on connp->conn_lock.

  • - full disclosure, the upcoming r151042 release candidate is built with a gcc10 primary, but looking at the
    assembly from a gcc7 primary, the re-ordering still occurs, but the race window is a bit bigger:
udp_activate+0x15:              movq   %rsi,0x1b0(%rdi)             <-- sock_handle
udp_activate+0x1c:              movl   $0xffcf,%esi
udp_activate+0x21:              movq   +0x3e716c8(%rip),%rax    <__stack_chk_guard>
udp_activate+0x28:              movq   %rax,-0x38(%rbp)
udp_activate+0x2c:              xorl   %eax,%eax
udp_activate+0x2e:              movl   0x1d8(%rdi),%eax
udp_activate+0x34:              cmpw   $0x2,0x1ca(%rdi)
udp_activate+0x3c:              movq   %rdx,0x1a8(%rdi)             <-- sock_upcalls

Related issues

Related to illumos gate - Bug #12976: system panics with error in IP moduleClosed

Actions
Actions #1

Updated by Andy Fiddaman 4 months ago

  • Related to Bug #12976: system panics with error in IP module added
Actions #2

Updated by Electric Monk 4 months ago

  • Gerrit CR set to 2099
Actions #3

Updated by Andy Fiddaman 4 months ago

  • Gerrit CR changed from 2099 to 2101
Actions #4

Updated by Dan McDonald 4 months ago

Andy Fiddaman wrote:

SNIP!

It's likely this pair of assignments should be wrapped in a lock on connp->conn_lock.

  • - full disclosure, the upcoming r151042 release candidate is built with a gcc10 primary, but looking at the
    assembly from a gcc7 primary, the re-ordering still occurs, but the race window is a bit bigger:

An alternative approach is to make sure conn_get_socket_info() has two legitimate pointers (upcall table AND argument/handle). If either one is NULL, "drop the event/packet" because you can treat the case like the socket isn't fully there yet.

This approach works because a new conn_t gets allocated with those set to NULL (which ip_conn_constructor() does with a bzero() call up front).

Actions #5

Updated by Dan McDonald 4 months ago

  • Assignee set to Dan McDonald
Actions #6

Updated by Dan McDonald 4 months ago

Testing from the bug reporter(s) indicates that the fix does not induce the panic (though the race is really tight).

Actions #7

Updated by Electric Monk 4 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 041297c2d66302c15134da1d1bdd91cf787a945a

commit  041297c2d66302c15134da1d1bdd91cf787a945a
Author: Dan McDonald <danmcd@joyent.com>
Date:   2022-04-06T21:07:03.000Z

    14619 Race between udp_activate() and conn_get_socket_info()
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Reviewed by: Marco van Wieringen <marco.van.wieringen@planets.elm.net>
    Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Actions

Also available in: Atom PDF