Bug #14619
closedRace between udp_activate() and conn_get_socket_info()
100%
Description
An OmniOS user experienced the following kernel panic while testing the upcoming r151042 release.
> ::status debugging crash dump vmcore.1 (64-bit) from corona operating system: 5.11 omnios-r151042-5b105d0fdc (i86pc) build version: gfx-drm - heads/master-0-g77f745e illumos-kvm - heads/master-0-g93cd164 heads/master-0-g5b105d0fdc image uuid: 8888bdcf-2a57-6ce3-c171-e2776cfad812 panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffcc273846520 addr=58 occurred in module "ip" due to a NULL pointer dereference dump content: kernel pages only > $C fffffcc2738466d0 conn_get_socket_info+0x4f(fffffe2391b985c0, fffffcc273846780) fffffcc2738468c0 udp_snmp_get+0x407(fffffe28e3ef7108, fffffe24fdbcf100, 0) fffffcc273846920 ip_snmp_get+0xd3(fffffe28e3ef7108, fffffe2386724cc0, 404, 0) fffffcc2738469a0 snmpcom_req+0x21a(fffffe28e3ef7108, fffffe36d57adc60, fffffffff7b86470, fffffffff7b832f0, fffffe24dffd3338) fffffcc2738469f0 ip_wput_nondata+0x26f(fffffe28e3ef7108, fffffe36d57adc60) fffffcc273846a60 putnext+0x2d5(fffffe24dee263a0, fffffe36d57adc60) fffffcc273846c50 strput+0x35f(fffffe24682f34d0, fffffe36d57adc60, fffffcc273846cb8, fffffcc273846ca0, 0, 0, 4) fffffcc273846d50 strputmsg+0x2ee(fffffe23c5cc6340, fffffcc273846dc8, fffffcc273846dd8, 0, c, 3) fffffcc273846e50 msgio32+0x1b8(8, 80472b4, 0, fffffcc273846e84, 2, fffffcc273846e7f, fffffcc273846e80) fffffcc273846eb0 putmsg32+0x64(8, 80472b4, 0, 8) fffffcc273846f10 _sys_sysenter_post_swapgs+0x159()
This crash corresponds to this code segment:
if (connp->conn_upper_handle != NULL) {
vn = (*connp->conn_upcalls->su_get_vnode)
(connp->conn_upper_handle);
}
which implies that conn_upper_handle
is non-NULL while conn_upcalls
is NULL, let's check:
> fffffe2391b985c0::print conn_t { conn_ref = 0x2 conn_flags = 0x1008 conn_proto = 0x11 conn_recv = udp_input conn_recvicmp = udp_icmp_input ... conn_upcalls = 0 conn_upper_handle = 0xfffffe24bdce70a0 ... }
This is similar to the panic that was fixed in #12976, and that fix aimed to always reset these two fields together, while protected by conn_lock.
Let's see what else was happening in the ip module at the time:
> ::stacks -m ip THREAD STATE SOBJ COUNT fffffe23b2d9f440 ONPROC <NONE> 1 udp_activate+0x49 socket_init_common+0x215 so_init+0x25 socket_create+0xe6 so_socket+0x1cd _sys_sysenter_post_swapgs+0x159 fffffe252dadf7e0 PANIC <NONE> 1 0xfffffffffb8a9461 die+0xc0 trap+0xa49 cmntrap_pushed+0x3c conn_get_socket_info+0x4f udp_snmp_get+0x407 ip_snmp_get+0xd3 snmpcom_req+0x21a ip_wput_nondata+0x26f putnext+0x2d5 strput+0x35f strputmsg+0x2ee msgio32+0x1b8 putmsg32+0x64 _sys_sysenter_post_swapgs+0x159
I suspect we have a winner, if it's working on the same connection:
> fffffe23b2d9f440::findstack -v stack pointer for thread fffffe23b2d9f440 (mdnsd/1): fffffcc26de3ab10 [ fffffcc26de3ab10 _resume_from_idle+0x12b() ] fffffcc26de3aca0 udp_activate+0x49(fffffe2391b985c0, fffffe24bdce70a0, fffffffffbdb8f40, 0, fffffe233e058178) fffffcc26de3ad10 socket_init_common+0x215(fffffe24bdce70a0, 0, 0, fffffe233e058178) fffffcc26de3ad40 so_init+0x25(fffffe24bdce70a0, 0, fffffe233e058178, 0) fffffcc26de3adf0 socket_create+0xe6(2, 1, 0, 0, 0, 0, fffffcc200000001, fffffe233e058178, fffffcc26de3ae3c) > fffffe252dadf7e0::findstack -v stack pointer for thread fffffe252dadf7e0 (in.mpathd/1): fffffcc273846320 fffffcc2738466d0 conn_get_socket_info+0x4f(fffffe2391b985c0, fffffcc273846780) fffffcc2738468c0 udp_snmp_get+0x407(fffffe28e3ef7108, fffffe24fdbcf100, 0) fffffcc273846920 ip_snmp_get+0xd3(fffffe28e3ef7108, fffffe2386724cc0, 404, 0) fffffcc2738469a0 snmpcom_req+0x21a(fffffe28e3ef7108, fffffe36d57adc60, fffffffff7b86470, fffffffff7b832f0, fffffe24dffd3338)
and indeed, they're both using fffffe2391b985c0
void
udp_activate(sock_lower_handle_t proto_handle, sock_upper_handle_t sock_handle,
sock_upcalls_t *sock_upcalls, int flags, cred_t *cr)
{
conn_t *connp = (conn_t *)proto_handle;
struct sock_proto_props sopp;
/* All Solaris components should pass a cred for this operation. */
ASSERT(cr != NULL);
connp->conn_upcalls = sock_upcalls;
connp->conn_upper_handle = sock_handle;
...
The thread was at udp_activate+0x49
:
udp_activate+0x42: movq %rsi,0x1b0(%rdi) <-- sock_handle
udp_activate+0x49: leaq -0x98(%rbp),%rsi <-- here when the panic occurred
udp_activate+0x50: movq %rdx,0x1a8(%rdi) <-- sock_upcalls
It looks like the compiler has re-ordered these* - %rsi
is the sock_handle and %rdx
is the sock_upcalls
and so the other thread saw the handle as non-NULL while the upcalls were still NULL. That's a tight race!
It's likely this pair of assignments should be wrapped in a lock on connp->conn_lock.
- - full disclosure, the upcoming r151042 release candidate is built with a gcc10 primary, but looking at the
assembly from a gcc7 primary, the re-ordering still occurs, but the race window is a bit bigger:
udp_activate+0x15: movq %rsi,0x1b0(%rdi) <-- sock_handle
udp_activate+0x1c: movl $0xffcf,%esi
udp_activate+0x21: movq +0x3e716c8(%rip),%rax <__stack_chk_guard>
udp_activate+0x28: movq %rax,-0x38(%rbp)
udp_activate+0x2c: xorl %eax,%eax
udp_activate+0x2e: movl 0x1d8(%rdi),%eax
udp_activate+0x34: cmpw $0x2,0x1ca(%rdi)
udp_activate+0x3c: movq %rdx,0x1a8(%rdi) <-- sock_upcalls
Related issues
Updated by Andy Fiddaman 4 months ago
- Related to Bug #12976: system panics with error in IP module added
Updated by Dan McDonald 4 months ago
Andy Fiddaman wrote:
SNIP!
It's likely this pair of assignments should be wrapped in a lock on connp->conn_lock.
- - full disclosure, the upcoming r151042 release candidate is built with a gcc10 primary, but looking at the
assembly from a gcc7 primary, the re-ordering still occurs, but the race window is a bit bigger:
An alternative approach is to make sure conn_get_socket_info() has two legitimate pointers (upcall table AND argument/handle). If either one is NULL, "drop the event/packet" because you can treat the case like the socket isn't fully there yet.
This approach works because a new conn_t gets allocated with those set to NULL (which ip_conn_constructor() does with a bzero() call up front).
Updated by Dan McDonald 4 months ago
Testing from the bug reporter(s) indicates that the fix does not induce the panic (though the race is really tight).
Updated by Electric Monk 4 months ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 041297c2d66302c15134da1d1bdd91cf787a945a
commit 041297c2d66302c15134da1d1bdd91cf787a945a Author: Dan McDonald <danmcd@joyent.com> Date: 2022-04-06T21:07:03.000Z 14619 Race between udp_activate() and conn_get_socket_info() Reviewed by: Andy Fiddaman <andy@omnios.org> Reviewed by: Marco van Wieringen <marco.van.wieringen@planets.elm.net> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Approved by: Robert Mustacchi <rm@fingolfin.org>