Project

General

Profile

Bug #12897

bhyve mevent can mistakenly handle events twice

Added by Michael Zeller 4 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Category:
bhyve
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

While working on bhyve I had sercons connected to a guest's com1 socket. When I disconnected bhyve segfaulted with the following stack:

> ::status
debugging core file of bhyve (64-bit) from ellie
file: /usr/sbin/bhyve
initial argv: bhyve -H -w -s 0,hostbridge -B 1,product=illumo bhyve -s 31,lpc -l bootrom,/usr
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=20
> ::stack
mevent_delete_event+0x23(0, 1)
mevent_delete_close+0x13(0)
uart_sock_drain+0xb7(ffffffff, 0, c493f0)
mevent_handle_pe+0x23(fffffc7fffdfc4a0)
mevent_dispatch+0x59()
main+0x575(18, fffffc7fffdfc598)
_start_crt+0x83()
_start+0x18()

> mevent_delete_event+0x23/ain
mevent_delete_event+0x23:
mevent_delete_event+0x23:       movl   0x20(%rbx),%eax

> mevent_delete_event::dis
mevent_delete_event:            pushq  %rbp
mevent_delete_event+1:          movq   %rsp,%rbp
mevent_delete_event+4:          movq   %rbx,-0x20(%rbp)
mevent_delete_event+8:          movq   %r12,-0x18(%rbp)
mevent_delete_event+0xc:        movq   %rdi,%rbx
mevent_delete_event+0xf:        movq   %rsi,-0x10(%rbp)
mevent_delete_event+0x13:       movq   %rdi,-0x8(%rbp)
mevent_delete_event+0x17:       subq   $0x20,%rsp
mevent_delete_event+0x1b:       movl   %esi,%r12d
mevent_delete_event+0x1e:       call   -0x763   <mevent_qlock>
mevent_delete_event+0x23:       movl   0x20(%rbx),%eax

> <rbx=K
                0

We can see that mevent_delete_event is moving rdi (1st arg) into rbx which is 0, so when when we hit offset 0x23 we sefault.
If we follow the stack trace we see that the mevent passed into mevent_handle_pe looks like:

> fffffc7fffdfc4a0::print port_event_t portev_user | ::print struct mevent
{
    me_func = uart_sock_drain
    me_fd = 0xffffffff
    me_timid = 0
    me_type = 0 (EVF_READ)
    me_param = 0xc493f0
    me_cq = 0
    me_state = 0x4
    me_closefd = 0x1
    me_notify = {
        portnfy_port = 0x7
        portnfy_user = 0xc49f40
    }
    me_sigev = {
        sigev_notify = 0
        sigev_signo = 0
        sigev_value = {
            sival_int = 0
            sival_ptr = 0
        }
        sigev_notify_function = 0
        sigev_notify_attributes = 0
        __sigev_pad2 = 0
    }
    me_auto_requeue = 0 (0)
    me_list = {
        le_next = 0
        le_prev = bhyve`change_head
    }
}

We can see that the code calls the uart_sock_drain function pointer with an fd of -1.
If we look at the code we can see the only place that me_fd gets set to -1 is when mevent_update_pending() sees me_closefd is true.
It looks like this event has already been processed and the fd has been closed.
So, studying the code a bit further it looks like mevent_dispatch() has the possibility of handling an event twice if port_get() returns an EINTR.

To test this hypothesis I wrote the following d script to try and reproduce the issue with the same stack trace.

pid$target::mevent_handle_pe:entry
{
        self->m = (userland struct pid`mevent *)args[0]->portev_user;
}

pid$target::mevent_handle_pe:return
/(uintptr_t)self->m->me_func == 0x451290 && self->m->me_state == 0x4/
{
        printf("closing uart connection\n");
        stop();
        exit(0);
}

The idea here is that when I trigger a disconnect from com1 the dtrace script will fire and stop the bhyve process. Once stopped I send the process a SIGCONT in hopes that
the port_get() call in mevent_dispatch() loop will receive an EINTR errno. Once received the variable holding the previous event should still contain the old mevent with the fd set to -1.

I ran the dtrace script and once the process was stopped I ran prun(1) on it and sure enough the bhyve process crashed with the same stack:

root@ellie:~# mdb core
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]
> $C
fffffc7fffdf3460 mevent_delete_event+0x23(0, 1)
fffffc7fffdf3480 mevent_delete_close+0x13(0)
fffffc7fffdf34d0 uart_sock_drain+0xb7(ffffffff, 0, 1206d40)
fffffc7fffdf3500 mevent_handle_pe+0x23(fffffc7fffdf3510)
fffffc7fffdf3540 mevent_dispatch+0x59()
fffffc7fffdf35b0 main+0x575(18, fffffc7fffdf3608)
fffffc7fffdf35e0 _start_crt+0x83()
fffffc7fffdf35f0 _start+0x18()
> fffffc7fffdf3510::print port_event_t portev_user | ::whatis
1206ac0 is 1206ab0+10, freed from umem_alloc_160
> ::status
debugging core file of bhyve (64-bit) from ellie
file: /usr/sbin/bhyve
initial argv: bhyve -H -w -s 0,hostbridge -B 1,product=illumo bhyve -s 31,lpc -l bootrom,/usr
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=20

To be sure you can also see that I started the bhyve process with libumem since the mevent we are processing should technically be freed already. Sure enough umem reports that we are processing a freed mevent.

History

#1

Updated by Michael Zeller 4 months ago

  • Subject changed from bhyve mevent handles event twice to bhyve mevent can mistakenly handle events twice
#2

Updated by Electric Monk 4 months ago

  • Gerrit CR set to 754
#3

Updated by Michael Zeller 4 months ago

  • Status changed from New to In Progress
#4

Updated by Michael Zeller 4 months ago

To test this I ran the included dtrace script in the ticket which stops bhyve at the point of serial console disconnect (uart_sock_drain). Once the process was stopped I ran prun <pid>.

Without the patch applied I observed a segv in bhyve when re-handling the mevent event.

With the patch applied I observed the bhyve VM start executing again without crashing due to receiving a signal.

#5

Updated by Electric Monk 4 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit cf7690ebb38fa81bd6f3904ba5ad4649c0ea3c0b

commit  cf7690ebb38fa81bd6f3904ba5ad4649c0ea3c0b
Author: Mike Zeller <mike@mikezeller.net>
Date:   2020-07-03T15:45:17.000Z

    12897 bhyve mevent can mistakenly handle events twice
    Reviewed by: Patrick Mooney <pmooney@pfmooney.com>
    Reviewed by: Andy Fiddaman <andy@omniosce.org>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF