Project

General

Profile

Actions

Feature #12476

closed

cyclic reprogramming can race with removal

Added by Patrick Mooney over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

Upstreaming OS-6684 for bhyve:

While running vmadm start / stop -f in a tight loop, I hit the following:

> ::status
debugging crash dump vmcore.22 (64-bit) from dent
operating system: 5.11 joyent_20180216T202957Z (i86pc)
image uuid: (not set)
panic message: attempt to reprogram non-existent cyclic
dump content: kernel pages only
> $C
ffffff000f8328d0 vpanic()
ffffff000f832950 cyclic_reprogram_cyclic+0xb0(ffffff03eb144000, b, 2cc15105498)
ffffff000f832990 cyclic_reprogram+0x1d8(ffffff0437fa49d0, 2cc15105498)
ffffff000f832a00 vmm_glue_callout_reset_sbt+0x8b(ffffff03e8e0d788, c038ccb3532, e0ef1f1, fffffffff7b1fbb0, ffffff03e8e0d7b0, 200)
ffffff000f832a40 pit_timer_start_cntr0+0xab(ffffff03e8e0d740)
ffffff000f832a80 vatpit_callout_handler+0x98(ffffff03e8e0d7b0)
ffffff000f832aa0 vmm_glue_callout_handler+0x2f(ffffff03e8e0d788)
ffffff000f832b90 cyclic_softint+0x209(ffffff03eb0c5580, 0)
ffffff000f832ba0 cbe_low_level+0x14()
ffffff000f832bf0 av_dispatch_softvect+0x88(2)
ffffff000f832c20 apix_dispatch_softint+0x35(0, 0)
ffffff000f9d4a10 switch_sp_and_call+0x13()
ffffff000f9d4a60 apix_do_softint+0x44(ffffff000f9d4ae0)
ffffff000f9d4ad0 apix_do_interrupt+0x480(ffffff000f9d4ae0, ffffff03eb0daea0)
ffffff000f9d4ae0 _interrupt+0x1e9()
ffffff000f9d4bd0 mach_cpu_idle+6()
ffffff000f9d4c00 cpu_idle+0x11a()
ffffff000f9d4c20 idle+0xa7()
ffffff000f9d4c30 thread_start+8()

After looking at the vhpet-related dump, and checking back with @john.levon's original, I've come up with a hypothesis regarding the issue.
In both cases, the panic occurs during an attempted timer reprogramming as part of virtual device's running callout at the same time that a reboot is occurring.

  fffffb0410bc9980 cyclic_reprogram_cyclic+0x97(fffffdc5496f0d40, c3, 370a98496f2f04)
  fffffb0410bc99b0 cyclic_reprogram+0x108(fffffdce41a295c0, 370a98496f2f04)
  fffffb0410bc9a00 vmm_glue_callout_reset_sbt+0x8b(fffffdc859b8ac90, ec66a4e704be61, 3ff6900, fffffffff8494aa0, fffffdc859b8acb8, 200)
  fffffb0410bc9a60 vhpet_start_timer+0x99(fffffdc859b8ac40, 0, 3e391951, ec66a4e3055561)
  fffffb0410bc9ab0 vhpet_handler+0x76(fffffdc859b8acb8)
  fffffb0410bc9ad0 vmm_glue_callout_handler+0x2f(fffffdc859b8ac90)
  fffffb0410bc9b70 cyclic_softint+0xfd(fffffdc544958000, 0)
  fffffb0410bc9b80 cbe_low_level+0x14()
  fffffb0410bc9bd0 av_dispatch_softvect+0x78(2)
  fffffb0410bc9c00 apix_dispatch_softint+0x35(0, 0)
  fffffb0410b789f0 switch_sp_and_call+0x13()
  fffffb0410b78a40 apix_do_softint+0x6c(fffffb0410b78ab0)
  fffffb0410b78aa0 apix_do_interrupt+0x382(fffffb0410b78ab0, 0)
  fffffb0410b78ab0 _interrupt+0xba()
  fffffb0410b78ba0 i86_mwait+0x14()
  fffffb0410b78be0 cpu_idle_mwait+0x109()
  fffffb0410b78c00 idle+0xa7()
  fffffb0410b78c10 thread_start+8()

We can see the re-init (read: reboot) thread in the process of removing that cyclic in another thread:

stack pointer for thread fffffe55d37b8080 (bhyve/1 [vcpu 0]): fffffb04249a9560
[ fffffb04249a9560 _resume_from_idle+0x12b() ]
  fffffb04249a9590 swtch+0x141()
  fffffb04249a95d0 sema_p+0x1d7(fffffdc5496f0de0)
  fffffb04249a9650 cyclic_remove_here+0xac(fffffdc5496f0d40, c3, 0, 0)
  fffffb04249a9690 cyclic_remove+0x38(fffffdce41a295c0)
  fffffb04249a96c0 vmm_glue_callout_drain+0x27(fffffdc859b8ac90)
  fffffb04249a96f0 vhpet_cleanup+0x40(fffffdc859b8ac40)
  fffffb04249a9730 vm_cleanup+0x66(fffffdca49aad000, 0)
  fffffb04249a9760 vm_reinit+0x38(fffffdca49aad000)
  fffffb04249a9c20 vmmdev_do_ioctl+0xc43(fffffdcba98b4dc0, 20007605, 0, 202003, fffffdcb461ef0e0, fffffb04249a9ea8)
  fffffb04249a9cc0 vmm_ioctl+0x12f(13100000017, 20007605, 0, 202003, fffffdcb461ef0e0, fffffb04249a9ea8)
  fffffb04249a9d00 cdev_ioctl+0x39(13100000017, 20007605, 0, 202003, fffffdcb461ef0e0, fffffb04249a9ea8)
  fffffb04249a9d50 spec_ioctl+0x60(fffffdcb12aa0a00, 20007605, 0, 202003, fffffdcb461ef0e0, fffffb04249a9ea8, 0)
  fffffb04249a9de0 fop_ioctl+0x55(fffffdcb12aa0a00, 20007605, 0, 202003, fffffdcb461ef0e0, fffffb04249a9ea8, 0)
  fffffb04249a9f00 ioctl+0x9b(3, 20007605, 0)
  fffffb04249a9f10 sys_syscall+0x19f()

Now cyclic_remove will wait for the pending handler to complete before it returns to the caller, as evidenced by the blocking sema_p call. The fact that the handler is trying to reprogram that very cyclic is where it gets tricky. I've theorized a set of events that leads to this condition:

1. The cyclic for vhpet_handler triggers, resulting in an interrupt on that CPU (at CBE_LOW_PIL of 2)
2. That handler executes some logic, but has not yet reached the point at which it reprograms the cyclic
3. On a different CPU, a thread calls vm_reinit, which as part of its cleanup, goes to remove the cyclic
4. The cyclic_remove call results in a cyclic xcall to the CPU currently executing the handler, at CBE_HIGH_PIL, effectively preempting it
5. In cyclic_remove_xcall, the cyclic is removed from the cyc_cpu_t heap and marked CYF_FREE.
6. Because the handler is still running in that moment (cy_pend != 0), the cyclic_remove_here call is forced to wait for its completion, via sema_p.
7. With the CBE_HIGH_PIL interrupt handling complete, the cyclic handler continues, and as part of its duty, attempts to reprogram the now-removed cyclic.
8. vpanic ensues

It seems that the cyclic_reprogram logic needs to be updated to detect when a cyclic has been removed out from under the nose of its running handler, and report the failure back up to its caller, rather than panic the box.

Actions #1

Updated by Patrick Mooney over 3 years ago

After merging this into SmartOS, we ceased seeing those cyclic_reprogram() panics. Absent the existence of bhyve (and a sufficient workload), it's difficult to reproduce those conditions for testing.

Actions #2

Updated by Electric Monk over 3 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit cb2d1b02582006a823e154df6807c505b73b1cb6

commit  cb2d1b02582006a823e154df6807c505b73b1cb6
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2020-04-25T02:21:17.000Z

    12476 cyclic reprogramming can race with removal
    Reviewed by: John Levon <john.levon@joyent.com>
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Reviewed by: Yuri Pankov <ypankov@fastmail.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF