Project

General

Profile

Actions

Bug #14827

closed

overlay_m_stop() blows verify in race with overlay_target_inject()

Added by Dan McDonald 4 months ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
networking
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

I have a customer kernel dump with the following two threads:

PANIC THREAD:

> ffffffb88f7ae0c0::findstack -v
stack pointer for thread ffffffb88f7ae0c0 (zoneadmd/3): fffff946163f29a0
  fffff946163f2a10 0xfffffffffb8a73d1()
  fffff946163f2a60 0xfffffffffba4c945()
  fffff946163f2a90 overlay_m_stop+0x8c(fffffd67c717b5c0)
  fffff946163f2ac0 mac_stop+0x68(fffffd6b350ea020)
  fffff946163f2b20 mac_client_datapath_teardown+0x1c9(fffffda307882e20, fffffdc2f513dba0, 0)
  fffff946163f2b70 mac_unicast_remove+0x29b(fffffda307882e20, fffffdc2f513dba0)
  fffff946163f2be0 vnic_dev_delete+0x146(150e0, 0, fffffda2b34af328)
  fffff946163f2c20 vnic_ioc_delete+0x1a(fffffdc2dd677210, fffff7ffee0ecdc4, 202003, fffffda2b34af328, fffff946163f2e18)
  fffff946163f2cc0 drv_ioctl+0x1ef(1200000000, 1710002, fffff7ffee0ecdc4, 202003, fffffda2b34af328, fffff946163f2e18)
  fffff946163f2d00 cdev_ioctl+0x2b(1200000000, 1710002, fffff7ffee0ecdc4, 202003, fffffda2b34af328, fffff946163f2e18)
  fffff946163f2d50 spec_ioctl+0x45(fffffd665eef6080, 1710002, fffff7ffee0ecdc4, 202003, fffffda2b34af328, fffff946163f2e18, 0)
  fffff946163f2de0 fop_ioctl+0x5b(fffffd665eef6080, 1710002, fffff7ffee0ecdc4, 202003, fffffda2b34af328, fffff946163f2e18, 0)
  fffff946163f2f00 ioctl+0x153(4, 1710002, fffff7ffee0ecdc4)
  fffff946163f2f10 sys_syscall+0x1a8()
> ::status
debugging crash dump vmcore.0 (64-bit) from <CENSORED>
operating system: 5.11 joyent_20220505T001410Z (i86pc)
git branch: release-20220505
git rev: af828066674374b72d7841d04a2224b69b070fc9
image uuid: (not set)
panic message: assertion failed: (odd->odd_flags & OVERLAY_F_STOPMASK) == 0, file: ../../common/io/overlay/overlay.c, line: 1013
dump content: kernel pages only
> 

This is zoneadmd calling dladm_vnic_delete(), presumably as part of zone shutdown.

OTHER THREAD:

> fffffd6773691460::findstack -v
stack pointer for thread fffffd6773691460 (varpd/167): fffff94605169900
  fffff94605169970 apix_hilevel_intr_prolog+0x85(fffffd665deab000, f, 0, fffff946051699f0)
  fffff946051699e0 apix_do_interrupt+0x333(fffff946051699f0, 4)
  fffff946051699f0 _interrupt+0xc3()
  fffff94605169b10 mutex_delay_default()
  fffff94605169b80 mutex_vector_enter+0x119(fffffd67c717b5c0)
  fffff94605169bd0 overlay_target_inject+0xcd(fffffd67c24383c0, fffffdc2e020b268)
  fffff94605169c70 overlay_target_ioctl+0x123(c400000001, 6f767413, f44c3b20, 100403, fffffd67cccf44c8, fffff94605169dc8)
  fffff94605169cb0 cdev_ioctl+0x2b(c400000001, 6f767413, f44c3b20, 100403, fffffd67cccf44c8, fffff94605169dc8)
  fffff94605169d00 spec_ioctl+0x45(fffffd67728f3080, 6f767413, f44c3b20, 100403, fffffd67cccf44c8, fffff94605169dc8, 0)
  fffff94605169d90 fop_ioctl+0x5b(fffffd67728f3080, 6f767413, f44c3b20, 100403, fffffd67cccf44c8, fffff94605169dc8, 0)
  fffff94605169eb0 ioctl+0x153(4, 6f767413, f44c3b20)
  fffff94605169f10 _sys_sysenter_post_swapgs+0x159()
> fffffd67c717b5c0::mutex
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
fffffd67c717b5c0 adapt ffffffb88f7ae0c0      -      -      no
> 

This is varpd returning something that will be injected into the soon-to-be-torn-down overlay's receive-side path.

The VERIFY itself in overlay_m_stop() blows likely one of the flags covered by OVERLAY_F_STOPMASK is set.


> fffffd67c717b5c0::print overlay_dev_t odd_flags odd_rxcount odd_ref     
odd_flags = 0t41 (???)
odd_rxcount = 0x1
odd_ref = 0x5757e    /* XXX KEBE notes this seems high, FWIW! */
> 0t41=X
                29              /* XXX KEBE SAYS ( VARPD | IN_RX | ACTIVATED ) */
> 

OVERLAY_F_IN_RX is our problem. And it's being set by the varpd injection.

Let's look at overlay_m_stop(), it's straightforward, and I'll point out where things can go wrong:

static void
overlay_m_stop(void *arg)
{
    overlay_dev_t *odd = arg;

    /*
     * The MAC Perimeter is held here, so we don't have to worry about
     * synchronizing this with respect to metadata operations.
     */
    mutex_enter(&odd->odd_lock);
    VERIFY(odd->odd_flags & OVERLAY_F_IN_MUX);
    VERIFY(!(odd->odd_flags & OVERLAY_F_MDDROP));
    odd->odd_flags |= OVERLAY_F_MDDROP;
    overlay_io_wait(odd, OVERLAY_F_IOMASK);
/* SO WE WAITED FOR IO, but WE DROP THE LOCK... */
    mutex_exit(&odd->odd_lock);

    overlay_mux_remove_dev(odd->odd_mux, odd);
    overlay_mux_close(odd->odd_mux);
    odd->odd_mux = NULL;

/* AHA!  SOMEONE COULD SET RX HERE! */

    mutex_enter(&odd->odd_lock);
/* AND IT COULD STILL BE SET WHEN WE REACH HERE */
/* PERHAPS WE SHOULD WAIT FOR RX?!? */
    odd->odd_flags &= ~OVERLAY_F_IN_MUX;
    odd->odd_flags &= ~OVERLAY_F_MDDROP;
    VERIFY((odd->odd_flags & OVERLAY_F_STOPMASK) == 0);
    mutex_exit(&odd->odd_lock);
}

If you look at the end of overlay_target_inject(), you see where the DO go wrong:


    mutex_enter(&odd->odd_lock);
    overlay_io_start(odd, OVERLAY_F_IN_RX);
/* OKAY I SET RX HERE */
    mutex_exit(&odd->odd_lock);

/* SOMEWHERE BETWEEN HERE... */
    mac_rx(odd->odd_mh, NULL, mp);
/* ... AND HERE overlay_m_stop grabs odd_lock */

    mutex_enter(&odd->odd_lock);  /* SO I BLOCK HERE! */
    overlay_io_done(odd, OVERLAY_F_IN_RX);
    mutex_exit(&odd->odd_lock);

I'm thinking the right answer is to have a overlay_io_wait(odd, OVERLAY_F_IN_RX) where I says PERHAPS in above.

Regarding the high odd_ref, I'm running a kgrep on the overlay_dev_t right now, but I suspect we're either leaking, or the zone-shutdown is overriding all of the other reference-holds.


Related issues

Related to illumos gate - Bug #14853: overlay_setprop_vnetid bungles OVERLAY_F_MDDROPClosedDan McDonald

Actions
Actions #1

Updated by Electric Monk 4 months ago

  • Gerrit CR set to 2251
Actions #2

Updated by Dan McDonald 4 months ago

  • Related to Bug #14853: overlay_setprop_vnetid bungles OVERLAY_F_MDDROP added
Actions #3

Updated by Dan McDonald 4 months ago

Initial code reviews revealed that:

1.) "While semantically the OVERLAY_F_MDDROP was about metadata, I think it should probably apply to all of the injection paths. So really to me this reads as the code in overlay_target_inject should probably look to whether or not we're actually active before proceeding to inject a packet, otherwise failing to do so (or just dropping it)." - Robert Mustacchi.

2.) There is an outright wrong flag being set, but that's covered by #14853 .

Testing this may be annoying, but should involve a program that rapidly injects packets into /dev/overlay while an overlay-using zone shuts down.

Actions #4

Updated by Dan McDonald 4 months ago

This fix landed in the release-20220728 branch of illumos-joyent (and the accompanying SmartOS release).

Actions #5

Updated by Electric Monk 4 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit bdb5139270356ff627abb9467f2b4fc8db3fb81d

commit  bdb5139270356ff627abb9467f2b4fc8db3fb81d
Author: Dan McDonald <danmcd@mnx.io>
Date:   2022-08-04T19:07:27.000Z

    14827 overlay_m_stop() blows verify in race with overlay_target_inject()
    14853 overlay_setprop_vnetid bungles OVERLAY_F_MDDROP
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Approved by: Gordon Ross <Gordon.W.Ross@gmail.com>

Actions

Also available in: Atom PDF