Bug #13463


bhyve locks up when viona ring buffers on loan to a local zone are not returned by sockfs

Added by Jorge Schrauwen over 2 years ago. Updated 6 months ago.

Start date:
Due date:
% Done:


Estimated time:
(Total: 0.00 h)
bhyve, viona
Gerrit CR:
External Bug:


Since neither I or jbk can seem to locate the ticket.

I've been suffering from an issue for a while where bhyve would completely lock up.
Giving it a forceful shutdown would leave the zone (SmartOS) stuck in shutting_down. Sometimes waiting a long long time can make the VM unfreeze, I guess once the buffers are eventually returned. But this is not always the case.

Hopefully jbk can full in what he remembers from this as he did look at this and found some stuff.

I can trigger this with the following setup with a high repeatability rate:

.---- [ cn1 ]-----------------.
| .-[bhyve_vm1]---.           |
| | OS: FreeBSD   |<===============> [switch] <===> [modem]
| | Role: fw      |           |
| '---------------'           |
|    ^                        |
|    +                        |
|    v                        |
| .-[native zone]---.         |
| | OS: native      |         |
| | Image: pkgbuild |         |
| '-----------------'         |

There is a caching service in the pkgbuild zone that caches the pkg used when running the sandbox to do builds, it seems to be very good at holding onto the ring buffers so building a larger package like samba usually stalls the vm.
Since the VM is acting as a firewall and forwarding traffic, the package downloaded by the pkgbuild zone come form one of the viona interfaces in the vm.

Subtasks 1 (1 open0 closed)

Bug #15464: Change the viona default to copy tx buffers instead of loaning them by defaultNew


Related issues

Related to illumos gate - Feature #15271: Want tunable to force viona to always copy tx buffersNewJorge Schrauwen

Actions #1

Updated by Jorge Schrauwen over 2 years ago

Adding some context from yesterday on IRC when this came up again

17:12 sjorge: jbk had anymore time to look at the viona <-> sockfs not returning the buffer issue?
17:12 sjorge: I ran into it again today when doing an scp
17:36 jbk: no i haven't
18:04 jbk: this problem is something with either tcp or sockfs + mac loopback path
18:04 sjorge: jbk ack
18:05 jbk: basically packets from viona->non-viona zone on the same box get queued up in the netstack of the non-viona zone
18:05 jbk: since they're loaned from viona
18:05 jbk: once viona runs out of available descriptors, tx from the bhyve instance stops
18:07 jbk: yeah, i could kinda of reproduce it with scp.. and after a while some timer appears to kick in which unclogs things.. but for non-scp stuff, that doesn't seem to happen
18:56 LeftWing: Don't you just need a simple TCP server in the zone which maxes out the socket buffer size and doesn't read anything that it gets sent?
18:57 LeftWing: If it's a lending issue
19:32 jbk: well the problem is there's supposed to be a 'please stop sending me stuff' signal that appears to get ignored
19:33 jbk: which if that was working, i think would at least stop things before the senders descriptors were exhausted, allowing other TX traffic to proceed
19:35 jbk: there's also probably an argument to be made that when stuff is lent, there should be a way for the receiver to know and decide 'hey it's going to be a while, let me copy it and release the loaned resource'.. though how to determine when to do that gets tricky
19:35 jbk: we (amusingly) have some old code (I think related to zero copy) to deal with that on the outbound side, but not for receive
19:52 pmooney: it's a pretty sticky problem, IMO
19:53 pmooney: since normally there's no impact for those long lending times
19:53 pmooney: I'm not sure anything besides viona would benefit from a system for copying those buffers after a spell
19:53 LeftWing: Could we just stop lending once you've leant half of the available resources
19:54 pmooney: we could
19:54 pmooney: but it's still visible from the guest
19:55 pmooney: if I ask a NIC to send a packet, and it doesn't for 30 seconds or more...
19:55 LeftWing: Yeah but at least it wouldn't entirely stall
19:56 rmustacc: With most NIC drivers we reserve an upfront amount of descriptors for loaning, fwiw.
19:57 rmustacc: And fall back to alloc/bcopy when we run out.
19:57 pmooney: LeftWing: And there's still the issue of VM shutdown
19:58 LeftWing: I guess you can't just leave the relevant pages around with the current architecture
19:59 pmooney: definitely not today
19:59 pmooney: even if you could, teardown of viona gets a little weird
20:00 pmooney: closing the rings down with outstand mblks which expect to make transmit-complete notification callbacks would be a little grimy
20:01 pmooney: *outstanding
20:02 pmooney: one would certainly need to take exquisite care to not leak anything (or use-after-free) if mblks were allowed to be left hanging like that
20:02 rmustacc: grimy subsystems is my second middle name. I suspect we can figure out a way to make it work.
20:02 pmooney: well, I think a lot of that grime woudl fall to viona
20:03 pmooney: wouldn't shock me if the ultimate fix was to just bcopy in the case of loopback :-/
20:03 pmooney: we already had to undo the checksum offload for loopback
20:03 pmooney: *checksum ellision
20:03 rmustacc: Yes, but I think that'll change over time.
20:04 rmustacc: There's a big difference between opt-in versus opt-out.
20:04 pmooney: perhaps!
20:04 rmustacc: I mean, there is a long history of IP opt-in already for this stuff.
22:08 sjorge: So far the pkgsrc cache thing in a pkgbuild zone triggers it and from what I can tell in today's case... a cherrypy webserver after sending it a few requests.
22:09 sjorge: Killing the zone fixed the issue, killing just the bhyve vm... make it unable to boot again, probably as hinted above because the viona thingy is not torn down yet
22:09 sjorge: I'd defiantly personally
22:10 sjorge: Opt for less performance over not having to deal with this.
22:10 sjorge: Max I waited was about and hour and it didn't continue... so I assume some magic timeout does not fix it in all casss
23:05 LeftWing: I wonder if we could first add a tunable that forces copies across the board here
23:05 LeftWing: And you could tune it on
23:05 LeftWing: Without hopefully a great deal of upheaval
23:06 LeftWing: We should be correct before we are fast
23:06 LeftWing: Is there a bug filed for the hang/stall
23:54 jbk: i thought so.. let me see if i can find it

I spend a bit cleaning this up to remember I can just link it _

Actions #2

Updated by Robert Mustacchi over 2 years ago

  • Project changed from site to illumos gate
Actions #3

Updated by Michael Zeller over 2 years ago

  • Tags changed from byve, viona to bhyve, viona
Actions #4

Updated by Michael Zeller over 2 years ago

We have an internal ticket filed for the customer that hit this particular issue. These are some relevant comments posted in the ticket from when jbk and I took a look at this awhile back:

We are able to track down a bunch of the viona_desb_t structures from the viona ring.

> ::walk streams_dblk_esb e | ::print dblk_t db_frtnp | ::print -t frtn_t free_func | ::grep ".==fffffffff853d230" | ::eval '<e::print dblk_t db_frtnp | ::print -t frtn_t free_arg | ::print -t viona_desb_t d_ring | ::grep ".==fffffd6bb6f288f8"' ! wc -l


So we know the reason the zone isn't shutting down is because the zone's viona instance has loaned out transmit buffers and is waiting to get them back. Our previous efforts were working on trying to track down what else in the kernel is holding on to those buffers.

> fffffd6b73590db8::print viona_vring_t
    vr_link = 0xfffffd6b73590c40
    vr_lock = {
        _opaque = [ 0 ]
    vr_cv = {
        _opaque = 0x1
    vr_state = 0x4
    vr_state_flags = 0
    vr_xfer_outstanding = 0x2b6
    vr_worker_thread = 0xfffffd6afb0dec00
    vr_lease = 0xfffffd6be088ec30
    vr_txdesb = 0xfffffd6b3c0f7000
    vr_txiov = 0xfffffd6b16da3000
    vr_intr_enabled = 0
    vr_msi_addr = 0xfee00000
    vr_msi_msg = 0x40b1
    vr_a_mutex = {
        _opaque = [ 0 ]
    vr_u_mutex = {
        _opaque = [ 0 ]
    vr_pa = 0x42b648000
    vr_size = 0x400
    vr_mask = 0x3ff
    vr_cur_aidx = 0x7e1a
    vr_descr = 0xfffff95369848000
    vr_avail_flags = 0xfffff9536984c000
    vr_avail_idx = 0xfffff9536984c002
    vr_avail_ring = 0xfffff9536984c004
    vr_avail_used_event = 0xfffff9536984c804
    vr_used_flags = 0xfffff9536984d000
    vr_used_idx = 0xfffff9536984d002
    vr_used_ring = 0xfffff9536984d004
    vr_used_avail_event = 0xfffff9536984f004
    vr_stats = {
        rs_ndesc_too_high = 0
        rs_bad_idx = 0
        rs_indir_bad_len = 0
        rs_indir_bad_nest = 0
        rs_indir_bad_next = 0
        rs_no_space = 0
        rs_too_many_desc = 0
        rs_desc_bad_len = 0
        rs_bad_ring_addr = 0
        rs_fail_hcksum = 0
        rs_fail_hcksum6 = 0
        rs_fail_hcksum_proto = 0
        rs_bad_rx_frame = 0
        rs_rx_merge_overrun = 0
> 0xfffffd6b3c0f7000::print viona_desb_t
    d_frtn = {
        free_func = viona_desb_release
        free_arg = 0xfffffd6b3c0f7000
    d_ring = 0xfffffd6b73590db8
    d_ref = 0x3
    d_len = 0x1ee
    d_cookie = 0
    d_headers = 0xfffffd6bd5184518

Helpfully, the first buffer in the ring appears to be loaned out ({{d_ref > 0}}). The buffer is used to hold the headers for the packets, so let's see what has a copy of it.

> 0xfffffd6bd5184518::kgrep

The second one is the pointer in the {{viona_vring_t}}, so the second one is more interesting.

> fffffd687d3d9388::whatis
fffffd687d3d9388 is fffffd687d3d9380+8, allocated from streams_dblk_esb

So that's a {{dblk_t}} pointing to that buffer as {{db_base}}:

> fffffd687d3d9380::print dblk_t
    db_frtnp = 0xfffffd6b3c0f7000
    db_base = 0xfffffd6bd5184518
    db_lim = 0xfffffd6bd51845a2
    db_ref = 0x1
    db_type = 0
    db_flags = 0
    db_struioflag = 0
    db_cpid = 0xffffffff
    db_cache = 0xfffffd64e605a008
    db_mblk = 0xfffffdace8a9dda0
    db_free = dblk_lastfree_desb
    db_lastfree = dblk_lastfree_desb
    db_cksumstart = 0x14
    db_cksumend = 0x1d4
    db_cksumstuff = 0x24
    db_struioun = {
        enforce_alignment = +2.0658925e-281
        data = [ 0, 0, 0xdd, 0xba, 0x8, 0, 0xa8, 0x5 ]
        cksum = {
            cksum_val = {
                u32 = 0xbadd0000
                u16 = 0
            flags = 0x8
            pad = 0x5a8
    db_fthdr = 0
    db_credp = 0

So every {{dblk_t}} should be part of an {{mblk_t}}, so trying to find it:

> fffffd687d3d9380::kgrep
> fffffdace8a9ddc8::whatis
fffffdace8a9ddc8 is fffffdace8a9dda0+28, allocated from streams_mblk
> fffffdace8a9dda0::print mblk_t
    b_next = 0
    b_prev = 0
    b_cont = 0xfffffdcb17d88260
    b_rptr = 0xfffffd6bd518455a
    b_wptr = 0xfffffd6bd51845a2
    b_datap = 0xfffffd687d3d9380
    b_band = 0
    b_tag = 0
    b_flag = 0
    b_queue = 0

This all seems correct -- {{b_cont}} is !NULL, which if the data buffer we're tracking down is for headers, would make sense (a dblk_t for the headers, and a dblk_t for the data from the guest).

So now that we have the {{mblk_t}}, let's see what's referencing it:

> fffffdace8a9dda0::kgrep

The first one is the reference to the {{mblk_t}} from the {{dblk_t}}. The second is:

> fffffd6ddd66eb10::whatis
fffffd6ddd66eb10 is fffffd6ddd66eb00+10, allocated from streams_mblk
> fffffd6ddd66eb00::kgrep
> fffffdce90c55bd0::whatis
fffffdce90c55bd0 is fffffdce90c55bc0+10, allocated from streams_mblk
> fffffdce90c55bc0::kgrep
>  fffffdcc8fd30170::whatis
fffffdcc8fd30170 is fffffdcc8fd30160+10, allocated from streams_mblk
> fffffdcc8fd30160::kgrep
> fffffd6875809228::whatis
fffffd6875809228 is fffffd6875809200+28, allocated from streams_dblk_esb
> fffffdcc17617f50::whatis
fffffdcc17617f50 is fffffdcc17617f40+10, allocated from streams_mblk
> fffffd6835ed5228::whatis
fffffd6835ed5228 is fffffd6835ed5200+28, allocated from streams_dblk_esb
> fffffe0f720c23b0::whatis
fffffe0f720c23b0 is fffffe0f720c23a0+10, allocated from streams_mblk
> fffffe0f720c23a0::kgrep
> fffffecef10bbf70::whatis
fffffecef10bbf70 is fffffecef10bbf60+10, allocated from streams_mblk
> fffffecef10bbf60::kgrep
> fffffe0c706b4bb0::whatis
fffffe0c706b4bb0 is fffffe0c706b4ba0+10, allocated from streams_mblk
> fffffe0c706b4ba0::kgrep
> fffffd66ef8574a8::whatis
fffffd66ef8574a8 is fffffd66ef857480+28, allocated from streams_dblk_esb
> fffffdcbdca99f70::whatis
fffffdcbdca99f70 is fffffdcbdca99f60+10, allocated from streams_mblk
> fffffdcbdca99f60::kgrep
> fffffd682c6247a8::whatis
fffffd682c6247a8 is fffffd682c624780+28, allocated from streams_dblk_esb
> fffffda90b5b51d0::whatis
fffffda90b5b51d0 is fffffda90b5b51c0+10, allocated from streams_mblk
> fffffda90b5b51c0::kgrep
> fffffda90b5b5190::whatis
fffffda90b5b5190 is fffffda90b5b5180+10, allocated from streams_mblk
> fffffda90b5b5180::kgrep
> fffffdcb358868b0::whatis
fffffdcb358868b0 is fffffdcb358868a0+10, allocated from streams_mblk
> fffffdcb358868a0::kgrep
> fffffd67abfba728::whatis
fffffd67abfba728 is fffffd67abfba700+28, allocated from streams_dblk_esb
> fffffecf991f9b90::whatis
fffffecf991f9b90 is fffffecf991f9b80+10, allocated from streams_mblk
> fffffecf991f9b80::kgrep
> fffffd683ec4ada8::whatis
fffffd683ec4ada8 is fffffd683ec4ad80+28, allocated from streams_dblk_esb
> fffffeb6f6d5f0b0::whatis
fffffeb6f6d5f0b0 is fffffeb6f6d5f0a0+10, allocated from streams_mblk

Trying to find a better way.. doing this seems to have sped up the process:

> ::walk streams_mblk | ::print -a mblk_t b_cont ! cat > out

I then used a bit of python to process the output


import sys

data = {}
f = open(sys.argv[1], 'r')
for line in f:
        parts = line.split(' ')
        mblk = int(parts[0], 16) - 16
        cont = int(parts[3], 16)
        if cont == 0:
        data[cont] = mblk

next_addr = int(sys.argv[2], 16)
prev = next_addr
while next_addr in data:
        prev = next_addr
        next_addr = data[next_addr]

print ('0x', hex(prev))

(not the greatest, but did it's job)

That gave {{0xfffffe2ec8245780}} as the first {{mblk_t}} in the (extra-long) packet. Double checking:

> 0xfffffe2ec8245780::whatis
fffffe2ec8245780 is allocated from streams_mblk
> fffffe2ec8245780::list mblk_t b_cont | ::grep '.==fffffdace8a9dda0'

So walking the {{b_cont}} pointers forward from that {{mblk_t}} gets us back to our loaned out viona buffer. It actually stopped one too short (bug in the python script).. the first {{mblk_t}} is {{fffffdd15c24c7d0}} (from grepping the output file).

Looking at that

> fffffdd15c24c7d0::print mblk_t
    b_next = 0xfffffe2ec8245780
    b_prev = 0xfffffd6cc59abbb6
    b_cont = 0xfffffd6cc59ac15e
    b_rptr = 0xfffffd6cc59abb00
    b_wptr = 0xbaddcafe00000000
    b_datap = 0
    b_band = 0x80
    b_tag = 0x82
    b_flag = 0x7bf3
    b_queue = 0xfffffdcc945b4a00

Suggesting things have been leaked...

Actions #5

Updated by Michael Zeller over 2 years ago


When a network interface for a bhyve VM is created, it is created with a fixed number of TX resources (TX descriptors) (this is a limitation of the virtio spec, nothing particular to us) – we happen to use 1024. Each transmitted packet consumes one of these TX resources until the underlying NIC completes transmitting the packet and releases the resource for reuse. If no TX descriptors are available, then the virtual NIC in the bhyve instance cannot transmit. The Bhyve instance also cannot complete shutdown until all of those TX descriptors have been released.

What's happening here is instead of transmitting out a physical NIC, the destination IP is another zone on the same machine. The kernel detects this, and instead sends the packets directly to the network stack of the destination zone, bypassing the physical NIC altogether. At that point the packet is sent up the network stack. As a result, the packet (and the TX descriptor for it) are tied up until the data is read by the destination zone (or the socket is closed, or the destination zone is rebooted/shutdown). If all of the TX descriptors get tied up in this manner, that will prevent the source bhyve instance from transmitting (or shutting down) until they are released.

We don't have a fix at the moment, but a workaround would be to just restart whatever process that's holding up the TX descriptors. It's takes a bit of digging in mdb to locate the process, so there's not an easy script I can suggest, however since they haven't encountered it since then, it seems rare enough that we could walk them through it if it does happen again before we can get a fix written.

Actions #6

Updated by Jason King over 2 years ago

Some additional notes from prior crash dumps:

> ::zone
            ADDR     ID STATUS        NAME                 PATH
fffffffffbfeb540      0 ???           global               ??
fffffe59944a9c40      1 ???           eceec5eb-22c7-e6...  ??
fffffe59d1b20d80      2 ???           99e40ee7-a8f9-4b...  ??
fffffe5d67b39400      3 ???           905aa94c-5bb0-6e...  ??

The 99e40... zone is the bhyve zone and 905aa... is the pkgsrc zone (the hang is triggered by loaned viona tx buffers from the bhyve zone to the pkgsrc zone). The address of the @zone_t@s will be useful in a bit...

Let's find an outstanding viona_desb_t.

> ::stacks -c viona_worker_tx
THREAD           STATE    SOBJ                COUNT
fffffe59912c5060 SLEEP    CV                      5

fffffe59a42d0440 SLEEP    CV                      1

> fffffe59a42d0440::findstack -v
stack pointer for thread fffffe59a42d0440 (/48 [viona_tx_fffffe5c2c826d78]): fffffe007be24d90
[ fffffe007be24d90 _resume_from_idle+0x12b() ]
  fffffe007be24dc0 swtch+0x16b()
  fffffe007be24df0 cv_wait+0x89(fffffe5c2c826d88, fffffe5c2c826d80)
  fffffe007be24e30 viona_tx_wait_outstanding+0x43(fffffe5c2c826d78)
  fffffe007be24e90 viona_worker_tx+0x1eb(fffffe5c2c826d78, fffffe5c2c826c00)
  fffffe007be24ee0 viona_worker+0x1cd(fffffe5c2c826d78)
  fffffe007be24ef0 thread_start+0xb()
> fffffe5c2c826d78::print viona_vring_t vr_txdesb
vr_txdesb = 0xfffffe5d09923000
> 0xfffffe5d09923000::print viona_desb_t
    d_frtn = {
        free_func = viona_desb_release
        free_arg = 0xfffffe5d09923000
    d_ring = 0xfffffe5c2c826d78
    d_ref = 0x3
    d_len = 0x5ee
    d_cookie = 0
    d_headers = 0xfffffe5d08dc58b0

d_ref > 0 means this viona descriptor is in use. The first ref is unimportant for this, the additional refs are for each mblk_t that references this (one mblk_t references a pre-allocated buffer for the headers (db_base == d_headers), the other is used to refer to the mapped tx data from the guest).

> 0xfffffe5d09923000::kgrep | ::whatis
fffffe5d84f46900 is allocated from streams_dblk_esb:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
fffffe5d90bfc220 fffffe5d84f46900     e3dc101f9af1 fffffe59a42d0440
                 fffffe595fe0e008 fffffe58e63d1780                0
fffffe5d84f469c0 is allocated from streams_dblk_esb:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
fffffe5d90bfc148 fffffe5d84f469c0     e3dc101f94fd fffffe59a42d0440
                 fffffe595fe0e008 fffffe58e63d1600                0
> fffffe5d84f469c0::print dblk_t
    db_frtnp = 0xfffffe5d09923000
    db_base = 0xfffffe5d08dc58b0
    db_lim = 0xfffffe5d08dc593a
    db_ref = 0x1
    db_type = 0
    db_flags = 0
    db_struioflag = 0
    db_cpid = 0xffffffff
    db_cache = 0xfffffe595fe0e008
    db_mblk = 0xfffffe5995f686c0
    db_free = dblk_lastfree_desb
    db_lastfree = dblk_lastfree_desb
    db_cksumstart = 0xbaddcafebaddcafe
    db_cksumend = 0xbaddcafebaddcafe
    db_cksumstuff = 0xbaddcafebaddcafe
    db_struioun = {
        enforce_alignment = -3.7481636e-25
        data = [ 0xfe, 0xca, 0xdd, 0xba, 0, 0, 0xdd, 0xba ]
        cksum = {
            cksum_val = {
                u32 = 0xbaddcafe
                u16 = 0xcafe
            flags = 0
            pad = 0xbadd
    db_fthdr = 0
    db_credp = 0
> fffffe5d84f469c0::print dblk_t db_base
db_base = 0xfffffe5d08dc58b0
> fffffe5d84f46900::print dblk_t db_base
db_base = 0xfffffe02e6ca709e

Since fffffe5d84f46900's db_base != d_headers, it contains data, so we'll use the mblk_t for this dblk_t.

> fffffe5d84f46900::print dblk_t db_mblk
db_mblk = 0xfffffe5995f68660

Let's find this mblk_t. Assume it's somewhere in the destination zone's netstack:

> fffffe5d67b39400::print zone_t zone_netstack
zone_netstack = 0xfffffe5d672863c0
> 0xfffffe5c2bf35dc0::walk tcp_conn_cache s | ::print conn_t conn_upper_handle | ::print sonode_t so_rcv_head | ::walk b_cont | ::grep '.==0xfffffe5995f68660' | ::eval '<s::print conn_t'
mdb: failed to read pointer at 148: no mapping for address
mdb: failed to print type: no mapping for address
mdb: failed to read pointer at 148: no mapping for address
mdb: failed to print type: no mapping for address
mdb: failed to read pointer at 148: no mapping for address
mdb: failed to print type: no mapping for address
mdb: failed to read pointer at 148: no mapping for address
mdb: failed to print type: no mapping for address
    conn_lock = {
        _opaque = [ 0 ]
    conn_ref = 0x3
    conn_flags = 0x40001001
    conn_proto_priv = {
        cp_tcp = 0xfffffe5d9019aac0
        cp_udp = 0xfffffe5d9019aac0
        cp_icmp = 0xfffffe5d9019aac0
        cp_rts = 0xfffffe5d9019aac0
        cp_iptun = 0xfffffe5d9019aac0
        cp_sctp = 0xfffffe5d9019aac0
        cp_priv = 0xfffffe5d9019aac0
    conn_cv = {
        _opaque = 0
    conn_proto = 0x6
    conn_recv = tcp_input_data
    conn_recvicmp = tcp_icmp_input
    conn_verifyicmp = tcp_verifyicmp
    conn_ixa = 0xfffffe598add4000
    conn_recv_ancillary = {
        crbu = {
            crbu_all = 0
            crbb = {
                crbb_recvdstaddr = 0
                crbb_recvopts = 0
                crbb_recvif = 0
                crbb_recvslla = 0
                crbb_recvttl = 0
                crbb_ip_recvpktinfo = 0
                crbb_ipv6_recvhoplimit = 0
                crbb_ipv6_recvhopopts = 0
                crbb_ipv6_recvdstopts = 0
                crbb_ipv6_recvrthdr = 0
                crbb_old_ipv6_recvdstopts = 0
                crbb_ipv6_recvrthdrdstopts = 0
                crbb_ipv6_recvtclass = 0
                crbb_recvucred = 0
                crbb_timestamp = 0
    conn_sqp = 0xfffffe597b084900
    conn_state_flags = 0
    conn_lingertime = 0
    conn_on_sqp = 0
    conn_linger = 0
    conn_useloopback = 0
    conn_broadcast = 0
>  0xfffffe5c2bf35dc0::walk tcp_conn_cache s | ::print conn_t conn_upper_handle | ::print sonode_t so_rcv_head | ::walk b_cont | ::grep '.==0xfffffe5995f68660' | ::eval '<s=J' mdb: failed to read pointer at 148: no mapping for address
mdb: failed to print type: no mapping for address
mdb: failed to read pointer at 148: no mapping for address
mdb: failed to print type: no mapping for address
mdb: failed to read pointer at 148: no mapping for address
mdb: failed to print type: no mapping for address
mdb: failed to read pointer at 148: no mapping for address
mdb: failed to print type: no mapping for address
> fffffe5d9019a2c0::print conn_t conn_upper_handle | ::print sonode_t so_rcv_queued
so_rcv_queued = 0x1565a0
> fffffe5d9019a2c0::print conn_t conn_upper_handle | ::print sonode_t
    so_vnode = 0xfffffe5d7038e700
    so_ops = so_sonodeops
    so_priv = 0
    so_fallback_rwlock = {
        _opaque = [ 0 ]
    so_lock = {
        _opaque = [ 0 ]
    so_state_cv = {
        _opaque = 0
    so_single_cv = {
        _opaque = 0
    so_read_cv = {
        _opaque = 0
    so_state = 0x800081
    so_mode = 0xc28
    so_flag = 0
    so_count = 0x1
    so_proto_connid = 0x1
    so_error = 0
    so_sockparams = 0xfffffe5989eba9c0
    so_family = 0x2
    so_type = 0x2
    so_protocol = 0
    so_version = 0x4
    so_acceptq_lock = {
        _opaque = [ 0 ]
    so_acceptq_list = {
        list_size = 0x290
        list_offset = 0xa8
        list_head = {
            list_next = 0xfffffe5c048dcb18
            list_prev = 0xfffffe5c048dcb18
    so_acceptq_defer = {
        list_size = 0x290
        list_offset = 0xa8
        list_head = {
            list_next = 0xfffffe5c048dcb38
            list_prev = 0xfffffe5c048dcb38
    so_acceptq_node = {
        list_next = 0
        list_prev = 0
    so_acceptq_len = 0
    so_backlog = 0
    so_acceptq_cv = {
        _opaque = 0
    so_listener = 0
    so_options = 0
    so_linger = {
        l_onoff = 0
        l_linger = 0
    so_xpg_rcvbuf = 0
    so_sndtimeo = 0
    so_rcvtimeo = 0
    so_oobmsg = 0
    so_oobmark = 0
    so_pgrp = 0
    so_peercred = 0
    so_cpid = 0xbaddcafe
    so_zoneid = 0x3
    so_poll_list = {
        ph_list = 0xfffffe5d8bad7400
        ph_pad1 = 0
        ph_pad2 = 0
    so_pollev = 0
    so_rcv_queued = 0x1565a0
    so_rcv_q_head = 0
    so_rcv_q_last_head = 0
    so_rcv_head = 0xfffffe595fe4ade0
    so_rcv_last_head = 0xfffffe595fe4ade0
    so_rcv_cv = {
        _opaque = 0
    so_rcv_wanted = 0
    so_rcv_timer_tid = 0
    so_snd_cv = {
        _opaque = 0
    so_snd_qfull = 0
    so_rcv_wakeup = 0
    so_snd_wakeup = 0
    so_not_str = 0x1
    so_pad_to_bit_31 = 0xaddcafe
    so_proto_handle = 0xfffffe5d9019a2c0
    so_downcalls = sock_tcp_downcalls
    so_proto_props = {
        sopp_flags = 0
        sopp_wroff = 0x78
        sopp_txhiwat = 0
        sopp_txlowat = 0
        sopp_rxhiwat = 0x1004a0
        sopp_rxlowat = 0x400
        sopp_maxblk = 0xffffffffffffffff
        sopp_maxpsz = 0x1f4a0
        sopp_minpsz = 0
        sopp_tail = 0
        sopp_zcopyflag = 0
        sopp_oobinline = 0 (0)
        sopp_rcvtimer = 0x32
        sopp_rcvthresh = 0x1800
        sopp_maxaddrlen = 0x20
        sopp_loopback = 0 (0)
    so_flowctrld = 0x1 (B_TRUE)
    so_copyflag = 0
    so_copy_cv = {
        _opaque = 0
    so_ksock_callbacks = {
        ksock_cb_flags = 0
        ksock_cb_connected = 0
        ksock_cb_connectfailed = 0
        ksock_cb_disconnected = 0
        ksock_cb_newdata = 0
        ksock_cb_newconn = 0
        ksock_cb_cansend = 0
        ksock_cb_oobdata = 0
        ksock_cb_cantsendmore = 0
        ksock_cb_cantrecvmore = 0
        ksock_cb_error = 0
    so_ksock_cb_arg = 0
    so_closing_cv = {
        _opaque = 0
    so_direct = 0
    so_filter_active = 0
    so_filter_tx = 0
    so_filter_top = 0
    so_filter_bottom = 0
    so_filter_defertime = 0
    so_krecv_cb = 0
    so_krecv_arg = 0

So this socket seems to be added to an event port:

> 0xfffffe5d8bad7400::print 'struct polldat'
    pd_fd = 0x18
    pd_events = 0x1
    pd_fp = 0xfffffe5d67099638
    pd_php = 0xfffffe5c048dcbb8
    pd_thread = 0
    pd_pcache = 0xfffffe5d585011c8
    pd_next = 0
    pd_hashnext = 0
    pd_count = 0
    pd_nsets = 0
    pd_ref = 0
    pd_portev = 0xfffffe5c1c365150
    pd_gen = 0
    pd_epolldata = 0
> 0xfffffe5c1c365150::print port_kevent_t
    portkev_lock = {
        _opaque = [ 0 ]
    portkev_source = 0x4
    portkev_events = 0x1
    portkev_flags = 0x2
    portkev_pid = 0x1a8f4
    portkev_object = 0x18
    portkev_user = 0
    portkev_callback = port_fd_callback
    portkev_arg = 0xfffffe5d8bad7400
    portkev_port = 0xfffffe5d59018d00
    portkev_node = {
        list_next = 0
        list_prev = 0
and the process is blocked in port_getn for that event port (w/ *nget == 1):

> fffffe5d81f8efd8::walk thread | ::findstack -v
stack pointer for thread fffffe5c0c041080 (/1): fffffe007cf4bbd0
[ fffffe007cf4bbd0 _resume_from_idle+0x12b() ]
  fffffe007cf4bc00 swtch+0x16b()
  fffffe007cf4bc80 cv_wait_sig_swap_core+0x182(fffffe5cc8957a64, fffffe5d59018d60, 0)
  fffffe007cf4bca0 cv_wait_sig_swap+0xd(fffffe5cc8957a64, fffffe5d59018d60)
  fffffe007cf4bd20 cv_waituntil_sig+0xd5(fffffe5cc8957a64, fffffe5d59018d60, 0, fffffe5d)
  fffffe007cf4be20 port_getn+0x198(fffffe5d59018d00, fffffc7fffdf9840, 400, fffffe007cf4be44, fffffe007cf4be60)
  fffffe007cf4bef0 portfs+0x2a0(6, f, fffffc7fffdf9840, 400, 1, 0)
  fffffe007cf4bf00 sys_syscall+0x2ae()
> fffffe007cf4be44/U
0xfffffe007cf4be44:             1

why didn't the process wake up?

Actions #7

Updated by Jason King over 2 years ago

Where I was going to look next (but haven't had an opportunity) is to look at how notifications happen for port_get(3C), poll(2), read(2), and try to observe working and (if I could recreate the hang on a test system) non-working. It seems like a notification is either getting lost or not generated -- so while a process is waiting for data to read, and there's plenty of data available to read, the process isn't being woke up.

Actions #8

Updated by Jorge Schrauwen 6 months ago

I'm still hitting this every so often and this issue got brought up again in #illumos yesterday, so summarizing here.

<sjorge> jbk / papertigers would the tcp CC have any effect on when a zone/socket would release it's held buffers?
<sjorge> I switched to cubic on the host side and switched the linux VMs back cubic (had the VMs on newreno to match the native stuff)
<sjorge> I seem to hit the viona starvation less
<sjorge> Still hit it, though
<sjorge> Not sure if placebo or not
<jclulow> I would have to guess that it's a placebo 
<jclulow> But anything is possible 

Original reason for bringing this up, probably not related to why I haven't hit the issue in a whole week (which is rare), but probably not related at all.

<sjorge> is the 16k a virtio max or is that something on our side?
<sjorge> (for vqsize)
<richlowe> that might be a legacy thing
<richlowe> in that I remember thinking that, pmooney is the expert.
<richlowe> oh, that's memory from me merging the virtio drivers together, so especially don't trust that.
<pmooney> max virtqueue length is 32k
<pmooney> that is, the number of entries in the virtqueue
<richlowe> the mmio stuff caps at some low number, the pci stuff just lets stuff default I think?
<pmooney> well and for legacy virtio, the virtqueue size is not negotiated, but rather specified by the device
<pmooney> so cranking it way up can be a bit onerous for the guest, having to allocate adequate memory for that dictated size
<richlowe> ok, that'd be what I was seeing and confused by 
<richlowe> at some point I hope someone smart updates us to the current spec, and takes a look at non-pci stuff.
<richlowe> because I'm trying to get weasel out of it
<richlowe> s/get//
<sjorge> i'll see if i can bump it to 32k tomorrow, i'd rather wast memory if it means less issues 

My own notes have vqsize for the viona vnics (using zadm on OmniOS these days) set to 16384 makes it less likely to hit the issue, I had this notes as being the max I could set it to. But seems that that was an error on my part and 32768 is also accepted. I bumped the vqsize for all my VMs to this value, hopefully I'll hit it even less frequent now.

<jclulow> so the VM transmits packets to the local zone, by loaning buffers, basically?
<sjorge> As far as i understand it, yes
<jclulow> It really feels like we could at least add an /etc/system tuneable or something to just force it to copy always
<sjorge> in my case its our nfs/server in a zone not returning them in some cases
<jclulow> so that it would at least work
<richlowe> wait, is it in nfs _in specific_ doing something weird?
<richlowe> or does that just happen to be what screws you, but anything could?
<jclulow> I believe if you have a process in a zone that accepts a connection but never reads from it
<jclulow> you would end up in the same boat
<jclulow> nothing releases that memory back, it'll just sit in your recvq
<sjorge> not nfs specific
<jclulow> The complex but complete solution would be to claw them back from queues
<jclulow> But the absolute minimum bandaid here is to force a copy instead of a loan in this case
<jclulow> so that it just never happens
<jbk> jclulow: at least from what I recall, and I think is in the ticket, even a process using port_get() or poll() to wait to read can trigger it (as it seems that somehow we're not signaling when data is there, so the mblk_ts keep getting linked) 

A possible bandaid could be a new tunable to always force the copy as suggested by jclulow.

For a full IRC log see

Actions #9

Updated by Jorge Schrauwen 6 months ago

  • Related to Feature #15271: Want tunable to force viona to always copy tx buffers added

Also available in: Atom PDF