Bug #13463
openbhyve locks up when viona ring buffers on loan to a local zone are not returned by sockfs
0%
Description
Since neither I or jbk can seem to locate the ticket.
I've been suffering from an issue for a while where bhyve would completely lock up.
Giving it a forceful shutdown would leave the zone (SmartOS) stuck in shutting_down. Sometimes waiting a long long time can make the VM unfreeze, I guess once the buffers are eventually returned. But this is not always the case.
Hopefully jbk can full in what he remembers from this as he did look at this and found some stuff.
I can trigger this with the following setup with a high repeatability rate:
.---- [ cn1 ]-----------------. | .-[bhyve_vm1]---. | | | OS: FreeBSD |<===============> [switch] <===> [modem] | | Role: fw | | | '---------------' | | ^ | | + | | v | | .-[native zone]---. | | | OS: native | | | | Image: pkgbuild | | | '-----------------' | '-----------------------------'
There is a caching service in the pkgbuild zone that caches the pkg used when running the sandbox to do builds, it seems to be very good at holding onto the ring buffers so building a larger package like samba usually stalls the vm.
Since the VM is acting as a firewall and forwarding traffic, the package downloaded by the pkgbuild zone come form one of the viona interfaces in the vm.
Related issues
Updated by Jorge Schrauwen over 2 years ago
Adding some context from yesterday on IRC when this came up again
17:12 sjorge: jbk had anymore time to look at the viona <-> sockfs not returning the buffer issue? 17:12 sjorge: I ran into it again today when doing an scp 17:36 jbk: no i haven't 18:04 jbk: this problem is something with either tcp or sockfs + mac loopback path 18:04 sjorge: jbk ack 18:05 jbk: basically packets from viona->non-viona zone on the same box get queued up in the netstack of the non-viona zone 18:05 jbk: since they're loaned from viona 18:05 jbk: once viona runs out of available descriptors, tx from the bhyve instance stops 18:07 jbk: yeah, i could kinda of reproduce it with scp.. and after a while some timer appears to kick in which unclogs things.. but for non-scp stuff, that doesn't seem to happen 18:56 LeftWing: Don't you just need a simple TCP server in the zone which maxes out the socket buffer size and doesn't read anything that it gets sent? 18:57 LeftWing: If it's a lending issue 19:32 jbk: well the problem is there's supposed to be a 'please stop sending me stuff' signal that appears to get ignored 19:33 jbk: which if that was working, i think would at least stop things before the senders descriptors were exhausted, allowing other TX traffic to proceed 19:35 jbk: there's also probably an argument to be made that when stuff is lent, there should be a way for the receiver to know and decide 'hey it's going to be a while, let me copy it and release the loaned resource'.. though how to determine when to do that gets tricky 19:35 jbk: we (amusingly) have some old code (I think related to zero copy) to deal with that on the outbound side, but not for receive 19:52 pmooney: it's a pretty sticky problem, IMO 19:53 pmooney: since normally there's no impact for those long lending times 19:53 pmooney: I'm not sure anything besides viona would benefit from a system for copying those buffers after a spell 19:53 LeftWing: Could we just stop lending once you've leant half of the available resources 19:54 pmooney: we could 19:54 pmooney: but it's still visible from the guest 19:55 pmooney: if I ask a NIC to send a packet, and it doesn't for 30 seconds or more... 19:55 LeftWing: Yeah but at least it wouldn't entirely stall 19:56 rmustacc: With most NIC drivers we reserve an upfront amount of descriptors for loaning, fwiw. 19:57 rmustacc: And fall back to alloc/bcopy when we run out. 19:57 pmooney: LeftWing: And there's still the issue of VM shutdown 19:58 LeftWing: I guess you can't just leave the relevant pages around with the current architecture 19:59 pmooney: definitely not today 19:59 pmooney: even if you could, teardown of viona gets a little weird 20:00 pmooney: closing the rings down with outstand mblks which expect to make transmit-complete notification callbacks would be a little grimy 20:01 pmooney: *outstanding 20:02 pmooney: one would certainly need to take exquisite care to not leak anything (or use-after-free) if mblks were allowed to be left hanging like that 20:02 rmustacc: grimy subsystems is my second middle name. I suspect we can figure out a way to make it work. 20:02 pmooney: well, I think a lot of that grime woudl fall to viona 20:03 pmooney: wouldn't shock me if the ultimate fix was to just bcopy in the case of loopback :-/ 20:03 pmooney: we already had to undo the checksum offload for loopback 20:03 pmooney: *checksum ellision 20:03 rmustacc: Yes, but I think that'll change over time. 20:04 rmustacc: There's a big difference between opt-in versus opt-out. 20:04 pmooney: perhaps! 20:04 rmustacc: I mean, there is a long history of IP opt-in already for this stuff. 22:08 sjorge: So far the pkgsrc cache thing in a pkgbuild zone triggers it and from what I can tell in today's case... a cherrypy webserver after sending it a few requests. 22:09 sjorge: Killing the zone fixed the issue, killing just the bhyve vm... make it unable to boot again, probably as hinted above because the viona thingy is not torn down yet 22:09 sjorge: I'd defiantly personally 22:10 sjorge: Opt for less performance over not having to deal with this. 22:10 sjorge: Max I waited was about and hour and it didn't continue... so I assume some magic timeout does not fix it in all casss 23:05 LeftWing: I wonder if we could first add a tunable that forces copies across the board here 23:05 LeftWing: And you could tune it on 23:05 LeftWing: Without hopefully a great deal of upheaval 23:06 LeftWing: We should be correct before we are fast 23:06 LeftWing: Is there a bug filed for the hang/stall 23:54 jbk: i thought so.. let me see if i can find it
I spend a bit cleaning this up to remember I can just link it _
Updated by Robert Mustacchi over 2 years ago
- Project changed from site to illumos gate
Updated by Michael Zeller over 2 years ago
- Tags changed from byve, viona to bhyve, viona
Updated by Michael Zeller over 2 years ago
We have an internal ticket filed for the customer that hit this particular issue. These are some relevant comments posted in the ticket from when jbk and I took a look at this awhile back:
We are able to track down a bunch of the viona_desb_t structures from the viona ring.
> ::walk streams_dblk_esb e | ::print dblk_t db_frtnp | ::print -t frtn_t free_func | ::grep ".==fffffffff853d230" | ::eval '<e::print dblk_t db_frtnp | ::print -t frtn_t free_arg | ::print -t viona_desb_t d_ring | ::grep ".==fffffd6bb6f288f8"' ! wc -l 1388
So we know the reason the zone isn't shutting down is because the zone's viona instance has loaned out transmit buffers and is waiting to get them back. Our previous efforts were working on trying to track down what else in the kernel is holding on to those buffers.
> fffffd6b73590db8::print viona_vring_t { vr_link = 0xfffffd6b73590c40 vr_lock = { _opaque = [ 0 ] } vr_cv = { _opaque = 0x1 } vr_state = 0x4 vr_state_flags = 0 vr_xfer_outstanding = 0x2b6 vr_worker_thread = 0xfffffd6afb0dec00 vr_lease = 0xfffffd6be088ec30 vr_txdesb = 0xfffffd6b3c0f7000 vr_txiov = 0xfffffd6b16da3000 vr_intr_enabled = 0 vr_msi_addr = 0xfee00000 vr_msi_msg = 0x40b1 vr_a_mutex = { _opaque = [ 0 ] } vr_u_mutex = { _opaque = [ 0 ] } vr_pa = 0x42b648000 vr_size = 0x400 vr_mask = 0x3ff vr_cur_aidx = 0x7e1a vr_descr = 0xfffff95369848000 vr_avail_flags = 0xfffff9536984c000 vr_avail_idx = 0xfffff9536984c002 vr_avail_ring = 0xfffff9536984c004 vr_avail_used_event = 0xfffff9536984c804 vr_used_flags = 0xfffff9536984d000 vr_used_idx = 0xfffff9536984d002 vr_used_ring = 0xfffff9536984d004 vr_used_avail_event = 0xfffff9536984f004 vr_stats = { rs_ndesc_too_high = 0 rs_bad_idx = 0 rs_indir_bad_len = 0 rs_indir_bad_nest = 0 rs_indir_bad_next = 0 rs_no_space = 0 rs_too_many_desc = 0 rs_desc_bad_len = 0 rs_bad_ring_addr = 0 rs_fail_hcksum = 0 rs_fail_hcksum6 = 0 rs_fail_hcksum_proto = 0 rs_bad_rx_frame = 0 rs_rx_merge_overrun = 0 > 0xfffffd6b3c0f7000::print viona_desb_t { d_frtn = { free_func = viona_desb_release free_arg = 0xfffffd6b3c0f7000 } d_ring = 0xfffffd6b73590db8 d_ref = 0x3 d_len = 0x1ee d_cookie = 0 d_headers = 0xfffffd6bd5184518 }
Helpfully, the first buffer in the ring appears to be loaned out ({{d_ref > 0}}). The buffer is used to hold the headers for the packets, so let's see what has a copy of it.
> 0xfffffd6bd5184518::kgrep fffffd687d3d9388 fffffd6b3c0f7028
The second one is the pointer in the {{viona_vring_t}}, so the second one is more interesting.
> fffffd687d3d9388::whatis fffffd687d3d9388 is fffffd687d3d9380+8, allocated from streams_dblk_esb
So that's a {{dblk_t}} pointing to that buffer as {{db_base}}:
> fffffd687d3d9380::print dblk_t { db_frtnp = 0xfffffd6b3c0f7000 db_base = 0xfffffd6bd5184518 db_lim = 0xfffffd6bd51845a2 db_ref = 0x1 db_type = 0 db_flags = 0 db_struioflag = 0 db_cpid = 0xffffffff db_cache = 0xfffffd64e605a008 db_mblk = 0xfffffdace8a9dda0 db_free = dblk_lastfree_desb db_lastfree = dblk_lastfree_desb db_cksumstart = 0x14 db_cksumend = 0x1d4 db_cksumstuff = 0x24 db_struioun = { enforce_alignment = +2.0658925e-281 data = [ 0, 0, 0xdd, 0xba, 0x8, 0, 0xa8, 0x5 ] cksum = { cksum_val = { u32 = 0xbadd0000 u16 = 0 } flags = 0x8 pad = 0x5a8 } } db_fthdr = 0 db_credp = 0 }
So every {{dblk_t}} should be part of an {{mblk_t}}, so trying to find it:
> fffffd687d3d9380::kgrep fffffdace8a9ddc8 > fffffdace8a9ddc8::whatis fffffdace8a9ddc8 is fffffdace8a9dda0+28, allocated from streams_mblk > fffffdace8a9dda0::print mblk_t { b_next = 0 b_prev = 0 b_cont = 0xfffffdcb17d88260 b_rptr = 0xfffffd6bd518455a b_wptr = 0xfffffd6bd51845a2 b_datap = 0xfffffd687d3d9380 b_band = 0 b_tag = 0 b_flag = 0 b_queue = 0 }
This all seems correct -- {{b_cont}} is !NULL, which if the data buffer we're tracking down is for headers, would make sense (a dblk_t for the headers, and a dblk_t for the data from the guest).
So now that we have the {{mblk_t}}, let's see what's referencing it:
> fffffdace8a9dda0::kgrep fffffd687d3d93a8 fffffd6ddd66eb10
The first one is the reference to the {{mblk_t}} from the {{dblk_t}}. The second is:
> fffffd6ddd66eb10::whatis fffffd6ddd66eb10 is fffffd6ddd66eb00+10, allocated from streams_mblk > fffffd6ddd66eb00::kgrep fffffdce90c55bd0 > fffffdce90c55bd0::whatis fffffdce90c55bd0 is fffffdce90c55bc0+10, allocated from streams_mblk > fffffdce90c55bc0::kgrep fffffdcc8fd30170 > fffffdcc8fd30170::whatis fffffdcc8fd30170 is fffffdcc8fd30160+10, allocated from streams_mblk > fffffdcc8fd30160::kgrep fffffd6875809228 fffffdcc17617f50 > fffffd6875809228::whatis fffffd6875809228 is fffffd6875809200+28, allocated from streams_dblk_esb > fffffdcc17617f50::whatis fffffdcc17617f50 is fffffdcc17617f40+10, allocated from streams_mblk > fffffd6835ed5228::whatis fffffd6835ed5228 is fffffd6835ed5200+28, allocated from streams_dblk_esb > fffffe0f720c23b0::whatis fffffe0f720c23b0 is fffffe0f720c23a0+10, allocated from streams_mblk > fffffe0f720c23a0::kgrep fffffecef10bbf70 > fffffecef10bbf70::whatis fffffecef10bbf70 is fffffecef10bbf60+10, allocated from streams_mblk > fffffecef10bbf60::kgrep fffffe0c706b4bb0 > fffffe0c706b4bb0::whatis fffffe0c706b4bb0 is fffffe0c706b4ba0+10, allocated from streams_mblk > fffffe0c706b4ba0::kgrep fffffd66ef8574a8 fffffdcbdca99f70 > fffffd66ef8574a8::whatis fffffd66ef8574a8 is fffffd66ef857480+28, allocated from streams_dblk_esb > fffffdcbdca99f70::whatis fffffdcbdca99f70 is fffffdcbdca99f60+10, allocated from streams_mblk > fffffdcbdca99f60::kgrep fffffd682c6247a8 fffffda90b5b51d0 > fffffd682c6247a8::whatis fffffd682c6247a8 is fffffd682c624780+28, allocated from streams_dblk_esb > fffffda90b5b51d0::whatis fffffda90b5b51d0 is fffffda90b5b51c0+10, allocated from streams_mblk > fffffda90b5b51c0::kgrep fffffda90b5b5190 > fffffda90b5b5190::whatis fffffda90b5b5190 is fffffda90b5b5180+10, allocated from streams_mblk > fffffda90b5b5180::kgrep fffffdcb358868b0 > fffffdcb358868b0::whatis fffffdcb358868b0 is fffffdcb358868a0+10, allocated from streams_mblk > fffffdcb358868a0::kgrep fffffd67abfba728 fffffecf991f9b90 > fffffd67abfba728::whatis fffffd67abfba728 is fffffd67abfba700+28, allocated from streams_dblk_esb > fffffecf991f9b90::whatis fffffecf991f9b90 is fffffecf991f9b80+10, allocated from streams_mblk > fffffecf991f9b80::kgrep fffffd683ec4ada8 fffffeb6f6d5f0b0 > fffffd683ec4ada8::whatis fffffd683ec4ada8 is fffffd683ec4ad80+28, allocated from streams_dblk_esb > fffffeb6f6d5f0b0::whatis fffffeb6f6d5f0b0 is fffffeb6f6d5f0a0+10, allocated from streams_mblk
Trying to find a better way.. doing this seems to have sped up the process:
> ::walk streams_mblk | ::print -a mblk_t b_cont ! cat > out
I then used a bit of python to process the output
#!/opt/local/bin/python
import sys
data = {}
f = open(sys.argv[1], 'r')
for line in f:
parts = line.split(' ')
mblk = int(parts[0], 16) - 16
cont = int(parts[3], 16)
if cont == 0:
continue
data[cont] = mblk
next_addr = int(sys.argv[2], 16)
prev = next_addr
while next_addr in data:
prev = next_addr
next_addr = data[next_addr]
print ('0x', hex(prev))
(not the greatest, but did it's job)
That gave {{0xfffffe2ec8245780}} as the first {{mblk_t}} in the (extra-long) packet. Double checking:
> 0xfffffe2ec8245780::whatis fffffe2ec8245780 is allocated from streams_mblk > fffffe2ec8245780::list mblk_t b_cont | ::grep '.==fffffdace8a9dda0' fffffdace8a9dda0
So walking the {{b_cont}} pointers forward from that {{mblk_t}} gets us back to our loaned out viona buffer. It actually stopped one too short (bug in the python script).. the first {{mblk_t}} is {{fffffdd15c24c7d0}} (from grepping the output file).
Looking at that
> fffffdd15c24c7d0::print mblk_t { b_next = 0xfffffe2ec8245780 b_prev = 0xfffffd6cc59abbb6 b_cont = 0xfffffd6cc59ac15e b_rptr = 0xfffffd6cc59abb00 b_wptr = 0xbaddcafe00000000 b_datap = 0 b_band = 0x80 b_tag = 0x82 b_flag = 0x7bf3 b_queue = 0xfffffdcc945b4a00 }
Suggesting things have been leaked...
Updated by Michael Zeller over 2 years ago
jbk:
When a network interface for a bhyve VM is created, it is created with a fixed number of TX resources (TX descriptors) (this is a limitation of the virtio spec, nothing particular to us) – we happen to use 1024. Each transmitted packet consumes one of these TX resources until the underlying NIC completes transmitting the packet and releases the resource for reuse. If no TX descriptors are available, then the virtual NIC in the bhyve instance cannot transmit. The Bhyve instance also cannot complete shutdown until all of those TX descriptors have been released.
What's happening here is instead of transmitting out a physical NIC, the destination IP is another zone on the same machine. The kernel detects this, and instead sends the packets directly to the network stack of the destination zone, bypassing the physical NIC altogether. At that point the packet is sent up the network stack. As a result, the packet (and the TX descriptor for it) are tied up until the data is read by the destination zone (or the socket is closed, or the destination zone is rebooted/shutdown). If all of the TX descriptors get tied up in this manner, that will prevent the source bhyve instance from transmitting (or shutting down) until they are released.
We don't have a fix at the moment, but a workaround would be to just restart whatever process that's holding up the TX descriptors. It's takes a bit of digging in mdb to locate the process, so there's not an easy script I can suggest, however since they haven't encountered it since then, it seems rare enough that we could walk them through it if it does happen again before we can get a fix written.
Updated by Jason King over 2 years ago
Some additional notes from prior crash dumps:
> ::zone ADDR ID STATUS NAME PATH fffffffffbfeb540 0 ??? global ?? fffffe59944a9c40 1 ??? eceec5eb-22c7-e6... ?? fffffe59d1b20d80 2 ??? 99e40ee7-a8f9-4b... ?? fffffe5d67b39400 3 ??? 905aa94c-5bb0-6e... ??
The 99e40...
zone is the bhyve zone and 905aa...
is the pkgsrc zone (the hang is triggered by loaned viona tx buffers from the bhyve zone to the pkgsrc zone). The address of the @zone_t@s will be useful in a bit...
Let's find an outstanding viona_desb_t.
> ::stacks -c viona_worker_tx THREAD STATE SOBJ COUNT fffffe59912c5060 SLEEP CV 5 swtch+0x16b cv_wait_sig+0x19c viona_worker_tx+0x1a8 viona_worker+0x1cd thread_start+0xb fffffe59a42d0440 SLEEP CV 1 swtch+0x16b cv_wait+0x89 viona_tx_wait_outstanding+0x43 viona_worker_tx+0x1eb viona_worker+0x1cd thread_start+0xb > fffffe59a42d0440::findstack -v stack pointer for thread fffffe59a42d0440 (/48 [viona_tx_fffffe5c2c826d78]): fffffe007be24d90 [ fffffe007be24d90 _resume_from_idle+0x12b() ] fffffe007be24dc0 swtch+0x16b() fffffe007be24df0 cv_wait+0x89(fffffe5c2c826d88, fffffe5c2c826d80) fffffe007be24e30 viona_tx_wait_outstanding+0x43(fffffe5c2c826d78) fffffe007be24e90 viona_worker_tx+0x1eb(fffffe5c2c826d78, fffffe5c2c826c00) fffffe007be24ee0 viona_worker+0x1cd(fffffe5c2c826d78) fffffe007be24ef0 thread_start+0xb() > fffffe5c2c826d78::print viona_vring_t vr_txdesb vr_txdesb = 0xfffffe5d09923000 > 0xfffffe5d09923000::print viona_desb_t { d_frtn = { free_func = viona_desb_release free_arg = 0xfffffe5d09923000 } d_ring = 0xfffffe5c2c826d78 d_ref = 0x3 d_len = 0x5ee d_cookie = 0 d_headers = 0xfffffe5d08dc58b0 }
d_ref > 0
means this viona descriptor is in use. The first ref is unimportant for this, the additional refs are for each mblk_t
that references this (one mblk_t
references a pre-allocated buffer for the headers (db_base == d_headers
), the other is used to refer to the mapped tx data from the guest).
> 0xfffffe5d09923000::kgrep | ::whatis ... fffffe5d84f46900 is allocated from streams_dblk_esb: ADDR BUFADDR TIMESTAMP THREAD CACHE LASTLOG CONTENTS fffffe5d90bfc220 fffffe5d84f46900 e3dc101f9af1 fffffe59a42d0440 fffffe595fe0e008 fffffe58e63d1780 0 kmem_cache_alloc_debug+0x2fc kmem_cache_alloc+0x25b gesballoc+0x65 desballoc+0x11d viona_tx+0x2ad viona_worker_tx+0xb3 viona_worker+0x1cd thread_start+0xb fffffe5d84f469c0 is allocated from streams_dblk_esb: ADDR BUFADDR TIMESTAMP THREAD CACHE LASTLOG CONTENTS fffffe5d90bfc148 fffffe5d84f469c0 e3dc101f94fd fffffe59a42d0440 fffffe595fe0e008 fffffe58e63d1600 0 kmem_cache_alloc_debug+0x2fc kmem_cache_alloc+0x25b gesballoc+0x65 desballoc+0x11d viona_tx+0x16c viona_worker_tx+0xb3 viona_worker+0x1cd thread_start+0xb ... > fffffe5d84f469c0::print dblk_t { db_frtnp = 0xfffffe5d09923000 db_base = 0xfffffe5d08dc58b0 db_lim = 0xfffffe5d08dc593a db_ref = 0x1 db_type = 0 db_flags = 0 db_struioflag = 0 db_cpid = 0xffffffff db_cache = 0xfffffe595fe0e008 db_mblk = 0xfffffe5995f686c0 db_free = dblk_lastfree_desb db_lastfree = dblk_lastfree_desb db_cksumstart = 0xbaddcafebaddcafe db_cksumend = 0xbaddcafebaddcafe db_cksumstuff = 0xbaddcafebaddcafe db_struioun = { enforce_alignment = -3.7481636e-25 data = [ 0xfe, 0xca, 0xdd, 0xba, 0, 0, 0xdd, 0xba ] cksum = { cksum_val = { u32 = 0xbaddcafe u16 = 0xcafe } flags = 0 pad = 0xbadd } } db_fthdr = 0 db_credp = 0 } > fffffe5d84f469c0::print dblk_t db_base db_base = 0xfffffe5d08dc58b0 > fffffe5d84f46900::print dblk_t db_base db_base = 0xfffffe02e6ca709e
Since fffffe5d84f46900
's db_base != d_headers
, it contains data, so we'll use the mblk_t
for this dblk_t
.
> fffffe5d84f46900::print dblk_t db_mblk db_mblk = 0xfffffe5995f68660
Let's find this mblk_t
. Assume it's somewhere in the destination zone's netstack:
> fffffe5d67b39400::print zone_t zone_netstack zone_netstack = 0xfffffe5d672863c0 > 0xfffffe5c2bf35dc0::walk tcp_conn_cache s | ::print conn_t conn_upper_handle | ::print sonode_t so_rcv_head | ::walk b_cont | ::grep '.==0xfffffe5995f68660' | ::eval '<s::print conn_t' mdb: failed to read pointer at 148: no mapping for address mdb: failed to print type: no mapping for address mdb: failed to read pointer at 148: no mapping for address mdb: failed to print type: no mapping for address mdb: failed to read pointer at 148: no mapping for address mdb: failed to print type: no mapping for address mdb: failed to read pointer at 148: no mapping for address mdb: failed to print type: no mapping for address { conn_lock = { _opaque = [ 0 ] } conn_ref = 0x3 conn_flags = 0x40001001 conn_proto_priv = { cp_tcp = 0xfffffe5d9019aac0 cp_udp = 0xfffffe5d9019aac0 cp_icmp = 0xfffffe5d9019aac0 cp_rts = 0xfffffe5d9019aac0 cp_iptun = 0xfffffe5d9019aac0 cp_sctp = 0xfffffe5d9019aac0 cp_priv = 0xfffffe5d9019aac0 } conn_cv = { _opaque = 0 } conn_proto = 0x6 conn_recv = tcp_input_data conn_recvicmp = tcp_icmp_input conn_verifyicmp = tcp_verifyicmp conn_ixa = 0xfffffe598add4000 conn_recv_ancillary = { crbu = { crbu_all = 0 crbb = { crbb_recvdstaddr = 0 crbb_recvopts = 0 crbb_recvif = 0 crbb_recvslla = 0 crbb_recvttl = 0 crbb_ip_recvpktinfo = 0 crbb_ipv6_recvhoplimit = 0 crbb_ipv6_recvhopopts = 0 crbb_ipv6_recvdstopts = 0 crbb_ipv6_recvrthdr = 0 crbb_old_ipv6_recvdstopts = 0 crbb_ipv6_recvrthdrdstopts = 0 crbb_ipv6_recvtclass = 0 crbb_recvucred = 0 crbb_timestamp = 0 } } } conn_sqp = 0xfffffe597b084900 conn_state_flags = 0 conn_lingertime = 0 conn_on_sqp = 0 conn_linger = 0 conn_useloopback = 0 conn_broadcast = 0 > 0xfffffe5c2bf35dc0::walk tcp_conn_cache s | ::print conn_t conn_upper_handle | ::print sonode_t so_rcv_head | ::walk b_cont | ::grep '.==0xfffffe5995f68660' | ::eval '<s=J' mdb: failed to read pointer at 148: no mapping for address mdb: failed to print type: no mapping for address mdb: failed to read pointer at 148: no mapping for address mdb: failed to print type: no mapping for address mdb: failed to read pointer at 148: no mapping for address mdb: failed to print type: no mapping for address mdb: failed to read pointer at 148: no mapping for address mdb: failed to print type: no mapping for address fffffe5d9019a2c0 > fffffe5d9019a2c0::print conn_t conn_upper_handle | ::print sonode_t so_rcv_queued so_rcv_queued = 0x1565a0 > fffffe5d9019a2c0::print conn_t conn_upper_handle | ::print sonode_t { so_vnode = 0xfffffe5d7038e700 so_ops = so_sonodeops so_priv = 0 so_fallback_rwlock = { _opaque = [ 0 ] } so_lock = { _opaque = [ 0 ] } so_state_cv = { _opaque = 0 } so_single_cv = { _opaque = 0 } so_read_cv = { _opaque = 0 } so_state = 0x800081 so_mode = 0xc28 so_flag = 0 so_count = 0x1 so_proto_connid = 0x1 so_error = 0 so_sockparams = 0xfffffe5989eba9c0 so_family = 0x2 so_type = 0x2 so_protocol = 0 so_version = 0x4 so_acceptq_lock = { _opaque = [ 0 ] } so_acceptq_list = { list_size = 0x290 list_offset = 0xa8 list_head = { list_next = 0xfffffe5c048dcb18 list_prev = 0xfffffe5c048dcb18 } } so_acceptq_defer = { list_size = 0x290 list_offset = 0xa8 list_head = { list_next = 0xfffffe5c048dcb38 list_prev = 0xfffffe5c048dcb38 } } so_acceptq_node = { list_next = 0 list_prev = 0 } so_acceptq_len = 0 so_backlog = 0 so_acceptq_cv = { _opaque = 0 } so_listener = 0 so_options = 0 so_linger = { l_onoff = 0 l_linger = 0 } so_xpg_rcvbuf = 0 so_sndtimeo = 0 so_rcvtimeo = 0 so_oobmsg = 0 so_oobmark = 0 so_pgrp = 0 so_peercred = 0 so_cpid = 0xbaddcafe so_zoneid = 0x3 so_poll_list = { ph_list = 0xfffffe5d8bad7400 ph_pad1 = 0 ph_pad2 = 0 } so_pollev = 0 so_rcv_queued = 0x1565a0 so_rcv_q_head = 0 so_rcv_q_last_head = 0 so_rcv_head = 0xfffffe595fe4ade0 so_rcv_last_head = 0xfffffe595fe4ade0 so_rcv_cv = { _opaque = 0 } so_rcv_wanted = 0 so_rcv_timer_tid = 0 so_snd_cv = { _opaque = 0 } so_snd_qfull = 0 so_rcv_wakeup = 0 so_snd_wakeup = 0 so_not_str = 0x1 so_pad_to_bit_31 = 0xaddcafe so_proto_handle = 0xfffffe5d9019a2c0 so_downcalls = sock_tcp_downcalls so_proto_props = { sopp_flags = 0 sopp_wroff = 0x78 sopp_txhiwat = 0 sopp_txlowat = 0 sopp_rxhiwat = 0x1004a0 sopp_rxlowat = 0x400 sopp_maxblk = 0xffffffffffffffff sopp_maxpsz = 0x1f4a0 sopp_minpsz = 0 sopp_tail = 0 sopp_zcopyflag = 0 sopp_oobinline = 0 (0) sopp_rcvtimer = 0x32 sopp_rcvthresh = 0x1800 sopp_maxaddrlen = 0x20 sopp_loopback = 0 (0) } so_flowctrld = 0x1 (B_TRUE) so_copyflag = 0 so_copy_cv = { _opaque = 0 } so_ksock_callbacks = { ksock_cb_flags = 0 ksock_cb_connected = 0 ksock_cb_connectfailed = 0 ksock_cb_disconnected = 0 ksock_cb_newdata = 0 ksock_cb_newconn = 0 ksock_cb_cansend = 0 ksock_cb_oobdata = 0 ksock_cb_cantsendmore = 0 ksock_cb_cantrecvmore = 0 ksock_cb_error = 0 } so_ksock_cb_arg = 0 so_closing_cv = { _opaque = 0 } so_direct = 0 so_filter_active = 0 so_filter_tx = 0 so_filter_top = 0 so_filter_bottom = 0 so_filter_defertime = 0 so_krecv_cb = 0 so_krecv_arg = 0 }
So this socket seems to be added to an event port:
> 0xfffffe5d8bad7400::print 'struct polldat' { pd_fd = 0x18 pd_events = 0x1 pd_fp = 0xfffffe5d67099638 pd_php = 0xfffffe5c048dcbb8 pd_thread = 0 pd_pcache = 0xfffffe5d585011c8 pd_next = 0 pd_hashnext = 0 pd_count = 0 pd_nsets = 0 pd_ref = 0 pd_portev = 0xfffffe5c1c365150 pd_gen = 0 pd_epolldata = 0 } > 0xfffffe5c1c365150::print port_kevent_t { portkev_lock = { _opaque = [ 0 ] } portkev_source = 0x4 portkev_events = 0x1 portkev_flags = 0x2 portkev_pid = 0x1a8f4 portkev_object = 0x18 portkev_user = 0 portkev_callback = port_fd_callback portkev_arg = 0xfffffe5d8bad7400 portkev_port = 0xfffffe5d59018d00 portkev_node = { list_next = 0 list_prev = 0 } } and the process is blocked in port_getn for that event port (w/ *nget == 1): > fffffe5d81f8efd8::walk thread | ::findstack -v stack pointer for thread fffffe5c0c041080 (/1): fffffe007cf4bbd0 [ fffffe007cf4bbd0 _resume_from_idle+0x12b() ] fffffe007cf4bc00 swtch+0x16b() fffffe007cf4bc80 cv_wait_sig_swap_core+0x182(fffffe5cc8957a64, fffffe5d59018d60, 0) fffffe007cf4bca0 cv_wait_sig_swap+0xd(fffffe5cc8957a64, fffffe5d59018d60) fffffe007cf4bd20 cv_waituntil_sig+0xd5(fffffe5cc8957a64, fffffe5d59018d60, 0, fffffe5d) fffffe007cf4be20 port_getn+0x198(fffffe5d59018d00, fffffc7fffdf9840, 400, fffffe007cf4be44, fffffe007cf4be60) fffffe007cf4bef0 portfs+0x2a0(6, f, fffffc7fffdf9840, 400, 1, 0) fffffe007cf4bf00 sys_syscall+0x2ae() ... > fffffe007cf4be44/U 0xfffffe007cf4be44: 1
why didn't the process wake up?
Updated by Jason King over 2 years ago
Where I was going to look next (but haven't had an opportunity) is to look at how notifications happen for port_get(3C), poll(2), read(2), and try to observe working and (if I could recreate the hang on a test system) non-working. It seems like a notification is either getting lost or not generated -- so while a process is waiting for data to read, and there's plenty of data available to read, the process isn't being woke up.
Updated by Jorge Schrauwen 6 months ago
I'm still hitting this every so often and this issue got brought up again in #illumos yesterday, so summarizing here.
[17:19:51] <sjorge> jbk / papertigers would the tcp CC have any effect on when a zone/socket would release it's held buffers? [17:20:22] <sjorge> I switched to cubic on the host side and switched the linux VMs back cubic (had the VMs on newreno to match the native stuff) [17:20:37] <sjorge> I seem to hit the viona starvation less [17:20:44] <sjorge> Still hit it, though [17:20:48] <sjorge> Not sure if placebo or not [18:13:54] <jclulow> I would have to guess that it's a placebo [18:14:01] <jclulow> But anything is possible
Original reason for bringing this up, probably not related to why I haven't hit the issue in a whole week (which is rare), but probably not related at all.
[22:51:09] <sjorge> is the 16k a virtio max or is that something on our side? [22:51:17] <sjorge> (for vqsize) [22:56:17] <richlowe> that might be a legacy thing [22:56:27] <richlowe> in that I remember thinking that, pmooney is the expert. [22:57:45] <richlowe> oh, that's memory from me merging the virtio drivers together, so especially don't trust that. [22:58:11] <pmooney> max virtqueue length is 32k [22:58:21] <pmooney> that is, the number of entries in the virtqueue [22:58:48] <richlowe> the mmio stuff caps at some low number, the pci stuff just lets stuff default I think? [23:00:17] <pmooney> well and for legacy virtio, the virtqueue size is not negotiated, but rather specified by the device [23:00:44] <pmooney> so cranking it way up can be a bit onerous for the guest, having to allocate adequate memory for that dictated size [23:01:50] <richlowe> ok, that'd be what I was seeing and confused by [23:02:53] <richlowe> at some point I hope someone smart updates us to the current spec, and takes a look at non-pci stuff. [23:02:59] <richlowe> because I'm trying to get weasel out of it [23:03:02] <richlowe> s/get// [23:07:04] <sjorge> i'll see if i can bump it to 32k tomorrow, i'd rather wast memory if it means less issues
My own notes have vqsize for the viona vnics (using zadm on OmniOS these days) set to 16384 makes it less likely to hit the issue, I had this notes as being the max I could set it to. But seems that that was an error on my part and 32768 is also accepted. I bumped the vqsize for all my VMs to this value, hopefully I'll hit it even less frequent now.
[23:15:25] <jclulow> so the VM transmits packets to the local zone, by loaning buffers, basically? [23:15:58] <sjorge> As far as i understand it, yes [23:16:11] <jclulow> It really feels like we could at least add an /etc/system tuneable or something to just force it to copy always [23:16:18] <sjorge> in my case its our nfs/server in a zone not returning them in some cases [23:16:20] <jclulow> so that it would at least work [23:18:05] <richlowe> wait, is it in nfs _in specific_ doing something weird? [23:18:12] <richlowe> or does that just happen to be what screws you, but anything could? [23:18:47] <jclulow> I believe if you have a process in a zone that accepts a connection but never reads from it [23:18:50] <jclulow> you would end up in the same boat [23:19:06] <jclulow> nothing releases that memory back, it'll just sit in your recvq [23:19:18] <sjorge> not nfs specific [23:19:24] <jclulow> The complex but complete solution would be to claw them back from queues [23:19:47] <jclulow> But the absolute minimum bandaid here is to force a copy instead of a loan in this case [23:19:53] <jclulow> so that it just never happens [00:33:08] <jbk> jclulow: at least from what I recall, and I think is in the ticket, even a process using port_get() or poll() to wait to read can trigger it (as it seems that somehow we're not signaling when data is there, so the mblk_ts keep getting linked)
A possible bandaid could be a new tunable to always force the copy as suggested by jclulow.
For a full IRC log see https://log.omnios.org/illumos/2022-12-18
Updated by Jorge Schrauwen 6 months ago
- Related to Feature #15271: Want tunable to force viona to always copy tx buffers added