Bug #6576
openKernel panic: panic[cpu18]/thread=2a102549c80: assertion failed: tcp->tcp_ecn_ok || tcp->tcp _state < TCPS_ESTABLISHED, file: ../../common/inet/tcp/tcp_output.c, line: 399^
0%
Description
This is a sparc LDOM running OI151a8 illumos. It crashed during runtime, without issuing something special.
root@dev:/apppool/data# mdb *0 Loading modules: [ unix genunix specfs dtrace zfs scsi_vhci ldc mac ip hook neti sockfs arp stmf stmf_sbd fctl random idm sppp crypto lofs cpc fcip ufs logindmux ptm nfs ipc ] > ::status debugging crash dump vmcore.0 (64-bit) from dev operating system: 5.11 glasgall_oi_151a_prestable8 (sun4v) image uuid: 152032db-df1b-671e-b4ec-8aac798a101d panic message: assertion failed: tcp->tcp_ecn_ok || tcp->tcp_state < TCPS_ESTABLISHED, file: .. /../common/inet/tcp/tcp_output.c, line: 399 dump content: kernel pages only >::msgbuf [...] panic[cpu18]/thread=2a102549c80: assertion failed: tcp->tcp_ecn_ok || tcp->tcp_state < TCPS_ESTABLISHED, file: .. /../common/inet/tcp/tcp_output.c, line: 399 000002a1025482e0 genunix:assfail+28 (7b7479f0, 7b747808, 18f, 2a10254847c, 2, 16 ) %l0-3: 000000006bb5f6f5 00000000019b8400 0040335d1b532b6b 0000000000000480 %l4-7: 0000000002711637 0000060015669000 0000000000000014 0000030028ed2bd4 000002a102548390 ip:tcp_wput_data+210 (60015669000, 20, 0, 270fef7, 30028ed2ba8, 0) %l0-3: 0000060015668a80 000000006bb5c92a 0000000000000000 0000000000000450 %l4-7: 0000000000000000 00000333e8dc0b80 00000000000005a8 0000000000000034 000002a102548480 ip:tcp_output+f8 (60015668a80, 60017694c80, 60014a3d000, 0, 3a7 de800, 0) %l0-3: 0000060015669000 00000600156693b0 00000333e8dc0b80 00000000000005a8 %l4-7: 0000060019fb6d30 0000000000000000 00000000000005a8 00000333e8dc0b80 000002a102548530 ip:squeue_drain+490 (60014a3d000, 4, 8000000000000000, 7fffffff ffffffff, 7b73ec88, 7b73ec70) %l0-3: 0000000000000000 000000007b73e928 0000000000000000 000002a102548630 %l4-7: 000000007b73eac8 000000007b73eae0 000000007b73eaf8 0000060014a3d010 000002a1025487c0 ip:squeue_enter+708 (60014a3d000, 600177dfe60, 600177dfe60, 1, 2a102548cd0, 4) %l0-3: 0000000000000004 0000060014a3d010 00000600177dfe60 0000000000000001 %l4-7: 0000000000000000 0000030028ed2ba8 000000007ba4a578 000002a1025498f8 000002a102548880 ip:ip_fanout_v4+1200 (60017feb820, 30028ed2b94, 2a102548cd0, 0, 334a646c000, 40) %l0-3: 00000600134e7be8 000000007b740800 0000060015668a80 00000600180f7e00 %l4-7: 00000600134e7be8 0000000000000006 0000000000000000 0000000000000006 000002a102548980 ip:ip_input_local_v4+10c (60012f1c038, 60017feb820, 30028ed2b94 , 2a102548cd0, 2000, 7ba4a470) %l0-3: 0000000000000000 00000600134e7be8 0040335d1b51ca71 0000000000000480 %l4-7: 0000000000000000 000002a102549a88 0000000000000000 0000000000000000 000002a102548a30 ip:ire_recv_local_v4+11c (60012f1c038, 60017feb820, 30028ed2b94 , 2a102548cd0, 30028ed2b94, 0) %l0-3: 00000600134e7be8 00000600134e7be8 0000060012f1c0e8 0000000000400000 %l4-7: 0000000000000000 0000000000000000 000000007b740bd0 ffffffffffffffff 000002a102548ae0 ip:ill_input_short_v4+244 (60017feb820, 30028ed2b94, 30028ed2ba 4, 2a102548cd0, 2a102548e60, 0) %l0-3: 00000600134e7be8 0000030028ed2b94 00000334a61fa000 00000000ac140314 %l4-7: 0000060012f1c038 000003009bccbd40 0000060011114c20 000002a102548d38 000002a102548c10 ip:ip_input_common_v4+2cc (0, 7b740be8, 80000060, 7b740bb8, 0, 7b740b88) %l0-3: 00000600134e7be8 0000000000000000 000002a102548cd0 0000000000000000 %l4-7: 0000000000000000 0000000000000000 000000007b740bd0 ffffffffffffffff 000002a102548e80 ip:ip_input+1c (600134e7be8, 334a7d02088, 60017feb820, 0, 0, 10 ) %l0-3: 000000006bb5f6f5 00000000019b8400 0040335d1b518311 0000000000000480 %l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 000002a102548f40 mac:mac_rx_soft_ring_process+350 (3002880bbd0, 60010f82e80, 600 17feb820, 60017feb820, 1, 0) %l0-3: 0000060010f7d9c0 000000007b697b20 00000600134e7be8 00000334a7d02088 %l4-7: 0000060010599990 00000600105b0200 0000000000630000 00000600105b0b48 000002a102548ff0 mac:mac_rx_srs_fanout+468 (60010f7d9c0, 60017feb820, 7ba4a520, 70467e60, 7ba4a470, 70467e54) %l0-3: 0000060010f82e80 0000000000000008 000003002880bbd0 0000000000000001 %l4-7: 0000000000000008 000000007ba4a470 000000007ba4a578 000002a1025498f8 000002a1025498f0 mac:mac_rx_srs_drain+36c (60010f7d9c0, 100, 7ba4a768, 60010f7d9 ea, 60010f7daa8, 7ba4a610) %l0-3: 000003002880bbd0 0000000000000001 0000000000140001 0000060017feb820 %l4-7: 000000007ba4a470 0000000000020000 0000000000010000 00000000fffeffff 000002a1025499c0 mac:mac_srs_worker+4f4 (60010f7d9c0, 7ba4a848, 7ba4a610, 7ba4a4 70, 2000, 7ba4a470) %l0-3: 000002a102549a8a 0000060010f7d9e8 0000000000140000 000000007ba4a470 %l4-7: 000000007ba4a5a8 000002a102549a88 0000000000000000 0000000000000000 panic[cpu18]/thread=2a102549c80: sync initiated WARNING: interrupt 0x7002e070 at level 15 not serviced syncing file systems...
I'll keep the core for further analysis...
Files
Updated by Dan McDonald over 6 years ago
NOTE: I was reviewing this in person with Olaf at FOSDEM.
I need this from the coredump:
> 60015669000::print tcp_t
I've seen it. The congestion window is down to 0 on a non-ECN TCP session. The big question is WHY AND HOW tcp_cwnd is set to 0. This happens infrequently, according to Olaf.
Updated by Olaf Bohlen over 6 years ago
Hi Dan,
as requested:
{ tcp_time_wait_next = 0 tcp_time_wait_prev = 0 tcp_time_wait_expire = 0 tcp_connp = 0x60015668a80 tcp_tcps = 0x333e8dc0b80 tcp_state = 0 tcp_rcv_ws = 0x2 tcp_snd_ws = 0x2 tcp_ts_recent = 0x6bb5c709 tcp_rto = 0x1bb tcp_last_rcv_lbolt = 0x6bb5c92a tcp_rto_initial = 0x3e8 tcp_rto_min = 0x190 tcp_rto_max = 0xea60 tcp_snxt = 0x2711637 tcp_swnd = 0x22f40 tcp_mss = 0x5a8 tcp_iss = 0xea9d1762 tcp_rnxt = 0xb8a98c16 tcp_rwnd = 0x20c4c tcp_xmit_head = 0x600111a3080 tcp_xmit_last = 0x60017694c80 tcp_xmit_tail = 0x33a56bb1b20 tcp_unsent = 0x450 tcp_xmit_tail_unsent = 0x40 tcp_suna = 0x270fef7 tcp_rexmit_nxt = 0x270fef7 tcp_rexmit_max = 0x2711637 tcp_snd_burst = 0x3 tcp_cwnd = 0 tcp_cwnd_cnt = 0 tcp_ibsegs = 0x55975 tcp_obsegs = 0xa5e2b tcp_naglim = 0x1 tcp_valid_bits = 0 tcp_timer_tid = 0x6001230e858 tcp_timer_backoff = 0 tcp_last_recv_time = 0x6bb5c92a tcp_init_cwnd = 0 tcp_urp_last_valid = 0 tcp_hard_binding = 0 tcp_fin_acked = 0 tcp_fin_rcvd = 0 tcp_fin_sent = 0 tcp_ordrel_done = 0 tcp_detached = 0 tcp_zero_win_probe = 0 tcp_loopback = 0 tcp_localnet = 0 tcp_syn_defense = 0 tcp_set_timer = 0 tcp_active_open = 0 tcp_rexmit = 0x1 tcp_snd_sack_ok = 0x1 tcp_hwcksum = 0 tcp_ip_forward_progress = 0 tcp_ecn_ok = 0 tcp_ecn_echo_on = 0 tcp_ecn_cwr_sent = 0 tcp_cwr = 0 tcp_pad_to_bit31 = 0 tcp_snd_ts_ok = 0x1 tcp_snd_ws_ok = 0x1 tcp_reserved_port = 0 tcp_in_free_list = 0 tcp_snd_zcopy_on = 0 tcp_snd_zcopy_aware = 0 tcp_xmit_zc_clean = 0 tcp_wait_for_eagers = 0 tcp_accept_error = 0 tcp_send_discon_ind = 0 tcp_cork = 0 tcp_tconnind_started = 0x1 tcp_lso = 0 tcp_is_wnd_shrnk = 0 tcp_pad_to_bit_31 = 0 tcp_initial_pmtu = 0x2328 tcp_reass_head = 0 tcp_reass_tail = 0 tcp_sack_info = { tcp_pipe = 0xb30 tcp_fack = 0x2711637 tcp_sack_snxt = 0x2710697 tcp_max_sack_blk = 0x3 tcp_num_sack_blk = 0 tcp_sack_list = [ { begin = 0xb8a793e6 end = 0xb8a79476 }, { begin = 0 end = 0 }, { begin = 0 end = 0 }, { begin = 0 end = 0 }, { begin = 0 end = 0 }, ] tcp_num_notsack_blk = 0x1 tcp_cnt_notsack_list = 0x7a0 tcp_notsack_list = 0x600175885e0 } tcp_rcv_list = 0 tcp_rcv_last_head = 0 tcp_rcv_last_tail = 0 tcp_rcv_cnt = 0 tcp_cwnd_ssthresh = 0x45e8 tcp_cwnd_max = 0x100000 tcp_csuna = 0x2711637 tcp_rtt_sa = 0x11c tcp_rtt_sd = 0x190 tcp_rtt_update = 0x4f11b tcp_ms_we_have_waited = 0 tcp_swl1 = 0xb8a98c16 tcp_swl2 = 0x270fef7 tcp_rack = 0xb8a98c16 tcp_rack_cnt = 0 tcp_rack_cur_max = 0x2 tcp_rack_abs_max = 0x2 tcp_ack_tid = 0 tcp_push_tid = 0 tcp_max_swnd = 0x22f40 tcp_listener = 0 tcp_irs = 0xb89ff961 tcp_fss = 0xea9d1761 tcp_urg = 0 tcp_first_timer_threshold = 0x2710 tcp_second_timer_threshold = 0x493e0 tcp_first_ctimer_threshold = 0x2710 tcp_second_ctimer_threshold = 0x2bf20 tcp_urp_last = 0 tcp_urp_mp = 0 tcp_urp_mark_mp = 0 tcp_conn_req_cnt_q0 = 0 tcp_conn_req_cnt_q = 0 tcp_conn_req_max = 0 tcp_conn_req_seqnum = 0x85184 tcp_eager_lock = { _opaque = [ 0 ] } tcp_eager_next_q = 0 tcp_eager_last_q = 0 tcp_eager_next_q0 = 0 tcp_eager_prev_q0 = 0 tcp_conn_def_q0 = 0 (0) tcp_conn = { tcp_eager_conn_ind = 0x600110ef920 tcp_opts_conn_req = 0x600110ef920 } tcp_syn_rcvd_timeout = 0 tcp_ka_last_intrvl = 0 tcp_ka_tid = 0x60013fd2388 tcp_ka_interval = 0x6ddd00 tcp_ka_rinterval = 0 tcp_ka_abort_thres = 0x75300 tcp_ka_cnt = 0 tcp_client_errno = 0 tcp_ipha = 0x33e85c6e208 tcp_ip6h = 0 tcp_tcpha = 0x33e85c6e21c tcp_last_sent_len = 0x390 tcp_last_recv_len = 0 tcp_dupack_cnt = 0 tcp_acceptor_lockp = 0 tcp_ordrel_mp = 0 tcp_acceptor_id = 0 tcp_ipsec_overhead = 0 tcp_recvifindex = 0 tcp_recvhops = 0xffffffff tcp_recvtclass = 0 tcp_hopopts = 0 tcp_dstopts = 0 tcp_rthdrdstopts = 0 tcp_rthdr = 0 tcp_hopoptslen = 0 tcp_dstoptslen = 0 tcp_rthdrdstoptslen = 0 tcp_rthdrlen = 0 tcp_timercache = 0x60013fd2a18 tcp_closelock = { _opaque = [ 0 ] } tcp_closecv = { _opaque = 0 } tcp_closed = 0 tcp_closeflags = 0 tcp_closemp = { b_next = 0 b_prev = 0 b_cont = 0 b_rptr = 0 b_wptr = 0 b_datap = 0 b_band = 0 b_tag = 0 b_flag = 0 b_queue = 0 } tcp_linger_tid = 0 tcp_acceptor_hash = 0 tcp_ptpahn = 0 tcp_bind_hash = 0 tcp_bind_hash_port = 0x60015f99180 tcp_ptpbhn = 0x600173ccaa8 tcp_maxpsz_multiplier = 0xa tcp_lso_max = 0 tcp_ofo_fin_seq = 0 tcp_cwr_snd_max = 0 tcp_saved_listener = 0x60015f99180 tcp_in_ack_unsent = 0 tcp_loopback_peer = 0 tcp_fused_sigurg_mp = 0 tcp_fused = 0 tcp_unfusable = 0 tcp_fused_sigurg = 0 tcp_fuse_to_bit_31 = 0 tcp_non_sq_lock = { _opaque = [ 0 ] } tcp_issocket = 0x1 (B_TRUE) tcp_squeue_bytes = 0 tcp_closemp_used = 0 (0) tcp_eager_prev_drop_q0 = 0 tcp_eager_next_drop_q0 = 0 tcp_flow_stopped = 0 (0) tcp_snxt_shrunk = 0 tcp_connid = 0x1 tcp_rsrv_mp = 0x60011135200 tcp_rsrv_mp_lock = { _opaque = [ 0 ] } tcp_listen_cnt = 0 tcp_reass_tid = 0 tcp_fin_wait_2_flush_interval = 0xea60 tcmp_stk = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] }
Thanks for having a look.
Updated by Olaf Bohlen over 6 years ago
I had the third crash tonight. It seems that it happen while producing a lot of IO via the ssh protocol. (I think it panics when the illumos-build produces a lot of output which I tail in this ssh session). I'll try to set up a reproducable setup.
Updated by Dan McDonald over 6 years ago
It's just an ssh session that generates a LOT of output ("tail -f nightly.log"). Something during this massive burst of output is pushing tcp_cwnd to 0, which tickles the ASSERT. tcp_cwnd is set in SO MANY PLACES, however, that tracing it may be difficult.
Updated by Yuri Pankov over 6 years ago
I just thought that as we all aren't seeing the same problem, it could be ssh client's OS doing something weird TCP-wise.
Updated by Olaf Bohlen over 6 years ago
It's all on the same box:
A zone named pshell running on a LDOM called prod does a ssh to a zone called olafdev running on another LDOM called olafdev
It goes like this: vnic -> etherstub -> vnic -> vnet -> vswitch -> vnet -> vnic -> etherstub -> vnic
Updated by Olaf Bohlen over 6 years ago
If I run this dtrace:
#!/usr/sbin/dtrace -qCs /* * Entry/exit for various tcp functions. Check tcp_cwnd before, and stop * if a function exits with cwnd set to 0. */ /* Pick a TCP function to start with */ tcp_input_data:entry /self->tcp == NULL/ { self->tcp = (tcp_t *)((conn_t *)arg0)->conn_proto_priv.cp_tcp; printf("%s", self->tcp->tcp_cwnd == 0 ? "ALREADY ZERO" : ""); } tcp_input_data:return /self->tcp != NULL/ { self->tcp = NULL; } tcp*:return /self->tcp != NULL && self->tcp->tcp_cwnd == 0/ { printf("OH SHIT! %p %s set tcp_cwnd to 0\n", self->tcp, probefunc); stack(); self->tcp = NULL; }
The system does not panic on the ASSERT while tailing the nightly.log. If I turn off dtrace and again start building illumos-gate while tailing the nightly.log, it panics reproducable.
I now started kmdb, got myself the address of the specific tcp_cwnd and set a watchpoint on it:
[0]> ::netstat -f inet TCPv4 State Local Address Remote Address Stack Zone [...] 0000060019ebe740 0 172.20.3.20.22 172.20.2.22.46781 2 0 [...] [0]> 0000060019ebe740::print -a tcp_t tcp_cwnd 60019ebe7e0 tcp_cwnd = 0x68dc [0]> 60019ebe7e0::wp -L 8 -c ::stack -w [0]> $b ID S TA HT LM Description Action ----- - -- -- -- ---------------------------------------- ---------------------- [ 1 ] - 0 0 stop on write of [0x60019ebe7e0, 0x60019 ::stack
so, I'll wait if I get something useful
Updated by Olaf Bohlen over 6 years ago
with this watchpoint in kmdb set, the system also does not panic...
next thing: re-try with kmdb active and without watchpoint. See if it jumps into kmdb (i.e. panics).
Looks like dtrace and kmdb modify the piece of code in a way that the error cannot occur...
Updated by Olaf Bohlen over 6 years ago
actually I can confirm that the system will not panic as long as kmdb is loaded.
Updated by Olaf Bohlen over 6 years ago
- File toy-network-virt.png toy-network-virt.png added
- File toy-ldom-virt.png toy-ldom-virt.png added
just for being complete, here is a drawing of a small part of the network-setup on this system:
the tail that causes the panic runs on the zone "olafdev" (in a ssh session, iniated from the pshell zone) which owns a vnic on the "devsw0" etherstub. The ldom "dev" routes this via the "dev vnet" interface to the "prod vnet0" interface of ldom prod. There the traffic is routed to the "pshell" zone via the "prodsw0" etherstub and the attached "pshell0" vnic.
this is the setup of the ldoms and zones: