Bug #12976
closedsystem panics with error in IP module
100%
Description
Upgraded some systems to newer releases and started seeing crashes. These systems make heavy use of NAT. Let me know what other information I can provide.
root@dbb006:/var/crash/volatile# uname -a && mdb vmcore.0 SunOS dbb006 5.11 joyent_20200701T231659Z i86pc i386 i86pc Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba fctl stmf_sbd stmf zfs mm sd lofs idm mpt_sas sata mr_sas crypto random cpc fcp logindmux ptm kvm sppp nsmb smbsrv nfs ipc ] > ::status debugging crash dump vmcore.0 (64-bit) from dbb006 operating system: 5.11 joyent_20200701T231659Z (i86pc) git branch: release-20200702 git rev: 4e1f938f9e4d41dcfe106b432ae2b08d3d5ab7c5 image uuid: (not set) panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffbe33059a910 addr=30 occurred in module "ip" due to a NULL pointer dereference dump content: kernel pages only
root@dbb005:/var/crash/volatile# uname -a ; mdb vmcore.1 SunOS dbb005 5.11 joyent_20200603T203505Z i86pc i386 i86pc Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba fctl stmf_sbd stmf zfs sd mm lofs idm mpt_sas mr_sas sata crypto random cpc fcp logindmux ptm kvm sppp nsmb smbsrv nfs ipc ] > ::status debugging crash dump vmcore.1 (64-bit) from dbb005 operating system: 5.11 joyent_20200603T203505Z (i86pc) git branch: release-20200604 git rev: 71b43f2a12f58ef8bc5a1965a3b742749bb49231 image uuid: (not set) panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffbe332e29910 addr=30 occurred in module "ip" due to a NULL pointer dereference dump content: kernel pages only
Related issues
Updated by Dan McDonald almost 2 years ago
Please make the vmdump available? If there's sensitive information on it, I can assure you (esp. given it's a SmartOS dump) we can respect the privacy of the dump in question.
Updated by Dan McDonald almost 2 years ago
Got the coredump. Some initial observations:
> $C fffffbe33059ab10 udp_do_bind+0x76(fffffdfa95564580, fffffbe33059ab28, 20, fffffdfca739fc88, 0) fffffbe33059ab60 udp_implicit_bind+0x61(fffffdfa95564580, fffffdfca739fc88) fffffbe33059ac00 udp_send+0x1a8(fffffdfa95564580, fffffdf586b34520, fffffbe33059ad18, fffffdfca739fc88) fffffbe33059ac90 so_sendmsg+0x252(fffffdfde0f502d8, fffffbe33059ad18, fffffbe33059ae78, fffffdfca739fc88) fffffbe33059acf0 socket_sendmsg+0x62(fffffdfde0f502d8, fffffbe33059ad18, fffffbe33059ae78, fffffdfca739fc88) fffffbe33059ad90 socket_vop_write+0x61(fffffdfde0f4fd80, fffffbe33059ae78, 0, fffffdfca739fc88, 0) fffffbe33059ae30 fop_write+0x111(fffffdfde0f4fd80, fffffbe33059ae78, 0, fffffdfca739fc88, 0) fffffbe33059af00 write+0x2c6(14b, c02e326700, 350) fffffbe33059af10 sys_syscall+0x1a8()
So it died in udp_do_bind() as part of the implicit-bind one does with a sendto() or sendmsg() (or write() in this case?).
A quick look at the code suggests the conn_t has a NULL netstack for it to panic there, and sure enough:
> fffffdfa95564580::print -at conn_t conn_netstack fffffdfa95564740 netstack_t *conn_netstack = 0 >
So what's interesting about this bug & dump is: How did a conn_t end up with a NULL netstack pointer? To be continued...
Updated by Dan McDonald almost 2 years ago
One more piece of data, this appears to be use-after-free:
> fffffdfa95564580::whatis fffffdfa95564580 is freed from udp_conn_cache
So now to figure out what two threads running together can cause this.
Updated by Dan McDonald almost 2 years ago
The bug filer, running a DEBUG kernel in his environment, has exposed a race between socket closing and the new-ish #9531 implementation. Basically, for TCP sockets, the "so_closed" upcall destroys state that the #9531 code accesses. This webrev: https://kebe.com/~danmcd/webrevs/12976/ narrows or eliminates the race by making the set-upcall-data-to-NULL portion protected by conn_lock, and the #9511 code call the upcall while under conn_lock as well.
It has been tested by the filer with DEBUG without a panic. This coming week, the filer will get a non-DEBUG kernel with this fix.
Updated by Dan McDonald almost 2 years ago
Jason now has non-DEBUG production time up with this fix too, and he is happy.
Updated by Dan McDonald almost 2 years ago
Note that the #9531 implementation is tickled by netstat(1) processes. Additional testing has demonstrated that `netstat -u` continues to function as before.
Updated by Electric Monk almost 2 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 2ad530425ac9cd3f429e64463a85f6f58703061c
commit 2ad530425ac9cd3f429e64463a85f6f58703061c Author: Dan McDonald <danmcd@joyent.com> Date: 2020-09-02T19:51:59.000Z 12976 system panics with error in IP module Reviewed by: Andy Fiddaman <andy@omniosce.org> Reviewed by: Paul Winder <p.winder@me.com> Approved by: Robert Mustacchi <rm@fingolfin.org>
Updated by Andy Fiddaman 3 months ago
- Related to Bug #14619: Race between udp_activate() and conn_get_socket_info() added