Bug #4925
openStrange in-kernel memory leak on upgrading to il-gate nightly 2014-06-10
0%
Description
Upgraded to a nightly build of il-gate from 2014-06-10 to see if #4920 would be resolved.
System panicked within minutes, logging "no resources for dumping, retrying" then "giving up". So that's exciting.
Keith helpfully suggested that this was probably a memory allocation failure, looking at the relevant portion of sd.c - on reboot, ::meminfo seems to agree with that...
Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 31561422 123286 94% ZFS File Data 174541 681 1% Anon 175344 684 1% Exec and libs 1730 6 0% Page cache 5550 21 0% Free (cachelist) 3155 12 0% Free (freelist) 1626249 6352 5% Total 33547991 131046 Physical 33547990 131046
kmastat attached; going to go take something for a migraine and then try tracing kmem_cache_alloc on streams_dblk_12048.
Just wanted to write it up before I forgot to...
Files
Updated by Rich Ercolani over 8 years ago
I..._think_ this might be in mpt_sas.
I'm still waiting for ::findleaks to come back (120 GB is a lot of pages), but the vast majority of kmem_firewall vmem_arena allocations seem to be coming from mpt_sas/sd, of the form:
genunix`kmem_slab_create+0xaf genunix`kmem_slab_alloc+0x18d genunix`kmem_cache_alloc+0x283 rootnex`rootnex_coredma_allochdl+0x61 rootnex`rootnex_dma_allochdl+0xdc genunix`ddi_dma_allochdl+0x6e pcieb`pcieb_dma_allochdl+0x3e genunix`ddi_dma_allochdl+0x6e pcieb`pcieb_dma_allochdl+0x3e genunix`ddi_dma_allochdl+0x6e pcieb`pcieb_dma_allochdl+0x3e genunix`ddi_dma_allochdl+0x6e genunix`ddi_dma_alloc_handle+0x7f mpt_sas`mptsas_kmem_cache_constructor+0x78 genunix`kmem_cache_alloc_debug+0x104 genunix`kmem_cache_alloc+0xdd mpt_sas`mptsas_scsi_init_pkt+0x513 scsi`scsi_init_pkt+0xaf sd`sd_setup_rw_pkt+0xcc sd`sd_initpkt_for_buf+0x10e
I had thought this might be in the 10GbE driver I was using, but that seems like it might have just been a correlation because the machine workload would intensify when the 10GbE interface was up (versus "just" having the internal management interface up and no client-facing interfaces).
We'll see.
Updated by Rich Ercolani over 8 years ago
I find myself questioning the conclusions of ::findleaks.
findleaks: findleaks: potential pointers => -763768478 findleaks: dismissals => -855425956 ( 0.0%) findleaks: misses => 15000168 ( 0.0%) findleaks: dups => 58183872 ( 0.0%) findleaks: follows => 18473438 ( 0.0%) findleaks: findleaks: peak memory usage => 435156 kB findleaks: elapsed CPU time => 123.9 seconds findleaks: elapsed wall time => 14373.0 seconds findleaks: findleaks: no memory leaks detected
streams_dblk_12048 12160 7287238 7287238 85397M 8225918 0 Total [kmem_firewall] 91736M 67614494 0
# ls -alh {unix,vmcore}.1 -rw-r--r-- 1 root root 2.1M 2014-06-17 23:41 unix.1 -rw-r--r-- 1 root root 119G 2014-06-17 23:45 vmcore.1
Updated by Rich Ercolani over 8 years ago
...oh my.
> ::mptsas mptsas_t inst ncmds suspend power ================================================================================ ffffff232622e000 0 264889920 0OFF=D3 ffffff2359d0f000 1 0 0OFF=D3 ffffff2359d0a000 2 0 0OFF=D3
That seems wrong, if I'm reading these columns reasonably.
e: Ah, this seems to just be an artifact of running mdb from a system not running the same environment as the kernel. More sane results (for some value of sane) come out if you run it from the same system:
> ::mptsas mptsas_t inst ncmds suspend power ================================================================================ ffffff232622e000 0 0 0 ON=D0 ffffff2359d0f000 1 1 0 ON=D0 ffffff2359d0a000 2 1 0 ON=D0
Now to see if that affected ::findleaks...
Updated by Rich Ercolani over 8 years ago
Nope, findleaks still doesn't find anything.
Gonna try git bisect between OI151a9 and this version of il-gate and see if I can find the commit where this starts transpiring (it shouldn't be hard...)
Updated by Rich Ercolani over 8 years ago
Having to pause bisecting this for other fires, serializing git bisect state out in the event that someone else runs into this and can complete it before I do, or I lose my state somehow:
git bisect start # bad: [b89e420ae1290e425c29db875ec0c0546006eec7] 4888 Undocument dma_req(9s) 4884 EOF scsi_hba_attach 4886 EOF ddi_dmae_getlim 4887 EOF ddi_iomin 4634 undocument scsi_hba_attach() and ddi_dma_lim(9s) 4630 clean stale references to ddi_iopb_alloc and ddi_iopb_free Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> git bisect bad b89e420ae1290e425c29db875ec0c0546006eec7 # good: [52e13e00baadc11c135170cd60a8b3c4f253a694] 4099 SMF methods without absolute paths no longer work Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Dan McDonald <danmcd@nexenta.com> git bisect good 52e13e00baadc11c135170cd60a8b3c4f253a694 # good: [2af88515229ddbf3a029426040c385a053f2bfb3] 4511 Update timezone info db to 2013i (fix packaging) Reviewed by: Joshua Clulow <jmc@joyent.com> git bisect good 2af88515229ddbf3a029426040c385a053f2bfb3 # good: [5131caa123fd046a511d18145869507ae9b7b9dd] 4462 clnt_vc_control()/clnt_dg_control() could return RPC_FAILED instead of FALSE Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Gordon Ross <gwr@nexenta.com> git bisect good 5131caa123fd046a511d18145869507ae9b7b9dd