Project

General

Profile

Bug #4925

Strange in-kernel memory leak on upgrading to il-gate nightly 2014-06-10

Added by Rich Ercolani over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2014-06-14
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

Upgraded to a nightly build of il-gate from 2014-06-10 to see if #4920 would be resolved.

System panicked within minutes, logging "no resources for dumping, retrying" then "giving up". So that's exciting.

Keith helpfully suggested that this was probably a memory allocation failure, looking at the relevant portion of sd.c - on reboot, ::meminfo seems to agree with that...

Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                   31561422            123286   94%
ZFS File Data              174541               681    1%
Anon                       175344               684    1%
Exec and libs                1730                 6    0%
Page cache                   5550                21    0%
Free (cachelist)             3155                12    0%
Free (freelist)           1626249              6352    5%

Total                    33547991            131046
Physical                 33547990            131046

kmastat attached; going to go take something for a migraine and then try tracing kmem_cache_alloc on streams_dblk_12048.

Just wanted to write it up before I forgot to...


Files

kmastat.txt (41.5 KB) kmastat.txt Rich Ercolani, 2014-06-14 03:11 AM

History

#1

Updated by Rich Ercolani over 5 years ago

I..._think_ this might be in mpt_sas.

I'm still waiting for ::findleaks to come back (120 GB is a lot of pages), but the vast majority of kmem_firewall vmem_arena allocations seem to be coming from mpt_sas/sd, of the form:

              genunix`kmem_slab_create+0xaf
              genunix`kmem_slab_alloc+0x18d
              genunix`kmem_cache_alloc+0x283
              rootnex`rootnex_coredma_allochdl+0x61
              rootnex`rootnex_dma_allochdl+0xdc
              genunix`ddi_dma_allochdl+0x6e
              pcieb`pcieb_dma_allochdl+0x3e
              genunix`ddi_dma_allochdl+0x6e
              pcieb`pcieb_dma_allochdl+0x3e
              genunix`ddi_dma_allochdl+0x6e
              pcieb`pcieb_dma_allochdl+0x3e
              genunix`ddi_dma_allochdl+0x6e
              genunix`ddi_dma_alloc_handle+0x7f
              mpt_sas`mptsas_kmem_cache_constructor+0x78
              genunix`kmem_cache_alloc_debug+0x104
              genunix`kmem_cache_alloc+0xdd
              mpt_sas`mptsas_scsi_init_pkt+0x513
              scsi`scsi_init_pkt+0xaf
              sd`sd_setup_rw_pkt+0xcc
              sd`sd_initpkt_for_buf+0x10e

I had thought this might be in the 10GbE driver I was using, but that seems like it might have just been a correlation because the machine workload would intensify when the 10GbE interface was up (versus "just" having the internal management interface up and no client-facing interfaces).

We'll see.

#2

Updated by Rich Ercolani over 5 years ago

I find myself questioning the conclusions of ::findleaks.

findleaks:
findleaks:             potential pointers => -763768478
findleaks:                     dismissals => -855425956    ( 0.0%)
findleaks:                         misses => 15000168      ( 0.0%)
findleaks:                           dups => 58183872      ( 0.0%)
findleaks:                        follows => 18473438      ( 0.0%)
findleaks:
findleaks:              peak memory usage => 435156 kB
findleaks:               elapsed CPU time => 123.9 seconds
findleaks:              elapsed wall time => 14373.0 seconds
findleaks:
findleaks: no memory leaks detected

streams_dblk_12048         12160 7287238 7287238      85397M   8225918     0
Total [kmem_firewall]                               91736M  67614494     0

# ls -alh {unix,vmcore}.1
-rw-r--r-- 1 root root 2.1M 2014-06-17 23:41 unix.1
-rw-r--r-- 1 root root 119G 2014-06-17 23:45 vmcore.1
#3

Updated by Rich Ercolani over 5 years ago

...oh my.

> ::mptsas
        mptsas_t inst ncmds suspend  power
================================================================================
ffffff232622e000    0 264889920       0OFF=D3
ffffff2359d0f000    1     0       0OFF=D3
ffffff2359d0a000    2     0       0OFF=D3

That seems wrong, if I'm reading these columns reasonably.

e: Ah, this seems to just be an artifact of running mdb from a system not running the same environment as the kernel. More sane results (for some value of sane) come out if you run it from the same system:

> ::mptsas
        mptsas_t inst ncmds suspend  power
================================================================================
ffffff232622e000    0     0       0 ON=D0
ffffff2359d0f000    1     1       0 ON=D0
ffffff2359d0a000    2     1       0 ON=D0

Now to see if that affected ::findleaks...

#4

Updated by Rich Ercolani over 5 years ago

Nope, findleaks still doesn't find anything.

Gonna try git bisect between OI151a9 and this version of il-gate and see if I can find the commit where this starts transpiring (it shouldn't be hard...)

#5

Updated by Rich Ercolani over 5 years ago

Having to pause bisecting this for other fires, serializing git bisect state out in the event that someone else runs into this and can complete it before I do, or I lose my state somehow:

git bisect start
# bad: [b89e420ae1290e425c29db875ec0c0546006eec7] 4888 Undocument dma_req(9s) 4884 EOF scsi_hba_attach 4886 EOF ddi_dmae_getlim 4887 EOF ddi_iomin 4634 undocument scsi_hba_attach() and ddi_dma_lim(9s) 4630 clean stale references to ddi_iopb_alloc and ddi_iopb_free Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com>
git bisect bad b89e420ae1290e425c29db875ec0c0546006eec7
# good: [52e13e00baadc11c135170cd60a8b3c4f253a694] 4099 SMF methods without absolute paths no longer work Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Dan McDonald <danmcd@nexenta.com>
git bisect good 52e13e00baadc11c135170cd60a8b3c4f253a694
# good: [2af88515229ddbf3a029426040c385a053f2bfb3] 4511 Update timezone info db to 2013i (fix packaging) Reviewed by: Joshua Clulow <jmc@joyent.com>
git bisect good 2af88515229ddbf3a029426040c385a053f2bfb3
# good: [5131caa123fd046a511d18145869507ae9b7b9dd] 4462 clnt_vc_control()/clnt_dg_control() could return RPC_FAILED instead of FALSE Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Gordon Ross <gwr@nexenta.com>
git bisect good 5131caa123fd046a511d18145869507ae9b7b9dd

Also available in: Atom PDF