Project

General

Profile

Bug #2970

oi151a5 has a problem in /lib (libc?)

Added by Roger Fujii over 8 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
OS/Net (Kernel and Userland)
Target version:
-
Start date:
2012-07-06
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

Ever since upgrading to oi151a5, my amavisd/maia process started to go berserk - child processes will consume lots of cpu, but trussing the process just hangs (ie, produces no output). I narrowed it down to a change in /lib by creating a /lib1 from an older BE, and running "LD_LIBRARY_PATH=/lib1 /opt/csw/sbin/amavisd-maia" on the oi151a5 machine and the issue went away.

From the little I can gather from the truss output of the parent (with a -f) before it stops output for the task is that the new libraries causes perl to SEGFAULT, and that cascades to a point where the child process consumes cpu but is non-responsive - though a kill -9 does clean it up ok. I would guess this is a libc issue, but that's just speculation.

Be happy to help track this down - though with truss not working on this, not sure what tool to use....

#1

Updated by Roger Fujii over 8 years ago

Spoke too soon. Still goes berserk with the old libc, but not nearly as often (go figure).

#2

Updated by Milan Jurik over 8 years ago

  • Status changed from New to Feedback

Do I understand correctly that perl segfaults? Do you have core file?

#3

Updated by Roger Fujii over 8 years ago

I can truss the parent process with a -f, so I can see the child traces, and this is what I see:

25973: write(2, " J u l 9 0 7 : 0 2".., 64) = 64
25973: write(2, " ( 2 5 9 7 3 - 0 8 ) ", 11) = 11
25973: write(2, " c a l l i n g S A p".., 34) = 34
25973: write(2, "\n", 1) = 1
25973: Incurred fault #6, FLTBOUNDS %pc = 0xFE8C92F0
25973: siginfo: SIGSEGV SEGV_MAPERR addr=0x75676F74
25973: Received signal #11, SIGSEGV [default]
25973: siginfo: SIGSEGV SEGV_MAPERR addr=0x75676F74
16780: Received signal #18, SIGCLD, in pollsys() [caught]
16780: siginfo: SIGCLD CLD_KILLED pid=25973 status=0x000B
16780: pollsys(0x08047420, 0, 0x08047498, 0x00000000) Err#4 EINTR

so something is getting a SEGFAULT. Mind you, this isn't the OI perl, but one from opencsw, but these binaries didn't hang processes like this pre a5. Looking further in the trace for the SEGFAULTing process (25973), the following occurs:

25973: stat64("/opt/csw/share/perl/csw/auto/DBI/DESTROY.al", 0x08046F80) Err#2 ENOENT
25973: stat64("/opt/csw/lib/perl/5.10.1/auto/DBI/DESTROY.al", 0x08046F80) Err#2 ENOENT
25973: stat64("/opt/csw/share/perl/5.10.1/auto/DBI/DESTROY.al", 0x08046F80) Err#2 ENOENT
25973: setcontext(0x08046E50)
25973: close(4) = 0
25973: mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFDD50000
25973: _exit(2)
16780: siginfo: SIGCLD CLD_EXITED pid=25973 status=0x0002
16780: forkx(0) = 25973
25973: forkx() (returning as child ...) = 16780
25973: getpid() = 25973 [16780]
25973: lwp_self() = 1
25973: lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF]
25973: getpid() = 25973 [16780]
25973: schedctl()

so it looks like 25973 exited, but the next forkx returned 25973, which I'm pretty sure is not a good thing. This seems to be where the semi-zombied process starts, since I think various lock files get confused over the state of the lock. Putting aside why perl seg faulted, something is screwball if you can't truss a process that is chewing up cpu. I suspect some race condition was added into the kernel with the last update, but have to figure out how to check out the oi code to see.

#4

Updated by Roger Fujii over 8 years ago

ok.. wonders what a little sleep can do. Scratch what I said above about the race condition, but it certainly looks like the SEGFAULT is what is triggering this. What I do not understand is why after the fault, the processes in question doesn't die - I checked using psig the processes in question, and all have the default handler for SEGV. Could it be that some setting in coreadm would cause a busy spin attempting to dump core? I'll try varying the coreadm settings and see if anything changes.

#5

Updated by Ken Mays about 8 years ago

  • Status changed from Feedback to Closed
  • % Done changed from 0 to 100

Perl packages were updated to perl 5.12.4 packages and oi_151a7. Please reopen if you see this issue under oi_151a7, but I have seen no reports for perl on oi_151a7 reported yet. Closing ticket.

Also available in: Atom PDF