Project

General

Profile

Bug #3671

left behind enemy lines, agent LWP can go rogue

Added by Robert Mustacchi about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Category:
lib - userland libraries
Start date:
2013-04-02
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

From the Joyent bug report:

With an errant /proc consumer, the agent LWP can be mistakenly left in a (stopped) target process. libproc tries to compensate for this by cleaning up any found agent when an agent is created – and libproc consumers like prun(1) use this to cleanup any agent that they find in a target process. However, there is an important distinction to be made in how the agent is abandoned. If a consumer is using libproc and simply neglects to destroy the agent, the agent LWP will be stopped on PR_SYSEXIT, e.g.:

68825:    ./foobar
    data model = _LP64  flags = RLC|MSACCT|MSFORK
 /1:    flags = STOPPED|ISTOP
    why = PR_REQUESTED
 /2:    flags = DETACH|STOPPED|ISTOP|AGENT  unlinkat(0xffd19553,0xfffffd7fffdffb7b,0x0)
    why = PR_SYSEXIT  what = unlinkat
    sigmask = 0xffbffeff,0xffffffff,0x000000ff

In this case, (re)creating the agent will involve finding a syscall instruction in the target process's program text, setting the %rip to point to it, setting PCSENTRY to stop the agent LWP on entry to the syscall, running the agent LWP and waiting for it to stop. This works fine.

However: if the agent LWP is stopped not on system call exit but rather system call entry, this logic will not behave correctly. That is, take an agent that looks like this:

68727:    ./foobar
    data model = _LP64  flags = MSACCT|MSFORK
    entryset = 0x00000000 0x00000000 0x00000001 0x00000000
               0x00000000 0x00000000 0x00000000 0x00000000
 /1:    flags = STOPPED|ISTOP
    why = PR_REQUESTED
 /2:    flags = DETACH|STOPPED|ISTOP|AGENT  unlinkat(0xfffffd7fff333200,0xfffffd7fff33f6c4,0x0)
    why = PR_SYSENTRY  what = unlinkat
    sigmask = 0xffbffeff,0xffffffff,0x000000ff

In this case, running the agent will result not in it returning to the designated %rip, but rather %rip + 2 (as it finishes the system call that it had been stopped from making). At that point, the agent is rogue; it may well die on the next instruction (especially if that instruction – due to chance – has been plucked from data and not true text) and it may execute for tens or hundreds of instructions before returning into outer space, often dying with a NULL or otherwise outrageous %rip. These are the symptoms that we have seen in HVM-771.

The fix for this (such as it is) is that libporoc must recognize this condition (namely, agent stopped on PR_SYSENTRY) on agent creation, treating it as it does the sleeping syscall case and calling Pabort_agent() to abort the current syscall.

Finally, it must be said that it is very hard to induce this latter condition – and it's not clear how it can be done from an errant consumer that does anything short of failing explosively. (Indeed, this condition was induced with a carefully constructed DTrace predicate and a raise() action.) As such, this bug may be an exacerbating factor for HVM-771, but it's hardly the root-cause.

History

#1

Updated by Robert Mustacchi about 7 years ago

  • % Done changed from 90 to 100

Resolved in f971a3462face662ae8ef220a18a98354d625d54.

#2

Updated by Yuri Pankov about 7 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF