left behind enemy lines, agent LWP can go rogue
From the Joyent bug report:
With an errant /proc consumer, the agent LWP can be mistakenly left in a (stopped) target process. libproc tries to compensate for this by cleaning up any found agent when an agent is created – and libproc consumers like prun(1) use this to cleanup any agent that they find in a target process. However, there is an important distinction to be made in how the agent is abandoned. If a consumer is using libproc and simply neglects to destroy the agent, the agent LWP will be stopped on PR_SYSEXIT, e.g.:
68825: ./foobar data model = _LP64 flags = RLC|MSACCT|MSFORK /1: flags = STOPPED|ISTOP why = PR_REQUESTED /2: flags = DETACH|STOPPED|ISTOP|AGENT unlinkat(0xffd19553,0xfffffd7fffdffb7b,0x0) why = PR_SYSEXIT what = unlinkat sigmask = 0xffbffeff,0xffffffff,0x000000ff
In this case, (re)creating the agent will involve finding a syscall instruction in the target process's program text, setting the %rip to point to it, setting PCSENTRY to stop the agent LWP on entry to the syscall, running the agent LWP and waiting for it to stop. This works fine.
However: if the agent LWP is stopped not on system call exit but rather system call entry, this logic will not behave correctly. That is, take an agent that looks like this:
68727: ./foobar data model = _LP64 flags = MSACCT|MSFORK entryset = 0x00000000 0x00000000 0x00000001 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 /1: flags = STOPPED|ISTOP why = PR_REQUESTED /2: flags = DETACH|STOPPED|ISTOP|AGENT unlinkat(0xfffffd7fff333200,0xfffffd7fff33f6c4,0x0) why = PR_SYSENTRY what = unlinkat sigmask = 0xffbffeff,0xffffffff,0x000000ff
In this case, running the agent will result not in it returning to the designated %rip, but rather %rip + 2 (as it finishes the system call that it had been stopped from making). At that point, the agent is rogue; it may well die on the next instruction (especially if that instruction – due to chance – has been plucked from data and not true text) and it may execute for tens or hundreds of instructions before returning into outer space, often dying with a NULL or otherwise outrageous %rip. These are the symptoms that we have seen in HVM-771.
The fix for this (such as it is) is that libporoc must recognize this condition (namely, agent stopped on PR_SYSENTRY) on agent creation, treating it as it does the sleeping syscall case and calling Pabort_agent() to abort the current syscall.
Finally, it must be said that it is very hard to induce this latter condition – and it's not clear how it can be done from an errant consumer that does anything short of failing explosively. (Indeed, this condition was induced with a carefully constructed DTrace predicate and a raise() action.) As such, this bug may be an exacerbating factor for HVM-771, but it's hardly the root-cause.