Bug #4736
spamming zlogin with SIGINT will halt the target zone
100%
Description
bdha reported on twitter that holding ^C with zlogin with a shell loop will halt some percentage of his zones. This is, with a bit of care, easy to reproduce.
We assume a stock zone called 'test' exists.
Do this.
$ trap echo INT
$ for i in $(seq 1 1000); do (pfexec zlogin test uname -a); done
<now hold ^C down with all your might>
(I'm using bash for this, that may be important, as far as the lengths we're going to to not interrupt the shell, to make this easier to reproduce)
After some period of time, you'll perhaps get errors from zlogin stating that you can't log in to a zone in the shutting_down state (and then the installed state), either way after a period of time the loop will terminate, and the zone 'test' will have halted
The zone should not have halted, whatever (presumably) received our SIGINT shouldn't have.
Updated by Rich Lowe almost 7 years ago
a bit of playing with dtrace
dtrace -qn ':::signal-send /args[2] == 2/ { printf("%s[%d] --> %s[%d]\\\\n", execname, pid, stringof(args[1]->pr_fname), args[1]->pr_pid); }' -o ~/sig.log
Shows that we do go down because we killed init:
$ grep 9944 sig.log sshd[11624] --> zlogin[9944] zlogin[9944] --> ksh93[9946] zlogin[9944] --> zlogin[9945] zlogin[9944] --> init[9349] svc.startd[9386] --> init[9349]
Updated by Rich Lowe almost 7 years ago
So, this is zlogin.c's sig_forward actually passing the signal along. The question then becomes, "How the hell is init in our process group?"
Are we racing against the setpgid() noninteractive_login performs in the child?
Updated by Rich Lowe almost 7 years ago
in (the kernel) zone_enter, we do indeed explicitly join zsched's session and pg. Which sets us up for racing like this. Why we do that, rather than getting a new session/pg I don't understand. Presumably there's a good reason?
It says
5765 * By joining zsched's session here, we mimic the behavior in the 5766 * global zone of init's sid being the pid of sched. We extend this 5767 * to all zlogin-like zone_enter()'ing processes as well.
That second sentence seems rather off hand about it.
I don't know of an easy way to close the race (it can be made shorter by doing the setpgid immediately after zone_enter, but it would still, barely, exist).
We could use sigsendset() rather than sigsend() and thereby protect "special" processes in the zone, but that's... well, a bit crap.
I don't remember my semantics well enough to know whether sending to the whole PG is correct here, either, v. just killing the immediate child.
Updated by Alexander Eremin almost 7 years ago
Possible working fix:
--- a/usr/src/cmd/zlogin/zlogin.c +++ b/usr/src/cmd/zlogin/zlogin.c @@ -554,9 +554,7 @@ static void sig_forward(int s) { if (child_pid != -1) { - pid_t pgid = getpgid(child_pid); - if (pgid != -1) - (void) sigsend(P_PGID, pgid, s); + (void) sigsend(P_PGID, child_pid, s); } }
Updated by Rich Lowe almost 7 years ago
Yeah, I'd been wondering why we weren't doing that, and assuming there had to be some kind of reason. Perhaps there isn't.
Updated by Rich Lowe almost 7 years ago
Hey Alexander, I think that's probably fine. The only thing that'd change is our behaviour if the child process changes its pgrp again (and to something other than its own pid), which is incredibly unlikely (and, were it to happen, I think not killing it is probably the right thing to do).
Updated by Alexander Eremin almost 7 years ago
- Status changed from New to In Progress
- Assignee set to Alexander Eremin
- % Done changed from 0 to 20
Updated by Alexander Eremin over 6 years ago
- Due date set to 2014-04-22
- Status changed from In Progress to Pending RTI
- % Done changed from 20 to 100
Updated by Electric Monk over 6 years ago
git commit 136dc9cbe3ceda200434238d1015d39bca1d2d1a
commit 136dc9cbe3ceda200434238d1015d39bca1d2d1a Author: Alexander Eremin <a.eremin@nexenta.com> Date: 2014-04-22T14:55:37.000Z 4736 spamming zlogin with SIGINT will halt the target zone Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Dan McDonald <danmcd@omniti.com>
Updated by Joshua M. Clulow over 6 years ago
- Status changed from Pending RTI to Closed
- Tags deleted (
needs-triage)