Project

General

Profile

Bug #4736

spamming zlogin with SIGINT will halt the target zone

Added by Rich Lowe over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Category:
zones
Start date:
2014-04-09
Due date:
2014-04-22
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:

Description

bdha reported on twitter that holding ^C with zlogin with a shell loop will halt some percentage of his zones. This is, with a bit of care, easy to reproduce.

We assume a stock zone called 'test' exists.

Do this.

$ trap echo INT
$ for i in $(seq 1 1000); do (pfexec zlogin test uname -a); done
<now hold ^C down with all your might>

(I'm using bash for this, that may be important, as far as the lengths we're going to to not interrupt the shell, to make this easier to reproduce)

After some period of time, you'll perhaps get errors from zlogin stating that you can't log in to a zone in the shutting_down state (and then the installed state), either way after a period of time the loop will terminate, and the zone 'test' will have halted

The zone should not have halted, whatever (presumably) received our SIGINT shouldn't have.

History

#1

Updated by Rich Lowe over 5 years ago

a bit of playing with dtrace

dtrace -qn ':::signal-send /args[2] == 2/ { printf("%s[%d] --> %s[%d]\\\\n", execname, pid, stringof(args[1]->pr_fname), args[1]->pr_pid); }' -o ~/sig.log                                   

Shows that we do go down because we killed init:

$ grep 9944 sig.log
sshd[11624] --> zlogin[9944]
zlogin[9944] --> ksh93[9946]
zlogin[9944] --> zlogin[9945]
zlogin[9944] --> init[9349]
svc.startd[9386] --> init[9349]
#2

Updated by Rich Lowe over 5 years ago

So, this is zlogin.c's sig_forward actually passing the signal along. The question then becomes, "How the hell is init in our process group?"

Are we racing against the setpgid() noninteractive_login performs in the child?

#3

Updated by Rich Lowe over 5 years ago

in (the kernel) zone_enter, we do indeed explicitly join zsched's session and pg. Which sets us up for racing like this. Why we do that, rather than getting a new session/pg I don't understand. Presumably there's a good reason?

It says

5765     * By joining zsched's session here, we mimic the behavior in the
5766     * global zone of init's sid being the pid of sched.  We extend this
5767     * to all zlogin-like zone_enter()'ing processes as well.

That second sentence seems rather off hand about it.

I don't know of an easy way to close the race (it can be made shorter by doing the setpgid immediately after zone_enter, but it would still, barely, exist).

We could use sigsendset() rather than sigsend() and thereby protect "special" processes in the zone, but that's... well, a bit crap.

I don't remember my semantics well enough to know whether sending to the whole PG is correct here, either, v. just killing the immediate child.

#4

Updated by Alexander Eremin over 5 years ago

Possible working fix:

--- a/usr/src/cmd/zlogin/zlogin.c
+++ b/usr/src/cmd/zlogin/zlogin.c
@@ -554,9 +554,7 @@ static void
 sig_forward(int s)
 {
        if (child_pid != -1) {
-               pid_t pgid = getpgid(child_pid);
-               if (pgid != -1)
-                       (void) sigsend(P_PGID, pgid, s);
+                       (void) sigsend(P_PGID, child_pid, s);
        }
 }

#5

Updated by Rich Lowe over 5 years ago

Yeah, I'd been wondering why we weren't doing that, and assuming there had to be some kind of reason. Perhaps there isn't.

#6

Updated by Rich Lowe over 5 years ago

Hey Alexander, I think that's probably fine. The only thing that'd change is our behaviour if the child process changes its pgrp again (and to something other than its own pid), which is incredibly unlikely (and, were it to happen, I think not killing it is probably the right thing to do).

#7

Updated by Alexander Eremin over 5 years ago

  • Status changed from New to In Progress
  • Assignee set to Alexander Eremin
  • % Done changed from 0 to 20
#8

Updated by Alexander Eremin over 5 years ago

  • Due date set to 2014-04-22
  • Status changed from In Progress to Pending RTI
  • % Done changed from 20 to 100
#9

Updated by Electric Monk over 5 years ago

git commit 136dc9cbe3ceda200434238d1015d39bca1d2d1a

commit  136dc9cbe3ceda200434238d1015d39bca1d2d1a
Author: Alexander Eremin <a.eremin@nexenta.com>
Date:   2014-04-22T14:55:37.000Z

    4736 spamming zlogin with SIGINT will halt the target zone
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
    Reviewed by: Garrett D'Amore <garrett@damore.org>
    Approved by: Dan McDonald <danmcd@omniti.com>

#10

Updated by Joshua M. Clulow over 5 years ago

  • Status changed from Pending RTI to Closed
  • Tags deleted (needs-triage)

Also available in: Atom PDF