Project

General

Profile

Actions

Bug #12069

closed

Backport sh_delay() and tvsleep() from ksh-2020.0.0

Added by Marcel Telka almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
lib - userland libraries
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

We faces few cases when /usr/bin/sleep slept forever. Here is truss output to show what happened:

18962:  sigaction(SIGALRM, 0x080468E0, 0x08046960)      = 0
18962:      new: hand = 0xFEEDC869 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000
18962:      old: hand = 0xFEEDC869 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000
18962:  setitimer(ITIMER_REAL, 0x080469A0, 0x08046990)  = 0
18962:           value:  interval:    0.000000 sec  value:  399.000000 sec
18962:          ovalue:  interval:    0.000000 sec  value:    0.000000 sec
18962:  pause()                         (sleeping...)
18962:      Received signal #14, SIGALRM, in pause() [caught]
18962:  pause()                                         Err#4 EINTR
18962:  lwp_sigmask(SIG_SETMASK, 0x00002000, 0x00000000, 0x00000000, 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF]
18962:  alarm(0)                                        = 0
18962:  lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF]
18962:  sigaction(SIGALRM, 0x08046610, 0x08046690)      = 0
18962:      new: hand = 0xFEEDC869 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000
18962:      old: hand = 0xFEEDC869 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000
18962:  setitimer(ITIMER_REAL, 0x080466D0, 0x080466C0)  = 0
18962:           value:  interval:    0.000000 sec  value:    0.001263 sec
18962:          ovalue:  interval:    0.000000 sec  value:    0.000000 sec
18962:      Received signal #14, SIGALRM [caught]
18962:  lwp_sigmask(SIG_SETMASK, 0x00002000, 0x00000000, 0x00000000, 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF]
18962:  alarm(0)                                        = 0
18962:  setcontext(0x080461E0)
18962:  setcontext(0x08046540)
18962:  getpid()                                        = 18962 [18961]
18962:  pause()                         (sleeping...)

Obviously, the problem is the second setitimer() call that sets the timeout to very short value and the SIGALARM comes almost immediately. The signal is then cleared by the subsequent alarm(0) call.

The root cause for this problem could be seen in the sleep() function in usr/src/lib/libshell/common/bltins/sleep.c (called from sh_delay() for delays longer than 30 seconds). First, the sh_timeradd() is called to set the alarm (line 133), later pause() is called to wait for the signal (line 137). In a case the signal arrived before the timeout the do-while loop (lines 134-152) calls the pause() again and it is expected that the sigalrm() handler configures new alarm by calling setalarm() in the meantime (in background). This obviously happens, but in a case the 2nd scheduled alarm interval is too short (in our case 0.001263 sec) the actual signal could arrive even before we called pause() 2nd time. This causes the pause() to hang basically indefinitely (or until some random signal arrives, which could never happen).


Related issues

Related to illumos gate - Bug #6520: ksh: sleep could wait foreverClosed2015-12-18

Actions
Related to illumos gate - Feature #11876: Want a native sleepClosedRobert Mustacchi

Actions
Actions #1

Updated by Marcel Telka almost 2 years ago

  • Related to Bug #6520: ksh: sleep could wait forever added
Actions #3

Updated by Marcel Telka almost 2 years ago

Testing Done

I confirmed that /usr/bin/sleep with the newly built libast and libshell calls nanosleep() to implement the delay.

Actions #5

Updated by Andy Fiddaman almost 2 years ago

Actions #6

Updated by Marcel Telka almost 2 years ago

Andy Fiddaman wrote:

See also https://www.illumos.org/issues/11876

Yes, I'm aware of that (I reviewed it), but it solves half of the problem only - /usr/bin/sleep, not ksh's internal sleep (often used by scripts written in ksh).

Actions #7

Updated by Marcel Telka almost 2 years ago

Testing

Without the fix:

$ truss -f -v all -w all /usr/bin/sleep 399
...
21615:  stat64("/dev/null", 0x08036E50)                 = 0
21615:      d=0x08300000 i=71303176 m=0020666 l=1  u=0     g=3     rdev=0x02200002
21615:          at = Dec 11 00:16:32 CET 2019  [ 1576019792.000000000 ]
21615:          mt = Dec 11 03:10:13 CET 2019  [ 1576030213.000000000 ]
21615:          ct = Oct 22 15:57:45 CEST 2019  [ 1571752665.000000000 ]
21615:      bsz=8192  blks=0     fs=devfs
21615:  close(3)                                        Err#9 EBADF
21615:  close(2)                                        = 0
21615:  fcntl(11, F_DUPFD, 0x00000002)                  = 2
21615:  close(11)                                       = 0
21615:  close(1)                                        = 0
21615:  fcntl(10, F_DUPFD, 0x00000001)                  = 1
21615:  close(10)                                       = 0
21615:  sigaction(SIGALRM, 0x08037400, 0x08037470)      = 0
21615:      new: hand = 0xFEEB9F7C mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000
21615:      old: hand = 0x00000000 mask = 0 0 0 0 flags = 0x0000
21615:  time()                                          = 1576056642
21615:  getpid()                                        = 21615 [21614]
21615:  sigaction(SIGALRM, 0x08037360, 0x080373D0)      = 0
21615:      new: hand = 0xFEEB9F7C mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000
21615:      old: hand = 0xFEEB9F7C mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000
21615:  setitimer(ITIMER_REAL, 0x08037410, 0x08037400)  = 0
21615:           value:  interval:    0.000000 sec  value:  399.000000 sec
21615:          ovalue:  interval:    0.000000 sec  value:    0.000000 sec
21615:  pause()                         (sleeping...)
^C21615:            Received signal #2, SIGINT, in pause() [default]
$

With the fix:

$ LD_LIBRARY_PATH=proto/root_i386/usr/lib/ truss -f -v all -w all /usr/bin/sleep 399
...
21667:  stat64("/dev/null", 0x080366C0)                 = 0
21667:      d=0x08300000 i=71303176 m=0020666 l=1  u=0     g=3     rdev=0x02200002
21667:          at = Dec 11 00:16:32 CET 2019  [ 1576019792.000000000 ]
21667:          mt = Dec 11 03:10:13 CET 2019  [ 1576030213.000000000 ]
21667:          ct = Oct 22 15:57:45 CEST 2019  [ 1571752665.000000000 ]
21667:      bsz=8192  blks=0     fs=devfs
21667:  close(3)                                        Err#9 EBADF
21667:  close(2)                                        = 0
21667:  fcntl(11, F_DUPFD, 0x00000002)                  = 2
21667:  close(11)                                       = 0
21667:  close(1)                                        = 0
21667:  fcntl(10, F_DUPFD, 0x00000001)                  = 1
21667:  close(10)                                       = 0
21667:  sigaction(SIGALRM, 0x08036C70, 0x08036CE0)      = 0
21667:      new: hand = 0xFEF23AC7 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000
21667:      old: hand = 0x00000000 mask = 0 0 0 0 flags = 0x0000
21667:  time()                                          = 1576056722
21667:  nanosleep(0x08036CF8, 0x08036CF0) (sleeping...)
21667:          tmout: 399.000000000 sec
^C21667:            Received signal #2, SIGINT, in nanosleep() [default]
$
Actions #8

Updated by Marcel Telka almost 2 years ago

  • Status changed from In Progress to Pending RTI
Actions #9

Updated by Electric Monk almost 2 years ago

  • Status changed from Pending RTI to Closed
  • % Done changed from 0 to 100

git commit 5947648b7f5c085635051e1f7aa083a309542467

commit  5947648b7f5c085635051e1f7aa083a309542467
Author: Marcel Telka <marcel@telka.sk>
Date:   2019-12-17T23:10:01.000Z

    12069 Backport sh_delay() and tvsleep() from ksh-2020.0.0
    Reviewed by: Dan McDonald <danmcd@joyent.com>
    Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Actions

Also available in: Atom PDF