Bug #12069
closedBackport sh_delay() and tvsleep() from ksh-2020.0.0
100%
Description
We faces few cases when /usr/bin/sleep
slept forever. Here is truss output to show what happened:
18962: sigaction(SIGALRM, 0x080468E0, 0x08046960) = 0 18962: new: hand = 0xFEEDC869 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000 18962: old: hand = 0xFEEDC869 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000 18962: setitimer(ITIMER_REAL, 0x080469A0, 0x08046990) = 0 18962: value: interval: 0.000000 sec value: 399.000000 sec 18962: ovalue: interval: 0.000000 sec value: 0.000000 sec 18962: pause() (sleeping...) 18962: Received signal #14, SIGALRM, in pause() [caught] 18962: pause() Err#4 EINTR 18962: lwp_sigmask(SIG_SETMASK, 0x00002000, 0x00000000, 0x00000000, 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF] 18962: alarm(0) = 0 18962: lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000, 0x00000000, 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF] 18962: sigaction(SIGALRM, 0x08046610, 0x08046690) = 0 18962: new: hand = 0xFEEDC869 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000 18962: old: hand = 0xFEEDC869 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000 18962: setitimer(ITIMER_REAL, 0x080466D0, 0x080466C0) = 0 18962: value: interval: 0.000000 sec value: 0.001263 sec 18962: ovalue: interval: 0.000000 sec value: 0.000000 sec 18962: Received signal #14, SIGALRM [caught] 18962: lwp_sigmask(SIG_SETMASK, 0x00002000, 0x00000000, 0x00000000, 0x00000000) = 0xFFBFFEFF [0xFFFFFFFF] 18962: alarm(0) = 0 18962: setcontext(0x080461E0) 18962: setcontext(0x08046540) 18962: getpid() = 18962 [18961] 18962: pause() (sleeping...)
Obviously, the problem is the second setitimer()
call that sets the timeout to very short value and the SIGALARM
comes almost immediately. The signal is then cleared by the subsequent alarm(0)
call.
The root cause for this problem could be seen in the sleep()
function in usr/src/lib/libshell/common/bltins/sleep.c
(called from sh_delay()
for delays longer than 30 seconds). First, the sh_timeradd()
is called to set the alarm (line 133), later pause()
is called to wait for the signal (line 137). In a case the signal arrived before the timeout the do-while loop (lines 134-152) calls the pause()
again and it is expected that the sigalrm()
handler configures new alarm by calling setalarm()
in the meantime (in background). This obviously happens, but in a case the 2nd scheduled alarm interval is too short (in our case 0.001263 sec) the actual signal could arrive even before we called pause()
2nd time. This causes the pause()
to hang basically indefinitely (or until some random signal arrives, which could never happen).
Related issues
Updated by Marcel Telka over 3 years ago
- Related to Bug #6520: ksh: sleep could wait forever added
Updated by Marcel Telka over 3 years ago
Updated by Marcel Telka over 3 years ago
Testing Done
I confirmed that /usr/bin/sleep
with the newly built libast
and libshell
calls nanosleep()
to implement the delay.
Updated by Andy Fiddaman over 3 years ago
See also https://www.illumos.org/issues/11876
Updated by Andy Fiddaman over 3 years ago
- Related to Feature #11876: Want a native sleep added
Updated by Marcel Telka over 3 years ago
Andy Fiddaman wrote:
See also https://www.illumos.org/issues/11876
Yes, I'm aware of that (I reviewed it), but it solves half of the problem only - /usr/bin/sleep
, not ksh's internal sleep (often used by scripts written in ksh).
Updated by Marcel Telka over 3 years ago
Testing
Without the fix:
$ truss -f -v all -w all /usr/bin/sleep 399 ... 21615: stat64("/dev/null", 0x08036E50) = 0 21615: d=0x08300000 i=71303176 m=0020666 l=1 u=0 g=3 rdev=0x02200002 21615: at = Dec 11 00:16:32 CET 2019 [ 1576019792.000000000 ] 21615: mt = Dec 11 03:10:13 CET 2019 [ 1576030213.000000000 ] 21615: ct = Oct 22 15:57:45 CEST 2019 [ 1571752665.000000000 ] 21615: bsz=8192 blks=0 fs=devfs 21615: close(3) Err#9 EBADF 21615: close(2) = 0 21615: fcntl(11, F_DUPFD, 0x00000002) = 2 21615: close(11) = 0 21615: close(1) = 0 21615: fcntl(10, F_DUPFD, 0x00000001) = 1 21615: close(10) = 0 21615: sigaction(SIGALRM, 0x08037400, 0x08037470) = 0 21615: new: hand = 0xFEEB9F7C mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000 21615: old: hand = 0x00000000 mask = 0 0 0 0 flags = 0x0000 21615: time() = 1576056642 21615: getpid() = 21615 [21614] 21615: sigaction(SIGALRM, 0x08037360, 0x080373D0) = 0 21615: new: hand = 0xFEEB9F7C mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000 21615: old: hand = 0xFEEB9F7C mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000 21615: setitimer(ITIMER_REAL, 0x08037410, 0x08037400) = 0 21615: value: interval: 0.000000 sec value: 399.000000 sec 21615: ovalue: interval: 0.000000 sec value: 0.000000 sec 21615: pause() (sleeping...) ^C21615: Received signal #2, SIGINT, in pause() [default] $
With the fix:
$ LD_LIBRARY_PATH=proto/root_i386/usr/lib/ truss -f -v all -w all /usr/bin/sleep 399 ... 21667: stat64("/dev/null", 0x080366C0) = 0 21667: d=0x08300000 i=71303176 m=0020666 l=1 u=0 g=3 rdev=0x02200002 21667: at = Dec 11 00:16:32 CET 2019 [ 1576019792.000000000 ] 21667: mt = Dec 11 03:10:13 CET 2019 [ 1576030213.000000000 ] 21667: ct = Oct 22 15:57:45 CEST 2019 [ 1571752665.000000000 ] 21667: bsz=8192 blks=0 fs=devfs 21667: close(3) Err#9 EBADF 21667: close(2) = 0 21667: fcntl(11, F_DUPFD, 0x00000002) = 2 21667: close(11) = 0 21667: close(1) = 0 21667: fcntl(10, F_DUPFD, 0x00000001) = 1 21667: close(10) = 0 21667: sigaction(SIGALRM, 0x08036C70, 0x08036CE0) = 0 21667: new: hand = 0xFEF23AC7 mask = 0xFFBFFEFF 0xFFFFFFFF 0x000001FF 0 flags = 0x0000 21667: old: hand = 0x00000000 mask = 0 0 0 0 flags = 0x0000 21667: time() = 1576056722 21667: nanosleep(0x08036CF8, 0x08036CF0) (sleeping...) 21667: tmout: 399.000000000 sec ^C21667: Received signal #2, SIGINT, in nanosleep() [default] $
Updated by Marcel Telka over 3 years ago
- Status changed from In Progress to Pending RTI
Updated by Electric Monk over 3 years ago
- Status changed from Pending RTI to Closed
- % Done changed from 0 to 100
git commit 5947648b7f5c085635051e1f7aa083a309542467
commit 5947648b7f5c085635051e1f7aa083a309542467 Author: Marcel Telka <marcel@telka.sk> Date: 2019-12-17T23:10:01.000Z 12069 Backport sh_delay() and tvsleep() from ksh-2020.0.0 Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk> Approved by: Robert Mustacchi <rm@fingolfin.org>